4: Deep Learning for Computer Vision – Transfer Learning and Fine-Tuning; Intro to HuggingFace — Full Transcript

0:50

Um,

0:51

Valentine's Day gods, maybe they maybe

0:53

be with us.

0:54

Okay, so let's get going. So, Fashion

0:56

MNIST we saw previously, um, i.e. as in,

1:00

you know, in the in the walk-through,

1:01

the video walk-through, that a neural

1:03

network with a single single hidden

1:05

layer can get us to some an accuracy in

1:08

the the high 80s, okay? Uh, and that

1:11

thing that network actually didn't know

1:14

what was coming in was an image, right?

1:16

It literally took this table of numbers

1:18

and just took each row and then

1:19

concatenated all the rows into one giant

1:21

long vector and then sent it in. So, the

1:23

neural network did exploit the fact that

1:25

the input data was sort of known to be

1:27

of a certain type, okay? Which is the

1:30

clue for how can we do better?

1:32

Right? So, let's just spend a few

1:35

minutes on why what is it about images

1:38

that we have to really pay attention to,

1:40

okay? As opposed to any arbitrary vector

1:42

of numbers that's coming in.

1:44

Okay? So, when we flatten the image into

1:47

a long vector and feed it into a dense

1:49

layer,

1:50

several undesirable things can actually

1:52

happen.

1:55

What are some of them? Any any guesses?

2:00

Uh, yeah.

2:02

I think you lose the proximity of one

2:04

pixel to other ones that would be around

2:06

it.

2:07

Right. So, if you take a particular

2:08

pixel, then let's say that the picture

2:11

shows a t-shirt, um, if there's a little

2:13

pixel at in the center of the t-shirt,

2:15

knowing that the surrounding pixels are

2:17

related to the pixel in a way because

2:19

they are all part of this concept called

2:21

a t-shirt, would certainly be helpful,

2:23

right? So, so to put it more

2:25

technically, spatial adjacency

2:28

information is very important. And we

2:30

need to somehow take that into account.

2:32

Okay? Um, all right. What else? What

2:34

else might be going on here?

2:38

Uh,

2:40

Yeah,

2:41

you have some metadata about it like the

2:43

relative match into the resolution

2:46

Oh, I see. So, if you actually had

2:47

structured data about the image such as,

2:50

you know, various characters about that

2:51

might be helpful. True. Now, but let's

2:54

just focus on the case where you only

2:55

have the raw image and nothing else.

2:57

And under that constraint, what else

3:00

might go wrong?

3:02

Or what else might be suboptimal?

3:08

Okay. Well, the first thing that might

3:10

happen is that

3:12

we have we may have too many parameters.

3:15

So, let's take So, this is, you know,

3:17

this these numbers are from my, you

3:18

know, older iPhone. Uh, I noticed that

3:21

when I take a color picture with my

3:22

phone, it's a 3,000 * 3,000 roughly uh,

3:27

grid, right? So, the picture is actually

3:30

3,024 pixels on this axis, 3,024 on that

3:34

axis, okay? So, that gets us to roughly

3:37

9 million pixels, but remember there's a

3:40

color picture, which means there are

3:41

three channels,

3:43

which means there are 27 million

3:45

numbers,

3:46

each of which is between 0 and 255 from

3:49

that little picture, okay? And now let's

3:51

say we connect it to a single

3:54

100 neuron dense layer.

3:57

A single 100 neuron dense layer. How

3:59

many parameters are we going to have?

4:00

Just in that one little part of the

4:01

network.

4:07

Could the mumbling be louder?

4:10

Yes, roughly 2.7 billion because 27

4:13

million parameters times 100,

4:15

right? Roughly, of course. Forget about

4:17

the biases for a moment, right? It's 2.7

4:19

billion.

4:21

2.7 billion parameters,

4:23

right? Do you think we can actually get

4:25

2.7 billion images to train any of these

4:27

things?

4:29

So, then you're going to overfit.

4:32

Right? Too many parameters. We have to

4:33

do We have to be smarter about this.

4:35

It's not going to work.

4:36

Right? That's the first problem.

4:39

The So, this clearly is computationally

4:41

demanding, very data hungry, and

4:43

increase the risk of overfitting.

4:45

Okay?

4:46

Next,

4:49

we lose spatial adjacency.

4:51

Right? We literally are ignoring what's

4:52

nearby.

4:55

So, that's a huge huge factor. There's a

4:57

third factor,

4:58

right? That we have to worry about,

5:01

which is that

5:02

let's say that, you know, the picture

5:04

has a vertical line

5:06

on the on the top left side and it has

5:08

some other vertical line on the bottom

5:09

right side.

5:12

What this sort of dumb approach is going

5:14

to do

5:15

is going to it's going to learn to

5:16

detect that vertical line on the top

5:18

left and it's going to independent of

5:20

that, it's going to learn to detect the

5:21

vertical line on the bottom right.

5:24

Okay? Which doesn't make any sense. What

5:26

do you A vertical line is a vertical

5:27

line. So, you want to be able to detect

5:29

it wherever it happens.

5:31

Detect once, reuse everywhere.

5:33

That's what you need to do.

5:35

So, this, by the way, is called

5:36

translation invariance.

5:38

Translation is math speak for move stuff

5:40

around.

5:41

Right? You take a line and it moves

5:42

around,

5:43

it doesn't matter, it's still a line.

5:45

Let's Let's Let's figure it out.

5:47

So, these are the the three things we

5:48

need to worry about. So, we want to

5:50

learn once and use all over the place.

5:53

We want to take spatial adjacency into

5:55

account, number two. And number three,

5:56

let's just find a way to make sure that

5:58

we don't have billions of parameters for

5:59

simple toy problems.

6:02

Any questions?

6:05

Yep.

6:07

Um, is this a problem

6:09

just because we are compressing the

6:11

image or would it have happened anyway?

6:14

It would have happened So, the question

6:15

was is it a problem because we are

6:16

compressing the image uh, or would it

6:18

would it have happened anyway? The

6:19

answer is it would have happened anyway.

6:20

You can take any picture, this is going

6:22

to happen, right? Because I'm not making

6:24

any assumptions about how the image is

6:26

coming in to me,

6:27

whether it's compressed or not and so on

6:28

and so forth.

6:31

Okay. All right.

6:33

So, convolutional layers

6:36

were developed to precisely address

6:38

these shortcomings and they're amazing

6:40

solution, as you will see. Very elegant.

6:45

All right.

6:45

So, the next, I don't know, half an hour

6:49

is going to be me defining a whole bunch

6:51

of stuff

6:52

before we actually get to the fun

6:53

collabs and so on and so forth.

6:55

Um, so just to put in perspective, I I

6:57

have a PowerPoint,

6:59

two collabs,

7:01

and an Excel spreadsheet, and maybe even

7:03

a notability file to cover today.

7:06

Okay? So, but hang on for the next 30

7:08

minutes because it's going to be a

7:09

little concept heavy

7:10

before we get to the fun stuff. So, stop

7:12

me, ask me questions because we do have

7:14

time.

7:15

All right. A convolutional layer is made

7:17

up of something called a convolutional

7:18

filter.

7:20

Okay? That's the atomic building block.

7:22

A convolutional filter is a nothing but

7:24

a small matrix of numbers like this.

7:28

It's just a small square matrix of

7:29

numbers. That's a convolutional filter,

7:31

okay? Now,

7:33

a layer is just composed of one or more

7:35

of these filters.

7:38

All right?

7:39

Filters and layers.

7:41

Now,

7:42

the thing about the convolutional filter

7:44

that makes it really magical

7:46

is that if you choose the numbers in a

7:48

filter carefully

7:50

and then you apply the filter to an

7:52

image, and I'll get to what I mean by

7:53

applying the filter,

7:56

if you choose the numbers carefully and

7:57

you apply to that image,

7:59

this little humble thing has the ability

8:02

to detect features in your image.

8:04

It can detect lines, curves, gradations

8:07

in color, circles, things like that,

8:09

okay? It's pretty cool.

8:11

And so,

8:12

I'm going to claim and I'm going to

8:14

prove shortly that this little humble

8:15

filter with the ones and zeros, it can

8:17

detect horizontal lines in any picture

8:19

you give it.

8:21

Okay?

8:22

This thing here is going to has the

8:23

ability to detect vertical lines.

8:27

All right? So, I will demonstrate how

8:28

this thing actually detects all these

8:30

things and then we will ask the big

8:33

question that's probably in your minds

8:34

already, where are we going to get these

8:35

numbers from?

8:37

That all sounds great, Rama. Where are

8:39

we going to get the numbers from? Okay?

8:41

And we have a beautiful answer to that

8:42

question.

8:43

All right. So, let's go. Um, now I'm

8:46

going to first explain to you what I

8:47

mean by applying a filter to an image

8:50

and then I'm going to give you examples

8:52

of how the filter works for detecting

8:54

vertical and horizontal lines. So, all

8:56

right. So, let's say that this is the

8:58

image we have.

9:00

Okay? Again, an image. Assume it's a

9:02

grayscale image. So, you just have a

9:04

bunch of numbers between 0 and 255,

9:06

okay? So, that's that This is the image

9:07

we have. It's a little tiny image.

9:09

And this is the filter that's been

9:10

magically given to us by somebody.

9:13

And what we are trying to do now is to

9:14

apply it, okay? So, what we do is that

9:17

we literally take this filter,

9:19

the little one, and then we superimpose

9:22

it on the top left part of the image.

9:24

So, you have the image here, you take

9:26

this little filter, and then you move it

9:28

to the top left so that they are sort of

9:30

right on top of each other.

9:32

Okay?

9:33

Once you have it right on top of each

9:34

other,

9:35

you have these matching numbers. You

9:37

have three numbers in the image, there

9:39

are three numbers in the filter, and

9:41

they're all matching each other right on

9:42

top of each other, right? So, you have

9:44

nine pairs of numbers.

9:46

And then what we do, once we overlay it,

9:48

we literally just multiply all the

9:50

matching numbers and add them up.

9:53

Okay? You just multiply all the numbers

9:55

and match them up, and you can confirm

9:57

later on that you know the the

9:58

arithmetic I'm doing is actually

9:59

accurate. Okay?

10:01

And once you do that you'll go get some

10:03

number.

10:04

Right?

10:05

Um

10:06

once you get that number

10:09

what we do is we go to our good old

10:11

friend the relu

10:12

and then we just run it through a relu.

10:15

Now, in this case all that effort comes

10:16

to nothing because it's zero. It's okay.

10:19

Okay? So, zero and this number becomes

10:22

the top left cell of your output.

10:26

So, this is called the convolution

10:28

operation.

10:29

Okay?

10:30

And we won't get into why it's called

10:31

that and so on and so forth. There's a

10:32

long and rich and storied history of

10:34

these things.

10:35

But this is the convolution operation.

10:38

And once we do that you sort of can now

10:40

predict what's going to happen, right?

10:42

We take the same exact operation and we

10:44

just move it to the right.

10:46

We move this little 3 by 3 thing to the

10:48

right and repeat the exact same process.

10:51

Matching numbers

10:53

uh to you know multiply all of the all

10:54

the matching numbers together, add them

10:55

up, run them through a relu.

10:58

Okay?

10:59

And then boom, you get the you get the

11:01

second number here.

11:03

And you keep doing that till you reach

11:05

the very end. You fill up all these

11:07

numbers then when you then you come to

11:08

the top of the second row.

11:11

Okay?

11:12

And you keep on doing that till you

11:14

reach the very bottom.

11:16

So, this is what I mean when I say apply

11:18

a filter to an image.

11:21

Okay?

11:22

Any questions?

11:25

Okay.

11:29

Microphone, please.

11:31

Microphone.

11:35

What happens when

11:36

the heart of the

11:38

and you stop

11:39

the remaining

11:42

but the filter doesn't perfectly match

11:44

Yeah, so you start from the left and

11:46

then you keep on going. At some point

11:47

the right edge of the filter is going to

11:49

match the right edge of the image and

11:51

then you stop.

11:52

Yeah. Now, there are some nuances here.

11:55

So, for example, you can actually pad

11:58

the whole image

11:59

on its borders so that you can actually

12:01

go outside the image and it'll still

12:03

work.

12:04

Okay? Number one. Number two, nuance.

12:08

Instead of just moving one step to the

12:10

right every time you finish, you can

12:11

move two steps to the right.

12:13

Right? And that's something called a

12:15

stride. Okay? So, there are a bunch of

12:17

pesky details here. But I'm just

12:20

ignoring them because this basic default

12:22

approach works well amazingly well

12:24

almost all the time.

12:27

Okay? All right. So, that's that's

12:29

that's the mechanics of how this

12:31

operation works. Um all right. Now, I'm

12:33

going to switch to a spreadsheet which

12:35

shows this really beautifully

12:37

courtesy of the fast.ai people.

12:41

All right. So, what I'm going to do here

12:43

because the big spreadsheet I'll upload

12:44

the spreadsheet after class so you can

12:45

see it. So, all I have done here, rather

12:48

all they have done here

12:50

thanks to them, is that they have

12:51

essentially created a table of numbers

12:53

in Excel as you can tell.

12:55

And they have just put some numbers.

12:57

Most of the numbers are zero. But these

12:59

some of these numbers are all more than

13:01

zero. They're like 0.8, 0.9 and so on.

13:03

Basically, all they have done is instead

13:04

of working with numbers between zero and

13:06

255, they're just dividing all the

13:08

numbers by 255 so you get fractions and

13:10

they just put the fractions in the

13:11

table. Okay? And then then they have

13:13

used Excel's very cool conditional

13:15

formatting

13:16

to essentially mark in red all the

13:19

values that are high. Right? If the

13:21

number is closer to one, the more

13:23

reddish it gets.

13:24

Okay? And when you do that the three

13:26

obviously pops out.

13:28

So, there is a three in the image. Yes?

13:31

Okay, good. So, now

13:33

what we're going to do is we're going to

13:35

move to our little filter here.

13:37

You can see the filter.

13:39

Right? And I'm claiming this detects

13:41

horizontal lines. And so and this table

13:44

here

13:47

Sorry.

13:51

This table here is the result of

13:53

applying that filter to the three.

13:56

Okay? And you can see here I'm looking

13:58

at the top left cell here.

14:01

Um

14:03

This is

14:03

Look at this top left cell. The formula

14:05

is nothing more than

14:07

you know, multiply all those things and

14:08

add them up. And then once you add it

14:10

up, run it through a max of zero comma

14:12

that which is just the relu.

14:15

Okay? Basic arithmetic.

14:18

So, we do that.

14:19

And this is the output and the output is

14:21

also conditionally formatted to show you

14:24

where things are lighting up.

14:26

And you can see only the horizontal

14:30

lines of the three are lighting up.

14:34

Everyone see that?

14:35

Right?

14:36

So, you So, now you you understand the

14:38

filter in fact is living up to the claim

14:41

I made for it.

14:42

Right? Similarly,

14:44

if you look at what's going on here,

14:46

this is a vertical filter, the same

14:47

thing, you apply it, only the vertical

14:50

line is lighting up.

14:53

Right? Now, what you can do is

14:56

uh I would encourage you to do this, you

14:57

know, um after class, is you can look at

15:00

all these numbers here, for example, and

15:02

then ask yourself, "Okay, why is that

15:04

lighting up?"

15:06

Right? And you will discover that what's

15:08

actually going on is that it's looking

15:11

for edges.

15:12

It's looking for you know, s- you're

15:14

looking for rows in the table where

15:16

there is some nonzero thing in the first

15:18

row and zeros in the second row.

15:21

And by choosing the numbers carefully,

15:23

you multiply the ones with positive

15:25

numbers and you multiply the zeros with

15:27

zeros and then you'll come up with a

15:29

positive number and thereby you detect

15:31

an edge.

15:32

Right? So, what I would encourage you to

15:34

do is use the this Excel thing here.

15:39

All right. So, here is here is a cell we

15:41

have. So, let's uh trace its

15:48

coincidence.

15:49

Okay.

15:51

So, you can see here

15:53

these numbers

15:56

Right? Th- This is what it's processing.

15:59

Right? That is this grid is being

16:00

processed to come up with that big

16:01

number. And you can see here in this

16:04

grid these are all these numbers are

16:06

here and then these numbers are a lot

16:08

lower than these these numbers because

16:11

there is an edge.

16:13

Right? The numbers are a lot lower.

16:14

That's why you can see the horizontal

16:16

part of the three.

16:17

And so, what this filter is doing, it's

16:19

basically saying, "Well, the stuff

16:22

the row that I'm catching here has the

16:24

ones, the middle has zeros, the rest are

16:26

all minus ones."

16:27

Right? So, the small values are going to

16:29

get very small.

16:31

The big values are going to get very big

16:33

and the overall thing is going to be

16:34

emphasized.

16:35

So, that's the basic idea of edge

16:37

detection.

16:38

Spend some time with it with the Excel

16:39

and it'll you'll become clear to you

16:41

what I'm talking about here.

16:43

All right, cool. So, that's that.

16:46

All right. Uh by the way, I also have uh

16:48

th- there is a little very cool site

16:49

here

16:50

in which you can actually go in and

16:52

punch in your own numbers and see what

16:53

it detects.

16:55

Right? Lot of edges and curves and this

16:56

and that. It's very cool. So, I

16:58

encourage you to try it out.

17:00

So, the key thing here I want to say is

17:06

by choosing the numbers in a filter

17:08

carefully and applying this operation

17:10

different different features can be

17:12

detected. All right.

17:13

Now,

17:14

I mentioned earlier that a convolution

17:16

layer is composed of one or more of

17:18

these filters. So, one or more of these

17:20

filters. And so, you can think of each

17:23

filter as a sort of a specialist for a

17:25

particular feature.

17:27

Okay? So, it's a specialist. Maybe it it

17:30

specializes in detecting vertical lines,

17:32

horizontal lines, you know, uh

17:34

semicircles, quarter circles, you don't

17:35

know. Right? You can imagine either them

17:38

as being specialists.

17:39

And given that modern images could be

17:42

very complicated, they may have lots of

17:43

interesting features going on, you

17:45

probably want to have lots of these

17:46

filters.

17:48

Okay? But the key the key is that you

17:52

don't have to decide up front, "Hey, you

17:54

filter, you better specialize in

17:56

detecting vertical lines and you on the

17:57

other hand do not stay in your lane. Do

18:00

vertical lines." Right? You're not going

18:01

to do that.

18:02

You will let the system figure out what

18:04

it wants to figure out.

18:06

Okay? So, there is no human bottleneck

18:08

in doing this.

18:10

And I mentioned this because there used

18:11

to be a human bottleneck, you know,

18:13

before deep learning happened.

18:15

And so,

18:17

Now, let's just um make sure we

18:19

understand the mechanics of what happens

18:20

when you have two of these filters, not

18:22

one. So, this is the input image as

18:24

before. This is the filter we saw

18:26

earlier and this is another filter we

18:28

have.

18:29

The thing is we just run them in

18:30

parallel. We take each filter, do the

18:32

operation, come up with an output. Take

18:33

the other filter, do the operation, come

18:34

up with its output. And then when you do

18:36

that, the first one gives you that, the

18:38

second one gives you that. And this

18:40

output is a table of some it's it's a

18:42

it's a it's actually not a table. What

18:44

is it?

18:49

Louder, please.

18:51

It's a tensor. Thank you. It's a tensor.

18:54

And so, these two 5 by 5 matrices can be

18:56

represented as a tensor of what shape?

19:02

And there are two right answers.

19:04

5 by 5

19:06

into two, correct. So, it can you can

19:08

either think of it as 5 by 5 * 2 or 2 *

19:11

5 by 5. They're both fine.

19:14

Which one you go with is actually ends

19:15

up being a matter of convention.

19:18

Okay? So, now you begin to see why we

19:20

care about tensors.

19:22

Imagine if instead of having two

19:24

filters, we have 103 filters.

19:27

The resulting tensor is going to be 5 by

19:29

5 by 103.

19:33

Okay.

19:34

Good.

19:35

Um all right. Now,

19:37

let's now look at the slightly more

19:39

complex situation where you have not a

19:42

black and white image, a grayscale image

19:44

with just a little table, but an actual

19:46

color image.

19:48

Okay? So, So, we know how to apply a

19:51

filter to a 2D tensor like this and to

19:54

get that. But let's say we have

19:56

something like this where it has

19:58

three, right? It's got three channels,

20:00

red, blue, green, RGB. It's got three

20:02

tables of numbers.

20:03

So, this is a a tensor of shape 6 * 6 *

20:06

3, let's say, and you want to apply this

20:08

3 by 3 filter just like before to this

20:11

thing. You want to apply the convolution

20:12

operation. How's that going to work?

20:18

Do we just like apply this to each

20:21

We first apply it to the red, then we

20:23

apply it to the to the green, then we

20:25

apply to the blue. Should we do that?

20:30

Or is there a

20:31

a problem with that approach?

20:36

Yeah.

20:39

Could you use the microphone, please?

20:42

Uh the problem with the approach, I

20:43

think, would be the same as what you

20:45

said earlier, that it would learn the

20:47

lines probably the same each channel,

20:49

right?

20:50

Like the location of the lines are

20:51

probably the same each channel.

20:54

Yes, the location of the line is going

20:55

to be the same thing because that line,

20:57

if you will, is sort of the the

20:59

aggregation of information from the

21:00

three different channels. Right. But the

21:03

problem here

21:05

is sort of slightly different,

21:07

which is that

21:09

If you do them independently,

21:12

the network has not been informed that

21:15

these things are all part of the same

21:17

underlying concept.

21:19

As far as it's concerned, it's just like

21:21

three things. It's just going to process

21:22

them independently. So, we need to

21:23

somehow change the filter so that it

21:25

understands like what is at this pixel

21:27

location, the three numbers under it,

21:29

RGB, they're actually the same part of

21:31

the same thing, underlying thing.

21:35

So, what we do is actually very simple.

21:37

We just take this filter and make it 3D.

21:42

So, we take this filter, instead of

21:44

having just one of them, we just make it

21:45

a cube like that. Three times.

21:49

And once we do that, you can imagine

21:51

taking this thing here and essentially

21:53

doing that.

21:56

Okay. Now, instead of having, you know,

21:58

nine numbers in the image and nine

22:00

numbers in the filter,

22:01

you have 27 numbers in the image, 27

22:04

numbers in the filter.

22:05

But you still match them up, multiply

22:07

them, add them up, run them through a

22:09

ReLU.

22:14

By the way, I tried to get ChatGPT to

22:16

give me a picture like that.

22:19

It just completely bombed.

22:21

I like three, four, five different

22:22

variations. It just gave up. And then I

22:24

found this nice picture at in the

22:25

deeplearning.ai and I used it.

22:28

So, then if you put different numbers in

22:30

each of the layers, is that like color

22:32

processing? Like it could be doing a

22:33

different thing to green and blue. I'm

22:36

sorry, say that again. If you put

22:37

different numbers in each of the layers

22:39

of your knowledge, in each of the

22:42

different like depth dimensions of your

22:43

convolution filter, would that be like

22:45

color processing?

22:47

Uh yeah, you you will in

22:49

Yeah, you will put different numbers. In

22:50

fact, you you have 27 numbers now,

22:53

but we haven't gotten to the question of

22:54

where these numbers are coming from. So,

22:55

just hold the thought till we get there.

22:58

Okay. Um so, any questions on this?

23:02

Okay. You literally take the 2D thing

23:04

and make it 3D.

23:05

You basically give it depth and the

23:08

depth just matches the depth of the

23:10

input.

23:11

So, if the input is like, you know, 10

23:13

deep, your filter is going to get 10

23:15

deep.

23:18

Okay?

23:20

Yes.

23:22

Rather than

23:24

increasing the rank order of the tensor

23:26

by one, is there any instance where you

23:27

would create a subtraction layer where

23:29

you would run an operation across the

23:30

different layers to come up with a

23:33

intermediary layer that you would run a

23:35

lower rank tensor of a filter over?

23:38

Yeah, so there is a lot of stuff in the

23:40

research literature which tries to do

23:42

things like that. Uh I'm just describing

23:45

like the the the most basic approach to

23:48

doing this. And as it turns out, this

23:50

basic approach is actually extremely

23:51

powerful, right? And of course, uh

23:54

researchers try to, you know, go from

23:56

the 95th percent thing to the 95.1%.

23:59

So, they invent like all sorts of crazy

24:01

complicated stuff, which is all good for

24:02

us, humanity, but for practical use,

24:04

this is good enough.

24:08

How do you convert the 3 by 3 layer into

24:10

a single 4 by 4 layer? 4 by 4 is

24:12

understood, but what about the 3 layers?

24:14

How do they work?

24:15

Yeah. Um so, we are coming to that. I

24:17

think we have a slide here. Actually, we

24:19

don't. Never mind. We'll answer that. Um

24:20

so, so here you have one filter, right?

24:23

You have one 3 by 3 by 3 filter, which

24:26

plugs into this thing here, and then it

24:28

gives you the 4 by 4 at the end.

24:30

Right? So, for one filter, we know that

24:33

by doing this operation, we get

24:37

we get this 4 by 4.

24:38

Let's say that you have another filter,

24:40

which is also 3D.

24:41

You do that thing, you'll get another 4

24:43

by 4.

24:45

And if you have 10 filters, you'll get

24:46

10 of these 4 by 4s, which then gets

24:48

packaged up into a 4 by 4 by 10 tensor.

24:54

Remember, whether it's 2D, 3D, 10D,

24:57

what is coming out is always 2D.

25:02

Because ultimately, when you apply all

25:03

this operation, at each position, you

25:05

just have one number.

25:06

And then ultimately, you just do all

25:07

those things, you just come up with a

25:08

table of numbers always. So, the what's

25:10

coming out is always a 2D number table

25:13

like that.

25:14

But when you have lots of filters, you

25:16

have lots of these 2D tables one after

25:18

the other, and there therefore, they get

25:20

packaged up into a tensor.

25:25

All right.

25:26

Um so,

25:28

textbook chapter 8.1 has a lot of detail

25:30

and intuition, which I think is really

25:32

good. So, please uh try it out. Okay.

25:35

And folks, by the way, this convolution

25:37

stuff, um it's sort of it grows in the

25:40

telling. So, I would encourage you to

25:41

revisit it, revisit it

25:43

a few times, and then it slowly becomes

25:45

part of your muscle memory.

25:48

Don't expect to just understand all the

25:49

nuances like one shot.

25:51

Do it a few times.

25:52

And it will become, you know, wired into

25:54

your into your head.

25:56

Okay. So, all right. The big question.

25:59

These seem excellent, but how are we

26:00

supposed to come up with these numbers?

26:02

Now, in fact, traditionally,

26:04

uh these filters actually used to be

26:05

designed by hand.

26:07

Uh computer vision researchers would

26:08

invest, you know, prodigious amounts of

26:10

time and effort and talent to figure

26:12

out, you know, the kind the right kinds

26:14

of filters to use for various specific

26:17

applications. So, if you wanted to build

26:19

an application which would look at, say,

26:20

MRI images and figure out, okay, what

26:22

kind of features should I extract from

26:24

this MRI thing to be able to say, you

26:27

know, predict the the evidence for a

26:28

stroke, they would actually, you know,

26:30

hand design the filter. They'd try lots

26:32

of different values and then come up

26:34

with, "Ah, I got the perfect filter for

26:35

this thing here." Right? So, that's the

26:37

way it used to be done.

26:39

Um and now,

26:41

I but as we figured out how to train

26:42

deep networks with lots of parameters,

26:45

right? We figured out things like ReLU

26:47

activation, stochastic gradient descent,

26:49

GPUs, backprop, things like that, you

26:51

know, uh this big idea emerged. Why

26:54

don't we think of the numbers in the

26:55

filter as just weights?

26:57

And why don't we just simply learn them

26:59

from the data using backprop?

27:01

Right? Just like we learn all the other

27:03

weights. What's the big deal?

27:06

And this simple idea,

27:08

and it feels a bit, I don't know,

27:09

blindingly obvious in hindsight.

27:12

I'm sure it was not obvious in

27:13

foresight.

27:14

Um right? This was the breakthrough.

27:16

This was the key breakthrough. And now,

27:18

it's actually possible to do this

27:20

because a convolutional filter that we

27:22

have seen is actually just a neuron.

27:25

And the underlying arithmetic of it is

27:27

just a neuronal arithmetic. And so, it

27:31

just happens to be a slightly special

27:32

one. It's actually even simpler than a

27:34

regular neuron. And in the interest of

27:37

time, I have a one or two slides in the

27:39

appendix which tells you exactly why

27:40

it's a neuron. So, check it out. But

27:42

just take my word for it. It's just a

27:44

particular kind of neuron. And because

27:46

it's a particular kind of neuron, and we

27:48

know how to work with neurons,

27:50

right? You know how to work with

27:51

neurons, which means that our entire

27:53

machinery,

27:55

layers, loss functions, gradient

27:57

descent, SGD, blah, blah, everything is

27:59

immediately applicable.

28:01

We don't have to invent any new stuff to

28:03

make it work.

28:06

Okay?

28:08

All right.

28:09

Do you initialize the layers differently

28:12

in applications or just because the

28:14

network has different sizes? Like

28:16

computer vision versus uh medical

28:18

imaging. Is it just because the network

28:20

has different numbers in them?

28:23

Yeah, so the initialization

28:25

So, let's It's a good question. Let's

28:27

come back to it when we get to something

28:29

called transfer learning, which I'm

28:30

going to get to by about 9:30.

28:34

All right. So,

28:36

that's it. All right. So, this turned

28:37

out to be a huge turning point in the

28:38

computer vision field, and this was the

28:40

massive unlock in the year 2012. This

28:43

computer vision system that used this

28:44

technology called AlexNet burst out onto

28:47

the world stage because it crushed the

28:49

competition in a, you know, in in a

28:51

competition called ImageNet, and uh the

28:53

previous best score was 26% error rate,

28:56

and this thing came in and had 16% error

28:59

rate. Right? It's the kind of thing

29:01

where if you see it, you'll be like,

29:01

"Oh, that must be a typo."

29:04

Right? Because every year, the

29:05

improvements in error rate were like

29:06

very little, half a percent, 1%, and

29:07

then this year was 10%, and that that

29:09

was because of this approach.

29:12

And so, all right. Now, one other thing

29:14

I want to cover talk about is that with

29:16

every succeeding convolutional layer,

29:19

uh this particular convolution any

29:21

particular convolutional filter, it's

29:23

basically implicitly seeing much more of

29:25

the input image as we go along.

29:28

Right? Which means that if in the very

29:29

beginning, if this is the input, right?

29:31

This little convolutional filter this

29:33

number here

29:34

in the first layer, let's say, only sees

29:37

like the top of the chimney or whatever

29:38

of this house.

29:40

But then the next layer, remember, the

29:42

next layer is input is this particular

29:44

layer.

29:45

And so,

29:47

this particular little thing here is

29:49

getting information from this whole

29:50

square here.

29:52

And every one of the points in that

29:53

square is actually something big in the

29:55

original picture.

29:57

So, with every additional layer, you're

29:59

seeing more and more and more of the

30:00

image.

30:03

All right? And this is a key part of why

30:04

these things work because you're

30:06

essentially hierarchically building a

30:08

better and better understanding of the

30:09

image.

30:10

It is the hierarchical understanding,

30:12

the hierarchical learning, that's a very

30:14

key part of the unlock.

30:17

And so, if you look at networks and what

30:20

they're visualizing, this actually a you

30:21

know, a face detection deep network

30:23

visualizes of what it's learning, you'll

30:25

see that the first layer is just

30:26

learning lines and edges and so on,

30:28

lines.

30:29

And the second layer is actually

30:30

learning edges. Look at this thing,

30:32

right?

30:33

It's it's learning to put these lines

30:36

together

30:37

to get some sort of an edge here,

30:38

another edge here. This looks like three

30:40

three quarters of a of somebody's ears.

30:43

And then, these things are now being

30:45

assembled

30:46

to get whole faces out.

30:49

Can you imagine the researchers who did

30:50

this work? They built the network, it's

30:52

doing really well on detecting faces,

30:53

and they turn around, "Okay, let's see

30:54

what it's actually doing."

30:56

And then, this picture pops up.

30:58

I mean, goosebumps.

31:00

Okay, so pooling layers, the next one.

31:03

So,

31:04

so far you've talked about convolutional

31:05

layers, this is the second thing, second

31:07

building block, and then we'll again go

31:09

go to the collapse. So, pooling layers

31:11

are also called subsampling or

31:12

downsampling layers.

31:15

So, the idea is that every time a tensor

31:17

is coming out of these convolutional um

31:19

layers,

31:20

we try to make it slightly smaller

31:23

because the act of making it smaller

31:25

will force the network to try to

31:27

summarize and learn what's going on in

31:29

this complicated thing it's coming into

31:30

it, okay? So, I will describe the

31:32

mechanics first. Um

31:35

So, let's say that this is the output of

31:37

a convolutional layer.

31:39

Okay?

31:40

Is this four of them? A 4 by 4.

31:42

So, what we do is that there are two

31:45

kinds of pooling, max pooling and

31:47

average pooling. This is called max

31:48

pooling, and the idea is really simple.

31:51

In this max pooling layer, there are no

31:52

weights parameters to be learned. It's

31:53

just a simple arithmetic operation. We

31:56

basically take

31:57

we take this we basically superimpose a

32:00

2 by 2 empty grid

32:02

on the top left, and then we say, "Hey,

32:04

what's the biggest number on the among

32:06

these four numbers?" Well, the biggest

32:08

number is 43. Boom. Okay, I'm going to

32:09

stick a 43 here.

32:11

Then I move my 2 by 2 to the right

32:13

so that it overlaps with these numbers

32:15

in blue, and I say, "Hey, what's the

32:17

biggest number here?" Okay, that's 109.

32:19

And I move it down, what's the biggest

32:20

number here? 105. Stick it in here.

32:23

Biggest number here, 35, and I stick it

32:25

in there. That's it. This is max

32:26

pooling.

32:29

Similarly, there's this thing called

32:30

average pooling, but instead of taking

32:32

the maximum of these four numbers, we

32:33

just average the four numbers.

32:35

Okay, the average of these four things

32:36

in yellow,

32:38

am I done?

32:41

Average of these four numbers is 32.2.

32:43

The average of blue numbers is 25.5, you

32:45

get the idea.

32:46

That's it. Max pooling and average

32:48

pooling. Now,

32:50

as you can see, when you go when you

32:51

apply pooling, the number of entries

32:53

drops significantly.

32:55

Right? The number of entries drops

32:56

significantly.

32:58

And the output from this layer is just

32:59

fed to the next layer as usual.

33:02

Okay? There's nothing, you know, crazy

33:04

going on.

33:05

So, it's a way to shrink the output from

33:07

one convolutional layer before it passes

33:10

on to the next convolutional, you

33:11

interject with a pooling layer.

33:13

Now, I have actually a

33:15

even if I say so myself, a very nice

33:18

handwritten explanation of what pooling

33:20

does, the the effect of pooling.

33:23

And unfortunately, I can't get my iPad

33:25

to actually show up on my laptop.

33:27

So, I'm not going to be able to do it,

33:28

but I will record a walk-through.

33:31

Yeah, and I posted check it out, okay?

33:33

But the intuition that I tried to convey

33:35

with that thing is that oh, um Sorry,

33:38

I'll come back to this.

33:39

So, max pooling acts like an or

33:41

condition. It basically says, "I have

33:43

this big picture.

33:44

So, in the four things that I'm looking

33:46

at, if there's any number which is

33:48

really high,

33:50

that means that some feature is being

33:51

detected, right?

33:54

The number is really high coming out of

33:55

a convolutional layer, that means that

33:57

something somewhere fired up,

33:59

lit up.

34:00

And so, I'm just looking to see if

34:01

anything lit up in that part. If it did,

34:04

I'm going to say, "Yep, something lit

34:05

up."

34:06

If nothing lit up, then I'm going to

34:08

say, "Oh, nothing lit up."

34:09

So, in a in that sense, what it's it it

34:11

think you can imagine it's like acting

34:13

like an or condition.

34:15

Anything fired up? Anything fired up?

34:16

Anything fired up? Anything up? Yes,

34:17

okay. Otherwise, no.

34:19

And so,

34:22

sadly, I can't switch to Notability.

34:24

So, it acts like a feature detector. So,

34:27

if you have lots of things going on in a

34:28

particular picture, you want to be able

34:30

to summarize and aggregate all the

34:32

things that are going on so that you can

34:33

say you if you may have a big picture

34:35

with lots of things lighting up here and

34:36

there, but you want to step back and

34:38

say, "You know what? In this picture,

34:40

the top left, nothing lit up. The top

34:42

right, something lit up. Bottom left,

34:45

something lit up. And the bottom right,

34:46

nothing lit up."

34:48

So, you're operating at a higher level

34:49

of abstraction.

34:51

That's the effect of pooling.

34:55

But don't you lose spatial information?

34:59

Uh you don't because the

35:02

what you're actually saying is the top

35:04

left has this thing.

35:06

You already know it is in the top left.

35:08

And you already moved up to that level

35:10

of abstraction.

35:12

So, the fact for example, if if the top

35:13

left there is a human eye,

35:15

and there is a circle detector, it's

35:18

going to fire up and saying, "Hey, in

35:19

the top left there is an eye."

35:21

Yep, lit up. So, you're not looking at

35:23

the pixels anymore, you're already

35:24

operating at a higher level of

35:25

abstraction, and that's how we get

35:27

around it. But this proceeds slowly and

35:29

incrementally, which is why you have

35:31

these big networks.

35:34

All right.

35:35

So, now as we saw, some successive

35:38

convolution layers can see more and more

35:40

of the original image,

35:41

the max pooling layers that follow them

35:43

can detect if a feature exists in more

35:45

and more of the original input as well.

35:47

So, by the time you get to like the

35:48

seventh and eighth, ninth and layers and

35:50

so on, this thing is actually really

35:52

smart. It's operating at a very high

35:53

level of abstraction.

35:55

Right? It It is You can think of it It

35:56

is basically like tagged all the

35:58

features in that image at various

36:00

resolutions, and it can work with it.

36:04

Is there a trade-off between doing

36:06

pre-processing as opposed to adding

36:08

additional convolutional layers? I'm

36:11

thinking if you have a video turning

36:12

into a black and white static images in

36:15

a sequence as opposed to

36:17

shoving in a color video with a ton of

36:19

noise.

36:20

The greater the time expanse, is there a

36:22

trade-off element? There is a trade-off.

36:24

Um if your particular data set and input

36:27

has has some there is some very

36:29

important domain knowledge that you want

36:31

to encode

36:33

into the network so that the network

36:35

doesn't waste its capacity learning

36:37

things that you know have to be true,

36:39

then yeah, modify the input.

36:41

But if you're not sure,

36:43

right? Then you want to just let network

36:45

learn whatever it can as long as it's

36:47

focused on predicting accuracy as well

36:49

as possible, then just let it be.

36:55

Uh all right. So, that's the basic idea.

36:57

And I again, I'm sorry this is

36:59

Notability thing is is it's not working.

37:01

Uh but take a look to really understand

37:03

um how this max pooling thing business

37:05

works. Okay. Oh, uh I think I skipped

37:08

over this.

37:09

So, when you have something like this,

37:12

so this, let's say, is a tensor coming

37:13

out of some convolutional layer, and its

37:15

size is 224 by 224 by 64, then you apply

37:18

something like a pooling. The thing I

37:20

want to point out is that the pooling

37:22

will work with every slice of the

37:23

tensor.

37:25

Okay? So, if the tensor is 224 by 224 by

37:27

64, it has a depth of 64,

37:30

which is basically like saying it's got

37:31

64 tables of 224 by 224, and the pooling

37:35

will work on every one of those tables.

37:38

Which means that

37:40

the 64 will that you'll still have 64

37:42

things at the very end. It's just that

37:43

every one of the things of the 64, the

37:45

224 by 224, will shrink to 112 by 112.

37:49

So, each table shrinks due to pooling,

37:52

but the number of tables does not

37:53

change.

37:57

Okay. So,

37:59

uh by the way, this

38:01

link here

38:03

has a beautiful explanation of all these

38:05

things with a little bit more complexity

38:06

as well from a course taught at Stanford

38:08

in like 2018 or 2019 or something, I

38:10

forget. Uh so, just check it out if

38:12

you're curious about this stuff. It's

38:13

really good.

38:15

Okay. Um

38:18

All right. So, that brings us to the

38:19

architecture of a basic CNN.

38:21

Um and so, what we do is we have an

38:23

input.

38:25

Okay? We take that input, we run it

38:27

through a bunch of convolutional and

38:29

pooling layers. So, there's a

38:30

convolutional layer, and then we pool

38:33

it, which is why it has shrunk

38:35

in size,

38:37

and then it goes through another

38:38

convolutional layer, then we pool it,

38:40

which is shrunk again,

38:42

and then it keeps on doing it. So, we

38:44

have a series of these these called

38:45

these are called convolutional blocks.

38:47

So, a convolutional block is typically,

38:49

you know, one to two convolutional

38:50

layers followed by a pooling layer.

38:52

Okay.

38:54

So, you have a series of convolutional

38:55

blocks.

38:57

Okay? And the thing to notice is that

38:59

as you go further and further in the

39:01

network,

39:03

the blocks will actually get smaller and

39:05

smaller because of

39:07

uh max pooling, right? They'll get

39:09

smaller and smaller, but they'll get

39:10

longer they'll get deeper and deeper.

39:14

Okay.

39:14

And we have empirically figured out that

39:16

that actually that model of reducing the

39:18

size, the height and height and the

39:20

width, but then making it deeper, tends

39:22

to work really well in practice.

39:25

And so,

39:27

in fact, uh and I apologies to the live

39:29

stream that I can't use iPad, I'm going

39:31

to do it on the the board.

39:35

So, let's say that you have a picture

39:38

which is

39:39

coming in as 224

39:43

224

39:44

and then you have

39:46

say three of them

39:48

because it's a color picture, so you

39:49

have three of them.

39:52

Can you folks see this okay?

39:54

All right. So, right? Let's say this is

39:56

the input coming in. And ResNet, which

39:59

is a very famous network that we're

40:00

actually going to work with in a few

40:02

minutes,

40:03

then it actually gets done with all this

40:05

convolution pooling business.

40:07

The final tensor that it it has is

40:11

actually of shape

40:13

7 by 7.

40:16

But it is 2048 long.

40:22

Okay? So, it it has gone it has

40:24

processed something which is 224 224 * 3

40:26

to much smaller height and width just 7

40:28

by 7, but it's gotten much deeper, 2048

40:31

layers.

40:32

This is a this is a numerical example of

40:34

what I'm talking about there in terms of

40:36

as you go along, things get smaller but

40:39

deeper.

40:41

All right.

40:43

Uh

40:44

Yes?

40:45

Is the reason that it gets deeper

40:47

because each

40:49

Like it it gets deeper because each

40:50

layer has a single feature that is

40:52

picked up and then it gets stacked on

40:54

top

40:55

It's not so much that each layer has

40:57

picking up a single feature, it's more

40:58

that

40:59

uh

41:00

basically

41:01

the way I think about it is that

41:04

the the the the number of atomic

41:06

features that you may want to detect are

41:07

probably not that many, right? Lines,

41:10

curves, gradations in color and things

41:11

like that. But the way in which you can

41:13

combine these atomic features

41:16

to depict real world things

41:18

is combinatorial.

41:20

It's sort of like I have 10 kinds of

41:22

atoms, how many molecules can I make

41:23

from it?

41:25

You can make a lot of molecules from

41:26

those 10 atoms, which means that you

41:28

better give the network more the ability

41:30

to capture more and more of these

41:32

possible things that the real world can

41:33

come up with.

41:35

And so every as the depth increases, you

41:38

have more filters and every filter has

41:40

now has the ability to pick up some

41:42

combinatorial combination of what's

41:43

coming in.

41:49

Uh sorry, quick question related to

41:51

this. So, right now like our model is

41:53

being trained to detect certain specific

41:55

features like a line, a color, or

41:56

something of this sort. But still it

41:58

doesn't have meaning to this, right?

42:00

Like still they don't know if that

42:02

arc is a sun or is an eye, right?

42:06

So, yeah. So, we we don't tell it what

42:08

to learn, it just learns.

42:10

All we tell it is make sure that you

42:12

minimize the loss function. Now, once it

42:14

is finished learning, if it's a good

42:16

network, it has good accuracy, then we

42:18

can introspect. We can peek into the

42:21

internals and try to understand what is

42:23

it learning,

42:24

right? And sometimes you like you saw in

42:26

the face detection example, it's

42:27

actually learning interesting things

42:28

like basic lines and edges and then

42:30

slowly, you know, more complicated

42:32

shapes and then finally like entire

42:34

human faces. Sometimes it may not be

42:36

understandable.

42:37

And the way it's doing this is by

42:39

constructing features of my brain.

42:42

Like how do you figure out what it's

42:44

learning?

42:44

>> Yeah. Oh, oh, I see. So, I'm going to

42:46

give a reference in just a few minutes.

42:49

Read the paper. That was one of the

42:50

first ones to actually visualize what it

42:52

what these things are learning and

42:53

that'll give you an idea of how it

42:54

actually works. And I'm also happy to

42:56

talk about it offline. It's a bit of a a

42:58

tangent, but it's a really rich tangent,

43:00

so if if I keep talking about it, I'll

43:02

end up spending 10 minutes on it, so I'm

43:03

going to back off.

43:06

Okay.

43:08

Um all right.

43:09

So, now once we do that,

43:12

okay? Now we are back in familiar

43:13

territory where we take whatever tensor

43:16

is coming out from these convolutional

43:18

operations and pooling operations and

43:20

then we just flatten them only now into

43:22

a long vector. And once we flatten them,

43:25

we can connect them to some good old

43:27

dense layers

43:29

like we know how to do and then we

43:30

finally connect them with whatever, you

43:32

know, output layer you want, right? In

43:34

this case, this example is using some

43:36

multi-class classification of

43:39

classifying images to what kind of

43:41

automobile or whatever it is. So, it's

43:42

like a softmax. So, this is a general

43:44

framework.

43:48

Okay?

43:50

Any questions?

43:54

Yeah.

43:55

Can you explain again how the depth

43:57

increases exactly like Oh, the depth

44:00

increases because you decide what the

44:01

depth is.

44:03

So, when you add a convolutional layer,

44:05

you decide how many filters it has. So,

44:07

you just keep adding more and more

44:09

filters the later on you go in the

44:11

network.

44:13

So, it's in your control. So, remember

44:14

the number of neurons in a hidden layer

44:16

is in your control, right? Similarly,

44:18

the number of filters is in your

44:19

control. It's a design choice.

44:22

And we design it so that the later we

44:24

go, the more depth we have. So, you have

44:26

you stack

44:28

um layers with each of those layers has

44:31

a different filter applied to the end

44:35

Yeah, a layer is made up of filters and

44:37

so the depth just comes from having lots

44:39

and lots and lots of filters. And you

44:40

get to choose what they are.

44:44

All right. So, now let's go to the

44:46

fashion MNIST collab um that I did the

44:49

video walk-through on and then actually

44:51

solve it using a convolutional network.

44:56

All right, cool. So, uh at this point

44:58

I'm going to zip through some of the

44:59

stuff because you know the preliminaries

45:00

have to be done. Import all these

45:02

packages, set the random seed here.

45:05

Great. And then the we will load the

45:07

MNIST data set just like I did in the

45:09

collab yesterday. Uh we create these

45:11

little labels.

45:13

Uh and then we just have these standard

45:14

functions to plot accuracy and loss that

45:17

we've been using so far. All right. Now

45:19

we come to the convolutional thing and

45:21

so as before, we're going to um

45:24

we're going to divide it by 255 to

45:25

normalize everything to a zero to one

45:27

range. Uh let's confirm to make sure

45:29

that the data nothing has gotten

45:31

tampered with. Yep, we have 60,000

45:33

images, each one is 28 by 28 in the

45:35

training set. Now,

45:37

convolutional networks um they expect

45:40

the input to have

45:42

three channels or it expects to have

45:44

like a an additional thing which is like

45:46

a channel,

45:47

right? Uh the color images have three

45:49

channels,

45:50

but black and white images have only one

45:52

channel, right? One table of numbers.

45:54

So, instead of saying 28 by 28, we tell

45:56

this the convolutional layer expect 28

45:59

by 28 by one.

46:01

It's the same thing conceptually, but

46:03

that's the sort of the format that it

46:04

expects.

46:05

And so,

46:06

uh we go here and then we say, all

46:09

right, there's a thing called expand

46:11

dimension. I'm just telling it to expand

46:12

its dimension and once I do that, you

46:14

can see here it's still 60,000, but

46:17

instead of 28 by 28, it has become 28 by

46:19

28 by one. Same thing.

46:21

Okay? Now, let's define our very first

46:24

CNN.

46:25

So, all right.

46:27

As as before, the the input is just

46:30

Keras.input as before, no difference

46:32

here and we tell it the shape and the

46:34

shape is of course just 28 by 28 by one.

46:37

Okay? That's what I have here.

46:39

And then we come to the first

46:40

convolutional block.

46:43

So, and this is the key thing.

46:45

If you want to tell Keras to use a

46:47

convolutional a layer,

46:49

you use this keyword layers.Conv2D.

46:53

And from this you can probably also

46:54

figure out that there's a Conv1D and

46:56

there's a Conv3D and so on and so forth,

46:58

which, you know, uh explore. It's really

47:00

good stuff.

47:01

But for image processing, Conv2D is all

47:04

you need. And now we tell it how many

47:06

filters you want. Okay. So, uh we decide

47:09

on the number of filters. So, I've

47:10

decided to have 32 filters. Okay? And

47:13

then I I we also have to decide the size

47:15

of the filter, right? The simplest size

47:18

is 2 by 2. So, I'm just going to go with

47:19

that.

47:20

Right? Kernel size is 2 by 2.

47:22

And then the activation is of course

47:23

ReLU. I give it a name, convolution one,

47:26

and then I feed it the input. And then

47:27

once I do that, I follow it up with a

47:29

little pooling layer which I where I use

47:31

MaxPooling2D.

47:33

And MaxPooling2D, you just literally

47:35

pass the input, you get the output back.

47:36

It just

47:37

shrinks everything using pooling.

47:39

So, that is the first convolutional

47:40

block.

47:41

And you know what?

47:43

I know how to cut and paste. Boom, cut

47:45

and paste, I get the second

47:46

convolutional block.

47:48

Okay? Here is the second convolutional

47:49

block. And I know in in I just lecture I

47:52

mentioned that as you go deeper, you get

47:54

more depth to it, but this is this is

47:56

just a starting point. I'm just going to

47:58

use the same depth. Not a big deal. It's

47:59

a simple problem. So, which is why in

48:01

the second convolutional block I'm still

48:03

using only 32.

48:04

But you can totally go to 64 for

48:06

instance to make it much deeper.

48:07

Okay?

48:08

Uh and once I do that,

48:10

I finally come to the point where I

48:12

flatten everything to a long vector,

48:14

then I connect it to one dense layer of

48:17

256 neurons.

48:19

And then finally, I come to the softmax

48:22

where I have 10 outputs, right? 10

48:23

categories of clothing, softmax, and

48:26

then I tell Keras, okay, take this input

48:27

and the output, string them up together,

48:30

define a model for me.

48:32

So, that's it. That's a convolutional

48:33

network. The new concepts we are seeing

48:35

here are Conv2D for the convolutional

48:38

layer and then MaxPooling2D for the max

48:40

pooling layer.

48:42

Okay? That's it.

48:43

Uh

48:44

coming. So, let me just run this thing.

48:46

It runs. Okay, good. Yeah.

48:49

Uh how do you decide when to flatten and

48:52

would there ever be a situation in which

48:54

we just kind of use the method that we

48:56

used before and not use a CNN?

48:59

Well, we already tried it with MNIST,

49:00

right? We didn't use a CNN. We just

49:02

flattened right away.

49:03

>> work. It it was it's not bad, but we are

49:05

like, you know, can we do better than 85

49:06

or 88 or whatever the percent was,

49:08

right? So, but we are working with

49:09

images, it's typically a good idea to

49:11

just start with a CNN straight out the

49:13

back because you're not losing anything.

49:14

You're not giving up anything.

49:16

So, uh in terms of how many uh layers

49:19

you should have, my philosophy is start

49:20

simple and if it works, stop working on

49:23

it. If it doesn't, add more layers.

49:27

Uh yeah.

49:28

Yeah, just to uh is it the architecture

49:30

design, the number of filters, kernel

49:32

size, number of layers, convolution

49:34

pooling, is that just all based on trial

49:36

and error or what's sometimes? Yeah, so

49:37

typically it's based on trial and error,

49:39

Um to answer your question. But as you

49:41

will see in the transfer learning

49:42

discussion we're going to have soon,

49:44

you can actually, instead of doing

49:46

anything from scratch, it's much better

49:48

to just download a pre-trained model and

49:50

just adapt it for your particular

49:51

problem. That is actually the norm by

49:54

which people do these things. The reason

49:55

I'm doing it from scratch is because you

49:57

should know how it was done.

50:00

Like you it should not be a black box to

50:01

you. That's my goal.

50:03

Yeah.

50:05

Just for what notation perspective, I

50:07

noticed you named all of these layers X.

50:09

Is that a habit we should get into

50:11

naming them all the same or is that just

50:12

a

50:12

>> Actually, I'm not naming the layers as

50:15

X. What what's going on here is I'm

50:17

feeding it X.

50:19

And whatever is coming out of it, I'm

50:21

just calling it X.

50:22

That's all. It's just a notational

50:23

convenience for me to I'm I'm just

50:25

calling the input and the output and

50:27

Keras under the hood will track

50:28

everything and make sure the right thing

50:29

happens. Otherwise, I'd have to be like

50:31

X1, X2, X3, X4 and then if I want to add

50:33

a new layer somewhere in the middle

50:35

between X3 and X4, I have to call that

50:37

X4 and then I'll change everything to 5,

50:39

6, 7. Complete pain in the neck. That's

50:41

why I do this.

50:42

All right. So, model.summary

50:46

It has got 302 thousand parameters. I'll

50:51

just plot it.

50:53

Great. And I encourage you to hand

50:56

calculate it later on and make sure the

50:58

numbers tally, okay?

51:00

For now, let's just go. So, as before,

51:03

we'll just use the same compilation.

51:06

We'll use Adam and then we'll train it

51:08

for, you know, just 10 epochs. We'll use

51:11

a validation split again, as usual, of

51:13

20%. So, let's just run it.

51:15

So, it's actually going to run. And as

51:17

you will see,

51:18

convolutional networks there's a lot

51:19

more going on, so it's going to be a bit

51:20

slower to run. Hopefully not too much

51:23

slower.

51:25

While it's doing, other questions?

51:31

So, if we have a task other than image

51:32

classification, do we still flat the

51:34

model like first and then it's

51:35

segmentation?

51:37

Yeah, so this is for image

51:39

classification. For other kinds of

51:41

applications,

51:42

typically you run it through a bunch of

51:44

convolutional layers and so on and so

51:45

forth.

51:46

But the output side of the equation gets

51:48

much more complicated because if instead

51:51

of classifying just

51:53

the whole picture into, you know, dog or

51:56

cat, if you have to take every pixel and

51:58

classify it, right? Then, well, you

52:01

better have an output shape that is the

52:03

same dimensions as the input shape.

52:06

So, for that we use a different

52:07

architecture. It's called U-Net

52:09

and so on, which unfortunately I won't

52:11

be able to get into. But I know I am

52:13

planning to post another video

52:14

walk-through where I show you how to use

52:17

the Hugging Face Hub

52:19

to very quickly build models for the

52:22

other applications like segmentation and

52:23

so on. I'm hoping to post that tomorrow.

52:26

It's an optional viewing thing that

52:27

might help with that.

52:29

Okay. So, is it done? Okay, good. It's

52:32

done. All right, let's plot the

52:35

thing here.

52:36

All right, so it seems like training is

52:38

going down nice and nicely. Validation

52:40

is sort of flattening out somewhere here

52:42

around the eighth epoch. Let's look at

52:45

the accuracy.

52:47

Same situation here. The accuracy is in

52:48

the 90s. Of course, the final question,

52:51

of course, is how it will will it does

52:52

on the thing.

52:55

Whoa, 90.5%.

52:58

Pretty good.

52:59

By the way, if you're not impressed that

53:00

we went from 88 to 90,

53:04

this is the These applications are the

53:05

proverbial sort of diminishing returns

53:07

problems, okay? So, what you should

53:09

always think of is look at the amount of

53:11

error that's left and ask yourself how

53:13

much of that error am I able to reduce?

53:16

So, you we had 12% roughly of error left

53:20

when we did the simple collab yesterday.

53:22

From that 12% we have knocked off two of

53:24

the 12% to get to over 90, which is

53:26

amazing.

53:27

Okay?

53:28

And in fact, I think the state of the

53:29

art on this

53:31

um

53:32

is 97%.

53:34

So, I invite you

53:36

to take this thing and try different

53:39

filters and so on and so forth to see if

53:40

you can get to the the mid-90s.

53:42

It's not easy, but try it. Yeah.

53:45

Does the number of epochs have to be

53:48

related to the number of batches?

53:50

Because you did 64 batches and 10 No,

53:52

the epochs is an independent

53:55

the epochs is just the number of passes

53:56

through the whole data.

53:58

But within each pass, within each epoch,

54:01

the num the batch size tells you how

54:03

many batches you're going to process.

54:05

So, it is basically the number of

54:06

examples you have in your training data

54:08

divided by the batch size that you have

54:10

chosen,

54:11

right? That number rounded up is the

54:13

number of batches within each epoch.

54:16

And here I'm just choosing 10 because,

54:18

you know,

54:20

Siri found something on the web. Okay.

54:23

I chose 10 because it's going to be fast

54:24

to do for me to do it in class. And 10

54:26

is actually more than enough because you

54:27

can see it's already beginning to

54:28

overfit.

54:31

Yeah.

54:33

This is more of a conceptual question,

54:35

but is it always the case that a neural

54:37

network will have better accuracy than

54:39

this like machine learning algorithm?

54:42

And I'm asking more on the case of like

54:44

the heart disease problem. Oh, yeah,

54:45

yeah.

54:46

Great question. So, neural networks are

54:49

really good for unstructured data like

54:50

the ones we're having here. But if you

54:52

have structured data like the heart

54:53

disease problem, sometimes it actually

54:55

works really well. Sometimes

54:57

things like gradient boosting, XGBoost,

54:59

work really well. So, if I am actually

55:01

working on a structured data problem,

55:03

I'll try both.

55:04

I'm not going to axiomatically assume

55:06

that the DNN is going to be the best

55:07

thing. But if you have structured data,

55:09

it's the best game in town.

55:11

All right. Um

55:13

I'm just going to

55:14

By the way, I have a whole section here

55:15

on once you build a model, how do you

55:16

actually improve it?

55:17

Right? Check it out. It's an optional

55:19

thing.

55:20

All right, I'm going to stop this here.

55:22

All right. So, the next thing I want to

55:23

do is

55:25

So, we went from 88 to 90 plus percent,

55:27

right? Using convolutional networks.

55:29

Now, let's work with color images. Let's

55:31

kick it up a notch.

55:33

So, um

55:34

I actually

55:36

web scraped

55:38

all these pictures for you folks, for

55:40

your enjoyment. I web scraped about 100

55:42

color images of handbags and shoes.

55:44

Each 100 roughly 100 handbags, 100

55:46

shoes. So, the question is with these

55:48

essentially 200 images,

55:51

can we build a really good neural

55:52

network to classify handbags and shoes?

55:54

Right? It seems kind of absurd, right?

55:56

Because 200 examples, I mean, it's not

55:58

that much, right? It doesn't feel like a

55:59

lot. The MNIST data fashion has 60,000

56:02

images.

56:04

Right? So, there's no, you know, even

56:06

with that we are overfitting in like 5,

56:07

6, 7, 8 epochs.

56:09

With 200 images, maybe, you know, is

56:10

there any hope? Obviously, there is

56:11

hope, otherwise it won't be in the

56:13

lecture. So, yeah. So, we're going to

56:15

take this data set and let's see what we

56:16

can do with it. So, we'll first actually

56:18

build a convolutional network from

56:19

scratch to solve this problem. Okay?

56:22

All right.

56:24

I'm actually going to run through the

56:25

code because at the end of it we'll have

56:27

a live demo. So, I would like one

56:29

volunteer to give me a handbag and one

56:31

volunteer to give me their footwear.

56:34

Boy, in class.

56:37

Okay. So, all right. Unlike the previous

56:40

data set, this one actually I just web

56:42

scraped it. So, I just, you know, it's

56:44

it's it's I've stuck it in this Dropbox

56:46

folder.

56:47

Let's just download it and unzip it. And

56:49

once we do that, we have to now organize

56:51

it with these 200 images. So,

56:54

I have to do some sort of

56:57

sort of boring-ish Python stuff here.

57:00

So, here what we're doing is that we

57:02

have 100 handbags, roughly 100 shoes.

57:04

And what this code is doing is it's

57:06

actually creating a directory of saying

57:08

it's it's splitting stuff into train and

57:10

validation and test. And then for each

57:12

of the splits it's doing the handbags

57:13

and the shoes folder. Okay? So, once we

57:16

do that, basically this directory

57:18

structure is created.

57:20

Okay? Training, validation folder, test

57:23

folder, handbags and shoes. In fact,

57:25

actually you can I think you can see it

57:26

here.

57:27

See here, handbags and shoes. And within

57:29

that, there is, you know, train, test,

57:31

validation. And within each of these,

57:33

there's handbags and shoes. So, the idea

57:34

is that when you're working with images,

57:36

right? What you can do is you can just

57:37

create folders for each kind of image,

57:40

right? Let's say dogs, cats,

57:42

two folders with cat images and dog

57:43

images and then just point Keras at it.

57:46

It'll automatically figure out those are

57:47

the labels.

57:49

It makes it easy for you. So, it's very

57:50

convenient when you're working with

57:51

images.

57:52

And the book explains this thing in

57:53

great detail.

57:55

All right. So, when working with these

57:56

images, color images, we'll follow this

57:58

process. We'll read in the JPEGs. We'll

58:00

convert them to tensors. And then since

58:02

I'm web scraping it, they all come in

58:03

different shapes and sizes. So, I need

58:05

to like bring it all to the same size.

58:06

Okay? I resize it and then I'm going to

58:08

batch it into whatever. I'm going to

58:10

batch it using a batch size of 32 here.

58:13

So, and this utility from Keras will do

58:16

all that for you, right? Very quickly.

58:19

So, basically what it says is that I

58:20

found 98 files in the 98 images in the

58:23

training data belonging to two classes,

58:25

49 in the validation and 38 in the test.

58:28

So, less than 100 examples in the

58:29

training set. That's what we have here.

58:31

All right. What's the time? 9:30. Okay.

58:33

So, all right. Now, let us check the

58:35

dimensions to make sure Good. So, 224

58:38

224 by 3. And the reason why did I pick

58:40

224 224? As you will see later, we're

58:43

going to use something called ResNet

58:45

and the ResNet expects it to be 224 by

58:47

224 by 3. That's why I resized it to 224

58:49

224. Let's look at a few examples of my

58:52

wonderful web scraping in action.

59:01

It's pretty wild, right?

59:02

Okay. So, we have a Now, let's do a

59:04

simple convolutional network. Um

59:07

And before we would take all the X

59:09

values in Fashion MNIST and divide them

59:10

manually by 255 to normalize it to 0 1.

59:13

Well, you know what? We are actually

59:14

graduating to the higher levels of Keras

59:16

now. So, let's not do that, right?

59:17

Manual stuff is bad. So, we'll do it

59:19

within Keras by using something called

59:21

the rescaling layer where we just tell

59:22

it how much to rescale and boom, it'll

59:24

do it for you. The first convolution

59:26

block, just like the Fashion MNIST 32,

59:28

second block, again 32, max pool,

59:31

flatten. And then here we only have

59:33

handbags which are shoes, just a sigmoid

59:35

is enough, right? It's just a binary

59:36

classification problem. So, I'm just

59:38

using one output layer with a sigmoid,

59:40

and that's our model. So, let's do the

59:42

model.

59:43

All right, model summary.

59:48

103 101,000 parameters in this little

59:52

model. Okay, let's compile it and run

59:54

it. Uh, and note here because it's a

59:56

binary

59:57

classification problem, I'm using binary

59:59

cross entropy.

1:00:02

Same Adam.

1:00:03

And accuracy, compile, and then boom,

1:00:05

let's run it. We'll run it for 20

1:00:07

epochs.

1:00:08

Hopefully.

1:00:12

Okay, while it's doing this business,

1:00:13

I'm going to shift to the PowerPoint.

1:00:17

So, we'll go back to see how well it

1:00:19

did, but the question is, uh, whatever

1:00:21

it did, we built it from scratch. So,

1:00:23

the question is, can we do better than

1:00:23

that? Okay? Because we only have 100

1:00:26

examples of each class, and which brings

1:00:28

us to something very cool and very

1:00:29

powerful called transfer learning. And

1:00:31

the idea, so the key thing is there are

1:00:33

two research trends that are going on

1:00:34

that we take advantage of. The first one

1:00:36

is that researchers have defined, you

1:00:38

know, designed architectures which

1:00:40

exploit the kind of input you have. So,

1:00:42

Olivia asked the question, if you have a

1:00:43

particular kind of input images, do you

1:00:45

actually change the input, or do you

1:00:47

actually change the network? As it turns

1:00:49

out, here, for example, if it's images,

1:00:50

we know that we should use convolutional

1:00:52

layers because convolutional layers were

1:00:53

designed to exploit the image-ness of

1:00:55

the input.

1:00:57

Okay? Similarly, if you have sequences

1:00:59

of information, like obviously natural

1:01:01

language, audio, video, gene sequences,

1:01:03

and so on, so forth, these things called

1:01:05

transformers were invented

1:01:07

to exploit them, and we're going to

1:01:08

spend a lot of time on transformers

1:01:09

starting next week. So, that's the first

1:01:11

trend. The second trend is that

1:01:13

researchers have used these innovations

1:01:15

to actually create and train models on

1:01:19

vast data sets, and thankfully, they've

1:01:21

made them publicly available for us to

1:01:23

use. So, transfer learning is the idea

1:01:26

that if you have a particular problem,

1:01:28

let's just take a pre-trained network

1:01:30

work somebody may have already created,

1:01:32

and then let's just customize it to our

1:01:33

problem, rather than actually build

1:01:35

anything from scratch.

1:01:37

Okay, that's the basic idea. So,

1:01:39

so here we have this basically we have

1:01:41

to build a classifier which takes in an

1:01:43

arbitrary image and figures out if it's

1:01:45

a handbag or a shoe, right? That's our

1:01:46

goal.

1:01:47

And so, now handbags and shoes are

1:01:49

everyday objects, and so what you can do

1:01:51

is, hmm, you you can look around and see

1:01:53

if there are any networks that have been

1:01:55

trained by other people which actually

1:01:57

have been trained on everyday images.

1:02:00

Right? As opposed to like MRI or X-rays,

1:02:02

right? Specialized images, everyday

1:02:04

images. Of course, the first thing you

1:02:05

should probably do is to see if anybody

1:02:07

has built the specific thing you want,

1:02:08

handbag shoes classifier on GitHub.

1:02:10

Assuming it's not, then you do transfer

1:02:12

learning. Okay? So, now it turns out

1:02:15

that there's this thing called ImageNet,

1:02:17

which is a database of millions of

1:02:19

images of everyday objects in a thousand

1:02:22

different categories, furniture,

1:02:24

animals, automobiles, you get the idea.

1:02:26

Okay? And so, we can look for the

1:02:28

networks that have been trained on

1:02:29

ImageNet.

1:02:31

Okay, let me just go back to the collab

1:02:33

just to make sure it doesn't time out.

1:02:37

All right, so it has finished doing it.

1:02:40

Um, let's just plot these things.

1:02:48

Okay, so

1:02:49

uh, there is some overfitting that

1:02:51

happens around here

1:02:52

on the training on the 10th epoch. Let's

1:02:55

look at the

1:02:59

So, the the training accuracy is

1:03:01

actually getting to almost to 100%. But

1:03:03

we're not interested in training

1:03:04

accuracy, right? We care about

1:03:06

validation and test accuracy, and that

1:03:08

seems to be kind of hovering around in

1:03:10

the 80s. Um, so let's just evaluate it

1:03:13

anyway to see what happens.

1:03:15

Okay, so it gets to 80 87% accuracy

1:03:19

on this data set.

1:03:20

It's actually pretty good given that we

1:03:22

only have 100 examples. So, 87%

1:03:24

accuracy, but we pre-trained the whole

1:03:26

thing. I'm sorry, we did everything from

1:03:28

scratch. Okay? Now, then

1:03:31

I'm going to there's this whole section

1:03:32

about data augmentation, which, um, you

1:03:35

know what? Do we have time?

1:03:38

So,

1:03:40

so the idea of augmentation is that when

1:03:42

you have an image,

1:03:44

let's say you take this image, and you

1:03:45

just rotate it slightly by 10°.

1:03:49

If it's a handbag before you rotated it,

1:03:51

it sure as hell is a handbag after you

1:03:52

rotated it.

1:03:54

Right?

1:03:55

It doesn't change The meaning of the

1:03:56

image doesn't change just because you

1:03:57

rotated it slightly. Or maybe you zoom

1:04:00

in slightly, you zoom out slightly, you

1:04:01

crop it slightly, nothing happens.

1:04:03

So, what you can do is you can take any

1:04:05

image you have, and you just perturb it

1:04:07

slightly,

1:04:08

like right there, and then add it as a

1:04:10

new example to your training data.

1:04:14

This is an unbelievable free lunch,

1:04:15

frankly.

1:04:16

And the same thing actually, same kinds

1:04:19

of techniques actually work for text

1:04:20

also, which we'll cover later on.

1:04:22

Right? This broad area is called data

1:04:24

augmentation.

1:04:26

It's a great way when you don't have a

1:04:27

lot of data to artificially bolster the

1:04:30

amount of data you have.

1:04:31

Okay?

1:04:32

Um, and so, and of course, Keras makes

1:04:34

it very easy for you to do all these

1:04:36

things. It has already predefined a

1:04:38

whole bunch of data augmentation layers

1:04:40

for you. So, here's a little example

1:04:43

where I basically take a picture and

1:04:45

then I randomly flip it. So, if it looks

1:04:47

like this, I flip it this way,

1:04:48

horizontal. Okay? Uh, and then I

1:04:50

randomly rotate it by 0.1. I forget if

1:04:53

it's 0.1° or radians, you can look up

1:04:55

the documentation. And then random zoom,

1:04:57

right? Zoom in and out a little bit. Uh,

1:05:00

but it won't do this for every picture.

1:05:02

It will only do it randomly. Okay? So,

1:05:04

that only some pictures will get

1:05:06

perturbed in some ways. And that's how

1:05:07

you make sure there's enough diversity

1:05:09

of pictures that you have.

1:05:10

So, once you do that,

1:05:12

you can actually take a picture and see

1:05:13

what it does.

1:05:15

I just randomly grab a picture, so it

1:05:17

keeps changing every time.

1:05:21

Yeah, look at this handbag.

1:05:22

Handbag slightly rotated this way,

1:05:24

rotated that way.

1:05:26

Some more. Maybe a little bit of zooming

1:05:28

going on, and so on. You get the idea,

1:05:30

right? And there's a whole list of these

1:05:31

things you can do. But when you do those

1:05:33

things, make sure

1:05:35

that what you're doing doesn't actually

1:05:37

change the underlying meaning of the

1:05:38

picture.

1:05:39

It's really important.

1:05:41

Okay? So, for example, if you're working

1:05:43

with satellite data,

1:05:45

yes, be very careful not to do flips of

1:05:47

crazy flips.

1:05:49

Right? Or even if you're working with

1:05:50

everyday images, horizontal flips are

1:05:51

okay. Don't do vertical flips.

1:05:54

Right? How many times will you have an

1:05:55

upside-down dog picture that you need to

1:05:57

classify?

1:05:59

Make sure your augmentation doesn't go

1:06:00

nuts.

1:06:02

All right.

1:06:05

Once you do that, you can actually just

1:06:07

insert the data augmentation layers in

1:06:09

your model right there, right after the

1:06:11

input. The rest of it can stay

1:06:12

unchanged.

1:06:14

So, this is a great way to increase the

1:06:15

size of your training data, and here is

1:06:17

a model, and then I invite you to

1:06:19

actually just play with it and uh, and

1:06:21

train it. We won't try In the interest

1:06:23

of time, we won't actually train this

1:06:23

model, but it's in the collab, you can

1:06:24

just try it. It also figures prominently

1:06:27

in homework one, by the way, data

1:06:28

augmentation. So, you'll get more

1:06:30

experience with this. Okay. So, uh, back

1:06:32

to the PPT.

1:06:34

So, this is what we have. Um, and so,

1:06:37

any network that has been trained on

1:06:38

this ImageNet thing, uh, turns out

1:06:41

learns all kinds of interesting features

1:06:42

in every one of its layers. So, here

1:06:44

this is the first layer, and you can see

1:06:46

it's picking up sort of gradations of

1:06:48

color, sort of line-ish kind of

1:06:49

behavior. Layer two, um, it's actually

1:06:52

picking up Hey, look, it's picking up an

1:06:54

edge. Can you see that edge?

1:06:56

Right? Like like that.

1:06:59

And then layer three is picking up these

1:07:01

interesting honeycomb shapes, uh, and so

1:07:04

on. Oh, it's actually this thing is

1:07:05

already already picking up like the

1:07:07

shape of a human torso.

1:07:12

Yeah, this layer is actually picking up

1:07:13

what looks like a Labrador retriever.

1:07:16

Okay.

1:07:17

Isn't that cute?

1:07:19

Come on, even if you're not a dog

1:07:20

person.

1:07:22

All right. So, the the this this is the

1:07:24

visualization I was referring to

1:07:25

earlier,

1:07:26

um, to figure out what are these

1:07:28

networks actually learning.

1:07:30

This paper was one of the first ones to

1:07:31

actually visualize what's going on

1:07:32

inside. So, if you folks are curious how

1:07:34

these pictures are actually produced, I

1:07:36

would encourage you to check this out.

1:07:38

Okay, yep.

1:07:40

So, we spoke about images and you

1:07:42

referred to classes, but sorry, we spoke

1:07:44

about images and you referred to classes

1:07:46

and

1:07:47

text next week on transformers, but

1:07:49

what about say an email which has both

1:07:52

text and image, and that may be white

1:07:54

space depending on who has written it

1:07:56

out. Does that get put in as an input

1:07:58

for an image or

1:08:01

So, we'll revisit this great question a

1:08:03

bit later on in the course.

1:08:04

So, the answer is a bit complicated, so

1:08:06

I don't want to I want to do it justice,

1:08:07

so we'll come back to it.

1:08:09

All right, so

1:08:10

so it turns out this thing called ResNet

1:08:12

is a family of networks that are which

1:08:14

were trained on this ImageNet data set,

1:08:16

and they did really well in this

1:08:18

competition that's associated with the

1:08:19

ImageNet data set called ImageNet. And

1:08:21

so, this is an example of such a

1:08:22

network. So, you we would expect the the

1:08:24

weights and the parameters of ResNet,

1:08:27

given that it's been trained on

1:08:28

ImageNet, to sort of have some knowledge

1:08:30

about lines and shapes and curves and

1:08:32

things like that. So, maybe we can just

1:08:34

use that, right? So, so the idea is we

1:08:37

But the thing is we can't use ResNet as

1:08:39

is because remember, it was trained to

1:08:40

classify an incoming image into a

1:08:42

thousand possibilities.

1:08:44

Here we only have two possibilities,

1:08:45

handbags and shoes. So, what we do is

1:08:47

very simple and elegant. We do just a

1:08:50

little bit of surgery.

1:08:51

We take ResNet and stop just before the

1:08:54

final layer. So, take my word for it,

1:08:57

this thing here, what it says is fully

1:08:59

connected thousand.

1:09:01

Because it's got thousand way, right?

1:09:02

Thousand objects. So, what we do is we

1:09:04

just take everything except and we stop

1:09:06

just before that last layer.

1:09:08

And then what comes out of that layer,

1:09:10

hopefully, will be like a very smart

1:09:11

representation of the images that it has

1:09:13

been trained on.

1:09:14

And so, what we do is we can think of

1:09:16

sort of headless ResNet

1:09:19

as our model.

1:09:21

And we can take that we can take all our

1:09:23

data and run it through ResNet up to but

1:09:26

not including the last layer.

1:09:28

Okay, you get some tensor and that

1:09:30

tensor is probably like a very has a

1:09:31

very rich understanding of what's going

1:09:33

on in that image, all the objects and

1:09:35

features and things like that. And then

1:09:36

we can just simply connect that we can

1:09:39

think of it as like a smart

1:09:40

representation of an input. We can

1:09:42

connect it to just a little hidden layer

1:09:44

and then we have a little sigmoid which

1:09:46

then tells you handbag or shoe. We can

1:09:47

just run this network.

1:09:50

Okay? Um and so since the outputs to the

1:09:53

hidden layer now are not raw images

1:09:54

anymore, but this much higher level of

1:09:57

abstraction that ResNet has learned,

1:09:59

hopefully it can get the job done with

1:10:00

hardly any examples.

1:10:02

Okay? And now you can get fancier.

1:10:04

That's the basic idea, but you can get

1:10:05

much fancier. You can connect up

1:10:07

headless ResNet directly with our little

1:10:09

network with a hidden layer and the

1:10:10

final thing and the whole thing can be

1:10:12

trained.

1:10:14

End to end. Uh but when you do that you

1:10:16

must start the training with the weights

1:10:18

that you downloaded with ResNet because

1:10:20

that is the crown jewel that's been

1:10:21

learned so you want to start from there.

1:10:23

Uh and you will do this in homework one.

1:10:26

Okay? All right. Uh by the way, these

1:10:28

pre-trained models are available all

1:10:29

over the internet. There is the

1:10:30

TensorFlow hub, the PyTorch hub and then

1:10:32

there's the Hugging Face hub. When I

1:10:34

checked it on the 13th yesterday, it had

1:10:36

over half a million models available

1:10:39

for download. Half a million.

1:10:41

I think last year it was like 50,000

1:10:42

when I taught the course. Uh so yes.

1:10:46

I was just wondering, doesn't this make

1:10:49

your neural network susceptible to

1:10:50

adversarial attacks because the weights

1:10:52

have been

1:10:53

pre-trained on a Yes. Uh it there is

1:10:55

some adversarial risk. I'm happy to talk

1:10:57

about it offline.

1:10:59

All right. So that's what we have. So

1:11:01

back to Colab. Okay. So that's what we

1:11:03

have. This is ResNet. So what we do is

1:11:06

and ResNet is all packaged up. It's

1:11:07

available for download. So we download

1:11:09

it here.

1:11:13

And you see here that I'm saying use

1:11:16

include top equals false.

1:11:19

So basically you are telling Keras

1:11:21

uh the top the very final layer of the

1:11:23

thing, don't give it to me. Just give me

1:11:25

everything up to but not including that.

1:11:27

And of course I think of it as left to

1:11:28

right. People think of it as bottom to

1:11:30

top. So they could the very very top

1:11:32

layer, don't give it to me. You're

1:11:34

telling it so that you don't have to

1:11:35

manually go and remove it.

1:11:37

Okay? And then I'm not going to

1:11:39

summarize uh well, I'll just summarize

1:11:40

some of it. Just show you how big it is.

1:11:44

Okay?

1:11:45

23 million parameters.

1:11:48

ResNet. Okay? And I won't plot it

1:11:50

because then I'll be scrolling for 5

1:11:52

minutes. Uh

1:11:53

so let's just do this now. So what we're

1:11:55

now going to do is we're going to run

1:11:56

all the data through this thing and

1:11:58

whatever comes out in that penultimate

1:11:59

thing, I'm going to just grab it and

1:12:00

store it. So that's what this thing

1:12:02

does.

1:12:04

All right. And now we create this a

1:12:07

little handy function to do all these

1:12:08

things.

1:12:09

And once I do that,

1:12:11

uh every image has been sent through

1:12:12

ResNet up to but not the final layer and

1:12:15

then whatever comes into the final

1:12:16

layer, we're storing it. And then we're

1:12:18

going to create a network where we'll

1:12:19

only feed that layer that information to

1:12:21

a simple network.

1:12:23

Okay?

1:12:24

So what is coming out of ResNet, you can

1:12:26

see here 98 examples in the training

1:12:28

data and each example is now a 7 by 7 by

1:12:31

2048 tensor.

1:12:33

That's what came out of ResNet and you

1:12:35

saw that's what I did there.

1:12:37

Okay?

1:12:37

All right. So that's what it looks like.

1:12:39

Now let's just create our actual model

1:12:41

now. Right? We have our input which is

1:12:43

just a 7 by 7 by 2048.

1:12:46

We flatten it immediately.

1:12:48

Then we run it through a dense layer

1:12:50

with 256 ReLU neurons and then we use

1:12:52

dropout which I haven't talked about yet

1:12:54

which I will talk about early next week.

1:12:56

Uh but I will come back to it. Don't

1:12:58

worry about this detail for the moment.

1:13:00

Uh and then we just run through a

1:13:01

sigmoid.

1:13:03

Okay? And that that's our model.

1:13:05

Finished. Plot the model. This is what

1:13:08

we have. Okay? Model summary.

1:13:13

It's one so far. All right, good. Now

1:13:15

let's actually train this thing.

1:13:18

I'm just going to run it for 10 epochs

1:13:20

because I tried running it uh previously

1:13:22

and it seems to do a fine job in just an

1:13:24

epoch. Okay, it's already done. It's so

1:13:26

fast because we ran everything through

1:13:28

this monster ResNet thing and basically

1:13:31

took all the output values and use them

1:13:33

as a starting point. Right? We don't

1:13:34

have to run it every single time. So you

1:13:36

can see here the accuracy is

1:13:40

quite high.

1:13:44

Wow, interesting. So the 10th epoch

1:13:45

something bad happened.

1:13:48

So maybe I should have stopped at the

1:13:49

ninth epoch. I didn't see this yesterday

1:13:51

when I was running. So much for random

1:13:53

reproducibility. Uh

1:13:55

So let's just run this. Oh wow, look. On

1:13:57

the test set it's achieving 100%

1:13:58

accuracy.

1:14:02

It's unbelievable. Okay folks, now for

1:14:04

the moment of truth. Um all right, I

1:14:06

have a little code snippet here to

1:14:08

capture stuff from the webcam.

1:14:10

Because that last epoch it went down,

1:14:12

I'm a little worried that the demo is

1:14:13

going to flunk.

1:14:14

But you know what? We all have to live

1:14:16

dangerously. So

1:14:18

So here's a little function to predict

1:14:20

what's going to happen.

1:14:21

Okay. Now I tried it at home yesterday

1:14:23

by the way.

1:14:24

I act and it's like, "Yay, it's a

1:14:26

handbag."

1:14:27

So okay. Now let's just do something

1:14:29

else.

1:14:30

Okay. Any volunteers?

1:14:32

I want a a piece of footwear

1:14:34

or a handbag.

1:14:37

It's like a backpack, right?

1:14:39

I don't know. It feels like an

1:14:40

adversarial example, but yeah, let's

1:14:42

just try it.

1:14:43

Okay.

1:14:45

No disrespect. I'll let me let me go

1:14:47

with the shoe first. I have a better

1:14:48

chance of it working.

1:14:50

So

1:14:51

it's a pretty big shoe. If it can't get

1:14:53

this shoe, I'm worried about this model.

1:14:55

All right. So

1:15:05

Okay. Hold on. Hold on. Hold on.

1:15:07

All right.

1:15:10

Please don't get distracted by my hand.

1:15:14

Capture.

1:15:16

It's a shoe! LOOK AT THAT.

1:15:21

PHEW. ALL RIGHT. THANKS.

1:15:25

OKAY. Now let's try that. I'm feeling

1:15:26

kind of brave now.

1:15:28

Thank you. All right. Let's do this.

1:15:32

All right.

1:15:34

Camera capture.

1:15:40

Okay.

1:15:44

Put its better side.

1:15:54

It's a handbag! Look at that.

1:15:59

I swear every time I do the demo I age a

1:16:01

few years. So

1:16:03

All right folks, I'm done. Thank you.

4: Deep Learning for Computer Vision – Transfer Learning and Fine-Tuning; Intro to HuggingFace

More from MIT OpenCourseWare

Trending Transcripts