Advertisement
Ad slot
4: Deep Learning for Computer Vision – Transfer Learning and Fine-Tuning; Intro to HuggingFace 1:16:22

4: Deep Learning for Computer Vision – Transfer Learning and Fine-Tuning; Intro to HuggingFace

MIT OpenCourseWare · May 11, 2026
Open on YouTube
Transcript ~13861 words · 1:16:22
0:16
Right folks, good morning.
0:19
Welcome back. I hope you all had a nice
0:21
weekend.
0:22
Uh, and I hope you had a chance to watch
0:24
the the video walk-through I posted
0:26
yesterday. Um, it's going to save us
0:28
some time today. So, let's get right in.
0:31
Today is going to be super packed. Um,
Advertisement
Ad slot
0:33
you're going to go from not knowing
0:35
anything about convolutions perhaps for
0:36
some of you to actually knowing how
0:38
convolution networks work and actually
0:39
to build one and demo it in class, okay?
0:42
And uh, this demo has actually worked
0:44
pretty well for the last few years that
0:45
I've taught the class, but you never
0:47
know because it's a live demo, it may
0:48
not work. We'll see.
Advertisement
Ad slot
0:50
Um,
0:51
Valentine's Day gods, maybe they maybe
0:53
be with us.
0:54
Okay, so let's get going. So, Fashion
0:56
MNIST we saw previously, um, i.e. as in,
1:00
you know, in the in the walk-through,
1:01
the video walk-through, that a neural
1:03
network with a single single hidden
1:05
layer can get us to some an accuracy in
1:08
the the high 80s, okay? Uh, and that
1:11
thing that network actually didn't know
1:14
what was coming in was an image, right?
1:16
It literally took this table of numbers
1:18
and just took each row and then
1:19
concatenated all the rows into one giant
1:21
long vector and then sent it in. So, the
1:23
neural network did exploit the fact that
1:25
the input data was sort of known to be
1:27
of a certain type, okay? Which is the
1:30
clue for how can we do better?
1:32
Right? So, let's just spend a few
1:35
minutes on why what is it about images
1:38
that we have to really pay attention to,
1:40
okay? As opposed to any arbitrary vector
1:42
of numbers that's coming in.
1:44
Okay? So, when we flatten the image into
1:47
a long vector and feed it into a dense
1:49
layer,
1:50
several undesirable things can actually
1:52
happen.
1:55
What are some of them? Any any guesses?
2:00
Uh, yeah.
2:02
I think you lose the proximity of one
2:04
pixel to other ones that would be around
2:06
it.
2:07
Right. So, if you take a particular
2:08
pixel, then let's say that the picture
2:11
shows a t-shirt, um, if there's a little
2:13
pixel at in the center of the t-shirt,
2:15
knowing that the surrounding pixels are
2:17
related to the pixel in a way because
2:19
they are all part of this concept called
2:21
a t-shirt, would certainly be helpful,
2:23
right? So, so to put it more
2:25
technically, spatial adjacency
2:28
information is very important. And we
2:30
need to somehow take that into account.
2:32
Okay? Um, all right. What else? What
2:34
else might be going on here?
2:38
Uh,
2:40
Yeah,
2:41
you have some metadata about it like the
2:43
relative match into the resolution
2:46
Oh, I see. So, if you actually had
2:47
structured data about the image such as,
2:50
you know, various characters about that
2:51
might be helpful. True. Now, but let's
2:54
just focus on the case where you only
2:55
have the raw image and nothing else.
2:57
And under that constraint, what else
3:00
might go wrong?
3:02
Or what else might be suboptimal?
3:08
Okay. Well, the first thing that might
3:10
happen is that
3:12
we have we may have too many parameters.
3:15
So, let's take So, this is, you know,
3:17
this these numbers are from my, you
3:18
know, older iPhone. Uh, I noticed that
3:21
when I take a color picture with my
3:22
phone, it's a 3,000 * 3,000 roughly uh,
3:27
grid, right? So, the picture is actually
3:30
3,024 pixels on this axis, 3,024 on that
3:34
axis, okay? So, that gets us to roughly
3:37
9 million pixels, but remember there's a
3:40
color picture, which means there are
3:41
three channels,
3:43
which means there are 27 million
3:45
numbers,
3:46
each of which is between 0 and 255 from
3:49
that little picture, okay? And now let's
3:51
say we connect it to a single
3:54
100 neuron dense layer.
3:57
A single 100 neuron dense layer. How
3:59
many parameters are we going to have?
4:00
Just in that one little part of the
4:01
network.
4:07
Could the mumbling be louder?
4:10
Yes, roughly 2.7 billion because 27
4:13
million parameters times 100,
4:15
right? Roughly, of course. Forget about
4:17
the biases for a moment, right? It's 2.7
4:19
billion.
4:21
2.7 billion parameters,
4:23
right? Do you think we can actually get
4:25
2.7 billion images to train any of these
4:27
things?
4:29
So, then you're going to overfit.
4:32
Right? Too many parameters. We have to
4:33
do We have to be smarter about this.
4:35
It's not going to work.
4:36
Right? That's the first problem.
4:39
The So, this clearly is computationally
4:41
demanding, very data hungry, and
4:43
increase the risk of overfitting.
4:45
Okay?
4:46
Next,
4:49
we lose spatial adjacency.
4:51
Right? We literally are ignoring what's
4:52
nearby.
4:55
So, that's a huge huge factor. There's a
4:57
third factor,
4:58
right? That we have to worry about,
5:01
which is that
5:02
let's say that, you know, the picture
5:04
has a vertical line
5:06
on the on the top left side and it has
5:08
some other vertical line on the bottom
5:09
right side.
5:12
What this sort of dumb approach is going
5:14
to do
5:15
is going to it's going to learn to
5:16
detect that vertical line on the top
5:18
left and it's going to independent of
5:20
that, it's going to learn to detect the
5:21
vertical line on the bottom right.
5:24
Okay? Which doesn't make any sense. What
5:26
do you A vertical line is a vertical
5:27
line. So, you want to be able to detect
5:29
it wherever it happens.
5:31
Detect once, reuse everywhere.
5:33
That's what you need to do.
5:35
So, this, by the way, is called
5:36
translation invariance.
5:38
Translation is math speak for move stuff
5:40
around.
5:41
Right? You take a line and it moves
5:42
around,
5:43
it doesn't matter, it's still a line.
5:45
Let's Let's Let's figure it out.
5:47
So, these are the the three things we
5:48
need to worry about. So, we want to
5:50
learn once and use all over the place.
5:53
We want to take spatial adjacency into
5:55
account, number two. And number three,
5:56
let's just find a way to make sure that
5:58
we don't have billions of parameters for
5:59
simple toy problems.
6:02
Any questions?
6:05
Yep.
6:07
Um, is this a problem
6:09
just because we are compressing the
6:11
image or would it have happened anyway?
6:14
It would have happened So, the question
6:15
was is it a problem because we are
6:16
compressing the image uh, or would it
6:18
would it have happened anyway? The
6:19
answer is it would have happened anyway.
6:20
You can take any picture, this is going
6:22
to happen, right? Because I'm not making
6:24
any assumptions about how the image is
6:26
coming in to me,
6:27
whether it's compressed or not and so on
6:28
and so forth.
6:31
Okay. All right.
6:33
So, convolutional layers
6:36
were developed to precisely address
6:38
these shortcomings and they're amazing
6:40
solution, as you will see. Very elegant.
6:45
All right.
6:45
So, the next, I don't know, half an hour
6:49
is going to be me defining a whole bunch
6:51
of stuff
6:52
before we actually get to the fun
6:53
collabs and so on and so forth.
6:55
Um, so just to put in perspective, I I
6:57
have a PowerPoint,
6:59
two collabs,
7:01
and an Excel spreadsheet, and maybe even
7:03
a notability file to cover today.
7:06
Okay? So, but hang on for the next 30
7:08
minutes because it's going to be a
7:09
little concept heavy
7:10
before we get to the fun stuff. So, stop
7:12
me, ask me questions because we do have
7:14
time.
7:15
All right. A convolutional layer is made
7:17
up of something called a convolutional
7:18
filter.
7:20
Okay? That's the atomic building block.
7:22
A convolutional filter is a nothing but
7:24
a small matrix of numbers like this.
7:28
It's just a small square matrix of
7:29
numbers. That's a convolutional filter,
7:31
okay? Now,
7:33
a layer is just composed of one or more
7:35
of these filters.
7:38
All right?
7:39
Filters and layers.
7:41
Now,
7:42
the thing about the convolutional filter
7:44
that makes it really magical
7:46
is that if you choose the numbers in a
7:48
filter carefully
7:50
and then you apply the filter to an
7:52
image, and I'll get to what I mean by
7:53
applying the filter,
7:56
if you choose the numbers carefully and
7:57
you apply to that image,
7:59
this little humble thing has the ability
8:02
to detect features in your image.
8:04
It can detect lines, curves, gradations
8:07
in color, circles, things like that,
8:09
okay? It's pretty cool.
8:11
And so,
8:12
I'm going to claim and I'm going to
8:14
prove shortly that this little humble
8:15
filter with the ones and zeros, it can
8:17
detect horizontal lines in any picture
8:19
you give it.
8:21
Okay?
8:22
This thing here is going to has the
8:23
ability to detect vertical lines.
8:27
All right? So, I will demonstrate how
8:28
this thing actually detects all these
8:30
things and then we will ask the big
8:33
question that's probably in your minds
8:34
already, where are we going to get these
8:35
numbers from?
8:37
That all sounds great, Rama. Where are
8:39
we going to get the numbers from? Okay?
8:41
And we have a beautiful answer to that
8:42
question.
8:43
All right. So, let's go. Um, now I'm
8:46
going to first explain to you what I
8:47
mean by applying a filter to an image
8:50
and then I'm going to give you examples
8:52
of how the filter works for detecting
8:54
vertical and horizontal lines. So, all
8:56
right. So, let's say that this is the
8:58
image we have.
9:00
Okay? Again, an image. Assume it's a
9:02
grayscale image. So, you just have a
9:04
bunch of numbers between 0 and 255,
9:06
okay? So, that's that This is the image
9:07
we have. It's a little tiny image.
9:09
And this is the filter that's been
9:10
magically given to us by somebody.
9:13
And what we are trying to do now is to
9:14
apply it, okay? So, what we do is that
9:17
we literally take this filter,
9:19
the little one, and then we superimpose
9:22
it on the top left part of the image.
9:24
So, you have the image here, you take
9:26
this little filter, and then you move it
9:28
to the top left so that they are sort of
9:30
right on top of each other.
9:32
Okay?
9:33
Once you have it right on top of each
9:34
other,
9:35
you have these matching numbers. You
9:37
have three numbers in the image, there
9:39
are three numbers in the filter, and
9:41
they're all matching each other right on
9:42
top of each other, right? So, you have
9:44
nine pairs of numbers.
9:46
And then what we do, once we overlay it,
9:48
we literally just multiply all the
9:50
matching numbers and add them up.
9:53
Okay? You just multiply all the numbers
9:55
and match them up, and you can confirm
9:57
later on that you know the the
9:58
arithmetic I'm doing is actually
9:59
accurate. Okay?
10:01
And once you do that you'll go get some
10:03
number.
10:04
Right?
10:05
Um
10:06
once you get that number
10:09
what we do is we go to our good old
10:11
friend the relu
10:12
and then we just run it through a relu.
10:15
Now, in this case all that effort comes
10:16
to nothing because it's zero. It's okay.
10:19
Okay? So, zero and this number becomes
10:22
the top left cell of your output.
10:26
So, this is called the convolution
10:28
operation.
10:29
Okay?
10:30
And we won't get into why it's called
10:31
that and so on and so forth. There's a
10:32
long and rich and storied history of
10:34
these things.
10:35
But this is the convolution operation.
10:38
And once we do that you sort of can now
10:40
predict what's going to happen, right?
10:42
We take the same exact operation and we
10:44
just move it to the right.
10:46
We move this little 3 by 3 thing to the
10:48
right and repeat the exact same process.
10:51
Matching numbers
10:53
uh to you know multiply all of the all
10:54
the matching numbers together, add them
10:55
up, run them through a relu.
10:58
Okay?
10:59
And then boom, you get the you get the
11:01
second number here.
11:03
And you keep doing that till you reach
11:05
the very end. You fill up all these
11:07
numbers then when you then you come to
11:08
the top of the second row.
11:11
Okay?
11:12
And you keep on doing that till you
11:14
reach the very bottom.
11:16
So, this is what I mean when I say apply
11:18
a filter to an image.
11:21
Okay?
11:22
Any questions?
11:25
Okay.
11:29
Microphone, please.
11:31
Microphone.
11:35
What happens when
11:36
the heart of the
11:38
and you stop
11:39
the remaining
11:42
but the filter doesn't perfectly match
11:44
Yeah, so you start from the left and
11:46
then you keep on going. At some point
11:47
the right edge of the filter is going to
11:49
match the right edge of the image and
11:51
then you stop.
11:52
Yeah. Now, there are some nuances here.
11:55
So, for example, you can actually pad
11:58
the whole image
11:59
on its borders so that you can actually
12:01
go outside the image and it'll still
12:03
work.
12:04
Okay? Number one. Number two, nuance.
12:08
Instead of just moving one step to the
12:10
right every time you finish, you can
12:11
move two steps to the right.
12:13
Right? And that's something called a
12:15
stride. Okay? So, there are a bunch of
12:17
pesky details here. But I'm just
12:20
ignoring them because this basic default
12:22
approach works well amazingly well
12:24
almost all the time.
12:27
Okay? All right. So, that's that's
12:29
that's the mechanics of how this
12:31
operation works. Um all right. Now, I'm
12:33
going to switch to a spreadsheet which
12:35
shows this really beautifully
12:37
courtesy of the fast.ai people.
12:41
All right. So, what I'm going to do here
12:43
because the big spreadsheet I'll upload
12:44
the spreadsheet after class so you can
12:45
see it. So, all I have done here, rather
12:48
all they have done here
12:50
thanks to them, is that they have
12:51
essentially created a table of numbers
12:53
in Excel as you can tell.
12:55
And they have just put some numbers.
12:57
Most of the numbers are zero. But these
12:59
some of these numbers are all more than
13:01
zero. They're like 0.8, 0.9 and so on.
13:03
Basically, all they have done is instead
13:04
of working with numbers between zero and
13:06
255, they're just dividing all the
13:08
numbers by 255 so you get fractions and
13:10
they just put the fractions in the
13:11
table. Okay? And then then they have
13:13
used Excel's very cool conditional
13:15
formatting
13:16
to essentially mark in red all the
13:19
values that are high. Right? If the
13:21
number is closer to one, the more
13:23
reddish it gets.
13:24
Okay? And when you do that the three
13:26
obviously pops out.
13:28
So, there is a three in the image. Yes?
13:31
Okay, good. So, now
13:33
what we're going to do is we're going to
13:35
move to our little filter here.
13:37
You can see the filter.
13:39
Right? And I'm claiming this detects
13:41
horizontal lines. And so and this table
13:44
here
13:47
Sorry.
13:51
This table here is the result of
13:53
applying that filter to the three.
13:56
Okay? And you can see here I'm looking
13:58
at the top left cell here.
14:01
Um
14:03
This is
14:03
Look at this top left cell. The formula
14:05
is nothing more than
14:07
you know, multiply all those things and
14:08
add them up. And then once you add it
14:10
up, run it through a max of zero comma
14:12
that which is just the relu.
14:15
Okay? Basic arithmetic.
14:18
So, we do that.
14:19
And this is the output and the output is
14:21
also conditionally formatted to show you
14:24
where things are lighting up.
14:26
And you can see only the horizontal
14:30
lines of the three are lighting up.
14:34
Everyone see that?
14:35
Right?
14:36
So, you So, now you you understand the
14:38
filter in fact is living up to the claim
14:41
I made for it.
14:42
Right? Similarly,
14:44
if you look at what's going on here,
14:46
this is a vertical filter, the same
14:47
thing, you apply it, only the vertical
14:50
line is lighting up.
14:53
Right? Now, what you can do is
14:56
uh I would encourage you to do this, you
14:57
know, um after class, is you can look at
15:00
all these numbers here, for example, and
15:02
then ask yourself, "Okay, why is that
15:04
lighting up?"
15:06
Right? And you will discover that what's
15:08
actually going on is that it's looking
15:11
for edges.
15:12
It's looking for you know, s- you're
15:14
looking for rows in the table where
15:16
there is some nonzero thing in the first
15:18
row and zeros in the second row.
15:21
And by choosing the numbers carefully,
15:23
you multiply the ones with positive
15:25
numbers and you multiply the zeros with
15:27
zeros and then you'll come up with a
15:29
positive number and thereby you detect
15:31
an edge.
15:32
Right? So, what I would encourage you to
15:34
do is use the this Excel thing here.
15:39
All right. So, here is here is a cell we
15:41
have. So, let's uh trace its
15:48
coincidence.
15:49
Okay.
15:51
So, you can see here
15:53
these numbers
15:56
Right? Th- This is what it's processing.
15:59
Right? That is this grid is being
16:00
processed to come up with that big
16:01
number. And you can see here in this
16:04
grid these are all these numbers are
16:06
here and then these numbers are a lot
16:08
lower than these these numbers because
16:11
there is an edge.
16:13
Right? The numbers are a lot lower.
16:14
That's why you can see the horizontal
16:16
part of the three.
16:17
And so, what this filter is doing, it's
16:19
basically saying, "Well, the stuff
16:22
the row that I'm catching here has the
16:24
ones, the middle has zeros, the rest are
16:26
all minus ones."
16:27
Right? So, the small values are going to
16:29
get very small.
16:31
The big values are going to get very big
16:33
and the overall thing is going to be
16:34
emphasized.
16:35
So, that's the basic idea of edge
16:37
detection.
16:38
Spend some time with it with the Excel
16:39
and it'll you'll become clear to you
16:41
what I'm talking about here.
16:43
All right, cool. So, that's that.
16:46
All right. Uh by the way, I also have uh
16:48
th- there is a little very cool site
16:49
here
16:50
in which you can actually go in and
16:52
punch in your own numbers and see what
16:53
it detects.
16:55
Right? Lot of edges and curves and this
16:56
and that. It's very cool. So, I
16:58
encourage you to try it out.
17:00
So, the key thing here I want to say is
17:06
by choosing the numbers in a filter
17:08
carefully and applying this operation
17:10
different different features can be
17:12
detected. All right.
17:13
Now,
17:14
I mentioned earlier that a convolution
17:16
layer is composed of one or more of
17:18
these filters. So, one or more of these
17:20
filters. And so, you can think of each
17:23
filter as a sort of a specialist for a
17:25
particular feature.
17:27
Okay? So, it's a specialist. Maybe it it
17:30
specializes in detecting vertical lines,
17:32
horizontal lines, you know, uh
17:34
semicircles, quarter circles, you don't
17:35
know. Right? You can imagine either them
17:38
as being specialists.
17:39
And given that modern images could be
17:42
very complicated, they may have lots of
17:43
interesting features going on, you
17:45
probably want to have lots of these
17:46
filters.
17:48
Okay? But the key the key is that you
17:52
don't have to decide up front, "Hey, you
17:54
filter, you better specialize in
17:56
detecting vertical lines and you on the
17:57
other hand do not stay in your lane. Do
18:00
vertical lines." Right? You're not going
18:01
to do that.
18:02
You will let the system figure out what
18:04
it wants to figure out.
18:06
Okay? So, there is no human bottleneck
18:08
in doing this.
18:10
And I mentioned this because there used
18:11
to be a human bottleneck, you know,
18:13
before deep learning happened.
18:15
And so,
18:17
Now, let's just um make sure we
18:19
understand the mechanics of what happens
18:20
when you have two of these filters, not
18:22
one. So, this is the input image as
18:24
before. This is the filter we saw
18:26
earlier and this is another filter we
18:28
have.
18:29
The thing is we just run them in
18:30
parallel. We take each filter, do the
18:32
operation, come up with an output. Take
18:33
the other filter, do the operation, come
18:34
up with its output. And then when you do
18:36
that, the first one gives you that, the
18:38
second one gives you that. And this
18:40
output is a table of some it's it's a
18:42
it's a it's actually not a table. What
18:44
is it?
18:49
Louder, please.
18:51
It's a tensor. Thank you. It's a tensor.
18:54
And so, these two 5 by 5 matrices can be
18:56
represented as a tensor of what shape?
19:02
And there are two right answers.
19:04
5 by 5
19:06
into two, correct. So, it can you can
19:08
either think of it as 5 by 5 * 2 or 2 *
19:11
5 by 5. They're both fine.
19:14
Which one you go with is actually ends
19:15
up being a matter of convention.
19:18
Okay? So, now you begin to see why we
19:20
care about tensors.
19:22
Imagine if instead of having two
19:24
filters, we have 103 filters.
19:27
The resulting tensor is going to be 5 by
19:29
5 by 103.
19:33
Okay.
19:34
Good.
19:35
Um all right. Now,
19:37
let's now look at the slightly more
19:39
complex situation where you have not a
19:42
black and white image, a grayscale image
19:44
with just a little table, but an actual
19:46
color image.
19:48
Okay? So, So, we know how to apply a
19:51
filter to a 2D tensor like this and to
19:54
get that. But let's say we have
19:56
something like this where it has
19:58
three, right? It's got three channels,
20:00
red, blue, green, RGB. It's got three
20:02
tables of numbers.
20:03
So, this is a a tensor of shape 6 * 6 *
20:06
3, let's say, and you want to apply this
20:08
3 by 3 filter just like before to this
20:11
thing. You want to apply the convolution
20:12
operation. How's that going to work?
20:18
Do we just like apply this to each
20:21
We first apply it to the red, then we
20:23
apply it to the to the green, then we
20:25
apply to the blue. Should we do that?
20:30
Or is there a
20:31
a problem with that approach?
20:36
Yeah.
20:39
Could you use the microphone, please?
20:42
Uh the problem with the approach, I
20:43
think, would be the same as what you
20:45
said earlier, that it would learn the
20:47
lines probably the same each channel,
20:49
right?
20:50
Like the location of the lines are
20:51
probably the same each channel.
20:54
Yes, the location of the line is going
20:55
to be the same thing because that line,
20:57
if you will, is sort of the the
20:59
aggregation of information from the
21:00
three different channels. Right. But the
21:03
problem here
21:05
is sort of slightly different,
21:07
which is that
21:09
If you do them independently,
21:12
the network has not been informed that
21:15
these things are all part of the same
21:17
underlying concept.
21:19
As far as it's concerned, it's just like
21:21
three things. It's just going to process
21:22
them independently. So, we need to
21:23
somehow change the filter so that it
21:25
understands like what is at this pixel
21:27
location, the three numbers under it,
21:29
RGB, they're actually the same part of
21:31
the same thing, underlying thing.
21:35
So, what we do is actually very simple.
21:37
We just take this filter and make it 3D.
21:42
So, we take this filter, instead of
21:44
having just one of them, we just make it
21:45
a cube like that. Three times.
21:49
And once we do that, you can imagine
21:51
taking this thing here and essentially
21:53
doing that.
21:56
Okay. Now, instead of having, you know,
21:58
nine numbers in the image and nine
22:00
numbers in the filter,
22:01
you have 27 numbers in the image, 27
22:04
numbers in the filter.
22:05
But you still match them up, multiply
22:07
them, add them up, run them through a
22:09
ReLU.
22:14
By the way, I tried to get ChatGPT to
22:16
give me a picture like that.
22:19
It just completely bombed.
22:21
I like three, four, five different
22:22
variations. It just gave up. And then I
22:24
found this nice picture at in the
22:25
deeplearning.ai and I used it.
22:28
So, then if you put different numbers in
22:30
each of the layers, is that like color
22:32
processing? Like it could be doing a
22:33
different thing to green and blue. I'm
22:36
sorry, say that again. If you put
22:37
different numbers in each of the layers
22:39
of your knowledge, in each of the
22:42
different like depth dimensions of your
22:43
convolution filter, would that be like
22:45
color processing?
22:47
Uh yeah, you you will in
22:49
Yeah, you will put different numbers. In
22:50
fact, you you have 27 numbers now,
22:53
but we haven't gotten to the question of
22:54
where these numbers are coming from. So,
22:55
just hold the thought till we get there.
22:58
Okay. Um so, any questions on this?
23:02
Okay. You literally take the 2D thing
23:04
and make it 3D.
23:05
You basically give it depth and the
23:08
depth just matches the depth of the
23:10
input.
23:11
So, if the input is like, you know, 10
23:13
deep, your filter is going to get 10
23:15
deep.
23:18
Okay?
23:20
Yes.
23:22
Rather than
23:24
increasing the rank order of the tensor
23:26
by one, is there any instance where you
23:27
would create a subtraction layer where
23:29
you would run an operation across the
23:30
different layers to come up with a
23:33
intermediary layer that you would run a
23:35
lower rank tensor of a filter over?
23:38
Yeah, so there is a lot of stuff in the
23:40
research literature which tries to do
23:42
things like that. Uh I'm just describing
23:45
like the the the most basic approach to
23:48
doing this. And as it turns out, this
23:50
basic approach is actually extremely
23:51
powerful, right? And of course, uh
23:54
researchers try to, you know, go from
23:56
the 95th percent thing to the 95.1%.
23:59
So, they invent like all sorts of crazy
24:01
complicated stuff, which is all good for
24:02
us, humanity, but for practical use,
24:04
this is good enough.
24:08
How do you convert the 3 by 3 layer into
24:10
a single 4 by 4 layer? 4 by 4 is
24:12
understood, but what about the 3 layers?
24:14
How do they work?
24:15
Yeah. Um so, we are coming to that. I
24:17
think we have a slide here. Actually, we
24:19
don't. Never mind. We'll answer that. Um
24:20
so, so here you have one filter, right?
24:23
You have one 3 by 3 by 3 filter, which
24:26
plugs into this thing here, and then it
24:28
gives you the 4 by 4 at the end.
24:30
Right? So, for one filter, we know that
24:33
by doing this operation, we get
24:37
we get this 4 by 4.
24:38
Let's say that you have another filter,
24:40
which is also 3D.
24:41
You do that thing, you'll get another 4
24:43
by 4.
24:45
And if you have 10 filters, you'll get
24:46
10 of these 4 by 4s, which then gets
24:48
packaged up into a 4 by 4 by 10 tensor.
24:54
Remember, whether it's 2D, 3D, 10D,
24:57
what is coming out is always 2D.
25:02
Because ultimately, when you apply all
25:03
this operation, at each position, you
25:05
just have one number.
25:06
And then ultimately, you just do all
25:07
those things, you just come up with a
25:08
table of numbers always. So, the what's
25:10
coming out is always a 2D number table
25:13
like that.
25:14
But when you have lots of filters, you
25:16
have lots of these 2D tables one after
25:18
the other, and there therefore, they get
25:20
packaged up into a tensor.
25:25
All right.
25:26
Um so,
25:28
textbook chapter 8.1 has a lot of detail
25:30
and intuition, which I think is really
25:32
good. So, please uh try it out. Okay.
25:35
And folks, by the way, this convolution
25:37
stuff, um it's sort of it grows in the
25:40
telling. So, I would encourage you to
25:41
revisit it, revisit it
25:43
a few times, and then it slowly becomes
25:45
part of your muscle memory.
25:48
Don't expect to just understand all the
25:49
nuances like one shot.
25:51
Do it a few times.
25:52
And it will become, you know, wired into
25:54
your into your head.
25:56
Okay. So, all right. The big question.
25:59
These seem excellent, but how are we
26:00
supposed to come up with these numbers?
26:02
Now, in fact, traditionally,
26:04
uh these filters actually used to be
26:05
designed by hand.
26:07
Uh computer vision researchers would
26:08
invest, you know, prodigious amounts of
26:10
time and effort and talent to figure
26:12
out, you know, the kind the right kinds
26:14
of filters to use for various specific
26:17
applications. So, if you wanted to build
26:19
an application which would look at, say,
26:20
MRI images and figure out, okay, what
26:22
kind of features should I extract from
26:24
this MRI thing to be able to say, you
26:27
know, predict the the evidence for a
26:28
stroke, they would actually, you know,
26:30
hand design the filter. They'd try lots
26:32
of different values and then come up
26:34
with, "Ah, I got the perfect filter for
26:35
this thing here." Right? So, that's the
26:37
way it used to be done.
26:39
Um and now,
26:41
I but as we figured out how to train
26:42
deep networks with lots of parameters,
26:45
right? We figured out things like ReLU
26:47
activation, stochastic gradient descent,
26:49
GPUs, backprop, things like that, you
26:51
know, uh this big idea emerged. Why
26:54
don't we think of the numbers in the
26:55
filter as just weights?
26:57
And why don't we just simply learn them
26:59
from the data using backprop?
27:01
Right? Just like we learn all the other
27:03
weights. What's the big deal?
27:06
And this simple idea,
27:08
and it feels a bit, I don't know,
27:09
blindingly obvious in hindsight.
27:12
I'm sure it was not obvious in
27:13
foresight.
27:14
Um right? This was the breakthrough.
27:16
This was the key breakthrough. And now,
27:18
it's actually possible to do this
27:20
because a convolutional filter that we
27:22
have seen is actually just a neuron.
27:25
And the underlying arithmetic of it is
27:27
just a neuronal arithmetic. And so, it
27:31
just happens to be a slightly special
27:32
one. It's actually even simpler than a
27:34
regular neuron. And in the interest of
27:37
time, I have a one or two slides in the
27:39
appendix which tells you exactly why
27:40
it's a neuron. So, check it out. But
27:42
just take my word for it. It's just a
27:44
particular kind of neuron. And because
27:46
it's a particular kind of neuron, and we
27:48
know how to work with neurons,
27:50
right? You know how to work with
27:51
neurons, which means that our entire
27:53
machinery,
27:55
layers, loss functions, gradient
27:57
descent, SGD, blah, blah, everything is
27:59
immediately applicable.
28:01
We don't have to invent any new stuff to
28:03
make it work.
28:06
Okay?
28:08
All right.
28:09
Do you initialize the layers differently
28:12
in applications or just because the
28:14
network has different sizes? Like
28:16
computer vision versus uh medical
28:18
imaging. Is it just because the network
28:20
has different numbers in them?
28:23
Yeah, so the initialization
28:25
So, let's It's a good question. Let's
28:27
come back to it when we get to something
28:29
called transfer learning, which I'm
28:30
going to get to by about 9:30.
28:34
All right. So,
28:36
that's it. All right. So, this turned
28:37
out to be a huge turning point in the
28:38
computer vision field, and this was the
28:40
massive unlock in the year 2012. This
28:43
computer vision system that used this
28:44
technology called AlexNet burst out onto
28:47
the world stage because it crushed the
28:49
competition in a, you know, in in a
28:51
competition called ImageNet, and uh the
28:53
previous best score was 26% error rate,
28:56
and this thing came in and had 16% error
28:59
rate. Right? It's the kind of thing
29:01
where if you see it, you'll be like,
29:01
"Oh, that must be a typo."
29:04
Right? Because every year, the
29:05
improvements in error rate were like
29:06
very little, half a percent, 1%, and
29:07
then this year was 10%, and that that
29:09
was because of this approach.
29:12
And so, all right. Now, one other thing
29:14
I want to cover talk about is that with
29:16
every succeeding convolutional layer,
29:19
uh this particular convolution any
29:21
particular convolutional filter, it's
29:23
basically implicitly seeing much more of
29:25
the input image as we go along.
29:28
Right? Which means that if in the very
29:29
beginning, if this is the input, right?
29:31
This little convolutional filter this
29:33
number here
29:34
in the first layer, let's say, only sees
29:37
like the top of the chimney or whatever
29:38
of this house.
29:40
But then the next layer, remember, the
29:42
next layer is input is this particular
29:44
layer.
29:45
And so,
29:47
this particular little thing here is
29:49
getting information from this whole
29:50
square here.
29:52
And every one of the points in that
29:53
square is actually something big in the
29:55
original picture.
29:57
So, with every additional layer, you're
29:59
seeing more and more and more of the
30:00
image.
30:03
All right? And this is a key part of why
30:04
these things work because you're
30:06
essentially hierarchically building a
30:08
better and better understanding of the
30:09
image.
30:10
It is the hierarchical understanding,
30:12
the hierarchical learning, that's a very
30:14
key part of the unlock.
30:17
And so, if you look at networks and what
30:20
they're visualizing, this actually a you
30:21
know, a face detection deep network
30:23
visualizes of what it's learning, you'll
30:25
see that the first layer is just
30:26
learning lines and edges and so on,
30:28
lines.
30:29
And the second layer is actually
30:30
learning edges. Look at this thing,
30:32
right?
30:33
It's it's learning to put these lines
30:36
together
30:37
to get some sort of an edge here,
30:38
another edge here. This looks like three
30:40
three quarters of a of somebody's ears.
30:43
And then, these things are now being
30:45
assembled
30:46
to get whole faces out.
30:49
Can you imagine the researchers who did
30:50
this work? They built the network, it's
30:52
doing really well on detecting faces,
30:53
and they turn around, "Okay, let's see
30:54
what it's actually doing."
30:56
And then, this picture pops up.
30:58
I mean, goosebumps.
31:00
Okay, so pooling layers, the next one.
31:03
So,
31:04
so far you've talked about convolutional
31:05
layers, this is the second thing, second
31:07
building block, and then we'll again go
31:09
go to the collapse. So, pooling layers
31:11
are also called subsampling or
31:12
downsampling layers.
31:15
So, the idea is that every time a tensor
31:17
is coming out of these convolutional um
31:19
layers,
31:20
we try to make it slightly smaller
31:23
because the act of making it smaller
31:25
will force the network to try to
31:27
summarize and learn what's going on in
31:29
this complicated thing it's coming into
31:30
it, okay? So, I will describe the
31:32
mechanics first. Um
31:35
So, let's say that this is the output of
31:37
a convolutional layer.
31:39
Okay?
31:40
Is this four of them? A 4 by 4.
31:42
So, what we do is that there are two
31:45
kinds of pooling, max pooling and
31:47
average pooling. This is called max
31:48
pooling, and the idea is really simple.
31:51
In this max pooling layer, there are no
31:52
weights parameters to be learned. It's
31:53
just a simple arithmetic operation. We
31:56
basically take
31:57
we take this we basically superimpose a
32:00
2 by 2 empty grid
32:02
on the top left, and then we say, "Hey,
32:04
what's the biggest number on the among
32:06
these four numbers?" Well, the biggest
32:08
number is 43. Boom. Okay, I'm going to
32:09
stick a 43 here.
32:11
Then I move my 2 by 2 to the right
32:13
so that it overlaps with these numbers
32:15
in blue, and I say, "Hey, what's the
32:17
biggest number here?" Okay, that's 109.
32:19
And I move it down, what's the biggest
32:20
number here? 105. Stick it in here.
32:23
Biggest number here, 35, and I stick it
32:25
in there. That's it. This is max
32:26
pooling.
32:29
Similarly, there's this thing called
32:30
average pooling, but instead of taking
32:32
the maximum of these four numbers, we
32:33
just average the four numbers.
32:35
Okay, the average of these four things
32:36
in yellow,
32:38
am I done?
32:41
Average of these four numbers is 32.2.
32:43
The average of blue numbers is 25.5, you
32:45
get the idea.
32:46
That's it. Max pooling and average
32:48
pooling. Now,
32:50
as you can see, when you go when you
32:51
apply pooling, the number of entries
32:53
drops significantly.
32:55
Right? The number of entries drops
32:56
significantly.
32:58
And the output from this layer is just
32:59
fed to the next layer as usual.
33:02
Okay? There's nothing, you know, crazy
33:04
going on.
33:05
So, it's a way to shrink the output from
33:07
one convolutional layer before it passes
33:10
on to the next convolutional, you
33:11
interject with a pooling layer.
33:13
Now, I have actually a
33:15
even if I say so myself, a very nice
33:18
handwritten explanation of what pooling
33:20
does, the the effect of pooling.
33:23
And unfortunately, I can't get my iPad
33:25
to actually show up on my laptop.
33:27
So, I'm not going to be able to do it,
33:28
but I will record a walk-through.
33:31
Yeah, and I posted check it out, okay?
33:33
But the intuition that I tried to convey
33:35
with that thing is that oh, um Sorry,
33:38
I'll come back to this.
33:39
So, max pooling acts like an or
33:41
condition. It basically says, "I have
33:43
this big picture.
33:44
So, in the four things that I'm looking
33:46
at, if there's any number which is
33:48
really high,
33:50
that means that some feature is being
33:51
detected, right?
33:54
The number is really high coming out of
33:55
a convolutional layer, that means that
33:57
something somewhere fired up,
33:59
lit up.
34:00
And so, I'm just looking to see if
34:01
anything lit up in that part. If it did,
34:04
I'm going to say, "Yep, something lit
34:05
up."
34:06
If nothing lit up, then I'm going to
34:08
say, "Oh, nothing lit up."
34:09
So, in a in that sense, what it's it it
34:11
think you can imagine it's like acting
34:13
like an or condition.
34:15
Anything fired up? Anything fired up?
34:16
Anything fired up? Anything up? Yes,
34:17
okay. Otherwise, no.
34:19
And so,
34:22
sadly, I can't switch to Notability.
34:24
So, it acts like a feature detector. So,
34:27
if you have lots of things going on in a
34:28
particular picture, you want to be able
34:30
to summarize and aggregate all the
34:32
things that are going on so that you can
34:33
say you if you may have a big picture
34:35
with lots of things lighting up here and
34:36
there, but you want to step back and
34:38
say, "You know what? In this picture,
34:40
the top left, nothing lit up. The top
34:42
right, something lit up. Bottom left,
34:45
something lit up. And the bottom right,
34:46
nothing lit up."
34:48
So, you're operating at a higher level
34:49
of abstraction.
34:51
That's the effect of pooling.
34:55
But don't you lose spatial information?
34:59
Uh you don't because the
35:02
what you're actually saying is the top
35:04
left has this thing.
35:06
You already know it is in the top left.
35:08
And you already moved up to that level
35:10
of abstraction.
35:12
So, the fact for example, if if the top
35:13
left there is a human eye,
35:15
and there is a circle detector, it's
35:18
going to fire up and saying, "Hey, in
35:19
the top left there is an eye."
35:21
Yep, lit up. So, you're not looking at
35:23
the pixels anymore, you're already
35:24
operating at a higher level of
35:25
abstraction, and that's how we get
35:27
around it. But this proceeds slowly and
35:29
incrementally, which is why you have
35:31
these big networks.
35:34
All right.
35:35
So, now as we saw, some successive
35:38
convolution layers can see more and more
35:40
of the original image,
35:41
the max pooling layers that follow them
35:43
can detect if a feature exists in more
35:45
and more of the original input as well.
35:47
So, by the time you get to like the
35:48
seventh and eighth, ninth and layers and
35:50
so on, this thing is actually really
35:52
smart. It's operating at a very high
35:53
level of abstraction.
35:55
Right? It It is You can think of it It
35:56
is basically like tagged all the
35:58
features in that image at various
36:00
resolutions, and it can work with it.
36:04
Is there a trade-off between doing
36:06
pre-processing as opposed to adding
36:08
additional convolutional layers? I'm
36:11
thinking if you have a video turning
36:12
into a black and white static images in
36:15
a sequence as opposed to
36:17
shoving in a color video with a ton of
36:19
noise.
36:20
The greater the time expanse, is there a
36:22
trade-off element? There is a trade-off.
36:24
Um if your particular data set and input
36:27
has has some there is some very
36:29
important domain knowledge that you want
36:31
to encode
36:33
into the network so that the network
36:35
doesn't waste its capacity learning
36:37
things that you know have to be true,
36:39
then yeah, modify the input.
36:41
But if you're not sure,
36:43
right? Then you want to just let network
36:45
learn whatever it can as long as it's
36:47
focused on predicting accuracy as well
36:49
as possible, then just let it be.
36:55
Uh all right. So, that's the basic idea.
36:57
And I again, I'm sorry this is
36:59
Notability thing is is it's not working.
37:01
Uh but take a look to really understand
37:03
um how this max pooling thing business
37:05
works. Okay. Oh, uh I think I skipped
37:08
over this.
37:09
So, when you have something like this,
37:12
so this, let's say, is a tensor coming
37:13
out of some convolutional layer, and its
37:15
size is 224 by 224 by 64, then you apply
37:18
something like a pooling. The thing I
37:20
want to point out is that the pooling
37:22
will work with every slice of the
37:23
tensor.
37:25
Okay? So, if the tensor is 224 by 224 by
37:27
64, it has a depth of 64,
37:30
which is basically like saying it's got
37:31
64 tables of 224 by 224, and the pooling
37:35
will work on every one of those tables.
37:38
Which means that
37:40
the 64 will that you'll still have 64
37:42
things at the very end. It's just that
37:43
every one of the things of the 64, the
37:45
224 by 224, will shrink to 112 by 112.
37:49
So, each table shrinks due to pooling,
37:52
but the number of tables does not
37:53
change.
37:57
Okay. So,
37:59
uh by the way, this
38:01
link here
38:03
has a beautiful explanation of all these
38:05
things with a little bit more complexity
38:06
as well from a course taught at Stanford
38:08
in like 2018 or 2019 or something, I
38:10
forget. Uh so, just check it out if
38:12
you're curious about this stuff. It's
38:13
really good.
38:15
Okay. Um
38:18
All right. So, that brings us to the
38:19
architecture of a basic CNN.
38:21
Um and so, what we do is we have an
38:23
input.
38:25
Okay? We take that input, we run it
38:27
through a bunch of convolutional and
38:29
pooling layers. So, there's a
38:30
convolutional layer, and then we pool
38:33
it, which is why it has shrunk
38:35
in size,
38:37
and then it goes through another
38:38
convolutional layer, then we pool it,
38:40
which is shrunk again,
38:42
and then it keeps on doing it. So, we
38:44
have a series of these these called
38:45
these are called convolutional blocks.
38:47
So, a convolutional block is typically,
38:49
you know, one to two convolutional
38:50
layers followed by a pooling layer.
38:52
Okay.
38:54
So, you have a series of convolutional
38:55
blocks.
38:57
Okay? And the thing to notice is that
38:59
as you go further and further in the
39:01
network,
39:03
the blocks will actually get smaller and
39:05
smaller because of
39:07
uh max pooling, right? They'll get
39:09
smaller and smaller, but they'll get
39:10
longer they'll get deeper and deeper.
39:14
Okay.
39:14
And we have empirically figured out that
39:16
that actually that model of reducing the
39:18
size, the height and height and the
39:20
width, but then making it deeper, tends
39:22
to work really well in practice.
39:25
And so,
39:27
in fact, uh and I apologies to the live
39:29
stream that I can't use iPad, I'm going
39:31
to do it on the the board.
39:35
So, let's say that you have a picture
39:38
which is
39:39
coming in as 224
39:43
224
39:44
and then you have
39:46
say three of them
39:48
because it's a color picture, so you
39:49
have three of them.
39:52
Can you folks see this okay?
39:54
All right. So, right? Let's say this is
39:56
the input coming in. And ResNet, which
39:59
is a very famous network that we're
40:00
actually going to work with in a few
40:02
minutes,
40:03
then it actually gets done with all this
40:05
convolution pooling business.
40:07
The final tensor that it it has is
40:11
actually of shape
40:13
7 by 7.
40:16
But it is 2048 long.
40:22
Okay? So, it it has gone it has
40:24
processed something which is 224 224 * 3
40:26
to much smaller height and width just 7
40:28
by 7, but it's gotten much deeper, 2048
40:31
layers.
40:32
This is a this is a numerical example of
40:34
what I'm talking about there in terms of
40:36
as you go along, things get smaller but
40:39
deeper.
40:41
All right.
40:43
Uh
40:44
Yes?
40:45
Is the reason that it gets deeper
40:47
because each
40:49
Like it it gets deeper because each
40:50
layer has a single feature that is
40:52
picked up and then it gets stacked on
40:54
top
40:55
It's not so much that each layer has
40:57
picking up a single feature, it's more
40:58
that
40:59
uh
41:00
basically
41:01
the way I think about it is that
41:04
the the the the number of atomic
41:06
features that you may want to detect are
41:07
probably not that many, right? Lines,
41:10
curves, gradations in color and things
41:11
like that. But the way in which you can
41:13
combine these atomic features
41:16
to depict real world things
41:18
is combinatorial.
41:20
It's sort of like I have 10 kinds of
41:22
atoms, how many molecules can I make
41:23
from it?
41:25
You can make a lot of molecules from
41:26
those 10 atoms, which means that you
41:28
better give the network more the ability
41:30
to capture more and more of these
41:32
possible things that the real world can
41:33
come up with.
41:35
And so every as the depth increases, you
41:38
have more filters and every filter has
41:40
now has the ability to pick up some
41:42
combinatorial combination of what's
41:43
coming in.
41:49
Uh sorry, quick question related to
41:51
this. So, right now like our model is
41:53
being trained to detect certain specific
41:55
features like a line, a color, or
41:56
something of this sort. But still it
41:58
doesn't have meaning to this, right?
42:00
Like still they don't know if that
42:02
arc is a sun or is an eye, right?
42:06
So, yeah. So, we we don't tell it what
42:08
to learn, it just learns.
42:10
All we tell it is make sure that you
42:12
minimize the loss function. Now, once it
42:14
is finished learning, if it's a good
42:16
network, it has good accuracy, then we
42:18
can introspect. We can peek into the
42:21
internals and try to understand what is
42:23
it learning,
42:24
right? And sometimes you like you saw in
42:26
the face detection example, it's
42:27
actually learning interesting things
42:28
like basic lines and edges and then
42:30
slowly, you know, more complicated
42:32
shapes and then finally like entire
42:34
human faces. Sometimes it may not be
42:36
understandable.
42:37
And the way it's doing this is by
42:39
constructing features of my brain.
42:42
Like how do you figure out what it's
42:44
learning?
42:44
>> Yeah. Oh, oh, I see. So, I'm going to
42:46
give a reference in just a few minutes.
42:49
Read the paper. That was one of the
42:50
first ones to actually visualize what it
42:52
what these things are learning and
42:53
that'll give you an idea of how it
42:54
actually works. And I'm also happy to
42:56
talk about it offline. It's a bit of a a
42:58
tangent, but it's a really rich tangent,
43:00
so if if I keep talking about it, I'll
43:02
end up spending 10 minutes on it, so I'm
43:03
going to back off.
43:06
Okay.
43:08
Um all right.
43:09
So, now once we do that,
43:12
okay? Now we are back in familiar
43:13
territory where we take whatever tensor
43:16
is coming out from these convolutional
43:18
operations and pooling operations and
43:20
then we just flatten them only now into
43:22
a long vector. And once we flatten them,
43:25
we can connect them to some good old
43:27
dense layers
43:29
like we know how to do and then we
43:30
finally connect them with whatever, you
43:32
know, output layer you want, right? In
43:34
this case, this example is using some
43:36
multi-class classification of
43:39
classifying images to what kind of
43:41
automobile or whatever it is. So, it's
43:42
like a softmax. So, this is a general
43:44
framework.
43:48
Okay?
43:50
Any questions?
43:54
Yeah.
43:55
Can you explain again how the depth
43:57
increases exactly like Oh, the depth
44:00
increases because you decide what the
44:01
depth is.
44:03
So, when you add a convolutional layer,
44:05
you decide how many filters it has. So,
44:07
you just keep adding more and more
44:09
filters the later on you go in the
44:11
network.
44:13
So, it's in your control. So, remember
44:14
the number of neurons in a hidden layer
44:16
is in your control, right? Similarly,
44:18
the number of filters is in your
44:19
control. It's a design choice.
44:22
And we design it so that the later we
44:24
go, the more depth we have. So, you have
44:26
you stack
44:28
um layers with each of those layers has
44:31
a different filter applied to the end
44:35
Yeah, a layer is made up of filters and
44:37
so the depth just comes from having lots
44:39
and lots and lots of filters. And you
44:40
get to choose what they are.
44:44
All right. So, now let's go to the
44:46
fashion MNIST collab um that I did the
44:49
video walk-through on and then actually
44:51
solve it using a convolutional network.
44:56
All right, cool. So, uh at this point
44:58
I'm going to zip through some of the
44:59
stuff because you know the preliminaries
45:00
have to be done. Import all these
45:02
packages, set the random seed here.
45:05
Great. And then the we will load the
45:07
MNIST data set just like I did in the
45:09
collab yesterday. Uh we create these
45:11
little labels.
45:13
Uh and then we just have these standard
45:14
functions to plot accuracy and loss that
45:17
we've been using so far. All right. Now
45:19
we come to the convolutional thing and
45:21
so as before, we're going to um
45:24
we're going to divide it by 255 to
45:25
normalize everything to a zero to one
45:27
range. Uh let's confirm to make sure
45:29
that the data nothing has gotten
45:31
tampered with. Yep, we have 60,000
45:33
images, each one is 28 by 28 in the
45:35
training set. Now,
45:37
convolutional networks um they expect
45:40
the input to have
45:42
three channels or it expects to have
45:44
like a an additional thing which is like
45:46
a channel,
45:47
right? Uh the color images have three
45:49
channels,
45:50
but black and white images have only one
45:52
channel, right? One table of numbers.
45:54
So, instead of saying 28 by 28, we tell
45:56
this the convolutional layer expect 28
45:59
by 28 by one.
46:01
It's the same thing conceptually, but
46:03
that's the sort of the format that it
46:04
expects.
46:05
And so,
46:06
uh we go here and then we say, all
46:09
right, there's a thing called expand
46:11
dimension. I'm just telling it to expand
46:12
its dimension and once I do that, you
46:14
can see here it's still 60,000, but
46:17
instead of 28 by 28, it has become 28 by
46:19
28 by one. Same thing.
46:21
Okay? Now, let's define our very first
46:24
CNN.
46:25
So, all right.
46:27
As as before, the the input is just
46:30
Keras.input as before, no difference
46:32
here and we tell it the shape and the
46:34
shape is of course just 28 by 28 by one.
46:37
Okay? That's what I have here.
46:39
And then we come to the first
46:40
convolutional block.
46:43
So, and this is the key thing.
46:45
If you want to tell Keras to use a
46:47
convolutional a layer,
46:49
you use this keyword layers.Conv2D.
46:53
And from this you can probably also
46:54
figure out that there's a Conv1D and
46:56
there's a Conv3D and so on and so forth,
46:58
which, you know, uh explore. It's really
47:00
good stuff.
47:01
But for image processing, Conv2D is all
47:04
you need. And now we tell it how many
47:06
filters you want. Okay. So, uh we decide
47:09
on the number of filters. So, I've
47:10
decided to have 32 filters. Okay? And
47:13
then I I we also have to decide the size
47:15
of the filter, right? The simplest size
47:18
is 2 by 2. So, I'm just going to go with
47:19
that.
47:20
Right? Kernel size is 2 by 2.
47:22
And then the activation is of course
47:23
ReLU. I give it a name, convolution one,
47:26
and then I feed it the input. And then
47:27
once I do that, I follow it up with a
47:29
little pooling layer which I where I use
47:31
MaxPooling2D.
47:33
And MaxPooling2D, you just literally
47:35
pass the input, you get the output back.
47:36
It just
47:37
shrinks everything using pooling.
47:39
So, that is the first convolutional
47:40
block.
47:41
And you know what?
47:43
I know how to cut and paste. Boom, cut
47:45
and paste, I get the second
47:46
convolutional block.
47:48
Okay? Here is the second convolutional
47:49
block. And I know in in I just lecture I
47:52
mentioned that as you go deeper, you get
47:54
more depth to it, but this is this is
47:56
just a starting point. I'm just going to
47:58
use the same depth. Not a big deal. It's
47:59
a simple problem. So, which is why in
48:01
the second convolutional block I'm still
48:03
using only 32.
48:04
But you can totally go to 64 for
48:06
instance to make it much deeper.
48:07
Okay?
48:08
Uh and once I do that,
48:10
I finally come to the point where I
48:12
flatten everything to a long vector,
48:14
then I connect it to one dense layer of
48:17
256 neurons.
48:19
And then finally, I come to the softmax
48:22
where I have 10 outputs, right? 10
48:23
categories of clothing, softmax, and
48:26
then I tell Keras, okay, take this input
48:27
and the output, string them up together,
48:30
define a model for me.
48:32
So, that's it. That's a convolutional
48:33
network. The new concepts we are seeing
48:35
here are Conv2D for the convolutional
48:38
layer and then MaxPooling2D for the max
48:40
pooling layer.
48:42
Okay? That's it.
48:43
Uh
48:44
coming. So, let me just run this thing.
48:46
It runs. Okay, good. Yeah.
48:49
Uh how do you decide when to flatten and
48:52
would there ever be a situation in which
48:54
we just kind of use the method that we
48:56
used before and not use a CNN?
48:59
Well, we already tried it with MNIST,
49:00
right? We didn't use a CNN. We just
49:02
flattened right away.
49:03
>> work. It it was it's not bad, but we are
49:05
like, you know, can we do better than 85
49:06
or 88 or whatever the percent was,
49:08
right? So, but we are working with
49:09
images, it's typically a good idea to
49:11
just start with a CNN straight out the
49:13
back because you're not losing anything.
49:14
You're not giving up anything.
49:16
So, uh in terms of how many uh layers
49:19
you should have, my philosophy is start
49:20
simple and if it works, stop working on
49:23
it. If it doesn't, add more layers.
49:27
Uh yeah.
49:28
Yeah, just to uh is it the architecture
49:30
design, the number of filters, kernel
49:32
size, number of layers, convolution
49:34
pooling, is that just all based on trial
49:36
and error or what's sometimes? Yeah, so
49:37
typically it's based on trial and error,
49:39
Um to answer your question. But as you
49:41
will see in the transfer learning
49:42
discussion we're going to have soon,
49:44
you can actually, instead of doing
49:46
anything from scratch, it's much better
49:48
to just download a pre-trained model and
49:50
just adapt it for your particular
49:51
problem. That is actually the norm by
49:54
which people do these things. The reason
49:55
I'm doing it from scratch is because you
49:57
should know how it was done.
50:00
Like you it should not be a black box to
50:01
you. That's my goal.
50:03
Yeah.
50:05
Just for what notation perspective, I
50:07
noticed you named all of these layers X.
50:09
Is that a habit we should get into
50:11
naming them all the same or is that just
50:12
a
50:12
>> Actually, I'm not naming the layers as
50:15
X. What what's going on here is I'm
50:17
feeding it X.
50:19
And whatever is coming out of it, I'm
50:21
just calling it X.
50:22
That's all. It's just a notational
50:23
convenience for me to I'm I'm just
50:25
calling the input and the output and
50:27
Keras under the hood will track
50:28
everything and make sure the right thing
50:29
happens. Otherwise, I'd have to be like
50:31
X1, X2, X3, X4 and then if I want to add
50:33
a new layer somewhere in the middle
50:35
between X3 and X4, I have to call that
50:37
X4 and then I'll change everything to 5,
50:39
6, 7. Complete pain in the neck. That's
50:41
why I do this.
50:42
All right. So, model.summary
50:46
It has got 302 thousand parameters. I'll
50:51
just plot it.
50:53
Great. And I encourage you to hand
50:56
calculate it later on and make sure the
50:58
numbers tally, okay?
51:00
For now, let's just go. So, as before,
51:03
we'll just use the same compilation.
51:06
We'll use Adam and then we'll train it
51:08
for, you know, just 10 epochs. We'll use
51:11
a validation split again, as usual, of
51:13
20%. So, let's just run it.
51:15
So, it's actually going to run. And as
51:17
you will see,
51:18
convolutional networks there's a lot
51:19
more going on, so it's going to be a bit
51:20
slower to run. Hopefully not too much
51:23
slower.
51:25
While it's doing, other questions?
51:31
So, if we have a task other than image
51:32
classification, do we still flat the
51:34
model like first and then it's
51:35
segmentation?
51:37
Yeah, so this is for image
51:39
classification. For other kinds of
51:41
applications,
51:42
typically you run it through a bunch of
51:44
convolutional layers and so on and so
51:45
forth.
51:46
But the output side of the equation gets
51:48
much more complicated because if instead
51:51
of classifying just
51:53
the whole picture into, you know, dog or
51:56
cat, if you have to take every pixel and
51:58
classify it, right? Then, well, you
52:01
better have an output shape that is the
52:03
same dimensions as the input shape.
52:06
So, for that we use a different
52:07
architecture. It's called U-Net
52:09
and so on, which unfortunately I won't
52:11
be able to get into. But I know I am
52:13
planning to post another video
52:14
walk-through where I show you how to use
52:17
the Hugging Face Hub
52:19
to very quickly build models for the
52:22
other applications like segmentation and
52:23
so on. I'm hoping to post that tomorrow.
52:26
It's an optional viewing thing that
52:27
might help with that.
52:29
Okay. So, is it done? Okay, good. It's
52:32
done. All right, let's plot the
52:35
thing here.
52:36
All right, so it seems like training is
52:38
going down nice and nicely. Validation
52:40
is sort of flattening out somewhere here
52:42
around the eighth epoch. Let's look at
52:45
the accuracy.
52:47
Same situation here. The accuracy is in
52:48
the 90s. Of course, the final question,
52:51
of course, is how it will will it does
52:52
on the thing.
52:55
Whoa, 90.5%.
52:58
Pretty good.
52:59
By the way, if you're not impressed that
53:00
we went from 88 to 90,
53:04
this is the These applications are the
53:05
proverbial sort of diminishing returns
53:07
problems, okay? So, what you should
53:09
always think of is look at the amount of
53:11
error that's left and ask yourself how
53:13
much of that error am I able to reduce?
53:16
So, you we had 12% roughly of error left
53:20
when we did the simple collab yesterday.
53:22
From that 12% we have knocked off two of
53:24
the 12% to get to over 90, which is
53:26
amazing.
53:27
Okay?
53:28
And in fact, I think the state of the
53:29
art on this
53:31
um
53:32
is 97%.
53:34
So, I invite you
53:36
to take this thing and try different
53:39
filters and so on and so forth to see if
53:40
you can get to the the mid-90s.
53:42
It's not easy, but try it. Yeah.
53:45
Does the number of epochs have to be
53:48
related to the number of batches?
53:50
Because you did 64 batches and 10 No,
53:52
the epochs is an independent
53:55
the epochs is just the number of passes
53:56
through the whole data.
53:58
But within each pass, within each epoch,
54:01
the num the batch size tells you how
54:03
many batches you're going to process.
54:05
So, it is basically the number of
54:06
examples you have in your training data
54:08
divided by the batch size that you have
54:10
chosen,
54:11
right? That number rounded up is the
54:13
number of batches within each epoch.
54:16
And here I'm just choosing 10 because,
54:18
you know,
54:20
Siri found something on the web. Okay.
54:23
I chose 10 because it's going to be fast
54:24
to do for me to do it in class. And 10
54:26
is actually more than enough because you
54:27
can see it's already beginning to
54:28
overfit.
54:31
Yeah.
54:33
This is more of a conceptual question,
54:35
but is it always the case that a neural
54:37
network will have better accuracy than
54:39
this like machine learning algorithm?
54:42
And I'm asking more on the case of like
54:44
the heart disease problem. Oh, yeah,
54:45
yeah.
54:46
Great question. So, neural networks are
54:49
really good for unstructured data like
54:50
the ones we're having here. But if you
54:52
have structured data like the heart
54:53
disease problem, sometimes it actually
54:55
works really well. Sometimes
54:57
things like gradient boosting, XGBoost,
54:59
work really well. So, if I am actually
55:01
working on a structured data problem,
55:03
I'll try both.
55:04
I'm not going to axiomatically assume
55:06
that the DNN is going to be the best
55:07
thing. But if you have structured data,
55:09
it's the best game in town.
55:11
All right. Um
55:13
I'm just going to
55:14
By the way, I have a whole section here
55:15
on once you build a model, how do you
55:16
actually improve it?
55:17
Right? Check it out. It's an optional
55:19
thing.
55:20
All right, I'm going to stop this here.
55:22
All right. So, the next thing I want to
55:23
do is
55:25
So, we went from 88 to 90 plus percent,
55:27
right? Using convolutional networks.
55:29
Now, let's work with color images. Let's
55:31
kick it up a notch.
55:33
So, um
55:34
I actually
55:36
web scraped
55:38
all these pictures for you folks, for
55:40
your enjoyment. I web scraped about 100
55:42
color images of handbags and shoes.
55:44
Each 100 roughly 100 handbags, 100
55:46
shoes. So, the question is with these
55:48
essentially 200 images,
55:51
can we build a really good neural
55:52
network to classify handbags and shoes?
55:54
Right? It seems kind of absurd, right?
55:56
Because 200 examples, I mean, it's not
55:58
that much, right? It doesn't feel like a
55:59
lot. The MNIST data fashion has 60,000
56:02
images.
56:04
Right? So, there's no, you know, even
56:06
with that we are overfitting in like 5,
56:07
6, 7, 8 epochs.
56:09
With 200 images, maybe, you know, is
56:10
there any hope? Obviously, there is
56:11
hope, otherwise it won't be in the
56:13
lecture. So, yeah. So, we're going to
56:15
take this data set and let's see what we
56:16
can do with it. So, we'll first actually
56:18
build a convolutional network from
56:19
scratch to solve this problem. Okay?
56:22
All right.
56:24
I'm actually going to run through the
56:25
code because at the end of it we'll have
56:27
a live demo. So, I would like one
56:29
volunteer to give me a handbag and one
56:31
volunteer to give me their footwear.
56:34
Boy, in class.
56:37
Okay. So, all right. Unlike the previous
56:40
data set, this one actually I just web
56:42
scraped it. So, I just, you know, it's
56:44
it's it's I've stuck it in this Dropbox
56:46
folder.
56:47
Let's just download it and unzip it. And
56:49
once we do that, we have to now organize
56:51
it with these 200 images. So,
56:54
I have to do some sort of
56:57
sort of boring-ish Python stuff here.
57:00
So, here what we're doing is that we
57:02
have 100 handbags, roughly 100 shoes.
57:04
And what this code is doing is it's
57:06
actually creating a directory of saying
57:08
it's it's splitting stuff into train and
57:10
validation and test. And then for each
57:12
of the splits it's doing the handbags
57:13
and the shoes folder. Okay? So, once we
57:16
do that, basically this directory
57:18
structure is created.
57:20
Okay? Training, validation folder, test
57:23
folder, handbags and shoes. In fact,
57:25
actually you can I think you can see it
57:26
here.
57:27
See here, handbags and shoes. And within
57:29
that, there is, you know, train, test,
57:31
validation. And within each of these,
57:33
there's handbags and shoes. So, the idea
57:34
is that when you're working with images,
57:36
right? What you can do is you can just
57:37
create folders for each kind of image,
57:40
right? Let's say dogs, cats,
57:42
two folders with cat images and dog
57:43
images and then just point Keras at it.
57:46
It'll automatically figure out those are
57:47
the labels.
57:49
It makes it easy for you. So, it's very
57:50
convenient when you're working with
57:51
images.
57:52
And the book explains this thing in
57:53
great detail.
57:55
All right. So, when working with these
57:56
images, color images, we'll follow this
57:58
process. We'll read in the JPEGs. We'll
58:00
convert them to tensors. And then since
58:02
I'm web scraping it, they all come in
58:03
different shapes and sizes. So, I need
58:05
to like bring it all to the same size.
58:06
Okay? I resize it and then I'm going to
58:08
batch it into whatever. I'm going to
58:10
batch it using a batch size of 32 here.
58:13
So, and this utility from Keras will do
58:16
all that for you, right? Very quickly.
58:19
So, basically what it says is that I
58:20
found 98 files in the 98 images in the
58:23
training data belonging to two classes,
58:25
49 in the validation and 38 in the test.
58:28
So, less than 100 examples in the
58:29
training set. That's what we have here.
58:31
All right. What's the time? 9:30. Okay.
58:33
So, all right. Now, let us check the
58:35
dimensions to make sure Good. So, 224
58:38
224 by 3. And the reason why did I pick
58:40
224 224? As you will see later, we're
58:43
going to use something called ResNet
58:45
and the ResNet expects it to be 224 by
58:47
224 by 3. That's why I resized it to 224
58:49
224. Let's look at a few examples of my
58:52
wonderful web scraping in action.
59:01
It's pretty wild, right?
59:02
Okay. So, we have a Now, let's do a
59:04
simple convolutional network. Um
59:07
And before we would take all the X
59:09
values in Fashion MNIST and divide them
59:10
manually by 255 to normalize it to 0 1.
59:13
Well, you know what? We are actually
59:14
graduating to the higher levels of Keras
59:16
now. So, let's not do that, right?
59:17
Manual stuff is bad. So, we'll do it
59:19
within Keras by using something called
59:21
the rescaling layer where we just tell
59:22
it how much to rescale and boom, it'll
59:24
do it for you. The first convolution
59:26
block, just like the Fashion MNIST 32,
59:28
second block, again 32, max pool,
59:31
flatten. And then here we only have
59:33
handbags which are shoes, just a sigmoid
59:35
is enough, right? It's just a binary
59:36
classification problem. So, I'm just
59:38
using one output layer with a sigmoid,
59:40
and that's our model. So, let's do the
59:42
model.
59:43
All right, model summary.
59:48
103 101,000 parameters in this little
59:52
model. Okay, let's compile it and run
59:54
it. Uh, and note here because it's a
59:56
binary
59:57
classification problem, I'm using binary
59:59
cross entropy.
1:00:02
Same Adam.
1:00:03
And accuracy, compile, and then boom,
1:00:05
let's run it. We'll run it for 20
1:00:07
epochs.
1:00:08
Hopefully.
1:00:12
Okay, while it's doing this business,
1:00:13
I'm going to shift to the PowerPoint.
1:00:17
So, we'll go back to see how well it
1:00:19
did, but the question is, uh, whatever
1:00:21
it did, we built it from scratch. So,
1:00:23
the question is, can we do better than
1:00:23
that? Okay? Because we only have 100
1:00:26
examples of each class, and which brings
1:00:28
us to something very cool and very
1:00:29
powerful called transfer learning. And
1:00:31
the idea, so the key thing is there are
1:00:33
two research trends that are going on
1:00:34
that we take advantage of. The first one
1:00:36
is that researchers have defined, you
1:00:38
know, designed architectures which
1:00:40
exploit the kind of input you have. So,
1:00:42
Olivia asked the question, if you have a
1:00:43
particular kind of input images, do you
1:00:45
actually change the input, or do you
1:00:47
actually change the network? As it turns
1:00:49
out, here, for example, if it's images,
1:00:50
we know that we should use convolutional
1:00:52
layers because convolutional layers were
1:00:53
designed to exploit the image-ness of
1:00:55
the input.
1:00:57
Okay? Similarly, if you have sequences
1:00:59
of information, like obviously natural
1:01:01
language, audio, video, gene sequences,
1:01:03
and so on, so forth, these things called
1:01:05
transformers were invented
1:01:07
to exploit them, and we're going to
1:01:08
spend a lot of time on transformers
1:01:09
starting next week. So, that's the first
1:01:11
trend. The second trend is that
1:01:13
researchers have used these innovations
1:01:15
to actually create and train models on
1:01:19
vast data sets, and thankfully, they've
1:01:21
made them publicly available for us to
1:01:23
use. So, transfer learning is the idea
1:01:26
that if you have a particular problem,
1:01:28
let's just take a pre-trained network
1:01:30
work somebody may have already created,
1:01:32
and then let's just customize it to our
1:01:33
problem, rather than actually build
1:01:35
anything from scratch.
1:01:37
Okay, that's the basic idea. So,
1:01:39
so here we have this basically we have
1:01:41
to build a classifier which takes in an
1:01:43
arbitrary image and figures out if it's
1:01:45
a handbag or a shoe, right? That's our
1:01:46
goal.
1:01:47
And so, now handbags and shoes are
1:01:49
everyday objects, and so what you can do
1:01:51
is, hmm, you you can look around and see
1:01:53
if there are any networks that have been
1:01:55
trained by other people which actually
1:01:57
have been trained on everyday images.
1:02:00
Right? As opposed to like MRI or X-rays,
1:02:02
right? Specialized images, everyday
1:02:04
images. Of course, the first thing you
1:02:05
should probably do is to see if anybody
1:02:07
has built the specific thing you want,
1:02:08
handbag shoes classifier on GitHub.
1:02:10
Assuming it's not, then you do transfer
1:02:12
learning. Okay? So, now it turns out
1:02:15
that there's this thing called ImageNet,
1:02:17
which is a database of millions of
1:02:19
images of everyday objects in a thousand
1:02:22
different categories, furniture,
1:02:24
animals, automobiles, you get the idea.
1:02:26
Okay? And so, we can look for the
1:02:28
networks that have been trained on
1:02:29
ImageNet.
1:02:31
Okay, let me just go back to the collab
1:02:33
just to make sure it doesn't time out.
1:02:37
All right, so it has finished doing it.
1:02:40
Um, let's just plot these things.
1:02:48
Okay, so
1:02:49
uh, there is some overfitting that
1:02:51
happens around here
1:02:52
on the training on the 10th epoch. Let's
1:02:55
look at the
1:02:59
So, the the training accuracy is
1:03:01
actually getting to almost to 100%. But
1:03:03
we're not interested in training
1:03:04
accuracy, right? We care about
1:03:06
validation and test accuracy, and that
1:03:08
seems to be kind of hovering around in
1:03:10
the 80s. Um, so let's just evaluate it
1:03:13
anyway to see what happens.
1:03:15
Okay, so it gets to 80 87% accuracy
1:03:19
on this data set.
1:03:20
It's actually pretty good given that we
1:03:22
only have 100 examples. So, 87%
1:03:24
accuracy, but we pre-trained the whole
1:03:26
thing. I'm sorry, we did everything from
1:03:28
scratch. Okay? Now, then
1:03:31
I'm going to there's this whole section
1:03:32
about data augmentation, which, um, you
1:03:35
know what? Do we have time?
1:03:38
So,
1:03:40
so the idea of augmentation is that when
1:03:42
you have an image,
1:03:44
let's say you take this image, and you
1:03:45
just rotate it slightly by 10°.
1:03:49
If it's a handbag before you rotated it,
1:03:51
it sure as hell is a handbag after you
1:03:52
rotated it.
1:03:54
Right?
1:03:55
It doesn't change The meaning of the
1:03:56
image doesn't change just because you
1:03:57
rotated it slightly. Or maybe you zoom
1:04:00
in slightly, you zoom out slightly, you
1:04:01
crop it slightly, nothing happens.
1:04:03
So, what you can do is you can take any
1:04:05
image you have, and you just perturb it
1:04:07
slightly,
1:04:08
like right there, and then add it as a
1:04:10
new example to your training data.
1:04:14
This is an unbelievable free lunch,
1:04:15
frankly.
1:04:16
And the same thing actually, same kinds
1:04:19
of techniques actually work for text
1:04:20
also, which we'll cover later on.
1:04:22
Right? This broad area is called data
1:04:24
augmentation.
1:04:26
It's a great way when you don't have a
1:04:27
lot of data to artificially bolster the
1:04:30
amount of data you have.
1:04:31
Okay?
1:04:32
Um, and so, and of course, Keras makes
1:04:34
it very easy for you to do all these
1:04:36
things. It has already predefined a
1:04:38
whole bunch of data augmentation layers
1:04:40
for you. So, here's a little example
1:04:43
where I basically take a picture and
1:04:45
then I randomly flip it. So, if it looks
1:04:47
like this, I flip it this way,
1:04:48
horizontal. Okay? Uh, and then I
1:04:50
randomly rotate it by 0.1. I forget if
1:04:53
it's 0.1° or radians, you can look up
1:04:55
the documentation. And then random zoom,
1:04:57
right? Zoom in and out a little bit. Uh,
1:05:00
but it won't do this for every picture.
1:05:02
It will only do it randomly. Okay? So,
1:05:04
that only some pictures will get
1:05:06
perturbed in some ways. And that's how
1:05:07
you make sure there's enough diversity
1:05:09
of pictures that you have.
1:05:10
So, once you do that,
1:05:12
you can actually take a picture and see
1:05:13
what it does.
1:05:15
I just randomly grab a picture, so it
1:05:17
keeps changing every time.
1:05:21
Yeah, look at this handbag.
1:05:22
Handbag slightly rotated this way,
1:05:24
rotated that way.
1:05:26
Some more. Maybe a little bit of zooming
1:05:28
going on, and so on. You get the idea,
1:05:30
right? And there's a whole list of these
1:05:31
things you can do. But when you do those
1:05:33
things, make sure
1:05:35
that what you're doing doesn't actually
1:05:37
change the underlying meaning of the
1:05:38
picture.
1:05:39
It's really important.
1:05:41
Okay? So, for example, if you're working
1:05:43
with satellite data,
1:05:45
yes, be very careful not to do flips of
1:05:47
crazy flips.
1:05:49
Right? Or even if you're working with
1:05:50
everyday images, horizontal flips are
1:05:51
okay. Don't do vertical flips.
1:05:54
Right? How many times will you have an
1:05:55
upside-down dog picture that you need to
1:05:57
classify?
1:05:59
Make sure your augmentation doesn't go
1:06:00
nuts.
1:06:02
All right.
1:06:05
Once you do that, you can actually just
1:06:07
insert the data augmentation layers in
1:06:09
your model right there, right after the
1:06:11
input. The rest of it can stay
1:06:12
unchanged.
1:06:14
So, this is a great way to increase the
1:06:15
size of your training data, and here is
1:06:17
a model, and then I invite you to
1:06:19
actually just play with it and uh, and
1:06:21
train it. We won't try In the interest
1:06:23
of time, we won't actually train this
1:06:23
model, but it's in the collab, you can
1:06:24
just try it. It also figures prominently
1:06:27
in homework one, by the way, data
1:06:28
augmentation. So, you'll get more
1:06:30
experience with this. Okay. So, uh, back
1:06:32
to the PPT.
1:06:34
So, this is what we have. Um, and so,
1:06:37
any network that has been trained on
1:06:38
this ImageNet thing, uh, turns out
1:06:41
learns all kinds of interesting features
1:06:42
in every one of its layers. So, here
1:06:44
this is the first layer, and you can see
1:06:46
it's picking up sort of gradations of
1:06:48
color, sort of line-ish kind of
1:06:49
behavior. Layer two, um, it's actually
1:06:52
picking up Hey, look, it's picking up an
1:06:54
edge. Can you see that edge?
1:06:56
Right? Like like that.
1:06:59
And then layer three is picking up these
1:07:01
interesting honeycomb shapes, uh, and so
1:07:04
on. Oh, it's actually this thing is
1:07:05
already already picking up like the
1:07:07
shape of a human torso.
1:07:12
Yeah, this layer is actually picking up
1:07:13
what looks like a Labrador retriever.
1:07:16
Okay.
1:07:17
Isn't that cute?
1:07:19
Come on, even if you're not a dog
1:07:20
person.
1:07:22
All right. So, the the this this is the
1:07:24
visualization I was referring to
1:07:25
earlier,
1:07:26
um, to figure out what are these
1:07:28
networks actually learning.
1:07:30
This paper was one of the first ones to
1:07:31
actually visualize what's going on
1:07:32
inside. So, if you folks are curious how
1:07:34
these pictures are actually produced, I
1:07:36
would encourage you to check this out.
1:07:38
Okay, yep.
1:07:40
So, we spoke about images and you
1:07:42
referred to classes, but sorry, we spoke
1:07:44
about images and you referred to classes
1:07:46
and
1:07:47
text next week on transformers, but
1:07:49
what about say an email which has both
1:07:52
text and image, and that may be white
1:07:54
space depending on who has written it
1:07:56
out. Does that get put in as an input
1:07:58
for an image or
1:08:01
So, we'll revisit this great question a
1:08:03
bit later on in the course.
1:08:04
So, the answer is a bit complicated, so
1:08:06
I don't want to I want to do it justice,
1:08:07
so we'll come back to it.
1:08:09
All right, so
1:08:10
so it turns out this thing called ResNet
1:08:12
is a family of networks that are which
1:08:14
were trained on this ImageNet data set,
1:08:16
and they did really well in this
1:08:18
competition that's associated with the
1:08:19
ImageNet data set called ImageNet. And
1:08:21
so, this is an example of such a
1:08:22
network. So, you we would expect the the
1:08:24
weights and the parameters of ResNet,
1:08:27
given that it's been trained on
1:08:28
ImageNet, to sort of have some knowledge
1:08:30
about lines and shapes and curves and
1:08:32
things like that. So, maybe we can just
1:08:34
use that, right? So, so the idea is we
1:08:37
But the thing is we can't use ResNet as
1:08:39
is because remember, it was trained to
1:08:40
classify an incoming image into a
1:08:42
thousand possibilities.
1:08:44
Here we only have two possibilities,
1:08:45
handbags and shoes. So, what we do is
1:08:47
very simple and elegant. We do just a
1:08:50
little bit of surgery.
1:08:51
We take ResNet and stop just before the
1:08:54
final layer. So, take my word for it,
1:08:57
this thing here, what it says is fully
1:08:59
connected thousand.
1:09:01
Because it's got thousand way, right?
1:09:02
Thousand objects. So, what we do is we
1:09:04
just take everything except and we stop
1:09:06
just before that last layer.
1:09:08
And then what comes out of that layer,
1:09:10
hopefully, will be like a very smart
1:09:11
representation of the images that it has
1:09:13
been trained on.
1:09:14
And so, what we do is we can think of
1:09:16
sort of headless ResNet
1:09:19
as our model.
1:09:21
And we can take that we can take all our
1:09:23
data and run it through ResNet up to but
1:09:26
not including the last layer.
1:09:28
Okay, you get some tensor and that
1:09:30
tensor is probably like a very has a
1:09:31
very rich understanding of what's going
1:09:33
on in that image, all the objects and
1:09:35
features and things like that. And then
1:09:36
we can just simply connect that we can
1:09:39
think of it as like a smart
1:09:40
representation of an input. We can
1:09:42
connect it to just a little hidden layer
1:09:44
and then we have a little sigmoid which
1:09:46
then tells you handbag or shoe. We can
1:09:47
just run this network.
1:09:50
Okay? Um and so since the outputs to the
1:09:53
hidden layer now are not raw images
1:09:54
anymore, but this much higher level of
1:09:57
abstraction that ResNet has learned,
1:09:59
hopefully it can get the job done with
1:10:00
hardly any examples.
1:10:02
Okay? And now you can get fancier.
1:10:04
That's the basic idea, but you can get
1:10:05
much fancier. You can connect up
1:10:07
headless ResNet directly with our little
1:10:09
network with a hidden layer and the
1:10:10
final thing and the whole thing can be
1:10:12
trained.
1:10:14
End to end. Uh but when you do that you
1:10:16
must start the training with the weights
1:10:18
that you downloaded with ResNet because
1:10:20
that is the crown jewel that's been
1:10:21
learned so you want to start from there.
1:10:23
Uh and you will do this in homework one.
1:10:26
Okay? All right. Uh by the way, these
1:10:28
pre-trained models are available all
1:10:29
over the internet. There is the
1:10:30
TensorFlow hub, the PyTorch hub and then
1:10:32
there's the Hugging Face hub. When I
1:10:34
checked it on the 13th yesterday, it had
1:10:36
over half a million models available
1:10:39
for download. Half a million.
1:10:41
I think last year it was like 50,000
1:10:42
when I taught the course. Uh so yes.
1:10:46
I was just wondering, doesn't this make
1:10:49
your neural network susceptible to
1:10:50
adversarial attacks because the weights
1:10:52
have been
1:10:53
pre-trained on a Yes. Uh it there is
1:10:55
some adversarial risk. I'm happy to talk
1:10:57
about it offline.
1:10:59
All right. So that's what we have. So
1:11:01
back to Colab. Okay. So that's what we
1:11:03
have. This is ResNet. So what we do is
1:11:06
and ResNet is all packaged up. It's
1:11:07
available for download. So we download
1:11:09
it here.
1:11:13
And you see here that I'm saying use
1:11:16
include top equals false.
1:11:19
So basically you are telling Keras
1:11:21
uh the top the very final layer of the
1:11:23
thing, don't give it to me. Just give me
1:11:25
everything up to but not including that.
1:11:27
And of course I think of it as left to
1:11:28
right. People think of it as bottom to
1:11:30
top. So they could the very very top
1:11:32
layer, don't give it to me. You're
1:11:34
telling it so that you don't have to
1:11:35
manually go and remove it.
1:11:37
Okay? And then I'm not going to
1:11:39
summarize uh well, I'll just summarize
1:11:40
some of it. Just show you how big it is.
1:11:44
Okay?
1:11:45
23 million parameters.
1:11:48
ResNet. Okay? And I won't plot it
1:11:50
because then I'll be scrolling for 5
1:11:52
minutes. Uh
1:11:53
so let's just do this now. So what we're
1:11:55
now going to do is we're going to run
1:11:56
all the data through this thing and
1:11:58
whatever comes out in that penultimate
1:11:59
thing, I'm going to just grab it and
1:12:00
store it. So that's what this thing
1:12:02
does.
1:12:04
All right. And now we create this a
1:12:07
little handy function to do all these
1:12:08
things.
1:12:09
And once I do that,
1:12:11
uh every image has been sent through
1:12:12
ResNet up to but not the final layer and
1:12:15
then whatever comes into the final
1:12:16
layer, we're storing it. And then we're
1:12:18
going to create a network where we'll
1:12:19
only feed that layer that information to
1:12:21
a simple network.
1:12:23
Okay?
1:12:24
So what is coming out of ResNet, you can
1:12:26
see here 98 examples in the training
1:12:28
data and each example is now a 7 by 7 by
1:12:31
2048 tensor.
1:12:33
That's what came out of ResNet and you
1:12:35
saw that's what I did there.
1:12:37
Okay?
1:12:37
All right. So that's what it looks like.
1:12:39
Now let's just create our actual model
1:12:41
now. Right? We have our input which is
1:12:43
just a 7 by 7 by 2048.
1:12:46
We flatten it immediately.
1:12:48
Then we run it through a dense layer
1:12:50
with 256 ReLU neurons and then we use
1:12:52
dropout which I haven't talked about yet
1:12:54
which I will talk about early next week.
1:12:56
Uh but I will come back to it. Don't
1:12:58
worry about this detail for the moment.
1:13:00
Uh and then we just run through a
1:13:01
sigmoid.
1:13:03
Okay? And that that's our model.
1:13:05
Finished. Plot the model. This is what
1:13:08
we have. Okay? Model summary.
1:13:13
It's one so far. All right, good. Now
1:13:15
let's actually train this thing.
1:13:18
I'm just going to run it for 10 epochs
1:13:20
because I tried running it uh previously
1:13:22
and it seems to do a fine job in just an
1:13:24
epoch. Okay, it's already done. It's so
1:13:26
fast because we ran everything through
1:13:28
this monster ResNet thing and basically
1:13:31
took all the output values and use them
1:13:33
as a starting point. Right? We don't
1:13:34
have to run it every single time. So you
1:13:36
can see here the accuracy is
1:13:40
quite high.
1:13:44
Wow, interesting. So the 10th epoch
1:13:45
something bad happened.
1:13:48
So maybe I should have stopped at the
1:13:49
ninth epoch. I didn't see this yesterday
1:13:51
when I was running. So much for random
1:13:53
reproducibility. Uh
1:13:55
So let's just run this. Oh wow, look. On
1:13:57
the test set it's achieving 100%
1:13:58
accuracy.
1:14:02
It's unbelievable. Okay folks, now for
1:14:04
the moment of truth. Um all right, I
1:14:06
have a little code snippet here to
1:14:08
capture stuff from the webcam.
1:14:10
Because that last epoch it went down,
1:14:12
I'm a little worried that the demo is
1:14:13
going to flunk.
1:14:14
But you know what? We all have to live
1:14:16
dangerously. So
1:14:18
So here's a little function to predict
1:14:20
what's going to happen.
1:14:21
Okay. Now I tried it at home yesterday
1:14:23
by the way.
1:14:24
I act and it's like, "Yay, it's a
1:14:26
handbag."
1:14:27
So okay. Now let's just do something
1:14:29
else.
1:14:30
Okay. Any volunteers?
1:14:32
I want a a piece of footwear
1:14:34
or a handbag.
1:14:37
It's like a backpack, right?
1:14:39
I don't know. It feels like an
1:14:40
adversarial example, but yeah, let's
1:14:42
just try it.
1:14:43
Okay.
1:14:45
No disrespect. I'll let me let me go
1:14:47
with the shoe first. I have a better
1:14:48
chance of it working.
1:14:50
So
1:14:51
it's a pretty big shoe. If it can't get
1:14:53
this shoe, I'm worried about this model.
1:14:55
All right. So
1:15:05
Okay. Hold on. Hold on. Hold on.
1:15:07
All right.
1:15:10
Please don't get distracted by my hand.
1:15:14
Capture.
1:15:16
It's a shoe! LOOK AT THAT.
1:15:21
PHEW. ALL RIGHT. THANKS.
1:15:25
OKAY. Now let's try that. I'm feeling
1:15:26
kind of brave now.
1:15:28
Thank you. All right. Let's do this.
1:15:32
All right.
1:15:34
Camera capture.
1:15:40
Okay.
1:15:44
Put its better side.
1:15:54
It's a handbag! Look at that.
1:15:59
I swear every time I do the demo I age a
1:16:01
few years. So
1:16:03
All right folks, I'm done. Thank you.
— end of transcript —
Advertisement
Ad slot

More from MIT OpenCourseWare

Trending Transcripts

Disclaimer: This site is not affiliated with, endorsed by, or sponsored by YouTube or Google LLC. All trademarks belong to their respective owners. Transcripts are sourced from publicly available captions on YouTube and remain the property of their original creators.