Advertisement
1:16:22
4: Deep Learning for Computer Vision – Transfer Learning and Fine-Tuning; Intro to HuggingFace
MIT OpenCourseWare
·
May 11, 2026
Open on YouTube
Transcript
0:16
Right folks, good morning.
0:19
Welcome back. I hope you all had a nice
0:21
weekend.
0:22
Uh, and I hope you had a chance to watch
0:24
the the video walk-through I posted
0:26
yesterday. Um, it's going to save us
0:28
some time today. So, let's get right in.
0:31
Today is going to be super packed. Um,
Advertisement
0:33
you're going to go from not knowing
0:35
anything about convolutions perhaps for
0:36
some of you to actually knowing how
0:38
convolution networks work and actually
0:39
to build one and demo it in class, okay?
0:42
And uh, this demo has actually worked
0:44
pretty well for the last few years that
0:45
I've taught the class, but you never
0:47
know because it's a live demo, it may
0:48
not work. We'll see.
Advertisement
0:50
Um,
0:51
Valentine's Day gods, maybe they maybe
0:53
be with us.
0:54
Okay, so let's get going. So, Fashion
0:56
MNIST we saw previously, um, i.e. as in,
1:00
you know, in the in the walk-through,
1:01
the video walk-through, that a neural
1:03
network with a single single hidden
1:05
layer can get us to some an accuracy in
1:08
the the high 80s, okay? Uh, and that
1:11
thing that network actually didn't know
1:14
what was coming in was an image, right?
1:16
It literally took this table of numbers
1:18
and just took each row and then
1:19
concatenated all the rows into one giant
1:21
long vector and then sent it in. So, the
1:23
neural network did exploit the fact that
1:25
the input data was sort of known to be
1:27
of a certain type, okay? Which is the
1:30
clue for how can we do better?
1:32
Right? So, let's just spend a few
1:35
minutes on why what is it about images
1:38
that we have to really pay attention to,
1:40
okay? As opposed to any arbitrary vector
1:42
of numbers that's coming in.
1:44
Okay? So, when we flatten the image into
1:47
a long vector and feed it into a dense
1:49
layer,
1:50
several undesirable things can actually
1:52
happen.
1:55
What are some of them? Any any guesses?
2:00
Uh, yeah.
2:02
I think you lose the proximity of one
2:04
pixel to other ones that would be around
2:06
it.
2:07
Right. So, if you take a particular
2:08
pixel, then let's say that the picture
2:11
shows a t-shirt, um, if there's a little
2:13
pixel at in the center of the t-shirt,
2:15
knowing that the surrounding pixels are
2:17
related to the pixel in a way because
2:19
they are all part of this concept called
2:21
a t-shirt, would certainly be helpful,
2:23
right? So, so to put it more
2:25
technically, spatial adjacency
2:28
information is very important. And we
2:30
need to somehow take that into account.
2:32
Okay? Um, all right. What else? What
2:34
else might be going on here?
2:38
Uh,
2:40
Yeah,
2:41
you have some metadata about it like the
2:43
relative match into the resolution
2:46
Oh, I see. So, if you actually had
2:47
structured data about the image such as,
2:50
you know, various characters about that
2:51
might be helpful. True. Now, but let's
2:54
just focus on the case where you only
2:55
have the raw image and nothing else.
2:57
And under that constraint, what else
3:00
might go wrong?
3:02
Or what else might be suboptimal?
3:08
Okay. Well, the first thing that might
3:10
happen is that
3:12
we have we may have too many parameters.
3:15
So, let's take So, this is, you know,
3:17
this these numbers are from my, you
3:18
know, older iPhone. Uh, I noticed that
3:21
when I take a color picture with my
3:22
phone, it's a 3,000 * 3,000 roughly uh,
3:27
grid, right? So, the picture is actually
3:30
3,024 pixels on this axis, 3,024 on that
3:34
axis, okay? So, that gets us to roughly
3:37
9 million pixels, but remember there's a
3:40
color picture, which means there are
3:41
three channels,
3:43
which means there are 27 million
3:45
numbers,
3:46
each of which is between 0 and 255 from
3:49
that little picture, okay? And now let's
3:51
say we connect it to a single
3:54
100 neuron dense layer.
3:57
A single 100 neuron dense layer. How
3:59
many parameters are we going to have?
4:00
Just in that one little part of the
4:01
network.
4:07
Could the mumbling be louder?
4:10
Yes, roughly 2.7 billion because 27
4:13
million parameters times 100,
4:15
right? Roughly, of course. Forget about
4:17
the biases for a moment, right? It's 2.7
4:19
billion.
4:21
2.7 billion parameters,
4:23
right? Do you think we can actually get
4:25
2.7 billion images to train any of these
4:27
things?
4:29
So, then you're going to overfit.
4:32
Right? Too many parameters. We have to
4:33
do We have to be smarter about this.
4:35
It's not going to work.
4:36
Right? That's the first problem.
4:39
The So, this clearly is computationally
4:41
demanding, very data hungry, and
4:43
increase the risk of overfitting.
4:45
Okay?
4:46
Next,
4:49
we lose spatial adjacency.
4:51
Right? We literally are ignoring what's
4:52
nearby.
4:55
So, that's a huge huge factor. There's a
4:57
third factor,
4:58
right? That we have to worry about,
5:01
which is that
5:02
let's say that, you know, the picture
5:04
has a vertical line
5:06
on the on the top left side and it has
5:08
some other vertical line on the bottom
5:09
right side.
5:12
What this sort of dumb approach is going
5:14
to do
5:15
is going to it's going to learn to
5:16
detect that vertical line on the top
5:18
left and it's going to independent of
5:20
that, it's going to learn to detect the
5:21
vertical line on the bottom right.
5:24
Okay? Which doesn't make any sense. What
5:26
do you A vertical line is a vertical
5:27
line. So, you want to be able to detect
5:29
it wherever it happens.
5:31
Detect once, reuse everywhere.
5:33
That's what you need to do.
5:35
So, this, by the way, is called
5:36
translation invariance.
5:38
Translation is math speak for move stuff
5:40
around.
5:41
Right? You take a line and it moves
5:42
around,
5:43
it doesn't matter, it's still a line.
5:45
Let's Let's Let's figure it out.
5:47
So, these are the the three things we
5:48
need to worry about. So, we want to
5:50
learn once and use all over the place.
5:53
We want to take spatial adjacency into
5:55
account, number two. And number three,
5:56
let's just find a way to make sure that
5:58
we don't have billions of parameters for
5:59
simple toy problems.
6:02
Any questions?
6:05
Yep.
6:07
Um, is this a problem
6:09
just because we are compressing the
6:11
image or would it have happened anyway?
6:14
It would have happened So, the question
6:15
was is it a problem because we are
6:16
compressing the image uh, or would it
6:18
would it have happened anyway? The
6:19
answer is it would have happened anyway.
6:20
You can take any picture, this is going
6:22
to happen, right? Because I'm not making
6:24
any assumptions about how the image is
6:26
coming in to me,
6:27
whether it's compressed or not and so on
6:28
and so forth.
6:31
Okay. All right.
6:33
So, convolutional layers
6:36
were developed to precisely address
6:38
these shortcomings and they're amazing
6:40
solution, as you will see. Very elegant.
6:45
All right.
6:45
So, the next, I don't know, half an hour
6:49
is going to be me defining a whole bunch
6:51
of stuff
6:52
before we actually get to the fun
6:53
collabs and so on and so forth.
6:55
Um, so just to put in perspective, I I
6:57
have a PowerPoint,
6:59
two collabs,
7:01
and an Excel spreadsheet, and maybe even
7:03
a notability file to cover today.
7:06
Okay? So, but hang on for the next 30
7:08
minutes because it's going to be a
7:09
little concept heavy
7:10
before we get to the fun stuff. So, stop
7:12
me, ask me questions because we do have
7:14
time.
7:15
All right. A convolutional layer is made
7:17
up of something called a convolutional
7:18
filter.
7:20
Okay? That's the atomic building block.
7:22
A convolutional filter is a nothing but
7:24
a small matrix of numbers like this.
7:28
It's just a small square matrix of
7:29
numbers. That's a convolutional filter,
7:31
okay? Now,
7:33
a layer is just composed of one or more
7:35
of these filters.
7:38
All right?
7:39
Filters and layers.
7:41
Now,
7:42
the thing about the convolutional filter
7:44
that makes it really magical
7:46
is that if you choose the numbers in a
7:48
filter carefully
7:50
and then you apply the filter to an
7:52
image, and I'll get to what I mean by
7:53
applying the filter,
7:56
if you choose the numbers carefully and
7:57
you apply to that image,
7:59
this little humble thing has the ability
8:02
to detect features in your image.
8:04
It can detect lines, curves, gradations
8:07
in color, circles, things like that,
8:09
okay? It's pretty cool.
8:11
And so,
8:12
I'm going to claim and I'm going to
8:14
prove shortly that this little humble
8:15
filter with the ones and zeros, it can
8:17
detect horizontal lines in any picture
8:19
you give it.
8:21
Okay?
8:22
This thing here is going to has the
8:23
ability to detect vertical lines.
8:27
All right? So, I will demonstrate how
8:28
this thing actually detects all these
8:30
things and then we will ask the big
8:33
question that's probably in your minds
8:34
already, where are we going to get these
8:35
numbers from?
8:37
That all sounds great, Rama. Where are
8:39
we going to get the numbers from? Okay?
8:41
And we have a beautiful answer to that
8:42
question.
8:43
All right. So, let's go. Um, now I'm
8:46
going to first explain to you what I
8:47
mean by applying a filter to an image
8:50
and then I'm going to give you examples
8:52
of how the filter works for detecting
8:54
vertical and horizontal lines. So, all
8:56
right. So, let's say that this is the
8:58
image we have.
9:00
Okay? Again, an image. Assume it's a
9:02
grayscale image. So, you just have a
9:04
bunch of numbers between 0 and 255,
9:06
okay? So, that's that This is the image
9:07
we have. It's a little tiny image.
9:09
And this is the filter that's been
9:10
magically given to us by somebody.
9:13
And what we are trying to do now is to
9:14
apply it, okay? So, what we do is that
9:17
we literally take this filter,
9:19
the little one, and then we superimpose
9:22
it on the top left part of the image.
9:24
So, you have the image here, you take
9:26
this little filter, and then you move it
9:28
to the top left so that they are sort of
9:30
right on top of each other.
9:32
Okay?
9:33
Once you have it right on top of each
9:34
other,
9:35
you have these matching numbers. You
9:37
have three numbers in the image, there
9:39
are three numbers in the filter, and
9:41
they're all matching each other right on
9:42
top of each other, right? So, you have
9:44
nine pairs of numbers.
9:46
And then what we do, once we overlay it,
9:48
we literally just multiply all the
9:50
matching numbers and add them up.
9:53
Okay? You just multiply all the numbers
9:55
and match them up, and you can confirm
9:57
later on that you know the the
9:58
arithmetic I'm doing is actually
9:59
accurate. Okay?
10:01
And once you do that you'll go get some
10:03
number.
10:04
Right?
10:05
Um
10:06
once you get that number
10:09
what we do is we go to our good old
10:11
friend the relu
10:12
and then we just run it through a relu.
10:15
Now, in this case all that effort comes
10:16
to nothing because it's zero. It's okay.
10:19
Okay? So, zero and this number becomes
10:22
the top left cell of your output.
10:26
So, this is called the convolution
10:28
operation.
10:29
Okay?
10:30
And we won't get into why it's called
10:31
that and so on and so forth. There's a
10:32
long and rich and storied history of
10:34
these things.
10:35
But this is the convolution operation.
10:38
And once we do that you sort of can now
10:40
predict what's going to happen, right?
10:42
We take the same exact operation and we
10:44
just move it to the right.
10:46
We move this little 3 by 3 thing to the
10:48
right and repeat the exact same process.
10:51
Matching numbers
10:53
uh to you know multiply all of the all
10:54
the matching numbers together, add them
10:55
up, run them through a relu.
10:58
Okay?
10:59
And then boom, you get the you get the
11:01
second number here.
11:03
And you keep doing that till you reach
11:05
the very end. You fill up all these
11:07
numbers then when you then you come to
11:08
the top of the second row.
11:11
Okay?
11:12
And you keep on doing that till you
11:14
reach the very bottom.
11:16
So, this is what I mean when I say apply
11:18
a filter to an image.
11:21
Okay?
11:22
Any questions?
11:25
Okay.
11:29
Microphone, please.
11:31
Microphone.
11:35
What happens when
11:36
the heart of the
11:38
and you stop
11:39
the remaining
11:42
but the filter doesn't perfectly match
11:44
Yeah, so you start from the left and
11:46
then you keep on going. At some point
11:47
the right edge of the filter is going to
11:49
match the right edge of the image and
11:51
then you stop.
11:52
Yeah. Now, there are some nuances here.
11:55
So, for example, you can actually pad
11:58
the whole image
11:59
on its borders so that you can actually
12:01
go outside the image and it'll still
12:03
work.
12:04
Okay? Number one. Number two, nuance.
12:08
Instead of just moving one step to the
12:10
right every time you finish, you can
12:11
move two steps to the right.
12:13
Right? And that's something called a
12:15
stride. Okay? So, there are a bunch of
12:17
pesky details here. But I'm just
12:20
ignoring them because this basic default
12:22
approach works well amazingly well
12:24
almost all the time.
12:27
Okay? All right. So, that's that's
12:29
that's the mechanics of how this
12:31
operation works. Um all right. Now, I'm
12:33
going to switch to a spreadsheet which
12:35
shows this really beautifully
12:37
courtesy of the fast.ai people.
12:41
All right. So, what I'm going to do here
12:43
because the big spreadsheet I'll upload
12:44
the spreadsheet after class so you can
12:45
see it. So, all I have done here, rather
12:48
all they have done here
12:50
thanks to them, is that they have
12:51
essentially created a table of numbers
12:53
in Excel as you can tell.
12:55
And they have just put some numbers.
12:57
Most of the numbers are zero. But these
12:59
some of these numbers are all more than
13:01
zero. They're like 0.8, 0.9 and so on.
13:03
Basically, all they have done is instead
13:04
of working with numbers between zero and
13:06
255, they're just dividing all the
13:08
numbers by 255 so you get fractions and
13:10
they just put the fractions in the
13:11
table. Okay? And then then they have
13:13
used Excel's very cool conditional
13:15
formatting
13:16
to essentially mark in red all the
13:19
values that are high. Right? If the
13:21
number is closer to one, the more
13:23
reddish it gets.
13:24
Okay? And when you do that the three
13:26
obviously pops out.
13:28
So, there is a three in the image. Yes?
13:31
Okay, good. So, now
13:33
what we're going to do is we're going to
13:35
move to our little filter here.
13:37
You can see the filter.
13:39
Right? And I'm claiming this detects
13:41
horizontal lines. And so and this table
13:44
here
13:47
Sorry.
13:51
This table here is the result of
13:53
applying that filter to the three.
13:56
Okay? And you can see here I'm looking
13:58
at the top left cell here.
14:01
Um
14:03
This is
14:03
Look at this top left cell. The formula
14:05
is nothing more than
14:07
you know, multiply all those things and
14:08
add them up. And then once you add it
14:10
up, run it through a max of zero comma
14:12
that which is just the relu.
14:15
Okay? Basic arithmetic.
14:18
So, we do that.
14:19
And this is the output and the output is
14:21
also conditionally formatted to show you
14:24
where things are lighting up.
14:26
And you can see only the horizontal
14:30
lines of the three are lighting up.
14:34
Everyone see that?
14:35
Right?
14:36
So, you So, now you you understand the
14:38
filter in fact is living up to the claim
14:41
I made for it.
14:42
Right? Similarly,
14:44
if you look at what's going on here,
14:46
this is a vertical filter, the same
14:47
thing, you apply it, only the vertical
14:50
line is lighting up.
14:53
Right? Now, what you can do is
14:56
uh I would encourage you to do this, you
14:57
know, um after class, is you can look at
15:00
all these numbers here, for example, and
15:02
then ask yourself, "Okay, why is that
15:04
lighting up?"
15:06
Right? And you will discover that what's
15:08
actually going on is that it's looking
15:11
for edges.
15:12
It's looking for you know, s- you're
15:14
looking for rows in the table where
15:16
there is some nonzero thing in the first
15:18
row and zeros in the second row.
15:21
And by choosing the numbers carefully,
15:23
you multiply the ones with positive
15:25
numbers and you multiply the zeros with
15:27
zeros and then you'll come up with a
15:29
positive number and thereby you detect
15:31
an edge.
15:32
Right? So, what I would encourage you to
15:34
do is use the this Excel thing here.
15:39
All right. So, here is here is a cell we
15:41
have. So, let's uh trace its
15:48
coincidence.
15:49
Okay.
15:51
So, you can see here
15:53
these numbers
15:56
Right? Th- This is what it's processing.
15:59
Right? That is this grid is being
16:00
processed to come up with that big
16:01
number. And you can see here in this
16:04
grid these are all these numbers are
16:06
here and then these numbers are a lot
16:08
lower than these these numbers because
16:11
there is an edge.
16:13
Right? The numbers are a lot lower.
16:14
That's why you can see the horizontal
16:16
part of the three.
16:17
And so, what this filter is doing, it's
16:19
basically saying, "Well, the stuff
16:22
the row that I'm catching here has the
16:24
ones, the middle has zeros, the rest are
16:26
all minus ones."
16:27
Right? So, the small values are going to
16:29
get very small.
16:31
The big values are going to get very big
16:33
and the overall thing is going to be
16:34
emphasized.
16:35
So, that's the basic idea of edge
16:37
detection.
16:38
Spend some time with it with the Excel
16:39
and it'll you'll become clear to you
16:41
what I'm talking about here.
16:43
All right, cool. So, that's that.
16:46
All right. Uh by the way, I also have uh
16:48
th- there is a little very cool site
16:49
here
16:50
in which you can actually go in and
16:52
punch in your own numbers and see what
16:53
it detects.
16:55
Right? Lot of edges and curves and this
16:56
and that. It's very cool. So, I
16:58
encourage you to try it out.
17:00
So, the key thing here I want to say is
17:06
by choosing the numbers in a filter
17:08
carefully and applying this operation
17:10
different different features can be
17:12
detected. All right.
17:13
Now,
17:14
I mentioned earlier that a convolution
17:16
layer is composed of one or more of
17:18
these filters. So, one or more of these
17:20
filters. And so, you can think of each
17:23
filter as a sort of a specialist for a
17:25
particular feature.
17:27
Okay? So, it's a specialist. Maybe it it
17:30
specializes in detecting vertical lines,
17:32
horizontal lines, you know, uh
17:34
semicircles, quarter circles, you don't
17:35
know. Right? You can imagine either them
17:38
as being specialists.
17:39
And given that modern images could be
17:42
very complicated, they may have lots of
17:43
interesting features going on, you
17:45
probably want to have lots of these
17:46
filters.
17:48
Okay? But the key the key is that you
17:52
don't have to decide up front, "Hey, you
17:54
filter, you better specialize in
17:56
detecting vertical lines and you on the
17:57
other hand do not stay in your lane. Do
18:00
vertical lines." Right? You're not going
18:01
to do that.
18:02
You will let the system figure out what
18:04
it wants to figure out.
18:06
Okay? So, there is no human bottleneck
18:08
in doing this.
18:10
And I mentioned this because there used
18:11
to be a human bottleneck, you know,
18:13
before deep learning happened.
18:15
And so,
18:17
Now, let's just um make sure we
18:19
understand the mechanics of what happens
18:20
when you have two of these filters, not
18:22
one. So, this is the input image as
18:24
before. This is the filter we saw
18:26
earlier and this is another filter we
18:28
have.
18:29
The thing is we just run them in
18:30
parallel. We take each filter, do the
18:32
operation, come up with an output. Take
18:33
the other filter, do the operation, come
18:34
up with its output. And then when you do
18:36
that, the first one gives you that, the
18:38
second one gives you that. And this
18:40
output is a table of some it's it's a
18:42
it's a it's actually not a table. What
18:44
is it?
18:49
Louder, please.
18:51
It's a tensor. Thank you. It's a tensor.
18:54
And so, these two 5 by 5 matrices can be
18:56
represented as a tensor of what shape?
19:02
And there are two right answers.
19:04
5 by 5
19:06
into two, correct. So, it can you can
19:08
either think of it as 5 by 5 * 2 or 2 *
19:11
5 by 5. They're both fine.
19:14
Which one you go with is actually ends
19:15
up being a matter of convention.
19:18
Okay? So, now you begin to see why we
19:20
care about tensors.
19:22
Imagine if instead of having two
19:24
filters, we have 103 filters.
19:27
The resulting tensor is going to be 5 by
19:29
5 by 103.
19:33
Okay.
19:34
Good.
19:35
Um all right. Now,
19:37
let's now look at the slightly more
19:39
complex situation where you have not a
19:42
black and white image, a grayscale image
19:44
with just a little table, but an actual
19:46
color image.
19:48
Okay? So, So, we know how to apply a
19:51
filter to a 2D tensor like this and to
19:54
get that. But let's say we have
19:56
something like this where it has
19:58
three, right? It's got three channels,
20:00
red, blue, green, RGB. It's got three
20:02
tables of numbers.
20:03
So, this is a a tensor of shape 6 * 6 *
20:06
3, let's say, and you want to apply this
20:08
3 by 3 filter just like before to this
20:11
thing. You want to apply the convolution
20:12
operation. How's that going to work?
20:18
Do we just like apply this to each
20:21
We first apply it to the red, then we
20:23
apply it to the to the green, then we
20:25
apply to the blue. Should we do that?
20:30
Or is there a
20:31
a problem with that approach?
20:36
Yeah.
20:39
Could you use the microphone, please?
20:42
Uh the problem with the approach, I
20:43
think, would be the same as what you
20:45
said earlier, that it would learn the
20:47
lines probably the same each channel,
20:49
right?
20:50
Like the location of the lines are
20:51
probably the same each channel.
20:54
Yes, the location of the line is going
20:55
to be the same thing because that line,
20:57
if you will, is sort of the the
20:59
aggregation of information from the
21:00
three different channels. Right. But the
21:03
problem here
21:05
is sort of slightly different,
21:07
which is that
21:09
If you do them independently,
21:12
the network has not been informed that
21:15
these things are all part of the same
21:17
underlying concept.
21:19
As far as it's concerned, it's just like
21:21
three things. It's just going to process
21:22
them independently. So, we need to
21:23
somehow change the filter so that it
21:25
understands like what is at this pixel
21:27
location, the three numbers under it,
21:29
RGB, they're actually the same part of
21:31
the same thing, underlying thing.
21:35
So, what we do is actually very simple.
21:37
We just take this filter and make it 3D.
21:42
So, we take this filter, instead of
21:44
having just one of them, we just make it
21:45
a cube like that. Three times.
21:49
And once we do that, you can imagine
21:51
taking this thing here and essentially
21:53
doing that.
21:56
Okay. Now, instead of having, you know,
21:58
nine numbers in the image and nine
22:00
numbers in the filter,
22:01
you have 27 numbers in the image, 27
22:04
numbers in the filter.
22:05
But you still match them up, multiply
22:07
them, add them up, run them through a
22:09
ReLU.
22:14
By the way, I tried to get ChatGPT to
22:16
give me a picture like that.
22:19
It just completely bombed.
22:21
I like three, four, five different
22:22
variations. It just gave up. And then I
22:24
found this nice picture at in the
22:25
deeplearning.ai and I used it.
22:28
So, then if you put different numbers in
22:30
each of the layers, is that like color
22:32
processing? Like it could be doing a
22:33
different thing to green and blue. I'm
22:36
sorry, say that again. If you put
22:37
different numbers in each of the layers
22:39
of your knowledge, in each of the
22:42
different like depth dimensions of your
22:43
convolution filter, would that be like
22:45
color processing?
22:47
Uh yeah, you you will in
22:49
Yeah, you will put different numbers. In
22:50
fact, you you have 27 numbers now,
22:53
but we haven't gotten to the question of
22:54
where these numbers are coming from. So,
22:55
just hold the thought till we get there.
22:58
Okay. Um so, any questions on this?
23:02
Okay. You literally take the 2D thing
23:04
and make it 3D.
23:05
You basically give it depth and the
23:08
depth just matches the depth of the
23:10
input.
23:11
So, if the input is like, you know, 10
23:13
deep, your filter is going to get 10
23:15
deep.
23:18
Okay?
23:20
Yes.
23:22
Rather than
23:24
increasing the rank order of the tensor
23:26
by one, is there any instance where you
23:27
would create a subtraction layer where
23:29
you would run an operation across the
23:30
different layers to come up with a
23:33
intermediary layer that you would run a
23:35
lower rank tensor of a filter over?
23:38
Yeah, so there is a lot of stuff in the
23:40
research literature which tries to do
23:42
things like that. Uh I'm just describing
23:45
like the the the most basic approach to
23:48
doing this. And as it turns out, this
23:50
basic approach is actually extremely
23:51
powerful, right? And of course, uh
23:54
researchers try to, you know, go from
23:56
the 95th percent thing to the 95.1%.
23:59
So, they invent like all sorts of crazy
24:01
complicated stuff, which is all good for
24:02
us, humanity, but for practical use,
24:04
this is good enough.
24:08
How do you convert the 3 by 3 layer into
24:10
a single 4 by 4 layer? 4 by 4 is
24:12
understood, but what about the 3 layers?
24:14
How do they work?
24:15
Yeah. Um so, we are coming to that. I
24:17
think we have a slide here. Actually, we
24:19
don't. Never mind. We'll answer that. Um
24:20
so, so here you have one filter, right?
24:23
You have one 3 by 3 by 3 filter, which
24:26
plugs into this thing here, and then it
24:28
gives you the 4 by 4 at the end.
24:30
Right? So, for one filter, we know that
24:33
by doing this operation, we get
24:37
we get this 4 by 4.
24:38
Let's say that you have another filter,
24:40
which is also 3D.
24:41
You do that thing, you'll get another 4
24:43
by 4.
24:45
And if you have 10 filters, you'll get
24:46
10 of these 4 by 4s, which then gets
24:48
packaged up into a 4 by 4 by 10 tensor.
24:54
Remember, whether it's 2D, 3D, 10D,
24:57
what is coming out is always 2D.
25:02
Because ultimately, when you apply all
25:03
this operation, at each position, you
25:05
just have one number.
25:06
And then ultimately, you just do all
25:07
those things, you just come up with a
25:08
table of numbers always. So, the what's
25:10
coming out is always a 2D number table
25:13
like that.
25:14
But when you have lots of filters, you
25:16
have lots of these 2D tables one after
25:18
the other, and there therefore, they get
25:20
packaged up into a tensor.
25:25
All right.
25:26
Um so,
25:28
textbook chapter 8.1 has a lot of detail
25:30
and intuition, which I think is really
25:32
good. So, please uh try it out. Okay.
25:35
And folks, by the way, this convolution
25:37
stuff, um it's sort of it grows in the
25:40
telling. So, I would encourage you to
25:41
revisit it, revisit it
25:43
a few times, and then it slowly becomes
25:45
part of your muscle memory.
25:48
Don't expect to just understand all the
25:49
nuances like one shot.
25:51
Do it a few times.
25:52
And it will become, you know, wired into
25:54
your into your head.
25:56
Okay. So, all right. The big question.
25:59
These seem excellent, but how are we
26:00
supposed to come up with these numbers?
26:02
Now, in fact, traditionally,
26:04
uh these filters actually used to be
26:05
designed by hand.
26:07
Uh computer vision researchers would
26:08
invest, you know, prodigious amounts of
26:10
time and effort and talent to figure
26:12
out, you know, the kind the right kinds
26:14
of filters to use for various specific
26:17
applications. So, if you wanted to build
26:19
an application which would look at, say,
26:20
MRI images and figure out, okay, what
26:22
kind of features should I extract from
26:24
this MRI thing to be able to say, you
26:27
know, predict the the evidence for a
26:28
stroke, they would actually, you know,
26:30
hand design the filter. They'd try lots
26:32
of different values and then come up
26:34
with, "Ah, I got the perfect filter for
26:35
this thing here." Right? So, that's the
26:37
way it used to be done.
26:39
Um and now,
26:41
I but as we figured out how to train
26:42
deep networks with lots of parameters,
26:45
right? We figured out things like ReLU
26:47
activation, stochastic gradient descent,
26:49
GPUs, backprop, things like that, you
26:51
know, uh this big idea emerged. Why
26:54
don't we think of the numbers in the
26:55
filter as just weights?
26:57
And why don't we just simply learn them
26:59
from the data using backprop?
27:01
Right? Just like we learn all the other
27:03
weights. What's the big deal?
27:06
And this simple idea,
27:08
and it feels a bit, I don't know,
27:09
blindingly obvious in hindsight.
27:12
I'm sure it was not obvious in
27:13
foresight.
27:14
Um right? This was the breakthrough.
27:16
This was the key breakthrough. And now,
27:18
it's actually possible to do this
27:20
because a convolutional filter that we
27:22
have seen is actually just a neuron.
27:25
And the underlying arithmetic of it is
27:27
just a neuronal arithmetic. And so, it
27:31
just happens to be a slightly special
27:32
one. It's actually even simpler than a
27:34
regular neuron. And in the interest of
27:37
time, I have a one or two slides in the
27:39
appendix which tells you exactly why
27:40
it's a neuron. So, check it out. But
27:42
just take my word for it. It's just a
27:44
particular kind of neuron. And because
27:46
it's a particular kind of neuron, and we
27:48
know how to work with neurons,
27:50
right? You know how to work with
27:51
neurons, which means that our entire
27:53
machinery,
27:55
layers, loss functions, gradient
27:57
descent, SGD, blah, blah, everything is
27:59
immediately applicable.
28:01
We don't have to invent any new stuff to
28:03
make it work.
28:06
Okay?
28:08
All right.
28:09
Do you initialize the layers differently
28:12
in applications or just because the
28:14
network has different sizes? Like
28:16
computer vision versus uh medical
28:18
imaging. Is it just because the network
28:20
has different numbers in them?
28:23
Yeah, so the initialization
28:25
So, let's It's a good question. Let's
28:27
come back to it when we get to something
28:29
called transfer learning, which I'm
28:30
going to get to by about 9:30.
28:34
All right. So,
28:36
that's it. All right. So, this turned
28:37
out to be a huge turning point in the
28:38
computer vision field, and this was the
28:40
massive unlock in the year 2012. This
28:43
computer vision system that used this
28:44
technology called AlexNet burst out onto
28:47
the world stage because it crushed the
28:49
competition in a, you know, in in a
28:51
competition called ImageNet, and uh the
28:53
previous best score was 26% error rate,
28:56
and this thing came in and had 16% error
28:59
rate. Right? It's the kind of thing
29:01
where if you see it, you'll be like,
29:01
"Oh, that must be a typo."
29:04
Right? Because every year, the
29:05
improvements in error rate were like
29:06
very little, half a percent, 1%, and
29:07
then this year was 10%, and that that
29:09
was because of this approach.
29:12
And so, all right. Now, one other thing
29:14
I want to cover talk about is that with
29:16
every succeeding convolutional layer,
29:19
uh this particular convolution any
29:21
particular convolutional filter, it's
29:23
basically implicitly seeing much more of
29:25
the input image as we go along.
29:28
Right? Which means that if in the very
29:29
beginning, if this is the input, right?
29:31
This little convolutional filter this
29:33
number here
29:34
in the first layer, let's say, only sees
29:37
like the top of the chimney or whatever
29:38
of this house.
29:40
But then the next layer, remember, the
29:42
next layer is input is this particular
29:44
layer.
29:45
And so,
29:47
this particular little thing here is
29:49
getting information from this whole
29:50
square here.
29:52
And every one of the points in that
29:53
square is actually something big in the
29:55
original picture.
29:57
So, with every additional layer, you're
29:59
seeing more and more and more of the
30:00
image.
30:03
All right? And this is a key part of why
30:04
these things work because you're
30:06
essentially hierarchically building a
30:08
better and better understanding of the
30:09
image.
30:10
It is the hierarchical understanding,
30:12
the hierarchical learning, that's a very
30:14
key part of the unlock.
30:17
And so, if you look at networks and what
30:20
they're visualizing, this actually a you
30:21
know, a face detection deep network
30:23
visualizes of what it's learning, you'll
30:25
see that the first layer is just
30:26
learning lines and edges and so on,
30:28
lines.
30:29
And the second layer is actually
30:30
learning edges. Look at this thing,
30:32
right?
30:33
It's it's learning to put these lines
30:36
together
30:37
to get some sort of an edge here,
30:38
another edge here. This looks like three
30:40
three quarters of a of somebody's ears.
30:43
And then, these things are now being
30:45
assembled
30:46
to get whole faces out.
30:49
Can you imagine the researchers who did
30:50
this work? They built the network, it's
30:52
doing really well on detecting faces,
30:53
and they turn around, "Okay, let's see
30:54
what it's actually doing."
30:56
And then, this picture pops up.
30:58
I mean, goosebumps.
31:00
Okay, so pooling layers, the next one.
31:03
So,
31:04
so far you've talked about convolutional
31:05
layers, this is the second thing, second
31:07
building block, and then we'll again go
31:09
go to the collapse. So, pooling layers
31:11
are also called subsampling or
31:12
downsampling layers.
31:15
So, the idea is that every time a tensor
31:17
is coming out of these convolutional um
31:19
layers,
31:20
we try to make it slightly smaller
31:23
because the act of making it smaller
31:25
will force the network to try to
31:27
summarize and learn what's going on in
31:29
this complicated thing it's coming into
31:30
it, okay? So, I will describe the
31:32
mechanics first. Um
31:35
So, let's say that this is the output of
31:37
a convolutional layer.
31:39
Okay?
31:40
Is this four of them? A 4 by 4.
31:42
So, what we do is that there are two
31:45
kinds of pooling, max pooling and
31:47
average pooling. This is called max
31:48
pooling, and the idea is really simple.
31:51
In this max pooling layer, there are no
31:52
weights parameters to be learned. It's
31:53
just a simple arithmetic operation. We
31:56
basically take
31:57
we take this we basically superimpose a
32:00
2 by 2 empty grid
32:02
on the top left, and then we say, "Hey,
32:04
what's the biggest number on the among
32:06
these four numbers?" Well, the biggest
32:08
number is 43. Boom. Okay, I'm going to
32:09
stick a 43 here.
32:11
Then I move my 2 by 2 to the right
32:13
so that it overlaps with these numbers
32:15
in blue, and I say, "Hey, what's the
32:17
biggest number here?" Okay, that's 109.
32:19
And I move it down, what's the biggest
32:20
number here? 105. Stick it in here.
32:23
Biggest number here, 35, and I stick it
32:25
in there. That's it. This is max
32:26
pooling.
32:29
Similarly, there's this thing called
32:30
average pooling, but instead of taking
32:32
the maximum of these four numbers, we
32:33
just average the four numbers.
32:35
Okay, the average of these four things
32:36
in yellow,
32:38
am I done?
32:41
Average of these four numbers is 32.2.
32:43
The average of blue numbers is 25.5, you
32:45
get the idea.
32:46
That's it. Max pooling and average
32:48
pooling. Now,
32:50
as you can see, when you go when you
32:51
apply pooling, the number of entries
32:53
drops significantly.
32:55
Right? The number of entries drops
32:56
significantly.
32:58
And the output from this layer is just
32:59
fed to the next layer as usual.
33:02
Okay? There's nothing, you know, crazy
33:04
going on.
33:05
So, it's a way to shrink the output from
33:07
one convolutional layer before it passes
33:10
on to the next convolutional, you
33:11
interject with a pooling layer.
33:13
Now, I have actually a
33:15
even if I say so myself, a very nice
33:18
handwritten explanation of what pooling
33:20
does, the the effect of pooling.
33:23
And unfortunately, I can't get my iPad
33:25
to actually show up on my laptop.
33:27
So, I'm not going to be able to do it,
33:28
but I will record a walk-through.
33:31
Yeah, and I posted check it out, okay?
33:33
But the intuition that I tried to convey
33:35
with that thing is that oh, um Sorry,
33:38
I'll come back to this.
33:39
So, max pooling acts like an or
33:41
condition. It basically says, "I have
33:43
this big picture.
33:44
So, in the four things that I'm looking
33:46
at, if there's any number which is
33:48
really high,
33:50
that means that some feature is being
33:51
detected, right?
33:54
The number is really high coming out of
33:55
a convolutional layer, that means that
33:57
something somewhere fired up,
33:59
lit up.
34:00
And so, I'm just looking to see if
34:01
anything lit up in that part. If it did,
34:04
I'm going to say, "Yep, something lit
34:05
up."
34:06
If nothing lit up, then I'm going to
34:08
say, "Oh, nothing lit up."
34:09
So, in a in that sense, what it's it it
34:11
think you can imagine it's like acting
34:13
like an or condition.
34:15
Anything fired up? Anything fired up?
34:16
Anything fired up? Anything up? Yes,
34:17
okay. Otherwise, no.
34:19
And so,
34:22
sadly, I can't switch to Notability.
34:24
So, it acts like a feature detector. So,
34:27
if you have lots of things going on in a
34:28
particular picture, you want to be able
34:30
to summarize and aggregate all the
34:32
things that are going on so that you can
34:33
say you if you may have a big picture
34:35
with lots of things lighting up here and
34:36
there, but you want to step back and
34:38
say, "You know what? In this picture,
34:40
the top left, nothing lit up. The top
34:42
right, something lit up. Bottom left,
34:45
something lit up. And the bottom right,
34:46
nothing lit up."
34:48
So, you're operating at a higher level
34:49
of abstraction.
34:51
That's the effect of pooling.
34:55
But don't you lose spatial information?
34:59
Uh you don't because the
35:02
what you're actually saying is the top
35:04
left has this thing.
35:06
You already know it is in the top left.
35:08
And you already moved up to that level
35:10
of abstraction.
35:12
So, the fact for example, if if the top
35:13
left there is a human eye,
35:15
and there is a circle detector, it's
35:18
going to fire up and saying, "Hey, in
35:19
the top left there is an eye."
35:21
Yep, lit up. So, you're not looking at
35:23
the pixels anymore, you're already
35:24
operating at a higher level of
35:25
abstraction, and that's how we get
35:27
around it. But this proceeds slowly and
35:29
incrementally, which is why you have
35:31
these big networks.
35:34
All right.
35:35
So, now as we saw, some successive
35:38
convolution layers can see more and more
35:40
of the original image,
35:41
the max pooling layers that follow them
35:43
can detect if a feature exists in more
35:45
and more of the original input as well.
35:47
So, by the time you get to like the
35:48
seventh and eighth, ninth and layers and
35:50
so on, this thing is actually really
35:52
smart. It's operating at a very high
35:53
level of abstraction.
35:55
Right? It It is You can think of it It
35:56
is basically like tagged all the
35:58
features in that image at various
36:00
resolutions, and it can work with it.
36:04
Is there a trade-off between doing
36:06
pre-processing as opposed to adding
36:08
additional convolutional layers? I'm
36:11
thinking if you have a video turning
36:12
into a black and white static images in
36:15
a sequence as opposed to
36:17
shoving in a color video with a ton of
36:19
noise.
36:20
The greater the time expanse, is there a
36:22
trade-off element? There is a trade-off.
36:24
Um if your particular data set and input
36:27
has has some there is some very
36:29
important domain knowledge that you want
36:31
to encode
36:33
into the network so that the network
36:35
doesn't waste its capacity learning
36:37
things that you know have to be true,
36:39
then yeah, modify the input.
36:41
But if you're not sure,
36:43
right? Then you want to just let network
36:45
learn whatever it can as long as it's
36:47
focused on predicting accuracy as well
36:49
as possible, then just let it be.
36:55
Uh all right. So, that's the basic idea.
36:57
And I again, I'm sorry this is
36:59
Notability thing is is it's not working.
37:01
Uh but take a look to really understand
37:03
um how this max pooling thing business
37:05
works. Okay. Oh, uh I think I skipped
37:08
over this.
37:09
So, when you have something like this,
37:12
so this, let's say, is a tensor coming
37:13
out of some convolutional layer, and its
37:15
size is 224 by 224 by 64, then you apply
37:18
something like a pooling. The thing I
37:20
want to point out is that the pooling
37:22
will work with every slice of the
37:23
tensor.
37:25
Okay? So, if the tensor is 224 by 224 by
37:27
64, it has a depth of 64,
37:30
which is basically like saying it's got
37:31
64 tables of 224 by 224, and the pooling
37:35
will work on every one of those tables.
37:38
Which means that
37:40
the 64 will that you'll still have 64
37:42
things at the very end. It's just that
37:43
every one of the things of the 64, the
37:45
224 by 224, will shrink to 112 by 112.
37:49
So, each table shrinks due to pooling,
37:52
but the number of tables does not
37:53
change.
37:57
Okay. So,
37:59
uh by the way, this
38:01
link here
38:03
has a beautiful explanation of all these
38:05
things with a little bit more complexity
38:06
as well from a course taught at Stanford
38:08
in like 2018 or 2019 or something, I
38:10
forget. Uh so, just check it out if
38:12
you're curious about this stuff. It's
38:13
really good.
38:15
Okay. Um
38:18
All right. So, that brings us to the
38:19
architecture of a basic CNN.
38:21
Um and so, what we do is we have an
38:23
input.
38:25
Okay? We take that input, we run it
38:27
through a bunch of convolutional and
38:29
pooling layers. So, there's a
38:30
convolutional layer, and then we pool
38:33
it, which is why it has shrunk
38:35
in size,
38:37
and then it goes through another
38:38
convolutional layer, then we pool it,
38:40
which is shrunk again,
38:42
and then it keeps on doing it. So, we
38:44
have a series of these these called
38:45
these are called convolutional blocks.
38:47
So, a convolutional block is typically,
38:49
you know, one to two convolutional
38:50
layers followed by a pooling layer.
38:52
Okay.
38:54
So, you have a series of convolutional
38:55
blocks.
38:57
Okay? And the thing to notice is that
38:59
as you go further and further in the
39:01
network,
39:03
the blocks will actually get smaller and
39:05
smaller because of
39:07
uh max pooling, right? They'll get
39:09
smaller and smaller, but they'll get
39:10
longer they'll get deeper and deeper.
39:14
Okay.
39:14
And we have empirically figured out that
39:16
that actually that model of reducing the
39:18
size, the height and height and the
39:20
width, but then making it deeper, tends
39:22
to work really well in practice.
39:25
And so,
39:27
in fact, uh and I apologies to the live
39:29
stream that I can't use iPad, I'm going
39:31
to do it on the the board.
39:35
So, let's say that you have a picture
39:38
which is
39:39
coming in as 224
39:43
224
39:44
and then you have
39:46
say three of them
39:48
because it's a color picture, so you
39:49
have three of them.
39:52
Can you folks see this okay?
39:54
All right. So, right? Let's say this is
39:56
the input coming in. And ResNet, which
39:59
is a very famous network that we're
40:00
actually going to work with in a few
40:02
minutes,
40:03
then it actually gets done with all this
40:05
convolution pooling business.
40:07
The final tensor that it it has is
40:11
actually of shape
40:13
7 by 7.
40:16
But it is 2048 long.
40:22
Okay? So, it it has gone it has
40:24
processed something which is 224 224 * 3
40:26
to much smaller height and width just 7
40:28
by 7, but it's gotten much deeper, 2048
40:31
layers.
40:32
This is a this is a numerical example of
40:34
what I'm talking about there in terms of
40:36
as you go along, things get smaller but
40:39
deeper.
40:41
All right.
40:43
Uh
40:44
Yes?
40:45
Is the reason that it gets deeper
40:47
because each
40:49
Like it it gets deeper because each
40:50
layer has a single feature that is
40:52
picked up and then it gets stacked on
40:54
top
40:55
It's not so much that each layer has
40:57
picking up a single feature, it's more
40:58
that
40:59
uh
41:00
basically
41:01
the way I think about it is that
41:04
the the the the number of atomic
41:06
features that you may want to detect are
41:07
probably not that many, right? Lines,
41:10
curves, gradations in color and things
41:11
like that. But the way in which you can
41:13
combine these atomic features
41:16
to depict real world things
41:18
is combinatorial.
41:20
It's sort of like I have 10 kinds of
41:22
atoms, how many molecules can I make
41:23
from it?
41:25
You can make a lot of molecules from
41:26
those 10 atoms, which means that you
41:28
better give the network more the ability
41:30
to capture more and more of these
41:32
possible things that the real world can
41:33
come up with.
41:35
And so every as the depth increases, you
41:38
have more filters and every filter has
41:40
now has the ability to pick up some
41:42
combinatorial combination of what's
41:43
coming in.
41:49
Uh sorry, quick question related to
41:51
this. So, right now like our model is
41:53
being trained to detect certain specific
41:55
features like a line, a color, or
41:56
something of this sort. But still it
41:58
doesn't have meaning to this, right?
42:00
Like still they don't know if that
42:02
arc is a sun or is an eye, right?
42:06
So, yeah. So, we we don't tell it what
42:08
to learn, it just learns.
42:10
All we tell it is make sure that you
42:12
minimize the loss function. Now, once it
42:14
is finished learning, if it's a good
42:16
network, it has good accuracy, then we
42:18
can introspect. We can peek into the
42:21
internals and try to understand what is
42:23
it learning,
42:24
right? And sometimes you like you saw in
42:26
the face detection example, it's
42:27
actually learning interesting things
42:28
like basic lines and edges and then
42:30
slowly, you know, more complicated
42:32
shapes and then finally like entire
42:34
human faces. Sometimes it may not be
42:36
understandable.
42:37
And the way it's doing this is by
42:39
constructing features of my brain.
42:42
Like how do you figure out what it's
42:44
learning?
42:44
>> Yeah. Oh, oh, I see. So, I'm going to
42:46
give a reference in just a few minutes.
42:49
Read the paper. That was one of the
42:50
first ones to actually visualize what it
42:52
what these things are learning and
42:53
that'll give you an idea of how it
42:54
actually works. And I'm also happy to
42:56
talk about it offline. It's a bit of a a
42:58
tangent, but it's a really rich tangent,
43:00
so if if I keep talking about it, I'll
43:02
end up spending 10 minutes on it, so I'm
43:03
going to back off.
43:06
Okay.
43:08
Um all right.
43:09
So, now once we do that,
43:12
okay? Now we are back in familiar
43:13
territory where we take whatever tensor
43:16
is coming out from these convolutional
43:18
operations and pooling operations and
43:20
then we just flatten them only now into
43:22
a long vector. And once we flatten them,
43:25
we can connect them to some good old
43:27
dense layers
43:29
like we know how to do and then we
43:30
finally connect them with whatever, you
43:32
know, output layer you want, right? In
43:34
this case, this example is using some
43:36
multi-class classification of
43:39
classifying images to what kind of
43:41
automobile or whatever it is. So, it's
43:42
like a softmax. So, this is a general
43:44
framework.
43:48
Okay?
43:50
Any questions?
43:54
Yeah.
43:55
Can you explain again how the depth
43:57
increases exactly like Oh, the depth
44:00
increases because you decide what the
44:01
depth is.
44:03
So, when you add a convolutional layer,
44:05
you decide how many filters it has. So,
44:07
you just keep adding more and more
44:09
filters the later on you go in the
44:11
network.
44:13
So, it's in your control. So, remember
44:14
the number of neurons in a hidden layer
44:16
is in your control, right? Similarly,
44:18
the number of filters is in your
44:19
control. It's a design choice.
44:22
And we design it so that the later we
44:24
go, the more depth we have. So, you have
44:26
you stack
44:28
um layers with each of those layers has
44:31
a different filter applied to the end
44:35
Yeah, a layer is made up of filters and
44:37
so the depth just comes from having lots
44:39
and lots and lots of filters. And you
44:40
get to choose what they are.
44:44
All right. So, now let's go to the
44:46
fashion MNIST collab um that I did the
44:49
video walk-through on and then actually
44:51
solve it using a convolutional network.
44:56
All right, cool. So, uh at this point
44:58
I'm going to zip through some of the
44:59
stuff because you know the preliminaries
45:00
have to be done. Import all these
45:02
packages, set the random seed here.
45:05
Great. And then the we will load the
45:07
MNIST data set just like I did in the
45:09
collab yesterday. Uh we create these
45:11
little labels.
45:13
Uh and then we just have these standard
45:14
functions to plot accuracy and loss that
45:17
we've been using so far. All right. Now
45:19
we come to the convolutional thing and
45:21
so as before, we're going to um
45:24
we're going to divide it by 255 to
45:25
normalize everything to a zero to one
45:27
range. Uh let's confirm to make sure
45:29
that the data nothing has gotten
45:31
tampered with. Yep, we have 60,000
45:33
images, each one is 28 by 28 in the
45:35
training set. Now,
45:37
convolutional networks um they expect
45:40
the input to have
45:42
three channels or it expects to have
45:44
like a an additional thing which is like
45:46
a channel,
45:47
right? Uh the color images have three
45:49
channels,
45:50
but black and white images have only one
45:52
channel, right? One table of numbers.
45:54
So, instead of saying 28 by 28, we tell
45:56
this the convolutional layer expect 28
45:59
by 28 by one.
46:01
It's the same thing conceptually, but
46:03
that's the sort of the format that it
46:04
expects.
46:05
And so,
46:06
uh we go here and then we say, all
46:09
right, there's a thing called expand
46:11
dimension. I'm just telling it to expand
46:12
its dimension and once I do that, you
46:14
can see here it's still 60,000, but
46:17
instead of 28 by 28, it has become 28 by
46:19
28 by one. Same thing.
46:21
Okay? Now, let's define our very first
46:24
CNN.
46:25
So, all right.
46:27
As as before, the the input is just
46:30
Keras.input as before, no difference
46:32
here and we tell it the shape and the
46:34
shape is of course just 28 by 28 by one.
46:37
Okay? That's what I have here.
46:39
And then we come to the first
46:40
convolutional block.
46:43
So, and this is the key thing.
46:45
If you want to tell Keras to use a
46:47
convolutional a layer,
46:49
you use this keyword layers.Conv2D.
46:53
And from this you can probably also
46:54
figure out that there's a Conv1D and
46:56
there's a Conv3D and so on and so forth,
46:58
which, you know, uh explore. It's really
47:00
good stuff.
47:01
But for image processing, Conv2D is all
47:04
you need. And now we tell it how many
47:06
filters you want. Okay. So, uh we decide
47:09
on the number of filters. So, I've
47:10
decided to have 32 filters. Okay? And
47:13
then I I we also have to decide the size
47:15
of the filter, right? The simplest size
47:18
is 2 by 2. So, I'm just going to go with
47:19
that.
47:20
Right? Kernel size is 2 by 2.
47:22
And then the activation is of course
47:23
ReLU. I give it a name, convolution one,
47:26
and then I feed it the input. And then
47:27
once I do that, I follow it up with a
47:29
little pooling layer which I where I use
47:31
MaxPooling2D.
47:33
And MaxPooling2D, you just literally
47:35
pass the input, you get the output back.
47:36
It just
47:37
shrinks everything using pooling.
47:39
So, that is the first convolutional
47:40
block.
47:41
And you know what?
47:43
I know how to cut and paste. Boom, cut
47:45
and paste, I get the second
47:46
convolutional block.
47:48
Okay? Here is the second convolutional
47:49
block. And I know in in I just lecture I
47:52
mentioned that as you go deeper, you get
47:54
more depth to it, but this is this is
47:56
just a starting point. I'm just going to
47:58
use the same depth. Not a big deal. It's
47:59
a simple problem. So, which is why in
48:01
the second convolutional block I'm still
48:03
using only 32.
48:04
But you can totally go to 64 for
48:06
instance to make it much deeper.
48:07
Okay?
48:08
Uh and once I do that,
48:10
I finally come to the point where I
48:12
flatten everything to a long vector,
48:14
then I connect it to one dense layer of
48:17
256 neurons.
48:19
And then finally, I come to the softmax
48:22
where I have 10 outputs, right? 10
48:23
categories of clothing, softmax, and
48:26
then I tell Keras, okay, take this input
48:27
and the output, string them up together,
48:30
define a model for me.
48:32
So, that's it. That's a convolutional
48:33
network. The new concepts we are seeing
48:35
here are Conv2D for the convolutional
48:38
layer and then MaxPooling2D for the max
48:40
pooling layer.
48:42
Okay? That's it.
48:43
Uh
48:44
coming. So, let me just run this thing.
48:46
It runs. Okay, good. Yeah.
48:49
Uh how do you decide when to flatten and
48:52
would there ever be a situation in which
48:54
we just kind of use the method that we
48:56
used before and not use a CNN?
48:59
Well, we already tried it with MNIST,
49:00
right? We didn't use a CNN. We just
49:02
flattened right away.
49:03
>> work. It it was it's not bad, but we are
49:05
like, you know, can we do better than 85
49:06
or 88 or whatever the percent was,
49:08
right? So, but we are working with
49:09
images, it's typically a good idea to
49:11
just start with a CNN straight out the
49:13
back because you're not losing anything.
49:14
You're not giving up anything.
49:16
So, uh in terms of how many uh layers
49:19
you should have, my philosophy is start
49:20
simple and if it works, stop working on
49:23
it. If it doesn't, add more layers.
49:27
Uh yeah.
49:28
Yeah, just to uh is it the architecture
49:30
design, the number of filters, kernel
49:32
size, number of layers, convolution
49:34
pooling, is that just all based on trial
49:36
and error or what's sometimes? Yeah, so
49:37
typically it's based on trial and error,
49:39
Um to answer your question. But as you
49:41
will see in the transfer learning
49:42
discussion we're going to have soon,
49:44
you can actually, instead of doing
49:46
anything from scratch, it's much better
49:48
to just download a pre-trained model and
49:50
just adapt it for your particular
49:51
problem. That is actually the norm by
49:54
which people do these things. The reason
49:55
I'm doing it from scratch is because you
49:57
should know how it was done.
50:00
Like you it should not be a black box to
50:01
you. That's my goal.
50:03
Yeah.
50:05
Just for what notation perspective, I
50:07
noticed you named all of these layers X.
50:09
Is that a habit we should get into
50:11
naming them all the same or is that just
50:12
a
50:12
>> Actually, I'm not naming the layers as
50:15
X. What what's going on here is I'm
50:17
feeding it X.
50:19
And whatever is coming out of it, I'm
50:21
just calling it X.
50:22
That's all. It's just a notational
50:23
convenience for me to I'm I'm just
50:25
calling the input and the output and
50:27
Keras under the hood will track
50:28
everything and make sure the right thing
50:29
happens. Otherwise, I'd have to be like
50:31
X1, X2, X3, X4 and then if I want to add
50:33
a new layer somewhere in the middle
50:35
between X3 and X4, I have to call that
50:37
X4 and then I'll change everything to 5,
50:39
6, 7. Complete pain in the neck. That's
50:41
why I do this.
50:42
All right. So, model.summary
50:46
It has got 302 thousand parameters. I'll
50:51
just plot it.
50:53
Great. And I encourage you to hand
50:56
calculate it later on and make sure the
50:58
numbers tally, okay?
51:00
For now, let's just go. So, as before,
51:03
we'll just use the same compilation.
51:06
We'll use Adam and then we'll train it
51:08
for, you know, just 10 epochs. We'll use
51:11
a validation split again, as usual, of
51:13
20%. So, let's just run it.
51:15
So, it's actually going to run. And as
51:17
you will see,
51:18
convolutional networks there's a lot
51:19
more going on, so it's going to be a bit
51:20
slower to run. Hopefully not too much
51:23
slower.
51:25
While it's doing, other questions?
51:31
So, if we have a task other than image
51:32
classification, do we still flat the
51:34
model like first and then it's
51:35
segmentation?
51:37
Yeah, so this is for image
51:39
classification. For other kinds of
51:41
applications,
51:42
typically you run it through a bunch of
51:44
convolutional layers and so on and so
51:45
forth.
51:46
But the output side of the equation gets
51:48
much more complicated because if instead
51:51
of classifying just
51:53
the whole picture into, you know, dog or
51:56
cat, if you have to take every pixel and
51:58
classify it, right? Then, well, you
52:01
better have an output shape that is the
52:03
same dimensions as the input shape.
52:06
So, for that we use a different
52:07
architecture. It's called U-Net
52:09
and so on, which unfortunately I won't
52:11
be able to get into. But I know I am
52:13
planning to post another video
52:14
walk-through where I show you how to use
52:17
the Hugging Face Hub
52:19
to very quickly build models for the
52:22
other applications like segmentation and
52:23
so on. I'm hoping to post that tomorrow.
52:26
It's an optional viewing thing that
52:27
might help with that.
52:29
Okay. So, is it done? Okay, good. It's
52:32
done. All right, let's plot the
52:35
thing here.
52:36
All right, so it seems like training is
52:38
going down nice and nicely. Validation
52:40
is sort of flattening out somewhere here
52:42
around the eighth epoch. Let's look at
52:45
the accuracy.
52:47
Same situation here. The accuracy is in
52:48
the 90s. Of course, the final question,
52:51
of course, is how it will will it does
52:52
on the thing.
52:55
Whoa, 90.5%.
52:58
Pretty good.
52:59
By the way, if you're not impressed that
53:00
we went from 88 to 90,
53:04
this is the These applications are the
53:05
proverbial sort of diminishing returns
53:07
problems, okay? So, what you should
53:09
always think of is look at the amount of
53:11
error that's left and ask yourself how
53:13
much of that error am I able to reduce?
53:16
So, you we had 12% roughly of error left
53:20
when we did the simple collab yesterday.
53:22
From that 12% we have knocked off two of
53:24
the 12% to get to over 90, which is
53:26
amazing.
53:27
Okay?
53:28
And in fact, I think the state of the
53:29
art on this
53:31
um
53:32
is 97%.
53:34
So, I invite you
53:36
to take this thing and try different
53:39
filters and so on and so forth to see if
53:40
you can get to the the mid-90s.
53:42
It's not easy, but try it. Yeah.
53:45
Does the number of epochs have to be
53:48
related to the number of batches?
53:50
Because you did 64 batches and 10 No,
53:52
the epochs is an independent
53:55
the epochs is just the number of passes
53:56
through the whole data.
53:58
But within each pass, within each epoch,
54:01
the num the batch size tells you how
54:03
many batches you're going to process.
54:05
So, it is basically the number of
54:06
examples you have in your training data
54:08
divided by the batch size that you have
54:10
chosen,
54:11
right? That number rounded up is the
54:13
number of batches within each epoch.
54:16
And here I'm just choosing 10 because,
54:18
you know,
54:20
Siri found something on the web. Okay.
54:23
I chose 10 because it's going to be fast
54:24
to do for me to do it in class. And 10
54:26
is actually more than enough because you
54:27
can see it's already beginning to
54:28
overfit.
54:31
Yeah.
54:33
This is more of a conceptual question,
54:35
but is it always the case that a neural
54:37
network will have better accuracy than
54:39
this like machine learning algorithm?
54:42
And I'm asking more on the case of like
54:44
the heart disease problem. Oh, yeah,
54:45
yeah.
54:46
Great question. So, neural networks are
54:49
really good for unstructured data like
54:50
the ones we're having here. But if you
54:52
have structured data like the heart
54:53
disease problem, sometimes it actually
54:55
works really well. Sometimes
54:57
things like gradient boosting, XGBoost,
54:59
work really well. So, if I am actually
55:01
working on a structured data problem,
55:03
I'll try both.
55:04
I'm not going to axiomatically assume
55:06
that the DNN is going to be the best
55:07
thing. But if you have structured data,
55:09
it's the best game in town.
55:11
All right. Um
55:13
I'm just going to
55:14
By the way, I have a whole section here
55:15
on once you build a model, how do you
55:16
actually improve it?
55:17
Right? Check it out. It's an optional
55:19
thing.
55:20
All right, I'm going to stop this here.
55:22
All right. So, the next thing I want to
55:23
do is
55:25
So, we went from 88 to 90 plus percent,
55:27
right? Using convolutional networks.
55:29
Now, let's work with color images. Let's
55:31
kick it up a notch.
55:33
So, um
55:34
I actually
55:36
web scraped
55:38
all these pictures for you folks, for
55:40
your enjoyment. I web scraped about 100
55:42
color images of handbags and shoes.
55:44
Each 100 roughly 100 handbags, 100
55:46
shoes. So, the question is with these
55:48
essentially 200 images,
55:51
can we build a really good neural
55:52
network to classify handbags and shoes?
55:54
Right? It seems kind of absurd, right?
55:56
Because 200 examples, I mean, it's not
55:58
that much, right? It doesn't feel like a
55:59
lot. The MNIST data fashion has 60,000
56:02
images.
56:04
Right? So, there's no, you know, even
56:06
with that we are overfitting in like 5,
56:07
6, 7, 8 epochs.
56:09
With 200 images, maybe, you know, is
56:10
there any hope? Obviously, there is
56:11
hope, otherwise it won't be in the
56:13
lecture. So, yeah. So, we're going to
56:15
take this data set and let's see what we
56:16
can do with it. So, we'll first actually
56:18
build a convolutional network from
56:19
scratch to solve this problem. Okay?
56:22
All right.
56:24
I'm actually going to run through the
56:25
code because at the end of it we'll have
56:27
a live demo. So, I would like one
56:29
volunteer to give me a handbag and one
56:31
volunteer to give me their footwear.
56:34
Boy, in class.
56:37
Okay. So, all right. Unlike the previous
56:40
data set, this one actually I just web
56:42
scraped it. So, I just, you know, it's
56:44
it's it's I've stuck it in this Dropbox
56:46
folder.
56:47
Let's just download it and unzip it. And
56:49
once we do that, we have to now organize
56:51
it with these 200 images. So,
56:54
I have to do some sort of
56:57
sort of boring-ish Python stuff here.
57:00
So, here what we're doing is that we
57:02
have 100 handbags, roughly 100 shoes.
57:04
And what this code is doing is it's
57:06
actually creating a directory of saying
57:08
it's it's splitting stuff into train and
57:10
validation and test. And then for each
57:12
of the splits it's doing the handbags
57:13
and the shoes folder. Okay? So, once we
57:16
do that, basically this directory
57:18
structure is created.
57:20
Okay? Training, validation folder, test
57:23
folder, handbags and shoes. In fact,
57:25
actually you can I think you can see it
57:26
here.
57:27
See here, handbags and shoes. And within
57:29
that, there is, you know, train, test,
57:31
validation. And within each of these,
57:33
there's handbags and shoes. So, the idea
57:34
is that when you're working with images,
57:36
right? What you can do is you can just
57:37
create folders for each kind of image,
57:40
right? Let's say dogs, cats,
57:42
two folders with cat images and dog
57:43
images and then just point Keras at it.
57:46
It'll automatically figure out those are
57:47
the labels.
57:49
It makes it easy for you. So, it's very
57:50
convenient when you're working with
57:51
images.
57:52
And the book explains this thing in
57:53
great detail.
57:55
All right. So, when working with these
57:56
images, color images, we'll follow this
57:58
process. We'll read in the JPEGs. We'll
58:00
convert them to tensors. And then since
58:02
I'm web scraping it, they all come in
58:03
different shapes and sizes. So, I need
58:05
to like bring it all to the same size.
58:06
Okay? I resize it and then I'm going to
58:08
batch it into whatever. I'm going to
58:10
batch it using a batch size of 32 here.
58:13
So, and this utility from Keras will do
58:16
all that for you, right? Very quickly.
58:19
So, basically what it says is that I
58:20
found 98 files in the 98 images in the
58:23
training data belonging to two classes,
58:25
49 in the validation and 38 in the test.
58:28
So, less than 100 examples in the
58:29
training set. That's what we have here.
58:31
All right. What's the time? 9:30. Okay.
58:33
So, all right. Now, let us check the
58:35
dimensions to make sure Good. So, 224
58:38
224 by 3. And the reason why did I pick
58:40
224 224? As you will see later, we're
58:43
going to use something called ResNet
58:45
and the ResNet expects it to be 224 by
58:47
224 by 3. That's why I resized it to 224
58:49
224. Let's look at a few examples of my
58:52
wonderful web scraping in action.
59:01
It's pretty wild, right?
59:02
Okay. So, we have a Now, let's do a
59:04
simple convolutional network. Um
59:07
And before we would take all the X
59:09
values in Fashion MNIST and divide them
59:10
manually by 255 to normalize it to 0 1.
59:13
Well, you know what? We are actually
59:14
graduating to the higher levels of Keras
59:16
now. So, let's not do that, right?
59:17
Manual stuff is bad. So, we'll do it
59:19
within Keras by using something called
59:21
the rescaling layer where we just tell
59:22
it how much to rescale and boom, it'll
59:24
do it for you. The first convolution
59:26
block, just like the Fashion MNIST 32,
59:28
second block, again 32, max pool,
59:31
flatten. And then here we only have
59:33
handbags which are shoes, just a sigmoid
59:35
is enough, right? It's just a binary
59:36
classification problem. So, I'm just
59:38
using one output layer with a sigmoid,
59:40
and that's our model. So, let's do the
59:42
model.
59:43
All right, model summary.
59:48
103 101,000 parameters in this little
59:52
model. Okay, let's compile it and run
59:54
it. Uh, and note here because it's a
59:56
binary
59:57
classification problem, I'm using binary
59:59
cross entropy.
1:00:02
Same Adam.
1:00:03
And accuracy, compile, and then boom,
1:00:05
let's run it. We'll run it for 20
1:00:07
epochs.
1:00:08
Hopefully.
1:00:12
Okay, while it's doing this business,
1:00:13
I'm going to shift to the PowerPoint.
1:00:17
So, we'll go back to see how well it
1:00:19
did, but the question is, uh, whatever
1:00:21
it did, we built it from scratch. So,
1:00:23
the question is, can we do better than
1:00:23
that? Okay? Because we only have 100
1:00:26
examples of each class, and which brings
1:00:28
us to something very cool and very
1:00:29
powerful called transfer learning. And
1:00:31
the idea, so the key thing is there are
1:00:33
two research trends that are going on
1:00:34
that we take advantage of. The first one
1:00:36
is that researchers have defined, you
1:00:38
know, designed architectures which
1:00:40
exploit the kind of input you have. So,
1:00:42
Olivia asked the question, if you have a
1:00:43
particular kind of input images, do you
1:00:45
actually change the input, or do you
1:00:47
actually change the network? As it turns
1:00:49
out, here, for example, if it's images,
1:00:50
we know that we should use convolutional
1:00:52
layers because convolutional layers were
1:00:53
designed to exploit the image-ness of
1:00:55
the input.
1:00:57
Okay? Similarly, if you have sequences
1:00:59
of information, like obviously natural
1:01:01
language, audio, video, gene sequences,
1:01:03
and so on, so forth, these things called
1:01:05
transformers were invented
1:01:07
to exploit them, and we're going to
1:01:08
spend a lot of time on transformers
1:01:09
starting next week. So, that's the first
1:01:11
trend. The second trend is that
1:01:13
researchers have used these innovations
1:01:15
to actually create and train models on
1:01:19
vast data sets, and thankfully, they've
1:01:21
made them publicly available for us to
1:01:23
use. So, transfer learning is the idea
1:01:26
that if you have a particular problem,
1:01:28
let's just take a pre-trained network
1:01:30
work somebody may have already created,
1:01:32
and then let's just customize it to our
1:01:33
problem, rather than actually build
1:01:35
anything from scratch.
1:01:37
Okay, that's the basic idea. So,
1:01:39
so here we have this basically we have
1:01:41
to build a classifier which takes in an
1:01:43
arbitrary image and figures out if it's
1:01:45
a handbag or a shoe, right? That's our
1:01:46
goal.
1:01:47
And so, now handbags and shoes are
1:01:49
everyday objects, and so what you can do
1:01:51
is, hmm, you you can look around and see
1:01:53
if there are any networks that have been
1:01:55
trained by other people which actually
1:01:57
have been trained on everyday images.
1:02:00
Right? As opposed to like MRI or X-rays,
1:02:02
right? Specialized images, everyday
1:02:04
images. Of course, the first thing you
1:02:05
should probably do is to see if anybody
1:02:07
has built the specific thing you want,
1:02:08
handbag shoes classifier on GitHub.
1:02:10
Assuming it's not, then you do transfer
1:02:12
learning. Okay? So, now it turns out
1:02:15
that there's this thing called ImageNet,
1:02:17
which is a database of millions of
1:02:19
images of everyday objects in a thousand
1:02:22
different categories, furniture,
1:02:24
animals, automobiles, you get the idea.
1:02:26
Okay? And so, we can look for the
1:02:28
networks that have been trained on
1:02:29
ImageNet.
1:02:31
Okay, let me just go back to the collab
1:02:33
just to make sure it doesn't time out.
1:02:37
All right, so it has finished doing it.
1:02:40
Um, let's just plot these things.
1:02:48
Okay, so
1:02:49
uh, there is some overfitting that
1:02:51
happens around here
1:02:52
on the training on the 10th epoch. Let's
1:02:55
look at the
1:02:59
So, the the training accuracy is
1:03:01
actually getting to almost to 100%. But
1:03:03
we're not interested in training
1:03:04
accuracy, right? We care about
1:03:06
validation and test accuracy, and that
1:03:08
seems to be kind of hovering around in
1:03:10
the 80s. Um, so let's just evaluate it
1:03:13
anyway to see what happens.
1:03:15
Okay, so it gets to 80 87% accuracy
1:03:19
on this data set.
1:03:20
It's actually pretty good given that we
1:03:22
only have 100 examples. So, 87%
1:03:24
accuracy, but we pre-trained the whole
1:03:26
thing. I'm sorry, we did everything from
1:03:28
scratch. Okay? Now, then
1:03:31
I'm going to there's this whole section
1:03:32
about data augmentation, which, um, you
1:03:35
know what? Do we have time?
1:03:38
So,
1:03:40
so the idea of augmentation is that when
1:03:42
you have an image,
1:03:44
let's say you take this image, and you
1:03:45
just rotate it slightly by 10°.
1:03:49
If it's a handbag before you rotated it,
1:03:51
it sure as hell is a handbag after you
1:03:52
rotated it.
1:03:54
Right?
1:03:55
It doesn't change The meaning of the
1:03:56
image doesn't change just because you
1:03:57
rotated it slightly. Or maybe you zoom
1:04:00
in slightly, you zoom out slightly, you
1:04:01
crop it slightly, nothing happens.
1:04:03
So, what you can do is you can take any
1:04:05
image you have, and you just perturb it
1:04:07
slightly,
1:04:08
like right there, and then add it as a
1:04:10
new example to your training data.
1:04:14
This is an unbelievable free lunch,
1:04:15
frankly.
1:04:16
And the same thing actually, same kinds
1:04:19
of techniques actually work for text
1:04:20
also, which we'll cover later on.
1:04:22
Right? This broad area is called data
1:04:24
augmentation.
1:04:26
It's a great way when you don't have a
1:04:27
lot of data to artificially bolster the
1:04:30
amount of data you have.
1:04:31
Okay?
1:04:32
Um, and so, and of course, Keras makes
1:04:34
it very easy for you to do all these
1:04:36
things. It has already predefined a
1:04:38
whole bunch of data augmentation layers
1:04:40
for you. So, here's a little example
1:04:43
where I basically take a picture and
1:04:45
then I randomly flip it. So, if it looks
1:04:47
like this, I flip it this way,
1:04:48
horizontal. Okay? Uh, and then I
1:04:50
randomly rotate it by 0.1. I forget if
1:04:53
it's 0.1° or radians, you can look up
1:04:55
the documentation. And then random zoom,
1:04:57
right? Zoom in and out a little bit. Uh,
1:05:00
but it won't do this for every picture.
1:05:02
It will only do it randomly. Okay? So,
1:05:04
that only some pictures will get
1:05:06
perturbed in some ways. And that's how
1:05:07
you make sure there's enough diversity
1:05:09
of pictures that you have.
1:05:10
So, once you do that,
1:05:12
you can actually take a picture and see
1:05:13
what it does.
1:05:15
I just randomly grab a picture, so it
1:05:17
keeps changing every time.
1:05:21
Yeah, look at this handbag.
1:05:22
Handbag slightly rotated this way,
1:05:24
rotated that way.
1:05:26
Some more. Maybe a little bit of zooming
1:05:28
going on, and so on. You get the idea,
1:05:30
right? And there's a whole list of these
1:05:31
things you can do. But when you do those
1:05:33
things, make sure
1:05:35
that what you're doing doesn't actually
1:05:37
change the underlying meaning of the
1:05:38
picture.
1:05:39
It's really important.
1:05:41
Okay? So, for example, if you're working
1:05:43
with satellite data,
1:05:45
yes, be very careful not to do flips of
1:05:47
crazy flips.
1:05:49
Right? Or even if you're working with
1:05:50
everyday images, horizontal flips are
1:05:51
okay. Don't do vertical flips.
1:05:54
Right? How many times will you have an
1:05:55
upside-down dog picture that you need to
1:05:57
classify?
1:05:59
Make sure your augmentation doesn't go
1:06:00
nuts.
1:06:02
All right.
1:06:05
Once you do that, you can actually just
1:06:07
insert the data augmentation layers in
1:06:09
your model right there, right after the
1:06:11
input. The rest of it can stay
1:06:12
unchanged.
1:06:14
So, this is a great way to increase the
1:06:15
size of your training data, and here is
1:06:17
a model, and then I invite you to
1:06:19
actually just play with it and uh, and
1:06:21
train it. We won't try In the interest
1:06:23
of time, we won't actually train this
1:06:23
model, but it's in the collab, you can
1:06:24
just try it. It also figures prominently
1:06:27
in homework one, by the way, data
1:06:28
augmentation. So, you'll get more
1:06:30
experience with this. Okay. So, uh, back
1:06:32
to the PPT.
1:06:34
So, this is what we have. Um, and so,
1:06:37
any network that has been trained on
1:06:38
this ImageNet thing, uh, turns out
1:06:41
learns all kinds of interesting features
1:06:42
in every one of its layers. So, here
1:06:44
this is the first layer, and you can see
1:06:46
it's picking up sort of gradations of
1:06:48
color, sort of line-ish kind of
1:06:49
behavior. Layer two, um, it's actually
1:06:52
picking up Hey, look, it's picking up an
1:06:54
edge. Can you see that edge?
1:06:56
Right? Like like that.
1:06:59
And then layer three is picking up these
1:07:01
interesting honeycomb shapes, uh, and so
1:07:04
on. Oh, it's actually this thing is
1:07:05
already already picking up like the
1:07:07
shape of a human torso.
1:07:12
Yeah, this layer is actually picking up
1:07:13
what looks like a Labrador retriever.
1:07:16
Okay.
1:07:17
Isn't that cute?
1:07:19
Come on, even if you're not a dog
1:07:20
person.
1:07:22
All right. So, the the this this is the
1:07:24
visualization I was referring to
1:07:25
earlier,
1:07:26
um, to figure out what are these
1:07:28
networks actually learning.
1:07:30
This paper was one of the first ones to
1:07:31
actually visualize what's going on
1:07:32
inside. So, if you folks are curious how
1:07:34
these pictures are actually produced, I
1:07:36
would encourage you to check this out.
1:07:38
Okay, yep.
1:07:40
So, we spoke about images and you
1:07:42
referred to classes, but sorry, we spoke
1:07:44
about images and you referred to classes
1:07:46
and
1:07:47
text next week on transformers, but
1:07:49
what about say an email which has both
1:07:52
text and image, and that may be white
1:07:54
space depending on who has written it
1:07:56
out. Does that get put in as an input
1:07:58
for an image or
1:08:01
So, we'll revisit this great question a
1:08:03
bit later on in the course.
1:08:04
So, the answer is a bit complicated, so
1:08:06
I don't want to I want to do it justice,
1:08:07
so we'll come back to it.
1:08:09
All right, so
1:08:10
so it turns out this thing called ResNet
1:08:12
is a family of networks that are which
1:08:14
were trained on this ImageNet data set,
1:08:16
and they did really well in this
1:08:18
competition that's associated with the
1:08:19
ImageNet data set called ImageNet. And
1:08:21
so, this is an example of such a
1:08:22
network. So, you we would expect the the
1:08:24
weights and the parameters of ResNet,
1:08:27
given that it's been trained on
1:08:28
ImageNet, to sort of have some knowledge
1:08:30
about lines and shapes and curves and
1:08:32
things like that. So, maybe we can just
1:08:34
use that, right? So, so the idea is we
1:08:37
But the thing is we can't use ResNet as
1:08:39
is because remember, it was trained to
1:08:40
classify an incoming image into a
1:08:42
thousand possibilities.
1:08:44
Here we only have two possibilities,
1:08:45
handbags and shoes. So, what we do is
1:08:47
very simple and elegant. We do just a
1:08:50
little bit of surgery.
1:08:51
We take ResNet and stop just before the
1:08:54
final layer. So, take my word for it,
1:08:57
this thing here, what it says is fully
1:08:59
connected thousand.
1:09:01
Because it's got thousand way, right?
1:09:02
Thousand objects. So, what we do is we
1:09:04
just take everything except and we stop
1:09:06
just before that last layer.
1:09:08
And then what comes out of that layer,
1:09:10
hopefully, will be like a very smart
1:09:11
representation of the images that it has
1:09:13
been trained on.
1:09:14
And so, what we do is we can think of
1:09:16
sort of headless ResNet
1:09:19
as our model.
1:09:21
And we can take that we can take all our
1:09:23
data and run it through ResNet up to but
1:09:26
not including the last layer.
1:09:28
Okay, you get some tensor and that
1:09:30
tensor is probably like a very has a
1:09:31
very rich understanding of what's going
1:09:33
on in that image, all the objects and
1:09:35
features and things like that. And then
1:09:36
we can just simply connect that we can
1:09:39
think of it as like a smart
1:09:40
representation of an input. We can
1:09:42
connect it to just a little hidden layer
1:09:44
and then we have a little sigmoid which
1:09:46
then tells you handbag or shoe. We can
1:09:47
just run this network.
1:09:50
Okay? Um and so since the outputs to the
1:09:53
hidden layer now are not raw images
1:09:54
anymore, but this much higher level of
1:09:57
abstraction that ResNet has learned,
1:09:59
hopefully it can get the job done with
1:10:00
hardly any examples.
1:10:02
Okay? And now you can get fancier.
1:10:04
That's the basic idea, but you can get
1:10:05
much fancier. You can connect up
1:10:07
headless ResNet directly with our little
1:10:09
network with a hidden layer and the
1:10:10
final thing and the whole thing can be
1:10:12
trained.
1:10:14
End to end. Uh but when you do that you
1:10:16
must start the training with the weights
1:10:18
that you downloaded with ResNet because
1:10:20
that is the crown jewel that's been
1:10:21
learned so you want to start from there.
1:10:23
Uh and you will do this in homework one.
1:10:26
Okay? All right. Uh by the way, these
1:10:28
pre-trained models are available all
1:10:29
over the internet. There is the
1:10:30
TensorFlow hub, the PyTorch hub and then
1:10:32
there's the Hugging Face hub. When I
1:10:34
checked it on the 13th yesterday, it had
1:10:36
over half a million models available
1:10:39
for download. Half a million.
1:10:41
I think last year it was like 50,000
1:10:42
when I taught the course. Uh so yes.
1:10:46
I was just wondering, doesn't this make
1:10:49
your neural network susceptible to
1:10:50
adversarial attacks because the weights
1:10:52
have been
1:10:53
pre-trained on a Yes. Uh it there is
1:10:55
some adversarial risk. I'm happy to talk
1:10:57
about it offline.
1:10:59
All right. So that's what we have. So
1:11:01
back to Colab. Okay. So that's what we
1:11:03
have. This is ResNet. So what we do is
1:11:06
and ResNet is all packaged up. It's
1:11:07
available for download. So we download
1:11:09
it here.
1:11:13
And you see here that I'm saying use
1:11:16
include top equals false.
1:11:19
So basically you are telling Keras
1:11:21
uh the top the very final layer of the
1:11:23
thing, don't give it to me. Just give me
1:11:25
everything up to but not including that.
1:11:27
And of course I think of it as left to
1:11:28
right. People think of it as bottom to
1:11:30
top. So they could the very very top
1:11:32
layer, don't give it to me. You're
1:11:34
telling it so that you don't have to
1:11:35
manually go and remove it.
1:11:37
Okay? And then I'm not going to
1:11:39
summarize uh well, I'll just summarize
1:11:40
some of it. Just show you how big it is.
1:11:44
Okay?
1:11:45
23 million parameters.
1:11:48
ResNet. Okay? And I won't plot it
1:11:50
because then I'll be scrolling for 5
1:11:52
minutes. Uh
1:11:53
so let's just do this now. So what we're
1:11:55
now going to do is we're going to run
1:11:56
all the data through this thing and
1:11:58
whatever comes out in that penultimate
1:11:59
thing, I'm going to just grab it and
1:12:00
store it. So that's what this thing
1:12:02
does.
1:12:04
All right. And now we create this a
1:12:07
little handy function to do all these
1:12:08
things.
1:12:09
And once I do that,
1:12:11
uh every image has been sent through
1:12:12
ResNet up to but not the final layer and
1:12:15
then whatever comes into the final
1:12:16
layer, we're storing it. And then we're
1:12:18
going to create a network where we'll
1:12:19
only feed that layer that information to
1:12:21
a simple network.
1:12:23
Okay?
1:12:24
So what is coming out of ResNet, you can
1:12:26
see here 98 examples in the training
1:12:28
data and each example is now a 7 by 7 by
1:12:31
2048 tensor.
1:12:33
That's what came out of ResNet and you
1:12:35
saw that's what I did there.
1:12:37
Okay?
1:12:37
All right. So that's what it looks like.
1:12:39
Now let's just create our actual model
1:12:41
now. Right? We have our input which is
1:12:43
just a 7 by 7 by 2048.
1:12:46
We flatten it immediately.
1:12:48
Then we run it through a dense layer
1:12:50
with 256 ReLU neurons and then we use
1:12:52
dropout which I haven't talked about yet
1:12:54
which I will talk about early next week.
1:12:56
Uh but I will come back to it. Don't
1:12:58
worry about this detail for the moment.
1:13:00
Uh and then we just run through a
1:13:01
sigmoid.
1:13:03
Okay? And that that's our model.
1:13:05
Finished. Plot the model. This is what
1:13:08
we have. Okay? Model summary.
1:13:13
It's one so far. All right, good. Now
1:13:15
let's actually train this thing.
1:13:18
I'm just going to run it for 10 epochs
1:13:20
because I tried running it uh previously
1:13:22
and it seems to do a fine job in just an
1:13:24
epoch. Okay, it's already done. It's so
1:13:26
fast because we ran everything through
1:13:28
this monster ResNet thing and basically
1:13:31
took all the output values and use them
1:13:33
as a starting point. Right? We don't
1:13:34
have to run it every single time. So you
1:13:36
can see here the accuracy is
1:13:40
quite high.
1:13:44
Wow, interesting. So the 10th epoch
1:13:45
something bad happened.
1:13:48
So maybe I should have stopped at the
1:13:49
ninth epoch. I didn't see this yesterday
1:13:51
when I was running. So much for random
1:13:53
reproducibility. Uh
1:13:55
So let's just run this. Oh wow, look. On
1:13:57
the test set it's achieving 100%
1:13:58
accuracy.
1:14:02
It's unbelievable. Okay folks, now for
1:14:04
the moment of truth. Um all right, I
1:14:06
have a little code snippet here to
1:14:08
capture stuff from the webcam.
1:14:10
Because that last epoch it went down,
1:14:12
I'm a little worried that the demo is
1:14:13
going to flunk.
1:14:14
But you know what? We all have to live
1:14:16
dangerously. So
1:14:18
So here's a little function to predict
1:14:20
what's going to happen.
1:14:21
Okay. Now I tried it at home yesterday
1:14:23
by the way.
1:14:24
I act and it's like, "Yay, it's a
1:14:26
handbag."
1:14:27
So okay. Now let's just do something
1:14:29
else.
1:14:30
Okay. Any volunteers?
1:14:32
I want a a piece of footwear
1:14:34
or a handbag.
1:14:37
It's like a backpack, right?
1:14:39
I don't know. It feels like an
1:14:40
adversarial example, but yeah, let's
1:14:42
just try it.
1:14:43
Okay.
1:14:45
No disrespect. I'll let me let me go
1:14:47
with the shoe first. I have a better
1:14:48
chance of it working.
1:14:50
So
1:14:51
it's a pretty big shoe. If it can't get
1:14:53
this shoe, I'm worried about this model.
1:14:55
All right. So
1:15:05
Okay. Hold on. Hold on. Hold on.
1:15:07
All right.
1:15:10
Please don't get distracted by my hand.
1:15:14
Capture.
1:15:16
It's a shoe! LOOK AT THAT.
1:15:21
PHEW. ALL RIGHT. THANKS.
1:15:25
OKAY. Now let's try that. I'm feeling
1:15:26
kind of brave now.
1:15:28
Thank you. All right. Let's do this.
1:15:32
All right.
1:15:34
Camera capture.
1:15:40
Okay.
1:15:44
Put its better side.
1:15:54
It's a handbag! Look at that.
1:15:59
I swear every time I do the demo I age a
1:16:01
few years. So
1:16:03
All right folks, I'm done. Thank you.
— end of transcript —
Advertisement