WEBVTT

00:00:16.719 --> 00:00:19.799
Right folks, good morning.

00:00:19.960 --> 00:00:22.880
Welcome back. I hope you all had a nice

00:00:21.600 --> 00:00:24.800
weekend.

00:00:22.879 --> 00:00:26.919
Uh, and I hope you had a chance to watch

00:00:24.800 --> 00:00:28.920
the the video walk-through I posted

00:00:26.920 --> 00:00:31.080
yesterday. Um, it's going to save us

00:00:28.920 --> 00:00:33.400
some time today. So, let's get right in.

00:00:31.079 --> 00:00:35.159
Today is going to be super packed. Um,

00:00:33.399 --> 00:00:36.759
you're going to go from not knowing

00:00:35.159 --> 00:00:38.439
anything about convolutions perhaps for

00:00:36.759 --> 00:00:39.839
some of you to actually knowing how

00:00:38.439 --> 00:00:42.839
convolution networks work and actually

00:00:39.840 --> 00:00:44.240
to build one and demo it in class, okay?

00:00:42.840 --> 00:00:45.720
And uh, this demo has actually worked

00:00:44.240 --> 00:00:47.240
pretty well for the last few years that

00:00:45.719 --> 00:00:48.439
I've taught the class, but you never

00:00:47.240 --> 00:00:50.039
know because it's a live demo, it may

00:00:48.439 --> 00:00:51.879
not work. We'll see.

00:00:50.039 --> 00:00:53.519
Um,

00:00:51.880 --> 00:00:54.760
Valentine's Day gods, maybe they maybe

00:00:53.520 --> 00:00:56.800
be with us.

00:00:54.759 --> 00:01:00.439
Okay, so let's get going. So, Fashion

00:00:56.799 --> 00:01:01.599
MNIST we saw previously, um, i.e. as in,

00:01:00.439 --> 00:01:03.839
you know, in the in the walk-through,

00:01:01.600 --> 00:01:05.760
the video walk-through, that a neural

00:01:03.840 --> 00:01:08.200
network with a single single hidden

00:01:05.760 --> 00:01:11.280
layer can get us to some an accuracy in

00:01:08.200 --> 00:01:14.200
the the high 80s, okay? Uh, and that

00:01:11.280 --> 00:01:16.239
thing that network actually didn't know

00:01:14.200 --> 00:01:18.280
what was coming in was an image, right?

00:01:16.239 --> 00:01:19.759
It literally took this table of numbers

00:01:18.280 --> 00:01:21.519
and just took each row and then

00:01:19.760 --> 00:01:23.719
concatenated all the rows into one giant

00:01:21.519 --> 00:01:25.799
long vector and then sent it in. So, the

00:01:23.719 --> 00:01:27.760
neural network did exploit the fact that

00:01:25.799 --> 00:01:30.280
the input data was sort of known to be

00:01:27.760 --> 00:01:32.760
of a certain type, okay? Which is the

00:01:30.280 --> 00:01:35.159
clue for how can we do better?

00:01:32.760 --> 00:01:38.480
Right? So, let's just spend a few

00:01:35.159 --> 00:01:40.479
minutes on why what is it about images

00:01:38.480 --> 00:01:42.719
that we have to really pay attention to,

00:01:40.480 --> 00:01:44.359
okay? As opposed to any arbitrary vector

00:01:42.719 --> 00:01:47.599
of numbers that's coming in.

00:01:44.359 --> 00:01:49.519
Okay? So, when we flatten the image into

00:01:47.599 --> 00:01:50.519
a long vector and feed it into a dense

00:01:49.519 --> 00:01:52.719
layer,

00:01:50.519 --> 00:01:55.119
several undesirable things can actually

00:01:52.719 --> 00:01:59.039
happen.

00:01:55.120 --> 00:01:59.040
What are some of them? Any any guesses?

00:02:00.400 --> 00:02:04.560
Uh, yeah.

00:02:02.640 --> 00:02:06.560
I think you lose the proximity of one

00:02:04.560 --> 00:02:07.400
pixel to other ones that would be around

00:02:06.560 --> 00:02:08.719
it.

00:02:07.400 --> 00:02:11.039
Right. So, if you take a particular

00:02:08.719 --> 00:02:13.560
pixel, then let's say that the picture

00:02:11.038 --> 00:02:15.759
shows a t-shirt, um, if there's a little

00:02:13.560 --> 00:02:17.479
pixel at in the center of the t-shirt,

00:02:15.759 --> 00:02:19.239
knowing that the surrounding pixels are

00:02:17.479 --> 00:02:21.159
related to the pixel in a way because

00:02:19.240 --> 00:02:23.439
they are all part of this concept called

00:02:21.159 --> 00:02:25.799
a t-shirt, would certainly be helpful,

00:02:23.439 --> 00:02:28.079
right? So, so to put it more

00:02:25.800 --> 00:02:30.480
technically, spatial adjacency

00:02:28.080 --> 00:02:32.560
information is very important. And we

00:02:30.479 --> 00:02:34.759
need to somehow take that into account.

00:02:32.560 --> 00:02:37.759
Okay? Um, all right. What else? What

00:02:34.759 --> 00:02:37.759
else might be going on here?

00:02:38.120 --> 00:02:41.439
Uh,

00:02:40.159 --> 00:02:43.000
Yeah,

00:02:41.439 --> 00:02:46.439
you have some metadata about it like the

00:02:43.000 --> 00:02:47.719
relative match into the resolution

00:02:46.439 --> 00:02:50.240
Oh, I see. So, if you actually had

00:02:47.719 --> 00:02:51.560
structured data about the image such as,

00:02:50.240 --> 00:02:54.040
you know, various characters about that

00:02:51.560 --> 00:02:55.560
might be helpful. True. Now, but let's

00:02:54.039 --> 00:02:57.799
just focus on the case where you only

00:02:55.560 --> 00:03:00.400
have the raw image and nothing else.

00:02:57.800 --> 00:03:02.480
And under that constraint, what else

00:03:00.400 --> 00:03:06.039
might go wrong?

00:03:02.479 --> 00:03:06.039
Or what else might be suboptimal?

00:03:08.199 --> 00:03:12.079
Okay. Well, the first thing that might

00:03:10.000 --> 00:03:15.240
happen is that

00:03:12.080 --> 00:03:17.400
we have we may have too many parameters.

00:03:15.240 --> 00:03:18.760
So, let's take So, this is, you know,

00:03:17.400 --> 00:03:21.439
this these numbers are from my, you

00:03:18.759 --> 00:03:22.959
know, older iPhone. Uh, I noticed that

00:03:21.439 --> 00:03:27.879
when I take a color picture with my

00:03:22.960 --> 00:03:30.200
phone, it's a 3,000 * 3,000 roughly uh,

00:03:27.879 --> 00:03:34.039
grid, right? So, the picture is actually

00:03:30.199 --> 00:03:37.839
3,024 pixels on this axis, 3,024 on that

00:03:34.039 --> 00:03:40.280
axis, okay? So, that gets us to roughly

00:03:37.840 --> 00:03:41.680
9 million pixels, but remember there's a

00:03:40.280 --> 00:03:43.479
color picture, which means there are

00:03:41.680 --> 00:03:45.360
three channels,

00:03:43.479 --> 00:03:46.959
which means there are 27 million

00:03:45.360 --> 00:03:49.240
numbers,

00:03:46.960 --> 00:03:51.879
each of which is between 0 and 255 from

00:03:49.240 --> 00:03:54.080
that little picture, okay? And now let's

00:03:51.879 --> 00:03:57.319
say we connect it to a single

00:03:54.080 --> 00:03:59.080
100 neuron dense layer.

00:03:57.319 --> 00:04:00.319
A single 100 neuron dense layer. How

00:03:59.080 --> 00:04:01.719
many parameters are we going to have?

00:04:00.319 --> 00:04:04.239
Just in that one little part of the

00:04:01.719 --> 00:04:04.240
network.

00:04:07.000 --> 00:04:13.319
Could the mumbling be louder?

00:04:10.280 --> 00:04:15.919
Yes, roughly 2.7 billion because 27

00:04:13.319 --> 00:04:17.439
million parameters times 100,

00:04:15.919 --> 00:04:19.839
right? Roughly, of course. Forget about

00:04:17.439 --> 00:04:21.000
the biases for a moment, right? It's 2.7

00:04:19.839 --> 00:04:23.479
billion.

00:04:21.000 --> 00:04:25.199
2.7 billion parameters,

00:04:23.480 --> 00:04:27.920
right? Do you think we can actually get

00:04:25.199 --> 00:04:29.680
2.7 billion images to train any of these

00:04:27.920 --> 00:04:32.280
things?

00:04:29.680 --> 00:04:33.920
So, then you're going to overfit.

00:04:32.279 --> 00:04:35.439
Right? Too many parameters. We have to

00:04:33.920 --> 00:04:36.800
do We have to be smarter about this.

00:04:35.439 --> 00:04:39.519
It's not going to work.

00:04:36.800 --> 00:04:41.240
Right? That's the first problem.

00:04:39.519 --> 00:04:43.079
The So, this clearly is computationally

00:04:41.240 --> 00:04:45.120
demanding, very data hungry, and

00:04:43.079 --> 00:04:46.359
increase the risk of overfitting.

00:04:45.120 --> 00:04:48.920
Okay?

00:04:46.360 --> 00:04:48.920
Next,

00:04:49.000 --> 00:04:52.800
we lose spatial adjacency.

00:04:51.279 --> 00:04:55.279
Right? We literally are ignoring what's

00:04:52.800 --> 00:04:55.280
nearby.

00:04:55.480 --> 00:04:58.879
So, that's a huge huge factor. There's a

00:04:57.519 --> 00:05:01.000
third factor,

00:04:58.879 --> 00:05:02.319
right? That we have to worry about,

00:05:01.000 --> 00:05:04.120
which is that

00:05:02.319 --> 00:05:06.199
let's say that, you know, the picture

00:05:04.120 --> 00:05:08.120
has a vertical line

00:05:06.199 --> 00:05:09.599
on the on the top left side and it has

00:05:08.120 --> 00:05:12.160
some other vertical line on the bottom

00:05:09.600 --> 00:05:12.160
right side.

00:05:12.240 --> 00:05:15.280
What this sort of dumb approach is going

00:05:14.160 --> 00:05:16.640
to do

00:05:15.279 --> 00:05:18.079
is going to it's going to learn to

00:05:16.639 --> 00:05:20.000
detect that vertical line on the top

00:05:18.079 --> 00:05:21.159
left and it's going to independent of

00:05:20.000 --> 00:05:24.079
that, it's going to learn to detect the

00:05:21.160 --> 00:05:26.200
vertical line on the bottom right.

00:05:24.079 --> 00:05:27.599
Okay? Which doesn't make any sense. What

00:05:26.199 --> 00:05:29.479
do you A vertical line is a vertical

00:05:27.600 --> 00:05:31.360
line. So, you want to be able to detect

00:05:29.480 --> 00:05:33.879
it wherever it happens.

00:05:31.360 --> 00:05:35.520
Detect once, reuse everywhere.

00:05:33.879 --> 00:05:36.879
That's what you need to do.

00:05:35.519 --> 00:05:38.680
So, this, by the way, is called

00:05:36.879 --> 00:05:40.279
translation invariance.

00:05:38.680 --> 00:05:41.720
Translation is math speak for move stuff

00:05:40.279 --> 00:05:42.959
around.

00:05:41.720 --> 00:05:43.960
Right? You take a line and it moves

00:05:42.959 --> 00:05:45.159
around,

00:05:43.959 --> 00:05:47.239
it doesn't matter, it's still a line.

00:05:45.160 --> 00:05:48.880
Let's Let's Let's figure it out.

00:05:47.240 --> 00:05:50.960
So, these are the the three things we

00:05:48.879 --> 00:05:53.199
need to worry about. So, we want to

00:05:50.959 --> 00:05:55.079
learn once and use all over the place.

00:05:53.199 --> 00:05:56.920
We want to take spatial adjacency into

00:05:55.079 --> 00:05:58.199
account, number two. And number three,

00:05:56.920 --> 00:05:59.720
let's just find a way to make sure that

00:05:58.199 --> 00:06:02.319
we don't have billions of parameters for

00:05:59.720 --> 00:06:04.920
simple toy problems.

00:06:02.319 --> 00:06:04.920
Any questions?

00:06:05.480 --> 00:06:09.280
Yep.

00:06:07.279 --> 00:06:11.879
Um, is this a problem

00:06:09.279 --> 00:06:14.119
just because we are compressing the

00:06:11.879 --> 00:06:15.279
image or would it have happened anyway?

00:06:14.120 --> 00:06:16.439
It would have happened So, the question

00:06:15.279 --> 00:06:18.279
was is it a problem because we are

00:06:16.439 --> 00:06:19.839
compressing the image uh, or would it

00:06:18.279 --> 00:06:20.839
would it have happened anyway? The

00:06:19.839 --> 00:06:22.239
answer is it would have happened anyway.

00:06:20.839 --> 00:06:24.399
You can take any picture, this is going

00:06:22.240 --> 00:06:26.199
to happen, right? Because I'm not making

00:06:24.399 --> 00:06:27.560
any assumptions about how the image is

00:06:26.199 --> 00:06:28.839
coming in to me,

00:06:27.560 --> 00:06:31.240
whether it's compressed or not and so on

00:06:28.839 --> 00:06:31.239
and so forth.

00:06:31.639 --> 00:06:36.240
Okay. All right.

00:06:33.519 --> 00:06:38.599
So, convolutional layers

00:06:36.240 --> 00:06:40.240
were developed to precisely address

00:06:38.600 --> 00:06:44.400
these shortcomings and they're amazing

00:06:40.240 --> 00:06:44.400
solution, as you will see. Very elegant.

00:06:45.040 --> 00:06:49.080
All right.

00:06:45.800 --> 00:06:51.040
So, the next, I don't know, half an hour

00:06:49.079 --> 00:06:52.359
is going to be me defining a whole bunch

00:06:51.040 --> 00:06:53.560
of stuff

00:06:52.360 --> 00:06:55.439
before we actually get to the fun

00:06:53.560 --> 00:06:57.560
collabs and so on and so forth.

00:06:55.439 --> 00:06:59.719
Um, so just to put in perspective, I I

00:06:57.560 --> 00:07:01.160
have a PowerPoint,

00:06:59.720 --> 00:07:03.200
two collabs,

00:07:01.160 --> 00:07:06.040
and an Excel spreadsheet, and maybe even

00:07:03.199 --> 00:07:08.159
a notability file to cover today.

00:07:06.040 --> 00:07:09.080
Okay? So, but hang on for the next 30

00:07:08.160 --> 00:07:10.600
minutes because it's going to be a

00:07:09.079 --> 00:07:12.279
little concept heavy

00:07:10.600 --> 00:07:14.280
before we get to the fun stuff. So, stop

00:07:12.279 --> 00:07:15.119
me, ask me questions because we do have

00:07:14.279 --> 00:07:17.559
time.

00:07:15.120 --> 00:07:18.920
All right. A convolutional layer is made

00:07:17.560 --> 00:07:20.000
up of something called a convolutional

00:07:18.920 --> 00:07:22.199
filter.

00:07:20.000 --> 00:07:24.720
Okay? That's the atomic building block.

00:07:22.199 --> 00:07:28.159
A convolutional filter is a nothing but

00:07:24.720 --> 00:07:29.600
a small matrix of numbers like this.

00:07:28.160 --> 00:07:31.480
It's just a small square matrix of

00:07:29.600 --> 00:07:33.400
numbers. That's a convolutional filter,

00:07:31.480 --> 00:07:35.600
okay? Now,

00:07:33.399 --> 00:07:38.159
a layer is just composed of one or more

00:07:35.600 --> 00:07:39.200
of these filters.

00:07:38.160 --> 00:07:41.400
All right?

00:07:39.199 --> 00:07:42.519
Filters and layers.

00:07:41.399 --> 00:07:44.679
Now,

00:07:42.519 --> 00:07:46.639
the thing about the convolutional filter

00:07:44.680 --> 00:07:48.720
that makes it really magical

00:07:46.639 --> 00:07:50.759
is that if you choose the numbers in a

00:07:48.720 --> 00:07:52.440
filter carefully

00:07:50.759 --> 00:07:53.879
and then you apply the filter to an

00:07:52.439 --> 00:07:56.040
image, and I'll get to what I mean by

00:07:53.879 --> 00:07:57.519
applying the filter,

00:07:56.040 --> 00:07:59.560
if you choose the numbers carefully and

00:07:57.519 --> 00:08:02.399
you apply to that image,

00:07:59.560 --> 00:08:04.759
this little humble thing has the ability

00:08:02.399 --> 00:08:07.039
to detect features in your image.

00:08:04.759 --> 00:08:09.800
It can detect lines, curves, gradations

00:08:07.040 --> 00:08:11.360
in color, circles, things like that,

00:08:09.800 --> 00:08:12.800
okay? It's pretty cool.

00:08:11.360 --> 00:08:14.080
And so,

00:08:12.800 --> 00:08:15.920
I'm going to claim and I'm going to

00:08:14.079 --> 00:08:17.719
prove shortly that this little humble

00:08:15.920 --> 00:08:19.560
filter with the ones and zeros, it can

00:08:17.720 --> 00:08:21.160
detect horizontal lines in any picture

00:08:19.560 --> 00:08:22.079
you give it.

00:08:21.160 --> 00:08:23.760
Okay?

00:08:22.079 --> 00:08:27.000
This thing here is going to has the

00:08:23.759 --> 00:08:28.959
ability to detect vertical lines.

00:08:27.000 --> 00:08:30.560
All right? So, I will demonstrate how

00:08:28.959 --> 00:08:33.038
this thing actually detects all these

00:08:30.560 --> 00:08:34.360
things and then we will ask the big

00:08:33.038 --> 00:08:35.879
question that's probably in your minds

00:08:34.360 --> 00:08:37.840
already, where are we going to get these

00:08:35.879 --> 00:08:39.038
numbers from?

00:08:37.840 --> 00:08:41.000
That all sounds great, Rama. Where are

00:08:39.038 --> 00:08:42.479
we going to get the numbers from? Okay?

00:08:41.000 --> 00:08:43.879
And we have a beautiful answer to that

00:08:42.479 --> 00:08:46.520
question.

00:08:43.879 --> 00:08:47.919
All right. So, let's go. Um, now I'm

00:08:46.519 --> 00:08:50.919
going to first explain to you what I

00:08:47.919 --> 00:08:52.679
mean by applying a filter to an image

00:08:50.919 --> 00:08:54.120
and then I'm going to give you examples

00:08:52.679 --> 00:08:56.120
of how the filter works for detecting

00:08:54.120 --> 00:08:58.320
vertical and horizontal lines. So, all

00:08:56.120 --> 00:09:00.200
right. So, let's say that this is the

00:08:58.320 --> 00:09:02.280
image we have.

00:09:00.200 --> 00:09:04.280
Okay? Again, an image. Assume it's a

00:09:02.279 --> 00:09:06.079
grayscale image. So, you just have a

00:09:04.279 --> 00:09:07.759
bunch of numbers between 0 and 255,

00:09:06.080 --> 00:09:09.720
okay? So, that's that This is the image

00:09:07.759 --> 00:09:10.919
we have. It's a little tiny image.

00:09:09.720 --> 00:09:13.200
And this is the filter that's been

00:09:10.919 --> 00:09:14.279
magically given to us by somebody.

00:09:13.200 --> 00:09:17.040
And what we are trying to do now is to

00:09:14.279 --> 00:09:19.879
apply it, okay? So, what we do is that

00:09:17.039 --> 00:09:22.719
we literally take this filter,

00:09:19.879 --> 00:09:24.720
the little one, and then we superimpose

00:09:22.720 --> 00:09:26.840
it on the top left part of the image.

00:09:24.720 --> 00:09:28.639
So, you have the image here, you take

00:09:26.840 --> 00:09:30.320
this little filter, and then you move it

00:09:28.639 --> 00:09:32.080
to the top left so that they are sort of

00:09:30.320 --> 00:09:33.240
right on top of each other.

00:09:32.080 --> 00:09:34.879
Okay?

00:09:33.240 --> 00:09:35.840
Once you have it right on top of each

00:09:34.879 --> 00:09:37.439
other,

00:09:35.840 --> 00:09:39.600
you have these matching numbers. You

00:09:37.440 --> 00:09:41.360
have three numbers in the image, there

00:09:39.600 --> 00:09:42.879
are three numbers in the filter, and

00:09:41.360 --> 00:09:44.039
they're all matching each other right on

00:09:42.879 --> 00:09:46.439
top of each other, right? So, you have

00:09:44.039 --> 00:09:48.919
nine pairs of numbers.

00:09:46.440 --> 00:09:50.880
And then what we do, once we overlay it,

00:09:48.919 --> 00:09:53.879
we literally just multiply all the

00:09:50.879 --> 00:09:55.519
matching numbers and add them up.

00:09:53.879 --> 00:09:57.080
Okay? You just multiply all the numbers

00:09:55.519 --> 00:09:58.600
and match them up, and you can confirm

00:09:57.080 --> 00:09:59.759
later on that you know the the

00:09:58.600 --> 00:10:01.759
arithmetic I'm doing is actually

00:09:59.759 --> 00:10:03.159
accurate. Okay?

00:10:01.759 --> 00:10:04.559
And once you do that you'll go get some

00:10:03.159 --> 00:10:05.559
number.

00:10:04.559 --> 00:10:06.879
Right?

00:10:05.559 --> 00:10:09.039
Um

00:10:06.879 --> 00:10:11.039
once you get that number

00:10:09.039 --> 00:10:12.360
what we do is we go to our good old

00:10:11.039 --> 00:10:15.559
friend the relu

00:10:12.360 --> 00:10:16.759
and then we just run it through a relu.

00:10:15.559 --> 00:10:19.119
Now, in this case all that effort comes

00:10:16.759 --> 00:10:22.159
to nothing because it's zero. It's okay.

00:10:19.120 --> 00:10:26.000
Okay? So, zero and this number becomes

00:10:22.159 --> 00:10:26.000
the top left cell of your output.

00:10:26.679 --> 00:10:29.639
So, this is called the convolution

00:10:28.120 --> 00:10:30.360
operation.

00:10:29.639 --> 00:10:31.600
Okay?

00:10:30.360 --> 00:10:32.639
And we won't get into why it's called

00:10:31.600 --> 00:10:34.560
that and so on and so forth. There's a

00:10:32.639 --> 00:10:35.840
long and rich and storied history of

00:10:34.559 --> 00:10:38.239
these things.

00:10:35.840 --> 00:10:40.320
But this is the convolution operation.

00:10:38.240 --> 00:10:42.080
And once we do that you sort of can now

00:10:40.320 --> 00:10:44.120
predict what's going to happen, right?

00:10:42.080 --> 00:10:46.920
We take the same exact operation and we

00:10:44.120 --> 00:10:48.360
just move it to the right.

00:10:46.919 --> 00:10:51.439
We move this little 3 by 3 thing to the

00:10:48.360 --> 00:10:53.000
right and repeat the exact same process.

00:10:51.440 --> 00:10:54.640
Matching numbers

00:10:53.000 --> 00:10:55.960
uh to you know multiply all of the all

00:10:54.639 --> 00:10:58.559
the matching numbers together, add them

00:10:55.960 --> 00:10:59.400
up, run them through a relu.

00:10:58.559 --> 00:11:01.359
Okay?

00:10:59.399 --> 00:11:03.720
And then boom, you get the you get the

00:11:01.360 --> 00:11:05.800
second number here.

00:11:03.720 --> 00:11:07.200
And you keep doing that till you reach

00:11:05.799 --> 00:11:08.559
the very end. You fill up all these

00:11:07.200 --> 00:11:11.480
numbers then when you then you come to

00:11:08.559 --> 00:11:12.559
the top of the second row.

00:11:11.480 --> 00:11:14.039
Okay?

00:11:12.559 --> 00:11:16.439
And you keep on doing that till you

00:11:14.039 --> 00:11:18.919
reach the very bottom.

00:11:16.440 --> 00:11:21.000
So, this is what I mean when I say apply

00:11:18.919 --> 00:11:22.159
a filter to an image.

00:11:21.000 --> 00:11:24.720
Okay?

00:11:22.159 --> 00:11:24.719
Any questions?

00:11:25.159 --> 00:11:27.399
Okay.

00:11:29.480 --> 00:11:33.519
Microphone, please.

00:11:31.080 --> 00:11:33.520
Microphone.

00:11:35.000 --> 00:11:38.200
What happens when

00:11:36.639 --> 00:11:39.360
the heart of the

00:11:38.200 --> 00:11:41.839
and you stop

00:11:39.360 --> 00:11:41.839
the remaining

00:11:42.120 --> 00:11:46.159
but the filter doesn't perfectly match

00:11:44.440 --> 00:11:47.839
Yeah, so you start from the left and

00:11:46.159 --> 00:11:49.480
then you keep on going. At some point

00:11:47.839 --> 00:11:51.000
the right edge of the filter is going to

00:11:49.480 --> 00:11:52.080
match the right edge of the image and

00:11:51.000 --> 00:11:55.360
then you stop.

00:11:52.080 --> 00:11:58.160
Yeah. Now, there are some nuances here.

00:11:55.360 --> 00:11:59.879
So, for example, you can actually pad

00:11:58.159 --> 00:12:01.879
the whole image

00:11:59.879 --> 00:12:03.639
on its borders so that you can actually

00:12:01.879 --> 00:12:04.879
go outside the image and it'll still

00:12:03.639 --> 00:12:08.519
work.

00:12:04.879 --> 00:12:10.159
Okay? Number one. Number two, nuance.

00:12:08.519 --> 00:12:11.679
Instead of just moving one step to the

00:12:10.159 --> 00:12:13.879
right every time you finish, you can

00:12:11.679 --> 00:12:15.479
move two steps to the right.

00:12:13.879 --> 00:12:17.879
Right? And that's something called a

00:12:15.480 --> 00:12:20.240
stride. Okay? So, there are a bunch of

00:12:17.879 --> 00:12:22.600
pesky details here. But I'm just

00:12:20.240 --> 00:12:24.919
ignoring them because this basic default

00:12:22.600 --> 00:12:27.639
approach works well amazingly well

00:12:24.919 --> 00:12:27.639
almost all the time.

00:12:27.879 --> 00:12:31.039
Okay? All right. So, that's that's

00:12:29.839 --> 00:12:33.920
that's the mechanics of how this

00:12:31.039 --> 00:12:35.120
operation works. Um all right. Now, I'm

00:12:33.919 --> 00:12:37.120
going to switch to a spreadsheet which

00:12:35.120 --> 00:12:41.000
shows this really beautifully

00:12:37.120 --> 00:12:43.279
courtesy of the fast.ai people.

00:12:41.000 --> 00:12:44.600
All right. So, what I'm going to do here

00:12:43.279 --> 00:12:45.679
because the big spreadsheet I'll upload

00:12:44.600 --> 00:12:48.399
the spreadsheet after class so you can

00:12:45.679 --> 00:12:50.079
see it. So, all I have done here, rather

00:12:48.399 --> 00:12:51.838
all they have done here

00:12:50.080 --> 00:12:53.360
thanks to them, is that they have

00:12:51.839 --> 00:12:55.320
essentially created a table of numbers

00:12:53.360 --> 00:12:57.399
in Excel as you can tell.

00:12:55.320 --> 00:12:59.280
And they have just put some numbers.

00:12:57.399 --> 00:13:01.720
Most of the numbers are zero. But these

00:12:59.279 --> 00:13:03.720
some of these numbers are all more than

00:13:01.720 --> 00:13:04.920
zero. They're like 0.8, 0.9 and so on.

00:13:03.720 --> 00:13:06.320
Basically, all they have done is instead

00:13:04.919 --> 00:13:08.159
of working with numbers between zero and

00:13:06.320 --> 00:13:10.080
255, they're just dividing all the

00:13:08.159 --> 00:13:11.199
numbers by 255 so you get fractions and

00:13:10.080 --> 00:13:13.440
they just put the fractions in the

00:13:11.200 --> 00:13:15.680
table. Okay? And then then they have

00:13:13.440 --> 00:13:16.920
used Excel's very cool conditional

00:13:15.679 --> 00:13:19.719
formatting

00:13:16.919 --> 00:13:21.759
to essentially mark in red all the

00:13:19.720 --> 00:13:23.200
values that are high. Right? If the

00:13:21.759 --> 00:13:24.679
number is closer to one, the more

00:13:23.200 --> 00:13:26.520
reddish it gets.

00:13:24.679 --> 00:13:28.599
Okay? And when you do that the three

00:13:26.519 --> 00:13:31.039
obviously pops out.

00:13:28.600 --> 00:13:33.320
So, there is a three in the image. Yes?

00:13:31.039 --> 00:13:35.559
Okay, good. So, now

00:13:33.320 --> 00:13:37.920
what we're going to do is we're going to

00:13:35.559 --> 00:13:39.519
move to our little filter here.

00:13:37.919 --> 00:13:41.199
You can see the filter.

00:13:39.519 --> 00:13:44.519
Right? And I'm claiming this detects

00:13:41.200 --> 00:13:47.000
horizontal lines. And so and this table

00:13:44.519 --> 00:13:47.000
here

00:13:47.159 --> 00:13:49.399
Sorry.

00:13:51.320 --> 00:13:56.040
This table here is the result of

00:13:53.440 --> 00:13:58.120
applying that filter to the three.

00:13:56.039 --> 00:14:01.039
Okay? And you can see here I'm looking

00:13:58.120 --> 00:14:03.080
at the top left cell here.

00:14:01.039 --> 00:14:03.799
Um

00:14:03.080 --> 00:14:05.400
This is

00:14:03.799 --> 00:14:07.199
Look at this top left cell. The formula

00:14:05.399 --> 00:14:08.759
is nothing more than

00:14:07.200 --> 00:14:10.680
you know, multiply all those things and

00:14:08.759 --> 00:14:12.759
add them up. And then once you add it

00:14:10.679 --> 00:14:15.319
up, run it through a max of zero comma

00:14:12.759 --> 00:14:18.078
that which is just the relu.

00:14:15.320 --> 00:14:19.480
Okay? Basic arithmetic.

00:14:18.078 --> 00:14:21.838
So, we do that.

00:14:19.480 --> 00:14:24.560
And this is the output and the output is

00:14:21.839 --> 00:14:26.680
also conditionally formatted to show you

00:14:24.559 --> 00:14:30.838
where things are lighting up.

00:14:26.679 --> 00:14:34.479
And you can see only the horizontal

00:14:30.839 --> 00:14:35.839
lines of the three are lighting up.

00:14:34.480 --> 00:14:36.720
Everyone see that?

00:14:35.839 --> 00:14:38.720
Right?

00:14:36.720 --> 00:14:41.000
So, you So, now you you understand the

00:14:38.720 --> 00:14:42.440
filter in fact is living up to the claim

00:14:41.000 --> 00:14:44.839
I made for it.

00:14:42.440 --> 00:14:46.079
Right? Similarly,

00:14:44.839 --> 00:14:47.839
if you look at what's going on here,

00:14:46.078 --> 00:14:50.159
this is a vertical filter, the same

00:14:47.839 --> 00:14:53.440
thing, you apply it, only the vertical

00:14:50.159 --> 00:14:53.439
line is lighting up.

00:14:53.480 --> 00:14:57.720
Right? Now, what you can do is

00:14:56.159 --> 00:15:00.279
uh I would encourage you to do this, you

00:14:57.720 --> 00:15:02.519
know, um after class, is you can look at

00:15:00.279 --> 00:15:04.759
all these numbers here, for example, and

00:15:02.519 --> 00:15:06.480
then ask yourself, "Okay, why is that

00:15:04.759 --> 00:15:08.759
lighting up?"

00:15:06.480 --> 00:15:11.039
Right? And you will discover that what's

00:15:08.759 --> 00:15:12.519
actually going on is that it's looking

00:15:11.039 --> 00:15:14.639
for edges.

00:15:12.519 --> 00:15:16.319
It's looking for you know, s- you're

00:15:14.639 --> 00:15:18.600
looking for rows in the table where

00:15:16.320 --> 00:15:21.760
there is some nonzero thing in the first

00:15:18.600 --> 00:15:23.519
row and zeros in the second row.

00:15:21.759 --> 00:15:25.120
And by choosing the numbers carefully,

00:15:23.519 --> 00:15:27.519
you multiply the ones with positive

00:15:25.120 --> 00:15:29.159
numbers and you multiply the zeros with

00:15:27.519 --> 00:15:31.039
zeros and then you'll come up with a

00:15:29.159 --> 00:15:32.879
positive number and thereby you detect

00:15:31.039 --> 00:15:34.120
an edge.

00:15:32.879 --> 00:15:39.399
Right? So, what I would encourage you to

00:15:34.120 --> 00:15:39.399
do is use the this Excel thing here.

00:15:39.639 --> 00:15:46.159
All right. So, here is here is a cell we

00:15:41.279 --> 00:15:46.159
have. So, let's uh trace its

00:15:48.240 --> 00:15:51.399
coincidence.

00:15:49.600 --> 00:15:53.000
Okay.

00:15:51.399 --> 00:15:56.078
So, you can see here

00:15:53.000 --> 00:15:56.078
these numbers

00:15:56.159 --> 00:16:00.639
Right? Th- This is what it's processing.

00:15:59.120 --> 00:16:01.959
Right? That is this grid is being

00:16:00.639 --> 00:16:04.360
processed to come up with that big

00:16:01.958 --> 00:16:06.159
number. And you can see here in this

00:16:04.360 --> 00:16:08.560
grid these are all these numbers are

00:16:06.159 --> 00:16:11.120
here and then these numbers are a lot

00:16:08.559 --> 00:16:13.319
lower than these these numbers because

00:16:11.120 --> 00:16:14.519
there is an edge.

00:16:13.320 --> 00:16:16.360
Right? The numbers are a lot lower.

00:16:14.519 --> 00:16:17.759
That's why you can see the horizontal

00:16:16.360 --> 00:16:19.959
part of the three.

00:16:17.759 --> 00:16:22.559
And so, what this filter is doing, it's

00:16:19.958 --> 00:16:24.399
basically saying, "Well, the stuff

00:16:22.559 --> 00:16:26.199
the row that I'm catching here has the

00:16:24.399 --> 00:16:27.720
ones, the middle has zeros, the rest are

00:16:26.200 --> 00:16:29.480
all minus ones."

00:16:27.720 --> 00:16:31.440
Right? So, the small values are going to

00:16:29.480 --> 00:16:33.120
get very small.

00:16:31.440 --> 00:16:34.040
The big values are going to get very big

00:16:33.120 --> 00:16:35.639
and the overall thing is going to be

00:16:34.039 --> 00:16:37.000
emphasized.

00:16:35.639 --> 00:16:38.360
So, that's the basic idea of edge

00:16:37.000 --> 00:16:39.958
detection.

00:16:38.360 --> 00:16:41.480
Spend some time with it with the Excel

00:16:39.958 --> 00:16:43.119
and it'll you'll become clear to you

00:16:41.480 --> 00:16:46.079
what I'm talking about here.

00:16:43.120 --> 00:16:48.399
All right, cool. So, that's that.

00:16:46.078 --> 00:16:49.759
All right. Uh by the way, I also have uh

00:16:48.399 --> 00:16:50.759
th- there is a little very cool site

00:16:49.759 --> 00:16:52.279
here

00:16:50.759 --> 00:16:53.879
in which you can actually go in and

00:16:52.279 --> 00:16:55.039
punch in your own numbers and see what

00:16:53.879 --> 00:16:56.838
it detects.

00:16:55.039 --> 00:16:58.319
Right? Lot of edges and curves and this

00:16:56.839 --> 00:17:00.040
and that. It's very cool. So, I

00:16:58.320 --> 00:17:04.680
encourage you to try it out.

00:17:00.039 --> 00:17:04.680
So, the key thing here I want to say is

00:17:06.640 --> 00:17:10.160
by choosing the numbers in a filter

00:17:08.160 --> 00:17:12.120
carefully and applying this operation

00:17:10.160 --> 00:17:13.720
different different features can be

00:17:12.119 --> 00:17:14.639
detected. All right.

00:17:13.720 --> 00:17:16.199
Now,

00:17:14.640 --> 00:17:18.079
I mentioned earlier that a convolution

00:17:16.199 --> 00:17:20.519
layer is composed of one or more of

00:17:18.078 --> 00:17:23.519
these filters. So, one or more of these

00:17:20.519 --> 00:17:25.759
filters. And so, you can think of each

00:17:23.519 --> 00:17:27.959
filter as a sort of a specialist for a

00:17:25.759 --> 00:17:30.279
particular feature.

00:17:27.959 --> 00:17:32.200
Okay? So, it's a specialist. Maybe it it

00:17:30.279 --> 00:17:34.079
specializes in detecting vertical lines,

00:17:32.200 --> 00:17:35.720
horizontal lines, you know, uh

00:17:34.079 --> 00:17:38.079
semicircles, quarter circles, you don't

00:17:35.720 --> 00:17:39.799
know. Right? You can imagine either them

00:17:38.079 --> 00:17:42.079
as being specialists.

00:17:39.799 --> 00:17:43.799
And given that modern images could be

00:17:42.079 --> 00:17:45.359
very complicated, they may have lots of

00:17:43.799 --> 00:17:46.678
interesting features going on, you

00:17:45.359 --> 00:17:48.359
probably want to have lots of these

00:17:46.679 --> 00:17:52.360
filters.

00:17:48.359 --> 00:17:54.719
Okay? But the key the key is that you

00:17:52.359 --> 00:17:56.559
don't have to decide up front, "Hey, you

00:17:54.720 --> 00:17:57.880
filter, you better specialize in

00:17:56.559 --> 00:18:00.119
detecting vertical lines and you on the

00:17:57.880 --> 00:18:01.320
other hand do not stay in your lane. Do

00:18:00.119 --> 00:18:02.559
vertical lines." Right? You're not going

00:18:01.319 --> 00:18:04.039
to do that.

00:18:02.559 --> 00:18:06.559
You will let the system figure out what

00:18:04.039 --> 00:18:08.678
it wants to figure out.

00:18:06.559 --> 00:18:10.200
Okay? So, there is no human bottleneck

00:18:08.679 --> 00:18:11.800
in doing this.

00:18:10.200 --> 00:18:13.600
And I mentioned this because there used

00:18:11.799 --> 00:18:15.799
to be a human bottleneck, you know,

00:18:13.599 --> 00:18:17.559
before deep learning happened.

00:18:15.799 --> 00:18:19.399
And so,

00:18:17.559 --> 00:18:20.599
Now, let's just um make sure we

00:18:19.400 --> 00:18:22.120
understand the mechanics of what happens

00:18:20.599 --> 00:18:24.439
when you have two of these filters, not

00:18:22.119 --> 00:18:26.119
one. So, this is the input image as

00:18:24.440 --> 00:18:28.159
before. This is the filter we saw

00:18:26.119 --> 00:18:29.399
earlier and this is another filter we

00:18:28.159 --> 00:18:30.440
have.

00:18:29.400 --> 00:18:32.120
The thing is we just run them in

00:18:30.440 --> 00:18:33.440
parallel. We take each filter, do the

00:18:32.119 --> 00:18:34.839
operation, come up with an output. Take

00:18:33.440 --> 00:18:36.679
the other filter, do the operation, come

00:18:34.839 --> 00:18:38.279
up with its output. And then when you do

00:18:36.679 --> 00:18:40.480
that, the first one gives you that, the

00:18:38.279 --> 00:18:42.799
second one gives you that. And this

00:18:40.480 --> 00:18:44.799
output is a table of some it's it's a

00:18:42.799 --> 00:18:47.200
it's a it's actually not a table. What

00:18:44.799 --> 00:18:47.200
is it?

00:18:49.159 --> 00:18:54.040
Louder, please.

00:18:51.359 --> 00:18:56.439
It's a tensor. Thank you. It's a tensor.

00:18:54.039 --> 00:18:59.960
And so, these two 5 by 5 matrices can be

00:18:56.440 --> 00:18:59.960
represented as a tensor of what shape?

00:19:02.079 --> 00:19:06.439
And there are two right answers.

00:19:04.919 --> 00:19:08.600
5 by 5

00:19:06.440 --> 00:19:11.480
into two, correct. So, it can you can

00:19:08.599 --> 00:19:14.439
either think of it as 5 by 5 * 2 or 2 *

00:19:11.480 --> 00:19:15.799
5 by 5. They're both fine.

00:19:14.440 --> 00:19:18.679
Which one you go with is actually ends

00:19:15.799 --> 00:19:20.839
up being a matter of convention.

00:19:18.679 --> 00:19:22.640
Okay? So, now you begin to see why we

00:19:20.839 --> 00:19:24.079
care about tensors.

00:19:22.640 --> 00:19:27.960
Imagine if instead of having two

00:19:24.079 --> 00:19:29.839
filters, we have 103 filters.

00:19:27.960 --> 00:19:32.799
The resulting tensor is going to be 5 by

00:19:29.839 --> 00:19:32.799
5 by 103.

00:19:33.559 --> 00:19:35.480
Okay.

00:19:34.720 --> 00:19:37.400
Good.

00:19:35.480 --> 00:19:39.679
Um all right. Now,

00:19:37.400 --> 00:19:42.600
let's now look at the slightly more

00:19:39.679 --> 00:19:44.720
complex situation where you have not a

00:19:42.599 --> 00:19:46.799
black and white image, a grayscale image

00:19:44.720 --> 00:19:48.440
with just a little table, but an actual

00:19:46.799 --> 00:19:51.119
color image.

00:19:48.440 --> 00:19:54.240
Okay? So, So, we know how to apply a

00:19:51.119 --> 00:19:56.359
filter to a 2D tensor like this and to

00:19:54.240 --> 00:19:58.400
get that. But let's say we have

00:19:56.359 --> 00:20:00.000
something like this where it has

00:19:58.400 --> 00:20:02.120
three, right? It's got three channels,

00:20:00.000 --> 00:20:03.519
red, blue, green, RGB. It's got three

00:20:02.119 --> 00:20:06.399
tables of numbers.

00:20:03.519 --> 00:20:08.599
So, this is a a tensor of shape 6 * 6 *

00:20:06.400 --> 00:20:11.120
3, let's say, and you want to apply this

00:20:08.599 --> 00:20:12.480
3 by 3 filter just like before to this

00:20:11.119 --> 00:20:16.599
thing. You want to apply the convolution

00:20:12.480 --> 00:20:16.599
operation. How's that going to work?

00:20:18.440 --> 00:20:23.200
Do we just like apply this to each

00:20:21.640 --> 00:20:25.400
We first apply it to the red, then we

00:20:23.200 --> 00:20:29.519
apply it to the to the green, then we

00:20:25.400 --> 00:20:29.519
apply to the blue. Should we do that?

00:20:30.079 --> 00:20:35.199
Or is there a

00:20:31.960 --> 00:20:35.200
a problem with that approach?

00:20:36.039 --> 00:20:38.359
Yeah.

00:20:39.960 --> 00:20:43.559
Could you use the microphone, please?

00:20:42.079 --> 00:20:45.279
Uh the problem with the approach, I

00:20:43.559 --> 00:20:47.399
think, would be the same as what you

00:20:45.279 --> 00:20:49.079
said earlier, that it would learn the

00:20:47.400 --> 00:20:50.360
lines probably the same each channel,

00:20:49.079 --> 00:20:51.599
right?

00:20:50.359 --> 00:20:54.039
Like the location of the lines are

00:20:51.599 --> 00:20:55.319
probably the same each channel.

00:20:54.039 --> 00:20:57.599
Yes, the location of the line is going

00:20:55.319 --> 00:20:59.399
to be the same thing because that line,

00:20:57.599 --> 00:21:00.879
if you will, is sort of the the

00:20:59.400 --> 00:21:03.320
aggregation of information from the

00:21:00.880 --> 00:21:05.080
three different channels. Right. But the

00:21:03.319 --> 00:21:07.200
problem here

00:21:05.079 --> 00:21:09.599
is sort of slightly different,

00:21:07.200 --> 00:21:12.000
which is that

00:21:09.599 --> 00:21:15.279
If you do them independently,

00:21:12.000 --> 00:21:17.599
the network has not been informed that

00:21:15.279 --> 00:21:19.759
these things are all part of the same

00:21:17.599 --> 00:21:21.039
underlying concept.

00:21:19.759 --> 00:21:22.160
As far as it's concerned, it's just like

00:21:21.039 --> 00:21:23.759
three things. It's just going to process

00:21:22.160 --> 00:21:25.800
them independently. So, we need to

00:21:23.759 --> 00:21:27.879
somehow change the filter so that it

00:21:25.799 --> 00:21:29.919
understands like what is at this pixel

00:21:27.880 --> 00:21:31.800
location, the three numbers under it,

00:21:29.920 --> 00:21:35.080
RGB, they're actually the same part of

00:21:31.799 --> 00:21:37.919
the same thing, underlying thing.

00:21:35.079 --> 00:21:42.399
So, what we do is actually very simple.

00:21:37.920 --> 00:21:42.400
We just take this filter and make it 3D.

00:21:42.599 --> 00:21:45.959
So, we take this filter, instead of

00:21:44.240 --> 00:21:49.240
having just one of them, we just make it

00:21:45.960 --> 00:21:51.680
a cube like that. Three times.

00:21:49.240 --> 00:21:53.839
And once we do that, you can imagine

00:21:51.680 --> 00:21:56.279
taking this thing here and essentially

00:21:53.839 --> 00:21:58.480
doing that.

00:21:56.279 --> 00:22:00.119
Okay. Now, instead of having, you know,

00:21:58.480 --> 00:22:01.799
nine numbers in the image and nine

00:22:00.119 --> 00:22:04.159
numbers in the filter,

00:22:01.799 --> 00:22:05.678
you have 27 numbers in the image, 27

00:22:04.160 --> 00:22:07.720
numbers in the filter.

00:22:05.679 --> 00:22:09.400
But you still match them up, multiply

00:22:07.720 --> 00:22:11.759
them, add them up, run them through a

00:22:09.400 --> 00:22:11.759
ReLU.

00:22:14.799 --> 00:22:19.399
By the way, I tried to get ChatGPT to

00:22:16.720 --> 00:22:21.679
give me a picture like that.

00:22:19.400 --> 00:22:22.920
It just completely bombed.

00:22:21.679 --> 00:22:24.400
I like three, four, five different

00:22:22.920 --> 00:22:25.800
variations. It just gave up. And then I

00:22:24.400 --> 00:22:28.640
found this nice picture at in the

00:22:25.799 --> 00:22:30.559
deeplearning.ai and I used it.

00:22:28.640 --> 00:22:32.160
So, then if you put different numbers in

00:22:30.559 --> 00:22:33.519
each of the layers, is that like color

00:22:32.160 --> 00:22:36.279
processing? Like it could be doing a

00:22:33.519 --> 00:22:37.440
different thing to green and blue. I'm

00:22:36.279 --> 00:22:39.920
sorry, say that again. If you put

00:22:37.440 --> 00:22:42.160
different numbers in each of the layers

00:22:39.920 --> 00:22:43.600
of your knowledge, in each of the

00:22:42.160 --> 00:22:45.519
different like depth dimensions of your

00:22:43.599 --> 00:22:47.000
convolution filter, would that be like

00:22:45.519 --> 00:22:49.319
color processing?

00:22:47.000 --> 00:22:50.559
Uh yeah, you you will in

00:22:49.319 --> 00:22:53.000
Yeah, you will put different numbers. In

00:22:50.559 --> 00:22:54.119
fact, you you have 27 numbers now,

00:22:53.000 --> 00:22:55.640
but we haven't gotten to the question of

00:22:54.119 --> 00:22:58.759
where these numbers are coming from. So,

00:22:55.640 --> 00:23:02.920
just hold the thought till we get there.

00:22:58.759 --> 00:23:04.640
Okay. Um so, any questions on this?

00:23:02.920 --> 00:23:05.800
Okay. You literally take the 2D thing

00:23:04.640 --> 00:23:08.120
and make it 3D.

00:23:05.799 --> 00:23:10.079
You basically give it depth and the

00:23:08.119 --> 00:23:11.319
depth just matches the depth of the

00:23:10.079 --> 00:23:13.319
input.

00:23:11.319 --> 00:23:15.000
So, if the input is like, you know, 10

00:23:13.319 --> 00:23:17.359
deep, your filter is going to get 10

00:23:15.000 --> 00:23:17.359
deep.

00:23:18.200 --> 00:23:22.519
Okay?

00:23:20.079 --> 00:23:22.519
Yes.

00:23:22.640 --> 00:23:26.000
Rather than

00:23:24.160 --> 00:23:27.679
increasing the rank order of the tensor

00:23:26.000 --> 00:23:29.240
by one, is there any instance where you

00:23:27.679 --> 00:23:30.920
would create a subtraction layer where

00:23:29.240 --> 00:23:33.559
you would run an operation across the

00:23:30.920 --> 00:23:35.920
different layers to come up with a

00:23:33.559 --> 00:23:38.799
intermediary layer that you would run a

00:23:35.920 --> 00:23:40.640
lower rank tensor of a filter over?

00:23:38.799 --> 00:23:42.639
Yeah, so there is a lot of stuff in the

00:23:40.640 --> 00:23:45.440
research literature which tries to do

00:23:42.640 --> 00:23:48.200
things like that. Uh I'm just describing

00:23:45.440 --> 00:23:50.080
like the the the most basic approach to

00:23:48.200 --> 00:23:51.720
doing this. And as it turns out, this

00:23:50.079 --> 00:23:54.319
basic approach is actually extremely

00:23:51.720 --> 00:23:56.079
powerful, right? And of course, uh

00:23:54.319 --> 00:23:59.399
researchers try to, you know, go from

00:23:56.079 --> 00:24:01.039
the 95th percent thing to the 95.1%.

00:23:59.400 --> 00:24:02.840
So, they invent like all sorts of crazy

00:24:01.039 --> 00:24:04.839
complicated stuff, which is all good for

00:24:02.839 --> 00:24:07.399
us, humanity, but for practical use,

00:24:04.839 --> 00:24:07.399
this is good enough.

00:24:08.119 --> 00:24:12.519
How do you convert the 3 by 3 layer into

00:24:10.599 --> 00:24:14.359
a single 4 by 4 layer? 4 by 4 is

00:24:12.519 --> 00:24:15.279
understood, but what about the 3 layers?

00:24:14.359 --> 00:24:17.399
How do they work?

00:24:15.279 --> 00:24:19.079
Yeah. Um so, we are coming to that. I

00:24:17.400 --> 00:24:20.960
think we have a slide here. Actually, we

00:24:19.079 --> 00:24:23.599
don't. Never mind. We'll answer that. Um

00:24:20.960 --> 00:24:26.480
so, so here you have one filter, right?

00:24:23.599 --> 00:24:28.319
You have one 3 by 3 by 3 filter, which

00:24:26.480 --> 00:24:30.920
plugs into this thing here, and then it

00:24:28.319 --> 00:24:33.119
gives you the 4 by 4 at the end.

00:24:30.920 --> 00:24:37.000
Right? So, for one filter, we know that

00:24:33.119 --> 00:24:37.000
by doing this operation, we get

00:24:37.119 --> 00:24:40.159
we get this 4 by 4.

00:24:38.720 --> 00:24:41.880
Let's say that you have another filter,

00:24:40.160 --> 00:24:43.120
which is also 3D.

00:24:41.880 --> 00:24:45.080
You do that thing, you'll get another 4

00:24:43.119 --> 00:24:46.399
by 4.

00:24:45.079 --> 00:24:48.240
And if you have 10 filters, you'll get

00:24:46.400 --> 00:24:52.600
10 of these 4 by 4s, which then gets

00:24:48.240 --> 00:24:52.599
packaged up into a 4 by 4 by 10 tensor.

00:24:54.519 --> 00:25:01.839
Remember, whether it's 2D, 3D, 10D,

00:24:57.880 --> 00:25:01.840
what is coming out is always 2D.

00:25:02.039 --> 00:25:05.240
Because ultimately, when you apply all

00:25:03.359 --> 00:25:06.639
this operation, at each position, you

00:25:05.240 --> 00:25:07.799
just have one number.

00:25:06.640 --> 00:25:08.720
And then ultimately, you just do all

00:25:07.799 --> 00:25:10.480
those things, you just come up with a

00:25:08.720 --> 00:25:13.160
table of numbers always. So, the what's

00:25:10.480 --> 00:25:14.279
coming out is always a 2D number table

00:25:13.160 --> 00:25:16.360
like that.

00:25:14.279 --> 00:25:18.119
But when you have lots of filters, you

00:25:16.359 --> 00:25:20.039
have lots of these 2D tables one after

00:25:18.119 --> 00:25:23.319
the other, and there therefore, they get

00:25:20.039 --> 00:25:23.319
packaged up into a tensor.

00:25:25.160 --> 00:25:28.279
All right.

00:25:26.200 --> 00:25:30.559
Um so,

00:25:28.279 --> 00:25:32.119
textbook chapter 8.1 has a lot of detail

00:25:30.559 --> 00:25:35.839
and intuition, which I think is really

00:25:32.119 --> 00:25:37.439
good. So, please uh try it out. Okay.

00:25:35.839 --> 00:25:40.199
And folks, by the way, this convolution

00:25:37.440 --> 00:25:41.920
stuff, um it's sort of it grows in the

00:25:40.200 --> 00:25:43.960
telling. So, I would encourage you to

00:25:41.920 --> 00:25:45.920
revisit it, revisit it

00:25:43.960 --> 00:25:48.240
a few times, and then it slowly becomes

00:25:45.920 --> 00:25:49.600
part of your muscle memory.

00:25:48.240 --> 00:25:51.559
Don't expect to just understand all the

00:25:49.599 --> 00:25:52.959
nuances like one shot.

00:25:51.559 --> 00:25:54.559
Do it a few times.

00:25:52.960 --> 00:25:56.360
And it will become, you know, wired into

00:25:54.559 --> 00:25:59.159
your into your head.

00:25:56.359 --> 00:26:00.599
Okay. So, all right. The big question.

00:25:59.160 --> 00:26:02.240
These seem excellent, but how are we

00:26:00.599 --> 00:26:04.079
supposed to come up with these numbers?

00:26:02.240 --> 00:26:05.480
Now, in fact, traditionally,

00:26:04.079 --> 00:26:07.079
uh these filters actually used to be

00:26:05.480 --> 00:26:08.480
designed by hand.

00:26:07.079 --> 00:26:10.079
Uh computer vision researchers would

00:26:08.480 --> 00:26:12.759
invest, you know, prodigious amounts of

00:26:10.079 --> 00:26:14.960
time and effort and talent to figure

00:26:12.759 --> 00:26:17.119
out, you know, the kind the right kinds

00:26:14.960 --> 00:26:19.000
of filters to use for various specific

00:26:17.119 --> 00:26:20.399
applications. So, if you wanted to build

00:26:19.000 --> 00:26:22.799
an application which would look at, say,

00:26:20.400 --> 00:26:24.720
MRI images and figure out, okay, what

00:26:22.799 --> 00:26:27.000
kind of features should I extract from

00:26:24.720 --> 00:26:28.519
this MRI thing to be able to say, you

00:26:27.000 --> 00:26:30.519
know, predict the the evidence for a

00:26:28.519 --> 00:26:32.799
stroke, they would actually, you know,

00:26:30.519 --> 00:26:34.359
hand design the filter. They'd try lots

00:26:32.799 --> 00:26:35.960
of different values and then come up

00:26:34.359 --> 00:26:37.959
with, "Ah, I got the perfect filter for

00:26:35.960 --> 00:26:39.440
this thing here." Right? So, that's the

00:26:37.960 --> 00:26:41.559
way it used to be done.

00:26:39.440 --> 00:26:42.920
Um and now,

00:26:41.559 --> 00:26:45.279
I but as we figured out how to train

00:26:42.920 --> 00:26:47.160
deep networks with lots of parameters,

00:26:45.279 --> 00:26:49.079
right? We figured out things like ReLU

00:26:47.160 --> 00:26:51.800
activation, stochastic gradient descent,

00:26:49.079 --> 00:26:54.559
GPUs, backprop, things like that, you

00:26:51.799 --> 00:26:55.759
know, uh this big idea emerged. Why

00:26:54.559 --> 00:26:57.839
don't we think of the numbers in the

00:26:55.759 --> 00:26:59.359
filter as just weights?

00:26:57.839 --> 00:27:01.639
And why don't we just simply learn them

00:26:59.359 --> 00:27:03.159
from the data using backprop?

00:27:01.640 --> 00:27:06.160
Right? Just like we learn all the other

00:27:03.160 --> 00:27:06.160
weights. What's the big deal?

00:27:06.279 --> 00:27:09.639
And this simple idea,

00:27:08.160 --> 00:27:12.080
and it feels a bit, I don't know,

00:27:09.640 --> 00:27:13.160
blindingly obvious in hindsight.

00:27:12.079 --> 00:27:14.439
I'm sure it was not obvious in

00:27:13.160 --> 00:27:16.560
foresight.

00:27:14.440 --> 00:27:18.960
Um right? This was the breakthrough.

00:27:16.559 --> 00:27:20.399
This was the key breakthrough. And now,

00:27:18.960 --> 00:27:22.840
it's actually possible to do this

00:27:20.400 --> 00:27:25.840
because a convolutional filter that we

00:27:22.839 --> 00:27:27.319
have seen is actually just a neuron.

00:27:25.839 --> 00:27:31.119
And the underlying arithmetic of it is

00:27:27.319 --> 00:27:32.960
just a neuronal arithmetic. And so, it

00:27:31.119 --> 00:27:34.879
just happens to be a slightly special

00:27:32.960 --> 00:27:37.400
one. It's actually even simpler than a

00:27:34.880 --> 00:27:39.400
regular neuron. And in the interest of

00:27:37.400 --> 00:27:40.640
time, I have a one or two slides in the

00:27:39.400 --> 00:27:42.920
appendix which tells you exactly why

00:27:40.640 --> 00:27:44.480
it's a neuron. So, check it out. But

00:27:42.920 --> 00:27:46.480
just take my word for it. It's just a

00:27:44.480 --> 00:27:48.319
particular kind of neuron. And because

00:27:46.480 --> 00:27:50.400
it's a particular kind of neuron, and we

00:27:48.319 --> 00:27:51.359
know how to work with neurons,

00:27:50.400 --> 00:27:53.519
right? You know how to work with

00:27:51.359 --> 00:27:55.559
neurons, which means that our entire

00:27:53.519 --> 00:27:57.279
machinery,

00:27:55.559 --> 00:27:59.480
layers, loss functions, gradient

00:27:57.279 --> 00:28:01.279
descent, SGD, blah, blah, everything is

00:27:59.480 --> 00:28:03.559
immediately applicable.

00:28:01.279 --> 00:28:06.039
We don't have to invent any new stuff to

00:28:03.559 --> 00:28:08.000
make it work.

00:28:06.039 --> 00:28:09.839
Okay?

00:28:08.000 --> 00:28:12.119
All right.

00:28:09.839 --> 00:28:14.639
Do you initialize the layers differently

00:28:12.119 --> 00:28:16.239
in applications or just because the

00:28:14.640 --> 00:28:18.400
network has different sizes? Like

00:28:16.240 --> 00:28:20.839
computer vision versus uh medical

00:28:18.400 --> 00:28:23.120
imaging. Is it just because the network

00:28:20.839 --> 00:28:25.359
has different numbers in them?

00:28:23.119 --> 00:28:27.439
Yeah, so the initialization

00:28:25.359 --> 00:28:29.119
So, let's It's a good question. Let's

00:28:27.440 --> 00:28:30.720
come back to it when we get to something

00:28:29.119 --> 00:28:34.559
called transfer learning, which I'm

00:28:30.720 --> 00:28:34.559
going to get to by about 9:30.

00:28:34.720 --> 00:28:37.480
All right. So,

00:28:36.279 --> 00:28:38.678
that's it. All right. So, this turned

00:28:37.480 --> 00:28:40.599
out to be a huge turning point in the

00:28:38.679 --> 00:28:43.360
computer vision field, and this was the

00:28:40.599 --> 00:28:44.678
massive unlock in the year 2012. This

00:28:43.359 --> 00:28:47.399
computer vision system that used this

00:28:44.679 --> 00:28:49.080
technology called AlexNet burst out onto

00:28:47.400 --> 00:28:51.200
the world stage because it crushed the

00:28:49.079 --> 00:28:53.519
competition in a, you know, in in a

00:28:51.200 --> 00:28:56.919
competition called ImageNet, and uh the

00:28:53.519 --> 00:28:59.679
previous best score was 26% error rate,

00:28:56.919 --> 00:29:01.159
and this thing came in and had 16% error

00:28:59.679 --> 00:29:01.960
rate. Right? It's the kind of thing

00:29:01.159 --> 00:29:04.120
where if you see it, you'll be like,

00:29:01.960 --> 00:29:05.480
"Oh, that must be a typo."

00:29:04.119 --> 00:29:06.439
Right? Because every year, the

00:29:05.480 --> 00:29:07.919
improvements in error rate were like

00:29:06.440 --> 00:29:09.919
very little, half a percent, 1%, and

00:29:07.919 --> 00:29:12.800
then this year was 10%, and that that

00:29:09.919 --> 00:29:14.520
was because of this approach.

00:29:12.799 --> 00:29:16.960
And so, all right. Now, one other thing

00:29:14.519 --> 00:29:19.960
I want to cover talk about is that with

00:29:16.960 --> 00:29:21.480
every succeeding convolutional layer,

00:29:19.960 --> 00:29:23.440
uh this particular convolution any

00:29:21.480 --> 00:29:25.519
particular convolutional filter, it's

00:29:23.440 --> 00:29:28.320
basically implicitly seeing much more of

00:29:25.519 --> 00:29:29.839
the input image as we go along.

00:29:28.319 --> 00:29:31.639
Right? Which means that if in the very

00:29:29.839 --> 00:29:33.119
beginning, if this is the input, right?

00:29:31.640 --> 00:29:34.360
This little convolutional filter this

00:29:33.119 --> 00:29:37.119
number here

00:29:34.359 --> 00:29:38.719
in the first layer, let's say, only sees

00:29:37.119 --> 00:29:40.119
like the top of the chimney or whatever

00:29:38.720 --> 00:29:42.120
of this house.

00:29:40.119 --> 00:29:44.839
But then the next layer, remember, the

00:29:42.119 --> 00:29:45.879
next layer is input is this particular

00:29:44.839 --> 00:29:47.240
layer.

00:29:45.880 --> 00:29:49.400
And so,

00:29:47.240 --> 00:29:50.839
this particular little thing here is

00:29:49.400 --> 00:29:52.280
getting information from this whole

00:29:50.839 --> 00:29:53.839
square here.

00:29:52.279 --> 00:29:55.599
And every one of the points in that

00:29:53.839 --> 00:29:57.399
square is actually something big in the

00:29:55.599 --> 00:29:59.480
original picture.

00:29:57.400 --> 00:30:00.680
So, with every additional layer, you're

00:29:59.480 --> 00:30:03.039
seeing more and more and more of the

00:30:00.680 --> 00:30:04.920
image.

00:30:03.039 --> 00:30:06.639
All right? And this is a key part of why

00:30:04.920 --> 00:30:08.519
these things work because you're

00:30:06.640 --> 00:30:09.759
essentially hierarchically building a

00:30:08.519 --> 00:30:10.680
better and better understanding of the

00:30:09.759 --> 00:30:12.879
image.

00:30:10.680 --> 00:30:14.960
It is the hierarchical understanding,

00:30:12.880 --> 00:30:17.880
the hierarchical learning, that's a very

00:30:14.960 --> 00:30:20.240
key part of the unlock.

00:30:17.880 --> 00:30:21.840
And so, if you look at networks and what

00:30:20.240 --> 00:30:23.759
they're visualizing, this actually a you

00:30:21.839 --> 00:30:25.639
know, a face detection deep network

00:30:23.759 --> 00:30:26.879
visualizes of what it's learning, you'll

00:30:25.640 --> 00:30:28.759
see that the first layer is just

00:30:26.880 --> 00:30:29.920
learning lines and edges and so on,

00:30:28.759 --> 00:30:30.960
lines.

00:30:29.920 --> 00:30:32.800
And the second layer is actually

00:30:30.960 --> 00:30:33.759
learning edges. Look at this thing,

00:30:32.799 --> 00:30:36.000
right?

00:30:33.759 --> 00:30:37.119
It's it's learning to put these lines

00:30:36.000 --> 00:30:38.519
together

00:30:37.119 --> 00:30:40.359
to get some sort of an edge here,

00:30:38.519 --> 00:30:43.879
another edge here. This looks like three

00:30:40.359 --> 00:30:45.199
three quarters of a of somebody's ears.

00:30:43.880 --> 00:30:46.360
And then, these things are now being

00:30:45.200 --> 00:30:49.160
assembled

00:30:46.359 --> 00:30:50.279
to get whole faces out.

00:30:49.160 --> 00:30:52.080
Can you imagine the researchers who did

00:30:50.279 --> 00:30:53.720
this work? They built the network, it's

00:30:52.079 --> 00:30:54.599
doing really well on detecting faces,

00:30:53.720 --> 00:30:56.079
and they turn around, "Okay, let's see

00:30:54.599 --> 00:30:58.079
what it's actually doing."

00:30:56.079 --> 00:31:00.480
And then, this picture pops up.

00:30:58.079 --> 00:31:03.039
I mean, goosebumps.

00:31:00.480 --> 00:31:04.440
Okay, so pooling layers, the next one.

00:31:03.039 --> 00:31:05.559
So,

00:31:04.440 --> 00:31:07.519
so far you've talked about convolutional

00:31:05.559 --> 00:31:09.559
layers, this is the second thing, second

00:31:07.519 --> 00:31:11.440
building block, and then we'll again go

00:31:09.559 --> 00:31:12.919
go to the collapse. So, pooling layers

00:31:11.440 --> 00:31:15.039
are also called subsampling or

00:31:12.920 --> 00:31:17.120
downsampling layers.

00:31:15.039 --> 00:31:19.480
So, the idea is that every time a tensor

00:31:17.119 --> 00:31:20.639
is coming out of these convolutional um

00:31:19.480 --> 00:31:23.440
layers,

00:31:20.640 --> 00:31:25.440
we try to make it slightly smaller

00:31:23.440 --> 00:31:27.519
because the act of making it smaller

00:31:25.440 --> 00:31:29.440
will force the network to try to

00:31:27.519 --> 00:31:30.920
summarize and learn what's going on in

00:31:29.440 --> 00:31:32.840
this complicated thing it's coming into

00:31:30.920 --> 00:31:35.200
it, okay? So, I will describe the

00:31:32.839 --> 00:31:37.599
mechanics first. Um

00:31:35.200 --> 00:31:39.600
So, let's say that this is the output of

00:31:37.599 --> 00:31:40.559
a convolutional layer.

00:31:39.599 --> 00:31:42.879
Okay?

00:31:40.559 --> 00:31:45.079
Is this four of them? A 4 by 4.

00:31:42.880 --> 00:31:47.440
So, what we do is that there are two

00:31:45.079 --> 00:31:48.879
kinds of pooling, max pooling and

00:31:47.440 --> 00:31:51.000
average pooling. This is called max

00:31:48.880 --> 00:31:52.480
pooling, and the idea is really simple.

00:31:51.000 --> 00:31:53.799
In this max pooling layer, there are no

00:31:52.480 --> 00:31:56.200
weights parameters to be learned. It's

00:31:53.799 --> 00:31:57.839
just a simple arithmetic operation. We

00:31:56.200 --> 00:32:00.200
basically take

00:31:57.839 --> 00:32:02.919
we take this we basically superimpose a

00:32:00.200 --> 00:32:04.920
2 by 2 empty grid

00:32:02.920 --> 00:32:06.519
on the top left, and then we say, "Hey,

00:32:04.920 --> 00:32:08.000
what's the biggest number on the among

00:32:06.519 --> 00:32:09.720
these four numbers?" Well, the biggest

00:32:08.000 --> 00:32:11.200
number is 43. Boom. Okay, I'm going to

00:32:09.720 --> 00:32:13.600
stick a 43 here.

00:32:11.200 --> 00:32:15.720
Then I move my 2 by 2 to the right

00:32:13.599 --> 00:32:17.039
so that it overlaps with these numbers

00:32:15.720 --> 00:32:19.759
in blue, and I say, "Hey, what's the

00:32:17.039 --> 00:32:20.960
biggest number here?" Okay, that's 109.

00:32:19.759 --> 00:32:23.240
And I move it down, what's the biggest

00:32:20.960 --> 00:32:25.000
number here? 105. Stick it in here.

00:32:23.240 --> 00:32:26.519
Biggest number here, 35, and I stick it

00:32:25.000 --> 00:32:28.839
in there. That's it. This is max

00:32:26.519 --> 00:32:28.839
pooling.

00:32:29.119 --> 00:32:32.199
Similarly, there's this thing called

00:32:30.200 --> 00:32:33.440
average pooling, but instead of taking

00:32:32.200 --> 00:32:35.480
the maximum of these four numbers, we

00:32:33.440 --> 00:32:36.840
just average the four numbers.

00:32:35.480 --> 00:32:38.519
Okay, the average of these four things

00:32:36.839 --> 00:32:40.879
in yellow,

00:32:38.519 --> 00:32:40.879
am I done?

00:32:41.559 --> 00:32:45.639
Average of these four numbers is 32.2.

00:32:43.519 --> 00:32:46.839
The average of blue numbers is 25.5, you

00:32:45.640 --> 00:32:48.200
get the idea.

00:32:46.839 --> 00:32:50.439
That's it. Max pooling and average

00:32:48.200 --> 00:32:51.840
pooling. Now,

00:32:50.440 --> 00:32:53.400
as you can see, when you go when you

00:32:51.839 --> 00:32:55.720
apply pooling, the number of entries

00:32:53.400 --> 00:32:56.880
drops significantly.

00:32:55.720 --> 00:32:58.240
Right? The number of entries drops

00:32:56.880 --> 00:32:59.880
significantly.

00:32:58.240 --> 00:33:02.839
And the output from this layer is just

00:32:59.880 --> 00:33:04.480
fed to the next layer as usual.

00:33:02.839 --> 00:33:05.720
Okay? There's nothing, you know, crazy

00:33:04.480 --> 00:33:07.679
going on.

00:33:05.720 --> 00:33:10.039
So, it's a way to shrink the output from

00:33:07.679 --> 00:33:11.560
one convolutional layer before it passes

00:33:10.039 --> 00:33:13.799
on to the next convolutional, you

00:33:11.559 --> 00:33:15.960
interject with a pooling layer.

00:33:13.799 --> 00:33:18.039
Now, I have actually a

00:33:15.960 --> 00:33:20.759
even if I say so myself, a very nice

00:33:18.039 --> 00:33:23.319
handwritten explanation of what pooling

00:33:20.759 --> 00:33:25.200
does, the the effect of pooling.

00:33:23.319 --> 00:33:27.480
And unfortunately, I can't get my iPad

00:33:25.200 --> 00:33:28.920
to actually show up on my laptop.

00:33:27.480 --> 00:33:31.400
So, I'm not going to be able to do it,

00:33:28.920 --> 00:33:33.519
but I will record a walk-through.

00:33:31.400 --> 00:33:35.519
Yeah, and I posted check it out, okay?

00:33:33.519 --> 00:33:38.240
But the intuition that I tried to convey

00:33:35.519 --> 00:33:39.359
with that thing is that oh, um Sorry,

00:33:38.240 --> 00:33:41.039
I'll come back to this.

00:33:39.359 --> 00:33:43.439
So, max pooling acts like an or

00:33:41.039 --> 00:33:44.879
condition. It basically says, "I have

00:33:43.440 --> 00:33:46.559
this big picture.

00:33:44.880 --> 00:33:48.720
So, in the four things that I'm looking

00:33:46.559 --> 00:33:50.319
at, if there's any number which is

00:33:48.720 --> 00:33:51.880
really high,

00:33:50.319 --> 00:33:54.319
that means that some feature is being

00:33:51.880 --> 00:33:55.720
detected, right?

00:33:54.319 --> 00:33:57.000
The number is really high coming out of

00:33:55.720 --> 00:33:59.200
a convolutional layer, that means that

00:33:57.000 --> 00:34:00.519
something somewhere fired up,

00:33:59.200 --> 00:34:01.799
lit up.

00:34:00.519 --> 00:34:04.200
And so, I'm just looking to see if

00:34:01.799 --> 00:34:05.319
anything lit up in that part. If it did,

00:34:04.200 --> 00:34:06.640
I'm going to say, "Yep, something lit

00:34:05.319 --> 00:34:08.239
up."

00:34:06.640 --> 00:34:09.640
If nothing lit up, then I'm going to

00:34:08.239 --> 00:34:11.559
say, "Oh, nothing lit up."

00:34:09.639 --> 00:34:13.158
So, in a in that sense, what it's it it

00:34:11.559 --> 00:34:15.398
think you can imagine it's like acting

00:34:13.159 --> 00:34:16.519
like an or condition.

00:34:15.398 --> 00:34:17.559
Anything fired up? Anything fired up?

00:34:16.519 --> 00:34:19.480
Anything fired up? Anything up? Yes,

00:34:17.559 --> 00:34:22.039
okay. Otherwise, no.

00:34:19.480 --> 00:34:22.039
And so,

00:34:22.280 --> 00:34:27.040
sadly, I can't switch to Notability.

00:34:24.639 --> 00:34:28.440
So, it acts like a feature detector. So,

00:34:27.039 --> 00:34:30.239
if you have lots of things going on in a

00:34:28.440 --> 00:34:32.000
particular picture, you want to be able

00:34:30.239 --> 00:34:33.398
to summarize and aggregate all the

00:34:32.000 --> 00:34:35.519
things that are going on so that you can

00:34:33.398 --> 00:34:36.918
say you if you may have a big picture

00:34:35.519 --> 00:34:38.398
with lots of things lighting up here and

00:34:36.918 --> 00:34:40.559
there, but you want to step back and

00:34:38.398 --> 00:34:42.918
say, "You know what? In this picture,

00:34:40.559 --> 00:34:45.440
the top left, nothing lit up. The top

00:34:42.918 --> 00:34:46.719
right, something lit up. Bottom left,

00:34:45.440 --> 00:34:48.320
something lit up. And the bottom right,

00:34:46.719 --> 00:34:49.599
nothing lit up."

00:34:48.320 --> 00:34:51.800
So, you're operating at a higher level

00:34:49.599 --> 00:34:54.839
of abstraction.

00:34:51.800 --> 00:34:54.840
That's the effect of pooling.

00:34:55.039 --> 00:34:58.639
But don't you lose spatial information?

00:34:59.920 --> 00:35:04.079
Uh you don't because the

00:35:02.480 --> 00:35:06.199
what you're actually saying is the top

00:35:04.079 --> 00:35:08.639
left has this thing.

00:35:06.199 --> 00:35:10.599
You already know it is in the top left.

00:35:08.639 --> 00:35:12.119
And you already moved up to that level

00:35:10.599 --> 00:35:13.839
of abstraction.

00:35:12.119 --> 00:35:15.880
So, the fact for example, if if the top

00:35:13.840 --> 00:35:18.480
left there is a human eye,

00:35:15.880 --> 00:35:19.880
and there is a circle detector, it's

00:35:18.480 --> 00:35:21.719
going to fire up and saying, "Hey, in

00:35:19.880 --> 00:35:23.599
the top left there is an eye."

00:35:21.719 --> 00:35:24.919
Yep, lit up. So, you're not looking at

00:35:23.599 --> 00:35:25.759
the pixels anymore, you're already

00:35:24.920 --> 00:35:27.159
operating at a higher level of

00:35:25.760 --> 00:35:29.520
abstraction, and that's how we get

00:35:27.159 --> 00:35:31.039
around it. But this proceeds slowly and

00:35:29.519 --> 00:35:34.039
incrementally, which is why you have

00:35:31.039 --> 00:35:34.039
these big networks.

00:35:34.199 --> 00:35:38.159
All right.

00:35:35.679 --> 00:35:40.159
So, now as we saw, some successive

00:35:38.159 --> 00:35:41.639
convolution layers can see more and more

00:35:40.159 --> 00:35:43.319
of the original image,

00:35:41.639 --> 00:35:45.480
the max pooling layers that follow them

00:35:43.320 --> 00:35:47.640
can detect if a feature exists in more

00:35:45.480 --> 00:35:48.760
and more of the original input as well.

00:35:47.639 --> 00:35:50.279
So, by the time you get to like the

00:35:48.760 --> 00:35:52.320
seventh and eighth, ninth and layers and

00:35:50.280 --> 00:35:53.720
so on, this thing is actually really

00:35:52.320 --> 00:35:55.160
smart. It's operating at a very high

00:35:53.719 --> 00:35:56.959
level of abstraction.

00:35:55.159 --> 00:35:58.480
Right? It It is You can think of it It

00:35:56.960 --> 00:36:00.280
is basically like tagged all the

00:35:58.480 --> 00:36:04.199
features in that image at various

00:36:00.280 --> 00:36:04.200
resolutions, and it can work with it.

00:36:04.880 --> 00:36:08.920
Is there a trade-off between doing

00:36:06.400 --> 00:36:11.160
pre-processing as opposed to adding

00:36:08.920 --> 00:36:12.760
additional convolutional layers? I'm

00:36:11.159 --> 00:36:15.519
thinking if you have a video turning

00:36:12.760 --> 00:36:17.600
into a black and white static images in

00:36:15.519 --> 00:36:19.358
a sequence as opposed to

00:36:17.599 --> 00:36:20.639
shoving in a color video with a ton of

00:36:19.358 --> 00:36:22.400
noise.

00:36:20.639 --> 00:36:24.759
The greater the time expanse, is there a

00:36:22.400 --> 00:36:27.960
trade-off element? There is a trade-off.

00:36:24.760 --> 00:36:29.760
Um if your particular data set and input

00:36:27.960 --> 00:36:31.720
has has some there is some very

00:36:29.760 --> 00:36:33.240
important domain knowledge that you want

00:36:31.719 --> 00:36:35.719
to encode

00:36:33.239 --> 00:36:37.839
into the network so that the network

00:36:35.719 --> 00:36:39.719
doesn't waste its capacity learning

00:36:37.840 --> 00:36:41.640
things that you know have to be true,

00:36:39.719 --> 00:36:43.358
then yeah, modify the input.

00:36:41.639 --> 00:36:45.480
But if you're not sure,

00:36:43.358 --> 00:36:47.199
right? Then you want to just let network

00:36:45.480 --> 00:36:49.679
learn whatever it can as long as it's

00:36:47.199 --> 00:36:53.439
focused on predicting accuracy as well

00:36:49.679 --> 00:36:53.440
as possible, then just let it be.

00:36:55.800 --> 00:36:59.200
Uh all right. So, that's the basic idea.

00:36:57.880 --> 00:37:01.358
And I again, I'm sorry this is

00:36:59.199 --> 00:37:03.799
Notability thing is is it's not working.

00:37:01.358 --> 00:37:05.559
Uh but take a look to really understand

00:37:03.800 --> 00:37:08.039
um how this max pooling thing business

00:37:05.559 --> 00:37:09.358
works. Okay. Oh, uh I think I skipped

00:37:08.039 --> 00:37:12.000
over this.

00:37:09.358 --> 00:37:13.639
So, when you have something like this,

00:37:12.000 --> 00:37:15.760
so this, let's say, is a tensor coming

00:37:13.639 --> 00:37:18.839
out of some convolutional layer, and its

00:37:15.760 --> 00:37:20.640
size is 224 by 224 by 64, then you apply

00:37:18.840 --> 00:37:22.160
something like a pooling. The thing I

00:37:20.639 --> 00:37:23.839
want to point out is that the pooling

00:37:22.159 --> 00:37:25.839
will work with every slice of the

00:37:23.840 --> 00:37:27.960
tensor.

00:37:25.840 --> 00:37:30.600
Okay? So, if the tensor is 224 by 224 by

00:37:27.960 --> 00:37:31.880
64, it has a depth of 64,

00:37:30.599 --> 00:37:35.239
which is basically like saying it's got

00:37:31.880 --> 00:37:38.200
64 tables of 224 by 224, and the pooling

00:37:35.239 --> 00:37:40.119
will work on every one of those tables.

00:37:38.199 --> 00:37:42.279
Which means that

00:37:40.119 --> 00:37:43.719
the 64 will that you'll still have 64

00:37:42.280 --> 00:37:45.760
things at the very end. It's just that

00:37:43.719 --> 00:37:49.759
every one of the things of the 64, the

00:37:45.760 --> 00:37:52.560
224 by 224, will shrink to 112 by 112.

00:37:49.760 --> 00:37:53.720
So, each table shrinks due to pooling,

00:37:52.559 --> 00:37:56.119
but the number of tables does not

00:37:53.719 --> 00:37:56.119
change.

00:37:57.800 --> 00:38:01.880
Okay. So,

00:37:59.440 --> 00:38:03.559
uh by the way, this

00:38:01.880 --> 00:38:05.400
link here

00:38:03.559 --> 00:38:06.599
has a beautiful explanation of all these

00:38:05.400 --> 00:38:08.800
things with a little bit more complexity

00:38:06.599 --> 00:38:10.440
as well from a course taught at Stanford

00:38:08.800 --> 00:38:12.640
in like 2018 or 2019 or something, I

00:38:10.440 --> 00:38:13.800
forget. Uh so, just check it out if

00:38:12.639 --> 00:38:15.039
you're curious about this stuff. It's

00:38:13.800 --> 00:38:18.160
really good.

00:38:15.039 --> 00:38:18.159
Okay. Um

00:38:18.440 --> 00:38:21.760
All right. So, that brings us to the

00:38:19.800 --> 00:38:23.800
architecture of a basic CNN.

00:38:21.760 --> 00:38:25.240
Um and so, what we do is we have an

00:38:23.800 --> 00:38:27.240
input.

00:38:25.239 --> 00:38:29.239
Okay? We take that input, we run it

00:38:27.239 --> 00:38:30.799
through a bunch of convolutional and

00:38:29.239 --> 00:38:33.559
pooling layers. So, there's a

00:38:30.800 --> 00:38:35.840
convolutional layer, and then we pool

00:38:33.559 --> 00:38:37.440
it, which is why it has shrunk

00:38:35.840 --> 00:38:38.440
in size,

00:38:37.440 --> 00:38:40.599
and then it goes through another

00:38:38.440 --> 00:38:42.358
convolutional layer, then we pool it,

00:38:40.599 --> 00:38:44.000
which is shrunk again,

00:38:42.358 --> 00:38:45.559
and then it keeps on doing it. So, we

00:38:44.000 --> 00:38:47.559
have a series of these these called

00:38:45.559 --> 00:38:49.400
these are called convolutional blocks.

00:38:47.559 --> 00:38:50.559
So, a convolutional block is typically,

00:38:49.400 --> 00:38:52.920
you know, one to two convolutional

00:38:50.559 --> 00:38:54.358
layers followed by a pooling layer.

00:38:52.920 --> 00:38:55.760
Okay.

00:38:54.358 --> 00:38:57.159
So, you have a series of convolutional

00:38:55.760 --> 00:38:59.960
blocks.

00:38:57.159 --> 00:39:01.559
Okay? And the thing to notice is that

00:38:59.960 --> 00:39:03.320
as you go further and further in the

00:39:01.559 --> 00:39:05.519
network,

00:39:03.320 --> 00:39:07.000
the blocks will actually get smaller and

00:39:05.519 --> 00:39:09.159
smaller because of

00:39:07.000 --> 00:39:10.599
uh max pooling, right? They'll get

00:39:09.159 --> 00:39:14.039
smaller and smaller, but they'll get

00:39:10.599 --> 00:39:14.799
longer they'll get deeper and deeper.

00:39:14.039 --> 00:39:16.519
Okay.

00:39:14.800 --> 00:39:18.880
And we have empirically figured out that

00:39:16.519 --> 00:39:20.639
that actually that model of reducing the

00:39:18.880 --> 00:39:22.519
size, the height and height and the

00:39:20.639 --> 00:39:25.519
width, but then making it deeper, tends

00:39:22.519 --> 00:39:27.119
to work really well in practice.

00:39:25.519 --> 00:39:29.559
And so,

00:39:27.119 --> 00:39:31.279
in fact, uh and I apologies to the live

00:39:29.559 --> 00:39:34.480
stream that I can't use iPad, I'm going

00:39:31.280 --> 00:39:34.480
to do it on the the board.

00:39:35.960 --> 00:39:39.639
So, let's say that you have a picture

00:39:38.358 --> 00:39:43.480
which is

00:39:39.639 --> 00:39:44.879
coming in as 224

00:39:43.480 --> 00:39:46.199
224

00:39:44.880 --> 00:39:48.000
and then you have

00:39:46.199 --> 00:39:49.719
say three of them

00:39:48.000 --> 00:39:52.360
because it's a color picture, so you

00:39:49.719 --> 00:39:54.399
have three of them.

00:39:52.360 --> 00:39:56.440
Can you folks see this okay?

00:39:54.400 --> 00:39:59.240
All right. So, right? Let's say this is

00:39:56.440 --> 00:40:00.960
the input coming in. And ResNet, which

00:39:59.239 --> 00:40:02.479
is a very famous network that we're

00:40:00.960 --> 00:40:03.679
actually going to work with in a few

00:40:02.480 --> 00:40:05.719
minutes,

00:40:03.679 --> 00:40:07.960
then it actually gets done with all this

00:40:05.719 --> 00:40:11.119
convolution pooling business.

00:40:07.960 --> 00:40:13.400
The final tensor that it it has is

00:40:11.119 --> 00:40:16.239
actually of shape

00:40:13.400 --> 00:40:20.720
7 by 7.

00:40:16.239 --> 00:40:20.719
But it is 2048 long.

00:40:22.519 --> 00:40:26.719
Okay? So, it it has gone it has

00:40:24.039 --> 00:40:28.400
processed something which is 224 224 * 3

00:40:26.719 --> 00:40:31.439
to much smaller height and width just 7

00:40:28.400 --> 00:40:32.840
by 7, but it's gotten much deeper, 2048

00:40:31.440 --> 00:40:34.920
layers.

00:40:32.840 --> 00:40:36.800
This is a this is a numerical example of

00:40:34.920 --> 00:40:39.320
what I'm talking about there in terms of

00:40:36.800 --> 00:40:41.560
as you go along, things get smaller but

00:40:39.320 --> 00:40:43.039
deeper.

00:40:41.559 --> 00:40:44.480
All right.

00:40:43.039 --> 00:40:45.880
Uh

00:40:44.480 --> 00:40:47.280
Yes?

00:40:45.880 --> 00:40:49.519
Is the reason that it gets deeper

00:40:47.280 --> 00:40:50.880
because each

00:40:49.519 --> 00:40:52.759
Like it it gets deeper because each

00:40:50.880 --> 00:40:54.400
layer has a single feature that is

00:40:52.760 --> 00:40:55.120
picked up and then it gets stacked on

00:40:54.400 --> 00:40:57.039
top

00:40:55.119 --> 00:40:58.559
It's not so much that each layer has

00:40:57.039 --> 00:40:59.480
picking up a single feature, it's more

00:40:58.559 --> 00:41:00.279
that

00:40:59.480 --> 00:41:01.960
uh

00:41:00.280 --> 00:41:04.519
basically

00:41:01.960 --> 00:41:06.159
the way I think about it is that

00:41:04.519 --> 00:41:07.800
the the the the number of atomic

00:41:06.159 --> 00:41:10.199
features that you may want to detect are

00:41:07.800 --> 00:41:11.920
probably not that many, right? Lines,

00:41:10.199 --> 00:41:13.719
curves, gradations in color and things

00:41:11.920 --> 00:41:16.519
like that. But the way in which you can

00:41:13.719 --> 00:41:18.559
combine these atomic features

00:41:16.519 --> 00:41:20.199
to depict real world things

00:41:18.559 --> 00:41:22.279
is combinatorial.

00:41:20.199 --> 00:41:23.879
It's sort of like I have 10 kinds of

00:41:22.280 --> 00:41:25.040
atoms, how many molecules can I make

00:41:23.880 --> 00:41:26.519
from it?

00:41:25.039 --> 00:41:28.279
You can make a lot of molecules from

00:41:26.519 --> 00:41:30.679
those 10 atoms, which means that you

00:41:28.280 --> 00:41:32.080
better give the network more the ability

00:41:30.679 --> 00:41:33.719
to capture more and more of these

00:41:32.079 --> 00:41:35.400
possible things that the real world can

00:41:33.719 --> 00:41:38.000
come up with.

00:41:35.400 --> 00:41:40.200
And so every as the depth increases, you

00:41:38.000 --> 00:41:42.320
have more filters and every filter has

00:41:40.199 --> 00:41:43.719
now has the ability to pick up some

00:41:42.320 --> 00:41:46.080
combinatorial combination of what's

00:41:43.719 --> 00:41:46.079
coming in.

00:41:49.639 --> 00:41:53.239
Uh sorry, quick question related to

00:41:51.320 --> 00:41:55.080
this. So, right now like our model is

00:41:53.239 --> 00:41:56.799
being trained to detect certain specific

00:41:55.079 --> 00:41:58.519
features like a line, a color, or

00:41:56.800 --> 00:42:00.680
something of this sort. But still it

00:41:58.519 --> 00:42:02.880
doesn't have meaning to this, right?

00:42:00.679 --> 00:42:06.239
Like still they don't know if that

00:42:02.880 --> 00:42:08.360
arc is a sun or is an eye, right?

00:42:06.239 --> 00:42:10.639
So, yeah. So, we we don't tell it what

00:42:08.360 --> 00:42:12.280
to learn, it just learns.

00:42:10.639 --> 00:42:14.599
All we tell it is make sure that you

00:42:12.280 --> 00:42:16.240
minimize the loss function. Now, once it

00:42:14.599 --> 00:42:18.679
is finished learning, if it's a good

00:42:16.239 --> 00:42:21.359
network, it has good accuracy, then we

00:42:18.679 --> 00:42:23.480
can introspect. We can peek into the

00:42:21.360 --> 00:42:24.559
internals and try to understand what is

00:42:23.480 --> 00:42:26.480
it learning,

00:42:24.559 --> 00:42:27.759
right? And sometimes you like you saw in

00:42:26.480 --> 00:42:28.840
the face detection example, it's

00:42:27.760 --> 00:42:30.720
actually learning interesting things

00:42:28.840 --> 00:42:32.440
like basic lines and edges and then

00:42:30.719 --> 00:42:34.359
slowly, you know, more complicated

00:42:32.440 --> 00:42:36.320
shapes and then finally like entire

00:42:34.360 --> 00:42:37.640
human faces. Sometimes it may not be

00:42:36.320 --> 00:42:39.200
understandable.

00:42:37.639 --> 00:42:42.879
And the way it's doing this is by

00:42:39.199 --> 00:42:44.039
constructing features of my brain.

00:42:42.880 --> 00:42:44.480
Like how do you figure out what it's

00:42:44.039 --> 00:42:46.800
learning?

00:42:44.480 --> 00:42:49.039
>> Yeah. Oh, oh, I see. So, I'm going to

00:42:46.800 --> 00:42:50.400
give a reference in just a few minutes.

00:42:49.039 --> 00:42:52.199
Read the paper. That was one of the

00:42:50.400 --> 00:42:53.720
first ones to actually visualize what it

00:42:52.199 --> 00:42:54.919
what these things are learning and

00:42:53.719 --> 00:42:56.399
that'll give you an idea of how it

00:42:54.920 --> 00:42:58.079
actually works. And I'm also happy to

00:42:56.400 --> 00:43:00.160
talk about it offline. It's a bit of a a

00:42:58.079 --> 00:43:02.319
tangent, but it's a really rich tangent,

00:43:00.159 --> 00:43:03.399
so if if I keep talking about it, I'll

00:43:02.320 --> 00:43:06.039
end up spending 10 minutes on it, so I'm

00:43:03.400 --> 00:43:06.039
going to back off.

00:43:06.960 --> 00:43:09.679
Okay.

00:43:08.039 --> 00:43:12.320
Um all right.

00:43:09.679 --> 00:43:13.919
So, now once we do that,

00:43:12.320 --> 00:43:16.200
okay? Now we are back in familiar

00:43:13.920 --> 00:43:18.360
territory where we take whatever tensor

00:43:16.199 --> 00:43:20.119
is coming out from these convolutional

00:43:18.360 --> 00:43:22.840
operations and pooling operations and

00:43:20.119 --> 00:43:25.440
then we just flatten them only now into

00:43:22.840 --> 00:43:27.720
a long vector. And once we flatten them,

00:43:25.440 --> 00:43:29.240
we can connect them to some good old

00:43:27.719 --> 00:43:30.599
dense layers

00:43:29.239 --> 00:43:32.479
like we know how to do and then we

00:43:30.599 --> 00:43:34.880
finally connect them with whatever, you

00:43:32.480 --> 00:43:36.760
know, output layer you want, right? In

00:43:34.880 --> 00:43:39.480
this case, this example is using some

00:43:36.760 --> 00:43:41.120
multi-class classification of

00:43:39.480 --> 00:43:42.760
classifying images to what kind of

00:43:41.119 --> 00:43:44.719
automobile or whatever it is. So, it's

00:43:42.760 --> 00:43:47.160
like a softmax. So, this is a general

00:43:44.719 --> 00:43:47.159
framework.

00:43:48.639 --> 00:43:52.639
Okay?

00:43:50.039 --> 00:43:52.639
Any questions?

00:43:54.559 --> 00:43:57.639
Yeah.

00:43:55.599 --> 00:44:00.159
Can you explain again how the depth

00:43:57.639 --> 00:44:01.839
increases exactly like Oh, the depth

00:44:00.159 --> 00:44:03.719
increases because you decide what the

00:44:01.840 --> 00:44:05.920
depth is.

00:44:03.719 --> 00:44:07.839
So, when you add a convolutional layer,

00:44:05.920 --> 00:44:09.920
you decide how many filters it has. So,

00:44:07.840 --> 00:44:11.600
you just keep adding more and more

00:44:09.920 --> 00:44:13.320
filters the later on you go in the

00:44:11.599 --> 00:44:14.920
network.

00:44:13.320 --> 00:44:16.600
So, it's in your control. So, remember

00:44:14.920 --> 00:44:18.480
the number of neurons in a hidden layer

00:44:16.599 --> 00:44:19.839
is in your control, right? Similarly,

00:44:18.480 --> 00:44:22.559
the number of filters is in your

00:44:19.840 --> 00:44:24.160
control. It's a design choice.

00:44:22.559 --> 00:44:26.519
And we design it so that the later we

00:44:24.159 --> 00:44:28.159
go, the more depth we have. So, you have

00:44:26.519 --> 00:44:31.800
you stack

00:44:28.159 --> 00:44:35.279
um layers with each of those layers has

00:44:31.800 --> 00:44:37.720
a different filter applied to the end

00:44:35.280 --> 00:44:39.359
Yeah, a layer is made up of filters and

00:44:37.719 --> 00:44:40.919
so the depth just comes from having lots

00:44:39.358 --> 00:44:43.759
and lots and lots of filters. And you

00:44:40.920 --> 00:44:43.760
get to choose what they are.

00:44:44.358 --> 00:44:49.319
All right. So, now let's go to the

00:44:46.639 --> 00:44:51.920
fashion MNIST collab um that I did the

00:44:49.320 --> 00:44:55.559
video walk-through on and then actually

00:44:51.920 --> 00:44:55.559
solve it using a convolutional network.

00:44:56.000 --> 00:44:59.159
All right, cool. So, uh at this point

00:44:58.199 --> 00:45:00.879
I'm going to zip through some of the

00:44:59.159 --> 00:45:02.519
stuff because you know the preliminaries

00:45:00.880 --> 00:45:05.559
have to be done. Import all these

00:45:02.519 --> 00:45:07.320
packages, set the random seed here.

00:45:05.559 --> 00:45:09.320
Great. And then the we will load the

00:45:07.320 --> 00:45:11.519
MNIST data set just like I did in the

00:45:09.320 --> 00:45:13.280
collab yesterday. Uh we create these

00:45:11.519 --> 00:45:14.960
little labels.

00:45:13.280 --> 00:45:17.240
Uh and then we just have these standard

00:45:14.960 --> 00:45:19.320
functions to plot accuracy and loss that

00:45:17.239 --> 00:45:21.159
we've been using so far. All right. Now

00:45:19.320 --> 00:45:24.519
we come to the convolutional thing and

00:45:21.159 --> 00:45:25.960
so as before, we're going to um

00:45:24.519 --> 00:45:27.280
we're going to divide it by 255 to

00:45:25.960 --> 00:45:29.480
normalize everything to a zero to one

00:45:27.280 --> 00:45:31.640
range. Uh let's confirm to make sure

00:45:29.480 --> 00:45:33.599
that the data nothing has gotten

00:45:31.639 --> 00:45:35.679
tampered with. Yep, we have 60,000

00:45:33.599 --> 00:45:37.799
images, each one is 28 by 28 in the

00:45:35.679 --> 00:45:40.559
training set. Now,

00:45:37.800 --> 00:45:42.680
convolutional networks um they expect

00:45:40.559 --> 00:45:44.759
the input to have

00:45:42.679 --> 00:45:46.239
three channels or it expects to have

00:45:44.760 --> 00:45:47.440
like a an additional thing which is like

00:45:46.239 --> 00:45:49.679
a channel,

00:45:47.440 --> 00:45:50.800
right? Uh the color images have three

00:45:49.679 --> 00:45:52.279
channels,

00:45:50.800 --> 00:45:54.400
but black and white images have only one

00:45:52.280 --> 00:45:56.640
channel, right? One table of numbers.

00:45:54.400 --> 00:45:59.280
So, instead of saying 28 by 28, we tell

00:45:56.639 --> 00:46:01.559
this the convolutional layer expect 28

00:45:59.280 --> 00:46:03.160
by 28 by one.

00:46:01.559 --> 00:46:04.519
It's the same thing conceptually, but

00:46:03.159 --> 00:46:05.639
that's the sort of the format that it

00:46:04.519 --> 00:46:06.679
expects.

00:46:05.639 --> 00:46:09.199
And so,

00:46:06.679 --> 00:46:11.039
uh we go here and then we say, all

00:46:09.199 --> 00:46:12.879
right, there's a thing called expand

00:46:11.039 --> 00:46:14.599
dimension. I'm just telling it to expand

00:46:12.880 --> 00:46:17.200
its dimension and once I do that, you

00:46:14.599 --> 00:46:19.639
can see here it's still 60,000, but

00:46:17.199 --> 00:46:21.799
instead of 28 by 28, it has become 28 by

00:46:19.639 --> 00:46:24.039
28 by one. Same thing.

00:46:21.800 --> 00:46:25.920
Okay? Now, let's define our very first

00:46:24.039 --> 00:46:27.440
CNN.

00:46:25.920 --> 00:46:30.240
So, all right.

00:46:27.440 --> 00:46:32.519
As as before, the the input is just

00:46:30.239 --> 00:46:34.239
Keras.input as before, no difference

00:46:32.519 --> 00:46:37.239
here and we tell it the shape and the

00:46:34.239 --> 00:46:39.239
shape is of course just 28 by 28 by one.

00:46:37.239 --> 00:46:40.639
Okay? That's what I have here.

00:46:39.239 --> 00:46:43.839
And then we come to the first

00:46:40.639 --> 00:46:45.679
convolutional block.

00:46:43.840 --> 00:46:47.400
So, and this is the key thing.

00:46:45.679 --> 00:46:49.719
If you want to tell Keras to use a

00:46:47.400 --> 00:46:53.519
convolutional a layer,

00:46:49.719 --> 00:46:54.679
you use this keyword layers.Conv2D.

00:46:53.519 --> 00:46:56.519
And from this you can probably also

00:46:54.679 --> 00:46:58.759
figure out that there's a Conv1D and

00:46:56.519 --> 00:47:00.880
there's a Conv3D and so on and so forth,

00:46:58.760 --> 00:47:01.920
which, you know, uh explore. It's really

00:47:00.880 --> 00:47:04.400
good stuff.

00:47:01.920 --> 00:47:06.599
But for image processing, Conv2D is all

00:47:04.400 --> 00:47:09.400
you need. And now we tell it how many

00:47:06.599 --> 00:47:10.920
filters you want. Okay. So, uh we decide

00:47:09.400 --> 00:47:13.240
on the number of filters. So, I've

00:47:10.920 --> 00:47:15.760
decided to have 32 filters. Okay? And

00:47:13.239 --> 00:47:18.199
then I I we also have to decide the size

00:47:15.760 --> 00:47:19.760
of the filter, right? The simplest size

00:47:18.199 --> 00:47:20.639
is 2 by 2. So, I'm just going to go with

00:47:19.760 --> 00:47:22.760
that.

00:47:20.639 --> 00:47:23.839
Right? Kernel size is 2 by 2.

00:47:22.760 --> 00:47:26.160
And then the activation is of course

00:47:23.840 --> 00:47:27.960
ReLU. I give it a name, convolution one,

00:47:26.159 --> 00:47:29.480
and then I feed it the input. And then

00:47:27.960 --> 00:47:31.679
once I do that, I follow it up with a

00:47:29.480 --> 00:47:33.679
little pooling layer which I where I use

00:47:31.679 --> 00:47:35.279
MaxPooling2D.

00:47:33.679 --> 00:47:36.639
And MaxPooling2D, you just literally

00:47:35.280 --> 00:47:37.600
pass the input, you get the output back.

00:47:36.639 --> 00:47:39.480
It just

00:47:37.599 --> 00:47:40.719
shrinks everything using pooling.

00:47:39.480 --> 00:47:41.679
So, that is the first convolutional

00:47:40.719 --> 00:47:43.879
block.

00:47:41.679 --> 00:47:45.599
And you know what?

00:47:43.880 --> 00:47:46.440
I know how to cut and paste. Boom, cut

00:47:45.599 --> 00:47:48.119
and paste, I get the second

00:47:46.440 --> 00:47:49.599
convolutional block.

00:47:48.119 --> 00:47:52.358
Okay? Here is the second convolutional

00:47:49.599 --> 00:47:54.199
block. And I know in in I just lecture I

00:47:52.358 --> 00:47:56.960
mentioned that as you go deeper, you get

00:47:54.199 --> 00:47:58.199
more depth to it, but this is this is

00:47:56.960 --> 00:47:59.480
just a starting point. I'm just going to

00:47:58.199 --> 00:48:01.599
use the same depth. Not a big deal. It's

00:47:59.480 --> 00:48:03.000
a simple problem. So, which is why in

00:48:01.599 --> 00:48:04.559
the second convolutional block I'm still

00:48:03.000 --> 00:48:06.039
using only 32.

00:48:04.559 --> 00:48:07.719
But you can totally go to 64 for

00:48:06.039 --> 00:48:08.639
instance to make it much deeper.

00:48:07.719 --> 00:48:10.679
Okay?

00:48:08.639 --> 00:48:12.159
Uh and once I do that,

00:48:10.679 --> 00:48:14.319
I finally come to the point where I

00:48:12.159 --> 00:48:17.759
flatten everything to a long vector,

00:48:14.320 --> 00:48:19.480
then I connect it to one dense layer of

00:48:17.760 --> 00:48:22.080
256 neurons.

00:48:19.480 --> 00:48:23.559
And then finally, I come to the softmax

00:48:22.079 --> 00:48:26.000
where I have 10 outputs, right? 10

00:48:23.559 --> 00:48:27.880
categories of clothing, softmax, and

00:48:26.000 --> 00:48:30.119
then I tell Keras, okay, take this input

00:48:27.880 --> 00:48:32.160
and the output, string them up together,

00:48:30.119 --> 00:48:33.519
define a model for me.

00:48:32.159 --> 00:48:35.599
So, that's it. That's a convolutional

00:48:33.519 --> 00:48:38.358
network. The new concepts we are seeing

00:48:35.599 --> 00:48:40.960
here are Conv2D for the convolutional

00:48:38.358 --> 00:48:42.440
layer and then MaxPooling2D for the max

00:48:40.960 --> 00:48:43.639
pooling layer.

00:48:42.440 --> 00:48:44.240
Okay? That's it.

00:48:43.639 --> 00:48:46.839
Uh

00:48:44.239 --> 00:48:49.839
coming. So, let me just run this thing.

00:48:46.840 --> 00:48:52.840
It runs. Okay, good. Yeah.

00:48:49.840 --> 00:48:54.800
Uh how do you decide when to flatten and

00:48:52.840 --> 00:48:56.800
would there ever be a situation in which

00:48:54.800 --> 00:48:59.600
we just kind of use the method that we

00:48:56.800 --> 00:49:00.960
used before and not use a CNN?

00:48:59.599 --> 00:49:02.279
Well, we already tried it with MNIST,

00:49:00.960 --> 00:49:03.039
right? We didn't use a CNN. We just

00:49:02.280 --> 00:49:05.120
flattened right away.

00:49:03.039 --> 00:49:06.719
>> work. It it was it's not bad, but we are

00:49:05.119 --> 00:49:08.079
like, you know, can we do better than 85

00:49:06.719 --> 00:49:09.679
or 88 or whatever the percent was,

00:49:08.079 --> 00:49:11.719
right? So, but we are working with

00:49:09.679 --> 00:49:13.239
images, it's typically a good idea to

00:49:11.719 --> 00:49:14.439
just start with a CNN straight out the

00:49:13.239 --> 00:49:16.799
back because you're not losing anything.

00:49:14.440 --> 00:49:19.320
You're not giving up anything.

00:49:16.800 --> 00:49:20.960
So, uh in terms of how many uh layers

00:49:19.320 --> 00:49:23.120
you should have, my philosophy is start

00:49:20.960 --> 00:49:27.079
simple and if it works, stop working on

00:49:23.119 --> 00:49:28.480
it. If it doesn't, add more layers.

00:49:27.079 --> 00:49:30.440
Uh yeah.

00:49:28.480 --> 00:49:32.358
Yeah, just to uh is it the architecture

00:49:30.440 --> 00:49:34.358
design, the number of filters, kernel

00:49:32.358 --> 00:49:36.159
size, number of layers, convolution

00:49:34.358 --> 00:49:37.719
pooling, is that just all based on trial

00:49:36.159 --> 00:49:39.440
and error or what's sometimes? Yeah, so

00:49:37.719 --> 00:49:41.359
typically it's based on trial and error,

00:49:39.440 --> 00:49:42.679
Um to answer your question. But as you

00:49:41.360 --> 00:49:44.559
will see in the transfer learning

00:49:42.679 --> 00:49:46.719
discussion we're going to have soon,

00:49:44.559 --> 00:49:48.639
you can actually, instead of doing

00:49:46.719 --> 00:49:50.679
anything from scratch, it's much better

00:49:48.639 --> 00:49:51.839
to just download a pre-trained model and

00:49:50.679 --> 00:49:54.039
just adapt it for your particular

00:49:51.840 --> 00:49:55.680
problem. That is actually the norm by

00:49:54.039 --> 00:49:57.320
which people do these things. The reason

00:49:55.679 --> 00:50:00.319
I'm doing it from scratch is because you

00:49:57.320 --> 00:50:01.800
should know how it was done.

00:50:00.320 --> 00:50:03.880
Like you it should not be a black box to

00:50:01.800 --> 00:50:05.080
you. That's my goal.

00:50:03.880 --> 00:50:07.039
Yeah.

00:50:05.079 --> 00:50:09.719
Just for what notation perspective, I

00:50:07.039 --> 00:50:11.159
noticed you named all of these layers X.

00:50:09.719 --> 00:50:12.639
Is that a habit we should get into

00:50:11.159 --> 00:50:12.759
naming them all the same or is that just

00:50:12.639 --> 00:50:15.199
a

00:50:12.760 --> 00:50:17.880
>> Actually, I'm not naming the layers as

00:50:15.199 --> 00:50:19.719
X. What what's going on here is I'm

00:50:17.880 --> 00:50:21.079
feeding it X.

00:50:19.719 --> 00:50:22.679
And whatever is coming out of it, I'm

00:50:21.079 --> 00:50:23.920
just calling it X.

00:50:22.679 --> 00:50:25.679
That's all. It's just a notational

00:50:23.920 --> 00:50:27.280
convenience for me to I'm I'm just

00:50:25.679 --> 00:50:28.679
calling the input and the output and

00:50:27.280 --> 00:50:29.760
Keras under the hood will track

00:50:28.679 --> 00:50:31.319
everything and make sure the right thing

00:50:29.760 --> 00:50:33.920
happens. Otherwise, I'd have to be like

00:50:31.320 --> 00:50:35.360
X1, X2, X3, X4 and then if I want to add

00:50:33.920 --> 00:50:37.320
a new layer somewhere in the middle

00:50:35.360 --> 00:50:39.160
between X3 and X4, I have to call that

00:50:37.320 --> 00:50:41.360
X4 and then I'll change everything to 5,

00:50:39.159 --> 00:50:42.839
6, 7. Complete pain in the neck. That's

00:50:41.360 --> 00:50:46.760
why I do this.

00:50:42.840 --> 00:50:51.039
All right. So, model.summary

00:50:46.760 --> 00:50:53.160
It has got 302 thousand parameters. I'll

00:50:51.039 --> 00:50:56.199
just plot it.

00:50:53.159 --> 00:50:58.519
Great. And I encourage you to hand

00:50:56.199 --> 00:51:00.359
calculate it later on and make sure the

00:50:58.519 --> 00:51:03.679
numbers tally, okay?

00:51:00.360 --> 00:51:06.320
For now, let's just go. So, as before,

00:51:03.679 --> 00:51:08.399
we'll just use the same compilation.

00:51:06.320 --> 00:51:11.080
We'll use Adam and then we'll train it

00:51:08.400 --> 00:51:13.119
for, you know, just 10 epochs. We'll use

00:51:11.079 --> 00:51:15.360
a validation split again, as usual, of

00:51:13.119 --> 00:51:17.519
20%. So, let's just run it.

00:51:15.360 --> 00:51:18.720
So, it's actually going to run. And as

00:51:17.519 --> 00:51:19.759
you will see,

00:51:18.719 --> 00:51:20.959
convolutional networks there's a lot

00:51:19.760 --> 00:51:23.560
more going on, so it's going to be a bit

00:51:20.960 --> 00:51:25.400
slower to run. Hopefully not too much

00:51:23.559 --> 00:51:28.599
slower.

00:51:25.400 --> 00:51:28.599
While it's doing, other questions?

00:51:31.000 --> 00:51:34.679
So, if we have a task other than image

00:51:32.840 --> 00:51:35.880
classification, do we still flat the

00:51:34.679 --> 00:51:37.399
model like first and then it's

00:51:35.880 --> 00:51:39.000
segmentation?

00:51:37.400 --> 00:51:41.480
Yeah, so this is for image

00:51:39.000 --> 00:51:42.920
classification. For other kinds of

00:51:41.480 --> 00:51:44.240
applications,

00:51:42.920 --> 00:51:45.840
typically you run it through a bunch of

00:51:44.239 --> 00:51:46.639
convolutional layers and so on and so

00:51:45.840 --> 00:51:48.840
forth.

00:51:46.639 --> 00:51:51.759
But the output side of the equation gets

00:51:48.840 --> 00:51:53.880
much more complicated because if instead

00:51:51.760 --> 00:51:56.360
of classifying just

00:51:53.880 --> 00:51:58.800
the whole picture into, you know, dog or

00:51:56.360 --> 00:52:01.280
cat, if you have to take every pixel and

00:51:58.800 --> 00:52:03.320
classify it, right? Then, well, you

00:52:01.280 --> 00:52:06.320
better have an output shape that is the

00:52:03.320 --> 00:52:07.640
same dimensions as the input shape.

00:52:06.320 --> 00:52:09.800
So, for that we use a different

00:52:07.639 --> 00:52:11.119
architecture. It's called U-Net

00:52:09.800 --> 00:52:13.120
and so on, which unfortunately I won't

00:52:11.119 --> 00:52:14.599
be able to get into. But I know I am

00:52:13.119 --> 00:52:17.319
planning to post another video

00:52:14.599 --> 00:52:19.440
walk-through where I show you how to use

00:52:17.320 --> 00:52:22.160
the Hugging Face Hub

00:52:19.440 --> 00:52:23.880
to very quickly build models for the

00:52:22.159 --> 00:52:26.039
other applications like segmentation and

00:52:23.880 --> 00:52:27.280
so on. I'm hoping to post that tomorrow.

00:52:26.039 --> 00:52:29.440
It's an optional viewing thing that

00:52:27.280 --> 00:52:32.400
might help with that.

00:52:29.440 --> 00:52:35.280
Okay. So, is it done? Okay, good. It's

00:52:32.400 --> 00:52:36.760
done. All right, let's plot the

00:52:35.280 --> 00:52:38.240
thing here.

00:52:36.760 --> 00:52:40.480
All right, so it seems like training is

00:52:38.239 --> 00:52:42.639
going down nice and nicely. Validation

00:52:40.480 --> 00:52:45.000
is sort of flattening out somewhere here

00:52:42.639 --> 00:52:47.359
around the eighth epoch. Let's look at

00:52:45.000 --> 00:52:48.840
the accuracy.

00:52:47.360 --> 00:52:51.440
Same situation here. The accuracy is in

00:52:48.840 --> 00:52:52.960
the 90s. Of course, the final question,

00:52:51.440 --> 00:52:55.639
of course, is how it will will it does

00:52:52.960 --> 00:52:55.639
on the thing.

00:52:55.840 --> 00:52:59.440
Whoa, 90.5%.

00:52:58.360 --> 00:53:00.720
Pretty good.

00:52:59.440 --> 00:53:04.200
By the way, if you're not impressed that

00:53:00.719 --> 00:53:05.959
we went from 88 to 90,

00:53:04.199 --> 00:53:07.599
this is the These applications are the

00:53:05.960 --> 00:53:09.639
proverbial sort of diminishing returns

00:53:07.599 --> 00:53:11.880
problems, okay? So, what you should

00:53:09.639 --> 00:53:13.920
always think of is look at the amount of

00:53:11.880 --> 00:53:16.920
error that's left and ask yourself how

00:53:13.920 --> 00:53:20.119
much of that error am I able to reduce?

00:53:16.920 --> 00:53:22.079
So, you we had 12% roughly of error left

00:53:20.119 --> 00:53:24.279
when we did the simple collab yesterday.

00:53:22.079 --> 00:53:26.119
From that 12% we have knocked off two of

00:53:24.280 --> 00:53:27.240
the 12% to get to over 90, which is

00:53:26.119 --> 00:53:28.119
amazing.

00:53:27.239 --> 00:53:29.639
Okay?

00:53:28.119 --> 00:53:31.119
And in fact, I think the state of the

00:53:29.639 --> 00:53:32.279
art on this

00:53:31.119 --> 00:53:34.400
um

00:53:32.280 --> 00:53:36.760
is 97%.

00:53:34.400 --> 00:53:39.039
So, I invite you

00:53:36.760 --> 00:53:40.480
to take this thing and try different

00:53:39.039 --> 00:53:42.800
filters and so on and so forth to see if

00:53:40.480 --> 00:53:45.960
you can get to the the mid-90s.

00:53:42.800 --> 00:53:48.039
It's not easy, but try it. Yeah.

00:53:45.960 --> 00:53:50.159
Does the number of epochs have to be

00:53:48.039 --> 00:53:52.960
related to the number of batches?

00:53:50.159 --> 00:53:55.199
Because you did 64 batches and 10 No,

00:53:52.960 --> 00:53:56.800
the epochs is an independent

00:53:55.199 --> 00:53:58.319
the epochs is just the number of passes

00:53:56.800 --> 00:54:01.320
through the whole data.

00:53:58.320 --> 00:54:03.000
But within each pass, within each epoch,

00:54:01.320 --> 00:54:05.039
the num the batch size tells you how

00:54:03.000 --> 00:54:06.599
many batches you're going to process.

00:54:05.039 --> 00:54:08.079
So, it is basically the number of

00:54:06.599 --> 00:54:10.480
examples you have in your training data

00:54:08.079 --> 00:54:11.679
divided by the batch size that you have

00:54:10.480 --> 00:54:13.960
chosen,

00:54:11.679 --> 00:54:16.879
right? That number rounded up is the

00:54:13.960 --> 00:54:18.480
number of batches within each epoch.

00:54:16.880 --> 00:54:20.559
And here I'm just choosing 10 because,

00:54:18.480 --> 00:54:23.119
you know,

00:54:20.559 --> 00:54:24.719
Siri found something on the web. Okay.

00:54:23.119 --> 00:54:26.519
I chose 10 because it's going to be fast

00:54:24.719 --> 00:54:27.439
to do for me to do it in class. And 10

00:54:26.519 --> 00:54:28.320
is actually more than enough because you

00:54:27.440 --> 00:54:30.800
can see it's already beginning to

00:54:28.320 --> 00:54:30.800
overfit.

00:54:31.000 --> 00:54:33.320
Yeah.

00:54:33.599 --> 00:54:37.559
This is more of a conceptual question,

00:54:35.639 --> 00:54:39.920
but is it always the case that a neural

00:54:37.559 --> 00:54:42.400
network will have better accuracy than

00:54:39.920 --> 00:54:44.440
this like machine learning algorithm?

00:54:42.400 --> 00:54:45.960
And I'm asking more on the case of like

00:54:44.440 --> 00:54:46.720
the heart disease problem. Oh, yeah,

00:54:45.960 --> 00:54:49.000
yeah.

00:54:46.719 --> 00:54:50.519
Great question. So, neural networks are

00:54:49.000 --> 00:54:52.039
really good for unstructured data like

00:54:50.519 --> 00:54:53.159
the ones we're having here. But if you

00:54:52.039 --> 00:54:55.199
have structured data like the heart

00:54:53.159 --> 00:54:57.519
disease problem, sometimes it actually

00:54:55.199 --> 00:54:59.799
works really well. Sometimes

00:54:57.519 --> 00:55:01.840
things like gradient boosting, XGBoost,

00:54:59.800 --> 00:55:03.440
work really well. So, if I am actually

00:55:01.840 --> 00:55:04.600
working on a structured data problem,

00:55:03.440 --> 00:55:06.119
I'll try both.

00:55:04.599 --> 00:55:07.239
I'm not going to axiomatically assume

00:55:06.119 --> 00:55:09.319
that the DNN is going to be the best

00:55:07.239 --> 00:55:11.679
thing. But if you have structured data,

00:55:09.320 --> 00:55:13.160
it's the best game in town.

00:55:11.679 --> 00:55:14.319
All right. Um

00:55:13.159 --> 00:55:15.480
I'm just going to

00:55:14.320 --> 00:55:16.480
By the way, I have a whole section here

00:55:15.480 --> 00:55:17.679
on once you build a model, how do you

00:55:16.480 --> 00:55:19.320
actually improve it?

00:55:17.679 --> 00:55:20.440
Right? Check it out. It's an optional

00:55:19.320 --> 00:55:22.559
thing.

00:55:20.440 --> 00:55:23.880
All right, I'm going to stop this here.

00:55:22.559 --> 00:55:25.559
All right. So, the next thing I want to

00:55:23.880 --> 00:55:27.599
do is

00:55:25.559 --> 00:55:29.559
So, we went from 88 to 90 plus percent,

00:55:27.599 --> 00:55:31.639
right? Using convolutional networks.

00:55:29.559 --> 00:55:33.000
Now, let's work with color images. Let's

00:55:31.639 --> 00:55:34.960
kick it up a notch.

00:55:33.000 --> 00:55:36.840
So, um

00:55:34.960 --> 00:55:38.880
I actually

00:55:36.840 --> 00:55:40.120
web scraped

00:55:38.880 --> 00:55:42.680
all these pictures for you folks, for

00:55:40.119 --> 00:55:44.759
your enjoyment. I web scraped about 100

00:55:42.679 --> 00:55:46.599
color images of handbags and shoes.

00:55:44.760 --> 00:55:48.600
Each 100 roughly 100 handbags, 100

00:55:46.599 --> 00:55:51.159
shoes. So, the question is with these

00:55:48.599 --> 00:55:52.239
essentially 200 images,

00:55:51.159 --> 00:55:54.679
can we build a really good neural

00:55:52.239 --> 00:55:56.079
network to classify handbags and shoes?

00:55:54.679 --> 00:55:58.039
Right? It seems kind of absurd, right?

00:55:56.079 --> 00:55:59.519
Because 200 examples, I mean, it's not

00:55:58.039 --> 00:56:02.759
that much, right? It doesn't feel like a

00:55:59.519 --> 00:56:04.239
lot. The MNIST data fashion has 60,000

00:56:02.760 --> 00:56:06.080
images.

00:56:04.239 --> 00:56:07.639
Right? So, there's no, you know, even

00:56:06.079 --> 00:56:09.199
with that we are overfitting in like 5,

00:56:07.639 --> 00:56:10.599
6, 7, 8 epochs.

00:56:09.199 --> 00:56:11.879
With 200 images, maybe, you know, is

00:56:10.599 --> 00:56:13.199
there any hope? Obviously, there is

00:56:11.880 --> 00:56:15.160
hope, otherwise it won't be in the

00:56:13.199 --> 00:56:16.319
lecture. So, yeah. So, we're going to

00:56:15.159 --> 00:56:18.119
take this data set and let's see what we

00:56:16.320 --> 00:56:19.519
can do with it. So, we'll first actually

00:56:18.119 --> 00:56:22.119
build a convolutional network from

00:56:19.519 --> 00:56:24.519
scratch to solve this problem. Okay?

00:56:22.119 --> 00:56:24.519
All right.

00:56:24.679 --> 00:56:27.519
I'm actually going to run through the

00:56:25.599 --> 00:56:29.480
code because at the end of it we'll have

00:56:27.519 --> 00:56:31.519
a live demo. So, I would like one

00:56:29.480 --> 00:56:34.840
volunteer to give me a handbag and one

00:56:31.519 --> 00:56:37.280
volunteer to give me their footwear.

00:56:34.840 --> 00:56:40.880
Boy, in class.

00:56:37.280 --> 00:56:42.400
Okay. So, all right. Unlike the previous

00:56:40.880 --> 00:56:44.760
data set, this one actually I just web

00:56:42.400 --> 00:56:46.280
scraped it. So, I just, you know, it's

00:56:44.760 --> 00:56:47.359
it's it's I've stuck it in this Dropbox

00:56:46.280 --> 00:56:49.120
folder.

00:56:47.358 --> 00:56:51.519
Let's just download it and unzip it. And

00:56:49.119 --> 00:56:54.920
once we do that, we have to now organize

00:56:51.519 --> 00:56:57.119
it with these 200 images. So,

00:56:54.920 --> 00:57:00.519
I have to do some sort of

00:56:57.119 --> 00:57:02.400
sort of boring-ish Python stuff here.

00:57:00.519 --> 00:57:04.639
So, here what we're doing is that we

00:57:02.400 --> 00:57:06.440
have 100 handbags, roughly 100 shoes.

00:57:04.639 --> 00:57:08.599
And what this code is doing is it's

00:57:06.440 --> 00:57:10.280
actually creating a directory of saying

00:57:08.599 --> 00:57:12.400
it's it's splitting stuff into train and

00:57:10.280 --> 00:57:13.960
validation and test. And then for each

00:57:12.400 --> 00:57:16.480
of the splits it's doing the handbags

00:57:13.960 --> 00:57:18.960
and the shoes folder. Okay? So, once we

00:57:16.480 --> 00:57:20.679
do that, basically this directory

00:57:18.960 --> 00:57:23.199
structure is created.

00:57:20.679 --> 00:57:25.079
Okay? Training, validation folder, test

00:57:23.199 --> 00:57:26.199
folder, handbags and shoes. In fact,

00:57:25.079 --> 00:57:27.039
actually you can I think you can see it

00:57:26.199 --> 00:57:29.679
here.

00:57:27.039 --> 00:57:31.559
See here, handbags and shoes. And within

00:57:29.679 --> 00:57:33.119
that, there is, you know, train, test,

00:57:31.559 --> 00:57:34.960
validation. And within each of these,

00:57:33.119 --> 00:57:36.319
there's handbags and shoes. So, the idea

00:57:34.960 --> 00:57:37.840
is that when you're working with images,

00:57:36.320 --> 00:57:40.400
right? What you can do is you can just

00:57:37.840 --> 00:57:42.358
create folders for each kind of image,

00:57:40.400 --> 00:57:43.760
right? Let's say dogs, cats,

00:57:42.358 --> 00:57:46.480
two folders with cat images and dog

00:57:43.760 --> 00:57:47.800
images and then just point Keras at it.

00:57:46.480 --> 00:57:49.559
It'll automatically figure out those are

00:57:47.800 --> 00:57:50.560
the labels.

00:57:49.559 --> 00:57:51.639
It makes it easy for you. So, it's very

00:57:50.559 --> 00:57:52.639
convenient when you're working with

00:57:51.639 --> 00:57:53.960
images.

00:57:52.639 --> 00:57:55.799
And the book explains this thing in

00:57:53.960 --> 00:57:56.920
great detail.

00:57:55.800 --> 00:57:58.600
All right. So, when working with these

00:57:56.920 --> 00:58:00.440
images, color images, we'll follow this

00:57:58.599 --> 00:58:02.279
process. We'll read in the JPEGs. We'll

00:58:00.440 --> 00:58:03.559
convert them to tensors. And then since

00:58:02.280 --> 00:58:05.040
I'm web scraping it, they all come in

00:58:03.559 --> 00:58:06.880
different shapes and sizes. So, I need

00:58:05.039 --> 00:58:08.719
to like bring it all to the same size.

00:58:06.880 --> 00:58:10.599
Okay? I resize it and then I'm going to

00:58:08.719 --> 00:58:13.319
batch it into whatever. I'm going to

00:58:10.599 --> 00:58:16.639
batch it using a batch size of 32 here.

00:58:13.320 --> 00:58:19.640
So, and this utility from Keras will do

00:58:16.639 --> 00:58:20.920
all that for you, right? Very quickly.

00:58:19.639 --> 00:58:23.358
So, basically what it says is that I

00:58:20.920 --> 00:58:25.440
found 98 files in the 98 images in the

00:58:23.358 --> 00:58:28.000
training data belonging to two classes,

00:58:25.440 --> 00:58:29.559
49 in the validation and 38 in the test.

00:58:28.000 --> 00:58:31.679
So, less than 100 examples in the

00:58:29.559 --> 00:58:33.960
training set. That's what we have here.

00:58:31.679 --> 00:58:35.879
All right. What's the time? 9:30. Okay.

00:58:33.960 --> 00:58:38.800
So, all right. Now, let us check the

00:58:35.880 --> 00:58:40.480
dimensions to make sure Good. So, 224

00:58:38.800 --> 00:58:43.039
224 by 3. And the reason why did I pick

00:58:40.480 --> 00:58:45.039
224 224? As you will see later, we're

00:58:43.039 --> 00:58:47.039
going to use something called ResNet

00:58:45.039 --> 00:58:49.599
and the ResNet expects it to be 224 by

00:58:47.039 --> 00:58:52.719
224 by 3. That's why I resized it to 224

00:58:49.599 --> 00:58:56.400
224. Let's look at a few examples of my

00:58:52.719 --> 00:58:56.399
wonderful web scraping in action.

00:59:01.079 --> 00:59:04.519
It's pretty wild, right?

00:59:02.920 --> 00:59:07.000
Okay. So, we have a Now, let's do a

00:59:04.519 --> 00:59:09.000
simple convolutional network. Um

00:59:07.000 --> 00:59:10.639
And before we would take all the X

00:59:09.000 --> 00:59:13.480
values in Fashion MNIST and divide them

00:59:10.639 --> 00:59:14.559
manually by 255 to normalize it to 0 1.

00:59:13.480 --> 00:59:16.240
Well, you know what? We are actually

00:59:14.559 --> 00:59:17.759
graduating to the higher levels of Keras

00:59:16.239 --> 00:59:19.319
now. So, let's not do that, right?

00:59:17.760 --> 00:59:21.240
Manual stuff is bad. So, we'll do it

00:59:19.320 --> 00:59:22.720
within Keras by using something called

00:59:21.239 --> 00:59:24.479
the rescaling layer where we just tell

00:59:22.719 --> 00:59:26.399
it how much to rescale and boom, it'll

00:59:24.480 --> 00:59:28.559
do it for you. The first convolution

00:59:26.400 --> 00:59:31.519
block, just like the Fashion MNIST 32,

00:59:28.559 --> 00:59:33.440
second block, again 32, max pool,

00:59:31.519 --> 00:59:35.199
flatten. And then here we only have

00:59:33.440 --> 00:59:36.599
handbags which are shoes, just a sigmoid

00:59:35.199 --> 00:59:38.079
is enough, right? It's just a binary

00:59:36.599 --> 00:59:40.440
classification problem. So, I'm just

00:59:38.079 --> 00:59:42.239
using one output layer with a sigmoid,

00:59:40.440 --> 00:59:43.840
and that's our model. So, let's do the

00:59:42.239 --> 00:59:47.279
model.

00:59:43.840 --> 00:59:47.280
All right, model summary.

00:59:48.440 --> 00:59:54.360
103 101,000 parameters in this little

00:59:52.079 --> 00:59:56.519
model. Okay, let's compile it and run

00:59:54.360 --> 00:59:57.720
it. Uh, and note here because it's a

00:59:56.519 --> 00:59:59.480
binary

00:59:57.719 --> 01:00:02.000
classification problem, I'm using binary

00:59:59.480 --> 01:00:03.320
cross entropy.

01:00:02.000 --> 01:00:05.880
Same Adam.

01:00:03.320 --> 01:00:07.440
And accuracy, compile, and then boom,

01:00:05.880 --> 01:00:08.519
let's run it. We'll run it for 20

01:00:07.440 --> 01:00:10.800
epochs.

01:00:08.519 --> 01:00:10.800
Hopefully.

01:00:12.320 --> 01:00:17.760
Okay, while it's doing this business,

01:00:13.760 --> 01:00:19.400
I'm going to shift to the PowerPoint.

01:00:17.760 --> 01:00:21.480
So, we'll go back to see how well it

01:00:19.400 --> 01:00:23.039
did, but the question is, uh, whatever

01:00:21.480 --> 01:00:23.960
it did, we built it from scratch. So,

01:00:23.039 --> 01:00:26.440
the question is, can we do better than

01:00:23.960 --> 01:00:28.079
that? Okay? Because we only have 100

01:00:26.440 --> 01:00:29.480
examples of each class, and which brings

01:00:28.079 --> 01:00:31.440
us to something very cool and very

01:00:29.480 --> 01:00:33.240
powerful called transfer learning. And

01:00:31.440 --> 01:00:34.519
the idea, so the key thing is there are

01:00:33.239 --> 01:00:36.000
two research trends that are going on

01:00:34.519 --> 01:00:38.199
that we take advantage of. The first one

01:00:36.000 --> 01:00:40.320
is that researchers have defined, you

01:00:38.199 --> 01:00:42.439
know, designed architectures which

01:00:40.320 --> 01:00:43.840
exploit the kind of input you have. So,

01:00:42.440 --> 01:00:45.639
Olivia asked the question, if you have a

01:00:43.840 --> 01:00:47.320
particular kind of input images, do you

01:00:45.639 --> 01:00:49.079
actually change the input, or do you

01:00:47.320 --> 01:00:50.680
actually change the network? As it turns

01:00:49.079 --> 01:00:52.039
out, here, for example, if it's images,

01:00:50.679 --> 01:00:53.679
we know that we should use convolutional

01:00:52.039 --> 01:00:55.759
layers because convolutional layers were

01:00:53.679 --> 01:00:57.159
designed to exploit the image-ness of

01:00:55.760 --> 01:00:59.680
the input.

01:00:57.159 --> 01:01:01.559
Okay? Similarly, if you have sequences

01:00:59.679 --> 01:01:03.719
of information, like obviously natural

01:01:01.559 --> 01:01:05.320
language, audio, video, gene sequences,

01:01:03.719 --> 01:01:07.119
and so on, so forth, these things called

01:01:05.320 --> 01:01:08.360
transformers were invented

01:01:07.119 --> 01:01:09.480
to exploit them, and we're going to

01:01:08.360 --> 01:01:11.720
spend a lot of time on transformers

01:01:09.480 --> 01:01:13.320
starting next week. So, that's the first

01:01:11.719 --> 01:01:15.959
trend. The second trend is that

01:01:13.320 --> 01:01:19.000
researchers have used these innovations

01:01:15.960 --> 01:01:21.880
to actually create and train models on

01:01:19.000 --> 01:01:23.719
vast data sets, and thankfully, they've

01:01:21.880 --> 01:01:26.760
made them publicly available for us to

01:01:23.719 --> 01:01:28.439
use. So, transfer learning is the idea

01:01:26.760 --> 01:01:30.080
that if you have a particular problem,

01:01:28.440 --> 01:01:32.240
let's just take a pre-trained network

01:01:30.079 --> 01:01:33.840
work somebody may have already created,

01:01:32.239 --> 01:01:35.519
and then let's just customize it to our

01:01:33.840 --> 01:01:37.079
problem, rather than actually build

01:01:35.519 --> 01:01:39.559
anything from scratch.

01:01:37.079 --> 01:01:41.599
Okay, that's the basic idea. So,

01:01:39.559 --> 01:01:43.519
so here we have this basically we have

01:01:41.599 --> 01:01:45.079
to build a classifier which takes in an

01:01:43.519 --> 01:01:46.759
arbitrary image and figures out if it's

01:01:45.079 --> 01:01:47.799
a handbag or a shoe, right? That's our

01:01:46.760 --> 01:01:49.800
goal.

01:01:47.800 --> 01:01:51.320
And so, now handbags and shoes are

01:01:49.800 --> 01:01:53.680
everyday objects, and so what you can do

01:01:51.320 --> 01:01:55.200
is, hmm, you you can look around and see

01:01:53.679 --> 01:01:57.919
if there are any networks that have been

01:01:55.199 --> 01:02:00.359
trained by other people which actually

01:01:57.920 --> 01:02:02.599
have been trained on everyday images.

01:02:00.360 --> 01:02:04.000
Right? As opposed to like MRI or X-rays,

01:02:02.599 --> 01:02:05.400
right? Specialized images, everyday

01:02:04.000 --> 01:02:07.039
images. Of course, the first thing you

01:02:05.400 --> 01:02:08.960
should probably do is to see if anybody

01:02:07.039 --> 01:02:10.800
has built the specific thing you want,

01:02:08.960 --> 01:02:12.519
handbag shoes classifier on GitHub.

01:02:10.800 --> 01:02:15.800
Assuming it's not, then you do transfer

01:02:12.519 --> 01:02:17.800
learning. Okay? So, now it turns out

01:02:15.800 --> 01:02:19.360
that there's this thing called ImageNet,

01:02:17.800 --> 01:02:22.080
which is a database of millions of

01:02:19.360 --> 01:02:24.079
images of everyday objects in a thousand

01:02:22.079 --> 01:02:26.199
different categories, furniture,

01:02:24.079 --> 01:02:28.719
animals, automobiles, you get the idea.

01:02:26.199 --> 01:02:29.919
Okay? And so, we can look for the

01:02:28.719 --> 01:02:31.599
networks that have been trained on

01:02:29.920 --> 01:02:33.200
ImageNet.

01:02:31.599 --> 01:02:36.360
Okay, let me just go back to the collab

01:02:33.199 --> 01:02:36.359
just to make sure it doesn't time out.

01:02:37.519 --> 01:02:44.039
All right, so it has finished doing it.

01:02:40.079 --> 01:02:44.039
Um, let's just plot these things.

01:02:48.599 --> 01:02:51.199
Okay, so

01:02:49.920 --> 01:02:52.920
uh, there is some overfitting that

01:02:51.199 --> 01:02:55.159
happens around here

01:02:52.920 --> 01:02:57.760
on the training on the 10th epoch. Let's

01:02:55.159 --> 01:02:57.759
look at the

01:02:59.239 --> 01:03:03.919
So, the the training accuracy is

01:03:01.039 --> 01:03:04.920
actually getting to almost to 100%. But

01:03:03.920 --> 01:03:06.760
we're not interested in training

01:03:04.920 --> 01:03:08.880
accuracy, right? We care about

01:03:06.760 --> 01:03:10.200
validation and test accuracy, and that

01:03:08.880 --> 01:03:13.000
seems to be kind of hovering around in

01:03:10.199 --> 01:03:15.159
the 80s. Um, so let's just evaluate it

01:03:13.000 --> 01:03:19.360
anyway to see what happens.

01:03:15.159 --> 01:03:20.960
Okay, so it gets to 80 87% accuracy

01:03:19.360 --> 01:03:22.320
on this data set.

01:03:20.960 --> 01:03:24.760
It's actually pretty good given that we

01:03:22.320 --> 01:03:26.320
only have 100 examples. So, 87%

01:03:24.760 --> 01:03:28.320
accuracy, but we pre-trained the whole

01:03:26.320 --> 01:03:31.280
thing. I'm sorry, we did everything from

01:03:28.320 --> 01:03:32.600
scratch. Okay? Now, then

01:03:31.280 --> 01:03:35.280
I'm going to there's this whole section

01:03:32.599 --> 01:03:38.079
about data augmentation, which, um, you

01:03:35.280 --> 01:03:40.040
know what? Do we have time?

01:03:38.079 --> 01:03:42.799
So,

01:03:40.039 --> 01:03:44.320
so the idea of augmentation is that when

01:03:42.800 --> 01:03:45.800
you have an image,

01:03:44.320 --> 01:03:49.160
let's say you take this image, and you

01:03:45.800 --> 01:03:51.359
just rotate it slightly by 10°.

01:03:49.159 --> 01:03:52.960
If it's a handbag before you rotated it,

01:03:51.358 --> 01:03:54.199
it sure as hell is a handbag after you

01:03:52.960 --> 01:03:55.119
rotated it.

01:03:54.199 --> 01:03:56.679
Right?

01:03:55.119 --> 01:03:57.920
It doesn't change The meaning of the

01:03:56.679 --> 01:04:00.000
image doesn't change just because you

01:03:57.920 --> 01:04:01.358
rotated it slightly. Or maybe you zoom

01:04:00.000 --> 01:04:03.639
in slightly, you zoom out slightly, you

01:04:01.358 --> 01:04:05.358
crop it slightly, nothing happens.

01:04:03.639 --> 01:04:07.440
So, what you can do is you can take any

01:04:05.358 --> 01:04:08.880
image you have, and you just perturb it

01:04:07.440 --> 01:04:10.920
slightly,

01:04:08.880 --> 01:04:14.079
like right there, and then add it as a

01:04:10.920 --> 01:04:15.800
new example to your training data.

01:04:14.079 --> 01:04:16.960
This is an unbelievable free lunch,

01:04:15.800 --> 01:04:19.080
frankly.

01:04:16.960 --> 01:04:20.720
And the same thing actually, same kinds

01:04:19.079 --> 01:04:22.519
of techniques actually work for text

01:04:20.719 --> 01:04:24.599
also, which we'll cover later on.

01:04:22.519 --> 01:04:26.119
Right? This broad area is called data

01:04:24.599 --> 01:04:27.599
augmentation.

01:04:26.119 --> 01:04:30.199
It's a great way when you don't have a

01:04:27.599 --> 01:04:31.799
lot of data to artificially bolster the

01:04:30.199 --> 01:04:32.599
amount of data you have.

01:04:31.800 --> 01:04:34.800
Okay?

01:04:32.599 --> 01:04:36.239
Um, and so, and of course, Keras makes

01:04:34.800 --> 01:04:38.640
it very easy for you to do all these

01:04:36.239 --> 01:04:40.919
things. It has already predefined a

01:04:38.639 --> 01:04:43.079
whole bunch of data augmentation layers

01:04:40.920 --> 01:04:45.000
for you. So, here's a little example

01:04:43.079 --> 01:04:47.239
where I basically take a picture and

01:04:45.000 --> 01:04:48.320
then I randomly flip it. So, if it looks

01:04:47.239 --> 01:04:50.799
like this, I flip it this way,

01:04:48.320 --> 01:04:53.080
horizontal. Okay? Uh, and then I

01:04:50.800 --> 01:04:55.039
randomly rotate it by 0.1. I forget if

01:04:53.079 --> 01:04:57.358
it's 0.1° or radians, you can look up

01:04:55.039 --> 01:05:00.079
the documentation. And then random zoom,

01:04:57.358 --> 01:05:02.920
right? Zoom in and out a little bit. Uh,

01:05:00.079 --> 01:05:04.960
but it won't do this for every picture.

01:05:02.920 --> 01:05:06.000
It will only do it randomly. Okay? So,

01:05:04.960 --> 01:05:07.800
that only some pictures will get

01:05:06.000 --> 01:05:09.000
perturbed in some ways. And that's how

01:05:07.800 --> 01:05:10.600
you make sure there's enough diversity

01:05:09.000 --> 01:05:12.880
of pictures that you have.

01:05:10.599 --> 01:05:13.839
So, once you do that,

01:05:12.880 --> 01:05:15.960
you can actually take a picture and see

01:05:13.840 --> 01:05:17.240
what it does.

01:05:15.960 --> 01:05:20.159
I just randomly grab a picture, so it

01:05:17.239 --> 01:05:20.159
keeps changing every time.

01:05:21.280 --> 01:05:24.880
Yeah, look at this handbag.

01:05:22.800 --> 01:05:26.440
Handbag slightly rotated this way,

01:05:24.880 --> 01:05:28.320
rotated that way.

01:05:26.440 --> 01:05:30.320
Some more. Maybe a little bit of zooming

01:05:28.320 --> 01:05:31.880
going on, and so on. You get the idea,

01:05:30.320 --> 01:05:33.840
right? And there's a whole list of these

01:05:31.880 --> 01:05:35.640
things you can do. But when you do those

01:05:33.840 --> 01:05:37.358
things, make sure

01:05:35.639 --> 01:05:38.759
that what you're doing doesn't actually

01:05:37.358 --> 01:05:39.679
change the underlying meaning of the

01:05:38.760 --> 01:05:41.480
picture.

01:05:39.679 --> 01:05:43.440
It's really important.

01:05:41.480 --> 01:05:45.679
Okay? So, for example, if you're working

01:05:43.440 --> 01:05:47.679
with satellite data,

01:05:45.679 --> 01:05:49.319
yes, be very careful not to do flips of

01:05:47.679 --> 01:05:50.319
crazy flips.

01:05:49.320 --> 01:05:51.920
Right? Or even if you're working with

01:05:50.320 --> 01:05:54.440
everyday images, horizontal flips are

01:05:51.920 --> 01:05:55.800
okay. Don't do vertical flips.

01:05:54.440 --> 01:05:57.400
Right? How many times will you have an

01:05:55.800 --> 01:05:59.400
upside-down dog picture that you need to

01:05:57.400 --> 01:06:00.639
classify?

01:05:59.400 --> 01:06:02.720
Make sure your augmentation doesn't go

01:06:00.639 --> 01:06:04.839
nuts.

01:06:02.719 --> 01:06:04.839
All right.

01:06:05.760 --> 01:06:09.240
Once you do that, you can actually just

01:06:07.239 --> 01:06:11.000
insert the data augmentation layers in

01:06:09.239 --> 01:06:12.479
your model right there, right after the

01:06:11.000 --> 01:06:14.280
input. The rest of it can stay

01:06:12.480 --> 01:06:15.760
unchanged.

01:06:14.280 --> 01:06:17.600
So, this is a great way to increase the

01:06:15.760 --> 01:06:19.760
size of your training data, and here is

01:06:17.599 --> 01:06:21.880
a model, and then I invite you to

01:06:19.760 --> 01:06:23.120
actually just play with it and uh, and

01:06:21.880 --> 01:06:23.960
train it. We won't try In the interest

01:06:23.119 --> 01:06:24.920
of time, we won't actually train this

01:06:23.960 --> 01:06:27.240
model, but it's in the collab, you can

01:06:24.920 --> 01:06:28.599
just try it. It also figures prominently

01:06:27.239 --> 01:06:30.000
in homework one, by the way, data

01:06:28.599 --> 01:06:32.519
augmentation. So, you'll get more

01:06:30.000 --> 01:06:34.800
experience with this. Okay. So, uh, back

01:06:32.519 --> 01:06:37.239
to the PPT.

01:06:34.800 --> 01:06:38.800
So, this is what we have. Um, and so,

01:06:37.239 --> 01:06:41.279
any network that has been trained on

01:06:38.800 --> 01:06:42.920
this ImageNet thing, uh, turns out

01:06:41.280 --> 01:06:44.880
learns all kinds of interesting features

01:06:42.920 --> 01:06:46.320
in every one of its layers. So, here

01:06:44.880 --> 01:06:48.039
this is the first layer, and you can see

01:06:46.320 --> 01:06:49.880
it's picking up sort of gradations of

01:06:48.039 --> 01:06:52.559
color, sort of line-ish kind of

01:06:49.880 --> 01:06:54.320
behavior. Layer two, um, it's actually

01:06:52.559 --> 01:06:56.880
picking up Hey, look, it's picking up an

01:06:54.320 --> 01:06:59.480
edge. Can you see that edge?

01:06:56.880 --> 01:07:01.920
Right? Like like that.

01:06:59.480 --> 01:07:04.400
And then layer three is picking up these

01:07:01.920 --> 01:07:05.960
interesting honeycomb shapes, uh, and so

01:07:04.400 --> 01:07:07.880
on. Oh, it's actually this thing is

01:07:05.960 --> 01:07:11.240
already already picking up like the

01:07:07.880 --> 01:07:11.240
shape of a human torso.

01:07:12.599 --> 01:07:16.440
Yeah, this layer is actually picking up

01:07:13.920 --> 01:07:17.240
what looks like a Labrador retriever.

01:07:16.440 --> 01:07:19.400
Okay.

01:07:17.239 --> 01:07:20.399
Isn't that cute?

01:07:19.400 --> 01:07:22.480
Come on, even if you're not a dog

01:07:20.400 --> 01:07:24.480
person.

01:07:22.480 --> 01:07:25.599
All right. So, the the this this is the

01:07:24.480 --> 01:07:26.599
visualization I was referring to

01:07:25.599 --> 01:07:28.319
earlier,

01:07:26.599 --> 01:07:30.039
um, to figure out what are these

01:07:28.320 --> 01:07:31.760
networks actually learning.

01:07:30.039 --> 01:07:32.920
This paper was one of the first ones to

01:07:31.760 --> 01:07:34.920
actually visualize what's going on

01:07:32.920 --> 01:07:36.639
inside. So, if you folks are curious how

01:07:34.920 --> 01:07:38.760
these pictures are actually produced, I

01:07:36.639 --> 01:07:40.719
would encourage you to check this out.

01:07:38.760 --> 01:07:42.560
Okay, yep.

01:07:40.719 --> 01:07:44.879
So, we spoke about images and you

01:07:42.559 --> 01:07:46.599
referred to classes, but sorry, we spoke

01:07:44.880 --> 01:07:47.358
about images and you referred to classes

01:07:46.599 --> 01:07:49.920
and

01:07:47.358 --> 01:07:52.480
text next week on transformers, but

01:07:49.920 --> 01:07:54.039
what about say an email which has both

01:07:52.480 --> 01:07:56.280
text and image, and that may be white

01:07:54.039 --> 01:07:58.759
space depending on who has written it

01:07:56.280 --> 01:08:01.359
out. Does that get put in as an input

01:07:58.760 --> 01:08:03.240
for an image or

01:08:01.358 --> 01:08:04.840
So, we'll revisit this great question a

01:08:03.239 --> 01:08:06.439
bit later on in the course.

01:08:04.840 --> 01:08:07.840
So, the answer is a bit complicated, so

01:08:06.440 --> 01:08:09.280
I don't want to I want to do it justice,

01:08:07.840 --> 01:08:10.800
so we'll come back to it.

01:08:09.280 --> 01:08:12.600
All right, so

01:08:10.800 --> 01:08:14.280
so it turns out this thing called ResNet

01:08:12.599 --> 01:08:16.000
is a family of networks that are which

01:08:14.280 --> 01:08:18.079
were trained on this ImageNet data set,

01:08:16.000 --> 01:08:19.399
and they did really well in this

01:08:18.079 --> 01:08:21.479
competition that's associated with the

01:08:19.399 --> 01:08:22.519
ImageNet data set called ImageNet. And

01:08:21.479 --> 01:08:24.679
so, this is an example of such a

01:08:22.520 --> 01:08:27.400
network. So, you we would expect the the

01:08:24.680 --> 01:08:28.400
weights and the parameters of ResNet,

01:08:27.399 --> 01:08:30.838
given that it's been trained on

01:08:28.399 --> 01:08:32.719
ImageNet, to sort of have some knowledge

01:08:30.838 --> 01:08:34.719
about lines and shapes and curves and

01:08:32.719 --> 01:08:37.520
things like that. So, maybe we can just

01:08:34.719 --> 01:08:39.039
use that, right? So, so the idea is we

01:08:37.520 --> 01:08:40.920
But the thing is we can't use ResNet as

01:08:39.039 --> 01:08:42.159
is because remember, it was trained to

01:08:40.920 --> 01:08:44.119
classify an incoming image into a

01:08:42.159 --> 01:08:45.439
thousand possibilities.

01:08:44.119 --> 01:08:47.838
Here we only have two possibilities,

01:08:45.439 --> 01:08:50.039
handbags and shoes. So, what we do is

01:08:47.838 --> 01:08:51.759
very simple and elegant. We do just a

01:08:50.039 --> 01:08:54.519
little bit of surgery.

01:08:51.759 --> 01:08:57.439
We take ResNet and stop just before the

01:08:54.520 --> 01:08:59.680
final layer. So, take my word for it,

01:08:57.439 --> 01:09:01.318
this thing here, what it says is fully

01:08:59.680 --> 01:09:02.920
connected thousand.

01:09:01.319 --> 01:09:04.839
Because it's got thousand way, right?

01:09:02.920 --> 01:09:06.560
Thousand objects. So, what we do is we

01:09:04.838 --> 01:09:08.239
just take everything except and we stop

01:09:06.560 --> 01:09:10.280
just before that last layer.

01:09:08.239 --> 01:09:11.599
And then what comes out of that layer,

01:09:10.279 --> 01:09:13.239
hopefully, will be like a very smart

01:09:11.600 --> 01:09:14.480
representation of the images that it has

01:09:13.239 --> 01:09:16.960
been trained on.

01:09:14.479 --> 01:09:19.199
And so, what we do is we can think of

01:09:16.960 --> 01:09:21.000
sort of headless ResNet

01:09:19.199 --> 01:09:23.358
as our model.

01:09:21.000 --> 01:09:26.239
And we can take that we can take all our

01:09:23.359 --> 01:09:28.079
data and run it through ResNet up to but

01:09:26.239 --> 01:09:30.358
not including the last layer.

01:09:28.079 --> 01:09:31.920
Okay, you get some tensor and that

01:09:30.359 --> 01:09:33.319
tensor is probably like a very has a

01:09:31.920 --> 01:09:35.079
very rich understanding of what's going

01:09:33.319 --> 01:09:36.880
on in that image, all the objects and

01:09:35.079 --> 01:09:39.880
features and things like that. And then

01:09:36.880 --> 01:09:40.759
we can just simply connect that we can

01:09:39.880 --> 01:09:42.199
think of it as like a smart

01:09:40.759 --> 01:09:44.359
representation of an input. We can

01:09:42.199 --> 01:09:46.000
connect it to just a little hidden layer

01:09:44.359 --> 01:09:47.798
and then we have a little sigmoid which

01:09:46.000 --> 01:09:50.199
then tells you handbag or shoe. We can

01:09:47.798 --> 01:09:53.039
just run this network.

01:09:50.199 --> 01:09:54.840
Okay? Um and so since the outputs to the

01:09:53.039 --> 01:09:57.199
hidden layer now are not raw images

01:09:54.840 --> 01:09:59.000
anymore, but this much higher level of

01:09:57.199 --> 01:10:00.279
abstraction that ResNet has learned,

01:09:59.000 --> 01:10:02.399
hopefully it can get the job done with

01:10:00.279 --> 01:10:04.519
hardly any examples.

01:10:02.399 --> 01:10:05.679
Okay? And now you can get fancier.

01:10:04.520 --> 01:10:07.440
That's the basic idea, but you can get

01:10:05.680 --> 01:10:09.760
much fancier. You can connect up

01:10:07.439 --> 01:10:10.960
headless ResNet directly with our little

01:10:09.760 --> 01:10:12.720
network with a hidden layer and the

01:10:10.960 --> 01:10:14.960
final thing and the whole thing can be

01:10:12.720 --> 01:10:16.960
trained.

01:10:14.960 --> 01:10:18.680
End to end. Uh but when you do that you

01:10:16.960 --> 01:10:20.159
must start the training with the weights

01:10:18.680 --> 01:10:21.960
that you downloaded with ResNet because

01:10:20.159 --> 01:10:23.639
that is the crown jewel that's been

01:10:21.960 --> 01:10:26.239
learned so you want to start from there.

01:10:23.640 --> 01:10:28.400
Uh and you will do this in homework one.

01:10:26.239 --> 01:10:29.479
Okay? All right. Uh by the way, these

01:10:28.399 --> 01:10:30.639
pre-trained models are available all

01:10:29.479 --> 01:10:32.599
over the internet. There is the

01:10:30.640 --> 01:10:34.000
TensorFlow hub, the PyTorch hub and then

01:10:32.600 --> 01:10:36.840
there's the Hugging Face hub. When I

01:10:34.000 --> 01:10:39.079
checked it on the 13th yesterday, it had

01:10:36.840 --> 01:10:41.199
over half a million models available

01:10:39.079 --> 01:10:42.760
for download. Half a million.

01:10:41.199 --> 01:10:46.840
I think last year it was like 50,000

01:10:42.760 --> 01:10:49.159
when I taught the course. Uh so yes.

01:10:46.840 --> 01:10:50.880
I was just wondering, doesn't this make

01:10:49.159 --> 01:10:52.199
your neural network susceptible to

01:10:50.880 --> 01:10:53.279
adversarial attacks because the weights

01:10:52.199 --> 01:10:55.639
have been

01:10:53.279 --> 01:10:57.319
pre-trained on a Yes. Uh it there is

01:10:55.640 --> 01:10:59.160
some adversarial risk. I'm happy to talk

01:10:57.319 --> 01:11:01.439
about it offline.

01:10:59.159 --> 01:11:03.720
All right. So that's what we have. So

01:11:01.439 --> 01:11:06.319
back to Colab. Okay. So that's what we

01:11:03.720 --> 01:11:07.720
have. This is ResNet. So what we do is

01:11:06.319 --> 01:11:09.519
and ResNet is all packaged up. It's

01:11:07.720 --> 01:11:12.640
available for download. So we download

01:11:09.520 --> 01:11:12.640
it here.

01:11:13.560 --> 01:11:19.360
And you see here that I'm saying use

01:11:16.520 --> 01:11:21.800
include top equals false.

01:11:19.359 --> 01:11:23.799
So basically you are telling Keras

01:11:21.800 --> 01:11:25.279
uh the top the very final layer of the

01:11:23.800 --> 01:11:27.239
thing, don't give it to me. Just give me

01:11:25.279 --> 01:11:28.840
everything up to but not including that.

01:11:27.239 --> 01:11:30.920
And of course I think of it as left to

01:11:28.840 --> 01:11:32.960
right. People think of it as bottom to

01:11:30.920 --> 01:11:34.440
top. So they could the very very top

01:11:32.960 --> 01:11:35.480
layer, don't give it to me. You're

01:11:34.439 --> 01:11:37.319
telling it so that you don't have to

01:11:35.479 --> 01:11:39.319
manually go and remove it.

01:11:37.319 --> 01:11:40.880
Okay? And then I'm not going to

01:11:39.319 --> 01:11:44.000
summarize uh well, I'll just summarize

01:11:40.880 --> 01:11:44.000
some of it. Just show you how big it is.

01:11:44.640 --> 01:11:48.800
Okay?

01:11:45.720 --> 01:11:50.920
23 million parameters.

01:11:48.800 --> 01:11:52.039
ResNet. Okay? And I won't plot it

01:11:50.920 --> 01:11:53.520
because then I'll be scrolling for 5

01:11:52.039 --> 01:11:55.399
minutes. Uh

01:11:53.520 --> 01:11:56.400
so let's just do this now. So what we're

01:11:55.399 --> 01:11:58.000
now going to do is we're going to run

01:11:56.399 --> 01:11:59.679
all the data through this thing and

01:11:58.000 --> 01:12:00.880
whatever comes out in that penultimate

01:11:59.680 --> 01:12:02.640
thing, I'm going to just grab it and

01:12:00.880 --> 01:12:04.720
store it. So that's what this thing

01:12:02.640 --> 01:12:07.000
does.

01:12:04.720 --> 01:12:08.640
All right. And now we create this a

01:12:07.000 --> 01:12:09.520
little handy function to do all these

01:12:08.640 --> 01:12:11.160
things.

01:12:09.520 --> 01:12:12.760
And once I do that,

01:12:11.159 --> 01:12:15.239
uh every image has been sent through

01:12:12.760 --> 01:12:16.280
ResNet up to but not the final layer and

01:12:15.239 --> 01:12:18.119
then whatever comes into the final

01:12:16.279 --> 01:12:19.479
layer, we're storing it. And then we're

01:12:18.119 --> 01:12:21.800
going to create a network where we'll

01:12:19.479 --> 01:12:23.199
only feed that layer that information to

01:12:21.800 --> 01:12:24.440
a simple network.

01:12:23.199 --> 01:12:26.279
Okay?

01:12:24.439 --> 01:12:28.599
So what is coming out of ResNet, you can

01:12:26.279 --> 01:12:31.719
see here 98 examples in the training

01:12:28.600 --> 01:12:33.840
data and each example is now a 7 by 7 by

01:12:31.720 --> 01:12:35.000
2048 tensor.

01:12:33.840 --> 01:12:37.000
That's what came out of ResNet and you

01:12:35.000 --> 01:12:37.720
saw that's what I did there.

01:12:37.000 --> 01:12:39.479
Okay?

01:12:37.720 --> 01:12:41.199
All right. So that's what it looks like.

01:12:39.479 --> 01:12:43.479
Now let's just create our actual model

01:12:41.199 --> 01:12:46.479
now. Right? We have our input which is

01:12:43.479 --> 01:12:48.559
just a 7 by 7 by 2048.

01:12:46.479 --> 01:12:50.079
We flatten it immediately.

01:12:48.560 --> 01:12:52.600
Then we run it through a dense layer

01:12:50.079 --> 01:12:54.519
with 256 ReLU neurons and then we use

01:12:52.600 --> 01:12:56.920
dropout which I haven't talked about yet

01:12:54.520 --> 01:12:58.720
which I will talk about early next week.

01:12:56.920 --> 01:13:00.720
Uh but I will come back to it. Don't

01:12:58.720 --> 01:13:01.520
worry about this detail for the moment.

01:13:00.720 --> 01:13:03.360
Uh and then we just run through a

01:13:01.520 --> 01:13:05.960
sigmoid.

01:13:03.359 --> 01:13:08.079
Okay? And that that's our model.

01:13:05.960 --> 01:13:12.640
Finished. Plot the model. This is what

01:13:08.079 --> 01:13:12.640
we have. Okay? Model summary.

01:13:13.479 --> 01:13:18.519
It's one so far. All right, good. Now

01:13:15.399 --> 01:13:18.519
let's actually train this thing.

01:13:18.640 --> 01:13:22.720
I'm just going to run it for 10 epochs

01:13:20.600 --> 01:13:24.760
because I tried running it uh previously

01:13:22.720 --> 01:13:26.920
and it seems to do a fine job in just an

01:13:24.760 --> 01:13:28.680
epoch. Okay, it's already done. It's so

01:13:26.920 --> 01:13:31.359
fast because we ran everything through

01:13:28.680 --> 01:13:33.640
this monster ResNet thing and basically

01:13:31.359 --> 01:13:34.759
took all the output values and use them

01:13:33.640 --> 01:13:36.880
as a starting point. Right? We don't

01:13:34.760 --> 01:13:40.440
have to run it every single time. So you

01:13:36.880 --> 01:13:43.279
can see here the accuracy is

01:13:40.439 --> 01:13:43.279
quite high.

01:13:44.199 --> 01:13:48.439
Wow, interesting. So the 10th epoch

01:13:45.920 --> 01:13:49.880
something bad happened.

01:13:48.439 --> 01:13:51.439
So maybe I should have stopped at the

01:13:49.880 --> 01:13:53.079
ninth epoch. I didn't see this yesterday

01:13:51.439 --> 01:13:55.159
when I was running. So much for random

01:13:53.079 --> 01:13:57.079
reproducibility. Uh

01:13:55.159 --> 01:13:58.800
So let's just run this. Oh wow, look. On

01:13:57.079 --> 01:14:01.319
the test set it's achieving 100%

01:13:58.800 --> 01:14:01.320
accuracy.

01:14:02.159 --> 01:14:06.840
It's unbelievable. Okay folks, now for

01:14:04.439 --> 01:14:08.079
the moment of truth. Um all right, I

01:14:06.840 --> 01:14:10.000
have a little code snippet here to

01:14:08.079 --> 01:14:12.159
capture stuff from the webcam.

01:14:10.000 --> 01:14:13.600
Because that last epoch it went down,

01:14:12.159 --> 01:14:14.920
I'm a little worried that the demo is

01:14:13.600 --> 01:14:16.440
going to flunk.

01:14:14.920 --> 01:14:18.560
But you know what? We all have to live

01:14:16.439 --> 01:14:20.119
dangerously. So

01:14:18.560 --> 01:14:21.560
So here's a little function to predict

01:14:20.119 --> 01:14:23.519
what's going to happen.

01:14:21.560 --> 01:14:24.920
Okay. Now I tried it at home yesterday

01:14:23.520 --> 01:14:26.160
by the way.

01:14:24.920 --> 01:14:27.680
I act and it's like, "Yay, it's a

01:14:26.159 --> 01:14:29.599
handbag."

01:14:27.680 --> 01:14:30.720
So okay. Now let's just do something

01:14:29.600 --> 01:14:32.560
else.

01:14:30.720 --> 01:14:34.880
Okay. Any volunteers?

01:14:32.560 --> 01:14:37.400
I want a a piece of footwear

01:14:34.880 --> 01:14:39.840
or a handbag.

01:14:37.399 --> 01:14:40.839
It's like a backpack, right?

01:14:39.840 --> 01:14:42.159
I don't know. It feels like an

01:14:40.840 --> 01:14:43.440
adversarial example, but yeah, let's

01:14:42.159 --> 01:14:45.000
just try it.

01:14:43.439 --> 01:14:47.039
Okay.

01:14:45.000 --> 01:14:48.880
No disrespect. I'll let me let me go

01:14:47.039 --> 01:14:50.920
with the shoe first. I have a better

01:14:48.880 --> 01:14:51.880
chance of it working.

01:14:50.920 --> 01:14:53.239
So

01:14:51.880 --> 01:14:55.880
it's a pretty big shoe. If it can't get

01:14:53.239 --> 01:14:59.079
this shoe, I'm worried about this model.

01:14:55.880 --> 01:14:59.079
All right. So

01:15:05.159 --> 01:15:10.159
Okay. Hold on. Hold on. Hold on.

01:15:07.800 --> 01:15:10.159
All right.

01:15:10.680 --> 01:15:14.360
Please don't get distracted by my hand.

01:15:14.479 --> 01:15:20.719
Capture.

01:15:16.880 --> 01:15:20.720
It's a shoe! LOOK AT THAT.

01:15:21.680 --> 01:15:26.760
PHEW. ALL RIGHT. THANKS.

01:15:25.000 --> 01:15:28.319
OKAY. Now let's try that. I'm feeling

01:15:26.760 --> 01:15:32.600
kind of brave now.

01:15:28.319 --> 01:15:34.880
Thank you. All right. Let's do this.

01:15:32.600 --> 01:15:38.000
All right.

01:15:34.880 --> 01:15:38.000
Camera capture.

01:15:40.399 --> 01:15:42.559
Okay.

01:15:44.199 --> 01:15:47.519
Put its better side.

01:15:54.960 --> 01:15:58.720
It's a handbag! Look at that.

01:15:59.800 --> 01:16:03.640
I swear every time I do the demo I age a

01:16:01.479 --> 01:16:06.879
few years. So

01:16:03.640 --> 01:16:06.880
All right folks, I'm done. Thank you.
