1
00:00:16,719 --> 00:00:19,799
Right folks, good morning.

2
00:00:19,960 --> 00:00:22,880
Welcome back. I hope you all had a nice

3
00:00:21,600 --> 00:00:24,800
weekend.

4
00:00:22,879 --> 00:00:26,919
Uh, and I hope you had a chance to watch

5
00:00:24,800 --> 00:00:28,920
the the video walk-through I posted

6
00:00:26,920 --> 00:00:31,080
yesterday. Um, it's going to save us

7
00:00:28,920 --> 00:00:33,400
some time today. So, let's get right in.

8
00:00:31,079 --> 00:00:35,159
Today is going to be super packed. Um,

9
00:00:33,399 --> 00:00:36,759
you're going to go from not knowing

10
00:00:35,159 --> 00:00:38,439
anything about convolutions perhaps for

11
00:00:36,759 --> 00:00:39,839
some of you to actually knowing how

12
00:00:38,439 --> 00:00:42,839
convolution networks work and actually

13
00:00:39,840 --> 00:00:44,240
to build one and demo it in class, okay?

14
00:00:42,840 --> 00:00:45,720
And uh, this demo has actually worked

15
00:00:44,240 --> 00:00:47,240
pretty well for the last few years that

16
00:00:45,719 --> 00:00:48,439
I've taught the class, but you never

17
00:00:47,240 --> 00:00:50,039
know because it's a live demo, it may

18
00:00:48,439 --> 00:00:51,879
not work. We'll see.

19
00:00:50,039 --> 00:00:53,519
Um,

20
00:00:51,880 --> 00:00:54,760
Valentine's Day gods, maybe they maybe

21
00:00:53,520 --> 00:00:56,800
be with us.

22
00:00:54,759 --> 00:01:00,439
Okay, so let's get going. So, Fashion

23
00:00:56,799 --> 00:01:01,599
MNIST we saw previously, um, i.e. as in,

24
00:01:00,439 --> 00:01:03,839
you know, in the in the walk-through,

25
00:01:01,600 --> 00:01:05,760
the video walk-through, that a neural

26
00:01:03,840 --> 00:01:08,200
network with a single single hidden

27
00:01:05,760 --> 00:01:11,280
layer can get us to some an accuracy in

28
00:01:08,200 --> 00:01:14,200
the the high 80s, okay? Uh, and that

29
00:01:11,280 --> 00:01:16,239
thing that network actually didn't know

30
00:01:14,200 --> 00:01:18,280
what was coming in was an image, right?

31
00:01:16,239 --> 00:01:19,759
It literally took this table of numbers

32
00:01:18,280 --> 00:01:21,519
and just took each row and then

33
00:01:19,760 --> 00:01:23,719
concatenated all the rows into one giant

34
00:01:21,519 --> 00:01:25,799
long vector and then sent it in. So, the

35
00:01:23,719 --> 00:01:27,760
neural network did exploit the fact that

36
00:01:25,799 --> 00:01:30,280
the input data was sort of known to be

37
00:01:27,760 --> 00:01:32,760
of a certain type, okay? Which is the

38
00:01:30,280 --> 00:01:35,159
clue for how can we do better?

39
00:01:32,760 --> 00:01:38,480
Right? So, let's just spend a few

40
00:01:35,159 --> 00:01:40,479
minutes on why what is it about images

41
00:01:38,480 --> 00:01:42,719
that we have to really pay attention to,

42
00:01:40,480 --> 00:01:44,359
okay? As opposed to any arbitrary vector

43
00:01:42,719 --> 00:01:47,599
of numbers that's coming in.

44
00:01:44,359 --> 00:01:49,519
Okay? So, when we flatten the image into

45
00:01:47,599 --> 00:01:50,519
a long vector and feed it into a dense

46
00:01:49,519 --> 00:01:52,719
layer,

47
00:01:50,519 --> 00:01:55,119
several undesirable things can actually

48
00:01:52,719 --> 00:01:59,039
happen.

49
00:01:55,120 --> 00:01:59,040
What are some of them? Any any guesses?

50
00:02:00,400 --> 00:02:04,560
Uh, yeah.

51
00:02:02,640 --> 00:02:06,560
I think you lose the proximity of one

52
00:02:04,560 --> 00:02:07,400
pixel to other ones that would be around

53
00:02:06,560 --> 00:02:08,719
it.

54
00:02:07,400 --> 00:02:11,039
Right. So, if you take a particular

55
00:02:08,719 --> 00:02:13,560
pixel, then let's say that the picture

56
00:02:11,038 --> 00:02:15,759
shows a t-shirt, um, if there's a little

57
00:02:13,560 --> 00:02:17,479
pixel at in the center of the t-shirt,

58
00:02:15,759 --> 00:02:19,239
knowing that the surrounding pixels are

59
00:02:17,479 --> 00:02:21,159
related to the pixel in a way because

60
00:02:19,240 --> 00:02:23,439
they are all part of this concept called

61
00:02:21,159 --> 00:02:25,799
a t-shirt, would certainly be helpful,

62
00:02:23,439 --> 00:02:28,079
right? So, so to put it more

63
00:02:25,800 --> 00:02:30,480
technically, spatial adjacency

64
00:02:28,080 --> 00:02:32,560
information is very important. And we

65
00:02:30,479 --> 00:02:34,759
need to somehow take that into account.

66
00:02:32,560 --> 00:02:37,759
Okay? Um, all right. What else? What

67
00:02:34,759 --> 00:02:37,759
else might be going on here?

68
00:02:38,120 --> 00:02:41,439
Uh,

69
00:02:40,159 --> 00:02:43,000
Yeah,

70
00:02:41,439 --> 00:02:46,439
you have some metadata about it like the

71
00:02:43,000 --> 00:02:47,719
relative match into the resolution

72
00:02:46,439 --> 00:02:50,240
Oh, I see. So, if you actually had

73
00:02:47,719 --> 00:02:51,560
structured data about the image such as,

74
00:02:50,240 --> 00:02:54,040
you know, various characters about that

75
00:02:51,560 --> 00:02:55,560
might be helpful. True. Now, but let's

76
00:02:54,039 --> 00:02:57,799
just focus on the case where you only

77
00:02:55,560 --> 00:03:00,400
have the raw image and nothing else.

78
00:02:57,800 --> 00:03:02,480
And under that constraint, what else

79
00:03:00,400 --> 00:03:06,039
might go wrong?

80
00:03:02,479 --> 00:03:06,039
Or what else might be suboptimal?

81
00:03:08,199 --> 00:03:12,079
Okay. Well, the first thing that might

82
00:03:10,000 --> 00:03:15,240
happen is that

83
00:03:12,080 --> 00:03:17,400
we have we may have too many parameters.

84
00:03:15,240 --> 00:03:18,760
So, let's take So, this is, you know,

85
00:03:17,400 --> 00:03:21,439
this these numbers are from my, you

86
00:03:18,759 --> 00:03:22,959
know, older iPhone. Uh, I noticed that

87
00:03:21,439 --> 00:03:27,879
when I take a color picture with my

88
00:03:22,960 --> 00:03:30,200
phone, it's a 3,000 * 3,000 roughly uh,

89
00:03:27,879 --> 00:03:34,039
grid, right? So, the picture is actually

90
00:03:30,199 --> 00:03:37,839
3,024 pixels on this axis, 3,024 on that

91
00:03:34,039 --> 00:03:40,280
axis, okay? So, that gets us to roughly

92
00:03:37,840 --> 00:03:41,680
9 million pixels, but remember there's a

93
00:03:40,280 --> 00:03:43,479
color picture, which means there are

94
00:03:41,680 --> 00:03:45,360
three channels,

95
00:03:43,479 --> 00:03:46,959
which means there are 27 million

96
00:03:45,360 --> 00:03:49,240
numbers,

97
00:03:46,960 --> 00:03:51,879
each of which is between 0 and 255 from

98
00:03:49,240 --> 00:03:54,080
that little picture, okay? And now let's

99
00:03:51,879 --> 00:03:57,319
say we connect it to a single

100
00:03:54,080 --> 00:03:59,080
100 neuron dense layer.

101
00:03:57,319 --> 00:04:00,319
A single 100 neuron dense layer. How

102
00:03:59,080 --> 00:04:01,719
many parameters are we going to have?

103
00:04:00,319 --> 00:04:04,239
Just in that one little part of the

104
00:04:01,719 --> 00:04:04,240
network.

105
00:04:07,000 --> 00:04:13,319
Could the mumbling be louder?

106
00:04:10,280 --> 00:04:15,919
Yes, roughly 2.7 billion because 27

107
00:04:13,319 --> 00:04:17,439
million parameters times 100,

108
00:04:15,919 --> 00:04:19,839
right? Roughly, of course. Forget about

109
00:04:17,439 --> 00:04:21,000
the biases for a moment, right? It's 2.7

110
00:04:19,839 --> 00:04:23,479
billion.

111
00:04:21,000 --> 00:04:25,199
2.7 billion parameters,

112
00:04:23,480 --> 00:04:27,920
right? Do you think we can actually get

113
00:04:25,199 --> 00:04:29,680
2.7 billion images to train any of these

114
00:04:27,920 --> 00:04:32,280
things?

115
00:04:29,680 --> 00:04:33,920
So, then you're going to overfit.

116
00:04:32,279 --> 00:04:35,439
Right? Too many parameters. We have to

117
00:04:33,920 --> 00:04:36,800
do We have to be smarter about this.

118
00:04:35,439 --> 00:04:39,519
It's not going to work.

119
00:04:36,800 --> 00:04:41,240
Right? That's the first problem.

120
00:04:39,519 --> 00:04:43,079
The So, this clearly is computationally

121
00:04:41,240 --> 00:04:45,120
demanding, very data hungry, and

122
00:04:43,079 --> 00:04:46,359
increase the risk of overfitting.

123
00:04:45,120 --> 00:04:48,920
Okay?

124
00:04:46,360 --> 00:04:48,920
Next,

125
00:04:49,000 --> 00:04:52,800
we lose spatial adjacency.

126
00:04:51,279 --> 00:04:55,279
Right? We literally are ignoring what's

127
00:04:52,800 --> 00:04:55,280
nearby.

128
00:04:55,480 --> 00:04:58,879
So, that's a huge huge factor. There's a

129
00:04:57,519 --> 00:05:01,000
third factor,

130
00:04:58,879 --> 00:05:02,319
right? That we have to worry about,

131
00:05:01,000 --> 00:05:04,120
which is that

132
00:05:02,319 --> 00:05:06,199
let's say that, you know, the picture

133
00:05:04,120 --> 00:05:08,120
has a vertical line

134
00:05:06,199 --> 00:05:09,599
on the on the top left side and it has

135
00:05:08,120 --> 00:05:12,160
some other vertical line on the bottom

136
00:05:09,600 --> 00:05:12,160
right side.

137
00:05:12,240 --> 00:05:15,280
What this sort of dumb approach is going

138
00:05:14,160 --> 00:05:16,640
to do

139
00:05:15,279 --> 00:05:18,079
is going to it's going to learn to

140
00:05:16,639 --> 00:05:20,000
detect that vertical line on the top

141
00:05:18,079 --> 00:05:21,159
left and it's going to independent of

142
00:05:20,000 --> 00:05:24,079
that, it's going to learn to detect the

143
00:05:21,160 --> 00:05:26,200
vertical line on the bottom right.

144
00:05:24,079 --> 00:05:27,599
Okay? Which doesn't make any sense. What

145
00:05:26,199 --> 00:05:29,479
do you A vertical line is a vertical

146
00:05:27,600 --> 00:05:31,360
line. So, you want to be able to detect

147
00:05:29,480 --> 00:05:33,879
it wherever it happens.

148
00:05:31,360 --> 00:05:35,520
Detect once, reuse everywhere.

149
00:05:33,879 --> 00:05:36,879
That's what you need to do.

150
00:05:35,519 --> 00:05:38,680
So, this, by the way, is called

151
00:05:36,879 --> 00:05:40,279
translation invariance.

152
00:05:38,680 --> 00:05:41,720
Translation is math speak for move stuff

153
00:05:40,279 --> 00:05:42,959
around.

154
00:05:41,720 --> 00:05:43,960
Right? You take a line and it moves

155
00:05:42,959 --> 00:05:45,159
around,

156
00:05:43,959 --> 00:05:47,239
it doesn't matter, it's still a line.

157
00:05:45,160 --> 00:05:48,880
Let's Let's Let's figure it out.

158
00:05:47,240 --> 00:05:50,960
So, these are the the three things we

159
00:05:48,879 --> 00:05:53,199
need to worry about. So, we want to

160
00:05:50,959 --> 00:05:55,079
learn once and use all over the place.

161
00:05:53,199 --> 00:05:56,920
We want to take spatial adjacency into

162
00:05:55,079 --> 00:05:58,199
account, number two. And number three,

163
00:05:56,920 --> 00:05:59,720
let's just find a way to make sure that

164
00:05:58,199 --> 00:06:02,319
we don't have billions of parameters for

165
00:05:59,720 --> 00:06:04,920
simple toy problems.

166
00:06:02,319 --> 00:06:04,920
Any questions?

167
00:06:05,480 --> 00:06:09,280
Yep.

168
00:06:07,279 --> 00:06:11,879
Um, is this a problem

169
00:06:09,279 --> 00:06:14,119
just because we are compressing the

170
00:06:11,879 --> 00:06:15,279
image or would it have happened anyway?

171
00:06:14,120 --> 00:06:16,439
It would have happened So, the question

172
00:06:15,279 --> 00:06:18,279
was is it a problem because we are

173
00:06:16,439 --> 00:06:19,839
compressing the image uh, or would it

174
00:06:18,279 --> 00:06:20,839
would it have happened anyway? The

175
00:06:19,839 --> 00:06:22,239
answer is it would have happened anyway.

176
00:06:20,839 --> 00:06:24,399
You can take any picture, this is going

177
00:06:22,240 --> 00:06:26,199
to happen, right? Because I'm not making

178
00:06:24,399 --> 00:06:27,560
any assumptions about how the image is

179
00:06:26,199 --> 00:06:28,839
coming in to me,

180
00:06:27,560 --> 00:06:31,240
whether it's compressed or not and so on

181
00:06:28,839 --> 00:06:31,239
and so forth.

182
00:06:31,639 --> 00:06:36,240
Okay. All right.

183
00:06:33,519 --> 00:06:38,599
So, convolutional layers

184
00:06:36,240 --> 00:06:40,240
were developed to precisely address

185
00:06:38,600 --> 00:06:44,400
these shortcomings and they're amazing

186
00:06:40,240 --> 00:06:44,400
solution, as you will see. Very elegant.

187
00:06:45,040 --> 00:06:49,080
All right.

188
00:06:45,800 --> 00:06:51,040
So, the next, I don't know, half an hour

189
00:06:49,079 --> 00:06:52,359
is going to be me defining a whole bunch

190
00:06:51,040 --> 00:06:53,560
of stuff

191
00:06:52,360 --> 00:06:55,439
before we actually get to the fun

192
00:06:53,560 --> 00:06:57,560
collabs and so on and so forth.

193
00:06:55,439 --> 00:06:59,719
Um, so just to put in perspective, I I

194
00:06:57,560 --> 00:07:01,160
have a PowerPoint,

195
00:06:59,720 --> 00:07:03,200
two collabs,

196
00:07:01,160 --> 00:07:06,040
and an Excel spreadsheet, and maybe even

197
00:07:03,199 --> 00:07:08,159
a notability file to cover today.

198
00:07:06,040 --> 00:07:09,080
Okay? So, but hang on for the next 30

199
00:07:08,160 --> 00:07:10,600
minutes because it's going to be a

200
00:07:09,079 --> 00:07:12,279
little concept heavy

201
00:07:10,600 --> 00:07:14,280
before we get to the fun stuff. So, stop

202
00:07:12,279 --> 00:07:15,119
me, ask me questions because we do have

203
00:07:14,279 --> 00:07:17,559
time.

204
00:07:15,120 --> 00:07:18,920
All right. A convolutional layer is made

205
00:07:17,560 --> 00:07:20,000
up of something called a convolutional

206
00:07:18,920 --> 00:07:22,199
filter.

207
00:07:20,000 --> 00:07:24,720
Okay? That's the atomic building block.

208
00:07:22,199 --> 00:07:28,159
A convolutional filter is a nothing but

209
00:07:24,720 --> 00:07:29,600
a small matrix of numbers like this.

210
00:07:28,160 --> 00:07:31,480
It's just a small square matrix of

211
00:07:29,600 --> 00:07:33,400
numbers. That's a convolutional filter,

212
00:07:31,480 --> 00:07:35,600
okay? Now,

213
00:07:33,399 --> 00:07:38,159
a layer is just composed of one or more

214
00:07:35,600 --> 00:07:39,200
of these filters.

215
00:07:38,160 --> 00:07:41,400
All right?

216
00:07:39,199 --> 00:07:42,519
Filters and layers.

217
00:07:41,399 --> 00:07:44,679
Now,

218
00:07:42,519 --> 00:07:46,639
the thing about the convolutional filter

219
00:07:44,680 --> 00:07:48,720
that makes it really magical

220
00:07:46,639 --> 00:07:50,759
is that if you choose the numbers in a

221
00:07:48,720 --> 00:07:52,440
filter carefully

222
00:07:50,759 --> 00:07:53,879
and then you apply the filter to an

223
00:07:52,439 --> 00:07:56,040
image, and I'll get to what I mean by

224
00:07:53,879 --> 00:07:57,519
applying the filter,

225
00:07:56,040 --> 00:07:59,560
if you choose the numbers carefully and

226
00:07:57,519 --> 00:08:02,399
you apply to that image,

227
00:07:59,560 --> 00:08:04,759
this little humble thing has the ability

228
00:08:02,399 --> 00:08:07,039
to detect features in your image.

229
00:08:04,759 --> 00:08:09,800
It can detect lines, curves, gradations

230
00:08:07,040 --> 00:08:11,360
in color, circles, things like that,

231
00:08:09,800 --> 00:08:12,800
okay? It's pretty cool.

232
00:08:11,360 --> 00:08:14,080
And so,

233
00:08:12,800 --> 00:08:15,920
I'm going to claim and I'm going to

234
00:08:14,079 --> 00:08:17,719
prove shortly that this little humble

235
00:08:15,920 --> 00:08:19,560
filter with the ones and zeros, it can

236
00:08:17,720 --> 00:08:21,160
detect horizontal lines in any picture

237
00:08:19,560 --> 00:08:22,079
you give it.

238
00:08:21,160 --> 00:08:23,760
Okay?

239
00:08:22,079 --> 00:08:27,000
This thing here is going to has the

240
00:08:23,759 --> 00:08:28,959
ability to detect vertical lines.

241
00:08:27,000 --> 00:08:30,560
All right? So, I will demonstrate how

242
00:08:28,959 --> 00:08:33,038
this thing actually detects all these

243
00:08:30,560 --> 00:08:34,360
things and then we will ask the big

244
00:08:33,038 --> 00:08:35,879
question that's probably in your minds

245
00:08:34,360 --> 00:08:37,840
already, where are we going to get these

246
00:08:35,879 --> 00:08:39,038
numbers from?

247
00:08:37,840 --> 00:08:41,000
That all sounds great, Rama. Where are

248
00:08:39,038 --> 00:08:42,479
we going to get the numbers from? Okay?

249
00:08:41,000 --> 00:08:43,879
And we have a beautiful answer to that

250
00:08:42,479 --> 00:08:46,520
question.

251
00:08:43,879 --> 00:08:47,919
All right. So, let's go. Um, now I'm

252
00:08:46,519 --> 00:08:50,919
going to first explain to you what I

253
00:08:47,919 --> 00:08:52,679
mean by applying a filter to an image

254
00:08:50,919 --> 00:08:54,120
and then I'm going to give you examples

255
00:08:52,679 --> 00:08:56,120
of how the filter works for detecting

256
00:08:54,120 --> 00:08:58,320
vertical and horizontal lines. So, all

257
00:08:56,120 --> 00:09:00,200
right. So, let's say that this is the

258
00:08:58,320 --> 00:09:02,280
image we have.

259
00:09:00,200 --> 00:09:04,280
Okay? Again, an image. Assume it's a

260
00:09:02,279 --> 00:09:06,079
grayscale image. So, you just have a

261
00:09:04,279 --> 00:09:07,759
bunch of numbers between 0 and 255,

262
00:09:06,080 --> 00:09:09,720
okay? So, that's that This is the image

263
00:09:07,759 --> 00:09:10,919
we have. It's a little tiny image.

264
00:09:09,720 --> 00:09:13,200
And this is the filter that's been

265
00:09:10,919 --> 00:09:14,279
magically given to us by somebody.

266
00:09:13,200 --> 00:09:17,040
And what we are trying to do now is to

267
00:09:14,279 --> 00:09:19,879
apply it, okay? So, what we do is that

268
00:09:17,039 --> 00:09:22,719
we literally take this filter,

269
00:09:19,879 --> 00:09:24,720
the little one, and then we superimpose

270
00:09:22,720 --> 00:09:26,840
it on the top left part of the image.

271
00:09:24,720 --> 00:09:28,639
So, you have the image here, you take

272
00:09:26,840 --> 00:09:30,320
this little filter, and then you move it

273
00:09:28,639 --> 00:09:32,080
to the top left so that they are sort of

274
00:09:30,320 --> 00:09:33,240
right on top of each other.

275
00:09:32,080 --> 00:09:34,879
Okay?

276
00:09:33,240 --> 00:09:35,840
Once you have it right on top of each

277
00:09:34,879 --> 00:09:37,439
other,

278
00:09:35,840 --> 00:09:39,600
you have these matching numbers. You

279
00:09:37,440 --> 00:09:41,360
have three numbers in the image, there

280
00:09:39,600 --> 00:09:42,879
are three numbers in the filter, and

281
00:09:41,360 --> 00:09:44,039
they're all matching each other right on

282
00:09:42,879 --> 00:09:46,439
top of each other, right? So, you have

283
00:09:44,039 --> 00:09:48,919
nine pairs of numbers.

284
00:09:46,440 --> 00:09:50,880
And then what we do, once we overlay it,

285
00:09:48,919 --> 00:09:53,879
we literally just multiply all the

286
00:09:50,879 --> 00:09:55,519
matching numbers and add them up.

287
00:09:53,879 --> 00:09:57,080
Okay? You just multiply all the numbers

288
00:09:55,519 --> 00:09:58,600
and match them up, and you can confirm

289
00:09:57,080 --> 00:09:59,759
later on that you know the the

290
00:09:58,600 --> 00:10:01,759
arithmetic I'm doing is actually

291
00:09:59,759 --> 00:10:03,159
accurate. Okay?

292
00:10:01,759 --> 00:10:04,559
And once you do that you'll go get some

293
00:10:03,159 --> 00:10:05,559
number.

294
00:10:04,559 --> 00:10:06,879
Right?

295
00:10:05,559 --> 00:10:09,039
Um

296
00:10:06,879 --> 00:10:11,039
once you get that number

297
00:10:09,039 --> 00:10:12,360
what we do is we go to our good old

298
00:10:11,039 --> 00:10:15,559
friend the relu

299
00:10:12,360 --> 00:10:16,759
and then we just run it through a relu.

300
00:10:15,559 --> 00:10:19,119
Now, in this case all that effort comes

301
00:10:16,759 --> 00:10:22,159
to nothing because it's zero. It's okay.

302
00:10:19,120 --> 00:10:26,000
Okay? So, zero and this number becomes

303
00:10:22,159 --> 00:10:26,000
the top left cell of your output.

304
00:10:26,679 --> 00:10:29,639
So, this is called the convolution

305
00:10:28,120 --> 00:10:30,360
operation.

306
00:10:29,639 --> 00:10:31,600
Okay?

307
00:10:30,360 --> 00:10:32,639
And we won't get into why it's called

308
00:10:31,600 --> 00:10:34,560
that and so on and so forth. There's a

309
00:10:32,639 --> 00:10:35,840
long and rich and storied history of

310
00:10:34,559 --> 00:10:38,239
these things.

311
00:10:35,840 --> 00:10:40,320
But this is the convolution operation.

312
00:10:38,240 --> 00:10:42,080
And once we do that you sort of can now

313
00:10:40,320 --> 00:10:44,120
predict what's going to happen, right?

314
00:10:42,080 --> 00:10:46,920
We take the same exact operation and we

315
00:10:44,120 --> 00:10:48,360
just move it to the right.

316
00:10:46,919 --> 00:10:51,439
We move this little 3 by 3 thing to the

317
00:10:48,360 --> 00:10:53,000
right and repeat the exact same process.

318
00:10:51,440 --> 00:10:54,640
Matching numbers

319
00:10:53,000 --> 00:10:55,960
uh to you know multiply all of the all

320
00:10:54,639 --> 00:10:58,559
the matching numbers together, add them

321
00:10:55,960 --> 00:10:59,400
up, run them through a relu.

322
00:10:58,559 --> 00:11:01,359
Okay?

323
00:10:59,399 --> 00:11:03,720
And then boom, you get the you get the

324
00:11:01,360 --> 00:11:05,800
second number here.

325
00:11:03,720 --> 00:11:07,200
And you keep doing that till you reach

326
00:11:05,799 --> 00:11:08,559
the very end. You fill up all these

327
00:11:07,200 --> 00:11:11,480
numbers then when you then you come to

328
00:11:08,559 --> 00:11:12,559
the top of the second row.

329
00:11:11,480 --> 00:11:14,039
Okay?

330
00:11:12,559 --> 00:11:16,439
And you keep on doing that till you

331
00:11:14,039 --> 00:11:18,919
reach the very bottom.

332
00:11:16,440 --> 00:11:21,000
So, this is what I mean when I say apply

333
00:11:18,919 --> 00:11:22,159
a filter to an image.

334
00:11:21,000 --> 00:11:24,720
Okay?

335
00:11:22,159 --> 00:11:24,719
Any questions?

336
00:11:25,159 --> 00:11:27,399
Okay.

337
00:11:29,480 --> 00:11:33,519
Microphone, please.

338
00:11:31,080 --> 00:11:33,520
Microphone.

339
00:11:35,000 --> 00:11:38,200
What happens when

340
00:11:36,639 --> 00:11:39,360
the heart of the

341
00:11:38,200 --> 00:11:41,839
and you stop

342
00:11:39,360 --> 00:11:41,839
the remaining

343
00:11:42,120 --> 00:11:46,159
but the filter doesn't perfectly match

344
00:11:44,440 --> 00:11:47,839
Yeah, so you start from the left and

345
00:11:46,159 --> 00:11:49,480
then you keep on going. At some point

346
00:11:47,839 --> 00:11:51,000
the right edge of the filter is going to

347
00:11:49,480 --> 00:11:52,080
match the right edge of the image and

348
00:11:51,000 --> 00:11:55,360
then you stop.

349
00:11:52,080 --> 00:11:58,160
Yeah. Now, there are some nuances here.

350
00:11:55,360 --> 00:11:59,879
So, for example, you can actually pad

351
00:11:58,159 --> 00:12:01,879
the whole image

352
00:11:59,879 --> 00:12:03,639
on its borders so that you can actually

353
00:12:01,879 --> 00:12:04,879
go outside the image and it'll still

354
00:12:03,639 --> 00:12:08,519
work.

355
00:12:04,879 --> 00:12:10,159
Okay? Number one. Number two, nuance.

356
00:12:08,519 --> 00:12:11,679
Instead of just moving one step to the

357
00:12:10,159 --> 00:12:13,879
right every time you finish, you can

358
00:12:11,679 --> 00:12:15,479
move two steps to the right.

359
00:12:13,879 --> 00:12:17,879
Right? And that's something called a

360
00:12:15,480 --> 00:12:20,240
stride. Okay? So, there are a bunch of

361
00:12:17,879 --> 00:12:22,600
pesky details here. But I'm just

362
00:12:20,240 --> 00:12:24,919
ignoring them because this basic default

363
00:12:22,600 --> 00:12:27,639
approach works well amazingly well

364
00:12:24,919 --> 00:12:27,639
almost all the time.

365
00:12:27,879 --> 00:12:31,039
Okay? All right. So, that's that's

366
00:12:29,839 --> 00:12:33,920
that's the mechanics of how this

367
00:12:31,039 --> 00:12:35,120
operation works. Um all right. Now, I'm

368
00:12:33,919 --> 00:12:37,120
going to switch to a spreadsheet which

369
00:12:35,120 --> 00:12:41,000
shows this really beautifully

370
00:12:37,120 --> 00:12:43,279
courtesy of the fast.ai people.

371
00:12:41,000 --> 00:12:44,600
All right. So, what I'm going to do here

372
00:12:43,279 --> 00:12:45,679
because the big spreadsheet I'll upload

373
00:12:44,600 --> 00:12:48,399
the spreadsheet after class so you can

374
00:12:45,679 --> 00:12:50,079
see it. So, all I have done here, rather

375
00:12:48,399 --> 00:12:51,838
all they have done here

376
00:12:50,080 --> 00:12:53,360
thanks to them, is that they have

377
00:12:51,839 --> 00:12:55,320
essentially created a table of numbers

378
00:12:53,360 --> 00:12:57,399
in Excel as you can tell.

379
00:12:55,320 --> 00:12:59,280
And they have just put some numbers.

380
00:12:57,399 --> 00:13:01,720
Most of the numbers are zero. But these

381
00:12:59,279 --> 00:13:03,720
some of these numbers are all more than

382
00:13:01,720 --> 00:13:04,920
zero. They're like 0.8, 0.9 and so on.

383
00:13:03,720 --> 00:13:06,320
Basically, all they have done is instead

384
00:13:04,919 --> 00:13:08,159
of working with numbers between zero and

385
00:13:06,320 --> 00:13:10,080
255, they're just dividing all the

386
00:13:08,159 --> 00:13:11,199
numbers by 255 so you get fractions and

387
00:13:10,080 --> 00:13:13,440
they just put the fractions in the

388
00:13:11,200 --> 00:13:15,680
table. Okay? And then then they have

389
00:13:13,440 --> 00:13:16,920
used Excel's very cool conditional

390
00:13:15,679 --> 00:13:19,719
formatting

391
00:13:16,919 --> 00:13:21,759
to essentially mark in red all the

392
00:13:19,720 --> 00:13:23,200
values that are high. Right? If the

393
00:13:21,759 --> 00:13:24,679
number is closer to one, the more

394
00:13:23,200 --> 00:13:26,520
reddish it gets.

395
00:13:24,679 --> 00:13:28,599
Okay? And when you do that the three

396
00:13:26,519 --> 00:13:31,039
obviously pops out.

397
00:13:28,600 --> 00:13:33,320
So, there is a three in the image. Yes?

398
00:13:31,039 --> 00:13:35,559
Okay, good. So, now

399
00:13:33,320 --> 00:13:37,920
what we're going to do is we're going to

400
00:13:35,559 --> 00:13:39,519
move to our little filter here.

401
00:13:37,919 --> 00:13:41,199
You can see the filter.

402
00:13:39,519 --> 00:13:44,519
Right? And I'm claiming this detects

403
00:13:41,200 --> 00:13:47,000
horizontal lines. And so and this table

404
00:13:44,519 --> 00:13:47,000
here

405
00:13:47,159 --> 00:13:49,399
Sorry.

406
00:13:51,320 --> 00:13:56,040
This table here is the result of

407
00:13:53,440 --> 00:13:58,120
applying that filter to the three.

408
00:13:56,039 --> 00:14:01,039
Okay? And you can see here I'm looking

409
00:13:58,120 --> 00:14:03,080
at the top left cell here.

410
00:14:01,039 --> 00:14:03,799
Um

411
00:14:03,080 --> 00:14:05,400
This is

412
00:14:03,799 --> 00:14:07,199
Look at this top left cell. The formula

413
00:14:05,399 --> 00:14:08,759
is nothing more than

414
00:14:07,200 --> 00:14:10,680
you know, multiply all those things and

415
00:14:08,759 --> 00:14:12,759
add them up. And then once you add it

416
00:14:10,679 --> 00:14:15,319
up, run it through a max of zero comma

417
00:14:12,759 --> 00:14:18,078
that which is just the relu.

418
00:14:15,320 --> 00:14:19,480
Okay? Basic arithmetic.

419
00:14:18,078 --> 00:14:21,838
So, we do that.

420
00:14:19,480 --> 00:14:24,560
And this is the output and the output is

421
00:14:21,839 --> 00:14:26,680
also conditionally formatted to show you

422
00:14:24,559 --> 00:14:30,838
where things are lighting up.

423
00:14:26,679 --> 00:14:34,479
And you can see only the horizontal

424
00:14:30,839 --> 00:14:35,839
lines of the three are lighting up.

425
00:14:34,480 --> 00:14:36,720
Everyone see that?

426
00:14:35,839 --> 00:14:38,720
Right?

427
00:14:36,720 --> 00:14:41,000
So, you So, now you you understand the

428
00:14:38,720 --> 00:14:42,440
filter in fact is living up to the claim

429
00:14:41,000 --> 00:14:44,839
I made for it.

430
00:14:42,440 --> 00:14:46,079
Right? Similarly,

431
00:14:44,839 --> 00:14:47,839
if you look at what's going on here,

432
00:14:46,078 --> 00:14:50,159
this is a vertical filter, the same

433
00:14:47,839 --> 00:14:53,440
thing, you apply it, only the vertical

434
00:14:50,159 --> 00:14:53,439
line is lighting up.

435
00:14:53,480 --> 00:14:57,720
Right? Now, what you can do is

436
00:14:56,159 --> 00:15:00,279
uh I would encourage you to do this, you

437
00:14:57,720 --> 00:15:02,519
know, um after class, is you can look at

438
00:15:00,279 --> 00:15:04,759
all these numbers here, for example, and

439
00:15:02,519 --> 00:15:06,480
then ask yourself, "Okay, why is that

440
00:15:04,759 --> 00:15:08,759
lighting up?"

441
00:15:06,480 --> 00:15:11,039
Right? And you will discover that what's

442
00:15:08,759 --> 00:15:12,519
actually going on is that it's looking

443
00:15:11,039 --> 00:15:14,639
for edges.

444
00:15:12,519 --> 00:15:16,319
It's looking for you know, s- you're

445
00:15:14,639 --> 00:15:18,600
looking for rows in the table where

446
00:15:16,320 --> 00:15:21,760
there is some nonzero thing in the first

447
00:15:18,600 --> 00:15:23,519
row and zeros in the second row.

448
00:15:21,759 --> 00:15:25,120
And by choosing the numbers carefully,

449
00:15:23,519 --> 00:15:27,519
you multiply the ones with positive

450
00:15:25,120 --> 00:15:29,159
numbers and you multiply the zeros with

451
00:15:27,519 --> 00:15:31,039
zeros and then you'll come up with a

452
00:15:29,159 --> 00:15:32,879
positive number and thereby you detect

453
00:15:31,039 --> 00:15:34,120
an edge.

454
00:15:32,879 --> 00:15:39,399
Right? So, what I would encourage you to

455
00:15:34,120 --> 00:15:39,399
do is use the this Excel thing here.

456
00:15:39,639 --> 00:15:46,159
All right. So, here is here is a cell we

457
00:15:41,279 --> 00:15:46,159
have. So, let's uh trace its

458
00:15:48,240 --> 00:15:51,399
coincidence.

459
00:15:49,600 --> 00:15:53,000
Okay.

460
00:15:51,399 --> 00:15:56,078
So, you can see here

461
00:15:53,000 --> 00:15:56,078
these numbers

462
00:15:56,159 --> 00:16:00,639
Right? Th- This is what it's processing.

463
00:15:59,120 --> 00:16:01,959
Right? That is this grid is being

464
00:16:00,639 --> 00:16:04,360
processed to come up with that big

465
00:16:01,958 --> 00:16:06,159
number. And you can see here in this

466
00:16:04,360 --> 00:16:08,560
grid these are all these numbers are

467
00:16:06,159 --> 00:16:11,120
here and then these numbers are a lot

468
00:16:08,559 --> 00:16:13,319
lower than these these numbers because

469
00:16:11,120 --> 00:16:14,519
there is an edge.

470
00:16:13,320 --> 00:16:16,360
Right? The numbers are a lot lower.

471
00:16:14,519 --> 00:16:17,759
That's why you can see the horizontal

472
00:16:16,360 --> 00:16:19,959
part of the three.

473
00:16:17,759 --> 00:16:22,559
And so, what this filter is doing, it's

474
00:16:19,958 --> 00:16:24,399
basically saying, "Well, the stuff

475
00:16:22,559 --> 00:16:26,199
the row that I'm catching here has the

476
00:16:24,399 --> 00:16:27,720
ones, the middle has zeros, the rest are

477
00:16:26,200 --> 00:16:29,480
all minus ones."

478
00:16:27,720 --> 00:16:31,440
Right? So, the small values are going to

479
00:16:29,480 --> 00:16:33,120
get very small.

480
00:16:31,440 --> 00:16:34,040
The big values are going to get very big

481
00:16:33,120 --> 00:16:35,639
and the overall thing is going to be

482
00:16:34,039 --> 00:16:37,000
emphasized.

483
00:16:35,639 --> 00:16:38,360
So, that's the basic idea of edge

484
00:16:37,000 --> 00:16:39,958
detection.

485
00:16:38,360 --> 00:16:41,480
Spend some time with it with the Excel

486
00:16:39,958 --> 00:16:43,119
and it'll you'll become clear to you

487
00:16:41,480 --> 00:16:46,079
what I'm talking about here.

488
00:16:43,120 --> 00:16:48,399
All right, cool. So, that's that.

489
00:16:46,078 --> 00:16:49,759
All right. Uh by the way, I also have uh

490
00:16:48,399 --> 00:16:50,759
th- there is a little very cool site

491
00:16:49,759 --> 00:16:52,279
here

492
00:16:50,759 --> 00:16:53,879
in which you can actually go in and

493
00:16:52,279 --> 00:16:55,039
punch in your own numbers and see what

494
00:16:53,879 --> 00:16:56,838
it detects.

495
00:16:55,039 --> 00:16:58,319
Right? Lot of edges and curves and this

496
00:16:56,839 --> 00:17:00,040
and that. It's very cool. So, I

497
00:16:58,320 --> 00:17:04,680
encourage you to try it out.

498
00:17:00,039 --> 00:17:04,680
So, the key thing here I want to say is

499
00:17:06,640 --> 00:17:10,160
by choosing the numbers in a filter

500
00:17:08,160 --> 00:17:12,120
carefully and applying this operation

501
00:17:10,160 --> 00:17:13,720
different different features can be

502
00:17:12,119 --> 00:17:14,639
detected. All right.

503
00:17:13,720 --> 00:17:16,199
Now,

504
00:17:14,640 --> 00:17:18,079
I mentioned earlier that a convolution

505
00:17:16,199 --> 00:17:20,519
layer is composed of one or more of

506
00:17:18,078 --> 00:17:23,519
these filters. So, one or more of these

507
00:17:20,519 --> 00:17:25,759
filters. And so, you can think of each

508
00:17:23,519 --> 00:17:27,959
filter as a sort of a specialist for a

509
00:17:25,759 --> 00:17:30,279
particular feature.

510
00:17:27,959 --> 00:17:32,200
Okay? So, it's a specialist. Maybe it it

511
00:17:30,279 --> 00:17:34,079
specializes in detecting vertical lines,

512
00:17:32,200 --> 00:17:35,720
horizontal lines, you know, uh

513
00:17:34,079 --> 00:17:38,079
semicircles, quarter circles, you don't

514
00:17:35,720 --> 00:17:39,799
know. Right? You can imagine either them

515
00:17:38,079 --> 00:17:42,079
as being specialists.

516
00:17:39,799 --> 00:17:43,799
And given that modern images could be

517
00:17:42,079 --> 00:17:45,359
very complicated, they may have lots of

518
00:17:43,799 --> 00:17:46,678
interesting features going on, you

519
00:17:45,359 --> 00:17:48,359
probably want to have lots of these

520
00:17:46,679 --> 00:17:52,360
filters.

521
00:17:48,359 --> 00:17:54,719
Okay? But the key the key is that you

522
00:17:52,359 --> 00:17:56,559
don't have to decide up front, "Hey, you

523
00:17:54,720 --> 00:17:57,880
filter, you better specialize in

524
00:17:56,559 --> 00:18:00,119
detecting vertical lines and you on the

525
00:17:57,880 --> 00:18:01,320
other hand do not stay in your lane. Do

526
00:18:00,119 --> 00:18:02,559
vertical lines." Right? You're not going

527
00:18:01,319 --> 00:18:04,039
to do that.

528
00:18:02,559 --> 00:18:06,559
You will let the system figure out what

529
00:18:04,039 --> 00:18:08,678
it wants to figure out.

530
00:18:06,559 --> 00:18:10,200
Okay? So, there is no human bottleneck

531
00:18:08,679 --> 00:18:11,800
in doing this.

532
00:18:10,200 --> 00:18:13,600
And I mentioned this because there used

533
00:18:11,799 --> 00:18:15,799
to be a human bottleneck, you know,

534
00:18:13,599 --> 00:18:17,559
before deep learning happened.

535
00:18:15,799 --> 00:18:19,399
And so,

536
00:18:17,559 --> 00:18:20,599
Now, let's just um make sure we

537
00:18:19,400 --> 00:18:22,120
understand the mechanics of what happens

538
00:18:20,599 --> 00:18:24,439
when you have two of these filters, not

539
00:18:22,119 --> 00:18:26,119
one. So, this is the input image as

540
00:18:24,440 --> 00:18:28,159
before. This is the filter we saw

541
00:18:26,119 --> 00:18:29,399
earlier and this is another filter we

542
00:18:28,159 --> 00:18:30,440
have.

543
00:18:29,400 --> 00:18:32,120
The thing is we just run them in

544
00:18:30,440 --> 00:18:33,440
parallel. We take each filter, do the

545
00:18:32,119 --> 00:18:34,839
operation, come up with an output. Take

546
00:18:33,440 --> 00:18:36,679
the other filter, do the operation, come

547
00:18:34,839 --> 00:18:38,279
up with its output. And then when you do

548
00:18:36,679 --> 00:18:40,480
that, the first one gives you that, the

549
00:18:38,279 --> 00:18:42,799
second one gives you that. And this

550
00:18:40,480 --> 00:18:44,799
output is a table of some it's it's a

551
00:18:42,799 --> 00:18:47,200
it's a it's actually not a table. What

552
00:18:44,799 --> 00:18:47,200
is it?

553
00:18:49,159 --> 00:18:54,040
Louder, please.

554
00:18:51,359 --> 00:18:56,439
It's a tensor. Thank you. It's a tensor.

555
00:18:54,039 --> 00:18:59,960
And so, these two 5 by 5 matrices can be

556
00:18:56,440 --> 00:18:59,960
represented as a tensor of what shape?

557
00:19:02,079 --> 00:19:06,439
And there are two right answers.

558
00:19:04,919 --> 00:19:08,600
5 by 5

559
00:19:06,440 --> 00:19:11,480
into two, correct. So, it can you can

560
00:19:08,599 --> 00:19:14,439
either think of it as 5 by 5 * 2 or 2 *

561
00:19:11,480 --> 00:19:15,799
5 by 5. They're both fine.

562
00:19:14,440 --> 00:19:18,679
Which one you go with is actually ends

563
00:19:15,799 --> 00:19:20,839
up being a matter of convention.

564
00:19:18,679 --> 00:19:22,640
Okay? So, now you begin to see why we

565
00:19:20,839 --> 00:19:24,079
care about tensors.

566
00:19:22,640 --> 00:19:27,960
Imagine if instead of having two

567
00:19:24,079 --> 00:19:29,839
filters, we have 103 filters.

568
00:19:27,960 --> 00:19:32,799
The resulting tensor is going to be 5 by

569
00:19:29,839 --> 00:19:32,799
5 by 103.

570
00:19:33,559 --> 00:19:35,480
Okay.

571
00:19:34,720 --> 00:19:37,400
Good.

572
00:19:35,480 --> 00:19:39,679
Um all right. Now,

573
00:19:37,400 --> 00:19:42,600
let's now look at the slightly more

574
00:19:39,679 --> 00:19:44,720
complex situation where you have not a

575
00:19:42,599 --> 00:19:46,799
black and white image, a grayscale image

576
00:19:44,720 --> 00:19:48,440
with just a little table, but an actual

577
00:19:46,799 --> 00:19:51,119
color image.

578
00:19:48,440 --> 00:19:54,240
Okay? So, So, we know how to apply a

579
00:19:51,119 --> 00:19:56,359
filter to a 2D tensor like this and to

580
00:19:54,240 --> 00:19:58,400
get that. But let's say we have

581
00:19:56,359 --> 00:20:00,000
something like this where it has

582
00:19:58,400 --> 00:20:02,120
three, right? It's got three channels,

583
00:20:00,000 --> 00:20:03,519
red, blue, green, RGB. It's got three

584
00:20:02,119 --> 00:20:06,399
tables of numbers.

585
00:20:03,519 --> 00:20:08,599
So, this is a a tensor of shape 6 * 6 *

586
00:20:06,400 --> 00:20:11,120
3, let's say, and you want to apply this

587
00:20:08,599 --> 00:20:12,480
3 by 3 filter just like before to this

588
00:20:11,119 --> 00:20:16,599
thing. You want to apply the convolution

589
00:20:12,480 --> 00:20:16,599
operation. How's that going to work?

590
00:20:18,440 --> 00:20:23,200
Do we just like apply this to each

591
00:20:21,640 --> 00:20:25,400
We first apply it to the red, then we

592
00:20:23,200 --> 00:20:29,519
apply it to the to the green, then we

593
00:20:25,400 --> 00:20:29,519
apply to the blue. Should we do that?

594
00:20:30,079 --> 00:20:35,199
Or is there a

595
00:20:31,960 --> 00:20:35,200
a problem with that approach?

596
00:20:36,039 --> 00:20:38,359
Yeah.

597
00:20:39,960 --> 00:20:43,559
Could you use the microphone, please?

598
00:20:42,079 --> 00:20:45,279
Uh the problem with the approach, I

599
00:20:43,559 --> 00:20:47,399
think, would be the same as what you

600
00:20:45,279 --> 00:20:49,079
said earlier, that it would learn the

601
00:20:47,400 --> 00:20:50,360
lines probably the same each channel,

602
00:20:49,079 --> 00:20:51,599
right?

603
00:20:50,359 --> 00:20:54,039
Like the location of the lines are

604
00:20:51,599 --> 00:20:55,319
probably the same each channel.

605
00:20:54,039 --> 00:20:57,599
Yes, the location of the line is going

606
00:20:55,319 --> 00:20:59,399
to be the same thing because that line,

607
00:20:57,599 --> 00:21:00,879
if you will, is sort of the the

608
00:20:59,400 --> 00:21:03,320
aggregation of information from the

609
00:21:00,880 --> 00:21:05,080
three different channels. Right. But the

610
00:21:03,319 --> 00:21:07,200
problem here

611
00:21:05,079 --> 00:21:09,599
is sort of slightly different,

612
00:21:07,200 --> 00:21:12,000
which is that

613
00:21:09,599 --> 00:21:15,279
If you do them independently,

614
00:21:12,000 --> 00:21:17,599
the network has not been informed that

615
00:21:15,279 --> 00:21:19,759
these things are all part of the same

616
00:21:17,599 --> 00:21:21,039
underlying concept.

617
00:21:19,759 --> 00:21:22,160
As far as it's concerned, it's just like

618
00:21:21,039 --> 00:21:23,759
three things. It's just going to process

619
00:21:22,160 --> 00:21:25,800
them independently. So, we need to

620
00:21:23,759 --> 00:21:27,879
somehow change the filter so that it

621
00:21:25,799 --> 00:21:29,919
understands like what is at this pixel

622
00:21:27,880 --> 00:21:31,800
location, the three numbers under it,

623
00:21:29,920 --> 00:21:35,080
RGB, they're actually the same part of

624
00:21:31,799 --> 00:21:37,919
the same thing, underlying thing.

625
00:21:35,079 --> 00:21:42,399
So, what we do is actually very simple.

626
00:21:37,920 --> 00:21:42,400
We just take this filter and make it 3D.

627
00:21:42,599 --> 00:21:45,959
So, we take this filter, instead of

628
00:21:44,240 --> 00:21:49,240
having just one of them, we just make it

629
00:21:45,960 --> 00:21:51,680
a cube like that. Three times.

630
00:21:49,240 --> 00:21:53,839
And once we do that, you can imagine

631
00:21:51,680 --> 00:21:56,279
taking this thing here and essentially

632
00:21:53,839 --> 00:21:58,480
doing that.

633
00:21:56,279 --> 00:22:00,119
Okay. Now, instead of having, you know,

634
00:21:58,480 --> 00:22:01,799
nine numbers in the image and nine

635
00:22:00,119 --> 00:22:04,159
numbers in the filter,

636
00:22:01,799 --> 00:22:05,678
you have 27 numbers in the image, 27

637
00:22:04,160 --> 00:22:07,720
numbers in the filter.

638
00:22:05,679 --> 00:22:09,400
But you still match them up, multiply

639
00:22:07,720 --> 00:22:11,759
them, add them up, run them through a

640
00:22:09,400 --> 00:22:11,759
ReLU.

641
00:22:14,799 --> 00:22:19,399
By the way, I tried to get ChatGPT to

642
00:22:16,720 --> 00:22:21,679
give me a picture like that.

643
00:22:19,400 --> 00:22:22,920
It just completely bombed.

644
00:22:21,679 --> 00:22:24,400
I like three, four, five different

645
00:22:22,920 --> 00:22:25,800
variations. It just gave up. And then I

646
00:22:24,400 --> 00:22:28,640
found this nice picture at in the

647
00:22:25,799 --> 00:22:30,559
deeplearning.ai and I used it.

648
00:22:28,640 --> 00:22:32,160
So, then if you put different numbers in

649
00:22:30,559 --> 00:22:33,519
each of the layers, is that like color

650
00:22:32,160 --> 00:22:36,279
processing? Like it could be doing a

651
00:22:33,519 --> 00:22:37,440
different thing to green and blue. I'm

652
00:22:36,279 --> 00:22:39,920
sorry, say that again. If you put

653
00:22:37,440 --> 00:22:42,160
different numbers in each of the layers

654
00:22:39,920 --> 00:22:43,600
of your knowledge, in each of the

655
00:22:42,160 --> 00:22:45,519
different like depth dimensions of your

656
00:22:43,599 --> 00:22:47,000
convolution filter, would that be like

657
00:22:45,519 --> 00:22:49,319
color processing?

658
00:22:47,000 --> 00:22:50,559
Uh yeah, you you will in

659
00:22:49,319 --> 00:22:53,000
Yeah, you will put different numbers. In

660
00:22:50,559 --> 00:22:54,119
fact, you you have 27 numbers now,

661
00:22:53,000 --> 00:22:55,640
but we haven't gotten to the question of

662
00:22:54,119 --> 00:22:58,759
where these numbers are coming from. So,

663
00:22:55,640 --> 00:23:02,920
just hold the thought till we get there.

664
00:22:58,759 --> 00:23:04,640
Okay. Um so, any questions on this?

665
00:23:02,920 --> 00:23:05,800
Okay. You literally take the 2D thing

666
00:23:04,640 --> 00:23:08,120
and make it 3D.

667
00:23:05,799 --> 00:23:10,079
You basically give it depth and the

668
00:23:08,119 --> 00:23:11,319
depth just matches the depth of the

669
00:23:10,079 --> 00:23:13,319
input.

670
00:23:11,319 --> 00:23:15,000
So, if the input is like, you know, 10

671
00:23:13,319 --> 00:23:17,359
deep, your filter is going to get 10

672
00:23:15,000 --> 00:23:17,359
deep.

673
00:23:18,200 --> 00:23:22,519
Okay?

674
00:23:20,079 --> 00:23:22,519
Yes.

675
00:23:22,640 --> 00:23:26,000
Rather than

676
00:23:24,160 --> 00:23:27,679
increasing the rank order of the tensor

677
00:23:26,000 --> 00:23:29,240
by one, is there any instance where you

678
00:23:27,679 --> 00:23:30,920
would create a subtraction layer where

679
00:23:29,240 --> 00:23:33,559
you would run an operation across the

680
00:23:30,920 --> 00:23:35,920
different layers to come up with a

681
00:23:33,559 --> 00:23:38,799
intermediary layer that you would run a

682
00:23:35,920 --> 00:23:40,640
lower rank tensor of a filter over?

683
00:23:38,799 --> 00:23:42,639
Yeah, so there is a lot of stuff in the

684
00:23:40,640 --> 00:23:45,440
research literature which tries to do

685
00:23:42,640 --> 00:23:48,200
things like that. Uh I'm just describing

686
00:23:45,440 --> 00:23:50,080
like the the the most basic approach to

687
00:23:48,200 --> 00:23:51,720
doing this. And as it turns out, this

688
00:23:50,079 --> 00:23:54,319
basic approach is actually extremely

689
00:23:51,720 --> 00:23:56,079
powerful, right? And of course, uh

690
00:23:54,319 --> 00:23:59,399
researchers try to, you know, go from

691
00:23:56,079 --> 00:24:01,039
the 95th percent thing to the 95.1%.

692
00:23:59,400 --> 00:24:02,840
So, they invent like all sorts of crazy

693
00:24:01,039 --> 00:24:04,839
complicated stuff, which is all good for

694
00:24:02,839 --> 00:24:07,399
us, humanity, but for practical use,

695
00:24:04,839 --> 00:24:07,399
this is good enough.

696
00:24:08,119 --> 00:24:12,519
How do you convert the 3 by 3 layer into

697
00:24:10,599 --> 00:24:14,359
a single 4 by 4 layer? 4 by 4 is

698
00:24:12,519 --> 00:24:15,279
understood, but what about the 3 layers?

699
00:24:14,359 --> 00:24:17,399
How do they work?

700
00:24:15,279 --> 00:24:19,079
Yeah. Um so, we are coming to that. I

701
00:24:17,400 --> 00:24:20,960
think we have a slide here. Actually, we

702
00:24:19,079 --> 00:24:23,599
don't. Never mind. We'll answer that. Um

703
00:24:20,960 --> 00:24:26,480
so, so here you have one filter, right?

704
00:24:23,599 --> 00:24:28,319
You have one 3 by 3 by 3 filter, which

705
00:24:26,480 --> 00:24:30,920
plugs into this thing here, and then it

706
00:24:28,319 --> 00:24:33,119
gives you the 4 by 4 at the end.

707
00:24:30,920 --> 00:24:37,000
Right? So, for one filter, we know that

708
00:24:33,119 --> 00:24:37,000
by doing this operation, we get

709
00:24:37,119 --> 00:24:40,159
we get this 4 by 4.

710
00:24:38,720 --> 00:24:41,880
Let's say that you have another filter,

711
00:24:40,160 --> 00:24:43,120
which is also 3D.

712
00:24:41,880 --> 00:24:45,080
You do that thing, you'll get another 4

713
00:24:43,119 --> 00:24:46,399
by 4.

714
00:24:45,079 --> 00:24:48,240
And if you have 10 filters, you'll get

715
00:24:46,400 --> 00:24:52,600
10 of these 4 by 4s, which then gets

716
00:24:48,240 --> 00:24:52,599
packaged up into a 4 by 4 by 10 tensor.

717
00:24:54,519 --> 00:25:01,839
Remember, whether it's 2D, 3D, 10D,

718
00:24:57,880 --> 00:25:01,840
what is coming out is always 2D.

719
00:25:02,039 --> 00:25:05,240
Because ultimately, when you apply all

720
00:25:03,359 --> 00:25:06,639
this operation, at each position, you

721
00:25:05,240 --> 00:25:07,799
just have one number.

722
00:25:06,640 --> 00:25:08,720
And then ultimately, you just do all

723
00:25:07,799 --> 00:25:10,480
those things, you just come up with a

724
00:25:08,720 --> 00:25:13,160
table of numbers always. So, the what's

725
00:25:10,480 --> 00:25:14,279
coming out is always a 2D number table

726
00:25:13,160 --> 00:25:16,360
like that.

727
00:25:14,279 --> 00:25:18,119
But when you have lots of filters, you

728
00:25:16,359 --> 00:25:20,039
have lots of these 2D tables one after

729
00:25:18,119 --> 00:25:23,319
the other, and there therefore, they get

730
00:25:20,039 --> 00:25:23,319
packaged up into a tensor.

731
00:25:25,160 --> 00:25:28,279
All right.

732
00:25:26,200 --> 00:25:30,559
Um so,

733
00:25:28,279 --> 00:25:32,119
textbook chapter 8.1 has a lot of detail

734
00:25:30,559 --> 00:25:35,839
and intuition, which I think is really

735
00:25:32,119 --> 00:25:37,439
good. So, please uh try it out. Okay.

736
00:25:35,839 --> 00:25:40,199
And folks, by the way, this convolution

737
00:25:37,440 --> 00:25:41,920
stuff, um it's sort of it grows in the

738
00:25:40,200 --> 00:25:43,960
telling. So, I would encourage you to

739
00:25:41,920 --> 00:25:45,920
revisit it, revisit it

740
00:25:43,960 --> 00:25:48,240
a few times, and then it slowly becomes

741
00:25:45,920 --> 00:25:49,600
part of your muscle memory.

742
00:25:48,240 --> 00:25:51,559
Don't expect to just understand all the

743
00:25:49,599 --> 00:25:52,959
nuances like one shot.

744
00:25:51,559 --> 00:25:54,559
Do it a few times.

745
00:25:52,960 --> 00:25:56,360
And it will become, you know, wired into

746
00:25:54,559 --> 00:25:59,159
your into your head.

747
00:25:56,359 --> 00:26:00,599
Okay. So, all right. The big question.

748
00:25:59,160 --> 00:26:02,240
These seem excellent, but how are we

749
00:26:00,599 --> 00:26:04,079
supposed to come up with these numbers?

750
00:26:02,240 --> 00:26:05,480
Now, in fact, traditionally,

751
00:26:04,079 --> 00:26:07,079
uh these filters actually used to be

752
00:26:05,480 --> 00:26:08,480
designed by hand.

753
00:26:07,079 --> 00:26:10,079
Uh computer vision researchers would

754
00:26:08,480 --> 00:26:12,759
invest, you know, prodigious amounts of

755
00:26:10,079 --> 00:26:14,960
time and effort and talent to figure

756
00:26:12,759 --> 00:26:17,119
out, you know, the kind the right kinds

757
00:26:14,960 --> 00:26:19,000
of filters to use for various specific

758
00:26:17,119 --> 00:26:20,399
applications. So, if you wanted to build

759
00:26:19,000 --> 00:26:22,799
an application which would look at, say,

760
00:26:20,400 --> 00:26:24,720
MRI images and figure out, okay, what

761
00:26:22,799 --> 00:26:27,000
kind of features should I extract from

762
00:26:24,720 --> 00:26:28,519
this MRI thing to be able to say, you

763
00:26:27,000 --> 00:26:30,519
know, predict the the evidence for a

764
00:26:28,519 --> 00:26:32,799
stroke, they would actually, you know,

765
00:26:30,519 --> 00:26:34,359
hand design the filter. They'd try lots

766
00:26:32,799 --> 00:26:35,960
of different values and then come up

767
00:26:34,359 --> 00:26:37,959
with, "Ah, I got the perfect filter for

768
00:26:35,960 --> 00:26:39,440
this thing here." Right? So, that's the

769
00:26:37,960 --> 00:26:41,559
way it used to be done.

770
00:26:39,440 --> 00:26:42,920
Um and now,

771
00:26:41,559 --> 00:26:45,279
I but as we figured out how to train

772
00:26:42,920 --> 00:26:47,160
deep networks with lots of parameters,

773
00:26:45,279 --> 00:26:49,079
right? We figured out things like ReLU

774
00:26:47,160 --> 00:26:51,800
activation, stochastic gradient descent,

775
00:26:49,079 --> 00:26:54,559
GPUs, backprop, things like that, you

776
00:26:51,799 --> 00:26:55,759
know, uh this big idea emerged. Why

777
00:26:54,559 --> 00:26:57,839
don't we think of the numbers in the

778
00:26:55,759 --> 00:26:59,359
filter as just weights?

779
00:26:57,839 --> 00:27:01,639
And why don't we just simply learn them

780
00:26:59,359 --> 00:27:03,159
from the data using backprop?

781
00:27:01,640 --> 00:27:06,160
Right? Just like we learn all the other

782
00:27:03,160 --> 00:27:06,160
weights. What's the big deal?

783
00:27:06,279 --> 00:27:09,639
And this simple idea,

784
00:27:08,160 --> 00:27:12,080
and it feels a bit, I don't know,

785
00:27:09,640 --> 00:27:13,160
blindingly obvious in hindsight.

786
00:27:12,079 --> 00:27:14,439
I'm sure it was not obvious in

787
00:27:13,160 --> 00:27:16,560
foresight.

788
00:27:14,440 --> 00:27:18,960
Um right? This was the breakthrough.

789
00:27:16,559 --> 00:27:20,399
This was the key breakthrough. And now,

790
00:27:18,960 --> 00:27:22,840
it's actually possible to do this

791
00:27:20,400 --> 00:27:25,840
because a convolutional filter that we

792
00:27:22,839 --> 00:27:27,319
have seen is actually just a neuron.

793
00:27:25,839 --> 00:27:31,119
And the underlying arithmetic of it is

794
00:27:27,319 --> 00:27:32,960
just a neuronal arithmetic. And so, it

795
00:27:31,119 --> 00:27:34,879
just happens to be a slightly special

796
00:27:32,960 --> 00:27:37,400
one. It's actually even simpler than a

797
00:27:34,880 --> 00:27:39,400
regular neuron. And in the interest of

798
00:27:37,400 --> 00:27:40,640
time, I have a one or two slides in the

799
00:27:39,400 --> 00:27:42,920
appendix which tells you exactly why

800
00:27:40,640 --> 00:27:44,480
it's a neuron. So, check it out. But

801
00:27:42,920 --> 00:27:46,480
just take my word for it. It's just a

802
00:27:44,480 --> 00:27:48,319
particular kind of neuron. And because

803
00:27:46,480 --> 00:27:50,400
it's a particular kind of neuron, and we

804
00:27:48,319 --> 00:27:51,359
know how to work with neurons,

805
00:27:50,400 --> 00:27:53,519
right? You know how to work with

806
00:27:51,359 --> 00:27:55,559
neurons, which means that our entire

807
00:27:53,519 --> 00:27:57,279
machinery,

808
00:27:55,559 --> 00:27:59,480
layers, loss functions, gradient

809
00:27:57,279 --> 00:28:01,279
descent, SGD, blah, blah, everything is

810
00:27:59,480 --> 00:28:03,559
immediately applicable.

811
00:28:01,279 --> 00:28:06,039
We don't have to invent any new stuff to

812
00:28:03,559 --> 00:28:08,000
make it work.

813
00:28:06,039 --> 00:28:09,839
Okay?

814
00:28:08,000 --> 00:28:12,119
All right.

815
00:28:09,839 --> 00:28:14,639
Do you initialize the layers differently

816
00:28:12,119 --> 00:28:16,239
in applications or just because the

817
00:28:14,640 --> 00:28:18,400
network has different sizes? Like

818
00:28:16,240 --> 00:28:20,839
computer vision versus uh medical

819
00:28:18,400 --> 00:28:23,120
imaging. Is it just because the network

820
00:28:20,839 --> 00:28:25,359
has different numbers in them?

821
00:28:23,119 --> 00:28:27,439
Yeah, so the initialization

822
00:28:25,359 --> 00:28:29,119
So, let's It's a good question. Let's

823
00:28:27,440 --> 00:28:30,720
come back to it when we get to something

824
00:28:29,119 --> 00:28:34,559
called transfer learning, which I'm

825
00:28:30,720 --> 00:28:34,559
going to get to by about 9:30.

826
00:28:34,720 --> 00:28:37,480
All right. So,

827
00:28:36,279 --> 00:28:38,678
that's it. All right. So, this turned

828
00:28:37,480 --> 00:28:40,599
out to be a huge turning point in the

829
00:28:38,679 --> 00:28:43,360
computer vision field, and this was the

830
00:28:40,599 --> 00:28:44,678
massive unlock in the year 2012. This

831
00:28:43,359 --> 00:28:47,399
computer vision system that used this

832
00:28:44,679 --> 00:28:49,080
technology called AlexNet burst out onto

833
00:28:47,400 --> 00:28:51,200
the world stage because it crushed the

834
00:28:49,079 --> 00:28:53,519
competition in a, you know, in in a

835
00:28:51,200 --> 00:28:56,919
competition called ImageNet, and uh the

836
00:28:53,519 --> 00:28:59,679
previous best score was 26% error rate,

837
00:28:56,919 --> 00:29:01,159
and this thing came in and had 16% error

838
00:28:59,679 --> 00:29:01,960
rate. Right? It's the kind of thing

839
00:29:01,159 --> 00:29:04,120
where if you see it, you'll be like,

840
00:29:01,960 --> 00:29:05,480
"Oh, that must be a typo."

841
00:29:04,119 --> 00:29:06,439
Right? Because every year, the

842
00:29:05,480 --> 00:29:07,919
improvements in error rate were like

843
00:29:06,440 --> 00:29:09,919
very little, half a percent, 1%, and

844
00:29:07,919 --> 00:29:12,800
then this year was 10%, and that that

845
00:29:09,919 --> 00:29:14,520
was because of this approach.

846
00:29:12,799 --> 00:29:16,960
And so, all right. Now, one other thing

847
00:29:14,519 --> 00:29:19,960
I want to cover talk about is that with

848
00:29:16,960 --> 00:29:21,480
every succeeding convolutional layer,

849
00:29:19,960 --> 00:29:23,440
uh this particular convolution any

850
00:29:21,480 --> 00:29:25,519
particular convolutional filter, it's

851
00:29:23,440 --> 00:29:28,320
basically implicitly seeing much more of

852
00:29:25,519 --> 00:29:29,839
the input image as we go along.

853
00:29:28,319 --> 00:29:31,639
Right? Which means that if in the very

854
00:29:29,839 --> 00:29:33,119
beginning, if this is the input, right?

855
00:29:31,640 --> 00:29:34,360
This little convolutional filter this

856
00:29:33,119 --> 00:29:37,119
number here

857
00:29:34,359 --> 00:29:38,719
in the first layer, let's say, only sees

858
00:29:37,119 --> 00:29:40,119
like the top of the chimney or whatever

859
00:29:38,720 --> 00:29:42,120
of this house.

860
00:29:40,119 --> 00:29:44,839
But then the next layer, remember, the

861
00:29:42,119 --> 00:29:45,879
next layer is input is this particular

862
00:29:44,839 --> 00:29:47,240
layer.

863
00:29:45,880 --> 00:29:49,400
And so,

864
00:29:47,240 --> 00:29:50,839
this particular little thing here is

865
00:29:49,400 --> 00:29:52,280
getting information from this whole

866
00:29:50,839 --> 00:29:53,839
square here.

867
00:29:52,279 --> 00:29:55,599
And every one of the points in that

868
00:29:53,839 --> 00:29:57,399
square is actually something big in the

869
00:29:55,599 --> 00:29:59,480
original picture.

870
00:29:57,400 --> 00:30:00,680
So, with every additional layer, you're

871
00:29:59,480 --> 00:30:03,039
seeing more and more and more of the

872
00:30:00,680 --> 00:30:04,920
image.

873
00:30:03,039 --> 00:30:06,639
All right? And this is a key part of why

874
00:30:04,920 --> 00:30:08,519
these things work because you're

875
00:30:06,640 --> 00:30:09,759
essentially hierarchically building a

876
00:30:08,519 --> 00:30:10,680
better and better understanding of the

877
00:30:09,759 --> 00:30:12,879
image.

878
00:30:10,680 --> 00:30:14,960
It is the hierarchical understanding,

879
00:30:12,880 --> 00:30:17,880
the hierarchical learning, that's a very

880
00:30:14,960 --> 00:30:20,240
key part of the unlock.

881
00:30:17,880 --> 00:30:21,840
And so, if you look at networks and what

882
00:30:20,240 --> 00:30:23,759
they're visualizing, this actually a you

883
00:30:21,839 --> 00:30:25,639
know, a face detection deep network

884
00:30:23,759 --> 00:30:26,879
visualizes of what it's learning, you'll

885
00:30:25,640 --> 00:30:28,759
see that the first layer is just

886
00:30:26,880 --> 00:30:29,920
learning lines and edges and so on,

887
00:30:28,759 --> 00:30:30,960
lines.

888
00:30:29,920 --> 00:30:32,800
And the second layer is actually

889
00:30:30,960 --> 00:30:33,759
learning edges. Look at this thing,

890
00:30:32,799 --> 00:30:36,000
right?

891
00:30:33,759 --> 00:30:37,119
It's it's learning to put these lines

892
00:30:36,000 --> 00:30:38,519
together

893
00:30:37,119 --> 00:30:40,359
to get some sort of an edge here,

894
00:30:38,519 --> 00:30:43,879
another edge here. This looks like three

895
00:30:40,359 --> 00:30:45,199
three quarters of a of somebody's ears.

896
00:30:43,880 --> 00:30:46,360
And then, these things are now being

897
00:30:45,200 --> 00:30:49,160
assembled

898
00:30:46,359 --> 00:30:50,279
to get whole faces out.

899
00:30:49,160 --> 00:30:52,080
Can you imagine the researchers who did

900
00:30:50,279 --> 00:30:53,720
this work? They built the network, it's

901
00:30:52,079 --> 00:30:54,599
doing really well on detecting faces,

902
00:30:53,720 --> 00:30:56,079
and they turn around, "Okay, let's see

903
00:30:54,599 --> 00:30:58,079
what it's actually doing."

904
00:30:56,079 --> 00:31:00,480
And then, this picture pops up.

905
00:30:58,079 --> 00:31:03,039
I mean, goosebumps.

906
00:31:00,480 --> 00:31:04,440
Okay, so pooling layers, the next one.

907
00:31:03,039 --> 00:31:05,559
So,

908
00:31:04,440 --> 00:31:07,519
so far you've talked about convolutional

909
00:31:05,559 --> 00:31:09,559
layers, this is the second thing, second

910
00:31:07,519 --> 00:31:11,440
building block, and then we'll again go

911
00:31:09,559 --> 00:31:12,919
go to the collapse. So, pooling layers

912
00:31:11,440 --> 00:31:15,039
are also called subsampling or

913
00:31:12,920 --> 00:31:17,120
downsampling layers.

914
00:31:15,039 --> 00:31:19,480
So, the idea is that every time a tensor

915
00:31:17,119 --> 00:31:20,639
is coming out of these convolutional um

916
00:31:19,480 --> 00:31:23,440
layers,

917
00:31:20,640 --> 00:31:25,440
we try to make it slightly smaller

918
00:31:23,440 --> 00:31:27,519
because the act of making it smaller

919
00:31:25,440 --> 00:31:29,440
will force the network to try to

920
00:31:27,519 --> 00:31:30,920
summarize and learn what's going on in

921
00:31:29,440 --> 00:31:32,840
this complicated thing it's coming into

922
00:31:30,920 --> 00:31:35,200
it, okay? So, I will describe the

923
00:31:32,839 --> 00:31:37,599
mechanics first. Um

924
00:31:35,200 --> 00:31:39,600
So, let's say that this is the output of

925
00:31:37,599 --> 00:31:40,559
a convolutional layer.

926
00:31:39,599 --> 00:31:42,879
Okay?

927
00:31:40,559 --> 00:31:45,079
Is this four of them? A 4 by 4.

928
00:31:42,880 --> 00:31:47,440
So, what we do is that there are two

929
00:31:45,079 --> 00:31:48,879
kinds of pooling, max pooling and

930
00:31:47,440 --> 00:31:51,000
average pooling. This is called max

931
00:31:48,880 --> 00:31:52,480
pooling, and the idea is really simple.

932
00:31:51,000 --> 00:31:53,799
In this max pooling layer, there are no

933
00:31:52,480 --> 00:31:56,200
weights parameters to be learned. It's

934
00:31:53,799 --> 00:31:57,839
just a simple arithmetic operation. We

935
00:31:56,200 --> 00:32:00,200
basically take

936
00:31:57,839 --> 00:32:02,919
we take this we basically superimpose a

937
00:32:00,200 --> 00:32:04,920
2 by 2 empty grid

938
00:32:02,920 --> 00:32:06,519
on the top left, and then we say, "Hey,

939
00:32:04,920 --> 00:32:08,000
what's the biggest number on the among

940
00:32:06,519 --> 00:32:09,720
these four numbers?" Well, the biggest

941
00:32:08,000 --> 00:32:11,200
number is 43. Boom. Okay, I'm going to

942
00:32:09,720 --> 00:32:13,600
stick a 43 here.

943
00:32:11,200 --> 00:32:15,720
Then I move my 2 by 2 to the right

944
00:32:13,599 --> 00:32:17,039
so that it overlaps with these numbers

945
00:32:15,720 --> 00:32:19,759
in blue, and I say, "Hey, what's the

946
00:32:17,039 --> 00:32:20,960
biggest number here?" Okay, that's 109.

947
00:32:19,759 --> 00:32:23,240
And I move it down, what's the biggest

948
00:32:20,960 --> 00:32:25,000
number here? 105. Stick it in here.

949
00:32:23,240 --> 00:32:26,519
Biggest number here, 35, and I stick it

950
00:32:25,000 --> 00:32:28,839
in there. That's it. This is max

951
00:32:26,519 --> 00:32:28,839
pooling.

952
00:32:29,119 --> 00:32:32,199
Similarly, there's this thing called

953
00:32:30,200 --> 00:32:33,440
average pooling, but instead of taking

954
00:32:32,200 --> 00:32:35,480
the maximum of these four numbers, we

955
00:32:33,440 --> 00:32:36,840
just average the four numbers.

956
00:32:35,480 --> 00:32:38,519
Okay, the average of these four things

957
00:32:36,839 --> 00:32:40,879
in yellow,

958
00:32:38,519 --> 00:32:40,879
am I done?

959
00:32:41,559 --> 00:32:45,639
Average of these four numbers is 32.2.

960
00:32:43,519 --> 00:32:46,839
The average of blue numbers is 25.5, you

961
00:32:45,640 --> 00:32:48,200
get the idea.

962
00:32:46,839 --> 00:32:50,439
That's it. Max pooling and average

963
00:32:48,200 --> 00:32:51,840
pooling. Now,

964
00:32:50,440 --> 00:32:53,400
as you can see, when you go when you

965
00:32:51,839 --> 00:32:55,720
apply pooling, the number of entries

966
00:32:53,400 --> 00:32:56,880
drops significantly.

967
00:32:55,720 --> 00:32:58,240
Right? The number of entries drops

968
00:32:56,880 --> 00:32:59,880
significantly.

969
00:32:58,240 --> 00:33:02,839
And the output from this layer is just

970
00:32:59,880 --> 00:33:04,480
fed to the next layer as usual.

971
00:33:02,839 --> 00:33:05,720
Okay? There's nothing, you know, crazy

972
00:33:04,480 --> 00:33:07,679
going on.

973
00:33:05,720 --> 00:33:10,039
So, it's a way to shrink the output from

974
00:33:07,679 --> 00:33:11,560
one convolutional layer before it passes

975
00:33:10,039 --> 00:33:13,799
on to the next convolutional, you

976
00:33:11,559 --> 00:33:15,960
interject with a pooling layer.

977
00:33:13,799 --> 00:33:18,039
Now, I have actually a

978
00:33:15,960 --> 00:33:20,759
even if I say so myself, a very nice

979
00:33:18,039 --> 00:33:23,319
handwritten explanation of what pooling

980
00:33:20,759 --> 00:33:25,200
does, the the effect of pooling.

981
00:33:23,319 --> 00:33:27,480
And unfortunately, I can't get my iPad

982
00:33:25,200 --> 00:33:28,920
to actually show up on my laptop.

983
00:33:27,480 --> 00:33:31,400
So, I'm not going to be able to do it,

984
00:33:28,920 --> 00:33:33,519
but I will record a walk-through.

985
00:33:31,400 --> 00:33:35,519
Yeah, and I posted check it out, okay?

986
00:33:33,519 --> 00:33:38,240
But the intuition that I tried to convey

987
00:33:35,519 --> 00:33:39,359
with that thing is that oh, um Sorry,

988
00:33:38,240 --> 00:33:41,039
I'll come back to this.

989
00:33:39,359 --> 00:33:43,439
So, max pooling acts like an or

990
00:33:41,039 --> 00:33:44,879
condition. It basically says, "I have

991
00:33:43,440 --> 00:33:46,559
this big picture.

992
00:33:44,880 --> 00:33:48,720
So, in the four things that I'm looking

993
00:33:46,559 --> 00:33:50,319
at, if there's any number which is

994
00:33:48,720 --> 00:33:51,880
really high,

995
00:33:50,319 --> 00:33:54,319
that means that some feature is being

996
00:33:51,880 --> 00:33:55,720
detected, right?

997
00:33:54,319 --> 00:33:57,000
The number is really high coming out of

998
00:33:55,720 --> 00:33:59,200
a convolutional layer, that means that

999
00:33:57,000 --> 00:34:00,519
something somewhere fired up,

1000
00:33:59,200 --> 00:34:01,799
lit up.

1001
00:34:00,519 --> 00:34:04,200
And so, I'm just looking to see if

1002
00:34:01,799 --> 00:34:05,319
anything lit up in that part. If it did,

1003
00:34:04,200 --> 00:34:06,640
I'm going to say, "Yep, something lit

1004
00:34:05,319 --> 00:34:08,239
up."

1005
00:34:06,640 --> 00:34:09,640
If nothing lit up, then I'm going to

1006
00:34:08,239 --> 00:34:11,559
say, "Oh, nothing lit up."

1007
00:34:09,639 --> 00:34:13,158
So, in a in that sense, what it's it it

1008
00:34:11,559 --> 00:34:15,398
think you can imagine it's like acting

1009
00:34:13,159 --> 00:34:16,519
like an or condition.

1010
00:34:15,398 --> 00:34:17,559
Anything fired up? Anything fired up?

1011
00:34:16,519 --> 00:34:19,480
Anything fired up? Anything up? Yes,

1012
00:34:17,559 --> 00:34:22,039
okay. Otherwise, no.

1013
00:34:19,480 --> 00:34:22,039
And so,

1014
00:34:22,280 --> 00:34:27,040
sadly, I can't switch to Notability.

1015
00:34:24,639 --> 00:34:28,440
So, it acts like a feature detector. So,

1016
00:34:27,039 --> 00:34:30,239
if you have lots of things going on in a

1017
00:34:28,440 --> 00:34:32,000
particular picture, you want to be able

1018
00:34:30,239 --> 00:34:33,398
to summarize and aggregate all the

1019
00:34:32,000 --> 00:34:35,519
things that are going on so that you can

1020
00:34:33,398 --> 00:34:36,918
say you if you may have a big picture

1021
00:34:35,519 --> 00:34:38,398
with lots of things lighting up here and

1022
00:34:36,918 --> 00:34:40,559
there, but you want to step back and

1023
00:34:38,398 --> 00:34:42,918
say, "You know what? In this picture,

1024
00:34:40,559 --> 00:34:45,440
the top left, nothing lit up. The top

1025
00:34:42,918 --> 00:34:46,719
right, something lit up. Bottom left,

1026
00:34:45,440 --> 00:34:48,320
something lit up. And the bottom right,

1027
00:34:46,719 --> 00:34:49,599
nothing lit up."

1028
00:34:48,320 --> 00:34:51,800
So, you're operating at a higher level

1029
00:34:49,599 --> 00:34:54,839
of abstraction.

1030
00:34:51,800 --> 00:34:54,840
That's the effect of pooling.

1031
00:34:55,039 --> 00:34:58,639
But don't you lose spatial information?

1032
00:34:59,920 --> 00:35:04,079
Uh you don't because the

1033
00:35:02,480 --> 00:35:06,199
what you're actually saying is the top

1034
00:35:04,079 --> 00:35:08,639
left has this thing.

1035
00:35:06,199 --> 00:35:10,599
You already know it is in the top left.

1036
00:35:08,639 --> 00:35:12,119
And you already moved up to that level

1037
00:35:10,599 --> 00:35:13,839
of abstraction.

1038
00:35:12,119 --> 00:35:15,880
So, the fact for example, if if the top

1039
00:35:13,840 --> 00:35:18,480
left there is a human eye,

1040
00:35:15,880 --> 00:35:19,880
and there is a circle detector, it's

1041
00:35:18,480 --> 00:35:21,719
going to fire up and saying, "Hey, in

1042
00:35:19,880 --> 00:35:23,599
the top left there is an eye."

1043
00:35:21,719 --> 00:35:24,919
Yep, lit up. So, you're not looking at

1044
00:35:23,599 --> 00:35:25,759
the pixels anymore, you're already

1045
00:35:24,920 --> 00:35:27,159
operating at a higher level of

1046
00:35:25,760 --> 00:35:29,520
abstraction, and that's how we get

1047
00:35:27,159 --> 00:35:31,039
around it. But this proceeds slowly and

1048
00:35:29,519 --> 00:35:34,039
incrementally, which is why you have

1049
00:35:31,039 --> 00:35:34,039
these big networks.

1050
00:35:34,199 --> 00:35:38,159
All right.

1051
00:35:35,679 --> 00:35:40,159
So, now as we saw, some successive

1052
00:35:38,159 --> 00:35:41,639
convolution layers can see more and more

1053
00:35:40,159 --> 00:35:43,319
of the original image,

1054
00:35:41,639 --> 00:35:45,480
the max pooling layers that follow them

1055
00:35:43,320 --> 00:35:47,640
can detect if a feature exists in more

1056
00:35:45,480 --> 00:35:48,760
and more of the original input as well.

1057
00:35:47,639 --> 00:35:50,279
So, by the time you get to like the

1058
00:35:48,760 --> 00:35:52,320
seventh and eighth, ninth and layers and

1059
00:35:50,280 --> 00:35:53,720
so on, this thing is actually really

1060
00:35:52,320 --> 00:35:55,160
smart. It's operating at a very high

1061
00:35:53,719 --> 00:35:56,959
level of abstraction.

1062
00:35:55,159 --> 00:35:58,480
Right? It It is You can think of it It

1063
00:35:56,960 --> 00:36:00,280
is basically like tagged all the

1064
00:35:58,480 --> 00:36:04,199
features in that image at various

1065
00:36:00,280 --> 00:36:04,200
resolutions, and it can work with it.

1066
00:36:04,880 --> 00:36:08,920
Is there a trade-off between doing

1067
00:36:06,400 --> 00:36:11,160
pre-processing as opposed to adding

1068
00:36:08,920 --> 00:36:12,760
additional convolutional layers? I'm

1069
00:36:11,159 --> 00:36:15,519
thinking if you have a video turning

1070
00:36:12,760 --> 00:36:17,600
into a black and white static images in

1071
00:36:15,519 --> 00:36:19,358
a sequence as opposed to

1072
00:36:17,599 --> 00:36:20,639
shoving in a color video with a ton of

1073
00:36:19,358 --> 00:36:22,400
noise.

1074
00:36:20,639 --> 00:36:24,759
The greater the time expanse, is there a

1075
00:36:22,400 --> 00:36:27,960
trade-off element? There is a trade-off.

1076
00:36:24,760 --> 00:36:29,760
Um if your particular data set and input

1077
00:36:27,960 --> 00:36:31,720
has has some there is some very

1078
00:36:29,760 --> 00:36:33,240
important domain knowledge that you want

1079
00:36:31,719 --> 00:36:35,719
to encode

1080
00:36:33,239 --> 00:36:37,839
into the network so that the network

1081
00:36:35,719 --> 00:36:39,719
doesn't waste its capacity learning

1082
00:36:37,840 --> 00:36:41,640
things that you know have to be true,

1083
00:36:39,719 --> 00:36:43,358
then yeah, modify the input.

1084
00:36:41,639 --> 00:36:45,480
But if you're not sure,

1085
00:36:43,358 --> 00:36:47,199
right? Then you want to just let network

1086
00:36:45,480 --> 00:36:49,679
learn whatever it can as long as it's

1087
00:36:47,199 --> 00:36:53,439
focused on predicting accuracy as well

1088
00:36:49,679 --> 00:36:53,440
as possible, then just let it be.

1089
00:36:55,800 --> 00:36:59,200
Uh all right. So, that's the basic idea.

1090
00:36:57,880 --> 00:37:01,358
And I again, I'm sorry this is

1091
00:36:59,199 --> 00:37:03,799
Notability thing is is it's not working.

1092
00:37:01,358 --> 00:37:05,559
Uh but take a look to really understand

1093
00:37:03,800 --> 00:37:08,039
um how this max pooling thing business

1094
00:37:05,559 --> 00:37:09,358
works. Okay. Oh, uh I think I skipped

1095
00:37:08,039 --> 00:37:12,000
over this.

1096
00:37:09,358 --> 00:37:13,639
So, when you have something like this,

1097
00:37:12,000 --> 00:37:15,760
so this, let's say, is a tensor coming

1098
00:37:13,639 --> 00:37:18,839
out of some convolutional layer, and its

1099
00:37:15,760 --> 00:37:20,640
size is 224 by 224 by 64, then you apply

1100
00:37:18,840 --> 00:37:22,160
something like a pooling. The thing I

1101
00:37:20,639 --> 00:37:23,839
want to point out is that the pooling

1102
00:37:22,159 --> 00:37:25,839
will work with every slice of the

1103
00:37:23,840 --> 00:37:27,960
tensor.

1104
00:37:25,840 --> 00:37:30,600
Okay? So, if the tensor is 224 by 224 by

1105
00:37:27,960 --> 00:37:31,880
64, it has a depth of 64,

1106
00:37:30,599 --> 00:37:35,239
which is basically like saying it's got

1107
00:37:31,880 --> 00:37:38,200
64 tables of 224 by 224, and the pooling

1108
00:37:35,239 --> 00:37:40,119
will work on every one of those tables.

1109
00:37:38,199 --> 00:37:42,279
Which means that

1110
00:37:40,119 --> 00:37:43,719
the 64 will that you'll still have 64

1111
00:37:42,280 --> 00:37:45,760
things at the very end. It's just that

1112
00:37:43,719 --> 00:37:49,759
every one of the things of the 64, the

1113
00:37:45,760 --> 00:37:52,560
224 by 224, will shrink to 112 by 112.

1114
00:37:49,760 --> 00:37:53,720
So, each table shrinks due to pooling,

1115
00:37:52,559 --> 00:37:56,119
but the number of tables does not

1116
00:37:53,719 --> 00:37:56,119
change.

1117
00:37:57,800 --> 00:38:01,880
Okay. So,

1118
00:37:59,440 --> 00:38:03,559
uh by the way, this

1119
00:38:01,880 --> 00:38:05,400
link here

1120
00:38:03,559 --> 00:38:06,599
has a beautiful explanation of all these

1121
00:38:05,400 --> 00:38:08,800
things with a little bit more complexity

1122
00:38:06,599 --> 00:38:10,440
as well from a course taught at Stanford

1123
00:38:08,800 --> 00:38:12,640
in like 2018 or 2019 or something, I

1124
00:38:10,440 --> 00:38:13,800
forget. Uh so, just check it out if

1125
00:38:12,639 --> 00:38:15,039
you're curious about this stuff. It's

1126
00:38:13,800 --> 00:38:18,160
really good.

1127
00:38:15,039 --> 00:38:18,159
Okay. Um

1128
00:38:18,440 --> 00:38:21,760
All right. So, that brings us to the

1129
00:38:19,800 --> 00:38:23,800
architecture of a basic CNN.

1130
00:38:21,760 --> 00:38:25,240
Um and so, what we do is we have an

1131
00:38:23,800 --> 00:38:27,240
input.

1132
00:38:25,239 --> 00:38:29,239
Okay? We take that input, we run it

1133
00:38:27,239 --> 00:38:30,799
through a bunch of convolutional and

1134
00:38:29,239 --> 00:38:33,559
pooling layers. So, there's a

1135
00:38:30,800 --> 00:38:35,840
convolutional layer, and then we pool

1136
00:38:33,559 --> 00:38:37,440
it, which is why it has shrunk

1137
00:38:35,840 --> 00:38:38,440
in size,

1138
00:38:37,440 --> 00:38:40,599
and then it goes through another

1139
00:38:38,440 --> 00:38:42,358
convolutional layer, then we pool it,

1140
00:38:40,599 --> 00:38:44,000
which is shrunk again,

1141
00:38:42,358 --> 00:38:45,559
and then it keeps on doing it. So, we

1142
00:38:44,000 --> 00:38:47,559
have a series of these these called

1143
00:38:45,559 --> 00:38:49,400
these are called convolutional blocks.

1144
00:38:47,559 --> 00:38:50,559
So, a convolutional block is typically,

1145
00:38:49,400 --> 00:38:52,920
you know, one to two convolutional

1146
00:38:50,559 --> 00:38:54,358
layers followed by a pooling layer.

1147
00:38:52,920 --> 00:38:55,760
Okay.

1148
00:38:54,358 --> 00:38:57,159
So, you have a series of convolutional

1149
00:38:55,760 --> 00:38:59,960
blocks.

1150
00:38:57,159 --> 00:39:01,559
Okay? And the thing to notice is that

1151
00:38:59,960 --> 00:39:03,320
as you go further and further in the

1152
00:39:01,559 --> 00:39:05,519
network,

1153
00:39:03,320 --> 00:39:07,000
the blocks will actually get smaller and

1154
00:39:05,519 --> 00:39:09,159
smaller because of

1155
00:39:07,000 --> 00:39:10,599
uh max pooling, right? They'll get

1156
00:39:09,159 --> 00:39:14,039
smaller and smaller, but they'll get

1157
00:39:10,599 --> 00:39:14,799
longer they'll get deeper and deeper.

1158
00:39:14,039 --> 00:39:16,519
Okay.

1159
00:39:14,800 --> 00:39:18,880
And we have empirically figured out that

1160
00:39:16,519 --> 00:39:20,639
that actually that model of reducing the

1161
00:39:18,880 --> 00:39:22,519
size, the height and height and the

1162
00:39:20,639 --> 00:39:25,519
width, but then making it deeper, tends

1163
00:39:22,519 --> 00:39:27,119
to work really well in practice.

1164
00:39:25,519 --> 00:39:29,559
And so,

1165
00:39:27,119 --> 00:39:31,279
in fact, uh and I apologies to the live

1166
00:39:29,559 --> 00:39:34,480
stream that I can't use iPad, I'm going

1167
00:39:31,280 --> 00:39:34,480
to do it on the the board.

1168
00:39:35,960 --> 00:39:39,639
So, let's say that you have a picture

1169
00:39:38,358 --> 00:39:43,480
which is

1170
00:39:39,639 --> 00:39:44,879
coming in as 224

1171
00:39:43,480 --> 00:39:46,199
224

1172
00:39:44,880 --> 00:39:48,000
and then you have

1173
00:39:46,199 --> 00:39:49,719
say three of them

1174
00:39:48,000 --> 00:39:52,360
because it's a color picture, so you

1175
00:39:49,719 --> 00:39:54,399
have three of them.

1176
00:39:52,360 --> 00:39:56,440
Can you folks see this okay?

1177
00:39:54,400 --> 00:39:59,240
All right. So, right? Let's say this is

1178
00:39:56,440 --> 00:40:00,960
the input coming in. And ResNet, which

1179
00:39:59,239 --> 00:40:02,479
is a very famous network that we're

1180
00:40:00,960 --> 00:40:03,679
actually going to work with in a few

1181
00:40:02,480 --> 00:40:05,719
minutes,

1182
00:40:03,679 --> 00:40:07,960
then it actually gets done with all this

1183
00:40:05,719 --> 00:40:11,119
convolution pooling business.

1184
00:40:07,960 --> 00:40:13,400
The final tensor that it it has is

1185
00:40:11,119 --> 00:40:16,239
actually of shape

1186
00:40:13,400 --> 00:40:20,720
7 by 7.

1187
00:40:16,239 --> 00:40:20,719
But it is 2048 long.

1188
00:40:22,519 --> 00:40:26,719
Okay? So, it it has gone it has

1189
00:40:24,039 --> 00:40:28,400
processed something which is 224 224 * 3

1190
00:40:26,719 --> 00:40:31,439
to much smaller height and width just 7

1191
00:40:28,400 --> 00:40:32,840
by 7, but it's gotten much deeper, 2048

1192
00:40:31,440 --> 00:40:34,920
layers.

1193
00:40:32,840 --> 00:40:36,800
This is a this is a numerical example of

1194
00:40:34,920 --> 00:40:39,320
what I'm talking about there in terms of

1195
00:40:36,800 --> 00:40:41,560
as you go along, things get smaller but

1196
00:40:39,320 --> 00:40:43,039
deeper.

1197
00:40:41,559 --> 00:40:44,480
All right.

1198
00:40:43,039 --> 00:40:45,880
Uh

1199
00:40:44,480 --> 00:40:47,280
Yes?

1200
00:40:45,880 --> 00:40:49,519
Is the reason that it gets deeper

1201
00:40:47,280 --> 00:40:50,880
because each

1202
00:40:49,519 --> 00:40:52,759
Like it it gets deeper because each

1203
00:40:50,880 --> 00:40:54,400
layer has a single feature that is

1204
00:40:52,760 --> 00:40:55,120
picked up and then it gets stacked on

1205
00:40:54,400 --> 00:40:57,039
top

1206
00:40:55,119 --> 00:40:58,559
It's not so much that each layer has

1207
00:40:57,039 --> 00:40:59,480
picking up a single feature, it's more

1208
00:40:58,559 --> 00:41:00,279
that

1209
00:40:59,480 --> 00:41:01,960
uh

1210
00:41:00,280 --> 00:41:04,519
basically

1211
00:41:01,960 --> 00:41:06,159
the way I think about it is that

1212
00:41:04,519 --> 00:41:07,800
the the the the number of atomic

1213
00:41:06,159 --> 00:41:10,199
features that you may want to detect are

1214
00:41:07,800 --> 00:41:11,920
probably not that many, right? Lines,

1215
00:41:10,199 --> 00:41:13,719
curves, gradations in color and things

1216
00:41:11,920 --> 00:41:16,519
like that. But the way in which you can

1217
00:41:13,719 --> 00:41:18,559
combine these atomic features

1218
00:41:16,519 --> 00:41:20,199
to depict real world things

1219
00:41:18,559 --> 00:41:22,279
is combinatorial.

1220
00:41:20,199 --> 00:41:23,879
It's sort of like I have 10 kinds of

1221
00:41:22,280 --> 00:41:25,040
atoms, how many molecules can I make

1222
00:41:23,880 --> 00:41:26,519
from it?

1223
00:41:25,039 --> 00:41:28,279
You can make a lot of molecules from

1224
00:41:26,519 --> 00:41:30,679
those 10 atoms, which means that you

1225
00:41:28,280 --> 00:41:32,080
better give the network more the ability

1226
00:41:30,679 --> 00:41:33,719
to capture more and more of these

1227
00:41:32,079 --> 00:41:35,400
possible things that the real world can

1228
00:41:33,719 --> 00:41:38,000
come up with.

1229
00:41:35,400 --> 00:41:40,200
And so every as the depth increases, you

1230
00:41:38,000 --> 00:41:42,320
have more filters and every filter has

1231
00:41:40,199 --> 00:41:43,719
now has the ability to pick up some

1232
00:41:42,320 --> 00:41:46,080
combinatorial combination of what's

1233
00:41:43,719 --> 00:41:46,079
coming in.

1234
00:41:49,639 --> 00:41:53,239
Uh sorry, quick question related to

1235
00:41:51,320 --> 00:41:55,080
this. So, right now like our model is

1236
00:41:53,239 --> 00:41:56,799
being trained to detect certain specific

1237
00:41:55,079 --> 00:41:58,519
features like a line, a color, or

1238
00:41:56,800 --> 00:42:00,680
something of this sort. But still it

1239
00:41:58,519 --> 00:42:02,880
doesn't have meaning to this, right?

1240
00:42:00,679 --> 00:42:06,239
Like still they don't know if that

1241
00:42:02,880 --> 00:42:08,360
arc is a sun or is an eye, right?

1242
00:42:06,239 --> 00:42:10,639
So, yeah. So, we we don't tell it what

1243
00:42:08,360 --> 00:42:12,280
to learn, it just learns.

1244
00:42:10,639 --> 00:42:14,599
All we tell it is make sure that you

1245
00:42:12,280 --> 00:42:16,240
minimize the loss function. Now, once it

1246
00:42:14,599 --> 00:42:18,679
is finished learning, if it's a good

1247
00:42:16,239 --> 00:42:21,359
network, it has good accuracy, then we

1248
00:42:18,679 --> 00:42:23,480
can introspect. We can peek into the

1249
00:42:21,360 --> 00:42:24,559
internals and try to understand what is

1250
00:42:23,480 --> 00:42:26,480
it learning,

1251
00:42:24,559 --> 00:42:27,759
right? And sometimes you like you saw in

1252
00:42:26,480 --> 00:42:28,840
the face detection example, it's

1253
00:42:27,760 --> 00:42:30,720
actually learning interesting things

1254
00:42:28,840 --> 00:42:32,440
like basic lines and edges and then

1255
00:42:30,719 --> 00:42:34,359
slowly, you know, more complicated

1256
00:42:32,440 --> 00:42:36,320
shapes and then finally like entire

1257
00:42:34,360 --> 00:42:37,640
human faces. Sometimes it may not be

1258
00:42:36,320 --> 00:42:39,200
understandable.

1259
00:42:37,639 --> 00:42:42,879
And the way it's doing this is by

1260
00:42:39,199 --> 00:42:44,039
constructing features of my brain.

1261
00:42:42,880 --> 00:42:44,480
Like how do you figure out what it's

1262
00:42:44,039 --> 00:42:46,800
learning?

1263
00:42:44,480 --> 00:42:49,039
>> Yeah. Oh, oh, I see. So, I'm going to

1264
00:42:46,800 --> 00:42:50,400
give a reference in just a few minutes.

1265
00:42:49,039 --> 00:42:52,199
Read the paper. That was one of the

1266
00:42:50,400 --> 00:42:53,720
first ones to actually visualize what it

1267
00:42:52,199 --> 00:42:54,919
what these things are learning and

1268
00:42:53,719 --> 00:42:56,399
that'll give you an idea of how it

1269
00:42:54,920 --> 00:42:58,079
actually works. And I'm also happy to

1270
00:42:56,400 --> 00:43:00,160
talk about it offline. It's a bit of a a

1271
00:42:58,079 --> 00:43:02,319
tangent, but it's a really rich tangent,

1272
00:43:00,159 --> 00:43:03,399
so if if I keep talking about it, I'll

1273
00:43:02,320 --> 00:43:06,039
end up spending 10 minutes on it, so I'm

1274
00:43:03,400 --> 00:43:06,039
going to back off.

1275
00:43:06,960 --> 00:43:09,679
Okay.

1276
00:43:08,039 --> 00:43:12,320
Um all right.

1277
00:43:09,679 --> 00:43:13,919
So, now once we do that,

1278
00:43:12,320 --> 00:43:16,200
okay? Now we are back in familiar

1279
00:43:13,920 --> 00:43:18,360
territory where we take whatever tensor

1280
00:43:16,199 --> 00:43:20,119
is coming out from these convolutional

1281
00:43:18,360 --> 00:43:22,840
operations and pooling operations and

1282
00:43:20,119 --> 00:43:25,440
then we just flatten them only now into

1283
00:43:22,840 --> 00:43:27,720
a long vector. And once we flatten them,

1284
00:43:25,440 --> 00:43:29,240
we can connect them to some good old

1285
00:43:27,719 --> 00:43:30,599
dense layers

1286
00:43:29,239 --> 00:43:32,479
like we know how to do and then we

1287
00:43:30,599 --> 00:43:34,880
finally connect them with whatever, you

1288
00:43:32,480 --> 00:43:36,760
know, output layer you want, right? In

1289
00:43:34,880 --> 00:43:39,480
this case, this example is using some

1290
00:43:36,760 --> 00:43:41,120
multi-class classification of

1291
00:43:39,480 --> 00:43:42,760
classifying images to what kind of

1292
00:43:41,119 --> 00:43:44,719
automobile or whatever it is. So, it's

1293
00:43:42,760 --> 00:43:47,160
like a softmax. So, this is a general

1294
00:43:44,719 --> 00:43:47,159
framework.

1295
00:43:48,639 --> 00:43:52,639
Okay?

1296
00:43:50,039 --> 00:43:52,639
Any questions?

1297
00:43:54,559 --> 00:43:57,639
Yeah.

1298
00:43:55,599 --> 00:44:00,159
Can you explain again how the depth

1299
00:43:57,639 --> 00:44:01,839
increases exactly like Oh, the depth

1300
00:44:00,159 --> 00:44:03,719
increases because you decide what the

1301
00:44:01,840 --> 00:44:05,920
depth is.

1302
00:44:03,719 --> 00:44:07,839
So, when you add a convolutional layer,

1303
00:44:05,920 --> 00:44:09,920
you decide how many filters it has. So,

1304
00:44:07,840 --> 00:44:11,600
you just keep adding more and more

1305
00:44:09,920 --> 00:44:13,320
filters the later on you go in the

1306
00:44:11,599 --> 00:44:14,920
network.

1307
00:44:13,320 --> 00:44:16,600
So, it's in your control. So, remember

1308
00:44:14,920 --> 00:44:18,480
the number of neurons in a hidden layer

1309
00:44:16,599 --> 00:44:19,839
is in your control, right? Similarly,

1310
00:44:18,480 --> 00:44:22,559
the number of filters is in your

1311
00:44:19,840 --> 00:44:24,160
control. It's a design choice.

1312
00:44:22,559 --> 00:44:26,519
And we design it so that the later we

1313
00:44:24,159 --> 00:44:28,159
go, the more depth we have. So, you have

1314
00:44:26,519 --> 00:44:31,800
you stack

1315
00:44:28,159 --> 00:44:35,279
um layers with each of those layers has

1316
00:44:31,800 --> 00:44:37,720
a different filter applied to the end

1317
00:44:35,280 --> 00:44:39,359
Yeah, a layer is made up of filters and

1318
00:44:37,719 --> 00:44:40,919
so the depth just comes from having lots

1319
00:44:39,358 --> 00:44:43,759
and lots and lots of filters. And you

1320
00:44:40,920 --> 00:44:43,760
get to choose what they are.

1321
00:44:44,358 --> 00:44:49,319
All right. So, now let's go to the

1322
00:44:46,639 --> 00:44:51,920
fashion MNIST collab um that I did the

1323
00:44:49,320 --> 00:44:55,559
video walk-through on and then actually

1324
00:44:51,920 --> 00:44:55,559
solve it using a convolutional network.

1325
00:44:56,000 --> 00:44:59,159
All right, cool. So, uh at this point

1326
00:44:58,199 --> 00:45:00,879
I'm going to zip through some of the

1327
00:44:59,159 --> 00:45:02,519
stuff because you know the preliminaries

1328
00:45:00,880 --> 00:45:05,559
have to be done. Import all these

1329
00:45:02,519 --> 00:45:07,320
packages, set the random seed here.

1330
00:45:05,559 --> 00:45:09,320
Great. And then the we will load the

1331
00:45:07,320 --> 00:45:11,519
MNIST data set just like I did in the

1332
00:45:09,320 --> 00:45:13,280
collab yesterday. Uh we create these

1333
00:45:11,519 --> 00:45:14,960
little labels.

1334
00:45:13,280 --> 00:45:17,240
Uh and then we just have these standard

1335
00:45:14,960 --> 00:45:19,320
functions to plot accuracy and loss that

1336
00:45:17,239 --> 00:45:21,159
we've been using so far. All right. Now

1337
00:45:19,320 --> 00:45:24,519
we come to the convolutional thing and

1338
00:45:21,159 --> 00:45:25,960
so as before, we're going to um

1339
00:45:24,519 --> 00:45:27,280
we're going to divide it by 255 to

1340
00:45:25,960 --> 00:45:29,480
normalize everything to a zero to one

1341
00:45:27,280 --> 00:45:31,640
range. Uh let's confirm to make sure

1342
00:45:29,480 --> 00:45:33,599
that the data nothing has gotten

1343
00:45:31,639 --> 00:45:35,679
tampered with. Yep, we have 60,000

1344
00:45:33,599 --> 00:45:37,799
images, each one is 28 by 28 in the

1345
00:45:35,679 --> 00:45:40,559
training set. Now,

1346
00:45:37,800 --> 00:45:42,680
convolutional networks um they expect

1347
00:45:40,559 --> 00:45:44,759
the input to have

1348
00:45:42,679 --> 00:45:46,239
three channels or it expects to have

1349
00:45:44,760 --> 00:45:47,440
like a an additional thing which is like

1350
00:45:46,239 --> 00:45:49,679
a channel,

1351
00:45:47,440 --> 00:45:50,800
right? Uh the color images have three

1352
00:45:49,679 --> 00:45:52,279
channels,

1353
00:45:50,800 --> 00:45:54,400
but black and white images have only one

1354
00:45:52,280 --> 00:45:56,640
channel, right? One table of numbers.

1355
00:45:54,400 --> 00:45:59,280
So, instead of saying 28 by 28, we tell

1356
00:45:56,639 --> 00:46:01,559
this the convolutional layer expect 28

1357
00:45:59,280 --> 00:46:03,160
by 28 by one.

1358
00:46:01,559 --> 00:46:04,519
It's the same thing conceptually, but

1359
00:46:03,159 --> 00:46:05,639
that's the sort of the format that it

1360
00:46:04,519 --> 00:46:06,679
expects.

1361
00:46:05,639 --> 00:46:09,199
And so,

1362
00:46:06,679 --> 00:46:11,039
uh we go here and then we say, all

1363
00:46:09,199 --> 00:46:12,879
right, there's a thing called expand

1364
00:46:11,039 --> 00:46:14,599
dimension. I'm just telling it to expand

1365
00:46:12,880 --> 00:46:17,200
its dimension and once I do that, you

1366
00:46:14,599 --> 00:46:19,639
can see here it's still 60,000, but

1367
00:46:17,199 --> 00:46:21,799
instead of 28 by 28, it has become 28 by

1368
00:46:19,639 --> 00:46:24,039
28 by one. Same thing.

1369
00:46:21,800 --> 00:46:25,920
Okay? Now, let's define our very first

1370
00:46:24,039 --> 00:46:27,440
CNN.

1371
00:46:25,920 --> 00:46:30,240
So, all right.

1372
00:46:27,440 --> 00:46:32,519
As as before, the the input is just

1373
00:46:30,239 --> 00:46:34,239
Keras.input as before, no difference

1374
00:46:32,519 --> 00:46:37,239
here and we tell it the shape and the

1375
00:46:34,239 --> 00:46:39,239
shape is of course just 28 by 28 by one.

1376
00:46:37,239 --> 00:46:40,639
Okay? That's what I have here.

1377
00:46:39,239 --> 00:46:43,839
And then we come to the first

1378
00:46:40,639 --> 00:46:45,679
convolutional block.

1379
00:46:43,840 --> 00:46:47,400
So, and this is the key thing.

1380
00:46:45,679 --> 00:46:49,719
If you want to tell Keras to use a

1381
00:46:47,400 --> 00:46:53,519
convolutional a layer,

1382
00:46:49,719 --> 00:46:54,679
you use this keyword layers.Conv2D.

1383
00:46:53,519 --> 00:46:56,519
And from this you can probably also

1384
00:46:54,679 --> 00:46:58,759
figure out that there's a Conv1D and

1385
00:46:56,519 --> 00:47:00,880
there's a Conv3D and so on and so forth,

1386
00:46:58,760 --> 00:47:01,920
which, you know, uh explore. It's really

1387
00:47:00,880 --> 00:47:04,400
good stuff.

1388
00:47:01,920 --> 00:47:06,599
But for image processing, Conv2D is all

1389
00:47:04,400 --> 00:47:09,400
you need. And now we tell it how many

1390
00:47:06,599 --> 00:47:10,920
filters you want. Okay. So, uh we decide

1391
00:47:09,400 --> 00:47:13,240
on the number of filters. So, I've

1392
00:47:10,920 --> 00:47:15,760
decided to have 32 filters. Okay? And

1393
00:47:13,239 --> 00:47:18,199
then I I we also have to decide the size

1394
00:47:15,760 --> 00:47:19,760
of the filter, right? The simplest size

1395
00:47:18,199 --> 00:47:20,639
is 2 by 2. So, I'm just going to go with

1396
00:47:19,760 --> 00:47:22,760
that.

1397
00:47:20,639 --> 00:47:23,839
Right? Kernel size is 2 by 2.

1398
00:47:22,760 --> 00:47:26,160
And then the activation is of course

1399
00:47:23,840 --> 00:47:27,960
ReLU. I give it a name, convolution one,

1400
00:47:26,159 --> 00:47:29,480
and then I feed it the input. And then

1401
00:47:27,960 --> 00:47:31,679
once I do that, I follow it up with a

1402
00:47:29,480 --> 00:47:33,679
little pooling layer which I where I use

1403
00:47:31,679 --> 00:47:35,279
MaxPooling2D.

1404
00:47:33,679 --> 00:47:36,639
And MaxPooling2D, you just literally

1405
00:47:35,280 --> 00:47:37,600
pass the input, you get the output back.

1406
00:47:36,639 --> 00:47:39,480
It just

1407
00:47:37,599 --> 00:47:40,719
shrinks everything using pooling.

1408
00:47:39,480 --> 00:47:41,679
So, that is the first convolutional

1409
00:47:40,719 --> 00:47:43,879
block.

1410
00:47:41,679 --> 00:47:45,599
And you know what?

1411
00:47:43,880 --> 00:47:46,440
I know how to cut and paste. Boom, cut

1412
00:47:45,599 --> 00:47:48,119
and paste, I get the second

1413
00:47:46,440 --> 00:47:49,599
convolutional block.

1414
00:47:48,119 --> 00:47:52,358
Okay? Here is the second convolutional

1415
00:47:49,599 --> 00:47:54,199
block. And I know in in I just lecture I

1416
00:47:52,358 --> 00:47:56,960
mentioned that as you go deeper, you get

1417
00:47:54,199 --> 00:47:58,199
more depth to it, but this is this is

1418
00:47:56,960 --> 00:47:59,480
just a starting point. I'm just going to

1419
00:47:58,199 --> 00:48:01,599
use the same depth. Not a big deal. It's

1420
00:47:59,480 --> 00:48:03,000
a simple problem. So, which is why in

1421
00:48:01,599 --> 00:48:04,559
the second convolutional block I'm still

1422
00:48:03,000 --> 00:48:06,039
using only 32.

1423
00:48:04,559 --> 00:48:07,719
But you can totally go to 64 for

1424
00:48:06,039 --> 00:48:08,639
instance to make it much deeper.

1425
00:48:07,719 --> 00:48:10,679
Okay?

1426
00:48:08,639 --> 00:48:12,159
Uh and once I do that,

1427
00:48:10,679 --> 00:48:14,319
I finally come to the point where I

1428
00:48:12,159 --> 00:48:17,759
flatten everything to a long vector,

1429
00:48:14,320 --> 00:48:19,480
then I connect it to one dense layer of

1430
00:48:17,760 --> 00:48:22,080
256 neurons.

1431
00:48:19,480 --> 00:48:23,559
And then finally, I come to the softmax

1432
00:48:22,079 --> 00:48:26,000
where I have 10 outputs, right? 10

1433
00:48:23,559 --> 00:48:27,880
categories of clothing, softmax, and

1434
00:48:26,000 --> 00:48:30,119
then I tell Keras, okay, take this input

1435
00:48:27,880 --> 00:48:32,160
and the output, string them up together,

1436
00:48:30,119 --> 00:48:33,519
define a model for me.

1437
00:48:32,159 --> 00:48:35,599
So, that's it. That's a convolutional

1438
00:48:33,519 --> 00:48:38,358
network. The new concepts we are seeing

1439
00:48:35,599 --> 00:48:40,960
here are Conv2D for the convolutional

1440
00:48:38,358 --> 00:48:42,440
layer and then MaxPooling2D for the max

1441
00:48:40,960 --> 00:48:43,639
pooling layer.

1442
00:48:42,440 --> 00:48:44,240
Okay? That's it.

1443
00:48:43,639 --> 00:48:46,839
Uh

1444
00:48:44,239 --> 00:48:49,839
coming. So, let me just run this thing.

1445
00:48:46,840 --> 00:48:52,840
It runs. Okay, good. Yeah.

1446
00:48:49,840 --> 00:48:54,800
Uh how do you decide when to flatten and

1447
00:48:52,840 --> 00:48:56,800
would there ever be a situation in which

1448
00:48:54,800 --> 00:48:59,600
we just kind of use the method that we

1449
00:48:56,800 --> 00:49:00,960
used before and not use a CNN?

1450
00:48:59,599 --> 00:49:02,279
Well, we already tried it with MNIST,

1451
00:49:00,960 --> 00:49:03,039
right? We didn't use a CNN. We just

1452
00:49:02,280 --> 00:49:05,120
flattened right away.

1453
00:49:03,039 --> 00:49:06,719
>> work. It it was it's not bad, but we are

1454
00:49:05,119 --> 00:49:08,079
like, you know, can we do better than 85

1455
00:49:06,719 --> 00:49:09,679
or 88 or whatever the percent was,

1456
00:49:08,079 --> 00:49:11,719
right? So, but we are working with

1457
00:49:09,679 --> 00:49:13,239
images, it's typically a good idea to

1458
00:49:11,719 --> 00:49:14,439
just start with a CNN straight out the

1459
00:49:13,239 --> 00:49:16,799
back because you're not losing anything.

1460
00:49:14,440 --> 00:49:19,320
You're not giving up anything.

1461
00:49:16,800 --> 00:49:20,960
So, uh in terms of how many uh layers

1462
00:49:19,320 --> 00:49:23,120
you should have, my philosophy is start

1463
00:49:20,960 --> 00:49:27,079
simple and if it works, stop working on

1464
00:49:23,119 --> 00:49:28,480
it. If it doesn't, add more layers.

1465
00:49:27,079 --> 00:49:30,440
Uh yeah.

1466
00:49:28,480 --> 00:49:32,358
Yeah, just to uh is it the architecture

1467
00:49:30,440 --> 00:49:34,358
design, the number of filters, kernel

1468
00:49:32,358 --> 00:49:36,159
size, number of layers, convolution

1469
00:49:34,358 --> 00:49:37,719
pooling, is that just all based on trial

1470
00:49:36,159 --> 00:49:39,440
and error or what's sometimes? Yeah, so

1471
00:49:37,719 --> 00:49:41,359
typically it's based on trial and error,

1472
00:49:39,440 --> 00:49:42,679
Um to answer your question. But as you

1473
00:49:41,360 --> 00:49:44,559
will see in the transfer learning

1474
00:49:42,679 --> 00:49:46,719
discussion we're going to have soon,

1475
00:49:44,559 --> 00:49:48,639
you can actually, instead of doing

1476
00:49:46,719 --> 00:49:50,679
anything from scratch, it's much better

1477
00:49:48,639 --> 00:49:51,839
to just download a pre-trained model and

1478
00:49:50,679 --> 00:49:54,039
just adapt it for your particular

1479
00:49:51,840 --> 00:49:55,680
problem. That is actually the norm by

1480
00:49:54,039 --> 00:49:57,320
which people do these things. The reason

1481
00:49:55,679 --> 00:50:00,319
I'm doing it from scratch is because you

1482
00:49:57,320 --> 00:50:01,800
should know how it was done.

1483
00:50:00,320 --> 00:50:03,880
Like you it should not be a black box to

1484
00:50:01,800 --> 00:50:05,080
you. That's my goal.

1485
00:50:03,880 --> 00:50:07,039
Yeah.

1486
00:50:05,079 --> 00:50:09,719
Just for what notation perspective, I

1487
00:50:07,039 --> 00:50:11,159
noticed you named all of these layers X.

1488
00:50:09,719 --> 00:50:12,639
Is that a habit we should get into

1489
00:50:11,159 --> 00:50:12,759
naming them all the same or is that just

1490
00:50:12,639 --> 00:50:15,199
a

1491
00:50:12,760 --> 00:50:17,880
>> Actually, I'm not naming the layers as

1492
00:50:15,199 --> 00:50:19,719
X. What what's going on here is I'm

1493
00:50:17,880 --> 00:50:21,079
feeding it X.

1494
00:50:19,719 --> 00:50:22,679
And whatever is coming out of it, I'm

1495
00:50:21,079 --> 00:50:23,920
just calling it X.

1496
00:50:22,679 --> 00:50:25,679
That's all. It's just a notational

1497
00:50:23,920 --> 00:50:27,280
convenience for me to I'm I'm just

1498
00:50:25,679 --> 00:50:28,679
calling the input and the output and

1499
00:50:27,280 --> 00:50:29,760
Keras under the hood will track

1500
00:50:28,679 --> 00:50:31,319
everything and make sure the right thing

1501
00:50:29,760 --> 00:50:33,920
happens. Otherwise, I'd have to be like

1502
00:50:31,320 --> 00:50:35,360
X1, X2, X3, X4 and then if I want to add

1503
00:50:33,920 --> 00:50:37,320
a new layer somewhere in the middle

1504
00:50:35,360 --> 00:50:39,160
between X3 and X4, I have to call that

1505
00:50:37,320 --> 00:50:41,360
X4 and then I'll change everything to 5,

1506
00:50:39,159 --> 00:50:42,839
6, 7. Complete pain in the neck. That's

1507
00:50:41,360 --> 00:50:46,760
why I do this.

1508
00:50:42,840 --> 00:50:51,039
All right. So, model.summary

1509
00:50:46,760 --> 00:50:53,160
It has got 302 thousand parameters. I'll

1510
00:50:51,039 --> 00:50:56,199
just plot it.

1511
00:50:53,159 --> 00:50:58,519
Great. And I encourage you to hand

1512
00:50:56,199 --> 00:51:00,359
calculate it later on and make sure the

1513
00:50:58,519 --> 00:51:03,679
numbers tally, okay?

1514
00:51:00,360 --> 00:51:06,320
For now, let's just go. So, as before,

1515
00:51:03,679 --> 00:51:08,399
we'll just use the same compilation.

1516
00:51:06,320 --> 00:51:11,080
We'll use Adam and then we'll train it

1517
00:51:08,400 --> 00:51:13,119
for, you know, just 10 epochs. We'll use

1518
00:51:11,079 --> 00:51:15,360
a validation split again, as usual, of

1519
00:51:13,119 --> 00:51:17,519
20%. So, let's just run it.

1520
00:51:15,360 --> 00:51:18,720
So, it's actually going to run. And as

1521
00:51:17,519 --> 00:51:19,759
you will see,

1522
00:51:18,719 --> 00:51:20,959
convolutional networks there's a lot

1523
00:51:19,760 --> 00:51:23,560
more going on, so it's going to be a bit

1524
00:51:20,960 --> 00:51:25,400
slower to run. Hopefully not too much

1525
00:51:23,559 --> 00:51:28,599
slower.

1526
00:51:25,400 --> 00:51:28,599
While it's doing, other questions?

1527
00:51:31,000 --> 00:51:34,679
So, if we have a task other than image

1528
00:51:32,840 --> 00:51:35,880
classification, do we still flat the

1529
00:51:34,679 --> 00:51:37,399
model like first and then it's

1530
00:51:35,880 --> 00:51:39,000
segmentation?

1531
00:51:37,400 --> 00:51:41,480
Yeah, so this is for image

1532
00:51:39,000 --> 00:51:42,920
classification. For other kinds of

1533
00:51:41,480 --> 00:51:44,240
applications,

1534
00:51:42,920 --> 00:51:45,840
typically you run it through a bunch of

1535
00:51:44,239 --> 00:51:46,639
convolutional layers and so on and so

1536
00:51:45,840 --> 00:51:48,840
forth.

1537
00:51:46,639 --> 00:51:51,759
But the output side of the equation gets

1538
00:51:48,840 --> 00:51:53,880
much more complicated because if instead

1539
00:51:51,760 --> 00:51:56,360
of classifying just

1540
00:51:53,880 --> 00:51:58,800
the whole picture into, you know, dog or

1541
00:51:56,360 --> 00:52:01,280
cat, if you have to take every pixel and

1542
00:51:58,800 --> 00:52:03,320
classify it, right? Then, well, you

1543
00:52:01,280 --> 00:52:06,320
better have an output shape that is the

1544
00:52:03,320 --> 00:52:07,640
same dimensions as the input shape.

1545
00:52:06,320 --> 00:52:09,800
So, for that we use a different

1546
00:52:07,639 --> 00:52:11,119
architecture. It's called U-Net

1547
00:52:09,800 --> 00:52:13,120
and so on, which unfortunately I won't

1548
00:52:11,119 --> 00:52:14,599
be able to get into. But I know I am

1549
00:52:13,119 --> 00:52:17,319
planning to post another video

1550
00:52:14,599 --> 00:52:19,440
walk-through where I show you how to use

1551
00:52:17,320 --> 00:52:22,160
the Hugging Face Hub

1552
00:52:19,440 --> 00:52:23,880
to very quickly build models for the

1553
00:52:22,159 --> 00:52:26,039
other applications like segmentation and

1554
00:52:23,880 --> 00:52:27,280
so on. I'm hoping to post that tomorrow.

1555
00:52:26,039 --> 00:52:29,440
It's an optional viewing thing that

1556
00:52:27,280 --> 00:52:32,400
might help with that.

1557
00:52:29,440 --> 00:52:35,280
Okay. So, is it done? Okay, good. It's

1558
00:52:32,400 --> 00:52:36,760
done. All right, let's plot the

1559
00:52:35,280 --> 00:52:38,240
thing here.

1560
00:52:36,760 --> 00:52:40,480
All right, so it seems like training is

1561
00:52:38,239 --> 00:52:42,639
going down nice and nicely. Validation

1562
00:52:40,480 --> 00:52:45,000
is sort of flattening out somewhere here

1563
00:52:42,639 --> 00:52:47,359
around the eighth epoch. Let's look at

1564
00:52:45,000 --> 00:52:48,840
the accuracy.

1565
00:52:47,360 --> 00:52:51,440
Same situation here. The accuracy is in

1566
00:52:48,840 --> 00:52:52,960
the 90s. Of course, the final question,

1567
00:52:51,440 --> 00:52:55,639
of course, is how it will will it does

1568
00:52:52,960 --> 00:52:55,639
on the thing.

1569
00:52:55,840 --> 00:52:59,440
Whoa, 90.5%.

1570
00:52:58,360 --> 00:53:00,720
Pretty good.

1571
00:52:59,440 --> 00:53:04,200
By the way, if you're not impressed that

1572
00:53:00,719 --> 00:53:05,959
we went from 88 to 90,

1573
00:53:04,199 --> 00:53:07,599
this is the These applications are the

1574
00:53:05,960 --> 00:53:09,639
proverbial sort of diminishing returns

1575
00:53:07,599 --> 00:53:11,880
problems, okay? So, what you should

1576
00:53:09,639 --> 00:53:13,920
always think of is look at the amount of

1577
00:53:11,880 --> 00:53:16,920
error that's left and ask yourself how

1578
00:53:13,920 --> 00:53:20,119
much of that error am I able to reduce?

1579
00:53:16,920 --> 00:53:22,079
So, you we had 12% roughly of error left

1580
00:53:20,119 --> 00:53:24,279
when we did the simple collab yesterday.

1581
00:53:22,079 --> 00:53:26,119
From that 12% we have knocked off two of

1582
00:53:24,280 --> 00:53:27,240
the 12% to get to over 90, which is

1583
00:53:26,119 --> 00:53:28,119
amazing.

1584
00:53:27,239 --> 00:53:29,639
Okay?

1585
00:53:28,119 --> 00:53:31,119
And in fact, I think the state of the

1586
00:53:29,639 --> 00:53:32,279
art on this

1587
00:53:31,119 --> 00:53:34,400
um

1588
00:53:32,280 --> 00:53:36,760
is 97%.

1589
00:53:34,400 --> 00:53:39,039
So, I invite you

1590
00:53:36,760 --> 00:53:40,480
to take this thing and try different

1591
00:53:39,039 --> 00:53:42,800
filters and so on and so forth to see if

1592
00:53:40,480 --> 00:53:45,960
you can get to the the mid-90s.

1593
00:53:42,800 --> 00:53:48,039
It's not easy, but try it. Yeah.

1594
00:53:45,960 --> 00:53:50,159
Does the number of epochs have to be

1595
00:53:48,039 --> 00:53:52,960
related to the number of batches?

1596
00:53:50,159 --> 00:53:55,199
Because you did 64 batches and 10 No,

1597
00:53:52,960 --> 00:53:56,800
the epochs is an independent

1598
00:53:55,199 --> 00:53:58,319
the epochs is just the number of passes

1599
00:53:56,800 --> 00:54:01,320
through the whole data.

1600
00:53:58,320 --> 00:54:03,000
But within each pass, within each epoch,

1601
00:54:01,320 --> 00:54:05,039
the num the batch size tells you how

1602
00:54:03,000 --> 00:54:06,599
many batches you're going to process.

1603
00:54:05,039 --> 00:54:08,079
So, it is basically the number of

1604
00:54:06,599 --> 00:54:10,480
examples you have in your training data

1605
00:54:08,079 --> 00:54:11,679
divided by the batch size that you have

1606
00:54:10,480 --> 00:54:13,960
chosen,

1607
00:54:11,679 --> 00:54:16,879
right? That number rounded up is the

1608
00:54:13,960 --> 00:54:18,480
number of batches within each epoch.

1609
00:54:16,880 --> 00:54:20,559
And here I'm just choosing 10 because,

1610
00:54:18,480 --> 00:54:23,119
you know,

1611
00:54:20,559 --> 00:54:24,719
Siri found something on the web. Okay.

1612
00:54:23,119 --> 00:54:26,519
I chose 10 because it's going to be fast

1613
00:54:24,719 --> 00:54:27,439
to do for me to do it in class. And 10

1614
00:54:26,519 --> 00:54:28,320
is actually more than enough because you

1615
00:54:27,440 --> 00:54:30,800
can see it's already beginning to

1616
00:54:28,320 --> 00:54:30,800
overfit.

1617
00:54:31,000 --> 00:54:33,320
Yeah.

1618
00:54:33,599 --> 00:54:37,559
This is more of a conceptual question,

1619
00:54:35,639 --> 00:54:39,920
but is it always the case that a neural

1620
00:54:37,559 --> 00:54:42,400
network will have better accuracy than

1621
00:54:39,920 --> 00:54:44,440
this like machine learning algorithm?

1622
00:54:42,400 --> 00:54:45,960
And I'm asking more on the case of like

1623
00:54:44,440 --> 00:54:46,720
the heart disease problem. Oh, yeah,

1624
00:54:45,960 --> 00:54:49,000
yeah.

1625
00:54:46,719 --> 00:54:50,519
Great question. So, neural networks are

1626
00:54:49,000 --> 00:54:52,039
really good for unstructured data like

1627
00:54:50,519 --> 00:54:53,159
the ones we're having here. But if you

1628
00:54:52,039 --> 00:54:55,199
have structured data like the heart

1629
00:54:53,159 --> 00:54:57,519
disease problem, sometimes it actually

1630
00:54:55,199 --> 00:54:59,799
works really well. Sometimes

1631
00:54:57,519 --> 00:55:01,840
things like gradient boosting, XGBoost,

1632
00:54:59,800 --> 00:55:03,440
work really well. So, if I am actually

1633
00:55:01,840 --> 00:55:04,600
working on a structured data problem,

1634
00:55:03,440 --> 00:55:06,119
I'll try both.

1635
00:55:04,599 --> 00:55:07,239
I'm not going to axiomatically assume

1636
00:55:06,119 --> 00:55:09,319
that the DNN is going to be the best

1637
00:55:07,239 --> 00:55:11,679
thing. But if you have structured data,

1638
00:55:09,320 --> 00:55:13,160
it's the best game in town.

1639
00:55:11,679 --> 00:55:14,319
All right. Um

1640
00:55:13,159 --> 00:55:15,480
I'm just going to

1641
00:55:14,320 --> 00:55:16,480
By the way, I have a whole section here

1642
00:55:15,480 --> 00:55:17,679
on once you build a model, how do you

1643
00:55:16,480 --> 00:55:19,320
actually improve it?

1644
00:55:17,679 --> 00:55:20,440
Right? Check it out. It's an optional

1645
00:55:19,320 --> 00:55:22,559
thing.

1646
00:55:20,440 --> 00:55:23,880
All right, I'm going to stop this here.

1647
00:55:22,559 --> 00:55:25,559
All right. So, the next thing I want to

1648
00:55:23,880 --> 00:55:27,599
do is

1649
00:55:25,559 --> 00:55:29,559
So, we went from 88 to 90 plus percent,

1650
00:55:27,599 --> 00:55:31,639
right? Using convolutional networks.

1651
00:55:29,559 --> 00:55:33,000
Now, let's work with color images. Let's

1652
00:55:31,639 --> 00:55:34,960
kick it up a notch.

1653
00:55:33,000 --> 00:55:36,840
So, um

1654
00:55:34,960 --> 00:55:38,880
I actually

1655
00:55:36,840 --> 00:55:40,120
web scraped

1656
00:55:38,880 --> 00:55:42,680
all these pictures for you folks, for

1657
00:55:40,119 --> 00:55:44,759
your enjoyment. I web scraped about 100

1658
00:55:42,679 --> 00:55:46,599
color images of handbags and shoes.

1659
00:55:44,760 --> 00:55:48,600
Each 100 roughly 100 handbags, 100

1660
00:55:46,599 --> 00:55:51,159
shoes. So, the question is with these

1661
00:55:48,599 --> 00:55:52,239
essentially 200 images,

1662
00:55:51,159 --> 00:55:54,679
can we build a really good neural

1663
00:55:52,239 --> 00:55:56,079
network to classify handbags and shoes?

1664
00:55:54,679 --> 00:55:58,039
Right? It seems kind of absurd, right?

1665
00:55:56,079 --> 00:55:59,519
Because 200 examples, I mean, it's not

1666
00:55:58,039 --> 00:56:02,759
that much, right? It doesn't feel like a

1667
00:55:59,519 --> 00:56:04,239
lot. The MNIST data fashion has 60,000

1668
00:56:02,760 --> 00:56:06,080
images.

1669
00:56:04,239 --> 00:56:07,639
Right? So, there's no, you know, even

1670
00:56:06,079 --> 00:56:09,199
with that we are overfitting in like 5,

1671
00:56:07,639 --> 00:56:10,599
6, 7, 8 epochs.

1672
00:56:09,199 --> 00:56:11,879
With 200 images, maybe, you know, is

1673
00:56:10,599 --> 00:56:13,199
there any hope? Obviously, there is

1674
00:56:11,880 --> 00:56:15,160
hope, otherwise it won't be in the

1675
00:56:13,199 --> 00:56:16,319
lecture. So, yeah. So, we're going to

1676
00:56:15,159 --> 00:56:18,119
take this data set and let's see what we

1677
00:56:16,320 --> 00:56:19,519
can do with it. So, we'll first actually

1678
00:56:18,119 --> 00:56:22,119
build a convolutional network from

1679
00:56:19,519 --> 00:56:24,519
scratch to solve this problem. Okay?

1680
00:56:22,119 --> 00:56:24,519
All right.

1681
00:56:24,679 --> 00:56:27,519
I'm actually going to run through the

1682
00:56:25,599 --> 00:56:29,480
code because at the end of it we'll have

1683
00:56:27,519 --> 00:56:31,519
a live demo. So, I would like one

1684
00:56:29,480 --> 00:56:34,840
volunteer to give me a handbag and one

1685
00:56:31,519 --> 00:56:37,280
volunteer to give me their footwear.

1686
00:56:34,840 --> 00:56:40,880
Boy, in class.

1687
00:56:37,280 --> 00:56:42,400
Okay. So, all right. Unlike the previous

1688
00:56:40,880 --> 00:56:44,760
data set, this one actually I just web

1689
00:56:42,400 --> 00:56:46,280
scraped it. So, I just, you know, it's

1690
00:56:44,760 --> 00:56:47,359
it's it's I've stuck it in this Dropbox

1691
00:56:46,280 --> 00:56:49,120
folder.

1692
00:56:47,358 --> 00:56:51,519
Let's just download it and unzip it. And

1693
00:56:49,119 --> 00:56:54,920
once we do that, we have to now organize

1694
00:56:51,519 --> 00:56:57,119
it with these 200 images. So,

1695
00:56:54,920 --> 00:57:00,519
I have to do some sort of

1696
00:56:57,119 --> 00:57:02,400
sort of boring-ish Python stuff here.

1697
00:57:00,519 --> 00:57:04,639
So, here what we're doing is that we

1698
00:57:02,400 --> 00:57:06,440
have 100 handbags, roughly 100 shoes.

1699
00:57:04,639 --> 00:57:08,599
And what this code is doing is it's

1700
00:57:06,440 --> 00:57:10,280
actually creating a directory of saying

1701
00:57:08,599 --> 00:57:12,400
it's it's splitting stuff into train and

1702
00:57:10,280 --> 00:57:13,960
validation and test. And then for each

1703
00:57:12,400 --> 00:57:16,480
of the splits it's doing the handbags

1704
00:57:13,960 --> 00:57:18,960
and the shoes folder. Okay? So, once we

1705
00:57:16,480 --> 00:57:20,679
do that, basically this directory

1706
00:57:18,960 --> 00:57:23,199
structure is created.

1707
00:57:20,679 --> 00:57:25,079
Okay? Training, validation folder, test

1708
00:57:23,199 --> 00:57:26,199
folder, handbags and shoes. In fact,

1709
00:57:25,079 --> 00:57:27,039
actually you can I think you can see it

1710
00:57:26,199 --> 00:57:29,679
here.

1711
00:57:27,039 --> 00:57:31,559
See here, handbags and shoes. And within

1712
00:57:29,679 --> 00:57:33,119
that, there is, you know, train, test,

1713
00:57:31,559 --> 00:57:34,960
validation. And within each of these,

1714
00:57:33,119 --> 00:57:36,319
there's handbags and shoes. So, the idea

1715
00:57:34,960 --> 00:57:37,840
is that when you're working with images,

1716
00:57:36,320 --> 00:57:40,400
right? What you can do is you can just

1717
00:57:37,840 --> 00:57:42,358
create folders for each kind of image,

1718
00:57:40,400 --> 00:57:43,760
right? Let's say dogs, cats,

1719
00:57:42,358 --> 00:57:46,480
two folders with cat images and dog

1720
00:57:43,760 --> 00:57:47,800
images and then just point Keras at it.

1721
00:57:46,480 --> 00:57:49,559
It'll automatically figure out those are

1722
00:57:47,800 --> 00:57:50,560
the labels.

1723
00:57:49,559 --> 00:57:51,639
It makes it easy for you. So, it's very

1724
00:57:50,559 --> 00:57:52,639
convenient when you're working with

1725
00:57:51,639 --> 00:57:53,960
images.

1726
00:57:52,639 --> 00:57:55,799
And the book explains this thing in

1727
00:57:53,960 --> 00:57:56,920
great detail.

1728
00:57:55,800 --> 00:57:58,600
All right. So, when working with these

1729
00:57:56,920 --> 00:58:00,440
images, color images, we'll follow this

1730
00:57:58,599 --> 00:58:02,279
process. We'll read in the JPEGs. We'll

1731
00:58:00,440 --> 00:58:03,559
convert them to tensors. And then since

1732
00:58:02,280 --> 00:58:05,040
I'm web scraping it, they all come in

1733
00:58:03,559 --> 00:58:06,880
different shapes and sizes. So, I need

1734
00:58:05,039 --> 00:58:08,719
to like bring it all to the same size.

1735
00:58:06,880 --> 00:58:10,599
Okay? I resize it and then I'm going to

1736
00:58:08,719 --> 00:58:13,319
batch it into whatever. I'm going to

1737
00:58:10,599 --> 00:58:16,639
batch it using a batch size of 32 here.

1738
00:58:13,320 --> 00:58:19,640
So, and this utility from Keras will do

1739
00:58:16,639 --> 00:58:20,920
all that for you, right? Very quickly.

1740
00:58:19,639 --> 00:58:23,358
So, basically what it says is that I

1741
00:58:20,920 --> 00:58:25,440
found 98 files in the 98 images in the

1742
00:58:23,358 --> 00:58:28,000
training data belonging to two classes,

1743
00:58:25,440 --> 00:58:29,559
49 in the validation and 38 in the test.

1744
00:58:28,000 --> 00:58:31,679
So, less than 100 examples in the

1745
00:58:29,559 --> 00:58:33,960
training set. That's what we have here.

1746
00:58:31,679 --> 00:58:35,879
All right. What's the time? 9:30. Okay.

1747
00:58:33,960 --> 00:58:38,800
So, all right. Now, let us check the

1748
00:58:35,880 --> 00:58:40,480
dimensions to make sure Good. So, 224

1749
00:58:38,800 --> 00:58:43,039
224 by 3. And the reason why did I pick

1750
00:58:40,480 --> 00:58:45,039
224 224? As you will see later, we're

1751
00:58:43,039 --> 00:58:47,039
going to use something called ResNet

1752
00:58:45,039 --> 00:58:49,599
and the ResNet expects it to be 224 by

1753
00:58:47,039 --> 00:58:52,719
224 by 3. That's why I resized it to 224

1754
00:58:49,599 --> 00:58:56,400
224. Let's look at a few examples of my

1755
00:58:52,719 --> 00:58:56,399
wonderful web scraping in action.

1756
00:59:01,079 --> 00:59:04,519
It's pretty wild, right?

1757
00:59:02,920 --> 00:59:07,000
Okay. So, we have a Now, let's do a

1758
00:59:04,519 --> 00:59:09,000
simple convolutional network. Um

1759
00:59:07,000 --> 00:59:10,639
And before we would take all the X

1760
00:59:09,000 --> 00:59:13,480
values in Fashion MNIST and divide them

1761
00:59:10,639 --> 00:59:14,559
manually by 255 to normalize it to 0 1.

1762
00:59:13,480 --> 00:59:16,240
Well, you know what? We are actually

1763
00:59:14,559 --> 00:59:17,759
graduating to the higher levels of Keras

1764
00:59:16,239 --> 00:59:19,319
now. So, let's not do that, right?

1765
00:59:17,760 --> 00:59:21,240
Manual stuff is bad. So, we'll do it

1766
00:59:19,320 --> 00:59:22,720
within Keras by using something called

1767
00:59:21,239 --> 00:59:24,479
the rescaling layer where we just tell

1768
00:59:22,719 --> 00:59:26,399
it how much to rescale and boom, it'll

1769
00:59:24,480 --> 00:59:28,559
do it for you. The first convolution

1770
00:59:26,400 --> 00:59:31,519
block, just like the Fashion MNIST 32,

1771
00:59:28,559 --> 00:59:33,440
second block, again 32, max pool,

1772
00:59:31,519 --> 00:59:35,199
flatten. And then here we only have

1773
00:59:33,440 --> 00:59:36,599
handbags which are shoes, just a sigmoid

1774
00:59:35,199 --> 00:59:38,079
is enough, right? It's just a binary

1775
00:59:36,599 --> 00:59:40,440
classification problem. So, I'm just

1776
00:59:38,079 --> 00:59:42,239
using one output layer with a sigmoid,

1777
00:59:40,440 --> 00:59:43,840
and that's our model. So, let's do the

1778
00:59:42,239 --> 00:59:47,279
model.

1779
00:59:43,840 --> 00:59:47,280
All right, model summary.

1780
00:59:48,440 --> 00:59:54,360
103 101,000 parameters in this little

1781
00:59:52,079 --> 00:59:56,519
model. Okay, let's compile it and run

1782
00:59:54,360 --> 00:59:57,720
it. Uh, and note here because it's a

1783
00:59:56,519 --> 00:59:59,480
binary

1784
00:59:57,719 --> 01:00:02,000
classification problem, I'm using binary

1785
00:59:59,480 --> 01:00:03,320
cross entropy.

1786
01:00:02,000 --> 01:00:05,880
Same Adam.

1787
01:00:03,320 --> 01:00:07,440
And accuracy, compile, and then boom,

1788
01:00:05,880 --> 01:00:08,519
let's run it. We'll run it for 20

1789
01:00:07,440 --> 01:00:10,800
epochs.

1790
01:00:08,519 --> 01:00:10,800
Hopefully.

1791
01:00:12,320 --> 01:00:17,760
Okay, while it's doing this business,

1792
01:00:13,760 --> 01:00:19,400
I'm going to shift to the PowerPoint.

1793
01:00:17,760 --> 01:00:21,480
So, we'll go back to see how well it

1794
01:00:19,400 --> 01:00:23,039
did, but the question is, uh, whatever

1795
01:00:21,480 --> 01:00:23,960
it did, we built it from scratch. So,

1796
01:00:23,039 --> 01:00:26,440
the question is, can we do better than

1797
01:00:23,960 --> 01:00:28,079
that? Okay? Because we only have 100

1798
01:00:26,440 --> 01:00:29,480
examples of each class, and which brings

1799
01:00:28,079 --> 01:00:31,440
us to something very cool and very

1800
01:00:29,480 --> 01:00:33,240
powerful called transfer learning. And

1801
01:00:31,440 --> 01:00:34,519
the idea, so the key thing is there are

1802
01:00:33,239 --> 01:00:36,000
two research trends that are going on

1803
01:00:34,519 --> 01:00:38,199
that we take advantage of. The first one

1804
01:00:36,000 --> 01:00:40,320
is that researchers have defined, you

1805
01:00:38,199 --> 01:00:42,439
know, designed architectures which

1806
01:00:40,320 --> 01:00:43,840
exploit the kind of input you have. So,

1807
01:00:42,440 --> 01:00:45,639
Olivia asked the question, if you have a

1808
01:00:43,840 --> 01:00:47,320
particular kind of input images, do you

1809
01:00:45,639 --> 01:00:49,079
actually change the input, or do you

1810
01:00:47,320 --> 01:00:50,680
actually change the network? As it turns

1811
01:00:49,079 --> 01:00:52,039
out, here, for example, if it's images,

1812
01:00:50,679 --> 01:00:53,679
we know that we should use convolutional

1813
01:00:52,039 --> 01:00:55,759
layers because convolutional layers were

1814
01:00:53,679 --> 01:00:57,159
designed to exploit the image-ness of

1815
01:00:55,760 --> 01:00:59,680
the input.

1816
01:00:57,159 --> 01:01:01,559
Okay? Similarly, if you have sequences

1817
01:00:59,679 --> 01:01:03,719
of information, like obviously natural

1818
01:01:01,559 --> 01:01:05,320
language, audio, video, gene sequences,

1819
01:01:03,719 --> 01:01:07,119
and so on, so forth, these things called

1820
01:01:05,320 --> 01:01:08,360
transformers were invented

1821
01:01:07,119 --> 01:01:09,480
to exploit them, and we're going to

1822
01:01:08,360 --> 01:01:11,720
spend a lot of time on transformers

1823
01:01:09,480 --> 01:01:13,320
starting next week. So, that's the first

1824
01:01:11,719 --> 01:01:15,959
trend. The second trend is that

1825
01:01:13,320 --> 01:01:19,000
researchers have used these innovations

1826
01:01:15,960 --> 01:01:21,880
to actually create and train models on

1827
01:01:19,000 --> 01:01:23,719
vast data sets, and thankfully, they've

1828
01:01:21,880 --> 01:01:26,760
made them publicly available for us to

1829
01:01:23,719 --> 01:01:28,439
use. So, transfer learning is the idea

1830
01:01:26,760 --> 01:01:30,080
that if you have a particular problem,

1831
01:01:28,440 --> 01:01:32,240
let's just take a pre-trained network

1832
01:01:30,079 --> 01:01:33,840
work somebody may have already created,

1833
01:01:32,239 --> 01:01:35,519
and then let's just customize it to our

1834
01:01:33,840 --> 01:01:37,079
problem, rather than actually build

1835
01:01:35,519 --> 01:01:39,559
anything from scratch.

1836
01:01:37,079 --> 01:01:41,599
Okay, that's the basic idea. So,

1837
01:01:39,559 --> 01:01:43,519
so here we have this basically we have

1838
01:01:41,599 --> 01:01:45,079
to build a classifier which takes in an

1839
01:01:43,519 --> 01:01:46,759
arbitrary image and figures out if it's

1840
01:01:45,079 --> 01:01:47,799
a handbag or a shoe, right? That's our

1841
01:01:46,760 --> 01:01:49,800
goal.

1842
01:01:47,800 --> 01:01:51,320
And so, now handbags and shoes are

1843
01:01:49,800 --> 01:01:53,680
everyday objects, and so what you can do

1844
01:01:51,320 --> 01:01:55,200
is, hmm, you you can look around and see

1845
01:01:53,679 --> 01:01:57,919
if there are any networks that have been

1846
01:01:55,199 --> 01:02:00,359
trained by other people which actually

1847
01:01:57,920 --> 01:02:02,599
have been trained on everyday images.

1848
01:02:00,360 --> 01:02:04,000
Right? As opposed to like MRI or X-rays,

1849
01:02:02,599 --> 01:02:05,400
right? Specialized images, everyday

1850
01:02:04,000 --> 01:02:07,039
images. Of course, the first thing you

1851
01:02:05,400 --> 01:02:08,960
should probably do is to see if anybody

1852
01:02:07,039 --> 01:02:10,800
has built the specific thing you want,

1853
01:02:08,960 --> 01:02:12,519
handbag shoes classifier on GitHub.

1854
01:02:10,800 --> 01:02:15,800
Assuming it's not, then you do transfer

1855
01:02:12,519 --> 01:02:17,800
learning. Okay? So, now it turns out

1856
01:02:15,800 --> 01:02:19,360
that there's this thing called ImageNet,

1857
01:02:17,800 --> 01:02:22,080
which is a database of millions of

1858
01:02:19,360 --> 01:02:24,079
images of everyday objects in a thousand

1859
01:02:22,079 --> 01:02:26,199
different categories, furniture,

1860
01:02:24,079 --> 01:02:28,719
animals, automobiles, you get the idea.

1861
01:02:26,199 --> 01:02:29,919
Okay? And so, we can look for the

1862
01:02:28,719 --> 01:02:31,599
networks that have been trained on

1863
01:02:29,920 --> 01:02:33,200
ImageNet.

1864
01:02:31,599 --> 01:02:36,360
Okay, let me just go back to the collab

1865
01:02:33,199 --> 01:02:36,359
just to make sure it doesn't time out.

1866
01:02:37,519 --> 01:02:44,039
All right, so it has finished doing it.

1867
01:02:40,079 --> 01:02:44,039
Um, let's just plot these things.

1868
01:02:48,599 --> 01:02:51,199
Okay, so

1869
01:02:49,920 --> 01:02:52,920
uh, there is some overfitting that

1870
01:02:51,199 --> 01:02:55,159
happens around here

1871
01:02:52,920 --> 01:02:57,760
on the training on the 10th epoch. Let's

1872
01:02:55,159 --> 01:02:57,759
look at the

1873
01:02:59,239 --> 01:03:03,919
So, the the training accuracy is

1874
01:03:01,039 --> 01:03:04,920
actually getting to almost to 100%. But

1875
01:03:03,920 --> 01:03:06,760
we're not interested in training

1876
01:03:04,920 --> 01:03:08,880
accuracy, right? We care about

1877
01:03:06,760 --> 01:03:10,200
validation and test accuracy, and that

1878
01:03:08,880 --> 01:03:13,000
seems to be kind of hovering around in

1879
01:03:10,199 --> 01:03:15,159
the 80s. Um, so let's just evaluate it

1880
01:03:13,000 --> 01:03:19,360
anyway to see what happens.

1881
01:03:15,159 --> 01:03:20,960
Okay, so it gets to 80 87% accuracy

1882
01:03:19,360 --> 01:03:22,320
on this data set.

1883
01:03:20,960 --> 01:03:24,760
It's actually pretty good given that we

1884
01:03:22,320 --> 01:03:26,320
only have 100 examples. So, 87%

1885
01:03:24,760 --> 01:03:28,320
accuracy, but we pre-trained the whole

1886
01:03:26,320 --> 01:03:31,280
thing. I'm sorry, we did everything from

1887
01:03:28,320 --> 01:03:32,600
scratch. Okay? Now, then

1888
01:03:31,280 --> 01:03:35,280
I'm going to there's this whole section

1889
01:03:32,599 --> 01:03:38,079
about data augmentation, which, um, you

1890
01:03:35,280 --> 01:03:40,040
know what? Do we have time?

1891
01:03:38,079 --> 01:03:42,799
So,

1892
01:03:40,039 --> 01:03:44,320
so the idea of augmentation is that when

1893
01:03:42,800 --> 01:03:45,800
you have an image,

1894
01:03:44,320 --> 01:03:49,160
let's say you take this image, and you

1895
01:03:45,800 --> 01:03:51,359
just rotate it slightly by 10°.

1896
01:03:49,159 --> 01:03:52,960
If it's a handbag before you rotated it,

1897
01:03:51,358 --> 01:03:54,199
it sure as hell is a handbag after you

1898
01:03:52,960 --> 01:03:55,119
rotated it.

1899
01:03:54,199 --> 01:03:56,679
Right?

1900
01:03:55,119 --> 01:03:57,920
It doesn't change The meaning of the

1901
01:03:56,679 --> 01:04:00,000
image doesn't change just because you

1902
01:03:57,920 --> 01:04:01,358
rotated it slightly. Or maybe you zoom

1903
01:04:00,000 --> 01:04:03,639
in slightly, you zoom out slightly, you

1904
01:04:01,358 --> 01:04:05,358
crop it slightly, nothing happens.

1905
01:04:03,639 --> 01:04:07,440
So, what you can do is you can take any

1906
01:04:05,358 --> 01:04:08,880
image you have, and you just perturb it

1907
01:04:07,440 --> 01:04:10,920
slightly,

1908
01:04:08,880 --> 01:04:14,079
like right there, and then add it as a

1909
01:04:10,920 --> 01:04:15,800
new example to your training data.

1910
01:04:14,079 --> 01:04:16,960
This is an unbelievable free lunch,

1911
01:04:15,800 --> 01:04:19,080
frankly.

1912
01:04:16,960 --> 01:04:20,720
And the same thing actually, same kinds

1913
01:04:19,079 --> 01:04:22,519
of techniques actually work for text

1914
01:04:20,719 --> 01:04:24,599
also, which we'll cover later on.

1915
01:04:22,519 --> 01:04:26,119
Right? This broad area is called data

1916
01:04:24,599 --> 01:04:27,599
augmentation.

1917
01:04:26,119 --> 01:04:30,199
It's a great way when you don't have a

1918
01:04:27,599 --> 01:04:31,799
lot of data to artificially bolster the

1919
01:04:30,199 --> 01:04:32,599
amount of data you have.

1920
01:04:31,800 --> 01:04:34,800
Okay?

1921
01:04:32,599 --> 01:04:36,239
Um, and so, and of course, Keras makes

1922
01:04:34,800 --> 01:04:38,640
it very easy for you to do all these

1923
01:04:36,239 --> 01:04:40,919
things. It has already predefined a

1924
01:04:38,639 --> 01:04:43,079
whole bunch of data augmentation layers

1925
01:04:40,920 --> 01:04:45,000
for you. So, here's a little example

1926
01:04:43,079 --> 01:04:47,239
where I basically take a picture and

1927
01:04:45,000 --> 01:04:48,320
then I randomly flip it. So, if it looks

1928
01:04:47,239 --> 01:04:50,799
like this, I flip it this way,

1929
01:04:48,320 --> 01:04:53,080
horizontal. Okay? Uh, and then I

1930
01:04:50,800 --> 01:04:55,039
randomly rotate it by 0.1. I forget if

1931
01:04:53,079 --> 01:04:57,358
it's 0.1° or radians, you can look up

1932
01:04:55,039 --> 01:05:00,079
the documentation. And then random zoom,

1933
01:04:57,358 --> 01:05:02,920
right? Zoom in and out a little bit. Uh,

1934
01:05:00,079 --> 01:05:04,960
but it won't do this for every picture.

1935
01:05:02,920 --> 01:05:06,000
It will only do it randomly. Okay? So,

1936
01:05:04,960 --> 01:05:07,800
that only some pictures will get

1937
01:05:06,000 --> 01:05:09,000
perturbed in some ways. And that's how

1938
01:05:07,800 --> 01:05:10,600
you make sure there's enough diversity

1939
01:05:09,000 --> 01:05:12,880
of pictures that you have.

1940
01:05:10,599 --> 01:05:13,839
So, once you do that,

1941
01:05:12,880 --> 01:05:15,960
you can actually take a picture and see

1942
01:05:13,840 --> 01:05:17,240
what it does.

1943
01:05:15,960 --> 01:05:20,159
I just randomly grab a picture, so it

1944
01:05:17,239 --> 01:05:20,159
keeps changing every time.

1945
01:05:21,280 --> 01:05:24,880
Yeah, look at this handbag.

1946
01:05:22,800 --> 01:05:26,440
Handbag slightly rotated this way,

1947
01:05:24,880 --> 01:05:28,320
rotated that way.

1948
01:05:26,440 --> 01:05:30,320
Some more. Maybe a little bit of zooming

1949
01:05:28,320 --> 01:05:31,880
going on, and so on. You get the idea,

1950
01:05:30,320 --> 01:05:33,840
right? And there's a whole list of these

1951
01:05:31,880 --> 01:05:35,640
things you can do. But when you do those

1952
01:05:33,840 --> 01:05:37,358
things, make sure

1953
01:05:35,639 --> 01:05:38,759
that what you're doing doesn't actually

1954
01:05:37,358 --> 01:05:39,679
change the underlying meaning of the

1955
01:05:38,760 --> 01:05:41,480
picture.

1956
01:05:39,679 --> 01:05:43,440
It's really important.

1957
01:05:41,480 --> 01:05:45,679
Okay? So, for example, if you're working

1958
01:05:43,440 --> 01:05:47,679
with satellite data,

1959
01:05:45,679 --> 01:05:49,319
yes, be very careful not to do flips of

1960
01:05:47,679 --> 01:05:50,319
crazy flips.

1961
01:05:49,320 --> 01:05:51,920
Right? Or even if you're working with

1962
01:05:50,320 --> 01:05:54,440
everyday images, horizontal flips are

1963
01:05:51,920 --> 01:05:55,800
okay. Don't do vertical flips.

1964
01:05:54,440 --> 01:05:57,400
Right? How many times will you have an

1965
01:05:55,800 --> 01:05:59,400
upside-down dog picture that you need to

1966
01:05:57,400 --> 01:06:00,639
classify?

1967
01:05:59,400 --> 01:06:02,720
Make sure your augmentation doesn't go

1968
01:06:00,639 --> 01:06:04,839
nuts.

1969
01:06:02,719 --> 01:06:04,839
All right.

1970
01:06:05,760 --> 01:06:09,240
Once you do that, you can actually just

1971
01:06:07,239 --> 01:06:11,000
insert the data augmentation layers in

1972
01:06:09,239 --> 01:06:12,479
your model right there, right after the

1973
01:06:11,000 --> 01:06:14,280
input. The rest of it can stay

1974
01:06:12,480 --> 01:06:15,760
unchanged.

1975
01:06:14,280 --> 01:06:17,600
So, this is a great way to increase the

1976
01:06:15,760 --> 01:06:19,760
size of your training data, and here is

1977
01:06:17,599 --> 01:06:21,880
a model, and then I invite you to

1978
01:06:19,760 --> 01:06:23,120
actually just play with it and uh, and

1979
01:06:21,880 --> 01:06:23,960
train it. We won't try In the interest

1980
01:06:23,119 --> 01:06:24,920
of time, we won't actually train this

1981
01:06:23,960 --> 01:06:27,240
model, but it's in the collab, you can

1982
01:06:24,920 --> 01:06:28,599
just try it. It also figures prominently

1983
01:06:27,239 --> 01:06:30,000
in homework one, by the way, data

1984
01:06:28,599 --> 01:06:32,519
augmentation. So, you'll get more

1985
01:06:30,000 --> 01:06:34,800
experience with this. Okay. So, uh, back

1986
01:06:32,519 --> 01:06:37,239
to the PPT.

1987
01:06:34,800 --> 01:06:38,800
So, this is what we have. Um, and so,

1988
01:06:37,239 --> 01:06:41,279
any network that has been trained on

1989
01:06:38,800 --> 01:06:42,920
this ImageNet thing, uh, turns out

1990
01:06:41,280 --> 01:06:44,880
learns all kinds of interesting features

1991
01:06:42,920 --> 01:06:46,320
in every one of its layers. So, here

1992
01:06:44,880 --> 01:06:48,039
this is the first layer, and you can see

1993
01:06:46,320 --> 01:06:49,880
it's picking up sort of gradations of

1994
01:06:48,039 --> 01:06:52,559
color, sort of line-ish kind of

1995
01:06:49,880 --> 01:06:54,320
behavior. Layer two, um, it's actually

1996
01:06:52,559 --> 01:06:56,880
picking up Hey, look, it's picking up an

1997
01:06:54,320 --> 01:06:59,480
edge. Can you see that edge?

1998
01:06:56,880 --> 01:07:01,920
Right? Like like that.

1999
01:06:59,480 --> 01:07:04,400
And then layer three is picking up these

2000
01:07:01,920 --> 01:07:05,960
interesting honeycomb shapes, uh, and so

2001
01:07:04,400 --> 01:07:07,880
on. Oh, it's actually this thing is

2002
01:07:05,960 --> 01:07:11,240
already already picking up like the

2003
01:07:07,880 --> 01:07:11,240
shape of a human torso.

2004
01:07:12,599 --> 01:07:16,440
Yeah, this layer is actually picking up

2005
01:07:13,920 --> 01:07:17,240
what looks like a Labrador retriever.

2006
01:07:16,440 --> 01:07:19,400
Okay.

2007
01:07:17,239 --> 01:07:20,399
Isn't that cute?

2008
01:07:19,400 --> 01:07:22,480
Come on, even if you're not a dog

2009
01:07:20,400 --> 01:07:24,480
person.

2010
01:07:22,480 --> 01:07:25,599
All right. So, the the this this is the

2011
01:07:24,480 --> 01:07:26,599
visualization I was referring to

2012
01:07:25,599 --> 01:07:28,319
earlier,

2013
01:07:26,599 --> 01:07:30,039
um, to figure out what are these

2014
01:07:28,320 --> 01:07:31,760
networks actually learning.

2015
01:07:30,039 --> 01:07:32,920
This paper was one of the first ones to

2016
01:07:31,760 --> 01:07:34,920
actually visualize what's going on

2017
01:07:32,920 --> 01:07:36,639
inside. So, if you folks are curious how

2018
01:07:34,920 --> 01:07:38,760
these pictures are actually produced, I

2019
01:07:36,639 --> 01:07:40,719
would encourage you to check this out.

2020
01:07:38,760 --> 01:07:42,560
Okay, yep.

2021
01:07:40,719 --> 01:07:44,879
So, we spoke about images and you

2022
01:07:42,559 --> 01:07:46,599
referred to classes, but sorry, we spoke

2023
01:07:44,880 --> 01:07:47,358
about images and you referred to classes

2024
01:07:46,599 --> 01:07:49,920
and

2025
01:07:47,358 --> 01:07:52,480
text next week on transformers, but

2026
01:07:49,920 --> 01:07:54,039
what about say an email which has both

2027
01:07:52,480 --> 01:07:56,280
text and image, and that may be white

2028
01:07:54,039 --> 01:07:58,759
space depending on who has written it

2029
01:07:56,280 --> 01:08:01,359
out. Does that get put in as an input

2030
01:07:58,760 --> 01:08:03,240
for an image or

2031
01:08:01,358 --> 01:08:04,840
So, we'll revisit this great question a

2032
01:08:03,239 --> 01:08:06,439
bit later on in the course.

2033
01:08:04,840 --> 01:08:07,840
So, the answer is a bit complicated, so

2034
01:08:06,440 --> 01:08:09,280
I don't want to I want to do it justice,

2035
01:08:07,840 --> 01:08:10,800
so we'll come back to it.

2036
01:08:09,280 --> 01:08:12,600
All right, so

2037
01:08:10,800 --> 01:08:14,280
so it turns out this thing called ResNet

2038
01:08:12,599 --> 01:08:16,000
is a family of networks that are which

2039
01:08:14,280 --> 01:08:18,079
were trained on this ImageNet data set,

2040
01:08:16,000 --> 01:08:19,399
and they did really well in this

2041
01:08:18,079 --> 01:08:21,479
competition that's associated with the

2042
01:08:19,399 --> 01:08:22,519
ImageNet data set called ImageNet. And

2043
01:08:21,479 --> 01:08:24,679
so, this is an example of such a

2044
01:08:22,520 --> 01:08:27,400
network. So, you we would expect the the

2045
01:08:24,680 --> 01:08:28,400
weights and the parameters of ResNet,

2046
01:08:27,399 --> 01:08:30,838
given that it's been trained on

2047
01:08:28,399 --> 01:08:32,719
ImageNet, to sort of have some knowledge

2048
01:08:30,838 --> 01:08:34,719
about lines and shapes and curves and

2049
01:08:32,719 --> 01:08:37,520
things like that. So, maybe we can just

2050
01:08:34,719 --> 01:08:39,039
use that, right? So, so the idea is we

2051
01:08:37,520 --> 01:08:40,920
But the thing is we can't use ResNet as

2052
01:08:39,039 --> 01:08:42,159
is because remember, it was trained to

2053
01:08:40,920 --> 01:08:44,119
classify an incoming image into a

2054
01:08:42,159 --> 01:08:45,439
thousand possibilities.

2055
01:08:44,119 --> 01:08:47,838
Here we only have two possibilities,

2056
01:08:45,439 --> 01:08:50,039
handbags and shoes. So, what we do is

2057
01:08:47,838 --> 01:08:51,759
very simple and elegant. We do just a

2058
01:08:50,039 --> 01:08:54,519
little bit of surgery.

2059
01:08:51,759 --> 01:08:57,439
We take ResNet and stop just before the

2060
01:08:54,520 --> 01:08:59,680
final layer. So, take my word for it,

2061
01:08:57,439 --> 01:09:01,318
this thing here, what it says is fully

2062
01:08:59,680 --> 01:09:02,920
connected thousand.

2063
01:09:01,319 --> 01:09:04,839
Because it's got thousand way, right?

2064
01:09:02,920 --> 01:09:06,560
Thousand objects. So, what we do is we

2065
01:09:04,838 --> 01:09:08,239
just take everything except and we stop

2066
01:09:06,560 --> 01:09:10,280
just before that last layer.

2067
01:09:08,239 --> 01:09:11,599
And then what comes out of that layer,

2068
01:09:10,279 --> 01:09:13,239
hopefully, will be like a very smart

2069
01:09:11,600 --> 01:09:14,480
representation of the images that it has

2070
01:09:13,239 --> 01:09:16,960
been trained on.

2071
01:09:14,479 --> 01:09:19,199
And so, what we do is we can think of

2072
01:09:16,960 --> 01:09:21,000
sort of headless ResNet

2073
01:09:19,199 --> 01:09:23,358
as our model.

2074
01:09:21,000 --> 01:09:26,239
And we can take that we can take all our

2075
01:09:23,359 --> 01:09:28,079
data and run it through ResNet up to but

2076
01:09:26,239 --> 01:09:30,358
not including the last layer.

2077
01:09:28,079 --> 01:09:31,920
Okay, you get some tensor and that

2078
01:09:30,359 --> 01:09:33,319
tensor is probably like a very has a

2079
01:09:31,920 --> 01:09:35,079
very rich understanding of what's going

2080
01:09:33,319 --> 01:09:36,880
on in that image, all the objects and

2081
01:09:35,079 --> 01:09:39,880
features and things like that. And then

2082
01:09:36,880 --> 01:09:40,759
we can just simply connect that we can

2083
01:09:39,880 --> 01:09:42,199
think of it as like a smart

2084
01:09:40,759 --> 01:09:44,359
representation of an input. We can

2085
01:09:42,199 --> 01:09:46,000
connect it to just a little hidden layer

2086
01:09:44,359 --> 01:09:47,798
and then we have a little sigmoid which

2087
01:09:46,000 --> 01:09:50,199
then tells you handbag or shoe. We can

2088
01:09:47,798 --> 01:09:53,039
just run this network.

2089
01:09:50,199 --> 01:09:54,840
Okay? Um and so since the outputs to the

2090
01:09:53,039 --> 01:09:57,199
hidden layer now are not raw images

2091
01:09:54,840 --> 01:09:59,000
anymore, but this much higher level of

2092
01:09:57,199 --> 01:10:00,279
abstraction that ResNet has learned,

2093
01:09:59,000 --> 01:10:02,399
hopefully it can get the job done with

2094
01:10:00,279 --> 01:10:04,519
hardly any examples.

2095
01:10:02,399 --> 01:10:05,679
Okay? And now you can get fancier.

2096
01:10:04,520 --> 01:10:07,440
That's the basic idea, but you can get

2097
01:10:05,680 --> 01:10:09,760
much fancier. You can connect up

2098
01:10:07,439 --> 01:10:10,960
headless ResNet directly with our little

2099
01:10:09,760 --> 01:10:12,720
network with a hidden layer and the

2100
01:10:10,960 --> 01:10:14,960
final thing and the whole thing can be

2101
01:10:12,720 --> 01:10:16,960
trained.

2102
01:10:14,960 --> 01:10:18,680
End to end. Uh but when you do that you

2103
01:10:16,960 --> 01:10:20,159
must start the training with the weights

2104
01:10:18,680 --> 01:10:21,960
that you downloaded with ResNet because

2105
01:10:20,159 --> 01:10:23,639
that is the crown jewel that's been

2106
01:10:21,960 --> 01:10:26,239
learned so you want to start from there.

2107
01:10:23,640 --> 01:10:28,400
Uh and you will do this in homework one.

2108
01:10:26,239 --> 01:10:29,479
Okay? All right. Uh by the way, these

2109
01:10:28,399 --> 01:10:30,639
pre-trained models are available all

2110
01:10:29,479 --> 01:10:32,599
over the internet. There is the

2111
01:10:30,640 --> 01:10:34,000
TensorFlow hub, the PyTorch hub and then

2112
01:10:32,600 --> 01:10:36,840
there's the Hugging Face hub. When I

2113
01:10:34,000 --> 01:10:39,079
checked it on the 13th yesterday, it had

2114
01:10:36,840 --> 01:10:41,199
over half a million models available

2115
01:10:39,079 --> 01:10:42,760
for download. Half a million.

2116
01:10:41,199 --> 01:10:46,840
I think last year it was like 50,000

2117
01:10:42,760 --> 01:10:49,159
when I taught the course. Uh so yes.

2118
01:10:46,840 --> 01:10:50,880
I was just wondering, doesn't this make

2119
01:10:49,159 --> 01:10:52,199
your neural network susceptible to

2120
01:10:50,880 --> 01:10:53,279
adversarial attacks because the weights

2121
01:10:52,199 --> 01:10:55,639
have been

2122
01:10:53,279 --> 01:10:57,319
pre-trained on a Yes. Uh it there is

2123
01:10:55,640 --> 01:10:59,160
some adversarial risk. I'm happy to talk

2124
01:10:57,319 --> 01:11:01,439
about it offline.

2125
01:10:59,159 --> 01:11:03,720
All right. So that's what we have. So

2126
01:11:01,439 --> 01:11:06,319
back to Colab. Okay. So that's what we

2127
01:11:03,720 --> 01:11:07,720
have. This is ResNet. So what we do is

2128
01:11:06,319 --> 01:11:09,519
and ResNet is all packaged up. It's

2129
01:11:07,720 --> 01:11:12,640
available for download. So we download

2130
01:11:09,520 --> 01:11:12,640
it here.

2131
01:11:13,560 --> 01:11:19,360
And you see here that I'm saying use

2132
01:11:16,520 --> 01:11:21,800
include top equals false.

2133
01:11:19,359 --> 01:11:23,799
So basically you are telling Keras

2134
01:11:21,800 --> 01:11:25,279
uh the top the very final layer of the

2135
01:11:23,800 --> 01:11:27,239
thing, don't give it to me. Just give me

2136
01:11:25,279 --> 01:11:28,840
everything up to but not including that.

2137
01:11:27,239 --> 01:11:30,920
And of course I think of it as left to

2138
01:11:28,840 --> 01:11:32,960
right. People think of it as bottom to

2139
01:11:30,920 --> 01:11:34,440
top. So they could the very very top

2140
01:11:32,960 --> 01:11:35,480
layer, don't give it to me. You're

2141
01:11:34,439 --> 01:11:37,319
telling it so that you don't have to

2142
01:11:35,479 --> 01:11:39,319
manually go and remove it.

2143
01:11:37,319 --> 01:11:40,880
Okay? And then I'm not going to

2144
01:11:39,319 --> 01:11:44,000
summarize uh well, I'll just summarize

2145
01:11:40,880 --> 01:11:44,000
some of it. Just show you how big it is.

2146
01:11:44,640 --> 01:11:48,800
Okay?

2147
01:11:45,720 --> 01:11:50,920
23 million parameters.

2148
01:11:48,800 --> 01:11:52,039
ResNet. Okay? And I won't plot it

2149
01:11:50,920 --> 01:11:53,520
because then I'll be scrolling for 5

2150
01:11:52,039 --> 01:11:55,399
minutes. Uh

2151
01:11:53,520 --> 01:11:56,400
so let's just do this now. So what we're

2152
01:11:55,399 --> 01:11:58,000
now going to do is we're going to run

2153
01:11:56,399 --> 01:11:59,679
all the data through this thing and

2154
01:11:58,000 --> 01:12:00,880
whatever comes out in that penultimate

2155
01:11:59,680 --> 01:12:02,640
thing, I'm going to just grab it and

2156
01:12:00,880 --> 01:12:04,720
store it. So that's what this thing

2157
01:12:02,640 --> 01:12:07,000
does.

2158
01:12:04,720 --> 01:12:08,640
All right. And now we create this a

2159
01:12:07,000 --> 01:12:09,520
little handy function to do all these

2160
01:12:08,640 --> 01:12:11,160
things.

2161
01:12:09,520 --> 01:12:12,760
And once I do that,

2162
01:12:11,159 --> 01:12:15,239
uh every image has been sent through

2163
01:12:12,760 --> 01:12:16,280
ResNet up to but not the final layer and

2164
01:12:15,239 --> 01:12:18,119
then whatever comes into the final

2165
01:12:16,279 --> 01:12:19,479
layer, we're storing it. And then we're

2166
01:12:18,119 --> 01:12:21,800
going to create a network where we'll

2167
01:12:19,479 --> 01:12:23,199
only feed that layer that information to

2168
01:12:21,800 --> 01:12:24,440
a simple network.

2169
01:12:23,199 --> 01:12:26,279
Okay?

2170
01:12:24,439 --> 01:12:28,599
So what is coming out of ResNet, you can

2171
01:12:26,279 --> 01:12:31,719
see here 98 examples in the training

2172
01:12:28,600 --> 01:12:33,840
data and each example is now a 7 by 7 by

2173
01:12:31,720 --> 01:12:35,000
2048 tensor.

2174
01:12:33,840 --> 01:12:37,000
That's what came out of ResNet and you

2175
01:12:35,000 --> 01:12:37,720
saw that's what I did there.

2176
01:12:37,000 --> 01:12:39,479
Okay?

2177
01:12:37,720 --> 01:12:41,199
All right. So that's what it looks like.

2178
01:12:39,479 --> 01:12:43,479
Now let's just create our actual model

2179
01:12:41,199 --> 01:12:46,479
now. Right? We have our input which is

2180
01:12:43,479 --> 01:12:48,559
just a 7 by 7 by 2048.

2181
01:12:46,479 --> 01:12:50,079
We flatten it immediately.

2182
01:12:48,560 --> 01:12:52,600
Then we run it through a dense layer

2183
01:12:50,079 --> 01:12:54,519
with 256 ReLU neurons and then we use

2184
01:12:52,600 --> 01:12:56,920
dropout which I haven't talked about yet

2185
01:12:54,520 --> 01:12:58,720
which I will talk about early next week.

2186
01:12:56,920 --> 01:13:00,720
Uh but I will come back to it. Don't

2187
01:12:58,720 --> 01:13:01,520
worry about this detail for the moment.

2188
01:13:00,720 --> 01:13:03,360
Uh and then we just run through a

2189
01:13:01,520 --> 01:13:05,960
sigmoid.

2190
01:13:03,359 --> 01:13:08,079
Okay? And that that's our model.

2191
01:13:05,960 --> 01:13:12,640
Finished. Plot the model. This is what

2192
01:13:08,079 --> 01:13:12,640
we have. Okay? Model summary.

2193
01:13:13,479 --> 01:13:18,519
It's one so far. All right, good. Now

2194
01:13:15,399 --> 01:13:18,519
let's actually train this thing.

2195
01:13:18,640 --> 01:13:22,720
I'm just going to run it for 10 epochs

2196
01:13:20,600 --> 01:13:24,760
because I tried running it uh previously

2197
01:13:22,720 --> 01:13:26,920
and it seems to do a fine job in just an

2198
01:13:24,760 --> 01:13:28,680
epoch. Okay, it's already done. It's so

2199
01:13:26,920 --> 01:13:31,359
fast because we ran everything through

2200
01:13:28,680 --> 01:13:33,640
this monster ResNet thing and basically

2201
01:13:31,359 --> 01:13:34,759
took all the output values and use them

2202
01:13:33,640 --> 01:13:36,880
as a starting point. Right? We don't

2203
01:13:34,760 --> 01:13:40,440
have to run it every single time. So you

2204
01:13:36,880 --> 01:13:43,279
can see here the accuracy is

2205
01:13:40,439 --> 01:13:43,279
quite high.

2206
01:13:44,199 --> 01:13:48,439
Wow, interesting. So the 10th epoch

2207
01:13:45,920 --> 01:13:49,880
something bad happened.

2208
01:13:48,439 --> 01:13:51,439
So maybe I should have stopped at the

2209
01:13:49,880 --> 01:13:53,079
ninth epoch. I didn't see this yesterday

2210
01:13:51,439 --> 01:13:55,159
when I was running. So much for random

2211
01:13:53,079 --> 01:13:57,079
reproducibility. Uh

2212
01:13:55,159 --> 01:13:58,800
So let's just run this. Oh wow, look. On

2213
01:13:57,079 --> 01:14:01,319
the test set it's achieving 100%

2214
01:13:58,800 --> 01:14:01,320
accuracy.

2215
01:14:02,159 --> 01:14:06,840
It's unbelievable. Okay folks, now for

2216
01:14:04,439 --> 01:14:08,079
the moment of truth. Um all right, I

2217
01:14:06,840 --> 01:14:10,000
have a little code snippet here to

2218
01:14:08,079 --> 01:14:12,159
capture stuff from the webcam.

2219
01:14:10,000 --> 01:14:13,600
Because that last epoch it went down,

2220
01:14:12,159 --> 01:14:14,920
I'm a little worried that the demo is

2221
01:14:13,600 --> 01:14:16,440
going to flunk.

2222
01:14:14,920 --> 01:14:18,560
But you know what? We all have to live

2223
01:14:16,439 --> 01:14:20,119
dangerously. So

2224
01:14:18,560 --> 01:14:21,560
So here's a little function to predict

2225
01:14:20,119 --> 01:14:23,519
what's going to happen.

2226
01:14:21,560 --> 01:14:24,920
Okay. Now I tried it at home yesterday

2227
01:14:23,520 --> 01:14:26,160
by the way.

2228
01:14:24,920 --> 01:14:27,680
I act and it's like, "Yay, it's a

2229
01:14:26,159 --> 01:14:29,599
handbag."

2230
01:14:27,680 --> 01:14:30,720
So okay. Now let's just do something

2231
01:14:29,600 --> 01:14:32,560
else.

2232
01:14:30,720 --> 01:14:34,880
Okay. Any volunteers?

2233
01:14:32,560 --> 01:14:37,400
I want a a piece of footwear

2234
01:14:34,880 --> 01:14:39,840
or a handbag.

2235
01:14:37,399 --> 01:14:40,839
It's like a backpack, right?

2236
01:14:39,840 --> 01:14:42,159
I don't know. It feels like an

2237
01:14:40,840 --> 01:14:43,440
adversarial example, but yeah, let's

2238
01:14:42,159 --> 01:14:45,000
just try it.

2239
01:14:43,439 --> 01:14:47,039
Okay.

2240
01:14:45,000 --> 01:14:48,880
No disrespect. I'll let me let me go

2241
01:14:47,039 --> 01:14:50,920
with the shoe first. I have a better

2242
01:14:48,880 --> 01:14:51,880
chance of it working.

2243
01:14:50,920 --> 01:14:53,239
So

2244
01:14:51,880 --> 01:14:55,880
it's a pretty big shoe. If it can't get

2245
01:14:53,239 --> 01:14:59,079
this shoe, I'm worried about this model.

2246
01:14:55,880 --> 01:14:59,079
All right. So

2247
01:15:05,159 --> 01:15:10,159
Okay. Hold on. Hold on. Hold on.

2248
01:15:07,800 --> 01:15:10,159
All right.

2249
01:15:10,680 --> 01:15:14,360
Please don't get distracted by my hand.

2250
01:15:14,479 --> 01:15:20,719
Capture.

2251
01:15:16,880 --> 01:15:20,720
It's a shoe! LOOK AT THAT.

2252
01:15:21,680 --> 01:15:26,760
PHEW. ALL RIGHT. THANKS.

2253
01:15:25,000 --> 01:15:28,319
OKAY. Now let's try that. I'm feeling

2254
01:15:26,760 --> 01:15:32,600
kind of brave now.

2255
01:15:28,319 --> 01:15:34,880
Thank you. All right. Let's do this.

2256
01:15:32,600 --> 01:15:38,000
All right.

2257
01:15:34,880 --> 01:15:38,000
Camera capture.

2258
01:15:40,399 --> 01:15:42,559
Okay.

2259
01:15:44,199 --> 01:15:47,519
Put its better side.

2260
01:15:54,960 --> 01:15:58,720
It's a handbag! Look at that.

2261
01:15:59,800 --> 01:16:03,640
I swear every time I do the demo I age a

2262
01:16:01,479 --> 01:16:06,879
few years. So

2263
01:16:03,640 --> 01:16:06,880
All right folks, I'm done. Thank you.