1
00:00:16,320 --> 00:00:21,519
Okay. All right. Let's get going. Uh

2
00:00:20,304 --> 00:00:23,358
[clears throat] today is going to be

3
00:00:21,519 --> 00:00:25,920
packed. uh I'm going to spend the first

4
00:00:23,359 --> 00:00:28,960
roughly half of the lecture on uh

5
00:00:25,920 --> 00:00:30,720
actually building a model a car corass

6
00:00:28,960 --> 00:00:32,960
model in collab to solve the heart

7
00:00:30,719 --> 00:00:35,759
disease problem we saw earlier and then

8
00:00:32,960 --> 00:00:37,679
switch gears halfway and then talk about

9
00:00:35,759 --> 00:00:39,839
uh how to solve image classification

10
00:00:37,679 --> 00:00:42,159
okay so we're going to do two collabs

11
00:00:39,840 --> 00:00:44,079
today uh I've been talking about collab

12
00:00:42,159 --> 00:00:46,718
collab right I've been teasing you we'll

13
00:00:44,079 --> 00:00:48,960
actually do collabs today all right so

14
00:00:46,719 --> 00:00:50,320
summary of baby by the way I've shut off

15
00:00:48,960 --> 00:00:52,160
the lights in the top because when I

16
00:00:50,320 --> 00:00:53,280
switch to collab it's going be much

17
00:00:52,159 --> 00:00:54,639
better for you folks particularly the

18
00:00:53,280 --> 00:00:57,039
folks in the back to be able to see it.

19
00:00:54,640 --> 00:01:00,000
Okay, but I hope you can see the slide

20
00:00:57,039 --> 00:01:02,320
right now. Yes.

21
00:01:00,000 --> 00:01:04,799
Okay, great. So this is just a quick

22
00:01:02,320 --> 00:01:07,040
recap of what we did last class. U you

23
00:01:04,799 --> 00:01:08,479
know broadly speaking training a neural

24
00:01:07,040 --> 00:01:10,240
network essentially is no different than

25
00:01:08,478 --> 00:01:12,079
training other kinds of models. We have

26
00:01:10,239 --> 00:01:14,959
a bunch of parameters i.e weights and

27
00:01:12,079 --> 00:01:17,200
biases and we need to use the data to

28
00:01:14,959 --> 00:01:19,199
find good values of those weights. And

29
00:01:17,200 --> 00:01:21,040
what does good mean? Typically it means

30
00:01:19,200 --> 00:01:23,040
that we define some measure of

31
00:01:21,040 --> 00:01:24,960
discrepancy between what the model

32
00:01:23,040 --> 00:01:26,880
predicts for a given set of weights and

33
00:01:24,959 --> 00:01:29,199
what the right answer is what the ground

34
00:01:26,879 --> 00:01:30,879
truth answer is and then we try to find

35
00:01:29,200 --> 00:01:32,240
weights that minimize this discrepancy

36
00:01:30,879 --> 00:01:34,078
that's it and this notion of a

37
00:01:32,239 --> 00:01:36,239
discrepancy is called a loss function

38
00:01:34,078 --> 00:01:38,239
right so the broadly speaking the

39
00:01:36,239 --> 00:01:40,078
overall training flow is that you define

40
00:01:38,239 --> 00:01:41,280
some network it has an input it goes

41
00:01:40,078 --> 00:01:42,719
through a bunch of layers you come up

42
00:01:41,280 --> 00:01:44,799
with some predictions you take the

43
00:01:42,719 --> 00:01:46,798
predictions you take the true values and

44
00:01:44,799 --> 00:01:48,560
then those two go into the loss function

45
00:01:46,799 --> 00:01:50,320
i.e i.e. the discrepancy function and

46
00:01:48,560 --> 00:01:52,399
then you come up with the loss score and

47
00:01:50,319 --> 00:01:54,639
then you send it to the optimizer which

48
00:01:52,399 --> 00:01:56,239
then proceeds to calculate the gradient

49
00:01:54,640 --> 00:01:58,239
of this loss function with respect to

50
00:01:56,239 --> 00:02:00,078
all the parameters and then it updates

51
00:01:58,239 --> 00:02:02,718
all the weights using that gradient and

52
00:02:00,078 --> 00:02:04,718
then this process repeats. That's it. So

53
00:02:02,718 --> 00:02:08,000
that is the training flow. Okay, quick

54
00:02:04,718 --> 00:02:09,359
recap. Now we also talked about the

55
00:02:08,000 --> 00:02:12,719
optimization algorithm we're going to

56
00:02:09,360 --> 00:02:15,280
use which is called gradient descent.

57
00:02:12,719 --> 00:02:17,759
and gradient descent. As you noticed in

58
00:02:15,280 --> 00:02:20,400
each iteration, every data point is

59
00:02:17,759 --> 00:02:22,399
being used to make predictions and

60
00:02:20,400 --> 00:02:24,480
therefore to calculate the loss and then

61
00:02:22,400 --> 00:02:26,719
to calculate the gradient. And then we

62
00:02:24,479 --> 00:02:28,399
pointed out that gradient descent is

63
00:02:26,719 --> 00:02:31,120
actually not as good as something called

64
00:02:28,400 --> 00:02:33,200
stochastic gradient descent. Stoastic

65
00:02:31,120 --> 00:02:35,759
gradient descent where we instead of

66
00:02:33,199 --> 00:02:37,439
choosing taking all the points, we just

67
00:02:35,759 --> 00:02:40,000
randomly choose a small number of

68
00:02:37,439 --> 00:02:42,239
points. Pretend for a moment as if those

69
00:02:40,000 --> 00:02:44,479
are the only points we have. make

70
00:02:42,239 --> 00:02:47,360
predictions, calculate loss, calculate

71
00:02:44,479 --> 00:02:49,439
gradient and go on. So that was the

72
00:02:47,360 --> 00:02:51,840
basic idea behind stochastic gradient

73
00:02:49,439 --> 00:02:54,479
descent, right? Two different kinds of

74
00:02:51,840 --> 00:02:56,000
things. Now what it means is that when

75
00:02:54,479 --> 00:02:58,238
we actually start training the model, as

76
00:02:56,000 --> 00:03:00,318
we will in a few minutes, the way

77
00:02:58,239 --> 00:03:02,640
because we only take a few points at a

78
00:03:00,318 --> 00:03:04,399
time, we have to be a bit careful in

79
00:03:02,639 --> 00:03:06,079
what's going on. And I want to make sure

80
00:03:04,400 --> 00:03:07,439
you clearly understand what the

81
00:03:06,080 --> 00:03:10,640
differences are before we actually get

82
00:03:07,439 --> 00:03:13,280
to the collab. Okay. And

83
00:03:10,639 --> 00:03:14,878
all right. So there is the notion of an

84
00:03:13,280 --> 00:03:17,120
epoch.

85
00:03:14,878 --> 00:03:20,000
An epoch essentially just means that we

86
00:03:17,120 --> 00:03:22,080
make one pass through the training data.

87
00:03:20,000 --> 00:03:25,039
All the training data we make one pass

88
00:03:22,080 --> 00:03:27,519
through it. Okay. And so what is one

89
00:03:25,039 --> 00:03:30,560
pass is that if you have something like

90
00:03:27,519 --> 00:03:32,400
gradient descent, one pass means every

91
00:03:30,560 --> 00:03:34,318
data point is sent through the network.

92
00:03:32,400 --> 00:03:37,039
We calculate its predictions, calculate

93
00:03:34,318 --> 00:03:38,958
the loss, calculate the gradient, right?

94
00:03:37,039 --> 00:03:40,560
We run every training sample through it.

95
00:03:38,959 --> 00:03:42,959
we calculate the gradient which is just

96
00:03:40,560 --> 00:03:46,878
this thing here right I mean I will

97
00:03:42,959 --> 00:03:48,799
sometimes say d of loss time dwerative

98
00:03:46,878 --> 00:03:51,199
of loss with respect to w sometimes I

99
00:03:48,799 --> 00:03:54,000
might use this naba symbol these are all

100
00:03:51,199 --> 00:03:55,598
interchangeable okay so we'll calculate

101
00:03:54,000 --> 00:03:58,080
the gradient and then we update using

102
00:03:55,598 --> 00:04:01,438
some version of this okay but we just do

103
00:03:58,080 --> 00:04:03,680
it once at the end of the epoch because

104
00:04:01,438 --> 00:04:05,280
if you have 10 billion data points every

105
00:04:03,680 --> 00:04:07,200
one of them flows through you get 10

106
00:04:05,280 --> 00:04:08,959
billion outputs and then we calculate

107
00:04:07,199 --> 00:04:10,158
the epoch just one at the end of this

108
00:04:08,959 --> 00:04:15,039
thing we calculate the gradient and

109
00:04:10,158 --> 00:04:18,399
update once one update per epoch. Yes.

110
00:04:15,039 --> 00:04:20,319
Now in stoastic gradient descent what we

111
00:04:18,399 --> 00:04:22,078
do is that we process the data in

112
00:04:20,319 --> 00:04:25,360
batches

113
00:04:22,079 --> 00:04:26,800
small numbers of points at a time right

114
00:04:25,360 --> 00:04:29,280
and these are called technically

115
00:04:26,800 --> 00:04:30,560
speaking they're called mini batches I

116
00:04:29,279 --> 00:04:31,758
don't know about you I just get tired of

117
00:04:30,560 --> 00:04:34,879
saying mini batches I'm just going to

118
00:04:31,759 --> 00:04:36,319
say batches from this point on okay and

119
00:04:34,879 --> 00:04:39,360
in fact that is widely done in the

120
00:04:36,319 --> 00:04:41,360
literature so we'll so we'll have to

121
00:04:39,360 --> 00:04:43,199
process it in batches so we take the

122
00:04:41,360 --> 00:04:44,720
training data and then we divide it up

123
00:04:43,199 --> 00:04:46,400
into batches

124
00:04:44,720 --> 00:04:49,840
batch one, batch two all the way till

125
00:04:46,399 --> 00:04:53,120
the final batch. And so what we do is we

126
00:04:49,839 --> 00:04:56,399
for each batch we basically do gradient

127
00:04:53,120 --> 00:04:57,918
descent for each batch we take batch one

128
00:04:56,399 --> 00:05:00,399
and then we run just the training

129
00:04:57,918 --> 00:05:01,918
samples in that batch through the

130
00:05:00,399 --> 00:05:03,758
network to get predictions. We calculate

131
00:05:01,918 --> 00:05:05,519
the gradient we update the parameters

132
00:05:03,759 --> 00:05:07,360
and then we go to batch two then we go

133
00:05:05,519 --> 00:05:09,120
to batch three and so on and so forth.

134
00:05:07,360 --> 00:05:11,038
So pictorially this is how it's going to

135
00:05:09,120 --> 00:05:12,800
look like

136
00:05:11,038 --> 00:05:16,000
right let's say the first batch is say

137
00:05:12,800 --> 00:05:17,439
32 points we take those 32 points we run

138
00:05:16,000 --> 00:05:19,519
it through the network get all the stuff

139
00:05:17,439 --> 00:05:22,719
out we calculate the gradient update the

140
00:05:19,519 --> 00:05:25,279
weights so when we now get to batch two

141
00:05:22,720 --> 00:05:27,600
the weights have changed

142
00:05:25,279 --> 00:05:29,038
they have been updated and then we do

143
00:05:27,600 --> 00:05:30,720
the same thing for batch two batch three

144
00:05:29,038 --> 00:05:32,399
and all the way till we get to the end

145
00:05:30,720 --> 00:05:34,240
of the thing and when we are done with

146
00:05:32,399 --> 00:05:36,239
this thing this whole thing is called a

147
00:05:34,240 --> 00:05:38,160
what

148
00:05:36,240 --> 00:05:42,879
an epoch [clears throat]

149
00:05:38,160 --> 00:05:44,880
This whole thing is an epoch. Okay.

150
00:05:42,879 --> 00:05:46,319
All right. Now, so the question of

151
00:05:44,879 --> 00:05:47,519
course is that if you have a bunch of

152
00:05:46,319 --> 00:05:50,000
data points and you're going to run

153
00:05:47,519 --> 00:05:52,319
stoastic gradient descent on it in in a

154
00:05:50,000 --> 00:05:54,959
in a particular epoch, how many batches

155
00:05:52,319 --> 00:05:56,879
are going to be there? Okay, how many

156
00:05:54,959 --> 00:05:58,159
batches are going to be there? Now,

157
00:05:56,879 --> 00:05:59,199
Keras is going to calculate all this

158
00:05:58,160 --> 00:06:00,560
stuff. You don't have to worry about it,

159
00:05:59,199 --> 00:06:02,960
but you just need to understand exactly

160
00:06:00,560 --> 00:06:04,879
what happens. Okay, so my philosophy, by

161
00:06:02,959 --> 00:06:06,799
the way, is that you have to know the

162
00:06:04,879 --> 00:06:08,560
details of what's going on. If you don't

163
00:06:06,800 --> 00:06:11,038
know the details, if you haven't figured

164
00:06:08,560 --> 00:06:12,399
out at least once, you will not actually

165
00:06:11,038 --> 00:06:15,519
be able to think new and creative

166
00:06:12,399 --> 00:06:17,198
thoughts for a new problem. Okay, it's

167
00:06:15,519 --> 00:06:21,719
because the concepts are not manipulable

168
00:06:17,199 --> 00:06:21,720
in your head yet. Okay,

169
00:06:23,839 --> 00:06:30,159
please use the microphone.

170
00:06:27,279 --> 00:06:32,399
So when we talk about SG, so and we

171
00:06:30,160 --> 00:06:34,080
talking about uh we are only taking some

172
00:06:32,399 --> 00:06:36,239
part of it. Is it what we are saying is

173
00:06:34,079 --> 00:06:37,918
that we only take some variables or we

174
00:06:36,240 --> 00:06:40,319
only taking some part of the data.

175
00:06:37,918 --> 00:06:42,639
>> We are taking some rows.

176
00:06:40,319 --> 00:06:44,639
Okay. We taking only right. So that data

177
00:06:42,639 --> 00:06:46,319
points that means a batch.

178
00:06:44,639 --> 00:06:48,160
>> Exactly. So for example, let's say you

179
00:06:46,319 --> 00:06:50,479
have a thousand data points, right?

180
00:06:48,160 --> 00:06:52,400
Thousand rows of observations, thousand

181
00:06:50,478 --> 00:06:53,839
patients in the heart disease example or

182
00:06:52,399 --> 00:06:56,478
a thousand images that you're trying to

183
00:06:53,839 --> 00:06:58,799
classify. You take let's say 32 of those

184
00:06:56,478 --> 00:07:00,879
images, 32 of those patients and that's

185
00:06:58,800 --> 00:07:02,800
a batch. Then you go to the next 32.

186
00:07:00,879 --> 00:07:04,240
Then the next 32 and so on and so forth

187
00:07:02,800 --> 00:07:05,199
till you run out of patients or run out

188
00:07:04,240 --> 00:07:07,759
of images.

189
00:07:05,199 --> 00:07:09,199
>> And each iterative time you are updating

190
00:07:07,759 --> 00:07:09,759
with the weights new weights that you've

191
00:07:09,199 --> 00:07:12,478
got.

192
00:07:09,759 --> 00:07:13,520
>> And it means you keep connecting it or

193
00:07:12,478 --> 00:07:14,560
keep moving towards

194
00:07:13,519 --> 00:07:14,799
>> you're basically updating the weights as

195
00:07:14,560 --> 00:07:17,519
you

196
00:07:14,800 --> 00:07:19,199
>> updating the weights

197
00:07:17,519 --> 00:07:20,719
>> and what we calling the epoch is

198
00:07:19,199 --> 00:07:21,919
ultimately the equation of loss function

199
00:07:20,720 --> 00:07:24,400
that we are trying to do.

200
00:07:21,918 --> 00:07:27,680
>> No an epoch. See the the thing to

201
00:07:24,399 --> 00:07:30,079
remember is that here this whole thing

202
00:07:27,680 --> 00:07:32,319
is called an epoch because we have to do

203
00:07:30,079 --> 00:07:35,439
one full pass through the training data.

204
00:07:32,319 --> 00:07:37,919
Okay. But within that epoch we update

205
00:07:35,439 --> 00:07:40,160
the weights many times. Basically we

206
00:07:37,918 --> 00:07:43,318
update the weights as many times as we

207
00:07:40,160 --> 00:07:43,319
have batches.

208
00:07:44,079 --> 00:07:49,038
All right. Um

209
00:07:46,478 --> 00:07:50,240
so to go here let's say for example

210
00:07:49,038 --> 00:07:52,000
basically the idea is that you take the

211
00:07:50,240 --> 00:07:54,960
training tech you divide it by the batch

212
00:07:52,000 --> 00:07:56,478
size and you choose the batch size okay

213
00:07:54,959 --> 00:07:57,918
you choose the bat size and we'll talk

214
00:07:56,478 --> 00:07:59,839
about well how do you choose that later

215
00:07:57,918 --> 00:08:01,598
on you choose the batch size and once

216
00:07:59,839 --> 00:08:04,239
you choose size just divide it and round

217
00:08:01,598 --> 00:08:06,878
it up so for example as you will see in

218
00:08:04,240 --> 00:08:09,199
the collabing set is going to be 194

219
00:08:06,879 --> 00:08:12,080
patients and then we're going to choose

220
00:08:09,199 --> 00:08:14,960
a batch size of 32 and we typically tend

221
00:08:12,079 --> 00:08:16,478
to choose batch sizes of 32 64 and

222
00:08:14,959 --> 00:08:18,318
things like that because it actually

223
00:08:16,478 --> 00:08:20,800
aligns very well with the nature of the

224
00:08:18,319 --> 00:08:24,479
parallel hardware we're going to use.

225
00:08:20,800 --> 00:08:27,038
Okay. And so here 32 and so on. So

226
00:08:24,478 --> 00:08:29,680
divide 194 by 32 you get 6 point

227
00:08:27,038 --> 00:08:31,439
something. You round it up to seven.

228
00:08:29,680 --> 00:08:33,839
Okay. And so what that means is that the

229
00:08:31,439 --> 00:08:36,719
first six batches will have 32 samples

230
00:08:33,839 --> 00:08:38,800
each. And then the final batch has only

231
00:08:36,719 --> 00:08:40,560
two samples left. And that's okay. It

232
00:08:38,799 --> 00:08:42,079
can be a nice little small batch at the

233
00:08:40,559 --> 00:08:43,278
end.

234
00:08:42,080 --> 00:08:46,959
There's nothing that says that every

235
00:08:43,278 --> 00:08:51,320
batch has to be the same size.

236
00:08:46,958 --> 00:08:51,319
>> That's it. Epoch batches.

237
00:08:53,039 --> 00:08:58,399
>> And are you like for each batch you run

238
00:08:56,799 --> 00:09:00,799
through the whole network like all the

239
00:08:58,399 --> 00:09:03,039
layers or like each layer is one batch?

240
00:09:00,799 --> 00:09:04,719
>> No, for a batch you run it through the

241
00:09:03,039 --> 00:09:06,958
entire network. So the way I think about

242
00:09:04,720 --> 00:09:08,560
it is that you take a batch right just

243
00:09:06,958 --> 00:09:10,879
momentarily you assume that's all the

244
00:09:08,559 --> 00:09:12,879
data you have

245
00:09:10,879 --> 00:09:14,399
just run it through the network because

246
00:09:12,879 --> 00:09:15,838
unless you run it through the every

247
00:09:14,399 --> 00:09:18,080
layer of the network you can't get a

248
00:09:15,839 --> 00:09:19,600
prediction and unless you get a

249
00:09:18,080 --> 00:09:20,879
prediction you can't calculate the loss

250
00:09:19,600 --> 00:09:22,159
and unless you calculate the loss you

251
00:09:20,879 --> 00:09:23,200
can't calculate the gradient unless you

252
00:09:22,159 --> 00:09:25,199
calculate the gradient you can't update

253
00:09:23,200 --> 00:09:27,120
the weights

254
00:09:25,200 --> 00:09:29,440
>> last thing but if you're using like all

255
00:09:27,120 --> 00:09:31,120
the data just doing the gradient descent

256
00:09:29,440 --> 00:09:32,160
then you just go through the network

257
00:09:31,120 --> 00:09:34,320
once right

258
00:09:32,159 --> 00:09:37,679
>> okay exactly so in Gradient descent one

259
00:09:34,320 --> 00:09:40,399
epoch is one pass and one weight update.

260
00:09:37,679 --> 00:09:41,838
In many in stoastic gradient descent the

261
00:09:40,399 --> 00:09:43,440
number of updates you make is equal to

262
00:09:41,839 --> 00:09:46,080
the number of batches you have which

263
00:09:43,440 --> 00:09:47,920
ends up being you know some the training

264
00:09:46,080 --> 00:09:50,399
set divided by the batch size rounded

265
00:09:47,919 --> 00:09:52,799
up.

266
00:09:50,399 --> 00:09:54,639
>> So just to confirm so initially when we

267
00:09:52,799 --> 00:09:56,559
introduced like the concept of batches

268
00:09:54,639 --> 00:09:58,559
the whole purpose was not to run through

269
00:09:56,559 --> 00:10:00,639
all the data and be able to do some

270
00:09:58,559 --> 00:10:02,319
prediction from a subset. So now like

271
00:10:00,639 --> 00:10:04,639
the advantage is that like after batch

272
00:10:02,320 --> 00:10:06,560
one we are using more accurate

273
00:10:04,639 --> 00:10:08,399
coefficient to run through batch two and

274
00:10:06,559 --> 00:10:10,319
so on. That's really the advantage of it

275
00:10:08,399 --> 00:10:11,919
or there's something else to it.

276
00:10:10,320 --> 00:10:13,920
>> Perfectly set. That's exactly the

277
00:10:11,919 --> 00:10:16,319
advantage. So we take a small amount of

278
00:10:13,919 --> 00:10:18,240
data and we say hey we know this is not

279
00:10:16,320 --> 00:10:19,839
all the data. It's just a small subset

280
00:10:18,240 --> 00:10:21,120
of the data. So therefore it's not going

281
00:10:19,839 --> 00:10:23,920
to be super accurate. It's going to be

282
00:10:21,120 --> 00:10:25,919
approximate but it's okay. So we'll

283
00:10:23,919 --> 00:10:28,078
still tend to move in the in the right

284
00:10:25,919 --> 00:10:29,679
direction. So instead of waiting for the

285
00:10:28,078 --> 00:10:30,958
whole thing to get done and then

286
00:10:29,679 --> 00:10:33,679
updating it, we're just going to update

287
00:10:30,958 --> 00:10:35,919
it as we go along.

288
00:10:33,679 --> 00:10:37,759
All right. Uh yes,

289
00:10:35,919 --> 00:10:40,799
>> building on to her question, is it that

290
00:10:37,759 --> 00:10:43,600
uh doing this process for SG will uh

291
00:10:40,799 --> 00:10:45,519
render us a more better solution or

292
00:10:43,600 --> 00:10:46,399
requires less compute power?

293
00:10:45,519 --> 00:10:48,240
>> Both

294
00:10:46,399 --> 00:10:51,278
>> both and the reasons for both are in the

295
00:10:48,240 --> 00:10:52,480
previous lecture. Yeah. And I'm saying

296
00:10:51,278 --> 00:10:54,000
that instead of repeating it just

297
00:10:52,480 --> 00:10:57,120
because I'm like very pressed for time

298
00:10:54,000 --> 00:11:01,278
today. That's why uh all right cool so

299
00:10:57,120 --> 00:11:04,000
that's what we have uh are we good

300
00:11:01,278 --> 00:11:05,519
okay so now we come to the last step

301
00:11:04,000 --> 00:11:07,600
before we actually fire up the collab

302
00:11:05,519 --> 00:11:09,600
which is overfitting and regularization

303
00:11:07,600 --> 00:11:12,159
um so if you remember from your machine

304
00:11:09,600 --> 00:11:14,720
learning background um when your model

305
00:11:12,159 --> 00:11:18,319
gets more and more complex

306
00:11:14,720 --> 00:11:19,759
right if you you know using

307
00:11:18,320 --> 00:11:21,680
use a simple model then you use a more

308
00:11:19,759 --> 00:11:23,278
complex model and so on and so forth

309
00:11:21,679 --> 00:11:26,159
what happens to the error on the

310
00:11:23,278 --> 00:11:27,200
training data Typically what happens to

311
00:11:26,159 --> 00:11:28,480
the error on the training data? So let's

312
00:11:27,200 --> 00:11:30,079
say you have a simple regression model,

313
00:11:28,480 --> 00:11:31,440
you get some error and then you have a

314
00:11:30,078 --> 00:11:32,879
regression model in which you use all

315
00:11:31,440 --> 00:11:34,480
kinds of interaction terms. You use

316
00:11:32,879 --> 00:11:35,759
logarithms and this and that and make it

317
00:11:34,480 --> 00:11:36,879
super complicated. What do you think is

318
00:11:35,759 --> 00:11:39,519
going to happen to the error on the

319
00:11:36,879 --> 00:11:41,278
training data?

320
00:11:39,519 --> 00:11:43,200
>> Right? Basically it's going to go down

321
00:11:41,278 --> 00:11:45,360
as the model get more gets more complex.

322
00:11:43,200 --> 00:11:46,959
Correct. Now of course comes the punch

323
00:11:45,360 --> 00:11:49,440
line which is what what do you think is

324
00:11:46,958 --> 00:11:53,000
going to happen to the training data? I

325
00:11:49,440 --> 00:11:53,000
showed you the answer.

326
00:11:53,039 --> 00:11:56,078
Right? Basically, what's going to happen

327
00:11:54,399 --> 00:11:57,360
typically, at least conceptually, is

328
00:11:56,078 --> 00:11:59,120
that it's going to get better and better

329
00:11:57,360 --> 00:12:00,879
at some point. It's going to bottom out

330
00:11:59,120 --> 00:12:03,360
and it's going to start climbing again.

331
00:12:00,879 --> 00:12:05,039
And so, we typically refer to this

332
00:12:03,360 --> 00:12:07,440
phenomenon here when it starts to climb

333
00:12:05,039 --> 00:12:09,120
again as overfitting because the model

334
00:12:07,440 --> 00:12:11,440
is essentially fitting to the

335
00:12:09,120 --> 00:12:14,159
idiosyncrasies of the training data as

336
00:12:11,440 --> 00:12:15,760
opposed to generalizing patterns. And

337
00:12:14,159 --> 00:12:17,278
then in this thing we call it

338
00:12:15,759 --> 00:12:18,480
underfitting because it can still

339
00:12:17,278 --> 00:12:20,399
there's a lot of potential to improve

340
00:12:18,480 --> 00:12:23,360
and we really are hoping to find the

341
00:12:20,399 --> 00:12:24,559
sweet spot in the middle right that's

342
00:12:23,360 --> 00:12:27,360
the basic idea of overfitting

343
00:12:24,559 --> 00:12:29,359
underfitting and the way we and to to

344
00:12:27,360 --> 00:12:31,839
relate this to neural networks as you

345
00:12:29,360 --> 00:12:33,680
see as you as you've learned so far you

346
00:12:31,839 --> 00:12:36,320
have to learn smart representations of

347
00:12:33,679 --> 00:12:38,078
the input data and to do that we I have

348
00:12:36,320 --> 00:12:39,760
argued that you need to have lots of

349
00:12:38,078 --> 00:12:42,719
layers in your network the more layers

350
00:12:39,759 --> 00:12:45,439
you have the better things get. GPT3 for

351
00:12:42,720 --> 00:12:47,680
example has 96 layers if I recall right

352
00:12:45,440 --> 00:12:50,079
more layers the better but more layers

353
00:12:47,679 --> 00:12:52,000
means more parameters more parameters

354
00:12:50,078 --> 00:12:54,719
means more complexity to the model and

355
00:12:52,000 --> 00:12:57,919
therefore more chance of overfitting

356
00:12:54,720 --> 00:12:59,200
okay so it's really important in neural

357
00:12:57,919 --> 00:13:01,759
networks that we think about

358
00:12:59,200 --> 00:13:03,278
regularization and regularization you

359
00:13:01,759 --> 00:13:05,600
will recall from your machine learning

360
00:13:03,278 --> 00:13:07,278
background is the way we handle the risk

361
00:13:05,600 --> 00:13:11,278
of overfitting and try to find models

362
00:13:07,278 --> 00:13:12,639
that fit just right okay and so several

363
00:13:11,278 --> 00:13:14,480
regularization methods have been

364
00:13:12,639 --> 00:13:16,560
developed over the years and we are

365
00:13:14,480 --> 00:13:19,039
going to use only two of them. The first

366
00:13:16,559 --> 00:13:20,799
one is called early stopping. uh and

367
00:13:19,039 --> 00:13:23,120
this is this has been famously referred

368
00:13:20,799 --> 00:13:25,199
to uh by Jeff Hinton who's one of the

369
00:13:23,120 --> 00:13:27,039
pioneers or as he's more colorfully

370
00:13:25,200 --> 00:13:29,040
known one of the godfathers of deep

371
00:13:27,039 --> 00:13:31,199
learning um who won he also won the

372
00:13:29,039 --> 00:13:33,360
touring a few years ago as the own sort

373
00:13:31,200 --> 00:13:35,120
of a beautiful free lunch right that's

374
00:13:33,360 --> 00:13:37,839
what he calls it so the idea is very

375
00:13:35,120 --> 00:13:39,278
simple we take a validation set we take

376
00:13:37,839 --> 00:13:41,120
the training data we split into a

377
00:13:39,278 --> 00:13:42,879
training and a validation set and then

378
00:13:41,120 --> 00:13:45,519
we just keep you know doing gradient

379
00:13:42,879 --> 00:13:46,720
descent boop b the training will

380
00:13:45,519 --> 00:13:49,200
hopefully keep on getting better and

381
00:13:46,720 --> 00:13:50,800
better lower and lower error

382
00:13:49,200 --> 00:13:52,959
And then we just keep track of what's

383
00:13:50,799 --> 00:13:54,559
going on in the validation set. And then

384
00:13:52,958 --> 00:13:56,958
at some point if it starts to flatten

385
00:13:54,559 --> 00:13:59,919
out and start to climb, we just say,

386
00:13:56,958 --> 00:14:01,359
"Okay, that's when we stop training."

387
00:13:59,919 --> 00:14:02,639
Right? And what we're going to do in the

388
00:14:01,360 --> 00:14:03,919
collab is actually run it through the

389
00:14:02,639 --> 00:14:04,959
whole thing, see where it flattens out,

390
00:14:03,919 --> 00:14:06,479
and then we say, "Okay, that's why we

391
00:14:04,958 --> 00:14:07,759
should stop." But of course, you don't

392
00:14:06,480 --> 00:14:09,120
want to go all the way to the end and

393
00:14:07,759 --> 00:14:12,000
then go back and say, "Well, I want to

394
00:14:09,120 --> 00:14:13,600
stop at the 10th epoch." And there are

395
00:14:12,000 --> 00:14:15,039
ways you can use Keras to be very

396
00:14:13,600 --> 00:14:16,320
efficient about this. But the

397
00:14:15,039 --> 00:14:18,319
fundamental idea is you take the

398
00:14:16,320 --> 00:14:20,079
training data, split it into training

399
00:14:18,320 --> 00:14:21,920
and validation and just track what's

400
00:14:20,078 --> 00:14:23,278
going on in the validation set to see

401
00:14:21,919 --> 00:14:25,679
whether this kind of bottoming out

402
00:14:23,278 --> 00:14:28,240
happens. Okay. So this is called early

403
00:14:25,679 --> 00:14:30,879
stopping. And the other way we're going

404
00:14:28,240 --> 00:14:32,959
to do right this called early stopping.

405
00:14:30,879 --> 00:14:35,838
We're looking for this part. The other

406
00:14:32,958 --> 00:14:39,039
thing is called dropout. And I'm going

407
00:14:35,839 --> 00:14:40,560
to come back to dropout when we do when

408
00:14:39,039 --> 00:14:42,000
on Wednesday's lecture because that's

409
00:14:40,559 --> 00:14:43,518
the first time we're going to use it.

410
00:14:42,000 --> 00:14:44,958
And so I'll come back to draw port and

411
00:14:43,519 --> 00:14:46,879
tell you exactly how it works. It's a

412
00:14:44,958 --> 00:14:48,399
very very clever strategy. But we will

413
00:14:46,879 --> 00:14:51,519
not use it today. We'll use it on

414
00:14:48,399 --> 00:14:53,679
Wednesday. Okay. So in summary, uh what

415
00:14:51,519 --> 00:14:55,679
do we do? We get the data ready. We

416
00:14:53,679 --> 00:14:57,198
design the network, number of hidden

417
00:14:55,679 --> 00:14:58,958
layers, number of neurons and so on and

418
00:14:57,198 --> 00:15:01,359
so forth. We pick the right output

419
00:14:58,958 --> 00:15:04,078
layer. We pick the right loss function.

420
00:15:01,360 --> 00:15:06,000
Uh we choose an optimizer. As I

421
00:15:04,078 --> 00:15:07,919
mentioned earlier, SGD comes in lots of

422
00:15:06,000 --> 00:15:11,519
flavors, lots of variations on the

423
00:15:07,919 --> 00:15:13,439
theme. And empirically much like for

424
00:15:11,519 --> 00:15:16,159
hidden layer neurons we t tend to use

425
00:15:13,440 --> 00:15:17,920
relu as the activation function for

426
00:15:16,159 --> 00:15:20,559
optimization we tend to use a flavor of

427
00:15:17,919 --> 00:15:22,240
HGD called Adam okay as sort of the

428
00:15:20,559 --> 00:15:24,879
default because it's really good so

429
00:15:22,240 --> 00:15:27,039
we'll use Adam as you'll see we

430
00:15:24,879 --> 00:15:29,039
typically use either uh early stopping

431
00:15:27,039 --> 00:15:32,000
or dropout and then you just fire it up

432
00:15:29,039 --> 00:15:33,838
and start training in terasen tensorflow

433
00:15:32,000 --> 00:15:35,120
all right so that is the training loop

434
00:15:33,839 --> 00:15:38,079
now I'm going to switch gears and give

435
00:15:35,120 --> 00:15:40,959
you a quick intro to teras and teras

436
00:15:38,078 --> 00:15:43,278
TensorFlow. Okay. Keras and Tensor. No,

437
00:15:40,958 --> 00:15:45,119
TensorFlow and KAS. Thank you. Um, and

438
00:15:43,278 --> 00:15:49,078
then we'll actually fire up the collab.

439
00:15:45,120 --> 00:15:49,078
So, first of all, what's a tensor?

440
00:15:49,919 --> 00:15:54,639
>> Yeah, I just quick question on the

441
00:15:52,159 --> 00:15:57,679
previous thing like if you're looking at

442
00:15:54,639 --> 00:15:59,440
the validation set to avoid overfitting,

443
00:15:57,679 --> 00:16:02,000
but aren't you actually like over

444
00:15:59,440 --> 00:16:03,920
actually overfitting because like you're

445
00:16:02,000 --> 00:16:05,919
kind of using the validation set as a

446
00:16:03,919 --> 00:16:08,000
training set or not?

447
00:16:05,919 --> 00:16:10,319
>> Uh, no, no, no. The validation set is

448
00:16:08,000 --> 00:16:12,799
never used to calculate any gradients.

449
00:16:10,320 --> 00:16:14,480
It's only used to calculate accuracy and

450
00:16:12,799 --> 00:16:16,078
loss.

451
00:16:14,480 --> 00:16:19,360
Yeah. Yeah. It's kept aside and only

452
00:16:16,078 --> 00:16:22,479
used for evaluation, not for training.

453
00:16:19,360 --> 00:16:23,120
That's what keeps you honest.

454
00:16:22,480 --> 00:16:24,399
>> Right.

455
00:16:23,120 --> 00:16:25,600
>> And this will become clear when we

456
00:16:24,399 --> 00:16:28,600
actually go to the collab. So what's a

457
00:16:25,600 --> 00:16:28,600
tensor?

458
00:16:28,639 --> 00:16:33,120
>> All right.

459
00:16:30,639 --> 00:16:35,360
Okay.

460
00:16:33,120 --> 00:16:36,720
Tensor is the input data which you're

461
00:16:35,360 --> 00:16:39,440
giving to the system. It could be in

462
00:16:36,720 --> 00:16:42,240
various formats like it's image it could

463
00:16:39,440 --> 00:16:45,120
be like we call it a 4D tensor. If it's

464
00:16:42,240 --> 00:16:47,278
a time series data, it's 3D. And

465
00:16:45,120 --> 00:16:49,360
typically, if you just send numbers in,

466
00:16:47,278 --> 00:16:52,480
it becomes a vector which would go

467
00:16:49,360 --> 00:16:54,480
inside which each each it gives the

468
00:16:52,480 --> 00:16:57,278
value of the

469
00:16:54,480 --> 00:16:59,120
uh uh the variable as well the values of

470
00:16:57,278 --> 00:17:01,759
the variables associated to it as well

471
00:16:59,120 --> 00:17:05,599
as

472
00:17:01,759 --> 00:17:07,120
uh as well as the I mean information you

473
00:17:05,599 --> 00:17:08,480
want to get to.

474
00:17:07,119 --> 00:17:10,159
>> You're kind of on the right track, but

475
00:17:08,480 --> 00:17:13,439
not entirely, right? It's actually a

476
00:17:10,160 --> 00:17:15,038
simpler concept than that. So, uh

477
00:17:13,439 --> 00:17:16,720
>> it's like a matrix but generalized with

478
00:17:15,038 --> 00:17:18,558
higher dimensions.

479
00:17:16,720 --> 00:17:21,360
>> Correct? That's also actually correct

480
00:17:18,558 --> 00:17:24,078
but incomplete. The reason is because it

481
00:17:21,359 --> 00:17:25,838
can be simpler than a matrix. It's not

482
00:17:24,078 --> 00:17:27,599
matrix or higher. It's actually could be

483
00:17:25,838 --> 00:17:30,159
simpler. In fact, you take a number,

484
00:17:27,599 --> 00:17:31,759
it's actually a tensor.

485
00:17:30,160 --> 00:17:34,400
All right? The simplest case of a tensor

486
00:17:31,759 --> 00:17:37,359
is a number. The next simpler case is a

487
00:17:34,400 --> 00:17:40,798
vector which is a list. The next higher

488
00:17:37,359 --> 00:17:43,038
case is a table.

489
00:17:40,798 --> 00:17:45,679
Okay, so these are all tensors. So

490
00:17:43,038 --> 00:17:48,879
tensors basically are a generalization

491
00:17:45,679 --> 00:17:52,240
of the notion of both a number, a vector

492
00:17:48,880 --> 00:17:56,799
and a table to higher dimensions.

493
00:17:52,240 --> 00:17:59,440
Okay, so you can think of a tensor as

494
00:17:56,798 --> 00:18:03,200
having what are called every tensor has

495
00:17:59,440 --> 00:18:04,720
something called a rank, right? So a

496
00:18:03,200 --> 00:18:06,720
number is just a number. It doesn't have

497
00:18:04,720 --> 00:18:10,798
a dimensionality to it. So it has got

498
00:18:06,720 --> 00:18:12,720
rank zero. Okay. While a vector it's a

499
00:18:10,798 --> 00:18:14,720
list of numbers. You can sort of write

500
00:18:12,720 --> 00:18:17,600
it down top to bottom and it's one

501
00:18:14,720 --> 00:18:19,200
dimension. Right? So that dimension that

502
00:18:17,599 --> 00:18:22,798
one dimension is called a rank. So it's

503
00:18:19,200 --> 00:18:24,480
called rank one. A table is 2D

504
00:18:22,798 --> 00:18:26,558
two-dimensional. So it's called rank

505
00:18:24,480 --> 00:18:28,640
two.

506
00:18:26,558 --> 00:18:32,079
And you can have a rank three which is

507
00:18:28,640 --> 00:18:34,080
just a bunch of tables.

508
00:18:32,079 --> 00:18:37,199
A bunch of tables is a rank three

509
00:18:34,079 --> 00:18:40,399
tensor. We also think of it as a cube.

510
00:18:37,200 --> 00:18:42,240
Okay. So these things are very useful

511
00:18:40,400 --> 00:18:45,280
because obviously we are all familiar

512
00:18:42,240 --> 00:18:48,000
with vectors. Uh as you will see very

513
00:18:45,279 --> 00:18:49,678
shortly later in this class black and

514
00:18:48,000 --> 00:18:51,679
white grayscale images are usually

515
00:18:49,679 --> 00:18:54,240
represented using tables of numbers like

516
00:18:51,679 --> 00:18:56,240
this. Color images are represented using

517
00:18:54,240 --> 00:18:59,440
three tables.

518
00:18:56,240 --> 00:19:02,319
Okay. Can you get think of what might be

519
00:18:59,440 --> 00:19:06,160
representable as you know a tensil of

520
00:19:02,319 --> 00:19:08,720
rank four? Meaning every element of a

521
00:19:06,160 --> 00:19:11,600
tensor of rank four is actually a color

522
00:19:08,720 --> 00:19:14,720
picture.

523
00:19:11,599 --> 00:19:16,959
Just shout it out. Video. Exactly. What

524
00:19:14,720 --> 00:19:19,440
is a video? A video is basically a

525
00:19:16,960 --> 00:19:23,519
stream of black color images. A color

526
00:19:19,440 --> 00:19:25,519
video. So each element of that stream,

527
00:19:23,519 --> 00:19:28,879
right? What the first dimension of the

528
00:19:25,519 --> 00:19:31,440
tensor is which frame it is and then

529
00:19:28,880 --> 00:19:34,000
everything else is the actual frame. So

530
00:19:31,440 --> 00:19:37,320
the way I u think about these tensors

531
00:19:34,000 --> 00:19:37,319
always is

532
00:19:37,359 --> 00:19:42,639
tensor you can just think of it as a you

533
00:19:40,480 --> 00:19:45,759
can think of a tensor as being this

534
00:19:42,640 --> 00:19:48,080
array which has all these axes or

535
00:19:45,759 --> 00:19:51,359
dimensions. This is the first one. This

536
00:19:48,079 --> 00:19:54,159
is the second one. This is a third swan.

537
00:19:51,359 --> 00:19:58,639
Right? This is a tensor of rank four.

538
00:19:54,160 --> 00:20:02,000
Okay? 1 2 3 4. And so if you have a

539
00:19:58,640 --> 00:20:03,520
vector, right? So you can imagine if

540
00:20:02,000 --> 00:20:06,480
it's just a vector, you can imagine the

541
00:20:03,519 --> 00:20:10,240
vector actually living like this, just a

542
00:20:06,480 --> 00:20:14,000
list of numbers, right?

543
00:20:10,240 --> 00:20:16,798
But if it's just if it is just

544
00:20:14,000 --> 00:20:19,038
a 2D a rank two tensor right which is

545
00:20:16,798 --> 00:20:21,200
just like that right which is just like

546
00:20:19,038 --> 00:20:24,079
that

547
00:20:21,200 --> 00:20:26,400
so this thing becomes you know like that

548
00:20:24,079 --> 00:20:29,199
and that thing becomes like that. So for

549
00:20:26,400 --> 00:20:31,360
example if this is a 7a 3 that means

550
00:20:29,200 --> 00:20:35,200
that there are

551
00:20:31,359 --> 00:20:36,558
seven rows and three columns.

552
00:20:35,200 --> 00:20:38,558
So you get the idea. So the way you

553
00:20:36,558 --> 00:20:40,319
think about tensor is always as if this

554
00:20:38,558 --> 00:20:42,079
open square bracket a bunch of things a

555
00:20:40,319 --> 00:20:44,639
closed square bracket and that's really

556
00:20:42,079 --> 00:20:48,158
what a tensor object is. So what that

557
00:20:44,640 --> 00:20:49,759
means is that anytime you have a tensor

558
00:20:48,159 --> 00:20:52,480
right anytime you have a tensor however

559
00:20:49,759 --> 00:20:54,720
complicated it is you can always create

560
00:20:52,480 --> 00:20:56,319
a more complicated tensor by if you want

561
00:20:54,720 --> 00:20:59,279
to take a list of those tensors let's

562
00:20:56,319 --> 00:21:02,158
say that you have a list of videos

563
00:20:59,279 --> 00:21:04,240
each video is a rank four tensor so

564
00:21:02,159 --> 00:21:05,760
which means a list of videos is what

565
00:21:04,240 --> 00:21:10,720
rank

566
00:21:05,759 --> 00:21:15,279
Exactly. So a a tensor of rank say 10 is

567
00:21:10,720 --> 00:21:17,120
just a list of rank nine tensors.

568
00:21:15,279 --> 00:21:18,000
So that is this that is the most

569
00:21:17,119 --> 00:21:20,719
important thing you need to understand

570
00:21:18,000 --> 00:21:22,640
about tensors. So at any point in time

571
00:21:20,720 --> 00:21:24,319
if I give you a tensor you can just

572
00:21:22,640 --> 00:21:27,520
iterate through the first dimension of

573
00:21:24,319 --> 00:21:29,119
it the first aspect of it and as as you

574
00:21:27,519 --> 00:21:32,158
go through each one of these values. So

575
00:21:29,119 --> 00:21:35,599
for example here um

576
00:21:32,159 --> 00:21:38,600
yeah that can do it.

577
00:21:35,599 --> 00:21:38,599
So

578
00:21:39,038 --> 00:21:43,599
so if you have this tensor here

579
00:21:42,319 --> 00:21:46,879
and if you want to create a more

580
00:21:43,599 --> 00:21:52,359
complicated tensor no problem.

581
00:21:46,880 --> 00:21:52,360
So you add another dimension here. Okay.

582
00:21:52,558 --> 00:21:58,119
Now it just becomes this dimension let's

583
00:21:54,480 --> 00:21:58,120
say has nine values.

584
00:21:58,558 --> 00:22:02,558
one on the nine. So you put zero here

585
00:22:00,960 --> 00:22:04,720
and then what do you get? This whole

586
00:22:02,558 --> 00:22:06,798
tensor is a rank four tensor. And you

587
00:22:04,720 --> 00:22:08,720
put a one here, it's another rank four

588
00:22:06,798 --> 00:22:11,759
tensor. You put a two here, another rank

589
00:22:08,720 --> 00:22:14,319
four tensor. So every tensor, you take

590
00:22:11,759 --> 00:22:18,000
the first element, it's just a list, but

591
00:22:14,319 --> 00:22:20,480
it's a list of the next downrank tensor.

592
00:22:18,000 --> 00:22:21,679
Okay. Now this tensor concept is

593
00:22:20,480 --> 00:22:26,640
actually something Einstein came up

594
00:22:21,679 --> 00:22:28,480
with. Um and so u it's simultaneously

595
00:22:26,640 --> 00:22:30,559
kind of easy to understand and also

596
00:22:28,480 --> 00:22:32,400
slippery. So I would actually encourage

597
00:22:30,558 --> 00:22:33,918
you to read the book which has a really

598
00:22:32,400 --> 00:22:35,280
good discussion of tensors and the more

599
00:22:33,919 --> 00:22:38,000
you practice with it the easier it'll

600
00:22:35,279 --> 00:22:39,759
get. Okay. So if you feel you kind of

601
00:22:38,000 --> 00:22:42,159
understood but not quite you're not

602
00:22:39,759 --> 00:22:43,599
alone. It happens to all of us right?

603
00:22:42,159 --> 00:22:48,640
You have to pay the price or go through

604
00:22:43,599 --> 00:22:51,519
the crucible. Okay. Okay. All right.

605
00:22:48,640 --> 00:22:55,038
So to come back to this

606
00:22:51,519 --> 00:22:56,400
that's what we have

607
00:22:55,038 --> 00:22:59,519
and we already talked about a rank four

608
00:22:56,400 --> 00:23:00,720
tensor it's a video so 2.2 two the text

609
00:22:59,519 --> 00:23:05,119
has a lot more detail. You should

610
00:23:00,720 --> 00:23:08,079
definitely read it. U so here tensorflow

611
00:23:05,119 --> 00:23:10,639
is a library and as you can imagine

612
00:23:08,079 --> 00:23:11,918
neural networks tensors come in and go

613
00:23:10,640 --> 00:23:14,559
through the network and go out the other

614
00:23:11,919 --> 00:23:16,880
end right and since tensors capture

615
00:23:14,558 --> 00:23:18,639
everything numbers lists uh tables and

616
00:23:16,880 --> 00:23:20,240
so on and so forth it's just tensors

617
00:23:18,640 --> 00:23:22,640
flowing from input to output hence it's

618
00:23:20,240 --> 00:23:23,839
called tensorflow and it gives you a

619
00:23:22,640 --> 00:23:25,360
couple of things which are really really

620
00:23:23,839 --> 00:23:27,199
important which is why we use it. The

621
00:23:25,359 --> 00:23:30,000
first one is that it'll automatically

622
00:23:27,200 --> 00:23:32,640
calculate gradients for you of

623
00:23:30,000 --> 00:23:34,079
arbitrarily complicated loss functions.

624
00:23:32,640 --> 00:23:35,520
You don't have to calculate the gradient

625
00:23:34,079 --> 00:23:37,678
because calculating the gradient is very

626
00:23:35,519 --> 00:23:39,519
painful, right? It'll automatically

627
00:23:37,679 --> 00:23:40,720
calculate the gradients for you. That's

628
00:23:39,519 --> 00:23:42,639
the best part. You don't have to use the

629
00:23:40,720 --> 00:23:44,400
chain rule. You don't do anything. The

630
00:23:42,640 --> 00:23:46,400
second thing it'll do, it gives you all

631
00:23:44,400 --> 00:23:48,000
these optimizers including SGD and all

632
00:23:46,400 --> 00:23:49,360
its variations. So you don't have to

633
00:23:48,000 --> 00:23:50,558
worry about the optimization itself.

634
00:23:49,359 --> 00:23:53,359
It'll just you can just pick and choose

635
00:23:50,558 --> 00:23:55,440
what you want. Third, if you have a lot

636
00:23:53,359 --> 00:23:56,959
of servers, it'll actually take the

637
00:23:55,440 --> 00:23:58,320
computational load and distribute it

638
00:23:56,960 --> 00:24:00,480
across all those servers. People here

639
00:23:58,319 --> 00:24:02,879
with the CS background know that

640
00:24:00,480 --> 00:24:05,038
parallelizing computation is actually a

641
00:24:02,880 --> 00:24:06,320
very difficult problem, right? There are

642
00:24:05,038 --> 00:24:09,119
things which are called embarrassingly

643
00:24:06,319 --> 00:24:10,798
parallel. Many things are not actually

644
00:24:09,119 --> 00:24:11,839
quite tricky to figure it out. We don't

645
00:24:10,798 --> 00:24:13,918
know how to figure it out. TensorFlow

646
00:24:11,839 --> 00:24:15,678
will figure it out. Okay? And then

647
00:24:13,919 --> 00:24:17,440
finally, I talked about the fact that

648
00:24:15,679 --> 00:24:18,720
there are these things called GPUs,

649
00:24:17,440 --> 00:24:21,919
graphics processing units, which are

650
00:24:18,720 --> 00:24:23,679
parallel hardware. uh and so it'll even

651
00:24:21,919 --> 00:24:26,000
if you have just one computer but it has

652
00:24:23,679 --> 00:24:28,080
GPUs there's a particular way in which

653
00:24:26,000 --> 00:24:30,079
you have to take your computation and

654
00:24:28,079 --> 00:24:33,359
organize it to really exploit the fact

655
00:24:30,079 --> 00:24:35,199
that you have a GPU and so TensorFlow

656
00:24:33,359 --> 00:24:36,240
will actually do it for you out of the

657
00:24:35,200 --> 00:24:38,080
box automatically you don't have to

658
00:24:36,240 --> 00:24:39,278
worry about any of that stuff okay so

659
00:24:38,079 --> 00:24:41,519
those are all the advantages of this

660
00:24:39,278 --> 00:24:43,519
thing by the way TPU is called a tensor

661
00:24:41,519 --> 00:24:45,278
processing unit it's something that it's

662
00:24:43,519 --> 00:24:47,440
kind of you can think of it as Google's

663
00:24:45,278 --> 00:24:50,000
GPU right they came up with their own

664
00:24:47,440 --> 00:24:52,080
variation on the theme okay now keras

665
00:24:50,000 --> 00:24:53,839
sits on top of TensorFlow, right?

666
00:24:52,079 --> 00:24:56,158
TensorFlow, this is the this is the

667
00:24:53,839 --> 00:24:58,319
hardware you have. TensorFlow sits on

668
00:24:56,159 --> 00:25:01,200
top of the hardware. Keras sits on top

669
00:24:58,319 --> 00:25:02,879
of TensorFlow and it basically gives you

670
00:25:01,200 --> 00:25:04,960
a whole bunch of convenience features.

671
00:25:02,880 --> 00:25:07,120
So, for example, it gives you the notion

672
00:25:04,960 --> 00:25:10,079
of a layer, right? We already saw

673
00:25:07,119 --> 00:25:11,519
Keras.dense is a dense layer, right? It

674
00:25:10,079 --> 00:25:12,558
gives you the notion of a layer. It

675
00:25:11,519 --> 00:25:14,558
gives you the notion of activation

676
00:25:12,558 --> 00:25:16,240
functions and so on and so forth. It

677
00:25:14,558 --> 00:25:18,079
gives you easy ways to pre-process the

678
00:25:16,240 --> 00:25:20,000
data, easy ways to train the model,

679
00:25:18,079 --> 00:25:21,839
report on metrics, you know, calculate

680
00:25:20,000 --> 00:25:23,519
validation loss, validation accuracy,

681
00:25:21,839 --> 00:25:25,359
training loss, all the metrics we care

682
00:25:23,519 --> 00:25:26,960
about. And then it also gives you a

683
00:25:25,359 --> 00:25:28,558
whole library of pre-trained models that

684
00:25:26,960 --> 00:25:30,798
you can just use and adapt for your

685
00:25:28,558 --> 00:25:32,720
particular problem. So it gives you a

686
00:25:30,798 --> 00:25:34,400
whole bunch of conveniences and that's

687
00:25:32,720 --> 00:25:35,679
why it's very popular. And by the way,

688
00:25:34,400 --> 00:25:37,440
you know, many of you might also be

689
00:25:35,679 --> 00:25:38,960
familiar with PyTorch, which is a

690
00:25:37,440 --> 00:25:41,038
fantastic framework as well for deep

691
00:25:38,960 --> 00:25:42,798
learning. And the reason we chose to go

692
00:25:41,038 --> 00:25:45,679
with TensorFlow for this course rather

693
00:25:42,798 --> 00:25:48,158
than PyTorch is because we wanted to

694
00:25:45,679 --> 00:25:49,679
make the course uh sort of accessible to

695
00:25:48,159 --> 00:25:51,200
folks who don't have a ton of

696
00:25:49,679 --> 00:25:53,360
programming background before coming to

697
00:25:51,200 --> 00:25:55,519
the class. And Pyarch is a bit more

698
00:25:53,359 --> 00:25:56,479
demanding from a CS perspective. It

699
00:25:55,519 --> 00:25:58,400
requires more knowledge of

700
00:25:56,480 --> 00:25:59,759
object-oriented programming. Uh which is

701
00:25:58,400 --> 00:26:02,080
why we decided to go with TensorFlow and

702
00:25:59,759 --> 00:26:04,720
KAS because I think it's actually as

703
00:26:02,079 --> 00:26:07,278
powerful uh in many ways and it's a

704
00:26:04,720 --> 00:26:09,440
little easier to get going. Okay, so

705
00:26:07,278 --> 00:26:10,960
that's what we have here. And one other

706
00:26:09,440 --> 00:26:12,720
thing I will mention is that there are

707
00:26:10,960 --> 00:26:14,480
three ways in which you can use kas.

708
00:26:12,720 --> 00:26:16,480
There are three kinds of APIs.

709
00:26:14,480 --> 00:26:18,079
Sequential, functional, subclassing. And

710
00:26:16,480 --> 00:26:21,120
we'll almost exclusively use the

711
00:26:18,079 --> 00:26:22,319
functional API. Okay. And in fact, the

712
00:26:21,119 --> 00:26:24,399
model we built for heart disease

713
00:26:22,319 --> 00:26:26,798
prediction uses the functional API. And

714
00:26:24,400 --> 00:26:28,640
so just read 722 of the textbook to

715
00:26:26,798 --> 00:26:30,319
understand in detail how the API works.

716
00:26:28,640 --> 00:26:32,080
I find in my own work, the functional

717
00:26:30,319 --> 00:26:33,278
API is basically all I need. I don't

718
00:26:32,079 --> 00:26:35,519
need to do anything more complicated

719
00:26:33,278 --> 00:26:37,599
than that. Um and and as you will see as

720
00:26:35,519 --> 00:26:39,679
you work on the homeworks uh and on your

721
00:26:37,599 --> 00:26:41,199
project that it's is it's sort of a

722
00:26:39,679 --> 00:26:43,440
beautifully designed Lego block

723
00:26:41,200 --> 00:26:45,200
environment for doing these things and

724
00:26:43,440 --> 00:26:48,240
you can create very complicated models

725
00:26:45,200 --> 00:26:50,159
very easily. Okay. Uh there's a whole

726
00:26:48,240 --> 00:26:51,759
bunch of stuff here on these websites.

727
00:26:50,159 --> 00:26:55,600
So check them out. There's lots of

728
00:26:51,759 --> 00:26:57,038
collabs or uh uh are available. So now

729
00:26:55,599 --> 00:26:58,158
if you go back to the neural model for

730
00:26:57,038 --> 00:26:59,519
heart disease prediction, this is what

731
00:26:58,159 --> 00:27:02,400
we came up with in the last class,

732
00:26:59,519 --> 00:27:04,319
right? uh we had an input layer, one

733
00:27:02,400 --> 00:27:05,759
dense layer with 16 neurons, rel

734
00:27:04,319 --> 00:27:08,000
neurons, an output layer with the

735
00:27:05,759 --> 00:27:10,720
sigmoid and then boom, that was a model.

736
00:27:08,000 --> 00:27:13,119
So let's train this model. Uh and so the

737
00:27:10,720 --> 00:27:14,640
training checklist is that uh we have

738
00:27:13,119 --> 00:27:17,918
already done this hidden layer of 16

739
00:27:14,640 --> 00:27:19,200
neurons uh sigmoid. We need to use an

740
00:27:17,919 --> 00:27:20,559
appropriate loss function based on the

741
00:27:19,200 --> 00:27:23,038
type of output. What loss function

742
00:27:20,558 --> 00:27:26,480
should we use?

743
00:27:23,038 --> 00:27:28,798
What is the output here?

744
00:27:26,480 --> 00:27:33,079
It's a binary classification problem. So

745
00:27:28,798 --> 00:27:33,079
what should the the loss function be?

746
00:27:33,440 --> 00:27:37,360
Kind of heard it somewhere. Get shout it

747
00:27:35,599 --> 00:27:40,079
out.

748
00:27:37,359 --> 00:27:43,079
No, the output is a sigmoid. The loss

749
00:27:40,079 --> 00:27:43,079
functionary

750
00:27:43,200 --> 00:27:46,798
cross entropy.

751
00:27:44,960 --> 00:27:48,798
Okay, remember if if you're predicting a

752
00:27:46,798 --> 00:27:50,879
number an arbitrary number, you can use

753
00:27:48,798 --> 00:27:52,400
something like mean square error. If

754
00:27:50,880 --> 00:27:55,120
you're predicting a probability which

755
00:27:52,400 --> 00:27:56,720
has to be compared to a 01 output, which

756
00:27:55,119 --> 00:27:59,038
is what binary classification is all

757
00:27:56,720 --> 00:28:01,120
about. we use binary cross entropy.

758
00:27:59,038 --> 00:28:03,759
Okay, so that's what we do here. So we

759
00:28:01,119 --> 00:28:06,239
do binary cross entropy

760
00:28:03,759 --> 00:28:08,158
and then we will go with Adam, right?

761
00:28:06,240 --> 00:28:10,880
And then we'll use early stopping to

762
00:28:08,159 --> 00:28:12,399
make sure we don't over fit. Okay, I

763
00:28:10,880 --> 00:28:13,679
know this like okay I promise this is a

764
00:28:12,398 --> 00:28:16,079
lot literally the last slide before I go

765
00:28:13,679 --> 00:28:19,519
to the collab. I feel like one of those

766
00:28:16,079 --> 00:28:23,359
used cars here but wait there is more.

767
00:28:19,519 --> 00:28:24,720
So anyway, u so uh don't worry if you

768
00:28:23,359 --> 00:28:26,558
don't understand every detail of what

769
00:28:24,720 --> 00:28:27,919
I'm going to go through. I'm going to

770
00:28:26,558 --> 00:28:29,839
link to the collab as soon as the class

771
00:28:27,919 --> 00:28:31,278
is over. But once you get your hands on

772
00:28:29,839 --> 00:28:33,519
the collab, make sure you actually go

773
00:28:31,278 --> 00:28:34,640
through every line in the collab. What I

774
00:28:33,519 --> 00:28:36,558
typically do when I'm trying to learn

775
00:28:34,640 --> 00:28:39,919
something new is I'll actually cut and

776
00:28:36,558 --> 00:28:41,359
paste, right? I won't do that. I won't

777
00:28:39,919 --> 00:28:44,159
actually cut and paste the code and run

778
00:28:41,359 --> 00:28:45,519
it myself. I will retype the code. If

779
00:28:44,159 --> 00:28:46,799
you retype the code as opposed to

780
00:28:45,519 --> 00:28:48,960
cutting and pasting, trust me, you'll

781
00:28:46,798 --> 00:28:52,079
learn a lot more. Right? So I strongly

782
00:28:48,960 --> 00:28:54,480
encourage you to do it that way.

783
00:28:52,079 --> 00:28:56,079
Um and so all the collabs you're going

784
00:28:54,480 --> 00:28:57,519
to publish in the class, uh the first

785
00:28:56,079 --> 00:29:00,240
thing you should do is you should just

786
00:28:57,519 --> 00:29:02,879
make your own copy of the notebook,

787
00:29:00,240 --> 00:29:04,558
right? Copy to drive. And then if you're

788
00:29:02,880 --> 00:29:06,720
using anything other than today's

789
00:29:04,558 --> 00:29:08,079
collab, uh right, anything involving

790
00:29:06,720 --> 00:29:10,079
natural language processing or vision,

791
00:29:08,079 --> 00:29:13,038
you probably should use a GPU. So just

792
00:29:10,079 --> 00:29:15,918
go into go in here, choose the runtime

793
00:29:13,038 --> 00:29:17,599
to be a GPU. Um and then you start your

794
00:29:15,919 --> 00:29:19,038
notebook and you're done. And the second

795
00:29:17,599 --> 00:29:21,199
time onwards, you can just go directly

796
00:29:19,038 --> 00:29:23,359
to this step. You don't have to do all

797
00:29:21,200 --> 00:29:24,880
this stuff for that particular notebook.

798
00:29:23,359 --> 00:29:26,319
And there are numerous tutorials like

799
00:29:24,880 --> 00:29:27,919
five minute videos and so on on how to

800
00:29:26,319 --> 00:29:30,319
use collab. Just just do that. I'm not

801
00:29:27,919 --> 00:29:33,919
going to spend time on it here.

802
00:29:30,319 --> 00:29:35,839
All right. Okay. So, uh I just ran it um

803
00:29:33,919 --> 00:29:37,120
a few hours ago. I'm not going to run

804
00:29:35,839 --> 00:29:38,079
every cell now because it's going to

805
00:29:37,119 --> 00:29:39,759
take some time. It's going to get in the

806
00:29:38,079 --> 00:29:40,960
way of the class time, but I'm going to

807
00:29:39,759 --> 00:29:43,359
just like, you know, go through it

808
00:29:40,960 --> 00:29:45,278
slowly and explain what's going on. So,

809
00:29:43,359 --> 00:29:46,639
here this is just an introduction to the

810
00:29:45,278 --> 00:29:49,038
data set. We already saw this

811
00:29:46,640 --> 00:29:51,759
introduction in the last last week. We

812
00:29:49,038 --> 00:29:54,720
have whatever 303 patients, hot

813
00:29:51,759 --> 00:29:57,919
patients. We have a whole bunch of uh

814
00:29:54,720 --> 00:29:59,839
variables here, age, demographics, and a

815
00:29:57,919 --> 00:30:02,960
whole bunch of biomarker information.

816
00:29:59,839 --> 00:30:05,519
And this is a target variable. Okay? Uh

817
00:30:02,960 --> 00:30:07,759
zero or one, heart disease, yes or no.

818
00:30:05,519 --> 00:30:10,319
And so, by the way, just some technical

819
00:30:07,759 --> 00:30:12,000
prelim preliminaries here. Basically,

820
00:30:10,319 --> 00:30:13,439
every time we load these things, we're

821
00:30:12,000 --> 00:30:15,119
actually going to load these packages.

822
00:30:13,440 --> 00:30:16,558
So you can see here these are the two

823
00:30:15,119 --> 00:30:18,879
key things we need to do. We import

824
00:30:16,558 --> 00:30:21,038
tensorflow first and then from within

825
00:30:18,880 --> 00:30:23,760
tensorflow we import keras. Okay that's

826
00:30:21,038 --> 00:30:25,759
what these two lines do here. Okay. And

827
00:30:23,759 --> 00:30:26,798
then and folks who have done data

828
00:30:25,759 --> 00:30:28,640
science and machine learning a bit

829
00:30:26,798 --> 00:30:30,720
before you you'll know this. We will in

830
00:30:28,640 --> 00:30:32,320
in sort of we will actually load like

831
00:30:30,720 --> 00:30:34,558
the three packages that were just most

832
00:30:32,319 --> 00:30:37,278
commonly used right which is numpy

833
00:30:34,558 --> 00:30:39,678
pandas and mattplot lib. Uh numpy

834
00:30:37,278 --> 00:30:42,079
because it's very easy for manipulating

835
00:30:39,679 --> 00:30:44,159
matrices and arrays and tensors. uh

836
00:30:42,079 --> 00:30:46,240
pandas because often times you get some

837
00:30:44,159 --> 00:30:48,240
data in from somewhere you need to

838
00:30:46,240 --> 00:30:49,839
massage it and wrangle it to a point

839
00:30:48,240 --> 00:30:51,839
where we can actually feed it into ketas

840
00:30:49,839 --> 00:30:53,678
so you need pandas for that and mattplot

841
00:30:51,839 --> 00:30:55,839
lib because you just want to plot you

842
00:30:53,679 --> 00:30:57,440
know uh these loss curves and accuracy

843
00:30:55,839 --> 00:31:00,158
curves to see whether early stopping is

844
00:30:57,440 --> 00:31:02,320
needed okay so that's why we use it uh

845
00:31:00,159 --> 00:31:03,200
so we import all these things and then I

846
00:31:02,319 --> 00:31:04,558
guess the other thing you have to

847
00:31:03,200 --> 00:31:06,558
remember is that when we are training

848
00:31:04,558 --> 00:31:08,639
these deep learning models uh there is

849
00:31:06,558 --> 00:31:11,278
randomness in the process which enters

850
00:31:08,640 --> 00:31:13,360
in a few different places so clearly the

851
00:31:11,278 --> 00:31:14,398
starting values for the these weights

852
00:31:13,359 --> 00:31:15,439
are going to be they're going the

853
00:31:14,398 --> 00:31:17,359
weights are going to be randomly

854
00:31:15,440 --> 00:31:19,600
initialized. Uh and therefore that

855
00:31:17,359 --> 00:31:22,398
that's obviously a source of randomness.

856
00:31:19,599 --> 00:31:23,599
Uh now we talked about how you take if

857
00:31:22,398 --> 00:31:25,519
when you're doing stoastic gradient

858
00:31:23,599 --> 00:31:28,000
descent you take all the data and then

859
00:31:25,519 --> 00:31:29,839
you randomly choose batches right from

860
00:31:28,000 --> 00:31:32,398
this data till we finish a whole pass

861
00:31:29,839 --> 00:31:33,519
through it. Well that immediately raised

862
00:31:32,398 --> 00:31:35,759
the question well well what do you mean

863
00:31:33,519 --> 00:31:37,839
by randomly choose? So typically what we

864
00:31:35,759 --> 00:31:39,519
do in practice is that and kas will take

865
00:31:37,839 --> 00:31:40,720
care of all this for you. um you

866
00:31:39,519 --> 00:31:42,960
basically take the data and just shuffle

867
00:31:40,720 --> 00:31:45,200
it once randomly and then you just go

868
00:31:42,960 --> 00:31:47,120
first 32 next 32 next 32 next 32 like

869
00:31:45,200 --> 00:31:49,278
that okay but it is a source of

870
00:31:47,119 --> 00:31:51,518
randomness and then when we split the

871
00:31:49,278 --> 00:31:53,278
data into train validation testing and

872
00:31:51,519 --> 00:31:55,440
so on uh particularly if you want to

873
00:31:53,278 --> 00:31:56,880
look for early stopping and overfitting

874
00:31:55,440 --> 00:31:58,480
uh we need to again split the data

875
00:31:56,880 --> 00:32:01,120
randomly and that's another source of

876
00:31:58,480 --> 00:32:02,880
randomness and then when we do dropout

877
00:32:01,119 --> 00:32:05,119
which we'll talk about on Wednesday

878
00:32:02,880 --> 00:32:06,799
again dropout has a little bit of a

879
00:32:05,119 --> 00:32:09,599
random element to it and so that's

880
00:32:06,798 --> 00:32:11,679
another source of randomness this. So

881
00:32:09,599 --> 00:32:13,038
all of it all this means is that if

882
00:32:11,679 --> 00:32:14,240
you're working with these models and if

883
00:32:13,038 --> 00:32:16,000
you want to build a model and you want

884
00:32:14,240 --> 00:32:17,919
to hand it off to someone so that they

885
00:32:16,000 --> 00:32:19,759
can reproduce your results well you

886
00:32:17,919 --> 00:32:21,440
better make sure that you sort of you

887
00:32:19,759 --> 00:32:22,960
know make it easy for them to replicate

888
00:32:21,440 --> 00:32:24,798
what you have and the way you do it is

889
00:32:22,960 --> 00:32:26,960
by sending a setting a random seat for

890
00:32:24,798 --> 00:32:28,480
all these things okay and the way you do

891
00:32:26,960 --> 00:32:31,200
it is by having this little handy

892
00:32:28,480 --> 00:32:32,960
function here set random seat uh and of

893
00:32:31,200 --> 00:32:35,360
course you know I use 42 tool like just

894
00:32:32,960 --> 00:32:38,000
like everybody should right so okay so

895
00:32:35,359 --> 00:32:39,678
that's that uh by the way just that's

896
00:32:38,000 --> 00:32:40,880
just a popculture reference to this book

897
00:32:39,679 --> 00:32:43,360
called The Hitchhiker's Guide to the

898
00:32:40,880 --> 00:32:45,440
Galaxy.

899
00:32:43,359 --> 00:32:47,678
>> Number 42 and you'll know what I mean.

900
00:32:45,440 --> 00:32:49,759
Okay, so by the way, um the question

901
00:32:47,679 --> 00:32:51,278
inevitably comes at this point, okay, if

902
00:32:49,759 --> 00:32:52,558
we do exactly this, will you actually

903
00:32:51,278 --> 00:32:55,359
get the exact same numbers that you have

904
00:32:52,558 --> 00:32:57,200
in your version uh of the notebook? And

905
00:32:55,359 --> 00:32:59,119
the answer is hopefully most of the

906
00:32:57,200 --> 00:33:01,360
time, but it's not guaranteed. So this

907
00:32:59,119 --> 00:33:03,359
is called bitwise reproducibility. It's

908
00:33:01,359 --> 00:33:05,439
not guaranteed due to certain hardware

909
00:33:03,359 --> 00:33:07,119
things and device drivers and stuff like

910
00:33:05,440 --> 00:33:09,120
that. So we won't get into all that

911
00:33:07,119 --> 00:33:11,199
stuff. uh and which is why as you see

912
00:33:09,119 --> 00:33:14,239
here uh I have a bit of a fingers

913
00:33:11,200 --> 00:33:16,480
crossed thing. Okay. All right. Cool. So

914
00:33:14,240 --> 00:33:18,399
that's what we have. Um so as it turns

915
00:33:16,480 --> 00:33:20,240
out uh Frantois Shallet who wrote the

916
00:33:18,398 --> 00:33:21,678
book uh the textbook he actually made

917
00:33:20,240 --> 00:33:24,480
this data available in a pandanda's data

918
00:33:21,679 --> 00:33:26,880
frame. So we read the CSV file into this

919
00:33:24,480 --> 00:33:30,720
data frame right there. Uh and then it's

920
00:33:26,880 --> 00:33:32,000
uh and it's 303 rows 14 columns right

921
00:33:30,720 --> 00:33:34,399
and you can see here we'll take a look

922
00:33:32,000 --> 00:33:36,960
at the first few rows. Uh and these are

923
00:33:34,398 --> 00:33:38,719
all the rows. age, gender, cholesterol,

924
00:33:36,960 --> 00:33:41,120
blah blah blah blah blah. And then this

925
00:33:38,720 --> 00:33:42,880
is the target variable right there. U

926
00:33:41,119 --> 00:33:44,000
and the one of the first things I always

927
00:33:42,880 --> 00:33:45,679
do when I'm working with a binary

928
00:33:44,000 --> 00:33:47,359
classification problem is to quickly

929
00:33:45,679 --> 00:33:49,759
check whether the positive and negative

930
00:33:47,359 --> 00:33:51,119
classes are balanced or not. And so what

931
00:33:49,759 --> 00:33:52,720
you can do is you can just quickly check

932
00:33:51,119 --> 00:33:55,278
to see what percent of the data points

933
00:33:52,720 --> 00:33:57,038
is zero versus one. And you can see here

934
00:33:55,278 --> 00:33:59,038
uh 72.6%

935
00:33:57,038 --> 00:34:00,720
of the patients don't have heart

936
00:33:59,038 --> 00:34:03,839
disease. That's a good thing of course.

937
00:34:00,720 --> 00:34:05,519
Uh and then 27.4 have heart disease. So

938
00:34:03,839 --> 00:34:08,159
it's not bad. It's not 50/50 or roughly

939
00:34:05,519 --> 00:34:11,599
50/50. It's a little thing. So, by the

940
00:34:08,159 --> 00:34:13,358
way, quick question. What is a a b good

941
00:34:11,599 --> 00:34:14,639
baseline model for this problem? Suppose

942
00:34:13,358 --> 00:34:15,679
you couldn't use anything any

943
00:34:14,639 --> 00:34:19,159
complicated thing. What's a good

944
00:34:15,679 --> 00:34:19,159
baseline model?

945
00:34:22,079 --> 00:34:25,519
>> Yes. Just predict zero.

946
00:34:24,320 --> 00:34:28,879
>> Yeah. And why would you do that?

947
00:34:25,519 --> 00:34:31,519
>> Uh, it would give you a 72.6% accuracy.

948
00:34:28,878 --> 00:34:33,759
Exactly. Because 72.6% 6% is the sort of

949
00:34:31,519 --> 00:34:35,358
the higher class higher class with the

950
00:34:33,760 --> 00:34:37,040
higher percentage you just predict it

951
00:34:35,358 --> 00:34:38,878
you'll be right on those 72.6% of the

952
00:34:37,039 --> 00:34:41,519
cases you'll be wrong on the rest which

953
00:34:38,878 --> 00:34:43,838
means that your accuracy of this model

954
00:34:41,519 --> 00:34:46,559
is going to be 72.6%.

955
00:34:43,838 --> 00:34:48,078
Okay. And so any fancy model we build

956
00:34:46,559 --> 00:34:49,279
better do you know it's got to do better

957
00:34:48,079 --> 00:34:51,919
than this otherwise it's not worth its

958
00:34:49,280 --> 00:34:53,760
weight uh in layers. Um so all right so

959
00:34:51,918 --> 00:34:54,960
we'll come back to this later. So the

960
00:34:53,760 --> 00:34:56,560
first thing we want to do is we want to

961
00:34:54,960 --> 00:34:58,880
pre-process it because this data set has

962
00:34:56,559 --> 00:35:01,599
both categorical variables and numeric

963
00:34:58,880 --> 00:35:03,119
variables. Um and so it's usually

964
00:35:01,599 --> 00:35:05,119
convenient to just to group them into

965
00:35:03,119 --> 00:35:06,640
two different groups. So I have listed

966
00:35:05,119 --> 00:35:09,200
all the categorical variables here and

967
00:35:06,639 --> 00:35:11,199
the numeric here. Uh and then we have

968
00:35:09,199 --> 00:35:12,799
the pre-processing here. We have to take

969
00:35:11,199 --> 00:35:15,118
the categorical variables and we have to

970
00:35:12,800 --> 00:35:17,920
one hot encode them. And the reason is

971
00:35:15,119 --> 00:35:20,400
that unlike say a decision tree model, a

972
00:35:17,920 --> 00:35:22,800
neural network cannot handle uh

973
00:35:20,400 --> 00:35:24,720
categorical inputs directly. It can only

974
00:35:22,800 --> 00:35:26,400
handle numeric inputs. Which means that

975
00:35:24,719 --> 00:35:28,319
we have to numericalize every

976
00:35:26,400 --> 00:35:29,760
categorical thing that comes in. And the

977
00:35:28,320 --> 00:35:31,200
st there are many ways to do it but the

978
00:35:29,760 --> 00:35:33,760
standard way to do it is one hot

979
00:35:31,199 --> 00:35:35,358
encoding. Um and for the numeric

980
00:35:33,760 --> 00:35:37,839
variables we need to normalize them and

981
00:35:35,358 --> 00:35:40,400
I'll come to that in a second. So pandas

982
00:35:37,838 --> 00:35:41,920
has this get dummies function here and

983
00:35:40,400 --> 00:35:44,000
you can just run this thing and it'll

984
00:35:41,920 --> 00:35:45,680
just hot encode the whole thing. So once

985
00:35:44,000 --> 00:35:49,358
you do that this is what you have. So

986
00:35:45,679 --> 00:35:52,319
you can see here previously um let's say

987
00:35:49,358 --> 00:35:54,000
tal was had three values fixed normal

988
00:35:52,320 --> 00:35:56,880
reversible or something and then you go

989
00:35:54,000 --> 00:36:00,079
to the one hot encoded version u and now

990
00:35:56,880 --> 00:36:02,160
we can see here tal fixed tal normal tal

991
00:36:00,079 --> 00:36:04,720
reversible that's three columns right

992
00:36:02,159 --> 00:36:07,598
that's the one hot encoding in action

993
00:36:04,719 --> 00:36:09,919
okay now the other thing to remember is

994
00:36:07,599 --> 00:36:12,240
that neural networks work best when the

995
00:36:09,920 --> 00:36:13,920
numeric inputs you send them are all in

996
00:36:12,239 --> 00:36:15,838
a relatively small range they shouldn't

997
00:36:13,920 --> 00:36:18,639
have a wide range of variation

998
00:36:15,838 --> 00:36:20,239
Um and so the standard practice is to

999
00:36:18,639 --> 00:36:22,078
standardize the numerical variables. By

1000
00:36:20,239 --> 00:36:23,279
standardize, I mean typically subtract

1001
00:36:22,079 --> 00:36:26,000
the mean, divide by the standard

1002
00:36:23,280 --> 00:36:27,839
deviation. Um we should do that. But

1003
00:36:26,000 --> 00:36:30,719
before we do so, we should split the

1004
00:36:27,838 --> 00:36:32,239
data into a training set and a test set,

1005
00:36:30,719 --> 00:36:33,759
right? And why do we want to split into

1006
00:36:32,239 --> 00:36:35,039
a test set? Because at the very end once

1007
00:36:33,760 --> 00:36:36,800
we've built the model and done all the

1008
00:36:35,039 --> 00:36:38,639
things we want to do with it, we finally

1009
00:36:36,800 --> 00:36:41,519
want to take out the test set and

1010
00:36:38,639 --> 00:36:43,679
evaluate it once so that we get this

1011
00:36:41,519 --> 00:36:46,079
true measure of how it's going to

1012
00:36:43,679 --> 00:36:48,960
perform in the wild after you deploy it.

1013
00:36:46,079 --> 00:36:51,280
Okay. Uh so you want to divide it 80 80

1014
00:36:48,960 --> 00:36:53,119
say 80% training and 20% test set. So

1015
00:36:51,280 --> 00:36:54,640
the question is why should we do the

1016
00:36:53,119 --> 00:36:57,358
splitting now before we do the

1017
00:36:54,639 --> 00:37:01,480
normalization? Why can't we just do the

1018
00:36:57,358 --> 00:37:01,480
normalization and then do the splitting?

1019
00:37:02,800 --> 00:37:09,680
Um all right

1020
00:37:06,239 --> 00:37:11,838
>> because then your uh validation set is

1021
00:37:09,679 --> 00:37:13,440
also somewhat dependent on your test set

1022
00:37:11,838 --> 00:37:13,838
results as well as the mean of the test

1023
00:37:13,440 --> 00:37:16,400
set.

1024
00:37:13,838 --> 00:37:18,799
>> Correct? Because the test set has now

1025
00:37:16,400 --> 00:37:21,920
essentially sort of has been influenced

1026
00:37:18,800 --> 00:37:23,359
by the training set. Right? That is the

1027
00:37:21,920 --> 00:37:25,200
the the modeling process part of the

1028
00:37:23,358 --> 00:37:27,039
modeling process the splitting and the

1029
00:37:25,199 --> 00:37:28,719
splitting also the the the

1030
00:37:27,039 --> 00:37:30,800
standardization

1031
00:37:28,719 --> 00:37:32,879
if if the standardization which is part

1032
00:37:30,800 --> 00:37:34,720
of the process uses information about

1033
00:37:32,880 --> 00:37:37,440
the test set well the test set not

1034
00:37:34,719 --> 00:37:39,759
really kept away from anything is it

1035
00:37:37,440 --> 00:37:41,200
that's why we want to split it lock away

1036
00:37:39,760 --> 00:37:43,200
the test set somewhere and then proceed

1037
00:37:41,199 --> 00:37:44,799
with the modeling this again this is

1038
00:37:43,199 --> 00:37:47,598
like machine learning 101 which is why

1039
00:37:44,800 --> 00:37:50,800
I'm going through it pretty fast uh okay

1040
00:37:47,599 --> 00:37:53,200
so we we do this uh sampling function

1041
00:37:50,800 --> 00:37:55,039
take 20% of the data and make it the

1042
00:37:53,199 --> 00:37:56,719
test set and the remaining is going to

1043
00:37:55,039 --> 00:37:58,960
be the training set. And when we do

1044
00:37:56,719 --> 00:38:00,959
that, you can see the training set is

1045
00:37:58,960 --> 00:38:05,199
now 242

1046
00:38:00,960 --> 00:38:07,039
um rows while the test is 61 rows. Uh

1047
00:38:05,199 --> 00:38:08,960
and any of these data frames, you'll

1048
00:38:07,039 --> 00:38:10,719
know that the the shape attribute gives

1049
00:38:08,960 --> 00:38:12,240
you the dimensions of the number of rows

1050
00:38:10,719 --> 00:38:14,000
in the columns. That's what we're doing

1051
00:38:12,239 --> 00:38:15,199
here. And now that we have done that, we

1052
00:38:14,000 --> 00:38:16,400
have done the split, we can calculate

1053
00:38:15,199 --> 00:38:18,480
the the the mean and the standard

1054
00:38:16,400 --> 00:38:20,079
deviation. So I calculate the mean here.

1055
00:38:18,480 --> 00:38:21,760
I calculate standard deviation. And

1056
00:38:20,079 --> 00:38:24,640
these are all the means. And once I do

1057
00:38:21,760 --> 00:38:26,000
that, I just do you know each column

1058
00:38:24,639 --> 00:38:28,319
minus the mean divide the standard

1059
00:38:26,000 --> 00:38:30,320
deviation. And then once I do that I get

1060
00:38:28,320 --> 00:38:32,160
I save them in the train and the test

1061
00:38:30,320 --> 00:38:33,680
data frames. And you can see here now

1062
00:38:32,159 --> 00:38:36,799
all the numbers are all very sort of

1063
00:38:33,679 --> 00:38:38,480
smallalish 0 1 minus one kind of around

1064
00:38:36,800 --> 00:38:40,880
that range and that's kind of ideal when

1065
00:38:38,480 --> 00:38:42,159
you're network training. Okay. All

1066
00:38:40,880 --> 00:38:44,640
right. Right. So at this point the data

1067
00:38:42,159 --> 00:38:46,719
is entirely numeric and then uh we are

1068
00:38:44,639 --> 00:38:48,000
ready almost ready to feed it into KAS

1069
00:38:46,719 --> 00:38:51,279
and the way you do it is you take a

1070
00:38:48,000 --> 00:38:52,719
numpy array u you you take a pandas data

1071
00:38:51,280 --> 00:38:54,880
frame and then you convert it into a

1072
00:38:52,719 --> 00:38:56,959
numpy array and then keras is happy to

1073
00:38:54,880 --> 00:39:00,079
take it happy to receive it. So the so

1074
00:38:56,960 --> 00:39:01,838
we use this thing called two numpy which

1075
00:39:00,079 --> 00:39:04,160
I think is as descriptive as it gets in

1076
00:39:01,838 --> 00:39:05,838
programming. Um and then you save it as

1077
00:39:04,159 --> 00:39:08,000
train and test. Now train and test on

1078
00:39:05,838 --> 00:39:09,679
two numpy arrays with exactly the same

1079
00:39:08,000 --> 00:39:12,000
information and now we can fit it into

1080
00:39:09,679 --> 00:39:13,838
kas. All right. Now I guess there's one

1081
00:39:12,000 --> 00:39:17,358
other thing we need to do which is that

1082
00:39:13,838 --> 00:39:18,880
um in this data frame train and test our

1083
00:39:17,358 --> 00:39:20,799
independent variables all the features

1084
00:39:18,880 --> 00:39:23,519
as well as the target the 01 target.

1085
00:39:20,800 --> 00:39:25,280
They're all in this

1086
00:39:23,519 --> 00:39:27,679
right and we need to now take it and

1087
00:39:25,280 --> 00:39:29,839
just take the the dependent variable the

1088
00:39:27,679 --> 00:39:32,000
01 column and split it out and keep the

1089
00:39:29,838 --> 00:39:33,519
x and the y separately. Right? That's

1090
00:39:32,000 --> 00:39:34,960
the whole point of it, right? Because

1091
00:39:33,519 --> 00:39:36,320
you need to feed the X, do the

1092
00:39:34,960 --> 00:39:38,240
prediction, and then compare it to the

1093
00:39:36,320 --> 00:39:41,599
actual Y and calculate the loss and so

1094
00:39:38,239 --> 00:39:43,279
on and so forth. So, uh, so the target

1095
00:39:41,599 --> 00:39:45,119
column is our Y variable, and it's

1096
00:39:43,280 --> 00:39:47,119
column number six from the left. If you

1097
00:39:45,119 --> 00:39:49,599
count it, you can see it. So, we just,

1098
00:39:47,119 --> 00:39:53,039
you know, uh, we we delete it from the

1099
00:39:49,599 --> 00:39:56,720
the train and test. Um, and now we have

1100
00:39:53,039 --> 00:39:58,320
242 rows and 29 columns, 29 features.

1101
00:39:56,719 --> 00:40:01,039
You will recall from the network that we

1102
00:39:58,320 --> 00:40:03,200
made way back, it had 29 inputs, right?

1103
00:40:01,039 --> 00:40:06,159
29 nodes in the input layer. And that's

1104
00:40:03,199 --> 00:40:07,838
where the 29 is coming from. And so now

1105
00:40:06,159 --> 00:40:09,759
uh we just select the sixth column which

1106
00:40:07,838 --> 00:40:12,559
is the target and make it the Y variable

1107
00:40:09,760 --> 00:40:14,320
right train Y and test Y. And that is of

1108
00:40:12,559 --> 00:40:16,559
course a vector which is 242 long in the

1109
00:40:14,320 --> 00:40:19,359
training set and 61 long in the thing.

1110
00:40:16,559 --> 00:40:21,679
So at this point all we have done is to

1111
00:40:19,358 --> 00:40:22,960
be honest boring pre-processing. Okay,

1112
00:40:21,679 --> 00:40:26,319
we haven't actually gotten to the action

1113
00:40:22,960 --> 00:40:29,039
yet. Finally, let's do something. So um

1114
00:40:26,320 --> 00:40:30,320
and we start with a single hidden layer.

1115
00:40:29,039 --> 00:40:31,920
Since it's a binary classification

1116
00:40:30,320 --> 00:40:34,000
problem, we'll use sigmoids as we saw

1117
00:40:31,920 --> 00:40:36,559
earlier. And this is the model we

1118
00:40:34,000 --> 00:40:39,760
created in class last last class. This

1119
00:40:36,559 --> 00:40:41,199
is the model we created. Okay. The only

1120
00:40:39,760 --> 00:40:43,280
difference between that model and this

1121
00:40:41,199 --> 00:40:45,919
model is that I've actually given names

1122
00:40:43,280 --> 00:40:47,599
to these layers. And this name thing is

1123
00:40:45,920 --> 00:40:48,800
totally optional. Right? If you want to

1124
00:40:47,599 --> 00:40:50,240
give a name, give a name. It's just a

1125
00:40:48,800 --> 00:40:53,280
little easier to interpret later on.

1126
00:40:50,239 --> 00:40:55,519
Okay? It's just cosmetic. Okay? So, uh,

1127
00:40:53,280 --> 00:40:57,760
but I've just put it here. U and once

1128
00:40:55,519 --> 00:40:59,599
you build the model u you should

1129
00:40:57,760 --> 00:41:01,680
immediately run the model dots summary

1130
00:40:59,599 --> 00:41:04,079
command because it gives you a nice

1131
00:41:01,679 --> 00:41:05,440
overview of the model right what are for

1132
00:41:04,079 --> 00:41:07,599
each layer it tells you what the layer

1133
00:41:05,440 --> 00:41:09,519
is it tells you what's coming into the

1134
00:41:07,599 --> 00:41:11,280
layer meaning the shape of the tensor

1135
00:41:09,519 --> 00:41:13,440
that's coming in and what's going out

1136
00:41:11,280 --> 00:41:16,240
and how many parameters the layer has

1137
00:41:13,440 --> 00:41:20,720
and it turns out this layer has sorry

1138
00:41:16,239 --> 00:41:22,799
this network has 497 parameters okay uh

1139
00:41:20,719 --> 00:41:24,078
and I have told you repeatedly the first

1140
00:41:22,800 --> 00:41:25,680
few times just hadn't calculated the

1141
00:41:24,079 --> 00:41:27,359
number of parameters to make sure it

1142
00:41:25,679 --> 00:41:30,000
verifies. So we should just make sure

1143
00:41:27,358 --> 00:41:32,078
that it is in fact 497. So let's hand

1144
00:41:30,000 --> 00:41:34,559
calculate it. And you do basically it's

1145
00:41:32,079 --> 00:41:37,839
basically what's going on here. 29

1146
00:41:34,559 --> 00:41:40,239
inputs time 16, right? All the arrows 29

1147
00:41:37,838 --> 00:41:42,000
* 16 arrows, right? And then you have a

1148
00:41:40,239 --> 00:41:43,759
bias of another 16. That's why you have

1149
00:41:42,000 --> 00:41:46,318
this expression. And then the next one

1150
00:41:43,760 --> 00:41:49,200
is 16 * 1 plus one bias for the output

1151
00:41:46,318 --> 00:41:50,960
sigmoid and you get to 497. Okay? Just

1152
00:41:49,199 --> 00:41:53,039
make sure you follow this later on when

1153
00:41:50,960 --> 00:41:55,358
you work with the collab. We we did this

1154
00:41:53,039 --> 00:41:56,960
in class last week and you can visualize

1155
00:41:55,358 --> 00:41:59,279
the network graphically as well by using

1156
00:41:56,960 --> 00:42:02,240
the plot model function. So we do that

1157
00:41:59,280 --> 00:42:03,760
here. Um and let's say it gives you the

1158
00:42:02,239 --> 00:42:06,159
same information but in a slightly

1159
00:42:03,760 --> 00:42:07,839
easier form to consume and when we work

1160
00:42:06,159 --> 00:42:09,440
with larger networks starting on

1161
00:42:07,838 --> 00:42:11,039
Wednesday you will see that being able

1162
00:42:09,440 --> 00:42:13,838
to visualize the topology of the network

1163
00:42:11,039 --> 00:42:16,239
is actually quite handy. Okay, we

1164
00:42:13,838 --> 00:42:18,400
finally come to uh actually trying to

1165
00:42:16,239 --> 00:42:20,719
train this thing and so what loss

1166
00:42:18,400 --> 00:42:23,358
function should we use? uh we need to we

1167
00:42:20,719 --> 00:42:26,159
need to use binary cross entropy right

1168
00:42:23,358 --> 00:42:29,838
there. What optimizer to use? Well, as I

1169
00:42:26,159 --> 00:42:32,480
mentioned earlier, uh we'll use Adam.

1170
00:42:29,838 --> 00:42:35,679
Adam.

1171
00:42:32,480 --> 00:42:37,920
All right, Adam. Uh and then uh and then

1172
00:42:35,679 --> 00:42:39,598
the the final thing is you can ask Keras

1173
00:42:37,920 --> 00:42:41,358
to report out whatever metrics you care

1174
00:42:39,599 --> 00:42:42,960
about. These metrics are not going to be

1175
00:42:41,358 --> 00:42:45,039
used in any optimization. They just it's

1176
00:42:42,960 --> 00:42:46,800
just reporting it to you. And the most

1177
00:42:45,039 --> 00:42:49,119
common thing people report out for

1178
00:42:46,800 --> 00:42:51,440
binary classification is accuracy. So

1179
00:42:49,119 --> 00:42:54,318
we'll just go with that metric. Um and

1180
00:42:51,440 --> 00:42:56,880
so so what we do is we tell Keras take

1181
00:42:54,318 --> 00:42:58,719
the model we just built and compile it

1182
00:42:56,880 --> 00:43:00,000
with this choice of optimizer this

1183
00:42:58,719 --> 00:43:02,159
choice of loss function and these

1184
00:43:00,000 --> 00:43:04,480
metrics. And this compilation step what

1185
00:43:02,159 --> 00:43:06,480
it does is it essentially Keras will

1186
00:43:04,480 --> 00:43:08,639
take this information and take the model

1187
00:43:06,480 --> 00:43:11,599
you have built and it'll reorganize the

1188
00:43:08,639 --> 00:43:13,920
model in such a way that the parallel

1189
00:43:11,599 --> 00:43:16,000
computing uh distribution of computing

1190
00:43:13,920 --> 00:43:17,519
across many servers and so on. That's

1191
00:43:16,000 --> 00:43:20,159
that's what's happening in the compile

1192
00:43:17,519 --> 00:43:21,838
step. Organizing it so that reorganizing

1193
00:43:20,159 --> 00:43:23,679
the model so that it becomes amendable

1194
00:43:21,838 --> 00:43:25,039
to parallelization and distribution.

1195
00:43:23,679 --> 00:43:26,159
That's what's going on. That's why you

1196
00:43:25,039 --> 00:43:28,800
actually have to do something called the

1197
00:43:26,159 --> 00:43:30,879
compile step. Okay. And once we do that,

1198
00:43:28,800 --> 00:43:34,160
we have finally finally ready to train

1199
00:43:30,880 --> 00:43:36,000
the model. And to do that uh we have to

1200
00:43:34,159 --> 00:43:37,199
decide what the batch size is that we're

1201
00:43:36,000 --> 00:43:38,880
going to use. Remember, we're using some

1202
00:43:37,199 --> 00:43:40,559
flavor of SGD, which means we have to

1203
00:43:38,880 --> 00:43:43,358
choose what is the bat size. And

1204
00:43:40,559 --> 00:43:45,199
typically what people do is that uh 32

1205
00:43:43,358 --> 00:43:46,480
is a good default for the batch size.

1206
00:43:45,199 --> 00:43:47,519
Like if you don't if you're not just

1207
00:43:46,480 --> 00:43:49,519
getting started with something, just use

1208
00:43:47,519 --> 00:43:51,519
32. Uh and there's a whole bunch of

1209
00:43:49,519 --> 00:43:53,358
literature on what the right batch size

1210
00:43:51,519 --> 00:43:55,119
should be for the number of data points

1211
00:43:53,358 --> 00:43:56,960
you have, the size of the network and so

1212
00:43:55,119 --> 00:43:59,760
on and so forth. My philosophy is start

1213
00:43:56,960 --> 00:44:02,000
with 32. Um and you can always try 32,

1214
00:43:59,760 --> 00:44:04,079
64, 128. It's kind of like, you know,

1215
00:44:02,000 --> 00:44:05,760
oftenimes what people tell me,

1216
00:44:04,079 --> 00:44:07,760
researchers tell me is that just use the

1217
00:44:05,760 --> 00:44:09,920
biggest batch size that doesn't make

1218
00:44:07,760 --> 00:44:11,359
your machine die.

1219
00:44:09,920 --> 00:44:12,400
Right? If you can fit into memory, it's

1220
00:44:11,358 --> 00:44:13,759
probably good. Just try the biggest

1221
00:44:12,400 --> 00:44:15,039
size. We'll just start with 30. It's

1222
00:44:13,760 --> 00:44:16,720
just a tiny problem. It's not a big

1223
00:44:15,039 --> 00:44:19,199
deal. And then we also have to decide

1224
00:44:16,719 --> 00:44:21,519
how many epochs through the data do we

1225
00:44:19,199 --> 00:44:24,318
want to go through, right? How many

1226
00:44:21,519 --> 00:44:26,480
epochs? And uh you know, usually 20 to

1227
00:44:24,318 --> 00:44:28,239
30 epochs is a good starting point. Um

1228
00:44:26,480 --> 00:44:29,679
and then because this is a tiny problem

1229
00:44:28,239 --> 00:44:31,838
just for kicks, I decided to run it for

1230
00:44:29,679 --> 00:44:33,759
300 epochs. Uh just to see if anything

1231
00:44:31,838 --> 00:44:34,960
any overfitting is going to happen. Uh

1232
00:44:33,760 --> 00:44:36,079
and then whether we want to use a

1233
00:44:34,960 --> 00:44:38,639
validation set. Of course, we want to

1234
00:44:36,079 --> 00:44:40,560
use a validation set. Uh right. So we

1235
00:44:38,639 --> 00:44:42,239
will use 20% of the data points as a

1236
00:44:40,559 --> 00:44:44,400
validation set so that we can look for

1237
00:44:42,239 --> 00:44:46,399
overfitting underfitting.

1238
00:44:44,400 --> 00:44:49,519
All right. So with these decisions made

1239
00:44:46,400 --> 00:44:51,920
we finally uh we use the model.fit

1240
00:44:49,519 --> 00:44:55,039
command. Model.fit is what actually

1241
00:44:51,920 --> 00:44:58,000
trains the neural network. Okay. And you

1242
00:44:55,039 --> 00:45:00,318
have to tell it what the x

1243
00:44:58,000 --> 00:45:03,280
tensor is. You have to tell it what the

1244
00:45:00,318 --> 00:45:05,199
dependent variable y tensor is. We need

1245
00:45:03,280 --> 00:45:07,519
to tell it how many epochs to do this.

1246
00:45:05,199 --> 00:45:09,519
What the bat size to use. Verbos equals

1247
00:45:07,519 --> 00:45:11,199
1 just means like just you know put a

1248
00:45:09,519 --> 00:45:13,199
lot of descriptive output as you do this

1249
00:45:11,199 --> 00:45:16,318
thing and then validation split means

1250
00:45:13,199 --> 00:45:18,559
you know take 20% of the training data

1251
00:45:16,318 --> 00:45:20,000
and set it aside as your validation data

1252
00:45:18,559 --> 00:45:22,239
set. Don't use it for training because I

1253
00:45:20,000 --> 00:45:24,239
want to measure overfitting using that.

1254
00:45:22,239 --> 00:45:26,318
So that's it. So you do that thing it

1255
00:45:24,239 --> 00:45:28,159
it'll run for 300 epochs and this is the

1256
00:45:26,318 --> 00:45:31,358
reason why you know I decided to just

1257
00:45:28,159 --> 00:45:33,759
not actually run it in class. Um and so

1258
00:45:31,358 --> 00:45:36,318
you keep on doing it gives you a lot of

1259
00:45:33,760 --> 00:45:40,280
output and finally

1260
00:45:36,318 --> 00:45:40,279
we reach the end.

1261
00:45:41,760 --> 00:45:44,640
Okay. Now let's take a moment to

1262
00:45:43,358 --> 00:45:46,559
understand what's being reported. So

1263
00:45:44,639 --> 00:45:49,118
I'll just take this one line here. So

1264
00:45:46,559 --> 00:45:51,279
this there is a there is these two there

1265
00:45:49,119 --> 00:45:53,920
is a pair of lines for each epoch. And

1266
00:45:51,280 --> 00:45:56,960
then here it's telling you uh you know

1267
00:45:53,920 --> 00:46:01,280
it it actually uses in the in this 300th

1268
00:45:56,960 --> 00:46:02,800
epoch it used seven batches seven out of

1269
00:46:01,280 --> 00:46:05,040
seven batches right so it used seven

1270
00:46:02,800 --> 00:46:06,960
batches and if you you will recall from

1271
00:46:05,039 --> 00:46:08,318
the math we did in the class that it's

1272
00:46:06,960 --> 00:46:10,559
actually seven batches where the first

1273
00:46:08,318 --> 00:46:12,159
six batches are 32 and the last batch is

1274
00:46:10,559 --> 00:46:15,440
just a couple of examples but we have

1275
00:46:12,159 --> 00:46:19,039
seven batches right this is the 193 by

1276
00:46:15,440 --> 00:46:20,720
32 rounded up okay so that's why we have

1277
00:46:19,039 --> 00:46:22,800
seven here and then it tells you how how

1278
00:46:20,719 --> 00:46:24,239
long it took it for that and then it

1279
00:46:22,800 --> 00:46:26,560
this is the loss value. This is the

1280
00:46:24,239 --> 00:46:29,279
binary cross entropy loss value on the

1281
00:46:26,559 --> 00:46:32,239
training set right on on that particular

1282
00:46:29,280 --> 00:46:33,599
batch right uh that it calculated this

1283
00:46:32,239 --> 00:46:36,799
is the accuracy that you asked you to

1284
00:46:33,599 --> 00:46:39,838
report out 98.4% 4% 98.5% accuracy on

1285
00:46:36,800 --> 00:46:42,480
that batch and and then at the end of

1286
00:46:39,838 --> 00:46:44,480
this epoch using whatever weights were

1287
00:46:42,480 --> 00:46:46,639
available in that network it actually

1288
00:46:44,480 --> 00:46:48,318
calculate the loss on the validation set

1289
00:46:46,639 --> 00:46:50,480
which is the 20% of the data we have set

1290
00:46:48,318 --> 00:46:53,759
aside and then it this is the accuracy

1291
00:46:50,480 --> 00:46:55,920
on that validation set okay so that's

1292
00:46:53,760 --> 00:46:57,599
what each of these numbers mean now

1293
00:46:55,920 --> 00:47:00,318
looking at these wall of numbers is kind

1294
00:46:57,599 --> 00:47:02,480
of painful so usually you just plot it

1295
00:47:00,318 --> 00:47:04,719
um so and the way you do that is if you

1296
00:47:02,480 --> 00:47:06,800
if you notice here Uh okay, I'm not

1297
00:47:04,719 --> 00:47:08,639
going to go back here. So I said history

1298
00:47:06,800 --> 00:47:10,560
equals model.fit blah blah blah blah

1299
00:47:08,639 --> 00:47:12,000
blah. And that history object has a lot

1300
00:47:10,559 --> 00:47:14,799
of information that we can use for

1301
00:47:12,000 --> 00:47:18,480
plotting and diagnostics and so on. And

1302
00:47:14,800 --> 00:47:19,760
that history thing uh history object has

1303
00:47:18,480 --> 00:47:21,358
another object called history

1304
00:47:19,760 --> 00:47:23,040
history.htistory which is a dictionary

1305
00:47:21,358 --> 00:47:24,318
with all these values and that's what

1306
00:47:23,039 --> 00:47:25,599
we're going to plot. Was there a

1307
00:47:24,318 --> 00:47:28,639
question here? Yeah.

1308
00:47:25,599 --> 00:47:30,960
>> Uh so you prompted it to keep the size

1309
00:47:28,639 --> 00:47:33,679
for validation but didn't we already

1310
00:47:30,960 --> 00:47:34,960
keep a test set? So that's going to be a

1311
00:47:33,679 --> 00:47:37,679
secondary validation, right?

1312
00:47:34,960 --> 00:47:40,079
>> So basically we have a training uh and

1313
00:47:37,679 --> 00:47:42,000
then a validation and a test. The role

1314
00:47:40,079 --> 00:47:43,680
of the validation set is to figure out

1315
00:47:42,000 --> 00:47:45,519
things like early stopping. Should we

1316
00:47:43,679 --> 00:47:46,719
stop here? Should we go back? And as you

1317
00:47:45,519 --> 00:47:48,960
will see later on, if we use

1318
00:47:46,719 --> 00:47:50,399
hyperparameters, you know, we we'll try

1319
00:47:48,960 --> 00:47:52,079
different values of the hyperparameters

1320
00:47:50,400 --> 00:47:53,680
and figure out use the validation set to

1321
00:47:52,079 --> 00:47:55,359
figure out which one is the best one.

1322
00:47:53,679 --> 00:47:57,679
But once we are done with all that, we

1323
00:47:55,358 --> 00:47:59,679
will finally have a model. At that

1324
00:47:57,679 --> 00:48:02,399
point, we open the safe, take out the

1325
00:47:59,679 --> 00:48:04,239
test set and use it just once with your

1326
00:48:02,400 --> 00:48:05,519
final final model. Not because you want

1327
00:48:04,239 --> 00:48:07,519
to improve the model, but because you

1328
00:48:05,519 --> 00:48:08,880
want to have a realistic idea how it'll

1329
00:48:07,519 --> 00:48:11,679
do when you actually deploy it out in

1330
00:48:08,880 --> 00:48:13,920
the real world.

1331
00:48:11,679 --> 00:48:17,679
>> Uh yeah.

1332
00:48:13,920 --> 00:48:20,000
>> Uh can we use can we instead of accuracy

1333
00:48:17,679 --> 00:48:21,199
could we use other metrics uh to

1334
00:48:20,000 --> 00:48:23,920
evaluate whether to

1335
00:48:21,199 --> 00:48:24,318
>> absolutely like a confusion matrix let's

1336
00:48:23,920 --> 00:48:25,680
say?

1337
00:48:24,318 --> 00:48:27,519
>> Yeah, you can you can do whatever you

1338
00:48:25,679 --> 00:48:29,118
want. You can use like I said it's not

1339
00:48:27,519 --> 00:48:31,280
used for training so there is no

1340
00:48:29,119 --> 00:48:32,720
mathematical implication what you choose

1341
00:48:31,280 --> 00:48:35,040
right you can choose error rates

1342
00:48:32,719 --> 00:48:37,118
accuracy f1 fb beta you can do whatever

1343
00:48:35,039 --> 00:48:39,440
you want and keras as you will see has

1344
00:48:37,119 --> 00:48:41,760
this dizzying list of possible metrics

1345
00:48:39,440 --> 00:48:43,280
you can use for reporting the key thing

1346
00:48:41,760 --> 00:48:44,800
to remember is you're just reporting

1347
00:48:43,280 --> 00:48:47,440
these metrics you're not actually using

1348
00:48:44,800 --> 00:48:49,039
them for any training

1349
00:48:47,440 --> 00:48:50,559
yeah

1350
00:48:49,039 --> 00:48:52,800
>> uh my question is with respect to

1351
00:48:50,559 --> 00:48:55,760
validation like uh we've got a training

1352
00:48:52,800 --> 00:48:58,720
data set so when we take out 20% This is

1353
00:48:55,760 --> 00:49:00,559
the validation uh data for validation.

1354
00:48:58,719 --> 00:49:02,719
Are we taking out from the training set

1355
00:49:00,559 --> 00:49:04,640
or correct from there that level or we

1356
00:49:02,719 --> 00:49:04,879
go to each batch and take out 20% from

1357
00:49:04,639 --> 00:49:05,759
the train?

1358
00:49:04,880 --> 00:49:06,400
>> No, we're taking it out from the

1359
00:49:05,760 --> 00:49:08,400
training set.

1360
00:49:06,400 --> 00:49:09,920
>> So it means the batch size the number of

1361
00:49:08,400 --> 00:49:11,599
batch number of data would be available

1362
00:49:09,920 --> 00:49:12,079
for calculating the batch size will

1363
00:49:11,599 --> 00:49:13,599
reduce.

1364
00:49:12,079 --> 00:49:15,200
>> Correct. And in [snorts] fact once we

1365
00:49:13,599 --> 00:49:17,119
validate take out the validation set

1366
00:49:15,199 --> 00:49:18,558
whatever remaining is 193.

1367
00:49:17,119 --> 00:49:21,519
>> Okay. And then we divide that into

1368
00:49:18,559 --> 00:49:23,440
batches and then that every info uh that

1369
00:49:21,519 --> 00:49:25,519
validation and the data gets different

1370
00:49:23,440 --> 00:49:27,519
added. Now once you take out the

1371
00:49:25,519 --> 00:49:30,960
validation set at the very beginning you

1372
00:49:27,519 --> 00:49:33,440
keep it aside and then you only evaluate

1373
00:49:30,960 --> 00:49:36,000
at the end of each epoch what your loss

1374
00:49:33,440 --> 00:49:37,838
and accuracy is on that validation set.

1375
00:49:36,000 --> 00:49:39,358
>> So you don't have cross validation.

1376
00:49:37,838 --> 00:49:40,558
>> No no we're not doing any of that stuff.

1377
00:49:39,358 --> 00:49:43,519
We're just taking it out once and we're

1378
00:49:40,559 --> 00:49:46,240
just evaluating the end of every epoch.

1379
00:49:43,519 --> 00:49:50,559
>> Okay. So

1380
00:49:46,239 --> 00:49:53,679
yeah. Okay. So I know we both asked

1381
00:49:50,559 --> 00:49:54,960
similar questions but

1382
00:49:53,679 --> 00:49:56,960
>> so I know both have asked similar

1383
00:49:54,960 --> 00:49:59,440
questions but just to reconfirm. So here

1384
00:49:56,960 --> 00:50:01,760
my training model is giving me say a

1385
00:49:59,440 --> 00:50:04,800
loss of 0860.

1386
00:50:01,760 --> 00:50:07,680
My validation model is giving me 660.

1387
00:50:04,800 --> 00:50:11,519
That means I've already crossed the U.

1388
00:50:07,679 --> 00:50:13,358
So when I have to actually test the

1389
00:50:11,519 --> 00:50:14,800
model that is the midpoint which I take

1390
00:50:13,358 --> 00:50:16,880
and that will model which will get

1391
00:50:14,800 --> 00:50:19,200
deployed in production.

1392
00:50:16,880 --> 00:50:20,559
Correct. And as to okay, what do we do

1393
00:50:19,199 --> 00:50:22,318
to get that model? Do we actually have

1394
00:50:20,559 --> 00:50:24,720
to go go back to the beginning and run

1395
00:50:22,318 --> 00:50:25,920
it for a few epochs or can we do

1396
00:50:24,719 --> 00:50:26,959
something smarter than that? We'll get

1397
00:50:25,920 --> 00:50:27,838
to that.

1398
00:50:26,960 --> 00:50:30,159
>> Yeah.

1399
00:50:27,838 --> 00:50:31,838
>> Is the validation set different for each

1400
00:50:30,159 --> 00:50:33,759
APO or is it the same?

1401
00:50:31,838 --> 00:50:35,759
>> It's the same. So what you do is you

1402
00:50:33,760 --> 00:50:37,359
have a training set before you do any

1403
00:50:35,760 --> 00:50:39,680
training. You take out 20% of it, keep

1404
00:50:37,358 --> 00:50:41,838
it aside. You take whatever is left over

1405
00:50:39,679 --> 00:50:43,279
that you divide that into mini batches

1406
00:50:41,838 --> 00:50:45,838
and then start running it through each

1407
00:50:43,280 --> 00:50:47,519
epoch. But at the end of each epoch, you

1408
00:50:45,838 --> 00:50:49,119
just evaluate the quality of that

1409
00:50:47,519 --> 00:50:49,920
resulting model using the validation

1410
00:50:49,119 --> 00:50:51,920
set.

1411
00:50:49,920 --> 00:50:52,800
>> What's different between each epoch? Is

1412
00:50:51,920 --> 00:50:53,519
it just the way

1413
00:50:52,800 --> 00:50:55,760
>> weights have changed?

1414
00:50:53,519 --> 00:50:56,960
>> It's the it's the division into the

1415
00:50:55,760 --> 00:51:00,480
different

1416
00:50:56,960 --> 00:51:02,159
>> uh no so in the difference in each epoch

1417
00:51:00,480 --> 00:51:03,920
is the weights have changed.

1418
00:51:02,159 --> 00:51:05,440
>> So after every mini batch, the weights

1419
00:51:03,920 --> 00:51:07,200
have changed. At the end of one epoch,

1420
00:51:05,440 --> 00:51:09,200
you've gone through all the data points

1421
00:51:07,199 --> 00:51:10,639
you ever had, right, in the training

1422
00:51:09,199 --> 00:51:14,558
set. And then you come back to the

1423
00:51:10,639 --> 00:51:14,558
beginning and you do it again.

1424
00:51:17,760 --> 00:51:22,480
How do you identify the sweet spot?

1425
00:51:20,800 --> 00:51:24,160
>> It's coming.

1426
00:51:22,480 --> 00:51:27,280
>> Yeah. All right. So, I'm going to keep

1427
00:51:24,159 --> 00:51:28,960
going. So, we have this here. And so,

1428
00:51:27,280 --> 00:51:31,280
you just I mean there's a little bit of

1429
00:51:28,960 --> 00:51:33,440
mattplot lip code. So, what we do is we

1430
00:51:31,280 --> 00:51:35,280
just plot the training loss and the

1431
00:51:33,440 --> 00:51:37,760
validation loss as a function of the

1432
00:51:35,280 --> 00:51:39,920
number of epochs. Okay? And as you can

1433
00:51:37,760 --> 00:51:41,920
see here, the training loss is these

1434
00:51:39,920 --> 00:51:45,280
things here. And it's steadily going

1435
00:51:41,920 --> 00:51:47,519
down as you would expect. The validation

1436
00:51:45,280 --> 00:51:49,599
loss goes down here. And then at some

1437
00:51:47,519 --> 00:51:53,358
point it kind of flattens out and then

1438
00:51:49,599 --> 00:51:55,920
maybe gently starts to rise. Okay. So do

1439
00:51:53,358 --> 00:51:57,279
you think there's overfitting?

1440
00:51:55,920 --> 00:51:59,200
>> Right. There seems to be some level of

1441
00:51:57,280 --> 00:52:01,839
overfitting here. But the thing you have

1442
00:51:59,199 --> 00:52:04,799
to always remember is that the binary

1443
00:52:01,838 --> 00:52:06,639
cross entropy loss is a loss function

1444
00:52:04,800 --> 00:52:08,160
that is convenient for you because it

1445
00:52:06,639 --> 00:52:10,879
sort of captures the thing you want to

1446
00:52:08,159 --> 00:52:13,920
capture the discrepancy but also because

1447
00:52:10,880 --> 00:52:15,599
it's mathematically convenient but what

1448
00:52:13,920 --> 00:52:18,400
you may actually care about in practice

1449
00:52:15,599 --> 00:52:19,760
is something like accuracy right so I

1450
00:52:18,400 --> 00:52:21,440
always that's why you're reporting out

1451
00:52:19,760 --> 00:52:23,200
the accuracy when we do these things so

1452
00:52:21,440 --> 00:52:25,358
you should also plot the accuracy to see

1453
00:52:23,199 --> 00:52:26,799
what's going on and really you should

1454
00:52:25,358 --> 00:52:28,239
look at the accuracy and figure out

1455
00:52:26,800 --> 00:52:30,720
overfitting and underfitting and all

1456
00:52:28,239 --> 00:52:34,000
stuff. So let's just do that. So I have

1457
00:52:30,719 --> 00:52:35,519
here uh overfitting.

1458
00:52:34,000 --> 00:52:37,280
Uh okay. So this is how it looks like

1459
00:52:35,519 --> 00:52:38,639
for accuracy. Accuracy of course as the

1460
00:52:37,280 --> 00:52:40,079
model gets you know as you do more and

1461
00:52:38,639 --> 00:52:42,078
more epochs hopefully it get better and

1462
00:52:40,079 --> 00:52:44,480
better for training. So you can see here

1463
00:52:42,079 --> 00:52:47,440
accuracy actually climbs all the way up

1464
00:52:44,480 --> 00:52:50,079
to the mid 90s uh right there small the

1465
00:52:47,440 --> 00:52:52,639
low 90s here. the validation gets to

1466
00:52:50,079 --> 00:52:54,400
this point after like I don't know 50

1467
00:52:52,639 --> 00:52:56,719
epochs maybe and then it kind of

1468
00:52:54,400 --> 00:53:00,880
flattens out and then strangely it

1469
00:52:56,719 --> 00:53:03,759
climbs up again a bit later right so now

1470
00:53:00,880 --> 00:53:06,800
the fact that the accuracy actually got

1471
00:53:03,760 --> 00:53:09,920
better at the very end suggests that

1472
00:53:06,800 --> 00:53:10,480
maybe we can live with this overfitting

1473
00:53:09,920 --> 00:53:12,000
>> okay

1474
00:53:10,480 --> 00:53:14,559
>> right it's not the end of the world

1475
00:53:12,000 --> 00:53:16,719
right so you can so you can certainly

1476
00:53:14,559 --> 00:53:17,920
what you can do is you can go back and

1477
00:53:16,719 --> 00:53:20,558
say you know what no I'm going to be a

1478
00:53:17,920 --> 00:53:22,240
purist about this around 50 epochs or

1479
00:53:20,559 --> 00:53:24,079
so. I think that's when it actually

1480
00:53:22,239 --> 00:53:26,078
flattened out for loss. So you can just

1481
00:53:24,079 --> 00:53:29,039
go back and just restart the model and

1482
00:53:26,079 --> 00:53:30,318
run it only for 50 epochs, not 300 and

1483
00:53:29,039 --> 00:53:31,920
then stop and just use that model for

1484
00:53:30,318 --> 00:53:33,358
everything from that point on. Or you

1485
00:53:31,920 --> 00:53:35,358
can say, you know what, it's okay. I can

1486
00:53:33,358 --> 00:53:36,558
live with this thing. Uh and so that's

1487
00:53:35,358 --> 00:53:39,838
what we're going to do here. Let me just

1488
00:53:36,559 --> 00:53:40,319
stop for a second. There was a question.

1489
00:53:39,838 --> 00:53:42,000
>> Yeah,

1490
00:53:40,318 --> 00:53:44,000
>> for originally when we were starting

1491
00:53:42,000 --> 00:53:46,880
out, we were saying 20 to 30 pods, but

1492
00:53:44,000 --> 00:53:49,039
we were going to do 300. 50 is over 20

1493
00:53:46,880 --> 00:53:51,280
to 30. So when it comes to validation of

1494
00:53:49,039 --> 00:53:52,639
if you run enough epochs, are you doing

1495
00:53:51,280 --> 00:53:54,480
like derivative calculations?

1496
00:53:52,639 --> 00:53:56,639
>> Oh, I see. No, that's a great question.

1497
00:53:54,480 --> 00:53:58,240
So the question is I said start with 20

1498
00:53:56,639 --> 00:54:00,000
and 30 epochs as a rule of thumb here,

1499
00:53:58,239 --> 00:54:01,598
I'm just going with 300. And because I'm

1500
00:54:00,000 --> 00:54:03,199
going with 300, I can actually see some

1501
00:54:01,599 --> 00:54:05,119
potential evidence of overfitting. But

1502
00:54:03,199 --> 00:54:06,239
if I had done only 20 to 30, maybe I

1503
00:54:05,119 --> 00:54:07,280
wouldn't have even seen that. What

1504
00:54:06,239 --> 00:54:09,279
happens next? Right? Is that the

1505
00:54:07,280 --> 00:54:10,559
question? Great question. So what you

1506
00:54:09,280 --> 00:54:13,519
should do is when you look at these

1507
00:54:10,559 --> 00:54:15,680
curves if at the end of 30 epochs you

1508
00:54:13,519 --> 00:54:18,159
find that the validation loss continues

1509
00:54:15,679 --> 00:54:20,078
to drop then you know maybe there is

1510
00:54:18,159 --> 00:54:21,759
more room for it to drop. So you you

1511
00:54:20,079 --> 00:54:24,000
continue from that point on. The thing

1512
00:54:21,760 --> 00:54:27,119
about keras is that you can actually run

1513
00:54:24,000 --> 00:54:29,199
the the the fit command at that point

1514
00:54:27,119 --> 00:54:31,680
and it'll continue where it left off. It

1515
00:54:29,199 --> 00:54:33,598
won't go to the beginning again.

1516
00:54:31,679 --> 00:54:34,799
Right? So you can run 10. Okay. The

1517
00:54:33,599 --> 00:54:36,640
validation is still getting better and

1518
00:54:34,800 --> 00:54:38,240
better. Okay. Run for another 10. It's

1519
00:54:36,639 --> 00:54:39,440
getting better and better. Run for

1520
00:54:38,239 --> 00:54:40,639
another 10. Getting better and better.

1521
00:54:39,440 --> 00:54:41,760
Run for another 10. Oh, it starts to

1522
00:54:40,639 --> 00:54:44,799
climb up again. Okay, now I'm going to

1523
00:54:41,760 --> 00:54:47,119
back off. That's what you do.

1524
00:54:44,800 --> 00:54:48,800
All right. Now, all this manual stuff

1525
00:54:47,119 --> 00:54:50,800
I'm going through it just because to

1526
00:54:48,800 --> 00:54:52,559
build intuition, there are these things

1527
00:54:50,800 --> 00:54:54,640
called callbacks in KAS, which we'll get

1528
00:54:52,559 --> 00:54:57,040
to later on in which you can actually

1529
00:54:54,639 --> 00:54:59,679
tell it, hey, when the validation loss,

1530
00:54:57,039 --> 00:55:02,000
you know, uh, stops improving, stop

1531
00:54:59,679 --> 00:55:04,558
everything or when it stops improving,

1532
00:55:02,000 --> 00:55:05,920
save that model for me somewhere. So,

1533
00:55:04,559 --> 00:55:07,280
they don't have to go back and rerun

1534
00:55:05,920 --> 00:55:08,480
everything. It'll just it'll have saved

1535
00:55:07,280 --> 00:55:12,240
it for you and you can just pick it up

1536
00:55:08,480 --> 00:55:15,358
and use it. Uh yeah.

1537
00:55:12,239 --> 00:55:17,838
>> What's the intuition behind um the

1538
00:55:15,358 --> 00:55:19,358
accuracy continuing to improve when the

1539
00:55:17,838 --> 00:55:21,440
loss is getting higher?

1540
00:55:19,358 --> 00:55:23,759
>> Because accuracy and loss are related

1541
00:55:21,440 --> 00:55:25,760
but they're not the same thing. Uh in

1542
00:55:23,760 --> 00:55:27,520
particular, so it's a really good

1543
00:55:25,760 --> 00:55:29,359
question also kind of a profound

1544
00:55:27,519 --> 00:55:30,880
question because accuracy is a very

1545
00:55:29,358 --> 00:55:32,078
discrete measure, right? So if a

1546
00:55:30,880 --> 00:55:34,880
particular point we predicting its

1547
00:55:32,079 --> 00:55:37,599
probability to be say 049 we're going to

1548
00:55:34,880 --> 00:55:39,599
say okay that's a zero no heart disease

1549
00:55:37,599 --> 00:55:41,599
but if it goes to 0.51 we're going to be

1550
00:55:39,599 --> 00:55:44,559
oh that's heart disease. So when you go

1551
00:55:41,599 --> 00:55:46,079
from 049 to 0.51 the binary cross

1552
00:55:44,559 --> 00:55:48,640
entropy loss will change very very

1553
00:55:46,079 --> 00:55:51,359
slightly but the accuracy will go from 0

1554
00:55:48,639 --> 00:55:53,358
to one dramatic jump. So it's very jumpy

1555
00:55:51,358 --> 00:55:56,000
and discreet and that's why it tends to

1556
00:55:53,358 --> 00:55:58,639
be a proxy but sort of a crude proxy for

1557
00:55:56,000 --> 00:56:01,440
loss. That's part of the reason and I

1558
00:55:58,639 --> 00:56:04,558
can talk more offline.

1559
00:56:01,440 --> 00:56:06,480
Okay. So yeah,

1560
00:56:04,559 --> 00:56:09,839
>> you mentioned that if you are a purist,

1561
00:56:06,480 --> 00:56:12,159
you could stop up 50. In this case, I

1562
00:56:09,838 --> 00:56:13,759
was want and run it and stop it there. I

1563
00:56:12,159 --> 00:56:15,679
was wondering if you could see the

1564
00:56:13,760 --> 00:56:18,079
history of the model, take the weight at

1565
00:56:15,679 --> 00:56:21,358
EOC 50 and input it your model and it

1566
00:56:18,079 --> 00:56:22,400
will be roughly the same or it would be

1567
00:56:21,358 --> 00:56:24,318
certain differences.

1568
00:56:22,400 --> 00:56:25,920
>> You could try it. Yeah, you should just

1569
00:56:24,318 --> 00:56:27,599
try it because what happens is that

1570
00:56:25,920 --> 00:56:29,440
ultimately what we care about is how it

1571
00:56:27,599 --> 00:56:30,960
performs on the validation set. Right.

1572
00:56:29,440 --> 00:56:33,200
Here it appears to perform better on the

1573
00:56:30,960 --> 00:56:34,880
validation set. right? If you stop at 50

1574
00:56:33,199 --> 00:56:36,078
but only for the loss for accuracy

1575
00:56:34,880 --> 00:56:40,079
actually if you wait till the very end

1576
00:56:36,079 --> 00:56:41,760
it gets better. So my thrust tends to be

1577
00:56:40,079 --> 00:56:44,079
what is the measure that's closest to

1578
00:56:41,760 --> 00:56:45,599
the real world deployment.

1579
00:56:44,079 --> 00:56:48,599
It's accuracy. So I tend to go with

1580
00:56:45,599 --> 00:56:48,599
accuracy.

1581
00:56:48,639 --> 00:56:53,519
Binary cross entropy is a beautiful

1582
00:56:50,639 --> 00:56:54,960
proxy but an imperfect proxy for the

1583
00:56:53,519 --> 00:56:57,440
thing we actually care about in the real

1584
00:56:54,960 --> 00:56:59,519
world which is error rate and accuracy.

1585
00:56:57,440 --> 00:57:00,960
That's why I tend to plot both and if

1586
00:56:59,519 --> 00:57:03,119
accuracy is telling me one thing I kind

1587
00:57:00,960 --> 00:57:07,920
of tend to believe that

1588
00:57:03,119 --> 00:57:09,680
all right so um here that's what we have

1589
00:57:07,920 --> 00:57:11,519
so once we do all this we have a model

1590
00:57:09,679 --> 00:57:13,039
and we now we may to evaluate to see

1591
00:57:11,519 --> 00:57:14,559
okay if you actually deployed how good

1592
00:57:13,039 --> 00:57:17,039
is going to be so you use this thing

1593
00:57:14,559 --> 00:57:19,040
called the model evealuate function so

1594
00:57:17,039 --> 00:57:21,358
you take the modelealate function now we

1595
00:57:19,039 --> 00:57:23,279
use the test and the the test x and the

1596
00:57:21,358 --> 00:57:24,719
test y data set which we split at the

1597
00:57:23,280 --> 00:57:27,040
very very beginning and never used from

1598
00:57:24,719 --> 00:57:29,679
that point on uh we run it And when I

1599
00:57:27,039 --> 00:57:33,039
ran it uh last night, it came up with a

1600
00:57:29,679 --> 00:57:35,118
83.6% accuracy for the model. And

1601
00:57:33,039 --> 00:57:36,798
remember our baseline model which just

1602
00:57:35,119 --> 00:57:39,358
predicts everybody is a zero is going to

1603
00:57:36,798 --> 00:57:41,599
have a 72.6% accuracy. And this little

1604
00:57:39,358 --> 00:57:45,598
neural network gives you 83 83.6 which

1605
00:57:41,599 --> 00:57:47,280
is pretty good right so it's actually uh

1606
00:57:45,599 --> 00:57:49,519
few it's beating the model the baseline

1607
00:57:47,280 --> 00:57:50,720
model which is nice. Uh and I guess

1608
00:57:49,519 --> 00:57:52,159
there is something here about you know

1609
00:57:50,719 --> 00:57:53,919
the fact that we did a bunch of

1610
00:57:52,159 --> 00:57:55,440
pre-processing outside Keras and then we

1611
00:57:53,920 --> 00:57:57,119
send stuff into Keras. You can actually

1612
00:57:55,440 --> 00:57:58,639
do all this pre-processing inside Karas

1613
00:57:57,119 --> 00:58:00,160
automatically and there are layers for

1614
00:57:58,639 --> 00:58:02,239
that and I have linked to a bunch of

1615
00:58:00,159 --> 00:58:03,679
stuff here. So that's it as far as this

1616
00:58:02,239 --> 00:58:05,358
model is concerned. I know we went

1617
00:58:03,679 --> 00:58:07,199
through it really fast but please go

1618
00:58:05,358 --> 00:58:09,039
through it afterwards and make sure you

1619
00:58:07,199 --> 00:58:11,039
understand every single line. Change

1620
00:58:09,039 --> 00:58:12,239
each of these lines, rerun it, see how

1621
00:58:11,039 --> 00:58:15,279
the output changes. That's how we build

1622
00:58:12,239 --> 00:58:17,919
some intuition. Okay. All right.

1623
00:58:15,280 --> 00:58:20,079
computer vision

1624
00:58:17,920 --> 00:58:22,639
>> as I do

1625
00:58:20,079 --> 00:58:24,720
>> just one question and for is there a way

1626
00:58:22,639 --> 00:58:27,118
to build a model just to have less false

1627
00:58:24,719 --> 00:58:27,679
positive or less false immediate or you

1628
00:58:27,119 --> 00:58:29,119
don't know that

1629
00:58:27,679 --> 00:58:31,679
>> oh yeah yeah you can do that um but

1630
00:58:29,119 --> 00:58:33,599
there are so you can report on all those

1631
00:58:31,679 --> 00:58:35,759
things very easily but there are more

1632
00:58:33,599 --> 00:58:38,400
complex loss functions which will take

1633
00:58:35,760 --> 00:58:40,960
the the asymmetry between the false

1634
00:58:38,400 --> 00:58:43,440
positive false negative into account u

1635
00:58:40,960 --> 00:58:45,199
you know yeah so the short it's possible

1636
00:58:43,440 --> 00:58:46,880
yeah

1637
00:58:45,199 --> 00:58:48,318
All right. So, first let's just talk

1638
00:58:46,880 --> 00:58:52,240
about how do you represent an image

1639
00:58:48,318 --> 00:58:54,159
digitally. Okay. Uh and so these are how

1640
00:58:52,239 --> 00:58:55,759
gay grayscale images are represented.

1641
00:58:54,159 --> 00:58:57,598
Black and white images. So the basic

1642
00:58:55,760 --> 00:58:59,520
basic idea is very simple. Every picture

1643
00:58:57,599 --> 00:59:01,680
you have it's got a every location in

1644
00:58:59,519 --> 00:59:03,599
that picture is a pixel and the pixel

1645
00:59:01,679 --> 00:59:06,159
pixel basically has a light intensity.

1646
00:59:03,599 --> 00:59:09,119
The amount of light at that location and

1647
00:59:06,159 --> 00:59:12,078
that light level is measured from zero

1648
00:59:09,119 --> 00:59:16,000
no light to blinding white light which

1649
00:59:12,079 --> 00:59:18,559
is 255. And so all the numbers here, if

1650
00:59:16,000 --> 00:59:20,798
you take this five for example, you can

1651
00:59:18,559 --> 00:59:23,599
see a lot of no light like all the black

1652
00:59:20,798 --> 00:59:24,960
regions, those are all zeros. Okay? And

1653
00:59:23,599 --> 00:59:27,119
then wherever there is white light,

1654
00:59:24,960 --> 00:59:29,519
there's a number and more the amount of

1655
00:59:27,119 --> 00:59:30,720
light, the closer it gets to 255. Okay?

1656
00:59:29,519 --> 00:59:32,079
In fact, if you just step back and

1657
00:59:30,719 --> 00:59:33,679
squint at this, you can actually see the

1658
00:59:32,079 --> 00:59:35,680
five.

1659
00:59:33,679 --> 00:59:37,440
Okay? So that's it. That's how that's

1660
00:59:35,679 --> 00:59:42,239
how black and white image represented.

1661
00:59:37,440 --> 00:59:43,838
Very simple. Okay. Now, yeah.

1662
00:59:42,239 --> 00:59:45,838
microphone

1663
00:59:43,838 --> 00:59:47,679
>> just when you say amount of light what's

1664
00:59:45,838 --> 00:59:48,239
the unit that's being measured like what

1665
00:59:47,679 --> 00:59:51,039
do you mean

1666
00:59:48,239 --> 00:59:54,639
>> so here basically what we have is uh the

1667
00:59:51,039 --> 00:59:56,318
the computer takes whatever so when you

1668
00:59:54,639 --> 00:59:58,239
send an analog you take an analog

1669
00:59:56,318 --> 00:59:59,440
picture there is an there's a process by

1670
00:59:58,239 --> 01:00:02,000
which you take that analog picture and

1671
00:59:59,440 --> 01:00:04,559
read it in and it gets mapped to a scale

1672
01:00:02,000 --> 01:00:05,599
between 0 and 255 that's it that's all

1673
01:00:04,559 --> 01:00:07,119
so you can think of it as like a

1674
01:00:05,599 --> 01:00:10,559
relative scale a normalized scale

1675
01:00:07,119 --> 01:00:12,240
between 0 and 255 and so um it just

1676
01:00:10,559 --> 01:00:14,720
roughly maps to amount of light in that

1677
01:00:12,239 --> 01:00:16,318
location the exact like lumens to the

1678
01:00:14,719 --> 01:00:18,159
number mapping I don't know how they do

1679
01:00:16,318 --> 01:00:20,798
it my guess is there are a dis number of

1680
01:00:18,159 --> 01:00:22,318
variations on that but the for our

1681
01:00:20,798 --> 01:00:24,079
purposes just think of it as it's a

1682
01:00:22,318 --> 01:00:26,318
normalized scale which runs from 0 to

1683
01:00:24,079 --> 01:00:28,880
255

1684
01:00:26,318 --> 01:00:30,798
all right so uh if you look at u so

1685
01:00:28,880 --> 01:00:34,318
that's what's happening every is a

1686
01:00:30,798 --> 01:00:37,119
number between 0 to 55 boom boom okay so

1687
01:00:34,318 --> 01:00:38,880
if you have a color image each pixel of

1688
01:00:37,119 --> 01:00:42,400
a colored image is represented by three

1689
01:00:38,880 --> 01:00:44,480
numbers uh And these numbers measure the

1690
01:00:42,400 --> 01:00:46,480
intensity of red light, blue light and

1691
01:00:44,480 --> 01:00:47,599
green light because red, blue and green

1692
01:00:46,480 --> 01:00:50,480
if you mix them in the right proportion

1693
01:00:47,599 --> 01:00:52,559
you can get whatever you want. Okay. So

1694
01:00:50,480 --> 01:00:54,719
uh and so each light density is still a

1695
01:00:52,559 --> 01:00:56,480
number between 0 and 55 and that's what

1696
01:00:54,719 --> 01:00:58,078
you have. Which means that now you have

1697
01:00:56,480 --> 01:01:00,079
three tables of numbers instead of one

1698
01:00:58,079 --> 01:01:02,240
table of numbers. And by the way just

1699
01:01:00,079 --> 01:01:05,440
some lingo here uh in the deep learning

1700
01:01:02,239 --> 01:01:06,959
world these uh colors RGB, red, blue,

1701
01:01:05,440 --> 01:01:10,318
green are sometimes referred to as

1702
01:01:06,960 --> 01:01:11,358
channels. Okay. All right. So this is

1703
01:01:10,318 --> 01:01:13,599
what we have here. This is a picture of

1704
01:01:11,358 --> 01:01:16,159
Kian Cord U and then if you take that

1705
01:01:13,599 --> 01:01:18,960
little thing here red the red table the

1706
01:01:16,159 --> 01:01:21,039
green table and the blue table. So for

1707
01:01:18,960 --> 01:01:23,760
this picture these three tables is a

1708
01:01:21,039 --> 01:01:26,159
tensor of rank what?

1709
01:01:23,760 --> 01:01:30,520
Good.

1710
01:01:26,159 --> 01:01:30,519
All right. Any questions on this?

1711
01:01:33,920 --> 01:01:37,599
So the key task in computer vision

1712
01:01:35,838 --> 01:01:40,239
obviously the the important thing is

1713
01:01:37,599 --> 01:01:42,160
image classification right uh the most

1714
01:01:40,239 --> 01:01:43,679
basic task if you will uh when you're

1715
01:01:42,159 --> 01:01:45,358
working with images is you you have an

1716
01:01:43,679 --> 01:01:46,719
image and you want to take whatever you

1717
01:01:45,358 --> 01:01:48,078
take the image and figure out okay you

1718
01:01:46,719 --> 01:01:49,519
have a list of possible objects the

1719
01:01:48,079 --> 01:01:51,039
image could contain and you're figuring

1720
01:01:49,519 --> 01:01:53,280
out okay which of these possible objects

1721
01:01:51,039 --> 01:01:54,960
exists in that image right the doc cat

1722
01:01:53,280 --> 01:01:57,760
classification is like the the canonical

1723
01:01:54,960 --> 01:01:59,599
example right that we all know and love

1724
01:01:57,760 --> 01:02:01,280
uh and that's what we will solve uh

1725
01:01:59,599 --> 01:02:02,720
later today and on Wednesday but there

1726
01:02:01,280 --> 01:02:05,680
are many other tasks that you need to to

1727
01:02:02,719 --> 01:02:07,358
be aware of. So when you actually not

1728
01:02:05,679 --> 01:02:10,318
just classify an image, but you also

1729
01:02:07,358 --> 01:02:11,519
localize where in the image is it,

1730
01:02:10,318 --> 01:02:13,039
right? It's not just enough to say

1731
01:02:11,519 --> 01:02:14,639
sheep, you want to figure out where is

1732
01:02:13,039 --> 01:02:16,159
the sheep, right? And that's called

1733
01:02:14,639 --> 01:02:18,239
localization. And the way you do

1734
01:02:16,159 --> 01:02:21,118
localization is you put this little box

1735
01:02:18,239 --> 01:02:23,358
around it. And then you output not just

1736
01:02:21,119 --> 01:02:26,000
whether it's a, you know, sheep, yes or

1737
01:02:23,358 --> 01:02:28,159
no, but the coordinates of this box, the

1738
01:02:26,000 --> 01:02:29,760
top left, uh, and the bottom right, for

1739
01:02:28,159 --> 01:02:31,598
example, if you put the coordinates, you

1740
01:02:29,760 --> 01:02:33,599
can actually draw a box around it. So

1741
01:02:31,599 --> 01:02:36,079
you you output the numbers the

1742
01:02:33,599 --> 01:02:39,760
coordinates of where this box is in the

1743
01:02:36,079 --> 01:02:42,720
picture. Okay, this called localization.

1744
01:02:39,760 --> 01:02:45,040
Now this is object detection where you

1745
01:02:42,719 --> 01:02:47,039
may have lots of objects going on and

1746
01:02:45,039 --> 01:02:49,759
you want to pick up every one of them

1747
01:02:47,039 --> 01:02:51,679
and you want to localize it.

1748
01:02:49,760 --> 01:02:53,359
Okay, this is object detection. So here

1749
01:02:51,679 --> 01:02:55,679
we have gone in there and said okay

1750
01:02:53,358 --> 01:02:57,519
sheep one, sheep two, sheep three and

1751
01:02:55,679 --> 01:02:59,598
each of these sheep has a little box

1752
01:02:57,519 --> 01:03:01,440
around it. Okay.

1753
01:02:59,599 --> 01:03:04,000
>> By the way, u you know, self-driving

1754
01:03:01,440 --> 01:03:05,358
cars, the the camera vision system is

1755
01:03:04,000 --> 01:03:06,960
constantly scanning what's coming in

1756
01:03:05,358 --> 01:03:08,400
through the cameras and doing object

1757
01:03:06,960 --> 01:03:09,039
detection constantly, many times a

1758
01:03:08,400 --> 01:03:09,680
second,

1759
01:03:09,039 --> 01:03:11,599
>> right?

1760
01:03:09,679 --> 01:03:13,838
>> Pedestrian box, you know, zero crossing

1761
01:03:11,599 --> 01:03:16,240
box, doggy box, stroller box, and so on

1762
01:03:13,838 --> 01:03:17,358
and so forth.

1763
01:03:16,239 --> 01:03:20,479
And then we have this thing called

1764
01:03:17,358 --> 01:03:22,960
semantic segmentation where we take

1765
01:03:20,480 --> 01:03:24,880
every pixel in the picture and classify

1766
01:03:22,960 --> 01:03:26,159
every pixel. We are not classifying the

1767
01:03:24,880 --> 01:03:28,880
whole picture, we're classifying every

1768
01:03:26,159 --> 01:03:32,318
pixel. So we are saying okay all these

1769
01:03:28,880 --> 01:03:34,798
gray pixels road all these pixels are

1770
01:03:32,318 --> 01:03:37,838
sheep and all these pixels are grass

1771
01:03:34,798 --> 01:03:39,838
every pixel is being classified.

1772
01:03:37,838 --> 01:03:42,159
So we are taking a an image instead of

1773
01:03:39,838 --> 01:03:43,920
giving one classification for every

1774
01:03:42,159 --> 01:03:47,558
pixel we are solving a multiclass

1775
01:03:43,920 --> 01:03:47,559
classification problem.

1776
01:03:48,318 --> 01:03:51,199
Okay, every pixel is classified. And

1777
01:03:49,920 --> 01:03:53,280
just when you think it can't get more

1778
01:03:51,199 --> 01:03:54,480
complicated than this,

1779
01:03:53,280 --> 01:03:56,880
we have something called instance

1780
01:03:54,480 --> 01:03:58,559
segmentation where not only are we

1781
01:03:56,880 --> 01:03:59,838
classifying every pixel, we are

1782
01:03:58,559 --> 01:04:01,920
distinguishing between the different

1783
01:03:59,838 --> 01:04:04,318
sheep.

1784
01:04:01,920 --> 01:04:06,400
So every pixel is classified and

1785
01:04:04,318 --> 01:04:09,960
different instances of the same category

1786
01:04:06,400 --> 01:04:09,960
need to be identified.

1787
01:04:10,480 --> 01:04:14,880
Okay. So these are all some of the most

1788
01:04:12,318 --> 01:04:16,798
sort of uh I would say popular most

1789
01:04:14,880 --> 01:04:18,960
prevalently and useful most prevalent

1790
01:04:16,798 --> 01:04:20,880
and useful categories of image

1791
01:04:18,960 --> 01:04:23,920
processing problems that are aminable to

1792
01:04:20,880 --> 01:04:25,440
a deep learning system.

1793
01:04:23,920 --> 01:04:27,200
All right. So let's go to image

1794
01:04:25,440 --> 01:04:28,559
classification and we're going to work

1795
01:04:27,199 --> 01:04:32,598
with this application called fashion

1796
01:04:28,559 --> 01:04:32,599
emnest. Um

1797
01:04:33,039 --> 01:04:38,400
so the idea here is that you have 70

1798
01:04:35,358 --> 01:04:40,960
70,000 images of clothing items across

1799
01:04:38,400 --> 01:04:43,119
10 categories. you know like boots and

1800
01:04:40,960 --> 01:04:45,760
sweaters and t-shirts and you get the

1801
01:04:43,119 --> 01:04:48,559
idea 10 categories of clothing. Um we we

1802
01:04:45,760 --> 01:04:50,559
have 70,000 images like this u and then

1803
01:04:48,559 --> 01:04:52,559
we'll build a network from scratch to

1804
01:04:50,559 --> 01:04:54,559
classify all these things uh you know

1805
01:04:52,559 --> 01:04:55,920
with pretty high accuracy. So these

1806
01:04:54,559 --> 01:04:58,000
classes by the way you know this is a

1807
01:04:55,920 --> 01:04:59,838
very balanced data set. So 10% of the

1808
01:04:58,000 --> 01:05:01,920
data is you know sweaters 10% is boots

1809
01:04:59,838 --> 01:05:03,519
and so on and so forth. So a naive

1810
01:05:01,920 --> 01:05:06,519
baseline model would give you what

1811
01:05:03,519 --> 01:05:06,519
accuracy

1812
01:05:07,679 --> 01:05:12,078
10%. Exactly. So we need to build

1813
01:05:10,559 --> 01:05:13,440
something that's better than 10% and I'm

1814
01:05:12,079 --> 01:05:14,559
glad to report that a simple neural

1815
01:05:13,440 --> 01:05:17,559
network can actually get you close to

1816
01:05:14,559 --> 01:05:17,559
90%.

1817
01:05:18,559 --> 01:05:24,798
Right? So so this is the simple network

1818
01:05:21,838 --> 01:05:28,400
that we have. The input in this case is

1819
01:05:24,798 --> 01:05:33,358
a 28x 28 picture.

1820
01:05:28,400 --> 01:05:36,720
It's a 28x 28 picture. Uh and

1821
01:05:33,358 --> 01:05:38,318
so far we have been feeding vectors into

1822
01:05:36,719 --> 01:05:40,239
our neural network. Now we have a

1823
01:05:38,318 --> 01:05:43,759
picture which is 28 by 28. It's a tens

1824
01:05:40,239 --> 01:05:45,919
set of rank two, right? It's a table of

1825
01:05:43,760 --> 01:05:49,160
numbers. What do we do? How do we feed

1826
01:05:45,920 --> 01:05:49,159
that in?

1827
01:05:51,199 --> 01:05:54,960
It's a temp. No, each image is a table

1828
01:05:53,599 --> 01:05:57,519
of numbers. Let's just take a single

1829
01:05:54,960 --> 01:05:59,280
image.

1830
01:05:57,519 --> 01:06:01,679
Like what do we do? How do we what do we

1831
01:05:59,280 --> 01:06:04,079
do with this table?

1832
01:06:01,679 --> 01:06:06,399
Convert it into a vector. Exactly. And

1833
01:06:04,079 --> 01:06:08,079
that's called flattening. So we take

1834
01:06:06,400 --> 01:06:11,440
this table of numbers and we flatten it

1835
01:06:08,079 --> 01:06:13,599
into a vector. And so so what we do is

1836
01:06:11,440 --> 01:06:17,760
uh let me just

1837
01:06:13,599 --> 01:06:20,240
Okay. So we have um

1838
01:06:17,760 --> 01:06:22,400
28 by 28.

1839
01:06:20,239 --> 01:06:25,598
So what we can do is we can take each

1840
01:06:22,400 --> 01:06:27,838
row right take this row and then write

1841
01:06:25,599 --> 01:06:32,599
it like that.

1842
01:06:27,838 --> 01:06:32,599
We take the second row oops

1843
01:06:33,440 --> 01:06:36,639
write it like that.

1844
01:06:38,079 --> 01:06:43,599
third row is here

1845
01:06:41,440 --> 01:06:45,358
like that. You get the idea. So you take

1846
01:06:43,599 --> 01:06:47,039
each row just rotate it and stack it all

1847
01:06:45,358 --> 01:06:49,119
up, right? And string them up. It

1848
01:06:47,039 --> 01:06:51,760
becomes one long vector. So this called

1849
01:06:49,119 --> 01:06:52,960
flattening. Okay? So that's how you take

1850
01:06:51,760 --> 01:06:55,960
this thing and make it into one long

1851
01:06:52,960 --> 01:06:55,960
vector.

1852
01:06:56,400 --> 01:07:03,400
So when you do that 28 by 28 is what is

1853
01:07:00,159 --> 01:07:03,399
it? 7

1854
01:07:03,599 --> 01:07:09,440
784. So we get 7. So we get a vector.

1855
01:07:07,440 --> 01:07:11,119
This is the flattened input and you get

1856
01:07:09,440 --> 01:07:15,039
784.

1857
01:07:11,119 --> 01:07:17,358
Uh it's a vector that's 784 long.

1858
01:07:15,039 --> 01:07:18,799
Okay. After the flattening, we have not

1859
01:07:17,358 --> 01:07:19,920
done anything complicated yet. We have

1860
01:07:18,798 --> 01:07:21,679
literally taken the numbers and just

1861
01:07:19,920 --> 01:07:24,318
reorganized them in a different way.

1862
01:07:21,679 --> 01:07:26,000
Okay. And once we do that, now we are

1863
01:07:24,318 --> 01:07:27,759
back in our familiar neural network

1864
01:07:26,000 --> 01:07:29,760
territory, right? We know how to work

1865
01:07:27,760 --> 01:07:33,760
with vectors. So, we just need to pass

1866
01:07:29,760 --> 01:07:35,520
it through a hidden layer, right? And

1867
01:07:33,760 --> 01:07:37,599
this hidden layer, we're going to use re

1868
01:07:35,519 --> 01:07:39,119
neurons. And I tried a few different

1869
01:07:37,599 --> 01:07:41,680
values. And it turns out that 256

1870
01:07:39,119 --> 01:07:43,680
neurons does a really good job.

1871
01:07:41,679 --> 01:07:46,480
Okay? And so, I'm going to use 256

1872
01:07:43,679 --> 01:07:48,000
neurons here. And then we need to now

1873
01:07:46,480 --> 01:07:51,199
think about what the output layer should

1874
01:07:48,000 --> 01:07:54,159
be. Now, the now we run into a problem

1875
01:07:51,199 --> 01:07:55,759
because the output layer before we saw

1876
01:07:54,159 --> 01:07:58,239
for the heart disease example, it's just

1877
01:07:55,760 --> 01:08:01,039
zero or one. Right? Here there are 10

1878
01:07:58,239 --> 01:08:02,879
possible outputs. It could be a you know

1879
01:08:01,039 --> 01:08:04,799
boot, a sweater, a shirt and so on so

1880
01:08:02,880 --> 01:08:06,798
forth. 10 possible categories. So we

1881
01:08:04,798 --> 01:08:09,199
need some way to handle something with

1882
01:08:06,798 --> 01:08:12,960
many more than you know one binary

1883
01:08:09,199 --> 01:08:15,038
output many possible outputs. So the way

1884
01:08:12,960 --> 01:08:16,880
we do that

1885
01:08:15,039 --> 01:08:20,079
this is by the way pay attention to this

1886
01:08:16,880 --> 01:08:24,000
because this is actually how GPD4 works.

1887
01:08:20,079 --> 01:08:26,880
Okay. So what we do is here's what we

1888
01:08:24,000 --> 01:08:28,640
have. We know how to output 10 numbers,

1889
01:08:26,880 --> 01:08:30,000
right? If you want to output 10 numbers,

1890
01:08:28,640 --> 01:08:31,440
no problem. We just, you know, we have,

1891
01:08:30,000 --> 01:08:33,600
we can easily output 10 numbers by just

1892
01:08:31,439 --> 01:08:36,559
using a linear activation. We also know

1893
01:08:33,600 --> 01:08:37,838
how to output 10 probabilities,

1894
01:08:36,560 --> 01:08:40,560
right? Each one just needs to be a

1895
01:08:37,838 --> 01:08:44,079
sigmoid. But here we can't use 10

1896
01:08:40,560 --> 01:08:47,839
sigmoids as the output. Why is that?

1897
01:08:44,079 --> 01:08:50,000
Why can't we use 10 sigmoids?

1898
01:08:47,838 --> 01:08:52,798
>> Because the probability to one,

1899
01:08:50,000 --> 01:08:54,640
>> right? So here when the output comes we

1900
01:08:52,798 --> 01:08:56,238
need to figure out okay is it a boot, a

1901
01:08:54,640 --> 01:08:59,199
sweater, a shirt and so on and so forth.

1902
01:08:56,238 --> 01:09:00,479
There's only one right answer. Okay,

1903
01:08:59,198 --> 01:09:01,838
which means that we need to actually

1904
01:09:00,479 --> 01:09:03,519
figure out which of these 10 is the

1905
01:09:01,838 --> 01:09:05,439
right answer which means that we need to

1906
01:09:03,520 --> 01:09:07,520
produce probabilities but they have to

1907
01:09:05,439 --> 01:09:09,599
add up to one because only one of them

1908
01:09:07,520 --> 01:09:10,719
can be true.

1909
01:09:09,600 --> 01:09:12,159
So that's the key thing. They have to

1910
01:09:10,719 --> 01:09:13,279
add up to one. That's the wrinkle. If

1911
01:09:12,158 --> 01:09:16,000
not for that we can just use 10

1912
01:09:13,279 --> 01:09:17,600
sigmoids, right? And the way we do that

1913
01:09:16,000 --> 01:09:20,079
is something using something called the

1914
01:09:17,600 --> 01:09:22,319
softmax function or the softmax layer.

1915
01:09:20,079 --> 01:09:25,198
And the idea is actually very simple. We

1916
01:09:22,319 --> 01:09:27,759
have these 10 outputs in the very final

1917
01:09:25,198 --> 01:09:29,759
layer which is just linear activations.

1918
01:09:27,759 --> 01:09:32,719
And then we take each one of these

1919
01:09:29,759 --> 01:09:34,719
numbers and then run it through the

1920
01:09:32,719 --> 01:09:37,279
exponential function and then divide by

1921
01:09:34,719 --> 01:09:39,279
the total. So when you do that two

1922
01:09:37,279 --> 01:09:40,560
things happen. The first one is when you

1923
01:09:39,279 --> 01:09:43,359
take these numbers and run it through

1924
01:09:40,560 --> 01:09:45,920
say you take a1 and do e raised to a1

1925
01:09:43,359 --> 01:09:47,039
you now get a positive number

1926
01:09:45,920 --> 01:09:48,640
and now you have a positive number

1927
01:09:47,039 --> 01:09:50,319
divide by the sum of a bunch of positive

1928
01:09:48,640 --> 01:09:52,079
numbers and they're all you can see here

1929
01:09:50,319 --> 01:09:53,920
you can confirm visually that they will

1930
01:09:52,079 --> 01:09:55,198
add up to one because you're literally

1931
01:09:53,920 --> 01:09:56,719
divide taking each number dividing by

1932
01:09:55,198 --> 01:09:59,439
the total so they will add up to one

1933
01:09:56,719 --> 01:10:00,880
there's no other option right so this is

1934
01:09:59,439 --> 01:10:02,559
called the softmax function which means

1935
01:10:00,880 --> 01:10:04,000
that you can take any set of 10 numbers

1936
01:10:02,560 --> 01:10:05,199
that's coming out of the network and

1937
01:10:04,000 --> 01:10:07,198
convert them into probabilities that add

1938
01:10:05,198 --> 01:10:09,919
up to one

1939
01:10:07,198 --> 01:10:12,639
and So, by the way, the GPD4 reference

1940
01:10:09,920 --> 01:10:14,480
when you actually put a prompt in GPD4

1941
01:10:12,640 --> 01:10:17,760
and it starts giving you the output.

1942
01:10:14,479 --> 01:10:19,359
Every word it's emitting, right? It's

1943
01:10:17,760 --> 01:10:21,199
actually a token, but we'll get to that

1944
01:10:19,359 --> 01:10:23,599
later. You imagine it's a word. Every

1945
01:10:21,198 --> 01:10:27,599
word it's emitting u is actually it's

1946
01:10:23,600 --> 01:10:28,960
doing a 50 52,000 way softmax.

1947
01:10:27,600 --> 01:10:31,840
Think of it as every word in the

1948
01:10:28,960 --> 01:10:34,158
language is a possible output. So it's a

1949
01:10:31,840 --> 01:10:36,560
vector which is 52,000 long but it's

1950
01:10:34,158 --> 01:10:39,839
actually a softmax and it just picks the

1951
01:10:36,560 --> 01:10:41,440
most probable word and emits that. So

1952
01:10:39,840 --> 01:10:43,360
this notion of a softmax is actually

1953
01:10:41,439 --> 01:10:45,039
very powerful.

1954
01:10:43,359 --> 01:10:49,119
Okay but we'll come back to that uh

1955
01:10:45,039 --> 01:10:51,039
later. So, so to summarize, if you have

1956
01:10:49,119 --> 01:10:53,519
a single number, you can use a s simple

1957
01:10:51,039 --> 01:10:55,519
output layer, a single probability, a

1958
01:10:53,520 --> 01:10:57,440
sigmoid, you have lots of numbers, just

1959
01:10:55,520 --> 01:10:58,719
have a stack of these things. And when

1960
01:10:57,439 --> 01:10:59,839
you have a lot of numbers that have to

1961
01:10:58,719 --> 01:11:03,640
add up to one, that have to be

1962
01:10:59,840 --> 01:11:03,640
probabilities, use softmax,

1963
01:11:03,679 --> 01:11:08,399
>> right? So uh yeah

1964
01:11:06,640 --> 01:11:11,360
>> why do we choose probabilities instead

1965
01:11:08,399 --> 01:11:12,000
of just number

1966
01:11:11,359 --> 01:11:12,559
one

1967
01:11:12,000 --> 01:11:14,158
>> sorry

1968
01:11:12,560 --> 01:11:15,760
>> then we know it's only going to be one

1969
01:11:14,158 --> 01:11:19,399
>> because you can't force the network to

1970
01:11:15,760 --> 01:11:19,400
give you ones or zeros

1971
01:11:20,158 --> 01:11:22,639
it's going to produce what it's going to

1972
01:11:21,279 --> 01:11:24,399
produce

1973
01:11:22,640 --> 01:11:26,239
>> you can't force it to be exactly one or

1974
01:11:24,399 --> 01:11:28,479
zero

1975
01:11:26,238 --> 01:11:30,319
it'll give you some number you can do is

1976
01:11:28,479 --> 01:11:32,238
to tame that number so that it comes

1977
01:11:30,319 --> 01:11:34,639
into a range that you like like between

1978
01:11:32,238 --> 01:11:38,399
zero and

1979
01:11:34,640 --> 01:11:40,000
So here very quickly um we have a b when

1980
01:11:38,399 --> 01:11:41,759
we have a binary classification example

1981
01:11:40,000 --> 01:11:43,279
like yes or no this is the one hot

1982
01:11:41,760 --> 01:11:45,440
encoded version one or zero this is what

1983
01:11:43,279 --> 01:11:46,719
we saw in the heart disease example when

1984
01:11:45,439 --> 01:11:48,639
you have something like this example

1985
01:11:46,719 --> 01:11:51,039
fashion mn list where you have all these

1986
01:11:48,640 --> 01:11:52,560
different possibilities then you can

1987
01:11:51,039 --> 01:11:54,479
encode it in one of two ways you can

1988
01:11:52,560 --> 01:11:56,560
encode it just using integers like 0 to

1989
01:11:54,479 --> 01:11:59,519
9 right this is called the sparse

1990
01:11:56,560 --> 01:12:02,239
encoded version or you can do a one hot

1991
01:11:59,520 --> 01:12:03,760
encoded version of the output right you

1992
01:12:02,238 --> 01:12:06,879
can have a one hot encoded version of

1993
01:12:03,760 --> 01:12:08,960
the output and depending on how your

1994
01:12:06,880 --> 01:12:11,760
data comes in to you comes into your

1995
01:12:08,960 --> 01:12:13,840
collab right just pay attention to this

1996
01:12:11,760 --> 01:12:18,239
and depending on what it is you have to

1997
01:12:13,840 --> 01:12:20,159
pick the right keras loss function so

1998
01:12:18,238 --> 01:12:21,839
data comes like a one zero thing which

1999
01:12:20,158 --> 01:12:24,079
is exactly what we had in the how this

2000
01:12:21,840 --> 01:12:26,400
example we use binary cross entropy if

2001
01:12:24,079 --> 01:12:28,719
your data comes in this form where it's

2002
01:12:26,399 --> 01:12:31,279
sparse encoded you use sparse

2003
01:12:28,719 --> 01:12:32,640
categorical cross entropy and then if it

2004
01:12:31,279 --> 01:12:34,960
comes in this form form you use

2005
01:12:32,640 --> 01:12:36,640
categorical cross entropy, right? These

2006
01:12:34,960 --> 01:12:38,399
are all equalent things. It just depends

2007
01:12:36,640 --> 01:12:40,159
on the data that you get how it happens

2008
01:12:38,399 --> 01:12:42,559
to be encoded by the people who sent it

2009
01:12:40,158 --> 01:12:43,759
to you. If they send it this way, use

2010
01:12:42,560 --> 01:12:46,080
this loss function. If you send that

2011
01:12:43,760 --> 01:12:47,600
way, use that loss function.

2012
01:12:46,079 --> 01:12:49,359
Now, as it turns out in our example

2013
01:12:47,600 --> 01:12:50,800
here, the data is actually coming in in

2014
01:12:49,359 --> 01:12:52,158
this form. So, we'll use this thing

2015
01:12:50,800 --> 01:12:54,880
called the sparse categorical cross

2016
01:12:52,158 --> 01:12:56,399
entropy. And categorical cross entropy

2017
01:12:54,880 --> 01:12:58,159
is a generalization of binary cross

2018
01:12:56,399 --> 01:12:59,839
entropy which I'm not going to get into

2019
01:12:58,158 --> 01:13:01,359
the mathematical details but the in the

2020
01:12:59,840 --> 01:13:04,319
the intuition is basically roughly the

2021
01:13:01,359 --> 01:13:07,439
same.

2022
01:13:04,319 --> 01:13:09,198
Okay so this is what we have. Um if this

2023
01:13:07,439 --> 01:13:11,599
is your output layer use mean squared

2024
01:13:09,198 --> 01:13:14,079
error. If this is your output layer use

2025
01:13:11,600 --> 01:13:15,360
binary cross entropy and if you still

2026
01:13:14,079 --> 01:13:17,039
have a stack of these numbers you can

2027
01:13:15,359 --> 01:13:19,519
still use mean squed error. And if your

2028
01:13:17,039 --> 01:13:22,000
output is a soft max, use categorical

2029
01:13:19,520 --> 01:13:24,560
cross entropy or sparse categorical

2030
01:13:22,000 --> 01:13:26,479
cross entropy.

2031
01:13:24,560 --> 01:13:30,600
Okay. So let's actually run this in

2032
01:13:26,479 --> 01:13:30,599
collab. Um

2033
01:13:32,079 --> 01:13:37,198
right. So this is what we have. Can

2034
01:13:33,679 --> 01:13:40,800
folks see this? Okay. All right. So this

2035
01:13:37,198 --> 01:13:44,399
is the data set we saw earlier. Uh down

2036
01:13:40,800 --> 01:13:47,039
here as usual, right? We have we load

2037
01:13:44,399 --> 01:13:49,198
tensorflow and kas. We load our usual

2038
01:13:47,039 --> 01:13:51,119
three packages and then we set the

2039
01:13:49,198 --> 01:13:53,198
random seed for reproducibility. And it

2040
01:13:51,119 --> 01:13:54,719
turns out that the fashion mnest data is

2041
01:13:53,198 --> 01:13:56,000
actually available in keras. You don't

2042
01:13:54,719 --> 01:13:57,439
have to go find it somewhere and bring

2043
01:13:56,000 --> 01:13:59,279
it in. It's actually available in kas.

2044
01:13:57,439 --> 01:14:01,119
It's one of the standard data sets. We

2045
01:13:59,279 --> 01:14:04,079
luck out. So we just actually load the

2046
01:14:01,119 --> 01:14:05,920
data right using this load data command.

2047
01:14:04,079 --> 01:14:08,399
And then you do that and conveniently

2048
01:14:05,920 --> 01:14:10,399
for us keras has not only made the data

2049
01:14:08,399 --> 01:14:12,238
available it has already split it into a

2050
01:14:10,399 --> 01:14:13,920
training and test set. So we don't have

2051
01:14:12,238 --> 01:14:15,279
to do the splitting. Okay. And the

2052
01:14:13,920 --> 01:14:18,279
reason they do that, why would they do

2053
01:14:15,279 --> 01:14:18,279
that?

2054
01:14:18,640 --> 01:14:21,679
They do that so that different people

2055
01:14:20,238 --> 01:14:23,678
who are building algorithms for that

2056
01:14:21,679 --> 01:14:26,640
particular data set can all be evaluated

2057
01:14:23,679 --> 01:14:28,079
using the same test set.

2058
01:14:26,640 --> 01:14:29,600
Otherwise, if I split it one way and

2059
01:14:28,079 --> 01:14:31,439
say, "Hey, look how well I did that like

2060
01:14:29,600 --> 01:14:32,480
I don't know how did you split it."

2061
01:14:31,439 --> 01:14:36,000
>> That's the reason.

2062
01:14:32,479 --> 01:14:38,158
>> Okay. So here and you can see here that

2063
01:14:36,000 --> 01:14:43,760
uh we have

2064
01:14:38,158 --> 01:14:47,039
the input data is a tensor of rank

2065
01:14:43,760 --> 01:14:48,239
three. The first and basically another

2066
01:14:47,039 --> 01:14:50,158
way to think about a tensor of rank

2067
01:14:48,238 --> 01:14:52,879
three is just a list of rank two

2068
01:14:50,158 --> 01:14:57,279
tensors. Right? So here you have 60,000

2069
01:14:52,880 --> 01:15:02,079
images. 60,000 images and each image is

2070
01:14:57,279 --> 01:15:04,639
a 28x 28 square of numbers. Each image

2071
01:15:02,079 --> 01:15:07,279
is a 28 x 28 table. Uh and then of

2072
01:15:04,640 --> 01:15:09,920
course the output uh is just what

2073
01:15:07,279 --> 01:15:11,519
category it is a number between 0 and 9.

2074
01:15:09,920 --> 01:15:13,840
So you just have 60,000 numbers. It's

2075
01:15:11,520 --> 01:15:15,920
just a vector of 60,000 numbers. Okay.

2076
01:15:13,840 --> 01:15:19,039
Uh so there are 60,000 in the training

2077
01:15:15,920 --> 01:15:21,279
set. Oops. Uh and then there are 10,000

2078
01:15:19,039 --> 01:15:23,519
in the test set. Same structure 28 by

2079
01:15:21,279 --> 01:15:25,039
28. Uh that's what we have. So if you

2080
01:15:23,520 --> 01:15:27,040
look at the first 10 rows of the

2081
01:15:25,039 --> 01:15:29,039
dependent variable Y, you get these

2082
01:15:27,039 --> 01:15:31,439
numbers 9 0 33 like that. There are

2083
01:15:29,039 --> 01:15:33,359
numbers from 0 to 9. So if you look at

2084
01:15:31,439 --> 01:15:35,919
the fashion mnest GitHub site, this is

2085
01:15:33,359 --> 01:15:37,839
what it refers to. Zero is a t-shirt,

2086
01:15:35,920 --> 01:15:41,600
one is a trouser, and so on and so

2087
01:15:37,840 --> 01:15:43,760
forth. And nine is an ankle boot.

2088
01:15:41,600 --> 01:15:45,280
All right. So, uh, whenever I'm working

2089
01:15:43,760 --> 01:15:47,520
with multiclass lab classification

2090
01:15:45,279 --> 01:15:49,439
problems, I always, you know, do a

2091
01:15:47,520 --> 01:15:51,120
little thing here to help me figure out

2092
01:15:49,439 --> 01:15:52,319
that nine corresponds to an ankle boot

2093
01:15:51,119 --> 01:15:53,519
and so on and so forth. It just makes it

2094
01:15:52,319 --> 01:15:56,639
a little easier to to work with this

2095
01:15:53,520 --> 01:15:59,679
stuff. So, I create this little list. Um

2096
01:15:56,640 --> 01:16:01,119
and then uh turns out if you okay what

2097
01:15:59,679 --> 01:16:02,960
is the very first data point? What is

2098
01:16:01,119 --> 01:16:05,279
it? What is its y- value? Turns out to

2099
01:16:02,960 --> 01:16:07,679
be an ankle boot. Um so you can actually

2100
01:16:05,279 --> 01:16:10,238
look at the raw data for that image

2101
01:16:07,679 --> 01:16:13,119
which is just a 28x 28 thing and these

2102
01:16:10,238 --> 01:16:16,959
are the numbers you have.

2103
01:16:13,119 --> 01:16:19,198
See all these 250 233 lots of zeros and

2104
01:16:16,960 --> 01:16:20,960
so on and so forth. So you can actually

2105
01:16:19,198 --> 01:16:22,639
look at the first visualize the first 25

2106
01:16:20,960 --> 01:16:24,560
images. I have a little bit of code here

2107
01:16:22,640 --> 01:16:25,920
which visualizes that just matt plot lip

2108
01:16:24,560 --> 01:16:28,719
code and you can see these are all the

2109
01:16:25,920 --> 01:16:32,319
images they're kind of smallalish this

2110
01:16:28,719 --> 01:16:34,560
my friends is an ankle boot

2111
01:16:32,319 --> 01:16:35,759
right it's like okay can the network

2112
01:16:34,560 --> 01:16:37,360
really make any sense out of this thing

2113
01:16:35,760 --> 01:16:39,920
right it looks very blurry and I don't

2114
01:16:37,359 --> 01:16:42,158
know

2115
01:16:39,920 --> 01:16:43,679
this is uh

2116
01:16:42,158 --> 01:16:45,359
oh this is actually a better ankle boot

2117
01:16:43,679 --> 01:16:47,840
look at that okay sorry I'm getting

2118
01:16:45,359 --> 01:16:49,599
distracted so so this is what we have

2119
01:16:47,840 --> 01:16:51,520
here

2120
01:16:49,600 --> 01:16:53,360
uh okay we are at 955

2121
01:16:51,520 --> 01:16:54,880
I'm going to stop um so you folks are

2122
01:16:53,359 --> 01:16:56,399
not late for your next class. So we'll

2123
01:16:54,880 --> 01:16:58,079
continue this journey on Wednesday and

2124
01:16:56,399 --> 01:16:59,599
then we'll go on to color images the

2125
01:16:58,079 --> 01:17:03,000
next class as well. Thank you folks.

2126
01:16:59,600 --> 01:17:03,000
Have a good one.