1
00:00:21,519 --> 00:00:26,759
Okay. So, let's get going. Today we're

2
00:00:24,399 --> 00:00:28,879
going to talk about how do you actually

3
00:00:26,760 --> 00:00:30,320
train a neural network, right? Because

4
00:00:28,879 --> 00:00:33,439
that is sort of the heart of the game

5
00:00:30,320 --> 00:00:34,960
here. Um so, just to recap, we looked

6
00:00:33,439 --> 00:00:36,679
last class

7
00:00:34,960 --> 00:00:38,560
at what it takes to design a neural

8
00:00:36,679 --> 00:00:40,880
network, and we made this very important

9
00:00:38,560 --> 00:00:42,960
distinction between the things that you

10
00:00:40,880 --> 00:00:44,679
are handed by your problem and the

11
00:00:42,960 --> 00:00:46,759
things that you have agency over, that

12
00:00:44,679 --> 00:00:49,079
you have control over. And we noticed

13
00:00:46,759 --> 00:00:51,599
that, you know, the input layer for your

14
00:00:49,079 --> 00:00:53,079
problem, the input is the input. Uh the

15
00:00:51,600 --> 00:00:54,439
output is the output. You got to do

16
00:00:53,079 --> 00:00:56,519
something with the output, something

17
00:00:54,439 --> 00:00:58,119
that's expected. But everything that

18
00:00:56,520 --> 00:01:00,480
happens in the middle is actually in

19
00:00:58,119 --> 00:01:03,320
your hands. And in particular, we

20
00:01:00,479 --> 00:01:05,920
noticed that we have to decide how many

21
00:01:03,320 --> 00:01:08,599
hidden layers we want. We have to decide

22
00:01:05,920 --> 00:01:11,200
in each layer how many neurons to have.

23
00:01:08,599 --> 00:01:13,359
And then we had to decide what uh

24
00:01:11,200 --> 00:01:14,719
activation to use. Even though I'm kind

25
00:01:13,359 --> 00:01:17,159
of cheating when I say that because I

26
00:01:14,719 --> 00:01:18,719
told you very clearly on Monday that for

27
00:01:17,159 --> 00:01:20,679
the hidden layer activation, just go

28
00:01:18,719 --> 00:01:22,120
with the ReLU activation function. You

29
00:01:20,680 --> 00:01:23,520
don't have to think deep thoughts about

30
00:01:22,120 --> 00:01:24,920
this, okay?

31
00:01:23,519 --> 00:01:26,280
But the other things are all choices you

32
00:01:24,920 --> 00:01:28,320
have to make, and we will talk a bit

33
00:01:26,280 --> 00:01:29,920
later about how do you actually make

34
00:01:28,319 --> 00:01:32,519
those choices.

35
00:01:29,920 --> 00:01:34,519
Okay. Now, the rule of thumb,

36
00:01:32,519 --> 00:01:36,439
right? The rule of thumb always is to

37
00:01:34,519 --> 00:01:37,599
start with the simplest network you can

38
00:01:36,439 --> 00:01:39,439
think of.

39
00:01:37,599 --> 00:01:41,519
And if it's if it gets the job done,

40
00:01:39,439 --> 00:01:42,799
stop working on it.

41
00:01:41,519 --> 00:01:45,359
If it's not good enough, make it

42
00:01:42,799 --> 00:01:46,679
slightly more complicated. Okay? So,

43
00:01:45,359 --> 00:01:48,200
that's sort of the, you know, like the

44
00:01:46,680 --> 00:01:49,880
meta thing you have to remember always

45
00:01:48,200 --> 00:01:52,200
when you're designing these things.

46
00:01:49,879 --> 00:01:53,479
Okay. So, that's sort of, you know, what

47
00:01:52,200 --> 00:01:55,640
it takes to design a deep neural

48
00:01:53,480 --> 00:01:57,120
network. So, what we will do in this

49
00:01:55,640 --> 00:01:59,680
class is we'll actually take a real

50
00:01:57,120 --> 00:02:01,320
example with real data, and then we

51
00:01:59,680 --> 00:02:03,280
we'll think through how we would design

52
00:02:01,319 --> 00:02:05,439
a network to solve this problem.

53
00:02:03,280 --> 00:02:07,599
And while doing so, we will cover a

54
00:02:05,439 --> 00:02:09,758
whole bunch of conceptual foundations

55
00:02:07,599 --> 00:02:11,079
such as optimization, loss functions,

56
00:02:09,758 --> 00:02:12,039
gradient descent, and all that good

57
00:02:11,080 --> 00:02:12,960
stuff.

58
00:02:12,039 --> 00:02:16,199
Okay?

59
00:02:12,960 --> 00:02:18,760
All right. So, the the case study or the

60
00:02:16,199 --> 00:02:20,919
scenario here is we have a data set of

61
00:02:18,759 --> 00:02:23,719
patients uh made available by the

62
00:02:20,919 --> 00:02:25,599
Cleveland Clinic. And essentially, we

63
00:02:23,719 --> 00:02:27,359
have a bunch of patients, and for all

64
00:02:25,599 --> 00:02:29,799
these patients, the setting is that they

65
00:02:27,360 --> 00:02:31,600
have come into the Cleveland Clinic, and

66
00:02:29,800 --> 00:02:32,800
they have not come in with a heart

67
00:02:31,599 --> 00:02:33,879
problem. They have come in for something

68
00:02:32,800 --> 00:02:36,080
else. Maybe they just came in for a

69
00:02:33,879 --> 00:02:38,039
physical. And we measured a whole bunch

70
00:02:36,080 --> 00:02:40,160
of things about them, okay? And the

71
00:02:38,039 --> 00:02:41,719
kinds of things we measured are, you

72
00:02:40,159 --> 00:02:44,199
know, demographic information, like

73
00:02:41,719 --> 00:02:45,680
what's their age, uh gender, whether

74
00:02:44,199 --> 00:02:47,639
they have any chest pain at all when

75
00:02:45,680 --> 00:02:50,520
they came in, blood pressure,

76
00:02:47,639 --> 00:02:52,399
cholesterol, sugar, so on and so forth.

77
00:02:50,520 --> 00:02:53,920
Right? You get the idea? Demographic

78
00:02:52,400 --> 00:02:56,439
information and a bunch of biomarker

79
00:02:53,919 --> 00:02:59,039
information. And then,

80
00:02:56,439 --> 00:03:01,520
what the Cleveland Clinic uh did was

81
00:02:59,039 --> 00:03:04,079
they actually tracked these people

82
00:03:01,520 --> 00:03:05,560
and figured out in the next year,

83
00:03:04,080 --> 00:03:07,520
did they get diagnosed with heart

84
00:03:05,560 --> 00:03:09,000
disease or not?

85
00:03:07,520 --> 00:03:10,760
Okay, in the next year.

86
00:03:09,000 --> 00:03:12,879
Which means that maybe you can build a

87
00:03:10,759 --> 00:03:15,199
model when someone comes in, even though

88
00:03:12,879 --> 00:03:16,519
they didn't come in for a chest problem,

89
00:03:15,199 --> 00:03:17,439
maybe you can predict that something's

90
00:03:16,520 --> 00:03:20,120
going to happen to them in the next

91
00:03:17,439 --> 00:03:23,000
year, right? It's a nice sort of classic

92
00:03:20,120 --> 00:03:24,879
machine learning setup.

93
00:03:23,000 --> 00:03:26,439
All right. So, this is the thing. So,

94
00:03:24,879 --> 00:03:28,199
what we want to do is we can totally

95
00:03:26,439 --> 00:03:29,719
solve this problem using decision trees,

96
00:03:28,199 --> 00:03:31,439
neural network I mean, sorry, random

97
00:03:29,719 --> 00:03:33,240
forests and gradient boosting and all

98
00:03:31,439 --> 00:03:35,039
that good stuff you folks have already

99
00:03:33,240 --> 00:03:36,360
learned from machine learning.

100
00:03:35,039 --> 00:03:38,959
But we will try to solve it using neural

101
00:03:36,360 --> 00:03:40,280
networks, okay? Um this is an example,

102
00:03:38,960 --> 00:03:41,680
of course, of what's called structured

103
00:03:40,280 --> 00:03:43,879
data because this is all data sitting in

104
00:03:41,680 --> 00:03:46,159
the columns of a spreadsheet, right? Uh

105
00:03:43,879 --> 00:03:48,199
so, working with structured data is the

106
00:03:46,159 --> 00:03:50,199
way we warm up our knowledge of neural

107
00:03:48,199 --> 00:03:51,879
networks. And then we will do things

108
00:03:50,199 --> 00:03:53,599
like working with unstructured data

109
00:03:51,879 --> 00:03:55,079
starting next week with images and then

110
00:03:53,599 --> 00:03:58,960
later on with text and so on and so

111
00:03:55,080 --> 00:03:58,960
forth. Okay, any questions on this?

112
00:04:00,439 --> 00:04:05,560
Okay. Uh yes. Uh just connected even to

113
00:04:03,319 --> 00:04:07,840
last time's class where we took uh the

114
00:04:05,560 --> 00:04:10,000
same example and first it was a logistic

115
00:04:07,840 --> 00:04:12,120
and then we did a neural network. So,

116
00:04:10,000 --> 00:04:14,759
the probability in case of one was 0.85,

117
00:04:12,120 --> 00:04:16,840
then was 0.22, and here as well, how do

118
00:04:14,759 --> 00:04:19,199
you know when to uh

119
00:04:16,839 --> 00:04:21,919
use what? Usually in textbooks, you know

120
00:04:19,199 --> 00:04:24,399
when to use logistic or when to use uh

121
00:04:21,920 --> 00:04:25,560
something else, but in this case,

122
00:04:24,399 --> 00:04:27,439
uh

123
00:04:25,560 --> 00:04:29,079
when do I complicate it to neural

124
00:04:27,439 --> 00:04:30,439
networks visa-vis in this case maybe

125
00:04:29,079 --> 00:04:33,039
just doing a random It's a great

126
00:04:30,439 --> 00:04:34,480
question. Uh when do you use what? So, I

127
00:04:33,040 --> 00:04:35,800
think there are two broad dimensions

128
00:04:34,480 --> 00:04:37,160
that you have to think about. One broad

129
00:04:35,800 --> 00:04:39,439
dimension is

130
00:04:37,160 --> 00:04:41,720
uh how important is it that you need to

131
00:04:39,439 --> 00:04:43,680
explain or interpret what's going on

132
00:04:41,720 --> 00:04:46,240
inside the model to perhaps a

133
00:04:43,680 --> 00:04:48,280
non-technical consumer.

134
00:04:46,240 --> 00:04:50,759
The other dimension is how important is

135
00:04:48,279 --> 00:04:52,559
sheer predictive accuracy.

136
00:04:50,759 --> 00:04:54,319
In some situations, predictive accuracy

137
00:04:52,560 --> 00:04:56,160
trumps everything else. In which case,

138
00:04:54,319 --> 00:04:57,399
just go with it. In other cases,

139
00:04:56,160 --> 00:04:59,000
explainability becomes a big deal

140
00:04:57,399 --> 00:05:00,560
because if they can't understand, they

141
00:04:59,000 --> 00:05:02,000
won't use it.

142
00:05:00,560 --> 00:05:04,319
And those cases, it's probably better to

143
00:05:02,000 --> 00:05:05,800
go with simpler models such as decision

144
00:05:04,319 --> 00:05:07,800
trees and neural I mean, not neural

145
00:05:05,800 --> 00:05:09,280
network decision trees, maybe even

146
00:05:07,800 --> 00:05:10,920
random forests, certainly logistic

147
00:05:09,279 --> 00:05:12,319
regression. Those are all a little more

148
00:05:10,920 --> 00:05:15,480
amenable.

149
00:05:12,319 --> 00:05:17,439
But that said, uh even complex black box

150
00:05:15,480 --> 00:05:19,280
methods like neural networks, there is a

151
00:05:17,439 --> 00:05:20,800
whole field called mechanistic

152
00:05:19,279 --> 00:05:23,199
interpretability,

153
00:05:20,800 --> 00:05:24,720
which seeks to try to get insight into

154
00:05:23,199 --> 00:05:28,360
what's going on inside these big black

155
00:05:24,720 --> 00:05:30,560
boxes. So, the story isn't over, right?

156
00:05:28,360 --> 00:05:33,360
But that's just the first cut you sort

157
00:05:30,560 --> 00:05:35,199
of analyze the problem.

158
00:05:33,360 --> 00:05:37,600
Okay. So,

159
00:05:35,199 --> 00:05:39,719
um let's get going. So, if you want to

160
00:05:37,600 --> 00:05:42,080
design a network,

161
00:05:39,720 --> 00:05:43,160
All right. So, we design the network. Uh

162
00:05:42,079 --> 00:05:45,039
so, we have to choose the number of

163
00:05:43,160 --> 00:05:46,439
hidden layers and the number of neurons

164
00:05:45,040 --> 00:05:49,160
in each layer. Then we have to pick the

165
00:05:46,439 --> 00:05:51,199
right output layer. So, here,

166
00:05:49,160 --> 00:05:52,720
what I did is the simplest thing you can

167
00:05:51,199 --> 00:05:53,719
do is, of course, is to have no hidden

168
00:05:52,720 --> 00:05:55,360
layer.

169
00:05:53,720 --> 00:05:58,120
So, if you have no hidden layers, what

170
00:05:55,360 --> 00:05:58,120
is that model called?

171
00:05:58,439 --> 00:06:02,079
Yes, logistic regression.

172
00:06:00,240 --> 00:06:03,319
Okay? So, of course, we want to do a

173
00:06:02,079 --> 00:06:05,079
neural network, so I'm going to have one

174
00:06:03,319 --> 00:06:08,199
hidden layer because that's the simplest

175
00:06:05,079 --> 00:06:09,879
thing I can do. And then, I'll confess,

176
00:06:08,199 --> 00:06:12,000
I tried a few different numbers of

177
00:06:09,879 --> 00:06:14,079
neurons in this thing, and when I had 16

178
00:06:12,000 --> 00:06:15,480
neurons, it actually did pretty well.

179
00:06:14,079 --> 00:06:16,959
Okay? So, there was some trial and error

180
00:06:15,480 --> 00:06:19,280
that went on before I landed on the

181
00:06:16,959 --> 00:06:20,839
number 16. Right? And for some reason,

182
00:06:19,279 --> 00:06:22,599
people always use powers of two, so may

183
00:06:20,839 --> 00:06:24,239
as well do that.

184
00:06:22,600 --> 00:06:25,439
So, I tried like 4, 8, 16, and 16 was

185
00:06:24,240 --> 00:06:27,319
really good.

186
00:06:25,439 --> 00:06:30,800
And as it turns out, when I went above

187
00:06:27,319 --> 00:06:31,959
16, uh it sort of started to do badly.

188
00:06:30,800 --> 00:06:33,560
And it started to do badly because

189
00:06:31,959 --> 00:06:35,039
something called overfitting,

190
00:06:33,560 --> 00:06:37,439
which we're going to talk about later,

191
00:06:35,040 --> 00:06:39,960
okay? So, yeah, 16.

192
00:06:37,439 --> 00:06:42,040
Um and then by default, I use ReLUs,

193
00:06:39,959 --> 00:06:44,959
okay? So, 16 ReLU neurons. And then

194
00:06:42,040 --> 00:06:47,160
here, the output is a categorical

195
00:06:44,959 --> 00:06:49,719
output, right? Heart disease, yes or no,

196
00:06:47,160 --> 00:06:51,120
one or zero, classification problem,

197
00:06:49,720 --> 00:06:53,040
which means that we want to emit a

198
00:06:51,120 --> 00:06:54,840
probability at the very end. Therefore,

199
00:06:53,040 --> 00:06:57,240
we'll use a sigmoid.

200
00:06:54,839 --> 00:06:59,159
Okay? So, so far, so good, right? Any

201
00:06:57,240 --> 00:07:00,360
questions?

202
00:06:59,160 --> 00:07:02,520
All right.

203
00:07:00,360 --> 00:07:03,720
So, we're going to lay out this network

204
00:07:02,519 --> 00:07:06,560
visually.

205
00:07:03,720 --> 00:07:09,160
Okay? So, we have an input, and so I

206
00:07:06,560 --> 00:07:10,480
just have have an input. And as you will

207
00:07:09,160 --> 00:07:13,120
see here,

208
00:07:10,480 --> 00:07:15,240
X1 through X29, that's our input layer.

209
00:07:13,120 --> 00:07:17,360
And you may be wondering, 29, where did

210
00:07:15,240 --> 00:07:19,519
he get that from?

211
00:07:17,360 --> 00:07:22,040
Because there doesn't seem to be like 29

212
00:07:19,519 --> 00:07:24,759
rows here of independent variables. So,

213
00:07:22,040 --> 00:07:26,439
it turns out there are only 13 input

214
00:07:24,759 --> 00:07:29,159
variables here,

215
00:07:26,439 --> 00:07:31,279
but some of them are categorical.

216
00:07:29,160 --> 00:07:32,920
So, what I ended up doing is to take

217
00:07:31,279 --> 00:07:34,039
each categorical variable and one-hot

218
00:07:32,920 --> 00:07:35,560
encode it.

219
00:07:34,040 --> 00:07:37,360
Okay?

220
00:07:35,560 --> 00:07:39,240
And when you do that, you get to 39.

221
00:07:37,360 --> 00:07:40,800
Sorry, 29.

222
00:07:39,240 --> 00:07:43,240
All right? And when we actually do the

223
00:07:40,800 --> 00:07:45,400
Colab later on, I'll show you exactly

224
00:07:43,240 --> 00:07:46,879
how I one-hot encode encoded it, but

225
00:07:45,399 --> 00:07:49,239
that's what I'm doing here.

226
00:07:46,879 --> 00:07:51,920
That's why you have 29, not 13.

227
00:07:49,240 --> 00:07:54,079
Okay? Now, obviously, we have decided on

228
00:07:51,920 --> 00:07:56,199
these hidden units, 16 units,

229
00:07:54,079 --> 00:07:57,680
with nice ReLUs here.

230
00:07:56,199 --> 00:07:59,479
Okay? And then we have an output layer

231
00:07:57,680 --> 00:08:01,319
with a little sigmoid.

232
00:07:59,480 --> 00:08:02,560
And I got bored of trying to draw all

233
00:08:01,319 --> 00:08:05,199
these arrows, so I just gave up and

234
00:08:02,560 --> 00:08:07,839
said, "Assume there are arrows."

235
00:08:05,199 --> 00:08:09,800
Okay, between all these things.

236
00:08:07,839 --> 00:08:11,119
Good?

237
00:08:09,800 --> 00:08:12,439
Yeah.

238
00:08:11,120 --> 00:08:15,319
Yeah, I'm sorry. I think you already

239
00:08:12,439 --> 00:08:16,600
mentioned this, but why 16 units? Why

240
00:08:15,319 --> 00:08:18,159
16? Uh

241
00:08:16,600 --> 00:08:21,400
I tried a bunch of different numbers of

242
00:08:18,160 --> 00:08:23,480
units. Uh and at 16, the resulting model

243
00:08:21,399 --> 00:08:25,879
did well, so I just went with that. And

244
00:08:23,480 --> 00:08:28,040
the logic of why is a ReLU?

245
00:08:25,879 --> 00:08:29,519
Oh, why a ReLU? Yeah, so there's a

246
00:08:28,040 --> 00:08:31,960
there's just a mountain of empirical

247
00:08:29,519 --> 00:08:35,158
evidence that suggests that uh ReLU is a

248
00:08:31,959 --> 00:08:37,038
really good default option for using as

249
00:08:35,158 --> 00:08:39,279
activations in hidden layers. There is

250
00:08:37,038 --> 00:08:41,639
also a really great set of theoretical

251
00:08:39,279 --> 00:08:42,918
results, and I'll allude to some of them

252
00:08:41,639 --> 00:08:45,199
when we actually talk about gradient

253
00:08:42,918 --> 00:08:47,519
descent.

254
00:08:45,200 --> 00:08:47,520
Yeah.

255
00:08:47,879 --> 00:08:51,840
Sorry, quick question. You mentioned um

256
00:08:50,120 --> 00:08:53,919
in the input layer, how how did you get

257
00:08:51,840 --> 00:08:55,720
to 29 again when you had like 13

258
00:08:53,919 --> 00:08:58,399
variables? So, some of those 13

259
00:08:55,720 --> 00:09:00,560
variables are categorical variables like

260
00:08:58,399 --> 00:09:02,159
uh cholesterol low, medium, high. Right?

261
00:09:00,559 --> 00:09:04,639
And so, I took them and one-hot encoded

262
00:09:02,159 --> 00:09:08,079
them. So, if it had like five levels, I

263
00:09:04,639 --> 00:09:08,080
would get five columns now.

264
00:09:08,440 --> 00:09:12,720
Uh yeah.

265
00:09:09,799 --> 00:09:15,359
And by the way, folks, um just like uh

266
00:09:12,720 --> 00:09:17,080
is it can Yeah, just like did, please

267
00:09:15,360 --> 00:09:18,440
use a microphone so that people on the

268
00:09:17,080 --> 00:09:20,440
live stream can hear your question.

269
00:09:18,440 --> 00:09:22,080
Yeah, go ahead. Uh sorry, just one

270
00:09:20,440 --> 00:09:23,800
question. So, the vectors, since you

271
00:09:22,080 --> 00:09:26,000
didn't represent them, are we assuming

272
00:09:23,799 --> 00:09:26,599
like every X is connected to all the

273
00:09:26,000 --> 00:09:28,480
units?

274
00:09:26,600 --> 00:09:31,000
>> Correct. And this is also a parameter

275
00:09:28,480 --> 00:09:32,279
that we have to decide or That ends up

276
00:09:31,000 --> 00:09:33,720
being the default.

277
00:09:32,279 --> 00:09:36,120
And we will see

278
00:09:33,720 --> 00:09:37,840
deviations from that assumption when we

279
00:09:36,120 --> 00:09:39,440
go to image processing and language

280
00:09:37,840 --> 00:09:40,879
processing and so on. But when you're

281
00:09:39,440 --> 00:09:43,800
working with structured data like we're

282
00:09:40,879 --> 00:09:46,039
doing now, that's the default.

283
00:09:43,799 --> 00:09:47,759
Okay. So, let's keep going.

284
00:09:46,039 --> 00:09:49,399
So, this is what we have.

285
00:09:47,759 --> 00:09:50,679
So, what Remember what I told you in the

286
00:09:49,399 --> 00:09:52,360
last class? Whenever you're working with

287
00:09:50,679 --> 00:09:54,239
these networks, right? Get into the

288
00:09:52,360 --> 00:09:55,919
habit of very quickly calculating the

289
00:09:54,240 --> 00:09:57,360
number of parameters.

290
00:09:55,919 --> 00:09:59,839
Right? Just do it a few times, the first

291
00:09:57,360 --> 00:10:02,279
few times, so that you really know cold

292
00:09:59,840 --> 00:10:04,600
exactly what's going on. Okay? So, yeah,

293
00:10:02,279 --> 00:10:06,159
how many parameters do we have here?

294
00:10:04,600 --> 00:10:08,120
How many weights and biases? You can

295
00:10:06,159 --> 00:10:09,120
work through it, okay? You can You don't

296
00:10:08,120 --> 00:10:13,840
have to tell me the final number. You

297
00:10:09,120 --> 00:10:13,840
can say x * y + z, stuff like that.

298
00:10:14,399 --> 00:10:20,199
Yeah.

299
00:10:15,759 --> 00:10:21,759
65. You have 48 weights and 17 biases.

300
00:10:20,200 --> 00:10:23,680
Okay, and how did he come up with that?

301
00:10:21,759 --> 00:10:26,000
So, for the weights, you have like for

302
00:10:23,679 --> 00:10:28,319
the first layer it's 2 * 16 and for the

303
00:10:26,000 --> 00:10:30,399
the second connection it's 1 * 16 and

304
00:10:28,320 --> 00:10:32,200
then the biases are the 16 hidden plus

305
00:10:30,399 --> 00:10:33,439
the outputs.

306
00:10:32,200 --> 00:10:36,280
Okay.

307
00:10:33,440 --> 00:10:40,280
Um any other views on this?

308
00:10:36,279 --> 00:10:43,559
I think it's 29 into 16. 29, okay, 29

309
00:10:40,279 --> 00:10:46,600
into 16. And then 16 into

310
00:10:43,559 --> 00:10:49,839
uh plus I mean 16 there. Yeah. And then

311
00:10:46,600 --> 00:10:52,320
biases 16 biases and one bias. Right.

312
00:10:49,840 --> 00:10:55,240
So, the way it's going to work is we

313
00:10:52,320 --> 00:10:58,440
have 29 things here, 16 in the middle,

314
00:10:55,240 --> 00:11:00,279
so 29 into 16 arrows.

315
00:10:58,440 --> 00:11:02,640
And then for each of these fellows,

316
00:11:00,279 --> 00:11:05,000
there's a bias coming in.

317
00:11:02,639 --> 00:11:08,399
So, that's another 16.

318
00:11:05,000 --> 00:11:10,759
Plus, you have 16 * 1.

319
00:11:08,399 --> 00:11:12,079
Which is here, plus there is one bias

320
00:11:10,759 --> 00:11:15,519
for this one.

321
00:11:12,080 --> 00:11:15,520
So, the total is 497.

322
00:11:16,720 --> 00:11:21,040
So, you can see here there's something

323
00:11:19,279 --> 00:11:22,838
very interesting going on, which is that

324
00:11:21,039 --> 00:11:24,000
when you go from one layer to another

325
00:11:22,839 --> 00:11:26,280
layer,

326
00:11:24,000 --> 00:11:28,360
the number of weights is roughly on the

327
00:11:26,279 --> 00:11:30,199
order of a * b.

328
00:11:28,360 --> 00:11:31,639
The number of units and so that's a

329
00:11:30,200 --> 00:11:33,400
dramatic explosion in the number of

330
00:11:31,639 --> 00:11:34,559
parameters.

331
00:11:33,399 --> 00:11:36,199
Right? And that's something we have to

332
00:11:34,559 --> 00:11:38,039
watch for later on to prevent

333
00:11:36,200 --> 00:11:39,720
overfitting.

334
00:11:38,039 --> 00:11:41,480
Okay, that's where the explosion of

335
00:11:39,720 --> 00:11:43,080
parameters comes from the fact that each

336
00:11:41,480 --> 00:11:44,000
layer is fully connected to the next

337
00:11:43,080 --> 00:11:46,160
layer.

338
00:11:44,000 --> 00:11:47,200
Okay? But we'll revisit this later on.

339
00:11:46,159 --> 00:11:48,279
Okay.

340
00:11:47,200 --> 00:11:50,120
So,

341
00:11:48,279 --> 00:11:52,240
what I'm going to do now is I'm going to

342
00:11:50,120 --> 00:11:53,200
actually translate this network, right?

343
00:11:52,240 --> 00:11:56,039
The one that we have laid out

344
00:11:53,200 --> 00:11:58,759
graphically, into Keras code

345
00:11:56,039 --> 00:12:01,159
to demonstrate how easy it is.

346
00:11:58,759 --> 00:12:03,159
Okay? So, I will give a fuller intro to

347
00:12:01,159 --> 00:12:06,240
Keras in TensorFlow later on, but for

348
00:12:03,159 --> 00:12:08,159
now, just suspend your disbelief.

349
00:12:06,240 --> 00:12:10,560
We'll just try to do it in Keras as if

350
00:12:08,159 --> 00:12:12,039
we know Keras. Okay? So, let's try that.

351
00:12:10,559 --> 00:12:14,119
Later on we'll get into all the gory

352
00:12:12,039 --> 00:12:17,519
details and train it in Colab and so on

353
00:12:14,120 --> 00:12:19,399
and so forth. Okay. All right. So,

354
00:12:17,519 --> 00:12:21,319
So, the So, the way we typically do it

355
00:12:19,399 --> 00:12:23,759
is that once we have a network like

356
00:12:21,320 --> 00:12:25,800
this, we typically start from the left

357
00:12:23,759 --> 00:12:27,519
and start defining each layer in Keras

358
00:12:25,799 --> 00:12:30,120
one after the other. So, we flow left to

359
00:12:27,519 --> 00:12:32,000
right. Okay? So, let's take the input

360
00:12:30,120 --> 00:12:34,720
layer. The way you define an input layer

361
00:12:32,000 --> 00:12:38,360
in Keras is really easy.

362
00:12:34,720 --> 00:12:41,200
You literally say Keras.input.

363
00:12:38,360 --> 00:12:43,360
Okay? And then you tell Keras how many

364
00:12:41,200 --> 00:12:45,120
nodes you have in the input coming in.

365
00:12:43,360 --> 00:12:47,240
In this case it happens to be 29, so you

366
00:12:45,120 --> 00:12:49,039
tell it the shape. Shape equals 29. And

367
00:12:47,240 --> 00:12:51,120
the reason why we say shape as opposed

368
00:12:49,039 --> 00:12:53,159
to length is because, as you will see

369
00:12:51,120 --> 00:12:55,519
later on, we don't have to just send

370
00:12:53,159 --> 00:12:57,279
vectors in, we can send complicated

371
00:12:55,519 --> 00:12:59,319
things in to Keras.

372
00:12:57,279 --> 00:13:01,519
And those complicated objects could be

373
00:12:59,320 --> 00:13:03,600
matrices, it could be 3D cubes, it could

374
00:13:01,519 --> 00:13:06,199
be 4D tensors and so on and so forth.

375
00:13:03,600 --> 00:13:07,720
So, it's expecting a shape.

376
00:13:06,200 --> 00:13:09,040
Right? What is the shape shape of this

377
00:13:07,720 --> 00:13:10,800
thing you're going to send me? In this

378
00:13:09,039 --> 00:13:12,679
particular case it happens to be a nice

379
00:13:10,799 --> 00:13:15,519
list or a vector, so it's 29. Okay,

380
00:13:12,679 --> 00:13:17,719
that's it. So, we we write this down.

381
00:13:15,519 --> 00:13:19,720
This creates the input layer.

382
00:13:17,720 --> 00:13:21,440
Right? And we give it a name. Right? And

383
00:13:19,720 --> 00:13:23,160
the name here means

384
00:13:21,440 --> 00:13:26,400
this layer, whatever comes out of this

385
00:13:23,159 --> 00:13:27,799
layer has a name input.

386
00:13:26,399 --> 00:13:30,319
Okay?

387
00:13:27,799 --> 00:13:31,399
Good. Next.

388
00:13:30,320 --> 00:13:32,920
Let's make sure the shape of the input

389
00:13:31,399 --> 00:13:34,360
as I mentioned.

390
00:13:32,919 --> 00:13:36,719
Right there.

391
00:13:34,360 --> 00:13:39,560
Then we go to the next one. And here and

392
00:13:36,720 --> 00:13:41,920
we will unpack this. The way you define

393
00:13:39,559 --> 00:13:43,439
a layer is typically a hidden layer

394
00:13:41,919 --> 00:13:46,000
Keras.layers.dense

395
00:13:43,440 --> 00:13:48,760
and all this stuff. Okay? So, what this

396
00:13:46,000 --> 00:13:50,720
is is it first of all it says

397
00:13:48,759 --> 00:13:52,480
I want a dense layer. By dense layer I

398
00:13:50,720 --> 00:13:53,960
mean a layer that's going to fully

399
00:13:52,480 --> 00:13:55,120
connect to the prior and the later

400
00:13:53,960 --> 00:13:56,240
layers.

401
00:13:55,120 --> 00:13:58,120
Fully connect, that's what the word

402
00:13:56,240 --> 00:13:59,159
dense means. Okay?

403
00:13:58,120 --> 00:14:02,799
Number two,

404
00:13:59,159 --> 00:14:06,799
I want 16 nodes here in this layer.

405
00:14:02,799 --> 00:14:09,559
Okay? Finally, I want to use a ReLU.

406
00:14:06,799 --> 00:14:11,120
See how compact and parsimonious it is?

407
00:14:09,559 --> 00:14:13,679
Right? And that is the appeal of Keras.

408
00:14:11,120 --> 00:14:15,039
It's very easy to get going.

409
00:14:13,679 --> 00:14:18,239
So, the moment you do that, you've

410
00:14:15,039 --> 00:14:18,240
actually defined this layer.

411
00:14:18,600 --> 00:14:23,519
But what you have not done

412
00:14:20,600 --> 00:14:25,440
is you have not told this layer what

413
00:14:23,519 --> 00:14:26,439
input is going to get.

414
00:14:25,440 --> 00:14:28,440
Because as far as this layer is

415
00:14:26,440 --> 00:14:30,320
concerned, it doesn't know that this

416
00:14:28,440 --> 00:14:33,320
other layer exists.

417
00:14:30,320 --> 00:14:35,800
So, you need to connect them. Yes.

418
00:14:33,320 --> 00:14:38,079
Um do we need to define for the ReLU

419
00:14:35,799 --> 00:14:39,039
where the the bends are? Like where you

420
00:14:38,078 --> 00:14:41,319
take the max?

421
00:14:39,039 --> 00:14:44,159
>> No, the ReLU the bend is always at zero.

422
00:14:41,320 --> 00:14:44,160
Okay. Thank you.

423
00:14:45,559 --> 00:14:48,799
Okay?

424
00:14:47,320 --> 00:14:51,240
All right.

425
00:14:48,799 --> 00:14:53,399
So, that's what we have here.

426
00:14:51,240 --> 00:14:55,959
And then, what we do is we have to tell

427
00:14:53,399 --> 00:14:57,958
it I you want to feed this layer the

428
00:14:55,958 --> 00:15:00,239
output of the previous layer, so you

429
00:14:57,958 --> 00:15:02,000
feed it by taking whatever is coming out

430
00:15:00,240 --> 00:15:03,120
of this thing, which is called input,

431
00:15:02,000 --> 00:15:05,480
and you basically

432
00:15:03,120 --> 00:15:07,759
stick it in here.

433
00:15:05,480 --> 00:15:09,039
So, the moment you do that, boom, it's

434
00:15:07,759 --> 00:15:10,519
going to receive the input from the

435
00:15:09,039 --> 00:15:12,879
previous layer.

436
00:15:10,519 --> 00:15:15,000
And because this one's output needs to

437
00:15:12,879 --> 00:15:16,519
go to the final layer, you need to give

438
00:15:15,000 --> 00:15:17,919
a name to that output.

439
00:15:16,519 --> 00:15:19,360
So, you give it a name. I'm just calling

440
00:15:17,919 --> 00:15:20,559
it h for because it's coming out of the

441
00:15:19,360 --> 00:15:21,600
hidden layer.

442
00:15:20,559 --> 00:15:24,119
It's just a variable. You can call it

443
00:15:21,600 --> 00:15:24,120
anything you want.

444
00:15:25,000 --> 00:15:28,958
Now, what we do, we go to the final

445
00:15:26,360 --> 00:15:30,360
output layer.

446
00:15:28,958 --> 00:15:32,799
And this is what we use. The output

447
00:15:30,360 --> 00:15:34,720
layer is just another dense layer.

448
00:15:32,799 --> 00:15:36,279
That's why I use the word dense. But we

449
00:15:34,720 --> 00:15:37,800
say, "Hey, give me just one thing

450
00:15:36,279 --> 00:15:40,159
because I just literally just need one

451
00:15:37,799 --> 00:15:41,919
unit here because I need to emit just

452
00:15:40,159 --> 00:15:44,120
one probability.

453
00:15:41,919 --> 00:15:46,639
And the activation I want to use is a

454
00:15:44,120 --> 00:15:46,639
sigmoid."

455
00:15:46,958 --> 00:15:50,399
Done.

456
00:15:48,720 --> 00:15:52,759
Okay?

457
00:15:50,399 --> 00:15:54,679
And once you do that, you

458
00:15:52,759 --> 00:15:57,838
have to feed it the input from the

459
00:15:54,679 --> 00:16:00,000
second layer. So, you stick an h here.

460
00:15:57,839 --> 00:16:01,400
Now you have connected the third and the

461
00:16:00,000 --> 00:16:03,039
second layers.

462
00:16:01,399 --> 00:16:04,720
And after you do that, you give a name

463
00:16:03,039 --> 00:16:06,399
to the output coming out of that. We'll

464
00:16:04,720 --> 00:16:07,360
just call it output. You can call it y,

465
00:16:06,399 --> 00:16:09,720
you can call it output, you can call it

466
00:16:07,360 --> 00:16:11,039
whatever you want.

467
00:16:09,720 --> 00:16:12,000
Okay? So, at this point, what we have

468
00:16:11,039 --> 00:16:14,399
done

469
00:16:12,000 --> 00:16:16,200
is we have mapped that picture into

470
00:16:14,399 --> 00:16:17,759
those three lines.

471
00:16:16,200 --> 00:16:19,400
That's it.

472
00:16:17,759 --> 00:16:20,759
Okay?

473
00:16:19,399 --> 00:16:22,519
But we aren't quite done yet. There's

474
00:16:20,759 --> 00:16:24,759
one little thing we have to do.

475
00:16:22,519 --> 00:16:27,919
So, what we have to do is we have to

476
00:16:24,759 --> 00:16:30,078
formally define a model so that Keras

477
00:16:27,919 --> 00:16:31,879
can just work with this model object. It

478
00:16:30,078 --> 00:16:33,199
can train it, it can evaluate it, it can

479
00:16:31,879 --> 00:16:35,759
use it for prediction and so on and so

480
00:16:33,200 --> 00:16:38,160
forth. So, we tell Keras, "Hey, uh

481
00:16:35,759 --> 00:16:40,039
create a model for me, Keras.model,

482
00:16:38,159 --> 00:16:41,600
and basically where the input is this

483
00:16:40,039 --> 00:16:42,480
thing here and the output is that thing

484
00:16:41,600 --> 00:16:43,800
there.

485
00:16:42,480 --> 00:16:45,879
And then the whole thing we'll just call

486
00:16:43,799 --> 00:16:48,559
it model."

487
00:16:45,879 --> 00:16:50,240
Okay? So, that's it.

488
00:16:48,559 --> 00:16:52,000
We are done. That is the whole model.

489
00:16:50,240 --> 00:16:53,680
That is It sounds really fancy, right? A

490
00:16:52,000 --> 00:16:56,600
neural model for heart disease

491
00:16:53,679 --> 00:16:58,599
prediction. That's pretty cool.

492
00:16:56,600 --> 00:17:00,360
Four lines.

493
00:16:58,600 --> 00:17:02,839
And we will show how to train this model

494
00:17:00,360 --> 00:17:05,199
with real data and so on and so forth

495
00:17:02,839 --> 00:17:06,959
and use it for prediction after we

496
00:17:05,199 --> 00:17:08,759
switch gears and really get into some

497
00:17:06,959 --> 00:17:11,320
conceptual building blocks.

498
00:17:08,759 --> 00:17:11,319
Had a question.

499
00:17:13,799 --> 00:17:18,599
Can you define a custom activation

500
00:17:16,319 --> 00:17:21,039
function that is not in the list of

501
00:17:18,599 --> 00:17:22,319
Keras library? Yes.

502
00:17:21,039 --> 00:17:23,438
Yeah, you can define The question was,

503
00:17:22,319 --> 00:17:25,359
can you define a custom activation

504
00:17:23,439 --> 00:17:27,400
function? You totally can.

505
00:17:25,359 --> 00:17:30,279
Uh in fact, I mean, the the kind of

506
00:17:27,400 --> 00:17:32,280
flexibility you have here is incredible.

507
00:17:30,279 --> 00:17:34,480
And this these innocent four lines

508
00:17:32,279 --> 00:17:36,399
unfortunately sort of hide the the

509
00:17:34,480 --> 00:17:38,640
potential that's possible here, but I

510
00:17:36,400 --> 00:17:39,759
guarantee you in two to three weeks you

511
00:17:38,640 --> 00:17:41,440
folks will be thinking in building

512
00:17:39,759 --> 00:17:43,599
blocks like Legos.

513
00:17:41,440 --> 00:17:44,600
So, you'll be, you know, I I I I'm so

514
00:17:43,599 --> 00:17:46,079
happy when it happens. Students will

515
00:17:44,599 --> 00:17:47,319
come to my office hours and say, "You

516
00:17:46,079 --> 00:17:49,399
know, I want to create a network where I

517
00:17:47,319 --> 00:17:50,879
have a little network going up on top,

518
00:17:49,400 --> 00:17:52,240
one going in the bottom, then they meet

519
00:17:50,880 --> 00:17:54,160
in the middle, then they fork again,

520
00:17:52,240 --> 00:17:55,440
they split." I'm like, "Unbelievable."

521
00:17:54,160 --> 00:17:56,720
It's fantastic. And you're going to be

522
00:17:55,440 --> 00:17:58,720
doing this in two weeks, I guarantee

523
00:17:56,720 --> 00:18:00,319
you.

524
00:17:58,720 --> 00:18:01,880
Yeah, in the case of a multi-class

525
00:18:00,319 --> 00:18:04,159
classification problem, are the output

526
00:18:01,880 --> 00:18:05,320
nodes equal to the number of classes?

527
00:18:04,160 --> 00:18:07,400
Correct.

528
00:18:05,319 --> 00:18:09,279
So, we will come to So, this is binary

529
00:18:07,400 --> 00:18:10,880
classification. And the question is for

530
00:18:09,279 --> 00:18:12,960
multi-class classification, let's say

531
00:18:10,880 --> 00:18:14,960
you're trying to classify some input

532
00:18:12,960 --> 00:18:16,720
into one of 10 possibilities, we will

533
00:18:14,960 --> 00:18:18,840
have 10 outputs.

534
00:18:16,720 --> 00:18:20,360
But the way we define it is going to be

535
00:18:18,839 --> 00:18:21,879
using something called a softmax

536
00:18:20,359 --> 00:18:24,039
function, which we're going to cover on

537
00:18:21,880 --> 00:18:25,720
Monday.

538
00:18:24,039 --> 00:18:27,079
So, for now, we just live with binary

539
00:18:25,720 --> 00:18:29,120
classification.

540
00:18:27,079 --> 00:18:29,119
Uh

541
00:18:29,159 --> 00:18:33,800
Is there a default activation method in

542
00:18:31,679 --> 00:18:35,400
Keras or you have to put something? Ah,

543
00:18:33,799 --> 00:18:37,079
that's a good question. I believe the

544
00:18:35,400 --> 00:18:39,200
default might be ReLUs for hidden

545
00:18:37,079 --> 00:18:40,678
layers, but I'm not 100% sure. Let's

546
00:18:39,200 --> 00:18:42,759
double-check that.

547
00:18:40,679 --> 00:18:44,960
Uh

548
00:18:42,759 --> 00:18:47,240
Uh just to get a clearer understanding,

549
00:18:44,960 --> 00:18:50,000
when you said that beyond 16 when you

550
00:18:47,240 --> 00:18:52,240
tried working on those neurons, the

551
00:18:50,000 --> 00:18:53,279
performance uh worsened.

552
00:18:52,240 --> 00:18:54,919
So, that is where you were playing

553
00:18:53,279 --> 00:18:58,759
around with initially two and then maybe

554
00:18:54,919 --> 00:19:01,560
four and six and eight. Exactly. Right.

555
00:18:58,759 --> 00:19:01,559
Could you use the mic?

556
00:19:02,200 --> 00:19:05,880
Do we need to define each of the hidden

557
00:19:04,000 --> 00:19:08,200
layer when the model gets more complex

558
00:19:05,880 --> 00:19:09,640
when we have more than one layer? Oh,

559
00:19:08,200 --> 00:19:11,159
like if you have like 25 layers?

560
00:19:09,640 --> 00:19:12,640
>> consolidate, yeah. Yeah, yeah, yeah. So,

561
00:19:11,159 --> 00:19:14,919
what we typically Good question. If you

562
00:19:12,640 --> 00:19:16,200
have let's say 100 layers, right? Uh do

563
00:19:14,919 --> 00:19:18,280
you actually write I have to type in

564
00:19:16,200 --> 00:19:19,759
each by hand and cut and paste? No. You

565
00:19:18,279 --> 00:19:20,839
can actually write a little loop which

566
00:19:19,759 --> 00:19:22,720
will just automatically create them for

567
00:19:20,839 --> 00:19:24,240
you.

568
00:19:22,720 --> 00:19:26,000
And so, basically what's going on is

569
00:19:24,240 --> 00:19:27,640
that this little output thing you see

570
00:19:26,000 --> 00:19:30,200
here, this variable,

571
00:19:27,640 --> 00:19:32,880
this output could be the result of a

572
00:19:30,200 --> 00:19:34,519
thousand layer network with all sorts of

573
00:19:32,880 --> 00:19:36,080
complicated transformations going on and

574
00:19:34,519 --> 00:19:38,200
then finally it pops up as a little

575
00:19:36,079 --> 00:19:39,678
thing called the output. And what Keras

576
00:19:38,200 --> 00:19:41,919
will do is it'll be like, "Okay, this

577
00:19:39,679 --> 00:19:43,759
model has this input and has this

578
00:19:41,919 --> 00:19:45,200
output, but boy, this output came from

579
00:19:43,759 --> 00:19:47,079
incredible transformations applied to

580
00:19:45,200 --> 00:19:48,159
the input." And Keras will process all

581
00:19:47,079 --> 00:19:49,759
that very easily for you. You don't have

582
00:19:48,159 --> 00:19:51,280
to worry about it.

583
00:19:49,759 --> 00:19:53,319
Right? It's really a beautiful example

584
00:19:51,279 --> 00:19:54,440
of the power of abstraction.

585
00:19:53,319 --> 00:19:55,200
And you will you will see that as we go

586
00:19:54,440 --> 00:19:56,880
along.

587
00:19:55,200 --> 00:19:58,640
Okay. So,

588
00:19:56,880 --> 00:20:00,040
now let's switch gears and say once

589
00:19:58,640 --> 00:20:01,840
you've written a model like that in

590
00:20:00,039 --> 00:20:04,240
Keras, how do you actually train it?

591
00:20:01,839 --> 00:20:05,839
Okay? Now, training is something you've

592
00:20:04,240 --> 00:20:06,880
been doing a lot, right? So, for

593
00:20:05,839 --> 00:20:08,720
example, when you have something like

594
00:20:06,880 --> 00:20:09,800
linear regression, right? Where you have

595
00:20:08,720 --> 00:20:12,039
all these coefficients you need to

596
00:20:09,799 --> 00:20:14,039
estimate, you have this model, then you

597
00:20:12,039 --> 00:20:16,680
have a bunch of data, then you run it

598
00:20:14,039 --> 00:20:18,559
through something like LM if you use R,

599
00:20:16,680 --> 00:20:20,480
and what it gives you is actual values

600
00:20:18,559 --> 00:20:22,559
for these coefficients, right? 2.8, 0.9,

601
00:20:20,480 --> 00:20:23,880
and so on and so forth. So, the the role

602
00:20:22,559 --> 00:20:25,399
of the data is to give you the

603
00:20:23,880 --> 00:20:26,560
coefficients.

604
00:20:25,400 --> 00:20:28,280
Right? Or you can think of the

605
00:20:26,559 --> 00:20:30,319
coefficients as really a compressed

606
00:20:28,279 --> 00:20:31,759
version of the data.

607
00:20:30,319 --> 00:20:33,799
Okay? Similarly, if you do logistic

608
00:20:31,759 --> 00:20:35,359
regression, you have a model like that,

609
00:20:33,799 --> 00:20:37,240
you add some data, you run it through

610
00:20:35,359 --> 00:20:40,479
some estimation routine like GLM or

611
00:20:37,240 --> 00:20:42,079
scikit-learn or statsmodels, pick your

612
00:20:40,480 --> 00:20:43,680
favorite tool, then you'll come up with

613
00:20:42,079 --> 00:20:45,919
something like that. So, basically

614
00:20:43,680 --> 00:20:47,519
what's going on here is training simply

615
00:20:45,920 --> 00:20:49,640
means find the values of the

616
00:20:47,519 --> 00:20:51,839
coefficients that so that the model's

617
00:20:49,640 --> 00:20:54,680
predictions are as close to the actual

618
00:20:51,839 --> 00:20:57,559
values as possible. That's it. Okay? And

619
00:20:54,680 --> 00:20:59,519
so and to find the one that is as close

620
00:20:57,559 --> 00:21:01,519
to the actual value as possible, a whole

621
00:20:59,519 --> 00:21:02,200
bunch of optimization is involved. You

622
00:21:01,519 --> 00:21:03,079
didn't have to worry about the

623
00:21:02,200 --> 00:21:05,200
optimization when you did the

624
00:21:03,079 --> 00:21:07,039
regression, linear or logistic, because

625
00:21:05,200 --> 00:21:08,840
it's all done under the hood for you,

626
00:21:07,039 --> 00:21:10,879
but for neural networks, we actually get

627
00:21:08,839 --> 00:21:12,919
to know how it's done.

628
00:21:10,880 --> 00:21:15,800
Okay, because it's important.

629
00:21:12,920 --> 00:21:18,279
Okay. So, training a neural network, a

630
00:21:15,799 --> 00:21:19,680
deep neural network, even GPT-4, it's

631
00:21:18,279 --> 00:21:21,000
basically the same process as what you

632
00:21:19,680 --> 00:21:23,320
do for regression.

633
00:21:21,000 --> 00:21:24,480
Right? You basically you're just a very

634
00:21:23,319 --> 00:21:26,679
complicated function with lots of

635
00:21:24,480 --> 00:21:28,160
parameters, but ultimately you have a

636
00:21:26,680 --> 00:21:29,960
network with all these question marks,

637
00:21:28,160 --> 00:21:32,960
you add some data, you do some training,

638
00:21:29,960 --> 00:21:32,960
and boom, you get some numbers.

639
00:21:36,200 --> 00:21:40,480
You may get into this, but are we

640
00:21:38,279 --> 00:21:43,079
determining the architecture of the

641
00:21:40,480 --> 00:21:45,319
network before we train it?

642
00:21:43,079 --> 00:21:46,720
Okay. Yes, because if you don't define

643
00:21:45,319 --> 00:21:49,279
the architecture,

644
00:21:46,720 --> 00:21:51,200
um Keras doesn't know how to actually

645
00:21:49,279 --> 00:21:53,279
calculate the output.

646
00:21:51,200 --> 00:21:55,880
Given an input. And unless it knows

647
00:21:53,279 --> 00:21:58,119
input-output pairs, it can't do anything

648
00:21:55,880 --> 00:22:00,400
more with it.

649
00:21:58,119 --> 00:22:02,039
Okay. So, um

650
00:22:00,400 --> 00:22:04,080
so the essence of training is to find

651
00:22:02,039 --> 00:22:05,440
the best values for the weights and

652
00:22:04,079 --> 00:22:07,559
biases.

653
00:22:05,440 --> 00:22:09,440
And the way we think of the best values

654
00:22:07,559 --> 00:22:11,919
is that we basically set up a little

655
00:22:09,440 --> 00:22:14,400
function, and this function measures the

656
00:22:11,920 --> 00:22:16,759
discrepancy between the actual and the

657
00:22:14,400 --> 00:22:19,640
predicted values. Okay? And I use the

658
00:22:16,759 --> 00:22:20,960
word discrepancy because the way you

659
00:22:19,640 --> 00:22:22,320
define discrepancy, there's an

660
00:22:20,960 --> 00:22:23,279
incredible amounts of creativity in the

661
00:22:22,319 --> 00:22:25,000
field.

662
00:22:23,279 --> 00:22:27,039
In fact, a lot of breakthroughs in deep

663
00:22:25,000 --> 00:22:29,519
learning come because people define a

664
00:22:27,039 --> 00:22:31,079
very clever measure of discrepancy, and

665
00:22:29,519 --> 00:22:33,039
then turns out it actually gives you all

666
00:22:31,079 --> 00:22:34,279
sorts of interesting behavior. Okay?

667
00:22:33,039 --> 00:22:35,879
That's why I use the word discrepancy as

668
00:22:34,279 --> 00:22:37,399
opposed to the word error, because when

669
00:22:35,880 --> 00:22:39,960
I say error, you might be just thinking

670
00:22:37,400 --> 00:22:42,240
something like predicted minus actual.

671
00:22:39,960 --> 00:22:43,600
That's too limiting.

672
00:22:42,240 --> 00:22:45,120
Prediction minus actual is too limiting,

673
00:22:43,599 --> 00:22:48,079
that's why I use the word discrepancy.

674
00:22:45,119 --> 00:22:49,439
So, so we we basically define a function

675
00:22:48,079 --> 00:22:50,639
that captures the discrepancy between

676
00:22:49,440 --> 00:22:53,000
these the actual and the predicted

677
00:22:50,640 --> 00:22:54,759
values, and these functions are called

678
00:22:53,000 --> 00:22:55,759
loss functions in the deep learning

679
00:22:54,759 --> 00:22:58,039
world.

680
00:22:55,759 --> 00:23:00,200
And every paper that you read, you will

681
00:22:58,039 --> 00:23:02,519
find interesting loss functions. There

682
00:23:00,200 --> 00:23:03,920
are hundreds of loss functions, enormous

683
00:23:02,519 --> 00:23:05,920
research creativity goes into defining

684
00:23:03,920 --> 00:23:08,519
these loss functions. Okay?

685
00:23:05,920 --> 00:23:10,039
All right. So, these are loss functions.

686
00:23:08,519 --> 00:23:12,440
And so a loss function is a function

687
00:23:10,039 --> 00:23:14,119
that quantifies a discrepancy. So, let's

688
00:23:12,440 --> 00:23:16,679
say the predictions are really close to

689
00:23:14,119 --> 00:23:19,039
the actual values, the loss would be

690
00:23:16,679 --> 00:23:20,720
what?

691
00:23:19,039 --> 00:23:23,279
It's close to zero. It's close to zero.

692
00:23:20,720 --> 00:23:26,240
Close to zero. Right? Very small.

693
00:23:23,279 --> 00:23:27,519
And if if you have a perfect model,

694
00:23:26,240 --> 00:23:28,799
perfect crystal ball, what would the

695
00:23:27,519 --> 00:23:30,039
loss be?

696
00:23:28,799 --> 00:23:32,839
Exactly zero.

697
00:23:30,039 --> 00:23:35,599
Right? Exactly zero. So, in linear

698
00:23:32,839 --> 00:23:37,759
regression, we the loss function we use

699
00:23:35,599 --> 00:23:39,159
is called sum of squared errors.

700
00:23:37,759 --> 00:23:40,640
We didn't call it loss function because

701
00:23:39,160 --> 00:23:42,200
we were not doing deep learning, just

702
00:23:40,640 --> 00:23:45,120
linear regression, but that's basically

703
00:23:42,200 --> 00:23:47,200
the loss function. Right? So,

704
00:23:45,119 --> 00:23:49,000
the loss function we use must be very

705
00:23:47,200 --> 00:23:51,200
matched very properly with the kind of

706
00:23:49,000 --> 00:23:53,200
output we have.

707
00:23:51,200 --> 00:23:55,200
Right? So, if your output is a number

708
00:23:53,200 --> 00:23:57,480
like 23, right? You're trying to predict

709
00:23:55,200 --> 00:24:00,319
demand like a product demand for next

710
00:23:57,480 --> 00:24:02,120
week for a particular product, and uh

711
00:24:00,319 --> 00:24:03,439
predicted value is 23, the actual value

712
00:24:02,119 --> 00:24:05,879
is 21,

713
00:24:03,440 --> 00:24:09,120
it's okay to do 23 minus 21, two as a

714
00:24:05,880 --> 00:24:11,640
discrepancy, right? The error. Okay? But

715
00:24:09,119 --> 00:24:13,439
for other kinds of outputs, it's not so

716
00:24:11,640 --> 00:24:14,800
obvious what the correct loss function

717
00:24:13,440 --> 00:24:18,160
is, what the correct measure of

718
00:24:14,799 --> 00:24:20,799
discrepancy is. And so here,

719
00:24:18,160 --> 00:24:21,759
for the simple case of regression,

720
00:24:20,799 --> 00:24:23,759
right? Um

721
00:24:21,759 --> 00:24:26,119
the YI, the I here, by the way, is a

722
00:24:23,759 --> 00:24:29,000
superscript which stands for the ith

723
00:24:26,119 --> 00:24:31,079
data point, the ith data point. So, what

724
00:24:29,000 --> 00:24:33,519
I'm saying is that okay, for the ith

725
00:24:31,079 --> 00:24:36,119
data point, this is the actual value, Y,

726
00:24:33,519 --> 00:24:39,000
and this is what the model predicted.

727
00:24:36,119 --> 00:24:41,079
Okay? I take the difference, square it,

728
00:24:39,000 --> 00:24:43,119
and once I square it for each point, I

729
00:24:41,079 --> 00:24:45,759
just average all these numbers to get an

730
00:24:43,119 --> 00:24:48,239
average squared error, i.e. mean squared

731
00:24:45,759 --> 00:24:50,960
error, MSE. So, this is sort of like the

732
00:24:48,240 --> 00:24:52,240
easiest loss function.

733
00:24:50,960 --> 00:24:55,000
Okay?

734
00:24:52,240 --> 00:24:57,120
Now, let's crank it up a notch.

735
00:24:55,000 --> 00:24:59,759
In the heart disease example, the heart

736
00:24:57,119 --> 00:25:01,678
disease the neural prediction model,

737
00:24:59,759 --> 00:25:03,440
the prediction is a number between zero

738
00:25:01,679 --> 00:25:04,759
and one, right? It's because it's coming

739
00:25:03,440 --> 00:25:07,720
out of the sigmoid.

740
00:25:04,759 --> 00:25:09,799
It's a fraction. The actual output is a

741
00:25:07,720 --> 00:25:11,120
zero or one, one of the two, right? It's

742
00:25:09,799 --> 00:25:12,720
binary.

743
00:25:11,119 --> 00:25:14,039
So, how would we compare the

744
00:25:12,720 --> 00:25:16,640
discrepancy? How would we measure the

745
00:25:14,039 --> 00:25:18,839
discrepancy between a fraction and the

746
00:25:16,640 --> 00:25:21,080
numbers zero and one? Right? What is the

747
00:25:18,839 --> 00:25:22,879
good loss function in this situation?

748
00:25:21,079 --> 00:25:26,000
Right? Is the key question. So, let's

749
00:25:22,880 --> 00:25:28,640
build some intuition around this.

750
00:25:26,000 --> 00:25:31,200
And let's see if my little daisy chain

751
00:25:28,640 --> 00:25:32,480
iPad thing works.

752
00:25:31,200 --> 00:25:34,160
I'm doing it on the iPad so that people

753
00:25:32,480 --> 00:25:35,200
on the live stream can see it, otherwise

754
00:25:34,160 --> 00:25:37,040
the blackboard is a little tough for

755
00:25:35,200 --> 00:25:41,039
them.

756
00:25:37,039 --> 00:25:43,159
Okay. So, let's have a situation here.

757
00:25:41,039 --> 00:25:45,039
Okay? So, let's say let's say that you

758
00:25:43,160 --> 00:25:47,000
have a patient who comes in, and let's

759
00:25:45,039 --> 00:25:50,240
say they have heart disease. Okay? So,

760
00:25:47,000 --> 00:25:51,960
for that patient, Y equals one.

761
00:25:50,240 --> 00:25:55,920
Right? The true value is one for that

762
00:25:51,960 --> 00:25:59,840
patient. And now you have this model.

763
00:25:55,920 --> 00:26:03,480
Okay? And this is the predicted

764
00:25:59,839 --> 00:26:03,480
probability from this model.

765
00:26:04,480 --> 00:26:07,480
Can people see my

766
00:26:05,960 --> 00:26:08,279
handwriting okay?

767
00:26:07,480 --> 00:26:11,200
Good.

768
00:26:08,279 --> 00:26:13,359
I could never be a doctor, right? So.

769
00:26:11,200 --> 00:26:14,279
So, zero, okay? One, it's going to be

770
00:26:13,359 --> 00:26:15,479
between zero and one because it's

771
00:26:14,279 --> 00:26:17,079
probability.

772
00:26:15,480 --> 00:26:19,079
And then this is the loss we want to

773
00:26:17,079 --> 00:26:21,759
sort of have, right? This is the loss.

774
00:26:19,079 --> 00:26:23,839
So, for this this patient actually had

775
00:26:21,759 --> 00:26:25,240
heart disease, Y equals one. So, let's

776
00:26:23,839 --> 00:26:26,919
say that the predicted probability is

777
00:26:25,240 --> 00:26:28,279
pretty close to one.

778
00:26:26,920 --> 00:26:29,759
Okay? What do you think the loss should

779
00:26:28,279 --> 00:26:30,879
be?

780
00:26:29,759 --> 00:26:32,799
Small.

781
00:26:30,880 --> 00:26:34,080
Close to zero.

782
00:26:32,799 --> 00:26:36,480
Sorry?

783
00:26:34,079 --> 00:26:38,480
Close to zero, exactly. So, here, if the

784
00:26:36,480 --> 00:26:40,599
prediction comes here, you want the loss

785
00:26:38,480 --> 00:26:42,279
to be you want the loss to be somewhere

786
00:26:40,599 --> 00:26:44,000
here.

787
00:26:42,279 --> 00:26:45,599
But if the predicted probability is

788
00:26:44,000 --> 00:26:47,079
pretty close to zero, even though the

789
00:26:45,599 --> 00:26:49,319
patient actually has heart disease, what

790
00:26:47,079 --> 00:26:50,678
do you want the loss to be?

791
00:26:49,319 --> 00:26:52,599
Really high.

792
00:26:50,679 --> 00:26:53,720
Because it's screwing up badly, right?

793
00:26:52,599 --> 00:26:55,319
So, you want the loss to be somewhere

794
00:26:53,720 --> 00:26:57,440
here.

795
00:26:55,319 --> 00:27:00,359
So, basically you want a function that's

796
00:26:57,440 --> 00:27:00,360
kind of like that.

797
00:27:00,759 --> 00:27:04,319
Right? You want the loss function shape

798
00:27:02,319 --> 00:27:05,519
to be like that.

799
00:27:04,319 --> 00:27:07,039
High values of probability should have

800
00:27:05,519 --> 00:27:08,799
low losses, low values of probability

801
00:27:07,039 --> 00:27:10,759
should have high losses. Yeah.

802
00:27:08,799 --> 00:27:12,279
I understand like why it has to be

803
00:27:10,759 --> 00:27:14,480
increasing or decreasing, but can you

804
00:27:12,279 --> 00:27:16,279
explain why it has to be Yeah, yeah. So,

805
00:27:14,480 --> 00:27:18,279
it can be linear, it can certainly be

806
00:27:16,279 --> 00:27:21,678
linear, but basically what you want to

807
00:27:18,279 --> 00:27:23,960
do is the more it makes a mistake, the

808
00:27:21,679 --> 00:27:25,920
more harshly you want to penalize it.

809
00:27:23,960 --> 00:27:27,720
Right? So, basically what you're what

810
00:27:25,920 --> 00:27:29,120
what you really want is something where

811
00:27:27,720 --> 00:27:31,880
if it basically says this person's

812
00:27:29,119 --> 00:27:33,199
probability is say uh the probability

813
00:27:31,880 --> 00:27:34,560
the predicted probability is say one

814
00:27:33,200 --> 00:27:35,960
over a million,

815
00:27:34,559 --> 00:27:37,919
basically close to zero, you want the

816
00:27:35,960 --> 00:27:39,480
loss to be like super high.

817
00:27:37,920 --> 00:27:41,200
So that the model is like it's like a

818
00:27:39,480 --> 00:27:42,440
huge rap on the knuckles for the model.

819
00:27:41,200 --> 00:27:43,880
Don't do that.

820
00:27:42,440 --> 00:27:45,519
That's basically what we're doing, and

821
00:27:43,880 --> 00:27:47,400
I'm sort of demonstrating that dynamic

822
00:27:45,519 --> 00:27:49,559
by using a very curved and steep loss

823
00:27:47,400 --> 00:27:50,960
function.

824
00:27:49,559 --> 00:27:52,799
But you can absolutely use a linear

825
00:27:50,960 --> 00:27:54,759
function, it's totally fine. It won't be

826
00:27:52,799 --> 00:27:56,000
as effective for gradient descent later

827
00:27:54,759 --> 00:27:57,799
on with a bunch of bunch of technical

828
00:27:56,000 --> 00:27:59,359
details.

829
00:27:57,799 --> 00:28:01,440
Are we good with this?

830
00:27:59,359 --> 00:28:03,919
All right. So, now let's look at the

831
00:28:01,440 --> 00:28:05,039
case where a patient does not have heart

832
00:28:03,920 --> 00:28:06,720
disease.

833
00:28:05,039 --> 00:28:09,000
Y equals zero.

834
00:28:06,720 --> 00:28:11,920
Same setup, okay?

835
00:28:09,000 --> 00:28:15,279
Predicted probability,

836
00:28:11,920 --> 00:28:18,360
zero, one, loss.

837
00:28:15,279 --> 00:28:20,440
So, for this patient, they don't have um

838
00:28:18,359 --> 00:28:22,240
whatever uh they're not

839
00:28:20,440 --> 00:28:24,559
uh they don't have heart disease. If the

840
00:28:22,240 --> 00:28:26,200
probability is close to zero, what

841
00:28:24,559 --> 00:28:27,279
should the loss be?

842
00:28:26,200 --> 00:28:28,720
Close to zero. It should be somewhere

843
00:28:27,279 --> 00:28:31,079
here, right?

844
00:28:28,720 --> 00:28:32,440
And the more and more the probability

845
00:28:31,079 --> 00:28:34,359
gets closer and closer to one, you want

846
00:28:32,440 --> 00:28:36,120
to penalize it very heavily, which means

847
00:28:34,359 --> 00:28:37,559
you want the loss to be somewhere here.

848
00:28:36,119 --> 00:28:39,239
So, you basically want a loss ideally

849
00:28:37,559 --> 00:28:42,158
that's kind of going up like that and

850
00:28:39,240 --> 00:28:43,200
climbing higher and higher.

851
00:28:42,159 --> 00:28:44,640
Are we good?

852
00:28:43,200 --> 00:28:46,919
Okay, perfect.

853
00:28:44,640 --> 00:28:48,919
Because we have a perfect loss function

854
00:28:46,919 --> 00:28:51,360
for that.

855
00:28:48,919 --> 00:28:53,040
So, just a recap.

856
00:28:51,359 --> 00:28:54,799
Right? This is what we want.

857
00:28:53,039 --> 00:28:56,799
People with for points with Y equals

858
00:28:54,799 --> 00:28:58,359
one, lower prediction predictions should

859
00:28:56,799 --> 00:29:02,000
have higher loss. You want something

860
00:28:58,359 --> 00:29:03,519
like that. And then turns out

861
00:29:02,000 --> 00:29:04,640
there's a very little simple loss

862
00:29:03,519 --> 00:29:05,918
function

863
00:29:04,640 --> 00:29:07,880
which just literally just uses the

864
00:29:05,919 --> 00:29:09,840
logarithm, which will get the job done.

865
00:29:07,880 --> 00:29:13,159
So, what you do is you literally do

866
00:29:09,839 --> 00:29:15,399
minus log of the predicted probability.

867
00:29:13,159 --> 00:29:16,520
That's it. And that thing it has exactly

868
00:29:15,400 --> 00:29:17,919
that shape.

869
00:29:16,519 --> 00:29:20,039
Okay? And in fact, you can see it

870
00:29:17,919 --> 00:29:22,840
numerically. So, if the loss is one,

871
00:29:20,039 --> 00:29:24,720
it's zero. If it's half, it's 1.0. And

872
00:29:22,839 --> 00:29:26,599
if it's like one over 1,000, it's almost

873
00:29:24,720 --> 00:29:27,319
10. If it's one over 10,000, it's going

874
00:29:26,599 --> 00:29:30,359
to be like

875
00:29:27,319 --> 00:29:32,519
much higher, right? Very high losses.

876
00:29:30,359 --> 00:29:34,479
Okay? So, minus log probability, boom,

877
00:29:32,519 --> 00:29:36,639
done.

878
00:29:34,480 --> 00:29:38,919
Similarly, this is what we want for

879
00:29:36,640 --> 00:29:42,400
patients for whom Y equals zero.

880
00:29:38,919 --> 00:29:44,520
And turns out if you do minus log one

881
00:29:42,400 --> 00:29:46,960
minus predicted probability, it does the

882
00:29:44,519 --> 00:29:46,960
same thing.

883
00:29:47,880 --> 00:29:50,160
Okay?

884
00:29:50,759 --> 00:29:54,640
Mathematicians once again saved with a

885
00:29:52,160 --> 00:29:54,640
logarithm.

886
00:29:54,680 --> 00:29:58,560
So, see in summary

887
00:29:56,920 --> 00:30:00,400
this is what we have.

888
00:29:58,559 --> 00:30:01,599
Right? For data points where y equals 1,

889
00:30:00,400 --> 00:30:03,960
we have this. Data points where y equals

890
00:30:01,599 --> 00:30:05,919
0, we have this. But, it feels a little

891
00:30:03,960 --> 00:30:07,279
inelegant

892
00:30:05,920 --> 00:30:08,400
to say, "Well, if it's y equals 1, I

893
00:30:07,279 --> 00:30:09,599
want to use this. If y equals 0, I want

894
00:30:08,400 --> 00:30:11,280
to use that."

895
00:30:09,599 --> 00:30:12,759
Right? There's There's like an if-then

896
00:30:11,279 --> 00:30:14,639
thing going on here. And I don't know

897
00:30:12,759 --> 00:30:15,640
about you folks, but if-then really irks

898
00:30:14,640 --> 00:30:17,320
me

899
00:30:15,640 --> 00:30:19,600
mathematically because you can't do

900
00:30:17,319 --> 00:30:20,279
derivatives and so on very easily.

901
00:30:19,599 --> 00:30:22,919
Okay?

902
00:30:20,279 --> 00:30:24,879
But, no worries. This is MIT. We know we

903
00:30:22,920 --> 00:30:26,720
have our bag of math tricks.

904
00:30:24,880 --> 00:30:28,680
So, what we do is

905
00:30:26,720 --> 00:30:30,519
we can actually combine them both into a

906
00:30:28,680 --> 00:30:32,600
single expression.

907
00:30:30,519 --> 00:30:35,079
Okay? Like this.

908
00:30:32,599 --> 00:30:37,000
Okay? And here the yi again is the ith

909
00:30:35,079 --> 00:30:38,399
data point. Remember, yi is either 1 or

910
00:30:37,000 --> 00:30:40,359
0 always.

911
00:30:38,400 --> 00:30:43,360
And this model of xi is the predicted

912
00:30:40,359 --> 00:30:45,679
probability. Okay? So,

913
00:30:43,359 --> 00:30:48,439
and I've just taken the minus log the

914
00:30:45,680 --> 00:30:50,680
minus and I've just moved it here.

915
00:30:48,440 --> 00:30:52,680
Okay? And I've taken the the minus that

916
00:30:50,680 --> 00:30:54,640
was here and just moved it here. Okay?

917
00:30:52,680 --> 00:30:57,080
That's why you see it like this.

918
00:30:54,640 --> 00:30:58,560
So, this one is basically

919
00:30:57,079 --> 00:30:59,960
you can convince yourself what's

920
00:30:58,559 --> 00:31:01,359
happens. This single expression will get

921
00:30:59,960 --> 00:31:04,039
the job done. So, let's say there is a

922
00:31:01,359 --> 00:31:05,559
patient for whom y equals 1.

923
00:31:04,039 --> 00:31:07,799
What's going to happen is that when you

924
00:31:05,559 --> 00:31:10,519
plug in y equals 1, this becomes 0. The

925
00:31:07,799 --> 00:31:12,559
whole thing will collapse to 0.

926
00:31:10,519 --> 00:31:14,319
While here, y equals 1 just means it

927
00:31:12,559 --> 00:31:16,879
becomes minus log probability, which is

928
00:31:14,319 --> 00:31:16,879
what we want.

929
00:31:17,640 --> 00:31:22,120
Conversely, if y equals 0, this whole

930
00:31:20,200 --> 00:31:23,720
thing is going to disappear.

931
00:31:22,119 --> 00:31:25,919
And this thing becomes 1 minus 0, which

932
00:31:23,720 --> 00:31:27,559
is just 1. And so, it becomes minus log

933
00:31:25,920 --> 00:31:29,680
1 minus probability, which is again what

934
00:31:27,559 --> 00:31:32,000
we want.

935
00:31:29,680 --> 00:31:34,720
Simple and neat, right?

936
00:31:32,000 --> 00:31:36,799
So, in one expression, we have defined

937
00:31:34,720 --> 00:31:39,360
the perfect loss. No if-thens, none of

938
00:31:36,799 --> 00:31:39,359
that crap.

939
00:31:39,519 --> 00:31:44,079
Good. So, now what we do is that was

940
00:31:42,200 --> 00:31:45,160
true for every data point.

941
00:31:44,079 --> 00:31:47,799
But, we obviously have lots of data

942
00:31:45,160 --> 00:31:50,560
points. So, we just add them all up and

943
00:31:47,799 --> 00:31:51,919
take the average.

944
00:31:50,559 --> 00:31:53,519
That's it. We average across all the

945
00:31:51,920 --> 00:31:55,440
data points we have. So, that we get an

946
00:31:53,519 --> 00:31:57,119
average loss.

947
00:31:55,440 --> 00:31:58,679
Okay?

948
00:31:57,119 --> 00:32:01,239
We call this is the binary cross entropy

949
00:31:58,679 --> 00:32:01,240
loss function.

950
00:32:06,640 --> 00:32:11,440
Is there a way you can um edit the loss

951
00:32:08,920 --> 00:32:13,560
function so that you penalize like false

952
00:32:11,440 --> 00:32:15,679
negatives more strongly than false

953
00:32:13,559 --> 00:32:17,279
>> you can do all of them. Great question.

954
00:32:15,679 --> 00:32:19,160
Uh I'm just looking at the basic case

955
00:32:17,279 --> 00:32:21,720
where we it's symmetric

956
00:32:19,160 --> 00:32:23,240
loss. Um you can actually penalize

957
00:32:21,720 --> 00:32:25,200
overestimates much more than

958
00:32:23,240 --> 00:32:26,759
underestimates and things like that.

959
00:32:25,200 --> 00:32:28,160
Um and if you're curious, you can just

960
00:32:26,759 --> 00:32:30,599
Google something called the pinball

961
00:32:28,160 --> 00:32:30,600
loss.

962
00:32:31,519 --> 00:32:34,440
Okay?

963
00:32:32,599 --> 00:32:36,359
Any other questions on this?

964
00:32:34,440 --> 00:32:38,120
So, when you see this massive deep

965
00:32:36,359 --> 00:32:39,959
neural network built by Google for doing

966
00:32:38,119 --> 00:32:41,839
something or the other, if it's a binary

967
00:32:39,960 --> 00:32:44,079
classification problem, chances are

968
00:32:41,839 --> 00:32:45,119
they're using this thing.

969
00:32:44,079 --> 00:32:45,960
Okay?

970
00:32:45,119 --> 00:32:48,159
All right.

971
00:32:45,960 --> 00:32:49,840
So, now let's figure out how to minimize

972
00:32:48,160 --> 00:32:50,800
these loss functions because the name of

973
00:32:49,839 --> 00:32:52,199
the game

974
00:32:50,799 --> 00:32:54,839
is to find a way to minimize these loss

975
00:32:52,200 --> 00:32:56,880
functions. So, now loss functions are

976
00:32:54,839 --> 00:32:59,279
just a particular kind of function. So,

977
00:32:56,880 --> 00:33:02,000
we'll first consider the general problem

978
00:32:59,279 --> 00:33:02,759
of minimizing some arbitrary function.

979
00:33:02,000 --> 00:33:03,720
Okay?

980
00:33:02,759 --> 00:33:05,160
And once we develop a little bit of

981
00:33:03,720 --> 00:33:07,400
intuition about that, we'll return to

982
00:33:05,160 --> 00:33:09,920
the specific task of minimizing loss

983
00:33:07,400 --> 00:33:09,920
functions.

984
00:33:12,240 --> 00:33:14,920
How's everyone doing?

985
00:33:15,240 --> 00:33:18,480
Yes, no, good, bad?

986
00:33:18,679 --> 00:33:23,240
You have a bit of a

987
00:33:20,480 --> 00:33:24,960
like a tough-to-interpret head shake.

988
00:33:23,240 --> 00:33:26,559
It's more like um I kind of lost you

989
00:33:24,960 --> 00:33:28,400
where you said that the loss function

990
00:33:26,559 --> 00:33:30,119
and the predicted probability

991
00:33:28,400 --> 00:33:31,560
uh how were they inversely because my

992
00:33:30,119 --> 00:33:33,839
understanding was that the loss function

993
00:33:31,559 --> 00:33:35,200
is supposed to be the sum of errors.

994
00:33:33,839 --> 00:33:36,159
We're averaging the errors. And when you

995
00:33:35,200 --> 00:33:37,360
said the heart patient

996
00:33:36,160 --> 00:33:38,880
>> Sorry, sorry. Let me Let me just stop

997
00:33:37,359 --> 00:33:41,240
there for a second.

998
00:33:38,880 --> 00:33:42,640
For each point, you define the loss.

999
00:33:41,240 --> 00:33:44,400
That's the whole point of the game. And

1000
00:33:42,640 --> 00:33:46,640
once you define it, you calculate for

1001
00:33:44,400 --> 00:33:49,440
every point and average it, right? So,

1002
00:33:46,640 --> 00:33:50,960
just focus on a single data point.

1003
00:33:49,440 --> 00:33:53,000
And so, now continue.

1004
00:33:50,960 --> 00:33:56,160
So, now when the heart patient has There

1005
00:33:53,000 --> 00:33:58,240
is more probability that they No. So,

1006
00:33:56,160 --> 00:34:00,400
when there is a person who has the heart

1007
00:33:58,240 --> 00:34:02,759
uh disease, you said that you want the

1008
00:34:00,400 --> 00:34:03,960
loss function to be high.

1009
00:34:02,759 --> 00:34:06,440
I think I'm going back to the graph.

1010
00:34:03,960 --> 00:34:08,159
>> You want the loss function to be high if

1011
00:34:06,440 --> 00:34:09,878
I'm predicting that they basically don't

1012
00:34:08,159 --> 00:34:12,079
have heart disease.

1013
00:34:09,878 --> 00:34:13,960
If the prediction is close to 0,

1014
00:34:12,079 --> 00:34:16,878
the predicted probability is close to 0,

1015
00:34:13,960 --> 00:34:18,519
then I'm badly wrong.

1016
00:34:16,878 --> 00:34:19,918
Because in reality, they do have heart

1017
00:34:18,519 --> 00:34:21,039
disease.

1018
00:34:19,918 --> 00:34:23,199
And that's why I want the loss to be

1019
00:34:21,039 --> 00:34:25,519
really high. Okay, so effectively, loss

1020
00:34:23,199 --> 00:34:28,678
is my way of finding out how good my

1021
00:34:25,519 --> 00:34:31,159
model is instead of saying, "Okay." Or

1022
00:34:28,679 --> 00:34:33,119
rather, how bad your model is. Yeah.

1023
00:34:31,159 --> 00:34:34,760
Right? How bad is it? That's really what

1024
00:34:33,119 --> 00:34:37,279
the loss function is. Got it.

1025
00:34:34,760 --> 00:34:39,960
>> And you want to minimize badness.

1026
00:34:37,280 --> 00:34:41,560
That's the whole point of optimization.

1027
00:34:39,960 --> 00:34:43,800
Okay.

1028
00:34:41,559 --> 00:34:45,119
Um I guess I don't have a fully like

1029
00:34:43,800 --> 00:34:46,800
similar to the point where I said but I

1030
00:34:45,119 --> 00:34:48,839
don't have a fully clear intuition of

1031
00:34:46,800 --> 00:34:50,440
why exactly a log function rather than

1032
00:34:48,840 --> 00:34:53,320
something that say

1033
00:34:50,440 --> 00:34:55,519
flatter for small and then really steep

1034
00:34:53,320 --> 00:34:57,640
later. Those are all fantastic things.

1035
00:34:55,519 --> 00:35:00,719
You can totally do it. Uh the reason we

1036
00:34:57,639 --> 00:35:02,759
picked the loss this function because A,

1037
00:35:00,719 --> 00:35:04,079
it's easy to work with. It has good

1038
00:35:02,760 --> 00:35:06,160
gradients. It's well-behaved

1039
00:35:04,079 --> 00:35:07,799
mathematically. But, there are many

1040
00:35:06,159 --> 00:35:09,399
alternatives to it. I don't want you to

1041
00:35:07,800 --> 00:35:11,720
think that this is like the only game in

1042
00:35:09,400 --> 00:35:13,760
town or it's the only choice for us. We

1043
00:35:11,719 --> 00:35:15,919
have many choices. This is really This

1044
00:35:13,760 --> 00:35:17,320
happens to be a very easy choice, which

1045
00:35:15,920 --> 00:35:18,960
also happens to be empirically very

1046
00:35:17,320 --> 00:35:20,480
effective.

1047
00:35:18,960 --> 00:35:22,840
And I'm happy to give you pointers to

1048
00:35:20,480 --> 00:35:26,000
other crazy loss functions, right? Which

1049
00:35:22,840 --> 00:35:26,000
can actually do all these things, too.

1050
00:35:26,800 --> 00:35:29,120
Okay?

1051
00:35:30,400 --> 00:35:34,440
All right. So, uh minimizing a single

1052
00:35:32,440 --> 00:35:36,559
variable function, we will warm up by

1053
00:35:34,440 --> 00:35:38,358
looking at this little function here.

1054
00:35:36,559 --> 00:35:41,639
Okay? Which is a

1055
00:35:38,358 --> 00:35:41,639
What do you call a fourth power?

1056
00:35:41,840 --> 00:35:45,519
What? Quartic, right? Yeah, thank you.

1057
00:35:43,679 --> 00:35:47,599
Quartic. So, yeah, it's a quartic

1058
00:35:45,519 --> 00:35:50,000
function. Um

1059
00:35:47,599 --> 00:35:51,639
right? And this is how it looks like.

1060
00:35:50,000 --> 00:35:53,199
But, you can see there is like a minimum

1061
00:35:51,639 --> 00:35:54,679
somewhere here, right? Between like one

1062
00:35:53,199 --> 00:35:56,799
minus one and minus two. Like maybe

1063
00:35:54,679 --> 00:35:58,519
minus 1.5. Okay?

1064
00:35:56,800 --> 00:36:00,440
So, we want to minimize this function.

1065
00:35:58,519 --> 00:36:02,039
It's obviously a toy function, little

1066
00:36:00,440 --> 00:36:03,599
function with one variable.

1067
00:36:02,039 --> 00:36:06,320
But, the intuition we use here is going

1068
00:36:03,599 --> 00:36:08,239
to be exactly what we use for GPT-4.

1069
00:36:06,320 --> 00:36:09,880
So, pay attention.

1070
00:36:08,239 --> 00:36:11,000
So, how can we go about minimizing this

1071
00:36:09,880 --> 00:36:13,559
function?

1072
00:36:11,000 --> 00:36:13,559
What will we do?

1073
00:36:15,079 --> 00:36:18,159
Yeah.

1074
00:36:16,639 --> 00:36:20,119
Take the derivative and set it equal to

1075
00:36:18,159 --> 00:36:22,039
zero. You take the derivative. Exactly.

1076
00:36:20,119 --> 00:36:23,799
So, you take the derivative, right?

1077
00:36:22,039 --> 00:36:25,559
Um so, when you So, let's look at what

1078
00:36:23,800 --> 00:36:26,640
the derivative does for us.

1079
00:36:25,559 --> 00:36:30,000
But, then

1080
00:36:26,639 --> 00:36:31,920
the second part of what said

1081
00:36:30,000 --> 00:36:33,960
Yeah. Second part of what said was set

1082
00:36:31,920 --> 00:36:35,800
it to zero. Setting it to zero becomes

1083
00:36:33,960 --> 00:36:37,000
problematic

1084
00:36:35,800 --> 00:36:38,840
when you have very complicated

1085
00:36:37,000 --> 00:36:39,960
functions. It's not clear at all what's

1086
00:36:38,840 --> 00:36:41,880
going to make them zero, right?

1087
00:36:39,960 --> 00:36:42,960
Unfortunately. But, the idea of taking

1088
00:36:41,880 --> 00:36:43,840
the derivative is in fact the right

1089
00:36:42,960 --> 00:36:45,440
idea.

1090
00:36:43,840 --> 00:36:46,480
So, we can go about this. We can

1091
00:36:45,440 --> 00:36:47,920
calculate the derivative. And that

1092
00:36:46,480 --> 00:36:49,480
actually happens with the derivative.

1093
00:36:47,920 --> 00:36:50,840
You can convince yourself.

1094
00:36:49,480 --> 00:36:53,240
And if you plot the derivative, it looks

1095
00:36:50,840 --> 00:36:53,240
like that.

1096
00:36:53,400 --> 00:36:56,760
And as you would hope, wherever the

1097
00:36:55,079 --> 00:36:58,679
minimum is, in fact, the derivative is

1098
00:36:56,760 --> 00:36:59,760
crossing

1099
00:36:58,679 --> 00:37:01,119
right? The derivative is zero here. It's

1100
00:36:59,760 --> 00:37:02,320
crossing the x-axis.

1101
00:37:01,119 --> 00:37:03,759
Right? In this case, you can actually do

1102
00:37:02,320 --> 00:37:04,800
that.

1103
00:37:03,760 --> 00:37:06,280
So, let's say you have the derivative.

1104
00:37:04,800 --> 00:37:08,359
How can you use it?

1105
00:37:06,280 --> 00:37:09,760
Like, what is the value of a derivative?

1106
00:37:08,358 --> 00:37:11,199
What does it tell you?

1107
00:37:09,760 --> 00:37:13,800
Yeah.

1108
00:37:11,199 --> 00:37:16,159
You use a gradient descent algorithm.

1109
00:37:13,800 --> 00:37:18,240
You are 10 steps ahead of me, my friend.

1110
00:37:16,159 --> 00:37:19,920
I just want the basic answer.

1111
00:37:18,239 --> 00:37:21,239
Like, what what what what good is a

1112
00:37:19,920 --> 00:37:22,200
derivative? What Like, what does it tell

1113
00:37:21,239 --> 00:37:23,919
you? When you calculate the derivative

1114
00:37:22,199 --> 00:37:25,919
of something at a particular point

1115
00:37:23,920 --> 00:37:27,240
>> you the rate of change of the function

1116
00:37:25,920 --> 00:37:29,800
at the place you are. Correct. Exactly

1117
00:37:27,239 --> 00:37:32,119
right. So, here, what the derivative

1118
00:37:29,800 --> 00:37:34,240
would tells us is that the slope tells

1119
00:37:32,119 --> 00:37:36,920
us the change in the function for a very

1120
00:37:34,239 --> 00:37:38,319
small increase in w, right?

1121
00:37:36,920 --> 00:37:41,519
And this is high school calculus. I'm

1122
00:37:38,320 --> 00:37:41,519
just doing a quick refresher.

1123
00:37:41,920 --> 00:37:47,720
So, what that means is that

1124
00:37:45,199 --> 00:37:49,480
if the derivative is positive,

1125
00:37:47,719 --> 00:37:52,039
what that means is that increasing w

1126
00:37:49,480 --> 00:37:53,760
slightly will increase the function.

1127
00:37:52,039 --> 00:37:55,000
So, if if you're here,

1128
00:37:53,760 --> 00:37:56,160
you calculate the derivative, the slope

1129
00:37:55,000 --> 00:37:57,480
is positive. It means that if you go

1130
00:37:56,159 --> 00:37:58,799
slightly in this direction, the function

1131
00:37:57,480 --> 00:38:00,199
is going to get higher.

1132
00:37:58,800 --> 00:38:02,560
Right?

1133
00:38:00,199 --> 00:38:03,839
Similarly, if it's negative,

1134
00:38:02,559 --> 00:38:05,039
let's say here, you calculate the

1135
00:38:03,840 --> 00:38:06,680
derivative, it's the the slope is like

1136
00:38:05,039 --> 00:38:08,840
this. It's negative, which means that if

1137
00:38:06,679 --> 00:38:10,239
you increase w, if you go in this

1138
00:38:08,840 --> 00:38:12,519
direction, it's going to decrease the

1139
00:38:10,239 --> 00:38:13,759
function.

1140
00:38:12,519 --> 00:38:15,000
Okay?

1141
00:38:13,760 --> 00:38:17,760
All right.

1142
00:38:15,000 --> 00:38:19,639
And if it's kind of close to zero,

1143
00:38:17,760 --> 00:38:22,240
it means that changing w slightly won't

1144
00:38:19,639 --> 00:38:24,119
change anything.

1145
00:38:22,239 --> 00:38:25,719
So, if you're here, changing it slightly

1146
00:38:24,119 --> 00:38:26,880
won't change anything.

1147
00:38:25,719 --> 00:38:28,079
All right?

1148
00:38:26,880 --> 00:38:29,920
That's it.

1149
00:38:28,079 --> 00:38:31,599
So,

1150
00:38:29,920 --> 00:38:35,400
So, what we do is this immediately

1151
00:38:31,599 --> 00:38:37,079
suggests an algorithm for minimizing gw,

1152
00:38:35,400 --> 00:38:38,400
which is let's start with some random

1153
00:38:37,079 --> 00:38:39,400
point w.

1154
00:38:38,400 --> 00:38:40,519
And then,

1155
00:38:39,400 --> 00:38:41,480
let's calculate the derivative at that

1156
00:38:40,519 --> 00:38:42,920
point.

1157
00:38:41,480 --> 00:38:45,000
And once we do that,

1158
00:38:42,920 --> 00:38:46,280
there are three possibilities.

1159
00:38:45,000 --> 00:38:48,320
It could be positive, negative, or kind

1160
00:38:46,280 --> 00:38:49,640
of close to zero.

1161
00:38:48,320 --> 00:38:52,160
And if it's positive, we know that

1162
00:38:49,639 --> 00:38:53,839
increasing w will increase the function.

1163
00:38:52,159 --> 00:38:55,358
But, we want to decrease the function.

1164
00:38:53,840 --> 00:38:56,200
We want to minimize it.

1165
00:38:55,358 --> 00:38:58,920
Which means that we should not be

1166
00:38:56,199 --> 00:39:00,159
increasing w. We should be doing what

1167
00:38:58,920 --> 00:39:01,720
here?

1168
00:39:00,159 --> 00:39:03,519
Decrease.

1169
00:39:01,719 --> 00:39:07,119
Yes. And similarly, if it's negative,

1170
00:39:03,519 --> 00:39:07,119
what should we do here? Increase.

1171
00:39:07,840 --> 00:39:11,358
Exactly. So, in the first case, you

1172
00:39:09,320 --> 00:39:13,240
reduce w slightly. In the second case,

1173
00:39:11,358 --> 00:39:14,400
you increase w slightly. And if the

1174
00:39:13,239 --> 00:39:17,399
thing is close to zero, you just stop

1175
00:39:14,400 --> 00:39:17,400
because there's nothing else you can do.

1176
00:39:17,880 --> 00:39:20,119
Okay?

1177
00:39:21,358 --> 00:39:26,639
This is the basic intuition behind how

1178
00:39:23,599 --> 00:39:28,239
GPT-4 was built.

1179
00:39:26,639 --> 00:39:29,199
Which is kind of shocking if you think

1180
00:39:28,239 --> 00:39:31,279
about it.

1181
00:39:29,199 --> 00:39:32,879
Right? Which means that all the the

1182
00:39:31,280 --> 00:39:35,080
heavy-duty optimization stuff that

1183
00:39:32,880 --> 00:39:37,960
people have figured out over the decades

1184
00:39:35,079 --> 00:39:39,440
is kind of not used.

1185
00:39:37,960 --> 00:39:41,320
Right? This algorithm is what's being

1186
00:39:39,440 --> 00:39:42,200
used with some, you know, flavors on top

1187
00:39:41,320 --> 00:39:44,200
of it.

1188
00:39:42,199 --> 00:39:46,719
So, yeah. So, back to this

1189
00:39:44,199 --> 00:39:48,319
uh and you you do that and then if

1190
00:39:46,719 --> 00:39:49,879
you've sort of run out of time or

1191
00:39:48,320 --> 00:39:52,240
compute

1192
00:39:49,880 --> 00:39:54,119
or right, if you run out of time and so

1193
00:39:52,239 --> 00:39:55,279
on, just stop.

1194
00:39:54,119 --> 00:39:56,839
Otherwise, just go back to step one and

1195
00:39:55,280 --> 00:39:59,720
try again. Of course, if it's close to

1196
00:39:56,840 --> 00:39:59,720
zero, you got to stop anyway.

1197
00:40:00,119 --> 00:40:05,159
Yeah.

1198
00:40:02,280 --> 00:40:09,040
Is there the um concern of a potentially

1199
00:40:05,159 --> 00:40:09,039
local minimum there? It's coming.

1200
00:40:10,039 --> 00:40:12,400
Okay? So, that's the function. It's

1201
00:40:11,320 --> 00:40:13,960
going to give find It's going to find

1202
00:40:12,400 --> 00:40:16,160
you some point where the derivative is

1203
00:40:13,960 --> 00:40:17,639
kind of close to zero. Okay?

1204
00:40:16,159 --> 00:40:19,879
So,

1205
00:40:17,639 --> 00:40:21,759
this is called gradient descent. Right?

1206
00:40:19,880 --> 00:40:23,519
This is gradient descent, this little

1207
00:40:21,760 --> 00:40:26,720
algorithm.

1208
00:40:23,519 --> 00:40:29,360
And this this

1209
00:40:26,719 --> 00:40:32,679
this very power pointy MBA table can be

1210
00:40:29,360 --> 00:40:34,039
collapsed into this little expression.

1211
00:40:32,679 --> 00:40:35,519
Basically says,

1212
00:40:34,039 --> 00:40:36,920
calculate the derivative,

1213
00:40:35,519 --> 00:40:38,320
multiplied by a small number which we'll

1214
00:40:36,920 --> 00:40:41,880
get to in a second,

1215
00:40:38,320 --> 00:40:44,039
and then change the old W to the new W

1216
00:40:41,880 --> 00:40:45,680
is the old W minus a little number times

1217
00:40:44,039 --> 00:40:47,800
gradient.

1218
00:40:45,679 --> 00:40:50,480
So, this little one-line formula is

1219
00:40:47,800 --> 00:40:51,560
basically gradient descent.

1220
00:40:50,480 --> 00:40:54,159
Okay?

1221
00:40:51,559 --> 00:40:56,400
And what you should do, just to build

1222
00:40:54,159 --> 00:40:58,639
your intuition, is to make sure that

1223
00:40:56,400 --> 00:41:00,119
these three possibilities here map

1224
00:40:58,639 --> 00:41:01,199
nicely to this. Like this thing will

1225
00:41:00,119 --> 00:41:03,559
actually capture these three

1226
00:41:01,199 --> 00:41:04,559
possibilities.

1227
00:41:03,559 --> 00:41:07,079
This is when gradient descent was

1228
00:41:04,559 --> 00:41:07,079
invented.

1229
00:41:07,599 --> 00:41:10,839
It has some historical fun, right?

1230
00:41:13,199 --> 00:41:17,719
The 19th century?

1231
00:41:15,000 --> 00:41:20,320
19th century. Yeah, okay. Good. Very

1232
00:41:17,719 --> 00:41:22,719
good. Excellent guess.

1233
00:41:20,320 --> 00:41:25,559
1847.

1234
00:41:22,719 --> 00:41:27,919
It was uh invented uh in 1847 by Cauchy,

1235
00:41:25,559 --> 00:41:29,159
the great mathematician. And in fact, if

1236
00:41:27,920 --> 00:41:30,760
you're curious, you can check out the

1237
00:41:29,159 --> 00:41:32,639
paper.

1238
00:41:30,760 --> 00:41:35,880
I have I gave you I give you the paper

1239
00:41:32,639 --> 00:41:35,879
here for handy reference.

1240
00:41:36,639 --> 00:41:40,839
So, 1847.

1241
00:41:38,159 --> 00:41:43,879
So, GPT-4 is built using an algorithm

1242
00:41:40,840 --> 00:41:43,880
invented in 1847.

1243
00:41:44,280 --> 00:41:51,600
Which I find like astonishing, frankly.

1244
00:41:47,719 --> 00:41:52,959
That this little thing is so capable.

1245
00:41:51,599 --> 00:41:54,639
Okay.

1246
00:41:52,960 --> 00:41:56,599
So, that's gradient descent. And this

1247
00:41:54,639 --> 00:41:58,519
little number alpha

1248
00:41:56,599 --> 00:41:59,920
is called the learning rate. And it's

1249
00:41:58,519 --> 00:42:02,480
our way of sort of essentially

1250
00:41:59,920 --> 00:42:04,880
quantifying the idea of let's not

1251
00:42:02,480 --> 00:42:06,480
increase or decrease W massively, let's

1252
00:42:04,880 --> 00:42:08,640
do it slightly.

1253
00:42:06,480 --> 00:42:11,280
Because the gradient is only valid for

1254
00:42:08,639 --> 00:42:14,839
small movements around your point. If

1255
00:42:11,280 --> 00:42:17,519
you take a big step, all bets are off.

1256
00:42:14,840 --> 00:42:20,000
So, this alpha tells you how how small a

1257
00:42:17,519 --> 00:42:20,880
step should you take.

1258
00:42:20,000 --> 00:42:23,360
Okay?

1259
00:42:20,880 --> 00:42:25,880
And in typically, it's set to very small

1260
00:42:23,360 --> 00:42:27,240
values like, you know, 0.1, 0.001, and

1261
00:42:25,880 --> 00:42:30,000
so on and so forth. And in fact, if you

1262
00:42:27,239 --> 00:42:31,159
read any deep learning academic papers

1263
00:42:30,000 --> 00:42:32,440
where they have trained like a big model

1264
00:42:31,159 --> 00:42:34,279
to do something,

1265
00:42:32,440 --> 00:42:36,240
right? More lot of researchers will very

1266
00:42:34,280 --> 00:42:37,640
quickly go to the appendix where they

1267
00:42:36,239 --> 00:42:39,559
have described exactly what learning

1268
00:42:37,639 --> 00:42:40,960
rates were used.

1269
00:42:39,559 --> 00:42:44,239
Because sort of the learning rate is

1270
00:42:40,960 --> 00:42:45,480
like part of the IP for how it's built.

1271
00:42:44,239 --> 00:42:47,479
A lot of trial and error that goes into

1272
00:42:45,480 --> 00:42:50,280
these learning rates.

1273
00:42:47,480 --> 00:42:53,400
Okay. So, that is gradient descent.

1274
00:42:50,280 --> 00:42:55,080
Um so, if we apply this algorithm to GW,

1275
00:42:53,400 --> 00:42:56,800
our original function,

1276
00:42:55,079 --> 00:42:58,840
right? We just keep on doing this thing

1277
00:42:56,800 --> 00:43:00,560
a few times.

1278
00:42:58,840 --> 00:43:01,880
Right? What you will find is that if

1279
00:43:00,559 --> 00:43:02,639
let's say we start with two point the

1280
00:43:01,880 --> 00:43:05,519
the

1281
00:43:02,639 --> 00:43:07,599
the point we randomly pick is a 2.5, we

1282
00:43:05,519 --> 00:43:09,759
set the alpha to one, we run this

1283
00:43:07,599 --> 00:43:11,159
algorithm, it starts here, then it goes

1284
00:43:09,760 --> 00:43:12,960
there, it goes there, bup bup bup bup

1285
00:43:11,159 --> 00:43:14,119
bup, and then finally ends up here.

1286
00:43:12,960 --> 00:43:16,440
In like four or five iterations, it

1287
00:43:14,119 --> 00:43:17,679
finds some minimum.

1288
00:43:16,440 --> 00:43:19,639
This is obviously a very simple,

1289
00:43:17,679 --> 00:43:22,279
well-behaved, nice little function, so

1290
00:43:19,639 --> 00:43:23,440
you can easily optimize it.

1291
00:43:22,280 --> 00:43:25,400
Okay? If you want, you can just go to

1292
00:43:23,440 --> 00:43:28,000
this thing. There's a nice animation of

1293
00:43:25,400 --> 00:43:28,000
this thing as well.

1294
00:43:28,119 --> 00:43:31,679
Okay. So, now

1295
00:43:30,119 --> 00:43:33,279
All right. Before we actually go to the

1296
00:43:31,679 --> 00:43:35,000
multi-variable function, I want to go to

1297
00:43:33,280 --> 00:43:36,280
the question that you posed about local

1298
00:43:35,000 --> 00:43:37,480
minima.

1299
00:43:36,280 --> 00:43:38,920
Um actually, you know what? I think I

1300
00:43:37,480 --> 00:43:40,320
may have some slides on it. So, sorry.

1301
00:43:38,920 --> 00:43:41,920
I'll come back to this.

1302
00:43:40,320 --> 00:43:43,080
So, let's actually see what you know,

1303
00:43:41,920 --> 00:43:45,240
what we looked at a toy example where

1304
00:43:43,079 --> 00:43:46,440
there was only one variable. What if you

1305
00:43:45,239 --> 00:43:49,319
have

1306
00:43:46,440 --> 00:43:51,639
uh what if it was GPT-3? GPT-3 has 175

1307
00:43:49,320 --> 00:43:53,960
billion parameters.

1308
00:43:51,639 --> 00:43:55,400
175 billion and GPT-4, they haven't

1309
00:43:53,960 --> 00:43:57,720
published it, so we don't know. It's

1310
00:43:55,400 --> 00:43:59,840
supposed to be eight times as much.

1311
00:43:57,719 --> 00:44:02,039
Okay? So, I mean, the number of

1312
00:43:59,840 --> 00:44:04,840
parameters is massive. So, basically,

1313
00:44:02,039 --> 00:44:07,960
our loss function has

1314
00:44:04,840 --> 00:44:10,320
billions of variables, billions of Ws

1315
00:44:07,960 --> 00:44:12,920
that we need to optimize over, minimize

1316
00:44:10,320 --> 00:44:14,760
over. So, we need to use this notion of

1317
00:44:12,920 --> 00:44:16,039
a partial derivative. So, let's take

1318
00:44:14,760 --> 00:44:18,200
baby steps and say, okay, what if you

1319
00:44:16,039 --> 00:44:20,079
have a two-variable function, right?

1320
00:44:18,199 --> 00:44:21,599
Something like this, very simple. So,

1321
00:44:20,079 --> 00:44:23,960
what we can do is we can calculate the

1322
00:44:21,599 --> 00:44:26,400
partial derivative of G with respect to

1323
00:44:23,960 --> 00:44:27,840
each of these Ws.

1324
00:44:26,400 --> 00:44:29,720
And the partial derivative, just to

1325
00:44:27,840 --> 00:44:32,840
quickly refresh your memories,

1326
00:44:29,719 --> 00:44:36,439
is you take a function, you pretend that

1327
00:44:32,840 --> 00:44:38,400
everything other than W is a constant.

1328
00:44:36,440 --> 00:44:40,960
Then the function becomes a

1329
00:44:38,400 --> 00:44:41,920
a function of just one variable W, W1.

1330
00:44:40,960 --> 00:44:43,760
And then you just differentiate it like

1331
00:44:41,920 --> 00:44:46,159
you do everything else. And you you get

1332
00:44:43,760 --> 00:44:48,600
you get something, and that is

1333
00:44:46,159 --> 00:44:50,039
this thing here.

1334
00:44:48,599 --> 00:44:51,559
And then you do the same thing for W2,

1335
00:44:50,039 --> 00:44:54,239
you get this thing here, and then you

1336
00:44:51,559 --> 00:44:55,079
just stack them up in a nice list.

1337
00:44:54,239 --> 00:44:56,399
Okay?

1338
00:44:55,079 --> 00:44:58,000
This is the vector of partial

1339
00:44:56,400 --> 00:44:59,400
derivatives.

1340
00:44:58,000 --> 00:45:01,559
So, how should we interpret this? The

1341
00:44:59,400 --> 00:45:04,280
same way as before. Basically, for a

1342
00:45:01,559 --> 00:45:06,000
small change in W1, keeping W2 and

1343
00:45:04,280 --> 00:45:08,200
everything else fixed, how does the

1344
00:45:06,000 --> 00:45:11,000
function change if you change just W1

1345
00:45:08,199 --> 00:45:14,039
slightly? And similarly for W2 and all

1346
00:45:11,000 --> 00:45:15,760
the way to W175 billion.

1347
00:45:14,039 --> 00:45:17,119
Same thing. Okay?

1348
00:45:15,760 --> 00:45:19,359
So, um

1349
00:45:17,119 --> 00:45:22,039
now, when you have these functions with

1350
00:45:19,358 --> 00:45:24,480
many variables, many Ws,

1351
00:45:22,039 --> 00:45:26,759
uh since we have a gradient for each one

1352
00:45:24,480 --> 00:45:28,358
of those Ws, we stack them up into a

1353
00:45:26,760 --> 00:45:30,200
nice vector

1354
00:45:28,358 --> 00:45:32,199
of derivatives, and this vector is

1355
00:45:30,199 --> 00:45:33,799
called the gradient.

1356
00:45:32,199 --> 00:45:35,279
And it's denoted

1357
00:45:33,800 --> 00:45:37,240
using

1358
00:45:35,280 --> 00:45:38,720
this uh Anyone know what the symbol is

1359
00:45:37,239 --> 00:45:40,199
called?

1360
00:45:38,719 --> 00:45:41,679
nabla

1361
00:45:40,199 --> 00:45:43,839
Yeah?

1362
00:45:41,679 --> 00:45:45,599
Laplacian

1363
00:45:43,840 --> 00:45:48,880
Maybe. Maybe that's a synonym. But the

1364
00:45:45,599 --> 00:45:50,559
one I'm familiar with is nabla.

1365
00:45:48,880 --> 00:45:52,200
Delta is the one that's upside down

1366
00:45:50,559 --> 00:45:53,920
triangle, but I think the upside down

1367
00:45:52,199 --> 00:45:55,960
triangle is called nabla if I if I

1368
00:45:53,920 --> 00:45:58,200
recall. Am I right?

1369
00:45:55,960 --> 00:46:00,800
Thank you.

1370
00:45:58,199 --> 00:46:00,799
He's my go-to.

1371
00:46:02,559 --> 00:46:06,440
So, yeah. So, the gradient, um we just

1372
00:46:04,840 --> 00:46:08,519
call it the gradient, and it's written

1373
00:46:06,440 --> 00:46:10,960
as this.

1374
00:46:08,519 --> 00:46:12,358
All right. So, what we do is we simply

1375
00:46:10,960 --> 00:46:13,599
do gradient descent on every one of the

1376
00:46:12,358 --> 00:46:16,519
Ws

1377
00:46:13,599 --> 00:46:19,319
using its partial derivative.

1378
00:46:16,519 --> 00:46:21,519
Okay? So, in a in a gradient step, we

1379
00:46:19,320 --> 00:46:23,000
update W1 using this formula, W2 using

1380
00:46:21,519 --> 00:46:25,400
this formula.

1381
00:46:23,000 --> 00:46:25,400
Finished.

1382
00:46:25,599 --> 00:46:30,440
We've just generalized gradient descent

1383
00:46:27,000 --> 00:46:30,440
to an arbitrary number of variables.

1384
00:46:30,840 --> 00:46:35,120
So, and of course, as before, this can

1385
00:46:32,480 --> 00:46:36,719
be summarized compactly as this vector

1386
00:46:35,119 --> 00:46:40,358
formula.

1387
00:46:36,719 --> 00:46:40,358
Let me just do this.

1388
00:46:43,000 --> 00:46:46,639
So, what's going on here is that

1389
00:46:46,719 --> 00:46:50,119
I have

1390
00:46:47,599 --> 00:46:52,400
W1

1391
00:46:50,119 --> 00:46:53,639
old W1 minus alpha

1392
00:46:52,400 --> 00:46:55,720
times

1393
00:46:53,639 --> 00:46:59,319
the function G

1394
00:46:55,719 --> 00:47:02,159
of W1, then W2

1395
00:46:59,320 --> 00:47:04,920
W2 minus alpha

1396
00:47:02,159 --> 00:47:06,039
G by W2. And then all we're doing is

1397
00:47:04,920 --> 00:47:08,358
we're just stacking them up into a

1398
00:47:06,039 --> 00:47:10,880
vector

1399
00:47:08,358 --> 00:47:10,880
like that.

1400
00:47:15,440 --> 00:47:19,559
minus alpha, and this vector

1401
00:47:21,440 --> 00:47:24,159
like that.

1402
00:47:27,719 --> 00:47:31,919
So, this can be written as just this

1403
00:47:28,760 --> 00:47:34,240
vector W, the new vector

1404
00:47:31,920 --> 00:47:37,599
old vector minus alpha

1405
00:47:34,239 --> 00:47:39,119
and the gradient. Finished.

1406
00:47:37,599 --> 00:47:40,400
And you can see if it is, you know,

1407
00:47:39,119 --> 00:47:42,719
GPT-3,

1408
00:47:40,400 --> 00:47:44,880
this vector is going to be 175 billion

1409
00:47:42,719 --> 00:47:46,559
long.

1410
00:47:44,880 --> 00:47:47,920
Okay? But whether it's two or 175

1411
00:47:46,559 --> 00:47:50,199
billion, who cares? It's the same thing,

1412
00:47:47,920 --> 00:47:50,200
right?

1413
00:47:50,358 --> 00:47:52,480
Okay.

1414
00:47:52,559 --> 00:47:55,320
So, yeah. So, that's what we have here.

1415
00:47:54,358 --> 00:47:58,000
I'm really thrilled by the way this

1416
00:47:55,320 --> 00:48:00,200
whole iPad business is working out.

1417
00:47:58,000 --> 00:48:02,199
I was a little worried about it. Okay.

1418
00:48:00,199 --> 00:48:04,000
Um so, if you look at two dimensions,

1419
00:48:02,199 --> 00:48:06,679
this function, and if you actually look

1420
00:48:04,000 --> 00:48:08,239
at if you plot the function, this is W

1421
00:48:06,679 --> 00:48:09,119
the first W, the second W, and then you

1422
00:48:08,239 --> 00:48:11,679
actually This is actually the loss

1423
00:48:09,119 --> 00:48:13,000
function. That's the function GW. And

1424
00:48:11,679 --> 00:48:14,960
so, you're trying to find the minimum

1425
00:48:13,000 --> 00:48:16,079
here, and so this is how the gradient

1426
00:48:14,960 --> 00:48:17,400
descent will do do do do do. It will

1427
00:48:16,079 --> 00:48:18,400
progress if you're starting from this

1428
00:48:17,400 --> 00:48:20,000
point.

1429
00:48:18,400 --> 00:48:22,280
Or you can also sort of look at it from

1430
00:48:20,000 --> 00:48:23,480
up top down into the function, and

1431
00:48:22,280 --> 00:48:24,720
that's what this picture is, and it

1432
00:48:23,480 --> 00:48:27,000
shows gradient descent starting from

1433
00:48:24,719 --> 00:48:30,599
there and working its way down

1434
00:48:27,000 --> 00:48:32,840
um from here all the way to the center.

1435
00:48:30,599 --> 00:48:35,119
Okay. So,

1436
00:48:32,840 --> 00:48:38,160
All right. Local minima. So, now

1437
00:48:35,119 --> 00:48:41,358
gradient descent will just stop

1438
00:48:38,159 --> 00:48:43,399
near uh hopefully a minimum,

1439
00:48:41,358 --> 00:48:45,960
right? But the problem is it may not be

1440
00:48:43,400 --> 00:48:47,400
a global minimum. It may It may not even

1441
00:48:45,960 --> 00:48:48,800
be a minimum.

1442
00:48:47,400 --> 00:48:49,880
So, um

1443
00:48:48,800 --> 00:48:51,160
so, let's see what what I'm talking

1444
00:48:49,880 --> 00:48:53,920
about here.

1445
00:48:51,159 --> 00:48:57,079
Here are some possibilities.

1446
00:48:53,920 --> 00:48:59,960
So, let's take a simple function.

1447
00:48:57,079 --> 00:49:02,159
Okay? Let's take This is GW.

1448
00:48:59,960 --> 00:49:05,960
This is W. And turns out this function

1449
00:49:02,159 --> 00:49:05,960
is actually looks like this.

1450
00:49:12,199 --> 00:49:16,719
Okay?

1451
00:49:13,519 --> 00:49:16,719
So, you can see here

1452
00:49:17,719 --> 00:49:23,159
Well,

1453
00:49:19,679 --> 00:49:24,759
um this point

1454
00:49:23,159 --> 00:49:27,119
this point here

1455
00:49:24,760 --> 00:49:29,359
is a local minimum.

1456
00:49:27,119 --> 00:49:30,880
This is a local minimum.

1457
00:49:29,358 --> 00:49:32,599
It's a local minimum.

1458
00:49:30,880 --> 00:49:34,559
These are all

1459
00:49:32,599 --> 00:49:37,239
lots of local minima here.

1460
00:49:34,559 --> 00:49:39,320
Okay? And yeah, there's a lot of local

1461
00:49:37,239 --> 00:49:41,599
minima here, too.

1462
00:49:39,320 --> 00:49:43,880
So, these are all places in which the

1463
00:49:41,599 --> 00:49:46,079
derivative is going to be zero.

1464
00:49:43,880 --> 00:49:48,160
So, if you run gradient descent and it

1465
00:49:46,079 --> 00:49:49,119
stops because the gradient is reached

1466
00:49:48,159 --> 00:49:52,000
zero,

1467
00:49:49,119 --> 00:49:54,519
you could be in any of these places.

1468
00:49:52,000 --> 00:49:57,480
Right? So, there's no guarantee. So,

1469
00:49:54,519 --> 00:49:59,400
this in this picture happens to be

1470
00:49:57,480 --> 00:50:01,039
maybe the global minimum because it's

1471
00:49:59,400 --> 00:50:02,160
the lowest of the lot.

1472
00:50:01,039 --> 00:50:02,880
Right?

1473
00:50:02,159 --> 00:50:04,639
But, there's no guarantee you're

1474
00:50:02,880 --> 00:50:06,320
actually going to get there.

1475
00:50:04,639 --> 00:50:07,519
Okay, there's not even a guarantee

1476
00:50:06,320 --> 00:50:09,519
you're going to be in any of these

1477
00:50:07,519 --> 00:50:10,920
places because you could literally be in

1478
00:50:09,519 --> 00:50:12,480
this thing here

1479
00:50:10,920 --> 00:50:14,599
where it's sort of taking a break and

1480
00:50:12,480 --> 00:50:15,920
then continuing on down.

1481
00:50:14,599 --> 00:50:17,799
That, by the way, is called a you know,

1482
00:50:15,920 --> 00:50:19,320
a saddle point. I drew it badly, but

1483
00:50:17,800 --> 00:50:21,120
this sort of coming in sort of taking a

1484
00:50:19,320 --> 00:50:23,559
break and going down again is called a

1485
00:50:21,119 --> 00:50:25,679
saddle point. So, gradient descent can

1486
00:50:23,559 --> 00:50:27,239
stop at a saddle point. It can stop at

1487
00:50:25,679 --> 00:50:28,879
some minima. There's no guarantee it's

1488
00:50:27,239 --> 00:50:31,279
going to be global.

1489
00:50:28,880 --> 00:50:31,280
Okay?

1490
00:50:33,000 --> 00:50:39,199
But, it turns out it has not mattered.

1491
00:50:37,239 --> 00:50:41,039
So, it has not mattered. And there are a

1492
00:50:39,199 --> 00:50:42,919
whole bunch of reasons why it has not

1493
00:50:41,039 --> 00:50:44,440
mattered because when you have these

1494
00:50:42,920 --> 00:50:46,360
very complicated neural networks,

1495
00:50:44,440 --> 00:50:49,200
they're very complex functions. Even

1496
00:50:46,360 --> 00:50:50,640
finding a decent solution, right, to

1497
00:50:49,199 --> 00:50:52,960
these complicated networks is actually

1498
00:50:50,639 --> 00:50:54,879
really good for solving the problem.

1499
00:50:52,960 --> 00:50:57,199
You don't have to go to the best best

1500
00:50:54,880 --> 00:50:58,680
possible solution. And in fact, if you

1501
00:50:57,199 --> 00:51:01,960
go to the best possible solution, you

1502
00:50:58,679 --> 00:51:01,960
actually run the risk of overfitting.

1503
00:51:02,039 --> 00:51:05,840
So, that's one reason. The other

1504
00:51:03,719 --> 00:51:08,319
interesting reason and by the way, this

1505
00:51:05,840 --> 00:51:09,800
is a very hot area of research to figure

1506
00:51:08,320 --> 00:51:11,120
out exactly

1507
00:51:09,800 --> 00:51:12,600
So, it's sort of like this. Empirically,

1508
00:51:11,119 --> 00:51:13,960
what we have seen is that not worrying

1509
00:51:12,599 --> 00:51:16,239
about local minima, global minima, all

1510
00:51:13,960 --> 00:51:18,119
that stuff has not hurt us because these

1511
00:51:16,239 --> 00:51:20,479
is things are amazing.

1512
00:51:18,119 --> 00:51:21,480
GPT GPT-4, probably they just stopped

1513
00:51:20,480 --> 00:51:22,880
somewhere. They probably it wasn't even

1514
00:51:21,480 --> 00:51:24,000
a local minima. They're like, "All

1515
00:51:22,880 --> 00:51:25,000
right, we've It's been running for 6

1516
00:51:24,000 --> 00:51:27,000
days. We've spent 2 million dollars.

1517
00:51:25,000 --> 00:51:29,000
Let's stop."

1518
00:51:27,000 --> 00:51:31,800
Right? Because these are very expensive.

1519
00:51:29,000 --> 00:51:33,199
So, but that's still so magical.

1520
00:51:31,800 --> 00:51:34,600
You don't need to get anywhere close to

1521
00:51:33,199 --> 00:51:36,279
local minimum. But, there's another

1522
00:51:34,599 --> 00:51:37,559
interesting point which I've which which

1523
00:51:36,280 --> 00:51:40,880
I read about.

1524
00:51:37,559 --> 00:51:43,279
People basically hypothesize that

1525
00:51:40,880 --> 00:51:45,200
for you to be at a local minimum, just

1526
00:51:43,280 --> 00:51:47,000
think about what it means. It means that

1527
00:51:45,199 --> 00:51:49,439
you're standing at a particular point,

1528
00:51:47,000 --> 00:51:51,800
in every direction that you look,

1529
00:51:49,440 --> 00:51:52,840
things are just sloping upward.

1530
00:51:51,800 --> 00:51:54,760
Right?

1531
00:51:52,840 --> 00:51:56,400
Everything is sloping upward. Only if

1532
00:51:54,760 --> 00:51:58,520
everything is sloping upward all around

1533
00:51:56,400 --> 00:52:00,760
you, could you be at a local minimum

1534
00:51:58,519 --> 00:52:02,880
by definition. But, if you have a

1535
00:52:00,760 --> 00:52:04,560
billion dimensions,

1536
00:52:02,880 --> 00:52:06,200
what are the odds that you're going to

1537
00:52:04,559 --> 00:52:07,199
be standing at a point where every one

1538
00:52:06,199 --> 00:52:08,319
of those billion dimensions is going

1539
00:52:07,199 --> 00:52:10,119
upward?

1540
00:52:08,320 --> 00:52:11,600
The odds are really low.

1541
00:52:10,119 --> 00:52:13,239
Chances are some of them are going to go

1542
00:52:11,599 --> 00:52:14,480
going up, some of them are going down,

1543
00:52:13,239 --> 00:52:16,759
others are sort of coming down and going

1544
00:52:14,480 --> 00:52:18,400
another way. It's going to be crazy.

1545
00:52:16,760 --> 00:52:20,000
So, in some sense, the best you can hope

1546
00:52:18,400 --> 00:52:23,079
for in these very high-dimensional

1547
00:52:20,000 --> 00:52:25,760
situations is probably a saddle point.

1548
00:52:23,079 --> 00:52:29,159
And it turns out it's good enough.

1549
00:52:25,760 --> 00:52:30,920
So, for those reasons, we are content

1550
00:52:29,159 --> 00:52:31,879
with just running gradient descent with

1551
00:52:30,920 --> 00:52:34,320
some tweaks which I'll get to in a

1552
00:52:31,880 --> 00:52:36,880
second. Um and it just performs really

1553
00:52:34,320 --> 00:52:36,880
admirably.

1554
00:52:36,920 --> 00:52:41,680
Um how does alpha depend on like how

1555
00:52:39,840 --> 00:52:44,600
much compute you have? Like, would you

1556
00:52:41,679 --> 00:52:45,359
set the learning rate based on that or

1557
00:52:44,599 --> 00:52:47,960
not really?

1558
00:52:45,360 --> 00:52:50,680
>> No, the the learning rate is really

1559
00:52:47,960 --> 00:52:52,519
is a measure of It's sort of like this.

1560
00:52:50,679 --> 00:52:54,759
When you're at a point where you think

1561
00:52:52,519 --> 00:52:55,840
that the gradient is looking nice and

1562
00:52:54,760 --> 00:52:57,600
right, if you take a step in the

1563
00:52:55,840 --> 00:53:00,000
direction it's going to go down. And if

1564
00:52:57,599 --> 00:53:01,519
you further believe that it's going to

1565
00:53:00,000 --> 00:53:02,480
keep going down in the direction for a

1566
00:53:01,519 --> 00:53:04,159
while,

1567
00:53:02,480 --> 00:53:06,000
then you're very confident about taking

1568
00:53:04,159 --> 00:53:07,519
a big step.

1569
00:53:06,000 --> 00:53:09,119
But, if you're like, "I I don't know

1570
00:53:07,519 --> 00:53:10,960
because the maybe I take a little step,

1571
00:53:09,119 --> 00:53:12,159
maybe I have to go this way. I can't go

1572
00:53:10,960 --> 00:53:13,360
straight anymore." Then you don't want

1573
00:53:12,159 --> 00:53:14,639
to take a big step because then you have

1574
00:53:13,360 --> 00:53:16,320
to backtrack.

1575
00:53:14,639 --> 00:53:19,039
So, those kinds of considerations go

1576
00:53:16,320 --> 00:53:20,920
into the learning rate. Um and so,

1577
00:53:19,039 --> 00:53:23,360
that's sort of the rough answer to your

1578
00:53:20,920 --> 00:53:24,920
question. It's not so much determined by

1579
00:53:23,360 --> 00:53:25,840
compute and bandwidth and things like

1580
00:53:24,920 --> 00:53:27,320
that.

1581
00:53:25,840 --> 00:53:29,400
But, again, it's very it's a sort of a

1582
00:53:27,320 --> 00:53:31,240
complicated thing because sometimes with

1583
00:53:29,400 --> 00:53:33,079
a given amount of compute compute, if

1584
00:53:31,239 --> 00:53:35,239
you have a particular kind of data, you

1585
00:53:33,079 --> 00:53:37,079
can have very aggressive learning rates.

1586
00:53:35,239 --> 00:53:39,439
So, it tends to be a bit sort of, you

1587
00:53:37,079 --> 00:53:40,880
know, jumbled up complicated. So, but

1588
00:53:39,440 --> 00:53:43,320
that's sort of the the quick surface

1589
00:53:40,880 --> 00:53:46,240
level idea of what's going on.

1590
00:53:43,320 --> 00:53:46,240
Um okay.

1591
00:53:47,000 --> 00:53:49,679
9:31.

1592
00:53:50,960 --> 00:53:54,119
Anyway, folks, this lecture is like

1593
00:53:52,400 --> 00:53:55,519
probably one of the driest in the like

1594
00:53:54,119 --> 00:53:57,719
semester because of like I have to go

1595
00:53:55,519 --> 00:53:59,159
through all the concepts. Um once we

1596
00:53:57,719 --> 00:54:00,839
start doing collabs, you know, things

1597
00:53:59,159 --> 00:54:01,759
get a lot more lively.

1598
00:54:00,840 --> 00:54:04,320
Okay.

1599
00:54:01,760 --> 00:54:05,880
Um all right. So, now let's talk about

1600
00:54:04,320 --> 00:54:08,039
minimizing a loss function gradient

1601
00:54:05,880 --> 00:54:09,519
descent. So, here is our little binary

1602
00:54:08,039 --> 00:54:11,719
cross entropy loss function that we saw

1603
00:54:09,519 --> 00:54:13,519
from before. Right? This is what we want

1604
00:54:11,719 --> 00:54:14,839
to minimize. So, if you look at this

1605
00:54:13,519 --> 00:54:16,800
thing,

1606
00:54:14,840 --> 00:54:19,280
where are the variables we need to

1607
00:54:16,800 --> 00:54:21,880
change to minimize this function?

1608
00:54:19,280 --> 00:54:23,880
Folks, don't look at your phones.

1609
00:54:21,880 --> 00:54:26,480
Okay, with laptop and iPad use, don't

1610
00:54:23,880 --> 00:54:26,480
look at your phones.

1611
00:54:27,559 --> 00:54:33,000
Sorry, we've kind of abstracted um the

1612
00:54:30,639 --> 00:54:35,079
variables W, but just to bring it back,

1613
00:54:33,000 --> 00:54:36,559
those are actually the weights in the

1614
00:54:35,079 --> 00:54:38,519
neural networks, right? Yeah, the

1615
00:54:36,559 --> 00:54:42,480
weights and the biases. I'm just calling

1616
00:54:38,519 --> 00:54:45,440
them as weights. So, the output of these

1617
00:54:42,480 --> 00:54:47,480
uh minimization functions are going to

1618
00:54:45,440 --> 00:54:47,720
be the actual weights in your model,

1619
00:54:47,480 --> 00:54:49,920
right?

1620
00:54:47,719 --> 00:54:51,358
>> Exactly. Exactly right.

1621
00:54:49,920 --> 00:54:52,440
The whole name of the game is to find

1622
00:54:51,358 --> 00:54:53,719
the weights.

1623
00:54:52,440 --> 00:54:57,159
And so, for example, when you see in the

1624
00:54:53,719 --> 00:55:00,279
press that uh Meta has essentially um

1625
00:54:57,159 --> 00:55:01,719
made the weights of Llama 2 or something

1626
00:55:00,280 --> 00:55:02,920
available, that's basically what they've

1627
00:55:01,719 --> 00:55:04,679
done.

1628
00:55:02,920 --> 00:55:06,039
They basically published the weights.

1629
00:55:04,679 --> 00:55:07,480
Reason that's so valuable is

1630
00:55:06,039 --> 00:55:09,320
>> Microphone, please. Go.

1631
00:55:07,480 --> 00:55:11,400
Cuz if you have a billion parameters,

1632
00:55:09,320 --> 00:55:13,320
the compute time on that is horrendous

1633
00:55:11,400 --> 00:55:14,599
and expensive. That's why it does

1634
00:55:13,320 --> 00:55:16,559
weights are so valuable.

1635
00:55:14,599 --> 00:55:18,358
>> Correct. The weights are the crown jewel

1636
00:55:16,559 --> 00:55:19,840
because they are the result of a lot of

1637
00:55:18,358 --> 00:55:21,880
money and time and smartness being

1638
00:55:19,840 --> 00:55:23,320
spent.

1639
00:55:21,880 --> 00:55:25,240
There is a separate question of why are

1640
00:55:23,320 --> 00:55:26,000
they making it open source,

1641
00:55:25,239 --> 00:55:28,679
which

1642
00:55:26,000 --> 00:55:29,880
I'm happy to chat about offline.

1643
00:55:28,679 --> 00:55:30,839
All right, cool. So, what are the

1644
00:55:29,880 --> 00:55:32,920
variables we need to change change to

1645
00:55:30,840 --> 00:55:34,480
minimize? It's basically the parameters

1646
00:55:32,920 --> 00:55:36,358
and they're hiding inside the model

1647
00:55:34,480 --> 00:55:38,440
term.

1648
00:55:36,358 --> 00:55:41,279
Right? Because what is the model? The

1649
00:55:38,440 --> 00:55:42,639
model is some function like that, right?

1650
00:55:41,280 --> 00:55:44,400
If you look at the simple GPA and

1651
00:55:42,639 --> 00:55:46,400
experience thing we looked at in the on

1652
00:55:44,400 --> 00:55:48,800
Monday, we finally figured out that the

1653
00:55:46,400 --> 00:55:50,480
actual thing that comes out here is

1654
00:55:48,800 --> 00:55:52,440
going to be this complicated function of

1655
00:55:50,480 --> 00:55:54,960
all the X's and the W's and so on and so

1656
00:55:52,440 --> 00:55:57,599
forth, right? And that complicated thing

1657
00:55:54,960 --> 00:55:58,800
is showing up inside this thing.

1658
00:55:57,599 --> 00:56:00,960
So,

1659
00:55:58,800 --> 00:56:02,920
you know, and the W's here are the

1660
00:56:00,960 --> 00:56:05,119
variables we can we need to change to

1661
00:56:02,920 --> 00:56:06,720
minimize the loss function. And it It's

1662
00:56:05,119 --> 00:56:10,159
important for you to to note and

1663
00:56:06,719 --> 00:56:13,159
understand that the values of X and Y

1664
00:56:10,159 --> 00:56:14,199
and so on are just data.

1665
00:56:13,159 --> 00:56:15,759
You're not optimizing anything there.

1666
00:56:14,199 --> 00:56:17,919
You're just data.

1667
00:56:15,760 --> 00:56:20,480
What you're optimizing is the W's.

1668
00:56:17,920 --> 00:56:20,480
The weights.

1669
00:56:22,400 --> 00:56:27,639
Okay. So, so imagine replacing the model

1670
00:56:26,400 --> 00:56:29,440
here with the mathematical expression

1671
00:56:27,639 --> 00:56:31,920
above whenever this appears the loss

1672
00:56:29,440 --> 00:56:33,400
function. And once you do that, your

1673
00:56:31,920 --> 00:56:35,800
loss function is just a good old

1674
00:56:33,400 --> 00:56:37,440
function of the W's.

1675
00:56:35,800 --> 00:56:39,200
The fact that it's a loss function is

1676
00:56:37,440 --> 00:56:41,039
kind of irrelevant.

1677
00:56:39,199 --> 00:56:42,399
It's just a function.

1678
00:56:41,039 --> 00:56:43,880
And since it's just a good old function

1679
00:56:42,400 --> 00:56:45,920
of the W's, you can apply gradient

1680
00:56:43,880 --> 00:56:48,559
descent to it as we normally would.

1681
00:56:45,920 --> 00:56:48,559
It's no big deal.

1682
00:56:49,440 --> 00:56:52,920
Which brings us to something called

1683
00:56:50,880 --> 00:56:55,400
backpropagation.

1684
00:56:52,920 --> 00:56:55,400
Um

1685
00:56:56,199 --> 00:56:59,839
Um if you remember nothing else about

1686
00:56:57,639 --> 00:57:01,239
backpropagation, just remember this.

1687
00:56:59,840 --> 00:57:04,680
Never use the word backpropagation

1688
00:57:01,239 --> 00:57:05,479
again. Only use the word backprop.

1689
00:57:04,679 --> 00:57:06,759
You're

1690
00:57:05,480 --> 00:57:07,760
hip and cool to the deep learning

1691
00:57:06,760 --> 00:57:09,320
community.

1692
00:57:07,760 --> 00:57:12,200
Backprop.

1693
00:57:09,320 --> 00:57:14,480
Okay. All right. So, what is backprop?

1694
00:57:12,199 --> 00:57:16,000
Backprop is a very efficient way to

1695
00:57:14,480 --> 00:57:17,519
compute the gradient of the loss

1696
00:57:16,000 --> 00:57:19,239
function.

1697
00:57:17,519 --> 00:57:21,920
So, when you have this loss function,

1698
00:57:19,239 --> 00:57:24,759
and let's say you have a billion W's

1699
00:57:21,920 --> 00:57:27,559
and you have 10 million data points. So,

1700
00:57:24,760 --> 00:57:30,520
the little n we saw was 10 million.

1701
00:57:27,559 --> 00:57:32,279
That is a lot of computation.

1702
00:57:30,519 --> 00:57:34,239
And that is just for one step of

1703
00:57:32,280 --> 00:57:37,480
gradient descent.

1704
00:57:34,239 --> 00:57:39,799
Right? So, backprop is a way is a very

1705
00:57:37,480 --> 00:57:41,639
efficient and clever way to compute the

1706
00:57:39,800 --> 00:57:44,800
gradient of the loss function, which

1707
00:57:41,639 --> 00:57:47,039
takes advantage of the fact that what we

1708
00:57:44,800 --> 00:57:49,480
have here is not some arbitrary model.

1709
00:57:47,039 --> 00:57:51,960
It's a model that came from a particular

1710
00:57:49,480 --> 00:57:53,480
kind of neural network, which has layers

1711
00:57:51,960 --> 00:57:55,119
one after the other, and then there was

1712
00:57:53,480 --> 00:57:57,760
an output at the very end.

1713
00:57:55,119 --> 00:57:59,519
So, what backprop does is

1714
00:57:57,760 --> 00:58:00,440
it organizes the computation in the form

1715
00:57:59,519 --> 00:58:01,920
of something called a computational

1716
00:58:00,440 --> 00:58:03,679
graph, and the book has a good

1717
00:58:01,920 --> 00:58:05,880
discussion about it. And so, what we do

1718
00:58:03,679 --> 00:58:08,039
is we start at the very end.

1719
00:58:05,880 --> 00:58:10,119
We calculate the gradient of the loss

1720
00:58:08,039 --> 00:58:12,119
with respect to the output.

1721
00:58:10,119 --> 00:58:13,960
Then we move left. We calculate the

1722
00:58:12,119 --> 00:58:15,759
gradient of that output with respect to

1723
00:58:13,960 --> 00:58:17,079
the output of the just the prior hidden

1724
00:58:15,760 --> 00:58:19,160
layer.

1725
00:58:17,079 --> 00:58:20,559
Step to the left. Calculate the gradient

1726
00:58:19,159 --> 00:58:22,719
of the current thing with respect to the

1727
00:58:20,559 --> 00:58:25,159
previous layer. You get the idea, right?

1728
00:58:22,719 --> 00:58:27,319
It's iterative and it moves backwards,

1729
00:58:25,159 --> 00:58:30,879
and by doing so, you never repeat the

1730
00:58:27,320 --> 00:58:32,920
same computation twice wastefully.

1731
00:58:30,880 --> 00:58:34,400
That's the big advantage. You calculate

1732
00:58:32,920 --> 00:58:35,519
once and reuse it many many many many

1733
00:58:34,400 --> 00:58:37,200
times.

1734
00:58:35,519 --> 00:58:39,639
The second advantage is that if you

1735
00:58:37,199 --> 00:58:42,159
organize it this way, it just becomes a

1736
00:58:39,639 --> 00:58:42,879
sequence of matrix multiplications.

1737
00:58:42,159 --> 00:58:45,239
Okay.

1738
00:58:42,880 --> 00:58:46,240
And

1739
00:58:45,239 --> 00:58:48,358
because it's a sequence of matrix

1740
00:58:46,239 --> 00:58:51,879
multiplications and eliminates redundant

1741
00:58:48,358 --> 00:58:53,199
calculations, and best of all,

1742
00:58:51,880 --> 00:58:54,680
there are these things called GPUs,

1743
00:58:53,199 --> 00:58:56,079
graphics processing units, originally

1744
00:58:54,679 --> 00:58:57,119
invented to accelerate video game

1745
00:58:56,079 --> 00:58:58,599
rendering.

1746
00:58:57,119 --> 00:59:00,639
Uh and as it turns out, to accelerate

1747
00:58:58,599 --> 00:59:02,159
video game rendering, the core math

1748
00:59:00,639 --> 00:59:03,960
operation you do is basically a matrix

1749
00:59:02,159 --> 00:59:05,319
multiplication. Right? Some linear

1750
00:59:03,960 --> 00:59:07,599
algebra uh

1751
00:59:05,320 --> 00:59:09,760
sort of operations. And so, someone

1752
00:59:07,599 --> 00:59:11,920
really at some point had the bright idea

1753
00:59:09,760 --> 00:59:13,440
for deep learning, calculating gradients

1754
00:59:11,920 --> 00:59:14,920
and so on, we need to do matrix

1755
00:59:13,440 --> 00:59:17,559
multiplications, and here is some

1756
00:59:14,920 --> 00:59:19,200
specialized hardware that does really

1757
00:59:17,559 --> 00:59:20,960
that does a fast job of matrix

1758
00:59:19,199 --> 00:59:22,319
multiplications. Can't we Can we use

1759
00:59:20,960 --> 00:59:24,440
this for that?

1760
00:59:22,320 --> 00:59:26,200
And they did it. And all hell broke

1761
00:59:24,440 --> 00:59:28,039
loose.

1762
00:59:26,199 --> 00:59:30,000
That's literally what happened.

1763
00:59:28,039 --> 00:59:32,279
And that's why Nvidia is valued at what,

1764
00:59:30,000 --> 00:59:35,480
1.5 trillion or something.

1765
00:59:32,280 --> 00:59:37,880
So, yeah. So, they are really good. And

1766
00:59:35,480 --> 00:59:40,079
so, backprop

1767
00:59:37,880 --> 00:59:42,680
the way you do backprop plus using it on

1768
00:59:40,079 --> 00:59:44,400
GPUs leads to fast calculation of loss

1769
00:59:42,679 --> 00:59:47,159
function gradients.

1770
00:59:44,400 --> 00:59:49,039
If this thing were not true, this class

1771
00:59:47,159 --> 00:59:50,279
would not exist.

1772
00:59:49,039 --> 00:59:52,880
Because there won't be any deep learning

1773
00:59:50,280 --> 00:59:56,840
revolution.

1774
00:59:52,880 --> 00:59:56,840
This is a fundamental seminal reason.

1775
00:59:57,880 --> 01:00:00,840
All right. So, the book has a bunch of

1776
00:59:59,760 --> 01:00:01,880
detail

1777
01:00:00,840 --> 01:00:05,600
um

1778
01:00:01,880 --> 01:00:07,559
and I actually did like a I work I hand

1779
01:00:05,599 --> 01:00:09,599
worked out an example

1780
01:00:07,559 --> 01:00:11,679
of calculating a gradient like the

1781
01:00:09,599 --> 01:00:13,400
old-fashioned way and calculating it

1782
01:00:11,679 --> 01:00:14,879
using backprop.

1783
01:00:13,400 --> 01:00:17,200
So, take a look at it. I'll post it on

1784
01:00:14,880 --> 01:00:18,519
Canvas and you will understand exactly

1785
01:00:17,199 --> 01:00:21,519
where the savings come from, where the

1786
01:00:18,519 --> 01:00:22,800
efficiency gains come from. Okay?

1787
01:00:21,519 --> 01:00:25,239
Because of time, I'm not going to get

1788
01:00:22,800 --> 01:00:25,240
into it now.

1789
01:00:26,400 --> 01:00:30,400
All right. Any questions so far?

1790
01:00:28,840 --> 01:00:32,600
Yep.

1791
01:00:30,400 --> 01:00:34,559
Sorry, I followed up to and so, we've

1792
01:00:32,599 --> 01:00:36,239
done gradient descent, which is

1793
01:00:34,559 --> 01:00:37,840
different than calculation of the

1794
01:00:36,239 --> 01:00:39,239
gradient of the loss function. What What

1795
01:00:37,840 --> 01:00:41,039
is the purpose of the calculation of the

1796
01:00:39,239 --> 01:00:42,519
gradient of the loss function? You

1797
01:00:41,039 --> 01:00:44,159
calculate the gradient because the

1798
01:00:42,519 --> 01:00:47,039
fundamental operation of gradient

1799
01:00:44,159 --> 01:00:48,199
descent is to take your current value of

1800
01:00:47,039 --> 01:00:50,159
W

1801
01:00:48,199 --> 01:00:52,919
and modify it slightly and the

1802
01:00:50,159 --> 01:00:56,000
modification is old value minus learning

1803
01:00:52,920 --> 01:00:56,000
rate times gradient.

1804
01:01:03,360 --> 01:01:06,280
It'd be cool, right, if I say, "Go mo-

1805
01:01:04,960 --> 01:01:08,400
go back five slides to this thing." and

1806
01:01:06,280 --> 01:01:09,880
it just goes back. Product idea. Anyone

1807
01:01:08,400 --> 01:01:11,840
startups?

1808
01:01:09,880 --> 01:01:14,320
So.

1809
01:01:11,840 --> 01:01:15,360
So, this one.

1810
01:01:14,320 --> 01:01:16,920
So, this is the fundamental step of

1811
01:01:15,360 --> 01:01:19,280
gradient descent.

1812
01:01:16,920 --> 01:01:20,720
So, this is the current value of W.

1813
01:01:19,280 --> 01:01:22,000
You calculate the gradient at that

1814
01:01:20,719 --> 01:01:24,159
current value

1815
01:01:22,000 --> 01:01:26,199
multiplied by alpha do this thing and

1816
01:01:24,159 --> 01:01:27,440
you get the new value.

1817
01:01:26,199 --> 01:01:29,879
And you keep repeating.

1818
01:01:27,440 --> 01:01:32,240
Right, but GW

1819
01:01:29,880 --> 01:01:33,559
that's not that's not the loss function.

1820
01:01:32,239 --> 01:01:34,039
>> It is the loss function. That is the

1821
01:01:33,559 --> 01:01:35,960
loss function.

1822
01:01:34,039 --> 01:01:37,880
>> Yeah, right. Here, I'm just using G as

1823
01:01:35,960 --> 01:01:39,880
an arbitrary function

1824
01:01:37,880 --> 01:01:41,599
to just to demonstrate the point. But

1825
01:01:39,880 --> 01:01:42,880
when you're optimizing, when you're

1826
01:01:41,599 --> 01:01:45,519
training a neural network, what you're

1827
01:01:42,880 --> 01:01:46,800
actually doing is minimizing a loss

1828
01:01:45,519 --> 01:01:49,320
function. Right.

1829
01:01:46,800 --> 01:01:51,360
>> Loss of W. Sorry, I got things mixed up.

1830
01:01:49,320 --> 01:01:53,000
Thank you.

1831
01:01:51,360 --> 01:01:54,680
>> Yeah.

1832
01:01:53,000 --> 01:01:55,639
Uh how do we define the initial weights

1833
01:01:54,679 --> 01:01:57,279
for the neural network?

1834
01:01:55,639 --> 01:02:01,639
>> Ah.

1835
01:01:57,280 --> 01:02:01,640
So, yeah, the initial weights um

1836
01:02:02,199 --> 01:02:04,919
So, there's a there are many ways to So,

1837
01:02:04,000 --> 01:02:06,119
first of all, they are initialized

1838
01:02:04,920 --> 01:02:08,119
randomly.

1839
01:02:06,119 --> 01:02:09,920
Uh but randomly doesn't mean you can

1840
01:02:08,119 --> 01:02:11,839
just pick any random weight. There are

1841
01:02:09,920 --> 01:02:13,519
actually some good ways to randomly pick

1842
01:02:11,840 --> 01:02:16,240
the weights. Uh those are called

1843
01:02:13,519 --> 01:02:18,199
initialization schemes. Um and there are

1844
01:02:16,239 --> 01:02:19,359
a bunch of very effective initialization

1845
01:02:18,199 --> 01:02:21,119
schemes people have figured out over the

1846
01:02:19,360 --> 01:02:22,880
years and those things are baked into

1847
01:02:21,119 --> 01:02:24,880
Keras as the default.

1848
01:02:22,880 --> 01:02:26,079
So, the Keras, I believe, uses something

1849
01:02:24,880 --> 01:02:27,960
called the

1850
01:02:26,079 --> 01:02:31,199
uh He initialization, H E

1851
01:02:27,960 --> 01:02:33,039
initialization, or the Xavier Glorot

1852
01:02:31,199 --> 01:02:33,839
initialization. I wouldn't worry about

1853
01:02:33,039 --> 01:02:36,000
it. Just go with the default

1854
01:02:33,840 --> 01:02:37,519
initialization.

1855
01:02:36,000 --> 01:02:38,679
The reason why they have to be very

1856
01:02:37,519 --> 01:02:40,880
careful about how these weights are

1857
01:02:38,679 --> 01:02:43,039
initialized is because if you have a

1858
01:02:40,880 --> 01:02:45,200
very big network and if you initialize

1859
01:02:43,039 --> 01:02:47,679
badly then

1860
01:02:45,199 --> 01:02:48,919
the gradient will just explode as you

1861
01:02:47,679 --> 01:02:50,440
calculate it.

1862
01:02:48,920 --> 01:02:52,480
The earlier layers, the weights will

1863
01:02:50,440 --> 01:02:53,720
have massive gradients or the gradients

1864
01:02:52,480 --> 01:02:55,119
will vanish.

1865
01:02:53,719 --> 01:02:56,319
So, they're called the exploding

1866
01:02:55,119 --> 01:02:58,239
gradient problem or the vanishing

1867
01:02:56,320 --> 01:02:59,240
gradient problem. To avoid all those

1868
01:02:58,239 --> 01:03:00,719
things, researchers have figured out

1869
01:02:59,239 --> 01:03:03,599
some clever way to initialize so that

1870
01:03:00,719 --> 01:03:05,359
it's well-behaved throughout.

1871
01:03:03,599 --> 01:03:08,400
Yep.

1872
01:03:05,360 --> 01:03:10,360
If using um backprops and GPUs was so

1873
01:03:08,400 --> 01:03:12,440
critical, I'm just curious like who

1874
01:03:10,360 --> 01:03:14,760
first did it and when? Was this like a

1875
01:03:12,440 --> 01:03:15,119
couple years ago? Was it a company? Was

1876
01:03:14,760 --> 01:03:17,520
it a Yeah.

1877
01:03:15,119 --> 01:03:20,199
>> Yeah. Well, GPUs have been used for deep

1878
01:03:17,519 --> 01:03:22,400
learning, I want to say um

1879
01:03:20,199 --> 01:03:26,279
I think the first uh case may have been

1880
01:03:22,400 --> 01:03:27,920
in the mid 2005, 2006 sort of thing.

1881
01:03:26,280 --> 01:03:30,000
But I would say that it sort of burst

1882
01:03:27,920 --> 01:03:32,800
out onto the world stage and made

1883
01:03:30,000 --> 01:03:35,000
everyone take notice when uh a deep

1884
01:03:32,800 --> 01:03:38,519
learning model called AlexNet

1885
01:03:35,000 --> 01:03:40,440
in 2012 won a very famous

1886
01:03:38,519 --> 01:03:43,320
computer vision competition.

1887
01:03:40,440 --> 01:03:45,079
Uh and it beat the and it set a world

1888
01:03:43,320 --> 01:03:46,200
record for how good it was.

1889
01:03:45,079 --> 01:03:48,039
Uh and that's when everyone was like,

1890
01:03:46,199 --> 01:03:49,119
"Hey, what is this thing?" And that's

1891
01:03:48,039 --> 01:03:50,719
really when it burst onto the world

1892
01:03:49,119 --> 01:03:51,880
stage. I'll talk a bit more about it

1893
01:03:50,719 --> 01:03:54,119
when I get into the computer vision

1894
01:03:51,880 --> 01:03:55,480
segment of the class.

1895
01:03:54,119 --> 01:03:58,759
But you can Google AlexNet and you'll

1896
01:03:55,480 --> 01:03:58,760
find a whole bunch of history around it.

1897
01:03:59,599 --> 01:04:04,920
I believe that if you do this, is it

1898
01:04:00,760 --> 01:04:06,040
true that could get to a global minima

1899
01:04:04,920 --> 01:04:07,840
that would mean there would be no

1900
01:04:06,039 --> 01:04:09,840
hallucinations?

1901
01:04:07,840 --> 01:04:11,920
Aha, good question.

1902
01:04:09,840 --> 01:04:13,120
So, if it is perfect

1903
01:04:11,920 --> 01:04:14,519
if you get to a global minimum. First of

1904
01:04:13,119 --> 01:04:15,880
all, global minima doesn't mean the

1905
01:04:14,519 --> 01:04:17,199
model is perfect, right? It may still

1906
01:04:15,880 --> 01:04:18,400
have some loss.

1907
01:04:17,199 --> 01:04:21,119
Um

1908
01:04:18,400 --> 01:04:24,000
but global minima is going to be on the

1909
01:04:21,119 --> 01:04:24,000
training data.

1910
01:04:24,199 --> 01:04:28,519
You can imagine that the test data,

1911
01:04:26,280 --> 01:04:29,480
future data has its own loss function,

1912
01:04:28,519 --> 01:04:31,000
right?

1913
01:04:29,480 --> 01:04:34,599
So, what is minimum here may not be

1914
01:04:31,000 --> 01:04:34,599
minimum there. That's the problem.

1915
01:04:36,440 --> 01:04:40,280
Is that a comment? No, okay.

1916
01:04:38,800 --> 01:04:42,280
Just saying that

1917
01:04:40,280 --> 01:04:43,240
uh that would mean that also you can be

1918
01:04:42,280 --> 01:04:45,200
over-fitting for

1919
01:04:43,239 --> 01:04:47,119
>> Correct. Exactly. Exactly. So, if you

1920
01:04:45,199 --> 01:04:48,960
overdo, if you find the best thing in

1921
01:04:47,119 --> 01:04:50,960
the training function, chances are it

1922
01:04:48,960 --> 01:04:52,000
doesn't match the best thing of the test

1923
01:04:50,960 --> 01:04:53,358
data.

1924
01:04:52,000 --> 01:04:55,880
So, on the test data, you're actually

1925
01:04:53,358 --> 01:04:55,880
doing badly.

1926
01:04:56,440 --> 01:05:00,880
Okay. So,

1927
01:04:57,960 --> 01:05:00,880
uh come back to this.

1928
01:05:03,800 --> 01:05:08,240
Okay. Now, uh the final uh twist to the

1929
01:05:06,199 --> 01:05:10,039
tail here uh we're going to go from

1930
01:05:08,239 --> 01:05:11,839
something gradient descent to something

1931
01:05:10,039 --> 01:05:14,639
called stochastic gradient descent. And

1932
01:05:11,840 --> 01:05:16,400
stochastic gradient descent or SGD is

1933
01:05:14,639 --> 01:05:17,480
the workhorse for all deep learning.

1934
01:05:16,400 --> 01:05:19,639
Okay?

1935
01:05:17,480 --> 01:05:20,679
And funnily enough, SGD is simpler than

1936
01:05:19,639 --> 01:05:21,839
GD.

1937
01:05:20,679 --> 01:05:23,799
Okay? Just when you thought it couldn't

1938
01:05:21,840 --> 01:05:25,280
get simpler, right?

1939
01:05:23,800 --> 01:05:27,400
Okay. So,

1940
01:05:25,280 --> 01:05:28,640
So, for large data sets, computing the

1941
01:05:27,400 --> 01:05:31,440
gradient of the loss function can be

1942
01:05:28,639 --> 01:05:32,920
very expensive. Right? Needless to say.

1943
01:05:31,440 --> 01:05:34,519
Because it has to be done at every step

1944
01:05:32,920 --> 01:05:36,760
and the cardinality of the data set is

1945
01:05:34,519 --> 01:05:38,079
really big. Right? And you may have, I

1946
01:05:36,760 --> 01:05:39,480
don't know, billions of parameters. It's

1947
01:05:38,079 --> 01:05:43,119
just very, very

1948
01:05:39,480 --> 01:05:45,679
tough to compute it even with backprop.

1949
01:05:43,119 --> 01:05:47,519
So, the solution is at each iteration,

1950
01:05:45,679 --> 01:05:50,119
when I say iteration, I'm talking about

1951
01:05:47,519 --> 01:05:52,599
this step of gradient descent.

1952
01:05:50,119 --> 01:05:54,599
Instead of using all the data

1953
01:05:52,599 --> 01:05:57,358
instead of calculating the loss function

1954
01:05:54,599 --> 01:05:59,480
by averaging the loss across all N data

1955
01:05:57,358 --> 01:06:01,880
points and then calculating the gradient

1956
01:05:59,480 --> 01:06:04,440
of that thing, what you do is you just

1957
01:06:01,880 --> 01:06:06,480
choose a small sample randomly. You

1958
01:06:04,440 --> 01:06:08,400
choose just a few of the N observations

1959
01:06:06,480 --> 01:06:10,159
and we call it a mini batch.

1960
01:06:08,400 --> 01:06:11,599
So, for example, the number of data

1961
01:06:10,159 --> 01:06:12,639
points you may you may have 10 billion

1962
01:06:11,599 --> 01:06:14,000
data points

1963
01:06:12,639 --> 01:06:16,559
but in every iteration, you may

1964
01:06:14,000 --> 01:06:18,119
literally grab just like 32 or 64,

1965
01:06:16,559 --> 01:06:20,199
something really small.

1966
01:06:18,119 --> 01:06:21,199
Like absurdly small.

1967
01:06:20,199 --> 01:06:23,000
Okay?

1968
01:06:21,199 --> 01:06:24,799
And then you pretend that okay, that's

1969
01:06:23,000 --> 01:06:27,159
all the data I have. You calculate the

1970
01:06:24,800 --> 01:06:30,359
loss, find the gradient and just use

1971
01:06:27,159 --> 01:06:33,199
that here instead.

1972
01:06:30,358 --> 01:06:36,799
Okay? So, this is called stochastic

1973
01:06:33,199 --> 01:06:39,159
gradient descent. So, strictly speaking

1974
01:06:36,800 --> 01:06:40,680
theoretically, SGD uses just one data

1975
01:06:39,159 --> 01:06:42,079
point.

1976
01:06:40,679 --> 01:06:44,599
But in practice, we use what's called a

1977
01:06:42,079 --> 01:06:47,039
mini batch, 32, 64, whatever.

1978
01:06:44,599 --> 01:06:48,319
Uh and so, mini batch gradient descent

1979
01:06:47,039 --> 01:06:51,719
is just loosely called stochastic

1980
01:06:48,320 --> 01:06:51,720
gradient descent, SGD.

1981
01:06:52,719 --> 01:06:57,559
So, and SGD, as it turns out

1982
01:06:55,679 --> 01:06:58,799
you can see it's clearly very efficient,

1983
01:06:57,559 --> 01:07:00,960
right? Because

1984
01:06:58,800 --> 01:07:02,519
it's just processing a few at a time.

1985
01:07:00,960 --> 01:07:03,559
Uh and in fact, if you have a lot of

1986
01:07:02,519 --> 01:07:05,159
data

1987
01:07:03,559 --> 01:07:07,119
and you calculate the full gradient of

1988
01:07:05,159 --> 01:07:09,319
the loss function, it may not even fit

1989
01:07:07,119 --> 01:07:11,319
into memory.

1990
01:07:09,320 --> 01:07:12,880
Right? It's really problematic. But with

1991
01:07:11,320 --> 01:07:14,359
SGD, it says, "I don't care whether you

1992
01:07:12,880 --> 01:07:17,400
have a billion data points or a trillion

1993
01:07:14,358 --> 01:07:19,199
data points. Just give me 32 at a time."

1994
01:07:17,400 --> 01:07:20,720
Okay? And you just keep on doing it.

1995
01:07:19,199 --> 01:07:22,639
And

1996
01:07:20,719 --> 01:07:24,719
turns out, because not all the points

1997
01:07:22,639 --> 01:07:26,679
are used in the calculation this only

1998
01:07:24,719 --> 01:07:27,919
approximates the true gradient. Right?

1999
01:07:26,679 --> 01:07:29,919
It's only an approximation. It's not the

2000
01:07:27,920 --> 01:07:32,079
real thing. It's only an approximation.

2001
01:07:29,920 --> 01:07:33,760
But it works extremely well in practice.

2002
01:07:32,079 --> 01:07:34,960
Extremely well in practice.

2003
01:07:33,760 --> 01:07:37,359
And there's a whole bunch of research

2004
01:07:34,960 --> 01:07:39,079
that goes into why is it so effective?

2005
01:07:37,358 --> 01:07:40,920
And you know, people are discovering

2006
01:07:39,079 --> 01:07:42,599
interesting things about SGD, but we

2007
01:07:40,920 --> 01:07:44,680
don't have like a definitive theory as

2008
01:07:42,599 --> 01:07:46,039
to why it's so good yet. We have some

2009
01:07:44,679 --> 01:07:47,799
interesting, you know, uh research

2010
01:07:46,039 --> 01:07:50,000
threads that have happened.

2011
01:07:47,800 --> 01:07:51,840
And very tantalizingly, very

2012
01:07:50,000 --> 01:07:53,920
tantalizingly

2013
01:07:51,840 --> 01:07:55,640
because it's only an approximation of

2014
01:07:53,920 --> 01:07:59,480
the true gradient

2015
01:07:55,639 --> 01:08:00,480
SGD can actually escape local minima.

2016
01:07:59,480 --> 01:08:02,240
So,

2017
01:08:00,480 --> 01:08:04,159
in the in the true loss function, you're

2018
01:08:02,239 --> 01:08:06,679
at a local minimum

2019
01:08:04,159 --> 01:08:08,519
but in SGD's loss function, when you're

2020
01:08:06,679 --> 01:08:11,440
doing SGD, you're reaching the the

2021
01:08:08,519 --> 01:08:13,159
minimum of the SGD loss function

2022
01:08:11,440 --> 01:08:14,920
which actually may not be the actual

2023
01:08:13,159 --> 01:08:16,798
loss function. So, as you're moving

2024
01:08:14,920 --> 01:08:18,359
around, you're actually jumping from

2025
01:08:16,798 --> 01:08:20,359
local minima to local minima of the

2026
01:08:18,359 --> 01:08:22,039
actual loss function.

2027
01:08:20,359 --> 01:08:24,039
I know that's a mouthful. I'm happy to

2028
01:08:22,039 --> 01:08:25,319
tell you more. It's just a side thing

2029
01:08:24,039 --> 01:08:26,560
that I just wanted you to be aware of.

2030
01:08:25,319 --> 01:08:27,960
Okay?

2031
01:08:26,560 --> 01:08:30,640
One of the reasons why SGD is actually

2032
01:08:27,960 --> 01:08:33,838
effective. It's almost like you work

2033
01:08:30,640 --> 01:08:33,838
less and you do better.

2034
01:08:34,000 --> 01:08:38,159
How many times does it happen in life?

2035
01:08:35,680 --> 01:08:38,159
This is one of them.

2036
01:08:39,520 --> 01:08:44,359
Okay? Now, SGD comes in many flavors.

2037
01:08:42,798 --> 01:08:45,680
Uh many siblings. It's got a lot of

2038
01:08:44,359 --> 01:08:47,520
siblings and variations. It's a big

2039
01:08:45,680 --> 01:08:49,838
family. Uh and we're going to use a

2040
01:08:47,520 --> 01:08:52,040
particular flavor called Adam

2041
01:08:49,838 --> 01:08:53,159
as our default in this course and I'll

2042
01:08:52,039 --> 01:08:56,000
get back to it when we get into the

2043
01:08:53,159 --> 01:08:57,119
co-labs and things like that.

2044
01:08:56,000 --> 01:08:58,159
All right.

2045
01:08:57,119 --> 01:09:00,039
Um

2046
01:08:58,159 --> 01:09:01,519
By the way

2047
01:09:00,039 --> 01:09:02,600
you know how you know all these pictures

2048
01:09:01,520 --> 01:09:04,600
I've been showing you a nice little

2049
01:09:02,600 --> 01:09:05,440
function like that, a little bowl and so

2050
01:09:04,600 --> 01:09:07,359
on.

2051
01:09:05,439 --> 01:09:08,960
This is a visualization

2052
01:09:07,359 --> 01:09:11,400
of an actual neural network loss

2053
01:09:08,960 --> 01:09:12,838
function.

2054
01:09:11,399 --> 01:09:14,920
You can see like the hills and valleys

2055
01:09:12,838 --> 01:09:16,798
and the cracks and so on and so forth.

2056
01:09:14,920 --> 01:09:18,600
Okay? And you can check out the paper to

2057
01:09:16,798 --> 01:09:19,359
get more insight into how they actually,

2058
01:09:18,600 --> 01:09:21,680
you know, came up with this

2059
01:09:19,359 --> 01:09:24,280
visualization. It's crazy.

2060
01:09:21,680 --> 01:09:25,520
It's complicated.

2061
01:09:24,279 --> 01:09:28,439
Yep.

2062
01:09:25,520 --> 01:09:30,920
So, for for SGD, do you perform the

2063
01:09:28,439 --> 01:09:32,599
iterations until you minimize the loss

2064
01:09:30,920 --> 01:09:34,440
function for each mini batch and then

2065
01:09:32,600 --> 01:09:36,520
move to another mini batch? Yeah, so

2066
01:09:34,439 --> 01:09:37,719
what you do is you take each mini batch

2067
01:09:36,520 --> 01:09:39,440
and then

2068
01:09:37,720 --> 01:09:41,560
you calculate the loss for the mini

2069
01:09:39,439 --> 01:09:43,679
batch, you find the gradient.

2070
01:09:41,560 --> 01:09:45,319
And use the gradient and update the W.

2071
01:09:43,680 --> 01:09:47,119
Then you pick up the next mini batch. So

2072
01:09:45,319 --> 01:09:48,920
you don't you don't pick a mini batch

2073
01:09:47,119 --> 01:09:50,920
and try to perform the iterations on

2074
01:09:48,920 --> 01:09:52,838
that mini batch until you reach the

2075
01:09:50,920 --> 01:09:54,840
You Each mini batch, one iteration. Each

2076
01:09:52,838 --> 01:09:56,359
mini batch, one iteration. Because if

2077
01:09:54,840 --> 01:09:57,600
you do a lot of iterations on one mini

2078
01:09:56,359 --> 01:09:58,759
batch,

2079
01:09:57,600 --> 01:09:59,640
first of all, you'll never be sure that

2080
01:09:58,760 --> 01:10:00,960
you're going to find any optimal

2081
01:09:59,640 --> 01:10:03,079
solution because you're not guaranteed

2082
01:10:00,960 --> 01:10:04,039
of any global minima. And secondly, it's

2083
01:10:03,079 --> 01:10:05,960
much better for you to get new

2084
01:10:04,039 --> 01:10:07,399
information constantly because what you

2085
01:10:05,960 --> 01:10:09,439
can do is you can revisit that mini

2086
01:10:07,399 --> 01:10:10,799
batch later on.

2087
01:10:09,439 --> 01:10:13,039
Right? And that gets into these things

2088
01:10:10,800 --> 01:10:14,239
called epochs and batch size and so on,

2089
01:10:13,039 --> 01:10:16,359
which we'll get into a lot of gory

2090
01:10:14,239 --> 01:10:17,880
detail when we do the collab.

2091
01:10:16,359 --> 01:10:20,359
So let's revisit that question. It's a

2092
01:10:17,880 --> 01:10:20,359
good question.

2093
01:10:20,439 --> 01:10:25,439
Yeah.

2094
01:10:22,520 --> 01:10:26,880
When you do the backprop process, Very

2095
01:10:25,439 --> 01:10:27,960
good. Backprop. Not backpropagation.

2096
01:10:26,880 --> 01:10:29,039
Nice. I made sure.

2097
01:10:27,960 --> 01:10:30,840
>> Yes.

2098
01:10:29,039 --> 01:10:32,760
Well, it's it sounded like you started

2099
01:10:30,840 --> 01:10:35,159
from the layers that were closest to the

2100
01:10:32,760 --> 01:10:36,920
output and you went backward. Okay. And

2101
01:10:35,159 --> 01:10:39,479
um my question is are you doing that

2102
01:10:36,920 --> 01:10:39,760
once or is it looping multiple times and

2103
01:10:39,479 --> 01:10:42,439
then

2104
01:10:39,760 --> 01:10:44,600
>> do it once. Just once. Yeah. So for each

2105
01:10:42,439 --> 01:10:45,960
gradient calculation, you do it once.

2106
01:10:44,600 --> 01:10:47,680
Why does it Why does it want to start

2107
01:10:45,960 --> 01:10:48,560
from the layer that's closest or why do

2108
01:10:47,680 --> 01:10:49,800
you want to start it from the layer

2109
01:10:48,560 --> 01:10:51,280
that's closest to the output?

2110
01:10:49,800 --> 01:10:53,239
>> Yeah. So basically what happens is let's

2111
01:10:51,279 --> 01:10:54,920
say that just for argument that you go

2112
01:10:53,239 --> 01:10:56,800
go in the reverse direction.

2113
01:10:54,920 --> 01:10:58,279
You will discover that a lot of paths to

2114
01:10:56,800 --> 01:10:59,960
go from the left to the right will end

2115
01:10:58,279 --> 01:11:02,439
up calculating certain intermediate

2116
01:10:59,960 --> 01:11:04,720
quantities including the very final

2117
01:11:02,439 --> 01:11:06,559
gradient sort of item

2118
01:11:04,720 --> 01:11:07,760
again and again and again.

2119
01:11:06,560 --> 01:11:09,280
Same thing is going to get calculated

2120
01:11:07,760 --> 01:11:10,520
again and again and again. So by

2121
01:11:09,279 --> 01:11:12,159
starting from the end and working

2122
01:11:10,520 --> 01:11:14,320
backwards, you just reuse stuff you've

2123
01:11:12,159 --> 01:11:15,920
already calculated.

2124
01:11:14,319 --> 01:11:17,960
So that is sort of the rough idea. But

2125
01:11:15,920 --> 01:11:19,440
if you see my PDF, I've actually worked

2126
01:11:17,960 --> 01:11:22,399
out the example and you and that will

2127
01:11:19,439 --> 01:11:22,399
demonstrate what I'm talking about.

2128
01:11:23,359 --> 01:11:28,319
By the way, this gradient the backprop

2129
01:11:25,119 --> 01:11:28,319
is just a sort of a

2130
01:11:28,600 --> 01:11:31,760
Like in calculus, we have something

2131
01:11:29,920 --> 01:11:32,600
called the chain rule.

2132
01:11:31,760 --> 01:11:34,400
To calculate the derivative of a

2133
01:11:32,600 --> 01:11:35,960
complicated function, you calculate the

2134
01:11:34,399 --> 01:11:37,479
calculate derivative of like the outer

2135
01:11:35,960 --> 01:11:39,239
function then the inner function and so

2136
01:11:37,479 --> 01:11:40,799
on and so forth. The backprop is

2137
01:11:39,239 --> 01:11:42,840
essentially a way to organize the chain

2138
01:11:40,800 --> 01:11:46,279
rule to work with the neural network

2139
01:11:42,840 --> 01:11:46,279
layer-by-layer architecture. That's all.

2140
01:11:49,520 --> 01:11:54,120
So is it Is it fair to say that once we

2141
01:11:51,960 --> 01:11:56,560
are finding like the local minimum, we

2142
01:11:54,119 --> 01:11:58,079
are not optimizing to all the GWs

2143
01:11:56,560 --> 01:11:59,400
because like this local minimum is

2144
01:11:58,079 --> 01:12:01,239
coming like from different curves, from

2145
01:11:59,399 --> 01:12:02,920
different lines. So

2146
01:12:01,239 --> 01:12:04,760
Is that fair to say? When we are using

2147
01:12:02,920 --> 01:12:06,640
stochastic gradient descent, yes. So for

2148
01:12:04,760 --> 01:12:09,360
in stochastic gradient descent, when you

2149
01:12:06,640 --> 01:12:10,880
take say 32 data points from a million

2150
01:12:09,359 --> 01:12:12,960
and you're calculating the loss for that

2151
01:12:10,880 --> 01:12:14,880
32 data points, you're basically trying

2152
01:12:12,960 --> 01:12:17,039
to do a gradient step.

2153
01:12:14,880 --> 01:12:20,000
Right? The W equals W minus alpha

2154
01:12:17,039 --> 01:12:22,680
gradient thing. You're doing it for that

2155
01:12:20,000 --> 01:12:24,720
that 32 points loss function.

2156
01:12:22,680 --> 01:12:25,840
Right? Which is not the 1 million points

2157
01:12:24,720 --> 01:12:27,680
loss function.

2158
01:12:25,840 --> 01:12:29,279
That's why it's approximate.

2159
01:12:27,680 --> 01:12:31,640
But the approximation, instead of

2160
01:12:29,279 --> 01:12:33,719
hurting you, actually helps you because

2161
01:12:31,640 --> 01:12:35,640
it helps you escape the local minima of

2162
01:12:33,720 --> 01:12:37,000
the global loss function.

2163
01:12:35,640 --> 01:12:38,640
So it's it's sort of an interesting and

2164
01:12:37,000 --> 01:12:40,159
somewhat technically subtle point, which

2165
01:12:38,640 --> 01:12:41,920
is why I'm not getting into it too much,

2166
01:12:40,159 --> 01:12:44,119
but I'm happy to give pointers if people

2167
01:12:41,920 --> 01:12:45,680
are interested. Yeah?

2168
01:12:44,119 --> 01:12:47,319
Uh when you say you initialize the

2169
01:12:45,680 --> 01:12:50,039
weights, you initialize for the whole

2170
01:12:47,319 --> 01:12:51,119
network or just the end layer and then

2171
01:12:50,039 --> 01:12:52,119
go backwards like you

2172
01:12:51,119 --> 01:12:53,880
>> No, you initialize everything in one

2173
01:12:52,119 --> 01:12:54,840
shot.

2174
01:12:53,880 --> 01:12:55,960
Because if you don't initialize

2175
01:12:54,840 --> 01:12:57,760
everything in one shot, what's going to

2176
01:12:55,960 --> 01:12:58,960
happen is that you can't do like the

2177
01:12:57,760 --> 01:13:00,560
forward computation to find the

2178
01:12:58,960 --> 01:13:02,720
prediction.

2179
01:13:00,560 --> 01:13:05,080
Uh and so they are done independently

2180
01:13:02,720 --> 01:13:07,159
and the initialization schemes will take

2181
01:13:05,079 --> 01:13:08,680
into account, okay, I'm initializing the

2182
01:13:07,159 --> 01:13:10,720
weights between a layer which has 10

2183
01:13:08,680 --> 01:13:12,280
nodes and on one side and 32 on the

2184
01:13:10,720 --> 01:13:13,240
other side and the 10 and the 32

2185
01:13:12,279 --> 01:13:15,800
actually play a role in how you

2186
01:13:13,239 --> 01:13:15,800
initialize.

2187
01:13:15,960 --> 01:13:19,960
Okay. So um so the summary of the

2188
01:13:18,279 --> 01:13:22,840
overall training flow

2189
01:13:19,960 --> 01:13:24,359
is that, you know, you have an input.

2190
01:13:22,840 --> 01:13:26,079
It goes through a bunch of layers. You

2191
01:13:24,359 --> 01:13:28,319
come up with a prediction. You compare

2192
01:13:26,079 --> 01:13:29,600
it to the true values and these two

2193
01:13:28,319 --> 01:13:31,679
things go into the loss function

2194
01:13:29,600 --> 01:13:33,600
calculation. You get a loss number.

2195
01:13:31,680 --> 01:13:35,480
Right? And you do it for say 10 points

2196
01:13:33,600 --> 01:13:38,000
or 32 points or a million points. And

2197
01:13:35,479 --> 01:13:39,959
this loss thing goes into the optimizer,

2198
01:13:38,000 --> 01:13:41,640
which calculates the gradient. And once

2199
01:13:39,960 --> 01:13:44,159
it calculates the gradient, it updates

2200
01:13:41,640 --> 01:13:45,880
the weights of every layer using the W

2201
01:13:44,159 --> 01:13:47,760
equals W minus alpha times gradient

2202
01:13:45,880 --> 01:13:48,920
formula, gradient descent formula. And

2203
01:13:47,760 --> 01:13:50,440
then you keep it doing this again and

2204
01:13:48,920 --> 01:13:53,000
again and again.

2205
01:13:50,439 --> 01:13:54,439
This is the overall flow.

2206
01:13:53,000 --> 01:13:56,359
This is how our little network is going

2207
01:13:54,439 --> 01:14:00,039
to get built for heart disease

2208
01:13:56,359 --> 01:14:00,039
prediction. This is how GPT-4 was built.

2209
01:14:00,720 --> 01:14:04,240
And this is how AlphaFold was built.

2210
01:14:02,720 --> 01:14:06,720
And AlphaGo was built.

2211
01:14:04,239 --> 01:14:06,719
You get the idea.

2212
01:14:07,359 --> 01:14:10,799
I mean, it's astonishing, frankly.

2213
01:14:09,479 --> 01:14:12,359
If you're not getting goosebumps at the

2214
01:14:10,800 --> 01:14:14,239
thought that this simple thing can do

2215
01:14:12,359 --> 01:14:17,159
all these complicated things, we really

2216
01:14:14,239 --> 01:14:20,359
need to talk offline.

2217
01:14:17,159 --> 01:14:23,119
Uh there was a hand raised here. Yeah.

2218
01:14:20,359 --> 01:14:25,759
Sorry. Just quickly, this is for each

2219
01:14:23,119 --> 01:14:27,159
mini batch, right? So

2220
01:14:25,760 --> 01:14:28,680
my question is if you came with

2221
01:14:27,159 --> 01:14:30,199
different weight for each mini batch,

2222
01:14:28,680 --> 01:14:31,520
how do you

2223
01:14:30,199 --> 01:14:33,800
add it up?

2224
01:14:31,520 --> 01:14:35,400
The like, okay, this weight has is the

2225
01:14:33,800 --> 01:14:37,880
perfect combination for this mini batch,

2226
01:14:35,399 --> 01:14:39,559
but you have weight different

2227
01:14:37,880 --> 01:14:41,560
weight for another mini batch. How do

2228
01:14:39,560 --> 01:14:43,360
you combine those two? No.

2229
01:14:41,560 --> 01:14:45,400
On each point, what you do is you you

2230
01:14:43,359 --> 01:14:46,519
find the you find you you you start with

2231
01:14:45,399 --> 01:14:48,000
a weight.

2232
01:14:46,520 --> 01:14:49,320
You run it through for a mini batch. You

2233
01:14:48,000 --> 01:14:50,680
come up with the loss function. You

2234
01:14:49,319 --> 01:14:51,880
calculate the gradient.

2235
01:14:50,680 --> 01:14:53,159
And now using the gradient, you've

2236
01:14:51,880 --> 01:14:54,159
updated the weight. Now you have a new

2237
01:14:53,159 --> 01:14:55,559
set of weights, right? Which is the

2238
01:14:54,159 --> 01:14:57,680
updated weights. Call it

2239
01:14:55,560 --> 01:14:59,480
W2 instead of W1.

2240
01:14:57,680 --> 01:15:00,680
Now W2 is is your network and when you

2241
01:14:59,479 --> 01:15:03,559
take the next mini batch, it's going to

2242
01:15:00,680 --> 01:15:05,240
use W2 to calculate the prediction.

2243
01:15:03,560 --> 01:15:08,800
And this this whole flow will become a

2244
01:15:05,239 --> 01:15:11,840
lot clearer when we do the collabs.

2245
01:15:08,800 --> 01:15:13,360
Okay. So we have 3 minutes.

2246
01:15:11,840 --> 01:15:15,720
I don't want to go into

2247
01:15:13,359 --> 01:15:19,039
regularization overfitting in 3 minutes.

2248
01:15:15,720 --> 01:15:19,039
So let's have some more questions.

2249
01:15:19,680 --> 01:15:22,600
Yeah.

2250
01:15:20,640 --> 01:15:25,200
Can you use any activation function as

2251
01:15:22,600 --> 01:15:26,760
long as it gives like positive values?

2252
01:15:25,199 --> 01:15:29,679
For like X squared or mod X or

2253
01:15:26,760 --> 01:15:31,400
something. Um you can use a variety of

2254
01:15:29,680 --> 01:15:33,320
activation functions.

2255
01:15:31,399 --> 01:15:35,519
Um

2256
01:15:33,319 --> 01:15:37,319
There is uh but yeah, there's a whole

2257
01:15:35,520 --> 01:15:38,640
literature on, you know, the pros and

2258
01:15:37,319 --> 01:15:39,840
cons of various activation functions

2259
01:15:38,640 --> 01:15:42,520
that you could use.

2260
01:15:39,840 --> 01:15:44,760
But in general, you have to make sure of

2261
01:15:42,520 --> 01:15:46,880
a couple of things. One is that when you

2262
01:15:44,760 --> 01:15:48,360
do backprop,

2263
01:15:46,880 --> 01:15:49,520
the gradient is going to flow through

2264
01:15:48,359 --> 01:15:50,639
the activation function in the reverse

2265
01:15:49,520 --> 01:15:52,200
direction.

2266
01:15:50,640 --> 01:15:53,720
And the activation function should

2267
01:15:52,199 --> 01:15:55,439
actually sort of make sure the gradient

2268
01:15:53,720 --> 01:15:56,800
doesn't get squished.

2269
01:15:55,439 --> 01:15:58,559
It shouldn't get squished. It shouldn't

2270
01:15:56,800 --> 01:16:00,199
get exploded.

2271
01:15:58,560 --> 01:16:01,280
So those are some considerations and

2272
01:16:00,199 --> 01:16:02,760
these are technical considerations, but

2273
01:16:01,279 --> 01:16:04,239
those all those considerations have to

2274
01:16:02,760 --> 01:16:07,000
be taken into account. If you can take

2275
01:16:04,239 --> 01:16:08,039
those into account, then you're okay.

2276
01:16:07,000 --> 01:16:08,960
That's sort of the key thing to keep in

2277
01:16:08,039 --> 01:16:10,479
mind.

2278
01:16:08,960 --> 01:16:11,920
And that's in fact why the ReLU is

2279
01:16:10,479 --> 01:16:13,319
actually very popular

2280
01:16:11,920 --> 01:16:15,640
because as long as the value is

2281
01:16:13,319 --> 01:16:18,000
positive, the gradient of the ReLU is

2282
01:16:15,640 --> 01:16:20,640
just one. Right?

2283
01:16:18,000 --> 01:16:20,640
Uh because

2284
01:16:22,680 --> 01:16:26,600
So if you look at something

2285
01:16:24,239 --> 01:16:26,599
Oops.

2286
01:16:28,720 --> 01:16:31,920
Was it frozen?

2287
01:16:30,359 --> 01:16:34,880
I jinxed it.

2288
01:16:31,920 --> 01:16:37,399
So sorry, livestream.

2289
01:16:34,880 --> 01:16:39,880
If you have something like this,

2290
01:16:37,399 --> 01:16:41,719
the ReLU is like that, right?

2291
01:16:39,880 --> 01:16:43,480
So the gradient here

2292
01:16:41,720 --> 01:16:44,560
is always going to be one.

2293
01:16:43,479 --> 01:16:46,279
Which means that as long as the value is

2294
01:16:44,560 --> 01:16:47,960
positive, whatever gradient comes in

2295
01:16:46,279 --> 01:16:49,000
like this, it just like gets multiplied

2296
01:16:47,960 --> 01:16:50,960
by one and gets pushed out the other

2297
01:16:49,000 --> 01:16:52,840
side. So it doesn't get it doesn't get

2298
01:16:50,960 --> 01:16:55,399
harmed or squished or anything like

2299
01:16:52,840 --> 01:16:57,119
that. Um so that's one reason why the

2300
01:16:55,399 --> 01:16:59,239
ReLU is very popular because it

2301
01:16:57,119 --> 01:17:00,640
preserves the gradient while injecting

2302
01:16:59,239 --> 01:17:04,519
almost like the minimum amount of

2303
01:17:00,640 --> 01:17:04,520
non-linearity to do interesting things.

2304
01:17:04,760 --> 01:17:10,280
Um yeah.

2305
01:17:07,520 --> 01:17:13,080
If you have a high number of dimensions,

2306
01:17:10,279 --> 01:17:14,920
can you do mini batching on like

2307
01:17:13,079 --> 01:17:17,119
features dimensions instead of just

2308
01:17:14,920 --> 01:17:19,840
observations and keep the same number of

2309
01:17:17,119 --> 01:17:21,760
observations, but just take a small

2310
01:17:19,840 --> 01:17:24,000
sample of the number of features that

2311
01:17:21,760 --> 01:17:25,760
you're actually using? Oh, I see. I see.

2312
01:17:24,000 --> 01:17:27,039
So you're saying let's say you have 10

2313
01:17:25,760 --> 01:17:28,720
features.

2314
01:17:27,039 --> 01:17:31,000
Um instead of taking all data points of

2315
01:17:28,720 --> 01:17:33,640
10 features, what if you have choose

2316
01:17:31,000 --> 01:17:34,920
five features and just use them and do

2317
01:17:33,640 --> 01:17:36,760
the thing

2318
01:17:34,920 --> 01:17:38,520
as long as you can actually compute the

2319
01:17:36,760 --> 01:17:39,840
prediction.

2320
01:17:38,520 --> 01:17:41,600
To compute the prediction, you may need

2321
01:17:39,840 --> 01:17:43,239
all 10 features.

2322
01:17:41,600 --> 01:17:44,720
Right? Or you need to have some defaults

2323
01:17:43,239 --> 01:17:46,800
for those features.

2324
01:17:44,720 --> 01:17:48,560
And by if you define defaults for those

2325
01:17:46,800 --> 01:17:50,520
other five features, you're basically

2326
01:17:48,560 --> 01:17:51,400
using all all features.

2327
01:17:50,520 --> 01:17:53,400
So that's the key thing. Can you

2328
01:17:51,399 --> 01:17:55,079
actually calculate the prediction

2329
01:17:53,399 --> 01:17:57,399
by manipulating? And typically, you

2330
01:17:55,079 --> 01:17:57,399
can't.

2331
01:17:57,840 --> 01:18:00,960
All right?

2332
01:17:58,960 --> 01:18:02,439
Okay, folks. 9:55. I'm done. Have a

2333
01:18:00,960 --> 01:18:04,800
great rest of your week. I'll see you on

2334
01:18:02,439 --> 01:18:04,799
Monday.