1
00:00:16,600 --> 00:00:19,320
All right. So, today's lecture,

2
00:00:18,000 --> 00:00:20,640
introduction to neural networks and deep

3
00:00:19,320 --> 00:00:21,879
learning.

4
00:00:20,640 --> 00:00:23,960
Um so, we'll start with a very quick

5
00:00:21,879 --> 00:00:25,399
intro to these things,

6
00:00:23,960 --> 00:00:27,480
uh and then we'll switch and dive deep

7
00:00:25,399 --> 00:00:30,000
into neural networks. All right. So, the

8
00:00:27,480 --> 00:00:31,000
field of AI originated in 1956. Sadly,

9
00:00:30,000 --> 00:00:32,520
it didn't originate at MIT, it

10
00:00:31,000 --> 00:00:33,640
originated at Dartmouth.

11
00:00:32,520 --> 00:00:35,520
Because all these people got together at

12
00:00:33,640 --> 00:00:37,480
Dartmouth. I guess it's it's got a nice

13
00:00:35,520 --> 00:00:40,000
quad or whatever. They got together,

14
00:00:37,479 --> 00:00:42,159
they defined the field. But, fortunately

15
00:00:40,000 --> 00:00:44,719
for us, MIT was very well represented.

16
00:00:42,159 --> 00:00:47,279
So, we have Marvin Minsky who founded

17
00:00:44,719 --> 00:00:50,079
the MIT AI Lab, John McCarthy who

18
00:00:47,280 --> 00:00:51,880
invented Lisp, and then later defected

19
00:00:50,079 --> 00:00:53,879
to the West Coast, and then Claude

20
00:00:51,880 --> 00:00:55,800
Shannon who invented information theory,

21
00:00:53,880 --> 00:00:57,560
right? Who was a professor at MIT. So,

22
00:00:55,799 --> 00:00:58,919
MIT was well represented. These folks,

23
00:00:57,560 --> 00:01:01,359
you know, founded the field, and they

24
00:00:58,920 --> 00:01:03,280
were so bright, they thought that AI was

25
00:01:01,359 --> 00:01:04,359
going to be substantially solved, quote

26
00:01:03,280 --> 00:01:06,519
unquote,

27
00:01:04,359 --> 00:01:07,359
by that fall.

28
00:01:06,519 --> 00:01:08,959
Okay?

29
00:01:07,359 --> 00:01:10,840
Now, obviously, it turned out a bit

30
00:01:08,959 --> 00:01:12,839
differently than what they expected.

31
00:01:10,840 --> 00:01:14,560
Um so, it's been, whatever, 67, 68 years

32
00:01:12,840 --> 00:01:16,520
since its founding. So, it's gone

33
00:01:14,560 --> 00:01:18,680
through, essentially, in my opinion,

34
00:01:16,519 --> 00:01:19,679
three seminal breakthroughs,

35
00:01:18,680 --> 00:01:21,280
um starting with the traditional

36
00:01:19,680 --> 00:01:22,920
approach, then machine learning, deep

37
00:01:21,280 --> 00:01:24,760
learning, and generative AI. So, let's

38
00:01:22,920 --> 00:01:26,760
take a very quick look at each of these

39
00:01:24,760 --> 00:01:27,719
breakthroughs and what motivated them.

40
00:01:26,760 --> 00:01:28,560
So,

41
00:01:27,719 --> 00:01:31,120
let's start with the traditional

42
00:01:28,560 --> 00:01:33,480
approach to AI. And so, what is AI? AI,

43
00:01:31,120 --> 00:01:34,960
informally, is the ability to imbue

44
00:01:33,480 --> 00:01:36,799
computers with the

45
00:01:34,959 --> 00:01:38,439
the the the ability to do things that

46
00:01:36,799 --> 00:01:39,640
only humans can typically do. Cognitive

47
00:01:38,439 --> 00:01:41,879
tasks, thinking tasks, and things like

48
00:01:39,640 --> 00:01:43,640
that. And so, the most sort of common

49
00:01:41,879 --> 00:01:45,199
sensical way to do that is to say,

50
00:01:43,640 --> 00:01:46,920
"Well, if I want the computer to do

51
00:01:45,200 --> 00:01:48,200
something complicated like play chess,

52
00:01:46,920 --> 00:01:49,920
I'm just going to sit down with a few

53
00:01:48,200 --> 00:01:51,640
chess grandmasters,

54
00:01:49,920 --> 00:01:53,400
show them a whole bunch of board moves,

55
00:01:51,640 --> 00:01:55,400
and ask them how they figure out how to

56
00:01:53,400 --> 00:01:56,480
respond, how to play the next move." I'm

57
00:01:55,400 --> 00:01:57,719
going to sort of sit down, talk to all

58
00:01:56,480 --> 00:01:59,960
these people, and then I'm going to

59
00:01:57,719 --> 00:02:01,560
write down a whole bunch of rules. If

60
00:01:59,959 --> 00:02:02,599
this is the board position, move this.

61
00:02:01,560 --> 00:02:04,159
If this is the board position, move

62
00:02:02,599 --> 00:02:05,959
this, and so on and so forth. Or I might

63
00:02:04,159 --> 00:02:06,959
sit down with a cardiologist and tell

64
00:02:05,959 --> 00:02:09,478
them, "Okay, how do you actually

65
00:02:06,959 --> 00:02:11,239
interpret an ECG?" They will give me all

66
00:02:09,479 --> 00:02:12,360
the similarly a bunch of if-then rules.

67
00:02:11,240 --> 00:02:13,920
I will take all these rules, I'll put

68
00:02:12,360 --> 00:02:15,200
them into the computer, and boom, I have

69
00:02:13,919 --> 00:02:17,959
a system that can do what a human can

70
00:02:15,199 --> 00:02:19,319
do. Right? Now, this approach, even

71
00:02:17,960 --> 00:02:21,360
though it's common sensical and kind of

72
00:02:19,319 --> 00:02:22,560
makes sense, it had success in only a

73
00:02:21,360 --> 00:02:24,640
few areas.

74
00:02:22,560 --> 00:02:28,159
Um and so, the interesting question is,

75
00:02:24,639 --> 00:02:29,559
why was it not pervasively successful?

76
00:02:28,159 --> 00:02:31,319
Why was it not pervasively successful?

77
00:02:29,560 --> 00:02:32,599
It seems like a pretty good idea to me,

78
00:02:31,319 --> 00:02:33,680
right? And the people who came up with

79
00:02:32,599 --> 00:02:35,560
these things are smart people, they're

80
00:02:33,680 --> 00:02:39,319
not dumb people. They know what they're

81
00:02:35,560 --> 00:02:39,319
doing. So, why did it not work?

82
00:02:39,400 --> 00:02:42,760
Because

83
00:02:40,719 --> 00:02:44,120
because it's time-intensive,

84
00:02:42,759 --> 00:02:46,000
so in case that you have to run through

85
00:02:44,120 --> 00:02:48,280
all these scenarios that can ever exist,

86
00:02:46,000 --> 00:02:51,120
and still some new scenarios can come up

87
00:02:48,280 --> 00:02:52,759
that you didn't cater for initially.

88
00:02:51,120 --> 00:02:54,159
Right. So, there are two aspects to what

89
00:02:52,759 --> 00:02:56,079
you said, which is the first aspect is

90
00:02:54,159 --> 00:02:57,319
it's time-intensive. That, as it turns

91
00:02:56,080 --> 00:02:58,816
out, is not a big deal, because

92
00:02:57,319 --> 00:02:59,359
computers are getting faster and faster.

93
00:02:58,816 --> 00:03:01,080
>> [clears throat]

94
00:02:59,360 --> 00:03:02,840
>> Right? The second thing is actually the

95
00:03:01,080 --> 00:03:05,600
key thing, which is that it doesn't

96
00:03:02,840 --> 00:03:07,400
generalize to new situations very well.

97
00:03:05,599 --> 00:03:08,960
Right? The problem is

98
00:03:07,400 --> 00:03:10,080
there are an infinite number of things

99
00:03:08,960 --> 00:03:11,719
that you're going to see when you deploy

100
00:03:10,080 --> 00:03:13,160
these systems in the real world. By

101
00:03:11,719 --> 00:03:15,479
definition, what you're training it on

102
00:03:13,159 --> 00:03:17,079
is a small sample of rules. So, these

103
00:03:15,479 --> 00:03:19,599
rules are very brittle. But, there's

104
00:03:17,080 --> 00:03:22,040
actually even more interesting reason.

105
00:03:19,599 --> 00:03:23,919
And that reason is that we know more

106
00:03:22,039 --> 00:03:25,599
than we can tell.

107
00:03:23,919 --> 00:03:27,280
This is called Polanyi's paradox. So,

108
00:03:25,599 --> 00:03:29,519
the idea is that if I come to you and

109
00:03:27,280 --> 00:03:32,360
say, "Hey, uh here's a picture. Is it a

110
00:03:29,520 --> 00:03:33,800
dog or a cat?" you will tell me within,

111
00:03:32,360 --> 00:03:34,960
I believe, they've measured it, like 20

112
00:03:33,800 --> 00:03:36,920
milliseconds or something, you know if

113
00:03:34,960 --> 00:03:38,560
it's a dog if it's a dog or a cat. And

114
00:03:36,919 --> 00:03:40,039
then if I ask you to explain to me

115
00:03:38,560 --> 00:03:41,520
exactly how you figured that out, you'll

116
00:03:40,039 --> 00:03:43,639
come up with a bunch of sort of reasons,

117
00:03:41,520 --> 00:03:45,120
right? Alleged reasons. Oh, you know, if

118
00:03:43,639 --> 00:03:46,000
it has whiskers, I think it's a cat or

119
00:03:45,120 --> 00:03:47,800
whatever.

120
00:03:46,000 --> 00:03:49,080
But, the problem is that you actually,

121
00:03:47,800 --> 00:03:50,280
first of all, can't really articulate

122
00:03:49,080 --> 00:03:51,840
what's going on in your head, how you do

123
00:03:50,280 --> 00:03:54,000
these things. And number two, even if

124
00:03:51,840 --> 00:03:55,479
you articulate it, often times, your

125
00:03:54,000 --> 00:03:58,000
articulation has no correspondence with

126
00:03:55,479 --> 00:04:01,239
how your brain actually does it.

127
00:03:58,000 --> 00:04:03,360
So, you're incomplete and a liar.

128
00:04:01,240 --> 00:04:04,840
So, this is Polanyi's paradox. So, if

129
00:04:03,360 --> 00:04:06,840
you can't even

130
00:04:04,840 --> 00:04:08,120
tell me how you do something, how the

131
00:04:06,840 --> 00:04:10,080
heck am I supposed to take it and put it

132
00:04:08,120 --> 00:04:11,680
into a computer? Doesn't work. And

133
00:04:10,080 --> 00:04:13,480
second is the fact that we can't write

134
00:04:11,680 --> 00:04:15,760
down these rules for all possible

135
00:04:13,479 --> 00:04:17,279
situations. Edge cases, corner cases,

136
00:04:15,759 --> 00:04:18,759
etc. And the world is full of edge

137
00:04:17,279 --> 00:04:20,199
cases.

138
00:04:18,759 --> 00:04:21,560
So, for these reasons, this approach

139
00:04:20,199 --> 00:04:22,800
didn't work.

140
00:04:21,560 --> 00:04:24,959
And so, a different approach was

141
00:04:22,800 --> 00:04:26,040
developed, and this approach was, well,

142
00:04:24,959 --> 00:04:27,279
basically said, "Hey, instead of

143
00:04:26,040 --> 00:04:30,040
explicitly telling the computer what to

144
00:04:27,279 --> 00:04:32,719
do, why don't we simply give it lots of

145
00:04:30,040 --> 00:04:35,680
examples of inputs and outputs, chess

146
00:04:32,720 --> 00:04:37,800
positions, next move, right? ECG,

147
00:04:35,680 --> 00:04:39,319
diagnosis, right? Inputs and outputs.

148
00:04:37,800 --> 00:04:41,000
And then, why don't we just use some

149
00:04:39,319 --> 00:04:43,240
statistical techniques to learn a

150
00:04:41,000 --> 00:04:44,920
mapping, a function, that can go from

151
00:04:43,240 --> 00:04:45,879
the input to the output? Okay? That was

152
00:04:44,920 --> 00:04:48,160
the idea.

153
00:04:45,879 --> 00:04:49,839
And this idea is machine learning.

154
00:04:48,160 --> 00:04:51,800
Okay? So, machine learning is basically

155
00:04:49,839 --> 00:04:53,719
just a fancy way of saying, "Learn from

156
00:04:51,800 --> 00:04:55,920
input-output examples using statistical

157
00:04:53,720 --> 00:04:59,080
techniques."

158
00:04:55,920 --> 00:05:00,480
Good. All right. So, um

159
00:04:59,079 --> 00:05:01,879
Now, there are numerous ways to create

160
00:05:00,480 --> 00:05:02,840
machine learning models, and if you've

161
00:05:01,879 --> 00:05:03,759
ever done linear regression,

162
00:05:02,839 --> 00:05:06,279
congratulations, you've been doing

163
00:05:03,759 --> 00:05:06,279
machine learning.

164
00:05:06,439 --> 00:05:09,839
Okay? And only one of those methods

165
00:05:08,720 --> 00:05:11,360
happens to be something called neural

166
00:05:09,839 --> 00:05:12,439
networks.

167
00:05:11,360 --> 00:05:14,280
There are many other methods, and in

168
00:05:12,439 --> 00:05:16,000
fact, you probably have done these other

169
00:05:14,279 --> 00:05:17,199
methods if you have done the a course

170
00:05:16,000 --> 00:05:19,279
like the Analytics Edge or something

171
00:05:17,199 --> 00:05:21,279
similar.

172
00:05:19,279 --> 00:05:23,039
Okay. So, machine learning has got

173
00:05:21,279 --> 00:05:25,759
tremendous impact around the world,

174
00:05:23,040 --> 00:05:27,560
right? It's like, at this point, um it's

175
00:05:25,759 --> 00:05:29,480
widely accepted, it's a very, very

176
00:05:27,560 --> 00:05:30,680
successful technology.

177
00:05:29,480 --> 00:05:32,560
And in fact, whenever people are

178
00:05:30,680 --> 00:05:33,959
actually talking about AI,

179
00:05:32,560 --> 00:05:35,639
chances are they're actually talking

180
00:05:33,959 --> 00:05:38,959
about machine learning.

181
00:05:35,639 --> 00:05:40,639
It's just that AI sounds cooler.

182
00:05:38,959 --> 00:05:41,919
The only problem is, for machine

183
00:05:40,639 --> 00:05:43,680
learning to work really well, the input

184
00:05:41,920 --> 00:05:46,439
data has to be structured.

185
00:05:43,680 --> 00:05:47,680
Okay? And what I mean by that is data

186
00:05:46,439 --> 00:05:50,040
that can essentially be sort of

187
00:05:47,680 --> 00:05:51,920
numericalized and stuck into the columns

188
00:05:50,040 --> 00:05:54,000
and rows of a spreadsheet.

189
00:05:51,920 --> 00:05:55,360
Right? So, for example, here, let's say

190
00:05:54,000 --> 00:05:58,120
I want to put together a data data set

191
00:05:55,360 --> 00:05:59,960
of, you know, uh patients, their

192
00:05:58,120 --> 00:06:01,759
symptoms, and their characteristics, and

193
00:05:59,959 --> 00:06:03,519
then in the following year after they

194
00:06:01,759 --> 00:06:05,439
showed up at the doctor's office whether

195
00:06:03,519 --> 00:06:07,279
they had a cardiac event or not.

196
00:06:05,439 --> 00:06:09,560
I might create a data set like this with

197
00:06:07,279 --> 00:06:11,799
age, smoking status, yes, no, exercise,

198
00:06:09,560 --> 00:06:13,480
blah blah blah blah blah. Right? And so,

199
00:06:11,800 --> 00:06:15,079
either these numbers are numbers,

200
00:06:13,480 --> 00:06:17,200
they're numerical, or if they're not

201
00:06:15,079 --> 00:06:19,680
numerical, they're categorical.

202
00:06:17,199 --> 00:06:21,479
Right? Yes, no, uh smoking, yes, no,

203
00:06:19,680 --> 00:06:22,720
things like that. Which means that if

204
00:06:21,480 --> 00:06:25,240
you have categorical variables, you can

205
00:06:22,720 --> 00:06:26,600
just numericalize them pretty easily.

206
00:06:25,240 --> 00:06:27,680
You folks have done the some machine

207
00:06:26,600 --> 00:06:29,040
learning before, so you know, things

208
00:06:27,680 --> 00:06:30,560
like one-hot encoding and stuff like

209
00:06:29,040 --> 00:06:32,160
that can be done to make them all

210
00:06:30,560 --> 00:06:35,040
numerical. So, the point is, you can

211
00:06:32,160 --> 00:06:36,640
just render the data into the columns

212
00:06:35,040 --> 00:06:38,080
and rows of a spreadsheet pretty easily,

213
00:06:36,639 --> 00:06:40,479
right? That's what I mean by structured

214
00:06:38,079 --> 00:06:41,879
data. So, when you but the situation is

215
00:06:40,480 --> 00:06:43,920
very different if you have unstructured

216
00:06:41,879 --> 00:06:46,120
data. So, if you have an image of, you

217
00:06:43,920 --> 00:06:47,720
know, a cute puppy, this is my puppy, by

218
00:06:46,120 --> 00:06:49,319
the way, um

219
00:06:47,720 --> 00:06:50,960
from many years ago. Sadly, he's no

220
00:06:49,319 --> 00:06:54,079
more. Um

221
00:06:50,959 --> 00:06:54,079
but, his name was Google.

222
00:06:54,800 --> 00:06:58,759
So, yeah, anyway, uh

223
00:06:56,600 --> 00:07:00,160
my DMD alums know Google well. So, this

224
00:06:58,759 --> 00:07:01,240
is Google, right? If you want to take

225
00:07:00,160 --> 00:07:03,439
Google,

226
00:07:01,240 --> 00:07:05,280
uh this picture, and figure out how to

227
00:07:03,439 --> 00:07:06,680
sort of numericalize it, the first thing

228
00:07:05,279 --> 00:07:07,839
you want to need to understand is that

229
00:07:06,680 --> 00:07:10,759
if you actually look at how this picture

230
00:07:07,839 --> 00:07:12,839
is represented inside, uh digitally, in

231
00:07:10,759 --> 00:07:13,839
the computer, basically, every picture

232
00:07:12,839 --> 00:07:15,319
like this is represented using three

233
00:07:13,839 --> 00:07:17,439
tables of numbers.

234
00:07:15,319 --> 00:07:19,000
Okay? And these and we'll get to what

235
00:07:17,439 --> 00:07:21,360
these numbers mean later on, but the

236
00:07:19,000 --> 00:07:23,279
point I'm making is that each number

237
00:07:21,360 --> 00:07:25,240
basically represents the amount of

238
00:07:23,279 --> 00:07:27,319
light,

239
00:07:25,240 --> 00:07:29,160
right? On a scale of 0 to 255, the

240
00:07:27,319 --> 00:07:30,639
amount of light in that location, in

241
00:07:29,160 --> 00:07:32,960
that pixel. That's all the amount of

242
00:07:30,639 --> 00:07:35,759
light. So, basically, the this table is

243
00:07:32,959 --> 00:07:37,479
the amount of um sorry.

244
00:07:35,759 --> 00:07:39,000
This table is amount of red light,

245
00:07:37,480 --> 00:07:41,200
amount of green light, amount of blue

246
00:07:39,000 --> 00:07:42,480
light. Okay? Now, you will agree with me

247
00:07:41,199 --> 00:07:45,199
that if you, for example, look at

248
00:07:42,480 --> 00:07:47,200
something like this and say, "Okay, 251

249
00:07:45,199 --> 00:07:49,639
at this location, there is a lot of blue

250
00:07:47,199 --> 00:07:52,000
light because it's 251 out of a possible

251
00:07:49,639 --> 00:07:53,439
255, right? Maybe a lot of blue light

252
00:07:52,000 --> 00:07:55,600
somewhere here. There's a lot of blue

253
00:07:53,439 --> 00:07:59,199
here."

254
00:07:55,600 --> 00:08:00,600
Whether that area is blue because of a

255
00:07:59,199 --> 00:08:03,560
piece of sky,

256
00:08:00,600 --> 00:08:04,879
some water, or a bunch of blue paint,

257
00:08:03,560 --> 00:08:06,240
could be anything, it's going to say

258
00:08:04,879 --> 00:08:08,079
251.

259
00:08:06,240 --> 00:08:09,319
So, the underlying reality, the

260
00:08:08,079 --> 00:08:11,079
underlying object that's being

261
00:08:09,319 --> 00:08:12,480
described, has nothing to do with the

262
00:08:11,079 --> 00:08:14,399
251.

263
00:08:12,480 --> 00:08:16,480
Right? So, that's the whole problem. The

264
00:08:14,399 --> 00:08:19,039
raw form of the data has no intrinsic

265
00:08:16,480 --> 00:08:20,240
meaning with the underlying thing.

266
00:08:19,040 --> 00:08:21,560
So, given that there's no connection

267
00:08:20,240 --> 00:08:23,360
between the number and what it's

268
00:08:21,560 --> 00:08:25,160
describing, how the heck can any

269
00:08:23,360 --> 00:08:27,639
algorithm do anything with it?

270
00:08:25,160 --> 00:08:27,640
It can't.

271
00:08:27,680 --> 00:08:32,440
Right? So, what you have to do is

272
00:08:30,319 --> 00:08:34,639
something called feature engineering or

273
00:08:32,440 --> 00:08:36,919
feature extraction, right? Where you

274
00:08:34,639 --> 00:08:38,279
have to manually take all these things

275
00:08:36,918 --> 00:08:40,279
and create essentially a spreadsheet

276
00:08:38,279 --> 00:08:42,279
from them. So, basically, let's say that

277
00:08:40,279 --> 00:08:43,478
you have a bunch of birds, right? And

278
00:08:42,279 --> 00:08:44,839
you're trying to build a a bird

279
00:08:43,479 --> 00:08:46,680
classifier to figure out what kind of

280
00:08:44,840 --> 00:08:48,759
bird species it is, you might actually

281
00:08:46,679 --> 00:08:50,799
have to take this picture, and then you

282
00:08:48,759 --> 00:08:52,759
have to measure the beak length, the

283
00:08:50,799 --> 00:08:54,479
wingspan, the primary color, and so on

284
00:08:52,759 --> 00:08:56,720
and so forth.

285
00:08:54,480 --> 00:08:59,360
So, you're basically structuring the

286
00:08:56,720 --> 00:09:02,120
unstructured data manually, right?

287
00:08:59,360 --> 00:09:06,120
And this process of structuring

288
00:09:02,120 --> 00:09:08,200
unstructured data is basically called

289
00:09:06,120 --> 00:09:10,879
we use the word representation. We take

290
00:09:08,200 --> 00:09:13,360
the raw data and we represent the data

291
00:09:10,879 --> 00:09:14,600
in a different form. And the the reason

292
00:09:13,360 --> 00:09:15,840
why I'm sort of

293
00:09:14,600 --> 00:09:17,879
focusing on the use of the word

294
00:09:15,840 --> 00:09:19,519
representation is because it becomes

295
00:09:17,879 --> 00:09:22,080
really, really important a bit later on

296
00:09:19,519 --> 00:09:23,399
when we get to deep learning. Okay? So,

297
00:09:22,080 --> 00:09:25,080
we have to represent the data in a

298
00:09:23,399 --> 00:09:26,519
different way for it to work. That's the

299
00:09:25,080 --> 00:09:28,960
basic idea.

300
00:09:26,519 --> 00:09:31,319
All right. So, what that means is that,

301
00:09:28,960 --> 00:09:33,519
uh historically, researchers would

302
00:09:31,320 --> 00:09:35,440
manually develop these representations.

303
00:09:33,519 --> 00:09:37,159
And once you develop them, once you have

304
00:09:35,440 --> 00:09:38,320
representations, you can just use

305
00:09:37,159 --> 00:09:40,559
traditional linear regression or

306
00:09:38,320 --> 00:09:41,720
logistic regression to get the job done.

307
00:09:40,559 --> 00:09:43,799
So, the whole name of the game is the

308
00:09:41,720 --> 00:09:45,440
representations. So, in fact, people

309
00:09:43,799 --> 00:09:47,959
doing PhDs, for example, in computer

310
00:09:45,440 --> 00:09:49,840
vision, would spend like 4 years

311
00:09:47,960 --> 00:09:52,080
developing amazing representations for

312
00:09:49,840 --> 00:09:53,920
solving one particular little problem.

313
00:09:52,080 --> 00:09:55,680
Right? We have a bunch of, say, CAT

314
00:09:53,919 --> 00:09:57,279
scans, and we need to take the CAT scan

315
00:09:55,679 --> 00:09:58,959
and figure out whether a particular kind

316
00:09:57,279 --> 00:10:00,879
of stroke that is evidence for it in the

317
00:09:58,960 --> 00:10:02,519
cat scan, right? They might actually sit

318
00:10:00,879 --> 00:10:04,000
and develop all kinds of representations

319
00:10:02,519 --> 00:10:05,960
and test it and so on. And then they'll

320
00:10:04,000 --> 00:10:07,200
finally declare victory and say, "Yay,

321
00:10:05,960 --> 00:10:08,920
I'm done with my PhD. Here is this

322
00:10:07,200 --> 00:10:11,000
amazing representation, and you can

323
00:10:08,919 --> 00:10:12,639
build a classifier with it to predict a

324
00:10:11,000 --> 00:10:15,840
particular kind of stroke with a high

325
00:10:12,639 --> 00:10:18,120
accuracy." Okay? So, that was what that

326
00:10:15,840 --> 00:10:20,680
that's where the world was.

327
00:10:18,120 --> 00:10:22,240
Uh now, as you can imagine, developing

328
00:10:20,679 --> 00:10:24,479
representations, because it's so manual,

329
00:10:22,240 --> 00:10:27,000
is this massive human bottleneck, and

330
00:10:24,480 --> 00:10:29,120
this sharply limited limited the reach

331
00:10:27,000 --> 00:10:31,919
and applicability of machine learning.

332
00:10:29,120 --> 00:10:31,919
As you would expect.

333
00:10:31,960 --> 00:10:35,000
To address this problem,

334
00:10:33,840 --> 00:10:36,120
a different approach came about, and

335
00:10:35,000 --> 00:10:38,720
that's deep learning. So, deep learning

336
00:10:36,120 --> 00:10:40,440
sits inside machine learning. Okay?

337
00:10:38,720 --> 00:10:43,279
And deep learning

338
00:10:40,440 --> 00:10:46,880
can handle unstructured input data

339
00:10:43,279 --> 00:10:48,079
without upfront manual processing.

340
00:10:46,879 --> 00:10:50,439
Meaning,

341
00:10:48,080 --> 00:10:52,639
it will automatically learn the right

342
00:10:50,440 --> 00:10:54,000
representations from the raw input.

343
00:10:52,639 --> 00:10:55,759
Automatically is the keyword.

344
00:10:54,000 --> 00:10:57,159
Automatically learn representations,

345
00:10:55,759 --> 00:10:58,200
which means that you could give it

346
00:10:57,159 --> 00:10:59,279
structured data, you can give it

347
00:10:58,200 --> 00:11:00,600
pictures, you can give it text, you can

348
00:10:59,279 --> 00:11:01,559
give it anything you want, it just learn

349
00:11:00,600 --> 00:11:02,600
it.

350
00:11:01,559 --> 00:11:05,159
Okay?

351
00:11:02,600 --> 00:11:07,279
Um it can automatically extract these

352
00:11:05,159 --> 00:11:09,480
representations, and since it's being

353
00:11:07,279 --> 00:11:11,240
automatically extracted, you can imagine

354
00:11:09,480 --> 00:11:12,960
sort of a pipeline where the raw data

355
00:11:11,240 --> 00:11:14,320
comes in, you have a bunch of stuff in

356
00:11:12,960 --> 00:11:15,879
the middle that's learning these

357
00:11:14,320 --> 00:11:17,600
representations automatically without

358
00:11:15,879 --> 00:11:19,439
your help, and then boom, you just

359
00:11:17,600 --> 00:11:20,720
attach a little linear regression or

360
00:11:19,440 --> 00:11:22,880
logistic regression at the end, problem

361
00:11:20,720 --> 00:11:25,000
solved.

362
00:11:22,879 --> 00:11:26,679
That in a nutshell is deep learning.

363
00:11:25,000 --> 00:11:28,440
Input, a whole bunch of representations

364
00:11:26,679 --> 00:11:30,838
being learned, and then piped into a

365
00:11:28,440 --> 00:11:31,920
linear or logistic regression model.

366
00:11:30,839 --> 00:11:34,560
Okay?

367
00:11:31,919 --> 00:11:36,000
You would So, the amazing thing is this

368
00:11:34,559 --> 00:11:37,599
simple idea

369
00:11:36,000 --> 00:11:40,399
this simple idea

370
00:11:37,600 --> 00:11:42,440
is just incredibly powerful. Right? That

371
00:11:40,399 --> 00:11:44,639
idea has led to ChatGPT, has led to

372
00:11:42,440 --> 00:11:45,480
AlphaGo, AlphaFold, and so on and so

373
00:11:44,639 --> 00:11:46,600
forth.

374
00:11:45,480 --> 00:11:49,120
And

375
00:11:46,600 --> 00:11:50,360
I I kid you not, I'm sort of

376
00:11:49,120 --> 00:11:52,600
I've I've I've been doing deep learning

377
00:11:50,360 --> 00:11:54,800
for about 10 years now, and every time I

378
00:11:52,600 --> 00:11:56,159
look at it, I literally get goosebumps

379
00:11:54,799 --> 00:11:57,838
every so often.

380
00:11:56,159 --> 00:11:59,759
That that something so simple could be

381
00:11:57,839 --> 00:12:01,200
so powerful, right? It's really like

382
00:11:59,759 --> 00:12:03,080
boggles the mind.

383
00:12:01,200 --> 00:12:05,360
I'm like I'm just so lucky to be alive

384
00:12:03,080 --> 00:12:06,360
and working during this period.

385
00:12:05,360 --> 00:12:07,759
Okay?

386
00:12:06,360 --> 00:12:08,879
And you know, coming from people who

387
00:12:07,759 --> 00:12:10,799
have been in the industry a long time,

388
00:12:08,879 --> 00:12:12,399
this sort of breathless exclamation is

389
00:12:10,799 --> 00:12:14,559
not very rare, particularly because I'm

390
00:12:12,399 --> 00:12:17,240
not in marketing.

391
00:12:14,559 --> 00:12:19,399
Okay? I actually mean it.

392
00:12:17,240 --> 00:12:21,480
With all your apologies to various

393
00:12:19,399 --> 00:12:23,319
marketing folks. So,

394
00:12:21,480 --> 00:12:25,759
just realized it's being taped, so uh

395
00:12:23,320 --> 00:12:27,560
okay. So, so this has demolished the

396
00:12:25,759 --> 00:12:29,919
human bottleneck for using machine

397
00:12:27,559 --> 00:12:31,479
learning with unstructured data, uh and

398
00:12:29,919 --> 00:12:32,639
so it comes from the confluence of three

399
00:12:31,480 --> 00:12:34,839
forces,

400
00:12:32,639 --> 00:12:37,159
uh new algorithmic ideas, a whole a lot

401
00:12:34,839 --> 00:12:38,680
of data, and then very importantly, the

402
00:12:37,159 --> 00:12:40,559
fact that we have access to parallel

403
00:12:38,679 --> 00:12:42,000
computing hardware in the in the form of

404
00:12:40,559 --> 00:12:44,159
these things called GPUs, graphics

405
00:12:42,000 --> 00:12:45,960
processing units. Um and these three

406
00:12:44,159 --> 00:12:47,319
forces came together, and they were

407
00:12:45,960 --> 00:12:48,480
applied to an old idea called neural

408
00:12:47,320 --> 00:12:49,720
networks, and that's basically deep

409
00:12:48,480 --> 00:12:50,960
learning. And I'll go through it very

410
00:12:49,720 --> 00:12:52,639
quickly, because obviously we going to

411
00:12:50,960 --> 00:12:54,040
spend half the semester looking into

412
00:12:52,639 --> 00:12:56,679
this thing in detail.

413
00:12:54,039 --> 00:12:58,519
Uh so, what's the immediate immediate

414
00:12:56,679 --> 00:13:01,559
application of the ability to

415
00:12:58,519 --> 00:13:05,360
automatically handle unstructured data?

416
00:13:01,559 --> 00:13:05,359
What is like the no-brainer application?

417
00:13:10,759 --> 00:13:15,879
It's okay if it's obvious, tell me.

418
00:13:13,639 --> 00:13:18,360
Uh sorry.

419
00:13:15,879 --> 00:13:19,759
Um image classification. Right. So,

420
00:13:18,360 --> 00:13:21,279
image classification, yes. So, you can

421
00:13:19,759 --> 00:13:22,480
take an image, a good example of

422
00:13:21,279 --> 00:13:24,199
unstructured data, you can do some

423
00:13:22,480 --> 00:13:27,000
classification on it. But more

424
00:13:24,200 --> 00:13:30,000
generally, more generally, what I'm

425
00:13:27,000 --> 00:13:31,399
getting at is that every sensor in the

426
00:13:30,000 --> 00:13:33,799
world

427
00:13:31,399 --> 00:13:35,039
can be given the ability to detect,

428
00:13:33,799 --> 00:13:37,319
recognize, and classify what it's

429
00:13:35,039 --> 00:13:39,799
sensing. Every sensor. Because remember,

430
00:13:37,320 --> 00:13:41,520
what is a What does a sensor do?

431
00:13:39,799 --> 00:13:43,199
A sensor is just a receptacle for

432
00:13:41,519 --> 00:13:44,679
unstructured data.

433
00:13:43,200 --> 00:13:46,080
A camera is a receptacle for

434
00:13:44,679 --> 00:13:48,079
unstructured video

435
00:13:46,080 --> 00:13:50,480
or unstructured, you know, still images.

436
00:13:48,080 --> 00:13:52,600
Microphone, unstructured audio, right?

437
00:13:50,480 --> 00:13:54,600
So, every sensor, you can you can

438
00:13:52,600 --> 00:13:56,839
imagine taking a sensor and sticking a

439
00:13:54,600 --> 00:13:58,720
little deep learning system behind it.

440
00:13:56,839 --> 00:13:59,880
And now suddenly, the

441
00:13:58,720 --> 00:14:01,759
what comes out of this sensor the deep

442
00:13:59,879 --> 00:14:03,320
learning system, you can count, you can

443
00:14:01,759 --> 00:14:05,360
classify, you can detect, you can do all

444
00:14:03,320 --> 00:14:07,080
kinds of stuff. In short, you can

445
00:14:05,360 --> 00:14:10,279
analyze.

446
00:14:07,080 --> 00:14:12,120
And you can predict, right? And this is

447
00:14:10,279 --> 00:14:15,600
the way I'm describing it right now,

448
00:14:12,120 --> 00:14:17,839
you'll be like, "Yeah, duh, obviously."

449
00:14:15,600 --> 00:14:19,839
But you know what, this obviously thing

450
00:14:17,839 --> 00:14:21,800
is actually not at all obvious

451
00:14:19,839 --> 00:14:24,400
in terms of whether it'll help you find

452
00:14:21,799 --> 00:14:25,719
interesting applications or not. Okay?

453
00:14:24,399 --> 00:14:28,159
So,

454
00:14:25,720 --> 00:14:30,920
here's something I literally saw

455
00:14:28,159 --> 00:14:32,120
last week. Okay? Actually, I have

456
00:14:30,919 --> 00:14:34,599
another slide before that, but we are

457
00:14:32,120 --> 00:14:36,399
coming to that. So, for instance, every

458
00:14:34,600 --> 00:14:38,200
time you use Face ID to unlock your

459
00:14:36,399 --> 00:14:39,639
phone, this is the basic principle at

460
00:14:38,200 --> 00:14:41,120
work, right? The the camera in the

461
00:14:39,639 --> 00:14:42,240
iPhone is the sensor, and they stuck a

462
00:14:41,120 --> 00:14:44,039
deep learning system behind it to do

463
00:14:42,240 --> 00:14:45,879
image classification, right? Drama,

464
00:14:44,039 --> 00:14:46,958
non-drama, right? That's what it's

465
00:14:45,879 --> 00:14:49,399
classifying.

466
00:14:46,958 --> 00:14:51,279
Um and so here, right, you have a breast

467
00:14:49,399 --> 00:14:52,799
cancer is it's a breast cancer detection

468
00:14:51,279 --> 00:14:55,319
system from a mammogram.

469
00:14:52,799 --> 00:14:57,759
Uh by the way, this picture

470
00:14:55,320 --> 00:15:00,320
it's a very interesting picture. So, uh

471
00:14:57,759 --> 00:15:02,519
there's a professor in WCS, uh Regina

472
00:15:00,320 --> 00:15:05,879
Barzilay, who's a very well-known expert

473
00:15:02,519 --> 00:15:07,439
in this field, and uh she actually has

474
00:15:05,879 --> 00:15:08,919
built a breast cancer detection system,

475
00:15:07,440 --> 00:15:10,240
which is which has been deployed at Mass

476
00:15:08,919 --> 00:15:12,120
General Hospital.

477
00:15:10,240 --> 00:15:15,399
And turns out she's actually a breast

478
00:15:12,120 --> 00:15:16,919
cancer survivor. And uh she was

479
00:15:15,399 --> 00:15:19,919
you know, she's she's she's good now,

480
00:15:16,919 --> 00:15:21,958
all good. But when um after she built

481
00:15:19,919 --> 00:15:25,319
her system, I heard that she actually

482
00:15:21,958 --> 00:15:29,000
ran that system against the mammograms

483
00:15:25,320 --> 00:15:30,720
from many years prior when she went for

484
00:15:29,000 --> 00:15:32,360
a mammogram and was told that everything

485
00:15:30,720 --> 00:15:34,440
is fine.

486
00:15:32,360 --> 00:15:35,639
She ran the system on that mammogram,

487
00:15:34,440 --> 00:15:37,400
and it came back and said, "Here is a

488
00:15:35,639 --> 00:15:38,879
problem."

489
00:15:37,399 --> 00:15:40,720
So, a very interesting example where a

490
00:15:38,879 --> 00:15:43,279
deep learning system picked up something

491
00:15:40,720 --> 00:15:45,519
that a radiologist could not, right? So,

492
00:15:43,279 --> 00:15:47,399
these things can be quite powerful.

493
00:15:45,519 --> 00:15:50,078
Um obviously, any self-driving system

494
00:15:47,399 --> 00:15:51,399
has numerous deep learning algorithms

495
00:15:50,078 --> 00:15:52,958
running under the hood, you know,

496
00:15:51,399 --> 00:15:54,720
pedestrian detection, you know,

497
00:15:52,958 --> 00:15:57,239
stoplight detection, zebra crossing

498
00:15:54,720 --> 00:15:58,759
detection, and so on and so forth. Um

499
00:15:57,240 --> 00:16:00,879
you know, it's being very heavily used

500
00:15:58,759 --> 00:16:02,159
in visual inspection manufacturing.

501
00:16:00,879 --> 00:16:03,279
Uh you have various cameras now instead

502
00:16:02,159 --> 00:16:04,919
of people looking at saying, "Okay,

503
00:16:03,279 --> 00:16:06,199
there is a dent or there's a scratch."

504
00:16:04,919 --> 00:16:07,919
They have a little system, which is a

505
00:16:06,200 --> 00:16:09,680
dent detector, scratch detector, and so

506
00:16:07,919 --> 00:16:11,159
on. That's that's going on right now.

507
00:16:09,679 --> 00:16:12,199
And now I come to the example I saw last

508
00:16:11,159 --> 00:16:14,759
week,

509
00:16:12,200 --> 00:16:16,000
which is um So, this is an example of

510
00:16:14,759 --> 00:16:18,159
you can create dramatically better

511
00:16:16,000 --> 00:16:20,799
products if you really internalize this

512
00:16:18,159 --> 00:16:22,519
idea of, "Okay, it's almost like you're

513
00:16:20,799 --> 00:16:24,078
looking the the world and saying, 'Oh,

514
00:16:22,519 --> 00:16:25,559
there's a sensor. Can I attach a DL

515
00:16:24,078 --> 00:16:26,679
thing behind it?'" That's the way you

516
00:16:25,559 --> 00:16:28,719
should be looking at the world, okay,

517
00:16:26,679 --> 00:16:30,879
for startup ideas. So, here's an

518
00:16:28,720 --> 00:16:34,279
example, okay, these apparently are the

519
00:16:30,879 --> 00:16:35,480
world's first smart binoculars.

520
00:16:34,279 --> 00:16:37,720
Okay?

521
00:16:35,480 --> 00:16:39,360
This is the binocular.

522
00:16:37,720 --> 00:16:41,240
Two weeks ago,

523
00:16:39,360 --> 00:16:42,320
where you look at the bird you look at

524
00:16:41,240 --> 00:16:43,959
the bird,

525
00:16:42,320 --> 00:16:46,680
and now it tells you what kind of bird

526
00:16:43,958 --> 00:16:46,679
it is, right there.

527
00:16:47,360 --> 00:16:51,839
It's a simple idea, but imagine, right?

528
00:16:50,120 --> 00:16:53,560
Imagine you are the first out of the

529
00:16:51,839 --> 00:16:54,640
gate with this feature, you'll have a

530
00:16:53,559 --> 00:16:57,719
little bit of an edge till everybody

531
00:16:54,639 --> 00:16:58,958
catches up like 3 months later.

532
00:16:57,720 --> 00:17:01,360
Let's be very clear, there are no

533
00:16:58,958 --> 00:17:03,479
long-term monopoly windows in the world.

534
00:17:01,360 --> 00:17:04,838
There are only short-term windows, so

535
00:17:03,480 --> 00:17:06,720
the hunt is always on for a little

536
00:17:04,838 --> 00:17:08,838
monopoly window.

537
00:17:06,720 --> 00:17:11,199
So, here's an example of that.

538
00:17:08,838 --> 00:17:13,240
Right? So, I encourage you to always

539
00:17:11,199 --> 00:17:15,079
think about the world as, you know,

540
00:17:13,240 --> 00:17:16,519
where are the sensors here?

541
00:17:15,078 --> 00:17:18,198
And can I attach something behind the

542
00:17:16,519 --> 00:17:19,078
sensor to do something useful with it?

543
00:17:18,199 --> 00:17:21,439
Okay?

544
00:17:19,078 --> 00:17:21,438
All right.

545
00:17:24,799 --> 00:17:27,279
Now, let's uh turn our attention to the

546
00:17:26,199 --> 00:17:28,759
output.

547
00:17:27,279 --> 00:17:30,678
We've been talking about in structured

548
00:17:28,759 --> 00:17:32,839
data, unstructured data, and how deep

549
00:17:30,679 --> 00:17:34,519
learning has sort of unlocked the

550
00:17:32,839 --> 00:17:35,759
ability to work with unstructured data,

551
00:17:34,519 --> 00:17:37,879
but you've sort of been neglecting the

552
00:17:35,759 --> 00:17:40,079
output side of the equation. So,

553
00:17:37,880 --> 00:17:42,040
traditionally, uh we could predict

554
00:17:40,079 --> 00:17:44,678
single numbers or a few numbers pretty

555
00:17:42,039 --> 00:17:46,920
easily, right? So, you've all done the

556
00:17:44,679 --> 00:17:48,600
canonical, you know, uh should this

557
00:17:46,920 --> 00:17:50,600
person be given a loan application in

558
00:17:48,599 --> 00:17:51,919
machine learning, right? So, you just

559
00:17:50,599 --> 00:17:53,159
predicts a probability that a borrower

560
00:17:51,920 --> 00:17:56,080
will repay a loan on a whole based on a

561
00:17:53,160 --> 00:17:57,240
whole bunch of data, or supply chain,

562
00:17:56,079 --> 00:17:58,799
you predict the demand for the product

563
00:17:57,240 --> 00:18:00,480
next week, or you could predict a bunch

564
00:17:58,799 --> 00:18:01,919
of numbers. So, given a

565
00:18:00,480 --> 00:18:03,640
um given a picture, you can say, "Okay,

566
00:18:01,920 --> 00:18:04,920
does it Which which one of the 10 kinds

567
00:18:03,640 --> 00:18:06,360
of furniture is it?" Right? You can

568
00:18:04,920 --> 00:18:08,000
predict 10 numbers, 10 probabilities

569
00:18:06,359 --> 00:18:09,199
that add up to one. You can predict a

570
00:18:08,000 --> 00:18:10,440
whole bunch of numbers that don't have

571
00:18:09,200 --> 00:18:12,840
to add up to one, such as the GPS

572
00:18:10,440 --> 00:18:15,279
coordinates of a of an Uber ride. So,

573
00:18:12,839 --> 00:18:16,759
these are all simple unstructured Sorry,

574
00:18:15,279 --> 00:18:18,839
simple structured output, just a few

575
00:18:16,759 --> 00:18:20,799
numbers, right? What we could not do

576
00:18:18,839 --> 00:18:23,399
very easily was to actually generate

577
00:18:20,799 --> 00:18:25,319
pictures like this.

578
00:18:23,400 --> 00:18:27,560
We could not generate unstructured data.

579
00:18:25,319 --> 00:18:28,519
We could only consume unstructured data,

580
00:18:27,559 --> 00:18:29,918
right?

581
00:18:28,519 --> 00:18:31,440
Um you can generate text, you can

582
00:18:29,919 --> 00:18:32,919
generate pictures, and so on, and audio,

583
00:18:31,440 --> 00:18:35,080
and so on, and so forth.

584
00:18:32,919 --> 00:18:36,200
So, with generative AI, that problem is

585
00:18:35,079 --> 00:18:37,519
gone.

586
00:18:36,200 --> 00:18:39,880
So, generative AI is the ability to

587
00:18:37,519 --> 00:18:41,599
actually create unstructured data, all

588
00:18:39,880 --> 00:18:43,840
right? And therefore, it sits within

589
00:18:41,599 --> 00:18:45,399
deep learning. It still runs on deep

590
00:18:43,839 --> 00:18:47,079
learning, but it's just one kind of deep

591
00:18:45,400 --> 00:18:49,000
learning.

592
00:18:47,079 --> 00:18:50,119
Okay? There's plenty of stuff going on

593
00:18:49,000 --> 00:18:51,679
in deep learning that's got nothing to

594
00:18:50,119 --> 00:18:53,399
do with generative AI.

595
00:18:51,679 --> 00:18:55,080
Nowadays, of course, you know, if you're

596
00:18:53,400 --> 00:18:57,519
a self-respecting entrepreneur who wants

597
00:18:55,079 --> 00:18:58,599
to ride this craze, you'll probably

598
00:18:57,519 --> 00:19:00,240
declare whatever you're doing as

599
00:18:58,599 --> 00:19:02,480
generative AI.

600
00:19:00,240 --> 00:19:04,319
Right? Um and some VCs may actually be

601
00:19:02,480 --> 00:19:05,679
ready to fund you, who knows?

602
00:19:04,319 --> 00:19:06,759
But the point is, there's plenty of

603
00:19:05,679 --> 00:19:08,759
stuff going on in deep learning that's

604
00:19:06,759 --> 00:19:11,079
got nothing to do with generative AI. Uh

605
00:19:08,759 --> 00:19:13,000
but this is the overall picture. Now,

606
00:19:11,079 --> 00:19:15,439
here, uh we can produce unstructured

607
00:19:13,000 --> 00:19:17,359
outputs, like pictures. You can take

608
00:19:15,440 --> 00:19:18,440
this thing, and then you can actually,

609
00:19:17,359 --> 00:19:19,519
you know, come up with a nice picture

610
00:19:18,440 --> 00:19:21,880
description of it. This actually is a

611
00:19:19,519 --> 00:19:23,200
very famous picture, by the way, in in

612
00:19:21,880 --> 00:19:24,520
the world of computer vision. So, we are

613
00:19:23,200 --> 00:19:26,319
actually going to be analyzing this

614
00:19:24,519 --> 00:19:27,879
picture a little later on

615
00:19:26,319 --> 00:19:29,639
in the semester.

616
00:19:27,880 --> 00:19:31,840
Uh you can obviously go from a very

617
00:19:29,640 --> 00:19:35,560
complicated caption to an image.

618
00:19:31,839 --> 00:19:35,559
Uh you can go from text to music.

619
00:19:36,240 --> 00:19:40,359
Can people hear it? Okay. Yeah. Yeah.

620
00:19:38,359 --> 00:19:43,039
All right. So, and of course, we can go

621
00:19:40,359 --> 00:19:45,439
from text to text, i.e., ChatGPT. Uh and

622
00:19:43,039 --> 00:19:47,079
then uh as of a few months ago, things

623
00:19:45,440 --> 00:19:49,440
have gotten even more interesting, where

624
00:19:47,079 --> 00:19:51,000
you can actually go you can send text

625
00:19:49,440 --> 00:19:51,880
and an image in, and you can get text

626
00:19:51,000 --> 00:19:53,480
out.

627
00:19:51,880 --> 00:19:55,360
Right? And in fact, as of a few weeks

628
00:19:53,480 --> 00:19:56,960
ago, you can send text, image, text,

629
00:19:55,359 --> 00:19:58,119
image, text, image in in an arbitrary

630
00:19:56,960 --> 00:20:00,039
sequence

631
00:19:58,119 --> 00:20:02,239
into into the system, and it'll actually

632
00:20:00,039 --> 00:20:03,519
come back to you with text and image.

633
00:20:02,240 --> 00:20:05,200
Right? So, things are becoming

634
00:20:03,519 --> 00:20:07,839
multimodal. I just want to share with

635
00:20:05,200 --> 00:20:10,840
you like a really fun example I saw

636
00:20:07,839 --> 00:20:14,000
uh recently. So, this person

637
00:20:10,839 --> 00:20:16,879
sends this picture. Can folks see this?

638
00:20:14,000 --> 00:20:19,000
It's this very complicated parking sign.

639
00:20:16,880 --> 00:20:20,360
Apparently in San Francisco.

640
00:20:19,000 --> 00:20:22,519
And they're like, it's Wednesday at 4:00

641
00:20:20,359 --> 00:20:23,959
p.m. Can I park here?

642
00:20:22,519 --> 00:20:25,480
Tell me in one line. Because you really

643
00:20:23,960 --> 00:20:26,880
didn't want GPT-4 to be giving you a big

644
00:20:25,480 --> 00:20:29,079
essay about this.

645
00:20:26,880 --> 00:20:32,120
Like, you literally want to park.

646
00:20:29,079 --> 00:20:33,960
So, GPT-4 comes back and says, "Yes, you

647
00:20:32,119 --> 00:20:35,439
can park here for up to 1 hour starting

648
00:20:33,960 --> 00:20:36,720
at 4:00 p.m."

649
00:20:35,440 --> 00:20:38,559
And folks, I double-checked this thing,

650
00:20:36,720 --> 00:20:39,640
it's correct.

651
00:20:38,559 --> 00:20:41,119
We all know these things hallucinate,

652
00:20:39,640 --> 00:20:42,080
right? Can you imagine getting a parking

653
00:20:41,119 --> 00:20:42,839
ticket and telling the judge, "I'm

654
00:20:42,079 --> 00:20:44,359
sorry, I didn't realize it was

655
00:20:42,839 --> 00:20:45,359
hallucinating."

656
00:20:44,359 --> 00:20:46,839
So,

657
00:20:45,359 --> 00:20:47,759
so you have to double-check it.

658
00:20:46,839 --> 00:20:49,399
So, yeah. So, things are getting

659
00:20:47,759 --> 00:20:51,759
multimodal very quickly.

660
00:20:49,400 --> 00:20:53,640
Uh and so, the picture here is that

661
00:20:51,759 --> 00:20:55,400
within gen AI, we used to have these

662
00:20:53,640 --> 00:20:57,360
separate circles, text to text, text to

663
00:20:55,400 --> 00:20:59,040
image, text to music, text to this, text

664
00:20:57,359 --> 00:21:00,879
to that, so on and so forth. Those are

665
00:20:59,039 --> 00:21:02,720
all beginning to merge now inside gen AI

666
00:21:00,880 --> 00:21:04,680
because multimodal models are going to

667
00:21:02,720 --> 00:21:06,279
become the norm this year, right? We

668
00:21:04,680 --> 00:21:07,880
already have really good closed models.

669
00:21:06,279 --> 00:21:10,119
We really have We actually already have

670
00:21:07,880 --> 00:21:12,320
very good open-source multimodal models.

671
00:21:10,119 --> 00:21:15,839
And so, my feeling is that by the end of

672
00:21:12,319 --> 00:21:17,960
the year, the idea of using a text-only

673
00:21:15,839 --> 00:21:19,359
model is going to be like, "Really, you

674
00:21:17,960 --> 00:21:20,319
do that still?"

675
00:21:19,359 --> 00:21:21,919
Right? It's going to become like a

676
00:21:20,319 --> 00:21:23,720
quaint, old-fashioned thing. I think

677
00:21:21,920 --> 00:21:25,200
multimodal modality is going to become

678
00:21:23,720 --> 00:21:26,680
the norm. So, that's where the world is,

679
00:21:25,200 --> 00:21:29,000
and this is the landscape. So, any

680
00:21:26,680 --> 00:21:29,960
questions on the landscape?

681
00:21:29,000 --> 00:21:32,319
Before we actually start doing some

682
00:21:29,960 --> 00:21:32,319
math.

683
00:21:35,519 --> 00:21:40,039
Okay.

684
00:21:37,799 --> 00:21:40,039
Yeah.

685
00:22:05,559 --> 00:22:09,519
You mean the the the evidence of that

686
00:22:07,400 --> 00:22:11,720
being a problem would have been smaller.

687
00:22:09,519 --> 00:22:11,720
Yeah.

688
00:22:16,319 --> 00:22:19,359
Yeah. So, I think the So, the question

689
00:22:17,759 --> 00:22:20,480
is that in general, how do you train

690
00:22:19,359 --> 00:22:22,240
your models so that it gives you the

691
00:22:20,480 --> 00:22:24,000
right answers given that over the

692
00:22:22,240 --> 00:22:25,599
passage of time, the amount of evidence

693
00:22:24,000 --> 00:22:28,119
in this data could be very highly

694
00:22:25,599 --> 00:22:30,719
variable. So, in this particular case of

695
00:22:28,119 --> 00:22:32,199
you know, the professor I talked about,

696
00:22:30,720 --> 00:22:34,400
uh yeah, everything at that point was

697
00:22:32,200 --> 00:22:36,840
going through a an expert radiologist.

698
00:22:34,400 --> 00:22:38,200
So, 5 years ago, this mammogram was seen

699
00:22:36,839 --> 00:22:40,240
by a radiologist, and that person

700
00:22:38,200 --> 00:22:41,759
concluded there is no problem. So, that

701
00:22:40,240 --> 00:22:44,599
was the training label, right? The wrong

702
00:22:41,759 --> 00:22:46,400
training label. Uh so, in typically what

703
00:22:44,599 --> 00:22:48,399
happens is that training labels could be

704
00:22:46,400 --> 00:22:49,400
wrong some small fraction of the time.

705
00:22:48,400 --> 00:22:51,720
So, you need to have systems that are

706
00:22:49,400 --> 00:22:53,880
robust. So, your data needs to be

707
00:22:51,720 --> 00:22:56,120
complete, it needs to be comprehensive,

708
00:22:53,880 --> 00:22:58,320
it needs to be have correct labels. If

709
00:22:56,119 --> 00:22:59,959
these ideas are not met, your systems

710
00:22:58,319 --> 00:23:01,960
are not going to be that good. But as it

711
00:22:59,960 --> 00:23:04,240
turns out, with neural networks, even

712
00:23:01,960 --> 00:23:06,000
with some amount of noise in the labels,

713
00:23:04,240 --> 00:23:07,079
they still do a pretty good job.

714
00:23:06,000 --> 00:23:09,759
Right? So, it's that's sort of the

715
00:23:07,079 --> 00:23:09,759
general idea.

716
00:23:11,480 --> 00:23:15,759
The veri- The verification comes from

717
00:23:12,799 --> 00:23:17,599
the human. So, every Remember when we

718
00:23:15,759 --> 00:23:19,319
look at radiology data,

719
00:23:17,599 --> 00:23:21,439
the the data we're working with is the

720
00:23:19,319 --> 00:23:23,559
input is let's say an image, like a

721
00:23:21,440 --> 00:23:25,440
radio mammogram or something, and then a

722
00:23:23,559 --> 00:23:27,480
human radiologist or a set of

723
00:23:25,440 --> 00:23:29,400
radiologists have said this has a

724
00:23:27,480 --> 00:23:31,279
problem or does not have a problem. So,

725
00:23:29,400 --> 00:23:33,679
that is called the ground truth.

726
00:23:31,279 --> 00:23:35,440
So, it is this ground truth image and

727
00:23:33,679 --> 00:23:38,440
label, this combination that's being

728
00:23:35,440 --> 00:23:38,440
used to train these models.

729
00:23:39,559 --> 00:23:41,759
Yeah.

730
00:23:43,160 --> 00:23:47,400
Embodiment? So, So, are we are we going

731
00:23:45,440 --> 00:23:49,080
to cover embodiment? So, the

732
00:23:47,400 --> 00:23:50,280
the embodiment here refers to the fact

733
00:23:49,079 --> 00:23:53,039
that

734
00:23:50,279 --> 00:23:54,359
if you have robot robots, right?

735
00:23:53,039 --> 00:23:56,200
They need to actually operate in the

736
00:23:54,359 --> 00:23:58,559
real world, and so robots are an example

737
00:23:56,200 --> 00:23:59,920
of what's called embodied intelligence.

738
00:23:58,559 --> 00:24:01,440
So, unfortunately, due to the

739
00:23:59,920 --> 00:24:03,720
constraints of time, we're not going to

740
00:24:01,440 --> 00:24:04,799
get into robotics at all. But I will say

741
00:24:03,720 --> 00:24:05,880
that a lot of the deep learning stuff

742
00:24:04,799 --> 00:24:07,359
you're going to talk about, those are

743
00:24:05,880 --> 00:24:09,880
all fundamental building blocks in

744
00:24:07,359 --> 00:24:13,039
modern robotic systems.

745
00:24:09,880 --> 00:24:14,400
All right. So, um so, in summary,

746
00:24:13,039 --> 00:24:15,639
X and Y

747
00:24:14,400 --> 00:24:17,200
can be anything, and it can be

748
00:24:15,640 --> 00:24:19,240
multimodal.

749
00:24:17,200 --> 00:24:21,679
Okay? I literally could not have put up

750
00:24:19,240 --> 00:24:23,559
this slide maybe 2 years ago.

751
00:24:21,679 --> 00:24:25,800
Right? So, it's very simple in how it

752
00:24:23,559 --> 00:24:28,079
looks, but it's very profound. You can

753
00:24:25,799 --> 00:24:29,599
You can learn a mapping from anything to

754
00:24:28,079 --> 00:24:31,559
anything at this point very easily as

755
00:24:29,599 --> 00:24:34,480
long as you have enough data.

756
00:24:31,559 --> 00:24:36,599
Okay? So, um now, note that all this

757
00:24:34,480 --> 00:24:38,640
excitement that we see around us

758
00:24:36,599 --> 00:24:39,639
is everything stems from stems from deep

759
00:24:38,640 --> 00:24:40,640
learning.

760
00:24:39,640 --> 00:24:42,160
Okay?

761
00:24:40,640 --> 00:24:44,280
Everything Everything depends on deep

762
00:24:42,160 --> 00:24:45,679
learning. And so, if you understand deep

763
00:24:44,279 --> 00:24:47,960
learning, a lot of interesting things

764
00:24:45,679 --> 00:24:49,080
become possible. So, let's get going.

765
00:24:47,960 --> 00:24:51,840
All right. So, we'll start with the very

766
00:24:49,079 --> 00:24:54,599
basics. Uh what's a neural network?

767
00:24:51,839 --> 00:24:56,039
Uh now, recall logistic regression

768
00:24:54,599 --> 00:24:57,879
from back in the day.

769
00:24:56,039 --> 00:24:59,920
So, what is logistic regression?

770
00:24:57,880 --> 00:25:01,679
You send in a bunch of numbers, a vector

771
00:24:59,920 --> 00:25:03,960
of numbers, and you get usually get a

772
00:25:01,679 --> 00:25:05,000
probability out, right? Between 0 and 1.

773
00:25:03,960 --> 00:25:07,559
What is the probability of something or

774
00:25:05,000 --> 00:25:09,559
the other? Okay? Um and so, this

775
00:25:07,559 --> 00:25:11,519
logistic regression model is also

776
00:25:09,559 --> 00:25:13,359
represented in this form,

777
00:25:11,519 --> 00:25:15,519
if you will recall. So, basically what

778
00:25:13,359 --> 00:25:17,678
we do is we take all these numbers, we

779
00:25:15,519 --> 00:25:19,240
run it through a linear function, right?

780
00:25:17,679 --> 00:25:20,880
We run it through a linear function, you

781
00:25:19,240 --> 00:25:22,880
get a number, and then we take that

782
00:25:20,880 --> 00:25:25,000
thing and run it through 1 / 1 + e

783
00:25:22,880 --> 00:25:26,120
raised to minus that,

784
00:25:25,000 --> 00:25:27,720
and that's guaranteed to give you a

785
00:25:26,119 --> 00:25:29,719
number between 0 and 1, which can be

786
00:25:27,720 --> 00:25:31,839
interpreted as a probability, and that's

787
00:25:29,720 --> 00:25:33,559
logistic regression. Okay? And the

788
00:25:31,839 --> 00:25:35,399
canonical, you know,

789
00:25:33,559 --> 00:25:36,720
uh loan approvals, things like that, all

790
00:25:35,400 --> 00:25:38,480
fall into this sort of convenient

791
00:25:36,720 --> 00:25:42,799
bucket.

792
00:25:38,480 --> 00:25:42,799
Okay? So, this should be super familiar.

793
00:25:44,400 --> 00:25:48,759
All right. Now, we're going to actually

794
00:25:46,480 --> 00:25:51,920
look at this, you know, simple, modest,

795
00:25:48,759 --> 00:25:53,799
humble little operation

796
00:25:51,920 --> 00:25:55,480
using the lens of a network of

797
00:25:53,799 --> 00:25:56,879
mathematical operations, and the reason

798
00:25:55,480 --> 00:25:57,799
why we do it will become clear a bit

799
00:25:56,880 --> 00:25:59,880
later.

800
00:25:57,799 --> 00:26:02,240
So, we'll take this very simple example

801
00:25:59,880 --> 00:26:05,320
where we have uh let's say two

802
00:26:02,240 --> 00:26:07,759
variables, GPA and experience, right?

803
00:26:05,319 --> 00:26:09,559
This is the GPA of some graduates, uh

804
00:26:07,759 --> 00:26:11,799
number of years of work experience, and

805
00:26:09,559 --> 00:26:14,678
then this is the dependent variable,

806
00:26:11,799 --> 00:26:16,480
which is either 0 or 1, and 0 if they

807
00:26:14,679 --> 00:26:18,280
don't get called for an interview, 1 if

808
00:26:16,480 --> 00:26:20,519
they get called for an interview. Okay?

809
00:26:18,279 --> 00:26:22,119
It's a two-input variable, one-output

810
00:26:20,519 --> 00:26:24,000
variable problem. Okay? And it's a

811
00:26:22,119 --> 00:26:25,719
classification problem because we're

812
00:26:24,000 --> 00:26:27,880
classifying people into will they get

813
00:26:25,720 --> 00:26:29,600
called for an interview, yes or no.

814
00:26:27,880 --> 00:26:31,560
Okay?

815
00:26:29,599 --> 00:26:33,119
And so, that's the setup for this

816
00:26:31,559 --> 00:26:35,919
problem.

817
00:26:33,119 --> 00:26:38,839
And let's say that we actually run it

818
00:26:35,920 --> 00:26:40,720
through any you know, we actually try to

819
00:26:38,839 --> 00:26:41,959
fit a logistic regression model to it.

820
00:26:40,720 --> 00:26:43,559
So, if you're familiar with R, for

821
00:26:41,960 --> 00:26:46,120
example, you would use something like

822
00:26:43,559 --> 00:26:48,079
GLM to fit this model.

823
00:26:46,119 --> 00:26:49,919
Um if you use something like statsmodels

824
00:26:48,079 --> 00:26:52,000
in Python, there's a similar function

825
00:26:49,920 --> 00:26:53,560
for it. Scikit-learn, there's another

826
00:26:52,000 --> 00:26:55,160
function for it. You get the idea,

827
00:26:53,559 --> 00:26:57,079
right? This

828
00:26:55,160 --> 00:26:58,160
You can use whatever favorite methods

829
00:26:57,079 --> 00:27:00,199
you have for logistic regression

830
00:26:58,160 --> 00:27:02,080
modeling to get this job done. And if

831
00:27:00,200 --> 00:27:04,120
you do that with this little data set,

832
00:27:02,079 --> 00:27:06,599
you're going to get these coefficients.

833
00:27:04,119 --> 00:27:08,199
Right? The 0.4 is the intercept, 0.2 is

834
00:27:06,599 --> 00:27:09,919
the coefficient for GPA, 0.5 for

835
00:27:08,200 --> 00:27:11,440
experience. And that is the resulting

836
00:27:09,920 --> 00:27:12,560
sigmoid function.

837
00:27:11,440 --> 00:27:14,519
Okay?

838
00:27:12,559 --> 00:27:17,240
All right. Cool. So, now let's actually

839
00:27:14,519 --> 00:27:19,319
rewrite this formula as a network in the

840
00:27:17,240 --> 00:27:20,920
following way. So, first, what we'll do

841
00:27:19,319 --> 00:27:22,839
is we'll take GPA and experience and

842
00:27:20,920 --> 00:27:24,600
stick it here on the left side, and

843
00:27:22,839 --> 00:27:26,799
we'll put little circles next to them,

844
00:27:24,599 --> 00:27:29,359
and we'll call them the input nodes.

845
00:27:26,799 --> 00:27:32,000
Okay? And so, imagine that somebody puts

846
00:27:29,359 --> 00:27:34,279
the writes a GPA into the circle, 3.5 or

847
00:27:32,000 --> 00:27:36,880
you know, years experience, 2.0, and

848
00:27:34,279 --> 00:27:38,000
then it flows through this arrow,

849
00:27:36,880 --> 00:27:40,400
and as it flows through, it gets

850
00:27:38,000 --> 00:27:42,880
multiplied by its coefficient, 0.2. The

851
00:27:40,400 --> 00:27:44,320
0.2 is coming from here.

852
00:27:42,880 --> 00:27:47,080
Similarly, experience gets multiplied by

853
00:27:44,319 --> 00:27:49,119
0.5, it comes in here, and this node, as

854
00:27:47,079 --> 00:27:50,480
the plus indicates, is adding everything

855
00:27:49,119 --> 00:27:52,919
that's coming into it.

856
00:27:50,480 --> 00:27:54,519
So, it's adding 0.2 * GPA, 0.5 *

857
00:27:52,920 --> 00:27:57,200
experience, plus the intercept, which is

858
00:27:54,519 --> 00:27:58,599
the green arrow coming from on its own.

859
00:27:57,200 --> 00:28:01,240
It comes through here, and what comes

860
00:27:58,599 --> 00:28:02,839
out of this is just a single number,

861
00:28:01,240 --> 00:28:04,640
and that number goes into this little

862
00:28:02,839 --> 00:28:07,319
circle,

863
00:28:04,640 --> 00:28:08,560
and then out pops a probability.

864
00:28:07,319 --> 00:28:10,720
Okay?

865
00:28:08,559 --> 00:28:13,440
So, I've sort of

866
00:28:10,720 --> 00:28:15,039
done this ridiculously long long

867
00:28:13,440 --> 00:28:16,400
long-winded way of writing a simple

868
00:28:15,039 --> 00:28:18,000
function.

869
00:28:16,400 --> 00:28:20,880
Okay? And the reason we why I'm doing it

870
00:28:18,000 --> 00:28:20,880
will become clear in a second.

871
00:28:21,079 --> 00:28:25,839
Okay? So, this is a little network of

872
00:28:23,359 --> 00:28:27,678
operations for the simple function.

873
00:28:25,839 --> 00:28:29,639
And so, for instance, how you would use

874
00:28:27,679 --> 00:28:31,759
it is you to make a prediction, you'll

875
00:28:29,640 --> 00:28:33,600
let's say someone has a 3.8 GPA and 1.2

876
00:28:31,759 --> 00:28:34,640
years experience, you just plug it in

877
00:28:33,599 --> 00:28:36,599
here,

878
00:28:34,640 --> 00:28:38,360
do the math, you get 0.76, same thing

879
00:28:36,599 --> 00:28:40,918
here, comes in here, add them all up,

880
00:28:38,359 --> 00:28:43,279
you get 1.76, you run 1.76 through the

881
00:28:40,919 --> 00:28:44,480
sigmoid, you get 0.85, and that is the

882
00:28:43,279 --> 00:28:45,519
probability that that particular

883
00:28:44,480 --> 00:28:46,839
individual may get called for an

884
00:28:45,519 --> 00:28:48,240
interview.

885
00:28:46,839 --> 00:28:49,399
Okay? At this point, we're just doing

886
00:28:48,240 --> 00:28:51,359
logistic regression, nothing more

887
00:28:49,400 --> 00:28:54,040
complicated.

888
00:28:51,359 --> 00:28:56,119
Okay? So, um now, if you have many

889
00:28:54,039 --> 00:28:58,399
variables, not two variables like X1

890
00:28:56,119 --> 00:28:59,759
through XK, you can the same sort of

891
00:28:58,400 --> 00:29:01,200
logic applies. Each one has some

892
00:28:59,759 --> 00:29:03,039
coefficient, and then there's an

893
00:29:01,200 --> 00:29:04,720
intercept, they all get added up here,

894
00:29:03,039 --> 00:29:07,000
run through a sigmoid, and out pops this

895
00:29:04,720 --> 00:29:09,240
number. Okay? Notice how the data flows

896
00:29:07,000 --> 00:29:10,559
from left to right.

897
00:29:09,240 --> 00:29:14,039
Okay?

898
00:29:10,559 --> 00:29:14,039
All right. Any questions on this?

899
00:29:15,119 --> 00:29:18,719
All right. Good.

900
00:29:16,519 --> 00:29:20,519
So, now terminology.

901
00:29:18,720 --> 00:29:21,720
Uh so, you will actually you'll discover

902
00:29:20,519 --> 00:29:24,039
that the world of neural networks and

903
00:29:21,720 --> 00:29:25,440
deep learning has its own terminology.

904
00:29:24,039 --> 00:29:26,799
They have their own ways of referring to

905
00:29:25,440 --> 00:29:28,440
things that we the rest of the world has

906
00:29:26,799 --> 00:29:29,799
been referring using something else for

907
00:29:28,440 --> 00:29:31,240
the longest time.

908
00:29:29,799 --> 00:29:35,000
Right? It's kind of annoying sometimes,

909
00:29:31,240 --> 00:29:35,000
but it's the way it is. So, um

910
00:29:35,200 --> 00:29:38,440
Remember in regression, we used to call

911
00:29:37,000 --> 00:29:39,720
those numbers next to each variable as

912
00:29:38,440 --> 00:29:41,440
coefficients,

913
00:29:39,720 --> 00:29:43,200
and the constant thing as an intercept?

914
00:29:41,440 --> 00:29:44,519
Well, guess what? In this world, these

915
00:29:43,200 --> 00:29:46,960
multi- those coefficients are actually

916
00:29:44,519 --> 00:29:49,160
called weights,

917
00:29:46,960 --> 00:29:50,840
and the intercepts are called biases.

918
00:29:49,160 --> 00:29:53,000
So, in in the neural network world,

919
00:29:50,839 --> 00:29:54,240
these are called weights and biases.

920
00:29:53,000 --> 00:29:55,240
And sometimes, if you're a little lazy,

921
00:29:54,240 --> 00:29:56,359
you may just call the whole thing as

922
00:29:55,240 --> 00:29:58,480
weights.

923
00:29:56,359 --> 00:30:00,799
Okay? So, when you see in the newspaper

924
00:29:58,480 --> 00:30:03,640
that, you know, "Oh my god, this amazing

925
00:30:00,799 --> 00:30:05,119
model's weights have been leaked

926
00:30:03,640 --> 00:30:06,680
on the internet or on BitTorrent or

927
00:30:05,119 --> 00:30:08,119
something." That's what's going on,

928
00:30:06,680 --> 00:30:09,960
right? All these coefficients have been

929
00:30:08,119 --> 00:30:11,559
leaked. Because once you know what the

930
00:30:09,960 --> 00:30:12,640
coefficients are and what the

931
00:30:11,559 --> 00:30:15,039
architecture is, you can just

932
00:30:12,640 --> 00:30:16,360
reconstruct the model.

933
00:30:15,039 --> 00:30:17,559
All right. So, that's what's going on

934
00:30:16,359 --> 00:30:19,639
here.

935
00:30:17,559 --> 00:30:20,799
Now, why did we do this network

936
00:30:19,640 --> 00:30:23,120
business? Why did we write it as a

937
00:30:20,799 --> 00:30:24,359
network?

938
00:30:23,119 --> 00:30:26,919
Yeah, what is the advantage? Any

939
00:30:24,359 --> 00:30:26,919
guesses?

940
00:30:34,000 --> 00:30:38,200
When you have multiple functions for

941
00:30:37,200 --> 00:30:40,360
So,

942
00:30:38,200 --> 00:30:41,840
it's just easier to see it that way.

943
00:30:40,359 --> 00:30:43,719
Right. If you have lots of things going

944
00:30:41,839 --> 00:30:45,240
on, it's easier to see it if you

945
00:30:43,720 --> 00:30:46,960
actually write it in graphical form.

946
00:30:45,240 --> 00:30:49,880
Yes, correct.

947
00:30:46,960 --> 00:30:51,920
But, so is it only like a usability

948
00:30:49,880 --> 00:30:53,560
advantage?

949
00:30:51,920 --> 00:30:55,920
I mean, the thing is you want different

950
00:30:53,559 --> 00:30:56,679
functions for different layers of that.

951
00:30:55,920 --> 00:30:57,640
Uh-huh.

952
00:30:56,680 --> 00:30:59,000
Okay.

953
00:30:57,640 --> 00:31:00,880
So, maybe we want to use different

954
00:30:59,000 --> 00:31:02,599
functions in different layers. But, I

955
00:31:00,880 --> 00:31:04,640
think there's actually even a larger

956
00:31:02,599 --> 00:31:05,559
sort of a more basic point, which is

957
00:31:04,640 --> 00:31:07,000
that

958
00:31:05,559 --> 00:31:09,000
then when you the moment you write it

959
00:31:07,000 --> 00:31:10,480
down, you suddenly realize

960
00:31:09,000 --> 00:31:12,839
that I could have lots of things in the

961
00:31:10,480 --> 00:31:12,839
middle.

962
00:31:12,960 --> 00:31:15,640
I don't have to go from the input to the

963
00:31:13,960 --> 00:31:17,360
output directly. I can do lots of things

964
00:31:15,640 --> 00:31:20,560
in the middle, right? That's sort of the

965
00:31:17,359 --> 00:31:22,359
key idea. So, what you do is

966
00:31:20,559 --> 00:31:24,799
So, remember the notion of learning

967
00:31:22,359 --> 00:31:25,959
representations of unstructured data,

968
00:31:24,799 --> 00:31:27,879
right? Where you take a picture and say

969
00:31:25,960 --> 00:31:29,400
beak length and things like that, right?

970
00:31:27,880 --> 00:31:30,800
And remember, I said deep learning

971
00:31:29,400 --> 00:31:33,000
actually automatically learns these

972
00:31:30,799 --> 00:31:34,839
things. Where is that automatic learning

973
00:31:33,000 --> 00:31:36,680
coming from?

974
00:31:34,839 --> 00:31:38,879
Well, this is where it's coming from.

975
00:31:36,680 --> 00:31:39,680
So, what we do is we take this thing,

976
00:31:38,880 --> 00:31:41,560
right? There's just a logistic

977
00:31:39,680 --> 00:31:43,480
regression model. Inputs

978
00:31:41,559 --> 00:31:45,720
get multiple added up as a linear

979
00:31:43,480 --> 00:31:46,880
function, run through a sigmoid.

980
00:31:45,720 --> 00:31:48,799
And then

981
00:31:46,880 --> 00:31:51,520
we are like, "Hmm, if we want to learn

982
00:31:48,799 --> 00:31:53,000
representations of the raw input, we

983
00:31:51,519 --> 00:31:54,720
better be doing something in the middle

984
00:31:53,000 --> 00:31:56,759
here."

985
00:31:54,720 --> 00:31:58,720
Because the output is the output.

986
00:31:56,759 --> 00:32:00,039
That is That's not going to change.

987
00:31:58,720 --> 00:32:02,079
You know, it's it's either a dog or a

988
00:32:00,039 --> 00:32:05,440
cat. You don't have any choice

989
00:32:02,079 --> 00:32:07,960
as to what it is. Okay? The only agency

990
00:32:05,440 --> 00:32:09,279
you have at this point is you can take

991
00:32:07,960 --> 00:32:11,079
the raw input and do things in the

992
00:32:09,279 --> 00:32:12,678
middle with it.

993
00:32:11,079 --> 00:32:14,439
You can do a lot of stuff in the middle

994
00:32:12,679 --> 00:32:18,160
and then run it through something to get

995
00:32:14,440 --> 00:32:20,679
the output. Okay? So, in any in in in

996
00:32:18,160 --> 00:32:22,120
any mathematical discipline,

997
00:32:20,679 --> 00:32:23,679
if someone comes to you and says,

998
00:32:22,119 --> 00:32:25,639
"Here's a bunch of data.

999
00:32:23,679 --> 00:32:27,280
I want you to do something with it."

1000
00:32:25,640 --> 00:32:30,759
What should the What is like the big the

1001
00:32:27,279 --> 00:32:30,759
most basic first thing you should do?

1002
00:32:31,720 --> 00:32:36,120
Run it through a linear function.

1003
00:32:34,480 --> 00:32:37,759
The most basic thing in math is a linear

1004
00:32:36,119 --> 00:32:38,559
function. So, given anything, just run

1005
00:32:37,759 --> 00:32:40,039
it through a linear function. See what

1006
00:32:38,559 --> 00:32:42,678
happens.

1007
00:32:40,039 --> 00:32:44,399
So, that's exactly what we can do. So,

1008
00:32:42,679 --> 00:32:46,560
the simplest thing we can do here, we

1009
00:32:44,400 --> 00:32:49,400
can insert a bunch of linear functions.

1010
00:32:46,559 --> 00:32:50,960
So, we do is we take all this input and

1011
00:32:49,400 --> 00:32:52,759
we just run it we we do a linear

1012
00:32:50,960 --> 00:32:56,079
function on it. So, think of it this as

1013
00:32:52,759 --> 00:32:58,879
X1 * 2 + X3 * 4 and all the way to XK *

1014
00:32:56,079 --> 00:33:00,599
9 plus some intercept and boom, it goes

1015
00:32:58,880 --> 00:33:05,200
out the other end. So, this little

1016
00:33:00,599 --> 00:33:05,959
circle here with a plus in it is just

1017
00:33:05,200 --> 00:33:06,600
Thank you.

1018
00:33:05,960 --> 00:33:08,279
Uh

1019
00:33:06,599 --> 00:33:10,359
that is This is just a linear It's a

1020
00:33:08,279 --> 00:33:11,480
shorthand for a linear function.

1021
00:33:10,359 --> 00:33:13,159
So, whenever you see a circle with a

1022
00:33:11,480 --> 00:33:15,360
plus, it's just a shorthand for a linear

1023
00:33:13,160 --> 00:33:16,279
function. Okay? So, you can take this

1024
00:33:15,359 --> 00:33:17,759
whole thing and run through a linear

1025
00:33:16,279 --> 00:33:19,799
function and when you do it, you'll get

1026
00:33:17,759 --> 00:33:21,960
some number right there. You'll get some

1027
00:33:19,799 --> 00:33:23,399
number. So, you've taken these K numbers

1028
00:33:21,960 --> 00:33:25,559
and you've sort of dis- compressed them

1029
00:33:23,400 --> 00:33:26,840
in some way into one number.

1030
00:33:25,559 --> 00:33:28,319
Okay?

1031
00:33:26,839 --> 00:33:30,079
But, you don't have to stop at one

1032
00:33:28,319 --> 00:33:31,599
number. You can do more.

1033
00:33:30,079 --> 00:33:33,439
So, we can have a stack of linear

1034
00:33:31,599 --> 00:33:35,359
functions in the middle.

1035
00:33:33,440 --> 00:33:37,279
Right? There's a linear function here,

1036
00:33:35,359 --> 00:33:40,159
another one here, another one here. At

1037
00:33:37,279 --> 00:33:42,240
this point, the K numbers you have

1038
00:33:40,160 --> 00:33:43,440
K could be, for example, 1,000.

1039
00:33:42,240 --> 00:33:44,400
Right? It's just the size of your input

1040
00:33:43,440 --> 00:33:45,799
data.

1041
00:33:44,400 --> 00:33:47,280
You've taken these K things and you've

1042
00:33:45,799 --> 00:33:48,839
compressed them into three numbers at

1043
00:33:47,279 --> 00:33:50,359
this point.

1044
00:33:48,839 --> 00:33:52,079
Okay?

1045
00:33:50,359 --> 00:33:53,079
So, okay, maybe three is the right

1046
00:33:52,079 --> 00:33:54,039
number, maybe 10 is the right number. We

1047
00:33:53,079 --> 00:33:55,480
don't know.

1048
00:33:54,039 --> 00:33:58,079
And we'll get to know how do we know

1049
00:33:55,480 --> 00:33:59,519
what the right number is later on.

1050
00:33:58,079 --> 00:34:01,159
So, we can stack as many linear

1051
00:33:59,519 --> 00:34:02,720
functions we want.

1052
00:34:01,160 --> 00:34:04,440
So, we have transformed this K thing

1053
00:34:02,720 --> 00:34:06,600
into a three-dimensional vector, right?

1054
00:34:04,440 --> 00:34:07,519
K numbers become three numbers.

1055
00:34:06,599 --> 00:34:10,279
Um

1056
00:34:07,519 --> 00:34:12,280
and now we can flow this three these

1057
00:34:10,280 --> 00:34:13,919
three numbers through some other little

1058
00:34:12,280 --> 00:34:16,359
function.

1059
00:34:13,918 --> 00:34:16,358
Okay?

1060
00:34:16,440 --> 00:34:19,559
And as you will see in a few minutes,

1061
00:34:18,039 --> 00:34:20,759
that function is called an activation

1062
00:34:19,559 --> 00:34:22,320
function

1063
00:34:20,760 --> 00:34:23,359
and it's chosen to be a non-linear

1064
00:34:22,320 --> 00:34:24,559
function

1065
00:34:23,358 --> 00:34:26,759
because if you don't choose it to be a

1066
00:34:24,559 --> 00:34:28,719
non-linear function, all the effort we

1067
00:34:26,760 --> 00:34:30,280
are doing is going to be a total waste

1068
00:34:28,719 --> 00:34:32,839
of time.

1069
00:34:30,280 --> 00:34:34,399
Okay? For now, just

1070
00:34:32,840 --> 00:34:36,200
take it on faith that you need to have

1071
00:34:34,398 --> 00:34:39,480
non-linear functions here.

1072
00:34:36,199 --> 00:34:41,039
But, note that the three numbers here

1073
00:34:39,480 --> 00:34:42,079
are still three numbers. They are three

1074
00:34:41,039 --> 00:34:43,398
different numbers, but they're still

1075
00:34:42,079 --> 00:34:45,000
three numbers.

1076
00:34:43,398 --> 00:34:46,440
And once we do this, we'll be like, "You

1077
00:34:45,000 --> 00:34:48,119
know what? This was fun. Let's do it

1078
00:34:46,440 --> 00:34:51,918
again."

1079
00:34:48,119 --> 00:34:51,918
Okay? So, you can do it again.

1080
00:34:52,320 --> 00:34:55,720
And you can keep on doing it. You can

1081
00:34:53,559 --> 00:34:57,400
keep it 100 times if you want.

1082
00:34:55,719 --> 00:35:00,639
And the key thing is that every time you

1083
00:34:57,400 --> 00:35:03,079
do it, you're giving this network some

1084
00:35:00,639 --> 00:35:05,159
ability, some capacity to learn

1085
00:35:03,079 --> 00:35:07,799
something interesting from the data.

1086
00:35:05,159 --> 00:35:09,319
To learn an interesting representation.

1087
00:35:07,800 --> 00:35:10,680
Now, of course, you're thinking, "Well,

1088
00:35:09,320 --> 00:35:12,039
how do we know it's interesting? How do

1089
00:35:10,679 --> 00:35:14,079
you know it's a useful thing?" And we'll

1090
00:35:12,039 --> 00:35:14,840
come to all that later on.

1091
00:35:14,079 --> 00:35:16,840
Right? We're just giving it the

1092
00:35:14,840 --> 00:35:17,960
capacity, the potential to learn

1093
00:35:16,840 --> 00:35:19,240
interesting things from the data.

1094
00:35:17,960 --> 00:35:21,199
Whether it actually lives up to its

1095
00:35:19,239 --> 00:35:23,000
potential, we don't know yet.

1096
00:35:21,199 --> 00:35:24,719
Okay? We'll give it the potential.

1097
00:35:23,000 --> 00:35:26,358
Because the more transformations of the

1098
00:35:24,719 --> 00:35:27,799
input data you make, the more

1099
00:35:26,358 --> 00:35:29,039
opportunity you have to do interesting

1100
00:35:27,800 --> 00:35:30,160
things with it.

1101
00:35:29,039 --> 00:35:31,480
If I don't even give you the opportunity

1102
00:35:30,159 --> 00:35:32,879
to transform it once, you don't have any

1103
00:35:31,480 --> 00:35:34,719
opportunity, right?

1104
00:35:32,880 --> 00:35:36,200
If I give you 10 chances to transform

1105
00:35:34,719 --> 00:35:38,039
things, you have 10 shots at doing

1106
00:35:36,199 --> 00:35:40,239
something useful.

1107
00:35:38,039 --> 00:35:42,159
So, you can you can do this repeatedly

1108
00:35:40,239 --> 00:35:44,759
and once we are done doing these

1109
00:35:42,159 --> 00:35:46,159
transformations, we just pipe it through

1110
00:35:44,760 --> 00:35:49,920
to our good old logistic regression

1111
00:35:46,159 --> 00:35:49,920
sigmoid here and we are done.

1112
00:35:50,440 --> 00:35:53,960
Okay?

1113
00:35:51,480 --> 00:35:55,960
So, this is the basic idea.

1114
00:35:53,960 --> 00:35:57,800
And so, just to contrast it, this was

1115
00:35:55,960 --> 00:35:59,240
good old logistic regression where we

1116
00:35:57,800 --> 00:36:00,519
take the input,

1117
00:35:59,239 --> 00:36:02,319
we run it through a linear function and

1118
00:36:00,519 --> 00:36:04,599
pop out a number,

1119
00:36:02,320 --> 00:36:06,080
a probability number. But, after we do

1120
00:36:04,599 --> 00:36:08,599
all this stuff, the input stays the

1121
00:36:06,079 --> 00:36:09,679
same, the output stays the same, but in

1122
00:36:08,599 --> 00:36:11,480
the middle you just run through a whole

1123
00:36:09,679 --> 00:36:12,639
bunch of these functions, you know,

1124
00:36:11,480 --> 00:36:14,358
these layers, boop boop boop boop, and

1125
00:36:12,639 --> 00:36:15,239
then we get the output.

1126
00:36:14,358 --> 00:36:16,559
Okay?

1127
00:36:15,239 --> 00:36:19,519
That's all we have done.

1128
00:36:16,559 --> 00:36:21,679
And this is a neural network.

1129
00:36:19,519 --> 00:36:25,079
A neural network is nothing more than

1130
00:36:21,679 --> 00:36:27,519
repeatedly transformed inputs which are

1131
00:36:25,079 --> 00:36:30,159
finally fed to a linear or logistic

1132
00:36:27,519 --> 00:36:30,159
regression model.

1133
00:36:35,400 --> 00:36:38,800
Any questions?

1134
00:36:37,559 --> 00:36:41,799
I have two questions. Could you use the

1135
00:36:38,800 --> 00:36:43,320
thing so that everyone can hear? Yeah.

1136
00:36:41,800 --> 00:36:45,240
I have two questions. Firstly, so when

1137
00:36:43,320 --> 00:36:48,080
we say that there isn't chance of

1138
00:36:45,239 --> 00:36:51,559
explainability, is it that we don't know

1139
00:36:48,079 --> 00:36:53,239
which arrow it went through? That's one.

1140
00:36:51,559 --> 00:36:54,960
Second,

1141
00:36:53,239 --> 00:36:57,239
who's controlling the number of

1142
00:36:54,960 --> 00:36:59,639
iterations or the number of functions?

1143
00:36:57,239 --> 00:37:01,239
That's up to us or how does that work?

1144
00:36:59,639 --> 00:37:03,960
Right. So, yeah, so the the first

1145
00:37:01,239 --> 00:37:06,879
question, um explainability, we actually

1146
00:37:03,960 --> 00:37:09,119
know exactly for any given input input

1147
00:37:06,880 --> 00:37:10,760
uh data data point, we know exactly how

1148
00:37:09,119 --> 00:37:12,119
it flows through the network. So, there

1149
00:37:10,760 --> 00:37:15,680
is no problem there.

1150
00:37:12,119 --> 00:37:17,599
The problem is in ascribing, "Okay, this

1151
00:37:15,679 --> 00:37:20,159
we we think this person is going to be

1152
00:37:17,599 --> 00:37:21,880
uh repay the loan because

1153
00:37:20,159 --> 00:37:24,159
of this particular attribute." We don't

1154
00:37:21,880 --> 00:37:25,680
know that because those attributes all

1155
00:37:24,159 --> 00:37:27,358
get enmeshed together and goes through

1156
00:37:25,679 --> 00:37:29,119
this complicated thing. So, we know

1157
00:37:27,358 --> 00:37:31,480
exactly what happens. We just can't give

1158
00:37:29,119 --> 00:37:33,319
credit to anyone thing very easily.

1159
00:37:31,480 --> 00:37:35,480
I'm again, I'm just standing on the

1160
00:37:33,320 --> 00:37:36,280
brink of this vast ocean of something

1161
00:37:35,480 --> 00:37:38,519
called explainability and

1162
00:37:36,280 --> 00:37:39,960
interpretability, uh which I'll get to a

1163
00:37:38,519 --> 00:37:42,280
bit later on in the semester. But,

1164
00:37:39,960 --> 00:37:44,280
that's sort of the quick

1165
00:37:42,280 --> 00:37:46,880
kind of right-ish kind of wrong answer.

1166
00:37:44,280 --> 00:37:47,760
Okay? Number two, um

1167
00:37:46,880 --> 00:37:49,559
uh

1168
00:37:47,760 --> 00:37:51,000
we decide the number of layers. We

1169
00:37:49,559 --> 00:37:52,880
decide a whole bunch of things and as

1170
00:37:51,000 --> 00:37:53,920
we'll see in a few minutes, uh there is

1171
00:37:52,880 --> 00:37:55,640
something that's given to us and

1172
00:37:53,920 --> 00:37:58,840
something we get to design and I'll make

1173
00:37:55,639 --> 00:37:58,839
it very clear which is which.

1174
00:37:59,320 --> 00:38:01,600
Yeah.

1175
00:38:02,000 --> 00:38:06,320
Did I say your name right? Yeah.

1176
00:38:04,039 --> 00:38:08,840
So, which functions have to be linear

1177
00:38:06,320 --> 00:38:11,960
and also like why does it have to be

1178
00:38:08,840 --> 00:38:15,200
linear? Yeah. So, these functions uh the

1179
00:38:11,960 --> 00:38:16,920
f of x here, they have to be non-linear.

1180
00:38:15,199 --> 00:38:19,439
As to why they have to be non-linear,

1181
00:38:16,920 --> 00:38:22,559
we'll get to that in a few minutes.

1182
00:38:19,440 --> 00:38:23,480
Okay. So, these are called neurons.

1183
00:38:22,559 --> 00:38:25,239
Okay?

1184
00:38:23,480 --> 00:38:27,559
These things where you basically there's

1185
00:38:25,239 --> 00:38:29,358
a linear function followed by uh a

1186
00:38:27,559 --> 00:38:31,000
little non-linear function,

1187
00:38:29,358 --> 00:38:32,679
right? This is a Each one of these

1188
00:38:31,000 --> 00:38:34,239
things is called a neuron.

1189
00:38:32,679 --> 00:38:36,960
Um

1190
00:38:34,239 --> 00:38:39,719
By the way, you know, this is loosely

1191
00:38:36,960 --> 00:38:41,679
inspired by the way how, you know, uh

1192
00:38:39,719 --> 00:38:42,919
neurons work in a human in mammalian

1193
00:38:41,679 --> 00:38:45,599
brains.

1194
00:38:42,920 --> 00:38:47,880
But, the connections between

1195
00:38:45,599 --> 00:38:50,679
neuroscience and deep learning

1196
00:38:47,880 --> 00:38:52,599
are very heavily argued.

1197
00:38:50,679 --> 00:38:55,559
So, I'm going to like stay away from it.

1198
00:38:52,599 --> 00:38:57,559
Okay? Uh suffice it to say it's I I just

1199
00:38:55,559 --> 00:38:59,559
think of For for building practical deep

1200
00:38:57,559 --> 00:39:01,880
learning systems in industry, you don't

1201
00:38:59,559 --> 00:39:04,000
you don't worry about this. Okay?

1202
00:39:01,880 --> 00:39:06,880
All right, let's move on.

1203
00:39:04,000 --> 00:39:09,320
Terminology. Uh this vertical stack of

1204
00:39:06,880 --> 00:39:10,760
linear functions or neurons,

1205
00:39:09,320 --> 00:39:12,080
right? This vertical stack is called a

1206
00:39:10,760 --> 00:39:14,080
layer.

1207
00:39:12,079 --> 00:39:15,840
Right? This is a layer, that's a layer.

1208
00:39:14,079 --> 00:39:17,279
Uh and these little non-linear

1209
00:39:15,840 --> 00:39:20,440
functions, which we haven't gotten to

1210
00:39:17,280 --> 00:39:22,280
yet, are called activation functions.

1211
00:39:20,440 --> 00:39:25,240
Uh and we'll get to why they are called

1212
00:39:22,280 --> 00:39:25,240
that in just a second.

1213
00:39:25,320 --> 00:39:29,400
And

1214
00:39:26,920 --> 00:39:31,840
the input

1215
00:39:29,400 --> 00:39:34,079
is called an input layer and I have the

1216
00:39:31,840 --> 00:39:35,640
word layer in double quotes because like

1217
00:39:34,079 --> 00:39:36,759
it's not really doing anything, right?

1218
00:39:35,639 --> 00:39:39,279
It's just the input.

1219
00:39:36,760 --> 00:39:41,480
So, but we call it an input layer.

1220
00:39:39,280 --> 00:39:42,880
And what the very final thing that

1221
00:39:41,480 --> 00:39:45,280
produces outputs is called the output

1222
00:39:42,880 --> 00:39:48,360
layer, right? Obviously. And everything

1223
00:39:45,280 --> 00:39:50,200
in the middle is called a hidden layer.

1224
00:39:48,360 --> 00:39:52,440
Okay?

1225
00:39:50,199 --> 00:39:54,960
So, the final piece of terminology is

1226
00:39:52,440 --> 00:39:56,240
that when you have a layer like this in

1227
00:39:54,960 --> 00:39:58,240
which say three numbers are coming out

1228
00:39:56,239 --> 00:40:00,799
and there's another another layer,

1229
00:39:58,239 --> 00:40:03,319
right? If every neuron in this layer is

1230
00:40:00,800 --> 00:40:05,280
connected to every neuron in this layer,

1231
00:40:03,320 --> 00:40:07,280
it's called a fully connected or dense

1232
00:40:05,280 --> 00:40:08,880
layer. So, for instance, here

1233
00:40:07,280 --> 00:40:10,360
this arrow that's

1234
00:40:08,880 --> 00:40:11,240
whatever the whatever number is coming

1235
00:40:10,360 --> 00:40:12,720
out. Let's say the number three is

1236
00:40:11,239 --> 00:40:15,239
coming out of this thing here. That

1237
00:40:12,719 --> 00:40:17,399
number three goes flows on this arrow to

1238
00:40:15,239 --> 00:40:19,559
this thing, flows on this arrow to this

1239
00:40:17,400 --> 00:40:21,200
neuron, and flows on this third arrow to

1240
00:40:19,559 --> 00:40:23,239
this neuron. That's what I mean. So,

1241
00:40:21,199 --> 00:40:25,159
every neuron, its output is being sent

1242
00:40:23,239 --> 00:40:27,559
to every neuron in the following layer.

1243
00:40:25,159 --> 00:40:29,319
Okay? That's we call it fully connected

1244
00:40:27,559 --> 00:40:30,599
or dense.

1245
00:40:29,320 --> 00:40:32,559
And then

1246
00:40:30,599 --> 00:40:34,480
if you look at logistic regression,

1247
00:40:32,559 --> 00:40:36,320
right? This is logistic regression. You

1248
00:40:34,480 --> 00:40:40,440
can see basically logistic regression is

1249
00:40:36,320 --> 00:40:40,440
a neural network with no hidden layers.

1250
00:40:41,000 --> 00:40:43,639
So, in some sense, logistic regression

1251
00:40:42,159 --> 00:40:45,440
is like almost the simplest possible

1252
00:40:43,639 --> 00:40:48,359
network you can think of.

1253
00:40:45,440 --> 00:40:50,280
Like barely a neural network.

1254
00:40:48,360 --> 00:40:51,079
Right? It's got no no hidden layers.

1255
00:40:50,280 --> 00:40:52,440
That's what makes it logistic

1256
00:40:51,079 --> 00:40:54,239
regression.

1257
00:40:52,440 --> 00:40:56,119
And so, as you might have guessed by

1258
00:40:54,239 --> 00:40:58,879
now, deep learning is just neural

1259
00:40:56,119 --> 00:41:00,119
networks with lots and lots of

1260
00:40:58,880 --> 00:41:02,400
of what?

1261
00:41:00,119 --> 00:41:04,319
Yes, layers.

1262
00:41:02,400 --> 00:41:07,079
So, here are a few.

1263
00:41:04,320 --> 00:41:08,480
Uh and by the way, these are not even

1264
00:41:07,079 --> 00:41:10,039
considered all that, you know,

1265
00:41:08,480 --> 00:41:13,039
impressive these days.

1266
00:41:10,039 --> 00:41:16,039
Okay? Uh but I put them up because this

1267
00:41:13,039 --> 00:41:18,119
this thing here is called ResNet.

1268
00:41:16,039 --> 00:41:20,440
And it's famous because the ResNet

1269
00:41:18,119 --> 00:41:21,559
neural network was I think the first

1270
00:41:20,440 --> 00:41:24,039
network

1271
00:41:21,559 --> 00:41:26,799
to surpass human-level performance in

1272
00:41:24,039 --> 00:41:28,920
image classification.

1273
00:41:26,800 --> 00:41:31,039
Sort of it it's sort of like the Skynet

1274
00:41:28,920 --> 00:41:32,960
of image classification. Okay? It

1275
00:41:31,039 --> 00:41:34,159
surpassed human-level performance. And

1276
00:41:32,960 --> 00:41:36,320
I'm putting it up here because we'll

1277
00:41:34,159 --> 00:41:37,759
actually work with ResNet on next next

1278
00:41:36,320 --> 00:41:39,280
Wednesday. And we'll actually take

1279
00:41:37,760 --> 00:41:41,920
ResNet, we'll fine-tune it, and solve a

1280
00:41:39,280 --> 00:41:43,640
real problem in class.

1281
00:41:41,920 --> 00:41:46,000
All right. So, it's got lots and lots of

1282
00:41:43,639 --> 00:41:47,159
layers. Uh now, let's turn to these

1283
00:41:46,000 --> 00:41:48,800
activation functions. We've been

1284
00:41:47,159 --> 00:41:49,839
ignoring these little guys, right? So

1285
00:41:48,800 --> 00:41:52,800
far.

1286
00:41:49,840 --> 00:41:54,920
So, the activation function at a node is

1287
00:41:52,800 --> 00:41:56,960
a first of all, it's a function that

1288
00:41:54,920 --> 00:41:58,639
receives a single number and outputs a

1289
00:41:56,960 --> 00:42:00,760
single number, right? It's not very

1290
00:41:58,639 --> 00:42:03,000
complicated, right? It receives

1291
00:42:00,760 --> 00:42:04,560
basically this this is a linear function

1292
00:42:03,000 --> 00:42:06,679
which receives all these inputs. It

1293
00:42:04,559 --> 00:42:07,880
could be 10 inputs, 1,000 inputs,

1294
00:42:06,679 --> 00:42:09,559
runs it through a linear function,

1295
00:42:07,880 --> 00:42:12,200
outputs a number, and that single

1296
00:42:09,559 --> 00:42:14,759
number, a scalar, goes in here, and it

1297
00:42:12,199 --> 00:42:16,599
comes out as another single number.

1298
00:42:14,760 --> 00:42:18,000
Just just just remember that.

1299
00:42:16,599 --> 00:42:19,480
And so, these are some of the most

1300
00:42:18,000 --> 00:42:21,519
common activation functions. In fact,

1301
00:42:19,480 --> 00:42:23,400
the sigmoid we saw, which is actually we

1302
00:42:21,519 --> 00:42:25,639
use for the output, is actually a kind

1303
00:42:23,400 --> 00:42:28,119
of activation function where a single

1304
00:42:25,639 --> 00:42:30,000
number comes in and it gets mapped into

1305
00:42:28,119 --> 00:42:31,799
this curve because of this thing. So,

1306
00:42:30,000 --> 00:42:33,920
the single number that comes in is A,

1307
00:42:31,800 --> 00:42:37,160
and it and it gets transformed as 1 / 1

1308
00:42:33,920 --> 00:42:38,880
+ e ^ -A, and you get a shape like this,

1309
00:42:37,159 --> 00:42:40,679
and it's called the sigmoid activation

1310
00:42:38,880 --> 00:42:41,840
function. And And And as you can see

1311
00:42:40,679 --> 00:42:44,319
here,

1312
00:42:41,840 --> 00:42:45,920
for very small values, for very negative

1313
00:42:44,320 --> 00:42:47,840
values,

1314
00:42:45,920 --> 00:42:50,280
it's going to be pretty close to zero,

1315
00:42:47,840 --> 00:42:52,559
meaning it won't get activated.

1316
00:42:50,280 --> 00:42:53,680
And for very very large values, it's

1317
00:42:52,559 --> 00:42:55,360
going to be

1318
00:42:53,679 --> 00:42:57,759
pretty close to one.

1319
00:42:55,360 --> 00:42:59,079
All the action happens in the middle.

1320
00:42:57,760 --> 00:43:00,160
When your When your When your values are

1321
00:42:59,079 --> 00:43:03,119
somewhere in this range, there's a

1322
00:43:00,159 --> 00:43:05,079
dramatic increases in what comes out.

1323
00:43:03,119 --> 00:43:06,440
Okay? So, that little thing in the

1324
00:43:05,079 --> 00:43:07,799
middle is a sweet spot for these

1325
00:43:06,440 --> 00:43:08,639
functions.

1326
00:43:07,800 --> 00:43:10,000
Uh

1327
00:43:08,639 --> 00:43:11,440
and this

1328
00:43:10,000 --> 00:43:12,760
I you know, I'm also almost embarrassed

1329
00:43:11,440 --> 00:43:13,880
to call it an activation function

1330
00:43:12,760 --> 00:43:15,520
because it's literally not doing

1331
00:43:13,880 --> 00:43:16,880
anything. It's sort of getting a nice

1332
00:43:15,519 --> 00:43:18,639
label for free.

1333
00:43:16,880 --> 00:43:19,720
Um right? You basically it says you just

1334
00:43:18,639 --> 00:43:20,839
get a number, just pass it straight

1335
00:43:19,719 --> 00:43:22,359
along.

1336
00:43:20,840 --> 00:43:23,720
It's a linear activation function, but

1337
00:43:22,360 --> 00:43:25,599
just for completeness, I want to put it

1338
00:43:23,719 --> 00:43:28,319
here.

1339
00:43:25,599 --> 00:43:30,920
And then we come to the hero of deep

1340
00:43:28,320 --> 00:43:32,000
learning, which is the rectified linear

1341
00:43:30,920 --> 00:43:34,519
unit,

1342
00:43:32,000 --> 00:43:37,079
right? Rectified linear unit. It's

1343
00:43:34,519 --> 00:43:38,519
called ReLU. Uh and ReLU is going to

1344
00:43:37,079 --> 00:43:41,039
become part of your vocabulary very very

1345
00:43:38,519 --> 00:43:43,000
quickly. Uh and so, ReLU is actually a

1346
00:43:41,039 --> 00:43:44,920
very interesting function. So, you write

1347
00:43:43,000 --> 00:43:46,320
it as maximum of whatever number and

1348
00:43:44,920 --> 00:43:48,360
zero,

1349
00:43:46,320 --> 00:43:50,600
which is another way of saying if the

1350
00:43:48,360 --> 00:43:53,480
number is positive, just send it along

1351
00:43:50,599 --> 00:43:56,639
unchanged. If the number is negative,

1352
00:43:53,480 --> 00:43:57,639
send a zero instead. Squish it to zero.

1353
00:43:56,639 --> 00:43:59,799
So, which means if the number is

1354
00:43:57,639 --> 00:44:03,039
negative, nothing happens. If the number

1355
00:43:59,800 --> 00:44:03,039
is positive, it wakes up.

1356
00:44:03,239 --> 00:44:07,159
So, what happens is that you could have

1357
00:44:04,920 --> 00:44:09,320
a very complicated linear function with

1358
00:44:07,159 --> 00:44:10,519
millions of variables, and then it puts

1359
00:44:09,320 --> 00:44:12,000
a single number, and that number

1360
00:44:10,519 --> 00:44:13,239
unfortunately happens to be negative.

1361
00:44:12,000 --> 00:44:15,199
The ReLU is not impressed. It's going to

1362
00:44:13,239 --> 00:44:17,519
send a zero out.

1363
00:44:15,199 --> 00:44:20,279
Okay? It's a very simple function.

1364
00:44:17,519 --> 00:44:22,559
And many many folks who've been in deep

1365
00:44:20,280 --> 00:44:23,480
learning for a long long time believe

1366
00:44:22,559 --> 00:44:25,519
that

1367
00:44:23,480 --> 00:44:26,760
the use of the ReLUs is one of the key

1368
00:44:25,519 --> 00:44:28,840
factors

1369
00:44:26,760 --> 00:44:30,440
that led to the amazing success of deep

1370
00:44:28,840 --> 00:44:32,160
learning because it's got some very

1371
00:44:30,440 --> 00:44:33,880
interesting properties,

1372
00:44:32,159 --> 00:44:35,759
uh which we'll get to hopefully on

1373
00:44:33,880 --> 00:44:40,039
Wednesday.

1374
00:44:35,760 --> 00:44:42,000
Okay. So, the shorthand here is that um

1375
00:44:40,039 --> 00:44:43,639
whenever you see this thing, it's just a

1376
00:44:42,000 --> 00:44:44,679
linear activation, linear function

1377
00:44:43,639 --> 00:44:47,319
followed by just sending it straight

1378
00:44:44,679 --> 00:44:49,119
out. If I If you do this this If I put a

1379
00:44:47,320 --> 00:44:51,519
ReLU in here, I'm going to denote it

1380
00:44:49,119 --> 00:44:53,239
like that, which mimics the graph

1381
00:44:51,519 --> 00:44:54,719
uh how it looks. And if I'm going If I

1382
00:44:53,239 --> 00:44:55,839
put a sigmoid, I'm just going to use

1383
00:44:54,719 --> 00:44:56,839
this thing here.

1384
00:44:55,840 --> 00:44:59,941
Okay?

1385
00:44:56,840 --> 00:45:00,240
Just a visual shorthand.

1386
00:44:59,940 --> 00:45:02,358
>> [clears throat]

1387
00:45:00,239 --> 00:45:03,839
>> There are many other functions

1388
00:45:02,358 --> 00:45:05,079
activation functions, by the way.

1389
00:45:03,840 --> 00:45:07,840
There's something called the tan h

1390
00:45:05,079 --> 00:45:10,960
function, the leaky ReLU, the GELU, the

1391
00:45:07,840 --> 00:45:12,640
Swish. I mean, it's like a menagerie of

1392
00:45:10,960 --> 00:45:14,280
activation functions because very often

1393
00:45:12,639 --> 00:45:15,799
researchers will be like, "Well, I don't

1394
00:45:14,280 --> 00:45:17,040
like this activation function. Here's a

1395
00:45:15,800 --> 00:45:18,080
little modified version of the function

1396
00:45:17,039 --> 00:45:20,400
which is going to be better for certain

1397
00:45:18,079 --> 00:45:22,480
things." So, you know, people's research

1398
00:45:20,400 --> 00:45:24,400
creativity is sort of on this point has

1399
00:45:22,480 --> 00:45:26,519
gone unhinged. Um so, there's lots of

1400
00:45:24,400 --> 00:45:27,760
options. But if you just stick to the

1401
00:45:26,519 --> 00:45:29,519
ReLU

1402
00:45:27,760 --> 00:45:31,720
for your hidden layers, you can

1403
00:45:29,519 --> 00:45:32,519
basically get anything done practically,

1404
00:45:31,719 --> 00:45:34,039
right? You don't have to worry about

1405
00:45:32,519 --> 00:45:37,280
anything else. So, we'll only focus on

1406
00:45:34,039 --> 00:45:38,559
ReLUs for all the intermediate stuff. Uh

1407
00:45:37,280 --> 00:45:40,400
yeah.

1408
00:45:38,559 --> 00:45:41,840
Yeah, how do you gauge which activation

1409
00:45:40,400 --> 00:45:42,720
function is more suited for your use

1410
00:45:41,840 --> 00:45:45,280
case?

1411
00:45:42,719 --> 00:45:48,000
Yeah. So, the rule of thumb here is that

1412
00:45:45,280 --> 00:45:49,680
for your hidden layers, use ReLUs,

1413
00:45:48,000 --> 00:45:51,880
right? Because empirically we have seen

1414
00:45:49,679 --> 00:45:54,199
that they they do an amazing job.

1415
00:45:51,880 --> 00:45:56,320
For your output layer, your very final

1416
00:45:54,199 --> 00:45:57,960
thing, you actually don't have a choice

1417
00:45:56,320 --> 00:45:59,640
because what you have to use depends on

1418
00:45:57,960 --> 00:46:01,199
what kind of output you have to work

1419
00:45:59,639 --> 00:46:02,679
with. If it's an output which is a

1420
00:46:01,199 --> 00:46:04,480
probability number between zero and one,

1421
00:46:02,679 --> 00:46:05,839
you have to use a sigmoid.

1422
00:46:04,480 --> 00:46:07,559
Um if it is

1423
00:46:05,840 --> 00:46:08,960
say 10 numbers, all of which have to be

1424
00:46:07,559 --> 00:46:10,119
probabilities, and they have to add up

1425
00:46:08,960 --> 00:46:10,880
to one,

1426
00:46:10,119 --> 00:46:12,199
you got to use something called the

1427
00:46:10,880 --> 00:46:13,960
softmax, which we'll get to on

1428
00:46:12,199 --> 00:46:15,679
Wednesday. So, it really depends on the

1429
00:46:13,960 --> 00:46:16,760
output, and the nature of the output

1430
00:46:15,679 --> 00:46:18,599
dictates what you use in the output

1431
00:46:16,760 --> 00:46:19,920
layer.

1432
00:46:18,599 --> 00:46:22,000
Okay.

1433
00:46:19,920 --> 00:46:24,880
So, coming back to this. So, if you want

1434
00:46:22,000 --> 00:46:27,280
to design a deep neural network,

1435
00:46:24,880 --> 00:46:29,599
uh the input is the input.

1436
00:46:27,280 --> 00:46:30,960
The output is the output. And so, you

1437
00:46:29,599 --> 00:46:32,880
get to choose everything else. You get

1438
00:46:30,960 --> 00:46:35,320
to choose the number of hidden layers,

1439
00:46:32,880 --> 00:46:37,559
the number of neurons in each layer, the

1440
00:46:35,320 --> 00:46:39,600
activation functions you're going to use

1441
00:46:37,559 --> 00:46:41,119
and uh for the hidden layers, and then

1442
00:46:39,599 --> 00:46:42,759
you have to make sure that the what you

1443
00:46:41,119 --> 00:46:44,279
choose for the output layer matches the

1444
00:46:42,760 --> 00:46:46,840
kind of output you want to generate.

1445
00:46:44,280 --> 00:46:48,680
Okay? So, this is this sort of This is

1446
00:46:46,840 --> 00:46:51,120
all in your hands. You decide what

1447
00:46:48,679 --> 00:46:52,799
happens. But

1448
00:46:51,119 --> 00:46:53,719
you will there there's a lot of guidance

1449
00:46:52,800 --> 00:46:56,080
for how to do these things, which we'll

1450
00:46:53,719 --> 00:46:57,679
which we'll cover as we go along.

1451
00:46:56,079 --> 00:47:00,519
Did you have a question?

1452
00:46:57,679 --> 00:47:03,279
Kind of, but I guess I'll do it.

1453
00:47:00,519 --> 00:47:05,400
Is Is there also exploration in kind of

1454
00:47:03,280 --> 00:47:07,920
dynamic uh

1455
00:47:05,400 --> 00:47:11,400
setting up layers so that your users

1456
00:47:07,920 --> 00:47:11,400
determine the number of layers

1457
00:47:12,599 --> 00:47:16,719
Yeah. So, there's a whole field called

1458
00:47:14,320 --> 00:47:18,680
neural architecture search, NAS,

1459
00:47:16,719 --> 00:47:20,480
where we can actually try a whole bunch

1460
00:47:18,679 --> 00:47:22,319
of different architectures,

1461
00:47:20,480 --> 00:47:23,800
uh and then use some optimization and in

1462
00:47:22,320 --> 00:47:25,640
fact reinforcement learning, which we

1463
00:47:23,800 --> 00:47:27,160
won't get to in this class,

1464
00:47:25,639 --> 00:47:28,440
as a way to figure out really good

1465
00:47:27,159 --> 00:47:32,199
architectures for any particular

1466
00:47:28,440 --> 00:47:33,760
problem. Uh but the

1467
00:47:32,199 --> 00:47:34,799
the question of okay,

1468
00:47:33,760 --> 00:47:36,480
when I'm training a model with a

1469
00:47:34,800 --> 00:47:37,840
particular kind of data,

1470
00:47:36,480 --> 00:47:39,039
the first pass through the training

1471
00:47:37,840 --> 00:47:40,240
data, I'm going to use two layers. The

1472
00:47:39,039 --> 00:47:42,440
second pass, I'm going to do seven

1473
00:47:40,239 --> 00:47:44,039
layers. That is not done.

1474
00:47:42,440 --> 00:47:45,840
Uh and the reason it's not done is

1475
00:47:44,039 --> 00:47:47,279
because of certain other constraints we

1476
00:47:45,840 --> 00:47:48,840
have in how we can do the the

1477
00:47:47,280 --> 00:47:50,720
optimization and the gradient descent

1478
00:47:48,840 --> 00:47:52,680
and stuff like that. But what you can

1479
00:47:50,719 --> 00:47:54,319
do, and we will we'll look at this thing

1480
00:47:52,679 --> 00:47:56,399
called dropout,

1481
00:47:54,320 --> 00:47:58,200
for certain layers, you can actually for

1482
00:47:56,400 --> 00:48:00,440
each time you run it through the

1483
00:47:58,199 --> 00:48:02,199
network, you can decide in this layer

1484
00:48:00,440 --> 00:48:03,599
I'm not going to use all the nodes. I'm

1485
00:48:02,199 --> 00:48:05,879
going to drop out a few of the nodes

1486
00:48:03,599 --> 00:48:07,279
randomly. And it's a very effective

1487
00:48:05,880 --> 00:48:09,599
technique to prevent overfitting, and

1488
00:48:07,280 --> 00:48:11,240
we'll come to that a little later on.

1489
00:48:09,599 --> 00:48:13,639
Uh yeah.

1490
00:48:11,239 --> 00:48:15,439
So, one question regarding like

1491
00:48:13,639 --> 00:48:16,960
neural networks is about the

1492
00:48:15,440 --> 00:48:17,920
coefficients. Is this something we

1493
00:48:16,960 --> 00:48:19,358
decide

1494
00:48:17,920 --> 00:48:21,159
or we

1495
00:48:19,358 --> 00:48:23,840
have to use as a defined coefficient for

1496
00:48:21,159 --> 00:48:25,920
the weights? No, the whole trick here

1497
00:48:23,840 --> 00:48:29,079
the whole name of the game is we use the

1498
00:48:25,920 --> 00:48:30,440
data, the training data, and something

1499
00:48:29,079 --> 00:48:31,719
called a loss function, which I'll get

1500
00:48:30,440 --> 00:48:33,760
to on Wednesday,

1501
00:48:31,719 --> 00:48:36,639
along with an optimization algorithm, so

1502
00:48:33,760 --> 00:48:37,880
that the network figures out by itself

1503
00:48:36,639 --> 00:48:39,599
what the weights need to be, what the

1504
00:48:37,880 --> 00:48:42,039
coefficients need to be, so as to

1505
00:48:39,599 --> 00:48:43,920
minimize prediction error.

1506
00:48:42,039 --> 00:48:45,358
And that's the whole thing. The magic

1507
00:48:43,920 --> 00:48:47,559
here is that we don't have to do

1508
00:48:45,358 --> 00:48:49,880
anything. We only have to set it up, sit

1509
00:48:47,559 --> 00:48:51,679
back, often for many hours, and watch it

1510
00:48:49,880 --> 00:48:52,800
do its thing.

1511
00:48:51,679 --> 00:48:54,279
Yeah.

1512
00:48:52,800 --> 00:48:56,320
Just one quick question. Um you

1513
00:48:54,280 --> 00:48:58,000
mentioned nodes just now when you were

1514
00:48:56,320 --> 00:49:00,920
answering Roland's question. Can you

1515
00:48:58,000 --> 00:49:02,559
just confirm exactly what a node is? I

1516
00:49:00,920 --> 00:49:03,519
have an idea that it's basically any

1517
00:49:02,559 --> 00:49:04,799
circle, but

1518
00:49:03,519 --> 00:49:06,320
>> Yeah, yeah. you just added a lot more

1519
00:49:04,800 --> 00:49:07,560
detail. Sure. No, when when I'm

1520
00:49:06,320 --> 00:49:09,760
referring to a node, I'm literally

1521
00:49:07,559 --> 00:49:12,000
referring to something like this, which

1522
00:49:09,760 --> 00:49:14,640
think of it as a linear function

1523
00:49:12,000 --> 00:49:16,480
followed by a non-linear activation.

1524
00:49:14,639 --> 00:49:18,239
So, it it reads a bunch of inputs, runs

1525
00:49:16,480 --> 00:49:19,920
it through a linear function, and pass

1526
00:49:18,239 --> 00:49:22,119
it through like a ReLU or a sigmoid or

1527
00:49:19,920 --> 00:49:24,119
something, and out pops a number.

1528
00:49:22,119 --> 00:49:26,000
So, in general, a node will have

1529
00:49:24,119 --> 00:49:28,239
many numbers potentially coming in, but

1530
00:49:26,000 --> 00:49:30,000
only one number going out.

1531
00:49:28,239 --> 00:49:32,719
Uh now, that one number may get copied

1532
00:49:30,000 --> 00:49:33,960
to every node in the next layer,

1533
00:49:32,719 --> 00:49:36,639
but what comes out of that particular

1534
00:49:33,960 --> 00:49:38,240
node is just a single number.

1535
00:49:36,639 --> 00:49:38,839
All right. So,

1536
00:49:38,239 --> 00:49:41,799
uh

1537
00:49:38,840 --> 00:49:44,320
So, let's use a DNN for our interview

1538
00:49:41,800 --> 00:49:46,360
example. So, in this problem we had two

1539
00:49:44,320 --> 00:49:48,000
inputs, right? GPA and experience. The

1540
00:49:46,360 --> 00:49:48,880
output variable has to be between zero

1541
00:49:48,000 --> 00:49:50,039
and one because you're trying to predict

1542
00:49:48,880 --> 00:49:52,720
the probability that someone will get

1543
00:49:50,039 --> 00:49:54,079
called for an interview. So, the output

1544
00:49:52,719 --> 00:49:55,319
size is fixed the

1545
00:49:54,079 --> 00:49:57,039
sorry, the input size is fixed the

1546
00:49:55,320 --> 00:49:59,440
output is fixed. Uh

1547
00:49:57,039 --> 00:50:00,800
and we so, since it's really the only

1548
00:49:59,440 --> 00:50:02,800
the very first network we're actually

1549
00:50:00,800 --> 00:50:04,360
playing with uh

1550
00:50:02,800 --> 00:50:06,640
let's just start simple, right? We'll

1551
00:50:04,360 --> 00:50:09,480
just have one hidden layer and we'll

1552
00:50:06,639 --> 00:50:11,199
have three neurons, right? And and as I

1553
00:50:09,480 --> 00:50:13,719
mentioned to Tommaso's question from

1554
00:50:11,199 --> 00:50:15,839
before if you are choosing activation

1555
00:50:13,719 --> 00:50:17,919
functions in the hidden layers, just go

1556
00:50:15,840 --> 00:50:19,760
with the ReLU as a default. It usually

1557
00:50:17,920 --> 00:50:21,360
works really well out of the box. So,

1558
00:50:19,760 --> 00:50:23,280
we'll just use a ReLU and since the

1559
00:50:21,360 --> 00:50:25,240
output has to be between zero and one,

1560
00:50:23,280 --> 00:50:27,000
we don't have a choice. We have to use a

1561
00:50:25,239 --> 00:50:29,199
sigmoid for the output layer.

1562
00:50:27,000 --> 00:50:31,119
Okay? That's it. So, we have the those

1563
00:50:29,199 --> 00:50:32,919
are the design choices and when we do

1564
00:50:31,119 --> 00:50:34,960
that, this is how it's looked like,

1565
00:50:32,920 --> 00:50:36,760
right? We have two inputs X1 and X2, GPA

1566
00:50:34,960 --> 00:50:38,199
and experience and then it goes through

1567
00:50:36,760 --> 00:50:40,440
these three

1568
00:50:38,199 --> 00:50:42,759
ReLUs and then out comes these three

1569
00:50:40,440 --> 00:50:44,960
numbers and they pass through a sigmoid

1570
00:50:42,760 --> 00:50:46,560
and we get a probability Y at the end.

1571
00:50:44,960 --> 00:50:47,440
All right, quick question. Concept

1572
00:50:46,559 --> 00:50:49,320
check.

1573
00:50:47,440 --> 00:50:51,039
How many weights

1574
00:50:49,320 --> 00:50:53,000
how many parameters, both weights and

1575
00:50:51,039 --> 00:50:56,079
biases does this network have?

1576
00:50:53,000 --> 00:50:56,079
Let's take a moment to count.

1577
00:51:11,199 --> 00:51:14,439
All right, any guesses?

1578
00:51:15,559 --> 00:51:18,440
Yeah.

1579
00:51:16,840 --> 00:51:21,840
12.

1580
00:51:18,440 --> 00:51:21,840
I think you're almost there.

1581
00:51:22,039 --> 00:51:25,400
Um

1582
00:51:23,960 --> 00:51:28,320
our folks going to be doing a binary

1583
00:51:25,400 --> 00:51:28,320
search on this now? Okay.

1584
00:51:29,320 --> 00:51:34,039
Uh no.

1585
00:51:31,119 --> 00:51:35,679
Yes? 30. Yes, very good.

1586
00:51:34,039 --> 00:51:37,360
So, that's 30

1587
00:51:35,679 --> 00:51:39,000
and my guess is that the reason you came

1588
00:51:37,360 --> 00:51:41,000
up with 12 and I made the same mistake,

1589
00:51:39,000 --> 00:51:44,400
that's why I know it is you probably

1590
00:51:41,000 --> 00:51:44,400
forgot this green thing here.

1591
00:51:45,239 --> 00:51:49,319
Um so, so the what folks often forget is

1592
00:51:48,000 --> 00:51:50,679
the bias.

1593
00:51:49,320 --> 00:51:52,600
Right? We all count the things, right?

1594
00:51:50,679 --> 00:51:54,239
Okay. And the easy way to do it is okay,

1595
00:51:52,599 --> 00:51:56,119
two things here,

1596
00:51:54,239 --> 00:51:57,279
three things here, two times six three

1597
00:51:56,119 --> 00:51:59,400
is six,

1598
00:51:57,280 --> 00:52:00,760
three times one is three another nine

1599
00:51:59,400 --> 00:52:02,480
and then you have to add up all the

1600
00:52:00,760 --> 00:52:04,080
intercepts.

1601
00:52:02,480 --> 00:52:05,840
Right? So, you get 30.

1602
00:52:04,079 --> 00:52:08,079
And so, when we get to very complicated

1603
00:52:05,840 --> 00:52:09,480
networks the the first two or three

1604
00:52:08,079 --> 00:52:10,719
times you work with very complex

1605
00:52:09,480 --> 00:52:11,960
networks

1606
00:52:10,719 --> 00:52:14,359
and we'll do it, you know, starting very

1607
00:52:11,960 --> 00:52:16,119
soon, just get into the habit of hand

1608
00:52:14,360 --> 00:52:17,079
calculating the number of parameters

1609
00:52:16,119 --> 00:52:18,880
just to make sure you understand what's

1610
00:52:17,079 --> 00:52:20,039
going on. Once you get it right a couple

1611
00:52:18,880 --> 00:52:21,599
of times, you can you don't have to do

1612
00:52:20,039 --> 00:52:23,000
it anymore. Okay? The first couple of

1613
00:52:21,599 --> 00:52:23,920
times hand calculate to make sure you

1614
00:52:23,000 --> 00:52:26,239
get it.

1615
00:52:23,920 --> 00:52:28,840
Okay. So, yeah. So, let's say that we

1616
00:52:26,239 --> 00:52:30,559
have trained this network using, you

1617
00:52:28,840 --> 00:52:32,800
know, using techniques which we'll cover

1618
00:52:30,559 --> 00:52:34,119
on Wednesday and it is it comes back to

1619
00:52:32,800 --> 00:52:36,360
you after training and says, "Okay,

1620
00:52:34,119 --> 00:52:38,679
these are the optimal the best values

1621
00:52:36,360 --> 00:52:40,559
for the weights and the biases that I

1622
00:52:38,679 --> 00:52:42,319
have found." So, now your network is

1623
00:52:40,559 --> 00:52:43,840
ready for action.

1624
00:52:42,320 --> 00:52:45,880
It's ready to be used

1625
00:52:43,840 --> 00:52:47,079
and so, so what you can do is let's say

1626
00:52:45,880 --> 00:52:48,640
that you want to predict with this

1627
00:52:47,079 --> 00:52:49,880
network,

1628
00:52:48,639 --> 00:52:52,679
you know,

1629
00:52:49,880 --> 00:52:54,119
if you have X1 and X2, what comes out of

1630
00:52:52,679 --> 00:52:56,480
what So, what comes out of this top

1631
00:52:54,119 --> 00:52:58,719
neuron, right? Let's call it A1. It's

1632
00:52:56,480 --> 00:53:00,199
basically this.

1633
00:52:58,719 --> 00:53:02,159
Okay? That's what's coming out of this

1634
00:53:00,199 --> 00:53:05,639
thing. For any X1 and X2, this is what's

1635
00:53:02,159 --> 00:53:06,519
coming out. Similarly for A2 and A3

1636
00:53:05,639 --> 00:53:08,519
Okay?

1637
00:53:06,519 --> 00:53:09,559
And then what comes out at the very end

1638
00:53:08,519 --> 00:53:11,840
is

1639
00:53:09,559 --> 00:53:14,880
basically A1 times that plus A2 times

1640
00:53:11,840 --> 00:53:15,880
that plus A3 times that plus 0.05 and

1641
00:53:14,880 --> 00:53:18,240
the whole thing gets run through the

1642
00:53:15,880 --> 00:53:20,920
sigmoid and this is what you get.

1643
00:53:18,239 --> 00:53:22,159
Okay? So, this slide and the one before,

1644
00:53:20,920 --> 00:53:23,840
just make sure you look at it afterwards

1645
00:53:22,159 --> 00:53:26,399
and to make sure you totally understand

1646
00:53:23,840 --> 00:53:27,800
the mechanics of it because

1647
00:53:26,400 --> 00:53:28,960
this is really important. If you don't

1648
00:53:27,800 --> 00:53:30,720
If you don't fully understand like

1649
00:53:28,960 --> 00:53:31,880
internalize the mechanics, when we get

1650
00:53:30,719 --> 00:53:33,799
to things like transformers, it's going

1651
00:53:31,880 --> 00:53:35,280
to get hard. Okay? So, just make sure

1652
00:53:33,800 --> 00:53:37,080
it's like automatic at this point. It

1653
00:53:35,280 --> 00:53:38,280
should be reflexive.

1654
00:53:37,079 --> 00:53:40,840
Um

1655
00:53:38,280 --> 00:53:41,840
Okay. So, yeah. And so, when we when you

1656
00:53:40,840 --> 00:53:42,760
want to predict anything, you just run

1657
00:53:41,840 --> 00:53:44,120
some numbers through it, you get all

1658
00:53:42,760 --> 00:53:45,480
these things

1659
00:53:44,119 --> 00:53:48,519
and boom, you calculate it. It turns out

1660
00:53:45,480 --> 00:53:50,000
to be 22.6. That's the answer.

1661
00:53:48,519 --> 00:53:51,800
All right. So,

1662
00:53:50,000 --> 00:53:53,519
I just want to say that let's say that

1663
00:53:51,800 --> 00:53:55,359
you built this network

1664
00:53:53,519 --> 00:53:57,079
and now we are like, "Hey,

1665
00:53:55,358 --> 00:53:58,440
given any X1 and X2, I can come up with

1666
00:53:57,079 --> 00:54:00,239
a Y."

1667
00:53:58,440 --> 00:54:02,159
But I'm feeling a little mathy. Can we

1668
00:54:00,239 --> 00:54:03,358
actually write down the function? Yeah,

1669
00:54:02,159 --> 00:54:06,000
you can write down the function. This is

1670
00:54:03,358 --> 00:54:06,000
what it looks like.

1671
00:54:07,358 --> 00:54:10,358
Super interpretable, right?

1672
00:54:10,480 --> 00:54:16,159
So, this goes to the comment that Itai

1673
00:54:12,480 --> 00:54:18,280
you made earlier on where the act of

1674
00:54:16,159 --> 00:54:21,119
depicting something using this sort of

1675
00:54:18,280 --> 00:54:22,400
graphical layout makes it so much easier

1676
00:54:21,119 --> 00:54:24,440
to reason with

1677
00:54:22,400 --> 00:54:26,559
and to think about compared to trying to

1678
00:54:24,440 --> 00:54:28,519
figure out what this function is doing.

1679
00:54:26,559 --> 00:54:30,559
Right? The other point I want to make is

1680
00:54:28,519 --> 00:54:32,239
that um

1681
00:54:30,559 --> 00:54:33,400
just contrast what we just saw with the

1682
00:54:32,239 --> 00:54:35,599
logistic regression thing we saw

1683
00:54:33,400 --> 00:54:38,200
earlier, which was this little function

1684
00:54:35,599 --> 00:54:40,759
and so, here

1685
00:54:38,199 --> 00:54:42,559
even this simple network with just three

1686
00:54:40,760 --> 00:54:44,200
hidden layers the sorry, three nodes in

1687
00:54:42,559 --> 00:54:46,519
that single hidden layer

1688
00:54:44,199 --> 00:54:48,480
right? It's so much more complicated

1689
00:54:46,519 --> 00:54:50,280
than the logistic regression model. So

1690
00:54:48,480 --> 00:54:52,760
much more complicated, right?

1691
00:54:50,280 --> 00:54:55,000
And it is from this complexity

1692
00:54:52,760 --> 00:54:56,800
springs the ability of these networks to

1693
00:54:55,000 --> 00:54:58,159
do basically magical things.

1694
00:54:56,800 --> 00:55:00,000
Right? That's where the complexity comes

1695
00:54:58,159 --> 00:55:02,519
from. That's where the magic comes from.

1696
00:55:00,000 --> 00:55:03,559
So, and here in this case, the number of

1697
00:55:02,519 --> 00:55:05,960
variables hasn't even changed. It's

1698
00:55:03,559 --> 00:55:07,759
still only two.

1699
00:55:05,960 --> 00:55:10,199
But we can go from the two inputs to the

1700
00:55:07,760 --> 00:55:11,800
one output in very complicated ways as

1701
00:55:10,199 --> 00:55:13,159
long as we know how to train these

1702
00:55:11,800 --> 00:55:13,960
networks the right way. That's sort of

1703
00:55:13,159 --> 00:55:15,799
the

1704
00:55:13,960 --> 00:55:16,920
the secret sauce which we'll spend a lot

1705
00:55:15,800 --> 00:55:19,039
of time on.

1706
00:55:16,920 --> 00:55:20,920
So, yeah. To summarize, this is what we

1707
00:55:19,039 --> 00:55:22,239
have. It's a deep neural network.

1708
00:55:20,920 --> 00:55:23,639
By the way, this kind of network where

1709
00:55:22,239 --> 00:55:25,599
things just flow from left to right is

1710
00:55:23,639 --> 00:55:27,239
called a feedforward

1711
00:55:25,599 --> 00:55:28,679
neural network

1712
00:55:27,239 --> 00:55:30,599
in contrast to some other kinds of

1713
00:55:28,679 --> 00:55:31,919
networks called recurrent networks which

1714
00:55:30,599 --> 00:55:34,639
you won't get to

1715
00:55:31,920 --> 00:55:36,880
in this class because

1716
00:55:34,639 --> 00:55:38,799
transformers have actually proven to be

1717
00:55:36,880 --> 00:55:40,680
much more capable than recurrent

1718
00:55:38,800 --> 00:55:42,920
networks and those have become the norm,

1719
00:55:40,679 --> 00:55:44,799
so we'll just focus on those instead. Um

1720
00:55:42,920 --> 00:55:46,519
and so, this arrangement of neurons into

1721
00:55:44,800 --> 00:55:48,240
layers and activation functions and all

1722
00:55:46,519 --> 00:55:50,039
that stuff, this called the architecture

1723
00:55:48,239 --> 00:55:51,639
of the neural network. And as you will

1724
00:55:50,039 --> 00:55:53,637
see later on, the transformer, the

1725
00:55:51,639 --> 00:55:54,920
famous transformer network

1726
00:55:53,637 --> 00:55:57,239
[clears throat] is just an example of a

1727
00:55:54,920 --> 00:55:59,280
particular neural network architecture

1728
00:55:57,239 --> 00:56:01,479
much like convolutional neural networks

1729
00:55:59,280 --> 00:56:03,280
which will get to next week for computer

1730
00:56:01,480 --> 00:56:05,719
vision or another example of a

1731
00:56:03,280 --> 00:56:07,519
particular network of of architecture.

1732
00:56:05,719 --> 00:56:08,959
So, we will focus on transformers. They

1733
00:56:07,519 --> 00:56:10,559
are a particular kind of architecture.

1734
00:56:08,960 --> 00:56:11,760
All right. So, in summary, this is what

1735
00:56:10,559 --> 00:56:13,239
we have.

1736
00:56:11,760 --> 00:56:14,400
You know, you get to choose the hidden

1737
00:56:13,239 --> 00:56:15,839
layers, the neurons, activation

1738
00:56:14,400 --> 00:56:17,280
functions, stuff like that.

1739
00:56:15,840 --> 00:56:19,200
The inputs and outputs are what you have

1740
00:56:17,280 --> 00:56:22,160
to work with and so, we will actually

1741
00:56:19,199 --> 00:56:23,119
take this idea and then use it

1742
00:56:22,159 --> 00:56:25,920
to

1743
00:56:23,119 --> 00:56:28,319
to actually solve a problem from start

1744
00:56:25,920 --> 00:56:29,559
to finish on Wednesday. So, I think I'm

1745
00:56:28,320 --> 00:56:32,284
done. I give you three minutes back of

1746
00:56:29,559 --> 00:56:34,304
your day. Thank you.

1747
00:56:32,284 --> 00:56:34,304
>> [applause]