1
00:00:16,800 --> 00:00:23,039
Okay. So today we start the the natural

2
00:00:20,399 --> 00:00:24,799
language processing sequence and so just

3
00:00:23,039 --> 00:00:26,400
to give you a quick idea we're going to

4
00:00:24,800 --> 00:00:27,920
start with uh what's called

5
00:00:26,399 --> 00:00:29,759
vectorization

6
00:00:27,920 --> 00:00:30,960
uh and then the bag of words model and

7
00:00:29,760 --> 00:00:33,920
then we'll spend a fair amount of time

8
00:00:30,960 --> 00:00:34,799
on a collab uh and then on Wednesday we

9
00:00:33,920 --> 00:00:36,480
talk about these things called

10
00:00:34,799 --> 00:00:38,000
embeddings which you'll come to

11
00:00:36,479 --> 00:00:40,238
appreciate over the the next couple of

12
00:00:38,000 --> 00:00:42,640
weeks form like the sort of the core

13
00:00:40,238 --> 00:00:45,280
atomic unit of all modern natural

14
00:00:42,640 --> 00:00:47,439
language processing and for that matter

15
00:00:45,280 --> 00:00:49,280
vision processing as well. uh and then

16
00:00:47,439 --> 00:00:50,640
we will uh following week we'll do

17
00:00:49,280 --> 00:00:52,399
transformers two lectures on

18
00:00:50,640 --> 00:00:53,520
transformers we'll get into the theory

19
00:00:52,399 --> 00:00:55,759
and then we'll get into a bunch of

20
00:00:53,520 --> 00:00:59,280
applications and then lectures nine and

21
00:00:55,759 --> 00:01:01,358
10 will be all LLMs all about LLMs so

22
00:00:59,280 --> 00:01:04,320
it's going to be a lot of fun u this is

23
00:01:01,359 --> 00:01:05,920
one of my favorite segments of the class

24
00:01:04,319 --> 00:01:08,000
of course truth be told every segment of

25
00:01:05,920 --> 00:01:10,879
the class is my favorite so don't judge

26
00:01:08,000 --> 00:01:13,599
me all right so let's get going uh so

27
00:01:10,879 --> 00:01:16,079
why why natural language processing u

28
00:01:13,599 --> 00:01:17,599
you know these are in some sense the the

29
00:01:16,079 --> 00:01:18,879
things I have on the slide here are sort

30
00:01:17,599 --> 00:01:21,679
of obvious but I think it's actually

31
00:01:18,879 --> 00:01:24,239
worth reme reminding ourselves of how

32
00:01:21,680 --> 00:01:26,320
important text is for everything we do.

33
00:01:24,239 --> 00:01:29,280
Uh obviously human knowledge is mostly

34
00:01:26,319 --> 00:01:30,959
encoded as text. The internet is mostly

35
00:01:29,280 --> 00:01:33,920
text. At least this was true till the

36
00:01:30,959 --> 00:01:35,759
advent of Tik Tok and YouTube. Uh and uh

37
00:01:33,920 --> 00:01:37,840
human communication is mostly text and

38
00:01:35,759 --> 00:01:40,640
cultural production you know movies,

39
00:01:37,840 --> 00:01:43,680
books, uh arts and so on. So much of it

40
00:01:40,640 --> 00:01:47,040
is so textheavy and so in some sense uh

41
00:01:43,680 --> 00:01:49,200
text forms not just a big chunk of all

42
00:01:47,040 --> 00:01:50,880
the media that's out there but it also

43
00:01:49,200 --> 00:01:52,560
happens to be the way in which we think

44
00:01:50,879 --> 00:01:55,438
and communicate and so on and so forth.

45
00:01:52,560 --> 00:01:57,920
So it's sort of uh primacy is in my

46
00:01:55,438 --> 00:01:59,919
opinion sort of unparalleled uh in how

47
00:01:57,920 --> 00:02:02,399
we think about the world. And so the the

48
00:01:59,920 --> 00:02:04,719
tantalizing possibility is that imagine

49
00:02:02,399 --> 00:02:06,560
if we had an AI system which could just

50
00:02:04,718 --> 00:02:09,519
read and quote unquote understand all

51
00:02:06,560 --> 00:02:11,759
this text, right? Um and so you can

52
00:02:09,520 --> 00:02:13,599
imagine such a system reading all of

53
00:02:11,759 --> 00:02:15,199
PubMed, reading all the medical

54
00:02:13,598 --> 00:02:17,199
literature and then coming back and

55
00:02:15,199 --> 00:02:19,439
saying you know for this particular

56
00:02:17,199 --> 00:02:21,039
disease you know this particular sort of

57
00:02:19,439 --> 00:02:23,039
protein is actually the malfunctioning

58
00:02:21,039 --> 00:02:24,400
protein and for that that small molecule

59
00:02:23,039 --> 00:02:26,400
is going to dock into the protein and

60
00:02:24,400 --> 00:02:27,680
cure the disease and you didn't know

61
00:02:26,400 --> 00:02:29,920
this. It came back and told you that.

62
00:02:27,680 --> 00:02:31,439
Wouldn't it be unbelievable? So my

63
00:02:29,919 --> 00:02:33,759
feeling is that such things are going to

64
00:02:31,439 --> 00:02:36,239
happen. It's just that it's not going to

65
00:02:33,759 --> 00:02:38,000
happen soon enough for my lifetime, but

66
00:02:36,239 --> 00:02:40,560
perhaps it'll happen in yours. All

67
00:02:38,000 --> 00:02:42,639
right. Okay. So, let's continue. So, NLP

68
00:02:40,560 --> 00:02:44,400
is an action all around us. Um, you

69
00:02:42,639 --> 00:02:46,958
know, according to Google, apparently

70
00:02:44,400 --> 00:02:49,840
Google autocomplete, uh, which uses a

71
00:02:46,959 --> 00:02:53,199
fair bit of NLP, uh, saves 200 years of

72
00:02:49,840 --> 00:02:54,640
typing time apparently, every day. Uh, I

73
00:02:53,199 --> 00:02:55,759
actually thought it was, you know, this

74
00:02:54,639 --> 00:02:57,598
I wasn't very impressed with this

75
00:02:55,759 --> 00:02:58,959
number, frankly, because billions of

76
00:02:57,598 --> 00:03:01,919
searches are being done every day and

77
00:02:58,959 --> 00:03:03,598
I'm like only 200 years. So anyway u but

78
00:03:01,919 --> 00:03:06,399
I think the more important point is that

79
00:03:03,598 --> 00:03:08,000
it made mobile possible right if you if

80
00:03:06,400 --> 00:03:09,920
you didn't have autocomplete people

81
00:03:08,000 --> 00:03:11,759
would not be you know typing and pecking

82
00:03:09,919 --> 00:03:13,679
on their keyboards it's going to be much

83
00:03:11,759 --> 00:03:15,199
worse it would have had a hugely

84
00:03:13,680 --> 00:03:17,519
dampening effect on e-commerce for

85
00:03:15,199 --> 00:03:19,598
instance so this humble little

86
00:03:17,519 --> 00:03:21,759
autocomplete has incredible incredible

87
00:03:19,598 --> 00:03:23,359
impact on the world economy and the

88
00:03:21,759 --> 00:03:25,039
other thing which I heard about I'm not

89
00:03:23,360 --> 00:03:26,959
sure if it's 100% true but it's an

90
00:03:25,039 --> 00:03:28,799
interesting example apparently the very

91
00:03:26,959 --> 00:03:30,640
first iPhone keyboard that came out

92
00:03:28,800 --> 00:03:34,239
right the soft keyboard not the hard

93
00:03:30,639 --> 00:03:35,759
keyboard. Um they had some very basic,

94
00:03:34,239 --> 00:03:38,719
you know, sort of word continuation

95
00:03:35,759 --> 00:03:41,519
prediction going on. And so if if when

96
00:03:38,719 --> 00:03:43,039
you start typing T and H, obviously it's

97
00:03:41,519 --> 00:03:46,080
going to guess the E is going to come

98
00:03:43,039 --> 00:03:48,239
next, right? So that part is old old

99
00:03:46,080 --> 00:03:50,719
news, nothing new there. But apparently

100
00:03:48,239 --> 00:03:53,360
the E letter in the keyboard will become

101
00:03:50,719 --> 00:03:54,959
slightly bigger. So when your finger

102
00:03:53,360 --> 00:03:57,280
goes towards it, it has a better shot of

103
00:03:54,959 --> 00:03:59,438
actually connecting with it. Right? So

104
00:03:57,280 --> 00:04:01,280
these kinds of things are used to change

105
00:03:59,438 --> 00:04:02,560
the UI in real time in a whole bunch of

106
00:04:01,280 --> 00:04:06,000
applications and you just don't even

107
00:04:02,560 --> 00:04:08,560
realize it. All right. So uh and of

108
00:04:06,000 --> 00:04:09,919
course we all know about uh LRM at this

109
00:04:08,560 --> 00:04:11,680
point. So I asked it to write a

110
00:04:09,919 --> 00:04:13,919
limmerick about the beauty and power of

111
00:04:11,680 --> 00:04:15,360
deep learning yesterday and it says in a

112
00:04:13,919 --> 00:04:16,798
world where data flows like a stream

113
00:04:15,360 --> 00:04:18,319
deep learning is more than a dream.

114
00:04:16,798 --> 00:04:22,399
Sifts through the noise with an elegant

115
00:04:18,319 --> 00:04:25,199
poise unveiling insights that gleam.

116
00:04:22,399 --> 00:04:26,879
Cool, right? All right. So let's get

117
00:04:25,199 --> 00:04:28,478
back to work. Uh so NLP has

118
00:04:26,879 --> 00:04:30,478
extraordinary potential for making

119
00:04:28,478 --> 00:04:33,279
products, service and services much much

120
00:04:30,478 --> 00:04:35,758
smarter. Uh and what I want to point out

121
00:04:33,279 --> 00:04:37,599
here is that you know even if you focus

122
00:04:35,759 --> 00:04:40,160
on this very very simple sort of

123
00:04:37,600 --> 00:04:42,160
formalism right a bunch of text comes in

124
00:04:40,160 --> 00:04:44,000
a bunch of text goes out that's it. If

125
00:04:42,160 --> 00:04:46,720
you take that very simple text in text

126
00:04:44,000 --> 00:04:49,199
out formalism this little humble little

127
00:04:46,720 --> 00:04:51,840
thing has just an enormous enormous

128
00:04:49,199 --> 00:04:53,840
range of applicability. Right? So

129
00:04:51,839 --> 00:04:56,079
obviously you can send a bunch of text

130
00:04:53,839 --> 00:04:58,399
in and ask it to classify it right for

131
00:04:56,079 --> 00:05:00,159
mo you know sentiment route it for

132
00:04:58,399 --> 00:05:01,519
customer support you can try to figure

133
00:05:00,160 --> 00:05:03,520
out the intent of what the person is

134
00:05:01,519 --> 00:05:04,799
asking in search you can filter it you

135
00:05:03,519 --> 00:05:06,879
can content filter to make sure there's

136
00:05:04,800 --> 00:05:08,639
no toxic abusive stuff going on I mean

137
00:05:06,879 --> 00:05:11,038
the the possibilities for just text

138
00:05:08,639 --> 00:05:12,879
classification are numerous okay but

139
00:05:11,038 --> 00:05:14,879
that's a that's sort of a use case we

140
00:05:12,879 --> 00:05:17,360
are all kind of familiar with right so

141
00:05:14,879 --> 00:05:19,038
no surprise there now text extraction we

142
00:05:17,360 --> 00:05:20,879
may be less familiar with here and the

143
00:05:19,038 --> 00:05:23,360
idea is that you can actually look at a

144
00:05:20,879 --> 00:05:25,279
lot lot of uh unstructured textual data

145
00:05:23,360 --> 00:05:27,759
and extract all sorts of interesting

146
00:05:25,279 --> 00:05:29,439
entities from it. Right? Hedge fun hedge

147
00:05:27,759 --> 00:05:30,720
funds use it very heavily. They will

148
00:05:29,439 --> 00:05:33,600
extract all sorts of company information

149
00:05:30,720 --> 00:05:34,880
from news articles u and then obviously

150
00:05:33,600 --> 00:05:36,800
doctor's notes. There are a whole bunch

151
00:05:34,879 --> 00:05:38,959
of NLP startups that will take the

152
00:05:36,800 --> 00:05:40,879
doctor's the doctor patient conversation

153
00:05:38,959 --> 00:05:43,279
transcribe it and then extract disease

154
00:05:40,879 --> 00:05:45,199
codes, diagnosis codes, medication codes

155
00:05:43,279 --> 00:05:47,279
and things like that. Uh right. So the

156
00:05:45,199 --> 00:05:48,960
possibilities for this are enormous. Of

157
00:05:47,279 --> 00:05:50,799
course text summarization and we all

158
00:05:48,959 --> 00:05:53,038
have been doing it thanks to chat GPT

159
00:05:50,800 --> 00:05:54,319
right take text in and any kind of

160
00:05:53,038 --> 00:05:57,120
summary that comes out of the text is

161
00:05:54,319 --> 00:05:58,719
just text out okay and then text

162
00:05:57,120 --> 00:06:00,478
generation of course we can take text

163
00:05:58,720 --> 00:06:01,680
and do marketing copy sales emails

164
00:06:00,478 --> 00:06:03,199
market summaries so on so forth and

165
00:06:01,680 --> 00:06:06,840
including troublingly for educators

166
00:06:03,199 --> 00:06:06,840
college application essays

167
00:06:06,959 --> 00:06:14,478
code generation is a more subtle example

168
00:06:10,800 --> 00:06:16,720
of text out because code is just text

169
00:06:14,478 --> 00:06:20,959
right so text in text out also covers

170
00:06:16,720 --> 00:06:22,639
was text in code out. Okay. And question

171
00:06:20,959 --> 00:06:24,399
answering. So you can take a bunch of

172
00:06:22,639 --> 00:06:25,759
text,

173
00:06:24,399 --> 00:06:27,758
you can take a whole bunch of documents,

174
00:06:25,759 --> 00:06:29,680
you can add a bit of text to it which is

175
00:06:27,759 --> 00:06:31,840
your question and this whole thing at

176
00:06:29,680 --> 00:06:33,680
the end of the day is just text it in

177
00:06:31,839 --> 00:06:35,038
and then you can have and you can use it

178
00:06:33,680 --> 00:06:36,560
to answer questions and therefore create

179
00:06:35,038 --> 00:06:39,560
chat bots for all sorts of interesting

180
00:06:36,560 --> 00:06:39,560
applications.

181
00:06:39,918 --> 00:06:44,799
And you know if you look at this example

182
00:06:42,240 --> 00:06:46,319
call centers that's that is where a lot

183
00:06:44,800 --> 00:06:47,680
of money is being spent right now to

184
00:06:46,319 --> 00:06:49,840
build these call center chatbots for

185
00:06:47,680 --> 00:06:52,160
text and text out question answering and

186
00:06:49,839 --> 00:06:54,318
so just if you drill into this right if

187
00:06:52,160 --> 00:06:56,720
you imagine taking all the call center

188
00:06:54,319 --> 00:06:59,280
transcripts and their internal product

189
00:06:56,720 --> 00:07:02,000
documentation service documentation FAQs

190
00:06:59,279 --> 00:07:04,159
etc stick it in you can start to answer

191
00:07:02,000 --> 00:07:05,519
these kinds of questions okay yesterday

192
00:07:04,160 --> 00:07:08,000
what are the top reasons why customers

193
00:07:05,519 --> 00:07:09,758
were upset with us what interventions

194
00:07:08,000 --> 00:07:12,319
made by the agent actually worked what

195
00:07:09,759 --> 00:07:14,560
did not work, right? What characterizes

196
00:07:12,319 --> 00:07:16,000
the best agents from the rest? How

197
00:07:14,560 --> 00:07:16,879
should we grade this particular agent's

198
00:07:16,000 --> 00:07:18,800
interaction with the particular

199
00:07:16,879 --> 00:07:20,478
customer? How should she how should we

200
00:07:18,800 --> 00:07:23,280
chain the call center script? How should

201
00:07:20,478 --> 00:07:25,038
we coach the agent in real time? Every

202
00:07:23,279 --> 00:07:26,399
one of these applications is aminable to

203
00:07:25,038 --> 00:07:28,159
this very humble text and text route

204
00:07:26,399 --> 00:07:30,478
model.

205
00:07:28,160 --> 00:07:32,080
Okay. And so, and of course the

206
00:07:30,478 --> 00:07:33,598
potential for is now everybody knows

207
00:07:32,079 --> 00:07:36,399
this potential because of the advent of

208
00:07:33,598 --> 00:07:38,959
large language models. Uh, by the way,

209
00:07:36,399 --> 00:07:42,318
Google is uh released something called

210
00:07:38,959 --> 00:07:46,879
Google Geminy 1.5 Pro u a couple of days

211
00:07:42,319 --> 00:07:49,199
ago. Uh, and it's incredible.

212
00:07:46,879 --> 00:07:50,560
It's incredible, right? And anyway,

213
00:07:49,199 --> 00:07:52,319
we'll get back to that later. But the

214
00:07:50,560 --> 00:07:54,000
point is that the kind of potential we

215
00:07:52,319 --> 00:07:59,000
have is just amazing even for text and

216
00:07:54,000 --> 00:07:59,000
text. Okay. And as you would imagine

217
00:08:00,478 --> 00:08:04,478
>> this is all like though we are calling

218
00:08:02,478 --> 00:08:05,680
it language this is all primarily

219
00:08:04,478 --> 00:08:07,758
English right

220
00:08:05,680 --> 00:08:09,598
>> now there are lots of multilingual uh

221
00:08:07,759 --> 00:08:12,160
models as well uh there are multilingual

222
00:08:09,598 --> 00:08:13,439
models by that I mean models which are

223
00:08:12,160 --> 00:08:15,039
specialized to other languages

224
00:08:13,439 --> 00:08:16,478
non-English languages and models which

225
00:08:15,038 --> 00:08:18,800
are mult truly multilingual like

226
00:08:16,478 --> 00:08:21,918
polyglot models as well and both of them

227
00:08:18,800 --> 00:08:23,598
are available uh right now and many many

228
00:08:21,918 --> 00:08:26,318
modern LLMs are actually trained from

229
00:08:23,598 --> 00:08:28,319
the get-go to be multilingual in a bunch

230
00:08:26,319 --> 00:08:30,080
of the what are called high resource

231
00:08:28,319 --> 00:08:32,320
languages. Languages which are spoken by

232
00:08:30,079 --> 00:08:33,199
lots of people. Uh but actually it's

233
00:08:32,320 --> 00:08:34,959
funny you should ask that question

234
00:08:33,200 --> 00:08:37,680
because this this Google Gemini model

235
00:08:34,958 --> 00:08:40,000
that I just described they actually u so

236
00:08:37,679 --> 00:08:41,918
there is a language called kalamang

237
00:08:40,000 --> 00:08:45,919
which is spoken by 200 people in the

238
00:08:41,918 --> 00:08:48,720
world and so a researcher had created a

239
00:08:45,919 --> 00:08:50,879
one book which is sort of like a grammar

240
00:08:48,720 --> 00:08:52,240
manual for kalamag right because there

241
00:08:50,879 --> 00:08:54,559
are no other written works in that

242
00:08:52,240 --> 00:08:56,799
language. And so what they did is they

243
00:08:54,559 --> 00:09:00,799
took a whole bunch of English dialogue

244
00:08:56,799 --> 00:09:04,479
and this book fed it into uh Google

245
00:09:00,799 --> 00:09:06,079
Gemini 1.4 Pro 1.5 and it translated

246
00:09:04,480 --> 00:09:07,680
into Calamong at human level

247
00:09:06,080 --> 00:09:10,639
proficiency.

248
00:09:07,679 --> 00:09:12,399
It had never seen it before. So that's

249
00:09:10,639 --> 00:09:15,600
an example

250
00:09:12,399 --> 00:09:18,399
of of this.

251
00:09:15,600 --> 00:09:19,920
Yes. So the question is the question

252
00:09:18,399 --> 00:09:21,759
text here is all the things you want to

253
00:09:19,919 --> 00:09:23,360
translate from English to kalamong. The

254
00:09:21,759 --> 00:09:25,919
documents here is just one document

255
00:09:23,360 --> 00:09:29,039
singular the grammar book the manual and

256
00:09:25,919 --> 00:09:30,559
then what comes out is a translation. So

257
00:09:29,039 --> 00:09:31,838
these models even when they're not

258
00:09:30,559 --> 00:09:34,159
explicitly trained on a different

259
00:09:31,839 --> 00:09:35,839
language if you give them enough of sort

260
00:09:34,159 --> 00:09:37,679
of grammar manuals and stuff like that

261
00:09:35,839 --> 00:09:40,480
they may do a pretty decent job from the

262
00:09:37,679 --> 00:09:42,479
get-go with no training.

263
00:09:40,480 --> 00:09:44,879
It's kind of a shocker. Two years ago

264
00:09:42,480 --> 00:09:47,440
people would be like that's impossible.

265
00:09:44,879 --> 00:09:50,159
All right. So

266
00:09:47,440 --> 00:09:51,680
back to this.

267
00:09:50,159 --> 00:09:53,039
All right. And as you folks, you know,

268
00:09:51,679 --> 00:09:54,559
may already know and maybe you're in

269
00:09:53,039 --> 00:09:57,039
fact participating in this gold rush

270
00:09:54,559 --> 00:09:58,799
already. Um, you know, lots of people

271
00:09:57,039 --> 00:10:00,639
are creating lots of really cool

272
00:09:58,799 --> 00:10:02,000
companies to take some of these ideas

273
00:10:00,639 --> 00:10:04,159
and actually create really interesting

274
00:10:02,000 --> 00:10:06,320
products and services out of them. Um,

275
00:10:04,159 --> 00:10:07,439
so if you're not doing it and if you've

276
00:10:06,320 --> 00:10:10,000
been thinking about entrepreneurial

277
00:10:07,440 --> 00:10:13,000
stuff, here's a word of advice. Take the

278
00:10:10,000 --> 00:10:13,000
plunge.

279
00:10:15,120 --> 00:10:19,839
Dismissed. Just kidding. All right. So,

280
00:10:18,240 --> 00:10:22,240
and as you can imagine, enterprise

281
00:10:19,839 --> 00:10:24,480
vendors are rushing to add NLP to all

282
00:10:22,240 --> 00:10:27,039
their products. Salesforce Einstein now

283
00:10:24,480 --> 00:10:28,800
has Einstein GPT. Microsoft has

284
00:10:27,039 --> 00:10:30,639
co-pilot. I mean, the list goes on.

285
00:10:28,799 --> 00:10:32,319
Everybody, everybody's like scrambling

286
00:10:30,639 --> 00:10:34,559
and really trying hard to infuse some

287
00:10:32,320 --> 00:10:36,480
GPT magic into whatever they're doing.

288
00:10:34,559 --> 00:10:39,679
Okay, some of it is real, a lot of it is

289
00:10:36,480 --> 00:10:41,759
not. Uh, okay. So, let's go to like the

290
00:10:39,679 --> 00:10:43,759
arc of NLP progress. How did we get to

291
00:10:41,759 --> 00:10:46,720
this kind of crazy times that we live

292
00:10:43,759 --> 00:10:48,958
in? Um so if you look at natural

293
00:10:46,720 --> 00:10:50,639
language processing basically efforts to

294
00:10:48,958 --> 00:10:52,239
take language and try to analyze

295
00:10:50,639 --> 00:10:56,240
language and you do predictions with

296
00:10:52,240 --> 00:10:58,000
language and so on and so forth. Um

297
00:10:56,240 --> 00:11:00,240
the first phase of it was just

298
00:10:58,000 --> 00:11:02,000
handcrafted rules based on linguistics.

299
00:11:00,240 --> 00:11:03,360
So these are all linguists who would

300
00:11:02,000 --> 00:11:05,200
really understand the grammar of a

301
00:11:03,360 --> 00:11:07,278
language and then they would use a deep

302
00:11:05,200 --> 00:11:08,959
knowledge of linguistics to figure out

303
00:11:07,278 --> 00:11:11,919
all these rules by which you can process

304
00:11:08,958 --> 00:11:13,919
and analyze natural language text. And

305
00:11:11,919 --> 00:11:15,360
then this other thing came along which

306
00:11:13,919 --> 00:11:17,679
was a statistical machine learning

307
00:11:15,360 --> 00:11:19,440
approach which basically said never mind

308
00:11:17,679 --> 00:11:21,359
all that complicated knowledge of

309
00:11:19,440 --> 00:11:24,160
linguistics and grammar. Why don't we

310
00:11:21,360 --> 00:11:25,360
simply count things? Let's count the

311
00:11:24,159 --> 00:11:26,879
number of times these two will co-

312
00:11:25,360 --> 00:11:29,120
occur. Now let's count that. Let's count

313
00:11:26,879 --> 00:11:31,278
this basically just count a lot. Okay.

314
00:11:29,120 --> 00:11:32,879
And let's see if it does right if it

315
00:11:31,278 --> 00:11:34,799
does for predicting things for say for

316
00:11:32,879 --> 00:11:36,799
classifying text and so on. And

317
00:11:34,799 --> 00:11:39,199
shockingly those methods ended up being

318
00:11:36,799 --> 00:11:41,120
really good. They ended up being really

319
00:11:39,200 --> 00:11:44,000
good and in fact they actually were

320
00:11:41,120 --> 00:11:47,039
better than the lovingly handcurated

321
00:11:44,000 --> 00:11:50,159
linguistically driven rules. Okay, so

322
00:11:47,039 --> 00:11:52,159
much so this is a famous quote which

323
00:11:50,159 --> 00:11:55,278
says every time I fire a linguist the

324
00:11:52,159 --> 00:11:57,039
performance of speech recognizer goes up

325
00:11:55,278 --> 00:11:59,759
right obviously made in justest but

326
00:11:57,039 --> 00:12:01,199
there is a kernel of truth to it.

327
00:11:59,759 --> 00:12:03,600
So that was that's what that's that's

328
00:12:01,200 --> 00:12:06,639
what that's where we were and then deep

329
00:12:03,600 --> 00:12:08,320
learning happened okay um in 2012

330
00:12:06,639 --> 00:12:09,839
roughly and then we had these things

331
00:12:08,320 --> 00:12:11,278
called recurren neural networks which

332
00:12:09,839 --> 00:12:13,600
are based on deep learning which

333
00:12:11,278 --> 00:12:15,600
actually moved the ball forward and then

334
00:12:13,600 --> 00:12:17,120
in 2017

335
00:12:15,600 --> 00:12:18,959
something called the transformer was

336
00:12:17,120 --> 00:12:21,919
invented

337
00:12:18,958 --> 00:12:26,159
2017 and the transformer replaced

338
00:12:21,919 --> 00:12:27,519
everything else across the board so we

339
00:12:26,159 --> 00:12:29,199
just going to leaprog directly to

340
00:12:27,519 --> 00:12:30,959
transformers in hodle we will not spend

341
00:12:29,200 --> 00:12:32,879
any time on recurren neural networks and

342
00:12:30,958 --> 00:12:35,119
that is not to say that they are sort of

343
00:12:32,879 --> 00:12:36,559
dead. Um there's there's a very

344
00:12:35,120 --> 00:12:38,320
interesting work which actually is

345
00:12:36,559 --> 00:12:40,159
trying to now revive recurren neural

346
00:12:38,320 --> 00:12:42,639
networks to make it work for these kinds

347
00:12:40,159 --> 00:12:44,399
of modern LLM kinds of tasks but it's

348
00:12:42,639 --> 00:12:46,639
still very early days. Okay. So for now

349
00:12:44,399 --> 00:12:49,759
we'll just focus on transformers.

350
00:12:46,639 --> 00:12:51,759
Okay. So the the very high level view of

351
00:12:49,759 --> 00:12:53,600
the problem here is that like most

352
00:12:51,759 --> 00:12:55,600
things in deep learning it's basically

353
00:12:53,600 --> 00:12:57,680
fancy regression.

354
00:12:55,600 --> 00:12:59,120
There is some variable X that comes in.

355
00:12:57,679 --> 00:13:01,599
It goes through a bunch this goes to

356
00:12:59,120 --> 00:13:03,839
this very complicated function along

357
00:13:01,600 --> 00:13:05,920
with this W which is the weights and

358
00:13:03,839 --> 00:13:07,760
then out pops an output. Right? That's

359
00:13:05,919 --> 00:13:10,399
just the view that you've always had.

360
00:13:07,759 --> 00:13:12,720
And so in this case X happens to be

361
00:13:10,399 --> 00:13:13,600
text. Y can be text. It could be labels.

362
00:13:12,720 --> 00:13:15,360
It could be numbers. It could be

363
00:13:13,600 --> 00:13:16,879
anything else. The W is the weights. And

364
00:13:15,360 --> 00:13:19,600
the function is a deep neural network.

365
00:13:16,879 --> 00:13:20,639
Right? This by by at this point when you

366
00:13:19,600 --> 00:13:23,440
look at this slide it should be like

367
00:13:20,639 --> 00:13:26,000
blindingly obvious.

368
00:13:23,440 --> 00:13:28,560
So now the key question here is how do

369
00:13:26,000 --> 00:13:31,679
you actually represent X? That's the key

370
00:13:28,559 --> 00:13:34,399
question for pictures for images. We saw

371
00:13:31,679 --> 00:13:36,078
that we just took the pixel values which

372
00:13:34,399 --> 00:13:37,600
were light intensity numbers between 0

373
00:13:36,078 --> 00:13:39,599
and 255 and you could just use that

374
00:13:37,600 --> 00:13:41,839
directly but when a when a sentence

375
00:13:39,600 --> 00:13:43,600
comes in like I love deep learning like

376
00:13:41,839 --> 00:13:45,279
what do you do right how do you actually

377
00:13:43,600 --> 00:13:46,639
represent it because remember we have to

378
00:13:45,278 --> 00:13:49,439
numericalize everything that's coming

379
00:13:46,639 --> 00:13:50,959
in. So that's a key question and and

380
00:13:49,440 --> 00:13:52,800
this actually is a very subtle question

381
00:13:50,958 --> 00:13:56,638
very important question and we'll focus

382
00:13:52,799 --> 00:13:58,719
on that today and then next week when we

383
00:13:56,639 --> 00:14:00,720
look at transformers we will look at

384
00:13:58,720 --> 00:14:02,959
what neural network architecture is best

385
00:14:00,720 --> 00:14:04,560
suited to process this sort of text

386
00:14:02,958 --> 00:14:06,000
inputs that are coming in right those

387
00:14:04,559 --> 00:14:11,198
are the two big questions we're going to

388
00:14:06,000 --> 00:14:12,879
look at all right so processing basics

389
00:14:11,198 --> 00:14:15,879
we going to follow this very standard

390
00:14:12,879 --> 00:14:15,879
process

391
00:14:15,919 --> 00:14:21,519
this is the process by which we take any

392
00:14:18,639 --> 00:14:23,120
any text that comes in and we do run it

393
00:14:21,519 --> 00:14:25,360
through these four steps and this

394
00:14:23,120 --> 00:14:26,720
process is called text vectorization and

395
00:14:25,360 --> 00:14:28,159
as the name suggest that we are

396
00:14:26,720 --> 00:14:30,399
essentially taking text and creating

397
00:14:28,159 --> 00:14:32,399
vectors of numbers out of it right text

398
00:14:30,399 --> 00:14:34,559
vectorization and we'll go through each

399
00:14:32,399 --> 00:14:36,879
of these processes one after the other

400
00:14:34,559 --> 00:14:39,278
so I just find it very useful to just

401
00:14:36,879 --> 00:14:41,519
have this acronym stie in my head like

402
00:14:39,278 --> 00:14:45,519
stie just keep that in mind it may be

403
00:14:41,519 --> 00:14:48,240
helpful um all right so we what we do is

404
00:14:45,519 --> 00:14:50,078
the setup here is that we have a whole

405
00:14:48,240 --> 00:14:51,839
bunch of documents, right? We call it

406
00:14:50,078 --> 00:14:54,319
the training corpus. We have a whole

407
00:14:51,839 --> 00:14:55,760
bunch of text documents, text data. Uh,

408
00:14:54,320 --> 00:14:58,399
and as far as we are concerned, you can

409
00:14:55,759 --> 00:15:01,120
just imagine it as just lists of long

410
00:14:58,399 --> 00:15:03,360
passages. Okay? What is a novel? It's

411
00:15:01,120 --> 00:15:05,839
just a long passage, right, of text. So

412
00:15:03,360 --> 00:15:07,360
whether it's a novel or a sentence

413
00:15:05,839 --> 00:15:09,440
doesn't really matter. We just think of

414
00:15:07,360 --> 00:15:11,360
them as a big list of strings, a big

415
00:15:09,440 --> 00:15:13,600
list of text. Okay, that's a training

416
00:15:11,360 --> 00:15:15,759
corpus. And what we do is we take this

417
00:15:13,600 --> 00:15:17,680
training corpus and we run it through

418
00:15:15,759 --> 00:15:19,600
and we apply standardization and

419
00:15:17,679 --> 00:15:22,399
tokenization which I will describe to

420
00:15:19,600 --> 00:15:26,879
this entire training corpus up front.

421
00:15:22,399 --> 00:15:29,919
Okay. So we first do this and and

422
00:15:26,879 --> 00:15:32,480
standardization is basically

423
00:15:29,919 --> 00:15:34,639
the default for most applications tends

424
00:15:32,480 --> 00:15:36,399
to be this which is we first strip

425
00:15:34,639 --> 00:15:38,240
capitalization and make everything lower

426
00:15:36,399 --> 00:15:40,078
case

427
00:15:38,240 --> 00:15:42,240
and then we remove punctuation and

428
00:15:40,078 --> 00:15:44,559
accents and so on and so forth. Okay,

429
00:15:42,240 --> 00:15:46,639
that's the first thing we do. I'll talk

430
00:15:44,559 --> 00:15:48,559
about why we do it in just a moment, but

431
00:15:46,639 --> 00:15:51,360
the mechanics of it are we do this

432
00:15:48,559 --> 00:15:53,359
first. Then we look at words like a,

433
00:15:51,360 --> 00:15:55,199
the, it, and so on and so forth.

434
00:15:53,360 --> 00:15:57,680
Basically filler words, right? Which

435
00:15:55,198 --> 00:15:59,439
which we we need to actually make

436
00:15:57,679 --> 00:16:02,399
complete sentences, but they may not

437
00:15:59,440 --> 00:16:03,759
have any value predicting things. So we

438
00:16:02,399 --> 00:16:06,559
remove them and they are called stop

439
00:16:03,759 --> 00:16:08,079
words. And then finally we take words

440
00:16:06,559 --> 00:16:10,399
which are very similar which have sort

441
00:16:08,078 --> 00:16:12,159
of a same kind of stem or root and then

442
00:16:10,399 --> 00:16:14,958
we just map it to like a common

443
00:16:12,159 --> 00:16:16,559
representation like ate eaten eating

444
00:16:14,958 --> 00:16:19,439
eaten all these things just becomes

445
00:16:16,559 --> 00:16:21,679
let's say eats and we do that sometimes.

446
00:16:19,440 --> 00:16:23,600
So this we almost always do this we

447
00:16:21,679 --> 00:16:25,599
often do and this we do it sometimes.

448
00:16:23,600 --> 00:16:28,600
Okay. Now, why do we do any of these

449
00:16:25,600 --> 00:16:28,600
things?

450
00:16:34,320 --> 00:16:38,879
>> I think we want to try to recognize the

451
00:16:36,480 --> 00:16:40,480
essential thing with the word, right?

452
00:16:38,879 --> 00:16:42,799
Whether it's eaten or eat, but the

453
00:16:40,480 --> 00:16:45,600
essential thing is the eat, right? So,

454
00:16:42,799 --> 00:16:47,198
we want to try to sort of abstract from

455
00:16:45,600 --> 00:16:49,120
it the more essential thing,

456
00:16:47,198 --> 00:16:50,799
>> right? So, why do we need to abstract? I

457
00:16:49,120 --> 00:16:52,560
guess you're absolutely correct. We're

458
00:16:50,799 --> 00:16:56,439
trying to abstract. Why is there a

459
00:16:52,559 --> 00:16:56,439
benefit to doing this abstraction?

460
00:16:58,000 --> 00:17:02,919
How about somebody from this side of the

461
00:16:59,278 --> 00:17:02,919
room? Oh yes.

462
00:17:03,440 --> 00:17:08,880
>> So I want to reduce the library.

463
00:17:07,359 --> 00:17:12,240
>> Why is it a good idea to reduce the

464
00:17:08,880 --> 00:17:14,480
library? The size of the library

465
00:17:12,240 --> 00:17:17,519
>> because of the the amount of computation

466
00:17:14,480 --> 00:17:20,160
needed. So that is part of the answer.

467
00:17:17,519 --> 00:17:25,240
There's another part to the answer which

468
00:17:20,160 --> 00:17:25,240
says all right let's swing to the right

469
00:17:26,480 --> 00:17:30,720
um is it faculties comparison between

470
00:17:28,880 --> 00:17:33,880
different sets

471
00:17:30,720 --> 00:17:33,880
of standard

472
00:17:33,906 --> 00:17:37,279
[clears throat]

473
00:17:34,480 --> 00:17:39,759
>> okay so I will go with that but I think

474
00:17:37,279 --> 00:17:42,240
the the key thing we want to uh the key

475
00:17:39,759 --> 00:17:44,480
thing to realize here is that you want

476
00:17:42,240 --> 00:17:46,640
the model much like when you go when we

477
00:17:44,480 --> 00:17:48,720
talk about computer vision we said look

478
00:17:46,640 --> 00:17:51,038
if it's vertical line. I want to be able

479
00:17:48,720 --> 00:17:52,640
to detect it wherever it happens. I

480
00:17:51,038 --> 00:17:54,000
don't want the model to think that the

481
00:17:52,640 --> 00:17:55,038
vertical line on the left side is

482
00:17:54,000 --> 00:17:57,038
different from the vertical line on the

483
00:17:55,038 --> 00:17:58,720
right side and then later realize they

484
00:17:57,038 --> 00:18:00,879
are the same thing because you would

485
00:17:58,720 --> 00:18:02,319
have wasted valuable capacity learning

486
00:18:00,880 --> 00:18:03,760
things which actually happen to be the

487
00:18:02,319 --> 00:18:06,798
same because you didn't know it was the

488
00:18:03,759 --> 00:18:09,519
same. So here if you for example take a

489
00:18:06,798 --> 00:18:11,918
word and lowerase it clearly the case of

490
00:18:09,519 --> 00:18:12,960
it whether it's uppercase or lower case

491
00:18:11,919 --> 00:18:14,559
most of the time it's not going to

492
00:18:12,960 --> 00:18:16,400
matter for anything you want to predict.

493
00:18:14,558 --> 00:18:18,399
So you're essentially telling the model

494
00:18:16,400 --> 00:18:19,840
you know the lowerase version uppercase

495
00:18:18,400 --> 00:18:21,919
version they are not different they're

496
00:18:19,839 --> 00:18:23,678
actually the same and the easiest way to

497
00:18:21,919 --> 00:18:25,919
tell the model they are the same is just

498
00:18:23,679 --> 00:18:29,200
make everything lower case so that is

499
00:18:25,919 --> 00:18:31,038
the key idea okay and similarly if you

500
00:18:29,200 --> 00:18:32,080
look at stop words the reason is that

501
00:18:31,038 --> 00:18:34,400
these stop words may not help you

502
00:18:32,079 --> 00:18:36,720
predict anything whether a word uh and

503
00:18:34,400 --> 00:18:38,160
the showed up in a movie review probably

504
00:18:36,720 --> 00:18:40,400
does not affect the sentiment of the

505
00:18:38,160 --> 00:18:42,240
review and therefore let's remove it so

506
00:18:40,400 --> 00:18:44,320
that's a slightly different reason

507
00:18:42,240 --> 00:18:46,400
stemming is the same reason as the first

508
00:18:44,319 --> 00:18:48,319
which is that all these words kind of

509
00:18:46,400 --> 00:18:50,160
mean the same thing. We don't have to be

510
00:18:48,319 --> 00:18:51,839
super precise about it and so let's just

511
00:18:50,160 --> 00:18:54,080
like collapse them onto the same thing.

512
00:18:51,839 --> 00:18:57,038
Now that these are all the standard

513
00:18:54,079 --> 00:18:58,399
things we do there are totally notice

514
00:18:57,038 --> 00:19:00,319
you know important exceptions to all

515
00:18:58,400 --> 00:19:02,000
these things. Okay we'll come back to

516
00:19:00,319 --> 00:19:05,359
the exceptions a bit later but that is

517
00:19:02,000 --> 00:19:08,359
the standard thing we do make sense. All

518
00:19:05,359 --> 00:19:08,359
right.

519
00:19:08,720 --> 00:19:14,240
So if you look at something like this um

520
00:19:11,679 --> 00:19:15,440
this sentence here right hola what do

521
00:19:14,240 --> 00:19:17,919
you picture when you think of travel

522
00:19:15,440 --> 00:19:20,000
Mexico boom and then you can see here

523
00:19:17,919 --> 00:19:21,520
this is the standardized version like

524
00:19:20,000 --> 00:19:24,000
everything has become lower case like

525
00:19:21,519 --> 00:19:25,279
the h has become small h the punctuation

526
00:19:24,000 --> 00:19:29,440
has disappeared that's part of

527
00:19:25,279 --> 00:19:32,160
standardization and then uh travel and

528
00:19:29,440 --> 00:19:35,759
you can see here that Mexico m has

529
00:19:32,160 --> 00:19:37,759
become small sipping has become sips uh

530
00:19:35,759 --> 00:19:38,960
things think has become things and so on

531
00:19:37,759 --> 00:19:41,279
and so forth

532
00:19:38,960 --> 00:19:44,279
So that's an example of strandization at

533
00:19:41,279 --> 00:19:44,279
work.

534
00:19:47,038 --> 00:19:51,200
Okay.

535
00:19:49,200 --> 00:19:53,840
The next thing we do is something very

536
00:19:51,200 --> 00:19:55,600
important and it's called tokenization.

537
00:19:53,839 --> 00:19:56,720
So what we do typically is that okay now

538
00:19:55,599 --> 00:19:59,439
we have standardized everything. We have

539
00:19:56,720 --> 00:20:01,919
a bunch of words. Uh we need to now

540
00:19:59,440 --> 00:20:04,558
split them into what are called tokens.

541
00:20:01,919 --> 00:20:07,120
So the most common default is to just

542
00:20:04,558 --> 00:20:09,279
think of a word as a token.

543
00:20:07,119 --> 00:20:11,119
We just split on the white space, right?

544
00:20:09,279 --> 00:20:14,160
You take each string and wherever there

545
00:20:11,119 --> 00:20:15,678
is white space, meaning actual spaces,

546
00:20:14,160 --> 00:20:17,440
uh, carriage returns and things like

547
00:20:15,679 --> 00:20:20,720
that, boom, you just split on them and

548
00:20:17,440 --> 00:20:22,080
you just create words out of it. So, so

549
00:20:20,720 --> 00:20:24,160
for instance, if you have this

550
00:20:22,079 --> 00:20:26,159
standardized sentence here, you just

551
00:20:24,160 --> 00:20:29,120
split it after every word and you get

552
00:20:26,160 --> 00:20:32,519
this thing. Okay? So, each of these is

553
00:20:29,119 --> 00:20:32,519
now a token.

554
00:20:32,880 --> 00:20:38,400
Now, this has some disadvantages.

555
00:20:36,159 --> 00:20:40,799
What are some disadvantages of just

556
00:20:38,400 --> 00:20:43,840
splitting on on the space between words?

557
00:20:40,798 --> 00:20:46,319
Uh yeah,

558
00:20:43,839 --> 00:20:49,439
>> I think we lose any context because we

559
00:20:46,319 --> 00:20:52,639
look at each word separately. Uh so we

560
00:20:49,440 --> 00:20:53,440
don't have any password or what happens

561
00:20:52,640 --> 00:20:55,280
next,

562
00:20:53,440 --> 00:20:57,759
>> right? So for example, the cat sat on

563
00:20:55,279 --> 00:21:00,558
the mat and the mat sat on the cat will

564
00:20:57,759 --> 00:21:02,400
have the same like set, right? Yeah. So

565
00:21:00,558 --> 00:21:05,798
you lose the order. What are some other

566
00:21:02,400 --> 00:21:05,798
issues with it?

567
00:21:05,839 --> 00:21:10,319
for words that should have two together

568
00:21:07,440 --> 00:21:11,840
like you lose the fact that that's one

569
00:21:10,319 --> 00:21:14,240
name because you separated

570
00:21:11,839 --> 00:21:16,480
>> right exactly so there are compound

571
00:21:14,240 --> 00:21:18,640
words right like father-in-law for

572
00:21:16,480 --> 00:21:20,319
instance that's one problem another

573
00:21:18,640 --> 00:21:22,240
problem is that lots of non-English

574
00:21:20,319 --> 00:21:25,038
languages they actually don't have this

575
00:21:22,240 --> 00:21:27,359
notion of a space between words right

576
00:21:25,038 --> 00:21:29,359
actually runs one after the other and it

577
00:21:27,359 --> 00:21:32,479
is and the native speakers know from

578
00:21:29,359 --> 00:21:34,879
context how to chunk it and break it so

579
00:21:32,480 --> 00:21:36,480
well what do we do Right?

580
00:21:34,880 --> 00:21:39,280
Because you basically will have one word

581
00:21:36,480 --> 00:21:40,720
for the whole passage, one token. The

582
00:21:39,279 --> 00:21:42,960
other problem is that there are

583
00:21:40,720 --> 00:21:44,960
languages, German is perhaps the most

584
00:21:42,960 --> 00:21:47,279
notable one in which you have very long

585
00:21:44,960 --> 00:21:50,798
words.

586
00:21:47,279 --> 00:21:52,319
Um I saw a word uh which I think I might

587
00:21:50,798 --> 00:21:57,200
have it on the site somewhere this like

588
00:21:52,319 --> 00:21:59,200
this long which means uh

589
00:21:57,200 --> 00:22:00,720
you realize that something amazing is

590
00:21:59,200 --> 00:22:02,798
happening but the rest of the world

591
00:22:00,720 --> 00:22:04,400
hasn't woken up to it yet. It's that

592
00:22:02,798 --> 00:22:07,918
feeling.

593
00:22:04,400 --> 00:22:10,640
There's a word for that. Amazing, right?

594
00:22:07,919 --> 00:22:12,640
Anyway, so yeah, some words or Japanese,

595
00:22:10,640 --> 00:22:13,520
for example, there's a word called. Do

596
00:22:12,640 --> 00:22:16,520
people know the meaning of the word

597
00:22:13,519 --> 00:22:16,519
combi?

598
00:22:16,640 --> 00:22:24,640
It means the transient beauty of

599
00:22:20,240 --> 00:22:26,720
sunlight going through fall foliage.

600
00:22:24,640 --> 00:22:29,280
There's a word for that. How cool is

601
00:22:26,720 --> 00:22:31,440
that? Anyway, sorry. I love that word.

602
00:22:29,279 --> 00:22:33,440
So, back to this. Um so we have this

603
00:22:31,440 --> 00:22:35,200
thing here. So there are all reasons for

604
00:22:33,440 --> 00:22:38,798
which splitting on the the space between

605
00:22:35,200 --> 00:22:41,360
words not going to work. Okay. Um

606
00:22:38,798 --> 00:22:44,720
so what we will so what happens is that

607
00:22:41,359 --> 00:22:46,319
modern large language models. So the the

608
00:22:44,720 --> 00:22:47,919
what we have described so far despite

609
00:22:46,319 --> 00:22:50,960
its shortcomings is actually really good

610
00:22:47,919 --> 00:22:52,640
for lots of NLP use cases. Okay. If you

611
00:22:50,960 --> 00:22:54,640
want to classify text as good enough for

612
00:22:52,640 --> 00:22:57,679
instance but if you want to generate

613
00:22:54,640 --> 00:22:59,840
text like LLMs do it's not going to

614
00:22:57,679 --> 00:23:01,600
work. It's not going to work because you

615
00:22:59,839 --> 00:23:03,839
know when you ask the strategia question

616
00:23:01,599 --> 00:23:05,918
it comes back with perfect punctuation.

617
00:23:03,839 --> 00:23:07,119
Clearly punctuation was not stripped. It

618
00:23:05,919 --> 00:23:09,600
comes back with particular upper and

619
00:23:07,119 --> 00:23:11,359
lower case clearly that wasn't stripped.

620
00:23:09,599 --> 00:23:12,719
You can actually make up new words and

621
00:23:11,359 --> 00:23:15,359
ask it to use the new word. It'll make

622
00:23:12,720 --> 00:23:17,919
it I'll use it. Therefore, it's not like

623
00:23:15,359 --> 00:23:19,918
it can only recognize a finite set. So

624
00:23:17,919 --> 00:23:22,240
there's a very clever scheme called bite

625
00:23:19,919 --> 00:23:24,880
pair encoding right which is which is

626
00:23:22,240 --> 00:23:26,400
invented to do all those things. And I

627
00:23:24,880 --> 00:23:28,240
have slides at the end and if we have

628
00:23:26,400 --> 00:23:29,759
time we'll talk about it.

629
00:23:28,240 --> 00:23:33,440
All right, for now let's continue this

630
00:23:29,759 --> 00:23:35,359
thing. So when this is done for every

631
00:23:33,440 --> 00:23:37,440
sentence or every uh passage in our

632
00:23:35,359 --> 00:23:40,079
training data set, we have now have a

633
00:23:37,440 --> 00:23:41,519
list of distinct tokens, right? We have

634
00:23:40,079 --> 00:23:42,960
a list of distinct tokens. In this

635
00:23:41,519 --> 00:23:45,200
simple case, it happens to be all the

636
00:23:42,960 --> 00:23:47,360
distinct words that we have seen, right?

637
00:23:45,200 --> 00:23:49,840
That's called the vocabulary.

638
00:23:47,359 --> 00:23:51,279
That's called the vocabulary.

639
00:23:49,839 --> 00:23:53,599
So now we move to the third and fourth

640
00:23:51,279 --> 00:23:55,279
stages. In this in these stages, the

641
00:23:53,599 --> 00:23:58,719
indexing and encoding stage, we only

642
00:23:55,279 --> 00:24:00,319
work with the vocabulary. Okay. And so

643
00:23:58,720 --> 00:24:03,200
what we do is the first thing the

644
00:24:00,319 --> 00:24:05,359
indexing we assign a unique integer to

645
00:24:03,200 --> 00:24:07,600
each distinct token in the vocabulary.

646
00:24:05,359 --> 00:24:09,599
So for instance, let's say that you know

647
00:24:07,599 --> 00:24:12,000
you took a whole bunch of English

648
00:24:09,599 --> 00:24:14,240
literature as your training corus and

649
00:24:12,000 --> 00:24:16,079
you ran it through you basically you'll

650
00:24:14,240 --> 00:24:18,159
come up with English dictionary right?

651
00:24:16,079 --> 00:24:20,879
So it'll have maybe starting with a all

652
00:24:18,159 --> 00:24:24,000
the way to zebra a whole bunch of words.

653
00:24:20,880 --> 00:24:26,960
Um, and so I'm just putting 50,000 here

654
00:24:24,000 --> 00:24:28,480
because turns out the GPD family uses

655
00:24:26,960 --> 00:24:30,159
something called 50,000 tokens. So I'm

656
00:24:28,480 --> 00:24:31,519
just using 50,000. It's not the actual

657
00:24:30,159 --> 00:24:33,600
number of words in the English language.

658
00:24:31,519 --> 00:24:35,119
It's much more than that. So let's say

659
00:24:33,599 --> 00:24:37,519
that we give a number one through

660
00:24:35,119 --> 00:24:40,158
50,000. And then we actually also

661
00:24:37,519 --> 00:24:42,240
introduce a special token called UN. It

662
00:24:40,159 --> 00:24:44,559
stands for unknown. And we'll come back

663
00:24:42,240 --> 00:24:46,960
to this later. And we give unknown the

664
00:24:44,558 --> 00:24:48,558
integer zero.

665
00:24:46,960 --> 00:24:51,600
Okay. So this what this is what we mean

666
00:24:48,558 --> 00:24:52,798
by indexing. take the word the tokens

667
00:24:51,599 --> 00:24:55,038
you have identified and just map it to

668
00:24:52,798 --> 00:24:57,440
an integer.

669
00:24:55,038 --> 00:25:00,400
Okay, that's the indexing step. Then

670
00:24:57,440 --> 00:25:03,120
what we do is we assign a vector to

671
00:25:00,400 --> 00:25:05,600
every one of these integers.

672
00:25:03,119 --> 00:25:08,959
Okay, and that is the encoding step. We

673
00:25:05,599 --> 00:25:10,639
assign a vector to each integer.

674
00:25:08,960 --> 00:25:12,480
So you have a bunch of distinct words

675
00:25:10,640 --> 00:25:14,000
and each word we put an integer on it

676
00:25:12,480 --> 00:25:16,079
and then we take that integer and map it

677
00:25:14,000 --> 00:25:17,599
to a vector. Yeah. Can you please

678
00:25:16,079 --> 00:25:18,558
explain

679
00:25:17,599 --> 00:25:20,158
to

680
00:25:18,558 --> 00:25:20,960
>> Can you please explain what unknown

681
00:25:20,159 --> 00:25:23,679
means?

682
00:25:20,960 --> 00:25:25,200
>> Yeah. So, so I'll come back to that for

683
00:25:23,679 --> 00:25:26,720
now. Just assume that we have a token

684
00:25:25,200 --> 00:25:28,240
called unknown. And the way we are going

685
00:25:26,720 --> 00:25:29,759
to use it will become apparent in a few

686
00:25:28,240 --> 00:25:31,038
minutes.

687
00:25:29,759 --> 00:25:32,480
>> Does it mean there's a base to it

688
00:25:31,038 --> 00:25:32,960
though? There's like a letter or

689
00:25:32,480 --> 00:25:34,798
something.

690
00:25:32,960 --> 00:25:36,400
>> It's it's a it's a placeholder for

691
00:25:34,798 --> 00:25:38,639
something else which I'll describe

692
00:25:36,400 --> 00:25:42,798
shortly.

693
00:25:38,640 --> 00:25:44,320
Okay. So, that's what we have. U so

694
00:25:42,798 --> 00:25:46,879
let's say that we want to assign a

695
00:25:44,319 --> 00:25:50,720
vector to each integer in our vocabulary

696
00:25:46,880 --> 00:25:52,880
and let's assume that we have uh okay

697
00:25:50,720 --> 00:25:54,400
let's say we have 50,000 possible

698
00:25:52,880 --> 00:25:56,480
integers because we have 50,000 possible

699
00:25:54,400 --> 00:25:58,400
words and we want to assign a vector so

700
00:25:56,480 --> 00:25:59,759
that if you take the vector of two

701
00:25:58,400 --> 00:26:02,320
different words they should look

702
00:25:59,759 --> 00:26:04,319
different right clearly that's the whole

703
00:26:02,319 --> 00:26:06,399
point of mapping from integer to vector

704
00:26:04,319 --> 00:26:08,079
they better be different uh what is the

705
00:26:06,400 --> 00:26:12,240
simplest way to come up with a vector

706
00:26:08,079 --> 00:26:12,240
for each each of these tokens

707
00:26:20,079 --> 00:26:22,399
the same as the index.

708
00:26:21,839 --> 00:26:24,319
>> Sorry,

709
00:26:22,400 --> 00:26:26,880
>> the same as the index. It's just a

710
00:26:24,319 --> 00:26:31,599
vector one one by one with the index.

711
00:26:26,880 --> 00:26:34,799
>> So, a vector of uh zeros and ones or

712
00:26:31,599 --> 00:26:38,158
>> it's just a vector with one dimension.

713
00:26:34,798 --> 00:26:39,359
>> Oh. Oh, I see. So, god. Well, it's it

714
00:26:38,159 --> 00:26:40,799
it's creative, but it's a little

715
00:26:39,359 --> 00:26:42,000
cheating, right? Because you're

716
00:26:40,798 --> 00:26:43,038
essentially putting a square bracket

717
00:26:42,000 --> 00:26:47,038
around the number and saying it's a

718
00:26:43,038 --> 00:26:48,640
vector. Good try.

719
00:26:47,038 --> 00:26:51,440
>> You can try one hot encoding,

720
00:26:48,640 --> 00:26:53,520
>> right? You can try one hot encoding.

721
00:26:51,440 --> 00:26:55,360
So remember the list of distinct tokens

722
00:26:53,519 --> 00:26:57,119
you have, you can just think of them as

723
00:26:55,359 --> 00:26:59,759
the distinct levels of a categorical

724
00:26:57,119 --> 00:27:01,759
variable,

725
00:26:59,759 --> 00:27:04,558
right? And you can just use one hard

726
00:27:01,759 --> 00:27:07,359
encoding for it.

727
00:27:04,558 --> 00:27:08,480
So what you can do is you can the

728
00:27:07,359 --> 00:27:10,399
simplest thing is do one one hard

729
00:27:08,480 --> 00:27:13,599
encoding and the way it's going to work

730
00:27:10,400 --> 00:27:16,000
is that if you have let's say 50,000

731
00:27:13,599 --> 00:27:17,599
uh 50,000 possible values the vector is

732
00:27:16,000 --> 00:27:20,079
going to be 50,000 long it's going to

733
00:27:17,599 --> 00:27:22,719
have zeros everywhere except in the

734
00:27:20,079 --> 00:27:25,359
index value of whatever that token is.

735
00:27:22,720 --> 00:27:28,319
So for instance since we said ank is

736
00:27:25,359 --> 00:27:31,278
going to be the first uh first number

737
00:27:28,319 --> 00:27:33,359
zero it has a one here and the zero the

738
00:27:31,278 --> 00:27:36,159
zero index position has a one everything

739
00:27:33,359 --> 00:27:37,918
is zero a happens to be the second one

740
00:27:36,159 --> 00:27:40,480
so it happens to be one in the second

741
00:27:37,919 --> 00:27:40,880
position zero you get the idea

742
00:27:40,480 --> 00:27:42,480
okay

743
00:27:40,880 --> 00:27:45,039
>> so this real zero hot encoding we can do

744
00:27:42,480 --> 00:27:47,599
the zero hot one coding one hot encoding

745
00:27:45,038 --> 00:27:50,079
and so so the dimension of this encoding

746
00:27:47,599 --> 00:27:51,678
vector how long it is it's basically the

747
00:27:50,079 --> 00:27:54,798
number of distinct tokens that you have

748
00:27:51,679 --> 00:27:59,320
seen in in the training corpus plus one

749
00:27:54,798 --> 00:27:59,319
for this unk thing that you'll get to.

750
00:27:59,759 --> 00:28:03,278
Okay,

751
00:28:01,278 --> 00:28:05,278
so that is a dimensional encoding vector

752
00:28:03,278 --> 00:28:08,278
which is this is called the vocabulary

753
00:28:05,278 --> 00:28:08,278
size.

754
00:28:09,519 --> 00:28:13,398
It's called the vocabulary size.

755
00:28:13,440 --> 00:28:18,159
All right. So at this point we have

756
00:28:16,798 --> 00:28:20,480
created a vocabulary for the training

757
00:28:18,159 --> 00:28:22,240
data training corpus. every distinct

758
00:28:20,480 --> 00:28:24,240
token vocabulary has been assigned a one

759
00:28:22,240 --> 00:28:26,480
hot vector and we are done with basic

760
00:28:24,240 --> 00:28:29,359
pre-processing.

761
00:28:26,480 --> 00:28:31,440
Okay, so all the text that has come in,

762
00:28:29,359 --> 00:28:33,359
every token has been mapped to some one

763
00:28:31,440 --> 00:28:35,840
hot one potentially very long one hot

764
00:28:33,359 --> 00:28:37,439
vector.

765
00:28:35,839 --> 00:28:41,158
Any questions on the mechanics of this

766
00:28:37,440 --> 00:28:41,159
before we continue on?

767
00:28:45,038 --> 00:28:50,000
>> Now let's see if when you get a new

768
00:28:47,278 --> 00:28:52,000
input sentence in a new sentence freshly

769
00:28:50,000 --> 00:28:53,599
arriving and we want to feed it into a

770
00:28:52,000 --> 00:28:55,038
deep neural network, how will this

771
00:28:53,599 --> 00:28:57,599
process actually apply to the new

772
00:28:55,038 --> 00:29:00,240
sentence that's coming in? Okay, so

773
00:28:57,599 --> 00:29:02,558
let's assume um that we have completed

774
00:29:00,240 --> 00:29:05,038
our SDIE on the training corpus and it

775
00:29:02,558 --> 00:29:08,000
turns out we found only you know 99

776
00:29:05,038 --> 00:29:10,079
distinct tokens 99 distinct words and

777
00:29:08,000 --> 00:29:13,599
then we add this ank thing to it so we

778
00:29:10,079 --> 00:29:16,158
got a 100 okay so this is our vocabulary

779
00:29:13,599 --> 00:29:17,599
it starts with ank a and then goes all

780
00:29:16,159 --> 00:29:20,399
the way to zebra but there are only 100

781
00:29:17,599 --> 00:29:22,398
of them in total right and just to be

782
00:29:20,398 --> 00:29:24,319
very clear we didn't bother to do things

783
00:29:22,398 --> 00:29:26,239
like stemming and stop word removal and

784
00:29:24,319 --> 00:29:28,000
stuff like that which is why you have

785
00:29:26,240 --> 00:29:30,319
words like 'the' showing up in this

786
00:29:28,000 --> 00:29:34,159
list.

787
00:29:30,319 --> 00:29:35,759
Okay. All right. So,

788
00:29:34,159 --> 00:29:38,000
let's say this input string arrives, the

789
00:29:35,759 --> 00:29:40,158
cats are on the mat, and then we run it

790
00:29:38,000 --> 00:29:43,440
through STIE. So, the cats are on the

791
00:29:40,159 --> 00:29:46,640
mat goes through this thingoop.

792
00:29:43,440 --> 00:29:49,440
Then the output is going to be a table

793
00:29:46,640 --> 00:29:52,559
with a bunch of rows and a bunch of

794
00:29:49,440 --> 00:29:56,840
columns. Any guesses

795
00:29:52,558 --> 00:29:56,839
how many rows and how many columns?

796
00:30:02,000 --> 00:30:06,278
Just raise your hands. I'll call on you.

797
00:30:13,359 --> 00:30:18,319
>> Yeah, you use a microphone. Go for it.

798
00:30:14,880 --> 00:30:20,000
>> Yeah, I would guess uh 100 rows and uh

799
00:30:18,319 --> 00:30:23,200
six columns.

800
00:30:20,000 --> 00:30:24,880
All right, we'll take a look. Uh

801
00:30:23,200 --> 00:30:27,919
100 and six as well as six and 100 are

802
00:30:24,880 --> 00:30:30,799
both correct. So, so the way I've done

803
00:30:27,919 --> 00:30:33,038
it is six and 100. And the and that's

804
00:30:30,798 --> 00:30:36,158
exactly right. So, the idea is that this

805
00:30:33,038 --> 00:30:38,720
is your vocabulary, right? So, the word

806
00:30:36,159 --> 00:30:41,600
the cat sat on the mat once you change

807
00:30:38,720 --> 00:30:43,919
the case of it, it becomes like this.

808
00:30:41,599 --> 00:30:47,199
So, the the happens to be a one hot

809
00:30:43,919 --> 00:30:48,799
vector with a one where there is a the

810
00:30:47,200 --> 00:30:50,159
and zero everywhere else. I'm not

811
00:30:48,798 --> 00:30:52,079
showing all the zeros because it'll get

812
00:30:50,159 --> 00:30:55,679
too cluttered.

813
00:30:52,079 --> 00:30:57,519
Similarly, cat has a one where the the

814
00:30:55,679 --> 00:30:59,919
cat position is and zero everywhere else

815
00:30:57,519 --> 00:31:02,079
and so on and so forth. Does that make

816
00:30:59,919 --> 00:31:04,159
sense? So, the the phrase the cat sat on

817
00:31:02,079 --> 00:31:06,319
the mat came in as just whatever six

818
00:31:04,159 --> 00:31:10,200
words and then it became this you know

819
00:31:06,319 --> 00:31:10,200
600 entry table.

820
00:31:12,240 --> 00:31:18,000
Okay. Now, what is the best way to feed

821
00:31:15,679 --> 00:31:21,559
this table to a deep neural network?

822
00:31:18,000 --> 00:31:21,558
What can we do?

823
00:31:23,599 --> 00:31:27,678
It's not a vector. It's a table.

824
00:31:26,319 --> 00:31:29,359
If it's a vector, we know what to do. We

825
00:31:27,679 --> 00:31:30,960
just feed it in. We'll just maybe send

826
00:31:29,359 --> 00:31:34,398
it to some, you know, hidden layer and

827
00:31:30,960 --> 00:31:37,200
declare victory at that point.

828
00:31:34,398 --> 00:31:38,959
>> Yeah.

829
00:31:37,200 --> 00:31:42,840
>> You would like to flatten it. And like

830
00:31:38,960 --> 00:31:42,840
how how might you do it?

831
00:31:43,200 --> 00:31:46,960
Flattening is a reasonable answer by the

832
00:31:45,119 --> 00:31:49,038
way.

833
00:31:46,960 --> 00:31:52,480
I think you mean you just have to like

834
00:31:49,038 --> 00:31:54,798
take each like each column

835
00:31:52,480 --> 00:31:56,319
take the first one each row and each row

836
00:31:54,798 --> 00:31:57,839
each word kind of like

837
00:31:56,319 --> 00:31:59,599
>> yeah so basically you can take all the

838
00:31:57,839 --> 00:32:01,439
first columns and then take the second

839
00:31:59,599 --> 00:32:03,359
column and attach it under the first

840
00:32:01,440 --> 00:32:05,120
column and so on and so forth right so

841
00:32:03,359 --> 00:32:08,158
we can certainly do that and that's very

842
00:32:05,119 --> 00:32:10,319
akin to how we work with images right u

843
00:32:08,159 --> 00:32:13,640
but there is one downside to that what

844
00:32:10,319 --> 00:32:13,639
is that downside

845
00:32:15,759 --> 00:32:20,798
uh Um,

846
00:32:18,480 --> 00:32:23,360
>> it's pretty long. Like I wonder if

847
00:32:20,798 --> 00:32:25,440
instead you could for the first word

848
00:32:23,359 --> 00:32:27,439
it's one, for the second word it's two,

849
00:32:25,440 --> 00:32:30,558
and then you maintain the order, but you

850
00:32:27,440 --> 00:32:33,038
still keep it just as like one row.

851
00:32:30,558 --> 00:32:34,960
>> One row. So one issue, so we'll come

852
00:32:33,038 --> 00:32:36,240
back to what we do about this, but what

853
00:32:34,960 --> 00:32:39,440
you're pointing out is it could be very

854
00:32:36,240 --> 00:32:42,399
long, right? Because if each word is a

855
00:32:39,440 --> 00:32:45,278
50,000 long one vector with just six

856
00:32:42,398 --> 00:32:48,000
words, it becomes a 300,000 long vector.

857
00:32:45,278 --> 00:32:50,798
Imagine take the 300,000 long vector and

858
00:32:48,000 --> 00:32:53,839
sending it into a 100 hidden unit hidden

859
00:32:50,798 --> 00:32:56,158
layer. 300,000 times 100 parameters. Too

860
00:32:53,839 --> 00:32:58,879
much can't learn anything.

861
00:32:56,159 --> 00:33:01,360
So that's one issue. The other issue is

862
00:32:58,880 --> 00:33:02,720
that different length texts that are

863
00:33:01,359 --> 00:33:04,398
coming in will have different sized

864
00:33:02,720 --> 00:33:06,319
inputs.

865
00:33:04,398 --> 00:33:08,879
So here the cat sat on the mat has six

866
00:33:06,319 --> 00:33:10,558
times 50,000 but maybe the cat sat on

867
00:33:08,880 --> 00:33:13,200
the mat and the rat rat ran over to the

868
00:33:10,558 --> 00:33:15,359
cat becomes even longer. We can't handle

869
00:33:13,200 --> 00:33:16,798
variable sized inputs.

870
00:33:15,359 --> 00:33:19,599
the inputs all have to be mapped to the

871
00:33:16,798 --> 00:33:22,158
same length.

872
00:33:19,599 --> 00:33:24,079
That's another problem.

873
00:33:22,159 --> 00:33:26,000
>> So maybe you can count how many you can

874
00:33:24,079 --> 00:33:27,599
sum the columns basically and count how

875
00:33:26,000 --> 00:33:29,519
many times each word appears since

876
00:33:27,599 --> 00:33:30,240
you're using the like spatial

877
00:33:29,519 --> 00:33:33,359
relationship.

878
00:33:30,240 --> 00:33:34,880
>> Yes. So you Yeah. So both you and are on

879
00:33:33,359 --> 00:33:37,199
the same sort of trajectory which is

880
00:33:34,880 --> 00:33:39,120
that uh we need to somehow take this

881
00:33:37,200 --> 00:33:40,960
table and make it into a vector. And

882
00:33:39,119 --> 00:33:42,879
there are many ways like what you folks

883
00:33:40,960 --> 00:33:46,880
are describing to make it into a vector

884
00:33:42,880 --> 00:33:48,159
and turns out um this is all the things

885
00:33:46,880 --> 00:33:50,880
that we've been discussing so far the

886
00:33:48,159 --> 00:33:53,039
varying length ratio and so on. So, so

887
00:33:50,880 --> 00:33:56,720
what we can do is we can aggregate all

888
00:33:53,038 --> 00:33:58,319
these things. If you just add them up,

889
00:33:56,720 --> 00:34:00,720
this is what you described. I believe

890
00:33:58,319 --> 00:34:02,720
it's called sum encoding.

891
00:34:00,720 --> 00:34:04,079
And if instead of adding you just or

892
00:34:02,720 --> 00:34:05,360
them, meaning if you look at the column

893
00:34:04,079 --> 00:34:07,038
and say, is there any one in this

894
00:34:05,359 --> 00:34:08,878
column? If there's a any one, I'll put a

895
00:34:07,038 --> 00:34:12,239
stick of one, otherwise it's a zero.

896
00:34:08,878 --> 00:34:13,918
It's called multihot encoding. So, so if

897
00:34:12,239 --> 00:34:15,358
you look at this thing, if you literally

898
00:34:13,918 --> 00:34:17,199
just go column by column and count

899
00:34:15,358 --> 00:34:19,838
everything. Okay, there's a one here,

900
00:34:17,199 --> 00:34:21,519
one here. Oh, wait. There are two twos

901
00:34:19,838 --> 00:34:23,039
here. So you put a two. That's count

902
00:34:21,519 --> 00:34:26,159
count encoding. Multih hard encoding. It

903
00:34:23,039 --> 00:34:28,800
just looks for any ones and puts on.

904
00:34:26,159 --> 00:34:30,159
Make sense? So by the way there are many

905
00:34:28,800 --> 00:34:32,159
ways to take these tables and make them

906
00:34:30,159 --> 00:34:34,159
into vectors. These two happen to be

907
00:34:32,159 --> 00:34:37,480
very commonly used and they kind of make

908
00:34:34,159 --> 00:34:37,480
common sense.

909
00:34:39,199 --> 00:34:43,039
Okay.

910
00:34:41,039 --> 00:34:44,800
Right. So this aggregation approach that

911
00:34:43,039 --> 00:34:46,800
we just described is called the bag of

912
00:34:44,800 --> 00:34:49,039
words model.

913
00:34:46,800 --> 00:34:51,760
Bag of words model. And the reason is

914
00:34:49,039 --> 00:34:53,918
that first of all this bag that we have

915
00:34:51,760 --> 00:34:56,560
has words either it counts whether a

916
00:34:53,918 --> 00:34:58,000
word exists or not or it counts how many

917
00:34:56,559 --> 00:35:01,039
words how many times the word has

918
00:34:58,000 --> 00:35:04,000
appeared right count versus multihot

919
00:35:01,039 --> 00:35:05,920
versus sum encoding count encoding but

920
00:35:04,000 --> 00:35:09,199
more importantly and this goes back to

921
00:35:05,920 --> 00:35:12,320
your observation is that we have lost

922
00:35:09,199 --> 00:35:14,399
the order of the words now whether the

923
00:35:12,320 --> 00:35:18,079
phrase came in was the cat sat on the

924
00:35:14,400 --> 00:35:19,599
mat or the mat sat on the cat the count

925
00:35:18,079 --> 00:35:21,440
encoding and the multih hard encoding

926
00:35:19,599 --> 00:35:23,200
are exactly the same. There's no

927
00:35:21,440 --> 00:35:24,880
difference because we're just looking

928
00:35:23,199 --> 00:35:27,039
for the the presence or absence of

929
00:35:24,880 --> 00:35:29,599
words. That's it. We don't care in what

930
00:35:27,039 --> 00:35:32,480
which order they appear, right? That's a

931
00:35:29,599 --> 00:35:34,160
huge limitation, but shockingly for many

932
00:35:32,480 --> 00:35:36,800
applications, it doesn't matter. It's

933
00:35:34,159 --> 00:35:38,960
good enough. So, it's called the bag of

934
00:35:36,800 --> 00:35:40,480
words model.

935
00:35:38,960 --> 00:35:42,720
All right, so this called the bag of

936
00:35:40,480 --> 00:35:46,320
words model.

937
00:35:42,719 --> 00:35:47,199
Um, now does it have any shortcomings? I

938
00:35:46,320 --> 00:35:48,960
already talked about the first

939
00:35:47,199 --> 00:35:51,279
shortcoming which is that it loses

940
00:35:48,960 --> 00:35:54,320
sequentiality the order we lost this

941
00:35:51,280 --> 00:35:55,680
order information right uh we we lose

942
00:35:54,320 --> 00:36:00,280
the meaning inherent in the order of the

943
00:35:55,679 --> 00:36:00,279
words what are some other issues with it

944
00:36:04,079 --> 00:36:07,720
what do you mean by that

945
00:36:12,480 --> 00:36:16,559
>> right so there are lots of zeros not

946
00:36:14,639 --> 00:36:18,239
that many ones so you have it's a very

947
00:36:16,559 --> 00:36:19,920
sparse amount of information but maybe

948
00:36:18,239 --> 00:36:22,000
is carrying around a lot of information

949
00:36:19,920 --> 00:36:24,159
to to make it all work. Now there are

950
00:36:22,000 --> 00:36:26,239
some tricks CS computer science tricks

951
00:36:24,159 --> 00:36:29,118
to handle sparsity in some clever ways

952
00:36:26,239 --> 00:36:30,319
but it is certainly an issue. Now the

953
00:36:29,119 --> 00:36:32,640
other issue is that let's say the

954
00:36:30,320 --> 00:36:34,960
vocabulary is very long.

955
00:36:32,639 --> 00:36:36,879
Each input sentence whether it's the

956
00:36:34,960 --> 00:36:39,838
collected works of William Shakespeare

957
00:36:36,880 --> 00:36:42,640
or the phrase I love you will have the

958
00:36:39,838 --> 00:36:45,519
same length input.

959
00:36:42,639 --> 00:36:48,078
Is that the same length input

960
00:36:45,519 --> 00:36:51,440
because ultimately every incoming thing

961
00:36:48,079 --> 00:36:54,480
gets mapped into one vector. Okay, that

962
00:36:51,440 --> 00:36:56,159
feels a little sub suboptimal.

963
00:36:54,480 --> 00:36:59,280
Clearly the collected works of ins have

964
00:36:56,159 --> 00:37:02,719
a lot more stuff going on in them.

965
00:36:59,280 --> 00:37:04,480
Right? So that's a problem. In

966
00:37:02,719 --> 00:37:06,239
particular, very very small things that

967
00:37:04,480 --> 00:37:08,159
come in, you'll be spending a lot of

968
00:37:06,239 --> 00:37:10,799
compute on those long vectors and

969
00:37:08,159 --> 00:37:13,039
processing them. Um, now you can

970
00:37:10,800 --> 00:37:14,560
mitigate some of this by choosing only

971
00:37:13,039 --> 00:37:16,000
the most frequent words. You don't have

972
00:37:14,559 --> 00:37:18,000
to take, you know, I think the English

973
00:37:16,000 --> 00:37:20,800
language I read somewhere has roughly

974
00:37:18,000 --> 00:37:23,440
500,000 words or so. Uh, but turns out

975
00:37:20,800 --> 00:37:24,640
the top 50,000 most frequent words are

976
00:37:23,440 --> 00:37:27,200
responsible for just about everything

977
00:37:24,639 --> 00:37:29,519
you're going to see ever. And the other

978
00:37:27,199 --> 00:37:31,358
50,000 are what's called the long tail.

979
00:37:29,519 --> 00:37:33,119
They almost never happen, right? You

980
00:37:31,358 --> 00:37:34,639
never see them. So, you can be very

981
00:37:33,119 --> 00:37:36,640
pragmatic and say, "I'm not going to

982
00:37:34,639 --> 00:37:38,559
take every little word that I see in my

983
00:37:36,639 --> 00:37:40,000
vocabulary. I'm going to only take the

984
00:37:38,559 --> 00:37:42,078
most frequent words. I'm just going to

985
00:37:40,000 --> 00:37:44,000
ignore the rest.

986
00:37:42,079 --> 00:37:46,960
I'm just going to ignore the rest."

987
00:37:44,000 --> 00:37:50,079
Okay?

988
00:37:46,960 --> 00:37:52,400
But if you ignore the rest, let's say

989
00:37:50,079 --> 00:37:55,280
the there is one word uh let's take some

990
00:37:52,400 --> 00:37:57,358
Shakespeare word hamlet. Let's let's

991
00:37:55,280 --> 00:37:58,640
assume that you ignore the word Hamlet

992
00:37:57,358 --> 00:38:00,400
from your training corpus. You just

993
00:37:58,639 --> 00:38:02,159
delete it because it's not one of the

994
00:38:00,400 --> 00:38:04,480
top most frequent things you have seen.

995
00:38:02,159 --> 00:38:06,559
And then somebody sends you a text

996
00:38:04,480 --> 00:38:08,240
saying, you know, Hamlet was a bad

997
00:38:06,559 --> 00:38:10,400
prince.

998
00:38:08,239 --> 00:38:12,159
Analyze the sentiment of the sentence.

999
00:38:10,400 --> 00:38:14,160
Well, when you see Hamlet, what is your

1000
00:38:12,159 --> 00:38:15,358
system going to do?

1001
00:38:14,159 --> 00:38:16,799
It's going to look at the Hamlet and

1002
00:38:15,358 --> 00:38:18,480
say, I can't see it in my vocabulary

1003
00:38:16,800 --> 00:38:19,920
anywhere.

1004
00:38:18,480 --> 00:38:22,400
And if it can't see in the vocabulary,

1005
00:38:19,920 --> 00:38:26,000
what is the only thing it can do?

1006
00:38:22,400 --> 00:38:28,400
Replace it with Unk. So that's where

1007
00:38:26,000 --> 00:38:30,079
comes into the picture.

1008
00:38:28,400 --> 00:38:32,000
So whenever it can't see something in

1009
00:38:30,079 --> 00:38:35,839
the vocabulary in a new input, it just

1010
00:38:32,000 --> 00:38:37,838
replaced with ank. Which means that

1011
00:38:35,838 --> 00:38:40,880
if you had ignored Romeo, Juliet, and

1012
00:38:37,838 --> 00:38:42,239
Hamlet in the in the training corpus,

1013
00:38:40,880 --> 00:38:44,079
all of them are going to be replaced by

1014
00:38:42,239 --> 00:38:46,719
the same ankh, which means that we can't

1015
00:38:44,079 --> 00:38:48,960
distinguish between them anymore.

1016
00:38:46,719 --> 00:38:52,159
>> So is this whereation

1017
00:38:48,960 --> 00:38:54,880
comes into play here where it doesn't

1018
00:38:52,159 --> 00:38:56,239
recogize

1019
00:38:54,880 --> 00:38:58,400
H interesting question. This is

1020
00:38:56,239 --> 00:39:00,799
whereation comes up. Actually, as it

1021
00:38:58,400 --> 00:39:03,680
turns out, no, as we will see when we

1022
00:39:00,800 --> 00:39:06,480
talk about LLMs later. Uh LLMs actually

1023
00:39:03,679 --> 00:39:08,078
will not have this UN problem because

1024
00:39:06,480 --> 00:39:09,440
they use a different tokenization scheme

1025
00:39:08,079 --> 00:39:10,960
which can handle anything you throw at

1026
00:39:09,440 --> 00:39:12,480
it, including new stuff you just made

1027
00:39:10,960 --> 00:39:14,800
up.

1028
00:39:12,480 --> 00:39:17,838
So, we'll come back to that.

1029
00:39:14,800 --> 00:39:19,760
All right. Um so, that's what we have.

1030
00:39:17,838 --> 00:39:21,440
And so what we're going to do is despite

1031
00:39:19,760 --> 00:39:23,599
its shortcomings, bag of words is

1032
00:39:21,440 --> 00:39:26,400
actually a really good default for many

1033
00:39:23,599 --> 00:39:27,599
NLP tasks. Uh and in the spirit of do

1034
00:39:26,400 --> 00:39:28,880
the simple stuff first and do

1035
00:39:27,599 --> 00:39:30,400
complicated things only the simple

1036
00:39:28,880 --> 00:39:32,079
doesn't work. We'll use a bag of words

1037
00:39:30,400 --> 00:39:36,480
model right now. Okay. So we'll switch

1038
00:39:32,079 --> 00:39:39,440
to a collab and see how it's done.

1039
00:39:36,480 --> 00:39:40,719
So here the the application we're going

1040
00:39:39,440 --> 00:39:43,119
to work with is kind of a fun

1041
00:39:40,719 --> 00:39:46,000
application. Uh we're going to try to

1042
00:39:43,119 --> 00:39:47,599
predict the genre of songs.

1043
00:39:46,000 --> 00:39:50,480
Okay, it's a nice classification use

1044
00:39:47,599 --> 00:39:52,800
case. Um, so we want to take some

1045
00:39:50,480 --> 00:39:55,440
arbitrary song and then classify it into

1046
00:39:52,800 --> 00:39:59,599
either hip-hop, rock or pop.

1047
00:39:55,440 --> 00:40:01,200
Okay. Um, and so for instance,

1048
00:39:59,599 --> 00:40:03,200
right, this is the kind of lyric you're

1049
00:40:01,199 --> 00:40:04,879
lyrics you're going to see. And as you

1050
00:40:03,199 --> 00:40:07,279
will see in this data set, the data set,

1051
00:40:04,880 --> 00:40:10,320
just a quick word of caution, uh, the

1052
00:40:07,280 --> 00:40:12,720
data set does have lyrics which may not

1053
00:40:10,320 --> 00:40:14,320
be sort of, you know, safe for work as

1054
00:40:12,719 --> 00:40:16,719
it were. So I'm not going to be like

1055
00:40:14,320 --> 00:40:18,880
exploring the lyrics in the collab, but

1056
00:40:16,719 --> 00:40:20,959
I just wanted to be aware of it. Okay.

1057
00:40:18,880 --> 00:40:22,480
Um, so but it's just some data set that

1058
00:40:20,960 --> 00:40:24,240
we downloaded from somewhere, right? Uh,

1059
00:40:22,480 --> 00:40:25,599
it's got all these lyrics. Okay. So

1060
00:40:24,239 --> 00:40:27,759
we're going to try to classify each

1061
00:40:25,599 --> 00:40:29,200
verse that we see into one of three

1062
00:40:27,760 --> 00:40:31,680
things. Hip hop, rock or pop. It's a

1063
00:40:29,199 --> 00:40:33,279
multi-class classification problem.

1064
00:40:31,679 --> 00:40:35,039
All right. Actually, what is the

1065
00:40:33,280 --> 00:40:37,760
simplest neural network based classifier

1066
00:40:35,039 --> 00:40:41,119
we can build

1067
00:40:37,760 --> 00:40:42,800
for this problem?

1068
00:40:41,119 --> 00:40:44,880
All right. So what is the simplest

1069
00:40:42,800 --> 00:40:47,519
neural network we can build for this

1070
00:40:44,880 --> 00:40:49,519
problem? So remember what is the input?

1071
00:40:47,519 --> 00:40:50,719
The input is going to be a bunch of song

1072
00:40:49,519 --> 00:40:52,800
lyrics. It's going to be a really long

1073
00:40:50,719 --> 00:40:54,879
song for all you know, right? And we're

1074
00:40:52,800 --> 00:40:56,560
going to use the bag of birds model. Uh

1075
00:40:54,880 --> 00:40:59,680
and let's assume for a moment that we

1076
00:40:56,559 --> 00:41:02,239
will use multihot encoding, right? We'll

1077
00:40:59,679 --> 00:41:04,000
create a vocabulary from this for the

1078
00:41:02,239 --> 00:41:06,559
song. We'll take all the songs. We'll

1079
00:41:04,000 --> 00:41:08,239
process them, run it through STI. will

1080
00:41:06,559 --> 00:41:10,719
do multihod encoding which means that

1081
00:41:08,239 --> 00:41:14,239
every song that comes in will have will

1082
00:41:10,719 --> 00:41:17,279
be a vector that's how long

1083
00:41:14,239 --> 00:41:20,719
it'll be as long as the

1084
00:41:17,280 --> 00:41:24,720
correct as a vocabulary size right so um

1085
00:41:20,719 --> 00:41:26,480
so maybe what comes in is this phrase um

1086
00:41:24,719 --> 00:41:28,000
since it's supposed to be songs I'll say

1087
00:41:26,480 --> 00:41:30,960
something which is probably common to

1088
00:41:28,000 --> 00:41:34,639
90% of songs I love you

1089
00:41:30,960 --> 00:41:38,480
okay that goes in

1090
00:41:34,639 --> 00:41:42,000
it goes into our ST STIE process

1091
00:41:38,480 --> 00:41:49,039
and then this SDIE process gives us a

1092
00:41:42,000 --> 00:41:50,318
vector which is X1 X2 all the way to XV

1093
00:41:49,039 --> 00:41:52,639
where V stands for the size of

1094
00:41:50,318 --> 00:41:54,960
vocabulary. Okay. So that that's our

1095
00:41:52,639 --> 00:41:58,239
input layer

1096
00:41:54,960 --> 00:42:02,400
all the way. So knowing what we know now

1097
00:41:58,239 --> 00:42:04,959
about deep learning what can we do next?

1098
00:42:02,400 --> 00:42:07,920
Couldn't you or maybe I'm getting ahead

1099
00:42:04,960 --> 00:42:10,240
but wouldn't the classifier just be like

1100
00:42:07,920 --> 00:42:11,920
the baseline would be classify it as the

1101
00:42:10,239 --> 00:42:13,199
most common genre?

1102
00:42:11,920 --> 00:42:14,800
>> That is the baseline. Correct. Correct.

1103
00:42:13,199 --> 00:42:17,039
I'm just saying and we'll come to the

1104
00:42:14,800 --> 00:42:18,720
baseline a bit later. But here I'm

1105
00:42:17,039 --> 00:42:21,119
saying suppose you need to you wanted to

1106
00:42:18,719 --> 00:42:23,358
build a neural network model for this.

1107
00:42:21,119 --> 00:42:25,280
How would you set it up?

1108
00:42:23,358 --> 00:42:26,078
>> You think about the layers that you

1109
00:42:25,280 --> 00:42:27,359
want,

1110
00:42:26,079 --> 00:42:29,039
>> right? And what is the simplest thing

1111
00:42:27,358 --> 00:42:30,159
you can do with a neural network? How

1112
00:42:29,039 --> 00:42:33,279
many layers?

1113
00:42:30,159 --> 00:42:35,358
>> Uh no layers. Well, then it becomes

1114
00:42:33,280 --> 00:42:36,800
problematic with even a neural network

1115
00:42:35,358 --> 00:42:37,759
because it could just be logistic

1116
00:42:36,800 --> 00:42:38,800
regression

1117
00:42:37,760 --> 00:42:41,760
>> one hidden layer.

1118
00:42:38,800 --> 00:42:43,119
>> Yes, thank you. I'm being a little

1119
00:42:41,760 --> 00:42:44,800
squishy about this because there are

1120
00:42:43,119 --> 00:42:46,480
some people who be like well even if

1121
00:42:44,800 --> 00:42:48,560
there's no hidden layers if you're using

1122
00:42:46,480 --> 00:42:49,838
relus and this and that and sigma that's

1123
00:42:48,559 --> 00:42:51,519
maybe it's a neural network and I don't

1124
00:42:49,838 --> 00:42:54,400
want to get into that how many ages in

1125
00:42:51,519 --> 00:42:56,079
the tip of a pin argument. So um so yeah

1126
00:42:54,400 --> 00:42:57,358
we need one hidden layer right in this

1127
00:42:56,079 --> 00:42:59,039
course we need at least one hidden layer

1128
00:42:57,358 --> 00:43:01,119
for it to qualify as a neural network.

1129
00:42:59,039 --> 00:43:04,800
Okay, so let's have a hidden layer and

1130
00:43:01,119 --> 00:43:07,680
we'll have a bunch of ReLUS as usual.

1131
00:43:04,800 --> 00:43:09,119
Okay, bunch of ReLULS and I'll ignore

1132
00:43:07,679 --> 00:43:11,519
all the arrows between them. It's kind

1133
00:43:09,119 --> 00:43:13,039
of a pain. U and then we come to the

1134
00:43:11,519 --> 00:43:15,358
output layer. And what should the output

1135
00:43:13,039 --> 00:43:16,960
layer be?

1136
00:43:15,358 --> 00:43:19,519
How many nodes do we have need in the

1137
00:43:16,960 --> 00:43:22,400
output layer? Three, right? Hip-hop,

1138
00:43:19,519 --> 00:43:23,759
rock, whatever. Pop. So we And then that

1139
00:43:22,400 --> 00:43:25,358
layer is called what? What activation

1140
00:43:23,760 --> 00:43:27,520
function?

1141
00:43:25,358 --> 00:43:30,960
Softmax. Perfect. Love it. love this

1142
00:43:27,519 --> 00:43:33,838
class. All right, three things. Uh,

1143
00:43:30,960 --> 00:43:36,880
rock, hip-hop,

1144
00:43:33,838 --> 00:43:39,199
and uh, pop, right? And this is a soft

1145
00:43:36,880 --> 00:43:41,760
max right there.

1146
00:43:39,199 --> 00:43:44,639
And then it's going to give us three

1147
00:43:41,760 --> 00:43:46,400
probabilities that add up to one because

1148
00:43:44,639 --> 00:43:49,679
it's a soft max. So that's our basic

1149
00:43:46,400 --> 00:43:51,039
network, right? Perfect. Yeah.

1150
00:43:49,679 --> 00:43:52,799
>> Why do you need those probabilities?

1151
00:43:51,039 --> 00:43:55,279
Again, if you just want to identify the

1152
00:43:52,800 --> 00:43:56,720
most likely genre, the soft max just

1153
00:43:55,280 --> 00:43:59,359
give you a way to kind of add them all

1154
00:43:56,719 --> 00:44:01,358
up once. Why do you need soft? Why don't

1155
00:43:59,358 --> 00:44:01,759
you just take the max value and say it's

1156
00:44:01,358 --> 00:44:03,679
that?

1157
00:44:01,760 --> 00:44:05,760
>> Oh, interesting question. Why can't we

1158
00:44:03,679 --> 00:44:09,519
just produce three numbers and grab the

1159
00:44:05,760 --> 00:44:11,200
maximum number? So, it turns out finding

1160
00:44:09,519 --> 00:44:12,719
the maximum bunch of numbers that

1161
00:44:11,199 --> 00:44:14,960
function

1162
00:44:12,719 --> 00:44:16,959
is not very it's not very friendly for

1163
00:44:14,960 --> 00:44:18,880
differentiation.

1164
00:44:16,960 --> 00:44:20,800
And ultimately you want to take this

1165
00:44:18,880 --> 00:44:23,200
output, run it through a loss function

1166
00:44:20,800 --> 00:44:25,839
like cross entropy and then be able to

1167
00:44:23,199 --> 00:44:27,679
run back prop on it. And so

1168
00:44:25,838 --> 00:44:29,599
fundamentally back propagation is just

1169
00:44:27,679 --> 00:44:31,199
differentiation and it requires

1170
00:44:29,599 --> 00:44:34,160
everything inside of it to have well-

1171
00:44:31,199 --> 00:44:36,239
behaved gradients. And so this little

1172
00:44:34,159 --> 00:44:39,039
max function is actually not well

1173
00:44:36,239 --> 00:44:41,598
behaved and which is why we have a soft

1174
00:44:39,039 --> 00:44:44,318
version of it soft max which makes it

1175
00:44:41,599 --> 00:44:45,760
easy to differentiate. So I can tell you

1176
00:44:44,318 --> 00:44:49,079
more about it offline but that's sort of

1177
00:44:45,760 --> 00:44:49,079
the quick synopsis.

1178
00:44:49,119 --> 00:44:52,640
So a lot of tricks you will see in the

1179
00:44:50,480 --> 00:44:55,440
neural network literature or ways to

1180
00:44:52,639 --> 00:44:57,358
avoid this the problem of having certain

1181
00:44:55,440 --> 00:44:59,200
the like the obvious choice of function

1182
00:44:57,358 --> 00:45:00,400
will not be well behaved for

1183
00:44:59,199 --> 00:45:02,960
differentiation. That's why you need to

1184
00:45:00,400 --> 00:45:05,039
go through all these other mechanisms

1185
00:45:02,960 --> 00:45:06,400
much like we couldn't just say accuracy.

1186
00:45:05,039 --> 00:45:07,679
Why don't you just maximize accuracy

1187
00:45:06,400 --> 00:45:10,880
instead of doing this cross entropy

1188
00:45:07,679 --> 00:45:14,480
business? Same reason.

1189
00:45:10,880 --> 00:45:17,640
All right. So let's come back here.

1190
00:45:14,480 --> 00:45:17,639
All right.

1191
00:45:20,639 --> 00:45:27,279
So that's what we created on the thing.

1192
00:45:23,679 --> 00:45:28,960
Right? Cats out of the mat vocabulary

1193
00:45:27,280 --> 00:45:31,359
thing and so on. And I you know I was

1194
00:45:28,960 --> 00:45:33,519
playing around with it uh earlier and so

1195
00:45:31,358 --> 00:45:35,039
I I found that you know eight relu

1196
00:45:33,519 --> 00:45:36,159
neurons were pretty good to get the job

1197
00:45:35,039 --> 00:45:37,838
done. So I'm just going to go with eight

1198
00:45:36,159 --> 00:45:39,920
rel

1199
00:45:37,838 --> 00:45:44,078
neurons in the hidden layer.

1200
00:45:39,920 --> 00:45:47,039
So I think that brings us to the collab.

1201
00:45:44,079 --> 00:45:49,519
Yeah. So let's switch to the collab.

1202
00:45:47,039 --> 00:45:50,960
All right. So um that's what we have

1203
00:45:49,519 --> 00:45:52,318
here. We you know there's a little bit

1204
00:45:50,960 --> 00:45:54,159
of verbiage here which just describes

1205
00:45:52,318 --> 00:45:56,400
what I just talked about. So we'll do

1206
00:45:54,159 --> 00:45:58,639
the usual things and upload everything

1207
00:45:56,400 --> 00:46:01,280
uh import everything we want. TensorFlow

1208
00:45:58,639 --> 00:46:03,838
and caras and the the holy trinity of

1209
00:46:01,280 --> 00:46:07,040
numpy pandas and mattplot lib. Uh set

1210
00:46:03,838 --> 00:46:09,679
the random seed as usual at 42.

1211
00:46:07,039 --> 00:46:11,759
This is our SDIE framework here. And the

1212
00:46:09,679 --> 00:46:14,480
nice thing is that all four of these

1213
00:46:11,760 --> 00:46:16,880
things SDIE are beautifully implemented

1214
00:46:14,480 --> 00:46:19,440
in Keras is a single simple layer called

1215
00:46:16,880 --> 00:46:22,880
the text vectorzation layer. Okay, which

1216
00:46:19,440 --> 00:46:25,200
is nice. Um, so we have the text vector

1217
00:46:22,880 --> 00:46:26,960
right here. And so in our first example,

1218
00:46:25,199 --> 00:46:29,039
what we'll do is we will use a default

1219
00:46:26,960 --> 00:46:31,199
standardization which will just remove

1220
00:46:29,039 --> 00:46:33,039
punctuation, convert to lowercase. We'll

1221
00:46:31,199 --> 00:46:35,598
use a default tokenization which just

1222
00:46:33,039 --> 00:46:37,358
means split on the space between words.

1223
00:46:35,599 --> 00:46:39,680
And then we will set the output to

1224
00:46:37,358 --> 00:46:41,039
multihart. Right? All the things we

1225
00:46:39,679 --> 00:46:43,598
talked about, KAS will just do it for

1226
00:46:41,039 --> 00:46:45,759
you automatically. And so output mode

1227
00:46:43,599 --> 00:46:47,359
multihart standardize this spread whites

1228
00:46:45,760 --> 00:46:49,760
space and boom, you run the text

1229
00:46:47,358 --> 00:46:52,000
vectorization thing. And once you do it,

1230
00:46:49,760 --> 00:46:53,599
KAS creates this textualization layer

1231
00:46:52,000 --> 00:46:56,159
with these settings and it's now ready

1232
00:46:53,599 --> 00:46:58,480
to swing into action. So what does swing

1233
00:46:56,159 --> 00:46:59,679
into action actually means? Well, now we

1234
00:46:58,480 --> 00:47:01,920
need to actually feed it a training

1235
00:46:59,679 --> 00:47:02,960
carpass so that it can do all the things

1236
00:47:01,920 --> 00:47:07,039
it's supposed to do and create the

1237
00:47:02,960 --> 00:47:08,800
vocabulary for you, right? So um so and

1238
00:47:07,039 --> 00:47:11,599
that thing is called the adapt method.

1239
00:47:08,800 --> 00:47:14,880
So we create a tiny training corpus for

1240
00:47:11,599 --> 00:47:16,160
us. This is our data set. Um right this

1241
00:47:14,880 --> 00:47:18,240
just a bunch of words from some of these

1242
00:47:16,159 --> 00:47:19,920
lyrics. And then what we'll do is we'll

1243
00:47:18,239 --> 00:47:21,838
take this layer that we just defined

1244
00:47:19,920 --> 00:47:24,240
here that we have set up here. And then

1245
00:47:21,838 --> 00:47:26,078
we will ask this layer to actually

1246
00:47:24,239 --> 00:47:29,679
create the vocabulary using this adapt

1247
00:47:26,079 --> 00:47:31,760
command. Okay. Index the vocabulary. And

1248
00:47:29,679 --> 00:47:34,239
it's done. And once it does it, you can

1249
00:47:31,760 --> 00:47:36,160
actually ask it for the vocabulary.

1250
00:47:34,239 --> 00:47:38,479
Okay, this is the vocabulary using the

1251
00:47:36,159 --> 00:47:41,679
get vocabulary command. And so first of

1252
00:47:38,480 --> 00:47:45,119
all, how long is the vocab? 17 17 words,

1253
00:47:41,679 --> 00:47:46,799
17 tokens. What are they?

1254
00:47:45,119 --> 00:47:48,880
And see here, and you can see these are

1255
00:47:46,800 --> 00:47:50,640
all the words and you can see it is

1256
00:47:48,880 --> 00:47:52,400
stuck in an in the very beginning,

1257
00:47:50,639 --> 00:47:54,239
right? It's sort of the default. By the

1258
00:47:52,400 --> 00:47:55,599
way, uh just a little programming tip if

1259
00:47:54,239 --> 00:47:57,118
you're not familiar with if you don't

1260
00:47:55,599 --> 00:47:58,400
have a ton of programming experience. If

1261
00:47:57,119 --> 00:48:00,240
you want to, you know, print these

1262
00:47:58,400 --> 00:48:02,960
Python objects like list and all in a

1263
00:48:00,239 --> 00:48:05,838
pretty way, one trick that often works

1264
00:48:02,960 --> 00:48:08,240
is just stick it into a data frame

1265
00:48:05,838 --> 00:48:09,599
and then print it. Usually, it'll print

1266
00:48:08,239 --> 00:48:11,679
it in a much better way. So, you can see

1267
00:48:09,599 --> 00:48:13,760
it like that.

1268
00:48:11,679 --> 00:48:15,598
So, you can see here ank arrays blah

1269
00:48:13,760 --> 00:48:17,920
blah blah blah blah. And you can see

1270
00:48:15,599 --> 00:48:19,760
integer zero assigned the ank token. By

1271
00:48:17,920 --> 00:48:22,559
the way, how come it picked the word

1272
00:48:19,760 --> 00:48:26,960
arrays as the second entry? Why not

1273
00:48:22,559 --> 00:48:29,839
something like an or um you know why

1274
00:48:26,960 --> 00:48:32,400
not? Why not a how come a is not chosen

1275
00:48:29,838 --> 00:48:36,039
as a second entry? Why why did it pick

1276
00:48:32,400 --> 00:48:36,039
arrays? You think

1277
00:48:40,318 --> 00:48:45,358
>> maybe maybe it tried like the words that

1278
00:48:43,358 --> 00:48:49,119
are most influential on the meaning of

1279
00:48:45,358 --> 00:48:49,119
the sentence to be on the

1280
00:48:49,760 --> 00:48:54,160
But it at this point it doesn't know

1281
00:48:51,280 --> 00:48:56,000
what we're going to use it for.

1282
00:48:54,159 --> 00:48:57,358
So it has no way to know what word is

1283
00:48:56,000 --> 00:48:59,599
useful because we haven't told it how

1284
00:48:57,358 --> 00:49:01,838
we're going to use it.

1285
00:48:59,599 --> 00:49:04,559
But but you're kind of on the right

1286
00:49:01,838 --> 00:49:06,400
track. So what KAS does is it'll

1287
00:49:04,559 --> 00:49:07,680
calculate it'll find all these tokens

1288
00:49:06,400 --> 00:49:09,760
and then it'll actually just sort them

1289
00:49:07,679 --> 00:49:12,239
by frequency.

1290
00:49:09,760 --> 00:49:13,680
So the most frequent as it turns out in

1291
00:49:12,239 --> 00:49:15,838
those four sentences we gave it happen

1292
00:49:13,679 --> 00:49:17,838
to be the word arrays. That's why arrays

1293
00:49:15,838 --> 00:49:19,279
is showing up on top. Um, and you can

1294
00:49:17,838 --> 00:49:21,759
actually confirm this by going to the

1295
00:49:19,280 --> 00:49:23,760
our little data set and you can see here

1296
00:49:21,760 --> 00:49:25,920
array shows up here and was up here

1297
00:49:23,760 --> 00:49:29,920
twice and that's why it came up on top.

1298
00:49:25,920 --> 00:49:32,559
Okay. All right. So that's what we have

1299
00:49:29,920 --> 00:49:34,400
and u and now now that we have populated

1300
00:49:32,559 --> 00:49:36,319
this we can run any sentence through it

1301
00:49:34,400 --> 00:49:37,358
easily. Yeah.

1302
00:49:36,318 --> 00:49:39,599
>> Does [clears throat] it matter that it's

1303
00:49:37,358 --> 00:49:41,199
on the top or is it just

1304
00:49:39,599 --> 00:49:43,519
>> it doesn't matter. It doesn't matter.

1305
00:49:41,199 --> 00:49:45,598
The reason why it's helpful later on is

1306
00:49:43,519 --> 00:49:48,079
because suppose you tell Kas hey don't

1307
00:49:45,599 --> 00:49:50,559
take every word you see here give me

1308
00:49:48,079 --> 00:49:52,318
only the most frequent 100 words I don't

1309
00:49:50,559 --> 00:49:56,519
want any more than that it can easily do

1310
00:49:52,318 --> 00:49:56,519
that that's the reason yeah

1311
00:50:01,199 --> 00:50:05,679
>> this is just a vocabulary so basically

1312
00:50:03,280 --> 00:50:07,519
you you give it all this phrases it

1313
00:50:05,679 --> 00:50:09,039
happens just four phrases in our example

1314
00:50:07,519 --> 00:50:10,639
and then it finds all the distinct words

1315
00:50:09,039 --> 00:50:12,558
and you know does all that stuff and and

1316
00:50:10,639 --> 00:50:14,480
then it has created a vocabulary. At

1317
00:50:12,559 --> 00:50:17,680
this point the the training corpus you

1318
00:50:14,480 --> 00:50:19,440
fed it will is forgotten and the only

1319
00:50:17,679 --> 00:50:21,838
thing has survived this processing is

1320
00:50:19,440 --> 00:50:23,280
just the vocabulary. That's it. Now we

1321
00:50:21,838 --> 00:50:25,838
have to start applying it to any kind of

1322
00:50:23,280 --> 00:50:28,559
text we want to use it for.

1323
00:50:25,838 --> 00:50:30,159
So here when you come back here u so

1324
00:50:28,559 --> 00:50:32,240
this is what we have and so what you can

1325
00:50:30,159 --> 00:50:33,920
do is you can take any sentence and you

1326
00:50:32,239 --> 00:50:35,039
can just run it through a layer and to

1327
00:50:33,920 --> 00:50:37,039
make sure that actually is doing the

1328
00:50:35,039 --> 00:50:39,119
right thing for you. So we'll take the

1329
00:50:37,039 --> 00:50:40,558
sentence, we will then run it through

1330
00:50:39,119 --> 00:50:42,000
the text vectorization layer by just

1331
00:50:40,559 --> 00:50:45,640
passing that sentence into it and then

1332
00:50:42,000 --> 00:50:45,639
we can just print it.

1333
00:50:46,000 --> 00:50:50,559
So now it's giving you a tensor. This is

1334
00:50:47,838 --> 00:50:54,318
a multihot encoder tensor with all these

1335
00:50:50,559 --> 00:50:56,400
ones and zeros. So note that this tensor

1336
00:50:54,318 --> 00:50:58,079
is 17 units long which is which is a

1337
00:50:56,400 --> 00:51:00,880
good check because our vocabulary is 17

1338
00:50:58,079 --> 00:51:03,519
long. So it's better match that. Uh now

1339
00:51:00,880 --> 00:51:05,680
recall that the ank token is at the

1340
00:51:03,519 --> 00:51:08,159
first location. It's at index zero and

1341
00:51:05,679 --> 00:51:10,558
it says that this encoded sentence does

1342
00:51:08,159 --> 00:51:13,358
have an unk word.

1343
00:51:10,559 --> 00:51:15,920
Okay. So

1344
00:51:13,358 --> 00:51:19,039
why is that? What is this UN word?

1345
00:51:15,920 --> 00:51:21,680
Anyone can guess?

1346
00:51:19,039 --> 00:51:24,400
Well, it turns out to be the word still.

1347
00:51:21,679 --> 00:51:26,480
Um I think yeah still is not in our

1348
00:51:24,400 --> 00:51:28,079
vocabulary because the four sentences

1349
00:51:26,480 --> 00:51:30,240
which is our training corpus used to

1350
00:51:28,079 --> 00:51:32,000
build vocabulary. They had a lot of

1351
00:51:30,239 --> 00:51:33,838
write and rewrite but there was no still

1352
00:51:32,000 --> 00:51:35,920
in it anyway. That's why there's an UN

1353
00:51:33,838 --> 00:51:38,159
ank for it. Uh we can just double check

1354
00:51:35,920 --> 00:51:40,000
that by asking Python is it is it

1355
00:51:38,159 --> 00:51:41,598
vocabulary? Nope, it's not. Okay. Now,

1356
00:51:40,000 --> 00:51:42,960
in the spirit of making small changes to

1357
00:51:41,599 --> 00:51:45,680
the code to understand what's going on,

1358
00:51:42,960 --> 00:51:46,880
which is a very useful tip for folks who

1359
00:51:45,679 --> 00:51:48,879
don't have a ton of programming

1360
00:51:46,880 --> 00:51:52,960
knowledge. Let's say that you send the

1361
00:51:48,880 --> 00:51:54,480
phrase Sloan Hodddle and DM, DMD. Uh I

1362
00:51:52,960 --> 00:51:55,760
think you will agree with me that none

1363
00:51:54,480 --> 00:51:59,358
of these words is in the training

1364
00:51:55,760 --> 00:52:02,000
corpus, right? So what will this what is

1365
00:51:59,358 --> 00:52:06,199
the multihot encoded vector for this

1366
00:52:02,000 --> 00:52:06,199
phrase sloan hodddle BMD

1367
00:52:07,440 --> 00:52:10,440
three

1368
00:52:11,440 --> 00:52:14,800
it's not count encoding it's multihod

1369
00:52:13,119 --> 00:52:17,358
encoding

1370
00:52:14,800 --> 00:52:19,039
right it's going to be 1 0 0 so you can

1371
00:52:17,358 --> 00:52:21,598
see here or in this case remember the

1372
00:52:19,039 --> 00:52:23,599
vocabulary is 17

1373
00:52:21,599 --> 00:52:27,440
right so each of these words is going to

1374
00:52:23,599 --> 00:52:29,200
be a one followed by 16 zeros

1375
00:52:27,440 --> 00:52:30,880
And then it's going to multih hot encode

1376
00:52:29,199 --> 00:52:34,318
them which means the three ones in the

1377
00:52:30,880 --> 00:52:37,039
column just become a one. So so you

1378
00:52:34,318 --> 00:52:39,599
still have only this one. Okay. All

1379
00:52:37,039 --> 00:52:41,358
right. Good. So now let's see that's now

1380
00:52:39,599 --> 00:52:45,359
let's actually get to the the the data

1381
00:52:41,358 --> 00:52:47,598
set. We have this 90,000 songs. Uh and

1382
00:52:45,358 --> 00:52:49,440
it's in this little thing here. Uh we

1383
00:52:47,599 --> 00:52:50,720
have grabbed the data and cleaned it up.

1384
00:52:49,440 --> 00:52:53,280
Cleaned it up meaning like formatting

1385
00:52:50,719 --> 00:52:55,039
wise not content wise. uh and then we

1386
00:52:53,280 --> 00:52:56,960
stuck it in this uh data frame and it's

1387
00:52:55,039 --> 00:52:58,480
we already have divided into train, test

1388
00:52:56,960 --> 00:53:00,720
and validation for your benefit. So you

1389
00:52:58,480 --> 00:53:03,599
don't have to worry about it. So turns

1390
00:53:00,719 --> 00:53:05,759
out we have 40 almost 49,000 songs in

1391
00:53:03,599 --> 00:53:08,800
the training set, 16,000 songs in the

1392
00:53:05,760 --> 00:53:10,960
validation set and 22 roughly 22,000 in

1393
00:53:08,800 --> 00:53:13,119
the test set. Okay, lot of songs. It's a

1394
00:53:10,960 --> 00:53:15,838
lot. It's a big data set. Um so let's

1395
00:53:13,119 --> 00:53:18,079
just look at the first few.

1396
00:53:15,838 --> 00:53:20,558
So oh girl, I can't get ready. We met on

1397
00:53:18,079 --> 00:53:22,000
rainy evening. Paralysis through

1398
00:53:20,559 --> 00:53:23,599
analysis.

1399
00:53:22,000 --> 00:53:27,599
Okay, that I can relate to as a data

1400
00:53:23,599 --> 00:53:29,280
science person. But anyway, u but uh by

1401
00:53:27,599 --> 00:53:31,440
the way this uh these things are very

1402
00:53:29,280 --> 00:53:33,440
useful for exploration of any uh data

1403
00:53:31,440 --> 00:53:36,720
frames that you might have. Collab is a

1404
00:53:33,440 --> 00:53:38,318
collab feature just check it out. Um so

1405
00:53:36,719 --> 00:53:40,159
anyway, that's the first few the first

1406
00:53:38,318 --> 00:53:43,119
few rows. Let's look at the last few

1407
00:53:40,159 --> 00:53:46,118
rows.

1408
00:53:43,119 --> 00:53:46,119
Okay,

1409
00:53:48,800 --> 00:53:56,280
you never listen to me as pop. Beamer

1410
00:53:51,440 --> 00:53:56,280
Benz is hip-hop. Yeah, of course.

1411
00:53:57,599 --> 00:54:01,440
So, okay. Uh, now to go back to the

1412
00:53:59,679 --> 00:54:02,639
question of, okay, um, what could be a

1413
00:54:01,440 --> 00:54:04,559
good baseline model? We need to

1414
00:54:02,639 --> 00:54:07,118
understand the proportion of these three

1415
00:54:04,559 --> 00:54:10,559
classes of songs. So, we'll do a quick

1416
00:54:07,119 --> 00:54:12,480
check. Turns out rock is 55%. So, if you

1417
00:54:10,559 --> 00:54:13,599
had to just guess something just

1418
00:54:12,480 --> 00:54:15,920
naively, you would just guess everything

1419
00:54:13,599 --> 00:54:18,400
to be rock and you'd be right 55% of the

1420
00:54:15,920 --> 00:54:20,159
time. Uh so now uh by the way the the

1421
00:54:18,400 --> 00:54:21,680
the target variable which tells you

1422
00:54:20,159 --> 00:54:24,639
whether which of these three genres it

1423
00:54:21,679 --> 00:54:26,318
is uh is is is a is actually a dummy

1424
00:54:24,639 --> 00:54:29,598
variable. So we need to one hot encode

1425
00:54:26,318 --> 00:54:32,000
that right. Um so we'll just turn that

1426
00:54:29,599 --> 00:54:34,559
this way using the pandas get dummies

1427
00:54:32,000 --> 00:54:35,920
function. And when we do that uh this is

1428
00:54:34,559 --> 00:54:37,200
y train which contains the dependent

1429
00:54:35,920 --> 00:54:40,800
variable. And you can see that is one

1430
00:54:37,199 --> 00:54:42,719
hot encoded now. Uh 0 1 0 0 1 0 0 1 and

1431
00:54:40,800 --> 00:54:44,960
so on and so forth. That's it. So I

1432
00:54:42,719 --> 00:54:46,799
think the first I forget it rock,

1433
00:54:44,960 --> 00:54:48,400
hip-hop, rock, pop or whatever. It's in

1434
00:54:46,800 --> 00:54:50,800
some order. We'll we'll get to that

1435
00:54:48,400 --> 00:54:52,559
later. So it's one hot encoded as well.

1436
00:54:50,800 --> 00:54:54,240
So that is as far as the data

1437
00:54:52,559 --> 00:54:55,680
downloading and setup is concerned. Any

1438
00:54:54,239 --> 00:54:57,439
questions?

1439
00:54:55,679 --> 00:54:58,960
>> Yeah.

1440
00:54:57,440 --> 00:55:01,440
>> Uh this kind of goes back to the

1441
00:54:58,960 --> 00:55:04,000
transfer learning concept. But do you

1442
00:55:01,440 --> 00:55:06,079
always want to build your corpus based

1443
00:55:04,000 --> 00:55:08,000
off of the vocabulary of your training

1444
00:55:06,079 --> 00:55:10,559
data or could you have like a

1445
00:55:08,000 --> 00:55:13,679
pre-ompiled like somebody's already made

1446
00:55:10,559 --> 00:55:15,280
like a list of the 50,000 words?

1447
00:55:13,679 --> 00:55:16,558
>> That's a really good question. Uh

1448
00:55:15,280 --> 00:55:20,240
unfortunately I'm going to punt on it

1449
00:55:16,559 --> 00:55:22,240
for the moment because um with modern

1450
00:55:20,239 --> 00:55:25,039
large language models a number of these

1451
00:55:22,239 --> 00:55:27,039
NLP tasks for which you had to sort of

1452
00:55:25,039 --> 00:55:29,759
roll your own and build your own thing

1453
00:55:27,039 --> 00:55:31,838
can now be very easily done using large

1454
00:55:29,760 --> 00:55:33,520
language models without even any further

1455
00:55:31,838 --> 00:55:34,639
training.

1456
00:55:33,519 --> 00:55:35,759
Case you pay for it is that you have to

1457
00:55:34,639 --> 00:55:37,759
use a large language model which means

1458
00:55:35,760 --> 00:55:38,800
you have to pay somebody an API call and

1459
00:55:37,760 --> 00:55:41,760
things like that and there are other

1460
00:55:38,800 --> 00:55:43,920
issues with it. uh but

1461
00:55:41,760 --> 00:55:46,319
we'll talk a lot about transfer learning

1462
00:55:43,920 --> 00:55:48,559
for text when we come to a little later

1463
00:55:46,318 --> 00:55:52,279
in the NLP sequence. So if I forget

1464
00:55:48,559 --> 00:55:52,280
please bring it up again.

1465
00:55:53,358 --> 00:55:58,159
>> Yeah.

1466
00:55:54,880 --> 00:56:00,880
>> Um quick clarification on the encode

1467
00:55:58,159 --> 00:56:03,440
factor. If I post it as floats not ins.

1468
00:56:00,880 --> 00:56:05,599
If it gets incredibly long wouldn't that

1469
00:56:03,440 --> 00:56:06,559
eat into compute time? Is there a reason

1470
00:56:05,599 --> 00:56:09,119
why it's floats?

1471
00:56:06,559 --> 00:56:11,359
>> Yeah. So uh question is that when when I

1472
00:56:09,119 --> 00:56:13,200
showed you that tensor the it is

1473
00:56:11,358 --> 00:56:14,639
actually is written as a continuous

1474
00:56:13,199 --> 00:56:16,399
number right a float floating point

1475
00:56:14,639 --> 00:56:18,159
number but we know these are one zeros

1476
00:56:16,400 --> 00:56:20,240
and ones so why can't we why do we have

1477
00:56:18,159 --> 00:56:21,519
to waste compute capacity by telling the

1478
00:56:20,239 --> 00:56:23,118
computer that these are all big

1479
00:56:21,519 --> 00:56:25,199
continuous numbers when it's just a zero

1480
00:56:23,119 --> 00:56:26,559
one there are ways to optimize that but

1481
00:56:25,199 --> 00:56:28,960
these problems are so small we just

1482
00:56:26,559 --> 00:56:30,319
don't worry about it but when we come to

1483
00:56:28,960 --> 00:56:34,079
something called parameter efficient

1484
00:56:30,318 --> 00:56:35,838
fine-tuning lecture maybe 10ish uh we

1485
00:56:34,079 --> 00:56:38,318
actually exploit that particular fact to

1486
00:56:35,838 --> 00:56:38,318
make things faster

1487
00:56:38,480 --> 00:56:43,519
Okay, so that's what we have. Uh, so

1488
00:56:41,199 --> 00:56:46,000
we'll we'll do the bag of birds model.

1489
00:56:43,519 --> 00:56:47,119
Um, by the way, there's a whole bunch of

1490
00:56:46,000 --> 00:56:49,199
stuff here. It just repeats what I've

1491
00:56:47,119 --> 00:56:50,880
been telling you in the lecture. So feel

1492
00:56:49,199 --> 00:56:54,000
free to read it again, but we can ignore

1493
00:56:50,880 --> 00:56:55,920
it for the moment. And now there's a new

1494
00:56:54,000 --> 00:56:58,159
thing we are doing here. So we are

1495
00:56:55,920 --> 00:57:00,159
basically saying, look, instead of

1496
00:56:58,159 --> 00:57:03,519
taking every word you see in these

1497
00:57:00,159 --> 00:57:05,358
49,000 uh songs in the training corpus,

1498
00:57:03,519 --> 00:57:09,119
uh, it's going to be too many words.

1499
00:57:05,358 --> 00:57:11,679
just pick the 5,000 most frequent words

1500
00:57:09,119 --> 00:57:15,039
and that's what this max tokens stands

1501
00:57:11,679 --> 00:57:18,719
for. Okay. And so we tell it uh all

1502
00:57:15,039 --> 00:57:20,798
right do this thing max tokens 5,000

1503
00:57:18,719 --> 00:57:22,318
sorry not 50,000 5,000 and still do

1504
00:57:20,798 --> 00:57:24,318
multihart and we are not explicitly

1505
00:57:22,318 --> 00:57:25,599
saying the standardization and all that

1506
00:57:24,318 --> 00:57:29,119
stuff because the defaults are what

1507
00:57:25,599 --> 00:57:30,960
we're going with. Okay. Yeah.

1508
00:57:29,119 --> 00:57:32,798
This is for making it more efficient.

1509
00:57:30,960 --> 00:57:36,639
Like this is like don't waste your time

1510
00:57:32,798 --> 00:57:39,358
on these thousand sports. Use them more.

1511
00:57:36,639 --> 00:57:40,239
Use them. Just focus on that to make

1512
00:57:39,358 --> 00:57:42,318
more efficient.

1513
00:57:40,239 --> 00:57:44,000
>> Make more efficient. But there is a

1514
00:57:42,318 --> 00:57:46,400
related and important point which is

1515
00:57:44,000 --> 00:57:49,599
that fundamentally the number of tokens

1516
00:57:46,400 --> 00:57:51,760
you allow this layer to have dictates

1517
00:57:49,599 --> 00:57:53,680
the size of your vocabulary and the size

1518
00:57:51,760 --> 00:57:56,079
of your vocabulary dictates the size of

1519
00:57:53,679 --> 00:57:57,358
the vector that you feed in. So shorter

1520
00:57:56,079 --> 00:57:59,039
vectors are better than longer vectors.

1521
00:57:57,358 --> 00:58:00,639
That's the efficiency point. The other

1522
00:57:59,039 --> 00:58:02,719
point is that the longer the input

1523
00:58:00,639 --> 00:58:04,400
vector, the more the number of

1524
00:58:02,719 --> 00:58:06,558
parameters the network has to learn

1525
00:58:04,400 --> 00:58:08,480
because the first layer itself is the

1526
00:58:06,559 --> 00:58:10,000
size of the input times roughly times

1527
00:58:08,480 --> 00:58:11,199
the size of the hidden layer. So this

1528
00:58:10,000 --> 00:58:13,039
thing becomes 10 times as long. You have

1529
00:58:11,199 --> 00:58:15,439
10 times as many parameters to learn and

1530
00:58:13,039 --> 00:58:17,199
given a finite amount of data, right?

1531
00:58:15,440 --> 00:58:18,400
The more parameters you have, the worse

1532
00:58:17,199 --> 00:58:19,679
it's going to do when you actually start

1533
00:58:18,400 --> 00:58:21,200
using it in the real world. It's going

1534
00:58:19,679 --> 00:58:24,000
to overfitit heavily. That's why you

1535
00:58:21,199 --> 00:58:25,679
need to be very careful.

1536
00:58:24,000 --> 00:58:27,519
Okay.

1537
00:58:25,679 --> 00:58:29,440
Yeah.

1538
00:58:27,519 --> 00:58:31,358
So, um, you downloaded the data set, but

1539
00:58:29,440 --> 00:58:33,760
are you still using the vocabulary the

1540
00:58:31,358 --> 00:58:35,598
17 words or did you

1541
00:58:33,760 --> 00:58:36,720
>> No, no, I'm that was just for fun. I'm

1542
00:58:35,599 --> 00:58:38,960
going to actually build a vocabulary

1543
00:58:36,719 --> 00:58:41,838
now. It's coming. Yeah, good question.

1544
00:58:38,960 --> 00:58:43,599
Yeah. So, all right, let's do that. Um,

1545
00:58:41,838 --> 00:58:46,000
so I first, you know, I defined this

1546
00:58:43,599 --> 00:58:47,599
layer. Uh, okay. I just defined it. All

1547
00:58:46,000 --> 00:58:49,760
right. Now we actually build the

1548
00:58:47,599 --> 00:58:53,519
vocabulary by essentially telling it to

1549
00:58:49,760 --> 00:58:56,640
adapt the layer using essentially the

1550
00:58:53,519 --> 00:58:58,719
full all 15 basically 49,000 songs in

1551
00:58:56,639 --> 00:59:01,679
the training data set right that's a

1552
00:58:58,719 --> 00:59:02,798
long list of songs as far as kas is

1553
00:59:01,679 --> 00:59:04,879
concerned you're just looking for a list

1554
00:59:02,798 --> 00:59:06,159
of strings so you just give it the list

1555
00:59:04,880 --> 00:59:09,200
of strings instead of four we're giving

1556
00:59:06,159 --> 00:59:11,358
it 49,000 the same uh philosophy applies

1557
00:59:09,199 --> 00:59:12,879
so we run it

1558
00:59:11,358 --> 00:59:15,039
it's obviously going to take a few

1559
00:59:12,880 --> 00:59:17,280
seconds to do that because it's 49,000

1560
00:59:15,039 --> 00:59:19,039
songs

1561
00:59:17,280 --> 00:59:21,519
five seconds. Uh, all right. Let's look

1562
00:59:19,039 --> 00:59:23,759
at the most common 20,

1563
00:59:21,519 --> 00:59:26,318
right? We get the vocabulary from our

1564
00:59:23,760 --> 00:59:27,839
layer. See, once you adapt the layer and

1565
00:59:26,318 --> 00:59:29,358
has built a vocabulary, the layer is

1566
00:59:27,838 --> 00:59:31,279
sort of been populated with all this

1567
00:59:29,358 --> 00:59:34,719
information. So, you can query it. So,

1568
00:59:31,280 --> 00:59:37,040
you can get the vocab top 20 words, the

1569
00:59:34,719 --> 00:59:39,039
most frequent word, no surprise, u, I,

1570
00:59:37,039 --> 00:59:41,039
blah, blah, blah. Uh, let's look at the

1571
00:59:39,039 --> 00:59:43,599
last few.

1572
00:59:41,039 --> 00:59:46,599
Dagger cheddar

1573
00:59:43,599 --> 00:59:46,599
verified

1574
00:59:46,798 --> 00:59:51,199
moving on

1575
00:59:48,880 --> 00:59:52,960
right and then we so once we have done

1576
00:59:51,199 --> 00:59:55,439
that now we actually can vectorize all

1577
00:59:52,960 --> 00:59:57,039
the data sets we have using this and by

1578
00:59:55,440 --> 00:59:59,119
vectorize you mean take every string and

1579
00:59:57,039 --> 01:00:00,400
create the multihot encoded vector from

1580
00:59:59,119 --> 01:00:02,480
it uh yeah

1581
01:00:00,400 --> 01:00:05,358
>> are we doing stie because we're keeping

1582
01:00:02,480 --> 01:00:07,119
stuff like d a etc. Yeah, we are not

1583
01:00:05,358 --> 01:00:09,598
strictly doing STI or to put it

1584
01:00:07,119 --> 01:00:12,000
differently the S stands typically S has

1585
01:00:09,599 --> 01:00:14,960
lower case uppercase strip punctuation

1586
01:00:12,000 --> 01:00:16,798
stemming stop word removal here the

1587
01:00:14,960 --> 01:00:18,639
default in KAS happens to not do

1588
01:00:16,798 --> 01:00:20,000
stemming not do stop word removal so

1589
01:00:18,639 --> 01:00:22,078
we're just going with the default thanks

1590
01:00:20,000 --> 01:00:23,519
for the clarification

1591
01:00:22,079 --> 01:00:25,039
and in fact in practice what I find

1592
01:00:23,519 --> 01:00:27,039
these days is that don't even bother to

1593
01:00:25,039 --> 01:00:28,239
stem don't even bother to remove the

1594
01:00:27,039 --> 01:00:31,119
stop words it's going to work well

1595
01:00:28,239 --> 01:00:34,399
enough

1596
01:00:31,119 --> 01:00:36,000
okay so all right uh okay so now Each

1597
01:00:34,400 --> 01:00:38,639
phrase is a vector. How long is this

1598
01:00:36,000 --> 01:00:41,039
vector? Each song is now a vector. How

1599
01:00:38,639 --> 01:00:43,279
long is that vector?

1600
01:00:41,039 --> 01:00:46,920
5,000. Correct. Because that is a size

1601
01:00:43,280 --> 01:00:46,920
vocabulary. Correct.

1602
01:00:47,199 --> 01:00:51,679
It's max tokens long, which is 5,000. So

1603
01:00:49,599 --> 01:00:52,960
if you actually look at X Oh, wait,

1604
01:00:51,679 --> 01:00:56,358
wait, wait, wait, wait. I haven't done

1605
01:00:52,960 --> 01:00:56,358
this thing yet.

1606
01:00:57,838 --> 01:01:02,400
It's going through 49,000. It's going

1607
01:00:59,599 --> 01:01:04,400
through another what? 23,000. Fine. So

1608
01:01:02,400 --> 01:01:06,798
let's run it.

1609
01:01:04,400 --> 01:01:09,200
Okay, now we can see X train which is

1610
01:01:06,798 --> 01:01:12,960
all the training data you have has is a

1611
01:01:09,199 --> 01:01:18,039
tensor is a table with 48 991 rows and

1612
01:01:12,960 --> 01:01:18,039
each row is a 5,000 long vector.

1613
01:01:18,079 --> 01:01:23,280
All right, good. Now we will try the

1614
01:01:20,559 --> 01:01:28,240
simple neural network that we wrote up

1615
01:01:23,280 --> 01:01:31,359
in class. So and now at this point this

1616
01:01:28,239 --> 01:01:34,078
code should be sort of second nature,

1617
01:01:31,358 --> 01:01:36,159
right? Isn't that cool? It's so easy to

1618
01:01:34,079 --> 01:01:39,280
write the write the thing the power of

1619
01:01:36,159 --> 01:01:41,279
abstraction. So uh we take kasin input

1620
01:01:39,280 --> 01:01:42,720
as usual input layer we tell it what is

1621
01:01:41,280 --> 01:01:44,480
the size of each thing that's coming in.

1622
01:01:42,719 --> 01:01:46,480
Well the size of each thing is a 50 max

1623
01:01:44,480 --> 01:01:48,880
tokens long vector. So we tell it the

1624
01:01:46,480 --> 01:01:51,119
shape is max tokens and then we run it

1625
01:01:48,880 --> 01:01:54,160
through a dense layer with eight relus.

1626
01:01:51,119 --> 01:01:56,079
Okay I'm hurrying.

1627
01:01:54,159 --> 01:01:58,000
So we get the outputs then we string the

1628
01:01:56,079 --> 01:01:59,680
inputs and the outputs into a model and

1629
01:01:58,000 --> 01:02:02,239
then we summarize the model. That's it.

1630
01:01:59,679 --> 01:02:04,639
So we go here and this has 40,000

1631
01:02:02,239 --> 01:02:08,239
parameters and you can see here right

1632
01:02:04,639 --> 01:02:10,239
when you go from the input the 5,000 * 8

1633
01:02:08,239 --> 01:02:11,838
that gives you 40,000 plus the eight

1634
01:02:10,239 --> 01:02:15,039
neurons have a bias coming in that's

1635
01:02:11,838 --> 01:02:17,119
another eight so you get 40,0008 okay

1636
01:02:15,039 --> 01:02:20,159
and we compile it as usual we use atom

1637
01:02:17,119 --> 01:02:23,760
as usual and because now the the output

1638
01:02:20,159 --> 01:02:27,039
y variable the y train variable is now

1639
01:02:23,760 --> 01:02:29,599
it itself is actually one hot encoded

1640
01:02:27,039 --> 01:02:31,440
right 0 1 0 0 1 depending on pop rock

1641
01:02:29,599 --> 01:02:33,519
and so on and so forth. We don't use

1642
01:02:31,440 --> 01:02:35,119
sparse categorical cross entropy. We

1643
01:02:33,519 --> 01:02:38,000
just use plain old categorical cross

1644
01:02:35,119 --> 01:02:40,318
entropy here. Okay. And this was

1645
01:02:38,000 --> 01:02:42,400
explained in lecture last week. So you

1646
01:02:40,318 --> 01:02:44,318
can revisit it if uh if it's if it's not

1647
01:02:42,400 --> 01:02:46,400
familiar. We again report accuracy,

1648
01:02:44,318 --> 01:02:48,558
right? So let's compile it. And we've

1649
01:02:46,400 --> 01:02:50,798
got a model. So we just run it for 10

1650
01:02:48,559 --> 01:02:52,640
epochs with a batch size of 32. And

1651
01:02:50,798 --> 01:02:53,838
because we have validation data already

1652
01:02:52,639 --> 01:02:55,679
supplied to us, we don't have to tell

1653
01:02:53,838 --> 01:02:58,159
Karas take the training data and keep

1654
01:02:55,679 --> 01:02:59,519
20% of it aside for validation. We can

1655
01:02:58,159 --> 01:03:04,000
literally tell it what validation to

1656
01:02:59,519 --> 01:03:06,798
use. That's what we're doing here. Okay.

1657
01:03:04,000 --> 01:03:09,119
All right. So, it's running.

1658
01:03:06,798 --> 01:03:12,599
Um,

1659
01:03:09,119 --> 01:03:12,599
it's pretty fast.

1660
01:03:16,318 --> 01:03:20,480
Any questions so far?

1661
01:03:18,159 --> 01:03:23,519
>> Yes.

1662
01:03:20,480 --> 01:03:25,358
>> The microphone.

1663
01:03:23,519 --> 01:03:27,679
>> How do we decide the max total? like

1664
01:03:25,358 --> 01:03:29,038
define the number of 5,000 here but we

1665
01:03:27,679 --> 01:03:29,919
do not know how many words would be

1666
01:03:29,039 --> 01:03:31,200
there in the entire text.

1667
01:03:29,920 --> 01:03:32,720
>> Yeah. So it's a good question. How do

1668
01:03:31,199 --> 01:03:34,399
you decide on this the maximum

1669
01:03:32,719 --> 01:03:36,480
vocabulary? What you typically do in

1670
01:03:34,400 --> 01:03:38,240
practice is that you actually you do it

1671
01:03:36,480 --> 01:03:40,079
without the max tokens and then you see

1672
01:03:38,239 --> 01:03:41,838
how long the vocabulary is and then you

1673
01:03:40,079 --> 01:03:43,839
actually get statistics on how

1674
01:03:41,838 --> 01:03:45,279
frequently the very infrequent words

1675
01:03:43,838 --> 01:03:47,279
actually show up. And then you'll

1676
01:03:45,280 --> 01:03:49,599
typically see like a dramatic fall-off

1677
01:03:47,280 --> 01:03:54,119
at some point and you pick that fall-off

1678
01:03:49,599 --> 01:03:54,119
point and then set that to be the max.

1679
01:03:54,960 --> 01:04:01,599
Uh all right. So perfect. Let's test it.

1680
01:03:58,719 --> 01:04:05,358
Uh accuracy is pretty good. 87% on the

1681
01:04:01,599 --> 01:04:09,280
training and 73 on the validation. We'll

1682
01:04:05,358 --> 01:04:11,440
do it on the test set. All right. 72%.

1683
01:04:09,280 --> 01:04:13,200
So we saw earlier the the largest class

1684
01:04:11,440 --> 01:04:15,358
of the three-way is a rock with around

1685
01:04:13,199 --> 01:04:17,279
50%. So the naive model is going to get

1686
01:04:15,358 --> 01:04:19,279
50% accuracy and this little neural

1687
01:04:17,280 --> 01:04:22,160
network model gets you 70 72% which is

1688
01:04:19,280 --> 01:04:23,839
pretty nice. Okay. So now let's actually

1689
01:04:22,159 --> 01:04:26,798
kick it up a notch and make it slightly

1690
01:04:23,838 --> 01:04:29,358
more capable. So the key thing here is

1691
01:04:26,798 --> 01:04:31,119
that uh as was has been observed in

1692
01:04:29,358 --> 01:04:33,759
class already when you go with a bag of

1693
01:04:31,119 --> 01:04:35,358
words model we lose all notion of order

1694
01:04:33,760 --> 01:04:38,000
right the word order clearly matters and

1695
01:04:35,358 --> 01:04:40,400
we're kind of ignoring it. So what we do

1696
01:04:38,000 --> 01:04:42,079
to get around it is um so actually this

1697
01:04:40,400 --> 01:04:44,720
actually really interesting uh sentence

1698
01:04:42,079 --> 01:04:46,640
here. Let's say this is a movie review.

1699
01:04:44,719 --> 01:04:48,639
Kate Vinclet's performance as a

1700
01:04:46,639 --> 01:04:50,639
detective trying to solve a terrible

1701
01:04:48,639 --> 01:04:52,879
crime in a P small pin tennos is

1702
01:04:50,639 --> 01:04:55,038
anything but disappointing.

1703
01:04:52,880 --> 01:04:56,160
Tricky tricky thing, right? Because if

1704
01:04:55,039 --> 01:04:58,400
you look at the word separately, the

1705
01:04:56,159 --> 01:05:01,358
word terrible and disappointing like

1706
01:04:58,400 --> 01:05:04,000
negative sentiment, right? But then if

1707
01:05:01,358 --> 01:05:06,318
you actually know that the word terrible

1708
01:05:04,000 --> 01:05:08,559
respon refers to the crime, not to the

1709
01:05:06,318 --> 01:05:09,440
movie or anything but disappointing

1710
01:05:08,559 --> 01:05:10,798
changes the meaning of the word

1711
01:05:09,440 --> 01:05:12,639
disappointing, you will see obviously

1712
01:05:10,798 --> 01:05:14,719
it's a positive review, right? So

1713
01:05:12,639 --> 01:05:17,679
clearly the the the words around the

1714
01:05:14,719 --> 01:05:20,558
word provide valuable clues as to how to

1715
01:05:17,679 --> 01:05:23,519
interpret that word. And so what we do

1716
01:05:20,559 --> 01:05:25,599
is how can we make our little model a

1717
01:05:23,519 --> 01:05:27,599
bit more capable of recognizing the

1718
01:05:25,599 --> 01:05:29,680
context around every word. And the way

1719
01:05:27,599 --> 01:05:32,960
we do it is something called bgrams.

1720
01:05:29,679 --> 01:05:34,318
Okay. And what for biograms what we

1721
01:05:32,960 --> 01:05:36,960
basically do is instead of taking

1722
01:05:34,318 --> 01:05:39,599
instead of just taking each word we take

1723
01:05:36,960 --> 01:05:42,240
each word and we further take every pair

1724
01:05:39,599 --> 01:05:44,400
of adjacent words

1725
01:05:42,239 --> 01:05:47,279
and those become our tokens and because

1726
01:05:44,400 --> 01:05:49,440
we take two adjacent words right it are

1727
01:05:47,280 --> 01:05:51,680
called bgrams you can take three adjent

1728
01:05:49,440 --> 01:05:54,480
words trigrams you get the idea engram

1729
01:05:51,679 --> 01:05:56,719
grams okay so that's the idea of bgrams

1730
01:05:54,480 --> 01:05:59,920
and so um so for example if you had the

1731
01:05:56,719 --> 01:06:03,519
cat matt sat on the cat sat on the mat

1732
01:05:59,920 --> 01:06:05,680
you will have the the cat cats sat you

1733
01:06:03,519 --> 01:06:07,679
get the idea right uh that's what we

1734
01:06:05,679 --> 01:06:09,279
have so let's do a little example and

1735
01:06:07,679 --> 01:06:12,399
kas makes it very easy you literally

1736
01:06:09,280 --> 01:06:15,119
tell it engram grams equals 2

1737
01:06:12,400 --> 01:06:16,960
bs and now by by from this you auto

1738
01:06:15,119 --> 01:06:19,358
immediately should know that engram

1739
01:06:16,960 --> 01:06:23,039
grams equals 1 is the default that's why

1740
01:06:19,358 --> 01:06:25,440
we didn't have to specify it okay so you

1741
01:06:23,039 --> 01:06:27,520
run it and then you do

1742
01:06:25,440 --> 01:06:29,119
cats on the mat is your training corpus

1743
01:06:27,519 --> 01:06:31,280
and then you get the vocabulary and you

1744
01:06:29,119 --> 01:06:34,160
can see here, right? It has created all

1745
01:06:31,280 --> 01:06:35,920
these nice biograms for you. And so

1746
01:06:34,159 --> 01:06:37,679
that's it. All right. Now, what we do is

1747
01:06:35,920 --> 01:06:39,680
we'll go back to the songs and we

1748
01:06:37,679 --> 01:06:41,919
actually tell Keras to not just take

1749
01:06:39,679 --> 01:06:43,519
each word, but take all the biograms as

1750
01:06:41,920 --> 01:06:45,440
well. And hopefully you'll do a better

1751
01:06:43,519 --> 01:06:47,759
job, right, of figuring out what the

1752
01:06:45,440 --> 01:06:49,679
sentiment is. And now because you know

1753
01:06:47,760 --> 01:06:51,680
when you have when you when you say,

1754
01:06:49,679 --> 01:06:53,919
okay, take the top 5,000 words, that's

1755
01:06:51,679 --> 01:06:56,078
great for single unigs as they are

1756
01:06:53,920 --> 01:06:57,680
called. But when you have biograms, you

1757
01:06:56,079 --> 01:06:59,599
have 5,000 possibilities for the first

1758
01:06:57,679 --> 01:07:01,279
word, maybe 5,000 for the second word,

1759
01:06:59,599 --> 01:07:03,280
right? That's a lot of possibilities. 25

1760
01:07:01,280 --> 01:07:04,240
million. Now, most of the 25 million

1761
01:07:03,280 --> 01:07:05,680
possibilities are not going to show up

1762
01:07:04,239 --> 01:07:07,199
in the data. So, you don't need to

1763
01:07:05,679 --> 01:07:08,798
actually make it much larger, but you

1764
01:07:07,199 --> 01:07:11,038
should make the vocabulary a bit more

1765
01:07:08,798 --> 01:07:13,759
than 5,000. So, here we go with say

1766
01:07:11,039 --> 01:07:16,160
20,000, right? Otherwise, it's the same.

1767
01:07:13,760 --> 01:07:18,240
Still multihart. So, let's run it. And

1768
01:07:16,159 --> 01:07:20,000
now we will run this. Now that the layer

1769
01:07:18,239 --> 01:07:21,519
has been set up with all the right

1770
01:07:20,000 --> 01:07:24,079
settings, we'll ask it to create the

1771
01:07:21,519 --> 01:07:25,679
vocabulary. Okay? again by doing exactly

1772
01:07:24,079 --> 01:07:28,680
what we did before. Create the

1773
01:07:25,679 --> 01:07:28,679
vocabulary

1774
01:07:30,000 --> 01:07:33,000
seconds

1775
01:07:42,000 --> 01:07:46,639
by triagrams all of them will get much

1776
01:07:44,480 --> 01:07:48,400
more computer intensive that's why

1777
01:07:46,639 --> 01:07:51,358
you're seeing this. So all right let's

1778
01:07:48,400 --> 01:07:53,200
look at the first 10 words. The first 10

1779
01:07:51,358 --> 01:07:54,639
words are all just single words and

1780
01:07:53,199 --> 01:07:55,598
that's not surprising because the single

1781
01:07:54,639 --> 01:07:59,519
words are going to be the most more

1782
01:07:55,599 --> 01:08:02,559
frequent right u

1783
01:07:59,519 --> 01:08:08,038
and then the last few

1784
01:08:02,559 --> 01:08:08,039
your mom your god you short you hell

1785
01:08:09,920 --> 01:08:15,838
all right let's just uh you know uh

1786
01:08:13,039 --> 01:08:17,520
index the whole all the data we have the

1787
01:08:15,838 --> 01:08:19,920
training validation test sets using this

1788
01:08:17,520 --> 01:08:19,920
vocabulary

1789
01:08:23,198 --> 01:08:26,479
Perfect. Now we come to our second model

1790
01:08:24,798 --> 01:08:28,479
where we say the shape the incoming

1791
01:08:26,479 --> 01:08:30,238
shape is now 20,000 long right because

1792
01:08:28,479 --> 01:08:32,718
we increase max tokens from 5,000 to

1793
01:08:30,238 --> 01:08:35,198
20,000. So each thing is a 20,000 long

1794
01:08:32,719 --> 01:08:37,198
vector otherwise it's the same and now

1795
01:08:35,198 --> 01:08:38,639
we will use this thing called dropout

1796
01:08:37,198 --> 01:08:41,119
for the first time which is a

1797
01:08:38,640 --> 01:08:43,440
rigorization thing that I have referred

1798
01:08:41,119 --> 01:08:45,439
to earlier that I never really described

1799
01:08:43,439 --> 01:08:47,119
and I will describe today if we have

1800
01:08:45,439 --> 01:08:49,439
time but I'll first run through the

1801
01:08:47,119 --> 01:08:50,960
whole demo. So just you know just you

1802
01:08:49,439 --> 01:08:52,879
can just you think of dropout as just

1803
01:08:50,960 --> 01:08:54,079
another layer you can insert and it's

1804
01:08:52,880 --> 01:08:56,798
essentially a great way to prevent

1805
01:08:54,079 --> 01:08:58,798
overfitting. So I just routinely will

1806
01:08:56,798 --> 01:09:00,719
use it and I'll talk more about it. So

1807
01:08:58,798 --> 01:09:02,399
for now you have this dropout layer in

1808
01:09:00,719 --> 01:09:04,319
the middle. It receives the input from

1809
01:09:02,399 --> 01:09:05,278
the dense layer and then sends it to the

1810
01:09:04,319 --> 01:09:07,039
output layer. The output layer is

1811
01:09:05,279 --> 01:09:10,319
unchanged. It's a three-way softmax.

1812
01:09:07,039 --> 01:09:11,838
Same model as before. Okay. And now uh

1813
01:09:10,319 --> 01:09:13,279
all right we'll come back to drop out.

1814
01:09:11,838 --> 01:09:15,198
So we'll compile it the same way as

1815
01:09:13,279 --> 01:09:17,440
before and then we will we will I will

1816
01:09:15,198 --> 01:09:19,198
just fit it for three epochs. Um if

1817
01:09:17,439 --> 01:09:20,479
you're interested after class later on

1818
01:09:19,198 --> 01:09:22,879
you can actually try it for more epochs

1819
01:09:20,479 --> 01:09:23,838
and see if it does better. Uh for now in

1820
01:09:22,880 --> 01:09:27,000
the interest of time we'll just do it

1821
01:09:23,838 --> 01:09:27,000
for three

1822
01:09:29,838 --> 01:09:32,838
right

1823
01:09:36,560 --> 01:09:43,440
I think 72% right was the uh the single

1824
01:09:39,520 --> 01:09:45,279
word unig thing we had.

1825
01:09:43,439 --> 01:09:47,358
>> If you're rerunning this code with the

1826
01:09:45,279 --> 01:09:49,120
same number of Do you ever expect the

1827
01:09:47,359 --> 01:09:51,759
accuracy to change?

1828
01:09:49,119 --> 01:09:53,679
>> Um if if you had to run this code in

1829
01:09:51,759 --> 01:09:55,359
your machine, you would expect it to be

1830
01:09:53,679 --> 01:09:57,119
roughly the same, but there are some

1831
01:09:55,359 --> 01:09:58,238
minute differences due to hardware and

1832
01:09:57,119 --> 01:09:59,920
device drivers.

1833
01:09:58,238 --> 01:10:02,959
>> If you rewrite it on your own machine

1834
01:09:59,920 --> 01:10:05,039
twice, would you expect a change?

1835
01:10:02,960 --> 01:10:07,198
>> That's actually a very tricky question.

1836
01:10:05,039 --> 01:10:09,679
Uh because it depends on what else I

1837
01:10:07,198 --> 01:10:11,759
have been doing in that notebook.

1838
01:10:09,679 --> 01:10:13,840
If I start fresh and do nothing but

1839
01:10:11,760 --> 01:10:15,199
that, typically I get the same numbers

1840
01:10:13,840 --> 01:10:19,000
typically. But for some reason I don't

1841
01:10:15,198 --> 01:10:19,000
get it exactly the right.

1842
01:10:19,359 --> 01:10:25,559
Okay. So we come to this. Let's evaluate

1843
01:10:22,000 --> 01:10:25,560
our little model.

1844
01:10:25,840 --> 01:10:30,640
Okay. 75%. So it went from 72 to 75.

1845
01:10:29,119 --> 01:10:32,960
It's actually a meaningful jump just by

1846
01:10:30,640 --> 01:10:34,800
using biograms. Okay. And I ran it only

1847
01:10:32,960 --> 01:10:36,239
for three epochs. If you run it for 10,

1848
01:10:34,800 --> 01:10:38,960
maybe it's going to do even better. All

1849
01:10:36,238 --> 01:10:40,639
right. So that is the beauty of this

1850
01:10:38,960 --> 01:10:42,719
thing. Now let's just actually do a

1851
01:10:40,640 --> 01:10:45,920
little demo. Uh we'll try to predict

1852
01:10:42,719 --> 01:10:49,198
some lyrics. Okay, I'll try another one.

1853
01:10:45,920 --> 01:10:50,399
Bites the dust.

1854
01:10:49,198 --> 01:10:53,359
It's a rock song. I think that's

1855
01:10:50,399 --> 01:10:55,839
correct. Yes. Okay. Okay, folks. Your

1856
01:10:53,359 --> 01:11:00,359
turn now.

1857
01:10:55,840 --> 01:11:00,360
Uh, somebody tell me your favorite song.

1858
01:11:00,479 --> 01:11:05,519
>> Dancing Queen from Aba.

1859
01:11:03,039 --> 01:11:07,039
>> I love ABBA. That's awesome. All right.

1860
01:11:05,520 --> 01:11:11,120
Okay.

1861
01:11:07,039 --> 01:11:14,119
Uh, Dancing Queen

1862
01:11:11,119 --> 01:11:14,119
Rex.

1863
01:11:17,119 --> 01:11:20,158
worse one intro. I don't like that.

1864
01:11:18,560 --> 01:11:23,480
Let's just go to something without all

1865
01:11:20,158 --> 01:11:23,479
this metadata.

1866
01:11:23,679 --> 01:11:26,679
Right.

1867
01:11:27,359 --> 01:11:31,559
All right. I'll just take the first

1868
01:11:28,238 --> 01:11:31,559
page. Okay.

1869
01:11:40,479 --> 01:11:45,439
Are we good?

1870
01:11:42,560 --> 01:11:49,560
All right,

1871
01:11:45,439 --> 01:11:49,559
down model. Let's predict

1872
01:11:50,319 --> 01:11:55,238
pop just about. Yay.

1873
01:11:55,679 --> 01:12:00,399
All right. So, uh yeah. So, that's

1874
01:11:58,238 --> 01:12:01,919
basically the model, but we have five

1875
01:12:00,399 --> 01:12:03,599
minutes. I want to get back to you can

1876
01:12:01,920 --> 01:12:05,440
play around and put your own lyrics in.

1877
01:12:03,600 --> 01:12:07,600
Uh typically what happens is that the

1878
01:12:05,439 --> 01:12:09,519
last two years that I've been doing this

1879
01:12:07,600 --> 01:12:11,679
particular lecture, I've noticed that

1880
01:12:09,520 --> 01:12:13,280
the songs are always rock songs for some

1881
01:12:11,679 --> 01:12:14,800
reason.

1882
01:12:13,279 --> 01:12:16,000
>> First time I'm getting a pop song from

1883
01:12:14,800 --> 01:12:18,320
the from a group that I actually like.

1884
01:12:16,000 --> 01:12:20,158
So thank you.

1885
01:12:18,319 --> 01:12:22,000
Uh all right. Uh let's go back to

1886
01:12:20,158 --> 01:12:24,879
dropout.

1887
01:12:22,000 --> 01:12:26,560
So the idea here in dropout is that you

1888
01:12:24,880 --> 01:12:28,079
know you have all these the input comes

1889
01:12:26,560 --> 01:12:30,719
in, it goes through a hidden layer and

1890
01:12:28,079 --> 01:12:33,760
so on and so forth. What the dropout? So

1891
01:12:30,719 --> 01:12:35,198
dropout is a layer and you put this

1892
01:12:33,760 --> 01:12:37,600
layer just like you use any other layer.

1893
01:12:35,198 --> 01:12:38,960
And what dropout does is that it takes

1894
01:12:37,600 --> 01:12:41,360
all the things that are coming into it

1895
01:12:38,960 --> 01:12:43,760
from the previous layer and randomly

1896
01:12:41,359 --> 01:12:46,079
decides to replace that number with a

1897
01:12:43,760 --> 01:12:48,239
zero.

1898
01:12:46,079 --> 01:12:50,399
That's it. It drops that number and

1899
01:12:48,238 --> 01:12:52,158
replace it with a zero. Okay? But it

1900
01:12:50,399 --> 01:12:54,319
does it randomly. It basically toss a

1901
01:12:52,158 --> 01:12:55,839
coin and the coin comes up heads zero.

1902
01:12:54,319 --> 01:12:58,479
If it comes up to us, let it through.

1903
01:12:55,840 --> 01:13:02,960
Pass it through. Okay? And the reason

1904
01:12:58,479 --> 01:13:04,799
why this is very effective is because

1905
01:13:02,960 --> 01:13:07,600
you can imagine all the neurons in a

1906
01:13:04,800 --> 01:13:09,840
particular layer when they overfit to a

1907
01:13:07,600 --> 01:13:11,360
particular data set the overfitting

1908
01:13:09,840 --> 01:13:14,319
happens because the neurons essentially

1909
01:13:11,359 --> 01:13:15,839
collude with each other right they sort

1910
01:13:14,319 --> 01:13:17,439
of collude with each other to actually

1911
01:13:15,840 --> 01:13:19,920
overfitit and predict things in sort of

1912
01:13:17,439 --> 01:13:21,839
a very accurate way. So you want to

1913
01:13:19,920 --> 01:13:24,480
break any sort of collusion between the

1914
01:13:21,840 --> 01:13:26,239
neurons, right? I'm obviously using sort

1915
01:13:24,479 --> 01:13:28,799
of like a you know again theoretic way

1916
01:13:26,238 --> 01:13:30,718
of describing it but the idea is that

1917
01:13:28,800 --> 01:13:33,440
any kind of speurious correlations in

1918
01:13:30,719 --> 01:13:36,079
your data neurons can pick it up by

1919
01:13:33,439 --> 01:13:38,079
being correlated themselves.

1920
01:13:36,079 --> 01:13:40,800
And so the way you avoid the spurious

1921
01:13:38,079 --> 01:13:42,000
correlation is by dropping neurons

1922
01:13:40,800 --> 01:13:44,159
randomly. You just kill the neuron

1923
01:13:42,000 --> 01:13:45,520
randomly which means that no neuron can

1924
01:13:44,158 --> 01:13:47,679
depend on another neuron being

1925
01:13:45,520 --> 01:13:50,320
available.

1926
01:13:47,679 --> 01:13:52,560
I know it's a bit grim but that's the

1927
01:13:50,319 --> 01:13:54,399
basic idea of dropout and apparently the

1928
01:13:52,560 --> 01:13:56,640
story goes that the the folk person who

1929
01:13:54,399 --> 01:13:58,238
the team that invented it Jeff Hinton

1930
01:13:56,640 --> 01:13:59,679
who won the touring for the stuff not

1931
01:13:58,238 --> 01:14:02,158
for not for dropout just for deep

1932
01:13:59,679 --> 01:14:03,279
learning um he said I don't know if it's

1933
01:14:02,158 --> 01:14:05,519
true but he said that apparently he got

1934
01:14:03,279 --> 01:14:07,759
the idea when he went to a bank and

1935
01:14:05,520 --> 01:14:09,440
realized that you know very often the

1936
01:14:07,760 --> 01:14:11,280
bank the folks who working in that bank

1937
01:14:09,439 --> 01:14:13,119
branch that he used to go to kept

1938
01:14:11,279 --> 01:14:14,800
changing

1939
01:14:13,119 --> 01:14:16,000
right they were never sort of the same

1940
01:14:14,800 --> 01:14:17,279
the people would be transferring in

1941
01:14:16,000 --> 01:14:18,158
transferring out and he was like why Why

1942
01:14:17,279 --> 01:14:19,519
can't they just leave these people

1943
01:14:18,158 --> 01:14:21,519
alone? Why does it keep changing? And

1944
01:14:19,520 --> 01:14:24,560
then he got the insight that maybe a lot

1945
01:14:21,520 --> 01:14:26,080
of fraud happens because the person

1946
01:14:24,560 --> 01:14:28,400
working in the branch colludes with the

1947
01:14:26,079 --> 01:14:30,399
customer, but by changing the staff

1948
01:14:28,399 --> 01:14:32,319
constantly, you break the the risk of

1949
01:14:30,399 --> 01:14:34,879
fraud happening. And that apparently was

1950
01:14:32,319 --> 01:14:36,559
the genesis for this idea. True,

1951
01:14:34,880 --> 01:14:40,480
apocryphal? I have no idea. But it's

1952
01:14:36,560 --> 01:14:43,039
sort of a fun story. Uh yes,

1953
01:14:40,479 --> 01:14:45,198
>> instead of random, if we go to the way

1954
01:14:43,039 --> 01:14:47,039
historical models are built, concepts of

1955
01:14:45,198 --> 01:14:50,079
multiple and all of that, would that

1956
01:14:47,039 --> 01:14:53,279
make it sharper as compared to this?

1957
01:14:50,079 --> 01:14:56,238
>> The problem is that um these networks

1958
01:14:53,279 --> 01:14:58,158
are massive, right? And for you to take

1959
01:14:56,238 --> 01:14:59,839
each layer and look at it correlation

1960
01:14:58,158 --> 01:15:01,679
with some other layer and so on and so

1961
01:14:59,840 --> 01:15:04,319
forth. First of all, investigating

1962
01:15:01,679 --> 01:15:05,359
multi-linearity is pro is a problem. The

1963
01:15:04,319 --> 01:15:08,559
second thing is okay, what do you do

1964
01:15:05,359 --> 01:15:09,920
then? Next uh in linear regression you

1965
01:15:08,560 --> 01:15:11,360
can do things like principal components

1966
01:15:09,920 --> 01:15:12,960
analysis to get around it. Here

1967
01:15:11,359 --> 01:15:14,719
everything is nonlinear. There is no

1968
01:15:12,960 --> 01:15:16,000
easy way to solve the problem. So we are

1969
01:15:14,719 --> 01:15:20,000
like we'll just solve the problem in one

1970
01:15:16,000 --> 01:15:23,840
shot using dropout. That's all right. Um

1971
01:15:20,000 --> 01:15:25,920
so I had uh some material on

1972
01:15:23,840 --> 01:15:28,239
something called bite pair encoding

1973
01:15:25,920 --> 01:15:30,319
which I will um which I will do when we

1974
01:15:28,238 --> 01:15:31,759
get to LLMs and I stuck it in the end

1975
01:15:30,319 --> 01:15:33,519
because I knew that we probably won't

1976
01:15:31,760 --> 01:15:35,280
have enough time to cover this anyway.

1977
01:15:33,520 --> 01:15:37,679
And that is a very clever tokenization

1978
01:15:35,279 --> 01:15:40,079
scheme used by for example the GPT

1979
01:15:37,679 --> 01:15:41,679
family and that allows them to do

1980
01:15:40,079 --> 01:15:43,920
beautiful punctuation, keep the case

1981
01:15:41,679 --> 01:15:45,679
intact and then use words that you just

1982
01:15:43,920 --> 01:15:47,520
made up and things like that. Okay. So

1983
01:15:45,679 --> 01:15:50,719
we have two one more minute. I'm happy

1984
01:15:47,520 --> 01:15:52,640
to answer any questions you might have.

1985
01:15:50,719 --> 01:15:54,079
>> And so initially when we are picking

1986
01:15:52,640 --> 01:15:57,119
like the hidden layer the number of

1987
01:15:54,079 --> 01:15:59,279
neurons and weed. So so far in all the

1988
01:15:57,119 --> 01:16:01,039
materials this is has been given to us

1989
01:15:59,279 --> 01:16:03,439
but initially how do you pick it? Is it

1990
01:16:01,039 --> 01:16:03,920
more of a trial and error type of thing

1991
01:16:03,439 --> 01:16:05,678
or

1992
01:16:03,920 --> 01:16:07,520
>> it tends to be trial and error. Um so

1993
01:16:05,679 --> 01:16:10,319
that's in fact what I did when I created

1994
01:16:07,520 --> 01:16:12,400
the collabs. So um and and you can

1995
01:16:10,319 --> 01:16:14,319
actually make it a bit more systematic

1996
01:16:12,399 --> 01:16:16,719
by trying lots of different values and

1997
01:16:14,319 --> 01:16:18,719
there is a particular package uh Python

1998
01:16:16,719 --> 01:16:20,719
package called Keras tuner. So just

1999
01:16:18,719 --> 01:16:22,640
Google Keras tuner and it comes with

2000
01:16:20,719 --> 01:16:23,920
very nice collabs and if I have a chance

2001
01:16:22,640 --> 01:16:25,440
maybe I'll just record a screen

2002
01:16:23,920 --> 01:16:27,119
walkthrough of doing that. But that's

2003
01:16:25,439 --> 01:16:28,559
that's a very efficient way to do these

2004
01:16:27,119 --> 01:16:29,359
things. And it comes under the broad

2005
01:16:28,560 --> 01:16:31,920
category of something called

2006
01:16:29,359 --> 01:16:33,519
hyperparameter optimization where the

2007
01:16:31,920 --> 01:16:35,279
number of neurons, the activation you

2008
01:16:33,520 --> 01:16:36,960
use, the learning rate, all those things

2009
01:16:35,279 --> 01:16:39,039
can all be tried. You can try lots of

2010
01:16:36,960 --> 01:16:42,000
variations and kas is a great way to do

2011
01:16:39,039 --> 01:16:45,238
it in the context of kas.

2012
01:16:42,000 --> 01:16:45,238
Other questions?

2013
01:16:45,920 --> 01:16:51,399
>> All right, I give you 30 seconds back.

2014
01:16:47,359 --> 01:16:51,399
Thank you. See you tomorrow.