1
00:00:21,359 --> 00:00:25,039
We'll continue our journey with

2
00:00:23,120 --> 00:00:26,679
natural language processing.

3
00:00:25,039 --> 00:00:28,039
We looked at the bag of words model,

4
00:00:26,679 --> 00:00:30,679
one-hot embeddings, and so on and so

5
00:00:28,039 --> 00:00:32,679
forth. And today we will talk about

6
00:00:30,679 --> 00:00:34,759
embeddings, or to be more precise,

7
00:00:32,679 --> 00:00:36,159
stand-alone embeddings, and then that

8
00:00:34,759 --> 00:00:38,799
will tee us up for something called

9
00:00:36,159 --> 00:00:40,439
contextual embeddings, which is where

10
00:00:38,799 --> 00:00:41,519
the transformer really sort of comes

11
00:00:40,439 --> 00:00:43,960
into play.

12
00:00:41,520 --> 00:00:47,920
All right, so let's get going. So so far

13
00:00:43,960 --> 00:00:50,200
we have encoded input text

14
00:00:47,920 --> 00:00:52,480
one-hot vector. So to just to refresh

15
00:00:50,200 --> 00:00:53,640
your memories from Monday,

16
00:00:52,479 --> 00:00:55,640
if you know, if this is the phrase

17
00:00:53,640 --> 00:00:58,560
that's coming into the system, we run it

18
00:00:55,640 --> 00:01:01,359
through the STIE process. And when we do

19
00:00:58,560 --> 00:01:03,600
that, what happens is that first of all,

20
00:01:01,359 --> 00:01:05,760
we you know, we standardize, then we

21
00:01:03,600 --> 00:01:08,960
split on white space to get individual

22
00:01:05,760 --> 00:01:10,840
words, then we assign words to integers,

23
00:01:08,959 --> 00:01:12,439
and then we take you know, each integer

24
00:01:10,840 --> 00:01:15,159
and essentially create a one-hot version

25
00:01:12,439 --> 00:01:18,759
of that integer. And when we do that,

26
00:01:15,159 --> 00:01:20,319
basically we have a vocabulary.

27
00:01:18,760 --> 00:01:23,040
Right? And in this example, we just have

28
00:01:20,319 --> 00:01:25,159
100 words, and you will note that this

29
00:01:23,040 --> 00:01:28,120
vocabulary, which are which you arrive

30
00:01:25,159 --> 00:01:30,039
at once you standardize and tokenize,

31
00:01:28,120 --> 00:01:32,000
you know, has words like the because we

32
00:01:30,040 --> 00:01:33,840
decided not to remove stop words like A,

33
00:01:32,000 --> 00:01:36,159
and the,

34
00:01:33,840 --> 00:01:38,159
and so on. So just to be clear,

35
00:01:36,159 --> 00:01:40,399
standardization

36
00:01:38,159 --> 00:01:42,519
here, standardization, while it has

37
00:01:40,400 --> 00:01:45,000
historically been all about stripping

38
00:01:42,519 --> 00:01:47,640
punctuation, lowercasing everything,

39
00:01:45,000 --> 00:01:49,239
removing stop words, and stemming,

40
00:01:47,640 --> 00:01:51,519
while that has been true historically,

41
00:01:49,239 --> 00:01:54,560
if you look at modern practice, people

42
00:01:51,519 --> 00:01:57,039
essentially strip punctuation maybe, and

43
00:01:54,560 --> 00:01:58,439
then lowercase and and they often don't

44
00:01:57,040 --> 00:02:00,120
even bother to do stemming and things

45
00:01:58,439 --> 00:02:01,120
like that, or to remove stop words.

46
00:02:00,120 --> 00:02:03,800
Okay?

47
00:02:01,120 --> 00:02:05,840
And that's why in Keras, the default

48
00:02:03,799 --> 00:02:08,800
standardization is only lowercasing and

49
00:02:05,840 --> 00:02:08,800
punctuation stripping.

50
00:02:09,479 --> 00:02:12,840
This detail may actually be handy for

51
00:02:11,319 --> 00:02:14,280
homework two, perhaps. That's why I'm

52
00:02:12,840 --> 00:02:17,080
pointing it out.

53
00:02:14,280 --> 00:02:18,840
Okay. So that's what we have. And so for

54
00:02:17,080 --> 00:02:20,719
each word that's coming in, we have a

55
00:02:18,840 --> 00:02:22,520
one-hot vector.

56
00:02:20,719 --> 00:02:25,800
Right? But the one-hot vector is just

57
00:02:22,520 --> 00:02:27,520
like on to the vocabulary. And then, you

58
00:02:25,800 --> 00:02:29,520
know, and we can either

59
00:02:27,520 --> 00:02:32,719
quote unquote add them up and get a

60
00:02:29,520 --> 00:02:34,719
multi-hot encoding, or

61
00:02:32,719 --> 00:02:36,560
sorry, get a count encoding, or we can

62
00:02:34,719 --> 00:02:38,280
just do or, right? Look for just any

63
00:02:36,560 --> 00:02:39,280
ones in a column and get multi-hot

64
00:02:38,280 --> 00:02:42,199
encoding.

65
00:02:39,280 --> 00:02:44,439
So that's what we saw last class. But

66
00:02:42,199 --> 00:02:47,399
this scheme, while it's quite effective

67
00:02:44,439 --> 00:02:49,079
for simple kind of problems,

68
00:02:47,400 --> 00:02:50,920
is it has some very serious

69
00:02:49,080 --> 00:02:52,760
shortcomings. And so we will sort of

70
00:02:50,919 --> 00:02:54,280
delve into those shortcomings, and then

71
00:02:52,759 --> 00:02:57,679
sort of step back and say, all right, is

72
00:02:54,280 --> 00:02:57,680
there a solution to fix these things?

73
00:02:58,319 --> 00:03:01,599
Problem with one-hot vectors.

74
00:03:00,199 --> 00:03:04,319
There are lots of problems. Any

75
00:03:01,599 --> 00:03:04,319
volunteers?

76
00:03:07,879 --> 00:03:12,439
Similar words are understood

77
00:03:09,919 --> 00:03:12,439
differently.

78
00:03:21,919 --> 00:03:26,319
Absolutely. So that what he's pointing

79
00:03:24,439 --> 00:03:28,120
out is that if you have two words which

80
00:03:26,319 --> 00:03:29,439
are synonyms, let's say, great and

81
00:03:28,120 --> 00:03:31,840
awesome,

82
00:03:29,439 --> 00:03:33,759
hope that the way we represent them

83
00:03:31,840 --> 00:03:35,120
using these vectors would have some

84
00:03:33,759 --> 00:03:37,439
connection to what the words actually

85
00:03:35,120 --> 00:03:38,800
mean. In particular, we would hope that

86
00:03:37,439 --> 00:03:40,919
if they mean similar things, that they

87
00:03:38,800 --> 00:03:41,920
are sort of close by. If they mean very

88
00:03:40,919 --> 00:03:43,599
different things, we would hope that

89
00:03:41,919 --> 00:03:44,839
they are very far away. Right? Things

90
00:03:43,599 --> 00:03:46,280
like that. Sort of common sensical

91
00:03:44,840 --> 00:03:49,000
expectations of what you want the

92
00:03:46,280 --> 00:03:50,599
vectors to have. So it clearly it won't

93
00:03:49,000 --> 00:03:53,039
have that, and we'll look into it in a

94
00:03:50,599 --> 00:03:54,759
detail in a bit. But before we do that,

95
00:03:53,039 --> 00:03:56,199
there is also a computational issue,

96
00:03:54,759 --> 00:03:59,120
which we covered last class, which is

97
00:03:56,199 --> 00:04:01,560
that if the vocabulary is really long,

98
00:03:59,120 --> 00:04:03,360
then each token, each word that's coming

99
00:04:01,560 --> 00:04:04,400
in here, will have a one-hot vector

100
00:04:03,360 --> 00:04:06,640
that's as long as the size of

101
00:04:04,400 --> 00:04:08,360
vocabulary. Right? If you have 500,000

102
00:04:06,639 --> 00:04:09,759
words in your vocabulary, every little

103
00:04:08,360 --> 00:04:12,200
word that comes in has a vector which is

104
00:04:09,759 --> 00:04:15,840
500,000 long. Which feels like a gross

105
00:04:12,199 --> 00:04:15,839
sort of waste of it stuff.

106
00:04:16,798 --> 00:04:20,120
Now you can mitigate it somewhat by

107
00:04:18,199 --> 00:04:21,680
choosing only the most frequent words,

108
00:04:20,120 --> 00:04:23,360
but it does increase the number of ways

109
00:04:21,680 --> 00:04:25,280
the model has to learn, and increase the

110
00:04:23,360 --> 00:04:26,800
need for compute and data, and so on and

111
00:04:25,279 --> 00:04:27,759
so forth. Okay?

112
00:04:26,800 --> 00:04:28,800
Now

113
00:04:27,759 --> 00:04:31,159
let's say that we have created a

114
00:04:28,800 --> 00:04:32,520
vocabulary from a training corpus. Okay?

115
00:04:31,160 --> 00:04:34,400
We have a bunch of

116
00:04:32,519 --> 00:04:36,439
strings, text that's coming in. We have

117
00:04:34,399 --> 00:04:37,639
done it We have done the ST the

118
00:04:36,439 --> 00:04:39,439
standardization and organization. We

119
00:04:37,639 --> 00:04:41,399
have created a vocabulary from it. And

120
00:04:39,439 --> 00:04:42,319
let's say we get the words movie and

121
00:04:41,399 --> 00:04:44,679
film.

122
00:04:42,319 --> 00:04:47,199
So the question is, and and I always

123
00:04:44,680 --> 00:04:48,639
observation gets to this immediately, if

124
00:04:47,199 --> 00:04:50,039
you look at the words movie and film,

125
00:04:48,639 --> 00:04:52,759
are these two vectors close to each

126
00:04:50,040 --> 00:04:56,800
other or not? Okay? So if you have two

127
00:04:52,759 --> 00:04:56,800
vectors, how would we measure closeness?

128
00:04:56,879 --> 00:05:00,759
What's the simplest way to think about

129
00:04:58,240 --> 00:05:00,759
closeness?

130
00:05:02,519 --> 00:05:06,680
It's not a trick question.

131
00:05:05,199 --> 00:05:08,680
Distance. Yeah, exactly. So if they are

132
00:05:06,680 --> 00:05:10,480
really close distance-wise, we would

133
00:05:08,680 --> 00:05:13,480
hope, right? The words similar words

134
00:05:10,480 --> 00:05:16,280
should do should should be close by. So

135
00:05:13,480 --> 00:05:19,759
here, if you let's just imagine that the

136
00:05:16,279 --> 00:05:19,759
vector for movie,

137
00:05:20,000 --> 00:05:22,199
let's say your vocabulary is, I don't

138
00:05:21,480 --> 00:05:24,600
know,

139
00:05:22,199 --> 00:05:24,599
um

140
00:05:25,079 --> 00:05:30,359
100,000 long.

141
00:05:27,879 --> 00:05:33,000
So your vector is 100,000 long,

142
00:05:30,360 --> 00:05:35,520
and the word for movie

143
00:05:33,000 --> 00:05:39,399
is the position, so this this has a one,

144
00:05:35,519 --> 00:05:39,399
everything else is zero. Right?

145
00:05:42,519 --> 00:05:47,279
Sorry, this is the vector for film, and

146
00:05:44,199 --> 00:05:51,039
maybe this is the position for film.

147
00:05:47,279 --> 00:05:53,439
So that has a one, everything else here

148
00:05:51,040 --> 00:05:55,640
zero. Okay? What's the distance between

149
00:05:53,439 --> 00:05:58,000
these two vectors?

150
00:05:55,639 --> 00:06:00,360
You just use the Euclidean distance. So

151
00:05:58,000 --> 00:06:01,759
the Euclidean distance, you will recall,

152
00:06:00,360 --> 00:06:02,639
you literally just take the difference

153
00:06:01,759 --> 00:06:04,079
of

154
00:06:02,639 --> 00:06:06,039
these values,

155
00:06:04,079 --> 00:06:07,159
square them, add them up, take square

156
00:06:06,040 --> 00:06:09,360
root.

157
00:06:07,160 --> 00:06:12,080
So which means that all the zeros will

158
00:06:09,360 --> 00:06:14,000
obviously give you zero. This one is

159
00:06:12,079 --> 00:06:15,399
going to give you a one.

160
00:06:14,000 --> 00:06:18,079
This comparison is going to give you

161
00:06:15,399 --> 00:06:20,079
another one. 1 + 1 = 2. Root 2. That's

162
00:06:18,079 --> 00:06:21,039
the answer.

163
00:06:20,079 --> 00:06:23,759
So the distance between these two

164
00:06:21,040 --> 00:06:23,760
vectors is root 2.

165
00:06:25,120 --> 00:06:30,639
Now,

166
00:06:27,240 --> 00:06:32,639
so the distance between them is root 2.

167
00:06:30,639 --> 00:06:34,959
What about the one-hot encoded vectors

168
00:06:32,639 --> 00:06:36,800
for good and bad? Clearly good and bad

169
00:06:34,959 --> 00:06:37,799
mean opposite things.

170
00:06:36,800 --> 00:06:40,960
What is the distance between the good

171
00:06:37,800 --> 00:06:40,960
and bad 01 vectors?

172
00:06:42,600 --> 00:06:45,320
Still root 2.

173
00:06:45,360 --> 00:06:49,920
Because the zeros don't mean anything,

174
00:06:47,800 --> 00:06:51,040
the ones are not in the same place.

175
00:06:49,920 --> 00:06:52,640
So when you subtract the one and the

176
00:06:51,040 --> 00:06:54,879
zero, you'll get ones and ones, add them

177
00:06:52,639 --> 00:06:56,560
up, two, root 2.

178
00:06:54,879 --> 00:06:57,800
In fact, you take any two words in your

179
00:06:56,560 --> 00:06:59,720
vocabulary, what's the distance between

180
00:06:57,800 --> 00:07:01,560
the two one-hot vectors for those words?

181
00:06:59,720 --> 00:07:03,960
It's root 2.

182
00:07:01,560 --> 00:07:06,399
So if any two words have the same

183
00:07:03,959 --> 00:07:08,759
distance, does this even have a notion

184
00:07:06,399 --> 00:07:10,879
of distance?

185
00:07:08,759 --> 00:07:12,639
It doesn't.

186
00:07:10,879 --> 00:07:13,959
There's no notion of distance from

187
00:07:12,639 --> 00:07:15,959
one-hot vectors.

188
00:07:13,959 --> 00:07:17,839
It has no connection to the actual

189
00:07:15,959 --> 00:07:21,199
meanings of these words.

190
00:07:17,839 --> 00:07:22,319
It's just a way of representing them.

191
00:07:21,199 --> 00:07:24,039
Okay?

192
00:07:22,319 --> 00:07:26,240
So that is the big problem with one-hot

193
00:07:24,040 --> 00:07:27,400
vectors.

194
00:07:26,240 --> 00:07:28,639
So

195
00:07:27,399 --> 00:07:29,519
the distance between them is the same

196
00:07:28,639 --> 00:07:30,439
regardless of the words. It's got

197
00:07:29,519 --> 00:07:32,079
nothing to do with the meaning of the

198
00:07:30,439 --> 00:07:33,920
words.

199
00:07:32,079 --> 00:07:35,879
And this is a huge problem, which we'll

200
00:07:33,920 --> 00:07:37,759
have to solve.

201
00:07:35,879 --> 00:07:39,439
So to summarize where we are, if the

202
00:07:37,759 --> 00:07:40,519
vocabulary is very long, each token will

203
00:07:39,439 --> 00:07:42,279
have a one-hot vector that's long as

204
00:07:40,519 --> 00:07:44,599
vocabulary. That's that's sort of a

205
00:07:42,279 --> 00:07:46,759
computational and sort of training

206
00:07:44,600 --> 00:07:48,080
problem. And then this is a deeper

207
00:07:46,759 --> 00:07:49,039
problem, where there's no connection

208
00:07:48,079 --> 00:07:51,240
between the meaning of a word and its

209
00:07:49,040 --> 00:07:55,080
vector.

210
00:07:51,240 --> 00:07:57,639
So wouldn't it be nice if

211
00:07:55,079 --> 00:07:59,879
vectors that represent synonyms,

212
00:07:57,639 --> 00:08:01,839
movie and film, apple, banana,

213
00:07:59,879 --> 00:08:03,519
hopefully they're close to each other.

214
00:08:01,839 --> 00:08:04,919
It would be nice if the vectors for

215
00:08:03,519 --> 00:08:06,919
things that mean very different things

216
00:08:04,920 --> 00:08:08,400
are far from each other.

217
00:08:06,920 --> 00:08:10,840
So let's take a look at a particular

218
00:08:08,399 --> 00:08:13,239
example. Okay? Let's assume that we have

219
00:08:10,839 --> 00:08:15,279
been magically given

220
00:08:13,240 --> 00:08:17,319
these vectors, so that they actually

221
00:08:15,279 --> 00:08:18,959
have some notion of meaning.

222
00:08:17,319 --> 00:08:21,839
And for convenience, let's say that we

223
00:08:18,959 --> 00:08:23,759
take the just the first uh

224
00:08:21,839 --> 00:08:25,239
two dimensions of these vectors, the

225
00:08:23,759 --> 00:08:28,120
first two dimensions, so that we can do

226
00:08:25,240 --> 00:08:30,040
a scatter plot on them.

227
00:08:28,120 --> 00:08:31,959
So we plot the first dimension of the of

228
00:08:30,040 --> 00:08:34,158
these vectors, the second dimension, and

229
00:08:31,959 --> 00:08:37,279
what we have in this little cartoon is

230
00:08:34,158 --> 00:08:41,559
we have plotted the the word for

231
00:08:37,279 --> 00:08:44,000
factory, uh for home, for building, and

232
00:08:41,559 --> 00:08:45,719
they all happen to be clustered here.

233
00:08:44,000 --> 00:08:48,320
Clearly this representation is capturing

234
00:08:45,720 --> 00:08:50,120
some notion of what the thing is.

235
00:08:48,320 --> 00:08:53,680
Right? Some sort of building.

236
00:08:50,120 --> 00:08:55,919
Uh and here we have, you know, bicycle,

237
00:08:53,679 --> 00:08:57,799
truck, and car. Clearly some This is

238
00:08:55,919 --> 00:09:00,000
like the automobile cluster, right?

239
00:08:57,799 --> 00:09:02,079
Transportation cluster. And here we have

240
00:09:00,000 --> 00:09:04,240
like a fruit cluster, and here we have

241
00:09:02,080 --> 00:09:05,879
some, you know, sports balls cluster.

242
00:09:04,240 --> 00:09:07,560
Okay?

243
00:09:05,879 --> 00:09:10,439
We Because it's a cartoon, things are

244
00:09:07,559 --> 00:09:12,319
all nice and cleanly separated. Okay? So

245
00:09:10,440 --> 00:09:14,360
now if you take the word apple, where do

246
00:09:12,320 --> 00:09:19,000
you think it's going to go?

247
00:09:14,360 --> 00:09:20,840
It's going to go in into A, C, D, or B?

248
00:09:19,000 --> 00:09:23,519
C, right? It makes eminent sense it's

249
00:09:20,840 --> 00:09:23,519
going to go to C.

250
00:09:23,600 --> 00:09:27,920
Good. Now,

251
00:09:25,440 --> 00:09:29,960
wouldn't it be nice if

252
00:09:27,919 --> 00:09:32,839
in more generally, if the geometric

253
00:09:29,960 --> 00:09:35,120
relationship between word vectors

254
00:09:32,840 --> 00:09:37,000
represent the semantic relationship

255
00:09:35,120 --> 00:09:38,399
between the underlying objects that the

256
00:09:37,000 --> 00:09:39,080
words represent?

257
00:09:38,399 --> 00:09:41,039
Okay?

258
00:09:39,080 --> 00:09:42,800
And it's And I say relationship and not

259
00:09:41,039 --> 00:09:45,559
distance, because it's not just

260
00:09:42,799 --> 00:09:46,319
distance. It's actually more than that.

261
00:09:45,559 --> 00:09:48,239
Okay?

262
00:09:46,320 --> 00:09:49,720
So let's take another one.

263
00:09:48,240 --> 00:09:52,200
Here we have

264
00:09:49,720 --> 00:09:54,320
uh this is the the vector plotted for

265
00:09:52,200 --> 00:09:56,240
puppy and dog,

266
00:09:54,320 --> 00:09:58,040
and this is calf.

267
00:09:56,240 --> 00:09:59,639
Uh right? We have plotted the word for

268
00:09:58,039 --> 00:10:01,599
calf. And let's say that we need to

269
00:09:59,639 --> 00:10:04,879
figure out where would the embedding,

270
00:10:01,600 --> 00:10:07,920
the word vector for cow appear?

271
00:10:04,879 --> 00:10:09,639
It is the most logical. Should it be A?

272
00:10:07,919 --> 00:10:11,639
Should it be C? Should it be B? Where

273
00:10:09,639 --> 00:10:13,919
should it be?

274
00:10:11,639 --> 00:10:13,919
This is

275
00:10:14,000 --> 00:10:19,600
C? Okay, what's the logic?

276
00:10:16,320 --> 00:10:21,440
Any volunteers? Just put your hand up.

277
00:10:19,600 --> 00:10:23,480
Uh, yes.

278
00:10:21,440 --> 00:10:26,200
Uh

279
00:10:23,480 --> 00:10:27,639
A calf is a baby bull, whereas the cow

280
00:10:26,200 --> 00:10:28,720
is an adult.

281
00:10:27,639 --> 00:10:31,240
So, it should be closer to the dog,

282
00:10:28,720 --> 00:10:32,840
which is the adult version of a dog.

283
00:10:31,240 --> 00:10:34,600
Got it. So, you're basically saying go

284
00:10:32,840 --> 00:10:36,120
from the puppy version to the grown-up

285
00:10:34,600 --> 00:10:37,560
version. Right? That's sort of what

286
00:10:36,120 --> 00:10:39,560
you're getting at, right? And that's a

287
00:10:37,559 --> 00:10:40,719
totally valid way to think about it.

288
00:10:39,559 --> 00:10:42,479
But there are a couple of ways to think

289
00:10:40,720 --> 00:10:44,600
about this, which is this is one of the

290
00:10:42,480 --> 00:10:45,800
those two ways. So, what you can do is

291
00:10:44,600 --> 00:10:46,920
you can actually look at it and say,

292
00:10:45,799 --> 00:10:48,719
well,

293
00:10:46,919 --> 00:10:50,759
Okay, if this is big bringing you, you

294
00:10:48,720 --> 00:10:52,920
know, bad memories of GMAT and GRE and

295
00:10:50,759 --> 00:10:55,080
stuff like that, I apologize.

296
00:10:52,919 --> 00:10:57,120
But

297
00:10:55,080 --> 00:10:59,720
So, a puppy is to a dog like a calf is

298
00:10:57,120 --> 00:11:01,159
to a cow, right? Which means that that's

299
00:10:59,720 --> 00:11:02,720
exactly what Jay is pointing out. You

300
00:11:01,159 --> 00:11:04,480
can go from like the baby version to the

301
00:11:02,720 --> 00:11:08,720
full-grown version if you go in the

302
00:11:04,480 --> 00:11:10,200
horizontal direction. Okay? But maybe if

303
00:11:08,720 --> 00:11:13,000
you go in the vertical direction, you're

304
00:11:10,200 --> 00:11:15,759
essentially going up and down the young

305
00:11:13,000 --> 00:11:16,720
entities of animals.

306
00:11:15,759 --> 00:11:18,399
Okay?

307
00:11:16,720 --> 00:11:20,560
So, here you are growing with, you know,

308
00:11:18,399 --> 00:11:22,000
you're still across the same dimension

309
00:11:20,559 --> 00:11:24,039
of animals. You're just going from, you

310
00:11:22,000 --> 00:11:25,639
know, the the same age level, right?

311
00:11:24,039 --> 00:11:27,039
That is the band here.

312
00:11:25,639 --> 00:11:28,639
So, this is the grown-up version of a

313
00:11:27,039 --> 00:11:30,000
whole bunch of animals, the puppy

314
00:11:28,639 --> 00:11:31,600
version of a whole bunch of animals. So,

315
00:11:30,000 --> 00:11:34,720
the vertical dimension measures some

316
00:11:31,600 --> 00:11:36,839
sort of variation across animal species

317
00:11:34,720 --> 00:11:37,920
of the same roughly sort of maturity

318
00:11:36,839 --> 00:11:41,000
stage.

319
00:11:37,919 --> 00:11:43,399
Okay? So, these directions also matter.

320
00:11:41,000 --> 00:11:45,279
It's not just the distance.

321
00:11:43,399 --> 00:11:47,319
Okay. That's what I mean when I say

322
00:11:45,279 --> 00:11:48,759
semantic relationship and geometric

323
00:11:47,320 --> 00:11:51,200
relationship.

324
00:11:48,759 --> 00:11:53,159
Relationship is distance and direction,

325
00:11:51,200 --> 00:11:55,120
right? Both have to be involved.

326
00:11:53,159 --> 00:11:57,879
So, so

327
00:11:55,120 --> 00:12:00,759
Uh, now word embeddings, as we will dis-

328
00:11:57,879 --> 00:12:03,399
learn soon, are word vectors designed to

329
00:12:00,759 --> 00:12:04,720
achieve exactly these requirements.

330
00:12:03,399 --> 00:12:06,000
Okay? They will achieve these

331
00:12:04,720 --> 00:12:07,800
requirements.

332
00:12:06,000 --> 00:12:11,440
Uh, and they will fix both these

333
00:12:07,799 --> 00:12:11,439
problems very elegantly.

334
00:12:11,720 --> 00:12:14,399
Okay?

335
00:12:13,159 --> 00:12:15,279
So, let's say that we have word

336
00:12:14,399 --> 00:12:17,639
embeddings that achieve both these

337
00:12:15,279 --> 00:12:19,720
problems. Are we basically done?

338
00:12:17,639 --> 00:12:22,399
Can we declare victory?

339
00:12:19,720 --> 00:12:24,639
Or are there any- is there anything that

340
00:12:22,399 --> 00:12:28,240
even words which actually capture the

341
00:12:24,639 --> 00:12:28,240
meaning of the underlying thing

342
00:12:28,279 --> 00:12:31,519
don't fully address? Is there any

343
00:12:30,159 --> 00:12:33,199
remaining problem we have to worry

344
00:12:31,519 --> 00:12:36,720
about? Yes?

345
00:12:33,200 --> 00:12:39,520
Context. Context? Yes.

346
00:12:36,720 --> 00:12:42,240
Context, right? What about The fact is a

347
00:12:39,519 --> 00:12:44,679
word's meaning Sure, every word has a

348
00:12:42,240 --> 00:12:46,440
meaning, but we know that some words

349
00:12:44,679 --> 00:12:49,399
have multiple meanings.

350
00:12:46,440 --> 00:12:51,320
And that meaning is really sort of

351
00:12:49,399 --> 00:12:52,799
inferencable, or you can make sense of

352
00:12:51,320 --> 00:12:55,680
it only if you know the surrounding

353
00:12:52,799 --> 00:12:59,039
context, right? If I give you if if you

354
00:12:55,679 --> 00:13:00,239
see the word bank, b a n k, bank,

355
00:12:59,039 --> 00:13:02,120
sure, it could be a financial

356
00:13:00,240 --> 00:13:04,839
institution. It could be the side of a

357
00:13:02,120 --> 00:13:07,200
river. It could be the act of a plane

358
00:13:04,839 --> 00:13:09,160
turning in one direction.

359
00:13:07,200 --> 00:13:11,960
It could be someone hoping for

360
00:13:09,159 --> 00:13:13,679
something, banking on something. The

361
00:13:11,960 --> 00:13:16,360
list of possible meanings of the word

362
00:13:13,679 --> 00:13:18,359
bank is basically enormous.

363
00:13:16,360 --> 00:13:19,800
And you cannot figure out what it means

364
00:13:18,360 --> 00:13:22,120
unless you know what else is going on

365
00:13:19,799 --> 00:13:24,559
around that word. So, context is super

366
00:13:22,120 --> 00:13:26,159
super important. And these embeddings,

367
00:13:24,559 --> 00:13:28,199
word embeddings, just tell you what the

368
00:13:26,159 --> 00:13:29,838
meaning of the word is. And basically

369
00:13:28,200 --> 00:13:31,400
what's going to happen when you have a

370
00:13:29,839 --> 00:13:33,400
word which could mean many different

371
00:13:31,399 --> 00:13:36,319
things, it's going to give you some

372
00:13:33,399 --> 00:13:37,759
average version of that meaning.

373
00:13:36,320 --> 00:13:39,680
And that average version is not going to

374
00:13:37,759 --> 00:13:40,838
be very good.

375
00:13:39,679 --> 00:13:41,759
Now, there are some words which only

376
00:13:40,839 --> 00:13:42,800
mean one thing, and you'll be okay

377
00:13:41,759 --> 00:13:44,759
there.

378
00:13:42,799 --> 00:13:47,319
But for the rest of it, right? It's

379
00:13:44,759 --> 00:13:47,319
going to be tough.

380
00:13:47,480 --> 00:13:52,879
So, what we need is some way

381
00:13:53,360 --> 00:13:56,680
We need to find a way to make word

382
00:13:54,480 --> 00:13:58,200
embeddings contextual.

383
00:13:56,679 --> 00:14:00,199
Meaning we need to somehow consider the

384
00:13:58,200 --> 00:14:02,879
other words in the sentence.

385
00:14:00,200 --> 00:14:05,040
Okay? So, if we can do that, then we

386
00:14:02,879 --> 00:14:08,279
will be in great shape.

387
00:14:05,039 --> 00:14:11,039
Solve all sorts of NLP problems.

388
00:14:08,279 --> 00:14:13,639
Now, as it turns out, contextual word

389
00:14:11,039 --> 00:14:15,279
embeddings, or word vectors, or word

390
00:14:13,639 --> 00:14:16,838
embeddings that achieve both these

391
00:14:15,279 --> 00:14:19,399
requirements.

392
00:14:16,839 --> 00:14:21,440
They capture the semantic geometric

393
00:14:19,399 --> 00:14:22,838
relationship thing I talked about, and

394
00:14:21,440 --> 00:14:23,880
they are contextual.

395
00:14:22,839 --> 00:14:27,079
Okay?

396
00:14:23,879 --> 00:14:29,078
They're really fantastic. Uh, and the

397
00:14:27,078 --> 00:14:32,838
key to calculating contextual word

398
00:14:29,078 --> 00:14:32,838
embeddings is the transformer.

399
00:14:33,200 --> 00:14:37,959
That is why transformers are justifiably

400
00:14:35,519 --> 00:14:37,958
famous.

401
00:14:39,320 --> 00:14:42,680
So, what's sort of the the lay of the

402
00:14:40,440 --> 00:14:44,520
land here? So, today we are going to

403
00:14:42,679 --> 00:14:46,879
look at how to calculate

404
00:14:44,519 --> 00:14:48,159
stand-alone or uncontextual word

405
00:14:46,879 --> 00:14:50,600
embeddings.

406
00:14:48,159 --> 00:14:52,319
And then starting Monday, we will take

407
00:14:50,600 --> 00:14:53,759
these, you know, un- stand-alone

408
00:14:52,320 --> 00:14:56,079
embeddings and make them contextual

409
00:14:53,759 --> 00:14:57,159
using transformers. Okay? That is the

410
00:14:56,078 --> 00:14:58,679
plan.

411
00:14:57,159 --> 00:15:00,719
Any questions so far?

412
00:14:58,679 --> 00:15:02,519
So, now let's think about how we can

413
00:15:00,720 --> 00:15:05,800
learn these stand-alone embeddings from

414
00:15:02,519 --> 00:15:07,559
data, right? Now, the naive way to think

415
00:15:05,799 --> 00:15:08,879
about it would be, hey, let's Why don't

416
00:15:07,559 --> 00:15:11,719
we manually collect a whole bunch of

417
00:15:08,879 --> 00:15:13,679
synonyms, antonyms, related words, etc.,

418
00:15:11,720 --> 00:15:15,440
and try to assign embedding vectors to

419
00:15:13,679 --> 00:15:18,399
them that satisfy

420
00:15:15,440 --> 00:15:19,880
our requirements. Okay? Now, as you can

421
00:15:18,399 --> 00:15:21,639
imagine, this is going to be a long,

422
00:15:19,879 --> 00:15:22,759
painful, and never quite complete

423
00:15:21,639 --> 00:15:23,720
exercise.

424
00:15:22,759 --> 00:15:24,480
Okay?

425
00:15:23,720 --> 00:15:26,600
Uh,

426
00:15:24,480 --> 00:15:29,200
so and uh you mean and given that we are

427
00:15:26,600 --> 00:15:30,759
machine learning people,

428
00:15:29,200 --> 00:15:32,240
the question is, can we do in a better

429
00:15:30,759 --> 00:15:34,039
way? Can we just learn it from the data

430
00:15:32,240 --> 00:15:36,600
without doing any of this manual stuff?

431
00:15:34,039 --> 00:15:39,639
Okay? And

432
00:15:36,600 --> 00:15:42,320
the key insight that makes it all happen

433
00:15:39,639 --> 00:15:44,360
is this humble-looking line on the

434
00:15:42,320 --> 00:15:45,839
screen by John Firth, who was a

435
00:15:44,360 --> 00:15:47,720
linguist.

436
00:15:45,839 --> 00:15:49,600
You shall know a word

437
00:15:47,720 --> 00:15:52,879
by the company it keeps. I wish I could

438
00:15:49,600 --> 00:15:52,879
deliver this in a British accent.

439
00:15:53,078 --> 00:15:57,879
Know a word by the company it keeps.

440
00:15:55,120 --> 00:15:59,919
Okay? It's a very profound statement.

441
00:15:57,879 --> 00:16:02,399
Okay? And here is the sort of the key

442
00:15:59,919 --> 00:16:03,958
intuition behind this.

443
00:16:02,399 --> 00:16:05,480
It says,

444
00:16:03,958 --> 00:16:08,559
let's say that you have a sentence like

445
00:16:05,480 --> 00:16:09,560
the acting in the dash was superb.

446
00:16:08,559 --> 00:16:11,199
Okay?

447
00:16:09,559 --> 00:16:14,519
What are some words that you folks think

448
00:16:11,200 --> 00:16:14,520
are likely to appear in the sentence?

449
00:16:15,039 --> 00:16:19,519
Shout it out. Play. Play.

450
00:16:18,120 --> 00:16:20,560
Movie.

451
00:16:19,519 --> 00:16:24,159
Show.

452
00:16:20,559 --> 00:16:25,239
Musical. Right? Those are all some great

453
00:16:24,159 --> 00:16:26,838
candidates, right? The acting in the

454
00:16:25,240 --> 00:16:28,799
movie, the film, musical, and so on and

455
00:16:26,839 --> 00:16:29,800
so forth. Okay? Now, let's say that I

456
00:16:28,799 --> 00:16:31,679
ask you, what are some words that are

457
00:16:29,799 --> 00:16:32,958
unlikely to appear in the sentence? And

458
00:16:31,679 --> 00:16:35,879
I think we could all be here for like

459
00:16:32,958 --> 00:16:38,519
days, you know, listing them out. Uh, I

460
00:16:35,879 --> 00:16:39,919
just listed these out. Um, I love the

461
00:16:38,519 --> 00:16:41,759
word tensor, so I have to find a way to

462
00:16:39,919 --> 00:16:43,078
use it somewhere.

463
00:16:41,759 --> 00:16:45,200
So, all right. So, the acting in the

464
00:16:43,078 --> 00:16:48,239
banana was superb. Clearly nonsensical,

465
00:16:45,200 --> 00:16:51,200
right? So, what this actually What What

466
00:16:48,240 --> 00:16:53,879
we are seeing here is that if certain

467
00:16:51,200 --> 00:16:55,360
words are sort of interchangeable in a

468
00:16:53,879 --> 00:16:57,000
sentence,

469
00:16:55,360 --> 00:16:59,959
meaning you you change them, they still

470
00:16:57,000 --> 00:17:02,240
the sentence still makes sense, right?

471
00:16:59,958 --> 00:17:04,240
If they appear in the same context very

472
00:17:02,240 --> 00:17:07,559
often, i.e., if they're interchangeable,

473
00:17:04,240 --> 00:17:07,559
they are probably related.

474
00:17:07,799 --> 00:17:10,559
Sort of like we don't even have to know

475
00:17:09,119 --> 00:17:12,599
what the word is.

476
00:17:10,559 --> 00:17:14,240
All we have to know is that this word

477
00:17:12,599 --> 00:17:15,519
and this word, you can drop them into a

478
00:17:14,240 --> 00:17:17,240
particular sentence, you can fill in the

479
00:17:15,519 --> 00:17:18,799
blank of that sentence with that word,

480
00:17:17,240 --> 00:17:20,120
and it actually makes sense, then we're

481
00:17:18,799 --> 00:17:21,399
like, oh, wow, okay, these words are

482
00:17:20,119 --> 00:17:23,359
related then.

483
00:17:21,400 --> 00:17:25,519
Right? You're sort of inferring their

484
00:17:23,359 --> 00:17:29,559
relatedness not by looking at them

485
00:17:25,519 --> 00:17:29,559
directly, but by seeing where they live.

486
00:17:30,000 --> 00:17:36,119
Right? It's a very very clever idea. And

487
00:17:32,319 --> 00:17:37,480
it'll slowly sink into you. Okay? Um, so

488
00:17:36,119 --> 00:17:39,079
that's the first observation. If they

489
00:17:37,480 --> 00:17:41,160
appear in the same context very often,

490
00:17:39,079 --> 00:17:44,240
they are likely to be related.

491
00:17:41,160 --> 00:17:47,440
More generally, related words appear in

492
00:17:44,240 --> 00:17:47,440
related contexts.

493
00:17:47,880 --> 00:17:52,480
So, all we have to do

494
00:17:49,559 --> 00:17:54,240
is to figure out a way to calculate

495
00:17:52,480 --> 00:17:57,039
context.

496
00:17:54,240 --> 00:17:58,599
And then use that to understand, you

497
00:17:57,039 --> 00:18:00,519
know, what the words are that happen to

498
00:17:58,599 --> 00:18:02,119
be living in this context.

499
00:18:00,519 --> 00:18:03,639
And there are some beautiful ways to do

500
00:18:02,119 --> 00:18:05,239
these things, and we'll you and we'll

501
00:18:03,640 --> 00:18:06,120
really dive deep into one such way to do

502
00:18:05,240 --> 00:18:08,759
it.

503
00:18:06,119 --> 00:18:10,639
So, so the So, what we're going to do in

504
00:18:08,759 --> 00:18:11,759
this approach

505
00:18:10,640 --> 00:18:12,920
is that

506
00:18:11,759 --> 00:18:14,879
since

507
00:18:12,920 --> 00:18:16,880
words that appear in

508
00:18:14,880 --> 00:18:18,200
related contexts mean related same

509
00:18:16,880 --> 00:18:19,200
similar things,

510
00:18:18,200 --> 00:18:21,480
first of all, you have to define what do

511
00:18:19,200 --> 00:18:22,440
you mean by context?

512
00:18:21,480 --> 00:18:23,360
And there are many ways to define

513
00:18:22,440 --> 00:18:24,759
context. We're going to go with a very

514
00:18:23,359 --> 00:18:26,959
simple explanation, simple definition,

515
00:18:24,759 --> 00:18:29,079
which is that if words happen to appear

516
00:18:26,960 --> 00:18:31,159
in the same sentence a lot,

517
00:18:29,079 --> 00:18:32,480
then we think that, okay,

518
00:18:31,159 --> 00:18:34,440
they are in the same context. So,

519
00:18:32,480 --> 00:18:35,120
context here means sentence.

520
00:18:34,440 --> 00:18:38,200
Okay?

521
00:18:35,119 --> 00:18:40,399
So, what we can do is we can actually

522
00:18:38,200 --> 00:18:41,919
take a whole bunch of text, maybe all of

523
00:18:40,400 --> 00:18:43,519
Wikipedia,

524
00:18:41,919 --> 00:18:46,040
and then break it up into sentences.

525
00:18:43,519 --> 00:18:47,279
We'll have billions of sentences, right?

526
00:18:46,039 --> 00:18:48,879
And then for all these billion

527
00:18:47,279 --> 00:18:51,639
sentences, we can literally go and count

528
00:18:48,880 --> 00:18:52,880
for every pair of words, how many times

529
00:18:51,640 --> 00:18:55,280
are both these words showing up in the

530
00:18:52,880 --> 00:18:57,880
same sentence?

531
00:18:55,279 --> 00:18:59,359
Okay? And we call this co-occurrence,

532
00:18:57,880 --> 00:19:00,640
right? The words are co-occurring in the

533
00:18:59,359 --> 00:19:02,000
sentence.

534
00:19:00,640 --> 00:19:02,880
And it doesn't have to be next to each

535
00:19:02,000 --> 00:19:04,759
other,

536
00:19:02,880 --> 00:19:07,280
right? We know that in complicated

537
00:19:04,759 --> 00:19:09,079
words, a word at the very end of the

538
00:19:07,279 --> 00:19:10,799
sentence could actually alter the mean-

539
00:19:09,079 --> 00:19:11,759
could be its meaning could be altered by

540
00:19:10,799 --> 00:19:12,678
a word that happened in the very

541
00:19:11,759 --> 00:19:14,240
beginning of the sentence, and it could

542
00:19:12,679 --> 00:19:16,240
be a really long sentence.

543
00:19:14,240 --> 00:19:18,079
So, we take the whole sentence and say,

544
00:19:16,240 --> 00:19:19,599
are are two words co-occurring in the

545
00:19:18,079 --> 00:19:20,720
sentence, yes or no? And we just count

546
00:19:19,599 --> 00:19:23,799
them up.

547
00:19:20,720 --> 00:19:23,799
And when we do that,

548
00:19:24,119 --> 00:19:27,678
right? When we do that, we will get

549
00:19:26,279 --> 00:19:29,519
something like this.

550
00:19:27,679 --> 00:19:30,880
So, I'm just

551
00:19:29,519 --> 00:19:32,359
This just captures what I've been

552
00:19:30,880 --> 00:19:34,280
talking about. Identify all the words

553
00:19:32,359 --> 00:19:35,799
that occur, let's say, in Wikipedia. And

554
00:19:34,279 --> 00:19:37,039
then for every sentence, you look at

555
00:19:35,799 --> 00:19:38,759
every word pair and count the number of

556
00:19:37,039 --> 00:19:41,480
times they appear in the same sentence

557
00:19:38,759 --> 00:19:43,839
across all those sentences. Okay?

558
00:19:41,480 --> 00:19:46,440
This is a word-word co-occurrence

559
00:19:43,839 --> 00:19:47,519
matrix. So, for example,

560
00:19:46,440 --> 00:19:48,679
let's assume that you took all of

561
00:19:47,519 --> 00:19:49,918
Wikipedia, looked at all the words,

562
00:19:48,679 --> 00:19:51,960
distinct words, and you found there are

563
00:19:49,919 --> 00:19:54,360
500,000 words.

564
00:19:51,960 --> 00:19:56,880
Okay? So, there are 500,000 words

565
00:19:54,359 --> 00:20:00,240
here in the columns

566
00:19:56,880 --> 00:20:02,640
500,000 words on the rows.

567
00:20:00,240 --> 00:20:05,599
The columns and rows. And then you go

568
00:20:02,640 --> 00:20:08,000
and each cell of this table is basically

569
00:20:05,599 --> 00:20:10,519
has a number that you calculate which is

570
00:20:08,000 --> 00:20:12,039
the number of times the word in the row

571
00:20:10,519 --> 00:20:14,319
and the word in the column happen to

572
00:20:12,039 --> 00:20:15,680
show up in the same sentence. That's it.

573
00:20:14,319 --> 00:20:18,119
So, for instance

574
00:20:15,680 --> 00:20:20,360
if you look at deep and learning, right?

575
00:20:18,119 --> 00:20:22,519
The word deep and the word learning

576
00:20:20,359 --> 00:20:24,719
maybe that

577
00:20:22,519 --> 00:20:28,319
the those two words occurred in the same

578
00:20:24,720 --> 00:20:31,400
sentence maybe 3,025 times.

579
00:20:28,319 --> 00:20:35,200
3,025 sentences across all of Wikipedia.

580
00:20:31,400 --> 00:20:35,200
You put 3,025 right in that cell.

581
00:20:35,240 --> 00:20:37,680
Okay?

582
00:20:36,000 --> 00:20:38,880
Many words are unlikely to appear in the

583
00:20:37,680 --> 00:20:40,360
same sentence.

584
00:20:38,880 --> 00:20:42,720
So, much of this matrix is going to be

585
00:20:40,359 --> 00:20:42,719
zero.

586
00:20:44,319 --> 00:20:47,119
But, we

587
00:20:45,359 --> 00:20:49,639
fundamentally form this co-occurrence

588
00:20:47,119 --> 00:20:49,639
matrix.

589
00:20:49,960 --> 00:20:55,640
This matrix essentially embodies all the

590
00:20:54,119 --> 00:20:58,359
context information that we can work

591
00:20:55,640 --> 00:20:59,840
with in a very compact, beautiful you

592
00:20:58,359 --> 00:21:02,240
know, sort of

593
00:20:59,839 --> 00:21:02,240
elegant

594
00:21:03,279 --> 00:21:06,039
And using this, we're going to try to

595
00:21:04,640 --> 00:21:07,400
figure out

596
00:21:06,039 --> 00:21:08,440
what the word embeddings actually are

597
00:21:07,400 --> 00:21:09,519
going to be.

598
00:21:08,440 --> 00:21:11,720
Okay?

599
00:21:09,519 --> 00:21:13,480
And so

600
00:21:11,720 --> 00:21:15,440
So, by the way, the approach I'm

601
00:21:13,480 --> 00:21:19,240
describing here to calculate standalone

602
00:21:15,440 --> 00:21:19,240
embeddings is called Glove.

603
00:21:20,200 --> 00:21:24,799
Uh it's called Glove and

604
00:21:23,039 --> 00:21:27,519
standalone embeddings first sort of came

605
00:21:24,799 --> 00:21:29,720
onto the NLP deep learning scene. Uh

606
00:21:27,519 --> 00:21:32,519
there were two sort of ways of doing it.

607
00:21:29,720 --> 00:21:34,400
One was called word to vec, word to vec.

608
00:21:32,519 --> 00:21:35,879
Uh the other one is Glove.

609
00:21:34,400 --> 00:21:36,960
And they're both comparable, right? They

610
00:21:35,880 --> 00:21:38,520
use slightly different mechanisms of

611
00:21:36,960 --> 00:21:40,559
doing this.

612
00:21:38,519 --> 00:21:42,279
We went with word for for this lecture

613
00:21:40,559 --> 00:21:44,359
because I think it's actually a little

614
00:21:42,279 --> 00:21:45,759
easier to understand and equally

615
00:21:44,359 --> 00:21:47,199
effective.

616
00:21:45,759 --> 00:21:49,480
Okay?

617
00:21:47,200 --> 00:21:50,880
So, this is what we have. And so, what

618
00:21:49,480 --> 00:21:52,880
we want to do is

619
00:21:50,880 --> 00:21:54,120
we want to learn these embedding vectors

620
00:21:52,880 --> 00:21:56,200
that can be used to essentially

621
00:21:54,119 --> 00:21:59,319
approximate this matrix.

622
00:21:56,200 --> 00:22:01,720
Right? If you can find vectors that can

623
00:21:59,319 --> 00:22:03,279
actually approximate this matrix, then

624
00:22:01,720 --> 00:22:04,519
hopefully those vectors do in fact

625
00:22:03,279 --> 00:22:06,519
capture some notion of what the words

626
00:22:04,519 --> 00:22:07,440
actually mean. Okay? So, let me put it

627
00:22:06,519 --> 00:22:10,119
differently.

628
00:22:07,440 --> 00:22:12,759
You come to me with this matrix. Okay?

629
00:22:10,119 --> 00:22:14,359
And you say uh okay, Rama, do you have

630
00:22:12,759 --> 00:22:15,679
embeddings for me?

631
00:22:14,359 --> 00:22:17,319
And I'm like, yeah, I reach into my bag

632
00:22:15,679 --> 00:22:19,160
and I'm like, okay, every one of those

633
00:22:17,319 --> 00:22:20,119
500,000 words, I have an embedding.

634
00:22:19,160 --> 00:22:21,440
Right?

635
00:22:20,119 --> 00:22:23,039
Let's ignore for a moment how I actually

636
00:22:21,440 --> 00:22:24,000
calculated embeddings. I have the

637
00:22:23,039 --> 00:22:25,839
embeddings.

638
00:22:24,000 --> 00:22:28,400
How will you know if my embeddings are

639
00:22:25,839 --> 00:22:28,399
any good?

640
00:22:28,720 --> 00:22:31,559
How will you know?

641
00:22:30,279 --> 00:22:34,440
How can you actually assess if those

642
00:22:31,559 --> 00:22:34,440
embeddings are any good?

643
00:22:34,559 --> 00:22:37,440
Well, you can certainly say, okay, give

644
00:22:35,799 --> 00:22:39,240
me the embeddings for movie and film and

645
00:22:37,440 --> 00:22:40,440
you can see if they're really close by.

646
00:22:39,240 --> 00:22:42,160
If you can look at the you look at the

647
00:22:40,440 --> 00:22:43,920
embedding for movie and tensor and

648
00:22:42,160 --> 00:22:46,600
hopefully they're far away.

649
00:22:43,920 --> 00:22:47,360
But, you'll never get done.

650
00:22:46,599 --> 00:22:49,199
Right?

651
00:22:47,359 --> 00:22:51,159
How can you systematically evaluate

652
00:22:49,200 --> 00:22:53,720
this?

653
00:22:51,160 --> 00:22:55,840
Well, what if you could actually what

654
00:22:53,720 --> 00:22:57,400
what if I come to you and say, not only

655
00:22:55,839 --> 00:22:59,079
am I going to give you an embedding,

656
00:22:57,400 --> 00:23:00,480
here is a procedure

657
00:22:59,079 --> 00:23:02,279
which you can use with these embeddings

658
00:23:00,480 --> 00:23:04,400
to validate how good they are and here

659
00:23:02,279 --> 00:23:07,160
is the procedure. What you can do is you

660
00:23:04,400 --> 00:23:09,960
can use the embedding to recreate the

661
00:23:07,160 --> 00:23:11,600
co-occurrence matrix.

662
00:23:09,960 --> 00:23:14,400
And if the recreated co-occurrence

663
00:23:11,599 --> 00:23:15,319
matrix actually matches the real matrix

664
00:23:14,400 --> 00:23:17,519
well, these embeddings probably are

665
00:23:15,319 --> 00:23:18,559
pretty good.

666
00:23:17,519 --> 00:23:20,079
Remember, the whole point of the

667
00:23:18,559 --> 00:23:21,720
co-occurrence is to handle this context

668
00:23:20,079 --> 00:23:23,960
information. So, if my embeddings can

669
00:23:21,720 --> 00:23:25,640
actually recreate them, reconstruct them

670
00:23:23,960 --> 00:23:27,400
pretty close, right? It'll never be

671
00:23:25,640 --> 00:23:28,200
perfect. But, it comes pretty close,

672
00:23:27,400 --> 00:23:29,759
then we're like, wow, okay, these

673
00:23:28,200 --> 00:23:31,400
embeddings do mean something.

674
00:23:29,759 --> 00:23:33,839
So, if it turns out for instance that

675
00:23:31,400 --> 00:23:36,600
the matrix has, you know, 3,000 possible

676
00:23:33,839 --> 00:23:40,159
va- value of 3,000 for deep and learning

677
00:23:36,599 --> 00:23:40,959
and values of uh

678
00:23:40,160 --> 00:23:43,519
say

679
00:23:40,960 --> 00:23:45,200
50 for extreme learning

680
00:23:43,519 --> 00:23:48,480
and our embedding comes in and says

681
00:23:45,200 --> 00:23:49,360
3,002 for the first one and 48 for the

682
00:23:48,480 --> 00:23:51,440
second one, we'll be like we'll be

683
00:23:49,359 --> 00:23:53,279
pretty impressed.

684
00:23:51,440 --> 00:23:54,320
Whoa, it didn't need to be that close.

685
00:23:53,279 --> 00:23:55,480
Unless it was actually capturing

686
00:23:54,319 --> 00:23:57,519
something.

687
00:23:55,480 --> 00:23:59,000
Okay? So, that's what we're going to do.

688
00:23:57,519 --> 00:24:00,240
And so, we're going to take this logic

689
00:23:59,000 --> 00:24:03,200
of saying

690
00:24:00,240 --> 00:24:05,960
find embeddings that can approximate the

691
00:24:03,200 --> 00:24:07,880
what we actually see in Wikipedia.

692
00:24:05,960 --> 00:24:09,240
Right? And we're going to use that idea

693
00:24:07,880 --> 00:24:10,440
to actually build the model and learn

694
00:24:09,240 --> 00:24:12,559
the

695
00:24:10,440 --> 00:24:14,759
using nothing more than basically linear

696
00:24:12,559 --> 00:24:14,759
regression.

697
00:24:16,480 --> 00:24:18,839
And here you are thinking that linear

698
00:24:17,759 --> 00:24:22,160
regression is useless now that you've

699
00:24:18,839 --> 00:24:22,159
graduated machine learning, right?

700
00:24:22,319 --> 00:24:24,759
So

701
00:24:23,240 --> 00:24:26,599
So, we can think of the embedding

702
00:24:24,759 --> 00:24:28,879
vectors that we want to figure out as

703
00:24:26,599 --> 00:24:31,319
just the weights in a model.

704
00:24:28,880 --> 00:24:33,120
In a linear regression.

705
00:24:31,319 --> 00:24:35,200
We can think of the co-occurrence matrix

706
00:24:33,119 --> 00:24:37,759
as just the data we're going to use in

707
00:24:35,200 --> 00:24:39,799
this model to estimate these weights.

708
00:24:37,759 --> 00:24:42,200
And the model we're going to use

709
00:24:39,799 --> 00:24:43,799
is something like this.

710
00:24:42,200 --> 00:24:45,080
So, first I have to inflict some

711
00:24:43,799 --> 00:24:46,559
notation on you.

712
00:24:45,079 --> 00:24:50,000
We would denote the co-occurrence matrix

713
00:24:46,559 --> 00:24:51,759
of say words I and J as Xij.

714
00:24:50,000 --> 00:24:53,079
Xij is just data.

715
00:24:51,759 --> 00:24:55,079
It's just data. Okay? It's not a

716
00:24:53,079 --> 00:24:55,639
variable, it's data.

717
00:24:55,079 --> 00:24:57,399
Uh

718
00:24:55,640 --> 00:24:59,160
and then we will denote an embedding

719
00:24:57,400 --> 00:25:01,080
vector for each word. Remember, we need

720
00:24:59,160 --> 00:25:03,840
to have a vector for each word. So, we

721
00:25:01,079 --> 00:25:06,199
call it Wi, right? Wi is the embedding

722
00:25:03,839 --> 00:25:09,119
vector for each word.

723
00:25:06,200 --> 00:25:10,559
And we will also assume that

724
00:25:09,119 --> 00:25:11,639
some words are just inherently very

725
00:25:10,559 --> 00:25:13,440
popular. They're going to show up all

726
00:25:11,640 --> 00:25:15,920
the time like the word the.

727
00:25:13,440 --> 00:25:18,320
Okay? So, we'll assume that every word

728
00:25:15,920 --> 00:25:20,160
has some natural frequency of occurring

729
00:25:18,319 --> 00:25:22,919
like movie versus flick.

730
00:25:20,160 --> 00:25:24,480
The versus tensor. So, we want the

731
00:25:22,920 --> 00:25:27,279
vectors to capture the co-occurrence

732
00:25:24,480 --> 00:25:28,880
patterns independent of how naturally

733
00:25:27,279 --> 00:25:29,639
frequent the words are.

734
00:25:28,880 --> 00:25:30,920
Okay?

735
00:25:29,640 --> 00:25:33,600
And so, to capture this natural

736
00:25:30,920 --> 00:25:34,600
frequency, we will assign a bias or Bi

737
00:25:33,599 --> 00:25:36,359
to each word that we're going to

738
00:25:34,599 --> 00:25:39,319
calculate. And all this will become

739
00:25:36,359 --> 00:25:41,000
clear in just a moment. Okay? So

740
00:25:39,319 --> 00:25:42,480
with this setup, basically what we're

741
00:25:41,000 --> 00:25:44,679
saying is something very simple. We're

742
00:25:42,480 --> 00:25:45,960
saying, look, this co-occurrence matrix

743
00:25:44,679 --> 00:25:48,000
that we have

744
00:25:45,960 --> 00:25:51,240
that we're able to compute, it came

745
00:25:48,000 --> 00:25:53,400
about because in in truth, in reality,

746
00:25:51,240 --> 00:25:55,559
in nature, there are these embedding

747
00:25:53,400 --> 00:25:58,120
vectors for every word.

748
00:25:55,559 --> 00:26:00,240
There are these biases Bi for every word

749
00:25:58,119 --> 00:26:03,000
and every co-occurrence number that you

750
00:26:00,240 --> 00:26:05,079
see just came about because, you know,

751
00:26:03,000 --> 00:26:07,839
under the hood, mother nature grabbed

752
00:26:05,079 --> 00:26:09,720
the bias number for the word I, the bias

753
00:26:07,839 --> 00:26:11,639
number for the word J took the two

754
00:26:09,720 --> 00:26:13,799
embedding vectors, which only mother

755
00:26:11,640 --> 00:26:15,200
nature knows at this point did the dot

756
00:26:13,799 --> 00:26:16,919
product of them, add them, and that's

757
00:26:15,200 --> 00:26:19,080
how we get this number.

758
00:26:16,920 --> 00:26:21,560
So, it basically says the number you see

759
00:26:19,079 --> 00:26:23,039
is the sum of the inherent popularity of

760
00:26:21,559 --> 00:26:25,159
the first word plus the inherent

761
00:26:23,039 --> 00:26:26,799
popularity of the second word plus the

762
00:26:25,160 --> 00:26:29,000
way in which these two words connect to

763
00:26:26,799 --> 00:26:29,960
each other.

764
00:26:29,000 --> 00:26:30,839
That's it.

765
00:26:29,960 --> 00:26:32,440
And

766
00:26:30,839 --> 00:26:33,599
you will agree with me

767
00:26:32,440 --> 00:26:34,799
that literally can't get simpler than

768
00:26:33,599 --> 00:26:36,759
this.

769
00:26:34,799 --> 00:26:38,200
If I tell you, hey, here are two things.

770
00:26:36,759 --> 00:26:39,799
I want you to tell me how connected they

771
00:26:38,200 --> 00:26:42,360
are, you'll be like, well, let's take

772
00:26:39,799 --> 00:26:44,200
the first one, figure out how inherently

773
00:26:42,359 --> 00:26:45,039
popular it is, inherent popularity, and

774
00:26:44,200 --> 00:26:46,319
then of course you got to worry about

775
00:26:45,039 --> 00:26:47,678
the connection. So, we do a dot dot

776
00:26:46,319 --> 00:26:49,720
product.

777
00:26:47,679 --> 00:26:50,440
That's it. Those three things.

778
00:26:49,720 --> 00:26:52,360
Right?

779
00:26:50,440 --> 00:26:53,840
So, this is what we have. Now, you may

780
00:26:52,359 --> 00:26:54,599
have seen

781
00:26:53,839 --> 00:26:56,839
uh

782
00:26:54,599 --> 00:27:00,079
from your, you know, good old linear

783
00:26:56,839 --> 00:27:02,039
regression that whenever uh your

784
00:27:00,079 --> 00:27:05,119
dependent variable happens to be

785
00:27:02,039 --> 00:27:08,279
positive, guaranteed to be positive

786
00:27:05,119 --> 00:27:10,519
and it ends up having a big range

787
00:27:08,279 --> 00:27:12,599
we always advise you folks

788
00:27:10,519 --> 00:27:14,839
to take the logarithmic transformation

789
00:27:12,599 --> 00:27:16,480
to squash it into a narrow range because

790
00:27:14,839 --> 00:27:18,319
that will make these models much more

791
00:27:16,480 --> 00:27:20,319
well-behaved.

792
00:27:18,319 --> 00:27:22,240
Regression if the Y value is like a huge

793
00:27:20,319 --> 00:27:23,159
range. Like the canonical example is

794
00:27:22,240 --> 00:27:24,960
that, you know, if you are trying to

795
00:27:23,160 --> 00:27:27,560
model, you know, the net worth of

796
00:27:24,960 --> 00:27:29,120
people, right? It's going to have a long

797
00:27:27,559 --> 00:27:30,879
right tail with people like Elon and

798
00:27:29,119 --> 00:27:33,279
Jeff and so on on the right side, right?

799
00:27:30,880 --> 00:27:34,880
And the rest of us on the left. So and

800
00:27:33,279 --> 00:27:35,920
so, to model this big long tail

801
00:27:34,880 --> 00:27:37,360
distribution, you just take the

802
00:27:35,920 --> 00:27:39,120
logarithm, just squash everything to a

803
00:27:37,359 --> 00:27:41,479
very narrow range. And that will make

804
00:27:39,119 --> 00:27:42,559
regression much better behaved. Okay?

805
00:27:41,480 --> 00:27:45,400
Here

806
00:27:42,559 --> 00:27:47,000
most of the counts are going to be zero.

807
00:27:45,400 --> 00:27:48,440
But, some of the counts could be very

808
00:27:47,000 --> 00:27:49,160
high.

809
00:27:48,440 --> 00:27:51,000
Right?

810
00:27:49,160 --> 00:27:52,960
And therefore we wanted to If you take

811
00:27:51,000 --> 00:27:54,839
the logarithm, it makes it much better

812
00:27:52,960 --> 00:27:56,440
behaved, so we take the logarithm here.

813
00:27:54,839 --> 00:27:57,439
So, this is actually our model. That's

814
00:27:56,440 --> 00:27:58,720
it.

815
00:27:57,440 --> 00:28:00,759
And I know that many of the numbers are

816
00:27:58,720 --> 00:28:02,600
zero and log of zero is not defined. So,

817
00:28:00,759 --> 00:28:03,960
we can just add the one a number one to

818
00:28:02,599 --> 00:28:06,240
all the numbers

819
00:28:03,960 --> 00:28:08,360
to avoid that kind of, you know,

820
00:28:06,240 --> 00:28:09,559
technical arithmetic problems.

821
00:28:08,359 --> 00:28:10,319
But, this conceptually is what's going

822
00:28:09,559 --> 00:28:11,519
on. This is the model we want to

823
00:28:10,319 --> 00:28:14,079
calculate.

824
00:28:11,519 --> 00:28:16,759
So, given that we have essentially

825
00:28:14,079 --> 00:28:17,839
postulated this model

826
00:28:16,759 --> 00:28:19,519
and we have this data, this

827
00:28:17,839 --> 00:28:21,240
co-occurrence matrix, how can we

828
00:28:19,519 --> 00:28:24,279
actually find the weights? How can we

829
00:28:21,240 --> 00:28:25,679
actually find the Bs and the Ws? What

830
00:28:24,279 --> 00:28:26,960
would we What should we do?

831
00:28:25,679 --> 00:28:29,320
Go back to the fundamentals of

832
00:28:26,960 --> 00:28:30,519
regression. Think about it conceptually.

833
00:28:29,319 --> 00:28:31,879
You have some model which has some

834
00:28:30,519 --> 00:28:33,519
weights.

835
00:28:31,880 --> 00:28:35,320
There's some data you can use to train

836
00:28:33,519 --> 00:28:36,960
the model.

837
00:28:35,319 --> 00:28:38,240
Right? And you need to find the best set

838
00:28:36,960 --> 00:28:40,079
of weights. What does the best mean

839
00:28:38,240 --> 00:28:42,279
here?

840
00:28:40,079 --> 00:28:43,879
The lowest

841
00:28:42,279 --> 00:28:46,119
The lowest error. Exactly. There are

842
00:28:43,880 --> 00:28:47,280
many ways to measure error, right? What

843
00:28:46,119 --> 00:28:48,759
would be What is the simplest thing we

844
00:28:47,279 --> 00:28:50,240
could use? So, what you do is you would

845
00:28:48,759 --> 00:28:52,079
actually do mean squared error. Right?

846
00:28:50,240 --> 00:28:53,240
Which is what you're getting at.

847
00:28:52,079 --> 00:28:54,359
You could take the actual thing, you

848
00:28:53,240 --> 00:28:55,839
could take the predicted thing, take the

849
00:28:54,359 --> 00:28:57,119
difference, square it, and minimize the

850
00:28:55,839 --> 00:28:59,759
sum of it.

851
00:28:57,119 --> 00:29:00,839
Okay? If your model exactly nails every

852
00:28:59,759 --> 00:29:02,799
number in the co-occurrence matrix, the

853
00:29:00,839 --> 00:29:04,879
error is going to be zero.

854
00:29:02,799 --> 00:29:07,759
Okay? So

855
00:29:04,880 --> 00:29:09,240
what we do is we literally just do that.

856
00:29:07,759 --> 00:29:11,200
This is the data.

857
00:29:09,240 --> 00:29:13,319
This is the actual predicted value.

858
00:29:11,200 --> 00:29:14,880
Predicted value, actual value,

859
00:29:13,319 --> 00:29:17,439
difference squared, add them all up,

860
00:29:14,880 --> 00:29:17,440
minimize.

861
00:29:17,839 --> 00:29:21,039
Okay?

862
00:29:19,200 --> 00:29:23,200
Uh yes.

863
00:29:21,039 --> 00:29:25,720
And in the loss function, how is this

864
00:29:23,200 --> 00:29:28,679
capturing the context? Because unless my

865
00:29:25,720 --> 00:29:31,120
input data is having that context

866
00:29:28,679 --> 00:29:33,120
how will this actually differentiate

867
00:29:31,119 --> 00:29:34,239
based on where the particular word is

868
00:29:33,119 --> 00:29:36,359
used?

869
00:29:34,240 --> 00:29:37,079
The word The way the word is

870
00:29:36,359 --> 00:29:38,559
the

871
00:29:37,079 --> 00:29:41,559
So, let's take two words like deep and

872
00:29:38,559 --> 00:29:42,918
learning. Now, let's take this word and

873
00:29:41,559 --> 00:29:44,839
change it according to the context.

874
00:29:42,919 --> 00:29:46,280
Okay.

875
00:29:44,839 --> 00:29:47,359
Sorry, go ahead. Yeah, so basically,

876
00:29:46,279 --> 00:29:49,759
let's say I'm talking about the word

877
00:29:47,359 --> 00:29:50,919
banana. So it's a fruit in some context

878
00:29:49,759 --> 00:29:53,119
and I could be saying he's going

879
00:29:50,920 --> 00:29:55,240
bananas. That's a

880
00:29:53,119 --> 00:29:57,039
whatever, right? So now these are two

881
00:29:55,240 --> 00:29:59,079
different contexts in my understanding

882
00:29:57,039 --> 00:30:01,000
and my same model needs to be able to

883
00:29:59,079 --> 00:30:02,720
tell me that banana is the right word in

884
00:30:01,000 --> 00:30:04,400
this context but wrong word in this

885
00:30:02,720 --> 00:30:06,600
context or

886
00:30:04,400 --> 00:30:08,440
correct in both contexts. Yeah, very

887
00:30:06,599 --> 00:30:10,359
good question. So let's actually spend a

888
00:30:08,440 --> 00:30:13,360
minute on that. Good question. I'm going

889
00:30:10,359 --> 00:30:15,439
to swap to my iPad.

890
00:30:13,359 --> 00:30:18,000
So let's let's assume that this is our

891
00:30:15,440 --> 00:30:20,160
co-occurrence matrix.

892
00:30:18,000 --> 00:30:23,160
Right? And then we have words going from

893
00:30:20,160 --> 00:30:24,600
A all the way to let's say zebra, right?

894
00:30:23,160 --> 00:30:25,800
This is the all the words in our

895
00:30:24,599 --> 00:30:29,439
vocabulary

896
00:30:25,799 --> 00:30:32,680
and we have A through zebra here.

897
00:30:29,440 --> 00:30:34,480
And now what we have is

898
00:30:32,680 --> 00:30:36,519
we have uh

899
00:30:34,480 --> 00:30:39,079
apple

900
00:30:36,519 --> 00:30:39,079
and banana.

901
00:30:39,559 --> 00:30:42,279
Right?

902
00:30:40,279 --> 00:30:44,079
So basically what's going on at this

903
00:30:42,279 --> 00:30:48,240
point is that

904
00:30:44,079 --> 00:30:50,559
here every number here measures

905
00:30:48,240 --> 00:30:51,960
for every word here, how many times that

906
00:30:50,559 --> 00:30:53,559
word and apple show up in the same

907
00:30:51,960 --> 00:30:56,400
sentence, okay?

908
00:30:53,559 --> 00:30:57,960
It is not measuring, to your point,

909
00:30:56,400 --> 00:30:59,880
how many times apple and banana are

910
00:30:57,960 --> 00:31:01,240
showing up. It's measuring how much how

911
00:30:59,880 --> 00:31:03,680
many times apple is showing up in each

912
00:31:01,240 --> 00:31:06,480
sentence, right? Now, if apple and

913
00:31:03,680 --> 00:31:09,799
banana are sort of interchangeable,

914
00:31:06,480 --> 00:31:11,880
what do we expect these numbers these

915
00:31:09,799 --> 00:31:13,319
two rows of numbers to look like? Let's

916
00:31:11,880 --> 00:31:14,560
assume that apple and banana are perfect

917
00:31:13,319 --> 00:31:15,799
synonyms.

918
00:31:14,559 --> 00:31:17,240
Just for argument, okay? Let's say it's

919
00:31:15,799 --> 00:31:19,839
a perfect synonyms.

920
00:31:17,240 --> 00:31:21,359
What do we expect these two

921
00:31:19,839 --> 00:31:23,839
numbers

922
00:31:21,359 --> 00:31:25,599
to look like?

923
00:31:23,839 --> 00:31:27,720
Very similar.

924
00:31:25,599 --> 00:31:30,240
So if two words are related, their

925
00:31:27,720 --> 00:31:31,120
entries their entry row vectors in the

926
00:31:30,240 --> 00:31:32,599
co-occurrence matrix are going to be

927
00:31:31,119 --> 00:31:34,479
very very similar.

928
00:31:32,599 --> 00:31:36,079
So that is how the context comes into

929
00:31:34,480 --> 00:31:37,960
the co-occurrence matrix.

930
00:31:36,079 --> 00:31:40,559
So what we want to do is we want to find

931
00:31:37,960 --> 00:31:42,840
if if embeddings can recreate the same

932
00:31:40,559 --> 00:31:45,000
pattern of numbers in these two

933
00:31:42,839 --> 00:31:47,919
uh in these two rows, it's actually

934
00:31:45,000 --> 00:31:49,880
capturing the underlying context.

935
00:31:47,920 --> 00:31:51,560
So words which are similar will sort of

936
00:31:49,880 --> 00:31:53,280
zig and zag together the same way

937
00:31:51,559 --> 00:31:56,039
through the co-occurrence matrix.

938
00:31:53,279 --> 00:31:56,039
And that's where it comes in.

939
00:31:57,440 --> 00:32:00,440
Yeah.

940
00:31:58,440 --> 00:32:01,960
What's up with the diagonal of the

941
00:32:00,440 --> 00:32:05,240
co-occurrence matrix where you have

942
00:32:01,960 --> 00:32:07,200
apple showing up twice? Oh oh, I see. So

943
00:32:05,240 --> 00:32:08,799
yeah, here the you can just ignore the

944
00:32:07,200 --> 00:32:10,480
diagonal typically

945
00:32:08,799 --> 00:32:13,519
uh because all the action is off the the

946
00:32:10,480 --> 00:32:13,519
off-diagonal entries.

947
00:32:15,319 --> 00:32:20,319
So so that's basically the idea and uh

948
00:32:18,720 --> 00:32:22,519
if words which are very similar will

949
00:32:20,319 --> 00:32:24,039
have a very similar pattern of numbers

950
00:32:22,519 --> 00:32:25,720
and then any

951
00:32:24,039 --> 00:32:27,759
embeddings that can actually recreate

952
00:32:25,720 --> 00:32:28,920
the same pattern of numbers is capturing

953
00:32:27,759 --> 00:32:29,720
the underlying reality of what's going

954
00:32:28,920 --> 00:32:32,240
on.

955
00:32:29,720 --> 00:32:34,799
If words are kind of unrelated, those

956
00:32:32,240 --> 00:32:38,000
two those two vectors, let's say that

957
00:32:34,799 --> 00:32:38,000
the word you have is uh

958
00:32:40,400 --> 00:32:45,640
Let's assume the word is uh of course

959
00:32:42,880 --> 00:32:48,080
you know what I'm going to say, tensor.

960
00:32:45,640 --> 00:32:49,440
Right? These two vectors

961
00:32:48,079 --> 00:32:50,799
will sort of won't have any connection

962
00:32:49,440 --> 00:32:51,920
to each other.

963
00:32:50,799 --> 00:32:53,119
Which means if you look at something

964
00:32:51,920 --> 00:32:54,679
like the correlation of those two

965
00:32:53,119 --> 00:32:55,919
vectors, it's it's going to be around

966
00:32:54,679 --> 00:32:56,600
zero.

967
00:32:55,920 --> 00:32:57,960
Right?

968
00:32:56,599 --> 00:32:59,719
Words which are

969
00:32:57,960 --> 00:33:01,559
you know, interchangeable will have a

970
00:32:59,720 --> 00:33:03,720
very high correlation.

971
00:33:01,559 --> 00:33:05,519
Words which are antonyms and never show

972
00:33:03,720 --> 00:33:07,240
up in the same place together may have a

973
00:33:05,519 --> 00:33:09,079
highly negative correlation, close to

974
00:33:07,240 --> 00:33:10,640
minus one for instance. So that's sort

975
00:33:09,079 --> 00:33:11,919
of the intuition behind what's going on

976
00:33:10,640 --> 00:33:12,920
in these two row vectors on these row

977
00:33:11,920 --> 00:33:14,560
vectors.

978
00:33:12,920 --> 00:33:16,120
And so the point is given this

979
00:33:14,559 --> 00:33:19,879
co-occurrence matrix is capturing all

980
00:33:16,119 --> 00:33:22,039
these word word correlational structure,

981
00:33:19,880 --> 00:33:25,200
any embedding that can recreate it must

982
00:33:22,039 --> 00:33:26,879
have captured the structure as well.

983
00:33:25,200 --> 00:33:28,759
Because you can't recreate something

984
00:33:26,880 --> 00:33:30,080
like this with great fidelity unless you

985
00:33:28,759 --> 00:33:31,799
have some notion of what's going on

986
00:33:30,079 --> 00:33:33,599
under the hood.

987
00:33:31,799 --> 00:33:34,519
That's the basic idea.

988
00:33:33,599 --> 00:33:36,599
Yeah.

989
00:33:34,519 --> 00:33:39,160
So just connecting to Sophie's question.

990
00:33:36,599 --> 00:33:40,879
So in that example then

991
00:33:39,160 --> 00:33:42,800
banana is a fruit and apple is a fruit

992
00:33:40,880 --> 00:33:44,160
as well. Banana and apple are synonyms

993
00:33:42,799 --> 00:33:47,039
and you're going mad, you're going

994
00:33:44,160 --> 00:33:48,040
bananas. How that comes together is that

995
00:33:47,039 --> 00:33:50,399
Oh, I see. You're going mad, you're

996
00:33:48,039 --> 00:33:52,319
going bananas, yeah. So uh so those will

997
00:33:50,400 --> 00:33:53,720
also have some correlational structure

998
00:33:52,319 --> 00:33:57,000
to it which the embeddings will

999
00:33:53,720 --> 00:33:59,440
hopefully catch, but words like banana

1000
00:33:57,000 --> 00:34:01,160
which are very they they

1001
00:33:59,440 --> 00:34:03,400
the thing is it's called polysemy where

1002
00:34:01,160 --> 00:34:04,880
the word looks one way, it looks the

1003
00:34:03,400 --> 00:34:06,080
same way. It's like the word bank,

1004
00:34:04,880 --> 00:34:07,520
right? It can mean very different things

1005
00:34:06,079 --> 00:34:09,319
in very different context. So the

1006
00:34:07,519 --> 00:34:11,800
embedding is going to be some average

1007
00:34:09,320 --> 00:34:13,280
representation of it, right? But we are

1008
00:34:11,800 --> 00:34:15,000
not happy with that average and we'll

1009
00:34:13,280 --> 00:34:18,280
get around that average

1010
00:34:15,000 --> 00:34:19,159
next week when we do contextual stuff.

1011
00:34:18,280 --> 00:34:20,320
All right.

1012
00:34:19,159 --> 00:34:22,280
Um

1013
00:34:20,320 --> 00:34:25,519
So that's what we have here. So to go

1014
00:34:22,280 --> 00:34:25,519
back to this thing,

1015
00:34:26,719 --> 00:34:31,839
so what we can do is yeah.

1016
00:34:29,000 --> 00:34:34,398
I didn't understand how do we get the

1017
00:34:31,840 --> 00:34:35,120
mean squared error in this because we

1018
00:34:34,398 --> 00:34:37,319
didn't

1019
00:34:35,119 --> 00:34:39,480
do any reading from the data set we got.

1020
00:34:37,320 --> 00:34:41,200
We haven't calculated the embeddings.

1021
00:34:39,480 --> 00:34:42,559
We are trying to calculate them. Those

1022
00:34:41,199 --> 00:34:45,079
are just it's sort of like, you know, in

1023
00:34:42,559 --> 00:34:47,199
regression you have, you know, beta beta

1024
00:34:45,079 --> 00:34:49,398
one times X1 plus beta two times X2 kind

1025
00:34:47,199 --> 00:34:51,199
of thing. The betas are what the

1026
00:34:49,398 --> 00:34:52,759
regression produces for us, right? The

1027
00:34:51,199 --> 00:34:53,918
the embeddings are exactly that. They're

1028
00:34:52,760 --> 00:34:55,240
just coefficients that we're trying to

1029
00:34:53,918 --> 00:34:59,400
figure out.

1030
00:34:55,239 --> 00:34:59,399
The data is only the X's, the Xij.

1031
00:34:59,519 --> 00:35:01,920
And so this is what we're trying to

1032
00:35:00,760 --> 00:35:03,960
calculate,

1033
00:35:01,920 --> 00:35:06,200
right? And so what you can do is you can

1034
00:35:03,960 --> 00:35:08,320
actually start with some random values

1035
00:35:06,199 --> 00:35:09,839
for these things

1036
00:35:08,320 --> 00:35:11,920
and then

1037
00:35:09,840 --> 00:35:13,240
keep on trying to improve to minimize

1038
00:35:11,920 --> 00:35:15,639
the error

1039
00:35:13,239 --> 00:35:17,319
starting from these random values.

1040
00:35:15,639 --> 00:35:19,119
Do you folks are you aware of any

1041
00:35:17,320 --> 00:35:20,559
algorithm that which allows us to take

1042
00:35:19,119 --> 00:35:23,839
random value starting point and then

1043
00:35:20,559 --> 00:35:23,840
minimize some notion of error?

1044
00:35:32,760 --> 00:35:35,600
Well, how do you know it's actually

1045
00:35:33,679 --> 00:35:37,879
random? Oh.

1046
00:35:35,599 --> 00:35:39,000
So that's actually a very deep question.

1047
00:35:37,880 --> 00:35:39,920
Um

1048
00:35:39,000 --> 00:35:41,400
and

1049
00:35:39,920 --> 00:35:42,480
so

1050
00:35:41,400 --> 00:35:44,160
it's actually a tough question, right?

1051
00:35:42,480 --> 00:35:46,079
Because ultimately the random number is

1052
00:35:44,159 --> 00:35:47,960
coming from a computer

1053
00:35:46,079 --> 00:35:50,000
and we know how the computer runs. It's

1054
00:35:47,960 --> 00:35:51,559
deterministic at the end of the day.

1055
00:35:50,000 --> 00:35:53,280
So we actually use something called

1056
00:35:51,559 --> 00:35:54,880
pseudo random numbers,

1057
00:35:53,280 --> 00:35:56,840
right? Um and there's like a whole

1058
00:35:54,880 --> 00:35:59,358
specialized field of math

1059
00:35:56,840 --> 00:36:02,120
which essentially says, "Look, how can I

1060
00:35:59,358 --> 00:36:03,719
get random numbers that are sufficiently

1061
00:36:02,119 --> 00:36:05,358
random even though they come from a

1062
00:36:03,719 --> 00:36:07,759
non-random computer deterministic

1063
00:36:05,358 --> 00:36:08,519
process?" So we can talk offline about

1064
00:36:07,760 --> 00:36:10,480
it,

1065
00:36:08,519 --> 00:36:11,960
um but fundamentally all these systems

1066
00:36:10,480 --> 00:36:14,519
have some random number generators built

1067
00:36:11,960 --> 00:36:17,400
in. We just cross our fingers and hope

1068
00:36:14,519 --> 00:36:19,079
for the best and just use them.

1069
00:36:17,400 --> 00:36:20,639
So come back to this,

1070
00:36:19,079 --> 00:36:22,119
right? We can start with random values

1071
00:36:20,639 --> 00:36:23,559
for these weights

1072
00:36:22,119 --> 00:36:25,440
um and then we can try to minimize the

1073
00:36:23,559 --> 00:36:26,639
squared error. Are are you folks aware

1074
00:36:25,440 --> 00:36:28,358
of any algorithm that can help us do

1075
00:36:26,639 --> 00:36:30,239
that?

1076
00:36:28,358 --> 00:36:33,079
Yes.

1077
00:36:30,239 --> 00:36:35,279
Gradient descent. Yes, gradient descent.

1078
00:36:33,079 --> 00:36:36,400
Again, comes to the rescue. Uh and since

1079
00:36:35,280 --> 00:36:38,680
we are cool, we'll do stochastic

1080
00:36:36,400 --> 00:36:41,880
gradient descent.

1081
00:36:38,679 --> 00:36:42,960
Okay? So that's it. So gradient descent

1082
00:36:41,880 --> 00:36:44,240
actually doesn't care what the function

1083
00:36:42,960 --> 00:36:45,559
is as long as it you can calculate a

1084
00:36:44,239 --> 00:36:47,319
derivative from it. As long as you

1085
00:36:45,559 --> 00:36:48,719
calculate a gradient, you're good.

1086
00:36:47,320 --> 00:36:50,960
Right? So we can just run gradient

1087
00:36:48,719 --> 00:36:53,119
descent on this thing, right?

1088
00:36:50,960 --> 00:36:54,240
Uh one key point here is that gradient

1089
00:36:53,119 --> 00:36:55,960
descent, stochastic gradient descent

1090
00:36:54,239 --> 00:36:58,519
work for any

1091
00:36:55,960 --> 00:37:00,480
any models as long as you can calculate

1092
00:36:58,519 --> 00:37:03,639
good gradients from them.

1093
00:37:00,480 --> 00:37:03,639
It doesn't have to be a neural network.

1094
00:37:03,760 --> 00:37:07,400
Any mathematical function as long as

1095
00:37:05,880 --> 00:37:08,800
it's differentiable and gives you a good

1096
00:37:07,400 --> 00:37:10,440
gradient.

1097
00:37:08,800 --> 00:37:12,480
Okay? So here this is not a neural

1098
00:37:10,440 --> 00:37:14,200
network per se, but we can still use

1099
00:37:12,480 --> 00:37:17,159
gradient descent for it.

1100
00:37:14,199 --> 00:37:17,159
So we do that.

1101
00:37:17,960 --> 00:37:22,159
Um and when we are done, we would have

1102
00:37:20,039 --> 00:37:23,880
calculated some nice embeddings. We

1103
00:37:22,159 --> 00:37:25,559
would have all calculated or we can also

1104
00:37:23,880 --> 00:37:26,559
calculate all these biases, but we don't

1105
00:37:25,559 --> 00:37:28,119
need the biases anymore. We can just

1106
00:37:26,559 --> 00:37:29,519
throw out the biases because we only

1107
00:37:28,119 --> 00:37:30,920
care about the embeddings and how they

1108
00:37:29,519 --> 00:37:33,320
connect to each other.

1109
00:37:30,920 --> 00:37:34,760
Okay? Yeah.

1110
00:37:33,320 --> 00:37:36,800
So when when you're doing that

1111
00:37:34,760 --> 00:37:39,480
regression, are you predicting the

1112
00:37:36,800 --> 00:37:42,000
co-occurrence matrix? Mhm. Okay.

1113
00:37:39,480 --> 00:37:42,000
Exactly.

1114
00:37:42,320 --> 00:37:45,039
So

1115
00:37:43,358 --> 00:37:46,559
um actually let me just show a very

1116
00:37:45,039 --> 00:37:48,199
quick example

1117
00:37:46,559 --> 00:37:52,039
numerical example here.

1118
00:37:48,199 --> 00:37:52,039
So let's say for example that um

1119
00:37:53,480 --> 00:37:56,039
you know what?

1120
00:37:57,159 --> 00:38:02,000
So this is say W1 and this is W2.

1121
00:38:00,358 --> 00:38:04,639
Okay? And this is the vector and let's

1122
00:38:02,000 --> 00:38:06,400
assume for a moment that we it has two

1123
00:38:04,639 --> 00:38:07,920
dimensions, okay?

1124
00:38:06,400 --> 00:38:09,840
Two dimensions.

1125
00:38:07,920 --> 00:38:13,320
And we also need to calculate B1 and B2

1126
00:38:09,840 --> 00:38:13,320
which is just a number, okay?

1127
00:38:14,320 --> 00:38:18,359
So and let's say the number for deep

1128
00:38:16,960 --> 00:38:20,599
learning in the co-occurrence matrix it

1129
00:38:18,358 --> 00:38:21,759
happens let's say it has occurred 104

1130
00:38:20,599 --> 00:38:24,759
times.

1131
00:38:21,760 --> 00:38:27,200
So all we are doing is to say log of

1132
00:38:24,760 --> 00:38:28,720
104.

1133
00:38:27,199 --> 00:38:30,919
That is the actual value

1134
00:38:28,719 --> 00:38:33,599
minus

1135
00:38:30,920 --> 00:38:34,880
B1 which we don't know plus B2 which we

1136
00:38:33,599 --> 00:38:36,880
don't know

1137
00:38:34,880 --> 00:38:38,039
and then this thing here, let's just

1138
00:38:36,880 --> 00:38:40,160
call it,

1139
00:38:38,039 --> 00:38:42,119
you know, W11,

1140
00:38:40,159 --> 00:38:43,960
W12,

1141
00:38:42,119 --> 00:38:45,159
W21,

1142
00:38:43,960 --> 00:38:46,519
W22.

1143
00:38:45,159 --> 00:38:49,000
Okay? And then we're just doing the dot

1144
00:38:46,519 --> 00:38:51,400
product which is

1145
00:38:49,000 --> 00:38:53,719
times W12

1146
00:38:51,400 --> 00:38:55,280
plus W21

1147
00:38:53,719 --> 00:38:58,679
W22.

1148
00:38:55,280 --> 00:39:00,240
Okay? So this is our prediction.

1149
00:38:58,679 --> 00:39:03,559
Where is that cool laser pointer? Yeah.

1150
00:39:00,239 --> 00:39:05,199
So this is our prediction.

1151
00:39:03,559 --> 00:39:07,480
This is the actual.

1152
00:39:05,199 --> 00:39:09,039
So all we do is to say, "Okay,

1153
00:39:07,480 --> 00:39:11,000
this thing, the difference, we're going

1154
00:39:09,039 --> 00:39:12,358
to square it."

1155
00:39:11,000 --> 00:39:16,280
And then we're going to do the same

1156
00:39:12,358 --> 00:39:17,799
exact thing for every other word pair.

1157
00:39:16,280 --> 00:39:19,840
Okay? And when we are done with all of

1158
00:39:17,800 --> 00:39:20,840
that thing, we just take this whole

1159
00:39:19,840 --> 00:39:23,880
thing

1160
00:39:20,840 --> 00:39:26,039
and say gradient descent minimize.

1161
00:39:23,880 --> 00:39:28,200
So then it has to find the B's and the

1162
00:39:26,039 --> 00:39:29,400
W's and everything for every every pair

1163
00:39:28,199 --> 00:39:31,919
every word.

1164
00:39:29,400 --> 00:39:34,440
So that's actually what's going on.

1165
00:39:31,920 --> 00:39:34,440
Make sense?

1166
00:39:37,039 --> 00:39:43,800
All right. So by the way uh here

1167
00:39:41,559 --> 00:39:45,320
I said

1168
00:39:43,800 --> 00:39:47,160
I said, you know, let's assume that the

1169
00:39:45,320 --> 00:39:51,160
embeddings are just vectors which are

1170
00:39:47,159 --> 00:39:52,440
two dimension dimension two.

1171
00:39:51,159 --> 00:39:54,039
Well,

1172
00:39:52,440 --> 00:39:55,840
that's an arbitrary decision that I made

1173
00:39:54,039 --> 00:39:58,119
just to show you how it works because I

1174
00:39:55,840 --> 00:39:59,680
was doing it by hand. But more

1175
00:39:58,119 --> 00:40:01,159
generally, we get to choose how long

1176
00:39:59,679 --> 00:40:02,079
these vectors are.

1177
00:40:01,159 --> 00:40:04,440
Right?

1178
00:40:02,079 --> 00:40:05,920
And the longer the vector, the more

1179
00:40:04,440 --> 00:40:07,240
interesting ways it can actually

1180
00:40:05,920 --> 00:40:09,880
reproduce the co-occurrence matrix. It

1181
00:40:07,239 --> 00:40:13,319
has more flexibility. But the longer the

1182
00:40:09,880 --> 00:40:14,920
vector, what is the risk that you run?

1183
00:40:13,320 --> 00:40:16,039
Overfitting.

1184
00:40:14,920 --> 00:40:17,079
Because these are all parameters at the

1185
00:40:16,039 --> 00:40:19,360
end of the day. More parameters you

1186
00:40:17,079 --> 00:40:21,239
have, the more risk of overfitting.

1187
00:40:19,360 --> 00:40:24,320
Okay? So, you get to choose how big

1188
00:40:21,239 --> 00:40:26,799
these things can be. Uh yes.

1189
00:40:24,320 --> 00:40:29,000
Um don't you find it surprising that

1190
00:40:26,800 --> 00:40:30,680
we're able to fit the model where we

1191
00:40:29,000 --> 00:40:32,719
have a lot more parameters than we have

1192
00:40:30,679 --> 00:40:33,919
data because usually with most machine

1193
00:40:32,719 --> 00:40:35,959
learning with our experts, you would

1194
00:40:33,920 --> 00:40:37,920
like to not have a lot of parameters,

1195
00:40:35,960 --> 00:40:40,240
but here we're going to have

1196
00:40:37,920 --> 00:40:42,680
as you said, the number of dimensions

1197
00:40:40,239 --> 00:40:44,359
times more parameters than we have

1198
00:40:42,679 --> 00:40:46,839
data points. Well, here in this

1199
00:40:44,360 --> 00:40:48,120
particular case, as it turns out, um

1200
00:40:46,840 --> 00:40:49,440
let's assume that you only have 10

1201
00:40:48,119 --> 00:40:51,920
words, right?

1202
00:40:49,440 --> 00:40:53,960
And for each word, let's assume that you

1203
00:40:51,920 --> 00:40:55,280
have let's just just keep the math

1204
00:40:53,960 --> 00:40:56,320
simple. You have a two-dimensional

1205
00:40:55,280 --> 00:40:58,600
vector.

1206
00:40:56,320 --> 00:41:00,640
So, 10 words * 2, that's 20.

1207
00:40:58,599 --> 00:41:02,880
Plus you have 10 biases for the words,

1208
00:41:00,639 --> 00:41:06,000
right? So, that's another 10, that's 30.

1209
00:41:02,880 --> 00:41:08,160
But 10 * 10, the matrix has 100 entries.

1210
00:41:06,000 --> 00:41:10,360
So, because of the matrix being a order

1211
00:41:08,159 --> 00:41:13,000
n squared matrix, you'll have a lot more

1212
00:41:10,360 --> 00:41:14,640
numbers than parameters.

1213
00:41:13,000 --> 00:41:17,239
In this particular case, you have more

1214
00:41:14,639 --> 00:41:18,440
data than parameters.

1215
00:41:17,239 --> 00:41:20,039
So, that particular problem doesn't

1216
00:41:18,440 --> 00:41:22,119
apply in this case.

1217
00:41:20,039 --> 00:41:23,599
But that does show up in other cases and

1218
00:41:22,119 --> 00:41:24,799
there is some

1219
00:41:23,599 --> 00:41:26,599
very interesting research in neural

1220
00:41:24,800 --> 00:41:29,120
networks which suggests that often times

1221
00:41:26,599 --> 00:41:30,679
the traditional assumptions of data and

1222
00:41:29,119 --> 00:41:32,079
overfitting and all

1223
00:41:30,679 --> 00:41:33,879
can all be called into question under

1224
00:41:32,079 --> 00:41:35,440
some situations.

1225
00:41:33,880 --> 00:41:37,240
Um happy to tell you more offline, but

1226
00:41:35,440 --> 00:41:39,280
if you're curious, just Google something

1227
00:41:37,239 --> 00:41:41,799
called double descent.

1228
00:41:39,280 --> 00:41:41,800
You know what I mean.

1229
00:41:42,559 --> 00:41:45,840
But in this case, it's not a problem.

1230
00:41:46,320 --> 00:41:49,680
Okay.

1231
00:41:47,480 --> 00:41:51,519
So, so what that means is that we can

1232
00:41:49,679 --> 00:41:53,480
choose how big these things are. So, if

1233
00:41:51,519 --> 00:41:55,920
you look at one-hot word vector, one-hot

1234
00:41:53,480 --> 00:41:57,119
vectors, right? Where

1235
00:41:55,920 --> 00:41:58,559
there's a one and everything else is

1236
00:41:57,119 --> 00:42:00,519
zero depending on the position of the

1237
00:41:58,559 --> 00:42:03,440
word, these are long vectors as long as

1238
00:42:00,519 --> 00:42:05,480
a vocabulary, right? As we saw earlier.

1239
00:42:03,440 --> 00:42:07,400
Word embeddings on the other hand,

1240
00:42:05,480 --> 00:42:08,679
right? They can be very dense, right?

1241
00:42:07,400 --> 00:42:10,000
The numbers

1242
00:42:08,679 --> 00:42:11,000
that make up these embeddings, we're

1243
00:42:10,000 --> 00:42:13,199
actually going to figure out from the

1244
00:42:11,000 --> 00:42:15,480
data what they are. So, it can be

1245
00:42:13,199 --> 00:42:17,679
anything. It can So, the first dimension

1246
00:42:15,480 --> 00:42:19,480
may stand for some combination of, you

1247
00:42:17,679 --> 00:42:22,519
know, um

1248
00:42:19,480 --> 00:42:23,559
brightness plus speed plus animalness or

1249
00:42:22,519 --> 00:42:24,719
something. We have no idea what it

1250
00:42:23,559 --> 00:42:26,279
means.

1251
00:42:24,719 --> 00:42:27,959
All we know is that it's able to

1252
00:42:26,280 --> 00:42:29,400
reproduce the co-occurrence matrix

1253
00:42:27,960 --> 00:42:30,880
really well, so it's probably has

1254
00:42:29,400 --> 00:42:32,480
figured something out.

1255
00:42:30,880 --> 00:42:33,720
Okay? And so, we can keep it really

1256
00:42:32,480 --> 00:42:35,039
short. So, the word embeddings tend to

1257
00:42:33,719 --> 00:42:36,039
be very

1258
00:42:35,039 --> 00:42:38,079
dense,

1259
00:42:36,039 --> 00:42:39,599
meaning not zeros and ones, but some

1260
00:42:38,079 --> 00:42:40,880
arbitrary numbers. It's very lower

1261
00:42:39,599 --> 00:42:41,960
dimensional and it's of course learned

1262
00:42:40,880 --> 00:42:43,960
from data.

1263
00:42:41,960 --> 00:42:45,760
Right? So,

1264
00:42:43,960 --> 00:42:47,800
so once you do this, once you actually

1265
00:42:45,760 --> 00:42:49,920
run Glove on this data and do gradient

1266
00:42:47,800 --> 00:42:51,400
descent and so on and so forth, uh you

1267
00:42:49,920 --> 00:42:52,639
will actually come up with embeddings

1268
00:42:51,400 --> 00:42:54,360
and then you can actually plot the

1269
00:42:52,639 --> 00:42:55,719
embeddings. You can take like this they

1270
00:42:54,360 --> 00:42:58,320
say the you know, you can take these

1271
00:42:55,719 --> 00:42:59,959
embeddings and just plot them. Here um

1272
00:42:58,320 --> 00:43:01,600
they're not literally plotting the first

1273
00:42:59,960 --> 00:43:03,599
two dimensions. They're using a

1274
00:43:01,599 --> 00:43:05,480
particular technique called t-SNE, which

1275
00:43:03,599 --> 00:43:07,239
is a way to take long vectors and

1276
00:43:05,480 --> 00:43:09,119
project them to 2D space for

1277
00:43:07,239 --> 00:43:11,479
visualization purposes.

1278
00:43:09,119 --> 00:43:12,719
And you can see here

1279
00:43:11,480 --> 00:43:15,079
some very interesting things are showing

1280
00:43:12,719 --> 00:43:17,000
up. So, they basically they plotted the

1281
00:43:15,079 --> 00:43:19,599
embedding for brother,

1282
00:43:17,000 --> 00:43:20,920
nephew, uncle, sister, niece,

1283
00:43:19,599 --> 00:43:22,199
aunt, and so on and so forth. It's all

1284
00:43:20,920 --> 00:43:24,240
showing up here.

1285
00:43:22,199 --> 00:43:25,439
This the embedding for man, embedding

1286
00:43:24,239 --> 00:43:28,119
for woman,

1287
00:43:25,440 --> 00:43:29,920
sir, madam,

1288
00:43:28,119 --> 00:43:32,519
empress, heir,

1289
00:43:29,920 --> 00:43:34,599
duke, emperor, king. You get the idea.

1290
00:43:32,519 --> 00:43:35,840
Right? So, clearly there are patterns

1291
00:43:34,599 --> 00:43:37,480
here where

1292
00:43:35,840 --> 00:43:38,880
things which are sort of similar in

1293
00:43:37,480 --> 00:43:41,519
their nature are all hanging out

1294
00:43:38,880 --> 00:43:42,720
together in the same part of the space.

1295
00:43:41,519 --> 00:43:44,079
Which is comforting, which is good to

1296
00:43:42,719 --> 00:43:44,959
know.

1297
00:43:44,079 --> 00:43:46,719
Right?

1298
00:43:44,960 --> 00:43:48,400
Now, but as I mentioned earlier, it's

1299
00:43:46,719 --> 00:43:50,839
not just about the fact that similar

1300
00:43:48,400 --> 00:43:53,280
things happen to be near each other.

1301
00:43:50,840 --> 00:43:54,960
The direction also actually matters. And

1302
00:43:53,280 --> 00:43:57,640
beautiful things happen when you look at

1303
00:43:54,960 --> 00:44:00,119
directions. So, for instance,

1304
00:43:57,639 --> 00:44:01,879
you know, let's say that

1305
00:44:00,119 --> 00:44:03,159
man and you want to go from man to

1306
00:44:01,880 --> 00:44:05,440
brother.

1307
00:44:03,159 --> 00:44:07,799
Okay? So, to go from man to brother, you

1308
00:44:05,440 --> 00:44:09,920
have to start with man and then travel

1309
00:44:07,800 --> 00:44:11,440
along this arrow, right? To get to

1310
00:44:09,920 --> 00:44:14,880
brother.

1311
00:44:11,440 --> 00:44:18,519
So, this arrow has some notion of a

1312
00:44:14,880 --> 00:44:19,400
person becoming a sibling.

1313
00:44:18,519 --> 00:44:20,920
Right?

1314
00:44:19,400 --> 00:44:22,400
So, you would hope that if you take that

1315
00:44:20,920 --> 00:44:23,800
same arrow

1316
00:44:22,400 --> 00:44:26,039
and then

1317
00:44:23,800 --> 00:44:29,359
start here with that arrow, hopefully

1318
00:44:26,039 --> 00:44:32,358
the woman will become a sister.

1319
00:44:29,358 --> 00:44:32,358
Sure enough, this.

1320
00:44:32,719 --> 00:44:37,119
So, this is called word vector algebra.

1321
00:44:35,199 --> 00:44:39,039
Right? Embedding algebra. And these

1322
00:44:37,119 --> 00:44:41,119
relationships are actually showing up in

1323
00:44:39,039 --> 00:44:42,039
the data. We didn't tell it any of these

1324
00:44:41,119 --> 00:44:43,039
things.

1325
00:44:42,039 --> 00:44:44,920
We just literally gave it the

1326
00:44:43,039 --> 00:44:46,358
co-occurrence matrix

1327
00:44:44,920 --> 00:44:47,960
and said and and asked it to reproduce

1328
00:44:46,358 --> 00:44:49,759
it.

1329
00:44:47,960 --> 00:44:52,519
So, I find it pretty shocking that these

1330
00:44:49,760 --> 00:44:55,160
things are actually true.

1331
00:44:52,519 --> 00:44:57,358
And it gives us evidence and comfort

1332
00:44:55,159 --> 00:44:59,358
that whatever has been learned does have

1333
00:44:57,358 --> 00:45:01,759
some deep connection to describing the

1334
00:44:59,358 --> 00:45:03,799
underlying nature of what's going on.

1335
00:45:01,760 --> 00:45:05,520
It's not some statistically fluky

1336
00:45:03,800 --> 00:45:07,000
artifact.

1337
00:45:05,519 --> 00:45:07,679
Um yeah.

1338
00:45:07,000 --> 00:45:08,639
So,

1339
00:45:07,679 --> 00:45:11,239
I said

1340
00:45:08,639 --> 00:45:12,960
by context or by adjacency to other

1341
00:45:11,239 --> 00:45:15,000
words and not by

1342
00:45:12,960 --> 00:45:16,480
the place in the same word, right?

1343
00:45:15,000 --> 00:45:17,840
Cuz you can't click they won't appear in

1344
00:45:16,480 --> 00:45:19,079
the same sentence.

1345
00:45:17,840 --> 00:45:20,720
They have

1346
00:45:19,079 --> 00:45:22,000
keywords. Right.

1347
00:45:20,719 --> 00:45:23,799
They won't appear in the same sentence,

1348
00:45:22,000 --> 00:45:25,199
but the pattern of co-occurrence will be

1349
00:45:23,800 --> 00:45:26,240
the same for them.

1350
00:45:25,199 --> 00:45:28,159
Which is what we've been able to

1351
00:45:26,239 --> 00:45:30,879
reproduce with these embeddings. So,

1352
00:45:28,159 --> 00:45:30,879
that's the key idea.

1353
00:45:34,119 --> 00:45:37,400
Um

1354
00:45:34,800 --> 00:45:40,359
so, my question is along like how are we

1355
00:45:37,400 --> 00:45:41,480
able to capture all these directions in

1356
00:45:40,358 --> 00:45:44,119
2D

1357
00:45:41,480 --> 00:45:46,119
matrix versus a multi-dimensional matrix

1358
00:45:44,119 --> 00:45:47,920
because I feel like okay, so this

1359
00:45:46,119 --> 00:45:48,759
relationship is kind of

1360
00:45:47,920 --> 00:45:50,519
uh

1361
00:45:48,760 --> 00:45:51,880
confirmed that you're moving to

1362
00:45:50,519 --> 00:45:53,440
kind of like

1363
00:45:51,880 --> 00:45:54,800
family or like blood relationship or

1364
00:45:53,440 --> 00:45:56,599
something of the sort, but like how does

1365
00:45:54,800 --> 00:45:58,240
it not mess up the other sides of that

1366
00:45:56,599 --> 00:46:00,000
matrix? Like

1367
00:45:58,239 --> 00:46:02,199
No, this is just a visualization thing.

1368
00:46:00,000 --> 00:46:04,159
So, we're basically taking this uh you

1369
00:46:02,199 --> 00:46:06,279
know, as you will see, Glove embeddings

1370
00:46:04,159 --> 00:46:08,199
come in lots of different sizes. And

1371
00:46:06,280 --> 00:46:10,080
this I think uses the 100 dimension

1372
00:46:08,199 --> 00:46:12,358
embedding and just projects it to 2D

1373
00:46:10,079 --> 00:46:15,840
space using a particular technique and

1374
00:46:12,358 --> 00:46:15,840
then looks to see what's going on.

1375
00:46:15,880 --> 00:46:20,000
Um yeah.

1376
00:46:17,800 --> 00:46:22,519
Uh if the input data being co-occurrence

1377
00:46:20,000 --> 00:46:24,599
matrix is biased, aren't we amplifying

1378
00:46:22,519 --> 00:46:26,800
that bias? Yes, we are. Yes. No, it's a

1379
00:46:24,599 --> 00:46:28,719
great observation. Uh any sort of data

1380
00:46:26,800 --> 00:46:30,840
you scrape from the internet and use for

1381
00:46:28,719 --> 00:46:32,679
this sort of modeling exercise will be

1382
00:46:30,840 --> 00:46:34,760
subject to all the biases that produced

1383
00:46:32,679 --> 00:46:36,599
the data in the place first place. And

1384
00:46:34,760 --> 00:46:38,760
the model will faithfully learn those

1385
00:46:36,599 --> 00:46:40,358
biases. And if you're not careful, it'll

1386
00:46:38,760 --> 00:46:41,840
perpetuate them.

1387
00:46:40,358 --> 00:46:43,960
So, and that's a whole very important

1388
00:46:41,840 --> 00:46:45,600
topic that unfortunately won't cover in

1389
00:46:43,960 --> 00:46:46,760
this course because of time constraints,

1390
00:46:45,599 --> 00:46:47,920
but it's something you always have to

1391
00:46:46,760 --> 00:46:50,359
worry about when you're building these

1392
00:46:47,920 --> 00:46:50,358
models.

1393
00:46:50,519 --> 00:46:53,679
How do you think about the

1394
00:46:51,199 --> 00:46:55,799
dimensionality of the embeddings not the

1395
00:46:53,679 --> 00:46:57,279
2D representation of the actual data?

1396
00:46:55,800 --> 00:46:59,000
The one that we choose, that's that's in

1397
00:46:57,280 --> 00:47:00,519
our hands. So, you should think of them

1398
00:46:59,000 --> 00:47:03,358
as a hyperparameter.

1399
00:47:00,519 --> 00:47:05,239
So, much like the number of hidden units

1400
00:47:03,358 --> 00:47:06,920
to use in a particular hidden layer,

1401
00:47:05,239 --> 00:47:09,719
um it's a hyperparameter. Uh so, you

1402
00:47:06,920 --> 00:47:11,039
know, I would again start small and if

1403
00:47:09,719 --> 00:47:13,159
it solves the problem that you're trying

1404
00:47:11,039 --> 00:47:15,440
to solve with these embeddings, great.

1405
00:47:13,159 --> 00:47:16,960
If not, keep increasing them. And at

1406
00:47:15,440 --> 00:47:19,000
some point there might be like a a

1407
00:47:16,960 --> 00:47:20,400
flattening out and a overfitting sort of

1408
00:47:19,000 --> 00:47:22,679
dynamic and then you stop. So, just

1409
00:47:20,400 --> 00:47:24,280
think of it as a hyperparameter.

1410
00:47:22,679 --> 00:47:26,599
Yeah.

1411
00:47:24,280 --> 00:47:28,920
Do you see any benefit practicing using

1412
00:47:26,599 --> 00:47:31,239
like penalized regression to do this

1413
00:47:28,920 --> 00:47:33,200
instead of having the embeddings more

1414
00:47:31,239 --> 00:47:36,879
sparse or just like

1415
00:47:33,199 --> 00:47:39,239
lowering the magnitude of them? Yeah.

1416
00:47:36,880 --> 00:47:40,160
Yes. So, there are lots of techniques to

1417
00:47:39,239 --> 00:47:42,159
uh

1418
00:47:40,159 --> 00:47:44,679
to apply regularization in the

1419
00:47:42,159 --> 00:47:46,759
estimation itself of all these numbers.

1420
00:47:44,679 --> 00:47:47,799
Um happy to give you pointers. It's I'm

1421
00:47:46,760 --> 00:47:49,480
just going with like the simplest

1422
00:47:47,800 --> 00:47:50,800
version possible.

1423
00:47:49,480 --> 00:47:53,719
Yeah.

1424
00:47:50,800 --> 00:47:55,920
Am I understanding why overfitting is a

1425
00:47:53,719 --> 00:47:58,000
problem in this case cuz we're not doing

1426
00:47:55,920 --> 00:48:00,079
any like out of sample

1427
00:47:58,000 --> 00:48:02,039
prediction. So, like wouldn't you want

1428
00:48:00,079 --> 00:48:03,599
like the embeddings to be

1429
00:48:02,039 --> 00:48:04,519
like high dimensional so you can capture

1430
00:48:03,599 --> 00:48:06,519
like

1431
00:48:04,519 --> 00:48:08,679
your relationships? Uh interesting

1432
00:48:06,519 --> 00:48:11,119
question. So, the question is given that

1433
00:48:08,679 --> 00:48:12,879
there's no notion of a test set, out of

1434
00:48:11,119 --> 00:48:14,559
sample test set that we got we're going

1435
00:48:12,880 --> 00:48:16,079
to evaluate these things on, why do we

1436
00:48:14,559 --> 00:48:18,519
really care about overfitting? Don't

1437
00:48:16,079 --> 00:48:20,519
should we do the best we can to capture

1438
00:48:18,519 --> 00:48:21,400
everything in the data, right?

1439
00:48:20,519 --> 00:48:22,920
Well,

1440
00:48:21,400 --> 00:48:24,280
the thing is

1441
00:48:22,920 --> 00:48:26,320
even when you're not trying to use it

1442
00:48:24,280 --> 00:48:29,560
for out of sample prediction, you do

1443
00:48:26,320 --> 00:48:31,359
want to make sure that your model only

1444
00:48:29,559 --> 00:48:32,639
captures the true patterns and not the

1445
00:48:31,358 --> 00:48:35,199
noise.

1446
00:48:32,639 --> 00:48:36,440
In every data set, there's always noise.

1447
00:48:35,199 --> 00:48:38,358
Right? And you want it to capture a

1448
00:48:36,440 --> 00:48:40,599
signal but not the noise.

1449
00:48:38,358 --> 00:48:42,719
And regardless of what you use it for.

1450
00:48:40,599 --> 00:48:44,159
Because if it captures the noise, then

1451
00:48:42,719 --> 00:48:45,959
the insights you draw from the word

1452
00:48:44,159 --> 00:48:48,399
embeddings may be flawed.

1453
00:48:45,960 --> 00:48:48,400
That's the reason.

1454
00:48:48,880 --> 00:48:51,358
Okay.

1455
00:48:49,760 --> 00:48:53,080
Um all right, so let's keep going. So,

1456
00:48:51,358 --> 00:48:55,400
here the algebra is brother minus man

1457
00:48:53,079 --> 00:48:57,039
plus woman is sister.

1458
00:48:55,400 --> 00:48:58,920
That's it. Human biology reduced to a

1459
00:48:57,039 --> 00:49:00,759
single sentence.

1460
00:48:58,920 --> 00:49:02,079
All right. So, now the pros and cons of

1461
00:49:00,760 --> 00:49:04,520
these things are you should use

1462
00:49:02,079 --> 00:49:07,159
something like a Glove embedding if you

1463
00:49:04,519 --> 00:49:07,960
don't have enough data to do to to sort

1464
00:49:07,159 --> 00:49:10,039
of

1465
00:49:07,960 --> 00:49:11,920
to learn a task-specific embedding for

1466
00:49:10,039 --> 00:49:13,400
your own vocabulary. As we As I'll show

1467
00:49:11,920 --> 00:49:14,880
you in the Colab, you can actually learn

1468
00:49:13,400 --> 00:49:16,720
these things just for your own data set

1469
00:49:14,880 --> 00:49:18,920
if you want. You don't have to use these

1470
00:49:16,719 --> 00:49:20,879
Glove embeddings. But the reason to use

1471
00:49:18,920 --> 00:49:22,639
these pretrained embeddings is that if

1472
00:49:20,880 --> 00:49:24,079
you're working with natural language,

1473
00:49:22,639 --> 00:49:25,559
you know, the word is the word, right?

1474
00:49:24,079 --> 00:49:28,358
It means something.

1475
00:49:25,559 --> 00:49:30,639
And so, there's no reason for you to

1476
00:49:28,358 --> 00:49:32,840
have for your model, for your little use

1477
00:49:30,639 --> 00:49:35,599
case, for you to actually somehow learn

1478
00:49:32,840 --> 00:49:36,519
all the fundamentals of English.

1479
00:49:35,599 --> 00:49:37,679
The fundamentals of English are the

1480
00:49:36,519 --> 00:49:40,519
fundamentals of English. May as well

1481
00:49:37,679 --> 00:49:42,039
learn it once and then piggyback on it.

1482
00:49:40,519 --> 00:49:43,880
So, that's the whole idea of using

1483
00:49:42,039 --> 00:49:45,519
pre-trained embeddings.

1484
00:49:43,880 --> 00:49:47,000
Because it These things are all common

1485
00:49:45,519 --> 00:49:48,559
aspects of language. May as well learn

1486
00:49:47,000 --> 00:49:50,559
them using all the data you can throw at

1487
00:49:48,559 --> 00:49:52,039
it and then you can sort of fine-tune

1488
00:49:50,559 --> 00:49:53,159
and tweak and adapt to your particular

1489
00:49:52,039 --> 00:49:55,920
use case.

1490
00:49:53,159 --> 00:49:57,319
Right? So, if you and this particular

1491
00:49:55,920 --> 00:49:58,920
useful when you don't have a lot of data

1492
00:49:57,320 --> 00:50:01,360
in your particular use case.

1493
00:49:58,920 --> 00:50:03,280
Uh right? That's one big advantage. Now,

1494
00:50:01,360 --> 00:50:04,840
it does have the drawback that this

1495
00:50:03,280 --> 00:50:05,920
embedding will not be customized to your

1496
00:50:04,840 --> 00:50:06,920
data.

1497
00:50:05,920 --> 00:50:08,840
Right? For example, if you're trying to

1498
00:50:06,920 --> 00:50:10,599
build an application for a medical or

1499
00:50:08,840 --> 00:50:11,680
legal use, it's going to have a lot of

1500
00:50:10,599 --> 00:50:13,679
jargon.

1501
00:50:11,679 --> 00:50:14,960
Right? And this pre-trained embedding

1502
00:50:13,679 --> 00:50:16,960
trained on all of Wikipedia may not

1503
00:50:14,960 --> 00:50:18,840
capture enough of the jargon and know

1504
00:50:16,960 --> 00:50:19,960
its meaning really accurately. So, what

1505
00:50:18,840 --> 00:50:21,000
you want to do is you want to take this

1506
00:50:19,960 --> 00:50:22,440
thing. You may still want to take this

1507
00:50:21,000 --> 00:50:25,119
thing and then you can adapt and

1508
00:50:22,440 --> 00:50:28,679
fine-tune it using your jargon-packed,

1509
00:50:25,119 --> 00:50:29,719
heavy, domain-specific data set.

1510
00:50:28,679 --> 00:50:32,239
Okay, those are some of the things to

1511
00:50:29,719 --> 00:50:32,239
keep in mind.

1512
00:50:32,360 --> 00:50:35,559
And of course, we can also learn it from

1513
00:50:33,559 --> 00:50:38,079
scratch if you want and the collab I

1514
00:50:35,559 --> 00:50:39,440
demonstrate all these options.

1515
00:50:38,079 --> 00:50:41,880
So, when you're working with embeddings

1516
00:50:39,440 --> 00:50:43,880
in Keras uh Keras, so what we do is

1517
00:50:41,880 --> 00:50:45,480
remember STI

1518
00:50:43,880 --> 00:50:48,360
where we after we standardize and

1519
00:50:45,480 --> 00:50:50,440
tokenize and index, right? At this

1520
00:50:48,360 --> 00:50:51,960
point, we go from integers to vectors

1521
00:50:50,440 --> 00:50:54,079
and so far we have been using integers

1522
00:50:51,960 --> 00:50:55,559
to one-hot vectors. Here, we're going to

1523
00:50:54,079 --> 00:50:57,679
use embedding vectors that we're going

1524
00:50:55,559 --> 00:51:00,519
to learn or that we're going to pre-use

1525
00:50:57,679 --> 00:51:02,119
from glove. And so, what we do is we

1526
00:51:00,519 --> 00:51:06,119
tell Kera we tell Keras's text

1527
00:51:02,119 --> 00:51:08,000
vectorization layer to do only STI.

1528
00:51:06,119 --> 00:51:10,519
And then we will use a new layer called

1529
00:51:08,000 --> 00:51:11,639
the embedding layer to do the encoding.

1530
00:51:10,519 --> 00:51:14,719
Yeah, that's how we're going to do it

1531
00:51:11,639 --> 00:51:14,719
divide divide it up.

1532
00:51:14,920 --> 00:51:18,760
So, we'll take a look at this first uh

1533
00:51:17,039 --> 00:51:20,559
before we switch to the collab. So,

1534
00:51:18,760 --> 00:51:23,480
before

1535
00:51:20,559 --> 00:51:26,279
we told Keras in this layer output mode

1536
00:51:23,480 --> 00:51:27,679
should be multi-hot or whatever, right?

1537
00:51:26,280 --> 00:51:29,080
Here, we don't want it to actually

1538
00:51:27,679 --> 00:51:30,759
encode anything in multi-hot. We just

1539
00:51:29,079 --> 00:51:32,920
wanted to give it integers back. So, we

1540
00:51:30,760 --> 00:51:35,120
tell it give me int.

1541
00:51:32,920 --> 00:51:36,800
Okay? That's the first change. We only

1542
00:51:35,119 --> 00:51:39,559
We tell it give me give us int. If you

1543
00:51:36,800 --> 00:51:41,120
say give us int, it'll stop with STI.

1544
00:51:39,559 --> 00:51:43,759
I'll just give you the integers.

1545
00:51:41,119 --> 00:51:45,079
Uh and then what you do is that

1546
00:51:43,760 --> 00:51:47,160
all the incoming sentences are going to

1547
00:51:45,079 --> 00:51:48,599
have different lengths. So, what we want

1548
00:51:47,159 --> 00:51:50,159
to do is we want to actually take all

1549
00:51:48,599 --> 00:51:52,319
these sentences and sort of normalize

1550
00:51:50,159 --> 00:51:53,199
them so they are of the same length.

1551
00:51:52,320 --> 00:51:55,440
Okay?

1552
00:51:53,199 --> 00:51:57,599
And the way we do that

1553
00:51:55,440 --> 00:51:59,800
And the way we do that very quickly is

1554
00:51:57,599 --> 00:52:01,519
that we either trunk we choose a maximum

1555
00:51:59,800 --> 00:52:04,000
length for every sen- for for the

1556
00:52:01,519 --> 00:52:05,960
sentences and then if something is

1557
00:52:04,000 --> 00:52:07,119
uh exactly fits that length, perfect.

1558
00:52:05,960 --> 00:52:08,920
Let's say in this case we want a max

1559
00:52:07,119 --> 00:52:11,199
length of five. Cats sat on the mat is

1560
00:52:08,920 --> 00:52:12,599
exactly five. Boom, fits perfectly. But

1561
00:52:11,199 --> 00:52:14,480
if something is smaller, I love you is

1562
00:52:12,599 --> 00:52:16,360
only three of these things, we actually

1563
00:52:14,480 --> 00:52:17,760
pad it with something called the pad

1564
00:52:16,360 --> 00:52:19,840
token.

1565
00:52:17,760 --> 00:52:22,160
Much like the unk token, pad token is a

1566
00:52:19,840 --> 00:52:23,840
special token which we use for padding.

1567
00:52:22,159 --> 00:52:25,759
And then it'll you know, and so and

1568
00:52:23,840 --> 00:52:27,760
Keras you will see will use zeros for

1569
00:52:25,760 --> 00:52:29,720
these paddings. So so that it fills it

1570
00:52:27,760 --> 00:52:31,440
up and gets all the way to the end. And

1571
00:52:29,719 --> 00:52:33,239
if you have something which is much

1572
00:52:31,440 --> 00:52:34,559
longer than five, you just truncate

1573
00:52:33,239 --> 00:52:36,199
everything else and just use the first

1574
00:52:34,559 --> 00:52:38,440
five.

1575
00:52:36,199 --> 00:52:41,639
So, this is what we do to get all the

1576
00:52:38,440 --> 00:52:41,639
sentences to be of the same length.

1577
00:52:42,400 --> 00:52:45,599
Okay?

1578
00:52:43,480 --> 00:52:47,199
And once we do that we then go to the

1579
00:52:45,599 --> 00:52:49,000
embedding layer.

1580
00:52:47,199 --> 00:52:50,359
And the embedding layer is actually very

1581
00:52:49,000 --> 00:52:51,719
simple.

1582
00:52:50,360 --> 00:52:53,360
What is What is an embedding? It's just

1583
00:52:51,719 --> 00:52:54,519
a vector and we need a vector for every

1584
00:52:53,360 --> 00:52:55,440
token.

1585
00:52:54,519 --> 00:52:57,639
Of course, we're going to learn these

1586
00:52:55,440 --> 00:52:59,599
vectors. We need one for every token.

1587
00:52:57,639 --> 00:53:01,039
So, in this case for example, uh let's

1588
00:52:59,599 --> 00:53:02,239
say that these are all the tokens we

1589
00:53:01,039 --> 00:53:05,000
have

1590
00:53:02,239 --> 00:53:08,039
in our vocabulary after the STI process.

1591
00:53:05,000 --> 00:53:09,280
Maybe in this case we have 5,000 tokens.

1592
00:53:08,039 --> 00:53:11,159
Each token we have this embedding

1593
00:53:09,280 --> 00:53:12,960
vector, right? And we choose what the

1594
00:53:11,159 --> 00:53:15,000
dimension of that embedding vector is,

1595
00:53:12,960 --> 00:53:17,400
right? And so, we can set it up by

1596
00:53:15,000 --> 00:53:19,280
saying Keras layers.embedding and we

1597
00:53:17,400 --> 00:53:21,160
tell it max tokens which means what how

1598
00:53:19,280 --> 00:53:21,920
many rows do we have here.

1599
00:53:21,159 --> 00:53:23,759
You know, how many What is the

1600
00:53:21,920 --> 00:53:25,680
vocabulary size that we're working with?

1601
00:53:23,760 --> 00:53:28,360
And then we tell it, okay, this is how

1602
00:53:25,679 --> 00:53:31,799
long I want each embedding vector to be.

1603
00:53:28,360 --> 00:53:33,240
So, rows, the size of the columns, and

1604
00:53:31,800 --> 00:53:34,400
that's the embedding layer. And we'll

1605
00:53:33,239 --> 00:53:35,719
use it in a second. I just want to show

1606
00:53:34,400 --> 00:53:37,039
it to you here so that's because it's

1607
00:53:35,719 --> 00:53:38,759
slightly clearer.

1608
00:53:37,039 --> 00:53:40,559
So, when an input sentence arrives, the

1609
00:53:38,760 --> 00:53:42,200
text vectorization layer will learn STI

1610
00:53:40,559 --> 00:53:44,000
on it. It'll truncate and pad it to max

1611
00:53:42,199 --> 00:53:46,599
length as needed. So, let's say this

1612
00:53:44,000 --> 00:53:48,639
phrase comes in, STI will give you the

1613
00:53:46,599 --> 00:53:50,319
same tokens plus pad pad because let's

1614
00:53:48,639 --> 00:53:52,639
say the max length is five and then

1615
00:53:50,320 --> 00:53:53,960
these are the corresponding integers.

1616
00:53:52,639 --> 00:53:55,279
And then

1617
00:53:53,960 --> 00:53:56,440
the embedding layer will just look up

1618
00:53:55,280 --> 00:53:59,320
the corresponding vector. So, for

1619
00:53:56,440 --> 00:54:01,519
example here, uh the vectors are we need

1620
00:53:59,320 --> 00:54:04,000
to look up the vectors for 23, 9, 5, 0,

1621
00:54:01,519 --> 00:54:07,400
and 0. So, we just go here and look up

1622
00:54:04,000 --> 00:54:08,760
23, 5, 9, and 0. And then once we have

1623
00:54:07,400 --> 00:54:10,720
that, boom.

1624
00:54:08,760 --> 00:54:12,320
This is the resulting output. So,

1625
00:54:10,719 --> 00:54:13,519
whatever input sentence comes in, we

1626
00:54:12,320 --> 00:54:14,760
have now

1627
00:54:13,519 --> 00:54:17,519
five embedding vectors that have been

1628
00:54:14,760 --> 00:54:20,080
looked up from the embedding layer.

1629
00:54:17,519 --> 00:54:22,159
And once we do that

1630
00:54:20,079 --> 00:54:24,400
this is a table. So, I love you comes

1631
00:54:22,159 --> 00:54:25,679
in, it becomes this table. As we have

1632
00:54:24,400 --> 00:54:27,519
seen before

1633
00:54:25,679 --> 00:54:30,279
neural networks can only accommodate

1634
00:54:27,519 --> 00:54:32,119
vectors as inputs. We need to you know,

1635
00:54:30,280 --> 00:54:33,760
make this into a vector. And as we have

1636
00:54:32,119 --> 00:54:35,319
done before, you know, we can either

1637
00:54:33,760 --> 00:54:37,320
take all these things and concatenate

1638
00:54:35,320 --> 00:54:39,359
them, make a one long vector, or we can

1639
00:54:37,320 --> 00:54:40,800
find a way to average them or sum them

1640
00:54:39,358 --> 00:54:42,960
and things like that, right? As we have

1641
00:54:40,800 --> 00:54:44,039
seen before. And we will use the same uh

1642
00:54:42,960 --> 00:54:46,320
we'll the simplest thing is probably

1643
00:54:44,039 --> 00:54:48,239
just to average them. So,

1644
00:54:46,320 --> 00:54:51,000
uh these are some options and we but

1645
00:54:48,239 --> 00:54:53,439
we'll average them here. So, and this is

1646
00:54:51,000 --> 00:54:55,679
called the global average pooling layer

1647
00:54:53,440 --> 00:54:57,240
1D. And it's all it does is whatever you

1648
00:54:55,679 --> 00:54:59,839
give it a table you give it, it just

1649
00:54:57,239 --> 00:55:01,079
takes each dimension and averages it.

1650
00:54:59,840 --> 00:55:02,280
The first dimension average, second

1651
00:55:01,079 --> 00:55:04,358
dimension average, and so on and so

1652
00:55:02,280 --> 00:55:05,440
forth. And once that's done

1653
00:55:04,358 --> 00:55:07,279
that's the whole

1654
00:55:05,440 --> 00:55:09,679
So,

1655
00:55:07,280 --> 00:55:11,920
the phrase comes in, STI gives you these

1656
00:55:09,679 --> 00:55:14,000
things, padding as needed or truncating

1657
00:55:11,920 --> 00:55:16,240
as needed. We look up the embeddings

1658
00:55:14,000 --> 00:55:18,559
from the embedding layer and then we get

1659
00:55:16,239 --> 00:55:20,479
all this thing. We do global global

1660
00:55:18,559 --> 00:55:22,400
pooling on it and it's done.

1661
00:55:20,480 --> 00:55:24,000
The resulting thing is a vector that can

1662
00:55:22,400 --> 00:55:26,680
then be passed into hidden layers just

1663
00:55:24,000 --> 00:55:26,679
like we normally do.

1664
00:55:27,559 --> 00:55:31,320
I'm going over this a little fast, but

1665
00:55:29,320 --> 00:55:33,000
make sure you look at it afterwards and

1666
00:55:31,320 --> 00:55:34,320
understand every step and the collab

1667
00:55:33,000 --> 00:55:36,159
will mirror this

1668
00:55:34,320 --> 00:55:37,200
you know, perfectly.

1669
00:55:36,159 --> 00:55:39,559
All right, so let's switch to the

1670
00:55:37,199 --> 00:55:41,480
collab.

1671
00:55:39,559 --> 00:55:43,960
Okay. All right.

1672
00:55:41,480 --> 00:55:46,320
Can folks see this okay?

1673
00:55:43,960 --> 00:55:47,639
All right, so we'll do the usual.

1674
00:55:46,320 --> 00:55:49,800
Um

1675
00:55:47,639 --> 00:55:51,599
import all the stuff we need and then

1676
00:55:49,800 --> 00:55:53,519
because I want to plot some of these uh

1677
00:55:51,599 --> 00:55:55,159
loss and accuracy curves to

1678
00:55:53,519 --> 00:55:56,639
you know, just to see what's going on,

1679
00:55:55,159 --> 00:55:58,239
I'll just bring in the functions from

1680
00:55:56,639 --> 00:55:59,319
the previous collabs.

1681
00:55:58,239 --> 00:56:01,839
Here.

1682
00:55:59,320 --> 00:56:03,440
And then um and I think I already have

1683
00:56:01,840 --> 00:56:06,000
downloaded this. Let me just make sure I

1684
00:56:03,440 --> 00:56:06,000
have it.

1685
00:56:08,079 --> 00:56:13,480
Uh it's not there. Okay.

1686
00:56:11,119 --> 00:56:14,960
Do it again.

1687
00:56:13,480 --> 00:56:17,760
This is same songs data set that we

1688
00:56:14,960 --> 00:56:17,760
looked at on Monday.

1689
00:56:17,840 --> 00:56:21,079
Okay.

1690
00:56:19,000 --> 00:56:25,280
So, roughly 49,000 examples as we saw

1691
00:56:21,079 --> 00:56:25,279
before. We'll one-hot encode them.

1692
00:56:25,519 --> 00:56:28,840
All right, so there's a bunch of stuff

1693
00:56:27,000 --> 00:56:30,519
that we already covered in class. So,

1694
00:56:28,840 --> 00:56:33,880
this is the thing

1695
00:56:30,519 --> 00:56:35,840
uh this URL has all the glove vectors

1696
00:56:33,880 --> 00:56:37,160
available for download. I downloaded it

1697
00:56:35,840 --> 00:56:39,960
uh before class because it takes a few

1698
00:56:37,159 --> 00:56:41,519
minutes. Uh and I've also unz- Did I

1699
00:56:39,960 --> 00:56:43,199
unzip it?

1700
00:56:41,519 --> 00:56:46,039
Uh yes, I did. And so, let's just look

1701
00:56:43,199 --> 00:56:47,119
at the first few.

1702
00:56:46,039 --> 00:56:49,159
All right, so these are all the first

1703
00:56:47,119 --> 00:56:52,839
few. We'll create a sort of an easier to

1704
00:56:49,159 --> 00:56:52,839
view version of these glove vectors.

1705
00:56:54,760 --> 00:56:58,480
So, I'm going to use the vectors which

1706
00:56:56,639 --> 00:56:59,839
are 100 long, but it comes in many

1707
00:56:58,480 --> 00:57:03,000
different shapes.

1708
00:56:59,840 --> 00:57:05,720
So, we have 400,000 vectors, 400,000

1709
00:57:03,000 --> 00:57:07,519
word vectors. Each is 100 dimension.

1710
00:57:05,719 --> 00:57:09,399
Uh and these all have been calculated

1711
00:57:07,519 --> 00:57:11,119
from Wikipedia using

1712
00:57:09,400 --> 00:57:12,720
the model we described using gradient

1713
00:57:11,119 --> 00:57:15,480
descent. Okay?

1714
00:57:12,719 --> 00:57:18,239
Uh all right, so this is the

1715
00:57:15,480 --> 00:57:19,280
vector for the word for movie.

1716
00:57:18,239 --> 00:57:21,519
Yeah, I don't know what these dimensions

1717
00:57:19,280 --> 00:57:23,480
mean, but it is there's something going

1718
00:57:21,519 --> 00:57:24,880
on. It has figured stuff out.

1719
00:57:23,480 --> 00:57:26,840
Uh but the proof is in the pudding,

1720
00:57:24,880 --> 00:57:28,200
right? So, all right, now we'll first

1721
00:57:26,840 --> 00:57:30,358
set up the text vectorization and

1722
00:57:28,199 --> 00:57:33,839
embedding layers like we saw before.

1723
00:57:30,358 --> 00:57:36,239
Um and so, I'm going to use uh a max

1724
00:57:33,840 --> 00:57:38,240
length of 300 for the songs.

1725
00:57:36,239 --> 00:57:40,879
Um right? Because all the sentences have

1726
00:57:38,239 --> 00:57:42,479
to be the same length. And you might be

1727
00:57:40,880 --> 00:57:44,519
wondering, okay, why did you pick 300

1728
00:57:42,480 --> 00:57:46,840
and not say 400 or 200? So, typically

1729
00:57:44,519 --> 00:57:48,920
what you do is you actually look at the

1730
00:57:46,840 --> 00:57:51,039
the length distribution of the songs you

1731
00:57:48,920 --> 00:57:52,720
have and you will find you're looking

1732
00:57:51,039 --> 00:57:54,358
for like an 80/20 or a you know, one of

1733
00:57:52,719 --> 00:57:56,399
those things. And in this case it turns

1734
00:57:54,358 --> 00:57:59,000
out 90% of the songs have less than or

1735
00:57:56,400 --> 00:58:00,880
equal to 300 words in our data set. So,

1736
00:57:59,000 --> 00:58:03,000
I'm just going to go with 300. Okay?

1737
00:58:00,880 --> 00:58:04,840
It's pretty good. Uh the problem is if

1738
00:58:03,000 --> 00:58:06,800
you actually say if you look at the song

1739
00:58:04,840 --> 00:58:09,079
which has the maximum length

1740
00:58:06,800 --> 00:58:10,680
that might have be like 3,000 words and

1741
00:58:09,079 --> 00:58:12,599
there would be any hardly any songs of

1742
00:58:10,679 --> 00:58:13,839
3,000 long. You're just wasting a lot of

1743
00:58:12,599 --> 00:58:16,079
capacity by doing that. So, you're just

1744
00:58:13,840 --> 00:58:18,840
being a little pragmatic here.

1745
00:58:16,079 --> 00:58:20,519
So, okay. So, and then we as before for

1746
00:58:18,840 --> 00:58:22,920
the vocabulary itself, we tell Keras use

1747
00:58:20,519 --> 00:58:24,599
the most frequent 5,000 words, right?

1748
00:58:22,920 --> 00:58:27,599
When you're doing the STI

1749
00:58:24,599 --> 00:58:29,719
um STI. So, we do that and we tell it

1750
00:58:27,599 --> 00:58:32,279
the output mode is int like we saw

1751
00:58:29,719 --> 00:58:32,279
before.

1752
00:58:32,320 --> 00:58:36,840
We have there.

1753
00:58:35,199 --> 00:58:39,480
Okay, perfect.

1754
00:58:36,840 --> 00:58:41,840
Okay, this is a very dangerous thing

1755
00:58:39,480 --> 00:58:44,519
where somebody is remotely changing it

1756
00:58:41,840 --> 00:58:48,160
in another tab somewhere.

1757
00:58:44,519 --> 00:58:48,159
Fingers crossed. Okay.

1758
00:58:50,239 --> 00:58:54,279
Okay. So, we have this um and this is

1759
00:58:52,400 --> 00:58:57,320
what we did with all this stuff uh as

1760
00:58:54,280 --> 00:58:59,920
I've covered. So, now we will adapt this

1761
00:58:57,320 --> 00:59:02,960
layer as we have seen before using all

1762
00:58:59,920 --> 00:59:02,960
the lyrics we have.

1763
00:59:04,358 --> 00:59:08,358
And once we that, we'll take a look at

1764
00:59:06,280 --> 00:59:10,080
the first few.

1765
00:59:08,358 --> 00:59:12,239
And so, here's a very important thing.

1766
00:59:10,079 --> 00:59:14,519
Before, when we asked it to do multi-hot

1767
00:59:12,239 --> 00:59:17,679
encoding and so on in on Monday,

1768
00:59:14,519 --> 00:59:19,880
uh the zero, the first position was unk.

1769
00:59:17,679 --> 00:59:21,679
Right? Unk had zero. But here, unk

1770
00:59:19,880 --> 00:59:23,400
actually has one.

1771
00:59:21,679 --> 00:59:25,559
And the reason is that

1772
00:59:23,400 --> 00:59:28,200
the zeroth position is going to be uh

1773
00:59:25,559 --> 00:59:30,199
used for essentially the You can think

1774
00:59:28,199 --> 00:59:32,839
of this as the empty string. That's how

1775
00:59:30,199 --> 00:59:35,000
Keras will print out pad.

1776
00:59:32,840 --> 00:59:37,039
So, the zero position is the padding,

1777
00:59:35,000 --> 00:59:39,079
the pad token. The first position is the

1778
00:59:37,039 --> 00:59:41,480
unk token. Okay?

1779
00:59:39,079 --> 00:59:44,440
So, it's an important thing here.

1780
00:59:41,480 --> 00:59:46,920
So, let's say that we do

1781
00:59:44,440 --> 00:59:49,599
"HODL you're the best."

1782
00:59:46,920 --> 00:59:51,240
We take a vectorize it. Um

1783
00:59:49,599 --> 00:59:52,719
Do you think HODL

1784
00:59:51,239 --> 00:59:54,319
is going to be part of those 400,000

1785
00:59:52,719 --> 00:59:57,480
word vectors?

1786
00:59:54,320 --> 01:00:01,400
Wikipedia. Not yet. So,

1787
00:59:57,480 --> 01:00:01,400
Um all right. So, let's try that.

1788
01:00:03,519 --> 01:00:05,960
Okay, and as you can tell,

1789
01:00:05,199 --> 01:00:08,199
um

1790
01:00:05,960 --> 01:00:12,720
HODL is an unknown word, right? That's

1791
01:00:08,199 --> 01:00:14,879
why uh it's showing up here.

1792
01:00:12,719 --> 01:00:18,119
Right. So, one is unknown, right? The

1793
01:00:14,880 --> 01:00:19,559
index value one is unknown. Zero is pad.

1794
01:00:18,119 --> 01:00:21,920
But then,

1795
01:00:19,559 --> 01:00:25,000
this is unknown HODL, I

1796
01:00:21,920 --> 01:00:26,720
Sorry, you are the best, and then

1797
01:00:25,000 --> 01:00:28,679
everything else from that point on is a

1798
01:00:26,719 --> 01:00:30,119
zero because we are padding all the way

1799
01:00:28,679 --> 01:00:31,239
to 300.

1800
01:00:30,119 --> 01:00:32,599
Okay? So, that's why you see all these

1801
01:00:31,239 --> 01:00:34,359
zeros here.

1802
01:00:32,599 --> 01:00:37,000
All right. Uh now, let's just, you know,

1803
01:00:34,360 --> 01:00:38,760
run everything through

1804
01:00:37,000 --> 01:00:41,880
the vectorization layer, and then we'll

1805
01:00:38,760 --> 01:00:41,880
get to the embedding layer.

1806
01:00:44,400 --> 01:00:50,519
Okay. Now, we will we'll we'll first

1807
01:00:48,360 --> 01:00:51,840
There's just a bit of Python uh

1808
01:00:50,519 --> 01:00:54,960
housekeeping

1809
01:00:51,840 --> 01:00:56,600
um to create a nice, easy to look at

1810
01:00:54,960 --> 01:00:58,679
matrix. So, what we're going to do is

1811
01:00:56,599 --> 01:01:00,960
we're actually going to create a nice

1812
01:00:58,679 --> 01:01:02,480
matrix which shows us all the the word

1813
01:01:00,960 --> 01:01:04,039
the GloVe embeddings.

1814
01:01:02,480 --> 01:01:05,679
Um

1815
01:01:04,039 --> 01:01:07,159
And so, here, this is the embedding

1816
01:01:05,679 --> 01:01:09,639
matrix.

1817
01:01:07,159 --> 01:01:11,679
And this matrix has only 5,000 words,

1818
01:01:09,639 --> 01:01:13,639
and each is a 100 long.

1819
01:01:11,679 --> 01:01:15,199
Why is this embedding matrix only 5,000

1820
01:01:13,639 --> 01:01:17,679
even though we downloaded 400,000

1821
01:01:15,199 --> 01:01:17,679
vectors?

1822
01:01:21,480 --> 01:01:24,719
Right. So, clearly the 5,000 we used

1823
01:01:23,440 --> 01:01:27,440
there has some bearing to this, but what

1824
01:01:24,719 --> 01:01:27,439
is that 5,000?

1825
01:01:30,760 --> 01:01:34,800
We told Keras to take the most frequent

1826
01:01:32,840 --> 01:01:36,960
5,000 words in our corpus.

1827
01:01:34,800 --> 01:01:38,920
So, we'll only have 5,000 in vocabulary.

1828
01:01:36,960 --> 01:01:40,480
That's why there's 5,000. So, we grab

1829
01:01:38,920 --> 01:01:42,480
just the word the GloVe vectors for

1830
01:01:40,480 --> 01:01:44,119
those 500 5,000 that Keras has chosen to

1831
01:01:42,480 --> 01:01:45,599
be in the vocabulary. Okay? And that's

1832
01:01:44,119 --> 01:01:47,639
our embedding matrix.

1833
01:01:45,599 --> 01:01:50,279
And then, if you look at the first few

1834
01:01:47,639 --> 01:01:52,879
rows, the first two rows should be all

1835
01:01:50,280 --> 01:01:54,720
zeros because it's pad and unk,

1836
01:01:52,880 --> 01:01:57,320
which clearly GloVe doesn't know about.

1837
01:01:54,719 --> 01:01:59,000
It's all going to be all zeros. And um

1838
01:01:57,320 --> 01:02:00,480
so, you can see all these zeros here,

1839
01:01:59,000 --> 01:02:02,639
and then from third on, words, you start

1840
01:02:00,480 --> 01:02:04,199
getting some numbers. Okay?

1841
01:02:02,639 --> 01:02:05,599
All right. Next, we'll set up the

1842
01:02:04,199 --> 01:02:06,279
embedding layer.

1843
01:02:05,599 --> 01:02:07,799
Uh

1844
01:02:06,280 --> 01:02:09,600
so, basically, what's going on here is

1845
01:02:07,800 --> 01:02:11,680
when you we tell the embedding layer how

1846
01:02:09,599 --> 01:02:13,519
many rows, which is just the vocab size,

1847
01:02:11,679 --> 01:02:15,759
max tokens, what is the embedding

1848
01:02:13,519 --> 01:02:17,599
dimension? Well, that's going to be 100

1849
01:02:15,760 --> 01:02:19,840
because the GloVe vectors are 100. And

1850
01:02:17,599 --> 01:02:22,279
then, here's the thing. You can tell it

1851
01:02:19,840 --> 01:02:23,800
um in this embedding layer, just use

1852
01:02:22,280 --> 01:02:25,640
this matrix I'm giving you as the

1853
01:02:23,800 --> 01:02:26,880
embedding layer. Because we already know

1854
01:02:25,639 --> 01:02:28,799
what the embeddings are. We downloaded

1855
01:02:26,880 --> 01:02:30,760
from whatever GloVe, right? So, we will

1856
01:02:28,800 --> 01:02:32,320
tell it to use GloVe as as the as the

1857
01:02:30,760 --> 01:02:34,680
weights for here, as the embeddings

1858
01:02:32,320 --> 01:02:36,720
here. So, we initialize it using that

1859
01:02:34,679 --> 01:02:38,359
embedding matrix, right? And then, we

1860
01:02:36,719 --> 01:02:40,199
tell it

1861
01:02:38,360 --> 01:02:41,720
don't train. When we do back propagation

1862
01:02:40,199 --> 01:02:43,679
later on, don't change any of these

1863
01:02:41,719 --> 01:02:45,839
weights because somebody spent a lot of

1864
01:02:43,679 --> 01:02:47,759
money create these weights for us.

1865
01:02:45,840 --> 01:02:49,200
Stanford. So, we don't want to like

1866
01:02:47,760 --> 01:02:51,280
further change them. Just freeze them

1867
01:02:49,199 --> 01:02:52,679
and use them as they are. Okay?

1868
01:02:51,280 --> 01:02:53,760
And this mask zero business I'll come

1869
01:02:52,679 --> 01:02:55,440
back later. Don't worry about it for the

1870
01:02:53,760 --> 01:02:58,200
moment.

1871
01:02:55,440 --> 01:03:00,079
All right. So, once we do that, we all

1872
01:02:58,199 --> 01:03:02,199
we are ready to set up our model. So,

1873
01:03:00,079 --> 01:03:04,159
this model is pretty simple. Uh Keras

1874
01:03:02,199 --> 01:03:05,799
input, the length, of course, is the

1875
01:03:04,159 --> 01:03:08,319
length of the sentence, right? Which is

1876
01:03:05,800 --> 01:03:09,600
uh 300 long, and then it runs the input

1877
01:03:08,320 --> 01:03:12,320
runs through an embedding layer right

1878
01:03:09,599 --> 01:03:14,679
there, right? And out comes a 300 by 100

1879
01:03:12,320 --> 01:03:15,600
table, and then we global average pool

1880
01:03:14,679 --> 01:03:17,039
it,

1881
01:03:15,599 --> 01:03:19,079
right? And that becomes a 100 element

1882
01:03:17,039 --> 01:03:20,920
vector, and then we are back in familiar

1883
01:03:19,079 --> 01:03:23,480
ground, and we run it through a dense

1884
01:03:20,920 --> 01:03:25,480
layer with eight ReLU neurons, uh right?

1885
01:03:23,480 --> 01:03:27,159
Eight ReLU neurons, and then we run it

1886
01:03:25,480 --> 01:03:29,000
through the final output layer, which is

1887
01:03:27,159 --> 01:03:31,239
a three-way softmax as before, hip hop

1888
01:03:29,000 --> 01:03:34,760
rock pop. And then, we tell Keras that's

1889
01:03:31,239 --> 01:03:36,879
our model, and then we summarize it.

1890
01:03:34,760 --> 01:03:38,000
Okay. So, this what we have. And you can

1891
01:03:36,880 --> 01:03:41,039
see here,

1892
01:03:38,000 --> 01:03:42,559
the total parameters are 500,835,

1893
01:03:41,039 --> 01:03:44,519
but the trainable parameters are only

1894
01:03:42,559 --> 01:03:46,000
835.

1895
01:03:44,519 --> 01:03:49,320
It's because the total parameters are

1896
01:03:46,000 --> 01:03:50,800
all the GloVe embeddings plus the the

1897
01:03:49,320 --> 01:03:52,840
things we added to the GloVe embeddings

1898
01:03:50,800 --> 01:03:54,720
like the hidden layer and so on.

1899
01:03:52,840 --> 01:03:57,400
But the GloVe embeddings are us we have

1900
01:03:54,719 --> 01:03:58,799
told Keras, freeze it. Do not train it.

1901
01:03:57,400 --> 01:04:00,358
Right? Which means only the rest of it

1902
01:03:58,800 --> 01:04:03,160
is going to be trainable. That's That's

1903
01:04:00,358 --> 01:04:03,159
the 835. Yeah.

1904
01:04:03,358 --> 01:04:06,880
So, when we do the global average

1905
01:04:05,000 --> 01:04:09,840
pooling, don't we don't we lose any

1906
01:04:06,880 --> 01:04:12,680
sense of meaning that we gain from the

1907
01:04:09,840 --> 01:04:14,519
embedding as we average very different

1908
01:04:12,679 --> 01:04:15,960
embeddings together?

1909
01:04:14,519 --> 01:04:16,400
Sorry, say that again. I I missed the

1910
01:04:15,960 --> 01:04:18,639
first

1911
01:04:16,400 --> 01:04:20,400
>> if we average the the embedding of apple

1912
01:04:18,639 --> 01:04:22,279
and learning, for instance, they are

1913
01:04:20,400 --> 01:04:23,880
very different words that are used in

1914
01:04:22,280 --> 01:04:26,320
different meanings, so we have different

1915
01:04:23,880 --> 01:04:27,358
embeddings, but we average it, so can't

1916
01:04:26,320 --> 01:04:28,600
lose it.

1917
01:04:27,358 --> 01:04:30,319
We will lose a bunch of stuff. Yeah,

1918
01:04:28,599 --> 01:04:31,239
yeah, yeah. So, you're barely Anytime

1919
01:04:30,320 --> 01:04:33,559
you average anything, you're going to

1920
01:04:31,239 --> 01:04:36,039
lose some new nuance and so on. So, the

1921
01:04:33,559 --> 01:04:37,840
real question is, is it Despite that

1922
01:04:36,039 --> 01:04:39,239
averaging, is it good enough for you?

1923
01:04:37,840 --> 01:04:41,039
And sometimes it's good enough.

1924
01:04:39,239 --> 01:04:42,599
Very often it's good enough, as it turns

1925
01:04:41,039 --> 01:04:44,119
out. But as you will see when you go to

1926
01:04:42,599 --> 01:04:45,759
contextual embeddings, there's just a

1927
01:04:44,119 --> 01:04:47,519
better way to do it, right? When you

1928
01:04:45,760 --> 01:04:49,240
have contextual embeddings, uh but it

1929
01:04:47,519 --> 01:04:50,679
requires bigger models, more powerful

1930
01:04:49,239 --> 01:04:51,559
stuff, and so on and so forth. And

1931
01:04:50,679 --> 01:04:53,919
that's where you're going from the

1932
01:04:51,559 --> 01:04:56,119
foundations to the advanced stuff.

1933
01:04:53,920 --> 01:04:56,119
Yeah.

1934
01:04:56,199 --> 01:05:00,679
When we're doing optimization, like

1935
01:04:58,159 --> 01:05:02,839
let's say we are word problem, it's

1936
01:05:00,679 --> 01:05:04,719
often best to optimize everything

1937
01:05:02,840 --> 01:05:06,160
together than to optimize one part of

1938
01:05:04,719 --> 01:05:07,199
the system and then optimize the other

1939
01:05:06,159 --> 01:05:09,799
part of the system.

1940
01:05:07,199 --> 01:05:12,319
So, in that case, why wouldn't we want

1941
01:05:09,800 --> 01:05:13,600
to also change the embeddings?

1942
01:05:12,320 --> 01:05:15,559
We would like I understand why we would

1943
01:05:13,599 --> 01:05:17,440
like to stop with

1944
01:05:15,559 --> 01:05:19,119
with those weights that

1945
01:05:17,440 --> 01:05:20,679
some people have spent a lot of money

1946
01:05:19,119 --> 01:05:23,358
trying to find, but will

1947
01:05:20,679 --> 01:05:25,279
we be able to find more specific uh

1948
01:05:23,358 --> 01:05:26,960
embeddings related to our problem if we

1949
01:05:25,280 --> 01:05:29,000
optimize if we let everything be

1950
01:05:26,960 --> 01:05:30,880
trainable? Yeah. Absolutely. Absolutely.

1951
01:05:29,000 --> 01:05:33,280
And in fact, you will see in the collab

1952
01:05:30,880 --> 01:05:35,280
uh that we will do that next. I just

1953
01:05:33,280 --> 01:05:37,000
want to show people you don't have to do

1954
01:05:35,280 --> 01:05:38,560
it. You start with not training it

1955
01:05:37,000 --> 01:05:39,679
because it's going to be much faster.

1956
01:05:38,559 --> 01:05:41,079
And then, you train everything and see

1957
01:05:39,679 --> 01:05:42,519
if it gets better. And sometimes it'll

1958
01:05:41,079 --> 01:05:44,119
get better, in which case it's great.

1959
01:05:42,519 --> 01:05:45,599
Sometimes it won't get better. And I

1960
01:05:44,119 --> 01:05:46,920
will also show you, and I probably will

1961
01:05:45,599 --> 01:05:48,880
run out of time, which I'll So, I'll do

1962
01:05:46,920 --> 01:05:50,119
it on Monday. I will also show you, hey,

1963
01:05:48,880 --> 01:05:51,480
what if you want to do your own

1964
01:05:50,119 --> 01:05:52,599
embeddings from scratch without using

1965
01:05:51,480 --> 01:05:55,639
GloVe?

1966
01:05:52,599 --> 01:05:57,599
So, all possibilities will be covered.

1967
01:05:55,639 --> 01:06:00,159
Um yeah. So, to come back to this, this

1968
01:05:57,599 --> 01:06:01,440
is the model we have. Um and then, all

1969
01:06:00,159 --> 01:06:03,000
right.

1970
01:06:01,440 --> 01:06:05,079
So, we'll If we take a look at the first

1971
01:06:03,000 --> 01:06:06,960
few embedding vectors, by the way, this

1972
01:06:05,079 --> 01:06:09,119
model.layers

1973
01:06:06,960 --> 01:06:10,240
uh will give you every layer as a list,

1974
01:06:09,119 --> 01:06:11,480
a list of all the layers, and then you

1975
01:06:10,239 --> 01:06:13,039
can just grab any layer you want and

1976
01:06:11,480 --> 01:06:14,079
look at its weights. Okay? It's very

1977
01:06:13,039 --> 01:06:15,159
handy.

1978
01:06:14,079 --> 01:06:16,559
So, we're looking at the weights, and

1979
01:06:15,159 --> 01:06:19,399
you can see here

1980
01:06:16,559 --> 01:06:21,519
the first two vectors are all zeros

1981
01:06:19,400 --> 01:06:22,920
because that stands for unk and pad, and

1982
01:06:21,519 --> 01:06:24,880
then we have everything else. So,

1983
01:06:22,920 --> 01:06:26,400
everything looks fine so far. And now,

1984
01:06:24,880 --> 01:06:28,800
we just, you know, compile and fit it.

1985
01:06:26,400 --> 01:06:30,039
So, as usual, Adam, cross entropy,

1986
01:06:28,800 --> 01:06:33,080
accuracy.

1987
01:06:30,039 --> 01:06:34,880
Um and then, we'll just fit the model.

1988
01:06:33,079 --> 01:06:36,000
All right.

1989
01:06:34,880 --> 01:06:38,599
It's going to take

1990
01:06:36,000 --> 01:06:38,599
a few minutes.

1991
01:06:39,000 --> 01:06:43,519
And while it's running, so what what you

1992
01:06:41,358 --> 01:06:44,960
will see in this collab is that

1993
01:06:43,519 --> 01:06:46,440
uh in this particular case, the

1994
01:06:44,960 --> 01:06:47,519
embeddings actually don't help a whole

1995
01:06:46,440 --> 01:06:50,440
lot.

1996
01:06:47,519 --> 01:06:50,440
Why do you think that is?

1997
01:06:51,920 --> 01:06:54,639
What if it could be because we're

1998
01:06:52,920 --> 01:06:57,079
averaging a lot of stuff? Maybe that's

1999
01:06:54,639 --> 01:06:58,400
hurting us.

2000
01:06:57,079 --> 01:06:59,840
Yeah.

2001
01:06:58,400 --> 01:07:01,960
Um I mean, I think that the embeddings

2002
01:06:59,840 --> 01:07:03,559
were pre-trained on some corpus, right?

2003
01:07:01,960 --> 01:07:05,358
Like Wikipedia or something like that

2004
01:07:03,559 --> 01:07:06,599
that is different from the a little bit

2005
01:07:05,358 --> 01:07:08,599
different from the language we tend to

2006
01:07:06,599 --> 01:07:09,599
use in song lyrics. So, so maybe it's

2007
01:07:08,599 --> 01:07:11,599
not

2008
01:07:09,599 --> 01:07:12,679
its ability to sort of extract the

2009
01:07:11,599 --> 01:07:13,319
meaning of

2010
01:07:12,679 --> 01:07:16,679
um

2011
01:07:13,320 --> 01:07:18,240
candy from like a song lyric um

2012
01:07:16,679 --> 01:07:19,679
maybe is limited because Yeah. it's

2013
01:07:18,239 --> 01:07:20,919
thinking of all the other ways Right.

2014
01:07:19,679 --> 01:07:22,039
like that could be our presentation.

2015
01:07:20,920 --> 01:07:23,680
Yeah, so there could be a mismatch

2016
01:07:22,039 --> 01:07:26,119
between the corpus on which the

2017
01:07:23,679 --> 01:07:27,719
pre-trained stuff was trained on versus

2018
01:07:26,119 --> 01:07:29,480
the the corpus that you're working with

2019
01:07:27,719 --> 01:07:31,679
right now. That's one big reason. The

2020
01:07:29,480 --> 01:07:34,719
other reason is that we actually may

2021
01:07:31,679 --> 01:07:36,119
have We have 50,000 examples, basically.

2022
01:07:34,719 --> 01:07:37,959
It's a lot of data.

2023
01:07:36,119 --> 01:07:39,920
So, when you have a lot of data, you may

2024
01:07:37,960 --> 01:07:41,519
not need any of these things.

2025
01:07:39,920 --> 01:07:43,280
These things tend to do really well when

2026
01:07:41,519 --> 01:07:46,159
you don't have a lot of data, and which

2027
01:07:43,280 --> 01:07:47,720
means you you you get to piggyback on

2028
01:07:46,159 --> 01:07:49,960
what these embeddings have learned from

2029
01:07:47,719 --> 01:07:52,159
all of Wikipedia.

2030
01:07:49,960 --> 01:07:54,240
So, so when you have a smallish data

2031
01:07:52,159 --> 01:07:55,799
set, basically, the the rule of thumb

2032
01:07:54,239 --> 01:07:58,119
here is that when your data is really

2033
01:07:55,800 --> 01:07:59,320
small, try to use a pre-trained model.

2034
01:07:58,119 --> 01:08:01,119
Right? And that's what you saw with the

2035
01:07:59,320 --> 01:08:03,200
handbags and shoes classifier, right? We

2036
01:08:01,119 --> 01:08:04,960
had 100 examples of handbags and shoes,

2037
01:08:03,199 --> 01:08:06,480
and we used ResNet to got basically get

2038
01:08:04,960 --> 01:08:08,358
to 100% accuracy.

2039
01:08:06,480 --> 01:08:09,719
The same sort of logic applies here.

2040
01:08:08,358 --> 01:08:11,519
All right. So,

2041
01:08:09,719 --> 01:08:12,879
here, let's see what's happening. Uh

2042
01:08:11,519 --> 01:08:15,480
okay, it's done.

2043
01:08:12,880 --> 01:08:15,480
So, we'll plot.

2044
01:08:16,000 --> 01:08:18,199
Right.

2045
01:08:16,880 --> 01:08:21,279
Uh okay, this look at a very

2046
01:08:18,199 --> 01:08:23,479
well-behaved uh loss function curve.

2047
01:08:21,279 --> 01:08:23,479
Uh

2048
01:08:25,640 --> 01:08:27,600
Okay.

2049
01:08:26,479 --> 01:08:28,879
So,

2050
01:08:27,600 --> 01:08:30,039
uh there doesn't seem to be any massive

2051
01:08:28,880 --> 01:08:32,119
overfitting going on. They are moving

2052
01:08:30,039 --> 01:08:35,279
really nicely in lockstep. Let's see

2053
01:08:32,119 --> 01:08:35,279
what the thing is.

2054
01:08:36,319 --> 01:08:40,798
Okay, 63%, which is not great. Um right?

2055
01:08:39,520 --> 01:08:43,080
Uh it's not as good as what we saw

2056
01:08:40,798 --> 01:08:44,479
before when we used all 50,000 examples

2057
01:08:43,079 --> 01:08:45,880
and just trained something from scratch,

2058
01:08:44,479 --> 01:08:47,519
and that's just because in this case, we

2059
01:08:45,880 --> 01:08:49,079
have lots of examples, these pre-trained

2060
01:08:47,520 --> 01:08:50,319
embeddings aren't, you know, as helpful

2061
01:08:49,079 --> 01:08:52,439
as they could be.

2062
01:08:50,319 --> 01:08:54,280
But if you have a small data set, they

2063
01:08:52,439 --> 01:08:56,318
could be very helpful. And now, we go to

2064
01:08:54,279 --> 01:08:58,239
what um

2065
01:08:56,319 --> 01:08:59,319
he pointed out. Like, why can't we just,

2066
01:08:58,239 --> 01:09:00,759
you know, optimize these embeddings,

2067
01:08:59,319 --> 01:09:02,440
too? Why don't Why do we have to take

2068
01:09:00,759 --> 01:09:03,838
trade them as sacred? We'll just Let

2069
01:09:02,439 --> 01:09:06,000
Let's just use Let's

2070
01:09:03,838 --> 01:09:07,920
inflict Let's just apply unleash back

2071
01:09:06,000 --> 01:09:11,079
prop on it and see what happens.

2072
01:09:07,920 --> 01:09:13,319
So, we'll do that. Um

2073
01:09:11,079 --> 01:09:15,359
So, here, what we do is we retrain it,

2074
01:09:13,319 --> 01:09:17,120
but here, we set trainable equals true

2075
01:09:15,359 --> 01:09:19,240
for the embedding layer. Okay? This is

2076
01:09:17,119 --> 01:09:20,960
the key step. Trainable equals true.

2077
01:09:19,239 --> 01:09:23,279
Otherwise, it's unchanged.

2078
01:09:20,960 --> 01:09:25,880
Uh and then,

2079
01:09:23,279 --> 01:09:25,880
let's skip that.

2080
01:09:27,119 --> 01:09:31,439
We'll run it and see what happens. So

2081
01:09:28,960 --> 01:09:33,079
before it was whatever 63% accuracy or

2082
01:09:31,439 --> 01:09:35,399
something, we'll see if it gets better

2083
01:09:33,079 --> 01:09:38,000
if you train the whole thing.

2084
01:09:35,399 --> 01:09:40,399
And the thing is you can never be sure.

2085
01:09:38,000 --> 01:09:41,279
Right? Because it may start to overfit.

2086
01:09:40,399 --> 01:09:42,599
Uh which is why you just have to

2087
01:09:41,279 --> 01:09:45,440
empirically see what's going on. There

2088
01:09:42,600 --> 01:09:45,440
are no guarantees.

2089
01:09:47,640 --> 01:09:50,079
Um all right, any questions while it's

2090
01:09:48,960 --> 01:09:51,880
training?

2091
01:09:50,079 --> 01:09:54,399
Yeah.

2092
01:09:51,880 --> 01:09:56,480
In that first graph of when um you have

2093
01:09:54,399 --> 01:09:58,000
the training accuracy still increasing,

2094
01:09:56,479 --> 01:10:00,479
that might suggest that you could use

2095
01:09:58,000 --> 01:10:02,399
even more upstream. Correct. Exactly.

2096
01:10:00,479 --> 01:10:03,879
Exactly. So in the in the in that curve,

2097
01:10:02,399 --> 01:10:05,519
we saw that the training was continuing

2098
01:10:03,880 --> 01:10:06,840
to increase. Typically what's going to

2099
01:10:05,520 --> 01:10:08,720
happen is the training will continue to

2100
01:10:06,840 --> 01:10:10,520
get better the more you train it. The

2101
01:10:08,720 --> 01:10:12,560
key thing is is the validation also

2102
01:10:10,520 --> 01:10:13,880
improving. If the validation continues

2103
01:10:12,560 --> 01:10:15,520
to improve, there is a little bit more

2104
01:10:13,880 --> 01:10:17,720
gas left in the tank. You can keep

2105
01:10:15,520 --> 01:10:19,120
increasing more. If it starts to flatten

2106
01:10:17,720 --> 01:10:21,960
and even worse if it starts to go down,

2107
01:10:19,119 --> 01:10:23,279
then you want to pull back.

2108
01:10:21,960 --> 01:10:25,359
Yeah.

2109
01:10:23,279 --> 01:10:27,359
Um so you had used the maximum against

2110
01:10:25,359 --> 01:10:29,079
the limit like the vocabulary

2111
01:10:27,359 --> 01:10:31,880
of the most common 5,000. And then the

2112
01:10:29,079 --> 01:10:33,600
width of that was 100. What is the 100?

2113
01:10:31,880 --> 01:10:34,680
The 100 is just the length of the glove

2114
01:10:33,600 --> 01:10:37,440
vector.

2115
01:10:34,680 --> 01:10:39,560
Does that mean that it can only capture

2116
01:10:37,439 --> 01:10:41,319
how that word is related to 100 other

2117
01:10:39,560 --> 01:10:43,760
words? No, no. It it basically we are

2118
01:10:41,319 --> 01:10:45,920
saying that every word its intrinsic

2119
01:10:43,760 --> 01:10:48,119
meaning can be captured using a vector

2120
01:10:45,920 --> 01:10:49,800
of 100 dimensions.

2121
01:10:48,119 --> 01:10:51,319
Those dimensions mean something. We

2122
01:10:49,800 --> 01:10:53,880
don't know what it is. The first

2123
01:10:51,319 --> 01:10:55,599
dimension could mean color. Second could

2124
01:10:53,880 --> 01:10:57,760
mean some sort of location. The third

2125
01:10:55,600 --> 01:11:00,760
could mean some sort of see time of the

2126
01:10:57,760 --> 01:11:00,760
year. We just have no idea.

2127
01:11:01,359 --> 01:11:04,159
Okay, and then the pre-trained model is

2128
01:11:02,880 --> 01:11:05,640
we're not We're not going to learn the

2129
01:11:04,159 --> 01:11:07,119
pre-trained model like has those

2130
01:11:05,640 --> 01:11:08,960
already. We don't know what they are,

2131
01:11:07,119 --> 01:11:10,000
but it has some cat The people who

2132
01:11:08,960 --> 01:11:10,960
created it don't know what they are

2133
01:11:10,000 --> 01:11:13,439
either.

2134
01:11:10,960 --> 01:11:15,840
All they know is that for each word they

2135
01:11:13,439 --> 01:11:18,799
learned a 100 long vector.

2136
01:11:15,840 --> 01:11:20,279
And that 100 long vector was able to re-

2137
01:11:18,800 --> 01:11:21,520
kind of recreate the co-occurrence

2138
01:11:20,279 --> 01:11:23,359
matrix.

2139
01:11:21,520 --> 01:11:25,200
And then they probed it using that

2140
01:11:23,359 --> 01:11:26,920
visualization of man woman sister

2141
01:11:25,199 --> 01:11:29,880
brother all that stuff and it seems to

2142
01:11:26,920 --> 01:11:31,960
sort of fit with what you would expect.

2143
01:11:29,880 --> 01:11:33,480
Can you think of it as analogous to uh

2144
01:11:31,960 --> 01:11:35,640
when we did the convolutional ones, you

2145
01:11:33,479 --> 01:11:37,639
have the number of kernels, right? So in

2146
01:11:35,640 --> 01:11:39,520
in this case, so if you have 32 kernels,

2147
01:11:37,640 --> 01:11:40,840
it's sort of like 32 things it can

2148
01:11:39,520 --> 01:11:42,400
learn.

2149
01:11:40,840 --> 01:11:43,880
I think that's actually a great analogy.

2150
01:11:42,399 --> 01:11:46,399
I love it. That's that's a great way to

2151
01:11:43,880 --> 01:11:48,079
think about it. Yes. Uh much like we got

2152
01:11:46,399 --> 01:11:50,079
to choose decide how many filters to

2153
01:11:48,079 --> 01:11:51,680
have, here we get to decide how long the

2154
01:11:50,079 --> 01:11:53,880
embedding dimension needs to be and our

2155
01:11:51,680 --> 01:11:55,280
hope is that the more things we are able

2156
01:11:53,880 --> 01:11:57,880
to accommodate, the more complicated

2157
01:11:55,279 --> 01:11:58,920
things it will pick up. Right? Uh at the

2158
01:11:57,880 --> 01:11:59,920
same time, you don't want to have too

2159
01:11:58,920 --> 01:12:01,920
many of these things because it's going

2160
01:11:59,920 --> 01:12:03,640
to start picking up noise.

2161
01:12:01,920 --> 01:12:05,840
And that's not a good That's never a

2162
01:12:03,640 --> 01:12:06,840
good thing.

2163
01:12:05,840 --> 01:12:07,880
Okay.

2164
01:12:06,840 --> 01:12:09,880
Um

2165
01:12:07,880 --> 01:12:10,840
Another question on this side?

2166
01:12:09,880 --> 01:12:12,359
Yeah.

2167
01:12:10,840 --> 01:12:13,159
Go ahead. My

2168
01:12:12,359 --> 01:12:15,359
question is

2169
01:12:13,159 --> 01:12:17,399
why did we use Why do we use embeddings

2170
01:12:15,359 --> 01:12:20,599
and not the actual uh

2171
01:12:17,399 --> 01:12:23,000
correlation matrix called rows to

2172
01:12:20,600 --> 01:12:25,079
represent words, right? Like why do we

2173
01:12:23,000 --> 01:12:26,399
need to abstract Yeah, yeah, yeah.

2174
01:12:25,079 --> 01:12:28,800
That's actually a good That's a That's a

2175
01:12:26,399 --> 01:12:30,399
good That's a good question. Um one

2176
01:12:28,800 --> 01:12:33,600
immediate reason is that that row is

2177
01:12:30,399 --> 01:12:35,679
500,000 vectors long. 500,000 long.

2178
01:12:33,600 --> 01:12:37,280
Right? So you want a compact dense

2179
01:12:35,680 --> 01:12:39,000
representation of a word.

2180
01:12:37,279 --> 01:12:40,679
The second thing is that thing is

2181
01:12:39,000 --> 01:12:43,760
subject to all the counts of the

2182
01:12:40,680 --> 01:12:45,200
Wikipedia corpus. It's not normalized.

2183
01:12:43,760 --> 01:12:47,400
So you need to normalize it so that if

2184
01:12:45,199 --> 01:12:49,119
you take any two rows and do dot

2185
01:12:47,399 --> 01:12:50,839
product, you will get some number which

2186
01:12:49,119 --> 01:12:53,800
is sort of in a narrow range. Otherwise

2187
01:12:50,840 --> 01:12:55,560
things don't become comparable.

2188
01:12:53,800 --> 01:12:57,600
No, both these objections can be

2189
01:12:55,560 --> 01:12:59,080
handled. You can normalize, you can

2190
01:12:57,600 --> 01:13:00,520
reduce the size of the corpus and so on

2191
01:12:59,079 --> 01:13:01,640
and so forth. And in fact that used to

2192
01:13:00,520 --> 01:13:03,400
be a very common way people used to do

2193
01:13:01,640 --> 01:13:04,560
it before.

2194
01:13:03,399 --> 01:13:06,319
But what they have discovered is that

2195
01:13:04,560 --> 01:13:07,680
these the way we learn embeddings now

2196
01:13:06,319 --> 01:13:10,119
tends to be much more effective in

2197
01:13:07,680 --> 01:13:10,119
practice.

2198
01:13:10,960 --> 01:13:16,199
So So what what we thought is

2199
01:13:13,960 --> 01:13:18,159
what what what this process does is it

2200
01:13:16,199 --> 01:13:21,559
creates this like n-dimensional

2201
01:13:18,159 --> 01:13:23,920
incomprehensible matrix that captures

2202
01:13:21,560 --> 01:13:25,840
in essence a summarized version of these

2203
01:13:23,920 --> 01:13:28,359
relationships.

2204
01:13:25,840 --> 01:13:30,440
Correct. A compact representation of

2205
01:13:28,359 --> 01:13:33,159
relationships which is not subject to

2206
01:13:30,439 --> 01:13:34,719
the size of your vocabulary.

2207
01:13:33,159 --> 01:13:36,439
So you know, you have 500,000 words

2208
01:13:34,720 --> 01:13:37,720
today, tomorrow somebody comes up with

2209
01:13:36,439 --> 01:13:39,039
the word called selfie which didn't

2210
01:13:37,720 --> 01:13:40,360
exist 5 years ago.

2211
01:13:39,039 --> 01:13:42,279
And now your corpus has gotten a little

2212
01:13:40,359 --> 01:13:43,920
bit more, right? So here it's very

2213
01:13:42,279 --> 01:13:46,800
compact and it tends to have a much

2214
01:13:43,920 --> 01:13:46,800
longer shelf life.

2215
01:13:48,039 --> 01:13:52,760
Yeah.

2216
01:13:49,279 --> 01:13:52,759
Uh all right, so let's see where we are.

2217
01:13:54,079 --> 01:13:58,159
Uh okay. So evaluate.

2218
01:13:59,079 --> 01:14:04,199
68 69% almost. It was 63 went to 69. So

2219
01:14:02,039 --> 01:14:06,239
clearly here training the whole thing

2220
01:14:04,199 --> 01:14:08,359
including glove actually helps. Uh and

2221
01:14:06,239 --> 01:14:11,239
so that sort of begs the question, well,

2222
01:14:08,359 --> 01:14:13,279
if it um every if training glove helps,

2223
01:14:11,239 --> 01:14:15,000
maybe we should actually train the whole

2224
01:14:13,279 --> 01:14:16,920
thing from scratch.

2225
01:14:15,000 --> 01:14:19,319
Like why the hell not, right? Why the

2226
01:14:16,920 --> 01:14:21,239
heck not? I apologize.

2227
01:14:19,319 --> 01:14:22,639
So uh what we'll do is we'll actually

2228
01:14:21,239 --> 01:14:24,760
create our own embeddings and just train

2229
01:14:22,640 --> 01:14:26,079
them. And here we don't have to worry

2230
01:14:24,760 --> 01:14:27,560
about co-occurrence matrices and so on

2231
01:14:26,079 --> 01:14:29,239
and so forth because we have a very

2232
01:14:27,560 --> 01:14:30,840
specific objective. We want to be very

2233
01:14:29,239 --> 01:14:32,279
accurate in predicting genre for these

2234
01:14:30,840 --> 01:14:34,119
songs.

2235
01:14:32,279 --> 01:14:35,159
The people who had who had worked on

2236
01:14:34,119 --> 01:14:36,479
glove,

2237
01:14:35,159 --> 01:14:37,760
they didn't have any objective. They

2238
01:14:36,479 --> 01:14:39,559
just wanted to create embeddings that

2239
01:14:37,760 --> 01:14:41,640
were generally useful.

2240
01:14:39,560 --> 01:14:43,520
Okay? Here we want to be specifically

2241
01:14:41,640 --> 01:14:45,760
useful for genre prediction.

2242
01:14:43,520 --> 01:14:48,680
And so what we can do is we can actually

2243
01:14:45,760 --> 01:14:50,320
train the whole thing ourselves, right?

2244
01:14:48,680 --> 01:14:51,320
We can actually give it

2245
01:14:50,319 --> 01:14:53,119
uh we can actually put an embedding

2246
01:14:51,319 --> 01:14:55,039
layer here. I you know, we just

2247
01:14:53,119 --> 01:14:57,439
arbitrarily decided to choose 64 as the

2248
01:14:55,039 --> 01:14:59,479
uh the dimension as opposed to 100. It

2249
01:14:57,439 --> 01:15:01,000
will run faster. Uh and then it's the

2250
01:14:59,479 --> 01:15:03,519
same thing. Global average pooling,

2251
01:15:01,000 --> 01:15:07,079
activation, blah blah blah blah blah. Um

2252
01:15:03,520 --> 01:15:07,080
and then you run it.

2253
01:15:08,039 --> 01:15:11,920
We'll see if it finishes in the next

2254
01:15:09,520 --> 01:15:11,920
minute.

2255
01:15:12,760 --> 01:15:16,360
And we'll see if it actually does better

2256
01:15:14,479 --> 01:15:17,359
than the pre-trained embeddings or the

2257
01:15:16,359 --> 01:15:19,399
pre-trained embeddings that have been

2258
01:15:17,359 --> 01:15:21,639
further fine-tuned. And I don't remember

2259
01:15:19,399 --> 01:15:23,119
what I saw when I ran it yesterday.

2260
01:15:21,640 --> 01:15:24,920
Uh and while it's running, other

2261
01:15:23,119 --> 01:15:25,760
questions?

2262
01:15:24,920 --> 01:15:28,000
Yeah.

2263
01:15:25,760 --> 01:15:30,039
So my question is regarding embeddings.

2264
01:15:28,000 --> 01:15:32,439
When we call embedding for a particular

2265
01:15:30,039 --> 01:15:33,920
word, we indicate that we have certain

2266
01:15:32,439 --> 01:15:35,359
number of parameters. Let's say in this

2267
01:15:33,920 --> 01:15:36,920
case we have defined

2268
01:15:35,359 --> 01:15:37,759
We defined 100. So there will be 100

2269
01:15:36,920 --> 01:15:40,079
parameters and there will be

2270
01:15:37,760 --> 01:15:42,520
coefficients weights for each of them.

2271
01:15:40,079 --> 01:15:43,279
So when we take a pre-trained model,

2272
01:15:42,520 --> 01:15:45,520
right?

2273
01:15:43,279 --> 01:15:47,840
The one we took glove. So for each word

2274
01:15:45,520 --> 01:15:49,640
there would already be those number of

2275
01:15:47,840 --> 01:15:51,159
parameters in that different Yeah. So

2276
01:15:49,640 --> 01:15:53,320
but then how do we redefine them? Is

2277
01:15:51,159 --> 01:15:54,880
that we want only 100 or we want only 10

2278
01:15:53,319 --> 01:15:56,519
parameters

2279
01:15:54,880 --> 01:15:59,239
You know, the the glove thing actually

2280
01:15:56,520 --> 01:16:01,520
gives you packaged It's pre-packaged to

2281
01:15:59,239 --> 01:16:03,199
be 100 long. I think they have 200 and

2282
01:16:01,520 --> 01:16:04,240
300 as well if I recall. We just

2283
01:16:03,199 --> 01:16:05,840
happened to use the one the one with

2284
01:16:04,239 --> 01:16:07,719
100. The one is

2285
01:16:05,840 --> 01:16:09,000
The one is available in Google

2286
01:16:07,720 --> 01:16:10,159
Yeah, yeah. And there are many

2287
01:16:09,000 --> 01:16:12,960
available. We just get to pick and

2288
01:16:10,159 --> 01:16:13,840
choose and I happen to pick 100.

2289
01:16:12,960 --> 01:16:15,880
Uh

2290
01:16:13,840 --> 01:16:17,560
Oh, it's okay. So it's a bit slow, but

2291
01:16:15,880 --> 01:16:18,680
it's actually looking promising.

2292
01:16:17,560 --> 01:16:21,080
Um

2293
01:16:18,680 --> 01:16:23,159
9:55, yeah.

2294
01:16:21,079 --> 01:16:24,960
So during the CNN models training during

2295
01:16:23,159 --> 01:16:27,199
our assignments,

2296
01:16:24,960 --> 01:16:29,800
changing the filters gave us more depth

2297
01:16:27,199 --> 01:16:32,399
than improvement in performance.

2298
01:16:29,800 --> 01:16:33,640
So here would I be right in concluding

2299
01:16:32,399 --> 01:16:34,839
that it's actually training the

2300
01:16:33,640 --> 01:16:36,600
embeddings which is giving us more

2301
01:16:34,840 --> 01:16:37,400
assuming that epoch and batch changes

2302
01:16:36,600 --> 01:16:39,400
are not

2303
01:16:37,399 --> 01:16:42,000
changed as much. So if I really want a

2304
01:16:39,399 --> 01:16:43,279
genuine change in performance, we go

2305
01:16:42,000 --> 01:16:44,760
to the level of retraining the

2306
01:16:43,279 --> 01:16:46,359
embeddings.

2307
01:16:44,760 --> 01:16:48,520
What Yeah, so what we saw was that using

2308
01:16:46,359 --> 01:16:50,359
glove as is was okay. Using glove and

2309
01:16:48,520 --> 01:16:51,320
then training them helped a lot. And now

2310
01:16:50,359 --> 01:16:53,559
we are basically saying, well, what if

2311
01:16:51,319 --> 01:16:55,639
we just abandon glove and train our own

2312
01:16:53,560 --> 01:16:57,760
embeddings for our particular problem.

2313
01:16:55,640 --> 01:16:59,160
See, glove is a general purpose tool.

2314
01:16:57,760 --> 01:17:00,640
So a general purpose tool is really good

2315
01:16:59,159 --> 01:17:01,920
if you don't have a lot of data

2316
01:17:00,640 --> 01:17:03,119
as a good starting point. But when you

2317
01:17:01,920 --> 01:17:04,560
have a lot of data, you should always

2318
01:17:03,119 --> 01:17:05,680
try to do your own thing and see if it's

2319
01:17:04,560 --> 01:17:07,400
any better.

2320
01:17:05,680 --> 01:17:09,480
And in this case, I

2321
01:17:07,399 --> 01:17:10,719
well, whoa. Okay, I think it's

2322
01:17:09,479 --> 01:17:13,759
uh

2323
01:17:10,720 --> 01:17:13,760
Come on, it's 9:55.

2324
01:17:14,439 --> 01:17:17,919
The button is going to enter any moment

2325
01:17:15,640 --> 01:17:17,920
now.

2326
01:17:21,399 --> 01:17:24,639
Right, let's just look at the thing.

2327
01:17:25,880 --> 01:17:30,279
Okay, folks. So 74% 72%.

2328
01:17:29,079 --> 01:17:31,840
So you can actually return your own

2329
01:17:30,279 --> 01:17:33,279
thing because of 50,000 examples and you

2330
01:17:31,840 --> 01:17:36,480
can see an even better thing. Thanks a

2331
01:17:33,279 --> 01:17:36,479
lot. Have a good rest of the week.