1
00:00:16,519 --> 00:00:20,600
So, all right. So, transformers, even

2
00:00:18,839 --> 00:00:22,839
though they were originally invented for

3
00:00:20,600 --> 00:00:24,240
machine translation, right, going from

4
00:00:22,839 --> 00:00:25,879
English to German and German to French

5
00:00:24,239 --> 00:00:27,799
and so on and so forth,

6
00:00:25,879 --> 00:00:29,559
they have turned out to be an incredibly

7
00:00:27,800 --> 00:00:32,439
effective deep neural network

8
00:00:29,559 --> 00:00:34,600
architecture for just really a vast

9
00:00:32,439 --> 00:00:36,000
array of domains. It has reached a point

10
00:00:34,600 --> 00:00:37,640
where if you're actually working with on

11
00:00:36,000 --> 00:00:39,439
a particular problem, you almost

12
00:00:37,640 --> 00:00:40,960
reflexively to will try a transformer

13
00:00:39,439 --> 00:00:42,919
first because it's probably going to be

14
00:00:40,960 --> 00:00:45,120
pretty darn good.

15
00:00:42,920 --> 00:00:46,480
Okay? So, they have just taken over

16
00:00:45,119 --> 00:00:48,199
everything.

17
00:00:46,479 --> 00:00:50,000
Um and obviously they have they've

18
00:00:48,200 --> 00:00:52,400
transformed translation, which is the

19
00:00:50,000 --> 00:00:54,079
original sort of target, uh Google

20
00:00:52,399 --> 00:00:55,600
search, really information retrieval,

21
00:00:54,079 --> 00:00:57,479
completely transformed speech

22
00:00:55,600 --> 00:00:59,520
recognition, text-to-speech, even

23
00:00:57,479 --> 00:01:00,599
computer vision. Even the stuff that we

24
00:00:59,520 --> 00:01:03,000
learned with convolutional neural

25
00:01:00,600 --> 00:01:04,760
networks, now there are transformers for

26
00:01:03,000 --> 00:01:06,519
computer vision problems that are

27
00:01:04,760 --> 00:01:07,560
actually quite good.

28
00:01:06,519 --> 00:01:08,839
Right?

29
00:01:07,560 --> 00:01:10,960
Um which is kind of shocking because

30
00:01:08,840 --> 00:01:12,719
they were not even designed for that.

31
00:01:10,959 --> 00:01:14,519
Um and then, you know, reinforcement

32
00:01:12,719 --> 00:01:15,719
learning. And of course, all the crazy

33
00:01:14,519 --> 00:01:17,439
stuff that's going on with generative

34
00:01:15,719 --> 00:01:20,200
AI, large language models, multimodal

35
00:01:17,439 --> 00:01:21,399
models, everything everything runs on a

36
00:01:20,200 --> 00:01:23,799
transformer.

37
00:01:21,400 --> 00:01:25,600
Okay? Uh and then there are numerous

38
00:01:23,799 --> 00:01:27,079
special purpose systems

39
00:01:25,599 --> 00:01:28,399
and I find these to be even more

40
00:01:27,079 --> 00:01:30,000
interesting.

41
00:01:28,400 --> 00:01:32,440
Um you know, like AlphaFold, the protein

42
00:01:30,000 --> 00:01:33,640
folding AI, is run runs on a transformer

43
00:01:32,439 --> 00:01:35,519
stack.

44
00:01:33,640 --> 00:01:36,640
Okay? And I could just list examples one

45
00:01:35,519 --> 00:01:38,479
after the other.

46
00:01:36,640 --> 00:01:40,040
So, it's just amazing. It's incredibly

47
00:01:38,480 --> 00:01:43,079
uh flexible architecture.

48
00:01:40,040 --> 00:01:44,280
Um and I think we are lucky to be alive

49
00:01:43,079 --> 00:01:46,879
during a time when such a thing was

50
00:01:44,280 --> 00:01:46,879
invented.

51
00:01:47,200 --> 00:01:50,480
And I'm not getting paid to tell you any

52
00:01:48,480 --> 00:01:52,120
of this stuff.

53
00:01:50,480 --> 00:01:55,439
All right, it's just amazing. Okay. So,

54
00:01:52,120 --> 00:01:57,280
let's get going. We will use search um

55
00:01:55,439 --> 00:01:59,359
or more broadly information retrieval as

56
00:01:57,280 --> 00:02:00,640
a motivating use case. So, these are all

57
00:01:59,359 --> 00:02:02,120
examples where people are typing in

58
00:02:00,640 --> 00:02:03,959
natural language queries or uttering

59
00:02:02,120 --> 00:02:05,400
natural language queries into a phone

60
00:02:03,959 --> 00:02:07,319
and we need to sort of make sense of

61
00:02:05,400 --> 00:02:08,879
what they want. And it's not like, you

62
00:02:07,319 --> 00:02:10,599
know, write me a limerick about deep

63
00:02:08,879 --> 00:02:12,639
learning where there could be many

64
00:02:10,599 --> 00:02:14,000
possible right answers. It's more like,

65
00:02:12,639 --> 00:02:15,279
okay, tell me all the flights that are

66
00:02:14,000 --> 00:02:16,680
leaving from Boston to going to

67
00:02:15,280 --> 00:02:19,080
LaGuardia tomorrow morning between 8:00

68
00:02:16,680 --> 00:02:21,120
and 9:00. Well, you better get it right.

69
00:02:19,080 --> 00:02:22,200
Okay? Accuracy is a high bar.

70
00:02:21,120 --> 00:02:23,319
So,

71
00:02:22,199 --> 00:02:24,679
um or, you know, how many customers

72
00:02:23,319 --> 00:02:26,000
abandoned their shopping cart? Find all

73
00:02:24,680 --> 00:02:28,960
contracts that are up for renewal next

74
00:02:26,000 --> 00:02:30,840
month. Uh you know, tell me the all the

75
00:02:28,960 --> 00:02:32,800
customers who ended the phone call to

76
00:02:30,840 --> 00:02:34,879
the call center yesterday not entirely

77
00:02:32,800 --> 00:02:37,040
pleased with the transaction. Right? The

78
00:02:34,879 --> 00:02:38,479
list goes on and on. And so, in

79
00:02:37,039 --> 00:02:40,959
particular, we'll focus on this

80
00:02:38,479 --> 00:02:42,879
travel-related example today. Okay? Uh

81
00:02:40,960 --> 00:02:44,159
find me all flights from Boston to

82
00:02:42,879 --> 00:02:45,639
LaGuardia tomorrow morning, right? That

83
00:02:44,159 --> 00:02:48,560
kind of query.

84
00:02:45,639 --> 00:02:50,919
Um and so, in these sorts of use cases,

85
00:02:48,560 --> 00:02:53,599
a very common approach historically has

86
00:02:50,919 --> 00:02:55,599
been, well, we will take this, you know,

87
00:02:53,599 --> 00:02:57,919
natural language query

88
00:02:55,599 --> 00:03:01,039
and then we will convert it into a

89
00:02:57,919 --> 00:03:03,559
structured query. By that I mean we will

90
00:03:01,039 --> 00:03:05,799
parse the query and we'll extract out

91
00:03:03,560 --> 00:03:07,640
key things in that query. Once we

92
00:03:05,800 --> 00:03:09,800
extract out those key things, we will

93
00:03:07,639 --> 00:03:12,919
reassemble it into a structured query,

94
00:03:09,800 --> 00:03:14,760
like a SQL query, right? Uh SQL is just

95
00:03:12,919 --> 00:03:15,919
one example of a possible structured

96
00:03:14,759 --> 00:03:17,239
query. There are many many ways to

97
00:03:15,919 --> 00:03:18,759
structure queries.

98
00:03:17,240 --> 00:03:20,840
But SQL is sort of familiar to lots of

99
00:03:18,759 --> 00:03:23,120
people, so I'm using that. So, you take

100
00:03:20,840 --> 00:03:25,200
the SQL. Once you have the SQL query,

101
00:03:23,120 --> 00:03:27,319
you're in a very comfortable structured

102
00:03:25,199 --> 00:03:28,839
land, in which case you just run the

103
00:03:27,319 --> 00:03:30,959
query through a some database that you

104
00:03:28,840 --> 00:03:32,960
have, get the results back, format it

105
00:03:30,960 --> 00:03:34,719
nicely, and show show it to the user.

106
00:03:32,960 --> 00:03:36,480
Right? That's the flow.

107
00:03:34,719 --> 00:03:37,599
So, the question becomes

108
00:03:36,479 --> 00:03:40,560
um

109
00:03:37,599 --> 00:03:43,960
how do we automatically extract all the

110
00:03:40,560 --> 00:03:45,280
travel-related entities from this query?

111
00:03:43,960 --> 00:03:49,800
Right? We want to be able to extract

112
00:03:45,280 --> 00:03:50,640
BOS, LGA, tomorrow, morning, flights, so

113
00:03:49,800 --> 00:03:51,800
on and so forth. These are all the

114
00:03:50,639 --> 00:03:54,799
travel-related entities we want to

115
00:03:51,800 --> 00:03:56,520
extract out, right? That's the problem.

116
00:03:54,800 --> 00:03:58,200
And so,

117
00:03:56,520 --> 00:03:59,760
we will use a really cool data set

118
00:03:58,199 --> 00:04:01,159
called the airline travel information

119
00:03:59,759 --> 00:04:02,759
system data set and I'll explain the

120
00:04:01,159 --> 00:04:05,159
data set in just in just a bit. We'll

121
00:04:02,759 --> 00:04:07,120
use this as the basis for this example.

122
00:04:05,159 --> 00:04:08,599
And so, the way we think about it is

123
00:04:07,120 --> 00:04:10,400
that

124
00:04:08,599 --> 00:04:12,079
we we have a whole bunch of queries in

125
00:04:10,400 --> 00:04:14,400
this data set.

126
00:04:12,080 --> 00:04:16,359
And fortunately for us, the researchers

127
00:04:14,400 --> 00:04:18,358
who compiled this data set,

128
00:04:16,358 --> 00:04:20,399
they went through every one of these

129
00:04:18,358 --> 00:04:22,239
queries, right? And we have, you know,

130
00:04:20,399 --> 00:04:24,239
several thousands of them. They went

131
00:04:22,240 --> 00:04:26,800
through every one of those queries and

132
00:04:24,240 --> 00:04:28,400
they manually tagged each word in the

133
00:04:26,800 --> 00:04:31,520
query

134
00:04:28,399 --> 00:04:33,319
with what kind of travel entity it is

135
00:04:31,519 --> 00:04:35,399
or none of them, right? So, for

136
00:04:33,319 --> 00:04:37,360
instance, so they class they call them

137
00:04:35,399 --> 00:04:39,799
slots. So, they will take each word in

138
00:04:37,360 --> 00:04:41,439
the query and assign it to a slot, a

139
00:04:39,800 --> 00:04:42,800
particular kind of slot, and I'll

140
00:04:41,439 --> 00:04:45,480
explain what slot means in just a

141
00:04:42,800 --> 00:04:47,439
second. Okay? That's the basic idea. So,

142
00:04:45,480 --> 00:04:49,759
so, for example, if you have something

143
00:04:47,439 --> 00:04:52,639
like I want to fly from

144
00:04:49,759 --> 00:04:53,759
Okay? And this is a flight database, so

145
00:04:52,639 --> 00:04:56,039
you can assume that everything is

146
00:04:53,759 --> 00:04:57,519
related to a flight flying. So, if you

147
00:04:56,040 --> 00:04:58,560
have all these words, I want to fly

148
00:04:57,519 --> 00:05:00,759
from,

149
00:04:58,560 --> 00:05:02,560
each of these words these five words

150
00:05:00,759 --> 00:05:04,599
gets mapped to something called the O,

151
00:05:02,560 --> 00:05:06,280
which means other.

152
00:05:04,600 --> 00:05:07,840
It's the other slot, right? We don't

153
00:05:06,279 --> 00:05:09,279
really care about it. It's the other

154
00:05:07,839 --> 00:05:11,599
slot.

155
00:05:09,279 --> 00:05:13,159
And then we come to Boston.

156
00:05:11,600 --> 00:05:15,560
Oh, Boston is very special, right?

157
00:05:13,160 --> 00:05:18,160
Because, you know, it's clearly a

158
00:05:15,560 --> 00:05:20,280
departure city. So, we actually tag it,

159
00:05:18,160 --> 00:05:21,640
we assign it this label. Think of it as

160
00:05:20,279 --> 00:05:23,000
just like a classification problem,

161
00:05:21,639 --> 00:05:26,479
right? A multi-class classification

162
00:05:23,000 --> 00:05:29,240
problem. So, we assign it to B from

163
00:05:26,480 --> 00:05:31,160
loc.city_name.

164
00:05:29,240 --> 00:05:32,560
Okay? That is the label you assign it.

165
00:05:31,160 --> 00:05:34,720
Okay?

166
00:05:32,560 --> 00:05:37,199
And then you go to at. You don't care

167
00:05:34,720 --> 00:05:38,680
about at. It's O, other. You come to

168
00:05:37,199 --> 00:05:41,159
7:00 a.m.

169
00:05:38,680 --> 00:05:43,280
And then, okay, that is depart time. So,

170
00:05:41,160 --> 00:05:45,680
depart time and then another depart

171
00:05:43,279 --> 00:05:47,319
time. And here you see there is a B and

172
00:05:45,680 --> 00:05:49,360
then there is an I.

173
00:05:47,319 --> 00:05:51,800
Right? So, what's what we are saying

174
00:05:49,360 --> 00:05:54,160
here is that there could be entities who

175
00:05:51,800 --> 00:05:57,439
are described using more than one word.

176
00:05:54,160 --> 00:05:58,600
Like 7:00 a.m., right? Two tokens.

177
00:05:57,439 --> 00:06:00,600
And for that, we need to be able to

178
00:05:58,600 --> 00:06:01,760
figure out, okay, the second token is

179
00:06:00,600 --> 00:06:03,920
really

180
00:06:01,759 --> 00:06:05,360
is part of the first token. Together,

181
00:06:03,920 --> 00:06:08,920
they define the notion of a departure

182
00:06:05,360 --> 00:06:10,520
time. So, what the B means that is that

183
00:06:08,920 --> 00:06:12,480
this is the word this is the token in

184
00:06:10,519 --> 00:06:15,240
which we are beginning the idea of a

185
00:06:12,480 --> 00:06:17,840
departure time. And then I means we are

186
00:06:15,240 --> 00:06:19,920
in the middle of this description.

187
00:06:17,839 --> 00:06:21,079
B is for beginning.

188
00:06:19,920 --> 00:06:23,240
So,

189
00:06:21,079 --> 00:06:25,079
you can see here. So, there is a B here

190
00:06:23,240 --> 00:06:27,519
and there is an I. B for beginning, I

191
00:06:25,079 --> 00:06:31,680
for intermediate or in the middle.

192
00:06:27,519 --> 00:06:33,120
Um and then at, we don't care. 11:00 B

193
00:06:31,680 --> 00:06:35,400
arrive time.

194
00:06:33,120 --> 00:06:37,920
Boop boop boop. Morning arrive time

195
00:06:35,399 --> 00:06:37,919
period.

196
00:06:38,199 --> 00:06:43,560
So, this is an example of how you can

197
00:06:40,800 --> 00:06:45,040
take a sentence and then manually label

198
00:06:43,560 --> 00:06:46,120
every word in the sentence with

199
00:06:45,040 --> 00:06:48,920
something that's relevant to your

200
00:06:46,120 --> 00:06:48,920
particular problem.

201
00:06:50,360 --> 00:06:54,439
And

202
00:06:51,959 --> 00:06:56,959
turns out these people

203
00:06:54,439 --> 00:06:59,439
every word is classified into one of 123

204
00:06:56,959 --> 00:07:02,919
possibilities.

205
00:06:59,439 --> 00:07:04,920
Okay? Um so, aircraft code, airline

206
00:07:02,920 --> 00:07:07,080
code, airline name, airport code,

207
00:07:04,920 --> 00:07:08,960
airport name, arrival date, relative

208
00:07:07,079 --> 00:07:11,560
name. Now, you get the idea.

209
00:07:08,959 --> 00:07:13,359
They want a round trip versus a one-way.

210
00:07:11,560 --> 00:07:14,480
The relative to today because if

211
00:07:13,360 --> 00:07:16,040
somebody say tomorrow morning, it's

212
00:07:14,480 --> 00:07:17,520
relative to today, so you need to notion

213
00:07:16,040 --> 00:07:19,560
you need absolute time and you need

214
00:07:17,519 --> 00:07:20,919
notion of relative time.

215
00:07:19,560 --> 00:07:23,480
So, they basically thought of every

216
00:07:20,920 --> 00:07:25,560
possibility with these researchers. And

217
00:07:23,480 --> 00:07:27,840
so, the every word in every one of these

218
00:07:25,560 --> 00:07:30,319
queries is assigned one of these 123

219
00:07:27,839 --> 00:07:30,319
labels.

220
00:07:32,240 --> 00:07:35,480
Any questions on the setup?

221
00:07:36,920 --> 00:07:39,480
Um

222
00:07:39,920 --> 00:07:44,480
Did they have to contextualize what

223
00:07:42,199 --> 00:07:46,360
comes before than let's say Boston? So,

224
00:07:44,480 --> 00:07:47,720
if someone says from

225
00:07:46,360 --> 00:07:49,360
Boston, so that there should be

226
00:07:47,720 --> 00:07:50,880
contextualization with the from to

227
00:07:49,360 --> 00:07:52,480
Boston. So, because they did it

228
00:07:50,879 --> 00:07:54,079
manually, they could just read it and

229
00:07:52,480 --> 00:07:55,680
figure it out, that's what they mean,

230
00:07:54,079 --> 00:07:57,319
right? You Boston is the the departure

231
00:07:55,680 --> 00:07:59,480
city and not the arrival city. So, do

232
00:07:57,319 --> 00:08:01,480
they have two tags to Boston, which is

233
00:07:59,480 --> 00:08:03,520
some like, you know, departure city as

234
00:08:01,480 --> 00:08:05,759
well as arrival city

235
00:08:03,519 --> 00:08:07,399
word Boston? In that particular phrase,

236
00:08:05,759 --> 00:08:08,959
it's it's clear from that particular

237
00:08:07,399 --> 00:08:10,639
case in the context of it as a human

238
00:08:08,959 --> 00:08:13,279
reading it that Boston is a departure

239
00:08:10,639 --> 00:08:15,360
city. So, it just only gets that tag. In

240
00:08:13,279 --> 00:08:16,759
that sentence. In some other sentence

241
00:08:15,360 --> 00:08:19,720
where people are coming into Boston,

242
00:08:16,759 --> 00:08:19,719
it'll have a different tag.

243
00:08:21,040 --> 00:08:25,120
I was wondering if my query like the

244
00:08:23,000 --> 00:08:27,279
others, basically there is like, for

245
00:08:25,120 --> 00:08:29,079
example, if my query was

246
00:08:27,279 --> 00:08:29,759
giving flights from Boston at 7:00 a.m.

247
00:08:29,079 --> 00:08:31,079
and

248
00:08:29,759 --> 00:08:33,559
uh the

249
00:08:31,079 --> 00:08:35,478
flights from Denver at 11:00 a.m.

250
00:08:33,559 --> 00:08:37,079
You mean like a compound query? Yeah.

251
00:08:35,479 --> 00:08:39,080
So, this one only takes single queries

252
00:08:37,080 --> 00:08:40,158
into account.

253
00:08:39,080 --> 00:08:42,038
Because most people are like, you know,

254
00:08:40,158 --> 00:08:43,120
give me a flight from here to there. Or

255
00:08:42,038 --> 00:08:45,279
what is the cheapest thing from here to

256
00:08:43,120 --> 00:08:47,720
there? And we'll see examples of queries

257
00:08:45,279 --> 00:08:47,720
later on.

258
00:08:50,000 --> 00:08:52,679
Okay.

259
00:08:51,120 --> 00:08:53,600
Uh all right. So, that's that's the

260
00:08:52,679 --> 00:08:56,959
deal.

261
00:08:53,600 --> 00:08:58,000
So, basically what we have this you

262
00:08:56,960 --> 00:08:59,480
know,

263
00:08:58,000 --> 00:09:02,120
uh this problem that we have here is

264
00:08:59,480 --> 00:09:04,879
really a word-to-slot,

265
00:09:02,120 --> 00:09:06,399
word-to-slot multi-class classification

266
00:09:04,879 --> 00:09:07,480
problem.

267
00:09:06,399 --> 00:09:09,240
Okay?

268
00:09:07,480 --> 00:09:10,560
Um because if you look at that that

269
00:09:09,240 --> 00:09:12,840
input, we want to be able to take that

270
00:09:10,559 --> 00:09:16,159
input and a really good model will then

271
00:09:12,840 --> 00:09:16,160
give you this as the output.

272
00:09:17,000 --> 00:09:20,240
Right? Because this is what a human

273
00:09:18,159 --> 00:09:23,480
would have done.

274
00:09:20,240 --> 00:09:25,720
So, that is our problem. Okay?

275
00:09:23,480 --> 00:09:27,840
So, the question is

276
00:09:25,720 --> 00:09:29,320
um the the key thing here is that each

277
00:09:27,840 --> 00:09:32,040
of the 18 words in this particular

278
00:09:29,320 --> 00:09:34,440
example must be assigned to one of 123

279
00:09:32,039 --> 00:09:36,399
slot types, right? Each word. It's not

280
00:09:34,440 --> 00:09:38,080
like we take the entire query and

281
00:09:36,399 --> 00:09:40,399
classify the entire query into one of

282
00:09:38,080 --> 00:09:42,480
123 possibilities. Every word in the

283
00:09:40,399 --> 00:09:45,360
query has to be classified.

284
00:09:42,480 --> 00:09:45,360
That is the wrinkle.

285
00:09:45,399 --> 00:09:49,240
Okay?

286
00:09:46,960 --> 00:09:51,080
So, now, if we could run the query

287
00:09:49,240 --> 00:09:54,120
through a deep neural network and

288
00:09:51,080 --> 00:09:55,800
generate 18 output nodes,

289
00:09:54,120 --> 00:09:57,679
it goes through some unspecified deep

290
00:09:55,799 --> 00:09:59,399
neural network. And when it comes out

291
00:09:57,679 --> 00:10:00,399
the other end, the output layer has 18

292
00:09:59,399 --> 00:10:01,439
nodes.

293
00:10:00,399 --> 00:10:03,120
Okay?

294
00:10:01,440 --> 00:10:04,480
Because that is that is the that is the

295
00:10:03,120 --> 00:10:06,919
that is the the the dimension of the

296
00:10:04,480 --> 00:10:09,000
output that we care about. 18 in, 18

297
00:10:06,919 --> 00:10:11,599
out. 18 in, 18 out, right?

298
00:10:09,000 --> 00:10:15,600
And then for each one of those 18 nodes,

299
00:10:11,600 --> 00:10:19,200
maybe we could attach a 123-way softmax

300
00:10:15,600 --> 00:10:19,200
to each of those 18 outputs.

301
00:10:20,200 --> 00:10:23,280
By the way, isn't it cool that we can

302
00:10:21,480 --> 00:10:25,440
just casually talk about sticking a

303
00:10:23,279 --> 00:10:27,159
123-way softmax onto each one of the 18

304
00:10:25,440 --> 00:10:29,920
nodes?

305
00:10:27,159 --> 00:10:29,919
Folks, wake up.

306
00:10:31,360 --> 00:10:34,840
You're not easily impressed. I'm

307
00:10:32,720 --> 00:10:37,560
impressed by that.

308
00:10:34,840 --> 00:10:37,560
So, okay.

309
00:10:37,879 --> 00:10:41,840
So, so the So, here's the key thing,

310
00:10:39,759 --> 00:10:45,639
right? We want to generate an output

311
00:10:41,840 --> 00:10:47,399
that has the same length as the input.

312
00:10:45,639 --> 00:10:48,960
But the problem is the inputs could be

313
00:10:47,399 --> 00:10:50,480
of different lengths as they come in.

314
00:10:48,960 --> 00:10:52,840
They could be short sentences, long

315
00:10:50,480 --> 00:10:55,120
sentences, we don't know, right?

316
00:10:52,840 --> 00:10:56,759
Yet we need to accommodate this range

317
00:10:55,120 --> 00:10:58,159
this variable size of input that's

318
00:10:56,759 --> 00:10:59,399
coming in.

319
00:10:58,159 --> 00:11:00,799
But the key thing is the output has to

320
00:10:59,399 --> 00:11:02,759
be the same thing as the input, the same

321
00:11:00,799 --> 00:11:05,559
cardinality as the input.

322
00:11:02,759 --> 00:11:07,080
Okay, that's a one big requirement.

323
00:11:05,559 --> 00:11:08,799
In addition, we want to take the

324
00:11:07,080 --> 00:11:10,440
surrounding context of each word into

325
00:11:08,799 --> 00:11:12,719
account, right? To go to Ronak's

326
00:11:10,440 --> 00:11:14,040
question, when you see the word Boston,

327
00:11:12,720 --> 00:11:15,920
you can't conclude whether it's a

328
00:11:14,039 --> 00:11:17,079
departure city or arrival city.

329
00:11:15,919 --> 00:11:19,399
You have to look at what else is going

330
00:11:17,080 --> 00:11:21,400
on around it. Is there a from? Is there

331
00:11:19,399 --> 00:11:22,600
a to? Things like that to figure out

332
00:11:21,399 --> 00:11:24,399
what how to tag it. So, clearly the

333
00:11:22,600 --> 00:11:25,800
context matters.

334
00:11:24,399 --> 00:11:28,319
And then we clearly have to take the

335
00:11:25,799 --> 00:11:29,599
order of the words into account.

336
00:11:28,320 --> 00:11:30,480
Going from Boston to LaGuardia is very

337
00:11:29,600 --> 00:11:31,680
different than going from LaGuardia to

338
00:11:30,480 --> 00:11:33,720
Boston.

339
00:11:31,679 --> 00:11:35,479
So, clearly the order matters.

340
00:11:33,720 --> 00:11:37,560
Right? So, the context matters and the

341
00:11:35,480 --> 00:11:40,279
order matters. And the output has to be

342
00:11:37,559 --> 00:11:42,119
the same length as the input.

343
00:11:40,279 --> 00:11:44,240
Okay?

344
00:11:42,120 --> 00:11:45,720
So, context matters, right? Just a few

345
00:11:44,240 --> 00:11:47,480
fun examples.

346
00:11:45,720 --> 00:11:48,680
Remember from the last week that the

347
00:11:47,480 --> 00:11:50,639
meaning of a word can change

348
00:11:48,679 --> 00:11:53,359
dramatically depending on the context.

349
00:11:50,639 --> 00:11:55,080
And we also saw that the standalone or

350
00:11:53,360 --> 00:11:58,360
uncontextual embeddings that we saw for

351
00:11:55,080 --> 00:11:59,960
last week, like Glove, um

352
00:11:58,360 --> 00:12:01,440
you know, they don't take context into

353
00:11:59,960 --> 00:12:04,040
account because they give a single

354
00:12:01,440 --> 00:12:05,880
unique embedding vector to every word.

355
00:12:04,039 --> 00:12:07,959
And if a word ends up having lots of

356
00:12:05,879 --> 00:12:09,720
different meanings, that vector is kind

357
00:12:07,960 --> 00:12:11,680
of some mushy average of all those

358
00:12:09,720 --> 00:12:13,320
meanings.

359
00:12:11,679 --> 00:12:15,399
Okay. So,

360
00:12:13,320 --> 00:12:16,960
the word see. I will see you soon. I

361
00:12:15,399 --> 00:12:18,959
will see this project to its end. I see

362
00:12:16,960 --> 00:12:20,879
what you mean. Very different meanings

363
00:12:18,960 --> 00:12:21,920
of the word see. This is my favorite,

364
00:12:20,879 --> 00:12:23,559
bank.

365
00:12:21,919 --> 00:12:24,838
Uh I went to the bank to apply for a

366
00:12:23,559 --> 00:12:27,359
loan. I'm banking on the job. I'm

367
00:12:24,839 --> 00:12:29,839
standing on the left bank. And so on. Uh

368
00:12:27,360 --> 00:12:31,680
it. The animal Oh, this is actually very

369
00:12:29,839 --> 00:12:33,640
It's a good one. The animal didn't cross

370
00:12:31,679 --> 00:12:34,719
the street because it was too tired. The

371
00:12:33,639 --> 00:12:37,279
animal didn't cross the street because

372
00:12:34,720 --> 00:12:39,040
it was too wide.

373
00:12:37,279 --> 00:12:40,519
Can you imagine

374
00:12:39,039 --> 00:12:42,279
a deep neural network looking at this

375
00:12:40,519 --> 00:12:44,360
word it and trying to figure out what

376
00:12:42,279 --> 00:12:46,120
the heck does it word it mean?

377
00:12:44,360 --> 00:12:48,480
What is it referring to?

378
00:12:46,120 --> 00:12:50,440
Tricky, right?

379
00:12:48,480 --> 00:12:52,000
Um and then, you know, if you take the

380
00:12:50,440 --> 00:12:53,120
word station, and I have the station

381
00:12:52,000 --> 00:12:55,200
example here because we're going to use

382
00:12:53,120 --> 00:12:57,080
it a bit more the rest of the lecture.

383
00:12:55,200 --> 00:12:59,360
The train You know, the station could be

384
00:12:57,080 --> 00:13:00,839
a radio station, a train station, being

385
00:12:59,360 --> 00:13:03,080
stationed somewhere, the International

386
00:13:00,839 --> 00:13:04,360
Space Station. The list goes on.

387
00:13:03,080 --> 00:13:05,960
So, clearly order matters. I mean,

388
00:13:04,360 --> 00:13:08,680
context matters.

389
00:13:05,960 --> 00:13:08,680
And

390
00:13:08,879 --> 00:13:12,000
clearly order matters. You can come up

391
00:13:10,480 --> 00:13:13,279
with your own examples. Let's keep

392
00:13:12,000 --> 00:13:15,159
moving.

393
00:13:13,279 --> 00:13:18,240
Okay?

394
00:13:15,159 --> 00:13:20,600
So, the Transformer architecture

395
00:13:18,240 --> 00:13:22,080
is a very elegant

396
00:13:20,600 --> 00:13:23,560
architecture

397
00:13:22,080 --> 00:13:25,360
which checks these three boxes

398
00:13:23,559 --> 00:13:26,479
beautifully.

399
00:13:25,360 --> 00:13:27,960
Okay?

400
00:13:26,480 --> 00:13:29,680
Um it takes the context into account,

401
00:13:27,960 --> 00:13:32,000
order into account, and then, you know,

402
00:13:29,679 --> 00:13:33,599
whatever is produced out there

403
00:13:32,000 --> 00:13:34,399
is the same length as whatever is coming

404
00:13:33,600 --> 00:13:35,279
in.

405
00:13:34,399 --> 00:13:36,879
And the reason it's called the

406
00:13:35,279 --> 00:13:39,679
Transformer

407
00:13:36,879 --> 00:13:41,759
is because if 10 things come in,

408
00:13:39,679 --> 00:13:43,799
10 things go out, but the 10 things that

409
00:13:41,759 --> 00:13:46,000
go out are a transformed version of the

410
00:13:43,799 --> 00:13:47,919
10 things that came in.

411
00:13:46,000 --> 00:13:48,919
That's why it's called the Transformer.

412
00:13:47,919 --> 00:13:50,679
Okay?

413
00:13:48,919 --> 00:13:52,559
If 10 things came in and like one thing

414
00:13:50,679 --> 00:13:54,359
go goes out, well, sure, it's been

415
00:13:52,559 --> 00:13:56,439
transformed, but what is it? It's some

416
00:13:54,360 --> 00:13:58,800
weird thing. But when 10 comes in and 10

417
00:13:56,440 --> 00:13:59,760
goes out, the 10 10 is preserved. Each

418
00:13:58,799 --> 00:14:01,039
one is getting transformed in

419
00:13:59,759 --> 00:14:04,279
interesting way.

420
00:14:01,039 --> 00:14:04,279
That's why it's called the Transformer.

421
00:14:04,440 --> 00:14:08,400
So, developed 2017, just dramatic

422
00:14:07,080 --> 00:14:09,360
impact.

423
00:14:08,399 --> 00:14:11,078
So, by the way, the effect of

424
00:14:09,360 --> 00:14:13,639
Transformer, um

425
00:14:11,078 --> 00:14:15,239
Google had spent a lot of research on

426
00:14:13,639 --> 00:14:17,439
machine translation and obviously

427
00:14:15,240 --> 00:14:20,079
search. Uh and then when the Transformer

428
00:14:17,440 --> 00:14:22,600
is invented, uh they took a model called

429
00:14:20,078 --> 00:14:25,958
BERT, which we will uh see on Wednesday

430
00:14:22,600 --> 00:14:28,320
in detail, and then they introduced BERT

431
00:14:25,958 --> 00:14:29,599
into their search, and the results were

432
00:14:28,320 --> 00:14:32,040
dramatic.

433
00:14:29,600 --> 00:14:34,279
And from what I've read, apparently the

434
00:14:32,039 --> 00:14:35,639
impact of doing that was a

435
00:14:34,279 --> 00:14:37,360
Typically, when you make an improvement

436
00:14:35,639 --> 00:14:38,799
to search, the improvement is very, very

437
00:14:37,360 --> 00:14:40,680
marginal because it's already a very

438
00:14:38,799 --> 00:14:42,240
heavily optimized system.

439
00:14:40,679 --> 00:14:43,799
And then when the Transformer thing came

440
00:14:42,240 --> 00:14:46,320
along, there was actually a significant

441
00:14:43,799 --> 00:14:48,240
jump in search quality. So, for example,

442
00:14:46,320 --> 00:14:49,800
and you can actually read this blog post

443
00:14:48,240 --> 00:14:51,839
uh which came out when they introduced

444
00:14:49,799 --> 00:14:54,359
BERT into search. It gives you a bit

445
00:14:51,839 --> 00:14:56,360
more detail. But here, so if you had if

446
00:14:54,360 --> 00:14:57,600
you were querying something like uh you

447
00:14:56,360 --> 00:15:00,480
know,

448
00:14:57,600 --> 00:15:02,240
"Brazil traveler to USA needs a visa."

449
00:15:00,480 --> 00:15:03,279
Right? You would think that it is it

450
00:15:02,240 --> 00:15:04,600
should give you information about how to

451
00:15:03,279 --> 00:15:06,480
get a visa if you're a Brazilian want to

452
00:15:04,600 --> 00:15:09,000
come to the US, right? Uh but it turns

453
00:15:06,480 --> 00:15:11,159
out the first result was how US citizens

454
00:15:09,000 --> 00:15:13,240
going to Brazil can get you know,

455
00:15:11,159 --> 00:15:14,879
get a visa.

456
00:15:13,240 --> 00:15:16,480
So, clearly it's not taking the order

457
00:15:14,879 --> 00:15:19,000
into account.

458
00:15:16,480 --> 00:15:20,440
Uh but once they introduced it, boom,

459
00:15:19,000 --> 00:15:21,720
the first thing was the US Embassy in

460
00:15:20,440 --> 00:15:24,200
Brazil.

461
00:15:21,720 --> 00:15:26,839
And a page on how to get a visa.

462
00:15:24,200 --> 00:15:30,120
So, the effect was dramatic.

463
00:15:26,839 --> 00:15:31,600
And so, this is a seminal paper,

464
00:15:30,120 --> 00:15:34,440
right? And it's actually worth reading

465
00:15:31,600 --> 00:15:35,639
the paper. And uh and it's worth and you

466
00:15:34,440 --> 00:15:38,079
know, this is the picture this this is

467
00:15:35,639 --> 00:15:39,799
like an iconic picture at this point

468
00:15:38,078 --> 00:15:41,679
in the deep learning community. And we

469
00:15:39,799 --> 00:15:43,399
will actually understand this picture

470
00:15:41,679 --> 00:15:45,399
by the end of Wednesday.

471
00:15:43,399 --> 00:15:46,399
Um and so, but the funny thing is that

472
00:15:45,399 --> 00:15:48,720
when the researchers came up with it,

473
00:15:46,399 --> 00:15:50,720
they didn't realize, in some sense, like

474
00:15:48,720 --> 00:15:51,759
what they had stumbled on uh because

475
00:15:50,720 --> 00:15:53,000
they were really focused on machine

476
00:15:51,759 --> 00:15:54,240
translation.

477
00:15:53,000 --> 00:15:55,519
It's only the rest of the research

478
00:15:54,240 --> 00:15:56,879
community that took it and started

479
00:15:55,519 --> 00:15:59,639
applying to everything else and found it

480
00:15:56,879 --> 00:16:01,240
to be really, really effective.

481
00:15:59,639 --> 00:16:02,399
Okay. So, we're going to take each one

482
00:16:01,240 --> 00:16:04,039
of these things and figure out how to

483
00:16:02,399 --> 00:16:05,480
address them and thereby build up the

484
00:16:04,039 --> 00:16:07,759
architecture.

485
00:16:05,480 --> 00:16:10,000
Any questions before I continue?

486
00:16:07,759 --> 00:16:10,000
Yeah.

487
00:16:11,000 --> 00:16:16,360
Is there any uh

488
00:16:13,559 --> 00:16:18,679
benefits to discarding some of those

489
00:16:16,360 --> 00:16:21,240
unclassified nodes before it goes out

490
00:16:18,679 --> 00:16:23,078
rather than going like you have 18 words

491
00:16:21,240 --> 00:16:24,279
input, discarding all the ones that

492
00:16:23,078 --> 00:16:26,239
don't actually matter and just doing

493
00:16:24,279 --> 00:16:28,480
like eight for your output?

494
00:16:26,240 --> 00:16:29,959
Yeah, yeah. I think that's a totally

495
00:16:28,480 --> 00:16:31,120
fine way to think about it. Basically,

496
00:16:29,958 --> 00:16:33,119
what you're saying is that can we have a

497
00:16:31,120 --> 00:16:35,839
two-stage model? The first-stage model

498
00:16:33,120 --> 00:16:37,200
is like a O non-O classifier. And the

499
00:16:35,839 --> 00:16:38,520
second-stage model only goes after the

500
00:16:37,200 --> 00:16:39,280
non-Os. That's a totally fine way to do

501
00:16:38,519 --> 00:16:40,319
it.

502
00:16:39,279 --> 00:16:41,958
Yeah.

503
00:16:40,320 --> 00:16:43,320
But as you can see, if you even if you

504
00:16:41,958 --> 00:16:44,958
go with the just a simple one-stage

505
00:16:43,320 --> 00:16:47,120
model, if you use a Transformer, you get

506
00:16:44,958 --> 00:16:50,359
fantastic accuracy.

507
00:16:47,120 --> 00:16:52,360
And we'll do the collab in a bit.

508
00:16:50,360 --> 00:16:53,600
Uh all right. So, let's take the first

509
00:16:52,360 --> 00:16:55,240
thing. How do you how do you take the

510
00:16:53,600 --> 00:16:56,959
context of everything around the word

511
00:16:55,240 --> 00:16:59,279
into account?

512
00:16:56,958 --> 00:17:01,000
So,

513
00:16:59,279 --> 00:17:03,039
so let's say that this is this is the

514
00:17:01,000 --> 00:17:04,199
sentence we have. The train slowly left

515
00:17:03,039 --> 00:17:06,759
the station.

516
00:17:04,199 --> 00:17:09,839
Okay? For each of these words,

517
00:17:06,759 --> 00:17:11,279
we can calculate a standalone embedding,

518
00:17:09,838 --> 00:17:13,720
say something like Glove.

519
00:17:11,279 --> 00:17:15,959
Okay? So, I'm just rep- depicting these

520
00:17:13,720 --> 00:17:18,400
standalone embeddings using these uh

521
00:17:15,959 --> 00:17:19,600
you know, thingies here.

522
00:17:18,400 --> 00:17:20,560
Please appreciate them because it took

523
00:17:19,599 --> 00:17:22,119
me a while to get them to do in

524
00:17:20,559 --> 00:17:24,678
PowerPoint.

525
00:17:22,119 --> 00:17:27,000
Okay? So, these are W1 through W6. These

526
00:17:24,679 --> 00:17:29,120
are the vectors standing up. Okay?

527
00:17:27,000 --> 00:17:30,359
Um now, let's say that So, we can easily

528
00:17:29,119 --> 00:17:32,079
do that.

529
00:17:30,359 --> 00:17:34,559
Now, what we want to figure out is we

530
00:17:32,079 --> 00:17:36,119
want to focus on the word station.

531
00:17:34,559 --> 00:17:37,519
And since station could mean very

532
00:17:36,119 --> 00:17:39,559
different things in different contexts,

533
00:17:37,519 --> 00:17:40,599
we want to figure out how do we actually

534
00:17:39,559 --> 00:17:43,359
take

535
00:17:40,599 --> 00:17:45,439
station's embedding and contextualize it

536
00:17:43,359 --> 00:17:46,799
using all the other words that are going

537
00:17:45,440 --> 00:17:49,799
on in that sentence.

538
00:17:46,799 --> 00:17:50,879
Okay? Clearly, it's a train station.

539
00:17:49,799 --> 00:17:53,720
So, we need to take the fact that there

540
00:17:50,880 --> 00:17:55,120
is a train involved to to alter the

541
00:17:53,720 --> 00:17:56,880
embedding of the word station. Right?

542
00:17:55,119 --> 00:17:58,719
That's what taking context into account

543
00:17:56,880 --> 00:17:59,960
actually means.

544
00:17:58,720 --> 00:18:03,039
So,

545
00:17:59,960 --> 00:18:04,799
how can we modify station's embedding so

546
00:18:03,039 --> 00:18:07,519
that it incorporates all the other

547
00:18:04,799 --> 00:18:08,399
words? That's the question.

548
00:18:07,519 --> 00:18:11,879
Okay?

549
00:18:08,400 --> 00:18:14,040
So, when you look at it this way,

550
00:18:11,880 --> 00:18:15,640
imagine just for a moment,

551
00:18:14,039 --> 00:18:16,559
just for a moment,

552
00:18:15,640 --> 00:18:17,640
that

553
00:18:16,559 --> 00:18:18,960
we

554
00:18:17,640 --> 00:18:20,440
Now, some of the other words in the

555
00:18:18,960 --> 00:18:22,279
sentence don't matter. The word the

556
00:18:20,440 --> 00:18:24,120
probably doesn't matter.

557
00:18:22,279 --> 00:18:26,678
But some of the other words like train,

558
00:18:24,119 --> 00:18:29,119
slowly, left probably does matter.

559
00:18:26,679 --> 00:18:30,480
And suppose, just magically, we have

560
00:18:29,119 --> 00:18:32,439
been told

561
00:18:30,480 --> 00:18:34,480
all the other words in the sentence,

562
00:18:32,440 --> 00:18:36,640
this is how much weight you have to give

563
00:18:34,480 --> 00:18:38,159
to them. These don't give it any weight.

564
00:18:36,640 --> 00:18:39,800
Those give it a lot of weight. Okay?

565
00:18:38,159 --> 00:18:41,360
Suppose we are told that.

566
00:18:39,799 --> 00:18:42,639
Or to put it another way, and this this

567
00:18:41,359 --> 00:18:44,199
is the word that's heavily used in the

568
00:18:42,640 --> 00:18:46,200
literature,

569
00:18:44,200 --> 00:18:47,720
someone tells you how much attention to

570
00:18:46,200 --> 00:18:48,720
pay to the other words.

571
00:18:47,720 --> 00:18:50,440
Whether you got to pay it a lot of

572
00:18:48,720 --> 00:18:51,360
attention or very little attention.

573
00:18:50,440 --> 00:18:52,600
Okay?

574
00:18:51,359 --> 00:18:54,439
And this

575
00:18:52,599 --> 00:18:55,879
how much attention to pay is given in

576
00:18:54,440 --> 00:18:57,440
the form of a weight that you can use.

577
00:18:55,880 --> 00:18:58,880
Okay? So,

578
00:18:57,440 --> 00:19:00,080
um

579
00:18:58,880 --> 00:19:01,840
if you look at it that way, from this

580
00:19:00,079 --> 00:19:04,039
notion of which word should I give a lot

581
00:19:01,839 --> 00:19:05,599
of weight to and very little weight to,

582
00:19:04,039 --> 00:19:06,799
in this example, intuitively, which

583
00:19:05,599 --> 00:19:07,759
words do you think should get the most

584
00:19:06,799 --> 00:19:09,759
weight and which words do you think

585
00:19:07,759 --> 00:19:11,319
should get the least weight?

586
00:19:09,759 --> 00:19:12,679
Yeah. Train.

587
00:19:11,319 --> 00:19:13,759
Train. Right.

588
00:19:12,679 --> 00:19:14,840
Time matters.

589
00:19:13,759 --> 00:19:16,200
Uh

590
00:19:14,839 --> 00:19:18,119
you can do one at a time.

591
00:19:16,200 --> 00:19:18,720
Train. Okay, thank you.

592
00:19:18,119 --> 00:19:21,279
Uh

593
00:19:18,720 --> 00:19:22,279
okay. Others?

594
00:19:21,279 --> 00:19:23,918
Slowly.

595
00:19:22,279 --> 00:19:25,599
Slowly. Right. So, that also seems to

596
00:19:23,919 --> 00:19:27,759
have some bearing to it. What about

597
00:19:25,599 --> 00:19:28,799
words that don't really I don't

598
00:19:27,759 --> 00:19:31,079
we don't think is going to are going to

599
00:19:28,799 --> 00:19:33,279
help at all?

600
00:19:31,079 --> 00:19:35,839
The. The. Exactly. It probably doesn't

601
00:19:33,279 --> 00:19:37,200
do much here. Some context it actually

602
00:19:35,839 --> 00:19:38,678
might make a difference, but in this

603
00:19:37,200 --> 00:19:40,759
sentence, maybe not.

604
00:19:38,679 --> 00:19:42,200
Right? Intuitively.

605
00:19:40,759 --> 00:19:43,079
So,

606
00:19:42,200 --> 00:19:45,000
we should probably give a lot of weight

607
00:19:43,079 --> 00:19:47,839
to train, maybe a little to slowly and

608
00:19:45,000 --> 00:19:49,359
left, and hardly anything to the.

609
00:19:47,839 --> 00:19:52,519
Okay?

610
00:19:49,359 --> 00:19:56,759
And so, this intuition that we have

611
00:19:52,519 --> 00:19:58,519
can be written numerically as maybe we

612
00:19:56,759 --> 00:20:00,160
have a bunch of weights that add up to

613
00:19:58,519 --> 00:20:02,240
one.

614
00:20:00,160 --> 00:20:03,560
Okay?

615
00:20:02,240 --> 00:20:07,120
Okay, maybe something like this. So, we

616
00:20:03,559 --> 00:20:11,639
are saying the train 30% weightage,

617
00:20:07,119 --> 00:20:14,159
maybe 8% weightage to left, maybe 12%

618
00:20:11,640 --> 00:20:15,680
weightage to slowly, uh and then as you

619
00:20:14,160 --> 00:20:17,960
will see here,

620
00:20:15,680 --> 00:20:20,680
the station's own embedding also plays a

621
00:20:17,960 --> 00:20:22,240
role. Because we want to take its own

622
00:20:20,680 --> 00:20:23,799
standalone embedding and just move it

623
00:20:22,240 --> 00:20:26,759
slightly, change it slightly, which

624
00:20:23,799 --> 00:20:28,279
means that has to be the starting point.

625
00:20:26,759 --> 00:20:30,799
So, it will get a lot of weight. We

626
00:20:28,279 --> 00:20:33,599
can't ignore itself, in other words.

627
00:20:30,799 --> 00:20:34,720
Right? So, we give it maybe 40% weight.

628
00:20:33,599 --> 00:20:35,879
By the way, these numbers I just made

629
00:20:34,720 --> 00:20:38,640
them up.

630
00:20:35,880 --> 00:20:40,640
Okay? Uh yeah.

631
00:20:38,640 --> 00:20:43,120
I'm sorry, it's a quick question. So,

632
00:20:40,640 --> 00:20:44,560
the weights

633
00:20:43,119 --> 00:20:46,399
are they

634
00:20:44,559 --> 00:20:48,200
Are they Are they standalone for the

635
00:20:46,400 --> 00:20:50,759
context of the entire sentence or are

636
00:20:48,200 --> 00:20:54,000
they related to station that we started

637
00:20:50,759 --> 00:20:56,400
off with? The The These six numbers are

638
00:20:54,000 --> 00:20:57,799
only pertinent to station.

639
00:20:56,400 --> 00:20:59,960
And for each word, we're going to do

640
00:20:57,799 --> 00:21:01,319
something similar.

641
00:20:59,960 --> 00:21:03,240
Yeah.

642
00:21:01,319 --> 00:21:05,399
And at this point, does the model

643
00:21:03,240 --> 00:21:07,000
understand order? Because like I'm just

644
00:21:05,400 --> 00:21:08,920
thinking of like left because like I

645
00:21:07,000 --> 00:21:09,559
gave it a very low

646
00:21:08,920 --> 00:21:11,360
a

647
00:21:09,559 --> 00:21:14,200
a very low weight. But let's say left

648
00:21:11,359 --> 00:21:15,919
comes slowly, leave left station. The

649
00:21:14,200 --> 00:21:18,000
station only have the two be higher.

650
00:21:15,920 --> 00:21:20,039
Yeah, correct. So, at this point, we are

651
00:21:18,000 --> 00:21:22,480
not worrying about order. We are only We

652
00:21:20,039 --> 00:21:24,000
are worrying about context.

653
00:21:22,480 --> 00:21:25,720
Later, we'll take order into account.

654
00:21:24,000 --> 00:21:28,119
But how does the model know that left

655
00:21:25,720 --> 00:21:31,039
here is of lesser importance because

656
00:21:28,119 --> 00:21:33,000
it's a verb rather than a

657
00:21:31,039 --> 00:21:34,279
It's It has to figure it out.

658
00:21:33,000 --> 00:21:36,519
We don't It doesn't We We are just

659
00:21:34,279 --> 00:21:38,879
giving it a whole bunch of capabilities.

660
00:21:36,519 --> 00:21:42,279
How it manifests those capabilities is

661
00:21:38,880 --> 00:21:42,280
all going to emerge from training.

662
00:21:42,880 --> 00:21:46,760
Okay. So, all right. So, let's say we

663
00:21:45,160 --> 00:21:48,120
have something like this. So, what we

664
00:21:46,759 --> 00:21:49,119
can do,

665
00:21:48,119 --> 00:21:50,319
right? And we'll get to the

666
00:21:49,119 --> 00:21:51,639
all-important question of where do we

667
00:21:50,319 --> 00:21:54,599
get these numbers from in just a moment.

668
00:21:51,640 --> 00:21:56,240
But suppose you had the numbers,

669
00:21:54,599 --> 00:22:00,399
how can we use these numbers to

670
00:21:56,240 --> 00:22:03,839
contextualize W6? What can we do?

671
00:22:00,400 --> 00:22:03,840
What is the simplest thing you can do?

672
00:22:05,359 --> 00:22:10,240
You have W6, you want to make it a new

673
00:22:07,359 --> 00:22:13,639
W6, which is now contextual, is aware of

674
00:22:10,240 --> 00:22:13,640
what else is going on. Okay?

675
00:22:17,480 --> 00:22:22,079
It's working now, I think.

676
00:22:20,119 --> 00:22:23,639
We can take a weighted average. Exactly.

677
00:22:22,079 --> 00:22:25,079
Exactly. So, when you have a bunch of

678
00:22:23,640 --> 00:22:26,400
things and you have a bunch of weights

679
00:22:25,079 --> 00:22:27,839
and I, you know, and we have when we

680
00:22:26,400 --> 00:22:29,480
have to somehow modify one of those

681
00:22:27,839 --> 00:22:30,519
things with those weights, the simplest

682
00:22:29,480 --> 00:22:31,559
thing you can do is to take a weighted

683
00:22:30,519 --> 00:22:33,000
average.

684
00:22:31,559 --> 00:22:34,359
Right? So, that's exactly what we're

685
00:22:33,000 --> 00:22:35,279
going to do.

686
00:22:34,359 --> 00:22:37,119
So, we're going to take all these

687
00:22:35,279 --> 00:22:39,678
weights

688
00:22:37,119 --> 00:22:40,639
and just like move them up.

689
00:22:39,679 --> 00:22:42,720
Okay?

690
00:22:40,640 --> 00:22:44,120
Move them up.

691
00:22:42,720 --> 00:22:46,319
Don't even get me started on how long it

692
00:22:44,119 --> 00:22:47,439
took me to get this arrow to run.

693
00:22:46,319 --> 00:22:49,439
I don't know about you, folks. Is it

694
00:22:47,440 --> 00:22:51,160
It's extremely painful to get the U-turn

695
00:22:49,440 --> 00:22:52,039
arrows to work in PowerPoint.

696
00:22:51,160 --> 00:22:54,960
Okay?

697
00:22:52,039 --> 00:22:57,159
Anyway, uh back to work. So,

698
00:22:54,960 --> 00:23:01,400
so we just move these up here, okay? So,

699
00:22:57,160 --> 00:23:03,679
now we can do 0.05 * this vector + 0.3 *

700
00:23:01,400 --> 00:23:06,679
that vector and so on and so forth.

701
00:23:03,679 --> 00:23:08,640
And the result is just another vector.

702
00:23:06,679 --> 00:23:11,400
Right?

703
00:23:08,640 --> 00:23:13,440
And that vector, folks,

704
00:23:11,400 --> 00:23:15,320
is the contextual embedding vector of

705
00:23:13,440 --> 00:23:17,759
station.

706
00:23:15,319 --> 00:23:19,759
Okay? That was the standalone embedding.

707
00:23:17,759 --> 00:23:21,119
And now we did the We multiplied this by

708
00:23:19,759 --> 00:23:24,759
that that by whoop whoop whoop, add them

709
00:23:21,119 --> 00:23:24,759
all up, and then you get a new vector.

710
00:23:24,799 --> 00:23:29,519
And contextual embeddings have this

711
00:23:27,839 --> 00:23:30,959
bluish kind of color.

712
00:23:29,519 --> 00:23:32,400
Okay?

713
00:23:30,960 --> 00:23:33,559
And I'll maintain that color scheme as

714
00:23:32,400 --> 00:23:36,320
we go along.

715
00:23:33,559 --> 00:23:38,440
So, that's it.

716
00:23:36,319 --> 00:23:41,079
That's it. That's the idea.

717
00:23:38,440 --> 00:23:41,080
Any questions?

718
00:23:41,679 --> 00:23:44,800
Yeah.

719
00:23:43,039 --> 00:23:46,960
How did you come up with the original

720
00:23:44,799 --> 00:23:49,359
weights again? You just kind of guessed?

721
00:23:46,960 --> 00:23:51,559
No, these weights I just I just

722
00:23:49,359 --> 00:23:53,279
hand typed them in manually just to make

723
00:23:51,559 --> 00:23:54,319
the point. And And now I'm going to talk

724
00:23:53,279 --> 00:23:57,039
about how we are actually going to

725
00:23:54,319 --> 00:23:57,039
calculate them.

726
00:23:57,599 --> 00:24:00,959
Okay.

727
00:23:58,640 --> 00:24:03,080
Uh all right, cool. So, now I'm going to

728
00:24:00,960 --> 00:24:05,400
uh okay, enough pictures. Let's switch

729
00:24:03,079 --> 00:24:07,319
to some math. So,

730
00:24:05,400 --> 00:24:08,759
so basically what I'm So, let's write it

731
00:24:07,319 --> 00:24:11,279
a bit more formally.

732
00:24:08,759 --> 00:24:12,920
So, we have these W1 through W6, which

733
00:24:11,279 --> 00:24:14,240
are the standalone embeddings.

734
00:24:12,920 --> 00:24:16,080
And then for station, we want to

735
00:24:14,240 --> 00:24:17,359
calculate, you know, W6 with a little

736
00:24:16,079 --> 00:24:19,599
hat on it, which is the contextual

737
00:24:17,359 --> 00:24:22,359
embedding. And the way we do it is to

738
00:24:19,599 --> 00:24:25,000
say we calculate some weights for each

739
00:24:22,359 --> 00:24:27,159
of these words. So, this weight S16

740
00:24:25,000 --> 00:24:30,079
means that the weight

741
00:24:27,160 --> 00:24:32,040
of the first word on the sixth word,

742
00:24:30,079 --> 00:24:33,678
which happens to be station.

743
00:24:32,039 --> 00:24:35,839
The The weight of the second word on the

744
00:24:33,679 --> 00:24:38,120
sixth word, and so on and so forth. And

745
00:24:35,839 --> 00:24:40,480
so, what we are saying is that W6 is

746
00:24:38,119 --> 00:24:41,879
just, you know, this weight times W1,

747
00:24:40,480 --> 00:24:43,240
this time W whoop whoop whoop,

748
00:24:41,880 --> 00:24:45,560
that's it.

749
00:24:43,240 --> 00:24:45,559
Okay?

750
00:24:45,839 --> 00:24:48,839
I have to inflict all these, you know,

751
00:24:47,039 --> 00:24:51,240
subscripts and all that because

752
00:24:48,839 --> 00:24:53,919
you know, we need it.

753
00:24:51,240 --> 00:24:56,559
All right. So, that's it.

754
00:24:53,920 --> 00:24:58,000
That's what we have.

755
00:24:56,559 --> 00:25:00,279
Now, let's talk about Okay, any

756
00:24:58,000 --> 00:25:01,759
questions on the mechanics of it

757
00:25:00,279 --> 00:25:02,879
before I get to Okay, where do these

758
00:25:01,759 --> 00:25:05,160
weights come from?

759
00:25:02,880 --> 00:25:05,160
Yeah.

760
00:25:06,920 --> 00:25:11,039
Utilizing something like Google, for

761
00:25:08,839 --> 00:25:12,759
example, like how does it understand

762
00:25:11,039 --> 00:25:13,960
like the context of

763
00:25:12,759 --> 00:25:16,000
new words

764
00:25:13,960 --> 00:25:18,480
and context like

765
00:25:16,000 --> 00:25:20,400
process immediately through the training

766
00:25:18,480 --> 00:25:21,480
data the users played or

767
00:25:20,400 --> 00:25:22,640
like basically

768
00:25:21,480 --> 00:25:24,440
>> like a totally new word that didn't

769
00:25:22,640 --> 00:25:27,520
exist before? A new word or a new

770
00:25:24,440 --> 00:25:29,320
context to a word that already exists.

771
00:25:27,519 --> 00:25:31,400
No, I think that the context is supplied

772
00:25:29,319 --> 00:25:33,159
because the query coming into something

773
00:25:31,400 --> 00:25:35,120
like Google is a full sentence.

774
00:25:33,160 --> 00:25:36,400
And we only take that sentence and take

775
00:25:35,119 --> 00:25:37,919
only the sentence into account as the

776
00:25:36,400 --> 00:25:40,000
context for us.

777
00:25:37,920 --> 00:25:41,600
So, the context is always present to us

778
00:25:40,000 --> 00:25:44,079
when we get the input.

779
00:25:41,599 --> 00:25:45,199
But the other question you had uh of

780
00:25:44,079 --> 00:25:46,678
Okay, what if there's a brand new word

781
00:25:45,200 --> 00:25:47,799
you've never seen before, for which

782
00:25:46,679 --> 00:25:49,720
there is not even a standalone

783
00:25:47,799 --> 00:25:51,919
embedding? What do you do then?

784
00:25:49,720 --> 00:25:53,600
So, let's punt on that till Wednesday

785
00:25:51,920 --> 00:25:55,440
because I have to talk about something

786
00:25:53,599 --> 00:25:57,359
called byte pair encoding and stuff like

787
00:25:55,440 --> 00:25:59,279
that before I can answer that.

788
00:25:57,359 --> 00:26:00,599
And And really quickly, does that

789
00:25:59,279 --> 00:26:03,480
immediately translate to their

790
00:26:00,599 --> 00:26:06,399
predictive search queries?

791
00:26:03,480 --> 00:26:08,559
Utilizing like verb

792
00:26:06,400 --> 00:26:10,759
Yeah, a new word, for example.

793
00:26:08,559 --> 00:26:12,200
Does that automatically get applied to

794
00:26:10,759 --> 00:26:14,000
the predictive search queries like when

795
00:26:12,200 --> 00:26:15,880
we're saying how to and then just home?

796
00:26:14,000 --> 00:26:17,200
Oh, you mean like the auto complete?

797
00:26:15,880 --> 00:26:18,560
You know, auto complete uses a slightly

798
00:26:17,200 --> 00:26:20,880
different mechanism.

799
00:26:18,559 --> 00:26:23,440
Um I They had a very complicated

800
00:26:20,880 --> 00:26:24,800
non-transformer thing for a long time.

801
00:26:23,440 --> 00:26:26,320
I'm sure they have a transformer version

802
00:26:24,799 --> 00:26:28,039
now, but I don't I'm not privy to how

803
00:26:26,319 --> 00:26:29,799
exactly they've done it. So, I don't

804
00:26:28,039 --> 00:26:31,200
quite know how they do it. But what

805
00:26:29,799 --> 00:26:33,279
you're proposing is a reasonable way to

806
00:26:31,200 --> 00:26:34,360
think about it.

807
00:26:33,279 --> 00:26:36,678
Yeah.

808
00:26:34,359 --> 00:26:39,678
Um my question is like we have six

809
00:26:36,679 --> 00:26:41,800
words, station and but number parameters

810
00:26:39,679 --> 00:26:43,400
as in weights, let's say 10 of them.

811
00:26:41,799 --> 00:26:46,119
And then we have calculated the

812
00:26:43,400 --> 00:26:48,280
contextual version of W6. Yeah. So, this

813
00:26:46,119 --> 00:26:50,559
has a different parameter or it remains

814
00:26:48,279 --> 00:26:54,759
the same? It replaces. Okay.

815
00:26:50,559 --> 00:26:57,720
Yeah, W becomes W6 becomes W6 hat.

816
00:26:54,759 --> 00:26:58,759
Okay. And how we are expecting

817
00:26:57,720 --> 00:27:00,600
Right.

818
00:26:58,759 --> 00:27:03,640
This contextual word will be really

819
00:27:00,599 --> 00:27:03,639
good. That's what we want.

820
00:27:07,759 --> 00:27:11,319
Do we lose that

821
00:27:08,960 --> 00:27:12,759
or retain No, we lose it. And as you

822
00:27:11,319 --> 00:27:14,439
will see here, as it flows through the

823
00:27:12,759 --> 00:27:16,720
transformer, it's getting more and more

824
00:27:14,440 --> 00:27:19,920
and more contextualized.

825
00:27:16,720 --> 00:27:19,920
So, it's a left-to-right flow.

826
00:27:20,000 --> 00:27:23,200
All right. Uh all right, great. So, the

827
00:27:22,000 --> 00:27:25,720
By the way, this thing that we did for

828
00:27:23,200 --> 00:27:27,960
station, we will do it for each word in

829
00:27:25,720 --> 00:27:30,039
the in the in the sentence.

830
00:27:27,960 --> 00:27:31,759
The same exact logic. Obviously, the

831
00:27:30,039 --> 00:27:34,079
weights are going to change.

832
00:27:31,759 --> 00:27:37,920
Okay? But what will happen is that W1

833
00:27:34,079 --> 00:27:39,480
through W6 will become W1 hat through W6

834
00:27:37,920 --> 00:27:41,880
hat.

835
00:27:39,480 --> 00:27:43,360
The same exact logic is going to hold.

836
00:27:41,880 --> 00:27:44,440
Okay? That's what I just don't have the

837
00:27:43,359 --> 00:27:45,719
slides for it because it's a waste of

838
00:27:44,440 --> 00:27:47,160
time.

839
00:27:45,720 --> 00:27:48,880
The same exact logic is going to hold.

840
00:27:47,160 --> 00:27:50,679
All right. Now, switch gears

841
00:27:48,880 --> 00:27:51,600
and and answer the all-important

842
00:27:50,679 --> 00:27:52,679
question of where are the weights going

843
00:27:51,599 --> 00:27:54,678
to come from.

844
00:27:52,679 --> 00:27:56,840
Okay? So, the intuition here is really

845
00:27:54,679 --> 00:27:59,800
really interesting and elegant.

846
00:27:56,839 --> 00:28:02,199
So, clearly the weight of a word

847
00:27:59,799 --> 00:28:04,879
should be proportional to how related it

848
00:28:02,200 --> 00:28:06,319
is to the word station.

849
00:28:04,880 --> 00:28:08,240
Right?

850
00:28:06,319 --> 00:28:09,919
The word train clearly is very related

851
00:28:08,240 --> 00:28:11,559
to the word station.

852
00:28:09,920 --> 00:28:12,640
The word the is not clear how it's

853
00:28:11,559 --> 00:28:15,440
related it is. Probably not all that

854
00:28:12,640 --> 00:28:17,160
related. So, the relatedness matters to

855
00:28:15,440 --> 00:28:19,360
the weight. More related, higher the

856
00:28:17,160 --> 00:28:21,400
weight, right? Just intuitive.

857
00:28:19,359 --> 00:28:23,799
So, one way to quantify how related two

858
00:28:21,400 --> 00:28:25,560
words are is to take their standalone

859
00:28:23,799 --> 00:28:27,918
embeddings and calculate the dot

860
00:28:25,559 --> 00:28:27,918
product.

861
00:28:28,000 --> 00:28:33,119
Okay? So, um

862
00:28:30,720 --> 00:28:36,799
in case folks have

863
00:28:33,119 --> 00:28:36,799
sort of forgotten about the dot product,

864
00:28:39,559 --> 00:28:44,599
Oops, that's not what I want.

865
00:28:42,519 --> 00:28:47,200
So, um So, if you have a Let's say you

866
00:28:44,599 --> 00:28:47,199
have a vector.

867
00:28:50,039 --> 00:28:52,599
Okay, let's Let's Let's say this is the

868
00:28:51,599 --> 00:28:55,039
vector for

869
00:28:52,599 --> 00:28:55,039
train.

870
00:28:55,720 --> 00:28:59,079
This is the vector for station.

871
00:28:59,279 --> 00:29:04,599
Okay? So, the dot product of these two

872
00:29:01,960 --> 00:29:04,600
vectors,

873
00:29:05,559 --> 00:29:11,759
I'll write it as train

874
00:29:09,039 --> 00:29:11,759
station

875
00:29:12,039 --> 00:29:17,519
equals

876
00:29:13,880 --> 00:29:19,960
basically the length

877
00:29:17,519 --> 00:29:19,960
of

878
00:29:20,359 --> 00:29:23,479
the vector for train

879
00:29:23,720 --> 00:29:30,480
times the length

880
00:29:26,679 --> 00:29:30,480
of the vector for station

881
00:29:30,720 --> 00:29:36,519
times the cosine

882
00:29:33,839 --> 00:29:38,480
of the angle between them.

883
00:29:36,519 --> 00:29:40,639
Okay?

884
00:29:38,480 --> 00:29:40,640
Okay?

885
00:29:42,400 --> 00:29:46,440
So, how long is each vector?

886
00:29:45,159 --> 00:29:48,560
Product of the two and then the angle

887
00:29:46,440 --> 00:29:50,480
between them. Okay? Now, let's assume

888
00:29:48,559 --> 00:29:52,480
for simplicity that these lengths are

889
00:29:50,480 --> 00:29:54,519
roughly the same.

890
00:29:52,480 --> 00:29:55,599
They're just one unit length. Okay? Just

891
00:29:54,519 --> 00:29:57,720
roughly.

892
00:29:55,599 --> 00:30:01,799
So, if you assume that,

893
00:29:57,720 --> 00:30:01,799
okay? This thing, let's say, becomes

894
00:30:01,880 --> 00:30:05,160
becomes one, let's say.

895
00:30:03,799 --> 00:30:07,119
Okay?

896
00:30:05,160 --> 00:30:09,240
This thing becomes one.

897
00:30:07,119 --> 00:30:11,399
So, all the action

898
00:30:09,240 --> 00:30:12,519
is here.

899
00:30:11,400 --> 00:30:14,280
Okay?

900
00:30:12,519 --> 00:30:15,839
So, all the action is here.

901
00:30:14,279 --> 00:30:17,440
So, basically, the dot product of these

902
00:30:15,839 --> 00:30:20,079
two vectors is really the cosine of

903
00:30:17,440 --> 00:30:22,360
angle between them.

904
00:30:20,079 --> 00:30:25,319
So, now, the question is, if you have

905
00:30:22,359 --> 00:30:25,319
something like this,

906
00:30:27,200 --> 00:30:31,519
right? Which are very close to each

907
00:30:28,519 --> 00:30:34,440
other, the cosine of a very small angle,

908
00:30:31,519 --> 00:30:35,480
actually, the cosine of zero is what?

909
00:30:34,440 --> 00:30:37,720
One.

910
00:30:35,480 --> 00:30:39,000
So, if the angle is really, really

911
00:30:37,720 --> 00:30:40,160
small, the cosine is going to be very

912
00:30:39,000 --> 00:30:41,559
close to one.

913
00:30:40,160 --> 00:30:43,519
Right? Because zero is one. The cosine

914
00:30:41,559 --> 00:30:46,639
of zero is one. So, this thing is going

915
00:30:43,519 --> 00:30:49,039
to be, you know, pretty close to one.

916
00:30:46,640 --> 00:30:51,520
If you have a cosine of two vectors that

917
00:30:49,039 --> 00:30:52,759
are like this, 90° apart, what is the

918
00:30:51,519 --> 00:30:55,440
cosine?

919
00:30:52,759 --> 00:30:58,079
Zero. They're orthogonal, right? Which

920
00:30:55,440 --> 00:31:00,720
maps to the English orthogonal.

921
00:30:58,079 --> 00:31:01,960
So, the cosine of that is zero.

922
00:31:00,720 --> 00:31:03,400
And then, if you have something like

923
00:31:01,960 --> 00:31:04,640
this,

924
00:31:03,400 --> 00:31:07,400
where they're literally pointing in

925
00:31:04,640 --> 00:31:07,400
opposite direction,

926
00:31:07,640 --> 00:31:11,240
what is the cosine of that 180?

927
00:31:09,880 --> 00:31:13,080
Minus one.

928
00:31:11,240 --> 00:31:14,799
So, that's it. So, the if these things

929
00:31:13,079 --> 00:31:16,119
if these these these two vectors are

930
00:31:14,799 --> 00:31:18,039
very close to each other,

931
00:31:16,119 --> 00:31:19,919
the cosine of the angle between them is

932
00:31:18,039 --> 00:31:21,399
going to be very close to one. If they

933
00:31:19,920 --> 00:31:22,960
are really kind of unrelated, it's going

934
00:31:21,400 --> 00:31:24,240
to be zero. If they're anti-related,

935
00:31:22,960 --> 00:31:27,120
it's going to be minus one.

936
00:31:24,240 --> 00:31:28,960
Right? So, that's how dot products

937
00:31:27,119 --> 00:31:30,679
capture this notion of closeness or

938
00:31:28,960 --> 00:31:31,680
relatedness.

939
00:31:30,680 --> 00:31:36,320
Okay?

940
00:31:31,680 --> 00:31:37,960
So, all right. Um iPad.

941
00:31:36,319 --> 00:31:40,480
So, we can use the dot product of these

942
00:31:37,960 --> 00:31:43,519
embeddings to capture relatedness.

943
00:31:40,480 --> 00:31:45,960
And so, okay, iPad done.

944
00:31:43,519 --> 00:31:48,000
So, now, what we do is we know now that

945
00:31:45,960 --> 00:31:49,920
we know that dot products can be used,

946
00:31:48,000 --> 00:31:51,759
we can't use them as is because we need

947
00:31:49,920 --> 00:31:53,880
to do one more thing to make them proper

948
00:31:51,759 --> 00:31:55,519
weights. And what I mean by proper

949
00:31:53,880 --> 00:31:58,000
weights is that the we want the weights

950
00:31:55,519 --> 00:31:59,279
to be, first of all, non-negative, and

951
00:31:58,000 --> 00:32:00,240
we want to add up we want them to add up

952
00:31:59,279 --> 00:32:01,319
to one, right? That's that's what a

953
00:32:00,240 --> 00:32:02,279
weighted average actually is going to

954
00:32:01,319 --> 00:32:05,359
mean.

955
00:32:02,279 --> 00:32:07,359
But these cosines could be negative.

956
00:32:05,359 --> 00:32:08,959
Right? And so, we need to now adjust

957
00:32:07,359 --> 00:32:10,039
them to make them proper so that every

958
00:32:08,960 --> 00:32:11,400
one of them is guaranteed to be

959
00:32:10,039 --> 00:32:12,279
non-negative and they will add up to

960
00:32:11,400 --> 00:32:14,200
one.

961
00:32:12,279 --> 00:32:15,839
When was the last time you had to take a

962
00:32:14,200 --> 00:32:18,279
bunch of numbers, which could be

963
00:32:15,839 --> 00:32:20,480
anything, and then somehow make sure

964
00:32:18,279 --> 00:32:22,079
that they are going to be positive,

965
00:32:20,480 --> 00:32:23,839
non-negative, and they add up to one?

966
00:32:22,079 --> 00:32:25,720
When was the last time?

967
00:32:23,839 --> 00:32:27,519
Yeah, softmax. Exactly. So, we'll do the

968
00:32:25,720 --> 00:32:29,759
same trick.

969
00:32:27,519 --> 00:32:32,799
So, what we'll simply do is we'll just,

970
00:32:29,759 --> 00:32:35,400
you know, exponentiate them, right? So,

971
00:32:32,799 --> 00:32:36,519
like this W1 W6, this angle bracket

972
00:32:35,400 --> 00:32:39,120
thing is the dot product. That's the

973
00:32:36,519 --> 00:32:41,319
notation I'm using. EXP of that is just

974
00:32:39,119 --> 00:32:42,839
you exponentiate them, e raised to that.

975
00:32:41,319 --> 00:32:44,599
And once you exponentiate them, they all

976
00:32:42,839 --> 00:32:46,119
become non-negative, and then we just

977
00:32:44,599 --> 00:32:47,359
divide each by the sum of everything.

978
00:32:46,119 --> 00:32:48,559
So, it the whole thing will become like

979
00:32:47,359 --> 00:32:50,119
a probability, right? It'll just add up

980
00:32:48,559 --> 00:32:52,119
to one.

981
00:32:50,119 --> 00:32:53,519
Make sense? So, that's how we take

982
00:32:52,119 --> 00:32:55,919
arbitrary numbers and make them proper

983
00:32:53,519 --> 00:32:55,920
weights.

984
00:32:56,679 --> 00:32:59,200
All right.

985
00:32:59,880 --> 00:33:02,840
So,

986
00:33:01,440 --> 00:33:04,200
to summarize,

987
00:33:02,839 --> 00:33:05,759
from embeddings to contextual

988
00:33:04,200 --> 00:33:08,120
embeddings, that's what we do.

989
00:33:05,759 --> 00:33:09,720
We take all the stand-alone embeddings,

990
00:33:08,119 --> 00:33:11,678
we calculate these weights using this

991
00:33:09,720 --> 00:33:12,799
formula, and then we just do the

992
00:33:11,679 --> 00:33:16,080
weighted average, and we arrive at the

993
00:33:12,799 --> 00:33:16,079
contextual embedding, and boom, done.

994
00:33:16,480 --> 00:33:20,079
Okay?

995
00:33:17,880 --> 00:33:22,360
And so, by way choosing weights in this

996
00:33:20,079 --> 00:33:24,359
manner, the embedding of a word gets

997
00:33:22,359 --> 00:33:26,839
dragged closer to the embeddings of the

998
00:33:24,359 --> 00:33:29,039
other words in proportion to how related

999
00:33:26,839 --> 00:33:30,439
they are. So, just imagine for a second,

1000
00:33:29,039 --> 00:33:31,920
right? In this case, station obviously

1001
00:33:30,440 --> 00:33:33,880
has many contexts, but let's assume for

1002
00:33:31,920 --> 00:33:35,800
a second that only has the train context

1003
00:33:33,880 --> 00:33:37,400
and the radio station context.

1004
00:33:35,799 --> 00:33:39,200
Okay?

1005
00:33:37,400 --> 00:33:40,920
In the current context, train is closely

1006
00:33:39,200 --> 00:33:42,640
related to station, and therefore exerts

1007
00:33:40,920 --> 00:33:43,840
a strong pull on it.

1008
00:33:42,640 --> 00:33:45,720
Right?

1009
00:33:43,839 --> 00:33:47,199
Now, radio is also related to station,

1010
00:33:45,720 --> 00:33:48,440
but it doesn't appear in the word in the

1011
00:33:47,200 --> 00:33:49,840
sentence.

1012
00:33:48,440 --> 00:33:52,200
So, effectively, it has a weight of

1013
00:33:49,839 --> 00:33:52,199
zero.

1014
00:33:52,839 --> 00:33:56,399
Okay? And that's the beauty of it. And

1015
00:33:55,119 --> 00:33:58,079
And please do not ask me things like,

1016
00:33:56,400 --> 00:33:59,640
you know, I I was listening to a great

1017
00:33:58,079 --> 00:34:01,559
song on the radio station and the train

1018
00:33:59,640 --> 00:34:03,360
pulled out of the station.

1019
00:34:01,559 --> 00:34:05,480
Okay? Transformers can deal with stuff

1020
00:34:03,359 --> 00:34:07,519
like that. Okay? But yeah, but you get

1021
00:34:05,480 --> 00:34:09,878
the idea, the main idea.

1022
00:34:07,519 --> 00:34:11,480
So, by paying moving station closer to

1023
00:34:09,878 --> 00:34:13,440
train,

1024
00:34:11,480 --> 00:34:15,559
by paying more attention to train, we

1025
00:34:13,440 --> 00:34:18,000
are contextualizing the station the word

1026
00:34:15,559 --> 00:34:20,440
the embedding to the context of trains,

1027
00:34:18,000 --> 00:34:22,960
platforms, departures, tickets, and so

1028
00:34:20,440 --> 00:34:25,159
on. It's like this portal into the whole

1029
00:34:22,960 --> 00:34:27,280
train world.

1030
00:34:25,159 --> 00:34:29,840
Right? It's beautiful. This simple idea

1031
00:34:27,280 --> 00:34:29,840
will get you there.

1032
00:34:30,840 --> 00:34:33,960
Okay?

1033
00:34:31,800 --> 00:34:36,679
So, this, folks, is called

1034
00:34:33,960 --> 00:34:37,760
self-attention.

1035
00:34:36,679 --> 00:34:39,639
What we just described is called

1036
00:34:37,760 --> 00:34:41,240
self-attention.

1037
00:34:39,639 --> 00:34:42,679
And it's the key building block of

1038
00:34:41,239 --> 00:34:44,759
transformers.

1039
00:34:42,679 --> 00:34:46,599
Okay? Um and so, the the So, to

1040
00:34:44,760 --> 00:34:50,320
summarize, stand-alone embeddings come

1041
00:34:46,599 --> 00:34:50,319
in, contextual embeddings go out.

1042
00:34:50,760 --> 00:34:54,720
Any questions?

1043
00:34:52,398 --> 00:34:56,199
Uh yeah.

1044
00:34:54,719 --> 00:34:58,799
Uh I'm still struggling a little bit

1045
00:34:56,199 --> 00:35:00,239
with the intuition of the word

1046
00:34:58,800 --> 00:35:02,039
contextual embedding. So, like the

1047
00:35:00,239 --> 00:35:03,639
weight of station in the station

1048
00:35:02,039 --> 00:35:05,159
embedding, how how should I think about

1049
00:35:03,639 --> 00:35:07,679
that? It seems intuitive that it would

1050
00:35:05,159 --> 00:35:11,879
be high for all contextual embeddings,

1051
00:35:07,679 --> 00:35:11,879
but I assume that's not the case.

1052
00:35:12,079 --> 00:35:15,920
It'll be high. It'll be typically be a

1053
00:35:13,639 --> 00:35:17,599
high number because the cosine of the

1054
00:35:15,920 --> 00:35:19,200
the vector to itself is going to be very

1055
00:35:17,599 --> 00:35:20,799
cosine is going to be one, right? So,

1056
00:35:19,199 --> 00:35:21,559
it's going to be pretty high, but it

1057
00:35:20,800 --> 00:35:22,880
there's no guarantee it's going to be

1058
00:35:21,559 --> 00:35:24,840
the highest.

1059
00:35:22,880 --> 00:35:26,519
Right? Because they're not actually the

1060
00:35:24,840 --> 00:35:28,000
the length doesn't have to be one. They

1061
00:35:26,519 --> 00:35:30,358
could be We try to keep them kind of

1062
00:35:28,000 --> 00:35:31,840
smallish, but they don't have to be.

1063
00:35:30,358 --> 00:35:33,319
Uh so, the way I would think about it is

1064
00:35:31,840 --> 00:35:35,320
imagine that you take an average of

1065
00:35:33,320 --> 00:35:37,359
everything else first, and then you

1066
00:35:35,320 --> 00:35:38,480
average it with the new the old

1067
00:35:37,358 --> 00:35:39,639
embedding.

1068
00:35:38,480 --> 00:35:40,880
Effectively, it's the same as just

1069
00:35:39,639 --> 00:35:42,639
calculating the different weights and

1070
00:35:40,880 --> 00:35:44,599
averaging the whole thing together.

1071
00:35:42,639 --> 00:35:45,639
Sure.

1072
00:35:44,599 --> 00:35:47,679
So, why should you say that the

1073
00:35:45,639 --> 00:35:50,239
embedding of a word would be the same

1074
00:35:47,679 --> 00:35:52,679
number but same place? But is this the

1075
00:35:50,239 --> 00:35:53,719
reason why you need a contextual

1076
00:35:52,679 --> 00:35:55,159
embedding?

1077
00:35:53,719 --> 00:35:56,519
But even if it's like a

1078
00:35:55,159 --> 00:35:59,000
other word

1079
00:35:56,519 --> 00:36:01,079
and it's not related, that's what

1080
00:35:59,000 --> 00:36:02,840
I'm saying. Correct. Correct. Exactly.

1081
00:36:01,079 --> 00:36:04,759
Exactly. And the other thing to remember

1082
00:36:02,840 --> 00:36:07,120
is that by getting

1083
00:36:04,760 --> 00:36:09,000
by keeping the origin the input sort of

1084
00:36:07,119 --> 00:36:10,119
the size of the input cardinality intact

1085
00:36:09,000 --> 00:36:11,119
as you move through the transformer

1086
00:36:10,119 --> 00:36:12,719
stack,

1087
00:36:11,119 --> 00:36:14,880
when you finally come out the other end,

1088
00:36:12,719 --> 00:36:16,439
there is sort of no loss of information.

1089
00:36:14,880 --> 00:36:18,079
And in the very end, you can choose to

1090
00:36:16,440 --> 00:36:19,519
aggregate, simplify, summarize, and so

1091
00:36:18,079 --> 00:36:22,840
on and so forth. It preserves your

1092
00:36:19,519 --> 00:36:22,840
optionality as long as possible.

1093
00:36:23,679 --> 00:36:27,000
Do you know

1094
00:36:25,119 --> 00:36:28,039
how how long the embedding contextual

1095
00:36:27,000 --> 00:36:29,880
embedding is?

1096
00:36:28,039 --> 00:36:31,039
Is that a factor between the

1097
00:36:29,880 --> 00:36:33,240
two?

1098
00:36:31,039 --> 00:36:34,679
You know

1099
00:36:33,239 --> 00:36:35,839
Yeah, so, what we do is the the sentence

1100
00:36:34,679 --> 00:36:37,679
comes in. There's a whole notion of

1101
00:36:35,840 --> 00:36:39,079
something called a context window, or

1102
00:36:37,679 --> 00:36:40,480
what is the sort of the maximum length

1103
00:36:39,079 --> 00:36:42,480
that these sentences will handle, and

1104
00:36:40,480 --> 00:36:43,519
that's a parameter you can set. And

1105
00:36:42,480 --> 00:36:44,719
we'll come to that when you actually

1106
00:36:43,519 --> 00:36:46,639
look at the collab.

1107
00:36:44,719 --> 00:36:48,399
Um

1108
00:36:46,639 --> 00:36:49,639
Was that a question in the middle? No.

1109
00:36:48,400 --> 00:36:53,119
Okay.

1110
00:36:49,639 --> 00:36:53,119
All right. So, that is self-attention.

1111
00:36:53,199 --> 00:36:58,000
Um and now,

1112
00:36:55,199 --> 00:37:00,119
because that's felt too easy,

1113
00:36:58,000 --> 00:37:02,079
we're going to do a little tweak called

1114
00:37:00,119 --> 00:37:03,039
multi-head attention.

1115
00:37:02,079 --> 00:37:04,719
So,

1116
00:37:03,039 --> 00:37:06,039
this is this is the self-attention we

1117
00:37:04,719 --> 00:37:07,439
just saw.

1118
00:37:06,039 --> 00:37:08,920
What we can do is we can be like, you

1119
00:37:07,440 --> 00:37:10,720
know what?

1120
00:37:08,920 --> 00:37:12,400
Why can't we have more than this? Why

1121
00:37:10,719 --> 00:37:13,879
can't we have more than one of these?

1122
00:37:12,400 --> 00:37:16,160
So, this is called an attention head,

1123
00:37:13,880 --> 00:37:18,519
self-attention head. We'll have multiple

1124
00:37:16,159 --> 00:37:20,279
self-attention heads. Okay?

1125
00:37:18,519 --> 00:37:22,239
Now, and I'll come back to the top thing

1126
00:37:20,280 --> 00:37:23,840
in a second, okay? But So, the question

1127
00:37:22,239 --> 00:37:25,399
is, why should we have multiple

1128
00:37:23,840 --> 00:37:26,920
self-attention heads?

1129
00:37:25,400 --> 00:37:28,280
Because a particular attention head is

1130
00:37:26,920 --> 00:37:30,480
going to pick up some patterns. The

1131
00:37:28,280 --> 00:37:32,519
reason is because

1132
00:37:30,480 --> 00:37:34,358
it'll help us attend to the multiple

1133
00:37:32,519 --> 00:37:35,599
patterns that may be present in a single

1134
00:37:34,358 --> 00:37:37,239
sentence.

1135
00:37:35,599 --> 00:37:38,440
So far, when I've been explaining, uh

1136
00:37:37,239 --> 00:37:40,319
I've sort of basically been looking at

1137
00:37:38,440 --> 00:37:42,240
what the meaning of these words are.

1138
00:37:40,320 --> 00:37:44,120
Just the meaning of these words. But in

1139
00:37:42,239 --> 00:37:45,759
any complicated sentence, you have to

1140
00:37:44,119 --> 00:37:47,519
worry about grammar, you have to worry

1141
00:37:45,760 --> 00:37:49,880
about tense, you have to worry about

1142
00:37:47,519 --> 00:37:51,880
tone. You have to worry about facts

1143
00:37:49,880 --> 00:37:53,760
versus, you know, opinions. There could

1144
00:37:51,880 --> 00:37:55,559
be any number of complicated patterns

1145
00:37:53,760 --> 00:37:57,920
that are sitting in a simple sentence.

1146
00:37:55,559 --> 00:37:59,519
Which means, well, there is just not one

1147
00:37:57,920 --> 00:38:02,079
way to pay attention. There could be

1148
00:37:59,519 --> 00:38:03,880
many ways of paying attention, many sort

1149
00:38:02,079 --> 00:38:05,799
of There could be many needs to pay

1150
00:38:03,880 --> 00:38:07,599
attention. Right?

1151
00:38:05,800 --> 00:38:09,240
Which means that we'll let's have many

1152
00:38:07,599 --> 00:38:10,719
of these attention heads.

1153
00:38:09,239 --> 00:38:12,919
And each one could be learning something

1154
00:38:10,719 --> 00:38:14,919
else. It's exactly like having lots of

1155
00:38:12,920 --> 00:38:16,680
filters in a convolutional network.

1156
00:38:14,920 --> 00:38:17,960
Right? Uh one filter might learn a line,

1157
00:38:16,679 --> 00:38:19,399
another filter might learn a curve, and

1158
00:38:17,960 --> 00:38:21,000
so on and so forth. And we don't want to

1159
00:38:19,400 --> 00:38:22,760
decide a priori, oh, you're going to

1160
00:38:21,000 --> 00:38:23,840
learn a line, right? Similarly here,

1161
00:38:22,760 --> 00:38:25,040
we're not telling any of these things

1162
00:38:23,840 --> 00:38:27,400
what you have to learn. They just have

1163
00:38:25,039 --> 00:38:28,960
to learn based on the training process.

1164
00:38:27,400 --> 00:38:30,800
So, what we do is

1165
00:38:28,960 --> 00:38:32,400
So, actually, this is an example where

1166
00:38:30,800 --> 00:38:35,000
this is from the original transformer

1167
00:38:32,400 --> 00:38:37,039
paper, where this sentence is the lawyer

1168
00:38:35,000 --> 00:38:39,559
will Sorry, the law will never be

1169
00:38:37,039 --> 00:38:43,079
perfect, but its application should be

1170
00:38:39,559 --> 00:38:44,400
just. This is what we are missing, in my

1171
00:38:43,079 --> 00:38:46,400
opinion.

1172
00:38:44,400 --> 00:38:48,840
The complicated sentence, right? So, the

1173
00:38:46,400 --> 00:38:50,559
first one attention head, actually, this

1174
00:38:48,840 --> 00:38:53,120
is the pattern of things it's it's it's

1175
00:38:50,559 --> 00:38:54,759
So, for example, the word perfect here,

1176
00:38:53,119 --> 00:38:57,279
the contextual embedding of the word

1177
00:38:54,760 --> 00:39:00,480
perfect

1178
00:38:57,280 --> 00:39:01,920
draws upon heavily from the word law

1179
00:39:00,480 --> 00:39:02,920
in this example.

1180
00:39:01,920 --> 00:39:04,840
Okay?

1181
00:39:02,920 --> 00:39:06,240
If you look at another attention head,

1182
00:39:04,840 --> 00:39:07,840
the contextual embedding for the word

1183
00:39:06,239 --> 00:39:11,519
perfect is actually drawing heavily from

1184
00:39:07,840 --> 00:39:13,039
just perfect and nothing else. Right?

1185
00:39:11,519 --> 00:39:14,880
And if you look at other words, the

1186
00:39:13,039 --> 00:39:17,079
patterns are subtly different of what

1187
00:39:14,880 --> 00:39:18,400
it's paying attention to.

1188
00:39:17,079 --> 00:39:20,279
So, these are two different attention

1189
00:39:18,400 --> 00:39:21,960
heads, and they're learning different

1190
00:39:20,280 --> 00:39:24,200
kinds of attentions.

1191
00:39:21,960 --> 00:39:25,679
Okay? In reality, trying to make sense

1192
00:39:24,199 --> 00:39:27,719
of why they

1193
00:39:25,679 --> 00:39:29,399
pay attention to the way they do, it's

1194
00:39:27,719 --> 00:39:30,319
usually quite sort of difficult to

1195
00:39:29,400 --> 00:39:32,320
figure that out. You can't actually

1196
00:39:30,320 --> 00:39:34,200
interpret it. But when you have lots of

1197
00:39:32,320 --> 00:39:35,840
attention heads, the performance on the

1198
00:39:34,199 --> 00:39:37,559
task that you care about gets really

1199
00:39:35,840 --> 00:39:39,000
much better.

1200
00:39:37,559 --> 00:39:40,759
Right? And then you're saying, okay, I

1201
00:39:39,000 --> 00:39:42,000
can use that. Uh yeah.

1202
00:39:40,760 --> 00:39:43,520
That's the

1203
00:39:42,000 --> 00:39:46,960
I think that's the idea behind this. Is

1204
00:39:43,519 --> 00:39:46,960
that the idea behind this?

1205
00:39:49,320 --> 00:39:53,360
Right.

1206
00:39:50,760 --> 00:39:55,640
Exactly. Same logic. Same logic.

1207
00:39:53,360 --> 00:39:55,640
Yeah.

1208
00:40:13,519 --> 00:40:17,360
Actually in the convolutional case, the

1209
00:40:15,079 --> 00:40:19,519
ones and zeros I had were just example

1210
00:40:17,360 --> 00:40:21,000
numbers to show that that particular

1211
00:40:19,519 --> 00:40:23,360
filter could detect a vertical line or

1212
00:40:21,000 --> 00:40:24,760
horizontal line. You will recall that

1213
00:40:23,360 --> 00:40:26,000
when we actually train a convolutional

1214
00:40:24,760 --> 00:40:27,880
network, we actually don't specify the

1215
00:40:26,000 --> 00:40:30,039
numbers. We start with random

1216
00:40:27,880 --> 00:40:32,200
initialized weights and then we let back

1217
00:40:30,039 --> 00:40:34,199
back propagation figure it out.

1218
00:40:32,199 --> 00:40:35,679
Similarly here, we don't decide any of

1219
00:40:34,199 --> 00:40:37,239
these things. We just let back prop

1220
00:40:35,679 --> 00:40:39,559
figure it out.

1221
00:40:37,239 --> 00:40:40,559
Okay? And now the question of what are

1222
00:40:39,559 --> 00:40:42,519
the weights that are actually going to

1223
00:40:40,559 --> 00:40:43,480
be learned? We'll come come to that in a

1224
00:40:42,519 --> 00:40:46,480
bit.

1225
00:40:43,480 --> 00:40:46,480
Okay? Uh yeah.

1226
00:40:47,559 --> 00:40:53,559
Uh I was wondering how come we have

1227
00:40:50,360 --> 00:40:55,480
different attention head even though

1228
00:40:53,559 --> 00:40:57,119
uh it seems like they're only function

1229
00:40:55,480 --> 00:40:59,480
of a dot product and we have the same

1230
00:40:57,119 --> 00:41:01,239
dot product for same embeddings.

1231
00:40:59,480 --> 00:41:02,960
Great question. Great question. And I

1232
00:41:01,239 --> 00:41:04,799
literally have a a note in my slide

1233
00:41:02,960 --> 00:41:06,480
saying, "If a student asks this good

1234
00:41:04,800 --> 00:41:08,480
question, tell them to wait till

1235
00:41:06,480 --> 00:41:10,400
Wednesday."

1236
00:41:08,480 --> 00:41:12,079
So, great question. And we'll come back

1237
00:41:10,400 --> 00:41:14,440
to that uh on Wednesday and spend a fair

1238
00:41:12,079 --> 00:41:17,079
amount of time on it. So, uh

1239
00:41:14,440 --> 00:41:19,800
the the the point that's being made here

1240
00:41:17,079 --> 00:41:22,840
is that oops.

1241
00:41:19,800 --> 00:41:24,720
When we look at self-attention,

1242
00:41:22,840 --> 00:41:26,600
the embeddings came in and we did all

1243
00:41:24,719 --> 00:41:28,799
these dot products and the contextual

1244
00:41:26,599 --> 00:41:30,319
things popped out the other end. Note

1245
00:41:28,800 --> 00:41:32,800
that inside the self-attention box,

1246
00:41:30,320 --> 00:41:34,160
there are no parameters.

1247
00:41:32,800 --> 00:41:36,519
There are no parameters.

1248
00:41:34,159 --> 00:41:38,799
So, the question that is being raised

1249
00:41:36,519 --> 00:41:40,880
here is that so what are we learning

1250
00:41:38,800 --> 00:41:42,840
really? If there is nothing inside to be

1251
00:41:40,880 --> 00:41:43,880
learned, if there are no parameters, no

1252
00:41:42,840 --> 00:41:46,480
coefficients, what are we learning?

1253
00:41:43,880 --> 00:41:48,400
That's the question. And by extension,

1254
00:41:46,480 --> 00:41:49,599
if we have two of these and neither of

1255
00:41:48,400 --> 00:41:52,079
them is learning anything, what's the

1256
00:41:49,599 --> 00:41:52,079
point?

1257
00:41:52,880 --> 00:41:57,320
Sadly, you have to wait till Wednesday.

1258
00:41:55,719 --> 00:41:58,719
Okay? But we have a great answer to the

1259
00:41:57,320 --> 00:42:00,120
question. So,

1260
00:41:58,719 --> 00:42:03,279
it'll be worth it. And if you can't

1261
00:42:00,119 --> 00:42:03,279
stand the suspense, read the book.

1262
00:42:03,320 --> 00:42:07,320
All right. So, that is uh that's why we

1263
00:42:05,519 --> 00:42:09,719
need multiple heads. Okay? And now to

1264
00:42:07,320 --> 00:42:11,400
come back to this, so what we do is it

1265
00:42:09,719 --> 00:42:13,199
goes through this head and you get these

1266
00:42:11,400 --> 00:42:15,760
W's, right? And it goes through here and

1267
00:42:13,199 --> 00:42:17,639
we get the another set of W's.

1268
00:42:15,760 --> 00:42:19,880
Then what we do at the very end is we

1269
00:42:17,639 --> 00:42:21,920
concatenate them.

1270
00:42:19,880 --> 00:42:23,480
Okay? We concatenate them and we do a

1271
00:42:21,920 --> 00:42:25,800
projection. And this is what I mean by

1272
00:42:23,480 --> 00:42:25,800
that.

1273
00:42:29,199 --> 00:42:33,279
So, we have

1274
00:42:30,760 --> 00:42:35,880
uh this this is one self-attention head,

1275
00:42:33,280 --> 00:42:38,960
self-attention one.

1276
00:42:35,880 --> 00:42:41,760
This is self-attention two.

1277
00:42:38,960 --> 00:42:44,720
And let's say that

1278
00:42:41,760 --> 00:42:47,200
W1 hat comes out.

1279
00:42:44,719 --> 00:42:48,799
And I'm just going to call it Z Z1 for

1280
00:42:47,199 --> 00:42:49,919
the same thing so that there's no name

1281
00:42:48,800 --> 00:42:52,440
clash.

1282
00:42:49,920 --> 00:42:55,360
Okay? And uh the W2, W6, all of them are

1283
00:42:52,440 --> 00:42:57,599
coming, right? Let's focus on W1 and Z1.

1284
00:42:55,360 --> 00:42:59,320
W1 and Z1 are both contextual embeddings

1285
00:42:57,599 --> 00:43:01,679
for the same word.

1286
00:42:59,320 --> 00:43:04,720
Okay? For the first word, word one. And

1287
00:43:01,679 --> 00:43:06,440
so what we do is let's say this is W1 uh

1288
00:43:04,719 --> 00:43:07,959
let's call let's say this vector is like

1289
00:43:06,440 --> 00:43:10,039
this. Okay?

1290
00:43:07,960 --> 00:43:12,400
And let's say that this vector is like

1291
00:43:10,039 --> 00:43:14,679
this.

1292
00:43:12,400 --> 00:43:16,360
What I mean when I say concatenated here

1293
00:43:14,679 --> 00:43:18,719
is we literally take

1294
00:43:16,360 --> 00:43:20,320
um this word here,

1295
00:43:18,719 --> 00:43:22,839
this embedding here, then we take this

1296
00:43:20,320 --> 00:43:22,840
thing here.

1297
00:43:23,079 --> 00:43:27,799
Okay? And we just make it a long vector.

1298
00:43:25,039 --> 00:43:30,519
We concatenate it. But now this vector

1299
00:43:27,800 --> 00:43:32,519
has become twice as long, right?

1300
00:43:30,519 --> 00:43:34,759
So, what but remember, we always want to

1301
00:43:32,519 --> 00:43:36,759
preserve this the the number of inputs

1302
00:43:34,760 --> 00:43:39,400
we have and the lengths of these vectors

1303
00:43:36,760 --> 00:43:42,760
everywhere as we go along. So, what we

1304
00:43:39,400 --> 00:43:44,840
do is at this point, we run it through

1305
00:43:42,760 --> 00:43:46,560
a single dense layer

1306
00:43:44,840 --> 00:43:48,480
which will take this thing and make it

1307
00:43:46,559 --> 00:43:50,039
back into the same small shape as

1308
00:43:48,480 --> 00:43:53,119
before.

1309
00:43:50,039 --> 00:43:53,119
So, this is a dense layer.

1310
00:43:54,320 --> 00:43:58,559
That's it. So, this vector comes in

1311
00:43:56,840 --> 00:44:00,240
and it becomes it gets compressed back

1312
00:43:58,559 --> 00:44:01,239
to the original shape that came out of

1313
00:44:00,239 --> 00:44:03,599
here.

1314
00:44:01,239 --> 00:44:04,919
So, you could have like 20 of these uh

1315
00:44:03,599 --> 00:44:06,480
attention heads

1316
00:44:04,920 --> 00:44:08,440
and the concatenated will be 20 times

1317
00:44:06,480 --> 00:44:09,800
long and then just project boom, one

1318
00:44:08,440 --> 00:44:12,119
dense layer comes back to the original

1319
00:44:09,800 --> 00:44:12,120
shape.

1320
00:44:12,920 --> 00:44:17,519
So, that's that is the projection step.

1321
00:44:16,320 --> 00:44:20,480
And that's what I mean here when I say

1322
00:44:17,519 --> 00:44:21,800
concatenate and project.

1323
00:44:20,480 --> 00:44:23,559
So, at this point, what we have is

1324
00:44:21,800 --> 00:44:25,120
things come in, we contextualize them

1325
00:44:23,559 --> 00:44:27,039
using these different attention heads,

1326
00:44:25,119 --> 00:44:29,000
and when they come out of the attention

1327
00:44:27,039 --> 00:44:31,039
heads, we take them all, we just like

1328
00:44:29,000 --> 00:44:32,480
concatenate them, and then compress them

1329
00:44:31,039 --> 00:44:35,320
back to the same original starting

1330
00:44:32,480 --> 00:44:37,119
shape. Right? If these vectors are 100

1331
00:44:35,320 --> 00:44:39,640
units long or 100 dimension long,

1332
00:44:37,119 --> 00:44:42,000
whatever comes out is 100 still.

1333
00:44:39,639 --> 00:44:43,839
And to pre- preserving this

1334
00:44:42,000 --> 00:44:44,920
size as we go along is very important

1335
00:44:43,840 --> 00:44:46,800
for reasons that'll become apparent a

1336
00:44:44,920 --> 00:44:49,440
bit later.

1337
00:44:46,800 --> 00:44:50,320
Okay. So, that is the multi-attention

1338
00:44:49,440 --> 00:44:53,679
thing.

1339
00:44:50,320 --> 00:44:55,120
Now, a final tweak for today

1340
00:44:53,679 --> 00:44:57,440
is that we will inject some

1341
00:44:55,119 --> 00:44:59,400
non-linearity

1342
00:44:57,440 --> 00:45:01,358
with some dense layer dense ReLU layers

1343
00:44:59,400 --> 00:45:03,280
at the very end. So, we'd went through a

1344
00:45:01,358 --> 00:45:04,400
bunch of attention heads. We we came up

1345
00:45:03,280 --> 00:45:05,240
with a bunch of contextual embeddings

1346
00:45:04,400 --> 00:45:07,720
now.

1347
00:45:05,239 --> 00:45:08,479
So, at this point so far,

1348
00:45:07,719 --> 00:45:10,759
there are no since there are no

1349
00:45:08,480 --> 00:45:11,840
parameters inside these boxes,

1350
00:45:10,760 --> 00:45:13,000
uh

1351
00:45:11,840 --> 00:45:13,960
right? And there are some parameters

1352
00:45:13,000 --> 00:45:15,559
here.

1353
00:45:13,960 --> 00:45:16,480
We need to do some non-linearity. So

1354
00:45:15,559 --> 00:45:18,840
far, there's been nothing that's

1355
00:45:16,480 --> 00:45:21,480
non-linear so far. So, here we actually

1356
00:45:18,840 --> 00:45:24,680
send it through one or more ReLUs.

1357
00:45:21,480 --> 00:45:27,559
Typically, they just use one ReLU. So,

1358
00:45:24,679 --> 00:45:27,559
and what I mean by that

1359
00:45:34,199 --> 00:45:36,599
Sorry.

1360
00:45:37,920 --> 00:45:44,480
So, this is what we had here and then

1361
00:45:41,760 --> 00:45:44,480
we take it in

1362
00:45:46,400 --> 00:45:49,599
and then run it through

1363
00:45:50,079 --> 00:45:52,639
actually

1364
00:45:54,719 --> 00:45:58,639
we typically run it through

1365
00:45:57,320 --> 00:46:01,440
a ReLU.

1366
00:45:58,639 --> 00:46:03,358
This is a nice ReLU.

1367
00:46:01,440 --> 00:46:04,559
Okay? And all and and the rule of thumb,

1368
00:46:03,358 --> 00:46:06,840
as you will see, if let's say this

1369
00:46:04,559 --> 00:46:08,119
vector is say 100 dimensions long, they

1370
00:46:06,840 --> 00:46:10,039
typically will choose a ReLU which is

1371
00:46:08,119 --> 00:46:12,440
about 400

1372
00:46:10,039 --> 00:46:15,920
wide. And then it just gets projected

1373
00:46:12,440 --> 00:46:15,920
out again back to 100.

1374
00:46:16,639 --> 00:46:20,279
So,

1375
00:46:17,719 --> 00:46:21,759
this is just a simple, you know, the

1376
00:46:20,280 --> 00:46:23,480
input comes in, goes through a single

1377
00:46:21,760 --> 00:46:26,040
hidden layer with four four times as

1378
00:46:23,480 --> 00:46:28,599
many as here, and then it

1379
00:46:26,039 --> 00:46:29,800
project another dense layer

1380
00:46:28,599 --> 00:46:32,279
to 100 again.

1381
00:46:29,800 --> 00:46:33,280
And this since there are ReLUs here,

1382
00:46:32,280 --> 00:46:35,760
we in- we have injected some

1383
00:46:33,280 --> 00:46:37,519
non-linearity into the processing.

1384
00:46:35,760 --> 00:46:39,200
Okay? Now,

1385
00:46:37,519 --> 00:46:41,719
a lot of this stuff when it came out

1386
00:46:39,199 --> 00:46:43,358
felt very ad hoc.

1387
00:46:41,719 --> 00:46:45,599
Right? It didn't come from some deep,

1388
00:46:43,358 --> 00:46:47,400
you know, theoretical motivations.

1389
00:46:45,599 --> 00:46:49,400
But and people had strong intuitions as

1390
00:46:47,400 --> 00:46:51,680
to why these things were helpful. And as

1391
00:46:49,400 --> 00:46:53,720
it turns out, since the transformer came

1392
00:46:51,679 --> 00:46:55,519
out, people have tried to optimize every

1393
00:46:53,719 --> 00:46:56,959
aspect of this thing.

1394
00:46:55,519 --> 00:46:58,719
It's actually pretty difficult to beat

1395
00:46:56,960 --> 00:47:00,358
the starting architecture.

1396
00:46:58,719 --> 00:47:02,679
Right? Improvements have been made, but

1397
00:47:00,358 --> 00:47:03,719
it's actually very robust architecture.

1398
00:47:02,679 --> 00:47:05,719
So,

1399
00:47:03,719 --> 00:47:08,959
so that's what's going on here. And then

1400
00:47:05,719 --> 00:47:10,919
when we come out of this thing,

1401
00:47:08,960 --> 00:47:13,000
this is what we have, the story so far.

1402
00:47:10,920 --> 00:47:14,639
We start with random standalone

1403
00:47:13,000 --> 00:47:15,960
embeddings. This could be

1404
00:47:14,639 --> 00:47:18,159
GloVe embeddings, it could be random

1405
00:47:15,960 --> 00:47:19,920
weights, doesn't matter. It goes through

1406
00:47:18,159 --> 00:47:21,399
a bunch of self-attention heads. We

1407
00:47:19,920 --> 00:47:23,840
concatenate it when it comes out the

1408
00:47:21,400 --> 00:47:25,039
other end.

1409
00:47:23,840 --> 00:47:27,160
Concatenate it when it comes out the

1410
00:47:25,039 --> 00:47:29,119
other end. And then we project it back

1411
00:47:27,159 --> 00:47:31,358
to the same size as before. Then we run

1412
00:47:29,119 --> 00:47:33,400
it through, you know, a ReLU followed by

1413
00:47:31,358 --> 00:47:36,079
a linear layer and we get these things

1414
00:47:33,400 --> 00:47:37,760
again. So, in this whole process, if six

1415
00:47:36,079 --> 00:47:40,400
things came in, six things will come

1416
00:47:37,760 --> 00:47:41,920
out. And if six and if those six things

1417
00:47:40,400 --> 00:47:43,358
that came in

1418
00:47:41,920 --> 00:47:45,440
were embedding standalone embedding

1419
00:47:43,358 --> 00:47:47,559
vectors of 100 dimensions, what comes

1420
00:47:45,440 --> 00:47:48,559
out is also 100 dimensions.

1421
00:47:47,559 --> 00:47:50,440
So, in that sense, you could think of

1422
00:47:48,559 --> 00:47:52,358
this whole thing as a black box in which

1423
00:47:50,440 --> 00:47:54,599
whatever you send in, the same number of

1424
00:47:52,358 --> 00:47:56,079
things will come out of the same length.

1425
00:47:54,599 --> 00:47:56,759
The numbers will be different because

1426
00:47:56,079 --> 00:47:58,519
they will have been heavily

1427
00:47:56,760 --> 00:48:00,240
contextualized.

1428
00:47:58,519 --> 00:48:02,480
The numbers are much smarter, in other

1429
00:48:00,239 --> 00:48:04,959
words.

1430
00:48:02,480 --> 00:48:05,920
So, so far what we have seen is that we

1431
00:48:04,960 --> 00:48:08,079
have satisfied two of the three

1432
00:48:05,920 --> 00:48:09,920
requirements. We have taken the context

1433
00:48:08,079 --> 00:48:11,119
of each word into account

1434
00:48:09,920 --> 00:48:12,599
by using these dot products in the

1435
00:48:11,119 --> 00:48:13,799
self-attention layer, and we can

1436
00:48:12,599 --> 00:48:15,599
generate an output that is the same

1437
00:48:13,800 --> 00:48:17,480
length as the input, but we have ignored

1438
00:48:15,599 --> 00:48:19,759
the fact that we have ignored word order

1439
00:48:17,480 --> 00:48:21,519
completely.

1440
00:48:19,760 --> 00:48:23,880
Okay? Because whether I had said the

1441
00:48:21,519 --> 00:48:25,559
train slowly left the station or I had

1442
00:48:23,880 --> 00:48:26,800
said the the station slowly left the

1443
00:48:25,559 --> 00:48:30,279
train,

1444
00:48:26,800 --> 00:48:30,280
this thing won't know the difference.

1445
00:48:30,840 --> 00:48:34,519
Because dot products

1446
00:48:32,239 --> 00:48:36,559
function on sets, not on sequences. They

1447
00:48:34,519 --> 00:48:37,800
function on sets.

1448
00:48:36,559 --> 00:48:39,159
Okay? Regard- You can you should

1449
00:48:37,800 --> 00:48:40,600
convince yourself of this. Regardless of

1450
00:48:39,159 --> 00:48:42,039
the order, the dot product calculation

1451
00:48:40,599 --> 00:48:45,159
doesn't change anything.

1452
00:48:42,039 --> 00:48:45,159
Because we are doing every pair.

1453
00:48:46,440 --> 00:48:50,519
Okay? So, the question is how do we take

1454
00:48:48,159 --> 00:48:52,199
the order of the words into account? Um

1455
00:48:50,519 --> 00:48:53,519
right. As I was saying, we can scramble

1456
00:48:52,199 --> 00:48:54,519
the order of the words in a sentence and

1457
00:48:53,519 --> 00:48:55,759
we'll get the exact same contextual

1458
00:48:54,519 --> 00:48:57,079
embeddings at the end.

1459
00:48:55,760 --> 00:48:58,840
So, by the way, if you're working on a

1460
00:48:57,079 --> 00:49:00,319
problem in which the order doesn't

1461
00:48:58,840 --> 00:49:01,960
matter,

1462
00:49:00,320 --> 00:49:04,160
then you can stop right now and use the

1463
00:49:01,960 --> 00:49:05,199
transformer.

1464
00:49:04,159 --> 00:49:06,759
And there are many problems that are

1465
00:49:05,199 --> 00:49:08,799
actually in that category where the

1466
00:49:06,760 --> 00:49:10,880
order doesn't matter. So, if you take

1467
00:49:08,800 --> 00:49:12,359
traditional structured data, right? Uh

1468
00:49:10,880 --> 00:49:14,320
tabular data,

1469
00:49:12,358 --> 00:49:15,759
uh you know, blood pressure, cholesterol

1470
00:49:14,320 --> 00:49:17,519
level, boom boom boom. Does it predict

1471
00:49:15,760 --> 00:49:18,520
heart disease? Well, there is no order

1472
00:49:17,519 --> 00:49:20,199
in that thing. You can use the

1473
00:49:18,519 --> 00:49:22,119
transformer as is without doing anything

1474
00:49:20,199 --> 00:49:24,679
more.

1475
00:49:22,119 --> 00:49:27,199
So, transformers work for both sets and

1476
00:49:24,679 --> 00:49:29,839
sequences where order matters.

1477
00:49:27,199 --> 00:49:32,239
Okay. So, the fix for this is something

1478
00:49:29,840 --> 00:49:33,160
called the positional encoding.

1479
00:49:32,239 --> 00:49:34,839
Um

1480
00:49:33,159 --> 00:49:36,159
so what we do is very simple. There are

1481
00:49:34,840 --> 00:49:40,920
By By there are many things that been

1482
00:49:36,159 --> 00:49:42,759
invented um to to to tell transformers

1483
00:49:40,920 --> 00:49:44,159
to give an transformer some information

1484
00:49:42,760 --> 00:49:45,760
about the order of each of the things

1485
00:49:44,159 --> 00:49:46,799
that are coming in.

1486
00:49:45,760 --> 00:49:47,920
I'm going to go with something called

1487
00:49:46,800 --> 00:49:49,480
the, you know,

1488
00:49:47,920 --> 00:49:51,440
the simplest possible way which actually

1489
00:49:49,480 --> 00:49:52,840
works pretty well in practice. So, what

1490
00:49:51,440 --> 00:49:55,000
we do is

1491
00:49:52,840 --> 00:49:56,960
for each position

1492
00:49:55,000 --> 00:49:58,280
each possible position in the input

1493
00:49:56,960 --> 00:50:00,280
starting from the first position all the

1494
00:49:58,280 --> 00:50:02,120
way through the last position

1495
00:50:00,280 --> 00:50:05,280
we imagine that that position itself is

1496
00:50:02,119 --> 00:50:05,279
a categorical variable.

1497
00:50:05,599 --> 00:50:10,039
Right? If a sentence can only be 30 30

1498
00:50:07,639 --> 00:50:11,719
words long, let's say, we say that hey,

1499
00:50:10,039 --> 00:50:14,599
the position of each word is a number

1500
00:50:11,719 --> 00:50:16,039
between 0 and 29.

1501
00:50:14,599 --> 00:50:17,960
And so, we can just think of it as a

1502
00:50:16,039 --> 00:50:20,000
categorical variable.

1503
00:50:17,960 --> 00:50:22,159
And because the categorical variable, we

1504
00:50:20,000 --> 00:50:24,199
can just imagine an embedding for that

1505
00:50:22,159 --> 00:50:25,319
for each potential value. So, it'll

1506
00:50:24,199 --> 00:50:27,000
become clear in just a moment because I

1507
00:50:25,320 --> 00:50:28,920
have a numerical example.

1508
00:50:27,000 --> 00:50:30,800
And so, what we do is we will just take

1509
00:50:28,920 --> 00:50:32,920
that standalone embedding and then we'll

1510
00:50:30,800 --> 00:50:33,960
take this position embedding

1511
00:50:32,920 --> 00:50:35,280
which represents the position of the

1512
00:50:33,960 --> 00:50:36,800
word in the sentence, we just add them

1513
00:50:35,280 --> 00:50:39,560
up.

1514
00:50:36,800 --> 00:50:40,519
Okay? Uh yeah.

1515
00:50:39,559 --> 00:50:43,079
So, if

1516
00:50:40,519 --> 00:50:45,280
in the initial sentence itself, I have a

1517
00:50:43,079 --> 00:50:48,039
mistake, so I just write it as the train

1518
00:50:45,280 --> 00:50:49,840
slowly the station.

1519
00:50:48,039 --> 00:50:52,079
So, which means my output is actually

1520
00:50:49,840 --> 00:50:53,760
going to be wrong. Yes.

1521
00:50:52,079 --> 00:50:55,559
Now, the transformers are since they're

1522
00:50:53,760 --> 00:50:57,000
trained on lots of data,

1523
00:50:55,559 --> 00:50:58,199
they will be quite robust to these

1524
00:50:57,000 --> 00:51:00,239
things.

1525
00:50:58,199 --> 00:51:02,839
But strictly arithmetically speaking

1526
00:51:00,239 --> 00:51:05,439
correct, yes.

1527
00:51:02,840 --> 00:51:06,720
Um okay. So, here's let's look at an

1528
00:51:05,440 --> 00:51:08,800
example.

1529
00:51:06,719 --> 00:51:09,359
Let's assume that

1530
00:51:08,800 --> 00:51:11,360
um

1531
00:51:09,360 --> 00:51:13,480
your standalone embeddings, right? This

1532
00:51:11,360 --> 00:51:15,920
is your vocabulary, okay?

1533
00:51:13,480 --> 00:51:17,400
Unknown, cat, mat, I, sit, love, the,

1534
00:51:15,920 --> 00:51:18,960
you, on. That's it. That's our

1535
00:51:17,400 --> 00:51:20,800
vocabulary.

1536
00:51:18,960 --> 00:51:22,440
And for this vocabulary, we have these

1537
00:51:20,800 --> 00:51:23,680
standalone embeddings.

1538
00:51:22,440 --> 00:51:26,159
And just for argument, let's assume

1539
00:51:23,679 --> 00:51:27,239
these embeddings are only two long.

1540
00:51:26,159 --> 00:51:28,599
Okay? The dimension of these embeddings

1541
00:51:27,239 --> 00:51:30,039
is two.

1542
00:51:28,599 --> 00:51:31,880
If you recall the glove embeddings we

1543
00:51:30,039 --> 00:51:33,159
used last week, I think they were what?

1544
00:51:31,880 --> 00:51:34,400
100 long?

1545
00:51:33,159 --> 00:51:35,799
And the ones we're using in the homework

1546
00:51:34,400 --> 00:51:37,200
are even longer than that.

1547
00:51:35,800 --> 00:51:39,120
Um but here we are assuming they're only

1548
00:51:37,199 --> 00:51:42,799
two long, okay? So, the embedding for

1549
00:51:39,119 --> 00:51:45,880
cat is 0.5, {comma} 7.1.

1550
00:51:42,800 --> 00:51:47,320
All right. Now, let's assume that the we

1551
00:51:45,880 --> 00:51:49,079
can have at most 10 words in any

1552
00:51:47,320 --> 00:51:50,559
sentence that's coming in.

1553
00:51:49,079 --> 00:51:52,360
And obviously, a particular word could

1554
00:51:50,559 --> 00:51:53,639
be in position 0 all the way through

1555
00:51:52,360 --> 00:51:56,240
position 9.

1556
00:51:53,639 --> 00:51:57,719
And we will learn embeddings for each of

1557
00:51:56,239 --> 00:51:59,759
these positions, and these embeddings

1558
00:51:57,719 --> 00:52:03,239
are also two long.

1559
00:51:59,760 --> 00:52:03,240
Two units long. Dimension two.

1560
00:52:03,320 --> 00:52:06,480
Okay?

1561
00:52:04,519 --> 00:52:07,880
Now, where will these embeddings come

1562
00:52:06,480 --> 00:52:09,199
from?

1563
00:52:07,880 --> 00:52:10,720
What's the answer to that question? What

1564
00:52:09,199 --> 00:52:13,839
is the answer to the general question of

1565
00:52:10,719 --> 00:52:13,839
where will these weights come from?

1566
00:52:14,599 --> 00:52:17,759
We will learn it with backprop.

1567
00:52:18,159 --> 00:52:21,599
Okay?

1568
00:52:20,400 --> 00:52:23,240
We will start initially with random

1569
00:52:21,599 --> 00:52:24,519
numbers and then we'll get them make

1570
00:52:23,239 --> 00:52:26,599
them better and better

1571
00:52:24,519 --> 00:52:28,280
as over the course of training.

1572
00:52:26,599 --> 00:52:29,400
So, what we do is we have these two

1573
00:52:28,280 --> 00:52:30,680
tables

1574
00:52:29,400 --> 00:52:32,400
of embeddings.

1575
00:52:30,679 --> 00:52:34,039
Um the standalone embedding for the word

1576
00:52:32,400 --> 00:52:37,000
and the position embedding.

1577
00:52:34,039 --> 00:52:39,239
And then, we literally add them up.

1578
00:52:37,000 --> 00:52:41,599
So, for example, let's say the word the

1579
00:52:39,239 --> 00:52:43,119
sentence that came in is cat sat mat.

1580
00:52:41,599 --> 00:52:46,119
That's the sentence. It's got three

1581
00:52:43,119 --> 00:52:49,119
words, cat sat mat. So, what we do is we

1582
00:52:46,119 --> 00:52:51,119
say, well, the embedding for cat is this

1583
00:52:49,119 --> 00:52:53,400
thing here, 0.571.

1584
00:52:51,119 --> 00:52:55,239
So, I write it here, 0.571.

1585
00:52:53,400 --> 00:52:56,240
Cat happens to be the zeroth position of

1586
00:52:55,239 --> 00:52:58,119
the word.

1587
00:52:56,239 --> 00:53:01,079
So, I grab the embedding for zero, which

1588
00:52:58,119 --> 00:53:04,799
is 1.3, 3.9. I stick it there, and then

1589
00:53:01,079 --> 00:53:07,159
I literally add them up. 0.5 + 1.3, 1.8.

1590
00:53:04,800 --> 00:53:10,880
11.0. That's it.

1591
00:53:07,159 --> 00:53:15,159
So, now the positional encoded embedding

1592
00:53:10,880 --> 00:53:17,880
for the word cat is 1.8, 11.0, not 0.5,

1593
00:53:15,159 --> 00:53:17,879
7.1.

1594
00:53:18,400 --> 00:53:22,400
So, if cat happens to show up in another

1595
00:53:20,719 --> 00:53:25,199
part of the sentence, let's say instead

1596
00:53:22,400 --> 00:53:28,119
of cat sat mat, we had

1597
00:53:25,199 --> 00:53:29,839
mat sat cat.

1598
00:53:28,119 --> 00:53:33,159
Now, cat is now the third position,

1599
00:53:29,840 --> 00:53:34,680
right? Which is 0, 1, and 2. Which means

1600
00:53:33,159 --> 00:53:36,239
its embedding doesn't change. It's just

1601
00:53:34,679 --> 00:53:38,159
the embedding for cat, but now instead

1602
00:53:36,239 --> 00:53:40,519
of picking zero, we'll pick this one,

1603
00:53:38,159 --> 00:53:43,079
0.6, 8.1, and put that here and add them

1604
00:53:40,519 --> 00:53:43,079
up instead.

1605
00:53:43,719 --> 00:53:46,959
So, this is the idea of the positional

1606
00:53:45,840 --> 00:53:48,800
encoding.

1607
00:53:46,960 --> 00:53:51,599
This is how we inject position knowledge

1608
00:53:48,800 --> 00:53:51,600
into the transformer.

1609
00:53:52,960 --> 00:53:55,000
Yes.

1610
00:53:54,400 --> 00:53:56,280
Um

1611
00:53:55,000 --> 00:53:58,159
the positional embedding would be

1612
00:53:56,280 --> 00:54:00,000
different for each sentence, right? How

1613
00:53:58,159 --> 00:54:01,799
do you No, this is just one table which

1614
00:54:00,000 --> 00:54:04,159
tells you what the position is.

1615
00:54:01,800 --> 00:54:06,200
So, the it says for a word that appears

1616
00:54:04,159 --> 00:54:08,279
in the seventh position in any input

1617
00:54:06,199 --> 00:54:09,599
sentence that you're feeding in,

1618
00:54:08,280 --> 00:54:11,359
this is the embedding that you need to

1619
00:54:09,599 --> 00:54:14,079
use

1620
00:54:11,358 --> 00:54:14,079
for that position.

1621
00:54:16,679 --> 00:54:21,639
If the word appears twice in the same

1622
00:54:19,559 --> 00:54:23,920
sentence, how do

1623
00:54:21,639 --> 00:54:25,719
Great question. So, if if let's say just

1624
00:54:23,920 --> 00:54:27,559
for argument, let's say the word the the

1625
00:54:25,719 --> 00:54:29,480
sentence was cat cat cat.

1626
00:54:27,559 --> 00:54:31,599
So, the

1627
00:54:29,480 --> 00:54:32,559
for each one of those cat for cat cat

1628
00:54:31,599 --> 00:54:34,759
cat,

1629
00:54:32,559 --> 00:54:36,519
the this embedding will be the same,

1630
00:54:34,760 --> 00:54:38,240
0.571, because that is happens to be

1631
00:54:36,519 --> 00:54:39,519
just the embedding for cat regardless of

1632
00:54:38,239 --> 00:54:42,159
position.

1633
00:54:39,519 --> 00:54:45,599
But then, the first cat

1634
00:54:42,159 --> 00:54:47,440
for the first cat, we will use 1.3, 3.9

1635
00:54:45,599 --> 00:54:50,159
as the addition. For the second cat,

1636
00:54:47,440 --> 00:54:51,679
we'll use 6.3, 3.7. The third cat will

1637
00:54:50,159 --> 00:54:53,519
use 0.6, 8.1.

1638
00:54:51,679 --> 00:54:55,000
So, only the things that are adding the

1639
00:54:53,519 --> 00:54:57,119
position encoding will change, the

1640
00:54:55,000 --> 00:54:58,280
positional embedding. So, the resulting

1641
00:54:57,119 --> 00:54:59,679
sum is going to be different for each of

1642
00:54:58,280 --> 00:55:02,560
these three words, even though they're

1643
00:54:59,679 --> 00:55:02,559
exactly the same word.

1644
00:55:05,760 --> 00:55:09,800
Is that position embedding table

1645
00:55:07,800 --> 00:55:12,000
specific to the standalone embedding

1646
00:55:09,800 --> 00:55:14,320
table? Like if you were to add or remove

1647
00:55:12,000 --> 00:55:15,960
some words from the standalone It's

1648
00:55:14,320 --> 00:55:18,000
independent.

1649
00:55:15,960 --> 00:55:19,880
Independent. It only depends on your

1650
00:55:18,000 --> 00:55:21,000
assumption about how long the sentences

1651
00:55:19,880 --> 00:55:21,920
can be.

1652
00:55:21,000 --> 00:55:23,400
That's it.

1653
00:55:21,920 --> 00:55:24,840
It doesn't really care about what's what

1654
00:55:23,400 --> 00:55:26,039
words are coming in. That's a whole

1655
00:55:24,840 --> 00:55:27,400
different thing.

1656
00:55:26,039 --> 00:55:28,719
So, these are two independent tables

1657
00:55:27,400 --> 00:55:31,160
that just learned as part of this

1658
00:55:28,719 --> 00:55:31,159
process.

1659
00:55:31,639 --> 00:55:35,480
So, yeah, I have the same thing for sat

1660
00:55:33,599 --> 00:55:39,079
and mat.

1661
00:55:35,480 --> 00:55:39,079
Sat and mat, that's what we have.

1662
00:55:39,519 --> 00:55:42,679
So, just make sure you understand these

1663
00:55:40,519 --> 00:55:46,199
two slides to really like make sure the

1664
00:55:42,679 --> 00:55:48,839
mechanics are clear. Yeah.

1665
00:55:46,199 --> 00:55:50,839
How do you control for filler words? For

1666
00:55:48,840 --> 00:55:53,920
example, if you're taking

1667
00:55:50,840 --> 00:55:55,680
NLP output for transcription and you're

1668
00:55:53,920 --> 00:55:56,639
trying to run a transformer and you have

1669
00:55:55,679 --> 00:55:58,799
a lot of

1670
00:55:56,639 --> 00:56:00,879
um's and likes that are

1671
00:55:58,800 --> 00:56:03,000
disproportionately large and have these

1672
00:56:00,880 --> 00:56:04,559
random assignments or

1673
00:56:03,000 --> 00:56:07,039
really deep embeddings, is there other

1674
00:56:04,559 --> 00:56:09,000
ways to look at through the noise?

1675
00:56:07,039 --> 00:56:10,440
Typically, what they do is um

1676
00:56:09,000 --> 00:56:12,239
as we will we'll talk about this thing

1677
00:56:10,440 --> 00:56:14,639
called byte pair encoding in which we

1678
00:56:12,239 --> 00:56:16,599
take individual characters,

1679
00:56:14,639 --> 00:56:18,879
fragments of words, and words into

1680
00:56:16,599 --> 00:56:21,239
account as tokens. So, when you hear

1681
00:56:18,880 --> 00:56:23,079
stuff like uh and so on, it gets mapped

1682
00:56:21,239 --> 00:56:24,119
to these small tokens.

1683
00:56:23,079 --> 00:56:26,799
Right? And then we treat them as just

1684
00:56:24,119 --> 00:56:26,799
any other token.

1685
00:56:28,840 --> 00:56:33,480
Um yeah, is aggregation like a simple

1686
00:56:31,119 --> 00:56:36,039
sum where here and is the actual

1687
00:56:33,480 --> 00:56:37,840
semantic meaning of the word standalone

1688
00:56:36,039 --> 00:56:40,400
not be more important than its

1689
00:56:37,840 --> 00:56:42,200
relative position in the sentence?

1690
00:56:40,400 --> 00:56:43,400
It could be. We just don't know a priori

1691
00:56:42,199 --> 00:56:45,399
whether it's going to be important or

1692
00:56:43,400 --> 00:56:46,960
not for any particular sentence.

1693
00:56:45,400 --> 00:56:48,880
We when we train the transformer with a

1694
00:56:46,960 --> 00:56:50,358
lot of textual data,

1695
00:56:48,880 --> 00:56:51,880
right? It'll just figure out the right

1696
00:56:50,358 --> 00:56:53,719
values for these things so that on

1697
00:56:51,880 --> 00:56:55,280
average, the accuracy is as high as

1698
00:56:53,719 --> 00:56:56,879
possible.

1699
00:56:55,280 --> 00:56:58,120
So, in many of these things, there's

1700
00:56:56,880 --> 00:57:00,480
always a tension between our human

1701
00:56:58,119 --> 00:57:01,559
intuition as to how it should work and

1702
00:57:00,480 --> 00:57:02,960
whether you should just throw it into

1703
00:57:01,559 --> 00:57:04,079
the meat grinder of backprop and see

1704
00:57:02,960 --> 00:57:05,280
what happens.

1705
00:57:04,079 --> 00:57:06,400
And so, here it does it turns out you

1706
00:57:05,280 --> 00:57:08,840
can just throw it into backprop, it'll

1707
00:57:06,400 --> 00:57:10,920
actually do a pretty good job.

1708
00:57:08,840 --> 00:57:13,000
Uh yeah.

1709
00:57:10,920 --> 00:57:15,960
For the positional encoding, we would

1710
00:57:13,000 --> 00:57:18,199
just be as using the sum vector, we

1711
00:57:15,960 --> 00:57:20,720
would be using like this 2 by 3 matrix

1712
00:57:18,199 --> 00:57:21,719
that you have for our right?

1713
00:57:20,719 --> 00:57:23,559
Uh oh yeah, this is just for

1714
00:57:21,719 --> 00:57:24,679
demonstration. Basically, this is the

1715
00:57:23,559 --> 00:57:26,279
thing that will actually go into the

1716
00:57:24,679 --> 00:57:28,358
transformer. Correct.

1717
00:57:26,280 --> 00:57:28,359
Yeah.

1718
00:57:28,559 --> 00:57:31,679
That was just me being overly verbose in

1719
00:57:30,079 --> 00:57:33,199
the slides.

1720
00:57:31,679 --> 00:57:35,239
Uh yeah.

1721
00:57:33,199 --> 00:57:36,919
I can see sentences in the input. At

1722
00:57:35,239 --> 00:57:38,279
this point, are we still parsing out

1723
00:57:36,920 --> 00:57:40,039
punctuation or if we have like a

1724
00:57:38,280 --> 00:57:41,760
multi-sentence input, is there a

1725
00:57:40,039 --> 00:57:44,119
positional embedding vector for each of

1726
00:57:41,760 --> 00:57:47,120
the sentences? Yeah, so here um

1727
00:57:44,119 --> 00:57:48,799
basically, the starting point is tokens.

1728
00:57:47,119 --> 00:57:50,239
Right? And in our example, because we're

1729
00:57:48,800 --> 00:57:51,760
working with the idea of simple

1730
00:57:50,239 --> 00:57:53,039
standardization and stripping and things

1731
00:57:51,760 --> 00:57:54,000
like that, I'm just showing actual

1732
00:57:53,039 --> 00:57:56,000
words.

1733
00:57:54,000 --> 00:57:58,199
If you go to something like GPT-4, since

1734
00:57:56,000 --> 00:58:01,159
it uses a different tokenization scheme,

1735
00:57:58,199 --> 00:58:02,319
uh each token might be part of a word.

1736
00:58:01,159 --> 00:58:03,559
It might be it might be an individual

1737
00:58:02,320 --> 00:58:06,240
character, it might be a punctuation

1738
00:58:03,559 --> 00:58:08,440
mark, it could be in fact um the GPT

1739
00:58:06,239 --> 00:58:10,439
family doesn't strip out punctuation.

1740
00:58:08,440 --> 00:58:12,480
Which is why when you ask a question, it

1741
00:58:10,440 --> 00:58:13,920
comes back with intact punctuation in

1742
00:58:12,480 --> 00:58:15,840
its response.

1743
00:58:13,920 --> 00:58:17,400
Uh and so, we'll get we'll revisit this

1744
00:58:15,840 --> 00:58:19,760
when you look at BPE, byte pair encoding

1745
00:58:17,400 --> 00:58:19,760
later on.

1746
00:58:19,840 --> 00:58:22,800
But the key thing to remember is that

1747
00:58:21,119 --> 00:58:24,679
all the stuff we're talking about starts

1748
00:58:22,800 --> 00:58:26,560
from the notion of a token.

1749
00:58:24,679 --> 00:58:28,559
As to how you define a token given a

1750
00:58:26,559 --> 00:58:30,719
bunch of text, that's the tokenizer's

1751
00:58:28,559 --> 00:58:33,519
job. And we just assumed a simple

1752
00:58:30,719 --> 00:58:36,759
tokenizer for the time being.

1753
00:58:33,519 --> 00:58:38,960
Okay? So, at this point, folks, we have

1754
00:58:36,760 --> 00:58:40,680
satisfied all the requirements.

1755
00:58:38,960 --> 00:58:42,480
Uh we have taken the surrounding context

1756
00:58:40,679 --> 00:58:43,839
of each word, we have taken the order,

1757
00:58:42,480 --> 00:58:45,480
and so on and so forth, because what's

1758
00:58:43,840 --> 00:58:47,519
coming in here is the positional

1759
00:58:45,480 --> 00:58:49,639
embeddings. Okay? And it runs through

1760
00:58:47,519 --> 00:58:51,440
the whole transformer stack.

1761
00:58:49,639 --> 00:58:54,799
So,

1762
00:58:51,440 --> 00:58:55,920
this is called a transformer encoder.

1763
00:58:54,800 --> 00:58:57,840
Okay?

1764
00:58:55,920 --> 00:58:59,039
This is the transformer encoder.

1765
00:58:57,840 --> 00:59:01,039
And you can see here, this is the

1766
00:58:59,039 --> 00:59:03,239
original picture from the paper.

1767
00:59:01,039 --> 00:59:04,719
It's an iconic picture at this point.

1768
00:59:03,239 --> 00:59:06,239
So, it says here this is these are the

1769
00:59:04,719 --> 00:59:07,599
input This is like the cat sat on the

1770
00:59:06,239 --> 00:59:09,519
mat.

1771
00:59:07,599 --> 00:59:11,400
It comes in here, gets transferred to

1772
00:59:09,519 --> 00:59:12,679
transformed into embeddings, standalone

1773
00:59:11,400 --> 00:59:14,639
embeddings.

1774
00:59:12,679 --> 00:59:17,319
And then, based on the position of each

1775
00:59:14,639 --> 00:59:20,679
word, we add that's why you see a plus

1776
00:59:17,320 --> 00:59:22,120
sign here, we add the positional

1777
00:59:20,679 --> 00:59:24,358
embedding to that.

1778
00:59:22,119 --> 00:59:26,799
And the resulting thing goes into this

1779
00:59:24,358 --> 00:59:30,599
transformer block. And here,

1780
00:59:26,800 --> 00:59:30,600
we go through multi-head attention.

1781
00:59:30,800 --> 00:59:34,480
And things come out the other end.

1782
00:59:32,800 --> 00:59:36,160
Then there is this thing called add and

1783
00:59:34,480 --> 00:59:37,440
norm, which we'll visit we'll revisit on

1784
00:59:36,159 --> 00:59:38,759
Wednesday.

1785
00:59:37,440 --> 00:59:40,800
And then it goes through a feed forward

1786
00:59:38,760 --> 00:59:42,480
network, another add and norm, which

1787
00:59:40,800 --> 00:59:43,640
we'll revisit on Wednesday.

1788
00:59:42,480 --> 00:59:46,360
And then it comes out the other end.

1789
00:59:43,639 --> 00:59:47,519
That's it. That's a transformer encoder.

1790
00:59:46,360 --> 00:59:48,360
Okay?

1791
00:59:47,519 --> 00:59:51,759
Um

1792
00:59:48,360 --> 00:59:51,760
and so if you look at this

1793
00:59:52,320 --> 00:59:55,160
just to point out a couple of things,

1794
00:59:53,719 --> 00:59:56,359
the input embeddings can be random

1795
00:59:55,159 --> 00:59:57,519
weights or it could be pre-trained

1796
00:59:56,360 --> 00:59:58,440
embeddings.

1797
00:59:57,519 --> 01:00:00,119
Um

1798
00:59:58,440 --> 01:00:01,000
we add in a position-dependent embedding

1799
01:00:00,119 --> 01:00:02,799
to represent the position of each word

1800
01:00:01,000 --> 01:00:04,000
in the sentence. That's the plus.

1801
01:00:02,800 --> 01:00:05,800
Then we pass it through multi-headed

1802
01:00:04,000 --> 01:00:07,199
attention to get a contextual uh

1803
01:00:05,800 --> 01:00:09,000
representation.

1804
01:00:07,199 --> 01:00:10,639
Then we finally we pass all this through

1805
01:00:09,000 --> 01:00:12,480
a simple

1806
01:00:10,639 --> 01:00:13,879
typically it's a two-layer network. A

1807
01:00:12,480 --> 01:00:16,039
one hidden layer with relus and then a

1808
01:00:13,880 --> 01:00:20,079
linear layer after that and boom. Uh and

1809
01:00:16,039 --> 01:00:21,840
then we do it. This is the encoder. And

1810
01:00:20,079 --> 01:00:23,799
here is the perhaps the most important

1811
01:00:21,840 --> 01:00:25,600
point to keep in mind.

1812
01:00:23,800 --> 01:00:26,840
Because we have taken inordinate care to

1813
01:00:25,599 --> 01:00:28,159
make sure that the things that are

1814
01:00:26,840 --> 01:00:30,200
coming in and the things that are going

1815
01:00:28,159 --> 01:00:32,159
out have the same size

1816
01:00:30,199 --> 01:00:34,199
both in terms of the number of tokens as

1817
01:00:32,159 --> 01:00:37,319
well as the length of each vector.

1818
01:00:34,199 --> 01:00:39,079
We can then stack them up like pancakes.

1819
01:00:37,320 --> 01:00:41,480
We can have lots of transformers stacked

1820
01:00:39,079 --> 01:00:43,679
one on top of each other.

1821
01:00:41,480 --> 01:00:45,679
Right? Because it's the perfect API.

1822
01:00:43,679 --> 01:00:47,879
It's the simplest possible API. The same

1823
01:00:45,679 --> 01:00:49,639
thing comes in, same thing goes out.

1824
01:00:47,880 --> 01:00:51,200
In terms of size. So you can have a

1825
01:00:49,639 --> 01:00:53,239
transformer encoder, another one top,

1826
01:00:51,199 --> 01:00:55,799
boom, boom, boom, boom, boom, one after

1827
01:00:53,239 --> 01:00:58,239
the other. GPT-3 has 96 transformer

1828
01:00:55,800 --> 01:00:58,240
stacks.

1829
01:00:58,719 --> 01:01:02,919
And like in all things deep learning

1830
01:01:00,440 --> 01:01:04,360
related, the more layers you have, the

1831
01:01:02,920 --> 01:01:05,400
more complicated things we can do with

1832
01:01:04,360 --> 01:01:06,760
it.

1833
01:01:05,400 --> 01:01:10,559
As long as you have enough data to keep

1834
01:01:06,760 --> 01:01:10,560
the model happy so it doesn't overfit.

1835
01:01:11,760 --> 01:01:15,920
Okay?

1836
01:01:13,400 --> 01:01:17,920
All right. So, what we haven't covered,

1837
01:01:15,920 --> 01:01:20,079
which we'll cover on Wednesday

1838
01:01:17,920 --> 01:01:22,400
uh is is the question that

1839
01:01:20,079 --> 01:01:23,440
he had posed about how

1840
01:01:22,400 --> 01:01:24,680
uh you know, since there are no

1841
01:01:23,440 --> 01:01:26,760
parameters inside the self-attention

1842
01:01:24,679 --> 01:01:27,879
block, what are we actually learning?

1843
01:01:26,760 --> 01:01:29,120
And then there is these things called

1844
01:01:27,880 --> 01:01:31,000
residual connections and layer

1845
01:01:29,119 --> 01:01:32,400
normalization. We'll talk about all

1846
01:01:31,000 --> 01:01:35,159
those things on Wednesday. Those are all

1847
01:01:32,400 --> 01:01:38,559
like, you know, refinements to the idea.

1848
01:01:35,159 --> 01:01:39,719
So, all right, 9:39. Um let's apply the

1849
01:01:38,559 --> 01:01:40,920
transformer encoder to an actual

1850
01:01:39,719 --> 01:01:43,319
problem.

1851
01:01:40,920 --> 01:01:45,119
Any questions?

1852
01:01:43,320 --> 01:01:46,760
Uh yeah.

1853
01:01:45,119 --> 01:01:48,839
My question is regarding like you said

1854
01:01:46,760 --> 01:01:50,400
you could have multiple transformers.

1855
01:01:48,840 --> 01:01:53,200
What is the difference with having

1856
01:01:50,400 --> 01:01:54,840
multiple self-attention heads uh and

1857
01:01:53,199 --> 01:01:57,519
rather than that having multiple When I

1858
01:01:54,840 --> 01:01:59,400
say a transformer block within the block

1859
01:01:57,519 --> 01:02:01,599
there could be multiple heads. So, if

1860
01:01:59,400 --> 01:02:04,680
you're if the accuracy is the same, why

1861
01:02:01,599 --> 01:02:06,039
would you use this rather

1862
01:02:04,679 --> 01:02:08,199
Yeah, you can have a lot of attention

1863
01:02:06,039 --> 01:02:10,559
heads. And that's totally fine. And

1864
01:02:08,199 --> 01:02:12,079
typically I forget how many GPT-3 and 4

1865
01:02:10,559 --> 01:02:13,799
have. They have a whole bunch of them.

1866
01:02:12,079 --> 01:02:15,360
But you can So you can go wide and you

1867
01:02:13,800 --> 01:02:18,320
can go deep.

1868
01:02:15,360 --> 01:02:19,599
Both are done in practice.

1869
01:02:18,320 --> 01:02:20,559
But the thing is if

1870
01:02:19,599 --> 01:02:22,119
The one thing you have to remember is

1871
01:02:20,559 --> 01:02:24,480
that if you if you go wide, you have a

1872
01:02:22,119 --> 01:02:26,239
lot of attention heads then given the

1873
01:02:24,480 --> 01:02:28,440
particular input that's coming into that

1874
01:02:26,239 --> 01:02:29,439
block, it'll learn different patterns

1875
01:02:28,440 --> 01:02:31,039
from it.

1876
01:02:29,440 --> 01:02:32,440
While if you stack them all up, it's

1877
01:02:31,039 --> 01:02:33,800
going to learn different ways to

1878
01:02:32,440 --> 01:02:35,200
contextualize the things that are coming

1879
01:02:33,800 --> 01:02:36,760
in. It operates at higher levels of

1880
01:02:35,199 --> 01:02:38,279
abstraction. So the analogy would be

1881
01:02:36,760 --> 01:02:40,520
that like the seventh layer of a

1882
01:02:38,280 --> 01:02:42,640
convolutional net may take the sixth

1883
01:02:40,519 --> 01:02:44,960
layer's output and say, "Oh, I'm seeing

1884
01:02:42,639 --> 01:02:46,839
a lot of edges here. I'm going to take

1885
01:02:44,960 --> 01:02:48,519
an edge like this, two circles like that

1886
01:02:46,840 --> 01:02:49,480
and call it a face."

1887
01:02:48,519 --> 01:02:52,000
So it'll operate at a higher level of

1888
01:02:49,480 --> 01:02:52,000
abstraction.

1889
01:02:52,400 --> 01:02:55,440
Okay.

1890
01:02:53,360 --> 01:02:55,440
Um

1891
01:02:58,320 --> 01:03:02,840
All right, let's go to the collab.

1892
01:03:01,800 --> 01:03:04,080
So what we're going to do is we're going

1893
01:03:02,840 --> 01:03:05,360
to take the transformer that we just

1894
01:03:04,079 --> 01:03:07,599
learned about and we're going to apply

1895
01:03:05,360 --> 01:03:09,320
it to solve the the travel uh slot

1896
01:03:07,599 --> 01:03:12,079
problem. Okay?

1897
01:03:09,320 --> 01:03:14,320
Uh all right. So

1898
01:03:12,079 --> 01:03:16,199
Okay, so we'll start with the usual

1899
01:03:14,320 --> 01:03:18,600
preliminaries.

1900
01:03:16,199 --> 01:03:20,319
And then we have taken the ATIS data set

1901
01:03:18,599 --> 01:03:23,960
I talked about and we have stuck them in

1902
01:03:20,320 --> 01:03:26,480
raw box for easy consumption.

1903
01:03:23,960 --> 01:03:26,480
It's here.

1904
01:03:29,880 --> 01:03:33,400
Okay.

1905
01:03:30,800 --> 01:03:35,160
So if you look at to the top view

1906
01:03:33,400 --> 01:03:37,960
you can see here, for example, I want to

1907
01:03:35,159 --> 01:03:39,599
fly from Boston 8:30 a.m. And then this

1908
01:03:37,960 --> 01:03:42,880
is the output. The slot filling is the

1909
01:03:39,599 --> 01:03:43,880
output. Um and so as it turns out here

1910
01:03:42,880 --> 01:03:46,000
there is

1911
01:03:43,880 --> 01:03:47,358
this these people also gave it a another

1912
01:03:46,000 --> 01:03:49,440
They took the whole query and gave it an

1913
01:03:47,358 --> 01:03:51,199
intent as to is it it's a flight query,

1914
01:03:49,440 --> 01:03:52,480
it's a something else query and so on,

1915
01:03:51,199 --> 01:03:54,559
which we're not going to use. Are you

1916
01:03:52,480 --> 01:03:56,599
kidding me?

1917
01:03:54,559 --> 01:03:57,519
I want to fly from Boston at 8:30 a.m.

1918
01:03:56,599 --> 01:03:59,239
and arrive in Denver at 11:00 in the

1919
01:03:57,519 --> 01:04:01,239
morning. What kind of ground

1920
01:03:59,239 --> 01:04:03,759
transportations are available in Denver?

1921
01:04:01,239 --> 01:04:06,079
What's the airport at Orlando?

1922
01:04:03,760 --> 01:04:08,480
Um how much does the limo service cost

1923
01:04:06,079 --> 01:04:09,799
within Pittsburgh? Okay.

1924
01:04:08,480 --> 01:04:11,480
And so on and so forth. So you get So

1925
01:04:09,800 --> 01:04:13,760
you get the idea. It's a very wide range

1926
01:04:11,480 --> 01:04:16,440
of queries that are in this data set.

1927
01:04:13,760 --> 01:04:18,960
Um okay. So let's just ignore that for a

1928
01:04:16,440 --> 01:04:22,240
sec. Um okay. So what we're now going to

1929
01:04:18,960 --> 01:04:24,960
do is we are going to take only

1930
01:04:22,239 --> 01:04:27,799
um this column, right? The query column.

1931
01:04:24,960 --> 01:04:29,559
That's going to be our input text. Okay?

1932
01:04:27,800 --> 01:04:31,359
And then the slot filling column is

1933
01:04:29,559 --> 01:04:32,599
going to be our dependent variable, the

1934
01:04:31,358 --> 01:04:34,880
output.

1935
01:04:32,599 --> 01:04:37,440
So we'll just gather them all up

1936
01:04:34,880 --> 01:04:38,840
uh here.

1937
01:04:37,440 --> 01:04:40,599
Let it run. We'll do it for the training

1938
01:04:38,840 --> 01:04:42,559
data and the test data.

1939
01:04:40,599 --> 01:04:45,759
And so what we have done is that we have

1940
01:04:42,559 --> 01:04:47,840
taken um the transformer related code in

1941
01:04:45,760 --> 01:04:49,480
Keras and we have packaged it into a

1942
01:04:47,840 --> 01:04:50,640
little hardel library for easy

1943
01:04:49,480 --> 01:04:53,240
consumption.

1944
01:04:50,639 --> 01:04:55,279
Um and so that thing is here. You can

1945
01:04:53,239 --> 01:04:56,719
download it.

1946
01:04:55,280 --> 01:04:57,680
Calling it a library is like overstating

1947
01:04:56,719 --> 01:04:59,679
it. We literally just collected a bunch

1948
01:04:57,679 --> 01:05:00,719
of code and stuck it in a file. Okay?

1949
01:04:59,679 --> 01:05:02,039
So

1950
01:05:00,719 --> 01:05:03,639
and so what we'll do is from hardel

1951
01:05:02,039 --> 01:05:04,960
we'll we'll import the transformer

1952
01:05:03,639 --> 01:05:06,679
encoder.

1953
01:05:04,960 --> 01:05:08,039
And we'll import this positional

1954
01:05:06,679 --> 01:05:09,239
embedding layer.

1955
01:05:08,039 --> 01:05:11,039
Because what we're going to do is we are

1956
01:05:09,239 --> 01:05:12,519
going to take the input do the

1957
01:05:11,039 --> 01:05:14,199
positional encoding business and then

1958
01:05:12,519 --> 01:05:15,400
send it into the transformer.

1959
01:05:14,199 --> 01:05:18,559
Okay?

1960
01:05:15,400 --> 01:05:21,119
Um so but first let's vectorize the

1961
01:05:18,559 --> 01:05:24,920
input uh queries that are coming in.

1962
01:05:21,119 --> 01:05:26,559
So we'll define a thing here.

1963
01:05:24,920 --> 01:05:28,440
The use this uh

1964
01:05:26,559 --> 01:05:30,320
max query length is not defined. That's

1965
01:05:28,440 --> 01:05:32,079
what happens when you

1966
01:05:30,320 --> 01:05:34,480
don't run everything.

1967
01:05:32,079 --> 01:05:34,480
All right.

1968
01:05:38,599 --> 01:05:44,839
Okay. So now we have this thing here. So

1969
01:05:41,719 --> 01:05:47,319
turns out that there are 8,888 tokens,

1970
01:05:44,840 --> 01:05:49,320
right? 8,888 words in the input queries

1971
01:05:47,320 --> 01:05:52,359
that are we have in the data. Uh so I

1972
01:05:49,320 --> 01:05:54,200
take a look at the first few.

1973
01:05:52,358 --> 01:05:56,799
And you can see here, you know, there is

1974
01:05:54,199 --> 01:05:58,759
unk. Uh and because the output mode here

1975
01:05:56,800 --> 01:06:00,280
is you just want integers to come out

1976
01:05:58,760 --> 01:06:01,000
not multi-hot encoding or anything

1977
01:06:00,280 --> 01:06:02,600
because we're going to take these

1978
01:06:01,000 --> 01:06:04,920
integers and then do embeddings from

1979
01:06:02,599 --> 01:06:07,880
them. So it'll it'll create it'll

1980
01:06:04,920 --> 01:06:10,280
reserve this empty string as the pad

1981
01:06:07,880 --> 01:06:11,119
token. This should be familiar from last

1982
01:06:10,280 --> 01:06:13,200
week.

1983
01:06:11,119 --> 01:06:14,679
And then the unk for unknown tokens and

1984
01:06:13,199 --> 01:06:17,039
then two from flights these are all some

1985
01:06:14,679 --> 01:06:18,559
of the most frequent. Um turns out

1986
01:06:17,039 --> 01:06:20,119
Boston is actually the most frequent. I

1987
01:06:18,559 --> 01:06:22,358
don't know what's up with that.

1988
01:06:20,119 --> 01:06:24,279
It is what it is. Then we'll do the same

1989
01:06:22,358 --> 01:06:25,319
vectorization to the train and test data

1990
01:06:24,280 --> 01:06:28,160
sets.

1991
01:06:25,320 --> 01:06:30,480
Now uh we need to do STIE for the output

1992
01:06:28,159 --> 01:06:31,799
side of the problem because the slots

1993
01:06:30,480 --> 01:06:33,800
the the dependent variable here,

1994
01:06:31,800 --> 01:06:36,519
remember, are all sentences as well with

1995
01:06:33,800 --> 01:06:38,200
the B, O, things like that, right? So we

1996
01:06:36,519 --> 01:06:40,840
need to vectorize those.

1997
01:06:38,199 --> 01:06:42,039
So we do or we need to do STIE on them.

1998
01:06:40,840 --> 01:06:43,280
So let's take a look at some of these

1999
01:06:42,039 --> 01:06:44,519
slots.

2000
01:06:43,280 --> 01:06:45,800
And you can see here all this stuff is

2001
01:06:44,519 --> 01:06:48,280
going on.

2002
01:06:45,800 --> 01:06:49,760
Note So now here is an example where you

2003
01:06:48,280 --> 01:06:51,440
have to be very careful when you do the

2004
01:06:49,760 --> 01:06:52,800
standardization.

2005
01:06:51,440 --> 01:06:54,440
Typically standardization you will

2006
01:06:52,800 --> 01:06:56,120
remove punctuation and you know, do

2007
01:06:54,440 --> 01:06:57,358
things like that and lowercase, right?

2008
01:06:56,119 --> 01:07:00,400
But here

2009
01:06:57,358 --> 01:07:01,559
these things have a specific meaning.

2010
01:07:00,400 --> 01:07:03,400
We can't just go in there and remove the

2011
01:07:01,559 --> 01:07:04,880
period and the underscore and then take

2012
01:07:03,400 --> 01:07:06,559
make the B into lowercase B and stuff

2013
01:07:04,880 --> 01:07:07,880
like that. That'll just harm it.

2014
01:07:06,559 --> 01:07:10,239
Right? We need to be able to preserve

2015
01:07:07,880 --> 01:07:12,559
the nomenclature of the output in terms

2016
01:07:10,239 --> 01:07:13,639
of all those tags. So

2017
01:07:12,559 --> 01:07:15,119
um so we don't want the standardization

2018
01:07:13,639 --> 01:07:17,000
to do all those out. So what we do is we

2019
01:07:15,119 --> 01:07:18,358
say standardization none.

2020
01:07:17,000 --> 01:07:20,039
Look at that.

2021
01:07:18,358 --> 01:07:22,319
We tell Keras do not standardize this.

2022
01:07:20,039 --> 01:07:23,239
Do not do your usual thing.

2023
01:07:22,320 --> 01:07:25,280
Okay?

2024
01:07:23,239 --> 01:07:26,919
Um so

2025
01:07:25,280 --> 01:07:29,080
we do that

2026
01:07:26,920 --> 01:07:30,960
for the output side. And then let's look

2027
01:07:29,079 --> 01:07:33,358
at the vocabulary.

2028
01:07:30,960 --> 01:07:34,440
Yeah, so this sounds pretty good.

2029
01:07:33,358 --> 01:07:35,880
These are all the things that we would

2030
01:07:34,440 --> 01:07:37,599
expect to see.

2031
01:07:35,880 --> 01:07:39,800
These are the distinct tokens in the

2032
01:07:37,599 --> 01:07:42,759
output strings.

2033
01:07:39,800 --> 01:07:42,760
Um all right.

2034
01:07:43,320 --> 01:07:48,359
Okay, we get it.

2035
01:07:45,880 --> 01:07:50,400
So we have 125 of them. In the in the

2036
01:07:48,358 --> 01:07:54,279
lecture I said there are 123 slots,

2037
01:07:50,400 --> 01:07:57,240
possible slots. Why is it 125 here?

2038
01:07:54,280 --> 01:07:59,519
Yes, unk and pad. Correct.

2039
01:07:57,239 --> 01:08:02,279
Um okay. Now we'll set up a transformer

2040
01:07:59,519 --> 01:08:05,119
encoder, right? Uh this Oh, wait, wait,

2041
01:08:02,280 --> 01:08:07,280
wait. I forgot about um doing this. My

2042
01:08:05,119 --> 01:08:09,519
bad. Um

2043
01:08:07,280 --> 01:08:09,519
All right.

2044
01:08:11,519 --> 01:08:15,639
I just thought when I saw the slide that

2045
01:08:12,880 --> 01:08:16,560
we should go to the collab

2046
01:08:15,639 --> 01:08:18,880
without giving you a bit more

2047
01:08:16,560 --> 01:08:20,240
background. No problem. So

2048
01:08:18,880 --> 01:08:21,119
So

2049
01:08:20,239 --> 01:08:22,318
the way we're going to model this

2050
01:08:21,119 --> 01:08:23,479
problem is that we're going to have

2051
01:08:22,319 --> 01:08:24,839
something like this, right? Fly from

2052
01:08:23,479 --> 01:08:26,239
Boston to Denver.

2053
01:08:24,838 --> 01:08:28,600
That's the input that's coming in and

2054
01:08:26,239 --> 01:08:31,439
that is the correct answer.

2055
01:08:28,600 --> 01:08:32,798
0 0 some B something or others I mean O

2056
01:08:31,439 --> 01:08:34,479
and then something else, right? That's

2057
01:08:32,798 --> 01:08:36,399
the correct answer. That's the that's

2058
01:08:34,479 --> 01:08:38,718
the input and that is the right answer.

2059
01:08:36,399 --> 01:08:40,559
So what we'll do is we will

2060
01:08:38,719 --> 01:08:42,640
create these positional input embeddings

2061
01:08:40,560 --> 01:08:45,359
like we have discussed before.

2062
01:08:42,640 --> 01:08:47,719
We will run it through a transformer.

2063
01:08:45,359 --> 01:08:49,120
It gives us contextual embeddings.

2064
01:08:47,719 --> 01:08:50,680
So if we send five in, it's going to

2065
01:08:49,119 --> 01:08:51,960
send us five out except the color is now

2066
01:08:50,680 --> 01:08:54,319
blue.

2067
01:08:51,960 --> 01:08:57,520
Right? And then what we do is

2068
01:08:54,319 --> 01:08:59,400
we will run it through a relu.

2069
01:08:57,520 --> 01:09:01,080
Okay, we'll run it through a relu.

2070
01:08:59,399 --> 01:09:02,639
We will still have

2071
01:09:01,079 --> 01:09:04,039
you know, five vectors here, five

2072
01:09:02,640 --> 01:09:05,920
vectors will come in.

2073
01:09:04,039 --> 01:09:07,960
And then for each of the things that

2074
01:09:05,920 --> 01:09:10,759
comes in, we will stick a 123-way

2075
01:09:07,960 --> 01:09:10,759
softmax.

2076
01:09:11,838 --> 01:09:15,838
Okay, for each thing that comes out

2077
01:09:13,279 --> 01:09:16,838
we'll have a 123-way softmax and that's

2078
01:09:15,838 --> 01:09:19,239
the classification problem we're going

2079
01:09:16,838 --> 01:09:19,239
to solve.

2080
01:09:20,439 --> 01:09:23,639
Okay?

2081
01:09:21,719 --> 01:09:25,759
So

2082
01:09:23,640 --> 01:09:28,280
the weights in all these layers will get

2083
01:09:25,759 --> 01:09:29,279
optimized by backprop.

2084
01:09:28,279 --> 01:09:30,798
All these weights are going to get

2085
01:09:29,279 --> 01:09:33,200
optimized.

2086
01:09:30,798 --> 01:09:33,199
Uh yeah.

2087
01:09:34,119 --> 01:09:36,399
Sorry?

2088
01:09:40,798 --> 01:09:44,798
Oh no, the that's a layer. The weights

2089
01:09:43,680 --> 01:09:46,920
in the layer will still need to be

2090
01:09:44,798 --> 01:09:48,159
learned.

2091
01:09:46,920 --> 01:09:50,199
It's sort of like the text vectorization

2092
01:09:48,159 --> 01:09:51,880
layer is a bunch of code and then you

2093
01:09:50,199 --> 01:09:53,439
actually run it on a particular corpus

2094
01:09:51,880 --> 01:09:54,480
to adapt it and fill our vocabulary out

2095
01:09:53,439 --> 01:09:55,679
of it.

2096
01:09:54,479 --> 01:09:57,879
So, it's like an empty shell that needs

2097
01:09:55,680 --> 01:09:59,320
to get populated.

2098
01:09:57,880 --> 01:10:00,680
Okay, so with the weights and all these

2099
01:09:59,319 --> 01:10:02,239
things are going to get updated when we

2100
01:10:00,680 --> 01:10:03,600
when we train the model

2101
01:10:02,239 --> 01:10:06,399
by backprop.

2102
01:10:03,600 --> 01:10:07,600
Uh and that's it. That's the setup.

2103
01:10:06,399 --> 01:10:09,639
Does this make sense before I switch

2104
01:10:07,600 --> 01:10:11,560
back to the collab?

2105
01:10:09,640 --> 01:10:14,320
In particular, does this make sense?

2106
01:10:11,560 --> 01:10:14,320
This part of it.

2107
01:10:15,920 --> 01:10:18,440
Bunch of things come out and then for

2108
01:10:17,319 --> 01:10:20,439
each one of those things we need to

2109
01:10:18,439 --> 01:10:22,119
figure out a classification of a 123-way

2110
01:10:20,439 --> 01:10:23,479
classification. And that's where we

2111
01:10:22,119 --> 01:10:25,319
stick a softmax on every one of those

2112
01:10:23,479 --> 01:10:27,599
output nodes.

2113
01:10:25,319 --> 01:10:27,599
Yeah.

2114
01:10:32,800 --> 01:10:35,440
Oh oh, I see.

2115
01:10:36,000 --> 01:10:38,439
Yeah, so

2116
01:10:40,239 --> 01:10:43,279
It could be whatever or to put it

2117
01:10:41,560 --> 01:10:45,600
another way, it is your choice as the

2118
01:10:43,279 --> 01:10:47,880
user as the modeler. Correct? The thing

2119
01:10:45,600 --> 01:10:49,400
is at this point with the blue stuff the

2120
01:10:47,880 --> 01:10:51,359
transformer is basically saying, my job

2121
01:10:49,399 --> 01:10:52,639
is done.

2122
01:10:51,359 --> 01:10:54,639
It has given you these valuable

2123
01:10:52,640 --> 01:10:56,720
contextual embeddings at some high-level

2124
01:10:54,640 --> 01:10:58,480
abstraction. What you do with it depends

2125
01:10:56,720 --> 01:11:00,680
on your particular problem. And so that

2126
01:10:58,479 --> 01:11:01,959
the best practice would be to take it

2127
01:11:00,680 --> 01:11:03,280
and then maybe, you know, if these

2128
01:11:01,960 --> 01:11:04,279
embeddings are embeddings are really

2129
01:11:03,279 --> 01:11:07,159
long, maybe you make them a little

2130
01:11:04,279 --> 01:11:09,079
smaller, right? Using a ReLU. And using

2131
01:11:07,159 --> 01:11:10,239
a ReLU is always a good idea because

2132
01:11:09,079 --> 01:11:11,640
when in doubt, throw in a bit of

2133
01:11:10,239 --> 01:11:13,519
non-linearity.

2134
01:11:11,640 --> 01:11:15,440
Right? Uh and then once you're done with

2135
01:11:13,520 --> 01:11:17,040
that, well, at this point you need to

2136
01:11:15,439 --> 01:11:20,079
actually classify it. So, you stick an

2137
01:11:17,039 --> 01:11:20,079
output softmax on it.

2138
01:11:20,560 --> 01:11:24,120
Okay. So, that's what we have.

2139
01:11:24,680 --> 01:11:26,960
Um

2140
01:11:27,680 --> 01:11:32,119
All right, back to this picture.

2141
01:11:29,640 --> 01:11:34,280
So, what we're going to do is we

2142
01:11:32,119 --> 01:11:36,119
we also get to decide how long are these

2143
01:11:34,279 --> 01:11:37,199
embedding vectors. How long because here

2144
01:11:36,119 --> 01:11:37,920
we're not going to use Glove embeddings.

2145
01:11:37,199 --> 01:11:39,800
We're just going to learn everything

2146
01:11:37,920 --> 01:11:40,800
from scratch.

2147
01:11:39,800 --> 01:11:42,880
Right? We're going to learn everything

2148
01:11:40,800 --> 01:11:45,360
from scratch. So, and we can decide how

2149
01:11:42,880 --> 01:11:46,440
long these embedding vectors are. So, um

2150
01:11:45,359 --> 01:11:47,519
these embedding vectors I'm going to

2151
01:11:46,439 --> 01:11:49,359
decide

2152
01:11:47,520 --> 01:11:52,880
uh I have decided that I want them to be

2153
01:11:49,359 --> 01:11:54,839
512 long, right? I want these actually

2154
01:11:52,880 --> 01:11:57,000
to be 512 long. So, that's what I have

2155
01:11:54,840 --> 01:11:58,880
here, 512.

2156
01:11:57,000 --> 01:12:00,000
And then inside the transformer,

2157
01:11:58,880 --> 01:12:01,239
remember

2158
01:12:00,000 --> 01:12:02,920
when we

2159
01:12:01,239 --> 01:12:04,679
concatenate everything and then we have

2160
01:12:02,920 --> 01:12:07,600
something, we run it through a final

2161
01:12:04,680 --> 01:12:08,960
ReLU layer, how big should that layer

2162
01:12:07,600 --> 01:12:11,079
be?

2163
01:12:08,960 --> 01:12:13,279
That's what it here what I mean by dense

2164
01:12:11,079 --> 01:12:15,039
dim. I want it to be 64.

2165
01:12:13,279 --> 01:12:17,519
And then I, you know, for fun I'm going

2166
01:12:15,039 --> 01:12:20,399
to use five attention heads.

2167
01:12:17,520 --> 01:12:20,400
Because why not?

2168
01:12:20,439 --> 01:12:27,399
Okay. And then in the final thing here

2169
01:12:24,319 --> 01:12:29,199
to go to Ali's question here these

2170
01:12:27,399 --> 01:12:32,079
things are all 512 long as I mentioned

2171
01:12:29,199 --> 01:12:34,479
earlier, right? These are all 512.

2172
01:12:32,079 --> 01:12:36,760
But this thing here I'm going to make it

2173
01:12:34,479 --> 01:12:38,799
just 128.

2174
01:12:36,760 --> 01:12:41,199
Okay, that's what I mean by units here.

2175
01:12:38,800 --> 01:12:43,119
And so if you look at the actual model

2176
01:12:41,199 --> 01:12:45,679
okay, whatever comes in has a max query

2177
01:12:43,119 --> 01:12:47,239
length of I think 30 if I recall.

2178
01:12:45,680 --> 01:12:50,240
Um actually let's just make sure of

2179
01:12:47,239 --> 01:12:50,239
that. What did I assume?

2180
01:12:51,439 --> 01:12:55,759
30, correct? Max query length 30. So,

2181
01:12:53,079 --> 01:12:57,319
each sentence is 30. So, if a sentence

2182
01:12:55,760 --> 01:12:59,680
has 35 words in it, what's going to

2183
01:12:57,319 --> 01:12:59,679
happen?

2184
01:12:59,840 --> 01:13:03,760
The last five will get chopped,

2185
01:13:01,159 --> 01:13:05,359
truncated. If it comes in at 22, we're

2186
01:13:03,760 --> 01:13:06,840
going to pad it with eight more tokens

2187
01:13:05,359 --> 01:13:09,559
with a pad token. Okay? That's how we

2188
01:13:06,840 --> 01:13:12,159
make sure everything uh gets to 30.

2189
01:13:09,560 --> 01:13:14,039
All right. So, we come back here.

2190
01:13:12,159 --> 01:13:16,720
So, the input is still sentences which

2191
01:13:14,039 --> 01:13:18,960
are 30 long, tokens which are 30 long.

2192
01:13:16,720 --> 01:13:20,520
And then we run it through a positional

2193
01:13:18,960 --> 01:13:23,119
embedding layer.

2194
01:13:20,520 --> 01:13:25,160
Okay? This positional embedding layer

2195
01:13:23,119 --> 01:13:27,319
has the the actual embedding for each

2196
01:13:25,159 --> 01:13:29,279
word, that table and it has the

2197
01:13:27,319 --> 01:13:31,639
positional table, positional embedding

2198
01:13:29,279 --> 01:13:34,119
table. So, just to be clear, this

2199
01:13:31,640 --> 01:13:37,119
positional embedding layer is basically

2200
01:13:34,119 --> 01:13:38,800
it's basically this.

2201
01:13:37,119 --> 01:13:41,199
So, this table

2202
01:13:38,800 --> 01:13:43,720
and this table together are packaged up

2203
01:13:41,199 --> 01:13:45,279
into the positional encoding layer.

2204
01:13:43,720 --> 01:13:47,400
But they are two distinct tables. They

2205
01:13:45,279 --> 01:13:49,479
just happen to be packaged up.

2206
01:13:47,399 --> 01:13:51,119
So,

2207
01:13:49,479 --> 01:13:52,839
so this is what we have here.

2208
01:13:51,119 --> 01:13:55,000
And then we get a nice positional

2209
01:13:52,840 --> 01:13:57,480
embedding out and then boom, we run it

2210
01:13:55,000 --> 01:13:59,640
through the transformer. And you know,

2211
01:13:57,479 --> 01:14:01,559
this transformer encoder object we have

2212
01:13:59,640 --> 01:14:02,800
to tell it obviously, hey, this is the

2213
01:14:01,560 --> 01:14:04,640
embedding dimension that's going to come

2214
01:14:02,800 --> 01:14:06,880
out. This is the dense dimension you're

2215
01:14:04,640 --> 01:14:09,000
going to use in that final feedforward

2216
01:14:06,880 --> 01:14:10,159
layer inside each attention block and

2217
01:14:09,000 --> 01:14:11,640
this is the number of attention heads I

2218
01:14:10,159 --> 01:14:13,519
want you to use. That's it.

2219
01:14:11,640 --> 01:14:14,800
Very, right? Only three things have to

2220
01:14:13,520 --> 01:14:16,840
be specified.

2221
01:14:14,800 --> 01:14:18,039
And then whatever comes out of the

2222
01:14:16,840 --> 01:14:19,159
transformer encoder are these blue

2223
01:14:18,039 --> 01:14:20,960
vectors.

2224
01:14:19,159 --> 01:14:22,720
And then we are back into good old sort

2225
01:14:20,960 --> 01:14:24,560
of, you know, traditional DNN stuff

2226
01:14:22,720 --> 01:14:27,880
where we take this thing, run it through

2227
01:14:24,560 --> 01:14:30,880
a ReLU with 128 units, we add a little

2228
01:14:27,880 --> 01:14:33,279
dropout uh and then we run it through a

2229
01:14:30,880 --> 01:14:35,600
dense layer which the the vocab size

2230
01:14:33,279 --> 01:14:37,359
here is 125, which is the 125-way

2231
01:14:35,600 --> 01:14:39,840
softmax.

2232
01:14:37,359 --> 01:14:41,239
Okay? Activation softmax.

2233
01:14:39,840 --> 01:14:42,720
Connect up everything into model input

2234
01:14:41,239 --> 01:14:44,399
and output and boom, that's the whole

2235
01:14:42,720 --> 01:14:47,440
model.

2236
01:14:44,399 --> 01:14:48,519
So, that's what we have here.

2237
01:14:47,439 --> 01:14:50,839
Okay?

2238
01:14:48,520 --> 01:14:50,840
Now,

2239
01:14:51,079 --> 01:14:54,680
this for the you know, after Wednesday's

2240
01:14:53,399 --> 01:14:56,679
class

2241
01:14:54,680 --> 01:14:59,320
for extra credit and for your personal

2242
01:14:56,680 --> 01:15:00,880
edification

2243
01:14:59,319 --> 01:15:03,000
try to work through this thing to come

2244
01:15:00,880 --> 01:15:04,800
up with this number.

2245
01:15:03,000 --> 01:15:06,960
53 million

2246
01:15:04,800 --> 01:15:10,039
um sorry, 5.3 million.

2247
01:15:06,960 --> 01:15:12,600
Right? Uh and see if it matches this

2248
01:15:10,039 --> 01:15:13,920
number here.

2249
01:15:12,600 --> 01:15:15,520
It should match.

2250
01:15:13,920 --> 01:15:17,840
Hand calculate the number of parameters

2251
01:15:15,520 --> 01:15:19,720
inside the transformer. Okay? For fame

2252
01:15:17,840 --> 01:15:20,520
and fortune. That's an optional thing.

2253
01:15:19,720 --> 01:15:22,240
So,

2254
01:15:20,520 --> 01:15:23,480
uh do it after Wednesday's class, not

2255
01:15:22,239 --> 01:15:24,920
right now.

2256
01:15:23,479 --> 01:15:26,799
And I have actually listed the exact

2257
01:15:24,920 --> 01:15:28,560
math that goes into it here. Okay? All

2258
01:15:26,800 --> 01:15:30,159
right. So, by the way, you can peek into

2259
01:15:28,560 --> 01:15:31,960
any layers' weights using its weight

2260
01:15:30,159 --> 01:15:33,319
attribute. This is the embedding

2261
01:15:31,960 --> 01:15:34,640
uh the positional embedding thing we

2262
01:15:33,319 --> 01:15:36,759
had. So,

2263
01:15:34,640 --> 01:15:39,440
we can click it and you can see here it

2264
01:15:36,760 --> 01:15:40,840
has two tables. There's the first table

2265
01:15:39,439 --> 01:15:41,799
which is just the embedding table which

2266
01:15:40,840 --> 01:15:43,560
says

2267
01:15:41,800 --> 01:15:45,840
there are eight eight eight tokens in my

2268
01:15:43,560 --> 01:15:47,880
vocabulary and each of those tokens is a

2269
01:15:45,840 --> 01:15:49,880
an embedding vector which is 512 long.

2270
01:15:47,880 --> 01:15:51,520
That is the first table here. And then

2271
01:15:49,880 --> 01:15:53,880
it has the second object which is the

2272
01:15:51,520 --> 01:15:56,480
positional embedding and it says here,

2273
01:15:53,880 --> 01:15:58,640
well, my sentences can be 30 long and

2274
01:15:56,479 --> 01:16:02,079
for each position of the 30 long

2275
01:15:58,640 --> 01:16:04,079
sentence, I will have a 512 embedding.

2276
01:16:02,079 --> 01:16:05,439
Both these tables as I mentioned earlier

2277
01:16:04,079 --> 01:16:06,800
are packaged up inside and you can

2278
01:16:05,439 --> 01:16:08,159
actually see what the weights are before

2279
01:16:06,800 --> 01:16:09,560
you do any training.

2280
01:16:08,159 --> 01:16:11,319
Okay?

2281
01:16:09,560 --> 01:16:13,400
So, all right. So, I'm going to stop

2282
01:16:11,319 --> 01:16:14,359
here uh because the model it's going to

2283
01:16:13,399 --> 01:16:16,079
take a few minutes to run and we're

2284
01:16:14,359 --> 01:16:17,519
already at 5 9:45.

2285
01:16:16,079 --> 01:16:19,479
Um so, we will continue the journey on

2286
01:16:17,520 --> 01:16:20,560
Wednesday. If some of it is not super

2287
01:16:19,479 --> 01:16:21,799
clear, don't worry about it. It will

2288
01:16:20,560 --> 01:16:22,960
become much clearer on Wednesday. All

2289
01:16:21,800 --> 01:16:23,640
right? All right, folks, have a good

2290
01:16:22,960 --> 01:16:26,000
couple of days. I'll see you on

2291
01:16:23,640 --> 01:16:26,000
Wednesday.