1
00:00:16,320 --> 00:00:21,039
Um, so let's start with a quick review.

2
00:00:18,559 --> 00:00:23,599
Last week we looked at BERT, how BERT

3
00:00:21,039 --> 00:00:25,439
was created, and we learned about this

4
00:00:23,600 --> 00:00:27,199
technique called masking, which is a

5
00:00:25,439 --> 00:00:29,278
kind of self-supervised learning. And

6
00:00:27,199 --> 00:00:31,279
the idea of masking was very simple. We

7
00:00:29,278 --> 00:00:33,039
asked ourselves the question we have

8
00:00:31,278 --> 00:00:35,519
seen ways in which people can take

9
00:00:33,039 --> 00:00:38,238
images and pre-train models like restnet

10
00:00:35,520 --> 00:00:40,160
on a vast you know vast uh body of

11
00:00:38,238 --> 00:00:42,320
images but then for each image somebody

12
00:00:40,159 --> 00:00:44,000
had to go and label them right so for

13
00:00:42,320 --> 00:00:46,399
text we asked the question well what

14
00:00:44,000 --> 00:00:48,000
does it mean to label a piece of text

15
00:00:46,399 --> 00:00:49,920
when we don't actually have a clearly

16
00:00:48,000 --> 00:00:51,600
defined end goal in mind except the

17
00:00:49,920 --> 00:00:53,280
general goal of pre-training things

18
00:00:51,600 --> 00:00:55,359
right and then we said oh well what we

19
00:00:53,280 --> 00:00:57,280
can do is we can actually replace some

20
00:00:55,359 --> 00:00:59,280
some of the words in every sentence with

21
00:00:57,280 --> 00:01:00,719
a what you call like a mask token and

22
00:00:59,280 --> 00:01:03,760
then we just train the network to

23
00:01:00,719 --> 00:01:06,079
recover the blanks to fill in the blanks

24
00:01:03,759 --> 00:01:07,280
right and this technique which is one of

25
00:01:06,079 --> 00:01:08,798
many ways of doing what's called

26
00:01:07,280 --> 00:01:12,239
self-supervised learning is called

27
00:01:08,799 --> 00:01:14,479
masking and we and we described how if

28
00:01:12,239 --> 00:01:16,399
you essentially take all of Wikipedia

29
00:01:14,478 --> 00:01:19,118
and for every sentence you mask it like

30
00:01:16,400 --> 00:01:21,280
this and then train a network to recover

31
00:01:19,118 --> 00:01:23,200
to fill in the blanks the resulting

32
00:01:21,280 --> 00:01:25,280
network becomes really good at doing all

33
00:01:23,200 --> 00:01:27,200
kinds of interesting things and that in

34
00:01:25,280 --> 00:01:29,359
fact the first such network or one of

35
00:01:27,200 --> 00:01:31,040
the first such networks was called BERT

36
00:01:29,359 --> 00:01:32,319
u and in fact in your homework you've

37
00:01:31,040 --> 00:01:34,479
been you've been looking at BERT and so

38
00:01:32,319 --> 00:01:35,758
on and so forth right that's masking now

39
00:01:34,478 --> 00:01:37,118
we're going to switch gears and talk

40
00:01:35,759 --> 00:01:38,640
about a different kind of self-s

41
00:01:37,118 --> 00:01:41,840
supervised learning which is different

42
00:01:38,640 --> 00:01:45,280
from masking which turns out to be

43
00:01:41,840 --> 00:01:47,200
weirdly more interesting and powerful

44
00:01:45,280 --> 00:01:49,040
okay so we are going to look at another

45
00:01:47,200 --> 00:01:52,560
technique and this technique is called

46
00:01:49,040 --> 00:01:54,079
next word prediction so now it is

47
00:01:52,560 --> 00:01:55,680
actually in some some sense a special

48
00:01:54,078 --> 00:01:57,519
case of masking where you're basically

49
00:01:55,680 --> 00:01:59,759
saying take a sentence and instead of

50
00:01:57,519 --> 00:02:01,679
randomly picking a word and and making

51
00:01:59,759 --> 00:02:03,040
it a blank. You're saying, "I'm just

52
00:02:01,680 --> 00:02:06,000
going to take the last word and make it

53
00:02:03,040 --> 00:02:08,560
a blank." Okay? And then you send the

54
00:02:06,000 --> 00:02:10,080
sentence in and then you have the the

55
00:02:08,560 --> 00:02:12,239
machine just fill in the blank on the

56
00:02:10,080 --> 00:02:13,920
last word. Predict the next word. Okay?

57
00:02:12,239 --> 00:02:15,920
And you don't have to use full sentences

58
00:02:13,919 --> 00:02:17,598
for it. You can use parts of sentences

59
00:02:15,919 --> 00:02:20,000
for it. Sentence fragments as well. So

60
00:02:17,598 --> 00:02:21,919
if you take the same sentences before

61
00:02:20,000 --> 00:02:23,598
the mission of the MI loan school, you

62
00:02:21,919 --> 00:02:25,439
can literally divide it into well you

63
00:02:23,598 --> 00:02:27,598
can give the and ask it to predict

64
00:02:25,439 --> 00:02:29,598
mission. If you can give it the mission

65
00:02:27,598 --> 00:02:31,598
and ask it to predict off. You give it

66
00:02:29,598 --> 00:02:33,919
the mission of ask to predict the you

67
00:02:31,598 --> 00:02:35,518
get the idea. So every sentence fragment

68
00:02:33,919 --> 00:02:37,119
you can take and literally just give it

69
00:02:35,519 --> 00:02:38,959
the first few and then predict the next

70
00:02:37,120 --> 00:02:41,360
one. First few next one first few next

71
00:02:38,959 --> 00:02:44,640
one. Okay. So this is next word

72
00:02:41,360 --> 00:02:46,239
prediction. And

73
00:02:44,639 --> 00:02:47,839
so the let's what we're going to do now

74
00:02:46,239 --> 00:02:50,480
is we're going to actually take the

75
00:02:47,840 --> 00:02:52,800
transformer encoder architecture that we

76
00:02:50,479 --> 00:02:54,479
used to build bird in the last class and

77
00:02:52,800 --> 00:02:56,239
we're going to try to use it to solve

78
00:02:54,479 --> 00:02:58,560
next word prediction to build a model

79
00:02:56,239 --> 00:03:01,039
that can do next word prediction. Okay.

80
00:02:58,560 --> 00:03:03,519
So this is what [clears throat] we have.

81
00:03:01,039 --> 00:03:08,120
So what we're going to do is uh if you

82
00:03:03,519 --> 00:03:08,120
take the phrase the cat sat on the mat.

83
00:03:09,199 --> 00:03:15,199
So the phrase was let's say the cat

84
00:03:13,598 --> 00:03:16,719
sat

85
00:03:15,199 --> 00:03:18,639
on

86
00:03:16,719 --> 00:03:20,400
the mat.

87
00:03:18,639 --> 00:03:24,199
So what you might want to do is to say

88
00:03:20,400 --> 00:03:24,200
okay this is the input

89
00:03:25,519 --> 00:03:30,000
output

90
00:03:27,680 --> 00:03:33,120
the cat.

91
00:03:30,000 --> 00:03:36,400
Then maybe you have the cat

92
00:03:33,120 --> 00:03:39,840
then the output is sat.

93
00:03:36,400 --> 00:03:42,239
The cat sat on and so on. Right, you get

94
00:03:39,840 --> 00:03:45,200
the idea. And then finally, we have the

95
00:03:42,239 --> 00:03:48,480
cat sat

96
00:03:45,199 --> 00:03:50,158
the mat. Right, this is basically what

97
00:03:48,479 --> 00:03:51,679
we have all these inputs and outputs.

98
00:03:50,158 --> 00:03:54,000
But we're going to very compactly

99
00:03:51,680 --> 00:03:56,480
express it as if it's just coming in

100
00:03:54,000 --> 00:03:58,639
through as as one sort of data point in

101
00:03:56,479 --> 00:04:00,079
one batch. And that's what we're doing

102
00:03:58,639 --> 00:04:02,158
here. So what we're going to do is we're

103
00:04:00,080 --> 00:04:04,879
going to stack it up like this where we

104
00:04:02,158 --> 00:04:07,120
have the cat sat on the on the left

105
00:04:04,878 --> 00:04:08,560
meaning everything but the last word and

106
00:04:07,120 --> 00:04:10,158
then we're going to take that same

107
00:04:08,560 --> 00:04:13,199
sentence and just shift it to the left

108
00:04:10,158 --> 00:04:15,438
one right so the cat sat on the mat we

109
00:04:13,199 --> 00:04:17,599
cut off the mat right and that becomes

110
00:04:15,438 --> 00:04:19,918
the input then we cut off the first word

111
00:04:17,600 --> 00:04:22,160
and that becomes the output so when you

112
00:04:19,918 --> 00:04:25,599
look at it that way you can see here

113
00:04:22,160 --> 00:04:29,040
right the you will want the to be used

114
00:04:25,600 --> 00:04:31,120
to predict cat you will want the to be

115
00:04:29,040 --> 00:04:32,800
used to predict SAT and so on and so

116
00:04:31,120 --> 00:04:35,759
forth.

117
00:04:32,800 --> 00:04:37,918
Okay, so this is just a little sort of

118
00:04:35,759 --> 00:04:40,400
manipulation so that we don't have to

119
00:04:37,918 --> 00:04:42,319
have you know like dozens of sentences

120
00:04:40,399 --> 00:04:44,799
or sentence examples just for one

121
00:04:42,319 --> 00:04:46,879
starting sentence.

122
00:04:44,800 --> 00:04:49,040
So if you have something like this, what

123
00:04:46,879 --> 00:04:50,639
you can do is you can run it through

124
00:04:49,040 --> 00:04:53,280
positional input embeddings like we have

125
00:04:50,639 --> 00:04:54,560
done before with BERT. Uh then we can

126
00:04:53,279 --> 00:04:56,879
run it through a whole bunch of

127
00:04:54,560 --> 00:04:59,360
transformers, right? It's like a

128
00:04:56,879 --> 00:05:01,680
transformer stack. Then we get these

129
00:04:59,360 --> 00:05:03,680
contextual embeddings. Then we run them

130
00:05:01,680 --> 00:05:05,439
through maybe one or more ReLUs if you

131
00:05:03,680 --> 00:05:08,079
want because it's always a good idea to

132
00:05:05,439 --> 00:05:11,680
stick some ReLUS at the very end. U and

133
00:05:08,079 --> 00:05:13,038
then we basically attach a softmax to

134
00:05:11,680 --> 00:05:17,038
every one of the things that are coming

135
00:05:13,038 --> 00:05:20,159
out. Okay. And then that soft max is

136
00:05:17,038 --> 00:05:23,759
actually going to be a soft max whose

137
00:05:20,160 --> 00:05:25,840
range is the entire vocabulary.

138
00:05:23,759 --> 00:05:27,199
Okay. For now, let's assume that the

139
00:05:25,839 --> 00:05:29,198
vocabulary is just a vocabulary of

140
00:05:27,199 --> 00:05:30,800
words, not tokens. We'll get into tokens

141
00:05:29,199 --> 00:05:32,639
a bit later on in the class. For now,

142
00:05:30,800 --> 00:05:33,919
just assume it's words. And roughly

143
00:05:32,639 --> 00:05:36,160
speaking, let's say there are 50,000

144
00:05:33,918 --> 00:05:38,079
words in our vocabulary. So each of

145
00:05:36,160 --> 00:05:39,759
these soft maxes, and this is exactly

146
00:05:38,079 --> 00:05:42,000
what we did for BERT, by the way. Each

147
00:05:39,759 --> 00:05:43,919
of these soft maxes is like a 50,000 way

148
00:05:42,000 --> 00:05:47,839
soft max.

149
00:05:43,918 --> 00:05:50,319
Okay. But what we're going to do is here

150
00:05:47,839 --> 00:05:52,079
when we look at it this way

151
00:05:50,319 --> 00:05:54,159
since we are fundamentally bothered

152
00:05:52,079 --> 00:05:55,519
about next word prediction as you will

153
00:05:54,160 --> 00:05:57,360
see later on we are actually going to

154
00:05:55,519 --> 00:05:59,519
ignore all these predictions because who

155
00:05:57,360 --> 00:06:02,800
cares? We are only going to look at the

156
00:05:59,519 --> 00:06:04,560
last one to figure out okay what is the

157
00:06:02,800 --> 00:06:06,960
last prediction? What is it? Because the

158
00:06:04,560 --> 00:06:09,759
last prediction is going to be based on

159
00:06:06,959 --> 00:06:11,279
everything that came before it here. So

160
00:06:09,759 --> 00:06:13,120
this is really the next word that's

161
00:06:11,279 --> 00:06:16,318
actually being predicted. All the things

162
00:06:13,120 --> 00:06:17,840
before we don't care so much.

163
00:06:16,319 --> 00:06:18,879
Okay. And all this will become slightly

164
00:06:17,839 --> 00:06:20,879
clearer because you're going to make a

165
00:06:18,879 --> 00:06:24,079
couple of passes through it. Yeah.

166
00:06:20,879 --> 00:06:27,519
>> How do we

167
00:06:24,079 --> 00:06:29,038
>> uh so um the notion of a sentence has

168
00:06:27,519 --> 00:06:30,879
disappeared at this point. What we're

169
00:06:29,038 --> 00:06:33,680
going to do is when we look at how we

170
00:06:30,879 --> 00:06:35,038
tokenize the input for these kinds of

171
00:06:33,680 --> 00:06:36,879
models, we're actually going to take

172
00:06:35,038 --> 00:06:37,918
punctuation into account. So we're going

173
00:06:36,879 --> 00:06:39,680
to take periods into account,

174
00:06:37,918 --> 00:06:41,279
exclamation marks into account and so on

175
00:06:39,680 --> 00:06:42,478
and so forth. And that that'll answer

176
00:06:41,279 --> 00:06:47,839
your question and we'll come back to

177
00:06:42,478 --> 00:06:49,360
that. U okay so this what we have. So um

178
00:06:47,839 --> 00:06:50,799
all right. So just to be clear the

179
00:06:49,360 --> 00:06:52,560
embedding that's coming out of the final

180
00:06:50,800 --> 00:06:54,400
dense layer is passed through its own

181
00:06:52,560 --> 00:06:58,160
softmax with the number of softmax

182
00:06:54,399 --> 00:07:01,918
categories equal to the cap size. Okay.

183
00:06:58,160 --> 00:07:04,080
All right. Um okay. So

184
00:07:01,918 --> 00:07:05,918
first of all, s let's say we train

185
00:07:04,079 --> 00:07:08,478
models a model like this with a lots of

186
00:07:05,918 --> 00:07:10,159
inputs and outputs. Okay, this just

187
00:07:08,478 --> 00:07:11,598
looks like bird, right? It's not that

188
00:07:10,160 --> 00:07:13,680
different except that there's no notion

189
00:07:11,598 --> 00:07:15,279
of a mask.

190
00:07:13,680 --> 00:07:19,519
Do you notice any problems with the way

191
00:07:15,279 --> 00:07:21,519
this thing has been set up? Uh

192
00:07:19,519 --> 00:07:23,758
>> like for some words like the you're

193
00:07:21,519 --> 00:07:25,680
going to have a lot of potential output

194
00:07:23,759 --> 00:07:27,598
pairs that come out of that.

195
00:07:25,680 --> 00:07:29,120
>> True. Which means that if you have a

196
00:07:27,598 --> 00:07:29,839
word like the the next word

197
00:07:29,120 --> 00:07:32,319
>> hard to predict.

198
00:07:29,839 --> 00:07:35,198
>> It's true. So some words may be hard to

199
00:07:32,319 --> 00:07:36,560
predict depending on the last word of

200
00:07:35,199 --> 00:07:39,120
the sentence that was the input. Right.

201
00:07:36,560 --> 00:07:41,199
That's what you're getting at. Yeah. U

202
00:07:39,120 --> 00:07:43,759
concerns.

203
00:07:41,199 --> 00:07:46,080
So I want you Yeah. Uh

204
00:07:43,759 --> 00:07:48,960
>> since you're using contextual

205
00:07:46,079 --> 00:07:51,198
like the output of the first word is

206
00:07:48,959 --> 00:07:53,839
going to have access to the second word

207
00:07:51,199 --> 00:07:55,360
and so it's kind of like cheating.

208
00:07:53,839 --> 00:07:58,959
>> Bingo.

209
00:07:55,360 --> 00:08:01,759
So remember for bingo is a technical

210
00:07:58,959 --> 00:08:05,439
term in deep learning which means great.

211
00:08:01,759 --> 00:08:08,000
So um so if you go to this right as she

212
00:08:05,439 --> 00:08:11,279
points out if you look at the self

213
00:08:08,000 --> 00:08:12,959
attention layer note remember the self

214
00:08:11,279 --> 00:08:15,279
attention layer is the key building

215
00:08:12,959 --> 00:08:17,439
block of the transformer block right and

216
00:08:15,279 --> 00:08:19,839
so in the self attention layer every

217
00:08:17,439 --> 00:08:23,360
word we calculate its contextual

218
00:08:19,839 --> 00:08:26,239
embedding by waiting weighted averaging

219
00:08:23,360 --> 00:08:28,560
of its relationship to all other words

220
00:08:26,240 --> 00:08:30,079
in the sentence. So the last word can

221
00:08:28,560 --> 00:08:31,360
see the first word, the first word can

222
00:08:30,079 --> 00:08:33,199
see the last word and so on and so

223
00:08:31,360 --> 00:08:34,800
forth, right? But when you're doing next

224
00:08:33,200 --> 00:08:38,240
word prediction, this feels problematic

225
00:08:34,799 --> 00:08:40,958
because you're peeking into the future,

226
00:08:38,240 --> 00:08:42,320
right? So

227
00:08:40,958 --> 00:08:43,598
so let's say that you want to predict

228
00:08:42,320 --> 00:08:46,560
the next word. If you look at this

229
00:08:43,599 --> 00:08:48,560
architecture, what it can simply do, it

230
00:08:46,559 --> 00:08:50,559
can simply copy it from the input

231
00:08:48,559 --> 00:08:52,639
because it can see the whole sentence.

232
00:08:50,559 --> 00:08:55,119
So if I tell you, hey, the cat sat on

233
00:08:52,639 --> 00:08:56,958
the mat. If I just gave you the cat sat

234
00:08:55,120 --> 00:08:58,720
on the can you predict the the next word

235
00:08:56,958 --> 00:09:01,278
for me? You'll be like yeah duh it's cat

236
00:08:58,720 --> 00:09:02,879
it's Matt.

237
00:09:01,278 --> 00:09:04,559
The whole thing becomes challenging only

238
00:09:02,879 --> 00:09:07,039
if I say the cat sat on the dash. Now

239
00:09:04,559 --> 00:09:09,759
predict the dash.

240
00:09:07,039 --> 00:09:11,838
So to put it another way let's say that

241
00:09:09,759 --> 00:09:13,919
you want to predict right you have fed

242
00:09:11,839 --> 00:09:15,600
in the first two words and you want to

243
00:09:13,919 --> 00:09:17,838
predict this. This is the right answer

244
00:09:15,600 --> 00:09:20,639
for the prediction. The network should

245
00:09:17,839 --> 00:09:23,279
only use the first two.

246
00:09:20,639 --> 00:09:26,480
However, but because self attention can

247
00:09:23,278 --> 00:09:28,240
see SAT, it can see this next word,

248
00:09:26,480 --> 00:09:31,039
it'll trivially learn to predict the

249
00:09:28,240 --> 00:09:34,480
next word to be SAT,

250
00:09:31,039 --> 00:09:37,278
right? There is no challenge for it.

251
00:09:34,480 --> 00:09:38,480
So, this is the key problem, right? This

252
00:09:37,278 --> 00:09:41,600
is the key problem. We're just using the

253
00:09:38,480 --> 00:09:43,278
transformer as is.

254
00:09:41,600 --> 00:09:44,959
>> What's our loss function here?

255
00:09:43,278 --> 00:09:46,320
>> The loss function in all these things is

256
00:09:44,958 --> 00:09:48,559
actually the same as before, which is

257
00:09:46,320 --> 00:09:50,240
that for every output that's coming out.

258
00:09:48,559 --> 00:09:52,479
So imagine you have just a traditional

259
00:09:50,240 --> 00:09:54,799
classification problem uh in which you

260
00:09:52,480 --> 00:09:56,560
have one output uh let's say dividing

261
00:09:54,799 --> 00:09:57,919
you're classifying things to uh 10

262
00:09:56,559 --> 00:10:00,239
categories like we did with the fashion

263
00:09:57,919 --> 00:10:02,000
mnest right 10 digits so you have 10

264
00:10:00,240 --> 00:10:03,759
outputs right and that goes through a

265
00:10:02,000 --> 00:10:05,759
softmax and then you have 10

266
00:10:03,759 --> 00:10:09,679
probabilities and there we use cross

267
00:10:05,759 --> 00:10:12,559
entropy right so here for every one of

268
00:10:09,679 --> 00:10:14,000
these things we use cross entropy so we

269
00:10:12,559 --> 00:10:16,078
take this output and there's a cross

270
00:10:14,000 --> 00:10:18,000
entropy for just for that plus cross

271
00:10:16,078 --> 00:10:20,479
entropy for that and so on and so forth

272
00:10:18,000 --> 00:10:21,759
So we we minimize still cross entropy

273
00:10:20,480 --> 00:10:22,560
but the sum of all these cross

274
00:10:21,759 --> 00:10:24,078
entropies.

275
00:10:22,559 --> 00:10:26,319
>> And does it get complicated at all by

276
00:10:24,078 --> 00:10:27,679
the fact we have a large vocabulary size

277
00:10:26,320 --> 00:10:29,040
now?

278
00:10:27,679 --> 00:10:30,239
>> I mean it it gets complicated just

279
00:10:29,039 --> 00:10:32,240
because there are more things to worry

280
00:10:30,240 --> 00:10:33,919
about compute and so on and so forth.

281
00:10:32,240 --> 00:10:35,360
But conceptually no difference whether

282
00:10:33,919 --> 00:10:37,759
you have 10 or 50,000 it's the same

283
00:10:35,360 --> 00:10:39,278
thing. It's just that instead of

284
00:10:37,759 --> 00:10:41,278
classifying an input into one of 10

285
00:10:39,278 --> 00:10:42,958
categories you're take the inputs

286
00:10:41,278 --> 00:10:45,039
themselves are as long as the number of

287
00:10:42,958 --> 00:10:46,799
words in your sentence. So each word

288
00:10:45,039 --> 00:10:49,679
that comes into your sentence is being

289
00:10:46,799 --> 00:10:51,838
classified in one of 50,000 ways, right?

290
00:10:49,679 --> 00:10:53,278
So essentially you have as many

291
00:10:51,839 --> 00:10:55,040
classification problems as you have

292
00:10:53,278 --> 00:10:56,240
number of words in a sentence. But at

293
00:10:55,039 --> 00:10:58,078
the end of the day, the loss function is

294
00:10:56,240 --> 00:10:59,440
just a sum of all those things or to be

295
00:10:58,078 --> 00:11:02,319
more precise, the average of all those

296
00:10:59,440 --> 00:11:03,600
things.

297
00:11:02,320 --> 00:11:05,600
Actually, I think I may have a slide

298
00:11:03,600 --> 00:11:07,440
about this which I may have hidden

299
00:11:05,600 --> 00:11:13,079
because I wasn't sure if I would have

300
00:11:07,440 --> 00:11:13,079
time. Uh let's unhide it.

301
00:11:17,360 --> 00:11:20,560
and B I did not agree ahead of time that

302
00:11:19,519 --> 00:11:23,120
we're going to set this up like this.

303
00:11:20,559 --> 00:11:25,599
Okay. So, all right. So, yeah. So, we

304
00:11:23,120 --> 00:11:27,759
still use the cross cross entropy cross

305
00:11:25,600 --> 00:11:30,480
cross entropy loss function. So, each

306
00:11:27,759 --> 00:11:33,120
word that comes in. So, the cross

307
00:11:30,480 --> 00:11:35,200
entropy is actually minus log

308
00:11:33,120 --> 00:11:36,480
probability of the right answer. And you

309
00:11:35,200 --> 00:11:38,399
may recall this from earlier in the

310
00:11:36,480 --> 00:11:41,039
class. So, we just do the same thing for

311
00:11:38,399 --> 00:11:43,278
for cat sat on the everything. And then

312
00:11:41,039 --> 00:11:46,519
we just take the average 1 / 7. Boom.

313
00:11:43,278 --> 00:11:46,519
That's it.

314
00:11:47,360 --> 00:11:52,000
So let's so to go back to this problem.

315
00:11:50,240 --> 00:11:55,680
So this is the issue. The issue is that

316
00:11:52,000 --> 00:11:57,440
we can't allow words to be predicted

317
00:11:55,679 --> 00:12:00,159
knowing the future. They should only

318
00:11:57,440 --> 00:12:02,240
know about the past words. Okay. So what

319
00:12:00,159 --> 00:12:03,919
do we do? Right? We have to make a

320
00:12:02,240 --> 00:12:06,320
change to the transformer to make it

321
00:12:03,919 --> 00:12:07,838
work for next word prediction. So what

322
00:12:06,320 --> 00:12:09,278
we're going to do is when we are

323
00:12:07,839 --> 00:12:11,200
calculating the contextual embedding for

324
00:12:09,278 --> 00:12:13,039
a word, remember the contextual

325
00:12:11,200 --> 00:12:14,560
embedding for a word is going to be a

326
00:12:13,039 --> 00:12:17,199
weighted average of all the other words

327
00:12:14,559 --> 00:12:20,000
embeddings. We will simply give zero

328
00:12:17,200 --> 00:12:22,000
weight to future words.

329
00:12:20,000 --> 00:12:26,240
If you give zero weight to future words,

330
00:12:22,000 --> 00:12:27,519
it's almost as if they don't exist.

331
00:12:26,240 --> 00:12:31,600
Okay? And this will become clear in a

332
00:12:27,519 --> 00:12:32,959
second. So imagine that this is the the

333
00:12:31,600 --> 00:12:34,879
thing we are going to calculate. These

334
00:12:32,958 --> 00:12:38,239
are all for every word in the sentence

335
00:12:34,879 --> 00:12:41,439
we are calculating the uh the pair-wise

336
00:12:38,240 --> 00:12:43,039
attention weight and you will remember I

337
00:12:41,440 --> 00:12:45,440
went through this you know with like an

338
00:12:43,039 --> 00:12:48,240
iPad thing last week we calculate all

339
00:12:45,440 --> 00:12:51,279
the weights. So for example to find the

340
00:12:48,240 --> 00:12:54,320
um so all these weights in every row

341
00:12:51,278 --> 00:12:56,799
will add up to one and so you take the

342
00:12:54,320 --> 00:12:58,560
contextual embeddings of the cat sat on

343
00:12:56,799 --> 00:12:59,838
the multiply them by the respective

344
00:12:58,559 --> 00:13:01,439
weights that add up to one which is the

345
00:12:59,839 --> 00:13:02,639
first row of this table and that gives

346
00:13:01,440 --> 00:13:05,120
you the contextual embedding for the

347
00:13:02,639 --> 00:13:07,440
word the and so on and so forth. And

348
00:13:05,120 --> 00:13:10,159
since we can't look at the future words

349
00:13:07,440 --> 00:13:14,600
all we do is we go take this table and

350
00:13:10,159 --> 00:13:14,600
we just zero everything out in red.

351
00:13:14,720 --> 00:13:19,519
Okay, we just zero everything here out

352
00:13:17,278 --> 00:13:22,240
and then we renormalize so that the

353
00:13:19,519 --> 00:13:25,278
remaining cells the nonzero dot cells

354
00:13:22,240 --> 00:13:27,519
will still add up to one in each row. So

355
00:13:25,278 --> 00:13:29,519
what that means is that if you're

356
00:13:27,519 --> 00:13:31,759
actually only looking at the only this

357
00:13:29,519 --> 00:13:32,959
thing is going to play a role for cat

358
00:13:31,759 --> 00:13:36,399
only this thing is going to play a role.

359
00:13:32,958 --> 00:13:39,439
So let's let's let's give an example. So

360
00:13:36,399 --> 00:13:43,839
um to calculate

361
00:13:39,440 --> 00:13:46,959
to predict uh on you'll only look at the

362
00:13:43,839 --> 00:13:48,639
words for the cat sat.

363
00:13:46,958 --> 00:13:51,359
Okay. The rest of it will not be

364
00:13:48,639 --> 00:13:54,000
considered at all. Now the effect of

365
00:13:51,360 --> 00:13:56,240
doing all this is that by the way this

366
00:13:54,000 --> 00:13:58,559
is called causal self attention. This

367
00:13:56,240 --> 00:14:01,198
tweak is called causal self attention.

368
00:13:58,559 --> 00:14:02,799
Uh is also called masked self attention.

369
00:14:01,198 --> 00:14:05,198
Right? Just different labels for the

370
00:14:02,799 --> 00:14:07,439
same thing. And so what that means is

371
00:14:05,198 --> 00:14:10,159
that when you're looking at the input

372
00:14:07,440 --> 00:14:12,720
for the only the is going to be used to

373
00:14:10,159 --> 00:14:15,600
predict cat.

374
00:14:12,720 --> 00:14:18,240
When you look the cat only these two are

375
00:14:15,600 --> 00:14:22,759
going to be used to predict sat and so

376
00:14:18,240 --> 00:14:22,759
on and so on and so forth.

377
00:14:24,159 --> 00:14:30,240
Okay. So this thing here this so all we

378
00:14:28,159 --> 00:14:32,559
do is we go into a transformer and we

379
00:14:30,240 --> 00:14:36,360
just change each attention head to be a

380
00:14:32,559 --> 00:14:36,359
causal attention head

381
00:14:38,559 --> 00:14:42,399
and the way it's actually done under the

382
00:14:40,078 --> 00:14:44,399
hood is actually very elegant for

383
00:14:42,399 --> 00:14:46,399
computational efficiency purposes but I

384
00:14:44,399 --> 00:14:49,600
won't get into it because it gets a bit

385
00:14:46,399 --> 00:14:52,559
you know involved but the key idea is

386
00:14:49,600 --> 00:14:54,959
replace basic plain vanilla attention

387
00:14:52,559 --> 00:14:57,119
with causal attention aka pay mass

388
00:14:54,958 --> 00:14:59,359
attention

389
00:14:57,120 --> 00:15:01,120
and you do that boom suddenly it it

390
00:14:59,360 --> 00:15:04,079
starts you know working for an expert

391
00:15:01,120 --> 00:15:06,000
prediction it can't cheat anymore

392
00:15:04,078 --> 00:15:10,198
and when we do that we get the

393
00:15:06,000 --> 00:15:10,198
transformer causal encoder

394
00:15:11,440 --> 00:15:15,360
and by the way the word causal here

395
00:15:13,519 --> 00:15:19,440
there's no connection to causality so

396
00:15:15,360 --> 00:15:20,800
it's just a it's just a term

397
00:15:19,440 --> 00:15:24,240
so if you look at the original

398
00:15:20,799 --> 00:15:26,319
transformer paper um

399
00:15:24,240 --> 00:15:28,000
it was created for translation for

400
00:15:26,320 --> 00:15:30,560
machine translation you know English to

401
00:15:28,000 --> 00:15:32,480
German right those kinds of use cases so

402
00:15:30,559 --> 00:15:34,399
it had something called an encoder which

403
00:15:32,480 --> 00:15:35,839
we are very familiar with from last week

404
00:15:34,399 --> 00:15:38,000
and then it had something called a

405
00:15:35,839 --> 00:15:40,480
decoder right and it is called the

406
00:15:38,000 --> 00:15:42,000
encoder decoder architecture and we are

407
00:15:40,480 --> 00:15:43,278
not going to cover the encoder decoder

408
00:15:42,000 --> 00:15:45,679
architecture because we are not covering

409
00:15:43,278 --> 00:15:48,958
machine translation in this class but

410
00:15:45,679 --> 00:15:51,439
I'm mentioning this because the this

411
00:15:48,958 --> 00:15:52,559
part of the the architecture is called a

412
00:15:51,440 --> 00:15:55,360
decoder

413
00:15:52,559 --> 00:15:57,758
because it uses see here there is a

414
00:15:55,360 --> 00:15:59,199
masked attention business going on here

415
00:15:57,759 --> 00:16:02,959
because it is using this masked

416
00:15:59,198 --> 00:16:05,278
attention it's called a decoder so

417
00:16:02,958 --> 00:16:06,799
the transformer causal encoder is also

418
00:16:05,278 --> 00:16:09,360
referred to sometimes as a transformer

419
00:16:06,799 --> 00:16:11,039
decoder but the word decoder has two

420
00:16:09,360 --> 00:16:12,560
meanings

421
00:16:11,039 --> 00:16:14,319
right it's a synonym for the causal

422
00:16:12,559 --> 00:16:17,359
encoder like we have seen today it's

423
00:16:14,320 --> 00:16:19,040
also used to refer to sequencetosequence

424
00:16:17,360 --> 00:16:21,519
translation problems for the second part

425
00:16:19,039 --> 00:16:23,198
of its architecture so you just have

426
00:16:21,519 --> 00:16:25,120
keep it it'll become clear from context

427
00:16:23,198 --> 00:16:26,399
what we're talking about in this course

428
00:16:25,120 --> 00:16:27,278
of course there is no confusion because

429
00:16:26,399 --> 00:16:29,679
we're not going to be looking at

430
00:16:27,278 --> 00:16:32,958
translation right we may say decoder

431
00:16:29,679 --> 00:16:34,479
causal encoder it's the same thing so I

432
00:16:32,958 --> 00:16:36,638
thought there were some transformers

433
00:16:34,480 --> 00:16:39,600
that use birectional

434
00:16:36,639 --> 00:16:42,399
package like is it different from

435
00:16:39,600 --> 00:16:44,480
>> no the um the birectional all all

436
00:16:42,399 --> 00:16:47,839
birectional means is that I can see

437
00:16:44,480 --> 00:16:49,920
everything so the encoder we looked at

438
00:16:47,839 --> 00:16:52,880
last week the the basic self attention

439
00:16:49,919 --> 00:16:52,879
thing is birectional

440
00:16:54,480 --> 00:16:57,199
Basically all it means is I can look at

441
00:16:55,839 --> 00:16:58,800
both in both directions to see what

442
00:16:57,198 --> 00:16:59,838
other words are there in causal. You're

443
00:16:58,799 --> 00:17:02,559
not using the one in the future.

444
00:16:59,839 --> 00:17:04,959
Correct.

445
00:17:02,559 --> 00:17:07,918
All right. So,

446
00:17:04,959 --> 00:17:09,519
so in to summarize where we are. This is

447
00:17:07,919 --> 00:17:11,600
what we looked at last week for BERT and

448
00:17:09,519 --> 00:17:14,078
this is a transformer encoder and we

449
00:17:11,599 --> 00:17:15,599
take the same thing and instead of

450
00:17:14,078 --> 00:17:18,639
multi-head retention we would do causal

451
00:17:15,599 --> 00:17:21,279
multi retention. We get the decoder aka

452
00:17:18,640 --> 00:17:25,360
causal encoder.

453
00:17:21,279 --> 00:17:27,038
Okay. And we use the left for masked

454
00:17:25,359 --> 00:17:29,599
prediction. We use the right for next

455
00:17:27,038 --> 00:17:32,319
word prediction.

456
00:17:29,599 --> 00:17:34,079
All right. So now if you have instead of

457
00:17:32,319 --> 00:17:37,119
having an encoder, if you have a causal

458
00:17:34,079 --> 00:17:38,879
encoder, a TCE here, now we can train

459
00:17:37,119 --> 00:17:42,159
models for expert prediction using the

460
00:17:38,880 --> 00:17:43,600
same exact approach as before,

461
00:17:42,160 --> 00:17:45,200
right? We set up the inputs and the

462
00:17:43,599 --> 00:17:47,359
outputs like I described earlier. We run

463
00:17:45,200 --> 00:17:50,000
it through a bunch of stacks, a stack of

464
00:17:47,359 --> 00:17:52,159
causal encoders, dens, relu, softmax and

465
00:17:50,000 --> 00:17:54,720
so on and so forth, right? Otherwise the

466
00:17:52,160 --> 00:17:56,558
details don't change but the all

467
00:17:54,720 --> 00:18:00,759
important changes go into the attention

468
00:17:56,558 --> 00:18:00,759
layer and make it masked or causal.

469
00:18:02,480 --> 00:18:08,240
Any questions so far?

470
00:18:06,240 --> 00:18:09,679
>> Uh yeah,

471
00:18:08,240 --> 00:18:11,120
this would only apply when we're

472
00:18:09,679 --> 00:18:13,679
training the model, not when we're

473
00:18:11,119 --> 00:18:15,918
validating and testing, right?

474
00:18:13,679 --> 00:18:18,559
Uh so if I if you give me a sentence

475
00:18:15,919 --> 00:18:20,880
after training right the final

476
00:18:18,558 --> 00:18:22,960
prediction is only is the only thing you

477
00:18:20,880 --> 00:18:24,240
care about and by definition the final

478
00:18:22,960 --> 00:18:27,440
prediction will use everything that came

479
00:18:24,240 --> 00:18:30,240
before it. So we are okay.

480
00:18:27,440 --> 00:18:33,038
Was that your question? No, I think the

481
00:18:30,240 --> 00:18:35,038
fact that we're

482
00:18:33,038 --> 00:18:36,720
uh we're zeroing out the weights in the

483
00:18:35,038 --> 00:18:38,240
future words I thought would apply more

484
00:18:36,720 --> 00:18:40,400
when we're training the model and we're

485
00:18:38,240 --> 00:18:44,558
trying to minimize the loss as opposed

486
00:18:40,400 --> 00:18:45,600
to when we're as a chance to the next

487
00:18:44,558 --> 00:18:47,440
set

488
00:18:45,599 --> 00:18:49,199
>> right but the point is when we actually

489
00:18:47,440 --> 00:18:50,480
use them what is the objective like what

490
00:18:49,200 --> 00:18:51,840
do we want to do when we actually use

491
00:18:50,480 --> 00:18:54,079
them for inference once we finish

492
00:18:51,839 --> 00:18:56,639
training our objective is given a

493
00:18:54,079 --> 00:18:59,038
particular string get me the next word

494
00:18:56,640 --> 00:19:00,320
right and to find the next word you can

495
00:18:59,038 --> 00:19:01,119
in fact use everything that came before

496
00:19:00,319 --> 00:19:03,119
it

497
00:19:01,119 --> 00:19:04,798
>> and therefore without any change to this

498
00:19:03,119 --> 00:19:06,639
model it'll just work for your intended

499
00:19:04,798 --> 00:19:08,160
purpose you don't have to go in there

500
00:19:06,640 --> 00:19:10,400
and change it to you don't have to

501
00:19:08,160 --> 00:19:13,600
unmask it for inference because you

502
00:19:10,400 --> 00:19:14,960
don't need to

503
00:19:13,599 --> 00:19:17,599
>> yes

504
00:19:14,960 --> 00:19:20,480
>> uh I have one question is regarding like

505
00:19:17,599 --> 00:19:22,480
when we do the puzzle transformers we

506
00:19:20,480 --> 00:19:24,160
are putting certain weights to zero for

507
00:19:22,480 --> 00:19:24,798
the words which are to be predicted and

508
00:19:24,160 --> 00:19:26,720
then we

509
00:19:24,798 --> 00:19:27,200
>> no word the the words that are in the

510
00:19:26,720 --> 00:19:28,000
future

511
00:19:27,200 --> 00:19:29,279
>> future Yeah.

512
00:19:28,000 --> 00:19:29,679
>> And then we normalize it.

513
00:19:29,279 --> 00:19:31,200
>> Correct.

514
00:19:29,679 --> 00:19:33,200
>> And we have trained a transformer

515
00:19:31,200 --> 00:19:35,759
earlier on the all the words packed all

516
00:19:33,200 --> 00:19:37,279
the words together. So won't there be

517
00:19:35,759 --> 00:19:37,839
difference in weights between both the

518
00:19:37,279 --> 00:19:39,279
things

519
00:19:37,839 --> 00:19:40,879
>> between the two ways of training? The

520
00:19:39,279 --> 00:19:43,599
weights are going to be very different

521
00:19:40,880 --> 00:19:45,600
and they are two different models. Bert

522
00:19:43,599 --> 00:19:47,119
is used for certain things and this kind

523
00:19:45,599 --> 00:19:47,918
of model which is the basis of GPT is

524
00:19:47,119 --> 00:19:49,599
going to be used for other things.

525
00:19:47,919 --> 00:19:52,240
>> We are training it as well like that. I

526
00:19:49,599 --> 00:19:53,839
mean with while putting the by moving

527
00:19:52,240 --> 00:19:56,160
some of the rates to

528
00:19:53,839 --> 00:19:59,199
>> correct correct. So what I'm talking

529
00:19:56,160 --> 00:20:01,279
about here is the what we're trying to

530
00:19:59,200 --> 00:20:03,919
do here is to say let's say that we want

531
00:20:01,279 --> 00:20:06,160
to do next word prediction as the as the

532
00:20:03,919 --> 00:20:08,160
task as a self-supervised learning task

533
00:20:06,160 --> 00:20:10,798
and and we want to train such a model on

534
00:20:08,160 --> 00:20:12,080
a vast amount of text data right well we

535
00:20:10,798 --> 00:20:13,279
can't just use what we did last week

536
00:20:12,079 --> 00:20:14,720
because it's not going to work because

537
00:20:13,279 --> 00:20:16,480
of the fact it can see the future

538
00:20:14,720 --> 00:20:17,839
therefore we make a tweak and then we

539
00:20:16,480 --> 00:20:18,960
build this model now the question

540
00:20:17,839 --> 00:20:20,319
becomes okay what can you do with this

541
00:20:18,960 --> 00:20:21,840
such a model right we have basically

542
00:20:20,319 --> 00:20:23,200
trained two different kinds of models

543
00:20:21,839 --> 00:20:25,599
that the one that can see everything

544
00:20:23,200 --> 00:20:27,840
Bert and that one that can't see the

545
00:20:25,599 --> 00:20:28,959
future which is actually GPT. So what

546
00:20:27,839 --> 00:20:32,199
can you do with it? And we're going to

547
00:20:28,960 --> 00:20:32,200
come to that.

548
00:20:32,240 --> 00:20:38,558
Okay. U all right. So now once you train

549
00:20:35,679 --> 00:20:41,519
such a model u right given any input

550
00:20:38,558 --> 00:20:45,519
sentence um let's say that the sentence

551
00:20:41,519 --> 00:20:47,599
is it was a dark and it was a dark and

552
00:20:45,519 --> 00:20:49,200
right it goes through all these things.

553
00:20:47,599 --> 00:20:50,879
And remember what I said earlier the

554
00:20:49,200 --> 00:20:53,440
fact that it's predicting something

555
00:20:50,880 --> 00:20:55,600
after just seeing it. We don't really

556
00:20:53,440 --> 00:20:57,840
care.

557
00:20:55,599 --> 00:20:59,199
All what we're really curious about is

558
00:20:57,839 --> 00:21:01,119
what is the next thing it's going to

559
00:20:59,200 --> 00:21:02,400
say? And the next thing it's going to

560
00:21:01,119 --> 00:21:06,359
say is going to be is basically going to

561
00:21:02,400 --> 00:21:06,360
be what's coming out of this softmax.

562
00:21:06,798 --> 00:21:11,839
Does it make sense? We don't care about

563
00:21:08,720 --> 00:21:14,159
anything that went before it

564
00:21:11,839 --> 00:21:15,918
because we already have like a half form

565
00:21:14,159 --> 00:21:17,840
sentence and we want to just find the

566
00:21:15,919 --> 00:21:19,520
next thing here. So we only care about

567
00:21:17,839 --> 00:21:21,759
this. We I mean these things will come

568
00:21:19,519 --> 00:21:22,960
out of the of the architecture of the

569
00:21:21,759 --> 00:21:24,640
model, but we don't we throw them out.

570
00:21:22,960 --> 00:21:26,240
We don't even pay any attention to them.

571
00:21:24,640 --> 00:21:30,159
Okay, we only look at what's coming out

572
00:21:26,240 --> 00:21:32,720
in this one here. And what comes out of

573
00:21:30,159 --> 00:21:35,120
the soft max, remember, is a 50,000 way

574
00:21:32,720 --> 00:21:37,440
table of probabilities. That's what a

575
00:21:35,119 --> 00:21:39,279
soft max is, right? It's a whole bunch

576
00:21:37,440 --> 00:21:40,640
of probabilities that add up to one. And

577
00:21:39,279 --> 00:21:42,960
so it's going to and let's say, for

578
00:21:40,640 --> 00:21:45,360
example, that you know you have starting

579
00:21:42,960 --> 00:21:46,159
with oddwark all the way to zebra,

580
00:21:45,359 --> 00:21:48,000
right? Right? And these are the

581
00:21:46,159 --> 00:21:52,640
probabilities.

582
00:21:48,000 --> 00:21:55,519
So it was a dark and you know just for

583
00:21:52,640 --> 00:21:56,880
kicks I put star me as the most highest

584
00:21:55,519 --> 00:21:59,599
probability number but these numbers

585
00:21:56,880 --> 00:22:02,400
will add up to one. We have this table.

586
00:21:59,599 --> 00:22:04,959
Okay. And then what we do is we choose a

587
00:22:02,400 --> 00:22:06,960
token from this table. We get we get to

588
00:22:04,960 --> 00:22:08,880
choose right. There's a whole bunch of

589
00:22:06,960 --> 00:22:11,120
numbers in this table that we we get to

590
00:22:08,880 --> 00:22:12,880
choose a token. the the simplest thing

591
00:22:11,119 --> 00:22:14,959
one can think of is just choose the the

592
00:22:12,880 --> 00:22:16,320
word that is the most likely, right? And

593
00:22:14,960 --> 00:22:18,400
we choose the word that's most likely

594
00:22:16,319 --> 00:22:20,319
here. And we we're going to have a whole

595
00:22:18,400 --> 00:22:22,880
section on how to choose these things

596
00:22:20,319 --> 00:22:23,918
coming up. Okay, for now let's go with

597
00:22:22,880 --> 00:22:26,960
the simple option. We're going to just

598
00:22:23,919 --> 00:22:30,880
choose the one that's most likely 6. And

599
00:22:26,960 --> 00:22:32,319
then we we attach it to the input. So

600
00:22:30,880 --> 00:22:34,480
now the input has become it was a dark

601
00:22:32,319 --> 00:22:36,319
and stormy. We run it through and we

602
00:22:34,480 --> 00:22:37,919
again we only care about the last one

603
00:22:36,319 --> 00:22:40,399
softmax.

604
00:22:37,919 --> 00:22:42,720
Okay,

605
00:22:40,400 --> 00:22:44,480
we do that. We get another table and

606
00:22:42,720 --> 00:22:45,839
this table turns out the table keeps

607
00:22:44,480 --> 00:22:46,880
changing because the softmax is

608
00:22:45,839 --> 00:22:49,119
different for each time you run it

609
00:22:46,880 --> 00:22:50,880
through because the input has changed.

610
00:22:49,119 --> 00:22:53,918
So you get a new table and it turns out

611
00:22:50,880 --> 00:22:56,480
the most likely one is knight. Okay. And

612
00:22:53,919 --> 00:22:59,840
then we attach so night comes out the

613
00:22:56,480 --> 00:23:03,279
other end. We and we attach knight here

614
00:22:59,839 --> 00:23:05,759
and we keep on going right. We can keep

615
00:23:03,279 --> 00:23:08,240
on going maybe till we basically we tell

616
00:23:05,759 --> 00:23:11,200
the model okay generate up to 100 tokens

617
00:23:08,240 --> 00:23:12,720
and stop. It might stop after 100 or you

618
00:23:11,200 --> 00:23:15,279
or it might decide the model may decide

619
00:23:12,720 --> 00:23:17,120
in fact that when it sees a punctuation

620
00:23:15,279 --> 00:23:19,678
like a period or exclamation mark or

621
00:23:17,119 --> 00:23:21,199
something it's going to stop. Okay. And

622
00:23:19,679 --> 00:23:23,600
we have control over this when it stops

623
00:23:21,200 --> 00:23:26,080
and how it stops. But this is this is

624
00:23:23,599 --> 00:23:27,199
sort of the the basic process and you

625
00:23:26,079 --> 00:23:28,558
folks are all very used to it because

626
00:23:27,200 --> 00:23:30,960
you've all been playing with chat GPT

627
00:23:28,558 --> 00:23:33,279
and the like right? So the but the basic

628
00:23:30,960 --> 00:23:34,720
building block is next word prediction

629
00:23:33,279 --> 00:23:36,960
feed it back to the input next word

630
00:23:34,720 --> 00:23:38,640
prediction keep on doing it right you

631
00:23:36,960 --> 00:23:41,519
keep on doing it and suddenly you know

632
00:23:38,640 --> 00:23:42,799
it's writing entire novels for you

633
00:23:41,519 --> 00:23:44,960
um yeah

634
00:23:42,798 --> 00:23:47,519
>> that mean that the longer the initial

635
00:23:44,960 --> 00:23:48,880
input is better you get a better

636
00:23:47,519 --> 00:23:52,639
prediction

637
00:23:48,880 --> 00:23:54,400
>> um it depends on your objective so

638
00:23:52,640 --> 00:23:56,400
fundamentally you have some task you

639
00:23:54,400 --> 00:23:58,320
want the thing to do for you right and

640
00:23:56,400 --> 00:24:00,320
that task may and you need to give it

641
00:23:58,319 --> 00:24:02,639
all the information it can puzzle we

642
00:24:00,319 --> 00:24:04,240
find useful. Yeah. So the long the the

643
00:24:02,640 --> 00:24:07,120
more helpful the input the better. Maybe

644
00:24:04,240 --> 00:24:09,200
that's how I would say it.

645
00:24:07,119 --> 00:24:11,199
Uh yeah.

646
00:24:09,200 --> 00:24:14,960
>> Would this also apply to something like

647
00:24:11,200 --> 00:24:17,038
Google search? Uh or does they also do

648
00:24:14,960 --> 00:24:18,079
next letter prediction too? But would

649
00:24:17,038 --> 00:24:20,000
this just be a deeper

650
00:24:18,079 --> 00:24:22,000
>> Yeah. So the Google autocomplete for

651
00:24:20,000 --> 00:24:24,240
example, I don't know if they actually

652
00:24:22,000 --> 00:24:26,319
use uh this kind of model under the hood

653
00:24:24,240 --> 00:24:27,679
or not. I just don't know. Um these

654
00:24:26,319 --> 00:24:29,918
things tend to be kept tightly under

655
00:24:27,679 --> 00:24:31,840
wraps. uh you know if they were to do if

656
00:24:29,919 --> 00:24:33,360
they were using it you know my guess is

657
00:24:31,839 --> 00:24:34,959
that

658
00:24:33,359 --> 00:24:36,798
they so I don't know if you folks have

659
00:24:34,960 --> 00:24:38,319
seen recently over the last few months

660
00:24:36,798 --> 00:24:40,158
they have there is there is a generative

661
00:24:38,319 --> 00:24:42,639
AI panel that opens up when you do a

662
00:24:40,159 --> 00:24:45,278
Google search that panel I suspect uses

663
00:24:42,640 --> 00:24:47,038
this uh but I don't know if the default

664
00:24:45,278 --> 00:24:49,759
Google autocomplete actually uses it or

665
00:24:47,038 --> 00:24:52,558
not because it's very compute heavy

666
00:24:49,759 --> 00:24:55,359
right so I don't know what they do

667
00:24:52,558 --> 00:25:00,000
um so yeah this is what you do other

668
00:24:55,359 --> 00:25:01,599
questions on this on the mechanics of

669
00:25:00,000 --> 00:25:03,679
Yeah,

670
00:25:01,599 --> 00:25:05,359
>> for our vocabulary list, I'm assuming

671
00:25:03,679 --> 00:25:07,200
it's static.

672
00:25:05,359 --> 00:25:08,959
>> Yeah, correct. Uh, and as you will see

673
00:25:07,200 --> 00:25:10,880
here, it's not really a word vocabulary.

674
00:25:08,960 --> 00:25:12,880
It's a token vocabulary, but yes, it is

675
00:25:10,880 --> 00:25:15,520
static for a given model.

676
00:25:12,880 --> 00:25:17,440
>> And so for I guess I'm assuming for

677
00:25:15,519 --> 00:25:19,839
Google or any other sort of like search

678
00:25:17,440 --> 00:25:23,519
engine that wouldn't necessarily be

679
00:25:19,839 --> 00:25:26,879
static. And so when it comes to I guess

680
00:25:23,519 --> 00:25:30,158
I guess I'll leave it like because the

681
00:25:26,880 --> 00:25:32,159
model would be different

682
00:25:30,159 --> 00:25:34,240
sort of thinking about uh what happens

683
00:25:32,159 --> 00:25:35,760
to like new words and things that are

684
00:25:34,240 --> 00:25:37,519
formed and how does it handle it if the

685
00:25:35,759 --> 00:25:41,440
vocabulary is static. There's a very

686
00:25:37,519 --> 00:25:45,119
elegant solution that's coming up.

687
00:25:41,440 --> 00:25:48,400
Okay. Um

688
00:25:45,119 --> 00:25:51,439
all right. So now in other words we have

689
00:25:48,400 --> 00:25:52,960
learned how to do sequence generation.

690
00:25:51,440 --> 00:25:54,558
We already saw that we can do

691
00:25:52,960 --> 00:25:56,640
classification with BERT. We can do

692
00:25:54,558 --> 00:25:59,038
labeling with BERT B like models which

693
00:25:56,640 --> 00:26:00,720
are trained on mass prediction. And for

694
00:25:59,038 --> 00:26:02,319
generating sequences now we know how to

695
00:26:00,720 --> 00:26:05,519
do it. We just need to use a transformer

696
00:26:02,319 --> 00:26:08,319
cosal encoder.

697
00:26:05,519 --> 00:26:10,480
Okay.

698
00:26:08,319 --> 00:26:12,240
Now

699
00:26:10,480 --> 00:26:13,919
these kind of models, sequence

700
00:26:12,240 --> 00:26:15,679
generation models trained on text

701
00:26:13,919 --> 00:26:17,759
sequences using next word prediction are

702
00:26:15,679 --> 00:26:20,798
called auto reggressive language models

703
00:26:17,759 --> 00:26:22,558
or causal language models. Okay. And of

704
00:26:20,798 --> 00:26:25,519
course the GPD family is perhaps the

705
00:26:22,558 --> 00:26:28,079
most well-known uh example of an auto

706
00:26:25,519 --> 00:26:30,720
reggressive co language model. auto

707
00:26:28,079 --> 00:26:32,480
reggressive because people who have done

708
00:26:30,720 --> 00:26:34,159
econometrics and some regression know

709
00:26:32,480 --> 00:26:36,159
the notion of auto reggression means

710
00:26:34,159 --> 00:26:38,320
that you predict something and then you

711
00:26:36,159 --> 00:26:40,159
you use sort of you know the past

712
00:26:38,319 --> 00:26:42,639
predictions as inputs into the next time

713
00:26:40,159 --> 00:26:44,720
you predict right so this is the notion

714
00:26:42,640 --> 00:26:46,799
of auto reggression you feed you predict

715
00:26:44,720 --> 00:26:48,079
you feed the prediction back get the

716
00:26:46,798 --> 00:26:51,798
next prediction and keep on cycling

717
00:26:48,079 --> 00:26:51,798
through yes

718
00:26:51,919 --> 00:26:56,320
>> so when you you're kind of putting an

719
00:26:53,839 --> 00:26:59,038
input into GPT for example and it has

720
00:26:56,319 --> 00:27:01,519
that um you know it shows you like the

721
00:26:59,038 --> 00:27:03,519
next words as as it's coming. Is that an

722
00:27:01,519 --> 00:27:05,759
indication of it doing this

723
00:27:03,519 --> 00:27:07,679
recalculation that you described here?

724
00:27:05,759 --> 00:27:09,759
>> Correct. That's exactly what's going on.

725
00:27:07,679 --> 00:27:12,240
Uh in fact, if you use the API, there is

726
00:27:09,759 --> 00:27:14,079
the thing called the streaming API where

727
00:27:12,240 --> 00:27:15,359
it'll actually stream each token that's

728
00:27:14,079 --> 00:27:17,278
coming out through the through every

729
00:27:15,359 --> 00:27:19,599
pass and you can actually see everything

730
00:27:17,278 --> 00:27:22,159
very clearly. But when you actually work

731
00:27:19,599 --> 00:27:24,079
with the web interface and you see the

732
00:27:22,159 --> 00:27:25,919
thing almost as if it's typing like a

733
00:27:24,079 --> 00:27:26,960
human, what I've heard from people, I

734
00:27:25,919 --> 00:27:28,720
don't know if this is true, what I've

735
00:27:26,960 --> 00:27:30,960
heard from people is that they can

736
00:27:28,720 --> 00:27:32,319
actually do it much faster. They slow it

737
00:27:30,960 --> 00:27:33,600
down intentionally to give you the

738
00:27:32,319 --> 00:27:36,480
feeling that it's actually coming from a

739
00:27:33,599 --> 00:27:39,599
human.

740
00:27:36,480 --> 00:27:41,278
So it's like a UX trick to slow it down

741
00:27:39,599 --> 00:27:42,480
to make it feel as if someone is

742
00:27:41,278 --> 00:27:44,240
actually typing something on the other

743
00:27:42,480 --> 00:27:46,159
end. So when you're interacting with a

744
00:27:44,240 --> 00:27:48,640
chatbot, for example, sometimes you see

745
00:27:46,159 --> 00:27:49,600
it actually typing like slowly you can

746
00:27:48,640 --> 00:27:50,720
see the bubble and you can see the

747
00:27:49,599 --> 00:27:53,439
typing. It's actually intentionally

748
00:27:50,720 --> 00:27:55,360
slowed down. Uh because you know it's a

749
00:27:53,440 --> 00:27:58,159
bot otherwise, right? So there's a

750
00:27:55,359 --> 00:28:01,119
little bit of UX

751
00:27:58,159 --> 00:28:03,039
creepiness maybe going on. Uh I don't

752
00:28:01,119 --> 00:28:05,038
know to what extent this is 100% true

753
00:28:03,038 --> 00:28:06,798
and how pervasive it is, but folks who

754
00:28:05,038 --> 00:28:10,558
work in the field have told me that this

755
00:28:06,798 --> 00:28:12,398
actually is not uncommon. So

756
00:28:10,558 --> 00:28:14,639
okay, so that's what's going on here.

757
00:28:12,398 --> 00:28:17,199
These are language models and of course

758
00:28:14,640 --> 00:28:20,159
GPD3 is an auto reggressive language

759
00:28:17,200 --> 00:28:22,399
model and the reason why we have an L in

760
00:28:20,159 --> 00:28:24,080
front of the LM because it was trained

761
00:28:22,398 --> 00:28:25,678
on lots of data with lots of parameters

762
00:28:24,079 --> 00:28:26,960
right some someone does this at some

763
00:28:25,679 --> 00:28:28,480
point it's not a small language model

764
00:28:26,960 --> 00:28:31,600
anymore it's a large language model so

765
00:28:28,480 --> 00:28:35,440
yeah so it's LLM nothing more momentous

766
00:28:31,599 --> 00:28:40,558
than that so so as it turns out uh GPT3

767
00:28:35,440 --> 00:28:43,038
uses 96 transformer blocks 96 blocks and

768
00:28:40,558 --> 00:28:44,960
each block has 96 six causal attention

769
00:28:43,038 --> 00:28:46,480
heads.

770
00:28:44,960 --> 00:28:48,480
Okay. And you can see you can read the

771
00:28:46,480 --> 00:28:50,319
GPD3 paper. It gives you all the details

772
00:28:48,480 --> 00:28:51,839
of the architecture. That is interesting

773
00:28:50,319 --> 00:28:55,918
because GPD4 they didn't publish the

774
00:28:51,839 --> 00:28:58,158
architecture from GPD3 after GPD3

775
00:28:55,919 --> 00:28:59,200
everything became closed. So we actually

776
00:28:58,159 --> 00:29:00,320
don't know what the architecture is even

777
00:28:59,200 --> 00:29:03,440
though there's a lot of speculation on

778
00:29:00,319 --> 00:29:06,240
Twitter. So uh but GP3 we know exactly

779
00:29:03,440 --> 00:29:09,360
what happened right 96 blocks each has

780
00:29:06,240 --> 00:29:11,359
96 causal attention heads. Um and then

781
00:29:09,359 --> 00:29:14,000
the data was actually they scraped 30

782
00:29:11,359 --> 00:29:16,158
billion sentences um from a whole bunch

783
00:29:14,000 --> 00:29:19,599
of sources, web text, Wikipedia, a bunch

784
00:29:16,159 --> 00:29:21,840
of book databases. Um and um and then

785
00:29:19,599 --> 00:29:23,359
they basically just took those 30

786
00:29:21,839 --> 00:29:27,038
billion sentences and just trained it

787
00:29:23,359 --> 00:29:28,798
exactly next word. That's it.

788
00:29:27,038 --> 00:29:31,759
Now when they trained GBD3, I think it

789
00:29:28,798 --> 00:29:34,158
cost them a lot of money um because

790
00:29:31,759 --> 00:29:36,158
things were not as we hadn't figured out

791
00:29:34,159 --> 00:29:38,240
how to do as efficiently as we know now.

792
00:29:36,159 --> 00:29:39,679
uh but it was still pretty amazing and

793
00:29:38,240 --> 00:29:41,200
I'll talk about you know what is so

794
00:29:39,679 --> 00:29:44,080
special about GBD3 in just a minute or

795
00:29:41,200 --> 00:29:46,960
two. So, so this is what we have here

796
00:29:44,079 --> 00:29:49,918
and as you folks have seen the notion of

797
00:29:46,960 --> 00:29:51,440
generating text right is very powerful

798
00:29:49,919 --> 00:29:53,278
right uh because we can obviously

799
00:29:51,440 --> 00:29:55,919
generate text but we can also generate

800
00:29:53,278 --> 00:29:57,519
code because code is just text uh we can

801
00:29:55,919 --> 00:29:58,960
generate documentation for code we can

802
00:29:57,519 --> 00:30:00,798
summarize text we can answer questions

803
00:29:58,960 --> 00:30:03,200
we can do chat I mean the list goes on

804
00:30:00,798 --> 00:30:05,278
all the excitement we see around genai

805
00:30:03,200 --> 00:30:07,600
from the time chat GBD came out is

806
00:30:05,278 --> 00:30:12,000
precisely because the simple idea of

807
00:30:07,599 --> 00:30:13,759
text in text out is just so flexible

808
00:30:12,000 --> 00:30:15,119
It's so versatile. It can handle all

809
00:30:13,759 --> 00:30:17,038
sorts of use cases. That's why there's

810
00:30:15,119 --> 00:30:19,759
so much excitement.

811
00:30:17,038 --> 00:30:21,839
Um, by the way, um, if you're really

812
00:30:19,759 --> 00:30:24,798
curious, I would actually recommend

813
00:30:21,839 --> 00:30:28,959
seeing this video where this this guy

814
00:30:24,798 --> 00:30:31,839
Andre Karpathi builds GPT from scratch.

815
00:30:28,960 --> 00:30:33,759
Okay, it's a fantastic video. If you if

816
00:30:31,839 --> 00:30:35,038
you have even like a little bit of

817
00:30:33,759 --> 00:30:36,240
curiosity about how these things are

818
00:30:35,038 --> 00:30:38,000
actually built, I would strongly

819
00:30:36,240 --> 00:30:39,440
recommend checking it out. Um and

820
00:30:38,000 --> 00:30:41,519
there's also a little blog post where

821
00:30:39,440 --> 00:30:43,519
this person you know basically if you

822
00:30:41,519 --> 00:30:46,079
know numpy you can actually create GPD3

823
00:30:43,519 --> 00:30:50,240
GPD using numpy without any using any

824
00:30:46,079 --> 00:30:52,319
frameworks and things like that. So um

825
00:30:50,240 --> 00:30:53,759
I I found it super interesting and

826
00:30:52,319 --> 00:30:55,439
helpful to understand what exactly is

827
00:30:53,759 --> 00:30:57,759
going on. So if you would like to do

828
00:30:55,440 --> 00:31:00,320
this. Okay. So now we're going to talk

829
00:30:57,759 --> 00:31:03,679
about um decoding sampling strategies

830
00:31:00,319 --> 00:31:05,278
which is I said that when we produce uh

831
00:31:03,679 --> 00:31:07,759
when when when we come up with the

832
00:31:05,278 --> 00:31:10,398
softmax for that last token right we

833
00:31:07,759 --> 00:31:13,278
have 50,000 choices. What do we pick

834
00:31:10,398 --> 00:31:15,759
right as it turns out to actually get

835
00:31:13,278 --> 00:31:17,839
really good performance out of uh genai

836
00:31:15,759 --> 00:31:19,919
systems like charge you need to be quite

837
00:31:17,839 --> 00:31:21,678
thoughtful about the how to decode right

838
00:31:19,919 --> 00:31:25,278
how to actually sample from that table.

839
00:31:21,679 --> 00:31:27,600
So we'll talk about that for a bit. So,

840
00:31:25,278 --> 00:31:29,119
so the first of all definition the

841
00:31:27,599 --> 00:31:30,639
process of choosing a token from the

842
00:31:29,119 --> 00:31:32,479
probability distribution from the coming

843
00:31:30,640 --> 00:31:34,399
out of the softmax right I'm sticking

844
00:31:32,480 --> 00:31:36,640
this table right here this is the

845
00:31:34,398 --> 00:31:38,798
softmax right this process of choosing

846
00:31:36,640 --> 00:31:40,720
it is called decoding that's a technical

847
00:31:38,798 --> 00:31:42,480
term for it right we have to we get this

848
00:31:40,720 --> 00:31:44,480
table we have to decode meaning we have

849
00:31:42,480 --> 00:31:48,079
to pick something from this table okay

850
00:31:44,480 --> 00:31:51,038
that's called decoding now

851
00:31:48,079 --> 00:31:53,359
there are two sort of extreme cases of

852
00:31:51,038 --> 00:31:55,038
very highly simple ways to do

853
00:31:53,359 --> 00:31:56,558
The first thing of course is just pick

854
00:31:55,038 --> 00:31:58,798
the one just pick the word with the

855
00:31:56,558 --> 00:32:02,240
highest probability.

856
00:31:58,798 --> 00:32:03,918
This is called greedy decoding.

857
00:32:02,240 --> 00:32:06,640
Okay.

858
00:32:03,919 --> 00:32:08,240
So in this case for example if stommy is

859
00:32:06,640 --> 00:32:10,880
6 the highest probability in this whole

860
00:32:08,240 --> 00:32:14,558
table we just pick stommy. Okay. So that

861
00:32:10,880 --> 00:32:15,760
is the obvious extreme simple case. The

862
00:32:14,558 --> 00:32:18,240
other thing we can do which is also

863
00:32:15,759 --> 00:32:20,480
super simple is that because we have a

864
00:32:18,240 --> 00:32:22,319
probability table here, we can just

865
00:32:20,480 --> 00:32:24,880
reach into the table and sample a word

866
00:32:22,319 --> 00:32:27,519
out of it, right? In proportion to its

867
00:32:24,880 --> 00:32:28,640
probability, which means that if you if

868
00:32:27,519 --> 00:32:30,960
if you have this table and you're

869
00:32:28,640 --> 00:32:33,519
sampling from it, if you sample from it

870
00:32:30,960 --> 00:32:36,480
100 times, 60 times you probably get

871
00:32:33,519 --> 00:32:38,079
Stormy because the probability is 6. But

872
00:32:36,480 --> 00:32:39,919
some small fraction of the time you may

873
00:32:38,079 --> 00:32:42,798
get strange things like oddwark and

874
00:32:39,919 --> 00:32:44,080
zebra and so on and so forth,

875
00:32:42,798 --> 00:32:46,558
right? you're just literally doing

876
00:32:44,079 --> 00:32:48,960
random sampling.

877
00:32:46,558 --> 00:32:50,558
That's a fine way to do it too, right?

878
00:32:48,960 --> 00:32:53,200
There's nothing wrong with that. So

879
00:32:50,558 --> 00:32:56,158
these these are both options. So the key

880
00:32:53,200 --> 00:32:58,080
thing you need to remember is that the

881
00:32:56,159 --> 00:32:59,600
which one you pick and there are some

882
00:32:58,079 --> 00:33:01,519
variations on it which we'll get to in a

883
00:32:59,599 --> 00:33:03,278
moment. What you pick, which way to

884
00:33:01,519 --> 00:33:05,519
decode you pick really depends on what

885
00:33:03,278 --> 00:33:08,558
your task is, what you're trying to use

886
00:33:05,519 --> 00:33:10,880
the the system for, right? The LLM for.

887
00:33:08,558 --> 00:33:13,839
So the the the broad thing to remember

888
00:33:10,880 --> 00:33:16,559
is that if you're working on questions

889
00:33:13,839 --> 00:33:19,678
for which the factual accuracy of the

890
00:33:16,558 --> 00:33:22,000
response is really important

891
00:33:19,679 --> 00:33:24,480
and or you want the output to be

892
00:33:22,000 --> 00:33:26,159
deterministic meaning every time you ask

893
00:33:24,480 --> 00:33:28,720
it a particular question you really want

894
00:33:26,159 --> 00:33:31,120
the same answer back right you can

895
00:33:28,720 --> 00:33:33,120
imagine a customer call support agent

896
00:33:31,119 --> 00:33:34,639
where there two different customers ask

897
00:33:33,119 --> 00:33:37,678
the same question and they get different

898
00:33:34,640 --> 00:33:40,000
answers right you don't want that so you

899
00:33:37,679 --> 00:33:41,679
want determinist IC outputs. So in those

900
00:33:40,000 --> 00:33:43,759
situations, you should use greedy

901
00:33:41,679 --> 00:33:45,519
decoding is a good starting point

902
00:33:43,759 --> 00:33:48,879
because you will get you know you won't

903
00:33:45,519 --> 00:33:51,679
get any random stuff because for any

904
00:33:48,880 --> 00:33:53,120
given input sentence the softmax that

905
00:33:51,679 --> 00:33:55,600
comes out of that table is not going to

906
00:33:53,119 --> 00:33:57,119
change. It's the same table and if

907
00:33:55,599 --> 00:33:58,398
you're always picking the highest number

908
00:33:57,119 --> 00:34:03,038
in the table that's not going to change

909
00:33:58,398 --> 00:34:05,199
either. So guaranteed determinism

910
00:34:03,038 --> 00:34:07,359
and I found that for reasoning questions

911
00:34:05,200 --> 00:34:08,960
and things where you know you're asking

912
00:34:07,359 --> 00:34:10,878
questions, math questions, reasoning

913
00:34:08,960 --> 00:34:12,878
questions, logic questions, you should

914
00:34:10,878 --> 00:34:15,598
really sort of keep it as sort of greedy

915
00:34:12,878 --> 00:34:18,319
as possible in my experience. Okay. Now

916
00:34:15,599 --> 00:34:20,879
there are other situations where random

917
00:34:18,320 --> 00:34:22,639
sampling is actually a better option. If

918
00:34:20,878 --> 00:34:24,159
you're doing creative things, right?

919
00:34:22,639 --> 00:34:26,000
write a poem, write a highQ, write a

920
00:34:24,159 --> 00:34:27,760
screenplay, things like that. You do

921
00:34:26,000 --> 00:34:30,320
want a lot of creativity in which case

922
00:34:27,760 --> 00:34:31,440
you actually randomness is your friend,

923
00:34:30,320 --> 00:34:32,960
right? You get a lot of different

924
00:34:31,440 --> 00:34:35,119
varieties of responses, diversity of

925
00:34:32,960 --> 00:34:36,878
responses, all that is really good. The

926
00:34:35,119 --> 00:34:39,119
price you pay for it is that you lose

927
00:34:36,878 --> 00:34:40,239
determinist determinism. The outputs are

928
00:34:39,119 --> 00:34:41,440
going to be stoastic. They're going to

929
00:34:40,239 --> 00:34:42,638
be random. They're going to vary from

930
00:34:41,440 --> 00:34:44,559
the same question. The answer is going

931
00:34:42,639 --> 00:34:47,119
to vary again and again. But in many

932
00:34:44,559 --> 00:34:49,599
cases, maybe it's okay. You don't care.

933
00:34:47,119 --> 00:34:50,960
Okay, so that's sort of how roughly how

934
00:34:49,599 --> 00:34:53,200
you think about. Other one I want to say

935
00:34:50,960 --> 00:34:55,039
is that the diversity of response also

936
00:34:53,199 --> 00:34:58,239
important because you if you imagine a

937
00:34:55,039 --> 00:35:00,239
chatbot um if you ask questions if the

938
00:34:58,239 --> 00:35:03,118
chatbot always responds in the same

939
00:35:00,239 --> 00:35:05,199
stilted robotic fashion right it kind it

940
00:35:03,119 --> 00:35:07,519
starts to get annoying you want some

941
00:35:05,199 --> 00:35:08,879
variation in the output right because a

942
00:35:07,519 --> 00:35:11,358
human will never give you the same thing

943
00:35:08,880 --> 00:35:13,119
back though I must say that when I

944
00:35:11,358 --> 00:35:14,480
interact with call center agents I think

945
00:35:13,119 --> 00:35:16,320
they're just cutting and pasting from a

946
00:35:14,480 --> 00:35:18,639
text library so it does look kind of

947
00:35:16,320 --> 00:35:20,079
robotic u so maybe we are already kind

948
00:35:18,639 --> 00:35:21,199
of used to this but anyway Okay, so

949
00:35:20,079 --> 00:35:24,400
those are some of the things to keep in

950
00:35:21,199 --> 00:35:26,480
mind. Yeah,

951
00:35:24,400 --> 00:35:28,480
>> if you're using random sampling, do you

952
00:35:26,480 --> 00:35:33,079
end up with a better estimation of the

953
00:35:28,480 --> 00:35:33,079
uncertainty and probability are more

954
00:35:33,119 --> 00:35:36,960
calibrated in the sense that the table

955
00:35:35,199 --> 00:35:39,759
that you end up at the end is the real

956
00:35:36,960 --> 00:35:42,000
probability that you observe from the

957
00:35:39,760 --> 00:35:43,760
words in your corpus.

958
00:35:42,000 --> 00:35:45,440
>> The table doesn't change regardless of

959
00:35:43,760 --> 00:35:47,599
how you sample it. The table is a

960
00:35:45,440 --> 00:35:50,480
starting point for sampling.

961
00:35:47,599 --> 00:35:51,680
The all of all decoding is about what

962
00:35:50,480 --> 00:35:53,039
token from the table you're going to

963
00:35:51,679 --> 00:35:54,799
pull out.

964
00:35:53,039 --> 00:35:55,599
>> Oh, so it doesn't impact the loss

965
00:35:54,800 --> 00:35:56,720
function.

966
00:35:55,599 --> 00:35:58,880
>> No.

967
00:35:56,719 --> 00:36:00,559
>> Yeah. It's all those things are fixed.

968
00:35:58,880 --> 00:36:02,000
You literally get the table and then you

969
00:36:00,559 --> 00:36:06,000
literally can forget how you got the

970
00:36:02,000 --> 00:36:09,199
table and now decoding starts.

971
00:36:06,000 --> 00:36:11,119
>> Is there a reason why would generate a

972
00:36:09,199 --> 00:36:12,559
different answer given the same prompt

973
00:36:11,119 --> 00:36:14,320
if we run it again and again? Because

974
00:36:12,559 --> 00:36:16,559
they are using random sampling.

975
00:36:14,320 --> 00:36:19,039
>> Correct. That's exactly why. And we'll

976
00:36:16,559 --> 00:36:20,159
see I'll see do a demo of it very very

977
00:36:19,039 --> 00:36:22,800
shortly because you can actually

978
00:36:20,159 --> 00:36:25,039
manipulate it. Uh

979
00:36:22,800 --> 00:36:27,680
>> if you do the prediction word by word,

980
00:36:25,039 --> 00:36:29,838
is there a way to make it resilient to

981
00:36:27,679 --> 00:36:32,319
mistakes? Like if you say the night was

982
00:36:29,838 --> 00:36:33,199
dark and hard work, that can mess up the

983
00:36:32,320 --> 00:36:34,559
next word, right?

984
00:36:33,199 --> 00:36:37,439
>> It can totally mess it up.

985
00:36:34,559 --> 00:36:37,838
>> So how does it can it get itself back on

986
00:36:37,440 --> 00:36:40,240
track?

987
00:36:37,838 --> 00:36:42,000
>> It cannot. And so great question. And

988
00:36:40,239 --> 00:36:46,078
we'll look at an example of things going

989
00:36:42,000 --> 00:36:48,400
off the rails in just a second. Yep.

990
00:36:46,079 --> 00:36:51,359
Is this how Bing works where you can

991
00:36:48,400 --> 00:36:52,000
slide between being more creative, more

992
00:36:51,358 --> 00:36:53,920
accurate?

993
00:36:52,000 --> 00:36:56,000
>> Yeah, exactly. So, Bing has creative,

994
00:36:53,920 --> 00:36:57,680
balanced, precise something, right? Uh

995
00:36:56,000 --> 00:36:59,440
they're basically under the hood,

996
00:36:57,679 --> 00:37:00,399
they're manipulating some of the par

997
00:36:59,440 --> 00:37:01,760
we're going to look at some of those

998
00:37:00,400 --> 00:37:03,838
parameters in just a moment. They're

999
00:37:01,760 --> 00:37:05,920
just manipulating it for you. But if you

1000
00:37:03,838 --> 00:37:08,920
use the API, you can manipulate it

1001
00:37:05,920 --> 00:37:08,920
directly.

1002
00:37:09,760 --> 00:37:15,760
Okay. Um All right. So, so here's sort

1003
00:37:14,559 --> 00:37:17,599
of the basic thing to remember about

1004
00:37:15,760 --> 00:37:19,839
random sampling.

1005
00:37:17,599 --> 00:37:22,000
So, our hope is that the, you know, for

1006
00:37:19,838 --> 00:37:24,400
any given sentence, we think that there

1007
00:37:22,000 --> 00:37:26,880
is probably some set of good answers for

1008
00:37:24,400 --> 00:37:30,720
the next word and a whole bunch of bad

1009
00:37:26,880 --> 00:37:33,358
answers, right? Intuitively. So, we want

1010
00:37:30,719 --> 00:37:36,078
the probability of the good stuff,

1011
00:37:33,358 --> 00:37:38,078
right? We we want like a you can imagine

1012
00:37:36,079 --> 00:37:39,440
a distribution is going like that. There

1013
00:37:38,079 --> 00:37:41,119
is the head of the distribution, the

1014
00:37:39,440 --> 00:37:42,320
first few words in the distribution. if

1015
00:37:41,119 --> 00:37:44,160
you sort them from high to low

1016
00:37:42,320 --> 00:37:46,400
probability and then there's all the

1017
00:37:44,159 --> 00:37:48,078
long tale of you know kind of you know

1018
00:37:46,400 --> 00:37:51,119
inappropriate not inappropriate

1019
00:37:48,079 --> 00:37:53,440
irrelevant words right so our hope is

1020
00:37:51,119 --> 00:37:55,838
that the model is so good that for any

1021
00:37:53,440 --> 00:37:57,440
given input phrase it it basically

1022
00:37:55,838 --> 00:37:59,679
concentrates the output probability in

1023
00:37:57,440 --> 00:38:01,358
the softmax to just a few good words and

1024
00:37:59,679 --> 00:38:04,639
sort of kind of zeros out everything

1025
00:38:01,358 --> 00:38:06,559
else that is the ideal scenario because

1026
00:38:04,639 --> 00:38:08,319
in that scenario if you do random

1027
00:38:06,559 --> 00:38:10,000
sampling you by definition you'll pick

1028
00:38:08,320 --> 00:38:13,119
something from the high quality head of

1029
00:38:10,000 --> 00:38:16,159
the distribution and life is good. Okay.

1030
00:38:13,119 --> 00:38:18,079
Now, we want random sampling to sample

1031
00:38:16,159 --> 00:38:19,440
from the head and not from the tail,

1032
00:38:18,079 --> 00:38:21,119
right? That's the key point. And what do

1033
00:38:19,440 --> 00:38:24,119
I mean by head and tail? Let's be very

1034
00:38:21,119 --> 00:38:24,119
clear.

1035
00:38:26,320 --> 00:38:31,760
So, um imagine you have

1036
00:38:30,559 --> 00:38:33,440
take the table that we looked at the

1037
00:38:31,760 --> 00:38:35,680
softax table which went from whatever

1038
00:38:33,440 --> 00:38:37,280
oddwalk to zebra right and let's say we

1039
00:38:35,679 --> 00:38:39,199
sort the table based on high to low

1040
00:38:37,280 --> 00:38:42,240
probabilities. So maybe what's going to

1041
00:38:39,199 --> 00:38:43,838
happen is that star me

1042
00:38:42,239 --> 00:38:46,879
is going to have a probability of I

1043
00:38:43,838 --> 00:38:51,920
don't know 6 and I think if I remember

1044
00:38:46,880 --> 00:38:53,440
right a knight had a probability of.3

1045
00:38:51,920 --> 00:38:56,639
and then a there was a whole bunch of

1046
00:38:53,440 --> 00:39:00,320
other words

1047
00:38:56,639 --> 00:39:02,480
all the way to the 50,000th word right

1048
00:39:00,320 --> 00:39:04,160
from highest low probability so this is

1049
00:39:02,480 --> 00:39:06,880
what I so this is you can think of this

1050
00:39:04,159 --> 00:39:09,598
as like a probability distribution

1051
00:39:06,880 --> 00:39:12,240
okay and So basically what we are saying

1052
00:39:09,599 --> 00:39:13,920
here is that these this is the head of

1053
00:39:12,239 --> 00:39:16,078
the distribution

1054
00:39:13,920 --> 00:39:18,960
while this long tail is the tail of the

1055
00:39:16,079 --> 00:39:21,200
distribution and we want our system to

1056
00:39:18,960 --> 00:39:23,119
grab something from the head and not

1057
00:39:21,199 --> 00:39:24,480
from the tail because the head is the

1058
00:39:23,119 --> 00:39:26,960
stuff that's actually the relevant

1059
00:39:24,480 --> 00:39:28,719
useful good stuff. Okay, that's really

1060
00:39:26,960 --> 00:39:32,639
what we're trying to do here. Does it

1061
00:39:28,719 --> 00:39:37,279
make sense? Okay. So,

1062
00:39:32,639 --> 00:39:39,039
so to come back to this um

1063
00:39:37,280 --> 00:39:41,440
and here is like the most important

1064
00:39:39,039 --> 00:39:43,679
point to remember about this slide.

1065
00:39:41,440 --> 00:39:46,000
While the probability of choosing any

1066
00:39:43,679 --> 00:39:47,440
individual word in this long tail is

1067
00:39:46,000 --> 00:39:49,199
pretty small. For any one word, it's

1068
00:39:47,440 --> 00:39:51,200
pretty small. The probability of

1069
00:39:49,199 --> 00:39:54,159
choosing some word from the tail is

1070
00:39:51,199 --> 00:39:56,239
high.

1071
00:39:54,159 --> 00:39:58,559
Some word from the tail is high. So to

1072
00:39:56,239 --> 00:40:00,719
go back to this thing here. Yeah. Uh so

1073
00:39:58,559 --> 00:40:03,519
in this particular example

1074
00:40:00,719 --> 00:40:05,279
6 +.3 there is a 0.9 probability it's

1075
00:40:03,519 --> 00:40:06,880
going to be either stormy or night but

1076
00:40:05,280 --> 00:40:09,519
there is a 10% probability it's going to

1077
00:40:06,880 --> 00:40:11,119
be one of these words

1078
00:40:09,519 --> 00:40:12,639
and who knows what that word might it's

1079
00:40:11,119 --> 00:40:15,358
going to be it might be some random

1080
00:40:12,639 --> 00:40:18,879
nonsense word right so what that means

1081
00:40:15,358 --> 00:40:21,838
is and this goes to um

1082
00:40:18,880 --> 00:40:24,160
this goes to point from before if the

1083
00:40:21,838 --> 00:40:25,920
LLM happens to sample a token from the

1084
00:40:24,159 --> 00:40:27,440
tail which is not good it won't be able

1085
00:40:25,920 --> 00:40:29,920
to recover from its mistake it'll just

1086
00:40:27,440 --> 00:40:31,679
go off the rails

1087
00:40:29,920 --> 00:40:33,358
Which is why every word that gets

1088
00:40:31,679 --> 00:40:35,919
generated is really important to get it

1089
00:40:33,358 --> 00:40:37,759
right because book it can't recover very

1090
00:40:35,920 --> 00:40:40,079
often.

1091
00:40:37,760 --> 00:40:41,359
>> Is there a technical way to define the

1092
00:40:40,079 --> 00:40:44,000
difference between the head and the

1093
00:40:41,358 --> 00:40:45,358
tail? No,

1094
00:40:44,000 --> 00:40:47,440
it's sort of like this common thing

1095
00:40:45,358 --> 00:40:50,239
people use and the reason why it's not

1096
00:40:47,440 --> 00:40:52,800
is because uh it's so problem dependent

1097
00:40:50,239 --> 00:40:54,078
as to what like the you know like

1098
00:40:52,800 --> 00:40:55,680
basically you're saying that for any

1099
00:40:54,079 --> 00:40:58,000
particular problem I think depending on

1100
00:40:55,679 --> 00:41:00,239
the question the right number of words

1101
00:40:58,000 --> 00:41:02,719
is probably 20 for the same for a

1102
00:41:00,239 --> 00:41:04,078
different question maybe it's 40 for a

1103
00:41:02,719 --> 00:41:05,919
totally different model for the same

1104
00:41:04,079 --> 00:41:09,200
question maybe 10 so because of that

1105
00:41:05,920 --> 00:41:12,960
variability we just can't figure it out

1106
00:41:09,199 --> 00:41:14,318
okay so um all All right. So, and I'll

1107
00:41:12,960 --> 00:41:18,400
show you this how to do this in just a

1108
00:41:14,318 --> 00:41:22,719
moment. So, just for kicks, um I went in

1109
00:41:18,400 --> 00:41:25,920
to GPD 3.5 U and then I said students at

1110
00:41:22,719 --> 00:41:29,118
the MIT Sloan School of Management are

1111
00:41:25,920 --> 00:41:31,519
and I said predict the next word. Okay,

1112
00:41:29,119 --> 00:41:33,599
so it turns out invited is the most

1113
00:41:31,519 --> 00:41:35,838
likely next word followed by given,

1114
00:41:33,599 --> 00:41:38,079
expected, required and able. These are

1115
00:41:35,838 --> 00:41:40,960
the top five words.

1116
00:41:38,079 --> 00:41:42,000
Okay. And the probability is 3% 2% you

1117
00:41:40,960 --> 00:41:43,760
see the you know pretty small

1118
00:41:42,000 --> 00:41:45,838
probabilities but then the words that

1119
00:41:43,760 --> 00:41:47,440
are below it right the remaining

1120
00:41:45,838 --> 00:41:50,078
whatever 50,000 odd words are even

1121
00:41:47,440 --> 00:41:52,800
lower. Okay. So here the most likely

1122
00:41:50,079 --> 00:41:54,800
word is invited. So what I did is I went

1123
00:41:52,800 --> 00:41:56,720
in there and said okay let me try again

1124
00:41:54,800 --> 00:41:59,119
now with students of this loan school of

1125
00:41:56,719 --> 00:42:00,639
management or invited. And now

1126
00:41:59,119 --> 00:42:03,519
autocomplete that find me the next

1127
00:42:00,639 --> 00:42:04,799
thing. So it comes back with see now

1128
00:42:03,519 --> 00:42:07,119
this is my new prompt. student the M

1129
00:42:04,800 --> 00:42:08,640
school invited to submit their original

1130
00:42:07,119 --> 00:42:11,119
white papers to the annual MIT

1131
00:42:08,639 --> 00:42:13,838
something. It seems reasonable. Doesn't

1132
00:42:11,119 --> 00:42:16,640
seem bad, right? It seems reasonable.

1133
00:42:13,838 --> 00:42:19,279
Okay. Now, let's mess it up a bit. So

1134
00:42:16,639 --> 00:42:22,559
now I go in there and I noticed that the

1135
00:42:19,280 --> 00:42:24,480
word masters and the word spending were

1136
00:42:22,559 --> 00:42:26,480
much lower probability than these top

1137
00:42:24,480 --> 00:42:28,400
five words. Right? I just mucked around

1138
00:42:26,480 --> 00:42:31,599
till I found these things. So this is

1139
00:42:28,400 --> 00:42:34,639
only 0.05%. This is.1%.

1140
00:42:31,599 --> 00:42:36,640
So these are clearly in the tail, right?

1141
00:42:34,639 --> 00:42:37,920
They're not the most likely. So I said,

1142
00:42:36,639 --> 00:42:41,039
what's going to happen if I actually

1143
00:42:37,920 --> 00:42:43,838
force it to use masters and then I force

1144
00:42:41,039 --> 00:42:46,239
it to use spending? Okay, this is what I

1145
00:42:43,838 --> 00:42:49,358
what you get. Students MID school of

1146
00:42:46,239 --> 00:42:52,639
management are masters of chaos.

1147
00:42:49,358 --> 00:42:53,920
They routinely blow past deadlines

1148
00:42:52,639 --> 00:42:57,559
fracture and then I couldn't take it

1149
00:42:53,920 --> 00:42:57,559
anymore. I stopped it.

1150
00:42:58,000 --> 00:43:02,318
a single word

1151
00:43:00,800 --> 00:43:03,760
and then I said students school of

1152
00:43:02,318 --> 00:43:05,838
management or spending which is the

1153
00:43:03,760 --> 00:43:07,440
other unlikely word the semester

1154
00:43:05,838 --> 00:43:11,960
learning life skills so far it looks

1155
00:43:07,440 --> 00:43:11,960
promising through knitting socks

1156
00:43:13,358 --> 00:43:17,519
I'm not making this stuff up but this is

1157
00:43:14,639 --> 00:43:19,199
GP3.5

1158
00:43:17,519 --> 00:43:22,880
so yes it will go off the rails you have

1159
00:43:19,199 --> 00:43:25,118
to be super careful um and so

1160
00:43:22,880 --> 00:43:29,280
so the way we sort of tame random

1161
00:43:25,119 --> 00:43:32,640
sampling to make it work for us uh

1162
00:43:29,280 --> 00:43:35,920
Do you think that these sentences refers

1163
00:43:32,639 --> 00:43:38,078
like the past like the master of chaos

1164
00:43:35,920 --> 00:43:40,800
blow past deadline like is something

1165
00:43:38,079 --> 00:43:42,720
that it was in the training sense?

1166
00:43:40,800 --> 00:43:45,200
>> Yeah. I mean that is the thing is it's

1167
00:43:42,719 --> 00:43:47,039
basically doing rough it's doing some

1168
00:43:45,199 --> 00:43:48,879
very rough and approximate pattern

1169
00:43:47,039 --> 00:43:51,119
matching from all the training data it

1170
00:43:48,880 --> 00:43:53,838
was trained on. So it doesn't mean for

1171
00:43:51,119 --> 00:43:56,800
example that on on the mit.edu edu

1172
00:43:53,838 --> 00:43:59,039
website right on the collection of sites

1173
00:43:56,800 --> 00:44:00,800
that actually there were text saying

1174
00:43:59,039 --> 00:44:02,960
that yeah MIT Sloan students were doing

1175
00:44:00,800 --> 00:44:06,400
all this crazy stuff it's probably more

1176
00:44:02,960 --> 00:44:08,000
like a whole bunch of you know u college

1177
00:44:06,400 --> 00:44:09,519
university websites probably had some

1178
00:44:08,000 --> 00:44:10,960
content like that maybe there was a

1179
00:44:09,519 --> 00:44:12,559
bunch of Reddit people posting stuff

1180
00:44:10,960 --> 00:44:14,400
like that so you're just doing some

1181
00:44:12,559 --> 00:44:15,599
rough pattern matching it's basically

1182
00:44:14,400 --> 00:44:16,960
looking the thing is you have to

1183
00:44:15,599 --> 00:44:19,599
remember always with large language

1184
00:44:16,960 --> 00:44:22,000
models what it's trying to give you it's

1185
00:44:19,599 --> 00:44:23,680
giving you a response that is not

1186
00:44:22,000 --> 00:44:25,358
implausible

1187
00:44:23,679 --> 00:44:27,279
There is no guarantee of correctness.

1188
00:44:25,358 --> 00:44:29,519
There's no accuracy. Nothing like that.

1189
00:44:27,280 --> 00:44:32,000
It's giving you a probabilistically

1190
00:44:29,519 --> 00:44:35,119
plausible response. That's it. Okay.

1191
00:44:32,000 --> 00:44:36,880
Now, usies being Sloan, uh we look at

1192
00:44:35,119 --> 00:44:39,200
stuff like this and we get offended. So,

1193
00:44:36,880 --> 00:44:40,880
we are we are imputing our values onto

1194
00:44:39,199 --> 00:44:43,919
its generation, but it doesn't know and

1195
00:44:40,880 --> 00:44:46,079
it doesn't care.

1196
00:44:43,920 --> 00:44:48,079
So in fact if I when I typed in

1197
00:44:46,079 --> 00:44:50,800
something like list all the awards that

1198
00:44:48,079 --> 00:44:52,960
professor Ramak Krishna has won it gave

1199
00:44:50,800 --> 00:44:55,440
me an amazing list of awards apparently

1200
00:44:52,960 --> 00:44:58,639
I won this and I won that I won none of

1201
00:44:55,440 --> 00:45:00,240
it is true to which a student said not

1202
00:44:58,639 --> 00:45:01,838
yet.

1203
00:45:00,239 --> 00:45:05,039
So I had the tea I made a note of that

1204
00:45:01,838 --> 00:45:09,039
fine person's name. So [laughter]

1205
00:45:05,039 --> 00:45:11,119
>> so yeah so that's what's going on.

1206
00:45:09,039 --> 00:45:12,800
Yeah

1207
00:45:11,119 --> 00:45:15,838
>> I get the sense like Maybe there's

1208
00:45:12,800 --> 00:45:17,599
>> Could you use the microphone, please?

1209
00:45:15,838 --> 00:45:20,480
>> I get the sense that maybe there's some

1210
00:45:17,599 --> 00:45:23,519
sort of sliding window that's somehow

1211
00:45:20,480 --> 00:45:26,480
waning later words more strongly than

1212
00:45:23,519 --> 00:45:28,079
earlier words given how far out because

1213
00:45:26,480 --> 00:45:30,318
I feel like the context of students at

1214
00:45:28,079 --> 00:45:32,000
MIT, right, should have steered in a

1215
00:45:30,318 --> 00:45:34,318
certain direction even with the presence

1216
00:45:32,000 --> 00:45:35,599
of the word masters. So, is there

1217
00:45:34,318 --> 00:45:37,519
something like that happening?

1218
00:45:35,599 --> 00:45:38,800
>> No, it is just the thing is think about

1219
00:45:37,519 --> 00:45:41,199
the training process, right? In the

1220
00:45:38,800 --> 00:45:42,800
training process, uh, we gave it

1221
00:45:41,199 --> 00:45:45,519
sentence fragments and we asked it to

1222
00:45:42,800 --> 00:45:48,240
predict the next word. Now, clearly the

1223
00:45:45,519 --> 00:45:49,759
more you know about the input that's

1224
00:45:48,239 --> 00:45:51,919
coming and the longer the input, the

1225
00:45:49,760 --> 00:45:53,200
more clues you have to figure out what

1226
00:45:51,920 --> 00:45:56,240
the right next prediction is going to

1227
00:45:53,199 --> 00:45:58,480
be. Right? If I say the capital uh the

1228
00:45:56,239 --> 00:46:00,239
capital of you'll be like, I don't know,

1229
00:45:58,480 --> 00:46:01,440
it's got to be a country, I guess, or a

1230
00:46:00,239 --> 00:46:03,039
state, but I don't know anything more

1231
00:46:01,440 --> 00:46:06,318
than that. But if you if I say the

1232
00:46:03,039 --> 00:46:08,719
capital of France is dramatic narrowing

1233
00:46:06,318 --> 00:46:11,039
of the cone of uncertainty. So that's

1234
00:46:08,719 --> 00:46:12,480
basically what's going on. And in fact

1235
00:46:11,039 --> 00:46:14,480
some there's a very beautiful expression

1236
00:46:12,480 --> 00:46:17,679
I've heard which is that what what the

1237
00:46:14,480 --> 00:46:20,159
LMS do they call it subtractive

1238
00:46:17,679 --> 00:46:22,000
sculpting. So what I mean by that is

1239
00:46:20,159 --> 00:46:24,559
it's sort of like when you start it's

1240
00:46:22,000 --> 00:46:26,480
like this big block of marble and then

1241
00:46:24,559 --> 00:46:27,838
every word chips away at the marble and

1242
00:46:26,480 --> 00:46:29,599
then when you're done it's kind of

1243
00:46:27,838 --> 00:46:31,358
pretty clear it's David inside the

1244
00:46:29,599 --> 00:46:34,240
marble. Right? That's sort of what's

1245
00:46:31,358 --> 00:46:36,559
going on.

1246
00:46:34,239 --> 00:46:38,559
All right. So to come back to this, uh

1247
00:46:36,559 --> 00:46:40,000
what can we do? We can there are three

1248
00:46:38,559 --> 00:46:42,078
ways in which you can tune random

1249
00:46:40,000 --> 00:46:44,400
sampling to make it work for you. The

1250
00:46:42,079 --> 00:46:46,160
first way and and the the idea of all

1251
00:46:44,400 --> 00:46:48,800
these things is that you have some

1252
00:46:46,159 --> 00:46:51,199
probability distribution. We are now

1253
00:46:48,800 --> 00:46:53,680
going to sort of manually

1254
00:46:51,199 --> 00:46:55,279
focus on the head and then we're going

1255
00:46:53,679 --> 00:46:56,879
to kill everything else and only focus

1256
00:46:55,280 --> 00:46:58,400
on the head and sample from that head.

1257
00:46:56,880 --> 00:46:59,920
Okay, which immediately begs the

1258
00:46:58,400 --> 00:47:01,280
question, how will you decide what the

1259
00:46:59,920 --> 00:47:02,880
head is? Right? And that was sort of

1260
00:47:01,280 --> 00:47:04,640
Alina's question from before. How will

1261
00:47:02,880 --> 00:47:07,440
you decide what the head is? So, one way

1262
00:47:04,639 --> 00:47:08,559
we do that is to say, you know what, I

1263
00:47:07,440 --> 00:47:11,280
know we have 50,000 words in the

1264
00:47:08,559 --> 00:47:13,199
vocabulary. I don't care. Each time, I'm

1265
00:47:11,280 --> 00:47:15,599
only going to pick the top K words,

1266
00:47:13,199 --> 00:47:17,039
right? K could be 10, 20, 30, 40, 50.

1267
00:47:15,599 --> 00:47:18,880
This very problem dependent. I'm going

1268
00:47:17,039 --> 00:47:20,800
to pick the top 20 words and I'm going

1269
00:47:18,880 --> 00:47:22,800
to ignore everything else and only

1270
00:47:20,800 --> 00:47:24,800
sample from the top 10 or the top 20.

1271
00:47:22,800 --> 00:47:25,920
That's called top K sampling. And so the

1272
00:47:24,800 --> 00:47:27,440
way it works is that let's say this is

1273
00:47:25,920 --> 00:47:28,720
your whole distribution and I just

1274
00:47:27,440 --> 00:47:30,960
stopped at wet instead of going all the

1275
00:47:28,719 --> 00:47:33,118
way to 50,000, right? And then you see

1276
00:47:30,960 --> 00:47:36,240
and you decide let's say that you want k

1277
00:47:33,119 --> 00:47:39,519
to be two. So you just grab the top two

1278
00:47:36,239 --> 00:47:41,679
words k equals 2 and then you reormalize

1279
00:47:39,519 --> 00:47:45,119
the probability so they add up to one.

1280
00:47:41,679 --> 00:47:46,799
So 6 and2 reormalize it becomes 75 and

1281
00:47:45,119 --> 00:47:48,480
0.25.

1282
00:47:46,800 --> 00:47:50,160
And now just imagine that this is the

1283
00:47:48,480 --> 00:47:52,240
new softmax table that you're sampling

1284
00:47:50,159 --> 00:47:55,039
from and you grab a number from I'm

1285
00:47:52,239 --> 00:47:58,159
sorry a word from here and you're done.

1286
00:47:55,039 --> 00:48:00,639
Okay, that's this called top K sampling

1287
00:47:58,159 --> 00:48:03,279
very commonly used

1288
00:48:00,639 --> 00:48:06,078
but there's it has a small shortcoming

1289
00:48:03,280 --> 00:48:07,680
which is that it basically assumes that

1290
00:48:06,079 --> 00:48:11,119
this K that you have come up with let's

1291
00:48:07,679 --> 00:48:13,118
say 20 every input sentence the right

1292
00:48:11,119 --> 00:48:15,519
number of words in the head is 20 which

1293
00:48:13,119 --> 00:48:16,640
seems obviously it's not a you know well

1294
00:48:15,519 --> 00:48:18,639
supported assumption it's just an

1295
00:48:16,639 --> 00:48:21,440
assumption so then the question becomes

1296
00:48:18,639 --> 00:48:24,078
can we do better right because what you

1297
00:48:21,440 --> 00:48:25,599
really want is you want the words that

1298
00:48:24,079 --> 00:48:27,280
you pick to have the bulk of the

1299
00:48:25,599 --> 00:48:29,440
probabilities,

1300
00:48:27,280 --> 00:48:30,800
right? As much probability as possible.

1301
00:48:29,440 --> 00:48:32,240
You don't really care how many words are

1302
00:48:30,800 --> 00:48:34,800
inside it as long as together they have

1303
00:48:32,239 --> 00:48:37,199
a lot of probability. Which brings us to

1304
00:48:34,800 --> 00:48:39,359
something called top p sampling also

1305
00:48:37,199 --> 00:48:40,639
called nucleus sampling where instead of

1306
00:48:39,358 --> 00:48:42,719
deciding on the number of words we're

1307
00:48:40,639 --> 00:48:45,118
going to pick every time, we decide you

1308
00:48:42,719 --> 00:48:47,358
know what we're just going to

1309
00:48:45,119 --> 00:48:49,119
choose all the words such that the

1310
00:48:47,358 --> 00:48:51,679
probability of such words that we have

1311
00:48:49,119 --> 00:48:53,039
chosen is at least P.

1312
00:48:51,679 --> 00:48:54,639
Sometimes it may be just two words.

1313
00:48:53,039 --> 00:48:58,880
Sometimes it may be 20 words. We don't

1314
00:48:54,639 --> 00:49:02,000
care. And then we sample from it.

1315
00:48:58,880 --> 00:49:05,280
Okay. So here, same thing here. Let's

1316
00:49:02,000 --> 00:49:09,039
say you go with P equ= 0.9. So you 6

1317
00:49:05,280 --> 00:49:11,359
+2.8 plus.1.9. Boom. We have hit 0.9. We

1318
00:49:09,039 --> 00:49:14,400
stop and then we grab these three words

1319
00:49:11,358 --> 00:49:16,799
and then we renormalize them to get this

1320
00:49:14,400 --> 00:49:18,079
thing and then boom, we sample from it.

1321
00:49:16,800 --> 00:49:19,839
So this actually is even more effective

1322
00:49:18,079 --> 00:49:21,599
in my opinion because it sort of it

1323
00:49:19,838 --> 00:49:23,440
fluctuates. It doesn't hardcode the

1324
00:49:21,599 --> 00:49:25,920
number of words you think is important.

1325
00:49:23,440 --> 00:49:29,440
Uh was there a question? Yeah.

1326
00:49:25,920 --> 00:49:32,720
>> What if like let's say 0.9 ended up like

1327
00:49:29,440 --> 00:49:33,838
if foggy was 0.12 will it only take 0.1

1328
00:49:32,719 --> 00:49:35,519
from foggy?

1329
00:49:33,838 --> 00:49:37,199
>> Yeah. What it does is it so you give it

1330
00:49:35,519 --> 00:49:39,599
a give it a 0.9. What it's going to do

1331
00:49:37,199 --> 00:49:43,598
is it's going to keep adding words till

1332
00:49:39,599 --> 00:49:46,640
it just crosses that number.

1333
00:49:43,599 --> 00:49:50,240
>> Yeah. I was thinking, can't you just set

1334
00:49:46,639 --> 00:49:53,598
a threshold for the word slap? Don't

1335
00:49:50,239 --> 00:49:57,118
pick a word below probability. This top

1336
00:49:53,599 --> 00:49:59,680
B, what if was like 0.89

1337
00:49:57,119 --> 00:50:00,800
and then the other one is just 0.1. So

1338
00:49:59,679 --> 00:50:03,440
you pick two words.

1339
00:50:00,800 --> 00:50:04,960
>> Yeah, you can do that. Um and in fact in

1340
00:50:03,440 --> 00:50:06,240
what you can do is you can always say I

1341
00:50:04,960 --> 00:50:08,480
want to pick a word which is the most

1342
00:50:06,239 --> 00:50:12,078
likely word, right? You can do that. But

1343
00:50:08,480 --> 00:50:13,760
if you say I want a word um I want only

1344
00:50:12,079 --> 00:50:15,760
consider words whose probabilities are

1345
00:50:13,760 --> 00:50:16,640
at least something then basically what

1346
00:50:15,760 --> 00:50:18,559
you're saying is that I'm just going to

1347
00:50:16,639 --> 00:50:21,039
keep on doing and then we draw a line

1348
00:50:18,559 --> 00:50:23,839
here right but the problem is you don't

1349
00:50:21,039 --> 00:50:25,519
know how many words have crept over your

1350
00:50:23,838 --> 00:50:27,679
threshold

1351
00:50:25,519 --> 00:50:29,759
right you might for example find that to

1352
00:50:27,679 --> 00:50:31,598
to go to your example maybe you said 0.9

1353
00:50:29,760 --> 00:50:33,520
as a threshold may maybe there are a

1354
00:50:31,599 --> 00:50:34,559
whole bunch of there was a word at 089

1355
00:50:33,519 --> 00:50:36,079
that you just missed because you didn't

1356
00:50:34,559 --> 00:50:38,000
make the threshold you'll be like oh no

1357
00:50:36,079 --> 00:50:40,079
I should have made it 089 so there's No

1358
00:50:38,000 --> 00:50:41,838
right answer unfortunately. But these

1359
00:50:40,079 --> 00:50:43,680
are exactly the this is exactly the kind

1360
00:50:41,838 --> 00:50:46,239
of thinking that brought us these kinds

1361
00:50:43,679 --> 00:50:48,639
of ways to tune these things

1362
00:50:46,239 --> 00:50:51,118
all sort of you know the foundation here

1363
00:50:48,639 --> 00:50:53,279
is that the realization that we cannot

1364
00:50:51,119 --> 00:50:54,800
pro sort of a priority decide what the

1365
00:50:53,280 --> 00:50:56,720
right number of words is. So we have to

1366
00:50:54,800 --> 00:50:58,318
find huristics to try to do do these

1367
00:50:56,719 --> 00:51:00,318
things. So in practice people try all

1368
00:50:58,318 --> 00:51:02,000
these methods. In fact you can do both.

1369
00:51:00,318 --> 00:51:04,558
You can do you can set up so that you

1370
00:51:02,000 --> 00:51:07,358
can do top p and top k at the same time.

1371
00:51:04,559 --> 00:51:10,880
Basically you're saying grab words uh

1372
00:51:07,358 --> 00:51:14,920
till you cross the probability uh or you

1373
00:51:10,880 --> 00:51:14,920
cross k whichever is earlier.

1374
00:51:15,199 --> 00:51:19,118
Okay. So those are two methods people

1375
00:51:17,358 --> 00:51:21,598
use heavily.

1376
00:51:19,119 --> 00:51:23,680
The third method is called distribution.

1377
00:51:21,599 --> 00:51:26,640
I'm sorry temperature. And the idea of

1378
00:51:23,679 --> 00:51:28,719
temperature is that in top K and top P,

1379
00:51:26,639 --> 00:51:31,598
it sort of we have to decide on a number

1380
00:51:28,719 --> 00:51:33,279
up front K or P and then we just draw

1381
00:51:31,599 --> 00:51:35,599
the line and look at the words that pass

1382
00:51:33,280 --> 00:51:37,440
the threshold. Temperature is like a

1383
00:51:35,599 --> 00:51:39,838
softer way to do the same thing. It it's

1384
00:51:37,440 --> 00:51:44,159
a softer way to emphasize the head more

1385
00:51:39,838 --> 00:51:47,159
than the tail. So um I think iPad. All

1386
00:51:44,159 --> 00:51:47,159
right.

1387
00:51:52,960 --> 00:52:01,358
So the idea of temperature is remember

1388
00:51:55,039 --> 00:52:04,400
uh when we have this um oops soft max.

1389
00:52:01,358 --> 00:52:06,639
So you know oddwark

1390
00:52:04,400 --> 00:52:09,039
all the way to zebra

1391
00:52:06,639 --> 00:52:10,799
you have all these probabilities right

1392
00:52:09,039 --> 00:52:12,239
now remember where did we get these

1393
00:52:10,800 --> 00:52:15,440
probabilities these properties came from

1394
00:52:12,239 --> 00:52:18,799
a softmax. So what is a softmax? We

1395
00:52:15,440 --> 00:52:22,240
basically had you know all these nodes

1396
00:52:18,800 --> 00:52:23,680
say 50,000 nodes in some output layer

1397
00:52:22,239 --> 00:52:27,519
and these were just numbers let's just

1398
00:52:23,679 --> 00:52:29,598
call them a1 through a 50,000

1399
00:52:27,519 --> 00:52:31,838
and then we ran it through a softmax

1400
00:52:29,599 --> 00:52:36,160
function and what did it do it basically

1401
00:52:31,838 --> 00:52:39,358
did e ra to a1 e ra to a2 all the way to

1402
00:52:36,159 --> 00:52:40,719
e ra to a let's call it n and then we it

1403
00:52:39,358 --> 00:52:42,880
divided it by the sum of all these

1404
00:52:40,719 --> 00:52:47,039
things to get the probabilities. So this

1405
00:52:42,880 --> 00:52:51,640
number became e ra to a1 divided by the

1406
00:52:47,039 --> 00:52:51,639
sum of all the e ra to a

1407
00:52:52,400 --> 00:52:55,920
okay so e ra to a divided by e ra to a1

1408
00:52:54,159 --> 00:52:57,598
plus e to a2 and so on and so forth. So

1409
00:52:55,920 --> 00:52:59,519
this how softmax works. I'm just

1410
00:52:57,599 --> 00:53:03,200
refreshing your memory from a few weeks

1411
00:52:59,519 --> 00:53:06,719
ago. Okay. Now what temperature does is

1412
00:53:03,199 --> 00:53:08,558
that let me just write it a little

1413
00:53:06,719 --> 00:53:13,358
easier.

1414
00:53:08,559 --> 00:53:15,359
So e ra to a1 plus e ra to a2 is all the

1415
00:53:13,358 --> 00:53:18,358
way

1416
00:53:15,358 --> 00:53:18,358
and

1417
00:53:18,480 --> 00:53:22,800
what it does is it introduces a new

1418
00:53:20,159 --> 00:53:27,799
parameter here called temperature which

1419
00:53:22,800 --> 00:53:27,800
is that we divide everything here by t.

1420
00:53:41,679 --> 00:53:45,519
And the effect of adding this little

1421
00:53:43,358 --> 00:53:48,159
knob called temperature here, right, is

1422
00:53:45,519 --> 00:53:50,800
very interesting. So let's assume for a

1423
00:53:48,159 --> 00:53:52,399
second that t is a very very small

1424
00:53:50,800 --> 00:53:53,920
number.

1425
00:53:52,400 --> 00:53:57,838
Assume that t is pretty close to zero,

1426
00:53:53,920 --> 00:54:00,838
very small number. So if t is close to

1427
00:53:57,838 --> 00:54:00,838
zero,

1428
00:54:00,960 --> 00:54:05,280
what's going to happen is that since

1429
00:54:03,199 --> 00:54:06,799
it's in the denominator here, all these

1430
00:54:05,280 --> 00:54:08,319
numbers,

1431
00:54:06,800 --> 00:54:10,800
all these numbers are going to become

1432
00:54:08,318 --> 00:54:13,119
really big because t is really small.

1433
00:54:10,800 --> 00:54:14,240
Right? If if a1 happens to be a positive

1434
00:54:13,119 --> 00:54:15,519
number, it's going to become really big.

1435
00:54:14,239 --> 00:54:16,799
If a1 is a negative number, it's going

1436
00:54:15,519 --> 00:54:19,358
to be a really really small negative

1437
00:54:16,800 --> 00:54:20,880
number. Okay? Now in particular, what's

1438
00:54:19,358 --> 00:54:23,838
going to happen is the biggest of all

1439
00:54:20,880 --> 00:54:26,559
the a numbers, it was already big. Now

1440
00:54:23,838 --> 00:54:28,239
it's going to get massive

1441
00:54:26,559 --> 00:54:30,240
which means that its probability is

1442
00:54:28,239 --> 00:54:31,838
going to dominate everything else

1443
00:54:30,239 --> 00:54:35,039
because you're taking a really big

1444
00:54:31,838 --> 00:54:37,599
number and doing e ra to that number.

1445
00:54:35,039 --> 00:54:40,400
So what's going to happen is that wait

1446
00:54:37,599 --> 00:54:46,039
what what did this

1447
00:54:40,400 --> 00:54:46,039
okay so if t is close to zero

1448
00:54:47,280 --> 00:54:51,160
the biggest a

1449
00:54:56,000 --> 00:55:05,559
Uh, hold on.

1450
00:54:59,199 --> 00:55:05,558
The word corresponding to the biggest A

1451
00:55:06,960 --> 00:55:12,760
will have a probability of one or close

1452
00:55:09,599 --> 00:55:12,760
to one.

1453
00:55:12,800 --> 00:55:15,680
And since all the probabilities have to

1454
00:55:14,480 --> 00:55:17,599
add up to zero, which means that

1455
00:55:15,679 --> 00:55:18,960
everything else is going to be zero. So

1456
00:55:17,599 --> 00:55:20,160
the biggest A will have a probability of

1457
00:55:18,960 --> 00:55:22,480
one. Everything else is going to have

1458
00:55:20,159 --> 00:55:24,159
zero. So reducing temperature close to

1459
00:55:22,480 --> 00:55:25,679
zero means that the probability

1460
00:55:24,159 --> 00:55:27,358
distribution is going to peak at the

1461
00:55:25,679 --> 00:55:29,358
biggest word and everything is going to

1462
00:55:27,358 --> 00:55:30,960
become zero. So in practice what that

1463
00:55:29,358 --> 00:55:34,960
means is that if you look at something

1464
00:55:30,960 --> 00:55:37,760
like this if you apply um

1465
00:55:34,960 --> 00:55:40,240
temperature here

1466
00:55:37,760 --> 00:55:43,200
what's going to happen is that stormiest

1467
00:55:40,239 --> 00:55:46,000
thing is going to get something like.999

1468
00:55:43,199 --> 00:55:49,480
and everything else right it's going to

1469
00:55:46,000 --> 00:55:49,480
get wiped out

1470
00:55:49,838 --> 00:55:52,880
right it's going to get really small

1471
00:55:51,440 --> 00:55:55,599
it's going to get even smaller and so on

1472
00:55:52,880 --> 00:55:57,358
and so forth and so when t is exactly

1473
00:55:55,599 --> 00:55:59,519
zero basically what that means is that

1474
00:55:57,358 --> 00:56:00,798
this is going to be exactly nine uh one

1475
00:55:59,519 --> 00:56:02,719
and everything was going to just get

1476
00:56:00,798 --> 00:56:03,838
zero. So when one of them is one and

1477
00:56:02,719 --> 00:56:05,039
everything else is zero when you do

1478
00:56:03,838 --> 00:56:07,119
sampling from it you're just picking the

1479
00:56:05,039 --> 00:56:10,480
the big number right which means it sort

1480
00:56:07,119 --> 00:56:12,480
it becomes greedy decoding.

1481
00:56:10,480 --> 00:56:14,960
So that is the value of having

1482
00:56:12,480 --> 00:56:16,798
temperature as a knob. Conversely, if

1483
00:56:14,960 --> 00:56:19,519
you take temperature T and make it

1484
00:56:16,798 --> 00:56:22,159
bigger and bigger, right, as opposed to

1485
00:56:19,519 --> 00:56:24,159
smaller and smaller, this distribution

1486
00:56:22,159 --> 00:56:25,199
is going to become flat. Meaning all the

1487
00:56:24,159 --> 00:56:27,679
words are going to have the same

1488
00:56:25,199 --> 00:56:29,358
probability.

1489
00:56:27,679 --> 00:56:32,399
So a any one of these words becomes

1490
00:56:29,358 --> 00:56:34,639
equally likely. So t close to zero, the

1491
00:56:32,400 --> 00:56:38,160
biggest biggest word gets picked. T

1492
00:56:34,639 --> 00:56:40,078
close to say exceeds one goes to 1.52

1493
00:56:38,159 --> 00:56:42,318
any word becomes likely. It becomes

1494
00:56:40,079 --> 00:56:44,880
truly random. So that is the effect of

1495
00:56:42,318 --> 00:56:47,759
temperature.

1496
00:56:44,880 --> 00:56:50,559
And this knob, you can actually tune it.

1497
00:56:47,760 --> 00:56:53,119
Um,

1498
00:56:50,559 --> 00:56:56,000
all right. So, uh, this is called, uh,

1499
00:56:53,119 --> 00:56:57,519
I'm at

1500
00:56:56,000 --> 00:56:59,599
platform.openai.com.

1501
00:56:57,519 --> 00:57:01,119
It's called the OpenAI playground. And

1502
00:56:59,599 --> 00:57:02,640
in this playground, you can actually put

1503
00:57:01,119 --> 00:57:04,400
in all the sentences you want. You can

1504
00:57:02,639 --> 00:57:05,598
choose the model and then you can it'll

1505
00:57:04,400 --> 00:57:09,920
actually tell you what the softmax

1506
00:57:05,599 --> 00:57:12,079
output is. Okay, it's very handy. So

1507
00:57:09,920 --> 00:57:13,358
this is where I said oh so here are a

1508
00:57:12,079 --> 00:57:15,039
few things I want to draw your attention

1509
00:57:13,358 --> 00:57:18,239
to. The first one is you see temperature

1510
00:57:15,039 --> 00:57:20,880
here the default is one. If you make it

1511
00:57:18,239 --> 00:57:22,879
zero it becomes greedy decoding but you

1512
00:57:20,880 --> 00:57:24,400
can make it more than one if you want.

1513
00:57:22,880 --> 00:57:27,280
It'll give you all kinds of crazy stuff

1514
00:57:24,400 --> 00:57:30,480
as you will see in a second. Okay. Um

1515
00:57:27,280 --> 00:57:32,798
and then they don't have top K. They

1516
00:57:30,480 --> 00:57:35,519
don't have support for top K openai but

1517
00:57:32,798 --> 00:57:37,838
they do have support for top P. You can

1518
00:57:35,519 --> 00:57:38,880
put P here in this thing. And I'll

1519
00:57:37,838 --> 00:57:40,558
ignore these things. You can read the

1520
00:57:38,880 --> 00:57:42,318
documentation uh to understand those

1521
00:57:40,559 --> 00:57:44,319
things. But you can actually ask it to

1522
00:57:42,318 --> 00:57:46,159
show the probabilities. So I'm going to

1523
00:57:44,318 --> 00:57:48,480
ask it to show all the probabilities.

1524
00:57:46,159 --> 00:57:50,879
I'm also going to tell it um don't go

1525
00:57:48,480 --> 00:57:53,920
nuts. Just give me like a few outputs.

1526
00:57:50,880 --> 00:57:55,920
Let's just call it 30. Okay. And now I'm

1527
00:57:53,920 --> 00:57:57,440
going to enter some sentences for us to

1528
00:57:55,920 --> 00:57:59,920
see what's going on. So let's enter the

1529
00:57:57,440 --> 00:58:03,519
same sentence as before. students

1530
00:57:59,920 --> 00:58:05,039
at the MIT

1531
00:58:03,519 --> 00:58:08,079
Sloan

1532
00:58:05,039 --> 00:58:10,798
School of Management

1533
00:58:08,079 --> 00:58:13,798
or I think that's what we had right so

1534
00:58:10,798 --> 00:58:13,798
submit

1535
00:58:14,000 --> 00:58:18,239
so okay this is what it's filling out

1536
00:58:16,159 --> 00:58:20,399
now you go click on this word you get

1537
00:58:18,239 --> 00:58:23,118
all the probabilities

1538
00:58:20,400 --> 00:58:25,119
pretty cool right so you can see invited

1539
00:58:23,119 --> 00:58:27,440
given expected these are all some of the

1540
00:58:25,119 --> 00:58:32,400
things we had u and so what you can do

1541
00:58:27,440 --> 00:58:36,000
is you can go in and say here clearly uh

1542
00:58:32,400 --> 00:58:40,079
aching. What is that?

1543
00:58:36,000 --> 00:58:41,358
That's very weird. So I'm going to again

1544
00:58:40,079 --> 00:58:43,200
I'm just going to check to make sure

1545
00:58:41,358 --> 00:58:46,078
that I use the same sentence as before.

1546
00:58:43,199 --> 00:58:50,558
It's very brittle. Students MD school

1547
00:58:46,079 --> 00:58:54,440
management are okay. Uh are

1548
00:58:50,559 --> 00:58:54,440
oh I know what it is.

1549
00:58:54,798 --> 00:59:01,719
Okay.

1550
00:58:57,519 --> 00:59:01,719
Okay. So, let's try that again.

1551
00:59:03,679 --> 00:59:08,159
Okay. So, invited 3.18. That's what we

1552
00:59:05,920 --> 00:59:10,559
had, right? Invited 3.19. 3.8. Okay.

1553
00:59:08,159 --> 00:59:12,480
Close enough. So, this is what we have.

1554
00:59:10,559 --> 00:59:15,040
And now, if you wanted to force it to

1555
00:59:12,480 --> 00:59:18,798
choose invited here, you just go in

1556
00:59:15,039 --> 00:59:20,000
there and make the temperature zero.

1557
00:59:18,798 --> 00:59:21,519
Temperature zero means it's always going

1558
00:59:20,000 --> 00:59:25,039
to pick the best one. Greedy recording.

1559
00:59:21,519 --> 00:59:27,119
So, you can hit it again.

1560
00:59:25,039 --> 00:59:29,519
And it better give you invited. See it

1561
00:59:27,119 --> 00:59:31,119
has given you invited.

1562
00:59:29,519 --> 00:59:34,079
So that's how you manipulate it using

1563
00:59:31,119 --> 00:59:35,760
temperature. Um you can also ask it you

1564
00:59:34,079 --> 00:59:38,079
can also manipulate top P. You can do

1565
00:59:35,760 --> 00:59:40,000
all these things right but so it's a

1566
00:59:38,079 --> 00:59:41,839
it's a people actually use it very

1567
00:59:40,000 --> 00:59:42,798
heavily for debugging right and for when

1568
00:59:41,838 --> 00:59:44,239
they're playing with a bunch of data

1569
00:59:42,798 --> 00:59:45,679
with a model for that particular use

1570
00:59:44,239 --> 00:59:46,879
case. You just play with it to get a

1571
00:59:45,679 --> 00:59:48,399
sense for what kinds of probability

1572
00:59:46,880 --> 00:59:50,400
distributions you see and then you can

1573
00:59:48,400 --> 00:59:54,480
fine-tune it using that using that

1574
00:59:50,400 --> 00:59:58,079
knowledge. Um so yeah check this out.

1575
00:59:54,480 --> 01:00:01,199
Oh, uh, I I said that if the temperature

1576
00:59:58,079 --> 01:00:03,119
goes above one to a higher number, every

1577
01:00:01,199 --> 01:00:04,798
word in the 50,000 becomes sort of

1578
01:00:03,119 --> 01:00:06,400
equally likely, which means it's going

1579
01:00:04,798 --> 01:00:07,679
to produce garbage, right? So, let's

1580
01:00:06,400 --> 01:00:09,200
actually see garbage production in

1581
01:00:07,679 --> 01:00:11,838
action.

1582
01:00:09,199 --> 01:00:13,439
So, all right, let's just nuke this.

1583
01:00:11,838 --> 01:00:15,519
Okay, and I'm going to take the

1584
01:00:13,440 --> 01:00:19,280
temperature and max it. I'm going to

1585
01:00:15,519 --> 01:00:22,000
call it two. Okay, which means that

1586
01:00:19,280 --> 01:00:25,000
literally anything is possible.

1587
01:00:22,000 --> 01:00:25,000
Submit.

1588
01:00:25,838 --> 01:00:32,039
Ladies and gentlemen, I present to you a

1589
01:00:28,079 --> 01:00:32,039
modern large language model.

1590
01:00:35,838 --> 01:00:39,599
Isn't it like shocking

1591
01:00:38,079 --> 01:00:41,760
>> because when we work with these language

1592
01:00:39,599 --> 01:00:43,039
models we have, we always when we see it

1593
01:00:41,760 --> 01:00:45,119
doing some smart things, we always

1594
01:00:43,039 --> 01:00:46,480
ascribe some level of, you know,

1595
01:00:45,119 --> 01:00:48,960
interesting abilities and intelligence

1596
01:00:46,480 --> 01:00:50,318
and so on and then you realize all I had

1597
01:00:48,960 --> 01:00:52,798
to go in go in there and change one

1598
01:00:50,318 --> 01:00:54,719
parameter and it's garbage.

1599
01:00:52,798 --> 01:00:56,480
So you can see the amount of garbage

1600
01:00:54,719 --> 01:00:58,879
right it's showing just by twiddling one

1601
01:00:56,480 --> 01:01:00,240
parameter. So you have to be in

1602
01:00:58,880 --> 01:01:01,358
production use cases when you're

1603
01:01:00,239 --> 01:01:02,798
building applications on top of these

1604
01:01:01,358 --> 01:01:05,279
large language models you got to be very

1605
01:01:02,798 --> 01:01:09,039
very careful with these parameters. So

1606
01:01:05,280 --> 01:01:12,359
pay attention. All right. So um what did

1607
01:01:09,039 --> 01:01:12,358
I have next?

1608
01:01:13,679 --> 01:01:21,639
Okay. So that brings us to the uh sort

1609
01:01:17,440 --> 01:01:21,639
of the end of the decoding section.

1610
01:01:22,798 --> 01:01:27,119
Oh, see now I'm going to switch gears

1611
01:01:24,798 --> 01:01:30,798
and talk about tokenization, right?

1612
01:01:27,119 --> 01:01:32,318
which is that um when so far in all the

1613
01:01:30,798 --> 01:01:34,159
the the things we have done including

1614
01:01:32,318 --> 01:01:36,798
the homeworks and so on we looked at

1615
01:01:34,159 --> 01:01:38,639
this tokenization the standard process

1616
01:01:36,798 --> 01:01:41,039
right for taking a bunch of text and

1617
01:01:38,639 --> 01:01:44,960
vectorizing it which was the stie

1618
01:01:41,039 --> 01:01:46,639
process standardize tokenize um index

1619
01:01:44,960 --> 01:01:48,559
right and then encode and the

1620
01:01:46,639 --> 01:01:50,558
standardization I had mentioned earlier

1621
01:01:48,559 --> 01:01:53,200
uh strips out punctuation lower cases

1622
01:01:50,559 --> 01:01:55,359
everything uh sometimes removes stop

1623
01:01:53,199 --> 01:01:57,118
words like a and the things like that it

1624
01:01:55,358 --> 01:01:59,440
also does these things called stemming

1625
01:01:57,119 --> 01:02:02,960
But turns out if you actually work with

1626
01:01:59,440 --> 01:02:04,480
uh something like GPT, you know that

1627
01:02:02,960 --> 01:02:06,159
it hasn't stripped out punctuation. The

1628
01:02:04,480 --> 01:02:08,079
punctuation is really good, right? It

1629
01:02:06,159 --> 01:02:10,078
uses case, uppercase, and lower case.

1630
01:02:08,079 --> 01:02:11,920
And in fact, even better, you can

1631
01:02:10,079 --> 01:02:13,359
actually make up a word as part of your

1632
01:02:11,920 --> 01:02:15,760
question and it'll use the word

1633
01:02:13,358 --> 01:02:18,000
consistently in the output. So just for

1634
01:02:15,760 --> 01:02:22,240
fun,

1635
01:02:18,000 --> 01:02:23,599
um I made up a word.

1636
01:02:22,239 --> 01:02:24,879
I just did this yesterday, a day before.

1637
01:02:23,599 --> 01:02:28,160
I said, here's a new word and it

1638
01:02:24,880 --> 01:02:30,079
definition. The word is relo

1639
01:02:28,159 --> 01:02:31,759
backwards.

1640
01:02:30,079 --> 01:02:33,680
I said the definition a student who

1641
01:02:31,760 --> 01:02:35,200
understands deep learning backwards

1642
01:02:33,679 --> 01:02:37,358
please use his word in a sentence. And

1643
01:02:35,199 --> 01:02:39,759
here is a sentence it's coming up with.

1644
01:02:37,358 --> 01:02:41,838
Um

1645
01:02:39,760 --> 01:02:43,200
I was like a little shocked during the

1646
01:02:41,838 --> 01:02:45,838
advanced neural network seminar. It

1647
01:02:43,199 --> 01:02:47,439
became evident that Jane was a true relo

1648
01:02:45,838 --> 01:02:48,719
effortlessly explaining even the most

1649
01:02:47,440 --> 01:02:50,720
complex deep learning concepts in

1650
01:02:48,719 --> 01:02:53,039
reverse order.

1651
01:02:50,719 --> 01:02:54,719
Okay. So it clearly knows how to use

1652
01:02:53,039 --> 01:02:56,639
anything you may make up with. Right? So

1653
01:02:54,719 --> 01:02:59,039
it has the ability to compose things

1654
01:02:56,639 --> 01:03:01,118
from scratch as opposed to just looking

1655
01:02:59,039 --> 01:03:02,960
up stuff. So where is the thing coming

1656
01:03:01,119 --> 01:03:04,559
from? Right? That's the question. And

1657
01:03:02,960 --> 01:03:06,720
the answer is this very beautiful thing

1658
01:03:04,559 --> 01:03:10,040
called bite pair encoding which we'll

1659
01:03:06,719 --> 01:03:10,039
look at next.

1660
01:03:10,559 --> 01:03:15,599
So all right. So what here um when we

1661
01:03:14,318 --> 01:03:17,119
look at this process the adv

1662
01:03:15,599 --> 01:03:18,400
disadvantages are some of the things we

1663
01:03:17,119 --> 01:03:19,920
have discussed which is that we want to

1664
01:03:18,400 --> 01:03:21,119
be able to preserve punctuation. We want

1665
01:03:19,920 --> 01:03:22,318
to be able to preserve case. We want to

1666
01:03:21,119 --> 01:03:26,240
be able to handle new words and so on

1667
01:03:22,318 --> 01:03:28,318
and so forth. So uh the new like the the

1668
01:03:26,239 --> 01:03:29,759
sort of the modern models like BERT and

1669
01:03:28,318 --> 01:03:31,599
so on they use different tokenization

1670
01:03:29,760 --> 01:03:34,720
schemes. They don't actually do the STIE

1671
01:03:31,599 --> 01:03:37,519
thing and the GPD family uses bite pair

1672
01:03:34,719 --> 01:03:40,399
encoding BPE. Uh BERT uses something

1673
01:03:37,519 --> 01:03:42,719
called wordpiece. All of these ways of

1674
01:03:40,400 --> 01:03:44,720
encoding, the fundamental idea is to

1675
01:03:42,719 --> 01:03:46,078
say, well, you know what? Why don't

1676
01:03:44,719 --> 01:03:47,598
whatever language you're working with,

1677
01:03:46,079 --> 01:03:50,000
why don't we start first of all with all

1678
01:03:47,599 --> 01:03:51,359
the individual characters? Because if

1679
01:03:50,000 --> 01:03:53,039
you could actually work with individual

1680
01:03:51,358 --> 01:03:56,000
characters, you can clearly compose any

1681
01:03:53,039 --> 01:03:58,880
word that comes up, right? Reo is just R

1682
01:03:56,000 --> 01:04:00,318
E L D O H, right? Six tokens. If you're

1683
01:03:58,880 --> 01:04:02,720
working with characters at the character

1684
01:04:00,318 --> 01:04:05,679
level, but working only with characters

1685
01:04:02,719 --> 01:04:07,838
is not great, right? because that means

1686
01:04:05,679 --> 01:04:09,279
that the model you're giving it no

1687
01:04:07,838 --> 01:04:11,199
information about the world. It has to

1688
01:04:09,280 --> 01:04:14,160
learn every word from scratch, what the

1689
01:04:11,199 --> 01:04:15,439
word means and so on and so forth. So we

1690
01:04:14,159 --> 01:04:17,759
it would be nice if we can actually give

1691
01:04:15,440 --> 01:04:20,159
it words as well. But we don't we don't

1692
01:04:17,760 --> 01:04:22,400
want to give it infrequent words because

1693
01:04:20,159 --> 01:04:25,118
infrequent words by definition are not

1694
01:04:22,400 --> 01:04:26,480
worth adding to your vocabulary. We're

1695
01:04:25,119 --> 01:04:28,318
just going to you know take up another

1696
01:04:26,480 --> 01:04:30,000
embedding vector and things like that.

1697
01:04:28,318 --> 01:04:31,679
For infrequent words, we'll just make

1698
01:04:30,000 --> 01:04:32,960
we'll just compose them. we'll we'll

1699
01:04:31,679 --> 01:04:35,440
actually construct them on the fly

1700
01:04:32,960 --> 01:04:37,199
because we can always use characters.

1701
01:04:35,440 --> 01:04:38,880
Okay, so we don't want to put every word

1702
01:04:37,199 --> 01:04:41,199
in there. We only want to put frequent

1703
01:04:38,880 --> 01:04:43,039
words. But to give this thing the

1704
01:04:41,199 --> 01:04:45,038
ability to compose new words and not

1705
01:04:43,039 --> 01:04:47,520
always have to go to characters, we will

1706
01:04:45,039 --> 01:04:52,000
give it parts of words. These are called

1707
01:04:47,519 --> 01:04:54,000
subwords. So the key idea is that let's

1708
01:04:52,000 --> 01:04:56,880
come up with a way to build a vocabulary

1709
01:04:54,000 --> 01:04:59,679
which has characters full words that are

1710
01:04:56,880 --> 01:05:01,838
frequent enough to be worth adding and

1711
01:04:59,679 --> 01:05:03,519
subwords or word fragments that occur

1712
01:05:01,838 --> 01:05:07,279
frequently enough to be worth adding. So

1713
01:05:03,519 --> 01:05:09,759
for example the word standardize

1714
01:05:07,280 --> 01:05:11,119
right normalize standardize and so on

1715
01:05:09,760 --> 01:05:12,880
and so forth. I is going to show up a

1716
01:05:11,119 --> 01:05:14,318
lot in many places. So you don't want to

1717
01:05:12,880 --> 01:05:15,680
have standardize and normalize and so

1718
01:05:14,318 --> 01:05:17,679
on. You just want to have eyes. you can

1719
01:05:15,679 --> 01:05:19,598
just attach it to all kinds of words,

1720
01:05:17,679 --> 01:05:20,960
right? And make it all work, right? So

1721
01:05:19,599 --> 01:05:23,760
that's the basic idea of all these

1722
01:05:20,960 --> 01:05:25,679
tokenization schemes. And BP is one such

1723
01:05:23,760 --> 01:05:27,039
way to figure out how to actually

1724
01:05:25,679 --> 01:05:29,358
construct this vocabulary from a

1725
01:05:27,039 --> 01:05:31,359
training corpus, right? And by the way,

1726
01:05:29,358 --> 01:05:33,279
when I say characters, this will include

1727
01:05:31,358 --> 01:05:34,639
not just you know uppercase lowerase

1728
01:05:33,280 --> 01:05:37,039
alphabets and numbers, it may it will

1729
01:05:34,639 --> 01:05:38,318
also include punctuation.

1730
01:05:37,039 --> 01:05:40,640
So that all these things just become

1731
01:05:38,318 --> 01:05:42,960
atomic units.

1732
01:05:40,639 --> 01:05:45,519
All right. So uh so what we're going to

1733
01:05:42,960 --> 01:05:47,599
the way BP works is that uh we're going

1734
01:05:45,519 --> 01:05:49,679
to uh start with each character as a

1735
01:05:47,599 --> 01:05:51,039
token and I'll talk about the rest of

1736
01:05:49,679 --> 01:05:52,318
the thing on the page in just a moment.

1737
01:05:51,039 --> 01:05:53,920
Don't worry about it. We'll start with

1738
01:05:52,318 --> 01:05:56,480
each character as a token. So let's say

1739
01:05:53,920 --> 01:05:58,720
that your training corpus is just a

1740
01:05:56,480 --> 01:06:02,079
single sentence. The cat sat on the mat.

1741
01:05:58,719 --> 01:06:03,838
Okay. And even though GPT does not

1742
01:06:02,079 --> 01:06:05,839
actually do any lower casing, it'll just

1743
01:06:03,838 --> 01:06:08,159
actually use like TH uppercase is

1744
01:06:05,838 --> 01:06:09,038
different than TH lowerase. Uh just for

1745
01:06:08,159 --> 01:06:11,118
simplicity, I'm just going to

1746
01:06:09,039 --> 01:06:12,799
standardize it here. So it just becomes

1747
01:06:11,119 --> 01:06:14,880
a cat sat on the mat. And then I'm going

1748
01:06:12,798 --> 01:06:16,719
to write it in this form where I

1749
01:06:14,880 --> 01:06:18,160
basically put a comma after every word

1750
01:06:16,719 --> 01:06:20,318
and then I put a little underscore to

1751
01:06:18,159 --> 01:06:21,598
show the space between the words. Okay,

1752
01:06:20,318 --> 01:06:22,798
I'm going to write it in this format.

1753
01:06:21,599 --> 01:06:25,359
And it'll become clear why I'm writing

1754
01:06:22,798 --> 01:06:27,358
it in just a second. Okay. Now my

1755
01:06:25,358 --> 01:06:28,719
starting vocabulary is just all the

1756
01:06:27,358 --> 01:06:31,440
individual letters in the training

1757
01:06:28,719 --> 01:06:34,159
corpus. So the starting is just whatever

1758
01:06:31,440 --> 01:06:35,920
all these letters. Okay, that's it. And

1759
01:06:34,159 --> 01:06:38,558
this is a starting point. And now what

1760
01:06:35,920 --> 01:06:41,838
we do and this is the key step.

1761
01:06:38,559 --> 01:06:44,960
We merge tokens that most frequently

1762
01:06:41,838 --> 01:06:47,358
occur right next to each other. So if

1763
01:06:44,960 --> 01:06:48,720
two characters or two tokens are

1764
01:06:47,358 --> 01:06:51,119
occurring right next to each other a

1765
01:06:48,719 --> 01:06:52,798
lot, let's just merge them because they

1766
01:06:51,119 --> 01:06:54,880
seem to be occurring a lot together,

1767
01:06:52,798 --> 01:06:57,679
right? May as well merge them. And so

1768
01:06:54,880 --> 01:06:59,119
here, for example, I've I've listed the

1769
01:06:57,679 --> 01:07:01,759
frequency of the adjacent token. So for

1770
01:06:59,119 --> 01:07:04,160
example, if you look at th

1771
01:07:01,760 --> 01:07:06,960
shows up right after each other here, it

1772
01:07:04,159 --> 01:07:08,558
also shows up here. So therefore, it

1773
01:07:06,960 --> 01:07:11,920
shows up twice.

1774
01:07:08,559 --> 01:07:13,519
Now H E again is showing up here. It's

1775
01:07:11,920 --> 01:07:16,079
also showing up here. So that also shows

1776
01:07:13,519 --> 01:07:17,679
up twice. CA on the other hand is only

1777
01:07:16,079 --> 01:07:20,798
showing up here. It's not showing up

1778
01:07:17,679 --> 01:07:24,000
anywhere else. So it shows up once. A

1779
01:07:20,798 --> 01:07:25,599
shows up three times in Matt, SAT, and

1780
01:07:24,000 --> 01:07:27,838
in CAT and so on and so forth. You get

1781
01:07:25,599 --> 01:07:30,798
the idea. So you're just looking at

1782
01:07:27,838 --> 01:07:32,318
pair-wise adjacent tokens. And you pick

1783
01:07:30,798 --> 01:07:34,318
the most frequent one that's showing up,

1784
01:07:32,318 --> 01:07:36,000
which in this case happens to be a t.

1785
01:07:34,318 --> 01:07:40,000
And then you take a and t and you merge

1786
01:07:36,000 --> 01:07:42,400
them. So it becomes 80.

1787
01:07:40,000 --> 01:07:44,079
Okay. So when you do that when you when

1788
01:07:42,400 --> 01:07:45,440
you you merge them and then you add that

1789
01:07:44,079 --> 01:07:48,559
new token that you've just literally

1790
01:07:45,440 --> 01:07:50,400
created to your vocabulary list and then

1791
01:07:48,559 --> 01:07:52,559
you update the corpus to reflect the

1792
01:07:50,400 --> 01:07:55,039
merge you've just did. So now the corpus

1793
01:07:52,559 --> 01:07:56,319
becomes the cat sat on the mat. But in

1794
01:07:55,039 --> 01:07:58,880
this case there is no a and t

1795
01:07:56,318 --> 01:08:02,400
separately. There is just the at combo

1796
01:07:58,880 --> 01:08:06,160
com combo token here.

1797
01:08:02,400 --> 01:08:07,599
Are we good with this step so far?

1798
01:08:06,159 --> 01:08:10,598
take the most frequent things and merge

1799
01:08:07,599 --> 01:08:10,599
them.

1800
01:08:12,639 --> 01:08:16,238
It's a way to compress the data. In

1801
01:08:14,400 --> 01:08:17,440
fact, the algorithm came from someone

1802
01:08:16,238 --> 01:08:18,959
trying to figure out a way to compress

1803
01:08:17,439 --> 01:08:22,119
data.

1804
01:08:18,960 --> 01:08:22,119
You know,

1805
01:08:22,158 --> 01:08:25,920
think of it this way, right? Suppose I

1806
01:08:23,759 --> 01:08:28,238
tell you uh I'm I want you to compress a

1807
01:08:25,920 --> 01:08:30,158
message I'm going to send to you and

1808
01:08:28,238 --> 01:08:32,079
then you look at all the past messages

1809
01:08:30,158 --> 01:08:35,838
you've had to deal with and it turns out

1810
01:08:32,079 --> 01:08:37,359
you're finding that u certain characters

1811
01:08:35,838 --> 01:08:40,079
are occurring next to each other all the

1812
01:08:37,359 --> 01:08:42,480
time right maybe just for argument let's

1813
01:08:40,079 --> 01:08:44,158
say ABC shows up ridiculously often in

1814
01:08:42,479 --> 01:08:45,439
the messaging and then you'll be like

1815
01:08:44,158 --> 01:08:47,358
you know what's if it's always showing

1816
01:08:45,439 --> 01:08:48,639
up all the time together why treat it as

1817
01:08:47,359 --> 01:08:51,520
three things let me just call it one

1818
01:08:48,640 --> 01:08:53,119
thing ABC that's it you send a single

1819
01:08:51,520 --> 01:08:56,480
token called ABC every time you send

1820
01:08:53,119 --> 01:08:58,880
need ABC not a B C that's the basic

1821
01:08:56,479 --> 01:09:01,278
idea. So here if you come here that's

1822
01:08:58,880 --> 01:09:03,039
what we have and then what we do is now

1823
01:09:01,279 --> 01:09:05,520
we do again this calculation of

1824
01:09:03,039 --> 01:09:08,640
adjacency tokens on this updated corpus

1825
01:09:05,520 --> 01:09:11,600
and you can see here th shows up once TH

1826
01:09:08,640 --> 01:09:13,838
shows up here twice so you get two every

1827
01:09:11,600 --> 01:09:16,880
H shows up twice everything else shows

1828
01:09:13,838 --> 01:09:18,000
up once and yeah when many things are

1829
01:09:16,880 --> 01:09:19,600
showing up with equal frequency just

1830
01:09:18,000 --> 01:09:22,079
pick one randomly from this. So we pick

1831
01:09:19,600 --> 01:09:25,199
up th right and we merge that which

1832
01:09:22,079 --> 01:09:27,278
means that we add th to our vocabulary

1833
01:09:25,198 --> 01:09:30,238
and once we do that we update the corpus

1834
01:09:27,279 --> 01:09:32,080
and now we have th is now one thing

1835
01:09:30,238 --> 01:09:34,959
fused together along with the previous

1836
01:09:32,079 --> 01:09:36,960
thing 80 that had been fused together

1837
01:09:34,960 --> 01:09:38,960
that is a corpus after the second merge

1838
01:09:36,960 --> 01:09:40,640
and then we do the same thing we find

1839
01:09:38,960 --> 01:09:42,640
the frequency adjacent tokens turns out

1840
01:09:40,640 --> 01:09:45,039
th and e are showing up twice everything

1841
01:09:42,640 --> 01:09:48,880
else is showing up once so we take th

1842
01:09:45,039 --> 01:09:51,359
merge it to get the boom the and now we

1843
01:09:48,880 --> 01:09:53,838
have the cat sat on the mat. So this

1844
01:09:51,359 --> 01:09:56,159
process continues

1845
01:09:53,838 --> 01:09:59,039
till we reach a predefined limit for our

1846
01:09:56,158 --> 01:10:02,399
vocabulary. Now as it turns out when

1847
01:09:59,039 --> 01:10:04,238
they built GPT2 and GPT let me just see

1848
01:10:02,399 --> 01:10:07,279
I think I did some digging around on

1849
01:10:04,238 --> 01:10:09,119
this thing. Yeah. So GPT2 and 3 they set

1850
01:10:07,279 --> 01:10:12,000
the vocabulary size to be roughly

1851
01:10:09,119 --> 01:10:14,399
50,000. So it basically kept on doing

1852
01:10:12,000 --> 01:10:17,119
this till it hit a limit of 50,000 then

1853
01:10:14,399 --> 01:10:18,639
it stopped. GPD4 on the other hand

1854
01:10:17,119 --> 01:10:23,238
actually went goes all the way to

1855
01:10:18,640 --> 01:10:23,239
100,000 vocabulary size.

1856
01:10:23,439 --> 01:10:29,198
Okay, so this is BP in action. U and so

1857
01:10:28,000 --> 01:10:30,000
what's going to happen is once you

1858
01:10:29,198 --> 01:10:31,119
finish all this thing and you have

1859
01:10:30,000 --> 01:10:32,960
vocabulary and you have all these things

1860
01:10:31,119 --> 01:10:36,158
that you have merged when a new piece of

1861
01:10:32,960 --> 01:10:39,760
text comes in right the merges remember

1862
01:10:36,158 --> 01:10:41,759
here we merged a to get a this th became

1863
01:10:39,760 --> 01:10:43,119
this and so on. When a new piece of text

1864
01:10:41,760 --> 01:10:45,520
arrives the tokenization apply the

1865
01:10:43,119 --> 01:10:47,920
merges in the exact same order. So if

1866
01:10:45,520 --> 01:10:50,239
the new text that comes in is the rat,

1867
01:10:47,920 --> 01:10:52,800
it's first going to apply the 80 to 80

1868
01:10:50,238 --> 01:10:54,399
to become fuse this here and then going

1869
01:10:52,800 --> 01:10:56,400
to fuse th to get this and then it's

1870
01:10:54,399 --> 01:10:58,559
going to fuse th and e to get that. And

1871
01:10:56,399 --> 01:11:00,319
the final list of tokens that goes in to

1872
01:10:58,560 --> 01:11:02,080
your model is going to be the token for

1873
01:11:00,319 --> 01:11:05,960
the the token for space and the token

1874
01:11:02,079 --> 01:11:05,960
for r and the token for at.

1875
01:11:06,560 --> 01:11:10,600
So let's see this in action.

1876
01:11:12,319 --> 01:11:17,119
uh GP I mean OpenAI has a has its own

1877
01:11:14,560 --> 01:11:20,960
thing but I found this uh site to be

1878
01:11:17,119 --> 01:11:23,039
really good. So let's uh tokenize

1879
01:11:20,960 --> 01:11:26,079
hands-on

1880
01:11:23,039 --> 01:11:28,319
deep learning.

1881
01:11:26,079 --> 01:11:30,960
So you can see here

1882
01:11:28,319 --> 01:11:34,639
look at this.

1883
01:11:30,960 --> 01:11:36,880
So H uppercase H is its own token. It's

1884
01:11:34,640 --> 01:11:38,560
token number 39

1885
01:11:36,880 --> 01:11:41,119
and

1886
01:11:38,560 --> 01:11:43,440
it's it own token. dash is its own token

1887
01:11:41,119 --> 01:11:45,198
on is its own token and then space deep

1888
01:11:43,439 --> 01:11:48,319
is its token and space learning is its

1889
01:11:45,198 --> 01:11:50,399
token okay note one thing suppose you

1890
01:11:48,319 --> 01:11:51,920
had said

1891
01:11:50,399 --> 01:11:53,679
let's just say you just had deep deep

1892
01:11:51,920 --> 01:11:56,480
deep learning

1893
01:11:53,679 --> 01:11:58,960
deep has a different token than space

1894
01:11:56,479 --> 01:12:01,359
deep

1895
01:11:58,960 --> 01:12:03,119
okay what they have realized is that

1896
01:12:01,359 --> 01:12:06,079
most words are actually going to show up

1897
01:12:03,119 --> 01:12:08,238
after the space after a space right much

1898
01:12:06,079 --> 01:12:10,079
more likely so having a space attached

1899
01:12:08,238 --> 01:12:12,000
to the beginning of the word saves you a

1900
01:12:10,079 --> 01:12:13,519
lot of sort of you know saves you a lot

1901
01:12:12,000 --> 01:12:15,198
of compute and so on and so forth

1902
01:12:13,520 --> 01:12:17,199
because they will in fact arrive almost

1903
01:12:15,198 --> 01:12:18,479
all the time with the space before it

1904
01:12:17,198 --> 01:12:21,119
right that's why they have attached the

1905
01:12:18,479 --> 01:12:25,759
space to the word itself um and note

1906
01:12:21,119 --> 01:12:29,719
that deep learning deep and uh deep

1907
01:12:25,760 --> 01:12:29,719
actually let's call it this way

1908
01:12:30,800 --> 01:12:36,960
so deep and deep are different

1909
01:12:34,319 --> 01:12:38,799
right there is deep there is so clearly

1910
01:12:36,960 --> 01:12:43,359
it's taking case into account then I put

1911
01:12:38,800 --> 01:12:44,800
an exclamation here. Boom. That and so

1912
01:12:43,359 --> 01:12:48,319
ultimately what goes in when you have

1913
01:12:44,800 --> 01:12:51,679
have a phrase like um

1914
01:12:48,319 --> 01:12:53,679
sat on the mat.

1915
01:12:51,679 --> 01:12:58,480
So the cat sat on the mat. And you can

1916
01:12:53,679 --> 01:13:01,600
see here uppercase the um and then

1917
01:12:58,479 --> 01:13:06,718
let's just do another thing here.

1918
01:13:01,600 --> 01:13:10,239
So uppercase the with a space is 383.

1919
01:13:06,719 --> 01:13:11,920
lowerase the is 262. Uh and then that's

1920
01:13:10,238 --> 01:13:13,119
distinct from just the without any

1921
01:13:11,920 --> 01:13:16,960
space. That's a different thing. So

1922
01:13:13,119 --> 01:13:18,960
these are all the tokens. Now um let's

1923
01:13:16,960 --> 01:13:21,520
try something.

1924
01:13:18,960 --> 01:13:24,520
Let's try

1925
01:13:21,520 --> 01:13:24,520
Jane.

1926
01:13:24,719 --> 01:13:30,800
So Jane is one token which is great and

1927
01:13:27,520 --> 01:13:34,560
is another token. Let's see. Rama. Ah

1928
01:13:30,800 --> 01:13:38,960
darn. My name wasn't worthy enough to be

1929
01:13:34,560 --> 01:13:41,520
its own token. Okay. But strangely

1930
01:13:38,960 --> 01:13:44,000
enough

1931
01:13:41,520 --> 01:13:46,080
this I was very surprised by this. So if

1932
01:13:44,000 --> 01:13:48,319
I put Rama in lower case is its own

1933
01:13:46,079 --> 01:13:51,039
token.

1934
01:13:48,319 --> 01:13:55,039
I have no idea what they were scraping

1935
01:13:51,039 --> 01:13:56,640
which websites. Uh and if I put Jane

1936
01:13:55,039 --> 01:13:58,960
here

1937
01:13:56,640 --> 01:14:01,600
now J has become its token with space

1938
01:13:58,960 --> 01:14:03,840
and A has become different.

1939
01:14:01,600 --> 01:14:05,199
So the tokenization is like very it's a

1940
01:14:03,840 --> 01:14:07,360
very interesting thing and it works in

1941
01:14:05,198 --> 01:14:08,719
very interesting ways. But that's the

1942
01:14:07,359 --> 01:14:10,639
basic idea of what's going on under the

1943
01:14:08,719 --> 01:14:12,079
hood. I would encourage you to like

1944
01:14:10,640 --> 01:14:13,920
check out your names to see if it's

1945
01:14:12,079 --> 01:14:15,359
actually been tokenized. So all right,

1946
01:14:13,920 --> 01:14:18,359
I'm done. Thanks folks. I'll see you on

1947
01:14:15,359 --> 01:14:18,359
Wednesday.