WEBVTT

00:00:16.320 --> 00:00:21.039
Um, so let's start with a quick review.

00:00:18.559 --> 00:00:23.599
Last week we looked at BERT, how BERT

00:00:21.039 --> 00:00:25.439
was created, and we learned about this

00:00:23.600 --> 00:00:27.199
technique called masking, which is a

00:00:25.439 --> 00:00:29.278
kind of self-supervised learning. And

00:00:27.199 --> 00:00:31.279
the idea of masking was very simple. We

00:00:29.278 --> 00:00:33.039
asked ourselves the question we have

00:00:31.278 --> 00:00:35.519
seen ways in which people can take

00:00:33.039 --> 00:00:38.238
images and pre-train models like restnet

00:00:35.520 --> 00:00:40.160
on a vast you know vast uh body of

00:00:38.238 --> 00:00:42.320
images but then for each image somebody

00:00:40.159 --> 00:00:44.000
had to go and label them right so for

00:00:42.320 --> 00:00:46.399
text we asked the question well what

00:00:44.000 --> 00:00:48.000
does it mean to label a piece of text

00:00:46.399 --> 00:00:49.920
when we don't actually have a clearly

00:00:48.000 --> 00:00:51.600
defined end goal in mind except the

00:00:49.920 --> 00:00:53.280
general goal of pre-training things

00:00:51.600 --> 00:00:55.359
right and then we said oh well what we

00:00:53.280 --> 00:00:57.280
can do is we can actually replace some

00:00:55.359 --> 00:00:59.280
some of the words in every sentence with

00:00:57.280 --> 00:01:00.719
a what you call like a mask token and

00:00:59.280 --> 00:01:03.760
then we just train the network to

00:01:00.719 --> 00:01:06.079
recover the blanks to fill in the blanks

00:01:03.759 --> 00:01:07.280
right and this technique which is one of

00:01:06.079 --> 00:01:08.798
many ways of doing what's called

00:01:07.280 --> 00:01:12.239
self-supervised learning is called

00:01:08.799 --> 00:01:14.479
masking and we and we described how if

00:01:12.239 --> 00:01:16.399
you essentially take all of Wikipedia

00:01:14.478 --> 00:01:19.118
and for every sentence you mask it like

00:01:16.400 --> 00:01:21.280
this and then train a network to recover

00:01:19.118 --> 00:01:23.200
to fill in the blanks the resulting

00:01:21.280 --> 00:01:25.280
network becomes really good at doing all

00:01:23.200 --> 00:01:27.200
kinds of interesting things and that in

00:01:25.280 --> 00:01:29.359
fact the first such network or one of

00:01:27.200 --> 00:01:31.040
the first such networks was called BERT

00:01:29.359 --> 00:01:32.319
u and in fact in your homework you've

00:01:31.040 --> 00:01:34.479
been you've been looking at BERT and so

00:01:32.319 --> 00:01:35.758
on and so forth right that's masking now

00:01:34.478 --> 00:01:37.118
we're going to switch gears and talk

00:01:35.759 --> 00:01:38.640
about a different kind of self-s

00:01:37.118 --> 00:01:41.840
supervised learning which is different

00:01:38.640 --> 00:01:45.280
from masking which turns out to be

00:01:41.840 --> 00:01:47.200
weirdly more interesting and powerful

00:01:45.280 --> 00:01:49.040
okay so we are going to look at another

00:01:47.200 --> 00:01:52.560
technique and this technique is called

00:01:49.040 --> 00:01:54.079
next word prediction so now it is

00:01:52.560 --> 00:01:55.680
actually in some some sense a special

00:01:54.078 --> 00:01:57.519
case of masking where you're basically

00:01:55.680 --> 00:01:59.759
saying take a sentence and instead of

00:01:57.519 --> 00:02:01.679
randomly picking a word and and making

00:01:59.759 --> 00:02:03.040
it a blank. You're saying, "I'm just

00:02:01.680 --> 00:02:06.000
going to take the last word and make it

00:02:03.040 --> 00:02:08.560
a blank." Okay? And then you send the

00:02:06.000 --> 00:02:10.080
sentence in and then you have the the

00:02:08.560 --> 00:02:12.239
machine just fill in the blank on the

00:02:10.080 --> 00:02:13.920
last word. Predict the next word. Okay?

00:02:12.239 --> 00:02:15.920
And you don't have to use full sentences

00:02:13.919 --> 00:02:17.598
for it. You can use parts of sentences

00:02:15.919 --> 00:02:20.000
for it. Sentence fragments as well. So

00:02:17.598 --> 00:02:21.919
if you take the same sentences before

00:02:20.000 --> 00:02:23.598
the mission of the MI loan school, you

00:02:21.919 --> 00:02:25.439
can literally divide it into well you

00:02:23.598 --> 00:02:27.598
can give the and ask it to predict

00:02:25.439 --> 00:02:29.598
mission. If you can give it the mission

00:02:27.598 --> 00:02:31.598
and ask it to predict off. You give it

00:02:29.598 --> 00:02:33.919
the mission of ask to predict the you

00:02:31.598 --> 00:02:35.518
get the idea. So every sentence fragment

00:02:33.919 --> 00:02:37.119
you can take and literally just give it

00:02:35.519 --> 00:02:38.959
the first few and then predict the next

00:02:37.120 --> 00:02:41.360
one. First few next one first few next

00:02:38.959 --> 00:02:44.640
one. Okay. So this is next word

00:02:41.360 --> 00:02:46.239
prediction. And

00:02:44.639 --> 00:02:47.839
so the let's what we're going to do now

00:02:46.239 --> 00:02:50.480
is we're going to actually take the

00:02:47.840 --> 00:02:52.800
transformer encoder architecture that we

00:02:50.479 --> 00:02:54.479
used to build bird in the last class and

00:02:52.800 --> 00:02:56.239
we're going to try to use it to solve

00:02:54.479 --> 00:02:58.560
next word prediction to build a model

00:02:56.239 --> 00:03:01.039
that can do next word prediction. Okay.

00:02:58.560 --> 00:03:03.519
So this is what [clears throat] we have.

00:03:01.039 --> 00:03:08.120
So what we're going to do is uh if you

00:03:03.519 --> 00:03:08.120
take the phrase the cat sat on the mat.

00:03:09.199 --> 00:03:15.199
So the phrase was let's say the cat

00:03:13.598 --> 00:03:16.719
sat

00:03:15.199 --> 00:03:18.639
on

00:03:16.719 --> 00:03:20.400
the mat.

00:03:18.639 --> 00:03:24.199
So what you might want to do is to say

00:03:20.400 --> 00:03:24.200
okay this is the input

00:03:25.519 --> 00:03:30.000
output

00:03:27.680 --> 00:03:33.120
the cat.

00:03:30.000 --> 00:03:36.400
Then maybe you have the cat

00:03:33.120 --> 00:03:39.840
then the output is sat.

00:03:36.400 --> 00:03:42.239
The cat sat on and so on. Right, you get

00:03:39.840 --> 00:03:45.200
the idea. And then finally, we have the

00:03:42.239 --> 00:03:48.480
cat sat

00:03:45.199 --> 00:03:50.158
the mat. Right, this is basically what

00:03:48.479 --> 00:03:51.679
we have all these inputs and outputs.

00:03:50.158 --> 00:03:54.000
But we're going to very compactly

00:03:51.680 --> 00:03:56.480
express it as if it's just coming in

00:03:54.000 --> 00:03:58.639
through as as one sort of data point in

00:03:56.479 --> 00:04:00.079
one batch. And that's what we're doing

00:03:58.639 --> 00:04:02.158
here. So what we're going to do is we're

00:04:00.080 --> 00:04:04.879
going to stack it up like this where we

00:04:02.158 --> 00:04:07.120
have the cat sat on the on the left

00:04:04.878 --> 00:04:08.560
meaning everything but the last word and

00:04:07.120 --> 00:04:10.158
then we're going to take that same

00:04:08.560 --> 00:04:13.199
sentence and just shift it to the left

00:04:10.158 --> 00:04:15.438
one right so the cat sat on the mat we

00:04:13.199 --> 00:04:17.599
cut off the mat right and that becomes

00:04:15.438 --> 00:04:19.918
the input then we cut off the first word

00:04:17.600 --> 00:04:22.160
and that becomes the output so when you

00:04:19.918 --> 00:04:25.599
look at it that way you can see here

00:04:22.160 --> 00:04:29.040
right the you will want the to be used

00:04:25.600 --> 00:04:31.120
to predict cat you will want the to be

00:04:29.040 --> 00:04:32.800
used to predict SAT and so on and so

00:04:31.120 --> 00:04:35.759
forth.

00:04:32.800 --> 00:04:37.918
Okay, so this is just a little sort of

00:04:35.759 --> 00:04:40.400
manipulation so that we don't have to

00:04:37.918 --> 00:04:42.319
have you know like dozens of sentences

00:04:40.399 --> 00:04:44.799
or sentence examples just for one

00:04:42.319 --> 00:04:46.879
starting sentence.

00:04:44.800 --> 00:04:49.040
So if you have something like this, what

00:04:46.879 --> 00:04:50.639
you can do is you can run it through

00:04:49.040 --> 00:04:53.280
positional input embeddings like we have

00:04:50.639 --> 00:04:54.560
done before with BERT. Uh then we can

00:04:53.279 --> 00:04:56.879
run it through a whole bunch of

00:04:54.560 --> 00:04:59.360
transformers, right? It's like a

00:04:56.879 --> 00:05:01.680
transformer stack. Then we get these

00:04:59.360 --> 00:05:03.680
contextual embeddings. Then we run them

00:05:01.680 --> 00:05:05.439
through maybe one or more ReLUs if you

00:05:03.680 --> 00:05:08.079
want because it's always a good idea to

00:05:05.439 --> 00:05:11.680
stick some ReLUS at the very end. U and

00:05:08.079 --> 00:05:13.038
then we basically attach a softmax to

00:05:11.680 --> 00:05:17.038
every one of the things that are coming

00:05:13.038 --> 00:05:20.159
out. Okay. And then that soft max is

00:05:17.038 --> 00:05:23.759
actually going to be a soft max whose

00:05:20.160 --> 00:05:25.840
range is the entire vocabulary.

00:05:23.759 --> 00:05:27.199
Okay. For now, let's assume that the

00:05:25.839 --> 00:05:29.198
vocabulary is just a vocabulary of

00:05:27.199 --> 00:05:30.800
words, not tokens. We'll get into tokens

00:05:29.199 --> 00:05:32.639
a bit later on in the class. For now,

00:05:30.800 --> 00:05:33.919
just assume it's words. And roughly

00:05:32.639 --> 00:05:36.160
speaking, let's say there are 50,000

00:05:33.918 --> 00:05:38.079
words in our vocabulary. So each of

00:05:36.160 --> 00:05:39.759
these soft maxes, and this is exactly

00:05:38.079 --> 00:05:42.000
what we did for BERT, by the way. Each

00:05:39.759 --> 00:05:43.919
of these soft maxes is like a 50,000 way

00:05:42.000 --> 00:05:47.839
soft max.

00:05:43.918 --> 00:05:50.319
Okay. But what we're going to do is here

00:05:47.839 --> 00:05:52.079
when we look at it this way

00:05:50.319 --> 00:05:54.159
since we are fundamentally bothered

00:05:52.079 --> 00:05:55.519
about next word prediction as you will

00:05:54.160 --> 00:05:57.360
see later on we are actually going to

00:05:55.519 --> 00:05:59.519
ignore all these predictions because who

00:05:57.360 --> 00:06:02.800
cares? We are only going to look at the

00:05:59.519 --> 00:06:04.560
last one to figure out okay what is the

00:06:02.800 --> 00:06:06.960
last prediction? What is it? Because the

00:06:04.560 --> 00:06:09.759
last prediction is going to be based on

00:06:06.959 --> 00:06:11.279
everything that came before it here. So

00:06:09.759 --> 00:06:13.120
this is really the next word that's

00:06:11.279 --> 00:06:16.318
actually being predicted. All the things

00:06:13.120 --> 00:06:17.840
before we don't care so much.

00:06:16.319 --> 00:06:18.879
Okay. And all this will become slightly

00:06:17.839 --> 00:06:20.879
clearer because you're going to make a

00:06:18.879 --> 00:06:24.079
couple of passes through it. Yeah.

00:06:20.879 --> 00:06:27.519
>> How do we

00:06:24.079 --> 00:06:29.038
>> uh so um the notion of a sentence has

00:06:27.519 --> 00:06:30.879
disappeared at this point. What we're

00:06:29.038 --> 00:06:33.680
going to do is when we look at how we

00:06:30.879 --> 00:06:35.038
tokenize the input for these kinds of

00:06:33.680 --> 00:06:36.879
models, we're actually going to take

00:06:35.038 --> 00:06:37.918
punctuation into account. So we're going

00:06:36.879 --> 00:06:39.680
to take periods into account,

00:06:37.918 --> 00:06:41.279
exclamation marks into account and so on

00:06:39.680 --> 00:06:42.478
and so forth. And that that'll answer

00:06:41.279 --> 00:06:47.839
your question and we'll come back to

00:06:42.478 --> 00:06:49.360
that. U okay so this what we have. So um

00:06:47.839 --> 00:06:50.799
all right. So just to be clear the

00:06:49.360 --> 00:06:52.560
embedding that's coming out of the final

00:06:50.800 --> 00:06:54.400
dense layer is passed through its own

00:06:52.560 --> 00:06:58.160
softmax with the number of softmax

00:06:54.399 --> 00:07:01.918
categories equal to the cap size. Okay.

00:06:58.160 --> 00:07:04.080
All right. Um okay. So

00:07:01.918 --> 00:07:05.918
first of all, s let's say we train

00:07:04.079 --> 00:07:08.478
models a model like this with a lots of

00:07:05.918 --> 00:07:10.159
inputs and outputs. Okay, this just

00:07:08.478 --> 00:07:11.598
looks like bird, right? It's not that

00:07:10.160 --> 00:07:13.680
different except that there's no notion

00:07:11.598 --> 00:07:15.279
of a mask.

00:07:13.680 --> 00:07:19.519
Do you notice any problems with the way

00:07:15.279 --> 00:07:21.519
this thing has been set up? Uh

00:07:19.519 --> 00:07:23.758
>> like for some words like the you're

00:07:21.519 --> 00:07:25.680
going to have a lot of potential output

00:07:23.759 --> 00:07:27.598
pairs that come out of that.

00:07:25.680 --> 00:07:29.120
>> True. Which means that if you have a

00:07:27.598 --> 00:07:29.839
word like the the next word

00:07:29.120 --> 00:07:32.319
>> hard to predict.

00:07:29.839 --> 00:07:35.198
>> It's true. So some words may be hard to

00:07:32.319 --> 00:07:36.560
predict depending on the last word of

00:07:35.199 --> 00:07:39.120
the sentence that was the input. Right.

00:07:36.560 --> 00:07:41.199
That's what you're getting at. Yeah. U

00:07:39.120 --> 00:07:43.759
concerns.

00:07:41.199 --> 00:07:46.080
So I want you Yeah. Uh

00:07:43.759 --> 00:07:48.960
>> since you're using contextual

00:07:46.079 --> 00:07:51.198
like the output of the first word is

00:07:48.959 --> 00:07:53.839
going to have access to the second word

00:07:51.199 --> 00:07:55.360
and so it's kind of like cheating.

00:07:53.839 --> 00:07:58.959
>> Bingo.

00:07:55.360 --> 00:08:01.759
So remember for bingo is a technical

00:07:58.959 --> 00:08:05.439
term in deep learning which means great.

00:08:01.759 --> 00:08:08.000
So um so if you go to this right as she

00:08:05.439 --> 00:08:11.279
points out if you look at the self

00:08:08.000 --> 00:08:12.959
attention layer note remember the self

00:08:11.279 --> 00:08:15.279
attention layer is the key building

00:08:12.959 --> 00:08:17.439
block of the transformer block right and

00:08:15.279 --> 00:08:19.839
so in the self attention layer every

00:08:17.439 --> 00:08:23.360
word we calculate its contextual

00:08:19.839 --> 00:08:26.239
embedding by waiting weighted averaging

00:08:23.360 --> 00:08:28.560
of its relationship to all other words

00:08:26.240 --> 00:08:30.079
in the sentence. So the last word can

00:08:28.560 --> 00:08:31.360
see the first word, the first word can

00:08:30.079 --> 00:08:33.199
see the last word and so on and so

00:08:31.360 --> 00:08:34.800
forth, right? But when you're doing next

00:08:33.200 --> 00:08:38.240
word prediction, this feels problematic

00:08:34.799 --> 00:08:40.958
because you're peeking into the future,

00:08:38.240 --> 00:08:42.320
right? So

00:08:40.958 --> 00:08:43.598
so let's say that you want to predict

00:08:42.320 --> 00:08:46.560
the next word. If you look at this

00:08:43.599 --> 00:08:48.560
architecture, what it can simply do, it

00:08:46.559 --> 00:08:50.559
can simply copy it from the input

00:08:48.559 --> 00:08:52.639
because it can see the whole sentence.

00:08:50.559 --> 00:08:55.119
So if I tell you, hey, the cat sat on

00:08:52.639 --> 00:08:56.958
the mat. If I just gave you the cat sat

00:08:55.120 --> 00:08:58.720
on the can you predict the the next word

00:08:56.958 --> 00:09:01.278
for me? You'll be like yeah duh it's cat

00:08:58.720 --> 00:09:02.879
it's Matt.

00:09:01.278 --> 00:09:04.559
The whole thing becomes challenging only

00:09:02.879 --> 00:09:07.039
if I say the cat sat on the dash. Now

00:09:04.559 --> 00:09:09.759
predict the dash.

00:09:07.039 --> 00:09:11.838
So to put it another way let's say that

00:09:09.759 --> 00:09:13.919
you want to predict right you have fed

00:09:11.839 --> 00:09:15.600
in the first two words and you want to

00:09:13.919 --> 00:09:17.838
predict this. This is the right answer

00:09:15.600 --> 00:09:20.639
for the prediction. The network should

00:09:17.839 --> 00:09:23.279
only use the first two.

00:09:20.639 --> 00:09:26.480
However, but because self attention can

00:09:23.278 --> 00:09:28.240
see SAT, it can see this next word,

00:09:26.480 --> 00:09:31.039
it'll trivially learn to predict the

00:09:28.240 --> 00:09:34.480
next word to be SAT,

00:09:31.039 --> 00:09:37.278
right? There is no challenge for it.

00:09:34.480 --> 00:09:38.480
So, this is the key problem, right? This

00:09:37.278 --> 00:09:41.600
is the key problem. We're just using the

00:09:38.480 --> 00:09:43.278
transformer as is.

00:09:41.600 --> 00:09:44.959
>> What's our loss function here?

00:09:43.278 --> 00:09:46.320
>> The loss function in all these things is

00:09:44.958 --> 00:09:48.559
actually the same as before, which is

00:09:46.320 --> 00:09:50.240
that for every output that's coming out.

00:09:48.559 --> 00:09:52.479
So imagine you have just a traditional

00:09:50.240 --> 00:09:54.799
classification problem uh in which you

00:09:52.480 --> 00:09:56.560
have one output uh let's say dividing

00:09:54.799 --> 00:09:57.919
you're classifying things to uh 10

00:09:56.559 --> 00:10:00.239
categories like we did with the fashion

00:09:57.919 --> 00:10:02.000
mnest right 10 digits so you have 10

00:10:00.240 --> 00:10:03.759
outputs right and that goes through a

00:10:02.000 --> 00:10:05.759
softmax and then you have 10

00:10:03.759 --> 00:10:09.679
probabilities and there we use cross

00:10:05.759 --> 00:10:12.559
entropy right so here for every one of

00:10:09.679 --> 00:10:14.000
these things we use cross entropy so we

00:10:12.559 --> 00:10:16.078
take this output and there's a cross

00:10:14.000 --> 00:10:18.000
entropy for just for that plus cross

00:10:16.078 --> 00:10:20.479
entropy for that and so on and so forth

00:10:18.000 --> 00:10:21.759
So we we minimize still cross entropy

00:10:20.480 --> 00:10:22.560
but the sum of all these cross

00:10:21.759 --> 00:10:24.078
entropies.

00:10:22.559 --> 00:10:26.319
>> And does it get complicated at all by

00:10:24.078 --> 00:10:27.679
the fact we have a large vocabulary size

00:10:26.320 --> 00:10:29.040
now?

00:10:27.679 --> 00:10:30.239
>> I mean it it gets complicated just

00:10:29.039 --> 00:10:32.240
because there are more things to worry

00:10:30.240 --> 00:10:33.919
about compute and so on and so forth.

00:10:32.240 --> 00:10:35.360
But conceptually no difference whether

00:10:33.919 --> 00:10:37.759
you have 10 or 50,000 it's the same

00:10:35.360 --> 00:10:39.278
thing. It's just that instead of

00:10:37.759 --> 00:10:41.278
classifying an input into one of 10

00:10:39.278 --> 00:10:42.958
categories you're take the inputs

00:10:41.278 --> 00:10:45.039
themselves are as long as the number of

00:10:42.958 --> 00:10:46.799
words in your sentence. So each word

00:10:45.039 --> 00:10:49.679
that comes into your sentence is being

00:10:46.799 --> 00:10:51.838
classified in one of 50,000 ways, right?

00:10:49.679 --> 00:10:53.278
So essentially you have as many

00:10:51.839 --> 00:10:55.040
classification problems as you have

00:10:53.278 --> 00:10:56.240
number of words in a sentence. But at

00:10:55.039 --> 00:10:58.078
the end of the day, the loss function is

00:10:56.240 --> 00:10:59.440
just a sum of all those things or to be

00:10:58.078 --> 00:11:02.319
more precise, the average of all those

00:10:59.440 --> 00:11:03.600
things.

00:11:02.320 --> 00:11:05.600
Actually, I think I may have a slide

00:11:03.600 --> 00:11:07.440
about this which I may have hidden

00:11:05.600 --> 00:11:13.079
because I wasn't sure if I would have

00:11:07.440 --> 00:11:13.079
time. Uh let's unhide it.

00:11:17.360 --> 00:11:20.560
and B I did not agree ahead of time that

00:11:19.519 --> 00:11:23.120
we're going to set this up like this.

00:11:20.559 --> 00:11:25.599
Okay. So, all right. So, yeah. So, we

00:11:23.120 --> 00:11:27.759
still use the cross cross entropy cross

00:11:25.600 --> 00:11:30.480
cross entropy loss function. So, each

00:11:27.759 --> 00:11:33.120
word that comes in. So, the cross

00:11:30.480 --> 00:11:35.200
entropy is actually minus log

00:11:33.120 --> 00:11:36.480
probability of the right answer. And you

00:11:35.200 --> 00:11:38.399
may recall this from earlier in the

00:11:36.480 --> 00:11:41.039
class. So, we just do the same thing for

00:11:38.399 --> 00:11:43.278
for cat sat on the everything. And then

00:11:41.039 --> 00:11:46.519
we just take the average 1 / 7. Boom.

00:11:43.278 --> 00:11:46.519
That's it.

00:11:47.360 --> 00:11:52.000
So let's so to go back to this problem.

00:11:50.240 --> 00:11:55.680
So this is the issue. The issue is that

00:11:52.000 --> 00:11:57.440
we can't allow words to be predicted

00:11:55.679 --> 00:12:00.159
knowing the future. They should only

00:11:57.440 --> 00:12:02.240
know about the past words. Okay. So what

00:12:00.159 --> 00:12:03.919
do we do? Right? We have to make a

00:12:02.240 --> 00:12:06.320
change to the transformer to make it

00:12:03.919 --> 00:12:07.838
work for next word prediction. So what

00:12:06.320 --> 00:12:09.278
we're going to do is when we are

00:12:07.839 --> 00:12:11.200
calculating the contextual embedding for

00:12:09.278 --> 00:12:13.039
a word, remember the contextual

00:12:11.200 --> 00:12:14.560
embedding for a word is going to be a

00:12:13.039 --> 00:12:17.199
weighted average of all the other words

00:12:14.559 --> 00:12:20.000
embeddings. We will simply give zero

00:12:17.200 --> 00:12:22.000
weight to future words.

00:12:20.000 --> 00:12:26.240
If you give zero weight to future words,

00:12:22.000 --> 00:12:27.519
it's almost as if they don't exist.

00:12:26.240 --> 00:12:31.600
Okay? And this will become clear in a

00:12:27.519 --> 00:12:32.959
second. So imagine that this is the the

00:12:31.600 --> 00:12:34.879
thing we are going to calculate. These

00:12:32.958 --> 00:12:38.239
are all for every word in the sentence

00:12:34.879 --> 00:12:41.439
we are calculating the uh the pair-wise

00:12:38.240 --> 00:12:43.039
attention weight and you will remember I

00:12:41.440 --> 00:12:45.440
went through this you know with like an

00:12:43.039 --> 00:12:48.240
iPad thing last week we calculate all

00:12:45.440 --> 00:12:51.279
the weights. So for example to find the

00:12:48.240 --> 00:12:54.320
um so all these weights in every row

00:12:51.278 --> 00:12:56.799
will add up to one and so you take the

00:12:54.320 --> 00:12:58.560
contextual embeddings of the cat sat on

00:12:56.799 --> 00:12:59.838
the multiply them by the respective

00:12:58.559 --> 00:13:01.439
weights that add up to one which is the

00:12:59.839 --> 00:13:02.639
first row of this table and that gives

00:13:01.440 --> 00:13:05.120
you the contextual embedding for the

00:13:02.639 --> 00:13:07.440
word the and so on and so forth. And

00:13:05.120 --> 00:13:10.159
since we can't look at the future words

00:13:07.440 --> 00:13:14.600
all we do is we go take this table and

00:13:10.159 --> 00:13:14.600
we just zero everything out in red.

00:13:14.720 --> 00:13:19.519
Okay, we just zero everything here out

00:13:17.278 --> 00:13:22.240
and then we renormalize so that the

00:13:19.519 --> 00:13:25.278
remaining cells the nonzero dot cells

00:13:22.240 --> 00:13:27.519
will still add up to one in each row. So

00:13:25.278 --> 00:13:29.519
what that means is that if you're

00:13:27.519 --> 00:13:31.759
actually only looking at the only this

00:13:29.519 --> 00:13:32.959
thing is going to play a role for cat

00:13:31.759 --> 00:13:36.399
only this thing is going to play a role.

00:13:32.958 --> 00:13:39.439
So let's let's let's give an example. So

00:13:36.399 --> 00:13:43.839
um to calculate

00:13:39.440 --> 00:13:46.959
to predict uh on you'll only look at the

00:13:43.839 --> 00:13:48.639
words for the cat sat.

00:13:46.958 --> 00:13:51.359
Okay. The rest of it will not be

00:13:48.639 --> 00:13:54.000
considered at all. Now the effect of

00:13:51.360 --> 00:13:56.240
doing all this is that by the way this

00:13:54.000 --> 00:13:58.559
is called causal self attention. This

00:13:56.240 --> 00:14:01.198
tweak is called causal self attention.

00:13:58.559 --> 00:14:02.799
Uh is also called masked self attention.

00:14:01.198 --> 00:14:05.198
Right? Just different labels for the

00:14:02.799 --> 00:14:07.439
same thing. And so what that means is

00:14:05.198 --> 00:14:10.159
that when you're looking at the input

00:14:07.440 --> 00:14:12.720
for the only the is going to be used to

00:14:10.159 --> 00:14:15.600
predict cat.

00:14:12.720 --> 00:14:18.240
When you look the cat only these two are

00:14:15.600 --> 00:14:22.759
going to be used to predict sat and so

00:14:18.240 --> 00:14:22.759
on and so on and so forth.

00:14:24.159 --> 00:14:30.240
Okay. So this thing here this so all we

00:14:28.159 --> 00:14:32.559
do is we go into a transformer and we

00:14:30.240 --> 00:14:36.360
just change each attention head to be a

00:14:32.559 --> 00:14:36.359
causal attention head

00:14:38.559 --> 00:14:42.399
and the way it's actually done under the

00:14:40.078 --> 00:14:44.399
hood is actually very elegant for

00:14:42.399 --> 00:14:46.399
computational efficiency purposes but I

00:14:44.399 --> 00:14:49.600
won't get into it because it gets a bit

00:14:46.399 --> 00:14:52.559
you know involved but the key idea is

00:14:49.600 --> 00:14:54.959
replace basic plain vanilla attention

00:14:52.559 --> 00:14:57.119
with causal attention aka pay mass

00:14:54.958 --> 00:14:59.359
attention

00:14:57.120 --> 00:15:01.120
and you do that boom suddenly it it

00:14:59.360 --> 00:15:04.079
starts you know working for an expert

00:15:01.120 --> 00:15:06.000
prediction it can't cheat anymore

00:15:04.078 --> 00:15:10.198
and when we do that we get the

00:15:06.000 --> 00:15:10.198
transformer causal encoder

00:15:11.440 --> 00:15:15.360
and by the way the word causal here

00:15:13.519 --> 00:15:19.440
there's no connection to causality so

00:15:15.360 --> 00:15:20.800
it's just a it's just a term

00:15:19.440 --> 00:15:24.240
so if you look at the original

00:15:20.799 --> 00:15:26.319
transformer paper um

00:15:24.240 --> 00:15:28.000
it was created for translation for

00:15:26.320 --> 00:15:30.560
machine translation you know English to

00:15:28.000 --> 00:15:32.480
German right those kinds of use cases so

00:15:30.559 --> 00:15:34.399
it had something called an encoder which

00:15:32.480 --> 00:15:35.839
we are very familiar with from last week

00:15:34.399 --> 00:15:38.000
and then it had something called a

00:15:35.839 --> 00:15:40.480
decoder right and it is called the

00:15:38.000 --> 00:15:42.000
encoder decoder architecture and we are

00:15:40.480 --> 00:15:43.278
not going to cover the encoder decoder

00:15:42.000 --> 00:15:45.679
architecture because we are not covering

00:15:43.278 --> 00:15:48.958
machine translation in this class but

00:15:45.679 --> 00:15:51.439
I'm mentioning this because the this

00:15:48.958 --> 00:15:52.559
part of the the architecture is called a

00:15:51.440 --> 00:15:55.360
decoder

00:15:52.559 --> 00:15:57.758
because it uses see here there is a

00:15:55.360 --> 00:15:59.199
masked attention business going on here

00:15:57.759 --> 00:16:02.959
because it is using this masked

00:15:59.198 --> 00:16:05.278
attention it's called a decoder so

00:16:02.958 --> 00:16:06.799
the transformer causal encoder is also

00:16:05.278 --> 00:16:09.360
referred to sometimes as a transformer

00:16:06.799 --> 00:16:11.039
decoder but the word decoder has two

00:16:09.360 --> 00:16:12.560
meanings

00:16:11.039 --> 00:16:14.319
right it's a synonym for the causal

00:16:12.559 --> 00:16:17.359
encoder like we have seen today it's

00:16:14.320 --> 00:16:19.040
also used to refer to sequencetosequence

00:16:17.360 --> 00:16:21.519
translation problems for the second part

00:16:19.039 --> 00:16:23.198
of its architecture so you just have

00:16:21.519 --> 00:16:25.120
keep it it'll become clear from context

00:16:23.198 --> 00:16:26.399
what we're talking about in this course

00:16:25.120 --> 00:16:27.278
of course there is no confusion because

00:16:26.399 --> 00:16:29.679
we're not going to be looking at

00:16:27.278 --> 00:16:32.958
translation right we may say decoder

00:16:29.679 --> 00:16:34.479
causal encoder it's the same thing so I

00:16:32.958 --> 00:16:36.638
thought there were some transformers

00:16:34.480 --> 00:16:39.600
that use birectional

00:16:36.639 --> 00:16:42.399
package like is it different from

00:16:39.600 --> 00:16:44.480
>> no the um the birectional all all

00:16:42.399 --> 00:16:47.839
birectional means is that I can see

00:16:44.480 --> 00:16:49.920
everything so the encoder we looked at

00:16:47.839 --> 00:16:52.880
last week the the basic self attention

00:16:49.919 --> 00:16:52.879
thing is birectional

00:16:54.480 --> 00:16:57.199
Basically all it means is I can look at

00:16:55.839 --> 00:16:58.800
both in both directions to see what

00:16:57.198 --> 00:16:59.838
other words are there in causal. You're

00:16:58.799 --> 00:17:02.559
not using the one in the future.

00:16:59.839 --> 00:17:04.959
Correct.

00:17:02.559 --> 00:17:07.918
All right. So,

00:17:04.959 --> 00:17:09.519
so in to summarize where we are. This is

00:17:07.919 --> 00:17:11.600
what we looked at last week for BERT and

00:17:09.519 --> 00:17:14.078
this is a transformer encoder and we

00:17:11.599 --> 00:17:15.599
take the same thing and instead of

00:17:14.078 --> 00:17:18.639
multi-head retention we would do causal

00:17:15.599 --> 00:17:21.279
multi retention. We get the decoder aka

00:17:18.640 --> 00:17:25.360
causal encoder.

00:17:21.279 --> 00:17:27.038
Okay. And we use the left for masked

00:17:25.359 --> 00:17:29.599
prediction. We use the right for next

00:17:27.038 --> 00:17:32.319
word prediction.

00:17:29.599 --> 00:17:34.079
All right. So now if you have instead of

00:17:32.319 --> 00:17:37.119
having an encoder, if you have a causal

00:17:34.079 --> 00:17:38.879
encoder, a TCE here, now we can train

00:17:37.119 --> 00:17:42.159
models for expert prediction using the

00:17:38.880 --> 00:17:43.600
same exact approach as before,

00:17:42.160 --> 00:17:45.200
right? We set up the inputs and the

00:17:43.599 --> 00:17:47.359
outputs like I described earlier. We run

00:17:45.200 --> 00:17:50.000
it through a bunch of stacks, a stack of

00:17:47.359 --> 00:17:52.159
causal encoders, dens, relu, softmax and

00:17:50.000 --> 00:17:54.720
so on and so forth, right? Otherwise the

00:17:52.160 --> 00:17:56.558
details don't change but the all

00:17:54.720 --> 00:18:00.759
important changes go into the attention

00:17:56.558 --> 00:18:00.759
layer and make it masked or causal.

00:18:02.480 --> 00:18:08.240
Any questions so far?

00:18:06.240 --> 00:18:09.679
>> Uh yeah,

00:18:08.240 --> 00:18:11.120
this would only apply when we're

00:18:09.679 --> 00:18:13.679
training the model, not when we're

00:18:11.119 --> 00:18:15.918
validating and testing, right?

00:18:13.679 --> 00:18:18.559
Uh so if I if you give me a sentence

00:18:15.919 --> 00:18:20.880
after training right the final

00:18:18.558 --> 00:18:22.960
prediction is only is the only thing you

00:18:20.880 --> 00:18:24.240
care about and by definition the final

00:18:22.960 --> 00:18:27.440
prediction will use everything that came

00:18:24.240 --> 00:18:30.240
before it. So we are okay.

00:18:27.440 --> 00:18:33.038
Was that your question? No, I think the

00:18:30.240 --> 00:18:35.038
fact that we're

00:18:33.038 --> 00:18:36.720
uh we're zeroing out the weights in the

00:18:35.038 --> 00:18:38.240
future words I thought would apply more

00:18:36.720 --> 00:18:40.400
when we're training the model and we're

00:18:38.240 --> 00:18:44.558
trying to minimize the loss as opposed

00:18:40.400 --> 00:18:45.600
to when we're as a chance to the next

00:18:44.558 --> 00:18:47.440
set

00:18:45.599 --> 00:18:49.199
>> right but the point is when we actually

00:18:47.440 --> 00:18:50.480
use them what is the objective like what

00:18:49.200 --> 00:18:51.840
do we want to do when we actually use

00:18:50.480 --> 00:18:54.079
them for inference once we finish

00:18:51.839 --> 00:18:56.639
training our objective is given a

00:18:54.079 --> 00:18:59.038
particular string get me the next word

00:18:56.640 --> 00:19:00.320
right and to find the next word you can

00:18:59.038 --> 00:19:01.119
in fact use everything that came before

00:19:00.319 --> 00:19:03.119
it

00:19:01.119 --> 00:19:04.798
>> and therefore without any change to this

00:19:03.119 --> 00:19:06.639
model it'll just work for your intended

00:19:04.798 --> 00:19:08.160
purpose you don't have to go in there

00:19:06.640 --> 00:19:10.400
and change it to you don't have to

00:19:08.160 --> 00:19:13.600
unmask it for inference because you

00:19:10.400 --> 00:19:14.960
don't need to

00:19:13.599 --> 00:19:17.599
>> yes

00:19:14.960 --> 00:19:20.480
>> uh I have one question is regarding like

00:19:17.599 --> 00:19:22.480
when we do the puzzle transformers we

00:19:20.480 --> 00:19:24.160
are putting certain weights to zero for

00:19:22.480 --> 00:19:24.798
the words which are to be predicted and

00:19:24.160 --> 00:19:26.720
then we

00:19:24.798 --> 00:19:27.200
>> no word the the words that are in the

00:19:26.720 --> 00:19:28.000
future

00:19:27.200 --> 00:19:29.279
>> future Yeah.

00:19:28.000 --> 00:19:29.679
>> And then we normalize it.

00:19:29.279 --> 00:19:31.200
>> Correct.

00:19:29.679 --> 00:19:33.200
>> And we have trained a transformer

00:19:31.200 --> 00:19:35.759
earlier on the all the words packed all

00:19:33.200 --> 00:19:37.279
the words together. So won't there be

00:19:35.759 --> 00:19:37.839
difference in weights between both the

00:19:37.279 --> 00:19:39.279
things

00:19:37.839 --> 00:19:40.879
>> between the two ways of training? The

00:19:39.279 --> 00:19:43.599
weights are going to be very different

00:19:40.880 --> 00:19:45.600
and they are two different models. Bert

00:19:43.599 --> 00:19:47.119
is used for certain things and this kind

00:19:45.599 --> 00:19:47.918
of model which is the basis of GPT is

00:19:47.119 --> 00:19:49.599
going to be used for other things.

00:19:47.919 --> 00:19:52.240
>> We are training it as well like that. I

00:19:49.599 --> 00:19:53.839
mean with while putting the by moving

00:19:52.240 --> 00:19:56.160
some of the rates to

00:19:53.839 --> 00:19:59.199
>> correct correct. So what I'm talking

00:19:56.160 --> 00:20:01.279
about here is the what we're trying to

00:19:59.200 --> 00:20:03.919
do here is to say let's say that we want

00:20:01.279 --> 00:20:06.160
to do next word prediction as the as the

00:20:03.919 --> 00:20:08.160
task as a self-supervised learning task

00:20:06.160 --> 00:20:10.798
and and we want to train such a model on

00:20:08.160 --> 00:20:12.080
a vast amount of text data right well we

00:20:10.798 --> 00:20:13.279
can't just use what we did last week

00:20:12.079 --> 00:20:14.720
because it's not going to work because

00:20:13.279 --> 00:20:16.480
of the fact it can see the future

00:20:14.720 --> 00:20:17.839
therefore we make a tweak and then we

00:20:16.480 --> 00:20:18.960
build this model now the question

00:20:17.839 --> 00:20:20.319
becomes okay what can you do with this

00:20:18.960 --> 00:20:21.840
such a model right we have basically

00:20:20.319 --> 00:20:23.200
trained two different kinds of models

00:20:21.839 --> 00:20:25.599
that the one that can see everything

00:20:23.200 --> 00:20:27.840
Bert and that one that can't see the

00:20:25.599 --> 00:20:28.959
future which is actually GPT. So what

00:20:27.839 --> 00:20:32.199
can you do with it? And we're going to

00:20:28.960 --> 00:20:32.200
come to that.

00:20:32.240 --> 00:20:38.558
Okay. U all right. So now once you train

00:20:35.679 --> 00:20:41.519
such a model u right given any input

00:20:38.558 --> 00:20:45.519
sentence um let's say that the sentence

00:20:41.519 --> 00:20:47.599
is it was a dark and it was a dark and

00:20:45.519 --> 00:20:49.200
right it goes through all these things.

00:20:47.599 --> 00:20:50.879
And remember what I said earlier the

00:20:49.200 --> 00:20:53.440
fact that it's predicting something

00:20:50.880 --> 00:20:55.600
after just seeing it. We don't really

00:20:53.440 --> 00:20:57.840
care.

00:20:55.599 --> 00:20:59.199
All what we're really curious about is

00:20:57.839 --> 00:21:01.119
what is the next thing it's going to

00:20:59.200 --> 00:21:02.400
say? And the next thing it's going to

00:21:01.119 --> 00:21:06.359
say is going to be is basically going to

00:21:02.400 --> 00:21:06.360
be what's coming out of this softmax.

00:21:06.798 --> 00:21:11.839
Does it make sense? We don't care about

00:21:08.720 --> 00:21:14.159
anything that went before it

00:21:11.839 --> 00:21:15.918
because we already have like a half form

00:21:14.159 --> 00:21:17.840
sentence and we want to just find the

00:21:15.919 --> 00:21:19.520
next thing here. So we only care about

00:21:17.839 --> 00:21:21.759
this. We I mean these things will come

00:21:19.519 --> 00:21:22.960
out of the of the architecture of the

00:21:21.759 --> 00:21:24.640
model, but we don't we throw them out.

00:21:22.960 --> 00:21:26.240
We don't even pay any attention to them.

00:21:24.640 --> 00:21:30.159
Okay, we only look at what's coming out

00:21:26.240 --> 00:21:32.720
in this one here. And what comes out of

00:21:30.159 --> 00:21:35.120
the soft max, remember, is a 50,000 way

00:21:32.720 --> 00:21:37.440
table of probabilities. That's what a

00:21:35.119 --> 00:21:39.279
soft max is, right? It's a whole bunch

00:21:37.440 --> 00:21:40.640
of probabilities that add up to one. And

00:21:39.279 --> 00:21:42.960
so it's going to and let's say, for

00:21:40.640 --> 00:21:45.360
example, that you know you have starting

00:21:42.960 --> 00:21:46.159
with oddwark all the way to zebra,

00:21:45.359 --> 00:21:48.000
right? Right? And these are the

00:21:46.159 --> 00:21:52.640
probabilities.

00:21:48.000 --> 00:21:55.519
So it was a dark and you know just for

00:21:52.640 --> 00:21:56.880
kicks I put star me as the most highest

00:21:55.519 --> 00:21:59.599
probability number but these numbers

00:21:56.880 --> 00:22:02.400
will add up to one. We have this table.

00:21:59.599 --> 00:22:04.959
Okay. And then what we do is we choose a

00:22:02.400 --> 00:22:06.960
token from this table. We get we get to

00:22:04.960 --> 00:22:08.880
choose right. There's a whole bunch of

00:22:06.960 --> 00:22:11.120
numbers in this table that we we get to

00:22:08.880 --> 00:22:12.880
choose a token. the the simplest thing

00:22:11.119 --> 00:22:14.959
one can think of is just choose the the

00:22:12.880 --> 00:22:16.320
word that is the most likely, right? And

00:22:14.960 --> 00:22:18.400
we choose the word that's most likely

00:22:16.319 --> 00:22:20.319
here. And we we're going to have a whole

00:22:18.400 --> 00:22:22.880
section on how to choose these things

00:22:20.319 --> 00:22:23.918
coming up. Okay, for now let's go with

00:22:22.880 --> 00:22:26.960
the simple option. We're going to just

00:22:23.919 --> 00:22:30.880
choose the one that's most likely 6. And

00:22:26.960 --> 00:22:32.319
then we we attach it to the input. So

00:22:30.880 --> 00:22:34.480
now the input has become it was a dark

00:22:32.319 --> 00:22:36.319
and stormy. We run it through and we

00:22:34.480 --> 00:22:37.919
again we only care about the last one

00:22:36.319 --> 00:22:40.399
softmax.

00:22:37.919 --> 00:22:42.720
Okay,

00:22:40.400 --> 00:22:44.480
we do that. We get another table and

00:22:42.720 --> 00:22:45.839
this table turns out the table keeps

00:22:44.480 --> 00:22:46.880
changing because the softmax is

00:22:45.839 --> 00:22:49.119
different for each time you run it

00:22:46.880 --> 00:22:50.880
through because the input has changed.

00:22:49.119 --> 00:22:53.918
So you get a new table and it turns out

00:22:50.880 --> 00:22:56.480
the most likely one is knight. Okay. And

00:22:53.919 --> 00:22:59.840
then we attach so night comes out the

00:22:56.480 --> 00:23:03.279
other end. We and we attach knight here

00:22:59.839 --> 00:23:05.759
and we keep on going right. We can keep

00:23:03.279 --> 00:23:08.240
on going maybe till we basically we tell

00:23:05.759 --> 00:23:11.200
the model okay generate up to 100 tokens

00:23:08.240 --> 00:23:12.720
and stop. It might stop after 100 or you

00:23:11.200 --> 00:23:15.279
or it might decide the model may decide

00:23:12.720 --> 00:23:17.120
in fact that when it sees a punctuation

00:23:15.279 --> 00:23:19.678
like a period or exclamation mark or

00:23:17.119 --> 00:23:21.199
something it's going to stop. Okay. And

00:23:19.679 --> 00:23:23.600
we have control over this when it stops

00:23:21.200 --> 00:23:26.080
and how it stops. But this is this is

00:23:23.599 --> 00:23:27.199
sort of the the basic process and you

00:23:26.079 --> 00:23:28.558
folks are all very used to it because

00:23:27.200 --> 00:23:30.960
you've all been playing with chat GPT

00:23:28.558 --> 00:23:33.279
and the like right? So the but the basic

00:23:30.960 --> 00:23:34.720
building block is next word prediction

00:23:33.279 --> 00:23:36.960
feed it back to the input next word

00:23:34.720 --> 00:23:38.640
prediction keep on doing it right you

00:23:36.960 --> 00:23:41.519
keep on doing it and suddenly you know

00:23:38.640 --> 00:23:42.799
it's writing entire novels for you

00:23:41.519 --> 00:23:44.960
um yeah

00:23:42.798 --> 00:23:47.519
>> that mean that the longer the initial

00:23:44.960 --> 00:23:48.880
input is better you get a better

00:23:47.519 --> 00:23:52.639
prediction

00:23:48.880 --> 00:23:54.400
>> um it depends on your objective so

00:23:52.640 --> 00:23:56.400
fundamentally you have some task you

00:23:54.400 --> 00:23:58.320
want the thing to do for you right and

00:23:56.400 --> 00:24:00.320
that task may and you need to give it

00:23:58.319 --> 00:24:02.639
all the information it can puzzle we

00:24:00.319 --> 00:24:04.240
find useful. Yeah. So the long the the

00:24:02.640 --> 00:24:07.120
more helpful the input the better. Maybe

00:24:04.240 --> 00:24:09.200
that's how I would say it.

00:24:07.119 --> 00:24:11.199
Uh yeah.

00:24:09.200 --> 00:24:14.960
>> Would this also apply to something like

00:24:11.200 --> 00:24:17.038
Google search? Uh or does they also do

00:24:14.960 --> 00:24:18.079
next letter prediction too? But would

00:24:17.038 --> 00:24:20.000
this just be a deeper

00:24:18.079 --> 00:24:22.000
>> Yeah. So the Google autocomplete for

00:24:20.000 --> 00:24:24.240
example, I don't know if they actually

00:24:22.000 --> 00:24:26.319
use uh this kind of model under the hood

00:24:24.240 --> 00:24:27.679
or not. I just don't know. Um these

00:24:26.319 --> 00:24:29.918
things tend to be kept tightly under

00:24:27.679 --> 00:24:31.840
wraps. uh you know if they were to do if

00:24:29.919 --> 00:24:33.360
they were using it you know my guess is

00:24:31.839 --> 00:24:34.959
that

00:24:33.359 --> 00:24:36.798
they so I don't know if you folks have

00:24:34.960 --> 00:24:38.319
seen recently over the last few months

00:24:36.798 --> 00:24:40.158
they have there is there is a generative

00:24:38.319 --> 00:24:42.639
AI panel that opens up when you do a

00:24:40.159 --> 00:24:45.278
Google search that panel I suspect uses

00:24:42.640 --> 00:24:47.038
this uh but I don't know if the default

00:24:45.278 --> 00:24:49.759
Google autocomplete actually uses it or

00:24:47.038 --> 00:24:52.558
not because it's very compute heavy

00:24:49.759 --> 00:24:55.359
right so I don't know what they do

00:24:52.558 --> 00:25:00.000
um so yeah this is what you do other

00:24:55.359 --> 00:25:01.599
questions on this on the mechanics of

00:25:00.000 --> 00:25:03.679
Yeah,

00:25:01.599 --> 00:25:05.359
>> for our vocabulary list, I'm assuming

00:25:03.679 --> 00:25:07.200
it's static.

00:25:05.359 --> 00:25:08.959
>> Yeah, correct. Uh, and as you will see

00:25:07.200 --> 00:25:10.880
here, it's not really a word vocabulary.

00:25:08.960 --> 00:25:12.880
It's a token vocabulary, but yes, it is

00:25:10.880 --> 00:25:15.520
static for a given model.

00:25:12.880 --> 00:25:17.440
>> And so for I guess I'm assuming for

00:25:15.519 --> 00:25:19.839
Google or any other sort of like search

00:25:17.440 --> 00:25:23.519
engine that wouldn't necessarily be

00:25:19.839 --> 00:25:26.879
static. And so when it comes to I guess

00:25:23.519 --> 00:25:30.158
I guess I'll leave it like because the

00:25:26.880 --> 00:25:32.159
model would be different

00:25:30.159 --> 00:25:34.240
sort of thinking about uh what happens

00:25:32.159 --> 00:25:35.760
to like new words and things that are

00:25:34.240 --> 00:25:37.519
formed and how does it handle it if the

00:25:35.759 --> 00:25:41.440
vocabulary is static. There's a very

00:25:37.519 --> 00:25:45.119
elegant solution that's coming up.

00:25:41.440 --> 00:25:48.400
Okay. Um

00:25:45.119 --> 00:25:51.439
all right. So now in other words we have

00:25:48.400 --> 00:25:52.960
learned how to do sequence generation.

00:25:51.440 --> 00:25:54.558
We already saw that we can do

00:25:52.960 --> 00:25:56.640
classification with BERT. We can do

00:25:54.558 --> 00:25:59.038
labeling with BERT B like models which

00:25:56.640 --> 00:26:00.720
are trained on mass prediction. And for

00:25:59.038 --> 00:26:02.319
generating sequences now we know how to

00:26:00.720 --> 00:26:05.519
do it. We just need to use a transformer

00:26:02.319 --> 00:26:08.319
cosal encoder.

00:26:05.519 --> 00:26:10.480
Okay.

00:26:08.319 --> 00:26:12.240
Now

00:26:10.480 --> 00:26:13.919
these kind of models, sequence

00:26:12.240 --> 00:26:15.679
generation models trained on text

00:26:13.919 --> 00:26:17.759
sequences using next word prediction are

00:26:15.679 --> 00:26:20.798
called auto reggressive language models

00:26:17.759 --> 00:26:22.558
or causal language models. Okay. And of

00:26:20.798 --> 00:26:25.519
course the GPD family is perhaps the

00:26:22.558 --> 00:26:28.079
most well-known uh example of an auto

00:26:25.519 --> 00:26:30.720
reggressive co language model. auto

00:26:28.079 --> 00:26:32.480
reggressive because people who have done

00:26:30.720 --> 00:26:34.159
econometrics and some regression know

00:26:32.480 --> 00:26:36.159
the notion of auto reggression means

00:26:34.159 --> 00:26:38.320
that you predict something and then you

00:26:36.159 --> 00:26:40.159
you use sort of you know the past

00:26:38.319 --> 00:26:42.639
predictions as inputs into the next time

00:26:40.159 --> 00:26:44.720
you predict right so this is the notion

00:26:42.640 --> 00:26:46.799
of auto reggression you feed you predict

00:26:44.720 --> 00:26:48.079
you feed the prediction back get the

00:26:46.798 --> 00:26:51.798
next prediction and keep on cycling

00:26:48.079 --> 00:26:51.798
through yes

00:26:51.919 --> 00:26:56.320
>> so when you you're kind of putting an

00:26:53.839 --> 00:26:59.038
input into GPT for example and it has

00:26:56.319 --> 00:27:01.519
that um you know it shows you like the

00:26:59.038 --> 00:27:03.519
next words as as it's coming. Is that an

00:27:01.519 --> 00:27:05.759
indication of it doing this

00:27:03.519 --> 00:27:07.679
recalculation that you described here?

00:27:05.759 --> 00:27:09.759
>> Correct. That's exactly what's going on.

00:27:07.679 --> 00:27:12.240
Uh in fact, if you use the API, there is

00:27:09.759 --> 00:27:14.079
the thing called the streaming API where

00:27:12.240 --> 00:27:15.359
it'll actually stream each token that's

00:27:14.079 --> 00:27:17.278
coming out through the through every

00:27:15.359 --> 00:27:19.599
pass and you can actually see everything

00:27:17.278 --> 00:27:22.159
very clearly. But when you actually work

00:27:19.599 --> 00:27:24.079
with the web interface and you see the

00:27:22.159 --> 00:27:25.919
thing almost as if it's typing like a

00:27:24.079 --> 00:27:26.960
human, what I've heard from people, I

00:27:25.919 --> 00:27:28.720
don't know if this is true, what I've

00:27:26.960 --> 00:27:30.960
heard from people is that they can

00:27:28.720 --> 00:27:32.319
actually do it much faster. They slow it

00:27:30.960 --> 00:27:33.600
down intentionally to give you the

00:27:32.319 --> 00:27:36.480
feeling that it's actually coming from a

00:27:33.599 --> 00:27:39.599
human.

00:27:36.480 --> 00:27:41.278
So it's like a UX trick to slow it down

00:27:39.599 --> 00:27:42.480
to make it feel as if someone is

00:27:41.278 --> 00:27:44.240
actually typing something on the other

00:27:42.480 --> 00:27:46.159
end. So when you're interacting with a

00:27:44.240 --> 00:27:48.640
chatbot, for example, sometimes you see

00:27:46.159 --> 00:27:49.600
it actually typing like slowly you can

00:27:48.640 --> 00:27:50.720
see the bubble and you can see the

00:27:49.599 --> 00:27:53.439
typing. It's actually intentionally

00:27:50.720 --> 00:27:55.360
slowed down. Uh because you know it's a

00:27:53.440 --> 00:27:58.159
bot otherwise, right? So there's a

00:27:55.359 --> 00:28:01.119
little bit of UX

00:27:58.159 --> 00:28:03.039
creepiness maybe going on. Uh I don't

00:28:01.119 --> 00:28:05.038
know to what extent this is 100% true

00:28:03.038 --> 00:28:06.798
and how pervasive it is, but folks who

00:28:05.038 --> 00:28:10.558
work in the field have told me that this

00:28:06.798 --> 00:28:12.398
actually is not uncommon. So

00:28:10.558 --> 00:28:14.639
okay, so that's what's going on here.

00:28:12.398 --> 00:28:17.199
These are language models and of course

00:28:14.640 --> 00:28:20.159
GPD3 is an auto reggressive language

00:28:17.200 --> 00:28:22.399
model and the reason why we have an L in

00:28:20.159 --> 00:28:24.080
front of the LM because it was trained

00:28:22.398 --> 00:28:25.678
on lots of data with lots of parameters

00:28:24.079 --> 00:28:26.960
right some someone does this at some

00:28:25.679 --> 00:28:28.480
point it's not a small language model

00:28:26.960 --> 00:28:31.600
anymore it's a large language model so

00:28:28.480 --> 00:28:35.440
yeah so it's LLM nothing more momentous

00:28:31.599 --> 00:28:40.558
than that so so as it turns out uh GPT3

00:28:35.440 --> 00:28:43.038
uses 96 transformer blocks 96 blocks and

00:28:40.558 --> 00:28:44.960
each block has 96 six causal attention

00:28:43.038 --> 00:28:46.480
heads.

00:28:44.960 --> 00:28:48.480
Okay. And you can see you can read the

00:28:46.480 --> 00:28:50.319
GPD3 paper. It gives you all the details

00:28:48.480 --> 00:28:51.839
of the architecture. That is interesting

00:28:50.319 --> 00:28:55.918
because GPD4 they didn't publish the

00:28:51.839 --> 00:28:58.158
architecture from GPD3 after GPD3

00:28:55.919 --> 00:28:59.200
everything became closed. So we actually

00:28:58.159 --> 00:29:00.320
don't know what the architecture is even

00:28:59.200 --> 00:29:03.440
though there's a lot of speculation on

00:29:00.319 --> 00:29:06.240
Twitter. So uh but GP3 we know exactly

00:29:03.440 --> 00:29:09.360
what happened right 96 blocks each has

00:29:06.240 --> 00:29:11.359
96 causal attention heads. Um and then

00:29:09.359 --> 00:29:14.000
the data was actually they scraped 30

00:29:11.359 --> 00:29:16.158
billion sentences um from a whole bunch

00:29:14.000 --> 00:29:19.599
of sources, web text, Wikipedia, a bunch

00:29:16.159 --> 00:29:21.840
of book databases. Um and um and then

00:29:19.599 --> 00:29:23.359
they basically just took those 30

00:29:21.839 --> 00:29:27.038
billion sentences and just trained it

00:29:23.359 --> 00:29:28.798
exactly next word. That's it.

00:29:27.038 --> 00:29:31.759
Now when they trained GBD3, I think it

00:29:28.798 --> 00:29:34.158
cost them a lot of money um because

00:29:31.759 --> 00:29:36.158
things were not as we hadn't figured out

00:29:34.159 --> 00:29:38.240
how to do as efficiently as we know now.

00:29:36.159 --> 00:29:39.679
uh but it was still pretty amazing and

00:29:38.240 --> 00:29:41.200
I'll talk about you know what is so

00:29:39.679 --> 00:29:44.080
special about GBD3 in just a minute or

00:29:41.200 --> 00:29:46.960
two. So, so this is what we have here

00:29:44.079 --> 00:29:49.918
and as you folks have seen the notion of

00:29:46.960 --> 00:29:51.440
generating text right is very powerful

00:29:49.919 --> 00:29:53.278
right uh because we can obviously

00:29:51.440 --> 00:29:55.919
generate text but we can also generate

00:29:53.278 --> 00:29:57.519
code because code is just text uh we can

00:29:55.919 --> 00:29:58.960
generate documentation for code we can

00:29:57.519 --> 00:30:00.798
summarize text we can answer questions

00:29:58.960 --> 00:30:03.200
we can do chat I mean the list goes on

00:30:00.798 --> 00:30:05.278
all the excitement we see around genai

00:30:03.200 --> 00:30:07.600
from the time chat GBD came out is

00:30:05.278 --> 00:30:12.000
precisely because the simple idea of

00:30:07.599 --> 00:30:13.759
text in text out is just so flexible

00:30:12.000 --> 00:30:15.119
It's so versatile. It can handle all

00:30:13.759 --> 00:30:17.038
sorts of use cases. That's why there's

00:30:15.119 --> 00:30:19.759
so much excitement.

00:30:17.038 --> 00:30:21.839
Um, by the way, um, if you're really

00:30:19.759 --> 00:30:24.798
curious, I would actually recommend

00:30:21.839 --> 00:30:28.959
seeing this video where this this guy

00:30:24.798 --> 00:30:31.839
Andre Karpathi builds GPT from scratch.

00:30:28.960 --> 00:30:33.759
Okay, it's a fantastic video. If you if

00:30:31.839 --> 00:30:35.038
you have even like a little bit of

00:30:33.759 --> 00:30:36.240
curiosity about how these things are

00:30:35.038 --> 00:30:38.000
actually built, I would strongly

00:30:36.240 --> 00:30:39.440
recommend checking it out. Um and

00:30:38.000 --> 00:30:41.519
there's also a little blog post where

00:30:39.440 --> 00:30:43.519
this person you know basically if you

00:30:41.519 --> 00:30:46.079
know numpy you can actually create GPD3

00:30:43.519 --> 00:30:50.240
GPD using numpy without any using any

00:30:46.079 --> 00:30:52.319
frameworks and things like that. So um

00:30:50.240 --> 00:30:53.759
I I found it super interesting and

00:30:52.319 --> 00:30:55.439
helpful to understand what exactly is

00:30:53.759 --> 00:30:57.759
going on. So if you would like to do

00:30:55.440 --> 00:31:00.320
this. Okay. So now we're going to talk

00:30:57.759 --> 00:31:03.679
about um decoding sampling strategies

00:31:00.319 --> 00:31:05.278
which is I said that when we produce uh

00:31:03.679 --> 00:31:07.759
when when when we come up with the

00:31:05.278 --> 00:31:10.398
softmax for that last token right we

00:31:07.759 --> 00:31:13.278
have 50,000 choices. What do we pick

00:31:10.398 --> 00:31:15.759
right as it turns out to actually get

00:31:13.278 --> 00:31:17.839
really good performance out of uh genai

00:31:15.759 --> 00:31:19.919
systems like charge you need to be quite

00:31:17.839 --> 00:31:21.678
thoughtful about the how to decode right

00:31:19.919 --> 00:31:25.278
how to actually sample from that table.

00:31:21.679 --> 00:31:27.600
So we'll talk about that for a bit. So,

00:31:25.278 --> 00:31:29.119
so the first of all definition the

00:31:27.599 --> 00:31:30.639
process of choosing a token from the

00:31:29.119 --> 00:31:32.479
probability distribution from the coming

00:31:30.640 --> 00:31:34.399
out of the softmax right I'm sticking

00:31:32.480 --> 00:31:36.640
this table right here this is the

00:31:34.398 --> 00:31:38.798
softmax right this process of choosing

00:31:36.640 --> 00:31:40.720
it is called decoding that's a technical

00:31:38.798 --> 00:31:42.480
term for it right we have to we get this

00:31:40.720 --> 00:31:44.480
table we have to decode meaning we have

00:31:42.480 --> 00:31:48.079
to pick something from this table okay

00:31:44.480 --> 00:31:51.038
that's called decoding now

00:31:48.079 --> 00:31:53.359
there are two sort of extreme cases of

00:31:51.038 --> 00:31:55.038
very highly simple ways to do

00:31:53.359 --> 00:31:56.558
The first thing of course is just pick

00:31:55.038 --> 00:31:58.798
the one just pick the word with the

00:31:56.558 --> 00:32:02.240
highest probability.

00:31:58.798 --> 00:32:03.918
This is called greedy decoding.

00:32:02.240 --> 00:32:06.640
Okay.

00:32:03.919 --> 00:32:08.240
So in this case for example if stommy is

00:32:06.640 --> 00:32:10.880
6 the highest probability in this whole

00:32:08.240 --> 00:32:14.558
table we just pick stommy. Okay. So that

00:32:10.880 --> 00:32:15.760
is the obvious extreme simple case. The

00:32:14.558 --> 00:32:18.240
other thing we can do which is also

00:32:15.759 --> 00:32:20.480
super simple is that because we have a

00:32:18.240 --> 00:32:22.319
probability table here, we can just

00:32:20.480 --> 00:32:24.880
reach into the table and sample a word

00:32:22.319 --> 00:32:27.519
out of it, right? In proportion to its

00:32:24.880 --> 00:32:28.640
probability, which means that if you if

00:32:27.519 --> 00:32:30.960
if you have this table and you're

00:32:28.640 --> 00:32:33.519
sampling from it, if you sample from it

00:32:30.960 --> 00:32:36.480
100 times, 60 times you probably get

00:32:33.519 --> 00:32:38.079
Stormy because the probability is 6. But

00:32:36.480 --> 00:32:39.919
some small fraction of the time you may

00:32:38.079 --> 00:32:42.798
get strange things like oddwark and

00:32:39.919 --> 00:32:44.080
zebra and so on and so forth,

00:32:42.798 --> 00:32:46.558
right? you're just literally doing

00:32:44.079 --> 00:32:48.960
random sampling.

00:32:46.558 --> 00:32:50.558
That's a fine way to do it too, right?

00:32:48.960 --> 00:32:53.200
There's nothing wrong with that. So

00:32:50.558 --> 00:32:56.158
these these are both options. So the key

00:32:53.200 --> 00:32:58.080
thing you need to remember is that the

00:32:56.159 --> 00:32:59.600
which one you pick and there are some

00:32:58.079 --> 00:33:01.519
variations on it which we'll get to in a

00:32:59.599 --> 00:33:03.278
moment. What you pick, which way to

00:33:01.519 --> 00:33:05.519
decode you pick really depends on what

00:33:03.278 --> 00:33:08.558
your task is, what you're trying to use

00:33:05.519 --> 00:33:10.880
the the system for, right? The LLM for.

00:33:08.558 --> 00:33:13.839
So the the the broad thing to remember

00:33:10.880 --> 00:33:16.559
is that if you're working on questions

00:33:13.839 --> 00:33:19.678
for which the factual accuracy of the

00:33:16.558 --> 00:33:22.000
response is really important

00:33:19.679 --> 00:33:24.480
and or you want the output to be

00:33:22.000 --> 00:33:26.159
deterministic meaning every time you ask

00:33:24.480 --> 00:33:28.720
it a particular question you really want

00:33:26.159 --> 00:33:31.120
the same answer back right you can

00:33:28.720 --> 00:33:33.120
imagine a customer call support agent

00:33:31.119 --> 00:33:34.639
where there two different customers ask

00:33:33.119 --> 00:33:37.678
the same question and they get different

00:33:34.640 --> 00:33:40.000
answers right you don't want that so you

00:33:37.679 --> 00:33:41.679
want determinist IC outputs. So in those

00:33:40.000 --> 00:33:43.759
situations, you should use greedy

00:33:41.679 --> 00:33:45.519
decoding is a good starting point

00:33:43.759 --> 00:33:48.879
because you will get you know you won't

00:33:45.519 --> 00:33:51.679
get any random stuff because for any

00:33:48.880 --> 00:33:53.120
given input sentence the softmax that

00:33:51.679 --> 00:33:55.600
comes out of that table is not going to

00:33:53.119 --> 00:33:57.119
change. It's the same table and if

00:33:55.599 --> 00:33:58.398
you're always picking the highest number

00:33:57.119 --> 00:34:03.038
in the table that's not going to change

00:33:58.398 --> 00:34:05.199
either. So guaranteed determinism

00:34:03.038 --> 00:34:07.359
and I found that for reasoning questions

00:34:05.200 --> 00:34:08.960
and things where you know you're asking

00:34:07.359 --> 00:34:10.878
questions, math questions, reasoning

00:34:08.960 --> 00:34:12.878
questions, logic questions, you should

00:34:10.878 --> 00:34:15.598
really sort of keep it as sort of greedy

00:34:12.878 --> 00:34:18.319
as possible in my experience. Okay. Now

00:34:15.599 --> 00:34:20.879
there are other situations where random

00:34:18.320 --> 00:34:22.639
sampling is actually a better option. If

00:34:20.878 --> 00:34:24.159
you're doing creative things, right?

00:34:22.639 --> 00:34:26.000
write a poem, write a highQ, write a

00:34:24.159 --> 00:34:27.760
screenplay, things like that. You do

00:34:26.000 --> 00:34:30.320
want a lot of creativity in which case

00:34:27.760 --> 00:34:31.440
you actually randomness is your friend,

00:34:30.320 --> 00:34:32.960
right? You get a lot of different

00:34:31.440 --> 00:34:35.119
varieties of responses, diversity of

00:34:32.960 --> 00:34:36.878
responses, all that is really good. The

00:34:35.119 --> 00:34:39.119
price you pay for it is that you lose

00:34:36.878 --> 00:34:40.239
determinist determinism. The outputs are

00:34:39.119 --> 00:34:41.440
going to be stoastic. They're going to

00:34:40.239 --> 00:34:42.638
be random. They're going to vary from

00:34:41.440 --> 00:34:44.559
the same question. The answer is going

00:34:42.639 --> 00:34:47.119
to vary again and again. But in many

00:34:44.559 --> 00:34:49.599
cases, maybe it's okay. You don't care.

00:34:47.119 --> 00:34:50.960
Okay, so that's sort of how roughly how

00:34:49.599 --> 00:34:53.200
you think about. Other one I want to say

00:34:50.960 --> 00:34:55.039
is that the diversity of response also

00:34:53.199 --> 00:34:58.239
important because you if you imagine a

00:34:55.039 --> 00:35:00.239
chatbot um if you ask questions if the

00:34:58.239 --> 00:35:03.118
chatbot always responds in the same

00:35:00.239 --> 00:35:05.199
stilted robotic fashion right it kind it

00:35:03.119 --> 00:35:07.519
starts to get annoying you want some

00:35:05.199 --> 00:35:08.879
variation in the output right because a

00:35:07.519 --> 00:35:11.358
human will never give you the same thing

00:35:08.880 --> 00:35:13.119
back though I must say that when I

00:35:11.358 --> 00:35:14.480
interact with call center agents I think

00:35:13.119 --> 00:35:16.320
they're just cutting and pasting from a

00:35:14.480 --> 00:35:18.639
text library so it does look kind of

00:35:16.320 --> 00:35:20.079
robotic u so maybe we are already kind

00:35:18.639 --> 00:35:21.199
of used to this but anyway Okay, so

00:35:20.079 --> 00:35:24.400
those are some of the things to keep in

00:35:21.199 --> 00:35:26.480
mind. Yeah,

00:35:24.400 --> 00:35:28.480
>> if you're using random sampling, do you

00:35:26.480 --> 00:35:33.079
end up with a better estimation of the

00:35:28.480 --> 00:35:33.079
uncertainty and probability are more

00:35:33.119 --> 00:35:36.960
calibrated in the sense that the table

00:35:35.199 --> 00:35:39.759
that you end up at the end is the real

00:35:36.960 --> 00:35:42.000
probability that you observe from the

00:35:39.760 --> 00:35:43.760
words in your corpus.

00:35:42.000 --> 00:35:45.440
>> The table doesn't change regardless of

00:35:43.760 --> 00:35:47.599
how you sample it. The table is a

00:35:45.440 --> 00:35:50.480
starting point for sampling.

00:35:47.599 --> 00:35:51.680
The all of all decoding is about what

00:35:50.480 --> 00:35:53.039
token from the table you're going to

00:35:51.679 --> 00:35:54.799
pull out.

00:35:53.039 --> 00:35:55.599
>> Oh, so it doesn't impact the loss

00:35:54.800 --> 00:35:56.720
function.

00:35:55.599 --> 00:35:58.880
>> No.

00:35:56.719 --> 00:36:00.559
>> Yeah. It's all those things are fixed.

00:35:58.880 --> 00:36:02.000
You literally get the table and then you

00:36:00.559 --> 00:36:06.000
literally can forget how you got the

00:36:02.000 --> 00:36:09.199
table and now decoding starts.

00:36:06.000 --> 00:36:11.119
>> Is there a reason why would generate a

00:36:09.199 --> 00:36:12.559
different answer given the same prompt

00:36:11.119 --> 00:36:14.320
if we run it again and again? Because

00:36:12.559 --> 00:36:16.559
they are using random sampling.

00:36:14.320 --> 00:36:19.039
>> Correct. That's exactly why. And we'll

00:36:16.559 --> 00:36:20.159
see I'll see do a demo of it very very

00:36:19.039 --> 00:36:22.800
shortly because you can actually

00:36:20.159 --> 00:36:25.039
manipulate it. Uh

00:36:22.800 --> 00:36:27.680
>> if you do the prediction word by word,

00:36:25.039 --> 00:36:29.838
is there a way to make it resilient to

00:36:27.679 --> 00:36:32.319
mistakes? Like if you say the night was

00:36:29.838 --> 00:36:33.199
dark and hard work, that can mess up the

00:36:32.320 --> 00:36:34.559
next word, right?

00:36:33.199 --> 00:36:37.439
>> It can totally mess it up.

00:36:34.559 --> 00:36:37.838
>> So how does it can it get itself back on

00:36:37.440 --> 00:36:40.240
track?

00:36:37.838 --> 00:36:42.000
>> It cannot. And so great question. And

00:36:40.239 --> 00:36:46.078
we'll look at an example of things going

00:36:42.000 --> 00:36:48.400
off the rails in just a second. Yep.

00:36:46.079 --> 00:36:51.359
Is this how Bing works where you can

00:36:48.400 --> 00:36:52.000
slide between being more creative, more

00:36:51.358 --> 00:36:53.920
accurate?

00:36:52.000 --> 00:36:56.000
>> Yeah, exactly. So, Bing has creative,

00:36:53.920 --> 00:36:57.680
balanced, precise something, right? Uh

00:36:56.000 --> 00:36:59.440
they're basically under the hood,

00:36:57.679 --> 00:37:00.399
they're manipulating some of the par

00:36:59.440 --> 00:37:01.760
we're going to look at some of those

00:37:00.400 --> 00:37:03.838
parameters in just a moment. They're

00:37:01.760 --> 00:37:05.920
just manipulating it for you. But if you

00:37:03.838 --> 00:37:08.920
use the API, you can manipulate it

00:37:05.920 --> 00:37:08.920
directly.

00:37:09.760 --> 00:37:15.760
Okay. Um All right. So, so here's sort

00:37:14.559 --> 00:37:17.599
of the basic thing to remember about

00:37:15.760 --> 00:37:19.839
random sampling.

00:37:17.599 --> 00:37:22.000
So, our hope is that the, you know, for

00:37:19.838 --> 00:37:24.400
any given sentence, we think that there

00:37:22.000 --> 00:37:26.880
is probably some set of good answers for

00:37:24.400 --> 00:37:30.720
the next word and a whole bunch of bad

00:37:26.880 --> 00:37:33.358
answers, right? Intuitively. So, we want

00:37:30.719 --> 00:37:36.078
the probability of the good stuff,

00:37:33.358 --> 00:37:38.078
right? We we want like a you can imagine

00:37:36.079 --> 00:37:39.440
a distribution is going like that. There

00:37:38.079 --> 00:37:41.119
is the head of the distribution, the

00:37:39.440 --> 00:37:42.320
first few words in the distribution. if

00:37:41.119 --> 00:37:44.160
you sort them from high to low

00:37:42.320 --> 00:37:46.400
probability and then there's all the

00:37:44.159 --> 00:37:48.078
long tale of you know kind of you know

00:37:46.400 --> 00:37:51.119
inappropriate not inappropriate

00:37:48.079 --> 00:37:53.440
irrelevant words right so our hope is

00:37:51.119 --> 00:37:55.838
that the model is so good that for any

00:37:53.440 --> 00:37:57.440
given input phrase it it basically

00:37:55.838 --> 00:37:59.679
concentrates the output probability in

00:37:57.440 --> 00:38:01.358
the softmax to just a few good words and

00:37:59.679 --> 00:38:04.639
sort of kind of zeros out everything

00:38:01.358 --> 00:38:06.559
else that is the ideal scenario because

00:38:04.639 --> 00:38:08.319
in that scenario if you do random

00:38:06.559 --> 00:38:10.000
sampling you by definition you'll pick

00:38:08.320 --> 00:38:13.119
something from the high quality head of

00:38:10.000 --> 00:38:16.159
the distribution and life is good. Okay.

00:38:13.119 --> 00:38:18.079
Now, we want random sampling to sample

00:38:16.159 --> 00:38:19.440
from the head and not from the tail,

00:38:18.079 --> 00:38:21.119
right? That's the key point. And what do

00:38:19.440 --> 00:38:24.119
I mean by head and tail? Let's be very

00:38:21.119 --> 00:38:24.119
clear.

00:38:26.320 --> 00:38:31.760
So, um imagine you have

00:38:30.559 --> 00:38:33.440
take the table that we looked at the

00:38:31.760 --> 00:38:35.680
softax table which went from whatever

00:38:33.440 --> 00:38:37.280
oddwalk to zebra right and let's say we

00:38:35.679 --> 00:38:39.199
sort the table based on high to low

00:38:37.280 --> 00:38:42.240
probabilities. So maybe what's going to

00:38:39.199 --> 00:38:43.838
happen is that star me

00:38:42.239 --> 00:38:46.879
is going to have a probability of I

00:38:43.838 --> 00:38:51.920
don't know 6 and I think if I remember

00:38:46.880 --> 00:38:53.440
right a knight had a probability of.3

00:38:51.920 --> 00:38:56.639
and then a there was a whole bunch of

00:38:53.440 --> 00:39:00.320
other words

00:38:56.639 --> 00:39:02.480
all the way to the 50,000th word right

00:39:00.320 --> 00:39:04.160
from highest low probability so this is

00:39:02.480 --> 00:39:06.880
what I so this is you can think of this

00:39:04.159 --> 00:39:09.598
as like a probability distribution

00:39:06.880 --> 00:39:12.240
okay and So basically what we are saying

00:39:09.599 --> 00:39:13.920
here is that these this is the head of

00:39:12.239 --> 00:39:16.078
the distribution

00:39:13.920 --> 00:39:18.960
while this long tail is the tail of the

00:39:16.079 --> 00:39:21.200
distribution and we want our system to

00:39:18.960 --> 00:39:23.119
grab something from the head and not

00:39:21.199 --> 00:39:24.480
from the tail because the head is the

00:39:23.119 --> 00:39:26.960
stuff that's actually the relevant

00:39:24.480 --> 00:39:28.719
useful good stuff. Okay, that's really

00:39:26.960 --> 00:39:32.639
what we're trying to do here. Does it

00:39:28.719 --> 00:39:37.279
make sense? Okay. So,

00:39:32.639 --> 00:39:39.039
so to come back to this um

00:39:37.280 --> 00:39:41.440
and here is like the most important

00:39:39.039 --> 00:39:43.679
point to remember about this slide.

00:39:41.440 --> 00:39:46.000
While the probability of choosing any

00:39:43.679 --> 00:39:47.440
individual word in this long tail is

00:39:46.000 --> 00:39:49.199
pretty small. For any one word, it's

00:39:47.440 --> 00:39:51.200
pretty small. The probability of

00:39:49.199 --> 00:39:54.159
choosing some word from the tail is

00:39:51.199 --> 00:39:56.239
high.

00:39:54.159 --> 00:39:58.559
Some word from the tail is high. So to

00:39:56.239 --> 00:40:00.719
go back to this thing here. Yeah. Uh so

00:39:58.559 --> 00:40:03.519
in this particular example

00:40:00.719 --> 00:40:05.279
6 +.3 there is a 0.9 probability it's

00:40:03.519 --> 00:40:06.880
going to be either stormy or night but

00:40:05.280 --> 00:40:09.519
there is a 10% probability it's going to

00:40:06.880 --> 00:40:11.119
be one of these words

00:40:09.519 --> 00:40:12.639
and who knows what that word might it's

00:40:11.119 --> 00:40:15.358
going to be it might be some random

00:40:12.639 --> 00:40:18.879
nonsense word right so what that means

00:40:15.358 --> 00:40:21.838
is and this goes to um

00:40:18.880 --> 00:40:24.160
this goes to point from before if the

00:40:21.838 --> 00:40:25.920
LLM happens to sample a token from the

00:40:24.159 --> 00:40:27.440
tail which is not good it won't be able

00:40:25.920 --> 00:40:29.920
to recover from its mistake it'll just

00:40:27.440 --> 00:40:31.679
go off the rails

00:40:29.920 --> 00:40:33.358
Which is why every word that gets

00:40:31.679 --> 00:40:35.919
generated is really important to get it

00:40:33.358 --> 00:40:37.759
right because book it can't recover very

00:40:35.920 --> 00:40:40.079
often.

00:40:37.760 --> 00:40:41.359
>> Is there a technical way to define the

00:40:40.079 --> 00:40:44.000
difference between the head and the

00:40:41.358 --> 00:40:45.358
tail? No,

00:40:44.000 --> 00:40:47.440
it's sort of like this common thing

00:40:45.358 --> 00:40:50.239
people use and the reason why it's not

00:40:47.440 --> 00:40:52.800
is because uh it's so problem dependent

00:40:50.239 --> 00:40:54.078
as to what like the you know like

00:40:52.800 --> 00:40:55.680
basically you're saying that for any

00:40:54.079 --> 00:40:58.000
particular problem I think depending on

00:40:55.679 --> 00:41:00.239
the question the right number of words

00:40:58.000 --> 00:41:02.719
is probably 20 for the same for a

00:41:00.239 --> 00:41:04.078
different question maybe it's 40 for a

00:41:02.719 --> 00:41:05.919
totally different model for the same

00:41:04.079 --> 00:41:09.200
question maybe 10 so because of that

00:41:05.920 --> 00:41:12.960
variability we just can't figure it out

00:41:09.199 --> 00:41:14.318
okay so um all All right. So, and I'll

00:41:12.960 --> 00:41:18.400
show you this how to do this in just a

00:41:14.318 --> 00:41:22.719
moment. So, just for kicks, um I went in

00:41:18.400 --> 00:41:25.920
to GPD 3.5 U and then I said students at

00:41:22.719 --> 00:41:29.118
the MIT Sloan School of Management are

00:41:25.920 --> 00:41:31.519
and I said predict the next word. Okay,

00:41:29.119 --> 00:41:33.599
so it turns out invited is the most

00:41:31.519 --> 00:41:35.838
likely next word followed by given,

00:41:33.599 --> 00:41:38.079
expected, required and able. These are

00:41:35.838 --> 00:41:40.960
the top five words.

00:41:38.079 --> 00:41:42.000
Okay. And the probability is 3% 2% you

00:41:40.960 --> 00:41:43.760
see the you know pretty small

00:41:42.000 --> 00:41:45.838
probabilities but then the words that

00:41:43.760 --> 00:41:47.440
are below it right the remaining

00:41:45.838 --> 00:41:50.078
whatever 50,000 odd words are even

00:41:47.440 --> 00:41:52.800
lower. Okay. So here the most likely

00:41:50.079 --> 00:41:54.800
word is invited. So what I did is I went

00:41:52.800 --> 00:41:56.720
in there and said okay let me try again

00:41:54.800 --> 00:41:59.119
now with students of this loan school of

00:41:56.719 --> 00:42:00.639
management or invited. And now

00:41:59.119 --> 00:42:03.519
autocomplete that find me the next

00:42:00.639 --> 00:42:04.799
thing. So it comes back with see now

00:42:03.519 --> 00:42:07.119
this is my new prompt. student the M

00:42:04.800 --> 00:42:08.640
school invited to submit their original

00:42:07.119 --> 00:42:11.119
white papers to the annual MIT

00:42:08.639 --> 00:42:13.838
something. It seems reasonable. Doesn't

00:42:11.119 --> 00:42:16.640
seem bad, right? It seems reasonable.

00:42:13.838 --> 00:42:19.279
Okay. Now, let's mess it up a bit. So

00:42:16.639 --> 00:42:22.559
now I go in there and I noticed that the

00:42:19.280 --> 00:42:24.480
word masters and the word spending were

00:42:22.559 --> 00:42:26.480
much lower probability than these top

00:42:24.480 --> 00:42:28.400
five words. Right? I just mucked around

00:42:26.480 --> 00:42:31.599
till I found these things. So this is

00:42:28.400 --> 00:42:34.639
only 0.05%. This is.1%.

00:42:31.599 --> 00:42:36.640
So these are clearly in the tail, right?

00:42:34.639 --> 00:42:37.920
They're not the most likely. So I said,

00:42:36.639 --> 00:42:41.039
what's going to happen if I actually

00:42:37.920 --> 00:42:43.838
force it to use masters and then I force

00:42:41.039 --> 00:42:46.239
it to use spending? Okay, this is what I

00:42:43.838 --> 00:42:49.358
what you get. Students MID school of

00:42:46.239 --> 00:42:52.639
management are masters of chaos.

00:42:49.358 --> 00:42:53.920
They routinely blow past deadlines

00:42:52.639 --> 00:42:57.559
fracture and then I couldn't take it

00:42:53.920 --> 00:42:57.559
anymore. I stopped it.

00:42:58.000 --> 00:43:02.318
a single word

00:43:00.800 --> 00:43:03.760
and then I said students school of

00:43:02.318 --> 00:43:05.838
management or spending which is the

00:43:03.760 --> 00:43:07.440
other unlikely word the semester

00:43:05.838 --> 00:43:11.960
learning life skills so far it looks

00:43:07.440 --> 00:43:11.960
promising through knitting socks

00:43:13.358 --> 00:43:17.519
I'm not making this stuff up but this is

00:43:14.639 --> 00:43:19.199
GP3.5

00:43:17.519 --> 00:43:22.880
so yes it will go off the rails you have

00:43:19.199 --> 00:43:25.118
to be super careful um and so

00:43:22.880 --> 00:43:29.280
so the way we sort of tame random

00:43:25.119 --> 00:43:32.640
sampling to make it work for us uh

00:43:29.280 --> 00:43:35.920
Do you think that these sentences refers

00:43:32.639 --> 00:43:38.078
like the past like the master of chaos

00:43:35.920 --> 00:43:40.800
blow past deadline like is something

00:43:38.079 --> 00:43:42.720
that it was in the training sense?

00:43:40.800 --> 00:43:45.200
>> Yeah. I mean that is the thing is it's

00:43:42.719 --> 00:43:47.039
basically doing rough it's doing some

00:43:45.199 --> 00:43:48.879
very rough and approximate pattern

00:43:47.039 --> 00:43:51.119
matching from all the training data it

00:43:48.880 --> 00:43:53.838
was trained on. So it doesn't mean for

00:43:51.119 --> 00:43:56.800
example that on on the mit.edu edu

00:43:53.838 --> 00:43:59.039
website right on the collection of sites

00:43:56.800 --> 00:44:00.800
that actually there were text saying

00:43:59.039 --> 00:44:02.960
that yeah MIT Sloan students were doing

00:44:00.800 --> 00:44:06.400
all this crazy stuff it's probably more

00:44:02.960 --> 00:44:08.000
like a whole bunch of you know u college

00:44:06.400 --> 00:44:09.519
university websites probably had some

00:44:08.000 --> 00:44:10.960
content like that maybe there was a

00:44:09.519 --> 00:44:12.559
bunch of Reddit people posting stuff

00:44:10.960 --> 00:44:14.400
like that so you're just doing some

00:44:12.559 --> 00:44:15.599
rough pattern matching it's basically

00:44:14.400 --> 00:44:16.960
looking the thing is you have to

00:44:15.599 --> 00:44:19.599
remember always with large language

00:44:16.960 --> 00:44:22.000
models what it's trying to give you it's

00:44:19.599 --> 00:44:23.680
giving you a response that is not

00:44:22.000 --> 00:44:25.358
implausible

00:44:23.679 --> 00:44:27.279
There is no guarantee of correctness.

00:44:25.358 --> 00:44:29.519
There's no accuracy. Nothing like that.

00:44:27.280 --> 00:44:32.000
It's giving you a probabilistically

00:44:29.519 --> 00:44:35.119
plausible response. That's it. Okay.

00:44:32.000 --> 00:44:36.880
Now, usies being Sloan, uh we look at

00:44:35.119 --> 00:44:39.200
stuff like this and we get offended. So,

00:44:36.880 --> 00:44:40.880
we are we are imputing our values onto

00:44:39.199 --> 00:44:43.919
its generation, but it doesn't know and

00:44:40.880 --> 00:44:46.079
it doesn't care.

00:44:43.920 --> 00:44:48.079
So in fact if I when I typed in

00:44:46.079 --> 00:44:50.800
something like list all the awards that

00:44:48.079 --> 00:44:52.960
professor Ramak Krishna has won it gave

00:44:50.800 --> 00:44:55.440
me an amazing list of awards apparently

00:44:52.960 --> 00:44:58.639
I won this and I won that I won none of

00:44:55.440 --> 00:45:00.240
it is true to which a student said not

00:44:58.639 --> 00:45:01.838
yet.

00:45:00.239 --> 00:45:05.039
So I had the tea I made a note of that

00:45:01.838 --> 00:45:09.039
fine person's name. So [laughter]

00:45:05.039 --> 00:45:11.119
>> so yeah so that's what's going on.

00:45:09.039 --> 00:45:12.800
Yeah

00:45:11.119 --> 00:45:15.838
>> I get the sense like Maybe there's

00:45:12.800 --> 00:45:17.599
>> Could you use the microphone, please?

00:45:15.838 --> 00:45:20.480
>> I get the sense that maybe there's some

00:45:17.599 --> 00:45:23.519
sort of sliding window that's somehow

00:45:20.480 --> 00:45:26.480
waning later words more strongly than

00:45:23.519 --> 00:45:28.079
earlier words given how far out because

00:45:26.480 --> 00:45:30.318
I feel like the context of students at

00:45:28.079 --> 00:45:32.000
MIT, right, should have steered in a

00:45:30.318 --> 00:45:34.318
certain direction even with the presence

00:45:32.000 --> 00:45:35.599
of the word masters. So, is there

00:45:34.318 --> 00:45:37.519
something like that happening?

00:45:35.599 --> 00:45:38.800
>> No, it is just the thing is think about

00:45:37.519 --> 00:45:41.199
the training process, right? In the

00:45:38.800 --> 00:45:42.800
training process, uh, we gave it

00:45:41.199 --> 00:45:45.519
sentence fragments and we asked it to

00:45:42.800 --> 00:45:48.240
predict the next word. Now, clearly the

00:45:45.519 --> 00:45:49.759
more you know about the input that's

00:45:48.239 --> 00:45:51.919
coming and the longer the input, the

00:45:49.760 --> 00:45:53.200
more clues you have to figure out what

00:45:51.920 --> 00:45:56.240
the right next prediction is going to

00:45:53.199 --> 00:45:58.480
be. Right? If I say the capital uh the

00:45:56.239 --> 00:46:00.239
capital of you'll be like, I don't know,

00:45:58.480 --> 00:46:01.440
it's got to be a country, I guess, or a

00:46:00.239 --> 00:46:03.039
state, but I don't know anything more

00:46:01.440 --> 00:46:06.318
than that. But if you if I say the

00:46:03.039 --> 00:46:08.719
capital of France is dramatic narrowing

00:46:06.318 --> 00:46:11.039
of the cone of uncertainty. So that's

00:46:08.719 --> 00:46:12.480
basically what's going on. And in fact

00:46:11.039 --> 00:46:14.480
some there's a very beautiful expression

00:46:12.480 --> 00:46:17.679
I've heard which is that what what the

00:46:14.480 --> 00:46:20.159
LMS do they call it subtractive

00:46:17.679 --> 00:46:22.000
sculpting. So what I mean by that is

00:46:20.159 --> 00:46:24.559
it's sort of like when you start it's

00:46:22.000 --> 00:46:26.480
like this big block of marble and then

00:46:24.559 --> 00:46:27.838
every word chips away at the marble and

00:46:26.480 --> 00:46:29.599
then when you're done it's kind of

00:46:27.838 --> 00:46:31.358
pretty clear it's David inside the

00:46:29.599 --> 00:46:34.240
marble. Right? That's sort of what's

00:46:31.358 --> 00:46:36.559
going on.

00:46:34.239 --> 00:46:38.559
All right. So to come back to this, uh

00:46:36.559 --> 00:46:40.000
what can we do? We can there are three

00:46:38.559 --> 00:46:42.078
ways in which you can tune random

00:46:40.000 --> 00:46:44.400
sampling to make it work for you. The

00:46:42.079 --> 00:46:46.160
first way and and the the idea of all

00:46:44.400 --> 00:46:48.800
these things is that you have some

00:46:46.159 --> 00:46:51.199
probability distribution. We are now

00:46:48.800 --> 00:46:53.680
going to sort of manually

00:46:51.199 --> 00:46:55.279
focus on the head and then we're going

00:46:53.679 --> 00:46:56.879
to kill everything else and only focus

00:46:55.280 --> 00:46:58.400
on the head and sample from that head.

00:46:56.880 --> 00:46:59.920
Okay, which immediately begs the

00:46:58.400 --> 00:47:01.280
question, how will you decide what the

00:46:59.920 --> 00:47:02.880
head is? Right? And that was sort of

00:47:01.280 --> 00:47:04.640
Alina's question from before. How will

00:47:02.880 --> 00:47:07.440
you decide what the head is? So, one way

00:47:04.639 --> 00:47:08.559
we do that is to say, you know what, I

00:47:07.440 --> 00:47:11.280
know we have 50,000 words in the

00:47:08.559 --> 00:47:13.199
vocabulary. I don't care. Each time, I'm

00:47:11.280 --> 00:47:15.599
only going to pick the top K words,

00:47:13.199 --> 00:47:17.039
right? K could be 10, 20, 30, 40, 50.

00:47:15.599 --> 00:47:18.880
This very problem dependent. I'm going

00:47:17.039 --> 00:47:20.800
to pick the top 20 words and I'm going

00:47:18.880 --> 00:47:22.800
to ignore everything else and only

00:47:20.800 --> 00:47:24.800
sample from the top 10 or the top 20.

00:47:22.800 --> 00:47:25.920
That's called top K sampling. And so the

00:47:24.800 --> 00:47:27.440
way it works is that let's say this is

00:47:25.920 --> 00:47:28.720
your whole distribution and I just

00:47:27.440 --> 00:47:30.960
stopped at wet instead of going all the

00:47:28.719 --> 00:47:33.118
way to 50,000, right? And then you see

00:47:30.960 --> 00:47:36.240
and you decide let's say that you want k

00:47:33.119 --> 00:47:39.519
to be two. So you just grab the top two

00:47:36.239 --> 00:47:41.679
words k equals 2 and then you reormalize

00:47:39.519 --> 00:47:45.119
the probability so they add up to one.

00:47:41.679 --> 00:47:46.799
So 6 and2 reormalize it becomes 75 and

00:47:45.119 --> 00:47:48.480
0.25.

00:47:46.800 --> 00:47:50.160
And now just imagine that this is the

00:47:48.480 --> 00:47:52.240
new softmax table that you're sampling

00:47:50.159 --> 00:47:55.039
from and you grab a number from I'm

00:47:52.239 --> 00:47:58.159
sorry a word from here and you're done.

00:47:55.039 --> 00:48:00.639
Okay, that's this called top K sampling

00:47:58.159 --> 00:48:03.279
very commonly used

00:48:00.639 --> 00:48:06.078
but there's it has a small shortcoming

00:48:03.280 --> 00:48:07.680
which is that it basically assumes that

00:48:06.079 --> 00:48:11.119
this K that you have come up with let's

00:48:07.679 --> 00:48:13.118
say 20 every input sentence the right

00:48:11.119 --> 00:48:15.519
number of words in the head is 20 which

00:48:13.119 --> 00:48:16.640
seems obviously it's not a you know well

00:48:15.519 --> 00:48:18.639
supported assumption it's just an

00:48:16.639 --> 00:48:21.440
assumption so then the question becomes

00:48:18.639 --> 00:48:24.078
can we do better right because what you

00:48:21.440 --> 00:48:25.599
really want is you want the words that

00:48:24.079 --> 00:48:27.280
you pick to have the bulk of the

00:48:25.599 --> 00:48:29.440
probabilities,

00:48:27.280 --> 00:48:30.800
right? As much probability as possible.

00:48:29.440 --> 00:48:32.240
You don't really care how many words are

00:48:30.800 --> 00:48:34.800
inside it as long as together they have

00:48:32.239 --> 00:48:37.199
a lot of probability. Which brings us to

00:48:34.800 --> 00:48:39.359
something called top p sampling also

00:48:37.199 --> 00:48:40.639
called nucleus sampling where instead of

00:48:39.358 --> 00:48:42.719
deciding on the number of words we're

00:48:40.639 --> 00:48:45.118
going to pick every time, we decide you

00:48:42.719 --> 00:48:47.358
know what we're just going to

00:48:45.119 --> 00:48:49.119
choose all the words such that the

00:48:47.358 --> 00:48:51.679
probability of such words that we have

00:48:49.119 --> 00:48:53.039
chosen is at least P.

00:48:51.679 --> 00:48:54.639
Sometimes it may be just two words.

00:48:53.039 --> 00:48:58.880
Sometimes it may be 20 words. We don't

00:48:54.639 --> 00:49:02.000
care. And then we sample from it.

00:48:58.880 --> 00:49:05.280
Okay. So here, same thing here. Let's

00:49:02.000 --> 00:49:09.039
say you go with P equ= 0.9. So you 6

00:49:05.280 --> 00:49:11.359
+2.8 plus.1.9. Boom. We have hit 0.9. We

00:49:09.039 --> 00:49:14.400
stop and then we grab these three words

00:49:11.358 --> 00:49:16.799
and then we renormalize them to get this

00:49:14.400 --> 00:49:18.079
thing and then boom, we sample from it.

00:49:16.800 --> 00:49:19.839
So this actually is even more effective

00:49:18.079 --> 00:49:21.599
in my opinion because it sort of it

00:49:19.838 --> 00:49:23.440
fluctuates. It doesn't hardcode the

00:49:21.599 --> 00:49:25.920
number of words you think is important.

00:49:23.440 --> 00:49:29.440
Uh was there a question? Yeah.

00:49:25.920 --> 00:49:32.720
>> What if like let's say 0.9 ended up like

00:49:29.440 --> 00:49:33.838
if foggy was 0.12 will it only take 0.1

00:49:32.719 --> 00:49:35.519
from foggy?

00:49:33.838 --> 00:49:37.199
>> Yeah. What it does is it so you give it

00:49:35.519 --> 00:49:39.599
a give it a 0.9. What it's going to do

00:49:37.199 --> 00:49:43.598
is it's going to keep adding words till

00:49:39.599 --> 00:49:46.640
it just crosses that number.

00:49:43.599 --> 00:49:50.240
>> Yeah. I was thinking, can't you just set

00:49:46.639 --> 00:49:53.598
a threshold for the word slap? Don't

00:49:50.239 --> 00:49:57.118
pick a word below probability. This top

00:49:53.599 --> 00:49:59.680
B, what if was like 0.89

00:49:57.119 --> 00:50:00.800
and then the other one is just 0.1. So

00:49:59.679 --> 00:50:03.440
you pick two words.

00:50:00.800 --> 00:50:04.960
>> Yeah, you can do that. Um and in fact in

00:50:03.440 --> 00:50:06.240
what you can do is you can always say I

00:50:04.960 --> 00:50:08.480
want to pick a word which is the most

00:50:06.239 --> 00:50:12.078
likely word, right? You can do that. But

00:50:08.480 --> 00:50:13.760
if you say I want a word um I want only

00:50:12.079 --> 00:50:15.760
consider words whose probabilities are

00:50:13.760 --> 00:50:16.640
at least something then basically what

00:50:15.760 --> 00:50:18.559
you're saying is that I'm just going to

00:50:16.639 --> 00:50:21.039
keep on doing and then we draw a line

00:50:18.559 --> 00:50:23.839
here right but the problem is you don't

00:50:21.039 --> 00:50:25.519
know how many words have crept over your

00:50:23.838 --> 00:50:27.679
threshold

00:50:25.519 --> 00:50:29.759
right you might for example find that to

00:50:27.679 --> 00:50:31.598
to go to your example maybe you said 0.9

00:50:29.760 --> 00:50:33.520
as a threshold may maybe there are a

00:50:31.599 --> 00:50:34.559
whole bunch of there was a word at 089

00:50:33.519 --> 00:50:36.079
that you just missed because you didn't

00:50:34.559 --> 00:50:38.000
make the threshold you'll be like oh no

00:50:36.079 --> 00:50:40.079
I should have made it 089 so there's No

00:50:38.000 --> 00:50:41.838
right answer unfortunately. But these

00:50:40.079 --> 00:50:43.680
are exactly the this is exactly the kind

00:50:41.838 --> 00:50:46.239
of thinking that brought us these kinds

00:50:43.679 --> 00:50:48.639
of ways to tune these things

00:50:46.239 --> 00:50:51.118
all sort of you know the foundation here

00:50:48.639 --> 00:50:53.279
is that the realization that we cannot

00:50:51.119 --> 00:50:54.800
pro sort of a priority decide what the

00:50:53.280 --> 00:50:56.720
right number of words is. So we have to

00:50:54.800 --> 00:50:58.318
find huristics to try to do do these

00:50:56.719 --> 00:51:00.318
things. So in practice people try all

00:50:58.318 --> 00:51:02.000
these methods. In fact you can do both.

00:51:00.318 --> 00:51:04.558
You can do you can set up so that you

00:51:02.000 --> 00:51:07.358
can do top p and top k at the same time.

00:51:04.559 --> 00:51:10.880
Basically you're saying grab words uh

00:51:07.358 --> 00:51:14.920
till you cross the probability uh or you

00:51:10.880 --> 00:51:14.920
cross k whichever is earlier.

00:51:15.199 --> 00:51:19.118
Okay. So those are two methods people

00:51:17.358 --> 00:51:21.598
use heavily.

00:51:19.119 --> 00:51:23.680
The third method is called distribution.

00:51:21.599 --> 00:51:26.640
I'm sorry temperature. And the idea of

00:51:23.679 --> 00:51:28.719
temperature is that in top K and top P,

00:51:26.639 --> 00:51:31.598
it sort of we have to decide on a number

00:51:28.719 --> 00:51:33.279
up front K or P and then we just draw

00:51:31.599 --> 00:51:35.599
the line and look at the words that pass

00:51:33.280 --> 00:51:37.440
the threshold. Temperature is like a

00:51:35.599 --> 00:51:39.838
softer way to do the same thing. It it's

00:51:37.440 --> 00:51:44.159
a softer way to emphasize the head more

00:51:39.838 --> 00:51:47.159
than the tail. So um I think iPad. All

00:51:44.159 --> 00:51:47.159
right.

00:51:52.960 --> 00:52:01.358
So the idea of temperature is remember

00:51:55.039 --> 00:52:04.400
uh when we have this um oops soft max.

00:52:01.358 --> 00:52:06.639
So you know oddwark

00:52:04.400 --> 00:52:09.039
all the way to zebra

00:52:06.639 --> 00:52:10.799
you have all these probabilities right

00:52:09.039 --> 00:52:12.239
now remember where did we get these

00:52:10.800 --> 00:52:15.440
probabilities these properties came from

00:52:12.239 --> 00:52:18.799
a softmax. So what is a softmax? We

00:52:15.440 --> 00:52:22.240
basically had you know all these nodes

00:52:18.800 --> 00:52:23.680
say 50,000 nodes in some output layer

00:52:22.239 --> 00:52:27.519
and these were just numbers let's just

00:52:23.679 --> 00:52:29.598
call them a1 through a 50,000

00:52:27.519 --> 00:52:31.838
and then we ran it through a softmax

00:52:29.599 --> 00:52:36.160
function and what did it do it basically

00:52:31.838 --> 00:52:39.358
did e ra to a1 e ra to a2 all the way to

00:52:36.159 --> 00:52:40.719
e ra to a let's call it n and then we it

00:52:39.358 --> 00:52:42.880
divided it by the sum of all these

00:52:40.719 --> 00:52:47.039
things to get the probabilities. So this

00:52:42.880 --> 00:52:51.640
number became e ra to a1 divided by the

00:52:47.039 --> 00:52:51.639
sum of all the e ra to a

00:52:52.400 --> 00:52:55.920
okay so e ra to a divided by e ra to a1

00:52:54.159 --> 00:52:57.598
plus e to a2 and so on and so forth. So

00:52:55.920 --> 00:52:59.519
this how softmax works. I'm just

00:52:57.599 --> 00:53:03.200
refreshing your memory from a few weeks

00:52:59.519 --> 00:53:06.719
ago. Okay. Now what temperature does is

00:53:03.199 --> 00:53:08.558
that let me just write it a little

00:53:06.719 --> 00:53:13.358
easier.

00:53:08.559 --> 00:53:15.359
So e ra to a1 plus e ra to a2 is all the

00:53:13.358 --> 00:53:18.358
way

00:53:15.358 --> 00:53:18.358
and

00:53:18.480 --> 00:53:22.800
what it does is it introduces a new

00:53:20.159 --> 00:53:27.799
parameter here called temperature which

00:53:22.800 --> 00:53:27.800
is that we divide everything here by t.

00:53:41.679 --> 00:53:45.519
And the effect of adding this little

00:53:43.358 --> 00:53:48.159
knob called temperature here, right, is

00:53:45.519 --> 00:53:50.800
very interesting. So let's assume for a

00:53:48.159 --> 00:53:52.399
second that t is a very very small

00:53:50.800 --> 00:53:53.920
number.

00:53:52.400 --> 00:53:57.838
Assume that t is pretty close to zero,

00:53:53.920 --> 00:54:00.838
very small number. So if t is close to

00:53:57.838 --> 00:54:00.838
zero,

00:54:00.960 --> 00:54:05.280
what's going to happen is that since

00:54:03.199 --> 00:54:06.799
it's in the denominator here, all these

00:54:05.280 --> 00:54:08.319
numbers,

00:54:06.800 --> 00:54:10.800
all these numbers are going to become

00:54:08.318 --> 00:54:13.119
really big because t is really small.

00:54:10.800 --> 00:54:14.240
Right? If if a1 happens to be a positive

00:54:13.119 --> 00:54:15.519
number, it's going to become really big.

00:54:14.239 --> 00:54:16.799
If a1 is a negative number, it's going

00:54:15.519 --> 00:54:19.358
to be a really really small negative

00:54:16.800 --> 00:54:20.880
number. Okay? Now in particular, what's

00:54:19.358 --> 00:54:23.838
going to happen is the biggest of all

00:54:20.880 --> 00:54:26.559
the a numbers, it was already big. Now

00:54:23.838 --> 00:54:28.239
it's going to get massive

00:54:26.559 --> 00:54:30.240
which means that its probability is

00:54:28.239 --> 00:54:31.838
going to dominate everything else

00:54:30.239 --> 00:54:35.039
because you're taking a really big

00:54:31.838 --> 00:54:37.599
number and doing e ra to that number.

00:54:35.039 --> 00:54:40.400
So what's going to happen is that wait

00:54:37.599 --> 00:54:46.039
what what did this

00:54:40.400 --> 00:54:46.039
okay so if t is close to zero

00:54:47.280 --> 00:54:51.160
the biggest a

00:54:56.000 --> 00:55:05.559
Uh, hold on.

00:54:59.199 --> 00:55:05.558
The word corresponding to the biggest A

00:55:06.960 --> 00:55:12.760
will have a probability of one or close

00:55:09.599 --> 00:55:12.760
to one.

00:55:12.800 --> 00:55:15.680
And since all the probabilities have to

00:55:14.480 --> 00:55:17.599
add up to zero, which means that

00:55:15.679 --> 00:55:18.960
everything else is going to be zero. So

00:55:17.599 --> 00:55:20.160
the biggest A will have a probability of

00:55:18.960 --> 00:55:22.480
one. Everything else is going to have

00:55:20.159 --> 00:55:24.159
zero. So reducing temperature close to

00:55:22.480 --> 00:55:25.679
zero means that the probability

00:55:24.159 --> 00:55:27.358
distribution is going to peak at the

00:55:25.679 --> 00:55:29.358
biggest word and everything is going to

00:55:27.358 --> 00:55:30.960
become zero. So in practice what that

00:55:29.358 --> 00:55:34.960
means is that if you look at something

00:55:30.960 --> 00:55:37.760
like this if you apply um

00:55:34.960 --> 00:55:40.240
temperature here

00:55:37.760 --> 00:55:43.200
what's going to happen is that stormiest

00:55:40.239 --> 00:55:46.000
thing is going to get something like.999

00:55:43.199 --> 00:55:49.480
and everything else right it's going to

00:55:46.000 --> 00:55:49.480
get wiped out

00:55:49.838 --> 00:55:52.880
right it's going to get really small

00:55:51.440 --> 00:55:55.599
it's going to get even smaller and so on

00:55:52.880 --> 00:55:57.358
and so forth and so when t is exactly

00:55:55.599 --> 00:55:59.519
zero basically what that means is that

00:55:57.358 --> 00:56:00.798
this is going to be exactly nine uh one

00:55:59.519 --> 00:56:02.719
and everything was going to just get

00:56:00.798 --> 00:56:03.838
zero. So when one of them is one and

00:56:02.719 --> 00:56:05.039
everything else is zero when you do

00:56:03.838 --> 00:56:07.119
sampling from it you're just picking the

00:56:05.039 --> 00:56:10.480
the big number right which means it sort

00:56:07.119 --> 00:56:12.480
it becomes greedy decoding.

00:56:10.480 --> 00:56:14.960
So that is the value of having

00:56:12.480 --> 00:56:16.798
temperature as a knob. Conversely, if

00:56:14.960 --> 00:56:19.519
you take temperature T and make it

00:56:16.798 --> 00:56:22.159
bigger and bigger, right, as opposed to

00:56:19.519 --> 00:56:24.159
smaller and smaller, this distribution

00:56:22.159 --> 00:56:25.199
is going to become flat. Meaning all the

00:56:24.159 --> 00:56:27.679
words are going to have the same

00:56:25.199 --> 00:56:29.358
probability.

00:56:27.679 --> 00:56:32.399
So a any one of these words becomes

00:56:29.358 --> 00:56:34.639
equally likely. So t close to zero, the

00:56:32.400 --> 00:56:38.160
biggest biggest word gets picked. T

00:56:34.639 --> 00:56:40.078
close to say exceeds one goes to 1.52

00:56:38.159 --> 00:56:42.318
any word becomes likely. It becomes

00:56:40.079 --> 00:56:44.880
truly random. So that is the effect of

00:56:42.318 --> 00:56:47.759
temperature.

00:56:44.880 --> 00:56:50.559
And this knob, you can actually tune it.

00:56:47.760 --> 00:56:53.119
Um,

00:56:50.559 --> 00:56:56.000
all right. So, uh, this is called, uh,

00:56:53.119 --> 00:56:57.519
I'm at

00:56:56.000 --> 00:56:59.599
platform.openai.com.

00:56:57.519 --> 00:57:01.119
It's called the OpenAI playground. And

00:56:59.599 --> 00:57:02.640
in this playground, you can actually put

00:57:01.119 --> 00:57:04.400
in all the sentences you want. You can

00:57:02.639 --> 00:57:05.598
choose the model and then you can it'll

00:57:04.400 --> 00:57:09.920
actually tell you what the softmax

00:57:05.599 --> 00:57:12.079
output is. Okay, it's very handy. So

00:57:09.920 --> 00:57:13.358
this is where I said oh so here are a

00:57:12.079 --> 00:57:15.039
few things I want to draw your attention

00:57:13.358 --> 00:57:18.239
to. The first one is you see temperature

00:57:15.039 --> 00:57:20.880
here the default is one. If you make it

00:57:18.239 --> 00:57:22.879
zero it becomes greedy decoding but you

00:57:20.880 --> 00:57:24.400
can make it more than one if you want.

00:57:22.880 --> 00:57:27.280
It'll give you all kinds of crazy stuff

00:57:24.400 --> 00:57:30.480
as you will see in a second. Okay. Um

00:57:27.280 --> 00:57:32.798
and then they don't have top K. They

00:57:30.480 --> 00:57:35.519
don't have support for top K openai but

00:57:32.798 --> 00:57:37.838
they do have support for top P. You can

00:57:35.519 --> 00:57:38.880
put P here in this thing. And I'll

00:57:37.838 --> 00:57:40.558
ignore these things. You can read the

00:57:38.880 --> 00:57:42.318
documentation uh to understand those

00:57:40.559 --> 00:57:44.319
things. But you can actually ask it to

00:57:42.318 --> 00:57:46.159
show the probabilities. So I'm going to

00:57:44.318 --> 00:57:48.480
ask it to show all the probabilities.

00:57:46.159 --> 00:57:50.879
I'm also going to tell it um don't go

00:57:48.480 --> 00:57:53.920
nuts. Just give me like a few outputs.

00:57:50.880 --> 00:57:55.920
Let's just call it 30. Okay. And now I'm

00:57:53.920 --> 00:57:57.440
going to enter some sentences for us to

00:57:55.920 --> 00:57:59.920
see what's going on. So let's enter the

00:57:57.440 --> 00:58:03.519
same sentence as before. students

00:57:59.920 --> 00:58:05.039
at the MIT

00:58:03.519 --> 00:58:08.079
Sloan

00:58:05.039 --> 00:58:10.798
School of Management

00:58:08.079 --> 00:58:13.798
or I think that's what we had right so

00:58:10.798 --> 00:58:13.798
submit

00:58:14.000 --> 00:58:18.239
so okay this is what it's filling out

00:58:16.159 --> 00:58:20.399
now you go click on this word you get

00:58:18.239 --> 00:58:23.118
all the probabilities

00:58:20.400 --> 00:58:25.119
pretty cool right so you can see invited

00:58:23.119 --> 00:58:27.440
given expected these are all some of the

00:58:25.119 --> 00:58:32.400
things we had u and so what you can do

00:58:27.440 --> 00:58:36.000
is you can go in and say here clearly uh

00:58:32.400 --> 00:58:40.079
aching. What is that?

00:58:36.000 --> 00:58:41.358
That's very weird. So I'm going to again

00:58:40.079 --> 00:58:43.200
I'm just going to check to make sure

00:58:41.358 --> 00:58:46.078
that I use the same sentence as before.

00:58:43.199 --> 00:58:50.558
It's very brittle. Students MD school

00:58:46.079 --> 00:58:54.440
management are okay. Uh are

00:58:50.559 --> 00:58:54.440
oh I know what it is.

00:58:54.798 --> 00:59:01.719
Okay.

00:58:57.519 --> 00:59:01.719
Okay. So, let's try that again.

00:59:03.679 --> 00:59:08.159
Okay. So, invited 3.18. That's what we

00:59:05.920 --> 00:59:10.559
had, right? Invited 3.19. 3.8. Okay.

00:59:08.159 --> 00:59:12.480
Close enough. So, this is what we have.

00:59:10.559 --> 00:59:15.040
And now, if you wanted to force it to

00:59:12.480 --> 00:59:18.798
choose invited here, you just go in

00:59:15.039 --> 00:59:20.000
there and make the temperature zero.

00:59:18.798 --> 00:59:21.519
Temperature zero means it's always going

00:59:20.000 --> 00:59:25.039
to pick the best one. Greedy recording.

00:59:21.519 --> 00:59:27.119
So, you can hit it again.

00:59:25.039 --> 00:59:29.519
And it better give you invited. See it

00:59:27.119 --> 00:59:31.119
has given you invited.

00:59:29.519 --> 00:59:34.079
So that's how you manipulate it using

00:59:31.119 --> 00:59:35.760
temperature. Um you can also ask it you

00:59:34.079 --> 00:59:38.079
can also manipulate top P. You can do

00:59:35.760 --> 00:59:40.000
all these things right but so it's a

00:59:38.079 --> 00:59:41.839
it's a people actually use it very

00:59:40.000 --> 00:59:42.798
heavily for debugging right and for when

00:59:41.838 --> 00:59:44.239
they're playing with a bunch of data

00:59:42.798 --> 00:59:45.679
with a model for that particular use

00:59:44.239 --> 00:59:46.879
case. You just play with it to get a

00:59:45.679 --> 00:59:48.399
sense for what kinds of probability

00:59:46.880 --> 00:59:50.400
distributions you see and then you can

00:59:48.400 --> 00:59:54.480
fine-tune it using that using that

00:59:50.400 --> 00:59:58.079
knowledge. Um so yeah check this out.

00:59:54.480 --> 01:00:01.199
Oh, uh, I I said that if the temperature

00:59:58.079 --> 01:00:03.119
goes above one to a higher number, every

01:00:01.199 --> 01:00:04.798
word in the 50,000 becomes sort of

01:00:03.119 --> 01:00:06.400
equally likely, which means it's going

01:00:04.798 --> 01:00:07.679
to produce garbage, right? So, let's

01:00:06.400 --> 01:00:09.200
actually see garbage production in

01:00:07.679 --> 01:00:11.838
action.

01:00:09.199 --> 01:00:13.439
So, all right, let's just nuke this.

01:00:11.838 --> 01:00:15.519
Okay, and I'm going to take the

01:00:13.440 --> 01:00:19.280
temperature and max it. I'm going to

01:00:15.519 --> 01:00:22.000
call it two. Okay, which means that

01:00:19.280 --> 01:00:25.000
literally anything is possible.

01:00:22.000 --> 01:00:25.000
Submit.

01:00:25.838 --> 01:00:32.039
Ladies and gentlemen, I present to you a

01:00:28.079 --> 01:00:32.039
modern large language model.

01:00:35.838 --> 01:00:39.599
Isn't it like shocking

01:00:38.079 --> 01:00:41.760
>> because when we work with these language

01:00:39.599 --> 01:00:43.039
models we have, we always when we see it

01:00:41.760 --> 01:00:45.119
doing some smart things, we always

01:00:43.039 --> 01:00:46.480
ascribe some level of, you know,

01:00:45.119 --> 01:00:48.960
interesting abilities and intelligence

01:00:46.480 --> 01:00:50.318
and so on and then you realize all I had

01:00:48.960 --> 01:00:52.798
to go in go in there and change one

01:00:50.318 --> 01:00:54.719
parameter and it's garbage.

01:00:52.798 --> 01:00:56.480
So you can see the amount of garbage

01:00:54.719 --> 01:00:58.879
right it's showing just by twiddling one

01:00:56.480 --> 01:01:00.240
parameter. So you have to be in

01:00:58.880 --> 01:01:01.358
production use cases when you're

01:01:00.239 --> 01:01:02.798
building applications on top of these

01:01:01.358 --> 01:01:05.279
large language models you got to be very

01:01:02.798 --> 01:01:09.039
very careful with these parameters. So

01:01:05.280 --> 01:01:12.359
pay attention. All right. So um what did

01:01:09.039 --> 01:01:12.358
I have next?

01:01:13.679 --> 01:01:21.639
Okay. So that brings us to the uh sort

01:01:17.440 --> 01:01:21.639
of the end of the decoding section.

01:01:22.798 --> 01:01:27.119
Oh, see now I'm going to switch gears

01:01:24.798 --> 01:01:30.798
and talk about tokenization, right?

01:01:27.119 --> 01:01:32.318
which is that um when so far in all the

01:01:30.798 --> 01:01:34.159
the the things we have done including

01:01:32.318 --> 01:01:36.798
the homeworks and so on we looked at

01:01:34.159 --> 01:01:38.639
this tokenization the standard process

01:01:36.798 --> 01:01:41.039
right for taking a bunch of text and

01:01:38.639 --> 01:01:44.960
vectorizing it which was the stie

01:01:41.039 --> 01:01:46.639
process standardize tokenize um index

01:01:44.960 --> 01:01:48.559
right and then encode and the

01:01:46.639 --> 01:01:50.558
standardization I had mentioned earlier

01:01:48.559 --> 01:01:53.200
uh strips out punctuation lower cases

01:01:50.559 --> 01:01:55.359
everything uh sometimes removes stop

01:01:53.199 --> 01:01:57.118
words like a and the things like that it

01:01:55.358 --> 01:01:59.440
also does these things called stemming

01:01:57.119 --> 01:02:02.960
But turns out if you actually work with

01:01:59.440 --> 01:02:04.480
uh something like GPT, you know that

01:02:02.960 --> 01:02:06.159
it hasn't stripped out punctuation. The

01:02:04.480 --> 01:02:08.079
punctuation is really good, right? It

01:02:06.159 --> 01:02:10.078
uses case, uppercase, and lower case.

01:02:08.079 --> 01:02:11.920
And in fact, even better, you can

01:02:10.079 --> 01:02:13.359
actually make up a word as part of your

01:02:11.920 --> 01:02:15.760
question and it'll use the word

01:02:13.358 --> 01:02:18.000
consistently in the output. So just for

01:02:15.760 --> 01:02:22.240
fun,

01:02:18.000 --> 01:02:23.599
um I made up a word.

01:02:22.239 --> 01:02:24.879
I just did this yesterday, a day before.

01:02:23.599 --> 01:02:28.160
I said, here's a new word and it

01:02:24.880 --> 01:02:30.079
definition. The word is relo

01:02:28.159 --> 01:02:31.759
backwards.

01:02:30.079 --> 01:02:33.680
I said the definition a student who

01:02:31.760 --> 01:02:35.200
understands deep learning backwards

01:02:33.679 --> 01:02:37.358
please use his word in a sentence. And

01:02:35.199 --> 01:02:39.759
here is a sentence it's coming up with.

01:02:37.358 --> 01:02:41.838
Um

01:02:39.760 --> 01:02:43.200
I was like a little shocked during the

01:02:41.838 --> 01:02:45.838
advanced neural network seminar. It

01:02:43.199 --> 01:02:47.439
became evident that Jane was a true relo

01:02:45.838 --> 01:02:48.719
effortlessly explaining even the most

01:02:47.440 --> 01:02:50.720
complex deep learning concepts in

01:02:48.719 --> 01:02:53.039
reverse order.

01:02:50.719 --> 01:02:54.719
Okay. So it clearly knows how to use

01:02:53.039 --> 01:02:56.639
anything you may make up with. Right? So

01:02:54.719 --> 01:02:59.039
it has the ability to compose things

01:02:56.639 --> 01:03:01.118
from scratch as opposed to just looking

01:02:59.039 --> 01:03:02.960
up stuff. So where is the thing coming

01:03:01.119 --> 01:03:04.559
from? Right? That's the question. And

01:03:02.960 --> 01:03:06.720
the answer is this very beautiful thing

01:03:04.559 --> 01:03:10.040
called bite pair encoding which we'll

01:03:06.719 --> 01:03:10.039
look at next.

01:03:10.559 --> 01:03:15.599
So all right. So what here um when we

01:03:14.318 --> 01:03:17.119
look at this process the adv

01:03:15.599 --> 01:03:18.400
disadvantages are some of the things we

01:03:17.119 --> 01:03:19.920
have discussed which is that we want to

01:03:18.400 --> 01:03:21.119
be able to preserve punctuation. We want

01:03:19.920 --> 01:03:22.318
to be able to preserve case. We want to

01:03:21.119 --> 01:03:26.240
be able to handle new words and so on

01:03:22.318 --> 01:03:28.318
and so forth. So uh the new like the the

01:03:26.239 --> 01:03:29.759
sort of the modern models like BERT and

01:03:28.318 --> 01:03:31.599
so on they use different tokenization

01:03:29.760 --> 01:03:34.720
schemes. They don't actually do the STIE

01:03:31.599 --> 01:03:37.519
thing and the GPD family uses bite pair

01:03:34.719 --> 01:03:40.399
encoding BPE. Uh BERT uses something

01:03:37.519 --> 01:03:42.719
called wordpiece. All of these ways of

01:03:40.400 --> 01:03:44.720
encoding, the fundamental idea is to

01:03:42.719 --> 01:03:46.078
say, well, you know what? Why don't

01:03:44.719 --> 01:03:47.598
whatever language you're working with,

01:03:46.079 --> 01:03:50.000
why don't we start first of all with all

01:03:47.599 --> 01:03:51.359
the individual characters? Because if

01:03:50.000 --> 01:03:53.039
you could actually work with individual

01:03:51.358 --> 01:03:56.000
characters, you can clearly compose any

01:03:53.039 --> 01:03:58.880
word that comes up, right? Reo is just R

01:03:56.000 --> 01:04:00.318
E L D O H, right? Six tokens. If you're

01:03:58.880 --> 01:04:02.720
working with characters at the character

01:04:00.318 --> 01:04:05.679
level, but working only with characters

01:04:02.719 --> 01:04:07.838
is not great, right? because that means

01:04:05.679 --> 01:04:09.279
that the model you're giving it no

01:04:07.838 --> 01:04:11.199
information about the world. It has to

01:04:09.280 --> 01:04:14.160
learn every word from scratch, what the

01:04:11.199 --> 01:04:15.439
word means and so on and so forth. So we

01:04:14.159 --> 01:04:17.759
it would be nice if we can actually give

01:04:15.440 --> 01:04:20.159
it words as well. But we don't we don't

01:04:17.760 --> 01:04:22.400
want to give it infrequent words because

01:04:20.159 --> 01:04:25.118
infrequent words by definition are not

01:04:22.400 --> 01:04:26.480
worth adding to your vocabulary. We're

01:04:25.119 --> 01:04:28.318
just going to you know take up another

01:04:26.480 --> 01:04:30.000
embedding vector and things like that.

01:04:28.318 --> 01:04:31.679
For infrequent words, we'll just make

01:04:30.000 --> 01:04:32.960
we'll just compose them. we'll we'll

01:04:31.679 --> 01:04:35.440
actually construct them on the fly

01:04:32.960 --> 01:04:37.199
because we can always use characters.

01:04:35.440 --> 01:04:38.880
Okay, so we don't want to put every word

01:04:37.199 --> 01:04:41.199
in there. We only want to put frequent

01:04:38.880 --> 01:04:43.039
words. But to give this thing the

01:04:41.199 --> 01:04:45.038
ability to compose new words and not

01:04:43.039 --> 01:04:47.520
always have to go to characters, we will

01:04:45.039 --> 01:04:52.000
give it parts of words. These are called

01:04:47.519 --> 01:04:54.000
subwords. So the key idea is that let's

01:04:52.000 --> 01:04:56.880
come up with a way to build a vocabulary

01:04:54.000 --> 01:04:59.679
which has characters full words that are

01:04:56.880 --> 01:05:01.838
frequent enough to be worth adding and

01:04:59.679 --> 01:05:03.519
subwords or word fragments that occur

01:05:01.838 --> 01:05:07.279
frequently enough to be worth adding. So

01:05:03.519 --> 01:05:09.759
for example the word standardize

01:05:07.280 --> 01:05:11.119
right normalize standardize and so on

01:05:09.760 --> 01:05:12.880
and so forth. I is going to show up a

01:05:11.119 --> 01:05:14.318
lot in many places. So you don't want to

01:05:12.880 --> 01:05:15.680
have standardize and normalize and so

01:05:14.318 --> 01:05:17.679
on. You just want to have eyes. you can

01:05:15.679 --> 01:05:19.598
just attach it to all kinds of words,

01:05:17.679 --> 01:05:20.960
right? And make it all work, right? So

01:05:19.599 --> 01:05:23.760
that's the basic idea of all these

01:05:20.960 --> 01:05:25.679
tokenization schemes. And BP is one such

01:05:23.760 --> 01:05:27.039
way to figure out how to actually

01:05:25.679 --> 01:05:29.358
construct this vocabulary from a

01:05:27.039 --> 01:05:31.359
training corpus, right? And by the way,

01:05:29.358 --> 01:05:33.279
when I say characters, this will include

01:05:31.358 --> 01:05:34.639
not just you know uppercase lowerase

01:05:33.280 --> 01:05:37.039
alphabets and numbers, it may it will

01:05:34.639 --> 01:05:38.318
also include punctuation.

01:05:37.039 --> 01:05:40.640
So that all these things just become

01:05:38.318 --> 01:05:42.960
atomic units.

01:05:40.639 --> 01:05:45.519
All right. So uh so what we're going to

01:05:42.960 --> 01:05:47.599
the way BP works is that uh we're going

01:05:45.519 --> 01:05:49.679
to uh start with each character as a

01:05:47.599 --> 01:05:51.039
token and I'll talk about the rest of

01:05:49.679 --> 01:05:52.318
the thing on the page in just a moment.

01:05:51.039 --> 01:05:53.920
Don't worry about it. We'll start with

01:05:52.318 --> 01:05:56.480
each character as a token. So let's say

01:05:53.920 --> 01:05:58.720
that your training corpus is just a

01:05:56.480 --> 01:06:02.079
single sentence. The cat sat on the mat.

01:05:58.719 --> 01:06:03.838
Okay. And even though GPT does not

01:06:02.079 --> 01:06:05.839
actually do any lower casing, it'll just

01:06:03.838 --> 01:06:08.159
actually use like TH uppercase is

01:06:05.838 --> 01:06:09.038
different than TH lowerase. Uh just for

01:06:08.159 --> 01:06:11.118
simplicity, I'm just going to

01:06:09.039 --> 01:06:12.799
standardize it here. So it just becomes

01:06:11.119 --> 01:06:14.880
a cat sat on the mat. And then I'm going

01:06:12.798 --> 01:06:16.719
to write it in this form where I

01:06:14.880 --> 01:06:18.160
basically put a comma after every word

01:06:16.719 --> 01:06:20.318
and then I put a little underscore to

01:06:18.159 --> 01:06:21.598
show the space between the words. Okay,

01:06:20.318 --> 01:06:22.798
I'm going to write it in this format.

01:06:21.599 --> 01:06:25.359
And it'll become clear why I'm writing

01:06:22.798 --> 01:06:27.358
it in just a second. Okay. Now my

01:06:25.358 --> 01:06:28.719
starting vocabulary is just all the

01:06:27.358 --> 01:06:31.440
individual letters in the training

01:06:28.719 --> 01:06:34.159
corpus. So the starting is just whatever

01:06:31.440 --> 01:06:35.920
all these letters. Okay, that's it. And

01:06:34.159 --> 01:06:38.558
this is a starting point. And now what

01:06:35.920 --> 01:06:41.838
we do and this is the key step.

01:06:38.559 --> 01:06:44.960
We merge tokens that most frequently

01:06:41.838 --> 01:06:47.358
occur right next to each other. So if

01:06:44.960 --> 01:06:48.720
two characters or two tokens are

01:06:47.358 --> 01:06:51.119
occurring right next to each other a

01:06:48.719 --> 01:06:52.798
lot, let's just merge them because they

01:06:51.119 --> 01:06:54.880
seem to be occurring a lot together,

01:06:52.798 --> 01:06:57.679
right? May as well merge them. And so

01:06:54.880 --> 01:06:59.119
here, for example, I've I've listed the

01:06:57.679 --> 01:07:01.759
frequency of the adjacent token. So for

01:06:59.119 --> 01:07:04.160
example, if you look at th

01:07:01.760 --> 01:07:06.960
shows up right after each other here, it

01:07:04.159 --> 01:07:08.558
also shows up here. So therefore, it

01:07:06.960 --> 01:07:11.920
shows up twice.

01:07:08.559 --> 01:07:13.519
Now H E again is showing up here. It's

01:07:11.920 --> 01:07:16.079
also showing up here. So that also shows

01:07:13.519 --> 01:07:17.679
up twice. CA on the other hand is only

01:07:16.079 --> 01:07:20.798
showing up here. It's not showing up

01:07:17.679 --> 01:07:24.000
anywhere else. So it shows up once. A

01:07:20.798 --> 01:07:25.599
shows up three times in Matt, SAT, and

01:07:24.000 --> 01:07:27.838
in CAT and so on and so forth. You get

01:07:25.599 --> 01:07:30.798
the idea. So you're just looking at

01:07:27.838 --> 01:07:32.318
pair-wise adjacent tokens. And you pick

01:07:30.798 --> 01:07:34.318
the most frequent one that's showing up,

01:07:32.318 --> 01:07:36.000
which in this case happens to be a t.

01:07:34.318 --> 01:07:40.000
And then you take a and t and you merge

01:07:36.000 --> 01:07:42.400
them. So it becomes 80.

01:07:40.000 --> 01:07:44.079
Okay. So when you do that when you when

01:07:42.400 --> 01:07:45.440
you you merge them and then you add that

01:07:44.079 --> 01:07:48.559
new token that you've just literally

01:07:45.440 --> 01:07:50.400
created to your vocabulary list and then

01:07:48.559 --> 01:07:52.559
you update the corpus to reflect the

01:07:50.400 --> 01:07:55.039
merge you've just did. So now the corpus

01:07:52.559 --> 01:07:56.319
becomes the cat sat on the mat. But in

01:07:55.039 --> 01:07:58.880
this case there is no a and t

01:07:56.318 --> 01:08:02.400
separately. There is just the at combo

01:07:58.880 --> 01:08:06.160
com combo token here.

01:08:02.400 --> 01:08:07.599
Are we good with this step so far?

01:08:06.159 --> 01:08:10.598
take the most frequent things and merge

01:08:07.599 --> 01:08:10.599
them.

01:08:12.639 --> 01:08:16.238
It's a way to compress the data. In

01:08:14.400 --> 01:08:17.440
fact, the algorithm came from someone

01:08:16.238 --> 01:08:18.959
trying to figure out a way to compress

01:08:17.439 --> 01:08:22.119
data.

01:08:18.960 --> 01:08:22.119
You know,

01:08:22.158 --> 01:08:25.920
think of it this way, right? Suppose I

01:08:23.759 --> 01:08:28.238
tell you uh I'm I want you to compress a

01:08:25.920 --> 01:08:30.158
message I'm going to send to you and

01:08:28.238 --> 01:08:32.079
then you look at all the past messages

01:08:30.158 --> 01:08:35.838
you've had to deal with and it turns out

01:08:32.079 --> 01:08:37.359
you're finding that u certain characters

01:08:35.838 --> 01:08:40.079
are occurring next to each other all the

01:08:37.359 --> 01:08:42.480
time right maybe just for argument let's

01:08:40.079 --> 01:08:44.158
say ABC shows up ridiculously often in

01:08:42.479 --> 01:08:45.439
the messaging and then you'll be like

01:08:44.158 --> 01:08:47.358
you know what's if it's always showing

01:08:45.439 --> 01:08:48.639
up all the time together why treat it as

01:08:47.359 --> 01:08:51.520
three things let me just call it one

01:08:48.640 --> 01:08:53.119
thing ABC that's it you send a single

01:08:51.520 --> 01:08:56.480
token called ABC every time you send

01:08:53.119 --> 01:08:58.880
need ABC not a B C that's the basic

01:08:56.479 --> 01:09:01.278
idea. So here if you come here that's

01:08:58.880 --> 01:09:03.039
what we have and then what we do is now

01:09:01.279 --> 01:09:05.520
we do again this calculation of

01:09:03.039 --> 01:09:08.640
adjacency tokens on this updated corpus

01:09:05.520 --> 01:09:11.600
and you can see here th shows up once TH

01:09:08.640 --> 01:09:13.838
shows up here twice so you get two every

01:09:11.600 --> 01:09:16.880
H shows up twice everything else shows

01:09:13.838 --> 01:09:18.000
up once and yeah when many things are

01:09:16.880 --> 01:09:19.600
showing up with equal frequency just

01:09:18.000 --> 01:09:22.079
pick one randomly from this. So we pick

01:09:19.600 --> 01:09:25.199
up th right and we merge that which

01:09:22.079 --> 01:09:27.278
means that we add th to our vocabulary

01:09:25.198 --> 01:09:30.238
and once we do that we update the corpus

01:09:27.279 --> 01:09:32.080
and now we have th is now one thing

01:09:30.238 --> 01:09:34.959
fused together along with the previous

01:09:32.079 --> 01:09:36.960
thing 80 that had been fused together

01:09:34.960 --> 01:09:38.960
that is a corpus after the second merge

01:09:36.960 --> 01:09:40.640
and then we do the same thing we find

01:09:38.960 --> 01:09:42.640
the frequency adjacent tokens turns out

01:09:40.640 --> 01:09:45.039
th and e are showing up twice everything

01:09:42.640 --> 01:09:48.880
else is showing up once so we take th

01:09:45.039 --> 01:09:51.359
merge it to get the boom the and now we

01:09:48.880 --> 01:09:53.838
have the cat sat on the mat. So this

01:09:51.359 --> 01:09:56.159
process continues

01:09:53.838 --> 01:09:59.039
till we reach a predefined limit for our

01:09:56.158 --> 01:10:02.399
vocabulary. Now as it turns out when

01:09:59.039 --> 01:10:04.238
they built GPT2 and GPT let me just see

01:10:02.399 --> 01:10:07.279
I think I did some digging around on

01:10:04.238 --> 01:10:09.119
this thing. Yeah. So GPT2 and 3 they set

01:10:07.279 --> 01:10:12.000
the vocabulary size to be roughly

01:10:09.119 --> 01:10:14.399
50,000. So it basically kept on doing

01:10:12.000 --> 01:10:17.119
this till it hit a limit of 50,000 then

01:10:14.399 --> 01:10:18.639
it stopped. GPD4 on the other hand

01:10:17.119 --> 01:10:23.238
actually went goes all the way to

01:10:18.640 --> 01:10:23.239
100,000 vocabulary size.

01:10:23.439 --> 01:10:29.198
Okay, so this is BP in action. U and so

01:10:28.000 --> 01:10:30.000
what's going to happen is once you

01:10:29.198 --> 01:10:31.119
finish all this thing and you have

01:10:30.000 --> 01:10:32.960
vocabulary and you have all these things

01:10:31.119 --> 01:10:36.158
that you have merged when a new piece of

01:10:32.960 --> 01:10:39.760
text comes in right the merges remember

01:10:36.158 --> 01:10:41.759
here we merged a to get a this th became

01:10:39.760 --> 01:10:43.119
this and so on. When a new piece of text

01:10:41.760 --> 01:10:45.520
arrives the tokenization apply the

01:10:43.119 --> 01:10:47.920
merges in the exact same order. So if

01:10:45.520 --> 01:10:50.239
the new text that comes in is the rat,

01:10:47.920 --> 01:10:52.800
it's first going to apply the 80 to 80

01:10:50.238 --> 01:10:54.399
to become fuse this here and then going

01:10:52.800 --> 01:10:56.400
to fuse th to get this and then it's

01:10:54.399 --> 01:10:58.559
going to fuse th and e to get that. And

01:10:56.399 --> 01:11:00.319
the final list of tokens that goes in to

01:10:58.560 --> 01:11:02.080
your model is going to be the token for

01:11:00.319 --> 01:11:05.960
the the token for space and the token

01:11:02.079 --> 01:11:05.960
for r and the token for at.

01:11:06.560 --> 01:11:10.600
So let's see this in action.

01:11:12.319 --> 01:11:17.119
uh GP I mean OpenAI has a has its own

01:11:14.560 --> 01:11:20.960
thing but I found this uh site to be

01:11:17.119 --> 01:11:23.039
really good. So let's uh tokenize

01:11:20.960 --> 01:11:26.079
hands-on

01:11:23.039 --> 01:11:28.319
deep learning.

01:11:26.079 --> 01:11:30.960
So you can see here

01:11:28.319 --> 01:11:34.639
look at this.

01:11:30.960 --> 01:11:36.880
So H uppercase H is its own token. It's

01:11:34.640 --> 01:11:38.560
token number 39

01:11:36.880 --> 01:11:41.119
and

01:11:38.560 --> 01:11:43.440
it's it own token. dash is its own token

01:11:41.119 --> 01:11:45.198
on is its own token and then space deep

01:11:43.439 --> 01:11:48.319
is its token and space learning is its

01:11:45.198 --> 01:11:50.399
token okay note one thing suppose you

01:11:48.319 --> 01:11:51.920
had said

01:11:50.399 --> 01:11:53.679
let's just say you just had deep deep

01:11:51.920 --> 01:11:56.480
deep learning

01:11:53.679 --> 01:11:58.960
deep has a different token than space

01:11:56.479 --> 01:12:01.359
deep

01:11:58.960 --> 01:12:03.119
okay what they have realized is that

01:12:01.359 --> 01:12:06.079
most words are actually going to show up

01:12:03.119 --> 01:12:08.238
after the space after a space right much

01:12:06.079 --> 01:12:10.079
more likely so having a space attached

01:12:08.238 --> 01:12:12.000
to the beginning of the word saves you a

01:12:10.079 --> 01:12:13.519
lot of sort of you know saves you a lot

01:12:12.000 --> 01:12:15.198
of compute and so on and so forth

01:12:13.520 --> 01:12:17.199
because they will in fact arrive almost

01:12:15.198 --> 01:12:18.479
all the time with the space before it

01:12:17.198 --> 01:12:21.119
right that's why they have attached the

01:12:18.479 --> 01:12:25.759
space to the word itself um and note

01:12:21.119 --> 01:12:29.719
that deep learning deep and uh deep

01:12:25.760 --> 01:12:29.719
actually let's call it this way

01:12:30.800 --> 01:12:36.960
so deep and deep are different

01:12:34.319 --> 01:12:38.799
right there is deep there is so clearly

01:12:36.960 --> 01:12:43.359
it's taking case into account then I put

01:12:38.800 --> 01:12:44.800
an exclamation here. Boom. That and so

01:12:43.359 --> 01:12:48.319
ultimately what goes in when you have

01:12:44.800 --> 01:12:51.679
have a phrase like um

01:12:48.319 --> 01:12:53.679
sat on the mat.

01:12:51.679 --> 01:12:58.480
So the cat sat on the mat. And you can

01:12:53.679 --> 01:13:01.600
see here uppercase the um and then

01:12:58.479 --> 01:13:06.718
let's just do another thing here.

01:13:01.600 --> 01:13:10.239
So uppercase the with a space is 383.

01:13:06.719 --> 01:13:11.920
lowerase the is 262. Uh and then that's

01:13:10.238 --> 01:13:13.119
distinct from just the without any

01:13:11.920 --> 01:13:16.960
space. That's a different thing. So

01:13:13.119 --> 01:13:18.960
these are all the tokens. Now um let's

01:13:16.960 --> 01:13:21.520
try something.

01:13:18.960 --> 01:13:24.520
Let's try

01:13:21.520 --> 01:13:24.520
Jane.

01:13:24.719 --> 01:13:30.800
So Jane is one token which is great and

01:13:27.520 --> 01:13:34.560
is another token. Let's see. Rama. Ah

01:13:30.800 --> 01:13:38.960
darn. My name wasn't worthy enough to be

01:13:34.560 --> 01:13:41.520
its own token. Okay. But strangely

01:13:38.960 --> 01:13:44.000
enough

01:13:41.520 --> 01:13:46.080
this I was very surprised by this. So if

01:13:44.000 --> 01:13:48.319
I put Rama in lower case is its own

01:13:46.079 --> 01:13:51.039
token.

01:13:48.319 --> 01:13:55.039
I have no idea what they were scraping

01:13:51.039 --> 01:13:56.640
which websites. Uh and if I put Jane

01:13:55.039 --> 01:13:58.960
here

01:13:56.640 --> 01:14:01.600
now J has become its token with space

01:13:58.960 --> 01:14:03.840
and A has become different.

01:14:01.600 --> 01:14:05.199
So the tokenization is like very it's a

01:14:03.840 --> 01:14:07.360
very interesting thing and it works in

01:14:05.198 --> 01:14:08.719
very interesting ways. But that's the

01:14:07.359 --> 01:14:10.639
basic idea of what's going on under the

01:14:08.719 --> 01:14:12.079
hood. I would encourage you to like

01:14:10.640 --> 01:14:13.920
check out your names to see if it's

01:14:12.079 --> 01:14:15.359
actually been tokenized. So all right,

01:14:13.920 --> 01:14:18.359
I'm done. Thanks folks. I'll see you on

01:14:15.359 --> 01:14:18.359
Wednesday.