Advertisement
Ad slot
9: Generative AI – Large Language Models (LLMs) and Retrieval Augmented Generation (RAG) 1:14:29

9: Generative AI – Large Language Models (LLMs) and Retrieval Augmented Generation (RAG)

MIT OpenCourseWare · May 11, 2026
Open on YouTube
Transcript ~13430 words · 1:14:29
0:16
Um, so let's start with a quick review.
0:18
Last week we looked at BERT, how BERT
0:21
was created, and we learned about this
0:23
technique called masking, which is a
0:25
kind of self-supervised learning. And
0:27
the idea of masking was very simple. We
0:29
asked ourselves the question we have
0:31
seen ways in which people can take
Advertisement
Ad slot
0:33
images and pre-train models like restnet
0:35
on a vast you know vast uh body of
0:38
images but then for each image somebody
0:40
had to go and label them right so for
0:42
text we asked the question well what
0:44
does it mean to label a piece of text
0:46
when we don't actually have a clearly
0:48
defined end goal in mind except the
0:49
general goal of pre-training things
0:51
right and then we said oh well what we
Advertisement
Ad slot
0:53
can do is we can actually replace some
0:55
some of the words in every sentence with
0:57
a what you call like a mask token and
0:59
then we just train the network to
1:00
recover the blanks to fill in the blanks
1:03
right and this technique which is one of
1:06
many ways of doing what's called
1:07
self-supervised learning is called
1:08
masking and we and we described how if
1:12
you essentially take all of Wikipedia
1:14
and for every sentence you mask it like
1:16
this and then train a network to recover
1:19
to fill in the blanks the resulting
1:21
network becomes really good at doing all
1:23
kinds of interesting things and that in
1:25
fact the first such network or one of
1:27
the first such networks was called BERT
1:29
u and in fact in your homework you've
1:31
been you've been looking at BERT and so
1:32
on and so forth right that's masking now
1:34
we're going to switch gears and talk
1:35
about a different kind of self-s
1:37
supervised learning which is different
1:38
from masking which turns out to be
1:41
weirdly more interesting and powerful
1:45
okay so we are going to look at another
1:47
technique and this technique is called
1:49
next word prediction so now it is
1:52
actually in some some sense a special
1:54
case of masking where you're basically
1:55
saying take a sentence and instead of
1:57
randomly picking a word and and making
1:59
it a blank. You're saying, "I'm just
2:01
going to take the last word and make it
2:03
a blank." Okay? And then you send the
2:06
sentence in and then you have the the
2:08
machine just fill in the blank on the
2:10
last word. Predict the next word. Okay?
2:12
And you don't have to use full sentences
2:13
for it. You can use parts of sentences
2:15
for it. Sentence fragments as well. So
2:17
if you take the same sentences before
2:20
the mission of the MI loan school, you
2:21
can literally divide it into well you
2:23
can give the and ask it to predict
2:25
mission. If you can give it the mission
2:27
and ask it to predict off. You give it
2:29
the mission of ask to predict the you
2:31
get the idea. So every sentence fragment
2:33
you can take and literally just give it
2:35
the first few and then predict the next
2:37
one. First few next one first few next
2:38
one. Okay. So this is next word
2:41
prediction. And
2:44
so the let's what we're going to do now
2:46
is we're going to actually take the
2:47
transformer encoder architecture that we
2:50
used to build bird in the last class and
2:52
we're going to try to use it to solve
2:54
next word prediction to build a model
2:56
that can do next word prediction. Okay.
2:58
So this is what [clears throat] we have.
3:01
So what we're going to do is uh if you
3:03
take the phrase the cat sat on the mat.
3:09
So the phrase was let's say the cat
3:13
sat
3:15
on
3:16
the mat.
3:18
So what you might want to do is to say
3:20
okay this is the input
3:25
output
3:27
the cat.
3:30
Then maybe you have the cat
3:33
then the output is sat.
3:36
The cat sat on and so on. Right, you get
3:39
the idea. And then finally, we have the
3:42
cat sat
3:45
the mat. Right, this is basically what
3:48
we have all these inputs and outputs.
3:50
But we're going to very compactly
3:51
express it as if it's just coming in
3:54
through as as one sort of data point in
3:56
one batch. And that's what we're doing
3:58
here. So what we're going to do is we're
4:00
going to stack it up like this where we
4:02
have the cat sat on the on the left
4:04
meaning everything but the last word and
4:07
then we're going to take that same
4:08
sentence and just shift it to the left
4:10
one right so the cat sat on the mat we
4:13
cut off the mat right and that becomes
4:15
the input then we cut off the first word
4:17
and that becomes the output so when you
4:19
look at it that way you can see here
4:22
right the you will want the to be used
4:25
to predict cat you will want the to be
4:29
used to predict SAT and so on and so
4:31
forth.
4:32
Okay, so this is just a little sort of
4:35
manipulation so that we don't have to
4:37
have you know like dozens of sentences
4:40
or sentence examples just for one
4:42
starting sentence.
4:44
So if you have something like this, what
4:46
you can do is you can run it through
4:49
positional input embeddings like we have
4:50
done before with BERT. Uh then we can
4:53
run it through a whole bunch of
4:54
transformers, right? It's like a
4:56
transformer stack. Then we get these
4:59
contextual embeddings. Then we run them
5:01
through maybe one or more ReLUs if you
5:03
want because it's always a good idea to
5:05
stick some ReLUS at the very end. U and
5:08
then we basically attach a softmax to
5:11
every one of the things that are coming
5:13
out. Okay. And then that soft max is
5:17
actually going to be a soft max whose
5:20
range is the entire vocabulary.
5:23
Okay. For now, let's assume that the
5:25
vocabulary is just a vocabulary of
5:27
words, not tokens. We'll get into tokens
5:29
a bit later on in the class. For now,
5:30
just assume it's words. And roughly
5:32
speaking, let's say there are 50,000
5:33
words in our vocabulary. So each of
5:36
these soft maxes, and this is exactly
5:38
what we did for BERT, by the way. Each
5:39
of these soft maxes is like a 50,000 way
5:42
soft max.
5:43
Okay. But what we're going to do is here
5:47
when we look at it this way
5:50
since we are fundamentally bothered
5:52
about next word prediction as you will
5:54
see later on we are actually going to
5:55
ignore all these predictions because who
5:57
cares? We are only going to look at the
5:59
last one to figure out okay what is the
6:02
last prediction? What is it? Because the
6:04
last prediction is going to be based on
6:06
everything that came before it here. So
6:09
this is really the next word that's
6:11
actually being predicted. All the things
6:13
before we don't care so much.
6:16
Okay. And all this will become slightly
6:17
clearer because you're going to make a
6:18
couple of passes through it. Yeah.
6:20
>> How do we
6:24
>> uh so um the notion of a sentence has
6:27
disappeared at this point. What we're
6:29
going to do is when we look at how we
6:30
tokenize the input for these kinds of
6:33
models, we're actually going to take
6:35
punctuation into account. So we're going
6:36
to take periods into account,
6:37
exclamation marks into account and so on
6:39
and so forth. And that that'll answer
6:41
your question and we'll come back to
6:42
that. U okay so this what we have. So um
6:47
all right. So just to be clear the
6:49
embedding that's coming out of the final
6:50
dense layer is passed through its own
6:52
softmax with the number of softmax
6:54
categories equal to the cap size. Okay.
6:58
All right. Um okay. So
7:01
first of all, s let's say we train
7:04
models a model like this with a lots of
7:05
inputs and outputs. Okay, this just
7:08
looks like bird, right? It's not that
7:10
different except that there's no notion
7:11
of a mask.
7:13
Do you notice any problems with the way
7:15
this thing has been set up? Uh
7:19
>> like for some words like the you're
7:21
going to have a lot of potential output
7:23
pairs that come out of that.
7:25
>> True. Which means that if you have a
7:27
word like the the next word
7:29
>> hard to predict.
7:29
>> It's true. So some words may be hard to
7:32
predict depending on the last word of
7:35
the sentence that was the input. Right.
7:36
That's what you're getting at. Yeah. U
7:39
concerns.
7:41
So I want you Yeah. Uh
7:43
>> since you're using contextual
7:46
like the output of the first word is
7:48
going to have access to the second word
7:51
and so it's kind of like cheating.
7:53
>> Bingo.
7:55
So remember for bingo is a technical
7:58
term in deep learning which means great.
8:01
So um so if you go to this right as she
8:05
points out if you look at the self
8:08
attention layer note remember the self
8:11
attention layer is the key building
8:12
block of the transformer block right and
8:15
so in the self attention layer every
8:17
word we calculate its contextual
8:19
embedding by waiting weighted averaging
8:23
of its relationship to all other words
8:26
in the sentence. So the last word can
8:28
see the first word, the first word can
8:30
see the last word and so on and so
8:31
forth, right? But when you're doing next
8:33
word prediction, this feels problematic
8:34
because you're peeking into the future,
8:38
right? So
8:40
so let's say that you want to predict
8:42
the next word. If you look at this
8:43
architecture, what it can simply do, it
8:46
can simply copy it from the input
8:48
because it can see the whole sentence.
8:50
So if I tell you, hey, the cat sat on
8:52
the mat. If I just gave you the cat sat
8:55
on the can you predict the the next word
8:56
for me? You'll be like yeah duh it's cat
8:58
it's Matt.
9:01
The whole thing becomes challenging only
9:02
if I say the cat sat on the dash. Now
9:04
predict the dash.
9:07
So to put it another way let's say that
9:09
you want to predict right you have fed
9:11
in the first two words and you want to
9:13
predict this. This is the right answer
9:15
for the prediction. The network should
9:17
only use the first two.
9:20
However, but because self attention can
9:23
see SAT, it can see this next word,
9:26
it'll trivially learn to predict the
9:28
next word to be SAT,
9:31
right? There is no challenge for it.
9:34
So, this is the key problem, right? This
9:37
is the key problem. We're just using the
9:38
transformer as is.
9:41
>> What's our loss function here?
9:43
>> The loss function in all these things is
9:44
actually the same as before, which is
9:46
that for every output that's coming out.
9:48
So imagine you have just a traditional
9:50
classification problem uh in which you
9:52
have one output uh let's say dividing
9:54
you're classifying things to uh 10
9:56
categories like we did with the fashion
9:57
mnest right 10 digits so you have 10
10:00
outputs right and that goes through a
10:02
softmax and then you have 10
10:03
probabilities and there we use cross
10:05
entropy right so here for every one of
10:09
these things we use cross entropy so we
10:12
take this output and there's a cross
10:14
entropy for just for that plus cross
10:16
entropy for that and so on and so forth
10:18
So we we minimize still cross entropy
10:20
but the sum of all these cross
10:21
entropies.
10:22
>> And does it get complicated at all by
10:24
the fact we have a large vocabulary size
10:26
now?
10:27
>> I mean it it gets complicated just
10:29
because there are more things to worry
10:30
about compute and so on and so forth.
10:32
But conceptually no difference whether
10:33
you have 10 or 50,000 it's the same
10:35
thing. It's just that instead of
10:37
classifying an input into one of 10
10:39
categories you're take the inputs
10:41
themselves are as long as the number of
10:42
words in your sentence. So each word
10:45
that comes into your sentence is being
10:46
classified in one of 50,000 ways, right?
10:49
So essentially you have as many
10:51
classification problems as you have
10:53
number of words in a sentence. But at
10:55
the end of the day, the loss function is
10:56
just a sum of all those things or to be
10:58
more precise, the average of all those
10:59
things.
11:02
Actually, I think I may have a slide
11:03
about this which I may have hidden
11:05
because I wasn't sure if I would have
11:07
time. Uh let's unhide it.
11:17
and B I did not agree ahead of time that
11:19
we're going to set this up like this.
11:20
Okay. So, all right. So, yeah. So, we
11:23
still use the cross cross entropy cross
11:25
cross entropy loss function. So, each
11:27
word that comes in. So, the cross
11:30
entropy is actually minus log
11:33
probability of the right answer. And you
11:35
may recall this from earlier in the
11:36
class. So, we just do the same thing for
11:38
for cat sat on the everything. And then
11:41
we just take the average 1 / 7. Boom.
11:43
That's it.
11:47
So let's so to go back to this problem.
11:50
So this is the issue. The issue is that
11:52
we can't allow words to be predicted
11:55
knowing the future. They should only
11:57
know about the past words. Okay. So what
12:00
do we do? Right? We have to make a
12:02
change to the transformer to make it
12:03
work for next word prediction. So what
12:06
we're going to do is when we are
12:07
calculating the contextual embedding for
12:09
a word, remember the contextual
12:11
embedding for a word is going to be a
12:13
weighted average of all the other words
12:14
embeddings. We will simply give zero
12:17
weight to future words.
12:20
If you give zero weight to future words,
12:22
it's almost as if they don't exist.
12:26
Okay? And this will become clear in a
12:27
second. So imagine that this is the the
12:31
thing we are going to calculate. These
12:32
are all for every word in the sentence
12:34
we are calculating the uh the pair-wise
12:38
attention weight and you will remember I
12:41
went through this you know with like an
12:43
iPad thing last week we calculate all
12:45
the weights. So for example to find the
12:48
um so all these weights in every row
12:51
will add up to one and so you take the
12:54
contextual embeddings of the cat sat on
12:56
the multiply them by the respective
12:58
weights that add up to one which is the
12:59
first row of this table and that gives
13:01
you the contextual embedding for the
13:02
word the and so on and so forth. And
13:05
since we can't look at the future words
13:07
all we do is we go take this table and
13:10
we just zero everything out in red.
13:14
Okay, we just zero everything here out
13:17
and then we renormalize so that the
13:19
remaining cells the nonzero dot cells
13:22
will still add up to one in each row. So
13:25
what that means is that if you're
13:27
actually only looking at the only this
13:29
thing is going to play a role for cat
13:31
only this thing is going to play a role.
13:32
So let's let's let's give an example. So
13:36
um to calculate
13:39
to predict uh on you'll only look at the
13:43
words for the cat sat.
13:46
Okay. The rest of it will not be
13:48
considered at all. Now the effect of
13:51
doing all this is that by the way this
13:54
is called causal self attention. This
13:56
tweak is called causal self attention.
13:58
Uh is also called masked self attention.
14:01
Right? Just different labels for the
14:02
same thing. And so what that means is
14:05
that when you're looking at the input
14:07
for the only the is going to be used to
14:10
predict cat.
14:12
When you look the cat only these two are
14:15
going to be used to predict sat and so
14:18
on and so on and so forth.
14:24
Okay. So this thing here this so all we
14:28
do is we go into a transformer and we
14:30
just change each attention head to be a
14:32
causal attention head
14:38
and the way it's actually done under the
14:40
hood is actually very elegant for
14:42
computational efficiency purposes but I
14:44
won't get into it because it gets a bit
14:46
you know involved but the key idea is
14:49
replace basic plain vanilla attention
14:52
with causal attention aka pay mass
14:54
attention
14:57
and you do that boom suddenly it it
14:59
starts you know working for an expert
15:01
prediction it can't cheat anymore
15:04
and when we do that we get the
15:06
transformer causal encoder
15:11
and by the way the word causal here
15:13
there's no connection to causality so
15:15
it's just a it's just a term
15:19
so if you look at the original
15:20
transformer paper um
15:24
it was created for translation for
15:26
machine translation you know English to
15:28
German right those kinds of use cases so
15:30
it had something called an encoder which
15:32
we are very familiar with from last week
15:34
and then it had something called a
15:35
decoder right and it is called the
15:38
encoder decoder architecture and we are
15:40
not going to cover the encoder decoder
15:42
architecture because we are not covering
15:43
machine translation in this class but
15:45
I'm mentioning this because the this
15:48
part of the the architecture is called a
15:51
decoder
15:52
because it uses see here there is a
15:55
masked attention business going on here
15:57
because it is using this masked
15:59
attention it's called a decoder so
16:02
the transformer causal encoder is also
16:05
referred to sometimes as a transformer
16:06
decoder but the word decoder has two
16:09
meanings
16:11
right it's a synonym for the causal
16:12
encoder like we have seen today it's
16:14
also used to refer to sequencetosequence
16:17
translation problems for the second part
16:19
of its architecture so you just have
16:21
keep it it'll become clear from context
16:23
what we're talking about in this course
16:25
of course there is no confusion because
16:26
we're not going to be looking at
16:27
translation right we may say decoder
16:29
causal encoder it's the same thing so I
16:32
thought there were some transformers
16:34
that use birectional
16:36
package like is it different from
16:39
>> no the um the birectional all all
16:42
birectional means is that I can see
16:44
everything so the encoder we looked at
16:47
last week the the basic self attention
16:49
thing is birectional
16:54
Basically all it means is I can look at
16:55
both in both directions to see what
16:57
other words are there in causal. You're
16:58
not using the one in the future.
16:59
Correct.
17:02
All right. So,
17:04
so in to summarize where we are. This is
17:07
what we looked at last week for BERT and
17:09
this is a transformer encoder and we
17:11
take the same thing and instead of
17:14
multi-head retention we would do causal
17:15
multi retention. We get the decoder aka
17:18
causal encoder.
17:21
Okay. And we use the left for masked
17:25
prediction. We use the right for next
17:27
word prediction.
17:29
All right. So now if you have instead of
17:32
having an encoder, if you have a causal
17:34
encoder, a TCE here, now we can train
17:37
models for expert prediction using the
17:38
same exact approach as before,
17:42
right? We set up the inputs and the
17:43
outputs like I described earlier. We run
17:45
it through a bunch of stacks, a stack of
17:47
causal encoders, dens, relu, softmax and
17:50
so on and so forth, right? Otherwise the
17:52
details don't change but the all
17:54
important changes go into the attention
17:56
layer and make it masked or causal.
18:02
Any questions so far?
18:06
>> Uh yeah,
18:08
this would only apply when we're
18:09
training the model, not when we're
18:11
validating and testing, right?
18:13
Uh so if I if you give me a sentence
18:15
after training right the final
18:18
prediction is only is the only thing you
18:20
care about and by definition the final
18:22
prediction will use everything that came
18:24
before it. So we are okay.
18:27
Was that your question? No, I think the
18:30
fact that we're
18:33
uh we're zeroing out the weights in the
18:35
future words I thought would apply more
18:36
when we're training the model and we're
18:38
trying to minimize the loss as opposed
18:40
to when we're as a chance to the next
18:44
set
18:45
>> right but the point is when we actually
18:47
use them what is the objective like what
18:49
do we want to do when we actually use
18:50
them for inference once we finish
18:51
training our objective is given a
18:54
particular string get me the next word
18:56
right and to find the next word you can
18:59
in fact use everything that came before
19:00
it
19:01
>> and therefore without any change to this
19:03
model it'll just work for your intended
19:04
purpose you don't have to go in there
19:06
and change it to you don't have to
19:08
unmask it for inference because you
19:10
don't need to
19:13
>> yes
19:14
>> uh I have one question is regarding like
19:17
when we do the puzzle transformers we
19:20
are putting certain weights to zero for
19:22
the words which are to be predicted and
19:24
then we
19:24
>> no word the the words that are in the
19:26
future
19:27
>> future Yeah.
19:28
>> And then we normalize it.
19:29
>> Correct.
19:29
>> And we have trained a transformer
19:31
earlier on the all the words packed all
19:33
the words together. So won't there be
19:35
difference in weights between both the
19:37
things
19:37
>> between the two ways of training? The
19:39
weights are going to be very different
19:40
and they are two different models. Bert
19:43
is used for certain things and this kind
19:45
of model which is the basis of GPT is
19:47
going to be used for other things.
19:47
>> We are training it as well like that. I
19:49
mean with while putting the by moving
19:52
some of the rates to
19:53
>> correct correct. So what I'm talking
19:56
about here is the what we're trying to
19:59
do here is to say let's say that we want
20:01
to do next word prediction as the as the
20:03
task as a self-supervised learning task
20:06
and and we want to train such a model on
20:08
a vast amount of text data right well we
20:10
can't just use what we did last week
20:12
because it's not going to work because
20:13
of the fact it can see the future
20:14
therefore we make a tweak and then we
20:16
build this model now the question
20:17
becomes okay what can you do with this
20:18
such a model right we have basically
20:20
trained two different kinds of models
20:21
that the one that can see everything
20:23
Bert and that one that can't see the
20:25
future which is actually GPT. So what
20:27
can you do with it? And we're going to
20:28
come to that.
20:32
Okay. U all right. So now once you train
20:35
such a model u right given any input
20:38
sentence um let's say that the sentence
20:41
is it was a dark and it was a dark and
20:45
right it goes through all these things.
20:47
And remember what I said earlier the
20:49
fact that it's predicting something
20:50
after just seeing it. We don't really
20:53
care.
20:55
All what we're really curious about is
20:57
what is the next thing it's going to
20:59
say? And the next thing it's going to
21:01
say is going to be is basically going to
21:02
be what's coming out of this softmax.
21:06
Does it make sense? We don't care about
21:08
anything that went before it
21:11
because we already have like a half form
21:14
sentence and we want to just find the
21:15
next thing here. So we only care about
21:17
this. We I mean these things will come
21:19
out of the of the architecture of the
21:21
model, but we don't we throw them out.
21:22
We don't even pay any attention to them.
21:24
Okay, we only look at what's coming out
21:26
in this one here. And what comes out of
21:30
the soft max, remember, is a 50,000 way
21:32
table of probabilities. That's what a
21:35
soft max is, right? It's a whole bunch
21:37
of probabilities that add up to one. And
21:39
so it's going to and let's say, for
21:40
example, that you know you have starting
21:42
with oddwark all the way to zebra,
21:45
right? Right? And these are the
21:46
probabilities.
21:48
So it was a dark and you know just for
21:52
kicks I put star me as the most highest
21:55
probability number but these numbers
21:56
will add up to one. We have this table.
21:59
Okay. And then what we do is we choose a
22:02
token from this table. We get we get to
22:04
choose right. There's a whole bunch of
22:06
numbers in this table that we we get to
22:08
choose a token. the the simplest thing
22:11
one can think of is just choose the the
22:12
word that is the most likely, right? And
22:14
we choose the word that's most likely
22:16
here. And we we're going to have a whole
22:18
section on how to choose these things
22:20
coming up. Okay, for now let's go with
22:22
the simple option. We're going to just
22:23
choose the one that's most likely 6. And
22:26
then we we attach it to the input. So
22:30
now the input has become it was a dark
22:32
and stormy. We run it through and we
22:34
again we only care about the last one
22:36
softmax.
22:37
Okay,
22:40
we do that. We get another table and
22:42
this table turns out the table keeps
22:44
changing because the softmax is
22:45
different for each time you run it
22:46
through because the input has changed.
22:49
So you get a new table and it turns out
22:50
the most likely one is knight. Okay. And
22:53
then we attach so night comes out the
22:56
other end. We and we attach knight here
22:59
and we keep on going right. We can keep
23:03
on going maybe till we basically we tell
23:05
the model okay generate up to 100 tokens
23:08
and stop. It might stop after 100 or you
23:11
or it might decide the model may decide
23:12
in fact that when it sees a punctuation
23:15
like a period or exclamation mark or
23:17
something it's going to stop. Okay. And
23:19
we have control over this when it stops
23:21
and how it stops. But this is this is
23:23
sort of the the basic process and you
23:26
folks are all very used to it because
23:27
you've all been playing with chat GPT
23:28
and the like right? So the but the basic
23:30
building block is next word prediction
23:33
feed it back to the input next word
23:34
prediction keep on doing it right you
23:36
keep on doing it and suddenly you know
23:38
it's writing entire novels for you
23:41
um yeah
23:42
>> that mean that the longer the initial
23:44
input is better you get a better
23:47
prediction
23:48
>> um it depends on your objective so
23:52
fundamentally you have some task you
23:54
want the thing to do for you right and
23:56
that task may and you need to give it
23:58
all the information it can puzzle we
24:00
find useful. Yeah. So the long the the
24:02
more helpful the input the better. Maybe
24:04
that's how I would say it.
24:07
Uh yeah.
24:09
>> Would this also apply to something like
24:11
Google search? Uh or does they also do
24:14
next letter prediction too? But would
24:17
this just be a deeper
24:18
>> Yeah. So the Google autocomplete for
24:20
example, I don't know if they actually
24:22
use uh this kind of model under the hood
24:24
or not. I just don't know. Um these
24:26
things tend to be kept tightly under
24:27
wraps. uh you know if they were to do if
24:29
they were using it you know my guess is
24:31
that
24:33
they so I don't know if you folks have
24:34
seen recently over the last few months
24:36
they have there is there is a generative
24:38
AI panel that opens up when you do a
24:40
Google search that panel I suspect uses
24:42
this uh but I don't know if the default
24:45
Google autocomplete actually uses it or
24:47
not because it's very compute heavy
24:49
right so I don't know what they do
24:52
um so yeah this is what you do other
24:55
questions on this on the mechanics of
25:00
Yeah,
25:01
>> for our vocabulary list, I'm assuming
25:03
it's static.
25:05
>> Yeah, correct. Uh, and as you will see
25:07
here, it's not really a word vocabulary.
25:08
It's a token vocabulary, but yes, it is
25:10
static for a given model.
25:12
>> And so for I guess I'm assuming for
25:15
Google or any other sort of like search
25:17
engine that wouldn't necessarily be
25:19
static. And so when it comes to I guess
25:23
I guess I'll leave it like because the
25:26
model would be different
25:30
sort of thinking about uh what happens
25:32
to like new words and things that are
25:34
formed and how does it handle it if the
25:35
vocabulary is static. There's a very
25:37
elegant solution that's coming up.
25:41
Okay. Um
25:45
all right. So now in other words we have
25:48
learned how to do sequence generation.
25:51
We already saw that we can do
25:52
classification with BERT. We can do
25:54
labeling with BERT B like models which
25:56
are trained on mass prediction. And for
25:59
generating sequences now we know how to
26:00
do it. We just need to use a transformer
26:02
cosal encoder.
26:05
Okay.
26:08
Now
26:10
these kind of models, sequence
26:12
generation models trained on text
26:13
sequences using next word prediction are
26:15
called auto reggressive language models
26:17
or causal language models. Okay. And of
26:20
course the GPD family is perhaps the
26:22
most well-known uh example of an auto
26:25
reggressive co language model. auto
26:28
reggressive because people who have done
26:30
econometrics and some regression know
26:32
the notion of auto reggression means
26:34
that you predict something and then you
26:36
you use sort of you know the past
26:38
predictions as inputs into the next time
26:40
you predict right so this is the notion
26:42
of auto reggression you feed you predict
26:44
you feed the prediction back get the
26:46
next prediction and keep on cycling
26:48
through yes
26:51
>> so when you you're kind of putting an
26:53
input into GPT for example and it has
26:56
that um you know it shows you like the
26:59
next words as as it's coming. Is that an
27:01
indication of it doing this
27:03
recalculation that you described here?
27:05
>> Correct. That's exactly what's going on.
27:07
Uh in fact, if you use the API, there is
27:09
the thing called the streaming API where
27:12
it'll actually stream each token that's
27:14
coming out through the through every
27:15
pass and you can actually see everything
27:17
very clearly. But when you actually work
27:19
with the web interface and you see the
27:22
thing almost as if it's typing like a
27:24
human, what I've heard from people, I
27:25
don't know if this is true, what I've
27:26
heard from people is that they can
27:28
actually do it much faster. They slow it
27:30
down intentionally to give you the
27:32
feeling that it's actually coming from a
27:33
human.
27:36
So it's like a UX trick to slow it down
27:39
to make it feel as if someone is
27:41
actually typing something on the other
27:42
end. So when you're interacting with a
27:44
chatbot, for example, sometimes you see
27:46
it actually typing like slowly you can
27:48
see the bubble and you can see the
27:49
typing. It's actually intentionally
27:50
slowed down. Uh because you know it's a
27:53
bot otherwise, right? So there's a
27:55
little bit of UX
27:58
creepiness maybe going on. Uh I don't
28:01
know to what extent this is 100% true
28:03
and how pervasive it is, but folks who
28:05
work in the field have told me that this
28:06
actually is not uncommon. So
28:10
okay, so that's what's going on here.
28:12
These are language models and of course
28:14
GPD3 is an auto reggressive language
28:17
model and the reason why we have an L in
28:20
front of the LM because it was trained
28:22
on lots of data with lots of parameters
28:24
right some someone does this at some
28:25
point it's not a small language model
28:26
anymore it's a large language model so
28:28
yeah so it's LLM nothing more momentous
28:31
than that so so as it turns out uh GPT3
28:35
uses 96 transformer blocks 96 blocks and
28:40
each block has 96 six causal attention
28:43
heads.
28:44
Okay. And you can see you can read the
28:46
GPD3 paper. It gives you all the details
28:48
of the architecture. That is interesting
28:50
because GPD4 they didn't publish the
28:51
architecture from GPD3 after GPD3
28:55
everything became closed. So we actually
28:58
don't know what the architecture is even
28:59
though there's a lot of speculation on
29:00
Twitter. So uh but GP3 we know exactly
29:03
what happened right 96 blocks each has
29:06
96 causal attention heads. Um and then
29:09
the data was actually they scraped 30
29:11
billion sentences um from a whole bunch
29:14
of sources, web text, Wikipedia, a bunch
29:16
of book databases. Um and um and then
29:19
they basically just took those 30
29:21
billion sentences and just trained it
29:23
exactly next word. That's it.
29:27
Now when they trained GBD3, I think it
29:28
cost them a lot of money um because
29:31
things were not as we hadn't figured out
29:34
how to do as efficiently as we know now.
29:36
uh but it was still pretty amazing and
29:38
I'll talk about you know what is so
29:39
special about GBD3 in just a minute or
29:41
two. So, so this is what we have here
29:44
and as you folks have seen the notion of
29:46
generating text right is very powerful
29:49
right uh because we can obviously
29:51
generate text but we can also generate
29:53
code because code is just text uh we can
29:55
generate documentation for code we can
29:57
summarize text we can answer questions
29:58
we can do chat I mean the list goes on
30:00
all the excitement we see around genai
30:03
from the time chat GBD came out is
30:05
precisely because the simple idea of
30:07
text in text out is just so flexible
30:12
It's so versatile. It can handle all
30:13
sorts of use cases. That's why there's
30:15
so much excitement.
30:17
Um, by the way, um, if you're really
30:19
curious, I would actually recommend
30:21
seeing this video where this this guy
30:24
Andre Karpathi builds GPT from scratch.
30:28
Okay, it's a fantastic video. If you if
30:31
you have even like a little bit of
30:33
curiosity about how these things are
30:35
actually built, I would strongly
30:36
recommend checking it out. Um and
30:38
there's also a little blog post where
30:39
this person you know basically if you
30:41
know numpy you can actually create GPD3
30:43
GPD using numpy without any using any
30:46
frameworks and things like that. So um
30:50
I I found it super interesting and
30:52
helpful to understand what exactly is
30:53
going on. So if you would like to do
30:55
this. Okay. So now we're going to talk
30:57
about um decoding sampling strategies
31:00
which is I said that when we produce uh
31:03
when when when we come up with the
31:05
softmax for that last token right we
31:07
have 50,000 choices. What do we pick
31:10
right as it turns out to actually get
31:13
really good performance out of uh genai
31:15
systems like charge you need to be quite
31:17
thoughtful about the how to decode right
31:19
how to actually sample from that table.
31:21
So we'll talk about that for a bit. So,
31:25
so the first of all definition the
31:27
process of choosing a token from the
31:29
probability distribution from the coming
31:30
out of the softmax right I'm sticking
31:32
this table right here this is the
31:34
softmax right this process of choosing
31:36
it is called decoding that's a technical
31:38
term for it right we have to we get this
31:40
table we have to decode meaning we have
31:42
to pick something from this table okay
31:44
that's called decoding now
31:48
there are two sort of extreme cases of
31:51
very highly simple ways to do
31:53
The first thing of course is just pick
31:55
the one just pick the word with the
31:56
highest probability.
31:58
This is called greedy decoding.
32:02
Okay.
32:03
So in this case for example if stommy is
32:06
6 the highest probability in this whole
32:08
table we just pick stommy. Okay. So that
32:10
is the obvious extreme simple case. The
32:14
other thing we can do which is also
32:15
super simple is that because we have a
32:18
probability table here, we can just
32:20
reach into the table and sample a word
32:22
out of it, right? In proportion to its
32:24
probability, which means that if you if
32:27
if you have this table and you're
32:28
sampling from it, if you sample from it
32:30
100 times, 60 times you probably get
32:33
Stormy because the probability is 6. But
32:36
some small fraction of the time you may
32:38
get strange things like oddwark and
32:39
zebra and so on and so forth,
32:42
right? you're just literally doing
32:44
random sampling.
32:46
That's a fine way to do it too, right?
32:48
There's nothing wrong with that. So
32:50
these these are both options. So the key
32:53
thing you need to remember is that the
32:56
which one you pick and there are some
32:58
variations on it which we'll get to in a
32:59
moment. What you pick, which way to
33:01
decode you pick really depends on what
33:03
your task is, what you're trying to use
33:05
the the system for, right? The LLM for.
33:08
So the the the broad thing to remember
33:10
is that if you're working on questions
33:13
for which the factual accuracy of the
33:16
response is really important
33:19
and or you want the output to be
33:22
deterministic meaning every time you ask
33:24
it a particular question you really want
33:26
the same answer back right you can
33:28
imagine a customer call support agent
33:31
where there two different customers ask
33:33
the same question and they get different
33:34
answers right you don't want that so you
33:37
want determinist IC outputs. So in those
33:40
situations, you should use greedy
33:41
decoding is a good starting point
33:43
because you will get you know you won't
33:45
get any random stuff because for any
33:48
given input sentence the softmax that
33:51
comes out of that table is not going to
33:53
change. It's the same table and if
33:55
you're always picking the highest number
33:57
in the table that's not going to change
33:58
either. So guaranteed determinism
34:03
and I found that for reasoning questions
34:05
and things where you know you're asking
34:07
questions, math questions, reasoning
34:08
questions, logic questions, you should
34:10
really sort of keep it as sort of greedy
34:12
as possible in my experience. Okay. Now
34:15
there are other situations where random
34:18
sampling is actually a better option. If
34:20
you're doing creative things, right?
34:22
write a poem, write a highQ, write a
34:24
screenplay, things like that. You do
34:26
want a lot of creativity in which case
34:27
you actually randomness is your friend,
34:30
right? You get a lot of different
34:31
varieties of responses, diversity of
34:32
responses, all that is really good. The
34:35
price you pay for it is that you lose
34:36
determinist determinism. The outputs are
34:39
going to be stoastic. They're going to
34:40
be random. They're going to vary from
34:41
the same question. The answer is going
34:42
to vary again and again. But in many
34:44
cases, maybe it's okay. You don't care.
34:47
Okay, so that's sort of how roughly how
34:49
you think about. Other one I want to say
34:50
is that the diversity of response also
34:53
important because you if you imagine a
34:55
chatbot um if you ask questions if the
34:58
chatbot always responds in the same
35:00
stilted robotic fashion right it kind it
35:03
starts to get annoying you want some
35:05
variation in the output right because a
35:07
human will never give you the same thing
35:08
back though I must say that when I
35:11
interact with call center agents I think
35:13
they're just cutting and pasting from a
35:14
text library so it does look kind of
35:16
robotic u so maybe we are already kind
35:18
of used to this but anyway Okay, so
35:20
those are some of the things to keep in
35:21
mind. Yeah,
35:24
>> if you're using random sampling, do you
35:26
end up with a better estimation of the
35:28
uncertainty and probability are more
35:33
calibrated in the sense that the table
35:35
that you end up at the end is the real
35:36
probability that you observe from the
35:39
words in your corpus.
35:42
>> The table doesn't change regardless of
35:43
how you sample it. The table is a
35:45
starting point for sampling.
35:47
The all of all decoding is about what
35:50
token from the table you're going to
35:51
pull out.
35:53
>> Oh, so it doesn't impact the loss
35:54
function.
35:55
>> No.
35:56
>> Yeah. It's all those things are fixed.
35:58
You literally get the table and then you
36:00
literally can forget how you got the
36:02
table and now decoding starts.
36:06
>> Is there a reason why would generate a
36:09
different answer given the same prompt
36:11
if we run it again and again? Because
36:12
they are using random sampling.
36:14
>> Correct. That's exactly why. And we'll
36:16
see I'll see do a demo of it very very
36:19
shortly because you can actually
36:20
manipulate it. Uh
36:22
>> if you do the prediction word by word,
36:25
is there a way to make it resilient to
36:27
mistakes? Like if you say the night was
36:29
dark and hard work, that can mess up the
36:32
next word, right?
36:33
>> It can totally mess it up.
36:34
>> So how does it can it get itself back on
36:37
track?
36:37
>> It cannot. And so great question. And
36:40
we'll look at an example of things going
36:42
off the rails in just a second. Yep.
36:46
Is this how Bing works where you can
36:48
slide between being more creative, more
36:51
accurate?
36:52
>> Yeah, exactly. So, Bing has creative,
36:53
balanced, precise something, right? Uh
36:56
they're basically under the hood,
36:57
they're manipulating some of the par
36:59
we're going to look at some of those
37:00
parameters in just a moment. They're
37:01
just manipulating it for you. But if you
37:03
use the API, you can manipulate it
37:05
directly.
37:09
Okay. Um All right. So, so here's sort
37:14
of the basic thing to remember about
37:15
random sampling.
37:17
So, our hope is that the, you know, for
37:19
any given sentence, we think that there
37:22
is probably some set of good answers for
37:24
the next word and a whole bunch of bad
37:26
answers, right? Intuitively. So, we want
37:30
the probability of the good stuff,
37:33
right? We we want like a you can imagine
37:36
a distribution is going like that. There
37:38
is the head of the distribution, the
37:39
first few words in the distribution. if
37:41
you sort them from high to low
37:42
probability and then there's all the
37:44
long tale of you know kind of you know
37:46
inappropriate not inappropriate
37:48
irrelevant words right so our hope is
37:51
that the model is so good that for any
37:53
given input phrase it it basically
37:55
concentrates the output probability in
37:57
the softmax to just a few good words and
37:59
sort of kind of zeros out everything
38:01
else that is the ideal scenario because
38:04
in that scenario if you do random
38:06
sampling you by definition you'll pick
38:08
something from the high quality head of
38:10
the distribution and life is good. Okay.
38:13
Now, we want random sampling to sample
38:16
from the head and not from the tail,
38:18
right? That's the key point. And what do
38:19
I mean by head and tail? Let's be very
38:21
clear.
38:26
So, um imagine you have
38:30
take the table that we looked at the
38:31
softax table which went from whatever
38:33
oddwalk to zebra right and let's say we
38:35
sort the table based on high to low
38:37
probabilities. So maybe what's going to
38:39
happen is that star me
38:42
is going to have a probability of I
38:43
don't know 6 and I think if I remember
38:46
right a knight had a probability of.3
38:51
and then a there was a whole bunch of
38:53
other words
38:56
all the way to the 50,000th word right
39:00
from highest low probability so this is
39:02
what I so this is you can think of this
39:04
as like a probability distribution
39:06
okay and So basically what we are saying
39:09
here is that these this is the head of
39:12
the distribution
39:13
while this long tail is the tail of the
39:16
distribution and we want our system to
39:18
grab something from the head and not
39:21
from the tail because the head is the
39:23
stuff that's actually the relevant
39:24
useful good stuff. Okay, that's really
39:26
what we're trying to do here. Does it
39:28
make sense? Okay. So,
39:32
so to come back to this um
39:37
and here is like the most important
39:39
point to remember about this slide.
39:41
While the probability of choosing any
39:43
individual word in this long tail is
39:46
pretty small. For any one word, it's
39:47
pretty small. The probability of
39:49
choosing some word from the tail is
39:51
high.
39:54
Some word from the tail is high. So to
39:56
go back to this thing here. Yeah. Uh so
39:58
in this particular example
40:00
6 +.3 there is a 0.9 probability it's
40:03
going to be either stormy or night but
40:05
there is a 10% probability it's going to
40:06
be one of these words
40:09
and who knows what that word might it's
40:11
going to be it might be some random
40:12
nonsense word right so what that means
40:15
is and this goes to um
40:18
this goes to point from before if the
40:21
LLM happens to sample a token from the
40:24
tail which is not good it won't be able
40:25
to recover from its mistake it'll just
40:27
go off the rails
40:29
Which is why every word that gets
40:31
generated is really important to get it
40:33
right because book it can't recover very
40:35
often.
40:37
>> Is there a technical way to define the
40:40
difference between the head and the
40:41
tail? No,
40:44
it's sort of like this common thing
40:45
people use and the reason why it's not
40:47
is because uh it's so problem dependent
40:50
as to what like the you know like
40:52
basically you're saying that for any
40:54
particular problem I think depending on
40:55
the question the right number of words
40:58
is probably 20 for the same for a
41:00
different question maybe it's 40 for a
41:02
totally different model for the same
41:04
question maybe 10 so because of that
41:05
variability we just can't figure it out
41:09
okay so um all All right. So, and I'll
41:12
show you this how to do this in just a
41:14
moment. So, just for kicks, um I went in
41:18
to GPD 3.5 U and then I said students at
41:22
the MIT Sloan School of Management are
41:25
and I said predict the next word. Okay,
41:29
so it turns out invited is the most
41:31
likely next word followed by given,
41:33
expected, required and able. These are
41:35
the top five words.
41:38
Okay. And the probability is 3% 2% you
41:40
see the you know pretty small
41:42
probabilities but then the words that
41:43
are below it right the remaining
41:45
whatever 50,000 odd words are even
41:47
lower. Okay. So here the most likely
41:50
word is invited. So what I did is I went
41:52
in there and said okay let me try again
41:54
now with students of this loan school of
41:56
management or invited. And now
41:59
autocomplete that find me the next
42:00
thing. So it comes back with see now
42:03
this is my new prompt. student the M
42:04
school invited to submit their original
42:07
white papers to the annual MIT
42:08
something. It seems reasonable. Doesn't
42:11
seem bad, right? It seems reasonable.
42:13
Okay. Now, let's mess it up a bit. So
42:16
now I go in there and I noticed that the
42:19
word masters and the word spending were
42:22
much lower probability than these top
42:24
five words. Right? I just mucked around
42:26
till I found these things. So this is
42:28
only 0.05%. This is.1%.
42:31
So these are clearly in the tail, right?
42:34
They're not the most likely. So I said,
42:36
what's going to happen if I actually
42:37
force it to use masters and then I force
42:41
it to use spending? Okay, this is what I
42:43
what you get. Students MID school of
42:46
management are masters of chaos.
42:49
They routinely blow past deadlines
42:52
fracture and then I couldn't take it
42:53
anymore. I stopped it.
42:58
a single word
43:00
and then I said students school of
43:02
management or spending which is the
43:03
other unlikely word the semester
43:05
learning life skills so far it looks
43:07
promising through knitting socks
43:13
I'm not making this stuff up but this is
43:14
GP3.5
43:17
so yes it will go off the rails you have
43:19
to be super careful um and so
43:22
so the way we sort of tame random
43:25
sampling to make it work for us uh
43:29
Do you think that these sentences refers
43:32
like the past like the master of chaos
43:35
blow past deadline like is something
43:38
that it was in the training sense?
43:40
>> Yeah. I mean that is the thing is it's
43:42
basically doing rough it's doing some
43:45
very rough and approximate pattern
43:47
matching from all the training data it
43:48
was trained on. So it doesn't mean for
43:51
example that on on the mit.edu edu
43:53
website right on the collection of sites
43:56
that actually there were text saying
43:59
that yeah MIT Sloan students were doing
44:00
all this crazy stuff it's probably more
44:02
like a whole bunch of you know u college
44:06
university websites probably had some
44:08
content like that maybe there was a
44:09
bunch of Reddit people posting stuff
44:10
like that so you're just doing some
44:12
rough pattern matching it's basically
44:14
looking the thing is you have to
44:15
remember always with large language
44:16
models what it's trying to give you it's
44:19
giving you a response that is not
44:22
implausible
44:23
There is no guarantee of correctness.
44:25
There's no accuracy. Nothing like that.
44:27
It's giving you a probabilistically
44:29
plausible response. That's it. Okay.
44:32
Now, usies being Sloan, uh we look at
44:35
stuff like this and we get offended. So,
44:36
we are we are imputing our values onto
44:39
its generation, but it doesn't know and
44:40
it doesn't care.
44:43
So in fact if I when I typed in
44:46
something like list all the awards that
44:48
professor Ramak Krishna has won it gave
44:50
me an amazing list of awards apparently
44:52
I won this and I won that I won none of
44:55
it is true to which a student said not
44:58
yet.
45:00
So I had the tea I made a note of that
45:01
fine person's name. So [laughter]
45:05
>> so yeah so that's what's going on.
45:09
Yeah
45:11
>> I get the sense like Maybe there's
45:12
>> Could you use the microphone, please?
45:15
>> I get the sense that maybe there's some
45:17
sort of sliding window that's somehow
45:20
waning later words more strongly than
45:23
earlier words given how far out because
45:26
I feel like the context of students at
45:28
MIT, right, should have steered in a
45:30
certain direction even with the presence
45:32
of the word masters. So, is there
45:34
something like that happening?
45:35
>> No, it is just the thing is think about
45:37
the training process, right? In the
45:38
training process, uh, we gave it
45:41
sentence fragments and we asked it to
45:42
predict the next word. Now, clearly the
45:45
more you know about the input that's
45:48
coming and the longer the input, the
45:49
more clues you have to figure out what
45:51
the right next prediction is going to
45:53
be. Right? If I say the capital uh the
45:56
capital of you'll be like, I don't know,
45:58
it's got to be a country, I guess, or a
46:00
state, but I don't know anything more
46:01
than that. But if you if I say the
46:03
capital of France is dramatic narrowing
46:06
of the cone of uncertainty. So that's
46:08
basically what's going on. And in fact
46:11
some there's a very beautiful expression
46:12
I've heard which is that what what the
46:14
LMS do they call it subtractive
46:17
sculpting. So what I mean by that is
46:20
it's sort of like when you start it's
46:22
like this big block of marble and then
46:24
every word chips away at the marble and
46:26
then when you're done it's kind of
46:27
pretty clear it's David inside the
46:29
marble. Right? That's sort of what's
46:31
going on.
46:34
All right. So to come back to this, uh
46:36
what can we do? We can there are three
46:38
ways in which you can tune random
46:40
sampling to make it work for you. The
46:42
first way and and the the idea of all
46:44
these things is that you have some
46:46
probability distribution. We are now
46:48
going to sort of manually
46:51
focus on the head and then we're going
46:53
to kill everything else and only focus
46:55
on the head and sample from that head.
46:56
Okay, which immediately begs the
46:58
question, how will you decide what the
46:59
head is? Right? And that was sort of
47:01
Alina's question from before. How will
47:02
you decide what the head is? So, one way
47:04
we do that is to say, you know what, I
47:07
know we have 50,000 words in the
47:08
vocabulary. I don't care. Each time, I'm
47:11
only going to pick the top K words,
47:13
right? K could be 10, 20, 30, 40, 50.
47:15
This very problem dependent. I'm going
47:17
to pick the top 20 words and I'm going
47:18
to ignore everything else and only
47:20
sample from the top 10 or the top 20.
47:22
That's called top K sampling. And so the
47:24
way it works is that let's say this is
47:25
your whole distribution and I just
47:27
stopped at wet instead of going all the
47:28
way to 50,000, right? And then you see
47:30
and you decide let's say that you want k
47:33
to be two. So you just grab the top two
47:36
words k equals 2 and then you reormalize
47:39
the probability so they add up to one.
47:41
So 6 and2 reormalize it becomes 75 and
47:45
0.25.
47:46
And now just imagine that this is the
47:48
new softmax table that you're sampling
47:50
from and you grab a number from I'm
47:52
sorry a word from here and you're done.
47:55
Okay, that's this called top K sampling
47:58
very commonly used
48:00
but there's it has a small shortcoming
48:03
which is that it basically assumes that
48:06
this K that you have come up with let's
48:07
say 20 every input sentence the right
48:11
number of words in the head is 20 which
48:13
seems obviously it's not a you know well
48:15
supported assumption it's just an
48:16
assumption so then the question becomes
48:18
can we do better right because what you
48:21
really want is you want the words that
48:24
you pick to have the bulk of the
48:25
probabilities,
48:27
right? As much probability as possible.
48:29
You don't really care how many words are
48:30
inside it as long as together they have
48:32
a lot of probability. Which brings us to
48:34
something called top p sampling also
48:37
called nucleus sampling where instead of
48:39
deciding on the number of words we're
48:40
going to pick every time, we decide you
48:42
know what we're just going to
48:45
choose all the words such that the
48:47
probability of such words that we have
48:49
chosen is at least P.
48:51
Sometimes it may be just two words.
48:53
Sometimes it may be 20 words. We don't
48:54
care. And then we sample from it.
48:58
Okay. So here, same thing here. Let's
49:02
say you go with P equ= 0.9. So you 6
49:05
+2.8 plus.1.9. Boom. We have hit 0.9. We
49:09
stop and then we grab these three words
49:11
and then we renormalize them to get this
49:14
thing and then boom, we sample from it.
49:16
So this actually is even more effective
49:18
in my opinion because it sort of it
49:19
fluctuates. It doesn't hardcode the
49:21
number of words you think is important.
49:23
Uh was there a question? Yeah.
49:25
>> What if like let's say 0.9 ended up like
49:29
if foggy was 0.12 will it only take 0.1
49:32
from foggy?
49:33
>> Yeah. What it does is it so you give it
49:35
a give it a 0.9. What it's going to do
49:37
is it's going to keep adding words till
49:39
it just crosses that number.
49:43
>> Yeah. I was thinking, can't you just set
49:46
a threshold for the word slap? Don't
49:50
pick a word below probability. This top
49:53
B, what if was like 0.89
49:57
and then the other one is just 0.1. So
49:59
you pick two words.
50:00
>> Yeah, you can do that. Um and in fact in
50:03
what you can do is you can always say I
50:04
want to pick a word which is the most
50:06
likely word, right? You can do that. But
50:08
if you say I want a word um I want only
50:12
consider words whose probabilities are
50:13
at least something then basically what
50:15
you're saying is that I'm just going to
50:16
keep on doing and then we draw a line
50:18
here right but the problem is you don't
50:21
know how many words have crept over your
50:23
threshold
50:25
right you might for example find that to
50:27
to go to your example maybe you said 0.9
50:29
as a threshold may maybe there are a
50:31
whole bunch of there was a word at 089
50:33
that you just missed because you didn't
50:34
make the threshold you'll be like oh no
50:36
I should have made it 089 so there's No
50:38
right answer unfortunately. But these
50:40
are exactly the this is exactly the kind
50:41
of thinking that brought us these kinds
50:43
of ways to tune these things
50:46
all sort of you know the foundation here
50:48
is that the realization that we cannot
50:51
pro sort of a priority decide what the
50:53
right number of words is. So we have to
50:54
find huristics to try to do do these
50:56
things. So in practice people try all
50:58
these methods. In fact you can do both.
51:00
You can do you can set up so that you
51:02
can do top p and top k at the same time.
51:04
Basically you're saying grab words uh
51:07
till you cross the probability uh or you
51:10
cross k whichever is earlier.
51:15
Okay. So those are two methods people
51:17
use heavily.
51:19
The third method is called distribution.
51:21
I'm sorry temperature. And the idea of
51:23
temperature is that in top K and top P,
51:26
it sort of we have to decide on a number
51:28
up front K or P and then we just draw
51:31
the line and look at the words that pass
51:33
the threshold. Temperature is like a
51:35
softer way to do the same thing. It it's
51:37
a softer way to emphasize the head more
51:39
than the tail. So um I think iPad. All
51:44
right.
51:52
So the idea of temperature is remember
51:55
uh when we have this um oops soft max.
52:01
So you know oddwark
52:04
all the way to zebra
52:06
you have all these probabilities right
52:09
now remember where did we get these
52:10
probabilities these properties came from
52:12
a softmax. So what is a softmax? We
52:15
basically had you know all these nodes
52:18
say 50,000 nodes in some output layer
52:22
and these were just numbers let's just
52:23
call them a1 through a 50,000
52:27
and then we ran it through a softmax
52:29
function and what did it do it basically
52:31
did e ra to a1 e ra to a2 all the way to
52:36
e ra to a let's call it n and then we it
52:39
divided it by the sum of all these
52:40
things to get the probabilities. So this
52:42
number became e ra to a1 divided by the
52:47
sum of all the e ra to a
52:52
okay so e ra to a divided by e ra to a1
52:54
plus e to a2 and so on and so forth. So
52:55
this how softmax works. I'm just
52:57
refreshing your memory from a few weeks
52:59
ago. Okay. Now what temperature does is
53:03
that let me just write it a little
53:06
easier.
53:08
So e ra to a1 plus e ra to a2 is all the
53:13
way
53:15
and
53:18
what it does is it introduces a new
53:20
parameter here called temperature which
53:22
is that we divide everything here by t.
53:41
And the effect of adding this little
53:43
knob called temperature here, right, is
53:45
very interesting. So let's assume for a
53:48
second that t is a very very small
53:50
number.
53:52
Assume that t is pretty close to zero,
53:53
very small number. So if t is close to
53:57
zero,
54:00
what's going to happen is that since
54:03
it's in the denominator here, all these
54:05
numbers,
54:06
all these numbers are going to become
54:08
really big because t is really small.
54:10
Right? If if a1 happens to be a positive
54:13
number, it's going to become really big.
54:14
If a1 is a negative number, it's going
54:15
to be a really really small negative
54:16
number. Okay? Now in particular, what's
54:19
going to happen is the biggest of all
54:20
the a numbers, it was already big. Now
54:23
it's going to get massive
54:26
which means that its probability is
54:28
going to dominate everything else
54:30
because you're taking a really big
54:31
number and doing e ra to that number.
54:35
So what's going to happen is that wait
54:37
what what did this
54:40
okay so if t is close to zero
54:47
the biggest a
54:56
Uh, hold on.
54:59
The word corresponding to the biggest A
55:06
will have a probability of one or close
55:09
to one.
55:12
And since all the probabilities have to
55:14
add up to zero, which means that
55:15
everything else is going to be zero. So
55:17
the biggest A will have a probability of
55:18
one. Everything else is going to have
55:20
zero. So reducing temperature close to
55:22
zero means that the probability
55:24
distribution is going to peak at the
55:25
biggest word and everything is going to
55:27
become zero. So in practice what that
55:29
means is that if you look at something
55:30
like this if you apply um
55:34
temperature here
55:37
what's going to happen is that stormiest
55:40
thing is going to get something like.999
55:43
and everything else right it's going to
55:46
get wiped out
55:49
right it's going to get really small
55:51
it's going to get even smaller and so on
55:52
and so forth and so when t is exactly
55:55
zero basically what that means is that
55:57
this is going to be exactly nine uh one
55:59
and everything was going to just get
56:00
zero. So when one of them is one and
56:02
everything else is zero when you do
56:03
sampling from it you're just picking the
56:05
the big number right which means it sort
56:07
it becomes greedy decoding.
56:10
So that is the value of having
56:12
temperature as a knob. Conversely, if
56:14
you take temperature T and make it
56:16
bigger and bigger, right, as opposed to
56:19
smaller and smaller, this distribution
56:22
is going to become flat. Meaning all the
56:24
words are going to have the same
56:25
probability.
56:27
So a any one of these words becomes
56:29
equally likely. So t close to zero, the
56:32
biggest biggest word gets picked. T
56:34
close to say exceeds one goes to 1.52
56:38
any word becomes likely. It becomes
56:40
truly random. So that is the effect of
56:42
temperature.
56:44
And this knob, you can actually tune it.
56:47
Um,
56:50
all right. So, uh, this is called, uh,
56:53
I'm at
56:56
platform.openai.com.
56:57
It's called the OpenAI playground. And
56:59
in this playground, you can actually put
57:01
in all the sentences you want. You can
57:02
choose the model and then you can it'll
57:04
actually tell you what the softmax
57:05
output is. Okay, it's very handy. So
57:09
this is where I said oh so here are a
57:12
few things I want to draw your attention
57:13
to. The first one is you see temperature
57:15
here the default is one. If you make it
57:18
zero it becomes greedy decoding but you
57:20
can make it more than one if you want.
57:22
It'll give you all kinds of crazy stuff
57:24
as you will see in a second. Okay. Um
57:27
and then they don't have top K. They
57:30
don't have support for top K openai but
57:32
they do have support for top P. You can
57:35
put P here in this thing. And I'll
57:37
ignore these things. You can read the
57:38
documentation uh to understand those
57:40
things. But you can actually ask it to
57:42
show the probabilities. So I'm going to
57:44
ask it to show all the probabilities.
57:46
I'm also going to tell it um don't go
57:48
nuts. Just give me like a few outputs.
57:50
Let's just call it 30. Okay. And now I'm
57:53
going to enter some sentences for us to
57:55
see what's going on. So let's enter the
57:57
same sentence as before. students
57:59
at the MIT
58:03
Sloan
58:05
School of Management
58:08
or I think that's what we had right so
58:10
submit
58:14
so okay this is what it's filling out
58:16
now you go click on this word you get
58:18
all the probabilities
58:20
pretty cool right so you can see invited
58:23
given expected these are all some of the
58:25
things we had u and so what you can do
58:27
is you can go in and say here clearly uh
58:32
aching. What is that?
58:36
That's very weird. So I'm going to again
58:40
I'm just going to check to make sure
58:41
that I use the same sentence as before.
58:43
It's very brittle. Students MD school
58:46
management are okay. Uh are
58:50
oh I know what it is.
58:54
Okay.
58:57
Okay. So, let's try that again.
59:03
Okay. So, invited 3.18. That's what we
59:05
had, right? Invited 3.19. 3.8. Okay.
59:08
Close enough. So, this is what we have.
59:10
And now, if you wanted to force it to
59:12
choose invited here, you just go in
59:15
there and make the temperature zero.
59:18
Temperature zero means it's always going
59:20
to pick the best one. Greedy recording.
59:21
So, you can hit it again.
59:25
And it better give you invited. See it
59:27
has given you invited.
59:29
So that's how you manipulate it using
59:31
temperature. Um you can also ask it you
59:34
can also manipulate top P. You can do
59:35
all these things right but so it's a
59:38
it's a people actually use it very
59:40
heavily for debugging right and for when
59:41
they're playing with a bunch of data
59:42
with a model for that particular use
59:44
case. You just play with it to get a
59:45
sense for what kinds of probability
59:46
distributions you see and then you can
59:48
fine-tune it using that using that
59:50
knowledge. Um so yeah check this out.
59:54
Oh, uh, I I said that if the temperature
59:58
goes above one to a higher number, every
1:00:01
word in the 50,000 becomes sort of
1:00:03
equally likely, which means it's going
1:00:04
to produce garbage, right? So, let's
1:00:06
actually see garbage production in
1:00:07
action.
1:00:09
So, all right, let's just nuke this.
1:00:11
Okay, and I'm going to take the
1:00:13
temperature and max it. I'm going to
1:00:15
call it two. Okay, which means that
1:00:19
literally anything is possible.
1:00:22
Submit.
1:00:25
Ladies and gentlemen, I present to you a
1:00:28
modern large language model.
1:00:35
Isn't it like shocking
1:00:38
>> because when we work with these language
1:00:39
models we have, we always when we see it
1:00:41
doing some smart things, we always
1:00:43
ascribe some level of, you know,
1:00:45
interesting abilities and intelligence
1:00:46
and so on and then you realize all I had
1:00:48
to go in go in there and change one
1:00:50
parameter and it's garbage.
1:00:52
So you can see the amount of garbage
1:00:54
right it's showing just by twiddling one
1:00:56
parameter. So you have to be in
1:00:58
production use cases when you're
1:01:00
building applications on top of these
1:01:01
large language models you got to be very
1:01:02
very careful with these parameters. So
1:01:05
pay attention. All right. So um what did
1:01:09
I have next?
1:01:13
Okay. So that brings us to the uh sort
1:01:17
of the end of the decoding section.
1:01:22
Oh, see now I'm going to switch gears
1:01:24
and talk about tokenization, right?
1:01:27
which is that um when so far in all the
1:01:30
the the things we have done including
1:01:32
the homeworks and so on we looked at
1:01:34
this tokenization the standard process
1:01:36
right for taking a bunch of text and
1:01:38
vectorizing it which was the stie
1:01:41
process standardize tokenize um index
1:01:44
right and then encode and the
1:01:46
standardization I had mentioned earlier
1:01:48
uh strips out punctuation lower cases
1:01:50
everything uh sometimes removes stop
1:01:53
words like a and the things like that it
1:01:55
also does these things called stemming
1:01:57
But turns out if you actually work with
1:01:59
uh something like GPT, you know that
1:02:02
it hasn't stripped out punctuation. The
1:02:04
punctuation is really good, right? It
1:02:06
uses case, uppercase, and lower case.
1:02:08
And in fact, even better, you can
1:02:10
actually make up a word as part of your
1:02:11
question and it'll use the word
1:02:13
consistently in the output. So just for
1:02:15
fun,
1:02:18
um I made up a word.
1:02:22
I just did this yesterday, a day before.
1:02:23
I said, here's a new word and it
1:02:24
definition. The word is relo
1:02:28
backwards.
1:02:30
I said the definition a student who
1:02:31
understands deep learning backwards
1:02:33
please use his word in a sentence. And
1:02:35
here is a sentence it's coming up with.
1:02:37
Um
1:02:39
I was like a little shocked during the
1:02:41
advanced neural network seminar. It
1:02:43
became evident that Jane was a true relo
1:02:45
effortlessly explaining even the most
1:02:47
complex deep learning concepts in
1:02:48
reverse order.
1:02:50
Okay. So it clearly knows how to use
1:02:53
anything you may make up with. Right? So
1:02:54
it has the ability to compose things
1:02:56
from scratch as opposed to just looking
1:02:59
up stuff. So where is the thing coming
1:03:01
from? Right? That's the question. And
1:03:02
the answer is this very beautiful thing
1:03:04
called bite pair encoding which we'll
1:03:06
look at next.
1:03:10
So all right. So what here um when we
1:03:14
look at this process the adv
1:03:15
disadvantages are some of the things we
1:03:17
have discussed which is that we want to
1:03:18
be able to preserve punctuation. We want
1:03:19
to be able to preserve case. We want to
1:03:21
be able to handle new words and so on
1:03:22
and so forth. So uh the new like the the
1:03:26
sort of the modern models like BERT and
1:03:28
so on they use different tokenization
1:03:29
schemes. They don't actually do the STIE
1:03:31
thing and the GPD family uses bite pair
1:03:34
encoding BPE. Uh BERT uses something
1:03:37
called wordpiece. All of these ways of
1:03:40
encoding, the fundamental idea is to
1:03:42
say, well, you know what? Why don't
1:03:44
whatever language you're working with,
1:03:46
why don't we start first of all with all
1:03:47
the individual characters? Because if
1:03:50
you could actually work with individual
1:03:51
characters, you can clearly compose any
1:03:53
word that comes up, right? Reo is just R
1:03:56
E L D O H, right? Six tokens. If you're
1:03:58
working with characters at the character
1:04:00
level, but working only with characters
1:04:02
is not great, right? because that means
1:04:05
that the model you're giving it no
1:04:07
information about the world. It has to
1:04:09
learn every word from scratch, what the
1:04:11
word means and so on and so forth. So we
1:04:14
it would be nice if we can actually give
1:04:15
it words as well. But we don't we don't
1:04:17
want to give it infrequent words because
1:04:20
infrequent words by definition are not
1:04:22
worth adding to your vocabulary. We're
1:04:25
just going to you know take up another
1:04:26
embedding vector and things like that.
1:04:28
For infrequent words, we'll just make
1:04:30
we'll just compose them. we'll we'll
1:04:31
actually construct them on the fly
1:04:32
because we can always use characters.
1:04:35
Okay, so we don't want to put every word
1:04:37
in there. We only want to put frequent
1:04:38
words. But to give this thing the
1:04:41
ability to compose new words and not
1:04:43
always have to go to characters, we will
1:04:45
give it parts of words. These are called
1:04:47
subwords. So the key idea is that let's
1:04:52
come up with a way to build a vocabulary
1:04:54
which has characters full words that are
1:04:56
frequent enough to be worth adding and
1:04:59
subwords or word fragments that occur
1:05:01
frequently enough to be worth adding. So
1:05:03
for example the word standardize
1:05:07
right normalize standardize and so on
1:05:09
and so forth. I is going to show up a
1:05:11
lot in many places. So you don't want to
1:05:12
have standardize and normalize and so
1:05:14
on. You just want to have eyes. you can
1:05:15
just attach it to all kinds of words,
1:05:17
right? And make it all work, right? So
1:05:19
that's the basic idea of all these
1:05:20
tokenization schemes. And BP is one such
1:05:23
way to figure out how to actually
1:05:25
construct this vocabulary from a
1:05:27
training corpus, right? And by the way,
1:05:29
when I say characters, this will include
1:05:31
not just you know uppercase lowerase
1:05:33
alphabets and numbers, it may it will
1:05:34
also include punctuation.
1:05:37
So that all these things just become
1:05:38
atomic units.
1:05:40
All right. So uh so what we're going to
1:05:42
the way BP works is that uh we're going
1:05:45
to uh start with each character as a
1:05:47
token and I'll talk about the rest of
1:05:49
the thing on the page in just a moment.
1:05:51
Don't worry about it. We'll start with
1:05:52
each character as a token. So let's say
1:05:53
that your training corpus is just a
1:05:56
single sentence. The cat sat on the mat.
1:05:58
Okay. And even though GPT does not
1:06:02
actually do any lower casing, it'll just
1:06:03
actually use like TH uppercase is
1:06:05
different than TH lowerase. Uh just for
1:06:08
simplicity, I'm just going to
1:06:09
standardize it here. So it just becomes
1:06:11
a cat sat on the mat. And then I'm going
1:06:12
to write it in this form where I
1:06:14
basically put a comma after every word
1:06:16
and then I put a little underscore to
1:06:18
show the space between the words. Okay,
1:06:20
I'm going to write it in this format.
1:06:21
And it'll become clear why I'm writing
1:06:22
it in just a second. Okay. Now my
1:06:25
starting vocabulary is just all the
1:06:27
individual letters in the training
1:06:28
corpus. So the starting is just whatever
1:06:31
all these letters. Okay, that's it. And
1:06:34
this is a starting point. And now what
1:06:35
we do and this is the key step.
1:06:38
We merge tokens that most frequently
1:06:41
occur right next to each other. So if
1:06:44
two characters or two tokens are
1:06:47
occurring right next to each other a
1:06:48
lot, let's just merge them because they
1:06:51
seem to be occurring a lot together,
1:06:52
right? May as well merge them. And so
1:06:54
here, for example, I've I've listed the
1:06:57
frequency of the adjacent token. So for
1:06:59
example, if you look at th
1:07:01
shows up right after each other here, it
1:07:04
also shows up here. So therefore, it
1:07:06
shows up twice.
1:07:08
Now H E again is showing up here. It's
1:07:11
also showing up here. So that also shows
1:07:13
up twice. CA on the other hand is only
1:07:16
showing up here. It's not showing up
1:07:17
anywhere else. So it shows up once. A
1:07:20
shows up three times in Matt, SAT, and
1:07:24
in CAT and so on and so forth. You get
1:07:25
the idea. So you're just looking at
1:07:27
pair-wise adjacent tokens. And you pick
1:07:30
the most frequent one that's showing up,
1:07:32
which in this case happens to be a t.
1:07:34
And then you take a and t and you merge
1:07:36
them. So it becomes 80.
1:07:40
Okay. So when you do that when you when
1:07:42
you you merge them and then you add that
1:07:44
new token that you've just literally
1:07:45
created to your vocabulary list and then
1:07:48
you update the corpus to reflect the
1:07:50
merge you've just did. So now the corpus
1:07:52
becomes the cat sat on the mat. But in
1:07:55
this case there is no a and t
1:07:56
separately. There is just the at combo
1:07:58
com combo token here.
1:08:02
Are we good with this step so far?
1:08:06
take the most frequent things and merge
1:08:07
them.
1:08:12
It's a way to compress the data. In
1:08:14
fact, the algorithm came from someone
1:08:16
trying to figure out a way to compress
1:08:17
data.
1:08:18
You know,
1:08:22
think of it this way, right? Suppose I
1:08:23
tell you uh I'm I want you to compress a
1:08:25
message I'm going to send to you and
1:08:28
then you look at all the past messages
1:08:30
you've had to deal with and it turns out
1:08:32
you're finding that u certain characters
1:08:35
are occurring next to each other all the
1:08:37
time right maybe just for argument let's
1:08:40
say ABC shows up ridiculously often in
1:08:42
the messaging and then you'll be like
1:08:44
you know what's if it's always showing
1:08:45
up all the time together why treat it as
1:08:47
three things let me just call it one
1:08:48
thing ABC that's it you send a single
1:08:51
token called ABC every time you send
1:08:53
need ABC not a B C that's the basic
1:08:56
idea. So here if you come here that's
1:08:58
what we have and then what we do is now
1:09:01
we do again this calculation of
1:09:03
adjacency tokens on this updated corpus
1:09:05
and you can see here th shows up once TH
1:09:08
shows up here twice so you get two every
1:09:11
H shows up twice everything else shows
1:09:13
up once and yeah when many things are
1:09:16
showing up with equal frequency just
1:09:18
pick one randomly from this. So we pick
1:09:19
up th right and we merge that which
1:09:22
means that we add th to our vocabulary
1:09:25
and once we do that we update the corpus
1:09:27
and now we have th is now one thing
1:09:30
fused together along with the previous
1:09:32
thing 80 that had been fused together
1:09:34
that is a corpus after the second merge
1:09:36
and then we do the same thing we find
1:09:38
the frequency adjacent tokens turns out
1:09:40
th and e are showing up twice everything
1:09:42
else is showing up once so we take th
1:09:45
merge it to get the boom the and now we
1:09:48
have the cat sat on the mat. So this
1:09:51
process continues
1:09:53
till we reach a predefined limit for our
1:09:56
vocabulary. Now as it turns out when
1:09:59
they built GPT2 and GPT let me just see
1:10:02
I think I did some digging around on
1:10:04
this thing. Yeah. So GPT2 and 3 they set
1:10:07
the vocabulary size to be roughly
1:10:09
50,000. So it basically kept on doing
1:10:12
this till it hit a limit of 50,000 then
1:10:14
it stopped. GPD4 on the other hand
1:10:17
actually went goes all the way to
1:10:18
100,000 vocabulary size.
1:10:23
Okay, so this is BP in action. U and so
1:10:28
what's going to happen is once you
1:10:29
finish all this thing and you have
1:10:30
vocabulary and you have all these things
1:10:31
that you have merged when a new piece of
1:10:32
text comes in right the merges remember
1:10:36
here we merged a to get a this th became
1:10:39
this and so on. When a new piece of text
1:10:41
arrives the tokenization apply the
1:10:43
merges in the exact same order. So if
1:10:45
the new text that comes in is the rat,
1:10:47
it's first going to apply the 80 to 80
1:10:50
to become fuse this here and then going
1:10:52
to fuse th to get this and then it's
1:10:54
going to fuse th and e to get that. And
1:10:56
the final list of tokens that goes in to
1:10:58
your model is going to be the token for
1:11:00
the the token for space and the token
1:11:02
for r and the token for at.
1:11:06
So let's see this in action.
1:11:12
uh GP I mean OpenAI has a has its own
1:11:14
thing but I found this uh site to be
1:11:17
really good. So let's uh tokenize
1:11:20
hands-on
1:11:23
deep learning.
1:11:26
So you can see here
1:11:28
look at this.
1:11:30
So H uppercase H is its own token. It's
1:11:34
token number 39
1:11:36
and
1:11:38
it's it own token. dash is its own token
1:11:41
on is its own token and then space deep
1:11:43
is its token and space learning is its
1:11:45
token okay note one thing suppose you
1:11:48
had said
1:11:50
let's just say you just had deep deep
1:11:51
deep learning
1:11:53
deep has a different token than space
1:11:56
deep
1:11:58
okay what they have realized is that
1:12:01
most words are actually going to show up
1:12:03
after the space after a space right much
1:12:06
more likely so having a space attached
1:12:08
to the beginning of the word saves you a
1:12:10
lot of sort of you know saves you a lot
1:12:12
of compute and so on and so forth
1:12:13
because they will in fact arrive almost
1:12:15
all the time with the space before it
1:12:17
right that's why they have attached the
1:12:18
space to the word itself um and note
1:12:21
that deep learning deep and uh deep
1:12:25
actually let's call it this way
1:12:30
so deep and deep are different
1:12:34
right there is deep there is so clearly
1:12:36
it's taking case into account then I put
1:12:38
an exclamation here. Boom. That and so
1:12:43
ultimately what goes in when you have
1:12:44
have a phrase like um
1:12:48
sat on the mat.
1:12:51
So the cat sat on the mat. And you can
1:12:53
see here uppercase the um and then
1:12:58
let's just do another thing here.
1:13:01
So uppercase the with a space is 383.
1:13:06
lowerase the is 262. Uh and then that's
1:13:10
distinct from just the without any
1:13:11
space. That's a different thing. So
1:13:13
these are all the tokens. Now um let's
1:13:16
try something.
1:13:18
Let's try
1:13:21
Jane.
1:13:24
So Jane is one token which is great and
1:13:27
is another token. Let's see. Rama. Ah
1:13:30
darn. My name wasn't worthy enough to be
1:13:34
its own token. Okay. But strangely
1:13:38
enough
1:13:41
this I was very surprised by this. So if
1:13:44
I put Rama in lower case is its own
1:13:46
token.
1:13:48
I have no idea what they were scraping
1:13:51
which websites. Uh and if I put Jane
1:13:55
here
1:13:56
now J has become its token with space
1:13:58
and A has become different.
1:14:01
So the tokenization is like very it's a
1:14:03
very interesting thing and it works in
1:14:05
very interesting ways. But that's the
1:14:07
basic idea of what's going on under the
1:14:08
hood. I would encourage you to like
1:14:10
check out your names to see if it's
1:14:12
actually been tokenized. So all right,
1:14:13
I'm done. Thanks folks. I'll see you on
1:14:15
Wednesday.
— end of transcript —
Advertisement
Ad slot

More from MIT OpenCourseWare

Trending Transcripts

Disclaimer: This site is not affiliated with, endorsed by, or sponsored by YouTube or Google LLC. All trademarks belong to their respective owners. Transcripts are sourced from publicly available captions on YouTube and remain the property of their original creators.