9: Generative AI – Large Language Models (LLMs) and Retrieval Augmented Generation (RAG) — Full Transcript

0:53

can do is we can actually replace some

0:55

some of the words in every sentence with

0:57

a what you call like a mask token and

0:59

then we just train the network to

1:00

recover the blanks to fill in the blanks

1:03

right and this technique which is one of

1:06

many ways of doing what's called

1:07

self-supervised learning is called

1:08

masking and we and we described how if

1:12

you essentially take all of Wikipedia

1:14

and for every sentence you mask it like

1:16

this and then train a network to recover

1:19

to fill in the blanks the resulting

1:21

network becomes really good at doing all

1:23

kinds of interesting things and that in

1:25

fact the first such network or one of

1:27

the first such networks was called BERT

1:29

u and in fact in your homework you've

1:31

been you've been looking at BERT and so

1:32

on and so forth right that's masking now

1:34

we're going to switch gears and talk

1:35

about a different kind of self-s

1:37

supervised learning which is different

1:38

from masking which turns out to be

1:41

weirdly more interesting and powerful

1:45

okay so we are going to look at another

1:47

technique and this technique is called

1:49

next word prediction so now it is

1:52

actually in some some sense a special

1:54

case of masking where you're basically

1:55

saying take a sentence and instead of

1:57

randomly picking a word and and making

1:59

it a blank. You're saying, "I'm just

2:01

going to take the last word and make it

2:03

a blank." Okay? And then you send the

2:06

sentence in and then you have the the

2:08

machine just fill in the blank on the

2:10

last word. Predict the next word. Okay?

2:12

And you don't have to use full sentences

2:13

for it. You can use parts of sentences

2:15

for it. Sentence fragments as well. So

2:17

if you take the same sentences before

2:20

the mission of the MI loan school, you

2:21

can literally divide it into well you

2:23

can give the and ask it to predict

2:25

mission. If you can give it the mission

2:27

and ask it to predict off. You give it

2:29

the mission of ask to predict the you

2:31

get the idea. So every sentence fragment

2:33

you can take and literally just give it

2:35

the first few and then predict the next

2:37

one. First few next one first few next

2:38

one. Okay. So this is next word

2:41

prediction. And

2:44

so the let's what we're going to do now

2:46

is we're going to actually take the

2:47

transformer encoder architecture that we

2:50

used to build bird in the last class and

2:52

we're going to try to use it to solve

2:54

next word prediction to build a model

2:56

that can do next word prediction. Okay.

2:58

So this is what [clears throat] we have.

3:01

So what we're going to do is uh if you

3:03

take the phrase the cat sat on the mat.

3:09

So the phrase was let's say the cat

3:13

sat

3:15

on

3:16

the mat.

3:18

So what you might want to do is to say

3:20

okay this is the input

3:25

output

3:27

the cat.

3:30

Then maybe you have the cat

3:33

then the output is sat.

3:36

The cat sat on and so on. Right, you get

3:39

the idea. And then finally, we have the

3:42

cat sat

3:45

the mat. Right, this is basically what

3:48

we have all these inputs and outputs.

3:50

But we're going to very compactly

3:51

express it as if it's just coming in

3:54

through as as one sort of data point in

3:56

one batch. And that's what we're doing

3:58

here. So what we're going to do is we're

4:00

going to stack it up like this where we

4:02

have the cat sat on the on the left

4:04

meaning everything but the last word and

4:07

then we're going to take that same

4:08

sentence and just shift it to the left

4:10

one right so the cat sat on the mat we

4:13

cut off the mat right and that becomes

4:15

the input then we cut off the first word

4:17

and that becomes the output so when you

4:19

look at it that way you can see here

4:22

right the you will want the to be used

4:25

to predict cat you will want the to be

4:29

used to predict SAT and so on and so

4:31

forth.

4:32

Okay, so this is just a little sort of

4:35

manipulation so that we don't have to

4:37

have you know like dozens of sentences

4:40

or sentence examples just for one

4:42

starting sentence.

4:44

So if you have something like this, what

4:46

you can do is you can run it through

4:49

positional input embeddings like we have

4:50

done before with BERT. Uh then we can

4:53

run it through a whole bunch of

4:54

transformers, right? It's like a

4:56

transformer stack. Then we get these

4:59

contextual embeddings. Then we run them

5:01

through maybe one or more ReLUs if you

5:03

want because it's always a good idea to

5:05

stick some ReLUS at the very end. U and

5:08

then we basically attach a softmax to

5:11

every one of the things that are coming

5:13

out. Okay. And then that soft max is

5:17

actually going to be a soft max whose

5:20

range is the entire vocabulary.

5:23

Okay. For now, let's assume that the

5:25

vocabulary is just a vocabulary of

5:27

words, not tokens. We'll get into tokens

5:29

a bit later on in the class. For now,

5:30

just assume it's words. And roughly

5:32

speaking, let's say there are 50,000

5:33

words in our vocabulary. So each of

5:36

these soft maxes, and this is exactly

5:38

what we did for BERT, by the way. Each

5:39

of these soft maxes is like a 50,000 way

5:42

soft max.

5:43

Okay. But what we're going to do is here

5:47

when we look at it this way

5:50

since we are fundamentally bothered

5:52

about next word prediction as you will

5:54

see later on we are actually going to

5:55

ignore all these predictions because who

5:57

cares? We are only going to look at the

5:59

last one to figure out okay what is the

6:02

last prediction? What is it? Because the

6:04

last prediction is going to be based on

6:06

everything that came before it here. So

6:09

this is really the next word that's

6:11

actually being predicted. All the things

6:13

before we don't care so much.

6:16

Okay. And all this will become slightly

6:17

clearer because you're going to make a

6:18

couple of passes through it. Yeah.

6:20

>> How do we

6:24

>> uh so um the notion of a sentence has

6:27

disappeared at this point. What we're

6:29

going to do is when we look at how we

6:30

tokenize the input for these kinds of

6:33

models, we're actually going to take

6:35

punctuation into account. So we're going

6:36

to take periods into account,

6:37

exclamation marks into account and so on

6:39

and so forth. And that that'll answer

6:41

your question and we'll come back to

6:42

that. U okay so this what we have. So um

6:47

all right. So just to be clear the

6:49

embedding that's coming out of the final

6:50

dense layer is passed through its own

6:52

softmax with the number of softmax

6:54

categories equal to the cap size. Okay.

6:58

All right. Um okay. So

7:01

first of all, s let's say we train

7:04

models a model like this with a lots of

7:05

inputs and outputs. Okay, this just

7:08

looks like bird, right? It's not that

7:10

different except that there's no notion

7:11

of a mask.

7:13

Do you notice any problems with the way

7:15

this thing has been set up? Uh

7:19

>> like for some words like the you're

7:21

going to have a lot of potential output

7:23

pairs that come out of that.

7:25

>> True. Which means that if you have a

7:27

word like the the next word

7:29

>> hard to predict.

7:29

>> It's true. So some words may be hard to

7:32

predict depending on the last word of

7:35

the sentence that was the input. Right.

7:36

That's what you're getting at. Yeah. U

7:39

concerns.

7:41

So I want you Yeah. Uh

7:43

>> since you're using contextual

7:46

like the output of the first word is

7:48

going to have access to the second word

7:51

and so it's kind of like cheating.

7:53

>> Bingo.

7:55

So remember for bingo is a technical

7:58

term in deep learning which means great.

8:01

So um so if you go to this right as she

8:05

points out if you look at the self

8:08

attention layer note remember the self

8:11

attention layer is the key building

8:12

block of the transformer block right and

8:15

so in the self attention layer every

8:17

word we calculate its contextual

8:19

embedding by waiting weighted averaging

8:23

of its relationship to all other words

8:26

in the sentence. So the last word can

8:28

see the first word, the first word can

8:30

see the last word and so on and so

8:31

forth, right? But when you're doing next

8:33

word prediction, this feels problematic

8:34

because you're peeking into the future,

8:38

right? So

8:40

so let's say that you want to predict

8:42

the next word. If you look at this

8:43

architecture, what it can simply do, it

8:46

can simply copy it from the input

8:48

because it can see the whole sentence.

8:50

So if I tell you, hey, the cat sat on

8:52

the mat. If I just gave you the cat sat

8:55

on the can you predict the the next word

8:56

for me? You'll be like yeah duh it's cat

8:58

it's Matt.

9:01

The whole thing becomes challenging only

9:02

if I say the cat sat on the dash. Now

9:04

predict the dash.

9:07

So to put it another way let's say that

9:09

you want to predict right you have fed

9:11

in the first two words and you want to

9:13

predict this. This is the right answer

9:15

for the prediction. The network should

9:17

only use the first two.

9:20

However, but because self attention can

9:23

see SAT, it can see this next word,

9:26

it'll trivially learn to predict the

9:28

next word to be SAT,

9:31

right? There is no challenge for it.

9:34

So, this is the key problem, right? This

9:37

is the key problem. We're just using the

9:38

transformer as is.

9:41

>> What's our loss function here?

9:43

>> The loss function in all these things is

9:44

actually the same as before, which is

9:46

that for every output that's coming out.

9:48

So imagine you have just a traditional

9:50

classification problem uh in which you

9:52

have one output uh let's say dividing

9:54

you're classifying things to uh 10

9:56

categories like we did with the fashion

9:57

mnest right 10 digits so you have 10

10:00

outputs right and that goes through a

10:02

softmax and then you have 10

10:03

probabilities and there we use cross

10:05

entropy right so here for every one of

10:09

these things we use cross entropy so we

10:12

take this output and there's a cross

10:14

entropy for just for that plus cross

10:16

entropy for that and so on and so forth

10:18

So we we minimize still cross entropy

10:20

but the sum of all these cross

10:21

entropies.

10:22

>> And does it get complicated at all by

10:24

the fact we have a large vocabulary size

10:26

now?

10:27

>> I mean it it gets complicated just

10:29

because there are more things to worry

10:30

about compute and so on and so forth.

10:32

But conceptually no difference whether

10:33

you have 10 or 50,000 it's the same

10:35

thing. It's just that instead of

10:37

classifying an input into one of 10

10:39

categories you're take the inputs

10:41

themselves are as long as the number of

10:42

words in your sentence. So each word

10:45

that comes into your sentence is being

10:46

classified in one of 50,000 ways, right?

10:49

So essentially you have as many

10:51

classification problems as you have

10:53

number of words in a sentence. But at

10:55

the end of the day, the loss function is

10:56

just a sum of all those things or to be

10:58

more precise, the average of all those

10:59

things.

11:02

Actually, I think I may have a slide

11:03

about this which I may have hidden

11:05

because I wasn't sure if I would have

11:07

time. Uh let's unhide it.

11:17

and B I did not agree ahead of time that

11:19

we're going to set this up like this.

11:20

Okay. So, all right. So, yeah. So, we

11:23

still use the cross cross entropy cross

11:25

cross entropy loss function. So, each

11:27

word that comes in. So, the cross

11:30

entropy is actually minus log

11:33

probability of the right answer. And you

11:35

may recall this from earlier in the

11:36

class. So, we just do the same thing for

11:38

for cat sat on the everything. And then

11:41

we just take the average 1 / 7. Boom.

11:43

That's it.

11:47

So let's so to go back to this problem.

11:50

So this is the issue. The issue is that

11:52

we can't allow words to be predicted

11:55

knowing the future. They should only

11:57

know about the past words. Okay. So what

12:00

do we do? Right? We have to make a

12:02

change to the transformer to make it

12:03

work for next word prediction. So what

12:06

we're going to do is when we are

12:07

calculating the contextual embedding for

12:09

a word, remember the contextual

12:11

embedding for a word is going to be a

12:13

weighted average of all the other words

12:14

embeddings. We will simply give zero

12:17

weight to future words.

12:20

If you give zero weight to future words,

12:22

it's almost as if they don't exist.

12:26

Okay? And this will become clear in a

12:27

second. So imagine that this is the the

12:31

thing we are going to calculate. These

12:32

are all for every word in the sentence

12:34

we are calculating the uh the pair-wise

12:38

attention weight and you will remember I

12:41

went through this you know with like an

12:43

iPad thing last week we calculate all

12:45

the weights. So for example to find the

12:48

um so all these weights in every row

12:51

will add up to one and so you take the

12:54

contextual embeddings of the cat sat on

12:56

the multiply them by the respective

12:58

weights that add up to one which is the

12:59

first row of this table and that gives

13:01

you the contextual embedding for the

13:02

word the and so on and so forth. And

13:05

since we can't look at the future words

13:07

all we do is we go take this table and

13:10

we just zero everything out in red.

13:14

Okay, we just zero everything here out

13:17

and then we renormalize so that the

13:19

remaining cells the nonzero dot cells

13:22

will still add up to one in each row. So

13:25

what that means is that if you're

13:27

actually only looking at the only this

13:29

thing is going to play a role for cat

13:31

only this thing is going to play a role.

13:32

So let's let's let's give an example. So

13:36

um to calculate

13:39

to predict uh on you'll only look at the

13:43

words for the cat sat.

13:46

Okay. The rest of it will not be

13:48

considered at all. Now the effect of

13:51

doing all this is that by the way this

13:54

is called causal self attention. This

13:56

tweak is called causal self attention.

13:58

Uh is also called masked self attention.

14:01

Right? Just different labels for the

14:02

same thing. And so what that means is

14:05

that when you're looking at the input

14:07

for the only the is going to be used to

14:10

predict cat.

14:12

When you look the cat only these two are

14:15

going to be used to predict sat and so

14:18

on and so on and so forth.

14:24

Okay. So this thing here this so all we

14:28

do is we go into a transformer and we

14:30

just change each attention head to be a

14:32

causal attention head

14:38

and the way it's actually done under the

14:40

hood is actually very elegant for

14:42

computational efficiency purposes but I

14:44

won't get into it because it gets a bit

14:46

you know involved but the key idea is

14:49

replace basic plain vanilla attention

14:52

with causal attention aka pay mass

14:54

attention

14:57

and you do that boom suddenly it it

14:59

starts you know working for an expert

15:01

prediction it can't cheat anymore

15:04

and when we do that we get the

15:06

transformer causal encoder

15:11

and by the way the word causal here

15:13

there's no connection to causality so

15:15

it's just a it's just a term

15:19

so if you look at the original

15:20

transformer paper um

15:24

it was created for translation for

15:26

machine translation you know English to

15:28

German right those kinds of use cases so

15:30

it had something called an encoder which

15:32

we are very familiar with from last week

15:34

and then it had something called a

15:35

decoder right and it is called the

15:38

encoder decoder architecture and we are

15:40

not going to cover the encoder decoder

15:42

architecture because we are not covering

15:43

machine translation in this class but

15:45

I'm mentioning this because the this

15:48

part of the the architecture is called a

15:51

decoder

15:52

because it uses see here there is a

15:55

masked attention business going on here

15:57

because it is using this masked

15:59

attention it's called a decoder so

16:02

the transformer causal encoder is also

16:05

referred to sometimes as a transformer

16:06

decoder but the word decoder has two

16:09

meanings

16:11

right it's a synonym for the causal

16:12

encoder like we have seen today it's

16:14

also used to refer to sequencetosequence

16:17

translation problems for the second part

16:19

of its architecture so you just have

16:21

keep it it'll become clear from context

16:23

what we're talking about in this course

16:25

of course there is no confusion because

16:26

we're not going to be looking at

16:27

translation right we may say decoder

16:29

causal encoder it's the same thing so I

16:32

thought there were some transformers

16:34

that use birectional

16:36

package like is it different from

16:39

>> no the um the birectional all all

16:42

birectional means is that I can see

16:44

everything so the encoder we looked at

16:47

last week the the basic self attention

16:49

thing is birectional

16:54

Basically all it means is I can look at

16:55

both in both directions to see what

16:57

other words are there in causal. You're

16:58

not using the one in the future.

16:59

Correct.

17:02

All right. So,

17:04

so in to summarize where we are. This is

17:07

what we looked at last week for BERT and

17:09

this is a transformer encoder and we

17:11

take the same thing and instead of

17:14

multi-head retention we would do causal

17:15

multi retention. We get the decoder aka

17:18

causal encoder.

17:21

Okay. And we use the left for masked

17:25

prediction. We use the right for next

17:27

word prediction.

17:29

All right. So now if you have instead of

17:32

having an encoder, if you have a causal

17:34

encoder, a TCE here, now we can train

17:37

models for expert prediction using the

17:38

same exact approach as before,

17:42

right? We set up the inputs and the

17:43

outputs like I described earlier. We run

17:45

it through a bunch of stacks, a stack of

17:47

causal encoders, dens, relu, softmax and

17:50

so on and so forth, right? Otherwise the

17:52

details don't change but the all

17:54

important changes go into the attention

17:56

layer and make it masked or causal.

18:02

Any questions so far?

18:06

>> Uh yeah,

18:08

this would only apply when we're

18:09

training the model, not when we're

18:11

validating and testing, right?

18:13

Uh so if I if you give me a sentence

18:15

after training right the final

18:18

prediction is only is the only thing you

18:20

care about and by definition the final

18:22

prediction will use everything that came

18:24

before it. So we are okay.

18:27

Was that your question? No, I think the

18:30

fact that we're

18:33

uh we're zeroing out the weights in the

18:35

future words I thought would apply more

18:36

when we're training the model and we're

18:38

trying to minimize the loss as opposed

18:40

to when we're as a chance to the next

18:44

set

18:45

>> right but the point is when we actually

18:47

use them what is the objective like what

18:49

do we want to do when we actually use

18:50

them for inference once we finish

18:51

training our objective is given a

18:54

particular string get me the next word

18:56

right and to find the next word you can

18:59

in fact use everything that came before

19:00

it

19:01

>> and therefore without any change to this

19:03

model it'll just work for your intended

19:04

purpose you don't have to go in there

19:06

and change it to you don't have to

19:08

unmask it for inference because you

19:10

don't need to

19:13

>> yes

19:14

>> uh I have one question is regarding like

19:17

when we do the puzzle transformers we

19:20

are putting certain weights to zero for

19:22

the words which are to be predicted and

19:24

then we

19:24

>> no word the the words that are in the

19:26

future

19:27

>> future Yeah.

19:28

>> And then we normalize it.

19:29

>> Correct.

19:29

>> And we have trained a transformer

19:31

earlier on the all the words packed all

19:33

the words together. So won't there be

19:35

difference in weights between both the

19:37

things

19:37

>> between the two ways of training? The

19:39

weights are going to be very different

19:40

and they are two different models. Bert

19:43

is used for certain things and this kind

19:45

of model which is the basis of GPT is

19:47

going to be used for other things.

19:47

>> We are training it as well like that. I

19:49

mean with while putting the by moving

19:52

some of the rates to

19:53

>> correct correct. So what I'm talking

19:56

about here is the what we're trying to

19:59

do here is to say let's say that we want

20:01

to do next word prediction as the as the

20:03

task as a self-supervised learning task

20:06

and and we want to train such a model on

20:08

a vast amount of text data right well we

20:10

can't just use what we did last week

20:12

because it's not going to work because

20:13

of the fact it can see the future

20:14

therefore we make a tweak and then we

20:16

build this model now the question

20:17

becomes okay what can you do with this

20:18

such a model right we have basically

20:20

trained two different kinds of models

20:21

that the one that can see everything

20:23

Bert and that one that can't see the

20:25

future which is actually GPT. So what

20:27

can you do with it? And we're going to

20:28

come to that.

20:32

Okay. U all right. So now once you train

20:35

such a model u right given any input

20:38

sentence um let's say that the sentence

20:41

is it was a dark and it was a dark and

20:45

right it goes through all these things.

20:47

And remember what I said earlier the

20:49

fact that it's predicting something

20:50

after just seeing it. We don't really

20:53

care.

20:55

All what we're really curious about is

20:57

what is the next thing it's going to

20:59

say? And the next thing it's going to

21:01

say is going to be is basically going to

21:02

be what's coming out of this softmax.

21:06

Does it make sense? We don't care about

21:08

anything that went before it

21:11

because we already have like a half form

21:14

sentence and we want to just find the

21:15

next thing here. So we only care about

21:17

this. We I mean these things will come

21:19

out of the of the architecture of the

21:21

model, but we don't we throw them out.

21:22

We don't even pay any attention to them.

21:24

Okay, we only look at what's coming out

21:26

in this one here. And what comes out of

21:30

the soft max, remember, is a 50,000 way

21:32

table of probabilities. That's what a

21:35

soft max is, right? It's a whole bunch

21:37

of probabilities that add up to one. And

21:39

so it's going to and let's say, for

21:40

example, that you know you have starting

21:42

with oddwark all the way to zebra,

21:45

right? Right? And these are the

21:46

probabilities.

21:48

So it was a dark and you know just for

21:52

kicks I put star me as the most highest

21:55

probability number but these numbers

21:56

will add up to one. We have this table.

21:59

Okay. And then what we do is we choose a

22:02

token from this table. We get we get to

22:04

choose right. There's a whole bunch of

22:06

numbers in this table that we we get to

22:08

choose a token. the the simplest thing

22:11

one can think of is just choose the the

22:12

word that is the most likely, right? And

22:14

we choose the word that's most likely

22:16

here. And we we're going to have a whole

22:18

section on how to choose these things

22:20

coming up. Okay, for now let's go with

22:22

the simple option. We're going to just

22:23

choose the one that's most likely 6. And

22:26

then we we attach it to the input. So

22:30

now the input has become it was a dark

22:32

and stormy. We run it through and we

22:34

again we only care about the last one

22:36

softmax.

22:37

Okay,

22:40

we do that. We get another table and

22:42

this table turns out the table keeps

22:44

changing because the softmax is

22:45

different for each time you run it

22:46

through because the input has changed.

22:49

So you get a new table and it turns out

22:50

the most likely one is knight. Okay. And

22:53

then we attach so night comes out the

22:56

other end. We and we attach knight here

22:59

and we keep on going right. We can keep

23:03

on going maybe till we basically we tell

23:05

the model okay generate up to 100 tokens

23:08

and stop. It might stop after 100 or you

23:11

or it might decide the model may decide

23:12

in fact that when it sees a punctuation

23:15

like a period or exclamation mark or

23:17

something it's going to stop. Okay. And

23:19

we have control over this when it stops

23:21

and how it stops. But this is this is

23:23

sort of the the basic process and you

23:26

folks are all very used to it because

23:27

you've all been playing with chat GPT

23:28

and the like right? So the but the basic

23:30

building block is next word prediction

23:33

feed it back to the input next word

23:34

prediction keep on doing it right you

23:36

keep on doing it and suddenly you know

23:38

it's writing entire novels for you

23:41

um yeah

23:42

>> that mean that the longer the initial

23:44

input is better you get a better

23:47

prediction

23:48

>> um it depends on your objective so

23:52

fundamentally you have some task you

23:54

want the thing to do for you right and

23:56

that task may and you need to give it

23:58

all the information it can puzzle we

24:00

find useful. Yeah. So the long the the

24:02

more helpful the input the better. Maybe

24:04

that's how I would say it.

24:07

Uh yeah.

24:09

>> Would this also apply to something like

24:11

Google search? Uh or does they also do

24:14

next letter prediction too? But would

24:17

this just be a deeper

24:18

>> Yeah. So the Google autocomplete for

24:20

example, I don't know if they actually

24:22

use uh this kind of model under the hood

24:24

or not. I just don't know. Um these

24:26

things tend to be kept tightly under

24:27

wraps. uh you know if they were to do if

24:29

they were using it you know my guess is

24:31

that

24:33

they so I don't know if you folks have

24:34

seen recently over the last few months

24:36

they have there is there is a generative

24:38

AI panel that opens up when you do a

24:40

Google search that panel I suspect uses

24:42

this uh but I don't know if the default

24:45

Google autocomplete actually uses it or

24:47

not because it's very compute heavy

24:49

right so I don't know what they do

24:52

um so yeah this is what you do other

24:55

questions on this on the mechanics of

25:00

Yeah,

25:01

>> for our vocabulary list, I'm assuming

25:03

it's static.

25:05

>> Yeah, correct. Uh, and as you will see

25:07

here, it's not really a word vocabulary.

25:08

It's a token vocabulary, but yes, it is

25:10

static for a given model.

25:12

>> And so for I guess I'm assuming for

25:15

Google or any other sort of like search

25:17

engine that wouldn't necessarily be

25:19

static. And so when it comes to I guess

25:23

I guess I'll leave it like because the

25:26

model would be different

25:30

sort of thinking about uh what happens

25:32

to like new words and things that are

25:34

formed and how does it handle it if the

25:35

vocabulary is static. There's a very

25:37

elegant solution that's coming up.

25:41

Okay. Um

25:45

all right. So now in other words we have

25:48

learned how to do sequence generation.

25:51

We already saw that we can do

25:52

classification with BERT. We can do

25:54

labeling with BERT B like models which

25:56

are trained on mass prediction. And for

25:59

generating sequences now we know how to

26:00

do it. We just need to use a transformer

26:02

cosal encoder.

26:05

Okay.

26:08

Now

26:10

these kind of models, sequence

26:12

generation models trained on text

26:13

sequences using next word prediction are

26:15

called auto reggressive language models

26:17

or causal language models. Okay. And of

26:20

course the GPD family is perhaps the

26:22

most well-known uh example of an auto

26:25

reggressive co language model. auto

26:28

reggressive because people who have done

26:30

econometrics and some regression know

26:32

the notion of auto reggression means

26:34

that you predict something and then you

26:36

you use sort of you know the past

26:38

predictions as inputs into the next time

26:40

you predict right so this is the notion

26:42

of auto reggression you feed you predict

26:44

you feed the prediction back get the

26:46

next prediction and keep on cycling

26:48

through yes

26:51

>> so when you you're kind of putting an

26:53

input into GPT for example and it has

26:56

that um you know it shows you like the

26:59

next words as as it's coming. Is that an

27:01

indication of it doing this

27:03

recalculation that you described here?

27:05

>> Correct. That's exactly what's going on.

27:07

Uh in fact, if you use the API, there is

27:09

the thing called the streaming API where

27:12

it'll actually stream each token that's

27:14

coming out through the through every

27:15

pass and you can actually see everything

27:17

very clearly. But when you actually work

27:19

with the web interface and you see the

27:22

thing almost as if it's typing like a

27:24

human, what I've heard from people, I

27:25

don't know if this is true, what I've

27:26

heard from people is that they can

27:28

actually do it much faster. They slow it

27:30

down intentionally to give you the

27:32

feeling that it's actually coming from a

27:33

human.

27:36

So it's like a UX trick to slow it down

27:39

to make it feel as if someone is

27:41

actually typing something on the other

27:42

end. So when you're interacting with a

27:44

chatbot, for example, sometimes you see

27:46

it actually typing like slowly you can

27:48

see the bubble and you can see the

27:49

typing. It's actually intentionally

27:50

slowed down. Uh because you know it's a

27:53

bot otherwise, right? So there's a

27:55

little bit of UX

27:58

creepiness maybe going on. Uh I don't

28:01

know to what extent this is 100% true

28:03

and how pervasive it is, but folks who

28:05

work in the field have told me that this

28:06

actually is not uncommon. So

28:10

okay, so that's what's going on here.

28:12

These are language models and of course

28:14

GPD3 is an auto reggressive language

28:17

model and the reason why we have an L in

28:20

front of the LM because it was trained

28:22

on lots of data with lots of parameters

28:24

right some someone does this at some

28:25

point it's not a small language model

28:26

anymore it's a large language model so

28:28

yeah so it's LLM nothing more momentous

28:31

than that so so as it turns out uh GPT3

28:35

uses 96 transformer blocks 96 blocks and

28:40

each block has 96 six causal attention

28:43

heads.

28:44

Okay. And you can see you can read the

28:46

GPD3 paper. It gives you all the details

28:48

of the architecture. That is interesting

28:50

because GPD4 they didn't publish the

28:51

architecture from GPD3 after GPD3

28:55

everything became closed. So we actually

28:58

don't know what the architecture is even

28:59

though there's a lot of speculation on

29:00

Twitter. So uh but GP3 we know exactly

29:03

what happened right 96 blocks each has

29:06

96 causal attention heads. Um and then

29:09

the data was actually they scraped 30

29:11

billion sentences um from a whole bunch

29:14

of sources, web text, Wikipedia, a bunch

29:16

of book databases. Um and um and then

29:19

they basically just took those 30

29:21

billion sentences and just trained it

29:23

exactly next word. That's it.

29:27

Now when they trained GBD3, I think it

29:28

cost them a lot of money um because

29:31

things were not as we hadn't figured out

29:34

how to do as efficiently as we know now.

29:36

uh but it was still pretty amazing and

29:38

I'll talk about you know what is so

29:39

special about GBD3 in just a minute or

29:41

two. So, so this is what we have here

29:44

and as you folks have seen the notion of

29:46

generating text right is very powerful

29:49

right uh because we can obviously

29:51

generate text but we can also generate

29:53

code because code is just text uh we can

29:55

generate documentation for code we can

29:57

summarize text we can answer questions

29:58

we can do chat I mean the list goes on

30:00

all the excitement we see around genai

30:03

from the time chat GBD came out is

30:05

precisely because the simple idea of

30:07

text in text out is just so flexible

30:12

It's so versatile. It can handle all

30:13

sorts of use cases. That's why there's

30:15

so much excitement.

30:17

Um, by the way, um, if you're really

30:19

curious, I would actually recommend

30:21

seeing this video where this this guy

30:24

Andre Karpathi builds GPT from scratch.

30:28

Okay, it's a fantastic video. If you if

30:31

you have even like a little bit of

30:33

curiosity about how these things are

30:35

actually built, I would strongly

30:36

recommend checking it out. Um and

30:38

there's also a little blog post where

30:39

this person you know basically if you

30:41

know numpy you can actually create GPD3

30:43

GPD using numpy without any using any

30:46

frameworks and things like that. So um

30:50

I I found it super interesting and

30:52

helpful to understand what exactly is

30:53

going on. So if you would like to do

30:55

this. Okay. So now we're going to talk

30:57

about um decoding sampling strategies

31:00

which is I said that when we produce uh

31:03

when when when we come up with the

31:05

softmax for that last token right we

31:07

have 50,000 choices. What do we pick

31:10

right as it turns out to actually get

31:13

really good performance out of uh genai

31:15

systems like charge you need to be quite

31:17

thoughtful about the how to decode right

31:19

how to actually sample from that table.

31:21

So we'll talk about that for a bit. So,

31:25

so the first of all definition the

31:27

process of choosing a token from the

31:29

probability distribution from the coming

31:30

out of the softmax right I'm sticking

31:32

this table right here this is the

31:34

softmax right this process of choosing

31:36

it is called decoding that's a technical

31:38

term for it right we have to we get this

31:40

table we have to decode meaning we have

31:42

to pick something from this table okay

31:44

that's called decoding now

31:48

there are two sort of extreme cases of

31:51

very highly simple ways to do

31:53

The first thing of course is just pick

31:55

the one just pick the word with the

31:56

highest probability.

31:58

This is called greedy decoding.

32:02

Okay.

32:03

So in this case for example if stommy is

32:06

6 the highest probability in this whole

32:08

table we just pick stommy. Okay. So that

32:10

is the obvious extreme simple case. The

32:14

other thing we can do which is also

32:15

super simple is that because we have a

32:18

probability table here, we can just

32:20

reach into the table and sample a word

32:22

out of it, right? In proportion to its

32:24

probability, which means that if you if

32:27

if you have this table and you're

32:28

sampling from it, if you sample from it

32:30

100 times, 60 times you probably get

32:33

Stormy because the probability is 6. But

32:36

some small fraction of the time you may

32:38

get strange things like oddwark and

32:39

zebra and so on and so forth,

32:42

right? you're just literally doing

32:44

random sampling.

32:46

That's a fine way to do it too, right?

32:48

There's nothing wrong with that. So

32:50

these these are both options. So the key

32:53

thing you need to remember is that the

32:56

which one you pick and there are some

32:58

variations on it which we'll get to in a

32:59

moment. What you pick, which way to

33:01

decode you pick really depends on what

33:03

your task is, what you're trying to use

33:05

the the system for, right? The LLM for.

33:08

So the the the broad thing to remember

33:10

is that if you're working on questions

33:13

for which the factual accuracy of the

33:16

response is really important

33:19

and or you want the output to be

33:22

deterministic meaning every time you ask

33:24

it a particular question you really want

33:26

the same answer back right you can

33:28

imagine a customer call support agent

33:31

where there two different customers ask

33:33

the same question and they get different

33:34

answers right you don't want that so you

33:37

want determinist IC outputs. So in those

33:40

situations, you should use greedy

33:41

decoding is a good starting point

33:43

because you will get you know you won't

33:45

get any random stuff because for any

33:48

given input sentence the softmax that

33:51

comes out of that table is not going to

33:53

change. It's the same table and if

33:55

you're always picking the highest number

33:57

in the table that's not going to change

33:58

either. So guaranteed determinism

34:03

and I found that for reasoning questions

34:05

and things where you know you're asking

34:07

questions, math questions, reasoning

34:08

questions, logic questions, you should

34:10

really sort of keep it as sort of greedy

34:12

as possible in my experience. Okay. Now

34:15

there are other situations where random

34:18

sampling is actually a better option. If

34:20

you're doing creative things, right?

34:22

write a poem, write a highQ, write a

34:24

screenplay, things like that. You do

34:26

want a lot of creativity in which case

34:27

you actually randomness is your friend,

34:30

right? You get a lot of different

34:31

varieties of responses, diversity of

34:32

responses, all that is really good. The

34:35

price you pay for it is that you lose

34:36

determinist determinism. The outputs are

34:39

going to be stoastic. They're going to

34:40

be random. They're going to vary from

34:41

the same question. The answer is going

34:42

to vary again and again. But in many

34:44

cases, maybe it's okay. You don't care.

34:47

Okay, so that's sort of how roughly how

34:49

you think about. Other one I want to say

34:50

is that the diversity of response also

34:53

important because you if you imagine a

34:55

chatbot um if you ask questions if the

34:58

chatbot always responds in the same

35:00

stilted robotic fashion right it kind it

35:03

starts to get annoying you want some

35:05

variation in the output right because a

35:07

human will never give you the same thing

35:08

back though I must say that when I

35:11

interact with call center agents I think

35:13

they're just cutting and pasting from a

35:14

text library so it does look kind of

35:16

robotic u so maybe we are already kind

35:18

of used to this but anyway Okay, so

35:20

those are some of the things to keep in

35:21

mind. Yeah,

35:24

>> if you're using random sampling, do you

35:26

end up with a better estimation of the

35:28

uncertainty and probability are more

35:33

calibrated in the sense that the table

35:35

that you end up at the end is the real

35:36

probability that you observe from the

35:39

words in your corpus.

35:42

>> The table doesn't change regardless of

35:43

how you sample it. The table is a

35:45

starting point for sampling.

35:47

The all of all decoding is about what

35:50

token from the table you're going to

35:51

pull out.

35:53

>> Oh, so it doesn't impact the loss

35:54

function.

35:55

>> No.

35:56

>> Yeah. It's all those things are fixed.

35:58

You literally get the table and then you

36:00

literally can forget how you got the

36:02

table and now decoding starts.

36:06

>> Is there a reason why would generate a

36:09

different answer given the same prompt

36:11

if we run it again and again? Because

36:12

they are using random sampling.

36:14

>> Correct. That's exactly why. And we'll

36:16

see I'll see do a demo of it very very

36:19

shortly because you can actually

36:20

manipulate it. Uh

36:22

>> if you do the prediction word by word,

36:25

is there a way to make it resilient to

36:27

mistakes? Like if you say the night was

36:29

dark and hard work, that can mess up the

36:32

next word, right?

36:33

>> It can totally mess it up.

36:34

>> So how does it can it get itself back on

36:37

track?

36:37

>> It cannot. And so great question. And

36:40

we'll look at an example of things going

36:42

off the rails in just a second. Yep.

36:46

Is this how Bing works where you can

36:48

slide between being more creative, more

36:51

accurate?

36:52

>> Yeah, exactly. So, Bing has creative,

36:53

balanced, precise something, right? Uh

36:56

they're basically under the hood,

36:57

they're manipulating some of the par

36:59

we're going to look at some of those

37:00

parameters in just a moment. They're

37:01

just manipulating it for you. But if you

37:03

use the API, you can manipulate it

37:05

directly.

37:09

Okay. Um All right. So, so here's sort

37:14

of the basic thing to remember about

37:15

random sampling.

37:17

So, our hope is that the, you know, for

37:19

any given sentence, we think that there

37:22

is probably some set of good answers for

37:24

the next word and a whole bunch of bad

37:26

answers, right? Intuitively. So, we want

37:30

the probability of the good stuff,

37:33

right? We we want like a you can imagine

37:36

a distribution is going like that. There

37:38

is the head of the distribution, the

37:39

first few words in the distribution. if

37:41

you sort them from high to low

37:42

probability and then there's all the

37:44

long tale of you know kind of you know

37:46

inappropriate not inappropriate

37:48

irrelevant words right so our hope is

37:51

that the model is so good that for any

37:53

given input phrase it it basically

37:55

concentrates the output probability in

37:57

the softmax to just a few good words and

37:59

sort of kind of zeros out everything

38:01

else that is the ideal scenario because

38:04

in that scenario if you do random

38:06

sampling you by definition you'll pick

38:08

something from the high quality head of

38:10

the distribution and life is good. Okay.

38:13

Now, we want random sampling to sample

38:16

from the head and not from the tail,

38:18

right? That's the key point. And what do

38:19

I mean by head and tail? Let's be very

38:21

clear.

38:26

So, um imagine you have

38:30

take the table that we looked at the

38:31

softax table which went from whatever

38:33

oddwalk to zebra right and let's say we

38:35

sort the table based on high to low

38:37

probabilities. So maybe what's going to

38:39

happen is that star me

38:42

is going to have a probability of I

38:43

don't know 6 and I think if I remember

38:46

right a knight had a probability of.3

38:51

and then a there was a whole bunch of

38:53

other words

38:56

all the way to the 50,000th word right

39:00

from highest low probability so this is

39:02

what I so this is you can think of this

39:04

as like a probability distribution

39:06

okay and So basically what we are saying

39:09

here is that these this is the head of

39:12

the distribution

39:13

while this long tail is the tail of the

39:16

distribution and we want our system to

39:18

grab something from the head and not

39:21

from the tail because the head is the

39:23

stuff that's actually the relevant

39:24

useful good stuff. Okay, that's really

39:26

what we're trying to do here. Does it

39:28

make sense? Okay. So,

39:32

so to come back to this um

39:37

and here is like the most important

39:39

point to remember about this slide.

39:41

While the probability of choosing any

39:43

individual word in this long tail is

39:46

pretty small. For any one word, it's

39:47

pretty small. The probability of

39:49

choosing some word from the tail is

39:51

high.

39:54

Some word from the tail is high. So to

39:56

go back to this thing here. Yeah. Uh so

39:58

in this particular example

40:00

6 +.3 there is a 0.9 probability it's

40:03

going to be either stormy or night but

40:05

there is a 10% probability it's going to

40:06

be one of these words

40:09

and who knows what that word might it's

40:11

going to be it might be some random

40:12

nonsense word right so what that means

40:15

is and this goes to um

40:18

this goes to point from before if the

40:21

LLM happens to sample a token from the

40:24

tail which is not good it won't be able

40:25

to recover from its mistake it'll just

40:27

go off the rails

40:29

Which is why every word that gets

40:31

generated is really important to get it

40:33

right because book it can't recover very

40:35

often.

40:37

>> Is there a technical way to define the

40:40

difference between the head and the

40:41

tail? No,

40:44

it's sort of like this common thing

40:45

people use and the reason why it's not

40:47

is because uh it's so problem dependent

40:50

as to what like the you know like

40:52

basically you're saying that for any

40:54

particular problem I think depending on

40:55

the question the right number of words

40:58

is probably 20 for the same for a

41:00

different question maybe it's 40 for a

41:02

totally different model for the same

41:04

question maybe 10 so because of that

41:05

variability we just can't figure it out

41:09

okay so um all All right. So, and I'll

41:12

show you this how to do this in just a

41:14

moment. So, just for kicks, um I went in

41:18

to GPD 3.5 U and then I said students at

41:22

the MIT Sloan School of Management are

41:25

and I said predict the next word. Okay,

41:29

so it turns out invited is the most

41:31

likely next word followed by given,

41:33

expected, required and able. These are

41:35

the top five words.

41:38

Okay. And the probability is 3% 2% you

41:40

see the you know pretty small

41:42

probabilities but then the words that

41:43

are below it right the remaining

41:45

whatever 50,000 odd words are even

41:47

lower. Okay. So here the most likely

41:50

word is invited. So what I did is I went

41:52

in there and said okay let me try again

41:54

now with students of this loan school of

41:56

management or invited. And now

41:59

autocomplete that find me the next

42:00

thing. So it comes back with see now

42:03

this is my new prompt. student the M

42:04

school invited to submit their original

42:07

white papers to the annual MIT

42:08

something. It seems reasonable. Doesn't

42:11

seem bad, right? It seems reasonable.

42:13

Okay. Now, let's mess it up a bit. So

42:16

now I go in there and I noticed that the

42:19

word masters and the word spending were

42:22

much lower probability than these top

42:24

five words. Right? I just mucked around

42:26

till I found these things. So this is

42:28

only 0.05%. This is.1%.

42:31

So these are clearly in the tail, right?

42:34

They're not the most likely. So I said,

42:36

what's going to happen if I actually

42:37

force it to use masters and then I force

42:41

it to use spending? Okay, this is what I

42:43

what you get. Students MID school of

42:46

management are masters of chaos.

42:49

They routinely blow past deadlines

42:52

fracture and then I couldn't take it

42:53

anymore. I stopped it.

42:58

a single word

43:00

and then I said students school of

43:02

management or spending which is the

43:03

other unlikely word the semester

43:05

learning life skills so far it looks

43:07

promising through knitting socks

43:13

I'm not making this stuff up but this is

43:14

GP3.5

43:17

so yes it will go off the rails you have

43:19

to be super careful um and so

43:22

so the way we sort of tame random

43:25

sampling to make it work for us uh

43:29

Do you think that these sentences refers

43:32

like the past like the master of chaos

43:35

blow past deadline like is something

43:38

that it was in the training sense?

43:40

>> Yeah. I mean that is the thing is it's

43:42

basically doing rough it's doing some

43:45

very rough and approximate pattern

43:47

matching from all the training data it

43:48

was trained on. So it doesn't mean for

43:51

example that on on the mit.edu edu

43:53

website right on the collection of sites

43:56

that actually there were text saying

43:59

that yeah MIT Sloan students were doing

44:00

all this crazy stuff it's probably more

44:02

like a whole bunch of you know u college

44:06

university websites probably had some

44:08

content like that maybe there was a

44:09

bunch of Reddit people posting stuff

44:10

like that so you're just doing some

44:12

rough pattern matching it's basically

44:14

looking the thing is you have to

44:15

remember always with large language

44:16

models what it's trying to give you it's

44:19

giving you a response that is not

44:22

implausible

44:23

There is no guarantee of correctness.

44:25

There's no accuracy. Nothing like that.

44:27

It's giving you a probabilistically

44:29

plausible response. That's it. Okay.

44:32

Now, usies being Sloan, uh we look at

44:35

stuff like this and we get offended. So,

44:36

we are we are imputing our values onto

44:39

its generation, but it doesn't know and

44:40

it doesn't care.

44:43

So in fact if I when I typed in

44:46

something like list all the awards that

44:48

professor Ramak Krishna has won it gave

44:50

me an amazing list of awards apparently

44:52

I won this and I won that I won none of

44:55

it is true to which a student said not

44:58

yet.

45:00

So I had the tea I made a note of that

45:01

fine person's name. So [laughter]

45:05

>> so yeah so that's what's going on.

45:09

Yeah

45:11

>> I get the sense like Maybe there's

45:12

>> Could you use the microphone, please?

45:15

>> I get the sense that maybe there's some

45:17

sort of sliding window that's somehow

45:20

waning later words more strongly than

45:23

earlier words given how far out because

45:26

I feel like the context of students at

45:28

MIT, right, should have steered in a

45:30

certain direction even with the presence

45:32

of the word masters. So, is there

45:34

something like that happening?

45:35

>> No, it is just the thing is think about

45:37

the training process, right? In the

45:38

training process, uh, we gave it

45:41

sentence fragments and we asked it to

45:42

predict the next word. Now, clearly the

45:45

more you know about the input that's

45:48

coming and the longer the input, the

45:49

more clues you have to figure out what

45:51

the right next prediction is going to

45:53

be. Right? If I say the capital uh the

45:56

capital of you'll be like, I don't know,

45:58

it's got to be a country, I guess, or a

46:00

state, but I don't know anything more

46:01

than that. But if you if I say the

46:03

capital of France is dramatic narrowing

46:06

of the cone of uncertainty. So that's

46:08

basically what's going on. And in fact

46:11

some there's a very beautiful expression

46:12

I've heard which is that what what the

46:14

LMS do they call it subtractive

46:17

sculpting. So what I mean by that is

46:20

it's sort of like when you start it's

46:22

like this big block of marble and then

46:24

every word chips away at the marble and

46:26

then when you're done it's kind of

46:27

pretty clear it's David inside the

46:29

marble. Right? That's sort of what's

46:31

going on.

46:34

All right. So to come back to this, uh

46:36

what can we do? We can there are three

46:38

ways in which you can tune random

46:40

sampling to make it work for you. The

46:42

first way and and the the idea of all

46:44

these things is that you have some

46:46

probability distribution. We are now

46:48

going to sort of manually

46:51

focus on the head and then we're going

46:53

to kill everything else and only focus

46:55

on the head and sample from that head.

46:56

Okay, which immediately begs the

46:58

question, how will you decide what the

46:59

head is? Right? And that was sort of

47:01

Alina's question from before. How will

47:02

you decide what the head is? So, one way

47:04

we do that is to say, you know what, I

47:07

know we have 50,000 words in the

47:08

vocabulary. I don't care. Each time, I'm

47:11

only going to pick the top K words,

47:13

right? K could be 10, 20, 30, 40, 50.

47:15

This very problem dependent. I'm going

47:17

to pick the top 20 words and I'm going

47:18

to ignore everything else and only

47:20

sample from the top 10 or the top 20.

47:22

That's called top K sampling. And so the

47:24

way it works is that let's say this is

47:25

your whole distribution and I just

47:27

stopped at wet instead of going all the

47:28

way to 50,000, right? And then you see

47:30

and you decide let's say that you want k

47:33

to be two. So you just grab the top two

47:36

words k equals 2 and then you reormalize

47:39

the probability so they add up to one.

47:41

So 6 and2 reormalize it becomes 75 and

47:45

0.25.

47:46

And now just imagine that this is the

47:48

new softmax table that you're sampling

47:50

from and you grab a number from I'm

47:52

sorry a word from here and you're done.

47:55

Okay, that's this called top K sampling

47:58

very commonly used

48:00

but there's it has a small shortcoming

48:03

which is that it basically assumes that

48:06

this K that you have come up with let's

48:07

say 20 every input sentence the right

48:11

number of words in the head is 20 which

48:13

seems obviously it's not a you know well

48:15

supported assumption it's just an

48:16

assumption so then the question becomes

48:18

can we do better right because what you

48:21

really want is you want the words that

48:24

you pick to have the bulk of the

48:25

probabilities,

48:27

right? As much probability as possible.

48:29

You don't really care how many words are

48:30

inside it as long as together they have

48:32

a lot of probability. Which brings us to

48:34

something called top p sampling also

48:37

called nucleus sampling where instead of

48:39

deciding on the number of words we're

48:40

going to pick every time, we decide you

48:42

know what we're just going to

48:45

choose all the words such that the

48:47

probability of such words that we have

48:49

chosen is at least P.

48:51

Sometimes it may be just two words.

48:53

Sometimes it may be 20 words. We don't

48:54

care. And then we sample from it.

48:58

Okay. So here, same thing here. Let's

49:02

say you go with P equ= 0.9. So you 6

49:05

+2.8 plus.1.9. Boom. We have hit 0.9. We

49:09

stop and then we grab these three words

49:11

and then we renormalize them to get this

49:14

thing and then boom, we sample from it.

49:16

So this actually is even more effective

49:18

in my opinion because it sort of it

49:19

fluctuates. It doesn't hardcode the

49:21

number of words you think is important.

49:23

Uh was there a question? Yeah.

49:25

>> What if like let's say 0.9 ended up like

49:29

if foggy was 0.12 will it only take 0.1

49:32

from foggy?

49:33

>> Yeah. What it does is it so you give it

49:35

a give it a 0.9. What it's going to do

49:37

is it's going to keep adding words till

49:39

it just crosses that number.

49:43

>> Yeah. I was thinking, can't you just set

49:46

a threshold for the word slap? Don't

49:50

pick a word below probability. This top

49:53

B, what if was like 0.89

49:57

and then the other one is just 0.1. So

49:59

you pick two words.

50:00

>> Yeah, you can do that. Um and in fact in

50:03

what you can do is you can always say I

50:04

want to pick a word which is the most

50:06

likely word, right? You can do that. But

50:08

if you say I want a word um I want only

50:12

consider words whose probabilities are

50:13

at least something then basically what

50:15

you're saying is that I'm just going to

50:16

keep on doing and then we draw a line

50:18

here right but the problem is you don't

50:21

know how many words have crept over your

50:23

threshold

50:25

right you might for example find that to

50:27

to go to your example maybe you said 0.9

50:29

as a threshold may maybe there are a

50:31

whole bunch of there was a word at 089

50:33

that you just missed because you didn't

50:34

make the threshold you'll be like oh no

50:36

I should have made it 089 so there's No

50:38

right answer unfortunately. But these

50:40

are exactly the this is exactly the kind

50:41

of thinking that brought us these kinds

50:43

of ways to tune these things

50:46

all sort of you know the foundation here

50:48

is that the realization that we cannot

50:51

pro sort of a priority decide what the

50:53

right number of words is. So we have to

50:54

find huristics to try to do do these

50:56

things. So in practice people try all

50:58

these methods. In fact you can do both.

51:00

You can do you can set up so that you

51:02

can do top p and top k at the same time.

51:04

Basically you're saying grab words uh

51:07

till you cross the probability uh or you

51:10

cross k whichever is earlier.

51:15

Okay. So those are two methods people

51:17

use heavily.

51:19

The third method is called distribution.

51:21

I'm sorry temperature. And the idea of

51:23

temperature is that in top K and top P,

51:26

it sort of we have to decide on a number

51:28

up front K or P and then we just draw

51:31

the line and look at the words that pass

51:33

the threshold. Temperature is like a

51:35

softer way to do the same thing. It it's

51:37

a softer way to emphasize the head more

51:39

than the tail. So um I think iPad. All

51:44

right.

51:52

So the idea of temperature is remember

51:55

uh when we have this um oops soft max.

52:01

So you know oddwark

52:04

all the way to zebra

52:06

you have all these probabilities right

52:09

now remember where did we get these

52:10

probabilities these properties came from

52:12

a softmax. So what is a softmax? We

52:15

basically had you know all these nodes

52:18

say 50,000 nodes in some output layer

52:22

and these were just numbers let's just

52:23

call them a1 through a 50,000

52:27

and then we ran it through a softmax

52:29

function and what did it do it basically

52:31

did e ra to a1 e ra to a2 all the way to

52:36

e ra to a let's call it n and then we it

52:39

divided it by the sum of all these

52:40

things to get the probabilities. So this

52:42

number became e ra to a1 divided by the

52:47

sum of all the e ra to a

52:52

okay so e ra to a divided by e ra to a1

52:54

plus e to a2 and so on and so forth. So

52:55

this how softmax works. I'm just

52:57

refreshing your memory from a few weeks

52:59

ago. Okay. Now what temperature does is

53:03

that let me just write it a little

53:06

easier.

53:08

So e ra to a1 plus e ra to a2 is all the

53:13

way

53:15

and

53:18

what it does is it introduces a new

53:20

parameter here called temperature which

53:22

is that we divide everything here by t.

53:41

And the effect of adding this little

53:43

knob called temperature here, right, is

53:45

very interesting. So let's assume for a

53:48

second that t is a very very small

53:50

number.

53:52

Assume that t is pretty close to zero,

53:53

very small number. So if t is close to

53:57

zero,

54:00

what's going to happen is that since

54:03

it's in the denominator here, all these

54:05

numbers,

54:06

all these numbers are going to become

54:08

really big because t is really small.

54:10

Right? If if a1 happens to be a positive

54:13

number, it's going to become really big.

54:14

If a1 is a negative number, it's going

54:15

to be a really really small negative

54:16

number. Okay? Now in particular, what's

54:19

going to happen is the biggest of all

54:20

the a numbers, it was already big. Now

54:23

it's going to get massive

54:26

which means that its probability is

54:28

going to dominate everything else

54:30

because you're taking a really big

54:31

number and doing e ra to that number.

54:35

So what's going to happen is that wait

54:37

what what did this

54:40

okay so if t is close to zero

54:47

the biggest a

54:56

Uh, hold on.

54:59

The word corresponding to the biggest A

55:06

will have a probability of one or close

55:09

to one.

55:12

And since all the probabilities have to

55:14

add up to zero, which means that

55:15

everything else is going to be zero. So

55:17

the biggest A will have a probability of

55:18

one. Everything else is going to have

55:20

zero. So reducing temperature close to

55:22

zero means that the probability

55:24

distribution is going to peak at the

55:25

biggest word and everything is going to

55:27

become zero. So in practice what that

55:29

means is that if you look at something

55:30

like this if you apply um

55:34

temperature here

55:37

what's going to happen is that stormiest

55:40

thing is going to get something like.999

55:43

and everything else right it's going to

55:46

get wiped out

55:49

right it's going to get really small

55:51

it's going to get even smaller and so on

55:52

and so forth and so when t is exactly

55:55

zero basically what that means is that

55:57

this is going to be exactly nine uh one

55:59

and everything was going to just get

56:00

zero. So when one of them is one and

56:02

everything else is zero when you do

56:03

sampling from it you're just picking the

56:05

the big number right which means it sort

56:07

it becomes greedy decoding.

56:10

So that is the value of having

56:12

temperature as a knob. Conversely, if

56:14

you take temperature T and make it

56:16

bigger and bigger, right, as opposed to

56:19

smaller and smaller, this distribution

56:22

is going to become flat. Meaning all the

56:24

words are going to have the same

56:25

probability.

56:27

So a any one of these words becomes

56:29

equally likely. So t close to zero, the

56:32

biggest biggest word gets picked. T

56:34

close to say exceeds one goes to 1.52

56:38

any word becomes likely. It becomes

56:40

truly random. So that is the effect of

56:42

temperature.

56:44

And this knob, you can actually tune it.

56:47

Um,

56:50

all right. So, uh, this is called, uh,

56:53

I'm at

56:56

platform.openai.com.

56:57

It's called the OpenAI playground. And

56:59

in this playground, you can actually put

57:01

in all the sentences you want. You can

57:02

choose the model and then you can it'll

57:04

actually tell you what the softmax

57:05

output is. Okay, it's very handy. So

57:09

this is where I said oh so here are a

57:12

few things I want to draw your attention

57:13

to. The first one is you see temperature

57:15

here the default is one. If you make it

57:18

zero it becomes greedy decoding but you

57:20

can make it more than one if you want.

57:22

It'll give you all kinds of crazy stuff

57:24

as you will see in a second. Okay. Um

57:27

and then they don't have top K. They

57:30

don't have support for top K openai but

57:32

they do have support for top P. You can

57:35

put P here in this thing. And I'll

57:37

ignore these things. You can read the

57:38

documentation uh to understand those

57:40

things. But you can actually ask it to

57:42

show the probabilities. So I'm going to

57:44

ask it to show all the probabilities.

57:46

I'm also going to tell it um don't go

57:48

nuts. Just give me like a few outputs.

57:50

Let's just call it 30. Okay. And now I'm

57:53

going to enter some sentences for us to

57:55

see what's going on. So let's enter the

57:57

same sentence as before. students

57:59

at the MIT

58:03

Sloan

58:05

School of Management

58:08

or I think that's what we had right so

58:10

submit

58:14

so okay this is what it's filling out

58:16

now you go click on this word you get

58:18

all the probabilities

58:20

pretty cool right so you can see invited

58:23

given expected these are all some of the

58:25

things we had u and so what you can do

58:27

is you can go in and say here clearly uh

58:32

aching. What is that?

58:36

That's very weird. So I'm going to again

58:40

I'm just going to check to make sure

58:41

that I use the same sentence as before.

58:43

It's very brittle. Students MD school

58:46

management are okay. Uh are

58:50

oh I know what it is.

58:54

Okay.

58:57

Okay. So, let's try that again.

59:03

Okay. So, invited 3.18. That's what we

59:05

had, right? Invited 3.19. 3.8. Okay.

59:08

Close enough. So, this is what we have.

59:10

And now, if you wanted to force it to

59:12

choose invited here, you just go in

59:15

there and make the temperature zero.

59:18

Temperature zero means it's always going

59:20

to pick the best one. Greedy recording.

59:21

So, you can hit it again.

59:25

And it better give you invited. See it

59:27

has given you invited.

59:29

So that's how you manipulate it using

59:31

temperature. Um you can also ask it you

59:34

can also manipulate top P. You can do

59:35

all these things right but so it's a

59:38

it's a people actually use it very

59:40

heavily for debugging right and for when

59:41

they're playing with a bunch of data

59:42

with a model for that particular use

59:44

case. You just play with it to get a

59:45

sense for what kinds of probability

59:46

distributions you see and then you can

59:48

fine-tune it using that using that

59:50

knowledge. Um so yeah check this out.

59:54

Oh, uh, I I said that if the temperature

59:58

goes above one to a higher number, every

1:00:01

word in the 50,000 becomes sort of

1:00:03

equally likely, which means it's going

1:00:04

to produce garbage, right? So, let's

1:00:06

actually see garbage production in

1:00:07

action.

1:00:09

So, all right, let's just nuke this.

1:00:11

Okay, and I'm going to take the

1:00:13

temperature and max it. I'm going to

1:00:15

call it two. Okay, which means that

1:00:19

literally anything is possible.

1:00:22

Submit.

1:00:25

Ladies and gentlemen, I present to you a

1:00:28

modern large language model.

1:00:35

Isn't it like shocking

1:00:38

>> because when we work with these language

1:00:39

models we have, we always when we see it

1:00:41

doing some smart things, we always

1:00:43

ascribe some level of, you know,

1:00:45

interesting abilities and intelligence

1:00:46

and so on and then you realize all I had

1:00:48

to go in go in there and change one

1:00:50

parameter and it's garbage.

1:00:52

So you can see the amount of garbage

1:00:54

right it's showing just by twiddling one

1:00:56

parameter. So you have to be in

1:00:58

production use cases when you're

1:01:00

building applications on top of these

1:01:01

large language models you got to be very

1:01:02

very careful with these parameters. So

1:01:05

pay attention. All right. So um what did

1:01:09

I have next?

1:01:13

Okay. So that brings us to the uh sort

1:01:17

of the end of the decoding section.

1:01:22

Oh, see now I'm going to switch gears

1:01:24

and talk about tokenization, right?

1:01:27

which is that um when so far in all the

1:01:30

the the things we have done including

1:01:32

the homeworks and so on we looked at

1:01:34

this tokenization the standard process

1:01:36

right for taking a bunch of text and

1:01:38

vectorizing it which was the stie

1:01:41

process standardize tokenize um index

1:01:44

right and then encode and the

1:01:46

standardization I had mentioned earlier

1:01:48

uh strips out punctuation lower cases

1:01:50

everything uh sometimes removes stop

1:01:53

words like a and the things like that it

1:01:55

also does these things called stemming

1:01:57

But turns out if you actually work with

1:01:59

uh something like GPT, you know that

1:02:02

it hasn't stripped out punctuation. The

1:02:04

punctuation is really good, right? It

1:02:06

uses case, uppercase, and lower case.

1:02:08

And in fact, even better, you can

1:02:10

actually make up a word as part of your

1:02:11

question and it'll use the word

1:02:13

consistently in the output. So just for

1:02:15

fun,

1:02:18

um I made up a word.

1:02:22

I just did this yesterday, a day before.

1:02:23

I said, here's a new word and it

1:02:24

definition. The word is relo

1:02:28

backwards.

1:02:30

I said the definition a student who

1:02:31

understands deep learning backwards

1:02:33

please use his word in a sentence. And

1:02:35

here is a sentence it's coming up with.

1:02:37

Um

1:02:39

I was like a little shocked during the

1:02:41

advanced neural network seminar. It

1:02:43

became evident that Jane was a true relo

1:02:45

effortlessly explaining even the most

1:02:47

complex deep learning concepts in

1:02:48

reverse order.

1:02:50

Okay. So it clearly knows how to use

1:02:53

anything you may make up with. Right? So

1:02:54

it has the ability to compose things

1:02:56

from scratch as opposed to just looking

1:02:59

up stuff. So where is the thing coming

1:03:01

from? Right? That's the question. And

1:03:02

the answer is this very beautiful thing

1:03:04

called bite pair encoding which we'll

1:03:06

look at next.

1:03:10

So all right. So what here um when we

1:03:14

look at this process the adv

1:03:15

disadvantages are some of the things we

1:03:17

have discussed which is that we want to

1:03:18

be able to preserve punctuation. We want

1:03:19

to be able to preserve case. We want to

1:03:21

be able to handle new words and so on

1:03:22

and so forth. So uh the new like the the

1:03:26

sort of the modern models like BERT and

1:03:28

so on they use different tokenization

1:03:29

schemes. They don't actually do the STIE

1:03:31

thing and the GPD family uses bite pair

1:03:34

encoding BPE. Uh BERT uses something

1:03:37

called wordpiece. All of these ways of

1:03:40

encoding, the fundamental idea is to

1:03:42

say, well, you know what? Why don't

1:03:44

whatever language you're working with,

1:03:46

why don't we start first of all with all

1:03:47

the individual characters? Because if

1:03:50

you could actually work with individual

1:03:51

characters, you can clearly compose any

1:03:53

word that comes up, right? Reo is just R

1:03:56

E L D O H, right? Six tokens. If you're

1:03:58

working with characters at the character

1:04:00

level, but working only with characters

1:04:02

is not great, right? because that means

1:04:05

that the model you're giving it no

1:04:07

information about the world. It has to

1:04:09

learn every word from scratch, what the

1:04:11

word means and so on and so forth. So we

1:04:14

it would be nice if we can actually give

1:04:15

it words as well. But we don't we don't

1:04:17

want to give it infrequent words because

1:04:20

infrequent words by definition are not

1:04:22

worth adding to your vocabulary. We're

1:04:25

just going to you know take up another

1:04:26

embedding vector and things like that.

1:04:28

For infrequent words, we'll just make

1:04:30

we'll just compose them. we'll we'll

1:04:31

actually construct them on the fly

1:04:32

because we can always use characters.

1:04:35

Okay, so we don't want to put every word

1:04:37

in there. We only want to put frequent

1:04:38

words. But to give this thing the

1:04:41

ability to compose new words and not

1:04:43

always have to go to characters, we will

1:04:45

give it parts of words. These are called

1:04:47

subwords. So the key idea is that let's

1:04:52

come up with a way to build a vocabulary

1:04:54

which has characters full words that are

1:04:56

frequent enough to be worth adding and

1:04:59

subwords or word fragments that occur

1:05:01

frequently enough to be worth adding. So

1:05:03

for example the word standardize

1:05:07

right normalize standardize and so on

1:05:09

and so forth. I is going to show up a

1:05:11

lot in many places. So you don't want to

1:05:12

have standardize and normalize and so

1:05:14

on. You just want to have eyes. you can

1:05:15

just attach it to all kinds of words,

1:05:17

right? And make it all work, right? So

1:05:19

that's the basic idea of all these

1:05:20

tokenization schemes. And BP is one such

1:05:23

way to figure out how to actually

1:05:25

construct this vocabulary from a

1:05:27

training corpus, right? And by the way,

1:05:29

when I say characters, this will include

1:05:31

not just you know uppercase lowerase

1:05:33

alphabets and numbers, it may it will

1:05:34

also include punctuation.

1:05:37

So that all these things just become

1:05:38

atomic units.

1:05:40

All right. So uh so what we're going to

1:05:42

the way BP works is that uh we're going

1:05:45

to uh start with each character as a

1:05:47

token and I'll talk about the rest of

1:05:49

the thing on the page in just a moment.

1:05:51

Don't worry about it. We'll start with

1:05:52

each character as a token. So let's say

1:05:53

that your training corpus is just a

1:05:56

single sentence. The cat sat on the mat.

1:05:58

Okay. And even though GPT does not

1:06:02

actually do any lower casing, it'll just

1:06:03

actually use like TH uppercase is

1:06:05

different than TH lowerase. Uh just for

1:06:08

simplicity, I'm just going to

1:06:09

standardize it here. So it just becomes

1:06:11

a cat sat on the mat. And then I'm going

1:06:12

to write it in this form where I

1:06:14

basically put a comma after every word

1:06:16

and then I put a little underscore to

1:06:18

show the space between the words. Okay,

1:06:20

I'm going to write it in this format.

1:06:21

And it'll become clear why I'm writing

1:06:22

it in just a second. Okay. Now my

1:06:25

starting vocabulary is just all the

1:06:27

individual letters in the training

1:06:28

corpus. So the starting is just whatever

1:06:31

all these letters. Okay, that's it. And

1:06:34

this is a starting point. And now what

1:06:35

we do and this is the key step.

1:06:38

We merge tokens that most frequently

1:06:41

occur right next to each other. So if

1:06:44

two characters or two tokens are

1:06:47

occurring right next to each other a

1:06:48

lot, let's just merge them because they

1:06:51

seem to be occurring a lot together,

1:06:52

right? May as well merge them. And so

1:06:54

here, for example, I've I've listed the

1:06:57

frequency of the adjacent token. So for

1:06:59

example, if you look at th

1:07:01

shows up right after each other here, it

1:07:04

also shows up here. So therefore, it

1:07:06

shows up twice.

1:07:08

Now H E again is showing up here. It's

1:07:11

also showing up here. So that also shows

1:07:13

up twice. CA on the other hand is only

1:07:16

showing up here. It's not showing up

1:07:17

anywhere else. So it shows up once. A

1:07:20

shows up three times in Matt, SAT, and

1:07:24

in CAT and so on and so forth. You get

1:07:25

the idea. So you're just looking at

1:07:27

pair-wise adjacent tokens. And you pick

1:07:30

the most frequent one that's showing up,

1:07:32

which in this case happens to be a t.

1:07:34

And then you take a and t and you merge

1:07:36

them. So it becomes 80.

1:07:40

Okay. So when you do that when you when

1:07:42

you you merge them and then you add that

1:07:44

new token that you've just literally

1:07:45

created to your vocabulary list and then

1:07:48

you update the corpus to reflect the

1:07:50

merge you've just did. So now the corpus

1:07:52

becomes the cat sat on the mat. But in

1:07:55

this case there is no a and t

1:07:56

separately. There is just the at combo

1:07:58

com combo token here.

1:08:02

Are we good with this step so far?

1:08:06

take the most frequent things and merge

1:08:07

them.

1:08:12

It's a way to compress the data. In

1:08:14

fact, the algorithm came from someone

1:08:16

trying to figure out a way to compress

1:08:17

data.

1:08:18

You know,

1:08:22

think of it this way, right? Suppose I

1:08:23

tell you uh I'm I want you to compress a

1:08:25

message I'm going to send to you and

1:08:28

then you look at all the past messages

1:08:30

you've had to deal with and it turns out

1:08:32

you're finding that u certain characters

1:08:35

are occurring next to each other all the

1:08:37

time right maybe just for argument let's

1:08:40

say ABC shows up ridiculously often in

1:08:42

the messaging and then you'll be like

1:08:44

you know what's if it's always showing

1:08:45

up all the time together why treat it as

1:08:47

three things let me just call it one

1:08:48

thing ABC that's it you send a single

1:08:51

token called ABC every time you send

1:08:53

need ABC not a B C that's the basic

1:08:56

idea. So here if you come here that's

1:08:58

what we have and then what we do is now

1:09:01

we do again this calculation of

1:09:03

adjacency tokens on this updated corpus

1:09:05

and you can see here th shows up once TH

1:09:08

shows up here twice so you get two every

1:09:11

H shows up twice everything else shows

1:09:13

up once and yeah when many things are

1:09:16

showing up with equal frequency just

1:09:18

pick one randomly from this. So we pick

1:09:19

up th right and we merge that which

1:09:22

means that we add th to our vocabulary

1:09:25

and once we do that we update the corpus

1:09:27

and now we have th is now one thing

1:09:30

fused together along with the previous

1:09:32

thing 80 that had been fused together

1:09:34

that is a corpus after the second merge

1:09:36

and then we do the same thing we find

1:09:38

the frequency adjacent tokens turns out

1:09:40

th and e are showing up twice everything

1:09:42

else is showing up once so we take th

1:09:45

merge it to get the boom the and now we

1:09:48

have the cat sat on the mat. So this

1:09:51

process continues

1:09:53

till we reach a predefined limit for our

1:09:56

vocabulary. Now as it turns out when

1:09:59

they built GPT2 and GPT let me just see

1:10:02

I think I did some digging around on

1:10:04

this thing. Yeah. So GPT2 and 3 they set

1:10:07

the vocabulary size to be roughly

1:10:09

50,000. So it basically kept on doing

1:10:12

this till it hit a limit of 50,000 then

1:10:14

it stopped. GPD4 on the other hand

1:10:17

actually went goes all the way to

1:10:18

100,000 vocabulary size.

1:10:23

Okay, so this is BP in action. U and so

1:10:28

what's going to happen is once you

1:10:29

finish all this thing and you have

1:10:30

vocabulary and you have all these things

1:10:31

that you have merged when a new piece of

1:10:32

text comes in right the merges remember

1:10:36

here we merged a to get a this th became

1:10:39

this and so on. When a new piece of text

1:10:41

arrives the tokenization apply the

1:10:43

merges in the exact same order. So if

1:10:45

the new text that comes in is the rat,

1:10:47

it's first going to apply the 80 to 80

1:10:50

to become fuse this here and then going

1:10:52

to fuse th to get this and then it's

1:10:54

going to fuse th and e to get that. And

1:10:56

the final list of tokens that goes in to

1:10:58

your model is going to be the token for

1:11:00

the the token for space and the token

1:11:02

for r and the token for at.

1:11:06

So let's see this in action.

1:11:12

uh GP I mean OpenAI has a has its own

1:11:14

thing but I found this uh site to be

1:11:17

really good. So let's uh tokenize

1:11:20

hands-on

1:11:23

deep learning.

1:11:26

So you can see here

1:11:28

look at this.

1:11:30

So H uppercase H is its own token. It's

1:11:34

token number 39

1:11:36

and

1:11:38

it's it own token. dash is its own token

1:11:41

on is its own token and then space deep

1:11:43

is its token and space learning is its

1:11:45

token okay note one thing suppose you

1:11:48

had said

1:11:50

let's just say you just had deep deep

1:11:51

deep learning

1:11:53

deep has a different token than space

1:11:56

deep

1:11:58

okay what they have realized is that

1:12:01

most words are actually going to show up

1:12:03

after the space after a space right much

1:12:06

more likely so having a space attached

1:12:08

to the beginning of the word saves you a

1:12:10

lot of sort of you know saves you a lot

1:12:12

of compute and so on and so forth

1:12:13

because they will in fact arrive almost

1:12:15

all the time with the space before it

1:12:17

right that's why they have attached the

1:12:18

space to the word itself um and note

1:12:21

that deep learning deep and uh deep

1:12:25

actually let's call it this way

1:12:30

so deep and deep are different

1:12:34

right there is deep there is so clearly

1:12:36

it's taking case into account then I put

1:12:38

an exclamation here. Boom. That and so

1:12:43

ultimately what goes in when you have

1:12:44

have a phrase like um

1:12:48

sat on the mat.

1:12:51

So the cat sat on the mat. And you can

1:12:53

see here uppercase the um and then

1:12:58

let's just do another thing here.

1:13:01

So uppercase the with a space is 383.

1:13:06

lowerase the is 262. Uh and then that's

1:13:10

distinct from just the without any

1:13:11

space. That's a different thing. So

1:13:13

these are all the tokens. Now um let's

1:13:16

try something.

1:13:18

Let's try

1:13:21

Jane.

1:13:24

So Jane is one token which is great and

1:13:27

is another token. Let's see. Rama. Ah

1:13:30

darn. My name wasn't worthy enough to be

1:13:34

its own token. Okay. But strangely

1:13:38

enough

1:13:41

this I was very surprised by this. So if

1:13:44

I put Rama in lower case is its own

1:13:46

token.

1:13:48

I have no idea what they were scraping

1:13:51

which websites. Uh and if I put Jane

1:13:55

here

1:13:56

now J has become its token with space

1:13:58

and A has become different.

1:14:01

So the tokenization is like very it's a

1:14:03

very interesting thing and it works in

1:14:05

very interesting ways. But that's the

1:14:07

basic idea of what's going on under the

1:14:08

hood. I would encourage you to like

1:14:10

check out your names to see if it's

1:14:12

actually been tokenized. So all right,

1:14:13

I'm done. Thanks folks. I'll see you on

1:14:15

Wednesday.

9: Generative AI – Large Language Models (LLMs) and Retrieval Augmented Generation (RAG)

More from MIT OpenCourseWare

Trending Transcripts