Stanford CS229 I Machine Learning I Building Large Language Models (LLMs)

0:52

Have the same questions.

0:53

So, please ask.

0:56

Great.

0:56

So what matters when training LLMs.

1:00

So there are a few key components that matter.

1:02

One is the architecture.

1:04

So as you probably all LLMs are neural networks,

1:07

and when you think about neural networks,

1:09

you have to think about what architecture you're using.

1:11

And another component, which is really important

1:13

is the training loss and the training algorithm.

1:16

So, how you actually train these models, then it's data.

1:20

So, what do you train these models on.

1:24

The evaluation, which is how do you

1:26

know whether you're actually making progress

1:28

towards the goal of LLMs and then, the system component.

1:33

So that is like how do you actually

1:35

make these models run on modern hardware, which

1:38

is really important because these models are really large.

1:41

So now more than ever, systems are actually

1:43

really an important topic for LLMs.

1:47

So those five components, you probably all know that LLMs.

1:52

And if you don't know LLMs are all

1:53

based on transformers or at least some version

1:56

of transformers.

1:57

I'm actually not going to talk about the architecture today.

2:00

One, because I gave a lecture on transformers a few weeks ago

2:06

and two, because you can find so much information online

2:09

on transformers.

2:11

There's much less information about the other four topics.

2:14

So, I really want to talk about those.

2:17

And another thing to say is that most of academia

2:20

actually focuses on architecture and training

2:22

algorithm and losses as academics

2:25

and I've done that for a big part of my career,

2:28

is simply we like thinking that this is like we make

2:32

new architectures, new models, and it

2:35

seems like it's very important.

2:37

But in reality, honestly, what matters in practice is mostly

2:39

the three other topics.

2:41

So, data, evaluation and systems, which is what most

2:45

of industry actually focuses on.

2:48

So, that's also one of the reasons

2:49

why I don't want to talk too much about the architecture,

2:52

because really the rest is super important.

2:55

Great.

2:55

So, overview of the lecture, I'll

2:57

be talking about pretraining.

2:58

So, pretraining, you probably heard that word.

3:00

This is the general word.

3:02

This is kind of the classical language modeling paradigm where

3:06

you basically train your language model to essentially

3:08

model all of internet.

3:10

And then, there's a post training,

3:11

which is a more recent paradigm which

3:13

is taking these large language models

3:15

and making them essentially AI assistants.

3:18

So, this is more of a recent trend since ChatGPT.

3:22

So, if you ever heard of GPT3 or GPT2,

3:25

that's really pretraining land.

3:27

If you heard of ChatGPT, which you probably have,

3:29

this is really post training land,

3:31

so I'll be talking about both, but I'll start with pretraining

3:34

and specifically I'll talk about what

3:37

is the task of pretraining LLMs and what is the loss that people

3:41

actually use.

3:43

So, language modeling, this is a quick recap.

3:47

Language models at a high level are simply

3:49

models of probability distribution over sequences

3:52

of tokens or of words.

3:53

So it's basically some model of p of x1

3:57

to XL, where x1 is basically what

3:59

one and XL is the last one in the sequence or in the sentence.

4:04

So, very concretely, if you have a sentence like the mouse

4:07

ate the cheese, what the language model gives

4:09

you is simply a probability of this sentence being uttered

4:13

by a human or being found online.

4:17

So, if you have another sentence like "The the mouse ate cheese."

4:21

Here, there's grammatical mistakes.

4:23

So, the model should know that this should

4:25

have some syntactic knowledge.

4:27

So, it should know that this has less likelihood

4:30

of appearing online.

4:32

If you have another sentence like the cheese ate the mouse,

4:36

then the model should hopefully know about the fact

4:39

that usually cheese don't eat mouse.

4:42

So, there's some semantic knowledge

4:43

and this is less likely that the first sentence.

4:45

So, this is basically at a high level what language models are.

4:50

One word that you probably have been hearing a lot in the news

4:52

are generative models.

4:54

So, this is just something that can generate.

4:56

Models that can generate sentences

4:57

or can generate some data.

4:59

The reason why we say language models are generative models

5:01

is that once you have a model of a distribution,

5:04

you can simply sample from this model.

5:06

And now we can generate data.

5:07

So we can generate sentences using a language model.

5:12

So the type of models that people are all currently using

5:15

are what we call autoregressive language models.

5:18

And the key idea of autoregressive language models

5:21

is that you take this distribution over words

5:25

and you basically decompose it into the distribution

5:29

of the first word, multiply by the distribution of

5:32

or the likelihood of the distribution of the second word

5:35

given the first word, and multiply it

5:37

by P of the third word given the first two words.

5:40

So, there's no approximation here.

5:42

This is just the chain rule of probability, which you

5:44

hopefully you all know about.

5:46

Really no approximation.

5:47

This is just one way of modeling a distribution.

5:50

So, slightly more concisely, you can write it

5:52

as a product of P's of the next word, given everything which

5:57

happened in the past.

5:58

So, of the context.

5:59

So, this is what we call autoregressive language models.

6:02

Again, this is really not the only way

6:05

of modeling distribution.

6:06

This is just one way.

6:07

It has some benefits and some downsides.

6:10

One downside of autoregressive language models

6:12

is that when you actually sample from this autoregressive

6:15

language model, you basically have

6:16

a for loop, which generates the next word, then conditions

6:20

on that next word.

6:21

And then we generate in other words.

6:23

So, basically if you have a longer sentence

6:24

that you want to generate, it takes more time to generate it.

6:28

So, there are some downsides of this current paradigm,

6:31

but that's what we currently have.

6:33

So, I'm going to talk about this one.

6:36

Great.

6:36

So, autoregressive language models.

6:38

At a high level, what a task of autoregressive language model

6:41

is simply predicting the next word, as I just said.

6:44

So, if we have a sentence like she likely prefers,

6:47

one potential, next word might be dogs.

6:50

And the way we do it is that we first tokenize.

6:54

So, you take these words or subwords you tokenize them

6:58

and then you give an ID for each token.

7:00

So here you have one, two, three.

7:03

Then, you pass it through this black box.

7:04

As I already said, we're not going

7:06

to talk about the architecture.

7:07

You just pass it through, pass it through a model,

7:10

and you then get a distribution, a probability distribution

7:13

over the next word or over the next token.

7:16

And then you sample from this distribution,

7:20

you get a new token and then you detokenize.

7:22

So, you get a new ID, you detokenize

7:24

and that's how you basically sample from a language model.

7:28

One thing which is important to note

7:29

is that the last two steps are actually

7:32

only needed during inference.

7:34

When you do training, you just need

7:36

to predict the most likely token and you can just

7:38

compare to the real token which happened in practice,

7:41

and then, you basically change the weights

7:43

of your model to increase the probability of generating

7:46

that token.

7:49

Great.

7:50

So, autoregressive neural language models.

7:52

So to be slightly more specific, still,

7:54

without talking about the architecture,

7:56

the first thing we do is that we have all of these.

7:58

Sorry, yes.

7:59

On the previous slide.

8:01

Predicting the probability of the next token,

8:03

does this mean that your final output vector has

8:06

to be the same dimensionality as the number of tokens

8:08

that you have?

8:09

Yes.

8:10

How do you deal with if you have more token.

8:13

Adding more token to your [INAUDIBLE]?

8:16

Yeah so we're going to talk about tokenization

8:18

actually later so you will get some sense of this.

8:21

You basically can deal with adding new tokens.

8:24

I'm kind of exaggerating.

8:25

There are methods for doing it, but essentially people

8:28

don't do it.

8:29

So it's really important to think about

8:32

how you tokenize your text, and that's why

8:33

we'll talk about that later.

8:35

But it's a very good point to note

8:36

is that you basically-- the vocabulary size, so

8:38

the number of tokens that you have is essentially

8:40

the output of your language model.

8:43

So it's actually pretty large.

8:46

So autoregressive neural language models.

8:48

First thing you do is that you take every word or every token.

8:51

You embed them so you get some vector representation

8:56

for each of these tokens.

8:58

You pass them through some neural network, as we said,

9:00

it's a transformer.

9:01

Then you get a representation for all the word

9:04

and all the words in the context.

9:06

So it's basically a representation

9:07

of the entire sentence.

9:09

You pass it through a linear layer,

9:11

as you just said, to basically map it to the number

9:15

so that the output-- the number of outputs

9:17

is the number of tokens.

9:19

You then pass it through some softmax

9:21

and you basically get a probability distribution

9:24

over the next words given every word in the context.

9:30

And the last that you use is basically--

9:32

it's essentially a task of classifying the next token.

9:35

So it's a very simple, kind of, machine learning task.

9:37

So you use the cross-entropy loss.

9:39

Where you basically look at the actual target that happened,

9:44

which is the target distribution, which

9:45

is a one hot encoding, which in this case says,

9:49

I saw the real word that happened is cat.

9:51

So that's a one hot distribution over cat.

9:55

And here this is the actual--

9:57

do you see my mouse?

9:58

Oh, yeah.

9:58

This is the distribution that you generated.

10:00

And basically you do cross entropy,

10:01

which really just increases the probability of generating cat

10:04

and decreases all the probability of generating

10:06

all the other tokens.

10:08

One thing to notice is that, as you all know again,

10:11

this is just equivalent to maximizing the text log

10:15

likelihood because you can just rewrite

10:17

the max over the probability of this autoregressive language

10:23

modeling task as just being this minimum of I just

10:26

added the log here and minus, which

10:29

is just the minimum of the loss, which is the cross entropy loss.

10:31

So basically minimizing the loss is

10:33

the same thing as maximizing the likelihood of your text.

10:36

Any question?

10:37

Questions?

10:43

OK, tokenizer.

10:46

So this is one thing that people usually

10:49

don't talk that much about.

10:50

Tokenizers are extremely important.

10:53

So it's really important that you understand at least what

10:56

they do at a high level.

10:57

So why do we need tokenizers in the first place?

11:01

First, it's more general than words.

11:02

So one simple thing that you might think

11:04

is we're just going to take every word that we will have.

11:07

You just say every word is a token in its own.

11:11

But then what happens is if there's a typo in your word?

11:14

Then you might not have any token associated

11:17

with this word with a typo.

11:20

And then you don't know how to actually pass

11:21

this word with a typo into a large language model.

11:24

So what do you do next?

11:25

And also, even if you think about words, words is a very--

11:29

words are fine with Latin-based languages.

11:32

But if you think about a language like Thai,

11:34

you won't have a simple way of tokenizing

11:36

by spaces because there are no spaces between words.

11:39

So really, tokens are much more general than words.

11:43

It's the first thing.

11:44

Second thing that you might think

11:45

is that you might tokenize every sentence, character

11:48

by character.

11:49

You might say A is one token, B is another token.

11:52

That would actually work and probably very well.

11:55

The issue is that then your sequence becomes super long.

11:58

And as you probably remember from the lecture

12:00

on transformers, the complexity grows quadratically

12:05

with the length of sequences.

12:06

So you really don't want to have a super-long sequence.

12:10

So tokenizers basically try to deal with those two problems

12:14

and give common subsequences a certain token.

12:19

And usually how you should be thinking about it is around

12:22

an average of every token is around 3-4 letters.

12:27

And there are many algorithms for tokenization.

12:30

I'll just talk about one of them to give you a high level, which

12:32

is what we call Byte Pair Encoding, which is actually

12:34

a pretty common.

12:35

One of the two most common tokenizers.

12:37

And the way that you train a tokenizer

12:39

is that first you start with a very large corpus of text.

12:42

And here, I'm really not talking about training a large language

12:45

model yet, this is purely for the tokenization step.

12:48

So this is my large corpus of text with these five words.

12:52

And then you associate every character

12:55

in this corpus of text a different token.

12:58

So here, I just split it up every character

13:00

with a different token, and I just

13:03

color coded all of those tokens.

13:05

And then what you do is that you go through your text,

13:08

and every time you see pairs of tokens that are very common,

13:12

the most common pair of token, you just merge them.

13:15

So here you see three times the tokens t and o

13:19

next to each other.

13:20

So you're just going to say this is a new token.

13:22

And then you continue, you repeat that.

13:24

So now you have tok, tok which happens three times.

13:28

Toke with an E that happens 2 times and token,

13:33

which happens twice, and then ex which also happens twice.

13:37

So this is the-- if you were to train a tokenizer on this corpus

13:41

of text, which is very small, that's

13:43

how you would finish with a token--

13:45

with like trained tokenizer.

13:47

In reality, you do it on much larger corpus of text.

13:51

And this is the real tokenizer of--

13:54

actually, I think this is GPT3 or ChatGPT.

13:57

And here you see how it would actually separate these words.

14:00

So basically you see the same thing

14:01

as what we gave in the previous example.

14:03

Token becomes its own token.

14:06

So tokenizer is actually split it up

14:08

into two tokens token and -izer.

14:12

So yeah, that's all about tokenizers.

14:15

Any questions on that?

14:16

Yeah.

14:16

How do you deal with spaces, and how do you

14:18

deal with [INAUDIBLE].

14:19

Yeah so actually there's a step before tokenizers,

14:23

which is what we call pre-tokenizers, which

14:25

is exactly what you just said.

14:27

So this is mostly--

14:29

in theory, there's no reason to deal with spaces and punctuation

14:33

separately.

14:34

You could just say every space gets its own token,

14:37

every punctuation gets its own token,

14:40

and you can just do all the merging.

14:42

The problem is that-- so there's an efficiency question.

14:45

Actually, training these tokenizers takes a long time.

14:48

So you better-- because you have to consider every pair of token.

14:51

So what you end up doing is saying if there's a space,

14:54

this is very-- like pre-tokenizers

14:55

are very English specific.

14:57

You say if there's a space, we're

14:58

not going to start looking at the token that came before

15:01

and the token that came afterwards.

15:03

So you're not merging in between spaces.

15:06

But this is just like a computational optimization.

15:10

You could theoretically just deal with it

15:12

the same way as you deal with any other character.

15:15

And--

15:15

Yeah.

15:16

When you merge tokens to delete the tokens that you merged away

15:19

or do you keep the smaller tokens that emerge?

15:22

You actually keep the smaller tokens.

15:25

I mean, in reality, it doesn't matter much because usually

15:29

on a large corpus of text, you will have actually everything.

15:32

But you usually keep the small ones.

15:34

And the reason why you want to do that

15:36

is because if-- in case there's, as we said before, you have

15:38

some grammatical mistakes or some typos,

15:41

you still want to be able to represent

15:43

these words by character.

15:46

So, yeah.

15:47

Yes.

15:48

Are the tokens unique?

15:51

So I mean, say in this case T-O-K-E-N is there only one

15:54

occurrence or could--

15:56

do you need to leave multiple occurrence so they could have--

16:00

take on different meanings or something?

16:02

Oh I see what you say.

16:03

No, it's every token has its own unique ID.

16:08

So a usual-- this is a great question.

16:11

For example, if you think about a bank, which

16:13

could be bank for like money or bank like water,

16:16

it will have the same token.

16:18

But the model will learn, the transformer

16:19

will learn that based on the words that are around it,

16:22

it should associate that--

16:24

I'm saying-- I'm being very handwavy here,

16:26

but associate that with a representation that

16:30

is either more like the bank money side or the bank water

16:33

side.

16:34

But that's a transformer that does that.

16:36

It's not a tokenizer.

16:38

Yes.

16:39

Yes.

16:39

So you mentioned during tokenization,

16:41

keep the smaller tokens you started with, right.

16:43

Like if you start with a T you keep the T

16:45

and then you build your tokenize out to

16:47

[INAUDIBLE] allow input token.

16:49

So let's say maybe you didn't train on token, but in your data

16:53

you are trying to encode token.

16:54

So how does the tokenizer know to encode it with token or to

16:58

[INAUDIBLE]?

16:59

Yeah.

16:59

The great question.

17:00

You basically when you-- so when you tokenize,

17:02

so that's after training of the tokenizer

17:04

when you actually apply the tokenizer

17:06

you basically always choose the largest token

17:10

that you can apply.

17:11

So if you can do token, you will never do T,

17:13

you will always do token.

17:15

But there's actually-- so people don't usually

17:18

talk that much about tokenizers, but there's

17:20

a lot of computational benefits or computational tricks

17:24

that you can do for making these things faster.

17:27

So I really don't think we-- and honestly, I

17:29

think a lot of people think that we should just get away

17:31

from tokenizers and just kind of tokenize character

17:34

by character or bytes by bytes.

17:36

But as I said, right now there's this issue of length,

17:39

but maybe one day, like in five or 10 years,

17:42

we will have different architectures

17:43

that don't scale quadratically with the length of the sequence.

17:46

And maybe we'll move away from tokenizers.

17:50

So can you share with us the drawback?

17:53

Why do people want to move away from the tokenizer?

17:57

Yeah.

17:58

So I think one good example is math.

18:03

If you think about math, actually numbers right now

18:06

are not tokenized.

18:07

So for example, 327 might have its own token, which

18:10

means that models, when they see numbers,

18:13

they don't see them the same way as we do.

18:15

And this is very annoying because I mean,

18:17

the reason why we can generalize with math

18:19

is because we can deal with every letter separately

18:22

and we can then do composition.

18:24

Where you know that basically if you add stuff,

18:26

it's the same thing as adding every one separately

18:28

plus like whatever the unit that you add.

18:30

So they can't do that.

18:32

So then you have to do special tokenization.

18:35

And, like, one of the big changes that GPT4 did

18:39

is changing the way that they tokenize code.

18:42

So for example, if you have code, you know you have often,

18:46

in Python, these four spaces at the beginning.

18:48

Those were dealt with strangely before.

18:52

And as a result, like, the model couldn't really

18:54

understand how to deal with code.

18:57

So tokenize actually matter a lot.

19:00

OK, so I'll move on right now, but we can come back later

19:04

on tokenizers.

19:05

Great.

19:06

So we talked about a task the loss the tokenizer,

19:08

let's talk a little bit about evaluation.

19:11

So the way that LLMs are usually evaluated

19:13

is what we call-- is using what we call perplexity.

19:16

At a high level it's basically just your validation loss.

19:20

The slight difference with perplexity

19:21

is that we use something that is slightly more interpretable,

19:24

which is that we use the average per token loss,

19:27

and then you exponentiate it.

19:29

And the reason why you exponentiate it

19:30

is because you want--

19:32

I mean, the loss has a log inside and you--

19:35

like one humans are actually pretty

19:36

bad at thinking in log space.

19:38

But two logs depend on the base of the log

19:41

while when you exponentiate you basically have everything

19:44

in the vocabulary size unit.

19:48

And the average per token is just so

19:50

that your perplexity is independent of the length

19:52

of your sequence.

19:54

So perplexity is just two to the power average

19:57

of the loss of the sequence.

20:00

So perplexity is between one and the length of the vocabulary

20:04

of your tokenizer.

20:05

One it's simply well, if you predict perfectly

20:08

the thing which every word, then every word

20:11

will have basically products of ones.

20:14

So the best perplexity you can have is one.

20:16

If you really have no idea, you basically

20:18

predict with one divided by size of vocabulary

20:22

and then you do simple math and you basically

20:24

get perplexity of size of vocabulary.

20:26

So the intuition of perplexity is

20:28

that it's basically the number of tokens

20:30

that your model is, kind of, hesitating between.

20:32

So if your model is perfect, it doesn't hesitate.

20:35

It know exactly the word.

20:36

If it really has no idea, then it

20:38

hesitates between all of the vocabulary.

20:43

So perplexity really improved.

20:46

That's perplexity on a standard data set between 2017 and 2023.

20:50

It went from a kind of 70 tokens to less than 10 tokens

20:54

over these five, six years.

20:56

So that means that the models were previously

20:58

stated between 70 words every time it was generating a word,

21:02

and now it's hesitating between less than 10 words.

21:05

So that's much better.

21:06

Perplexity is actually not used anymore

21:08

in academic benchmarking, mostly because it depends

21:11

on the tokenizer that you use.

21:12

It depends on the actual data that people are evaluating on.

21:16

But it's still very important for development of LLMs.

21:19

So when you actually train your own LLM people

21:21

will still really look at the perplexity.

21:26

One common other way and now more common in academia

21:30

of evaluating these LLMs is just by taking all the classical NLP

21:34

benchmarks, and I'll give you a few examples later and just,

21:37

kind of, aggregating everything.

21:39

So collect as many automatically evaluatable benchmarks

21:43

and just evaluate across all of them.

21:46

So one such-- or actually two such

21:50

benchmarks are what we call HELM, which is from Stanford.

21:54

And another one is the Hugging Face open leaderboard,

21:56

which are probably the two most common ones right now.

22:00

So just to give you an idea, in HELM,

22:02

all of these type of tasks, which

22:04

are mostly things that can be easily evaluated

22:08

like question answering.

22:09

So think about many different question answering tasks.

22:13

And the benefit with question answering

22:15

is that you usually know what is the real answer.

22:18

So you can-- the way that you evaluate these models

22:20

and I'll give you a concrete example in one second,

22:22

is that you can just look at how likely the language model is

22:26

to generate the real answer compared to some other answers.

22:30

And that's essentially, at a high level,

22:31

how you evaluate these models.

22:33

So to give you a specific example,

22:35

MMLU is probably the most common academic benchmark for LLMs.

22:42

And this is just a collection of many question

22:45

and answers in all of those domains.

22:47

For example, college medicine, college physics,

22:50

astronomy and these type of topics.

22:52

And the questions are things like, so this is in astronomy.

22:55

What is true for type-1a supernova?

22:58

Then you give four different potential answers

23:01

and you just ask the model which one is more likely.

23:04

So there are many different ways of doing it.

23:06

Either you can look at the likelihood of generating

23:09

all these answers, or you can ask the model

23:11

which one is the most likely.

23:12

So there are different ways that you can prompt the model,

23:15

but at a high level, you know which one is correct.

23:17

And there are three other mistakes.

23:20

Yes.

23:22

Creating unconstrained text as an output.

23:24

Yeah.

23:25

How do you evaluate a model if it

23:28

gives something that's semantically completely

23:31

identical, but is not the exact tokens that you expect?

23:35

Yeah.

23:36

So that's a great question.

23:37

I'll talk more about that later.

23:38

Here, in this case, we don't do unconstrained.

23:41

So the way you would evaluate MMLU is basically either

23:44

you ask the first question, and then you

23:47

look at the likelihood of the model generating A,

23:50

the likelihood of the model generating B, C, and D

23:53

and you look at which one is the most likely.

23:55

Or you can ask the model out of A, B, C, D,

23:58

which one is the most likely.

23:59

And you look at whether the most likely next token is A, B,

24:03

C, or D. So you constrain the model

24:05

to say it can only answer these four things.

24:09

You say you constraint--

24:10

Yeah.

24:11

You constrain the prompt or do you

24:13

mean of its whole probability distribution

24:15

that it outputs you only comparing

24:17

the outputs of like-- you're only comparing the A token the

24:19

[INAUDIBLE].

24:20

Yeah.

24:20

So in the second case I gave you, you would do exactly the--

24:24

actually would do both.

24:25

You would prompt the model saying A, B, C, or D

24:27

plus you would constrain to only look at these four tokens.

24:32

In the first case, you don't even need to generate anything.

24:34

So in the first case, you literally just

24:36

look, given it's a language model,

24:38

it can give a distribution over sentences.

24:40

You just look at what is the likelihood of generating

24:43

all of these words?

24:45

What is the likelihood of generating the second choice?

24:48

And you just look at whether the most likely sentence is actually

24:52

the real answer.

24:54

So you don't actually sample from it,

24:56

you really just use P of X1 to XL.

24:59

Does that make sense?

25:01

That being said, evaluation of open-ended questions

25:05

is something we're going to talk about later,

25:06

and it's actually really important

25:08

and really challenging.

25:09

Yes.

25:10

Earlier you mentioned [INAUDIBLE] metrics

25:13

like perplexity are not I usually

25:16

use because it depends on how you do

25:18

your tokenization, some design choices.

25:21

I was wondering if you could speak more to that.

25:24

Yeah.

25:25

So think about perplexity.

25:26

I told you perplexity is between 1 and vocabulary size.

25:30

So now imagine that ChatGPT uses a tokenizer that has 10,000

25:34

tokens but Gemini from Google uses a tokenizer that had

25:38

100,000 potential tokens.

25:41

Then actually the Gemini one will have the upper bound

25:45

of the perplexity that you can get is actually worse for Gemini

25:48

than for ChatGPT.

25:50

Does that make sense?

25:52

So that's just an idea.

25:53

It's actually a little bit more complicated than that,

25:55

but that's just one festival with a bit

25:58

of where you can see that the tokenizer actually matters.

26:02

Great.

26:05

OK, so evaluation challenges.

26:07

There are many.

26:08

I'll just talk about two really briefly.

26:10

One, as I told you, there are two ways of doing evaluation

26:13

for these MMLUs.

26:14

Actually, there are many more than two

26:16

but I gave you two examples.

26:17

And it happens that for a long time,

26:20

even though that was a very classical benchmark

26:22

that everyone uses actually different companies

26:27

and different organizations were actually

26:32

using different ways of evaluating MMLU.

26:34

And as a result, you get completely different results.

26:37

For example, Llama-65b, which was the first model of meta

26:42

in the llama series, had on HELM 63.7 accuracy

26:47

but on this other benchmark had like 48.8.

26:53

So really the way that you evaluate, and this is not even

26:55

talking about prompting this is really just the way

26:58

that you evaluate the models.

27:01

Prompting is another issue.

27:02

So really, there are a lot of inconsistencies.

27:04

It's not as easy as it looks.

27:07

First thing.

27:08

Yeah, sorry.

27:08

How can we make sure that all these models

27:10

are trained on the benchmark?

27:13

Second thing.

27:14

This is a great question.

27:15

Train test contamination.

27:17

This is something which I would say

27:19

is really important in academia in--

27:24

given that the talk is mostly about training large language

27:26

models, for companies, it's maybe not that important

27:29

because they know what they trained on.

27:33

For us, we have no idea.

27:35

So, for us, it's a real problem.

27:37

So there are many different ways of trying

27:39

to test whether the test set--

27:42

or sorry, whether the test set was actually

27:44

in the training set.

27:45

One, kind of, cute trick that people in the lab,

27:51

in [? Tatsuo's ?] lab have found, is that what you can do

27:54

is that given that most of the data set online

27:57

are not randomized, you can just look at--

28:00

and that language models, what they do is just

28:02

predict the next word.

28:03

You can just look at the entire test set.

28:06

What if you generate all the examples

28:09

in order versus all the examples in a different order.

28:13

And if it's more likely to generate a thing in order, given

28:17

that there's no real order there,

28:19

then it means that probably it was in the training set.

28:21

Does that make sense?

28:23

So there are many-- that's like one of them.

28:24

There are many other ways of doing it.

28:26

Train test contamination, again, not

28:28

that important for development, really important for

28:30

academic benchmarking.

28:33

Great.

28:33

So there are many other challenges,

28:34

but I'll move on for now.

28:37

Great.

28:38

Data.

28:40

So data is another really big topic.

28:43

At a high level people just say you basically

28:45

train large language models on all of internet.

28:48

What does that even mean?

28:50

So people sometimes say, well, of clean internet,

28:53

which is even less defined.

28:55

So internet is very dirty and really not representative

28:59

of what we want in practice.

29:00

If I download a random website right now,

29:03

you would be shocked at what is in there.

29:06

It's definitely not your Wikipedia.

29:08

So I'll go really briefly on what people do.

29:14

I can answer some questions, but I mean,

29:16

data is on its own it's a huge topic.

29:19

Basically, first what you do is download all of internet.

29:22

What that means is that you use web crawlers that

29:25

will go on every web page, on internet or every web page that

29:29

is on Google.

29:31

And that is around 250 billion pages right now.

29:36

And that's around 1 petabyte of data.

29:39

So this is actually a Common Crawl is one web crawler.

29:42

So people don't usually write their own web crawlers

29:45

what they do is that they use standard web crawlers,

29:47

and Common Crawl is one of them that basically every month adds

29:51

all the new websites that were added on internet that are found

29:56

by Google, and they put it in a big basically a big data set.

30:00

So that's-- on Common Crawl, you have around 250 billion pages

30:04

right now.

30:04

So 1E6 gigabytes of data.

30:07

Once you have this--

30:09

so this is a random web page.

30:11

Like literally random from this Common Crawl.

30:14

And what you see is that one, it really

30:16

doesn't look at type of things that you would usually see,

30:18

but actually-- so this is an HTML page.

30:21

It's hard to see, but if you look through

30:24

will see some content.

30:26

For example, here, Test King World

30:30

is your ultimate source for the system x high performance

30:33

server.

30:34

And then you have three dots.

30:35

So you don't even-- the sentence is not even finished.

30:37

That's how random internet looks like.

30:40

So, of course, it's not that useful

30:42

if you just train a large language model

30:44

to generate things like this.

30:45

So what are some of the steps that are needed?

30:48

First one, you extract the text from the HTML.

30:51

So that's what I just tried to do by looking

30:53

at basically the correct tags.

30:55

There are a lot of challenges through this.

30:57

For example, extracting math is actually

30:59

very complicated, but pretty important for training

31:02

large language models.

31:03

Or for example, boilerplates.

31:05

A lot of your forums will have the same type of headers,

31:08

the same type of footers.

31:10

You don't want to repeat all of this in your data,

31:13

and then you will filter undesirable content.

31:16

So not safe for work, harmful content, PII.

31:20

So usually every company has basically

31:22

a blacklist of websites that they don't

31:26

want to train their models on.

31:27

That blacklist is very long and you basically

31:30

say if it comes from there, we don't train on this.

31:32

There are other ways of doing these things.

31:34

Is that you can train a small model for classifying what

31:36

is PII, removing these things.

31:39

It's hard.

31:40

Every point here that I'm going to show you

31:42

is a hard amount of work, but I'm just

31:46

going to go quickly through it.

31:48

So filter undesirable content.

31:50

Second or fourth is de-duplication.

31:54

As I said, you might have things like headers and footers

31:57

in forums that are always the same.

31:59

You want to remove that.

32:01

Another thing that you might have

32:02

is a lot of URLs that are different, but actually show

32:05

the same website.

32:08

And you might also have a lot of paragraphs that come from common

32:13

books that are basically de-duplicated 1,000 times

32:16

or 10,000 times on internet.

32:18

So you have to de-duplicated.

32:20

Also very challenging because you have to do that at scale.

32:24

Once you do the de-duplication, you

32:26

will do some heuristic filtering.

32:28

You will try to remove low-quality documents.

32:31

The way you do that are things like rules-based filtering.

32:35

For example, if you see that there are some outlier tokens.

32:37

If the distribution of tokens in the website

32:39

is very different than the usual distribution of tokens,

32:42

then it's probably some outlier.

32:43

If you see that the length of the words in this website

32:46

is super long, there's something strange going on that website.

32:49

If you see that the website has only three words,

32:52

maybe, is it worth training on it.

32:54

Maybe not.

32:54

If it has 10 million words, maybe there's something also

32:58

wrong going on that page.

33:00

So a lot of rules like this.

33:01

Yes.

33:02

Why do we filter out undesirable content

33:04

from our data set instead of putting it in as,

33:08

like, a supervised loss?

33:10

Can we not just say, here's this like, hate speech website,

33:14

let's actively try to--

33:17

let's actively penalize the model for getting it.

33:19

We'll do exactly that, but not at this step.

33:22

That's why the post-training will come from.

33:25

Pretraining the idea is just to say

33:30

I want to model, kind of, how humans speak, essentially.

33:34

And I want to remove all these headers, footers

33:36

and menus and things like this.

33:38

But it's a very good idea that you just had.

33:41

And that's exactly what we'll do later.

33:45

Next step, model-based filtering.

33:47

So once you filter a lot of data, what you will do--

33:50

that's actually a very cute trick.

33:51

You will take all of Wikipedia and you

33:54

will look at all the links that are

33:56

linked through Wikipedia pages.

33:58

Because probably if something is referenced by Wikipedia,

34:01

it's probably some high-quality website.

34:02

And you will train a classifier to predict whether something

34:07

comes from-- whether a document comes from one

34:10

of these references from Wikipedia

34:13

or whether it's from the random web.

34:15

And you will try to basically say,

34:17

I want more of the things that come from Wikipedia references.

34:21

Does that make sense?

34:23

So yeah.

34:24

So you will train a machine learning model.

34:26

Usually also very simple models because you

34:28

need to do that really at scale.

34:30

I mean, just think about the 250 billion pages.

34:34

Next one, you will try to classify your data

34:37

into different domains.

34:41

You will say, OK, this is entertainment, this is books,

34:43

this is code, this is like these type of domains.

34:46

And then you will try to either up or down weight

34:51

some of the domains.

34:52

For example, you might say--

34:54

you might see that actually if you train more on code, then

34:57

actually your model becomes better on reasoning.

34:59

So that's something that people usually say in

35:01

a very hand-wavy way.

35:02

If you train your model more on code,

35:04

actually it helps reasoning.

35:05

So you want to update the coding distribution

35:08

because that helps for general language modeling skills.

35:11

Books is usually also another one that people usually update.

35:16

Entertainment, they usually down weight.

35:18

So things like this.

35:19

Of course, you want to do it-- so people used to do it, maybe

35:24

kind of heuristically.

35:25

Now there's entire pipelines that we'll

35:27

talk about of how to do these things slightly

35:30

more automatically.

35:33

And then at the end of training, you usually train--

35:37

after training on all of this data that we saw

35:40

you usually train on very high quality data

35:42

at the end of training your large language model where you

35:46

decrease your learning rate.

35:47

And that basically means that you're,

35:49

kind of, overfitting your model on a very high quality data.

35:52

So usually what you do there is Wikipedia.

35:55

You basically overfit on Wikipedia

35:57

and you overfit on, like, human data that was collected.

36:04

The other thing is like continual pretraining

36:06

for getting longer context.

36:07

I'm going to skip over all of these things.

36:09

But that's just to give you a sense of how hard it

36:12

is when people just say I'm going to train on internet,

36:15

that's a lot of work.

36:17

And, really, we haven't figured it out yet.

36:19

So collecting well data is a huge part

36:23

of practical, large language model.

36:24

Some might say that it's actually the key.

36:26

Yes.

36:27

[INAUDIBLE] about data.

36:29

So basic question.

36:30

So usually when you start with like a petabyte of data,

36:33

after you go through all the steps,

36:35

what's the typical amount of data you have remaining.

36:37

And then how large a team does it typically

36:40

take to go through all the data steps you talked about?

36:43

Sorry how la-- is your question how large

36:45

is the data after you filter?

36:46

Yeah.

36:47

After you filter and then you go through all the steps.

36:49

How large a team do you need to go through, like,

36:52

all the filtration steps you mentioned.

36:54

How slow is it or--

36:56

How many people would you need to be

37:00

able to do this [INAUDIBLE]?

37:02

OK that's a great question.

37:03

I'm going to somewhat answer about the data.

37:06

How large is the data set at the end of this slide.

37:10

For number of people that work on it, that's a good question.

37:15

I'm actually not quite sure, but I would say, yeah,

37:19

I actually don't quite know but I

37:22

would say it's probably even bigger than the number of people

37:25

that work on the tuning of the pretraining of the model.

37:29

So the data is bigger than the modeling aspect.

37:34

Yeah, I don't think I have a good sense.

37:37

I would say probably in LLAMA's team, which have 70-ish people,

37:41

I would say maybe 15 work on data.

37:45

Yeah.

37:46

All these things, you don't need that many people,

37:48

you need a lot of compute also.

37:49

Because for data you need a lot of CPUs.

37:52

So, yeah.

37:53

And I'll answer the second question

37:54

at the end of this slide.

37:56

So as I just, kind of, alluded to really,

37:59

we haven't solved data at all for pretraining.

38:02

So there's a lot of research that has to be done.

38:04

First, how do you process these things super efficiently?

38:07

Second, how do you balance kind of all

38:09

of these different domains?

38:10

Can you do synthetic data generation?

38:12

That's actually a big one right now.

38:14

And because we don't have--

38:16

we'll talk about that later, but we don't have

38:18

enough data on the internet.

38:20

Can you use multimodal data instead of just text data?

38:23

And how does that improve even your text performance?

38:28

There's a lot of secrecy because, really, this

38:30

is the key of most of the pretraining large language

38:33

models.

38:34

So for competitive dynamics, usually these companies

38:39

don't talk about how they do the data collection.

38:41

And also there's a copyright liability issue.

38:44

They definitely don't want to tell you

38:45

that they've trained on books even though they did

38:47

because if not can sue them.

38:50

Common academic benchmarks.

38:52

So that will, kind of, answer what you asked.

38:54

It started-- so those are the smaller ones.

38:57

The names are not that important,

38:58

but it started from around $150 billion tokens, which are

39:02

around 800 gigabytes of data.

39:04

And now it's around 15 trillion--

39:06

15 trillion tokens, which is also

39:09

the size of the models that are-- right now the best models

39:12

are probably trained on that amount of data.

39:14

So 15 trillion tokens, which is probably,

39:18

I guess, two orders of magnitude bigger than that.

39:20

So 80E3 gigabyte.

39:23

So that would be around 100 to 1,000 times filtering

39:29

of the Common Crawl, if I'm not mistaken.

39:32

So, yeah.

39:34

One very famous one is the Pile.

39:37

So this is an academic benchmark, the Pile.

39:39

And we can just look at what distribution of data they have.

39:42

It's things like archive, PubMed Central,

39:46

which is all the biology stuff.

39:50

Here it's Wikipedia, you see Stack Exchange, some GitHub

39:55

and some books and things like this.

39:58

Again, this is on the smaller side.

39:59

So this is-- if we look at here, this is on 280B so, in reality,

40:03

it's like 100 times bigger so you cannot have that much

40:05

of GitHub and of Wikipedia.

40:09

In terms of closed source models.

40:11

Just to give you an idea, Llama 2

40:14

it was trained on 2 trillion tokens,

40:16

Llama 3 15 trillion tokens, which is currently

40:19

the best model that we know on how much it was trained on,

40:22

which is the same thing as is the best academic or the biggest

40:26

academic benchmark, which is 15 trillion tokens.

40:29

GPT4 we don't really but it's probably

40:31

in the same order of magnitude or it's probably around that.

40:33

Actually, it's probably around 13 from leaks.

40:36

If the leaks are true.

40:39

Great.

40:41

So scaling laws.

40:43

Any other questions on data before we go to scaling laws?

40:48

Sorry I know I'm giving you a lot of information,

40:51

but there's a lot into training, large language models.

40:54

Great scaling laws.

40:56

So the idea is that what people saw around 2020, or at least

41:01

from a long time, but they've been able to theoretically show

41:05

it or empirically show it since 2020,

41:07

is that the more data you train your models on

41:09

and the larger the models, the better the performance.

41:12

This is actually pretty different than what

41:14

you've seen in this class.

41:15

In this class we teach you about overfitting.

41:17

Overfitting doesn't happen with large language models.

41:20

Larger models, better performance.

41:23

It's something that really took a long time

41:25

for the community who took this type of class to realize.

41:29

But for the exam, overfitting exists.

41:33

So, OK, the idea of scaling loss is that if-- given that more

41:38

data and larger models will always

41:40

give you better performance, can we

41:42

predict how much better your performance will

41:46

be if you increase the amount of data and the size of your model?

41:50

And surprisingly, it works.

41:52

So here you see three plots from a very famous paper called

41:55

Scaling Laws from OpenAI.

41:57

Here you see on the x-axis compute.

42:00

So how much did you train--

42:01

like, how much compute did you spend for training?

42:04

And here you see test loss.

42:05

So this is essentially, I mean, perplexity,

42:08

but it's your validation loss.

42:09

So it's a log of the perplexity.

42:11

And if you put these two on log scale,

42:15

then you see that the performance or the--

42:19

sorry, the scaling law is linear.

42:22

That means that if you increase your compute

42:25

by a certain amount, you can say by how much your test loss will

42:29

actually decrease.

42:30

Same thing with data and same thing for parameters.

42:33

If you increase the data set size,

42:35

your loss will decrease by an amount

42:38

that is somewhat predictable.

42:40

If you increase the number of parameters,

42:42

the loss will decrease by an amount,

42:44

which is somewhat predictable.

42:45

This is really amazing.

42:47

Very surprising.

42:49

I mean, it looks innocuous when you look at these type of plots,

42:52

but that's crazy because it means that you can predict

42:55

how well we're going to perform in two or three years,

42:58

depending on how much compute we will add,

42:59

assuming that these things will hold.

43:01

There's nothing theoretical about it.

43:04

Yes.

43:05

Two things.

43:06

One, what is the loss that they're using here.

43:08

Is this perplexity?

43:09

So it's-- I said perplexity was like 2 to the power of the loss.

43:13

So this is the power of the perplexity.

43:17

And then the second thing is, when

43:19

you increase the number of parameters

43:21

or you increase the data set size [INAUDIBLE] data

43:24

[INAUDIBLE] times, doesn't that just inherently

43:26

increase your compute?

43:27

Like does all of this [INAUDIBLE] come to just how

43:30

[INAUDIBLE] you [INAUDIBLE]?

43:31

Yes.

43:31

--or something specific [INAUDIBLE]?

43:32

No, this is a great question.

43:33

So the compute here is actually a factor of two things, the data

43:37

and the parameter.

43:38

What I'm showing here is that you can--

43:40

well, actually, we're going to talk about that in details.

43:42

But basically, if you increase the number of parameters,

43:44

you should increase the number of data that you have.

43:48

So you actually don't go multiple times

43:50

to the same data set.

43:51

No one does epochs in at least not yet

43:56

because we haven't still kind of enough data.

43:59

So yeah, this is all the same trend,

44:01

which is increase compute decrease loss.

44:04

Yes.

44:06

Have we seen the numbers for the last two years or this

44:09

is still holding?

44:10

It is still holding.

44:13

I don't have good numbers to show you,

44:16

but it is still holding, surprisingly.

44:20

Yes.

44:21

Is there no evidence that control quality density

44:23

will ever plateau?

44:25

In theory, we would expect it plateau, [INAUDIBLE]?

44:28

No empirical evidence of plateauing anytime soon.

44:33

Why?

44:34

We don't know.

44:35

Will it happen?

44:37

Probably.

44:37

I mean, it doesn't need to because it's actually

44:39

in log scale.

44:40

So it's not like as if it had to go.

44:43

It had to plateau.

44:44

Like mathematically, it could continue decreasing like this.

44:47

I mean, most people think that it will probably

44:49

plateau at some point.

44:50

We don't know when.

44:54

So that's-- I'll talk more about scaling laws now.

44:57

So why are scaling laws really cool?

44:59

Imagine that I gave you--

45:02

you're very fortunate I gave you 10,000 GPUs for this month.

45:05

What model will you train?

45:07

How do you even go about answering that question?

45:09

And I mean, this is a hypothetical,

45:12

but that's exactly what these companies are faced with.

45:16

The old pipeline, which was basically

45:19

tune hyperparameters on the big models.

45:21

So let's say I have 30 days, I will train

45:24

30 models for one day each.

45:26

I will pick the best one and that will be the final model

45:30

that I will use in production.

45:32

That means that the model that I actually used

45:34

was only trained for one day.

45:36

The new pipeline is that you first find a scaling recipe.

45:40

So you find something that tells you, for example,

45:43

like one common thing is that if you increase

45:45

the size of your model, you should decrease your learning

45:46

rate.

45:47

So you find a scaling recipe such

45:49

that you know if I increase the size of my model,

45:52

here's what I should do with some hyperparameters.

45:55

Then you tune your hyperparameters

45:57

on smaller models of different sizes.

46:00

Let's say I will say for three days, of my 30 days,

46:03

I will train many different models.

46:05

And I will do hyperparameter tuning

46:07

on these small models, each of different sizes.

46:09

Then I will fit a scaling law and try

46:11

to extrapolate from these smaller models, which

46:15

one will be the best if I train it for much longer--

46:20

or sorry if I train it for a larger model.

46:22

And then I will train the final huge model

46:24

for 27 days instead of just one day.

46:28

So the new pipeline is not train things

46:31

or do hyperparameter tuning on the real scale of the model

46:34

that you're going to use in practice,

46:35

but do things on smaller ones at different scales.

46:39

Try to predict how well they will perform

46:41

once you make them bigger.

46:43

I will give-- I will give you a very concrete example right now.

46:46

Let's say transformers versus LSTMs.

46:49

Let's say you have these 10,000 GPUs,

46:51

you are not sure which one you should be using.

46:53

Should I be using a transformer-based model

46:55

or LSTM-based model.

46:56

What I will do is I will train transformers

46:58

at different scales.

47:00

So here you see different parameters on the x-axis,

47:02

y-axis is my test source.

47:04

I will then train different LSTMs at different scales.

47:08

Once I have these points, I will see oh it, kind of,

47:11

fits a scaling law.

47:12

I will fit my scaling law and then

47:14

I will be able to predict if I had 10 times more compute,

47:18

here's how well I would perform for the LSTM.

47:21

It's actually slightly less linear for the LSTM,

47:23

but you can probably try to predict where you would end up.

47:26

And clearly from this plot, you would see

47:28

that transformers are better.

47:30

One thing to notice when you read these type of scaling laws

47:33

is that there are two things that are important.

47:35

One is really your scaling rate, which

47:40

is the slope of the-- the slope of the scaling law.

47:45

The other thing is your intercept,

47:49

you could start worse, but actually

47:52

become better over time.

47:53

It just happens that LSTMs are worse for both.

47:55

But I could show you another one where things--

47:58

you can predict that actually after a certain scale

48:01

you're better off using that type of model than others.

48:04

So that's why scaling laws are actually really useful.

48:08

Any questions on that?

48:12

Yeah.

48:12

So these are all, kind of, very--

48:15

how sensitive are these to small differences in the architecture.

48:18

Like one like transformer architecture

48:21

versus another transformer architecture.

48:23

Do you think we have to fit your own curve

48:26

and, basically, say like oh scaling laws tell me this should

48:28

be some logarithmic function.

48:31

Like, let me extrapolate that for

48:33

my own specific architecture.

48:35

Yeah, so usually, for example, if you're an academic

48:38

and you want to-- now at least that's pretty recent

48:40

and you want to propose a new activation.

48:43

That's exactly what you will do.

48:45

You will fit a scaling law, show another scaling law

48:47

with the standard like, I don't GELU

48:49

and you will say that it's better.

48:50

In reality, once you start thinking about it in scaling

48:53

laws terms, you really realize that actually

48:55

all the architecture differences that we

48:57

can make, like the small, minor ones, all they do

48:59

is maybe change a little bit the intercept.

49:03

But really that doesn't matter because just

49:05

train it for 10 hours longer or like wait for the next computer

49:09

GPUs and these things are really secondary.

49:12

Which is exactly why I was telling you originally,

49:14

people spend too much time on the architecture and losses.

49:17

In reality, these things don't matter as much.

49:19

Data though.

49:19

If you use good data, you will have much better scaling laws

49:23

than if you use bad data.

49:24

So that really matters.

49:27

Another really cool thing you can do with scaling laws

49:29

is that you can ask yourself, how to optimally allocate

49:33

training resources.

49:35

Should I train larger models.

49:37

Because we saw that it's better when you train larger models,

49:39

but we saw that it's also better when you use more data.

49:42

So which one should I do?

49:43

Should I just train on more data, a smaller model,

49:46

or should I train a larger model on less data?

49:49

So Chinchilla is a very famous paper that first showed this.

49:53

The way they did it, I want to give you

49:55

a little bit of a sense of what these plots are.

49:58

Here you see training loss again on the x-axis,

50:00

you see parameter differences, sorry, parameter size--

50:04

number of parameters.

50:04

So the size of the model.

50:06

And here all these curves are what

50:07

we call ISO flops, which is that all the models on this curve

50:13

have been trained with the same amount of compute.

50:17

The way that you do that is that you train--

50:19

you change.

50:20

Sorry, you vary the number of tokens that were trained on

50:22

and the size of the models, but you vary in such a way

50:25

that the total compute is constant, OK.

50:27

So all these curves that you see with different colors

50:29

have different amount of compute that were trained on.

50:32

Then you take the best one for each of those curves.

50:35

Once you have the best one for each of those curves,

50:38

you can ask-- you can plot how much flops it was

50:44

and which curve were you on and how much parameters

50:47

did you actually use for training that specific point.

50:50

You put that on the log log scale again and now

50:55

you fit a scaling law again.

50:56

So now I have something which tells me

50:59

if I want to train a model of 10 to the power 23 flops, here is

51:03

exactly the number of parameters that I should be using.

51:06

100 B.

51:07

And you can do the same thing with flops and tokens.

51:11

So now you can predict--

51:13

if I tell you exactly I have one month of compute,

51:16

what size of model should I be training?

51:18

Fit the scaling law, and I tell you.

51:21

Of course that all looks beautiful.

51:23

In reality like there's a lot of small things of like,

51:26

should you be counting, like, embedding parameters,

51:29

there's a lot of complexities.

51:30

But if you do things well, these things actually do hold.

51:35

So the optimal number of parameters that Chinchilla paper

51:38

have found is to use 20 tokens for every parameter

51:42

that you train.

51:44

So if you add one more parameter,

51:45

you should train your thing on-- your model on 20 more tokens.

51:49

So one caveat here is that this is optimal training resources.

51:53

So that is telling me if you have 10 to the power, 23 flops

51:57

or if you have 100, I don't know how much that is, $100 million

52:00

or 10-- no, that's much less, actually.

52:02

Let's say I have $5 million to train

52:05

my best model that gets the lowest

52:07

loss what would I train on?

52:09

In reality, these companies need to think about inference also.

52:12

If you have a smaller model, they will spend less over time.

52:17

So actually, if you consider the inference cost,

52:20

you have other papers that try to show that, it's

52:23

around 150 parameters, sorry--

52:26

tokens per parameters, because you prefer having a smaller

52:29

model because over time you're going

52:32

to actually spend less money on inference of these models.

52:37

So 150 to 1, that's around what the best models are trained

52:42

on right now, at least the ones that are

52:45

used in practice in production.

52:49

Great.

52:51

Any questions on Chinchilla?

52:55

Great.

52:56

Oh sorry.

52:58

In practice, how expensive is inference for these models

53:01

relative to training?

53:03

Actually, very expensive.

53:05

I will not talk about inference because that would

53:07

be another entire lecture.

53:09

But just think about ChatGPT where

53:11

they have I don't know how much it is now,

53:14

like 600 million people that use it.

53:18

Like, that's a lot.

53:22

Yeah.

53:23

So it's actually very expensive.

53:24

There's a lot of optimization you can do for inference though.

53:27

And that's an entire other lecture.

53:29

I'm going to skip that this time, but it's very interesting.

53:33

OK tunings.

53:34

As I said, there are many things that you

53:36

can answer with scaling laws.

53:38

I just try to give you two examples,

53:40

but really there are many things.

53:42

What data do you use.

53:43

What mixture-- what data mixing weighting you use.

53:46

The mixtures, that's what we talked about before.

53:49

What architecture you use, whether you should make

53:51

your models wider or deeper?

53:54

Should you be paying for more GPUs

53:56

or actually collecting more data?

53:58

All these things are things you can try

54:00

to answer with scaling laws.

54:03

One thing I want to say is the bitter lesson.

54:05

If you ever heard of Richard Sutton,

54:08

very famous blog post in 2019, what he realized,

54:12

which I think not enough people realize,

54:16

I didn't-- definitely did not realize at that time,

54:19

is that once you see these type of scaling laws you know that

54:23

the more compute you have, the better models you will get.

54:26

So with scale, you will get better model.

54:28

And you also know by Moore's law or these type

54:30

of variants of Moore's law that you will always

54:33

have better compute.

54:34

Then the only thing that matters is just

54:36

to have architectures that can leverage computation.

54:40

So what matters is basically systems data and less

54:44

so the architecture, like the small architecture

54:46

differences like, your activation and things like this.

54:49

So I think that's one of the reasons why most of research

54:52

focuses on some things that for industry matters less.

54:56

And I was one of those researchers

54:58

for a large part of my career.

55:02

So don't spend time over complicating.

55:04

Do the simple things, do it well.

55:07

See all them.

55:08

That's really what OpenAI taught us with ChatGPT and with all

55:12

the GPTs before.

55:15

OK, I want to give you some back of the envelope computation.

55:18

So I might be off by a few factors here,

55:20

but I just want to give you a sense of how costly it is

55:23

to train some of these models.

55:25

I'll give us an example.

55:26

llama3 400b which is currently the best open source model that

55:30

you can get.

55:31

It was trained on 15.6 tokens.

55:35

It has 405 billion parameters.

55:37

So just now that you know what is

55:39

like this optimal tokens per parameter, that's around 40.

55:43

So that's a little bit more than Chinchilla,

55:45

but less than this like inference optimal model.

55:50

So they went for training optimallity

55:53

Flops for this model.

55:55

So one simple way to compute flops

55:57

is 6 times the number of parameters,

56:00

times the number of data that you train on.

56:03

So if you do the simple calculation here,

56:04

it's 3.8 e25 flops.

56:07

The reason why this is important is

56:09

that if you follow it a little bit, the news,

56:11

there's an executive order from Biden that basically

56:13

says that once you have one e26 parameters, sorry, flops, then

56:19

you have special scrutiny on your models.

56:21

So they went to 2X less than that.

56:23

So they really went right below this

56:25

to not have special scrutiny.

56:27

So 3.8.

56:28

I might be off by a little bit, but it's definitely

56:30

under the 1 e26

56:36

So parameter p is parameters n is data, number of tokens.

56:41

This is just an approximation.

56:46

Yeah.

56:48

OK.

56:49

Compute and we know that they trained on 16,000 h100s and we

56:55

know the throughput they set it to.

56:58

So if you do the computation, it takes around 70 days

57:02

or 26 million GPU hours.

57:05

At least that's what my back of the envelope computation.

57:08

They actually said that they use 30 million

57:10

instead of 26 million GPU hours.

57:13

So maybe they had some challenges.

57:17

I don't really know.

57:18

But if you follow the simple computation,

57:20

it's around 70 days.

57:22

Cost.

57:24

I mean this it's hard to approximate,

57:27

but I'm just going to say it's, kind of, the rent.

57:29

Like, what if I wanted to rent H100, that many H 100

57:33

for that many days, how much will I pay?

57:36

H100 a lower bound on the renting costs of H100

57:41

is around two hours--

57:42

$2 per hour.

57:43

So if you multiply this by 26,000,000 hours,

57:48

you get $52 million.

57:50

So they probably pay less than that,

57:52

but not actually much less because all these services

57:58

that actually rent GPUs, they don't make that much money.

58:00

So it's probably slightly less, but not that much less.

58:04

Now salary I said 50 employees, 500k per year.

58:10

Yeah it's probably the right ballpark.

58:12

$25 million.

58:13

So if you put altogether around $75 million

58:17

for training this llama model.

58:21

I'm probably off by like 10 million,

58:22

but that's kind of right ballpark.

58:27

Carbon emitted.

58:29

A lot of people might ask like also the cost is not

58:32

the only thing that is important.

58:33

So I did the computation.

58:35

It's around 4000 tons of CO2 equivalent.

58:42

That is actually only 2000 return tickets

58:45

from JFK to London.

58:47

So right now carbon emitted is actually not--

58:51

I mean, it's huge, but it's not meaningful yet.

58:56

I think in maybe GPT6, GPT7, once you multiply this

59:01

by 100, that might become a real issue.

59:04

Right now it's still not, I think,

59:07

an issue in the grand scheme of things.

59:09

Next model the way you should be thinking about these models is

59:12

that every new generation, the number of flops essentially

59:16

multiplies 10x, or at least that's what they try if they

59:19

have enough energy.

59:20

And if they can buy enough GPUs.

59:23

Great.

59:23

Any question on these back of the envelope math.

59:29

No.

59:30

OK.

59:31

So now we talked about pretraining,

59:34

I wanted to also chat about systems

59:36

because now we know compute is really important so there's

59:39

a question of how do you optimize the--

59:41

how do you optimize the compute?

59:43

I will leave that for the end because I'm not

59:45

sure how much time we will have.

59:46

I think it's important, but hopefully I'll

59:48

be able to talk about it later.

59:50

It's slightly different than what we've

59:52

been talking about right now.

59:54

So I'll move on to post-training for now.

59:56

So the task of post-training, the reason why

59:59

we need to do post training is, as I told you

1:00:01

before, it's to make AI assistants.

1:00:06

So language modeling is not really the thing

1:00:09

that you want when you have an AI assistant.

1:00:12

For example, if you ask to GPT3, which

1:00:14

is a purely language model--

1:00:16

a pure language model, not a non-aligned one.

1:00:20

If you ask a question explain the moon landing

1:00:22

to a six-year-old, the completion that you would get

1:00:26

is something explain the theory of gravity to a six-year-old.

1:00:29

Because what it learned is that on internet,

1:00:31

if you have one question, you usually

1:00:33

have maybe another bullet point of other similar questions

1:00:36

you don't usually have question and then answer later.

1:00:39

This is not what you want from an AI assistant.

1:00:42

So how do we do this alignment, which

1:00:46

is this post training and making these models assistants?

1:00:49

So the goal of this alignment is to basically get

1:00:52

LLMs follow the instructions that

1:00:55

are given by users and maybe some designers,

1:01:00

kind of, desires.

1:01:02

So think about motivation.

1:01:04

You don't want the model-- like OpenAI

1:01:06

doesn't want the model to say stuff that is very toxic.

1:01:09

So here you see on the left-hand side

1:01:12

that when you ask a question, it actually provides a real answer.

1:01:15

So it's not like before the LLM.

1:01:17

And on the right-hand side, you see that it would--

1:01:20

if you ask to write a tweet describing how a certain part

1:01:25

of the population are evil, it will say that it cannot do that.

1:01:29

So that's kind of this alignment.

1:01:32

The background here is that basically the data

1:01:38

that you want for training some of these models is--

1:01:41

like, we know what we want.

1:01:42

Which is just asking humans, this is a question,

1:01:44

this is the answer that you want.

1:01:46

But the thing is that it's very expensive to collect that data,

1:01:48

and it's hard to find it online.

1:01:51

In contrast, pretraining data is not what you want,

1:01:54

but there's a lot of it.

1:01:56

So what we will do, or the main idea is simply

1:01:59

take a pretrained large language model

1:02:01

pretrained on all of internet and then just fine tune.

1:02:03

So you just change a little bit the weights on the type of data

1:02:06

that you actually want.

1:02:07

And hopefully given it, you already

1:02:08

pretrained it on all of internet,

1:02:10

it basically learns or knows how to speak in English

1:02:13

and knows standard language syntax

1:02:18

then you can really fine tune it with very little data.

1:02:23

OK, SFT.

1:02:24

So Supervised Fine Tuning is really exactly what I just said.

1:02:27

Which is the idea of fine-tuning the large language

1:02:29

model on basically the desired answers that

1:02:33

are collected from humans.

1:02:35

So why is it called supervised fine tuning?

1:02:37

Because you basically want to do language modeling on the real

1:02:41

answers.

1:02:41

So language modeling is this like next word prediction,

1:02:44

and that's the fine tuning part.

1:02:45

And then you want to do it on desired answers given by humans

1:02:48

so that's why we call it supervised.

1:02:51

So how do we collect this data?

1:02:52

Well, I just said it.

1:02:54

You just ask humans to tell you this

1:02:57

is a question this is the answer that you would

1:02:59

want from some of these models.

1:03:00

So this is an example.

1:03:03

I can't read very well on my computer,

1:03:04

but my kid needs to do a science--

1:03:08

no let's read this one.

1:03:09

Can you write a short introduction

1:03:11

about the relevance of the term monopsony?

1:03:13

And then it says monopsony refers to a market

1:03:15

structure, blah blah, blah.

1:03:16

And that's a human network there.

1:03:19

So, actually, this is Open Assistant,

1:03:20

which was a way to collect data online by humans.

1:03:27

So this type of supervised fine tuning or alignment

1:03:31

is really the key of ChatGPT.

1:03:33

This is what made the big jump from GPT 3, which was mostly

1:03:37

something that was known by AI researchers

1:03:40

to ChatGPT, which became known by basically everyone.

1:03:46

So the problem with human data is

1:03:51

that it's very slow to collect and very expensive.

1:03:56

So one possible simple idea is to use

1:04:00

LLMs to scale data collection.

1:04:03

So that's exactly what we did with Alpaca one year ago.

1:04:06

What we did is that we asked humans,

1:04:09

so we use a data set of human question answers.

1:04:11

So there were 175 question answers here,

1:04:15

and we asked the best model at the time,

1:04:16

so text-davinci 003 to basically generate many more of these

1:04:21

question and answers.

1:04:22

So all we did is, this is what humans would write now,

1:04:25

write similar answers and similar questions.

1:04:27

And we collected 52,000 LLM-generated question answers.

1:04:32

And then what we did is simply we took llama 7B,

1:04:34

which was the best pre-trained model at the time.

1:04:36

And we just fine tuned this with supervised fine tuning,

1:04:39

as I told you.

1:04:39

And that's how we got the Alpaca 7B model.

1:04:44

And this is the type of data that we collected.

1:04:47

So things like what does algorithm mean?

1:04:49

And algorithm is a step by step set of instructions

1:04:53

you use to solve a problem or achieve a goal, blah, blah,

1:04:55

blah, blah.

1:04:56

So the data is not actually-- it's actually pretty good,

1:04:58

given that it was LLM generated by LLMs from essentially two

1:05:02

generations ago.

1:05:04

So that really started at least for us

1:05:07

as an academic replication of ChatGPT.

1:05:10

Now it really-- there's a big field

1:05:12

of synthetic data generation of how

1:05:15

to use LLMs to basically make development of LLMs faster.

1:05:21

And basically by decreasing the amount of human hours that

1:05:24

you need.

1:05:26

Quantity of data.

1:05:28

So we talked about what type of data and how we collect it.

1:05:31

One thing which is surprising with SFT

1:05:33

is that you don't need that much data.

1:05:36

So what this paper showed this is called LIMA,

1:05:38

is that if you scale the amount of data that you use from

1:05:43

supervised fine tuning from 2000 to 32,000,

1:05:46

it really doesn't help much.

1:05:47

So here scaling laws definitely don't help.

1:05:49

And so the intuition here is that all you learn

1:05:55

is you learn how to format your desired answers.

1:05:58

Another way of saying it is that your pre-trained models, they

1:06:02

essentially model the distribution of every user

1:06:04

on internet, one that might write bullet points,

1:06:07

another one that might answer question-- answer

1:06:09

question with an answer.

1:06:10

So all you tell your model is like, wait,

1:06:13

you should actually be optimizing

1:06:14

more for this type of user than another one.

1:06:17

So you're not actually teaching it--

1:06:18

you're not teaching anything through this SFT, so

1:06:23

supervised fine tuning, all you do

1:06:25

is you tell the model to optimize for one type of user

1:06:28

that it saw already in a pretrained data set.

1:06:30

So the knowledge is already in the pretrained LLM

1:06:33

and you basically just specialize to one type of user.

1:06:37

Great.

1:06:38

Any question on SFT?

1:06:40

Yes.

1:06:41

So I know it's a big issue with synthetic data

1:06:45

where if you keep generating data from the same distribution,

1:06:49

eventually you're not learning a new distribution,

1:06:51

you're essentially playing with it.

1:06:52

Just bootstrapping that.

1:06:53

Yeah.

1:06:55

Surely you can't scale that forever, right.

1:06:57

You can't keep going on and generating

1:06:59

from the same distribution.

1:07:00

You hope to learned something new.

1:07:01

Yeah.

1:07:02

So are there-- it's an active area of research

1:07:05

but any thoughts that you have around

1:07:06

how people are maybe thinking around this and better ways

1:07:10

to bootstrap?

1:07:11

Or to give up on this idea and realize that the chart shows

1:07:15

you don't need that many so just get humans to generate

1:07:17

2000 really good prompts.

1:07:19

Yeah.

1:07:20

So that's a very good question.

1:07:21

So for the data stuff, so I'm saying

1:07:23

it's not that important for SFT, but there

1:07:25

will be another thing we'll talk about right after where actually

1:07:28

data does matter.

1:07:29

My intuition based on not that much empirical results

1:07:33

is that you can still get, even though you use your LLMs,

1:07:38

if you use purely LLM generated text

1:07:40

and you do that for like three or four generations of LLMs,

1:07:43

I agree with you that probably you won't improve much.

1:07:45

But for me what is important is how do you use human in the loop

1:07:48

with LLMs?

1:07:49

Not purely LLMs, not purely humans,

1:07:53

but maybe what you can do is just

1:07:54

have the model regenerate some new text

1:07:56

and just humans write a few edits.

1:07:59

Edits are much faster than writing the entire text.

1:08:01

And I think that if you have that type of collaboration,

1:08:04

then from an information theoretical point of view,

1:08:07

you still get additional information,

1:08:09

but you're still much faster than if you use humans.

1:08:11

And I think that as a field we'll

1:08:13

probably move towards these type of things, which is really

1:08:17

just finding the examples that are important and asking humans.

1:08:20

It's kind of active learning, just

1:08:22

asking humans exactly when you need to get their inputs.

1:08:28

Yes.

1:08:28

Do we train with the same loss function

1:08:30

and the same general training algorithm

1:08:32

for the supervised fine tuning bit

1:08:34

as we do for the pretraining?

1:08:36

Because the examples you showed, I

1:08:39

think the important thing of the good examples

1:08:43

is like super factually accurate.

1:08:45

Like there's these more complex things

1:08:46

and it's still just like [INAUDIBLE].

1:08:48

Same loss.

1:08:49

So that's why here--

1:08:50

yeah, I didn't-- maybe didn't emphasize enough.

1:08:52

This is just language modeling.

1:08:53

Fine tune the LLM with language model and the desired answers.

1:08:56

So this is literally the same loss.

1:08:59

It will be different in two seconds,

1:09:01

but the first step of SFT is literally

1:09:04

the same loss where you just say, OK, I

1:09:06

want to actually specialize on that type of data.

1:09:08

So there's even a question of what is pretraining,

1:09:10

what is post-training?

1:09:11

Because, in reality, it's just like a different data

1:09:13

that you use.

1:09:13

The reason why we usually call it post-training is that the way

1:09:16

we collect that data is very different.

1:09:18

Great, great questions.

1:09:20

Yes.

1:09:22

Maybe it's the same question, but why would

1:09:24

these 2000 examples have such a overweighted influence

1:09:28

on fine tuning?

1:09:30

So that's why we--

1:09:31

also that's another reason why we call it post-training

1:09:33

is that we use different type of hyperparameters.

1:09:35

So, I told you basically at the end

1:09:37

of pretraining you essentially end up

1:09:38

with a learning rate of 0.

1:09:40

Here, you're going to increase your learning rate.

1:09:42

So like 1e minus 5, 1e minus-- yeah.

1:09:44

And so the way that you give to them is actually different.

1:09:52

OK.

1:09:54

Second step or second part of this post training

1:09:57

is what we call reinforcement learning

1:10:00

from human feedback or RLHF.

1:10:02

Some of you might have heard of that.

1:10:05

The idea is that SFT has a problem, namely that you

1:10:09

do behavioral cloning, which means that you just try to clone

1:10:12

what the humans would say.

1:10:14

And that has many issues.

1:10:16

One of them is that you're bound by human abilities.

1:10:19

So if-- humans actually humans won't generate the things

1:10:26

that they think is actually the best thing to generate.

1:10:28

So if you ask me to write a book,

1:10:30

I mean, I can definitely enjoy your book.

1:10:32

I can probably say one book is better than another,

1:10:34

but I'm definitely not going to be as good as writing the book

1:10:37

that I want to read.

1:10:37

So you're going to be bound by the human ability

1:10:39

to generate things, even though the humans might be better

1:10:42

at distinguishing between things.

1:10:43

That's one issue.

1:10:44

Issue number two, I find that actually pretty interesting

1:10:47

is that it--

1:10:49

if you ever heard of the word hallucination. so this

1:10:51

is LLMs generating fake-- like false information.

1:10:55

Hallucination might-- at least people

1:10:57

have hypothesized that can come from the supervised fine tuning

1:11:02

even if you do supervised fine tuning on data that is correct.

1:11:06

And the reason why that is is that if--

1:11:09

given I told you that basically SFT is with very little data.

1:11:13

And it's with data that the model

1:11:15

doesn't learn anything new.

1:11:17

So what if the human gives an answer that the model didn't

1:11:21

know was true.

1:11:23

From the model perspective, the human basically

1:11:26

is telling the model generate this thing that seems plausible

1:11:30

but actually have no idea if it's true or not.

1:11:34

So just to give you a very concrete example,

1:11:36

if we go back to this monopsony example,

1:11:39

can you write blah blah blah about monopsony?

1:11:41

Imagine that the human wrote a reference on this type of book.

1:11:46

And that book might exist.

1:11:47

That might be a correct reference,

1:11:49

but what if the LLM never saw this reference

1:11:51

during pretraining.

1:11:52

Then it doesn't know that it's a correct reference.

1:11:54

So really what you tell the model

1:11:56

is to generate or make up some plausible sounding reference

1:12:00

rather than actually tell the real reference

1:12:03

that it saw during pretraining.

1:12:05

So hallucination might be caused by this SFT.

1:12:12

So that's problem number two.

1:12:14

Does that all make sense?

1:12:15

Great.

1:12:16

Problem number 3, price.

1:12:18

Generating the ideal answers is very pricey.

1:12:21

And that comes back to your question

1:12:23

of humans writing the entire answer is actually

1:12:26

pretty expensive.

1:12:28

So that's why RLHF comes in.

1:12:30

The idea is that instead of cloning the behaviors of humans,

1:12:34

we're going to maximize human preference.

1:12:37

And the way we're going to do that, so the pipeline,

1:12:39

is that for a certain-- for every instruction,

1:12:42

you're going to ask a model to generate two answers

1:12:45

and usually use a pretty good model.

1:12:48

So you usually don't use an LLM here, you use a SFT fine tune,

1:12:52

you use a fine tuned LLM already to give pretty good answers.

1:12:56

And then you ask labelers which of these two answers was better?

1:13:01

So select the preferred one.

1:13:02

And then with different types of algorithms,

1:13:05

we're going to talk about the algorithms, you just fine

1:13:07

tune the model to generate more of the green thing

1:13:10

than the red thing.

1:13:10

So more of the good stuff.

1:13:12

So now the question is how and we're

1:13:14

going to talk about that right now.

1:13:17

So there are two ways that we're going to talk about

1:13:20

and two that are mainly use in the community.

1:13:23

The first one is simply the idea of using reinforcement learning.

1:13:26

So hopefully you all know what reinforcement learning is now.

1:13:30

So when you think about using reinforcement learning,

1:13:33

one important question is like, what is the reward

1:13:35

that we're optimizing.

1:13:36

So in this case, there are really two options

1:13:38

that I could think about.

1:13:39

The first one, you could just say,

1:13:41

I'm going to compare the output generated by some baseline,

1:13:44

the output generated by my model.

1:13:46

And I'm just going to ask the human to say which one is better

1:13:49

and I'm going to use this as a reward.

1:13:51

So if I'm better than the baseline,

1:13:53

this is a plus 1, if not, it's a minus 1.

1:13:55

So now it's binary reward.

1:13:57

The problem with binary reward is that it's very sparse

1:13:59

and you don't get much information out of it.

1:14:01

Like maybe your answer was slightly better,

1:14:04

maybe it was like way better and you don't really

1:14:07

know from this how much better it was.

1:14:10

So option 2 is that you can train

1:14:13

what we call a reward model, which is simply a classifier.

1:14:16

So you use machine learning to classify

1:14:19

how much better two outputs are from the preference--

1:14:24

from the perspective of the human.

1:14:26

So this is a little bit meta, but what you basically

1:14:29

do is that you train--

1:14:31

you take a reward model, which is just a large la-- also

1:14:37

a large classifier, and you basically ask this reward model,

1:14:41

you give it the input and the actual output

1:14:43

that you have, one of the two outputs.

1:14:45

And you just exponentiate that so that's the softmax loss

1:14:49

that you all know about.

1:14:50

And now you divide by the exponentiated reward

1:14:56

on the first example--

1:14:58

I'm sorry, on the first output and this

1:15:00

is on the second output.

1:15:01

And you basically train--

1:15:02

so the reason why you do that is that you train your model,

1:15:05

you train this reward model to be

1:15:07

able to classify how much better one output is to another one.

1:15:13

So another slightly less convoluted way of saying it

1:15:16

is that your reward model will output

1:15:19

some reward that will be used as the logits of your softmax.

1:15:22

So now if you have high logits in your softmax,

1:15:25

it means that you highly likely this output is better.

1:15:32

So that's what we call Bradley-Terry model.

1:15:34

Yes.

1:15:35

Will this reward model [INAUDIBLE]

1:15:36

lower the entire output, or is it going to [INAUDIBLE]?

1:15:40

So this takes the entire--

1:15:45

yeah, this takes the entire output at once.

1:15:46

So it takes all the input and all the output

1:15:48

and it gives one number.

1:15:50

Yes.

1:15:51

So [INAUDIBLE] reward model, where would the human be then?

1:15:55

Sorry.

1:15:55

With the reward model, where would the human be?

1:15:58

Like--

1:15:58

I see.

1:16:00

OK sorry.

1:16:01

Maybe I wasn't clear.

1:16:02

You train this reward model to fit this green and red

1:16:08

preference from humans.

1:16:09

So basically you train a classifier

1:16:11

to say whether the humans prefer red or green.

1:16:15

But instead of using the binary reward, which

1:16:18

is what the human would tell you you basically use

1:16:20

the logits of the softmax.

1:16:23

And the thing with the logits is that logits are continuous.

1:16:26

So now you know that if your reward model said

1:16:29

it has high logits, then, in some ways,

1:16:31

the human highly preferred this answer to some other answer.

1:16:36

Great.

1:16:38

So as I just said, continuous information is better.

1:16:41

So that's what people use in practice or at least

1:16:44

used to use in practice.

1:16:45

I'll tell you about the other algorithm later.

1:16:48

So what do you do at the end is that you basically

1:16:50

try to just use reinforcement learning that you know about.

1:16:53

Now we know we have a reward.

1:16:55

What you sample through is the generation

1:16:58

from your large language model.

1:16:59

And then you just use some regularization term.

1:17:02

So the reason why we do this regularization term

1:17:04

is for avoiding what we call overoptimization.

1:17:06

So this reward model might not be

1:17:08

really represent-- might not perfectly

1:17:10

model human preferences.

1:17:12

So you don't want to maximize this thing

1:17:14

to essentially infinity.

1:17:17

And you do it using a PPO, which is a common reinforcement

1:17:22

learning algorithm.

1:17:24

One thing to note here, because it will be important for later,

1:17:27

is that when we use maximum likelihood--

1:17:32

sorry, now the large language models

1:17:34

are actually a policy for your reinforcement learning.

1:17:38

It's not maximizing maximum likelihood anymore.

1:17:41

Which means that you're not modeling any distribution

1:17:43

anymore.

1:17:43

And the reason why this is important

1:17:45

is that models that went through this type of PPO

1:17:48

actually don't give you likelihoods

1:17:51

of text that are meaningful.

1:17:52

Because what you optimize them to do

1:17:54

is basically just optimize for generating

1:17:56

the most likely thing, not optimize for modeling,

1:18:00

all the answers that humans might say.

1:18:02

Another way of saying that is that there's

1:18:04

nothing that incentivizes here the model to not give

1:18:09

a single possible generation.

1:18:11

Nothing here says it's good if you have some distribution

1:18:15

with some entropy.

1:18:18

If you haven't followed, it's not that important but just good

1:18:20

to know.

1:18:22

Great.

1:18:23

So PPO is exactly what ChatGPT did originally.

1:18:27

So here is on their blog post on what

1:18:30

they have is step one do supervised fine tuning, which

1:18:33

now you all know about.

1:18:34

Step two, train a reward model on human preferences.

1:18:38

Step three, do PPO multiple steps,

1:18:40

which is where you see this blue arrow.

1:18:43

So you continue-- you train the model once with the PPO,

1:18:45

you collect new data, you continue.

1:18:47

And that's why-- and that's exactly what ChatGPT did.

1:18:50

And that was the big breakthrough

1:18:52

between GPT 3 and ChatGPT.

1:18:55

One thing to note is that PPO has many challenges.

1:18:58

Reinforcement learning is something that

1:19:00

is super nice theoretically.

1:19:02

In practice, anyone who ever worked

1:19:03

with reinforcement learning knows it's such a mess.

1:19:06

There's a lot of things like rollouts, outer loops,

1:19:09

clipping so many complications.

1:19:11

So it's messy.

1:19:13

This is the idealized PPO used for LLM settings,

1:19:15

so that's already much more complicated

1:19:17

than this expectation we saw before.

1:19:19

And in practice it's actually much more complicated.

1:19:21

So we have one implementation of it that we had to do,

1:19:23

and I'm not going to go through it.

1:19:25

But basically have so much stuff that you

1:19:27

have to think about when you implement

1:19:29

that type of PPO algorithm.

1:19:31

So you have clipping everywhere, you have a lot of complexities

1:19:34

and things are not well documented.

1:19:37

All this to say that we're going to there was a new method that

1:19:41

was proposed also from Stanford one year ago

1:19:44

called DPO, which is essentially a simplification of PPO.

1:19:49

And the way-- what they did or the idea that they have

1:19:53

is that instead of using reinforcement learning,

1:19:56

you can just maximize the probability of generating

1:19:58

the stuff that you like and minimizing

1:20:00

the probability of the stuff that you don't like.

1:20:02

So if you think about the human preference, the red and green,

1:20:05

maximize green, minimize red.

1:20:08

So the loss is actually this one where what you see

1:20:12

this is simply some log of the model.

1:20:16

So this is the likelihood of a model generating the things

1:20:19

that the human preferred, given the inputs.

1:20:23

And what you try to do is basically

1:20:25

maximize the likelihood of generating the things that you

1:20:30

like, minimize the likelihood of the things that you don't like.

1:20:33

All the rest of the terms here it's not too important.

1:20:36

It's actually really not that complicated to understand.

1:20:39

But at a high level, it's really just maximizing the things

1:20:42

you like, minimizing the rest.

1:20:45

And one thing to note, which I was going to say just here,

1:20:49

is that actually all the rest is chosen such

1:20:51

that the global minima of PPO and the global minima

1:20:56

of like this DPO, under some assumptions,

1:20:59

are essentially equivalent.

1:21:01

So this is the right thing to do mathematically.

1:21:04

I'm not going to go through the derivations,

1:21:06

but that's the right thing to do.

1:21:08

It's pretty different with PPO in the sense that now--

1:21:10

with PPO, what you had to do is collect the human preferences,

1:21:13

then train a reward model with maximum likelihood,

1:21:16

then use reinforcement learning.

1:21:17

Now all you do is basically maximum likelihood.

1:21:19

Much simpler.

1:21:20

Yes.

1:21:21

I mean, yeah.

1:21:21

So it seems like this is A, much simpler and B, like,

1:21:24

what you would just intuitively do with [INAUDIBLE]?

1:21:27

Why did they start with this reward model.

1:21:29

Like what led them doing that?

1:21:31

I think it's a great question.

1:21:33

I don't really know.

1:21:34

What I can tell you is that.

1:21:35

At ChatGPT the people who did basically

1:21:41

this PP-- sorry, who did ChatGPT initially

1:21:44

are the ones who actually wrote PPO.

1:21:47

And I think they were just-- like,

1:21:48

there are a lot of reinforcement learning people.

1:21:50

And I think that for them it was very intuitive.

1:21:54

So there's also some additional potential benefits.

1:21:58

For example, I don't want to--

1:22:00

yeah, for example, if you use the reward model,

1:22:03

the cool thing here with reinforcement learning

1:22:04

is that you can use unlabeled data with the reward model.

1:22:08

So here you can only use the labeled data for doing DPO--

1:22:12

For PPO-- for PPO, you first train your reward model

1:22:15

and then you can use unlabeled data

1:22:18

where the reward model will basically

1:22:19

label this unlabeled data.

1:22:21

So this additional, kind of, potential--

1:22:25

there could be potential improvements.

1:22:26

In practice it happens that there are none.

1:22:29

And I think just that a lot of people in this team

1:22:32

were reinforcement learning experts, including

1:22:35

the main author of PPO, John Schulman.

1:22:39

So much simpler than PPO, and it's basically performs as well.

1:22:43

So now this is the standard thing that people use.

1:22:46

At least in the open source community,

1:22:47

I believe it's actually the standard also in industry.

1:22:51

So that's called DPO.

1:22:53

Gains so those are all the papers on the left.

1:22:57

Here this is on the summarization task.

1:22:59

You see, all I want to show you is

1:23:01

that basically the pretrained models were OK

1:23:04

and they improve of scale.

1:23:05

If you do supervised fine tuning,

1:23:07

you improve them a little bit more,

1:23:08

if you do PPO or something with RLHF human feedback,

1:23:12

you get performance that are, oftentimes

1:23:15

depending on a benchmark, even better than humans.

1:23:18

So this is the human reference summaries.

1:23:21

Same thing.

1:23:22

This is on a paper that we have Alpaca farm where

1:23:25

we see the evaluation here is not too important

1:23:27

but basically see pretrained model.

1:23:29

You jump to SFT and then you jump to PPO, DPO and PPO,

1:23:33

DPO have the exact same performance.

1:23:36

So basically RLHF helps.

1:23:38

That's, kind of, the conclusion and DPO is simple.

1:23:42

Data.

1:23:43

The way that you collect that type of data.

1:23:46

First idea is just use humans as we already talked about.

1:23:51

Guidelines are very complicated for what

1:23:53

humans should be labeling, and it's really not that easy.

1:23:55

And actually, if you ever do some of the labeling,

1:23:58

you will see that it's extremely complicated.

1:24:01

Like if I Zoom in to this.

1:24:03

Here, I have a question tell me about self-driving cars.

1:24:07

And you read both self-driving cars

1:24:09

are vehicles that are capable of detecting

1:24:10

the surroundings, blah, blah blah, blah.

1:24:12

Self driving cars are cars that are equipped

1:24:13

with sensors, blah blah, blah to navigate

1:24:15

without the need for a driver.

1:24:16

I mean, both seem OK.

1:24:18

Which one is better?

1:24:19

It's actually hard to say at a glance.

1:24:21

And as a result, the problem with humans

1:24:24

is that you will start optimizing

1:24:27

a lot of high-level features.

1:24:28

For example, the second one is longer.

1:24:30

I can guarantee you that most humans will choose

1:24:32

the second one, even though I mean,

1:24:34

maybe the first one is better.

1:24:35

I don't know.

1:24:36

I haven't read it carefully.

1:24:38

So challenges of humans.

1:24:39

First, slow and expensive.

1:24:42

Second, as I just mentioned, it's hard to focus on things

1:24:46

that matter, like correctness.

1:24:47

And people usually look at things

1:24:49

that don't matter as much like the form, like length.

1:24:53

And as a result, so what I show here

1:24:55

is that when you do RLHF, the more you do RLHF,

1:24:58

the longer the output of the models become.

1:25:01

So if you've ever been annoyed at ChatGPT

1:25:03

answering you super long sentences,

1:25:05

this is because of RLHF.

1:25:08

Annotator distribution shift.

1:25:11

Like the distribution of annotators

1:25:12

that you use matters a lot, and you have to think,

1:25:15

like, what is even the humans that we want

1:25:17

to represent in these models?

1:25:20

Another question is crowdsourcing ethics.

1:25:22

Like usually these-- basically a lot

1:25:25

of the labeling that is done, the people who do them

1:25:29

are not paid well and they have to go

1:25:31

through a lot of toxic data because you basically

1:25:33

want the model to avoid saying the toxic data.

1:25:36

So crowdsourcing ethics too.

1:25:40

So many challenges with human data.

1:25:43

So what we did, also last year, is again,

1:25:46

the same thing as Alpaca, just the idea of like oh well, there

1:25:48

are challenges with humans, maybe

1:25:50

we can just replace them with LLMs.

1:25:51

So what we did is simply replace--

1:25:55

I see that.

1:25:56

I'm just realizing that the slides are not centered.

1:25:58

Anyways you replace a human preference with preferences.

1:26:02

So here, on this figure, you see on the x-axis, the price

1:26:06

that we paid for collecting human data.

1:26:09

It's around $300 for 1,000 examples.

1:26:12

And this is on mechanical Turkers which are usually

1:26:15

like cheaper than maybe some of the other companies

1:26:19

that you could go through.

1:26:20

And on the y-axis, it's basically

1:26:22

the agreement with other humans, with the mode of other humans.

1:26:27

And what you see is that actually, as I told you before,

1:26:29

labeling is really complicated.

1:26:30

Humans agree with themselves only around 66%

1:26:34

of the time on a binary task.

1:26:36

And it's not that the humans are not good

1:26:38

here because we were five main authors on this paper.

1:26:41

We tried to label this data ourselves,

1:26:43

and we only had, like, 67 or 68% accuracy, even though we

1:26:47

talked-- like we talked for like three hours of how

1:26:50

we should be doing labeling.

1:26:51

But really, it's complicated.

1:26:52

It's not an easy task.

1:26:54

And here I just showed many different models.

1:26:56

And, basically, you see that models are much cheaper,

1:26:59

and they can actually get higher agreement

1:27:01

with the mode of humans than humans themselves.

1:27:04

And the reason why is because humans have a lot of variance,

1:27:06

models have no variance.

1:27:08

So there might be a little bit more biased

1:27:09

but have less variance.

1:27:11

So it works surprisingly well.

1:27:13

And now it's, kind of, the standard

1:27:14

in open source community.

1:27:16

I think even in industry a lot of people

1:27:18

use both humans and LLMs for improving

1:27:21

the collection of RLHF data.

1:27:24

And this is like-- this is the paper from last year,

1:27:27

but honestly, now it's more like the LLMs would be around this

1:27:30

agreement, and this costs around,

1:27:32

I would say 50 50x than humans and better agreement with human

1:27:36

than humans themselves.

1:27:39

OK.

1:27:39

So that gets us to evaluation of post training.

1:27:45

That goes back to your initial question

1:27:46

at the beginning of the lecture.

1:27:48

How do you evaluate something like ChatGPT?

1:27:50

The answers that GPT could give are basically unbounded.

1:27:54

And it's not that there's one right answer,

1:27:56

there are many answers that are just as good.

1:27:59

So there are many challenges.

1:28:00

One, you can't use validation loss

1:28:03

because one method might use PPO,

1:28:06

the other one might use DPO.

1:28:07

Validation loss is not comparable.

1:28:08

Second, you can't use--

1:28:10

sorry, perplexity.

1:28:11

That's the thing I told you before.

1:28:13

These models are not calibrated.

1:28:16

They don't give distributions.

1:28:17

They just optimize for one thing.

1:28:19

So you can't use perplexity for actually evaluating these type

1:28:22

of models once they aligned--

1:28:24

sorry, once they're aligned.

1:28:26

Third, there's a large diversity of questions

1:28:29

that humans might ask to these models.

1:28:31

Generation open QA some question answering some summarization

1:28:35

and all of these things.

1:28:36

So there's so many things you have to cover.

1:28:38

Then the tasks are really open ended,

1:28:41

so it's very hard to automate.

1:28:42

So that's what you were alluding to before.

1:28:45

So the idea is that instead of trying

1:28:48

to come up with really easily automated benchmarks,

1:28:51

it's just we're going to ask questions that users actually

1:28:55

ask to these models in practice.

1:28:56

And we're just going to ask annotators

1:28:58

to say between these two models, which one is better.

1:29:01

What's the better output.

1:29:03

So basically the exact same thing

1:29:04

as basically the data from RLHF but you

1:29:08

use it now for evaluation.

1:29:10

Yes I'm not sure I understand what

1:29:11

you mean by can't use perplexity not calibrated.

1:29:14

Like RLHF still doing like next token prediction.

1:29:19

So--

1:29:19

Why can't perplexity be used then?

1:29:21

So think about the optimal solution

1:29:24

after doing PPL is basically one model that

1:29:27

gives you essentially a delta.

1:29:30

Like basically it says that there's only one sentence

1:29:33

that is--

1:29:34

that could be generated for that question.

1:29:36

So now if you use it on something

1:29:38

that is slightly semantically differently different,

1:29:40

it would actually give a likelihood of 0 for that answer.

1:29:44

So in reality, it's not that extreme because as you say,

1:29:46

it's still a distribution, but it just

1:29:48

shows you that there's a fundamental issue

1:29:50

with perplexity.

1:29:51

Once these models are not LLMs anymore,

1:29:55

they were not trained, at least with PPO

1:29:56

they're not trained to do maximum likelihood anymore,

1:29:59

they were trained to be policies.

1:30:04

So probably the most common or the most--

1:30:08

yeah, the most common benchmark or the most trusted one

1:30:10

is what we call ChatBotArena, which is basically

1:30:14

go on internet, have random users on the internet,

1:30:17

blindly talk with two chatbots, just ask many questions,

1:30:21

see the two answers and rate, which one is better.

1:30:23

And you do that over hundreds of thousands of users and then

1:30:26

you get the actual preferences and you get rankings of models.

1:30:30

So you can go right now on ChatBotArena

1:30:33

and actually interact with these models.

1:30:35

One potential issue just to highlight

1:30:38

is that while people who want to do these type of things

1:30:40

are usually more like tech-driven or like tech savvy.

1:30:44

So a lot of the questions that you will ask

1:30:46

are more like tech stuff discussing

1:30:47

software errors, inquiries about AI tools

1:30:50

and all of these things.

1:30:52

So another issue is cost and speed.

1:30:54

If you really want to use something

1:30:55

like this for development process,

1:30:58

it will be too costly because you will need to basically pay

1:31:01

a lot of humans to do that.

1:31:03

So one simple idea is, again, as we said many times,

1:31:07

just use LLM instead of humans.

1:31:10

You probably know the drill at this point.

1:31:13

Steps for every instruction generate outputs

1:31:15

by some baseline and the model that you want to evaluate.

1:31:19

So here you imagine that I'm comparing an answer

1:31:22

from ChatGPT and from Misrule.

1:31:24

I'm just asking a model, another model, which one is better.

1:31:29

And I just basically average that out.

1:31:32

Yeah.

1:31:32

I asked ChatGPT 4, which one is better.

1:31:34

I averaged that out over my entire distribution,

1:31:37

over my entire benchmark or data set,

1:31:39

and that gives me a win rate.

1:31:41

So a win probability for one model compared to another one.

1:31:44

And now you can rank models.

1:31:46

And this is the AlpacaEval leaderboard.

1:31:50

So the benefits of this is that actually we

1:31:53

show-- we get 98% correlation with ChatBotArena.

1:31:56

So very high correlation with humans.

1:31:59

So this is yeah, comparison with correlation

1:32:01

with other benchmarks.

1:32:02

And it takes less than three minutes and less than $10

1:32:05

to run.

1:32:05

So it's pretty cheap.

1:32:06

And there are downsides though.

1:32:08

One of them is poor correlation.

1:32:11

So as we already saw before, LLMs prefer,

1:32:14

this is one spurious correlation, not many.

1:32:16

I'll just talk about one.

1:32:17

LLMs prefer longer outputs.

1:32:19

Actually humans also prefer longer outputs.

1:32:21

But the problem or the issue once you use LLMs

1:32:23

is that once there is bias, you will continue optimizing that.

1:32:26

Humans at some point, I can guarantee you

1:32:28

if I ask a simple question, and you give me

1:32:29

five pages of answers, I'll be like,

1:32:31

no, I don't like that answer.

1:32:32

But LLMs if they have this bias and they were trained for that,

1:32:35

they will continue preferring longer outputs.

1:32:37

So here we see the preference just showing

1:32:42

that humans and models prefer longer outputs.

1:32:46

And here is another view of the initial AlpacaEval data set

1:32:50

benchmark, where when we asked--

1:32:53

when we rank GPT4, when we look at the win rate of GPT4

1:32:56

versus actually GPT4 itself, if we use the standard GPT4,

1:33:01

it gets 50%, kind of, by definition because we're

1:33:03

comparing GPT4 versus GPT4.

1:33:06

But if we ask a GPT4 to be slightly more verbose,

1:33:09

so we just say in the prompt, be verbose in your answers,

1:33:12

then it gets a win rate of 64.4%.

1:33:15

So really there's a huge variance.

1:33:16

And if we ask it to be concise, it

1:33:17

gets 20% so there's a huge variance

1:33:20

depending on whether you ask it to be concise or verbose.

1:33:24

That's very annoying.

1:33:25

So one possible solution, which is what we did,

1:33:29

is just use some regression analysis.

1:33:31

I'm not going to go into details,

1:33:32

but basically use causal inference

1:33:34

tools to control for length.

1:33:36

And right now actually length matters much less.

1:33:38

So if you ask it to be verbose, you still get some gains,

1:33:41

but much less.

1:33:44

Great.

1:33:44

So that's all about post training.

1:33:46

And now for the next eight minutes,

1:33:48

I might talk about systems or just answer questions.

1:33:51

Yes.

1:33:52

Can you go back to your post training, internal post

1:33:56

training.

1:33:57

How did we tune those parameters using

1:33:59

the small body of fine-tuning data

1:34:03

and have such big effect on the model?

1:34:05

You mentioned earlier that there's a different set

1:34:07

of hyperparameters.

1:34:08

Are we changing just some of the weights, the later weights

1:34:11

or other weights.

1:34:12

What's actually happening?

1:34:13

Yeah.

1:34:14

Yeah, I, kind of, skimmed through all of this.

1:34:16

You change all the weights.

1:34:17

Actually, industry will change all the weights.

1:34:20

In open source land, you might have

1:34:22

heard of Laura, which is going to change basically only

1:34:26

some of the weights or it actually, to be more specific,

1:34:29

it's going to add some differences

1:34:31

to the output of every layer.

1:34:33

But in industry, you're going to just fine tune all the weights.

1:34:37

And also to say something else about the data, actually,

1:34:40

this last step, RLHF you usually going

1:34:42

to collect a lot more data than with SFT.

1:34:45

So if FSFT is like 5,000, 10,000, maybe 50,000 with,

1:34:50

RLHF I think you're going to be more around like the one million

1:34:54

order of magnitude.

1:34:55

It's still much less than pretraining though.

1:34:57

Yeah.

1:34:57

Because pretraining is 15 trillion tokens.

1:35:00

I mean, this is like-- that's not even a drop

1:35:02

and yet you influence the weight a lot.

1:35:05

So because you do it--

1:35:05

I mean, you have to think that how you do it is you use--

1:35:10

I mean, as I said, the learning rate that you're going to use

1:35:12

is going to be different, but also you only do that.

1:35:16

So just imagine if I trained--

1:35:18

even if I trained on one sentence,

1:35:19

but over and over again at some point

1:35:22

my model will only generate that sentence

1:35:24

even if it was just one sentence instead of

1:35:27

the 15 trillion tokens.

1:35:29

So if you use a large enough learning

1:35:30

rate and for enough time, you will basically

1:35:33

overfit that sentence.

1:35:35

So the key thing to remember is that the data is not--

1:35:39

it's not as if you mix some post-training data

1:35:42

and some pretraining data.

1:35:43

You do pretraining, and then you just start fine-tuning only

1:35:47

on the post-training.

1:35:48

So another way, maybe another perspective

1:35:50

is that the pretraining is just the initialization

1:35:53

of your model.

1:35:54

And once you view it that way, that this is just

1:35:56

initialization of weights, then there's nothing special.

1:35:59

Like you don't need to remember that you train on a lot of data

1:36:02

before.

1:36:02

The only thing that matters is that you had an initialization

1:36:04

and now I actually train the model.

1:36:06

So maybe you think about it that way.

1:36:07

Like this is a Markov property in some ways.

1:36:10

It's just like you had your weights.

1:36:11

This is my initialization.

1:36:12

Now I'm training that one.

1:36:14

Does that answer your question?

1:36:16

Kind of but you said something just now about it's

1:36:20

almost the equivalent of just rerunning the fine tuning

1:36:23

data many times.

1:36:25

Is it actually-- is that what actually happens in order

1:36:28

to give so much more preference?

1:36:33

You might-- I actually don't know right now how they do it

1:36:37

in industry.

1:36:37

When we did our packet, we had to do three epochs.

1:36:40

So you did run it three times through it.

1:36:44

But I mean, even the number of times

1:36:46

that you run it through, it's actually not important.

1:36:48

The only thing-- the only thing is the effective learning rate

1:36:52

that what matters.

1:36:54

So yeah.

1:36:56

Great.

1:36:58

So I think I have five minutes.

1:37:06

OK I might try to give a high-level overview at least

1:37:12

from one of the systems trick.

1:37:14

Systems, as we said, for everyone bottleneck is--

1:37:19

sorry compute is the huge bottleneck.

1:37:21

One question you might ask is, why not buy more GPUs?

1:37:24

GPUs are expensive, but also are scarce.

1:37:26

Even if you have $10 million right now,

1:37:28

you cannot buy the best GPUs.

1:37:31

[INAUDIBLE]

1:37:33

There's also some physical limitations.

1:37:35

When you have multiple GPUs, you have

1:37:37

to communicate between them.

1:37:39

That takes time.

1:37:40

So just buying more GPUs is not that easy.

1:37:43

So it's really important to think about

1:37:45

how do you allocate resources and how do you optimize

1:37:47

your pipeline, so system?

1:37:49

101 on GPUs, I'm sorry, I'm going slightly faster.

1:37:53

I hope that some of you at least can follow.

1:37:55

GPUs are basically optimized for throughput.

1:37:58

CPUs are optimized for latency.

1:38:01

So GPUs, the way you have to think about it

1:38:03

is that there's one--

1:38:04

there's one command that is run on many, many cores

1:38:07

at the same time on different type of data.

1:38:11

So this is how you see a GPU.

1:38:13

You see there are many different codes.

1:38:14

We call them streaming multiprocessors,

1:38:17

which is very different than the usual CPU architecture.

1:38:20

So just think high throughput parallelization for GPUs.

1:38:24

GPUs are optimized for fast matrix multiplication.

1:38:27

So every time you will do-- you will do something on GPU.

1:38:30

If you can do it with a matrix multiplication,

1:38:33

it's going to be 10 times faster than with anything else.

1:38:36

That is a little bit annoying because it

1:38:38

means that we are, kind of, bottlenecked

1:38:40

to doing anything with matrix multiplications.

1:38:44

Another thing to note with GPUs is

1:38:46

that compute has been improving faster

1:38:48

than memory and communication.

1:38:50

So right now GPUs usually are hard to keep--

1:38:55

Like the data that you sent to GPUs

1:38:58

is actually hard to keep up with the processes.

1:39:00

So most of your GPUs are actually

1:39:02

going to be idle if you just run normal code,

1:39:04

if you don't optimize your code.

1:39:06

So communication-- and this will continue over time.

1:39:10

Another thing to know about GPUs is that there's

1:39:12

a memory hierarchy.

1:39:13

This is the same thing actually with CPUs,

1:39:15

but basically the closer you are to your cores,

1:39:17

the less memory there is, but the faster things run.

1:39:20

If you are further, more memory slower.

1:39:24

Oh yeah I'm going to skip that.

1:39:26

OK actually, I'm going to say it.

1:39:27

I told you about this--

1:39:29

the fact of communication.

1:39:31

The metric that people usually look at

1:39:32

is model FLOP utilization.

1:39:34

So what is the theoretical maximum that GPU could run at,

1:39:37

number of flops that you could use per second--

1:39:39

divide-- sorry, the number of observed throughput

1:39:42

divided by this theoretical maximum.

1:39:45

And in general, if you reach 50% you're very happy.

1:39:49

Like Facebook I looked at llama was at 45

1:39:51

or something like this.

1:39:52

So that means that data doesn't come fast enough

1:39:55

even for these big companies.

1:39:58

So one simple trick, and that might

1:40:00

be the only one I'm going to tell you about,

1:40:02

is low precision.

1:40:04

One simple idea is that well, if I'm

1:40:06

going to put my floats in low precision,

1:40:09

then there's going to be fewer bits

1:40:10

that I have to send to my GPUs.

1:40:12

If there's fewer bits, it's faster communication,

1:40:14

lower memory consumption.

1:40:16

Things are going to go faster.

1:40:17

And for deep learning it just happens

1:40:19

that decimal is not that important.

1:40:22

So when you do matrix multiplication, when

1:40:25

you do like for example, SGD, there's already so much noise

1:40:28

that if you update something by 0.01 or 0.015, who cares.

1:40:33

So basically instead of using 32 bits per float, which

1:40:37

is what people used to use, or 64 for example, which

1:40:41

is what you would use in other domains,

1:40:43

you use 16 bits for matrix multiplication.

1:40:46

So for every float you use 16 bits.

1:40:49

And for training you have this type

1:40:51

of what we call automatic mixed precision.

1:40:54

Which is that some of the things are in 32 bits,

1:40:57

others are in 60 bit--

1:40:58

on 16 bits.

1:41:00

Generally, the way you should be thinking about

1:41:02

it is that your weights are stored-- of your model,

1:41:05

are stored in 32 bits.

1:41:06

But just before the computation you put everything in 16 bits.

1:41:10

Like this you do computation super fast.

1:41:12

And at the end you update your weights in 32 bits.

1:41:16

And the reason why you do all the updates in 32 bits is just

1:41:19

think that if your learning rate, for example,

1:41:21

is very small, you still want to be able to make

1:41:23

a difference in your weights.

1:41:25

So all the computation is done in 16 bits,

1:41:28

but the weights are actually stored in 32 bits.

1:41:30

So that's like the standard way that people are doing it.

1:41:35

OK, I'll actually talk just about this,

1:41:36

and then I'll skip all the rest, operator fusion, because I think

1:41:39

this is actually pretty cool.

1:41:40

As I just said, communication is very slow

1:41:42

and actually every time you use a PyTorch line,

1:41:45

it basically moves variable to global memory of your GPU.

1:41:49

So when you have something like this x dot cosine equal x1,

1:41:54

and then you do x1 dot cosine.

1:41:56

What is happening behind the scenes

1:41:58

is that you take the x, which is data.

1:42:00

You ship it to your actual processors of your GPUs.

1:42:03

You apply the cosine.

1:42:05

You ship it back to the main memory of your GPU

1:42:07

and then you see the next line.

1:42:09

You ship it back to the computer-- to the GPU processor,

1:42:12

you apply another cosine and you ship it back again.

1:42:15

So another way to see that is that you

1:42:17

go from your DRAM, which is your global memory and your GPU

1:42:20

and you ship it to compute.

1:42:22

You ship it back for every line.

1:42:24

This is a naive way of doing it.

1:42:25

This seems very wasteful.

1:42:28

So the idea, simple idea of operator fusion

1:42:31

is just communicate, do all the computation, ship it back once.

1:42:35

And this is exactly what fused kernels are.

1:42:39

So if you ever want to make your compute-- your computations

1:42:44

in PyTorch much faster, just apply torch dot

1:42:46

compile on your model.

1:42:48

This is going to make your model around 2 times faster.

1:42:51

And what it does is simply that it rewrites your code--

1:42:56

your PyTorch code basically in C++ in CUDA to do

1:43:03

the communication only once then do all the operations,

1:43:05

then ship it back.

1:43:07

OK I'm not going to have time to talk about tiling.

1:43:10

Tiling is important.

1:43:11

Parallelization.

1:43:12

Parallelization is important.

1:43:15

And mixture of experts.

1:43:17

Mixture of experts is important.

1:43:18

Outlook.

1:43:19

There are many things we haven't talked about.

1:43:23

We haven't talked about architectures we definitely

1:43:25

haven't talked about inference.

1:43:27

There are many other things that are important with LLMs.

1:43:29

What is the UI that you use?

1:43:31

I mean, arguably ChatGPT, the big novelty was just

1:43:34

have a simple UI to use it.

1:43:35

Multi-modality.

1:43:36

What are all the misuses you could have.

1:43:38

The fact that there might not be enough data on the internet

1:43:41

to train all these models.

1:43:42

Legality of data collection, so many other things.

1:43:45

If you are interested in all these topics,

1:43:47

I would suggest three classes.

1:43:49

CS224N is probably the one that touches the least on LLMs,

1:43:54

but it gives some background and historical context

1:43:57

of all the LLMs and gives some adjacent material.

1:44:01

CS324 I think it's called--

1:44:04

I think it's just called Large Language Models, more

1:44:07

in depth reading and lectures on everything I talked about.

1:44:10

CS336 which is large language model from scratch,

1:44:13

you actually build your own LLM.

1:44:16

It's an amazing class also given by my two supervisors.

1:44:20

Very heavy workload, so be careful.

1:44:23

Great.

More from Stanford Online

Trending Transcripts