Advertisement
Ad slot
Stanford CS229 I Machine Learning I Building Large Language Models (LLMs) 1:44:31

Stanford CS229 I Machine Learning I Building Large Language Models (LLMs)

Stanford Online · May 10, 2026
Open on YouTube
Transcript ~18170 words · 1:44:31
0:05
So, let's get started.
0:07
So I'll be talking about building LLMs today.
0:10
So I think a lot of you have heard of LLMs before, but just
0:14
as a quick recap.
0:16
LLMs standing for large language models
0:18
are basically all the chat bots that you've
0:21
been hearing about recently.
0:22
So, ChatGPT, from OpenAI, Claude, from Anthropic, Gemini
Advertisement
Ad slot
0:28
and Llama, and other types of models like this.
0:31
And today we'll be talking about how do they actually work.
0:34
So it's going to be an overview because it's only one lecture
0:36
and it's hard to compress everything.
0:38
But hopefully, I'll touch a little bit
0:39
about all the components that are needed
0:41
to train some of these LLMs.
0:43
Also, if you have questions, please interrupt me
0:46
and ask if you have a question.
0:48
Most likely other people in the room or on Zoom have other.
Advertisement
Ad slot
0:52
Have the same questions.
0:53
So, please ask.
0:56
Great.
0:56
So what matters when training LLMs.
1:00
So there are a few key components that matter.
1:02
One is the architecture.
1:04
So as you probably all LLMs are neural networks,
1:07
and when you think about neural networks,
1:09
you have to think about what architecture you're using.
1:11
And another component, which is really important
1:13
is the training loss and the training algorithm.
1:16
So, how you actually train these models, then it's data.
1:20
So, what do you train these models on.
1:24
The evaluation, which is how do you
1:26
know whether you're actually making progress
1:28
towards the goal of LLMs and then, the system component.
1:33
So that is like how do you actually
1:35
make these models run on modern hardware, which
1:38
is really important because these models are really large.
1:41
So now more than ever, systems are actually
1:43
really an important topic for LLMs.
1:47
So those five components, you probably all know that LLMs.
1:52
And if you don't know LLMs are all
1:53
based on transformers or at least some version
1:56
of transformers.
1:57
I'm actually not going to talk about the architecture today.
2:00
One, because I gave a lecture on transformers a few weeks ago
2:06
and two, because you can find so much information online
2:09
on transformers.
2:11
There's much less information about the other four topics.
2:14
So, I really want to talk about those.
2:17
And another thing to say is that most of academia
2:20
actually focuses on architecture and training
2:22
algorithm and losses as academics
2:25
and I've done that for a big part of my career,
2:28
is simply we like thinking that this is like we make
2:32
new architectures, new models, and it
2:35
seems like it's very important.
2:37
But in reality, honestly, what matters in practice is mostly
2:39
the three other topics.
2:41
So, data, evaluation and systems, which is what most
2:45
of industry actually focuses on.
2:48
So, that's also one of the reasons
2:49
why I don't want to talk too much about the architecture,
2:52
because really the rest is super important.
2:55
Great.
2:55
So, overview of the lecture, I'll
2:57
be talking about pretraining.
2:58
So, pretraining, you probably heard that word.
3:00
This is the general word.
3:02
This is kind of the classical language modeling paradigm where
3:06
you basically train your language model to essentially
3:08
model all of internet.
3:10
And then, there's a post training,
3:11
which is a more recent paradigm which
3:13
is taking these large language models
3:15
and making them essentially AI assistants.
3:18
So, this is more of a recent trend since ChatGPT.
3:22
So, if you ever heard of GPT3 or GPT2,
3:25
that's really pretraining land.
3:27
If you heard of ChatGPT, which you probably have,
3:29
this is really post training land,
3:31
so I'll be talking about both, but I'll start with pretraining
3:34
and specifically I'll talk about what
3:37
is the task of pretraining LLMs and what is the loss that people
3:41
actually use.
3:43
So, language modeling, this is a quick recap.
3:47
Language models at a high level are simply
3:49
models of probability distribution over sequences
3:52
of tokens or of words.
3:53
So it's basically some model of p of x1
3:57
to XL, where x1 is basically what
3:59
one and XL is the last one in the sequence or in the sentence.
4:04
So, very concretely, if you have a sentence like the mouse
4:07
ate the cheese, what the language model gives
4:09
you is simply a probability of this sentence being uttered
4:13
by a human or being found online.
4:17
So, if you have another sentence like "The the mouse ate cheese."
4:21
Here, there's grammatical mistakes.
4:23
So, the model should know that this should
4:25
have some syntactic knowledge.
4:27
So, it should know that this has less likelihood
4:30
of appearing online.
4:32
If you have another sentence like the cheese ate the mouse,
4:36
then the model should hopefully know about the fact
4:39
that usually cheese don't eat mouse.
4:42
So, there's some semantic knowledge
4:43
and this is less likely that the first sentence.
4:45
So, this is basically at a high level what language models are.
4:50
One word that you probably have been hearing a lot in the news
4:52
are generative models.
4:54
So, this is just something that can generate.
4:56
Models that can generate sentences
4:57
or can generate some data.
4:59
The reason why we say language models are generative models
5:01
is that once you have a model of a distribution,
5:04
you can simply sample from this model.
5:06
And now we can generate data.
5:07
So we can generate sentences using a language model.
5:12
So the type of models that people are all currently using
5:15
are what we call autoregressive language models.
5:18
And the key idea of autoregressive language models
5:21
is that you take this distribution over words
5:25
and you basically decompose it into the distribution
5:29
of the first word, multiply by the distribution of
5:32
or the likelihood of the distribution of the second word
5:35
given the first word, and multiply it
5:37
by P of the third word given the first two words.
5:40
So, there's no approximation here.
5:42
This is just the chain rule of probability, which you
5:44
hopefully you all know about.
5:46
Really no approximation.
5:47
This is just one way of modeling a distribution.
5:50
So, slightly more concisely, you can write it
5:52
as a product of P's of the next word, given everything which
5:57
happened in the past.
5:58
So, of the context.
5:59
So, this is what we call autoregressive language models.
6:02
Again, this is really not the only way
6:05
of modeling distribution.
6:06
This is just one way.
6:07
It has some benefits and some downsides.
6:10
One downside of autoregressive language models
6:12
is that when you actually sample from this autoregressive
6:15
language model, you basically have
6:16
a for loop, which generates the next word, then conditions
6:20
on that next word.
6:21
And then we generate in other words.
6:23
So, basically if you have a longer sentence
6:24
that you want to generate, it takes more time to generate it.
6:28
So, there are some downsides of this current paradigm,
6:31
but that's what we currently have.
6:33
So, I'm going to talk about this one.
6:36
Great.
6:36
So, autoregressive language models.
6:38
At a high level, what a task of autoregressive language model
6:41
is simply predicting the next word, as I just said.
6:44
So, if we have a sentence like she likely prefers,
6:47
one potential, next word might be dogs.
6:50
And the way we do it is that we first tokenize.
6:54
So, you take these words or subwords you tokenize them
6:58
and then you give an ID for each token.
7:00
So here you have one, two, three.
7:03
Then, you pass it through this black box.
7:04
As I already said, we're not going
7:06
to talk about the architecture.
7:07
You just pass it through, pass it through a model,
7:10
and you then get a distribution, a probability distribution
7:13
over the next word or over the next token.
7:16
And then you sample from this distribution,
7:20
you get a new token and then you detokenize.
7:22
So, you get a new ID, you detokenize
7:24
and that's how you basically sample from a language model.
7:28
One thing which is important to note
7:29
is that the last two steps are actually
7:32
only needed during inference.
7:34
When you do training, you just need
7:36
to predict the most likely token and you can just
7:38
compare to the real token which happened in practice,
7:41
and then, you basically change the weights
7:43
of your model to increase the probability of generating
7:46
that token.
7:49
Great.
7:50
So, autoregressive neural language models.
7:52
So to be slightly more specific, still,
7:54
without talking about the architecture,
7:56
the first thing we do is that we have all of these.
7:58
Sorry, yes.
7:59
On the previous slide.
8:01
Predicting the probability of the next token,
8:03
does this mean that your final output vector has
8:06
to be the same dimensionality as the number of tokens
8:08
that you have?
8:09
Yes.
8:10
How do you deal with if you have more token.
8:13
Adding more token to your [INAUDIBLE]?
8:16
Yeah so we're going to talk about tokenization
8:18
actually later so you will get some sense of this.
8:21
You basically can deal with adding new tokens.
8:24
I'm kind of exaggerating.
8:25
There are methods for doing it, but essentially people
8:28
don't do it.
8:29
So it's really important to think about
8:32
how you tokenize your text, and that's why
8:33
we'll talk about that later.
8:35
But it's a very good point to note
8:36
is that you basically-- the vocabulary size, so
8:38
the number of tokens that you have is essentially
8:40
the output of your language model.
8:43
So it's actually pretty large.
8:46
So autoregressive neural language models.
8:48
First thing you do is that you take every word or every token.
8:51
You embed them so you get some vector representation
8:56
for each of these tokens.
8:58
You pass them through some neural network, as we said,
9:00
it's a transformer.
9:01
Then you get a representation for all the word
9:04
and all the words in the context.
9:06
So it's basically a representation
9:07
of the entire sentence.
9:09
You pass it through a linear layer,
9:11
as you just said, to basically map it to the number
9:15
so that the output-- the number of outputs
9:17
is the number of tokens.
9:19
You then pass it through some softmax
9:21
and you basically get a probability distribution
9:24
over the next words given every word in the context.
9:30
And the last that you use is basically--
9:32
it's essentially a task of classifying the next token.
9:35
So it's a very simple, kind of, machine learning task.
9:37
So you use the cross-entropy loss.
9:39
Where you basically look at the actual target that happened,
9:44
which is the target distribution, which
9:45
is a one hot encoding, which in this case says,
9:49
I saw the real word that happened is cat.
9:51
So that's a one hot distribution over cat.
9:55
And here this is the actual--
9:57
do you see my mouse?
9:58
Oh, yeah.
9:58
This is the distribution that you generated.
10:00
And basically you do cross entropy,
10:01
which really just increases the probability of generating cat
10:04
and decreases all the probability of generating
10:06
all the other tokens.
10:08
One thing to notice is that, as you all know again,
10:11
this is just equivalent to maximizing the text log
10:15
likelihood because you can just rewrite
10:17
the max over the probability of this autoregressive language
10:23
modeling task as just being this minimum of I just
10:26
added the log here and minus, which
10:29
is just the minimum of the loss, which is the cross entropy loss.
10:31
So basically minimizing the loss is
10:33
the same thing as maximizing the likelihood of your text.
10:36
Any question?
10:37
Questions?
10:43
OK, tokenizer.
10:46
So this is one thing that people usually
10:49
don't talk that much about.
10:50
Tokenizers are extremely important.
10:53
So it's really important that you understand at least what
10:56
they do at a high level.
10:57
So why do we need tokenizers in the first place?
11:01
First, it's more general than words.
11:02
So one simple thing that you might think
11:04
is we're just going to take every word that we will have.
11:07
You just say every word is a token in its own.
11:11
But then what happens is if there's a typo in your word?
11:14
Then you might not have any token associated
11:17
with this word with a typo.
11:20
And then you don't know how to actually pass
11:21
this word with a typo into a large language model.
11:24
So what do you do next?
11:25
And also, even if you think about words, words is a very--
11:29
words are fine with Latin-based languages.
11:32
But if you think about a language like Thai,
11:34
you won't have a simple way of tokenizing
11:36
by spaces because there are no spaces between words.
11:39
So really, tokens are much more general than words.
11:43
It's the first thing.
11:44
Second thing that you might think
11:45
is that you might tokenize every sentence, character
11:48
by character.
11:49
You might say A is one token, B is another token.
11:52
That would actually work and probably very well.
11:55
The issue is that then your sequence becomes super long.
11:58
And as you probably remember from the lecture
12:00
on transformers, the complexity grows quadratically
12:05
with the length of sequences.
12:06
So you really don't want to have a super-long sequence.
12:10
So tokenizers basically try to deal with those two problems
12:14
and give common subsequences a certain token.
12:19
And usually how you should be thinking about it is around
12:22
an average of every token is around 3-4 letters.
12:27
And there are many algorithms for tokenization.
12:30
I'll just talk about one of them to give you a high level, which
12:32
is what we call Byte Pair Encoding, which is actually
12:34
a pretty common.
12:35
One of the two most common tokenizers.
12:37
And the way that you train a tokenizer
12:39
is that first you start with a very large corpus of text.
12:42
And here, I'm really not talking about training a large language
12:45
model yet, this is purely for the tokenization step.
12:48
So this is my large corpus of text with these five words.
12:52
And then you associate every character
12:55
in this corpus of text a different token.
12:58
So here, I just split it up every character
13:00
with a different token, and I just
13:03
color coded all of those tokens.
13:05
And then what you do is that you go through your text,
13:08
and every time you see pairs of tokens that are very common,
13:12
the most common pair of token, you just merge them.
13:15
So here you see three times the tokens t and o
13:19
next to each other.
13:20
So you're just going to say this is a new token.
13:22
And then you continue, you repeat that.
13:24
So now you have tok, tok which happens three times.
13:28
Toke with an E that happens 2 times and token,
13:33
which happens twice, and then ex which also happens twice.
13:37
So this is the-- if you were to train a tokenizer on this corpus
13:41
of text, which is very small, that's
13:43
how you would finish with a token--
13:45
with like trained tokenizer.
13:47
In reality, you do it on much larger corpus of text.
13:51
And this is the real tokenizer of--
13:54
actually, I think this is GPT3 or ChatGPT.
13:57
And here you see how it would actually separate these words.
14:00
So basically you see the same thing
14:01
as what we gave in the previous example.
14:03
Token becomes its own token.
14:06
So tokenizer is actually split it up
14:08
into two tokens token and -izer.
14:12
So yeah, that's all about tokenizers.
14:15
Any questions on that?
14:16
Yeah.
14:16
How do you deal with spaces, and how do you
14:18
deal with [INAUDIBLE].
14:19
Yeah so actually there's a step before tokenizers,
14:23
which is what we call pre-tokenizers, which
14:25
is exactly what you just said.
14:27
So this is mostly--
14:29
in theory, there's no reason to deal with spaces and punctuation
14:33
separately.
14:34
You could just say every space gets its own token,
14:37
every punctuation gets its own token,
14:40
and you can just do all the merging.
14:42
The problem is that-- so there's an efficiency question.
14:45
Actually, training these tokenizers takes a long time.
14:48
So you better-- because you have to consider every pair of token.
14:51
So what you end up doing is saying if there's a space,
14:54
this is very-- like pre-tokenizers
14:55
are very English specific.
14:57
You say if there's a space, we're
14:58
not going to start looking at the token that came before
15:01
and the token that came afterwards.
15:03
So you're not merging in between spaces.
15:06
But this is just like a computational optimization.
15:10
You could theoretically just deal with it
15:12
the same way as you deal with any other character.
15:15
And--
15:15
Yeah.
15:16
When you merge tokens to delete the tokens that you merged away
15:19
or do you keep the smaller tokens that emerge?
15:22
You actually keep the smaller tokens.
15:25
I mean, in reality, it doesn't matter much because usually
15:29
on a large corpus of text, you will have actually everything.
15:32
But you usually keep the small ones.
15:34
And the reason why you want to do that
15:36
is because if-- in case there's, as we said before, you have
15:38
some grammatical mistakes or some typos,
15:41
you still want to be able to represent
15:43
these words by character.
15:46
So, yeah.
15:47
Yes.
15:48
Are the tokens unique?
15:51
So I mean, say in this case T-O-K-E-N is there only one
15:54
occurrence or could--
15:56
do you need to leave multiple occurrence so they could have--
16:00
take on different meanings or something?
16:02
Oh I see what you say.
16:03
No, it's every token has its own unique ID.
16:08
So a usual-- this is a great question.
16:11
For example, if you think about a bank, which
16:13
could be bank for like money or bank like water,
16:16
it will have the same token.
16:18
But the model will learn, the transformer
16:19
will learn that based on the words that are around it,
16:22
it should associate that--
16:24
I'm saying-- I'm being very handwavy here,
16:26
but associate that with a representation that
16:30
is either more like the bank money side or the bank water
16:33
side.
16:34
But that's a transformer that does that.
16:36
It's not a tokenizer.
16:38
Yes.
16:39
Yes.
16:39
So you mentioned during tokenization,
16:41
keep the smaller tokens you started with, right.
16:43
Like if you start with a T you keep the T
16:45
and then you build your tokenize out to
16:47
[INAUDIBLE] allow input token.
16:49
So let's say maybe you didn't train on token, but in your data
16:53
you are trying to encode token.
16:54
So how does the tokenizer know to encode it with token or to
16:58
[INAUDIBLE]?
16:59
Yeah.
16:59
The great question.
17:00
You basically when you-- so when you tokenize,
17:02
so that's after training of the tokenizer
17:04
when you actually apply the tokenizer
17:06
you basically always choose the largest token
17:10
that you can apply.
17:11
So if you can do token, you will never do T,
17:13
you will always do token.
17:15
But there's actually-- so people don't usually
17:18
talk that much about tokenizers, but there's
17:20
a lot of computational benefits or computational tricks
17:24
that you can do for making these things faster.
17:27
So I really don't think we-- and honestly, I
17:29
think a lot of people think that we should just get away
17:31
from tokenizers and just kind of tokenize character
17:34
by character or bytes by bytes.
17:36
But as I said, right now there's this issue of length,
17:39
but maybe one day, like in five or 10 years,
17:42
we will have different architectures
17:43
that don't scale quadratically with the length of the sequence.
17:46
And maybe we'll move away from tokenizers.
17:50
So can you share with us the drawback?
17:53
Why do people want to move away from the tokenizer?
17:57
Yeah.
17:58
So I think one good example is math.
18:03
If you think about math, actually numbers right now
18:06
are not tokenized.
18:07
So for example, 327 might have its own token, which
18:10
means that models, when they see numbers,
18:13
they don't see them the same way as we do.
18:15
And this is very annoying because I mean,
18:17
the reason why we can generalize with math
18:19
is because we can deal with every letter separately
18:22
and we can then do composition.
18:24
Where you know that basically if you add stuff,
18:26
it's the same thing as adding every one separately
18:28
plus like whatever the unit that you add.
18:30
So they can't do that.
18:32
So then you have to do special tokenization.
18:35
And, like, one of the big changes that GPT4 did
18:39
is changing the way that they tokenize code.
18:42
So for example, if you have code, you know you have often,
18:46
in Python, these four spaces at the beginning.
18:48
Those were dealt with strangely before.
18:52
And as a result, like, the model couldn't really
18:54
understand how to deal with code.
18:57
So tokenize actually matter a lot.
19:00
OK, so I'll move on right now, but we can come back later
19:04
on tokenizers.
19:05
Great.
19:06
So we talked about a task the loss the tokenizer,
19:08
let's talk a little bit about evaluation.
19:11
So the way that LLMs are usually evaluated
19:13
is what we call-- is using what we call perplexity.
19:16
At a high level it's basically just your validation loss.
19:20
The slight difference with perplexity
19:21
is that we use something that is slightly more interpretable,
19:24
which is that we use the average per token loss,
19:27
and then you exponentiate it.
19:29
And the reason why you exponentiate it
19:30
is because you want--
19:32
I mean, the loss has a log inside and you--
19:35
like one humans are actually pretty
19:36
bad at thinking in log space.
19:38
But two logs depend on the base of the log
19:41
while when you exponentiate you basically have everything
19:44
in the vocabulary size unit.
19:48
And the average per token is just so
19:50
that your perplexity is independent of the length
19:52
of your sequence.
19:54
So perplexity is just two to the power average
19:57
of the loss of the sequence.
20:00
So perplexity is between one and the length of the vocabulary
20:04
of your tokenizer.
20:05
One it's simply well, if you predict perfectly
20:08
the thing which every word, then every word
20:11
will have basically products of ones.
20:14
So the best perplexity you can have is one.
20:16
If you really have no idea, you basically
20:18
predict with one divided by size of vocabulary
20:22
and then you do simple math and you basically
20:24
get perplexity of size of vocabulary.
20:26
So the intuition of perplexity is
20:28
that it's basically the number of tokens
20:30
that your model is, kind of, hesitating between.
20:32
So if your model is perfect, it doesn't hesitate.
20:35
It know exactly the word.
20:36
If it really has no idea, then it
20:38
hesitates between all of the vocabulary.
20:43
So perplexity really improved.
20:46
That's perplexity on a standard data set between 2017 and 2023.
20:50
It went from a kind of 70 tokens to less than 10 tokens
20:54
over these five, six years.
20:56
So that means that the models were previously
20:58
stated between 70 words every time it was generating a word,
21:02
and now it's hesitating between less than 10 words.
21:05
So that's much better.
21:06
Perplexity is actually not used anymore
21:08
in academic benchmarking, mostly because it depends
21:11
on the tokenizer that you use.
21:12
It depends on the actual data that people are evaluating on.
21:16
But it's still very important for development of LLMs.
21:19
So when you actually train your own LLM people
21:21
will still really look at the perplexity.
21:26
One common other way and now more common in academia
21:30
of evaluating these LLMs is just by taking all the classical NLP
21:34
benchmarks, and I'll give you a few examples later and just,
21:37
kind of, aggregating everything.
21:39
So collect as many automatically evaluatable benchmarks
21:43
and just evaluate across all of them.
21:46
So one such-- or actually two such
21:50
benchmarks are what we call HELM, which is from Stanford.
21:54
And another one is the Hugging Face open leaderboard,
21:56
which are probably the two most common ones right now.
22:00
So just to give you an idea, in HELM,
22:02
all of these type of tasks, which
22:04
are mostly things that can be easily evaluated
22:08
like question answering.
22:09
So think about many different question answering tasks.
22:13
And the benefit with question answering
22:15
is that you usually know what is the real answer.
22:18
So you can-- the way that you evaluate these models
22:20
and I'll give you a concrete example in one second,
22:22
is that you can just look at how likely the language model is
22:26
to generate the real answer compared to some other answers.
22:30
And that's essentially, at a high level,
22:31
how you evaluate these models.
22:33
So to give you a specific example,
22:35
MMLU is probably the most common academic benchmark for LLMs.
22:42
And this is just a collection of many question
22:45
and answers in all of those domains.
22:47
For example, college medicine, college physics,
22:50
astronomy and these type of topics.
22:52
And the questions are things like, so this is in astronomy.
22:55
What is true for type-1a supernova?
22:58
Then you give four different potential answers
23:01
and you just ask the model which one is more likely.
23:04
So there are many different ways of doing it.
23:06
Either you can look at the likelihood of generating
23:09
all these answers, or you can ask the model
23:11
which one is the most likely.
23:12
So there are different ways that you can prompt the model,
23:15
but at a high level, you know which one is correct.
23:17
And there are three other mistakes.
23:20
Yes.
23:22
Creating unconstrained text as an output.
23:24
Yeah.
23:25
How do you evaluate a model if it
23:28
gives something that's semantically completely
23:31
identical, but is not the exact tokens that you expect?
23:35
Yeah.
23:36
So that's a great question.
23:37
I'll talk more about that later.
23:38
Here, in this case, we don't do unconstrained.
23:41
So the way you would evaluate MMLU is basically either
23:44
you ask the first question, and then you
23:47
look at the likelihood of the model generating A,
23:50
the likelihood of the model generating B, C, and D
23:53
and you look at which one is the most likely.
23:55
Or you can ask the model out of A, B, C, D,
23:58
which one is the most likely.
23:59
And you look at whether the most likely next token is A, B,
24:03
C, or D. So you constrain the model
24:05
to say it can only answer these four things.
24:09
You say you constraint--
24:10
Yeah.
24:11
You constrain the prompt or do you
24:13
mean of its whole probability distribution
24:15
that it outputs you only comparing
24:17
the outputs of like-- you're only comparing the A token the
24:19
[INAUDIBLE].
24:20
Yeah.
24:20
So in the second case I gave you, you would do exactly the--
24:24
actually would do both.
24:25
You would prompt the model saying A, B, C, or D
24:27
plus you would constrain to only look at these four tokens.
24:32
In the first case, you don't even need to generate anything.
24:34
So in the first case, you literally just
24:36
look, given it's a language model,
24:38
it can give a distribution over sentences.
24:40
You just look at what is the likelihood of generating
24:43
all of these words?
24:45
What is the likelihood of generating the second choice?
24:48
And you just look at whether the most likely sentence is actually
24:52
the real answer.
24:54
So you don't actually sample from it,
24:56
you really just use P of X1 to XL.
24:59
Does that make sense?
25:01
That being said, evaluation of open-ended questions
25:05
is something we're going to talk about later,
25:06
and it's actually really important
25:08
and really challenging.
25:09
Yes.
25:10
Earlier you mentioned [INAUDIBLE] metrics
25:13
like perplexity are not I usually
25:16
use because it depends on how you do
25:18
your tokenization, some design choices.
25:21
I was wondering if you could speak more to that.
25:24
Yeah.
25:25
So think about perplexity.
25:26
I told you perplexity is between 1 and vocabulary size.
25:30
So now imagine that ChatGPT uses a tokenizer that has 10,000
25:34
tokens but Gemini from Google uses a tokenizer that had
25:38
100,000 potential tokens.
25:41
Then actually the Gemini one will have the upper bound
25:45
of the perplexity that you can get is actually worse for Gemini
25:48
than for ChatGPT.
25:50
Does that make sense?
25:52
So that's just an idea.
25:53
It's actually a little bit more complicated than that,
25:55
but that's just one festival with a bit
25:58
of where you can see that the tokenizer actually matters.
26:02
Great.
26:05
OK, so evaluation challenges.
26:07
There are many.
26:08
I'll just talk about two really briefly.
26:10
One, as I told you, there are two ways of doing evaluation
26:13
for these MMLUs.
26:14
Actually, there are many more than two
26:16
but I gave you two examples.
26:17
And it happens that for a long time,
26:20
even though that was a very classical benchmark
26:22
that everyone uses actually different companies
26:27
and different organizations were actually
26:32
using different ways of evaluating MMLU.
26:34
And as a result, you get completely different results.
26:37
For example, Llama-65b, which was the first model of meta
26:42
in the llama series, had on HELM 63.7 accuracy
26:47
but on this other benchmark had like 48.8.
26:53
So really the way that you evaluate, and this is not even
26:55
talking about prompting this is really just the way
26:58
that you evaluate the models.
27:01
Prompting is another issue.
27:02
So really, there are a lot of inconsistencies.
27:04
It's not as easy as it looks.
27:07
First thing.
27:08
Yeah, sorry.
27:08
How can we make sure that all these models
27:10
are trained on the benchmark?
27:13
Second thing.
27:14
This is a great question.
27:15
Train test contamination.
27:17
This is something which I would say
27:19
is really important in academia in--
27:24
given that the talk is mostly about training large language
27:26
models, for companies, it's maybe not that important
27:29
because they know what they trained on.
27:33
For us, we have no idea.
27:35
So, for us, it's a real problem.
27:37
So there are many different ways of trying
27:39
to test whether the test set--
27:42
or sorry, whether the test set was actually
27:44
in the training set.
27:45
One, kind of, cute trick that people in the lab,
27:51
in [? Tatsuo's ?] lab have found, is that what you can do
27:54
is that given that most of the data set online
27:57
are not randomized, you can just look at--
28:00
and that language models, what they do is just
28:02
predict the next word.
28:03
You can just look at the entire test set.
28:06
What if you generate all the examples
28:09
in order versus all the examples in a different order.
28:13
And if it's more likely to generate a thing in order, given
28:17
that there's no real order there,
28:19
then it means that probably it was in the training set.
28:21
Does that make sense?
28:23
So there are many-- that's like one of them.
28:24
There are many other ways of doing it.
28:26
Train test contamination, again, not
28:28
that important for development, really important for
28:30
academic benchmarking.
28:33
Great.
28:33
So there are many other challenges,
28:34
but I'll move on for now.
28:37
Great.
28:38
Data.
28:40
So data is another really big topic.
28:43
At a high level people just say you basically
28:45
train large language models on all of internet.
28:48
What does that even mean?
28:50
So people sometimes say, well, of clean internet,
28:53
which is even less defined.
28:55
So internet is very dirty and really not representative
28:59
of what we want in practice.
29:00
If I download a random website right now,
29:03
you would be shocked at what is in there.
29:06
It's definitely not your Wikipedia.
29:08
So I'll go really briefly on what people do.
29:14
I can answer some questions, but I mean,
29:16
data is on its own it's a huge topic.
29:19
Basically, first what you do is download all of internet.
29:22
What that means is that you use web crawlers that
29:25
will go on every web page, on internet or every web page that
29:29
is on Google.
29:31
And that is around 250 billion pages right now.
29:36
And that's around 1 petabyte of data.
29:39
So this is actually a Common Crawl is one web crawler.
29:42
So people don't usually write their own web crawlers
29:45
what they do is that they use standard web crawlers,
29:47
and Common Crawl is one of them that basically every month adds
29:51
all the new websites that were added on internet that are found
29:56
by Google, and they put it in a big basically a big data set.
30:00
So that's-- on Common Crawl, you have around 250 billion pages
30:04
right now.
30:04
So 1E6 gigabytes of data.
30:07
Once you have this--
30:09
so this is a random web page.
30:11
Like literally random from this Common Crawl.
30:14
And what you see is that one, it really
30:16
doesn't look at type of things that you would usually see,
30:18
but actually-- so this is an HTML page.
30:21
It's hard to see, but if you look through
30:24
will see some content.
30:26
For example, here, Test King World
30:30
is your ultimate source for the system x high performance
30:33
server.
30:34
And then you have three dots.
30:35
So you don't even-- the sentence is not even finished.
30:37
That's how random internet looks like.
30:40
So, of course, it's not that useful
30:42
if you just train a large language model
30:44
to generate things like this.
30:45
So what are some of the steps that are needed?
30:48
First one, you extract the text from the HTML.
30:51
So that's what I just tried to do by looking
30:53
at basically the correct tags.
30:55
There are a lot of challenges through this.
30:57
For example, extracting math is actually
30:59
very complicated, but pretty important for training
31:02
large language models.
31:03
Or for example, boilerplates.
31:05
A lot of your forums will have the same type of headers,
31:08
the same type of footers.
31:10
You don't want to repeat all of this in your data,
31:13
and then you will filter undesirable content.
31:16
So not safe for work, harmful content, PII.
31:20
So usually every company has basically
31:22
a blacklist of websites that they don't
31:26
want to train their models on.
31:27
That blacklist is very long and you basically
31:30
say if it comes from there, we don't train on this.
31:32
There are other ways of doing these things.
31:34
Is that you can train a small model for classifying what
31:36
is PII, removing these things.
31:39
It's hard.
31:40
Every point here that I'm going to show you
31:42
is a hard amount of work, but I'm just
31:46
going to go quickly through it.
31:48
So filter undesirable content.
31:50
Second or fourth is de-duplication.
31:54
As I said, you might have things like headers and footers
31:57
in forums that are always the same.
31:59
You want to remove that.
32:01
Another thing that you might have
32:02
is a lot of URLs that are different, but actually show
32:05
the same website.
32:08
And you might also have a lot of paragraphs that come from common
32:13
books that are basically de-duplicated 1,000 times
32:16
or 10,000 times on internet.
32:18
So you have to de-duplicated.
32:20
Also very challenging because you have to do that at scale.
32:24
Once you do the de-duplication, you
32:26
will do some heuristic filtering.
32:28
You will try to remove low-quality documents.
32:31
The way you do that are things like rules-based filtering.
32:35
For example, if you see that there are some outlier tokens.
32:37
If the distribution of tokens in the website
32:39
is very different than the usual distribution of tokens,
32:42
then it's probably some outlier.
32:43
If you see that the length of the words in this website
32:46
is super long, there's something strange going on that website.
32:49
If you see that the website has only three words,
32:52
maybe, is it worth training on it.
32:54
Maybe not.
32:54
If it has 10 million words, maybe there's something also
32:58
wrong going on that page.
33:00
So a lot of rules like this.
33:01
Yes.
33:02
Why do we filter out undesirable content
33:04
from our data set instead of putting it in as,
33:08
like, a supervised loss?
33:10
Can we not just say, here's this like, hate speech website,
33:14
let's actively try to--
33:17
let's actively penalize the model for getting it.
33:19
We'll do exactly that, but not at this step.
33:22
That's why the post-training will come from.
33:25
Pretraining the idea is just to say
33:30
I want to model, kind of, how humans speak, essentially.
33:34
And I want to remove all these headers, footers
33:36
and menus and things like this.
33:38
But it's a very good idea that you just had.
33:41
And that's exactly what we'll do later.
33:45
Next step, model-based filtering.
33:47
So once you filter a lot of data, what you will do--
33:50
that's actually a very cute trick.
33:51
You will take all of Wikipedia and you
33:54
will look at all the links that are
33:56
linked through Wikipedia pages.
33:58
Because probably if something is referenced by Wikipedia,
34:01
it's probably some high-quality website.
34:02
And you will train a classifier to predict whether something
34:07
comes from-- whether a document comes from one
34:10
of these references from Wikipedia
34:13
or whether it's from the random web.
34:15
And you will try to basically say,
34:17
I want more of the things that come from Wikipedia references.
34:21
Does that make sense?
34:23
So yeah.
34:24
So you will train a machine learning model.
34:26
Usually also very simple models because you
34:28
need to do that really at scale.
34:30
I mean, just think about the 250 billion pages.
34:34
Next one, you will try to classify your data
34:37
into different domains.
34:41
You will say, OK, this is entertainment, this is books,
34:43
this is code, this is like these type of domains.
34:46
And then you will try to either up or down weight
34:51
some of the domains.
34:52
For example, you might say--
34:54
you might see that actually if you train more on code, then
34:57
actually your model becomes better on reasoning.
34:59
So that's something that people usually say in
35:01
a very hand-wavy way.
35:02
If you train your model more on code,
35:04
actually it helps reasoning.
35:05
So you want to update the coding distribution
35:08
because that helps for general language modeling skills.
35:11
Books is usually also another one that people usually update.
35:16
Entertainment, they usually down weight.
35:18
So things like this.
35:19
Of course, you want to do it-- so people used to do it, maybe
35:24
kind of heuristically.
35:25
Now there's entire pipelines that we'll
35:27
talk about of how to do these things slightly
35:30
more automatically.
35:33
And then at the end of training, you usually train--
35:37
after training on all of this data that we saw
35:40
you usually train on very high quality data
35:42
at the end of training your large language model where you
35:46
decrease your learning rate.
35:47
And that basically means that you're,
35:49
kind of, overfitting your model on a very high quality data.
35:52
So usually what you do there is Wikipedia.
35:55
You basically overfit on Wikipedia
35:57
and you overfit on, like, human data that was collected.
36:04
The other thing is like continual pretraining
36:06
for getting longer context.
36:07
I'm going to skip over all of these things.
36:09
But that's just to give you a sense of how hard it
36:12
is when people just say I'm going to train on internet,
36:15
that's a lot of work.
36:17
And, really, we haven't figured it out yet.
36:19
So collecting well data is a huge part
36:23
of practical, large language model.
36:24
Some might say that it's actually the key.
36:26
Yes.
36:27
[INAUDIBLE] about data.
36:29
So basic question.
36:30
So usually when you start with like a petabyte of data,
36:33
after you go through all the steps,
36:35
what's the typical amount of data you have remaining.
36:37
And then how large a team does it typically
36:40
take to go through all the data steps you talked about?
36:43
Sorry how la-- is your question how large
36:45
is the data after you filter?
36:46
Yeah.
36:47
After you filter and then you go through all the steps.
36:49
How large a team do you need to go through, like,
36:52
all the filtration steps you mentioned.
36:54
How slow is it or--
36:56
How many people would you need to be
37:00
able to do this [INAUDIBLE]?
37:02
OK that's a great question.
37:03
I'm going to somewhat answer about the data.
37:06
How large is the data set at the end of this slide.
37:10
For number of people that work on it, that's a good question.
37:15
I'm actually not quite sure, but I would say, yeah,
37:19
I actually don't quite know but I
37:22
would say it's probably even bigger than the number of people
37:25
that work on the tuning of the pretraining of the model.
37:29
So the data is bigger than the modeling aspect.
37:34
Yeah, I don't think I have a good sense.
37:37
I would say probably in LLAMA's team, which have 70-ish people,
37:41
I would say maybe 15 work on data.
37:45
Yeah.
37:46
All these things, you don't need that many people,
37:48
you need a lot of compute also.
37:49
Because for data you need a lot of CPUs.
37:52
So, yeah.
37:53
And I'll answer the second question
37:54
at the end of this slide.
37:56
So as I just, kind of, alluded to really,
37:59
we haven't solved data at all for pretraining.
38:02
So there's a lot of research that has to be done.
38:04
First, how do you process these things super efficiently?
38:07
Second, how do you balance kind of all
38:09
of these different domains?
38:10
Can you do synthetic data generation?
38:12
That's actually a big one right now.
38:14
And because we don't have--
38:16
we'll talk about that later, but we don't have
38:18
enough data on the internet.
38:20
Can you use multimodal data instead of just text data?
38:23
And how does that improve even your text performance?
38:28
There's a lot of secrecy because, really, this
38:30
is the key of most of the pretraining large language
38:33
models.
38:34
So for competitive dynamics, usually these companies
38:39
don't talk about how they do the data collection.
38:41
And also there's a copyright liability issue.
38:44
They definitely don't want to tell you
38:45
that they've trained on books even though they did
38:47
because if not can sue them.
38:50
Common academic benchmarks.
38:52
So that will, kind of, answer what you asked.
38:54
It started-- so those are the smaller ones.
38:57
The names are not that important,
38:58
but it started from around $150 billion tokens, which are
39:02
around 800 gigabytes of data.
39:04
And now it's around 15 trillion--
39:06
15 trillion tokens, which is also
39:09
the size of the models that are-- right now the best models
39:12
are probably trained on that amount of data.
39:14
So 15 trillion tokens, which is probably,
39:18
I guess, two orders of magnitude bigger than that.
39:20
So 80E3 gigabyte.
39:23
So that would be around 100 to 1,000 times filtering
39:29
of the Common Crawl, if I'm not mistaken.
39:32
So, yeah.
39:34
One very famous one is the Pile.
39:37
So this is an academic benchmark, the Pile.
39:39
And we can just look at what distribution of data they have.
39:42
It's things like archive, PubMed Central,
39:46
which is all the biology stuff.
39:50
Here it's Wikipedia, you see Stack Exchange, some GitHub
39:55
and some books and things like this.
39:58
Again, this is on the smaller side.
39:59
So this is-- if we look at here, this is on 280B so, in reality,
40:03
it's like 100 times bigger so you cannot have that much
40:05
of GitHub and of Wikipedia.
40:09
In terms of closed source models.
40:11
Just to give you an idea, Llama 2
40:14
it was trained on 2 trillion tokens,
40:16
Llama 3 15 trillion tokens, which is currently
40:19
the best model that we know on how much it was trained on,
40:22
which is the same thing as is the best academic or the biggest
40:26
academic benchmark, which is 15 trillion tokens.
40:29
GPT4 we don't really but it's probably
40:31
in the same order of magnitude or it's probably around that.
40:33
Actually, it's probably around 13 from leaks.
40:36
If the leaks are true.
40:39
Great.
40:41
So scaling laws.
40:43
Any other questions on data before we go to scaling laws?
40:48
Sorry I know I'm giving you a lot of information,
40:51
but there's a lot into training, large language models.
40:54
Great scaling laws.
40:56
So the idea is that what people saw around 2020, or at least
41:01
from a long time, but they've been able to theoretically show
41:05
it or empirically show it since 2020,
41:07
is that the more data you train your models on
41:09
and the larger the models, the better the performance.
41:12
This is actually pretty different than what
41:14
you've seen in this class.
41:15
In this class we teach you about overfitting.
41:17
Overfitting doesn't happen with large language models.
41:20
Larger models, better performance.
41:23
It's something that really took a long time
41:25
for the community who took this type of class to realize.
41:29
But for the exam, overfitting exists.
41:33
So, OK, the idea of scaling loss is that if-- given that more
41:38
data and larger models will always
41:40
give you better performance, can we
41:42
predict how much better your performance will
41:46
be if you increase the amount of data and the size of your model?
41:50
And surprisingly, it works.
41:52
So here you see three plots from a very famous paper called
41:55
Scaling Laws from OpenAI.
41:57
Here you see on the x-axis compute.
42:00
So how much did you train--
42:01
like, how much compute did you spend for training?
42:04
And here you see test loss.
42:05
So this is essentially, I mean, perplexity,
42:08
but it's your validation loss.
42:09
So it's a log of the perplexity.
42:11
And if you put these two on log scale,
42:15
then you see that the performance or the--
42:19
sorry, the scaling law is linear.
42:22
That means that if you increase your compute
42:25
by a certain amount, you can say by how much your test loss will
42:29
actually decrease.
42:30
Same thing with data and same thing for parameters.
42:33
If you increase the data set size,
42:35
your loss will decrease by an amount
42:38
that is somewhat predictable.
42:40
If you increase the number of parameters,
42:42
the loss will decrease by an amount,
42:44
which is somewhat predictable.
42:45
This is really amazing.
42:47
Very surprising.
42:49
I mean, it looks innocuous when you look at these type of plots,
42:52
but that's crazy because it means that you can predict
42:55
how well we're going to perform in two or three years,
42:58
depending on how much compute we will add,
42:59
assuming that these things will hold.
43:01
There's nothing theoretical about it.
43:04
Yes.
43:05
Two things.
43:06
One, what is the loss that they're using here.
43:08
Is this perplexity?
43:09
So it's-- I said perplexity was like 2 to the power of the loss.
43:13
So this is the power of the perplexity.
43:17
And then the second thing is, when
43:19
you increase the number of parameters
43:21
or you increase the data set size [INAUDIBLE] data
43:24
[INAUDIBLE] times, doesn't that just inherently
43:26
increase your compute?
43:27
Like does all of this [INAUDIBLE] come to just how
43:30
[INAUDIBLE] you [INAUDIBLE]?
43:31
Yes.
43:31
--or something specific [INAUDIBLE]?
43:32
No, this is a great question.
43:33
So the compute here is actually a factor of two things, the data
43:37
and the parameter.
43:38
What I'm showing here is that you can--
43:40
well, actually, we're going to talk about that in details.
43:42
But basically, if you increase the number of parameters,
43:44
you should increase the number of data that you have.
43:48
So you actually don't go multiple times
43:50
to the same data set.
43:51
No one does epochs in at least not yet
43:56
because we haven't still kind of enough data.
43:59
So yeah, this is all the same trend,
44:01
which is increase compute decrease loss.
44:04
Yes.
44:06
Have we seen the numbers for the last two years or this
44:09
is still holding?
44:10
It is still holding.
44:13
I don't have good numbers to show you,
44:16
but it is still holding, surprisingly.
44:20
Yes.
44:21
Is there no evidence that control quality density
44:23
will ever plateau?
44:25
In theory, we would expect it plateau, [INAUDIBLE]?
44:28
No empirical evidence of plateauing anytime soon.
44:33
Why?
44:34
We don't know.
44:35
Will it happen?
44:37
Probably.
44:37
I mean, it doesn't need to because it's actually
44:39
in log scale.
44:40
So it's not like as if it had to go.
44:43
It had to plateau.
44:44
Like mathematically, it could continue decreasing like this.
44:47
I mean, most people think that it will probably
44:49
plateau at some point.
44:50
We don't know when.
44:54
So that's-- I'll talk more about scaling laws now.
44:57
So why are scaling laws really cool?
44:59
Imagine that I gave you--
45:02
you're very fortunate I gave you 10,000 GPUs for this month.
45:05
What model will you train?
45:07
How do you even go about answering that question?
45:09
And I mean, this is a hypothetical,
45:12
but that's exactly what these companies are faced with.
45:16
The old pipeline, which was basically
45:19
tune hyperparameters on the big models.
45:21
So let's say I have 30 days, I will train
45:24
30 models for one day each.
45:26
I will pick the best one and that will be the final model
45:30
that I will use in production.
45:32
That means that the model that I actually used
45:34
was only trained for one day.
45:36
The new pipeline is that you first find a scaling recipe.
45:40
So you find something that tells you, for example,
45:43
like one common thing is that if you increase
45:45
the size of your model, you should decrease your learning
45:46
rate.
45:47
So you find a scaling recipe such
45:49
that you know if I increase the size of my model,
45:52
here's what I should do with some hyperparameters.
45:55
Then you tune your hyperparameters
45:57
on smaller models of different sizes.
46:00
Let's say I will say for three days, of my 30 days,
46:03
I will train many different models.
46:05
And I will do hyperparameter tuning
46:07
on these small models, each of different sizes.
46:09
Then I will fit a scaling law and try
46:11
to extrapolate from these smaller models, which
46:15
one will be the best if I train it for much longer--
46:20
or sorry if I train it for a larger model.
46:22
And then I will train the final huge model
46:24
for 27 days instead of just one day.
46:28
So the new pipeline is not train things
46:31
or do hyperparameter tuning on the real scale of the model
46:34
that you're going to use in practice,
46:35
but do things on smaller ones at different scales.
46:39
Try to predict how well they will perform
46:41
once you make them bigger.
46:43
I will give-- I will give you a very concrete example right now.
46:46
Let's say transformers versus LSTMs.
46:49
Let's say you have these 10,000 GPUs,
46:51
you are not sure which one you should be using.
46:53
Should I be using a transformer-based model
46:55
or LSTM-based model.
46:56
What I will do is I will train transformers
46:58
at different scales.
47:00
So here you see different parameters on the x-axis,
47:02
y-axis is my test source.
47:04
I will then train different LSTMs at different scales.
47:08
Once I have these points, I will see oh it, kind of,
47:11
fits a scaling law.
47:12
I will fit my scaling law and then
47:14
I will be able to predict if I had 10 times more compute,
47:18
here's how well I would perform for the LSTM.
47:21
It's actually slightly less linear for the LSTM,
47:23
but you can probably try to predict where you would end up.
47:26
And clearly from this plot, you would see
47:28
that transformers are better.
47:30
One thing to notice when you read these type of scaling laws
47:33
is that there are two things that are important.
47:35
One is really your scaling rate, which
47:40
is the slope of the-- the slope of the scaling law.
47:45
The other thing is your intercept,
47:49
you could start worse, but actually
47:52
become better over time.
47:53
It just happens that LSTMs are worse for both.
47:55
But I could show you another one where things--
47:58
you can predict that actually after a certain scale
48:01
you're better off using that type of model than others.
48:04
So that's why scaling laws are actually really useful.
48:08
Any questions on that?
48:12
Yeah.
48:12
So these are all, kind of, very--
48:15
how sensitive are these to small differences in the architecture.
48:18
Like one like transformer architecture
48:21
versus another transformer architecture.
48:23
Do you think we have to fit your own curve
48:26
and, basically, say like oh scaling laws tell me this should
48:28
be some logarithmic function.
48:31
Like, let me extrapolate that for
48:33
my own specific architecture.
48:35
Yeah, so usually, for example, if you're an academic
48:38
and you want to-- now at least that's pretty recent
48:40
and you want to propose a new activation.
48:43
That's exactly what you will do.
48:45
You will fit a scaling law, show another scaling law
48:47
with the standard like, I don't GELU
48:49
and you will say that it's better.
48:50
In reality, once you start thinking about it in scaling
48:53
laws terms, you really realize that actually
48:55
all the architecture differences that we
48:57
can make, like the small, minor ones, all they do
48:59
is maybe change a little bit the intercept.
49:03
But really that doesn't matter because just
49:05
train it for 10 hours longer or like wait for the next computer
49:09
GPUs and these things are really secondary.
49:12
Which is exactly why I was telling you originally,
49:14
people spend too much time on the architecture and losses.
49:17
In reality, these things don't matter as much.
49:19
Data though.
49:19
If you use good data, you will have much better scaling laws
49:23
than if you use bad data.
49:24
So that really matters.
49:27
Another really cool thing you can do with scaling laws
49:29
is that you can ask yourself, how to optimally allocate
49:33
training resources.
49:35
Should I train larger models.
49:37
Because we saw that it's better when you train larger models,
49:39
but we saw that it's also better when you use more data.
49:42
So which one should I do?
49:43
Should I just train on more data, a smaller model,
49:46
or should I train a larger model on less data?
49:49
So Chinchilla is a very famous paper that first showed this.
49:53
The way they did it, I want to give you
49:55
a little bit of a sense of what these plots are.
49:58
Here you see training loss again on the x-axis,
50:00
you see parameter differences, sorry, parameter size--
50:04
number of parameters.
50:04
So the size of the model.
50:06
And here all these curves are what
50:07
we call ISO flops, which is that all the models on this curve
50:13
have been trained with the same amount of compute.
50:17
The way that you do that is that you train--
50:19
you change.
50:20
Sorry, you vary the number of tokens that were trained on
50:22
and the size of the models, but you vary in such a way
50:25
that the total compute is constant, OK.
50:27
So all these curves that you see with different colors
50:29
have different amount of compute that were trained on.
50:32
Then you take the best one for each of those curves.
50:35
Once you have the best one for each of those curves,
50:38
you can ask-- you can plot how much flops it was
50:44
and which curve were you on and how much parameters
50:47
did you actually use for training that specific point.
50:50
You put that on the log log scale again and now
50:55
you fit a scaling law again.
50:56
So now I have something which tells me
50:59
if I want to train a model of 10 to the power 23 flops, here is
51:03
exactly the number of parameters that I should be using.
51:06
100 B.
51:07
And you can do the same thing with flops and tokens.
51:11
So now you can predict--
51:13
if I tell you exactly I have one month of compute,
51:16
what size of model should I be training?
51:18
Fit the scaling law, and I tell you.
51:21
Of course that all looks beautiful.
51:23
In reality like there's a lot of small things of like,
51:26
should you be counting, like, embedding parameters,
51:29
there's a lot of complexities.
51:30
But if you do things well, these things actually do hold.
51:35
So the optimal number of parameters that Chinchilla paper
51:38
have found is to use 20 tokens for every parameter
51:42
that you train.
51:44
So if you add one more parameter,
51:45
you should train your thing on-- your model on 20 more tokens.
51:49
So one caveat here is that this is optimal training resources.
51:53
So that is telling me if you have 10 to the power, 23 flops
51:57
or if you have 100, I don't know how much that is, $100 million
52:00
or 10-- no, that's much less, actually.
52:02
Let's say I have $5 million to train
52:05
my best model that gets the lowest
52:07
loss what would I train on?
52:09
In reality, these companies need to think about inference also.
52:12
If you have a smaller model, they will spend less over time.
52:17
So actually, if you consider the inference cost,
52:20
you have other papers that try to show that, it's
52:23
around 150 parameters, sorry--
52:26
tokens per parameters, because you prefer having a smaller
52:29
model because over time you're going
52:32
to actually spend less money on inference of these models.
52:37
So 150 to 1, that's around what the best models are trained
52:42
on right now, at least the ones that are
52:45
used in practice in production.
52:49
Great.
52:51
Any questions on Chinchilla?
52:55
Great.
52:56
Oh sorry.
52:58
In practice, how expensive is inference for these models
53:01
relative to training?
53:03
Actually, very expensive.
53:05
I will not talk about inference because that would
53:07
be another entire lecture.
53:09
But just think about ChatGPT where
53:11
they have I don't know how much it is now,
53:14
like 600 million people that use it.
53:18
Like, that's a lot.
53:22
Yeah.
53:23
So it's actually very expensive.
53:24
There's a lot of optimization you can do for inference though.
53:27
And that's an entire other lecture.
53:29
I'm going to skip that this time, but it's very interesting.
53:33
OK tunings.
53:34
As I said, there are many things that you
53:36
can answer with scaling laws.
53:38
I just try to give you two examples,
53:40
but really there are many things.
53:42
What data do you use.
53:43
What mixture-- what data mixing weighting you use.
53:46
The mixtures, that's what we talked about before.
53:49
What architecture you use, whether you should make
53:51
your models wider or deeper?
53:54
Should you be paying for more GPUs
53:56
or actually collecting more data?
53:58
All these things are things you can try
54:00
to answer with scaling laws.
54:03
One thing I want to say is the bitter lesson.
54:05
If you ever heard of Richard Sutton,
54:08
very famous blog post in 2019, what he realized,
54:12
which I think not enough people realize,
54:16
I didn't-- definitely did not realize at that time,
54:19
is that once you see these type of scaling laws you know that
54:23
the more compute you have, the better models you will get.
54:26
So with scale, you will get better model.
54:28
And you also know by Moore's law or these type
54:30
of variants of Moore's law that you will always
54:33
have better compute.
54:34
Then the only thing that matters is just
54:36
to have architectures that can leverage computation.
54:40
So what matters is basically systems data and less
54:44
so the architecture, like the small architecture
54:46
differences like, your activation and things like this.
54:49
So I think that's one of the reasons why most of research
54:52
focuses on some things that for industry matters less.
54:56
And I was one of those researchers
54:58
for a large part of my career.
55:02
So don't spend time over complicating.
55:04
Do the simple things, do it well.
55:07
See all them.
55:08
That's really what OpenAI taught us with ChatGPT and with all
55:12
the GPTs before.
55:15
OK, I want to give you some back of the envelope computation.
55:18
So I might be off by a few factors here,
55:20
but I just want to give you a sense of how costly it is
55:23
to train some of these models.
55:25
I'll give us an example.
55:26
llama3 400b which is currently the best open source model that
55:30
you can get.
55:31
It was trained on 15.6 tokens.
55:35
It has 405 billion parameters.
55:37
So just now that you know what is
55:39
like this optimal tokens per parameter, that's around 40.
55:43
So that's a little bit more than Chinchilla,
55:45
but less than this like inference optimal model.
55:50
So they went for training optimallity
55:53
Flops for this model.
55:55
So one simple way to compute flops
55:57
is 6 times the number of parameters,
56:00
times the number of data that you train on.
56:03
So if you do the simple calculation here,
56:04
it's 3.8 e25 flops.
56:07
The reason why this is important is
56:09
that if you follow it a little bit, the news,
56:11
there's an executive order from Biden that basically
56:13
says that once you have one e26 parameters, sorry, flops, then
56:19
you have special scrutiny on your models.
56:21
So they went to 2X less than that.
56:23
So they really went right below this
56:25
to not have special scrutiny.
56:27
So 3.8.
56:28
I might be off by a little bit, but it's definitely
56:30
under the 1 e26
56:36
So parameter p is parameters n is data, number of tokens.
56:41
This is just an approximation.
56:46
Yeah.
56:48
OK.
56:49
Compute and we know that they trained on 16,000 h100s and we
56:55
know the throughput they set it to.
56:58
So if you do the computation, it takes around 70 days
57:02
or 26 million GPU hours.
57:05
At least that's what my back of the envelope computation.
57:08
They actually said that they use 30 million
57:10
instead of 26 million GPU hours.
57:13
So maybe they had some challenges.
57:17
I don't really know.
57:18
But if you follow the simple computation,
57:20
it's around 70 days.
57:22
Cost.
57:24
I mean this it's hard to approximate,
57:27
but I'm just going to say it's, kind of, the rent.
57:29
Like, what if I wanted to rent H100, that many H 100
57:33
for that many days, how much will I pay?
57:36
H100 a lower bound on the renting costs of H100
57:41
is around two hours--
57:42
$2 per hour.
57:43
So if you multiply this by 26,000,000 hours,
57:48
you get $52 million.
57:50
So they probably pay less than that,
57:52
but not actually much less because all these services
57:58
that actually rent GPUs, they don't make that much money.
58:00
So it's probably slightly less, but not that much less.
58:04
Now salary I said 50 employees, 500k per year.
58:10
Yeah it's probably the right ballpark.
58:12
$25 million.
58:13
So if you put altogether around $75 million
58:17
for training this llama model.
58:21
I'm probably off by like 10 million,
58:22
but that's kind of right ballpark.
58:27
Carbon emitted.
58:29
A lot of people might ask like also the cost is not
58:32
the only thing that is important.
58:33
So I did the computation.
58:35
It's around 4000 tons of CO2 equivalent.
58:42
That is actually only 2000 return tickets
58:45
from JFK to London.
58:47
So right now carbon emitted is actually not--
58:51
I mean, it's huge, but it's not meaningful yet.
58:56
I think in maybe GPT6, GPT7, once you multiply this
59:01
by 100, that might become a real issue.
59:04
Right now it's still not, I think,
59:07
an issue in the grand scheme of things.
59:09
Next model the way you should be thinking about these models is
59:12
that every new generation, the number of flops essentially
59:16
multiplies 10x, or at least that's what they try if they
59:19
have enough energy.
59:20
And if they can buy enough GPUs.
59:23
Great.
59:23
Any question on these back of the envelope math.
59:29
No.
59:30
OK.
59:31
So now we talked about pretraining,
59:34
I wanted to also chat about systems
59:36
because now we know compute is really important so there's
59:39
a question of how do you optimize the--
59:41
how do you optimize the compute?
59:43
I will leave that for the end because I'm not
59:45
sure how much time we will have.
59:46
I think it's important, but hopefully I'll
59:48
be able to talk about it later.
59:50
It's slightly different than what we've
59:52
been talking about right now.
59:54
So I'll move on to post-training for now.
59:56
So the task of post-training, the reason why
59:59
we need to do post training is, as I told you
1:00:01
before, it's to make AI assistants.
1:00:06
So language modeling is not really the thing
1:00:09
that you want when you have an AI assistant.
1:00:12
For example, if you ask to GPT3, which
1:00:14
is a purely language model--
1:00:16
a pure language model, not a non-aligned one.
1:00:20
If you ask a question explain the moon landing
1:00:22
to a six-year-old, the completion that you would get
1:00:26
is something explain the theory of gravity to a six-year-old.
1:00:29
Because what it learned is that on internet,
1:00:31
if you have one question, you usually
1:00:33
have maybe another bullet point of other similar questions
1:00:36
you don't usually have question and then answer later.
1:00:39
This is not what you want from an AI assistant.
1:00:42
So how do we do this alignment, which
1:00:46
is this post training and making these models assistants?
1:00:49
So the goal of this alignment is to basically get
1:00:52
LLMs follow the instructions that
1:00:55
are given by users and maybe some designers,
1:01:00
kind of, desires.
1:01:02
So think about motivation.
1:01:04
You don't want the model-- like OpenAI
1:01:06
doesn't want the model to say stuff that is very toxic.
1:01:09
So here you see on the left-hand side
1:01:12
that when you ask a question, it actually provides a real answer.
1:01:15
So it's not like before the LLM.
1:01:17
And on the right-hand side, you see that it would--
1:01:20
if you ask to write a tweet describing how a certain part
1:01:25
of the population are evil, it will say that it cannot do that.
1:01:29
So that's kind of this alignment.
1:01:32
The background here is that basically the data
1:01:38
that you want for training some of these models is--
1:01:41
like, we know what we want.
1:01:42
Which is just asking humans, this is a question,
1:01:44
this is the answer that you want.
1:01:46
But the thing is that it's very expensive to collect that data,
1:01:48
and it's hard to find it online.
1:01:51
In contrast, pretraining data is not what you want,
1:01:54
but there's a lot of it.
1:01:56
So what we will do, or the main idea is simply
1:01:59
take a pretrained large language model
1:02:01
pretrained on all of internet and then just fine tune.
1:02:03
So you just change a little bit the weights on the type of data
1:02:06
that you actually want.
1:02:07
And hopefully given it, you already
1:02:08
pretrained it on all of internet,
1:02:10
it basically learns or knows how to speak in English
1:02:13
and knows standard language syntax
1:02:18
then you can really fine tune it with very little data.
1:02:23
OK, SFT.
1:02:24
So Supervised Fine Tuning is really exactly what I just said.
1:02:27
Which is the idea of fine-tuning the large language
1:02:29
model on basically the desired answers that
1:02:33
are collected from humans.
1:02:35
So why is it called supervised fine tuning?
1:02:37
Because you basically want to do language modeling on the real
1:02:41
answers.
1:02:41
So language modeling is this like next word prediction,
1:02:44
and that's the fine tuning part.
1:02:45
And then you want to do it on desired answers given by humans
1:02:48
so that's why we call it supervised.
1:02:51
So how do we collect this data?
1:02:52
Well, I just said it.
1:02:54
You just ask humans to tell you this
1:02:57
is a question this is the answer that you would
1:02:59
want from some of these models.
1:03:00
So this is an example.
1:03:03
I can't read very well on my computer,
1:03:04
but my kid needs to do a science--
1:03:08
no let's read this one.
1:03:09
Can you write a short introduction
1:03:11
about the relevance of the term monopsony?
1:03:13
And then it says monopsony refers to a market
1:03:15
structure, blah blah, blah.
1:03:16
And that's a human network there.
1:03:19
So, actually, this is Open Assistant,
1:03:20
which was a way to collect data online by humans.
1:03:27
So this type of supervised fine tuning or alignment
1:03:31
is really the key of ChatGPT.
1:03:33
This is what made the big jump from GPT 3, which was mostly
1:03:37
something that was known by AI researchers
1:03:40
to ChatGPT, which became known by basically everyone.
1:03:46
So the problem with human data is
1:03:51
that it's very slow to collect and very expensive.
1:03:56
So one possible simple idea is to use
1:04:00
LLMs to scale data collection.
1:04:03
So that's exactly what we did with Alpaca one year ago.
1:04:06
What we did is that we asked humans,
1:04:09
so we use a data set of human question answers.
1:04:11
So there were 175 question answers here,
1:04:15
and we asked the best model at the time,
1:04:16
so text-davinci 003 to basically generate many more of these
1:04:21
question and answers.
1:04:22
So all we did is, this is what humans would write now,
1:04:25
write similar answers and similar questions.
1:04:27
And we collected 52,000 LLM-generated question answers.
1:04:32
And then what we did is simply we took llama 7B,
1:04:34
which was the best pre-trained model at the time.
1:04:36
And we just fine tuned this with supervised fine tuning,
1:04:39
as I told you.
1:04:39
And that's how we got the Alpaca 7B model.
1:04:44
And this is the type of data that we collected.
1:04:47
So things like what does algorithm mean?
1:04:49
And algorithm is a step by step set of instructions
1:04:53
you use to solve a problem or achieve a goal, blah, blah,
1:04:55
blah, blah.
1:04:56
So the data is not actually-- it's actually pretty good,
1:04:58
given that it was LLM generated by LLMs from essentially two
1:05:02
generations ago.
1:05:04
So that really started at least for us
1:05:07
as an academic replication of ChatGPT.
1:05:10
Now it really-- there's a big field
1:05:12
of synthetic data generation of how
1:05:15
to use LLMs to basically make development of LLMs faster.
1:05:21
And basically by decreasing the amount of human hours that
1:05:24
you need.
1:05:26
Quantity of data.
1:05:28
So we talked about what type of data and how we collect it.
1:05:31
One thing which is surprising with SFT
1:05:33
is that you don't need that much data.
1:05:36
So what this paper showed this is called LIMA,
1:05:38
is that if you scale the amount of data that you use from
1:05:43
supervised fine tuning from 2000 to 32,000,
1:05:46
it really doesn't help much.
1:05:47
So here scaling laws definitely don't help.
1:05:49
And so the intuition here is that all you learn
1:05:55
is you learn how to format your desired answers.
1:05:58
Another way of saying it is that your pre-trained models, they
1:06:02
essentially model the distribution of every user
1:06:04
on internet, one that might write bullet points,
1:06:07
another one that might answer question-- answer
1:06:09
question with an answer.
1:06:10
So all you tell your model is like, wait,
1:06:13
you should actually be optimizing
1:06:14
more for this type of user than another one.
1:06:17
So you're not actually teaching it--
1:06:18
you're not teaching anything through this SFT, so
1:06:23
supervised fine tuning, all you do
1:06:25
is you tell the model to optimize for one type of user
1:06:28
that it saw already in a pretrained data set.
1:06:30
So the knowledge is already in the pretrained LLM
1:06:33
and you basically just specialize to one type of user.
1:06:37
Great.
1:06:38
Any question on SFT?
1:06:40
Yes.
1:06:41
So I know it's a big issue with synthetic data
1:06:45
where if you keep generating data from the same distribution,
1:06:49
eventually you're not learning a new distribution,
1:06:51
you're essentially playing with it.
1:06:52
Just bootstrapping that.
1:06:53
Yeah.
1:06:55
Surely you can't scale that forever, right.
1:06:57
You can't keep going on and generating
1:06:59
from the same distribution.
1:07:00
You hope to learned something new.
1:07:01
Yeah.
1:07:02
So are there-- it's an active area of research
1:07:05
but any thoughts that you have around
1:07:06
how people are maybe thinking around this and better ways
1:07:10
to bootstrap?
1:07:11
Or to give up on this idea and realize that the chart shows
1:07:15
you don't need that many so just get humans to generate
1:07:17
2000 really good prompts.
1:07:19
Yeah.
1:07:20
So that's a very good question.
1:07:21
So for the data stuff, so I'm saying
1:07:23
it's not that important for SFT, but there
1:07:25
will be another thing we'll talk about right after where actually
1:07:28
data does matter.
1:07:29
My intuition based on not that much empirical results
1:07:33
is that you can still get, even though you use your LLMs,
1:07:38
if you use purely LLM generated text
1:07:40
and you do that for like three or four generations of LLMs,
1:07:43
I agree with you that probably you won't improve much.
1:07:45
But for me what is important is how do you use human in the loop
1:07:48
with LLMs?
1:07:49
Not purely LLMs, not purely humans,
1:07:53
but maybe what you can do is just
1:07:54
have the model regenerate some new text
1:07:56
and just humans write a few edits.
1:07:59
Edits are much faster than writing the entire text.
1:08:01
And I think that if you have that type of collaboration,
1:08:04
then from an information theoretical point of view,
1:08:07
you still get additional information,
1:08:09
but you're still much faster than if you use humans.
1:08:11
And I think that as a field we'll
1:08:13
probably move towards these type of things, which is really
1:08:17
just finding the examples that are important and asking humans.
1:08:20
It's kind of active learning, just
1:08:22
asking humans exactly when you need to get their inputs.
1:08:28
Yes.
1:08:28
Do we train with the same loss function
1:08:30
and the same general training algorithm
1:08:32
for the supervised fine tuning bit
1:08:34
as we do for the pretraining?
1:08:36
Because the examples you showed, I
1:08:39
think the important thing of the good examples
1:08:43
is like super factually accurate.
1:08:45
Like there's these more complex things
1:08:46
and it's still just like [INAUDIBLE].
1:08:48
Same loss.
1:08:49
So that's why here--
1:08:50
yeah, I didn't-- maybe didn't emphasize enough.
1:08:52
This is just language modeling.
1:08:53
Fine tune the LLM with language model and the desired answers.
1:08:56
So this is literally the same loss.
1:08:59
It will be different in two seconds,
1:09:01
but the first step of SFT is literally
1:09:04
the same loss where you just say, OK, I
1:09:06
want to actually specialize on that type of data.
1:09:08
So there's even a question of what is pretraining,
1:09:10
what is post-training?
1:09:11
Because, in reality, it's just like a different data
1:09:13
that you use.
1:09:13
The reason why we usually call it post-training is that the way
1:09:16
we collect that data is very different.
1:09:18
Great, great questions.
1:09:20
Yes.
1:09:22
Maybe it's the same question, but why would
1:09:24
these 2000 examples have such a overweighted influence
1:09:28
on fine tuning?
1:09:30
So that's why we--
1:09:31
also that's another reason why we call it post-training
1:09:33
is that we use different type of hyperparameters.
1:09:35
So, I told you basically at the end
1:09:37
of pretraining you essentially end up
1:09:38
with a learning rate of 0.
1:09:40
Here, you're going to increase your learning rate.
1:09:42
So like 1e minus 5, 1e minus-- yeah.
1:09:44
And so the way that you give to them is actually different.
1:09:52
OK.
1:09:54
Second step or second part of this post training
1:09:57
is what we call reinforcement learning
1:10:00
from human feedback or RLHF.
1:10:02
Some of you might have heard of that.
1:10:05
The idea is that SFT has a problem, namely that you
1:10:09
do behavioral cloning, which means that you just try to clone
1:10:12
what the humans would say.
1:10:14
And that has many issues.
1:10:16
One of them is that you're bound by human abilities.
1:10:19
So if-- humans actually humans won't generate the things
1:10:26
that they think is actually the best thing to generate.
1:10:28
So if you ask me to write a book,
1:10:30
I mean, I can definitely enjoy your book.
1:10:32
I can probably say one book is better than another,
1:10:34
but I'm definitely not going to be as good as writing the book
1:10:37
that I want to read.
1:10:37
So you're going to be bound by the human ability
1:10:39
to generate things, even though the humans might be better
1:10:42
at distinguishing between things.
1:10:43
That's one issue.
1:10:44
Issue number two, I find that actually pretty interesting
1:10:47
is that it--
1:10:49
if you ever heard of the word hallucination. so this
1:10:51
is LLMs generating fake-- like false information.
1:10:55
Hallucination might-- at least people
1:10:57
have hypothesized that can come from the supervised fine tuning
1:11:02
even if you do supervised fine tuning on data that is correct.
1:11:06
And the reason why that is is that if--
1:11:09
given I told you that basically SFT is with very little data.
1:11:13
And it's with data that the model
1:11:15
doesn't learn anything new.
1:11:17
So what if the human gives an answer that the model didn't
1:11:21
know was true.
1:11:23
From the model perspective, the human basically
1:11:26
is telling the model generate this thing that seems plausible
1:11:30
but actually have no idea if it's true or not.
1:11:34
So just to give you a very concrete example,
1:11:36
if we go back to this monopsony example,
1:11:39
can you write blah blah blah about monopsony?
1:11:41
Imagine that the human wrote a reference on this type of book.
1:11:46
And that book might exist.
1:11:47
That might be a correct reference,
1:11:49
but what if the LLM never saw this reference
1:11:51
during pretraining.
1:11:52
Then it doesn't know that it's a correct reference.
1:11:54
So really what you tell the model
1:11:56
is to generate or make up some plausible sounding reference
1:12:00
rather than actually tell the real reference
1:12:03
that it saw during pretraining.
1:12:05
So hallucination might be caused by this SFT.
1:12:12
So that's problem number two.
1:12:14
Does that all make sense?
1:12:15
Great.
1:12:16
Problem number 3, price.
1:12:18
Generating the ideal answers is very pricey.
1:12:21
And that comes back to your question
1:12:23
of humans writing the entire answer is actually
1:12:26
pretty expensive.
1:12:28
So that's why RLHF comes in.
1:12:30
The idea is that instead of cloning the behaviors of humans,
1:12:34
we're going to maximize human preference.
1:12:37
And the way we're going to do that, so the pipeline,
1:12:39
is that for a certain-- for every instruction,
1:12:42
you're going to ask a model to generate two answers
1:12:45
and usually use a pretty good model.
1:12:48
So you usually don't use an LLM here, you use a SFT fine tune,
1:12:52
you use a fine tuned LLM already to give pretty good answers.
1:12:56
And then you ask labelers which of these two answers was better?
1:13:01
So select the preferred one.
1:13:02
And then with different types of algorithms,
1:13:05
we're going to talk about the algorithms, you just fine
1:13:07
tune the model to generate more of the green thing
1:13:10
than the red thing.
1:13:10
So more of the good stuff.
1:13:12
So now the question is how and we're
1:13:14
going to talk about that right now.
1:13:17
So there are two ways that we're going to talk about
1:13:20
and two that are mainly use in the community.
1:13:23
The first one is simply the idea of using reinforcement learning.
1:13:26
So hopefully you all know what reinforcement learning is now.
1:13:30
So when you think about using reinforcement learning,
1:13:33
one important question is like, what is the reward
1:13:35
that we're optimizing.
1:13:36
So in this case, there are really two options
1:13:38
that I could think about.
1:13:39
The first one, you could just say,
1:13:41
I'm going to compare the output generated by some baseline,
1:13:44
the output generated by my model.
1:13:46
And I'm just going to ask the human to say which one is better
1:13:49
and I'm going to use this as a reward.
1:13:51
So if I'm better than the baseline,
1:13:53
this is a plus 1, if not, it's a minus 1.
1:13:55
So now it's binary reward.
1:13:57
The problem with binary reward is that it's very sparse
1:13:59
and you don't get much information out of it.
1:14:01
Like maybe your answer was slightly better,
1:14:04
maybe it was like way better and you don't really
1:14:07
know from this how much better it was.
1:14:10
So option 2 is that you can train
1:14:13
what we call a reward model, which is simply a classifier.
1:14:16
So you use machine learning to classify
1:14:19
how much better two outputs are from the preference--
1:14:24
from the perspective of the human.
1:14:26
So this is a little bit meta, but what you basically
1:14:29
do is that you train--
1:14:31
you take a reward model, which is just a large la-- also
1:14:37
a large classifier, and you basically ask this reward model,
1:14:41
you give it the input and the actual output
1:14:43
that you have, one of the two outputs.
1:14:45
And you just exponentiate that so that's the softmax loss
1:14:49
that you all know about.
1:14:50
And now you divide by the exponentiated reward
1:14:56
on the first example--
1:14:58
I'm sorry, on the first output and this
1:15:00
is on the second output.
1:15:01
And you basically train--
1:15:02
so the reason why you do that is that you train your model,
1:15:05
you train this reward model to be
1:15:07
able to classify how much better one output is to another one.
1:15:13
So another slightly less convoluted way of saying it
1:15:16
is that your reward model will output
1:15:19
some reward that will be used as the logits of your softmax.
1:15:22
So now if you have high logits in your softmax,
1:15:25
it means that you highly likely this output is better.
1:15:32
So that's what we call Bradley-Terry model.
1:15:34
Yes.
1:15:35
Will this reward model [INAUDIBLE]
1:15:36
lower the entire output, or is it going to [INAUDIBLE]?
1:15:40
So this takes the entire--
1:15:45
yeah, this takes the entire output at once.
1:15:46
So it takes all the input and all the output
1:15:48
and it gives one number.
1:15:50
Yes.
1:15:51
So [INAUDIBLE] reward model, where would the human be then?
1:15:55
Sorry.
1:15:55
With the reward model, where would the human be?
1:15:58
Like--
1:15:58
I see.
1:16:00
OK sorry.
1:16:01
Maybe I wasn't clear.
1:16:02
You train this reward model to fit this green and red
1:16:08
preference from humans.
1:16:09
So basically you train a classifier
1:16:11
to say whether the humans prefer red or green.
1:16:15
But instead of using the binary reward, which
1:16:18
is what the human would tell you you basically use
1:16:20
the logits of the softmax.
1:16:23
And the thing with the logits is that logits are continuous.
1:16:26
So now you know that if your reward model said
1:16:29
it has high logits, then, in some ways,
1:16:31
the human highly preferred this answer to some other answer.
1:16:36
Great.
1:16:38
So as I just said, continuous information is better.
1:16:41
So that's what people use in practice or at least
1:16:44
used to use in practice.
1:16:45
I'll tell you about the other algorithm later.
1:16:48
So what do you do at the end is that you basically
1:16:50
try to just use reinforcement learning that you know about.
1:16:53
Now we know we have a reward.
1:16:55
What you sample through is the generation
1:16:58
from your large language model.
1:16:59
And then you just use some regularization term.
1:17:02
So the reason why we do this regularization term
1:17:04
is for avoiding what we call overoptimization.
1:17:06
So this reward model might not be
1:17:08
really represent-- might not perfectly
1:17:10
model human preferences.
1:17:12
So you don't want to maximize this thing
1:17:14
to essentially infinity.
1:17:17
And you do it using a PPO, which is a common reinforcement
1:17:22
learning algorithm.
1:17:24
One thing to note here, because it will be important for later,
1:17:27
is that when we use maximum likelihood--
1:17:32
sorry, now the large language models
1:17:34
are actually a policy for your reinforcement learning.
1:17:38
It's not maximizing maximum likelihood anymore.
1:17:41
Which means that you're not modeling any distribution
1:17:43
anymore.
1:17:43
And the reason why this is important
1:17:45
is that models that went through this type of PPO
1:17:48
actually don't give you likelihoods
1:17:51
of text that are meaningful.
1:17:52
Because what you optimize them to do
1:17:54
is basically just optimize for generating
1:17:56
the most likely thing, not optimize for modeling,
1:18:00
all the answers that humans might say.
1:18:02
Another way of saying that is that there's
1:18:04
nothing that incentivizes here the model to not give
1:18:09
a single possible generation.
1:18:11
Nothing here says it's good if you have some distribution
1:18:15
with some entropy.
1:18:18
If you haven't followed, it's not that important but just good
1:18:20
to know.
1:18:22
Great.
1:18:23
So PPO is exactly what ChatGPT did originally.
1:18:27
So here is on their blog post on what
1:18:30
they have is step one do supervised fine tuning, which
1:18:33
now you all know about.
1:18:34
Step two, train a reward model on human preferences.
1:18:38
Step three, do PPO multiple steps,
1:18:40
which is where you see this blue arrow.
1:18:43
So you continue-- you train the model once with the PPO,
1:18:45
you collect new data, you continue.
1:18:47
And that's why-- and that's exactly what ChatGPT did.
1:18:50
And that was the big breakthrough
1:18:52
between GPT 3 and ChatGPT.
1:18:55
One thing to note is that PPO has many challenges.
1:18:58
Reinforcement learning is something that
1:19:00
is super nice theoretically.
1:19:02
In practice, anyone who ever worked
1:19:03
with reinforcement learning knows it's such a mess.
1:19:06
There's a lot of things like rollouts, outer loops,
1:19:09
clipping so many complications.
1:19:11
So it's messy.
1:19:13
This is the idealized PPO used for LLM settings,
1:19:15
so that's already much more complicated
1:19:17
than this expectation we saw before.
1:19:19
And in practice it's actually much more complicated.
1:19:21
So we have one implementation of it that we had to do,
1:19:23
and I'm not going to go through it.
1:19:25
But basically have so much stuff that you
1:19:27
have to think about when you implement
1:19:29
that type of PPO algorithm.
1:19:31
So you have clipping everywhere, you have a lot of complexities
1:19:34
and things are not well documented.
1:19:37
All this to say that we're going to there was a new method that
1:19:41
was proposed also from Stanford one year ago
1:19:44
called DPO, which is essentially a simplification of PPO.
1:19:49
And the way-- what they did or the idea that they have
1:19:53
is that instead of using reinforcement learning,
1:19:56
you can just maximize the probability of generating
1:19:58
the stuff that you like and minimizing
1:20:00
the probability of the stuff that you don't like.
1:20:02
So if you think about the human preference, the red and green,
1:20:05
maximize green, minimize red.
1:20:08
So the loss is actually this one where what you see
1:20:12
this is simply some log of the model.
1:20:16
So this is the likelihood of a model generating the things
1:20:19
that the human preferred, given the inputs.
1:20:23
And what you try to do is basically
1:20:25
maximize the likelihood of generating the things that you
1:20:30
like, minimize the likelihood of the things that you don't like.
1:20:33
All the rest of the terms here it's not too important.
1:20:36
It's actually really not that complicated to understand.
1:20:39
But at a high level, it's really just maximizing the things
1:20:42
you like, minimizing the rest.
1:20:45
And one thing to note, which I was going to say just here,
1:20:49
is that actually all the rest is chosen such
1:20:51
that the global minima of PPO and the global minima
1:20:56
of like this DPO, under some assumptions,
1:20:59
are essentially equivalent.
1:21:01
So this is the right thing to do mathematically.
1:21:04
I'm not going to go through the derivations,
1:21:06
but that's the right thing to do.
1:21:08
It's pretty different with PPO in the sense that now--
1:21:10
with PPO, what you had to do is collect the human preferences,
1:21:13
then train a reward model with maximum likelihood,
1:21:16
then use reinforcement learning.
1:21:17
Now all you do is basically maximum likelihood.
1:21:19
Much simpler.
1:21:20
Yes.
1:21:21
I mean, yeah.
1:21:21
So it seems like this is A, much simpler and B, like,
1:21:24
what you would just intuitively do with [INAUDIBLE]?
1:21:27
Why did they start with this reward model.
1:21:29
Like what led them doing that?
1:21:31
I think it's a great question.
1:21:33
I don't really know.
1:21:34
What I can tell you is that.
1:21:35
At ChatGPT the people who did basically
1:21:41
this PP-- sorry, who did ChatGPT initially
1:21:44
are the ones who actually wrote PPO.
1:21:47
And I think they were just-- like,
1:21:48
there are a lot of reinforcement learning people.
1:21:50
And I think that for them it was very intuitive.
1:21:54
So there's also some additional potential benefits.
1:21:58
For example, I don't want to--
1:22:00
yeah, for example, if you use the reward model,
1:22:03
the cool thing here with reinforcement learning
1:22:04
is that you can use unlabeled data with the reward model.
1:22:08
So here you can only use the labeled data for doing DPO--
1:22:12
For PPO-- for PPO, you first train your reward model
1:22:15
and then you can use unlabeled data
1:22:18
where the reward model will basically
1:22:19
label this unlabeled data.
1:22:21
So this additional, kind of, potential--
1:22:25
there could be potential improvements.
1:22:26
In practice it happens that there are none.
1:22:29
And I think just that a lot of people in this team
1:22:32
were reinforcement learning experts, including
1:22:35
the main author of PPO, John Schulman.
1:22:39
So much simpler than PPO, and it's basically performs as well.
1:22:43
So now this is the standard thing that people use.
1:22:46
At least in the open source community,
1:22:47
I believe it's actually the standard also in industry.
1:22:51
So that's called DPO.
1:22:53
Gains so those are all the papers on the left.
1:22:57
Here this is on the summarization task.
1:22:59
You see, all I want to show you is
1:23:01
that basically the pretrained models were OK
1:23:04
and they improve of scale.
1:23:05
If you do supervised fine tuning,
1:23:07
you improve them a little bit more,
1:23:08
if you do PPO or something with RLHF human feedback,
1:23:12
you get performance that are, oftentimes
1:23:15
depending on a benchmark, even better than humans.
1:23:18
So this is the human reference summaries.
1:23:21
Same thing.
1:23:22
This is on a paper that we have Alpaca farm where
1:23:25
we see the evaluation here is not too important
1:23:27
but basically see pretrained model.
1:23:29
You jump to SFT and then you jump to PPO, DPO and PPO,
1:23:33
DPO have the exact same performance.
1:23:36
So basically RLHF helps.
1:23:38
That's, kind of, the conclusion and DPO is simple.
1:23:42
Data.
1:23:43
The way that you collect that type of data.
1:23:46
First idea is just use humans as we already talked about.
1:23:51
Guidelines are very complicated for what
1:23:53
humans should be labeling, and it's really not that easy.
1:23:55
And actually, if you ever do some of the labeling,
1:23:58
you will see that it's extremely complicated.
1:24:01
Like if I Zoom in to this.
1:24:03
Here, I have a question tell me about self-driving cars.
1:24:07
And you read both self-driving cars
1:24:09
are vehicles that are capable of detecting
1:24:10
the surroundings, blah, blah blah, blah.
1:24:12
Self driving cars are cars that are equipped
1:24:13
with sensors, blah blah, blah to navigate
1:24:15
without the need for a driver.
1:24:16
I mean, both seem OK.
1:24:18
Which one is better?
1:24:19
It's actually hard to say at a glance.
1:24:21
And as a result, the problem with humans
1:24:24
is that you will start optimizing
1:24:27
a lot of high-level features.
1:24:28
For example, the second one is longer.
1:24:30
I can guarantee you that most humans will choose
1:24:32
the second one, even though I mean,
1:24:34
maybe the first one is better.
1:24:35
I don't know.
1:24:36
I haven't read it carefully.
1:24:38
So challenges of humans.
1:24:39
First, slow and expensive.
1:24:42
Second, as I just mentioned, it's hard to focus on things
1:24:46
that matter, like correctness.
1:24:47
And people usually look at things
1:24:49
that don't matter as much like the form, like length.
1:24:53
And as a result, so what I show here
1:24:55
is that when you do RLHF, the more you do RLHF,
1:24:58
the longer the output of the models become.
1:25:01
So if you've ever been annoyed at ChatGPT
1:25:03
answering you super long sentences,
1:25:05
this is because of RLHF.
1:25:08
Annotator distribution shift.
1:25:11
Like the distribution of annotators
1:25:12
that you use matters a lot, and you have to think,
1:25:15
like, what is even the humans that we want
1:25:17
to represent in these models?
1:25:20
Another question is crowdsourcing ethics.
1:25:22
Like usually these-- basically a lot
1:25:25
of the labeling that is done, the people who do them
1:25:29
are not paid well and they have to go
1:25:31
through a lot of toxic data because you basically
1:25:33
want the model to avoid saying the toxic data.
1:25:36
So crowdsourcing ethics too.
1:25:40
So many challenges with human data.
1:25:43
So what we did, also last year, is again,
1:25:46
the same thing as Alpaca, just the idea of like oh well, there
1:25:48
are challenges with humans, maybe
1:25:50
we can just replace them with LLMs.
1:25:51
So what we did is simply replace--
1:25:55
I see that.
1:25:56
I'm just realizing that the slides are not centered.
1:25:58
Anyways you replace a human preference with preferences.
1:26:02
So here, on this figure, you see on the x-axis, the price
1:26:06
that we paid for collecting human data.
1:26:09
It's around $300 for 1,000 examples.
1:26:12
And this is on mechanical Turkers which are usually
1:26:15
like cheaper than maybe some of the other companies
1:26:19
that you could go through.
1:26:20
And on the y-axis, it's basically
1:26:22
the agreement with other humans, with the mode of other humans.
1:26:27
And what you see is that actually, as I told you before,
1:26:29
labeling is really complicated.
1:26:30
Humans agree with themselves only around 66%
1:26:34
of the time on a binary task.
1:26:36
And it's not that the humans are not good
1:26:38
here because we were five main authors on this paper.
1:26:41
We tried to label this data ourselves,
1:26:43
and we only had, like, 67 or 68% accuracy, even though we
1:26:47
talked-- like we talked for like three hours of how
1:26:50
we should be doing labeling.
1:26:51
But really, it's complicated.
1:26:52
It's not an easy task.
1:26:54
And here I just showed many different models.
1:26:56
And, basically, you see that models are much cheaper,
1:26:59
and they can actually get higher agreement
1:27:01
with the mode of humans than humans themselves.
1:27:04
And the reason why is because humans have a lot of variance,
1:27:06
models have no variance.
1:27:08
So there might be a little bit more biased
1:27:09
but have less variance.
1:27:11
So it works surprisingly well.
1:27:13
And now it's, kind of, the standard
1:27:14
in open source community.
1:27:16
I think even in industry a lot of people
1:27:18
use both humans and LLMs for improving
1:27:21
the collection of RLHF data.
1:27:24
And this is like-- this is the paper from last year,
1:27:27
but honestly, now it's more like the LLMs would be around this
1:27:30
agreement, and this costs around,
1:27:32
I would say 50 50x than humans and better agreement with human
1:27:36
than humans themselves.
1:27:39
OK.
1:27:39
So that gets us to evaluation of post training.
1:27:45
That goes back to your initial question
1:27:46
at the beginning of the lecture.
1:27:48
How do you evaluate something like ChatGPT?
1:27:50
The answers that GPT could give are basically unbounded.
1:27:54
And it's not that there's one right answer,
1:27:56
there are many answers that are just as good.
1:27:59
So there are many challenges.
1:28:00
One, you can't use validation loss
1:28:03
because one method might use PPO,
1:28:06
the other one might use DPO.
1:28:07
Validation loss is not comparable.
1:28:08
Second, you can't use--
1:28:10
sorry, perplexity.
1:28:11
That's the thing I told you before.
1:28:13
These models are not calibrated.
1:28:16
They don't give distributions.
1:28:17
They just optimize for one thing.
1:28:19
So you can't use perplexity for actually evaluating these type
1:28:22
of models once they aligned--
1:28:24
sorry, once they're aligned.
1:28:26
Third, there's a large diversity of questions
1:28:29
that humans might ask to these models.
1:28:31
Generation open QA some question answering some summarization
1:28:35
and all of these things.
1:28:36
So there's so many things you have to cover.
1:28:38
Then the tasks are really open ended,
1:28:41
so it's very hard to automate.
1:28:42
So that's what you were alluding to before.
1:28:45
So the idea is that instead of trying
1:28:48
to come up with really easily automated benchmarks,
1:28:51
it's just we're going to ask questions that users actually
1:28:55
ask to these models in practice.
1:28:56
And we're just going to ask annotators
1:28:58
to say between these two models, which one is better.
1:29:01
What's the better output.
1:29:03
So basically the exact same thing
1:29:04
as basically the data from RLHF but you
1:29:08
use it now for evaluation.
1:29:10
Yes I'm not sure I understand what
1:29:11
you mean by can't use perplexity not calibrated.
1:29:14
Like RLHF still doing like next token prediction.
1:29:19
So--
1:29:19
Why can't perplexity be used then?
1:29:21
So think about the optimal solution
1:29:24
after doing PPL is basically one model that
1:29:27
gives you essentially a delta.
1:29:30
Like basically it says that there's only one sentence
1:29:33
that is--
1:29:34
that could be generated for that question.
1:29:36
So now if you use it on something
1:29:38
that is slightly semantically differently different,
1:29:40
it would actually give a likelihood of 0 for that answer.
1:29:44
So in reality, it's not that extreme because as you say,
1:29:46
it's still a distribution, but it just
1:29:48
shows you that there's a fundamental issue
1:29:50
with perplexity.
1:29:51
Once these models are not LLMs anymore,
1:29:55
they were not trained, at least with PPO
1:29:56
they're not trained to do maximum likelihood anymore,
1:29:59
they were trained to be policies.
1:30:04
So probably the most common or the most--
1:30:08
yeah, the most common benchmark or the most trusted one
1:30:10
is what we call ChatBotArena, which is basically
1:30:14
go on internet, have random users on the internet,
1:30:17
blindly talk with two chatbots, just ask many questions,
1:30:21
see the two answers and rate, which one is better.
1:30:23
And you do that over hundreds of thousands of users and then
1:30:26
you get the actual preferences and you get rankings of models.
1:30:30
So you can go right now on ChatBotArena
1:30:33
and actually interact with these models.
1:30:35
One potential issue just to highlight
1:30:38
is that while people who want to do these type of things
1:30:40
are usually more like tech-driven or like tech savvy.
1:30:44
So a lot of the questions that you will ask
1:30:46
are more like tech stuff discussing
1:30:47
software errors, inquiries about AI tools
1:30:50
and all of these things.
1:30:52
So another issue is cost and speed.
1:30:54
If you really want to use something
1:30:55
like this for development process,
1:30:58
it will be too costly because you will need to basically pay
1:31:01
a lot of humans to do that.
1:31:03
So one simple idea is, again, as we said many times,
1:31:07
just use LLM instead of humans.
1:31:10
You probably know the drill at this point.
1:31:13
Steps for every instruction generate outputs
1:31:15
by some baseline and the model that you want to evaluate.
1:31:19
So here you imagine that I'm comparing an answer
1:31:22
from ChatGPT and from Misrule.
1:31:24
I'm just asking a model, another model, which one is better.
1:31:29
And I just basically average that out.
1:31:32
Yeah.
1:31:32
I asked ChatGPT 4, which one is better.
1:31:34
I averaged that out over my entire distribution,
1:31:37
over my entire benchmark or data set,
1:31:39
and that gives me a win rate.
1:31:41
So a win probability for one model compared to another one.
1:31:44
And now you can rank models.
1:31:46
And this is the AlpacaEval leaderboard.
1:31:50
So the benefits of this is that actually we
1:31:53
show-- we get 98% correlation with ChatBotArena.
1:31:56
So very high correlation with humans.
1:31:59
So this is yeah, comparison with correlation
1:32:01
with other benchmarks.
1:32:02
And it takes less than three minutes and less than $10
1:32:05
to run.
1:32:05
So it's pretty cheap.
1:32:06
And there are downsides though.
1:32:08
One of them is poor correlation.
1:32:11
So as we already saw before, LLMs prefer,
1:32:14
this is one spurious correlation, not many.
1:32:16
I'll just talk about one.
1:32:17
LLMs prefer longer outputs.
1:32:19
Actually humans also prefer longer outputs.
1:32:21
But the problem or the issue once you use LLMs
1:32:23
is that once there is bias, you will continue optimizing that.
1:32:26
Humans at some point, I can guarantee you
1:32:28
if I ask a simple question, and you give me
1:32:29
five pages of answers, I'll be like,
1:32:31
no, I don't like that answer.
1:32:32
But LLMs if they have this bias and they were trained for that,
1:32:35
they will continue preferring longer outputs.
1:32:37
So here we see the preference just showing
1:32:42
that humans and models prefer longer outputs.
1:32:46
And here is another view of the initial AlpacaEval data set
1:32:50
benchmark, where when we asked--
1:32:53
when we rank GPT4, when we look at the win rate of GPT4
1:32:56
versus actually GPT4 itself, if we use the standard GPT4,
1:33:01
it gets 50%, kind of, by definition because we're
1:33:03
comparing GPT4 versus GPT4.
1:33:06
But if we ask a GPT4 to be slightly more verbose,
1:33:09
so we just say in the prompt, be verbose in your answers,
1:33:12
then it gets a win rate of 64.4%.
1:33:15
So really there's a huge variance.
1:33:16
And if we ask it to be concise, it
1:33:17
gets 20% so there's a huge variance
1:33:20
depending on whether you ask it to be concise or verbose.
1:33:24
That's very annoying.
1:33:25
So one possible solution, which is what we did,
1:33:29
is just use some regression analysis.
1:33:31
I'm not going to go into details,
1:33:32
but basically use causal inference
1:33:34
tools to control for length.
1:33:36
And right now actually length matters much less.
1:33:38
So if you ask it to be verbose, you still get some gains,
1:33:41
but much less.
1:33:44
Great.
1:33:44
So that's all about post training.
1:33:46
And now for the next eight minutes,
1:33:48
I might talk about systems or just answer questions.
1:33:51
Yes.
1:33:52
Can you go back to your post training, internal post
1:33:56
training.
1:33:57
How did we tune those parameters using
1:33:59
the small body of fine-tuning data
1:34:03
and have such big effect on the model?
1:34:05
You mentioned earlier that there's a different set
1:34:07
of hyperparameters.
1:34:08
Are we changing just some of the weights, the later weights
1:34:11
or other weights.
1:34:12
What's actually happening?
1:34:13
Yeah.
1:34:14
Yeah, I, kind of, skimmed through all of this.
1:34:16
You change all the weights.
1:34:17
Actually, industry will change all the weights.
1:34:20
In open source land, you might have
1:34:22
heard of Laura, which is going to change basically only
1:34:26
some of the weights or it actually, to be more specific,
1:34:29
it's going to add some differences
1:34:31
to the output of every layer.
1:34:33
But in industry, you're going to just fine tune all the weights.
1:34:37
And also to say something else about the data, actually,
1:34:40
this last step, RLHF you usually going
1:34:42
to collect a lot more data than with SFT.
1:34:45
So if FSFT is like 5,000, 10,000, maybe 50,000 with,
1:34:50
RLHF I think you're going to be more around like the one million
1:34:54
order of magnitude.
1:34:55
It's still much less than pretraining though.
1:34:57
Yeah.
1:34:57
Because pretraining is 15 trillion tokens.
1:35:00
I mean, this is like-- that's not even a drop
1:35:02
and yet you influence the weight a lot.
1:35:05
So because you do it--
1:35:05
I mean, you have to think that how you do it is you use--
1:35:10
I mean, as I said, the learning rate that you're going to use
1:35:12
is going to be different, but also you only do that.
1:35:16
So just imagine if I trained--
1:35:18
even if I trained on one sentence,
1:35:19
but over and over again at some point
1:35:22
my model will only generate that sentence
1:35:24
even if it was just one sentence instead of
1:35:27
the 15 trillion tokens.
1:35:29
So if you use a large enough learning
1:35:30
rate and for enough time, you will basically
1:35:33
overfit that sentence.
1:35:35
So the key thing to remember is that the data is not--
1:35:39
it's not as if you mix some post-training data
1:35:42
and some pretraining data.
1:35:43
You do pretraining, and then you just start fine-tuning only
1:35:47
on the post-training.
1:35:48
So another way, maybe another perspective
1:35:50
is that the pretraining is just the initialization
1:35:53
of your model.
1:35:54
And once you view it that way, that this is just
1:35:56
initialization of weights, then there's nothing special.
1:35:59
Like you don't need to remember that you train on a lot of data
1:36:02
before.
1:36:02
The only thing that matters is that you had an initialization
1:36:04
and now I actually train the model.
1:36:06
So maybe you think about it that way.
1:36:07
Like this is a Markov property in some ways.
1:36:10
It's just like you had your weights.
1:36:11
This is my initialization.
1:36:12
Now I'm training that one.
1:36:14
Does that answer your question?
1:36:16
Kind of but you said something just now about it's
1:36:20
almost the equivalent of just rerunning the fine tuning
1:36:23
data many times.
1:36:25
Is it actually-- is that what actually happens in order
1:36:28
to give so much more preference?
1:36:33
You might-- I actually don't know right now how they do it
1:36:37
in industry.
1:36:37
When we did our packet, we had to do three epochs.
1:36:40
So you did run it three times through it.
1:36:44
But I mean, even the number of times
1:36:46
that you run it through, it's actually not important.
1:36:48
The only thing-- the only thing is the effective learning rate
1:36:52
that what matters.
1:36:54
So yeah.
1:36:56
Great.
1:36:58
So I think I have five minutes.
1:37:06
OK I might try to give a high-level overview at least
1:37:12
from one of the systems trick.
1:37:14
Systems, as we said, for everyone bottleneck is--
1:37:19
sorry compute is the huge bottleneck.
1:37:21
One question you might ask is, why not buy more GPUs?
1:37:24
GPUs are expensive, but also are scarce.
1:37:26
Even if you have $10 million right now,
1:37:28
you cannot buy the best GPUs.
1:37:31
[INAUDIBLE]
1:37:33
There's also some physical limitations.
1:37:35
When you have multiple GPUs, you have
1:37:37
to communicate between them.
1:37:39
That takes time.
1:37:40
So just buying more GPUs is not that easy.
1:37:43
So it's really important to think about
1:37:45
how do you allocate resources and how do you optimize
1:37:47
your pipeline, so system?
1:37:49
101 on GPUs, I'm sorry, I'm going slightly faster.
1:37:53
I hope that some of you at least can follow.
1:37:55
GPUs are basically optimized for throughput.
1:37:58
CPUs are optimized for latency.
1:38:01
So GPUs, the way you have to think about it
1:38:03
is that there's one--
1:38:04
there's one command that is run on many, many cores
1:38:07
at the same time on different type of data.
1:38:11
So this is how you see a GPU.
1:38:13
You see there are many different codes.
1:38:14
We call them streaming multiprocessors,
1:38:17
which is very different than the usual CPU architecture.
1:38:20
So just think high throughput parallelization for GPUs.
1:38:24
GPUs are optimized for fast matrix multiplication.
1:38:27
So every time you will do-- you will do something on GPU.
1:38:30
If you can do it with a matrix multiplication,
1:38:33
it's going to be 10 times faster than with anything else.
1:38:36
That is a little bit annoying because it
1:38:38
means that we are, kind of, bottlenecked
1:38:40
to doing anything with matrix multiplications.
1:38:44
Another thing to note with GPUs is
1:38:46
that compute has been improving faster
1:38:48
than memory and communication.
1:38:50
So right now GPUs usually are hard to keep--
1:38:55
Like the data that you sent to GPUs
1:38:58
is actually hard to keep up with the processes.
1:39:00
So most of your GPUs are actually
1:39:02
going to be idle if you just run normal code,
1:39:04
if you don't optimize your code.
1:39:06
So communication-- and this will continue over time.
1:39:10
Another thing to know about GPUs is that there's
1:39:12
a memory hierarchy.
1:39:13
This is the same thing actually with CPUs,
1:39:15
but basically the closer you are to your cores,
1:39:17
the less memory there is, but the faster things run.
1:39:20
If you are further, more memory slower.
1:39:24
Oh yeah I'm going to skip that.
1:39:26
OK actually, I'm going to say it.
1:39:27
I told you about this--
1:39:29
the fact of communication.
1:39:31
The metric that people usually look at
1:39:32
is model FLOP utilization.
1:39:34
So what is the theoretical maximum that GPU could run at,
1:39:37
number of flops that you could use per second--
1:39:39
divide-- sorry, the number of observed throughput
1:39:42
divided by this theoretical maximum.
1:39:45
And in general, if you reach 50% you're very happy.
1:39:49
Like Facebook I looked at llama was at 45
1:39:51
or something like this.
1:39:52
So that means that data doesn't come fast enough
1:39:55
even for these big companies.
1:39:58
So one simple trick, and that might
1:40:00
be the only one I'm going to tell you about,
1:40:02
is low precision.
1:40:04
One simple idea is that well, if I'm
1:40:06
going to put my floats in low precision,
1:40:09
then there's going to be fewer bits
1:40:10
that I have to send to my GPUs.
1:40:12
If there's fewer bits, it's faster communication,
1:40:14
lower memory consumption.
1:40:16
Things are going to go faster.
1:40:17
And for deep learning it just happens
1:40:19
that decimal is not that important.
1:40:22
So when you do matrix multiplication, when
1:40:25
you do like for example, SGD, there's already so much noise
1:40:28
that if you update something by 0.01 or 0.015, who cares.
1:40:33
So basically instead of using 32 bits per float, which
1:40:37
is what people used to use, or 64 for example, which
1:40:41
is what you would use in other domains,
1:40:43
you use 16 bits for matrix multiplication.
1:40:46
So for every float you use 16 bits.
1:40:49
And for training you have this type
1:40:51
of what we call automatic mixed precision.
1:40:54
Which is that some of the things are in 32 bits,
1:40:57
others are in 60 bit--
1:40:58
on 16 bits.
1:41:00
Generally, the way you should be thinking about
1:41:02
it is that your weights are stored-- of your model,
1:41:05
are stored in 32 bits.
1:41:06
But just before the computation you put everything in 16 bits.
1:41:10
Like this you do computation super fast.
1:41:12
And at the end you update your weights in 32 bits.
1:41:16
And the reason why you do all the updates in 32 bits is just
1:41:19
think that if your learning rate, for example,
1:41:21
is very small, you still want to be able to make
1:41:23
a difference in your weights.
1:41:25
So all the computation is done in 16 bits,
1:41:28
but the weights are actually stored in 32 bits.
1:41:30
So that's like the standard way that people are doing it.
1:41:35
OK, I'll actually talk just about this,
1:41:36
and then I'll skip all the rest, operator fusion, because I think
1:41:39
this is actually pretty cool.
1:41:40
As I just said, communication is very slow
1:41:42
and actually every time you use a PyTorch line,
1:41:45
it basically moves variable to global memory of your GPU.
1:41:49
So when you have something like this x dot cosine equal x1,
1:41:54
and then you do x1 dot cosine.
1:41:56
What is happening behind the scenes
1:41:58
is that you take the x, which is data.
1:42:00
You ship it to your actual processors of your GPUs.
1:42:03
You apply the cosine.
1:42:05
You ship it back to the main memory of your GPU
1:42:07
and then you see the next line.
1:42:09
You ship it back to the computer-- to the GPU processor,
1:42:12
you apply another cosine and you ship it back again.
1:42:15
So another way to see that is that you
1:42:17
go from your DRAM, which is your global memory and your GPU
1:42:20
and you ship it to compute.
1:42:22
You ship it back for every line.
1:42:24
This is a naive way of doing it.
1:42:25
This seems very wasteful.
1:42:28
So the idea, simple idea of operator fusion
1:42:31
is just communicate, do all the computation, ship it back once.
1:42:35
And this is exactly what fused kernels are.
1:42:39
So if you ever want to make your compute-- your computations
1:42:44
in PyTorch much faster, just apply torch dot
1:42:46
compile on your model.
1:42:48
This is going to make your model around 2 times faster.
1:42:51
And what it does is simply that it rewrites your code--
1:42:56
your PyTorch code basically in C++ in CUDA to do
1:43:03
the communication only once then do all the operations,
1:43:05
then ship it back.
1:43:07
OK I'm not going to have time to talk about tiling.
1:43:10
Tiling is important.
1:43:11
Parallelization.
1:43:12
Parallelization is important.
1:43:15
And mixture of experts.
1:43:17
Mixture of experts is important.
1:43:18
Outlook.
1:43:19
There are many things we haven't talked about.
1:43:23
We haven't talked about architectures we definitely
1:43:25
haven't talked about inference.
1:43:27
There are many other things that are important with LLMs.
1:43:29
What is the UI that you use?
1:43:31
I mean, arguably ChatGPT, the big novelty was just
1:43:34
have a simple UI to use it.
1:43:35
Multi-modality.
1:43:36
What are all the misuses you could have.
1:43:38
The fact that there might not be enough data on the internet
1:43:41
to train all these models.
1:43:42
Legality of data collection, so many other things.
1:43:45
If you are interested in all these topics,
1:43:47
I would suggest three classes.
1:43:49
CS224N is probably the one that touches the least on LLMs,
1:43:54
but it gives some background and historical context
1:43:57
of all the LLMs and gives some adjacent material.
1:44:01
CS324 I think it's called--
1:44:04
I think it's just called Large Language Models, more
1:44:07
in depth reading and lectures on everything I talked about.
1:44:10
CS336 which is large language model from scratch,
1:44:13
you actually build your own LLM.
1:44:16
It's an amazing class also given by my two supervisors.
1:44:20
Very heavy workload, so be careful.
1:44:23
Great.
— end of transcript —
Advertisement
Ad slot

More from Stanford Online

Trending Transcripts

Disclaimer: This site is not affiliated with, endorsed by, or sponsored by YouTube or Google LLC. All trademarks belong to their respective owners. Transcripts are sourced from publicly available captions on YouTube and remain the property of their original creators.