Advertisement
1:44:31
Stanford CS229 I Machine Learning I Building Large Language Models (LLMs)
Stanford Online
·
May 10, 2026
Open on YouTube
Transcript
0:05
So, let's get started.
0:07
So I'll be talking about
building LLMs today.
0:10
So I think a lot of you have
heard of LLMs before, but just
0:14
as a quick recap.
0:16
LLMs standing for
large language models
0:18
are basically all the
chat bots that you've
0:21
been hearing about recently.
0:22
So, ChatGPT, from OpenAI,
Claude, from Anthropic, Gemini
Advertisement
0:28
and Llama, and other
types of models like this.
0:31
And today we'll be talking
about how do they actually work.
0:34
So it's going to be an overview
because it's only one lecture
0:36
and it's hard to
compress everything.
0:38
But hopefully, I'll
touch a little bit
0:39
about all the components
that are needed
0:41
to train some of these LLMs.
0:43
Also, if you have questions,
please interrupt me
0:46
and ask if you have a question.
0:48
Most likely other people in
the room or on Zoom have other.
Advertisement
0:52
Have the same questions.
0:53
So, please ask.
0:56
Great.
0:56
So what matters
when training LLMs.
1:00
So there are a few key
components that matter.
1:02
One is the architecture.
1:04
So as you probably all
LLMs are neural networks,
1:07
and when you think
about neural networks,
1:09
you have to think about what
architecture you're using.
1:11
And another component,
which is really important
1:13
is the training loss and
the training algorithm.
1:16
So, how you actually train
these models, then it's data.
1:20
So, what do you train
these models on.
1:24
The evaluation,
which is how do you
1:26
know whether you're
actually making progress
1:28
towards the goal of LLMs and
then, the system component.
1:33
So that is like
how do you actually
1:35
make these models run on
modern hardware, which
1:38
is really important because
these models are really large.
1:41
So now more than ever,
systems are actually
1:43
really an important
topic for LLMs.
1:47
So those five components, you
probably all know that LLMs.
1:52
And if you don't
know LLMs are all
1:53
based on transformers
or at least some version
1:56
of transformers.
1:57
I'm actually not going to talk
about the architecture today.
2:00
One, because I gave a lecture
on transformers a few weeks ago
2:06
and two, because you can find
so much information online
2:09
on transformers.
2:11
There's much less information
about the other four topics.
2:14
So, I really want
to talk about those.
2:17
And another thing to say
is that most of academia
2:20
actually focuses on
architecture and training
2:22
algorithm and
losses as academics
2:25
and I've done that for
a big part of my career,
2:28
is simply we like thinking
that this is like we make
2:32
new architectures,
new models, and it
2:35
seems like it's very important.
2:37
But in reality, honestly, what
matters in practice is mostly
2:39
the three other topics.
2:41
So, data, evaluation and
systems, which is what most
2:45
of industry actually focuses on.
2:48
So, that's also
one of the reasons
2:49
why I don't want to talk too
much about the architecture,
2:52
because really the rest
is super important.
2:55
Great.
2:55
So, overview of
the lecture, I'll
2:57
be talking about pretraining.
2:58
So, pretraining, you
probably heard that word.
3:00
This is the general word.
3:02
This is kind of the classical
language modeling paradigm where
3:06
you basically train your
language model to essentially
3:08
model all of internet.
3:10
And then, there's
a post training,
3:11
which is a more
recent paradigm which
3:13
is taking these
large language models
3:15
and making them
essentially AI assistants.
3:18
So, this is more of a
recent trend since ChatGPT.
3:22
So, if you ever heard
of GPT3 or GPT2,
3:25
that's really pretraining land.
3:27
If you heard of ChatGPT,
which you probably have,
3:29
this is really
post training land,
3:31
so I'll be talking about both,
but I'll start with pretraining
3:34
and specifically
I'll talk about what
3:37
is the task of pretraining LLMs
and what is the loss that people
3:41
actually use.
3:43
So, language modeling,
this is a quick recap.
3:47
Language models at a
high level are simply
3:49
models of probability
distribution over sequences
3:52
of tokens or of words.
3:53
So it's basically
some model of p of x1
3:57
to XL, where x1
is basically what
3:59
one and XL is the last one in
the sequence or in the sentence.
4:04
So, very concretely, if you
have a sentence like the mouse
4:07
ate the cheese, what
the language model gives
4:09
you is simply a probability
of this sentence being uttered
4:13
by a human or
being found online.
4:17
So, if you have another sentence
like "The the mouse ate cheese."
4:21
Here, there's
grammatical mistakes.
4:23
So, the model should
know that this should
4:25
have some syntactic knowledge.
4:27
So, it should know that
this has less likelihood
4:30
of appearing online.
4:32
If you have another sentence
like the cheese ate the mouse,
4:36
then the model should
hopefully know about the fact
4:39
that usually cheese
don't eat mouse.
4:42
So, there's some
semantic knowledge
4:43
and this is less likely
that the first sentence.
4:45
So, this is basically at a high
level what language models are.
4:50
One word that you probably have
been hearing a lot in the news
4:52
are generative models.
4:54
So, this is just something
that can generate.
4:56
Models that can
generate sentences
4:57
or can generate some data.
4:59
The reason why we say language
models are generative models
5:01
is that once you have a
model of a distribution,
5:04
you can simply sample
from this model.
5:06
And now we can generate data.
5:07
So we can generate sentences
using a language model.
5:12
So the type of models that
people are all currently using
5:15
are what we call
autoregressive language models.
5:18
And the key idea of
autoregressive language models
5:21
is that you take this
distribution over words
5:25
and you basically decompose
it into the distribution
5:29
of the first word, multiply
by the distribution of
5:32
or the likelihood of the
distribution of the second word
5:35
given the first
word, and multiply it
5:37
by P of the third word
given the first two words.
5:40
So, there's no
approximation here.
5:42
This is just the chain rule
of probability, which you
5:44
hopefully you all know about.
5:46
Really no approximation.
5:47
This is just one way of
modeling a distribution.
5:50
So, slightly more
concisely, you can write it
5:52
as a product of P's of the next
word, given everything which
5:57
happened in the past.
5:58
So, of the context.
5:59
So, this is what we call
autoregressive language models.
6:02
Again, this is really
not the only way
6:05
of modeling distribution.
6:06
This is just one way.
6:07
It has some benefits
and some downsides.
6:10
One downside of
autoregressive language models
6:12
is that when you actually
sample from this autoregressive
6:15
language model,
you basically have
6:16
a for loop, which generates
the next word, then conditions
6:20
on that next word.
6:21
And then we generate
in other words.
6:23
So, basically if you
have a longer sentence
6:24
that you want to generate, it
takes more time to generate it.
6:28
So, there are some downsides
of this current paradigm,
6:31
but that's what
we currently have.
6:33
So, I'm going to
talk about this one.
6:36
Great.
6:36
So, autoregressive
language models.
6:38
At a high level, what a task of
autoregressive language model
6:41
is simply predicting the
next word, as I just said.
6:44
So, if we have a sentence
like she likely prefers,
6:47
one potential, next
word might be dogs.
6:50
And the way we do it is
that we first tokenize.
6:54
So, you take these words or
subwords you tokenize them
6:58
and then you give an
ID for each token.
7:00
So here you have
one, two, three.
7:03
Then, you pass it
through this black box.
7:04
As I already said,
we're not going
7:06
to talk about the architecture.
7:07
You just pass it through,
pass it through a model,
7:10
and you then get a distribution,
a probability distribution
7:13
over the next word or
over the next token.
7:16
And then you sample
from this distribution,
7:20
you get a new token and
then you detokenize.
7:22
So, you get a new
ID, you detokenize
7:24
and that's how you basically
sample from a language model.
7:28
One thing which is
important to note
7:29
is that the last two
steps are actually
7:32
only needed during inference.
7:34
When you do training,
you just need
7:36
to predict the most likely
token and you can just
7:38
compare to the real token
which happened in practice,
7:41
and then, you basically
change the weights
7:43
of your model to increase
the probability of generating
7:46
that token.
7:49
Great.
7:50
So, autoregressive
neural language models.
7:52
So to be slightly
more specific, still,
7:54
without talking about
the architecture,
7:56
the first thing we do is
that we have all of these.
7:58
Sorry, yes.
7:59
On the previous slide.
8:01
Predicting the probability
of the next token,
8:03
does this mean that your
final output vector has
8:06
to be the same dimensionality
as the number of tokens
8:08
that you have?
8:09
Yes.
8:10
How do you deal with
if you have more token.
8:13
Adding more token
to your [INAUDIBLE]?
8:16
Yeah so we're going to
talk about tokenization
8:18
actually later so you will
get some sense of this.
8:21
You basically can deal
with adding new tokens.
8:24
I'm kind of exaggerating.
8:25
There are methods for doing
it, but essentially people
8:28
don't do it.
8:29
So it's really
important to think about
8:32
how you tokenize your
text, and that's why
8:33
we'll talk about that later.
8:35
But it's a very
good point to note
8:36
is that you basically--
the vocabulary size, so
8:38
the number of tokens that
you have is essentially
8:40
the output of your
language model.
8:43
So it's actually pretty large.
8:46
So autoregressive
neural language models.
8:48
First thing you do is that you
take every word or every token.
8:51
You embed them so you get
some vector representation
8:56
for each of these tokens.
8:58
You pass them through some
neural network, as we said,
9:00
it's a transformer.
9:01
Then you get a representation
for all the word
9:04
and all the words
in the context.
9:06
So it's basically
a representation
9:07
of the entire sentence.
9:09
You pass it through
a linear layer,
9:11
as you just said, to
basically map it to the number
9:15
so that the output--
the number of outputs
9:17
is the number of tokens.
9:19
You then pass it
through some softmax
9:21
and you basically get a
probability distribution
9:24
over the next words given
every word in the context.
9:30
And the last that you
use is basically--
9:32
it's essentially a task of
classifying the next token.
9:35
So it's a very simple, kind
of, machine learning task.
9:37
So you use the
cross-entropy loss.
9:39
Where you basically look at the
actual target that happened,
9:44
which is the target
distribution, which
9:45
is a one hot encoding,
which in this case says,
9:49
I saw the real word
that happened is cat.
9:51
So that's a one hot
distribution over cat.
9:55
And here this is the actual--
9:57
do you see my mouse?
9:58
Oh, yeah.
9:58
This is the distribution
that you generated.
10:00
And basically you
do cross entropy,
10:01
which really just increases the
probability of generating cat
10:04
and decreases all the
probability of generating
10:06
all the other tokens.
10:08
One thing to notice is
that, as you all know again,
10:11
this is just equivalent
to maximizing the text log
10:15
likelihood because
you can just rewrite
10:17
the max over the probability
of this autoregressive language
10:23
modeling task as just being
this minimum of I just
10:26
added the log here
and minus, which
10:29
is just the minimum of the loss,
which is the cross entropy loss.
10:31
So basically
minimizing the loss is
10:33
the same thing as maximizing
the likelihood of your text.
10:36
Any question?
10:37
Questions?
10:43
OK, tokenizer.
10:46
So this is one thing
that people usually
10:49
don't talk that much about.
10:50
Tokenizers are
extremely important.
10:53
So it's really important that
you understand at least what
10:56
they do at a high level.
10:57
So why do we need tokenizers
in the first place?
11:01
First, it's more
general than words.
11:02
So one simple thing
that you might think
11:04
is we're just going to take
every word that we will have.
11:07
You just say every word
is a token in its own.
11:11
But then what happens is if
there's a typo in your word?
11:14
Then you might not have
any token associated
11:17
with this word with a typo.
11:20
And then you don't know
how to actually pass
11:21
this word with a typo into
a large language model.
11:24
So what do you do next?
11:25
And also, even if you think
about words, words is a very--
11:29
words are fine with
Latin-based languages.
11:32
But if you think about
a language like Thai,
11:34
you won't have a simple
way of tokenizing
11:36
by spaces because there are
no spaces between words.
11:39
So really, tokens are much
more general than words.
11:43
It's the first thing.
11:44
Second thing that
you might think
11:45
is that you might tokenize
every sentence, character
11:48
by character.
11:49
You might say A is one
token, B is another token.
11:52
That would actually work
and probably very well.
11:55
The issue is that then your
sequence becomes super long.
11:58
And as you probably
remember from the lecture
12:00
on transformers, the
complexity grows quadratically
12:05
with the length of sequences.
12:06
So you really don't want to
have a super-long sequence.
12:10
So tokenizers basically try to
deal with those two problems
12:14
and give common subsequences
a certain token.
12:19
And usually how you should be
thinking about it is around
12:22
an average of every token
is around 3-4 letters.
12:27
And there are many
algorithms for tokenization.
12:30
I'll just talk about one of them
to give you a high level, which
12:32
is what we call Byte Pair
Encoding, which is actually
12:34
a pretty common.
12:35
One of the two most
common tokenizers.
12:37
And the way that you
train a tokenizer
12:39
is that first you start with
a very large corpus of text.
12:42
And here, I'm really not talking
about training a large language
12:45
model yet, this is purely
for the tokenization step.
12:48
So this is my large corpus of
text with these five words.
12:52
And then you associate
every character
12:55
in this corpus of text
a different token.
12:58
So here, I just split
it up every character
13:00
with a different
token, and I just
13:03
color coded all of those tokens.
13:05
And then what you do is that
you go through your text,
13:08
and every time you see pairs
of tokens that are very common,
13:12
the most common pair of
token, you just merge them.
13:15
So here you see three
times the tokens t and o
13:19
next to each other.
13:20
So you're just going to
say this is a new token.
13:22
And then you continue,
you repeat that.
13:24
So now you have tok, tok
which happens three times.
13:28
Toke with an E that
happens 2 times and token,
13:33
which happens twice, and then
ex which also happens twice.
13:37
So this is the-- if you were to
train a tokenizer on this corpus
13:41
of text, which is
very small, that's
13:43
how you would finish
with a token--
13:45
with like trained tokenizer.
13:47
In reality, you do it on
much larger corpus of text.
13:51
And this is the
real tokenizer of--
13:54
actually, I think this
is GPT3 or ChatGPT.
13:57
And here you see how it would
actually separate these words.
14:00
So basically you
see the same thing
14:01
as what we gave in
the previous example.
14:03
Token becomes its own token.
14:06
So tokenizer is
actually split it up
14:08
into two tokens token and -izer.
14:12
So yeah, that's all
about tokenizers.
14:15
Any questions on that?
14:16
Yeah.
14:16
How do you deal with
spaces, and how do you
14:18
deal with [INAUDIBLE].
14:19
Yeah so actually there's
a step before tokenizers,
14:23
which is what we call
pre-tokenizers, which
14:25
is exactly what you just said.
14:27
So this is mostly--
14:29
in theory, there's no reason to
deal with spaces and punctuation
14:33
separately.
14:34
You could just say every
space gets its own token,
14:37
every punctuation
gets its own token,
14:40
and you can just
do all the merging.
14:42
The problem is that-- so
there's an efficiency question.
14:45
Actually, training these
tokenizers takes a long time.
14:48
So you better-- because you have
to consider every pair of token.
14:51
So what you end up doing is
saying if there's a space,
14:54
this is very--
like pre-tokenizers
14:55
are very English specific.
14:57
You say if there's
a space, we're
14:58
not going to start looking
at the token that came before
15:01
and the token that
came afterwards.
15:03
So you're not merging
in between spaces.
15:06
But this is just like a
computational optimization.
15:10
You could theoretically
just deal with it
15:12
the same way as you deal
with any other character.
15:15
And--
15:15
Yeah.
15:16
When you merge tokens to delete
the tokens that you merged away
15:19
or do you keep the smaller
tokens that emerge?
15:22
You actually keep
the smaller tokens.
15:25
I mean, in reality, it doesn't
matter much because usually
15:29
on a large corpus of text, you
will have actually everything.
15:32
But you usually
keep the small ones.
15:34
And the reason why
you want to do that
15:36
is because if-- in case there's,
as we said before, you have
15:38
some grammatical
mistakes or some typos,
15:41
you still want to
be able to represent
15:43
these words by character.
15:46
So, yeah.
15:47
Yes.
15:48
Are the tokens unique?
15:51
So I mean, say in this case
T-O-K-E-N is there only one
15:54
occurrence or could--
15:56
do you need to leave multiple
occurrence so they could have--
16:00
take on different
meanings or something?
16:02
Oh I see what you say.
16:03
No, it's every token
has its own unique ID.
16:08
So a usual-- this
is a great question.
16:11
For example, if you
think about a bank, which
16:13
could be bank for like
money or bank like water,
16:16
it will have the same token.
16:18
But the model will
learn, the transformer
16:19
will learn that based on the
words that are around it,
16:22
it should associate that--
16:24
I'm saying-- I'm being
very handwavy here,
16:26
but associate that with
a representation that
16:30
is either more like the bank
money side or the bank water
16:33
side.
16:34
But that's a transformer
that does that.
16:36
It's not a tokenizer.
16:38
Yes.
16:39
Yes.
16:39
So you mentioned
during tokenization,
16:41
keep the smaller tokens
you started with, right.
16:43
Like if you start with
a T you keep the T
16:45
and then you build
your tokenize out to
16:47
[INAUDIBLE] allow input token.
16:49
So let's say maybe you didn't
train on token, but in your data
16:53
you are trying to encode token.
16:54
So how does the tokenizer know
to encode it with token or to
16:58
[INAUDIBLE]?
16:59
Yeah.
16:59
The great question.
17:00
You basically when you--
so when you tokenize,
17:02
so that's after training
of the tokenizer
17:04
when you actually
apply the tokenizer
17:06
you basically always
choose the largest token
17:10
that you can apply.
17:11
So if you can do token,
you will never do T,
17:13
you will always do token.
17:15
But there's actually--
so people don't usually
17:18
talk that much about
tokenizers, but there's
17:20
a lot of computational benefits
or computational tricks
17:24
that you can do for making
these things faster.
17:27
So I really don't think
we-- and honestly, I
17:29
think a lot of people think
that we should just get away
17:31
from tokenizers and just
kind of tokenize character
17:34
by character or bytes by bytes.
17:36
But as I said, right now
there's this issue of length,
17:39
but maybe one day, like
in five or 10 years,
17:42
we will have different
architectures
17:43
that don't scale quadratically
with the length of the sequence.
17:46
And maybe we'll move
away from tokenizers.
17:50
So can you share
with us the drawback?
17:53
Why do people want to move
away from the tokenizer?
17:57
Yeah.
17:58
So I think one good
example is math.
18:03
If you think about math,
actually numbers right now
18:06
are not tokenized.
18:07
So for example, 327 might
have its own token, which
18:10
means that models,
when they see numbers,
18:13
they don't see them
the same way as we do.
18:15
And this is very
annoying because I mean,
18:17
the reason why we can
generalize with math
18:19
is because we can deal with
every letter separately
18:22
and we can then do composition.
18:24
Where you know that
basically if you add stuff,
18:26
it's the same thing as
adding every one separately
18:28
plus like whatever
the unit that you add.
18:30
So they can't do that.
18:32
So then you have to do
special tokenization.
18:35
And, like, one of the
big changes that GPT4 did
18:39
is changing the way
that they tokenize code.
18:42
So for example, if you have
code, you know you have often,
18:46
in Python, these four
spaces at the beginning.
18:48
Those were dealt with
strangely before.
18:52
And as a result, like,
the model couldn't really
18:54
understand how to
deal with code.
18:57
So tokenize actually
matter a lot.
19:00
OK, so I'll move on right now,
but we can come back later
19:04
on tokenizers.
19:05
Great.
19:06
So we talked about a task
the loss the tokenizer,
19:08
let's talk a little
bit about evaluation.
19:11
So the way that LLMs
are usually evaluated
19:13
is what we call-- is using
what we call perplexity.
19:16
At a high level it's basically
just your validation loss.
19:20
The slight difference
with perplexity
19:21
is that we use something that
is slightly more interpretable,
19:24
which is that we use the
average per token loss,
19:27
and then you exponentiate it.
19:29
And the reason why
you exponentiate it
19:30
is because you want--
19:32
I mean, the loss has
a log inside and you--
19:35
like one humans
are actually pretty
19:36
bad at thinking in log space.
19:38
But two logs depend
on the base of the log
19:41
while when you exponentiate
you basically have everything
19:44
in the vocabulary size unit.
19:48
And the average per
token is just so
19:50
that your perplexity is
independent of the length
19:52
of your sequence.
19:54
So perplexity is just
two to the power average
19:57
of the loss of the sequence.
20:00
So perplexity is between one
and the length of the vocabulary
20:04
of your tokenizer.
20:05
One it's simply well,
if you predict perfectly
20:08
the thing which every
word, then every word
20:11
will have basically
products of ones.
20:14
So the best perplexity
you can have is one.
20:16
If you really have no
idea, you basically
20:18
predict with one divided
by size of vocabulary
20:22
and then you do simple
math and you basically
20:24
get perplexity of
size of vocabulary.
20:26
So the intuition
of perplexity is
20:28
that it's basically
the number of tokens
20:30
that your model is, kind
of, hesitating between.
20:32
So if your model is perfect,
it doesn't hesitate.
20:35
It know exactly the word.
20:36
If it really has
no idea, then it
20:38
hesitates between all
of the vocabulary.
20:43
So perplexity really improved.
20:46
That's perplexity on a standard
data set between 2017 and 2023.
20:50
It went from a kind of 70
tokens to less than 10 tokens
20:54
over these five, six years.
20:56
So that means that the
models were previously
20:58
stated between 70 words every
time it was generating a word,
21:02
and now it's hesitating
between less than 10 words.
21:05
So that's much better.
21:06
Perplexity is actually
not used anymore
21:08
in academic benchmarking,
mostly because it depends
21:11
on the tokenizer that you use.
21:12
It depends on the actual data
that people are evaluating on.
21:16
But it's still very important
for development of LLMs.
21:19
So when you actually
train your own LLM people
21:21
will still really look
at the perplexity.
21:26
One common other way and
now more common in academia
21:30
of evaluating these LLMs is just
by taking all the classical NLP
21:34
benchmarks, and I'll give you
a few examples later and just,
21:37
kind of, aggregating everything.
21:39
So collect as many automatically
evaluatable benchmarks
21:43
and just evaluate
across all of them.
21:46
So one such-- or
actually two such
21:50
benchmarks are what we call
HELM, which is from Stanford.
21:54
And another one is the
Hugging Face open leaderboard,
21:56
which are probably the two
most common ones right now.
22:00
So just to give you
an idea, in HELM,
22:02
all of these type
of tasks, which
22:04
are mostly things that
can be easily evaluated
22:08
like question answering.
22:09
So think about many different
question answering tasks.
22:13
And the benefit with
question answering
22:15
is that you usually know
what is the real answer.
22:18
So you can-- the way that
you evaluate these models
22:20
and I'll give you a concrete
example in one second,
22:22
is that you can just look at
how likely the language model is
22:26
to generate the real answer
compared to some other answers.
22:30
And that's essentially,
at a high level,
22:31
how you evaluate these models.
22:33
So to give you a
specific example,
22:35
MMLU is probably the most common
academic benchmark for LLMs.
22:42
And this is just a
collection of many question
22:45
and answers in all
of those domains.
22:47
For example, college
medicine, college physics,
22:50
astronomy and these
type of topics.
22:52
And the questions are things
like, so this is in astronomy.
22:55
What is true for
type-1a supernova?
22:58
Then you give four
different potential answers
23:01
and you just ask the model
which one is more likely.
23:04
So there are many
different ways of doing it.
23:06
Either you can look at the
likelihood of generating
23:09
all these answers, or
you can ask the model
23:11
which one is the most likely.
23:12
So there are different ways
that you can prompt the model,
23:15
but at a high level, you
know which one is correct.
23:17
And there are three
other mistakes.
23:20
Yes.
23:22
Creating unconstrained
text as an output.
23:24
Yeah.
23:25
How do you evaluate
a model if it
23:28
gives something that's
semantically completely
23:31
identical, but is not the
exact tokens that you expect?
23:35
Yeah.
23:36
So that's a great question.
23:37
I'll talk more about that later.
23:38
Here, in this case, we
don't do unconstrained.
23:41
So the way you would evaluate
MMLU is basically either
23:44
you ask the first
question, and then you
23:47
look at the likelihood of
the model generating A,
23:50
the likelihood of the model
generating B, C, and D
23:53
and you look at which
one is the most likely.
23:55
Or you can ask the
model out of A, B, C, D,
23:58
which one is the most likely.
23:59
And you look at whether the
most likely next token is A, B,
24:03
C, or D. So you
constrain the model
24:05
to say it can only
answer these four things.
24:09
You say you constraint--
24:10
Yeah.
24:11
You constrain the
prompt or do you
24:13
mean of its whole
probability distribution
24:15
that it outputs
you only comparing
24:17
the outputs of like-- you're
only comparing the A token the
24:19
[INAUDIBLE].
24:20
Yeah.
24:20
So in the second case I gave
you, you would do exactly the--
24:24
actually would do both.
24:25
You would prompt the
model saying A, B, C, or D
24:27
plus you would constrain to
only look at these four tokens.
24:32
In the first case, you don't
even need to generate anything.
24:34
So in the first case,
you literally just
24:36
look, given it's
a language model,
24:38
it can give a distribution
over sentences.
24:40
You just look at what is
the likelihood of generating
24:43
all of these words?
24:45
What is the likelihood of
generating the second choice?
24:48
And you just look at whether the
most likely sentence is actually
24:52
the real answer.
24:54
So you don't actually
sample from it,
24:56
you really just
use P of X1 to XL.
24:59
Does that make sense?
25:01
That being said, evaluation
of open-ended questions
25:05
is something we're going
to talk about later,
25:06
and it's actually
really important
25:08
and really challenging.
25:09
Yes.
25:10
Earlier you mentioned
[INAUDIBLE] metrics
25:13
like perplexity
are not I usually
25:16
use because it
depends on how you do
25:18
your tokenization,
some design choices.
25:21
I was wondering if you
could speak more to that.
25:24
Yeah.
25:25
So think about perplexity.
25:26
I told you perplexity is
between 1 and vocabulary size.
25:30
So now imagine that ChatGPT
uses a tokenizer that has 10,000
25:34
tokens but Gemini from Google
uses a tokenizer that had
25:38
100,000 potential tokens.
25:41
Then actually the Gemini one
will have the upper bound
25:45
of the perplexity that you can
get is actually worse for Gemini
25:48
than for ChatGPT.
25:50
Does that make sense?
25:52
So that's just an idea.
25:53
It's actually a little bit
more complicated than that,
25:55
but that's just one
festival with a bit
25:58
of where you can see that the
tokenizer actually matters.
26:02
Great.
26:05
OK, so evaluation challenges.
26:07
There are many.
26:08
I'll just talk about
two really briefly.
26:10
One, as I told you, there are
two ways of doing evaluation
26:13
for these MMLUs.
26:14
Actually, there are
many more than two
26:16
but I gave you two examples.
26:17
And it happens that
for a long time,
26:20
even though that was a
very classical benchmark
26:22
that everyone uses actually
different companies
26:27
and different
organizations were actually
26:32
using different ways
of evaluating MMLU.
26:34
And as a result, you get
completely different results.
26:37
For example, Llama-65b, which
was the first model of meta
26:42
in the llama series, had
on HELM 63.7 accuracy
26:47
but on this other
benchmark had like 48.8.
26:53
So really the way that you
evaluate, and this is not even
26:55
talking about prompting
this is really just the way
26:58
that you evaluate the models.
27:01
Prompting is another issue.
27:02
So really, there are a
lot of inconsistencies.
27:04
It's not as easy as it looks.
27:07
First thing.
27:08
Yeah, sorry.
27:08
How can we make sure
that all these models
27:10
are trained on the benchmark?
27:13
Second thing.
27:14
This is a great question.
27:15
Train test contamination.
27:17
This is something
which I would say
27:19
is really important
in academia in--
27:24
given that the talk is mostly
about training large language
27:26
models, for companies, it's
maybe not that important
27:29
because they know
what they trained on.
27:33
For us, we have no idea.
27:35
So, for us, it's a real problem.
27:37
So there are many
different ways of trying
27:39
to test whether the test set--
27:42
or sorry, whether the
test set was actually
27:44
in the training set.
27:45
One, kind of, cute trick
that people in the lab,
27:51
in [? Tatsuo's ?] lab have
found, is that what you can do
27:54
is that given that most
of the data set online
27:57
are not randomized,
you can just look at--
28:00
and that language models,
what they do is just
28:02
predict the next word.
28:03
You can just look at
the entire test set.
28:06
What if you generate
all the examples
28:09
in order versus all the
examples in a different order.
28:13
And if it's more likely to
generate a thing in order, given
28:17
that there's no
real order there,
28:19
then it means that probably
it was in the training set.
28:21
Does that make sense?
28:23
So there are many--
that's like one of them.
28:24
There are many other
ways of doing it.
28:26
Train test
contamination, again, not
28:28
that important for development,
really important for
28:30
academic benchmarking.
28:33
Great.
28:33
So there are many
other challenges,
28:34
but I'll move on for now.
28:37
Great.
28:38
Data.
28:40
So data is another
really big topic.
28:43
At a high level people
just say you basically
28:45
train large language
models on all of internet.
28:48
What does that even mean?
28:50
So people sometimes say,
well, of clean internet,
28:53
which is even less defined.
28:55
So internet is very dirty
and really not representative
28:59
of what we want in practice.
29:00
If I download a random
website right now,
29:03
you would be shocked
at what is in there.
29:06
It's definitely
not your Wikipedia.
29:08
So I'll go really briefly
on what people do.
29:14
I can answer some
questions, but I mean,
29:16
data is on its own
it's a huge topic.
29:19
Basically, first what you do
is download all of internet.
29:22
What that means is that
you use web crawlers that
29:25
will go on every web page, on
internet or every web page that
29:29
is on Google.
29:31
And that is around 250
billion pages right now.
29:36
And that's around
1 petabyte of data.
29:39
So this is actually a Common
Crawl is one web crawler.
29:42
So people don't usually
write their own web crawlers
29:45
what they do is that they
use standard web crawlers,
29:47
and Common Crawl is one of them
that basically every month adds
29:51
all the new websites that were
added on internet that are found
29:56
by Google, and they put it in
a big basically a big data set.
30:00
So that's-- on Common Crawl, you
have around 250 billion pages
30:04
right now.
30:04
So 1E6 gigabytes of data.
30:07
Once you have this--
30:09
so this is a random web page.
30:11
Like literally random
from this Common Crawl.
30:14
And what you see is
that one, it really
30:16
doesn't look at type of things
that you would usually see,
30:18
but actually-- so
this is an HTML page.
30:21
It's hard to see, but
if you look through
30:24
will see some content.
30:26
For example, here,
Test King World
30:30
is your ultimate source for
the system x high performance
30:33
server.
30:34
And then you have three dots.
30:35
So you don't even-- the
sentence is not even finished.
30:37
That's how random
internet looks like.
30:40
So, of course, it's
not that useful
30:42
if you just train a
large language model
30:44
to generate things like this.
30:45
So what are some of the
steps that are needed?
30:48
First one, you extract
the text from the HTML.
30:51
So that's what I just
tried to do by looking
30:53
at basically the correct tags.
30:55
There are a lot of
challenges through this.
30:57
For example, extracting
math is actually
30:59
very complicated, but pretty
important for training
31:02
large language models.
31:03
Or for example, boilerplates.
31:05
A lot of your forums will
have the same type of headers,
31:08
the same type of footers.
31:10
You don't want to repeat
all of this in your data,
31:13
and then you will filter
undesirable content.
31:16
So not safe for work,
harmful content, PII.
31:20
So usually every
company has basically
31:22
a blacklist of websites
that they don't
31:26
want to train their models on.
31:27
That blacklist is very
long and you basically
31:30
say if it comes from there,
we don't train on this.
31:32
There are other ways
of doing these things.
31:34
Is that you can train a small
model for classifying what
31:36
is PII, removing these things.
31:39
It's hard.
31:40
Every point here that
I'm going to show you
31:42
is a hard amount of
work, but I'm just
31:46
going to go quickly through it.
31:48
So filter undesirable content.
31:50
Second or fourth
is de-duplication.
31:54
As I said, you might have
things like headers and footers
31:57
in forums that are
always the same.
31:59
You want to remove that.
32:01
Another thing that
you might have
32:02
is a lot of URLs that are
different, but actually show
32:05
the same website.
32:08
And you might also have a lot of
paragraphs that come from common
32:13
books that are basically
de-duplicated 1,000 times
32:16
or 10,000 times on internet.
32:18
So you have to de-duplicated.
32:20
Also very challenging because
you have to do that at scale.
32:24
Once you do the
de-duplication, you
32:26
will do some
heuristic filtering.
32:28
You will try to remove
low-quality documents.
32:31
The way you do that are things
like rules-based filtering.
32:35
For example, if you see that
there are some outlier tokens.
32:37
If the distribution of
tokens in the website
32:39
is very different than the
usual distribution of tokens,
32:42
then it's probably some outlier.
32:43
If you see that the length
of the words in this website
32:46
is super long, there's something
strange going on that website.
32:49
If you see that the website
has only three words,
32:52
maybe, is it worth
training on it.
32:54
Maybe not.
32:54
If it has 10 million words,
maybe there's something also
32:58
wrong going on that page.
33:00
So a lot of rules like this.
33:01
Yes.
33:02
Why do we filter out
undesirable content
33:04
from our data set instead
of putting it in as,
33:08
like, a supervised loss?
33:10
Can we not just say, here's
this like, hate speech website,
33:14
let's actively try to--
33:17
let's actively penalize
the model for getting it.
33:19
We'll do exactly that,
but not at this step.
33:22
That's why the post-training
will come from.
33:25
Pretraining the
idea is just to say
33:30
I want to model, kind of, how
humans speak, essentially.
33:34
And I want to remove all
these headers, footers
33:36
and menus and things like this.
33:38
But it's a very good
idea that you just had.
33:41
And that's exactly
what we'll do later.
33:45
Next step,
model-based filtering.
33:47
So once you filter a lot
of data, what you will do--
33:50
that's actually a
very cute trick.
33:51
You will take all
of Wikipedia and you
33:54
will look at all
the links that are
33:56
linked through Wikipedia pages.
33:58
Because probably if something
is referenced by Wikipedia,
34:01
it's probably some
high-quality website.
34:02
And you will train a classifier
to predict whether something
34:07
comes from-- whether a
document comes from one
34:10
of these references
from Wikipedia
34:13
or whether it's
from the random web.
34:15
And you will try
to basically say,
34:17
I want more of the things that
come from Wikipedia references.
34:21
Does that make sense?
34:23
So yeah.
34:24
So you will train a
machine learning model.
34:26
Usually also very simple
models because you
34:28
need to do that really at scale.
34:30
I mean, just think about
the 250 billion pages.
34:34
Next one, you will try
to classify your data
34:37
into different domains.
34:41
You will say, OK, this is
entertainment, this is books,
34:43
this is code, this is like
these type of domains.
34:46
And then you will try to
either up or down weight
34:51
some of the domains.
34:52
For example, you might say--
34:54
you might see that actually if
you train more on code, then
34:57
actually your model becomes
better on reasoning.
34:59
So that's something that
people usually say in
35:01
a very hand-wavy way.
35:02
If you train your
model more on code,
35:04
actually it helps reasoning.
35:05
So you want to update
the coding distribution
35:08
because that helps for general
language modeling skills.
35:11
Books is usually also another
one that people usually update.
35:16
Entertainment, they
usually down weight.
35:18
So things like this.
35:19
Of course, you want to do it--
so people used to do it, maybe
35:24
kind of heuristically.
35:25
Now there's entire
pipelines that we'll
35:27
talk about of how to do
these things slightly
35:30
more automatically.
35:33
And then at the end of
training, you usually train--
35:37
after training on all
of this data that we saw
35:40
you usually train on
very high quality data
35:42
at the end of training your
large language model where you
35:46
decrease your learning rate.
35:47
And that basically
means that you're,
35:49
kind of, overfitting your model
on a very high quality data.
35:52
So usually what you
do there is Wikipedia.
35:55
You basically
overfit on Wikipedia
35:57
and you overfit on, like,
human data that was collected.
36:04
The other thing is like
continual pretraining
36:06
for getting longer context.
36:07
I'm going to skip over
all of these things.
36:09
But that's just to give
you a sense of how hard it
36:12
is when people just say I'm
going to train on internet,
36:15
that's a lot of work.
36:17
And, really, we haven't
figured it out yet.
36:19
So collecting well
data is a huge part
36:23
of practical, large
language model.
36:24
Some might say that
it's actually the key.
36:26
Yes.
36:27
[INAUDIBLE] about data.
36:29
So basic question.
36:30
So usually when you start
with like a petabyte of data,
36:33
after you go through
all the steps,
36:35
what's the typical amount
of data you have remaining.
36:37
And then how large a
team does it typically
36:40
take to go through all the
data steps you talked about?
36:43
Sorry how la-- is your
question how large
36:45
is the data after you filter?
36:46
Yeah.
36:47
After you filter and then
you go through all the steps.
36:49
How large a team do you
need to go through, like,
36:52
all the filtration
steps you mentioned.
36:54
How slow is it or--
36:56
How many people
would you need to be
37:00
able to do this [INAUDIBLE]?
37:02
OK that's a great question.
37:03
I'm going to somewhat
answer about the data.
37:06
How large is the data set
at the end of this slide.
37:10
For number of people that work
on it, that's a good question.
37:15
I'm actually not quite
sure, but I would say, yeah,
37:19
I actually don't
quite know but I
37:22
would say it's probably even
bigger than the number of people
37:25
that work on the tuning of
the pretraining of the model.
37:29
So the data is bigger
than the modeling aspect.
37:34
Yeah, I don't think
I have a good sense.
37:37
I would say probably in LLAMA's
team, which have 70-ish people,
37:41
I would say maybe
15 work on data.
37:45
Yeah.
37:46
All these things, you don't
need that many people,
37:48
you need a lot of compute also.
37:49
Because for data you
need a lot of CPUs.
37:52
So, yeah.
37:53
And I'll answer
the second question
37:54
at the end of this slide.
37:56
So as I just, kind
of, alluded to really,
37:59
we haven't solved data
at all for pretraining.
38:02
So there's a lot of research
that has to be done.
38:04
First, how do you process
these things super efficiently?
38:07
Second, how do you
balance kind of all
38:09
of these different domains?
38:10
Can you do synthetic
data generation?
38:12
That's actually a
big one right now.
38:14
And because we don't have--
38:16
we'll talk about that
later, but we don't have
38:18
enough data on the internet.
38:20
Can you use multimodal data
instead of just text data?
38:23
And how does that improve
even your text performance?
38:28
There's a lot of secrecy
because, really, this
38:30
is the key of most of the
pretraining large language
38:33
models.
38:34
So for competitive dynamics,
usually these companies
38:39
don't talk about how they
do the data collection.
38:41
And also there's a
copyright liability issue.
38:44
They definitely don't
want to tell you
38:45
that they've trained on
books even though they did
38:47
because if not can sue them.
38:50
Common academic benchmarks.
38:52
So that will, kind of,
answer what you asked.
38:54
It started-- so those
are the smaller ones.
38:57
The names are not
that important,
38:58
but it started from around
$150 billion tokens, which are
39:02
around 800 gigabytes of data.
39:04
And now it's around
15 trillion--
39:06
15 trillion tokens,
which is also
39:09
the size of the models that
are-- right now the best models
39:12
are probably trained
on that amount of data.
39:14
So 15 trillion tokens,
which is probably,
39:18
I guess, two orders of
magnitude bigger than that.
39:20
So 80E3 gigabyte.
39:23
So that would be around 100
to 1,000 times filtering
39:29
of the Common Crawl,
if I'm not mistaken.
39:32
So, yeah.
39:34
One very famous one is the Pile.
39:37
So this is an academic
benchmark, the Pile.
39:39
And we can just look at what
distribution of data they have.
39:42
It's things like
archive, PubMed Central,
39:46
which is all the biology stuff.
39:50
Here it's Wikipedia, you see
Stack Exchange, some GitHub
39:55
and some books and
things like this.
39:58
Again, this is on
the smaller side.
39:59
So this is-- if we look at here,
this is on 280B so, in reality,
40:03
it's like 100 times bigger
so you cannot have that much
40:05
of GitHub and of Wikipedia.
40:09
In terms of closed
source models.
40:11
Just to give you
an idea, Llama 2
40:14
it was trained on
2 trillion tokens,
40:16
Llama 3 15 trillion
tokens, which is currently
40:19
the best model that we know
on how much it was trained on,
40:22
which is the same thing as is
the best academic or the biggest
40:26
academic benchmark, which
is 15 trillion tokens.
40:29
GPT4 we don't really
but it's probably
40:31
in the same order of magnitude
or it's probably around that.
40:33
Actually, it's probably
around 13 from leaks.
40:36
If the leaks are true.
40:39
Great.
40:41
So scaling laws.
40:43
Any other questions on data
before we go to scaling laws?
40:48
Sorry I know I'm giving
you a lot of information,
40:51
but there's a lot into
training, large language models.
40:54
Great scaling laws.
40:56
So the idea is that what people
saw around 2020, or at least
41:01
from a long time, but they've
been able to theoretically show
41:05
it or empirically
show it since 2020,
41:07
is that the more data
you train your models on
41:09
and the larger the models,
the better the performance.
41:12
This is actually pretty
different than what
41:14
you've seen in this class.
41:15
In this class we teach
you about overfitting.
41:17
Overfitting doesn't happen
with large language models.
41:20
Larger models,
better performance.
41:23
It's something that
really took a long time
41:25
for the community who took
this type of class to realize.
41:29
But for the exam,
overfitting exists.
41:33
So, OK, the idea of scaling loss
is that if-- given that more
41:38
data and larger
models will always
41:40
give you better
performance, can we
41:42
predict how much better
your performance will
41:46
be if you increase the amount of
data and the size of your model?
41:50
And surprisingly, it works.
41:52
So here you see three plots
from a very famous paper called
41:55
Scaling Laws from OpenAI.
41:57
Here you see on
the x-axis compute.
42:00
So how much did you train--
42:01
like, how much compute did
you spend for training?
42:04
And here you see test loss.
42:05
So this is essentially,
I mean, perplexity,
42:08
but it's your validation loss.
42:09
So it's a log of the perplexity.
42:11
And if you put these
two on log scale,
42:15
then you see that the
performance or the--
42:19
sorry, the scaling
law is linear.
42:22
That means that if you
increase your compute
42:25
by a certain amount, you can say
by how much your test loss will
42:29
actually decrease.
42:30
Same thing with data and
same thing for parameters.
42:33
If you increase
the data set size,
42:35
your loss will
decrease by an amount
42:38
that is somewhat predictable.
42:40
If you increase the
number of parameters,
42:42
the loss will
decrease by an amount,
42:44
which is somewhat predictable.
42:45
This is really amazing.
42:47
Very surprising.
42:49
I mean, it looks innocuous when
you look at these type of plots,
42:52
but that's crazy because it
means that you can predict
42:55
how well we're going to
perform in two or three years,
42:58
depending on how much
compute we will add,
42:59
assuming that these
things will hold.
43:01
There's nothing
theoretical about it.
43:04
Yes.
43:05
Two things.
43:06
One, what is the loss
that they're using here.
43:08
Is this perplexity?
43:09
So it's-- I said perplexity was
like 2 to the power of the loss.
43:13
So this is the power
of the perplexity.
43:17
And then the second
thing is, when
43:19
you increase the
number of parameters
43:21
or you increase the data
set size [INAUDIBLE] data
43:24
[INAUDIBLE] times, doesn't
that just inherently
43:26
increase your compute?
43:27
Like does all of this
[INAUDIBLE] come to just how
43:30
[INAUDIBLE] you [INAUDIBLE]?
43:31
Yes.
43:31
--or something
specific [INAUDIBLE]?
43:32
No, this is a great question.
43:33
So the compute here is actually
a factor of two things, the data
43:37
and the parameter.
43:38
What I'm showing here
is that you can--
43:40
well, actually, we're going
to talk about that in details.
43:42
But basically, if you increase
the number of parameters,
43:44
you should increase the
number of data that you have.
43:48
So you actually don't
go multiple times
43:50
to the same data set.
43:51
No one does epochs
in at least not yet
43:56
because we haven't still
kind of enough data.
43:59
So yeah, this is
all the same trend,
44:01
which is increase
compute decrease loss.
44:04
Yes.
44:06
Have we seen the numbers for
the last two years or this
44:09
is still holding?
44:10
It is still holding.
44:13
I don't have good
numbers to show you,
44:16
but it is still
holding, surprisingly.
44:20
Yes.
44:21
Is there no evidence that
control quality density
44:23
will ever plateau?
44:25
In theory, we would expect
it plateau, [INAUDIBLE]?
44:28
No empirical evidence of
plateauing anytime soon.
44:33
Why?
44:34
We don't know.
44:35
Will it happen?
44:37
Probably.
44:37
I mean, it doesn't need
to because it's actually
44:39
in log scale.
44:40
So it's not like
as if it had to go.
44:43
It had to plateau.
44:44
Like mathematically, it could
continue decreasing like this.
44:47
I mean, most people think
that it will probably
44:49
plateau at some point.
44:50
We don't know when.
44:54
So that's-- I'll talk more
about scaling laws now.
44:57
So why are scaling
laws really cool?
44:59
Imagine that I gave you--
45:02
you're very fortunate I gave
you 10,000 GPUs for this month.
45:05
What model will you train?
45:07
How do you even go about
answering that question?
45:09
And I mean, this
is a hypothetical,
45:12
but that's exactly what these
companies are faced with.
45:16
The old pipeline,
which was basically
45:19
tune hyperparameters
on the big models.
45:21
So let's say I have
30 days, I will train
45:24
30 models for one day each.
45:26
I will pick the best one and
that will be the final model
45:30
that I will use in production.
45:32
That means that the model
that I actually used
45:34
was only trained for one day.
45:36
The new pipeline is that you
first find a scaling recipe.
45:40
So you find something that
tells you, for example,
45:43
like one common thing
is that if you increase
45:45
the size of your model, you
should decrease your learning
45:46
rate.
45:47
So you find a
scaling recipe such
45:49
that you know if I increase
the size of my model,
45:52
here's what I should do
with some hyperparameters.
45:55
Then you tune your
hyperparameters
45:57
on smaller models
of different sizes.
46:00
Let's say I will say for
three days, of my 30 days,
46:03
I will train many
different models.
46:05
And I will do
hyperparameter tuning
46:07
on these small models,
each of different sizes.
46:09
Then I will fit a
scaling law and try
46:11
to extrapolate from these
smaller models, which
46:15
one will be the best if I
train it for much longer--
46:20
or sorry if I train
it for a larger model.
46:22
And then I will train
the final huge model
46:24
for 27 days instead
of just one day.
46:28
So the new pipeline
is not train things
46:31
or do hyperparameter tuning
on the real scale of the model
46:34
that you're going
to use in practice,
46:35
but do things on smaller
ones at different scales.
46:39
Try to predict how
well they will perform
46:41
once you make them bigger.
46:43
I will give-- I will give you a
very concrete example right now.
46:46
Let's say transformers
versus LSTMs.
46:49
Let's say you have
these 10,000 GPUs,
46:51
you are not sure which
one you should be using.
46:53
Should I be using a
transformer-based model
46:55
or LSTM-based model.
46:56
What I will do is I
will train transformers
46:58
at different scales.
47:00
So here you see different
parameters on the x-axis,
47:02
y-axis is my test source.
47:04
I will then train different
LSTMs at different scales.
47:08
Once I have these points,
I will see oh it, kind of,
47:11
fits a scaling law.
47:12
I will fit my
scaling law and then
47:14
I will be able to predict if
I had 10 times more compute,
47:18
here's how well I would
perform for the LSTM.
47:21
It's actually slightly
less linear for the LSTM,
47:23
but you can probably try to
predict where you would end up.
47:26
And clearly from this
plot, you would see
47:28
that transformers are better.
47:30
One thing to notice when you
read these type of scaling laws
47:33
is that there are two
things that are important.
47:35
One is really your
scaling rate, which
47:40
is the slope of the-- the
slope of the scaling law.
47:45
The other thing
is your intercept,
47:49
you could start
worse, but actually
47:52
become better over time.
47:53
It just happens that
LSTMs are worse for both.
47:55
But I could show you
another one where things--
47:58
you can predict that actually
after a certain scale
48:01
you're better off using that
type of model than others.
48:04
So that's why scaling laws
are actually really useful.
48:08
Any questions on that?
48:12
Yeah.
48:12
So these are all,
kind of, very--
48:15
how sensitive are these to small
differences in the architecture.
48:18
Like one like
transformer architecture
48:21
versus another
transformer architecture.
48:23
Do you think we have
to fit your own curve
48:26
and, basically, say like oh
scaling laws tell me this should
48:28
be some logarithmic function.
48:31
Like, let me
extrapolate that for
48:33
my own specific architecture.
48:35
Yeah, so usually, for
example, if you're an academic
48:38
and you want to-- now at
least that's pretty recent
48:40
and you want to propose
a new activation.
48:43
That's exactly what you will do.
48:45
You will fit a scaling law,
show another scaling law
48:47
with the standard
like, I don't GELU
48:49
and you will say
that it's better.
48:50
In reality, once you start
thinking about it in scaling
48:53
laws terms, you really
realize that actually
48:55
all the architecture
differences that we
48:57
can make, like the small,
minor ones, all they do
48:59
is maybe change a little
bit the intercept.
49:03
But really that doesn't
matter because just
49:05
train it for 10 hours longer or
like wait for the next computer
49:09
GPUs and these things
are really secondary.
49:12
Which is exactly why I was
telling you originally,
49:14
people spend too much time on
the architecture and losses.
49:17
In reality, these things
don't matter as much.
49:19
Data though.
49:19
If you use good data, you will
have much better scaling laws
49:23
than if you use bad data.
49:24
So that really matters.
49:27
Another really cool thing
you can do with scaling laws
49:29
is that you can ask yourself,
how to optimally allocate
49:33
training resources.
49:35
Should I train larger models.
49:37
Because we saw that it's better
when you train larger models,
49:39
but we saw that it's also
better when you use more data.
49:42
So which one should I do?
49:43
Should I just train on
more data, a smaller model,
49:46
or should I train a
larger model on less data?
49:49
So Chinchilla is a very famous
paper that first showed this.
49:53
The way they did it,
I want to give you
49:55
a little bit of a sense
of what these plots are.
49:58
Here you see training
loss again on the x-axis,
50:00
you see parameter differences,
sorry, parameter size--
50:04
number of parameters.
50:04
So the size of the model.
50:06
And here all these
curves are what
50:07
we call ISO flops, which is that
all the models on this curve
50:13
have been trained with the
same amount of compute.
50:17
The way that you do
that is that you train--
50:19
you change.
50:20
Sorry, you vary the number of
tokens that were trained on
50:22
and the size of the models,
but you vary in such a way
50:25
that the total compute
is constant, OK.
50:27
So all these curves that you
see with different colors
50:29
have different amount of
compute that were trained on.
50:32
Then you take the best one
for each of those curves.
50:35
Once you have the best one
for each of those curves,
50:38
you can ask-- you can
plot how much flops it was
50:44
and which curve were you
on and how much parameters
50:47
did you actually use for
training that specific point.
50:50
You put that on the log
log scale again and now
50:55
you fit a scaling law again.
50:56
So now I have something
which tells me
50:59
if I want to train a model of 10
to the power 23 flops, here is
51:03
exactly the number of parameters
that I should be using.
51:06
100 B.
51:07
And you can do the same
thing with flops and tokens.
51:11
So now you can predict--
51:13
if I tell you exactly I
have one month of compute,
51:16
what size of model
should I be training?
51:18
Fit the scaling
law, and I tell you.
51:21
Of course that all
looks beautiful.
51:23
In reality like there's a
lot of small things of like,
51:26
should you be counting,
like, embedding parameters,
51:29
there's a lot of complexities.
51:30
But if you do things well,
these things actually do hold.
51:35
So the optimal number of
parameters that Chinchilla paper
51:38
have found is to use 20
tokens for every parameter
51:42
that you train.
51:44
So if you add one
more parameter,
51:45
you should train your thing on--
your model on 20 more tokens.
51:49
So one caveat here is that this
is optimal training resources.
51:53
So that is telling me if you
have 10 to the power, 23 flops
51:57
or if you have 100, I don't know
how much that is, $100 million
52:00
or 10-- no, that's
much less, actually.
52:02
Let's say I have
$5 million to train
52:05
my best model that
gets the lowest
52:07
loss what would I train on?
52:09
In reality, these companies need
to think about inference also.
52:12
If you have a smaller model,
they will spend less over time.
52:17
So actually, if you
consider the inference cost,
52:20
you have other papers that
try to show that, it's
52:23
around 150 parameters, sorry--
52:26
tokens per parameters, because
you prefer having a smaller
52:29
model because over
time you're going
52:32
to actually spend less money
on inference of these models.
52:37
So 150 to 1, that's around what
the best models are trained
52:42
on right now, at least
the ones that are
52:45
used in practice in production.
52:49
Great.
52:51
Any questions on Chinchilla?
52:55
Great.
52:56
Oh sorry.
52:58
In practice, how expensive
is inference for these models
53:01
relative to training?
53:03
Actually, very expensive.
53:05
I will not talk about
inference because that would
53:07
be another entire lecture.
53:09
But just think
about ChatGPT where
53:11
they have I don't know
how much it is now,
53:14
like 600 million
people that use it.
53:18
Like, that's a lot.
53:22
Yeah.
53:23
So it's actually very expensive.
53:24
There's a lot of optimization
you can do for inference though.
53:27
And that's an entire
other lecture.
53:29
I'm going to skip that this
time, but it's very interesting.
53:33
OK tunings.
53:34
As I said, there are
many things that you
53:36
can answer with scaling laws.
53:38
I just try to give
you two examples,
53:40
but really there
are many things.
53:42
What data do you use.
53:43
What mixture-- what data
mixing weighting you use.
53:46
The mixtures, that's what
we talked about before.
53:49
What architecture you use,
whether you should make
53:51
your models wider or deeper?
53:54
Should you be
paying for more GPUs
53:56
or actually
collecting more data?
53:58
All these things are
things you can try
54:00
to answer with scaling laws.
54:03
One thing I want to say
is the bitter lesson.
54:05
If you ever heard
of Richard Sutton,
54:08
very famous blog post in
2019, what he realized,
54:12
which I think not
enough people realize,
54:16
I didn't-- definitely did
not realize at that time,
54:19
is that once you see these type
of scaling laws you know that
54:23
the more compute you have, the
better models you will get.
54:26
So with scale, you
will get better model.
54:28
And you also know by
Moore's law or these type
54:30
of variants of Moore's
law that you will always
54:33
have better compute.
54:34
Then the only thing
that matters is just
54:36
to have architectures that
can leverage computation.
54:40
So what matters is basically
systems data and less
54:44
so the architecture, like
the small architecture
54:46
differences like, your
activation and things like this.
54:49
So I think that's one of the
reasons why most of research
54:52
focuses on some things that
for industry matters less.
54:56
And I was one of
those researchers
54:58
for a large part of my career.
55:02
So don't spend time
over complicating.
55:04
Do the simple
things, do it well.
55:07
See all them.
55:08
That's really what OpenAI taught
us with ChatGPT and with all
55:12
the GPTs before.
55:15
OK, I want to give you some back
of the envelope computation.
55:18
So I might be off by
a few factors here,
55:20
but I just want to give you
a sense of how costly it is
55:23
to train some of these models.
55:25
I'll give us an example.
55:26
llama3 400b which is currently
the best open source model that
55:30
you can get.
55:31
It was trained on 15.6 tokens.
55:35
It has 405 billion parameters.
55:37
So just now that
you know what is
55:39
like this optimal tokens per
parameter, that's around 40.
55:43
So that's a little bit
more than Chinchilla,
55:45
but less than this like
inference optimal model.
55:50
So they went for
training optimallity
55:53
Flops for this model.
55:55
So one simple way
to compute flops
55:57
is 6 times the
number of parameters,
56:00
times the number of
data that you train on.
56:03
So if you do the simple
calculation here,
56:04
it's 3.8 e25 flops.
56:07
The reason why this
is important is
56:09
that if you follow it
a little bit, the news,
56:11
there's an executive order
from Biden that basically
56:13
says that once you have one e26
parameters, sorry, flops, then
56:19
you have special
scrutiny on your models.
56:21
So they went to
2X less than that.
56:23
So they really went
right below this
56:25
to not have special scrutiny.
56:27
So 3.8.
56:28
I might be off by a little
bit, but it's definitely
56:30
under the 1 e26
56:36
So parameter p is parameters
n is data, number of tokens.
56:41
This is just an approximation.
56:46
Yeah.
56:48
OK.
56:49
Compute and we know that they
trained on 16,000 h100s and we
56:55
know the throughput
they set it to.
56:58
So if you do the computation,
it takes around 70 days
57:02
or 26 million GPU hours.
57:05
At least that's what my back
of the envelope computation.
57:08
They actually said that
they use 30 million
57:10
instead of 26 million GPU hours.
57:13
So maybe they had
some challenges.
57:17
I don't really know.
57:18
But if you follow the
simple computation,
57:20
it's around 70 days.
57:22
Cost.
57:24
I mean this it's
hard to approximate,
57:27
but I'm just going to say
it's, kind of, the rent.
57:29
Like, what if I wanted to
rent H100, that many H 100
57:33
for that many days,
how much will I pay?
57:36
H100 a lower bound on
the renting costs of H100
57:41
is around two hours--
57:42
$2 per hour.
57:43
So if you multiply this
by 26,000,000 hours,
57:48
you get $52 million.
57:50
So they probably
pay less than that,
57:52
but not actually much less
because all these services
57:58
that actually rent GPUs, they
don't make that much money.
58:00
So it's probably slightly
less, but not that much less.
58:04
Now salary I said 50
employees, 500k per year.
58:10
Yeah it's probably
the right ballpark.
58:12
$25 million.
58:13
So if you put altogether
around $75 million
58:17
for training this llama model.
58:21
I'm probably off
by like 10 million,
58:22
but that's kind
of right ballpark.
58:27
Carbon emitted.
58:29
A lot of people might ask
like also the cost is not
58:32
the only thing
that is important.
58:33
So I did the computation.
58:35
It's around 4000 tons
of CO2 equivalent.
58:42
That is actually only
2000 return tickets
58:45
from JFK to London.
58:47
So right now carbon
emitted is actually not--
58:51
I mean, it's huge, but
it's not meaningful yet.
58:56
I think in maybe GPT6,
GPT7, once you multiply this
59:01
by 100, that might
become a real issue.
59:04
Right now it's
still not, I think,
59:07
an issue in the grand
scheme of things.
59:09
Next model the way you should be
thinking about these models is
59:12
that every new generation, the
number of flops essentially
59:16
multiplies 10x, or at least
that's what they try if they
59:19
have enough energy.
59:20
And if they can buy enough GPUs.
59:23
Great.
59:23
Any question on these
back of the envelope math.
59:29
No.
59:30
OK.
59:31
So now we talked
about pretraining,
59:34
I wanted to also
chat about systems
59:36
because now we know compute
is really important so there's
59:39
a question of how do
you optimize the--
59:41
how do you optimize the compute?
59:43
I will leave that for
the end because I'm not
59:45
sure how much time we will have.
59:46
I think it's important,
but hopefully I'll
59:48
be able to talk about it later.
59:50
It's slightly different
than what we've
59:52
been talking about right now.
59:54
So I'll move on to
post-training for now.
59:56
So the task of
post-training, the reason why
59:59
we need to do post
training is, as I told you
1:00:01
before, it's to
make AI assistants.
1:00:06
So language modeling
is not really the thing
1:00:09
that you want when you
have an AI assistant.
1:00:12
For example, if you
ask to GPT3, which
1:00:14
is a purely language model--
1:00:16
a pure language model,
not a non-aligned one.
1:00:20
If you ask a question
explain the moon landing
1:00:22
to a six-year-old, the
completion that you would get
1:00:26
is something explain the theory
of gravity to a six-year-old.
1:00:29
Because what it learned
is that on internet,
1:00:31
if you have one
question, you usually
1:00:33
have maybe another bullet point
of other similar questions
1:00:36
you don't usually have
question and then answer later.
1:00:39
This is not what you want
from an AI assistant.
1:00:42
So how do we do this
alignment, which
1:00:46
is this post training and
making these models assistants?
1:00:49
So the goal of this
alignment is to basically get
1:00:52
LLMs follow the
instructions that
1:00:55
are given by users and
maybe some designers,
1:01:00
kind of, desires.
1:01:02
So think about motivation.
1:01:04
You don't want the
model-- like OpenAI
1:01:06
doesn't want the model to
say stuff that is very toxic.
1:01:09
So here you see on
the left-hand side
1:01:12
that when you ask a question, it
actually provides a real answer.
1:01:15
So it's not like before the LLM.
1:01:17
And on the right-hand side,
you see that it would--
1:01:20
if you ask to write a tweet
describing how a certain part
1:01:25
of the population are evil, it
will say that it cannot do that.
1:01:29
So that's kind of
this alignment.
1:01:32
The background here is
that basically the data
1:01:38
that you want for training
some of these models is--
1:01:41
like, we know what we want.
1:01:42
Which is just asking
humans, this is a question,
1:01:44
this is the answer
that you want.
1:01:46
But the thing is that it's very
expensive to collect that data,
1:01:48
and it's hard to find it online.
1:01:51
In contrast, pretraining
data is not what you want,
1:01:54
but there's a lot of it.
1:01:56
So what we will do, or
the main idea is simply
1:01:59
take a pretrained
large language model
1:02:01
pretrained on all of internet
and then just fine tune.
1:02:03
So you just change a little bit
the weights on the type of data
1:02:06
that you actually want.
1:02:07
And hopefully given
it, you already
1:02:08
pretrained it on
all of internet,
1:02:10
it basically learns or knows
how to speak in English
1:02:13
and knows standard
language syntax
1:02:18
then you can really fine tune
it with very little data.
1:02:23
OK, SFT.
1:02:24
So Supervised Fine Tuning is
really exactly what I just said.
1:02:27
Which is the idea of
fine-tuning the large language
1:02:29
model on basically the
desired answers that
1:02:33
are collected from humans.
1:02:35
So why is it called
supervised fine tuning?
1:02:37
Because you basically want to
do language modeling on the real
1:02:41
answers.
1:02:41
So language modeling is this
like next word prediction,
1:02:44
and that's the fine tuning part.
1:02:45
And then you want to do it on
desired answers given by humans
1:02:48
so that's why we
call it supervised.
1:02:51
So how do we collect this data?
1:02:52
Well, I just said it.
1:02:54
You just ask humans
to tell you this
1:02:57
is a question this is
the answer that you would
1:02:59
want from some of these models.
1:03:00
So this is an example.
1:03:03
I can't read very
well on my computer,
1:03:04
but my kid needs
to do a science--
1:03:08
no let's read this one.
1:03:09
Can you write a
short introduction
1:03:11
about the relevance
of the term monopsony?
1:03:13
And then it says monopsony
refers to a market
1:03:15
structure, blah blah, blah.
1:03:16
And that's a human
network there.
1:03:19
So, actually, this
is Open Assistant,
1:03:20
which was a way to collect
data online by humans.
1:03:27
So this type of supervised
fine tuning or alignment
1:03:31
is really the key of ChatGPT.
1:03:33
This is what made the big jump
from GPT 3, which was mostly
1:03:37
something that was
known by AI researchers
1:03:40
to ChatGPT, which became
known by basically everyone.
1:03:46
So the problem
with human data is
1:03:51
that it's very slow to
collect and very expensive.
1:03:56
So one possible
simple idea is to use
1:04:00
LLMs to scale data collection.
1:04:03
So that's exactly what we
did with Alpaca one year ago.
1:04:06
What we did is that
we asked humans,
1:04:09
so we use a data set of
human question answers.
1:04:11
So there were 175
question answers here,
1:04:15
and we asked the best
model at the time,
1:04:16
so text-davinci 003 to basically
generate many more of these
1:04:21
question and answers.
1:04:22
So all we did is, this is
what humans would write now,
1:04:25
write similar answers
and similar questions.
1:04:27
And we collected 52,000
LLM-generated question answers.
1:04:32
And then what we did is
simply we took llama 7B,
1:04:34
which was the best
pre-trained model at the time.
1:04:36
And we just fine tuned this
with supervised fine tuning,
1:04:39
as I told you.
1:04:39
And that's how we got
the Alpaca 7B model.
1:04:44
And this is the type of
data that we collected.
1:04:47
So things like what
does algorithm mean?
1:04:49
And algorithm is a step by
step set of instructions
1:04:53
you use to solve a problem or
achieve a goal, blah, blah,
1:04:55
blah, blah.
1:04:56
So the data is not actually--
it's actually pretty good,
1:04:58
given that it was LLM generated
by LLMs from essentially two
1:05:02
generations ago.
1:05:04
So that really started
at least for us
1:05:07
as an academic
replication of ChatGPT.
1:05:10
Now it really--
there's a big field
1:05:12
of synthetic data
generation of how
1:05:15
to use LLMs to basically make
development of LLMs faster.
1:05:21
And basically by decreasing
the amount of human hours that
1:05:24
you need.
1:05:26
Quantity of data.
1:05:28
So we talked about what type
of data and how we collect it.
1:05:31
One thing which is
surprising with SFT
1:05:33
is that you don't
need that much data.
1:05:36
So what this paper showed
this is called LIMA,
1:05:38
is that if you scale the amount
of data that you use from
1:05:43
supervised fine tuning
from 2000 to 32,000,
1:05:46
it really doesn't help much.
1:05:47
So here scaling laws
definitely don't help.
1:05:49
And so the intuition here
is that all you learn
1:05:55
is you learn how to format
your desired answers.
1:05:58
Another way of saying it is that
your pre-trained models, they
1:06:02
essentially model the
distribution of every user
1:06:04
on internet, one that
might write bullet points,
1:06:07
another one that might
answer question-- answer
1:06:09
question with an answer.
1:06:10
So all you tell your
model is like, wait,
1:06:13
you should actually
be optimizing
1:06:14
more for this type of
user than another one.
1:06:17
So you're not
actually teaching it--
1:06:18
you're not teaching anything
through this SFT, so
1:06:23
supervised fine
tuning, all you do
1:06:25
is you tell the model to
optimize for one type of user
1:06:28
that it saw already in
a pretrained data set.
1:06:30
So the knowledge is already
in the pretrained LLM
1:06:33
and you basically just
specialize to one type of user.
1:06:37
Great.
1:06:38
Any question on SFT?
1:06:40
Yes.
1:06:41
So I know it's a big
issue with synthetic data
1:06:45
where if you keep generating
data from the same distribution,
1:06:49
eventually you're not
learning a new distribution,
1:06:51
you're essentially
playing with it.
1:06:52
Just bootstrapping that.
1:06:53
Yeah.
1:06:55
Surely you can't scale
that forever, right.
1:06:57
You can't keep going
on and generating
1:06:59
from the same distribution.
1:07:00
You hope to learned
something new.
1:07:01
Yeah.
1:07:02
So are there-- it's an
active area of research
1:07:05
but any thoughts
that you have around
1:07:06
how people are maybe thinking
around this and better ways
1:07:10
to bootstrap?
1:07:11
Or to give up on this idea and
realize that the chart shows
1:07:15
you don't need that many so
just get humans to generate
1:07:17
2000 really good prompts.
1:07:19
Yeah.
1:07:20
So that's a very good question.
1:07:21
So for the data
stuff, so I'm saying
1:07:23
it's not that important
for SFT, but there
1:07:25
will be another thing we'll talk
about right after where actually
1:07:28
data does matter.
1:07:29
My intuition based on not
that much empirical results
1:07:33
is that you can still get,
even though you use your LLMs,
1:07:38
if you use purely
LLM generated text
1:07:40
and you do that for like three
or four generations of LLMs,
1:07:43
I agree with you that probably
you won't improve much.
1:07:45
But for me what is important is
how do you use human in the loop
1:07:48
with LLMs?
1:07:49
Not purely LLMs,
not purely humans,
1:07:53
but maybe what
you can do is just
1:07:54
have the model
regenerate some new text
1:07:56
and just humans
write a few edits.
1:07:59
Edits are much faster than
writing the entire text.
1:08:01
And I think that if you have
that type of collaboration,
1:08:04
then from an information
theoretical point of view,
1:08:07
you still get
additional information,
1:08:09
but you're still much faster
than if you use humans.
1:08:11
And I think that
as a field we'll
1:08:13
probably move towards these
type of things, which is really
1:08:17
just finding the examples that
are important and asking humans.
1:08:20
It's kind of active
learning, just
1:08:22
asking humans exactly when
you need to get their inputs.
1:08:28
Yes.
1:08:28
Do we train with the
same loss function
1:08:30
and the same general
training algorithm
1:08:32
for the supervised
fine tuning bit
1:08:34
as we do for the pretraining?
1:08:36
Because the examples
you showed, I
1:08:39
think the important thing
of the good examples
1:08:43
is like super
factually accurate.
1:08:45
Like there's these
more complex things
1:08:46
and it's still just
like [INAUDIBLE].
1:08:48
Same loss.
1:08:49
So that's why here--
1:08:50
yeah, I didn't-- maybe
didn't emphasize enough.
1:08:52
This is just language modeling.
1:08:53
Fine tune the LLM with language
model and the desired answers.
1:08:56
So this is literally
the same loss.
1:08:59
It will be different
in two seconds,
1:09:01
but the first step
of SFT is literally
1:09:04
the same loss where
you just say, OK, I
1:09:06
want to actually specialize
on that type of data.
1:09:08
So there's even a question
of what is pretraining,
1:09:10
what is post-training?
1:09:11
Because, in reality, it's
just like a different data
1:09:13
that you use.
1:09:13
The reason why we usually call
it post-training is that the way
1:09:16
we collect that data
is very different.
1:09:18
Great, great questions.
1:09:20
Yes.
1:09:22
Maybe it's the same
question, but why would
1:09:24
these 2000 examples have
such a overweighted influence
1:09:28
on fine tuning?
1:09:30
So that's why we--
1:09:31
also that's another reason
why we call it post-training
1:09:33
is that we use different
type of hyperparameters.
1:09:35
So, I told you
basically at the end
1:09:37
of pretraining you
essentially end up
1:09:38
with a learning rate of 0.
1:09:40
Here, you're going to
increase your learning rate.
1:09:42
So like 1e minus
5, 1e minus-- yeah.
1:09:44
And so the way that you give
to them is actually different.
1:09:52
OK.
1:09:54
Second step or second
part of this post training
1:09:57
is what we call
reinforcement learning
1:10:00
from human feedback or RLHF.
1:10:02
Some of you might
have heard of that.
1:10:05
The idea is that SFT has
a problem, namely that you
1:10:09
do behavioral cloning, which
means that you just try to clone
1:10:12
what the humans would say.
1:10:14
And that has many issues.
1:10:16
One of them is that you're
bound by human abilities.
1:10:19
So if-- humans actually humans
won't generate the things
1:10:26
that they think is actually
the best thing to generate.
1:10:28
So if you ask me
to write a book,
1:10:30
I mean, I can definitely
enjoy your book.
1:10:32
I can probably say one book
is better than another,
1:10:34
but I'm definitely not going to
be as good as writing the book
1:10:37
that I want to read.
1:10:37
So you're going to be
bound by the human ability
1:10:39
to generate things, even though
the humans might be better
1:10:42
at distinguishing
between things.
1:10:43
That's one issue.
1:10:44
Issue number two, I find that
actually pretty interesting
1:10:47
is that it--
1:10:49
if you ever heard of the
word hallucination. so this
1:10:51
is LLMs generating fake--
like false information.
1:10:55
Hallucination might--
at least people
1:10:57
have hypothesized that can come
from the supervised fine tuning
1:11:02
even if you do supervised fine
tuning on data that is correct.
1:11:06
And the reason why
that is is that if--
1:11:09
given I told you that basically
SFT is with very little data.
1:11:13
And it's with data
that the model
1:11:15
doesn't learn anything new.
1:11:17
So what if the human gives an
answer that the model didn't
1:11:21
know was true.
1:11:23
From the model perspective,
the human basically
1:11:26
is telling the model generate
this thing that seems plausible
1:11:30
but actually have no
idea if it's true or not.
1:11:34
So just to give you a
very concrete example,
1:11:36
if we go back to this
monopsony example,
1:11:39
can you write blah blah
blah about monopsony?
1:11:41
Imagine that the human wrote a
reference on this type of book.
1:11:46
And that book might exist.
1:11:47
That might be a
correct reference,
1:11:49
but what if the LLM
never saw this reference
1:11:51
during pretraining.
1:11:52
Then it doesn't know that
it's a correct reference.
1:11:54
So really what
you tell the model
1:11:56
is to generate or make up some
plausible sounding reference
1:12:00
rather than actually
tell the real reference
1:12:03
that it saw during pretraining.
1:12:05
So hallucination might
be caused by this SFT.
1:12:12
So that's problem number two.
1:12:14
Does that all make sense?
1:12:15
Great.
1:12:16
Problem number 3, price.
1:12:18
Generating the ideal
answers is very pricey.
1:12:21
And that comes back
to your question
1:12:23
of humans writing the
entire answer is actually
1:12:26
pretty expensive.
1:12:28
So that's why RLHF comes in.
1:12:30
The idea is that instead of
cloning the behaviors of humans,
1:12:34
we're going to maximize
human preference.
1:12:37
And the way we're going to
do that, so the pipeline,
1:12:39
is that for a certain--
for every instruction,
1:12:42
you're going to ask a model
to generate two answers
1:12:45
and usually use a
pretty good model.
1:12:48
So you usually don't use an LLM
here, you use a SFT fine tune,
1:12:52
you use a fine tuned LLM already
to give pretty good answers.
1:12:56
And then you ask labelers which
of these two answers was better?
1:13:01
So select the preferred one.
1:13:02
And then with different
types of algorithms,
1:13:05
we're going to talk about
the algorithms, you just fine
1:13:07
tune the model to generate
more of the green thing
1:13:10
than the red thing.
1:13:10
So more of the good stuff.
1:13:12
So now the question
is how and we're
1:13:14
going to talk about
that right now.
1:13:17
So there are two ways that
we're going to talk about
1:13:20
and two that are mainly
use in the community.
1:13:23
The first one is simply the idea
of using reinforcement learning.
1:13:26
So hopefully you all know what
reinforcement learning is now.
1:13:30
So when you think about
using reinforcement learning,
1:13:33
one important question is
like, what is the reward
1:13:35
that we're optimizing.
1:13:36
So in this case, there
are really two options
1:13:38
that I could think about.
1:13:39
The first one, you
could just say,
1:13:41
I'm going to compare the output
generated by some baseline,
1:13:44
the output generated
by my model.
1:13:46
And I'm just going to ask the
human to say which one is better
1:13:49
and I'm going to use
this as a reward.
1:13:51
So if I'm better
than the baseline,
1:13:53
this is a plus 1, if
not, it's a minus 1.
1:13:55
So now it's binary reward.
1:13:57
The problem with binary reward
is that it's very sparse
1:13:59
and you don't get much
information out of it.
1:14:01
Like maybe your answer
was slightly better,
1:14:04
maybe it was like way
better and you don't really
1:14:07
know from this how
much better it was.
1:14:10
So option 2 is
that you can train
1:14:13
what we call a reward model,
which is simply a classifier.
1:14:16
So you use machine
learning to classify
1:14:19
how much better two outputs
are from the preference--
1:14:24
from the perspective
of the human.
1:14:26
So this is a little bit
meta, but what you basically
1:14:29
do is that you train--
1:14:31
you take a reward model, which
is just a large la-- also
1:14:37
a large classifier, and you
basically ask this reward model,
1:14:41
you give it the input
and the actual output
1:14:43
that you have, one
of the two outputs.
1:14:45
And you just exponentiate that
so that's the softmax loss
1:14:49
that you all know about.
1:14:50
And now you divide by
the exponentiated reward
1:14:56
on the first example--
1:14:58
I'm sorry, on the
first output and this
1:15:00
is on the second output.
1:15:01
And you basically train--
1:15:02
so the reason why you do that
is that you train your model,
1:15:05
you train this
reward model to be
1:15:07
able to classify how much better
one output is to another one.
1:15:13
So another slightly less
convoluted way of saying it
1:15:16
is that your reward
model will output
1:15:19
some reward that will be used
as the logits of your softmax.
1:15:22
So now if you have high
logits in your softmax,
1:15:25
it means that you highly
likely this output is better.
1:15:32
So that's what we call
Bradley-Terry model.
1:15:34
Yes.
1:15:35
Will this reward
model [INAUDIBLE]
1:15:36
lower the entire output, or
is it going to [INAUDIBLE]?
1:15:40
So this takes the entire--
1:15:45
yeah, this takes the
entire output at once.
1:15:46
So it takes all the
input and all the output
1:15:48
and it gives one number.
1:15:50
Yes.
1:15:51
So [INAUDIBLE] reward model,
where would the human be then?
1:15:55
Sorry.
1:15:55
With the reward model,
where would the human be?
1:15:58
Like--
1:15:58
I see.
1:16:00
OK sorry.
1:16:01
Maybe I wasn't clear.
1:16:02
You train this reward model
to fit this green and red
1:16:08
preference from humans.
1:16:09
So basically you
train a classifier
1:16:11
to say whether the humans
prefer red or green.
1:16:15
But instead of using
the binary reward, which
1:16:18
is what the human would
tell you you basically use
1:16:20
the logits of the softmax.
1:16:23
And the thing with the logits
is that logits are continuous.
1:16:26
So now you know that if
your reward model said
1:16:29
it has high logits,
then, in some ways,
1:16:31
the human highly preferred this
answer to some other answer.
1:16:36
Great.
1:16:38
So as I just said, continuous
information is better.
1:16:41
So that's what people use
in practice or at least
1:16:44
used to use in practice.
1:16:45
I'll tell you about the
other algorithm later.
1:16:48
So what do you do at the
end is that you basically
1:16:50
try to just use reinforcement
learning that you know about.
1:16:53
Now we know we have a reward.
1:16:55
What you sample through
is the generation
1:16:58
from your large language model.
1:16:59
And then you just use
some regularization term.
1:17:02
So the reason why we do
this regularization term
1:17:04
is for avoiding what we
call overoptimization.
1:17:06
So this reward
model might not be
1:17:08
really represent--
might not perfectly
1:17:10
model human preferences.
1:17:12
So you don't want to
maximize this thing
1:17:14
to essentially infinity.
1:17:17
And you do it using a PPO,
which is a common reinforcement
1:17:22
learning algorithm.
1:17:24
One thing to note here, because
it will be important for later,
1:17:27
is that when we use
maximum likelihood--
1:17:32
sorry, now the large
language models
1:17:34
are actually a policy for
your reinforcement learning.
1:17:38
It's not maximizing
maximum likelihood anymore.
1:17:41
Which means that you're not
modeling any distribution
1:17:43
anymore.
1:17:43
And the reason why
this is important
1:17:45
is that models that went
through this type of PPO
1:17:48
actually don't give
you likelihoods
1:17:51
of text that are meaningful.
1:17:52
Because what you
optimize them to do
1:17:54
is basically just
optimize for generating
1:17:56
the most likely thing,
not optimize for modeling,
1:18:00
all the answers that
humans might say.
1:18:02
Another way of saying
that is that there's
1:18:04
nothing that incentivizes
here the model to not give
1:18:09
a single possible generation.
1:18:11
Nothing here says it's good
if you have some distribution
1:18:15
with some entropy.
1:18:18
If you haven't followed, it's
not that important but just good
1:18:20
to know.
1:18:22
Great.
1:18:23
So PPO is exactly what
ChatGPT did originally.
1:18:27
So here is on their
blog post on what
1:18:30
they have is step one do
supervised fine tuning, which
1:18:33
now you all know about.
1:18:34
Step two, train a reward
model on human preferences.
1:18:38
Step three, do PPO
multiple steps,
1:18:40
which is where you
see this blue arrow.
1:18:43
So you continue-- you train
the model once with the PPO,
1:18:45
you collect new
data, you continue.
1:18:47
And that's why-- and that's
exactly what ChatGPT did.
1:18:50
And that was the
big breakthrough
1:18:52
between GPT 3 and ChatGPT.
1:18:55
One thing to note is that
PPO has many challenges.
1:18:58
Reinforcement learning
is something that
1:19:00
is super nice theoretically.
1:19:02
In practice, anyone
who ever worked
1:19:03
with reinforcement learning
knows it's such a mess.
1:19:06
There's a lot of things
like rollouts, outer loops,
1:19:09
clipping so many complications.
1:19:11
So it's messy.
1:19:13
This is the idealized PPO
used for LLM settings,
1:19:15
so that's already
much more complicated
1:19:17
than this expectation
we saw before.
1:19:19
And in practice it's actually
much more complicated.
1:19:21
So we have one implementation
of it that we had to do,
1:19:23
and I'm not going
to go through it.
1:19:25
But basically have so
much stuff that you
1:19:27
have to think about
when you implement
1:19:29
that type of PPO algorithm.
1:19:31
So you have clipping everywhere,
you have a lot of complexities
1:19:34
and things are not
well documented.
1:19:37
All this to say that we're going
to there was a new method that
1:19:41
was proposed also from
Stanford one year ago
1:19:44
called DPO, which is essentially
a simplification of PPO.
1:19:49
And the way-- what they did
or the idea that they have
1:19:53
is that instead of using
reinforcement learning,
1:19:56
you can just maximize the
probability of generating
1:19:58
the stuff that you
like and minimizing
1:20:00
the probability of the
stuff that you don't like.
1:20:02
So if you think about the human
preference, the red and green,
1:20:05
maximize green, minimize red.
1:20:08
So the loss is actually
this one where what you see
1:20:12
this is simply some
log of the model.
1:20:16
So this is the likelihood of
a model generating the things
1:20:19
that the human preferred,
given the inputs.
1:20:23
And what you try
to do is basically
1:20:25
maximize the likelihood of
generating the things that you
1:20:30
like, minimize the likelihood of
the things that you don't like.
1:20:33
All the rest of the terms
here it's not too important.
1:20:36
It's actually really not that
complicated to understand.
1:20:39
But at a high level, it's really
just maximizing the things
1:20:42
you like, minimizing the rest.
1:20:45
And one thing to note, which
I was going to say just here,
1:20:49
is that actually all
the rest is chosen such
1:20:51
that the global minima of
PPO and the global minima
1:20:56
of like this DPO,
under some assumptions,
1:20:59
are essentially equivalent.
1:21:01
So this is the right thing
to do mathematically.
1:21:04
I'm not going to go
through the derivations,
1:21:06
but that's the
right thing to do.
1:21:08
It's pretty different with
PPO in the sense that now--
1:21:10
with PPO, what you had to do is
collect the human preferences,
1:21:13
then train a reward model
with maximum likelihood,
1:21:16
then use reinforcement learning.
1:21:17
Now all you do is basically
maximum likelihood.
1:21:19
Much simpler.
1:21:20
Yes.
1:21:21
I mean, yeah.
1:21:21
So it seems like this is A,
much simpler and B, like,
1:21:24
what you would just intuitively
do with [INAUDIBLE]?
1:21:27
Why did they start
with this reward model.
1:21:29
Like what led them doing that?
1:21:31
I think it's a great question.
1:21:33
I don't really know.
1:21:34
What I can tell you is that.
1:21:35
At ChatGPT the people
who did basically
1:21:41
this PP-- sorry, who
did ChatGPT initially
1:21:44
are the ones who
actually wrote PPO.
1:21:47
And I think they
were just-- like,
1:21:48
there are a lot of
reinforcement learning people.
1:21:50
And I think that for them
it was very intuitive.
1:21:54
So there's also some
additional potential benefits.
1:21:58
For example, I don't want to--
1:22:00
yeah, for example, if
you use the reward model,
1:22:03
the cool thing here with
reinforcement learning
1:22:04
is that you can use unlabeled
data with the reward model.
1:22:08
So here you can only use the
labeled data for doing DPO--
1:22:12
For PPO-- for PPO, you first
train your reward model
1:22:15
and then you can
use unlabeled data
1:22:18
where the reward
model will basically
1:22:19
label this unlabeled data.
1:22:21
So this additional,
kind of, potential--
1:22:25
there could be
potential improvements.
1:22:26
In practice it happens
that there are none.
1:22:29
And I think just that a
lot of people in this team
1:22:32
were reinforcement
learning experts, including
1:22:35
the main author of
PPO, John Schulman.
1:22:39
So much simpler than PPO, and
it's basically performs as well.
1:22:43
So now this is the standard
thing that people use.
1:22:46
At least in the open
source community,
1:22:47
I believe it's actually the
standard also in industry.
1:22:51
So that's called DPO.
1:22:53
Gains so those are all
the papers on the left.
1:22:57
Here this is on the
summarization task.
1:22:59
You see, all I
want to show you is
1:23:01
that basically the
pretrained models were OK
1:23:04
and they improve of scale.
1:23:05
If you do supervised
fine tuning,
1:23:07
you improve them
a little bit more,
1:23:08
if you do PPO or something
with RLHF human feedback,
1:23:12
you get performance
that are, oftentimes
1:23:15
depending on a benchmark,
even better than humans.
1:23:18
So this is the human
reference summaries.
1:23:21
Same thing.
1:23:22
This is on a paper that
we have Alpaca farm where
1:23:25
we see the evaluation
here is not too important
1:23:27
but basically see
pretrained model.
1:23:29
You jump to SFT and then you
jump to PPO, DPO and PPO,
1:23:33
DPO have the exact
same performance.
1:23:36
So basically RLHF helps.
1:23:38
That's, kind of, the
conclusion and DPO is simple.
1:23:42
Data.
1:23:43
The way that you collect
that type of data.
1:23:46
First idea is just use humans
as we already talked about.
1:23:51
Guidelines are very
complicated for what
1:23:53
humans should be labeling,
and it's really not that easy.
1:23:55
And actually, if you ever
do some of the labeling,
1:23:58
you will see that it's
extremely complicated.
1:24:01
Like if I Zoom in to this.
1:24:03
Here, I have a question tell
me about self-driving cars.
1:24:07
And you read both
self-driving cars
1:24:09
are vehicles that are
capable of detecting
1:24:10
the surroundings,
blah, blah blah, blah.
1:24:12
Self driving cars are
cars that are equipped
1:24:13
with sensors, blah
blah, blah to navigate
1:24:15
without the need for a driver.
1:24:16
I mean, both seem OK.
1:24:18
Which one is better?
1:24:19
It's actually hard
to say at a glance.
1:24:21
And as a result, the
problem with humans
1:24:24
is that you will
start optimizing
1:24:27
a lot of high-level features.
1:24:28
For example, the
second one is longer.
1:24:30
I can guarantee you that
most humans will choose
1:24:32
the second one,
even though I mean,
1:24:34
maybe the first one is better.
1:24:35
I don't know.
1:24:36
I haven't read it carefully.
1:24:38
So challenges of humans.
1:24:39
First, slow and expensive.
1:24:42
Second, as I just mentioned,
it's hard to focus on things
1:24:46
that matter, like correctness.
1:24:47
And people usually
look at things
1:24:49
that don't matter as much
like the form, like length.
1:24:53
And as a result,
so what I show here
1:24:55
is that when you do RLHF,
the more you do RLHF,
1:24:58
the longer the output
of the models become.
1:25:01
So if you've ever been
annoyed at ChatGPT
1:25:03
answering you super
long sentences,
1:25:05
this is because of RLHF.
1:25:08
Annotator distribution shift.
1:25:11
Like the distribution
of annotators
1:25:12
that you use matters a
lot, and you have to think,
1:25:15
like, what is even the
humans that we want
1:25:17
to represent in these models?
1:25:20
Another question is
crowdsourcing ethics.
1:25:22
Like usually these--
basically a lot
1:25:25
of the labeling that is
done, the people who do them
1:25:29
are not paid well
and they have to go
1:25:31
through a lot of toxic
data because you basically
1:25:33
want the model to avoid
saying the toxic data.
1:25:36
So crowdsourcing ethics too.
1:25:40
So many challenges
with human data.
1:25:43
So what we did, also
last year, is again,
1:25:46
the same thing as Alpaca, just
the idea of like oh well, there
1:25:48
are challenges
with humans, maybe
1:25:50
we can just replace
them with LLMs.
1:25:51
So what we did is
simply replace--
1:25:55
I see that.
1:25:56
I'm just realizing that the
slides are not centered.
1:25:58
Anyways you replace a human
preference with preferences.
1:26:02
So here, on this figure, you
see on the x-axis, the price
1:26:06
that we paid for
collecting human data.
1:26:09
It's around $300
for 1,000 examples.
1:26:12
And this is on mechanical
Turkers which are usually
1:26:15
like cheaper than maybe
some of the other companies
1:26:19
that you could go through.
1:26:20
And on the y-axis,
it's basically
1:26:22
the agreement with other humans,
with the mode of other humans.
1:26:27
And what you see is that
actually, as I told you before,
1:26:29
labeling is really complicated.
1:26:30
Humans agree with
themselves only around 66%
1:26:34
of the time on a binary task.
1:26:36
And it's not that the
humans are not good
1:26:38
here because we were five
main authors on this paper.
1:26:41
We tried to label
this data ourselves,
1:26:43
and we only had, like, 67 or
68% accuracy, even though we
1:26:47
talked-- like we talked
for like three hours of how
1:26:50
we should be doing labeling.
1:26:51
But really, it's complicated.
1:26:52
It's not an easy task.
1:26:54
And here I just showed
many different models.
1:26:56
And, basically, you see that
models are much cheaper,
1:26:59
and they can actually
get higher agreement
1:27:01
with the mode of humans
than humans themselves.
1:27:04
And the reason why is because
humans have a lot of variance,
1:27:06
models have no variance.
1:27:08
So there might be a
little bit more biased
1:27:09
but have less variance.
1:27:11
So it works surprisingly well.
1:27:13
And now it's, kind
of, the standard
1:27:14
in open source community.
1:27:16
I think even in
industry a lot of people
1:27:18
use both humans and
LLMs for improving
1:27:21
the collection of RLHF data.
1:27:24
And this is like-- this is
the paper from last year,
1:27:27
but honestly, now it's more like
the LLMs would be around this
1:27:30
agreement, and
this costs around,
1:27:32
I would say 50 50x than humans
and better agreement with human
1:27:36
than humans themselves.
1:27:39
OK.
1:27:39
So that gets us to
evaluation of post training.
1:27:45
That goes back to
your initial question
1:27:46
at the beginning of the lecture.
1:27:48
How do you evaluate
something like ChatGPT?
1:27:50
The answers that GPT could
give are basically unbounded.
1:27:54
And it's not that
there's one right answer,
1:27:56
there are many answers
that are just as good.
1:27:59
So there are many challenges.
1:28:00
One, you can't use
validation loss
1:28:03
because one method
might use PPO,
1:28:06
the other one might use DPO.
1:28:07
Validation loss
is not comparable.
1:28:08
Second, you can't use--
1:28:10
sorry, perplexity.
1:28:11
That's the thing
I told you before.
1:28:13
These models are not calibrated.
1:28:16
They don't give distributions.
1:28:17
They just optimize
for one thing.
1:28:19
So you can't use perplexity for
actually evaluating these type
1:28:22
of models once they aligned--
1:28:24
sorry, once they're aligned.
1:28:26
Third, there's a large
diversity of questions
1:28:29
that humans might
ask to these models.
1:28:31
Generation open QA some question
answering some summarization
1:28:35
and all of these things.
1:28:36
So there's so many
things you have to cover.
1:28:38
Then the tasks are
really open ended,
1:28:41
so it's very hard to automate.
1:28:42
So that's what you were
alluding to before.
1:28:45
So the idea is that
instead of trying
1:28:48
to come up with really
easily automated benchmarks,
1:28:51
it's just we're going to ask
questions that users actually
1:28:55
ask to these models in practice.
1:28:56
And we're just going
to ask annotators
1:28:58
to say between these two
models, which one is better.
1:29:01
What's the better output.
1:29:03
So basically the
exact same thing
1:29:04
as basically the data
from RLHF but you
1:29:08
use it now for evaluation.
1:29:10
Yes I'm not sure
I understand what
1:29:11
you mean by can't use
perplexity not calibrated.
1:29:14
Like RLHF still doing like
next token prediction.
1:29:19
So--
1:29:19
Why can't perplexity
be used then?
1:29:21
So think about the
optimal solution
1:29:24
after doing PPL is
basically one model that
1:29:27
gives you essentially a delta.
1:29:30
Like basically it says that
there's only one sentence
1:29:33
that is--
1:29:34
that could be generated
for that question.
1:29:36
So now if you use
it on something
1:29:38
that is slightly semantically
differently different,
1:29:40
it would actually give a
likelihood of 0 for that answer.
1:29:44
So in reality, it's not that
extreme because as you say,
1:29:46
it's still a
distribution, but it just
1:29:48
shows you that there's
a fundamental issue
1:29:50
with perplexity.
1:29:51
Once these models
are not LLMs anymore,
1:29:55
they were not trained,
at least with PPO
1:29:56
they're not trained to do
maximum likelihood anymore,
1:29:59
they were trained
to be policies.
1:30:04
So probably the most
common or the most--
1:30:08
yeah, the most common benchmark
or the most trusted one
1:30:10
is what we call ChatBotArena,
which is basically
1:30:14
go on internet, have random
users on the internet,
1:30:17
blindly talk with two chatbots,
just ask many questions,
1:30:21
see the two answers and
rate, which one is better.
1:30:23
And you do that over hundreds
of thousands of users and then
1:30:26
you get the actual preferences
and you get rankings of models.
1:30:30
So you can go right
now on ChatBotArena
1:30:33
and actually interact
with these models.
1:30:35
One potential issue
just to highlight
1:30:38
is that while people who want
to do these type of things
1:30:40
are usually more like
tech-driven or like tech savvy.
1:30:44
So a lot of the questions
that you will ask
1:30:46
are more like tech
stuff discussing
1:30:47
software errors,
inquiries about AI tools
1:30:50
and all of these things.
1:30:52
So another issue
is cost and speed.
1:30:54
If you really want
to use something
1:30:55
like this for
development process,
1:30:58
it will be too costly because
you will need to basically pay
1:31:01
a lot of humans to do that.
1:31:03
So one simple idea is,
again, as we said many times,
1:31:07
just use LLM instead of humans.
1:31:10
You probably know the
drill at this point.
1:31:13
Steps for every instruction
generate outputs
1:31:15
by some baseline and the model
that you want to evaluate.
1:31:19
So here you imagine that
I'm comparing an answer
1:31:22
from ChatGPT and from Misrule.
1:31:24
I'm just asking a model, another
model, which one is better.
1:31:29
And I just basically
average that out.
1:31:32
Yeah.
1:31:32
I asked ChatGPT 4,
which one is better.
1:31:34
I averaged that out over
my entire distribution,
1:31:37
over my entire
benchmark or data set,
1:31:39
and that gives me a win rate.
1:31:41
So a win probability for one
model compared to another one.
1:31:44
And now you can rank models.
1:31:46
And this is the
AlpacaEval leaderboard.
1:31:50
So the benefits of this
is that actually we
1:31:53
show-- we get 98% correlation
with ChatBotArena.
1:31:56
So very high
correlation with humans.
1:31:59
So this is yeah,
comparison with correlation
1:32:01
with other benchmarks.
1:32:02
And it takes less than three
minutes and less than $10
1:32:05
to run.
1:32:05
So it's pretty cheap.
1:32:06
And there are downsides though.
1:32:08
One of them is poor correlation.
1:32:11
So as we already saw
before, LLMs prefer,
1:32:14
this is one spurious
correlation, not many.
1:32:16
I'll just talk about one.
1:32:17
LLMs prefer longer outputs.
1:32:19
Actually humans also
prefer longer outputs.
1:32:21
But the problem or the
issue once you use LLMs
1:32:23
is that once there is bias, you
will continue optimizing that.
1:32:26
Humans at some point,
I can guarantee you
1:32:28
if I ask a simple
question, and you give me
1:32:29
five pages of
answers, I'll be like,
1:32:31
no, I don't like that answer.
1:32:32
But LLMs if they have this bias
and they were trained for that,
1:32:35
they will continue
preferring longer outputs.
1:32:37
So here we see the
preference just showing
1:32:42
that humans and models
prefer longer outputs.
1:32:46
And here is another view of
the initial AlpacaEval data set
1:32:50
benchmark, where when we asked--
1:32:53
when we rank GPT4, when we
look at the win rate of GPT4
1:32:56
versus actually GPT4 itself,
if we use the standard GPT4,
1:33:01
it gets 50%, kind of, by
definition because we're
1:33:03
comparing GPT4 versus GPT4.
1:33:06
But if we ask a GPT4 to
be slightly more verbose,
1:33:09
so we just say in the prompt,
be verbose in your answers,
1:33:12
then it gets a
win rate of 64.4%.
1:33:15
So really there's
a huge variance.
1:33:16
And if we ask it
to be concise, it
1:33:17
gets 20% so there's
a huge variance
1:33:20
depending on whether you ask
it to be concise or verbose.
1:33:24
That's very annoying.
1:33:25
So one possible solution,
which is what we did,
1:33:29
is just use some
regression analysis.
1:33:31
I'm not going to
go into details,
1:33:32
but basically use
causal inference
1:33:34
tools to control for length.
1:33:36
And right now actually
length matters much less.
1:33:38
So if you ask it to be verbose,
you still get some gains,
1:33:41
but much less.
1:33:44
Great.
1:33:44
So that's all about
post training.
1:33:46
And now for the
next eight minutes,
1:33:48
I might talk about systems
or just answer questions.
1:33:51
Yes.
1:33:52
Can you go back to your
post training, internal post
1:33:56
training.
1:33:57
How did we tune those
parameters using
1:33:59
the small body of
fine-tuning data
1:34:03
and have such big
effect on the model?
1:34:05
You mentioned earlier that
there's a different set
1:34:07
of hyperparameters.
1:34:08
Are we changing just some of
the weights, the later weights
1:34:11
or other weights.
1:34:12
What's actually happening?
1:34:13
Yeah.
1:34:14
Yeah, I, kind of, skimmed
through all of this.
1:34:16
You change all the weights.
1:34:17
Actually, industry will
change all the weights.
1:34:20
In open source
land, you might have
1:34:22
heard of Laura, which is
going to change basically only
1:34:26
some of the weights or it
actually, to be more specific,
1:34:29
it's going to add
some differences
1:34:31
to the output of every layer.
1:34:33
But in industry, you're going to
just fine tune all the weights.
1:34:37
And also to say something
else about the data, actually,
1:34:40
this last step, RLHF
you usually going
1:34:42
to collect a lot more
data than with SFT.
1:34:45
So if FSFT is like 5,000,
10,000, maybe 50,000 with,
1:34:50
RLHF I think you're going to be
more around like the one million
1:34:54
order of magnitude.
1:34:55
It's still much less
than pretraining though.
1:34:57
Yeah.
1:34:57
Because pretraining
is 15 trillion tokens.
1:35:00
I mean, this is like--
that's not even a drop
1:35:02
and yet you influence
the weight a lot.
1:35:05
So because you do it--
1:35:05
I mean, you have to think that
how you do it is you use--
1:35:10
I mean, as I said, the learning
rate that you're going to use
1:35:12
is going to be different,
but also you only do that.
1:35:16
So just imagine if I trained--
1:35:18
even if I trained
on one sentence,
1:35:19
but over and over
again at some point
1:35:22
my model will only
generate that sentence
1:35:24
even if it was just
one sentence instead of
1:35:27
the 15 trillion tokens.
1:35:29
So if you use a
large enough learning
1:35:30
rate and for enough
time, you will basically
1:35:33
overfit that sentence.
1:35:35
So the key thing to remember
is that the data is not--
1:35:39
it's not as if you mix
some post-training data
1:35:42
and some pretraining data.
1:35:43
You do pretraining, and then
you just start fine-tuning only
1:35:47
on the post-training.
1:35:48
So another way, maybe
another perspective
1:35:50
is that the pretraining
is just the initialization
1:35:53
of your model.
1:35:54
And once you view it that
way, that this is just
1:35:56
initialization of weights,
then there's nothing special.
1:35:59
Like you don't need to remember
that you train on a lot of data
1:36:02
before.
1:36:02
The only thing that matters is
that you had an initialization
1:36:04
and now I actually
train the model.
1:36:06
So maybe you think
about it that way.
1:36:07
Like this is a Markov
property in some ways.
1:36:10
It's just like you
had your weights.
1:36:11
This is my initialization.
1:36:12
Now I'm training that one.
1:36:14
Does that answer your question?
1:36:16
Kind of but you said
something just now about it's
1:36:20
almost the equivalent of just
rerunning the fine tuning
1:36:23
data many times.
1:36:25
Is it actually-- is that what
actually happens in order
1:36:28
to give so much more preference?
1:36:33
You might-- I actually don't
know right now how they do it
1:36:37
in industry.
1:36:37
When we did our packet,
we had to do three epochs.
1:36:40
So you did run it
three times through it.
1:36:44
But I mean, even
the number of times
1:36:46
that you run it through,
it's actually not important.
1:36:48
The only thing-- the only thing
is the effective learning rate
1:36:52
that what matters.
1:36:54
So yeah.
1:36:56
Great.
1:36:58
So I think I have five minutes.
1:37:06
OK I might try to give a
high-level overview at least
1:37:12
from one of the systems trick.
1:37:14
Systems, as we said, for
everyone bottleneck is--
1:37:19
sorry compute is
the huge bottleneck.
1:37:21
One question you might ask
is, why not buy more GPUs?
1:37:24
GPUs are expensive,
but also are scarce.
1:37:26
Even if you have $10
million right now,
1:37:28
you cannot buy the best GPUs.
1:37:31
[INAUDIBLE]
1:37:33
There's also some
physical limitations.
1:37:35
When you have multiple
GPUs, you have
1:37:37
to communicate between them.
1:37:39
That takes time.
1:37:40
So just buying more
GPUs is not that easy.
1:37:43
So it's really
important to think about
1:37:45
how do you allocate resources
and how do you optimize
1:37:47
your pipeline, so system?
1:37:49
101 on GPUs, I'm sorry,
I'm going slightly faster.
1:37:53
I hope that some of you
at least can follow.
1:37:55
GPUs are basically
optimized for throughput.
1:37:58
CPUs are optimized for latency.
1:38:01
So GPUs, the way you
have to think about it
1:38:03
is that there's one--
1:38:04
there's one command that
is run on many, many cores
1:38:07
at the same time on
different type of data.
1:38:11
So this is how you see a GPU.
1:38:13
You see there are
many different codes.
1:38:14
We call them streaming
multiprocessors,
1:38:17
which is very different than
the usual CPU architecture.
1:38:20
So just think high throughput
parallelization for GPUs.
1:38:24
GPUs are optimized for
fast matrix multiplication.
1:38:27
So every time you will do--
you will do something on GPU.
1:38:30
If you can do it with a
matrix multiplication,
1:38:33
it's going to be 10 times
faster than with anything else.
1:38:36
That is a little bit
annoying because it
1:38:38
means that we are,
kind of, bottlenecked
1:38:40
to doing anything with
matrix multiplications.
1:38:44
Another thing to
note with GPUs is
1:38:46
that compute has
been improving faster
1:38:48
than memory and communication.
1:38:50
So right now GPUs usually
are hard to keep--
1:38:55
Like the data that
you sent to GPUs
1:38:58
is actually hard to keep
up with the processes.
1:39:00
So most of your
GPUs are actually
1:39:02
going to be idle if you
just run normal code,
1:39:04
if you don't optimize your code.
1:39:06
So communication-- and this
will continue over time.
1:39:10
Another thing to know
about GPUs is that there's
1:39:12
a memory hierarchy.
1:39:13
This is the same thing
actually with CPUs,
1:39:15
but basically the closer
you are to your cores,
1:39:17
the less memory there is,
but the faster things run.
1:39:20
If you are further,
more memory slower.
1:39:24
Oh yeah I'm going to skip that.
1:39:26
OK actually, I'm
going to say it.
1:39:27
I told you about this--
1:39:29
the fact of communication.
1:39:31
The metric that
people usually look at
1:39:32
is model FLOP utilization.
1:39:34
So what is the theoretical
maximum that GPU could run at,
1:39:37
number of flops that you
could use per second--
1:39:39
divide-- sorry, the number
of observed throughput
1:39:42
divided by this
theoretical maximum.
1:39:45
And in general, if you
reach 50% you're very happy.
1:39:49
Like Facebook I looked
at llama was at 45
1:39:51
or something like this.
1:39:52
So that means that data
doesn't come fast enough
1:39:55
even for these big companies.
1:39:58
So one simple trick,
and that might
1:40:00
be the only one I'm
going to tell you about,
1:40:02
is low precision.
1:40:04
One simple idea is
that well, if I'm
1:40:06
going to put my floats
in low precision,
1:40:09
then there's going
to be fewer bits
1:40:10
that I have to send to my GPUs.
1:40:12
If there's fewer bits,
it's faster communication,
1:40:14
lower memory consumption.
1:40:16
Things are going to go faster.
1:40:17
And for deep learning
it just happens
1:40:19
that decimal is
not that important.
1:40:22
So when you do matrix
multiplication, when
1:40:25
you do like for example, SGD,
there's already so much noise
1:40:28
that if you update something
by 0.01 or 0.015, who cares.
1:40:33
So basically instead of using
32 bits per float, which
1:40:37
is what people used to use,
or 64 for example, which
1:40:41
is what you would
use in other domains,
1:40:43
you use 16 bits for
matrix multiplication.
1:40:46
So for every float
you use 16 bits.
1:40:49
And for training
you have this type
1:40:51
of what we call automatic
mixed precision.
1:40:54
Which is that some of the
things are in 32 bits,
1:40:57
others are in 60 bit--
1:40:58
on 16 bits.
1:41:00
Generally, the way you
should be thinking about
1:41:02
it is that your weights
are stored-- of your model,
1:41:05
are stored in 32 bits.
1:41:06
But just before the computation
you put everything in 16 bits.
1:41:10
Like this you do
computation super fast.
1:41:12
And at the end you update
your weights in 32 bits.
1:41:16
And the reason why you do all
the updates in 32 bits is just
1:41:19
think that if your
learning rate, for example,
1:41:21
is very small, you still
want to be able to make
1:41:23
a difference in your weights.
1:41:25
So all the computation
is done in 16 bits,
1:41:28
but the weights are
actually stored in 32 bits.
1:41:30
So that's like the standard
way that people are doing it.
1:41:35
OK, I'll actually
talk just about this,
1:41:36
and then I'll skip all the rest,
operator fusion, because I think
1:41:39
this is actually pretty cool.
1:41:40
As I just said,
communication is very slow
1:41:42
and actually every time
you use a PyTorch line,
1:41:45
it basically moves variable
to global memory of your GPU.
1:41:49
So when you have something like
this x dot cosine equal x1,
1:41:54
and then you do x1 dot cosine.
1:41:56
What is happening
behind the scenes
1:41:58
is that you take the
x, which is data.
1:42:00
You ship it to your actual
processors of your GPUs.
1:42:03
You apply the cosine.
1:42:05
You ship it back to the
main memory of your GPU
1:42:07
and then you see the next line.
1:42:09
You ship it back to the
computer-- to the GPU processor,
1:42:12
you apply another cosine
and you ship it back again.
1:42:15
So another way to
see that is that you
1:42:17
go from your DRAM, which is
your global memory and your GPU
1:42:20
and you ship it to compute.
1:42:22
You ship it back for every line.
1:42:24
This is a naive way of doing it.
1:42:25
This seems very wasteful.
1:42:28
So the idea, simple
idea of operator fusion
1:42:31
is just communicate, do all the
computation, ship it back once.
1:42:35
And this is exactly
what fused kernels are.
1:42:39
So if you ever want to make
your compute-- your computations
1:42:44
in PyTorch much faster,
just apply torch dot
1:42:46
compile on your model.
1:42:48
This is going to make your
model around 2 times faster.
1:42:51
And what it does is simply
that it rewrites your code--
1:42:56
your PyTorch code basically
in C++ in CUDA to do
1:43:03
the communication only once
then do all the operations,
1:43:05
then ship it back.
1:43:07
OK I'm not going to have
time to talk about tiling.
1:43:10
Tiling is important.
1:43:11
Parallelization.
1:43:12
Parallelization is important.
1:43:15
And mixture of experts.
1:43:17
Mixture of experts is important.
1:43:18
Outlook.
1:43:19
There are many things
we haven't talked about.
1:43:23
We haven't talked about
architectures we definitely
1:43:25
haven't talked about inference.
1:43:27
There are many other things
that are important with LLMs.
1:43:29
What is the UI that you use?
1:43:31
I mean, arguably ChatGPT,
the big novelty was just
1:43:34
have a simple UI to use it.
1:43:35
Multi-modality.
1:43:36
What are all the
misuses you could have.
1:43:38
The fact that there might not
be enough data on the internet
1:43:41
to train all these models.
1:43:42
Legality of data collection,
so many other things.
1:43:45
If you are interested
in all these topics,
1:43:47
I would suggest three classes.
1:43:49
CS224N is probably the one
that touches the least on LLMs,
1:43:54
but it gives some background
and historical context
1:43:57
of all the LLMs and gives
some adjacent material.
1:44:01
CS324 I think it's called--
1:44:04
I think it's just called
Large Language Models, more
1:44:07
in depth reading and lectures
on everything I talked about.
1:44:10
CS336 which is large
language model from scratch,
1:44:13
you actually build your own LLM.
1:44:16
It's an amazing class also
given by my two supervisors.
1:44:20
Very heavy workload,
so be careful.
1:44:23
Great.
— end of transcript —
Advertisement
More from Stanford Online
1:11:40
Stanford CS25: V2 I Introduction to Transformers w/ Andrej Karpathy
Stanford Online
1:49:54
Stanford CS230 | Autumn 2025 | Lecture 8: Agents, Prompts, and RAG
Stanford Online
1:02:52
Stanford CS231N Deep Learning for Computer Vision | Spring 2025 | Lecture 1: Introduction
Stanford Online
1:45:08
Stanford CS230 | Autumn 2025 | Lecture 9: Career Advice in AI
Stanford Online