[00:05] So, let's get started.
[00:07] So I'll be talking about
building LLMs today.
[00:10] So I think a lot of you have
heard of LLMs before, but just
[00:14] as a quick recap.
[00:16] LLMs standing for
large language models
[00:18] are basically all the
chat bots that you've
[00:21] been hearing about recently.
[00:22] So, ChatGPT, from OpenAI,
Claude, from Anthropic, Gemini
[00:28] and Llama, and other
types of models like this.
[00:31] And today we'll be talking
about how do they actually work.
[00:34] So it's going to be an overview
because it's only one lecture
[00:36] and it's hard to
compress everything.
[00:38] But hopefully, I'll
touch a little bit
[00:39] about all the components
that are needed
[00:41] to train some of these LLMs.
[00:43] Also, if you have questions,
please interrupt me
[00:46] and ask if you have a question.
[00:48] Most likely other people in
the room or on Zoom have other.
[00:52] Have the same questions.
[00:53] So, please ask.
[00:56] Great.
[00:56] So what matters
when training LLMs.
[01:00] So there are a few key
components that matter.
[01:02] One is the architecture.
[01:04] So as you probably all
LLMs are neural networks,
[01:07] and when you think
about neural networks,
[01:09] you have to think about what
architecture you're using.
[01:11] And another component,
which is really important
[01:13] is the training loss and
the training algorithm.
[01:16] So, how you actually train
these models, then it's data.
[01:20] So, what do you train
these models on.
[01:24] The evaluation,
which is how do you
[01:26] know whether you're
actually making progress
[01:28] towards the goal of LLMs and
then, the system component.
[01:33] So that is like
how do you actually
[01:35] make these models run on
modern hardware, which
[01:38] is really important because
these models are really large.
[01:41] So now more than ever,
systems are actually
[01:43] really an important
topic for LLMs.
[01:47] So those five components, you
probably all know that LLMs.
[01:52] And if you don't
know LLMs are all
[01:53] based on transformers
or at least some version
[01:56] of transformers.
[01:57] I'm actually not going to talk
about the architecture today.
[02:00] One, because I gave a lecture
on transformers a few weeks ago
[02:06] and two, because you can find
so much information online
[02:09] on transformers.
[02:11] There's much less information
about the other four topics.
[02:14] So, I really want
to talk about those.
[02:17] And another thing to say
is that most of academia
[02:20] actually focuses on
architecture and training
[02:22] algorithm and
losses as academics
[02:25] and I've done that for
a big part of my career,
[02:28] is simply we like thinking
that this is like we make
[02:32] new architectures,
new models, and it
[02:35] seems like it's very important.
[02:37] But in reality, honestly, what
matters in practice is mostly
[02:39] the three other topics.
[02:41] So, data, evaluation and
systems, which is what most
[02:45] of industry actually focuses on.
[02:48] So, that's also
one of the reasons
[02:49] why I don't want to talk too
much about the architecture,
[02:52] because really the rest
is super important.
[02:55] Great.
[02:55] So, overview of
the lecture, I'll
[02:57] be talking about pretraining.
[02:58] So, pretraining, you
probably heard that word.
[03:00] This is the general word.
[03:02] This is kind of the classical
language modeling paradigm where
[03:06] you basically train your
language model to essentially
[03:08] model all of internet.
[03:10] And then, there's
a post training,
[03:11] which is a more
recent paradigm which
[03:13] is taking these
large language models
[03:15] and making them
essentially AI assistants.
[03:18] So, this is more of a
recent trend since ChatGPT.
[03:22] So, if you ever heard
of GPT3 or GPT2,
[03:25] that's really pretraining land.
[03:27] If you heard of ChatGPT,
which you probably have,
[03:29] this is really
post training land,
[03:31] so I'll be talking about both,
but I'll start with pretraining
[03:34] and specifically
I'll talk about what
[03:37] is the task of pretraining LLMs
and what is the loss that people
[03:41] actually use.
[03:43] So, language modeling,
this is a quick recap.
[03:47] Language models at a
high level are simply
[03:49] models of probability
distribution over sequences
[03:52] of tokens or of words.
[03:53] So it's basically
some model of p of x1
[03:57] to XL, where x1
is basically what
[03:59] one and XL is the last one in
the sequence or in the sentence.
[04:04] So, very concretely, if you
have a sentence like the mouse
[04:07] ate the cheese, what
the language model gives
[04:09] you is simply a probability
of this sentence being uttered
[04:13] by a human or
being found online.
[04:17] So, if you have another sentence
like "The the mouse ate cheese."
[04:21] Here, there's
grammatical mistakes.
[04:23] So, the model should
know that this should
[04:25] have some syntactic knowledge.
[04:27] So, it should know that
this has less likelihood
[04:30] of appearing online.
[04:32] If you have another sentence
like the cheese ate the mouse,
[04:36] then the model should
hopefully know about the fact
[04:39] that usually cheese
don't eat mouse.
[04:42] So, there's some
semantic knowledge
[04:43] and this is less likely
that the first sentence.
[04:45] So, this is basically at a high
level what language models are.
[04:50] One word that you probably have
been hearing a lot in the news
[04:52] are generative models.
[04:54] So, this is just something
that can generate.
[04:56] Models that can
generate sentences
[04:57] or can generate some data.
[04:59] The reason why we say language
models are generative models
[05:01] is that once you have a
model of a distribution,
[05:04] you can simply sample
from this model.
[05:06] And now we can generate data.
[05:07] So we can generate sentences
using a language model.
[05:12] So the type of models that
people are all currently using
[05:15] are what we call
autoregressive language models.
[05:18] And the key idea of
autoregressive language models
[05:21] is that you take this
distribution over words
[05:25] and you basically decompose
it into the distribution
[05:29] of the first word, multiply
by the distribution of
[05:32] or the likelihood of the
distribution of the second word
[05:35] given the first
word, and multiply it
[05:37] by P of the third word
given the first two words.
[05:40] So, there's no
approximation here.
[05:42] This is just the chain rule
of probability, which you
[05:44] hopefully you all know about.
[05:46] Really no approximation.
[05:47] This is just one way of
modeling a distribution.
[05:50] So, slightly more
concisely, you can write it
[05:52] as a product of P's of the next
word, given everything which
[05:57] happened in the past.
[05:58] So, of the context.
[05:59] So, this is what we call
autoregressive language models.
[06:02] Again, this is really
not the only way
[06:05] of modeling distribution.
[06:06] This is just one way.
[06:07] It has some benefits
and some downsides.
[06:10] One downside of
autoregressive language models
[06:12] is that when you actually
sample from this autoregressive
[06:15] language model,
you basically have
[06:16] a for loop, which generates
the next word, then conditions
[06:20] on that next word.
[06:21] And then we generate
in other words.
[06:23] So, basically if you
have a longer sentence
[06:24] that you want to generate, it
takes more time to generate it.
[06:28] So, there are some downsides
of this current paradigm,
[06:31] but that's what
we currently have.
[06:33] So, I'm going to
talk about this one.
[06:36] Great.
[06:36] So, autoregressive
language models.
[06:38] At a high level, what a task of
autoregressive language model
[06:41] is simply predicting the
next word, as I just said.
[06:44] So, if we have a sentence
like she likely prefers,
[06:47] one potential, next
word might be dogs.
[06:50] And the way we do it is
that we first tokenize.
[06:54] So, you take these words or
subwords you tokenize them
[06:58] and then you give an
ID for each token.
[07:00] So here you have
one, two, three.
[07:03] Then, you pass it
through this black box.
[07:04] As I already said,
we're not going
[07:06] to talk about the architecture.
[07:07] You just pass it through,
pass it through a model,
[07:10] and you then get a distribution,
a probability distribution
[07:13] over the next word or
over the next token.
[07:16] And then you sample
from this distribution,
[07:20] you get a new token and
then you detokenize.
[07:22] So, you get a new
ID, you detokenize
[07:24] and that's how you basically
sample from a language model.
[07:28] One thing which is
important to note
[07:29] is that the last two
steps are actually
[07:32] only needed during inference.
[07:34] When you do training,
you just need
[07:36] to predict the most likely
token and you can just
[07:38] compare to the real token
which happened in practice,
[07:41] and then, you basically
change the weights
[07:43] of your model to increase
the probability of generating
[07:46] that token.
[07:49] Great.
[07:50] So, autoregressive
neural language models.
[07:52] So to be slightly
more specific, still,
[07:54] without talking about
the architecture,
[07:56] the first thing we do is
that we have all of these.
[07:58] Sorry, yes.
[07:59] On the previous slide.
[08:01] Predicting the probability
of the next token,
[08:03] does this mean that your
final output vector has
[08:06] to be the same dimensionality
as the number of tokens
[08:08] that you have?
[08:09] Yes.
[08:10] How do you deal with
if you have more token.
[08:13] Adding more token
to your [INAUDIBLE]?
[08:16] Yeah so we're going to
talk about tokenization
[08:18] actually later so you will
get some sense of this.
[08:21] You basically can deal
with adding new tokens.
[08:24] I'm kind of exaggerating.
[08:25] There are methods for doing
it, but essentially people
[08:28] don't do it.
[08:29] So it's really
important to think about
[08:32] how you tokenize your
text, and that's why
[08:33] we'll talk about that later.
[08:35] But it's a very
good point to note
[08:36] is that you basically--
the vocabulary size, so
[08:38] the number of tokens that
you have is essentially
[08:40] the output of your
language model.
[08:43] So it's actually pretty large.
[08:46] So autoregressive
neural language models.
[08:48] First thing you do is that you
take every word or every token.
[08:51] You embed them so you get
some vector representation
[08:56] for each of these tokens.
[08:58] You pass them through some
neural network, as we said,
[09:00] it's a transformer.
[09:01] Then you get a representation
for all the word
[09:04] and all the words
in the context.
[09:06] So it's basically
a representation
[09:07] of the entire sentence.
[09:09] You pass it through
a linear layer,
[09:11] as you just said, to
basically map it to the number
[09:15] so that the output--
the number of outputs
[09:17] is the number of tokens.
[09:19] You then pass it
through some softmax
[09:21] and you basically get a
probability distribution
[09:24] over the next words given
every word in the context.
[09:30] And the last that you
use is basically--
[09:32] it's essentially a task of
classifying the next token.
[09:35] So it's a very simple, kind
of, machine learning task.
[09:37] So you use the
cross-entropy loss.
[09:39] Where you basically look at the
actual target that happened,
[09:44] which is the target
distribution, which
[09:45] is a one hot encoding,
which in this case says,
[09:49] I saw the real word
that happened is cat.
[09:51] So that's a one hot
distribution over cat.
[09:55] And here this is the actual--
[09:57] do you see my mouse?
[09:58] Oh, yeah.
[09:58] This is the distribution
that you generated.
[10:00] And basically you
do cross entropy,
[10:01] which really just increases the
probability of generating cat
[10:04] and decreases all the
probability of generating
[10:06] all the other tokens.
[10:08] One thing to notice is
that, as you all know again,
[10:11] this is just equivalent
to maximizing the text log
[10:15] likelihood because
you can just rewrite
[10:17] the max over the probability
of this autoregressive language
[10:23] modeling task as just being
this minimum of I just
[10:26] added the log here
and minus, which
[10:29] is just the minimum of the loss,
which is the cross entropy loss.
[10:31] So basically
minimizing the loss is
[10:33] the same thing as maximizing
the likelihood of your text.
[10:36] Any question?
[10:37] Questions?
[10:43] OK, tokenizer.
[10:46] So this is one thing
that people usually
[10:49] don't talk that much about.
[10:50] Tokenizers are
extremely important.
[10:53] So it's really important that
you understand at least what
[10:56] they do at a high level.
[10:57] So why do we need tokenizers
in the first place?
[11:01] First, it's more
general than words.
[11:02] So one simple thing
that you might think
[11:04] is we're just going to take
every word that we will have.
[11:07] You just say every word
is a token in its own.
[11:11] But then what happens is if
there's a typo in your word?
[11:14] Then you might not have
any token associated
[11:17] with this word with a typo.
[11:20] And then you don't know
how to actually pass
[11:21] this word with a typo into
a large language model.
[11:24] So what do you do next?
[11:25] And also, even if you think
about words, words is a very--
[11:29] words are fine with
Latin-based languages.
[11:32] But if you think about
a language like Thai,
[11:34] you won't have a simple
way of tokenizing
[11:36] by spaces because there are
no spaces between words.
[11:39] So really, tokens are much
more general than words.
[11:43] It's the first thing.
[11:44] Second thing that
you might think
[11:45] is that you might tokenize
every sentence, character
[11:48] by character.
[11:49] You might say A is one
token, B is another token.
[11:52] That would actually work
and probably very well.
[11:55] The issue is that then your
sequence becomes super long.
[11:58] And as you probably
remember from the lecture
[12:00] on transformers, the
complexity grows quadratically
[12:05] with the length of sequences.
[12:06] So you really don't want to
have a super-long sequence.
[12:10] So tokenizers basically try to
deal with those two problems
[12:14] and give common subsequences
a certain token.
[12:19] And usually how you should be
thinking about it is around
[12:22] an average of every token
is around 3-4 letters.
[12:27] And there are many
algorithms for tokenization.
[12:30] I'll just talk about one of them
to give you a high level, which
[12:32] is what we call Byte Pair
Encoding, which is actually
[12:34] a pretty common.
[12:35] One of the two most
common tokenizers.
[12:37] And the way that you
train a tokenizer
[12:39] is that first you start with
a very large corpus of text.
[12:42] And here, I'm really not talking
about training a large language
[12:45] model yet, this is purely
for the tokenization step.
[12:48] So this is my large corpus of
text with these five words.
[12:52] And then you associate
every character
[12:55] in this corpus of text
a different token.
[12:58] So here, I just split
it up every character
[13:00] with a different
token, and I just
[13:03] color coded all of those tokens.
[13:05] And then what you do is that
you go through your text,
[13:08] and every time you see pairs
of tokens that are very common,
[13:12] the most common pair of
token, you just merge them.
[13:15] So here you see three
times the tokens t and o
[13:19] next to each other.
[13:20] So you're just going to
say this is a new token.
[13:22] And then you continue,
you repeat that.
[13:24] So now you have tok, tok
which happens three times.
[13:28] Toke with an E that
happens 2 times and token,
[13:33] which happens twice, and then
ex which also happens twice.
[13:37] So this is the-- if you were to
train a tokenizer on this corpus
[13:41] of text, which is
very small, that's
[13:43] how you would finish
with a token--
[13:45] with like trained tokenizer.
[13:47] In reality, you do it on
much larger corpus of text.
[13:51] And this is the
real tokenizer of--
[13:54] actually, I think this
is GPT3 or ChatGPT.
[13:57] And here you see how it would
actually separate these words.
[14:00] So basically you
see the same thing
[14:01] as what we gave in
the previous example.
[14:03] Token becomes its own token.
[14:06] So tokenizer is
actually split it up
[14:08] into two tokens token and -izer.
[14:12] So yeah, that's all
about tokenizers.
[14:15] Any questions on that?
[14:16] Yeah.
[14:16] How do you deal with
spaces, and how do you
[14:18] deal with [INAUDIBLE].
[14:19] Yeah so actually there's
a step before tokenizers,
[14:23] which is what we call
pre-tokenizers, which
[14:25] is exactly what you just said.
[14:27] So this is mostly--
[14:29] in theory, there's no reason to
deal with spaces and punctuation
[14:33] separately.
[14:34] You could just say every
space gets its own token,
[14:37] every punctuation
gets its own token,
[14:40] and you can just
do all the merging.
[14:42] The problem is that-- so
there's an efficiency question.
[14:45] Actually, training these
tokenizers takes a long time.
[14:48] So you better-- because you have
to consider every pair of token.
[14:51] So what you end up doing is
saying if there's a space,
[14:54] this is very--
like pre-tokenizers
[14:55] are very English specific.
[14:57] You say if there's
a space, we're
[14:58] not going to start looking
at the token that came before
[15:01] and the token that
came afterwards.
[15:03] So you're not merging
in between spaces.
[15:06] But this is just like a
computational optimization.
[15:10] You could theoretically
just deal with it
[15:12] the same way as you deal
with any other character.
[15:15] And--
[15:15] Yeah.
[15:16] When you merge tokens to delete
the tokens that you merged away
[15:19] or do you keep the smaller
tokens that emerge?
[15:22] You actually keep
the smaller tokens.
[15:25] I mean, in reality, it doesn't
matter much because usually
[15:29] on a large corpus of text, you
will have actually everything.
[15:32] But you usually
keep the small ones.
[15:34] And the reason why
you want to do that
[15:36] is because if-- in case there's,
as we said before, you have
[15:38] some grammatical
mistakes or some typos,
[15:41] you still want to
be able to represent
[15:43] these words by character.
[15:46] So, yeah.
[15:47] Yes.
[15:48] Are the tokens unique?
[15:51] So I mean, say in this case
T-O-K-E-N is there only one
[15:54] occurrence or could--
[15:56] do you need to leave multiple
occurrence so they could have--
[16:00] take on different
meanings or something?
[16:02] Oh I see what you say.
[16:03] No, it's every token
has its own unique ID.
[16:08] So a usual-- this
is a great question.
[16:11] For example, if you
think about a bank, which
[16:13] could be bank for like
money or bank like water,
[16:16] it will have the same token.
[16:18] But the model will
learn, the transformer
[16:19] will learn that based on the
words that are around it,
[16:22] it should associate that--
[16:24] I'm saying-- I'm being
very handwavy here,
[16:26] but associate that with
a representation that
[16:30] is either more like the bank
money side or the bank water
[16:33] side.
[16:34] But that's a transformer
that does that.
[16:36] It's not a tokenizer.
[16:38] Yes.
[16:39] Yes.
[16:39] So you mentioned
during tokenization,
[16:41] keep the smaller tokens
you started with, right.
[16:43] Like if you start with
a T you keep the T
[16:45] and then you build
your tokenize out to
[16:47] [INAUDIBLE] allow input token.
[16:49] So let's say maybe you didn't
train on token, but in your data
[16:53] you are trying to encode token.
[16:54] So how does the tokenizer know
to encode it with token or to
[16:58] [INAUDIBLE]?
[16:59] Yeah.
[16:59] The great question.
[17:00] You basically when you--
so when you tokenize,
[17:02] so that's after training
of the tokenizer
[17:04] when you actually
apply the tokenizer
[17:06] you basically always
choose the largest token
[17:10] that you can apply.
[17:11] So if you can do token,
you will never do T,
[17:13] you will always do token.
[17:15] But there's actually--
so people don't usually
[17:18] talk that much about
tokenizers, but there's
[17:20] a lot of computational benefits
or computational tricks
[17:24] that you can do for making
these things faster.
[17:27] So I really don't think
we-- and honestly, I
[17:29] think a lot of people think
that we should just get away
[17:31] from tokenizers and just
kind of tokenize character
[17:34] by character or bytes by bytes.
[17:36] But as I said, right now
there's this issue of length,
[17:39] but maybe one day, like
in five or 10 years,
[17:42] we will have different
architectures
[17:43] that don't scale quadratically
with the length of the sequence.
[17:46] And maybe we'll move
away from tokenizers.
[17:50] So can you share
with us the drawback?
[17:53] Why do people want to move
away from the tokenizer?
[17:57] Yeah.
[17:58] So I think one good
example is math.
[18:03] If you think about math,
actually numbers right now
[18:06] are not tokenized.
[18:07] So for example, 327 might
have its own token, which
[18:10] means that models,
when they see numbers,
[18:13] they don't see them
the same way as we do.
[18:15] And this is very
annoying because I mean,
[18:17] the reason why we can
generalize with math
[18:19] is because we can deal with
every letter separately
[18:22] and we can then do composition.
[18:24] Where you know that
basically if you add stuff,
[18:26] it's the same thing as
adding every one separately
[18:28] plus like whatever
the unit that you add.
[18:30] So they can't do that.
[18:32] So then you have to do
special tokenization.
[18:35] And, like, one of the
big changes that GPT4 did
[18:39] is changing the way
that they tokenize code.
[18:42] So for example, if you have
code, you know you have often,
[18:46] in Python, these four
spaces at the beginning.
[18:48] Those were dealt with
strangely before.
[18:52] And as a result, like,
the model couldn't really
[18:54] understand how to
deal with code.
[18:57] So tokenize actually
matter a lot.
[19:00] OK, so I'll move on right now,
but we can come back later
[19:04] on tokenizers.
[19:05] Great.
[19:06] So we talked about a task
the loss the tokenizer,
[19:08] let's talk a little
bit about evaluation.
[19:11] So the way that LLMs
are usually evaluated
[19:13] is what we call-- is using
what we call perplexity.
[19:16] At a high level it's basically
just your validation loss.
[19:20] The slight difference
with perplexity
[19:21] is that we use something that
is slightly more interpretable,
[19:24] which is that we use the
average per token loss,
[19:27] and then you exponentiate it.
[19:29] And the reason why
you exponentiate it
[19:30] is because you want--
[19:32] I mean, the loss has
a log inside and you--
[19:35] like one humans
are actually pretty
[19:36] bad at thinking in log space.
[19:38] But two logs depend
on the base of the log
[19:41] while when you exponentiate
you basically have everything
[19:44] in the vocabulary size unit.
[19:48] And the average per
token is just so
[19:50] that your perplexity is
independent of the length
[19:52] of your sequence.
[19:54] So perplexity is just
two to the power average
[19:57] of the loss of the sequence.
[20:00] So perplexity is between one
and the length of the vocabulary
[20:04] of your tokenizer.
[20:05] One it's simply well,
if you predict perfectly
[20:08] the thing which every
word, then every word
[20:11] will have basically
products of ones.
[20:14] So the best perplexity
you can have is one.
[20:16] If you really have no
idea, you basically
[20:18] predict with one divided
by size of vocabulary
[20:22] and then you do simple
math and you basically
[20:24] get perplexity of
size of vocabulary.
[20:26] So the intuition
of perplexity is
[20:28] that it's basically
the number of tokens
[20:30] that your model is, kind
of, hesitating between.
[20:32] So if your model is perfect,
it doesn't hesitate.
[20:35] It know exactly the word.
[20:36] If it really has
no idea, then it
[20:38] hesitates between all
of the vocabulary.
[20:43] So perplexity really improved.
[20:46] That's perplexity on a standard
data set between 2017 and 2023.
[20:50] It went from a kind of 70
tokens to less than 10 tokens
[20:54] over these five, six years.
[20:56] So that means that the
models were previously
[20:58] stated between 70 words every
time it was generating a word,
[21:02] and now it's hesitating
between less than 10 words.
[21:05] So that's much better.
[21:06] Perplexity is actually
not used anymore
[21:08] in academic benchmarking,
mostly because it depends
[21:11] on the tokenizer that you use.
[21:12] It depends on the actual data
that people are evaluating on.
[21:16] But it's still very important
for development of LLMs.
[21:19] So when you actually
train your own LLM people
[21:21] will still really look
at the perplexity.
[21:26] One common other way and
now more common in academia
[21:30] of evaluating these LLMs is just
by taking all the classical NLP
[21:34] benchmarks, and I'll give you
a few examples later and just,
[21:37] kind of, aggregating everything.
[21:39] So collect as many automatically
evaluatable benchmarks
[21:43] and just evaluate
across all of them.
[21:46] So one such-- or
actually two such
[21:50] benchmarks are what we call
HELM, which is from Stanford.
[21:54] And another one is the
Hugging Face open leaderboard,
[21:56] which are probably the two
most common ones right now.
[22:00] So just to give you
an idea, in HELM,
[22:02] all of these type
of tasks, which
[22:04] are mostly things that
can be easily evaluated
[22:08] like question answering.
[22:09] So think about many different
question answering tasks.
[22:13] And the benefit with
question answering
[22:15] is that you usually know
what is the real answer.
[22:18] So you can-- the way that
you evaluate these models
[22:20] and I'll give you a concrete
example in one second,
[22:22] is that you can just look at
how likely the language model is
[22:26] to generate the real answer
compared to some other answers.
[22:30] And that's essentially,
at a high level,
[22:31] how you evaluate these models.
[22:33] So to give you a
specific example,
[22:35] MMLU is probably the most common
academic benchmark for LLMs.
[22:42] And this is just a
collection of many question
[22:45] and answers in all
of those domains.
[22:47] For example, college
medicine, college physics,
[22:50] astronomy and these
type of topics.
[22:52] And the questions are things
like, so this is in astronomy.
[22:55] What is true for
type-1a supernova?
[22:58] Then you give four
different potential answers
[23:01] and you just ask the model
which one is more likely.
[23:04] So there are many
different ways of doing it.
[23:06] Either you can look at the
likelihood of generating
[23:09] all these answers, or
you can ask the model
[23:11] which one is the most likely.
[23:12] So there are different ways
that you can prompt the model,
[23:15] but at a high level, you
know which one is correct.
[23:17] And there are three
other mistakes.
[23:20] Yes.
[23:22] Creating unconstrained
text as an output.
[23:24] Yeah.
[23:25] How do you evaluate
a model if it
[23:28] gives something that's
semantically completely
[23:31] identical, but is not the
exact tokens that you expect?
[23:35] Yeah.
[23:36] So that's a great question.
[23:37] I'll talk more about that later.
[23:38] Here, in this case, we
don't do unconstrained.
[23:41] So the way you would evaluate
MMLU is basically either
[23:44] you ask the first
question, and then you
[23:47] look at the likelihood of
the model generating A,
[23:50] the likelihood of the model
generating B, C, and D
[23:53] and you look at which
one is the most likely.
[23:55] Or you can ask the
model out of A, B, C, D,
[23:58] which one is the most likely.
[23:59] And you look at whether the
most likely next token is A, B,
[24:03] C, or D. So you
constrain the model
[24:05] to say it can only
answer these four things.
[24:09] You say you constraint--
[24:10] Yeah.
[24:11] You constrain the
prompt or do you
[24:13] mean of its whole
probability distribution
[24:15] that it outputs
you only comparing
[24:17] the outputs of like-- you're
only comparing the A token the
[24:19] [INAUDIBLE].
[24:20] Yeah.
[24:20] So in the second case I gave
you, you would do exactly the--
[24:24] actually would do both.
[24:25] You would prompt the
model saying A, B, C, or D
[24:27] plus you would constrain to
only look at these four tokens.
[24:32] In the first case, you don't
even need to generate anything.
[24:34] So in the first case,
you literally just
[24:36] look, given it's
a language model,
[24:38] it can give a distribution
over sentences.
[24:40] You just look at what is
the likelihood of generating
[24:43] all of these words?
[24:45] What is the likelihood of
generating the second choice?
[24:48] And you just look at whether the
most likely sentence is actually
[24:52] the real answer.
[24:54] So you don't actually
sample from it,
[24:56] you really just
use P of X1 to XL.
[24:59] Does that make sense?
[25:01] That being said, evaluation
of open-ended questions
[25:05] is something we're going
to talk about later,
[25:06] and it's actually
really important
[25:08] and really challenging.
[25:09] Yes.
[25:10] Earlier you mentioned
[INAUDIBLE] metrics
[25:13] like perplexity
are not I usually
[25:16] use because it
depends on how you do
[25:18] your tokenization,
some design choices.
[25:21] I was wondering if you
could speak more to that.
[25:24] Yeah.
[25:25] So think about perplexity.
[25:26] I told you perplexity is
between 1 and vocabulary size.
[25:30] So now imagine that ChatGPT
uses a tokenizer that has 10,000
[25:34] tokens but Gemini from Google
uses a tokenizer that had
[25:38] 100,000 potential tokens.
[25:41] Then actually the Gemini one
will have the upper bound
[25:45] of the perplexity that you can
get is actually worse for Gemini
[25:48] than for ChatGPT.
[25:50] Does that make sense?
[25:52] So that's just an idea.
[25:53] It's actually a little bit
more complicated than that,
[25:55] but that's just one
festival with a bit
[25:58] of where you can see that the
tokenizer actually matters.
[26:02] Great.
[26:05] OK, so evaluation challenges.
[26:07] There are many.
[26:08] I'll just talk about
two really briefly.
[26:10] One, as I told you, there are
two ways of doing evaluation
[26:13] for these MMLUs.
[26:14] Actually, there are
many more than two
[26:16] but I gave you two examples.
[26:17] And it happens that
for a long time,
[26:20] even though that was a
very classical benchmark
[26:22] that everyone uses actually
different companies
[26:27] and different
organizations were actually
[26:32] using different ways
of evaluating MMLU.
[26:34] And as a result, you get
completely different results.
[26:37] For example, Llama-65b, which
was the first model of meta
[26:42] in the llama series, had
on HELM 63.7 accuracy
[26:47] but on this other
benchmark had like 48.8.
[26:53] So really the way that you
evaluate, and this is not even
[26:55] talking about prompting
this is really just the way
[26:58] that you evaluate the models.
[27:01] Prompting is another issue.
[27:02] So really, there are a
lot of inconsistencies.
[27:04] It's not as easy as it looks.
[27:07] First thing.
[27:08] Yeah, sorry.
[27:08] How can we make sure
that all these models
[27:10] are trained on the benchmark?
[27:13] Second thing.
[27:14] This is a great question.
[27:15] Train test contamination.
[27:17] This is something
which I would say
[27:19] is really important
in academia in--
[27:24] given that the talk is mostly
about training large language
[27:26] models, for companies, it's
maybe not that important
[27:29] because they know
what they trained on.
[27:33] For us, we have no idea.
[27:35] So, for us, it's a real problem.
[27:37] So there are many
different ways of trying
[27:39] to test whether the test set--
[27:42] or sorry, whether the
test set was actually
[27:44] in the training set.
[27:45] One, kind of, cute trick
that people in the lab,
[27:51] in [? Tatsuo's ?] lab have
found, is that what you can do
[27:54] is that given that most
of the data set online
[27:57] are not randomized,
you can just look at--
[28:00] and that language models,
what they do is just
[28:02] predict the next word.
[28:03] You can just look at
the entire test set.
[28:06] What if you generate
all the examples
[28:09] in order versus all the
examples in a different order.
[28:13] And if it's more likely to
generate a thing in order, given
[28:17] that there's no
real order there,
[28:19] then it means that probably
it was in the training set.
[28:21] Does that make sense?
[28:23] So there are many--
that's like one of them.
[28:24] There are many other
ways of doing it.
[28:26] Train test
contamination, again, not
[28:28] that important for development,
really important for
[28:30] academic benchmarking.
[28:33] Great.
[28:33] So there are many
other challenges,
[28:34] but I'll move on for now.
[28:37] Great.
[28:38] Data.
[28:40] So data is another
really big topic.
[28:43] At a high level people
just say you basically
[28:45] train large language
models on all of internet.
[28:48] What does that even mean?
[28:50] So people sometimes say,
well, of clean internet,
[28:53] which is even less defined.
[28:55] So internet is very dirty
and really not representative
[28:59] of what we want in practice.
[29:00] If I download a random
website right now,
[29:03] you would be shocked
at what is in there.
[29:06] It's definitely
not your Wikipedia.
[29:08] So I'll go really briefly
on what people do.
[29:14] I can answer some
questions, but I mean,
[29:16] data is on its own
it's a huge topic.
[29:19] Basically, first what you do
is download all of internet.
[29:22] What that means is that
you use web crawlers that
[29:25] will go on every web page, on
internet or every web page that
[29:29] is on Google.
[29:31] And that is around 250
billion pages right now.
[29:36] And that's around
1 petabyte of data.
[29:39] So this is actually a Common
Crawl is one web crawler.
[29:42] So people don't usually
write their own web crawlers
[29:45] what they do is that they
use standard web crawlers,
[29:47] and Common Crawl is one of them
that basically every month adds
[29:51] all the new websites that were
added on internet that are found
[29:56] by Google, and they put it in
a big basically a big data set.
[30:00] So that's-- on Common Crawl, you
have around 250 billion pages
[30:04] right now.
[30:04] So 1E6 gigabytes of data.
[30:07] Once you have this--
[30:09] so this is a random web page.
[30:11] Like literally random
from this Common Crawl.
[30:14] And what you see is
that one, it really
[30:16] doesn't look at type of things
that you would usually see,
[30:18] but actually-- so
this is an HTML page.
[30:21] It's hard to see, but
if you look through
[30:24] will see some content.
[30:26] For example, here,
Test King World
[30:30] is your ultimate source for
the system x high performance
[30:33] server.
[30:34] And then you have three dots.
[30:35] So you don't even-- the
sentence is not even finished.
[30:37] That's how random
internet looks like.
[30:40] So, of course, it's
not that useful
[30:42] if you just train a
large language model
[30:44] to generate things like this.
[30:45] So what are some of the
steps that are needed?
[30:48] First one, you extract
the text from the HTML.
[30:51] So that's what I just
tried to do by looking
[30:53] at basically the correct tags.
[30:55] There are a lot of
challenges through this.
[30:57] For example, extracting
math is actually
[30:59] very complicated, but pretty
important for training
[31:02] large language models.
[31:03] Or for example, boilerplates.
[31:05] A lot of your forums will
have the same type of headers,
[31:08] the same type of footers.
[31:10] You don't want to repeat
all of this in your data,
[31:13] and then you will filter
undesirable content.
[31:16] So not safe for work,
harmful content, PII.
[31:20] So usually every
company has basically
[31:22] a blacklist of websites
that they don't
[31:26] want to train their models on.
[31:27] That blacklist is very
long and you basically
[31:30] say if it comes from there,
we don't train on this.
[31:32] There are other ways
of doing these things.
[31:34] Is that you can train a small
model for classifying what
[31:36] is PII, removing these things.
[31:39] It's hard.
[31:40] Every point here that
I'm going to show you
[31:42] is a hard amount of
work, but I'm just
[31:46] going to go quickly through it.
[31:48] So filter undesirable content.
[31:50] Second or fourth
is de-duplication.
[31:54] As I said, you might have
things like headers and footers
[31:57] in forums that are
always the same.
[31:59] You want to remove that.
[32:01] Another thing that
you might have
[32:02] is a lot of URLs that are
different, but actually show
[32:05] the same website.
[32:08] And you might also have a lot of
paragraphs that come from common
[32:13] books that are basically
de-duplicated 1,000 times
[32:16] or 10,000 times on internet.
[32:18] So you have to de-duplicated.
[32:20] Also very challenging because
you have to do that at scale.
[32:24] Once you do the
de-duplication, you
[32:26] will do some
heuristic filtering.
[32:28] You will try to remove
low-quality documents.
[32:31] The way you do that are things
like rules-based filtering.
[32:35] For example, if you see that
there are some outlier tokens.
[32:37] If the distribution of
tokens in the website
[32:39] is very different than the
usual distribution of tokens,
[32:42] then it's probably some outlier.
[32:43] If you see that the length
of the words in this website
[32:46] is super long, there's something
strange going on that website.
[32:49] If you see that the website
has only three words,
[32:52] maybe, is it worth
training on it.
[32:54] Maybe not.
[32:54] If it has 10 million words,
maybe there's something also
[32:58] wrong going on that page.
[33:00] So a lot of rules like this.
[33:01] Yes.
[33:02] Why do we filter out
undesirable content
[33:04] from our data set instead
of putting it in as,
[33:08] like, a supervised loss?
[33:10] Can we not just say, here's
this like, hate speech website,
[33:14] let's actively try to--
[33:17] let's actively penalize
the model for getting it.
[33:19] We'll do exactly that,
but not at this step.
[33:22] That's why the post-training
will come from.
[33:25] Pretraining the
idea is just to say
[33:30] I want to model, kind of, how
humans speak, essentially.
[33:34] And I want to remove all
these headers, footers
[33:36] and menus and things like this.
[33:38] But it's a very good
idea that you just had.
[33:41] And that's exactly
what we'll do later.
[33:45] Next step,
model-based filtering.
[33:47] So once you filter a lot
of data, what you will do--
[33:50] that's actually a
very cute trick.
[33:51] You will take all
of Wikipedia and you
[33:54] will look at all
the links that are
[33:56] linked through Wikipedia pages.
[33:58] Because probably if something
is referenced by Wikipedia,
[34:01] it's probably some
high-quality website.
[34:02] And you will train a classifier
to predict whether something
[34:07] comes from-- whether a
document comes from one
[34:10] of these references
from Wikipedia
[34:13] or whether it's
from the random web.
[34:15] And you will try
to basically say,
[34:17] I want more of the things that
come from Wikipedia references.
[34:21] Does that make sense?
[34:23] So yeah.
[34:24] So you will train a
machine learning model.
[34:26] Usually also very simple
models because you
[34:28] need to do that really at scale.
[34:30] I mean, just think about
the 250 billion pages.
[34:34] Next one, you will try
to classify your data
[34:37] into different domains.
[34:41] You will say, OK, this is
entertainment, this is books,
[34:43] this is code, this is like
these type of domains.
[34:46] And then you will try to
either up or down weight
[34:51] some of the domains.
[34:52] For example, you might say--
[34:54] you might see that actually if
you train more on code, then
[34:57] actually your model becomes
better on reasoning.
[34:59] So that's something that
people usually say in
[35:01] a very hand-wavy way.
[35:02] If you train your
model more on code,
[35:04] actually it helps reasoning.
[35:05] So you want to update
the coding distribution
[35:08] because that helps for general
language modeling skills.
[35:11] Books is usually also another
one that people usually update.
[35:16] Entertainment, they
usually down weight.
[35:18] So things like this.
[35:19] Of course, you want to do it--
so people used to do it, maybe
[35:24] kind of heuristically.
[35:25] Now there's entire
pipelines that we'll
[35:27] talk about of how to do
these things slightly
[35:30] more automatically.
[35:33] And then at the end of
training, you usually train--
[35:37] after training on all
of this data that we saw
[35:40] you usually train on
very high quality data
[35:42] at the end of training your
large language model where you
[35:46] decrease your learning rate.
[35:47] And that basically
means that you're,
[35:49] kind of, overfitting your model
on a very high quality data.
[35:52] So usually what you
do there is Wikipedia.
[35:55] You basically
overfit on Wikipedia
[35:57] and you overfit on, like,
human data that was collected.
[36:04] The other thing is like
continual pretraining
[36:06] for getting longer context.
[36:07] I'm going to skip over
all of these things.
[36:09] But that's just to give
you a sense of how hard it
[36:12] is when people just say I'm
going to train on internet,
[36:15] that's a lot of work.
[36:17] And, really, we haven't
figured it out yet.
[36:19] So collecting well
data is a huge part
[36:23] of practical, large
language model.
[36:24] Some might say that
it's actually the key.
[36:26] Yes.
[36:27] [INAUDIBLE] about data.
[36:29] So basic question.
[36:30] So usually when you start
with like a petabyte of data,
[36:33] after you go through
all the steps,
[36:35] what's the typical amount
of data you have remaining.
[36:37] And then how large a
team does it typically
[36:40] take to go through all the
data steps you talked about?
[36:43] Sorry how la-- is your
question how large
[36:45] is the data after you filter?
[36:46] Yeah.
[36:47] After you filter and then
you go through all the steps.
[36:49] How large a team do you
need to go through, like,
[36:52] all the filtration
steps you mentioned.
[36:54] How slow is it or--
[36:56] How many people
would you need to be
[37:00] able to do this [INAUDIBLE]?
[37:02] OK that's a great question.
[37:03] I'm going to somewhat
answer about the data.
[37:06] How large is the data set
at the end of this slide.
[37:10] For number of people that work
on it, that's a good question.
[37:15] I'm actually not quite
sure, but I would say, yeah,
[37:19] I actually don't
quite know but I
[37:22] would say it's probably even
bigger than the number of people
[37:25] that work on the tuning of
the pretraining of the model.
[37:29] So the data is bigger
than the modeling aspect.
[37:34] Yeah, I don't think
I have a good sense.
[37:37] I would say probably in LLAMA's
team, which have 70-ish people,
[37:41] I would say maybe
15 work on data.
[37:45] Yeah.
[37:46] All these things, you don't
need that many people,
[37:48] you need a lot of compute also.
[37:49] Because for data you
need a lot of CPUs.
[37:52] So, yeah.
[37:53] And I'll answer
the second question
[37:54] at the end of this slide.
[37:56] So as I just, kind
of, alluded to really,
[37:59] we haven't solved data
at all for pretraining.
[38:02] So there's a lot of research
that has to be done.
[38:04] First, how do you process
these things super efficiently?
[38:07] Second, how do you
balance kind of all
[38:09] of these different domains?
[38:10] Can you do synthetic
data generation?
[38:12] That's actually a
big one right now.
[38:14] And because we don't have--
[38:16] we'll talk about that
later, but we don't have
[38:18] enough data on the internet.
[38:20] Can you use multimodal data
instead of just text data?
[38:23] And how does that improve
even your text performance?
[38:28] There's a lot of secrecy
because, really, this
[38:30] is the key of most of the
pretraining large language
[38:33] models.
[38:34] So for competitive dynamics,
usually these companies
[38:39] don't talk about how they
do the data collection.
[38:41] And also there's a
copyright liability issue.
[38:44] They definitely don't
want to tell you
[38:45] that they've trained on
books even though they did
[38:47] because if not can sue them.
[38:50] Common academic benchmarks.
[38:52] So that will, kind of,
answer what you asked.
[38:54] It started-- so those
are the smaller ones.
[38:57] The names are not
that important,
[38:58] but it started from around
$150 billion tokens, which are
[39:02] around 800 gigabytes of data.
[39:04] And now it's around
15 trillion--
[39:06] 15 trillion tokens,
which is also
[39:09] the size of the models that
are-- right now the best models
[39:12] are probably trained
on that amount of data.
[39:14] So 15 trillion tokens,
which is probably,
[39:18] I guess, two orders of
magnitude bigger than that.
[39:20] So 80E3 gigabyte.
[39:23] So that would be around 100
to 1,000 times filtering
[39:29] of the Common Crawl,
if I'm not mistaken.
[39:32] So, yeah.
[39:34] One very famous one is the Pile.
[39:37] So this is an academic
benchmark, the Pile.
[39:39] And we can just look at what
distribution of data they have.
[39:42] It's things like
archive, PubMed Central,
[39:46] which is all the biology stuff.
[39:50] Here it's Wikipedia, you see
Stack Exchange, some GitHub
[39:55] and some books and
things like this.
[39:58] Again, this is on
the smaller side.
[39:59] So this is-- if we look at here,
this is on 280B so, in reality,
[40:03] it's like 100 times bigger
so you cannot have that much
[40:05] of GitHub and of Wikipedia.
[40:09] In terms of closed
source models.
[40:11] Just to give you
an idea, Llama 2
[40:14] it was trained on
2 trillion tokens,
[40:16] Llama 3 15 trillion
tokens, which is currently
[40:19] the best model that we know
on how much it was trained on,
[40:22] which is the same thing as is
the best academic or the biggest
[40:26] academic benchmark, which
is 15 trillion tokens.
[40:29] GPT4 we don't really
but it's probably
[40:31] in the same order of magnitude
or it's probably around that.
[40:33] Actually, it's probably
around 13 from leaks.
[40:36] If the leaks are true.
[40:39] Great.
[40:41] So scaling laws.
[40:43] Any other questions on data
before we go to scaling laws?
[40:48] Sorry I know I'm giving
you a lot of information,
[40:51] but there's a lot into
training, large language models.
[40:54] Great scaling laws.
[40:56] So the idea is that what people
saw around 2020, or at least
[41:01] from a long time, but they've
been able to theoretically show
[41:05] it or empirically
show it since 2020,
[41:07] is that the more data
you train your models on
[41:09] and the larger the models,
the better the performance.
[41:12] This is actually pretty
different than what
[41:14] you've seen in this class.
[41:15] In this class we teach
you about overfitting.
[41:17] Overfitting doesn't happen
with large language models.
[41:20] Larger models,
better performance.
[41:23] It's something that
really took a long time
[41:25] for the community who took
this type of class to realize.
[41:29] But for the exam,
overfitting exists.
[41:33] So, OK, the idea of scaling loss
is that if-- given that more
[41:38] data and larger
models will always
[41:40] give you better
performance, can we
[41:42] predict how much better
your performance will
[41:46] be if you increase the amount of
data and the size of your model?
[41:50] And surprisingly, it works.
[41:52] So here you see three plots
from a very famous paper called
[41:55] Scaling Laws from OpenAI.
[41:57] Here you see on
the x-axis compute.
[42:00] So how much did you train--
[42:01] like, how much compute did
you spend for training?
[42:04] And here you see test loss.
[42:05] So this is essentially,
I mean, perplexity,
[42:08] but it's your validation loss.
[42:09] So it's a log of the perplexity.
[42:11] And if you put these
two on log scale,
[42:15] then you see that the
performance or the--
[42:19] sorry, the scaling
law is linear.
[42:22] That means that if you
increase your compute
[42:25] by a certain amount, you can say
by how much your test loss will
[42:29] actually decrease.
[42:30] Same thing with data and
same thing for parameters.
[42:33] If you increase
the data set size,
[42:35] your loss will
decrease by an amount
[42:38] that is somewhat predictable.
[42:40] If you increase the
number of parameters,
[42:42] the loss will
decrease by an amount,
[42:44] which is somewhat predictable.
[42:45] This is really amazing.
[42:47] Very surprising.
[42:49] I mean, it looks innocuous when
you look at these type of plots,
[42:52] but that's crazy because it
means that you can predict
[42:55] how well we're going to
perform in two or three years,
[42:58] depending on how much
compute we will add,
[42:59] assuming that these
things will hold.
[43:01] There's nothing
theoretical about it.
[43:04] Yes.
[43:05] Two things.
[43:06] One, what is the loss
that they're using here.
[43:08] Is this perplexity?
[43:09] So it's-- I said perplexity was
like 2 to the power of the loss.
[43:13] So this is the power
of the perplexity.
[43:17] And then the second
thing is, when
[43:19] you increase the
number of parameters
[43:21] or you increase the data
set size [INAUDIBLE] data
[43:24] [INAUDIBLE] times, doesn't
that just inherently
[43:26] increase your compute?
[43:27] Like does all of this
[INAUDIBLE] come to just how
[43:30] [INAUDIBLE] you [INAUDIBLE]?
[43:31] Yes.
[43:31] --or something
specific [INAUDIBLE]?
[43:32] No, this is a great question.
[43:33] So the compute here is actually
a factor of two things, the data
[43:37] and the parameter.
[43:38] What I'm showing here
is that you can--
[43:40] well, actually, we're going
to talk about that in details.
[43:42] But basically, if you increase
the number of parameters,
[43:44] you should increase the
number of data that you have.
[43:48] So you actually don't
go multiple times
[43:50] to the same data set.
[43:51] No one does epochs
in at least not yet
[43:56] because we haven't still
kind of enough data.
[43:59] So yeah, this is
all the same trend,
[44:01] which is increase
compute decrease loss.
[44:04] Yes.
[44:06] Have we seen the numbers for
the last two years or this
[44:09] is still holding?
[44:10] It is still holding.
[44:13] I don't have good
numbers to show you,
[44:16] but it is still
holding, surprisingly.
[44:20] Yes.
[44:21] Is there no evidence that
control quality density
[44:23] will ever plateau?
[44:25] In theory, we would expect
it plateau, [INAUDIBLE]?
[44:28] No empirical evidence of
plateauing anytime soon.
[44:33] Why?
[44:34] We don't know.
[44:35] Will it happen?
[44:37] Probably.
[44:37] I mean, it doesn't need
to because it's actually
[44:39] in log scale.
[44:40] So it's not like
as if it had to go.
[44:43] It had to plateau.
[44:44] Like mathematically, it could
continue decreasing like this.
[44:47] I mean, most people think
that it will probably
[44:49] plateau at some point.
[44:50] We don't know when.
[44:54] So that's-- I'll talk more
about scaling laws now.
[44:57] So why are scaling
laws really cool?
[44:59] Imagine that I gave you--
[45:02] you're very fortunate I gave
you 10,000 GPUs for this month.
[45:05] What model will you train?
[45:07] How do you even go about
answering that question?
[45:09] And I mean, this
is a hypothetical,
[45:12] but that's exactly what these
companies are faced with.
[45:16] The old pipeline,
which was basically
[45:19] tune hyperparameters
on the big models.
[45:21] So let's say I have
30 days, I will train
[45:24] 30 models for one day each.
[45:26] I will pick the best one and
that will be the final model
[45:30] that I will use in production.
[45:32] That means that the model
that I actually used
[45:34] was only trained for one day.
[45:36] The new pipeline is that you
first find a scaling recipe.
[45:40] So you find something that
tells you, for example,
[45:43] like one common thing
is that if you increase
[45:45] the size of your model, you
should decrease your learning
[45:46] rate.
[45:47] So you find a
scaling recipe such
[45:49] that you know if I increase
the size of my model,
[45:52] here's what I should do
with some hyperparameters.
[45:55] Then you tune your
hyperparameters
[45:57] on smaller models
of different sizes.
[46:00] Let's say I will say for
three days, of my 30 days,
[46:03] I will train many
different models.
[46:05] And I will do
hyperparameter tuning
[46:07] on these small models,
each of different sizes.
[46:09] Then I will fit a
scaling law and try
[46:11] to extrapolate from these
smaller models, which
[46:15] one will be the best if I
train it for much longer--
[46:20] or sorry if I train
it for a larger model.
[46:22] And then I will train
the final huge model
[46:24] for 27 days instead
of just one day.
[46:28] So the new pipeline
is not train things
[46:31] or do hyperparameter tuning
on the real scale of the model
[46:34] that you're going
to use in practice,
[46:35] but do things on smaller
ones at different scales.
[46:39] Try to predict how
well they will perform
[46:41] once you make them bigger.
[46:43] I will give-- I will give you a
very concrete example right now.
[46:46] Let's say transformers
versus LSTMs.
[46:49] Let's say you have
these 10,000 GPUs,
[46:51] you are not sure which
one you should be using.
[46:53] Should I be using a
transformer-based model
[46:55] or LSTM-based model.
[46:56] What I will do is I
will train transformers
[46:58] at different scales.
[47:00] So here you see different
parameters on the x-axis,
[47:02] y-axis is my test source.
[47:04] I will then train different
LSTMs at different scales.
[47:08] Once I have these points,
I will see oh it, kind of,
[47:11] fits a scaling law.
[47:12] I will fit my
scaling law and then
[47:14] I will be able to predict if
I had 10 times more compute,
[47:18] here's how well I would
perform for the LSTM.
[47:21] It's actually slightly
less linear for the LSTM,
[47:23] but you can probably try to
predict where you would end up.
[47:26] And clearly from this
plot, you would see
[47:28] that transformers are better.
[47:30] One thing to notice when you
read these type of scaling laws
[47:33] is that there are two
things that are important.
[47:35] One is really your
scaling rate, which
[47:40] is the slope of the-- the
slope of the scaling law.
[47:45] The other thing
is your intercept,
[47:49] you could start
worse, but actually
[47:52] become better over time.
[47:53] It just happens that
LSTMs are worse for both.
[47:55] But I could show you
another one where things--
[47:58] you can predict that actually
after a certain scale
[48:01] you're better off using that
type of model than others.
[48:04] So that's why scaling laws
are actually really useful.
[48:08] Any questions on that?
[48:12] Yeah.
[48:12] So these are all,
kind of, very--
[48:15] how sensitive are these to small
differences in the architecture.
[48:18] Like one like
transformer architecture
[48:21] versus another
transformer architecture.
[48:23] Do you think we have
to fit your own curve
[48:26] and, basically, say like oh
scaling laws tell me this should
[48:28] be some logarithmic function.
[48:31] Like, let me
extrapolate that for
[48:33] my own specific architecture.
[48:35] Yeah, so usually, for
example, if you're an academic
[48:38] and you want to-- now at
least that's pretty recent
[48:40] and you want to propose
a new activation.
[48:43] That's exactly what you will do.
[48:45] You will fit a scaling law,
show another scaling law
[48:47] with the standard
like, I don't GELU
[48:49] and you will say
that it's better.
[48:50] In reality, once you start
thinking about it in scaling
[48:53] laws terms, you really
realize that actually
[48:55] all the architecture
differences that we
[48:57] can make, like the small,
minor ones, all they do
[48:59] is maybe change a little
bit the intercept.
[49:03] But really that doesn't
matter because just
[49:05] train it for 10 hours longer or
like wait for the next computer
[49:09] GPUs and these things
are really secondary.
[49:12] Which is exactly why I was
telling you originally,
[49:14] people spend too much time on
the architecture and losses.
[49:17] In reality, these things
don't matter as much.
[49:19] Data though.
[49:19] If you use good data, you will
have much better scaling laws
[49:23] than if you use bad data.
[49:24] So that really matters.
[49:27] Another really cool thing
you can do with scaling laws
[49:29] is that you can ask yourself,
how to optimally allocate
[49:33] training resources.
[49:35] Should I train larger models.
[49:37] Because we saw that it's better
when you train larger models,
[49:39] but we saw that it's also
better when you use more data.
[49:42] So which one should I do?
[49:43] Should I just train on
more data, a smaller model,
[49:46] or should I train a
larger model on less data?
[49:49] So Chinchilla is a very famous
paper that first showed this.
[49:53] The way they did it,
I want to give you
[49:55] a little bit of a sense
of what these plots are.
[49:58] Here you see training
loss again on the x-axis,
[50:00] you see parameter differences,
sorry, parameter size--
[50:04] number of parameters.
[50:04] So the size of the model.
[50:06] And here all these
curves are what
[50:07] we call ISO flops, which is that
all the models on this curve
[50:13] have been trained with the
same amount of compute.
[50:17] The way that you do
that is that you train--
[50:19] you change.
[50:20] Sorry, you vary the number of
tokens that were trained on
[50:22] and the size of the models,
but you vary in such a way
[50:25] that the total compute
is constant, OK.
[50:27] So all these curves that you
see with different colors
[50:29] have different amount of
compute that were trained on.
[50:32] Then you take the best one
for each of those curves.
[50:35] Once you have the best one
for each of those curves,
[50:38] you can ask-- you can
plot how much flops it was
[50:44] and which curve were you
on and how much parameters
[50:47] did you actually use for
training that specific point.
[50:50] You put that on the log
log scale again and now
[50:55] you fit a scaling law again.
[50:56] So now I have something
which tells me
[50:59] if I want to train a model of 10
to the power 23 flops, here is
[51:03] exactly the number of parameters
that I should be using.
[51:06] 100 B.
[51:07] And you can do the same
thing with flops and tokens.
[51:11] So now you can predict--
[51:13] if I tell you exactly I
have one month of compute,
[51:16] what size of model
should I be training?
[51:18] Fit the scaling
law, and I tell you.
[51:21] Of course that all
looks beautiful.
[51:23] In reality like there's a
lot of small things of like,
[51:26] should you be counting,
like, embedding parameters,
[51:29] there's a lot of complexities.
[51:30] But if you do things well,
these things actually do hold.
[51:35] So the optimal number of
parameters that Chinchilla paper
[51:38] have found is to use 20
tokens for every parameter
[51:42] that you train.
[51:44] So if you add one
more parameter,
[51:45] you should train your thing on--
your model on 20 more tokens.
[51:49] So one caveat here is that this
is optimal training resources.
[51:53] So that is telling me if you
have 10 to the power, 23 flops
[51:57] or if you have 100, I don't know
how much that is, $100 million
[52:00] or 10-- no, that's
much less, actually.
[52:02] Let's say I have
$5 million to train
[52:05] my best model that
gets the lowest
[52:07] loss what would I train on?
[52:09] In reality, these companies need
to think about inference also.
[52:12] If you have a smaller model,
they will spend less over time.
[52:17] So actually, if you
consider the inference cost,
[52:20] you have other papers that
try to show that, it's
[52:23] around 150 parameters, sorry--
[52:26] tokens per parameters, because
you prefer having a smaller
[52:29] model because over
time you're going
[52:32] to actually spend less money
on inference of these models.
[52:37] So 150 to 1, that's around what
the best models are trained
[52:42] on right now, at least
the ones that are
[52:45] used in practice in production.
[52:49] Great.
[52:51] Any questions on Chinchilla?
[52:55] Great.
[52:56] Oh sorry.
[52:58] In practice, how expensive
is inference for these models
[53:01] relative to training?
[53:03] Actually, very expensive.
[53:05] I will not talk about
inference because that would
[53:07] be another entire lecture.
[53:09] But just think
about ChatGPT where
[53:11] they have I don't know
how much it is now,
[53:14] like 600 million
people that use it.
[53:18] Like, that's a lot.
[53:22] Yeah.
[53:23] So it's actually very expensive.
[53:24] There's a lot of optimization
you can do for inference though.
[53:27] And that's an entire
other lecture.
[53:29] I'm going to skip that this
time, but it's very interesting.
[53:33] OK tunings.
[53:34] As I said, there are
many things that you
[53:36] can answer with scaling laws.
[53:38] I just try to give
you two examples,
[53:40] but really there
are many things.
[53:42] What data do you use.
[53:43] What mixture-- what data
mixing weighting you use.
[53:46] The mixtures, that's what
we talked about before.
[53:49] What architecture you use,
whether you should make
[53:51] your models wider or deeper?
[53:54] Should you be
paying for more GPUs
[53:56] or actually
collecting more data?
[53:58] All these things are
things you can try
[54:00] to answer with scaling laws.
[54:03] One thing I want to say
is the bitter lesson.
[54:05] If you ever heard
of Richard Sutton,
[54:08] very famous blog post in
2019, what he realized,
[54:12] which I think not
enough people realize,
[54:16] I didn't-- definitely did
not realize at that time,
[54:19] is that once you see these type
of scaling laws you know that
[54:23] the more compute you have, the
better models you will get.
[54:26] So with scale, you
will get better model.
[54:28] And you also know by
Moore's law or these type
[54:30] of variants of Moore's
law that you will always
[54:33] have better compute.
[54:34] Then the only thing
that matters is just
[54:36] to have architectures that
can leverage computation.
[54:40] So what matters is basically
systems data and less
[54:44] so the architecture, like
the small architecture
[54:46] differences like, your
activation and things like this.
[54:49] So I think that's one of the
reasons why most of research
[54:52] focuses on some things that
for industry matters less.
[54:56] And I was one of
those researchers
[54:58] for a large part of my career.
[55:02] So don't spend time
over complicating.
[55:04] Do the simple
things, do it well.
[55:07] See all them.
[55:08] That's really what OpenAI taught
us with ChatGPT and with all
[55:12] the GPTs before.
[55:15] OK, I want to give you some back
of the envelope computation.
[55:18] So I might be off by
a few factors here,
[55:20] but I just want to give you
a sense of how costly it is
[55:23] to train some of these models.
[55:25] I'll give us an example.
[55:26] llama3 400b which is currently
the best open source model that
[55:30] you can get.
[55:31] It was trained on 15.6 tokens.
[55:35] It has 405 billion parameters.
[55:37] So just now that
you know what is
[55:39] like this optimal tokens per
parameter, that's around 40.
[55:43] So that's a little bit
more than Chinchilla,
[55:45] but less than this like
inference optimal model.
[55:50] So they went for
training optimallity
[55:53] Flops for this model.
[55:55] So one simple way
to compute flops
[55:57] is 6 times the
number of parameters,
[56:00] times the number of
data that you train on.
[56:03] So if you do the simple
calculation here,
[56:04] it's 3.8 e25 flops.
[56:07] The reason why this
is important is
[56:09] that if you follow it
a little bit, the news,
[56:11] there's an executive order
from Biden that basically
[56:13] says that once you have one e26
parameters, sorry, flops, then
[56:19] you have special
scrutiny on your models.
[56:21] So they went to
2X less than that.
[56:23] So they really went
right below this
[56:25] to not have special scrutiny.
[56:27] So 3.8.
[56:28] I might be off by a little
bit, but it's definitely
[56:30] under the 1 e26
[56:36] So parameter p is parameters
n is data, number of tokens.
[56:41] This is just an approximation.
[56:46] Yeah.
[56:48] OK.
[56:49] Compute and we know that they
trained on 16,000 h100s and we
[56:55] know the throughput
they set it to.
[56:58] So if you do the computation,
it takes around 70 days
[57:02] or 26 million GPU hours.
[57:05] At least that's what my back
of the envelope computation.
[57:08] They actually said that
they use 30 million
[57:10] instead of 26 million GPU hours.
[57:13] So maybe they had
some challenges.
[57:17] I don't really know.
[57:18] But if you follow the
simple computation,
[57:20] it's around 70 days.
[57:22] Cost.
[57:24] I mean this it's
hard to approximate,
[57:27] but I'm just going to say
it's, kind of, the rent.
[57:29] Like, what if I wanted to
rent H100, that many H 100
[57:33] for that many days,
how much will I pay?
[57:36] H100 a lower bound on
the renting costs of H100
[57:41] is around two hours--
[57:42] $2 per hour.
[57:43] So if you multiply this
by 26,000,000 hours,
[57:48] you get $52 million.
[57:50] So they probably
pay less than that,
[57:52] but not actually much less
because all these services
[57:58] that actually rent GPUs, they
don't make that much money.
[58:00] So it's probably slightly
less, but not that much less.
[58:04] Now salary I said 50
employees, 500k per year.
[58:10] Yeah it's probably
the right ballpark.
[58:12] $25 million.
[58:13] So if you put altogether
around $75 million
[58:17] for training this llama model.
[58:21] I'm probably off
by like 10 million,
[58:22] but that's kind
of right ballpark.
[58:27] Carbon emitted.
[58:29] A lot of people might ask
like also the cost is not
[58:32] the only thing
that is important.
[58:33] So I did the computation.
[58:35] It's around 4000 tons
of CO2 equivalent.
[58:42] That is actually only
2000 return tickets
[58:45] from JFK to London.
[58:47] So right now carbon
emitted is actually not--
[58:51] I mean, it's huge, but
it's not meaningful yet.
[58:56] I think in maybe GPT6,
GPT7, once you multiply this
[59:01] by 100, that might
become a real issue.
[59:04] Right now it's
still not, I think,
[59:07] an issue in the grand
scheme of things.
[59:09] Next model the way you should be
thinking about these models is
[59:12] that every new generation, the
number of flops essentially
[59:16] multiplies 10x, or at least
that's what they try if they
[59:19] have enough energy.
[59:20] And if they can buy enough GPUs.
[59:23] Great.
[59:23] Any question on these
back of the envelope math.
[59:29] No.
[59:30] OK.
[59:31] So now we talked
about pretraining,
[59:34] I wanted to also
chat about systems
[59:36] because now we know compute
is really important so there's
[59:39] a question of how do
you optimize the--
[59:41] how do you optimize the compute?
[59:43] I will leave that for
the end because I'm not
[59:45] sure how much time we will have.
[59:46] I think it's important,
but hopefully I'll
[59:48] be able to talk about it later.
[59:50] It's slightly different
than what we've
[59:52] been talking about right now.
[59:54] So I'll move on to
post-training for now.
[59:56] So the task of
post-training, the reason why
[59:59] we need to do post
training is, as I told you
[01:00:01] before, it's to
make AI assistants.
[01:00:06] So language modeling
is not really the thing
[01:00:09] that you want when you
have an AI assistant.
[01:00:12] For example, if you
ask to GPT3, which
[01:00:14] is a purely language model--
[01:00:16] a pure language model,
not a non-aligned one.
[01:00:20] If you ask a question
explain the moon landing
[01:00:22] to a six-year-old, the
completion that you would get
[01:00:26] is something explain the theory
of gravity to a six-year-old.
[01:00:29] Because what it learned
is that on internet,
[01:00:31] if you have one
question, you usually
[01:00:33] have maybe another bullet point
of other similar questions
[01:00:36] you don't usually have
question and then answer later.
[01:00:39] This is not what you want
from an AI assistant.
[01:00:42] So how do we do this
alignment, which
[01:00:46] is this post training and
making these models assistants?
[01:00:49] So the goal of this
alignment is to basically get
[01:00:52] LLMs follow the
instructions that
[01:00:55] are given by users and
maybe some designers,
[01:01:00] kind of, desires.
[01:01:02] So think about motivation.
[01:01:04] You don't want the
model-- like OpenAI
[01:01:06] doesn't want the model to
say stuff that is very toxic.
[01:01:09] So here you see on
the left-hand side
[01:01:12] that when you ask a question, it
actually provides a real answer.
[01:01:15] So it's not like before the LLM.
[01:01:17] And on the right-hand side,
you see that it would--
[01:01:20] if you ask to write a tweet
describing how a certain part
[01:01:25] of the population are evil, it
will say that it cannot do that.
[01:01:29] So that's kind of
this alignment.
[01:01:32] The background here is
that basically the data
[01:01:38] that you want for training
some of these models is--
[01:01:41] like, we know what we want.
[01:01:42] Which is just asking
humans, this is a question,
[01:01:44] this is the answer
that you want.
[01:01:46] But the thing is that it's very
expensive to collect that data,
[01:01:48] and it's hard to find it online.
[01:01:51] In contrast, pretraining
data is not what you want,
[01:01:54] but there's a lot of it.
[01:01:56] So what we will do, or
the main idea is simply
[01:01:59] take a pretrained
large language model
[01:02:01] pretrained on all of internet
and then just fine tune.
[01:02:03] So you just change a little bit
the weights on the type of data
[01:02:06] that you actually want.
[01:02:07] And hopefully given
it, you already
[01:02:08] pretrained it on
all of internet,
[01:02:10] it basically learns or knows
how to speak in English
[01:02:13] and knows standard
language syntax
[01:02:18] then you can really fine tune
it with very little data.
[01:02:23] OK, SFT.
[01:02:24] So Supervised Fine Tuning is
really exactly what I just said.
[01:02:27] Which is the idea of
fine-tuning the large language
[01:02:29] model on basically the
desired answers that
[01:02:33] are collected from humans.
[01:02:35] So why is it called
supervised fine tuning?
[01:02:37] Because you basically want to
do language modeling on the real
[01:02:41] answers.
[01:02:41] So language modeling is this
like next word prediction,
[01:02:44] and that's the fine tuning part.
[01:02:45] And then you want to do it on
desired answers given by humans
[01:02:48] so that's why we
call it supervised.
[01:02:51] So how do we collect this data?
[01:02:52] Well, I just said it.
[01:02:54] You just ask humans
to tell you this
[01:02:57] is a question this is
the answer that you would
[01:02:59] want from some of these models.
[01:03:00] So this is an example.
[01:03:03] I can't read very
well on my computer,
[01:03:04] but my kid needs
to do a science--
[01:03:08] no let's read this one.
[01:03:09] Can you write a
short introduction
[01:03:11] about the relevance
of the term monopsony?
[01:03:13] And then it says monopsony
refers to a market
[01:03:15] structure, blah blah, blah.
[01:03:16] And that's a human
network there.
[01:03:19] So, actually, this
is Open Assistant,
[01:03:20] which was a way to collect
data online by humans.
[01:03:27] So this type of supervised
fine tuning or alignment
[01:03:31] is really the key of ChatGPT.
[01:03:33] This is what made the big jump
from GPT 3, which was mostly
[01:03:37] something that was
known by AI researchers
[01:03:40] to ChatGPT, which became
known by basically everyone.
[01:03:46] So the problem
with human data is
[01:03:51] that it's very slow to
collect and very expensive.
[01:03:56] So one possible
simple idea is to use
[01:04:00] LLMs to scale data collection.
[01:04:03] So that's exactly what we
did with Alpaca one year ago.
[01:04:06] What we did is that
we asked humans,
[01:04:09] so we use a data set of
human question answers.
[01:04:11] So there were 175
question answers here,
[01:04:15] and we asked the best
model at the time,
[01:04:16] so text-davinci 003 to basically
generate many more of these
[01:04:21] question and answers.
[01:04:22] So all we did is, this is
what humans would write now,
[01:04:25] write similar answers
and similar questions.
[01:04:27] And we collected 52,000
LLM-generated question answers.
[01:04:32] And then what we did is
simply we took llama 7B,
[01:04:34] which was the best
pre-trained model at the time.
[01:04:36] And we just fine tuned this
with supervised fine tuning,
[01:04:39] as I told you.
[01:04:39] And that's how we got
the Alpaca 7B model.
[01:04:44] And this is the type of
data that we collected.
[01:04:47] So things like what
does algorithm mean?
[01:04:49] And algorithm is a step by
step set of instructions
[01:04:53] you use to solve a problem or
achieve a goal, blah, blah,
[01:04:55] blah, blah.
[01:04:56] So the data is not actually--
it's actually pretty good,
[01:04:58] given that it was LLM generated
by LLMs from essentially two
[01:05:02] generations ago.
[01:05:04] So that really started
at least for us
[01:05:07] as an academic
replication of ChatGPT.
[01:05:10] Now it really--
there's a big field
[01:05:12] of synthetic data
generation of how
[01:05:15] to use LLMs to basically make
development of LLMs faster.
[01:05:21] And basically by decreasing
the amount of human hours that
[01:05:24] you need.
[01:05:26] Quantity of data.
[01:05:28] So we talked about what type
of data and how we collect it.
[01:05:31] One thing which is
surprising with SFT
[01:05:33] is that you don't
need that much data.
[01:05:36] So what this paper showed
this is called LIMA,
[01:05:38] is that if you scale the amount
of data that you use from
[01:05:43] supervised fine tuning
from 2000 to 32,000,
[01:05:46] it really doesn't help much.
[01:05:47] So here scaling laws
definitely don't help.
[01:05:49] And so the intuition here
is that all you learn
[01:05:55] is you learn how to format
your desired answers.
[01:05:58] Another way of saying it is that
your pre-trained models, they
[01:06:02] essentially model the
distribution of every user
[01:06:04] on internet, one that
might write bullet points,
[01:06:07] another one that might
answer question-- answer
[01:06:09] question with an answer.
[01:06:10] So all you tell your
model is like, wait,
[01:06:13] you should actually
be optimizing
[01:06:14] more for this type of
user than another one.
[01:06:17] So you're not
actually teaching it--
[01:06:18] you're not teaching anything
through this SFT, so
[01:06:23] supervised fine
tuning, all you do
[01:06:25] is you tell the model to
optimize for one type of user
[01:06:28] that it saw already in
a pretrained data set.
[01:06:30] So the knowledge is already
in the pretrained LLM
[01:06:33] and you basically just
specialize to one type of user.
[01:06:37] Great.
[01:06:38] Any question on SFT?
[01:06:40] Yes.
[01:06:41] So I know it's a big
issue with synthetic data
[01:06:45] where if you keep generating
data from the same distribution,
[01:06:49] eventually you're not
learning a new distribution,
[01:06:51] you're essentially
playing with it.
[01:06:52] Just bootstrapping that.
[01:06:53] Yeah.
[01:06:55] Surely you can't scale
that forever, right.
[01:06:57] You can't keep going
on and generating
[01:06:59] from the same distribution.
[01:07:00] You hope to learned
something new.
[01:07:01] Yeah.
[01:07:02] So are there-- it's an
active area of research
[01:07:05] but any thoughts
that you have around
[01:07:06] how people are maybe thinking
around this and better ways
[01:07:10] to bootstrap?
[01:07:11] Or to give up on this idea and
realize that the chart shows
[01:07:15] you don't need that many so
just get humans to generate
[01:07:17] 2000 really good prompts.
[01:07:19] Yeah.
[01:07:20] So that's a very good question.
[01:07:21] So for the data
stuff, so I'm saying
[01:07:23] it's not that important
for SFT, but there
[01:07:25] will be another thing we'll talk
about right after where actually
[01:07:28] data does matter.
[01:07:29] My intuition based on not
that much empirical results
[01:07:33] is that you can still get,
even though you use your LLMs,
[01:07:38] if you use purely
LLM generated text
[01:07:40] and you do that for like three
or four generations of LLMs,
[01:07:43] I agree with you that probably
you won't improve much.
[01:07:45] But for me what is important is
how do you use human in the loop
[01:07:48] with LLMs?
[01:07:49] Not purely LLMs,
not purely humans,
[01:07:53] but maybe what
you can do is just
[01:07:54] have the model
regenerate some new text
[01:07:56] and just humans
write a few edits.
[01:07:59] Edits are much faster than
writing the entire text.
[01:08:01] And I think that if you have
that type of collaboration,
[01:08:04] then from an information
theoretical point of view,
[01:08:07] you still get
additional information,
[01:08:09] but you're still much faster
than if you use humans.
[01:08:11] And I think that
as a field we'll
[01:08:13] probably move towards these
type of things, which is really
[01:08:17] just finding the examples that
are important and asking humans.
[01:08:20] It's kind of active
learning, just
[01:08:22] asking humans exactly when
you need to get their inputs.
[01:08:28] Yes.
[01:08:28] Do we train with the
same loss function
[01:08:30] and the same general
training algorithm
[01:08:32] for the supervised
fine tuning bit
[01:08:34] as we do for the pretraining?
[01:08:36] Because the examples
you showed, I
[01:08:39] think the important thing
of the good examples
[01:08:43] is like super
factually accurate.
[01:08:45] Like there's these
more complex things
[01:08:46] and it's still just
like [INAUDIBLE].
[01:08:48] Same loss.
[01:08:49] So that's why here--
[01:08:50] yeah, I didn't-- maybe
didn't emphasize enough.
[01:08:52] This is just language modeling.
[01:08:53] Fine tune the LLM with language
model and the desired answers.
[01:08:56] So this is literally
the same loss.
[01:08:59] It will be different
in two seconds,
[01:09:01] but the first step
of SFT is literally
[01:09:04] the same loss where
you just say, OK, I
[01:09:06] want to actually specialize
on that type of data.
[01:09:08] So there's even a question
of what is pretraining,
[01:09:10] what is post-training?
[01:09:11] Because, in reality, it's
just like a different data
[01:09:13] that you use.
[01:09:13] The reason why we usually call
it post-training is that the way
[01:09:16] we collect that data
is very different.
[01:09:18] Great, great questions.
[01:09:20] Yes.
[01:09:22] Maybe it's the same
question, but why would
[01:09:24] these 2000 examples have
such a overweighted influence
[01:09:28] on fine tuning?
[01:09:30] So that's why we--
[01:09:31] also that's another reason
why we call it post-training
[01:09:33] is that we use different
type of hyperparameters.
[01:09:35] So, I told you
basically at the end
[01:09:37] of pretraining you
essentially end up
[01:09:38] with a learning rate of 0.
[01:09:40] Here, you're going to
increase your learning rate.
[01:09:42] So like 1e minus
5, 1e minus-- yeah.
[01:09:44] And so the way that you give
to them is actually different.
[01:09:52] OK.
[01:09:54] Second step or second
part of this post training
[01:09:57] is what we call
reinforcement learning
[01:10:00] from human feedback or RLHF.
[01:10:02] Some of you might
have heard of that.
[01:10:05] The idea is that SFT has
a problem, namely that you
[01:10:09] do behavioral cloning, which
means that you just try to clone
[01:10:12] what the humans would say.
[01:10:14] And that has many issues.
[01:10:16] One of them is that you're
bound by human abilities.
[01:10:19] So if-- humans actually humans
won't generate the things
[01:10:26] that they think is actually
the best thing to generate.
[01:10:28] So if you ask me
to write a book,
[01:10:30] I mean, I can definitely
enjoy your book.
[01:10:32] I can probably say one book
is better than another,
[01:10:34] but I'm definitely not going to
be as good as writing the book
[01:10:37] that I want to read.
[01:10:37] So you're going to be
bound by the human ability
[01:10:39] to generate things, even though
the humans might be better
[01:10:42] at distinguishing
between things.
[01:10:43] That's one issue.
[01:10:44] Issue number two, I find that
actually pretty interesting
[01:10:47] is that it--
[01:10:49] if you ever heard of the
word hallucination. so this
[01:10:51] is LLMs generating fake--
like false information.
[01:10:55] Hallucination might--
at least people
[01:10:57] have hypothesized that can come
from the supervised fine tuning
[01:11:02] even if you do supervised fine
tuning on data that is correct.
[01:11:06] And the reason why
that is is that if--
[01:11:09] given I told you that basically
SFT is with very little data.
[01:11:13] And it's with data
that the model
[01:11:15] doesn't learn anything new.
[01:11:17] So what if the human gives an
answer that the model didn't
[01:11:21] know was true.
[01:11:23] From the model perspective,
the human basically
[01:11:26] is telling the model generate
this thing that seems plausible
[01:11:30] but actually have no
idea if it's true or not.
[01:11:34] So just to give you a
very concrete example,
[01:11:36] if we go back to this
monopsony example,
[01:11:39] can you write blah blah
blah about monopsony?
[01:11:41] Imagine that the human wrote a
reference on this type of book.
[01:11:46] And that book might exist.
[01:11:47] That might be a
correct reference,
[01:11:49] but what if the LLM
never saw this reference
[01:11:51] during pretraining.
[01:11:52] Then it doesn't know that
it's a correct reference.
[01:11:54] So really what
you tell the model
[01:11:56] is to generate or make up some
plausible sounding reference
[01:12:00] rather than actually
tell the real reference
[01:12:03] that it saw during pretraining.
[01:12:05] So hallucination might
be caused by this SFT.
[01:12:12] So that's problem number two.
[01:12:14] Does that all make sense?
[01:12:15] Great.
[01:12:16] Problem number 3, price.
[01:12:18] Generating the ideal
answers is very pricey.
[01:12:21] And that comes back
to your question
[01:12:23] of humans writing the
entire answer is actually
[01:12:26] pretty expensive.
[01:12:28] So that's why RLHF comes in.
[01:12:30] The idea is that instead of
cloning the behaviors of humans,
[01:12:34] we're going to maximize
human preference.
[01:12:37] And the way we're going to
do that, so the pipeline,
[01:12:39] is that for a certain--
for every instruction,
[01:12:42] you're going to ask a model
to generate two answers
[01:12:45] and usually use a
pretty good model.
[01:12:48] So you usually don't use an LLM
here, you use a SFT fine tune,
[01:12:52] you use a fine tuned LLM already
to give pretty good answers.
[01:12:56] And then you ask labelers which
of these two answers was better?
[01:13:01] So select the preferred one.
[01:13:02] And then with different
types of algorithms,
[01:13:05] we're going to talk about
the algorithms, you just fine
[01:13:07] tune the model to generate
more of the green thing
[01:13:10] than the red thing.
[01:13:10] So more of the good stuff.
[01:13:12] So now the question
is how and we're
[01:13:14] going to talk about
that right now.
[01:13:17] So there are two ways that
we're going to talk about
[01:13:20] and two that are mainly
use in the community.
[01:13:23] The first one is simply the idea
of using reinforcement learning.
[01:13:26] So hopefully you all know what
reinforcement learning is now.
[01:13:30] So when you think about
using reinforcement learning,
[01:13:33] one important question is
like, what is the reward
[01:13:35] that we're optimizing.
[01:13:36] So in this case, there
are really two options
[01:13:38] that I could think about.
[01:13:39] The first one, you
could just say,
[01:13:41] I'm going to compare the output
generated by some baseline,
[01:13:44] the output generated
by my model.
[01:13:46] And I'm just going to ask the
human to say which one is better
[01:13:49] and I'm going to use
this as a reward.
[01:13:51] So if I'm better
than the baseline,
[01:13:53] this is a plus 1, if
not, it's a minus 1.
[01:13:55] So now it's binary reward.
[01:13:57] The problem with binary reward
is that it's very sparse
[01:13:59] and you don't get much
information out of it.
[01:14:01] Like maybe your answer
was slightly better,
[01:14:04] maybe it was like way
better and you don't really
[01:14:07] know from this how
much better it was.
[01:14:10] So option 2 is
that you can train
[01:14:13] what we call a reward model,
which is simply a classifier.
[01:14:16] So you use machine
learning to classify
[01:14:19] how much better two outputs
are from the preference--
[01:14:24] from the perspective
of the human.
[01:14:26] So this is a little bit
meta, but what you basically
[01:14:29] do is that you train--
[01:14:31] you take a reward model, which
is just a large la-- also
[01:14:37] a large classifier, and you
basically ask this reward model,
[01:14:41] you give it the input
and the actual output
[01:14:43] that you have, one
of the two outputs.
[01:14:45] And you just exponentiate that
so that's the softmax loss
[01:14:49] that you all know about.
[01:14:50] And now you divide by
the exponentiated reward
[01:14:56] on the first example--
[01:14:58] I'm sorry, on the
first output and this
[01:15:00] is on the second output.
[01:15:01] And you basically train--
[01:15:02] so the reason why you do that
is that you train your model,
[01:15:05] you train this
reward model to be
[01:15:07] able to classify how much better
one output is to another one.
[01:15:13] So another slightly less
convoluted way of saying it
[01:15:16] is that your reward
model will output
[01:15:19] some reward that will be used
as the logits of your softmax.
[01:15:22] So now if you have high
logits in your softmax,
[01:15:25] it means that you highly
likely this output is better.
[01:15:32] So that's what we call
Bradley-Terry model.
[01:15:34] Yes.
[01:15:35] Will this reward
model [INAUDIBLE]
[01:15:36] lower the entire output, or
is it going to [INAUDIBLE]?
[01:15:40] So this takes the entire--
[01:15:45] yeah, this takes the
entire output at once.
[01:15:46] So it takes all the
input and all the output
[01:15:48] and it gives one number.
[01:15:50] Yes.
[01:15:51] So [INAUDIBLE] reward model,
where would the human be then?
[01:15:55] Sorry.
[01:15:55] With the reward model,
where would the human be?
[01:15:58] Like--
[01:15:58] I see.
[01:16:00] OK sorry.
[01:16:01] Maybe I wasn't clear.
[01:16:02] You train this reward model
to fit this green and red
[01:16:08] preference from humans.
[01:16:09] So basically you
train a classifier
[01:16:11] to say whether the humans
prefer red or green.
[01:16:15] But instead of using
the binary reward, which
[01:16:18] is what the human would
tell you you basically use
[01:16:20] the logits of the softmax.
[01:16:23] And the thing with the logits
is that logits are continuous.
[01:16:26] So now you know that if
your reward model said
[01:16:29] it has high logits,
then, in some ways,
[01:16:31] the human highly preferred this
answer to some other answer.
[01:16:36] Great.
[01:16:38] So as I just said, continuous
information is better.
[01:16:41] So that's what people use
in practice or at least
[01:16:44] used to use in practice.
[01:16:45] I'll tell you about the
other algorithm later.
[01:16:48] So what do you do at the
end is that you basically
[01:16:50] try to just use reinforcement
learning that you know about.
[01:16:53] Now we know we have a reward.
[01:16:55] What you sample through
is the generation
[01:16:58] from your large language model.
[01:16:59] And then you just use
some regularization term.
[01:17:02] So the reason why we do
this regularization term
[01:17:04] is for avoiding what we
call overoptimization.
[01:17:06] So this reward
model might not be
[01:17:08] really represent--
might not perfectly
[01:17:10] model human preferences.
[01:17:12] So you don't want to
maximize this thing
[01:17:14] to essentially infinity.
[01:17:17] And you do it using a PPO,
which is a common reinforcement
[01:17:22] learning algorithm.
[01:17:24] One thing to note here, because
it will be important for later,
[01:17:27] is that when we use
maximum likelihood--
[01:17:32] sorry, now the large
language models
[01:17:34] are actually a policy for
your reinforcement learning.
[01:17:38] It's not maximizing
maximum likelihood anymore.
[01:17:41] Which means that you're not
modeling any distribution
[01:17:43] anymore.
[01:17:43] And the reason why
this is important
[01:17:45] is that models that went
through this type of PPO
[01:17:48] actually don't give
you likelihoods
[01:17:51] of text that are meaningful.
[01:17:52] Because what you
optimize them to do
[01:17:54] is basically just
optimize for generating
[01:17:56] the most likely thing,
not optimize for modeling,
[01:18:00] all the answers that
humans might say.
[01:18:02] Another way of saying
that is that there's
[01:18:04] nothing that incentivizes
here the model to not give
[01:18:09] a single possible generation.
[01:18:11] Nothing here says it's good
if you have some distribution
[01:18:15] with some entropy.
[01:18:18] If you haven't followed, it's
not that important but just good
[01:18:20] to know.
[01:18:22] Great.
[01:18:23] So PPO is exactly what
ChatGPT did originally.
[01:18:27] So here is on their
blog post on what
[01:18:30] they have is step one do
supervised fine tuning, which
[01:18:33] now you all know about.
[01:18:34] Step two, train a reward
model on human preferences.
[01:18:38] Step three, do PPO
multiple steps,
[01:18:40] which is where you
see this blue arrow.
[01:18:43] So you continue-- you train
the model once with the PPO,
[01:18:45] you collect new
data, you continue.
[01:18:47] And that's why-- and that's
exactly what ChatGPT did.
[01:18:50] And that was the
big breakthrough
[01:18:52] between GPT 3 and ChatGPT.
[01:18:55] One thing to note is that
PPO has many challenges.
[01:18:58] Reinforcement learning
is something that
[01:19:00] is super nice theoretically.
[01:19:02] In practice, anyone
who ever worked
[01:19:03] with reinforcement learning
knows it's such a mess.
[01:19:06] There's a lot of things
like rollouts, outer loops,
[01:19:09] clipping so many complications.
[01:19:11] So it's messy.
[01:19:13] This is the idealized PPO
used for LLM settings,
[01:19:15] so that's already
much more complicated
[01:19:17] than this expectation
we saw before.
[01:19:19] And in practice it's actually
much more complicated.
[01:19:21] So we have one implementation
of it that we had to do,
[01:19:23] and I'm not going
to go through it.
[01:19:25] But basically have so
much stuff that you
[01:19:27] have to think about
when you implement
[01:19:29] that type of PPO algorithm.
[01:19:31] So you have clipping everywhere,
you have a lot of complexities
[01:19:34] and things are not
well documented.
[01:19:37] All this to say that we're going
to there was a new method that
[01:19:41] was proposed also from
Stanford one year ago
[01:19:44] called DPO, which is essentially
a simplification of PPO.
[01:19:49] And the way-- what they did
or the idea that they have
[01:19:53] is that instead of using
reinforcement learning,
[01:19:56] you can just maximize the
probability of generating
[01:19:58] the stuff that you
like and minimizing
[01:20:00] the probability of the
stuff that you don't like.
[01:20:02] So if you think about the human
preference, the red and green,
[01:20:05] maximize green, minimize red.
[01:20:08] So the loss is actually
this one where what you see
[01:20:12] this is simply some
log of the model.
[01:20:16] So this is the likelihood of
a model generating the things
[01:20:19] that the human preferred,
given the inputs.
[01:20:23] And what you try
to do is basically
[01:20:25] maximize the likelihood of
generating the things that you
[01:20:30] like, minimize the likelihood of
the things that you don't like.
[01:20:33] All the rest of the terms
here it's not too important.
[01:20:36] It's actually really not that
complicated to understand.
[01:20:39] But at a high level, it's really
just maximizing the things
[01:20:42] you like, minimizing the rest.
[01:20:45] And one thing to note, which
I was going to say just here,
[01:20:49] is that actually all
the rest is chosen such
[01:20:51] that the global minima of
PPO and the global minima
[01:20:56] of like this DPO,
under some assumptions,
[01:20:59] are essentially equivalent.
[01:21:01] So this is the right thing
to do mathematically.
[01:21:04] I'm not going to go
through the derivations,
[01:21:06] but that's the
right thing to do.
[01:21:08] It's pretty different with
PPO in the sense that now--
[01:21:10] with PPO, what you had to do is
collect the human preferences,
[01:21:13] then train a reward model
with maximum likelihood,
[01:21:16] then use reinforcement learning.
[01:21:17] Now all you do is basically
maximum likelihood.
[01:21:19] Much simpler.
[01:21:20] Yes.
[01:21:21] I mean, yeah.
[01:21:21] So it seems like this is A,
much simpler and B, like,
[01:21:24] what you would just intuitively
do with [INAUDIBLE]?
[01:21:27] Why did they start
with this reward model.
[01:21:29] Like what led them doing that?
[01:21:31] I think it's a great question.
[01:21:33] I don't really know.
[01:21:34] What I can tell you is that.
[01:21:35] At ChatGPT the people
who did basically
[01:21:41] this PP-- sorry, who
did ChatGPT initially
[01:21:44] are the ones who
actually wrote PPO.
[01:21:47] And I think they
were just-- like,
[01:21:48] there are a lot of
reinforcement learning people.
[01:21:50] And I think that for them
it was very intuitive.
[01:21:54] So there's also some
additional potential benefits.
[01:21:58] For example, I don't want to--
[01:22:00] yeah, for example, if
you use the reward model,
[01:22:03] the cool thing here with
reinforcement learning
[01:22:04] is that you can use unlabeled
data with the reward model.
[01:22:08] So here you can only use the
labeled data for doing DPO--
[01:22:12] For PPO-- for PPO, you first
train your reward model
[01:22:15] and then you can
use unlabeled data
[01:22:18] where the reward
model will basically
[01:22:19] label this unlabeled data.
[01:22:21] So this additional,
kind of, potential--
[01:22:25] there could be
potential improvements.
[01:22:26] In practice it happens
that there are none.
[01:22:29] And I think just that a
lot of people in this team
[01:22:32] were reinforcement
learning experts, including
[01:22:35] the main author of
PPO, John Schulman.
[01:22:39] So much simpler than PPO, and
it's basically performs as well.
[01:22:43] So now this is the standard
thing that people use.
[01:22:46] At least in the open
source community,
[01:22:47] I believe it's actually the
standard also in industry.
[01:22:51] So that's called DPO.
[01:22:53] Gains so those are all
the papers on the left.
[01:22:57] Here this is on the
summarization task.
[01:22:59] You see, all I
want to show you is
[01:23:01] that basically the
pretrained models were OK
[01:23:04] and they improve of scale.
[01:23:05] If you do supervised
fine tuning,
[01:23:07] you improve them
a little bit more,
[01:23:08] if you do PPO or something
with RLHF human feedback,
[01:23:12] you get performance
that are, oftentimes
[01:23:15] depending on a benchmark,
even better than humans.
[01:23:18] So this is the human
reference summaries.
[01:23:21] Same thing.
[01:23:22] This is on a paper that
we have Alpaca farm where
[01:23:25] we see the evaluation
here is not too important
[01:23:27] but basically see
pretrained model.
[01:23:29] You jump to SFT and then you
jump to PPO, DPO and PPO,
[01:23:33] DPO have the exact
same performance.
[01:23:36] So basically RLHF helps.
[01:23:38] That's, kind of, the
conclusion and DPO is simple.
[01:23:42] Data.
[01:23:43] The way that you collect
that type of data.
[01:23:46] First idea is just use humans
as we already talked about.
[01:23:51] Guidelines are very
complicated for what
[01:23:53] humans should be labeling,
and it's really not that easy.
[01:23:55] And actually, if you ever
do some of the labeling,
[01:23:58] you will see that it's
extremely complicated.
[01:24:01] Like if I Zoom in to this.
[01:24:03] Here, I have a question tell
me about self-driving cars.
[01:24:07] And you read both
self-driving cars
[01:24:09] are vehicles that are
capable of detecting
[01:24:10] the surroundings,
blah, blah blah, blah.
[01:24:12] Self driving cars are
cars that are equipped
[01:24:13] with sensors, blah
blah, blah to navigate
[01:24:15] without the need for a driver.
[01:24:16] I mean, both seem OK.
[01:24:18] Which one is better?
[01:24:19] It's actually hard
to say at a glance.
[01:24:21] And as a result, the
problem with humans
[01:24:24] is that you will
start optimizing
[01:24:27] a lot of high-level features.
[01:24:28] For example, the
second one is longer.
[01:24:30] I can guarantee you that
most humans will choose
[01:24:32] the second one,
even though I mean,
[01:24:34] maybe the first one is better.
[01:24:35] I don't know.
[01:24:36] I haven't read it carefully.
[01:24:38] So challenges of humans.
[01:24:39] First, slow and expensive.
[01:24:42] Second, as I just mentioned,
it's hard to focus on things
[01:24:46] that matter, like correctness.
[01:24:47] And people usually
look at things
[01:24:49] that don't matter as much
like the form, like length.
[01:24:53] And as a result,
so what I show here
[01:24:55] is that when you do RLHF,
the more you do RLHF,
[01:24:58] the longer the output
of the models become.
[01:25:01] So if you've ever been
annoyed at ChatGPT
[01:25:03] answering you super
long sentences,
[01:25:05] this is because of RLHF.
[01:25:08] Annotator distribution shift.
[01:25:11] Like the distribution
of annotators
[01:25:12] that you use matters a
lot, and you have to think,
[01:25:15] like, what is even the
humans that we want
[01:25:17] to represent in these models?
[01:25:20] Another question is
crowdsourcing ethics.
[01:25:22] Like usually these--
basically a lot
[01:25:25] of the labeling that is
done, the people who do them
[01:25:29] are not paid well
and they have to go
[01:25:31] through a lot of toxic
data because you basically
[01:25:33] want the model to avoid
saying the toxic data.
[01:25:36] So crowdsourcing ethics too.
[01:25:40] So many challenges
with human data.
[01:25:43] So what we did, also
last year, is again,
[01:25:46] the same thing as Alpaca, just
the idea of like oh well, there
[01:25:48] are challenges
with humans, maybe
[01:25:50] we can just replace
them with LLMs.
[01:25:51] So what we did is
simply replace--
[01:25:55] I see that.
[01:25:56] I'm just realizing that the
slides are not centered.
[01:25:58] Anyways you replace a human
preference with preferences.
[01:26:02] So here, on this figure, you
see on the x-axis, the price
[01:26:06] that we paid for
collecting human data.
[01:26:09] It's around $300
for 1,000 examples.
[01:26:12] And this is on mechanical
Turkers which are usually
[01:26:15] like cheaper than maybe
some of the other companies
[01:26:19] that you could go through.
[01:26:20] And on the y-axis,
it's basically
[01:26:22] the agreement with other humans,
with the mode of other humans.
[01:26:27] And what you see is that
actually, as I told you before,
[01:26:29] labeling is really complicated.
[01:26:30] Humans agree with
themselves only around 66%
[01:26:34] of the time on a binary task.
[01:26:36] And it's not that the
humans are not good
[01:26:38] here because we were five
main authors on this paper.
[01:26:41] We tried to label
this data ourselves,
[01:26:43] and we only had, like, 67 or
68% accuracy, even though we
[01:26:47] talked-- like we talked
for like three hours of how
[01:26:50] we should be doing labeling.
[01:26:51] But really, it's complicated.
[01:26:52] It's not an easy task.
[01:26:54] And here I just showed
many different models.
[01:26:56] And, basically, you see that
models are much cheaper,
[01:26:59] and they can actually
get higher agreement
[01:27:01] with the mode of humans
than humans themselves.
[01:27:04] And the reason why is because
humans have a lot of variance,
[01:27:06] models have no variance.
[01:27:08] So there might be a
little bit more biased
[01:27:09] but have less variance.
[01:27:11] So it works surprisingly well.
[01:27:13] And now it's, kind
of, the standard
[01:27:14] in open source community.
[01:27:16] I think even in
industry a lot of people
[01:27:18] use both humans and
LLMs for improving
[01:27:21] the collection of RLHF data.
[01:27:24] And this is like-- this is
the paper from last year,
[01:27:27] but honestly, now it's more like
the LLMs would be around this
[01:27:30] agreement, and
this costs around,
[01:27:32] I would say 50 50x than humans
and better agreement with human
[01:27:36] than humans themselves.
[01:27:39] OK.
[01:27:39] So that gets us to
evaluation of post training.
[01:27:45] That goes back to
your initial question
[01:27:46] at the beginning of the lecture.
[01:27:48] How do you evaluate
something like ChatGPT?
[01:27:50] The answers that GPT could
give are basically unbounded.
[01:27:54] And it's not that
there's one right answer,
[01:27:56] there are many answers
that are just as good.
[01:27:59] So there are many challenges.
[01:28:00] One, you can't use
validation loss
[01:28:03] because one method
might use PPO,
[01:28:06] the other one might use DPO.
[01:28:07] Validation loss
is not comparable.
[01:28:08] Second, you can't use--
[01:28:10] sorry, perplexity.
[01:28:11] That's the thing
I told you before.
[01:28:13] These models are not calibrated.
[01:28:16] They don't give distributions.
[01:28:17] They just optimize
for one thing.
[01:28:19] So you can't use perplexity for
actually evaluating these type
[01:28:22] of models once they aligned--
[01:28:24] sorry, once they're aligned.
[01:28:26] Third, there's a large
diversity of questions
[01:28:29] that humans might
ask to these models.
[01:28:31] Generation open QA some question
answering some summarization
[01:28:35] and all of these things.
[01:28:36] So there's so many
things you have to cover.
[01:28:38] Then the tasks are
really open ended,
[01:28:41] so it's very hard to automate.
[01:28:42] So that's what you were
alluding to before.
[01:28:45] So the idea is that
instead of trying
[01:28:48] to come up with really
easily automated benchmarks,
[01:28:51] it's just we're going to ask
questions that users actually
[01:28:55] ask to these models in practice.
[01:28:56] And we're just going
to ask annotators
[01:28:58] to say between these two
models, which one is better.
[01:29:01] What's the better output.
[01:29:03] So basically the
exact same thing
[01:29:04] as basically the data
from RLHF but you
[01:29:08] use it now for evaluation.
[01:29:10] Yes I'm not sure
I understand what
[01:29:11] you mean by can't use
perplexity not calibrated.
[01:29:14] Like RLHF still doing like
next token prediction.
[01:29:19] So--
[01:29:19] Why can't perplexity
be used then?
[01:29:21] So think about the
optimal solution
[01:29:24] after doing PPL is
basically one model that
[01:29:27] gives you essentially a delta.
[01:29:30] Like basically it says that
there's only one sentence
[01:29:33] that is--
[01:29:34] that could be generated
for that question.
[01:29:36] So now if you use
it on something
[01:29:38] that is slightly semantically
differently different,
[01:29:40] it would actually give a
likelihood of 0 for that answer.
[01:29:44] So in reality, it's not that
extreme because as you say,
[01:29:46] it's still a
distribution, but it just
[01:29:48] shows you that there's
a fundamental issue
[01:29:50] with perplexity.
[01:29:51] Once these models
are not LLMs anymore,
[01:29:55] they were not trained,
at least with PPO
[01:29:56] they're not trained to do
maximum likelihood anymore,
[01:29:59] they were trained
to be policies.
[01:30:04] So probably the most
common or the most--
[01:30:08] yeah, the most common benchmark
or the most trusted one
[01:30:10] is what we call ChatBotArena,
which is basically
[01:30:14] go on internet, have random
users on the internet,
[01:30:17] blindly talk with two chatbots,
just ask many questions,
[01:30:21] see the two answers and
rate, which one is better.
[01:30:23] And you do that over hundreds
of thousands of users and then
[01:30:26] you get the actual preferences
and you get rankings of models.
[01:30:30] So you can go right
now on ChatBotArena
[01:30:33] and actually interact
with these models.
[01:30:35] One potential issue
just to highlight
[01:30:38] is that while people who want
to do these type of things
[01:30:40] are usually more like
tech-driven or like tech savvy.
[01:30:44] So a lot of the questions
that you will ask
[01:30:46] are more like tech
stuff discussing
[01:30:47] software errors,
inquiries about AI tools
[01:30:50] and all of these things.
[01:30:52] So another issue
is cost and speed.
[01:30:54] If you really want
to use something
[01:30:55] like this for
development process,
[01:30:58] it will be too costly because
you will need to basically pay
[01:31:01] a lot of humans to do that.
[01:31:03] So one simple idea is,
again, as we said many times,
[01:31:07] just use LLM instead of humans.
[01:31:10] You probably know the
drill at this point.
[01:31:13] Steps for every instruction
generate outputs
[01:31:15] by some baseline and the model
that you want to evaluate.
[01:31:19] So here you imagine that
I'm comparing an answer
[01:31:22] from ChatGPT and from Misrule.
[01:31:24] I'm just asking a model, another
model, which one is better.
[01:31:29] And I just basically
average that out.
[01:31:32] Yeah.
[01:31:32] I asked ChatGPT 4,
which one is better.
[01:31:34] I averaged that out over
my entire distribution,
[01:31:37] over my entire
benchmark or data set,
[01:31:39] and that gives me a win rate.
[01:31:41] So a win probability for one
model compared to another one.
[01:31:44] And now you can rank models.
[01:31:46] And this is the
AlpacaEval leaderboard.
[01:31:50] So the benefits of this
is that actually we
[01:31:53] show-- we get 98% correlation
with ChatBotArena.
[01:31:56] So very high
correlation with humans.
[01:31:59] So this is yeah,
comparison with correlation
[01:32:01] with other benchmarks.
[01:32:02] And it takes less than three
minutes and less than $10
[01:32:05] to run.
[01:32:05] So it's pretty cheap.
[01:32:06] And there are downsides though.
[01:32:08] One of them is poor correlation.
[01:32:11] So as we already saw
before, LLMs prefer,
[01:32:14] this is one spurious
correlation, not many.
[01:32:16] I'll just talk about one.
[01:32:17] LLMs prefer longer outputs.
[01:32:19] Actually humans also
prefer longer outputs.
[01:32:21] But the problem or the
issue once you use LLMs
[01:32:23] is that once there is bias, you
will continue optimizing that.
[01:32:26] Humans at some point,
I can guarantee you
[01:32:28] if I ask a simple
question, and you give me
[01:32:29] five pages of
answers, I'll be like,
[01:32:31] no, I don't like that answer.
[01:32:32] But LLMs if they have this bias
and they were trained for that,
[01:32:35] they will continue
preferring longer outputs.
[01:32:37] So here we see the
preference just showing
[01:32:42] that humans and models
prefer longer outputs.
[01:32:46] And here is another view of
the initial AlpacaEval data set
[01:32:50] benchmark, where when we asked--
[01:32:53] when we rank GPT4, when we
look at the win rate of GPT4
[01:32:56] versus actually GPT4 itself,
if we use the standard GPT4,
[01:33:01] it gets 50%, kind of, by
definition because we're
[01:33:03] comparing GPT4 versus GPT4.
[01:33:06] But if we ask a GPT4 to
be slightly more verbose,
[01:33:09] so we just say in the prompt,
be verbose in your answers,
[01:33:12] then it gets a
win rate of 64.4%.
[01:33:15] So really there's
a huge variance.
[01:33:16] And if we ask it
to be concise, it
[01:33:17] gets 20% so there's
a huge variance
[01:33:20] depending on whether you ask
it to be concise or verbose.
[01:33:24] That's very annoying.
[01:33:25] So one possible solution,
which is what we did,
[01:33:29] is just use some
regression analysis.
[01:33:31] I'm not going to
go into details,
[01:33:32] but basically use
causal inference
[01:33:34] tools to control for length.
[01:33:36] And right now actually
length matters much less.
[01:33:38] So if you ask it to be verbose,
you still get some gains,
[01:33:41] but much less.
[01:33:44] Great.
[01:33:44] So that's all about
post training.
[01:33:46] And now for the
next eight minutes,
[01:33:48] I might talk about systems
or just answer questions.
[01:33:51] Yes.
[01:33:52] Can you go back to your
post training, internal post
[01:33:56] training.
[01:33:57] How did we tune those
parameters using
[01:33:59] the small body of
fine-tuning data
[01:34:03] and have such big
effect on the model?
[01:34:05] You mentioned earlier that
there's a different set
[01:34:07] of hyperparameters.
[01:34:08] Are we changing just some of
the weights, the later weights
[01:34:11] or other weights.
[01:34:12] What's actually happening?
[01:34:13] Yeah.
[01:34:14] Yeah, I, kind of, skimmed
through all of this.
[01:34:16] You change all the weights.
[01:34:17] Actually, industry will
change all the weights.
[01:34:20] In open source
land, you might have
[01:34:22] heard of Laura, which is
going to change basically only
[01:34:26] some of the weights or it
actually, to be more specific,
[01:34:29] it's going to add
some differences
[01:34:31] to the output of every layer.
[01:34:33] But in industry, you're going to
just fine tune all the weights.
[01:34:37] And also to say something
else about the data, actually,
[01:34:40] this last step, RLHF
you usually going
[01:34:42] to collect a lot more
data than with SFT.
[01:34:45] So if FSFT is like 5,000,
10,000, maybe 50,000 with,
[01:34:50] RLHF I think you're going to be
more around like the one million
[01:34:54] order of magnitude.
[01:34:55] It's still much less
than pretraining though.
[01:34:57] Yeah.
[01:34:57] Because pretraining
is 15 trillion tokens.
[01:35:00] I mean, this is like--
that's not even a drop
[01:35:02] and yet you influence
the weight a lot.
[01:35:05] So because you do it--
[01:35:05] I mean, you have to think that
how you do it is you use--
[01:35:10] I mean, as I said, the learning
rate that you're going to use
[01:35:12] is going to be different,
but also you only do that.
[01:35:16] So just imagine if I trained--
[01:35:18] even if I trained
on one sentence,
[01:35:19] but over and over
again at some point
[01:35:22] my model will only
generate that sentence
[01:35:24] even if it was just
one sentence instead of
[01:35:27] the 15 trillion tokens.
[01:35:29] So if you use a
large enough learning
[01:35:30] rate and for enough
time, you will basically
[01:35:33] overfit that sentence.
[01:35:35] So the key thing to remember
is that the data is not--
[01:35:39] it's not as if you mix
some post-training data
[01:35:42] and some pretraining data.
[01:35:43] You do pretraining, and then
you just start fine-tuning only
[01:35:47] on the post-training.
[01:35:48] So another way, maybe
another perspective
[01:35:50] is that the pretraining
is just the initialization
[01:35:53] of your model.
[01:35:54] And once you view it that
way, that this is just
[01:35:56] initialization of weights,
then there's nothing special.
[01:35:59] Like you don't need to remember
that you train on a lot of data
[01:36:02] before.
[01:36:02] The only thing that matters is
that you had an initialization
[01:36:04] and now I actually
train the model.
[01:36:06] So maybe you think
about it that way.
[01:36:07] Like this is a Markov
property in some ways.
[01:36:10] It's just like you
had your weights.
[01:36:11] This is my initialization.
[01:36:12] Now I'm training that one.
[01:36:14] Does that answer your question?
[01:36:16] Kind of but you said
something just now about it's
[01:36:20] almost the equivalent of just
rerunning the fine tuning
[01:36:23] data many times.
[01:36:25] Is it actually-- is that what
actually happens in order
[01:36:28] to give so much more preference?
[01:36:33] You might-- I actually don't
know right now how they do it
[01:36:37] in industry.
[01:36:37] When we did our packet,
we had to do three epochs.
[01:36:40] So you did run it
three times through it.
[01:36:44] But I mean, even
the number of times
[01:36:46] that you run it through,
it's actually not important.
[01:36:48] The only thing-- the only thing
is the effective learning rate
[01:36:52] that what matters.
[01:36:54] So yeah.
[01:36:56] Great.
[01:36:58] So I think I have five minutes.
[01:37:06] OK I might try to give a
high-level overview at least
[01:37:12] from one of the systems trick.
[01:37:14] Systems, as we said, for
everyone bottleneck is--
[01:37:19] sorry compute is
the huge bottleneck.
[01:37:21] One question you might ask
is, why not buy more GPUs?
[01:37:24] GPUs are expensive,
but also are scarce.
[01:37:26] Even if you have $10
million right now,
[01:37:28] you cannot buy the best GPUs.
[01:37:31] [INAUDIBLE]
[01:37:33] There's also some
physical limitations.
[01:37:35] When you have multiple
GPUs, you have
[01:37:37] to communicate between them.
[01:37:39] That takes time.
[01:37:40] So just buying more
GPUs is not that easy.
[01:37:43] So it's really
important to think about
[01:37:45] how do you allocate resources
and how do you optimize
[01:37:47] your pipeline, so system?
[01:37:49] 101 on GPUs, I'm sorry,
I'm going slightly faster.
[01:37:53] I hope that some of you
at least can follow.
[01:37:55] GPUs are basically
optimized for throughput.
[01:37:58] CPUs are optimized for latency.
[01:38:01] So GPUs, the way you
have to think about it
[01:38:03] is that there's one--
[01:38:04] there's one command that
is run on many, many cores
[01:38:07] at the same time on
different type of data.
[01:38:11] So this is how you see a GPU.
[01:38:13] You see there are
many different codes.
[01:38:14] We call them streaming
multiprocessors,
[01:38:17] which is very different than
the usual CPU architecture.
[01:38:20] So just think high throughput
parallelization for GPUs.
[01:38:24] GPUs are optimized for
fast matrix multiplication.
[01:38:27] So every time you will do--
you will do something on GPU.
[01:38:30] If you can do it with a
matrix multiplication,
[01:38:33] it's going to be 10 times
faster than with anything else.
[01:38:36] That is a little bit
annoying because it
[01:38:38] means that we are,
kind of, bottlenecked
[01:38:40] to doing anything with
matrix multiplications.
[01:38:44] Another thing to
note with GPUs is
[01:38:46] that compute has
been improving faster
[01:38:48] than memory and communication.
[01:38:50] So right now GPUs usually
are hard to keep--
[01:38:55] Like the data that
you sent to GPUs
[01:38:58] is actually hard to keep
up with the processes.
[01:39:00] So most of your
GPUs are actually
[01:39:02] going to be idle if you
just run normal code,
[01:39:04] if you don't optimize your code.
[01:39:06] So communication-- and this
will continue over time.
[01:39:10] Another thing to know
about GPUs is that there's
[01:39:12] a memory hierarchy.
[01:39:13] This is the same thing
actually with CPUs,
[01:39:15] but basically the closer
you are to your cores,
[01:39:17] the less memory there is,
but the faster things run.
[01:39:20] If you are further,
more memory slower.
[01:39:24] Oh yeah I'm going to skip that.
[01:39:26] OK actually, I'm
going to say it.
[01:39:27] I told you about this--
[01:39:29] the fact of communication.
[01:39:31] The metric that
people usually look at
[01:39:32] is model FLOP utilization.
[01:39:34] So what is the theoretical
maximum that GPU could run at,
[01:39:37] number of flops that you
could use per second--
[01:39:39] divide-- sorry, the number
of observed throughput
[01:39:42] divided by this
theoretical maximum.
[01:39:45] And in general, if you
reach 50% you're very happy.
[01:39:49] Like Facebook I looked
at llama was at 45
[01:39:51] or something like this.
[01:39:52] So that means that data
doesn't come fast enough
[01:39:55] even for these big companies.
[01:39:58] So one simple trick,
and that might
[01:40:00] be the only one I'm
going to tell you about,
[01:40:02] is low precision.
[01:40:04] One simple idea is
that well, if I'm
[01:40:06] going to put my floats
in low precision,
[01:40:09] then there's going
to be fewer bits
[01:40:10] that I have to send to my GPUs.
[01:40:12] If there's fewer bits,
it's faster communication,
[01:40:14] lower memory consumption.
[01:40:16] Things are going to go faster.
[01:40:17] And for deep learning
it just happens
[01:40:19] that decimal is
not that important.
[01:40:22] So when you do matrix
multiplication, when
[01:40:25] you do like for example, SGD,
there's already so much noise
[01:40:28] that if you update something
by 0.01 or 0.015, who cares.
[01:40:33] So basically instead of using
32 bits per float, which
[01:40:37] is what people used to use,
or 64 for example, which
[01:40:41] is what you would
use in other domains,
[01:40:43] you use 16 bits for
matrix multiplication.
[01:40:46] So for every float
you use 16 bits.
[01:40:49] And for training
you have this type
[01:40:51] of what we call automatic
mixed precision.
[01:40:54] Which is that some of the
things are in 32 bits,
[01:40:57] others are in 60 bit--
[01:40:58] on 16 bits.
[01:41:00] Generally, the way you
should be thinking about
[01:41:02] it is that your weights
are stored-- of your model,
[01:41:05] are stored in 32 bits.
[01:41:06] But just before the computation
you put everything in 16 bits.
[01:41:10] Like this you do
computation super fast.
[01:41:12] And at the end you update
your weights in 32 bits.
[01:41:16] And the reason why you do all
the updates in 32 bits is just
[01:41:19] think that if your
learning rate, for example,
[01:41:21] is very small, you still
want to be able to make
[01:41:23] a difference in your weights.
[01:41:25] So all the computation
is done in 16 bits,
[01:41:28] but the weights are
actually stored in 32 bits.
[01:41:30] So that's like the standard
way that people are doing it.
[01:41:35] OK, I'll actually
talk just about this,
[01:41:36] and then I'll skip all the rest,
operator fusion, because I think
[01:41:39] this is actually pretty cool.
[01:41:40] As I just said,
communication is very slow
[01:41:42] and actually every time
you use a PyTorch line,
[01:41:45] it basically moves variable
to global memory of your GPU.
[01:41:49] So when you have something like
this x dot cosine equal x1,
[01:41:54] and then you do x1 dot cosine.
[01:41:56] What is happening
behind the scenes
[01:41:58] is that you take the
x, which is data.
[01:42:00] You ship it to your actual
processors of your GPUs.
[01:42:03] You apply the cosine.
[01:42:05] You ship it back to the
main memory of your GPU
[01:42:07] and then you see the next line.
[01:42:09] You ship it back to the
computer-- to the GPU processor,
[01:42:12] you apply another cosine
and you ship it back again.
[01:42:15] So another way to
see that is that you
[01:42:17] go from your DRAM, which is
your global memory and your GPU
[01:42:20] and you ship it to compute.
[01:42:22] You ship it back for every line.
[01:42:24] This is a naive way of doing it.
[01:42:25] This seems very wasteful.
[01:42:28] So the idea, simple
idea of operator fusion
[01:42:31] is just communicate, do all the
computation, ship it back once.
[01:42:35] And this is exactly
what fused kernels are.
[01:42:39] So if you ever want to make
your compute-- your computations
[01:42:44] in PyTorch much faster,
just apply torch dot
[01:42:46] compile on your model.
[01:42:48] This is going to make your
model around 2 times faster.
[01:42:51] And what it does is simply
that it rewrites your code--
[01:42:56] your PyTorch code basically
in C++ in CUDA to do
[01:43:03] the communication only once
then do all the operations,
[01:43:05] then ship it back.
[01:43:07] OK I'm not going to have
time to talk about tiling.
[01:43:10] Tiling is important.
[01:43:11] Parallelization.
[01:43:12] Parallelization is important.
[01:43:15] And mixture of experts.
[01:43:17] Mixture of experts is important.
[01:43:18] Outlook.
[01:43:19] There are many things
we haven't talked about.
[01:43:23] We haven't talked about
architectures we definitely
[01:43:25] haven't talked about inference.
[01:43:27] There are many other things
that are important with LLMs.
[01:43:29] What is the UI that you use?
[01:43:31] I mean, arguably ChatGPT,
the big novelty was just
[01:43:34] have a simple UI to use it.
[01:43:35] Multi-modality.
[01:43:36] What are all the
misuses you could have.
[01:43:38] The fact that there might not
be enough data on the internet
[01:43:41] to train all these models.
[01:43:42] Legality of data collection,
so many other things.
[01:43:45] If you are interested
in all these topics,
[01:43:47] I would suggest three classes.
[01:43:49] CS224N is probably the one
that touches the least on LLMs,
[01:43:54] but it gives some background
and historical context
[01:43:57] of all the LLMs and gives
some adjacent material.
[01:44:01] CS324 I think it's called--
[01:44:04] I think it's just called
Large Language Models, more
[01:44:07] in depth reading and lectures
on everything I talked about.
[01:44:10] CS336 which is large
language model from scratch,
[01:44:13] you actually build your own LLM.
[01:44:16] It's an amazing class also
given by my two supervisors.
[01:44:20] Very heavy workload,
so be careful.
[01:44:23] Great.