Advertisement
Ad slot
10: Generative AI – Adapting LLMs with Parameter-Efficient Fine-Tuning 1:17:42

10: Generative AI – Adapting LLMs with Parameter-Efficient Fine-Tuning

MIT OpenCourseWare · May 11, 2026
Open on YouTube
Transcript ~15042 words · 1:17:42
0:16
Okay. So, um, so let's continue the
0:19
journey we started last time. Um so what
0:22
we're going to do uh you know if you
0:23
remember in the last class we showed how
0:26
we can actually build an auto
0:27
reggressive large language model uh aka
0:30
a causal large language model um using
0:33
this not this idea of a causal encoder a
Advertisement
Ad slot
0:36
transformer causal encoder and then we
0:38
showed how you can actually take a bunch
0:39
of sentences and use next word
0:41
prediction and just run it through and
0:43
boom you get GPD3 okay so that's what we
0:46
saw last time I want to point out a sort
0:49
of an important clarification slash
0:50
correction which is that when we work
0:52
with large language models uh unlike
0:55
when we work with BERT uh for instance
Advertisement
Ad slot
0:57
when we work with these kinds of causal
0:59
models actually uh when the contextual
1:01
embeddings come out you don't actually
1:03
have to use ReLU activations here you
1:05
can literally just run it through just a
1:07
single dense layer with linear
1:09
activations and then pass it into a
1:11
softmax and boom you're done okay so
1:13
that's how GPD3 and all these models are
1:15
trained u and the other thing I want to
1:18
point out which may not have clear is
1:21
that what what is coming out of these
1:23
this dense layer right this vector is as
1:27
long as your vocabulary
1:29
because only then when it goes into the
1:31
soft max you're going to get
1:33
probabilities which are as long as your
1:35
vocabulary which means that you get to
1:36
pick one word or token out of that
1:39
entire 50,000 long vocabulary
1:42
okay so so just I just want to point
1:45
that out because I think it's easy for
1:47
us to sort of get a little confused
1:49
because of this little difference
1:50
between the way uh masked language
1:53
models like BERT work and causal
1:55
language models like GPD3.
1:58
Okay, so now let's continue with we have
2:02
we know how to build GPD3. So like what
2:05
about GPD and GPD2 like what's up to
2:07
them? Why is GPD3 so famous and not
2:10
GPD2? Right? So turns out well first of
2:13
all you folks know that GPD stands for
2:15
generative pre-trained transformer. Now
2:17
like GPD3
2:19
two GPD2 and GPD1 were trained in
2:22
basically the same fashion. Predict the
2:23
next word uh same fashion the same sort
2:26
of transformer stack except that GPT3
2:29
was trained on much more data because
2:31
the underlying transformer stack had
2:33
many more layers. Okay, so it is a much
2:36
bigger stack meaning lots more
2:39
parameters and therefore you need lots
2:41
more data to train it well. Okay, so
2:44
that was really the only difference. The
2:47
difference was literally one of scale,
2:49
scale of network and scale of data. And
2:53
unlike GPT and GPD2, GPD3 even though it
2:57
was trained basically the same way with
2:59
the same kind of network, it was one of
3:01
the situations where more became
3:04
different. Okay, there was almost like
3:06
some sort of phase change that happened
3:07
between two and three. Unlike GPD and
3:10
GPD2, GPD3 could do amazingly coherent
3:14
continuations of any starting prompt,
3:16
right? Um so for example, if you have
3:19
this little prompt which says the
3:21
importance of being on Twitter by Jerome
3:22
K Jerome who was a famous humorist and
3:24
then you give it this prompt, right?
3:26
Ending with the word it, it produces
3:28
this continuation which is really like
3:30
strikingly good. And if any of you have
3:33
read Jerome K Jerome and if you read
3:35
this thing, you'll be like, "Wow, that
3:36
actually sounds like Jerome K Jerome."
3:38
Right? So amazing continuations the the
3:41
but the interesting thing here is not so
3:43
much the continuation it's the fact that
3:45
the same prompt you give it a two or GPT
3:47
it won't do any it won't be very good in
3:49
fact after the first one two or three
3:51
sentences it'll sort of become sort of
3:52
incoherent and meander and start
3:54
rambling this thing can keep faking it
3:57
for a long longer time right that's the
3:59
amazing thing that was unexpected re
4:02
researchers did not expect this okay and
4:05
but it wasn't good at following your
4:07
instructions
4:09
So for instance, if you ask it, help me
4:10
write a short note, introduce myself to
4:12
my neighbor. This is the kind of thing
4:14
it'll come up with. And you can actually
4:15
run it yourself. You can actually go to
4:17
GPD3 on the playground. I think GPD3 is
4:20
still available in the playground. U if
4:21
it is, you can actually start try
4:23
running these prompts. You will start
4:25
getting garbage very quickly, right? And
4:28
the reason so for example here, help me
4:29
write a short note. It says, what's a
4:31
good introduction to a resume? Rumé for
4:33
some reason has glombmed down to resume.
4:35
I have no idea why. Right? But the
4:38
reason it's doing stuff like this is
4:39
because a lot of the training data it
4:42
was trained on are basically lots of
4:44
lists of things.
4:46
So when you say for example um you know
4:49
the the the capital of Paris continue
4:52
it'll come back with the capital sorry
4:53
the capital of France continue it say
4:55
the capital of France is Paris the
4:56
capital of you know uh Hungary is
4:58
Budapest and so on. It just start coming
4:59
up with a list. So it's sort of very
5:02
list driven right? it thinks that you
5:04
you need to complete some sort of list,
5:06
right? That's what's going on here. And
5:07
so it's not very good. So it doesn't
5:09
realize that you're actually asking it
5:10
to do something specific.
5:12
So this is the problem when you have an
5:14
autocomplete thing which doesn't realize
5:17
what you're asking it. It just thinks
5:18
that you're it's just an autocomplete.
5:20
So um now in addition to these unhelpful
5:24
answers, it can also produce offensive
5:25
answers, factually incorrect answers and
5:27
so on and so forth. The list of bad
5:28
things it can do is long. So why does it
5:32
do that? Why does it produce unhelpful
5:33
answers? Well, you know, as you recall,
5:35
it was only trained to predict the next
5:37
word. It wasn't explicitly trained to
5:39
follow instructions, right? So, it
5:41
seems, you know, reasonable that if it's
5:44
simply trying to guess the next word
5:46
repeatedly, it can't really do anything
5:48
more. Like, how can it figure out that
5:50
there's an instruction that it needs to
5:52
follow, right? Unless the training data
5:54
on the net was all instructional, which
5:57
it clearly is not.
5:59
So light bulb idea, right? Let's
6:02
explicitly train it with instruction
6:04
data,
6:06
right? Let's just train it with
6:07
instruction data. And so OpenAI
6:10
developed an approach called instruction
6:12
tuning to do exactly this. Um, and this
6:15
paper is the paper that sort of was the
6:18
breakthrough. Okay, this is what
6:20
actually put Chad on the map. So, and
6:24
it's very readable. So, I would
6:25
encourage you to check it out if you're
6:26
curious.
6:28
And so so we had GPT, GPD2, GPD3, you
6:33
know, just bigger and bigger models
6:34
trained the same way. And then we run
6:36
into the problem that it can't handle
6:37
instructions. So we do instruction
6:39
tuning to get to 3.5, also called
6:41
instruct GPT. And then a small tweak
6:43
after that gets you chat GPT. Okay. And
6:46
by the way, this step here, there are
6:48
really two things going on in this as
6:50
you will soon see. I'm just calling it
6:52
instruction tuning just to so that I
6:53
don't have to say some long thing every
6:55
single time. it this is not a consistent
6:58
piece of terminology. So just just
6:59
beware aware of that's all. So all right
7:03
first step they got a bunch of people to
7:06
write highquality answers to questions
7:09
and they created about 12,500 such
7:11
question answer pairs. So for example
7:14
let's say this was the question explain
7:15
the moon landing to a six-year-old in a
7:17
few sentences. Believe it or not, GPD3's
7:19
answer to that question was another
7:21
question
7:23
because it thinks there's a list of
7:24
questions it needs autocomplete, right?
7:27
So, it comes up with explain the theory
7:28
of gravity to a six-y old. It's like one
7:30
of those people when you ask them a
7:31
question, they ask you a question back,
7:32
right? So, what what they did is they
7:35
said, "Okay, let's create a nice answer
7:36
to this question." And here's a human
7:38
created answer. People went to the moon
7:39
in a big rocket, walked around, blah
7:41
blah blah, right? Much better answer to
7:43
that question. And so once you create
7:46
these 12,500 question answer pairs as
7:48
training data, we just trained GPD3 some
7:52
more using Xword prediction as before.
7:56
No difference. So, so here is the input
7:59
explain the moon landing blah blah blah
8:00
blah. This is the question and then we
8:02
have the answer right there. And then we
8:05
we take that answer, move it to the
8:07
right and just shift it up
8:10
so that when it finishes sentences, it
8:13
needs to predict people. And then you
8:16
give it people, it needs to predict went
8:17
and so on and so forth. Just like we saw
8:20
before, the cat sat on the mat became
8:22
the cat sat on the cat sat on the mat on
8:25
the right shifted, right? That's what
8:27
makes prediction possible and necessary.
8:30
So that's what they did. This co this is
8:31
step one. Okay, same as same as before.
8:35
And once you do that, it turns out this
8:37
step is called supervised fine-tuning.
8:39
It really helped. GPD3 once you
8:42
supervised fine-tuned it was much much
8:44
better at following instructions. But
8:45
there's a small problem with this
8:46
approach. It takes a lot of money and
8:49
effort to have humans write highquality
8:51
answers to thousands of questions,
8:53
right? It takes a lot of money. So the
8:56
question is, what can we do, right? What
8:59
is easier than writing a good answer to
9:01
a question?
9:03
Well, what? Okay. Uh, all right. Uh, how
9:07
about somebody from this side?
9:11
>> Yeah, Joseph.
9:13
>> Perhaps writing a question for an
9:15
answer.
9:16
>> Oh, that's actually a good one. Yeah.
9:17
Yeah, I like that. Um, so given an
9:19
answer, find find a question. And while
9:22
that is not what I'm going to talk about
9:23
here, that technique is actually used
9:25
very heavily in LLMs. Uh, and so but
9:27
that that's great. Very creative. Uh
9:29
Mark,
9:31
>> thumbs up. Thumbs down.
9:32
>> Sorry.
9:33
>> Thumbs up or thumbs down?
9:34
>> Thumbs up or thumbs down. Exactly.
9:36
Because all of us, everyone loves to be
9:38
a critic. It's much better easier to be
9:40
a critic than to be a creator. Right. So
9:43
what do we do? We basically say, let's
9:46
rank answers written by somebody else.
9:48
Which begs the question, who's going to
9:50
write those answers? And that's where
9:53
there's a brilliant answer to that
9:54
question which is
9:57
Wikipedia,
9:59
Reddit.
10:04
We will just ask GPT3 to write the
10:06
answers.
10:08
It might be crap, but we don't care
10:10
because we can rank them.
10:12
So we ask GPT3 to get generate several
10:15
answers to the question. And how can we
10:17
generate several answers? Because we can
10:19
do sampling.
10:21
We can do sampling.
10:23
The fact that we had these stoastic
10:25
outputs because of sampling is now a
10:27
feature, not a bug. Okay, we create lots
10:30
of different answers to the question. We
10:32
feed it a question, get like three
10:33
answers out. Just run it three times,
10:36
get three answers out with a nice
10:37
temperature of like one or 1.1 or
10:39
something so that it's nice and random,
10:41
right? Um, and then we literally have
10:43
humans just rank them, do the thumbs up,
10:45
thumbs down, just rank them from most
10:47
useful to least useful. Okay, so this
10:51
step is a step two of instruction
10:53
tuning. So OpenAI collected 33,000
10:55
instructions, fed them to GB3, generated
10:57
answers and had humans rank them. And
11:00
once you do that, once you do this, you
11:03
can assemble a beautiful training data
11:05
set, right? And so basically what we
11:07
have is that we have an instruction and
11:09
let's say we have just two answers A and
11:10
B. And in in practice they you can have
11:12
many many answers which we rank but just
11:14
for simplicity I'll go with Mark's
11:16
thumbs up thumbs down sort of answer
11:18
which is let's assume only you have two
11:19
answers to every question right and so
11:22
and the human has said I prefer this to
11:24
that that's it right so we have a data
11:26
set now where the data point is
11:28
instruction preferred answer is A the
11:31
other answer is B yeah
11:36
>> um the thumbs up thumbs down uh
11:38
technique that we're talking is that why
11:40
We're attaching to now we also use
11:42
thumbs up thumbs down. It's using only
11:44
answers to train.
11:45
>> Exactly. Right.
11:46
>> Yeah. So yeah, all the models have the
11:48
thumbs up thumbs down stuff going on
11:49
somewhere. They are all collecting data
11:51
for this step.
11:53
>> Thank you.
11:53
>> Yeah. It's sort of the old adage, right?
11:55
Uh if you're not sure who the product
11:57
is, you are the product. So it's one of
11:59
those things. Yeah.
12:07
So if we understand correctly when we
12:09
see thumbs up thumbs down it does mean
12:12
that chat is going to trade on our data
12:16
right
12:16
>> unless you opt out. Yeah. So if you
12:19
actually go to the chaty controls there
12:20
is something called data controls or
12:22
something you can toggle it to off but I
12:24
think when I last checked if you toggle
12:26
it to off you lose your chat history. So
12:29
they have hobbled that feature to
12:31
prevent people from setting it to off as
12:33
much as possible. Yeah, clever.
12:37
But you can opt out and if you use the
12:39
API as opposed to the web interface,
12:41
you're automatically opted out. So you
12:43
have to deliberately opt in. And if you
12:45
use the versions that are available
12:46
through Microsoft Azure and so on and so
12:48
forth, there are all kinds of very safe
12:50
controls and stuff like that. In fact, I
12:51
think the Microsoft co-pilot license
12:54
that MIT has uh I think the default is
12:56
opted out.
12:58
Okay. So to go here, once you have this
13:01
data point, you can build something
13:02
called a reward model. Okay. And this is
13:05
a very clever piece of work. So what you
13:08
do is you have an instruction, right?
13:10
You have a preferred answer and you have
13:12
the other answer. You feed it to a
13:15
network. Okay? You feed it to a network.
13:18
This is just a a nice language model,
13:20
right? It's just a language model. And
13:23
the language model produces a number
13:25
which measures how good this thing is,
13:28
right? How good an answer is this to
13:30
that particular instruction. So you get
13:32
two you get a rating here, you get a
13:34
rating here and then what you do is you
13:38
run it through a little loss function
13:41
which
13:43
essentially encourages the model to give
13:45
higher numbers to the better answer.
13:50
It's the same model. You just run the
13:51
the question and the first answer,
13:53
question and the second answer. You get
13:54
these two numbers. And then initially
13:56
those numbers are just random. But then
13:59
you tell the model, hey, this is the
14:00
preferred thing. Make sure the preferred
14:02
answers
14:03
uh rating the R value is higher than the
14:06
other number because more is better.
14:08
Higher is better. Okay? And you can
14:12
actually since you and this thing is
14:13
just a sigmoid here, right? It's
14:15
basically take the difference of these
14:16
two things. do a sigma and take the
14:18
logarithm and you can actually convince
14:20
yourself afterwards and I encourage you
14:22
to do that to to check for yourself that
14:25
if we actually
14:28
give a higher number to the better
14:30
answer the loss will be lower and since
14:34
we are minimizing loss we're essentially
14:36
training the network to always to try to
14:38
give higher ratings to better answers
14:41
that's it so that's the approach uh did
14:43
you have a yeah Ben
14:46
So you could imagine training um
14:49
training the model and only the good
14:50
answers is the idea of having both that
14:52
the model is actually learning what
14:54
makes good
14:54
>> correct. Exactly. Much like if you want
14:56
to build a dog cat classifier, you have
14:58
to show pictures of both.
15:01
>> Yeah.
15:02
>> So u I understand the feedback mechanism
15:05
of thumbs up thumbs down but there are a
15:06
lot of times when the popular response
15:10
is not the accurate one. So uh is there
15:12
a way that they actually have a layer to
15:15
correct?
15:16
>> Yeah, good question Swati. So uh as it
15:18
turns out um the all these companies
15:22
like OpenAI, they have like a huge
15:24
document 100 200 pages longs you know
15:27
very very bulky document which instructs
15:30
and teaches the labelers the rankers to
15:32
how to rank these things. So they have
15:34
to follow these very strict guidelines
15:36
to precisely handle like strange corner
15:38
cases and things like that. And that
15:40
document is on the web. You can dig it
15:43
up, right? And it's actually very
15:44
instructive to read through it, right? I
15:46
think they put it out on the web because
15:48
they wanted to convince people that
15:49
they're going to inordinate trouble to
15:50
make sure the rankings are actually
15:52
good. U do you have a question? Comment.
15:55
Okay. All right. So um so back to this
16:00
and how how do you train this thing? SGD
16:03
because you have a network it's coming
16:04
up with an answer you have some way to
16:06
know if that answer is good or bad right
16:08
better answers of lower loss back
16:10
propagation through the network keep
16:12
updating the weights and boom you're
16:13
done
16:15
okay and once you do that this reward
16:18
model can provide a numerical rating for
16:21
any any instruction answer pair you just
16:24
give it an instruction you give it an
16:25
answer right could be a crappy answer
16:27
good answer it just tells you how good
16:28
it is which means right So in this case
16:31
for example maybe it's going to give you
16:32
like a nice number 1.5 uh uh which is
16:35
you know 1.5 for this this answer but
16:38
then a better answer comes along or 3.2
16:41
right what we have done by doing this
16:44
whole thing this modeling is that we
16:46
have essentially we have learned how
16:49
humans rank responses
16:51
because we can only have humans rank
16:53
responses for some finite number of
16:55
questions. What we really want to do is
16:58
to do this to automate that ranking
17:00
process so that we can just do it for
17:02
like tens of thousands of questions
17:03
really fast. Right? So we have
17:05
essentially built a model of how humans
17:07
rank things, right? Which is beautiful.
17:10
A lot of the stuff here is all very
17:12
self-reerential which I find very
17:13
elegant. Anyway, so this can be used to
17:15
improve GP3 even further. So we take the
17:18
instruction as before, we feed it. It
17:20
gives you some answer and then we feed
17:23
this instruction and the answer to our
17:25
newly minted reward model. It gives us a
17:28
numerical rating and then this is the
17:30
key step. We take this numerical rating
17:32
and then we use this rating to nudge the
17:35
internal weights of GPD3 in the right
17:37
direction. Right? This nudging
17:41
uses a technique called reinforcement
17:43
learning.
17:44
Right? Which just in the interest of
17:46
time we can't get into in this lecture.
17:49
But that that's a technique you use to
17:51
nudge these things in the right
17:52
direction.
17:54
So that's what we do. That's
17:56
reinforcement learning. We nudge it in
17:58
the right direction.
18:01
And OpenAI did this with 31,000
18:04
questions.
18:07
Okay. Nudge, nudge, nudge, nudge, nudge.
18:09
And when you do that, you get GPD
18:11
3.5/ingpd.
18:13
Okay. Uh that's it. And now by the way
18:18
this step here is called reinforcement
18:20
learning with human feedback because we
18:22
use reinforced learning and since humans
18:24
rank the answers which tread to the
18:26
building of the reward model we get
18:28
human feedback. Okay, that's
18:29
reinforcement learning with human
18:30
feedback. Yeah.
18:33
>> Yeah. I have [clears throat] a question
18:34
regarding the the type of questions that
18:37
they're using. I can imagine like maybe
18:39
there are very simple questions to
18:42
answer because I'm thinking now you can
18:44
ask GPD like for example respond this as
18:47
a pirate or something like that that is
18:49
kind of it's going to be harder to train
18:51
if you have bunch of questions that are
18:54
having like small interactions and then
18:56
there is the question like
18:57
>> that's a good question. So the quality
18:59
of the questions in the data set clearly
19:01
is a big factor because if you have
19:03
simple simplistic questions it won't be
19:05
able to handle complex questions later
19:07
on. So what it's a good question. So
19:09
what how so the qu so that actually begs
19:12
the question of where did they get these
19:14
questions from
19:16
so they actually got it from their API.
19:20
So people are asking GPD3 on the API
19:23
right before it became 3.5 people are
19:25
asking all the API was already available
19:26
you know fully available commercially
19:28
available a lot of people are building
19:29
products on it already by then and so
19:31
they collected all those questions and
19:33
filtered them for quality and that was
19:35
the question set that they used and then
19:37
they judiciously added to it with human
19:39
created questions but they couldn't do a
19:41
lot of that because it's expensive to do
19:43
that but collecting stuff that somebody
19:44
else is asking your API already very
19:46
easy
19:49
Yeah, Tomaso,
19:50
>> uh, this might be more of a
19:52
philosophical question, but, uh, the
19:54
human bias that's present in the small
19:56
subset of human labelers that they've
19:58
chosen gets eventually compounded in
20:00
this model that we often consider as the
20:03
source of objective truth.
20:04
>> Yes.
20:06
>> Yeah, that's very true. Um I think the
20:08
the reward model is probably very
20:09
faithfully learn all the biases of the
20:12
human labelers which is why they have
20:14
these very complex u sort of frameworks
20:17
and guidelines to try to prevent the
20:19
bias from happening to mitigate it. So
20:21
for example they might give the same
20:22
question and set of possible answers to
20:25
many many different labelers and only if
20:28
people pick the same ranking they might
20:30
use it so that at least inter labeler
20:33
bias can be minimized right but if
20:36
everybody's sort of biased in the same
20:37
direction it won't protect you against
20:39
that. Um so yeah in general there's a
20:41
whole work that's being done to try to
20:43
debias these things and build them
20:44
without you know too much bias in them.
20:46
It's like a whole world unto itself
20:48
which we just don't have time to get
20:49
into. Uh Olivia,
20:53
>> um depending on the medium that's being
20:56
returned by these models, would there be
20:57
more than one reward model? Because
20:59
isn't that what Gemini
21:00
>> would there be more than one
21:01
>> reward model? Because isn't this what
21:03
Gemini is running into issues with right
21:05
now with their image generation is the
21:08
bias that they try to
21:09
>> Yeah. So the Gemini business that's
21:11
going on, it's unclear what's causing
21:13
it. Um it may be in this step, maybe
21:16
they were a little overzealous in
21:18
preventing certain things from
21:19
happening.
21:20
Some of these uh systems also have um
21:23
they will actually intercept the
21:25
question that you ask and then route it
21:27
differently based on what they sense is
21:29
sitting around in the question. So there
21:31
could be pre-processing post-processing
21:32
a lot of stuff that goes on. So unclear
21:34
to me where in the pipeline and it could
21:36
be more than one place these things may
21:38
be entering. So yes, so here may very
21:40
well be where it actually enters a
21:42
situation where people are people are
21:44
told if you see any sort of this kind of
21:46
answer downrank it right don't uprank it
21:50
and then it learns that ranking very
21:51
faithfully and then proceeds to apply it
21:53
where it does should not be applied so
21:54
that does happen uh Joselyn you had a
21:56
question
21:58
>> um I think I still I still don't totally
22:02
understand why so when I ask chat GBT a
22:04
question even in a lengthy response it
22:06
doesn't wander away from the topic that
22:08
I'm asking about right and so
22:10
understanding that it it's predicting
22:11
each word it's sort of taking a random
22:13
walk from one word to the next in some
22:15
sense
22:15
>> but each word it utters
22:17
>> now becomes part of the input to the
22:19
next word it utters
22:20
>> right
22:21
>> so it's not truly random walk in that
22:23
sense so the next step is not
22:24
independent of the previous step
22:26
>> it depends on what it depends on the
22:27
journey so far so it's going to try to
22:29
be very consistent with the journey so
22:31
far
22:32
>> okay
22:33
>> does the
22:35
does this part with um sort of
22:38
fine-tuning it on these question answer
22:40
sets. Does this play some role in it
22:42
being able to constrain itself and not
22:44
meander away?
22:46
>> I don't think so. I think this is more
22:48
to make sure that you know it does the
22:50
weights generally tend to produce the
22:52
right answer. Now what one of the things
22:54
that is possible is that when when I'm
22:57
let's say I'm a ranker and I'm looking
22:58
at a few different answers I'm you know
23:01
I have to figure out if the answer is
23:03
helpful if it is accurate if it is uh
23:06
you know non-toxic right things like
23:08
that and part of the rubric for
23:11
evaluating these answers could be their
23:13
coherence right so it could also be that
23:16
they are saying short coherent answers
23:18
are better than long coherent answers
23:21
but once you adjust for length Maybe
23:23
coherence is more important, right? It
23:24
could be any number of these things. So
23:25
it could play a role in that.
23:26
>> So just sort of one small followup. So
23:28
in other words, when it's when it's
23:30
learning from these question and answer
23:31
pairs, it's able to look at
23:32
[clears throat] the whole response and
23:33
learn something about the whole response
23:35
rather than just one word at a time,
23:36
right?
23:37
>> Correct. Yeah. The the entire question
23:39
is being ranked.
23:40
>> Yeah.
23:40
>> Correct. Correct.
23:42
>> Yeah. On a related note, um when it's
23:46
generating a new word on a topic, does
23:48
the attention pertain to the entire
23:50
prior text or can you have like
23:52
traveling attention? So like last five
23:55
word.
23:56
>> So yeah, the short answer is yeah, you
24:00
can you can it's called sliding window
24:02
attention. It can be done. They
24:04
typically tend to do it not uh so much
24:06
because they want to focus more on the
24:08
the recent words, but more because it
24:10
actually makes it very compute
24:12
efficient. U that's why they do it. So
24:14
it's called sliding window attention.
24:16
You can Google it.
24:17
>> So normally it's full attention.
24:19
>> Normally it's full default is full
24:21
attention.
24:23
Okay. So that's what they did. Uh and
24:25
when they did that and by the way as I
24:27
think you pointed out that's exactly
24:29
what's going on. You're training the
24:30
reward model with these thumbs up and
24:31
thumbs down. U hold on the questions.
24:35
And so if you give it the same question
24:37
to GPD 3.5 in GPD amazing answer.
24:42
Okay, like night and day difference,
24:45
amazingly good answer. Um, and so and
24:48
then to go from 3.5 to CH GBT, they
24:51
basically followed the exact same
24:52
playbook except that because they wanted
24:55
to have a chatbot, meaning something
24:58
that could carry on a question answer,
24:59
question answer pair as opposed to just
25:00
a single question and answer, they
25:02
wanted question answer question answer,
25:03
right? Conversation. They trained it on
25:05
conversations. That's it. Instead of
25:08
training it on instruction answer data,
25:11
they trained it on instruction answer
25:13
instruction answer instruction answer a
25:16
sequence of such things which are strung
25:17
into a conversation.
25:19
That's it. That is the only difference
25:21
to go from 3.5 to CH GPT and then now
25:25
chat GPD given you do that it's giving
25:26
you a much nicer response and then you
25:28
can ask a follow-on question. Can you
25:30
make it more formal? Boom. It gives you
25:32
a nice response because now it knows
25:33
about conversations. It's been trained
25:35
on conversational data. So that's it. So
25:37
that's the whole that's how they built
25:38
RGBT right and all the things we are
25:41
seeing later on are all sort of
25:42
continuations of this sort of approach.
25:45
So pause for a couple of quick
25:46
questions. Swati you had a question then
25:47
we'll go to you and then to you. Yeah.
25:50
>> So does that make a difference if a new
25:53
question pair question answer pair or a
25:56
new training data comes early in the
25:59
building of the model or later in the
26:01
building of the model 7 billion
26:02
parameters. That be good. You mean the
26:05
order of the questions does it matter?
26:07
>> So I might have like let's say 5,000 uh
26:09
images to start with. Now there after my
26:12
model is trained and developed now I
26:14
have a new use case that has come in.
26:17
Will that make a difference if I set it
26:18
in now?
26:19
>> So if you have a new use case for which
26:22
you want to essentially adapt the model
26:24
there's a whole set of techniques you
26:26
use which is going to be the next
26:27
section.
26:27
>> But it's not
26:29
>> yeah because what you have out of the
26:30
box is just a generally good chatbot. It
26:33
knows about a lot of stuff because it's
26:34
been trained on, you know, those 30
26:36
billion sentences, it can answer a lot
26:37
of questions reasonably well using
26:39
common sense and world knowledge. But
26:41
any specific use case like medical and
26:43
so on and so forth, it may not know. So
26:44
you'll need to adapt it to your
26:46
particular unique situation and that's
26:47
coming. U all right. Yes. Habit.
26:51
>> Uh what determines if a whole
26:54
conversation is ranked positively versus
26:57
a specific answer proliferating your in
26:59
your question?
27:01
>> Is it if the first answer doesn't get a
27:03
positive response but then after follow
27:05
the second one does. Is that is that
27:06
correct?
27:07
>> Exactly. So if you're a human and you
27:08
read the transcript of an exchange
27:10
between two people and I'm giving you
27:12
two exchanges which all start with the
27:14
same question, you'll be able to assess
27:15
which one is a better transcript. That's
27:17
basically what's going on. Uh there was
27:20
something here, right? Something. Yeah.
27:22
>> So I was wondering when you ask a
27:25
question very often it sounds kind of
27:27
like you tell that something was written
27:29
by not by an actual person. Do you think
27:32
that comes from the reinforcement
27:35
learning part or where do you think it
27:38
comes from in this?
27:40
>> It's a good question. I don't know
27:41
because I know that part of the
27:42
evaluation uh the ranking rubric are
27:44
used is to is to favor responses which
27:48
sound more humanlike than you know more
27:50
than robotlike. So if anything I'm
27:52
hoping that reinforcement learning would
27:54
actually make it sound more humanike
27:55
because the rankers would have
27:56
prioritized that. So if you if it still
27:58
comes up with robotic stuff, you know,
28:01
it's something else that's going on.
28:02
Maybe I mean maybe the lot of text on
28:05
the internet is not literature. It's
28:07
just people writing some crap, right? So
28:09
could be that. Yeah.
28:13
>> How much of this instruction tuning or
28:15
conversational tuning is happening in
28:17
real time within a conversation? So
28:19
>> none of it.
28:19
>> None of it. So as you kind of give
28:22
feedback to the model, it's just
28:24
basically regenerating it like I don't
28:25
like that answer. come up with something
28:27
else.
28:27
>> No, it's not doing it in real time. Uh,
28:29
basically whatever signals you're giving
28:31
it with these thumbs up, thumbs down
28:32
business, that gets added to the
28:34
training logs and they periodically will
28:36
retrain it.
28:39
Uh, okay. So, by the way, this is
28:41
instruction tuning in a nutshell and I
28:42
want to point that out and you don't
28:44
have to read the whole thing, but just
28:45
to quickly point out this was where we
28:47
had to have human involvement, right? In
28:50
the first step, writing a lot of
28:51
responses to these questions and then
28:52
ranking the answers. So these two are
28:56
still human sort of labor intensive. Now
28:58
it turns out you can actually use helper
29:00
LLMs to automate this too,
29:03
right? This is not what open I did in
29:04
the beginning with HGBT but now you can
29:06
do it this way right because there are
29:07
lots of really good LLMs available for
29:09
you to automate many of these things. uh
29:11
we don't have time but if you're curious
29:12
I had a little blog post on this check
29:14
it out okay so now we come to the
29:17
question of well if you want to take a
29:20
base LLM like GBD3 and make it useful
29:23
and respond instructions we have seen
29:24
that we had to adapt it with high
29:26
quality instruction onset data right
29:28
using supervised fine-tuning and
29:30
reinforcement learning with human
29:31
feedback right that's what made GPD3
29:33
actually useful and became chat GPD by
29:37
the same token this holds true more
29:39
generally if you want to take large
29:41
language model make it useful for a
29:42
medical use case, a legal use case, some
29:44
other narrow business use case. You have
29:47
to adapt it with business domain
29:49
specific data. Okay. And so let's look
29:52
at techniques for doing so. All right.
29:54
So adaptation is sort of the rough name
29:56
for the process of taking a base large
29:57
language model and making it tailoring
30:00
it for your particular use case. And so
30:02
there's sort of this ladder of things
30:03
you can do, right? And we're going to
30:05
look at every one of them. So you can do
30:07
this thing called zeroshot prompting
30:08
which is just you literally ask the LLM
30:11
nicely clearly what you want and maybe
30:14
just give it to you. Okay. And this is
30:16
sort of the use case we're all used to
30:17
in the web interface right you can also
30:20
do something called few short prompting
30:22
where you ask it something and you also
30:24
give a few examples of the kind of
30:25
things you want right and that helps it
30:27
a great deal and then there is this
30:30
thing called retrieval augmented
30:31
generation and fine-tuning and we'll
30:33
look at all of them and I'll explain all
30:34
these things as we go along. Okay, so
30:36
let's start with zero short prompting
30:38
where by the way the word short here is
30:40
a synonym for example. So zero example
30:44
prompting. You literally ask in the
30:45
prompt what you want without giving even
30:47
a single example. Okay. And so let's say
30:50
we want to build we want to look at
30:51
product reviews and build a detector to
30:54
figure out if the product review
30:55
contains not sentiment. That's kind of
30:56
boring. Uh whether it contains some
30:59
description of a potential product
31:01
defect or not. Okay. And so here is
31:04
something I actually pulled off Wayfair
31:06
with apologies to Wayfair. Uh it says
31:08
here the curve of the back of the chair
31:10
does not leave enough room to sit
31:11
comfortably. Okay, sounds like a kind of
31:14
a defectish kind of thing, right? So
31:16
instead of bu back in the day, you would
31:18
have collected all these reviews and
31:20
built a special purpose NLP based
31:21
classifier to figure out defect yes or
31:23
no. Here you can literally just feed
31:25
this thing into GPD3 uh and ask it tell
31:28
me if a product defect is being
31:30
described in this product review and
31:31
then the curve at the back boom and then
31:33
it comes back and says yep that's a
31:34
product defect. Okay so this zero shot
31:37
you just ask a question you get the
31:38
answer back. Okay and it actually works
31:41
remarkably well and the better models
31:43
the bigger models tend to be much better
31:45
than the smaller simpler models for
31:47
doing zero shot. Okay. All right. Now
31:50
when you adapt an LLM to a specific task
31:52
obviously you need to carefully design
31:54
the prompt as you folks know this is
31:55
called prompt engineering and we're not
31:57
going to spend much time on prompt
31:58
engineering except I just want to give a
32:00
simple example. So if you actually ask
32:02
Jubid this question what is the fifth
32:04
word of the sentence very often it'll
32:07
give the wrong answer.
32:09
It's very strange why it can't get this
32:11
answer question right. It's a very
32:12
simple question. So if it's the fifth
32:14
word of the sentence is s right uh
32:17
sometimes it gets it right but very
32:18
often it'll get it wrong okay but now
32:20
you can do a little prompt engineering
32:22
and it'll always get it right. So for
32:23
example you can say I'll give you a
32:25
sentence first list all the words that
32:26
are in the sentence then tell me the
32:27
fifth word. Okay here is a sentence b it
32:30
gets it right. So it's an example of you
32:33
can help it along by being very very
32:34
prescriptive as to what you want it to
32:36
do and break down all the steps. Don't
32:38
make it guess things. It does a great
32:40
job. Okay. So anyway uh and there are
32:42
lots of other tricks people have figured
32:43
out over the the last couple of years.
32:45
Uh for for a long time this is pretty
32:47
hot where you say let's think step by
32:49
step. You tell it give it a question and
32:51
say let's think step by step. It
32:53
actually gives the better shot at giving
32:54
you a good answer back an accurate
32:55
answer back. Uh now this kind of thing
32:57
is actually already baked in into the
32:59
LLMs. So when you ask a question to ch
33:02
your question your prompt gets appended
33:05
to what's called the system prompt and
33:07
the whole thing goes into the LM. You
33:09
never see the system prompt and the
33:10
system prompt is telling Chad GPD think
33:12
step by step take your time don't blurt
33:14
out an answer stuff like that okay and
33:17
the system you can just Google it the
33:18
system problems have been jailbroken you
33:20
can find it on the web
33:22
so all right uh and and this is funny I
33:25
this came out maybe like a month or two
33:26
ago it says apparently take a deep
33:28
breath and work on the problem step by
33:29
step works better than saying work on it
33:31
step by step and then more recently I
33:34
literally read this two nights ago
33:36
apparently if you tell it if you have a
33:38
math or a reasoning question. You tell
33:40
it you are an officer on the starship
33:42
enterprise. Now solve this problem for
33:44
me. It's higher more likely to get it
33:46
right.
33:47
>> Go figure. Thomas,
33:48
>> I read two more that were super fun.
33:50
>> Yeah.
33:51
>> One I will keep you if you solve me
33:53
>> correct
33:54
>> and the other one was
33:56
an answer was I cannot do that
34:00
for answer was I tried on Gemini and he
34:05
it was the way to solve it. So
34:07
>> nice. both like back and forth charge
34:10
you did you want to say was to solve
34:11
this can you solve this
34:13
>> yeah very good excellent one of the
34:15
things just on that right let's have
34:16
some fun you can say I'm going to give
34:18
tip you a thousand bucks if you solve
34:19
this it says right so this person
34:22
apparently kept using this tip and at
34:24
one point it says you keep promising me
34:26
tips you never give me the tip so I'm
34:28
not going to solve this problem for you
34:31
yeah okay so and there are many prompt
34:34
engineering resources this one that came
34:36
out a couple of weeks ago which I
34:37
thought was pretty Good. So I just put a
34:38
link to it here. Um so now let's look at
34:41
few short prompting where you give it a
34:42
few examples. So here let's say we want
34:45
to build a grammar corrector. Okay. So
34:47
what you can do is you can actually give
34:49
it examples of poor English good
34:52
English. You can see right poor English
34:54
I eated the purple berries. Good English
34:56
I ate the purple berries. And similarly
34:58
three examples right and then you end
35:00
the prompt with just the poor English
35:01
input. And then the response from GPD3
35:04
is a good English output and it says fix
35:06
the error.
35:09
So this is an example of giving a few
35:10
examples of what you want and just
35:11
learns on the fly what you what you have
35:13
in mind what your intention is. Okay. So
35:16
that's that. Now the ability of LLMs to
35:19
learn from just a few examples or even
35:21
no examples and just with a clear
35:23
instruction. This thing is called in
35:25
context learning and that was something
35:28
that GPD2 and GPD could not do. that was
35:31
new in GBD3 and what they call an
35:33
emergent capability right it is
35:35
completely unanticipated by the people
35:37
who built it and all right so that's
35:40
that now let's look at retrieal
35:41
augmented generation by the way this
35:43
thing is also called indexing sometimes
35:45
so the the so the the idea of it's
35:47
called rag retrie rag the idea of rag is
35:50
actually very simple so let's say that
35:52
you know we want to ask a question to a
35:53
chatbot but we want the chatbot to
35:56
leverage proprietary data that we might
35:59
have maybe it's a customer call support
36:01
sort of in a call center kind of
36:02
operation and you have like this massive
36:04
FAQ database right content database and
36:06
you want to give that FAQ to the chatbot
36:09
along with your question so that it can
36:10
leverage the FAQ to answer the question
36:12
for you as opposed to like whatever
36:14
things it has learned previously in its
36:16
general training right so can't we just
36:19
include the entire FAQ the whole data
36:21
set into a prompt and set it in maybe we
36:24
just take our question take everything
36:26
we have potentially relevant to the
36:27
question everything we have in the data
36:28
set database just attach it to the
36:31
question. The whole thing becomes a
36:32
prompt. Feed it in and say, "Hey, find
36:34
out for me." Can't you just do that?
36:38
Theoretically, I think it stops us.
36:43
The reason you can't do it is because
36:44
this pesky thing called the context
36:46
window.
36:47
So, uh, for any LLM, the prompt plus the
36:51
output, right, the length cannot exceed
36:53
a predefined limit. This called the
36:55
context window. Remember the max
36:57
sequence length we had in our earlier
37:00
models where that was the size of the
37:02
sentence that could be fed in right
37:04
basically there is a size of the
37:05
sentence for any of these things right
37:07
it's called the context window it's
37:08
there are only so many tokens it can
37:09
accommodate and since what comes in is
37:12
what comes out it is for both the input
37:14
and the output together okay that's
37:16
called the context window okay and um
37:20
and and and furthermore when you have a
37:23
conversation with one of these chat bots
37:25
the entire entire conversation is fed in
37:27
every single time.
37:29
That's how it actually remembers the
37:31
what's going on earlier in the
37:32
conversation. It doesn't have any memory
37:34
per se. Each time you ask a question,
37:36
the entire thread is fed in. Okay? So,
37:39
initially you say what's the square root
37:41
of 17, it gives you an answer.
37:42
Initially, you only send in the red
37:44
stuff. Then the next question you ask is
37:46
the first question, the answer, the
37:48
second question. All of them are fed in.
37:50
Then all these are fed in. So with the
37:52
conversation, you're consuming more and
37:54
more of the context window as you go
37:55
along.
37:57
Okay. So can you imagine taking a whole
38:00
FAQ asking a question and saying, "Well,
38:01
I didn't mean that. I wanted something
38:03
else." And before you know it, boom,
38:04
you've blown out the context window.
38:05
It's going to come back and give you an
38:06
error.
38:08
>> You finished that you can't does it
38:10
together or does it take specific
38:14
windows of it?
38:15
>> Yeah. So there is a whole research
38:17
cottage industry around when your thing
38:19
is longer than the context window. what
38:21
do you pick? Uh so the simplest case is
38:23
you have a moving window, right? If if
38:25
you have thousand tokens, you just look
38:27
at the last thousand tokens. But there
38:28
are some cleverer schemes where you can
38:30
actually take the first stuff that is
38:33
outside the window that doesn't fit into
38:34
the window and use an other LLM to
38:37
summarize it for you and then you attach
38:39
it to your current prompt. I know it
38:41
gets crazy. So
38:43
uh okay. So for all these reasons, we
38:46
need to pick and choose what we can
38:47
send, right? To answer a particular
38:49
question. So what we do is since we
38:51
can't include the whole thing, we first
38:53
retrieve the relevant content from the
38:54
database or the FAQ and then send it to
38:57
the LLM along with a question we have.
38:59
Okay? So retrieval augmented sequence
39:02
generation. That's what's going on.
39:05
Make sense? And so pictorially
39:08
um basically what we do is let's say
39:10
that this is our external set of
39:12
documents. We take this are think of it
39:15
FAQ and then we take the FAQ and imagine
39:18
for each question and answer. We take
39:20
each question and answer in the FAQ and
39:22
then we we just we treat it as its own
39:24
little unit of text and then we actually
39:27
calculate a contextual embedding for
39:29
each of those question answer pairs.
39:32
Remember we know how to do contextual
39:33
embeddings, right? That's like it's a
39:35
piece of cake at this point, right? You
39:36
folks know how to do contextual
39:37
embedding. Run it through something like
39:39
BERT, you're done, right? You get you
39:41
get a context. So you get embeddings for
39:43
all the things that are in your FAQ. And
39:47
now when a new question comes in, right,
39:50
what you do is you take that question
39:52
and you calculate a contextual embedding
39:53
for that too.
39:56
And then what you do is you then look to
39:58
see which of the FAQ elements you have,
40:02
which of those chunks are the most
40:04
similar to your question.
40:07
Okay? And then you grab the ones that
40:09
are the most similar and then pack it
40:11
into the prompt and send it in. Maybe
40:14
you have 10,000 questions, but you can
40:16
only accommodate five of them in your
40:18
prompt because the context window is
40:19
very small. So you pick the five what
40:22
you think is the most relevant content
40:24
to your particular question and then you
40:25
feed it in.
40:28
That's the idea that is retrieval
40:29
augmented generation. Yeah, Rolando. So
40:32
if does this tie in for example if I
40:34
were to prompt and say help me work on
40:36
my startup pitch but given the voice of
40:38
Steve Jobs is it then kind of going out
40:41
there and reducing the subset of of data
40:45
to things that have been written by
40:48
Steve Jobs and then it's kind of
40:49
generating it response based
40:51
>> uh not as a default not as a default
40:53
typically because a lot of Steve Jobs
40:54
stuff on the web it's just using that
40:56
because it's all part of its
40:57
pre-training data but this tends to be
41:00
more useful for very targeted
41:01
applications where you don't expect to
41:03
know the answer because it is not on the
41:05
public internet.
41:07
It's your proprietary data and you
41:09
wanted to use that proprietary data and
41:10
this how you do it.
41:12
Uh yeah
41:15
this certain
41:19
>> sure like that there will be some loss.
41:22
>> There will be some loss because you have
41:23
to figure out how to chunk it right. Uh
41:26
maybe you have a 300page PDF and then
41:28
maybe you look for each section and make
41:30
it a chunk. Maybe you look for each
41:32
paragraph, make it a chunk. Again,
41:33
there's a whole empirical sort of
41:36
cottage industry of techniques for doing
41:37
these things better or worse depending
41:39
on the use case and so on and so forth.
41:40
But the conceptual idea is chunk and
41:42
embed.
41:43
>> Chunking is another use.
41:46
>> Yeah. In fact, we going to do it
41:47
ourselves in the collab right now.
41:49
>> Yeah.
41:50
>> Can we give more weightage lecture? Uh
41:54
[laughter]
41:55
so in the default implementation no but
41:58
but in some sense you by picking the
42:00
five most relevant chunks from 10,000
42:02
chunks you're giving it giving the other
42:04
you know 10,000 minus five chunks a
42:06
weight of zero and these a weight of
42:08
one. So in some sense you're waiting it.
42:10
>> Yeah.
42:12
>> I was just curious how much structure
42:13
you have to have with an external
42:14
document say hospital or something. Do
42:16
you have to do a bunch of like lab?
42:19
>> No, you just need to make sure it's kind
42:21
of relatively clean. Uh but you will see
42:23
in the collab that it can be kind of
42:26
crappy and it still works. Yeah, because
42:28
there is so much crap on the internet
42:30
has been trained on already. So, okay.
42:33
So, all right. So, let's look at the
42:34
collab.
42:36
By the way, retrieval operate generation
42:38
is in my opinion the most pre prevalent
42:41
business application of LLMs that I've
42:43
seen up to this up to up to date. And
42:45
there's a huge ecosystem of tools and
42:47
vendors and so on and so forth.
42:51
I'm going to skip through the verbiage
42:52
here. Um, so you have to um install the
42:56
OpenAI library
42:58
and this thing called tick token which
43:00
we'll get to in a in a bit. I've already
43:01
installed it before class because it
43:03
takes some time. So I'll just make sure
43:05
all these things are already
43:08
few good. So we don't have to wait for
43:10
this. So I've imported pandas as before
43:12
and so uh and you can read through these
43:15
things because I'm just basically you
43:17
know I have an open openi token that I
43:19
have to use u a key rather key API key
43:23
and I'm not showing you the key
43:24
obviously I have to remember to delete
43:25
it before I upload the collab uh you
43:27
have to get your own key to make it all
43:29
work uh but the instructions are here.
43:31
So we're going to use GPT3.5 turbo to
43:34
demonstrate rag right so I give it the
43:36
name of the model and then open a also
43:38
has a whole bunch of different models
43:40
which can be used for u you can feed it
43:43
a sentence or a chunk of text it'll give
43:45
you a contextual embedding out it's like
43:47
a nice little API you don't have to use
43:49
your own bird and so on and so forth you
43:50
can just use the open AI embeddings
43:53
obviously you have to pay openai every
43:54
time you make a request but it's really
43:55
really cheap at this point u yepa
44:01
question but
44:03
by dealing with proprietary data because
44:05
a lot of companies are like we need to
44:07
invest in our own L&M because we don't
44:09
want our data to be going down in this
44:11
kind of it context how good is the the
44:14
cyber security or the compliance and
44:16
legal
44:17
>> I think each vendor has their own sort
44:19
of set of rules and contractual
44:21
commitments they're willing to sign up
44:22
for so you just
44:23
>> if you use the data here does this go
44:25
into the public domain or no
44:27
>> but the vendor gets to see it
44:29
>> okay
44:29
>> right meaning the vendor systems get to
44:31
see it, but do the vendors employees get
44:33
to see it if they need to? Unclear.
44:36
Those are all the like the legally sort
44:38
of nitty-gritty you have to worry about.
44:39
The other thing you can do is you can
44:41
actually just download an open source
44:42
LLM and do it all within your own
44:44
premises.
44:46
That's totally possible to do, right? In
44:48
fact, um I probably won't have time
44:50
today. I have a whole section on how do
44:51
you actually do a fine-tuning with an
44:52
open-source LLM, which I'll do a video,
44:55
right, if you don't have time. U okay.
44:58
So, so we and so this model this
45:01
embedding ADA 2 is the name of the
45:02
OpenAI model that actually gives you
45:03
contextual embedding. So, we're going to
45:05
use that. So, so first thing we want to
45:07
so the the use case here is that uh we
45:10
have taken a whole bunch we want to ask
45:11
the LLM we want to create a chatbot
45:13
which can answer questions about the
45:15
2022 Olympics like random questions you
45:18
might have about the Olympics. So, uh so
45:20
let's first ask it this question. Uh
45:24
we'll ask it about the 2020 summer
45:26
Olympics. Okay, that's the query and
45:29
then this is the the API um request we
45:33
have to make and you can read through
45:35
it. I have linked to the documentation
45:36
here as how it works and then it says
45:38
that uh Bosshim of Qatar and Tambberia
45:41
of Italy both won the gold and you can
45:42
actually fact check this is actually
45:44
accurate. It's correct. Uh so now let's
45:46
change the query and ask it about the
45:48
2022 Winter Olympics. Okay. And why 22
45:51
versus 20 will become clear in just a
45:53
moment. So, which athletes won the gold
45:55
in curling
45:57
in the 22 Olympics? And it says the gold
46:00
medal in curling was won by the Swedish
46:02
men's team and the South Korean women's
46:04
team. Okay, turns out if you fact check
46:07
this, it turns out, wait for it, Sweden
46:12
won the men's gold. Yes, South Korean
46:13
DIM participated, but Great Britain
46:15
actually won the women's gold. So, it
46:17
got it wrong. So, it sounds like GBD3.5
46:19
Turbo could use some help. And now one
46:22
of the things we can do is so the thing
46:24
is the reason why GPT3 3.1 turbo didn't
46:27
know about this is because its training
46:29
cutoff date was September 2021.
46:32
So as far as it's concerned the 22
46:34
Olympics haven't happened yet
46:37
it confidently gave you the wrong answer
46:39
as it is often prone to do. So and this
46:42
is by the way is called hallucination
46:43
where it gives you a very eloquent
46:45
confident wrong answer. And so um
46:50
or as some folks have said about um
46:53
another business school that should
46:54
remain nameless often in error but never
46:56
in doubt. So um
46:59
all right back to this uh so one simple
47:02
thing we can try right off the bat is to
47:03
tell 3 3.5 Turbo you can ask it to say I
47:06
don't know if it doesn't know rather
47:08
than just make stuff up right and how do
47:10
you do it? It's very simple. You say in
47:12
your prompt, answer the question as
47:14
truthfully as possible. And if you're
47:17
unsure of the answer, say, "Sorry, I
47:18
don't know." Okay, now here's the
47:20
question. Okay, this is a query. So,
47:22
let's run it through.
47:25
Sorry, I don't know. Not bad, huh? So,
47:29
so it worked. It's sort of trying to be
47:31
humble and honest and, you know,
47:32
self-aware and things like that. Um,
47:35
it's more like a a Sloan at this point.
47:37
All right. So now the reason I as I
47:40
mentioned earlier there's a you can
47:41
check the cutoff date and you can see
47:42
it's 2021 actually you know what let me
47:44
just uh open a new tab
47:49
so all these cut off dates are training
47:50
data right so 3.5 turbo this is what we
47:53
are using cutff date 2021 okay that's
47:56
why all right so now what we can do is
47:59
to to we can obviously provide relevant
48:01
data on the prompt itself sort of we can
48:02
leading up to rag here and by the way
48:04
the extra information we provide in the
48:06
prompt to help it answer a question is
48:07
called context, right? That's sort of
48:08
the lingo for it. So, we can do it,
48:10
we'll first do it manually. Um, so we
48:13
first we'll use the Wikipedia article
48:15
for 2022 Winter Olympics and we tell it
48:17
explicitly to make use of this context
48:19
because telling things explicitly always
48:21
seems to help. So, this is the thing we
48:23
cut and pasted here, right? Wikipedia
48:25
article on curling and it's like a
48:28
pretty long article. It's got all kinds
48:30
of stuff and it's not even all that like
48:32
cleanly formatted, right? It's kind of
48:34
it's very strange. Look at that.
48:38
So don't don't answer your question,
48:39
Spencer. It can be, you know, in pretty
48:41
bad shape. It still seems to work. Okay.
48:44
So now use below article on the Olympics
48:46
to answer the subsequent question. If
48:47
you don't know, say you don't know.
48:49
Okay. So that's what we have. That's the
48:51
query. And by the way, before I send it
48:53
into the LLM, this is the actual query
48:55
that's going to be sending. I'm printing
48:56
out the query. Look at how long the
48:58
query is. Use the article below. And
49:00
here is the article. B scroll, scroll,
49:02
scroll. There's a whole thing, right?
49:04
And it keeps on going on. And then
49:05
finally, I say which teams won the gold.
49:07
So, okay, so let's run it.
49:12
Okay, look at that.
49:15
Women's curling Great Britain. It got it
49:16
right. Pretty good, right? I mean, it
49:19
had to parse all that crap to get and
49:22
find the nuggets, right? So, nicely done
49:25
now. But maybe it wasn't super hard
49:27
because we literally gave it the answer.
49:28
So, let's make it a bit harder. So, I
49:30
noticed that this person, Oscar Ericson,
49:32
won two golds in the event, two medals
49:34
in the event. So let's ask if any
49:37
athlete won multiple medals. That
49:39
requires a little bit of abstraction,
49:40
right? So all right, same query. Did any
49:44
athlete win multiple medals in curling?
49:46
The question has changed. Everything
49:47
else hasn't changed. Hit it. Let's see
49:50
what happens.
49:51
Yes, Oscar Ericson won multiple medals
49:53
in curling. He won a gold in the men's
49:56
event and a bronze in the mix doubles.
49:58
It's pretty cool, right? Take that
50:00
Google. So
50:02
all right now we come to retrieval
50:04
augment generation where instead of
50:05
doing it manually obviously because it
50:06
doesn't scale we will do it
50:07
automatically and so the thing you have
50:09
to remember as I mentioned just a few
50:11
minutes ago is that there is a context
50:12
window for every LLM and for GPD 3.0 of
50:15
turbo the context window is 1 1600 300
50:18
sorry 16,385 tokens that is the length
50:21
of the input and the output right so we
50:24
can't exceed that uh by the way GPT4's
50:26
context window is I think up to 128,000
50:29
tokens and GPT sorry Google Gemini 1.5
50:33
pro they really need to work on their
50:35
names Google Gemini 1.5 pro the context
50:38
window is 1 million tokens
50:40
okay and in research they have tested 10
50:43
million tokens so Crazy times. All that
50:46
means is that you can upload entire
50:48
videos and ask it questions about the
50:49
video. So all right to come back to
50:51
this. So what we'll do is we'll only
50:53
grab the data from the Wikipedia
50:55
articles the all the articles about the
50:57
Olympics that are relevant to our
50:59
question by using pre-trained
51:00
embeddings. So again this is the thing
51:02
we talked about earlier, right? This is
51:04
the picture we saw in class. And the the
51:06
only thing I want to point out is that
51:08
if you have a particular embedding for a
51:09
question and a particular embedding for
51:11
a chunk of text that you have in your
51:13
database, you have to figure out how
51:15
similar how related they are. And for
51:17
that we can use what
51:21
dot product or something slightly uh
51:24
almost as dot product which is more
51:27
easier for us to work with the cosine
51:29
similarity. We have we have done cosine
51:31
similarity previously. I've explained it
51:32
in class. We're just going to use cosine
51:34
similarity. How similar are these
51:35
vectors? So that's what we're going to
51:37
do. Um all right. So the same picture as
51:40
we saw in class. So the first we what
51:42
we'll do is we need to break up the data
51:43
set into sections and then take each
51:45
section and then run it through the
51:47
embedding thing. But fortunately for us
51:49
uh I have code here which actually does
51:50
it for you manually. You can play around
51:52
with it later. But OpenAI has already
51:54
given us the chunked data set. So we
51:56
just use that because it's just easy for
51:58
us. And I downloaded already because it
52:00
took it takes five minutes to download.
52:01
I've downloaded this thing and I've
52:02
stuck it in a particular data frame
52:04
here. So let's print out five randomly
52:07
chosen chunks. Um so you can see here
52:09
right this is the first chunk somebody
52:12
else somebody else this just and look at
52:14
all this crazy stuff here right the
52:17
formatting is off but these are all you
52:19
know basically paragraphs and sections
52:21
just grabbed straight from Wikipedia
52:22
with no cleaning.
52:24
Okay, now we define a simple function to
52:28
basically send in any arbitrary piece of
52:30
text into the embedding model and get
52:33
the contextual embedding vector out,
52:35
right? And there is this little function
52:36
that does that. Okay, u we using an
52:39
embedding model. We send in a text, it
52:40
gives you something. So let's try it on
52:42
that is amazing. You should get a vector
52:45
back.
52:51
Oh, come on. Don't fail me now.
52:56
All right. How long is it? 1536. Um, so
53:00
how about I say hodle is incredible.
53:02
Like hodle is amazing. Hopefully the two
53:04
vectors would be kind of similar in
53:05
terms of cosine, right? So um and so to
53:09
calculate the cosine distance, I use
53:11
this particular function from sci. It
53:13
just calculates the cosine similarity
53:15
and I hit it. So 0.9934
53:18
maximum is one, right? So 0 934 means
53:21
that they're very very similar. which is
53:23
comforting because amazing and
53:24
incredible are obviously synonyms. U
53:27
okay so now given a data frame with a
53:29
column of text chunks in it we can use
53:32
this function on every one of these
53:33
things to calculate the embedding right
53:34
and you have a function here that
53:36
basically does it for you I'm not going
53:37
to run it uh because it takes a long
53:39
time so but you can run it later on uh
53:41
just be prepared go get a cup of coffee
53:42
and stuff while it does it uh but once
53:44
you but happily for us open has actually
53:47
already done this step for us so we
53:48
don't have to uh so it's already
53:50
available in this data frame so if you
53:51
actually Look at this. And you can see
53:53
here there is a text and then there is
53:56
an embedding that's right sitting right
53:58
there right next to it. Okay. And these
54:00
embeddings are whatever 15 how long is
54:02
it? 1536 long. 1536 long vectors. Okay.
54:07
Um All right. So that's what we have.
54:14
Okay. So now that we have this thing
54:16
whenever we get a question we calculate
54:18
the question's embedding and then
54:20
compare calculate its cosine similarity
54:22
with all the embedding sitting in this
54:23
data frame. Okay. So to do that we're
54:26
going to define a couple of helper
54:28
functions here. You can read through the
54:29
Python later to understand is this is
54:31
basic Python manipulations that are
54:33
going on. Um and so let's just test this
54:36
function. So basically we have a little
54:38
function called strings ranked by
54:41
relatedness where you give it any input
54:44
question or text and then it's going to
54:46
give you the top five most related
54:49
chunks of text that is had in its data
54:52
frame. Okay. So uh let me just run this
54:55
thing. Okay.
55:00
So curling the things it pulls back it
55:02
better involves curling and metals and
55:03
so on. So this one has a cosign
55:06
similarity of 888 curling at the 22
55:09
Olympics. That's good. Result summary.
55:11
Medal summary. Result summary. It's all
55:13
pretty good, right? Even the fifth one
55:14
has a cosign similarity of867, which is
55:17
pretty high. So it's doing the right
55:18
things. It's it's picked up curling gold
55:20
medal was input text. It's picked up the
55:22
right things from it. Um, now let's see
55:25
what we can do um
55:28
with the original question. So here is a
55:30
header I'm going to use in the prompt.
55:31
I'm going to say use the below articles
55:33
to answer the subsequent question.
55:35
Answer the questions as truthfully as
55:36
possible. And if you're unsure of the
55:37
answer, say sorry, I don't know. As
55:38
before. Okay, that's our prompt. Uh, and
55:41
now here's the thing. We don't want to
55:42
exceed the context window, right? So, we
55:44
want to need to count the tokens we're
55:46
sending in and the likely number of
55:48
tokens we're going to get back so that
55:49
we don't exceed the budget. So, we use
55:51
this package called tick token package
55:53
for this. Uh, and then it just, you
55:55
know, helps you count the tokens. And
55:57
you can read through this. It's just
55:58
again some basic Python for counting
56:00
tokens. And now what we do is um this
56:03
this where we actually comp assemble the
56:05
prompt. We start with the header right
56:08
we have the header which says you know
56:09
be truthful and all that. Then we say uh
56:12
here is a question that you need that
56:14
I'm going to ask you and then you go in
56:16
there and keep grabbing Wikipedia
56:18
articles till the number of tokens in
56:21
your prompt is is exceeding your token
56:23
budget and then you stop. Right? When
56:26
you're about to exceed the budget you
56:27
stop because you can't exceed the
56:28
budget. Um, and that's that's the whole
56:31
thing. So here, uh, all right, let's
56:34
just do tick token. Run this function.
56:38
Now, it turns out, as you saw, we can go
56:40
up to like 1600 something, uh, tokens in
56:42
the context window. I'm just using three
56:45
3,700 as my budget. Uh, partly because
56:48
just to show you how to use this thing.
56:49
Uh, and also because it's charging my
56:52
credit card for every token that I'm
56:54
using, right? So, I'm just being
56:56
careful. um it charges by the token.
56:59
It's a beautiful business model. Anyway,
57:01
so back here, so let's ask the question,
57:03
which athletes won the gold medal in
57:05
curling at the Olympics? Here is the
57:06
data frame that you should use. Here is
57:08
the GPD model and don't exceed 3,700
57:11
tokens. Okay, that's the the query or
57:13
the prompt. It's going to compose the
57:15
prompt now. And this is the whole
57:17
prompt. Okay. Uh let's just go to the
57:19
very top. It's really long.
57:24
Okay. So, all right. use the below
57:25
articles on the subsequent question as
57:27
possible and boom boom boom boom boom it
57:29
has all these things it's got a added a
57:31
whole bunch of paragraphs from the
57:33
Wikipedia pages okay and then it finally
57:35
ends with a question which athletes won
57:37
the gold okay all right now let's just
57:39
ask it the thing and this is just a
57:41
little function to to send stuff into
57:44
the API and now we are finally ready to
57:47
ask GPD the question fingers crossed
57:53
all right curling
57:55
Stefan can tell in the mixed doubles and
57:58
the team consisting of blah blah blah in
58:01
the the men's tournament and oh
58:03
interesting it has actually ignored the
58:06
Great Britain people completely I think
58:08
right uh last night it didn't welcome to
58:12
stoasticity
58:14
so you can try it when you try it might
58:16
actually give you the the thing um and
58:19
so let's ask it now a question about the
58:21
2016 winter Olympics uh which by the way
58:24
didn't happen there were no winter
58:25
Olympics in 2016. So if you ask it,
58:31
sorry I don't know. All right. Now let's
58:34
change the header so that we don't say
58:36
be truthful. So we will remove the need
58:38
for it to be truthful and see what
58:40
happens.
58:43
All right, which at least won the gold.
58:50
Oh, now it's telling you about the 2022
58:53
Olympics. So it answered an irrelevant
58:55
question accurately.
58:57
Okay, if you remove the need for it to
58:59
uh to be truthful. So the I guess the
59:01
moral of the story is that um first of
59:04
all you can use rack to grab stuff from
59:07
mass databases and it's very heavily
59:09
used in industry. Number one, number
59:10
two. Um you have to be careful about
59:12
these token budgets and so on and so
59:13
forth. Uh and small wording changes in
59:16
the prompt can actually dramatically
59:18
alter behavior which makes it very
59:20
difficult in enterprise settings to do
59:21
QA on this stuff. Okay. Uh so a lot of
59:25
care has to go into it. Uh you know and
59:27
you have seen examples of for example
59:29
Air Canada had a chatbot which actually
59:30
gave the wrong advice to a customer. The
59:32
customer sued Air Canada and then the
59:34
court ruled in favor of the the
59:35
passenger and then they pulled the
59:37
chatbot off the website. Right? So you
59:39
got to be very careful. I think without
59:40
a human in the loop checking these
59:42
answers it's kind of dangerous in my
59:43
opinion at this current state. Hopefully
59:45
it'll get better but you have to be
59:47
there's a lot of potential but you have
59:48
to be to be careful. All right. So this
59:51
is what we have. Um, and you can
59:52
actually take this thing here and use
59:54
it. Um, you can actually, you know, take
59:57
like a thousandpage PDF that you might
59:58
have or something and then chunk it and
1:00:00
use this approach. And I've done it for
1:00:02
a whole bunch of different things. It
1:00:03
actually works really well, right? Most
1:00:04
of the time it'll make errors here and
1:00:05
there. Most of the time it actually
1:00:07
works really well. Okay. So, um, yeah.
1:00:11
>> Sorry, just a question. when when like
1:00:14
GP4 now lets you you upload PDFs, is it
1:00:18
junkling that or is it actually
1:00:20
ingesting all the
1:00:21
>> No, when you upload something because
1:00:22
GPD4 Turbo has 128,000 tokens which
1:00:25
means it can accommodate a whole long b
1:00:27
of documents. So when you upload stuff
1:00:29
is not doing any chunking. The chunking
1:00:31
you're talking about you have to do. The
1:00:32
LLM doesn't even know you're doing it.
1:00:34
As far as the LLM is concerned, it's
1:00:36
only seeing the prompt it sees and the
1:00:38
prompt says, "Hey, here's a bunch of
1:00:39
information. Here's a question. Answer
1:00:40
it for me using this question. Be
1:00:41
truthful." That's it.
1:00:44
Now when you ask these things a question
1:00:46
um which is later than its training
1:00:49
data, you will actually see GP4 saying
1:00:51
doing a Bing search and things like
1:00:53
that. there. What's actually going on is
1:00:55
there's an there's a pre-processing step
1:00:58
and a program which is doing a Bing
1:00:59
search, gathering a bunch of Bing
1:01:01
results, taking the top few results,
1:01:04
chunking, embedding, packing into a
1:01:06
prompt, sending it into GB4, and you
1:01:08
don't know what's all this is going on
1:01:10
under the hood. But that's actually so
1:01:11
when it's actually thinking and saying
1:01:12
Bing search, this is what's going on
1:01:13
under the hood.
1:01:19
Was was there a question somewhere here?
1:01:21
No. Oh, sorry. Yeah.
1:01:24
I have a question about formatting.
1:01:26
Yeah. So, it seems to be able to
1:01:29
understand and ignore irrelevant
1:01:31
formatting even though there's
1:01:33
colloquial tables, not really defined
1:01:35
tables. And also when it outputs
1:01:38
formats, it's able to do it really
1:01:40
humanly. Is that something that's
1:01:44
figuring out through the neural network
1:01:46
or just something that's kind of being
1:01:47
programmed in the head or somewhere with
1:01:49
standard?
1:01:49
>> There is no explicit programming going
1:01:51
on. It's typically because a lot of the
1:01:53
question answer pairs that it was used
1:01:54
for supervised fine tetuning and
1:01:56
instruction t and reinforcement
1:01:57
learning, right? The better answers with
1:02:00
the same sort of badly formatted input,
1:02:02
the better answers are just rewarded are
1:02:04
ranked higher. That's what's going on.
1:02:06
But on a related note, what one thing
1:02:08
that's very useful is that uh you can
1:02:10
actually ask it to send you give you the
1:02:12
answer back using certain formats like
1:02:14
markdown and JSON and things like that.
1:02:16
And by forcing it to adhere to a certain
1:02:19
well- definfined formats, you actually
1:02:21
increase the chance of it actually
1:02:22
getting the right answer in the first
1:02:23
place.
1:02:24
Uh again, there's like a whole tangent
1:02:26
here we can go into, but those are some
1:02:28
of the things that uh are part of prompt
1:02:30
engineering. All right, so that's what
1:02:33
we have here. Back to the PowerPoint.
1:02:40
So that's retrieval augment generation
1:02:42
and we finally come to fine-tuning. So
1:02:46
fine-tuning is when up to this point all
1:02:49
the things we have seen don't alter the
1:02:51
internals of the LLM. You have not
1:02:54
messed around with the weights or change
1:02:55
number them at all. You're just using it
1:02:56
as a black box. Right? With fine-tuning
1:03:00
you actually will train it further
1:03:01
meaning the weights are going to change.
1:03:04
Okay. So now remember we take something
1:03:07
like a causal error like GPT right uh
1:03:11
and then and this I haven't fixed this
1:03:13
yet. this there is no rel here as I
1:03:15
mentioned earlier okay just remember
1:03:17
that
1:03:19
and then if you have domain specific
1:03:21
input output examples like input and
1:03:23
output you can just train it like this
1:03:25
okay input and then the shifted output
1:03:28
uh and that will update these weights
1:03:31
right all these weights so this is
1:03:33
basically fine- tuning exactly like we
1:03:34
saw with BERT and so on and and even
1:03:37
with restnet it's the same sort of thing
1:03:39
okay that is fine-tuning now before we
1:03:42
discuss the mechanics how to do I want
1:03:43
to look at a show you a quick example of
1:03:45
the usefulness of finetuning. So, so
1:03:48
imagine for a sec that we want to
1:03:50
generate u synthetic product reviews
1:03:53
from product descriptions.
1:03:55
So we are building some product which
1:03:57
can simulate customer behavior in
1:03:59
e-commerce and for that we need to be
1:04:01
able to generate the kinds of reviews
1:04:03
that customers might come up with right
1:04:05
and writing a lot of reviews is very
1:04:07
timeconuming. So what you but what you
1:04:09
can do is you can get a whole bunch of
1:04:10
product descriptions right from the
1:04:12
internet. So let's say you ask an LLM,
1:04:14
hey write a positive product review
1:04:16
using this information here, product
1:04:18
description here and it comes up with
1:04:19
this timeless, authentic, iconic, right?
1:04:24
Seriously, do product reviewers actually
1:04:26
write stuff like this? No. This looks
1:04:28
like marketing copy, right? This reads
1:04:31
like marketing copy because there's a
1:04:33
whole bunch of marketing copy on the
1:04:34
internet. So it's not good. It doesn't
1:04:36
feel like a review. It's not authentic,
1:04:38
right? Um, here's another example for
1:04:41
Urban Outfitters, and it says, uh, the
1:04:44
the boxy and cropped silhouette is
1:04:46
flattering on all body types. Come on.
1:04:50
Okay, so it's not going to work. So,
1:04:52
what we do is we fine-tune the LLM. We
1:04:55
can take an LLM and we can fine-tune it
1:04:57
with instruction, product description,
1:05:00
and product review examples.
1:05:02
Okay, that's what we can do. So for
1:05:05
instance we can take something like
1:05:06
this. Uh let me zoom into this thing.
1:05:14
So it says here write a positive review
1:05:17
for the following product and then you
1:05:19
can have the work. This is the
1:05:20
description is the input and the output
1:05:22
is the best car my husband's favorite.
1:05:24
They fit well. Right? They feel like
1:05:26
product reviews. So you just have to get
1:05:28
a few hundred of these product review
1:05:30
examples. Okay just a few hundred. Um
1:05:33
and you may not even need that much. And
1:05:35
once you do that,
1:05:37
once you do that, you basically do uh
1:05:40
used to fine-tuning like I showed
1:05:42
earlier, you know, in instruction,
1:05:45
input, output, and then you take that
1:05:46
output and shift it a bit and make it
1:05:48
the actual label, the actual output.
1:05:50
Fine tune, fine tune, fine tune, fine
1:05:51
tune a bunch of times, gradient descent,
1:05:53
weights gets updated. Now you have a new
1:05:55
LM, an updated LLM. And when you do that
1:05:58
now for the same things, here's what you
1:06:00
get. Write a review. These are the best
1:06:02
jeans I've ever owned. I am whatever
1:06:04
some details. I've been wearing them for
1:06:06
a few weeks. They still look brand new,
1:06:07
right? It looks much better. Doesn't
1:06:09
look like marketing.
1:06:11
This is completely fake. By the way, the
1:06:13
came up with it after the fine tuning.
1:06:15
And then we say, "Write a horrible
1:06:16
review because we want to be balanced.
1:06:18
These are the worst genes I've ever
1:06:20
worn. They're too tight here and there.
1:06:22
I'm going to return them and try a 30,
1:06:23
but I'm not optimistic.
1:06:25
I'm going to stick with Levis's." Few.
1:06:27
Okay.
1:06:29
So, that is So, these read like real
1:06:31
reviews. So just by taking a few hundred
1:06:33
examples and fine-tuning it, it
1:06:34
completely changes the the behavior that
1:06:36
you want for your particular use case.
1:06:38
That's the key thing. So for me, the
1:06:40
biggest sort of benefit here is that
1:06:43
while it took billions of sentences for
1:06:45
pre-training the original LLM and then
1:06:47
it took tens of thousands of examples to
1:06:49
do supervised finetuning and or HF and
1:06:52
so on and so forth, for you for it to
1:06:55
make it work for your narrow business
1:06:56
use case, you only had to spend a couple
1:06:59
hundred examples. That's it. It's
1:07:02
amazing. Imagine that if you had to, you
1:07:04
know, collect like 30,000 examples to
1:07:06
make it. Nobody's going to do these
1:07:07
things. It's too much work. But a couple
1:07:10
of hundred anybody can do. That's why
1:07:12
it's so powerful to finetune these
1:07:14
things. Yeah.
1:07:16
You talked about being able to um you
1:07:19
know, in industries where you you don't
1:07:22
want to put some of this stuff on the
1:07:23
internet, downloading uh the pre-train
1:07:26
model and being able to do this on your
1:07:28
own. would you still need talking about
1:07:30
computer power some of the computers we
1:07:32
have now GPUs I don't know how they are
1:07:35
um are you able to do some of these very
1:07:37
small use cases on those types of
1:07:39
devices
1:07:40
>> perfect question uh Ike I mean you're
1:07:42
going to get to that because the short
1:07:44
answer it's hard yeah just a few hundred
1:07:46
examples but actually trying to
1:07:47
fine-tune these big models on consumer
1:07:50
grade hardware is actually not easy so
1:07:52
you have to make certain tricks and
1:07:53
simplifications which is the next topic
1:07:56
uh yeah
1:07:57
>> is tuning always supervised like you
1:08:00
need those pairs or could you do it if
1:08:02
the company has like less structured
1:08:05
data?
1:08:05
>> No, you can. The thing is it depends on
1:08:07
whether you want to make it generally
1:08:09
smart about the company's sort of
1:08:11
business details in which case you can
1:08:13
just take a whole bunch of text and just
1:08:14
do an expert prediction on it. It's
1:08:16
going to get smarter about generally
1:08:17
things. But it doesn't mean it's going
1:08:19
to specifically follow your instructions
1:08:20
on your particular business problem. So
1:08:23
if you wanted to follow instructions,
1:08:24
you need supervision.
1:08:27
Okay. So all right these three are great
1:08:29
reviews. So for small LLMs like GPD2
1:08:32
fine-tuning isn't difficult to go to
1:08:35
your question. You can actually do this
1:08:36
with small models. So like for example
1:08:38
Google had this has released this thing
1:08:40
called Gemma which came out recently.
1:08:41
It's a small model like two billion
1:08:42
parameters or something if I remember
1:08:44
the smallest one and those things will
1:08:46
typically fit into uh thank you. Uh
1:08:50
those things will typically fit into
1:08:52
like one GPU and you can fine-tune it.
1:08:54
You still need GPUs just to be clear. uh
1:08:56
they will actually fit into one thing.
1:08:57
But if you want to use a larger model,
1:08:59
it won't fit. So to make this work, you
1:09:02
have to do other things and that's what
1:09:03
we're going to talk about now. So but
1:09:05
this there's a family of models called
1:09:07
Llama Llama 2. These are open source uh
1:09:10
LLMs and they are widely used for
1:09:12
fine-tuning, right? Because you can just
1:09:14
download the model and just do whatever
1:09:16
you want with it, right? It's open. uh I
1:09:18
mean it's not strictly open because
1:09:20
there are some you know footnote
1:09:22
considerations you got to worry about
1:09:23
but for most purposes it's open enough
1:09:26
uh in my opinion and so what we let's
1:09:29
see how hard it is to build the biggest
1:09:30
model in this family which is the llama
1:09:32
2 model with 70 billion parameters okay
1:09:35
70 billion parameters so first of all
1:09:37
the model is gigantic so 70 billion
1:09:40
parameters each parameter is let's say
1:09:42
we store it in two bytes per parameter
1:09:44
right u and then each of these parame
1:09:48
ameters actually we will need a
1:09:50
multiplier on each parameter to store
1:09:52
various details about how the
1:09:53
optimization is done okay we know we
1:09:56
won't get into the details here the the
1:09:57
one thing I do want to point out is that
1:09:59
um this 3 to four uh should really be 1
1:10:02
to six right u so I I had I didn't have
1:10:06
a chance to change it this morning but
1:10:08
but the point is that it's going to be a
1:10:09
huge model right so even with this
1:10:12
number it's going to be like 48 to 560
1:10:14
gigabytes just to hold the model in
1:10:15
memory and manipulate it and So if you
1:10:18
use a GPU like an A00 GPU or an H00 GPU
1:10:21
which are all Nvidia GPUs,
1:10:23
each of these things typically has 80 GB
1:10:25
of RAM memory. So we need between six
1:10:28
and seven to accommodate this thing. Six
1:10:30
to seven GPUs just to accommodate this
1:10:32
thing. So that's the first problem. The
1:10:34
model is big just to hold it and work
1:10:35
with it. You need lots of GPUs. The
1:10:37
second problem, Llama 2 was trained on
1:10:40
two trillion tokens of text.
1:10:43
Two trillion tokens of text. So these
1:10:46
GPUs can process about 400 tokens per
1:10:49
GPU per second. By process, I mean the
1:10:51
forward pass through the network. Okay?
1:10:54
And so if you actually use seven GPUs
1:10:57
with all this thing, it's going to take
1:10:58
you 8,000 days, right? Let's say we want
1:11:01
to do it in about a month, you need 24
1:11:03
20,000 248 GPUs at this cost of two $25
1:11:08
per GPU per hour. This will cost you 4
1:11:10
million.
1:11:12
Okay? And we'd expect the actual cost to
1:11:14
be a lot higher than this because it's
1:11:15
very optimistic. It assumes you just do
1:11:16
one pass through it, you're all done,
1:11:17
right? In in general, you'll you know
1:11:19
you'll make some mistakes. You have to
1:11:20
do it a bunch of times and so on and so
1:11:21
forth. So this is overly optimistic
1:11:23
estimate and that is 4 million. So you
1:11:25
need lots of GPUs and you need to spend
1:11:27
a lot of money for it. Now what can we
1:11:29
do with fewer resources?
1:11:32
First of all, you you need to reduce the
1:11:34
size of the data set. The second thing
1:11:35
is you want to reduce the memory
1:11:36
required. So we can ideally do it on
1:11:38
many fewer GPUs, hopefully even one GPU
1:11:41
literally on Collab. Okay. And so now we
1:11:45
have good news on the data front because
1:11:47
as I mentioned earlier, while it takes a
1:11:49
lot of data to build these models, to
1:11:51
fine-tune them for your specific data
1:11:53
for use case, you may just need a few
1:11:55
hundred examples. Okay, it's no problem
1:11:57
at all. So the data for fine-tuning is
1:11:59
actually not a problem. Only for
1:12:01
building it in the first place, it's a
1:12:02
problem. So in fact, there's this famous
1:12:05
alpaca fine tune data set. It is 50,000
1:12:07
instruction on pairs and so for that
1:12:11
way less than the two trillion tokens
1:12:13
and that can actually be done in about
1:12:14
20 hours. You can fine-tune a 50,000
1:12:17
example fine-tuning data set you can
1:12:19
fine tune with just 20 hours. Okay,
1:12:21
Tomaso,
1:12:23
>> could Microsoft's one bit model
1:12:26
drastically reduce the amount of comput?
1:12:28
Yeah, there's a whole bunch of
1:12:30
approximations and simplifications to
1:12:32
make all these things fit uh into
1:12:35
smaller GPUs and so on and so forth and
1:12:37
that's one of them. So, so the short
1:12:39
answer is yeah, there are many
1:12:40
possibilities uh and we have to very
1:12:42
carefully look at them because every one
1:12:44
of these simplifications you'll it'll
1:12:45
cost you something in terms of accuracy
1:12:47
and the ability of the model to do what
1:12:49
it needs to do. So there's always a
1:12:50
trade-off you have to worry about. So
1:12:52
that for hooks who are interested
1:12:54
there's this whole field called
1:12:55
quantization LLM quantization. Google it
1:12:57
and that gives you that's an entry point
1:12:59
into that whole area. Okay. So now how
1:13:02
do we reduce the memory required so that
1:13:04
we can process the data using fewer GPUs
1:13:06
ideally just one GPU on collab. So if
1:13:08
you look at what actually consumes
1:13:10
memory, you have all these model
1:13:12
parameters. Let's say you know 70
1:13:14
billion parameters times two bytes each
1:13:16
140 GB gradient computations is another
1:13:18
140 to hold the gradient and then the
1:13:20
optimizer state is 2x. And as I
1:13:22
mentioned earlier it could be between
1:13:24
you know 1 to 6x as opposed to 3 to 4x
1:13:27
but we'll just go with these numbers for
1:13:28
the moment. And so the total is 560
1:13:30
gigabytes right if you just naively want
1:13:33
to use it. So turns out you can't do
1:13:36
anything about that. it is just 4140 but
1:13:38
by using a trick called gradient
1:13:40
checkpointing this whole thing can
1:13:42
actually be squashed close to zero
1:13:44
basically you say hey I don't mind it
1:13:46
running longer but I don't want to use
1:13:48
as much memory and that trick is called
1:13:50
gradient checkpointing we won't go into
1:13:52
technical details that can go to zero
1:13:54
but then this thing here the optimizer
1:13:56
state turns out even this can be
1:13:58
squashed very close to zero and that's
1:14:00
actually was a breakthrough from you
1:14:02
know maybe a year ago and so to do do
1:14:06
that. What we're going to do is to say,
1:14:07
look, you know what? Uh there are a
1:14:09
whole bunch of weights here, but we're
1:14:11
only going to take take those matrices
1:14:13
inside each attention layer, and we're
1:14:15
going to only look at those matrices.
1:14:17
We're going to freeze everything else.
1:14:19
So, we're going to take only a small set
1:14:22
of parameters, unfreeze them, and update
1:14:24
them and see if it's any good, if it
1:14:26
actually gets the job done. Instead of
1:14:27
unfreezing everything and updating them,
1:14:29
right? And so if you look at the weight
1:14:31
matrix, let's say the key AK weight
1:14:33
matrix uh in llama 2, this is a 8,000
1:14:36
roughly 8,000 by 8,000 matrix, which
1:14:38
means that there are 64 million
1:14:40
parameters inside each of these
1:14:41
matrices. 64 million. Okay. So you can
1:14:45
if you imagine this matrix AK here and
1:14:48
let's say you thought experiment, you do
1:14:50
the finetuning and the numbers have
1:14:52
changed, right? as a result of
1:14:54
finetuning then you can imagine that the
1:14:56
resulting matrix is just the original
1:14:58
matrix you had plus just the changes
1:15:01
right the original plus the changes and
1:15:04
we call the changes delta a k and of
1:15:07
course in general this this change is
1:15:08
also going to be a 64 million matrix
1:15:10
right 8,000 by 8,000 so the question is
1:15:13
can we make this change matrix smaller
1:15:15
and to make it smaller it seems
1:15:18
reasonable because a fine tune will only
1:15:20
make small changes to just a few weights
1:15:22
it's not going to change
1:15:23
By definition, a couple hundred
1:15:25
examples, you do some finetuning,
1:15:26
hopefully a few weights are going to
1:15:27
change and maybe they won't change a
1:15:29
whole lot, right? So the the key insight
1:15:32
here is that maybe we can force this
1:15:33
change matrix to be kind of simple and
1:15:36
get the job done, right? And it turns
1:15:38
out you can. And what you do is you can
1:15:40
think of this matrix as really coming
1:15:42
from two thin skinny matrices which if
1:15:46
you multiply them gets you the original
1:15:48
matrix, right? And I'm not going to get
1:15:51
into the mathematical details here. This
1:15:52
is called a low rank approximation. Uh
1:15:55
but the point here is that you can take
1:15:57
two very small matrices and if you
1:16:00
multiply them the right way, you
1:16:01
actually can recover the original
1:16:02
matrix, right? You can approximate the
1:16:04
original matrix. And this matrix, as it
1:16:06
turns out, these two matrices are much
1:16:08
smaller because each one is just 8,000 *
1:16:11
2, 16,000, right? And so this thing has
1:16:15
just 16,192 parameters, which is 0.02%
1:16:19
of the original 64 million.
1:16:23
So this thing is called low rank
1:16:25
adaptation or LORA and it's incredibly
1:16:27
widely used in the industry. U and so
1:16:30
what we do is we freeze all the
1:16:31
parameters. We initialize all these mat
1:16:34
these change matrices to zero and then
1:16:36
we update just the those two skinny
1:16:38
matrices right here here we update only
1:16:40
those matrices using gradient descent.
1:16:43
And when you do that everything will fit
1:16:45
into memory. So which means that the
1:16:47
whole thing will fit in and you can just
1:16:48
use like two GPUs and get the job done.
1:16:50
And if you actually use llama's the
1:16:52
smaller models like 7 billion 13 billion
1:16:55
it can be fine-tuned comfortably on a
1:16:56
single GPU on a single collab GPU. So
1:17:00
all right uh 954 time does not permit so
1:17:03
I will uh so I have a collab on how to
1:17:05
do the finetuning uh using this
1:17:07
technique. I will do like a video walk
1:17:09
through um tomorrow or day after and I'm
1:17:12
done. Thanks folks. Have a good rest of
1:17:14
your week. [applause]
1:17:16
Thank you.
— end of transcript —
Advertisement
Ad slot

More from MIT OpenCourseWare

Trending Transcripts

Disclaimer: This site is not affiliated with, endorsed by, or sponsored by YouTube or Google LLC. All trademarks belong to their respective owners. Transcripts are sourced from publicly available captions on YouTube and remain the property of their original creators.