Advertisement
1:17:42
10: Generative AI – Adapting LLMs with Parameter-Efficient Fine-Tuning
MIT OpenCourseWare
·
May 11, 2026
Open on YouTube
Transcript
0:16
Okay. So, um, so let's continue the
0:19
journey we started last time. Um so what
0:22
we're going to do uh you know if you
0:23
remember in the last class we showed how
0:26
we can actually build an auto
0:27
reggressive large language model uh aka
0:30
a causal large language model um using
0:33
this not this idea of a causal encoder a
Advertisement
0:36
transformer causal encoder and then we
0:38
showed how you can actually take a bunch
0:39
of sentences and use next word
0:41
prediction and just run it through and
0:43
boom you get GPD3 okay so that's what we
0:46
saw last time I want to point out a sort
0:49
of an important clarification slash
0:50
correction which is that when we work
0:52
with large language models uh unlike
0:55
when we work with BERT uh for instance
Advertisement
0:57
when we work with these kinds of causal
0:59
models actually uh when the contextual
1:01
embeddings come out you don't actually
1:03
have to use ReLU activations here you
1:05
can literally just run it through just a
1:07
single dense layer with linear
1:09
activations and then pass it into a
1:11
softmax and boom you're done okay so
1:13
that's how GPD3 and all these models are
1:15
trained u and the other thing I want to
1:18
point out which may not have clear is
1:21
that what what is coming out of these
1:23
this dense layer right this vector is as
1:27
long as your vocabulary
1:29
because only then when it goes into the
1:31
soft max you're going to get
1:33
probabilities which are as long as your
1:35
vocabulary which means that you get to
1:36
pick one word or token out of that
1:39
entire 50,000 long vocabulary
1:42
okay so so just I just want to point
1:45
that out because I think it's easy for
1:47
us to sort of get a little confused
1:49
because of this little difference
1:50
between the way uh masked language
1:53
models like BERT work and causal
1:55
language models like GPD3.
1:58
Okay, so now let's continue with we have
2:02
we know how to build GPD3. So like what
2:05
about GPD and GPD2 like what's up to
2:07
them? Why is GPD3 so famous and not
2:10
GPD2? Right? So turns out well first of
2:13
all you folks know that GPD stands for
2:15
generative pre-trained transformer. Now
2:17
like GPD3
2:19
two GPD2 and GPD1 were trained in
2:22
basically the same fashion. Predict the
2:23
next word uh same fashion the same sort
2:26
of transformer stack except that GPT3
2:29
was trained on much more data because
2:31
the underlying transformer stack had
2:33
many more layers. Okay, so it is a much
2:36
bigger stack meaning lots more
2:39
parameters and therefore you need lots
2:41
more data to train it well. Okay, so
2:44
that was really the only difference. The
2:47
difference was literally one of scale,
2:49
scale of network and scale of data. And
2:53
unlike GPT and GPD2, GPD3 even though it
2:57
was trained basically the same way with
2:59
the same kind of network, it was one of
3:01
the situations where more became
3:04
different. Okay, there was almost like
3:06
some sort of phase change that happened
3:07
between two and three. Unlike GPD and
3:10
GPD2, GPD3 could do amazingly coherent
3:14
continuations of any starting prompt,
3:16
right? Um so for example, if you have
3:19
this little prompt which says the
3:21
importance of being on Twitter by Jerome
3:22
K Jerome who was a famous humorist and
3:24
then you give it this prompt, right?
3:26
Ending with the word it, it produces
3:28
this continuation which is really like
3:30
strikingly good. And if any of you have
3:33
read Jerome K Jerome and if you read
3:35
this thing, you'll be like, "Wow, that
3:36
actually sounds like Jerome K Jerome."
3:38
Right? So amazing continuations the the
3:41
but the interesting thing here is not so
3:43
much the continuation it's the fact that
3:45
the same prompt you give it a two or GPT
3:47
it won't do any it won't be very good in
3:49
fact after the first one two or three
3:51
sentences it'll sort of become sort of
3:52
incoherent and meander and start
3:54
rambling this thing can keep faking it
3:57
for a long longer time right that's the
3:59
amazing thing that was unexpected re
4:02
researchers did not expect this okay and
4:05
but it wasn't good at following your
4:07
instructions
4:09
So for instance, if you ask it, help me
4:10
write a short note, introduce myself to
4:12
my neighbor. This is the kind of thing
4:14
it'll come up with. And you can actually
4:15
run it yourself. You can actually go to
4:17
GPD3 on the playground. I think GPD3 is
4:20
still available in the playground. U if
4:21
it is, you can actually start try
4:23
running these prompts. You will start
4:25
getting garbage very quickly, right? And
4:28
the reason so for example here, help me
4:29
write a short note. It says, what's a
4:31
good introduction to a resume? Rumé for
4:33
some reason has glombmed down to resume.
4:35
I have no idea why. Right? But the
4:38
reason it's doing stuff like this is
4:39
because a lot of the training data it
4:42
was trained on are basically lots of
4:44
lists of things.
4:46
So when you say for example um you know
4:49
the the the capital of Paris continue
4:52
it'll come back with the capital sorry
4:53
the capital of France continue it say
4:55
the capital of France is Paris the
4:56
capital of you know uh Hungary is
4:58
Budapest and so on. It just start coming
4:59
up with a list. So it's sort of very
5:02
list driven right? it thinks that you
5:04
you need to complete some sort of list,
5:06
right? That's what's going on here. And
5:07
so it's not very good. So it doesn't
5:09
realize that you're actually asking it
5:10
to do something specific.
5:12
So this is the problem when you have an
5:14
autocomplete thing which doesn't realize
5:17
what you're asking it. It just thinks
5:18
that you're it's just an autocomplete.
5:20
So um now in addition to these unhelpful
5:24
answers, it can also produce offensive
5:25
answers, factually incorrect answers and
5:27
so on and so forth. The list of bad
5:28
things it can do is long. So why does it
5:32
do that? Why does it produce unhelpful
5:33
answers? Well, you know, as you recall,
5:35
it was only trained to predict the next
5:37
word. It wasn't explicitly trained to
5:39
follow instructions, right? So, it
5:41
seems, you know, reasonable that if it's
5:44
simply trying to guess the next word
5:46
repeatedly, it can't really do anything
5:48
more. Like, how can it figure out that
5:50
there's an instruction that it needs to
5:52
follow, right? Unless the training data
5:54
on the net was all instructional, which
5:57
it clearly is not.
5:59
So light bulb idea, right? Let's
6:02
explicitly train it with instruction
6:04
data,
6:06
right? Let's just train it with
6:07
instruction data. And so OpenAI
6:10
developed an approach called instruction
6:12
tuning to do exactly this. Um, and this
6:15
paper is the paper that sort of was the
6:18
breakthrough. Okay, this is what
6:20
actually put Chad on the map. So, and
6:24
it's very readable. So, I would
6:25
encourage you to check it out if you're
6:26
curious.
6:28
And so so we had GPT, GPD2, GPD3, you
6:33
know, just bigger and bigger models
6:34
trained the same way. And then we run
6:36
into the problem that it can't handle
6:37
instructions. So we do instruction
6:39
tuning to get to 3.5, also called
6:41
instruct GPT. And then a small tweak
6:43
after that gets you chat GPT. Okay. And
6:46
by the way, this step here, there are
6:48
really two things going on in this as
6:50
you will soon see. I'm just calling it
6:52
instruction tuning just to so that I
6:53
don't have to say some long thing every
6:55
single time. it this is not a consistent
6:58
piece of terminology. So just just
6:59
beware aware of that's all. So all right
7:03
first step they got a bunch of people to
7:06
write highquality answers to questions
7:09
and they created about 12,500 such
7:11
question answer pairs. So for example
7:14
let's say this was the question explain
7:15
the moon landing to a six-year-old in a
7:17
few sentences. Believe it or not, GPD3's
7:19
answer to that question was another
7:21
question
7:23
because it thinks there's a list of
7:24
questions it needs autocomplete, right?
7:27
So, it comes up with explain the theory
7:28
of gravity to a six-y old. It's like one
7:30
of those people when you ask them a
7:31
question, they ask you a question back,
7:32
right? So, what what they did is they
7:35
said, "Okay, let's create a nice answer
7:36
to this question." And here's a human
7:38
created answer. People went to the moon
7:39
in a big rocket, walked around, blah
7:41
blah blah, right? Much better answer to
7:43
that question. And so once you create
7:46
these 12,500 question answer pairs as
7:48
training data, we just trained GPD3 some
7:52
more using Xword prediction as before.
7:56
No difference. So, so here is the input
7:59
explain the moon landing blah blah blah
8:00
blah. This is the question and then we
8:02
have the answer right there. And then we
8:05
we take that answer, move it to the
8:07
right and just shift it up
8:10
so that when it finishes sentences, it
8:13
needs to predict people. And then you
8:16
give it people, it needs to predict went
8:17
and so on and so forth. Just like we saw
8:20
before, the cat sat on the mat became
8:22
the cat sat on the cat sat on the mat on
8:25
the right shifted, right? That's what
8:27
makes prediction possible and necessary.
8:30
So that's what they did. This co this is
8:31
step one. Okay, same as same as before.
8:35
And once you do that, it turns out this
8:37
step is called supervised fine-tuning.
8:39
It really helped. GPD3 once you
8:42
supervised fine-tuned it was much much
8:44
better at following instructions. But
8:45
there's a small problem with this
8:46
approach. It takes a lot of money and
8:49
effort to have humans write highquality
8:51
answers to thousands of questions,
8:53
right? It takes a lot of money. So the
8:56
question is, what can we do, right? What
8:59
is easier than writing a good answer to
9:01
a question?
9:03
Well, what? Okay. Uh, all right. Uh, how
9:07
about somebody from this side?
9:11
>> Yeah, Joseph.
9:13
>> Perhaps writing a question for an
9:15
answer.
9:16
>> Oh, that's actually a good one. Yeah.
9:17
Yeah, I like that. Um, so given an
9:19
answer, find find a question. And while
9:22
that is not what I'm going to talk about
9:23
here, that technique is actually used
9:25
very heavily in LLMs. Uh, and so but
9:27
that that's great. Very creative. Uh
9:29
Mark,
9:31
>> thumbs up. Thumbs down.
9:32
>> Sorry.
9:33
>> Thumbs up or thumbs down?
9:34
>> Thumbs up or thumbs down. Exactly.
9:36
Because all of us, everyone loves to be
9:38
a critic. It's much better easier to be
9:40
a critic than to be a creator. Right. So
9:43
what do we do? We basically say, let's
9:46
rank answers written by somebody else.
9:48
Which begs the question, who's going to
9:50
write those answers? And that's where
9:53
there's a brilliant answer to that
9:54
question which is
9:57
Wikipedia,
9:59
Reddit.
10:04
We will just ask GPT3 to write the
10:06
answers.
10:08
It might be crap, but we don't care
10:10
because we can rank them.
10:12
So we ask GPT3 to get generate several
10:15
answers to the question. And how can we
10:17
generate several answers? Because we can
10:19
do sampling.
10:21
We can do sampling.
10:23
The fact that we had these stoastic
10:25
outputs because of sampling is now a
10:27
feature, not a bug. Okay, we create lots
10:30
of different answers to the question. We
10:32
feed it a question, get like three
10:33
answers out. Just run it three times,
10:36
get three answers out with a nice
10:37
temperature of like one or 1.1 or
10:39
something so that it's nice and random,
10:41
right? Um, and then we literally have
10:43
humans just rank them, do the thumbs up,
10:45
thumbs down, just rank them from most
10:47
useful to least useful. Okay, so this
10:51
step is a step two of instruction
10:53
tuning. So OpenAI collected 33,000
10:55
instructions, fed them to GB3, generated
10:57
answers and had humans rank them. And
11:00
once you do that, once you do this, you
11:03
can assemble a beautiful training data
11:05
set, right? And so basically what we
11:07
have is that we have an instruction and
11:09
let's say we have just two answers A and
11:10
B. And in in practice they you can have
11:12
many many answers which we rank but just
11:14
for simplicity I'll go with Mark's
11:16
thumbs up thumbs down sort of answer
11:18
which is let's assume only you have two
11:19
answers to every question right and so
11:22
and the human has said I prefer this to
11:24
that that's it right so we have a data
11:26
set now where the data point is
11:28
instruction preferred answer is A the
11:31
other answer is B yeah
11:36
>> um the thumbs up thumbs down uh
11:38
technique that we're talking is that why
11:40
We're attaching to now we also use
11:42
thumbs up thumbs down. It's using only
11:44
answers to train.
11:45
>> Exactly. Right.
11:46
>> Yeah. So yeah, all the models have the
11:48
thumbs up thumbs down stuff going on
11:49
somewhere. They are all collecting data
11:51
for this step.
11:53
>> Thank you.
11:53
>> Yeah. It's sort of the old adage, right?
11:55
Uh if you're not sure who the product
11:57
is, you are the product. So it's one of
11:59
those things. Yeah.
12:07
So if we understand correctly when we
12:09
see thumbs up thumbs down it does mean
12:12
that chat is going to trade on our data
12:16
right
12:16
>> unless you opt out. Yeah. So if you
12:19
actually go to the chaty controls there
12:20
is something called data controls or
12:22
something you can toggle it to off but I
12:24
think when I last checked if you toggle
12:26
it to off you lose your chat history. So
12:29
they have hobbled that feature to
12:31
prevent people from setting it to off as
12:33
much as possible. Yeah, clever.
12:37
But you can opt out and if you use the
12:39
API as opposed to the web interface,
12:41
you're automatically opted out. So you
12:43
have to deliberately opt in. And if you
12:45
use the versions that are available
12:46
through Microsoft Azure and so on and so
12:48
forth, there are all kinds of very safe
12:50
controls and stuff like that. In fact, I
12:51
think the Microsoft co-pilot license
12:54
that MIT has uh I think the default is
12:56
opted out.
12:58
Okay. So to go here, once you have this
13:01
data point, you can build something
13:02
called a reward model. Okay. And this is
13:05
a very clever piece of work. So what you
13:08
do is you have an instruction, right?
13:10
You have a preferred answer and you have
13:12
the other answer. You feed it to a
13:15
network. Okay? You feed it to a network.
13:18
This is just a a nice language model,
13:20
right? It's just a language model. And
13:23
the language model produces a number
13:25
which measures how good this thing is,
13:28
right? How good an answer is this to
13:30
that particular instruction. So you get
13:32
two you get a rating here, you get a
13:34
rating here and then what you do is you
13:38
run it through a little loss function
13:41
which
13:43
essentially encourages the model to give
13:45
higher numbers to the better answer.
13:50
It's the same model. You just run the
13:51
the question and the first answer,
13:53
question and the second answer. You get
13:54
these two numbers. And then initially
13:56
those numbers are just random. But then
13:59
you tell the model, hey, this is the
14:00
preferred thing. Make sure the preferred
14:02
answers
14:03
uh rating the R value is higher than the
14:06
other number because more is better.
14:08
Higher is better. Okay? And you can
14:12
actually since you and this thing is
14:13
just a sigmoid here, right? It's
14:15
basically take the difference of these
14:16
two things. do a sigma and take the
14:18
logarithm and you can actually convince
14:20
yourself afterwards and I encourage you
14:22
to do that to to check for yourself that
14:25
if we actually
14:28
give a higher number to the better
14:30
answer the loss will be lower and since
14:34
we are minimizing loss we're essentially
14:36
training the network to always to try to
14:38
give higher ratings to better answers
14:41
that's it so that's the approach uh did
14:43
you have a yeah Ben
14:46
So you could imagine training um
14:49
training the model and only the good
14:50
answers is the idea of having both that
14:52
the model is actually learning what
14:54
makes good
14:54
>> correct. Exactly. Much like if you want
14:56
to build a dog cat classifier, you have
14:58
to show pictures of both.
15:01
>> Yeah.
15:02
>> So u I understand the feedback mechanism
15:05
of thumbs up thumbs down but there are a
15:06
lot of times when the popular response
15:10
is not the accurate one. So uh is there
15:12
a way that they actually have a layer to
15:15
correct?
15:16
>> Yeah, good question Swati. So uh as it
15:18
turns out um the all these companies
15:22
like OpenAI, they have like a huge
15:24
document 100 200 pages longs you know
15:27
very very bulky document which instructs
15:30
and teaches the labelers the rankers to
15:32
how to rank these things. So they have
15:34
to follow these very strict guidelines
15:36
to precisely handle like strange corner
15:38
cases and things like that. And that
15:40
document is on the web. You can dig it
15:43
up, right? And it's actually very
15:44
instructive to read through it, right? I
15:46
think they put it out on the web because
15:48
they wanted to convince people that
15:49
they're going to inordinate trouble to
15:50
make sure the rankings are actually
15:52
good. U do you have a question? Comment.
15:55
Okay. All right. So um so back to this
16:00
and how how do you train this thing? SGD
16:03
because you have a network it's coming
16:04
up with an answer you have some way to
16:06
know if that answer is good or bad right
16:08
better answers of lower loss back
16:10
propagation through the network keep
16:12
updating the weights and boom you're
16:13
done
16:15
okay and once you do that this reward
16:18
model can provide a numerical rating for
16:21
any any instruction answer pair you just
16:24
give it an instruction you give it an
16:25
answer right could be a crappy answer
16:27
good answer it just tells you how good
16:28
it is which means right So in this case
16:31
for example maybe it's going to give you
16:32
like a nice number 1.5 uh uh which is
16:35
you know 1.5 for this this answer but
16:38
then a better answer comes along or 3.2
16:41
right what we have done by doing this
16:44
whole thing this modeling is that we
16:46
have essentially we have learned how
16:49
humans rank responses
16:51
because we can only have humans rank
16:53
responses for some finite number of
16:55
questions. What we really want to do is
16:58
to do this to automate that ranking
17:00
process so that we can just do it for
17:02
like tens of thousands of questions
17:03
really fast. Right? So we have
17:05
essentially built a model of how humans
17:07
rank things, right? Which is beautiful.
17:10
A lot of the stuff here is all very
17:12
self-reerential which I find very
17:13
elegant. Anyway, so this can be used to
17:15
improve GP3 even further. So we take the
17:18
instruction as before, we feed it. It
17:20
gives you some answer and then we feed
17:23
this instruction and the answer to our
17:25
newly minted reward model. It gives us a
17:28
numerical rating and then this is the
17:30
key step. We take this numerical rating
17:32
and then we use this rating to nudge the
17:35
internal weights of GPD3 in the right
17:37
direction. Right? This nudging
17:41
uses a technique called reinforcement
17:43
learning.
17:44
Right? Which just in the interest of
17:46
time we can't get into in this lecture.
17:49
But that that's a technique you use to
17:51
nudge these things in the right
17:52
direction.
17:54
So that's what we do. That's
17:56
reinforcement learning. We nudge it in
17:58
the right direction.
18:01
And OpenAI did this with 31,000
18:04
questions.
18:07
Okay. Nudge, nudge, nudge, nudge, nudge.
18:09
And when you do that, you get GPD
18:11
3.5/ingpd.
18:13
Okay. Uh that's it. And now by the way
18:18
this step here is called reinforcement
18:20
learning with human feedback because we
18:22
use reinforced learning and since humans
18:24
rank the answers which tread to the
18:26
building of the reward model we get
18:28
human feedback. Okay, that's
18:29
reinforcement learning with human
18:30
feedback. Yeah.
18:33
>> Yeah. I have [clears throat] a question
18:34
regarding the the type of questions that
18:37
they're using. I can imagine like maybe
18:39
there are very simple questions to
18:42
answer because I'm thinking now you can
18:44
ask GPD like for example respond this as
18:47
a pirate or something like that that is
18:49
kind of it's going to be harder to train
18:51
if you have bunch of questions that are
18:54
having like small interactions and then
18:56
there is the question like
18:57
>> that's a good question. So the quality
18:59
of the questions in the data set clearly
19:01
is a big factor because if you have
19:03
simple simplistic questions it won't be
19:05
able to handle complex questions later
19:07
on. So what it's a good question. So
19:09
what how so the qu so that actually begs
19:12
the question of where did they get these
19:14
questions from
19:16
so they actually got it from their API.
19:20
So people are asking GPD3 on the API
19:23
right before it became 3.5 people are
19:25
asking all the API was already available
19:26
you know fully available commercially
19:28
available a lot of people are building
19:29
products on it already by then and so
19:31
they collected all those questions and
19:33
filtered them for quality and that was
19:35
the question set that they used and then
19:37
they judiciously added to it with human
19:39
created questions but they couldn't do a
19:41
lot of that because it's expensive to do
19:43
that but collecting stuff that somebody
19:44
else is asking your API already very
19:46
easy
19:49
Yeah, Tomaso,
19:50
>> uh, this might be more of a
19:52
philosophical question, but, uh, the
19:54
human bias that's present in the small
19:56
subset of human labelers that they've
19:58
chosen gets eventually compounded in
20:00
this model that we often consider as the
20:03
source of objective truth.
20:04
>> Yes.
20:06
>> Yeah, that's very true. Um I think the
20:08
the reward model is probably very
20:09
faithfully learn all the biases of the
20:12
human labelers which is why they have
20:14
these very complex u sort of frameworks
20:17
and guidelines to try to prevent the
20:19
bias from happening to mitigate it. So
20:21
for example they might give the same
20:22
question and set of possible answers to
20:25
many many different labelers and only if
20:28
people pick the same ranking they might
20:30
use it so that at least inter labeler
20:33
bias can be minimized right but if
20:36
everybody's sort of biased in the same
20:37
direction it won't protect you against
20:39
that. Um so yeah in general there's a
20:41
whole work that's being done to try to
20:43
debias these things and build them
20:44
without you know too much bias in them.
20:46
It's like a whole world unto itself
20:48
which we just don't have time to get
20:49
into. Uh Olivia,
20:53
>> um depending on the medium that's being
20:56
returned by these models, would there be
20:57
more than one reward model? Because
20:59
isn't that what Gemini
21:00
>> would there be more than one
21:01
>> reward model? Because isn't this what
21:03
Gemini is running into issues with right
21:05
now with their image generation is the
21:08
bias that they try to
21:09
>> Yeah. So the Gemini business that's
21:11
going on, it's unclear what's causing
21:13
it. Um it may be in this step, maybe
21:16
they were a little overzealous in
21:18
preventing certain things from
21:19
happening.
21:20
Some of these uh systems also have um
21:23
they will actually intercept the
21:25
question that you ask and then route it
21:27
differently based on what they sense is
21:29
sitting around in the question. So there
21:31
could be pre-processing post-processing
21:32
a lot of stuff that goes on. So unclear
21:34
to me where in the pipeline and it could
21:36
be more than one place these things may
21:38
be entering. So yes, so here may very
21:40
well be where it actually enters a
21:42
situation where people are people are
21:44
told if you see any sort of this kind of
21:46
answer downrank it right don't uprank it
21:50
and then it learns that ranking very
21:51
faithfully and then proceeds to apply it
21:53
where it does should not be applied so
21:54
that does happen uh Joselyn you had a
21:56
question
21:58
>> um I think I still I still don't totally
22:02
understand why so when I ask chat GBT a
22:04
question even in a lengthy response it
22:06
doesn't wander away from the topic that
22:08
I'm asking about right and so
22:10
understanding that it it's predicting
22:11
each word it's sort of taking a random
22:13
walk from one word to the next in some
22:15
sense
22:15
>> but each word it utters
22:17
>> now becomes part of the input to the
22:19
next word it utters
22:20
>> right
22:21
>> so it's not truly random walk in that
22:23
sense so the next step is not
22:24
independent of the previous step
22:26
>> it depends on what it depends on the
22:27
journey so far so it's going to try to
22:29
be very consistent with the journey so
22:31
far
22:32
>> okay
22:33
>> does the
22:35
does this part with um sort of
22:38
fine-tuning it on these question answer
22:40
sets. Does this play some role in it
22:42
being able to constrain itself and not
22:44
meander away?
22:46
>> I don't think so. I think this is more
22:48
to make sure that you know it does the
22:50
weights generally tend to produce the
22:52
right answer. Now what one of the things
22:54
that is possible is that when when I'm
22:57
let's say I'm a ranker and I'm looking
22:58
at a few different answers I'm you know
23:01
I have to figure out if the answer is
23:03
helpful if it is accurate if it is uh
23:06
you know non-toxic right things like
23:08
that and part of the rubric for
23:11
evaluating these answers could be their
23:13
coherence right so it could also be that
23:16
they are saying short coherent answers
23:18
are better than long coherent answers
23:21
but once you adjust for length Maybe
23:23
coherence is more important, right? It
23:24
could be any number of these things. So
23:25
it could play a role in that.
23:26
>> So just sort of one small followup. So
23:28
in other words, when it's when it's
23:30
learning from these question and answer
23:31
pairs, it's able to look at
23:32
[clears throat] the whole response and
23:33
learn something about the whole response
23:35
rather than just one word at a time,
23:36
right?
23:37
>> Correct. Yeah. The the entire question
23:39
is being ranked.
23:40
>> Yeah.
23:40
>> Correct. Correct.
23:42
>> Yeah. On a related note, um when it's
23:46
generating a new word on a topic, does
23:48
the attention pertain to the entire
23:50
prior text or can you have like
23:52
traveling attention? So like last five
23:55
word.
23:56
>> So yeah, the short answer is yeah, you
24:00
can you can it's called sliding window
24:02
attention. It can be done. They
24:04
typically tend to do it not uh so much
24:06
because they want to focus more on the
24:08
the recent words, but more because it
24:10
actually makes it very compute
24:12
efficient. U that's why they do it. So
24:14
it's called sliding window attention.
24:16
You can Google it.
24:17
>> So normally it's full attention.
24:19
>> Normally it's full default is full
24:21
attention.
24:23
Okay. So that's what they did. Uh and
24:25
when they did that and by the way as I
24:27
think you pointed out that's exactly
24:29
what's going on. You're training the
24:30
reward model with these thumbs up and
24:31
thumbs down. U hold on the questions.
24:35
And so if you give it the same question
24:37
to GPD 3.5 in GPD amazing answer.
24:42
Okay, like night and day difference,
24:45
amazingly good answer. Um, and so and
24:48
then to go from 3.5 to CH GBT, they
24:51
basically followed the exact same
24:52
playbook except that because they wanted
24:55
to have a chatbot, meaning something
24:58
that could carry on a question answer,
24:59
question answer pair as opposed to just
25:00
a single question and answer, they
25:02
wanted question answer question answer,
25:03
right? Conversation. They trained it on
25:05
conversations. That's it. Instead of
25:08
training it on instruction answer data,
25:11
they trained it on instruction answer
25:13
instruction answer instruction answer a
25:16
sequence of such things which are strung
25:17
into a conversation.
25:19
That's it. That is the only difference
25:21
to go from 3.5 to CH GPT and then now
25:25
chat GPD given you do that it's giving
25:26
you a much nicer response and then you
25:28
can ask a follow-on question. Can you
25:30
make it more formal? Boom. It gives you
25:32
a nice response because now it knows
25:33
about conversations. It's been trained
25:35
on conversational data. So that's it. So
25:37
that's the whole that's how they built
25:38
RGBT right and all the things we are
25:41
seeing later on are all sort of
25:42
continuations of this sort of approach.
25:45
So pause for a couple of quick
25:46
questions. Swati you had a question then
25:47
we'll go to you and then to you. Yeah.
25:50
>> So does that make a difference if a new
25:53
question pair question answer pair or a
25:56
new training data comes early in the
25:59
building of the model or later in the
26:01
building of the model 7 billion
26:02
parameters. That be good. You mean the
26:05
order of the questions does it matter?
26:07
>> So I might have like let's say 5,000 uh
26:09
images to start with. Now there after my
26:12
model is trained and developed now I
26:14
have a new use case that has come in.
26:17
Will that make a difference if I set it
26:18
in now?
26:19
>> So if you have a new use case for which
26:22
you want to essentially adapt the model
26:24
there's a whole set of techniques you
26:26
use which is going to be the next
26:27
section.
26:27
>> But it's not
26:29
>> yeah because what you have out of the
26:30
box is just a generally good chatbot. It
26:33
knows about a lot of stuff because it's
26:34
been trained on, you know, those 30
26:36
billion sentences, it can answer a lot
26:37
of questions reasonably well using
26:39
common sense and world knowledge. But
26:41
any specific use case like medical and
26:43
so on and so forth, it may not know. So
26:44
you'll need to adapt it to your
26:46
particular unique situation and that's
26:47
coming. U all right. Yes. Habit.
26:51
>> Uh what determines if a whole
26:54
conversation is ranked positively versus
26:57
a specific answer proliferating your in
26:59
your question?
27:01
>> Is it if the first answer doesn't get a
27:03
positive response but then after follow
27:05
the second one does. Is that is that
27:06
correct?
27:07
>> Exactly. So if you're a human and you
27:08
read the transcript of an exchange
27:10
between two people and I'm giving you
27:12
two exchanges which all start with the
27:14
same question, you'll be able to assess
27:15
which one is a better transcript. That's
27:17
basically what's going on. Uh there was
27:20
something here, right? Something. Yeah.
27:22
>> So I was wondering when you ask a
27:25
question very often it sounds kind of
27:27
like you tell that something was written
27:29
by not by an actual person. Do you think
27:32
that comes from the reinforcement
27:35
learning part or where do you think it
27:38
comes from in this?
27:40
>> It's a good question. I don't know
27:41
because I know that part of the
27:42
evaluation uh the ranking rubric are
27:44
used is to is to favor responses which
27:48
sound more humanlike than you know more
27:50
than robotlike. So if anything I'm
27:52
hoping that reinforcement learning would
27:54
actually make it sound more humanike
27:55
because the rankers would have
27:56
prioritized that. So if you if it still
27:58
comes up with robotic stuff, you know,
28:01
it's something else that's going on.
28:02
Maybe I mean maybe the lot of text on
28:05
the internet is not literature. It's
28:07
just people writing some crap, right? So
28:09
could be that. Yeah.
28:13
>> How much of this instruction tuning or
28:15
conversational tuning is happening in
28:17
real time within a conversation? So
28:19
>> none of it.
28:19
>> None of it. So as you kind of give
28:22
feedback to the model, it's just
28:24
basically regenerating it like I don't
28:25
like that answer. come up with something
28:27
else.
28:27
>> No, it's not doing it in real time. Uh,
28:29
basically whatever signals you're giving
28:31
it with these thumbs up, thumbs down
28:32
business, that gets added to the
28:34
training logs and they periodically will
28:36
retrain it.
28:39
Uh, okay. So, by the way, this is
28:41
instruction tuning in a nutshell and I
28:42
want to point that out and you don't
28:44
have to read the whole thing, but just
28:45
to quickly point out this was where we
28:47
had to have human involvement, right? In
28:50
the first step, writing a lot of
28:51
responses to these questions and then
28:52
ranking the answers. So these two are
28:56
still human sort of labor intensive. Now
28:58
it turns out you can actually use helper
29:00
LLMs to automate this too,
29:03
right? This is not what open I did in
29:04
the beginning with HGBT but now you can
29:06
do it this way right because there are
29:07
lots of really good LLMs available for
29:09
you to automate many of these things. uh
29:11
we don't have time but if you're curious
29:12
I had a little blog post on this check
29:14
it out okay so now we come to the
29:17
question of well if you want to take a
29:20
base LLM like GBD3 and make it useful
29:23
and respond instructions we have seen
29:24
that we had to adapt it with high
29:26
quality instruction onset data right
29:28
using supervised fine-tuning and
29:30
reinforcement learning with human
29:31
feedback right that's what made GPD3
29:33
actually useful and became chat GPD by
29:37
the same token this holds true more
29:39
generally if you want to take large
29:41
language model make it useful for a
29:42
medical use case, a legal use case, some
29:44
other narrow business use case. You have
29:47
to adapt it with business domain
29:49
specific data. Okay. And so let's look
29:52
at techniques for doing so. All right.
29:54
So adaptation is sort of the rough name
29:56
for the process of taking a base large
29:57
language model and making it tailoring
30:00
it for your particular use case. And so
30:02
there's sort of this ladder of things
30:03
you can do, right? And we're going to
30:05
look at every one of them. So you can do
30:07
this thing called zeroshot prompting
30:08
which is just you literally ask the LLM
30:11
nicely clearly what you want and maybe
30:14
just give it to you. Okay. And this is
30:16
sort of the use case we're all used to
30:17
in the web interface right you can also
30:20
do something called few short prompting
30:22
where you ask it something and you also
30:24
give a few examples of the kind of
30:25
things you want right and that helps it
30:27
a great deal and then there is this
30:30
thing called retrieval augmented
30:31
generation and fine-tuning and we'll
30:33
look at all of them and I'll explain all
30:34
these things as we go along. Okay, so
30:36
let's start with zero short prompting
30:38
where by the way the word short here is
30:40
a synonym for example. So zero example
30:44
prompting. You literally ask in the
30:45
prompt what you want without giving even
30:47
a single example. Okay. And so let's say
30:50
we want to build we want to look at
30:51
product reviews and build a detector to
30:54
figure out if the product review
30:55
contains not sentiment. That's kind of
30:56
boring. Uh whether it contains some
30:59
description of a potential product
31:01
defect or not. Okay. And so here is
31:04
something I actually pulled off Wayfair
31:06
with apologies to Wayfair. Uh it says
31:08
here the curve of the back of the chair
31:10
does not leave enough room to sit
31:11
comfortably. Okay, sounds like a kind of
31:14
a defectish kind of thing, right? So
31:16
instead of bu back in the day, you would
31:18
have collected all these reviews and
31:20
built a special purpose NLP based
31:21
classifier to figure out defect yes or
31:23
no. Here you can literally just feed
31:25
this thing into GPD3 uh and ask it tell
31:28
me if a product defect is being
31:30
described in this product review and
31:31
then the curve at the back boom and then
31:33
it comes back and says yep that's a
31:34
product defect. Okay so this zero shot
31:37
you just ask a question you get the
31:38
answer back. Okay and it actually works
31:41
remarkably well and the better models
31:43
the bigger models tend to be much better
31:45
than the smaller simpler models for
31:47
doing zero shot. Okay. All right. Now
31:50
when you adapt an LLM to a specific task
31:52
obviously you need to carefully design
31:54
the prompt as you folks know this is
31:55
called prompt engineering and we're not
31:57
going to spend much time on prompt
31:58
engineering except I just want to give a
32:00
simple example. So if you actually ask
32:02
Jubid this question what is the fifth
32:04
word of the sentence very often it'll
32:07
give the wrong answer.
32:09
It's very strange why it can't get this
32:11
answer question right. It's a very
32:12
simple question. So if it's the fifth
32:14
word of the sentence is s right uh
32:17
sometimes it gets it right but very
32:18
often it'll get it wrong okay but now
32:20
you can do a little prompt engineering
32:22
and it'll always get it right. So for
32:23
example you can say I'll give you a
32:25
sentence first list all the words that
32:26
are in the sentence then tell me the
32:27
fifth word. Okay here is a sentence b it
32:30
gets it right. So it's an example of you
32:33
can help it along by being very very
32:34
prescriptive as to what you want it to
32:36
do and break down all the steps. Don't
32:38
make it guess things. It does a great
32:40
job. Okay. So anyway uh and there are
32:42
lots of other tricks people have figured
32:43
out over the the last couple of years.
32:45
Uh for for a long time this is pretty
32:47
hot where you say let's think step by
32:49
step. You tell it give it a question and
32:51
say let's think step by step. It
32:53
actually gives the better shot at giving
32:54
you a good answer back an accurate
32:55
answer back. Uh now this kind of thing
32:57
is actually already baked in into the
32:59
LLMs. So when you ask a question to ch
33:02
your question your prompt gets appended
33:05
to what's called the system prompt and
33:07
the whole thing goes into the LM. You
33:09
never see the system prompt and the
33:10
system prompt is telling Chad GPD think
33:12
step by step take your time don't blurt
33:14
out an answer stuff like that okay and
33:17
the system you can just Google it the
33:18
system problems have been jailbroken you
33:20
can find it on the web
33:22
so all right uh and and this is funny I
33:25
this came out maybe like a month or two
33:26
ago it says apparently take a deep
33:28
breath and work on the problem step by
33:29
step works better than saying work on it
33:31
step by step and then more recently I
33:34
literally read this two nights ago
33:36
apparently if you tell it if you have a
33:38
math or a reasoning question. You tell
33:40
it you are an officer on the starship
33:42
enterprise. Now solve this problem for
33:44
me. It's higher more likely to get it
33:46
right.
33:47
>> Go figure. Thomas,
33:48
>> I read two more that were super fun.
33:50
>> Yeah.
33:51
>> One I will keep you if you solve me
33:53
>> correct
33:54
>> and the other one was
33:56
an answer was I cannot do that
34:00
for answer was I tried on Gemini and he
34:05
it was the way to solve it. So
34:07
>> nice. both like back and forth charge
34:10
you did you want to say was to solve
34:11
this can you solve this
34:13
>> yeah very good excellent one of the
34:15
things just on that right let's have
34:16
some fun you can say I'm going to give
34:18
tip you a thousand bucks if you solve
34:19
this it says right so this person
34:22
apparently kept using this tip and at
34:24
one point it says you keep promising me
34:26
tips you never give me the tip so I'm
34:28
not going to solve this problem for you
34:31
yeah okay so and there are many prompt
34:34
engineering resources this one that came
34:36
out a couple of weeks ago which I
34:37
thought was pretty Good. So I just put a
34:38
link to it here. Um so now let's look at
34:41
few short prompting where you give it a
34:42
few examples. So here let's say we want
34:45
to build a grammar corrector. Okay. So
34:47
what you can do is you can actually give
34:49
it examples of poor English good
34:52
English. You can see right poor English
34:54
I eated the purple berries. Good English
34:56
I ate the purple berries. And similarly
34:58
three examples right and then you end
35:00
the prompt with just the poor English
35:01
input. And then the response from GPD3
35:04
is a good English output and it says fix
35:06
the error.
35:09
So this is an example of giving a few
35:10
examples of what you want and just
35:11
learns on the fly what you what you have
35:13
in mind what your intention is. Okay. So
35:16
that's that. Now the ability of LLMs to
35:19
learn from just a few examples or even
35:21
no examples and just with a clear
35:23
instruction. This thing is called in
35:25
context learning and that was something
35:28
that GPD2 and GPD could not do. that was
35:31
new in GBD3 and what they call an
35:33
emergent capability right it is
35:35
completely unanticipated by the people
35:37
who built it and all right so that's
35:40
that now let's look at retrieal
35:41
augmented generation by the way this
35:43
thing is also called indexing sometimes
35:45
so the the so the the idea of it's
35:47
called rag retrie rag the idea of rag is
35:50
actually very simple so let's say that
35:52
you know we want to ask a question to a
35:53
chatbot but we want the chatbot to
35:56
leverage proprietary data that we might
35:59
have maybe it's a customer call support
36:01
sort of in a call center kind of
36:02
operation and you have like this massive
36:04
FAQ database right content database and
36:06
you want to give that FAQ to the chatbot
36:09
along with your question so that it can
36:10
leverage the FAQ to answer the question
36:12
for you as opposed to like whatever
36:14
things it has learned previously in its
36:16
general training right so can't we just
36:19
include the entire FAQ the whole data
36:21
set into a prompt and set it in maybe we
36:24
just take our question take everything
36:26
we have potentially relevant to the
36:27
question everything we have in the data
36:28
set database just attach it to the
36:31
question. The whole thing becomes a
36:32
prompt. Feed it in and say, "Hey, find
36:34
out for me." Can't you just do that?
36:38
Theoretically, I think it stops us.
36:43
The reason you can't do it is because
36:44
this pesky thing called the context
36:46
window.
36:47
So, uh, for any LLM, the prompt plus the
36:51
output, right, the length cannot exceed
36:53
a predefined limit. This called the
36:55
context window. Remember the max
36:57
sequence length we had in our earlier
37:00
models where that was the size of the
37:02
sentence that could be fed in right
37:04
basically there is a size of the
37:05
sentence for any of these things right
37:07
it's called the context window it's
37:08
there are only so many tokens it can
37:09
accommodate and since what comes in is
37:12
what comes out it is for both the input
37:14
and the output together okay that's
37:16
called the context window okay and um
37:20
and and and furthermore when you have a
37:23
conversation with one of these chat bots
37:25
the entire entire conversation is fed in
37:27
every single time.
37:29
That's how it actually remembers the
37:31
what's going on earlier in the
37:32
conversation. It doesn't have any memory
37:34
per se. Each time you ask a question,
37:36
the entire thread is fed in. Okay? So,
37:39
initially you say what's the square root
37:41
of 17, it gives you an answer.
37:42
Initially, you only send in the red
37:44
stuff. Then the next question you ask is
37:46
the first question, the answer, the
37:48
second question. All of them are fed in.
37:50
Then all these are fed in. So with the
37:52
conversation, you're consuming more and
37:54
more of the context window as you go
37:55
along.
37:57
Okay. So can you imagine taking a whole
38:00
FAQ asking a question and saying, "Well,
38:01
I didn't mean that. I wanted something
38:03
else." And before you know it, boom,
38:04
you've blown out the context window.
38:05
It's going to come back and give you an
38:06
error.
38:08
>> You finished that you can't does it
38:10
together or does it take specific
38:14
windows of it?
38:15
>> Yeah. So there is a whole research
38:17
cottage industry around when your thing
38:19
is longer than the context window. what
38:21
do you pick? Uh so the simplest case is
38:23
you have a moving window, right? If if
38:25
you have thousand tokens, you just look
38:27
at the last thousand tokens. But there
38:28
are some cleverer schemes where you can
38:30
actually take the first stuff that is
38:33
outside the window that doesn't fit into
38:34
the window and use an other LLM to
38:37
summarize it for you and then you attach
38:39
it to your current prompt. I know it
38:41
gets crazy. So
38:43
uh okay. So for all these reasons, we
38:46
need to pick and choose what we can
38:47
send, right? To answer a particular
38:49
question. So what we do is since we
38:51
can't include the whole thing, we first
38:53
retrieve the relevant content from the
38:54
database or the FAQ and then send it to
38:57
the LLM along with a question we have.
38:59
Okay? So retrieval augmented sequence
39:02
generation. That's what's going on.
39:05
Make sense? And so pictorially
39:08
um basically what we do is let's say
39:10
that this is our external set of
39:12
documents. We take this are think of it
39:15
FAQ and then we take the FAQ and imagine
39:18
for each question and answer. We take
39:20
each question and answer in the FAQ and
39:22
then we we just we treat it as its own
39:24
little unit of text and then we actually
39:27
calculate a contextual embedding for
39:29
each of those question answer pairs.
39:32
Remember we know how to do contextual
39:33
embeddings, right? That's like it's a
39:35
piece of cake at this point, right? You
39:36
folks know how to do contextual
39:37
embedding. Run it through something like
39:39
BERT, you're done, right? You get you
39:41
get a context. So you get embeddings for
39:43
all the things that are in your FAQ. And
39:47
now when a new question comes in, right,
39:50
what you do is you take that question
39:52
and you calculate a contextual embedding
39:53
for that too.
39:56
And then what you do is you then look to
39:58
see which of the FAQ elements you have,
40:02
which of those chunks are the most
40:04
similar to your question.
40:07
Okay? And then you grab the ones that
40:09
are the most similar and then pack it
40:11
into the prompt and send it in. Maybe
40:14
you have 10,000 questions, but you can
40:16
only accommodate five of them in your
40:18
prompt because the context window is
40:19
very small. So you pick the five what
40:22
you think is the most relevant content
40:24
to your particular question and then you
40:25
feed it in.
40:28
That's the idea that is retrieval
40:29
augmented generation. Yeah, Rolando. So
40:32
if does this tie in for example if I
40:34
were to prompt and say help me work on
40:36
my startup pitch but given the voice of
40:38
Steve Jobs is it then kind of going out
40:41
there and reducing the subset of of data
40:45
to things that have been written by
40:48
Steve Jobs and then it's kind of
40:49
generating it response based
40:51
>> uh not as a default not as a default
40:53
typically because a lot of Steve Jobs
40:54
stuff on the web it's just using that
40:56
because it's all part of its
40:57
pre-training data but this tends to be
41:00
more useful for very targeted
41:01
applications where you don't expect to
41:03
know the answer because it is not on the
41:05
public internet.
41:07
It's your proprietary data and you
41:09
wanted to use that proprietary data and
41:10
this how you do it.
41:12
Uh yeah
41:15
this certain
41:19
>> sure like that there will be some loss.
41:22
>> There will be some loss because you have
41:23
to figure out how to chunk it right. Uh
41:26
maybe you have a 300page PDF and then
41:28
maybe you look for each section and make
41:30
it a chunk. Maybe you look for each
41:32
paragraph, make it a chunk. Again,
41:33
there's a whole empirical sort of
41:36
cottage industry of techniques for doing
41:37
these things better or worse depending
41:39
on the use case and so on and so forth.
41:40
But the conceptual idea is chunk and
41:42
embed.
41:43
>> Chunking is another use.
41:46
>> Yeah. In fact, we going to do it
41:47
ourselves in the collab right now.
41:49
>> Yeah.
41:50
>> Can we give more weightage lecture? Uh
41:54
[laughter]
41:55
so in the default implementation no but
41:58
but in some sense you by picking the
42:00
five most relevant chunks from 10,000
42:02
chunks you're giving it giving the other
42:04
you know 10,000 minus five chunks a
42:06
weight of zero and these a weight of
42:08
one. So in some sense you're waiting it.
42:10
>> Yeah.
42:12
>> I was just curious how much structure
42:13
you have to have with an external
42:14
document say hospital or something. Do
42:16
you have to do a bunch of like lab?
42:19
>> No, you just need to make sure it's kind
42:21
of relatively clean. Uh but you will see
42:23
in the collab that it can be kind of
42:26
crappy and it still works. Yeah, because
42:28
there is so much crap on the internet
42:30
has been trained on already. So, okay.
42:33
So, all right. So, let's look at the
42:34
collab.
42:36
By the way, retrieval operate generation
42:38
is in my opinion the most pre prevalent
42:41
business application of LLMs that I've
42:43
seen up to this up to up to date. And
42:45
there's a huge ecosystem of tools and
42:47
vendors and so on and so forth.
42:51
I'm going to skip through the verbiage
42:52
here. Um, so you have to um install the
42:56
OpenAI library
42:58
and this thing called tick token which
43:00
we'll get to in a in a bit. I've already
43:01
installed it before class because it
43:03
takes some time. So I'll just make sure
43:05
all these things are already
43:08
few good. So we don't have to wait for
43:10
this. So I've imported pandas as before
43:12
and so uh and you can read through these
43:15
things because I'm just basically you
43:17
know I have an open openi token that I
43:19
have to use u a key rather key API key
43:23
and I'm not showing you the key
43:24
obviously I have to remember to delete
43:25
it before I upload the collab uh you
43:27
have to get your own key to make it all
43:29
work uh but the instructions are here.
43:31
So we're going to use GPT3.5 turbo to
43:34
demonstrate rag right so I give it the
43:36
name of the model and then open a also
43:38
has a whole bunch of different models
43:40
which can be used for u you can feed it
43:43
a sentence or a chunk of text it'll give
43:45
you a contextual embedding out it's like
43:47
a nice little API you don't have to use
43:49
your own bird and so on and so forth you
43:50
can just use the open AI embeddings
43:53
obviously you have to pay openai every
43:54
time you make a request but it's really
43:55
really cheap at this point u yepa
44:01
question but
44:03
by dealing with proprietary data because
44:05
a lot of companies are like we need to
44:07
invest in our own L&M because we don't
44:09
want our data to be going down in this
44:11
kind of it context how good is the the
44:14
cyber security or the compliance and
44:16
legal
44:17
>> I think each vendor has their own sort
44:19
of set of rules and contractual
44:21
commitments they're willing to sign up
44:22
for so you just
44:23
>> if you use the data here does this go
44:25
into the public domain or no
44:27
>> but the vendor gets to see it
44:29
>> okay
44:29
>> right meaning the vendor systems get to
44:31
see it, but do the vendors employees get
44:33
to see it if they need to? Unclear.
44:36
Those are all the like the legally sort
44:38
of nitty-gritty you have to worry about.
44:39
The other thing you can do is you can
44:41
actually just download an open source
44:42
LLM and do it all within your own
44:44
premises.
44:46
That's totally possible to do, right? In
44:48
fact, um I probably won't have time
44:50
today. I have a whole section on how do
44:51
you actually do a fine-tuning with an
44:52
open-source LLM, which I'll do a video,
44:55
right, if you don't have time. U okay.
44:58
So, so we and so this model this
45:01
embedding ADA 2 is the name of the
45:02
OpenAI model that actually gives you
45:03
contextual embedding. So, we're going to
45:05
use that. So, so first thing we want to
45:07
so the the use case here is that uh we
45:10
have taken a whole bunch we want to ask
45:11
the LLM we want to create a chatbot
45:13
which can answer questions about the
45:15
2022 Olympics like random questions you
45:18
might have about the Olympics. So, uh so
45:20
let's first ask it this question. Uh
45:24
we'll ask it about the 2020 summer
45:26
Olympics. Okay, that's the query and
45:29
then this is the the API um request we
45:33
have to make and you can read through
45:35
it. I have linked to the documentation
45:36
here as how it works and then it says
45:38
that uh Bosshim of Qatar and Tambberia
45:41
of Italy both won the gold and you can
45:42
actually fact check this is actually
45:44
accurate. It's correct. Uh so now let's
45:46
change the query and ask it about the
45:48
2022 Winter Olympics. Okay. And why 22
45:51
versus 20 will become clear in just a
45:53
moment. So, which athletes won the gold
45:55
in curling
45:57
in the 22 Olympics? And it says the gold
46:00
medal in curling was won by the Swedish
46:02
men's team and the South Korean women's
46:04
team. Okay, turns out if you fact check
46:07
this, it turns out, wait for it, Sweden
46:12
won the men's gold. Yes, South Korean
46:13
DIM participated, but Great Britain
46:15
actually won the women's gold. So, it
46:17
got it wrong. So, it sounds like GBD3.5
46:19
Turbo could use some help. And now one
46:22
of the things we can do is so the thing
46:24
is the reason why GPT3 3.1 turbo didn't
46:27
know about this is because its training
46:29
cutoff date was September 2021.
46:32
So as far as it's concerned the 22
46:34
Olympics haven't happened yet
46:37
it confidently gave you the wrong answer
46:39
as it is often prone to do. So and this
46:42
is by the way is called hallucination
46:43
where it gives you a very eloquent
46:45
confident wrong answer. And so um
46:50
or as some folks have said about um
46:53
another business school that should
46:54
remain nameless often in error but never
46:56
in doubt. So um
46:59
all right back to this uh so one simple
47:02
thing we can try right off the bat is to
47:03
tell 3 3.5 Turbo you can ask it to say I
47:06
don't know if it doesn't know rather
47:08
than just make stuff up right and how do
47:10
you do it? It's very simple. You say in
47:12
your prompt, answer the question as
47:14
truthfully as possible. And if you're
47:17
unsure of the answer, say, "Sorry, I
47:18
don't know." Okay, now here's the
47:20
question. Okay, this is a query. So,
47:22
let's run it through.
47:25
Sorry, I don't know. Not bad, huh? So,
47:29
so it worked. It's sort of trying to be
47:31
humble and honest and, you know,
47:32
self-aware and things like that. Um,
47:35
it's more like a a Sloan at this point.
47:37
All right. So now the reason I as I
47:40
mentioned earlier there's a you can
47:41
check the cutoff date and you can see
47:42
it's 2021 actually you know what let me
47:44
just uh open a new tab
47:49
so all these cut off dates are training
47:50
data right so 3.5 turbo this is what we
47:53
are using cutff date 2021 okay that's
47:56
why all right so now what we can do is
47:59
to to we can obviously provide relevant
48:01
data on the prompt itself sort of we can
48:02
leading up to rag here and by the way
48:04
the extra information we provide in the
48:06
prompt to help it answer a question is
48:07
called context, right? That's sort of
48:08
the lingo for it. So, we can do it,
48:10
we'll first do it manually. Um, so we
48:13
first we'll use the Wikipedia article
48:15
for 2022 Winter Olympics and we tell it
48:17
explicitly to make use of this context
48:19
because telling things explicitly always
48:21
seems to help. So, this is the thing we
48:23
cut and pasted here, right? Wikipedia
48:25
article on curling and it's like a
48:28
pretty long article. It's got all kinds
48:30
of stuff and it's not even all that like
48:32
cleanly formatted, right? It's kind of
48:34
it's very strange. Look at that.
48:38
So don't don't answer your question,
48:39
Spencer. It can be, you know, in pretty
48:41
bad shape. It still seems to work. Okay.
48:44
So now use below article on the Olympics
48:46
to answer the subsequent question. If
48:47
you don't know, say you don't know.
48:49
Okay. So that's what we have. That's the
48:51
query. And by the way, before I send it
48:53
into the LLM, this is the actual query
48:55
that's going to be sending. I'm printing
48:56
out the query. Look at how long the
48:58
query is. Use the article below. And
49:00
here is the article. B scroll, scroll,
49:02
scroll. There's a whole thing, right?
49:04
And it keeps on going on. And then
49:05
finally, I say which teams won the gold.
49:07
So, okay, so let's run it.
49:12
Okay, look at that.
49:15
Women's curling Great Britain. It got it
49:16
right. Pretty good, right? I mean, it
49:19
had to parse all that crap to get and
49:22
find the nuggets, right? So, nicely done
49:25
now. But maybe it wasn't super hard
49:27
because we literally gave it the answer.
49:28
So, let's make it a bit harder. So, I
49:30
noticed that this person, Oscar Ericson,
49:32
won two golds in the event, two medals
49:34
in the event. So let's ask if any
49:37
athlete won multiple medals. That
49:39
requires a little bit of abstraction,
49:40
right? So all right, same query. Did any
49:44
athlete win multiple medals in curling?
49:46
The question has changed. Everything
49:47
else hasn't changed. Hit it. Let's see
49:50
what happens.
49:51
Yes, Oscar Ericson won multiple medals
49:53
in curling. He won a gold in the men's
49:56
event and a bronze in the mix doubles.
49:58
It's pretty cool, right? Take that
50:00
Google. So
50:02
all right now we come to retrieval
50:04
augment generation where instead of
50:05
doing it manually obviously because it
50:06
doesn't scale we will do it
50:07
automatically and so the thing you have
50:09
to remember as I mentioned just a few
50:11
minutes ago is that there is a context
50:12
window for every LLM and for GPD 3.0 of
50:15
turbo the context window is 1 1600 300
50:18
sorry 16,385 tokens that is the length
50:21
of the input and the output right so we
50:24
can't exceed that uh by the way GPT4's
50:26
context window is I think up to 128,000
50:29
tokens and GPT sorry Google Gemini 1.5
50:33
pro they really need to work on their
50:35
names Google Gemini 1.5 pro the context
50:38
window is 1 million tokens
50:40
okay and in research they have tested 10
50:43
million tokens so Crazy times. All that
50:46
means is that you can upload entire
50:48
videos and ask it questions about the
50:49
video. So all right to come back to
50:51
this. So what we'll do is we'll only
50:53
grab the data from the Wikipedia
50:55
articles the all the articles about the
50:57
Olympics that are relevant to our
50:59
question by using pre-trained
51:00
embeddings. So again this is the thing
51:02
we talked about earlier, right? This is
51:04
the picture we saw in class. And the the
51:06
only thing I want to point out is that
51:08
if you have a particular embedding for a
51:09
question and a particular embedding for
51:11
a chunk of text that you have in your
51:13
database, you have to figure out how
51:15
similar how related they are. And for
51:17
that we can use what
51:21
dot product or something slightly uh
51:24
almost as dot product which is more
51:27
easier for us to work with the cosine
51:29
similarity. We have we have done cosine
51:31
similarity previously. I've explained it
51:32
in class. We're just going to use cosine
51:34
similarity. How similar are these
51:35
vectors? So that's what we're going to
51:37
do. Um all right. So the same picture as
51:40
we saw in class. So the first we what
51:42
we'll do is we need to break up the data
51:43
set into sections and then take each
51:45
section and then run it through the
51:47
embedding thing. But fortunately for us
51:49
uh I have code here which actually does
51:50
it for you manually. You can play around
51:52
with it later. But OpenAI has already
51:54
given us the chunked data set. So we
51:56
just use that because it's just easy for
51:58
us. And I downloaded already because it
52:00
took it takes five minutes to download.
52:01
I've downloaded this thing and I've
52:02
stuck it in a particular data frame
52:04
here. So let's print out five randomly
52:07
chosen chunks. Um so you can see here
52:09
right this is the first chunk somebody
52:12
else somebody else this just and look at
52:14
all this crazy stuff here right the
52:17
formatting is off but these are all you
52:19
know basically paragraphs and sections
52:21
just grabbed straight from Wikipedia
52:22
with no cleaning.
52:24
Okay, now we define a simple function to
52:28
basically send in any arbitrary piece of
52:30
text into the embedding model and get
52:33
the contextual embedding vector out,
52:35
right? And there is this little function
52:36
that does that. Okay, u we using an
52:39
embedding model. We send in a text, it
52:40
gives you something. So let's try it on
52:42
that is amazing. You should get a vector
52:45
back.
52:51
Oh, come on. Don't fail me now.
52:56
All right. How long is it? 1536. Um, so
53:00
how about I say hodle is incredible.
53:02
Like hodle is amazing. Hopefully the two
53:04
vectors would be kind of similar in
53:05
terms of cosine, right? So um and so to
53:09
calculate the cosine distance, I use
53:11
this particular function from sci. It
53:13
just calculates the cosine similarity
53:15
and I hit it. So 0.9934
53:18
maximum is one, right? So 0 934 means
53:21
that they're very very similar. which is
53:23
comforting because amazing and
53:24
incredible are obviously synonyms. U
53:27
okay so now given a data frame with a
53:29
column of text chunks in it we can use
53:32
this function on every one of these
53:33
things to calculate the embedding right
53:34
and you have a function here that
53:36
basically does it for you I'm not going
53:37
to run it uh because it takes a long
53:39
time so but you can run it later on uh
53:41
just be prepared go get a cup of coffee
53:42
and stuff while it does it uh but once
53:44
you but happily for us open has actually
53:47
already done this step for us so we
53:48
don't have to uh so it's already
53:50
available in this data frame so if you
53:51
actually Look at this. And you can see
53:53
here there is a text and then there is
53:56
an embedding that's right sitting right
53:58
there right next to it. Okay. And these
54:00
embeddings are whatever 15 how long is
54:02
it? 1536 long. 1536 long vectors. Okay.
54:07
Um All right. So that's what we have.
54:14
Okay. So now that we have this thing
54:16
whenever we get a question we calculate
54:18
the question's embedding and then
54:20
compare calculate its cosine similarity
54:22
with all the embedding sitting in this
54:23
data frame. Okay. So to do that we're
54:26
going to define a couple of helper
54:28
functions here. You can read through the
54:29
Python later to understand is this is
54:31
basic Python manipulations that are
54:33
going on. Um and so let's just test this
54:36
function. So basically we have a little
54:38
function called strings ranked by
54:41
relatedness where you give it any input
54:44
question or text and then it's going to
54:46
give you the top five most related
54:49
chunks of text that is had in its data
54:52
frame. Okay. So uh let me just run this
54:55
thing. Okay.
55:00
So curling the things it pulls back it
55:02
better involves curling and metals and
55:03
so on. So this one has a cosign
55:06
similarity of 888 curling at the 22
55:09
Olympics. That's good. Result summary.
55:11
Medal summary. Result summary. It's all
55:13
pretty good, right? Even the fifth one
55:14
has a cosign similarity of867, which is
55:17
pretty high. So it's doing the right
55:18
things. It's it's picked up curling gold
55:20
medal was input text. It's picked up the
55:22
right things from it. Um, now let's see
55:25
what we can do um
55:28
with the original question. So here is a
55:30
header I'm going to use in the prompt.
55:31
I'm going to say use the below articles
55:33
to answer the subsequent question.
55:35
Answer the questions as truthfully as
55:36
possible. And if you're unsure of the
55:37
answer, say sorry, I don't know. As
55:38
before. Okay, that's our prompt. Uh, and
55:41
now here's the thing. We don't want to
55:42
exceed the context window, right? So, we
55:44
want to need to count the tokens we're
55:46
sending in and the likely number of
55:48
tokens we're going to get back so that
55:49
we don't exceed the budget. So, we use
55:51
this package called tick token package
55:53
for this. Uh, and then it just, you
55:55
know, helps you count the tokens. And
55:57
you can read through this. It's just
55:58
again some basic Python for counting
56:00
tokens. And now what we do is um this
56:03
this where we actually comp assemble the
56:05
prompt. We start with the header right
56:08
we have the header which says you know
56:09
be truthful and all that. Then we say uh
56:12
here is a question that you need that
56:14
I'm going to ask you and then you go in
56:16
there and keep grabbing Wikipedia
56:18
articles till the number of tokens in
56:21
your prompt is is exceeding your token
56:23
budget and then you stop. Right? When
56:26
you're about to exceed the budget you
56:27
stop because you can't exceed the
56:28
budget. Um, and that's that's the whole
56:31
thing. So here, uh, all right, let's
56:34
just do tick token. Run this function.
56:38
Now, it turns out, as you saw, we can go
56:40
up to like 1600 something, uh, tokens in
56:42
the context window. I'm just using three
56:45
3,700 as my budget. Uh, partly because
56:48
just to show you how to use this thing.
56:49
Uh, and also because it's charging my
56:52
credit card for every token that I'm
56:54
using, right? So, I'm just being
56:56
careful. um it charges by the token.
56:59
It's a beautiful business model. Anyway,
57:01
so back here, so let's ask the question,
57:03
which athletes won the gold medal in
57:05
curling at the Olympics? Here is the
57:06
data frame that you should use. Here is
57:08
the GPD model and don't exceed 3,700
57:11
tokens. Okay, that's the the query or
57:13
the prompt. It's going to compose the
57:15
prompt now. And this is the whole
57:17
prompt. Okay. Uh let's just go to the
57:19
very top. It's really long.
57:24
Okay. So, all right. use the below
57:25
articles on the subsequent question as
57:27
possible and boom boom boom boom boom it
57:29
has all these things it's got a added a
57:31
whole bunch of paragraphs from the
57:33
Wikipedia pages okay and then it finally
57:35
ends with a question which athletes won
57:37
the gold okay all right now let's just
57:39
ask it the thing and this is just a
57:41
little function to to send stuff into
57:44
the API and now we are finally ready to
57:47
ask GPD the question fingers crossed
57:53
all right curling
57:55
Stefan can tell in the mixed doubles and
57:58
the team consisting of blah blah blah in
58:01
the the men's tournament and oh
58:03
interesting it has actually ignored the
58:06
Great Britain people completely I think
58:08
right uh last night it didn't welcome to
58:12
stoasticity
58:14
so you can try it when you try it might
58:16
actually give you the the thing um and
58:19
so let's ask it now a question about the
58:21
2016 winter Olympics uh which by the way
58:24
didn't happen there were no winter
58:25
Olympics in 2016. So if you ask it,
58:31
sorry I don't know. All right. Now let's
58:34
change the header so that we don't say
58:36
be truthful. So we will remove the need
58:38
for it to be truthful and see what
58:40
happens.
58:43
All right, which at least won the gold.
58:50
Oh, now it's telling you about the 2022
58:53
Olympics. So it answered an irrelevant
58:55
question accurately.
58:57
Okay, if you remove the need for it to
58:59
uh to be truthful. So the I guess the
59:01
moral of the story is that um first of
59:04
all you can use rack to grab stuff from
59:07
mass databases and it's very heavily
59:09
used in industry. Number one, number
59:10
two. Um you have to be careful about
59:12
these token budgets and so on and so
59:13
forth. Uh and small wording changes in
59:16
the prompt can actually dramatically
59:18
alter behavior which makes it very
59:20
difficult in enterprise settings to do
59:21
QA on this stuff. Okay. Uh so a lot of
59:25
care has to go into it. Uh you know and
59:27
you have seen examples of for example
59:29
Air Canada had a chatbot which actually
59:30
gave the wrong advice to a customer. The
59:32
customer sued Air Canada and then the
59:34
court ruled in favor of the the
59:35
passenger and then they pulled the
59:37
chatbot off the website. Right? So you
59:39
got to be very careful. I think without
59:40
a human in the loop checking these
59:42
answers it's kind of dangerous in my
59:43
opinion at this current state. Hopefully
59:45
it'll get better but you have to be
59:47
there's a lot of potential but you have
59:48
to be to be careful. All right. So this
59:51
is what we have. Um, and you can
59:52
actually take this thing here and use
59:54
it. Um, you can actually, you know, take
59:57
like a thousandpage PDF that you might
59:58
have or something and then chunk it and
1:00:00
use this approach. And I've done it for
1:00:02
a whole bunch of different things. It
1:00:03
actually works really well, right? Most
1:00:04
of the time it'll make errors here and
1:00:05
there. Most of the time it actually
1:00:07
works really well. Okay. So, um, yeah.
1:00:11
>> Sorry, just a question. when when like
1:00:14
GP4 now lets you you upload PDFs, is it
1:00:18
junkling that or is it actually
1:00:20
ingesting all the
1:00:21
>> No, when you upload something because
1:00:22
GPD4 Turbo has 128,000 tokens which
1:00:25
means it can accommodate a whole long b
1:00:27
of documents. So when you upload stuff
1:00:29
is not doing any chunking. The chunking
1:00:31
you're talking about you have to do. The
1:00:32
LLM doesn't even know you're doing it.
1:00:34
As far as the LLM is concerned, it's
1:00:36
only seeing the prompt it sees and the
1:00:38
prompt says, "Hey, here's a bunch of
1:00:39
information. Here's a question. Answer
1:00:40
it for me using this question. Be
1:00:41
truthful." That's it.
1:00:44
Now when you ask these things a question
1:00:46
um which is later than its training
1:00:49
data, you will actually see GP4 saying
1:00:51
doing a Bing search and things like
1:00:53
that. there. What's actually going on is
1:00:55
there's an there's a pre-processing step
1:00:58
and a program which is doing a Bing
1:00:59
search, gathering a bunch of Bing
1:01:01
results, taking the top few results,
1:01:04
chunking, embedding, packing into a
1:01:06
prompt, sending it into GB4, and you
1:01:08
don't know what's all this is going on
1:01:10
under the hood. But that's actually so
1:01:11
when it's actually thinking and saying
1:01:12
Bing search, this is what's going on
1:01:13
under the hood.
1:01:19
Was was there a question somewhere here?
1:01:21
No. Oh, sorry. Yeah.
1:01:24
I have a question about formatting.
1:01:26
Yeah. So, it seems to be able to
1:01:29
understand and ignore irrelevant
1:01:31
formatting even though there's
1:01:33
colloquial tables, not really defined
1:01:35
tables. And also when it outputs
1:01:38
formats, it's able to do it really
1:01:40
humanly. Is that something that's
1:01:44
figuring out through the neural network
1:01:46
or just something that's kind of being
1:01:47
programmed in the head or somewhere with
1:01:49
standard?
1:01:49
>> There is no explicit programming going
1:01:51
on. It's typically because a lot of the
1:01:53
question answer pairs that it was used
1:01:54
for supervised fine tetuning and
1:01:56
instruction t and reinforcement
1:01:57
learning, right? The better answers with
1:02:00
the same sort of badly formatted input,
1:02:02
the better answers are just rewarded are
1:02:04
ranked higher. That's what's going on.
1:02:06
But on a related note, what one thing
1:02:08
that's very useful is that uh you can
1:02:10
actually ask it to send you give you the
1:02:12
answer back using certain formats like
1:02:14
markdown and JSON and things like that.
1:02:16
And by forcing it to adhere to a certain
1:02:19
well- definfined formats, you actually
1:02:21
increase the chance of it actually
1:02:22
getting the right answer in the first
1:02:23
place.
1:02:24
Uh again, there's like a whole tangent
1:02:26
here we can go into, but those are some
1:02:28
of the things that uh are part of prompt
1:02:30
engineering. All right, so that's what
1:02:33
we have here. Back to the PowerPoint.
1:02:40
So that's retrieval augment generation
1:02:42
and we finally come to fine-tuning. So
1:02:46
fine-tuning is when up to this point all
1:02:49
the things we have seen don't alter the
1:02:51
internals of the LLM. You have not
1:02:54
messed around with the weights or change
1:02:55
number them at all. You're just using it
1:02:56
as a black box. Right? With fine-tuning
1:03:00
you actually will train it further
1:03:01
meaning the weights are going to change.
1:03:04
Okay. So now remember we take something
1:03:07
like a causal error like GPT right uh
1:03:11
and then and this I haven't fixed this
1:03:13
yet. this there is no rel here as I
1:03:15
mentioned earlier okay just remember
1:03:17
that
1:03:19
and then if you have domain specific
1:03:21
input output examples like input and
1:03:23
output you can just train it like this
1:03:25
okay input and then the shifted output
1:03:28
uh and that will update these weights
1:03:31
right all these weights so this is
1:03:33
basically fine- tuning exactly like we
1:03:34
saw with BERT and so on and and even
1:03:37
with restnet it's the same sort of thing
1:03:39
okay that is fine-tuning now before we
1:03:42
discuss the mechanics how to do I want
1:03:43
to look at a show you a quick example of
1:03:45
the usefulness of finetuning. So, so
1:03:48
imagine for a sec that we want to
1:03:50
generate u synthetic product reviews
1:03:53
from product descriptions.
1:03:55
So we are building some product which
1:03:57
can simulate customer behavior in
1:03:59
e-commerce and for that we need to be
1:04:01
able to generate the kinds of reviews
1:04:03
that customers might come up with right
1:04:05
and writing a lot of reviews is very
1:04:07
timeconuming. So what you but what you
1:04:09
can do is you can get a whole bunch of
1:04:10
product descriptions right from the
1:04:12
internet. So let's say you ask an LLM,
1:04:14
hey write a positive product review
1:04:16
using this information here, product
1:04:18
description here and it comes up with
1:04:19
this timeless, authentic, iconic, right?
1:04:24
Seriously, do product reviewers actually
1:04:26
write stuff like this? No. This looks
1:04:28
like marketing copy, right? This reads
1:04:31
like marketing copy because there's a
1:04:33
whole bunch of marketing copy on the
1:04:34
internet. So it's not good. It doesn't
1:04:36
feel like a review. It's not authentic,
1:04:38
right? Um, here's another example for
1:04:41
Urban Outfitters, and it says, uh, the
1:04:44
the boxy and cropped silhouette is
1:04:46
flattering on all body types. Come on.
1:04:50
Okay, so it's not going to work. So,
1:04:52
what we do is we fine-tune the LLM. We
1:04:55
can take an LLM and we can fine-tune it
1:04:57
with instruction, product description,
1:05:00
and product review examples.
1:05:02
Okay, that's what we can do. So for
1:05:05
instance we can take something like
1:05:06
this. Uh let me zoom into this thing.
1:05:14
So it says here write a positive review
1:05:17
for the following product and then you
1:05:19
can have the work. This is the
1:05:20
description is the input and the output
1:05:22
is the best car my husband's favorite.
1:05:24
They fit well. Right? They feel like
1:05:26
product reviews. So you just have to get
1:05:28
a few hundred of these product review
1:05:30
examples. Okay just a few hundred. Um
1:05:33
and you may not even need that much. And
1:05:35
once you do that,
1:05:37
once you do that, you basically do uh
1:05:40
used to fine-tuning like I showed
1:05:42
earlier, you know, in instruction,
1:05:45
input, output, and then you take that
1:05:46
output and shift it a bit and make it
1:05:48
the actual label, the actual output.
1:05:50
Fine tune, fine tune, fine tune, fine
1:05:51
tune a bunch of times, gradient descent,
1:05:53
weights gets updated. Now you have a new
1:05:55
LM, an updated LLM. And when you do that
1:05:58
now for the same things, here's what you
1:06:00
get. Write a review. These are the best
1:06:02
jeans I've ever owned. I am whatever
1:06:04
some details. I've been wearing them for
1:06:06
a few weeks. They still look brand new,
1:06:07
right? It looks much better. Doesn't
1:06:09
look like marketing.
1:06:11
This is completely fake. By the way, the
1:06:13
came up with it after the fine tuning.
1:06:15
And then we say, "Write a horrible
1:06:16
review because we want to be balanced.
1:06:18
These are the worst genes I've ever
1:06:20
worn. They're too tight here and there.
1:06:22
I'm going to return them and try a 30,
1:06:23
but I'm not optimistic.
1:06:25
I'm going to stick with Levis's." Few.
1:06:27
Okay.
1:06:29
So, that is So, these read like real
1:06:31
reviews. So just by taking a few hundred
1:06:33
examples and fine-tuning it, it
1:06:34
completely changes the the behavior that
1:06:36
you want for your particular use case.
1:06:38
That's the key thing. So for me, the
1:06:40
biggest sort of benefit here is that
1:06:43
while it took billions of sentences for
1:06:45
pre-training the original LLM and then
1:06:47
it took tens of thousands of examples to
1:06:49
do supervised finetuning and or HF and
1:06:52
so on and so forth, for you for it to
1:06:55
make it work for your narrow business
1:06:56
use case, you only had to spend a couple
1:06:59
hundred examples. That's it. It's
1:07:02
amazing. Imagine that if you had to, you
1:07:04
know, collect like 30,000 examples to
1:07:06
make it. Nobody's going to do these
1:07:07
things. It's too much work. But a couple
1:07:10
of hundred anybody can do. That's why
1:07:12
it's so powerful to finetune these
1:07:14
things. Yeah.
1:07:16
You talked about being able to um you
1:07:19
know, in industries where you you don't
1:07:22
want to put some of this stuff on the
1:07:23
internet, downloading uh the pre-train
1:07:26
model and being able to do this on your
1:07:28
own. would you still need talking about
1:07:30
computer power some of the computers we
1:07:32
have now GPUs I don't know how they are
1:07:35
um are you able to do some of these very
1:07:37
small use cases on those types of
1:07:39
devices
1:07:40
>> perfect question uh Ike I mean you're
1:07:42
going to get to that because the short
1:07:44
answer it's hard yeah just a few hundred
1:07:46
examples but actually trying to
1:07:47
fine-tune these big models on consumer
1:07:50
grade hardware is actually not easy so
1:07:52
you have to make certain tricks and
1:07:53
simplifications which is the next topic
1:07:56
uh yeah
1:07:57
>> is tuning always supervised like you
1:08:00
need those pairs or could you do it if
1:08:02
the company has like less structured
1:08:05
data?
1:08:05
>> No, you can. The thing is it depends on
1:08:07
whether you want to make it generally
1:08:09
smart about the company's sort of
1:08:11
business details in which case you can
1:08:13
just take a whole bunch of text and just
1:08:14
do an expert prediction on it. It's
1:08:16
going to get smarter about generally
1:08:17
things. But it doesn't mean it's going
1:08:19
to specifically follow your instructions
1:08:20
on your particular business problem. So
1:08:23
if you wanted to follow instructions,
1:08:24
you need supervision.
1:08:27
Okay. So all right these three are great
1:08:29
reviews. So for small LLMs like GPD2
1:08:32
fine-tuning isn't difficult to go to
1:08:35
your question. You can actually do this
1:08:36
with small models. So like for example
1:08:38
Google had this has released this thing
1:08:40
called Gemma which came out recently.
1:08:41
It's a small model like two billion
1:08:42
parameters or something if I remember
1:08:44
the smallest one and those things will
1:08:46
typically fit into uh thank you. Uh
1:08:50
those things will typically fit into
1:08:52
like one GPU and you can fine-tune it.
1:08:54
You still need GPUs just to be clear. uh
1:08:56
they will actually fit into one thing.
1:08:57
But if you want to use a larger model,
1:08:59
it won't fit. So to make this work, you
1:09:02
have to do other things and that's what
1:09:03
we're going to talk about now. So but
1:09:05
this there's a family of models called
1:09:07
Llama Llama 2. These are open source uh
1:09:10
LLMs and they are widely used for
1:09:12
fine-tuning, right? Because you can just
1:09:14
download the model and just do whatever
1:09:16
you want with it, right? It's open. uh I
1:09:18
mean it's not strictly open because
1:09:20
there are some you know footnote
1:09:22
considerations you got to worry about
1:09:23
but for most purposes it's open enough
1:09:26
uh in my opinion and so what we let's
1:09:29
see how hard it is to build the biggest
1:09:30
model in this family which is the llama
1:09:32
2 model with 70 billion parameters okay
1:09:35
70 billion parameters so first of all
1:09:37
the model is gigantic so 70 billion
1:09:40
parameters each parameter is let's say
1:09:42
we store it in two bytes per parameter
1:09:44
right u and then each of these parame
1:09:48
ameters actually we will need a
1:09:50
multiplier on each parameter to store
1:09:52
various details about how the
1:09:53
optimization is done okay we know we
1:09:56
won't get into the details here the the
1:09:57
one thing I do want to point out is that
1:09:59
um this 3 to four uh should really be 1
1:10:02
to six right u so I I had I didn't have
1:10:06
a chance to change it this morning but
1:10:08
but the point is that it's going to be a
1:10:09
huge model right so even with this
1:10:12
number it's going to be like 48 to 560
1:10:14
gigabytes just to hold the model in
1:10:15
memory and manipulate it and So if you
1:10:18
use a GPU like an A00 GPU or an H00 GPU
1:10:21
which are all Nvidia GPUs,
1:10:23
each of these things typically has 80 GB
1:10:25
of RAM memory. So we need between six
1:10:28
and seven to accommodate this thing. Six
1:10:30
to seven GPUs just to accommodate this
1:10:32
thing. So that's the first problem. The
1:10:34
model is big just to hold it and work
1:10:35
with it. You need lots of GPUs. The
1:10:37
second problem, Llama 2 was trained on
1:10:40
two trillion tokens of text.
1:10:43
Two trillion tokens of text. So these
1:10:46
GPUs can process about 400 tokens per
1:10:49
GPU per second. By process, I mean the
1:10:51
forward pass through the network. Okay?
1:10:54
And so if you actually use seven GPUs
1:10:57
with all this thing, it's going to take
1:10:58
you 8,000 days, right? Let's say we want
1:11:01
to do it in about a month, you need 24
1:11:03
20,000 248 GPUs at this cost of two $25
1:11:08
per GPU per hour. This will cost you 4
1:11:10
million.
1:11:12
Okay? And we'd expect the actual cost to
1:11:14
be a lot higher than this because it's
1:11:15
very optimistic. It assumes you just do
1:11:16
one pass through it, you're all done,
1:11:17
right? In in general, you'll you know
1:11:19
you'll make some mistakes. You have to
1:11:20
do it a bunch of times and so on and so
1:11:21
forth. So this is overly optimistic
1:11:23
estimate and that is 4 million. So you
1:11:25
need lots of GPUs and you need to spend
1:11:27
a lot of money for it. Now what can we
1:11:29
do with fewer resources?
1:11:32
First of all, you you need to reduce the
1:11:34
size of the data set. The second thing
1:11:35
is you want to reduce the memory
1:11:36
required. So we can ideally do it on
1:11:38
many fewer GPUs, hopefully even one GPU
1:11:41
literally on Collab. Okay. And so now we
1:11:45
have good news on the data front because
1:11:47
as I mentioned earlier, while it takes a
1:11:49
lot of data to build these models, to
1:11:51
fine-tune them for your specific data
1:11:53
for use case, you may just need a few
1:11:55
hundred examples. Okay, it's no problem
1:11:57
at all. So the data for fine-tuning is
1:11:59
actually not a problem. Only for
1:12:01
building it in the first place, it's a
1:12:02
problem. So in fact, there's this famous
1:12:05
alpaca fine tune data set. It is 50,000
1:12:07
instruction on pairs and so for that
1:12:11
way less than the two trillion tokens
1:12:13
and that can actually be done in about
1:12:14
20 hours. You can fine-tune a 50,000
1:12:17
example fine-tuning data set you can
1:12:19
fine tune with just 20 hours. Okay,
1:12:21
Tomaso,
1:12:23
>> could Microsoft's one bit model
1:12:26
drastically reduce the amount of comput?
1:12:28
Yeah, there's a whole bunch of
1:12:30
approximations and simplifications to
1:12:32
make all these things fit uh into
1:12:35
smaller GPUs and so on and so forth and
1:12:37
that's one of them. So, so the short
1:12:39
answer is yeah, there are many
1:12:40
possibilities uh and we have to very
1:12:42
carefully look at them because every one
1:12:44
of these simplifications you'll it'll
1:12:45
cost you something in terms of accuracy
1:12:47
and the ability of the model to do what
1:12:49
it needs to do. So there's always a
1:12:50
trade-off you have to worry about. So
1:12:52
that for hooks who are interested
1:12:54
there's this whole field called
1:12:55
quantization LLM quantization. Google it
1:12:57
and that gives you that's an entry point
1:12:59
into that whole area. Okay. So now how
1:13:02
do we reduce the memory required so that
1:13:04
we can process the data using fewer GPUs
1:13:06
ideally just one GPU on collab. So if
1:13:08
you look at what actually consumes
1:13:10
memory, you have all these model
1:13:12
parameters. Let's say you know 70
1:13:14
billion parameters times two bytes each
1:13:16
140 GB gradient computations is another
1:13:18
140 to hold the gradient and then the
1:13:20
optimizer state is 2x. And as I
1:13:22
mentioned earlier it could be between
1:13:24
you know 1 to 6x as opposed to 3 to 4x
1:13:27
but we'll just go with these numbers for
1:13:28
the moment. And so the total is 560
1:13:30
gigabytes right if you just naively want
1:13:33
to use it. So turns out you can't do
1:13:36
anything about that. it is just 4140 but
1:13:38
by using a trick called gradient
1:13:40
checkpointing this whole thing can
1:13:42
actually be squashed close to zero
1:13:44
basically you say hey I don't mind it
1:13:46
running longer but I don't want to use
1:13:48
as much memory and that trick is called
1:13:50
gradient checkpointing we won't go into
1:13:52
technical details that can go to zero
1:13:54
but then this thing here the optimizer
1:13:56
state turns out even this can be
1:13:58
squashed very close to zero and that's
1:14:00
actually was a breakthrough from you
1:14:02
know maybe a year ago and so to do do
1:14:06
that. What we're going to do is to say,
1:14:07
look, you know what? Uh there are a
1:14:09
whole bunch of weights here, but we're
1:14:11
only going to take take those matrices
1:14:13
inside each attention layer, and we're
1:14:15
going to only look at those matrices.
1:14:17
We're going to freeze everything else.
1:14:19
So, we're going to take only a small set
1:14:22
of parameters, unfreeze them, and update
1:14:24
them and see if it's any good, if it
1:14:26
actually gets the job done. Instead of
1:14:27
unfreezing everything and updating them,
1:14:29
right? And so if you look at the weight
1:14:31
matrix, let's say the key AK weight
1:14:33
matrix uh in llama 2, this is a 8,000
1:14:36
roughly 8,000 by 8,000 matrix, which
1:14:38
means that there are 64 million
1:14:40
parameters inside each of these
1:14:41
matrices. 64 million. Okay. So you can
1:14:45
if you imagine this matrix AK here and
1:14:48
let's say you thought experiment, you do
1:14:50
the finetuning and the numbers have
1:14:52
changed, right? as a result of
1:14:54
finetuning then you can imagine that the
1:14:56
resulting matrix is just the original
1:14:58
matrix you had plus just the changes
1:15:01
right the original plus the changes and
1:15:04
we call the changes delta a k and of
1:15:07
course in general this this change is
1:15:08
also going to be a 64 million matrix
1:15:10
right 8,000 by 8,000 so the question is
1:15:13
can we make this change matrix smaller
1:15:15
and to make it smaller it seems
1:15:18
reasonable because a fine tune will only
1:15:20
make small changes to just a few weights
1:15:22
it's not going to change
1:15:23
By definition, a couple hundred
1:15:25
examples, you do some finetuning,
1:15:26
hopefully a few weights are going to
1:15:27
change and maybe they won't change a
1:15:29
whole lot, right? So the the key insight
1:15:32
here is that maybe we can force this
1:15:33
change matrix to be kind of simple and
1:15:36
get the job done, right? And it turns
1:15:38
out you can. And what you do is you can
1:15:40
think of this matrix as really coming
1:15:42
from two thin skinny matrices which if
1:15:46
you multiply them gets you the original
1:15:48
matrix, right? And I'm not going to get
1:15:51
into the mathematical details here. This
1:15:52
is called a low rank approximation. Uh
1:15:55
but the point here is that you can take
1:15:57
two very small matrices and if you
1:16:00
multiply them the right way, you
1:16:01
actually can recover the original
1:16:02
matrix, right? You can approximate the
1:16:04
original matrix. And this matrix, as it
1:16:06
turns out, these two matrices are much
1:16:08
smaller because each one is just 8,000 *
1:16:11
2, 16,000, right? And so this thing has
1:16:15
just 16,192 parameters, which is 0.02%
1:16:19
of the original 64 million.
1:16:23
So this thing is called low rank
1:16:25
adaptation or LORA and it's incredibly
1:16:27
widely used in the industry. U and so
1:16:30
what we do is we freeze all the
1:16:31
parameters. We initialize all these mat
1:16:34
these change matrices to zero and then
1:16:36
we update just the those two skinny
1:16:38
matrices right here here we update only
1:16:40
those matrices using gradient descent.
1:16:43
And when you do that everything will fit
1:16:45
into memory. So which means that the
1:16:47
whole thing will fit in and you can just
1:16:48
use like two GPUs and get the job done.
1:16:50
And if you actually use llama's the
1:16:52
smaller models like 7 billion 13 billion
1:16:55
it can be fine-tuned comfortably on a
1:16:56
single GPU on a single collab GPU. So
1:17:00
all right uh 954 time does not permit so
1:17:03
I will uh so I have a collab on how to
1:17:05
do the finetuning uh using this
1:17:07
technique. I will do like a video walk
1:17:09
through um tomorrow or day after and I'm
1:17:12
done. Thanks folks. Have a good rest of
1:17:14
your week. [applause]
1:17:16
Thank you.
— end of transcript —
Advertisement