WEBVTT

00:00:16.399 --> 00:00:22.159
Okay. So, um, so let's continue the

00:00:19.519 --> 00:00:23.679
journey we started last time. Um so what

00:00:22.160 --> 00:00:26.079
we're going to do uh you know if you

00:00:23.679 --> 00:00:27.439
remember in the last class we showed how

00:00:26.079 --> 00:00:30.000
we can actually build an auto

00:00:27.439 --> 00:00:33.200
reggressive large language model uh aka

00:00:30.000 --> 00:00:36.399
a causal large language model um using

00:00:33.200 --> 00:00:38.559
this not this idea of a causal encoder a

00:00:36.399 --> 00:00:39.920
transformer causal encoder and then we

00:00:38.558 --> 00:00:41.679
showed how you can actually take a bunch

00:00:39.920 --> 00:00:43.760
of sentences and use next word

00:00:41.679 --> 00:00:46.640
prediction and just run it through and

00:00:43.759 --> 00:00:49.119
boom you get GPD3 okay so that's what we

00:00:46.640 --> 00:00:50.799
saw last time I want to point out a sort

00:00:49.119 --> 00:00:52.799
of an important clarification slash

00:00:50.799 --> 00:00:55.359
correction which is that when we work

00:00:52.799 --> 00:00:57.759
with large language models uh unlike

00:00:55.359 --> 00:00:59.198
when we work with BERT uh for instance

00:00:57.759 --> 00:01:01.519
when we work with these kinds of causal

00:00:59.198 --> 00:01:03.358
models actually uh when the contextual

00:01:01.520 --> 00:01:05.760
embeddings come out you don't actually

00:01:03.359 --> 00:01:07.599
have to use ReLU activations here you

00:01:05.760 --> 00:01:09.118
can literally just run it through just a

00:01:07.599 --> 00:01:11.359
single dense layer with linear

00:01:09.118 --> 00:01:13.760
activations and then pass it into a

00:01:11.359 --> 00:01:15.519
softmax and boom you're done okay so

00:01:13.760 --> 00:01:18.799
that's how GPD3 and all these models are

00:01:15.519 --> 00:01:21.039
trained u and the other thing I want to

00:01:18.799 --> 00:01:23.600
point out which may not have clear is

00:01:21.040 --> 00:01:27.360
that what what is coming out of these

00:01:23.599 --> 00:01:29.919
this dense layer right this vector is as

00:01:27.359 --> 00:01:31.840
long as your vocabulary

00:01:29.920 --> 00:01:33.519
because only then when it goes into the

00:01:31.840 --> 00:01:35.118
soft max you're going to get

00:01:33.519 --> 00:01:36.959
probabilities which are as long as your

00:01:35.118 --> 00:01:39.200
vocabulary which means that you get to

00:01:36.959 --> 00:01:42.158
pick one word or token out of that

00:01:39.200 --> 00:01:45.118
entire 50,000 long vocabulary

00:01:42.159 --> 00:01:47.520
okay so so just I just want to point

00:01:45.118 --> 00:01:49.118
that out because I think it's easy for

00:01:47.519 --> 00:01:50.319
us to sort of get a little confused

00:01:49.118 --> 00:01:53.280
because of this little difference

00:01:50.319 --> 00:01:55.279
between the way uh masked language

00:01:53.280 --> 00:01:58.718
models like BERT work and causal

00:01:55.280 --> 00:02:02.718
language models like GPD3.

00:01:58.718 --> 00:02:05.759
Okay, so now let's continue with we have

00:02:02.718 --> 00:02:07.759
we know how to build GPD3. So like what

00:02:05.759 --> 00:02:10.479
about GPD and GPD2 like what's up to

00:02:07.759 --> 00:02:13.840
them? Why is GPD3 so famous and not

00:02:10.479 --> 00:02:15.200
GPD2? Right? So turns out well first of

00:02:13.840 --> 00:02:17.360
all you folks know that GPD stands for

00:02:15.199 --> 00:02:19.119
generative pre-trained transformer. Now

00:02:17.360 --> 00:02:22.000
like GPD3

00:02:19.120 --> 00:02:23.680
two GPD2 and GPD1 were trained in

00:02:22.000 --> 00:02:26.000
basically the same fashion. Predict the

00:02:23.680 --> 00:02:29.280
next word uh same fashion the same sort

00:02:26.000 --> 00:02:31.680
of transformer stack except that GPT3

00:02:29.280 --> 00:02:33.680
was trained on much more data because

00:02:31.680 --> 00:02:36.640
the underlying transformer stack had

00:02:33.680 --> 00:02:39.599
many more layers. Okay, so it is a much

00:02:36.639 --> 00:02:41.518
bigger stack meaning lots more

00:02:39.598 --> 00:02:44.878
parameters and therefore you need lots

00:02:41.519 --> 00:02:47.200
more data to train it well. Okay, so

00:02:44.878 --> 00:02:49.679
that was really the only difference. The

00:02:47.199 --> 00:02:53.919
difference was literally one of scale,

00:02:49.680 --> 00:02:57.680
scale of network and scale of data. And

00:02:53.919 --> 00:02:59.119
unlike GPT and GPD2, GPD3 even though it

00:02:57.680 --> 00:03:01.760
was trained basically the same way with

00:02:59.120 --> 00:03:04.158
the same kind of network, it was one of

00:03:01.759 --> 00:03:06.239
the situations where more became

00:03:04.158 --> 00:03:07.759
different. Okay, there was almost like

00:03:06.239 --> 00:03:10.319
some sort of phase change that happened

00:03:07.759 --> 00:03:14.158
between two and three. Unlike GPD and

00:03:10.318 --> 00:03:16.318
GPD2, GPD3 could do amazingly coherent

00:03:14.158 --> 00:03:19.840
continuations of any starting prompt,

00:03:16.318 --> 00:03:21.199
right? Um so for example, if you have

00:03:19.840 --> 00:03:22.640
this little prompt which says the

00:03:21.199 --> 00:03:24.878
importance of being on Twitter by Jerome

00:03:22.639 --> 00:03:26.318
K Jerome who was a famous humorist and

00:03:24.878 --> 00:03:28.399
then you give it this prompt, right?

00:03:26.318 --> 00:03:30.000
Ending with the word it, it produces

00:03:28.400 --> 00:03:33.599
this continuation which is really like

00:03:30.000 --> 00:03:35.120
strikingly good. And if any of you have

00:03:33.598 --> 00:03:36.479
read Jerome K Jerome and if you read

00:03:35.120 --> 00:03:38.319
this thing, you'll be like, "Wow, that

00:03:36.479 --> 00:03:41.518
actually sounds like Jerome K Jerome."

00:03:38.318 --> 00:03:43.119
Right? So amazing continuations the the

00:03:41.519 --> 00:03:45.120
but the interesting thing here is not so

00:03:43.120 --> 00:03:47.519
much the continuation it's the fact that

00:03:45.120 --> 00:03:49.680
the same prompt you give it a two or GPT

00:03:47.519 --> 00:03:51.439
it won't do any it won't be very good in

00:03:49.680 --> 00:03:52.640
fact after the first one two or three

00:03:51.439 --> 00:03:54.158
sentences it'll sort of become sort of

00:03:52.639 --> 00:03:57.039
incoherent and meander and start

00:03:54.158 --> 00:03:59.759
rambling this thing can keep faking it

00:03:57.039 --> 00:04:02.239
for a long longer time right that's the

00:03:59.759 --> 00:04:05.679
amazing thing that was unexpected re

00:04:02.239 --> 00:04:07.438
researchers did not expect this okay and

00:04:05.680 --> 00:04:09.040
but it wasn't good at following your

00:04:07.438 --> 00:04:10.799
instructions

00:04:09.039 --> 00:04:12.400
So for instance, if you ask it, help me

00:04:10.799 --> 00:04:14.159
write a short note, introduce myself to

00:04:12.400 --> 00:04:15.519
my neighbor. This is the kind of thing

00:04:14.158 --> 00:04:17.358
it'll come up with. And you can actually

00:04:15.519 --> 00:04:20.000
run it yourself. You can actually go to

00:04:17.358 --> 00:04:21.918
GPD3 on the playground. I think GPD3 is

00:04:20.000 --> 00:04:23.759
still available in the playground. U if

00:04:21.918 --> 00:04:25.359
it is, you can actually start try

00:04:23.759 --> 00:04:28.080
running these prompts. You will start

00:04:25.360 --> 00:04:29.919
getting garbage very quickly, right? And

00:04:28.079 --> 00:04:31.519
the reason so for example here, help me

00:04:29.918 --> 00:04:33.839
write a short note. It says, what's a

00:04:31.519 --> 00:04:35.680
good introduction to a resume? Rumé for

00:04:33.839 --> 00:04:38.319
some reason has glombmed down to resume.

00:04:35.680 --> 00:04:39.918
I have no idea why. Right? But the

00:04:38.319 --> 00:04:42.399
reason it's doing stuff like this is

00:04:39.918 --> 00:04:44.000
because a lot of the training data it

00:04:42.399 --> 00:04:46.159
was trained on are basically lots of

00:04:44.000 --> 00:04:49.040
lists of things.

00:04:46.160 --> 00:04:52.000
So when you say for example um you know

00:04:49.040 --> 00:04:53.840
the the the capital of Paris continue

00:04:52.000 --> 00:04:55.279
it'll come back with the capital sorry

00:04:53.839 --> 00:04:56.799
the capital of France continue it say

00:04:55.279 --> 00:04:58.559
the capital of France is Paris the

00:04:56.800 --> 00:04:59.919
capital of you know uh Hungary is

00:04:58.560 --> 00:05:02.319
Budapest and so on. It just start coming

00:04:59.918 --> 00:05:04.399
up with a list. So it's sort of very

00:05:02.319 --> 00:05:06.000
list driven right? it thinks that you

00:05:04.399 --> 00:05:07.839
you need to complete some sort of list,

00:05:06.000 --> 00:05:09.600
right? That's what's going on here. And

00:05:07.839 --> 00:05:10.799
so it's not very good. So it doesn't

00:05:09.600 --> 00:05:12.960
realize that you're actually asking it

00:05:10.800 --> 00:05:14.560
to do something specific.

00:05:12.959 --> 00:05:17.038
So this is the problem when you have an

00:05:14.560 --> 00:05:18.720
autocomplete thing which doesn't realize

00:05:17.038 --> 00:05:20.719
what you're asking it. It just thinks

00:05:18.720 --> 00:05:24.080
that you're it's just an autocomplete.

00:05:20.720 --> 00:05:25.600
So um now in addition to these unhelpful

00:05:24.079 --> 00:05:27.038
answers, it can also produce offensive

00:05:25.600 --> 00:05:28.960
answers, factually incorrect answers and

00:05:27.038 --> 00:05:32.079
so on and so forth. The list of bad

00:05:28.959 --> 00:05:33.599
things it can do is long. So why does it

00:05:32.079 --> 00:05:35.758
do that? Why does it produce unhelpful

00:05:33.600 --> 00:05:37.120
answers? Well, you know, as you recall,

00:05:35.759 --> 00:05:39.120
it was only trained to predict the next

00:05:37.120 --> 00:05:41.439
word. It wasn't explicitly trained to

00:05:39.120 --> 00:05:44.720
follow instructions, right? So, it

00:05:41.439 --> 00:05:46.160
seems, you know, reasonable that if it's

00:05:44.720 --> 00:05:48.800
simply trying to guess the next word

00:05:46.160 --> 00:05:50.880
repeatedly, it can't really do anything

00:05:48.800 --> 00:05:52.079
more. Like, how can it figure out that

00:05:50.879 --> 00:05:54.800
there's an instruction that it needs to

00:05:52.079 --> 00:05:57.120
follow, right? Unless the training data

00:05:54.800 --> 00:05:59.918
on the net was all instructional, which

00:05:57.120 --> 00:06:02.959
it clearly is not.

00:05:59.918 --> 00:06:04.560
So light bulb idea, right? Let's

00:06:02.959 --> 00:06:06.399
explicitly train it with instruction

00:06:04.560 --> 00:06:07.680
data,

00:06:06.399 --> 00:06:10.478
right? Let's just train it with

00:06:07.680 --> 00:06:12.079
instruction data. And so OpenAI

00:06:10.478 --> 00:06:15.758
developed an approach called instruction

00:06:12.079 --> 00:06:18.719
tuning to do exactly this. Um, and this

00:06:15.759 --> 00:06:20.720
paper is the paper that sort of was the

00:06:18.720 --> 00:06:24.160
breakthrough. Okay, this is what

00:06:20.720 --> 00:06:25.600
actually put Chad on the map. So, and

00:06:24.160 --> 00:06:26.800
it's very readable. So, I would

00:06:25.600 --> 00:06:28.879
encourage you to check it out if you're

00:06:26.800 --> 00:06:33.280
curious.

00:06:28.879 --> 00:06:34.478
And so so we had GPT, GPD2, GPD3, you

00:06:33.279 --> 00:06:36.079
know, just bigger and bigger models

00:06:34.478 --> 00:06:37.439
trained the same way. And then we run

00:06:36.079 --> 00:06:39.199
into the problem that it can't handle

00:06:37.439 --> 00:06:41.519
instructions. So we do instruction

00:06:39.199 --> 00:06:43.840
tuning to get to 3.5, also called

00:06:41.519 --> 00:06:46.478
instruct GPT. And then a small tweak

00:06:43.839 --> 00:06:48.560
after that gets you chat GPT. Okay. And

00:06:46.478 --> 00:06:50.079
by the way, this step here, there are

00:06:48.560 --> 00:06:52.000
really two things going on in this as

00:06:50.079 --> 00:06:53.839
you will soon see. I'm just calling it

00:06:52.000 --> 00:06:55.360
instruction tuning just to so that I

00:06:53.839 --> 00:06:58.239
don't have to say some long thing every

00:06:55.360 --> 00:06:59.919
single time. it this is not a consistent

00:06:58.240 --> 00:07:03.918
piece of terminology. So just just

00:06:59.918 --> 00:07:06.799
beware aware of that's all. So all right

00:07:03.918 --> 00:07:09.359
first step they got a bunch of people to

00:07:06.800 --> 00:07:11.918
write highquality answers to questions

00:07:09.360 --> 00:07:14.080
and they created about 12,500 such

00:07:11.918 --> 00:07:15.758
question answer pairs. So for example

00:07:14.079 --> 00:07:17.199
let's say this was the question explain

00:07:15.759 --> 00:07:19.840
the moon landing to a six-year-old in a

00:07:17.199 --> 00:07:21.759
few sentences. Believe it or not, GPD3's

00:07:19.839 --> 00:07:23.439
answer to that question was another

00:07:21.759 --> 00:07:24.879
question

00:07:23.439 --> 00:07:27.199
because it thinks there's a list of

00:07:24.879 --> 00:07:28.560
questions it needs autocomplete, right?

00:07:27.199 --> 00:07:30.400
So, it comes up with explain the theory

00:07:28.560 --> 00:07:31.439
of gravity to a six-y old. It's like one

00:07:30.399 --> 00:07:32.879
of those people when you ask them a

00:07:31.439 --> 00:07:35.199
question, they ask you a question back,

00:07:32.879 --> 00:07:36.399
right? So, what what they did is they

00:07:35.199 --> 00:07:38.160
said, "Okay, let's create a nice answer

00:07:36.399 --> 00:07:39.758
to this question." And here's a human

00:07:38.160 --> 00:07:41.280
created answer. People went to the moon

00:07:39.759 --> 00:07:43.759
in a big rocket, walked around, blah

00:07:41.279 --> 00:07:46.559
blah blah, right? Much better answer to

00:07:43.759 --> 00:07:48.960
that question. And so once you create

00:07:46.560 --> 00:07:52.000
these 12,500 question answer pairs as

00:07:48.959 --> 00:07:56.079
training data, we just trained GPD3 some

00:07:52.000 --> 00:07:59.199
more using Xword prediction as before.

00:07:56.079 --> 00:08:00.639
No difference. So, so here is the input

00:07:59.199 --> 00:08:02.560
explain the moon landing blah blah blah

00:08:00.639 --> 00:08:05.840
blah. This is the question and then we

00:08:02.560 --> 00:08:07.918
have the answer right there. And then we

00:08:05.839 --> 00:08:10.719
we take that answer, move it to the

00:08:07.918 --> 00:08:13.758
right and just shift it up

00:08:10.720 --> 00:08:16.000
so that when it finishes sentences, it

00:08:13.759 --> 00:08:17.759
needs to predict people. And then you

00:08:16.000 --> 00:08:20.079
give it people, it needs to predict went

00:08:17.759 --> 00:08:22.560
and so on and so forth. Just like we saw

00:08:20.079 --> 00:08:25.359
before, the cat sat on the mat became

00:08:22.560 --> 00:08:27.360
the cat sat on the cat sat on the mat on

00:08:25.360 --> 00:08:30.080
the right shifted, right? That's what

00:08:27.360 --> 00:08:31.598
makes prediction possible and necessary.

00:08:30.079 --> 00:08:35.360
So that's what they did. This co this is

00:08:31.598 --> 00:08:37.838
step one. Okay, same as same as before.

00:08:35.360 --> 00:08:39.680
And once you do that, it turns out this

00:08:37.839 --> 00:08:42.000
step is called supervised fine-tuning.

00:08:39.679 --> 00:08:44.000
It really helped. GPD3 once you

00:08:42.000 --> 00:08:45.600
supervised fine-tuned it was much much

00:08:44.000 --> 00:08:46.958
better at following instructions. But

00:08:45.600 --> 00:08:49.278
there's a small problem with this

00:08:46.958 --> 00:08:51.518
approach. It takes a lot of money and

00:08:49.278 --> 00:08:53.759
effort to have humans write highquality

00:08:51.519 --> 00:08:56.480
answers to thousands of questions,

00:08:53.759 --> 00:08:59.120
right? It takes a lot of money. So the

00:08:56.480 --> 00:09:01.200
question is, what can we do, right? What

00:08:59.120 --> 00:09:03.278
is easier than writing a good answer to

00:09:01.200 --> 00:09:07.519
a question?

00:09:03.278 --> 00:09:11.320
Well, what? Okay. Uh, all right. Uh, how

00:09:07.519 --> 00:09:11.320
about somebody from this side?

00:09:11.440 --> 00:09:15.600
>> Yeah, Joseph.

00:09:13.519 --> 00:09:16.560
>> Perhaps writing a question for an

00:09:15.600 --> 00:09:17.920
answer.

00:09:16.559 --> 00:09:19.679
>> Oh, that's actually a good one. Yeah.

00:09:17.919 --> 00:09:22.000
Yeah, I like that. Um, so given an

00:09:19.679 --> 00:09:23.439
answer, find find a question. And while

00:09:22.000 --> 00:09:25.039
that is not what I'm going to talk about

00:09:23.440 --> 00:09:27.680
here, that technique is actually used

00:09:25.039 --> 00:09:29.519
very heavily in LLMs. Uh, and so but

00:09:27.679 --> 00:09:31.199
that that's great. Very creative. Uh

00:09:29.519 --> 00:09:32.560
Mark,

00:09:31.200 --> 00:09:33.200
>> thumbs up. Thumbs down.

00:09:32.559 --> 00:09:34.479
>> Sorry.

00:09:33.200 --> 00:09:36.320
>> Thumbs up or thumbs down?

00:09:34.480 --> 00:09:38.320
>> Thumbs up or thumbs down. Exactly.

00:09:36.320 --> 00:09:40.800
Because all of us, everyone loves to be

00:09:38.320 --> 00:09:43.440
a critic. It's much better easier to be

00:09:40.799 --> 00:09:46.240
a critic than to be a creator. Right. So

00:09:43.440 --> 00:09:48.959
what do we do? We basically say, let's

00:09:46.240 --> 00:09:50.720
rank answers written by somebody else.

00:09:48.958 --> 00:09:53.199
Which begs the question, who's going to

00:09:50.720 --> 00:09:54.639
write those answers? And that's where

00:09:53.200 --> 00:09:57.440
there's a brilliant answer to that

00:09:54.639 --> 00:09:59.360
question which is

00:09:57.440 --> 00:10:02.360
Wikipedia,

00:09:59.360 --> 00:10:02.360
Reddit.

00:10:04.000 --> 00:10:08.159
We will just ask GPT3 to write the

00:10:06.080 --> 00:10:10.000
answers.

00:10:08.159 --> 00:10:12.958
It might be crap, but we don't care

00:10:10.000 --> 00:10:15.519
because we can rank them.

00:10:12.958 --> 00:10:17.919
So we ask GPT3 to get generate several

00:10:15.519 --> 00:10:19.360
answers to the question. And how can we

00:10:17.919 --> 00:10:21.199
generate several answers? Because we can

00:10:19.360 --> 00:10:23.200
do sampling.

00:10:21.200 --> 00:10:25.200
We can do sampling.

00:10:23.200 --> 00:10:27.200
The fact that we had these stoastic

00:10:25.200 --> 00:10:30.480
outputs because of sampling is now a

00:10:27.200 --> 00:10:32.240
feature, not a bug. Okay, we create lots

00:10:30.480 --> 00:10:33.920
of different answers to the question. We

00:10:32.240 --> 00:10:36.000
feed it a question, get like three

00:10:33.919 --> 00:10:37.599
answers out. Just run it three times,

00:10:36.000 --> 00:10:39.278
get three answers out with a nice

00:10:37.600 --> 00:10:41.120
temperature of like one or 1.1 or

00:10:39.278 --> 00:10:43.679
something so that it's nice and random,

00:10:41.120 --> 00:10:45.679
right? Um, and then we literally have

00:10:43.679 --> 00:10:47.278
humans just rank them, do the thumbs up,

00:10:45.679 --> 00:10:51.120
thumbs down, just rank them from most

00:10:47.278 --> 00:10:53.200
useful to least useful. Okay, so this

00:10:51.120 --> 00:10:55.120
step is a step two of instruction

00:10:53.200 --> 00:10:57.759
tuning. So OpenAI collected 33,000

00:10:55.120 --> 00:11:00.560
instructions, fed them to GB3, generated

00:10:57.759 --> 00:11:03.439
answers and had humans rank them. And

00:11:00.559 --> 00:11:05.278
once you do that, once you do this, you

00:11:03.440 --> 00:11:07.519
can assemble a beautiful training data

00:11:05.278 --> 00:11:09.200
set, right? And so basically what we

00:11:07.519 --> 00:11:10.799
have is that we have an instruction and

00:11:09.200 --> 00:11:12.879
let's say we have just two answers A and

00:11:10.799 --> 00:11:14.879
B. And in in practice they you can have

00:11:12.879 --> 00:11:16.879
many many answers which we rank but just

00:11:14.879 --> 00:11:18.720
for simplicity I'll go with Mark's

00:11:16.879 --> 00:11:19.919
thumbs up thumbs down sort of answer

00:11:18.720 --> 00:11:22.240
which is let's assume only you have two

00:11:19.919 --> 00:11:24.159
answers to every question right and so

00:11:22.240 --> 00:11:26.480
and the human has said I prefer this to

00:11:24.159 --> 00:11:28.958
that that's it right so we have a data

00:11:26.480 --> 00:11:31.278
set now where the data point is

00:11:28.958 --> 00:11:36.000
instruction preferred answer is A the

00:11:31.278 --> 00:11:38.720
other answer is B yeah

00:11:36.000 --> 00:11:40.958
>> um the thumbs up thumbs down uh

00:11:38.720 --> 00:11:42.560
technique that we're talking is that why

00:11:40.958 --> 00:11:44.319
We're attaching to now we also use

00:11:42.559 --> 00:11:45.518
thumbs up thumbs down. It's using only

00:11:44.320 --> 00:11:46.560
answers to train.

00:11:45.519 --> 00:11:48.240
>> Exactly. Right.

00:11:46.559 --> 00:11:49.439
>> Yeah. So yeah, all the models have the

00:11:48.240 --> 00:11:51.519
thumbs up thumbs down stuff going on

00:11:49.440 --> 00:11:53.120
somewhere. They are all collecting data

00:11:51.519 --> 00:11:53.600
for this step.

00:11:53.120 --> 00:11:55.679
>> Thank you.

00:11:53.600 --> 00:11:57.200
>> Yeah. It's sort of the old adage, right?

00:11:55.679 --> 00:11:59.199
Uh if you're not sure who the product

00:11:57.200 --> 00:12:02.680
is, you are the product. So it's one of

00:11:59.200 --> 00:12:02.680
those things. Yeah.

00:12:07.519 --> 00:12:12.240
So if we understand correctly when we

00:12:09.600 --> 00:12:16.320
see thumbs up thumbs down it does mean

00:12:12.240 --> 00:12:16.639
that chat is going to trade on our data

00:12:16.320 --> 00:12:19.200
right

00:12:16.639 --> 00:12:20.720
>> unless you opt out. Yeah. So if you

00:12:19.200 --> 00:12:22.079
actually go to the chaty controls there

00:12:20.720 --> 00:12:24.879
is something called data controls or

00:12:22.078 --> 00:12:26.479
something you can toggle it to off but I

00:12:24.879 --> 00:12:29.759
think when I last checked if you toggle

00:12:26.480 --> 00:12:31.120
it to off you lose your chat history. So

00:12:29.759 --> 00:12:33.120
they have hobbled that feature to

00:12:31.120 --> 00:12:37.600
prevent people from setting it to off as

00:12:33.120 --> 00:12:39.440
much as possible. Yeah, clever.

00:12:37.600 --> 00:12:41.040
But you can opt out and if you use the

00:12:39.440 --> 00:12:43.360
API as opposed to the web interface,

00:12:41.039 --> 00:12:45.360
you're automatically opted out. So you

00:12:43.360 --> 00:12:46.720
have to deliberately opt in. And if you

00:12:45.360 --> 00:12:48.399
use the versions that are available

00:12:46.720 --> 00:12:50.320
through Microsoft Azure and so on and so

00:12:48.399 --> 00:12:51.759
forth, there are all kinds of very safe

00:12:50.320 --> 00:12:54.079
controls and stuff like that. In fact, I

00:12:51.759 --> 00:12:56.799
think the Microsoft co-pilot license

00:12:54.078 --> 00:12:58.879
that MIT has uh I think the default is

00:12:56.799 --> 00:13:01.439
opted out.

00:12:58.879 --> 00:13:02.799
Okay. So to go here, once you have this

00:13:01.440 --> 00:13:05.680
data point, you can build something

00:13:02.799 --> 00:13:08.000
called a reward model. Okay. And this is

00:13:05.679 --> 00:13:10.479
a very clever piece of work. So what you

00:13:08.000 --> 00:13:12.399
do is you have an instruction, right?

00:13:10.480 --> 00:13:15.920
You have a preferred answer and you have

00:13:12.399 --> 00:13:18.320
the other answer. You feed it to a

00:13:15.919 --> 00:13:20.319
network. Okay? You feed it to a network.

00:13:18.320 --> 00:13:23.200
This is just a a nice language model,

00:13:20.320 --> 00:13:25.760
right? It's just a language model. And

00:13:23.200 --> 00:13:28.320
the language model produces a number

00:13:25.759 --> 00:13:30.480
which measures how good this thing is,

00:13:28.320 --> 00:13:32.480
right? How good an answer is this to

00:13:30.480 --> 00:13:34.639
that particular instruction. So you get

00:13:32.480 --> 00:13:38.320
two you get a rating here, you get a

00:13:34.639 --> 00:13:41.519
rating here and then what you do is you

00:13:38.320 --> 00:13:43.040
run it through a little loss function

00:13:41.519 --> 00:13:45.839
which

00:13:43.039 --> 00:13:50.000
essentially encourages the model to give

00:13:45.839 --> 00:13:51.680
higher numbers to the better answer.

00:13:50.000 --> 00:13:53.278
It's the same model. You just run the

00:13:51.679 --> 00:13:54.799
the question and the first answer,

00:13:53.278 --> 00:13:56.720
question and the second answer. You get

00:13:54.799 --> 00:13:59.120
these two numbers. And then initially

00:13:56.720 --> 00:14:00.480
those numbers are just random. But then

00:13:59.120 --> 00:14:02.078
you tell the model, hey, this is the

00:14:00.480 --> 00:14:03.600
preferred thing. Make sure the preferred

00:14:02.078 --> 00:14:06.638
answers

00:14:03.600 --> 00:14:08.959
uh rating the R value is higher than the

00:14:06.639 --> 00:14:12.000
other number because more is better.

00:14:08.958 --> 00:14:13.838
Higher is better. Okay? And you can

00:14:12.000 --> 00:14:15.919
actually since you and this thing is

00:14:13.839 --> 00:14:16.959
just a sigmoid here, right? It's

00:14:15.919 --> 00:14:18.240
basically take the difference of these

00:14:16.958 --> 00:14:20.479
two things. do a sigma and take the

00:14:18.240 --> 00:14:22.320
logarithm and you can actually convince

00:14:20.480 --> 00:14:25.600
yourself afterwards and I encourage you

00:14:22.320 --> 00:14:28.480
to do that to to check for yourself that

00:14:25.600 --> 00:14:30.879
if we actually

00:14:28.480 --> 00:14:34.159
give a higher number to the better

00:14:30.879 --> 00:14:36.320
answer the loss will be lower and since

00:14:34.159 --> 00:14:38.958
we are minimizing loss we're essentially

00:14:36.320 --> 00:14:41.760
training the network to always to try to

00:14:38.958 --> 00:14:43.919
give higher ratings to better answers

00:14:41.759 --> 00:14:46.639
that's it so that's the approach uh did

00:14:43.919 --> 00:14:49.360
you have a yeah Ben

00:14:46.639 --> 00:14:50.959
So you could imagine training um

00:14:49.360 --> 00:14:52.720
training the model and only the good

00:14:50.958 --> 00:14:54.078
answers is the idea of having both that

00:14:52.720 --> 00:14:54.720
the model is actually learning what

00:14:54.078 --> 00:14:56.719
makes good

00:14:54.720 --> 00:14:58.480
>> correct. Exactly. Much like if you want

00:14:56.720 --> 00:15:01.360
to build a dog cat classifier, you have

00:14:58.480 --> 00:15:02.879
to show pictures of both.

00:15:01.360 --> 00:15:05.278
>> Yeah.

00:15:02.879 --> 00:15:06.799
>> So u I understand the feedback mechanism

00:15:05.278 --> 00:15:10.078
of thumbs up thumbs down but there are a

00:15:06.799 --> 00:15:12.879
lot of times when the popular response

00:15:10.078 --> 00:15:15.759
is not the accurate one. So uh is there

00:15:12.879 --> 00:15:16.240
a way that they actually have a layer to

00:15:15.759 --> 00:15:18.958
correct?

00:15:16.240 --> 00:15:22.320
>> Yeah, good question Swati. So uh as it

00:15:18.958 --> 00:15:24.719
turns out um the all these companies

00:15:22.320 --> 00:15:27.440
like OpenAI, they have like a huge

00:15:24.720 --> 00:15:30.000
document 100 200 pages longs you know

00:15:27.440 --> 00:15:32.800
very very bulky document which instructs

00:15:30.000 --> 00:15:34.720
and teaches the labelers the rankers to

00:15:32.799 --> 00:15:36.479
how to rank these things. So they have

00:15:34.720 --> 00:15:38.800
to follow these very strict guidelines

00:15:36.480 --> 00:15:40.639
to precisely handle like strange corner

00:15:38.799 --> 00:15:43.439
cases and things like that. And that

00:15:40.639 --> 00:15:44.879
document is on the web. You can dig it

00:15:43.440 --> 00:15:46.959
up, right? And it's actually very

00:15:44.879 --> 00:15:48.078
instructive to read through it, right? I

00:15:46.958 --> 00:15:49.039
think they put it out on the web because

00:15:48.078 --> 00:15:50.879
they wanted to convince people that

00:15:49.039 --> 00:15:52.078
they're going to inordinate trouble to

00:15:50.879 --> 00:15:55.600
make sure the rankings are actually

00:15:52.078 --> 00:16:00.000
good. U do you have a question? Comment.

00:15:55.600 --> 00:16:03.278
Okay. All right. So um so back to this

00:16:00.000 --> 00:16:04.879
and how how do you train this thing? SGD

00:16:03.278 --> 00:16:06.639
because you have a network it's coming

00:16:04.879 --> 00:16:08.639
up with an answer you have some way to

00:16:06.639 --> 00:16:10.639
know if that answer is good or bad right

00:16:08.639 --> 00:16:12.480
better answers of lower loss back

00:16:10.639 --> 00:16:13.839
propagation through the network keep

00:16:12.480 --> 00:16:15.519
updating the weights and boom you're

00:16:13.839 --> 00:16:18.800
done

00:16:15.519 --> 00:16:21.198
okay and once you do that this reward

00:16:18.799 --> 00:16:24.000
model can provide a numerical rating for

00:16:21.198 --> 00:16:25.198
any any instruction answer pair you just

00:16:24.000 --> 00:16:27.120
give it an instruction you give it an

00:16:25.198 --> 00:16:28.240
answer right could be a crappy answer

00:16:27.120 --> 00:16:31.679
good answer it just tells you how good

00:16:28.240 --> 00:16:32.959
it is which means right So in this case

00:16:31.679 --> 00:16:35.919
for example maybe it's going to give you

00:16:32.958 --> 00:16:38.559
like a nice number 1.5 uh uh which is

00:16:35.919 --> 00:16:41.599
you know 1.5 for this this answer but

00:16:38.559 --> 00:16:44.719
then a better answer comes along or 3.2

00:16:41.600 --> 00:16:46.959
right what we have done by doing this

00:16:44.720 --> 00:16:49.278
whole thing this modeling is that we

00:16:46.958 --> 00:16:51.838
have essentially we have learned how

00:16:49.278 --> 00:16:53.759
humans rank responses

00:16:51.839 --> 00:16:55.279
because we can only have humans rank

00:16:53.759 --> 00:16:58.240
responses for some finite number of

00:16:55.278 --> 00:17:00.399
questions. What we really want to do is

00:16:58.240 --> 00:17:02.159
to do this to automate that ranking

00:17:00.399 --> 00:17:03.600
process so that we can just do it for

00:17:02.159 --> 00:17:05.599
like tens of thousands of questions

00:17:03.600 --> 00:17:07.519
really fast. Right? So we have

00:17:05.599 --> 00:17:10.879
essentially built a model of how humans

00:17:07.519 --> 00:17:12.160
rank things, right? Which is beautiful.

00:17:10.880 --> 00:17:13.439
A lot of the stuff here is all very

00:17:12.160 --> 00:17:15.519
self-reerential which I find very

00:17:13.439 --> 00:17:18.079
elegant. Anyway, so this can be used to

00:17:15.519 --> 00:17:20.798
improve GP3 even further. So we take the

00:17:18.078 --> 00:17:23.438
instruction as before, we feed it. It

00:17:20.798 --> 00:17:25.918
gives you some answer and then we feed

00:17:23.439 --> 00:17:28.319
this instruction and the answer to our

00:17:25.919 --> 00:17:30.799
newly minted reward model. It gives us a

00:17:28.318 --> 00:17:32.879
numerical rating and then this is the

00:17:30.798 --> 00:17:35.839
key step. We take this numerical rating

00:17:32.880 --> 00:17:37.840
and then we use this rating to nudge the

00:17:35.839 --> 00:17:41.519
internal weights of GPD3 in the right

00:17:37.839 --> 00:17:43.199
direction. Right? This nudging

00:17:41.519 --> 00:17:44.720
uses a technique called reinforcement

00:17:43.200 --> 00:17:46.319
learning.

00:17:44.720 --> 00:17:49.200
Right? Which just in the interest of

00:17:46.319 --> 00:17:51.678
time we can't get into in this lecture.

00:17:49.200 --> 00:17:52.720
But that that's a technique you use to

00:17:51.679 --> 00:17:54.559
nudge these things in the right

00:17:52.720 --> 00:17:56.640
direction.

00:17:54.558 --> 00:17:58.319
So that's what we do. That's

00:17:56.640 --> 00:18:01.520
reinforcement learning. We nudge it in

00:17:58.319 --> 00:18:04.879
the right direction.

00:18:01.519 --> 00:18:07.279
And OpenAI did this with 31,000

00:18:04.880 --> 00:18:09.919
questions.

00:18:07.279 --> 00:18:11.678
Okay. Nudge, nudge, nudge, nudge, nudge.

00:18:09.919 --> 00:18:13.759
And when you do that, you get GPD

00:18:11.679 --> 00:18:18.320
3.5/ingpd.

00:18:13.759 --> 00:18:20.960
Okay. Uh that's it. And now by the way

00:18:18.319 --> 00:18:22.480
this step here is called reinforcement

00:18:20.960 --> 00:18:24.480
learning with human feedback because we

00:18:22.480 --> 00:18:26.240
use reinforced learning and since humans

00:18:24.480 --> 00:18:28.160
rank the answers which tread to the

00:18:26.240 --> 00:18:29.759
building of the reward model we get

00:18:28.160 --> 00:18:30.798
human feedback. Okay, that's

00:18:29.759 --> 00:18:33.200
reinforcement learning with human

00:18:30.798 --> 00:18:34.639
feedback. Yeah.

00:18:33.200 --> 00:18:37.360
>> Yeah. I have [clears throat] a question

00:18:34.640 --> 00:18:39.759
regarding the the type of questions that

00:18:37.359 --> 00:18:42.079
they're using. I can imagine like maybe

00:18:39.759 --> 00:18:44.400
there are very simple questions to

00:18:42.079 --> 00:18:47.439
answer because I'm thinking now you can

00:18:44.400 --> 00:18:49.440
ask GPD like for example respond this as

00:18:47.440 --> 00:18:51.679
a pirate or something like that that is

00:18:49.440 --> 00:18:54.240
kind of it's going to be harder to train

00:18:51.679 --> 00:18:56.080
if you have bunch of questions that are

00:18:54.240 --> 00:18:57.679
having like small interactions and then

00:18:56.079 --> 00:18:59.279
there is the question like

00:18:57.679 --> 00:19:01.280
>> that's a good question. So the quality

00:18:59.279 --> 00:19:03.279
of the questions in the data set clearly

00:19:01.279 --> 00:19:05.839
is a big factor because if you have

00:19:03.279 --> 00:19:07.918
simple simplistic questions it won't be

00:19:05.839 --> 00:19:09.918
able to handle complex questions later

00:19:07.919 --> 00:19:12.400
on. So what it's a good question. So

00:19:09.919 --> 00:19:14.559
what how so the qu so that actually begs

00:19:12.400 --> 00:19:16.880
the question of where did they get these

00:19:14.558 --> 00:19:20.079
questions from

00:19:16.880 --> 00:19:23.520
so they actually got it from their API.

00:19:20.079 --> 00:19:25.119
So people are asking GPD3 on the API

00:19:23.519 --> 00:19:26.798
right before it became 3.5 people are

00:19:25.119 --> 00:19:28.159
asking all the API was already available

00:19:26.798 --> 00:19:29.599
you know fully available commercially

00:19:28.160 --> 00:19:31.679
available a lot of people are building

00:19:29.599 --> 00:19:33.439
products on it already by then and so

00:19:31.679 --> 00:19:35.440
they collected all those questions and

00:19:33.440 --> 00:19:37.279
filtered them for quality and that was

00:19:35.440 --> 00:19:39.360
the question set that they used and then

00:19:37.279 --> 00:19:41.519
they judiciously added to it with human

00:19:39.359 --> 00:19:43.038
created questions but they couldn't do a

00:19:41.519 --> 00:19:44.960
lot of that because it's expensive to do

00:19:43.038 --> 00:19:46.798
that but collecting stuff that somebody

00:19:44.960 --> 00:19:49.120
else is asking your API already very

00:19:46.798 --> 00:19:50.879
easy

00:19:49.119 --> 00:19:52.000
Yeah, Tomaso,

00:19:50.880 --> 00:19:54.400
>> uh, this might be more of a

00:19:52.000 --> 00:19:56.640
philosophical question, but, uh, the

00:19:54.400 --> 00:19:58.320
human bias that's present in the small

00:19:56.640 --> 00:20:00.799
subset of human labelers that they've

00:19:58.319 --> 00:20:03.119
chosen gets eventually compounded in

00:20:00.798 --> 00:20:04.798
this model that we often consider as the

00:20:03.119 --> 00:20:06.079
source of objective truth.

00:20:04.798 --> 00:20:08.319
>> Yes.

00:20:06.079 --> 00:20:09.918
>> Yeah, that's very true. Um I think the

00:20:08.319 --> 00:20:12.480
the reward model is probably very

00:20:09.919 --> 00:20:14.480
faithfully learn all the biases of the

00:20:12.480 --> 00:20:17.519
human labelers which is why they have

00:20:14.480 --> 00:20:19.599
these very complex u sort of frameworks

00:20:17.519 --> 00:20:21.519
and guidelines to try to prevent the

00:20:19.599 --> 00:20:22.959
bias from happening to mitigate it. So

00:20:21.519 --> 00:20:25.200
for example they might give the same

00:20:22.960 --> 00:20:28.240
question and set of possible answers to

00:20:25.200 --> 00:20:30.480
many many different labelers and only if

00:20:28.240 --> 00:20:33.679
people pick the same ranking they might

00:20:30.480 --> 00:20:36.240
use it so that at least inter labeler

00:20:33.679 --> 00:20:37.679
bias can be minimized right but if

00:20:36.240 --> 00:20:39.359
everybody's sort of biased in the same

00:20:37.679 --> 00:20:41.519
direction it won't protect you against

00:20:39.359 --> 00:20:43.199
that. Um so yeah in general there's a

00:20:41.519 --> 00:20:44.720
whole work that's being done to try to

00:20:43.200 --> 00:20:46.480
debias these things and build them

00:20:44.720 --> 00:20:48.000
without you know too much bias in them.

00:20:46.480 --> 00:20:49.200
It's like a whole world unto itself

00:20:48.000 --> 00:20:53.759
which we just don't have time to get

00:20:49.200 --> 00:20:56.000
into. Uh Olivia,

00:20:53.759 --> 00:20:57.519
>> um depending on the medium that's being

00:20:56.000 --> 00:20:59.119
returned by these models, would there be

00:20:57.519 --> 00:21:00.480
more than one reward model? Because

00:20:59.119 --> 00:21:01.839
isn't that what Gemini

00:21:00.480 --> 00:21:03.679
>> would there be more than one

00:21:01.839 --> 00:21:05.599
>> reward model? Because isn't this what

00:21:03.679 --> 00:21:08.000
Gemini is running into issues with right

00:21:05.599 --> 00:21:09.519
now with their image generation is the

00:21:08.000 --> 00:21:11.519
bias that they try to

00:21:09.519 --> 00:21:13.279
>> Yeah. So the Gemini business that's

00:21:11.519 --> 00:21:16.798
going on, it's unclear what's causing

00:21:13.279 --> 00:21:18.319
it. Um it may be in this step, maybe

00:21:16.798 --> 00:21:19.279
they were a little overzealous in

00:21:18.319 --> 00:21:20.319
preventing certain things from

00:21:19.279 --> 00:21:23.918
happening.

00:21:20.319 --> 00:21:25.279
Some of these uh systems also have um

00:21:23.919 --> 00:21:27.679
they will actually intercept the

00:21:25.279 --> 00:21:29.359
question that you ask and then route it

00:21:27.679 --> 00:21:31.280
differently based on what they sense is

00:21:29.359 --> 00:21:32.879
sitting around in the question. So there

00:21:31.279 --> 00:21:34.639
could be pre-processing post-processing

00:21:32.880 --> 00:21:36.960
a lot of stuff that goes on. So unclear

00:21:34.640 --> 00:21:38.559
to me where in the pipeline and it could

00:21:36.960 --> 00:21:40.960
be more than one place these things may

00:21:38.558 --> 00:21:42.639
be entering. So yes, so here may very

00:21:40.960 --> 00:21:44.720
well be where it actually enters a

00:21:42.640 --> 00:21:46.960
situation where people are people are

00:21:44.720 --> 00:21:50.000
told if you see any sort of this kind of

00:21:46.960 --> 00:21:51.600
answer downrank it right don't uprank it

00:21:50.000 --> 00:21:53.038
and then it learns that ranking very

00:21:51.599 --> 00:21:54.719
faithfully and then proceeds to apply it

00:21:53.038 --> 00:21:56.960
where it does should not be applied so

00:21:54.720 --> 00:21:58.880
that does happen uh Joselyn you had a

00:21:56.960 --> 00:22:02.000
question

00:21:58.880 --> 00:22:04.480
>> um I think I still I still don't totally

00:22:02.000 --> 00:22:06.480
understand why so when I ask chat GBT a

00:22:04.480 --> 00:22:08.319
question even in a lengthy response it

00:22:06.480 --> 00:22:10.159
doesn't wander away from the topic that

00:22:08.319 --> 00:22:11.839
I'm asking about right and so

00:22:10.159 --> 00:22:13.600
understanding that it it's predicting

00:22:11.839 --> 00:22:15.439
each word it's sort of taking a random

00:22:13.599 --> 00:22:15.839
walk from one word to the next in some

00:22:15.440 --> 00:22:17.759
sense

00:22:15.839 --> 00:22:19.839
>> but each word it utters

00:22:17.759 --> 00:22:20.879
>> now becomes part of the input to the

00:22:19.839 --> 00:22:21.199
next word it utters

00:22:20.880 --> 00:22:23.120
>> right

00:22:21.200 --> 00:22:24.640
>> so it's not truly random walk in that

00:22:23.119 --> 00:22:26.158
sense so the next step is not

00:22:24.640 --> 00:22:27.679
independent of the previous step

00:22:26.159 --> 00:22:29.440
>> it depends on what it depends on the

00:22:27.679 --> 00:22:31.038
journey so far so it's going to try to

00:22:29.440 --> 00:22:32.240
be very consistent with the journey so

00:22:31.038 --> 00:22:33.599
far

00:22:32.240 --> 00:22:35.519
>> okay

00:22:33.599 --> 00:22:38.480
>> does the

00:22:35.519 --> 00:22:40.158
does this part with um sort of

00:22:38.480 --> 00:22:42.079
fine-tuning it on these question answer

00:22:40.159 --> 00:22:44.799
sets. Does this play some role in it

00:22:42.079 --> 00:22:46.240
being able to constrain itself and not

00:22:44.798 --> 00:22:48.319
meander away?

00:22:46.240 --> 00:22:50.558
>> I don't think so. I think this is more

00:22:48.319 --> 00:22:52.319
to make sure that you know it does the

00:22:50.558 --> 00:22:54.960
weights generally tend to produce the

00:22:52.319 --> 00:22:57.359
right answer. Now what one of the things

00:22:54.960 --> 00:22:58.880
that is possible is that when when I'm

00:22:57.359 --> 00:23:01.439
let's say I'm a ranker and I'm looking

00:22:58.880 --> 00:23:03.039
at a few different answers I'm you know

00:23:01.440 --> 00:23:06.080
I have to figure out if the answer is

00:23:03.038 --> 00:23:08.640
helpful if it is accurate if it is uh

00:23:06.079 --> 00:23:11.279
you know non-toxic right things like

00:23:08.640 --> 00:23:13.200
that and part of the rubric for

00:23:11.279 --> 00:23:16.639
evaluating these answers could be their

00:23:13.200 --> 00:23:18.880
coherence right so it could also be that

00:23:16.640 --> 00:23:21.280
they are saying short coherent answers

00:23:18.880 --> 00:23:23.039
are better than long coherent answers

00:23:21.279 --> 00:23:24.399
but once you adjust for length Maybe

00:23:23.038 --> 00:23:25.519
coherence is more important, right? It

00:23:24.400 --> 00:23:26.960
could be any number of these things. So

00:23:25.519 --> 00:23:28.720
it could play a role in that.

00:23:26.960 --> 00:23:30.079
>> So just sort of one small followup. So

00:23:28.720 --> 00:23:31.519
in other words, when it's when it's

00:23:30.079 --> 00:23:32.850
learning from these question and answer

00:23:31.519 --> 00:23:33.918
pairs, it's able to look at

00:23:32.851 --> 00:23:35.440
[clears throat] the whole response and

00:23:33.919 --> 00:23:36.960
learn something about the whole response

00:23:35.440 --> 00:23:37.519
rather than just one word at a time,

00:23:36.960 --> 00:23:39.440
right?

00:23:37.519 --> 00:23:40.319
>> Correct. Yeah. The the entire question

00:23:39.440 --> 00:23:40.640
is being ranked.

00:23:40.319 --> 00:23:42.639
>> Yeah.

00:23:40.640 --> 00:23:46.240
>> Correct. Correct.

00:23:42.640 --> 00:23:48.400
>> Yeah. On a related note, um when it's

00:23:46.240 --> 00:23:50.640
generating a new word on a topic, does

00:23:48.400 --> 00:23:52.880
the attention pertain to the entire

00:23:50.640 --> 00:23:55.759
prior text or can you have like

00:23:52.880 --> 00:23:56.880
traveling attention? So like last five

00:23:55.759 --> 00:24:00.000
word.

00:23:56.880 --> 00:24:02.240
>> So yeah, the short answer is yeah, you

00:24:00.000 --> 00:24:04.480
can you can it's called sliding window

00:24:02.240 --> 00:24:06.640
attention. It can be done. They

00:24:04.480 --> 00:24:08.480
typically tend to do it not uh so much

00:24:06.640 --> 00:24:10.880
because they want to focus more on the

00:24:08.480 --> 00:24:12.000
the recent words, but more because it

00:24:10.880 --> 00:24:14.720
actually makes it very compute

00:24:12.000 --> 00:24:16.159
efficient. U that's why they do it. So

00:24:14.720 --> 00:24:17.919
it's called sliding window attention.

00:24:16.159 --> 00:24:19.760
You can Google it.

00:24:17.919 --> 00:24:21.120
>> So normally it's full attention.

00:24:19.759 --> 00:24:23.038
>> Normally it's full default is full

00:24:21.119 --> 00:24:25.678
attention.

00:24:23.038 --> 00:24:27.839
Okay. So that's what they did. Uh and

00:24:25.679 --> 00:24:29.278
when they did that and by the way as I

00:24:27.839 --> 00:24:30.480
think you pointed out that's exactly

00:24:29.278 --> 00:24:31.919
what's going on. You're training the

00:24:30.480 --> 00:24:35.038
reward model with these thumbs up and

00:24:31.919 --> 00:24:37.919
thumbs down. U hold on the questions.

00:24:35.038 --> 00:24:42.960
And so if you give it the same question

00:24:37.919 --> 00:24:45.679
to GPD 3.5 in GPD amazing answer.

00:24:42.960 --> 00:24:48.960
Okay, like night and day difference,

00:24:45.679 --> 00:24:51.360
amazingly good answer. Um, and so and

00:24:48.960 --> 00:24:52.640
then to go from 3.5 to CH GBT, they

00:24:51.359 --> 00:24:55.439
basically followed the exact same

00:24:52.640 --> 00:24:58.080
playbook except that because they wanted

00:24:55.440 --> 00:24:59.600
to have a chatbot, meaning something

00:24:58.079 --> 00:25:00.879
that could carry on a question answer,

00:24:59.599 --> 00:25:02.319
question answer pair as opposed to just

00:25:00.880 --> 00:25:03.600
a single question and answer, they

00:25:02.319 --> 00:25:05.839
wanted question answer question answer,

00:25:03.599 --> 00:25:08.719
right? Conversation. They trained it on

00:25:05.839 --> 00:25:11.519
conversations. That's it. Instead of

00:25:08.720 --> 00:25:13.759
training it on instruction answer data,

00:25:11.519 --> 00:25:16.000
they trained it on instruction answer

00:25:13.759 --> 00:25:17.919
instruction answer instruction answer a

00:25:16.000 --> 00:25:19.839
sequence of such things which are strung

00:25:17.919 --> 00:25:21.360
into a conversation.

00:25:19.839 --> 00:25:25.278
That's it. That is the only difference

00:25:21.359 --> 00:25:26.959
to go from 3.5 to CH GPT and then now

00:25:25.278 --> 00:25:28.880
chat GPD given you do that it's giving

00:25:26.960 --> 00:25:30.798
you a much nicer response and then you

00:25:28.880 --> 00:25:32.240
can ask a follow-on question. Can you

00:25:30.798 --> 00:25:33.759
make it more formal? Boom. It gives you

00:25:32.240 --> 00:25:35.120
a nice response because now it knows

00:25:33.759 --> 00:25:37.519
about conversations. It's been trained

00:25:35.119 --> 00:25:38.879
on conversational data. So that's it. So

00:25:37.519 --> 00:25:41.200
that's the whole that's how they built

00:25:38.880 --> 00:25:42.720
RGBT right and all the things we are

00:25:41.200 --> 00:25:45.038
seeing later on are all sort of

00:25:42.720 --> 00:25:46.240
continuations of this sort of approach.

00:25:45.038 --> 00:25:47.759
So pause for a couple of quick

00:25:46.240 --> 00:25:50.558
questions. Swati you had a question then

00:25:47.759 --> 00:25:53.759
we'll go to you and then to you. Yeah.

00:25:50.558 --> 00:25:56.240
>> So does that make a difference if a new

00:25:53.759 --> 00:25:59.759
question pair question answer pair or a

00:25:56.240 --> 00:26:01.759
new training data comes early in the

00:25:59.759 --> 00:26:02.960
building of the model or later in the

00:26:01.759 --> 00:26:05.278
building of the model 7 billion

00:26:02.960 --> 00:26:07.038
parameters. That be good. You mean the

00:26:05.278 --> 00:26:09.839
order of the questions does it matter?

00:26:07.038 --> 00:26:12.319
>> So I might have like let's say 5,000 uh

00:26:09.839 --> 00:26:14.240
images to start with. Now there after my

00:26:12.319 --> 00:26:17.278
model is trained and developed now I

00:26:14.240 --> 00:26:18.880
have a new use case that has come in.

00:26:17.278 --> 00:26:19.599
Will that make a difference if I set it

00:26:18.880 --> 00:26:22.000
in now?

00:26:19.599 --> 00:26:24.639
>> So if you have a new use case for which

00:26:22.000 --> 00:26:26.240
you want to essentially adapt the model

00:26:24.640 --> 00:26:27.278
there's a whole set of techniques you

00:26:26.240 --> 00:26:27.839
use which is going to be the next

00:26:27.278 --> 00:26:29.038
section.

00:26:27.839 --> 00:26:30.480
>> But it's not

00:26:29.038 --> 00:26:33.440
>> yeah because what you have out of the

00:26:30.480 --> 00:26:34.798
box is just a generally good chatbot. It

00:26:33.440 --> 00:26:36.000
knows about a lot of stuff because it's

00:26:34.798 --> 00:26:37.679
been trained on, you know, those 30

00:26:36.000 --> 00:26:39.119
billion sentences, it can answer a lot

00:26:37.679 --> 00:26:41.120
of questions reasonably well using

00:26:39.119 --> 00:26:43.038
common sense and world knowledge. But

00:26:41.119 --> 00:26:44.719
any specific use case like medical and

00:26:43.038 --> 00:26:46.079
so on and so forth, it may not know. So

00:26:44.720 --> 00:26:47.919
you'll need to adapt it to your

00:26:46.079 --> 00:26:51.678
particular unique situation and that's

00:26:47.919 --> 00:26:54.559
coming. U all right. Yes. Habit.

00:26:51.679 --> 00:26:57.360
>> Uh what determines if a whole

00:26:54.558 --> 00:26:59.359
conversation is ranked positively versus

00:26:57.359 --> 00:27:01.278
a specific answer proliferating your in

00:26:59.359 --> 00:27:03.119
your question?

00:27:01.278 --> 00:27:05.278
>> Is it if the first answer doesn't get a

00:27:03.119 --> 00:27:06.639
positive response but then after follow

00:27:05.278 --> 00:27:07.200
the second one does. Is that is that

00:27:06.640 --> 00:27:08.880
correct?

00:27:07.200 --> 00:27:10.319
>> Exactly. So if you're a human and you

00:27:08.880 --> 00:27:12.320
read the transcript of an exchange

00:27:10.319 --> 00:27:14.240
between two people and I'm giving you

00:27:12.319 --> 00:27:15.759
two exchanges which all start with the

00:27:14.240 --> 00:27:17.759
same question, you'll be able to assess

00:27:15.759 --> 00:27:20.000
which one is a better transcript. That's

00:27:17.759 --> 00:27:22.640
basically what's going on. Uh there was

00:27:20.000 --> 00:27:25.038
something here, right? Something. Yeah.

00:27:22.640 --> 00:27:27.919
>> So I was wondering when you ask a

00:27:25.038 --> 00:27:29.919
question very often it sounds kind of

00:27:27.919 --> 00:27:32.880
like you tell that something was written

00:27:29.919 --> 00:27:35.759
by not by an actual person. Do you think

00:27:32.880 --> 00:27:38.159
that comes from the reinforcement

00:27:35.759 --> 00:27:40.158
learning part or where do you think it

00:27:38.159 --> 00:27:41.440
comes from in this?

00:27:40.159 --> 00:27:42.559
>> It's a good question. I don't know

00:27:41.440 --> 00:27:44.960
because I know that part of the

00:27:42.558 --> 00:27:48.399
evaluation uh the ranking rubric are

00:27:44.960 --> 00:27:50.400
used is to is to favor responses which

00:27:48.400 --> 00:27:52.720
sound more humanlike than you know more

00:27:50.400 --> 00:27:54.320
than robotlike. So if anything I'm

00:27:52.720 --> 00:27:55.839
hoping that reinforcement learning would

00:27:54.319 --> 00:27:56.720
actually make it sound more humanike

00:27:55.839 --> 00:27:58.879
because the rankers would have

00:27:56.720 --> 00:28:01.120
prioritized that. So if you if it still

00:27:58.880 --> 00:28:02.960
comes up with robotic stuff, you know,

00:28:01.119 --> 00:28:05.119
it's something else that's going on.

00:28:02.960 --> 00:28:07.278
Maybe I mean maybe the lot of text on

00:28:05.119 --> 00:28:09.439
the internet is not literature. It's

00:28:07.278 --> 00:28:13.200
just people writing some crap, right? So

00:28:09.440 --> 00:28:15.120
could be that. Yeah.

00:28:13.200 --> 00:28:17.120
>> How much of this instruction tuning or

00:28:15.119 --> 00:28:19.278
conversational tuning is happening in

00:28:17.119 --> 00:28:19.918
real time within a conversation? So

00:28:19.278 --> 00:28:22.159
>> none of it.

00:28:19.919 --> 00:28:24.080
>> None of it. So as you kind of give

00:28:22.159 --> 00:28:25.919
feedback to the model, it's just

00:28:24.079 --> 00:28:27.439
basically regenerating it like I don't

00:28:25.919 --> 00:28:27.759
like that answer. come up with something

00:28:27.440 --> 00:28:29.840
else.

00:28:27.759 --> 00:28:31.679
>> No, it's not doing it in real time. Uh,

00:28:29.839 --> 00:28:32.879
basically whatever signals you're giving

00:28:31.679 --> 00:28:34.640
it with these thumbs up, thumbs down

00:28:32.880 --> 00:28:36.640
business, that gets added to the

00:28:34.640 --> 00:28:39.038
training logs and they periodically will

00:28:36.640 --> 00:28:41.038
retrain it.

00:28:39.038 --> 00:28:42.798
Uh, okay. So, by the way, this is

00:28:41.038 --> 00:28:44.000
instruction tuning in a nutshell and I

00:28:42.798 --> 00:28:45.440
want to point that out and you don't

00:28:44.000 --> 00:28:47.839
have to read the whole thing, but just

00:28:45.440 --> 00:28:50.320
to quickly point out this was where we

00:28:47.839 --> 00:28:51.439
had to have human involvement, right? In

00:28:50.319 --> 00:28:52.960
the first step, writing a lot of

00:28:51.440 --> 00:28:56.320
responses to these questions and then

00:28:52.960 --> 00:28:58.558
ranking the answers. So these two are

00:28:56.319 --> 00:29:00.798
still human sort of labor intensive. Now

00:28:58.558 --> 00:29:03.678
it turns out you can actually use helper

00:29:00.798 --> 00:29:04.879
LLMs to automate this too,

00:29:03.679 --> 00:29:06.320
right? This is not what open I did in

00:29:04.880 --> 00:29:07.760
the beginning with HGBT but now you can

00:29:06.319 --> 00:29:09.439
do it this way right because there are

00:29:07.759 --> 00:29:11.519
lots of really good LLMs available for

00:29:09.440 --> 00:29:12.880
you to automate many of these things. uh

00:29:11.519 --> 00:29:14.398
we don't have time but if you're curious

00:29:12.880 --> 00:29:17.039
I had a little blog post on this check

00:29:14.398 --> 00:29:20.000
it out okay so now we come to the

00:29:17.038 --> 00:29:23.278
question of well if you want to take a

00:29:20.000 --> 00:29:24.960
base LLM like GBD3 and make it useful

00:29:23.278 --> 00:29:26.880
and respond instructions we have seen

00:29:24.960 --> 00:29:28.880
that we had to adapt it with high

00:29:26.880 --> 00:29:30.240
quality instruction onset data right

00:29:28.880 --> 00:29:31.360
using supervised fine-tuning and

00:29:30.240 --> 00:29:33.919
reinforcement learning with human

00:29:31.359 --> 00:29:37.678
feedback right that's what made GPD3

00:29:33.919 --> 00:29:39.600
actually useful and became chat GPD by

00:29:37.679 --> 00:29:41.278
the same token this holds true more

00:29:39.599 --> 00:29:42.639
generally if you want to take large

00:29:41.278 --> 00:29:44.798
language model make it useful for a

00:29:42.640 --> 00:29:47.120
medical use case, a legal use case, some

00:29:44.798 --> 00:29:49.359
other narrow business use case. You have

00:29:47.119 --> 00:29:52.000
to adapt it with business domain

00:29:49.359 --> 00:29:54.079
specific data. Okay. And so let's look

00:29:52.000 --> 00:29:56.000
at techniques for doing so. All right.

00:29:54.079 --> 00:29:57.759
So adaptation is sort of the rough name

00:29:56.000 --> 00:30:00.000
for the process of taking a base large

00:29:57.759 --> 00:30:02.000
language model and making it tailoring

00:30:00.000 --> 00:30:03.359
it for your particular use case. And so

00:30:02.000 --> 00:30:05.119
there's sort of this ladder of things

00:30:03.359 --> 00:30:07.199
you can do, right? And we're going to

00:30:05.119 --> 00:30:08.959
look at every one of them. So you can do

00:30:07.200 --> 00:30:11.679
this thing called zeroshot prompting

00:30:08.960 --> 00:30:14.319
which is just you literally ask the LLM

00:30:11.679 --> 00:30:16.240
nicely clearly what you want and maybe

00:30:14.319 --> 00:30:17.519
just give it to you. Okay. And this is

00:30:16.240 --> 00:30:20.480
sort of the use case we're all used to

00:30:17.519 --> 00:30:22.558
in the web interface right you can also

00:30:20.480 --> 00:30:24.319
do something called few short prompting

00:30:22.558 --> 00:30:25.599
where you ask it something and you also

00:30:24.319 --> 00:30:27.599
give a few examples of the kind of

00:30:25.599 --> 00:30:30.398
things you want right and that helps it

00:30:27.599 --> 00:30:31.519
a great deal and then there is this

00:30:30.398 --> 00:30:33.119
thing called retrieval augmented

00:30:31.519 --> 00:30:34.240
generation and fine-tuning and we'll

00:30:33.119 --> 00:30:36.079
look at all of them and I'll explain all

00:30:34.240 --> 00:30:38.159
these things as we go along. Okay, so

00:30:36.079 --> 00:30:40.240
let's start with zero short prompting

00:30:38.159 --> 00:30:44.000
where by the way the word short here is

00:30:40.240 --> 00:30:45.839
a synonym for example. So zero example

00:30:44.000 --> 00:30:47.200
prompting. You literally ask in the

00:30:45.839 --> 00:30:50.639
prompt what you want without giving even

00:30:47.200 --> 00:30:51.919
a single example. Okay. And so let's say

00:30:50.640 --> 00:30:54.320
we want to build we want to look at

00:30:51.919 --> 00:30:55.360
product reviews and build a detector to

00:30:54.319 --> 00:30:56.960
figure out if the product review

00:30:55.359 --> 00:30:59.759
contains not sentiment. That's kind of

00:30:56.960 --> 00:31:01.120
boring. Uh whether it contains some

00:30:59.759 --> 00:31:04.960
description of a potential product

00:31:01.119 --> 00:31:06.879
defect or not. Okay. And so here is

00:31:04.960 --> 00:31:08.960
something I actually pulled off Wayfair

00:31:06.880 --> 00:31:10.640
with apologies to Wayfair. Uh it says

00:31:08.960 --> 00:31:11.919
here the curve of the back of the chair

00:31:10.640 --> 00:31:14.399
does not leave enough room to sit

00:31:11.919 --> 00:31:16.960
comfortably. Okay, sounds like a kind of

00:31:14.398 --> 00:31:18.719
a defectish kind of thing, right? So

00:31:16.960 --> 00:31:20.079
instead of bu back in the day, you would

00:31:18.720 --> 00:31:21.679
have collected all these reviews and

00:31:20.079 --> 00:31:23.599
built a special purpose NLP based

00:31:21.679 --> 00:31:25.679
classifier to figure out defect yes or

00:31:23.599 --> 00:31:28.798
no. Here you can literally just feed

00:31:25.679 --> 00:31:30.159
this thing into GPD3 uh and ask it tell

00:31:28.798 --> 00:31:31.679
me if a product defect is being

00:31:30.159 --> 00:31:33.278
described in this product review and

00:31:31.679 --> 00:31:34.399
then the curve at the back boom and then

00:31:33.278 --> 00:31:37.359
it comes back and says yep that's a

00:31:34.398 --> 00:31:38.398
product defect. Okay so this zero shot

00:31:37.359 --> 00:31:41.199
you just ask a question you get the

00:31:38.398 --> 00:31:43.759
answer back. Okay and it actually works

00:31:41.200 --> 00:31:45.360
remarkably well and the better models

00:31:43.759 --> 00:31:47.359
the bigger models tend to be much better

00:31:45.359 --> 00:31:50.000
than the smaller simpler models for

00:31:47.359 --> 00:31:52.639
doing zero shot. Okay. All right. Now

00:31:50.000 --> 00:31:54.079
when you adapt an LLM to a specific task

00:31:52.640 --> 00:31:55.919
obviously you need to carefully design

00:31:54.079 --> 00:31:57.759
the prompt as you folks know this is

00:31:55.919 --> 00:31:58.799
called prompt engineering and we're not

00:31:57.759 --> 00:32:00.640
going to spend much time on prompt

00:31:58.798 --> 00:32:02.720
engineering except I just want to give a

00:32:00.640 --> 00:32:04.960
simple example. So if you actually ask

00:32:02.720 --> 00:32:07.919
Jubid this question what is the fifth

00:32:04.960 --> 00:32:09.919
word of the sentence very often it'll

00:32:07.919 --> 00:32:11.679
give the wrong answer.

00:32:09.919 --> 00:32:12.960
It's very strange why it can't get this

00:32:11.679 --> 00:32:14.880
answer question right. It's a very

00:32:12.960 --> 00:32:17.440
simple question. So if it's the fifth

00:32:14.880 --> 00:32:18.559
word of the sentence is s right uh

00:32:17.440 --> 00:32:20.640
sometimes it gets it right but very

00:32:18.558 --> 00:32:22.000
often it'll get it wrong okay but now

00:32:20.640 --> 00:32:23.600
you can do a little prompt engineering

00:32:22.000 --> 00:32:25.278
and it'll always get it right. So for

00:32:23.599 --> 00:32:26.798
example you can say I'll give you a

00:32:25.278 --> 00:32:27.919
sentence first list all the words that

00:32:26.798 --> 00:32:30.398
are in the sentence then tell me the

00:32:27.919 --> 00:32:33.200
fifth word. Okay here is a sentence b it

00:32:30.398 --> 00:32:34.798
gets it right. So it's an example of you

00:32:33.200 --> 00:32:36.720
can help it along by being very very

00:32:34.798 --> 00:32:38.558
prescriptive as to what you want it to

00:32:36.720 --> 00:32:40.399
do and break down all the steps. Don't

00:32:38.558 --> 00:32:42.558
make it guess things. It does a great

00:32:40.398 --> 00:32:43.918
job. Okay. So anyway uh and there are

00:32:42.558 --> 00:32:45.918
lots of other tricks people have figured

00:32:43.919 --> 00:32:47.519
out over the the last couple of years.

00:32:45.919 --> 00:32:49.759
Uh for for a long time this is pretty

00:32:47.519 --> 00:32:51.679
hot where you say let's think step by

00:32:49.759 --> 00:32:53.038
step. You tell it give it a question and

00:32:51.679 --> 00:32:54.399
say let's think step by step. It

00:32:53.038 --> 00:32:55.839
actually gives the better shot at giving

00:32:54.398 --> 00:32:57.839
you a good answer back an accurate

00:32:55.839 --> 00:32:59.759
answer back. Uh now this kind of thing

00:32:57.839 --> 00:33:02.639
is actually already baked in into the

00:32:59.759 --> 00:33:05.200
LLMs. So when you ask a question to ch

00:33:02.640 --> 00:33:07.679
your question your prompt gets appended

00:33:05.200 --> 00:33:09.120
to what's called the system prompt and

00:33:07.679 --> 00:33:10.559
the whole thing goes into the LM. You

00:33:09.119 --> 00:33:12.558
never see the system prompt and the

00:33:10.558 --> 00:33:14.720
system prompt is telling Chad GPD think

00:33:12.558 --> 00:33:17.678
step by step take your time don't blurt

00:33:14.720 --> 00:33:18.880
out an answer stuff like that okay and

00:33:17.679 --> 00:33:20.080
the system you can just Google it the

00:33:18.880 --> 00:33:22.640
system problems have been jailbroken you

00:33:20.079 --> 00:33:25.519
can find it on the web

00:33:22.640 --> 00:33:26.880
so all right uh and and this is funny I

00:33:25.519 --> 00:33:28.399
this came out maybe like a month or two

00:33:26.880 --> 00:33:29.679
ago it says apparently take a deep

00:33:28.398 --> 00:33:31.678
breath and work on the problem step by

00:33:29.679 --> 00:33:34.960
step works better than saying work on it

00:33:31.679 --> 00:33:36.720
step by step and then more recently I

00:33:34.960 --> 00:33:38.558
literally read this two nights ago

00:33:36.720 --> 00:33:40.159
apparently if you tell it if you have a

00:33:38.558 --> 00:33:42.639
math or a reasoning question. You tell

00:33:40.159 --> 00:33:44.240
it you are an officer on the starship

00:33:42.640 --> 00:33:46.000
enterprise. Now solve this problem for

00:33:44.240 --> 00:33:47.278
me. It's higher more likely to get it

00:33:46.000 --> 00:33:48.640
right.

00:33:47.278 --> 00:33:50.798
>> Go figure. Thomas,

00:33:48.640 --> 00:33:51.120
>> I read two more that were super fun.

00:33:50.798 --> 00:33:53.599
>> Yeah.

00:33:51.119 --> 00:33:54.398
>> One I will keep you if you solve me

00:33:53.599 --> 00:33:56.719
>> correct

00:33:54.398 --> 00:34:00.798
>> and the other one was

00:33:56.720 --> 00:34:05.440
an answer was I cannot do that

00:34:00.798 --> 00:34:07.519
for answer was I tried on Gemini and he

00:34:05.440 --> 00:34:10.800
it was the way to solve it. So

00:34:07.519 --> 00:34:11.918
>> nice. both like back and forth charge

00:34:10.800 --> 00:34:13.839
you did you want to say was to solve

00:34:11.918 --> 00:34:15.679
this can you solve this

00:34:13.838 --> 00:34:16.960
>> yeah very good excellent one of the

00:34:15.679 --> 00:34:18.639
things just on that right let's have

00:34:16.960 --> 00:34:19.918
some fun you can say I'm going to give

00:34:18.639 --> 00:34:22.079
tip you a thousand bucks if you solve

00:34:19.918 --> 00:34:24.000
this it says right so this person

00:34:22.079 --> 00:34:26.159
apparently kept using this tip and at

00:34:24.000 --> 00:34:28.559
one point it says you keep promising me

00:34:26.159 --> 00:34:31.760
tips you never give me the tip so I'm

00:34:28.559 --> 00:34:34.960
not going to solve this problem for you

00:34:31.760 --> 00:34:36.399
yeah okay so and there are many prompt

00:34:34.960 --> 00:34:37.358
engineering resources this one that came

00:34:36.398 --> 00:34:38.638
out a couple of weeks ago which I

00:34:37.358 --> 00:34:41.199
thought was pretty Good. So I just put a

00:34:38.639 --> 00:34:42.800
link to it here. Um so now let's look at

00:34:41.199 --> 00:34:45.118
few short prompting where you give it a

00:34:42.800 --> 00:34:47.919
few examples. So here let's say we want

00:34:45.119 --> 00:34:49.440
to build a grammar corrector. Okay. So

00:34:47.918 --> 00:34:52.319
what you can do is you can actually give

00:34:49.440 --> 00:34:54.079
it examples of poor English good

00:34:52.320 --> 00:34:56.159
English. You can see right poor English

00:34:54.079 --> 00:34:58.079
I eated the purple berries. Good English

00:34:56.159 --> 00:35:00.240
I ate the purple berries. And similarly

00:34:58.079 --> 00:35:01.680
three examples right and then you end

00:35:00.239 --> 00:35:04.959
the prompt with just the poor English

00:35:01.679 --> 00:35:06.799
input. And then the response from GPD3

00:35:04.960 --> 00:35:09.039
is a good English output and it says fix

00:35:06.800 --> 00:35:10.880
the error.

00:35:09.039 --> 00:35:11.920
So this is an example of giving a few

00:35:10.880 --> 00:35:13.680
examples of what you want and just

00:35:11.920 --> 00:35:16.960
learns on the fly what you what you have

00:35:13.679 --> 00:35:19.519
in mind what your intention is. Okay. So

00:35:16.960 --> 00:35:21.920
that's that. Now the ability of LLMs to

00:35:19.519 --> 00:35:23.838
learn from just a few examples or even

00:35:21.920 --> 00:35:25.920
no examples and just with a clear

00:35:23.838 --> 00:35:28.159
instruction. This thing is called in

00:35:25.920 --> 00:35:31.119
context learning and that was something

00:35:28.159 --> 00:35:33.440
that GPD2 and GPD could not do. that was

00:35:31.119 --> 00:35:35.519
new in GBD3 and what they call an

00:35:33.440 --> 00:35:37.280
emergent capability right it is

00:35:35.519 --> 00:35:40.159
completely unanticipated by the people

00:35:37.280 --> 00:35:41.920
who built it and all right so that's

00:35:40.159 --> 00:35:43.199
that now let's look at retrieal

00:35:41.920 --> 00:35:45.280
augmented generation by the way this

00:35:43.199 --> 00:35:47.439
thing is also called indexing sometimes

00:35:45.280 --> 00:35:50.160
so the the so the the idea of it's

00:35:47.440 --> 00:35:52.240
called rag retrie rag the idea of rag is

00:35:50.159 --> 00:35:53.838
actually very simple so let's say that

00:35:52.239 --> 00:35:56.639
you know we want to ask a question to a

00:35:53.838 --> 00:35:59.039
chatbot but we want the chatbot to

00:35:56.639 --> 00:36:01.039
leverage proprietary data that we might

00:35:59.039 --> 00:36:02.239
have maybe it's a customer call support

00:36:01.039 --> 00:36:04.159
sort of in a call center kind of

00:36:02.239 --> 00:36:06.719
operation and you have like this massive

00:36:04.159 --> 00:36:09.440
FAQ database right content database and

00:36:06.719 --> 00:36:10.719
you want to give that FAQ to the chatbot

00:36:09.440 --> 00:36:12.800
along with your question so that it can

00:36:10.719 --> 00:36:14.559
leverage the FAQ to answer the question

00:36:12.800 --> 00:36:16.320
for you as opposed to like whatever

00:36:14.559 --> 00:36:19.920
things it has learned previously in its

00:36:16.320 --> 00:36:21.920
general training right so can't we just

00:36:19.920 --> 00:36:24.559
include the entire FAQ the whole data

00:36:21.920 --> 00:36:26.079
set into a prompt and set it in maybe we

00:36:24.559 --> 00:36:27.759
just take our question take everything

00:36:26.079 --> 00:36:28.960
we have potentially relevant to the

00:36:27.760 --> 00:36:31.040
question everything we have in the data

00:36:28.960 --> 00:36:32.480
set database just attach it to the

00:36:31.039 --> 00:36:34.400
question. The whole thing becomes a

00:36:32.480 --> 00:36:38.559
prompt. Feed it in and say, "Hey, find

00:36:34.400 --> 00:36:42.680
out for me." Can't you just do that?

00:36:38.559 --> 00:36:42.679
Theoretically, I think it stops us.

00:36:43.199 --> 00:36:46.159
The reason you can't do it is because

00:36:44.800 --> 00:36:47.760
this pesky thing called the context

00:36:46.159 --> 00:36:51.118
window.

00:36:47.760 --> 00:36:53.839
So, uh, for any LLM, the prompt plus the

00:36:51.119 --> 00:36:55.358
output, right, the length cannot exceed

00:36:53.838 --> 00:36:57.358
a predefined limit. This called the

00:36:55.358 --> 00:37:00.239
context window. Remember the max

00:36:57.358 --> 00:37:02.239
sequence length we had in our earlier

00:37:00.239 --> 00:37:04.078
models where that was the size of the

00:37:02.239 --> 00:37:05.279
sentence that could be fed in right

00:37:04.079 --> 00:37:07.039
basically there is a size of the

00:37:05.280 --> 00:37:08.400
sentence for any of these things right

00:37:07.039 --> 00:37:09.838
it's called the context window it's

00:37:08.400 --> 00:37:12.400
there are only so many tokens it can

00:37:09.838 --> 00:37:14.880
accommodate and since what comes in is

00:37:12.400 --> 00:37:16.800
what comes out it is for both the input

00:37:14.880 --> 00:37:20.640
and the output together okay that's

00:37:16.800 --> 00:37:23.440
called the context window okay and um

00:37:20.639 --> 00:37:25.199
and and and furthermore when you have a

00:37:23.440 --> 00:37:27.280
conversation with one of these chat bots

00:37:25.199 --> 00:37:29.919
the entire entire conversation is fed in

00:37:27.280 --> 00:37:31.519
every single time.

00:37:29.920 --> 00:37:32.800
That's how it actually remembers the

00:37:31.519 --> 00:37:34.880
what's going on earlier in the

00:37:32.800 --> 00:37:36.800
conversation. It doesn't have any memory

00:37:34.880 --> 00:37:39.838
per se. Each time you ask a question,

00:37:36.800 --> 00:37:41.119
the entire thread is fed in. Okay? So,

00:37:39.838 --> 00:37:42.960
initially you say what's the square root

00:37:41.119 --> 00:37:44.480
of 17, it gives you an answer.

00:37:42.960 --> 00:37:46.880
Initially, you only send in the red

00:37:44.480 --> 00:37:48.320
stuff. Then the next question you ask is

00:37:46.880 --> 00:37:50.320
the first question, the answer, the

00:37:48.320 --> 00:37:52.480
second question. All of them are fed in.

00:37:50.320 --> 00:37:54.240
Then all these are fed in. So with the

00:37:52.480 --> 00:37:55.760
conversation, you're consuming more and

00:37:54.239 --> 00:37:57.358
more of the context window as you go

00:37:55.760 --> 00:38:00.000
along.

00:37:57.358 --> 00:38:01.838
Okay. So can you imagine taking a whole

00:38:00.000 --> 00:38:03.199
FAQ asking a question and saying, "Well,

00:38:01.838 --> 00:38:04.400
I didn't mean that. I wanted something

00:38:03.199 --> 00:38:05.679
else." And before you know it, boom,

00:38:04.400 --> 00:38:06.720
you've blown out the context window.

00:38:05.679 --> 00:38:08.078
It's going to come back and give you an

00:38:06.719 --> 00:38:10.559
error.

00:38:08.079 --> 00:38:14.079
>> You finished that you can't does it

00:38:10.559 --> 00:38:15.279
together or does it take specific

00:38:14.079 --> 00:38:17.599
windows of it?

00:38:15.280 --> 00:38:19.839
>> Yeah. So there is a whole research

00:38:17.599 --> 00:38:21.440
cottage industry around when your thing

00:38:19.838 --> 00:38:23.920
is longer than the context window. what

00:38:21.440 --> 00:38:25.920
do you pick? Uh so the simplest case is

00:38:23.920 --> 00:38:27.119
you have a moving window, right? If if

00:38:25.920 --> 00:38:28.639
you have thousand tokens, you just look

00:38:27.119 --> 00:38:30.800
at the last thousand tokens. But there

00:38:28.639 --> 00:38:33.039
are some cleverer schemes where you can

00:38:30.800 --> 00:38:34.560
actually take the first stuff that is

00:38:33.039 --> 00:38:37.119
outside the window that doesn't fit into

00:38:34.559 --> 00:38:39.358
the window and use an other LLM to

00:38:37.119 --> 00:38:41.680
summarize it for you and then you attach

00:38:39.358 --> 00:38:43.920
it to your current prompt. I know it

00:38:41.679 --> 00:38:46.078
gets crazy. So

00:38:43.920 --> 00:38:47.280
uh okay. So for all these reasons, we

00:38:46.079 --> 00:38:49.280
need to pick and choose what we can

00:38:47.280 --> 00:38:51.359
send, right? To answer a particular

00:38:49.280 --> 00:38:53.280
question. So what we do is since we

00:38:51.358 --> 00:38:54.960
can't include the whole thing, we first

00:38:53.280 --> 00:38:57.119
retrieve the relevant content from the

00:38:54.960 --> 00:38:59.440
database or the FAQ and then send it to

00:38:57.119 --> 00:39:02.400
the LLM along with a question we have.

00:38:59.440 --> 00:39:05.200
Okay? So retrieval augmented sequence

00:39:02.400 --> 00:39:08.320
generation. That's what's going on.

00:39:05.199 --> 00:39:10.319
Make sense? And so pictorially

00:39:08.320 --> 00:39:12.079
um basically what we do is let's say

00:39:10.320 --> 00:39:15.359
that this is our external set of

00:39:12.079 --> 00:39:18.320
documents. We take this are think of it

00:39:15.358 --> 00:39:20.239
FAQ and then we take the FAQ and imagine

00:39:18.320 --> 00:39:22.320
for each question and answer. We take

00:39:20.239 --> 00:39:24.719
each question and answer in the FAQ and

00:39:22.320 --> 00:39:27.760
then we we just we treat it as its own

00:39:24.719 --> 00:39:29.439
little unit of text and then we actually

00:39:27.760 --> 00:39:32.079
calculate a contextual embedding for

00:39:29.440 --> 00:39:33.200
each of those question answer pairs.

00:39:32.079 --> 00:39:35.599
Remember we know how to do contextual

00:39:33.199 --> 00:39:36.879
embeddings, right? That's like it's a

00:39:35.599 --> 00:39:37.760
piece of cake at this point, right? You

00:39:36.880 --> 00:39:39.280
folks know how to do contextual

00:39:37.760 --> 00:39:41.760
embedding. Run it through something like

00:39:39.280 --> 00:39:43.920
BERT, you're done, right? You get you

00:39:41.760 --> 00:39:47.040
get a context. So you get embeddings for

00:39:43.920 --> 00:39:50.159
all the things that are in your FAQ. And

00:39:47.039 --> 00:39:52.000
now when a new question comes in, right,

00:39:50.159 --> 00:39:53.519
what you do is you take that question

00:39:52.000 --> 00:39:56.559
and you calculate a contextual embedding

00:39:53.519 --> 00:39:58.880
for that too.

00:39:56.559 --> 00:40:02.880
And then what you do is you then look to

00:39:58.880 --> 00:40:04.640
see which of the FAQ elements you have,

00:40:02.880 --> 00:40:07.599
which of those chunks are the most

00:40:04.639 --> 00:40:09.759
similar to your question.

00:40:07.599 --> 00:40:11.599
Okay? And then you grab the ones that

00:40:09.760 --> 00:40:14.240
are the most similar and then pack it

00:40:11.599 --> 00:40:16.800
into the prompt and send it in. Maybe

00:40:14.239 --> 00:40:18.559
you have 10,000 questions, but you can

00:40:16.800 --> 00:40:19.839
only accommodate five of them in your

00:40:18.559 --> 00:40:22.078
prompt because the context window is

00:40:19.838 --> 00:40:24.239
very small. So you pick the five what

00:40:22.079 --> 00:40:25.920
you think is the most relevant content

00:40:24.239 --> 00:40:28.159
to your particular question and then you

00:40:25.920 --> 00:40:29.599
feed it in.

00:40:28.159 --> 00:40:32.879
That's the idea that is retrieval

00:40:29.599 --> 00:40:34.880
augmented generation. Yeah, Rolando. So

00:40:32.880 --> 00:40:36.559
if does this tie in for example if I

00:40:34.880 --> 00:40:38.800
were to prompt and say help me work on

00:40:36.559 --> 00:40:41.358
my startup pitch but given the voice of

00:40:38.800 --> 00:40:45.519
Steve Jobs is it then kind of going out

00:40:41.358 --> 00:40:48.000
there and reducing the subset of of data

00:40:45.519 --> 00:40:49.759
to things that have been written by

00:40:48.000 --> 00:40:51.358
Steve Jobs and then it's kind of

00:40:49.760 --> 00:40:53.680
generating it response based

00:40:51.358 --> 00:40:54.960
>> uh not as a default not as a default

00:40:53.679 --> 00:40:56.879
typically because a lot of Steve Jobs

00:40:54.960 --> 00:40:57.920
stuff on the web it's just using that

00:40:56.880 --> 00:41:00.160
because it's all part of its

00:40:57.920 --> 00:41:01.920
pre-training data but this tends to be

00:41:00.159 --> 00:41:03.838
more useful for very targeted

00:41:01.920 --> 00:41:05.200
applications where you don't expect to

00:41:03.838 --> 00:41:07.519
know the answer because it is not on the

00:41:05.199 --> 00:41:09.039
public internet.

00:41:07.519 --> 00:41:10.559
It's your proprietary data and you

00:41:09.039 --> 00:41:12.800
wanted to use that proprietary data and

00:41:10.559 --> 00:41:15.838
this how you do it.

00:41:12.800 --> 00:41:19.079
Uh yeah

00:41:15.838 --> 00:41:19.078
this certain

00:41:19.119 --> 00:41:23.838
>> sure like that there will be some loss.

00:41:22.400 --> 00:41:26.000
>> There will be some loss because you have

00:41:23.838 --> 00:41:28.960
to figure out how to chunk it right. Uh

00:41:26.000 --> 00:41:30.559
maybe you have a 300page PDF and then

00:41:28.960 --> 00:41:32.000
maybe you look for each section and make

00:41:30.559 --> 00:41:33.679
it a chunk. Maybe you look for each

00:41:32.000 --> 00:41:36.000
paragraph, make it a chunk. Again,

00:41:33.679 --> 00:41:37.679
there's a whole empirical sort of

00:41:36.000 --> 00:41:39.039
cottage industry of techniques for doing

00:41:37.679 --> 00:41:40.559
these things better or worse depending

00:41:39.039 --> 00:41:42.719
on the use case and so on and so forth.

00:41:40.559 --> 00:41:43.759
But the conceptual idea is chunk and

00:41:42.719 --> 00:41:46.318
embed.

00:41:43.760 --> 00:41:47.359
>> Chunking is another use.

00:41:46.318 --> 00:41:49.519
>> Yeah. In fact, we going to do it

00:41:47.358 --> 00:41:50.559
ourselves in the collab right now.

00:41:49.519 --> 00:41:54.572
>> Yeah.

00:41:50.559 --> 00:41:55.838
>> Can we give more weightage lecture? Uh

00:41:54.572 --> 00:41:58.400
[laughter]

00:41:55.838 --> 00:42:00.239
so in the default implementation no but

00:41:58.400 --> 00:42:02.000
but in some sense you by picking the

00:42:00.239 --> 00:42:04.479
five most relevant chunks from 10,000

00:42:02.000 --> 00:42:06.559
chunks you're giving it giving the other

00:42:04.480 --> 00:42:08.159
you know 10,000 minus five chunks a

00:42:06.559 --> 00:42:10.719
weight of zero and these a weight of

00:42:08.159 --> 00:42:12.078
one. So in some sense you're waiting it.

00:42:10.719 --> 00:42:13.598
>> Yeah.

00:42:12.079 --> 00:42:14.720
>> I was just curious how much structure

00:42:13.599 --> 00:42:16.880
you have to have with an external

00:42:14.719 --> 00:42:19.759
document say hospital or something. Do

00:42:16.880 --> 00:42:21.039
you have to do a bunch of like lab?

00:42:19.760 --> 00:42:23.680
>> No, you just need to make sure it's kind

00:42:21.039 --> 00:42:26.079
of relatively clean. Uh but you will see

00:42:23.679 --> 00:42:28.879
in the collab that it can be kind of

00:42:26.079 --> 00:42:30.079
crappy and it still works. Yeah, because

00:42:28.880 --> 00:42:33.200
there is so much crap on the internet

00:42:30.079 --> 00:42:34.480
has been trained on already. So, okay.

00:42:33.199 --> 00:42:36.719
So, all right. So, let's look at the

00:42:34.480 --> 00:42:38.318
collab.

00:42:36.719 --> 00:42:41.039
By the way, retrieval operate generation

00:42:38.318 --> 00:42:43.039
is in my opinion the most pre prevalent

00:42:41.039 --> 00:42:45.920
business application of LLMs that I've

00:42:43.039 --> 00:42:47.599
seen up to this up to up to date. And

00:42:45.920 --> 00:42:51.358
there's a huge ecosystem of tools and

00:42:47.599 --> 00:42:52.640
vendors and so on and so forth.

00:42:51.358 --> 00:42:56.400
I'm going to skip through the verbiage

00:42:52.639 --> 00:42:58.799
here. Um, so you have to um install the

00:42:56.400 --> 00:43:00.480
OpenAI library

00:42:58.800 --> 00:43:01.920
and this thing called tick token which

00:43:00.480 --> 00:43:03.440
we'll get to in a in a bit. I've already

00:43:01.920 --> 00:43:05.760
installed it before class because it

00:43:03.440 --> 00:43:08.000
takes some time. So I'll just make sure

00:43:05.760 --> 00:43:10.079
all these things are already

00:43:08.000 --> 00:43:12.880
few good. So we don't have to wait for

00:43:10.079 --> 00:43:15.760
this. So I've imported pandas as before

00:43:12.880 --> 00:43:17.280
and so uh and you can read through these

00:43:15.760 --> 00:43:19.839
things because I'm just basically you

00:43:17.280 --> 00:43:23.519
know I have an open openi token that I

00:43:19.838 --> 00:43:24.719
have to use u a key rather key API key

00:43:23.519 --> 00:43:25.920
and I'm not showing you the key

00:43:24.719 --> 00:43:27.759
obviously I have to remember to delete

00:43:25.920 --> 00:43:29.519
it before I upload the collab uh you

00:43:27.760 --> 00:43:31.599
have to get your own key to make it all

00:43:29.519 --> 00:43:34.639
work uh but the instructions are here.

00:43:31.599 --> 00:43:36.480
So we're going to use GPT3.5 turbo to

00:43:34.639 --> 00:43:38.639
demonstrate rag right so I give it the

00:43:36.480 --> 00:43:40.480
name of the model and then open a also

00:43:38.639 --> 00:43:43.679
has a whole bunch of different models

00:43:40.480 --> 00:43:45.760
which can be used for u you can feed it

00:43:43.679 --> 00:43:47.519
a sentence or a chunk of text it'll give

00:43:45.760 --> 00:43:49.040
you a contextual embedding out it's like

00:43:47.519 --> 00:43:50.800
a nice little API you don't have to use

00:43:49.039 --> 00:43:53.119
your own bird and so on and so forth you

00:43:50.800 --> 00:43:54.480
can just use the open AI embeddings

00:43:53.119 --> 00:43:55.680
obviously you have to pay openai every

00:43:54.480 --> 00:44:00.440
time you make a request but it's really

00:43:55.679 --> 00:44:00.440
really cheap at this point u yepa

00:44:01.119 --> 00:44:05.358
question but

00:44:03.440 --> 00:44:07.119
by dealing with proprietary data because

00:44:05.358 --> 00:44:09.598
a lot of companies are like we need to

00:44:07.119 --> 00:44:11.920
invest in our own L&M because we don't

00:44:09.599 --> 00:44:14.880
want our data to be going down in this

00:44:11.920 --> 00:44:16.720
kind of it context how good is the the

00:44:14.880 --> 00:44:17.280
cyber security or the compliance and

00:44:16.719 --> 00:44:19.118
legal

00:44:17.280 --> 00:44:21.119
>> I think each vendor has their own sort

00:44:19.119 --> 00:44:22.559
of set of rules and contractual

00:44:21.119 --> 00:44:23.519
commitments they're willing to sign up

00:44:22.559 --> 00:44:25.199
for so you just

00:44:23.519 --> 00:44:27.440
>> if you use the data here does this go

00:44:25.199 --> 00:44:29.118
into the public domain or no

00:44:27.440 --> 00:44:29.760
>> but the vendor gets to see it

00:44:29.119 --> 00:44:31.760
>> okay

00:44:29.760 --> 00:44:33.839
>> right meaning the vendor systems get to

00:44:31.760 --> 00:44:36.160
see it, but do the vendors employees get

00:44:33.838 --> 00:44:38.078
to see it if they need to? Unclear.

00:44:36.159 --> 00:44:39.920
Those are all the like the legally sort

00:44:38.079 --> 00:44:41.119
of nitty-gritty you have to worry about.

00:44:39.920 --> 00:44:42.318
The other thing you can do is you can

00:44:41.119 --> 00:44:44.318
actually just download an open source

00:44:42.318 --> 00:44:46.000
LLM and do it all within your own

00:44:44.318 --> 00:44:48.239
premises.

00:44:46.000 --> 00:44:50.239
That's totally possible to do, right? In

00:44:48.239 --> 00:44:51.598
fact, um I probably won't have time

00:44:50.239 --> 00:44:52.959
today. I have a whole section on how do

00:44:51.599 --> 00:44:55.680
you actually do a fine-tuning with an

00:44:52.960 --> 00:44:58.400
open-source LLM, which I'll do a video,

00:44:55.679 --> 00:45:01.118
right, if you don't have time. U okay.

00:44:58.400 --> 00:45:02.720
So, so we and so this model this

00:45:01.119 --> 00:45:03.920
embedding ADA 2 is the name of the

00:45:02.719 --> 00:45:05.118
OpenAI model that actually gives you

00:45:03.920 --> 00:45:07.680
contextual embedding. So, we're going to

00:45:05.119 --> 00:45:10.160
use that. So, so first thing we want to

00:45:07.679 --> 00:45:11.679
so the the use case here is that uh we

00:45:10.159 --> 00:45:13.598
have taken a whole bunch we want to ask

00:45:11.679 --> 00:45:15.598
the LLM we want to create a chatbot

00:45:13.599 --> 00:45:18.240
which can answer questions about the

00:45:15.599 --> 00:45:20.640
2022 Olympics like random questions you

00:45:18.239 --> 00:45:24.318
might have about the Olympics. So, uh so

00:45:20.639 --> 00:45:26.480
let's first ask it this question. Uh

00:45:24.318 --> 00:45:29.838
we'll ask it about the 2020 summer

00:45:26.480 --> 00:45:33.358
Olympics. Okay, that's the query and

00:45:29.838 --> 00:45:35.039
then this is the the API um request we

00:45:33.358 --> 00:45:36.400
have to make and you can read through

00:45:35.039 --> 00:45:38.559
it. I have linked to the documentation

00:45:36.400 --> 00:45:41.119
here as how it works and then it says

00:45:38.559 --> 00:45:42.799
that uh Bosshim of Qatar and Tambberia

00:45:41.119 --> 00:45:44.000
of Italy both won the gold and you can

00:45:42.800 --> 00:45:46.480
actually fact check this is actually

00:45:44.000 --> 00:45:48.000
accurate. It's correct. Uh so now let's

00:45:46.480 --> 00:45:51.358
change the query and ask it about the

00:45:48.000 --> 00:45:53.358
2022 Winter Olympics. Okay. And why 22

00:45:51.358 --> 00:45:55.440
versus 20 will become clear in just a

00:45:53.358 --> 00:45:57.598
moment. So, which athletes won the gold

00:45:55.440 --> 00:46:00.480
in curling

00:45:57.599 --> 00:46:02.640
in the 22 Olympics? And it says the gold

00:46:00.480 --> 00:46:04.880
medal in curling was won by the Swedish

00:46:02.639 --> 00:46:07.920
men's team and the South Korean women's

00:46:04.880 --> 00:46:12.000
team. Okay, turns out if you fact check

00:46:07.920 --> 00:46:13.920
this, it turns out, wait for it, Sweden

00:46:12.000 --> 00:46:15.440
won the men's gold. Yes, South Korean

00:46:13.920 --> 00:46:17.358
DIM participated, but Great Britain

00:46:15.440 --> 00:46:19.599
actually won the women's gold. So, it

00:46:17.358 --> 00:46:22.078
got it wrong. So, it sounds like GBD3.5

00:46:19.599 --> 00:46:24.559
Turbo could use some help. And now one

00:46:22.079 --> 00:46:27.119
of the things we can do is so the thing

00:46:24.559 --> 00:46:29.440
is the reason why GPT3 3.1 turbo didn't

00:46:27.119 --> 00:46:32.400
know about this is because its training

00:46:29.440 --> 00:46:34.480
cutoff date was September 2021.

00:46:32.400 --> 00:46:37.280
So as far as it's concerned the 22

00:46:34.480 --> 00:46:39.519
Olympics haven't happened yet

00:46:37.280 --> 00:46:42.560
it confidently gave you the wrong answer

00:46:39.519 --> 00:46:43.920
as it is often prone to do. So and this

00:46:42.559 --> 00:46:45.519
is by the way is called hallucination

00:46:43.920 --> 00:46:50.159
where it gives you a very eloquent

00:46:45.519 --> 00:46:53.119
confident wrong answer. And so um

00:46:50.159 --> 00:46:54.480
or as some folks have said about um

00:46:53.119 --> 00:46:56.559
another business school that should

00:46:54.480 --> 00:46:59.519
remain nameless often in error but never

00:46:56.559 --> 00:47:02.239
in doubt. So um

00:46:59.519 --> 00:47:03.838
all right back to this uh so one simple

00:47:02.239 --> 00:47:06.719
thing we can try right off the bat is to

00:47:03.838 --> 00:47:08.239
tell 3 3.5 Turbo you can ask it to say I

00:47:06.719 --> 00:47:10.559
don't know if it doesn't know rather

00:47:08.239 --> 00:47:12.959
than just make stuff up right and how do

00:47:10.559 --> 00:47:14.559
you do it? It's very simple. You say in

00:47:12.960 --> 00:47:17.119
your prompt, answer the question as

00:47:14.559 --> 00:47:18.799
truthfully as possible. And if you're

00:47:17.119 --> 00:47:20.480
unsure of the answer, say, "Sorry, I

00:47:18.800 --> 00:47:22.560
don't know." Okay, now here's the

00:47:20.480 --> 00:47:25.519
question. Okay, this is a query. So,

00:47:22.559 --> 00:47:29.279
let's run it through.

00:47:25.519 --> 00:47:31.280
Sorry, I don't know. Not bad, huh? So,

00:47:29.280 --> 00:47:32.720
so it worked. It's sort of trying to be

00:47:31.280 --> 00:47:35.599
humble and honest and, you know,

00:47:32.719 --> 00:47:37.759
self-aware and things like that. Um,

00:47:35.599 --> 00:47:40.000
it's more like a a Sloan at this point.

00:47:37.760 --> 00:47:41.040
All right. So now the reason I as I

00:47:40.000 --> 00:47:42.159
mentioned earlier there's a you can

00:47:41.039 --> 00:47:44.159
check the cutoff date and you can see

00:47:42.159 --> 00:47:48.358
it's 2021 actually you know what let me

00:47:44.159 --> 00:47:48.358
just uh open a new tab

00:47:49.199 --> 00:47:53.118
so all these cut off dates are training

00:47:50.800 --> 00:47:56.400
data right so 3.5 turbo this is what we

00:47:53.119 --> 00:47:59.440
are using cutff date 2021 okay that's

00:47:56.400 --> 00:48:01.280
why all right so now what we can do is

00:47:59.440 --> 00:48:02.960
to to we can obviously provide relevant

00:48:01.280 --> 00:48:04.880
data on the prompt itself sort of we can

00:48:02.960 --> 00:48:06.318
leading up to rag here and by the way

00:48:04.880 --> 00:48:07.680
the extra information we provide in the

00:48:06.318 --> 00:48:08.960
prompt to help it answer a question is

00:48:07.679 --> 00:48:10.799
called context, right? That's sort of

00:48:08.960 --> 00:48:13.440
the lingo for it. So, we can do it,

00:48:10.800 --> 00:48:15.200
we'll first do it manually. Um, so we

00:48:13.440 --> 00:48:17.760
first we'll use the Wikipedia article

00:48:15.199 --> 00:48:19.838
for 2022 Winter Olympics and we tell it

00:48:17.760 --> 00:48:21.680
explicitly to make use of this context

00:48:19.838 --> 00:48:23.920
because telling things explicitly always

00:48:21.679 --> 00:48:25.679
seems to help. So, this is the thing we

00:48:23.920 --> 00:48:28.318
cut and pasted here, right? Wikipedia

00:48:25.679 --> 00:48:30.239
article on curling and it's like a

00:48:28.318 --> 00:48:32.800
pretty long article. It's got all kinds

00:48:30.239 --> 00:48:34.558
of stuff and it's not even all that like

00:48:32.800 --> 00:48:38.240
cleanly formatted, right? It's kind of

00:48:34.559 --> 00:48:39.359
it's very strange. Look at that.

00:48:38.239 --> 00:48:41.759
So don't don't answer your question,

00:48:39.358 --> 00:48:44.078
Spencer. It can be, you know, in pretty

00:48:41.760 --> 00:48:46.480
bad shape. It still seems to work. Okay.

00:48:44.079 --> 00:48:47.920
So now use below article on the Olympics

00:48:46.480 --> 00:48:49.599
to answer the subsequent question. If

00:48:47.920 --> 00:48:51.920
you don't know, say you don't know.

00:48:49.599 --> 00:48:53.760
Okay. So that's what we have. That's the

00:48:51.920 --> 00:48:55.599
query. And by the way, before I send it

00:48:53.760 --> 00:48:56.720
into the LLM, this is the actual query

00:48:55.599 --> 00:48:58.400
that's going to be sending. I'm printing

00:48:56.719 --> 00:49:00.159
out the query. Look at how long the

00:48:58.400 --> 00:49:02.240
query is. Use the article below. And

00:49:00.159 --> 00:49:04.399
here is the article. B scroll, scroll,

00:49:02.239 --> 00:49:05.759
scroll. There's a whole thing, right?

00:49:04.400 --> 00:49:07.680
And it keeps on going on. And then

00:49:05.760 --> 00:49:12.119
finally, I say which teams won the gold.

00:49:07.679 --> 00:49:12.118
So, okay, so let's run it.

00:49:12.318 --> 00:49:16.880
Okay, look at that.

00:49:15.199 --> 00:49:19.679
Women's curling Great Britain. It got it

00:49:16.880 --> 00:49:22.640
right. Pretty good, right? I mean, it

00:49:19.679 --> 00:49:25.919
had to parse all that crap to get and

00:49:22.639 --> 00:49:27.199
find the nuggets, right? So, nicely done

00:49:25.920 --> 00:49:28.559
now. But maybe it wasn't super hard

00:49:27.199 --> 00:49:30.799
because we literally gave it the answer.

00:49:28.559 --> 00:49:32.720
So, let's make it a bit harder. So, I

00:49:30.800 --> 00:49:34.240
noticed that this person, Oscar Ericson,

00:49:32.719 --> 00:49:37.039
won two golds in the event, two medals

00:49:34.239 --> 00:49:39.519
in the event. So let's ask if any

00:49:37.039 --> 00:49:40.960
athlete won multiple medals. That

00:49:39.519 --> 00:49:44.000
requires a little bit of abstraction,

00:49:40.960 --> 00:49:46.400
right? So all right, same query. Did any

00:49:44.000 --> 00:49:47.599
athlete win multiple medals in curling?

00:49:46.400 --> 00:49:50.000
The question has changed. Everything

00:49:47.599 --> 00:49:51.920
else hasn't changed. Hit it. Let's see

00:49:50.000 --> 00:49:53.760
what happens.

00:49:51.920 --> 00:49:56.400
Yes, Oscar Ericson won multiple medals

00:49:53.760 --> 00:49:58.480
in curling. He won a gold in the men's

00:49:56.400 --> 00:50:00.720
event and a bronze in the mix doubles.

00:49:58.480 --> 00:50:02.880
It's pretty cool, right? Take that

00:50:00.719 --> 00:50:04.239
Google. So

00:50:02.880 --> 00:50:05.440
all right now we come to retrieval

00:50:04.239 --> 00:50:06.719
augment generation where instead of

00:50:05.440 --> 00:50:07.838
doing it manually obviously because it

00:50:06.719 --> 00:50:09.919
doesn't scale we will do it

00:50:07.838 --> 00:50:11.519
automatically and so the thing you have

00:50:09.920 --> 00:50:12.800
to remember as I mentioned just a few

00:50:11.519 --> 00:50:15.920
minutes ago is that there is a context

00:50:12.800 --> 00:50:18.559
window for every LLM and for GPD 3.0 of

00:50:15.920 --> 00:50:21.119
turbo the context window is 1 1600 300

00:50:18.559 --> 00:50:24.400
sorry 16,385 tokens that is the length

00:50:21.119 --> 00:50:26.720
of the input and the output right so we

00:50:24.400 --> 00:50:29.280
can't exceed that uh by the way GPT4's

00:50:26.719 --> 00:50:33.679
context window is I think up to 128,000

00:50:29.280 --> 00:50:35.280
tokens and GPT sorry Google Gemini 1.5

00:50:33.679 --> 00:50:38.399
pro they really need to work on their

00:50:35.280 --> 00:50:40.960
names Google Gemini 1.5 pro the context

00:50:38.400 --> 00:50:43.440
window is 1 million tokens

00:50:40.960 --> 00:50:46.960
okay and in research they have tested 10

00:50:43.440 --> 00:50:48.079
million tokens so Crazy times. All that

00:50:46.960 --> 00:50:49.280
means is that you can upload entire

00:50:48.079 --> 00:50:51.839
videos and ask it questions about the

00:50:49.280 --> 00:50:53.359
video. So all right to come back to

00:50:51.838 --> 00:50:55.440
this. So what we'll do is we'll only

00:50:53.358 --> 00:50:57.920
grab the data from the Wikipedia

00:50:55.440 --> 00:50:59.200
articles the all the articles about the

00:50:57.920 --> 00:51:00.639
Olympics that are relevant to our

00:50:59.199 --> 00:51:02.719
question by using pre-trained

00:51:00.639 --> 00:51:04.318
embeddings. So again this is the thing

00:51:02.719 --> 00:51:06.879
we talked about earlier, right? This is

00:51:04.318 --> 00:51:08.159
the picture we saw in class. And the the

00:51:06.880 --> 00:51:09.680
only thing I want to point out is that

00:51:08.159 --> 00:51:11.759
if you have a particular embedding for a

00:51:09.679 --> 00:51:13.199
question and a particular embedding for

00:51:11.760 --> 00:51:15.440
a chunk of text that you have in your

00:51:13.199 --> 00:51:17.679
database, you have to figure out how

00:51:15.440 --> 00:51:21.280
similar how related they are. And for

00:51:17.679 --> 00:51:24.799
that we can use what

00:51:21.280 --> 00:51:27.119
dot product or something slightly uh

00:51:24.800 --> 00:51:29.039
almost as dot product which is more

00:51:27.119 --> 00:51:31.440
easier for us to work with the cosine

00:51:29.039 --> 00:51:32.800
similarity. We have we have done cosine

00:51:31.440 --> 00:51:34.240
similarity previously. I've explained it

00:51:32.800 --> 00:51:35.519
in class. We're just going to use cosine

00:51:34.239 --> 00:51:37.519
similarity. How similar are these

00:51:35.519 --> 00:51:40.400
vectors? So that's what we're going to

00:51:37.519 --> 00:51:42.559
do. Um all right. So the same picture as

00:51:40.400 --> 00:51:43.920
we saw in class. So the first we what

00:51:42.559 --> 00:51:45.839
we'll do is we need to break up the data

00:51:43.920 --> 00:51:47.039
set into sections and then take each

00:51:45.838 --> 00:51:49.199
section and then run it through the

00:51:47.039 --> 00:51:50.558
embedding thing. But fortunately for us

00:51:49.199 --> 00:51:52.399
uh I have code here which actually does

00:51:50.559 --> 00:51:54.640
it for you manually. You can play around

00:51:52.400 --> 00:51:56.639
with it later. But OpenAI has already

00:51:54.639 --> 00:51:58.078
given us the chunked data set. So we

00:51:56.639 --> 00:52:00.000
just use that because it's just easy for

00:51:58.079 --> 00:52:01.519
us. And I downloaded already because it

00:52:00.000 --> 00:52:02.800
took it takes five minutes to download.

00:52:01.519 --> 00:52:04.719
I've downloaded this thing and I've

00:52:02.800 --> 00:52:07.200
stuck it in a particular data frame

00:52:04.719 --> 00:52:09.598
here. So let's print out five randomly

00:52:07.199 --> 00:52:12.078
chosen chunks. Um so you can see here

00:52:09.599 --> 00:52:14.559
right this is the first chunk somebody

00:52:12.079 --> 00:52:17.119
else somebody else this just and look at

00:52:14.559 --> 00:52:19.200
all this crazy stuff here right the

00:52:17.119 --> 00:52:21.119
formatting is off but these are all you

00:52:19.199 --> 00:52:22.480
know basically paragraphs and sections

00:52:21.119 --> 00:52:24.559
just grabbed straight from Wikipedia

00:52:22.480 --> 00:52:28.240
with no cleaning.

00:52:24.559 --> 00:52:30.880
Okay, now we define a simple function to

00:52:28.239 --> 00:52:33.279
basically send in any arbitrary piece of

00:52:30.880 --> 00:52:35.200
text into the embedding model and get

00:52:33.280 --> 00:52:36.800
the contextual embedding vector out,

00:52:35.199 --> 00:52:39.118
right? And there is this little function

00:52:36.800 --> 00:52:40.640
that does that. Okay, u we using an

00:52:39.119 --> 00:52:42.400
embedding model. We send in a text, it

00:52:40.639 --> 00:52:45.039
gives you something. So let's try it on

00:52:42.400 --> 00:52:48.039
that is amazing. You should get a vector

00:52:45.039 --> 00:52:48.039
back.

00:52:51.280 --> 00:52:55.599
Oh, come on. Don't fail me now.

00:52:56.000 --> 00:53:02.400
All right. How long is it? 1536. Um, so

00:53:00.800 --> 00:53:04.240
how about I say hodle is incredible.

00:53:02.400 --> 00:53:05.599
Like hodle is amazing. Hopefully the two

00:53:04.239 --> 00:53:09.919
vectors would be kind of similar in

00:53:05.599 --> 00:53:11.440
terms of cosine, right? So um and so to

00:53:09.920 --> 00:53:13.680
calculate the cosine distance, I use

00:53:11.440 --> 00:53:15.679
this particular function from sci. It

00:53:13.679 --> 00:53:18.799
just calculates the cosine similarity

00:53:15.679 --> 00:53:21.440
and I hit it. So 0.9934

00:53:18.800 --> 00:53:23.280
maximum is one, right? So 0 934 means

00:53:21.440 --> 00:53:24.720
that they're very very similar. which is

00:53:23.280 --> 00:53:27.119
comforting because amazing and

00:53:24.719 --> 00:53:29.598
incredible are obviously synonyms. U

00:53:27.119 --> 00:53:32.000
okay so now given a data frame with a

00:53:29.599 --> 00:53:33.119
column of text chunks in it we can use

00:53:32.000 --> 00:53:34.800
this function on every one of these

00:53:33.119 --> 00:53:36.160
things to calculate the embedding right

00:53:34.800 --> 00:53:37.440
and you have a function here that

00:53:36.159 --> 00:53:39.199
basically does it for you I'm not going

00:53:37.440 --> 00:53:41.280
to run it uh because it takes a long

00:53:39.199 --> 00:53:42.799
time so but you can run it later on uh

00:53:41.280 --> 00:53:44.960
just be prepared go get a cup of coffee

00:53:42.800 --> 00:53:47.119
and stuff while it does it uh but once

00:53:44.960 --> 00:53:48.559
you but happily for us open has actually

00:53:47.119 --> 00:53:50.160
already done this step for us so we

00:53:48.559 --> 00:53:51.760
don't have to uh so it's already

00:53:50.159 --> 00:53:53.920
available in this data frame so if you

00:53:51.760 --> 00:53:56.079
actually Look at this. And you can see

00:53:53.920 --> 00:53:58.000
here there is a text and then there is

00:53:56.079 --> 00:54:00.079
an embedding that's right sitting right

00:53:58.000 --> 00:54:02.880
there right next to it. Okay. And these

00:54:00.079 --> 00:54:07.839
embeddings are whatever 15 how long is

00:54:02.880 --> 00:54:12.280
it? 1536 long. 1536 long vectors. Okay.

00:54:07.838 --> 00:54:12.279
Um All right. So that's what we have.

00:54:14.079 --> 00:54:18.640
Okay. So now that we have this thing

00:54:16.400 --> 00:54:20.240
whenever we get a question we calculate

00:54:18.639 --> 00:54:22.400
the question's embedding and then

00:54:20.239 --> 00:54:23.919
compare calculate its cosine similarity

00:54:22.400 --> 00:54:26.800
with all the embedding sitting in this

00:54:23.920 --> 00:54:28.079
data frame. Okay. So to do that we're

00:54:26.800 --> 00:54:29.839
going to define a couple of helper

00:54:28.079 --> 00:54:31.680
functions here. You can read through the

00:54:29.838 --> 00:54:33.199
Python later to understand is this is

00:54:31.679 --> 00:54:36.480
basic Python manipulations that are

00:54:33.199 --> 00:54:38.799
going on. Um and so let's just test this

00:54:36.480 --> 00:54:41.440
function. So basically we have a little

00:54:38.800 --> 00:54:44.079
function called strings ranked by

00:54:41.440 --> 00:54:46.400
relatedness where you give it any input

00:54:44.079 --> 00:54:49.280
question or text and then it's going to

00:54:46.400 --> 00:54:52.000
give you the top five most related

00:54:49.280 --> 00:54:55.680
chunks of text that is had in its data

00:54:52.000 --> 00:54:59.159
frame. Okay. So uh let me just run this

00:54:55.679 --> 00:54:59.159
thing. Okay.

00:55:00.000 --> 00:55:03.599
So curling the things it pulls back it

00:55:02.079 --> 00:55:06.000
better involves curling and metals and

00:55:03.599 --> 00:55:09.119
so on. So this one has a cosign

00:55:06.000 --> 00:55:11.280
similarity of 888 curling at the 22

00:55:09.119 --> 00:55:13.599
Olympics. That's good. Result summary.

00:55:11.280 --> 00:55:14.960
Medal summary. Result summary. It's all

00:55:13.599 --> 00:55:17.280
pretty good, right? Even the fifth one

00:55:14.960 --> 00:55:18.720
has a cosign similarity of867, which is

00:55:17.280 --> 00:55:20.800
pretty high. So it's doing the right

00:55:18.719 --> 00:55:22.239
things. It's it's picked up curling gold

00:55:20.800 --> 00:55:25.200
medal was input text. It's picked up the

00:55:22.239 --> 00:55:28.078
right things from it. Um, now let's see

00:55:25.199 --> 00:55:30.000
what we can do um

00:55:28.079 --> 00:55:31.519
with the original question. So here is a

00:55:30.000 --> 00:55:33.358
header I'm going to use in the prompt.

00:55:31.519 --> 00:55:35.199
I'm going to say use the below articles

00:55:33.358 --> 00:55:36.400
to answer the subsequent question.

00:55:35.199 --> 00:55:37.439
Answer the questions as truthfully as

00:55:36.400 --> 00:55:38.880
possible. And if you're unsure of the

00:55:37.440 --> 00:55:41.519
answer, say sorry, I don't know. As

00:55:38.880 --> 00:55:42.800
before. Okay, that's our prompt. Uh, and

00:55:41.519 --> 00:55:44.960
now here's the thing. We don't want to

00:55:42.800 --> 00:55:46.559
exceed the context window, right? So, we

00:55:44.960 --> 00:55:48.240
want to need to count the tokens we're

00:55:46.559 --> 00:55:49.440
sending in and the likely number of

00:55:48.239 --> 00:55:51.439
tokens we're going to get back so that

00:55:49.440 --> 00:55:53.679
we don't exceed the budget. So, we use

00:55:51.440 --> 00:55:55.679
this package called tick token package

00:55:53.679 --> 00:55:57.279
for this. Uh, and then it just, you

00:55:55.679 --> 00:55:58.480
know, helps you count the tokens. And

00:55:57.280 --> 00:56:00.079
you can read through this. It's just

00:55:58.480 --> 00:56:03.199
again some basic Python for counting

00:56:00.079 --> 00:56:05.519
tokens. And now what we do is um this

00:56:03.199 --> 00:56:08.318
this where we actually comp assemble the

00:56:05.519 --> 00:56:09.838
prompt. We start with the header right

00:56:08.318 --> 00:56:12.719
we have the header which says you know

00:56:09.838 --> 00:56:14.318
be truthful and all that. Then we say uh

00:56:12.719 --> 00:56:16.719
here is a question that you need that

00:56:14.318 --> 00:56:18.400
I'm going to ask you and then you go in

00:56:16.719 --> 00:56:21.199
there and keep grabbing Wikipedia

00:56:18.400 --> 00:56:23.680
articles till the number of tokens in

00:56:21.199 --> 00:56:26.639
your prompt is is exceeding your token

00:56:23.679 --> 00:56:27.838
budget and then you stop. Right? When

00:56:26.639 --> 00:56:28.798
you're about to exceed the budget you

00:56:27.838 --> 00:56:31.119
stop because you can't exceed the

00:56:28.798 --> 00:56:34.239
budget. Um, and that's that's the whole

00:56:31.119 --> 00:56:38.480
thing. So here, uh, all right, let's

00:56:34.239 --> 00:56:40.159
just do tick token. Run this function.

00:56:38.480 --> 00:56:42.960
Now, it turns out, as you saw, we can go

00:56:40.159 --> 00:56:45.440
up to like 1600 something, uh, tokens in

00:56:42.960 --> 00:56:48.400
the context window. I'm just using three

00:56:45.440 --> 00:56:49.920
3,700 as my budget. Uh, partly because

00:56:48.400 --> 00:56:52.160
just to show you how to use this thing.

00:56:49.920 --> 00:56:54.880
Uh, and also because it's charging my

00:56:52.159 --> 00:56:56.480
credit card for every token that I'm

00:56:54.880 --> 00:56:59.280
using, right? So, I'm just being

00:56:56.480 --> 00:57:01.280
careful. um it charges by the token.

00:56:59.280 --> 00:57:03.519
It's a beautiful business model. Anyway,

00:57:01.280 --> 00:57:05.040
so back here, so let's ask the question,

00:57:03.519 --> 00:57:06.960
which athletes won the gold medal in

00:57:05.039 --> 00:57:08.558
curling at the Olympics? Here is the

00:57:06.960 --> 00:57:11.039
data frame that you should use. Here is

00:57:08.559 --> 00:57:13.440
the GPD model and don't exceed 3,700

00:57:11.039 --> 00:57:15.679
tokens. Okay, that's the the query or

00:57:13.440 --> 00:57:17.280
the prompt. It's going to compose the

00:57:15.679 --> 00:57:19.519
prompt now. And this is the whole

00:57:17.280 --> 00:57:23.400
prompt. Okay. Uh let's just go to the

00:57:19.519 --> 00:57:23.400
very top. It's really long.

00:57:24.079 --> 00:57:27.440
Okay. So, all right. use the below

00:57:25.920 --> 00:57:29.200
articles on the subsequent question as

00:57:27.440 --> 00:57:31.920
possible and boom boom boom boom boom it

00:57:29.199 --> 00:57:33.118
has all these things it's got a added a

00:57:31.920 --> 00:57:35.920
whole bunch of paragraphs from the

00:57:33.119 --> 00:57:37.358
Wikipedia pages okay and then it finally

00:57:35.920 --> 00:57:39.599
ends with a question which athletes won

00:57:37.358 --> 00:57:41.759
the gold okay all right now let's just

00:57:39.599 --> 00:57:44.240
ask it the thing and this is just a

00:57:41.760 --> 00:57:47.200
little function to to send stuff into

00:57:44.239 --> 00:57:53.279
the API and now we are finally ready to

00:57:47.199 --> 00:57:55.519
ask GPD the question fingers crossed

00:57:53.280 --> 00:57:58.400
all right curling

00:57:55.519 --> 00:58:01.199
Stefan can tell in the mixed doubles and

00:57:58.400 --> 00:58:03.920
the team consisting of blah blah blah in

00:58:01.199 --> 00:58:06.159
the the men's tournament and oh

00:58:03.920 --> 00:58:08.880
interesting it has actually ignored the

00:58:06.159 --> 00:58:12.798
Great Britain people completely I think

00:58:08.880 --> 00:58:14.960
right uh last night it didn't welcome to

00:58:12.798 --> 00:58:16.480
stoasticity

00:58:14.960 --> 00:58:19.039
so you can try it when you try it might

00:58:16.480 --> 00:58:21.119
actually give you the the thing um and

00:58:19.039 --> 00:58:24.000
so let's ask it now a question about the

00:58:21.119 --> 00:58:25.838
2016 winter Olympics uh which by the way

00:58:24.000 --> 00:58:31.280
didn't happen there were no winter

00:58:25.838 --> 00:58:34.798
Olympics in 2016. So if you ask it,

00:58:31.280 --> 00:58:36.559
sorry I don't know. All right. Now let's

00:58:34.798 --> 00:58:38.960
change the header so that we don't say

00:58:36.559 --> 00:58:40.798
be truthful. So we will remove the need

00:58:38.960 --> 00:58:43.679
for it to be truthful and see what

00:58:40.798 --> 00:58:48.759
happens.

00:58:43.679 --> 00:58:48.759
All right, which at least won the gold.

00:58:50.960 --> 00:58:55.838
Oh, now it's telling you about the 2022

00:58:53.199 --> 00:58:57.679
Olympics. So it answered an irrelevant

00:58:55.838 --> 00:58:59.440
question accurately.

00:58:57.679 --> 00:59:01.919
Okay, if you remove the need for it to

00:58:59.440 --> 00:59:04.400
uh to be truthful. So the I guess the

00:59:01.920 --> 00:59:07.280
moral of the story is that um first of

00:59:04.400 --> 00:59:09.039
all you can use rack to grab stuff from

00:59:07.280 --> 00:59:10.319
mass databases and it's very heavily

00:59:09.039 --> 00:59:12.239
used in industry. Number one, number

00:59:10.318 --> 00:59:13.838
two. Um you have to be careful about

00:59:12.239 --> 00:59:16.719
these token budgets and so on and so

00:59:13.838 --> 00:59:18.159
forth. Uh and small wording changes in

00:59:16.719 --> 00:59:20.318
the prompt can actually dramatically

00:59:18.159 --> 00:59:21.838
alter behavior which makes it very

00:59:20.318 --> 00:59:25.279
difficult in enterprise settings to do

00:59:21.838 --> 00:59:27.679
QA on this stuff. Okay. Uh so a lot of

00:59:25.280 --> 00:59:29.200
care has to go into it. Uh you know and

00:59:27.679 --> 00:59:30.960
you have seen examples of for example

00:59:29.199 --> 00:59:32.639
Air Canada had a chatbot which actually

00:59:30.960 --> 00:59:34.240
gave the wrong advice to a customer. The

00:59:32.639 --> 00:59:35.679
customer sued Air Canada and then the

00:59:34.239 --> 00:59:37.199
court ruled in favor of the the

00:59:35.679 --> 00:59:39.118
passenger and then they pulled the

00:59:37.199 --> 00:59:40.480
chatbot off the website. Right? So you

00:59:39.119 --> 00:59:42.160
got to be very careful. I think without

00:59:40.480 --> 00:59:43.519
a human in the loop checking these

00:59:42.159 --> 00:59:45.199
answers it's kind of dangerous in my

00:59:43.519 --> 00:59:47.440
opinion at this current state. Hopefully

00:59:45.199 --> 00:59:48.960
it'll get better but you have to be

00:59:47.440 --> 00:59:51.039
there's a lot of potential but you have

00:59:48.960 --> 00:59:52.798
to be to be careful. All right. So this

00:59:51.039 --> 00:59:54.719
is what we have. Um, and you can

00:59:52.798 --> 00:59:57.039
actually take this thing here and use

00:59:54.719 --> 00:59:58.719
it. Um, you can actually, you know, take

00:59:57.039 --> 01:00:00.639
like a thousandpage PDF that you might

00:59:58.719 --> 01:00:02.239
have or something and then chunk it and

01:00:00.639 --> 01:00:03.358
use this approach. And I've done it for

01:00:02.239 --> 01:00:04.639
a whole bunch of different things. It

01:00:03.358 --> 01:00:05.920
actually works really well, right? Most

01:00:04.639 --> 01:00:07.039
of the time it'll make errors here and

01:00:05.920 --> 01:00:11.599
there. Most of the time it actually

01:00:07.039 --> 01:00:14.318
works really well. Okay. So, um, yeah.

01:00:11.599 --> 01:00:18.318
>> Sorry, just a question. when when like

01:00:14.318 --> 01:00:20.159
GP4 now lets you you upload PDFs, is it

01:00:18.318 --> 01:00:21.199
junkling that or is it actually

01:00:20.159 --> 01:00:22.719
ingesting all the

01:00:21.199 --> 01:00:25.759
>> No, when you upload something because

01:00:22.719 --> 01:00:27.919
GPD4 Turbo has 128,000 tokens which

01:00:25.760 --> 01:00:29.200
means it can accommodate a whole long b

01:00:27.920 --> 01:00:31.200
of documents. So when you upload stuff

01:00:29.199 --> 01:00:32.960
is not doing any chunking. The chunking

01:00:31.199 --> 01:00:34.798
you're talking about you have to do. The

01:00:32.960 --> 01:00:36.240
LLM doesn't even know you're doing it.

01:00:34.798 --> 01:00:38.239
As far as the LLM is concerned, it's

01:00:36.239 --> 01:00:39.519
only seeing the prompt it sees and the

01:00:38.239 --> 01:00:40.639
prompt says, "Hey, here's a bunch of

01:00:39.519 --> 01:00:41.759
information. Here's a question. Answer

01:00:40.639 --> 01:00:44.159
it for me using this question. Be

01:00:41.760 --> 01:00:46.799
truthful." That's it.

01:00:44.159 --> 01:00:49.440
Now when you ask these things a question

01:00:46.798 --> 01:00:51.759
um which is later than its training

01:00:49.440 --> 01:00:53.920
data, you will actually see GP4 saying

01:00:51.760 --> 01:00:55.760
doing a Bing search and things like

01:00:53.920 --> 01:00:58.079
that. there. What's actually going on is

01:00:55.760 --> 01:00:59.920
there's an there's a pre-processing step

01:00:58.079 --> 01:01:01.760
and a program which is doing a Bing

01:00:59.920 --> 01:01:04.159
search, gathering a bunch of Bing

01:01:01.760 --> 01:01:06.799
results, taking the top few results,

01:01:04.159 --> 01:01:08.960
chunking, embedding, packing into a

01:01:06.798 --> 01:01:10.159
prompt, sending it into GB4, and you

01:01:08.960 --> 01:01:11.358
don't know what's all this is going on

01:01:10.159 --> 01:01:12.558
under the hood. But that's actually so

01:01:11.358 --> 01:01:13.679
when it's actually thinking and saying

01:01:12.559 --> 01:01:17.000
Bing search, this is what's going on

01:01:13.679 --> 01:01:17.000
under the hood.

01:01:19.199 --> 01:01:24.798
Was was there a question somewhere here?

01:01:21.679 --> 01:01:26.558
No. Oh, sorry. Yeah.

01:01:24.798 --> 01:01:29.280
I have a question about formatting.

01:01:26.559 --> 01:01:31.519
Yeah. So, it seems to be able to

01:01:29.280 --> 01:01:33.920
understand and ignore irrelevant

01:01:31.519 --> 01:01:35.759
formatting even though there's

01:01:33.920 --> 01:01:38.480
colloquial tables, not really defined

01:01:35.760 --> 01:01:40.559
tables. And also when it outputs

01:01:38.480 --> 01:01:44.000
formats, it's able to do it really

01:01:40.559 --> 01:01:46.000
humanly. Is that something that's

01:01:44.000 --> 01:01:47.199
figuring out through the neural network

01:01:46.000 --> 01:01:49.280
or just something that's kind of being

01:01:47.199 --> 01:01:49.919
programmed in the head or somewhere with

01:01:49.280 --> 01:01:51.280
standard?

01:01:49.920 --> 01:01:53.200
>> There is no explicit programming going

01:01:51.280 --> 01:01:54.720
on. It's typically because a lot of the

01:01:53.199 --> 01:01:56.078
question answer pairs that it was used

01:01:54.719 --> 01:01:57.358
for supervised fine tetuning and

01:01:56.079 --> 01:02:00.079
instruction t and reinforcement

01:01:57.358 --> 01:02:02.400
learning, right? The better answers with

01:02:00.079 --> 01:02:04.079
the same sort of badly formatted input,

01:02:02.400 --> 01:02:06.079
the better answers are just rewarded are

01:02:04.079 --> 01:02:08.240
ranked higher. That's what's going on.

01:02:06.079 --> 01:02:10.318
But on a related note, what one thing

01:02:08.239 --> 01:02:12.000
that's very useful is that uh you can

01:02:10.318 --> 01:02:14.239
actually ask it to send you give you the

01:02:12.000 --> 01:02:16.719
answer back using certain formats like

01:02:14.239 --> 01:02:19.118
markdown and JSON and things like that.

01:02:16.719 --> 01:02:21.039
And by forcing it to adhere to a certain

01:02:19.119 --> 01:02:22.079
well- definfined formats, you actually

01:02:21.039 --> 01:02:23.119
increase the chance of it actually

01:02:22.079 --> 01:02:24.798
getting the right answer in the first

01:02:23.119 --> 01:02:26.720
place.

01:02:24.798 --> 01:02:28.719
Uh again, there's like a whole tangent

01:02:26.719 --> 01:02:30.719
here we can go into, but those are some

01:02:28.719 --> 01:02:33.039
of the things that uh are part of prompt

01:02:30.719 --> 01:02:37.159
engineering. All right, so that's what

01:02:33.039 --> 01:02:37.159
we have here. Back to the PowerPoint.

01:02:40.639 --> 01:02:46.000
So that's retrieval augment generation

01:02:42.559 --> 01:02:49.599
and we finally come to fine-tuning. So

01:02:46.000 --> 01:02:51.760
fine-tuning is when up to this point all

01:02:49.599 --> 01:02:54.240
the things we have seen don't alter the

01:02:51.760 --> 01:02:55.599
internals of the LLM. You have not

01:02:54.239 --> 01:02:56.798
messed around with the weights or change

01:02:55.599 --> 01:03:00.000
number them at all. You're just using it

01:02:56.798 --> 01:03:01.679
as a black box. Right? With fine-tuning

01:03:00.000 --> 01:03:04.000
you actually will train it further

01:03:01.679 --> 01:03:07.440
meaning the weights are going to change.

01:03:04.000 --> 01:03:11.440
Okay. So now remember we take something

01:03:07.440 --> 01:03:13.599
like a causal error like GPT right uh

01:03:11.440 --> 01:03:15.280
and then and this I haven't fixed this

01:03:13.599 --> 01:03:17.760
yet. this there is no rel here as I

01:03:15.280 --> 01:03:19.280
mentioned earlier okay just remember

01:03:17.760 --> 01:03:21.599
that

01:03:19.280 --> 01:03:23.359
and then if you have domain specific

01:03:21.599 --> 01:03:25.760
input output examples like input and

01:03:23.358 --> 01:03:28.719
output you can just train it like this

01:03:25.760 --> 01:03:31.280
okay input and then the shifted output

01:03:28.719 --> 01:03:33.038
uh and that will update these weights

01:03:31.280 --> 01:03:34.640
right all these weights so this is

01:03:33.039 --> 01:03:37.200
basically fine- tuning exactly like we

01:03:34.639 --> 01:03:39.598
saw with BERT and so on and and even

01:03:37.199 --> 01:03:42.318
with restnet it's the same sort of thing

01:03:39.599 --> 01:03:43.838
okay that is fine-tuning now before we

01:03:42.318 --> 01:03:45.759
discuss the mechanics how to do I want

01:03:43.838 --> 01:03:48.639
to look at a show you a quick example of

01:03:45.760 --> 01:03:50.480
the usefulness of finetuning. So, so

01:03:48.639 --> 01:03:53.199
imagine for a sec that we want to

01:03:50.480 --> 01:03:55.358
generate u synthetic product reviews

01:03:53.199 --> 01:03:57.439
from product descriptions.

01:03:55.358 --> 01:03:59.838
So we are building some product which

01:03:57.440 --> 01:04:01.760
can simulate customer behavior in

01:03:59.838 --> 01:04:03.838
e-commerce and for that we need to be

01:04:01.760 --> 01:04:05.760
able to generate the kinds of reviews

01:04:03.838 --> 01:04:07.358
that customers might come up with right

01:04:05.760 --> 01:04:09.200
and writing a lot of reviews is very

01:04:07.358 --> 01:04:10.318
timeconuming. So what you but what you

01:04:09.199 --> 01:04:12.639
can do is you can get a whole bunch of

01:04:10.318 --> 01:04:14.719
product descriptions right from the

01:04:12.639 --> 01:04:16.798
internet. So let's say you ask an LLM,

01:04:14.719 --> 01:04:18.318
hey write a positive product review

01:04:16.798 --> 01:04:19.759
using this information here, product

01:04:18.318 --> 01:04:24.159
description here and it comes up with

01:04:19.760 --> 01:04:26.319
this timeless, authentic, iconic, right?

01:04:24.159 --> 01:04:28.639
Seriously, do product reviewers actually

01:04:26.318 --> 01:04:31.199
write stuff like this? No. This looks

01:04:28.639 --> 01:04:33.118
like marketing copy, right? This reads

01:04:31.199 --> 01:04:34.318
like marketing copy because there's a

01:04:33.119 --> 01:04:36.798
whole bunch of marketing copy on the

01:04:34.318 --> 01:04:38.798
internet. So it's not good. It doesn't

01:04:36.798 --> 01:04:41.440
feel like a review. It's not authentic,

01:04:38.798 --> 01:04:44.318
right? Um, here's another example for

01:04:41.440 --> 01:04:46.240
Urban Outfitters, and it says, uh, the

01:04:44.318 --> 01:04:50.719
the boxy and cropped silhouette is

01:04:46.239 --> 01:04:52.959
flattering on all body types. Come on.

01:04:50.719 --> 01:04:55.519
Okay, so it's not going to work. So,

01:04:52.960 --> 01:04:57.838
what we do is we fine-tune the LLM. We

01:04:55.519 --> 01:05:00.159
can take an LLM and we can fine-tune it

01:04:57.838 --> 01:05:02.719
with instruction, product description,

01:05:00.159 --> 01:05:05.199
and product review examples.

01:05:02.719 --> 01:05:06.959
Okay, that's what we can do. So for

01:05:05.199 --> 01:05:11.719
instance we can take something like

01:05:06.960 --> 01:05:11.720
this. Uh let me zoom into this thing.

01:05:14.639 --> 01:05:19.118
So it says here write a positive review

01:05:17.199 --> 01:05:20.318
for the following product and then you

01:05:19.119 --> 01:05:22.000
can have the work. This is the

01:05:20.318 --> 01:05:24.719
description is the input and the output

01:05:22.000 --> 01:05:26.880
is the best car my husband's favorite.

01:05:24.719 --> 01:05:28.558
They fit well. Right? They feel like

01:05:26.880 --> 01:05:30.240
product reviews. So you just have to get

01:05:28.559 --> 01:05:33.119
a few hundred of these product review

01:05:30.239 --> 01:05:35.279
examples. Okay just a few hundred. Um

01:05:33.119 --> 01:05:37.440
and you may not even need that much. And

01:05:35.280 --> 01:05:40.960
once you do that,

01:05:37.440 --> 01:05:42.318
once you do that, you basically do uh

01:05:40.960 --> 01:05:45.280
used to fine-tuning like I showed

01:05:42.318 --> 01:05:46.880
earlier, you know, in instruction,

01:05:45.280 --> 01:05:48.319
input, output, and then you take that

01:05:46.880 --> 01:05:50.318
output and shift it a bit and make it

01:05:48.318 --> 01:05:51.599
the actual label, the actual output.

01:05:50.318 --> 01:05:53.279
Fine tune, fine tune, fine tune, fine

01:05:51.599 --> 01:05:55.119
tune a bunch of times, gradient descent,

01:05:53.280 --> 01:05:58.160
weights gets updated. Now you have a new

01:05:55.119 --> 01:06:00.318
LM, an updated LLM. And when you do that

01:05:58.159 --> 01:06:02.558
now for the same things, here's what you

01:06:00.318 --> 01:06:04.558
get. Write a review. These are the best

01:06:02.559 --> 01:06:06.319
jeans I've ever owned. I am whatever

01:06:04.559 --> 01:06:07.920
some details. I've been wearing them for

01:06:06.318 --> 01:06:09.199
a few weeks. They still look brand new,

01:06:07.920 --> 01:06:11.039
right? It looks much better. Doesn't

01:06:09.199 --> 01:06:13.679
look like marketing.

01:06:11.039 --> 01:06:15.119
This is completely fake. By the way, the

01:06:13.679 --> 01:06:16.558
came up with it after the fine tuning.

01:06:15.119 --> 01:06:18.640
And then we say, "Write a horrible

01:06:16.559 --> 01:06:20.000
review because we want to be balanced.

01:06:18.639 --> 01:06:22.078
These are the worst genes I've ever

01:06:20.000 --> 01:06:23.519
worn. They're too tight here and there.

01:06:22.079 --> 01:06:25.760
I'm going to return them and try a 30,

01:06:23.519 --> 01:06:27.519
but I'm not optimistic.

01:06:25.760 --> 01:06:29.119
I'm going to stick with Levis's." Few.

01:06:27.519 --> 01:06:31.119
Okay.

01:06:29.119 --> 01:06:33.280
So, that is So, these read like real

01:06:31.119 --> 01:06:34.798
reviews. So just by taking a few hundred

01:06:33.280 --> 01:06:36.400
examples and fine-tuning it, it

01:06:34.798 --> 01:06:38.318
completely changes the the behavior that

01:06:36.400 --> 01:06:40.400
you want for your particular use case.

01:06:38.318 --> 01:06:43.038
That's the key thing. So for me, the

01:06:40.400 --> 01:06:45.358
biggest sort of benefit here is that

01:06:43.039 --> 01:06:47.680
while it took billions of sentences for

01:06:45.358 --> 01:06:49.598
pre-training the original LLM and then

01:06:47.679 --> 01:06:52.399
it took tens of thousands of examples to

01:06:49.599 --> 01:06:55.119
do supervised finetuning and or HF and

01:06:52.400 --> 01:06:56.960
so on and so forth, for you for it to

01:06:55.119 --> 01:06:59.440
make it work for your narrow business

01:06:56.960 --> 01:07:02.079
use case, you only had to spend a couple

01:06:59.440 --> 01:07:04.240
hundred examples. That's it. It's

01:07:02.079 --> 01:07:06.160
amazing. Imagine that if you had to, you

01:07:04.239 --> 01:07:07.519
know, collect like 30,000 examples to

01:07:06.159 --> 01:07:10.318
make it. Nobody's going to do these

01:07:07.519 --> 01:07:12.639
things. It's too much work. But a couple

01:07:10.318 --> 01:07:14.079
of hundred anybody can do. That's why

01:07:12.639 --> 01:07:16.719
it's so powerful to finetune these

01:07:14.079 --> 01:07:19.280
things. Yeah.

01:07:16.719 --> 01:07:22.000
You talked about being able to um you

01:07:19.280 --> 01:07:23.359
know, in industries where you you don't

01:07:22.000 --> 01:07:26.000
want to put some of this stuff on the

01:07:23.358 --> 01:07:28.000
internet, downloading uh the pre-train

01:07:26.000 --> 01:07:30.400
model and being able to do this on your

01:07:28.000 --> 01:07:32.079
own. would you still need talking about

01:07:30.400 --> 01:07:35.200
computer power some of the computers we

01:07:32.079 --> 01:07:37.359
have now GPUs I don't know how they are

01:07:35.199 --> 01:07:39.279
um are you able to do some of these very

01:07:37.358 --> 01:07:40.558
small use cases on those types of

01:07:39.280 --> 01:07:42.559
devices

01:07:40.559 --> 01:07:44.079
>> perfect question uh Ike I mean you're

01:07:42.559 --> 01:07:46.640
going to get to that because the short

01:07:44.079 --> 01:07:47.599
answer it's hard yeah just a few hundred

01:07:46.639 --> 01:07:50.078
examples but actually trying to

01:07:47.599 --> 01:07:52.000
fine-tune these big models on consumer

01:07:50.079 --> 01:07:53.760
grade hardware is actually not easy so

01:07:52.000 --> 01:07:56.239
you have to make certain tricks and

01:07:53.760 --> 01:07:57.760
simplifications which is the next topic

01:07:56.239 --> 01:08:00.239
uh yeah

01:07:57.760 --> 01:08:02.480
>> is tuning always supervised like you

01:08:00.239 --> 01:08:05.439
need those pairs or could you do it if

01:08:02.480 --> 01:08:05.920
the company has like less structured

01:08:05.440 --> 01:08:07.838
data?

01:08:05.920 --> 01:08:09.599
>> No, you can. The thing is it depends on

01:08:07.838 --> 01:08:11.679
whether you want to make it generally

01:08:09.599 --> 01:08:13.519
smart about the company's sort of

01:08:11.679 --> 01:08:14.639
business details in which case you can

01:08:13.519 --> 01:08:16.319
just take a whole bunch of text and just

01:08:14.639 --> 01:08:17.759
do an expert prediction on it. It's

01:08:16.319 --> 01:08:19.279
going to get smarter about generally

01:08:17.759 --> 01:08:20.719
things. But it doesn't mean it's going

01:08:19.279 --> 01:08:23.279
to specifically follow your instructions

01:08:20.719 --> 01:08:24.880
on your particular business problem. So

01:08:23.279 --> 01:08:27.359
if you wanted to follow instructions,

01:08:24.880 --> 01:08:29.759
you need supervision.

01:08:27.359 --> 01:08:32.960
Okay. So all right these three are great

01:08:29.759 --> 01:08:35.039
reviews. So for small LLMs like GPD2

01:08:32.960 --> 01:08:36.399
fine-tuning isn't difficult to go to

01:08:35.039 --> 01:08:38.640
your question. You can actually do this

01:08:36.399 --> 01:08:40.000
with small models. So like for example

01:08:38.640 --> 01:08:41.440
Google had this has released this thing

01:08:40.000 --> 01:08:42.640
called Gemma which came out recently.

01:08:41.439 --> 01:08:44.000
It's a small model like two billion

01:08:42.640 --> 01:08:46.560
parameters or something if I remember

01:08:44.000 --> 01:08:50.640
the smallest one and those things will

01:08:46.560 --> 01:08:52.319
typically fit into uh thank you. Uh

01:08:50.640 --> 01:08:54.000
those things will typically fit into

01:08:52.319 --> 01:08:56.080
like one GPU and you can fine-tune it.

01:08:54.000 --> 01:08:57.600
You still need GPUs just to be clear. uh

01:08:56.079 --> 01:08:59.119
they will actually fit into one thing.

01:08:57.600 --> 01:09:02.000
But if you want to use a larger model,

01:08:59.119 --> 01:09:03.278
it won't fit. So to make this work, you

01:09:02.000 --> 01:09:05.520
have to do other things and that's what

01:09:03.279 --> 01:09:07.120
we're going to talk about now. So but

01:09:05.520 --> 01:09:10.400
this there's a family of models called

01:09:07.119 --> 01:09:12.960
Llama Llama 2. These are open source uh

01:09:10.399 --> 01:09:14.879
LLMs and they are widely used for

01:09:12.960 --> 01:09:16.158
fine-tuning, right? Because you can just

01:09:14.880 --> 01:09:18.880
download the model and just do whatever

01:09:16.158 --> 01:09:20.639
you want with it, right? It's open. uh I

01:09:18.880 --> 01:09:22.079
mean it's not strictly open because

01:09:20.640 --> 01:09:23.600
there are some you know footnote

01:09:22.079 --> 01:09:26.238
considerations you got to worry about

01:09:23.600 --> 01:09:29.120
but for most purposes it's open enough

01:09:26.238 --> 01:09:30.959
uh in my opinion and so what we let's

01:09:29.119 --> 01:09:32.640
see how hard it is to build the biggest

01:09:30.960 --> 01:09:35.359
model in this family which is the llama

01:09:32.640 --> 01:09:37.759
2 model with 70 billion parameters okay

01:09:35.359 --> 01:09:40.719
70 billion parameters so first of all

01:09:37.759 --> 01:09:42.399
the model is gigantic so 70 billion

01:09:40.719 --> 01:09:44.798
parameters each parameter is let's say

01:09:42.399 --> 01:09:48.000
we store it in two bytes per parameter

01:09:44.798 --> 01:09:50.079
right u and then each of these parame

01:09:48.000 --> 01:09:52.000
ameters actually we will need a

01:09:50.079 --> 01:09:53.439
multiplier on each parameter to store

01:09:52.000 --> 01:09:56.238
various details about how the

01:09:53.439 --> 01:09:57.919
optimization is done okay we know we

01:09:56.238 --> 01:09:59.678
won't get into the details here the the

01:09:57.920 --> 01:10:02.640
one thing I do want to point out is that

01:09:59.679 --> 01:10:06.239
um this 3 to four uh should really be 1

01:10:02.640 --> 01:10:08.400
to six right u so I I had I didn't have

01:10:06.238 --> 01:10:09.919
a chance to change it this morning but

01:10:08.399 --> 01:10:12.559
but the point is that it's going to be a

01:10:09.920 --> 01:10:14.239
huge model right so even with this

01:10:12.560 --> 01:10:15.760
number it's going to be like 48 to 560

01:10:14.238 --> 01:10:18.079
gigabytes just to hold the model in

01:10:15.760 --> 01:10:21.280
memory and manipulate it and So if you

01:10:18.079 --> 01:10:23.760
use a GPU like an A00 GPU or an H00 GPU

01:10:21.279 --> 01:10:25.759
which are all Nvidia GPUs,

01:10:23.760 --> 01:10:28.000
each of these things typically has 80 GB

01:10:25.760 --> 01:10:30.719
of RAM memory. So we need between six

01:10:28.000 --> 01:10:32.319
and seven to accommodate this thing. Six

01:10:30.719 --> 01:10:34.079
to seven GPUs just to accommodate this

01:10:32.319 --> 01:10:35.840
thing. So that's the first problem. The

01:10:34.079 --> 01:10:37.760
model is big just to hold it and work

01:10:35.840 --> 01:10:40.239
with it. You need lots of GPUs. The

01:10:37.760 --> 01:10:43.360
second problem, Llama 2 was trained on

01:10:40.238 --> 01:10:46.879
two trillion tokens of text.

01:10:43.359 --> 01:10:49.439
Two trillion tokens of text. So these

01:10:46.880 --> 01:10:51.760
GPUs can process about 400 tokens per

01:10:49.439 --> 01:10:54.719
GPU per second. By process, I mean the

01:10:51.760 --> 01:10:57.039
forward pass through the network. Okay?

01:10:54.719 --> 01:10:58.079
And so if you actually use seven GPUs

01:10:57.039 --> 01:11:01.279
with all this thing, it's going to take

01:10:58.079 --> 01:11:03.439
you 8,000 days, right? Let's say we want

01:11:01.279 --> 01:11:08.479
to do it in about a month, you need 24

01:11:03.439 --> 01:11:10.799
20,000 248 GPUs at this cost of two $25

01:11:08.479 --> 01:11:12.399
per GPU per hour. This will cost you 4

01:11:10.800 --> 01:11:14.239
million.

01:11:12.399 --> 01:11:15.359
Okay? And we'd expect the actual cost to

01:11:14.238 --> 01:11:16.718
be a lot higher than this because it's

01:11:15.359 --> 01:11:17.839
very optimistic. It assumes you just do

01:11:16.719 --> 01:11:19.679
one pass through it, you're all done,

01:11:17.840 --> 01:11:20.640
right? In in general, you'll you know

01:11:19.679 --> 01:11:21.920
you'll make some mistakes. You have to

01:11:20.640 --> 01:11:23.440
do it a bunch of times and so on and so

01:11:21.920 --> 01:11:25.920
forth. So this is overly optimistic

01:11:23.439 --> 01:11:27.439
estimate and that is 4 million. So you

01:11:25.920 --> 01:11:29.679
need lots of GPUs and you need to spend

01:11:27.439 --> 01:11:32.000
a lot of money for it. Now what can we

01:11:29.679 --> 01:11:34.000
do with fewer resources?

01:11:32.000 --> 01:11:35.760
First of all, you you need to reduce the

01:11:34.000 --> 01:11:36.880
size of the data set. The second thing

01:11:35.760 --> 01:11:38.960
is you want to reduce the memory

01:11:36.880 --> 01:11:41.199
required. So we can ideally do it on

01:11:38.960 --> 01:11:45.600
many fewer GPUs, hopefully even one GPU

01:11:41.198 --> 01:11:47.119
literally on Collab. Okay. And so now we

01:11:45.600 --> 01:11:49.360
have good news on the data front because

01:11:47.119 --> 01:11:51.519
as I mentioned earlier, while it takes a

01:11:49.359 --> 01:11:53.599
lot of data to build these models, to

01:11:51.520 --> 01:11:55.440
fine-tune them for your specific data

01:11:53.600 --> 01:11:57.520
for use case, you may just need a few

01:11:55.439 --> 01:11:59.839
hundred examples. Okay, it's no problem

01:11:57.520 --> 01:12:01.440
at all. So the data for fine-tuning is

01:11:59.840 --> 01:12:02.800
actually not a problem. Only for

01:12:01.439 --> 01:12:05.359
building it in the first place, it's a

01:12:02.800 --> 01:12:07.360
problem. So in fact, there's this famous

01:12:05.359 --> 01:12:11.119
alpaca fine tune data set. It is 50,000

01:12:07.359 --> 01:12:13.039
instruction on pairs and so for that

01:12:11.119 --> 01:12:14.559
way less than the two trillion tokens

01:12:13.039 --> 01:12:17.920
and that can actually be done in about

01:12:14.560 --> 01:12:19.520
20 hours. You can fine-tune a 50,000

01:12:17.920 --> 01:12:21.760
example fine-tuning data set you can

01:12:19.520 --> 01:12:23.280
fine tune with just 20 hours. Okay,

01:12:21.760 --> 01:12:26.000
Tomaso,

01:12:23.279 --> 01:12:28.800
>> could Microsoft's one bit model

01:12:26.000 --> 01:12:30.640
drastically reduce the amount of comput?

01:12:28.800 --> 01:12:32.719
Yeah, there's a whole bunch of

01:12:30.640 --> 01:12:35.199
approximations and simplifications to

01:12:32.719 --> 01:12:37.198
make all these things fit uh into

01:12:35.198 --> 01:12:39.759
smaller GPUs and so on and so forth and

01:12:37.198 --> 01:12:40.879
that's one of them. So, so the short

01:12:39.760 --> 01:12:42.640
answer is yeah, there are many

01:12:40.880 --> 01:12:44.000
possibilities uh and we have to very

01:12:42.640 --> 01:12:45.760
carefully look at them because every one

01:12:44.000 --> 01:12:47.359
of these simplifications you'll it'll

01:12:45.760 --> 01:12:49.280
cost you something in terms of accuracy

01:12:47.359 --> 01:12:50.639
and the ability of the model to do what

01:12:49.279 --> 01:12:52.719
it needs to do. So there's always a

01:12:50.640 --> 01:12:54.239
trade-off you have to worry about. So

01:12:52.719 --> 01:12:55.119
that for hooks who are interested

01:12:54.238 --> 01:12:57.839
there's this whole field called

01:12:55.119 --> 01:12:59.439
quantization LLM quantization. Google it

01:12:57.840 --> 01:13:02.719
and that gives you that's an entry point

01:12:59.439 --> 01:13:04.079
into that whole area. Okay. So now how

01:13:02.719 --> 01:13:06.158
do we reduce the memory required so that

01:13:04.079 --> 01:13:08.800
we can process the data using fewer GPUs

01:13:06.158 --> 01:13:10.079
ideally just one GPU on collab. So if

01:13:08.800 --> 01:13:12.079
you look at what actually consumes

01:13:10.079 --> 01:13:14.000
memory, you have all these model

01:13:12.079 --> 01:13:16.158
parameters. Let's say you know 70

01:13:14.000 --> 01:13:18.800
billion parameters times two bytes each

01:13:16.158 --> 01:13:20.639
140 GB gradient computations is another

01:13:18.800 --> 01:13:22.719
140 to hold the gradient and then the

01:13:20.640 --> 01:13:24.400
optimizer state is 2x. And as I

01:13:22.719 --> 01:13:27.520
mentioned earlier it could be between

01:13:24.399 --> 01:13:28.799
you know 1 to 6x as opposed to 3 to 4x

01:13:27.520 --> 01:13:30.880
but we'll just go with these numbers for

01:13:28.800 --> 01:13:33.440
the moment. And so the total is 560

01:13:30.880 --> 01:13:36.000
gigabytes right if you just naively want

01:13:33.439 --> 01:13:38.639
to use it. So turns out you can't do

01:13:36.000 --> 01:13:40.479
anything about that. it is just 4140 but

01:13:38.640 --> 01:13:42.000
by using a trick called gradient

01:13:40.479 --> 01:13:44.879
checkpointing this whole thing can

01:13:42.000 --> 01:13:46.800
actually be squashed close to zero

01:13:44.880 --> 01:13:48.239
basically you say hey I don't mind it

01:13:46.800 --> 01:13:50.560
running longer but I don't want to use

01:13:48.238 --> 01:13:52.079
as much memory and that trick is called

01:13:50.560 --> 01:13:54.560
gradient checkpointing we won't go into

01:13:52.079 --> 01:13:56.559
technical details that can go to zero

01:13:54.560 --> 01:13:58.640
but then this thing here the optimizer

01:13:56.560 --> 01:14:00.719
state turns out even this can be

01:13:58.640 --> 01:14:02.800
squashed very close to zero and that's

01:14:00.719 --> 01:14:06.319
actually was a breakthrough from you

01:14:02.800 --> 01:14:07.600
know maybe a year ago and so to do do

01:14:06.319 --> 01:14:09.439
that. What we're going to do is to say,

01:14:07.600 --> 01:14:11.120
look, you know what? Uh there are a

01:14:09.439 --> 01:14:13.599
whole bunch of weights here, but we're

01:14:11.119 --> 01:14:15.599
only going to take take those matrices

01:14:13.600 --> 01:14:17.199
inside each attention layer, and we're

01:14:15.600 --> 01:14:19.840
going to only look at those matrices.

01:14:17.198 --> 01:14:22.399
We're going to freeze everything else.

01:14:19.840 --> 01:14:24.880
So, we're going to take only a small set

01:14:22.399 --> 01:14:26.319
of parameters, unfreeze them, and update

01:14:24.880 --> 01:14:27.760
them and see if it's any good, if it

01:14:26.319 --> 01:14:29.519
actually gets the job done. Instead of

01:14:27.760 --> 01:14:31.520
unfreezing everything and updating them,

01:14:29.520 --> 01:14:33.840
right? And so if you look at the weight

01:14:31.520 --> 01:14:36.719
matrix, let's say the key AK weight

01:14:33.840 --> 01:14:38.960
matrix uh in llama 2, this is a 8,000

01:14:36.719 --> 01:14:40.399
roughly 8,000 by 8,000 matrix, which

01:14:38.960 --> 01:14:41.600
means that there are 64 million

01:14:40.399 --> 01:14:45.839
parameters inside each of these

01:14:41.600 --> 01:14:48.560
matrices. 64 million. Okay. So you can

01:14:45.840 --> 01:14:50.719
if you imagine this matrix AK here and

01:14:48.560 --> 01:14:52.480
let's say you thought experiment, you do

01:14:50.719 --> 01:14:54.239
the finetuning and the numbers have

01:14:52.479 --> 01:14:56.799
changed, right? as a result of

01:14:54.238 --> 01:14:58.399
finetuning then you can imagine that the

01:14:56.800 --> 01:15:01.600
resulting matrix is just the original

01:14:58.399 --> 01:15:04.079
matrix you had plus just the changes

01:15:01.600 --> 01:15:07.039
right the original plus the changes and

01:15:04.079 --> 01:15:08.960
we call the changes delta a k and of

01:15:07.039 --> 01:15:10.880
course in general this this change is

01:15:08.960 --> 01:15:13.119
also going to be a 64 million matrix

01:15:10.880 --> 01:15:15.760
right 8,000 by 8,000 so the question is

01:15:13.119 --> 01:15:18.079
can we make this change matrix smaller

01:15:15.760 --> 01:15:20.239
and to make it smaller it seems

01:15:18.079 --> 01:15:22.319
reasonable because a fine tune will only

01:15:20.238 --> 01:15:23.839
make small changes to just a few weights

01:15:22.319 --> 01:15:25.198
it's not going to change

01:15:23.840 --> 01:15:26.640
By definition, a couple hundred

01:15:25.198 --> 01:15:27.678
examples, you do some finetuning,

01:15:26.640 --> 01:15:29.920
hopefully a few weights are going to

01:15:27.679 --> 01:15:32.239
change and maybe they won't change a

01:15:29.920 --> 01:15:33.920
whole lot, right? So the the key insight

01:15:32.238 --> 01:15:36.079
here is that maybe we can force this

01:15:33.920 --> 01:15:38.640
change matrix to be kind of simple and

01:15:36.079 --> 01:15:40.640
get the job done, right? And it turns

01:15:38.640 --> 01:15:42.640
out you can. And what you do is you can

01:15:40.640 --> 01:15:46.880
think of this matrix as really coming

01:15:42.640 --> 01:15:48.480
from two thin skinny matrices which if

01:15:46.880 --> 01:15:51.119
you multiply them gets you the original

01:15:48.479 --> 01:15:52.559
matrix, right? And I'm not going to get

01:15:51.119 --> 01:15:55.198
into the mathematical details here. This

01:15:52.560 --> 01:15:57.280
is called a low rank approximation. Uh

01:15:55.198 --> 01:16:00.238
but the point here is that you can take

01:15:57.279 --> 01:16:01.599
two very small matrices and if you

01:16:00.238 --> 01:16:02.639
multiply them the right way, you

01:16:01.600 --> 01:16:04.400
actually can recover the original

01:16:02.640 --> 01:16:06.800
matrix, right? You can approximate the

01:16:04.399 --> 01:16:08.960
original matrix. And this matrix, as it

01:16:06.800 --> 01:16:11.679
turns out, these two matrices are much

01:16:08.960 --> 01:16:15.760
smaller because each one is just 8,000 *

01:16:11.679 --> 01:16:19.359
2, 16,000, right? And so this thing has

01:16:15.760 --> 01:16:23.360
just 16,192 parameters, which is 0.02%

01:16:19.359 --> 01:16:23.359
of the original 64 million.

01:16:23.439 --> 01:16:27.599
So this thing is called low rank

01:16:25.039 --> 01:16:30.238
adaptation or LORA and it's incredibly

01:16:27.600 --> 01:16:31.840
widely used in the industry. U and so

01:16:30.238 --> 01:16:34.079
what we do is we freeze all the

01:16:31.840 --> 01:16:36.079
parameters. We initialize all these mat

01:16:34.079 --> 01:16:38.319
these change matrices to zero and then

01:16:36.079 --> 01:16:40.960
we update just the those two skinny

01:16:38.319 --> 01:16:43.759
matrices right here here we update only

01:16:40.960 --> 01:16:45.198
those matrices using gradient descent.

01:16:43.760 --> 01:16:47.119
And when you do that everything will fit

01:16:45.198 --> 01:16:48.319
into memory. So which means that the

01:16:47.119 --> 01:16:50.079
whole thing will fit in and you can just

01:16:48.319 --> 01:16:52.158
use like two GPUs and get the job done.

01:16:50.079 --> 01:16:55.039
And if you actually use llama's the

01:16:52.158 --> 01:16:56.719
smaller models like 7 billion 13 billion

01:16:55.039 --> 01:17:00.158
it can be fine-tuned comfortably on a

01:16:56.719 --> 01:17:03.439
single GPU on a single collab GPU. So

01:17:00.158 --> 01:17:05.759
all right uh 954 time does not permit so

01:17:03.439 --> 01:17:07.519
I will uh so I have a collab on how to

01:17:05.760 --> 01:17:09.600
do the finetuning uh using this

01:17:07.520 --> 01:17:12.400
technique. I will do like a video walk

01:17:09.600 --> 01:17:14.159
through um tomorrow or day after and I'm

01:17:12.399 --> 01:17:16.158
done. Thanks folks. Have a good rest of

01:17:14.158 --> 01:17:19.399
your week. [applause]

01:17:16.158 --> 01:17:19.399
Thank you.
