10: Generative AI – Adapting LLMs with Parameter-Efficient Fine-Tuning — Full Transcript

0:57

when we work with these kinds of causal

0:59

models actually uh when the contextual

1:01

embeddings come out you don't actually

1:03

have to use ReLU activations here you

1:05

can literally just run it through just a

1:07

single dense layer with linear

1:09

activations and then pass it into a

1:11

softmax and boom you're done okay so

1:13

that's how GPD3 and all these models are

1:15

trained u and the other thing I want to

1:18

point out which may not have clear is

1:21

that what what is coming out of these

1:23

this dense layer right this vector is as

1:27

long as your vocabulary

1:29

because only then when it goes into the

1:31

soft max you're going to get

1:33

probabilities which are as long as your

1:35

vocabulary which means that you get to

1:36

pick one word or token out of that

1:39

entire 50,000 long vocabulary

1:42

okay so so just I just want to point

1:45

that out because I think it's easy for

1:47

us to sort of get a little confused

1:49

because of this little difference

1:50

between the way uh masked language

1:53

models like BERT work and causal

1:55

language models like GPD3.

1:58

Okay, so now let's continue with we have

2:02

we know how to build GPD3. So like what

2:05

about GPD and GPD2 like what's up to

2:07

them? Why is GPD3 so famous and not

2:10

GPD2? Right? So turns out well first of

2:13

all you folks know that GPD stands for

2:15

generative pre-trained transformer. Now

2:17

like GPD3

2:19

two GPD2 and GPD1 were trained in

2:22

basically the same fashion. Predict the

2:23

next word uh same fashion the same sort

2:26

of transformer stack except that GPT3

2:29

was trained on much more data because

2:31

the underlying transformer stack had

2:33

many more layers. Okay, so it is a much

2:36

bigger stack meaning lots more

2:39

parameters and therefore you need lots

2:41

more data to train it well. Okay, so

2:44

that was really the only difference. The

2:47

difference was literally one of scale,

2:49

scale of network and scale of data. And

2:53

unlike GPT and GPD2, GPD3 even though it

2:57

was trained basically the same way with

2:59

the same kind of network, it was one of

3:01

the situations where more became

3:04

different. Okay, there was almost like

3:06

some sort of phase change that happened

3:07

between two and three. Unlike GPD and

3:10

GPD2, GPD3 could do amazingly coherent

3:14

continuations of any starting prompt,

3:16

right? Um so for example, if you have

3:19

this little prompt which says the

3:21

importance of being on Twitter by Jerome

3:22

K Jerome who was a famous humorist and

3:24

then you give it this prompt, right?

3:26

Ending with the word it, it produces

3:28

this continuation which is really like

3:30

strikingly good. And if any of you have

3:33

read Jerome K Jerome and if you read

3:35

this thing, you'll be like, "Wow, that

3:36

actually sounds like Jerome K Jerome."

3:38

Right? So amazing continuations the the

3:41

but the interesting thing here is not so

3:43

much the continuation it's the fact that

3:45

the same prompt you give it a two or GPT

3:47

it won't do any it won't be very good in

3:49

fact after the first one two or three

3:51

sentences it'll sort of become sort of

3:52

incoherent and meander and start

3:54

rambling this thing can keep faking it

3:57

for a long longer time right that's the

3:59

amazing thing that was unexpected re

4:02

researchers did not expect this okay and

4:05

but it wasn't good at following your

4:07

instructions

4:09

So for instance, if you ask it, help me

4:10

write a short note, introduce myself to

4:12

my neighbor. This is the kind of thing

4:14

it'll come up with. And you can actually

4:15

run it yourself. You can actually go to

4:17

GPD3 on the playground. I think GPD3 is

4:20

still available in the playground. U if

4:21

it is, you can actually start try

4:23

running these prompts. You will start

4:25

getting garbage very quickly, right? And

4:28

the reason so for example here, help me

4:29

write a short note. It says, what's a

4:31

good introduction to a resume? Rumé for

4:33

some reason has glombmed down to resume.

4:35

I have no idea why. Right? But the

4:38

reason it's doing stuff like this is

4:39

because a lot of the training data it

4:42

was trained on are basically lots of

4:44

lists of things.

4:46

So when you say for example um you know

4:49

the the the capital of Paris continue

4:52

it'll come back with the capital sorry

4:53

the capital of France continue it say

4:55

the capital of France is Paris the

4:56

capital of you know uh Hungary is

4:58

Budapest and so on. It just start coming

4:59

up with a list. So it's sort of very

5:02

list driven right? it thinks that you

5:04

you need to complete some sort of list,

5:06

right? That's what's going on here. And

5:07

so it's not very good. So it doesn't

5:09

realize that you're actually asking it

5:10

to do something specific.

5:12

So this is the problem when you have an

5:14

autocomplete thing which doesn't realize

5:17

what you're asking it. It just thinks

5:18

that you're it's just an autocomplete.

5:20

So um now in addition to these unhelpful

5:24

answers, it can also produce offensive

5:25

answers, factually incorrect answers and

5:27

so on and so forth. The list of bad

5:28

things it can do is long. So why does it

5:32

do that? Why does it produce unhelpful

5:33

answers? Well, you know, as you recall,

5:35

it was only trained to predict the next

5:37

word. It wasn't explicitly trained to

5:39

follow instructions, right? So, it

5:41

seems, you know, reasonable that if it's

5:44

simply trying to guess the next word

5:46

repeatedly, it can't really do anything

5:48

more. Like, how can it figure out that

5:50

there's an instruction that it needs to

5:52

follow, right? Unless the training data

5:54

on the net was all instructional, which

5:57

it clearly is not.

5:59

So light bulb idea, right? Let's

6:02

explicitly train it with instruction

6:04

data,

6:06

right? Let's just train it with

6:07

instruction data. And so OpenAI

6:10

developed an approach called instruction

6:12

tuning to do exactly this. Um, and this

6:15

paper is the paper that sort of was the

6:18

breakthrough. Okay, this is what

6:20

actually put Chad on the map. So, and

6:24

it's very readable. So, I would

6:25

encourage you to check it out if you're

6:26

curious.

6:28

And so so we had GPT, GPD2, GPD3, you

6:33

know, just bigger and bigger models

6:34

trained the same way. And then we run

6:36

into the problem that it can't handle

6:37

instructions. So we do instruction

6:39

tuning to get to 3.5, also called

6:41

instruct GPT. And then a small tweak

6:43

after that gets you chat GPT. Okay. And

6:46

by the way, this step here, there are

6:48

really two things going on in this as

6:50

you will soon see. I'm just calling it

6:52

instruction tuning just to so that I

6:53

don't have to say some long thing every

6:55

single time. it this is not a consistent

6:58

piece of terminology. So just just

6:59

beware aware of that's all. So all right

7:03

first step they got a bunch of people to

7:06

write highquality answers to questions

7:09

and they created about 12,500 such

7:11

question answer pairs. So for example

7:14

let's say this was the question explain

7:15

the moon landing to a six-year-old in a

7:17

few sentences. Believe it or not, GPD3's

7:19

answer to that question was another

7:21

question

7:23

because it thinks there's a list of

7:24

questions it needs autocomplete, right?

7:27

So, it comes up with explain the theory

7:28

of gravity to a six-y old. It's like one

7:30

of those people when you ask them a

7:31

question, they ask you a question back,

7:32

right? So, what what they did is they

7:35

said, "Okay, let's create a nice answer

7:36

to this question." And here's a human

7:38

created answer. People went to the moon

7:39

in a big rocket, walked around, blah

7:41

blah blah, right? Much better answer to

7:43

that question. And so once you create

7:46

these 12,500 question answer pairs as

7:48

training data, we just trained GPD3 some

7:52

more using Xword prediction as before.

7:56

No difference. So, so here is the input

7:59

explain the moon landing blah blah blah

8:00

blah. This is the question and then we

8:02

have the answer right there. And then we

8:05

we take that answer, move it to the

8:07

right and just shift it up

8:10

so that when it finishes sentences, it

8:13

needs to predict people. And then you

8:16

give it people, it needs to predict went

8:17

and so on and so forth. Just like we saw

8:20

before, the cat sat on the mat became

8:22

the cat sat on the cat sat on the mat on

8:25

the right shifted, right? That's what

8:27

makes prediction possible and necessary.

8:30

So that's what they did. This co this is

8:31

step one. Okay, same as same as before.

8:35

And once you do that, it turns out this

8:37

step is called supervised fine-tuning.

8:39

It really helped. GPD3 once you

8:42

supervised fine-tuned it was much much

8:44

better at following instructions. But

8:45

there's a small problem with this

8:46

approach. It takes a lot of money and

8:49

effort to have humans write highquality

8:51

answers to thousands of questions,

8:53

right? It takes a lot of money. So the

8:56

question is, what can we do, right? What

8:59

is easier than writing a good answer to

9:01

a question?

9:03

Well, what? Okay. Uh, all right. Uh, how

9:07

about somebody from this side?

9:11

>> Yeah, Joseph.

9:13

>> Perhaps writing a question for an

9:15

answer.

9:16

>> Oh, that's actually a good one. Yeah.

9:17

Yeah, I like that. Um, so given an

9:19

answer, find find a question. And while

9:22

that is not what I'm going to talk about

9:23

here, that technique is actually used

9:25

very heavily in LLMs. Uh, and so but

9:27

that that's great. Very creative. Uh

9:29

Mark,

9:31

>> thumbs up. Thumbs down.

9:32

>> Sorry.

9:33

>> Thumbs up or thumbs down?

9:34

>> Thumbs up or thumbs down. Exactly.

9:36

Because all of us, everyone loves to be

9:38

a critic. It's much better easier to be

9:40

a critic than to be a creator. Right. So

9:43

what do we do? We basically say, let's

9:46

rank answers written by somebody else.

9:48

Which begs the question, who's going to

9:50

write those answers? And that's where

9:53

there's a brilliant answer to that

9:54

question which is

9:57

Wikipedia,

9:59

Reddit.

10:04

We will just ask GPT3 to write the

10:06

answers.

10:08

It might be crap, but we don't care

10:10

because we can rank them.

10:12

So we ask GPT3 to get generate several

10:15

answers to the question. And how can we

10:17

generate several answers? Because we can

10:19

do sampling.

10:21

We can do sampling.

10:23

The fact that we had these stoastic

10:25

outputs because of sampling is now a

10:27

feature, not a bug. Okay, we create lots

10:30

of different answers to the question. We

10:32

feed it a question, get like three

10:33

answers out. Just run it three times,

10:36

get three answers out with a nice

10:37

temperature of like one or 1.1 or

10:39

something so that it's nice and random,

10:41

right? Um, and then we literally have

10:43

humans just rank them, do the thumbs up,

10:45

thumbs down, just rank them from most

10:47

useful to least useful. Okay, so this

10:51

step is a step two of instruction

10:53

tuning. So OpenAI collected 33,000

10:55

instructions, fed them to GB3, generated

10:57

answers and had humans rank them. And

11:00

once you do that, once you do this, you

11:03

can assemble a beautiful training data

11:05

set, right? And so basically what we

11:07

have is that we have an instruction and

11:09

let's say we have just two answers A and

11:10

B. And in in practice they you can have

11:12

many many answers which we rank but just

11:14

for simplicity I'll go with Mark's

11:16

thumbs up thumbs down sort of answer

11:18

which is let's assume only you have two

11:19

answers to every question right and so

11:22

and the human has said I prefer this to

11:24

that that's it right so we have a data

11:26

set now where the data point is

11:28

instruction preferred answer is A the

11:31

10: Generative AI – Adapting LLMs with Parameter-Efficient Fine-Tuning

More from MIT OpenCourseWare

Trending Transcripts