8: Deep Learning for Natural Language – Transformers, Self-Supervised Learning — Full Transcript

0:55

delivers on those three requirements.

0:57

And so uh just a quick review, if you

0:59

have a phrase like the train liftation,

1:01

we have all these little arrows which

1:03

stand for the the standalone or

1:05

uncontextual embeddings. Uh and then

1:08

sometimes this works. So I'm going to

1:09

put it close to me here.

1:12

Okay.

1:13

All right. So um so if you here if you

1:16

we start with either standalone

1:17

embeddings i.e. the contextual

1:19

embeddings uh which have been

1:20

pre-trained or random doesn't really

1:22

matter. If you look at the collab we did

1:25

uh the other day we actually just start

1:27

with random weights for the embeddings

1:30

and then we add positional embeddings to

1:32

them. And so you know each embedding

1:35

each word here we take it standalone we

1:38

take its positional embedding we just

1:39

literally just add them up element by

1:41

element then we get a total embedding

1:43

and that's called the positional

1:45

embedding of each word. Okay. And then

1:48

uh that's what we have position input

1:49

embeddings. So this whole thing goes

1:51

into this transformer encoder stack and

1:54

what pops out the other end is

1:55

contextual embeddings. Okay. So that's

1:57

the overall flow. Now

2:01

we applied this uh the transformer stack

2:03

to the word to slot classification

2:06

problem where we basically took every

2:08

incoming natural language query that

2:10

comes in. We calculate its positional

2:12

embeddings and then we run it through

2:14

the transformer stack. uh and then we

2:16

get contextual embeddings and then at

2:18

this point uh since each word that comes

2:21

out each embedding that comes out needs

2:22

to be classified into one of 125

2:24

possibilities we run it through a ReLU

2:26

and then we and when we attach a softmax

2:29

to each embedding right this is

2:31

basically what we did last class

2:33

um so this is the transformer encoder

2:36

okay now actually

2:39

any questions on this before I continue

2:48

I was wondering why when how do you

2:50

decide where to add more self attention

2:52

and where to add transformer layers? You

2:55

mentioned that chart has 96 of them.

2:58

>> Yeah. So right so GPD3 has 90 96

3:03

transformer blocks. Each one is a block.

3:05

Um, so I think the question goes to do

3:07

you add more attention heads within a

3:09

single block or do you add lots of

3:11

blocks? And both are good things to do.

3:14

Um, what increasing the number of

3:16

attention heads in a block does for you,

3:18

it allows you to pick up more patterns

3:21

at that level of abstraction.

3:23

But if you add more blocks, much like

3:25

later convolutional filters can build on

3:28

earlier convolutional filters, you're

3:30

going up the levels of abstraction. So

3:32

to go to vision for instance you have

3:34

the notion of lines and so on in the

3:36

beginning and then you have a notion of

3:37

edges which are two lines then you have

3:40

you know nose eyes face and so on and so

3:42

forth. So both are worth doing. So

3:45

typically that's what you you typically

3:46

find that people typically have you know

3:49

maybe a dozen heads or you know five six

3:52

a dozen heads. We'll see examples of how

3:54

many heads in a couple of architectures

3:55

later on today. And you can the more you

3:58

go up the more uh more capable the model

4:01

becomes. as long as you have enough data

4:02

to train it well. So the perennial

4:05

question of do we have enough data to

4:07

train this large model because if you

4:09

don't have enough data we might run into

4:11

overfitting problems and so on. That's

4:12

always the trade-off.

4:14

So okay so here I just want to quickly

4:17

switch to the collab because we didn't

4:18

get have a chance to finish it. I'm not

4:20

going to run it because it's going to

4:22

take some time. So where we left off

4:24

last time.

4:27

Okay. So here we we basically took this

4:31

architecture that we just saw on the

4:32

slide and then we essentially wrote it

4:34

as a keras model and I went through this

4:36

model in the last class so I'm not going

4:37

to go through it all over again. What we

4:39

did not do last class was to actually

4:41

run it. Um and so uh so if you actually

4:44

run it right you can just run it for 10

4:47

epochs just like we normally do. Give it

4:50

data give it a bunch of epochs choose a

4:52

particular batch size. I just

4:53

arbitrarily chose 64. You run it for 10

4:55

epochs and then you evaluate it on the

4:57

test set. You get a 99% accuracy on this

5:00

problem. One transformer stack. That's

5:03

it. One one block rather. One block.

5:05

That's it. And uh of course here there's

5:08

a little trickiness going on here

5:09

because a naive model can literally say

5:12

every word that comes in is other. O.

5:15

And since the O's are the majority of

5:17

the words, it's not going to do badly,

5:19

right? It's like having a classification

5:20

problem in which one class is very

5:22

predominant. So the naive way to

5:25

actually do well is to just say every

5:26

time something comes in, oh it's that

5:27

majority class. The same thing happens.

5:30

But if you then adjust for that, it

5:32

turns out that the accuracy on the nono

5:34

slots, which is really what you care

5:35

about, is actually 93%.

5:38

Which is actually pretty good. Okay. Uh

5:40

and then I had some examples of, you

5:42

know, lots of fun queries you can do,

5:44

including queries where I try to break

5:45

stuff like cheapest flight to fly from

5:47

MIT to Mars and see what happens, you

5:49

know, things like that. So have fun with

5:50

it. Okay. Um, all right, back to

5:53

PowerPoint.

5:59

So, this is what we had. Now, what we're

6:01

going to do in today's class, we are

6:03

actually going to take the encoder we

6:05

built last time and introduce three new

6:08

complications into it. And when we

6:10

finish introducing these three

6:11

complications, we will actually have the

6:14

actual transformer that was invented in

6:15

the 2017 paper. Okay. All right. Um, the

6:20

first tweak is the hardest tweak. So

6:21

we'll slowly work our way to it. U so

6:24

the thing to remember is let's review

6:26

self attention. What is self attention?

6:28

You have a bunch of words and we further

6:30

said that for any particular word like

6:32

station we want to take its positional

6:34

embedding and then make it contextual.

6:36

And the way we do that is by taking each

6:38

word's embedding and then calculating

6:40

these dot productducts between all the

6:42

other words. And then since these dot

6:44

products can be positive or negative we

6:46

want to make them all positive and

6:48

normalize them so that they nicely add

6:50

up to one. So we then exponentiate them

6:52

and then divide with the total, right?

6:54

Which is basically soft max. And when

6:57

you do that, you have nice fractions

6:59

that add up to one. And then we said,

7:01

well, the contextual embedding for W6 is

7:03

just all these weights S1, S2 all the

7:07

way to S6 multiplied by the original W's

7:10

and then you get the context for W6. So

7:12

this is the basic logic we covered last

7:14

time. Now it is obviously the case that

7:19

we explained it only for one word but we

7:21

have to do the same exact operation for

7:23

every one of the other words too so that

7:25

we could calculate W5 hat, W4 hat, W3

7:28

hat and so on and so forth right so

7:30

there's a lot of computations that are

7:32

going on and they all look kind of

7:34

similar where you got to do a bunch of

7:36

dot products you got to like you know do

7:38

some soft maxing on it and stuff like

7:39

that so the natural question is is there

7:42

a way to organize it very efficiently

7:45

And the short answer is yes. In fact, if

7:46

you could not do that, there wouldn't be

7:48

any transformer revolution. Okay,

7:50

because there is that ability to package

7:52

it up into a very interesting and

7:53

efficient operation that allows you to

7:55

put the whole thing on GPUs.

7:58

Okay, so now I'm going to switch to iPad

8:02

uh and give you some iPad scribblings of

8:04

mine which were concocted last night

8:06

because I was very unhappy with the

8:08

slides that follow. So, we're going to

8:10

do iPad. Okay. U All right. So if it

8:14

works, you folks are lucky. If it

8:16

doesn't work, last year's huddle class

8:17

is luckier.

8:21

So let's shift to that.

8:24

All right. So we're going to go here.

8:31

So let's assume we have a simple thing

8:32

like uh oops.

8:37

Okay, instead of you know train left the

8:40

station which is a long sentence, let's

8:41

just say you have a simple sentence like

8:42

I love hodddle. Okay, and so I love

8:45

hodddle is what you have and then you

8:47

have these standalone embeddings W1 W2

8:50

W3. Okay, so it comes into the self

8:53

attention layer and let's assume that

8:55

these W1's, W2, W3, they're already

8:58

positionally encoded, right? We have

9:00

already added up the position encoding,

9:02

all that stuff also. It's all behind us.

9:03

That all happens outside the

9:05

transformer. So you you you get it here.

9:08

Now what you do is you actually make

9:10

three copies of this thing.

9:13

Okay? And let's call this whole thing as

9:15

just X. Okay? I'm just giving it the

9:18

name X. It's a matrix of these three

9:20

vectors. And so the first copy goes up

9:23

here, the second copy goes straight, and

9:25

the third copy goes down. And don't

9:26

worry about the third copy just yet. So

9:29

if you look at the the first two copies,

9:31

here is the key thing to focus on. Okay,

9:33

this whole thing here. Remember that we

9:36

want to calculate dotproducts between

9:37

all these vectors. And basically we want

9:40

to calculate the dot product of every

9:41

pair of vectors, every pair of words.

9:44

The whole point of self attention is

9:46

that every pair of words we figure out

9:47

how attracted or related they are.

9:49

Right? Which means that we have to

9:50

calculate all pairs of dot products. And

9:53

so what you do is you take this vector

9:55

right there W1 WW3. You take this other

9:58

copy that went up. Okay? And then you

10:00

transpose it. So when you transpose it,

10:03

it all becomes nice and vertical like

10:05

that.

10:06

Right? All the vectors come in came like

10:08

this. When you transfer, it becomes

10:09

vertical. And now what you do is you

10:12

take each one you take W1 and then you

10:15

multiply it by W1. Here you take W1 W2

10:19

W1 W3. You calculate all those dot

10:22

products like that. And when you do that

10:23

you have these nice cells where every

10:27

pair of words their dot products have

10:29

been calculated in this grid. Okay. And

10:31

the key thing to see here and folks with

10:34

a matrix algebra background will see

10:36

this immediately. All we are doing is we

10:38

are taking this x which is the matrix

10:40

that came in

10:42

and then xrpose which is the matrix that

10:44

we went sent up and then brought back

10:46

down. We are basically doing a matrix

10:48

multiplication of x * xrpose. That's all

10:50

we doing. And when we do that we're

10:53

getting this nice uh grid of where in

10:57

which every pair of words their dot

10:59

products have been calculated for you

11:01

with one matrix multiplication. Boom.

11:03

Done. Okay. Okay, so if you have three

11:05

words, there are nine multiplications,

11:07

right? So if you have a million words,

11:11

that's a lot of multiplications, right?

11:13

One trillion multiplications on the

11:15

order of all trillion. And the reason to

11:18

say order is because you know W1 * W3 is

11:21

the same as W3 * W1. So there's some

11:23

duplication here. So you get this grid,

11:25

okay, in one shot is one multi

11:27

multiplication. And then we because each

11:29

of these numbers is just a dot product

11:31

which can be negative or positive, we

11:32

need to softmax it.

11:34

And so what we do is we take all these

11:36

numbers and we put it into a softmax

11:38

function where for each row it

11:40

calculates a soft max. And what do I

11:41

mean by that? It takes each number here

11:44

does e raised to the top e ra to the

11:46

number. It does it for each of these

11:47

numbers and then divides by the sum of

11:49

those numbers for each row. And when you

11:51

do that okay you can think of this

11:54

operation as soft max applied to x *

11:56

xrpose you get this nice little table of

11:59

numbers.

12:01

This table of numbers basically says

12:02

that for the first word right W1 for the

12:06

first word take 0.1 of the of the first

12:08

one 7 of the second.3 of the 2 of the

12:11

third and add them up. We do a weighted

12:14

average. So we have this table here. We

12:17

have now the third copy shows up here.

12:20

Okay is right there. So we do this times

12:24

that which is just a matrix

12:25

multiplication again. And when we do

12:27

that we get the final contextual

12:29

embeddings. So this for example is just

12:31

0.1 * w12

12:34

* w2

12:36

point sorry 7 * w2 and then2 * w3 right

12:40

there. And you can see the same logic

12:41

here as well. Okay. And you can read it

12:44

later on. I will post this thing uh to

12:46

make sure you understand exactly how it

12:47

flowed. But the larger point I want you

12:50

to focus on is that the entire sol self

12:53

attention operation we just looked at

12:55

here basically is this this beautifully

12:58

little compact matrix formula.

13:01

Okay X comes in you do XRpose you do a

13:04

matrix multiplication you do a softmax

13:06

on top of it and then multiply by X

13:07

again and boom you're done.

13:10

So that is the magic of taking the

13:12

transformer stack and representing it

13:15

using matrix operations because then

13:17

lightning fast on GPUs.

13:20

Okay. All right.

13:22

That was the warm-up.

13:24

Now let's crank it up a notch.

13:27

So recall that in the last class um I

13:31

talked about the fact

13:35

the self attention operation the W's are

13:38

coming in and we're doing all this stuff

13:39

with the W's right and then we're

13:41

getting some W hats out but there are no

13:44

parameters

13:46

there's nothing to be learned inside the

13:48

transformer self attention layer right

13:51

there are no there are no weights there

13:52

are no biases there are no coefficients

13:54

so well okay What are we learning then?

13:58

Right? So what we now do is we going to

14:00

make the self attention layer tunable.

14:03

We're going to inject some weights into

14:05

it so that when we train it on an actual

14:07

system, it'll the weights will keep

14:09

changing to adapt itself to the

14:10

particularities of whatever problem

14:12

you're working on. Right? So that takes

14:15

us to the tunable self attention layer.

14:22

Okay? Tunable self attention layer. So

14:25

this is the key thing to keep in mind. U

14:28

any questions on this before I continue

14:29

with the tunability thing.

14:34

Okay.

14:37

Is this picture working out by the way?

14:39

Okay.

14:41

Uh all right.

14:44

So what we now do is we have the same

14:46

exact logic as before where we have this

14:48

thing that comes in. Okay. We have this

14:51

input that comes in the same we call it

14:53

X again. this whole this matrix of

14:55

embeddings and then before we just send

14:58

three copies instead of doing that what

15:01

we're going to do is we'll take each

15:02

copy X and then we will actually

15:04

multiply it by a matrix

15:07

okay this matrix is called the key

15:09

matrix

15:10

okay and this matrix this matrix of

15:14

numbers are weights that will be learned

15:16

by Brack prop

15:18

so basically what we're saying is that

15:20

when this thing comes in let's see if

15:23

there's a way to transform this X into

15:25

some other set of embeddings which may

15:28

be useful for your task. We don't know

15:30

if they're going to be useful, but

15:32

surely giving it a bit more ability to

15:34

have weights which can be learned means

15:36

that it giving it more expressive power,

15:39

more modeling capacity. And whether it

15:41

actually uses the capacity will depend

15:42

on how much data you have and how well

15:44

you train it. And maybe if it's not

15:46

useful, it won't use it. In what I mean

15:48

is if transforming X actually doesn't

15:50

really help at all, then this matrix A

15:52

is going to be what?

15:55

it's going to be the identity matrix

15:57

because you take basically one and

15:59

multiply by X you'll get one X again. So

16:01

in the worst case maybe it just says I

16:03

have nothing to learn here but maybe

16:05

there is something you can learn. So so

16:07

that's what we do. So we multiplied by

16:09

this matrix A K and then we come up with

16:12

the same you know some embeddings

16:14

transformed embeddings and we call these

16:16

things K

16:18

okay K. Now this KQV as you will see has

16:22

its origins in the in this field of

16:24

information retrieval but I personally

16:26

find that that interpretation is not

16:28

super helpful because transformers are

16:30

used for lots of applications outside

16:32

information retrieval. So I'm not going

16:33

to go with that kind of interpretation.

16:35

I'm going to go with interpretation of

16:37

let's make each of these things tunable.

16:39

Okay. And tunability means we need to

16:41

give it weights. All right. So that's

16:42

what we have here. Now the second copy

16:46

we did this with the first copy. Well,

16:47

let's do the same thing with the second

16:48

copy. We'll take the second copy and

16:50

multiply it by some other matrix called

16:51

AQ.

16:53

And when we are done with that, we get

16:54

these embeddings. And we will call these

16:57

embeddings as Q.

17:00

Okay. Now, just like before, we will

17:02

take this this thing here and we'll

17:05

transpose it.

17:07

So, it all becomes nice and vertical

17:08

like that. And then we'll do exactly the

17:11

same as before. We'll calculate all

17:12

these pair-wise dot productducts using

17:14

one one shot one matrix multiplication.

17:16

And because we are calling this Q and we

17:20

are calling this whole thing as K. This

17:22

thing just becomes Q * KT.

17:26

Okay. At the end of it you come up with

17:29

a grid of numbers just like before.

17:31

Okay. And these numbers could be

17:33

negative or positive. So we need to do

17:35

the softmax on them to make sure they

17:36

are well behaved fractions that add up

17:38

to one. So we take this Q KT business

17:42

and then we do we just run a we put it

17:44

through a softmax function for each row

17:48

and when we do that we we'll get

17:50

basically the the like a table like the

17:52

ones we saw before by the way the

17:54

numbers here are the same just because I

17:55

duplicated it because I'm lazy in

17:57

reality given it has gone through all

17:59

these transformations the numbers are

18:00

not going to be the same right uh you

18:03

have these numbers and then you take the

18:05

final copy which is x * av Right? Each

18:08

copy is getting multiplied by its own

18:10

matrix. Right? And this copy is being

18:11

multiplied by AV. And let's call this X

18:14

A. Okay? Which is here as just V.

18:19

And so what you have here is this soft

18:21

max QT * V is exactly the same kind of

18:24

dot product as we saw before matrix

18:26

multiplication. So we have these

18:28

contextual embeddings and that's what's

18:30

coming out of the of the transformer

18:32

block. So now the whole thing we did

18:34

here the whole thing can be represented

18:36

as soft max of Q KT * V. Okay. So if we

18:42

zoom in a bit. Come on. Okay.

18:47

Okay.

18:49

So X came in.

18:52

Three tracks went here. The first track

18:55

X * A K X * AQ X * A V. And this thing

18:59

is called K. This thing is called Q.

19:01

This thing is called V. And then we do

19:03

the same transpose as before. We do the

19:06

dotproduct thing to calculate the

19:08

pair-wise dot products for everything

19:09

which is just Q KT. We run it through a

19:12

soft max. We get soft max of Q KT. We

19:15

multiply it by one to do the final

19:16

waiting and then boom the output comes

19:18

and that's this function. That's it.

19:22

Okay. So what we have done is we have

19:24

introduced three matrices learnable

19:27

matrices into the self attention layer.

19:31

Okay. Now,

19:34

okay. Let me just stop there for a sec.

19:35

Questions.

19:37

Yeah.

19:39

[clears throat]

19:39

>> Is there a relationship between AK, AQ,

19:43

and A

19:44

>> independent independent matrices?

19:47

>> Yes.

19:48

>> Like we have

19:49

>> could you use the microphone please?

19:50

>> Here we have three set of parameters K,

19:52

Q and P. If there are let's say if there

19:55

were 100 the total length was let's say

19:58

the number of total totals were let's

19:59

say 50. So you would have uh 50 for a

20:02

set of parameters like you'll have to

20:04

>> so if you have a 50 if the dimension is

20:07

50 long what is coming in the W's are 50

20:10

long then the key the what comes out of

20:13

it if you want it to be 50 as well so

20:15

this matrix needs to be 50 * 50 2500

20:22

>> U Luna

20:24

>> what are the different things the three

20:27

the three matrices are trying to

20:30

Sorry,

20:30

>> what are the different things that the

20:32

matrices are trying to learn?

20:33

>> We don't know. All we are saying is that

20:35

we have a self attention layer which can

20:37

pay attention to every pair of words.

20:38

But we need to give it some ways to

20:40

transform what is coming in into

20:43

potentially useful things. Right? As to

20:45

their actual usefulness, we'll have to

20:48

figure out if if it actually helps or

20:49

not. And of course, as you know, the the

20:51

punch line is that yeah, it helps

20:52

massively. That's why we do it. In

20:54

general, what you will find in the deep

20:55

learning literature is that whenever you

20:57

want to increase the capacity, the

20:58

modeling capacity of a particular model,

21:01

you just take a small piece and inject a

21:03

little matrix multiplication into it.

21:05

You take a vector that's showing up in

21:07

the middle and then you make it run

21:08

through a matrix to get another vector

21:10

and then further after you run it

21:13

through a matrix, you run it through a

21:14

little ReLU as well. Even better. So

21:17

that's how you inject modeling capacity

21:19

into the middle of these networks. Okay?

21:22

And that's what these people are doing

21:23

here. Yeah.

21:26

>> In the last step, you had the matrix V.

21:29

So on the previous example, you had used

21:31

the original matrix X. So could you just

21:33

say for why is it not using X? What does

21:35

that mean?

21:36

>> So what we're saying is that the in the

21:38

initial version we had three copies and

21:40

we treated them all identical. Now we

21:42

said well there are are there ways to

21:44

transform each copy into some other

21:45

representation which could be useful. So

21:47

we may as well use three different

21:48

matrices for it. Why stop with two?

21:51

There are three opportunities to make

21:52

them more expressive. We'll use all of

21:54

them.

21:56

>> Yeah.

21:59

>> You mentioned that these are kind of

22:02

you're tuning it. You're kind of

22:03

fine-tuning it. Is there any risk?

22:05

>> We're not fine-tuning it. Uh just to be

22:06

clear on the on the vocabulary here. So

22:09

we have added more weights to make them

22:10

tunable. What that means is that we when

22:12

we finally train this entire model,

22:16

remember all the weights are going to be

22:17

updated using back propagation, right?

22:20

In particular, these matrices will also

22:21

get updated using back propagation.

22:23

>> So there's no risk of is there a risk of

22:26

>> there's always the risk of overfitting

22:27

when you add more parameters to a model

22:29

>> which means that you have to look at the

22:31

validation set and all that good stuff.

22:34

We are basically adding more parameters

22:36

in a very interesting way because we

22:39

want to add more capacity to the self

22:40

attention layer. We want to give it a

22:41

more of an ability to learn things from

22:43

the data. Before it could not learn

22:45

anything. It could only do dot products.

22:48

So we we want to solve that problem.

22:51

All right, I'm going to continue and

22:52

we'll come back to this. Okay. Um

22:57

so uh all right, let's just just for

22:59

fun, I'm going to do this. Um the the

23:01

original paper is called attention is

23:03

all you need. This is a transformer

23:05

paper.

23:07

You folks should read it at some point.

23:11

Just want to show you something.

23:14

Uh

23:20

You see that? So that is the famous

23:22

transformer formula. Okay. And the only

23:26

thing we ignored is this root of DK

23:29

business in the back under it. I

23:31

wouldn't worry about it. The reason they

23:33

have it is because these soft maxes when

23:35

you have lots of numbers and some

23:37

numbers really really big what's going

23:39

to happen is that all the other numbers

23:41

are going to get squashed to zero. Okay.

23:43

And so to make sure the gradient flows

23:45

properly, they just divide it by a

23:47

particular number to make sure no number

23:49

is too big. Okay, that's a small

23:51

technical important but bit of a

23:53

technical detail which is why I ignored

23:54

it in my iPad. But the rest of it you

23:57

can see this is exactly the formula we

23:59

derived qt * v softmax.

24:03

Okay, so this is the famous transformer

24:05

formula

24:08

and congratulations now you understand

24:10

it.

24:11

You seem less than fully convinced.

24:14

Okay.

24:17

Yes. Hi iPad.

24:19

Now I have a bunch of slides which I had

24:21

but actually I'll come back to this. I

24:24

had a bunch of other slides. This is

24:25

from last year uh which actually

24:27

explains what I did in the iPad in a

24:28

very different way without using any

24:30

matrices and so on. I was looking at it

24:32

last evening and I was getting very

24:34

annoyed by these slides for some reason

24:36

because I felt that it wasn't really

24:38

conveying the core matrix sort of the

24:40

matrix uh the ability of using matrix

24:43

algebra to to actually do this so

24:45

efficiently and compactly which is why I

24:47

decided to like handdraw this thing on

24:49

the iPad. Okay, but you should read it

24:51

afterwards to make sure that whatever

24:53

you saw on the iPad actually matches

24:55

this. Okay, because two different ways

24:56

of understanding something always helps.

24:58

Um okay so this what we have here now to

25:02

just to recall

25:05

the by making self attention tunable we

25:07

get a very interesting benefit which is

25:08

that when you have these different

25:10

attention heads before

25:13

you could have two attention heads but

25:14

because there were no parameters inside

25:16

their outputs would have been identical

25:19

because the inputs are the same for both

25:21

therefore the outputs would be identical

25:23

but now by since each attention head

25:25

will have its own aq

25:28

matrix

25:29

the outputs are going to be different.

25:32

That's why it makes sense to do the

25:34

tunability thing because that's what

25:36

actually makes multiple attention it's

25:37

actually useful. Um

25:43

is is there actually any relationship

25:44

between AK AQ and AV or is the A just

25:47

for like a notation standpoint?

25:49

>> Just notation. The thing is we want to

25:51

use QV for the resulting matrix and so I

25:54

had to find something else to use for

25:56

the first one and I was like okay aqaq

25:58

and we at MIT we do subscript super

25:59

subcripts right so yeah

26:03

>> what what is the the size of the

26:05

matrices are there like square matrices

26:07

or

26:08

>> yeah so typically what happens is that

26:10

um there's a whole bunch you can think

26:12

of it as a hyperparameter in some ways

26:14

um typically what people do in most

26:15

implementations is that they will

26:17

actually just preserve the size so if

26:19

the incoming embedding is and they'll

26:20

make sure the the thing coming out of

26:22

thing is also 10. So you just do a 10x10

26:24

matrix to transform it. Uh but the the

26:27

the value v av matrix on the other hand

26:31

there's a bit more technical stuff going

26:32

on where it often tends to be smaller.

26:35

Um so for example let's say that your

26:37

incoming is 100 you do 100 to 100 for

26:39

the key 100 to 100 for the query. But if

26:42

you have say five attention heads, you

26:44

may do 100 to 20 for the W's because

26:47

ultimately all the V's are going to get

26:48

concatenated into another 100 again. So

26:51

I can tell you more offline but fun

26:53

broadly speaking these things tend to

26:55

get transformed. They don't they

26:56

preserve the dimension 10 and 10 out.

26:58

Yeah.

27:00

>> So this uh aq uh these numbers are

27:04

random when you start with it and then

27:06

allow it to back.

27:07

>> Exactly. Exactly.

27:11

So all right um

27:17

yeah so the values in these matrices are

27:19

weights learned through optimization

27:20

using SGD. Uh and then what that means

27:23

is that

27:25

each of these attention now has its own

27:27

copy of these matrices. It has its own

27:29

matrices and over the course of back

27:31

propagation these matrices will look

27:33

very different. Okay. So important each

27:36

attention head will have its own mat set

27:38

of three matrices. So if you have 10

27:40

attention heads 30 matrices will be

27:42

learned.

27:46

So by the math it seems like it's

27:48

creating essentially a relationship

27:50

between all of the content being

27:52

ingested and if you're creating if

27:54

you're ingesting all the content for

27:56

each attention head are there different

27:58

categories of attention head type that

28:00

you're trying to go after?

28:01

>> Yeah. So basically what we're trying to

28:03

do is to say a particular attention

28:04

head. So in any particular sentence it

28:07

may turn out to be the case that one

28:09

pattern could be about the meanings of

28:10

these words right like the word bank and

28:12

what it means the word station train

28:14

things like that. That's what really

28:15

we've been talking about. But there is a

28:17

whole other pattern to do with grammar

28:19

and tense and things like that. There

28:21

could be another one in terms of tone.

28:23

All those things are very important. And

28:25

a priority we don't know how many such

28:26

patterns exist. Much like in a

28:28

convolutional network, we don't when

28:30

we're designing how many filters to

28:31

have, we don't know how many kinds of

28:33

little things we have to detect, you

28:34

know, vertical line, horizontal line,

28:36

semicircle, quarter circle, stuff like

28:38

that. So, you just give it a lot of

28:39

capacity so that it can learn whatever

28:41

it wants.

28:45

All right. So, um so that that is the

28:47

transformer encoder. So, we have done

28:49

one the first of the three complications

28:51

needed to make it like industrial

28:53

strength and legit. Uh the second thing

28:56

we do is something called the residual

28:58

connection. So what we do is that

29:02

whatever comes out here right W1 through

29:05

W6 goes in and comes out as W1 hat W2

29:08

and so on and so forth right

29:11

actually sorry what comes out here is

29:13

the hats but what comes out here is some

29:16

intermediate W's right that is what the

29:18

selfident is going to give you some

29:20

intermediate W's what we do is and

29:22

because what's coming out here these

29:24

vectors are the same length as what goes

29:26

in we can just add them element by

29:28

element

29:29

So we take the input and we actually add

29:32

it to what comes out.

29:35

So why would we want to do that? Why

29:37

would we want to you know go to a lot of

29:39

trouble to process this thing and then

29:41

when it comes out we like literally add

29:43

up the original input? What's like what

29:45

do you think the intuition is?

29:52

So turns out, think of it this way. You

29:56

have a bunch of inputs. You send it to a

29:57

neural network. It transforms it and

30:00

gives you something else. Right? At that

30:02

point, you might be thinking, well,

30:04

everything that go everything that

30:06

happens in the network from that point

30:07

onward can no longer see your original

30:10

input. It can only work with the

30:12

transformed input. Right? But what if

30:14

your transformations are not great?

30:17

So as an insurance policy what you can

30:20

do is you can take the the transform

30:22

stuff and you can take the original

30:24

stuff and send both in.

30:27

Right? And this whole thing is and you

30:30

can Google it. It's called like a wide

30:31

and deep network and things like that.

30:33

But the whole point is that let's not

30:35

lose the original input anywhere. Let's

30:37

also send it along. But if you keep

30:39

adding the original input to every

30:40

intermediate layer, it's going to get

30:42

longer and longer and longer and bigger,

30:43

which you don't want because you want it

30:44

all to be the same size. So the simplest

30:46

alternative is to just add them up. You

30:49

take the transform stuff and you add the

30:50

original input. You get the same thing

30:52

again. The the what came out what came

30:54

in W1 was a 100 long vector and the

30:57

transformed version is also 100 long. So

31:00

just literally 100 100 add them up.

31:02

That's it. You get another 100 long

31:04

vector. So that is what's called a

31:06

residual connection. Okay. And as it

31:08

turns out, residual connections make it

31:12

m improve the gradient flow during back

31:14

propagation dramatically and that's why

31:16

they are very heavily used. And in fact,

31:18

RestNet, which we looked at for computer

31:21

vision, it stands for residual net

31:24

because it was the first network to

31:26

actually figure this out. It's not this

31:29

this is not just a transformer thing by

31:30

the way. It's widely used in you know

31:32

lots of new architectures. The notion of

31:35

a residual connection that's what it

31:36

means. Okay, so we do a residual

31:39

connection and then we come to the final

31:42

tweak which is called layer

31:44

normalization.

31:45

So once we add the residual connection,

31:47

we are going to do something else here

31:48

to these vectors before they continue

31:51

flowing. And what layer normalation does

31:54

is it basically says that

31:57

I you will recall from the very

31:59

beginning of the semester I've been

32:00

saying that whatever comes into a neural

32:02

network the inputs let's just really

32:04

make sure that they are all in some sort

32:05

of a narrow well- definfined range they

32:07

can't be in a big range right so for

32:10

pictures for images we divided every

32:12

number by 255 so that every little pixel

32:15

value is between zero and one okay for

32:18

continuous things like the heart disease

32:20

example we standardized by calculating

32:22

the mean and the standard deviation and

32:24

doing subtracting the mean and dividing

32:26

by the standard deviation. So when you

32:27

do that all the numbers are going to

32:28

roughly be in the minus1 to +1 range. So

32:32

in neural networks it's for backrop to

32:35

work really well you have to make sure

32:36

that no numbers get too big that all the

32:39

numbers are always in some sort of a

32:41

narrow range. So what layer

32:43

normalization does is to say you know

32:45

what whatever is coming out here I want

32:48

to make sure none of these numbers are

32:49

too big. I want to make sure they're all

32:51

well behaved in a small range because if

32:53

I don't do that back prop is not going

32:55

to work very well and so

32:59

is this what we do to ensure we don't

33:01

problem of vanishing right

33:04

>> so um so the there technically there are

33:06

there could be two problems there's an

33:07

exploding gradient and vanishing

33:09

gradient both are bad this is a way to

33:10

address it so you will find a whole

33:12

bunch of dash normalization techniques

33:15

layer normalization batch normalization

33:17

and so on and so forth all these are

33:19

methods to make that these numbers stay

33:21

in a small range so it doesn't cause

33:22

gradient issues later.

33:27

All right. So in particular

33:30

what we do is or what happens inside

33:32

this layer layer normalization is we

33:35

just calculate the mean and standard

33:36

deviation of every one of these

33:37

embeddings. Okay? Right? If you have

33:39

let's say six embeddings here, we'll

33:41

have six means and six standard

33:42

deviations, right? For each one across

33:43

the rows and then we standardize it.

33:46

Meaning subtract the mean divide by the

33:48

standard deviation. And when you do

33:49

that, all these things are going to be

33:51

nice and small. And then we do this a

33:54

little other thing where we we have

33:55

introduced two new parameters to rescale

33:58

it and move it around a little bit just

34:01

because adding more weights always helps

34:03

make these things better. So we add them

34:06

and this gets slightly complicated

34:07

because of the way the dimensions work.

34:09

So I'm not going to spend much time on

34:10

it. Uh and then what comes out the other

34:13

end is a very well- behaved set of

34:15

numbers in a nice and small and narrow

34:16

range.

34:18

Okay, so this is called layer

34:20

normalization. Um, you can see this link

34:23

to understand it a bit better. Um, and

34:25

we do that as well. So to put it all

34:28

together,

34:30

so this is a transformer encoder where

34:32

we have this multi head attention layer

34:34

where each attention head in the inside

34:36

of it is tunable with those a matrices

34:39

and then we have a residual connection.

34:41

We do that and then we do layer norm and

34:43

then we do the same thing in the next

34:45

feed forward layer as well. And then

34:46

boom out pops the output

34:50

>> by that definition in the multi head

34:52

attention layer when I'm doing tone and

34:53

everything theoretically I can add even

34:56

the biases or the hate speech aspects

34:59

which come in to take care of it right

35:01

so the model can account for the fact

35:04

that something is biased or something is

35:06

not

35:07

>> um the thing is it's not so much the

35:09

model is accounting for it is capturing

35:11

whatever patterns happen to be inherent

35:13

in the data it's capturing Right now

35:16

what you do with that capture is up to

35:18

you. It depends on the actual problem

35:19

you're trying to solve. In particular,

35:21

it is going to capture all the bad stuff

35:23

too because if your training header has

35:25

a lot of biased stuff in it, toxic

35:27

things in it, dangerous things in it, it

35:29

doesn't it doesn't have a sense of

35:30

values as to what it's good or bad. It's

35:32

just going to pick it up.

35:35

>> Yes.

35:36

>> On that then how do you actually make it

35:38

angle on those or how do you mitigate

35:40

the effect of those? That's a whole

35:43

course unto itself, but I'm happy to

35:44

give you pointers offline.

35:47

All right, so this is what we have and

35:50

remember what I said that this is just a

35:52

single transformer block and since what

35:54

comes in and what goes out are the same

35:56

dimensions, we can just stack them one

35:58

after the other, right? It's very

36:00

stackable. You can do it, you can

36:02

multiply, you can you can stack it

36:03

vertically as much as you want. And as I

36:05

mentioned, I think GPD3 has 96 of these

36:08

things stacked one on top of the other.

36:09

Um and so yeah that brings us to that is

36:14

it that is the transformer encoder and

36:15

this exactly maps to that. So basically

36:18

the input embeddings come in you add

36:20

positional embeddings and then you send

36:22

it to say these many attention blocks

36:24

and they all get added up and then it

36:26

comes over the attention block you add

36:28

the add and nom here means add means

36:31

residual connection because you're

36:32

adding the input which is why you have

36:33

this arrow going from the input being

36:36

added there and then you normalize it

36:37

send it along and do it again and out

36:39

comes the output.

36:42

So all right now just to be very clear

36:46

on what is being optimized during back

36:48

propagation in this complex flow right

36:52

now clearly the the embeddings that you

36:54

started out with both the standalone

36:56

embeddings as well as the positional uh

36:57

the position embeddings those things are

37:00

going to get optimized right those are

37:01

just weights they're going to get

37:02

optimized clearly everything inside the

37:05

transformer encoder block is going to

37:06

get get nominized right and what are

37:08

they well they are the aqa v matrices

37:12

for Each attention head layer norm has

37:15

parameters as well. The next like the

37:18

little feed forward layer has weights as

37:20

well. All these things are going to get

37:22

optimized and then it goes through this

37:24

relu which again has a bunch of weights.

37:26

It's going to get optimized and then the

37:28

final softmax has a bunch of weights.

37:29

That's going to get optimized.

37:32

All these things are going to get

37:33

optimized by back prop.

37:36

Okay. So in that sense you just step

37:38

back for a second and look at the whole

37:40

thing. It is just a mathematical model

37:41

with a lot of parameters

37:43

and we're just going to use gradient

37:45

descent or stoastic gradient descent to

37:46

optimize it. That's it.

37:49

Yeah.

37:51

>> For those eight matrices we train the

37:53

model, are we calculating weights for

37:55

like each cell of every possible matrix

37:58

based on the number of inputs like every

38:00

possible dimension up to the max number

38:02

of inputs?

38:04

Um actually the the weights themselves

38:07

um don't depend on how long your input

38:09

sentence is because remember what we're

38:11

doing is for each sentence that comes in

38:13

let's say the sentence has say three

38:14

words there are three embeddings for

38:16

that sentence each of those embeddings

38:19

gets multiplied by say AK right so AK

38:23

only needs to work needs to know how

38:25

long is each embedding it doesn't need

38:27

to know how many words do I have

38:31

and that's a I'm glad you raised that

38:33

question Ben because that's what makes a

38:35

transformer's number of weights

38:37

independent of the number of words in

38:40

your sentence.

38:42

It only depends on the vocabulary that

38:43

you're going to work with because the

38:45

vocabulary determines how many

38:46

embeddings you need, how many embeddings

38:48

you need. It the length only matters in

38:51

terms of the positional embedding

38:53

because if you have a thousand long

38:55

sentence, you need a thousand long

38:56

positional embedding matrix. But beyond

38:59

that, it doesn't care.

39:02

And that's why for example Google uh

39:04

Gemini 1.5 Pro which is a million it can

39:07

accommodate basically a million long

39:09

million token context window right it

39:12

can it's still very compute heavy but it

39:15

does not change the number of parameters

39:18

uh yeah

39:20

>> conceptually which weights are optimized

39:24

first but in sequential order or are

39:26

they optimizing the weights at the very

39:28

same time all

39:29

>> simultaneously because if you think of

39:31

back propagation ultimately you have a

39:34

loss function right and you calculate

39:35

the gradient of that loss function so if

39:38

you have a say a billion parameters that

39:40

gradient is basically a billion long

39:42

vector right and we're going to take the

39:44

gradient and we're going to do w new

39:47

equals w old minus alpha times the

39:49

gradient so all the w's are going to

39:51

update instantaneously

39:53

now the way it actually works in

39:55

computation is you're going to do it the

39:56

because of the back and back propagation

39:58

it's going to start at the end and

39:59

slowly flow backwards but when it's done

40:01

everything will be updated.

40:03

Yeah.

40:06

>> We take uh two attention heads and we

40:10

have the matrices of AK, A2 and AV in

40:12

them. Uh why would the parameters of all

40:16

three of them all the weights of the

40:18

three matrices on this side and this

40:19

side would be different because finally

40:21

the things you're inputting from this

40:22

side and the output is same. So the

40:25

learning process should be ideally the

40:26

same unlike like a CNN where we had put

40:29

filters which were different. So what

40:31

different thing we have to

40:32

>> because the initialization is different.

40:35

>> What do we mean?

40:35

>> Like what I mean is if you have two

40:37

heads right each head has three

40:38

matrices. The starting values of those

40:40

six matrix is different.

40:42

>> Starting value of A aka B AQ and A is

40:45

different for both the heads

40:46

>> right? Much like for all the weights

40:48

typically the values are randomly

40:50

chosen. If they were all the same thing

40:53

you're right. It won't you don't make a

40:54

difference right? They will all change

40:56

the same way. Yeah.

40:59

U is the input of the transformer of the

41:02

sentence or the the array of embedding

41:06

of each word.

41:08

>> Uh the in the transformer itself is

41:10

expecting embeddings in and so what

41:13

basically happens is that we get some

41:14

sentence we run it through a tokenizer

41:16

which connects it to a bunch of tokens

41:18

which are just integers and then it goes

41:20

through the embedding layer which maps

41:22

the integers to these embeddings and

41:24

then you feed it to the transformer. But

41:26

when you do back propagation, it comes

41:28

all the way back to the starting

41:29

embedding layer and updates those

41:31

weights.

41:32

>> Okay. So they can be trainable. So the

41:34

twist at the beginning must be input

41:36

here, but they can train.

41:37

>> They're trainable. Exactly. Exactly.

41:40

>> Uh yeah.

41:41

>> Are the attention heads solely parallel

41:43

or can you have like a stack of

41:45

attention heads?

41:46

>> Typically they are parallelized. Um and

41:49

because you can always stack the block

41:50

itself to get more and more power.

41:54

All right. So um so now to apply the

41:57

transformer right there are common use

41:59

cases are that you have a whole sentence

42:01

that comes in and then you just want to

42:03

classify it right the the canonical

42:05

thing being hey movie sentiment

42:07

classification boom positive or negative

42:09

right classification another common one

42:11

is labeling where every word gets

42:13

labeled as a multiclass label and that's

42:15

basically what we saw with our slot

42:17

filling problem and then there is

42:19

another thing called sequence generation

42:20

where you give it a sequence you wanted

42:22

to continue the sequence right generate

42:23

more stuff i.e. large language models

42:25

and all that good stuff. So, so this we

42:28

know already know how to do because we

42:29

actually literally built a collab with

42:30

this with the transformer stack. Now the

42:33

question is how can we do that right?

42:35

How can you do basic classification with

42:37

these things? So now if you again when

42:40

you send a sentence in after all that

42:42

stuff is done and when I say encoder

42:44

here I'm assuming that you may have one

42:46

one block you may have 106 blocks I

42:48

don't care at the end of the day you

42:49

send something in you get a bunch of

42:50

contextual embeddings out

42:53

right so at this point we need to take

42:57

these contextual embeddings and somehow

42:58

make it work for classification for just

43:00

classifying something into yes or no

43:02

positive or negative so it'll be nice if

43:05

we can actually take all these

43:06

embeddings and like essentially

43:08

summarize them into a single embedding,

43:10

a single vector

43:12

because if you have a single vector then

43:14

we can run it through maybe a relu and

43:16

then we do a sigmoid and boom we can do

43:18

a you know a binary classification

43:19

problem super easy right so this begs

43:22

the question okay how are we going to go

43:23

from the all the many blue things to one

43:25

green thing

43:28

okay now of course um what we can do is

43:33

we can simply average them we can take

43:36

each of the embeddings just simply

43:37

average them element by element, you'll

43:39

get a nice green thing. Okay. Um any

43:42

shortcomings from doing that?

43:48

>> You would lose the ordering of the

43:50

words.

43:51

>> You do uh well in some sense the

43:53

positional embedding, the positional

43:55

encoding you have in the input does have

43:58

this notion of position, right? So

44:00

you're not necessarily losing the order

44:02

necessarily, but you're sort of

44:04

averaging all this information into

44:06

something and averaging is going to lose

44:08

some richness.

44:12

Okay.

44:15

>> I think it's going to be skewed to the

44:17

one that has like the biggest number,

44:19

right? So something is influencing your

44:22

>> Yeah, the biggest ones are going to

44:23

dominate. But hopefully we won't have

44:25

too much of that because all the layer

44:27

nom business at the beginning has

44:29

hopefully made sure the numbers are all

44:30

in a reasonably small and well behaved

44:31

range. But the the point really is that

44:33

you're going to lose richness in the

44:35

information because you're just like

44:36

mushing it down. So there's a much

44:40

better and more elegant way to do this

44:42

which is that what you do is for every

44:46

sentence when you train it you add an

44:49

artificial token called the class token.

44:52

Okay, literally it's an artificial token

44:54

and it's designated as you know CLS in

44:57

the literature and then this token is

45:00

getting trained with everything else.

45:03

Okay. And so once you once you finish

45:06

training

45:08

that token has its own embedding too.

45:10

And because it has been trained with

45:13

everything else and this token is

45:15

remember it's a contextual embedding

45:16

which means that it's very much aware of

45:18

all the other words in the sentence.

45:21

So in some sense this context this CLS

45:23

tokens contextual embedding sort of

45:25

captures everything that's going on

45:26

about that sentence

45:29

right and so what we do is once we are

45:31

done training we just grab this thing

45:32

alone and then send that through a relu

45:35

and a sigmoid and boom you're done.

45:38

So this is a very clever trick to

45:41

somehow you know instead of averaging

45:43

everything at the end let's just have

45:45

something just for the whole thing the

45:46

sentence and just learn it anyway along

45:48

with everything else. So in like a meta

45:50

principle in deep learning is that

45:52

whenever you think you're making an ad

45:54

hoc decision about something like

45:55

averaging a bunch of stuff you should

45:56

always stop and say is there a better

45:59

way to do it where it doesn't have to be

46:00

ad hoc where the right way is learnable

46:02

from the data directly using back

46:04

propagation. Um there was a hand. Yeah.

46:08

>> Is there a reason that you

46:11

added the CLS at the start? Why not add

46:14

it at the

46:15

>> You can do it at the end. Is there any

46:16

difference?

46:17

>> Um the only thing to remember is that um

46:19

it's a good question. So different

46:21

centers are going to be of different

46:22

length, right? So there might be short

46:24

sentences, there might be long

46:25

sentences. In particular, the lot the

46:27

short sentences are going to get padded,

46:29

right? I remember I talked about padding

46:31

to make it to fit to one length. So what

46:34

internally the transformer will do is

46:35

ignore all the padded tokens because it

46:37

doesn't do it's just padding doesn't

46:39

really matter for anything. So if you

46:40

have the serless at the very end we have

46:42

to have much more administrative

46:44

bookkeeping to take everything but the

46:46

last one

46:48

ignore it and only do the last one just

46:50

much easier just to get in the beginning

46:52

that's the reason. Yeah.

46:54

>> What would be just a practical

46:56

application of this would be something

46:58

like sentiment analysis like a positive

46:59

or negative.

47:00

>> Yeah. So basically any kind of text

47:02

comes in and you want to figure out some

47:04

labeling problem like a classification

47:06

problem. The easiest example I could

47:08

think of was sentiment.

47:09

But you can imagine for example an email

47:12

comes into a like a call center

47:14

operation and you want to take the email

47:16

and automatically figure out which

47:17

department should I send it to.

47:20

Okay. So now now if the input data for a

47:24

task is natural language text, right? We

47:27

don't have to restrict ourselves to only

47:28

the input training data we have. Right?

47:31

Would it be great to learn from all the

47:32

text that's out there? So, for example,

47:35

to go back to that call center thing I

47:36

just mentioned, you know, why clearly,

47:39

let's say it's coming in English, the

47:41

ability to take that English email and

47:43

route it to one of 10 things. You know,

47:45

you should have to learn English just

47:47

for your call center application. You

47:49

should learn English generally and use

47:50

it for other things, right? So, why

47:52

can't we just learn from all the text

47:54

that's out there? And so, that brings us

47:56

to something called self-supervised

47:58

learning. And the idea of sens

48:00

supervised learning is this. So if you

48:02

recall the transfer learning example

48:03

from lecture four right where we had

48:05

restnet right and we took restn net we

48:08

chopped off the final thing we make made

48:10

it sort of headless and then we attached

48:13

that output of the headless restn net to

48:14

a little hidden layer and output and we

48:17

did the handbags and shoes and you will

48:19

recall that we were able to build a very

48:21

good classifier for handbags and shoes

48:22

with just like a 100 examples. Right? So

48:24

the question is why was this so

48:26

effective? Why was this so effective?

48:29

And turns out the reason why any of this

48:31

stuff actually works is because neural

48:34

networks or they learn representations

48:36

automatically when you train them. So

48:38

what I mean by that is when you imagine

48:40

a network, you feed in a bunch of stuff,

48:42

it goes through all the layers, it comes

48:43

out. Uh you can think of each layer as

48:46

transforming the raw input in some

48:48

different alternate representation of

48:50

the input. Okay? And so and these are

48:53

called representations. That's actually

48:54

a technical term. Um, and so you can

48:57

from this perspective when you train a a

48:58

neural network, a deep network with lots

49:00

of layers, what you're really learning

49:02

is you're learning a way to you're

49:05

learning how to represent the input in

49:07

many different ways. Each of these

49:09

arrows is a different way of

49:10

representing things. Plus, you're

49:11

learning a final regression model,

49:14

either a linear regression model or a

49:15

logistic regression model.

49:16

Fundamentally, that's what's going on.

49:18

Because the final layers tend to be

49:19

sigmoid, soft max, or just linear,

49:21

right? So the final layer if you just

49:24

look at the this part alone whatever is

49:26

coming in it's just going through

49:27

essentially a linear regression model or

49:29

a logistic regression model that's it.

49:31

So fundamentally you're learning

49:32

representations and a final little

49:34

model. Okay. But the reason why all

49:36

these things work so much better than

49:38

logistic regression is because those

49:39

representations have learned all kinds

49:41

of useful things about the input data.

49:43

They have sort of automatically feature

49:45

engineered for you.

49:47

So, so from this perspective you can

49:50

imagine that each layer here is like an

49:53

encoder. It encodes the input, right?

49:55

The first layer encodes it. The first

49:56

two layers encode something. The first

49:58

three layers encode something and so on

49:59

and so forth. So a deep network contains

50:01

many encoders. And so the question is

50:04

what do these representations actually

50:06

embody right? What do they capture? Is

50:08

it like specific knowledge about the

50:10

particular problem that you train the

50:12

thing train the network on or is it like

50:14

general knowledge about the input data

50:16

because if it is general knowledge about

50:18

the input we can use it to solve other

50:20

problems unrelated problems. So is it

50:22

specific knowledge or general knowledge

50:24

and it turns out they actually capture a

50:26

lot of general knowledge about the input

50:28

and that's why you can get reuse out of

50:31

them you can reuse them for other

50:33

unrelated things because they have

50:34

captured general stuff. So if you look

50:36

at this, I think I've shown you before,

50:38

right? If you if you look at a network

50:40

that classifies everyday objects into a

50:41

bunch of categories, it can learn all

50:43

these little patterns in the beginning

50:44

and later on and so on and so forth. And

50:46

this is a face detection network. It has

50:48

learned how to look at, you know,

50:50

identify little circles and edges and

50:52

nose like shapes and finally faces. So

50:55

all these things are examples of

50:56

representations, learning interesting

50:57

things about the input. Okay. So since

51:00

these representations are capturing

51:02

intrinsic aspects of the data, you can

51:04

use it for other things, right? You can

51:06

take a face detection neural network and

51:08

use it, reuse it for emotion detection

51:10

for instance.

51:12

U so the question is if you can somehow

51:14

get like an encoder that generates good

51:17

representations for your input data, we

51:19

can simply build a regression model with

51:20

those as input and labels as output and

51:22

be done. And this is exactly what we did

51:24

with RestNet for handbags and shows. We

51:27

found a thing that had already been

51:28

trained on similar everyday objects,

51:30

everyday images. And the key insight

51:33

here is that since we don't have to

51:35

spend precious data on learning these

51:37

good representations,

51:40

we won't need as much label data in the

51:42

first place because the pre-training

51:44

used a lot of data and you're sort of

51:46

piggybacking on that data. So in some

51:48

sense, your training data is everything

51:50

that the pre-trained model was trained

51:51

on plus your little 200 examples.

51:55

Um, okay. So this is what we did. We

51:57

used headless resonate as an encoder

51:58

that can take raw input and transform it

52:00

into useful representations. Uh this is

52:02

what we did. All right. So the general

52:04

approach is that you find a deep neural

52:06

network built on similar inputs but

52:08

different outputs. Uh and then you

52:10

basically grab maybe the penultimate uh

52:13

representation or the one before that.

52:15

Then you chop off the head. You attach

52:17

your own output head. Train the whole

52:21

thing just the final layer or train the

52:23

whole thing if you want. Right? This is

52:25

like the playbook we followed for

52:26

restnet. The same thing works for all

52:27

kinds of other data types as well. So

52:30

now to build such a model we need

52:32

labeled data, right? We were lucky

52:34

because restnet was actually trained on

52:35

imageet data which is like a million

52:37

images each of which labeled into

52:39

thousand categories which is very

52:40

convenient for us, right? But what if

52:44

you want to build a generally useful

52:46

model for text data?

52:49

Clearly we need to collect a lot of text

52:51

data. But that's no problem because

52:52

internet is full of text data, right? we

52:54

can easily escape the internet. We can

52:55

just download Wikipedia. So that's not a

52:57

problem. The problem is something else

52:59

which is that how do we define an input

53:02

label for a piece of text? So for an

53:05

input sentence, what should the output

53:07

label be? That's the key question.

53:09

Because if you can answer this question,

53:10

you can just spray train all these

53:11

things on all kinds of text data, right?

53:14

So the like a beautiful idea for doing

53:17

this is called self-supervised learning.

53:18

And the key idea is that you take your

53:20

input, whatever the input is you take a

53:23

small part of the input and just remove

53:26

it and then ask your network to fill in

53:28

the blanks from everything else.

53:31

Okay, so this is called masking and it's

53:33

just one of many techniques in

53:35

self-supervised learning, but this is

53:36

very commonly used. So this is original

53:39

input, right? And then you take it and

53:41

then you just like take this thing in

53:43

the middle here randomly and and and

53:45

zero it out or mask it. And so this

53:48

incomplete input is your now new input

53:51

and the thing that you took out becomes

53:53

your your fake label.

53:56

So you can almost imagine right if you

53:58

take if you if you're baking donuts you

54:00

you make a donut and then you punch a

54:02

hole in the middle of the donut the the

54:04

donut with the hole is your no input the

54:07

munchkin is the label.

54:11

Am I making everybody hungry at this

54:13

point? So,

54:15

so and once you do that, no problem. You

54:17

have an input, you have an you have

54:19

labels, you just train a neural network

54:23

to essentially predict those to

54:25

basically fill in the blanks.

54:28

And so if for example, if you take a

54:30

sentence like the Sloan School's

54:32

mission, you can just go in there and

54:34

just just knock out randomly a bunch of

54:36

words like this second. And the ones I'm

54:39

knocking out, I'm just putting the word

54:40

mask in it just to show what I'm doing.

54:42

And then what it's actually given this

54:45

sentence, it will try to fill in the

54:46

blanks with actual words.

54:50

Okay,

54:51

so now for the amazing part. In the

54:53

process of learning to fill in the

54:54

blanks, uh the network learns a really

54:57

good representation of the kind of input

54:58

data it's seeing. And it kind of makes

55:01

sense, right? Because if I give you a

55:02

sentence with a few missing blanks and

55:04

you're able to very successfully fill in

55:06

the blanks, you have learned a whole

55:08

bunch of stuff about the world to be

55:10

able to do that, right? If I say the

55:12

capital of France is Dash and you're

55:14

like Paris, okay, how did you know that?

55:16

It's sort of like that. By learning to

55:18

fill in the blanks, you really have to

55:20

learn how how all these things work, all

55:22

the the connections between various

55:24

words and so on and so forth. So, and so

55:27

what you can do is once we build such a

55:29

model, we can just extract an encoder

55:32

from it, right? And then we'll fine-tune

55:34

it like we do with library transfer

55:36

learning. But this how you build a

55:38

generic a generic pre-trained model on

55:41

unlabelled data.

55:43

And so we can use a transformer encoder

55:46

to build this whole thing in the middle

55:48

because remember the transformer can

55:49

take any sentence and give you the same

55:51

size sentence back along with

55:53

predictions for everything. So we can

55:55

just have it take this thing in and ask

55:57

it to just predict all the missing words

55:58

here.

56:01

And

56:03

so uh to put it in other words, masked

56:05

self-supervised learning is just a

56:06

sequence labeling problem.

56:09

So basically this is the sequence that

56:11

comes in and then you you tell the

56:13

transform and you get all these

56:14

embeddings. It goes through all that

56:16

stuff. You really don't care about these

56:18

outputs. But wherever the word mask went

56:21

in in the input, you you basically try

56:23

to get it to the right answer is for

56:25

example the word mission and you're

56:26

trying to and that is the right answer.

56:28

This is the right answer here. And then

56:29

you take these right answers, create a

56:31

loss function, and do back prop and

56:32

boom, you're done.

56:35

Inputs, right answers, and and you're in

56:37

business. That's it. Now, if we

56:40

pre-train a transformer model like this

56:41

on massive amounts of English text,

56:44

let's say we did that. We get something

56:46

called BERT. BERT is a very famous

56:48

transformer model. And BERT was the

56:51

first model actually that Google used to

56:53

upgrade its search in 2019.

56:56

like the br the Brazil visa example you

56:58

may recall from earlier lectures that

57:00

uses BERT under the hood. Okay. Um and

57:03

so now I just want to show you because

57:06

you can actually read the BERT paper and

57:07

it'll actually make sense to you now

57:09

based on what you have learned in this

57:10

class. Look at this BERT's model

57:13

architecture is a multi-layer

57:14

birectional transformer encoder. Okay,

57:16

transformer encoder. We denote the

57:18

number of layers transformer blocks as

57:20

L. The hidden size is H and the number

57:23

of attention heads as A. And how much is

57:25

that? Uh okay we want uh h is 768 okay

57:30

so which means that the embedding sizes

57:34

or 768

57:36

and the hidden feed forward layer is

57:38

four times as much so it's 4096 and so

57:41

sorry the the the 4096 the feed forward

57:44

layer the embeddings are 768 and you can

57:47

see there are two BERT models here this

57:49

one has 12 transformer blocks this one

57:52

has 24 transformer blocks

57:55

Okay, so you can actually read this

57:58

paper. You can you can actually relate

57:59

it to exactly what we discussed in

58:00

class. It'll all make sense.

58:02

Birectionally means that the words can

58:04

pay attention to every other word in the

58:06

sentence. And as we will see on Monday,

58:09

you can have you have a diff another

58:10

transformer thing called a causal

58:12

transformer in which you only pay

58:14

attention to the words that came before

58:15

you, not the ones after you. So

58:18

birectional means all words are seen.

58:21

[snorts] Okay. So um so what we do is

58:24

remember we said to do solve sequence

58:26

classification you can add a little

58:27

token at the beginning uh and then boom

58:30

use it for classification as it turns

58:32

out but very conveniently for us the

58:35

people who built bird they actually auto

58:36

they when they train bird they just use

58:38

the CLS business

58:41

during training so it's actually

58:42

available for us out of the box so when

58:44

you use bird for sequence classification

58:46

you don't even have to do any surgery on

58:47

it it just gives you the class token

58:48

automatically which is very convenient

58:51

uh and you can also use it for sequence

58:52

labeling as well. So for sequence

58:55

classifications and sequence labeling uh

58:57

BERT is actually usually a really good

58:58

starting point and in particular there

59:00

have been lots of improvements and

59:02

variations of BERT over the years and if

59:04

you're curious about this there's a

59:05

thing called the sentence transformers

59:07

library which has got a whole bunch of

59:09

BERT related code and resources that you

59:11

can use to do things out of the box.

59:14

Okay. So okay there's a bit of a word

59:18

wall.

59:20

So to solve any of these problems

59:21

classification or labeling where the

59:23

input is natural language we can

59:24

obviously use a model like BERT label a

59:27

few hundred examples attach the right

59:28

final layers and fine tune it like we

59:30

did for the restn net but if your

59:32

problem is like a standard NLP problem

59:34

okay you don't even have to do that

59:37

because people for these standard tasks

59:39

they've already pre-trained it on those

59:40

standard tasks right and so you can do

59:43

all these things without any fine tuning

59:44

at all like literally out of the box u

59:47

and so there are many hubs which have

59:49

these pre-trained models, but perhaps

59:50

the biggest one is the hugging face hub.

59:53

And I checked last night, it has 525,000

59:56

models

59:58

available. I think if I recall last year

1:00:00

when I taught Hodel, I think the number

1:00:02

was a lot smaller, maybe 50,000. So it's

1:00:04

like growing really, really fast. Um,

1:00:07

and so all right, let's just switch to a

1:00:09

hugging face collab.

1:00:15

So, hugging face, how many of you are

1:00:18

familiar with hugging face?

1:00:21

Okay, it's good. All right, so um for

1:00:24

the others, basically you have a whole

1:00:26

bunch of pre-trained models on hugging

1:00:28

phase. You actually have a lot of data

1:00:30

sets you can work with for your own

1:00:32

tasks. Uh there are lots of people

1:00:34

demoing what they have built in this

1:00:37

thing called spaces and of course a lot

1:00:39

of documentation and so on. So the thing

1:00:40

you can do is what they have done is

1:00:42

they have organized all these models by

1:00:44

the kind of task you can use them for.

1:00:46

So you can see here there are a whole

1:00:47

bunch of computer vision tasks that you

1:00:49

can use them for. There's a whole bunch

1:00:50

of natural language tasks like text

1:00:52

classification

1:00:54

uh feature extraction this and that lots

1:00:56

of interesting examples here. And so

1:00:59

what you do is you just literally can go

1:01:00

in there and say okay I want to do a

1:01:01

text classification. You hit it and then

1:01:03

it tells you all the models that are

1:01:05

available. Turns into 50,000 models just

1:01:06

for text classification. And you can

1:01:08

look at okay which is you know most

1:01:10

downloaded or which is the most liked

1:01:11

and then you can just use them as a

1:01:13

starting point for whatever you want to

1:01:14

do. Okay. So so that is hugging phase

1:01:17

and so the way you do hugging face is

1:01:20

I'm just connecting it. Um

1:01:24

if you have a problem which the input is

1:01:26

natural language text the first question

1:01:28

you have to ask yourself is it standard

1:01:29

or not? Is it a standard task or not? If

1:01:31

it's a standard task you just go go that

1:01:32

do not reinvent the wheel. This thing

1:01:34

will usually work pretty well. Okay. So

1:01:37

here we will use this thing called um

1:01:39

the transformers library from hugging

1:01:41

face in particular the pipeline function

1:01:43

to demonstrate quickly how to do this

1:01:45

thing. Fortunately this library as of

1:01:47

this year is pre-installed in collab so

1:01:48

we can we don't have to install it. We

1:01:50

can just start using it right away. So

1:01:51

we'll take this example where you have a

1:01:53

bunch of text which says um

1:01:57

dear Amazon last week I got an Optimus

1:01:59

Prime action figure from your store in

1:02:00

Germany. Unfortunately when I opened the

1:02:01

vicage I discovered to my horror that I

1:02:04

had been sent an action figure of

1:02:05

Megatron instead. Can you imagine that

1:02:06

person's like sheer distress at this?

1:02:08

Um, so as a lifelong enemy of the

1:02:10

Decepticons, I hope you can understand

1:02:12

my dilemma. So to resolve the issue, I

1:02:14

demand an exchange. Encloser copies

1:02:17

expect to hear from you soon. Sincerely,

1:02:19

Bumblebee.

1:02:21

Okay, that Okay, they should have come

1:02:22

up with a better name for this example.

1:02:24

Uh, all right, cool. So that's the text

1:02:26

we have. So we import the this pipeline

1:02:29

function is the one that basically gives

1:02:31

you the ability to out of the box start

1:02:33

using it without any pre-training,

1:02:34

nothing like that. Okay, so we download

1:02:36

this thing. Um, oh wow, I got an A00

1:02:40

today. That happens very rarely. All

1:02:42

right, sorry.

1:02:44

So here, let's say you want to classify

1:02:46

that text. Okay, you want just want to

1:02:48

classify it for sentiment. You literally

1:02:50

go in there and say pipeline

1:02:52

text classification. That's the task you

1:02:55

want the pipeline to do for you, right?

1:02:57

And you create a classifier. Okay, it's

1:02:59

going to download a bunch of stuff. Uh,

1:03:01

and then so on and so forth.

1:03:04

The first time it just takes time to

1:03:06

download and then you literally take the

1:03:08

text you have here and then run it

1:03:10

through the classifier as it was just a

1:03:11

little function right you get some

1:03:14

outputs and then actually just do this

1:03:17

this way

1:03:19

negative sentiment is negative with 90%

1:03:21

probability pretty good right sequence

1:03:23

classification solved I mean sent

1:03:25

sentiment classification solved so we'll

1:03:27

try a few different examples uh I hated

1:03:30

the movie I if I said I loved the movie

1:03:31

I would be lying okay that's a little

1:03:33

tricky The movie left me speechless.

1:03:34

Incredible. And then I had to add this

1:03:36

last thing here last night. Almost but

1:03:38

not quite entirely unlike anything good

1:03:40

I've seen. Okay. And that's not

1:03:42

original. By the way, people who have

1:03:43

read Douglas Adams will know this famous

1:03:44

sentence about somebody drinking some

1:03:46

beverage and saying it's almost but not

1:03:48

quite entirely unlike tea. So I was

1:03:50

inspired by that. So anyway, we'll see

1:03:52

what happens. Um.

1:03:56

All right. Put it in there. Okay. So

1:03:59

negative. I hated the movie. Okay, fine.

1:04:01

If I said love me, I'd be lying.

1:04:02

Negative. Movie left me speechless. Uh,

1:04:05

it says it's negative, but it could go

1:04:07

either way, right? A good classifier

1:04:09

would have probably given you a

1:04:09

probability around the 50% mark because

1:04:11

it's sort of right on the fence. Um,

1:04:13

incredible, it's positive, and then it

1:04:15

got fooled by my crazy long sentence and

1:04:17

it says it's positive. Okay, now that's

1:04:20

classification. Here's one other quick

1:04:22

example. So, you can actually give it a

1:04:23

piece of text, right? For example, you

1:04:25

can take like a a Reuter's news story.

1:04:28

You can feed it and say extract all the

1:04:30

company names from it. Extract company

1:04:32

names, people names and things like

1:04:34

that. It's called named entity

1:04:35

extraction. And there are in the back in

1:04:37

back in the day people would bring they

1:04:40

would hand build painstakingly all these

1:04:42

very complex systems to be to do named

1:04:44

entity extraction. Now it's just a

1:04:46

pipeline away. So you can take this

1:04:48

thing and you can say create a pipeline

1:04:50

for any name extraction and for any

1:04:53

particular task that you're using there

1:04:54

might be a few additional parameters you

1:04:56

can set right as a part of the

1:04:57

configuration. So we download this

1:05:00

pipeline.

1:05:08

Okay, perfect. And then we run the

1:05:11

output. So it says okay good. Amazon is

1:05:14

an organization

1:05:16

uh

1:05:18

and Germany is a location lock which is

1:05:21

nice. So these things have a standard

1:05:22

vocabulary as to or lock things like

1:05:23

that which you can read up in the

1:05:24

documentation. Uh and then Bumblebee is

1:05:26

a person and then boy all the like the

1:05:29

Optimus Prime transformer stuff is all

1:05:32

it got full right. It thinks Optimus

1:05:33

Prime is miscellaneous. Uh decept is

1:05:36

miscellaneous and so on and so forth.

1:05:38

But you get the idea. You can take

1:05:39

standard things like Reuters use stories

1:05:41

and so or you can just boop. You can get

1:05:42

a very good entity extraction right out

1:05:44

of the bat. And once you get these

1:05:45

entities extracted, then you can put

1:05:47

them into a nice structured data table

1:05:48

like a database and then you can run

1:05:50

traditional machine learning on it.

1:05:53

Okay. Um and then I had I think a few

1:05:55

8: Deep Learning for Natural Language – Transformers, Self-Supervised Learning

More from MIT OpenCourseWare

Trending Transcripts