7: Deep Learning for Natural Language – Transformers — Full Transcript

0:50

original sort of target, uh Google

0:52

search, really information retrieval,

0:54

completely transformed speech

0:55

recognition, text-to-speech, even

0:57

computer vision. Even the stuff that we

0:59

learned with convolutional neural

1:00

networks, now there are transformers for

1:03

computer vision problems that are

1:04

actually quite good.

1:06

Right?

1:07

Um which is kind of shocking because

1:08

they were not even designed for that.

1:10

Um and then, you know, reinforcement

1:12

learning. And of course, all the crazy

1:14

stuff that's going on with generative

1:15

AI, large language models, multimodal

1:17

models, everything everything runs on a

1:20

transformer.

1:21

Okay? Uh and then there are numerous

1:23

special purpose systems

1:25

and I find these to be even more

1:27

interesting.

1:28

Um you know, like AlphaFold, the protein

1:30

folding AI, is run runs on a transformer

1:32

stack.

1:33

Okay? And I could just list examples one

1:35

after the other.

1:36

So, it's just amazing. It's incredibly

1:38

uh flexible architecture.

1:40

Um and I think we are lucky to be alive

1:43

during a time when such a thing was

1:44

invented.

1:47

And I'm not getting paid to tell you any

1:48

of this stuff.

1:50

All right, it's just amazing. Okay. So,

1:52

let's get going. We will use search um

1:55

or more broadly information retrieval as

1:57

a motivating use case. So, these are all

1:59

examples where people are typing in

2:00

natural language queries or uttering

2:02

natural language queries into a phone

2:03

and we need to sort of make sense of

2:05

what they want. And it's not like, you

2:07

know, write me a limerick about deep

2:08

learning where there could be many

2:10

possible right answers. It's more like,

2:12

okay, tell me all the flights that are

2:14

leaving from Boston to going to

2:15

LaGuardia tomorrow morning between 8:00

2:16

and 9:00. Well, you better get it right.

2:19

Okay? Accuracy is a high bar.

2:21

So,

2:22

um or, you know, how many customers

2:23

abandoned their shopping cart? Find all

2:24

contracts that are up for renewal next

2:26

month. Uh you know, tell me the all the

2:28

customers who ended the phone call to

2:30

the call center yesterday not entirely

2:32

pleased with the transaction. Right? The

2:34

list goes on and on. And so, in

2:37

particular, we'll focus on this

2:38

travel-related example today. Okay? Uh

2:40

find me all flights from Boston to

2:42

LaGuardia tomorrow morning, right? That

2:44

kind of query.

2:45

Um and so, in these sorts of use cases,

2:48

a very common approach historically has

2:50

been, well, we will take this, you know,

2:53

natural language query

2:55

and then we will convert it into a

2:57

structured query. By that I mean we will

3:01

parse the query and we'll extract out

3:03

key things in that query. Once we

3:05

extract out those key things, we will

3:07

reassemble it into a structured query,

3:09

like a SQL query, right? Uh SQL is just

3:12

one example of a possible structured

3:14

query. There are many many ways to

3:15

structure queries.

3:17

But SQL is sort of familiar to lots of

3:18

people, so I'm using that. So, you take

3:20

the SQL. Once you have the SQL query,

3:23

you're in a very comfortable structured

3:25

land, in which case you just run the

3:27

query through a some database that you

3:28

have, get the results back, format it

3:30

nicely, and show show it to the user.

3:32

Right? That's the flow.

3:34

So, the question becomes

3:36

um

3:37

how do we automatically extract all the

3:40

travel-related entities from this query?

3:43

Right? We want to be able to extract

3:45

BOS, LGA, tomorrow, morning, flights, so

3:49

on and so forth. These are all the

3:50

travel-related entities we want to

3:51

extract out, right? That's the problem.

3:54

And so,

3:56

we will use a really cool data set

3:58

called the airline travel information

3:59

system data set and I'll explain the

4:01

data set in just in just a bit. We'll

4:02

use this as the basis for this example.

4:05

And so, the way we think about it is

4:07

that

4:08

we we have a whole bunch of queries in

4:10

this data set.

4:12

And fortunately for us, the researchers

4:14

who compiled this data set,

4:16

they went through every one of these

4:18

queries, right? And we have, you know,

4:20

several thousands of them. They went

4:22

through every one of those queries and

4:24

they manually tagged each word in the

4:26

query

4:28

with what kind of travel entity it is

4:31

or none of them, right? So, for

4:33

instance, so they class they call them

4:35

slots. So, they will take each word in

4:37

the query and assign it to a slot, a

4:39

particular kind of slot, and I'll

4:41

explain what slot means in just a

4:42

second. Okay? That's the basic idea. So,

4:45

so, for example, if you have something

4:47

like I want to fly from

4:49

Okay? And this is a flight database, so

4:52

you can assume that everything is

4:53

related to a flight flying. So, if you

4:56

have all these words, I want to fly

4:57

from,

4:58

each of these words these five words

5:00

gets mapped to something called the O,

5:02

which means other.

5:04

It's the other slot, right? We don't

5:06

really care about it. It's the other

5:07

slot.

5:09

And then we come to Boston.

5:11

Oh, Boston is very special, right?

5:13

Because, you know, it's clearly a

5:15

departure city. So, we actually tag it,

5:18

we assign it this label. Think of it as

5:20

just like a classification problem,

5:21

right? A multi-class classification

5:23

problem. So, we assign it to B from

5:26

loc.city_name.

5:29

Okay? That is the label you assign it.

5:31

Okay?

5:32

And then you go to at. You don't care

5:34

about at. It's O, other. You come to

5:37

7:00 a.m.

5:38

And then, okay, that is depart time. So,

5:41

depart time and then another depart

5:43

time. And here you see there is a B and

5:45

then there is an I.

5:47

Right? So, what's what we are saying

5:49

here is that there could be entities who

5:51

are described using more than one word.

5:54

Like 7:00 a.m., right? Two tokens.

5:57

And for that, we need to be able to

5:58

figure out, okay, the second token is

6:00

really

6:01

is part of the first token. Together,

6:03

they define the notion of a departure

6:05

time. So, what the B means that is that

6:08

this is the word this is the token in

6:10

which we are beginning the idea of a

6:12

departure time. And then I means we are

6:15

in the middle of this description.

6:17

B is for beginning.

6:19

So,

6:21

you can see here. So, there is a B here

6:23

and there is an I. B for beginning, I

6:25

for intermediate or in the middle.

6:27

Um and then at, we don't care. 11:00 B

6:31

arrive time.

6:33

Boop boop boop. Morning arrive time

6:35

period.

6:38

So, this is an example of how you can

6:40

take a sentence and then manually label

6:43

every word in the sentence with

6:45

something that's relevant to your

6:46

particular problem.

6:50

And

6:51

turns out these people

6:54

every word is classified into one of 123

6:56

possibilities.

6:59

Okay? Um so, aircraft code, airline

7:02

code, airline name, airport code,

7:04

airport name, arrival date, relative

7:07

name. Now, you get the idea.

7:08

They want a round trip versus a one-way.

7:11

The relative to today because if

7:13

somebody say tomorrow morning, it's

7:14

relative to today, so you need to notion

7:16

you need absolute time and you need

7:17

notion of relative time.

7:19

So, they basically thought of every

7:20

possibility with these researchers. And

7:23

so, the every word in every one of these

7:25

queries is assigned one of these 123

7:27

labels.

7:32

Any questions on the setup?

7:36

Um

7:39

Did they have to contextualize what

7:42

comes before than let's say Boston? So,

7:44

if someone says from

7:46

Boston, so that there should be

7:47

contextualization with the from to

7:49

Boston. So, because they did it

7:50

manually, they could just read it and

7:52

figure it out, that's what they mean,

7:54

right? You Boston is the the departure

7:55

city and not the arrival city. So, do

7:57

they have two tags to Boston, which is

7:59

some like, you know, departure city as

8:01

well as arrival city

8:03

word Boston? In that particular phrase,

8:05

it's it's clear from that particular

8:07

case in the context of it as a human

8:08

reading it that Boston is a departure

8:10

city. So, it just only gets that tag. In

8:13

that sentence. In some other sentence

8:15

where people are coming into Boston,

8:16

it'll have a different tag.

8:21

I was wondering if my query like the

8:23

others, basically there is like, for

8:25

example, if my query was

8:27

giving flights from Boston at 7:00 a.m.

8:29

and

8:29

uh the

8:31

flights from Denver at 11:00 a.m.

8:33

You mean like a compound query? Yeah.

8:35

So, this one only takes single queries

8:37

into account.

8:39

Because most people are like, you know,

8:40

give me a flight from here to there. Or

8:42

what is the cheapest thing from here to

8:43

there? And we'll see examples of queries

8:45

later on.

8:50

Okay.

8:51

Uh all right. So, that's that's the

8:52

deal.

8:53

So, basically what we have this you

8:56

know,

8:58

uh this problem that we have here is

8:59

really a word-to-slot,

9:02

word-to-slot multi-class classification

9:04

problem.

9:06

Okay?

9:07

Um because if you look at that that

9:09

input, we want to be able to take that

9:10

input and a really good model will then

9:12

give you this as the output.

9:17

Right? Because this is what a human

9:18

would have done.

9:20

So, that is our problem. Okay?

9:23

So, the question is

9:25

um the the key thing here is that each

9:27

of the 18 words in this particular

9:29

example must be assigned to one of 123

9:32

slot types, right? Each word. It's not

9:34

like we take the entire query and

9:36

classify the entire query into one of

9:38

123 possibilities. Every word in the

9:40

query has to be classified.

9:42

That is the wrinkle.

9:45

Okay?

9:46

So, now, if we could run the query

9:49

through a deep neural network and

9:51

generate 18 output nodes,

9:54

it goes through some unspecified deep

9:55

neural network. And when it comes out

9:57

the other end, the output layer has 18

9:59

nodes.

10:00

Okay?

10:01

Because that is that is the that is the

10:03

that is the the the dimension of the

10:04

output that we care about. 18 in, 18

10:06

out. 18 in, 18 out, right?

10:09

And then for each one of those 18 nodes,

10:11

maybe we could attach a 123-way softmax

10:15

to each of those 18 outputs.

10:20

By the way, isn't it cool that we can

10:21

just casually talk about sticking a

10:23

123-way softmax onto each one of the 18

10:25

nodes?

10:27

Folks, wake up.

10:31

You're not easily impressed. I'm

10:32

impressed by that.

10:34

So, okay.

10:37

So, so the So, here's the key thing,

10:39

right? We want to generate an output

10:41

that has the same length as the input.

10:45

But the problem is the inputs could be

10:47

of different lengths as they come in.

10:48

They could be short sentences, long

10:50

sentences, we don't know, right?

10:52

Yet we need to accommodate this range

10:55

this variable size of input that's

10:56

coming in.

10:58

But the key thing is the output has to

10:59

be the same thing as the input, the same

11:00

cardinality as the input.

11:02

Okay, that's a one big requirement.

11:05

In addition, we want to take the

11:07

surrounding context of each word into

11:08

account, right? To go to Ronak's

11:10

question, when you see the word Boston,

11:12

you can't conclude whether it's a

11:14

departure city or arrival city.

11:15

You have to look at what else is going

11:17

on around it. Is there a from? Is there

11:19

a to? Things like that to figure out

11:21

what how to tag it. So, clearly the

11:22

context matters.

11:24

And then we clearly have to take the

11:25

order of the words into account.

11:28

Going from Boston to LaGuardia is very

11:29

different than going from LaGuardia to

11:30

Boston.

11:31

So, clearly the order matters.

11:33

Right? So, the context matters and the

11:35

order matters. And the output has to be

11:37

the same length as the input.

11:40

Okay?

11:42

So, context matters, right? Just a few

11:44

fun examples.

11:45

Remember from the last week that the

11:47

meaning of a word can change

11:48

dramatically depending on the context.

11:50

And we also saw that the standalone or

11:53

uncontextual embeddings that we saw for

11:55

last week, like Glove, um

11:58

you know, they don't take context into

11:59

account because they give a single

12:01

unique embedding vector to every word.

12:04

And if a word ends up having lots of

12:05

different meanings, that vector is kind

12:07

of some mushy average of all those

12:09

meanings.

12:11

Okay. So,

12:13

the word see. I will see you soon. I

12:15

will see this project to its end. I see

12:16

what you mean. Very different meanings

12:18

of the word see. This is my favorite,

12:20

bank.

12:21

Uh I went to the bank to apply for a

12:23

loan. I'm banking on the job. I'm

12:24

standing on the left bank. And so on. Uh

12:27

it. The animal Oh, this is actually very

12:29

It's a good one. The animal didn't cross

12:31

the street because it was too tired. The

12:33

animal didn't cross the street because

12:34

it was too wide.

12:37

Can you imagine

12:39

a deep neural network looking at this

12:40

word it and trying to figure out what

12:42

the heck does it word it mean?

12:44

What is it referring to?

12:46

Tricky, right?

12:48

Um and then, you know, if you take the

12:50

word station, and I have the station

12:52

example here because we're going to use

12:53

it a bit more the rest of the lecture.

12:55

The train You know, the station could be

12:57

a radio station, a train station, being

12:59

stationed somewhere, the International

13:00

Space Station. The list goes on.

13:03

So, clearly order matters. I mean,

13:04

context matters.

13:05

And

13:08

clearly order matters. You can come up

13:10

with your own examples. Let's keep

13:12

moving.

13:13

Okay?

13:15

So, the Transformer architecture

13:18

is a very elegant

13:20

architecture

13:22

which checks these three boxes

13:23

beautifully.

13:25

Okay?

13:26

Um it takes the context into account,

13:27

order into account, and then, you know,

13:29

whatever is produced out there

13:32

is the same length as whatever is coming

13:33

in.

13:34

And the reason it's called the

13:35

Transformer

13:36

is because if 10 things come in,

13:39

10 things go out, but the 10 things that

13:41

go out are a transformed version of the

13:43

10 things that came in.

13:46

That's why it's called the Transformer.

13:47

Okay?

13:48

If 10 things came in and like one thing

13:50

go goes out, well, sure, it's been

13:52

transformed, but what is it? It's some

13:54

weird thing. But when 10 comes in and 10

13:56

goes out, the 10 10 is preserved. Each

13:58

one is getting transformed in

13:59

interesting way.

14:01

That's why it's called the Transformer.

14:04

So, developed 2017, just dramatic

14:07

impact.

14:08

So, by the way, the effect of

14:09

Transformer, um

14:11

Google had spent a lot of research on

14:13

machine translation and obviously

14:15

search. Uh and then when the Transformer

14:17

is invented, uh they took a model called

14:20

BERT, which we will uh see on Wednesday

14:22

in detail, and then they introduced BERT

14:25

into their search, and the results were

14:28

dramatic.

14:29

And from what I've read, apparently the

14:32

impact of doing that was a

14:34

Typically, when you make an improvement

14:35

to search, the improvement is very, very

14:37

marginal because it's already a very

14:38

heavily optimized system.

14:40

And then when the Transformer thing came

14:42

along, there was actually a significant

14:43

jump in search quality. So, for example,

14:46

and you can actually read this blog post

14:48

uh which came out when they introduced

14:49

BERT into search. It gives you a bit

14:51

more detail. But here, so if you had if

14:54

you were querying something like uh you

14:56

know,

14:57

"Brazil traveler to USA needs a visa."

15:00

Right? You would think that it is it

15:02

should give you information about how to

15:03

get a visa if you're a Brazilian want to

15:04

come to the US, right? Uh but it turns

15:06

out the first result was how US citizens

15:09

going to Brazil can get you know,

15:11

get a visa.

15:13

So, clearly it's not taking the order

15:14

into account.

15:16

Uh but once they introduced it, boom,

15:19

the first thing was the US Embassy in

15:20

Brazil.

15:21

And a page on how to get a visa.

15:24

So, the effect was dramatic.

15:26

And so, this is a seminal paper,

15:30

right? And it's actually worth reading

15:31

the paper. And uh and it's worth and you

15:34

know, this is the picture this this is

15:35

like an iconic picture at this point

15:38

in the deep learning community. And we

15:39

will actually understand this picture

15:41

by the end of Wednesday.

15:43

Um and so, but the funny thing is that

15:45

when the researchers came up with it,

15:46

they didn't realize, in some sense, like

15:48

what they had stumbled on uh because

15:50

they were really focused on machine

15:51

translation.

15:53

It's only the rest of the research

15:54

community that took it and started

15:55

applying to everything else and found it

15:56

to be really, really effective.

15:59

Okay. So, we're going to take each one

16:01

of these things and figure out how to

16:02

address them and thereby build up the

16:04

architecture.

16:05

Any questions before I continue?

16:07

Yeah.

16:11

Is there any uh

16:13

benefits to discarding some of those

16:16

unclassified nodes before it goes out

16:18

rather than going like you have 18 words

16:21

input, discarding all the ones that

16:23

don't actually matter and just doing

16:24

like eight for your output?

16:26

Yeah, yeah. I think that's a totally

16:28

fine way to think about it. Basically,

16:29

what you're saying is that can we have a

16:31

two-stage model? The first-stage model

16:33

is like a O non-O classifier. And the

16:35

second-stage model only goes after the

16:37

non-Os. That's a totally fine way to do

16:38

it.

16:39

Yeah.

16:40

But as you can see, if you even if you

16:41

go with the just a simple one-stage

16:43

model, if you use a Transformer, you get

16:44

fantastic accuracy.

16:47

And we'll do the collab in a bit.

16:50

Uh all right. So, let's take the first

16:52

thing. How do you how do you take the

16:53

context of everything around the word

16:55

into account?

16:56

So,

16:59

so let's say that this is this is the

17:01

sentence we have. The train slowly left

17:03

the station.

17:04

Okay? For each of these words,

17:06

we can calculate a standalone embedding,

17:09

say something like Glove.

17:11

Okay? So, I'm just rep- depicting these

17:13

standalone embeddings using these uh

17:15

you know, thingies here.

17:18

Please appreciate them because it took

17:19

me a while to get them to do in

17:20

PowerPoint.

17:22

Okay? So, these are W1 through W6. These

17:24

are the vectors standing up. Okay?

17:27

Um now, let's say that So, we can easily

17:29

do that.

17:30

Now, what we want to figure out is we

17:32

want to focus on the word station.

17:34

And since station could mean very

17:36

different things in different contexts,

17:37

we want to figure out how do we actually

17:39

take

17:40

station's embedding and contextualize it

17:43

using all the other words that are going

17:45

on in that sentence.

17:46

Okay? Clearly, it's a train station.

17:49

So, we need to take the fact that there

17:50

is a train involved to to alter the

17:53

embedding of the word station. Right?

17:55

That's what taking context into account

17:56

actually means.

17:58

So,

17:59

how can we modify station's embedding so

18:03

that it incorporates all the other

18:04

words? That's the question.

18:07

Okay?

18:08

So, when you look at it this way,

18:11

imagine just for a moment,

18:14

just for a moment,

18:15

that

18:16

we

18:17

Now, some of the other words in the

18:18

sentence don't matter. The word the

18:20

probably doesn't matter.

18:22

But some of the other words like train,

18:24

slowly, left probably does matter.

18:26

And suppose, just magically, we have

18:29

been told

18:30

all the other words in the sentence,

18:32

this is how much weight you have to give

18:34

to them. These don't give it any weight.

18:36

Those give it a lot of weight. Okay?

18:38

Suppose we are told that.

18:39

Or to put it another way, and this this

18:41

is the word that's heavily used in the

18:42

literature,

18:44

someone tells you how much attention to

18:46

pay to the other words.

18:47

Whether you got to pay it a lot of

18:48

attention or very little attention.

18:50

Okay?

18:51

And this

18:52

how much attention to pay is given in

18:54

the form of a weight that you can use.

18:55

Okay? So,

18:57

um

18:58

if you look at it that way, from this

19:00

notion of which word should I give a lot

19:01

of weight to and very little weight to,

19:04

in this example, intuitively, which

19:05

words do you think should get the most

19:06

weight and which words do you think

19:07

should get the least weight?

19:09

Yeah. Train.

19:11

Train. Right.

19:12

Time matters.

19:13

Uh

19:14

you can do one at a time.

19:16

Train. Okay, thank you.

19:18

Uh

19:18

okay. Others?

19:21

Slowly.

19:22

Slowly. Right. So, that also seems to

19:23

have some bearing to it. What about

19:25

words that don't really I don't

19:27

we don't think is going to are going to

19:28

help at all?

19:31

The. The. Exactly. It probably doesn't

19:33

do much here. Some context it actually

19:35

might make a difference, but in this

19:37

sentence, maybe not.

19:38

Right? Intuitively.

19:40

So,

19:42

we should probably give a lot of weight

19:43

to train, maybe a little to slowly and

19:45

left, and hardly anything to the.

19:47

Okay?

19:49

And so, this intuition that we have

19:52

can be written numerically as maybe we

19:56

have a bunch of weights that add up to

19:58

one.

20:00

Okay?

20:02

Okay, maybe something like this. So, we

20:03

are saying the train 30% weightage,

20:07

maybe 8% weightage to left, maybe 12%

20:11

weightage to slowly, uh and then as you

20:14

will see here,

20:15

the station's own embedding also plays a

20:17

role. Because we want to take its own

20:20

standalone embedding and just move it

20:22

slightly, change it slightly, which

20:23

means that has to be the starting point.

20:26

So, it will get a lot of weight. We

20:28

can't ignore itself, in other words.

20:30

Right? So, we give it maybe 40% weight.

20:33

By the way, these numbers I just made

20:34

them up.

20:35

Okay? Uh yeah.

20:38

I'm sorry, it's a quick question. So,

20:40

the weights

20:43

are they

20:44

Are they Are they standalone for the

20:46

context of the entire sentence or are

20:48

they related to station that we started

20:50

off with? The The These six numbers are

20:54

only pertinent to station.

20:56

And for each word, we're going to do

20:57

something similar.

20:59

Yeah.

21:01

And at this point, does the model

21:03

understand order? Because like I'm just

21:05

thinking of like left because like I

21:07

gave it a very low

21:08

a

21:09

a very low weight. But let's say left

21:11

comes slowly, leave left station. The

21:14

station only have the two be higher.

21:15

Yeah, correct. So, at this point, we are

21:18

not worrying about order. We are only We

21:20

are worrying about context.

21:22

Later, we'll take order into account.

21:24

But how does the model know that left

21:25

here is of lesser importance because

21:28

it's a verb rather than a

21:31

It's It has to figure it out.

21:33

We don't It doesn't We We are just

21:34

giving it a whole bunch of capabilities.

21:36

How it manifests those capabilities is

21:38

all going to emerge from training.

21:42

Okay. So, all right. So, let's say we

21:45

have something like this. So, what we

21:46

can do,

21:48

right? And we'll get to the

21:49

all-important question of where do we

21:50

get these numbers from in just a moment.

21:51

But suppose you had the numbers,

21:54

how can we use these numbers to

21:56

contextualize W6? What can we do?

22:00

What is the simplest thing you can do?

22:05

You have W6, you want to make it a new

22:07

W6, which is now contextual, is aware of

22:10

what else is going on. Okay?

22:17

It's working now, I think.

22:20

We can take a weighted average. Exactly.

22:22

Exactly. So, when you have a bunch of

22:23

things and you have a bunch of weights

22:25

and I, you know, and we have when we

22:26

have to somehow modify one of those

22:27

things with those weights, the simplest

22:29

thing you can do is to take a weighted

22:30

average.

22:31

Right? So, that's exactly what we're

22:33

going to do.

22:34

So, we're going to take all these

22:35

weights

22:37

and just like move them up.

22:39

Okay?

22:40

Move them up.

22:42

Don't even get me started on how long it

22:44

took me to get this arrow to run.

22:46

I don't know about you, folks. Is it

22:47

It's extremely painful to get the U-turn

22:49

arrows to work in PowerPoint.

22:51

Okay?

22:52

Anyway, uh back to work. So,

22:54

so we just move these up here, okay? So,

22:57

now we can do 0.05 * this vector + 0.3 *

23:01

that vector and so on and so forth.

23:03

And the result is just another vector.

23:06

Right?

23:08

And that vector, folks,

23:11

is the contextual embedding vector of

23:13

station.

23:15

Okay? That was the standalone embedding.

23:17

And now we did the We multiplied this by

23:19

that that by whoop whoop whoop, add them

23:21

all up, and then you get a new vector.

23:24

And contextual embeddings have this

23:27

bluish kind of color.

23:29

Okay?

23:30

And I'll maintain that color scheme as

23:32

we go along.

23:33

So, that's it.

23:36

That's it. That's the idea.

23:38

Any questions?

23:41

Yeah.

23:43

How did you come up with the original

23:44

weights again? You just kind of guessed?

23:46

No, these weights I just I just

23:49

hand typed them in manually just to make

23:51

the point. And And now I'm going to talk

23:53

about how we are actually going to

23:54

calculate them.

23:57

Okay.

23:58

Uh all right, cool. So, now I'm going to

24:00

uh okay, enough pictures. Let's switch

24:03

to some math. So,

24:05

so basically what I'm So, let's write it

24:07

a bit more formally.

24:08

So, we have these W1 through W6, which

24:11

are the standalone embeddings.

24:12

And then for station, we want to

24:14

calculate, you know, W6 with a little

24:16

hat on it, which is the contextual

24:17

embedding. And the way we do it is to

24:19

say we calculate some weights for each

24:22

of these words. So, this weight S16

24:25

means that the weight

24:27

of the first word on the sixth word,

24:30

which happens to be station.

24:32

The The weight of the second word on the

24:33

sixth word, and so on and so forth. And

24:35

so, what we are saying is that W6 is

24:38

just, you know, this weight times W1,

24:40

this time W whoop whoop whoop,

24:41

that's it.

24:43

Okay?

24:45

I have to inflict all these, you know,

24:47

subscripts and all that because

24:48

you know, we need it.

24:51

All right. So, that's it.

24:53

That's what we have.

24:56

Now, let's talk about Okay, any

24:58

questions on the mechanics of it

25:00

before I get to Okay, where do these

25:01

weights come from?

25:02

Yeah.

25:06

Utilizing something like Google, for

25:08

example, like how does it understand

25:11

like the context of

25:12

new words

25:13

and context like

25:16

process immediately through the training

25:18

data the users played or

25:20

like basically

25:21

>> like a totally new word that didn't

25:22

exist before? A new word or a new

25:24

context to a word that already exists.

25:27

No, I think that the context is supplied

25:29

because the query coming into something

25:31

like Google is a full sentence.

25:33

And we only take that sentence and take

25:35

only the sentence into account as the

25:36

context for us.

25:37

So, the context is always present to us

25:40

when we get the input.

25:41

But the other question you had uh of

25:44

Okay, what if there's a brand new word

25:45

you've never seen before, for which

25:46

there is not even a standalone

25:47

embedding? What do you do then?

25:49

So, let's punt on that till Wednesday

25:51

because I have to talk about something

25:53

called byte pair encoding and stuff like

25:55

that before I can answer that.

25:57

And And really quickly, does that

25:59

immediately translate to their

26:00

predictive search queries?

26:03

Utilizing like verb

26:06

Yeah, a new word, for example.

26:08

Does that automatically get applied to

26:10

the predictive search queries like when

26:12

we're saying how to and then just home?

26:14

Oh, you mean like the auto complete?

26:15

You know, auto complete uses a slightly

26:17

different mechanism.

26:18

Um I They had a very complicated

26:20

non-transformer thing for a long time.

26:23

I'm sure they have a transformer version

26:24

now, but I don't I'm not privy to how

26:26

exactly they've done it. So, I don't

26:28

quite know how they do it. But what

26:29

you're proposing is a reasonable way to

26:31

think about it.

26:33

Yeah.

26:34

Um my question is like we have six

26:36

words, station and but number parameters

26:39

as in weights, let's say 10 of them.

26:41

And then we have calculated the

26:43

contextual version of W6. Yeah. So, this

26:46

has a different parameter or it remains

26:48

the same? It replaces. Okay.

26:50

Yeah, W becomes W6 becomes W6 hat.

26:54

Okay. And how we are expecting

26:57

Right.

26:58

This contextual word will be really

27:00

good. That's what we want.

27:07

Do we lose that

27:08

or retain No, we lose it. And as you

27:11

will see here, as it flows through the

27:12

transformer, it's getting more and more

27:14

and more contextualized.

27:16

So, it's a left-to-right flow.

27:20

All right. Uh all right, great. So, the

27:22

By the way, this thing that we did for

27:23

station, we will do it for each word in

27:25

the in the in the sentence.

27:27

The same exact logic. Obviously, the

27:30

weights are going to change.

27:31

Okay? But what will happen is that W1

27:34

through W6 will become W1 hat through W6

27:37

hat.

27:39

The same exact logic is going to hold.

27:41

Okay? That's what I just don't have the

27:43

slides for it because it's a waste of

27:44

time.

27:45

The same exact logic is going to hold.

27:47

All right. Now, switch gears

27:48

and and answer the all-important

27:50

question of where are the weights going

27:51

to come from.

27:52

Okay? So, the intuition here is really

27:54

really interesting and elegant.

27:56

So, clearly the weight of a word

27:59

should be proportional to how related it

28:02

is to the word station.

28:04

Right?

28:06

The word train clearly is very related

28:08

to the word station.

28:09

The word the is not clear how it's

28:11

related it is. Probably not all that

28:12

related. So, the relatedness matters to

28:15

the weight. More related, higher the

28:17

weight, right? Just intuitive.

28:19

So, one way to quantify how related two

28:21

words are is to take their standalone

28:23

embeddings and calculate the dot

28:25

product.

28:28

Okay? So, um

28:30

in case folks have

28:33

sort of forgotten about the dot product,

28:39

Oops, that's not what I want.

28:42

So, um So, if you have a Let's say you

28:44

have a vector.

28:50

Okay, let's Let's Let's say this is the

28:51

vector for

28:52

train.

28:55

This is the vector for station.

28:59

Okay? So, the dot product of these two

29:01

vectors,

29:05

I'll write it as train

29:09

station

29:12

equals

29:13

basically the length

29:17

of

29:20

the vector for train

29:23

times the length

29:26

of the vector for station

29:30

times the cosine

29:33

of the angle between them.

29:36

Okay?

29:38

Okay?

29:42

So, how long is each vector?

29:45

Product of the two and then the angle

29:46

between them. Okay? Now, let's assume

29:48

for simplicity that these lengths are

29:50

roughly the same.

29:52

They're just one unit length. Okay? Just

29:54

roughly.

29:55

So, if you assume that,

29:57

okay? This thing, let's say, becomes

30:01

becomes one, let's say.

30:03

Okay?

30:05

This thing becomes one.

30:07

So, all the action

30:09

is here.

30:11

Okay?

30:12

So, all the action is here.

30:14

So, basically, the dot product of these

30:15

two vectors is really the cosine of

30:17

angle between them.

30:20

So, now, the question is, if you have

30:22

something like this,

30:27

right? Which are very close to each

30:28

other, the cosine of a very small angle,

30:31

actually, the cosine of zero is what?

30:34

One.

30:35

So, if the angle is really, really

30:37

small, the cosine is going to be very

30:39

close to one.

30:40

Right? Because zero is one. The cosine

30:41

of zero is one. So, this thing is going

30:43

to be, you know, pretty close to one.

30:46

If you have a cosine of two vectors that

30:49

are like this, 90° apart, what is the

30:51

cosine?

30:52

Zero. They're orthogonal, right? Which

30:55

maps to the English orthogonal.

30:58

So, the cosine of that is zero.

31:00

And then, if you have something like

31:01

this,

31:03

where they're literally pointing in

31:04

opposite direction,

31:07

what is the cosine of that 180?

31:09

Minus one.

31:11

So, that's it. So, the if these things

31:13

if these these these two vectors are

31:14

very close to each other,

31:16

the cosine of the angle between them is

31:18

going to be very close to one. If they

31:19

are really kind of unrelated, it's going

31:21

to be zero. If they're anti-related,

31:22

it's going to be minus one.

31:24

Right? So, that's how dot products

31:27

capture this notion of closeness or

31:28

relatedness.

31:30

Okay?

31:31

So, all right. Um iPad.

31:36

So, we can use the dot product of these

31:37

embeddings to capture relatedness.

31:40

And so, okay, iPad done.

31:43

So, now, what we do is we know now that

31:45

we know that dot products can be used,

31:48

we can't use them as is because we need

31:49

to do one more thing to make them proper

31:51

weights. And what I mean by proper

31:53

weights is that the we want the weights

31:55

to be, first of all, non-negative, and

31:58

we want to add up we want them to add up

31:59

to one, right? That's that's what a

32:00

weighted average actually is going to

32:01

mean.

32:02

But these cosines could be negative.

32:05

Right? And so, we need to now adjust

32:07

them to make them proper so that every

32:08

one of them is guaranteed to be

32:10

non-negative and they will add up to

32:11

one.

32:12

When was the last time you had to take a

32:14

bunch of numbers, which could be

32:15

anything, and then somehow make sure

32:18

that they are going to be positive,

32:20

non-negative, and they add up to one?

32:22

When was the last time?

32:23

Yeah, softmax. Exactly. So, we'll do the

32:25

same trick.

32:27

So, what we'll simply do is we'll just,

32:29

you know, exponentiate them, right? So,

32:32

like this W1 W6, this angle bracket

32:35

thing is the dot product. That's the

32:36

notation I'm using. EXP of that is just

32:39

you exponentiate them, e raised to that.

32:41

And once you exponentiate them, they all

32:42

become non-negative, and then we just

32:44

divide each by the sum of everything.

32:46

So, it the whole thing will become like

32:47

a probability, right? It'll just add up

32:48

to one.

32:50

Make sense? So, that's how we take

32:52

arbitrary numbers and make them proper

32:53

weights.

32:56

All right.

32:59

So,

33:01

to summarize,

33:02

from embeddings to contextual

33:04

embeddings, that's what we do.

33:05

We take all the stand-alone embeddings,

33:08

we calculate these weights using this

33:09

formula, and then we just do the

33:11

weighted average, and we arrive at the

33:12

contextual embedding, and boom, done.

33:16

Okay?

33:17

And so, by way choosing weights in this

33:20

manner, the embedding of a word gets

33:22

dragged closer to the embeddings of the

33:24

other words in proportion to how related

33:26

they are. So, just imagine for a second,

33:29

right? In this case, station obviously

33:30

has many contexts, but let's assume for

33:31

a second that only has the train context

33:33

and the radio station context.

33:35

Okay?

33:37

In the current context, train is closely

33:39

related to station, and therefore exerts

33:40

a strong pull on it.

33:42

Right?

33:43

Now, radio is also related to station,

33:45

but it doesn't appear in the word in the

33:47

sentence.

33:48

So, effectively, it has a weight of

33:49

zero.

33:52

Okay? And that's the beauty of it. And

33:55

And please do not ask me things like,

33:56

you know, I I was listening to a great

33:58

song on the radio station and the train

33:59

pulled out of the station.

34:01

Okay? Transformers can deal with stuff

34:03

like that. Okay? But yeah, but you get

34:05

the idea, the main idea.

34:07

So, by paying moving station closer to

34:09

train,

34:11

by paying more attention to train, we

34:13

are contextualizing the station the word

34:15

the embedding to the context of trains,

34:18

platforms, departures, tickets, and so

34:20

on. It's like this portal into the whole

34:22

train world.

34:25

Right? It's beautiful. This simple idea

34:27

will get you there.

34:30

Okay?

34:31

So, this, folks, is called

34:33

self-attention.

34:36

What we just described is called

34:37

self-attention.

34:39

And it's the key building block of

34:41

transformers.

34:42

Okay? Um and so, the the So, to

34:44

summarize, stand-alone embeddings come

34:46

in, contextual embeddings go out.

34:50

Any questions?

34:52

Uh yeah.

34:54

Uh I'm still struggling a little bit

34:56

with the intuition of the word

34:58

contextual embedding. So, like the

35:00

weight of station in the station

35:02

embedding, how how should I think about

35:03

that? It seems intuitive that it would

35:05

be high for all contextual embeddings,

35:07

but I assume that's not the case.

35:12

It'll be high. It'll be typically be a

35:13

high number because the cosine of the

35:15

the vector to itself is going to be very

35:17

cosine is going to be one, right? So,

35:19

it's going to be pretty high, but it

35:20

there's no guarantee it's going to be

35:21

the highest.

35:22

Right? Because they're not actually the

35:24

the length doesn't have to be one. They

35:26

could be We try to keep them kind of

35:28

smallish, but they don't have to be.

35:30

Uh so, the way I would think about it is

35:31

imagine that you take an average of

35:33

everything else first, and then you

35:35

average it with the new the old

35:37

embedding.

35:38

Effectively, it's the same as just

35:39

calculating the different weights and

35:40

averaging the whole thing together.

35:42

Sure.

35:44

So, why should you say that the

35:45

embedding of a word would be the same

35:47

number but same place? But is this the

35:50

reason why you need a contextual

35:52

embedding?

35:53

But even if it's like a

35:55

other word

35:56

and it's not related, that's what

35:59

I'm saying. Correct. Correct. Exactly.

36:01

Exactly. And the other thing to remember

36:02

is that by getting

36:04

by keeping the origin the input sort of

36:07

the size of the input cardinality intact

36:09

as you move through the transformer

36:10

stack,

36:11

when you finally come out the other end,

36:12

there is sort of no loss of information.

36:14

And in the very end, you can choose to

36:16

aggregate, simplify, summarize, and so

36:18

on and so forth. It preserves your

36:19

optionality as long as possible.

36:23

Do you know

36:25

how how long the embedding contextual

36:27

embedding is?

36:28

Is that a factor between the

36:29

two?

36:31

You know

36:33

Yeah, so, what we do is the the sentence

36:34

comes in. There's a whole notion of

36:35

something called a context window, or

36:37

what is the sort of the maximum length

36:39

that these sentences will handle, and

36:40

that's a parameter you can set. And

36:42

we'll come to that when you actually

36:43

look at the collab.

36:44

Um

36:46

Was that a question in the middle? No.

36:48

Okay.

36:49

All right. So, that is self-attention.

36:53

Um and now,

36:55

because that's felt too easy,

36:58

we're going to do a little tweak called

37:00

multi-head attention.

37:02

So,

37:03

this is this is the self-attention we

37:04

just saw.

37:06

What we can do is we can be like, you

37:07

know what?

37:08

Why can't we have more than this? Why

37:10

can't we have more than one of these?

37:12

So, this is called an attention head,

37:13

self-attention head. We'll have multiple

37:16

self-attention heads. Okay?

37:18

Now, and I'll come back to the top thing

37:20

in a second, okay? But So, the question

37:22

is, why should we have multiple

37:23

self-attention heads?

37:25

Because a particular attention head is

37:26

going to pick up some patterns. The

37:28

reason is because

37:30

it'll help us attend to the multiple

37:32

patterns that may be present in a single

37:34

sentence.

37:35

So far, when I've been explaining, uh

37:37

I've sort of basically been looking at

37:38

what the meaning of these words are.

37:40

Just the meaning of these words. But in

37:42

any complicated sentence, you have to

37:44

worry about grammar, you have to worry

37:45

about tense, you have to worry about

37:47

tone. You have to worry about facts

37:49

versus, you know, opinions. There could

37:51

be any number of complicated patterns

37:53

that are sitting in a simple sentence.

37:55

Which means, well, there is just not one

37:57

way to pay attention. There could be

37:59

many ways of paying attention, many sort

38:02

of There could be many needs to pay

38:03

attention. Right?

38:05

Which means that we'll let's have many

38:07

of these attention heads.

38:09

And each one could be learning something

38:10

else. It's exactly like having lots of

38:12

filters in a convolutional network.

38:14

Right? Uh one filter might learn a line,

38:16

another filter might learn a curve, and

38:17

so on and so forth. And we don't want to

38:19

decide a priori, oh, you're going to

38:21

learn a line, right? Similarly here,

38:22

we're not telling any of these things

38:23

what you have to learn. They just have

38:25

to learn based on the training process.

38:27

So, what we do is

38:28

So, actually, this is an example where

38:30

this is from the original transformer

38:32

paper, where this sentence is the lawyer

38:35

will Sorry, the law will never be

38:37

perfect, but its application should be

38:39

just. This is what we are missing, in my

38:43

opinion.

38:44

The complicated sentence, right? So, the

38:46

first one attention head, actually, this

38:48

is the pattern of things it's it's it's

38:50

So, for example, the word perfect here,

38:53

the contextual embedding of the word

38:54

perfect

38:57

draws upon heavily from the word law

39:00

in this example.

39:01

Okay?

39:02

If you look at another attention head,

39:04

the contextual embedding for the word

39:06

perfect is actually drawing heavily from

39:07

just perfect and nothing else. Right?

39:11

And if you look at other words, the

39:13

patterns are subtly different of what

39:14

it's paying attention to.

39:17

So, these are two different attention

39:18

heads, and they're learning different

39:20

kinds of attentions.

39:21

Okay? In reality, trying to make sense

39:24

of why they

39:25

pay attention to the way they do, it's

39:27

usually quite sort of difficult to

39:29

figure that out. You can't actually

39:30

interpret it. But when you have lots of

39:32

attention heads, the performance on the

39:34

task that you care about gets really

39:35

much better.

39:37

Right? And then you're saying, okay, I

39:39

can use that. Uh yeah.

39:40

That's the

39:42

I think that's the idea behind this. Is

39:43

that the idea behind this?

39:49

Right.

39:50

Exactly. Same logic. Same logic.

39:53

Yeah.

40:13

Actually in the convolutional case, the

40:15

ones and zeros I had were just example

40:17

numbers to show that that particular

40:19

filter could detect a vertical line or

40:21

horizontal line. You will recall that

40:23

when we actually train a convolutional

40:24

network, we actually don't specify the

40:26

numbers. We start with random

40:27

initialized weights and then we let back

40:30

back propagation figure it out.

40:32

Similarly here, we don't decide any of

40:34

these things. We just let back prop

40:35

figure it out.

40:37

Okay? And now the question of what are

40:39

the weights that are actually going to

40:40

be learned? We'll come come to that in a

40:42

bit.

40:43

Okay? Uh yeah.

40:47

Uh I was wondering how come we have

40:50

different attention head even though

40:53

uh it seems like they're only function

40:55

of a dot product and we have the same

40:57

dot product for same embeddings.

40:59

Great question. Great question. And I

41:01

literally have a a note in my slide

41:02

saying, "If a student asks this good

41:04

question, tell them to wait till

41:06

Wednesday."

41:08

So, great question. And we'll come back

41:10

to that uh on Wednesday and spend a fair

41:12

amount of time on it. So, uh

41:14

the the the point that's being made here

41:17

is that oops.

41:19

When we look at self-attention,

41:22

the embeddings came in and we did all

41:24

these dot products and the contextual

41:26

things popped out the other end. Note

41:28

that inside the self-attention box,

41:30

there are no parameters.

41:32

There are no parameters.

41:34

So, the question that is being raised

41:36

here is that so what are we learning

41:38

really? If there is nothing inside to be

41:40

learned, if there are no parameters, no

41:42

coefficients, what are we learning?

41:43

That's the question. And by extension,

41:46

if we have two of these and neither of

41:48

them is learning anything, what's the

41:49

point?

41:52

Sadly, you have to wait till Wednesday.

41:55

Okay? But we have a great answer to the

41:57

question. So,

41:58

it'll be worth it. And if you can't

42:00

stand the suspense, read the book.

42:03

All right. So, that is uh that's why we

42:05

need multiple heads. Okay? And now to

42:07

come back to this, so what we do is it

42:09

goes through this head and you get these

42:11

W's, right? And it goes through here and

42:13

we get the another set of W's.

42:15

Then what we do at the very end is we

42:17

concatenate them.

42:19

Okay? We concatenate them and we do a

42:21

projection. And this is what I mean by

42:23

that.

42:29

So, we have

42:30

uh this this is one self-attention head,

42:33

self-attention one.

42:35

This is self-attention two.

42:38

And let's say that

42:41

W1 hat comes out.

42:44

And I'm just going to call it Z Z1 for

42:47

the same thing so that there's no name

42:48

clash.

42:49

Okay? And uh the W2, W6, all of them are

42:52

coming, right? Let's focus on W1 and Z1.

42:55

W1 and Z1 are both contextual embeddings

42:57

for the same word.

42:59

Okay? For the first word, word one. And

43:01

so what we do is let's say this is W1 uh

43:04

let's call let's say this vector is like

43:06

this. Okay?

43:07

And let's say that this vector is like

43:10

this.

43:12

What I mean when I say concatenated here

43:14

is we literally take

43:16

um this word here,

43:18

this embedding here, then we take this

43:20

thing here.

43:23

Okay? And we just make it a long vector.

43:25

We concatenate it. But now this vector

43:27

has become twice as long, right?

43:30

So, what but remember, we always want to

43:32

preserve this the the number of inputs

43:34

we have and the lengths of these vectors

43:36

everywhere as we go along. So, what we

43:39

do is at this point, we run it through

43:42

a single dense layer

43:44

which will take this thing and make it

43:46

back into the same small shape as

43:48

before.

43:50

So, this is a dense layer.

43:54

That's it. So, this vector comes in

43:56

and it becomes it gets compressed back

43:58

to the original shape that came out of

44:00

here.

44:01

So, you could have like 20 of these uh

44:03

attention heads

44:04

and the concatenated will be 20 times

44:06

long and then just project boom, one

44:08

dense layer comes back to the original

44:09

shape.

44:12

So, that's that is the projection step.

44:16

And that's what I mean here when I say

44:17

concatenate and project.

44:20

So, at this point, what we have is

44:21

things come in, we contextualize them

44:23

using these different attention heads,

44:25

and when they come out of the attention

44:27

heads, we take them all, we just like

44:29

concatenate them, and then compress them

44:31

back to the same original starting

44:32

shape. Right? If these vectors are 100

44:35

units long or 100 dimension long,

44:37

whatever comes out is 100 still.

44:39

And to pre- preserving this

44:42

size as we go along is very important

44:43

for reasons that'll become apparent a

44:44

bit later.

44:46

Okay. So, that is the multi-attention

44:49

thing.

44:50

Now, a final tweak for today

44:53

is that we will inject some

44:55

non-linearity

44:57

with some dense layer dense ReLU layers

44:59

at the very end. So, we'd went through a

45:01

bunch of attention heads. We we came up

45:03

with a bunch of contextual embeddings

45:04

now.

45:05

So, at this point so far,

45:07

there are no since there are no

45:08

parameters inside these boxes,

45:10

uh

45:11

right? And there are some parameters

45:13

here.

45:13

We need to do some non-linearity. So

45:15

far, there's been nothing that's

45:16

non-linear so far. So, here we actually

45:18

send it through one or more ReLUs.

45:21

Typically, they just use one ReLU. So,

45:24

and what I mean by that

45:34

Sorry.

45:37

So, this is what we had here and then

45:41

we take it in

45:46

and then run it through

45:50

actually

45:54

we typically run it through

45:57

a ReLU.

45:58

This is a nice ReLU.

46:01

Okay? And all and and the rule of thumb,

46:03

as you will see, if let's say this

46:04

vector is say 100 dimensions long, they

46:06

typically will choose a ReLU which is

46:08

about 400

46:10

wide. And then it just gets projected

46:12

out again back to 100.

46:16

So,

46:17

this is just a simple, you know, the

46:20

input comes in, goes through a single

46:21

hidden layer with four four times as

46:23

many as here, and then it

46:26

project another dense layer

46:28

to 100 again.

46:29

And this since there are ReLUs here,

46:32

we in- we have injected some

46:33

non-linearity into the processing.

46:35

Okay? Now,

46:37

a lot of this stuff when it came out

46:39

felt very ad hoc.

46:41

Right? It didn't come from some deep,

46:43

you know, theoretical motivations.

46:45

But and people had strong intuitions as

46:47

to why these things were helpful. And as

46:49

it turns out, since the transformer came

46:51

out, people have tried to optimize every

46:53

aspect of this thing.

46:55

It's actually pretty difficult to beat

46:56

the starting architecture.

46:58

Right? Improvements have been made, but

47:00

it's actually very robust architecture.

47:02

So,

47:03

so that's what's going on here. And then

47:05

when we come out of this thing,

47:08

this is what we have, the story so far.

47:10

We start with random standalone

47:13

embeddings. This could be

47:14

GloVe embeddings, it could be random

47:15

weights, doesn't matter. It goes through

47:18

a bunch of self-attention heads. We

47:19

concatenate it when it comes out the

47:21

other end.

47:23

Concatenate it when it comes out the

47:25

other end. And then we project it back

47:27

to the same size as before. Then we run

47:29

it through, you know, a ReLU followed by

47:31

a linear layer and we get these things

47:33

again. So, in this whole process, if six

47:36

things came in, six things will come

47:37

out. And if six and if those six things

47:40

that came in

47:41

were embedding standalone embedding

47:43

vectors of 100 dimensions, what comes

47:45

out is also 100 dimensions.

47:47

So, in that sense, you could think of

47:48

this whole thing as a black box in which

47:50

whatever you send in, the same number of

47:52

things will come out of the same length.

47:54

The numbers will be different because

47:56

they will have been heavily

47:56

contextualized.

47:58

The numbers are much smarter, in other

48:00

words.

48:02

So, so far what we have seen is that we

48:04

have satisfied two of the three

48:05

requirements. We have taken the context

48:08

of each word into account

48:09

by using these dot products in the

48:11

self-attention layer, and we can

48:12

generate an output that is the same

48:13

length as the input, but we have ignored

48:15

the fact that we have ignored word order

48:17

completely.

48:19

Okay? Because whether I had said the

48:21

train slowly left the station or I had

48:23

said the the station slowly left the

48:25

train,

48:26

this thing won't know the difference.

48:30

Because dot products

48:32

function on sets, not on sequences. They

48:34

function on sets.

48:36

Okay? Regard- You can you should

48:37

convince yourself of this. Regardless of

48:39

the order, the dot product calculation

48:40

doesn't change anything.

48:42

Because we are doing every pair.

48:46

Okay? So, the question is how do we take

48:48

the order of the words into account? Um

48:50

right. As I was saying, we can scramble

48:52

the order of the words in a sentence and

48:53

we'll get the exact same contextual

48:54

embeddings at the end.

48:55

So, by the way, if you're working on a

48:57

problem in which the order doesn't

48:58

matter,

49:00

then you can stop right now and use the

49:01

transformer.

49:04

And there are many problems that are

49:05

actually in that category where the

49:06

order doesn't matter. So, if you take

49:08

traditional structured data, right? Uh

49:10

tabular data,

49:12

uh you know, blood pressure, cholesterol

49:14

level, boom boom boom. Does it predict

49:15

heart disease? Well, there is no order

49:17

in that thing. You can use the

49:18

transformer as is without doing anything

49:20

more.

49:22

So, transformers work for both sets and

49:24

sequences where order matters.

49:27

Okay. So, the fix for this is something

49:29

called the positional encoding.

49:32

Um

49:33

so what we do is very simple. There are

49:34

By By there are many things that been

49:36

invented um to to to tell transformers

49:40

to give an transformer some information

49:42

about the order of each of the things

49:44

that are coming in.

49:45

I'm going to go with something called

49:46

the, you know,

49:47

the simplest possible way which actually

49:49

works pretty well in practice. So, what

49:51

we do is

49:52

for each position

49:55

each possible position in the input

49:56

starting from the first position all the

49:58

way through the last position

50:00

we imagine that that position itself is

50:02

a categorical variable.

50:05

Right? If a sentence can only be 30 30

50:07

words long, let's say, we say that hey,

50:10

the position of each word is a number

50:11

between 0 and 29.

50:14

And so, we can just think of it as a

50:16

categorical variable.

50:17

And because the categorical variable, we

50:20

can just imagine an embedding for that

50:22

for each potential value. So, it'll

50:24

become clear in just a moment because I

50:25

have a numerical example.

50:27

And so, what we do is we will just take

50:28

that standalone embedding and then we'll

50:30

take this position embedding

50:32

which represents the position of the

50:33

word in the sentence, we just add them

50:35

up.

50:36

Okay? Uh yeah.

50:39

So, if

50:40

in the initial sentence itself, I have a

50:43

mistake, so I just write it as the train

50:45

slowly the station.

50:48

So, which means my output is actually

50:49

going to be wrong. Yes.

50:52

Now, the transformers are since they're

50:53

trained on lots of data,

50:55

they will be quite robust to these

50:57

things.

50:58

But strictly arithmetically speaking

51:00

correct, yes.

51:02

Um okay. So, here's let's look at an

51:05

example.

51:06

Let's assume that

51:08

um

51:09

your standalone embeddings, right? This

51:11

is your vocabulary, okay?

51:13

Unknown, cat, mat, I, sit, love, the,

51:15

you, on. That's it. That's our

51:17

vocabulary.

51:18

And for this vocabulary, we have these

51:20

standalone embeddings.

51:22

And just for argument, let's assume

51:23

these embeddings are only two long.

51:26

Okay? The dimension of these embeddings

51:27

is two.

51:28

If you recall the glove embeddings we

51:30

used last week, I think they were what?

51:31

100 long?

51:33

And the ones we're using in the homework

51:34

are even longer than that.

51:35

Um but here we are assuming they're only

51:37

two long, okay? So, the embedding for

51:39

cat is 0.5, {comma} 7.1.

51:42

All right. Now, let's assume that the we

51:45

can have at most 10 words in any

51:47

sentence that's coming in.

51:49

And obviously, a particular word could

51:50

be in position 0 all the way through

51:52

position 9.

51:53

And we will learn embeddings for each of

51:56

these positions, and these embeddings

51:57

are also two long.

51:59

Two units long. Dimension two.

52:03

Okay?

52:04

Now, where will these embeddings come

52:06

from?

52:07

What's the answer to that question? What

52:09

is the answer to the general question of

52:10

where will these weights come from?

52:14

We will learn it with backprop.

52:18

Okay?

52:20

We will start initially with random

52:21

numbers and then we'll get them make

52:23

them better and better

52:24

as over the course of training.

52:26

So, what we do is we have these two

52:28

tables

52:29

of embeddings.

52:30

Um the standalone embedding for the word

52:32

and the position embedding.

52:34

And then, we literally add them up.

52:37

So, for example, let's say the word the

52:39

sentence that came in is cat sat mat.

52:41

That's the sentence. It's got three

52:43

words, cat sat mat. So, what we do is we

52:46

say, well, the embedding for cat is this

52:49

thing here, 0.571.

52:51

So, I write it here, 0.571.

52:53

Cat happens to be the zeroth position of

52:55

the word.

52:56

So, I grab the embedding for zero, which

52:58

is 1.3, 3.9. I stick it there, and then

53:01

I literally add them up. 0.5 + 1.3, 1.8.

53:04

11.0. That's it.

53:07

So, now the positional encoded embedding

53:10

for the word cat is 1.8, 11.0, not 0.5,

53:15

7.1.

53:18

So, if cat happens to show up in another

53:20

part of the sentence, let's say instead

53:22

of cat sat mat, we had

53:25

mat sat cat.

53:28

Now, cat is now the third position,

53:29

right? Which is 0, 1, and 2. Which means

53:33

its embedding doesn't change. It's just

53:34

the embedding for cat, but now instead

53:36

of picking zero, we'll pick this one,

53:38

0.6, 8.1, and put that here and add them

53:40

up instead.

53:43

So, this is the idea of the positional

53:45

encoding.

53:46

This is how we inject position knowledge

53:48

into the transformer.

53:52

Yes.

53:54

Um

53:55

the positional embedding would be

53:56

different for each sentence, right? How

53:58

do you No, this is just one table which

54:00

tells you what the position is.

54:01

So, the it says for a word that appears

54:04

in the seventh position in any input

54:06

sentence that you're feeding in,

54:08

this is the embedding that you need to

54:09

use

54:11

for that position.

54:16

If the word appears twice in the same

54:19

sentence, how do

54:21

Great question. So, if if let's say just

54:23

for argument, let's say the word the the

54:25

sentence was cat cat cat.

54:27

So, the

54:29

for each one of those cat for cat cat

54:31

cat,

54:32

the this embedding will be the same,

54:34

0.571, because that is happens to be

54:36

just the embedding for cat regardless of

54:38

position.

54:39

But then, the first cat

54:42

for the first cat, we will use 1.3, 3.9

54:45

as the addition. For the second cat,

54:47

we'll use 6.3, 3.7. The third cat will

54:50

use 0.6, 8.1.

54:51

So, only the things that are adding the

54:53

position encoding will change, the

54:55

positional embedding. So, the resulting

54:57

sum is going to be different for each of

54:58

these three words, even though they're

54:59

exactly the same word.

55:05

Is that position embedding table

55:07

specific to the standalone embedding

55:09

table? Like if you were to add or remove

55:12

some words from the standalone It's

55:14

independent.

55:15

Independent. It only depends on your

55:18

assumption about how long the sentences

55:19

can be.

55:21

That's it.

55:21

It doesn't really care about what's what

55:23

words are coming in. That's a whole

55:24

different thing.

55:26

So, these are two independent tables

55:27

that just learned as part of this

55:28

process.

55:31

So, yeah, I have the same thing for sat

55:33

and mat.

55:35

Sat and mat, that's what we have.

55:39

So, just make sure you understand these

55:40

two slides to really like make sure the

55:42

mechanics are clear. Yeah.

55:46

How do you control for filler words? For

55:48

example, if you're taking

55:50

NLP output for transcription and you're

55:53

trying to run a transformer and you have

55:55

a lot of

55:56

um's and likes that are

55:58

disproportionately large and have these

56:00

random assignments or

56:03

really deep embeddings, is there other

56:04

ways to look at through the noise?

56:07

Typically, what they do is um

56:09

as we will we'll talk about this thing

56:10

called byte pair encoding in which we

56:12

take individual characters,

56:14

fragments of words, and words into

56:16

account as tokens. So, when you hear

56:18

stuff like uh and so on, it gets mapped

56:21

to these small tokens.

56:23

Right? And then we treat them as just

56:24

any other token.

56:28

Um yeah, is aggregation like a simple

56:31

sum where here and is the actual

56:33

semantic meaning of the word standalone

56:36

not be more important than its

56:37

relative position in the sentence?

56:40

It could be. We just don't know a priori

56:42

whether it's going to be important or

56:43

not for any particular sentence.

56:45

We when we train the transformer with a

56:46

lot of textual data,

56:48

right? It'll just figure out the right

56:50

values for these things so that on

56:51

average, the accuracy is as high as

56:53

possible.

56:55

So, in many of these things, there's

56:56

always a tension between our human

56:58

intuition as to how it should work and

57:00

whether you should just throw it into

57:01

the meat grinder of backprop and see

57:02

what happens.

57:04

And so, here it does it turns out you

57:05

can just throw it into backprop, it'll

57:06

actually do a pretty good job.

57:08

Uh yeah.

57:10

For the positional encoding, we would

57:13

just be as using the sum vector, we

57:15

would be using like this 2 by 3 matrix

57:18

that you have for our right?

57:20

Uh oh yeah, this is just for

57:21

demonstration. Basically, this is the

57:23

thing that will actually go into the

57:24

transformer. Correct.

57:26

Yeah.

57:28

That was just me being overly verbose in

57:30

the slides.

57:31

Uh yeah.

57:33

I can see sentences in the input. At

57:35

this point, are we still parsing out

57:36

punctuation or if we have like a

57:38

multi-sentence input, is there a

57:40

positional embedding vector for each of

57:41

the sentences? Yeah, so here um

57:44

basically, the starting point is tokens.

57:47

Right? And in our example, because we're

57:48

working with the idea of simple

57:50

standardization and stripping and things

57:51

like that, I'm just showing actual

57:53

words.

57:54

If you go to something like GPT-4, since

57:56

it uses a different tokenization scheme,

57:58

uh each token might be part of a word.

58:01

It might be it might be an individual

58:02

character, it might be a punctuation

58:03

mark, it could be in fact um the GPT

58:06

family doesn't strip out punctuation.

58:08

Which is why when you ask a question, it

58:10

comes back with intact punctuation in

58:12

its response.

58:13

Uh and so, we'll get we'll revisit this

58:15

when you look at BPE, byte pair encoding

58:17

later on.

58:19

But the key thing to remember is that

58:21

all the stuff we're talking about starts

58:22

from the notion of a token.

58:24

As to how you define a token given a

58:26

bunch of text, that's the tokenizer's

58:28

job. And we just assumed a simple

58:30

tokenizer for the time being.

58:33

Okay? So, at this point, folks, we have

58:36

satisfied all the requirements.

58:38

Uh we have taken the surrounding context

58:40

of each word, we have taken the order,

58:42

and so on and so forth, because what's

58:43

coming in here is the positional

58:45

embeddings. Okay? And it runs through

58:47

the whole transformer stack.

58:49

So,

58:51

this is called a transformer encoder.

58:54

Okay?

58:55

This is the transformer encoder.

58:57

And you can see here, this is the

58:59

original picture from the paper.

59:01

It's an iconic picture at this point.

59:03

So, it says here this is these are the

59:04

input This is like the cat sat on the

59:06

mat.

59:07

It comes in here, gets transferred to

59:09

transformed into embeddings, standalone

59:11

embeddings.

59:12

And then, based on the position of each

59:14

word, we add that's why you see a plus

59:17

sign here, we add the positional

59:20

embedding to that.

59:22

And the resulting thing goes into this

59:24

transformer block. And here,

59:26

we go through multi-head attention.

59:30

And things come out the other end.

59:32

Then there is this thing called add and

59:34

norm, which we'll visit we'll revisit on

59:36

Wednesday.

59:37

And then it goes through a feed forward

59:38

network, another add and norm, which

59:40

we'll revisit on Wednesday.

59:42

And then it comes out the other end.

59:43

That's it. That's a transformer encoder.

59:46

Okay?

59:47

Um

59:48

and so if you look at this

59:52

just to point out a couple of things,

59:53

the input embeddings can be random

59:55

weights or it could be pre-trained

59:56

embeddings.

59:57

Um

59:58

we add in a position-dependent embedding

1:00:00

to represent the position of each word

1:00:01

in the sentence. That's the plus.

1:00:02

Then we pass it through multi-headed

1:00:04

attention to get a contextual uh

1:00:05

representation.

1:00:07

Then we finally we pass all this through

1:00:09

a simple

1:00:10

typically it's a two-layer network. A

1:00:12

one hidden layer with relus and then a

1:00:13

linear layer after that and boom. Uh and

1:00:16

then we do it. This is the encoder. And

1:00:20

here is the perhaps the most important

1:00:21

point to keep in mind.

1:00:23

Because we have taken inordinate care to

1:00:25

make sure that the things that are

1:00:26

coming in and the things that are going

1:00:28

out have the same size

1:00:30

both in terms of the number of tokens as

1:00:32

well as the length of each vector.

1:00:34

We can then stack them up like pancakes.

1:00:37

We can have lots of transformers stacked

1:00:39

one on top of each other.

1:00:41

Right? Because it's the perfect API.

1:00:43

It's the simplest possible API. The same

1:00:45

thing comes in, same thing goes out.

1:00:47

In terms of size. So you can have a

1:00:49

transformer encoder, another one top,

1:00:51

boom, boom, boom, boom, boom, one after

1:00:53

the other. GPT-3 has 96 transformer

1:00:55

stacks.

1:00:58

And like in all things deep learning

1:01:00

related, the more layers you have, the

1:01:02

more complicated things we can do with

1:01:04

it.

1:01:05

As long as you have enough data to keep

1:01:06

the model happy so it doesn't overfit.

1:01:11

Okay?

1:01:13

All right. So, what we haven't covered,

1:01:15

which we'll cover on Wednesday

1:01:17

uh is is the question that

1:01:20

he had posed about how

1:01:22

uh you know, since there are no

1:01:23

parameters inside the self-attention

1:01:24

block, what are we actually learning?

1:01:26

And then there is these things called

1:01:27

residual connections and layer

1:01:29

normalization. We'll talk about all

1:01:31

those things on Wednesday. Those are all

1:01:32

like, you know, refinements to the idea.

1:01:35

So, all right, 9:39. Um let's apply the

1:01:38

transformer encoder to an actual

1:01:39

problem.

1:01:40

Any questions?

1:01:43

Uh yeah.

1:01:45

My question is regarding like you said

1:01:46

you could have multiple transformers.

1:01:48

What is the difference with having

1:01:50

multiple self-attention heads uh and

1:01:53

rather than that having multiple When I

1:01:54

say a transformer block within the block

1:01:57

there could be multiple heads. So, if

1:01:59

you're if the accuracy is the same, why

1:02:01

would you use this rather

1:02:04

Yeah, you can have a lot of attention

1:02:06

heads. And that's totally fine. And

1:02:08

typically I forget how many GPT-3 and 4

1:02:10

have. They have a whole bunch of them.

1:02:12

But you can So you can go wide and you

1:02:13

can go deep.

1:02:15

Both are done in practice.

1:02:18

But the thing is if

1:02:19

The one thing you have to remember is

1:02:20

that if you if you go wide, you have a

1:02:22

lot of attention heads then given the

1:02:24

particular input that's coming into that

1:02:26

block, it'll learn different patterns

1:02:28

from it.

1:02:29

While if you stack them all up, it's

1:02:31

going to learn different ways to

1:02:32

contextualize the things that are coming

1:02:33

in. It operates at higher levels of

1:02:35

abstraction. So the analogy would be

1:02:36

that like the seventh layer of a

1:02:38

convolutional net may take the sixth

1:02:40

layer's output and say, "Oh, I'm seeing

1:02:42

a lot of edges here. I'm going to take

1:02:44

an edge like this, two circles like that

1:02:46

and call it a face."

1:02:48

So it'll operate at a higher level of

1:02:49

abstraction.

1:02:52

Okay.

1:02:53

Um

1:02:58

All right, let's go to the collab.

1:03:01

So what we're going to do is we're going

1:03:02

to take the transformer that we just

1:03:04

learned about and we're going to apply

1:03:05

it to solve the the travel uh slot

1:03:07

problem. Okay?

1:03:09

Uh all right. So

1:03:12

Okay, so we'll start with the usual

1:03:14

preliminaries.

1:03:16

And then we have taken the ATIS data set

1:03:18

I talked about and we have stuck them in

1:03:20

raw box for easy consumption.

1:03:23

It's here.

1:03:29

Okay.

1:03:30

So if you look at to the top view

1:03:33

you can see here, for example, I want to

1:03:35

fly from Boston 8:30 a.m. And then this

1:03:37

is the output. The slot filling is the

1:03:39

output. Um and so as it turns out here

1:03:42

there is

1:03:43

this these people also gave it a another

1:03:46

They took the whole query and gave it an

1:03:47

intent as to is it it's a flight query,

1:03:49

it's a something else query and so on,

1:03:51

which we're not going to use. Are you

1:03:52

kidding me?

1:03:54

I want to fly from Boston at 8:30 a.m.

1:03:56

and arrive in Denver at 11:00 in the

1:03:57

morning. What kind of ground

1:03:59

transportations are available in Denver?

1:04:01

What's the airport at Orlando?

1:04:03

Um how much does the limo service cost

1:04:06

within Pittsburgh? Okay.

1:04:08

And so on and so forth. So you get So

1:04:09

you get the idea. It's a very wide range

1:04:11

of queries that are in this data set.

1:04:13

Um okay. So let's just ignore that for a

1:04:16

sec. Um okay. So what we're now going to

1:04:18

do is we are going to take only

1:04:22

um this column, right? The query column.

1:04:24

That's going to be our input text. Okay?

1:04:27

And then the slot filling column is

1:04:29

going to be our dependent variable, the

1:04:31

output.

1:04:32

So we'll just gather them all up

1:04:34

uh here.

1:04:37

Let it run. We'll do it for the training

1:04:38

data and the test data.

1:04:40

And so what we have done is that we have

1:04:42

taken um the transformer related code in

1:04:45

Keras and we have packaged it into a

1:04:47

little hardel library for easy

1:04:49

consumption.

1:04:50

Um and so that thing is here. You can

1:04:53

download it.

1:04:55

Calling it a library is like overstating

1:04:56

it. We literally just collected a bunch

1:04:57

of code and stuck it in a file. Okay?

1:04:59

So

1:05:00

and so what we'll do is from hardel

1:05:02

we'll we'll import the transformer

1:05:03

encoder.

1:05:04

And we'll import this positional

1:05:06

embedding layer.

1:05:08

Because what we're going to do is we are

1:05:09

going to take the input do the

1:05:11

positional encoding business and then

1:05:12

send it into the transformer.

1:05:14

Okay?

1:05:15

Um so but first let's vectorize the

1:05:18

input uh queries that are coming in.

1:05:21

So we'll define a thing here.

1:05:24

The use this uh

1:05:26

max query length is not defined. That's

1:05:28

what happens when you

1:05:30

don't run everything.

1:05:32

All right.

1:05:38

Okay. So now we have this thing here. So

1:05:41

turns out that there are 8,888 tokens,

1:05:44

right? 8,888 words in the input queries

1:05:47

that are we have in the data. Uh so I

1:05:49

take a look at the first few.

1:05:52

And you can see here, you know, there is

1:05:54

unk. Uh and because the output mode here

1:05:56

is you just want integers to come out

1:05:58

not multi-hot encoding or anything

1:06:00

because we're going to take these

1:06:01

integers and then do embeddings from

1:06:02

them. So it'll it'll create it'll

1:06:04

reserve this empty string as the pad

1:06:07

token. This should be familiar from last

1:06:10

week.

1:06:11

And then the unk for unknown tokens and

1:06:13

then two from flights these are all some

1:06:14

of the most frequent. Um turns out

1:06:17

Boston is actually the most frequent. I

1:06:18

don't know what's up with that.

1:06:20

It is what it is. Then we'll do the same

1:06:22

vectorization to the train and test data

1:06:24

sets.

1:06:25

Now uh we need to do STIE for the output

1:06:28

side of the problem because the slots

1:06:30

the the dependent variable here,

1:06:31

remember, are all sentences as well with

1:06:33

the B, O, things like that, right? So we

1:06:36

need to vectorize those.

1:06:38

So we do or we need to do STIE on them.

1:06:40

So let's take a look at some of these

1:06:42

slots.

1:06:43

And you can see here all this stuff is

1:06:44

going on.

1:06:45

Note So now here is an example where you

1:06:48

have to be very careful when you do the

1:06:49

standardization.

1:06:51

Typically standardization you will

1:06:52

remove punctuation and you know, do

1:06:54

things like that and lowercase, right?

1:06:56

But here

1:06:57

these things have a specific meaning.

1:07:00

We can't just go in there and remove the

1:07:01

period and the underscore and then take

1:07:03

make the B into lowercase B and stuff

1:07:04

like that. That'll just harm it.

1:07:06

Right? We need to be able to preserve

1:07:07

the nomenclature of the output in terms

1:07:10

of all those tags. So

1:07:12

um so we don't want the standardization

1:07:13

to do all those out. So what we do is we

1:07:15

say standardization none.

1:07:17

Look at that.

1:07:18

We tell Keras do not standardize this.

1:07:20

Do not do your usual thing.

1:07:22

Okay?

1:07:23

Um so

1:07:25

we do that

1:07:26

for the output side. And then let's look

1:07:29

at the vocabulary.

1:07:30

Yeah, so this sounds pretty good.

1:07:33

These are all the things that we would

1:07:34

expect to see.

1:07:35

These are the distinct tokens in the

1:07:37

output strings.

1:07:39

Um all right.

1:07:43

Okay, we get it.

1:07:45

So we have 125 of them. In the in the

1:07:48

lecture I said there are 123 slots,

1:07:50

possible slots. Why is it 125 here?

1:07:54

Yes, unk and pad. Correct.

1:07:57

Um okay. Now we'll set up a transformer

1:07:59

encoder, right? Uh this Oh, wait, wait,

1:08:02

wait. I forgot about um doing this. My

1:08:05

bad. Um

1:08:07

All right.

1:08:11

I just thought when I saw the slide that

1:08:12

we should go to the collab

1:08:15

without giving you a bit more

1:08:16

background. No problem. So

1:08:18

So

1:08:20

the way we're going to model this

1:08:21

problem is that we're going to have

1:08:22

something like this, right? Fly from

1:08:23

Boston to Denver.

1:08:24

That's the input that's coming in and

1:08:26

that is the correct answer.

1:08:28

0 0 some B something or others I mean O

1:08:31

and then something else, right? That's

1:08:32

the correct answer. That's the that's

1:08:34

the input and that is the right answer.

1:08:36

So what we'll do is we will

1:08:38

create these positional input embeddings

1:08:40

like we have discussed before.

1:08:42

We will run it through a transformer.

1:08:45

It gives us contextual embeddings.

1:08:47

So if we send five in, it's going to

1:08:49

send us five out except the color is now

1:08:50

blue.

1:08:51

Right? And then what we do is

1:08:54

we will run it through a relu.

1:08:57

Okay, we'll run it through a relu.

1:08:59

We will still have

1:09:01

you know, five vectors here, five

1:09:02

vectors will come in.

1:09:04

And then for each of the things that

1:09:05

comes in, we will stick a 123-way

1:09:07

softmax.

1:09:11

Okay, for each thing that comes out

1:09:13

we'll have a 123-way softmax and that's

1:09:15

the classification problem we're going

1:09:16

to solve.

1:09:20

Okay?

1:09:21

So

1:09:23

the weights in all these layers will get

1:09:25

optimized by backprop.

1:09:28

All these weights are going to get

1:09:29

optimized.

1:09:30

Uh yeah.

1:09:34

Sorry?

1:09:40

Oh no, the that's a layer. The weights

1:09:43

in the layer will still need to be

1:09:44

learned.

1:09:46

It's sort of like the text vectorization

1:09:48

layer is a bunch of code and then you

1:09:50

actually run it on a particular corpus

1:09:51

to adapt it and fill our vocabulary out

1:09:53

of it.

1:09:54

So, it's like an empty shell that needs

1:09:55

to get populated.

1:09:57

Okay, so with the weights and all these

1:09:59

things are going to get updated when we

1:10:00

when we train the model

1:10:02

by backprop.

1:10:03

Uh and that's it. That's the setup.

1:10:06

Does this make sense before I switch

1:10:07

back to the collab?

1:10:09

In particular, does this make sense?

1:10:11

This part of it.

1:10:15

Bunch of things come out and then for

1:10:17

each one of those things we need to

1:10:18

figure out a classification of a 123-way

1:10:20

classification. And that's where we

1:10:22

stick a softmax on every one of those

1:10:23

output nodes.

1:10:25

Yeah.

1:10:32

Oh oh, I see.

1:10:36

Yeah, so

1:10:40

It could be whatever or to put it

1:10:41

another way, it is your choice as the

1:10:43

user as the modeler. Correct? The thing

1:10:45

is at this point with the blue stuff the

1:10:47

transformer is basically saying, my job

1:10:49

is done.

1:10:51

It has given you these valuable

1:10:52

contextual embeddings at some high-level

1:10:54

abstraction. What you do with it depends

1:10:56

on your particular problem. And so that

1:10:58

the best practice would be to take it

1:11:00

and then maybe, you know, if these

1:11:01

embeddings are embeddings are really

1:11:03

long, maybe you make them a little

1:11:04

smaller, right? Using a ReLU. And using

1:11:07

a ReLU is always a good idea because

1:11:09

when in doubt, throw in a bit of

1:11:10

non-linearity.

1:11:11

Right? Uh and then once you're done with

1:11:13

that, well, at this point you need to

1:11:15

actually classify it. So, you stick an

1:11:17

output softmax on it.

1:11:20

Okay. So, that's what we have.

1:11:24

Um

1:11:27

All right, back to this picture.

1:11:29

So, what we're going to do is we

1:11:32

we also get to decide how long are these

1:11:34

embedding vectors. How long because here

1:11:36

we're not going to use Glove embeddings.

1:11:37

We're just going to learn everything

1:11:37

from scratch.

1:11:39

Right? We're going to learn everything

1:11:40

from scratch. So, and we can decide how

1:11:42

long these embedding vectors are. So, um

1:11:45

these embedding vectors I'm going to

1:11:46

decide

1:11:47

uh I have decided that I want them to be

1:11:49

512 long, right? I want these actually

1:11:52

to be 512 long. So, that's what I have

1:11:54

here, 512.

1:11:57

And then inside the transformer,

1:11:58

remember

1:12:00

when we

1:12:01

concatenate everything and then we have

1:12:02

something, we run it through a final

1:12:04

ReLU layer, how big should that layer

1:12:07

be?

1:12:08

That's what it here what I mean by dense

1:12:11

dim. I want it to be 64.

1:12:13

And then I, you know, for fun I'm going

1:12:15

to use five attention heads.

1:12:17

Because why not?

1:12:20

Okay. And then in the final thing here

1:12:24

to go to Ali's question here these

1:12:27

things are all 512 long as I mentioned

1:12:29

earlier, right? These are all 512.

1:12:32

But this thing here I'm going to make it

1:12:34

just 128.

1:12:36

Okay, that's what I mean by units here.

1:12:38

And so if you look at the actual model

1:12:41

okay, whatever comes in has a max query

1:12:43

length of I think 30 if I recall.

1:12:45

Um actually let's just make sure of

1:12:47

that. What did I assume?

1:12:51

30, correct? Max query length 30. So,

1:12:53

each sentence is 30. So, if a sentence

1:12:55

has 35 words in it, what's going to

1:12:57

happen?

1:12:59

The last five will get chopped,

1:13:01

truncated. If it comes in at 22, we're

1:13:03

going to pad it with eight more tokens

1:13:05

with a pad token. Okay? That's how we

1:13:06

make sure everything uh gets to 30.

1:13:09

All right. So, we come back here.

1:13:12

So, the input is still sentences which

1:13:14

are 30 long, tokens which are 30 long.

1:13:16

And then we run it through a positional

1:13:18

embedding layer.

1:13:20

Okay? This positional embedding layer

1:13:23

has the the actual embedding for each

1:13:25

word, that table and it has the

1:13:27

positional table, positional embedding

1:13:29

table. So, just to be clear, this

1:13:31

positional embedding layer is basically

1:13:34

it's basically this.

1:13:37

So, this table

1:13:38

and this table together are packaged up

1:13:41

into the positional encoding layer.

1:13:43

But they are two distinct tables. They

1:13:45

just happen to be packaged up.

1:13:47

So,

1:13:49

so this is what we have here.

1:13:51

And then we get a nice positional

1:13:52

embedding out and then boom, we run it

1:13:55

through the transformer. And you know,

1:13:57

this transformer encoder object we have

1:13:59

to tell it obviously, hey, this is the

1:14:01

embedding dimension that's going to come

1:14:02

out. This is the dense dimension you're

1:14:04

going to use in that final feedforward

1:14:06

layer inside each attention block and

1:14:09

this is the number of attention heads I

1:14:10

want you to use. That's it.

1:14:11

Very, right? Only three things have to

1:14:13

be specified.

1:14:14

And then whatever comes out of the

1:14:16

transformer encoder are these blue

1:14:18

vectors.

1:14:19

And then we are back into good old sort

1:14:20

of, you know, traditional DNN stuff

1:14:22

where we take this thing, run it through

1:14:24

a ReLU with 128 units, we add a little

1:14:27

dropout uh and then we run it through a

1:14:30

dense layer which the the vocab size

1:14:33

here is 125, which is the 125-way

1:14:35

softmax.

1:14:37

Okay? Activation softmax.

1:14:39

Connect up everything into model input

1:14:41

and output and boom, that's the whole

1:14:42

model.

1:14:44

So, that's what we have here.

1:14:47

Okay?

1:14:48

Now,

1:14:51

this for the you know, after Wednesday's

1:14:53

class

1:14:54

for extra credit and for your personal

1:14:56

edification

1:14:59

try to work through this thing to come

1:15:00

up with this number.

1:15:03

53 million

1:15:04

um sorry, 5.3 million.

1:15:06

Right? Uh and see if it matches this

1:15:10

number here.

1:15:12

It should match.

1:15:13

Hand calculate the number of parameters

1:15:15

inside the transformer. Okay? For fame

1:15:17

and fortune. That's an optional thing.

1:15:19

So,

1:15:20

uh do it after Wednesday's class, not

1:15:22

right now.

1:15:23

And I have actually listed the exact

1:15:24

math that goes into it here. Okay? All

1:15:26

right. So, by the way, you can peek into

1:15:28

any layers' weights using its weight

1:15:30

attribute. This is the embedding

1:15:31

uh the positional embedding thing we

1:15:33

had. So,

1:15:34

we can click it and you can see here it

1:15:36

has two tables. There's the first table

1:15:39

which is just the embedding table which

1:15:40

says

1:15:41

there are eight eight eight tokens in my

1:15:43

vocabulary and each of those tokens is a

1:15:45

an embedding vector which is 512 long.

1:15:47

That is the first table here. And then

1:15:49

it has the second object which is the

1:15:51

positional embedding and it says here,

1:15:53

well, my sentences can be 30 long and

1:15:56

for each position of the 30 long

1:15:58

sentence, I will have a 512 embedding.

1:16:02

Both these tables as I mentioned earlier

1:16:04

are packaged up inside and you can

1:16:05

actually see what the weights are before

1:16:06

you do any training.

1:16:08

Okay?

1:16:09

So, all right. So, I'm going to stop

1:16:11

here uh because the model it's going to

1:16:13

take a few minutes to run and we're

1:16:14

already at 5 9:45.

1:16:16

Um so, we will continue the journey on

1:16:17

Wednesday. If some of it is not super

1:16:19

clear, don't worry about it. It will

1:16:20

become much clearer on Wednesday. All

1:16:21

right? All right, folks, have a good

1:16:22

couple of days. I'll see you on

1:16:23

Wednesday.

7: Deep Learning for Natural Language – Transformers

More from MIT OpenCourseWare

Trending Transcripts