6: Deep Learning for Natural Language – Embeddings — Full Transcript

0:58

that, what happens is that first of all,

1:01

we you know, we standardize, then we

1:03

split on white space to get individual

1:05

words, then we assign words to integers,

1:08

and then we take you know, each integer

1:10

and essentially create a one-hot version

1:12

of that integer. And when we do that,

1:15

basically we have a vocabulary.

1:18

Right? And in this example, we just have

1:20

100 words, and you will note that this

1:23

vocabulary, which are which you arrive

1:25

at once you standardize and tokenize,

1:28

you know, has words like the because we

1:30

decided not to remove stop words like A,

1:32

and the,

1:33

and so on. So just to be clear,

1:36

standardization

1:38

here, standardization, while it has

1:40

historically been all about stripping

1:42

punctuation, lowercasing everything,

1:45

removing stop words, and stemming,

1:47

while that has been true historically,

1:49

if you look at modern practice, people

1:51

essentially strip punctuation maybe, and

1:54

then lowercase and and they often don't

1:57

even bother to do stemming and things

1:58

like that, or to remove stop words.

2:00

Okay?

2:01

And that's why in Keras, the default

2:03

standardization is only lowercasing and

2:05

punctuation stripping.

2:09

This detail may actually be handy for

2:11

homework two, perhaps. That's why I'm

2:12

pointing it out.

2:14

Okay. So that's what we have. And so for

2:17

each word that's coming in, we have a

2:18

one-hot vector.

2:20

Right? But the one-hot vector is just

2:22

like on to the vocabulary. And then, you

2:25

know, and we can either

2:27

quote unquote add them up and get a

2:29

multi-hot encoding, or

2:32

sorry, get a count encoding, or we can

2:34

just do or, right? Look for just any

2:36

ones in a column and get multi-hot

2:38

encoding.

2:39

So that's what we saw last class. But

2:42

this scheme, while it's quite effective

2:44

for simple kind of problems,

2:47

is it has some very serious

2:49

shortcomings. And so we will sort of

2:50

delve into those shortcomings, and then

2:52

sort of step back and say, all right, is

2:54

there a solution to fix these things?

2:58

Problem with one-hot vectors.

3:00

There are lots of problems. Any

3:01

volunteers?

3:07

Similar words are understood

3:09

differently.

3:21

Absolutely. So that what he's pointing

3:24

out is that if you have two words which

3:26

are synonyms, let's say, great and

3:28

awesome,

3:29

hope that the way we represent them

3:31

using these vectors would have some

3:33

connection to what the words actually

3:35

mean. In particular, we would hope that

3:37

if they mean similar things, that they

3:38

are sort of close by. If they mean very

3:40

different things, we would hope that

3:41

they are very far away. Right? Things

3:43

like that. Sort of common sensical

3:44

expectations of what you want the

3:46

vectors to have. So it clearly it won't

3:49

have that, and we'll look into it in a

3:50

detail in a bit. But before we do that,

3:53

there is also a computational issue,

3:54

which we covered last class, which is

3:56

that if the vocabulary is really long,

3:59

then each token, each word that's coming

4:01

in here, will have a one-hot vector

4:03

that's as long as the size of

4:04

vocabulary. Right? If you have 500,000

4:06

words in your vocabulary, every little

4:08

word that comes in has a vector which is

4:09

500,000 long. Which feels like a gross

4:12

sort of waste of it stuff.

4:16

Now you can mitigate it somewhat by

4:18

choosing only the most frequent words,

4:20

but it does increase the number of ways

4:21

the model has to learn, and increase the

4:23

need for compute and data, and so on and

4:25

so forth. Okay?

4:26

Now

4:27

let's say that we have created a

4:28

vocabulary from a training corpus. Okay?

4:31

We have a bunch of

4:32

strings, text that's coming in. We have

4:34

done it We have done the ST the

4:36

standardization and organization. We

4:37

have created a vocabulary from it. And

4:39

let's say we get the words movie and

4:41

film.

4:42

So the question is, and and I always

4:44

observation gets to this immediately, if

4:47

you look at the words movie and film,

4:48

are these two vectors close to each

4:50

other or not? Okay? So if you have two

4:52

vectors, how would we measure closeness?

4:56

What's the simplest way to think about

4:58

closeness?

5:02

It's not a trick question.

5:05

Distance. Yeah, exactly. So if they are

5:06

really close distance-wise, we would

5:08

hope, right? The words similar words

5:10

should do should should be close by. So

5:13

here, if you let's just imagine that the

5:16

vector for movie,

5:20

let's say your vocabulary is, I don't

5:21

know,

5:22

um

5:25

100,000 long.

5:27

So your vector is 100,000 long,

5:30

and the word for movie

5:33

is the position, so this this has a one,

5:35

everything else is zero. Right?

5:42

Sorry, this is the vector for film, and

5:44

maybe this is the position for film.

5:47

So that has a one, everything else here

5:51

zero. Okay? What's the distance between

5:53

these two vectors?

5:55

You just use the Euclidean distance. So

5:58

the Euclidean distance, you will recall,

6:00

you literally just take the difference

6:01

of

6:02

these values,

6:04

square them, add them up, take square

6:06

root.

6:07

So which means that all the zeros will

6:09

obviously give you zero. This one is

6:12

going to give you a one.

6:14

This comparison is going to give you

6:15

another one. 1 + 1 = 2. Root 2. That's

6:18

the answer.

6:20

So the distance between these two

6:21

vectors is root 2.

6:25

Now,

6:27

so the distance between them is root 2.

6:30

What about the one-hot encoded vectors

6:32

for good and bad? Clearly good and bad

6:34

mean opposite things.

6:36

What is the distance between the good

6:37

and bad 01 vectors?

6:42

Still root 2.

6:45

Because the zeros don't mean anything,

6:47

the ones are not in the same place.

6:49

So when you subtract the one and the

6:51

zero, you'll get ones and ones, add them

6:52

up, two, root 2.

6:54

In fact, you take any two words in your

6:56

vocabulary, what's the distance between

6:57

the two one-hot vectors for those words?

6:59

It's root 2.

7:01

So if any two words have the same

7:03

distance, does this even have a notion

7:06

of distance?

7:08

It doesn't.

7:10

There's no notion of distance from

7:12

one-hot vectors.

7:13

It has no connection to the actual

7:15

meanings of these words.

7:17

It's just a way of representing them.

7:21

Okay?

7:22

So that is the big problem with one-hot

7:24

vectors.

7:26

So

7:27

the distance between them is the same

7:28

regardless of the words. It's got

7:29

nothing to do with the meaning of the

7:30

words.

7:32

And this is a huge problem, which we'll

7:33

have to solve.

7:35

So to summarize where we are, if the

7:37

vocabulary is very long, each token will

7:39

have a one-hot vector that's long as

7:40

vocabulary. That's that's sort of a

7:42

computational and sort of training

7:44

problem. And then this is a deeper

7:46

problem, where there's no connection

7:48

between the meaning of a word and its

7:49

vector.

7:51

So wouldn't it be nice if

7:55

vectors that represent synonyms,

7:57

movie and film, apple, banana,

7:59

hopefully they're close to each other.

8:01

It would be nice if the vectors for

8:03

things that mean very different things

8:04

are far from each other.

8:06

So let's take a look at a particular

8:08

example. Okay? Let's assume that we have

8:10

been magically given

8:13

these vectors, so that they actually

8:15

have some notion of meaning.

8:17

And for convenience, let's say that we

8:18

take the just the first uh

8:21

two dimensions of these vectors, the

8:23

first two dimensions, so that we can do

8:25

a scatter plot on them.

8:28

So we plot the first dimension of the of

8:30

these vectors, the second dimension, and

8:31

what we have in this little cartoon is

8:34

we have plotted the the word for

8:37

factory, uh for home, for building, and

8:41

they all happen to be clustered here.

8:44

Clearly this representation is capturing

8:45

some notion of what the thing is.

8:48

Right? Some sort of building.

8:50

Uh and here we have, you know, bicycle,

8:53

truck, and car. Clearly some This is

8:55

like the automobile cluster, right?

8:57

Transportation cluster. And here we have

9:00

like a fruit cluster, and here we have

9:02

some, you know, sports balls cluster.

9:04

Okay?

9:05

We Because it's a cartoon, things are

9:07

all nice and cleanly separated. Okay? So

9:10

now if you take the word apple, where do

9:12

you think it's going to go?

9:14

It's going to go in into A, C, D, or B?

9:19

C, right? It makes eminent sense it's

9:20

going to go to C.

9:23

Good. Now,

9:25

wouldn't it be nice if

9:27

in more generally, if the geometric

9:29

relationship between word vectors

9:32

represent the semantic relationship

9:35

between the underlying objects that the

9:37

words represent?

9:38

Okay?

9:39

And it's And I say relationship and not

9:41

distance, because it's not just

9:42

distance. It's actually more than that.

9:45

Okay?

9:46

So let's take another one.

9:48

Here we have

9:49

uh this is the the vector plotted for

9:52

puppy and dog,

9:54

and this is calf.

9:56

Uh right? We have plotted the word for

9:58

calf. And let's say that we need to

9:59

figure out where would the embedding,

10:01

the word vector for cow appear?

10:04

It is the most logical. Should it be A?

10:07

Should it be C? Should it be B? Where

10:09

should it be?

10:11

This is

10:14

C? Okay, what's the logic?

10:16

Any volunteers? Just put your hand up.

10:19

Uh, yes.

10:21

Uh

10:23

A calf is a baby bull, whereas the cow

10:26

is an adult.

10:27

So, it should be closer to the dog,

10:28

which is the adult version of a dog.

10:31

Got it. So, you're basically saying go

10:32

from the puppy version to the grown-up

10:34

version. Right? That's sort of what

10:36

you're getting at, right? And that's a

10:37

totally valid way to think about it.

10:39

But there are a couple of ways to think

10:40

about this, which is this is one of the

10:42

those two ways. So, what you can do is

10:44

you can actually look at it and say,

10:45

well,

10:46

Okay, if this is big bringing you, you

10:48

know, bad memories of GMAT and GRE and

10:50

stuff like that, I apologize.

10:52

But

10:55

So, a puppy is to a dog like a calf is

10:57

to a cow, right? Which means that that's

10:59

exactly what Jay is pointing out. You

11:01

can go from like the baby version to the

11:02

full-grown version if you go in the

11:04

horizontal direction. Okay? But maybe if

11:08

you go in the vertical direction, you're

11:10

essentially going up and down the young

11:13

entities of animals.

11:15

Okay?

11:16

So, here you are growing with, you know,

11:18

you're still across the same dimension

11:20

of animals. You're just going from, you

11:22

know, the the same age level, right?

11:24

That is the band here.

11:25

So, this is the grown-up version of a

11:27

whole bunch of animals, the puppy

11:28

version of a whole bunch of animals. So,

11:30

the vertical dimension measures some

11:31

sort of variation across animal species

11:34

of the same roughly sort of maturity

11:36

stage.

11:37

Okay? So, these directions also matter.

11:41

It's not just the distance.

11:43

Okay. That's what I mean when I say

11:45

semantic relationship and geometric

11:47

relationship.

11:48

Relationship is distance and direction,

11:51

right? Both have to be involved.

11:53

So, so

11:55

Uh, now word embeddings, as we will dis-

11:57

learn soon, are word vectors designed to

12:00

achieve exactly these requirements.

12:03

Okay? They will achieve these

12:04

requirements.

12:06

Uh, and they will fix both these

12:07

problems very elegantly.

12:11

Okay?

12:13

So, let's say that we have word

12:14

embeddings that achieve both these

12:15

problems. Are we basically done?

12:17

Can we declare victory?

12:19

Or are there any- is there anything that

12:22

even words which actually capture the

12:24

meaning of the underlying thing

12:28

don't fully address? Is there any

12:30

remaining problem we have to worry

12:31

about? Yes?

12:33

Context. Context? Yes.

12:36

Context, right? What about The fact is a

12:39

word's meaning Sure, every word has a

12:42

meaning, but we know that some words

12:44

have multiple meanings.

12:46

And that meaning is really sort of

12:49

inferencable, or you can make sense of

12:51

it only if you know the surrounding

12:52

context, right? If I give you if if you

12:55

see the word bank, b a n k, bank,

12:59

sure, it could be a financial

13:00

institution. It could be the side of a

13:02

river. It could be the act of a plane

13:04

turning in one direction.

13:07

It could be someone hoping for

13:09

something, banking on something. The

13:11

list of possible meanings of the word

13:13

bank is basically enormous.

13:16

And you cannot figure out what it means

13:18

unless you know what else is going on

13:19

around that word. So, context is super

13:22

super important. And these embeddings,

13:24

word embeddings, just tell you what the

13:26

meaning of the word is. And basically

13:28

what's going to happen when you have a

13:29

word which could mean many different

13:31

things, it's going to give you some

13:33

average version of that meaning.

13:36

And that average version is not going to

13:37

be very good.

13:39

Now, there are some words which only

13:40

mean one thing, and you'll be okay

13:41

there.

13:42

But for the rest of it, right? It's

13:44

going to be tough.

13:47

So, what we need is some way

13:53

We need to find a way to make word

13:54

embeddings contextual.

13:56

Meaning we need to somehow consider the

13:58

other words in the sentence.

14:00

Okay? So, if we can do that, then we

14:02

will be in great shape.

14:05

Solve all sorts of NLP problems.

14:08

Now, as it turns out, contextual word

14:11

embeddings, or word vectors, or word

14:13

embeddings that achieve both these

14:15

requirements.

14:16

They capture the semantic geometric

14:19

relationship thing I talked about, and

14:21

they are contextual.

14:22

Okay?

14:23

They're really fantastic. Uh, and the

14:27

key to calculating contextual word

14:29

embeddings is the transformer.

14:33

That is why transformers are justifiably

14:35

famous.

14:39

So, what's sort of the the lay of the

14:40

land here? So, today we are going to

14:42

look at how to calculate

14:44

stand-alone or uncontextual word

14:46

embeddings.

14:48

And then starting Monday, we will take

14:50

these, you know, un- stand-alone

14:52

embeddings and make them contextual

14:53

using transformers. Okay? That is the

14:56

plan.

14:57

Any questions so far?

14:58

So, now let's think about how we can

15:00

learn these stand-alone embeddings from

15:02

data, right? Now, the naive way to think

15:05

about it would be, hey, let's Why don't

15:07

we manually collect a whole bunch of

15:08

synonyms, antonyms, related words, etc.,

15:11

and try to assign embedding vectors to

15:13

them that satisfy

15:15

our requirements. Okay? Now, as you can

15:18

imagine, this is going to be a long,

15:19

painful, and never quite complete

15:21

exercise.

15:22

Okay?

15:23

Uh,

15:24

so and uh you mean and given that we are

15:26

machine learning people,

15:29

the question is, can we do in a better

15:30

way? Can we just learn it from the data

15:32

without doing any of this manual stuff?

15:34

Okay? And

15:36

the key insight that makes it all happen

15:39

is this humble-looking line on the

15:42

screen by John Firth, who was a

15:44

linguist.

15:45

You shall know a word

15:47

by the company it keeps. I wish I could

15:49

deliver this in a British accent.

15:53

Know a word by the company it keeps.

15:55

Okay? It's a very profound statement.

15:57

Okay? And here is the sort of the key

15:59

intuition behind this.

16:02

It says,

16:03

let's say that you have a sentence like

16:05

the acting in the dash was superb.

16:08

Okay?

16:09

What are some words that you folks think

16:11

are likely to appear in the sentence?

16:15

Shout it out. Play. Play.

16:18

Movie.

16:19

Show.

16:20

Musical. Right? Those are all some great

16:24

candidates, right? The acting in the

16:25

movie, the film, musical, and so on and

16:26

so forth. Okay? Now, let's say that I

16:28

ask you, what are some words that are

16:29

unlikely to appear in the sentence? And

16:31

I think we could all be here for like

16:32

days, you know, listing them out. Uh, I

16:35

just listed these out. Um, I love the

16:38

word tensor, so I have to find a way to

16:39

use it somewhere.

16:41

So, all right. So, the acting in the

16:43

banana was superb. Clearly nonsensical,

16:45

right? So, what this actually What What

16:48

we are seeing here is that if certain

16:51

words are sort of interchangeable in a

16:53

sentence,

16:55

meaning you you change them, they still

16:57

the sentence still makes sense, right?

16:59

If they appear in the same context very

17:02

often, i.e., if they're interchangeable,

17:04

they are probably related.

17:07

Sort of like we don't even have to know

17:09

what the word is.

17:10

All we have to know is that this word

17:12

and this word, you can drop them into a

17:14

particular sentence, you can fill in the

17:15

blank of that sentence with that word,

17:17

and it actually makes sense, then we're

17:18

like, oh, wow, okay, these words are

17:20

related then.

17:21

Right? You're sort of inferring their

17:23

relatedness not by looking at them

17:25

directly, but by seeing where they live.

17:30

Right? It's a very very clever idea. And

17:32

it'll slowly sink into you. Okay? Um, so

17:36

that's the first observation. If they

17:37

appear in the same context very often,

17:39

they are likely to be related.

17:41

More generally, related words appear in

17:44

related contexts.

17:47

So, all we have to do

17:49

is to figure out a way to calculate

17:52

context.

17:54

And then use that to understand, you

17:57

know, what the words are that happen to

17:58

be living in this context.

18:00

And there are some beautiful ways to do

18:02

these things, and we'll you and we'll

18:03

really dive deep into one such way to do

18:05

it.

18:06

So, so the So, what we're going to do in

18:08

this approach

18:10

is that

18:11

since

18:12

words that appear in

18:14

related contexts mean related same

18:16

similar things,

18:18

first of all, you have to define what do

18:19

you mean by context?

18:21

And there are many ways to define

18:22

context. We're going to go with a very

18:23

simple explanation, simple definition,

18:24

which is that if words happen to appear

18:26

in the same sentence a lot,

18:29

then we think that, okay,

18:31

they are in the same context. So,

18:32

context here means sentence.

18:34

Okay?

18:35

So, what we can do is we can actually

18:38

take a whole bunch of text, maybe all of

18:40

Wikipedia,

18:41

and then break it up into sentences.

18:43

We'll have billions of sentences, right?

18:46

And then for all these billion

18:47

sentences, we can literally go and count

18:48

for every pair of words, how many times

18:51

are both these words showing up in the

18:52

same sentence?

18:55

Okay? And we call this co-occurrence,

18:57

right? The words are co-occurring in the

18:59

sentence.

19:00

And it doesn't have to be next to each

19:02

other,

19:02

right? We know that in complicated

19:04

words, a word at the very end of the

19:07

sentence could actually alter the mean-

19:09

could be its meaning could be altered by

19:10

a word that happened in the very

19:11

beginning of the sentence, and it could

19:12

be a really long sentence.

19:14

So, we take the whole sentence and say,

19:16

are are two words co-occurring in the

19:18

sentence, yes or no? And we just count

19:19

them up.

19:20

And when we do that,

19:24

right? When we do that, we will get

19:26

something like this.

19:27

So, I'm just

19:29

This just captures what I've been

19:30

talking about. Identify all the words

19:32

that occur, let's say, in Wikipedia. And

19:34

then for every sentence, you look at

19:35

every word pair and count the number of

19:37

times they appear in the same sentence

19:38

across all those sentences. Okay?

19:41

This is a word-word co-occurrence

19:43

matrix. So, for example,

19:46

let's assume that you took all of

19:47

Wikipedia, looked at all the words,

19:48

distinct words, and you found there are

19:49

500,000 words.

19:51

Okay? So, there are 500,000 words

19:54

here in the columns

19:56

500,000 words on the rows.

20:00

The columns and rows. And then you go

20:02

and each cell of this table is basically

20:05

has a number that you calculate which is

20:08

the number of times the word in the row

20:10

and the word in the column happen to

20:12

show up in the same sentence. That's it.

20:14

So, for instance

20:15

if you look at deep and learning, right?

20:18

The word deep and the word learning

20:20

maybe that

20:22

the those two words occurred in the same

20:24

sentence maybe 3,025 times.

20:28

3,025 sentences across all of Wikipedia.

20:31

You put 3,025 right in that cell.

20:35

Okay?

20:36

Many words are unlikely to appear in the

20:37

same sentence.

20:38

So, much of this matrix is going to be

20:40

zero.

20:44

But, we

20:45

fundamentally form this co-occurrence

20:47

matrix.

20:49

This matrix essentially embodies all the

20:54

context information that we can work

20:55

with in a very compact, beautiful you

20:58

know, sort of

20:59

elegant

21:03

And using this, we're going to try to

21:04

figure out

21:06

what the word embeddings actually are

21:07

going to be.

21:08

Okay?

21:09

And so

21:11

So, by the way, the approach I'm

21:13

describing here to calculate standalone

21:15

embeddings is called Glove.

21:20

Uh it's called Glove and

21:23

standalone embeddings first sort of came

21:24

onto the NLP deep learning scene. Uh

21:27

there were two sort of ways of doing it.

21:29

One was called word to vec, word to vec.

21:32

Uh the other one is Glove.

21:34

And they're both comparable, right? They

21:35

use slightly different mechanisms of

21:36

doing this.

21:38

We went with word for for this lecture

21:40

because I think it's actually a little

21:42

easier to understand and equally

21:44

effective.

21:45

Okay?

21:47

So, this is what we have. And so, what

21:49

we want to do is

21:50

we want to learn these embedding vectors

21:52

that can be used to essentially

21:54

approximate this matrix.

21:56

Right? If you can find vectors that can

21:59

actually approximate this matrix, then

22:01

hopefully those vectors do in fact

22:03

capture some notion of what the words

22:04

actually mean. Okay? So, let me put it

22:06

differently.

22:07

You come to me with this matrix. Okay?

22:10

And you say uh okay, Rama, do you have

22:12

embeddings for me?

22:14

And I'm like, yeah, I reach into my bag

22:15

and I'm like, okay, every one of those

22:17

500,000 words, I have an embedding.

22:19

Right?

22:20

Let's ignore for a moment how I actually

22:21

calculated embeddings. I have the

22:23

embeddings.

22:24

How will you know if my embeddings are

22:25

any good?

22:28

How will you know?

22:30

How can you actually assess if those

22:31

embeddings are any good?

22:34

Well, you can certainly say, okay, give

22:35

me the embeddings for movie and film and

22:37

you can see if they're really close by.

22:39

If you can look at the you look at the

22:40

embedding for movie and tensor and

22:42

hopefully they're far away.

22:43

But, you'll never get done.

22:46

Right?

22:47

How can you systematically evaluate

22:49

this?

22:51

Well, what if you could actually what

22:53

what if I come to you and say, not only

22:55

am I going to give you an embedding,

22:57

here is a procedure

22:59

which you can use with these embeddings

23:00

to validate how good they are and here

23:02

is the procedure. What you can do is you

23:04

can use the embedding to recreate the

23:07

co-occurrence matrix.

23:09

And if the recreated co-occurrence

23:11

matrix actually matches the real matrix

23:14

well, these embeddings probably are

23:15

pretty good.

23:17

Remember, the whole point of the

23:18

co-occurrence is to handle this context

23:20

information. So, if my embeddings can

23:21

actually recreate them, reconstruct them

23:23

pretty close, right? It'll never be

23:25

perfect. But, it comes pretty close,

23:27

then we're like, wow, okay, these

23:28

embeddings do mean something.

23:29

So, if it turns out for instance that

23:31

the matrix has, you know, 3,000 possible

23:33

va- value of 3,000 for deep and learning

23:36

and values of uh

23:40

say

23:40

50 for extreme learning

23:43

and our embedding comes in and says

23:45

3,002 for the first one and 48 for the

23:48

second one, we'll be like we'll be

23:49

pretty impressed.

23:51

Whoa, it didn't need to be that close.

23:53

Unless it was actually capturing

23:54

something.

23:55

Okay? So, that's what we're going to do.

23:57

And so, we're going to take this logic

23:59

of saying

24:00

find embeddings that can approximate the

24:03

what we actually see in Wikipedia.

24:05

Right? And we're going to use that idea

24:07

to actually build the model and learn

24:09

the

24:10

using nothing more than basically linear

24:12

regression.

24:16

And here you are thinking that linear

24:17

regression is useless now that you've

24:18

graduated machine learning, right?

24:22

So

24:23

So, we can think of the embedding

24:24

vectors that we want to figure out as

24:26

just the weights in a model.

24:28

In a linear regression.

24:31

We can think of the co-occurrence matrix

24:33

as just the data we're going to use in

24:35

this model to estimate these weights.

24:37

And the model we're going to use

24:39

is something like this.

24:42

So, first I have to inflict some

24:43

notation on you.

24:45

We would denote the co-occurrence matrix

24:46

of say words I and J as Xij.

24:50

Xij is just data.

24:51

It's just data. Okay? It's not a

24:53

variable, it's data.

24:55

Uh

24:55

and then we will denote an embedding

24:57

vector for each word. Remember, we need

24:59

to have a vector for each word. So, we

25:01

call it Wi, right? Wi is the embedding

25:03

vector for each word.

25:06

And we will also assume that

25:09

some words are just inherently very

25:10

popular. They're going to show up all

25:11

the time like the word the.

25:13

Okay? So, we'll assume that every word

25:15

has some natural frequency of occurring

25:18

like movie versus flick.

25:20

The versus tensor. So, we want the

25:22

vectors to capture the co-occurrence

25:24

patterns independent of how naturally

25:27

frequent the words are.

25:28

Okay?

25:29

And so, to capture this natural

25:30

frequency, we will assign a bias or Bi

25:33

to each word that we're going to

25:34

calculate. And all this will become

25:36

clear in just a moment. Okay? So

25:39

with this setup, basically what we're

25:41

saying is something very simple. We're

25:42

saying, look, this co-occurrence matrix

25:44

that we have

25:45

that we're able to compute, it came

25:48

about because in in truth, in reality,

25:51

in nature, there are these embedding

25:53

vectors for every word.

25:55

There are these biases Bi for every word

25:58

and every co-occurrence number that you

26:00

see just came about because, you know,

26:03

under the hood, mother nature grabbed

26:05

the bias number for the word I, the bias

26:07

number for the word J took the two

26:09

embedding vectors, which only mother

26:11

nature knows at this point did the dot

26:13

product of them, add them, and that's

26:15

how we get this number.

26:16

So, it basically says the number you see

26:19

is the sum of the inherent popularity of

26:21

the first word plus the inherent

26:23

popularity of the second word plus the

26:25

way in which these two words connect to

26:26

each other.

26:29

That's it.

26:29

And

26:30

you will agree with me

26:32

that literally can't get simpler than

26:33

this.

26:34

If I tell you, hey, here are two things.

26:36

I want you to tell me how connected they

26:38

are, you'll be like, well, let's take

26:39

the first one, figure out how inherently

26:42

popular it is, inherent popularity, and

26:44

then of course you got to worry about

26:45

the connection. So, we do a dot dot

26:46

product.

26:47

That's it. Those three things.

26:49

Right?

26:50

So, this is what we have. Now, you may

26:52

have seen

26:53

uh

26:54

from your, you know, good old linear

26:56

regression that whenever uh your

27:00

dependent variable happens to be

27:02

positive, guaranteed to be positive

27:05

and it ends up having a big range

27:08

we always advise you folks

27:10

to take the logarithmic transformation

27:12

to squash it into a narrow range because

27:14

that will make these models much more

27:16

well-behaved.

27:18

Regression if the Y value is like a huge

27:20

range. Like the canonical example is

27:22

that, you know, if you are trying to

27:23

model, you know, the net worth of

27:24

people, right? It's going to have a long

27:27

right tail with people like Elon and

27:29

Jeff and so on on the right side, right?

27:30

And the rest of us on the left. So and

27:33

so, to model this big long tail

27:34

distribution, you just take the

27:35

logarithm, just squash everything to a

27:37

very narrow range. And that will make

27:39

regression much better behaved. Okay?

27:41

Here

27:42

most of the counts are going to be zero.

27:45

But, some of the counts could be very

27:47

high.

27:48

Right?

27:49

And therefore we wanted to If you take

27:51

the logarithm, it makes it much better

27:52

behaved, so we take the logarithm here.

27:54

So, this is actually our model. That's

27:56

it.

27:57

And I know that many of the numbers are

27:58

zero and log of zero is not defined. So,

28:00

we can just add the one a number one to

28:02

all the numbers

28:03

to avoid that kind of, you know,

28:06

technical arithmetic problems.

28:08

But, this conceptually is what's going

28:09

on. This is the model we want to

28:10

calculate.

28:11

So, given that we have essentially

28:14

postulated this model

28:16

and we have this data, this

28:17

co-occurrence matrix, how can we

28:19

actually find the weights? How can we

28:21

actually find the Bs and the Ws? What

28:24

would we What should we do?

28:25

Go back to the fundamentals of

28:26

regression. Think about it conceptually.

28:29

You have some model which has some

28:30

weights.

28:31

There's some data you can use to train

28:33

the model.

28:35

Right? And you need to find the best set

28:36

of weights. What does the best mean

28:38

here?

28:40

The lowest

28:42

The lowest error. Exactly. There are

28:43

many ways to measure error, right? What

28:46

would be What is the simplest thing we

28:47

could use? So, what you do is you would

28:48

actually do mean squared error. Right?

28:50

Which is what you're getting at.

28:52

You could take the actual thing, you

28:53

could take the predicted thing, take the

28:54

difference, square it, and minimize the

28:55

sum of it.

28:57

Okay? If your model exactly nails every

28:59

number in the co-occurrence matrix, the

29:00

error is going to be zero.

29:02

Okay? So

29:04

what we do is we literally just do that.

29:07

This is the data.

29:09

This is the actual predicted value.

29:11

Predicted value, actual value,

29:13

difference squared, add them all up,

29:14

minimize.

29:17

Okay?

29:19

Uh yes.

29:21

And in the loss function, how is this

29:23

capturing the context? Because unless my

29:25

input data is having that context

29:28

how will this actually differentiate

29:31

based on where the particular word is

29:33

used?

29:34

The word The way the word is

29:36

the

29:37

So, let's take two words like deep and

29:38

learning. Now, let's take this word and

29:41

change it according to the context.

29:42

Okay.

29:44

Sorry, go ahead. Yeah, so basically,

29:46

let's say I'm talking about the word

29:47

banana. So it's a fruit in some context

29:49

and I could be saying he's going

29:50

bananas. That's a

29:53

whatever, right? So now these are two

29:55

different contexts in my understanding

29:57

and my same model needs to be able to

29:59

tell me that banana is the right word in

30:01

this context but wrong word in this

30:02

context or

30:04

correct in both contexts. Yeah, very

30:06

good question. So let's actually spend a

30:08

minute on that. Good question. I'm going

30:10

to swap to my iPad.

30:13

So let's let's assume that this is our

30:15

co-occurrence matrix.

30:18

Right? And then we have words going from

30:20

A all the way to let's say zebra, right?

30:23

This is the all the words in our

30:24

vocabulary

30:25

and we have A through zebra here.

30:29

And now what we have is

30:32

we have uh

30:34

apple

30:36

and banana.

30:39

Right?

30:40

So basically what's going on at this

30:42

point is that

30:44

here every number here measures

30:48

for every word here, how many times that

30:50

word and apple show up in the same

30:51

sentence, okay?

30:53

It is not measuring, to your point,

30:56

how many times apple and banana are

30:57

showing up. It's measuring how much how

30:59

many times apple is showing up in each

31:01

sentence, right? Now, if apple and

31:03

banana are sort of interchangeable,

31:06

what do we expect these numbers these

31:09

two rows of numbers to look like? Let's

31:11

assume that apple and banana are perfect

31:13

synonyms.

31:14

Just for argument, okay? Let's say it's

31:15

a perfect synonyms.

31:17

What do we expect these two

31:19

numbers

31:21

to look like?

31:23

Very similar.

31:25

So if two words are related, their

31:27

entries their entry row vectors in the

31:30

co-occurrence matrix are going to be

31:31

very very similar.

31:32

So that is how the context comes into

31:34

the co-occurrence matrix.

31:36

So what we want to do is we want to find

31:37

if if embeddings can recreate the same

31:40

pattern of numbers in these two

31:42

uh in these two rows, it's actually

31:45

capturing the underlying context.

31:47

So words which are similar will sort of

31:49

zig and zag together the same way

31:51

through the co-occurrence matrix.

31:53

And that's where it comes in.

31:57

Yeah.

31:58

What's up with the diagonal of the

32:00

co-occurrence matrix where you have

32:01

apple showing up twice? Oh oh, I see. So

32:05

yeah, here the you can just ignore the

32:07

diagonal typically

32:08

uh because all the action is off the the

32:10

off-diagonal entries.

32:15

So so that's basically the idea and uh

32:18

if words which are very similar will

32:20

have a very similar pattern of numbers

32:22

and then any

32:24

embeddings that can actually recreate

32:25

the same pattern of numbers is capturing

32:27

the underlying reality of what's going

32:28

on.

32:29

If words are kind of unrelated, those

32:32

two those two vectors, let's say that

32:34

the word you have is uh

32:40

Let's assume the word is uh of course

32:42

you know what I'm going to say, tensor.

32:45

Right? These two vectors

32:48

will sort of won't have any connection

32:49

to each other.

32:50

Which means if you look at something

32:51

like the correlation of those two

32:53

vectors, it's it's going to be around

32:54

zero.

32:55

Right?

32:56

Words which are

32:57

you know, interchangeable will have a

32:59

very high correlation.

33:01

Words which are antonyms and never show

33:03

up in the same place together may have a

33:05

highly negative correlation, close to

33:07

minus one for instance. So that's sort

33:09

of the intuition behind what's going on

33:10

in these two row vectors on these row

33:11

vectors.

33:12

And so the point is given this

33:14

co-occurrence matrix is capturing all

33:16

these word word correlational structure,

33:19

any embedding that can recreate it must

33:22

have captured the structure as well.

33:25

Because you can't recreate something

33:26

like this with great fidelity unless you

33:28

have some notion of what's going on

33:30

under the hood.

33:31

That's the basic idea.

33:33

Yeah.

33:34

So just connecting to Sophie's question.

33:36

So in that example then

33:39

banana is a fruit and apple is a fruit

33:40

as well. Banana and apple are synonyms

33:42

and you're going mad, you're going

33:44

bananas. How that comes together is that

33:47

Oh, I see. You're going mad, you're

33:48

going bananas, yeah. So uh so those will

33:50

also have some correlational structure

33:52

to it which the embeddings will

33:53

hopefully catch, but words like banana

33:57

which are very they they

33:59

the thing is it's called polysemy where

34:01

the word looks one way, it looks the

34:03

same way. It's like the word bank,

34:04

right? It can mean very different things

34:06

in very different context. So the

34:07

embedding is going to be some average

34:09

representation of it, right? But we are

34:11

not happy with that average and we'll

34:13

get around that average

34:15

next week when we do contextual stuff.

34:18

All right.

34:19

Um

34:20

So that's what we have here. So to go

34:22

back to this thing,

34:26

so what we can do is yeah.

34:29

I didn't understand how do we get the

34:31

mean squared error in this because we

34:34

didn't

34:35

do any reading from the data set we got.

34:37

We haven't calculated the embeddings.

34:39

We are trying to calculate them. Those

34:41

are just it's sort of like, you know, in

34:42

regression you have, you know, beta beta

34:45

one times X1 plus beta two times X2 kind

34:47

of thing. The betas are what the

34:49

regression produces for us, right? The

34:51

the embeddings are exactly that. They're

34:52

just coefficients that we're trying to

34:53

figure out.

34:55

The data is only the X's, the Xij.

34:59

And so this is what we're trying to

35:00

calculate,

35:01

right? And so what you can do is you can

35:03

actually start with some random values

35:06

for these things

35:08

and then

35:09

keep on trying to improve to minimize

35:11

the error

35:13

starting from these random values.

35:15

Do you folks are you aware of any

35:17

algorithm that which allows us to take

35:19

random value starting point and then

35:20

minimize some notion of error?

35:32

Well, how do you know it's actually

35:33

random? Oh.

35:35

So that's actually a very deep question.

35:37

Um

35:39

and

35:39

so

35:41

it's actually a tough question, right?

35:42

Because ultimately the random number is

35:44

coming from a computer

35:46

and we know how the computer runs. It's

35:47

deterministic at the end of the day.

35:50

So we actually use something called

35:51

pseudo random numbers,

35:53

right? Um and there's like a whole

35:54

specialized field of math

35:56

which essentially says, "Look, how can I

35:59

get random numbers that are sufficiently

36:02

random even though they come from a

36:03

non-random computer deterministic

36:05

process?" So we can talk offline about

36:07

it,

36:08

um but fundamentally all these systems

36:10

have some random number generators built

36:11

in. We just cross our fingers and hope

36:14

for the best and just use them.

36:17

So come back to this,

36:19

right? We can start with random values

36:20

for these weights

36:22

um and then we can try to minimize the

36:23

squared error. Are are you folks aware

36:25

of any algorithm that can help us do

36:26

that?

36:28

Yes.

36:30

Gradient descent. Yes, gradient descent.

36:33

Again, comes to the rescue. Uh and since

36:35

we are cool, we'll do stochastic

36:36

gradient descent.

36:38

Okay? So that's it. So gradient descent

36:41

actually doesn't care what the function

36:42

is as long as it you can calculate a

36:44

derivative from it. As long as you

36:45

calculate a gradient, you're good.

36:47

Right? So we can just run gradient

36:48

descent on this thing, right?

36:50

Uh one key point here is that gradient

36:53

descent, stochastic gradient descent

36:54

work for any

36:55

any models as long as you can calculate

36:58

good gradients from them.

37:00

It doesn't have to be a neural network.

37:03

Any mathematical function as long as

37:05

it's differentiable and gives you a good

37:07

gradient.

37:08

Okay? So here this is not a neural

37:10

network per se, but we can still use

37:12

gradient descent for it.

37:14

So we do that.

37:17

Um and when we are done, we would have

37:20

calculated some nice embeddings. We

37:22

would have all calculated or we can also

37:23

calculate all these biases, but we don't

37:25

need the biases anymore. We can just

37:26

throw out the biases because we only

37:28

care about the embeddings and how they

37:29

connect to each other.

37:30

Okay? Yeah.

37:33

So when when you're doing that

37:34

regression, are you predicting the

37:36

co-occurrence matrix? Mhm. Okay.

37:39

Exactly.

37:42

So

37:43

um actually let me just show a very

37:45

quick example

37:46

numerical example here.

37:48

So let's say for example that um

37:53

you know what?

37:57

So this is say W1 and this is W2.

38:00

Okay? And this is the vector and let's

38:02

assume for a moment that we it has two

38:04

dimensions, okay?

38:06

Two dimensions.

38:07

And we also need to calculate B1 and B2

38:09

which is just a number, okay?

38:14

So and let's say the number for deep

38:16

learning in the co-occurrence matrix it

38:18

happens let's say it has occurred 104

38:20

times.

38:21

So all we are doing is to say log of

38:24

104.

38:27

That is the actual value

38:28

minus

38:30

B1 which we don't know plus B2 which we

38:33

don't know

38:34

and then this thing here, let's just

38:36

call it,

38:38

you know, W11,

38:40

W12,

38:42

W21,

38:43

W22.

38:45

Okay? And then we're just doing the dot

38:46

product which is

38:49

times W12

38:51

plus W21

38:53

W22.

38:55

Okay? So this is our prediction.

38:58

Where is that cool laser pointer? Yeah.

39:00

So this is our prediction.

39:03

This is the actual.

39:05

So all we do is to say, "Okay,

39:07

this thing, the difference, we're going

39:09

to square it."

39:11

And then we're going to do the same

39:12

exact thing for every other word pair.

39:16

Okay? And when we are done with all of

39:17

that thing, we just take this whole

39:19

thing

39:20

and say gradient descent minimize.

39:23

So then it has to find the B's and the

39:26

W's and everything for every every pair

39:28

every word.

39:29

So that's actually what's going on.

39:31

Make sense?

39:37

All right. So by the way uh here

39:41

I said

39:43

I said, you know, let's assume that the

39:45

embeddings are just vectors which are

39:47

two dimension dimension two.

39:51

Well,

39:52

that's an arbitrary decision that I made

39:54

just to show you how it works because I

39:55

was doing it by hand. But more

39:58

generally, we get to choose how long

39:59

these vectors are.

40:01

Right?

40:02

And the longer the vector, the more

40:04

interesting ways it can actually

40:05

reproduce the co-occurrence matrix. It

40:07

has more flexibility. But the longer the

40:09

vector, what is the risk that you run?

40:13

Overfitting.

40:14

Because these are all parameters at the

40:16

end of the day. More parameters you

40:17

have, the more risk of overfitting.

40:19

Okay? So, you get to choose how big

40:21

these things can be. Uh yes.

40:24

Um don't you find it surprising that

40:26

we're able to fit the model where we

40:29

have a lot more parameters than we have

40:30

data because usually with most machine

40:32

learning with our experts, you would

40:33

like to not have a lot of parameters,

40:35

but here we're going to have

40:37

as you said, the number of dimensions

40:40

times more parameters than we have

40:42

data points. Well, here in this

40:44

particular case, as it turns out, um

40:46

let's assume that you only have 10

40:48

words, right?

40:49

And for each word, let's assume that you

40:51

have let's just just keep the math

40:53

simple. You have a two-dimensional

40:55

vector.

40:56

So, 10 words * 2, that's 20.

40:58

Plus you have 10 biases for the words,

41:00

right? So, that's another 10, that's 30.

41:02

But 10 * 10, the matrix has 100 entries.

41:06

So, because of the matrix being a order

41:08

n squared matrix, you'll have a lot more

41:10

numbers than parameters.

41:13

In this particular case, you have more

41:14

data than parameters.

41:17

So, that particular problem doesn't

41:18

apply in this case.

41:20

But that does show up in other cases and

41:22

there is some

41:23

very interesting research in neural

41:24

networks which suggests that often times

41:26

the traditional assumptions of data and

41:29

overfitting and all

41:30

can all be called into question under

41:32

some situations.

41:33

Um happy to tell you more offline, but

41:35

if you're curious, just Google something

41:37

called double descent.

41:39

You know what I mean.

41:42

But in this case, it's not a problem.

41:46

Okay.

41:47

So, so what that means is that we can

41:49

choose how big these things are. So, if

41:51

you look at one-hot word vector, one-hot

41:53

vectors, right? Where

41:55

there's a one and everything else is

41:57

zero depending on the position of the

41:58

word, these are long vectors as long as

42:00

a vocabulary, right? As we saw earlier.

42:03

Word embeddings on the other hand,

42:05

right? They can be very dense, right?

42:07

The numbers

42:08

that make up these embeddings, we're

42:10

actually going to figure out from the

42:11

data what they are. So, it can be

42:13

anything. It can So, the first dimension

42:15

may stand for some combination of, you

42:17

know, um

42:19

brightness plus speed plus animalness or

42:22

something. We have no idea what it

42:23

means.

42:24

All we know is that it's able to

42:26

reproduce the co-occurrence matrix

42:27

really well, so it's probably has

42:29

figured something out.

42:30

Okay? And so, we can keep it really

42:32

short. So, the word embeddings tend to

42:33

be very

42:35

dense,

42:36

meaning not zeros and ones, but some

42:38

arbitrary numbers. It's very lower

42:39

dimensional and it's of course learned

42:40

from data.

42:41

Right? So,

42:43

so once you do this, once you actually

42:45

run Glove on this data and do gradient

42:47

descent and so on and so forth, uh you

42:49

will actually come up with embeddings

42:51

and then you can actually plot the

42:52

embeddings. You can take like this they

42:54

say the you know, you can take these

42:55

embeddings and just plot them. Here um

42:58

they're not literally plotting the first

42:59

two dimensions. They're using a

43:01

particular technique called t-SNE, which

43:03

is a way to take long vectors and

43:05

project them to 2D space for

43:07

visualization purposes.

43:09

And you can see here

43:11

some very interesting things are showing

43:12

up. So, they basically they plotted the

43:15

embedding for brother,

43:17

nephew, uncle, sister, niece,

43:19

aunt, and so on and so forth. It's all

43:20

showing up here.

43:22

This the embedding for man, embedding

43:24

for woman,

43:25

sir, madam,

43:28

empress, heir,

43:29

duke, emperor, king. You get the idea.

43:32

Right? So, clearly there are patterns

43:34

here where

43:35

things which are sort of similar in

43:37

their nature are all hanging out

43:38

together in the same part of the space.

43:41

Which is comforting, which is good to

43:42

know.

43:44

Right?

43:44

Now, but as I mentioned earlier, it's

43:46

not just about the fact that similar

43:48

things happen to be near each other.

43:50

The direction also actually matters. And

43:53

beautiful things happen when you look at

43:54

directions. So, for instance,

43:57

you know, let's say that

44:00

man and you want to go from man to

44:01

brother.

44:03

Okay? So, to go from man to brother, you

44:05

have to start with man and then travel

44:07

along this arrow, right? To get to

44:09

brother.

44:11

So, this arrow has some notion of a

44:14

person becoming a sibling.

44:18

Right?

44:19

So, you would hope that if you take that

44:20

same arrow

44:22

and then

44:23

start here with that arrow, hopefully

44:26

the woman will become a sister.

44:29

Sure enough, this.

44:32

So, this is called word vector algebra.

44:35

Right? Embedding algebra. And these

44:37

relationships are actually showing up in

44:39

the data. We didn't tell it any of these

44:41

things.

44:42

We just literally gave it the

44:43

co-occurrence matrix

44:44

and said and and asked it to reproduce

44:46

it.

44:47

So, I find it pretty shocking that these

44:49

things are actually true.

44:52

And it gives us evidence and comfort

44:55

that whatever has been learned does have

44:57

some deep connection to describing the

44:59

underlying nature of what's going on.

45:01

It's not some statistically fluky

45:03

artifact.

45:05

Um yeah.

45:07

So,

45:07

I said

45:08

by context or by adjacency to other

45:11

words and not by

45:12

the place in the same word, right?

45:15

Cuz you can't click they won't appear in

45:16

the same sentence.

45:17

They have

45:19

keywords. Right.

45:20

They won't appear in the same sentence,

45:22

but the pattern of co-occurrence will be

45:23

the same for them.

45:25

Which is what we've been able to

45:26

reproduce with these embeddings. So,

45:28

that's the key idea.

45:34

Um

45:34

so, my question is along like how are we

45:37

able to capture all these directions in

45:40

2D

45:41

matrix versus a multi-dimensional matrix

45:44

because I feel like okay, so this

45:46

relationship is kind of

45:47

uh

45:48

confirmed that you're moving to

45:50

kind of like

45:51

family or like blood relationship or

45:53

something of the sort, but like how does

45:54

it not mess up the other sides of that

45:56

matrix? Like

45:58

No, this is just a visualization thing.

46:00

So, we're basically taking this uh you

46:02

know, as you will see, Glove embeddings

46:04

come in lots of different sizes. And

46:06

this I think uses the 100 dimension

46:08

embedding and just projects it to 2D

46:10

space using a particular technique and

46:12

then looks to see what's going on.

46:15

Um yeah.

46:17

Uh if the input data being co-occurrence

46:20

matrix is biased, aren't we amplifying

46:22

that bias? Yes, we are. Yes. No, it's a

46:24

great observation. Uh any sort of data

46:26

you scrape from the internet and use for

46:28

this sort of modeling exercise will be

46:30

subject to all the biases that produced

46:32

the data in the place first place. And

46:34

the model will faithfully learn those

46:36

biases. And if you're not careful, it'll

46:38

perpetuate them.

46:40

So, and that's a whole very important

46:41

topic that unfortunately won't cover in

46:43

this course because of time constraints,

46:45

but it's something you always have to

46:46

worry about when you're building these

46:47

models.

46:50

How do you think about the

46:51

dimensionality of the embeddings not the

46:53

2D representation of the actual data?

46:55

The one that we choose, that's that's in

46:57

our hands. So, you should think of them

46:59

as a hyperparameter.

47:00

So, much like the number of hidden units

47:03

to use in a particular hidden layer,

47:05

um it's a hyperparameter. Uh so, you

47:06

know, I would again start small and if

47:09

it solves the problem that you're trying

47:11

to solve with these embeddings, great.

47:13

If not, keep increasing them. And at

47:15

some point there might be like a a

47:16

flattening out and a overfitting sort of

47:19

dynamic and then you stop. So, just

47:20

think of it as a hyperparameter.

47:22

Yeah.

47:24

Do you see any benefit practicing using

47:26

like penalized regression to do this

47:28

instead of having the embeddings more

47:31

sparse or just like

47:33

lowering the magnitude of them? Yeah.

47:36

Yes. So, there are lots of techniques to

47:39

uh

47:40

to apply regularization in the

47:42

estimation itself of all these numbers.

47:44

Um happy to give you pointers. It's I'm

47:46

just going with like the simplest

47:47

version possible.

47:49

Yeah.

47:50

Am I understanding why overfitting is a

47:53

problem in this case cuz we're not doing

47:55

any like out of sample

47:58

prediction. So, like wouldn't you want

48:00

like the embeddings to be

48:02

like high dimensional so you can capture

48:03

like

48:04

your relationships? Uh interesting

48:06

question. So, the question is given that

48:08

there's no notion of a test set, out of

48:11

sample test set that we got we're going

48:12

to evaluate these things on, why do we

48:14

really care about overfitting? Don't

48:16

should we do the best we can to capture

48:18

everything in the data, right?

48:20

Well,

48:21

the thing is

48:22

even when you're not trying to use it

48:24

for out of sample prediction, you do

48:26

want to make sure that your model only

48:29

captures the true patterns and not the

48:31

noise.

48:32

In every data set, there's always noise.

48:35

Right? And you want it to capture a

48:36

signal but not the noise.

48:38

And regardless of what you use it for.

48:40

Because if it captures the noise, then

48:42

the insights you draw from the word

48:44

embeddings may be flawed.

48:45

That's the reason.

48:48

Okay.

48:49

Um all right, so let's keep going. So,

48:51

here the algebra is brother minus man

48:53

plus woman is sister.

48:55

That's it. Human biology reduced to a

48:57

single sentence.

48:58

All right. So, now the pros and cons of

49:00

these things are you should use

49:02

something like a Glove embedding if you

49:04

don't have enough data to do to to sort

49:07

of

49:07

to learn a task-specific embedding for

49:10

your own vocabulary. As we As I'll show

49:11

you in the Colab, you can actually learn

49:13

these things just for your own data set

49:14

if you want. You don't have to use these

49:16

Glove embeddings. But the reason to use

49:18

these pretrained embeddings is that if

49:20

you're working with natural language,

49:22

you know, the word is the word, right?

49:24

It means something.

49:25

And so, there's no reason for you to

49:28

have for your model, for your little use

49:30

case, for you to actually somehow learn

49:32

all the fundamentals of English.

49:35

The fundamentals of English are the

49:36

fundamentals of English. May as well

49:37

learn it once and then piggyback on it.

49:40

So, that's the whole idea of using

49:42

pre-trained embeddings.

49:43

Because it These things are all common

49:45

aspects of language. May as well learn

49:47

them using all the data you can throw at

49:48

it and then you can sort of fine-tune

49:50

and tweak and adapt to your particular

49:52

use case.

49:53

Right? So, if you and this particular

49:55

useful when you don't have a lot of data

49:57

in your particular use case.

49:58

Uh right? That's one big advantage. Now,

50:01

it does have the drawback that this

50:03

embedding will not be customized to your

50:04

data.

50:05

Right? For example, if you're trying to

50:06

build an application for a medical or

50:08

legal use, it's going to have a lot of

50:10

jargon.

50:11

Right? And this pre-trained embedding

50:13

trained on all of Wikipedia may not

50:14

capture enough of the jargon and know

50:16

its meaning really accurately. So, what

50:18

you want to do is you want to take this

50:19

thing. You may still want to take this

50:21

thing and then you can adapt and

50:22

fine-tune it using your jargon-packed,

50:25

heavy, domain-specific data set.

50:28

Okay, those are some of the things to

50:29

keep in mind.

50:32

And of course, we can also learn it from

50:33

scratch if you want and the collab I

50:35

demonstrate all these options.

50:38

So, when you're working with embeddings

50:39

in Keras uh Keras, so what we do is

50:41

remember STI

50:43

where we after we standardize and

50:45

tokenize and index, right? At this

50:48

point, we go from integers to vectors

50:50

and so far we have been using integers

50:51

to one-hot vectors. Here, we're going to

50:54

use embedding vectors that we're going

50:55

to learn or that we're going to pre-use

50:57

from glove. And so, what we do is we

51:00

tell Kera we tell Keras's text

51:02

vectorization layer to do only STI.

51:06

And then we will use a new layer called

51:08

the embedding layer to do the encoding.

51:10

Yeah, that's how we're going to do it

51:11

divide divide it up.

51:14

So, we'll take a look at this first uh

51:17

before we switch to the collab. So,

51:18

before

51:20

we told Keras in this layer output mode

51:23

should be multi-hot or whatever, right?

51:26

Here, we don't want it to actually

51:27

encode anything in multi-hot. We just

51:29

wanted to give it integers back. So, we

51:30

tell it give me int.

51:32

Okay? That's the first change. We only

51:35

We tell it give me give us int. If you

51:36

say give us int, it'll stop with STI.

51:39

I'll just give you the integers.

51:41

Uh and then what you do is that

51:43

all the incoming sentences are going to

51:45

have different lengths. So, what we want

51:47

to do is we want to actually take all

51:48

these sentences and sort of normalize

51:50

them so they are of the same length.

51:52

Okay?

51:53

And the way we do that

51:55

And the way we do that very quickly is

51:57

that we either trunk we choose a maximum

51:59

length for every sen- for for the

52:01

sentences and then if something is

52:04

uh exactly fits that length, perfect.

52:05

Let's say in this case we want a max

52:07

length of five. Cats sat on the mat is

52:08

exactly five. Boom, fits perfectly. But

52:11

if something is smaller, I love you is

52:12

only three of these things, we actually

52:14

pad it with something called the pad

52:16

token.

52:17

Much like the unk token, pad token is a

52:19

special token which we use for padding.

52:22

And then it'll you know, and so and

52:23

Keras you will see will use zeros for

52:25

these paddings. So so that it fills it

52:27

up and gets all the way to the end. And

52:29

if you have something which is much

52:31

longer than five, you just truncate

52:33

everything else and just use the first

52:34

five.

52:36

So, this is what we do to get all the

52:38

sentences to be of the same length.

52:42

Okay?

52:43

And once we do that we then go to the

52:45

embedding layer.

52:47

And the embedding layer is actually very

52:49

simple.

52:50

What is What is an embedding? It's just

52:51

a vector and we need a vector for every

52:53

token.

52:54

Of course, we're going to learn these

52:55

vectors. We need one for every token.

52:57

So, in this case for example, uh let's

52:59

say that these are all the tokens we

53:01

have

53:02

in our vocabulary after the STI process.

53:05

Maybe in this case we have 5,000 tokens.

53:08

Each token we have this embedding

53:09

vector, right? And we choose what the

53:11

dimension of that embedding vector is,

53:12

right? And so, we can set it up by

53:15

saying Keras layers.embedding and we

53:17

tell it max tokens which means what how

53:19

many rows do we have here.

53:21

You know, how many What is the

53:21

vocabulary size that we're working with?

53:23

And then we tell it, okay, this is how

53:25

long I want each embedding vector to be.

53:28

So, rows, the size of the columns, and

53:31

that's the embedding layer. And we'll

53:33

use it in a second. I just want to show

53:34

it to you here so that's because it's

53:35

slightly clearer.

53:37

So, when an input sentence arrives, the

53:38

text vectorization layer will learn STI

53:40

on it. It'll truncate and pad it to max

53:42

length as needed. So, let's say this

53:44

phrase comes in, STI will give you the

53:46

same tokens plus pad pad because let's

53:48

say the max length is five and then

53:50

these are the corresponding integers.

53:52

And then

53:53

the embedding layer will just look up

53:55

the corresponding vector. So, for

53:56

example here, uh the vectors are we need

53:59

to look up the vectors for 23, 9, 5, 0,

54:01

and 0. So, we just go here and look up

54:04

23, 5, 9, and 0. And then once we have

54:07

that, boom.

54:08

This is the resulting output. So,

54:10

whatever input sentence comes in, we

54:12

have now

54:13

five embedding vectors that have been

54:14

looked up from the embedding layer.

54:17

And once we do that

54:20

this is a table. So, I love you comes

54:22

in, it becomes this table. As we have

54:24

seen before

54:25

neural networks can only accommodate

54:27

vectors as inputs. We need to you know,

54:30

make this into a vector. And as we have

54:32

done before, you know, we can either

54:33

take all these things and concatenate

54:35

them, make a one long vector, or we can

54:37

find a way to average them or sum them

54:39

and things like that, right? As we have

54:40

seen before. And we will use the same uh

54:42

we'll the simplest thing is probably

54:44

just to average them. So,

54:46

uh these are some options and we but

54:48

we'll average them here. So, and this is

54:51

called the global average pooling layer

54:53

1D. And it's all it does is whatever you

54:55

give it a table you give it, it just

54:57

takes each dimension and averages it.

54:59

The first dimension average, second

55:01

dimension average, and so on and so

55:02

forth. And once that's done

55:04

that's the whole

55:05

So,

55:07

the phrase comes in, STI gives you these

55:09

things, padding as needed or truncating

55:11

as needed. We look up the embeddings

55:14

from the embedding layer and then we get

55:16

all this thing. We do global global

55:18

pooling on it and it's done.

55:20

The resulting thing is a vector that can

55:22

then be passed into hidden layers just

55:24

like we normally do.

55:27

I'm going over this a little fast, but

55:29

make sure you look at it afterwards and

55:31

understand every step and the collab

55:33

will mirror this

55:34

you know, perfectly.

55:36

All right, so let's switch to the

55:37

collab.

55:39

Okay. All right.

55:41

Can folks see this okay?

55:43

All right, so we'll do the usual.

55:46

Um

55:47

import all the stuff we need and then

55:49

because I want to plot some of these uh

55:51

loss and accuracy curves to

55:53

you know, just to see what's going on,

55:55

I'll just bring in the functions from

55:56

the previous collabs.

55:58

Here.

55:59

And then um and I think I already have

56:01

downloaded this. Let me just make sure I

56:03

have it.

56:08

Uh it's not there. Okay.

56:11

Do it again.

56:13

This is same songs data set that we

56:14

looked at on Monday.

56:17

Okay.

56:19

So, roughly 49,000 examples as we saw

56:21

before. We'll one-hot encode them.

56:25

All right, so there's a bunch of stuff

56:27

that we already covered in class. So,

56:28

this is the thing

56:30

uh this URL has all the glove vectors

56:33

available for download. I downloaded it

56:35

uh before class because it takes a few

56:37

minutes. Uh and I've also unz- Did I

56:39

unzip it?

56:41

Uh yes, I did. And so, let's just look

56:43

at the first few.

56:46

All right, so these are all the first

56:47

few. We'll create a sort of an easier to

56:49

view version of these glove vectors.

56:54

So, I'm going to use the vectors which

56:56

are 100 long, but it comes in many

56:58

different shapes.

56:59

So, we have 400,000 vectors, 400,000

57:03

word vectors. Each is 100 dimension.

57:05

Uh and these all have been calculated

57:07

from Wikipedia using

57:09

the model we described using gradient

57:11

descent. Okay?

57:12

Uh all right, so this is the

57:15

vector for the word for movie.

57:18

Yeah, I don't know what these dimensions

57:19

mean, but it is there's something going

57:21

on. It has figured stuff out.

57:23

Uh but the proof is in the pudding,

57:24

right? So, all right, now we'll first

57:26

set up the text vectorization and

57:28

embedding layers like we saw before.

57:30

Um and so, I'm going to use uh a max

57:33

length of 300 for the songs.

57:36

Um right? Because all the sentences have

57:38

to be the same length. And you might be

57:40

wondering, okay, why did you pick 300

57:42

and not say 400 or 200? So, typically

57:44

what you do is you actually look at the

57:46

the length distribution of the songs you

57:48

have and you will find you're looking

57:51

for like an 80/20 or a you know, one of

57:52

those things. And in this case it turns

57:54

out 90% of the songs have less than or

57:56

equal to 300 words in our data set. So,

57:59

I'm just going to go with 300. Okay?

58:00

It's pretty good. Uh the problem is if

58:03

you actually say if you look at the song

58:04

which has the maximum length

58:06

that might have be like 3,000 words and

58:09

there would be any hardly any songs of

58:10

3,000 long. You're just wasting a lot of

58:12

capacity by doing that. So, you're just

58:13

being a little pragmatic here.

58:16

So, okay. So, and then we as before for

58:18

the vocabulary itself, we tell Keras use

58:20

the most frequent 5,000 words, right?

58:22

When you're doing the STI

58:24

um STI. So, we do that and we tell it

58:27

the output mode is int like we saw

58:29

before.

58:32

We have there.

58:35

Okay, perfect.

58:36

Okay, this is a very dangerous thing

58:39

where somebody is remotely changing it

58:41

in another tab somewhere.

58:44

Fingers crossed. Okay.

58:50

Okay. So, we have this um and this is

58:52

what we did with all this stuff uh as

58:54

I've covered. So, now we will adapt this

58:57

layer as we have seen before using all

58:59

the lyrics we have.

59:04

And once we that, we'll take a look at

59:06

the first few.

59:08

And so, here's a very important thing.

59:10

Before, when we asked it to do multi-hot

59:12

encoding and so on in on Monday,

59:14

uh the zero, the first position was unk.

59:17

Right? Unk had zero. But here, unk

59:19

actually has one.

59:21

And the reason is that

59:23

the zeroth position is going to be uh

59:25

used for essentially the You can think

59:28

of this as the empty string. That's how

59:30

Keras will print out pad.

59:32

So, the zero position is the padding,

59:35

the pad token. The first position is the

59:37

unk token. Okay?

59:39

So, it's an important thing here.

59:41

So, let's say that we do

59:44

"HODL you're the best."

59:46

We take a vectorize it. Um

59:49

Do you think HODL

59:51

is going to be part of those 400,000

59:52

word vectors?

59:54

Wikipedia. Not yet. So,

59:57

Um all right. So, let's try that.

1:00:03

Okay, and as you can tell,

1:00:05

um

1:00:05

HODL is an unknown word, right? That's

1:00:08

why uh it's showing up here.

1:00:12

Right. So, one is unknown, right? The

1:00:14

index value one is unknown. Zero is pad.

1:00:18

But then,

1:00:19

this is unknown HODL, I

1:00:21

Sorry, you are the best, and then

1:00:25

everything else from that point on is a

1:00:26

zero because we are padding all the way

1:00:28

to 300.

1:00:30

Okay? So, that's why you see all these

1:00:31

zeros here.

1:00:32

All right. Uh now, let's just, you know,

1:00:34

run everything through

1:00:37

the vectorization layer, and then we'll

1:00:38

get to the embedding layer.

1:00:44

Okay. Now, we will we'll we'll first

1:00:48

There's just a bit of Python uh

1:00:50

housekeeping

1:00:51

um to create a nice, easy to look at

1:00:54

matrix. So, what we're going to do is

1:00:56

we're actually going to create a nice

1:00:58

matrix which shows us all the the word

1:01:00

the GloVe embeddings.

1:01:02

Um

1:01:04

And so, here, this is the embedding

1:01:05

matrix.

1:01:07

And this matrix has only 5,000 words,

1:01:09

and each is a 100 long.

1:01:11

Why is this embedding matrix only 5,000

1:01:13

even though we downloaded 400,000

1:01:15

vectors?

1:01:21

Right. So, clearly the 5,000 we used

1:01:23

there has some bearing to this, but what

1:01:24

is that 5,000?

1:01:30

We told Keras to take the most frequent

1:01:32

5,000 words in our corpus.

1:01:34

So, we'll only have 5,000 in vocabulary.

1:01:36

That's why there's 5,000. So, we grab

1:01:38

just the word the GloVe vectors for

1:01:40

those 500 5,000 that Keras has chosen to

1:01:42

be in the vocabulary. Okay? And that's

1:01:44

our embedding matrix.

1:01:45

And then, if you look at the first few

1:01:47

rows, the first two rows should be all

1:01:50

zeros because it's pad and unk,

1:01:52

which clearly GloVe doesn't know about.

1:01:54

It's all going to be all zeros. And um

1:01:57

so, you can see all these zeros here,

1:01:59

and then from third on, words, you start

1:02:00

getting some numbers. Okay?

1:02:02

All right. Next, we'll set up the

1:02:04

embedding layer.

1:02:05

Uh

1:02:06

so, basically, what's going on here is

1:02:07

when you we tell the embedding layer how

1:02:09

many rows, which is just the vocab size,

1:02:11

max tokens, what is the embedding

1:02:13

dimension? Well, that's going to be 100

1:02:15

because the GloVe vectors are 100. And

1:02:17

then, here's the thing. You can tell it

1:02:19

um in this embedding layer, just use

1:02:22

this matrix I'm giving you as the

1:02:23

embedding layer. Because we already know

1:02:25

what the embeddings are. We downloaded

1:02:26

from whatever GloVe, right? So, we will

1:02:28

tell it to use GloVe as as the as the

1:02:30

weights for here, as the embeddings

1:02:32

here. So, we initialize it using that

1:02:34

embedding matrix, right? And then, we

1:02:36

tell it

1:02:38

don't train. When we do back propagation

1:02:40

later on, don't change any of these

1:02:41

weights because somebody spent a lot of

1:02:43

money create these weights for us.

1:02:45

Stanford. So, we don't want to like

1:02:47

further change them. Just freeze them

1:02:49

and use them as they are. Okay?

1:02:51

And this mask zero business I'll come

1:02:52

back later. Don't worry about it for the

1:02:53

moment.

1:02:55

All right. So, once we do that, we all

1:02:58

we are ready to set up our model. So,

1:03:00

this model is pretty simple. Uh Keras

1:03:02

input, the length, of course, is the

1:03:04

length of the sentence, right? Which is

1:03:05

uh 300 long, and then it runs the input

1:03:08

runs through an embedding layer right

1:03:09

there, right? And out comes a 300 by 100

1:03:12

table, and then we global average pool

1:03:14

it,

1:03:15

right? And that becomes a 100 element

1:03:17

vector, and then we are back in familiar

1:03:19

ground, and we run it through a dense

1:03:20

layer with eight ReLU neurons, uh right?

1:03:23

Eight ReLU neurons, and then we run it

1:03:25

through the final output layer, which is

1:03:27

a three-way softmax as before, hip hop

1:03:29

rock pop. And then, we tell Keras that's

1:03:31

our model, and then we summarize it.

1:03:34

Okay. So, this what we have. And you can

1:03:36

see here,

1:03:38

the total parameters are 500,835,

1:03:41

but the trainable parameters are only

1:03:42

835.

1:03:44

It's because the total parameters are

1:03:46

all the GloVe embeddings plus the the

1:03:49

things we added to the GloVe embeddings

1:03:50

like the hidden layer and so on.

1:03:52

But the GloVe embeddings are us we have

1:03:54

told Keras, freeze it. Do not train it.

1:03:57

Right? Which means only the rest of it

1:03:58

is going to be trainable. That's That's

1:04:00

the 835. Yeah.

1:04:03

So, when we do the global average

1:04:05

pooling, don't we don't we lose any

1:04:06

sense of meaning that we gain from the

1:04:09

embedding as we average very different

1:04:12

embeddings together?

1:04:14

Sorry, say that again. I I missed the

1:04:15

first

1:04:16

>> if we average the the embedding of apple

1:04:18

and learning, for instance, they are

1:04:20

very different words that are used in

1:04:22

different meanings, so we have different

1:04:23

embeddings, but we average it, so can't

1:04:26

lose it.

1:04:27

We will lose a bunch of stuff. Yeah,

1:04:28

yeah, yeah. So, you're barely Anytime

1:04:30

you average anything, you're going to

1:04:31

lose some new nuance and so on. So, the

1:04:33

real question is, is it Despite that

1:04:36

averaging, is it good enough for you?

1:04:37

And sometimes it's good enough.

1:04:39

Very often it's good enough, as it turns

1:04:41

out. But as you will see when you go to

1:04:42

contextual embeddings, there's just a

1:04:44

better way to do it, right? When you

1:04:45

have contextual embeddings, uh but it

1:04:47

requires bigger models, more powerful

1:04:49

stuff, and so on and so forth. And

1:04:50

that's where you're going from the

1:04:51

foundations to the advanced stuff.

1:04:53

Yeah.

1:04:56

When we're doing optimization, like

1:04:58

let's say we are word problem, it's

1:05:00

often best to optimize everything

1:05:02

together than to optimize one part of

1:05:04

the system and then optimize the other

1:05:06

part of the system.

1:05:07

So, in that case, why wouldn't we want

1:05:09

to also change the embeddings?

1:05:12

We would like I understand why we would

1:05:13

like to stop with

1:05:15

with those weights that

1:05:17

some people have spent a lot of money

1:05:19

trying to find, but will

1:05:20

we be able to find more specific uh

1:05:23

embeddings related to our problem if we

1:05:25

optimize if we let everything be

1:05:26

trainable? Yeah. Absolutely. Absolutely.

1:05:29

And in fact, you will see in the collab

1:05:30

uh that we will do that next. I just

1:05:33

want to show people you don't have to do

1:05:35

it. You start with not training it

1:05:37

because it's going to be much faster.

1:05:38

And then, you train everything and see

1:05:39

if it gets better. And sometimes it'll

1:05:41

get better, in which case it's great.

1:05:42

Sometimes it won't get better. And I

1:05:44

will also show you, and I probably will

1:05:45

run out of time, which I'll So, I'll do

1:05:46

it on Monday. I will also show you, hey,

1:05:48

what if you want to do your own

1:05:50

embeddings from scratch without using

1:05:51

GloVe?

1:05:52

So, all possibilities will be covered.

1:05:55

Um yeah. So, to come back to this, this

1:05:57

is the model we have. Um and then, all

1:06:00

right.

1:06:01

So, we'll If we take a look at the first

1:06:03

few embedding vectors, by the way, this

1:06:05

model.layers

1:06:06

uh will give you every layer as a list,

1:06:09

a list of all the layers, and then you

1:06:10

can just grab any layer you want and

1:06:11

look at its weights. Okay? It's very

1:06:13

handy.

1:06:14

So, we're looking at the weights, and

1:06:15

you can see here

1:06:16

the first two vectors are all zeros

1:06:19

because that stands for unk and pad, and

1:06:21

then we have everything else. So,

1:06:22

everything looks fine so far. And now,

1:06:24

we just, you know, compile and fit it.

1:06:26

So, as usual, Adam, cross entropy,

1:06:28

accuracy.

1:06:30

Um and then, we'll just fit the model.

1:06:33

All right.

1:06:34

It's going to take

1:06:36

a few minutes.

1:06:39

And while it's running, so what what you

1:06:41

will see in this collab is that

1:06:43

uh in this particular case, the

1:06:44

embeddings actually don't help a whole

1:06:46

lot.

1:06:47

Why do you think that is?

1:06:51

What if it could be because we're

1:06:52

averaging a lot of stuff? Maybe that's

1:06:54

hurting us.

1:06:57

Yeah.

1:06:58

Um I mean, I think that the embeddings

1:06:59

were pre-trained on some corpus, right?

1:07:01

Like Wikipedia or something like that

1:07:03

that is different from the a little bit

1:07:05

different from the language we tend to

1:07:06

use in song lyrics. So, so maybe it's

1:07:08

not

1:07:09

its ability to sort of extract the

1:07:11

meaning of

1:07:12

um

1:07:13

candy from like a song lyric um

1:07:16

maybe is limited because Yeah. it's

1:07:18

thinking of all the other ways Right.

1:07:19

like that could be our presentation.

1:07:20

Yeah, so there could be a mismatch

1:07:22

between the corpus on which the

1:07:23

pre-trained stuff was trained on versus

1:07:26

the the corpus that you're working with

1:07:27

right now. That's one big reason. The

1:07:29

other reason is that we actually may

1:07:31

have We have 50,000 examples, basically.

1:07:34

It's a lot of data.

1:07:36

So, when you have a lot of data, you may

1:07:37

not need any of these things.

1:07:39

These things tend to do really well when

1:07:41

you don't have a lot of data, and which

1:07:43

means you you you get to piggyback on

1:07:46

what these embeddings have learned from

1:07:47

all of Wikipedia.

1:07:49

So, so when you have a smallish data

1:07:52

set, basically, the the rule of thumb

1:07:54

here is that when your data is really

1:07:55

small, try to use a pre-trained model.

1:07:58

Right? And that's what you saw with the

1:07:59

handbags and shoes classifier, right? We

1:08:01

had 100 examples of handbags and shoes,

1:08:03

and we used ResNet to got basically get

1:08:04

to 100% accuracy.

1:08:06

The same sort of logic applies here.

1:08:08

All right. So,

1:08:09

here, let's see what's happening. Uh

1:08:11

okay, it's done.

1:08:12

So, we'll plot.

1:08:16

Right.

1:08:16

Uh okay, this look at a very

1:08:18

well-behaved uh loss function curve.

1:08:21

Uh

1:08:25

Okay.

1:08:26

So,

1:08:27

uh there doesn't seem to be any massive

1:08:28

overfitting going on. They are moving

1:08:30

really nicely in lockstep. Let's see

1:08:32

what the thing is.

1:08:36

Okay, 63%, which is not great. Um right?

1:08:39

Uh it's not as good as what we saw

1:08:40

before when we used all 50,000 examples

1:08:43

and just trained something from scratch,

1:08:44

and that's just because in this case, we

1:08:45

have lots of examples, these pre-trained

1:08:47

embeddings aren't, you know, as helpful

1:08:49

as they could be.

1:08:50

But if you have a small data set, they

1:08:52

could be very helpful. And now, we go to

1:08:54

what um

1:08:56

he pointed out. Like, why can't we just,

1:08:58

you know, optimize these embeddings,

1:08:59

too? Why don't Why do we have to take

1:09:00

trade them as sacred? We'll just Let

1:09:02

Let's just use Let's

1:09:03

inflict Let's just apply unleash back

1:09:06

prop on it and see what happens.

1:09:07

So, we'll do that. Um

1:09:11

So, here, what we do is we retrain it,

1:09:13

but here, we set trainable equals true

1:09:15

for the embedding layer. Okay? This is

1:09:17

the key step. Trainable equals true.

1:09:19

Otherwise, it's unchanged.

1:09:20

Uh and then,

1:09:23

let's skip that.

1:09:27

We'll run it and see what happens. So

1:09:28

before it was whatever 63% accuracy or

1:09:31

something, we'll see if it gets better

1:09:33

if you train the whole thing.

1:09:35

And the thing is you can never be sure.

1:09:38

Right? Because it may start to overfit.

1:09:40

Uh which is why you just have to

1:09:41

empirically see what's going on. There

1:09:42

are no guarantees.

1:09:47

Um all right, any questions while it's

1:09:48

training?

1:09:50

Yeah.

1:09:51

In that first graph of when um you have

1:09:54

the training accuracy still increasing,

1:09:56

that might suggest that you could use

1:09:58

even more upstream. Correct. Exactly.

1:10:00

Exactly. So in the in the in that curve,

1:10:02

we saw that the training was continuing

1:10:03

to increase. Typically what's going to

1:10:05

happen is the training will continue to

1:10:06

get better the more you train it. The

1:10:08

key thing is is the validation also

1:10:10

improving. If the validation continues

1:10:12

to improve, there is a little bit more

1:10:13

gas left in the tank. You can keep

1:10:15

increasing more. If it starts to flatten

1:10:17

and even worse if it starts to go down,

1:10:19

then you want to pull back.

1:10:21

Yeah.

1:10:23

Um so you had used the maximum against

1:10:25

the limit like the vocabulary

1:10:27

of the most common 5,000. And then the

1:10:29

width of that was 100. What is the 100?

1:10:31

The 100 is just the length of the glove

1:10:33

vector.

1:10:34

Does that mean that it can only capture

1:10:37

how that word is related to 100 other

1:10:39

words? No, no. It it basically we are

1:10:41

saying that every word its intrinsic

1:10:43

meaning can be captured using a vector

1:10:45

of 100 dimensions.

1:10:48

Those dimensions mean something. We

1:10:49

don't know what it is. The first

1:10:51

dimension could mean color. Second could

1:10:53

mean some sort of location. The third

1:10:55

could mean some sort of see time of the

1:10:57

year. We just have no idea.

1:11:01

Okay, and then the pre-trained model is

1:11:02

we're not We're not going to learn the

1:11:04

pre-trained model like has those

1:11:05

already. We don't know what they are,

1:11:07

but it has some cat The people who

1:11:08

created it don't know what they are

1:11:10

either.

1:11:10

All they know is that for each word they

1:11:13

learned a 100 long vector.

1:11:15

And that 100 long vector was able to re-

1:11:18

kind of recreate the co-occurrence

1:11:20

matrix.

1:11:21

And then they probed it using that

1:11:23

visualization of man woman sister

1:11:25

brother all that stuff and it seems to

1:11:26

sort of fit with what you would expect.

1:11:29

Can you think of it as analogous to uh

1:11:31

when we did the convolutional ones, you

1:11:33

have the number of kernels, right? So in

1:11:35

in this case, so if you have 32 kernels,

1:11:37

it's sort of like 32 things it can

1:11:39

learn.

1:11:40

I think that's actually a great analogy.

1:11:42

I love it. That's that's a great way to

1:11:43

think about it. Yes. Uh much like we got

1:11:46

to choose decide how many filters to

1:11:48

have, here we get to decide how long the

1:11:50

embedding dimension needs to be and our

1:11:51

hope is that the more things we are able

1:11:53

to accommodate, the more complicated

1:11:55

things it will pick up. Right? Uh at the

1:11:57

same time, you don't want to have too

1:11:58

many of these things because it's going

1:11:59

to start picking up noise.

1:12:01

And that's not a good That's never a

1:12:03

good thing.

1:12:05

Okay.

1:12:06

Um

1:12:07

Another question on this side?

1:12:09

Yeah.

1:12:10

Go ahead. My

1:12:12

question is

1:12:13

why did we use Why do we use embeddings

1:12:15

and not the actual uh

1:12:17

correlation matrix called rows to

1:12:20

represent words, right? Like why do we

1:12:23

need to abstract Yeah, yeah, yeah.

1:12:25

That's actually a good That's a That's a

1:12:26

good That's a good question. Um one

1:12:28

immediate reason is that that row is

1:12:30

500,000 vectors long. 500,000 long.

1:12:33

Right? So you want a compact dense

1:12:35

representation of a word.

1:12:37

The second thing is that thing is

1:12:39

subject to all the counts of the

1:12:40

Wikipedia corpus. It's not normalized.

1:12:43

So you need to normalize it so that if

1:12:45

you take any two rows and do dot

1:12:47

product, you will get some number which

1:12:49

is sort of in a narrow range. Otherwise

1:12:50

things don't become comparable.

1:12:53

No, both these objections can be

1:12:55

handled. You can normalize, you can

1:12:57

reduce the size of the corpus and so on

1:12:59

and so forth. And in fact that used to

1:13:00

be a very common way people used to do

1:13:01

it before.

1:13:03

But what they have discovered is that

1:13:04

these the way we learn embeddings now

1:13:06

tends to be much more effective in

1:13:07

practice.

1:13:10

So So what what we thought is

1:13:13

what what what this process does is it

1:13:16

creates this like n-dimensional

1:13:18

incomprehensible matrix that captures

1:13:21

in essence a summarized version of these

1:13:23

relationships.

1:13:25

Correct. A compact representation of

1:13:28

relationships which is not subject to

1:13:30

the size of your vocabulary.

1:13:33

So you know, you have 500,000 words

1:13:34

today, tomorrow somebody comes up with

1:13:36

the word called selfie which didn't

1:13:37

exist 5 years ago.

1:13:39

And now your corpus has gotten a little

1:13:40

bit more, right? So here it's very

1:13:42

compact and it tends to have a much

1:13:43

longer shelf life.

1:13:48

Yeah.

1:13:49

Uh all right, so let's see where we are.

1:13:54

Uh okay. So evaluate.

1:13:59

68 69% almost. It was 63 went to 69. So

1:14:02

clearly here training the whole thing

1:14:04

including glove actually helps. Uh and

1:14:06

so that sort of begs the question, well,

1:14:08

if it um every if training glove helps,

1:14:11

maybe we should actually train the whole

1:14:13

thing from scratch.

1:14:15

Like why the hell not, right? Why the

1:14:16

heck not? I apologize.

1:14:19

So uh what we'll do is we'll actually

1:14:21

create our own embeddings and just train

1:14:22

them. And here we don't have to worry

1:14:24

about co-occurrence matrices and so on

1:14:26

and so forth because we have a very

1:14:27

specific objective. We want to be very

1:14:29

accurate in predicting genre for these

1:14:30

songs.

1:14:32

The people who had who had worked on

1:14:34

glove,

1:14:35

they didn't have any objective. They

1:14:36

just wanted to create embeddings that

1:14:37

were generally useful.

1:14:39

Okay? Here we want to be specifically

1:14:41

useful for genre prediction.

1:14:43

And so what we can do is we can actually

1:14:45

train the whole thing ourselves, right?

1:14:48

We can actually give it

1:14:50

uh we can actually put an embedding

1:14:51

layer here. I you know, we just

1:14:53

arbitrarily decided to choose 64 as the

1:14:55

uh the dimension as opposed to 100. It

1:14:57

will run faster. Uh and then it's the

1:14:59

same thing. Global average pooling,

1:15:01

activation, blah blah blah blah blah. Um

1:15:03

and then you run it.

1:15:08

We'll see if it finishes in the next

1:15:09

minute.

1:15:12

And we'll see if it actually does better

1:15:14

than the pre-trained embeddings or the

1:15:16

pre-trained embeddings that have been

1:15:17

further fine-tuned. And I don't remember

1:15:19

what I saw when I ran it yesterday.

1:15:21

Uh and while it's running, other

1:15:23

questions?

1:15:24

Yeah.

1:15:25

So my question is regarding embeddings.

1:15:28

When we call embedding for a particular

1:15:30

word, we indicate that we have certain

1:15:32

number of parameters. Let's say in this

1:15:33

case we have defined

1:15:35

We defined 100. So there will be 100

1:15:36

parameters and there will be

1:15:37

coefficients weights for each of them.

1:15:40

So when we take a pre-trained model,

1:15:42

right?

1:15:43

The one we took glove. So for each word

1:15:45

there would already be those number of

1:15:47

parameters in that different Yeah. So

1:15:49

but then how do we redefine them? Is

1:15:51

that we want only 100 or we want only 10

1:15:53

parameters

1:15:54

You know, the the glove thing actually

1:15:56

gives you packaged It's pre-packaged to

1:15:59

be 100 long. I think they have 200 and

1:16:01

300 as well if I recall. We just

1:16:03

happened to use the one the one with

1:16:04

100. The one is

1:16:05

The one is available in Google

1:16:07

Yeah, yeah. And there are many

1:16:09

available. We just get to pick and

1:16:10

choose and I happen to pick 100.

1:16:12

Uh

1:16:13

Oh, it's okay. So it's a bit slow, but

1:16:15

it's actually looking promising.

1:16:17

Um

1:16:18

9:55, yeah.

1:16:21

So during the CNN models training during

1:16:23

our assignments,

1:16:24

changing the filters gave us more depth

1:16:27

than improvement in performance.

1:16:29

So here would I be right in concluding

1:16:32

that it's actually training the

1:16:33

embeddings which is giving us more

1:16:34

assuming that epoch and batch changes

1:16:36

are not

1:16:37

changed as much. So if I really want a

1:16:39

genuine change in performance, we go

1:16:42

to the level of retraining the

1:16:43

embeddings.

1:16:44

What Yeah, so what we saw was that using

1:16:46

glove as is was okay. Using glove and

1:16:48

then training them helped a lot. And now

1:16:50

we are basically saying, well, what if

1:16:51

we just abandon glove and train our own

1:16:53

embeddings for our particular problem.

1:16:55

See, glove is a general purpose tool.

1:16:57

So a general purpose tool is really good

1:16:59

if you don't have a lot of data

1:17:00

as a good starting point. But when you

1:17:01

have a lot of data, you should always

1:17:03

try to do your own thing and see if it's

1:17:04

any better.

1:17:05

And in this case, I

1:17:07

well, whoa. Okay, I think it's

1:17:09

uh

1:17:10

Come on, it's 9:55.

1:17:14

The button is going to enter any moment

1:17:15

now.

1:17:21

Right, let's just look at the thing.

1:17:25

Okay, folks. So 74% 72%.

1:17:29

So you can actually return your own

1:17:30

thing because of 50,000 examples and you

1:17:31

can see an even better thing. Thanks a

1:17:33

lot. Have a good rest of the week.

6: Deep Learning for Natural Language – Embeddings

More from MIT OpenCourseWare

Trending Transcripts