Advertisement
Ad slot
6: Deep Learning for Natural Language – Embeddings 1:17:50

6: Deep Learning for Natural Language – Embeddings

MIT OpenCourseWare · May 11, 2026
Open on YouTube
Transcript ~13722 words · 1:17:50
0:21
We'll continue our journey with
0:23
natural language processing.
0:25
We looked at the bag of words model,
0:26
one-hot embeddings, and so on and so
0:28
forth. And today we will talk about
0:30
embeddings, or to be more precise,
0:32
stand-alone embeddings, and then that
0:34
will tee us up for something called
Advertisement
Ad slot
0:36
contextual embeddings, which is where
0:38
the transformer really sort of comes
0:40
into play.
0:41
All right, so let's get going. So so far
0:43
we have encoded input text
0:47
one-hot vector. So to just to refresh
0:50
your memories from Monday,
0:52
if you know, if this is the phrase
0:53
that's coming into the system, we run it
0:55
through the STIE process. And when we do
Advertisement
Ad slot
0:58
that, what happens is that first of all,
1:01
we you know, we standardize, then we
1:03
split on white space to get individual
1:05
words, then we assign words to integers,
1:08
and then we take you know, each integer
1:10
and essentially create a one-hot version
1:12
of that integer. And when we do that,
1:15
basically we have a vocabulary.
1:18
Right? And in this example, we just have
1:20
100 words, and you will note that this
1:23
vocabulary, which are which you arrive
1:25
at once you standardize and tokenize,
1:28
you know, has words like the because we
1:30
decided not to remove stop words like A,
1:32
and the,
1:33
and so on. So just to be clear,
1:36
standardization
1:38
here, standardization, while it has
1:40
historically been all about stripping
1:42
punctuation, lowercasing everything,
1:45
removing stop words, and stemming,
1:47
while that has been true historically,
1:49
if you look at modern practice, people
1:51
essentially strip punctuation maybe, and
1:54
then lowercase and and they often don't
1:57
even bother to do stemming and things
1:58
like that, or to remove stop words.
2:00
Okay?
2:01
And that's why in Keras, the default
2:03
standardization is only lowercasing and
2:05
punctuation stripping.
2:09
This detail may actually be handy for
2:11
homework two, perhaps. That's why I'm
2:12
pointing it out.
2:14
Okay. So that's what we have. And so for
2:17
each word that's coming in, we have a
2:18
one-hot vector.
2:20
Right? But the one-hot vector is just
2:22
like on to the vocabulary. And then, you
2:25
know, and we can either
2:27
quote unquote add them up and get a
2:29
multi-hot encoding, or
2:32
sorry, get a count encoding, or we can
2:34
just do or, right? Look for just any
2:36
ones in a column and get multi-hot
2:38
encoding.
2:39
So that's what we saw last class. But
2:42
this scheme, while it's quite effective
2:44
for simple kind of problems,
2:47
is it has some very serious
2:49
shortcomings. And so we will sort of
2:50
delve into those shortcomings, and then
2:52
sort of step back and say, all right, is
2:54
there a solution to fix these things?
2:58
Problem with one-hot vectors.
3:00
There are lots of problems. Any
3:01
volunteers?
3:07
Similar words are understood
3:09
differently.
3:21
Absolutely. So that what he's pointing
3:24
out is that if you have two words which
3:26
are synonyms, let's say, great and
3:28
awesome,
3:29
hope that the way we represent them
3:31
using these vectors would have some
3:33
connection to what the words actually
3:35
mean. In particular, we would hope that
3:37
if they mean similar things, that they
3:38
are sort of close by. If they mean very
3:40
different things, we would hope that
3:41
they are very far away. Right? Things
3:43
like that. Sort of common sensical
3:44
expectations of what you want the
3:46
vectors to have. So it clearly it won't
3:49
have that, and we'll look into it in a
3:50
detail in a bit. But before we do that,
3:53
there is also a computational issue,
3:54
which we covered last class, which is
3:56
that if the vocabulary is really long,
3:59
then each token, each word that's coming
4:01
in here, will have a one-hot vector
4:03
that's as long as the size of
4:04
vocabulary. Right? If you have 500,000
4:06
words in your vocabulary, every little
4:08
word that comes in has a vector which is
4:09
500,000 long. Which feels like a gross
4:12
sort of waste of it stuff.
4:16
Now you can mitigate it somewhat by
4:18
choosing only the most frequent words,
4:20
but it does increase the number of ways
4:21
the model has to learn, and increase the
4:23
need for compute and data, and so on and
4:25
so forth. Okay?
4:26
Now
4:27
let's say that we have created a
4:28
vocabulary from a training corpus. Okay?
4:31
We have a bunch of
4:32
strings, text that's coming in. We have
4:34
done it We have done the ST the
4:36
standardization and organization. We
4:37
have created a vocabulary from it. And
4:39
let's say we get the words movie and
4:41
film.
4:42
So the question is, and and I always
4:44
observation gets to this immediately, if
4:47
you look at the words movie and film,
4:48
are these two vectors close to each
4:50
other or not? Okay? So if you have two
4:52
vectors, how would we measure closeness?
4:56
What's the simplest way to think about
4:58
closeness?
5:02
It's not a trick question.
5:05
Distance. Yeah, exactly. So if they are
5:06
really close distance-wise, we would
5:08
hope, right? The words similar words
5:10
should do should should be close by. So
5:13
here, if you let's just imagine that the
5:16
vector for movie,
5:20
let's say your vocabulary is, I don't
5:21
know,
5:22
um
5:25
100,000 long.
5:27
So your vector is 100,000 long,
5:30
and the word for movie
5:33
is the position, so this this has a one,
5:35
everything else is zero. Right?
5:42
Sorry, this is the vector for film, and
5:44
maybe this is the position for film.
5:47
So that has a one, everything else here
5:51
zero. Okay? What's the distance between
5:53
these two vectors?
5:55
You just use the Euclidean distance. So
5:58
the Euclidean distance, you will recall,
6:00
you literally just take the difference
6:01
of
6:02
these values,
6:04
square them, add them up, take square
6:06
root.
6:07
So which means that all the zeros will
6:09
obviously give you zero. This one is
6:12
going to give you a one.
6:14
This comparison is going to give you
6:15
another one. 1 + 1 = 2. Root 2. That's
6:18
the answer.
6:20
So the distance between these two
6:21
vectors is root 2.
6:25
Now,
6:27
so the distance between them is root 2.
6:30
What about the one-hot encoded vectors
6:32
for good and bad? Clearly good and bad
6:34
mean opposite things.
6:36
What is the distance between the good
6:37
and bad 01 vectors?
6:42
Still root 2.
6:45
Because the zeros don't mean anything,
6:47
the ones are not in the same place.
6:49
So when you subtract the one and the
6:51
zero, you'll get ones and ones, add them
6:52
up, two, root 2.
6:54
In fact, you take any two words in your
6:56
vocabulary, what's the distance between
6:57
the two one-hot vectors for those words?
6:59
It's root 2.
7:01
So if any two words have the same
7:03
distance, does this even have a notion
7:06
of distance?
7:08
It doesn't.
7:10
There's no notion of distance from
7:12
one-hot vectors.
7:13
It has no connection to the actual
7:15
meanings of these words.
7:17
It's just a way of representing them.
7:21
Okay?
7:22
So that is the big problem with one-hot
7:24
vectors.
7:26
So
7:27
the distance between them is the same
7:28
regardless of the words. It's got
7:29
nothing to do with the meaning of the
7:30
words.
7:32
And this is a huge problem, which we'll
7:33
have to solve.
7:35
So to summarize where we are, if the
7:37
vocabulary is very long, each token will
7:39
have a one-hot vector that's long as
7:40
vocabulary. That's that's sort of a
7:42
computational and sort of training
7:44
problem. And then this is a deeper
7:46
problem, where there's no connection
7:48
between the meaning of a word and its
7:49
vector.
7:51
So wouldn't it be nice if
7:55
vectors that represent synonyms,
7:57
movie and film, apple, banana,
7:59
hopefully they're close to each other.
8:01
It would be nice if the vectors for
8:03
things that mean very different things
8:04
are far from each other.
8:06
So let's take a look at a particular
8:08
example. Okay? Let's assume that we have
8:10
been magically given
8:13
these vectors, so that they actually
8:15
have some notion of meaning.
8:17
And for convenience, let's say that we
8:18
take the just the first uh
8:21
two dimensions of these vectors, the
8:23
first two dimensions, so that we can do
8:25
a scatter plot on them.
8:28
So we plot the first dimension of the of
8:30
these vectors, the second dimension, and
8:31
what we have in this little cartoon is
8:34
we have plotted the the word for
8:37
factory, uh for home, for building, and
8:41
they all happen to be clustered here.
8:44
Clearly this representation is capturing
8:45
some notion of what the thing is.
8:48
Right? Some sort of building.
8:50
Uh and here we have, you know, bicycle,
8:53
truck, and car. Clearly some This is
8:55
like the automobile cluster, right?
8:57
Transportation cluster. And here we have
9:00
like a fruit cluster, and here we have
9:02
some, you know, sports balls cluster.
9:04
Okay?
9:05
We Because it's a cartoon, things are
9:07
all nice and cleanly separated. Okay? So
9:10
now if you take the word apple, where do
9:12
you think it's going to go?
9:14
It's going to go in into A, C, D, or B?
9:19
C, right? It makes eminent sense it's
9:20
going to go to C.
9:23
Good. Now,
9:25
wouldn't it be nice if
9:27
in more generally, if the geometric
9:29
relationship between word vectors
9:32
represent the semantic relationship
9:35
between the underlying objects that the
9:37
words represent?
9:38
Okay?
9:39
And it's And I say relationship and not
9:41
distance, because it's not just
9:42
distance. It's actually more than that.
9:45
Okay?
9:46
So let's take another one.
9:48
Here we have
9:49
uh this is the the vector plotted for
9:52
puppy and dog,
9:54
and this is calf.
9:56
Uh right? We have plotted the word for
9:58
calf. And let's say that we need to
9:59
figure out where would the embedding,
10:01
the word vector for cow appear?
10:04
It is the most logical. Should it be A?
10:07
Should it be C? Should it be B? Where
10:09
should it be?
10:11
This is
10:14
C? Okay, what's the logic?
10:16
Any volunteers? Just put your hand up.
10:19
Uh, yes.
10:21
Uh
10:23
A calf is a baby bull, whereas the cow
10:26
is an adult.
10:27
So, it should be closer to the dog,
10:28
which is the adult version of a dog.
10:31
Got it. So, you're basically saying go
10:32
from the puppy version to the grown-up
10:34
version. Right? That's sort of what
10:36
you're getting at, right? And that's a
10:37
totally valid way to think about it.
10:39
But there are a couple of ways to think
10:40
about this, which is this is one of the
10:42
those two ways. So, what you can do is
10:44
you can actually look at it and say,
10:45
well,
10:46
Okay, if this is big bringing you, you
10:48
know, bad memories of GMAT and GRE and
10:50
stuff like that, I apologize.
10:52
But
10:55
So, a puppy is to a dog like a calf is
10:57
to a cow, right? Which means that that's
10:59
exactly what Jay is pointing out. You
11:01
can go from like the baby version to the
11:02
full-grown version if you go in the
11:04
horizontal direction. Okay? But maybe if
11:08
you go in the vertical direction, you're
11:10
essentially going up and down the young
11:13
entities of animals.
11:15
Okay?
11:16
So, here you are growing with, you know,
11:18
you're still across the same dimension
11:20
of animals. You're just going from, you
11:22
know, the the same age level, right?
11:24
That is the band here.
11:25
So, this is the grown-up version of a
11:27
whole bunch of animals, the puppy
11:28
version of a whole bunch of animals. So,
11:30
the vertical dimension measures some
11:31
sort of variation across animal species
11:34
of the same roughly sort of maturity
11:36
stage.
11:37
Okay? So, these directions also matter.
11:41
It's not just the distance.
11:43
Okay. That's what I mean when I say
11:45
semantic relationship and geometric
11:47
relationship.
11:48
Relationship is distance and direction,
11:51
right? Both have to be involved.
11:53
So, so
11:55
Uh, now word embeddings, as we will dis-
11:57
learn soon, are word vectors designed to
12:00
achieve exactly these requirements.
12:03
Okay? They will achieve these
12:04
requirements.
12:06
Uh, and they will fix both these
12:07
problems very elegantly.
12:11
Okay?
12:13
So, let's say that we have word
12:14
embeddings that achieve both these
12:15
problems. Are we basically done?
12:17
Can we declare victory?
12:19
Or are there any- is there anything that
12:22
even words which actually capture the
12:24
meaning of the underlying thing
12:28
don't fully address? Is there any
12:30
remaining problem we have to worry
12:31
about? Yes?
12:33
Context. Context? Yes.
12:36
Context, right? What about The fact is a
12:39
word's meaning Sure, every word has a
12:42
meaning, but we know that some words
12:44
have multiple meanings.
12:46
And that meaning is really sort of
12:49
inferencable, or you can make sense of
12:51
it only if you know the surrounding
12:52
context, right? If I give you if if you
12:55
see the word bank, b a n k, bank,
12:59
sure, it could be a financial
13:00
institution. It could be the side of a
13:02
river. It could be the act of a plane
13:04
turning in one direction.
13:07
It could be someone hoping for
13:09
something, banking on something. The
13:11
list of possible meanings of the word
13:13
bank is basically enormous.
13:16
And you cannot figure out what it means
13:18
unless you know what else is going on
13:19
around that word. So, context is super
13:22
super important. And these embeddings,
13:24
word embeddings, just tell you what the
13:26
meaning of the word is. And basically
13:28
what's going to happen when you have a
13:29
word which could mean many different
13:31
things, it's going to give you some
13:33
average version of that meaning.
13:36
And that average version is not going to
13:37
be very good.
13:39
Now, there are some words which only
13:40
mean one thing, and you'll be okay
13:41
there.
13:42
But for the rest of it, right? It's
13:44
going to be tough.
13:47
So, what we need is some way
13:53
We need to find a way to make word
13:54
embeddings contextual.
13:56
Meaning we need to somehow consider the
13:58
other words in the sentence.
14:00
Okay? So, if we can do that, then we
14:02
will be in great shape.
14:05
Solve all sorts of NLP problems.
14:08
Now, as it turns out, contextual word
14:11
embeddings, or word vectors, or word
14:13
embeddings that achieve both these
14:15
requirements.
14:16
They capture the semantic geometric
14:19
relationship thing I talked about, and
14:21
they are contextual.
14:22
Okay?
14:23
They're really fantastic. Uh, and the
14:27
key to calculating contextual word
14:29
embeddings is the transformer.
14:33
That is why transformers are justifiably
14:35
famous.
14:39
So, what's sort of the the lay of the
14:40
land here? So, today we are going to
14:42
look at how to calculate
14:44
stand-alone or uncontextual word
14:46
embeddings.
14:48
And then starting Monday, we will take
14:50
these, you know, un- stand-alone
14:52
embeddings and make them contextual
14:53
using transformers. Okay? That is the
14:56
plan.
14:57
Any questions so far?
14:58
So, now let's think about how we can
15:00
learn these stand-alone embeddings from
15:02
data, right? Now, the naive way to think
15:05
about it would be, hey, let's Why don't
15:07
we manually collect a whole bunch of
15:08
synonyms, antonyms, related words, etc.,
15:11
and try to assign embedding vectors to
15:13
them that satisfy
15:15
our requirements. Okay? Now, as you can
15:18
imagine, this is going to be a long,
15:19
painful, and never quite complete
15:21
exercise.
15:22
Okay?
15:23
Uh,
15:24
so and uh you mean and given that we are
15:26
machine learning people,
15:29
the question is, can we do in a better
15:30
way? Can we just learn it from the data
15:32
without doing any of this manual stuff?
15:34
Okay? And
15:36
the key insight that makes it all happen
15:39
is this humble-looking line on the
15:42
screen by John Firth, who was a
15:44
linguist.
15:45
You shall know a word
15:47
by the company it keeps. I wish I could
15:49
deliver this in a British accent.
15:53
Know a word by the company it keeps.
15:55
Okay? It's a very profound statement.
15:57
Okay? And here is the sort of the key
15:59
intuition behind this.
16:02
It says,
16:03
let's say that you have a sentence like
16:05
the acting in the dash was superb.
16:08
Okay?
16:09
What are some words that you folks think
16:11
are likely to appear in the sentence?
16:15
Shout it out. Play. Play.
16:18
Movie.
16:19
Show.
16:20
Musical. Right? Those are all some great
16:24
candidates, right? The acting in the
16:25
movie, the film, musical, and so on and
16:26
so forth. Okay? Now, let's say that I
16:28
ask you, what are some words that are
16:29
unlikely to appear in the sentence? And
16:31
I think we could all be here for like
16:32
days, you know, listing them out. Uh, I
16:35
just listed these out. Um, I love the
16:38
word tensor, so I have to find a way to
16:39
use it somewhere.
16:41
So, all right. So, the acting in the
16:43
banana was superb. Clearly nonsensical,
16:45
right? So, what this actually What What
16:48
we are seeing here is that if certain
16:51
words are sort of interchangeable in a
16:53
sentence,
16:55
meaning you you change them, they still
16:57
the sentence still makes sense, right?
16:59
If they appear in the same context very
17:02
often, i.e., if they're interchangeable,
17:04
they are probably related.
17:07
Sort of like we don't even have to know
17:09
what the word is.
17:10
All we have to know is that this word
17:12
and this word, you can drop them into a
17:14
particular sentence, you can fill in the
17:15
blank of that sentence with that word,
17:17
and it actually makes sense, then we're
17:18
like, oh, wow, okay, these words are
17:20
related then.
17:21
Right? You're sort of inferring their
17:23
relatedness not by looking at them
17:25
directly, but by seeing where they live.
17:30
Right? It's a very very clever idea. And
17:32
it'll slowly sink into you. Okay? Um, so
17:36
that's the first observation. If they
17:37
appear in the same context very often,
17:39
they are likely to be related.
17:41
More generally, related words appear in
17:44
related contexts.
17:47
So, all we have to do
17:49
is to figure out a way to calculate
17:52
context.
17:54
And then use that to understand, you
17:57
know, what the words are that happen to
17:58
be living in this context.
18:00
And there are some beautiful ways to do
18:02
these things, and we'll you and we'll
18:03
really dive deep into one such way to do
18:05
it.
18:06
So, so the So, what we're going to do in
18:08
this approach
18:10
is that
18:11
since
18:12
words that appear in
18:14
related contexts mean related same
18:16
similar things,
18:18
first of all, you have to define what do
18:19
you mean by context?
18:21
And there are many ways to define
18:22
context. We're going to go with a very
18:23
simple explanation, simple definition,
18:24
which is that if words happen to appear
18:26
in the same sentence a lot,
18:29
then we think that, okay,
18:31
they are in the same context. So,
18:32
context here means sentence.
18:34
Okay?
18:35
So, what we can do is we can actually
18:38
take a whole bunch of text, maybe all of
18:40
Wikipedia,
18:41
and then break it up into sentences.
18:43
We'll have billions of sentences, right?
18:46
And then for all these billion
18:47
sentences, we can literally go and count
18:48
for every pair of words, how many times
18:51
are both these words showing up in the
18:52
same sentence?
18:55
Okay? And we call this co-occurrence,
18:57
right? The words are co-occurring in the
18:59
sentence.
19:00
And it doesn't have to be next to each
19:02
other,
19:02
right? We know that in complicated
19:04
words, a word at the very end of the
19:07
sentence could actually alter the mean-
19:09
could be its meaning could be altered by
19:10
a word that happened in the very
19:11
beginning of the sentence, and it could
19:12
be a really long sentence.
19:14
So, we take the whole sentence and say,
19:16
are are two words co-occurring in the
19:18
sentence, yes or no? And we just count
19:19
them up.
19:20
And when we do that,
19:24
right? When we do that, we will get
19:26
something like this.
19:27
So, I'm just
19:29
This just captures what I've been
19:30
talking about. Identify all the words
19:32
that occur, let's say, in Wikipedia. And
19:34
then for every sentence, you look at
19:35
every word pair and count the number of
19:37
times they appear in the same sentence
19:38
across all those sentences. Okay?
19:41
This is a word-word co-occurrence
19:43
matrix. So, for example,
19:46
let's assume that you took all of
19:47
Wikipedia, looked at all the words,
19:48
distinct words, and you found there are
19:49
500,000 words.
19:51
Okay? So, there are 500,000 words
19:54
here in the columns
19:56
500,000 words on the rows.
20:00
The columns and rows. And then you go
20:02
and each cell of this table is basically
20:05
has a number that you calculate which is
20:08
the number of times the word in the row
20:10
and the word in the column happen to
20:12
show up in the same sentence. That's it.
20:14
So, for instance
20:15
if you look at deep and learning, right?
20:18
The word deep and the word learning
20:20
maybe that
20:22
the those two words occurred in the same
20:24
sentence maybe 3,025 times.
20:28
3,025 sentences across all of Wikipedia.
20:31
You put 3,025 right in that cell.
20:35
Okay?
20:36
Many words are unlikely to appear in the
20:37
same sentence.
20:38
So, much of this matrix is going to be
20:40
zero.
20:44
But, we
20:45
fundamentally form this co-occurrence
20:47
matrix.
20:49
This matrix essentially embodies all the
20:54
context information that we can work
20:55
with in a very compact, beautiful you
20:58
know, sort of
20:59
elegant
21:03
And using this, we're going to try to
21:04
figure out
21:06
what the word embeddings actually are
21:07
going to be.
21:08
Okay?
21:09
And so
21:11
So, by the way, the approach I'm
21:13
describing here to calculate standalone
21:15
embeddings is called Glove.
21:20
Uh it's called Glove and
21:23
standalone embeddings first sort of came
21:24
onto the NLP deep learning scene. Uh
21:27
there were two sort of ways of doing it.
21:29
One was called word to vec, word to vec.
21:32
Uh the other one is Glove.
21:34
And they're both comparable, right? They
21:35
use slightly different mechanisms of
21:36
doing this.
21:38
We went with word for for this lecture
21:40
because I think it's actually a little
21:42
easier to understand and equally
21:44
effective.
21:45
Okay?
21:47
So, this is what we have. And so, what
21:49
we want to do is
21:50
we want to learn these embedding vectors
21:52
that can be used to essentially
21:54
approximate this matrix.
21:56
Right? If you can find vectors that can
21:59
actually approximate this matrix, then
22:01
hopefully those vectors do in fact
22:03
capture some notion of what the words
22:04
actually mean. Okay? So, let me put it
22:06
differently.
22:07
You come to me with this matrix. Okay?
22:10
And you say uh okay, Rama, do you have
22:12
embeddings for me?
22:14
And I'm like, yeah, I reach into my bag
22:15
and I'm like, okay, every one of those
22:17
500,000 words, I have an embedding.
22:19
Right?
22:20
Let's ignore for a moment how I actually
22:21
calculated embeddings. I have the
22:23
embeddings.
22:24
How will you know if my embeddings are
22:25
any good?
22:28
How will you know?
22:30
How can you actually assess if those
22:31
embeddings are any good?
22:34
Well, you can certainly say, okay, give
22:35
me the embeddings for movie and film and
22:37
you can see if they're really close by.
22:39
If you can look at the you look at the
22:40
embedding for movie and tensor and
22:42
hopefully they're far away.
22:43
But, you'll never get done.
22:46
Right?
22:47
How can you systematically evaluate
22:49
this?
22:51
Well, what if you could actually what
22:53
what if I come to you and say, not only
22:55
am I going to give you an embedding,
22:57
here is a procedure
22:59
which you can use with these embeddings
23:00
to validate how good they are and here
23:02
is the procedure. What you can do is you
23:04
can use the embedding to recreate the
23:07
co-occurrence matrix.
23:09
And if the recreated co-occurrence
23:11
matrix actually matches the real matrix
23:14
well, these embeddings probably are
23:15
pretty good.
23:17
Remember, the whole point of the
23:18
co-occurrence is to handle this context
23:20
information. So, if my embeddings can
23:21
actually recreate them, reconstruct them
23:23
pretty close, right? It'll never be
23:25
perfect. But, it comes pretty close,
23:27
then we're like, wow, okay, these
23:28
embeddings do mean something.
23:29
So, if it turns out for instance that
23:31
the matrix has, you know, 3,000 possible
23:33
va- value of 3,000 for deep and learning
23:36
and values of uh
23:40
say
23:40
50 for extreme learning
23:43
and our embedding comes in and says
23:45
3,002 for the first one and 48 for the
23:48
second one, we'll be like we'll be
23:49
pretty impressed.
23:51
Whoa, it didn't need to be that close.
23:53
Unless it was actually capturing
23:54
something.
23:55
Okay? So, that's what we're going to do.
23:57
And so, we're going to take this logic
23:59
of saying
24:00
find embeddings that can approximate the
24:03
what we actually see in Wikipedia.
24:05
Right? And we're going to use that idea
24:07
to actually build the model and learn
24:09
the
24:10
using nothing more than basically linear
24:12
regression.
24:16
And here you are thinking that linear
24:17
regression is useless now that you've
24:18
graduated machine learning, right?
24:22
So
24:23
So, we can think of the embedding
24:24
vectors that we want to figure out as
24:26
just the weights in a model.
24:28
In a linear regression.
24:31
We can think of the co-occurrence matrix
24:33
as just the data we're going to use in
24:35
this model to estimate these weights.
24:37
And the model we're going to use
24:39
is something like this.
24:42
So, first I have to inflict some
24:43
notation on you.
24:45
We would denote the co-occurrence matrix
24:46
of say words I and J as Xij.
24:50
Xij is just data.
24:51
It's just data. Okay? It's not a
24:53
variable, it's data.
24:55
Uh
24:55
and then we will denote an embedding
24:57
vector for each word. Remember, we need
24:59
to have a vector for each word. So, we
25:01
call it Wi, right? Wi is the embedding
25:03
vector for each word.
25:06
And we will also assume that
25:09
some words are just inherently very
25:10
popular. They're going to show up all
25:11
the time like the word the.
25:13
Okay? So, we'll assume that every word
25:15
has some natural frequency of occurring
25:18
like movie versus flick.
25:20
The versus tensor. So, we want the
25:22
vectors to capture the co-occurrence
25:24
patterns independent of how naturally
25:27
frequent the words are.
25:28
Okay?
25:29
And so, to capture this natural
25:30
frequency, we will assign a bias or Bi
25:33
to each word that we're going to
25:34
calculate. And all this will become
25:36
clear in just a moment. Okay? So
25:39
with this setup, basically what we're
25:41
saying is something very simple. We're
25:42
saying, look, this co-occurrence matrix
25:44
that we have
25:45
that we're able to compute, it came
25:48
about because in in truth, in reality,
25:51
in nature, there are these embedding
25:53
vectors for every word.
25:55
There are these biases Bi for every word
25:58
and every co-occurrence number that you
26:00
see just came about because, you know,
26:03
under the hood, mother nature grabbed
26:05
the bias number for the word I, the bias
26:07
number for the word J took the two
26:09
embedding vectors, which only mother
26:11
nature knows at this point did the dot
26:13
product of them, add them, and that's
26:15
how we get this number.
26:16
So, it basically says the number you see
26:19
is the sum of the inherent popularity of
26:21
the first word plus the inherent
26:23
popularity of the second word plus the
26:25
way in which these two words connect to
26:26
each other.
26:29
That's it.
26:29
And
26:30
you will agree with me
26:32
that literally can't get simpler than
26:33
this.
26:34
If I tell you, hey, here are two things.
26:36
I want you to tell me how connected they
26:38
are, you'll be like, well, let's take
26:39
the first one, figure out how inherently
26:42
popular it is, inherent popularity, and
26:44
then of course you got to worry about
26:45
the connection. So, we do a dot dot
26:46
product.
26:47
That's it. Those three things.
26:49
Right?
26:50
So, this is what we have. Now, you may
26:52
have seen
26:53
uh
26:54
from your, you know, good old linear
26:56
regression that whenever uh your
27:00
dependent variable happens to be
27:02
positive, guaranteed to be positive
27:05
and it ends up having a big range
27:08
we always advise you folks
27:10
to take the logarithmic transformation
27:12
to squash it into a narrow range because
27:14
that will make these models much more
27:16
well-behaved.
27:18
Regression if the Y value is like a huge
27:20
range. Like the canonical example is
27:22
that, you know, if you are trying to
27:23
model, you know, the net worth of
27:24
people, right? It's going to have a long
27:27
right tail with people like Elon and
27:29
Jeff and so on on the right side, right?
27:30
And the rest of us on the left. So and
27:33
so, to model this big long tail
27:34
distribution, you just take the
27:35
logarithm, just squash everything to a
27:37
very narrow range. And that will make
27:39
regression much better behaved. Okay?
27:41
Here
27:42
most of the counts are going to be zero.
27:45
But, some of the counts could be very
27:47
high.
27:48
Right?
27:49
And therefore we wanted to If you take
27:51
the logarithm, it makes it much better
27:52
behaved, so we take the logarithm here.
27:54
So, this is actually our model. That's
27:56
it.
27:57
And I know that many of the numbers are
27:58
zero and log of zero is not defined. So,
28:00
we can just add the one a number one to
28:02
all the numbers
28:03
to avoid that kind of, you know,
28:06
technical arithmetic problems.
28:08
But, this conceptually is what's going
28:09
on. This is the model we want to
28:10
calculate.
28:11
So, given that we have essentially
28:14
postulated this model
28:16
and we have this data, this
28:17
co-occurrence matrix, how can we
28:19
actually find the weights? How can we
28:21
actually find the Bs and the Ws? What
28:24
would we What should we do?
28:25
Go back to the fundamentals of
28:26
regression. Think about it conceptually.
28:29
You have some model which has some
28:30
weights.
28:31
There's some data you can use to train
28:33
the model.
28:35
Right? And you need to find the best set
28:36
of weights. What does the best mean
28:38
here?
28:40
The lowest
28:42
The lowest error. Exactly. There are
28:43
many ways to measure error, right? What
28:46
would be What is the simplest thing we
28:47
could use? So, what you do is you would
28:48
actually do mean squared error. Right?
28:50
Which is what you're getting at.
28:52
You could take the actual thing, you
28:53
could take the predicted thing, take the
28:54
difference, square it, and minimize the
28:55
sum of it.
28:57
Okay? If your model exactly nails every
28:59
number in the co-occurrence matrix, the
29:00
error is going to be zero.
29:02
Okay? So
29:04
what we do is we literally just do that.
29:07
This is the data.
29:09
This is the actual predicted value.
29:11
Predicted value, actual value,
29:13
difference squared, add them all up,
29:14
minimize.
29:17
Okay?
29:19
Uh yes.
29:21
And in the loss function, how is this
29:23
capturing the context? Because unless my
29:25
input data is having that context
29:28
how will this actually differentiate
29:31
based on where the particular word is
29:33
used?
29:34
The word The way the word is
29:36
the
29:37
So, let's take two words like deep and
29:38
learning. Now, let's take this word and
29:41
change it according to the context.
29:42
Okay.
29:44
Sorry, go ahead. Yeah, so basically,
29:46
let's say I'm talking about the word
29:47
banana. So it's a fruit in some context
29:49
and I could be saying he's going
29:50
bananas. That's a
29:53
whatever, right? So now these are two
29:55
different contexts in my understanding
29:57
and my same model needs to be able to
29:59
tell me that banana is the right word in
30:01
this context but wrong word in this
30:02
context or
30:04
correct in both contexts. Yeah, very
30:06
good question. So let's actually spend a
30:08
minute on that. Good question. I'm going
30:10
to swap to my iPad.
30:13
So let's let's assume that this is our
30:15
co-occurrence matrix.
30:18
Right? And then we have words going from
30:20
A all the way to let's say zebra, right?
30:23
This is the all the words in our
30:24
vocabulary
30:25
and we have A through zebra here.
30:29
And now what we have is
30:32
we have uh
30:34
apple
30:36
and banana.
30:39
Right?
30:40
So basically what's going on at this
30:42
point is that
30:44
here every number here measures
30:48
for every word here, how many times that
30:50
word and apple show up in the same
30:51
sentence, okay?
30:53
It is not measuring, to your point,
30:56
how many times apple and banana are
30:57
showing up. It's measuring how much how
30:59
many times apple is showing up in each
31:01
sentence, right? Now, if apple and
31:03
banana are sort of interchangeable,
31:06
what do we expect these numbers these
31:09
two rows of numbers to look like? Let's
31:11
assume that apple and banana are perfect
31:13
synonyms.
31:14
Just for argument, okay? Let's say it's
31:15
a perfect synonyms.
31:17
What do we expect these two
31:19
numbers
31:21
to look like?
31:23
Very similar.
31:25
So if two words are related, their
31:27
entries their entry row vectors in the
31:30
co-occurrence matrix are going to be
31:31
very very similar.
31:32
So that is how the context comes into
31:34
the co-occurrence matrix.
31:36
So what we want to do is we want to find
31:37
if if embeddings can recreate the same
31:40
pattern of numbers in these two
31:42
uh in these two rows, it's actually
31:45
capturing the underlying context.
31:47
So words which are similar will sort of
31:49
zig and zag together the same way
31:51
through the co-occurrence matrix.
31:53
And that's where it comes in.
31:57
Yeah.
31:58
What's up with the diagonal of the
32:00
co-occurrence matrix where you have
32:01
apple showing up twice? Oh oh, I see. So
32:05
yeah, here the you can just ignore the
32:07
diagonal typically
32:08
uh because all the action is off the the
32:10
off-diagonal entries.
32:15
So so that's basically the idea and uh
32:18
if words which are very similar will
32:20
have a very similar pattern of numbers
32:22
and then any
32:24
embeddings that can actually recreate
32:25
the same pattern of numbers is capturing
32:27
the underlying reality of what's going
32:28
on.
32:29
If words are kind of unrelated, those
32:32
two those two vectors, let's say that
32:34
the word you have is uh
32:40
Let's assume the word is uh of course
32:42
you know what I'm going to say, tensor.
32:45
Right? These two vectors
32:48
will sort of won't have any connection
32:49
to each other.
32:50
Which means if you look at something
32:51
like the correlation of those two
32:53
vectors, it's it's going to be around
32:54
zero.
32:55
Right?
32:56
Words which are
32:57
you know, interchangeable will have a
32:59
very high correlation.
33:01
Words which are antonyms and never show
33:03
up in the same place together may have a
33:05
highly negative correlation, close to
33:07
minus one for instance. So that's sort
33:09
of the intuition behind what's going on
33:10
in these two row vectors on these row
33:11
vectors.
33:12
And so the point is given this
33:14
co-occurrence matrix is capturing all
33:16
these word word correlational structure,
33:19
any embedding that can recreate it must
33:22
have captured the structure as well.
33:25
Because you can't recreate something
33:26
like this with great fidelity unless you
33:28
have some notion of what's going on
33:30
under the hood.
33:31
That's the basic idea.
33:33
Yeah.
33:34
So just connecting to Sophie's question.
33:36
So in that example then
33:39
banana is a fruit and apple is a fruit
33:40
as well. Banana and apple are synonyms
33:42
and you're going mad, you're going
33:44
bananas. How that comes together is that
33:47
Oh, I see. You're going mad, you're
33:48
going bananas, yeah. So uh so those will
33:50
also have some correlational structure
33:52
to it which the embeddings will
33:53
hopefully catch, but words like banana
33:57
which are very they they
33:59
the thing is it's called polysemy where
34:01
the word looks one way, it looks the
34:03
same way. It's like the word bank,
34:04
right? It can mean very different things
34:06
in very different context. So the
34:07
embedding is going to be some average
34:09
representation of it, right? But we are
34:11
not happy with that average and we'll
34:13
get around that average
34:15
next week when we do contextual stuff.
34:18
All right.
34:19
Um
34:20
So that's what we have here. So to go
34:22
back to this thing,
34:26
so what we can do is yeah.
34:29
I didn't understand how do we get the
34:31
mean squared error in this because we
34:34
didn't
34:35
do any reading from the data set we got.
34:37
We haven't calculated the embeddings.
34:39
We are trying to calculate them. Those
34:41
are just it's sort of like, you know, in
34:42
regression you have, you know, beta beta
34:45
one times X1 plus beta two times X2 kind
34:47
of thing. The betas are what the
34:49
regression produces for us, right? The
34:51
the embeddings are exactly that. They're
34:52
just coefficients that we're trying to
34:53
figure out.
34:55
The data is only the X's, the Xij.
34:59
And so this is what we're trying to
35:00
calculate,
35:01
right? And so what you can do is you can
35:03
actually start with some random values
35:06
for these things
35:08
and then
35:09
keep on trying to improve to minimize
35:11
the error
35:13
starting from these random values.
35:15
Do you folks are you aware of any
35:17
algorithm that which allows us to take
35:19
random value starting point and then
35:20
minimize some notion of error?
35:32
Well, how do you know it's actually
35:33
random? Oh.
35:35
So that's actually a very deep question.
35:37
Um
35:39
and
35:39
so
35:41
it's actually a tough question, right?
35:42
Because ultimately the random number is
35:44
coming from a computer
35:46
and we know how the computer runs. It's
35:47
deterministic at the end of the day.
35:50
So we actually use something called
35:51
pseudo random numbers,
35:53
right? Um and there's like a whole
35:54
specialized field of math
35:56
which essentially says, "Look, how can I
35:59
get random numbers that are sufficiently
36:02
random even though they come from a
36:03
non-random computer deterministic
36:05
process?" So we can talk offline about
36:07
it,
36:08
um but fundamentally all these systems
36:10
have some random number generators built
36:11
in. We just cross our fingers and hope
36:14
for the best and just use them.
36:17
So come back to this,
36:19
right? We can start with random values
36:20
for these weights
36:22
um and then we can try to minimize the
36:23
squared error. Are are you folks aware
36:25
of any algorithm that can help us do
36:26
that?
36:28
Yes.
36:30
Gradient descent. Yes, gradient descent.
36:33
Again, comes to the rescue. Uh and since
36:35
we are cool, we'll do stochastic
36:36
gradient descent.
36:38
Okay? So that's it. So gradient descent
36:41
actually doesn't care what the function
36:42
is as long as it you can calculate a
36:44
derivative from it. As long as you
36:45
calculate a gradient, you're good.
36:47
Right? So we can just run gradient
36:48
descent on this thing, right?
36:50
Uh one key point here is that gradient
36:53
descent, stochastic gradient descent
36:54
work for any
36:55
any models as long as you can calculate
36:58
good gradients from them.
37:00
It doesn't have to be a neural network.
37:03
Any mathematical function as long as
37:05
it's differentiable and gives you a good
37:07
gradient.
37:08
Okay? So here this is not a neural
37:10
network per se, but we can still use
37:12
gradient descent for it.
37:14
So we do that.
37:17
Um and when we are done, we would have
37:20
calculated some nice embeddings. We
37:22
would have all calculated or we can also
37:23
calculate all these biases, but we don't
37:25
need the biases anymore. We can just
37:26
throw out the biases because we only
37:28
care about the embeddings and how they
37:29
connect to each other.
37:30
Okay? Yeah.
37:33
So when when you're doing that
37:34
regression, are you predicting the
37:36
co-occurrence matrix? Mhm. Okay.
37:39
Exactly.
37:42
So
37:43
um actually let me just show a very
37:45
quick example
37:46
numerical example here.
37:48
So let's say for example that um
37:53
you know what?
37:57
So this is say W1 and this is W2.
38:00
Okay? And this is the vector and let's
38:02
assume for a moment that we it has two
38:04
dimensions, okay?
38:06
Two dimensions.
38:07
And we also need to calculate B1 and B2
38:09
which is just a number, okay?
38:14
So and let's say the number for deep
38:16
learning in the co-occurrence matrix it
38:18
happens let's say it has occurred 104
38:20
times.
38:21
So all we are doing is to say log of
38:24
104.
38:27
That is the actual value
38:28
minus
38:30
B1 which we don't know plus B2 which we
38:33
don't know
38:34
and then this thing here, let's just
38:36
call it,
38:38
you know, W11,
38:40
W12,
38:42
W21,
38:43
W22.
38:45
Okay? And then we're just doing the dot
38:46
product which is
38:49
times W12
38:51
plus W21
38:53
W22.
38:55
Okay? So this is our prediction.
38:58
Where is that cool laser pointer? Yeah.
39:00
So this is our prediction.
39:03
This is the actual.
39:05
So all we do is to say, "Okay,
39:07
this thing, the difference, we're going
39:09
to square it."
39:11
And then we're going to do the same
39:12
exact thing for every other word pair.
39:16
Okay? And when we are done with all of
39:17
that thing, we just take this whole
39:19
thing
39:20
and say gradient descent minimize.
39:23
So then it has to find the B's and the
39:26
W's and everything for every every pair
39:28
every word.
39:29
So that's actually what's going on.
39:31
Make sense?
39:37
All right. So by the way uh here
39:41
I said
39:43
I said, you know, let's assume that the
39:45
embeddings are just vectors which are
39:47
two dimension dimension two.
39:51
Well,
39:52
that's an arbitrary decision that I made
39:54
just to show you how it works because I
39:55
was doing it by hand. But more
39:58
generally, we get to choose how long
39:59
these vectors are.
40:01
Right?
40:02
And the longer the vector, the more
40:04
interesting ways it can actually
40:05
reproduce the co-occurrence matrix. It
40:07
has more flexibility. But the longer the
40:09
vector, what is the risk that you run?
40:13
Overfitting.
40:14
Because these are all parameters at the
40:16
end of the day. More parameters you
40:17
have, the more risk of overfitting.
40:19
Okay? So, you get to choose how big
40:21
these things can be. Uh yes.
40:24
Um don't you find it surprising that
40:26
we're able to fit the model where we
40:29
have a lot more parameters than we have
40:30
data because usually with most machine
40:32
learning with our experts, you would
40:33
like to not have a lot of parameters,
40:35
but here we're going to have
40:37
as you said, the number of dimensions
40:40
times more parameters than we have
40:42
data points. Well, here in this
40:44
particular case, as it turns out, um
40:46
let's assume that you only have 10
40:48
words, right?
40:49
And for each word, let's assume that you
40:51
have let's just just keep the math
40:53
simple. You have a two-dimensional
40:55
vector.
40:56
So, 10 words * 2, that's 20.
40:58
Plus you have 10 biases for the words,
41:00
right? So, that's another 10, that's 30.
41:02
But 10 * 10, the matrix has 100 entries.
41:06
So, because of the matrix being a order
41:08
n squared matrix, you'll have a lot more
41:10
numbers than parameters.
41:13
In this particular case, you have more
41:14
data than parameters.
41:17
So, that particular problem doesn't
41:18
apply in this case.
41:20
But that does show up in other cases and
41:22
there is some
41:23
very interesting research in neural
41:24
networks which suggests that often times
41:26
the traditional assumptions of data and
41:29
overfitting and all
41:30
can all be called into question under
41:32
some situations.
41:33
Um happy to tell you more offline, but
41:35
if you're curious, just Google something
41:37
called double descent.
41:39
You know what I mean.
41:42
But in this case, it's not a problem.
41:46
Okay.
41:47
So, so what that means is that we can
41:49
choose how big these things are. So, if
41:51
you look at one-hot word vector, one-hot
41:53
vectors, right? Where
41:55
there's a one and everything else is
41:57
zero depending on the position of the
41:58
word, these are long vectors as long as
42:00
a vocabulary, right? As we saw earlier.
42:03
Word embeddings on the other hand,
42:05
right? They can be very dense, right?
42:07
The numbers
42:08
that make up these embeddings, we're
42:10
actually going to figure out from the
42:11
data what they are. So, it can be
42:13
anything. It can So, the first dimension
42:15
may stand for some combination of, you
42:17
know, um
42:19
brightness plus speed plus animalness or
42:22
something. We have no idea what it
42:23
means.
42:24
All we know is that it's able to
42:26
reproduce the co-occurrence matrix
42:27
really well, so it's probably has
42:29
figured something out.
42:30
Okay? And so, we can keep it really
42:32
short. So, the word embeddings tend to
42:33
be very
42:35
dense,
42:36
meaning not zeros and ones, but some
42:38
arbitrary numbers. It's very lower
42:39
dimensional and it's of course learned
42:40
from data.
42:41
Right? So,
42:43
so once you do this, once you actually
42:45
run Glove on this data and do gradient
42:47
descent and so on and so forth, uh you
42:49
will actually come up with embeddings
42:51
and then you can actually plot the
42:52
embeddings. You can take like this they
42:54
say the you know, you can take these
42:55
embeddings and just plot them. Here um
42:58
they're not literally plotting the first
42:59
two dimensions. They're using a
43:01
particular technique called t-SNE, which
43:03
is a way to take long vectors and
43:05
project them to 2D space for
43:07
visualization purposes.
43:09
And you can see here
43:11
some very interesting things are showing
43:12
up. So, they basically they plotted the
43:15
embedding for brother,
43:17
nephew, uncle, sister, niece,
43:19
aunt, and so on and so forth. It's all
43:20
showing up here.
43:22
This the embedding for man, embedding
43:24
for woman,
43:25
sir, madam,
43:28
empress, heir,
43:29
duke, emperor, king. You get the idea.
43:32
Right? So, clearly there are patterns
43:34
here where
43:35
things which are sort of similar in
43:37
their nature are all hanging out
43:38
together in the same part of the space.
43:41
Which is comforting, which is good to
43:42
know.
43:44
Right?
43:44
Now, but as I mentioned earlier, it's
43:46
not just about the fact that similar
43:48
things happen to be near each other.
43:50
The direction also actually matters. And
43:53
beautiful things happen when you look at
43:54
directions. So, for instance,
43:57
you know, let's say that
44:00
man and you want to go from man to
44:01
brother.
44:03
Okay? So, to go from man to brother, you
44:05
have to start with man and then travel
44:07
along this arrow, right? To get to
44:09
brother.
44:11
So, this arrow has some notion of a
44:14
person becoming a sibling.
44:18
Right?
44:19
So, you would hope that if you take that
44:20
same arrow
44:22
and then
44:23
start here with that arrow, hopefully
44:26
the woman will become a sister.
44:29
Sure enough, this.
44:32
So, this is called word vector algebra.
44:35
Right? Embedding algebra. And these
44:37
relationships are actually showing up in
44:39
the data. We didn't tell it any of these
44:41
things.
44:42
We just literally gave it the
44:43
co-occurrence matrix
44:44
and said and and asked it to reproduce
44:46
it.
44:47
So, I find it pretty shocking that these
44:49
things are actually true.
44:52
And it gives us evidence and comfort
44:55
that whatever has been learned does have
44:57
some deep connection to describing the
44:59
underlying nature of what's going on.
45:01
It's not some statistically fluky
45:03
artifact.
45:05
Um yeah.
45:07
So,
45:07
I said
45:08
by context or by adjacency to other
45:11
words and not by
45:12
the place in the same word, right?
45:15
Cuz you can't click they won't appear in
45:16
the same sentence.
45:17
They have
45:19
keywords. Right.
45:20
They won't appear in the same sentence,
45:22
but the pattern of co-occurrence will be
45:23
the same for them.
45:25
Which is what we've been able to
45:26
reproduce with these embeddings. So,
45:28
that's the key idea.
45:34
Um
45:34
so, my question is along like how are we
45:37
able to capture all these directions in
45:40
2D
45:41
matrix versus a multi-dimensional matrix
45:44
because I feel like okay, so this
45:46
relationship is kind of
45:47
uh
45:48
confirmed that you're moving to
45:50
kind of like
45:51
family or like blood relationship or
45:53
something of the sort, but like how does
45:54
it not mess up the other sides of that
45:56
matrix? Like
45:58
No, this is just a visualization thing.
46:00
So, we're basically taking this uh you
46:02
know, as you will see, Glove embeddings
46:04
come in lots of different sizes. And
46:06
this I think uses the 100 dimension
46:08
embedding and just projects it to 2D
46:10
space using a particular technique and
46:12
then looks to see what's going on.
46:15
Um yeah.
46:17
Uh if the input data being co-occurrence
46:20
matrix is biased, aren't we amplifying
46:22
that bias? Yes, we are. Yes. No, it's a
46:24
great observation. Uh any sort of data
46:26
you scrape from the internet and use for
46:28
this sort of modeling exercise will be
46:30
subject to all the biases that produced
46:32
the data in the place first place. And
46:34
the model will faithfully learn those
46:36
biases. And if you're not careful, it'll
46:38
perpetuate them.
46:40
So, and that's a whole very important
46:41
topic that unfortunately won't cover in
46:43
this course because of time constraints,
46:45
but it's something you always have to
46:46
worry about when you're building these
46:47
models.
46:50
How do you think about the
46:51
dimensionality of the embeddings not the
46:53
2D representation of the actual data?
46:55
The one that we choose, that's that's in
46:57
our hands. So, you should think of them
46:59
as a hyperparameter.
47:00
So, much like the number of hidden units
47:03
to use in a particular hidden layer,
47:05
um it's a hyperparameter. Uh so, you
47:06
know, I would again start small and if
47:09
it solves the problem that you're trying
47:11
to solve with these embeddings, great.
47:13
If not, keep increasing them. And at
47:15
some point there might be like a a
47:16
flattening out and a overfitting sort of
47:19
dynamic and then you stop. So, just
47:20
think of it as a hyperparameter.
47:22
Yeah.
47:24
Do you see any benefit practicing using
47:26
like penalized regression to do this
47:28
instead of having the embeddings more
47:31
sparse or just like
47:33
lowering the magnitude of them? Yeah.
47:36
Yes. So, there are lots of techniques to
47:39
uh
47:40
to apply regularization in the
47:42
estimation itself of all these numbers.
47:44
Um happy to give you pointers. It's I'm
47:46
just going with like the simplest
47:47
version possible.
47:49
Yeah.
47:50
Am I understanding why overfitting is a
47:53
problem in this case cuz we're not doing
47:55
any like out of sample
47:58
prediction. So, like wouldn't you want
48:00
like the embeddings to be
48:02
like high dimensional so you can capture
48:03
like
48:04
your relationships? Uh interesting
48:06
question. So, the question is given that
48:08
there's no notion of a test set, out of
48:11
sample test set that we got we're going
48:12
to evaluate these things on, why do we
48:14
really care about overfitting? Don't
48:16
should we do the best we can to capture
48:18
everything in the data, right?
48:20
Well,
48:21
the thing is
48:22
even when you're not trying to use it
48:24
for out of sample prediction, you do
48:26
want to make sure that your model only
48:29
captures the true patterns and not the
48:31
noise.
48:32
In every data set, there's always noise.
48:35
Right? And you want it to capture a
48:36
signal but not the noise.
48:38
And regardless of what you use it for.
48:40
Because if it captures the noise, then
48:42
the insights you draw from the word
48:44
embeddings may be flawed.
48:45
That's the reason.
48:48
Okay.
48:49
Um all right, so let's keep going. So,
48:51
here the algebra is brother minus man
48:53
plus woman is sister.
48:55
That's it. Human biology reduced to a
48:57
single sentence.
48:58
All right. So, now the pros and cons of
49:00
these things are you should use
49:02
something like a Glove embedding if you
49:04
don't have enough data to do to to sort
49:07
of
49:07
to learn a task-specific embedding for
49:10
your own vocabulary. As we As I'll show
49:11
you in the Colab, you can actually learn
49:13
these things just for your own data set
49:14
if you want. You don't have to use these
49:16
Glove embeddings. But the reason to use
49:18
these pretrained embeddings is that if
49:20
you're working with natural language,
49:22
you know, the word is the word, right?
49:24
It means something.
49:25
And so, there's no reason for you to
49:28
have for your model, for your little use
49:30
case, for you to actually somehow learn
49:32
all the fundamentals of English.
49:35
The fundamentals of English are the
49:36
fundamentals of English. May as well
49:37
learn it once and then piggyback on it.
49:40
So, that's the whole idea of using
49:42
pre-trained embeddings.
49:43
Because it These things are all common
49:45
aspects of language. May as well learn
49:47
them using all the data you can throw at
49:48
it and then you can sort of fine-tune
49:50
and tweak and adapt to your particular
49:52
use case.
49:53
Right? So, if you and this particular
49:55
useful when you don't have a lot of data
49:57
in your particular use case.
49:58
Uh right? That's one big advantage. Now,
50:01
it does have the drawback that this
50:03
embedding will not be customized to your
50:04
data.
50:05
Right? For example, if you're trying to
50:06
build an application for a medical or
50:08
legal use, it's going to have a lot of
50:10
jargon.
50:11
Right? And this pre-trained embedding
50:13
trained on all of Wikipedia may not
50:14
capture enough of the jargon and know
50:16
its meaning really accurately. So, what
50:18
you want to do is you want to take this
50:19
thing. You may still want to take this
50:21
thing and then you can adapt and
50:22
fine-tune it using your jargon-packed,
50:25
heavy, domain-specific data set.
50:28
Okay, those are some of the things to
50:29
keep in mind.
50:32
And of course, we can also learn it from
50:33
scratch if you want and the collab I
50:35
demonstrate all these options.
50:38
So, when you're working with embeddings
50:39
in Keras uh Keras, so what we do is
50:41
remember STI
50:43
where we after we standardize and
50:45
tokenize and index, right? At this
50:48
point, we go from integers to vectors
50:50
and so far we have been using integers
50:51
to one-hot vectors. Here, we're going to
50:54
use embedding vectors that we're going
50:55
to learn or that we're going to pre-use
50:57
from glove. And so, what we do is we
51:00
tell Kera we tell Keras's text
51:02
vectorization layer to do only STI.
51:06
And then we will use a new layer called
51:08
the embedding layer to do the encoding.
51:10
Yeah, that's how we're going to do it
51:11
divide divide it up.
51:14
So, we'll take a look at this first uh
51:17
before we switch to the collab. So,
51:18
before
51:20
we told Keras in this layer output mode
51:23
should be multi-hot or whatever, right?
51:26
Here, we don't want it to actually
51:27
encode anything in multi-hot. We just
51:29
wanted to give it integers back. So, we
51:30
tell it give me int.
51:32
Okay? That's the first change. We only
51:35
We tell it give me give us int. If you
51:36
say give us int, it'll stop with STI.
51:39
I'll just give you the integers.
51:41
Uh and then what you do is that
51:43
all the incoming sentences are going to
51:45
have different lengths. So, what we want
51:47
to do is we want to actually take all
51:48
these sentences and sort of normalize
51:50
them so they are of the same length.
51:52
Okay?
51:53
And the way we do that
51:55
And the way we do that very quickly is
51:57
that we either trunk we choose a maximum
51:59
length for every sen- for for the
52:01
sentences and then if something is
52:04
uh exactly fits that length, perfect.
52:05
Let's say in this case we want a max
52:07
length of five. Cats sat on the mat is
52:08
exactly five. Boom, fits perfectly. But
52:11
if something is smaller, I love you is
52:12
only three of these things, we actually
52:14
pad it with something called the pad
52:16
token.
52:17
Much like the unk token, pad token is a
52:19
special token which we use for padding.
52:22
And then it'll you know, and so and
52:23
Keras you will see will use zeros for
52:25
these paddings. So so that it fills it
52:27
up and gets all the way to the end. And
52:29
if you have something which is much
52:31
longer than five, you just truncate
52:33
everything else and just use the first
52:34
five.
52:36
So, this is what we do to get all the
52:38
sentences to be of the same length.
52:42
Okay?
52:43
And once we do that we then go to the
52:45
embedding layer.
52:47
And the embedding layer is actually very
52:49
simple.
52:50
What is What is an embedding? It's just
52:51
a vector and we need a vector for every
52:53
token.
52:54
Of course, we're going to learn these
52:55
vectors. We need one for every token.
52:57
So, in this case for example, uh let's
52:59
say that these are all the tokens we
53:01
have
53:02
in our vocabulary after the STI process.
53:05
Maybe in this case we have 5,000 tokens.
53:08
Each token we have this embedding
53:09
vector, right? And we choose what the
53:11
dimension of that embedding vector is,
53:12
right? And so, we can set it up by
53:15
saying Keras layers.embedding and we
53:17
tell it max tokens which means what how
53:19
many rows do we have here.
53:21
You know, how many What is the
53:21
vocabulary size that we're working with?
53:23
And then we tell it, okay, this is how
53:25
long I want each embedding vector to be.
53:28
So, rows, the size of the columns, and
53:31
that's the embedding layer. And we'll
53:33
use it in a second. I just want to show
53:34
it to you here so that's because it's
53:35
slightly clearer.
53:37
So, when an input sentence arrives, the
53:38
text vectorization layer will learn STI
53:40
on it. It'll truncate and pad it to max
53:42
length as needed. So, let's say this
53:44
phrase comes in, STI will give you the
53:46
same tokens plus pad pad because let's
53:48
say the max length is five and then
53:50
these are the corresponding integers.
53:52
And then
53:53
the embedding layer will just look up
53:55
the corresponding vector. So, for
53:56
example here, uh the vectors are we need
53:59
to look up the vectors for 23, 9, 5, 0,
54:01
and 0. So, we just go here and look up
54:04
23, 5, 9, and 0. And then once we have
54:07
that, boom.
54:08
This is the resulting output. So,
54:10
whatever input sentence comes in, we
54:12
have now
54:13
five embedding vectors that have been
54:14
looked up from the embedding layer.
54:17
And once we do that
54:20
this is a table. So, I love you comes
54:22
in, it becomes this table. As we have
54:24
seen before
54:25
neural networks can only accommodate
54:27
vectors as inputs. We need to you know,
54:30
make this into a vector. And as we have
54:32
done before, you know, we can either
54:33
take all these things and concatenate
54:35
them, make a one long vector, or we can
54:37
find a way to average them or sum them
54:39
and things like that, right? As we have
54:40
seen before. And we will use the same uh
54:42
we'll the simplest thing is probably
54:44
just to average them. So,
54:46
uh these are some options and we but
54:48
we'll average them here. So, and this is
54:51
called the global average pooling layer
54:53
1D. And it's all it does is whatever you
54:55
give it a table you give it, it just
54:57
takes each dimension and averages it.
54:59
The first dimension average, second
55:01
dimension average, and so on and so
55:02
forth. And once that's done
55:04
that's the whole
55:05
So,
55:07
the phrase comes in, STI gives you these
55:09
things, padding as needed or truncating
55:11
as needed. We look up the embeddings
55:14
from the embedding layer and then we get
55:16
all this thing. We do global global
55:18
pooling on it and it's done.
55:20
The resulting thing is a vector that can
55:22
then be passed into hidden layers just
55:24
like we normally do.
55:27
I'm going over this a little fast, but
55:29
make sure you look at it afterwards and
55:31
understand every step and the collab
55:33
will mirror this
55:34
you know, perfectly.
55:36
All right, so let's switch to the
55:37
collab.
55:39
Okay. All right.
55:41
Can folks see this okay?
55:43
All right, so we'll do the usual.
55:46
Um
55:47
import all the stuff we need and then
55:49
because I want to plot some of these uh
55:51
loss and accuracy curves to
55:53
you know, just to see what's going on,
55:55
I'll just bring in the functions from
55:56
the previous collabs.
55:58
Here.
55:59
And then um and I think I already have
56:01
downloaded this. Let me just make sure I
56:03
have it.
56:08
Uh it's not there. Okay.
56:11
Do it again.
56:13
This is same songs data set that we
56:14
looked at on Monday.
56:17
Okay.
56:19
So, roughly 49,000 examples as we saw
56:21
before. We'll one-hot encode them.
56:25
All right, so there's a bunch of stuff
56:27
that we already covered in class. So,
56:28
this is the thing
56:30
uh this URL has all the glove vectors
56:33
available for download. I downloaded it
56:35
uh before class because it takes a few
56:37
minutes. Uh and I've also unz- Did I
56:39
unzip it?
56:41
Uh yes, I did. And so, let's just look
56:43
at the first few.
56:46
All right, so these are all the first
56:47
few. We'll create a sort of an easier to
56:49
view version of these glove vectors.
56:54
So, I'm going to use the vectors which
56:56
are 100 long, but it comes in many
56:58
different shapes.
56:59
So, we have 400,000 vectors, 400,000
57:03
word vectors. Each is 100 dimension.
57:05
Uh and these all have been calculated
57:07
from Wikipedia using
57:09
the model we described using gradient
57:11
descent. Okay?
57:12
Uh all right, so this is the
57:15
vector for the word for movie.
57:18
Yeah, I don't know what these dimensions
57:19
mean, but it is there's something going
57:21
on. It has figured stuff out.
57:23
Uh but the proof is in the pudding,
57:24
right? So, all right, now we'll first
57:26
set up the text vectorization and
57:28
embedding layers like we saw before.
57:30
Um and so, I'm going to use uh a max
57:33
length of 300 for the songs.
57:36
Um right? Because all the sentences have
57:38
to be the same length. And you might be
57:40
wondering, okay, why did you pick 300
57:42
and not say 400 or 200? So, typically
57:44
what you do is you actually look at the
57:46
the length distribution of the songs you
57:48
have and you will find you're looking
57:51
for like an 80/20 or a you know, one of
57:52
those things. And in this case it turns
57:54
out 90% of the songs have less than or
57:56
equal to 300 words in our data set. So,
57:59
I'm just going to go with 300. Okay?
58:00
It's pretty good. Uh the problem is if
58:03
you actually say if you look at the song
58:04
which has the maximum length
58:06
that might have be like 3,000 words and
58:09
there would be any hardly any songs of
58:10
3,000 long. You're just wasting a lot of
58:12
capacity by doing that. So, you're just
58:13
being a little pragmatic here.
58:16
So, okay. So, and then we as before for
58:18
the vocabulary itself, we tell Keras use
58:20
the most frequent 5,000 words, right?
58:22
When you're doing the STI
58:24
um STI. So, we do that and we tell it
58:27
the output mode is int like we saw
58:29
before.
58:32
We have there.
58:35
Okay, perfect.
58:36
Okay, this is a very dangerous thing
58:39
where somebody is remotely changing it
58:41
in another tab somewhere.
58:44
Fingers crossed. Okay.
58:50
Okay. So, we have this um and this is
58:52
what we did with all this stuff uh as
58:54
I've covered. So, now we will adapt this
58:57
layer as we have seen before using all
58:59
the lyrics we have.
59:04
And once we that, we'll take a look at
59:06
the first few.
59:08
And so, here's a very important thing.
59:10
Before, when we asked it to do multi-hot
59:12
encoding and so on in on Monday,
59:14
uh the zero, the first position was unk.
59:17
Right? Unk had zero. But here, unk
59:19
actually has one.
59:21
And the reason is that
59:23
the zeroth position is going to be uh
59:25
used for essentially the You can think
59:28
of this as the empty string. That's how
59:30
Keras will print out pad.
59:32
So, the zero position is the padding,
59:35
the pad token. The first position is the
59:37
unk token. Okay?
59:39
So, it's an important thing here.
59:41
So, let's say that we do
59:44
"HODL you're the best."
59:46
We take a vectorize it. Um
59:49
Do you think HODL
59:51
is going to be part of those 400,000
59:52
word vectors?
59:54
Wikipedia. Not yet. So,
59:57
Um all right. So, let's try that.
1:00:03
Okay, and as you can tell,
1:00:05
um
1:00:05
HODL is an unknown word, right? That's
1:00:08
why uh it's showing up here.
1:00:12
Right. So, one is unknown, right? The
1:00:14
index value one is unknown. Zero is pad.
1:00:18
But then,
1:00:19
this is unknown HODL, I
1:00:21
Sorry, you are the best, and then
1:00:25
everything else from that point on is a
1:00:26
zero because we are padding all the way
1:00:28
to 300.
1:00:30
Okay? So, that's why you see all these
1:00:31
zeros here.
1:00:32
All right. Uh now, let's just, you know,
1:00:34
run everything through
1:00:37
the vectorization layer, and then we'll
1:00:38
get to the embedding layer.
1:00:44
Okay. Now, we will we'll we'll first
1:00:48
There's just a bit of Python uh
1:00:50
housekeeping
1:00:51
um to create a nice, easy to look at
1:00:54
matrix. So, what we're going to do is
1:00:56
we're actually going to create a nice
1:00:58
matrix which shows us all the the word
1:01:00
the GloVe embeddings.
1:01:02
Um
1:01:04
And so, here, this is the embedding
1:01:05
matrix.
1:01:07
And this matrix has only 5,000 words,
1:01:09
and each is a 100 long.
1:01:11
Why is this embedding matrix only 5,000
1:01:13
even though we downloaded 400,000
1:01:15
vectors?
1:01:21
Right. So, clearly the 5,000 we used
1:01:23
there has some bearing to this, but what
1:01:24
is that 5,000?
1:01:30
We told Keras to take the most frequent
1:01:32
5,000 words in our corpus.
1:01:34
So, we'll only have 5,000 in vocabulary.
1:01:36
That's why there's 5,000. So, we grab
1:01:38
just the word the GloVe vectors for
1:01:40
those 500 5,000 that Keras has chosen to
1:01:42
be in the vocabulary. Okay? And that's
1:01:44
our embedding matrix.
1:01:45
And then, if you look at the first few
1:01:47
rows, the first two rows should be all
1:01:50
zeros because it's pad and unk,
1:01:52
which clearly GloVe doesn't know about.
1:01:54
It's all going to be all zeros. And um
1:01:57
so, you can see all these zeros here,
1:01:59
and then from third on, words, you start
1:02:00
getting some numbers. Okay?
1:02:02
All right. Next, we'll set up the
1:02:04
embedding layer.
1:02:05
Uh
1:02:06
so, basically, what's going on here is
1:02:07
when you we tell the embedding layer how
1:02:09
many rows, which is just the vocab size,
1:02:11
max tokens, what is the embedding
1:02:13
dimension? Well, that's going to be 100
1:02:15
because the GloVe vectors are 100. And
1:02:17
then, here's the thing. You can tell it
1:02:19
um in this embedding layer, just use
1:02:22
this matrix I'm giving you as the
1:02:23
embedding layer. Because we already know
1:02:25
what the embeddings are. We downloaded
1:02:26
from whatever GloVe, right? So, we will
1:02:28
tell it to use GloVe as as the as the
1:02:30
weights for here, as the embeddings
1:02:32
here. So, we initialize it using that
1:02:34
embedding matrix, right? And then, we
1:02:36
tell it
1:02:38
don't train. When we do back propagation
1:02:40
later on, don't change any of these
1:02:41
weights because somebody spent a lot of
1:02:43
money create these weights for us.
1:02:45
Stanford. So, we don't want to like
1:02:47
further change them. Just freeze them
1:02:49
and use them as they are. Okay?
1:02:51
And this mask zero business I'll come
1:02:52
back later. Don't worry about it for the
1:02:53
moment.
1:02:55
All right. So, once we do that, we all
1:02:58
we are ready to set up our model. So,
1:03:00
this model is pretty simple. Uh Keras
1:03:02
input, the length, of course, is the
1:03:04
length of the sentence, right? Which is
1:03:05
uh 300 long, and then it runs the input
1:03:08
runs through an embedding layer right
1:03:09
there, right? And out comes a 300 by 100
1:03:12
table, and then we global average pool
1:03:14
it,
1:03:15
right? And that becomes a 100 element
1:03:17
vector, and then we are back in familiar
1:03:19
ground, and we run it through a dense
1:03:20
layer with eight ReLU neurons, uh right?
1:03:23
Eight ReLU neurons, and then we run it
1:03:25
through the final output layer, which is
1:03:27
a three-way softmax as before, hip hop
1:03:29
rock pop. And then, we tell Keras that's
1:03:31
our model, and then we summarize it.
1:03:34
Okay. So, this what we have. And you can
1:03:36
see here,
1:03:38
the total parameters are 500,835,
1:03:41
but the trainable parameters are only
1:03:42
835.
1:03:44
It's because the total parameters are
1:03:46
all the GloVe embeddings plus the the
1:03:49
things we added to the GloVe embeddings
1:03:50
like the hidden layer and so on.
1:03:52
But the GloVe embeddings are us we have
1:03:54
told Keras, freeze it. Do not train it.
1:03:57
Right? Which means only the rest of it
1:03:58
is going to be trainable. That's That's
1:04:00
the 835. Yeah.
1:04:03
So, when we do the global average
1:04:05
pooling, don't we don't we lose any
1:04:06
sense of meaning that we gain from the
1:04:09
embedding as we average very different
1:04:12
embeddings together?
1:04:14
Sorry, say that again. I I missed the
1:04:15
first
1:04:16
>> if we average the the embedding of apple
1:04:18
and learning, for instance, they are
1:04:20
very different words that are used in
1:04:22
different meanings, so we have different
1:04:23
embeddings, but we average it, so can't
1:04:26
lose it.
1:04:27
We will lose a bunch of stuff. Yeah,
1:04:28
yeah, yeah. So, you're barely Anytime
1:04:30
you average anything, you're going to
1:04:31
lose some new nuance and so on. So, the
1:04:33
real question is, is it Despite that
1:04:36
averaging, is it good enough for you?
1:04:37
And sometimes it's good enough.
1:04:39
Very often it's good enough, as it turns
1:04:41
out. But as you will see when you go to
1:04:42
contextual embeddings, there's just a
1:04:44
better way to do it, right? When you
1:04:45
have contextual embeddings, uh but it
1:04:47
requires bigger models, more powerful
1:04:49
stuff, and so on and so forth. And
1:04:50
that's where you're going from the
1:04:51
foundations to the advanced stuff.
1:04:53
Yeah.
1:04:56
When we're doing optimization, like
1:04:58
let's say we are word problem, it's
1:05:00
often best to optimize everything
1:05:02
together than to optimize one part of
1:05:04
the system and then optimize the other
1:05:06
part of the system.
1:05:07
So, in that case, why wouldn't we want
1:05:09
to also change the embeddings?
1:05:12
We would like I understand why we would
1:05:13
like to stop with
1:05:15
with those weights that
1:05:17
some people have spent a lot of money
1:05:19
trying to find, but will
1:05:20
we be able to find more specific uh
1:05:23
embeddings related to our problem if we
1:05:25
optimize if we let everything be
1:05:26
trainable? Yeah. Absolutely. Absolutely.
1:05:29
And in fact, you will see in the collab
1:05:30
uh that we will do that next. I just
1:05:33
want to show people you don't have to do
1:05:35
it. You start with not training it
1:05:37
because it's going to be much faster.
1:05:38
And then, you train everything and see
1:05:39
if it gets better. And sometimes it'll
1:05:41
get better, in which case it's great.
1:05:42
Sometimes it won't get better. And I
1:05:44
will also show you, and I probably will
1:05:45
run out of time, which I'll So, I'll do
1:05:46
it on Monday. I will also show you, hey,
1:05:48
what if you want to do your own
1:05:50
embeddings from scratch without using
1:05:51
GloVe?
1:05:52
So, all possibilities will be covered.
1:05:55
Um yeah. So, to come back to this, this
1:05:57
is the model we have. Um and then, all
1:06:00
right.
1:06:01
So, we'll If we take a look at the first
1:06:03
few embedding vectors, by the way, this
1:06:05
model.layers
1:06:06
uh will give you every layer as a list,
1:06:09
a list of all the layers, and then you
1:06:10
can just grab any layer you want and
1:06:11
look at its weights. Okay? It's very
1:06:13
handy.
1:06:14
So, we're looking at the weights, and
1:06:15
you can see here
1:06:16
the first two vectors are all zeros
1:06:19
because that stands for unk and pad, and
1:06:21
then we have everything else. So,
1:06:22
everything looks fine so far. And now,
1:06:24
we just, you know, compile and fit it.
1:06:26
So, as usual, Adam, cross entropy,
1:06:28
accuracy.
1:06:30
Um and then, we'll just fit the model.
1:06:33
All right.
1:06:34
It's going to take
1:06:36
a few minutes.
1:06:39
And while it's running, so what what you
1:06:41
will see in this collab is that
1:06:43
uh in this particular case, the
1:06:44
embeddings actually don't help a whole
1:06:46
lot.
1:06:47
Why do you think that is?
1:06:51
What if it could be because we're
1:06:52
averaging a lot of stuff? Maybe that's
1:06:54
hurting us.
1:06:57
Yeah.
1:06:58
Um I mean, I think that the embeddings
1:06:59
were pre-trained on some corpus, right?
1:07:01
Like Wikipedia or something like that
1:07:03
that is different from the a little bit
1:07:05
different from the language we tend to
1:07:06
use in song lyrics. So, so maybe it's
1:07:08
not
1:07:09
its ability to sort of extract the
1:07:11
meaning of
1:07:12
um
1:07:13
candy from like a song lyric um
1:07:16
maybe is limited because Yeah. it's
1:07:18
thinking of all the other ways Right.
1:07:19
like that could be our presentation.
1:07:20
Yeah, so there could be a mismatch
1:07:22
between the corpus on which the
1:07:23
pre-trained stuff was trained on versus
1:07:26
the the corpus that you're working with
1:07:27
right now. That's one big reason. The
1:07:29
other reason is that we actually may
1:07:31
have We have 50,000 examples, basically.
1:07:34
It's a lot of data.
1:07:36
So, when you have a lot of data, you may
1:07:37
not need any of these things.
1:07:39
These things tend to do really well when
1:07:41
you don't have a lot of data, and which
1:07:43
means you you you get to piggyback on
1:07:46
what these embeddings have learned from
1:07:47
all of Wikipedia.
1:07:49
So, so when you have a smallish data
1:07:52
set, basically, the the rule of thumb
1:07:54
here is that when your data is really
1:07:55
small, try to use a pre-trained model.
1:07:58
Right? And that's what you saw with the
1:07:59
handbags and shoes classifier, right? We
1:08:01
had 100 examples of handbags and shoes,
1:08:03
and we used ResNet to got basically get
1:08:04
to 100% accuracy.
1:08:06
The same sort of logic applies here.
1:08:08
All right. So,
1:08:09
here, let's see what's happening. Uh
1:08:11
okay, it's done.
1:08:12
So, we'll plot.
1:08:16
Right.
1:08:16
Uh okay, this look at a very
1:08:18
well-behaved uh loss function curve.
1:08:21
Uh
1:08:25
Okay.
1:08:26
So,
1:08:27
uh there doesn't seem to be any massive
1:08:28
overfitting going on. They are moving
1:08:30
really nicely in lockstep. Let's see
1:08:32
what the thing is.
1:08:36
Okay, 63%, which is not great. Um right?
1:08:39
Uh it's not as good as what we saw
1:08:40
before when we used all 50,000 examples
1:08:43
and just trained something from scratch,
1:08:44
and that's just because in this case, we
1:08:45
have lots of examples, these pre-trained
1:08:47
embeddings aren't, you know, as helpful
1:08:49
as they could be.
1:08:50
But if you have a small data set, they
1:08:52
could be very helpful. And now, we go to
1:08:54
what um
1:08:56
he pointed out. Like, why can't we just,
1:08:58
you know, optimize these embeddings,
1:08:59
too? Why don't Why do we have to take
1:09:00
trade them as sacred? We'll just Let
1:09:02
Let's just use Let's
1:09:03
inflict Let's just apply unleash back
1:09:06
prop on it and see what happens.
1:09:07
So, we'll do that. Um
1:09:11
So, here, what we do is we retrain it,
1:09:13
but here, we set trainable equals true
1:09:15
for the embedding layer. Okay? This is
1:09:17
the key step. Trainable equals true.
1:09:19
Otherwise, it's unchanged.
1:09:20
Uh and then,
1:09:23
let's skip that.
1:09:27
We'll run it and see what happens. So
1:09:28
before it was whatever 63% accuracy or
1:09:31
something, we'll see if it gets better
1:09:33
if you train the whole thing.
1:09:35
And the thing is you can never be sure.
1:09:38
Right? Because it may start to overfit.
1:09:40
Uh which is why you just have to
1:09:41
empirically see what's going on. There
1:09:42
are no guarantees.
1:09:47
Um all right, any questions while it's
1:09:48
training?
1:09:50
Yeah.
1:09:51
In that first graph of when um you have
1:09:54
the training accuracy still increasing,
1:09:56
that might suggest that you could use
1:09:58
even more upstream. Correct. Exactly.
1:10:00
Exactly. So in the in the in that curve,
1:10:02
we saw that the training was continuing
1:10:03
to increase. Typically what's going to
1:10:05
happen is the training will continue to
1:10:06
get better the more you train it. The
1:10:08
key thing is is the validation also
1:10:10
improving. If the validation continues
1:10:12
to improve, there is a little bit more
1:10:13
gas left in the tank. You can keep
1:10:15
increasing more. If it starts to flatten
1:10:17
and even worse if it starts to go down,
1:10:19
then you want to pull back.
1:10:21
Yeah.
1:10:23
Um so you had used the maximum against
1:10:25
the limit like the vocabulary
1:10:27
of the most common 5,000. And then the
1:10:29
width of that was 100. What is the 100?
1:10:31
The 100 is just the length of the glove
1:10:33
vector.
1:10:34
Does that mean that it can only capture
1:10:37
how that word is related to 100 other
1:10:39
words? No, no. It it basically we are
1:10:41
saying that every word its intrinsic
1:10:43
meaning can be captured using a vector
1:10:45
of 100 dimensions.
1:10:48
Those dimensions mean something. We
1:10:49
don't know what it is. The first
1:10:51
dimension could mean color. Second could
1:10:53
mean some sort of location. The third
1:10:55
could mean some sort of see time of the
1:10:57
year. We just have no idea.
1:11:01
Okay, and then the pre-trained model is
1:11:02
we're not We're not going to learn the
1:11:04
pre-trained model like has those
1:11:05
already. We don't know what they are,
1:11:07
but it has some cat The people who
1:11:08
created it don't know what they are
1:11:10
either.
1:11:10
All they know is that for each word they
1:11:13
learned a 100 long vector.
1:11:15
And that 100 long vector was able to re-
1:11:18
kind of recreate the co-occurrence
1:11:20
matrix.
1:11:21
And then they probed it using that
1:11:23
visualization of man woman sister
1:11:25
brother all that stuff and it seems to
1:11:26
sort of fit with what you would expect.
1:11:29
Can you think of it as analogous to uh
1:11:31
when we did the convolutional ones, you
1:11:33
have the number of kernels, right? So in
1:11:35
in this case, so if you have 32 kernels,
1:11:37
it's sort of like 32 things it can
1:11:39
learn.
1:11:40
I think that's actually a great analogy.
1:11:42
I love it. That's that's a great way to
1:11:43
think about it. Yes. Uh much like we got
1:11:46
to choose decide how many filters to
1:11:48
have, here we get to decide how long the
1:11:50
embedding dimension needs to be and our
1:11:51
hope is that the more things we are able
1:11:53
to accommodate, the more complicated
1:11:55
things it will pick up. Right? Uh at the
1:11:57
same time, you don't want to have too
1:11:58
many of these things because it's going
1:11:59
to start picking up noise.
1:12:01
And that's not a good That's never a
1:12:03
good thing.
1:12:05
Okay.
1:12:06
Um
1:12:07
Another question on this side?
1:12:09
Yeah.
1:12:10
Go ahead. My
1:12:12
question is
1:12:13
why did we use Why do we use embeddings
1:12:15
and not the actual uh
1:12:17
correlation matrix called rows to
1:12:20
represent words, right? Like why do we
1:12:23
need to abstract Yeah, yeah, yeah.
1:12:25
That's actually a good That's a That's a
1:12:26
good That's a good question. Um one
1:12:28
immediate reason is that that row is
1:12:30
500,000 vectors long. 500,000 long.
1:12:33
Right? So you want a compact dense
1:12:35
representation of a word.
1:12:37
The second thing is that thing is
1:12:39
subject to all the counts of the
1:12:40
Wikipedia corpus. It's not normalized.
1:12:43
So you need to normalize it so that if
1:12:45
you take any two rows and do dot
1:12:47
product, you will get some number which
1:12:49
is sort of in a narrow range. Otherwise
1:12:50
things don't become comparable.
1:12:53
No, both these objections can be
1:12:55
handled. You can normalize, you can
1:12:57
reduce the size of the corpus and so on
1:12:59
and so forth. And in fact that used to
1:13:00
be a very common way people used to do
1:13:01
it before.
1:13:03
But what they have discovered is that
1:13:04
these the way we learn embeddings now
1:13:06
tends to be much more effective in
1:13:07
practice.
1:13:10
So So what what we thought is
1:13:13
what what what this process does is it
1:13:16
creates this like n-dimensional
1:13:18
incomprehensible matrix that captures
1:13:21
in essence a summarized version of these
1:13:23
relationships.
1:13:25
Correct. A compact representation of
1:13:28
relationships which is not subject to
1:13:30
the size of your vocabulary.
1:13:33
So you know, you have 500,000 words
1:13:34
today, tomorrow somebody comes up with
1:13:36
the word called selfie which didn't
1:13:37
exist 5 years ago.
1:13:39
And now your corpus has gotten a little
1:13:40
bit more, right? So here it's very
1:13:42
compact and it tends to have a much
1:13:43
longer shelf life.
1:13:48
Yeah.
1:13:49
Uh all right, so let's see where we are.
1:13:54
Uh okay. So evaluate.
1:13:59
68 69% almost. It was 63 went to 69. So
1:14:02
clearly here training the whole thing
1:14:04
including glove actually helps. Uh and
1:14:06
so that sort of begs the question, well,
1:14:08
if it um every if training glove helps,
1:14:11
maybe we should actually train the whole
1:14:13
thing from scratch.
1:14:15
Like why the hell not, right? Why the
1:14:16
heck not? I apologize.
1:14:19
So uh what we'll do is we'll actually
1:14:21
create our own embeddings and just train
1:14:22
them. And here we don't have to worry
1:14:24
about co-occurrence matrices and so on
1:14:26
and so forth because we have a very
1:14:27
specific objective. We want to be very
1:14:29
accurate in predicting genre for these
1:14:30
songs.
1:14:32
The people who had who had worked on
1:14:34
glove,
1:14:35
they didn't have any objective. They
1:14:36
just wanted to create embeddings that
1:14:37
were generally useful.
1:14:39
Okay? Here we want to be specifically
1:14:41
useful for genre prediction.
1:14:43
And so what we can do is we can actually
1:14:45
train the whole thing ourselves, right?
1:14:48
We can actually give it
1:14:50
uh we can actually put an embedding
1:14:51
layer here. I you know, we just
1:14:53
arbitrarily decided to choose 64 as the
1:14:55
uh the dimension as opposed to 100. It
1:14:57
will run faster. Uh and then it's the
1:14:59
same thing. Global average pooling,
1:15:01
activation, blah blah blah blah blah. Um
1:15:03
and then you run it.
1:15:08
We'll see if it finishes in the next
1:15:09
minute.
1:15:12
And we'll see if it actually does better
1:15:14
than the pre-trained embeddings or the
1:15:16
pre-trained embeddings that have been
1:15:17
further fine-tuned. And I don't remember
1:15:19
what I saw when I ran it yesterday.
1:15:21
Uh and while it's running, other
1:15:23
questions?
1:15:24
Yeah.
1:15:25
So my question is regarding embeddings.
1:15:28
When we call embedding for a particular
1:15:30
word, we indicate that we have certain
1:15:32
number of parameters. Let's say in this
1:15:33
case we have defined
1:15:35
We defined 100. So there will be 100
1:15:36
parameters and there will be
1:15:37
coefficients weights for each of them.
1:15:40
So when we take a pre-trained model,
1:15:42
right?
1:15:43
The one we took glove. So for each word
1:15:45
there would already be those number of
1:15:47
parameters in that different Yeah. So
1:15:49
but then how do we redefine them? Is
1:15:51
that we want only 100 or we want only 10
1:15:53
parameters
1:15:54
You know, the the glove thing actually
1:15:56
gives you packaged It's pre-packaged to
1:15:59
be 100 long. I think they have 200 and
1:16:01
300 as well if I recall. We just
1:16:03
happened to use the one the one with
1:16:04
100. The one is
1:16:05
The one is available in Google
1:16:07
Yeah, yeah. And there are many
1:16:09
available. We just get to pick and
1:16:10
choose and I happen to pick 100.
1:16:12
Uh
1:16:13
Oh, it's okay. So it's a bit slow, but
1:16:15
it's actually looking promising.
1:16:17
Um
1:16:18
9:55, yeah.
1:16:21
So during the CNN models training during
1:16:23
our assignments,
1:16:24
changing the filters gave us more depth
1:16:27
than improvement in performance.
1:16:29
So here would I be right in concluding
1:16:32
that it's actually training the
1:16:33
embeddings which is giving us more
1:16:34
assuming that epoch and batch changes
1:16:36
are not
1:16:37
changed as much. So if I really want a
1:16:39
genuine change in performance, we go
1:16:42
to the level of retraining the
1:16:43
embeddings.
1:16:44
What Yeah, so what we saw was that using
1:16:46
glove as is was okay. Using glove and
1:16:48
then training them helped a lot. And now
1:16:50
we are basically saying, well, what if
1:16:51
we just abandon glove and train our own
1:16:53
embeddings for our particular problem.
1:16:55
See, glove is a general purpose tool.
1:16:57
So a general purpose tool is really good
1:16:59
if you don't have a lot of data
1:17:00
as a good starting point. But when you
1:17:01
have a lot of data, you should always
1:17:03
try to do your own thing and see if it's
1:17:04
any better.
1:17:05
And in this case, I
1:17:07
well, whoa. Okay, I think it's
1:17:09
uh
1:17:10
Come on, it's 9:55.
1:17:14
The button is going to enter any moment
1:17:15
now.
1:17:21
Right, let's just look at the thing.
1:17:25
Okay, folks. So 74% 72%.
1:17:29
So you can actually return your own
1:17:30
thing because of 50,000 examples and you
1:17:31
can see an even better thing. Thanks a
1:17:33
lot. Have a good rest of the week.
— end of transcript —
Advertisement
Ad slot

More from MIT OpenCourseWare

Trending Transcripts

Disclaimer: This site is not affiliated with, endorsed by, or sponsored by YouTube or Google LLC. All trademarks belong to their respective owners. Transcripts are sourced from publicly available captions on YouTube and remain the property of their original creators.