WEBVTT

00:00:21.359 --> 00:00:25.039
We'll continue our journey with

00:00:23.120 --> 00:00:26.679
natural language processing.

00:00:25.039 --> 00:00:28.039
We looked at the bag of words model,

00:00:26.679 --> 00:00:30.679
one-hot embeddings, and so on and so

00:00:28.039 --> 00:00:32.679
forth. And today we will talk about

00:00:30.679 --> 00:00:34.759
embeddings, or to be more precise,

00:00:32.679 --> 00:00:36.159
stand-alone embeddings, and then that

00:00:34.759 --> 00:00:38.799
will tee us up for something called

00:00:36.159 --> 00:00:40.439
contextual embeddings, which is where

00:00:38.799 --> 00:00:41.519
the transformer really sort of comes

00:00:40.439 --> 00:00:43.960
into play.

00:00:41.520 --> 00:00:47.920
All right, so let's get going. So so far

00:00:43.960 --> 00:00:50.200
we have encoded input text

00:00:47.920 --> 00:00:52.480
one-hot vector. So to just to refresh

00:00:50.200 --> 00:00:53.640
your memories from Monday,

00:00:52.479 --> 00:00:55.640
if you know, if this is the phrase

00:00:53.640 --> 00:00:58.560
that's coming into the system, we run it

00:00:55.640 --> 00:01:01.359
through the STIE process. And when we do

00:00:58.560 --> 00:01:03.600
that, what happens is that first of all,

00:01:01.359 --> 00:01:05.760
we you know, we standardize, then we

00:01:03.600 --> 00:01:08.960
split on white space to get individual

00:01:05.760 --> 00:01:10.840
words, then we assign words to integers,

00:01:08.959 --> 00:01:12.439
and then we take you know, each integer

00:01:10.840 --> 00:01:15.159
and essentially create a one-hot version

00:01:12.439 --> 00:01:18.759
of that integer. And when we do that,

00:01:15.159 --> 00:01:20.319
basically we have a vocabulary.

00:01:18.760 --> 00:01:23.040
Right? And in this example, we just have

00:01:20.319 --> 00:01:25.159
100 words, and you will note that this

00:01:23.040 --> 00:01:28.120
vocabulary, which are which you arrive

00:01:25.159 --> 00:01:30.039
at once you standardize and tokenize,

00:01:28.120 --> 00:01:32.000
you know, has words like the because we

00:01:30.040 --> 00:01:33.840
decided not to remove stop words like A,

00:01:32.000 --> 00:01:36.159
and the,

00:01:33.840 --> 00:01:38.159
and so on. So just to be clear,

00:01:36.159 --> 00:01:40.399
standardization

00:01:38.159 --> 00:01:42.519
here, standardization, while it has

00:01:40.400 --> 00:01:45.000
historically been all about stripping

00:01:42.519 --> 00:01:47.640
punctuation, lowercasing everything,

00:01:45.000 --> 00:01:49.239
removing stop words, and stemming,

00:01:47.640 --> 00:01:51.519
while that has been true historically,

00:01:49.239 --> 00:01:54.560
if you look at modern practice, people

00:01:51.519 --> 00:01:57.039
essentially strip punctuation maybe, and

00:01:54.560 --> 00:01:58.439
then lowercase and and they often don't

00:01:57.040 --> 00:02:00.120
even bother to do stemming and things

00:01:58.439 --> 00:02:01.120
like that, or to remove stop words.

00:02:00.120 --> 00:02:03.800
Okay?

00:02:01.120 --> 00:02:05.840
And that's why in Keras, the default

00:02:03.799 --> 00:02:08.800
standardization is only lowercasing and

00:02:05.840 --> 00:02:08.800
punctuation stripping.

00:02:09.479 --> 00:02:12.840
This detail may actually be handy for

00:02:11.319 --> 00:02:14.280
homework two, perhaps. That's why I'm

00:02:12.840 --> 00:02:17.080
pointing it out.

00:02:14.280 --> 00:02:18.840
Okay. So that's what we have. And so for

00:02:17.080 --> 00:02:20.719
each word that's coming in, we have a

00:02:18.840 --> 00:02:22.520
one-hot vector.

00:02:20.719 --> 00:02:25.800
Right? But the one-hot vector is just

00:02:22.520 --> 00:02:27.520
like on to the vocabulary. And then, you

00:02:25.800 --> 00:02:29.520
know, and we can either

00:02:27.520 --> 00:02:32.719
quote unquote add them up and get a

00:02:29.520 --> 00:02:34.719
multi-hot encoding, or

00:02:32.719 --> 00:02:36.560
sorry, get a count encoding, or we can

00:02:34.719 --> 00:02:38.280
just do or, right? Look for just any

00:02:36.560 --> 00:02:39.280
ones in a column and get multi-hot

00:02:38.280 --> 00:02:42.199
encoding.

00:02:39.280 --> 00:02:44.439
So that's what we saw last class. But

00:02:42.199 --> 00:02:47.399
this scheme, while it's quite effective

00:02:44.439 --> 00:02:49.079
for simple kind of problems,

00:02:47.400 --> 00:02:50.920
is it has some very serious

00:02:49.080 --> 00:02:52.760
shortcomings. And so we will sort of

00:02:50.919 --> 00:02:54.280
delve into those shortcomings, and then

00:02:52.759 --> 00:02:57.679
sort of step back and say, all right, is

00:02:54.280 --> 00:02:57.680
there a solution to fix these things?

00:02:58.319 --> 00:03:01.599
Problem with one-hot vectors.

00:03:00.199 --> 00:03:04.319
There are lots of problems. Any

00:03:01.599 --> 00:03:04.319
volunteers?

00:03:07.879 --> 00:03:12.439
Similar words are understood

00:03:09.919 --> 00:03:12.439
differently.

00:03:21.919 --> 00:03:26.319
Absolutely. So that what he's pointing

00:03:24.439 --> 00:03:28.120
out is that if you have two words which

00:03:26.319 --> 00:03:29.439
are synonyms, let's say, great and

00:03:28.120 --> 00:03:31.840
awesome,

00:03:29.439 --> 00:03:33.759
hope that the way we represent them

00:03:31.840 --> 00:03:35.120
using these vectors would have some

00:03:33.759 --> 00:03:37.439
connection to what the words actually

00:03:35.120 --> 00:03:38.800
mean. In particular, we would hope that

00:03:37.439 --> 00:03:40.919
if they mean similar things, that they

00:03:38.800 --> 00:03:41.920
are sort of close by. If they mean very

00:03:40.919 --> 00:03:43.599
different things, we would hope that

00:03:41.919 --> 00:03:44.839
they are very far away. Right? Things

00:03:43.599 --> 00:03:46.280
like that. Sort of common sensical

00:03:44.840 --> 00:03:49.000
expectations of what you want the

00:03:46.280 --> 00:03:50.599
vectors to have. So it clearly it won't

00:03:49.000 --> 00:03:53.039
have that, and we'll look into it in a

00:03:50.599 --> 00:03:54.759
detail in a bit. But before we do that,

00:03:53.039 --> 00:03:56.199
there is also a computational issue,

00:03:54.759 --> 00:03:59.120
which we covered last class, which is

00:03:56.199 --> 00:04:01.560
that if the vocabulary is really long,

00:03:59.120 --> 00:04:03.360
then each token, each word that's coming

00:04:01.560 --> 00:04:04.400
in here, will have a one-hot vector

00:04:03.360 --> 00:04:06.640
that's as long as the size of

00:04:04.400 --> 00:04:08.360
vocabulary. Right? If you have 500,000

00:04:06.639 --> 00:04:09.759
words in your vocabulary, every little

00:04:08.360 --> 00:04:12.200
word that comes in has a vector which is

00:04:09.759 --> 00:04:15.840
500,000 long. Which feels like a gross

00:04:12.199 --> 00:04:15.839
sort of waste of it stuff.

00:04:16.798 --> 00:04:20.120
Now you can mitigate it somewhat by

00:04:18.199 --> 00:04:21.680
choosing only the most frequent words,

00:04:20.120 --> 00:04:23.360
but it does increase the number of ways

00:04:21.680 --> 00:04:25.280
the model has to learn, and increase the

00:04:23.360 --> 00:04:26.800
need for compute and data, and so on and

00:04:25.279 --> 00:04:27.759
so forth. Okay?

00:04:26.800 --> 00:04:28.800
Now

00:04:27.759 --> 00:04:31.159
let's say that we have created a

00:04:28.800 --> 00:04:32.520
vocabulary from a training corpus. Okay?

00:04:31.160 --> 00:04:34.400
We have a bunch of

00:04:32.519 --> 00:04:36.439
strings, text that's coming in. We have

00:04:34.399 --> 00:04:37.639
done it We have done the ST the

00:04:36.439 --> 00:04:39.439
standardization and organization. We

00:04:37.639 --> 00:04:41.399
have created a vocabulary from it. And

00:04:39.439 --> 00:04:42.319
let's say we get the words movie and

00:04:41.399 --> 00:04:44.679
film.

00:04:42.319 --> 00:04:47.199
So the question is, and and I always

00:04:44.680 --> 00:04:48.639
observation gets to this immediately, if

00:04:47.199 --> 00:04:50.039
you look at the words movie and film,

00:04:48.639 --> 00:04:52.759
are these two vectors close to each

00:04:50.040 --> 00:04:56.800
other or not? Okay? So if you have two

00:04:52.759 --> 00:04:56.800
vectors, how would we measure closeness?

00:04:56.879 --> 00:05:00.759
What's the simplest way to think about

00:04:58.240 --> 00:05:00.759
closeness?

00:05:02.519 --> 00:05:06.680
It's not a trick question.

00:05:05.199 --> 00:05:08.680
Distance. Yeah, exactly. So if they are

00:05:06.680 --> 00:05:10.480
really close distance-wise, we would

00:05:08.680 --> 00:05:13.480
hope, right? The words similar words

00:05:10.480 --> 00:05:16.280
should do should should be close by. So

00:05:13.480 --> 00:05:19.759
here, if you let's just imagine that the

00:05:16.279 --> 00:05:19.759
vector for movie,

00:05:20.000 --> 00:05:22.199
let's say your vocabulary is, I don't

00:05:21.480 --> 00:05:24.600
know,

00:05:22.199 --> 00:05:24.599
um

00:05:25.079 --> 00:05:30.359
100,000 long.

00:05:27.879 --> 00:05:33.000
So your vector is 100,000 long,

00:05:30.360 --> 00:05:35.520
and the word for movie

00:05:33.000 --> 00:05:39.399
is the position, so this this has a one,

00:05:35.519 --> 00:05:39.399
everything else is zero. Right?

00:05:42.519 --> 00:05:47.279
Sorry, this is the vector for film, and

00:05:44.199 --> 00:05:51.039
maybe this is the position for film.

00:05:47.279 --> 00:05:53.439
So that has a one, everything else here

00:05:51.040 --> 00:05:55.640
zero. Okay? What's the distance between

00:05:53.439 --> 00:05:58.000
these two vectors?

00:05:55.639 --> 00:06:00.360
You just use the Euclidean distance. So

00:05:58.000 --> 00:06:01.759
the Euclidean distance, you will recall,

00:06:00.360 --> 00:06:02.639
you literally just take the difference

00:06:01.759 --> 00:06:04.079
of

00:06:02.639 --> 00:06:06.039
these values,

00:06:04.079 --> 00:06:07.159
square them, add them up, take square

00:06:06.040 --> 00:06:09.360
root.

00:06:07.160 --> 00:06:12.080
So which means that all the zeros will

00:06:09.360 --> 00:06:14.000
obviously give you zero. This one is

00:06:12.079 --> 00:06:15.399
going to give you a one.

00:06:14.000 --> 00:06:18.079
This comparison is going to give you

00:06:15.399 --> 00:06:20.079
another one. 1 + 1 = 2. Root 2. That's

00:06:18.079 --> 00:06:21.039
the answer.

00:06:20.079 --> 00:06:23.759
So the distance between these two

00:06:21.040 --> 00:06:23.760
vectors is root 2.

00:06:25.120 --> 00:06:30.639
Now,

00:06:27.240 --> 00:06:32.639
so the distance between them is root 2.

00:06:30.639 --> 00:06:34.959
What about the one-hot encoded vectors

00:06:32.639 --> 00:06:36.800
for good and bad? Clearly good and bad

00:06:34.959 --> 00:06:37.799
mean opposite things.

00:06:36.800 --> 00:06:40.960
What is the distance between the good

00:06:37.800 --> 00:06:40.960
and bad 01 vectors?

00:06:42.600 --> 00:06:45.320
Still root 2.

00:06:45.360 --> 00:06:49.920
Because the zeros don't mean anything,

00:06:47.800 --> 00:06:51.040
the ones are not in the same place.

00:06:49.920 --> 00:06:52.640
So when you subtract the one and the

00:06:51.040 --> 00:06:54.879
zero, you'll get ones and ones, add them

00:06:52.639 --> 00:06:56.560
up, two, root 2.

00:06:54.879 --> 00:06:57.800
In fact, you take any two words in your

00:06:56.560 --> 00:06:59.720
vocabulary, what's the distance between

00:06:57.800 --> 00:07:01.560
the two one-hot vectors for those words?

00:06:59.720 --> 00:07:03.960
It's root 2.

00:07:01.560 --> 00:07:06.399
So if any two words have the same

00:07:03.959 --> 00:07:08.759
distance, does this even have a notion

00:07:06.399 --> 00:07:10.879
of distance?

00:07:08.759 --> 00:07:12.639
It doesn't.

00:07:10.879 --> 00:07:13.959
There's no notion of distance from

00:07:12.639 --> 00:07:15.959
one-hot vectors.

00:07:13.959 --> 00:07:17.839
It has no connection to the actual

00:07:15.959 --> 00:07:21.199
meanings of these words.

00:07:17.839 --> 00:07:22.319
It's just a way of representing them.

00:07:21.199 --> 00:07:24.039
Okay?

00:07:22.319 --> 00:07:26.240
So that is the big problem with one-hot

00:07:24.040 --> 00:07:27.400
vectors.

00:07:26.240 --> 00:07:28.639
So

00:07:27.399 --> 00:07:29.519
the distance between them is the same

00:07:28.639 --> 00:07:30.439
regardless of the words. It's got

00:07:29.519 --> 00:07:32.079
nothing to do with the meaning of the

00:07:30.439 --> 00:07:33.920
words.

00:07:32.079 --> 00:07:35.879
And this is a huge problem, which we'll

00:07:33.920 --> 00:07:37.759
have to solve.

00:07:35.879 --> 00:07:39.439
So to summarize where we are, if the

00:07:37.759 --> 00:07:40.519
vocabulary is very long, each token will

00:07:39.439 --> 00:07:42.279
have a one-hot vector that's long as

00:07:40.519 --> 00:07:44.599
vocabulary. That's that's sort of a

00:07:42.279 --> 00:07:46.759
computational and sort of training

00:07:44.600 --> 00:07:48.080
problem. And then this is a deeper

00:07:46.759 --> 00:07:49.039
problem, where there's no connection

00:07:48.079 --> 00:07:51.240
between the meaning of a word and its

00:07:49.040 --> 00:07:55.080
vector.

00:07:51.240 --> 00:07:57.639
So wouldn't it be nice if

00:07:55.079 --> 00:07:59.879
vectors that represent synonyms,

00:07:57.639 --> 00:08:01.839
movie and film, apple, banana,

00:07:59.879 --> 00:08:03.519
hopefully they're close to each other.

00:08:01.839 --> 00:08:04.919
It would be nice if the vectors for

00:08:03.519 --> 00:08:06.919
things that mean very different things

00:08:04.920 --> 00:08:08.400
are far from each other.

00:08:06.920 --> 00:08:10.840
So let's take a look at a particular

00:08:08.399 --> 00:08:13.239
example. Okay? Let's assume that we have

00:08:10.839 --> 00:08:15.279
been magically given

00:08:13.240 --> 00:08:17.319
these vectors, so that they actually

00:08:15.279 --> 00:08:18.959
have some notion of meaning.

00:08:17.319 --> 00:08:21.839
And for convenience, let's say that we

00:08:18.959 --> 00:08:23.759
take the just the first uh

00:08:21.839 --> 00:08:25.239
two dimensions of these vectors, the

00:08:23.759 --> 00:08:28.120
first two dimensions, so that we can do

00:08:25.240 --> 00:08:30.040
a scatter plot on them.

00:08:28.120 --> 00:08:31.959
So we plot the first dimension of the of

00:08:30.040 --> 00:08:34.158
these vectors, the second dimension, and

00:08:31.959 --> 00:08:37.279
what we have in this little cartoon is

00:08:34.158 --> 00:08:41.559
we have plotted the the word for

00:08:37.279 --> 00:08:44.000
factory, uh for home, for building, and

00:08:41.559 --> 00:08:45.719
they all happen to be clustered here.

00:08:44.000 --> 00:08:48.320
Clearly this representation is capturing

00:08:45.720 --> 00:08:50.120
some notion of what the thing is.

00:08:48.320 --> 00:08:53.680
Right? Some sort of building.

00:08:50.120 --> 00:08:55.919
Uh and here we have, you know, bicycle,

00:08:53.679 --> 00:08:57.799
truck, and car. Clearly some This is

00:08:55.919 --> 00:09:00.000
like the automobile cluster, right?

00:08:57.799 --> 00:09:02.079
Transportation cluster. And here we have

00:09:00.000 --> 00:09:04.240
like a fruit cluster, and here we have

00:09:02.080 --> 00:09:05.879
some, you know, sports balls cluster.

00:09:04.240 --> 00:09:07.560
Okay?

00:09:05.879 --> 00:09:10.439
We Because it's a cartoon, things are

00:09:07.559 --> 00:09:12.319
all nice and cleanly separated. Okay? So

00:09:10.440 --> 00:09:14.360
now if you take the word apple, where do

00:09:12.320 --> 00:09:19.000
you think it's going to go?

00:09:14.360 --> 00:09:20.840
It's going to go in into A, C, D, or B?

00:09:19.000 --> 00:09:23.519
C, right? It makes eminent sense it's

00:09:20.840 --> 00:09:23.519
going to go to C.

00:09:23.600 --> 00:09:27.920
Good. Now,

00:09:25.440 --> 00:09:29.960
wouldn't it be nice if

00:09:27.919 --> 00:09:32.839
in more generally, if the geometric

00:09:29.960 --> 00:09:35.120
relationship between word vectors

00:09:32.840 --> 00:09:37.000
represent the semantic relationship

00:09:35.120 --> 00:09:38.399
between the underlying objects that the

00:09:37.000 --> 00:09:39.080
words represent?

00:09:38.399 --> 00:09:41.039
Okay?

00:09:39.080 --> 00:09:42.800
And it's And I say relationship and not

00:09:41.039 --> 00:09:45.559
distance, because it's not just

00:09:42.799 --> 00:09:46.319
distance. It's actually more than that.

00:09:45.559 --> 00:09:48.239
Okay?

00:09:46.320 --> 00:09:49.720
So let's take another one.

00:09:48.240 --> 00:09:52.200
Here we have

00:09:49.720 --> 00:09:54.320
uh this is the the vector plotted for

00:09:52.200 --> 00:09:56.240
puppy and dog,

00:09:54.320 --> 00:09:58.040
and this is calf.

00:09:56.240 --> 00:09:59.639
Uh right? We have plotted the word for

00:09:58.039 --> 00:10:01.599
calf. And let's say that we need to

00:09:59.639 --> 00:10:04.879
figure out where would the embedding,

00:10:01.600 --> 00:10:07.920
the word vector for cow appear?

00:10:04.879 --> 00:10:09.639
It is the most logical. Should it be A?

00:10:07.919 --> 00:10:11.639
Should it be C? Should it be B? Where

00:10:09.639 --> 00:10:13.919
should it be?

00:10:11.639 --> 00:10:13.919
This is

00:10:14.000 --> 00:10:19.600
C? Okay, what's the logic?

00:10:16.320 --> 00:10:21.440
Any volunteers? Just put your hand up.

00:10:19.600 --> 00:10:23.480
Uh, yes.

00:10:21.440 --> 00:10:26.200
Uh

00:10:23.480 --> 00:10:27.639
A calf is a baby bull, whereas the cow

00:10:26.200 --> 00:10:28.720
is an adult.

00:10:27.639 --> 00:10:31.240
So, it should be closer to the dog,

00:10:28.720 --> 00:10:32.840
which is the adult version of a dog.

00:10:31.240 --> 00:10:34.600
Got it. So, you're basically saying go

00:10:32.840 --> 00:10:36.120
from the puppy version to the grown-up

00:10:34.600 --> 00:10:37.560
version. Right? That's sort of what

00:10:36.120 --> 00:10:39.560
you're getting at, right? And that's a

00:10:37.559 --> 00:10:40.719
totally valid way to think about it.

00:10:39.559 --> 00:10:42.479
But there are a couple of ways to think

00:10:40.720 --> 00:10:44.600
about this, which is this is one of the

00:10:42.480 --> 00:10:45.800
those two ways. So, what you can do is

00:10:44.600 --> 00:10:46.920
you can actually look at it and say,

00:10:45.799 --> 00:10:48.719
well,

00:10:46.919 --> 00:10:50.759
Okay, if this is big bringing you, you

00:10:48.720 --> 00:10:52.920
know, bad memories of GMAT and GRE and

00:10:50.759 --> 00:10:55.080
stuff like that, I apologize.

00:10:52.919 --> 00:10:57.120
But

00:10:55.080 --> 00:10:59.720
So, a puppy is to a dog like a calf is

00:10:57.120 --> 00:11:01.159
to a cow, right? Which means that that's

00:10:59.720 --> 00:11:02.720
exactly what Jay is pointing out. You

00:11:01.159 --> 00:11:04.480
can go from like the baby version to the

00:11:02.720 --> 00:11:08.720
full-grown version if you go in the

00:11:04.480 --> 00:11:10.200
horizontal direction. Okay? But maybe if

00:11:08.720 --> 00:11:13.000
you go in the vertical direction, you're

00:11:10.200 --> 00:11:15.759
essentially going up and down the young

00:11:13.000 --> 00:11:16.720
entities of animals.

00:11:15.759 --> 00:11:18.399
Okay?

00:11:16.720 --> 00:11:20.560
So, here you are growing with, you know,

00:11:18.399 --> 00:11:22.000
you're still across the same dimension

00:11:20.559 --> 00:11:24.039
of animals. You're just going from, you

00:11:22.000 --> 00:11:25.639
know, the the same age level, right?

00:11:24.039 --> 00:11:27.039
That is the band here.

00:11:25.639 --> 00:11:28.639
So, this is the grown-up version of a

00:11:27.039 --> 00:11:30.000
whole bunch of animals, the puppy

00:11:28.639 --> 00:11:31.600
version of a whole bunch of animals. So,

00:11:30.000 --> 00:11:34.720
the vertical dimension measures some

00:11:31.600 --> 00:11:36.839
sort of variation across animal species

00:11:34.720 --> 00:11:37.920
of the same roughly sort of maturity

00:11:36.839 --> 00:11:41.000
stage.

00:11:37.919 --> 00:11:43.399
Okay? So, these directions also matter.

00:11:41.000 --> 00:11:45.279
It's not just the distance.

00:11:43.399 --> 00:11:47.319
Okay. That's what I mean when I say

00:11:45.279 --> 00:11:48.759
semantic relationship and geometric

00:11:47.320 --> 00:11:51.200
relationship.

00:11:48.759 --> 00:11:53.159
Relationship is distance and direction,

00:11:51.200 --> 00:11:55.120
right? Both have to be involved.

00:11:53.159 --> 00:11:57.879
So, so

00:11:55.120 --> 00:12:00.759
Uh, now word embeddings, as we will dis-

00:11:57.879 --> 00:12:03.399
learn soon, are word vectors designed to

00:12:00.759 --> 00:12:04.720
achieve exactly these requirements.

00:12:03.399 --> 00:12:06.000
Okay? They will achieve these

00:12:04.720 --> 00:12:07.800
requirements.

00:12:06.000 --> 00:12:11.440
Uh, and they will fix both these

00:12:07.799 --> 00:12:11.439
problems very elegantly.

00:12:11.720 --> 00:12:14.399
Okay?

00:12:13.159 --> 00:12:15.279
So, let's say that we have word

00:12:14.399 --> 00:12:17.639
embeddings that achieve both these

00:12:15.279 --> 00:12:19.720
problems. Are we basically done?

00:12:17.639 --> 00:12:22.399
Can we declare victory?

00:12:19.720 --> 00:12:24.639
Or are there any- is there anything that

00:12:22.399 --> 00:12:28.240
even words which actually capture the

00:12:24.639 --> 00:12:28.240
meaning of the underlying thing

00:12:28.279 --> 00:12:31.519
don't fully address? Is there any

00:12:30.159 --> 00:12:33.199
remaining problem we have to worry

00:12:31.519 --> 00:12:36.720
about? Yes?

00:12:33.200 --> 00:12:39.520
Context. Context? Yes.

00:12:36.720 --> 00:12:42.240
Context, right? What about The fact is a

00:12:39.519 --> 00:12:44.679
word's meaning Sure, every word has a

00:12:42.240 --> 00:12:46.440
meaning, but we know that some words

00:12:44.679 --> 00:12:49.399
have multiple meanings.

00:12:46.440 --> 00:12:51.320
And that meaning is really sort of

00:12:49.399 --> 00:12:52.799
inferencable, or you can make sense of

00:12:51.320 --> 00:12:55.680
it only if you know the surrounding

00:12:52.799 --> 00:12:59.039
context, right? If I give you if if you

00:12:55.679 --> 00:13:00.239
see the word bank, b a n k, bank,

00:12:59.039 --> 00:13:02.120
sure, it could be a financial

00:13:00.240 --> 00:13:04.839
institution. It could be the side of a

00:13:02.120 --> 00:13:07.200
river. It could be the act of a plane

00:13:04.839 --> 00:13:09.160
turning in one direction.

00:13:07.200 --> 00:13:11.960
It could be someone hoping for

00:13:09.159 --> 00:13:13.679
something, banking on something. The

00:13:11.960 --> 00:13:16.360
list of possible meanings of the word

00:13:13.679 --> 00:13:18.359
bank is basically enormous.

00:13:16.360 --> 00:13:19.800
And you cannot figure out what it means

00:13:18.360 --> 00:13:22.120
unless you know what else is going on

00:13:19.799 --> 00:13:24.559
around that word. So, context is super

00:13:22.120 --> 00:13:26.159
super important. And these embeddings,

00:13:24.559 --> 00:13:28.199
word embeddings, just tell you what the

00:13:26.159 --> 00:13:29.838
meaning of the word is. And basically

00:13:28.200 --> 00:13:31.400
what's going to happen when you have a

00:13:29.839 --> 00:13:33.400
word which could mean many different

00:13:31.399 --> 00:13:36.319
things, it's going to give you some

00:13:33.399 --> 00:13:37.759
average version of that meaning.

00:13:36.320 --> 00:13:39.680
And that average version is not going to

00:13:37.759 --> 00:13:40.838
be very good.

00:13:39.679 --> 00:13:41.759
Now, there are some words which only

00:13:40.839 --> 00:13:42.800
mean one thing, and you'll be okay

00:13:41.759 --> 00:13:44.759
there.

00:13:42.799 --> 00:13:47.319
But for the rest of it, right? It's

00:13:44.759 --> 00:13:47.319
going to be tough.

00:13:47.480 --> 00:13:52.879
So, what we need is some way

00:13:53.360 --> 00:13:56.680
We need to find a way to make word

00:13:54.480 --> 00:13:58.200
embeddings contextual.

00:13:56.679 --> 00:14:00.199
Meaning we need to somehow consider the

00:13:58.200 --> 00:14:02.879
other words in the sentence.

00:14:00.200 --> 00:14:05.040
Okay? So, if we can do that, then we

00:14:02.879 --> 00:14:08.279
will be in great shape.

00:14:05.039 --> 00:14:11.039
Solve all sorts of NLP problems.

00:14:08.279 --> 00:14:13.639
Now, as it turns out, contextual word

00:14:11.039 --> 00:14:15.279
embeddings, or word vectors, or word

00:14:13.639 --> 00:14:16.838
embeddings that achieve both these

00:14:15.279 --> 00:14:19.399
requirements.

00:14:16.839 --> 00:14:21.440
They capture the semantic geometric

00:14:19.399 --> 00:14:22.838
relationship thing I talked about, and

00:14:21.440 --> 00:14:23.880
they are contextual.

00:14:22.839 --> 00:14:27.079
Okay?

00:14:23.879 --> 00:14:29.078
They're really fantastic. Uh, and the

00:14:27.078 --> 00:14:32.838
key to calculating contextual word

00:14:29.078 --> 00:14:32.838
embeddings is the transformer.

00:14:33.200 --> 00:14:37.959
That is why transformers are justifiably

00:14:35.519 --> 00:14:37.958
famous.

00:14:39.320 --> 00:14:42.680
So, what's sort of the the lay of the

00:14:40.440 --> 00:14:44.520
land here? So, today we are going to

00:14:42.679 --> 00:14:46.879
look at how to calculate

00:14:44.519 --> 00:14:48.159
stand-alone or uncontextual word

00:14:46.879 --> 00:14:50.600
embeddings.

00:14:48.159 --> 00:14:52.319
And then starting Monday, we will take

00:14:50.600 --> 00:14:53.759
these, you know, un- stand-alone

00:14:52.320 --> 00:14:56.079
embeddings and make them contextual

00:14:53.759 --> 00:14:57.159
using transformers. Okay? That is the

00:14:56.078 --> 00:14:58.679
plan.

00:14:57.159 --> 00:15:00.719
Any questions so far?

00:14:58.679 --> 00:15:02.519
So, now let's think about how we can

00:15:00.720 --> 00:15:05.800
learn these stand-alone embeddings from

00:15:02.519 --> 00:15:07.559
data, right? Now, the naive way to think

00:15:05.799 --> 00:15:08.879
about it would be, hey, let's Why don't

00:15:07.559 --> 00:15:11.719
we manually collect a whole bunch of

00:15:08.879 --> 00:15:13.679
synonyms, antonyms, related words, etc.,

00:15:11.720 --> 00:15:15.440
and try to assign embedding vectors to

00:15:13.679 --> 00:15:18.399
them that satisfy

00:15:15.440 --> 00:15:19.880
our requirements. Okay? Now, as you can

00:15:18.399 --> 00:15:21.639
imagine, this is going to be a long,

00:15:19.879 --> 00:15:22.759
painful, and never quite complete

00:15:21.639 --> 00:15:23.720
exercise.

00:15:22.759 --> 00:15:24.480
Okay?

00:15:23.720 --> 00:15:26.600
Uh,

00:15:24.480 --> 00:15:29.200
so and uh you mean and given that we are

00:15:26.600 --> 00:15:30.759
machine learning people,

00:15:29.200 --> 00:15:32.240
the question is, can we do in a better

00:15:30.759 --> 00:15:34.039
way? Can we just learn it from the data

00:15:32.240 --> 00:15:36.600
without doing any of this manual stuff?

00:15:34.039 --> 00:15:39.639
Okay? And

00:15:36.600 --> 00:15:42.320
the key insight that makes it all happen

00:15:39.639 --> 00:15:44.360
is this humble-looking line on the

00:15:42.320 --> 00:15:45.839
screen by John Firth, who was a

00:15:44.360 --> 00:15:47.720
linguist.

00:15:45.839 --> 00:15:49.600
You shall know a word

00:15:47.720 --> 00:15:52.879
by the company it keeps. I wish I could

00:15:49.600 --> 00:15:52.879
deliver this in a British accent.

00:15:53.078 --> 00:15:57.879
Know a word by the company it keeps.

00:15:55.120 --> 00:15:59.919
Okay? It's a very profound statement.

00:15:57.879 --> 00:16:02.399
Okay? And here is the sort of the key

00:15:59.919 --> 00:16:03.958
intuition behind this.

00:16:02.399 --> 00:16:05.480
It says,

00:16:03.958 --> 00:16:08.559
let's say that you have a sentence like

00:16:05.480 --> 00:16:09.560
the acting in the dash was superb.

00:16:08.559 --> 00:16:11.199
Okay?

00:16:09.559 --> 00:16:14.519
What are some words that you folks think

00:16:11.200 --> 00:16:14.520
are likely to appear in the sentence?

00:16:15.039 --> 00:16:19.519
Shout it out. Play. Play.

00:16:18.120 --> 00:16:20.560
Movie.

00:16:19.519 --> 00:16:24.159
Show.

00:16:20.559 --> 00:16:25.239
Musical. Right? Those are all some great

00:16:24.159 --> 00:16:26.838
candidates, right? The acting in the

00:16:25.240 --> 00:16:28.799
movie, the film, musical, and so on and

00:16:26.839 --> 00:16:29.800
so forth. Okay? Now, let's say that I

00:16:28.799 --> 00:16:31.679
ask you, what are some words that are

00:16:29.799 --> 00:16:32.958
unlikely to appear in the sentence? And

00:16:31.679 --> 00:16:35.879
I think we could all be here for like

00:16:32.958 --> 00:16:38.519
days, you know, listing them out. Uh, I

00:16:35.879 --> 00:16:39.919
just listed these out. Um, I love the

00:16:38.519 --> 00:16:41.759
word tensor, so I have to find a way to

00:16:39.919 --> 00:16:43.078
use it somewhere.

00:16:41.759 --> 00:16:45.200
So, all right. So, the acting in the

00:16:43.078 --> 00:16:48.239
banana was superb. Clearly nonsensical,

00:16:45.200 --> 00:16:51.200
right? So, what this actually What What

00:16:48.240 --> 00:16:53.879
we are seeing here is that if certain

00:16:51.200 --> 00:16:55.360
words are sort of interchangeable in a

00:16:53.879 --> 00:16:57.000
sentence,

00:16:55.360 --> 00:16:59.959
meaning you you change them, they still

00:16:57.000 --> 00:17:02.240
the sentence still makes sense, right?

00:16:59.958 --> 00:17:04.240
If they appear in the same context very

00:17:02.240 --> 00:17:07.559
often, i.e., if they're interchangeable,

00:17:04.240 --> 00:17:07.559
they are probably related.

00:17:07.799 --> 00:17:10.559
Sort of like we don't even have to know

00:17:09.119 --> 00:17:12.599
what the word is.

00:17:10.559 --> 00:17:14.240
All we have to know is that this word

00:17:12.599 --> 00:17:15.519
and this word, you can drop them into a

00:17:14.240 --> 00:17:17.240
particular sentence, you can fill in the

00:17:15.519 --> 00:17:18.799
blank of that sentence with that word,

00:17:17.240 --> 00:17:20.120
and it actually makes sense, then we're

00:17:18.799 --> 00:17:21.399
like, oh, wow, okay, these words are

00:17:20.119 --> 00:17:23.359
related then.

00:17:21.400 --> 00:17:25.519
Right? You're sort of inferring their

00:17:23.359 --> 00:17:29.559
relatedness not by looking at them

00:17:25.519 --> 00:17:29.559
directly, but by seeing where they live.

00:17:30.000 --> 00:17:36.119
Right? It's a very very clever idea. And

00:17:32.319 --> 00:17:37.480
it'll slowly sink into you. Okay? Um, so

00:17:36.119 --> 00:17:39.079
that's the first observation. If they

00:17:37.480 --> 00:17:41.160
appear in the same context very often,

00:17:39.079 --> 00:17:44.240
they are likely to be related.

00:17:41.160 --> 00:17:47.440
More generally, related words appear in

00:17:44.240 --> 00:17:47.440
related contexts.

00:17:47.880 --> 00:17:52.480
So, all we have to do

00:17:49.559 --> 00:17:54.240
is to figure out a way to calculate

00:17:52.480 --> 00:17:57.039
context.

00:17:54.240 --> 00:17:58.599
And then use that to understand, you

00:17:57.039 --> 00:18:00.519
know, what the words are that happen to

00:17:58.599 --> 00:18:02.119
be living in this context.

00:18:00.519 --> 00:18:03.639
And there are some beautiful ways to do

00:18:02.119 --> 00:18:05.239
these things, and we'll you and we'll

00:18:03.640 --> 00:18:06.120
really dive deep into one such way to do

00:18:05.240 --> 00:18:08.759
it.

00:18:06.119 --> 00:18:10.639
So, so the So, what we're going to do in

00:18:08.759 --> 00:18:11.759
this approach

00:18:10.640 --> 00:18:12.920
is that

00:18:11.759 --> 00:18:14.879
since

00:18:12.920 --> 00:18:16.880
words that appear in

00:18:14.880 --> 00:18:18.200
related contexts mean related same

00:18:16.880 --> 00:18:19.200
similar things,

00:18:18.200 --> 00:18:21.480
first of all, you have to define what do

00:18:19.200 --> 00:18:22.440
you mean by context?

00:18:21.480 --> 00:18:23.360
And there are many ways to define

00:18:22.440 --> 00:18:24.759
context. We're going to go with a very

00:18:23.359 --> 00:18:26.959
simple explanation, simple definition,

00:18:24.759 --> 00:18:29.079
which is that if words happen to appear

00:18:26.960 --> 00:18:31.159
in the same sentence a lot,

00:18:29.079 --> 00:18:32.480
then we think that, okay,

00:18:31.159 --> 00:18:34.440
they are in the same context. So,

00:18:32.480 --> 00:18:35.120
context here means sentence.

00:18:34.440 --> 00:18:38.200
Okay?

00:18:35.119 --> 00:18:40.399
So, what we can do is we can actually

00:18:38.200 --> 00:18:41.919
take a whole bunch of text, maybe all of

00:18:40.400 --> 00:18:43.519
Wikipedia,

00:18:41.919 --> 00:18:46.040
and then break it up into sentences.

00:18:43.519 --> 00:18:47.279
We'll have billions of sentences, right?

00:18:46.039 --> 00:18:48.879
And then for all these billion

00:18:47.279 --> 00:18:51.639
sentences, we can literally go and count

00:18:48.880 --> 00:18:52.880
for every pair of words, how many times

00:18:51.640 --> 00:18:55.280
are both these words showing up in the

00:18:52.880 --> 00:18:57.880
same sentence?

00:18:55.279 --> 00:18:59.359
Okay? And we call this co-occurrence,

00:18:57.880 --> 00:19:00.640
right? The words are co-occurring in the

00:18:59.359 --> 00:19:02.000
sentence.

00:19:00.640 --> 00:19:02.880
And it doesn't have to be next to each

00:19:02.000 --> 00:19:04.759
other,

00:19:02.880 --> 00:19:07.280
right? We know that in complicated

00:19:04.759 --> 00:19:09.079
words, a word at the very end of the

00:19:07.279 --> 00:19:10.799
sentence could actually alter the mean-

00:19:09.079 --> 00:19:11.759
could be its meaning could be altered by

00:19:10.799 --> 00:19:12.678
a word that happened in the very

00:19:11.759 --> 00:19:14.240
beginning of the sentence, and it could

00:19:12.679 --> 00:19:16.240
be a really long sentence.

00:19:14.240 --> 00:19:18.079
So, we take the whole sentence and say,

00:19:16.240 --> 00:19:19.599
are are two words co-occurring in the

00:19:18.079 --> 00:19:20.720
sentence, yes or no? And we just count

00:19:19.599 --> 00:19:23.799
them up.

00:19:20.720 --> 00:19:23.799
And when we do that,

00:19:24.119 --> 00:19:27.678
right? When we do that, we will get

00:19:26.279 --> 00:19:29.519
something like this.

00:19:27.679 --> 00:19:30.880
So, I'm just

00:19:29.519 --> 00:19:32.359
This just captures what I've been

00:19:30.880 --> 00:19:34.280
talking about. Identify all the words

00:19:32.359 --> 00:19:35.799
that occur, let's say, in Wikipedia. And

00:19:34.279 --> 00:19:37.039
then for every sentence, you look at

00:19:35.799 --> 00:19:38.759
every word pair and count the number of

00:19:37.039 --> 00:19:41.480
times they appear in the same sentence

00:19:38.759 --> 00:19:43.839
across all those sentences. Okay?

00:19:41.480 --> 00:19:46.440
This is a word-word co-occurrence

00:19:43.839 --> 00:19:47.519
matrix. So, for example,

00:19:46.440 --> 00:19:48.679
let's assume that you took all of

00:19:47.519 --> 00:19:49.918
Wikipedia, looked at all the words,

00:19:48.679 --> 00:19:51.960
distinct words, and you found there are

00:19:49.919 --> 00:19:54.360
500,000 words.

00:19:51.960 --> 00:19:56.880
Okay? So, there are 500,000 words

00:19:54.359 --> 00:20:00.240
here in the columns

00:19:56.880 --> 00:20:02.640
500,000 words on the rows.

00:20:00.240 --> 00:20:05.599
The columns and rows. And then you go

00:20:02.640 --> 00:20:08.000
and each cell of this table is basically

00:20:05.599 --> 00:20:10.519
has a number that you calculate which is

00:20:08.000 --> 00:20:12.039
the number of times the word in the row

00:20:10.519 --> 00:20:14.319
and the word in the column happen to

00:20:12.039 --> 00:20:15.680
show up in the same sentence. That's it.

00:20:14.319 --> 00:20:18.119
So, for instance

00:20:15.680 --> 00:20:20.360
if you look at deep and learning, right?

00:20:18.119 --> 00:20:22.519
The word deep and the word learning

00:20:20.359 --> 00:20:24.719
maybe that

00:20:22.519 --> 00:20:28.319
the those two words occurred in the same

00:20:24.720 --> 00:20:31.400
sentence maybe 3,025 times.

00:20:28.319 --> 00:20:35.200
3,025 sentences across all of Wikipedia.

00:20:31.400 --> 00:20:35.200
You put 3,025 right in that cell.

00:20:35.240 --> 00:20:37.680
Okay?

00:20:36.000 --> 00:20:38.880
Many words are unlikely to appear in the

00:20:37.680 --> 00:20:40.360
same sentence.

00:20:38.880 --> 00:20:42.720
So, much of this matrix is going to be

00:20:40.359 --> 00:20:42.719
zero.

00:20:44.319 --> 00:20:47.119
But, we

00:20:45.359 --> 00:20:49.639
fundamentally form this co-occurrence

00:20:47.119 --> 00:20:49.639
matrix.

00:20:49.960 --> 00:20:55.640
This matrix essentially embodies all the

00:20:54.119 --> 00:20:58.359
context information that we can work

00:20:55.640 --> 00:20:59.840
with in a very compact, beautiful you

00:20:58.359 --> 00:21:02.240
know, sort of

00:20:59.839 --> 00:21:02.240
elegant

00:21:03.279 --> 00:21:06.039
And using this, we're going to try to

00:21:04.640 --> 00:21:07.400
figure out

00:21:06.039 --> 00:21:08.440
what the word embeddings actually are

00:21:07.400 --> 00:21:09.519
going to be.

00:21:08.440 --> 00:21:11.720
Okay?

00:21:09.519 --> 00:21:13.480
And so

00:21:11.720 --> 00:21:15.440
So, by the way, the approach I'm

00:21:13.480 --> 00:21:19.240
describing here to calculate standalone

00:21:15.440 --> 00:21:19.240
embeddings is called Glove.

00:21:20.200 --> 00:21:24.799
Uh it's called Glove and

00:21:23.039 --> 00:21:27.519
standalone embeddings first sort of came

00:21:24.799 --> 00:21:29.720
onto the NLP deep learning scene. Uh

00:21:27.519 --> 00:21:32.519
there were two sort of ways of doing it.

00:21:29.720 --> 00:21:34.400
One was called word to vec, word to vec.

00:21:32.519 --> 00:21:35.879
Uh the other one is Glove.

00:21:34.400 --> 00:21:36.960
And they're both comparable, right? They

00:21:35.880 --> 00:21:38.520
use slightly different mechanisms of

00:21:36.960 --> 00:21:40.559
doing this.

00:21:38.519 --> 00:21:42.279
We went with word for for this lecture

00:21:40.559 --> 00:21:44.359
because I think it's actually a little

00:21:42.279 --> 00:21:45.759
easier to understand and equally

00:21:44.359 --> 00:21:47.199
effective.

00:21:45.759 --> 00:21:49.480
Okay?

00:21:47.200 --> 00:21:50.880
So, this is what we have. And so, what

00:21:49.480 --> 00:21:52.880
we want to do is

00:21:50.880 --> 00:21:54.120
we want to learn these embedding vectors

00:21:52.880 --> 00:21:56.200
that can be used to essentially

00:21:54.119 --> 00:21:59.319
approximate this matrix.

00:21:56.200 --> 00:22:01.720
Right? If you can find vectors that can

00:21:59.319 --> 00:22:03.279
actually approximate this matrix, then

00:22:01.720 --> 00:22:04.519
hopefully those vectors do in fact

00:22:03.279 --> 00:22:06.519
capture some notion of what the words

00:22:04.519 --> 00:22:07.440
actually mean. Okay? So, let me put it

00:22:06.519 --> 00:22:10.119
differently.

00:22:07.440 --> 00:22:12.759
You come to me with this matrix. Okay?

00:22:10.119 --> 00:22:14.359
And you say uh okay, Rama, do you have

00:22:12.759 --> 00:22:15.679
embeddings for me?

00:22:14.359 --> 00:22:17.319
And I'm like, yeah, I reach into my bag

00:22:15.679 --> 00:22:19.160
and I'm like, okay, every one of those

00:22:17.319 --> 00:22:20.119
500,000 words, I have an embedding.

00:22:19.160 --> 00:22:21.440
Right?

00:22:20.119 --> 00:22:23.039
Let's ignore for a moment how I actually

00:22:21.440 --> 00:22:24.000
calculated embeddings. I have the

00:22:23.039 --> 00:22:25.839
embeddings.

00:22:24.000 --> 00:22:28.400
How will you know if my embeddings are

00:22:25.839 --> 00:22:28.399
any good?

00:22:28.720 --> 00:22:31.559
How will you know?

00:22:30.279 --> 00:22:34.440
How can you actually assess if those

00:22:31.559 --> 00:22:34.440
embeddings are any good?

00:22:34.559 --> 00:22:37.440
Well, you can certainly say, okay, give

00:22:35.799 --> 00:22:39.240
me the embeddings for movie and film and

00:22:37.440 --> 00:22:40.440
you can see if they're really close by.

00:22:39.240 --> 00:22:42.160
If you can look at the you look at the

00:22:40.440 --> 00:22:43.920
embedding for movie and tensor and

00:22:42.160 --> 00:22:46.600
hopefully they're far away.

00:22:43.920 --> 00:22:47.360
But, you'll never get done.

00:22:46.599 --> 00:22:49.199
Right?

00:22:47.359 --> 00:22:51.159
How can you systematically evaluate

00:22:49.200 --> 00:22:53.720
this?

00:22:51.160 --> 00:22:55.840
Well, what if you could actually what

00:22:53.720 --> 00:22:57.400
what if I come to you and say, not only

00:22:55.839 --> 00:22:59.079
am I going to give you an embedding,

00:22:57.400 --> 00:23:00.480
here is a procedure

00:22:59.079 --> 00:23:02.279
which you can use with these embeddings

00:23:00.480 --> 00:23:04.400
to validate how good they are and here

00:23:02.279 --> 00:23:07.160
is the procedure. What you can do is you

00:23:04.400 --> 00:23:09.960
can use the embedding to recreate the

00:23:07.160 --> 00:23:11.600
co-occurrence matrix.

00:23:09.960 --> 00:23:14.400
And if the recreated co-occurrence

00:23:11.599 --> 00:23:15.319
matrix actually matches the real matrix

00:23:14.400 --> 00:23:17.519
well, these embeddings probably are

00:23:15.319 --> 00:23:18.559
pretty good.

00:23:17.519 --> 00:23:20.079
Remember, the whole point of the

00:23:18.559 --> 00:23:21.720
co-occurrence is to handle this context

00:23:20.079 --> 00:23:23.960
information. So, if my embeddings can

00:23:21.720 --> 00:23:25.640
actually recreate them, reconstruct them

00:23:23.960 --> 00:23:27.400
pretty close, right? It'll never be

00:23:25.640 --> 00:23:28.200
perfect. But, it comes pretty close,

00:23:27.400 --> 00:23:29.759
then we're like, wow, okay, these

00:23:28.200 --> 00:23:31.400
embeddings do mean something.

00:23:29.759 --> 00:23:33.839
So, if it turns out for instance that

00:23:31.400 --> 00:23:36.600
the matrix has, you know, 3,000 possible

00:23:33.839 --> 00:23:40.159
va- value of 3,000 for deep and learning

00:23:36.599 --> 00:23:40.959
and values of uh

00:23:40.160 --> 00:23:43.519
say

00:23:40.960 --> 00:23:45.200
50 for extreme learning

00:23:43.519 --> 00:23:48.480
and our embedding comes in and says

00:23:45.200 --> 00:23:49.360
3,002 for the first one and 48 for the

00:23:48.480 --> 00:23:51.440
second one, we'll be like we'll be

00:23:49.359 --> 00:23:53.279
pretty impressed.

00:23:51.440 --> 00:23:54.320
Whoa, it didn't need to be that close.

00:23:53.279 --> 00:23:55.480
Unless it was actually capturing

00:23:54.319 --> 00:23:57.519
something.

00:23:55.480 --> 00:23:59.000
Okay? So, that's what we're going to do.

00:23:57.519 --> 00:24:00.240
And so, we're going to take this logic

00:23:59.000 --> 00:24:03.200
of saying

00:24:00.240 --> 00:24:05.960
find embeddings that can approximate the

00:24:03.200 --> 00:24:07.880
what we actually see in Wikipedia.

00:24:05.960 --> 00:24:09.240
Right? And we're going to use that idea

00:24:07.880 --> 00:24:10.440
to actually build the model and learn

00:24:09.240 --> 00:24:12.559
the

00:24:10.440 --> 00:24:14.759
using nothing more than basically linear

00:24:12.559 --> 00:24:14.759
regression.

00:24:16.480 --> 00:24:18.839
And here you are thinking that linear

00:24:17.759 --> 00:24:22.160
regression is useless now that you've

00:24:18.839 --> 00:24:22.159
graduated machine learning, right?

00:24:22.319 --> 00:24:24.759
So

00:24:23.240 --> 00:24:26.599
So, we can think of the embedding

00:24:24.759 --> 00:24:28.879
vectors that we want to figure out as

00:24:26.599 --> 00:24:31.319
just the weights in a model.

00:24:28.880 --> 00:24:33.120
In a linear regression.

00:24:31.319 --> 00:24:35.200
We can think of the co-occurrence matrix

00:24:33.119 --> 00:24:37.759
as just the data we're going to use in

00:24:35.200 --> 00:24:39.799
this model to estimate these weights.

00:24:37.759 --> 00:24:42.200
And the model we're going to use

00:24:39.799 --> 00:24:43.799
is something like this.

00:24:42.200 --> 00:24:45.080
So, first I have to inflict some

00:24:43.799 --> 00:24:46.559
notation on you.

00:24:45.079 --> 00:24:50.000
We would denote the co-occurrence matrix

00:24:46.559 --> 00:24:51.759
of say words I and J as Xij.

00:24:50.000 --> 00:24:53.079
Xij is just data.

00:24:51.759 --> 00:24:55.079
It's just data. Okay? It's not a

00:24:53.079 --> 00:24:55.639
variable, it's data.

00:24:55.079 --> 00:24:57.399
Uh

00:24:55.640 --> 00:24:59.160
and then we will denote an embedding

00:24:57.400 --> 00:25:01.080
vector for each word. Remember, we need

00:24:59.160 --> 00:25:03.840
to have a vector for each word. So, we

00:25:01.079 --> 00:25:06.199
call it Wi, right? Wi is the embedding

00:25:03.839 --> 00:25:09.119
vector for each word.

00:25:06.200 --> 00:25:10.559
And we will also assume that

00:25:09.119 --> 00:25:11.639
some words are just inherently very

00:25:10.559 --> 00:25:13.440
popular. They're going to show up all

00:25:11.640 --> 00:25:15.920
the time like the word the.

00:25:13.440 --> 00:25:18.320
Okay? So, we'll assume that every word

00:25:15.920 --> 00:25:20.160
has some natural frequency of occurring

00:25:18.319 --> 00:25:22.919
like movie versus flick.

00:25:20.160 --> 00:25:24.480
The versus tensor. So, we want the

00:25:22.920 --> 00:25:27.279
vectors to capture the co-occurrence

00:25:24.480 --> 00:25:28.880
patterns independent of how naturally

00:25:27.279 --> 00:25:29.639
frequent the words are.

00:25:28.880 --> 00:25:30.920
Okay?

00:25:29.640 --> 00:25:33.600
And so, to capture this natural

00:25:30.920 --> 00:25:34.600
frequency, we will assign a bias or Bi

00:25:33.599 --> 00:25:36.359
to each word that we're going to

00:25:34.599 --> 00:25:39.319
calculate. And all this will become

00:25:36.359 --> 00:25:41.000
clear in just a moment. Okay? So

00:25:39.319 --> 00:25:42.480
with this setup, basically what we're

00:25:41.000 --> 00:25:44.679
saying is something very simple. We're

00:25:42.480 --> 00:25:45.960
saying, look, this co-occurrence matrix

00:25:44.679 --> 00:25:48.000
that we have

00:25:45.960 --> 00:25:51.240
that we're able to compute, it came

00:25:48.000 --> 00:25:53.400
about because in in truth, in reality,

00:25:51.240 --> 00:25:55.559
in nature, there are these embedding

00:25:53.400 --> 00:25:58.120
vectors for every word.

00:25:55.559 --> 00:26:00.240
There are these biases Bi for every word

00:25:58.119 --> 00:26:03.000
and every co-occurrence number that you

00:26:00.240 --> 00:26:05.079
see just came about because, you know,

00:26:03.000 --> 00:26:07.839
under the hood, mother nature grabbed

00:26:05.079 --> 00:26:09.720
the bias number for the word I, the bias

00:26:07.839 --> 00:26:11.639
number for the word J took the two

00:26:09.720 --> 00:26:13.799
embedding vectors, which only mother

00:26:11.640 --> 00:26:15.200
nature knows at this point did the dot

00:26:13.799 --> 00:26:16.919
product of them, add them, and that's

00:26:15.200 --> 00:26:19.080
how we get this number.

00:26:16.920 --> 00:26:21.560
So, it basically says the number you see

00:26:19.079 --> 00:26:23.039
is the sum of the inherent popularity of

00:26:21.559 --> 00:26:25.159
the first word plus the inherent

00:26:23.039 --> 00:26:26.799
popularity of the second word plus the

00:26:25.160 --> 00:26:29.000
way in which these two words connect to

00:26:26.799 --> 00:26:29.960
each other.

00:26:29.000 --> 00:26:30.839
That's it.

00:26:29.960 --> 00:26:32.440
And

00:26:30.839 --> 00:26:33.599
you will agree with me

00:26:32.440 --> 00:26:34.799
that literally can't get simpler than

00:26:33.599 --> 00:26:36.759
this.

00:26:34.799 --> 00:26:38.200
If I tell you, hey, here are two things.

00:26:36.759 --> 00:26:39.799
I want you to tell me how connected they

00:26:38.200 --> 00:26:42.360
are, you'll be like, well, let's take

00:26:39.799 --> 00:26:44.200
the first one, figure out how inherently

00:26:42.359 --> 00:26:45.039
popular it is, inherent popularity, and

00:26:44.200 --> 00:26:46.319
then of course you got to worry about

00:26:45.039 --> 00:26:47.678
the connection. So, we do a dot dot

00:26:46.319 --> 00:26:49.720
product.

00:26:47.679 --> 00:26:50.440
That's it. Those three things.

00:26:49.720 --> 00:26:52.360
Right?

00:26:50.440 --> 00:26:53.840
So, this is what we have. Now, you may

00:26:52.359 --> 00:26:54.599
have seen

00:26:53.839 --> 00:26:56.839
uh

00:26:54.599 --> 00:27:00.079
from your, you know, good old linear

00:26:56.839 --> 00:27:02.039
regression that whenever uh your

00:27:00.079 --> 00:27:05.119
dependent variable happens to be

00:27:02.039 --> 00:27:08.279
positive, guaranteed to be positive

00:27:05.119 --> 00:27:10.519
and it ends up having a big range

00:27:08.279 --> 00:27:12.599
we always advise you folks

00:27:10.519 --> 00:27:14.839
to take the logarithmic transformation

00:27:12.599 --> 00:27:16.480
to squash it into a narrow range because

00:27:14.839 --> 00:27:18.319
that will make these models much more

00:27:16.480 --> 00:27:20.319
well-behaved.

00:27:18.319 --> 00:27:22.240
Regression if the Y value is like a huge

00:27:20.319 --> 00:27:23.159
range. Like the canonical example is

00:27:22.240 --> 00:27:24.960
that, you know, if you are trying to

00:27:23.160 --> 00:27:27.560
model, you know, the net worth of

00:27:24.960 --> 00:27:29.120
people, right? It's going to have a long

00:27:27.559 --> 00:27:30.879
right tail with people like Elon and

00:27:29.119 --> 00:27:33.279
Jeff and so on on the right side, right?

00:27:30.880 --> 00:27:34.880
And the rest of us on the left. So and

00:27:33.279 --> 00:27:35.920
so, to model this big long tail

00:27:34.880 --> 00:27:37.360
distribution, you just take the

00:27:35.920 --> 00:27:39.120
logarithm, just squash everything to a

00:27:37.359 --> 00:27:41.479
very narrow range. And that will make

00:27:39.119 --> 00:27:42.559
regression much better behaved. Okay?

00:27:41.480 --> 00:27:45.400
Here

00:27:42.559 --> 00:27:47.000
most of the counts are going to be zero.

00:27:45.400 --> 00:27:48.440
But, some of the counts could be very

00:27:47.000 --> 00:27:49.160
high.

00:27:48.440 --> 00:27:51.000
Right?

00:27:49.160 --> 00:27:52.960
And therefore we wanted to If you take

00:27:51.000 --> 00:27:54.839
the logarithm, it makes it much better

00:27:52.960 --> 00:27:56.440
behaved, so we take the logarithm here.

00:27:54.839 --> 00:27:57.439
So, this is actually our model. That's

00:27:56.440 --> 00:27:58.720
it.

00:27:57.440 --> 00:28:00.759
And I know that many of the numbers are

00:27:58.720 --> 00:28:02.600
zero and log of zero is not defined. So,

00:28:00.759 --> 00:28:03.960
we can just add the one a number one to

00:28:02.599 --> 00:28:06.240
all the numbers

00:28:03.960 --> 00:28:08.360
to avoid that kind of, you know,

00:28:06.240 --> 00:28:09.559
technical arithmetic problems.

00:28:08.359 --> 00:28:10.319
But, this conceptually is what's going

00:28:09.559 --> 00:28:11.519
on. This is the model we want to

00:28:10.319 --> 00:28:14.079
calculate.

00:28:11.519 --> 00:28:16.759
So, given that we have essentially

00:28:14.079 --> 00:28:17.839
postulated this model

00:28:16.759 --> 00:28:19.519
and we have this data, this

00:28:17.839 --> 00:28:21.240
co-occurrence matrix, how can we

00:28:19.519 --> 00:28:24.279
actually find the weights? How can we

00:28:21.240 --> 00:28:25.679
actually find the Bs and the Ws? What

00:28:24.279 --> 00:28:26.960
would we What should we do?

00:28:25.679 --> 00:28:29.320
Go back to the fundamentals of

00:28:26.960 --> 00:28:30.519
regression. Think about it conceptually.

00:28:29.319 --> 00:28:31.879
You have some model which has some

00:28:30.519 --> 00:28:33.519
weights.

00:28:31.880 --> 00:28:35.320
There's some data you can use to train

00:28:33.519 --> 00:28:36.960
the model.

00:28:35.319 --> 00:28:38.240
Right? And you need to find the best set

00:28:36.960 --> 00:28:40.079
of weights. What does the best mean

00:28:38.240 --> 00:28:42.279
here?

00:28:40.079 --> 00:28:43.879
The lowest

00:28:42.279 --> 00:28:46.119
The lowest error. Exactly. There are

00:28:43.880 --> 00:28:47.280
many ways to measure error, right? What

00:28:46.119 --> 00:28:48.759
would be What is the simplest thing we

00:28:47.279 --> 00:28:50.240
could use? So, what you do is you would

00:28:48.759 --> 00:28:52.079
actually do mean squared error. Right?

00:28:50.240 --> 00:28:53.240
Which is what you're getting at.

00:28:52.079 --> 00:28:54.359
You could take the actual thing, you

00:28:53.240 --> 00:28:55.839
could take the predicted thing, take the

00:28:54.359 --> 00:28:57.119
difference, square it, and minimize the

00:28:55.839 --> 00:28:59.759
sum of it.

00:28:57.119 --> 00:29:00.839
Okay? If your model exactly nails every

00:28:59.759 --> 00:29:02.799
number in the co-occurrence matrix, the

00:29:00.839 --> 00:29:04.879
error is going to be zero.

00:29:02.799 --> 00:29:07.759
Okay? So

00:29:04.880 --> 00:29:09.240
what we do is we literally just do that.

00:29:07.759 --> 00:29:11.200
This is the data.

00:29:09.240 --> 00:29:13.319
This is the actual predicted value.

00:29:11.200 --> 00:29:14.880
Predicted value, actual value,

00:29:13.319 --> 00:29:17.439
difference squared, add them all up,

00:29:14.880 --> 00:29:17.440
minimize.

00:29:17.839 --> 00:29:21.039
Okay?

00:29:19.200 --> 00:29:23.200
Uh yes.

00:29:21.039 --> 00:29:25.720
And in the loss function, how is this

00:29:23.200 --> 00:29:28.679
capturing the context? Because unless my

00:29:25.720 --> 00:29:31.120
input data is having that context

00:29:28.679 --> 00:29:33.120
how will this actually differentiate

00:29:31.119 --> 00:29:34.239
based on where the particular word is

00:29:33.119 --> 00:29:36.359
used?

00:29:34.240 --> 00:29:37.079
The word The way the word is

00:29:36.359 --> 00:29:38.559
the

00:29:37.079 --> 00:29:41.559
So, let's take two words like deep and

00:29:38.559 --> 00:29:42.918
learning. Now, let's take this word and

00:29:41.559 --> 00:29:44.839
change it according to the context.

00:29:42.919 --> 00:29:46.280
Okay.

00:29:44.839 --> 00:29:47.359
Sorry, go ahead. Yeah, so basically,

00:29:46.279 --> 00:29:49.759
let's say I'm talking about the word

00:29:47.359 --> 00:29:50.919
banana. So it's a fruit in some context

00:29:49.759 --> 00:29:53.119
and I could be saying he's going

00:29:50.920 --> 00:29:55.240
bananas. That's a

00:29:53.119 --> 00:29:57.039
whatever, right? So now these are two

00:29:55.240 --> 00:29:59.079
different contexts in my understanding

00:29:57.039 --> 00:30:01.000
and my same model needs to be able to

00:29:59.079 --> 00:30:02.720
tell me that banana is the right word in

00:30:01.000 --> 00:30:04.400
this context but wrong word in this

00:30:02.720 --> 00:30:06.600
context or

00:30:04.400 --> 00:30:08.440
correct in both contexts. Yeah, very

00:30:06.599 --> 00:30:10.359
good question. So let's actually spend a

00:30:08.440 --> 00:30:13.360
minute on that. Good question. I'm going

00:30:10.359 --> 00:30:15.439
to swap to my iPad.

00:30:13.359 --> 00:30:18.000
So let's let's assume that this is our

00:30:15.440 --> 00:30:20.160
co-occurrence matrix.

00:30:18.000 --> 00:30:23.160
Right? And then we have words going from

00:30:20.160 --> 00:30:24.600
A all the way to let's say zebra, right?

00:30:23.160 --> 00:30:25.800
This is the all the words in our

00:30:24.599 --> 00:30:29.439
vocabulary

00:30:25.799 --> 00:30:32.680
and we have A through zebra here.

00:30:29.440 --> 00:30:34.480
And now what we have is

00:30:32.680 --> 00:30:36.519
we have uh

00:30:34.480 --> 00:30:39.079
apple

00:30:36.519 --> 00:30:39.079
and banana.

00:30:39.559 --> 00:30:42.279
Right?

00:30:40.279 --> 00:30:44.079
So basically what's going on at this

00:30:42.279 --> 00:30:48.240
point is that

00:30:44.079 --> 00:30:50.559
here every number here measures

00:30:48.240 --> 00:30:51.960
for every word here, how many times that

00:30:50.559 --> 00:30:53.559
word and apple show up in the same

00:30:51.960 --> 00:30:56.400
sentence, okay?

00:30:53.559 --> 00:30:57.960
It is not measuring, to your point,

00:30:56.400 --> 00:30:59.880
how many times apple and banana are

00:30:57.960 --> 00:31:01.240
showing up. It's measuring how much how

00:30:59.880 --> 00:31:03.680
many times apple is showing up in each

00:31:01.240 --> 00:31:06.480
sentence, right? Now, if apple and

00:31:03.680 --> 00:31:09.799
banana are sort of interchangeable,

00:31:06.480 --> 00:31:11.880
what do we expect these numbers these

00:31:09.799 --> 00:31:13.319
two rows of numbers to look like? Let's

00:31:11.880 --> 00:31:14.560
assume that apple and banana are perfect

00:31:13.319 --> 00:31:15.799
synonyms.

00:31:14.559 --> 00:31:17.240
Just for argument, okay? Let's say it's

00:31:15.799 --> 00:31:19.839
a perfect synonyms.

00:31:17.240 --> 00:31:21.359
What do we expect these two

00:31:19.839 --> 00:31:23.839
numbers

00:31:21.359 --> 00:31:25.599
to look like?

00:31:23.839 --> 00:31:27.720
Very similar.

00:31:25.599 --> 00:31:30.240
So if two words are related, their

00:31:27.720 --> 00:31:31.120
entries their entry row vectors in the

00:31:30.240 --> 00:31:32.599
co-occurrence matrix are going to be

00:31:31.119 --> 00:31:34.479
very very similar.

00:31:32.599 --> 00:31:36.079
So that is how the context comes into

00:31:34.480 --> 00:31:37.960
the co-occurrence matrix.

00:31:36.079 --> 00:31:40.559
So what we want to do is we want to find

00:31:37.960 --> 00:31:42.840
if if embeddings can recreate the same

00:31:40.559 --> 00:31:45.000
pattern of numbers in these two

00:31:42.839 --> 00:31:47.919
uh in these two rows, it's actually

00:31:45.000 --> 00:31:49.880
capturing the underlying context.

00:31:47.920 --> 00:31:51.560
So words which are similar will sort of

00:31:49.880 --> 00:31:53.280
zig and zag together the same way

00:31:51.559 --> 00:31:56.039
through the co-occurrence matrix.

00:31:53.279 --> 00:31:56.039
And that's where it comes in.

00:31:57.440 --> 00:32:00.440
Yeah.

00:31:58.440 --> 00:32:01.960
What's up with the diagonal of the

00:32:00.440 --> 00:32:05.240
co-occurrence matrix where you have

00:32:01.960 --> 00:32:07.200
apple showing up twice? Oh oh, I see. So

00:32:05.240 --> 00:32:08.799
yeah, here the you can just ignore the

00:32:07.200 --> 00:32:10.480
diagonal typically

00:32:08.799 --> 00:32:13.519
uh because all the action is off the the

00:32:10.480 --> 00:32:13.519
off-diagonal entries.

00:32:15.319 --> 00:32:20.319
So so that's basically the idea and uh

00:32:18.720 --> 00:32:22.519
if words which are very similar will

00:32:20.319 --> 00:32:24.039
have a very similar pattern of numbers

00:32:22.519 --> 00:32:25.720
and then any

00:32:24.039 --> 00:32:27.759
embeddings that can actually recreate

00:32:25.720 --> 00:32:28.920
the same pattern of numbers is capturing

00:32:27.759 --> 00:32:29.720
the underlying reality of what's going

00:32:28.920 --> 00:32:32.240
on.

00:32:29.720 --> 00:32:34.799
If words are kind of unrelated, those

00:32:32.240 --> 00:32:38.000
two those two vectors, let's say that

00:32:34.799 --> 00:32:38.000
the word you have is uh

00:32:40.400 --> 00:32:45.640
Let's assume the word is uh of course

00:32:42.880 --> 00:32:48.080
you know what I'm going to say, tensor.

00:32:45.640 --> 00:32:49.440
Right? These two vectors

00:32:48.079 --> 00:32:50.799
will sort of won't have any connection

00:32:49.440 --> 00:32:51.920
to each other.

00:32:50.799 --> 00:32:53.119
Which means if you look at something

00:32:51.920 --> 00:32:54.679
like the correlation of those two

00:32:53.119 --> 00:32:55.919
vectors, it's it's going to be around

00:32:54.679 --> 00:32:56.600
zero.

00:32:55.920 --> 00:32:57.960
Right?

00:32:56.599 --> 00:32:59.719
Words which are

00:32:57.960 --> 00:33:01.559
you know, interchangeable will have a

00:32:59.720 --> 00:33:03.720
very high correlation.

00:33:01.559 --> 00:33:05.519
Words which are antonyms and never show

00:33:03.720 --> 00:33:07.240
up in the same place together may have a

00:33:05.519 --> 00:33:09.079
highly negative correlation, close to

00:33:07.240 --> 00:33:10.640
minus one for instance. So that's sort

00:33:09.079 --> 00:33:11.919
of the intuition behind what's going on

00:33:10.640 --> 00:33:12.920
in these two row vectors on these row

00:33:11.920 --> 00:33:14.560
vectors.

00:33:12.920 --> 00:33:16.120
And so the point is given this

00:33:14.559 --> 00:33:19.879
co-occurrence matrix is capturing all

00:33:16.119 --> 00:33:22.039
these word word correlational structure,

00:33:19.880 --> 00:33:25.200
any embedding that can recreate it must

00:33:22.039 --> 00:33:26.879
have captured the structure as well.

00:33:25.200 --> 00:33:28.759
Because you can't recreate something

00:33:26.880 --> 00:33:30.080
like this with great fidelity unless you

00:33:28.759 --> 00:33:31.799
have some notion of what's going on

00:33:30.079 --> 00:33:33.599
under the hood.

00:33:31.799 --> 00:33:34.519
That's the basic idea.

00:33:33.599 --> 00:33:36.599
Yeah.

00:33:34.519 --> 00:33:39.160
So just connecting to Sophie's question.

00:33:36.599 --> 00:33:40.879
So in that example then

00:33:39.160 --> 00:33:42.800
banana is a fruit and apple is a fruit

00:33:40.880 --> 00:33:44.160
as well. Banana and apple are synonyms

00:33:42.799 --> 00:33:47.039
and you're going mad, you're going

00:33:44.160 --> 00:33:48.040
bananas. How that comes together is that

00:33:47.039 --> 00:33:50.399
Oh, I see. You're going mad, you're

00:33:48.039 --> 00:33:52.319
going bananas, yeah. So uh so those will

00:33:50.400 --> 00:33:53.720
also have some correlational structure

00:33:52.319 --> 00:33:57.000
to it which the embeddings will

00:33:53.720 --> 00:33:59.440
hopefully catch, but words like banana

00:33:57.000 --> 00:34:01.160
which are very they they

00:33:59.440 --> 00:34:03.400
the thing is it's called polysemy where

00:34:01.160 --> 00:34:04.880
the word looks one way, it looks the

00:34:03.400 --> 00:34:06.080
same way. It's like the word bank,

00:34:04.880 --> 00:34:07.520
right? It can mean very different things

00:34:06.079 --> 00:34:09.319
in very different context. So the

00:34:07.519 --> 00:34:11.800
embedding is going to be some average

00:34:09.320 --> 00:34:13.280
representation of it, right? But we are

00:34:11.800 --> 00:34:15.000
not happy with that average and we'll

00:34:13.280 --> 00:34:18.280
get around that average

00:34:15.000 --> 00:34:19.159
next week when we do contextual stuff.

00:34:18.280 --> 00:34:20.320
All right.

00:34:19.159 --> 00:34:22.280
Um

00:34:20.320 --> 00:34:25.519
So that's what we have here. So to go

00:34:22.280 --> 00:34:25.519
back to this thing,

00:34:26.719 --> 00:34:31.839
so what we can do is yeah.

00:34:29.000 --> 00:34:34.398
I didn't understand how do we get the

00:34:31.840 --> 00:34:35.120
mean squared error in this because we

00:34:34.398 --> 00:34:37.319
didn't

00:34:35.119 --> 00:34:39.480
do any reading from the data set we got.

00:34:37.320 --> 00:34:41.200
We haven't calculated the embeddings.

00:34:39.480 --> 00:34:42.559
We are trying to calculate them. Those

00:34:41.199 --> 00:34:45.079
are just it's sort of like, you know, in

00:34:42.559 --> 00:34:47.199
regression you have, you know, beta beta

00:34:45.079 --> 00:34:49.398
one times X1 plus beta two times X2 kind

00:34:47.199 --> 00:34:51.199
of thing. The betas are what the

00:34:49.398 --> 00:34:52.759
regression produces for us, right? The

00:34:51.199 --> 00:34:53.918
the embeddings are exactly that. They're

00:34:52.760 --> 00:34:55.240
just coefficients that we're trying to

00:34:53.918 --> 00:34:59.400
figure out.

00:34:55.239 --> 00:34:59.399
The data is only the X's, the Xij.

00:34:59.519 --> 00:35:01.920
And so this is what we're trying to

00:35:00.760 --> 00:35:03.960
calculate,

00:35:01.920 --> 00:35:06.200
right? And so what you can do is you can

00:35:03.960 --> 00:35:08.320
actually start with some random values

00:35:06.199 --> 00:35:09.839
for these things

00:35:08.320 --> 00:35:11.920
and then

00:35:09.840 --> 00:35:13.240
keep on trying to improve to minimize

00:35:11.920 --> 00:35:15.639
the error

00:35:13.239 --> 00:35:17.319
starting from these random values.

00:35:15.639 --> 00:35:19.119
Do you folks are you aware of any

00:35:17.320 --> 00:35:20.559
algorithm that which allows us to take

00:35:19.119 --> 00:35:23.839
random value starting point and then

00:35:20.559 --> 00:35:23.840
minimize some notion of error?

00:35:32.760 --> 00:35:35.600
Well, how do you know it's actually

00:35:33.679 --> 00:35:37.879
random? Oh.

00:35:35.599 --> 00:35:39.000
So that's actually a very deep question.

00:35:37.880 --> 00:35:39.920
Um

00:35:39.000 --> 00:35:41.400
and

00:35:39.920 --> 00:35:42.480
so

00:35:41.400 --> 00:35:44.160
it's actually a tough question, right?

00:35:42.480 --> 00:35:46.079
Because ultimately the random number is

00:35:44.159 --> 00:35:47.960
coming from a computer

00:35:46.079 --> 00:35:50.000
and we know how the computer runs. It's

00:35:47.960 --> 00:35:51.559
deterministic at the end of the day.

00:35:50.000 --> 00:35:53.280
So we actually use something called

00:35:51.559 --> 00:35:54.880
pseudo random numbers,

00:35:53.280 --> 00:35:56.840
right? Um and there's like a whole

00:35:54.880 --> 00:35:59.358
specialized field of math

00:35:56.840 --> 00:36:02.120
which essentially says, "Look, how can I

00:35:59.358 --> 00:36:03.719
get random numbers that are sufficiently

00:36:02.119 --> 00:36:05.358
random even though they come from a

00:36:03.719 --> 00:36:07.759
non-random computer deterministic

00:36:05.358 --> 00:36:08.519
process?" So we can talk offline about

00:36:07.760 --> 00:36:10.480
it,

00:36:08.519 --> 00:36:11.960
um but fundamentally all these systems

00:36:10.480 --> 00:36:14.519
have some random number generators built

00:36:11.960 --> 00:36:17.400
in. We just cross our fingers and hope

00:36:14.519 --> 00:36:19.079
for the best and just use them.

00:36:17.400 --> 00:36:20.639
So come back to this,

00:36:19.079 --> 00:36:22.119
right? We can start with random values

00:36:20.639 --> 00:36:23.559
for these weights

00:36:22.119 --> 00:36:25.440
um and then we can try to minimize the

00:36:23.559 --> 00:36:26.639
squared error. Are are you folks aware

00:36:25.440 --> 00:36:28.358
of any algorithm that can help us do

00:36:26.639 --> 00:36:30.239
that?

00:36:28.358 --> 00:36:33.079
Yes.

00:36:30.239 --> 00:36:35.279
Gradient descent. Yes, gradient descent.

00:36:33.079 --> 00:36:36.400
Again, comes to the rescue. Uh and since

00:36:35.280 --> 00:36:38.680
we are cool, we'll do stochastic

00:36:36.400 --> 00:36:41.880
gradient descent.

00:36:38.679 --> 00:36:42.960
Okay? So that's it. So gradient descent

00:36:41.880 --> 00:36:44.240
actually doesn't care what the function

00:36:42.960 --> 00:36:45.559
is as long as it you can calculate a

00:36:44.239 --> 00:36:47.319
derivative from it. As long as you

00:36:45.559 --> 00:36:48.719
calculate a gradient, you're good.

00:36:47.320 --> 00:36:50.960
Right? So we can just run gradient

00:36:48.719 --> 00:36:53.119
descent on this thing, right?

00:36:50.960 --> 00:36:54.240
Uh one key point here is that gradient

00:36:53.119 --> 00:36:55.960
descent, stochastic gradient descent

00:36:54.239 --> 00:36:58.519
work for any

00:36:55.960 --> 00:37:00.480
any models as long as you can calculate

00:36:58.519 --> 00:37:03.639
good gradients from them.

00:37:00.480 --> 00:37:03.639
It doesn't have to be a neural network.

00:37:03.760 --> 00:37:07.400
Any mathematical function as long as

00:37:05.880 --> 00:37:08.800
it's differentiable and gives you a good

00:37:07.400 --> 00:37:10.440
gradient.

00:37:08.800 --> 00:37:12.480
Okay? So here this is not a neural

00:37:10.440 --> 00:37:14.200
network per se, but we can still use

00:37:12.480 --> 00:37:17.159
gradient descent for it.

00:37:14.199 --> 00:37:17.159
So we do that.

00:37:17.960 --> 00:37:22.159
Um and when we are done, we would have

00:37:20.039 --> 00:37:23.880
calculated some nice embeddings. We

00:37:22.159 --> 00:37:25.559
would have all calculated or we can also

00:37:23.880 --> 00:37:26.559
calculate all these biases, but we don't

00:37:25.559 --> 00:37:28.119
need the biases anymore. We can just

00:37:26.559 --> 00:37:29.519
throw out the biases because we only

00:37:28.119 --> 00:37:30.920
care about the embeddings and how they

00:37:29.519 --> 00:37:33.320
connect to each other.

00:37:30.920 --> 00:37:34.760
Okay? Yeah.

00:37:33.320 --> 00:37:36.800
So when when you're doing that

00:37:34.760 --> 00:37:39.480
regression, are you predicting the

00:37:36.800 --> 00:37:42.000
co-occurrence matrix? Mhm. Okay.

00:37:39.480 --> 00:37:42.000
Exactly.

00:37:42.320 --> 00:37:45.039
So

00:37:43.358 --> 00:37:46.559
um actually let me just show a very

00:37:45.039 --> 00:37:48.199
quick example

00:37:46.559 --> 00:37:52.039
numerical example here.

00:37:48.199 --> 00:37:52.039
So let's say for example that um

00:37:53.480 --> 00:37:56.039
you know what?

00:37:57.159 --> 00:38:02.000
So this is say W1 and this is W2.

00:38:00.358 --> 00:38:04.639
Okay? And this is the vector and let's

00:38:02.000 --> 00:38:06.400
assume for a moment that we it has two

00:38:04.639 --> 00:38:07.920
dimensions, okay?

00:38:06.400 --> 00:38:09.840
Two dimensions.

00:38:07.920 --> 00:38:13.320
And we also need to calculate B1 and B2

00:38:09.840 --> 00:38:13.320
which is just a number, okay?

00:38:14.320 --> 00:38:18.359
So and let's say the number for deep

00:38:16.960 --> 00:38:20.599
learning in the co-occurrence matrix it

00:38:18.358 --> 00:38:21.759
happens let's say it has occurred 104

00:38:20.599 --> 00:38:24.759
times.

00:38:21.760 --> 00:38:27.200
So all we are doing is to say log of

00:38:24.760 --> 00:38:28.720
104.

00:38:27.199 --> 00:38:30.919
That is the actual value

00:38:28.719 --> 00:38:33.599
minus

00:38:30.920 --> 00:38:34.880
B1 which we don't know plus B2 which we

00:38:33.599 --> 00:38:36.880
don't know

00:38:34.880 --> 00:38:38.039
and then this thing here, let's just

00:38:36.880 --> 00:38:40.160
call it,

00:38:38.039 --> 00:38:42.119
you know, W11,

00:38:40.159 --> 00:38:43.960
W12,

00:38:42.119 --> 00:38:45.159
W21,

00:38:43.960 --> 00:38:46.519
W22.

00:38:45.159 --> 00:38:49.000
Okay? And then we're just doing the dot

00:38:46.519 --> 00:38:51.400
product which is

00:38:49.000 --> 00:38:53.719
times W12

00:38:51.400 --> 00:38:55.280
plus W21

00:38:53.719 --> 00:38:58.679
W22.

00:38:55.280 --> 00:39:00.240
Okay? So this is our prediction.

00:38:58.679 --> 00:39:03.559
Where is that cool laser pointer? Yeah.

00:39:00.239 --> 00:39:05.199
So this is our prediction.

00:39:03.559 --> 00:39:07.480
This is the actual.

00:39:05.199 --> 00:39:09.039
So all we do is to say, "Okay,

00:39:07.480 --> 00:39:11.000
this thing, the difference, we're going

00:39:09.039 --> 00:39:12.358
to square it."

00:39:11.000 --> 00:39:16.280
And then we're going to do the same

00:39:12.358 --> 00:39:17.799
exact thing for every other word pair.

00:39:16.280 --> 00:39:19.840
Okay? And when we are done with all of

00:39:17.800 --> 00:39:20.840
that thing, we just take this whole

00:39:19.840 --> 00:39:23.880
thing

00:39:20.840 --> 00:39:26.039
and say gradient descent minimize.

00:39:23.880 --> 00:39:28.200
So then it has to find the B's and the

00:39:26.039 --> 00:39:29.400
W's and everything for every every pair

00:39:28.199 --> 00:39:31.919
every word.

00:39:29.400 --> 00:39:34.440
So that's actually what's going on.

00:39:31.920 --> 00:39:34.440
Make sense?

00:39:37.039 --> 00:39:43.800
All right. So by the way uh here

00:39:41.559 --> 00:39:45.320
I said

00:39:43.800 --> 00:39:47.160
I said, you know, let's assume that the

00:39:45.320 --> 00:39:51.160
embeddings are just vectors which are

00:39:47.159 --> 00:39:52.440
two dimension dimension two.

00:39:51.159 --> 00:39:54.039
Well,

00:39:52.440 --> 00:39:55.840
that's an arbitrary decision that I made

00:39:54.039 --> 00:39:58.119
just to show you how it works because I

00:39:55.840 --> 00:39:59.680
was doing it by hand. But more

00:39:58.119 --> 00:40:01.159
generally, we get to choose how long

00:39:59.679 --> 00:40:02.079
these vectors are.

00:40:01.159 --> 00:40:04.440
Right?

00:40:02.079 --> 00:40:05.920
And the longer the vector, the more

00:40:04.440 --> 00:40:07.240
interesting ways it can actually

00:40:05.920 --> 00:40:09.880
reproduce the co-occurrence matrix. It

00:40:07.239 --> 00:40:13.319
has more flexibility. But the longer the

00:40:09.880 --> 00:40:14.920
vector, what is the risk that you run?

00:40:13.320 --> 00:40:16.039
Overfitting.

00:40:14.920 --> 00:40:17.079
Because these are all parameters at the

00:40:16.039 --> 00:40:19.360
end of the day. More parameters you

00:40:17.079 --> 00:40:21.239
have, the more risk of overfitting.

00:40:19.360 --> 00:40:24.320
Okay? So, you get to choose how big

00:40:21.239 --> 00:40:26.799
these things can be. Uh yes.

00:40:24.320 --> 00:40:29.000
Um don't you find it surprising that

00:40:26.800 --> 00:40:30.680
we're able to fit the model where we

00:40:29.000 --> 00:40:32.719
have a lot more parameters than we have

00:40:30.679 --> 00:40:33.919
data because usually with most machine

00:40:32.719 --> 00:40:35.959
learning with our experts, you would

00:40:33.920 --> 00:40:37.920
like to not have a lot of parameters,

00:40:35.960 --> 00:40:40.240
but here we're going to have

00:40:37.920 --> 00:40:42.680
as you said, the number of dimensions

00:40:40.239 --> 00:40:44.359
times more parameters than we have

00:40:42.679 --> 00:40:46.839
data points. Well, here in this

00:40:44.360 --> 00:40:48.120
particular case, as it turns out, um

00:40:46.840 --> 00:40:49.440
let's assume that you only have 10

00:40:48.119 --> 00:40:51.920
words, right?

00:40:49.440 --> 00:40:53.960
And for each word, let's assume that you

00:40:51.920 --> 00:40:55.280
have let's just just keep the math

00:40:53.960 --> 00:40:56.320
simple. You have a two-dimensional

00:40:55.280 --> 00:40:58.600
vector.

00:40:56.320 --> 00:41:00.640
So, 10 words * 2, that's 20.

00:40:58.599 --> 00:41:02.880
Plus you have 10 biases for the words,

00:41:00.639 --> 00:41:06.000
right? So, that's another 10, that's 30.

00:41:02.880 --> 00:41:08.160
But 10 * 10, the matrix has 100 entries.

00:41:06.000 --> 00:41:10.360
So, because of the matrix being a order

00:41:08.159 --> 00:41:13.000
n squared matrix, you'll have a lot more

00:41:10.360 --> 00:41:14.640
numbers than parameters.

00:41:13.000 --> 00:41:17.239
In this particular case, you have more

00:41:14.639 --> 00:41:18.440
data than parameters.

00:41:17.239 --> 00:41:20.039
So, that particular problem doesn't

00:41:18.440 --> 00:41:22.119
apply in this case.

00:41:20.039 --> 00:41:23.599
But that does show up in other cases and

00:41:22.119 --> 00:41:24.799
there is some

00:41:23.599 --> 00:41:26.599
very interesting research in neural

00:41:24.800 --> 00:41:29.120
networks which suggests that often times

00:41:26.599 --> 00:41:30.679
the traditional assumptions of data and

00:41:29.119 --> 00:41:32.079
overfitting and all

00:41:30.679 --> 00:41:33.879
can all be called into question under

00:41:32.079 --> 00:41:35.440
some situations.

00:41:33.880 --> 00:41:37.240
Um happy to tell you more offline, but

00:41:35.440 --> 00:41:39.280
if you're curious, just Google something

00:41:37.239 --> 00:41:41.799
called double descent.

00:41:39.280 --> 00:41:41.800
You know what I mean.

00:41:42.559 --> 00:41:45.840
But in this case, it's not a problem.

00:41:46.320 --> 00:41:49.680
Okay.

00:41:47.480 --> 00:41:51.519
So, so what that means is that we can

00:41:49.679 --> 00:41:53.480
choose how big these things are. So, if

00:41:51.519 --> 00:41:55.920
you look at one-hot word vector, one-hot

00:41:53.480 --> 00:41:57.119
vectors, right? Where

00:41:55.920 --> 00:41:58.559
there's a one and everything else is

00:41:57.119 --> 00:42:00.519
zero depending on the position of the

00:41:58.559 --> 00:42:03.440
word, these are long vectors as long as

00:42:00.519 --> 00:42:05.480
a vocabulary, right? As we saw earlier.

00:42:03.440 --> 00:42:07.400
Word embeddings on the other hand,

00:42:05.480 --> 00:42:08.679
right? They can be very dense, right?

00:42:07.400 --> 00:42:10.000
The numbers

00:42:08.679 --> 00:42:11.000
that make up these embeddings, we're

00:42:10.000 --> 00:42:13.199
actually going to figure out from the

00:42:11.000 --> 00:42:15.480
data what they are. So, it can be

00:42:13.199 --> 00:42:17.679
anything. It can So, the first dimension

00:42:15.480 --> 00:42:19.480
may stand for some combination of, you

00:42:17.679 --> 00:42:22.519
know, um

00:42:19.480 --> 00:42:23.559
brightness plus speed plus animalness or

00:42:22.519 --> 00:42:24.719
something. We have no idea what it

00:42:23.559 --> 00:42:26.279
means.

00:42:24.719 --> 00:42:27.959
All we know is that it's able to

00:42:26.280 --> 00:42:29.400
reproduce the co-occurrence matrix

00:42:27.960 --> 00:42:30.880
really well, so it's probably has

00:42:29.400 --> 00:42:32.480
figured something out.

00:42:30.880 --> 00:42:33.720
Okay? And so, we can keep it really

00:42:32.480 --> 00:42:35.039
short. So, the word embeddings tend to

00:42:33.719 --> 00:42:36.039
be very

00:42:35.039 --> 00:42:38.079
dense,

00:42:36.039 --> 00:42:39.599
meaning not zeros and ones, but some

00:42:38.079 --> 00:42:40.880
arbitrary numbers. It's very lower

00:42:39.599 --> 00:42:41.960
dimensional and it's of course learned

00:42:40.880 --> 00:42:43.960
from data.

00:42:41.960 --> 00:42:45.760
Right? So,

00:42:43.960 --> 00:42:47.800
so once you do this, once you actually

00:42:45.760 --> 00:42:49.920
run Glove on this data and do gradient

00:42:47.800 --> 00:42:51.400
descent and so on and so forth, uh you

00:42:49.920 --> 00:42:52.639
will actually come up with embeddings

00:42:51.400 --> 00:42:54.360
and then you can actually plot the

00:42:52.639 --> 00:42:55.719
embeddings. You can take like this they

00:42:54.360 --> 00:42:58.320
say the you know, you can take these

00:42:55.719 --> 00:42:59.959
embeddings and just plot them. Here um

00:42:58.320 --> 00:43:01.600
they're not literally plotting the first

00:42:59.960 --> 00:43:03.599
two dimensions. They're using a

00:43:01.599 --> 00:43:05.480
particular technique called t-SNE, which

00:43:03.599 --> 00:43:07.239
is a way to take long vectors and

00:43:05.480 --> 00:43:09.119
project them to 2D space for

00:43:07.239 --> 00:43:11.479
visualization purposes.

00:43:09.119 --> 00:43:12.719
And you can see here

00:43:11.480 --> 00:43:15.079
some very interesting things are showing

00:43:12.719 --> 00:43:17.000
up. So, they basically they plotted the

00:43:15.079 --> 00:43:19.599
embedding for brother,

00:43:17.000 --> 00:43:20.920
nephew, uncle, sister, niece,

00:43:19.599 --> 00:43:22.199
aunt, and so on and so forth. It's all

00:43:20.920 --> 00:43:24.240
showing up here.

00:43:22.199 --> 00:43:25.439
This the embedding for man, embedding

00:43:24.239 --> 00:43:28.119
for woman,

00:43:25.440 --> 00:43:29.920
sir, madam,

00:43:28.119 --> 00:43:32.519
empress, heir,

00:43:29.920 --> 00:43:34.599
duke, emperor, king. You get the idea.

00:43:32.519 --> 00:43:35.840
Right? So, clearly there are patterns

00:43:34.599 --> 00:43:37.480
here where

00:43:35.840 --> 00:43:38.880
things which are sort of similar in

00:43:37.480 --> 00:43:41.519
their nature are all hanging out

00:43:38.880 --> 00:43:42.720
together in the same part of the space.

00:43:41.519 --> 00:43:44.079
Which is comforting, which is good to

00:43:42.719 --> 00:43:44.959
know.

00:43:44.079 --> 00:43:46.719
Right?

00:43:44.960 --> 00:43:48.400
Now, but as I mentioned earlier, it's

00:43:46.719 --> 00:43:50.839
not just about the fact that similar

00:43:48.400 --> 00:43:53.280
things happen to be near each other.

00:43:50.840 --> 00:43:54.960
The direction also actually matters. And

00:43:53.280 --> 00:43:57.640
beautiful things happen when you look at

00:43:54.960 --> 00:44:00.119
directions. So, for instance,

00:43:57.639 --> 00:44:01.879
you know, let's say that

00:44:00.119 --> 00:44:03.159
man and you want to go from man to

00:44:01.880 --> 00:44:05.440
brother.

00:44:03.159 --> 00:44:07.799
Okay? So, to go from man to brother, you

00:44:05.440 --> 00:44:09.920
have to start with man and then travel

00:44:07.800 --> 00:44:11.440
along this arrow, right? To get to

00:44:09.920 --> 00:44:14.880
brother.

00:44:11.440 --> 00:44:18.519
So, this arrow has some notion of a

00:44:14.880 --> 00:44:19.400
person becoming a sibling.

00:44:18.519 --> 00:44:20.920
Right?

00:44:19.400 --> 00:44:22.400
So, you would hope that if you take that

00:44:20.920 --> 00:44:23.800
same arrow

00:44:22.400 --> 00:44:26.039
and then

00:44:23.800 --> 00:44:29.359
start here with that arrow, hopefully

00:44:26.039 --> 00:44:32.358
the woman will become a sister.

00:44:29.358 --> 00:44:32.358
Sure enough, this.

00:44:32.719 --> 00:44:37.119
So, this is called word vector algebra.

00:44:35.199 --> 00:44:39.039
Right? Embedding algebra. And these

00:44:37.119 --> 00:44:41.119
relationships are actually showing up in

00:44:39.039 --> 00:44:42.039
the data. We didn't tell it any of these

00:44:41.119 --> 00:44:43.039
things.

00:44:42.039 --> 00:44:44.920
We just literally gave it the

00:44:43.039 --> 00:44:46.358
co-occurrence matrix

00:44:44.920 --> 00:44:47.960
and said and and asked it to reproduce

00:44:46.358 --> 00:44:49.759
it.

00:44:47.960 --> 00:44:52.519
So, I find it pretty shocking that these

00:44:49.760 --> 00:44:55.160
things are actually true.

00:44:52.519 --> 00:44:57.358
And it gives us evidence and comfort

00:44:55.159 --> 00:44:59.358
that whatever has been learned does have

00:44:57.358 --> 00:45:01.759
some deep connection to describing the

00:44:59.358 --> 00:45:03.799
underlying nature of what's going on.

00:45:01.760 --> 00:45:05.520
It's not some statistically fluky

00:45:03.800 --> 00:45:07.000
artifact.

00:45:05.519 --> 00:45:07.679
Um yeah.

00:45:07.000 --> 00:45:08.639
So,

00:45:07.679 --> 00:45:11.239
I said

00:45:08.639 --> 00:45:12.960
by context or by adjacency to other

00:45:11.239 --> 00:45:15.000
words and not by

00:45:12.960 --> 00:45:16.480
the place in the same word, right?

00:45:15.000 --> 00:45:17.840
Cuz you can't click they won't appear in

00:45:16.480 --> 00:45:19.079
the same sentence.

00:45:17.840 --> 00:45:20.720
They have

00:45:19.079 --> 00:45:22.000
keywords. Right.

00:45:20.719 --> 00:45:23.799
They won't appear in the same sentence,

00:45:22.000 --> 00:45:25.199
but the pattern of co-occurrence will be

00:45:23.800 --> 00:45:26.240
the same for them.

00:45:25.199 --> 00:45:28.159
Which is what we've been able to

00:45:26.239 --> 00:45:30.879
reproduce with these embeddings. So,

00:45:28.159 --> 00:45:30.879
that's the key idea.

00:45:34.119 --> 00:45:37.400
Um

00:45:34.800 --> 00:45:40.359
so, my question is along like how are we

00:45:37.400 --> 00:45:41.480
able to capture all these directions in

00:45:40.358 --> 00:45:44.119
2D

00:45:41.480 --> 00:45:46.119
matrix versus a multi-dimensional matrix

00:45:44.119 --> 00:45:47.920
because I feel like okay, so this

00:45:46.119 --> 00:45:48.759
relationship is kind of

00:45:47.920 --> 00:45:50.519
uh

00:45:48.760 --> 00:45:51.880
confirmed that you're moving to

00:45:50.519 --> 00:45:53.440
kind of like

00:45:51.880 --> 00:45:54.800
family or like blood relationship or

00:45:53.440 --> 00:45:56.599
something of the sort, but like how does

00:45:54.800 --> 00:45:58.240
it not mess up the other sides of that

00:45:56.599 --> 00:46:00.000
matrix? Like

00:45:58.239 --> 00:46:02.199
No, this is just a visualization thing.

00:46:00.000 --> 00:46:04.159
So, we're basically taking this uh you

00:46:02.199 --> 00:46:06.279
know, as you will see, Glove embeddings

00:46:04.159 --> 00:46:08.199
come in lots of different sizes. And

00:46:06.280 --> 00:46:10.080
this I think uses the 100 dimension

00:46:08.199 --> 00:46:12.358
embedding and just projects it to 2D

00:46:10.079 --> 00:46:15.840
space using a particular technique and

00:46:12.358 --> 00:46:15.840
then looks to see what's going on.

00:46:15.880 --> 00:46:20.000
Um yeah.

00:46:17.800 --> 00:46:22.519
Uh if the input data being co-occurrence

00:46:20.000 --> 00:46:24.599
matrix is biased, aren't we amplifying

00:46:22.519 --> 00:46:26.800
that bias? Yes, we are. Yes. No, it's a

00:46:24.599 --> 00:46:28.719
great observation. Uh any sort of data

00:46:26.800 --> 00:46:30.840
you scrape from the internet and use for

00:46:28.719 --> 00:46:32.679
this sort of modeling exercise will be

00:46:30.840 --> 00:46:34.760
subject to all the biases that produced

00:46:32.679 --> 00:46:36.599
the data in the place first place. And

00:46:34.760 --> 00:46:38.760
the model will faithfully learn those

00:46:36.599 --> 00:46:40.358
biases. And if you're not careful, it'll

00:46:38.760 --> 00:46:41.840
perpetuate them.

00:46:40.358 --> 00:46:43.960
So, and that's a whole very important

00:46:41.840 --> 00:46:45.600
topic that unfortunately won't cover in

00:46:43.960 --> 00:46:46.760
this course because of time constraints,

00:46:45.599 --> 00:46:47.920
but it's something you always have to

00:46:46.760 --> 00:46:50.359
worry about when you're building these

00:46:47.920 --> 00:46:50.358
models.

00:46:50.519 --> 00:46:53.679
How do you think about the

00:46:51.199 --> 00:46:55.799
dimensionality of the embeddings not the

00:46:53.679 --> 00:46:57.279
2D representation of the actual data?

00:46:55.800 --> 00:46:59.000
The one that we choose, that's that's in

00:46:57.280 --> 00:47:00.519
our hands. So, you should think of them

00:46:59.000 --> 00:47:03.358
as a hyperparameter.

00:47:00.519 --> 00:47:05.239
So, much like the number of hidden units

00:47:03.358 --> 00:47:06.920
to use in a particular hidden layer,

00:47:05.239 --> 00:47:09.719
um it's a hyperparameter. Uh so, you

00:47:06.920 --> 00:47:11.039
know, I would again start small and if

00:47:09.719 --> 00:47:13.159
it solves the problem that you're trying

00:47:11.039 --> 00:47:15.440
to solve with these embeddings, great.

00:47:13.159 --> 00:47:16.960
If not, keep increasing them. And at

00:47:15.440 --> 00:47:19.000
some point there might be like a a

00:47:16.960 --> 00:47:20.400
flattening out and a overfitting sort of

00:47:19.000 --> 00:47:22.679
dynamic and then you stop. So, just

00:47:20.400 --> 00:47:24.280
think of it as a hyperparameter.

00:47:22.679 --> 00:47:26.599
Yeah.

00:47:24.280 --> 00:47:28.920
Do you see any benefit practicing using

00:47:26.599 --> 00:47:31.239
like penalized regression to do this

00:47:28.920 --> 00:47:33.200
instead of having the embeddings more

00:47:31.239 --> 00:47:36.879
sparse or just like

00:47:33.199 --> 00:47:39.239
lowering the magnitude of them? Yeah.

00:47:36.880 --> 00:47:40.160
Yes. So, there are lots of techniques to

00:47:39.239 --> 00:47:42.159
uh

00:47:40.159 --> 00:47:44.679
to apply regularization in the

00:47:42.159 --> 00:47:46.759
estimation itself of all these numbers.

00:47:44.679 --> 00:47:47.799
Um happy to give you pointers. It's I'm

00:47:46.760 --> 00:47:49.480
just going with like the simplest

00:47:47.800 --> 00:47:50.800
version possible.

00:47:49.480 --> 00:47:53.719
Yeah.

00:47:50.800 --> 00:47:55.920
Am I understanding why overfitting is a

00:47:53.719 --> 00:47:58.000
problem in this case cuz we're not doing

00:47:55.920 --> 00:48:00.079
any like out of sample

00:47:58.000 --> 00:48:02.039
prediction. So, like wouldn't you want

00:48:00.079 --> 00:48:03.599
like the embeddings to be

00:48:02.039 --> 00:48:04.519
like high dimensional so you can capture

00:48:03.599 --> 00:48:06.519
like

00:48:04.519 --> 00:48:08.679
your relationships? Uh interesting

00:48:06.519 --> 00:48:11.119
question. So, the question is given that

00:48:08.679 --> 00:48:12.879
there's no notion of a test set, out of

00:48:11.119 --> 00:48:14.559
sample test set that we got we're going

00:48:12.880 --> 00:48:16.079
to evaluate these things on, why do we

00:48:14.559 --> 00:48:18.519
really care about overfitting? Don't

00:48:16.079 --> 00:48:20.519
should we do the best we can to capture

00:48:18.519 --> 00:48:21.400
everything in the data, right?

00:48:20.519 --> 00:48:22.920
Well,

00:48:21.400 --> 00:48:24.280
the thing is

00:48:22.920 --> 00:48:26.320
even when you're not trying to use it

00:48:24.280 --> 00:48:29.560
for out of sample prediction, you do

00:48:26.320 --> 00:48:31.359
want to make sure that your model only

00:48:29.559 --> 00:48:32.639
captures the true patterns and not the

00:48:31.358 --> 00:48:35.199
noise.

00:48:32.639 --> 00:48:36.440
In every data set, there's always noise.

00:48:35.199 --> 00:48:38.358
Right? And you want it to capture a

00:48:36.440 --> 00:48:40.599
signal but not the noise.

00:48:38.358 --> 00:48:42.719
And regardless of what you use it for.

00:48:40.599 --> 00:48:44.159
Because if it captures the noise, then

00:48:42.719 --> 00:48:45.959
the insights you draw from the word

00:48:44.159 --> 00:48:48.399
embeddings may be flawed.

00:48:45.960 --> 00:48:48.400
That's the reason.

00:48:48.880 --> 00:48:51.358
Okay.

00:48:49.760 --> 00:48:53.080
Um all right, so let's keep going. So,

00:48:51.358 --> 00:48:55.400
here the algebra is brother minus man

00:48:53.079 --> 00:48:57.039
plus woman is sister.

00:48:55.400 --> 00:48:58.920
That's it. Human biology reduced to a

00:48:57.039 --> 00:49:00.759
single sentence.

00:48:58.920 --> 00:49:02.079
All right. So, now the pros and cons of

00:49:00.760 --> 00:49:04.520
these things are you should use

00:49:02.079 --> 00:49:07.159
something like a Glove embedding if you

00:49:04.519 --> 00:49:07.960
don't have enough data to do to to sort

00:49:07.159 --> 00:49:10.039
of

00:49:07.960 --> 00:49:11.920
to learn a task-specific embedding for

00:49:10.039 --> 00:49:13.400
your own vocabulary. As we As I'll show

00:49:11.920 --> 00:49:14.880
you in the Colab, you can actually learn

00:49:13.400 --> 00:49:16.720
these things just for your own data set

00:49:14.880 --> 00:49:18.920
if you want. You don't have to use these

00:49:16.719 --> 00:49:20.879
Glove embeddings. But the reason to use

00:49:18.920 --> 00:49:22.639
these pretrained embeddings is that if

00:49:20.880 --> 00:49:24.079
you're working with natural language,

00:49:22.639 --> 00:49:25.559
you know, the word is the word, right?

00:49:24.079 --> 00:49:28.358
It means something.

00:49:25.559 --> 00:49:30.639
And so, there's no reason for you to

00:49:28.358 --> 00:49:32.840
have for your model, for your little use

00:49:30.639 --> 00:49:35.599
case, for you to actually somehow learn

00:49:32.840 --> 00:49:36.519
all the fundamentals of English.

00:49:35.599 --> 00:49:37.679
The fundamentals of English are the

00:49:36.519 --> 00:49:40.519
fundamentals of English. May as well

00:49:37.679 --> 00:49:42.039
learn it once and then piggyback on it.

00:49:40.519 --> 00:49:43.880
So, that's the whole idea of using

00:49:42.039 --> 00:49:45.519
pre-trained embeddings.

00:49:43.880 --> 00:49:47.000
Because it These things are all common

00:49:45.519 --> 00:49:48.559
aspects of language. May as well learn

00:49:47.000 --> 00:49:50.559
them using all the data you can throw at

00:49:48.559 --> 00:49:52.039
it and then you can sort of fine-tune

00:49:50.559 --> 00:49:53.159
and tweak and adapt to your particular

00:49:52.039 --> 00:49:55.920
use case.

00:49:53.159 --> 00:49:57.319
Right? So, if you and this particular

00:49:55.920 --> 00:49:58.920
useful when you don't have a lot of data

00:49:57.320 --> 00:50:01.360
in your particular use case.

00:49:58.920 --> 00:50:03.280
Uh right? That's one big advantage. Now,

00:50:01.360 --> 00:50:04.840
it does have the drawback that this

00:50:03.280 --> 00:50:05.920
embedding will not be customized to your

00:50:04.840 --> 00:50:06.920
data.

00:50:05.920 --> 00:50:08.840
Right? For example, if you're trying to

00:50:06.920 --> 00:50:10.599
build an application for a medical or

00:50:08.840 --> 00:50:11.680
legal use, it's going to have a lot of

00:50:10.599 --> 00:50:13.679
jargon.

00:50:11.679 --> 00:50:14.960
Right? And this pre-trained embedding

00:50:13.679 --> 00:50:16.960
trained on all of Wikipedia may not

00:50:14.960 --> 00:50:18.840
capture enough of the jargon and know

00:50:16.960 --> 00:50:19.960
its meaning really accurately. So, what

00:50:18.840 --> 00:50:21.000
you want to do is you want to take this

00:50:19.960 --> 00:50:22.440
thing. You may still want to take this

00:50:21.000 --> 00:50:25.119
thing and then you can adapt and

00:50:22.440 --> 00:50:28.679
fine-tune it using your jargon-packed,

00:50:25.119 --> 00:50:29.719
heavy, domain-specific data set.

00:50:28.679 --> 00:50:32.239
Okay, those are some of the things to

00:50:29.719 --> 00:50:32.239
keep in mind.

00:50:32.360 --> 00:50:35.559
And of course, we can also learn it from

00:50:33.559 --> 00:50:38.079
scratch if you want and the collab I

00:50:35.559 --> 00:50:39.440
demonstrate all these options.

00:50:38.079 --> 00:50:41.880
So, when you're working with embeddings

00:50:39.440 --> 00:50:43.880
in Keras uh Keras, so what we do is

00:50:41.880 --> 00:50:45.480
remember STI

00:50:43.880 --> 00:50:48.360
where we after we standardize and

00:50:45.480 --> 00:50:50.440
tokenize and index, right? At this

00:50:48.360 --> 00:50:51.960
point, we go from integers to vectors

00:50:50.440 --> 00:50:54.079
and so far we have been using integers

00:50:51.960 --> 00:50:55.559
to one-hot vectors. Here, we're going to

00:50:54.079 --> 00:50:57.679
use embedding vectors that we're going

00:50:55.559 --> 00:51:00.519
to learn or that we're going to pre-use

00:50:57.679 --> 00:51:02.119
from glove. And so, what we do is we

00:51:00.519 --> 00:51:06.119
tell Kera we tell Keras's text

00:51:02.119 --> 00:51:08.000
vectorization layer to do only STI.

00:51:06.119 --> 00:51:10.519
And then we will use a new layer called

00:51:08.000 --> 00:51:11.639
the embedding layer to do the encoding.

00:51:10.519 --> 00:51:14.719
Yeah, that's how we're going to do it

00:51:11.639 --> 00:51:14.719
divide divide it up.

00:51:14.920 --> 00:51:18.760
So, we'll take a look at this first uh

00:51:17.039 --> 00:51:20.559
before we switch to the collab. So,

00:51:18.760 --> 00:51:23.480
before

00:51:20.559 --> 00:51:26.279
we told Keras in this layer output mode

00:51:23.480 --> 00:51:27.679
should be multi-hot or whatever, right?

00:51:26.280 --> 00:51:29.080
Here, we don't want it to actually

00:51:27.679 --> 00:51:30.759
encode anything in multi-hot. We just

00:51:29.079 --> 00:51:32.920
wanted to give it integers back. So, we

00:51:30.760 --> 00:51:35.120
tell it give me int.

00:51:32.920 --> 00:51:36.800
Okay? That's the first change. We only

00:51:35.119 --> 00:51:39.559
We tell it give me give us int. If you

00:51:36.800 --> 00:51:41.120
say give us int, it'll stop with STI.

00:51:39.559 --> 00:51:43.759
I'll just give you the integers.

00:51:41.119 --> 00:51:45.079
Uh and then what you do is that

00:51:43.760 --> 00:51:47.160
all the incoming sentences are going to

00:51:45.079 --> 00:51:48.599
have different lengths. So, what we want

00:51:47.159 --> 00:51:50.159
to do is we want to actually take all

00:51:48.599 --> 00:51:52.319
these sentences and sort of normalize

00:51:50.159 --> 00:51:53.199
them so they are of the same length.

00:51:52.320 --> 00:51:55.440
Okay?

00:51:53.199 --> 00:51:57.599
And the way we do that

00:51:55.440 --> 00:51:59.800
And the way we do that very quickly is

00:51:57.599 --> 00:52:01.519
that we either trunk we choose a maximum

00:51:59.800 --> 00:52:04.000
length for every sen- for for the

00:52:01.519 --> 00:52:05.960
sentences and then if something is

00:52:04.000 --> 00:52:07.119
uh exactly fits that length, perfect.

00:52:05.960 --> 00:52:08.920
Let's say in this case we want a max

00:52:07.119 --> 00:52:11.199
length of five. Cats sat on the mat is

00:52:08.920 --> 00:52:12.599
exactly five. Boom, fits perfectly. But

00:52:11.199 --> 00:52:14.480
if something is smaller, I love you is

00:52:12.599 --> 00:52:16.360
only three of these things, we actually

00:52:14.480 --> 00:52:17.760
pad it with something called the pad

00:52:16.360 --> 00:52:19.840
token.

00:52:17.760 --> 00:52:22.160
Much like the unk token, pad token is a

00:52:19.840 --> 00:52:23.840
special token which we use for padding.

00:52:22.159 --> 00:52:25.759
And then it'll you know, and so and

00:52:23.840 --> 00:52:27.760
Keras you will see will use zeros for

00:52:25.760 --> 00:52:29.720
these paddings. So so that it fills it

00:52:27.760 --> 00:52:31.440
up and gets all the way to the end. And

00:52:29.719 --> 00:52:33.239
if you have something which is much

00:52:31.440 --> 00:52:34.559
longer than five, you just truncate

00:52:33.239 --> 00:52:36.199
everything else and just use the first

00:52:34.559 --> 00:52:38.440
five.

00:52:36.199 --> 00:52:41.639
So, this is what we do to get all the

00:52:38.440 --> 00:52:41.639
sentences to be of the same length.

00:52:42.400 --> 00:52:45.599
Okay?

00:52:43.480 --> 00:52:47.199
And once we do that we then go to the

00:52:45.599 --> 00:52:49.000
embedding layer.

00:52:47.199 --> 00:52:50.359
And the embedding layer is actually very

00:52:49.000 --> 00:52:51.719
simple.

00:52:50.360 --> 00:52:53.360
What is What is an embedding? It's just

00:52:51.719 --> 00:52:54.519
a vector and we need a vector for every

00:52:53.360 --> 00:52:55.440
token.

00:52:54.519 --> 00:52:57.639
Of course, we're going to learn these

00:52:55.440 --> 00:52:59.599
vectors. We need one for every token.

00:52:57.639 --> 00:53:01.039
So, in this case for example, uh let's

00:52:59.599 --> 00:53:02.239
say that these are all the tokens we

00:53:01.039 --> 00:53:05.000
have

00:53:02.239 --> 00:53:08.039
in our vocabulary after the STI process.

00:53:05.000 --> 00:53:09.280
Maybe in this case we have 5,000 tokens.

00:53:08.039 --> 00:53:11.159
Each token we have this embedding

00:53:09.280 --> 00:53:12.960
vector, right? And we choose what the

00:53:11.159 --> 00:53:15.000
dimension of that embedding vector is,

00:53:12.960 --> 00:53:17.400
right? And so, we can set it up by

00:53:15.000 --> 00:53:19.280
saying Keras layers.embedding and we

00:53:17.400 --> 00:53:21.160
tell it max tokens which means what how

00:53:19.280 --> 00:53:21.920
many rows do we have here.

00:53:21.159 --> 00:53:23.759
You know, how many What is the

00:53:21.920 --> 00:53:25.680
vocabulary size that we're working with?

00:53:23.760 --> 00:53:28.360
And then we tell it, okay, this is how

00:53:25.679 --> 00:53:31.799
long I want each embedding vector to be.

00:53:28.360 --> 00:53:33.240
So, rows, the size of the columns, and

00:53:31.800 --> 00:53:34.400
that's the embedding layer. And we'll

00:53:33.239 --> 00:53:35.719
use it in a second. I just want to show

00:53:34.400 --> 00:53:37.039
it to you here so that's because it's

00:53:35.719 --> 00:53:38.759
slightly clearer.

00:53:37.039 --> 00:53:40.559
So, when an input sentence arrives, the

00:53:38.760 --> 00:53:42.200
text vectorization layer will learn STI

00:53:40.559 --> 00:53:44.000
on it. It'll truncate and pad it to max

00:53:42.199 --> 00:53:46.599
length as needed. So, let's say this

00:53:44.000 --> 00:53:48.639
phrase comes in, STI will give you the

00:53:46.599 --> 00:53:50.319
same tokens plus pad pad because let's

00:53:48.639 --> 00:53:52.639
say the max length is five and then

00:53:50.320 --> 00:53:53.960
these are the corresponding integers.

00:53:52.639 --> 00:53:55.279
And then

00:53:53.960 --> 00:53:56.440
the embedding layer will just look up

00:53:55.280 --> 00:53:59.320
the corresponding vector. So, for

00:53:56.440 --> 00:54:01.519
example here, uh the vectors are we need

00:53:59.320 --> 00:54:04.000
to look up the vectors for 23, 9, 5, 0,

00:54:01.519 --> 00:54:07.400
and 0. So, we just go here and look up

00:54:04.000 --> 00:54:08.760
23, 5, 9, and 0. And then once we have

00:54:07.400 --> 00:54:10.720
that, boom.

00:54:08.760 --> 00:54:12.320
This is the resulting output. So,

00:54:10.719 --> 00:54:13.519
whatever input sentence comes in, we

00:54:12.320 --> 00:54:14.760
have now

00:54:13.519 --> 00:54:17.519
five embedding vectors that have been

00:54:14.760 --> 00:54:20.080
looked up from the embedding layer.

00:54:17.519 --> 00:54:22.159
And once we do that

00:54:20.079 --> 00:54:24.400
this is a table. So, I love you comes

00:54:22.159 --> 00:54:25.679
in, it becomes this table. As we have

00:54:24.400 --> 00:54:27.519
seen before

00:54:25.679 --> 00:54:30.279
neural networks can only accommodate

00:54:27.519 --> 00:54:32.119
vectors as inputs. We need to you know,

00:54:30.280 --> 00:54:33.760
make this into a vector. And as we have

00:54:32.119 --> 00:54:35.319
done before, you know, we can either

00:54:33.760 --> 00:54:37.320
take all these things and concatenate

00:54:35.320 --> 00:54:39.359
them, make a one long vector, or we can

00:54:37.320 --> 00:54:40.800
find a way to average them or sum them

00:54:39.358 --> 00:54:42.960
and things like that, right? As we have

00:54:40.800 --> 00:54:44.039
seen before. And we will use the same uh

00:54:42.960 --> 00:54:46.320
we'll the simplest thing is probably

00:54:44.039 --> 00:54:48.239
just to average them. So,

00:54:46.320 --> 00:54:51.000
uh these are some options and we but

00:54:48.239 --> 00:54:53.439
we'll average them here. So, and this is

00:54:51.000 --> 00:54:55.679
called the global average pooling layer

00:54:53.440 --> 00:54:57.240
1D. And it's all it does is whatever you

00:54:55.679 --> 00:54:59.839
give it a table you give it, it just

00:54:57.239 --> 00:55:01.079
takes each dimension and averages it.

00:54:59.840 --> 00:55:02.280
The first dimension average, second

00:55:01.079 --> 00:55:04.358
dimension average, and so on and so

00:55:02.280 --> 00:55:05.440
forth. And once that's done

00:55:04.358 --> 00:55:07.279
that's the whole

00:55:05.440 --> 00:55:09.679
So,

00:55:07.280 --> 00:55:11.920
the phrase comes in, STI gives you these

00:55:09.679 --> 00:55:14.000
things, padding as needed or truncating

00:55:11.920 --> 00:55:16.240
as needed. We look up the embeddings

00:55:14.000 --> 00:55:18.559
from the embedding layer and then we get

00:55:16.239 --> 00:55:20.479
all this thing. We do global global

00:55:18.559 --> 00:55:22.400
pooling on it and it's done.

00:55:20.480 --> 00:55:24.000
The resulting thing is a vector that can

00:55:22.400 --> 00:55:26.680
then be passed into hidden layers just

00:55:24.000 --> 00:55:26.679
like we normally do.

00:55:27.559 --> 00:55:31.320
I'm going over this a little fast, but

00:55:29.320 --> 00:55:33.000
make sure you look at it afterwards and

00:55:31.320 --> 00:55:34.320
understand every step and the collab

00:55:33.000 --> 00:55:36.159
will mirror this

00:55:34.320 --> 00:55:37.200
you know, perfectly.

00:55:36.159 --> 00:55:39.559
All right, so let's switch to the

00:55:37.199 --> 00:55:41.480
collab.

00:55:39.559 --> 00:55:43.960
Okay. All right.

00:55:41.480 --> 00:55:46.320
Can folks see this okay?

00:55:43.960 --> 00:55:47.639
All right, so we'll do the usual.

00:55:46.320 --> 00:55:49.800
Um

00:55:47.639 --> 00:55:51.599
import all the stuff we need and then

00:55:49.800 --> 00:55:53.519
because I want to plot some of these uh

00:55:51.599 --> 00:55:55.159
loss and accuracy curves to

00:55:53.519 --> 00:55:56.639
you know, just to see what's going on,

00:55:55.159 --> 00:55:58.239
I'll just bring in the functions from

00:55:56.639 --> 00:55:59.319
the previous collabs.

00:55:58.239 --> 00:56:01.839
Here.

00:55:59.320 --> 00:56:03.440
And then um and I think I already have

00:56:01.840 --> 00:56:06.000
downloaded this. Let me just make sure I

00:56:03.440 --> 00:56:06.000
have it.

00:56:08.079 --> 00:56:13.480
Uh it's not there. Okay.

00:56:11.119 --> 00:56:14.960
Do it again.

00:56:13.480 --> 00:56:17.760
This is same songs data set that we

00:56:14.960 --> 00:56:17.760
looked at on Monday.

00:56:17.840 --> 00:56:21.079
Okay.

00:56:19.000 --> 00:56:25.280
So, roughly 49,000 examples as we saw

00:56:21.079 --> 00:56:25.279
before. We'll one-hot encode them.

00:56:25.519 --> 00:56:28.840
All right, so there's a bunch of stuff

00:56:27.000 --> 00:56:30.519
that we already covered in class. So,

00:56:28.840 --> 00:56:33.880
this is the thing

00:56:30.519 --> 00:56:35.840
uh this URL has all the glove vectors

00:56:33.880 --> 00:56:37.160
available for download. I downloaded it

00:56:35.840 --> 00:56:39.960
uh before class because it takes a few

00:56:37.159 --> 00:56:41.519
minutes. Uh and I've also unz- Did I

00:56:39.960 --> 00:56:43.199
unzip it?

00:56:41.519 --> 00:56:46.039
Uh yes, I did. And so, let's just look

00:56:43.199 --> 00:56:47.119
at the first few.

00:56:46.039 --> 00:56:49.159
All right, so these are all the first

00:56:47.119 --> 00:56:52.839
few. We'll create a sort of an easier to

00:56:49.159 --> 00:56:52.839
view version of these glove vectors.

00:56:54.760 --> 00:56:58.480
So, I'm going to use the vectors which

00:56:56.639 --> 00:56:59.839
are 100 long, but it comes in many

00:56:58.480 --> 00:57:03.000
different shapes.

00:56:59.840 --> 00:57:05.720
So, we have 400,000 vectors, 400,000

00:57:03.000 --> 00:57:07.519
word vectors. Each is 100 dimension.

00:57:05.719 --> 00:57:09.399
Uh and these all have been calculated

00:57:07.519 --> 00:57:11.119
from Wikipedia using

00:57:09.400 --> 00:57:12.720
the model we described using gradient

00:57:11.119 --> 00:57:15.480
descent. Okay?

00:57:12.719 --> 00:57:18.239
Uh all right, so this is the

00:57:15.480 --> 00:57:19.280
vector for the word for movie.

00:57:18.239 --> 00:57:21.519
Yeah, I don't know what these dimensions

00:57:19.280 --> 00:57:23.480
mean, but it is there's something going

00:57:21.519 --> 00:57:24.880
on. It has figured stuff out.

00:57:23.480 --> 00:57:26.840
Uh but the proof is in the pudding,

00:57:24.880 --> 00:57:28.200
right? So, all right, now we'll first

00:57:26.840 --> 00:57:30.358
set up the text vectorization and

00:57:28.199 --> 00:57:33.839
embedding layers like we saw before.

00:57:30.358 --> 00:57:36.239
Um and so, I'm going to use uh a max

00:57:33.840 --> 00:57:38.240
length of 300 for the songs.

00:57:36.239 --> 00:57:40.879
Um right? Because all the sentences have

00:57:38.239 --> 00:57:42.479
to be the same length. And you might be

00:57:40.880 --> 00:57:44.519
wondering, okay, why did you pick 300

00:57:42.480 --> 00:57:46.840
and not say 400 or 200? So, typically

00:57:44.519 --> 00:57:48.920
what you do is you actually look at the

00:57:46.840 --> 00:57:51.039
the length distribution of the songs you

00:57:48.920 --> 00:57:52.720
have and you will find you're looking

00:57:51.039 --> 00:57:54.358
for like an 80/20 or a you know, one of

00:57:52.719 --> 00:57:56.399
those things. And in this case it turns

00:57:54.358 --> 00:57:59.000
out 90% of the songs have less than or

00:57:56.400 --> 00:58:00.880
equal to 300 words in our data set. So,

00:57:59.000 --> 00:58:03.000
I'm just going to go with 300. Okay?

00:58:00.880 --> 00:58:04.840
It's pretty good. Uh the problem is if

00:58:03.000 --> 00:58:06.800
you actually say if you look at the song

00:58:04.840 --> 00:58:09.079
which has the maximum length

00:58:06.800 --> 00:58:10.680
that might have be like 3,000 words and

00:58:09.079 --> 00:58:12.599
there would be any hardly any songs of

00:58:10.679 --> 00:58:13.839
3,000 long. You're just wasting a lot of

00:58:12.599 --> 00:58:16.079
capacity by doing that. So, you're just

00:58:13.840 --> 00:58:18.840
being a little pragmatic here.

00:58:16.079 --> 00:58:20.519
So, okay. So, and then we as before for

00:58:18.840 --> 00:58:22.920
the vocabulary itself, we tell Keras use

00:58:20.519 --> 00:58:24.599
the most frequent 5,000 words, right?

00:58:22.920 --> 00:58:27.599
When you're doing the STI

00:58:24.599 --> 00:58:29.719
um STI. So, we do that and we tell it

00:58:27.599 --> 00:58:32.279
the output mode is int like we saw

00:58:29.719 --> 00:58:32.279
before.

00:58:32.320 --> 00:58:36.840
We have there.

00:58:35.199 --> 00:58:39.480
Okay, perfect.

00:58:36.840 --> 00:58:41.840
Okay, this is a very dangerous thing

00:58:39.480 --> 00:58:44.519
where somebody is remotely changing it

00:58:41.840 --> 00:58:48.160
in another tab somewhere.

00:58:44.519 --> 00:58:48.159
Fingers crossed. Okay.

00:58:50.239 --> 00:58:54.279
Okay. So, we have this um and this is

00:58:52.400 --> 00:58:57.320
what we did with all this stuff uh as

00:58:54.280 --> 00:58:59.920
I've covered. So, now we will adapt this

00:58:57.320 --> 00:59:02.960
layer as we have seen before using all

00:58:59.920 --> 00:59:02.960
the lyrics we have.

00:59:04.358 --> 00:59:08.358
And once we that, we'll take a look at

00:59:06.280 --> 00:59:10.080
the first few.

00:59:08.358 --> 00:59:12.239
And so, here's a very important thing.

00:59:10.079 --> 00:59:14.519
Before, when we asked it to do multi-hot

00:59:12.239 --> 00:59:17.679
encoding and so on in on Monday,

00:59:14.519 --> 00:59:19.880
uh the zero, the first position was unk.

00:59:17.679 --> 00:59:21.679
Right? Unk had zero. But here, unk

00:59:19.880 --> 00:59:23.400
actually has one.

00:59:21.679 --> 00:59:25.559
And the reason is that

00:59:23.400 --> 00:59:28.200
the zeroth position is going to be uh

00:59:25.559 --> 00:59:30.199
used for essentially the You can think

00:59:28.199 --> 00:59:32.839
of this as the empty string. That's how

00:59:30.199 --> 00:59:35.000
Keras will print out pad.

00:59:32.840 --> 00:59:37.039
So, the zero position is the padding,

00:59:35.000 --> 00:59:39.079
the pad token. The first position is the

00:59:37.039 --> 00:59:41.480
unk token. Okay?

00:59:39.079 --> 00:59:44.440
So, it's an important thing here.

00:59:41.480 --> 00:59:46.920
So, let's say that we do

00:59:44.440 --> 00:59:49.599
"HODL you're the best."

00:59:46.920 --> 00:59:51.240
We take a vectorize it. Um

00:59:49.599 --> 00:59:52.719
Do you think HODL

00:59:51.239 --> 00:59:54.319
is going to be part of those 400,000

00:59:52.719 --> 00:59:57.480
word vectors?

00:59:54.320 --> 01:00:01.400
Wikipedia. Not yet. So,

00:59:57.480 --> 01:00:01.400
Um all right. So, let's try that.

01:00:03.519 --> 01:00:05.960
Okay, and as you can tell,

01:00:05.199 --> 01:00:08.199
um

01:00:05.960 --> 01:00:12.720
HODL is an unknown word, right? That's

01:00:08.199 --> 01:00:14.879
why uh it's showing up here.

01:00:12.719 --> 01:00:18.119
Right. So, one is unknown, right? The

01:00:14.880 --> 01:00:19.559
index value one is unknown. Zero is pad.

01:00:18.119 --> 01:00:21.920
But then,

01:00:19.559 --> 01:00:25.000
this is unknown HODL, I

01:00:21.920 --> 01:00:26.720
Sorry, you are the best, and then

01:00:25.000 --> 01:00:28.679
everything else from that point on is a

01:00:26.719 --> 01:00:30.119
zero because we are padding all the way

01:00:28.679 --> 01:00:31.239
to 300.

01:00:30.119 --> 01:00:32.599
Okay? So, that's why you see all these

01:00:31.239 --> 01:00:34.359
zeros here.

01:00:32.599 --> 01:00:37.000
All right. Uh now, let's just, you know,

01:00:34.360 --> 01:00:38.760
run everything through

01:00:37.000 --> 01:00:41.880
the vectorization layer, and then we'll

01:00:38.760 --> 01:00:41.880
get to the embedding layer.

01:00:44.400 --> 01:00:50.519
Okay. Now, we will we'll we'll first

01:00:48.360 --> 01:00:51.840
There's just a bit of Python uh

01:00:50.519 --> 01:00:54.960
housekeeping

01:00:51.840 --> 01:00:56.600
um to create a nice, easy to look at

01:00:54.960 --> 01:00:58.679
matrix. So, what we're going to do is

01:00:56.599 --> 01:01:00.960
we're actually going to create a nice

01:00:58.679 --> 01:01:02.480
matrix which shows us all the the word

01:01:00.960 --> 01:01:04.039
the GloVe embeddings.

01:01:02.480 --> 01:01:05.679
Um

01:01:04.039 --> 01:01:07.159
And so, here, this is the embedding

01:01:05.679 --> 01:01:09.639
matrix.

01:01:07.159 --> 01:01:11.679
And this matrix has only 5,000 words,

01:01:09.639 --> 01:01:13.639
and each is a 100 long.

01:01:11.679 --> 01:01:15.199
Why is this embedding matrix only 5,000

01:01:13.639 --> 01:01:17.679
even though we downloaded 400,000

01:01:15.199 --> 01:01:17.679
vectors?

01:01:21.480 --> 01:01:24.719
Right. So, clearly the 5,000 we used

01:01:23.440 --> 01:01:27.440
there has some bearing to this, but what

01:01:24.719 --> 01:01:27.439
is that 5,000?

01:01:30.760 --> 01:01:34.800
We told Keras to take the most frequent

01:01:32.840 --> 01:01:36.960
5,000 words in our corpus.

01:01:34.800 --> 01:01:38.920
So, we'll only have 5,000 in vocabulary.

01:01:36.960 --> 01:01:40.480
That's why there's 5,000. So, we grab

01:01:38.920 --> 01:01:42.480
just the word the GloVe vectors for

01:01:40.480 --> 01:01:44.119
those 500 5,000 that Keras has chosen to

01:01:42.480 --> 01:01:45.599
be in the vocabulary. Okay? And that's

01:01:44.119 --> 01:01:47.639
our embedding matrix.

01:01:45.599 --> 01:01:50.279
And then, if you look at the first few

01:01:47.639 --> 01:01:52.879
rows, the first two rows should be all

01:01:50.280 --> 01:01:54.720
zeros because it's pad and unk,

01:01:52.880 --> 01:01:57.320
which clearly GloVe doesn't know about.

01:01:54.719 --> 01:01:59.000
It's all going to be all zeros. And um

01:01:57.320 --> 01:02:00.480
so, you can see all these zeros here,

01:01:59.000 --> 01:02:02.639
and then from third on, words, you start

01:02:00.480 --> 01:02:04.199
getting some numbers. Okay?

01:02:02.639 --> 01:02:05.599
All right. Next, we'll set up the

01:02:04.199 --> 01:02:06.279
embedding layer.

01:02:05.599 --> 01:02:07.799
Uh

01:02:06.280 --> 01:02:09.600
so, basically, what's going on here is

01:02:07.800 --> 01:02:11.680
when you we tell the embedding layer how

01:02:09.599 --> 01:02:13.519
many rows, which is just the vocab size,

01:02:11.679 --> 01:02:15.759
max tokens, what is the embedding

01:02:13.519 --> 01:02:17.599
dimension? Well, that's going to be 100

01:02:15.760 --> 01:02:19.840
because the GloVe vectors are 100. And

01:02:17.599 --> 01:02:22.279
then, here's the thing. You can tell it

01:02:19.840 --> 01:02:23.800
um in this embedding layer, just use

01:02:22.280 --> 01:02:25.640
this matrix I'm giving you as the

01:02:23.800 --> 01:02:26.880
embedding layer. Because we already know

01:02:25.639 --> 01:02:28.799
what the embeddings are. We downloaded

01:02:26.880 --> 01:02:30.760
from whatever GloVe, right? So, we will

01:02:28.800 --> 01:02:32.320
tell it to use GloVe as as the as the

01:02:30.760 --> 01:02:34.680
weights for here, as the embeddings

01:02:32.320 --> 01:02:36.720
here. So, we initialize it using that

01:02:34.679 --> 01:02:38.359
embedding matrix, right? And then, we

01:02:36.719 --> 01:02:40.199
tell it

01:02:38.360 --> 01:02:41.720
don't train. When we do back propagation

01:02:40.199 --> 01:02:43.679
later on, don't change any of these

01:02:41.719 --> 01:02:45.839
weights because somebody spent a lot of

01:02:43.679 --> 01:02:47.759
money create these weights for us.

01:02:45.840 --> 01:02:49.200
Stanford. So, we don't want to like

01:02:47.760 --> 01:02:51.280
further change them. Just freeze them

01:02:49.199 --> 01:02:52.679
and use them as they are. Okay?

01:02:51.280 --> 01:02:53.760
And this mask zero business I'll come

01:02:52.679 --> 01:02:55.440
back later. Don't worry about it for the

01:02:53.760 --> 01:02:58.200
moment.

01:02:55.440 --> 01:03:00.079
All right. So, once we do that, we all

01:02:58.199 --> 01:03:02.199
we are ready to set up our model. So,

01:03:00.079 --> 01:03:04.159
this model is pretty simple. Uh Keras

01:03:02.199 --> 01:03:05.799
input, the length, of course, is the

01:03:04.159 --> 01:03:08.319
length of the sentence, right? Which is

01:03:05.800 --> 01:03:09.600
uh 300 long, and then it runs the input

01:03:08.320 --> 01:03:12.320
runs through an embedding layer right

01:03:09.599 --> 01:03:14.679
there, right? And out comes a 300 by 100

01:03:12.320 --> 01:03:15.600
table, and then we global average pool

01:03:14.679 --> 01:03:17.039
it,

01:03:15.599 --> 01:03:19.079
right? And that becomes a 100 element

01:03:17.039 --> 01:03:20.920
vector, and then we are back in familiar

01:03:19.079 --> 01:03:23.480
ground, and we run it through a dense

01:03:20.920 --> 01:03:25.480
layer with eight ReLU neurons, uh right?

01:03:23.480 --> 01:03:27.159
Eight ReLU neurons, and then we run it

01:03:25.480 --> 01:03:29.000
through the final output layer, which is

01:03:27.159 --> 01:03:31.239
a three-way softmax as before, hip hop

01:03:29.000 --> 01:03:34.760
rock pop. And then, we tell Keras that's

01:03:31.239 --> 01:03:36.879
our model, and then we summarize it.

01:03:34.760 --> 01:03:38.000
Okay. So, this what we have. And you can

01:03:36.880 --> 01:03:41.039
see here,

01:03:38.000 --> 01:03:42.559
the total parameters are 500,835,

01:03:41.039 --> 01:03:44.519
but the trainable parameters are only

01:03:42.559 --> 01:03:46.000
835.

01:03:44.519 --> 01:03:49.320
It's because the total parameters are

01:03:46.000 --> 01:03:50.800
all the GloVe embeddings plus the the

01:03:49.320 --> 01:03:52.840
things we added to the GloVe embeddings

01:03:50.800 --> 01:03:54.720
like the hidden layer and so on.

01:03:52.840 --> 01:03:57.400
But the GloVe embeddings are us we have

01:03:54.719 --> 01:03:58.799
told Keras, freeze it. Do not train it.

01:03:57.400 --> 01:04:00.358
Right? Which means only the rest of it

01:03:58.800 --> 01:04:03.160
is going to be trainable. That's That's

01:04:00.358 --> 01:04:03.159
the 835. Yeah.

01:04:03.358 --> 01:04:06.880
So, when we do the global average

01:04:05.000 --> 01:04:09.840
pooling, don't we don't we lose any

01:04:06.880 --> 01:04:12.680
sense of meaning that we gain from the

01:04:09.840 --> 01:04:14.519
embedding as we average very different

01:04:12.679 --> 01:04:15.960
embeddings together?

01:04:14.519 --> 01:04:16.400
Sorry, say that again. I I missed the

01:04:15.960 --> 01:04:18.639
first

01:04:16.400 --> 01:04:20.400
>> if we average the the embedding of apple

01:04:18.639 --> 01:04:22.279
and learning, for instance, they are

01:04:20.400 --> 01:04:23.880
very different words that are used in

01:04:22.280 --> 01:04:26.320
different meanings, so we have different

01:04:23.880 --> 01:04:27.358
embeddings, but we average it, so can't

01:04:26.320 --> 01:04:28.600
lose it.

01:04:27.358 --> 01:04:30.319
We will lose a bunch of stuff. Yeah,

01:04:28.599 --> 01:04:31.239
yeah, yeah. So, you're barely Anytime

01:04:30.320 --> 01:04:33.559
you average anything, you're going to

01:04:31.239 --> 01:04:36.039
lose some new nuance and so on. So, the

01:04:33.559 --> 01:04:37.840
real question is, is it Despite that

01:04:36.039 --> 01:04:39.239
averaging, is it good enough for you?

01:04:37.840 --> 01:04:41.039
And sometimes it's good enough.

01:04:39.239 --> 01:04:42.599
Very often it's good enough, as it turns

01:04:41.039 --> 01:04:44.119
out. But as you will see when you go to

01:04:42.599 --> 01:04:45.759
contextual embeddings, there's just a

01:04:44.119 --> 01:04:47.519
better way to do it, right? When you

01:04:45.760 --> 01:04:49.240
have contextual embeddings, uh but it

01:04:47.519 --> 01:04:50.679
requires bigger models, more powerful

01:04:49.239 --> 01:04:51.559
stuff, and so on and so forth. And

01:04:50.679 --> 01:04:53.919
that's where you're going from the

01:04:51.559 --> 01:04:56.119
foundations to the advanced stuff.

01:04:53.920 --> 01:04:56.119
Yeah.

01:04:56.199 --> 01:05:00.679
When we're doing optimization, like

01:04:58.159 --> 01:05:02.839
let's say we are word problem, it's

01:05:00.679 --> 01:05:04.719
often best to optimize everything

01:05:02.840 --> 01:05:06.160
together than to optimize one part of

01:05:04.719 --> 01:05:07.199
the system and then optimize the other

01:05:06.159 --> 01:05:09.799
part of the system.

01:05:07.199 --> 01:05:12.319
So, in that case, why wouldn't we want

01:05:09.800 --> 01:05:13.600
to also change the embeddings?

01:05:12.320 --> 01:05:15.559
We would like I understand why we would

01:05:13.599 --> 01:05:17.440
like to stop with

01:05:15.559 --> 01:05:19.119
with those weights that

01:05:17.440 --> 01:05:20.679
some people have spent a lot of money

01:05:19.119 --> 01:05:23.358
trying to find, but will

01:05:20.679 --> 01:05:25.279
we be able to find more specific uh

01:05:23.358 --> 01:05:26.960
embeddings related to our problem if we

01:05:25.280 --> 01:05:29.000
optimize if we let everything be

01:05:26.960 --> 01:05:30.880
trainable? Yeah. Absolutely. Absolutely.

01:05:29.000 --> 01:05:33.280
And in fact, you will see in the collab

01:05:30.880 --> 01:05:35.280
uh that we will do that next. I just

01:05:33.280 --> 01:05:37.000
want to show people you don't have to do

01:05:35.280 --> 01:05:38.560
it. You start with not training it

01:05:37.000 --> 01:05:39.679
because it's going to be much faster.

01:05:38.559 --> 01:05:41.079
And then, you train everything and see

01:05:39.679 --> 01:05:42.519
if it gets better. And sometimes it'll

01:05:41.079 --> 01:05:44.119
get better, in which case it's great.

01:05:42.519 --> 01:05:45.599
Sometimes it won't get better. And I

01:05:44.119 --> 01:05:46.920
will also show you, and I probably will

01:05:45.599 --> 01:05:48.880
run out of time, which I'll So, I'll do

01:05:46.920 --> 01:05:50.119
it on Monday. I will also show you, hey,

01:05:48.880 --> 01:05:51.480
what if you want to do your own

01:05:50.119 --> 01:05:52.599
embeddings from scratch without using

01:05:51.480 --> 01:05:55.639
GloVe?

01:05:52.599 --> 01:05:57.599
So, all possibilities will be covered.

01:05:55.639 --> 01:06:00.159
Um yeah. So, to come back to this, this

01:05:57.599 --> 01:06:01.440
is the model we have. Um and then, all

01:06:00.159 --> 01:06:03.000
right.

01:06:01.440 --> 01:06:05.079
So, we'll If we take a look at the first

01:06:03.000 --> 01:06:06.960
few embedding vectors, by the way, this

01:06:05.079 --> 01:06:09.119
model.layers

01:06:06.960 --> 01:06:10.240
uh will give you every layer as a list,

01:06:09.119 --> 01:06:11.480
a list of all the layers, and then you

01:06:10.239 --> 01:06:13.039
can just grab any layer you want and

01:06:11.480 --> 01:06:14.079
look at its weights. Okay? It's very

01:06:13.039 --> 01:06:15.159
handy.

01:06:14.079 --> 01:06:16.559
So, we're looking at the weights, and

01:06:15.159 --> 01:06:19.399
you can see here

01:06:16.559 --> 01:06:21.519
the first two vectors are all zeros

01:06:19.400 --> 01:06:22.920
because that stands for unk and pad, and

01:06:21.519 --> 01:06:24.880
then we have everything else. So,

01:06:22.920 --> 01:06:26.400
everything looks fine so far. And now,

01:06:24.880 --> 01:06:28.800
we just, you know, compile and fit it.

01:06:26.400 --> 01:06:30.039
So, as usual, Adam, cross entropy,

01:06:28.800 --> 01:06:33.080
accuracy.

01:06:30.039 --> 01:06:34.880
Um and then, we'll just fit the model.

01:06:33.079 --> 01:06:36.000
All right.

01:06:34.880 --> 01:06:38.599
It's going to take

01:06:36.000 --> 01:06:38.599
a few minutes.

01:06:39.000 --> 01:06:43.519
And while it's running, so what what you

01:06:41.358 --> 01:06:44.960
will see in this collab is that

01:06:43.519 --> 01:06:46.440
uh in this particular case, the

01:06:44.960 --> 01:06:47.519
embeddings actually don't help a whole

01:06:46.440 --> 01:06:50.440
lot.

01:06:47.519 --> 01:06:50.440
Why do you think that is?

01:06:51.920 --> 01:06:54.639
What if it could be because we're

01:06:52.920 --> 01:06:57.079
averaging a lot of stuff? Maybe that's

01:06:54.639 --> 01:06:58.400
hurting us.

01:06:57.079 --> 01:06:59.840
Yeah.

01:06:58.400 --> 01:07:01.960
Um I mean, I think that the embeddings

01:06:59.840 --> 01:07:03.559
were pre-trained on some corpus, right?

01:07:01.960 --> 01:07:05.358
Like Wikipedia or something like that

01:07:03.559 --> 01:07:06.599
that is different from the a little bit

01:07:05.358 --> 01:07:08.599
different from the language we tend to

01:07:06.599 --> 01:07:09.599
use in song lyrics. So, so maybe it's

01:07:08.599 --> 01:07:11.599
not

01:07:09.599 --> 01:07:12.679
its ability to sort of extract the

01:07:11.599 --> 01:07:13.319
meaning of

01:07:12.679 --> 01:07:16.679
um

01:07:13.320 --> 01:07:18.240
candy from like a song lyric um

01:07:16.679 --> 01:07:19.679
maybe is limited because Yeah. it's

01:07:18.239 --> 01:07:20.919
thinking of all the other ways Right.

01:07:19.679 --> 01:07:22.039
like that could be our presentation.

01:07:20.920 --> 01:07:23.680
Yeah, so there could be a mismatch

01:07:22.039 --> 01:07:26.119
between the corpus on which the

01:07:23.679 --> 01:07:27.719
pre-trained stuff was trained on versus

01:07:26.119 --> 01:07:29.480
the the corpus that you're working with

01:07:27.719 --> 01:07:31.679
right now. That's one big reason. The

01:07:29.480 --> 01:07:34.719
other reason is that we actually may

01:07:31.679 --> 01:07:36.119
have We have 50,000 examples, basically.

01:07:34.719 --> 01:07:37.959
It's a lot of data.

01:07:36.119 --> 01:07:39.920
So, when you have a lot of data, you may

01:07:37.960 --> 01:07:41.519
not need any of these things.

01:07:39.920 --> 01:07:43.280
These things tend to do really well when

01:07:41.519 --> 01:07:46.159
you don't have a lot of data, and which

01:07:43.280 --> 01:07:47.720
means you you you get to piggyback on

01:07:46.159 --> 01:07:49.960
what these embeddings have learned from

01:07:47.719 --> 01:07:52.159
all of Wikipedia.

01:07:49.960 --> 01:07:54.240
So, so when you have a smallish data

01:07:52.159 --> 01:07:55.799
set, basically, the the rule of thumb

01:07:54.239 --> 01:07:58.119
here is that when your data is really

01:07:55.800 --> 01:07:59.320
small, try to use a pre-trained model.

01:07:58.119 --> 01:08:01.119
Right? And that's what you saw with the

01:07:59.320 --> 01:08:03.200
handbags and shoes classifier, right? We

01:08:01.119 --> 01:08:04.960
had 100 examples of handbags and shoes,

01:08:03.199 --> 01:08:06.480
and we used ResNet to got basically get

01:08:04.960 --> 01:08:08.358
to 100% accuracy.

01:08:06.480 --> 01:08:09.719
The same sort of logic applies here.

01:08:08.358 --> 01:08:11.519
All right. So,

01:08:09.719 --> 01:08:12.879
here, let's see what's happening. Uh

01:08:11.519 --> 01:08:15.480
okay, it's done.

01:08:12.880 --> 01:08:15.480
So, we'll plot.

01:08:16.000 --> 01:08:18.199
Right.

01:08:16.880 --> 01:08:21.279
Uh okay, this look at a very

01:08:18.199 --> 01:08:23.479
well-behaved uh loss function curve.

01:08:21.279 --> 01:08:23.479
Uh

01:08:25.640 --> 01:08:27.600
Okay.

01:08:26.479 --> 01:08:28.879
So,

01:08:27.600 --> 01:08:30.039
uh there doesn't seem to be any massive

01:08:28.880 --> 01:08:32.119
overfitting going on. They are moving

01:08:30.039 --> 01:08:35.279
really nicely in lockstep. Let's see

01:08:32.119 --> 01:08:35.279
what the thing is.

01:08:36.319 --> 01:08:40.798
Okay, 63%, which is not great. Um right?

01:08:39.520 --> 01:08:43.080
Uh it's not as good as what we saw

01:08:40.798 --> 01:08:44.479
before when we used all 50,000 examples

01:08:43.079 --> 01:08:45.880
and just trained something from scratch,

01:08:44.479 --> 01:08:47.519
and that's just because in this case, we

01:08:45.880 --> 01:08:49.079
have lots of examples, these pre-trained

01:08:47.520 --> 01:08:50.319
embeddings aren't, you know, as helpful

01:08:49.079 --> 01:08:52.439
as they could be.

01:08:50.319 --> 01:08:54.280
But if you have a small data set, they

01:08:52.439 --> 01:08:56.318
could be very helpful. And now, we go to

01:08:54.279 --> 01:08:58.239
what um

01:08:56.319 --> 01:08:59.319
he pointed out. Like, why can't we just,

01:08:58.239 --> 01:09:00.759
you know, optimize these embeddings,

01:08:59.319 --> 01:09:02.440
too? Why don't Why do we have to take

01:09:00.759 --> 01:09:03.838
trade them as sacred? We'll just Let

01:09:02.439 --> 01:09:06.000
Let's just use Let's

01:09:03.838 --> 01:09:07.920
inflict Let's just apply unleash back

01:09:06.000 --> 01:09:11.079
prop on it and see what happens.

01:09:07.920 --> 01:09:13.319
So, we'll do that. Um

01:09:11.079 --> 01:09:15.359
So, here, what we do is we retrain it,

01:09:13.319 --> 01:09:17.120
but here, we set trainable equals true

01:09:15.359 --> 01:09:19.240
for the embedding layer. Okay? This is

01:09:17.119 --> 01:09:20.960
the key step. Trainable equals true.

01:09:19.239 --> 01:09:23.279
Otherwise, it's unchanged.

01:09:20.960 --> 01:09:25.880
Uh and then,

01:09:23.279 --> 01:09:25.880
let's skip that.

01:09:27.119 --> 01:09:31.439
We'll run it and see what happens. So

01:09:28.960 --> 01:09:33.079
before it was whatever 63% accuracy or

01:09:31.439 --> 01:09:35.399
something, we'll see if it gets better

01:09:33.079 --> 01:09:38.000
if you train the whole thing.

01:09:35.399 --> 01:09:40.399
And the thing is you can never be sure.

01:09:38.000 --> 01:09:41.279
Right? Because it may start to overfit.

01:09:40.399 --> 01:09:42.599
Uh which is why you just have to

01:09:41.279 --> 01:09:45.440
empirically see what's going on. There

01:09:42.600 --> 01:09:45.440
are no guarantees.

01:09:47.640 --> 01:09:50.079
Um all right, any questions while it's

01:09:48.960 --> 01:09:51.880
training?

01:09:50.079 --> 01:09:54.399
Yeah.

01:09:51.880 --> 01:09:56.480
In that first graph of when um you have

01:09:54.399 --> 01:09:58.000
the training accuracy still increasing,

01:09:56.479 --> 01:10:00.479
that might suggest that you could use

01:09:58.000 --> 01:10:02.399
even more upstream. Correct. Exactly.

01:10:00.479 --> 01:10:03.879
Exactly. So in the in the in that curve,

01:10:02.399 --> 01:10:05.519
we saw that the training was continuing

01:10:03.880 --> 01:10:06.840
to increase. Typically what's going to

01:10:05.520 --> 01:10:08.720
happen is the training will continue to

01:10:06.840 --> 01:10:10.520
get better the more you train it. The

01:10:08.720 --> 01:10:12.560
key thing is is the validation also

01:10:10.520 --> 01:10:13.880
improving. If the validation continues

01:10:12.560 --> 01:10:15.520
to improve, there is a little bit more

01:10:13.880 --> 01:10:17.720
gas left in the tank. You can keep

01:10:15.520 --> 01:10:19.120
increasing more. If it starts to flatten

01:10:17.720 --> 01:10:21.960
and even worse if it starts to go down,

01:10:19.119 --> 01:10:23.279
then you want to pull back.

01:10:21.960 --> 01:10:25.359
Yeah.

01:10:23.279 --> 01:10:27.359
Um so you had used the maximum against

01:10:25.359 --> 01:10:29.079
the limit like the vocabulary

01:10:27.359 --> 01:10:31.880
of the most common 5,000. And then the

01:10:29.079 --> 01:10:33.600
width of that was 100. What is the 100?

01:10:31.880 --> 01:10:34.680
The 100 is just the length of the glove

01:10:33.600 --> 01:10:37.440
vector.

01:10:34.680 --> 01:10:39.560
Does that mean that it can only capture

01:10:37.439 --> 01:10:41.319
how that word is related to 100 other

01:10:39.560 --> 01:10:43.760
words? No, no. It it basically we are

01:10:41.319 --> 01:10:45.920
saying that every word its intrinsic

01:10:43.760 --> 01:10:48.119
meaning can be captured using a vector

01:10:45.920 --> 01:10:49.800
of 100 dimensions.

01:10:48.119 --> 01:10:51.319
Those dimensions mean something. We

01:10:49.800 --> 01:10:53.880
don't know what it is. The first

01:10:51.319 --> 01:10:55.599
dimension could mean color. Second could

01:10:53.880 --> 01:10:57.760
mean some sort of location. The third

01:10:55.600 --> 01:11:00.760
could mean some sort of see time of the

01:10:57.760 --> 01:11:00.760
year. We just have no idea.

01:11:01.359 --> 01:11:04.159
Okay, and then the pre-trained model is

01:11:02.880 --> 01:11:05.640
we're not We're not going to learn the

01:11:04.159 --> 01:11:07.119
pre-trained model like has those

01:11:05.640 --> 01:11:08.960
already. We don't know what they are,

01:11:07.119 --> 01:11:10.000
but it has some cat The people who

01:11:08.960 --> 01:11:10.960
created it don't know what they are

01:11:10.000 --> 01:11:13.439
either.

01:11:10.960 --> 01:11:15.840
All they know is that for each word they

01:11:13.439 --> 01:11:18.799
learned a 100 long vector.

01:11:15.840 --> 01:11:20.279
And that 100 long vector was able to re-

01:11:18.800 --> 01:11:21.520
kind of recreate the co-occurrence

01:11:20.279 --> 01:11:23.359
matrix.

01:11:21.520 --> 01:11:25.200
And then they probed it using that

01:11:23.359 --> 01:11:26.920
visualization of man woman sister

01:11:25.199 --> 01:11:29.880
brother all that stuff and it seems to

01:11:26.920 --> 01:11:31.960
sort of fit with what you would expect.

01:11:29.880 --> 01:11:33.480
Can you think of it as analogous to uh

01:11:31.960 --> 01:11:35.640
when we did the convolutional ones, you

01:11:33.479 --> 01:11:37.639
have the number of kernels, right? So in

01:11:35.640 --> 01:11:39.520
in this case, so if you have 32 kernels,

01:11:37.640 --> 01:11:40.840
it's sort of like 32 things it can

01:11:39.520 --> 01:11:42.400
learn.

01:11:40.840 --> 01:11:43.880
I think that's actually a great analogy.

01:11:42.399 --> 01:11:46.399
I love it. That's that's a great way to

01:11:43.880 --> 01:11:48.079
think about it. Yes. Uh much like we got

01:11:46.399 --> 01:11:50.079
to choose decide how many filters to

01:11:48.079 --> 01:11:51.680
have, here we get to decide how long the

01:11:50.079 --> 01:11:53.880
embedding dimension needs to be and our

01:11:51.680 --> 01:11:55.280
hope is that the more things we are able

01:11:53.880 --> 01:11:57.880
to accommodate, the more complicated

01:11:55.279 --> 01:11:58.920
things it will pick up. Right? Uh at the

01:11:57.880 --> 01:11:59.920
same time, you don't want to have too

01:11:58.920 --> 01:12:01.920
many of these things because it's going

01:11:59.920 --> 01:12:03.640
to start picking up noise.

01:12:01.920 --> 01:12:05.840
And that's not a good That's never a

01:12:03.640 --> 01:12:06.840
good thing.

01:12:05.840 --> 01:12:07.880
Okay.

01:12:06.840 --> 01:12:09.880
Um

01:12:07.880 --> 01:12:10.840
Another question on this side?

01:12:09.880 --> 01:12:12.359
Yeah.

01:12:10.840 --> 01:12:13.159
Go ahead. My

01:12:12.359 --> 01:12:15.359
question is

01:12:13.159 --> 01:12:17.399
why did we use Why do we use embeddings

01:12:15.359 --> 01:12:20.599
and not the actual uh

01:12:17.399 --> 01:12:23.000
correlation matrix called rows to

01:12:20.600 --> 01:12:25.079
represent words, right? Like why do we

01:12:23.000 --> 01:12:26.399
need to abstract Yeah, yeah, yeah.

01:12:25.079 --> 01:12:28.800
That's actually a good That's a That's a

01:12:26.399 --> 01:12:30.399
good That's a good question. Um one

01:12:28.800 --> 01:12:33.600
immediate reason is that that row is

01:12:30.399 --> 01:12:35.679
500,000 vectors long. 500,000 long.

01:12:33.600 --> 01:12:37.280
Right? So you want a compact dense

01:12:35.680 --> 01:12:39.000
representation of a word.

01:12:37.279 --> 01:12:40.679
The second thing is that thing is

01:12:39.000 --> 01:12:43.760
subject to all the counts of the

01:12:40.680 --> 01:12:45.200
Wikipedia corpus. It's not normalized.

01:12:43.760 --> 01:12:47.400
So you need to normalize it so that if

01:12:45.199 --> 01:12:49.119
you take any two rows and do dot

01:12:47.399 --> 01:12:50.839
product, you will get some number which

01:12:49.119 --> 01:12:53.800
is sort of in a narrow range. Otherwise

01:12:50.840 --> 01:12:55.560
things don't become comparable.

01:12:53.800 --> 01:12:57.600
No, both these objections can be

01:12:55.560 --> 01:12:59.080
handled. You can normalize, you can

01:12:57.600 --> 01:13:00.520
reduce the size of the corpus and so on

01:12:59.079 --> 01:13:01.640
and so forth. And in fact that used to

01:13:00.520 --> 01:13:03.400
be a very common way people used to do

01:13:01.640 --> 01:13:04.560
it before.

01:13:03.399 --> 01:13:06.319
But what they have discovered is that

01:13:04.560 --> 01:13:07.680
these the way we learn embeddings now

01:13:06.319 --> 01:13:10.119
tends to be much more effective in

01:13:07.680 --> 01:13:10.119
practice.

01:13:10.960 --> 01:13:16.199
So So what what we thought is

01:13:13.960 --> 01:13:18.159
what what what this process does is it

01:13:16.199 --> 01:13:21.559
creates this like n-dimensional

01:13:18.159 --> 01:13:23.920
incomprehensible matrix that captures

01:13:21.560 --> 01:13:25.840
in essence a summarized version of these

01:13:23.920 --> 01:13:28.359
relationships.

01:13:25.840 --> 01:13:30.440
Correct. A compact representation of

01:13:28.359 --> 01:13:33.159
relationships which is not subject to

01:13:30.439 --> 01:13:34.719
the size of your vocabulary.

01:13:33.159 --> 01:13:36.439
So you know, you have 500,000 words

01:13:34.720 --> 01:13:37.720
today, tomorrow somebody comes up with

01:13:36.439 --> 01:13:39.039
the word called selfie which didn't

01:13:37.720 --> 01:13:40.360
exist 5 years ago.

01:13:39.039 --> 01:13:42.279
And now your corpus has gotten a little

01:13:40.359 --> 01:13:43.920
bit more, right? So here it's very

01:13:42.279 --> 01:13:46.800
compact and it tends to have a much

01:13:43.920 --> 01:13:46.800
longer shelf life.

01:13:48.039 --> 01:13:52.760
Yeah.

01:13:49.279 --> 01:13:52.759
Uh all right, so let's see where we are.

01:13:54.079 --> 01:13:58.159
Uh okay. So evaluate.

01:13:59.079 --> 01:14:04.199
68 69% almost. It was 63 went to 69. So

01:14:02.039 --> 01:14:06.239
clearly here training the whole thing

01:14:04.199 --> 01:14:08.359
including glove actually helps. Uh and

01:14:06.239 --> 01:14:11.239
so that sort of begs the question, well,

01:14:08.359 --> 01:14:13.279
if it um every if training glove helps,

01:14:11.239 --> 01:14:15.000
maybe we should actually train the whole

01:14:13.279 --> 01:14:16.920
thing from scratch.

01:14:15.000 --> 01:14:19.319
Like why the hell not, right? Why the

01:14:16.920 --> 01:14:21.239
heck not? I apologize.

01:14:19.319 --> 01:14:22.639
So uh what we'll do is we'll actually

01:14:21.239 --> 01:14:24.760
create our own embeddings and just train

01:14:22.640 --> 01:14:26.079
them. And here we don't have to worry

01:14:24.760 --> 01:14:27.560
about co-occurrence matrices and so on

01:14:26.079 --> 01:14:29.239
and so forth because we have a very

01:14:27.560 --> 01:14:30.840
specific objective. We want to be very

01:14:29.239 --> 01:14:32.279
accurate in predicting genre for these

01:14:30.840 --> 01:14:34.119
songs.

01:14:32.279 --> 01:14:35.159
The people who had who had worked on

01:14:34.119 --> 01:14:36.479
glove,

01:14:35.159 --> 01:14:37.760
they didn't have any objective. They

01:14:36.479 --> 01:14:39.559
just wanted to create embeddings that

01:14:37.760 --> 01:14:41.640
were generally useful.

01:14:39.560 --> 01:14:43.520
Okay? Here we want to be specifically

01:14:41.640 --> 01:14:45.760
useful for genre prediction.

01:14:43.520 --> 01:14:48.680
And so what we can do is we can actually

01:14:45.760 --> 01:14:50.320
train the whole thing ourselves, right?

01:14:48.680 --> 01:14:51.320
We can actually give it

01:14:50.319 --> 01:14:53.119
uh we can actually put an embedding

01:14:51.319 --> 01:14:55.039
layer here. I you know, we just

01:14:53.119 --> 01:14:57.439
arbitrarily decided to choose 64 as the

01:14:55.039 --> 01:14:59.479
uh the dimension as opposed to 100. It

01:14:57.439 --> 01:15:01.000
will run faster. Uh and then it's the

01:14:59.479 --> 01:15:03.519
same thing. Global average pooling,

01:15:01.000 --> 01:15:07.079
activation, blah blah blah blah blah. Um

01:15:03.520 --> 01:15:07.080
and then you run it.

01:15:08.039 --> 01:15:11.920
We'll see if it finishes in the next

01:15:09.520 --> 01:15:11.920
minute.

01:15:12.760 --> 01:15:16.360
And we'll see if it actually does better

01:15:14.479 --> 01:15:17.359
than the pre-trained embeddings or the

01:15:16.359 --> 01:15:19.399
pre-trained embeddings that have been

01:15:17.359 --> 01:15:21.639
further fine-tuned. And I don't remember

01:15:19.399 --> 01:15:23.119
what I saw when I ran it yesterday.

01:15:21.640 --> 01:15:24.920
Uh and while it's running, other

01:15:23.119 --> 01:15:25.760
questions?

01:15:24.920 --> 01:15:28.000
Yeah.

01:15:25.760 --> 01:15:30.039
So my question is regarding embeddings.

01:15:28.000 --> 01:15:32.439
When we call embedding for a particular

01:15:30.039 --> 01:15:33.920
word, we indicate that we have certain

01:15:32.439 --> 01:15:35.359
number of parameters. Let's say in this

01:15:33.920 --> 01:15:36.920
case we have defined

01:15:35.359 --> 01:15:37.759
We defined 100. So there will be 100

01:15:36.920 --> 01:15:40.079
parameters and there will be

01:15:37.760 --> 01:15:42.520
coefficients weights for each of them.

01:15:40.079 --> 01:15:43.279
So when we take a pre-trained model,

01:15:42.520 --> 01:15:45.520
right?

01:15:43.279 --> 01:15:47.840
The one we took glove. So for each word

01:15:45.520 --> 01:15:49.640
there would already be those number of

01:15:47.840 --> 01:15:51.159
parameters in that different Yeah. So

01:15:49.640 --> 01:15:53.320
but then how do we redefine them? Is

01:15:51.159 --> 01:15:54.880
that we want only 100 or we want only 10

01:15:53.319 --> 01:15:56.519
parameters

01:15:54.880 --> 01:15:59.239
You know, the the glove thing actually

01:15:56.520 --> 01:16:01.520
gives you packaged It's pre-packaged to

01:15:59.239 --> 01:16:03.199
be 100 long. I think they have 200 and

01:16:01.520 --> 01:16:04.240
300 as well if I recall. We just

01:16:03.199 --> 01:16:05.840
happened to use the one the one with

01:16:04.239 --> 01:16:07.719
100. The one is

01:16:05.840 --> 01:16:09.000
The one is available in Google

01:16:07.720 --> 01:16:10.159
Yeah, yeah. And there are many

01:16:09.000 --> 01:16:12.960
available. We just get to pick and

01:16:10.159 --> 01:16:13.840
choose and I happen to pick 100.

01:16:12.960 --> 01:16:15.880
Uh

01:16:13.840 --> 01:16:17.560
Oh, it's okay. So it's a bit slow, but

01:16:15.880 --> 01:16:18.680
it's actually looking promising.

01:16:17.560 --> 01:16:21.080
Um

01:16:18.680 --> 01:16:23.159
9:55, yeah.

01:16:21.079 --> 01:16:24.960
So during the CNN models training during

01:16:23.159 --> 01:16:27.199
our assignments,

01:16:24.960 --> 01:16:29.800
changing the filters gave us more depth

01:16:27.199 --> 01:16:32.399
than improvement in performance.

01:16:29.800 --> 01:16:33.640
So here would I be right in concluding

01:16:32.399 --> 01:16:34.839
that it's actually training the

01:16:33.640 --> 01:16:36.600
embeddings which is giving us more

01:16:34.840 --> 01:16:37.400
assuming that epoch and batch changes

01:16:36.600 --> 01:16:39.400
are not

01:16:37.399 --> 01:16:42.000
changed as much. So if I really want a

01:16:39.399 --> 01:16:43.279
genuine change in performance, we go

01:16:42.000 --> 01:16:44.760
to the level of retraining the

01:16:43.279 --> 01:16:46.359
embeddings.

01:16:44.760 --> 01:16:48.520
What Yeah, so what we saw was that using

01:16:46.359 --> 01:16:50.359
glove as is was okay. Using glove and

01:16:48.520 --> 01:16:51.320
then training them helped a lot. And now

01:16:50.359 --> 01:16:53.559
we are basically saying, well, what if

01:16:51.319 --> 01:16:55.639
we just abandon glove and train our own

01:16:53.560 --> 01:16:57.760
embeddings for our particular problem.

01:16:55.640 --> 01:16:59.160
See, glove is a general purpose tool.

01:16:57.760 --> 01:17:00.640
So a general purpose tool is really good

01:16:59.159 --> 01:17:01.920
if you don't have a lot of data

01:17:00.640 --> 01:17:03.119
as a good starting point. But when you

01:17:01.920 --> 01:17:04.560
have a lot of data, you should always

01:17:03.119 --> 01:17:05.680
try to do your own thing and see if it's

01:17:04.560 --> 01:17:07.400
any better.

01:17:05.680 --> 01:17:09.480
And in this case, I

01:17:07.399 --> 01:17:10.719
well, whoa. Okay, I think it's

01:17:09.479 --> 01:17:13.759
uh

01:17:10.720 --> 01:17:13.760
Come on, it's 9:55.

01:17:14.439 --> 01:17:17.919
The button is going to enter any moment

01:17:15.640 --> 01:17:17.920
now.

01:17:21.399 --> 01:17:24.639
Right, let's just look at the thing.

01:17:25.880 --> 01:17:30.279
Okay, folks. So 74% 72%.

01:17:29.079 --> 01:17:31.840
So you can actually return your own

01:17:30.279 --> 01:17:33.279
thing because of 50,000 examples and you

01:17:31.840 --> 01:17:36.480
can see an even better thing. Thanks a

01:17:33.279 --> 01:17:36.479
lot. Have a good rest of the week.
