Advertisement
Ad slot
7: Deep Learning for Natural Language – Transformers 1:16:37

7: Deep Learning for Natural Language – Transformers

MIT OpenCourseWare · May 11, 2026
Open on YouTube
Transcript ~13292 words · 1:16:37
0:16
So, all right. So, transformers, even
0:18
though they were originally invented for
0:20
machine translation, right, going from
0:22
English to German and German to French
0:24
and so on and so forth,
0:25
they have turned out to be an incredibly
0:27
effective deep neural network
0:29
architecture for just really a vast
Advertisement
Ad slot
0:32
array of domains. It has reached a point
0:34
where if you're actually working with on
0:36
a particular problem, you almost
0:37
reflexively to will try a transformer
0:39
first because it's probably going to be
0:40
pretty darn good.
0:42
Okay? So, they have just taken over
0:45
everything.
0:46
Um and obviously they have they've
0:48
transformed translation, which is the
Advertisement
Ad slot
0:50
original sort of target, uh Google
0:52
search, really information retrieval,
0:54
completely transformed speech
0:55
recognition, text-to-speech, even
0:57
computer vision. Even the stuff that we
0:59
learned with convolutional neural
1:00
networks, now there are transformers for
1:03
computer vision problems that are
1:04
actually quite good.
1:06
Right?
1:07
Um which is kind of shocking because
1:08
they were not even designed for that.
1:10
Um and then, you know, reinforcement
1:12
learning. And of course, all the crazy
1:14
stuff that's going on with generative
1:15
AI, large language models, multimodal
1:17
models, everything everything runs on a
1:20
transformer.
1:21
Okay? Uh and then there are numerous
1:23
special purpose systems
1:25
and I find these to be even more
1:27
interesting.
1:28
Um you know, like AlphaFold, the protein
1:30
folding AI, is run runs on a transformer
1:32
stack.
1:33
Okay? And I could just list examples one
1:35
after the other.
1:36
So, it's just amazing. It's incredibly
1:38
uh flexible architecture.
1:40
Um and I think we are lucky to be alive
1:43
during a time when such a thing was
1:44
invented.
1:47
And I'm not getting paid to tell you any
1:48
of this stuff.
1:50
All right, it's just amazing. Okay. So,
1:52
let's get going. We will use search um
1:55
or more broadly information retrieval as
1:57
a motivating use case. So, these are all
1:59
examples where people are typing in
2:00
natural language queries or uttering
2:02
natural language queries into a phone
2:03
and we need to sort of make sense of
2:05
what they want. And it's not like, you
2:07
know, write me a limerick about deep
2:08
learning where there could be many
2:10
possible right answers. It's more like,
2:12
okay, tell me all the flights that are
2:14
leaving from Boston to going to
2:15
LaGuardia tomorrow morning between 8:00
2:16
and 9:00. Well, you better get it right.
2:19
Okay? Accuracy is a high bar.
2:21
So,
2:22
um or, you know, how many customers
2:23
abandoned their shopping cart? Find all
2:24
contracts that are up for renewal next
2:26
month. Uh you know, tell me the all the
2:28
customers who ended the phone call to
2:30
the call center yesterday not entirely
2:32
pleased with the transaction. Right? The
2:34
list goes on and on. And so, in
2:37
particular, we'll focus on this
2:38
travel-related example today. Okay? Uh
2:40
find me all flights from Boston to
2:42
LaGuardia tomorrow morning, right? That
2:44
kind of query.
2:45
Um and so, in these sorts of use cases,
2:48
a very common approach historically has
2:50
been, well, we will take this, you know,
2:53
natural language query
2:55
and then we will convert it into a
2:57
structured query. By that I mean we will
3:01
parse the query and we'll extract out
3:03
key things in that query. Once we
3:05
extract out those key things, we will
3:07
reassemble it into a structured query,
3:09
like a SQL query, right? Uh SQL is just
3:12
one example of a possible structured
3:14
query. There are many many ways to
3:15
structure queries.
3:17
But SQL is sort of familiar to lots of
3:18
people, so I'm using that. So, you take
3:20
the SQL. Once you have the SQL query,
3:23
you're in a very comfortable structured
3:25
land, in which case you just run the
3:27
query through a some database that you
3:28
have, get the results back, format it
3:30
nicely, and show show it to the user.
3:32
Right? That's the flow.
3:34
So, the question becomes
3:36
um
3:37
how do we automatically extract all the
3:40
travel-related entities from this query?
3:43
Right? We want to be able to extract
3:45
BOS, LGA, tomorrow, morning, flights, so
3:49
on and so forth. These are all the
3:50
travel-related entities we want to
3:51
extract out, right? That's the problem.
3:54
And so,
3:56
we will use a really cool data set
3:58
called the airline travel information
3:59
system data set and I'll explain the
4:01
data set in just in just a bit. We'll
4:02
use this as the basis for this example.
4:05
And so, the way we think about it is
4:07
that
4:08
we we have a whole bunch of queries in
4:10
this data set.
4:12
And fortunately for us, the researchers
4:14
who compiled this data set,
4:16
they went through every one of these
4:18
queries, right? And we have, you know,
4:20
several thousands of them. They went
4:22
through every one of those queries and
4:24
they manually tagged each word in the
4:26
query
4:28
with what kind of travel entity it is
4:31
or none of them, right? So, for
4:33
instance, so they class they call them
4:35
slots. So, they will take each word in
4:37
the query and assign it to a slot, a
4:39
particular kind of slot, and I'll
4:41
explain what slot means in just a
4:42
second. Okay? That's the basic idea. So,
4:45
so, for example, if you have something
4:47
like I want to fly from
4:49
Okay? And this is a flight database, so
4:52
you can assume that everything is
4:53
related to a flight flying. So, if you
4:56
have all these words, I want to fly
4:57
from,
4:58
each of these words these five words
5:00
gets mapped to something called the O,
5:02
which means other.
5:04
It's the other slot, right? We don't
5:06
really care about it. It's the other
5:07
slot.
5:09
And then we come to Boston.
5:11
Oh, Boston is very special, right?
5:13
Because, you know, it's clearly a
5:15
departure city. So, we actually tag it,
5:18
we assign it this label. Think of it as
5:20
just like a classification problem,
5:21
right? A multi-class classification
5:23
problem. So, we assign it to B from
5:26
loc.city_name.
5:29
Okay? That is the label you assign it.
5:31
Okay?
5:32
And then you go to at. You don't care
5:34
about at. It's O, other. You come to
5:37
7:00 a.m.
5:38
And then, okay, that is depart time. So,
5:41
depart time and then another depart
5:43
time. And here you see there is a B and
5:45
then there is an I.
5:47
Right? So, what's what we are saying
5:49
here is that there could be entities who
5:51
are described using more than one word.
5:54
Like 7:00 a.m., right? Two tokens.
5:57
And for that, we need to be able to
5:58
figure out, okay, the second token is
6:00
really
6:01
is part of the first token. Together,
6:03
they define the notion of a departure
6:05
time. So, what the B means that is that
6:08
this is the word this is the token in
6:10
which we are beginning the idea of a
6:12
departure time. And then I means we are
6:15
in the middle of this description.
6:17
B is for beginning.
6:19
So,
6:21
you can see here. So, there is a B here
6:23
and there is an I. B for beginning, I
6:25
for intermediate or in the middle.
6:27
Um and then at, we don't care. 11:00 B
6:31
arrive time.
6:33
Boop boop boop. Morning arrive time
6:35
period.
6:38
So, this is an example of how you can
6:40
take a sentence and then manually label
6:43
every word in the sentence with
6:45
something that's relevant to your
6:46
particular problem.
6:50
And
6:51
turns out these people
6:54
every word is classified into one of 123
6:56
possibilities.
6:59
Okay? Um so, aircraft code, airline
7:02
code, airline name, airport code,
7:04
airport name, arrival date, relative
7:07
name. Now, you get the idea.
7:08
They want a round trip versus a one-way.
7:11
The relative to today because if
7:13
somebody say tomorrow morning, it's
7:14
relative to today, so you need to notion
7:16
you need absolute time and you need
7:17
notion of relative time.
7:19
So, they basically thought of every
7:20
possibility with these researchers. And
7:23
so, the every word in every one of these
7:25
queries is assigned one of these 123
7:27
labels.
7:32
Any questions on the setup?
7:36
Um
7:39
Did they have to contextualize what
7:42
comes before than let's say Boston? So,
7:44
if someone says from
7:46
Boston, so that there should be
7:47
contextualization with the from to
7:49
Boston. So, because they did it
7:50
manually, they could just read it and
7:52
figure it out, that's what they mean,
7:54
right? You Boston is the the departure
7:55
city and not the arrival city. So, do
7:57
they have two tags to Boston, which is
7:59
some like, you know, departure city as
8:01
well as arrival city
8:03
word Boston? In that particular phrase,
8:05
it's it's clear from that particular
8:07
case in the context of it as a human
8:08
reading it that Boston is a departure
8:10
city. So, it just only gets that tag. In
8:13
that sentence. In some other sentence
8:15
where people are coming into Boston,
8:16
it'll have a different tag.
8:21
I was wondering if my query like the
8:23
others, basically there is like, for
8:25
example, if my query was
8:27
giving flights from Boston at 7:00 a.m.
8:29
and
8:29
uh the
8:31
flights from Denver at 11:00 a.m.
8:33
You mean like a compound query? Yeah.
8:35
So, this one only takes single queries
8:37
into account.
8:39
Because most people are like, you know,
8:40
give me a flight from here to there. Or
8:42
what is the cheapest thing from here to
8:43
there? And we'll see examples of queries
8:45
later on.
8:50
Okay.
8:51
Uh all right. So, that's that's the
8:52
deal.
8:53
So, basically what we have this you
8:56
know,
8:58
uh this problem that we have here is
8:59
really a word-to-slot,
9:02
word-to-slot multi-class classification
9:04
problem.
9:06
Okay?
9:07
Um because if you look at that that
9:09
input, we want to be able to take that
9:10
input and a really good model will then
9:12
give you this as the output.
9:17
Right? Because this is what a human
9:18
would have done.
9:20
So, that is our problem. Okay?
9:23
So, the question is
9:25
um the the key thing here is that each
9:27
of the 18 words in this particular
9:29
example must be assigned to one of 123
9:32
slot types, right? Each word. It's not
9:34
like we take the entire query and
9:36
classify the entire query into one of
9:38
123 possibilities. Every word in the
9:40
query has to be classified.
9:42
That is the wrinkle.
9:45
Okay?
9:46
So, now, if we could run the query
9:49
through a deep neural network and
9:51
generate 18 output nodes,
9:54
it goes through some unspecified deep
9:55
neural network. And when it comes out
9:57
the other end, the output layer has 18
9:59
nodes.
10:00
Okay?
10:01
Because that is that is the that is the
10:03
that is the the the dimension of the
10:04
output that we care about. 18 in, 18
10:06
out. 18 in, 18 out, right?
10:09
And then for each one of those 18 nodes,
10:11
maybe we could attach a 123-way softmax
10:15
to each of those 18 outputs.
10:20
By the way, isn't it cool that we can
10:21
just casually talk about sticking a
10:23
123-way softmax onto each one of the 18
10:25
nodes?
10:27
Folks, wake up.
10:31
You're not easily impressed. I'm
10:32
impressed by that.
10:34
So, okay.
10:37
So, so the So, here's the key thing,
10:39
right? We want to generate an output
10:41
that has the same length as the input.
10:45
But the problem is the inputs could be
10:47
of different lengths as they come in.
10:48
They could be short sentences, long
10:50
sentences, we don't know, right?
10:52
Yet we need to accommodate this range
10:55
this variable size of input that's
10:56
coming in.
10:58
But the key thing is the output has to
10:59
be the same thing as the input, the same
11:00
cardinality as the input.
11:02
Okay, that's a one big requirement.
11:05
In addition, we want to take the
11:07
surrounding context of each word into
11:08
account, right? To go to Ronak's
11:10
question, when you see the word Boston,
11:12
you can't conclude whether it's a
11:14
departure city or arrival city.
11:15
You have to look at what else is going
11:17
on around it. Is there a from? Is there
11:19
a to? Things like that to figure out
11:21
what how to tag it. So, clearly the
11:22
context matters.
11:24
And then we clearly have to take the
11:25
order of the words into account.
11:28
Going from Boston to LaGuardia is very
11:29
different than going from LaGuardia to
11:30
Boston.
11:31
So, clearly the order matters.
11:33
Right? So, the context matters and the
11:35
order matters. And the output has to be
11:37
the same length as the input.
11:40
Okay?
11:42
So, context matters, right? Just a few
11:44
fun examples.
11:45
Remember from the last week that the
11:47
meaning of a word can change
11:48
dramatically depending on the context.
11:50
And we also saw that the standalone or
11:53
uncontextual embeddings that we saw for
11:55
last week, like Glove, um
11:58
you know, they don't take context into
11:59
account because they give a single
12:01
unique embedding vector to every word.
12:04
And if a word ends up having lots of
12:05
different meanings, that vector is kind
12:07
of some mushy average of all those
12:09
meanings.
12:11
Okay. So,
12:13
the word see. I will see you soon. I
12:15
will see this project to its end. I see
12:16
what you mean. Very different meanings
12:18
of the word see. This is my favorite,
12:20
bank.
12:21
Uh I went to the bank to apply for a
12:23
loan. I'm banking on the job. I'm
12:24
standing on the left bank. And so on. Uh
12:27
it. The animal Oh, this is actually very
12:29
It's a good one. The animal didn't cross
12:31
the street because it was too tired. The
12:33
animal didn't cross the street because
12:34
it was too wide.
12:37
Can you imagine
12:39
a deep neural network looking at this
12:40
word it and trying to figure out what
12:42
the heck does it word it mean?
12:44
What is it referring to?
12:46
Tricky, right?
12:48
Um and then, you know, if you take the
12:50
word station, and I have the station
12:52
example here because we're going to use
12:53
it a bit more the rest of the lecture.
12:55
The train You know, the station could be
12:57
a radio station, a train station, being
12:59
stationed somewhere, the International
13:00
Space Station. The list goes on.
13:03
So, clearly order matters. I mean,
13:04
context matters.
13:05
And
13:08
clearly order matters. You can come up
13:10
with your own examples. Let's keep
13:12
moving.
13:13
Okay?
13:15
So, the Transformer architecture
13:18
is a very elegant
13:20
architecture
13:22
which checks these three boxes
13:23
beautifully.
13:25
Okay?
13:26
Um it takes the context into account,
13:27
order into account, and then, you know,
13:29
whatever is produced out there
13:32
is the same length as whatever is coming
13:33
in.
13:34
And the reason it's called the
13:35
Transformer
13:36
is because if 10 things come in,
13:39
10 things go out, but the 10 things that
13:41
go out are a transformed version of the
13:43
10 things that came in.
13:46
That's why it's called the Transformer.
13:47
Okay?
13:48
If 10 things came in and like one thing
13:50
go goes out, well, sure, it's been
13:52
transformed, but what is it? It's some
13:54
weird thing. But when 10 comes in and 10
13:56
goes out, the 10 10 is preserved. Each
13:58
one is getting transformed in
13:59
interesting way.
14:01
That's why it's called the Transformer.
14:04
So, developed 2017, just dramatic
14:07
impact.
14:08
So, by the way, the effect of
14:09
Transformer, um
14:11
Google had spent a lot of research on
14:13
machine translation and obviously
14:15
search. Uh and then when the Transformer
14:17
is invented, uh they took a model called
14:20
BERT, which we will uh see on Wednesday
14:22
in detail, and then they introduced BERT
14:25
into their search, and the results were
14:28
dramatic.
14:29
And from what I've read, apparently the
14:32
impact of doing that was a
14:34
Typically, when you make an improvement
14:35
to search, the improvement is very, very
14:37
marginal because it's already a very
14:38
heavily optimized system.
14:40
And then when the Transformer thing came
14:42
along, there was actually a significant
14:43
jump in search quality. So, for example,
14:46
and you can actually read this blog post
14:48
uh which came out when they introduced
14:49
BERT into search. It gives you a bit
14:51
more detail. But here, so if you had if
14:54
you were querying something like uh you
14:56
know,
14:57
"Brazil traveler to USA needs a visa."
15:00
Right? You would think that it is it
15:02
should give you information about how to
15:03
get a visa if you're a Brazilian want to
15:04
come to the US, right? Uh but it turns
15:06
out the first result was how US citizens
15:09
going to Brazil can get you know,
15:11
get a visa.
15:13
So, clearly it's not taking the order
15:14
into account.
15:16
Uh but once they introduced it, boom,
15:19
the first thing was the US Embassy in
15:20
Brazil.
15:21
And a page on how to get a visa.
15:24
So, the effect was dramatic.
15:26
And so, this is a seminal paper,
15:30
right? And it's actually worth reading
15:31
the paper. And uh and it's worth and you
15:34
know, this is the picture this this is
15:35
like an iconic picture at this point
15:38
in the deep learning community. And we
15:39
will actually understand this picture
15:41
by the end of Wednesday.
15:43
Um and so, but the funny thing is that
15:45
when the researchers came up with it,
15:46
they didn't realize, in some sense, like
15:48
what they had stumbled on uh because
15:50
they were really focused on machine
15:51
translation.
15:53
It's only the rest of the research
15:54
community that took it and started
15:55
applying to everything else and found it
15:56
to be really, really effective.
15:59
Okay. So, we're going to take each one
16:01
of these things and figure out how to
16:02
address them and thereby build up the
16:04
architecture.
16:05
Any questions before I continue?
16:07
Yeah.
16:11
Is there any uh
16:13
benefits to discarding some of those
16:16
unclassified nodes before it goes out
16:18
rather than going like you have 18 words
16:21
input, discarding all the ones that
16:23
don't actually matter and just doing
16:24
like eight for your output?
16:26
Yeah, yeah. I think that's a totally
16:28
fine way to think about it. Basically,
16:29
what you're saying is that can we have a
16:31
two-stage model? The first-stage model
16:33
is like a O non-O classifier. And the
16:35
second-stage model only goes after the
16:37
non-Os. That's a totally fine way to do
16:38
it.
16:39
Yeah.
16:40
But as you can see, if you even if you
16:41
go with the just a simple one-stage
16:43
model, if you use a Transformer, you get
16:44
fantastic accuracy.
16:47
And we'll do the collab in a bit.
16:50
Uh all right. So, let's take the first
16:52
thing. How do you how do you take the
16:53
context of everything around the word
16:55
into account?
16:56
So,
16:59
so let's say that this is this is the
17:01
sentence we have. The train slowly left
17:03
the station.
17:04
Okay? For each of these words,
17:06
we can calculate a standalone embedding,
17:09
say something like Glove.
17:11
Okay? So, I'm just rep- depicting these
17:13
standalone embeddings using these uh
17:15
you know, thingies here.
17:18
Please appreciate them because it took
17:19
me a while to get them to do in
17:20
PowerPoint.
17:22
Okay? So, these are W1 through W6. These
17:24
are the vectors standing up. Okay?
17:27
Um now, let's say that So, we can easily
17:29
do that.
17:30
Now, what we want to figure out is we
17:32
want to focus on the word station.
17:34
And since station could mean very
17:36
different things in different contexts,
17:37
we want to figure out how do we actually
17:39
take
17:40
station's embedding and contextualize it
17:43
using all the other words that are going
17:45
on in that sentence.
17:46
Okay? Clearly, it's a train station.
17:49
So, we need to take the fact that there
17:50
is a train involved to to alter the
17:53
embedding of the word station. Right?
17:55
That's what taking context into account
17:56
actually means.
17:58
So,
17:59
how can we modify station's embedding so
18:03
that it incorporates all the other
18:04
words? That's the question.
18:07
Okay?
18:08
So, when you look at it this way,
18:11
imagine just for a moment,
18:14
just for a moment,
18:15
that
18:16
we
18:17
Now, some of the other words in the
18:18
sentence don't matter. The word the
18:20
probably doesn't matter.
18:22
But some of the other words like train,
18:24
slowly, left probably does matter.
18:26
And suppose, just magically, we have
18:29
been told
18:30
all the other words in the sentence,
18:32
this is how much weight you have to give
18:34
to them. These don't give it any weight.
18:36
Those give it a lot of weight. Okay?
18:38
Suppose we are told that.
18:39
Or to put it another way, and this this
18:41
is the word that's heavily used in the
18:42
literature,
18:44
someone tells you how much attention to
18:46
pay to the other words.
18:47
Whether you got to pay it a lot of
18:48
attention or very little attention.
18:50
Okay?
18:51
And this
18:52
how much attention to pay is given in
18:54
the form of a weight that you can use.
18:55
Okay? So,
18:57
um
18:58
if you look at it that way, from this
19:00
notion of which word should I give a lot
19:01
of weight to and very little weight to,
19:04
in this example, intuitively, which
19:05
words do you think should get the most
19:06
weight and which words do you think
19:07
should get the least weight?
19:09
Yeah. Train.
19:11
Train. Right.
19:12
Time matters.
19:13
Uh
19:14
you can do one at a time.
19:16
Train. Okay, thank you.
19:18
Uh
19:18
okay. Others?
19:21
Slowly.
19:22
Slowly. Right. So, that also seems to
19:23
have some bearing to it. What about
19:25
words that don't really I don't
19:27
we don't think is going to are going to
19:28
help at all?
19:31
The. The. Exactly. It probably doesn't
19:33
do much here. Some context it actually
19:35
might make a difference, but in this
19:37
sentence, maybe not.
19:38
Right? Intuitively.
19:40
So,
19:42
we should probably give a lot of weight
19:43
to train, maybe a little to slowly and
19:45
left, and hardly anything to the.
19:47
Okay?
19:49
And so, this intuition that we have
19:52
can be written numerically as maybe we
19:56
have a bunch of weights that add up to
19:58
one.
20:00
Okay?
20:02
Okay, maybe something like this. So, we
20:03
are saying the train 30% weightage,
20:07
maybe 8% weightage to left, maybe 12%
20:11
weightage to slowly, uh and then as you
20:14
will see here,
20:15
the station's own embedding also plays a
20:17
role. Because we want to take its own
20:20
standalone embedding and just move it
20:22
slightly, change it slightly, which
20:23
means that has to be the starting point.
20:26
So, it will get a lot of weight. We
20:28
can't ignore itself, in other words.
20:30
Right? So, we give it maybe 40% weight.
20:33
By the way, these numbers I just made
20:34
them up.
20:35
Okay? Uh yeah.
20:38
I'm sorry, it's a quick question. So,
20:40
the weights
20:43
are they
20:44
Are they Are they standalone for the
20:46
context of the entire sentence or are
20:48
they related to station that we started
20:50
off with? The The These six numbers are
20:54
only pertinent to station.
20:56
And for each word, we're going to do
20:57
something similar.
20:59
Yeah.
21:01
And at this point, does the model
21:03
understand order? Because like I'm just
21:05
thinking of like left because like I
21:07
gave it a very low
21:08
a
21:09
a very low weight. But let's say left
21:11
comes slowly, leave left station. The
21:14
station only have the two be higher.
21:15
Yeah, correct. So, at this point, we are
21:18
not worrying about order. We are only We
21:20
are worrying about context.
21:22
Later, we'll take order into account.
21:24
But how does the model know that left
21:25
here is of lesser importance because
21:28
it's a verb rather than a
21:31
It's It has to figure it out.
21:33
We don't It doesn't We We are just
21:34
giving it a whole bunch of capabilities.
21:36
How it manifests those capabilities is
21:38
all going to emerge from training.
21:42
Okay. So, all right. So, let's say we
21:45
have something like this. So, what we
21:46
can do,
21:48
right? And we'll get to the
21:49
all-important question of where do we
21:50
get these numbers from in just a moment.
21:51
But suppose you had the numbers,
21:54
how can we use these numbers to
21:56
contextualize W6? What can we do?
22:00
What is the simplest thing you can do?
22:05
You have W6, you want to make it a new
22:07
W6, which is now contextual, is aware of
22:10
what else is going on. Okay?
22:17
It's working now, I think.
22:20
We can take a weighted average. Exactly.
22:22
Exactly. So, when you have a bunch of
22:23
things and you have a bunch of weights
22:25
and I, you know, and we have when we
22:26
have to somehow modify one of those
22:27
things with those weights, the simplest
22:29
thing you can do is to take a weighted
22:30
average.
22:31
Right? So, that's exactly what we're
22:33
going to do.
22:34
So, we're going to take all these
22:35
weights
22:37
and just like move them up.
22:39
Okay?
22:40
Move them up.
22:42
Don't even get me started on how long it
22:44
took me to get this arrow to run.
22:46
I don't know about you, folks. Is it
22:47
It's extremely painful to get the U-turn
22:49
arrows to work in PowerPoint.
22:51
Okay?
22:52
Anyway, uh back to work. So,
22:54
so we just move these up here, okay? So,
22:57
now we can do 0.05 * this vector + 0.3 *
23:01
that vector and so on and so forth.
23:03
And the result is just another vector.
23:06
Right?
23:08
And that vector, folks,
23:11
is the contextual embedding vector of
23:13
station.
23:15
Okay? That was the standalone embedding.
23:17
And now we did the We multiplied this by
23:19
that that by whoop whoop whoop, add them
23:21
all up, and then you get a new vector.
23:24
And contextual embeddings have this
23:27
bluish kind of color.
23:29
Okay?
23:30
And I'll maintain that color scheme as
23:32
we go along.
23:33
So, that's it.
23:36
That's it. That's the idea.
23:38
Any questions?
23:41
Yeah.
23:43
How did you come up with the original
23:44
weights again? You just kind of guessed?
23:46
No, these weights I just I just
23:49
hand typed them in manually just to make
23:51
the point. And And now I'm going to talk
23:53
about how we are actually going to
23:54
calculate them.
23:57
Okay.
23:58
Uh all right, cool. So, now I'm going to
24:00
uh okay, enough pictures. Let's switch
24:03
to some math. So,
24:05
so basically what I'm So, let's write it
24:07
a bit more formally.
24:08
So, we have these W1 through W6, which
24:11
are the standalone embeddings.
24:12
And then for station, we want to
24:14
calculate, you know, W6 with a little
24:16
hat on it, which is the contextual
24:17
embedding. And the way we do it is to
24:19
say we calculate some weights for each
24:22
of these words. So, this weight S16
24:25
means that the weight
24:27
of the first word on the sixth word,
24:30
which happens to be station.
24:32
The The weight of the second word on the
24:33
sixth word, and so on and so forth. And
24:35
so, what we are saying is that W6 is
24:38
just, you know, this weight times W1,
24:40
this time W whoop whoop whoop,
24:41
that's it.
24:43
Okay?
24:45
I have to inflict all these, you know,
24:47
subscripts and all that because
24:48
you know, we need it.
24:51
All right. So, that's it.
24:53
That's what we have.
24:56
Now, let's talk about Okay, any
24:58
questions on the mechanics of it
25:00
before I get to Okay, where do these
25:01
weights come from?
25:02
Yeah.
25:06
Utilizing something like Google, for
25:08
example, like how does it understand
25:11
like the context of
25:12
new words
25:13
and context like
25:16
process immediately through the training
25:18
data the users played or
25:20
like basically
25:21
>> like a totally new word that didn't
25:22
exist before? A new word or a new
25:24
context to a word that already exists.
25:27
No, I think that the context is supplied
25:29
because the query coming into something
25:31
like Google is a full sentence.
25:33
And we only take that sentence and take
25:35
only the sentence into account as the
25:36
context for us.
25:37
So, the context is always present to us
25:40
when we get the input.
25:41
But the other question you had uh of
25:44
Okay, what if there's a brand new word
25:45
you've never seen before, for which
25:46
there is not even a standalone
25:47
embedding? What do you do then?
25:49
So, let's punt on that till Wednesday
25:51
because I have to talk about something
25:53
called byte pair encoding and stuff like
25:55
that before I can answer that.
25:57
And And really quickly, does that
25:59
immediately translate to their
26:00
predictive search queries?
26:03
Utilizing like verb
26:06
Yeah, a new word, for example.
26:08
Does that automatically get applied to
26:10
the predictive search queries like when
26:12
we're saying how to and then just home?
26:14
Oh, you mean like the auto complete?
26:15
You know, auto complete uses a slightly
26:17
different mechanism.
26:18
Um I They had a very complicated
26:20
non-transformer thing for a long time.
26:23
I'm sure they have a transformer version
26:24
now, but I don't I'm not privy to how
26:26
exactly they've done it. So, I don't
26:28
quite know how they do it. But what
26:29
you're proposing is a reasonable way to
26:31
think about it.
26:33
Yeah.
26:34
Um my question is like we have six
26:36
words, station and but number parameters
26:39
as in weights, let's say 10 of them.
26:41
And then we have calculated the
26:43
contextual version of W6. Yeah. So, this
26:46
has a different parameter or it remains
26:48
the same? It replaces. Okay.
26:50
Yeah, W becomes W6 becomes W6 hat.
26:54
Okay. And how we are expecting
26:57
Right.
26:58
This contextual word will be really
27:00
good. That's what we want.
27:07
Do we lose that
27:08
or retain No, we lose it. And as you
27:11
will see here, as it flows through the
27:12
transformer, it's getting more and more
27:14
and more contextualized.
27:16
So, it's a left-to-right flow.
27:20
All right. Uh all right, great. So, the
27:22
By the way, this thing that we did for
27:23
station, we will do it for each word in
27:25
the in the in the sentence.
27:27
The same exact logic. Obviously, the
27:30
weights are going to change.
27:31
Okay? But what will happen is that W1
27:34
through W6 will become W1 hat through W6
27:37
hat.
27:39
The same exact logic is going to hold.
27:41
Okay? That's what I just don't have the
27:43
slides for it because it's a waste of
27:44
time.
27:45
The same exact logic is going to hold.
27:47
All right. Now, switch gears
27:48
and and answer the all-important
27:50
question of where are the weights going
27:51
to come from.
27:52
Okay? So, the intuition here is really
27:54
really interesting and elegant.
27:56
So, clearly the weight of a word
27:59
should be proportional to how related it
28:02
is to the word station.
28:04
Right?
28:06
The word train clearly is very related
28:08
to the word station.
28:09
The word the is not clear how it's
28:11
related it is. Probably not all that
28:12
related. So, the relatedness matters to
28:15
the weight. More related, higher the
28:17
weight, right? Just intuitive.
28:19
So, one way to quantify how related two
28:21
words are is to take their standalone
28:23
embeddings and calculate the dot
28:25
product.
28:28
Okay? So, um
28:30
in case folks have
28:33
sort of forgotten about the dot product,
28:39
Oops, that's not what I want.
28:42
So, um So, if you have a Let's say you
28:44
have a vector.
28:50
Okay, let's Let's Let's say this is the
28:51
vector for
28:52
train.
28:55
This is the vector for station.
28:59
Okay? So, the dot product of these two
29:01
vectors,
29:05
I'll write it as train
29:09
station
29:12
equals
29:13
basically the length
29:17
of
29:20
the vector for train
29:23
times the length
29:26
of the vector for station
29:30
times the cosine
29:33
of the angle between them.
29:36
Okay?
29:38
Okay?
29:42
So, how long is each vector?
29:45
Product of the two and then the angle
29:46
between them. Okay? Now, let's assume
29:48
for simplicity that these lengths are
29:50
roughly the same.
29:52
They're just one unit length. Okay? Just
29:54
roughly.
29:55
So, if you assume that,
29:57
okay? This thing, let's say, becomes
30:01
becomes one, let's say.
30:03
Okay?
30:05
This thing becomes one.
30:07
So, all the action
30:09
is here.
30:11
Okay?
30:12
So, all the action is here.
30:14
So, basically, the dot product of these
30:15
two vectors is really the cosine of
30:17
angle between them.
30:20
So, now, the question is, if you have
30:22
something like this,
30:27
right? Which are very close to each
30:28
other, the cosine of a very small angle,
30:31
actually, the cosine of zero is what?
30:34
One.
30:35
So, if the angle is really, really
30:37
small, the cosine is going to be very
30:39
close to one.
30:40
Right? Because zero is one. The cosine
30:41
of zero is one. So, this thing is going
30:43
to be, you know, pretty close to one.
30:46
If you have a cosine of two vectors that
30:49
are like this, 90° apart, what is the
30:51
cosine?
30:52
Zero. They're orthogonal, right? Which
30:55
maps to the English orthogonal.
30:58
So, the cosine of that is zero.
31:00
And then, if you have something like
31:01
this,
31:03
where they're literally pointing in
31:04
opposite direction,
31:07
what is the cosine of that 180?
31:09
Minus one.
31:11
So, that's it. So, the if these things
31:13
if these these these two vectors are
31:14
very close to each other,
31:16
the cosine of the angle between them is
31:18
going to be very close to one. If they
31:19
are really kind of unrelated, it's going
31:21
to be zero. If they're anti-related,
31:22
it's going to be minus one.
31:24
Right? So, that's how dot products
31:27
capture this notion of closeness or
31:28
relatedness.
31:30
Okay?
31:31
So, all right. Um iPad.
31:36
So, we can use the dot product of these
31:37
embeddings to capture relatedness.
31:40
And so, okay, iPad done.
31:43
So, now, what we do is we know now that
31:45
we know that dot products can be used,
31:48
we can't use them as is because we need
31:49
to do one more thing to make them proper
31:51
weights. And what I mean by proper
31:53
weights is that the we want the weights
31:55
to be, first of all, non-negative, and
31:58
we want to add up we want them to add up
31:59
to one, right? That's that's what a
32:00
weighted average actually is going to
32:01
mean.
32:02
But these cosines could be negative.
32:05
Right? And so, we need to now adjust
32:07
them to make them proper so that every
32:08
one of them is guaranteed to be
32:10
non-negative and they will add up to
32:11
one.
32:12
When was the last time you had to take a
32:14
bunch of numbers, which could be
32:15
anything, and then somehow make sure
32:18
that they are going to be positive,
32:20
non-negative, and they add up to one?
32:22
When was the last time?
32:23
Yeah, softmax. Exactly. So, we'll do the
32:25
same trick.
32:27
So, what we'll simply do is we'll just,
32:29
you know, exponentiate them, right? So,
32:32
like this W1 W6, this angle bracket
32:35
thing is the dot product. That's the
32:36
notation I'm using. EXP of that is just
32:39
you exponentiate them, e raised to that.
32:41
And once you exponentiate them, they all
32:42
become non-negative, and then we just
32:44
divide each by the sum of everything.
32:46
So, it the whole thing will become like
32:47
a probability, right? It'll just add up
32:48
to one.
32:50
Make sense? So, that's how we take
32:52
arbitrary numbers and make them proper
32:53
weights.
32:56
All right.
32:59
So,
33:01
to summarize,
33:02
from embeddings to contextual
33:04
embeddings, that's what we do.
33:05
We take all the stand-alone embeddings,
33:08
we calculate these weights using this
33:09
formula, and then we just do the
33:11
weighted average, and we arrive at the
33:12
contextual embedding, and boom, done.
33:16
Okay?
33:17
And so, by way choosing weights in this
33:20
manner, the embedding of a word gets
33:22
dragged closer to the embeddings of the
33:24
other words in proportion to how related
33:26
they are. So, just imagine for a second,
33:29
right? In this case, station obviously
33:30
has many contexts, but let's assume for
33:31
a second that only has the train context
33:33
and the radio station context.
33:35
Okay?
33:37
In the current context, train is closely
33:39
related to station, and therefore exerts
33:40
a strong pull on it.
33:42
Right?
33:43
Now, radio is also related to station,
33:45
but it doesn't appear in the word in the
33:47
sentence.
33:48
So, effectively, it has a weight of
33:49
zero.
33:52
Okay? And that's the beauty of it. And
33:55
And please do not ask me things like,
33:56
you know, I I was listening to a great
33:58
song on the radio station and the train
33:59
pulled out of the station.
34:01
Okay? Transformers can deal with stuff
34:03
like that. Okay? But yeah, but you get
34:05
the idea, the main idea.
34:07
So, by paying moving station closer to
34:09
train,
34:11
by paying more attention to train, we
34:13
are contextualizing the station the word
34:15
the embedding to the context of trains,
34:18
platforms, departures, tickets, and so
34:20
on. It's like this portal into the whole
34:22
train world.
34:25
Right? It's beautiful. This simple idea
34:27
will get you there.
34:30
Okay?
34:31
So, this, folks, is called
34:33
self-attention.
34:36
What we just described is called
34:37
self-attention.
34:39
And it's the key building block of
34:41
transformers.
34:42
Okay? Um and so, the the So, to
34:44
summarize, stand-alone embeddings come
34:46
in, contextual embeddings go out.
34:50
Any questions?
34:52
Uh yeah.
34:54
Uh I'm still struggling a little bit
34:56
with the intuition of the word
34:58
contextual embedding. So, like the
35:00
weight of station in the station
35:02
embedding, how how should I think about
35:03
that? It seems intuitive that it would
35:05
be high for all contextual embeddings,
35:07
but I assume that's not the case.
35:12
It'll be high. It'll be typically be a
35:13
high number because the cosine of the
35:15
the vector to itself is going to be very
35:17
cosine is going to be one, right? So,
35:19
it's going to be pretty high, but it
35:20
there's no guarantee it's going to be
35:21
the highest.
35:22
Right? Because they're not actually the
35:24
the length doesn't have to be one. They
35:26
could be We try to keep them kind of
35:28
smallish, but they don't have to be.
35:30
Uh so, the way I would think about it is
35:31
imagine that you take an average of
35:33
everything else first, and then you
35:35
average it with the new the old
35:37
embedding.
35:38
Effectively, it's the same as just
35:39
calculating the different weights and
35:40
averaging the whole thing together.
35:42
Sure.
35:44
So, why should you say that the
35:45
embedding of a word would be the same
35:47
number but same place? But is this the
35:50
reason why you need a contextual
35:52
embedding?
35:53
But even if it's like a
35:55
other word
35:56
and it's not related, that's what
35:59
I'm saying. Correct. Correct. Exactly.
36:01
Exactly. And the other thing to remember
36:02
is that by getting
36:04
by keeping the origin the input sort of
36:07
the size of the input cardinality intact
36:09
as you move through the transformer
36:10
stack,
36:11
when you finally come out the other end,
36:12
there is sort of no loss of information.
36:14
And in the very end, you can choose to
36:16
aggregate, simplify, summarize, and so
36:18
on and so forth. It preserves your
36:19
optionality as long as possible.
36:23
Do you know
36:25
how how long the embedding contextual
36:27
embedding is?
36:28
Is that a factor between the
36:29
two?
36:31
You know
36:33
Yeah, so, what we do is the the sentence
36:34
comes in. There's a whole notion of
36:35
something called a context window, or
36:37
what is the sort of the maximum length
36:39
that these sentences will handle, and
36:40
that's a parameter you can set. And
36:42
we'll come to that when you actually
36:43
look at the collab.
36:44
Um
36:46
Was that a question in the middle? No.
36:48
Okay.
36:49
All right. So, that is self-attention.
36:53
Um and now,
36:55
because that's felt too easy,
36:58
we're going to do a little tweak called
37:00
multi-head attention.
37:02
So,
37:03
this is this is the self-attention we
37:04
just saw.
37:06
What we can do is we can be like, you
37:07
know what?
37:08
Why can't we have more than this? Why
37:10
can't we have more than one of these?
37:12
So, this is called an attention head,
37:13
self-attention head. We'll have multiple
37:16
self-attention heads. Okay?
37:18
Now, and I'll come back to the top thing
37:20
in a second, okay? But So, the question
37:22
is, why should we have multiple
37:23
self-attention heads?
37:25
Because a particular attention head is
37:26
going to pick up some patterns. The
37:28
reason is because
37:30
it'll help us attend to the multiple
37:32
patterns that may be present in a single
37:34
sentence.
37:35
So far, when I've been explaining, uh
37:37
I've sort of basically been looking at
37:38
what the meaning of these words are.
37:40
Just the meaning of these words. But in
37:42
any complicated sentence, you have to
37:44
worry about grammar, you have to worry
37:45
about tense, you have to worry about
37:47
tone. You have to worry about facts
37:49
versus, you know, opinions. There could
37:51
be any number of complicated patterns
37:53
that are sitting in a simple sentence.
37:55
Which means, well, there is just not one
37:57
way to pay attention. There could be
37:59
many ways of paying attention, many sort
38:02
of There could be many needs to pay
38:03
attention. Right?
38:05
Which means that we'll let's have many
38:07
of these attention heads.
38:09
And each one could be learning something
38:10
else. It's exactly like having lots of
38:12
filters in a convolutional network.
38:14
Right? Uh one filter might learn a line,
38:16
another filter might learn a curve, and
38:17
so on and so forth. And we don't want to
38:19
decide a priori, oh, you're going to
38:21
learn a line, right? Similarly here,
38:22
we're not telling any of these things
38:23
what you have to learn. They just have
38:25
to learn based on the training process.
38:27
So, what we do is
38:28
So, actually, this is an example where
38:30
this is from the original transformer
38:32
paper, where this sentence is the lawyer
38:35
will Sorry, the law will never be
38:37
perfect, but its application should be
38:39
just. This is what we are missing, in my
38:43
opinion.
38:44
The complicated sentence, right? So, the
38:46
first one attention head, actually, this
38:48
is the pattern of things it's it's it's
38:50
So, for example, the word perfect here,
38:53
the contextual embedding of the word
38:54
perfect
38:57
draws upon heavily from the word law
39:00
in this example.
39:01
Okay?
39:02
If you look at another attention head,
39:04
the contextual embedding for the word
39:06
perfect is actually drawing heavily from
39:07
just perfect and nothing else. Right?
39:11
And if you look at other words, the
39:13
patterns are subtly different of what
39:14
it's paying attention to.
39:17
So, these are two different attention
39:18
heads, and they're learning different
39:20
kinds of attentions.
39:21
Okay? In reality, trying to make sense
39:24
of why they
39:25
pay attention to the way they do, it's
39:27
usually quite sort of difficult to
39:29
figure that out. You can't actually
39:30
interpret it. But when you have lots of
39:32
attention heads, the performance on the
39:34
task that you care about gets really
39:35
much better.
39:37
Right? And then you're saying, okay, I
39:39
can use that. Uh yeah.
39:40
That's the
39:42
I think that's the idea behind this. Is
39:43
that the idea behind this?
39:49
Right.
39:50
Exactly. Same logic. Same logic.
39:53
Yeah.
40:13
Actually in the convolutional case, the
40:15
ones and zeros I had were just example
40:17
numbers to show that that particular
40:19
filter could detect a vertical line or
40:21
horizontal line. You will recall that
40:23
when we actually train a convolutional
40:24
network, we actually don't specify the
40:26
numbers. We start with random
40:27
initialized weights and then we let back
40:30
back propagation figure it out.
40:32
Similarly here, we don't decide any of
40:34
these things. We just let back prop
40:35
figure it out.
40:37
Okay? And now the question of what are
40:39
the weights that are actually going to
40:40
be learned? We'll come come to that in a
40:42
bit.
40:43
Okay? Uh yeah.
40:47
Uh I was wondering how come we have
40:50
different attention head even though
40:53
uh it seems like they're only function
40:55
of a dot product and we have the same
40:57
dot product for same embeddings.
40:59
Great question. Great question. And I
41:01
literally have a a note in my slide
41:02
saying, "If a student asks this good
41:04
question, tell them to wait till
41:06
Wednesday."
41:08
So, great question. And we'll come back
41:10
to that uh on Wednesday and spend a fair
41:12
amount of time on it. So, uh
41:14
the the the point that's being made here
41:17
is that oops.
41:19
When we look at self-attention,
41:22
the embeddings came in and we did all
41:24
these dot products and the contextual
41:26
things popped out the other end. Note
41:28
that inside the self-attention box,
41:30
there are no parameters.
41:32
There are no parameters.
41:34
So, the question that is being raised
41:36
here is that so what are we learning
41:38
really? If there is nothing inside to be
41:40
learned, if there are no parameters, no
41:42
coefficients, what are we learning?
41:43
That's the question. And by extension,
41:46
if we have two of these and neither of
41:48
them is learning anything, what's the
41:49
point?
41:52
Sadly, you have to wait till Wednesday.
41:55
Okay? But we have a great answer to the
41:57
question. So,
41:58
it'll be worth it. And if you can't
42:00
stand the suspense, read the book.
42:03
All right. So, that is uh that's why we
42:05
need multiple heads. Okay? And now to
42:07
come back to this, so what we do is it
42:09
goes through this head and you get these
42:11
W's, right? And it goes through here and
42:13
we get the another set of W's.
42:15
Then what we do at the very end is we
42:17
concatenate them.
42:19
Okay? We concatenate them and we do a
42:21
projection. And this is what I mean by
42:23
that.
42:29
So, we have
42:30
uh this this is one self-attention head,
42:33
self-attention one.
42:35
This is self-attention two.
42:38
And let's say that
42:41
W1 hat comes out.
42:44
And I'm just going to call it Z Z1 for
42:47
the same thing so that there's no name
42:48
clash.
42:49
Okay? And uh the W2, W6, all of them are
42:52
coming, right? Let's focus on W1 and Z1.
42:55
W1 and Z1 are both contextual embeddings
42:57
for the same word.
42:59
Okay? For the first word, word one. And
43:01
so what we do is let's say this is W1 uh
43:04
let's call let's say this vector is like
43:06
this. Okay?
43:07
And let's say that this vector is like
43:10
this.
43:12
What I mean when I say concatenated here
43:14
is we literally take
43:16
um this word here,
43:18
this embedding here, then we take this
43:20
thing here.
43:23
Okay? And we just make it a long vector.
43:25
We concatenate it. But now this vector
43:27
has become twice as long, right?
43:30
So, what but remember, we always want to
43:32
preserve this the the number of inputs
43:34
we have and the lengths of these vectors
43:36
everywhere as we go along. So, what we
43:39
do is at this point, we run it through
43:42
a single dense layer
43:44
which will take this thing and make it
43:46
back into the same small shape as
43:48
before.
43:50
So, this is a dense layer.
43:54
That's it. So, this vector comes in
43:56
and it becomes it gets compressed back
43:58
to the original shape that came out of
44:00
here.
44:01
So, you could have like 20 of these uh
44:03
attention heads
44:04
and the concatenated will be 20 times
44:06
long and then just project boom, one
44:08
dense layer comes back to the original
44:09
shape.
44:12
So, that's that is the projection step.
44:16
And that's what I mean here when I say
44:17
concatenate and project.
44:20
So, at this point, what we have is
44:21
things come in, we contextualize them
44:23
using these different attention heads,
44:25
and when they come out of the attention
44:27
heads, we take them all, we just like
44:29
concatenate them, and then compress them
44:31
back to the same original starting
44:32
shape. Right? If these vectors are 100
44:35
units long or 100 dimension long,
44:37
whatever comes out is 100 still.
44:39
And to pre- preserving this
44:42
size as we go along is very important
44:43
for reasons that'll become apparent a
44:44
bit later.
44:46
Okay. So, that is the multi-attention
44:49
thing.
44:50
Now, a final tweak for today
44:53
is that we will inject some
44:55
non-linearity
44:57
with some dense layer dense ReLU layers
44:59
at the very end. So, we'd went through a
45:01
bunch of attention heads. We we came up
45:03
with a bunch of contextual embeddings
45:04
now.
45:05
So, at this point so far,
45:07
there are no since there are no
45:08
parameters inside these boxes,
45:10
uh
45:11
right? And there are some parameters
45:13
here.
45:13
We need to do some non-linearity. So
45:15
far, there's been nothing that's
45:16
non-linear so far. So, here we actually
45:18
send it through one or more ReLUs.
45:21
Typically, they just use one ReLU. So,
45:24
and what I mean by that
45:34
Sorry.
45:37
So, this is what we had here and then
45:41
we take it in
45:46
and then run it through
45:50
actually
45:54
we typically run it through
45:57
a ReLU.
45:58
This is a nice ReLU.
46:01
Okay? And all and and the rule of thumb,
46:03
as you will see, if let's say this
46:04
vector is say 100 dimensions long, they
46:06
typically will choose a ReLU which is
46:08
about 400
46:10
wide. And then it just gets projected
46:12
out again back to 100.
46:16
So,
46:17
this is just a simple, you know, the
46:20
input comes in, goes through a single
46:21
hidden layer with four four times as
46:23
many as here, and then it
46:26
project another dense layer
46:28
to 100 again.
46:29
And this since there are ReLUs here,
46:32
we in- we have injected some
46:33
non-linearity into the processing.
46:35
Okay? Now,
46:37
a lot of this stuff when it came out
46:39
felt very ad hoc.
46:41
Right? It didn't come from some deep,
46:43
you know, theoretical motivations.
46:45
But and people had strong intuitions as
46:47
to why these things were helpful. And as
46:49
it turns out, since the transformer came
46:51
out, people have tried to optimize every
46:53
aspect of this thing.
46:55
It's actually pretty difficult to beat
46:56
the starting architecture.
46:58
Right? Improvements have been made, but
47:00
it's actually very robust architecture.
47:02
So,
47:03
so that's what's going on here. And then
47:05
when we come out of this thing,
47:08
this is what we have, the story so far.
47:10
We start with random standalone
47:13
embeddings. This could be
47:14
GloVe embeddings, it could be random
47:15
weights, doesn't matter. It goes through
47:18
a bunch of self-attention heads. We
47:19
concatenate it when it comes out the
47:21
other end.
47:23
Concatenate it when it comes out the
47:25
other end. And then we project it back
47:27
to the same size as before. Then we run
47:29
it through, you know, a ReLU followed by
47:31
a linear layer and we get these things
47:33
again. So, in this whole process, if six
47:36
things came in, six things will come
47:37
out. And if six and if those six things
47:40
that came in
47:41
were embedding standalone embedding
47:43
vectors of 100 dimensions, what comes
47:45
out is also 100 dimensions.
47:47
So, in that sense, you could think of
47:48
this whole thing as a black box in which
47:50
whatever you send in, the same number of
47:52
things will come out of the same length.
47:54
The numbers will be different because
47:56
they will have been heavily
47:56
contextualized.
47:58
The numbers are much smarter, in other
48:00
words.
48:02
So, so far what we have seen is that we
48:04
have satisfied two of the three
48:05
requirements. We have taken the context
48:08
of each word into account
48:09
by using these dot products in the
48:11
self-attention layer, and we can
48:12
generate an output that is the same
48:13
length as the input, but we have ignored
48:15
the fact that we have ignored word order
48:17
completely.
48:19
Okay? Because whether I had said the
48:21
train slowly left the station or I had
48:23
said the the station slowly left the
48:25
train,
48:26
this thing won't know the difference.
48:30
Because dot products
48:32
function on sets, not on sequences. They
48:34
function on sets.
48:36
Okay? Regard- You can you should
48:37
convince yourself of this. Regardless of
48:39
the order, the dot product calculation
48:40
doesn't change anything.
48:42
Because we are doing every pair.
48:46
Okay? So, the question is how do we take
48:48
the order of the words into account? Um
48:50
right. As I was saying, we can scramble
48:52
the order of the words in a sentence and
48:53
we'll get the exact same contextual
48:54
embeddings at the end.
48:55
So, by the way, if you're working on a
48:57
problem in which the order doesn't
48:58
matter,
49:00
then you can stop right now and use the
49:01
transformer.
49:04
And there are many problems that are
49:05
actually in that category where the
49:06
order doesn't matter. So, if you take
49:08
traditional structured data, right? Uh
49:10
tabular data,
49:12
uh you know, blood pressure, cholesterol
49:14
level, boom boom boom. Does it predict
49:15
heart disease? Well, there is no order
49:17
in that thing. You can use the
49:18
transformer as is without doing anything
49:20
more.
49:22
So, transformers work for both sets and
49:24
sequences where order matters.
49:27
Okay. So, the fix for this is something
49:29
called the positional encoding.
49:32
Um
49:33
so what we do is very simple. There are
49:34
By By there are many things that been
49:36
invented um to to to tell transformers
49:40
to give an transformer some information
49:42
about the order of each of the things
49:44
that are coming in.
49:45
I'm going to go with something called
49:46
the, you know,
49:47
the simplest possible way which actually
49:49
works pretty well in practice. So, what
49:51
we do is
49:52
for each position
49:55
each possible position in the input
49:56
starting from the first position all the
49:58
way through the last position
50:00
we imagine that that position itself is
50:02
a categorical variable.
50:05
Right? If a sentence can only be 30 30
50:07
words long, let's say, we say that hey,
50:10
the position of each word is a number
50:11
between 0 and 29.
50:14
And so, we can just think of it as a
50:16
categorical variable.
50:17
And because the categorical variable, we
50:20
can just imagine an embedding for that
50:22
for each potential value. So, it'll
50:24
become clear in just a moment because I
50:25
have a numerical example.
50:27
And so, what we do is we will just take
50:28
that standalone embedding and then we'll
50:30
take this position embedding
50:32
which represents the position of the
50:33
word in the sentence, we just add them
50:35
up.
50:36
Okay? Uh yeah.
50:39
So, if
50:40
in the initial sentence itself, I have a
50:43
mistake, so I just write it as the train
50:45
slowly the station.
50:48
So, which means my output is actually
50:49
going to be wrong. Yes.
50:52
Now, the transformers are since they're
50:53
trained on lots of data,
50:55
they will be quite robust to these
50:57
things.
50:58
But strictly arithmetically speaking
51:00
correct, yes.
51:02
Um okay. So, here's let's look at an
51:05
example.
51:06
Let's assume that
51:08
um
51:09
your standalone embeddings, right? This
51:11
is your vocabulary, okay?
51:13
Unknown, cat, mat, I, sit, love, the,
51:15
you, on. That's it. That's our
51:17
vocabulary.
51:18
And for this vocabulary, we have these
51:20
standalone embeddings.
51:22
And just for argument, let's assume
51:23
these embeddings are only two long.
51:26
Okay? The dimension of these embeddings
51:27
is two.
51:28
If you recall the glove embeddings we
51:30
used last week, I think they were what?
51:31
100 long?
51:33
And the ones we're using in the homework
51:34
are even longer than that.
51:35
Um but here we are assuming they're only
51:37
two long, okay? So, the embedding for
51:39
cat is 0.5, {comma} 7.1.
51:42
All right. Now, let's assume that the we
51:45
can have at most 10 words in any
51:47
sentence that's coming in.
51:49
And obviously, a particular word could
51:50
be in position 0 all the way through
51:52
position 9.
51:53
And we will learn embeddings for each of
51:56
these positions, and these embeddings
51:57
are also two long.
51:59
Two units long. Dimension two.
52:03
Okay?
52:04
Now, where will these embeddings come
52:06
from?
52:07
What's the answer to that question? What
52:09
is the answer to the general question of
52:10
where will these weights come from?
52:14
We will learn it with backprop.
52:18
Okay?
52:20
We will start initially with random
52:21
numbers and then we'll get them make
52:23
them better and better
52:24
as over the course of training.
52:26
So, what we do is we have these two
52:28
tables
52:29
of embeddings.
52:30
Um the standalone embedding for the word
52:32
and the position embedding.
52:34
And then, we literally add them up.
52:37
So, for example, let's say the word the
52:39
sentence that came in is cat sat mat.
52:41
That's the sentence. It's got three
52:43
words, cat sat mat. So, what we do is we
52:46
say, well, the embedding for cat is this
52:49
thing here, 0.571.
52:51
So, I write it here, 0.571.
52:53
Cat happens to be the zeroth position of
52:55
the word.
52:56
So, I grab the embedding for zero, which
52:58
is 1.3, 3.9. I stick it there, and then
53:01
I literally add them up. 0.5 + 1.3, 1.8.
53:04
11.0. That's it.
53:07
So, now the positional encoded embedding
53:10
for the word cat is 1.8, 11.0, not 0.5,
53:15
7.1.
53:18
So, if cat happens to show up in another
53:20
part of the sentence, let's say instead
53:22
of cat sat mat, we had
53:25
mat sat cat.
53:28
Now, cat is now the third position,
53:29
right? Which is 0, 1, and 2. Which means
53:33
its embedding doesn't change. It's just
53:34
the embedding for cat, but now instead
53:36
of picking zero, we'll pick this one,
53:38
0.6, 8.1, and put that here and add them
53:40
up instead.
53:43
So, this is the idea of the positional
53:45
encoding.
53:46
This is how we inject position knowledge
53:48
into the transformer.
53:52
Yes.
53:54
Um
53:55
the positional embedding would be
53:56
different for each sentence, right? How
53:58
do you No, this is just one table which
54:00
tells you what the position is.
54:01
So, the it says for a word that appears
54:04
in the seventh position in any input
54:06
sentence that you're feeding in,
54:08
this is the embedding that you need to
54:09
use
54:11
for that position.
54:16
If the word appears twice in the same
54:19
sentence, how do
54:21
Great question. So, if if let's say just
54:23
for argument, let's say the word the the
54:25
sentence was cat cat cat.
54:27
So, the
54:29
for each one of those cat for cat cat
54:31
cat,
54:32
the this embedding will be the same,
54:34
0.571, because that is happens to be
54:36
just the embedding for cat regardless of
54:38
position.
54:39
But then, the first cat
54:42
for the first cat, we will use 1.3, 3.9
54:45
as the addition. For the second cat,
54:47
we'll use 6.3, 3.7. The third cat will
54:50
use 0.6, 8.1.
54:51
So, only the things that are adding the
54:53
position encoding will change, the
54:55
positional embedding. So, the resulting
54:57
sum is going to be different for each of
54:58
these three words, even though they're
54:59
exactly the same word.
55:05
Is that position embedding table
55:07
specific to the standalone embedding
55:09
table? Like if you were to add or remove
55:12
some words from the standalone It's
55:14
independent.
55:15
Independent. It only depends on your
55:18
assumption about how long the sentences
55:19
can be.
55:21
That's it.
55:21
It doesn't really care about what's what
55:23
words are coming in. That's a whole
55:24
different thing.
55:26
So, these are two independent tables
55:27
that just learned as part of this
55:28
process.
55:31
So, yeah, I have the same thing for sat
55:33
and mat.
55:35
Sat and mat, that's what we have.
55:39
So, just make sure you understand these
55:40
two slides to really like make sure the
55:42
mechanics are clear. Yeah.
55:46
How do you control for filler words? For
55:48
example, if you're taking
55:50
NLP output for transcription and you're
55:53
trying to run a transformer and you have
55:55
a lot of
55:56
um's and likes that are
55:58
disproportionately large and have these
56:00
random assignments or
56:03
really deep embeddings, is there other
56:04
ways to look at through the noise?
56:07
Typically, what they do is um
56:09
as we will we'll talk about this thing
56:10
called byte pair encoding in which we
56:12
take individual characters,
56:14
fragments of words, and words into
56:16
account as tokens. So, when you hear
56:18
stuff like uh and so on, it gets mapped
56:21
to these small tokens.
56:23
Right? And then we treat them as just
56:24
any other token.
56:28
Um yeah, is aggregation like a simple
56:31
sum where here and is the actual
56:33
semantic meaning of the word standalone
56:36
not be more important than its
56:37
relative position in the sentence?
56:40
It could be. We just don't know a priori
56:42
whether it's going to be important or
56:43
not for any particular sentence.
56:45
We when we train the transformer with a
56:46
lot of textual data,
56:48
right? It'll just figure out the right
56:50
values for these things so that on
56:51
average, the accuracy is as high as
56:53
possible.
56:55
So, in many of these things, there's
56:56
always a tension between our human
56:58
intuition as to how it should work and
57:00
whether you should just throw it into
57:01
the meat grinder of backprop and see
57:02
what happens.
57:04
And so, here it does it turns out you
57:05
can just throw it into backprop, it'll
57:06
actually do a pretty good job.
57:08
Uh yeah.
57:10
For the positional encoding, we would
57:13
just be as using the sum vector, we
57:15
would be using like this 2 by 3 matrix
57:18
that you have for our right?
57:20
Uh oh yeah, this is just for
57:21
demonstration. Basically, this is the
57:23
thing that will actually go into the
57:24
transformer. Correct.
57:26
Yeah.
57:28
That was just me being overly verbose in
57:30
the slides.
57:31
Uh yeah.
57:33
I can see sentences in the input. At
57:35
this point, are we still parsing out
57:36
punctuation or if we have like a
57:38
multi-sentence input, is there a
57:40
positional embedding vector for each of
57:41
the sentences? Yeah, so here um
57:44
basically, the starting point is tokens.
57:47
Right? And in our example, because we're
57:48
working with the idea of simple
57:50
standardization and stripping and things
57:51
like that, I'm just showing actual
57:53
words.
57:54
If you go to something like GPT-4, since
57:56
it uses a different tokenization scheme,
57:58
uh each token might be part of a word.
58:01
It might be it might be an individual
58:02
character, it might be a punctuation
58:03
mark, it could be in fact um the GPT
58:06
family doesn't strip out punctuation.
58:08
Which is why when you ask a question, it
58:10
comes back with intact punctuation in
58:12
its response.
58:13
Uh and so, we'll get we'll revisit this
58:15
when you look at BPE, byte pair encoding
58:17
later on.
58:19
But the key thing to remember is that
58:21
all the stuff we're talking about starts
58:22
from the notion of a token.
58:24
As to how you define a token given a
58:26
bunch of text, that's the tokenizer's
58:28
job. And we just assumed a simple
58:30
tokenizer for the time being.
58:33
Okay? So, at this point, folks, we have
58:36
satisfied all the requirements.
58:38
Uh we have taken the surrounding context
58:40
of each word, we have taken the order,
58:42
and so on and so forth, because what's
58:43
coming in here is the positional
58:45
embeddings. Okay? And it runs through
58:47
the whole transformer stack.
58:49
So,
58:51
this is called a transformer encoder.
58:54
Okay?
58:55
This is the transformer encoder.
58:57
And you can see here, this is the
58:59
original picture from the paper.
59:01
It's an iconic picture at this point.
59:03
So, it says here this is these are the
59:04
input This is like the cat sat on the
59:06
mat.
59:07
It comes in here, gets transferred to
59:09
transformed into embeddings, standalone
59:11
embeddings.
59:12
And then, based on the position of each
59:14
word, we add that's why you see a plus
59:17
sign here, we add the positional
59:20
embedding to that.
59:22
And the resulting thing goes into this
59:24
transformer block. And here,
59:26
we go through multi-head attention.
59:30
And things come out the other end.
59:32
Then there is this thing called add and
59:34
norm, which we'll visit we'll revisit on
59:36
Wednesday.
59:37
And then it goes through a feed forward
59:38
network, another add and norm, which
59:40
we'll revisit on Wednesday.
59:42
And then it comes out the other end.
59:43
That's it. That's a transformer encoder.
59:46
Okay?
59:47
Um
59:48
and so if you look at this
59:52
just to point out a couple of things,
59:53
the input embeddings can be random
59:55
weights or it could be pre-trained
59:56
embeddings.
59:57
Um
59:58
we add in a position-dependent embedding
1:00:00
to represent the position of each word
1:00:01
in the sentence. That's the plus.
1:00:02
Then we pass it through multi-headed
1:00:04
attention to get a contextual uh
1:00:05
representation.
1:00:07
Then we finally we pass all this through
1:00:09
a simple
1:00:10
typically it's a two-layer network. A
1:00:12
one hidden layer with relus and then a
1:00:13
linear layer after that and boom. Uh and
1:00:16
then we do it. This is the encoder. And
1:00:20
here is the perhaps the most important
1:00:21
point to keep in mind.
1:00:23
Because we have taken inordinate care to
1:00:25
make sure that the things that are
1:00:26
coming in and the things that are going
1:00:28
out have the same size
1:00:30
both in terms of the number of tokens as
1:00:32
well as the length of each vector.
1:00:34
We can then stack them up like pancakes.
1:00:37
We can have lots of transformers stacked
1:00:39
one on top of each other.
1:00:41
Right? Because it's the perfect API.
1:00:43
It's the simplest possible API. The same
1:00:45
thing comes in, same thing goes out.
1:00:47
In terms of size. So you can have a
1:00:49
transformer encoder, another one top,
1:00:51
boom, boom, boom, boom, boom, one after
1:00:53
the other. GPT-3 has 96 transformer
1:00:55
stacks.
1:00:58
And like in all things deep learning
1:01:00
related, the more layers you have, the
1:01:02
more complicated things we can do with
1:01:04
it.
1:01:05
As long as you have enough data to keep
1:01:06
the model happy so it doesn't overfit.
1:01:11
Okay?
1:01:13
All right. So, what we haven't covered,
1:01:15
which we'll cover on Wednesday
1:01:17
uh is is the question that
1:01:20
he had posed about how
1:01:22
uh you know, since there are no
1:01:23
parameters inside the self-attention
1:01:24
block, what are we actually learning?
1:01:26
And then there is these things called
1:01:27
residual connections and layer
1:01:29
normalization. We'll talk about all
1:01:31
those things on Wednesday. Those are all
1:01:32
like, you know, refinements to the idea.
1:01:35
So, all right, 9:39. Um let's apply the
1:01:38
transformer encoder to an actual
1:01:39
problem.
1:01:40
Any questions?
1:01:43
Uh yeah.
1:01:45
My question is regarding like you said
1:01:46
you could have multiple transformers.
1:01:48
What is the difference with having
1:01:50
multiple self-attention heads uh and
1:01:53
rather than that having multiple When I
1:01:54
say a transformer block within the block
1:01:57
there could be multiple heads. So, if
1:01:59
you're if the accuracy is the same, why
1:02:01
would you use this rather
1:02:04
Yeah, you can have a lot of attention
1:02:06
heads. And that's totally fine. And
1:02:08
typically I forget how many GPT-3 and 4
1:02:10
have. They have a whole bunch of them.
1:02:12
But you can So you can go wide and you
1:02:13
can go deep.
1:02:15
Both are done in practice.
1:02:18
But the thing is if
1:02:19
The one thing you have to remember is
1:02:20
that if you if you go wide, you have a
1:02:22
lot of attention heads then given the
1:02:24
particular input that's coming into that
1:02:26
block, it'll learn different patterns
1:02:28
from it.
1:02:29
While if you stack them all up, it's
1:02:31
going to learn different ways to
1:02:32
contextualize the things that are coming
1:02:33
in. It operates at higher levels of
1:02:35
abstraction. So the analogy would be
1:02:36
that like the seventh layer of a
1:02:38
convolutional net may take the sixth
1:02:40
layer's output and say, "Oh, I'm seeing
1:02:42
a lot of edges here. I'm going to take
1:02:44
an edge like this, two circles like that
1:02:46
and call it a face."
1:02:48
So it'll operate at a higher level of
1:02:49
abstraction.
1:02:52
Okay.
1:02:53
Um
1:02:58
All right, let's go to the collab.
1:03:01
So what we're going to do is we're going
1:03:02
to take the transformer that we just
1:03:04
learned about and we're going to apply
1:03:05
it to solve the the travel uh slot
1:03:07
problem. Okay?
1:03:09
Uh all right. So
1:03:12
Okay, so we'll start with the usual
1:03:14
preliminaries.
1:03:16
And then we have taken the ATIS data set
1:03:18
I talked about and we have stuck them in
1:03:20
raw box for easy consumption.
1:03:23
It's here.
1:03:29
Okay.
1:03:30
So if you look at to the top view
1:03:33
you can see here, for example, I want to
1:03:35
fly from Boston 8:30 a.m. And then this
1:03:37
is the output. The slot filling is the
1:03:39
output. Um and so as it turns out here
1:03:42
there is
1:03:43
this these people also gave it a another
1:03:46
They took the whole query and gave it an
1:03:47
intent as to is it it's a flight query,
1:03:49
it's a something else query and so on,
1:03:51
which we're not going to use. Are you
1:03:52
kidding me?
1:03:54
I want to fly from Boston at 8:30 a.m.
1:03:56
and arrive in Denver at 11:00 in the
1:03:57
morning. What kind of ground
1:03:59
transportations are available in Denver?
1:04:01
What's the airport at Orlando?
1:04:03
Um how much does the limo service cost
1:04:06
within Pittsburgh? Okay.
1:04:08
And so on and so forth. So you get So
1:04:09
you get the idea. It's a very wide range
1:04:11
of queries that are in this data set.
1:04:13
Um okay. So let's just ignore that for a
1:04:16
sec. Um okay. So what we're now going to
1:04:18
do is we are going to take only
1:04:22
um this column, right? The query column.
1:04:24
That's going to be our input text. Okay?
1:04:27
And then the slot filling column is
1:04:29
going to be our dependent variable, the
1:04:31
output.
1:04:32
So we'll just gather them all up
1:04:34
uh here.
1:04:37
Let it run. We'll do it for the training
1:04:38
data and the test data.
1:04:40
And so what we have done is that we have
1:04:42
taken um the transformer related code in
1:04:45
Keras and we have packaged it into a
1:04:47
little hardel library for easy
1:04:49
consumption.
1:04:50
Um and so that thing is here. You can
1:04:53
download it.
1:04:55
Calling it a library is like overstating
1:04:56
it. We literally just collected a bunch
1:04:57
of code and stuck it in a file. Okay?
1:04:59
So
1:05:00
and so what we'll do is from hardel
1:05:02
we'll we'll import the transformer
1:05:03
encoder.
1:05:04
And we'll import this positional
1:05:06
embedding layer.
1:05:08
Because what we're going to do is we are
1:05:09
going to take the input do the
1:05:11
positional encoding business and then
1:05:12
send it into the transformer.
1:05:14
Okay?
1:05:15
Um so but first let's vectorize the
1:05:18
input uh queries that are coming in.
1:05:21
So we'll define a thing here.
1:05:24
The use this uh
1:05:26
max query length is not defined. That's
1:05:28
what happens when you
1:05:30
don't run everything.
1:05:32
All right.
1:05:38
Okay. So now we have this thing here. So
1:05:41
turns out that there are 8,888 tokens,
1:05:44
right? 8,888 words in the input queries
1:05:47
that are we have in the data. Uh so I
1:05:49
take a look at the first few.
1:05:52
And you can see here, you know, there is
1:05:54
unk. Uh and because the output mode here
1:05:56
is you just want integers to come out
1:05:58
not multi-hot encoding or anything
1:06:00
because we're going to take these
1:06:01
integers and then do embeddings from
1:06:02
them. So it'll it'll create it'll
1:06:04
reserve this empty string as the pad
1:06:07
token. This should be familiar from last
1:06:10
week.
1:06:11
And then the unk for unknown tokens and
1:06:13
then two from flights these are all some
1:06:14
of the most frequent. Um turns out
1:06:17
Boston is actually the most frequent. I
1:06:18
don't know what's up with that.
1:06:20
It is what it is. Then we'll do the same
1:06:22
vectorization to the train and test data
1:06:24
sets.
1:06:25
Now uh we need to do STIE for the output
1:06:28
side of the problem because the slots
1:06:30
the the dependent variable here,
1:06:31
remember, are all sentences as well with
1:06:33
the B, O, things like that, right? So we
1:06:36
need to vectorize those.
1:06:38
So we do or we need to do STIE on them.
1:06:40
So let's take a look at some of these
1:06:42
slots.
1:06:43
And you can see here all this stuff is
1:06:44
going on.
1:06:45
Note So now here is an example where you
1:06:48
have to be very careful when you do the
1:06:49
standardization.
1:06:51
Typically standardization you will
1:06:52
remove punctuation and you know, do
1:06:54
things like that and lowercase, right?
1:06:56
But here
1:06:57
these things have a specific meaning.
1:07:00
We can't just go in there and remove the
1:07:01
period and the underscore and then take
1:07:03
make the B into lowercase B and stuff
1:07:04
like that. That'll just harm it.
1:07:06
Right? We need to be able to preserve
1:07:07
the nomenclature of the output in terms
1:07:10
of all those tags. So
1:07:12
um so we don't want the standardization
1:07:13
to do all those out. So what we do is we
1:07:15
say standardization none.
1:07:17
Look at that.
1:07:18
We tell Keras do not standardize this.
1:07:20
Do not do your usual thing.
1:07:22
Okay?
1:07:23
Um so
1:07:25
we do that
1:07:26
for the output side. And then let's look
1:07:29
at the vocabulary.
1:07:30
Yeah, so this sounds pretty good.
1:07:33
These are all the things that we would
1:07:34
expect to see.
1:07:35
These are the distinct tokens in the
1:07:37
output strings.
1:07:39
Um all right.
1:07:43
Okay, we get it.
1:07:45
So we have 125 of them. In the in the
1:07:48
lecture I said there are 123 slots,
1:07:50
possible slots. Why is it 125 here?
1:07:54
Yes, unk and pad. Correct.
1:07:57
Um okay. Now we'll set up a transformer
1:07:59
encoder, right? Uh this Oh, wait, wait,
1:08:02
wait. I forgot about um doing this. My
1:08:05
bad. Um
1:08:07
All right.
1:08:11
I just thought when I saw the slide that
1:08:12
we should go to the collab
1:08:15
without giving you a bit more
1:08:16
background. No problem. So
1:08:18
So
1:08:20
the way we're going to model this
1:08:21
problem is that we're going to have
1:08:22
something like this, right? Fly from
1:08:23
Boston to Denver.
1:08:24
That's the input that's coming in and
1:08:26
that is the correct answer.
1:08:28
0 0 some B something or others I mean O
1:08:31
and then something else, right? That's
1:08:32
the correct answer. That's the that's
1:08:34
the input and that is the right answer.
1:08:36
So what we'll do is we will
1:08:38
create these positional input embeddings
1:08:40
like we have discussed before.
1:08:42
We will run it through a transformer.
1:08:45
It gives us contextual embeddings.
1:08:47
So if we send five in, it's going to
1:08:49
send us five out except the color is now
1:08:50
blue.
1:08:51
Right? And then what we do is
1:08:54
we will run it through a relu.
1:08:57
Okay, we'll run it through a relu.
1:08:59
We will still have
1:09:01
you know, five vectors here, five
1:09:02
vectors will come in.
1:09:04
And then for each of the things that
1:09:05
comes in, we will stick a 123-way
1:09:07
softmax.
1:09:11
Okay, for each thing that comes out
1:09:13
we'll have a 123-way softmax and that's
1:09:15
the classification problem we're going
1:09:16
to solve.
1:09:20
Okay?
1:09:21
So
1:09:23
the weights in all these layers will get
1:09:25
optimized by backprop.
1:09:28
All these weights are going to get
1:09:29
optimized.
1:09:30
Uh yeah.
1:09:34
Sorry?
1:09:40
Oh no, the that's a layer. The weights
1:09:43
in the layer will still need to be
1:09:44
learned.
1:09:46
It's sort of like the text vectorization
1:09:48
layer is a bunch of code and then you
1:09:50
actually run it on a particular corpus
1:09:51
to adapt it and fill our vocabulary out
1:09:53
of it.
1:09:54
So, it's like an empty shell that needs
1:09:55
to get populated.
1:09:57
Okay, so with the weights and all these
1:09:59
things are going to get updated when we
1:10:00
when we train the model
1:10:02
by backprop.
1:10:03
Uh and that's it. That's the setup.
1:10:06
Does this make sense before I switch
1:10:07
back to the collab?
1:10:09
In particular, does this make sense?
1:10:11
This part of it.
1:10:15
Bunch of things come out and then for
1:10:17
each one of those things we need to
1:10:18
figure out a classification of a 123-way
1:10:20
classification. And that's where we
1:10:22
stick a softmax on every one of those
1:10:23
output nodes.
1:10:25
Yeah.
1:10:32
Oh oh, I see.
1:10:36
Yeah, so
1:10:40
It could be whatever or to put it
1:10:41
another way, it is your choice as the
1:10:43
user as the modeler. Correct? The thing
1:10:45
is at this point with the blue stuff the
1:10:47
transformer is basically saying, my job
1:10:49
is done.
1:10:51
It has given you these valuable
1:10:52
contextual embeddings at some high-level
1:10:54
abstraction. What you do with it depends
1:10:56
on your particular problem. And so that
1:10:58
the best practice would be to take it
1:11:00
and then maybe, you know, if these
1:11:01
embeddings are embeddings are really
1:11:03
long, maybe you make them a little
1:11:04
smaller, right? Using a ReLU. And using
1:11:07
a ReLU is always a good idea because
1:11:09
when in doubt, throw in a bit of
1:11:10
non-linearity.
1:11:11
Right? Uh and then once you're done with
1:11:13
that, well, at this point you need to
1:11:15
actually classify it. So, you stick an
1:11:17
output softmax on it.
1:11:20
Okay. So, that's what we have.
1:11:24
Um
1:11:27
All right, back to this picture.
1:11:29
So, what we're going to do is we
1:11:32
we also get to decide how long are these
1:11:34
embedding vectors. How long because here
1:11:36
we're not going to use Glove embeddings.
1:11:37
We're just going to learn everything
1:11:37
from scratch.
1:11:39
Right? We're going to learn everything
1:11:40
from scratch. So, and we can decide how
1:11:42
long these embedding vectors are. So, um
1:11:45
these embedding vectors I'm going to
1:11:46
decide
1:11:47
uh I have decided that I want them to be
1:11:49
512 long, right? I want these actually
1:11:52
to be 512 long. So, that's what I have
1:11:54
here, 512.
1:11:57
And then inside the transformer,
1:11:58
remember
1:12:00
when we
1:12:01
concatenate everything and then we have
1:12:02
something, we run it through a final
1:12:04
ReLU layer, how big should that layer
1:12:07
be?
1:12:08
That's what it here what I mean by dense
1:12:11
dim. I want it to be 64.
1:12:13
And then I, you know, for fun I'm going
1:12:15
to use five attention heads.
1:12:17
Because why not?
1:12:20
Okay. And then in the final thing here
1:12:24
to go to Ali's question here these
1:12:27
things are all 512 long as I mentioned
1:12:29
earlier, right? These are all 512.
1:12:32
But this thing here I'm going to make it
1:12:34
just 128.
1:12:36
Okay, that's what I mean by units here.
1:12:38
And so if you look at the actual model
1:12:41
okay, whatever comes in has a max query
1:12:43
length of I think 30 if I recall.
1:12:45
Um actually let's just make sure of
1:12:47
that. What did I assume?
1:12:51
30, correct? Max query length 30. So,
1:12:53
each sentence is 30. So, if a sentence
1:12:55
has 35 words in it, what's going to
1:12:57
happen?
1:12:59
The last five will get chopped,
1:13:01
truncated. If it comes in at 22, we're
1:13:03
going to pad it with eight more tokens
1:13:05
with a pad token. Okay? That's how we
1:13:06
make sure everything uh gets to 30.
1:13:09
All right. So, we come back here.
1:13:12
So, the input is still sentences which
1:13:14
are 30 long, tokens which are 30 long.
1:13:16
And then we run it through a positional
1:13:18
embedding layer.
1:13:20
Okay? This positional embedding layer
1:13:23
has the the actual embedding for each
1:13:25
word, that table and it has the
1:13:27
positional table, positional embedding
1:13:29
table. So, just to be clear, this
1:13:31
positional embedding layer is basically
1:13:34
it's basically this.
1:13:37
So, this table
1:13:38
and this table together are packaged up
1:13:41
into the positional encoding layer.
1:13:43
But they are two distinct tables. They
1:13:45
just happen to be packaged up.
1:13:47
So,
1:13:49
so this is what we have here.
1:13:51
And then we get a nice positional
1:13:52
embedding out and then boom, we run it
1:13:55
through the transformer. And you know,
1:13:57
this transformer encoder object we have
1:13:59
to tell it obviously, hey, this is the
1:14:01
embedding dimension that's going to come
1:14:02
out. This is the dense dimension you're
1:14:04
going to use in that final feedforward
1:14:06
layer inside each attention block and
1:14:09
this is the number of attention heads I
1:14:10
want you to use. That's it.
1:14:11
Very, right? Only three things have to
1:14:13
be specified.
1:14:14
And then whatever comes out of the
1:14:16
transformer encoder are these blue
1:14:18
vectors.
1:14:19
And then we are back into good old sort
1:14:20
of, you know, traditional DNN stuff
1:14:22
where we take this thing, run it through
1:14:24
a ReLU with 128 units, we add a little
1:14:27
dropout uh and then we run it through a
1:14:30
dense layer which the the vocab size
1:14:33
here is 125, which is the 125-way
1:14:35
softmax.
1:14:37
Okay? Activation softmax.
1:14:39
Connect up everything into model input
1:14:41
and output and boom, that's the whole
1:14:42
model.
1:14:44
So, that's what we have here.
1:14:47
Okay?
1:14:48
Now,
1:14:51
this for the you know, after Wednesday's
1:14:53
class
1:14:54
for extra credit and for your personal
1:14:56
edification
1:14:59
try to work through this thing to come
1:15:00
up with this number.
1:15:03
53 million
1:15:04
um sorry, 5.3 million.
1:15:06
Right? Uh and see if it matches this
1:15:10
number here.
1:15:12
It should match.
1:15:13
Hand calculate the number of parameters
1:15:15
inside the transformer. Okay? For fame
1:15:17
and fortune. That's an optional thing.
1:15:19
So,
1:15:20
uh do it after Wednesday's class, not
1:15:22
right now.
1:15:23
And I have actually listed the exact
1:15:24
math that goes into it here. Okay? All
1:15:26
right. So, by the way, you can peek into
1:15:28
any layers' weights using its weight
1:15:30
attribute. This is the embedding
1:15:31
uh the positional embedding thing we
1:15:33
had. So,
1:15:34
we can click it and you can see here it
1:15:36
has two tables. There's the first table
1:15:39
which is just the embedding table which
1:15:40
says
1:15:41
there are eight eight eight tokens in my
1:15:43
vocabulary and each of those tokens is a
1:15:45
an embedding vector which is 512 long.
1:15:47
That is the first table here. And then
1:15:49
it has the second object which is the
1:15:51
positional embedding and it says here,
1:15:53
well, my sentences can be 30 long and
1:15:56
for each position of the 30 long
1:15:58
sentence, I will have a 512 embedding.
1:16:02
Both these tables as I mentioned earlier
1:16:04
are packaged up inside and you can
1:16:05
actually see what the weights are before
1:16:06
you do any training.
1:16:08
Okay?
1:16:09
So, all right. So, I'm going to stop
1:16:11
here uh because the model it's going to
1:16:13
take a few minutes to run and we're
1:16:14
already at 5 9:45.
1:16:16
Um so, we will continue the journey on
1:16:17
Wednesday. If some of it is not super
1:16:19
clear, don't worry about it. It will
1:16:20
become much clearer on Wednesday. All
1:16:21
right? All right, folks, have a good
1:16:22
couple of days. I'll see you on
1:16:23
Wednesday.
— end of transcript —
Advertisement
Ad slot

More from MIT OpenCourseWare

Trending Transcripts

Disclaimer: This site is not affiliated with, endorsed by, or sponsored by YouTube or Google LLC. All trademarks belong to their respective owners. Transcripts are sourced from publicly available captions on YouTube and remain the property of their original creators.