Advertisement
1:16:37
7: Deep Learning for Natural Language – Transformers
MIT OpenCourseWare
·
May 11, 2026
Open on YouTube
Transcript
0:16
So, all right. So, transformers, even
0:18
though they were originally invented for
0:20
machine translation, right, going from
0:22
English to German and German to French
0:24
and so on and so forth,
0:25
they have turned out to be an incredibly
0:27
effective deep neural network
0:29
architecture for just really a vast
Advertisement
0:32
array of domains. It has reached a point
0:34
where if you're actually working with on
0:36
a particular problem, you almost
0:37
reflexively to will try a transformer
0:39
first because it's probably going to be
0:40
pretty darn good.
0:42
Okay? So, they have just taken over
0:45
everything.
0:46
Um and obviously they have they've
0:48
transformed translation, which is the
Advertisement
0:50
original sort of target, uh Google
0:52
search, really information retrieval,
0:54
completely transformed speech
0:55
recognition, text-to-speech, even
0:57
computer vision. Even the stuff that we
0:59
learned with convolutional neural
1:00
networks, now there are transformers for
1:03
computer vision problems that are
1:04
actually quite good.
1:06
Right?
1:07
Um which is kind of shocking because
1:08
they were not even designed for that.
1:10
Um and then, you know, reinforcement
1:12
learning. And of course, all the crazy
1:14
stuff that's going on with generative
1:15
AI, large language models, multimodal
1:17
models, everything everything runs on a
1:20
transformer.
1:21
Okay? Uh and then there are numerous
1:23
special purpose systems
1:25
and I find these to be even more
1:27
interesting.
1:28
Um you know, like AlphaFold, the protein
1:30
folding AI, is run runs on a transformer
1:32
stack.
1:33
Okay? And I could just list examples one
1:35
after the other.
1:36
So, it's just amazing. It's incredibly
1:38
uh flexible architecture.
1:40
Um and I think we are lucky to be alive
1:43
during a time when such a thing was
1:44
invented.
1:47
And I'm not getting paid to tell you any
1:48
of this stuff.
1:50
All right, it's just amazing. Okay. So,
1:52
let's get going. We will use search um
1:55
or more broadly information retrieval as
1:57
a motivating use case. So, these are all
1:59
examples where people are typing in
2:00
natural language queries or uttering
2:02
natural language queries into a phone
2:03
and we need to sort of make sense of
2:05
what they want. And it's not like, you
2:07
know, write me a limerick about deep
2:08
learning where there could be many
2:10
possible right answers. It's more like,
2:12
okay, tell me all the flights that are
2:14
leaving from Boston to going to
2:15
LaGuardia tomorrow morning between 8:00
2:16
and 9:00. Well, you better get it right.
2:19
Okay? Accuracy is a high bar.
2:21
So,
2:22
um or, you know, how many customers
2:23
abandoned their shopping cart? Find all
2:24
contracts that are up for renewal next
2:26
month. Uh you know, tell me the all the
2:28
customers who ended the phone call to
2:30
the call center yesterday not entirely
2:32
pleased with the transaction. Right? The
2:34
list goes on and on. And so, in
2:37
particular, we'll focus on this
2:38
travel-related example today. Okay? Uh
2:40
find me all flights from Boston to
2:42
LaGuardia tomorrow morning, right? That
2:44
kind of query.
2:45
Um and so, in these sorts of use cases,
2:48
a very common approach historically has
2:50
been, well, we will take this, you know,
2:53
natural language query
2:55
and then we will convert it into a
2:57
structured query. By that I mean we will
3:01
parse the query and we'll extract out
3:03
key things in that query. Once we
3:05
extract out those key things, we will
3:07
reassemble it into a structured query,
3:09
like a SQL query, right? Uh SQL is just
3:12
one example of a possible structured
3:14
query. There are many many ways to
3:15
structure queries.
3:17
But SQL is sort of familiar to lots of
3:18
people, so I'm using that. So, you take
3:20
the SQL. Once you have the SQL query,
3:23
you're in a very comfortable structured
3:25
land, in which case you just run the
3:27
query through a some database that you
3:28
have, get the results back, format it
3:30
nicely, and show show it to the user.
3:32
Right? That's the flow.
3:34
So, the question becomes
3:36
um
3:37
how do we automatically extract all the
3:40
travel-related entities from this query?
3:43
Right? We want to be able to extract
3:45
BOS, LGA, tomorrow, morning, flights, so
3:49
on and so forth. These are all the
3:50
travel-related entities we want to
3:51
extract out, right? That's the problem.
3:54
And so,
3:56
we will use a really cool data set
3:58
called the airline travel information
3:59
system data set and I'll explain the
4:01
data set in just in just a bit. We'll
4:02
use this as the basis for this example.
4:05
And so, the way we think about it is
4:07
that
4:08
we we have a whole bunch of queries in
4:10
this data set.
4:12
And fortunately for us, the researchers
4:14
who compiled this data set,
4:16
they went through every one of these
4:18
queries, right? And we have, you know,
4:20
several thousands of them. They went
4:22
through every one of those queries and
4:24
they manually tagged each word in the
4:26
query
4:28
with what kind of travel entity it is
4:31
or none of them, right? So, for
4:33
instance, so they class they call them
4:35
slots. So, they will take each word in
4:37
the query and assign it to a slot, a
4:39
particular kind of slot, and I'll
4:41
explain what slot means in just a
4:42
second. Okay? That's the basic idea. So,
4:45
so, for example, if you have something
4:47
like I want to fly from
4:49
Okay? And this is a flight database, so
4:52
you can assume that everything is
4:53
related to a flight flying. So, if you
4:56
have all these words, I want to fly
4:57
from,
4:58
each of these words these five words
5:00
gets mapped to something called the O,
5:02
which means other.
5:04
It's the other slot, right? We don't
5:06
really care about it. It's the other
5:07
slot.
5:09
And then we come to Boston.
5:11
Oh, Boston is very special, right?
5:13
Because, you know, it's clearly a
5:15
departure city. So, we actually tag it,
5:18
we assign it this label. Think of it as
5:20
just like a classification problem,
5:21
right? A multi-class classification
5:23
problem. So, we assign it to B from
5:26
loc.city_name.
5:29
Okay? That is the label you assign it.
5:31
Okay?
5:32
And then you go to at. You don't care
5:34
about at. It's O, other. You come to
5:37
7:00 a.m.
5:38
And then, okay, that is depart time. So,
5:41
depart time and then another depart
5:43
time. And here you see there is a B and
5:45
then there is an I.
5:47
Right? So, what's what we are saying
5:49
here is that there could be entities who
5:51
are described using more than one word.
5:54
Like 7:00 a.m., right? Two tokens.
5:57
And for that, we need to be able to
5:58
figure out, okay, the second token is
6:00
really
6:01
is part of the first token. Together,
6:03
they define the notion of a departure
6:05
time. So, what the B means that is that
6:08
this is the word this is the token in
6:10
which we are beginning the idea of a
6:12
departure time. And then I means we are
6:15
in the middle of this description.
6:17
B is for beginning.
6:19
So,
6:21
you can see here. So, there is a B here
6:23
and there is an I. B for beginning, I
6:25
for intermediate or in the middle.
6:27
Um and then at, we don't care. 11:00 B
6:31
arrive time.
6:33
Boop boop boop. Morning arrive time
6:35
period.
6:38
So, this is an example of how you can
6:40
take a sentence and then manually label
6:43
every word in the sentence with
6:45
something that's relevant to your
6:46
particular problem.
6:50
And
6:51
turns out these people
6:54
every word is classified into one of 123
6:56
possibilities.
6:59
Okay? Um so, aircraft code, airline
7:02
code, airline name, airport code,
7:04
airport name, arrival date, relative
7:07
name. Now, you get the idea.
7:08
They want a round trip versus a one-way.
7:11
The relative to today because if
7:13
somebody say tomorrow morning, it's
7:14
relative to today, so you need to notion
7:16
you need absolute time and you need
7:17
notion of relative time.
7:19
So, they basically thought of every
7:20
possibility with these researchers. And
7:23
so, the every word in every one of these
7:25
queries is assigned one of these 123
7:27
labels.
7:32
Any questions on the setup?
7:36
Um
7:39
Did they have to contextualize what
7:42
comes before than let's say Boston? So,
7:44
if someone says from
7:46
Boston, so that there should be
7:47
contextualization with the from to
7:49
Boston. So, because they did it
7:50
manually, they could just read it and
7:52
figure it out, that's what they mean,
7:54
right? You Boston is the the departure
7:55
city and not the arrival city. So, do
7:57
they have two tags to Boston, which is
7:59
some like, you know, departure city as
8:01
well as arrival city
8:03
word Boston? In that particular phrase,
8:05
it's it's clear from that particular
8:07
case in the context of it as a human
8:08
reading it that Boston is a departure
8:10
city. So, it just only gets that tag. In
8:13
that sentence. In some other sentence
8:15
where people are coming into Boston,
8:16
it'll have a different tag.
8:21
I was wondering if my query like the
8:23
others, basically there is like, for
8:25
example, if my query was
8:27
giving flights from Boston at 7:00 a.m.
8:29
and
8:29
uh the
8:31
flights from Denver at 11:00 a.m.
8:33
You mean like a compound query? Yeah.
8:35
So, this one only takes single queries
8:37
into account.
8:39
Because most people are like, you know,
8:40
give me a flight from here to there. Or
8:42
what is the cheapest thing from here to
8:43
there? And we'll see examples of queries
8:45
later on.
8:50
Okay.
8:51
Uh all right. So, that's that's the
8:52
deal.
8:53
So, basically what we have this you
8:56
know,
8:58
uh this problem that we have here is
8:59
really a word-to-slot,
9:02
word-to-slot multi-class classification
9:04
problem.
9:06
Okay?
9:07
Um because if you look at that that
9:09
input, we want to be able to take that
9:10
input and a really good model will then
9:12
give you this as the output.
9:17
Right? Because this is what a human
9:18
would have done.
9:20
So, that is our problem. Okay?
9:23
So, the question is
9:25
um the the key thing here is that each
9:27
of the 18 words in this particular
9:29
example must be assigned to one of 123
9:32
slot types, right? Each word. It's not
9:34
like we take the entire query and
9:36
classify the entire query into one of
9:38
123 possibilities. Every word in the
9:40
query has to be classified.
9:42
That is the wrinkle.
9:45
Okay?
9:46
So, now, if we could run the query
9:49
through a deep neural network and
9:51
generate 18 output nodes,
9:54
it goes through some unspecified deep
9:55
neural network. And when it comes out
9:57
the other end, the output layer has 18
9:59
nodes.
10:00
Okay?
10:01
Because that is that is the that is the
10:03
that is the the the dimension of the
10:04
output that we care about. 18 in, 18
10:06
out. 18 in, 18 out, right?
10:09
And then for each one of those 18 nodes,
10:11
maybe we could attach a 123-way softmax
10:15
to each of those 18 outputs.
10:20
By the way, isn't it cool that we can
10:21
just casually talk about sticking a
10:23
123-way softmax onto each one of the 18
10:25
nodes?
10:27
Folks, wake up.
10:31
You're not easily impressed. I'm
10:32
impressed by that.
10:34
So, okay.
10:37
So, so the So, here's the key thing,
10:39
right? We want to generate an output
10:41
that has the same length as the input.
10:45
But the problem is the inputs could be
10:47
of different lengths as they come in.
10:48
They could be short sentences, long
10:50
sentences, we don't know, right?
10:52
Yet we need to accommodate this range
10:55
this variable size of input that's
10:56
coming in.
10:58
But the key thing is the output has to
10:59
be the same thing as the input, the same
11:00
cardinality as the input.
11:02
Okay, that's a one big requirement.
11:05
In addition, we want to take the
11:07
surrounding context of each word into
11:08
account, right? To go to Ronak's
11:10
question, when you see the word Boston,
11:12
you can't conclude whether it's a
11:14
departure city or arrival city.
11:15
You have to look at what else is going
11:17
on around it. Is there a from? Is there
11:19
a to? Things like that to figure out
11:21
what how to tag it. So, clearly the
11:22
context matters.
11:24
And then we clearly have to take the
11:25
order of the words into account.
11:28
Going from Boston to LaGuardia is very
11:29
different than going from LaGuardia to
11:30
Boston.
11:31
So, clearly the order matters.
11:33
Right? So, the context matters and the
11:35
order matters. And the output has to be
11:37
the same length as the input.
11:40
Okay?
11:42
So, context matters, right? Just a few
11:44
fun examples.
11:45
Remember from the last week that the
11:47
meaning of a word can change
11:48
dramatically depending on the context.
11:50
And we also saw that the standalone or
11:53
uncontextual embeddings that we saw for
11:55
last week, like Glove, um
11:58
you know, they don't take context into
11:59
account because they give a single
12:01
unique embedding vector to every word.
12:04
And if a word ends up having lots of
12:05
different meanings, that vector is kind
12:07
of some mushy average of all those
12:09
meanings.
12:11
Okay. So,
12:13
the word see. I will see you soon. I
12:15
will see this project to its end. I see
12:16
what you mean. Very different meanings
12:18
of the word see. This is my favorite,
12:20
bank.
12:21
Uh I went to the bank to apply for a
12:23
loan. I'm banking on the job. I'm
12:24
standing on the left bank. And so on. Uh
12:27
it. The animal Oh, this is actually very
12:29
It's a good one. The animal didn't cross
12:31
the street because it was too tired. The
12:33
animal didn't cross the street because
12:34
it was too wide.
12:37
Can you imagine
12:39
a deep neural network looking at this
12:40
word it and trying to figure out what
12:42
the heck does it word it mean?
12:44
What is it referring to?
12:46
Tricky, right?
12:48
Um and then, you know, if you take the
12:50
word station, and I have the station
12:52
example here because we're going to use
12:53
it a bit more the rest of the lecture.
12:55
The train You know, the station could be
12:57
a radio station, a train station, being
12:59
stationed somewhere, the International
13:00
Space Station. The list goes on.
13:03
So, clearly order matters. I mean,
13:04
context matters.
13:05
And
13:08
clearly order matters. You can come up
13:10
with your own examples. Let's keep
13:12
moving.
13:13
Okay?
13:15
So, the Transformer architecture
13:18
is a very elegant
13:20
architecture
13:22
which checks these three boxes
13:23
beautifully.
13:25
Okay?
13:26
Um it takes the context into account,
13:27
order into account, and then, you know,
13:29
whatever is produced out there
13:32
is the same length as whatever is coming
13:33
in.
13:34
And the reason it's called the
13:35
Transformer
13:36
is because if 10 things come in,
13:39
10 things go out, but the 10 things that
13:41
go out are a transformed version of the
13:43
10 things that came in.
13:46
That's why it's called the Transformer.
13:47
Okay?
13:48
If 10 things came in and like one thing
13:50
go goes out, well, sure, it's been
13:52
transformed, but what is it? It's some
13:54
weird thing. But when 10 comes in and 10
13:56
goes out, the 10 10 is preserved. Each
13:58
one is getting transformed in
13:59
interesting way.
14:01
That's why it's called the Transformer.
14:04
So, developed 2017, just dramatic
14:07
impact.
14:08
So, by the way, the effect of
14:09
Transformer, um
14:11
Google had spent a lot of research on
14:13
machine translation and obviously
14:15
search. Uh and then when the Transformer
14:17
is invented, uh they took a model called
14:20
BERT, which we will uh see on Wednesday
14:22
in detail, and then they introduced BERT
14:25
into their search, and the results were
14:28
dramatic.
14:29
And from what I've read, apparently the
14:32
impact of doing that was a
14:34
Typically, when you make an improvement
14:35
to search, the improvement is very, very
14:37
marginal because it's already a very
14:38
heavily optimized system.
14:40
And then when the Transformer thing came
14:42
along, there was actually a significant
14:43
jump in search quality. So, for example,
14:46
and you can actually read this blog post
14:48
uh which came out when they introduced
14:49
BERT into search. It gives you a bit
14:51
more detail. But here, so if you had if
14:54
you were querying something like uh you
14:56
know,
14:57
"Brazil traveler to USA needs a visa."
15:00
Right? You would think that it is it
15:02
should give you information about how to
15:03
get a visa if you're a Brazilian want to
15:04
come to the US, right? Uh but it turns
15:06
out the first result was how US citizens
15:09
going to Brazil can get you know,
15:11
get a visa.
15:13
So, clearly it's not taking the order
15:14
into account.
15:16
Uh but once they introduced it, boom,
15:19
the first thing was the US Embassy in
15:20
Brazil.
15:21
And a page on how to get a visa.
15:24
So, the effect was dramatic.
15:26
And so, this is a seminal paper,
15:30
right? And it's actually worth reading
15:31
the paper. And uh and it's worth and you
15:34
know, this is the picture this this is
15:35
like an iconic picture at this point
15:38
in the deep learning community. And we
15:39
will actually understand this picture
15:41
by the end of Wednesday.
15:43
Um and so, but the funny thing is that
15:45
when the researchers came up with it,
15:46
they didn't realize, in some sense, like
15:48
what they had stumbled on uh because
15:50
they were really focused on machine
15:51
translation.
15:53
It's only the rest of the research
15:54
community that took it and started
15:55
applying to everything else and found it
15:56
to be really, really effective.
15:59
Okay. So, we're going to take each one
16:01
of these things and figure out how to
16:02
address them and thereby build up the
16:04
architecture.
16:05
Any questions before I continue?
16:07
Yeah.
16:11
Is there any uh
16:13
benefits to discarding some of those
16:16
unclassified nodes before it goes out
16:18
rather than going like you have 18 words
16:21
input, discarding all the ones that
16:23
don't actually matter and just doing
16:24
like eight for your output?
16:26
Yeah, yeah. I think that's a totally
16:28
fine way to think about it. Basically,
16:29
what you're saying is that can we have a
16:31
two-stage model? The first-stage model
16:33
is like a O non-O classifier. And the
16:35
second-stage model only goes after the
16:37
non-Os. That's a totally fine way to do
16:38
it.
16:39
Yeah.
16:40
But as you can see, if you even if you
16:41
go with the just a simple one-stage
16:43
model, if you use a Transformer, you get
16:44
fantastic accuracy.
16:47
And we'll do the collab in a bit.
16:50
Uh all right. So, let's take the first
16:52
thing. How do you how do you take the
16:53
context of everything around the word
16:55
into account?
16:56
So,
16:59
so let's say that this is this is the
17:01
sentence we have. The train slowly left
17:03
the station.
17:04
Okay? For each of these words,
17:06
we can calculate a standalone embedding,
17:09
say something like Glove.
17:11
Okay? So, I'm just rep- depicting these
17:13
standalone embeddings using these uh
17:15
you know, thingies here.
17:18
Please appreciate them because it took
17:19
me a while to get them to do in
17:20
PowerPoint.
17:22
Okay? So, these are W1 through W6. These
17:24
are the vectors standing up. Okay?
17:27
Um now, let's say that So, we can easily
17:29
do that.
17:30
Now, what we want to figure out is we
17:32
want to focus on the word station.
17:34
And since station could mean very
17:36
different things in different contexts,
17:37
we want to figure out how do we actually
17:39
take
17:40
station's embedding and contextualize it
17:43
using all the other words that are going
17:45
on in that sentence.
17:46
Okay? Clearly, it's a train station.
17:49
So, we need to take the fact that there
17:50
is a train involved to to alter the
17:53
embedding of the word station. Right?
17:55
That's what taking context into account
17:56
actually means.
17:58
So,
17:59
how can we modify station's embedding so
18:03
that it incorporates all the other
18:04
words? That's the question.
18:07
Okay?
18:08
So, when you look at it this way,
18:11
imagine just for a moment,
18:14
just for a moment,
18:15
that
18:16
we
18:17
Now, some of the other words in the
18:18
sentence don't matter. The word the
18:20
probably doesn't matter.
18:22
But some of the other words like train,
18:24
slowly, left probably does matter.
18:26
And suppose, just magically, we have
18:29
been told
18:30
all the other words in the sentence,
18:32
this is how much weight you have to give
18:34
to them. These don't give it any weight.
18:36
Those give it a lot of weight. Okay?
18:38
Suppose we are told that.
18:39
Or to put it another way, and this this
18:41
is the word that's heavily used in the
18:42
literature,
18:44
someone tells you how much attention to
18:46
pay to the other words.
18:47
Whether you got to pay it a lot of
18:48
attention or very little attention.
18:50
Okay?
18:51
And this
18:52
how much attention to pay is given in
18:54
the form of a weight that you can use.
18:55
Okay? So,
18:57
um
18:58
if you look at it that way, from this
19:00
notion of which word should I give a lot
19:01
of weight to and very little weight to,
19:04
in this example, intuitively, which
19:05
words do you think should get the most
19:06
weight and which words do you think
19:07
should get the least weight?
19:09
Yeah. Train.
19:11
Train. Right.
19:12
Time matters.
19:13
Uh
19:14
you can do one at a time.
19:16
Train. Okay, thank you.
19:18
Uh
19:18
okay. Others?
19:21
Slowly.
19:22
Slowly. Right. So, that also seems to
19:23
have some bearing to it. What about
19:25
words that don't really I don't
19:27
we don't think is going to are going to
19:28
help at all?
19:31
The. The. Exactly. It probably doesn't
19:33
do much here. Some context it actually
19:35
might make a difference, but in this
19:37
sentence, maybe not.
19:38
Right? Intuitively.
19:40
So,
19:42
we should probably give a lot of weight
19:43
to train, maybe a little to slowly and
19:45
left, and hardly anything to the.
19:47
Okay?
19:49
And so, this intuition that we have
19:52
can be written numerically as maybe we
19:56
have a bunch of weights that add up to
19:58
one.
20:00
Okay?
20:02
Okay, maybe something like this. So, we
20:03
are saying the train 30% weightage,
20:07
maybe 8% weightage to left, maybe 12%
20:11
weightage to slowly, uh and then as you
20:14
will see here,
20:15
the station's own embedding also plays a
20:17
role. Because we want to take its own
20:20
standalone embedding and just move it
20:22
slightly, change it slightly, which
20:23
means that has to be the starting point.
20:26
So, it will get a lot of weight. We
20:28
can't ignore itself, in other words.
20:30
Right? So, we give it maybe 40% weight.
20:33
By the way, these numbers I just made
20:34
them up.
20:35
Okay? Uh yeah.
20:38
I'm sorry, it's a quick question. So,
20:40
the weights
20:43
are they
20:44
Are they Are they standalone for the
20:46
context of the entire sentence or are
20:48
they related to station that we started
20:50
off with? The The These six numbers are
20:54
only pertinent to station.
20:56
And for each word, we're going to do
20:57
something similar.
20:59
Yeah.
21:01
And at this point, does the model
21:03
understand order? Because like I'm just
21:05
thinking of like left because like I
21:07
gave it a very low
21:08
a
21:09
a very low weight. But let's say left
21:11
comes slowly, leave left station. The
21:14
station only have the two be higher.
21:15
Yeah, correct. So, at this point, we are
21:18
not worrying about order. We are only We
21:20
are worrying about context.
21:22
Later, we'll take order into account.
21:24
But how does the model know that left
21:25
here is of lesser importance because
21:28
it's a verb rather than a
21:31
It's It has to figure it out.
21:33
We don't It doesn't We We are just
21:34
giving it a whole bunch of capabilities.
21:36
How it manifests those capabilities is
21:38
all going to emerge from training.
21:42
Okay. So, all right. So, let's say we
21:45
have something like this. So, what we
21:46
can do,
21:48
right? And we'll get to the
21:49
all-important question of where do we
21:50
get these numbers from in just a moment.
21:51
But suppose you had the numbers,
21:54
how can we use these numbers to
21:56
contextualize W6? What can we do?
22:00
What is the simplest thing you can do?
22:05
You have W6, you want to make it a new
22:07
W6, which is now contextual, is aware of
22:10
what else is going on. Okay?
22:17
It's working now, I think.
22:20
We can take a weighted average. Exactly.
22:22
Exactly. So, when you have a bunch of
22:23
things and you have a bunch of weights
22:25
and I, you know, and we have when we
22:26
have to somehow modify one of those
22:27
things with those weights, the simplest
22:29
thing you can do is to take a weighted
22:30
average.
22:31
Right? So, that's exactly what we're
22:33
going to do.
22:34
So, we're going to take all these
22:35
weights
22:37
and just like move them up.
22:39
Okay?
22:40
Move them up.
22:42
Don't even get me started on how long it
22:44
took me to get this arrow to run.
22:46
I don't know about you, folks. Is it
22:47
It's extremely painful to get the U-turn
22:49
arrows to work in PowerPoint.
22:51
Okay?
22:52
Anyway, uh back to work. So,
22:54
so we just move these up here, okay? So,
22:57
now we can do 0.05 * this vector + 0.3 *
23:01
that vector and so on and so forth.
23:03
And the result is just another vector.
23:06
Right?
23:08
And that vector, folks,
23:11
is the contextual embedding vector of
23:13
station.
23:15
Okay? That was the standalone embedding.
23:17
And now we did the We multiplied this by
23:19
that that by whoop whoop whoop, add them
23:21
all up, and then you get a new vector.
23:24
And contextual embeddings have this
23:27
bluish kind of color.
23:29
Okay?
23:30
And I'll maintain that color scheme as
23:32
we go along.
23:33
So, that's it.
23:36
That's it. That's the idea.
23:38
Any questions?
23:41
Yeah.
23:43
How did you come up with the original
23:44
weights again? You just kind of guessed?
23:46
No, these weights I just I just
23:49
hand typed them in manually just to make
23:51
the point. And And now I'm going to talk
23:53
about how we are actually going to
23:54
calculate them.
23:57
Okay.
23:58
Uh all right, cool. So, now I'm going to
24:00
uh okay, enough pictures. Let's switch
24:03
to some math. So,
24:05
so basically what I'm So, let's write it
24:07
a bit more formally.
24:08
So, we have these W1 through W6, which
24:11
are the standalone embeddings.
24:12
And then for station, we want to
24:14
calculate, you know, W6 with a little
24:16
hat on it, which is the contextual
24:17
embedding. And the way we do it is to
24:19
say we calculate some weights for each
24:22
of these words. So, this weight S16
24:25
means that the weight
24:27
of the first word on the sixth word,
24:30
which happens to be station.
24:32
The The weight of the second word on the
24:33
sixth word, and so on and so forth. And
24:35
so, what we are saying is that W6 is
24:38
just, you know, this weight times W1,
24:40
this time W whoop whoop whoop,
24:41
that's it.
24:43
Okay?
24:45
I have to inflict all these, you know,
24:47
subscripts and all that because
24:48
you know, we need it.
24:51
All right. So, that's it.
24:53
That's what we have.
24:56
Now, let's talk about Okay, any
24:58
questions on the mechanics of it
25:00
before I get to Okay, where do these
25:01
weights come from?
25:02
Yeah.
25:06
Utilizing something like Google, for
25:08
example, like how does it understand
25:11
like the context of
25:12
new words
25:13
and context like
25:16
process immediately through the training
25:18
data the users played or
25:20
like basically
25:21
>> like a totally new word that didn't
25:22
exist before? A new word or a new
25:24
context to a word that already exists.
25:27
No, I think that the context is supplied
25:29
because the query coming into something
25:31
like Google is a full sentence.
25:33
And we only take that sentence and take
25:35
only the sentence into account as the
25:36
context for us.
25:37
So, the context is always present to us
25:40
when we get the input.
25:41
But the other question you had uh of
25:44
Okay, what if there's a brand new word
25:45
you've never seen before, for which
25:46
there is not even a standalone
25:47
embedding? What do you do then?
25:49
So, let's punt on that till Wednesday
25:51
because I have to talk about something
25:53
called byte pair encoding and stuff like
25:55
that before I can answer that.
25:57
And And really quickly, does that
25:59
immediately translate to their
26:00
predictive search queries?
26:03
Utilizing like verb
26:06
Yeah, a new word, for example.
26:08
Does that automatically get applied to
26:10
the predictive search queries like when
26:12
we're saying how to and then just home?
26:14
Oh, you mean like the auto complete?
26:15
You know, auto complete uses a slightly
26:17
different mechanism.
26:18
Um I They had a very complicated
26:20
non-transformer thing for a long time.
26:23
I'm sure they have a transformer version
26:24
now, but I don't I'm not privy to how
26:26
exactly they've done it. So, I don't
26:28
quite know how they do it. But what
26:29
you're proposing is a reasonable way to
26:31
think about it.
26:33
Yeah.
26:34
Um my question is like we have six
26:36
words, station and but number parameters
26:39
as in weights, let's say 10 of them.
26:41
And then we have calculated the
26:43
contextual version of W6. Yeah. So, this
26:46
has a different parameter or it remains
26:48
the same? It replaces. Okay.
26:50
Yeah, W becomes W6 becomes W6 hat.
26:54
Okay. And how we are expecting
26:57
Right.
26:58
This contextual word will be really
27:00
good. That's what we want.
27:07
Do we lose that
27:08
or retain No, we lose it. And as you
27:11
will see here, as it flows through the
27:12
transformer, it's getting more and more
27:14
and more contextualized.
27:16
So, it's a left-to-right flow.
27:20
All right. Uh all right, great. So, the
27:22
By the way, this thing that we did for
27:23
station, we will do it for each word in
27:25
the in the in the sentence.
27:27
The same exact logic. Obviously, the
27:30
weights are going to change.
27:31
Okay? But what will happen is that W1
27:34
through W6 will become W1 hat through W6
27:37
hat.
27:39
The same exact logic is going to hold.
27:41
Okay? That's what I just don't have the
27:43
slides for it because it's a waste of
27:44
time.
27:45
The same exact logic is going to hold.
27:47
All right. Now, switch gears
27:48
and and answer the all-important
27:50
question of where are the weights going
27:51
to come from.
27:52
Okay? So, the intuition here is really
27:54
really interesting and elegant.
27:56
So, clearly the weight of a word
27:59
should be proportional to how related it
28:02
is to the word station.
28:04
Right?
28:06
The word train clearly is very related
28:08
to the word station.
28:09
The word the is not clear how it's
28:11
related it is. Probably not all that
28:12
related. So, the relatedness matters to
28:15
the weight. More related, higher the
28:17
weight, right? Just intuitive.
28:19
So, one way to quantify how related two
28:21
words are is to take their standalone
28:23
embeddings and calculate the dot
28:25
product.
28:28
Okay? So, um
28:30
in case folks have
28:33
sort of forgotten about the dot product,
28:39
Oops, that's not what I want.
28:42
So, um So, if you have a Let's say you
28:44
have a vector.
28:50
Okay, let's Let's Let's say this is the
28:51
vector for
28:52
train.
28:55
This is the vector for station.
28:59
Okay? So, the dot product of these two
29:01
vectors,
29:05
I'll write it as train
29:09
station
29:12
equals
29:13
basically the length
29:17
of
29:20
the vector for train
29:23
times the length
29:26
of the vector for station
29:30
times the cosine
29:33
of the angle between them.
29:36
Okay?
29:38
Okay?
29:42
So, how long is each vector?
29:45
Product of the two and then the angle
29:46
between them. Okay? Now, let's assume
29:48
for simplicity that these lengths are
29:50
roughly the same.
29:52
They're just one unit length. Okay? Just
29:54
roughly.
29:55
So, if you assume that,
29:57
okay? This thing, let's say, becomes
30:01
becomes one, let's say.
30:03
Okay?
30:05
This thing becomes one.
30:07
So, all the action
30:09
is here.
30:11
Okay?
30:12
So, all the action is here.
30:14
So, basically, the dot product of these
30:15
two vectors is really the cosine of
30:17
angle between them.
30:20
So, now, the question is, if you have
30:22
something like this,
30:27
right? Which are very close to each
30:28
other, the cosine of a very small angle,
30:31
actually, the cosine of zero is what?
30:34
One.
30:35
So, if the angle is really, really
30:37
small, the cosine is going to be very
30:39
close to one.
30:40
Right? Because zero is one. The cosine
30:41
of zero is one. So, this thing is going
30:43
to be, you know, pretty close to one.
30:46
If you have a cosine of two vectors that
30:49
are like this, 90° apart, what is the
30:51
cosine?
30:52
Zero. They're orthogonal, right? Which
30:55
maps to the English orthogonal.
30:58
So, the cosine of that is zero.
31:00
And then, if you have something like
31:01
this,
31:03
where they're literally pointing in
31:04
opposite direction,
31:07
what is the cosine of that 180?
31:09
Minus one.
31:11
So, that's it. So, the if these things
31:13
if these these these two vectors are
31:14
very close to each other,
31:16
the cosine of the angle between them is
31:18
going to be very close to one. If they
31:19
are really kind of unrelated, it's going
31:21
to be zero. If they're anti-related,
31:22
it's going to be minus one.
31:24
Right? So, that's how dot products
31:27
capture this notion of closeness or
31:28
relatedness.
31:30
Okay?
31:31
So, all right. Um iPad.
31:36
So, we can use the dot product of these
31:37
embeddings to capture relatedness.
31:40
And so, okay, iPad done.
31:43
So, now, what we do is we know now that
31:45
we know that dot products can be used,
31:48
we can't use them as is because we need
31:49
to do one more thing to make them proper
31:51
weights. And what I mean by proper
31:53
weights is that the we want the weights
31:55
to be, first of all, non-negative, and
31:58
we want to add up we want them to add up
31:59
to one, right? That's that's what a
32:00
weighted average actually is going to
32:01
mean.
32:02
But these cosines could be negative.
32:05
Right? And so, we need to now adjust
32:07
them to make them proper so that every
32:08
one of them is guaranteed to be
32:10
non-negative and they will add up to
32:11
one.
32:12
When was the last time you had to take a
32:14
bunch of numbers, which could be
32:15
anything, and then somehow make sure
32:18
that they are going to be positive,
32:20
non-negative, and they add up to one?
32:22
When was the last time?
32:23
Yeah, softmax. Exactly. So, we'll do the
32:25
same trick.
32:27
So, what we'll simply do is we'll just,
32:29
you know, exponentiate them, right? So,
32:32
like this W1 W6, this angle bracket
32:35
thing is the dot product. That's the
32:36
notation I'm using. EXP of that is just
32:39
you exponentiate them, e raised to that.
32:41
And once you exponentiate them, they all
32:42
become non-negative, and then we just
32:44
divide each by the sum of everything.
32:46
So, it the whole thing will become like
32:47
a probability, right? It'll just add up
32:48
to one.
32:50
Make sense? So, that's how we take
32:52
arbitrary numbers and make them proper
32:53
weights.
32:56
All right.
32:59
So,
33:01
to summarize,
33:02
from embeddings to contextual
33:04
embeddings, that's what we do.
33:05
We take all the stand-alone embeddings,
33:08
we calculate these weights using this
33:09
formula, and then we just do the
33:11
weighted average, and we arrive at the
33:12
contextual embedding, and boom, done.
33:16
Okay?
33:17
And so, by way choosing weights in this
33:20
manner, the embedding of a word gets
33:22
dragged closer to the embeddings of the
33:24
other words in proportion to how related
33:26
they are. So, just imagine for a second,
33:29
right? In this case, station obviously
33:30
has many contexts, but let's assume for
33:31
a second that only has the train context
33:33
and the radio station context.
33:35
Okay?
33:37
In the current context, train is closely
33:39
related to station, and therefore exerts
33:40
a strong pull on it.
33:42
Right?
33:43
Now, radio is also related to station,
33:45
but it doesn't appear in the word in the
33:47
sentence.
33:48
So, effectively, it has a weight of
33:49
zero.
33:52
Okay? And that's the beauty of it. And
33:55
And please do not ask me things like,
33:56
you know, I I was listening to a great
33:58
song on the radio station and the train
33:59
pulled out of the station.
34:01
Okay? Transformers can deal with stuff
34:03
like that. Okay? But yeah, but you get
34:05
the idea, the main idea.
34:07
So, by paying moving station closer to
34:09
train,
34:11
by paying more attention to train, we
34:13
are contextualizing the station the word
34:15
the embedding to the context of trains,
34:18
platforms, departures, tickets, and so
34:20
on. It's like this portal into the whole
34:22
train world.
34:25
Right? It's beautiful. This simple idea
34:27
will get you there.
34:30
Okay?
34:31
So, this, folks, is called
34:33
self-attention.
34:36
What we just described is called
34:37
self-attention.
34:39
And it's the key building block of
34:41
transformers.
34:42
Okay? Um and so, the the So, to
34:44
summarize, stand-alone embeddings come
34:46
in, contextual embeddings go out.
34:50
Any questions?
34:52
Uh yeah.
34:54
Uh I'm still struggling a little bit
34:56
with the intuition of the word
34:58
contextual embedding. So, like the
35:00
weight of station in the station
35:02
embedding, how how should I think about
35:03
that? It seems intuitive that it would
35:05
be high for all contextual embeddings,
35:07
but I assume that's not the case.
35:12
It'll be high. It'll be typically be a
35:13
high number because the cosine of the
35:15
the vector to itself is going to be very
35:17
cosine is going to be one, right? So,
35:19
it's going to be pretty high, but it
35:20
there's no guarantee it's going to be
35:21
the highest.
35:22
Right? Because they're not actually the
35:24
the length doesn't have to be one. They
35:26
could be We try to keep them kind of
35:28
smallish, but they don't have to be.
35:30
Uh so, the way I would think about it is
35:31
imagine that you take an average of
35:33
everything else first, and then you
35:35
average it with the new the old
35:37
embedding.
35:38
Effectively, it's the same as just
35:39
calculating the different weights and
35:40
averaging the whole thing together.
35:42
Sure.
35:44
So, why should you say that the
35:45
embedding of a word would be the same
35:47
number but same place? But is this the
35:50
reason why you need a contextual
35:52
embedding?
35:53
But even if it's like a
35:55
other word
35:56
and it's not related, that's what
35:59
I'm saying. Correct. Correct. Exactly.
36:01
Exactly. And the other thing to remember
36:02
is that by getting
36:04
by keeping the origin the input sort of
36:07
the size of the input cardinality intact
36:09
as you move through the transformer
36:10
stack,
36:11
when you finally come out the other end,
36:12
there is sort of no loss of information.
36:14
And in the very end, you can choose to
36:16
aggregate, simplify, summarize, and so
36:18
on and so forth. It preserves your
36:19
optionality as long as possible.
36:23
Do you know
36:25
how how long the embedding contextual
36:27
embedding is?
36:28
Is that a factor between the
36:29
two?
36:31
You know
36:33
Yeah, so, what we do is the the sentence
36:34
comes in. There's a whole notion of
36:35
something called a context window, or
36:37
what is the sort of the maximum length
36:39
that these sentences will handle, and
36:40
that's a parameter you can set. And
36:42
we'll come to that when you actually
36:43
look at the collab.
36:44
Um
36:46
Was that a question in the middle? No.
36:48
Okay.
36:49
All right. So, that is self-attention.
36:53
Um and now,
36:55
because that's felt too easy,
36:58
we're going to do a little tweak called
37:00
multi-head attention.
37:02
So,
37:03
this is this is the self-attention we
37:04
just saw.
37:06
What we can do is we can be like, you
37:07
know what?
37:08
Why can't we have more than this? Why
37:10
can't we have more than one of these?
37:12
So, this is called an attention head,
37:13
self-attention head. We'll have multiple
37:16
self-attention heads. Okay?
37:18
Now, and I'll come back to the top thing
37:20
in a second, okay? But So, the question
37:22
is, why should we have multiple
37:23
self-attention heads?
37:25
Because a particular attention head is
37:26
going to pick up some patterns. The
37:28
reason is because
37:30
it'll help us attend to the multiple
37:32
patterns that may be present in a single
37:34
sentence.
37:35
So far, when I've been explaining, uh
37:37
I've sort of basically been looking at
37:38
what the meaning of these words are.
37:40
Just the meaning of these words. But in
37:42
any complicated sentence, you have to
37:44
worry about grammar, you have to worry
37:45
about tense, you have to worry about
37:47
tone. You have to worry about facts
37:49
versus, you know, opinions. There could
37:51
be any number of complicated patterns
37:53
that are sitting in a simple sentence.
37:55
Which means, well, there is just not one
37:57
way to pay attention. There could be
37:59
many ways of paying attention, many sort
38:02
of There could be many needs to pay
38:03
attention. Right?
38:05
Which means that we'll let's have many
38:07
of these attention heads.
38:09
And each one could be learning something
38:10
else. It's exactly like having lots of
38:12
filters in a convolutional network.
38:14
Right? Uh one filter might learn a line,
38:16
another filter might learn a curve, and
38:17
so on and so forth. And we don't want to
38:19
decide a priori, oh, you're going to
38:21
learn a line, right? Similarly here,
38:22
we're not telling any of these things
38:23
what you have to learn. They just have
38:25
to learn based on the training process.
38:27
So, what we do is
38:28
So, actually, this is an example where
38:30
this is from the original transformer
38:32
paper, where this sentence is the lawyer
38:35
will Sorry, the law will never be
38:37
perfect, but its application should be
38:39
just. This is what we are missing, in my
38:43
opinion.
38:44
The complicated sentence, right? So, the
38:46
first one attention head, actually, this
38:48
is the pattern of things it's it's it's
38:50
So, for example, the word perfect here,
38:53
the contextual embedding of the word
38:54
perfect
38:57
draws upon heavily from the word law
39:00
in this example.
39:01
Okay?
39:02
If you look at another attention head,
39:04
the contextual embedding for the word
39:06
perfect is actually drawing heavily from
39:07
just perfect and nothing else. Right?
39:11
And if you look at other words, the
39:13
patterns are subtly different of what
39:14
it's paying attention to.
39:17
So, these are two different attention
39:18
heads, and they're learning different
39:20
kinds of attentions.
39:21
Okay? In reality, trying to make sense
39:24
of why they
39:25
pay attention to the way they do, it's
39:27
usually quite sort of difficult to
39:29
figure that out. You can't actually
39:30
interpret it. But when you have lots of
39:32
attention heads, the performance on the
39:34
task that you care about gets really
39:35
much better.
39:37
Right? And then you're saying, okay, I
39:39
can use that. Uh yeah.
39:40
That's the
39:42
I think that's the idea behind this. Is
39:43
that the idea behind this?
39:49
Right.
39:50
Exactly. Same logic. Same logic.
39:53
Yeah.
40:13
Actually in the convolutional case, the
40:15
ones and zeros I had were just example
40:17
numbers to show that that particular
40:19
filter could detect a vertical line or
40:21
horizontal line. You will recall that
40:23
when we actually train a convolutional
40:24
network, we actually don't specify the
40:26
numbers. We start with random
40:27
initialized weights and then we let back
40:30
back propagation figure it out.
40:32
Similarly here, we don't decide any of
40:34
these things. We just let back prop
40:35
figure it out.
40:37
Okay? And now the question of what are
40:39
the weights that are actually going to
40:40
be learned? We'll come come to that in a
40:42
bit.
40:43
Okay? Uh yeah.
40:47
Uh I was wondering how come we have
40:50
different attention head even though
40:53
uh it seems like they're only function
40:55
of a dot product and we have the same
40:57
dot product for same embeddings.
40:59
Great question. Great question. And I
41:01
literally have a a note in my slide
41:02
saying, "If a student asks this good
41:04
question, tell them to wait till
41:06
Wednesday."
41:08
So, great question. And we'll come back
41:10
to that uh on Wednesday and spend a fair
41:12
amount of time on it. So, uh
41:14
the the the point that's being made here
41:17
is that oops.
41:19
When we look at self-attention,
41:22
the embeddings came in and we did all
41:24
these dot products and the contextual
41:26
things popped out the other end. Note
41:28
that inside the self-attention box,
41:30
there are no parameters.
41:32
There are no parameters.
41:34
So, the question that is being raised
41:36
here is that so what are we learning
41:38
really? If there is nothing inside to be
41:40
learned, if there are no parameters, no
41:42
coefficients, what are we learning?
41:43
That's the question. And by extension,
41:46
if we have two of these and neither of
41:48
them is learning anything, what's the
41:49
point?
41:52
Sadly, you have to wait till Wednesday.
41:55
Okay? But we have a great answer to the
41:57
question. So,
41:58
it'll be worth it. And if you can't
42:00
stand the suspense, read the book.
42:03
All right. So, that is uh that's why we
42:05
need multiple heads. Okay? And now to
42:07
come back to this, so what we do is it
42:09
goes through this head and you get these
42:11
W's, right? And it goes through here and
42:13
we get the another set of W's.
42:15
Then what we do at the very end is we
42:17
concatenate them.
42:19
Okay? We concatenate them and we do a
42:21
projection. And this is what I mean by
42:23
that.
42:29
So, we have
42:30
uh this this is one self-attention head,
42:33
self-attention one.
42:35
This is self-attention two.
42:38
And let's say that
42:41
W1 hat comes out.
42:44
And I'm just going to call it Z Z1 for
42:47
the same thing so that there's no name
42:48
clash.
42:49
Okay? And uh the W2, W6, all of them are
42:52
coming, right? Let's focus on W1 and Z1.
42:55
W1 and Z1 are both contextual embeddings
42:57
for the same word.
42:59
Okay? For the first word, word one. And
43:01
so what we do is let's say this is W1 uh
43:04
let's call let's say this vector is like
43:06
this. Okay?
43:07
And let's say that this vector is like
43:10
this.
43:12
What I mean when I say concatenated here
43:14
is we literally take
43:16
um this word here,
43:18
this embedding here, then we take this
43:20
thing here.
43:23
Okay? And we just make it a long vector.
43:25
We concatenate it. But now this vector
43:27
has become twice as long, right?
43:30
So, what but remember, we always want to
43:32
preserve this the the number of inputs
43:34
we have and the lengths of these vectors
43:36
everywhere as we go along. So, what we
43:39
do is at this point, we run it through
43:42
a single dense layer
43:44
which will take this thing and make it
43:46
back into the same small shape as
43:48
before.
43:50
So, this is a dense layer.
43:54
That's it. So, this vector comes in
43:56
and it becomes it gets compressed back
43:58
to the original shape that came out of
44:00
here.
44:01
So, you could have like 20 of these uh
44:03
attention heads
44:04
and the concatenated will be 20 times
44:06
long and then just project boom, one
44:08
dense layer comes back to the original
44:09
shape.
44:12
So, that's that is the projection step.
44:16
And that's what I mean here when I say
44:17
concatenate and project.
44:20
So, at this point, what we have is
44:21
things come in, we contextualize them
44:23
using these different attention heads,
44:25
and when they come out of the attention
44:27
heads, we take them all, we just like
44:29
concatenate them, and then compress them
44:31
back to the same original starting
44:32
shape. Right? If these vectors are 100
44:35
units long or 100 dimension long,
44:37
whatever comes out is 100 still.
44:39
And to pre- preserving this
44:42
size as we go along is very important
44:43
for reasons that'll become apparent a
44:44
bit later.
44:46
Okay. So, that is the multi-attention
44:49
thing.
44:50
Now, a final tweak for today
44:53
is that we will inject some
44:55
non-linearity
44:57
with some dense layer dense ReLU layers
44:59
at the very end. So, we'd went through a
45:01
bunch of attention heads. We we came up
45:03
with a bunch of contextual embeddings
45:04
now.
45:05
So, at this point so far,
45:07
there are no since there are no
45:08
parameters inside these boxes,
45:10
uh
45:11
right? And there are some parameters
45:13
here.
45:13
We need to do some non-linearity. So
45:15
far, there's been nothing that's
45:16
non-linear so far. So, here we actually
45:18
send it through one or more ReLUs.
45:21
Typically, they just use one ReLU. So,
45:24
and what I mean by that
45:34
Sorry.
45:37
So, this is what we had here and then
45:41
we take it in
45:46
and then run it through
45:50
actually
45:54
we typically run it through
45:57
a ReLU.
45:58
This is a nice ReLU.
46:01
Okay? And all and and the rule of thumb,
46:03
as you will see, if let's say this
46:04
vector is say 100 dimensions long, they
46:06
typically will choose a ReLU which is
46:08
about 400
46:10
wide. And then it just gets projected
46:12
out again back to 100.
46:16
So,
46:17
this is just a simple, you know, the
46:20
input comes in, goes through a single
46:21
hidden layer with four four times as
46:23
many as here, and then it
46:26
project another dense layer
46:28
to 100 again.
46:29
And this since there are ReLUs here,
46:32
we in- we have injected some
46:33
non-linearity into the processing.
46:35
Okay? Now,
46:37
a lot of this stuff when it came out
46:39
felt very ad hoc.
46:41
Right? It didn't come from some deep,
46:43
you know, theoretical motivations.
46:45
But and people had strong intuitions as
46:47
to why these things were helpful. And as
46:49
it turns out, since the transformer came
46:51
out, people have tried to optimize every
46:53
aspect of this thing.
46:55
It's actually pretty difficult to beat
46:56
the starting architecture.
46:58
Right? Improvements have been made, but
47:00
it's actually very robust architecture.
47:02
So,
47:03
so that's what's going on here. And then
47:05
when we come out of this thing,
47:08
this is what we have, the story so far.
47:10
We start with random standalone
47:13
embeddings. This could be
47:14
GloVe embeddings, it could be random
47:15
weights, doesn't matter. It goes through
47:18
a bunch of self-attention heads. We
47:19
concatenate it when it comes out the
47:21
other end.
47:23
Concatenate it when it comes out the
47:25
other end. And then we project it back
47:27
to the same size as before. Then we run
47:29
it through, you know, a ReLU followed by
47:31
a linear layer and we get these things
47:33
again. So, in this whole process, if six
47:36
things came in, six things will come
47:37
out. And if six and if those six things
47:40
that came in
47:41
were embedding standalone embedding
47:43
vectors of 100 dimensions, what comes
47:45
out is also 100 dimensions.
47:47
So, in that sense, you could think of
47:48
this whole thing as a black box in which
47:50
whatever you send in, the same number of
47:52
things will come out of the same length.
47:54
The numbers will be different because
47:56
they will have been heavily
47:56
contextualized.
47:58
The numbers are much smarter, in other
48:00
words.
48:02
So, so far what we have seen is that we
48:04
have satisfied two of the three
48:05
requirements. We have taken the context
48:08
of each word into account
48:09
by using these dot products in the
48:11
self-attention layer, and we can
48:12
generate an output that is the same
48:13
length as the input, but we have ignored
48:15
the fact that we have ignored word order
48:17
completely.
48:19
Okay? Because whether I had said the
48:21
train slowly left the station or I had
48:23
said the the station slowly left the
48:25
train,
48:26
this thing won't know the difference.
48:30
Because dot products
48:32
function on sets, not on sequences. They
48:34
function on sets.
48:36
Okay? Regard- You can you should
48:37
convince yourself of this. Regardless of
48:39
the order, the dot product calculation
48:40
doesn't change anything.
48:42
Because we are doing every pair.
48:46
Okay? So, the question is how do we take
48:48
the order of the words into account? Um
48:50
right. As I was saying, we can scramble
48:52
the order of the words in a sentence and
48:53
we'll get the exact same contextual
48:54
embeddings at the end.
48:55
So, by the way, if you're working on a
48:57
problem in which the order doesn't
48:58
matter,
49:00
then you can stop right now and use the
49:01
transformer.
49:04
And there are many problems that are
49:05
actually in that category where the
49:06
order doesn't matter. So, if you take
49:08
traditional structured data, right? Uh
49:10
tabular data,
49:12
uh you know, blood pressure, cholesterol
49:14
level, boom boom boom. Does it predict
49:15
heart disease? Well, there is no order
49:17
in that thing. You can use the
49:18
transformer as is without doing anything
49:20
more.
49:22
So, transformers work for both sets and
49:24
sequences where order matters.
49:27
Okay. So, the fix for this is something
49:29
called the positional encoding.
49:32
Um
49:33
so what we do is very simple. There are
49:34
By By there are many things that been
49:36
invented um to to to tell transformers
49:40
to give an transformer some information
49:42
about the order of each of the things
49:44
that are coming in.
49:45
I'm going to go with something called
49:46
the, you know,
49:47
the simplest possible way which actually
49:49
works pretty well in practice. So, what
49:51
we do is
49:52
for each position
49:55
each possible position in the input
49:56
starting from the first position all the
49:58
way through the last position
50:00
we imagine that that position itself is
50:02
a categorical variable.
50:05
Right? If a sentence can only be 30 30
50:07
words long, let's say, we say that hey,
50:10
the position of each word is a number
50:11
between 0 and 29.
50:14
And so, we can just think of it as a
50:16
categorical variable.
50:17
And because the categorical variable, we
50:20
can just imagine an embedding for that
50:22
for each potential value. So, it'll
50:24
become clear in just a moment because I
50:25
have a numerical example.
50:27
And so, what we do is we will just take
50:28
that standalone embedding and then we'll
50:30
take this position embedding
50:32
which represents the position of the
50:33
word in the sentence, we just add them
50:35
up.
50:36
Okay? Uh yeah.
50:39
So, if
50:40
in the initial sentence itself, I have a
50:43
mistake, so I just write it as the train
50:45
slowly the station.
50:48
So, which means my output is actually
50:49
going to be wrong. Yes.
50:52
Now, the transformers are since they're
50:53
trained on lots of data,
50:55
they will be quite robust to these
50:57
things.
50:58
But strictly arithmetically speaking
51:00
correct, yes.
51:02
Um okay. So, here's let's look at an
51:05
example.
51:06
Let's assume that
51:08
um
51:09
your standalone embeddings, right? This
51:11
is your vocabulary, okay?
51:13
Unknown, cat, mat, I, sit, love, the,
51:15
you, on. That's it. That's our
51:17
vocabulary.
51:18
And for this vocabulary, we have these
51:20
standalone embeddings.
51:22
And just for argument, let's assume
51:23
these embeddings are only two long.
51:26
Okay? The dimension of these embeddings
51:27
is two.
51:28
If you recall the glove embeddings we
51:30
used last week, I think they were what?
51:31
100 long?
51:33
And the ones we're using in the homework
51:34
are even longer than that.
51:35
Um but here we are assuming they're only
51:37
two long, okay? So, the embedding for
51:39
cat is 0.5, {comma} 7.1.
51:42
All right. Now, let's assume that the we
51:45
can have at most 10 words in any
51:47
sentence that's coming in.
51:49
And obviously, a particular word could
51:50
be in position 0 all the way through
51:52
position 9.
51:53
And we will learn embeddings for each of
51:56
these positions, and these embeddings
51:57
are also two long.
51:59
Two units long. Dimension two.
52:03
Okay?
52:04
Now, where will these embeddings come
52:06
from?
52:07
What's the answer to that question? What
52:09
is the answer to the general question of
52:10
where will these weights come from?
52:14
We will learn it with backprop.
52:18
Okay?
52:20
We will start initially with random
52:21
numbers and then we'll get them make
52:23
them better and better
52:24
as over the course of training.
52:26
So, what we do is we have these two
52:28
tables
52:29
of embeddings.
52:30
Um the standalone embedding for the word
52:32
and the position embedding.
52:34
And then, we literally add them up.
52:37
So, for example, let's say the word the
52:39
sentence that came in is cat sat mat.
52:41
That's the sentence. It's got three
52:43
words, cat sat mat. So, what we do is we
52:46
say, well, the embedding for cat is this
52:49
thing here, 0.571.
52:51
So, I write it here, 0.571.
52:53
Cat happens to be the zeroth position of
52:55
the word.
52:56
So, I grab the embedding for zero, which
52:58
is 1.3, 3.9. I stick it there, and then
53:01
I literally add them up. 0.5 + 1.3, 1.8.
53:04
11.0. That's it.
53:07
So, now the positional encoded embedding
53:10
for the word cat is 1.8, 11.0, not 0.5,
53:15
7.1.
53:18
So, if cat happens to show up in another
53:20
part of the sentence, let's say instead
53:22
of cat sat mat, we had
53:25
mat sat cat.
53:28
Now, cat is now the third position,
53:29
right? Which is 0, 1, and 2. Which means
53:33
its embedding doesn't change. It's just
53:34
the embedding for cat, but now instead
53:36
of picking zero, we'll pick this one,
53:38
0.6, 8.1, and put that here and add them
53:40
up instead.
53:43
So, this is the idea of the positional
53:45
encoding.
53:46
This is how we inject position knowledge
53:48
into the transformer.
53:52
Yes.
53:54
Um
53:55
the positional embedding would be
53:56
different for each sentence, right? How
53:58
do you No, this is just one table which
54:00
tells you what the position is.
54:01
So, the it says for a word that appears
54:04
in the seventh position in any input
54:06
sentence that you're feeding in,
54:08
this is the embedding that you need to
54:09
use
54:11
for that position.
54:16
If the word appears twice in the same
54:19
sentence, how do
54:21
Great question. So, if if let's say just
54:23
for argument, let's say the word the the
54:25
sentence was cat cat cat.
54:27
So, the
54:29
for each one of those cat for cat cat
54:31
cat,
54:32
the this embedding will be the same,
54:34
0.571, because that is happens to be
54:36
just the embedding for cat regardless of
54:38
position.
54:39
But then, the first cat
54:42
for the first cat, we will use 1.3, 3.9
54:45
as the addition. For the second cat,
54:47
we'll use 6.3, 3.7. The third cat will
54:50
use 0.6, 8.1.
54:51
So, only the things that are adding the
54:53
position encoding will change, the
54:55
positional embedding. So, the resulting
54:57
sum is going to be different for each of
54:58
these three words, even though they're
54:59
exactly the same word.
55:05
Is that position embedding table
55:07
specific to the standalone embedding
55:09
table? Like if you were to add or remove
55:12
some words from the standalone It's
55:14
independent.
55:15
Independent. It only depends on your
55:18
assumption about how long the sentences
55:19
can be.
55:21
That's it.
55:21
It doesn't really care about what's what
55:23
words are coming in. That's a whole
55:24
different thing.
55:26
So, these are two independent tables
55:27
that just learned as part of this
55:28
process.
55:31
So, yeah, I have the same thing for sat
55:33
and mat.
55:35
Sat and mat, that's what we have.
55:39
So, just make sure you understand these
55:40
two slides to really like make sure the
55:42
mechanics are clear. Yeah.
55:46
How do you control for filler words? For
55:48
example, if you're taking
55:50
NLP output for transcription and you're
55:53
trying to run a transformer and you have
55:55
a lot of
55:56
um's and likes that are
55:58
disproportionately large and have these
56:00
random assignments or
56:03
really deep embeddings, is there other
56:04
ways to look at through the noise?
56:07
Typically, what they do is um
56:09
as we will we'll talk about this thing
56:10
called byte pair encoding in which we
56:12
take individual characters,
56:14
fragments of words, and words into
56:16
account as tokens. So, when you hear
56:18
stuff like uh and so on, it gets mapped
56:21
to these small tokens.
56:23
Right? And then we treat them as just
56:24
any other token.
56:28
Um yeah, is aggregation like a simple
56:31
sum where here and is the actual
56:33
semantic meaning of the word standalone
56:36
not be more important than its
56:37
relative position in the sentence?
56:40
It could be. We just don't know a priori
56:42
whether it's going to be important or
56:43
not for any particular sentence.
56:45
We when we train the transformer with a
56:46
lot of textual data,
56:48
right? It'll just figure out the right
56:50
values for these things so that on
56:51
average, the accuracy is as high as
56:53
possible.
56:55
So, in many of these things, there's
56:56
always a tension between our human
56:58
intuition as to how it should work and
57:00
whether you should just throw it into
57:01
the meat grinder of backprop and see
57:02
what happens.
57:04
And so, here it does it turns out you
57:05
can just throw it into backprop, it'll
57:06
actually do a pretty good job.
57:08
Uh yeah.
57:10
For the positional encoding, we would
57:13
just be as using the sum vector, we
57:15
would be using like this 2 by 3 matrix
57:18
that you have for our right?
57:20
Uh oh yeah, this is just for
57:21
demonstration. Basically, this is the
57:23
thing that will actually go into the
57:24
transformer. Correct.
57:26
Yeah.
57:28
That was just me being overly verbose in
57:30
the slides.
57:31
Uh yeah.
57:33
I can see sentences in the input. At
57:35
this point, are we still parsing out
57:36
punctuation or if we have like a
57:38
multi-sentence input, is there a
57:40
positional embedding vector for each of
57:41
the sentences? Yeah, so here um
57:44
basically, the starting point is tokens.
57:47
Right? And in our example, because we're
57:48
working with the idea of simple
57:50
standardization and stripping and things
57:51
like that, I'm just showing actual
57:53
words.
57:54
If you go to something like GPT-4, since
57:56
it uses a different tokenization scheme,
57:58
uh each token might be part of a word.
58:01
It might be it might be an individual
58:02
character, it might be a punctuation
58:03
mark, it could be in fact um the GPT
58:06
family doesn't strip out punctuation.
58:08
Which is why when you ask a question, it
58:10
comes back with intact punctuation in
58:12
its response.
58:13
Uh and so, we'll get we'll revisit this
58:15
when you look at BPE, byte pair encoding
58:17
later on.
58:19
But the key thing to remember is that
58:21
all the stuff we're talking about starts
58:22
from the notion of a token.
58:24
As to how you define a token given a
58:26
bunch of text, that's the tokenizer's
58:28
job. And we just assumed a simple
58:30
tokenizer for the time being.
58:33
Okay? So, at this point, folks, we have
58:36
satisfied all the requirements.
58:38
Uh we have taken the surrounding context
58:40
of each word, we have taken the order,
58:42
and so on and so forth, because what's
58:43
coming in here is the positional
58:45
embeddings. Okay? And it runs through
58:47
the whole transformer stack.
58:49
So,
58:51
this is called a transformer encoder.
58:54
Okay?
58:55
This is the transformer encoder.
58:57
And you can see here, this is the
58:59
original picture from the paper.
59:01
It's an iconic picture at this point.
59:03
So, it says here this is these are the
59:04
input This is like the cat sat on the
59:06
mat.
59:07
It comes in here, gets transferred to
59:09
transformed into embeddings, standalone
59:11
embeddings.
59:12
And then, based on the position of each
59:14
word, we add that's why you see a plus
59:17
sign here, we add the positional
59:20
embedding to that.
59:22
And the resulting thing goes into this
59:24
transformer block. And here,
59:26
we go through multi-head attention.
59:30
And things come out the other end.
59:32
Then there is this thing called add and
59:34
norm, which we'll visit we'll revisit on
59:36
Wednesday.
59:37
And then it goes through a feed forward
59:38
network, another add and norm, which
59:40
we'll revisit on Wednesday.
59:42
And then it comes out the other end.
59:43
That's it. That's a transformer encoder.
59:46
Okay?
59:47
Um
59:48
and so if you look at this
59:52
just to point out a couple of things,
59:53
the input embeddings can be random
59:55
weights or it could be pre-trained
59:56
embeddings.
59:57
Um
59:58
we add in a position-dependent embedding
1:00:00
to represent the position of each word
1:00:01
in the sentence. That's the plus.
1:00:02
Then we pass it through multi-headed
1:00:04
attention to get a contextual uh
1:00:05
representation.
1:00:07
Then we finally we pass all this through
1:00:09
a simple
1:00:10
typically it's a two-layer network. A
1:00:12
one hidden layer with relus and then a
1:00:13
linear layer after that and boom. Uh and
1:00:16
then we do it. This is the encoder. And
1:00:20
here is the perhaps the most important
1:00:21
point to keep in mind.
1:00:23
Because we have taken inordinate care to
1:00:25
make sure that the things that are
1:00:26
coming in and the things that are going
1:00:28
out have the same size
1:00:30
both in terms of the number of tokens as
1:00:32
well as the length of each vector.
1:00:34
We can then stack them up like pancakes.
1:00:37
We can have lots of transformers stacked
1:00:39
one on top of each other.
1:00:41
Right? Because it's the perfect API.
1:00:43
It's the simplest possible API. The same
1:00:45
thing comes in, same thing goes out.
1:00:47
In terms of size. So you can have a
1:00:49
transformer encoder, another one top,
1:00:51
boom, boom, boom, boom, boom, one after
1:00:53
the other. GPT-3 has 96 transformer
1:00:55
stacks.
1:00:58
And like in all things deep learning
1:01:00
related, the more layers you have, the
1:01:02
more complicated things we can do with
1:01:04
it.
1:01:05
As long as you have enough data to keep
1:01:06
the model happy so it doesn't overfit.
1:01:11
Okay?
1:01:13
All right. So, what we haven't covered,
1:01:15
which we'll cover on Wednesday
1:01:17
uh is is the question that
1:01:20
he had posed about how
1:01:22
uh you know, since there are no
1:01:23
parameters inside the self-attention
1:01:24
block, what are we actually learning?
1:01:26
And then there is these things called
1:01:27
residual connections and layer
1:01:29
normalization. We'll talk about all
1:01:31
those things on Wednesday. Those are all
1:01:32
like, you know, refinements to the idea.
1:01:35
So, all right, 9:39. Um let's apply the
1:01:38
transformer encoder to an actual
1:01:39
problem.
1:01:40
Any questions?
1:01:43
Uh yeah.
1:01:45
My question is regarding like you said
1:01:46
you could have multiple transformers.
1:01:48
What is the difference with having
1:01:50
multiple self-attention heads uh and
1:01:53
rather than that having multiple When I
1:01:54
say a transformer block within the block
1:01:57
there could be multiple heads. So, if
1:01:59
you're if the accuracy is the same, why
1:02:01
would you use this rather
1:02:04
Yeah, you can have a lot of attention
1:02:06
heads. And that's totally fine. And
1:02:08
typically I forget how many GPT-3 and 4
1:02:10
have. They have a whole bunch of them.
1:02:12
But you can So you can go wide and you
1:02:13
can go deep.
1:02:15
Both are done in practice.
1:02:18
But the thing is if
1:02:19
The one thing you have to remember is
1:02:20
that if you if you go wide, you have a
1:02:22
lot of attention heads then given the
1:02:24
particular input that's coming into that
1:02:26
block, it'll learn different patterns
1:02:28
from it.
1:02:29
While if you stack them all up, it's
1:02:31
going to learn different ways to
1:02:32
contextualize the things that are coming
1:02:33
in. It operates at higher levels of
1:02:35
abstraction. So the analogy would be
1:02:36
that like the seventh layer of a
1:02:38
convolutional net may take the sixth
1:02:40
layer's output and say, "Oh, I'm seeing
1:02:42
a lot of edges here. I'm going to take
1:02:44
an edge like this, two circles like that
1:02:46
and call it a face."
1:02:48
So it'll operate at a higher level of
1:02:49
abstraction.
1:02:52
Okay.
1:02:53
Um
1:02:58
All right, let's go to the collab.
1:03:01
So what we're going to do is we're going
1:03:02
to take the transformer that we just
1:03:04
learned about and we're going to apply
1:03:05
it to solve the the travel uh slot
1:03:07
problem. Okay?
1:03:09
Uh all right. So
1:03:12
Okay, so we'll start with the usual
1:03:14
preliminaries.
1:03:16
And then we have taken the ATIS data set
1:03:18
I talked about and we have stuck them in
1:03:20
raw box for easy consumption.
1:03:23
It's here.
1:03:29
Okay.
1:03:30
So if you look at to the top view
1:03:33
you can see here, for example, I want to
1:03:35
fly from Boston 8:30 a.m. And then this
1:03:37
is the output. The slot filling is the
1:03:39
output. Um and so as it turns out here
1:03:42
there is
1:03:43
this these people also gave it a another
1:03:46
They took the whole query and gave it an
1:03:47
intent as to is it it's a flight query,
1:03:49
it's a something else query and so on,
1:03:51
which we're not going to use. Are you
1:03:52
kidding me?
1:03:54
I want to fly from Boston at 8:30 a.m.
1:03:56
and arrive in Denver at 11:00 in the
1:03:57
morning. What kind of ground
1:03:59
transportations are available in Denver?
1:04:01
What's the airport at Orlando?
1:04:03
Um how much does the limo service cost
1:04:06
within Pittsburgh? Okay.
1:04:08
And so on and so forth. So you get So
1:04:09
you get the idea. It's a very wide range
1:04:11
of queries that are in this data set.
1:04:13
Um okay. So let's just ignore that for a
1:04:16
sec. Um okay. So what we're now going to
1:04:18
do is we are going to take only
1:04:22
um this column, right? The query column.
1:04:24
That's going to be our input text. Okay?
1:04:27
And then the slot filling column is
1:04:29
going to be our dependent variable, the
1:04:31
output.
1:04:32
So we'll just gather them all up
1:04:34
uh here.
1:04:37
Let it run. We'll do it for the training
1:04:38
data and the test data.
1:04:40
And so what we have done is that we have
1:04:42
taken um the transformer related code in
1:04:45
Keras and we have packaged it into a
1:04:47
little hardel library for easy
1:04:49
consumption.
1:04:50
Um and so that thing is here. You can
1:04:53
download it.
1:04:55
Calling it a library is like overstating
1:04:56
it. We literally just collected a bunch
1:04:57
of code and stuck it in a file. Okay?
1:04:59
So
1:05:00
and so what we'll do is from hardel
1:05:02
we'll we'll import the transformer
1:05:03
encoder.
1:05:04
And we'll import this positional
1:05:06
embedding layer.
1:05:08
Because what we're going to do is we are
1:05:09
going to take the input do the
1:05:11
positional encoding business and then
1:05:12
send it into the transformer.
1:05:14
Okay?
1:05:15
Um so but first let's vectorize the
1:05:18
input uh queries that are coming in.
1:05:21
So we'll define a thing here.
1:05:24
The use this uh
1:05:26
max query length is not defined. That's
1:05:28
what happens when you
1:05:30
don't run everything.
1:05:32
All right.
1:05:38
Okay. So now we have this thing here. So
1:05:41
turns out that there are 8,888 tokens,
1:05:44
right? 8,888 words in the input queries
1:05:47
that are we have in the data. Uh so I
1:05:49
take a look at the first few.
1:05:52
And you can see here, you know, there is
1:05:54
unk. Uh and because the output mode here
1:05:56
is you just want integers to come out
1:05:58
not multi-hot encoding or anything
1:06:00
because we're going to take these
1:06:01
integers and then do embeddings from
1:06:02
them. So it'll it'll create it'll
1:06:04
reserve this empty string as the pad
1:06:07
token. This should be familiar from last
1:06:10
week.
1:06:11
And then the unk for unknown tokens and
1:06:13
then two from flights these are all some
1:06:14
of the most frequent. Um turns out
1:06:17
Boston is actually the most frequent. I
1:06:18
don't know what's up with that.
1:06:20
It is what it is. Then we'll do the same
1:06:22
vectorization to the train and test data
1:06:24
sets.
1:06:25
Now uh we need to do STIE for the output
1:06:28
side of the problem because the slots
1:06:30
the the dependent variable here,
1:06:31
remember, are all sentences as well with
1:06:33
the B, O, things like that, right? So we
1:06:36
need to vectorize those.
1:06:38
So we do or we need to do STIE on them.
1:06:40
So let's take a look at some of these
1:06:42
slots.
1:06:43
And you can see here all this stuff is
1:06:44
going on.
1:06:45
Note So now here is an example where you
1:06:48
have to be very careful when you do the
1:06:49
standardization.
1:06:51
Typically standardization you will
1:06:52
remove punctuation and you know, do
1:06:54
things like that and lowercase, right?
1:06:56
But here
1:06:57
these things have a specific meaning.
1:07:00
We can't just go in there and remove the
1:07:01
period and the underscore and then take
1:07:03
make the B into lowercase B and stuff
1:07:04
like that. That'll just harm it.
1:07:06
Right? We need to be able to preserve
1:07:07
the nomenclature of the output in terms
1:07:10
of all those tags. So
1:07:12
um so we don't want the standardization
1:07:13
to do all those out. So what we do is we
1:07:15
say standardization none.
1:07:17
Look at that.
1:07:18
We tell Keras do not standardize this.
1:07:20
Do not do your usual thing.
1:07:22
Okay?
1:07:23
Um so
1:07:25
we do that
1:07:26
for the output side. And then let's look
1:07:29
at the vocabulary.
1:07:30
Yeah, so this sounds pretty good.
1:07:33
These are all the things that we would
1:07:34
expect to see.
1:07:35
These are the distinct tokens in the
1:07:37
output strings.
1:07:39
Um all right.
1:07:43
Okay, we get it.
1:07:45
So we have 125 of them. In the in the
1:07:48
lecture I said there are 123 slots,
1:07:50
possible slots. Why is it 125 here?
1:07:54
Yes, unk and pad. Correct.
1:07:57
Um okay. Now we'll set up a transformer
1:07:59
encoder, right? Uh this Oh, wait, wait,
1:08:02
wait. I forgot about um doing this. My
1:08:05
bad. Um
1:08:07
All right.
1:08:11
I just thought when I saw the slide that
1:08:12
we should go to the collab
1:08:15
without giving you a bit more
1:08:16
background. No problem. So
1:08:18
So
1:08:20
the way we're going to model this
1:08:21
problem is that we're going to have
1:08:22
something like this, right? Fly from
1:08:23
Boston to Denver.
1:08:24
That's the input that's coming in and
1:08:26
that is the correct answer.
1:08:28
0 0 some B something or others I mean O
1:08:31
and then something else, right? That's
1:08:32
the correct answer. That's the that's
1:08:34
the input and that is the right answer.
1:08:36
So what we'll do is we will
1:08:38
create these positional input embeddings
1:08:40
like we have discussed before.
1:08:42
We will run it through a transformer.
1:08:45
It gives us contextual embeddings.
1:08:47
So if we send five in, it's going to
1:08:49
send us five out except the color is now
1:08:50
blue.
1:08:51
Right? And then what we do is
1:08:54
we will run it through a relu.
1:08:57
Okay, we'll run it through a relu.
1:08:59
We will still have
1:09:01
you know, five vectors here, five
1:09:02
vectors will come in.
1:09:04
And then for each of the things that
1:09:05
comes in, we will stick a 123-way
1:09:07
softmax.
1:09:11
Okay, for each thing that comes out
1:09:13
we'll have a 123-way softmax and that's
1:09:15
the classification problem we're going
1:09:16
to solve.
1:09:20
Okay?
1:09:21
So
1:09:23
the weights in all these layers will get
1:09:25
optimized by backprop.
1:09:28
All these weights are going to get
1:09:29
optimized.
1:09:30
Uh yeah.
1:09:34
Sorry?
1:09:40
Oh no, the that's a layer. The weights
1:09:43
in the layer will still need to be
1:09:44
learned.
1:09:46
It's sort of like the text vectorization
1:09:48
layer is a bunch of code and then you
1:09:50
actually run it on a particular corpus
1:09:51
to adapt it and fill our vocabulary out
1:09:53
of it.
1:09:54
So, it's like an empty shell that needs
1:09:55
to get populated.
1:09:57
Okay, so with the weights and all these
1:09:59
things are going to get updated when we
1:10:00
when we train the model
1:10:02
by backprop.
1:10:03
Uh and that's it. That's the setup.
1:10:06
Does this make sense before I switch
1:10:07
back to the collab?
1:10:09
In particular, does this make sense?
1:10:11
This part of it.
1:10:15
Bunch of things come out and then for
1:10:17
each one of those things we need to
1:10:18
figure out a classification of a 123-way
1:10:20
classification. And that's where we
1:10:22
stick a softmax on every one of those
1:10:23
output nodes.
1:10:25
Yeah.
1:10:32
Oh oh, I see.
1:10:36
Yeah, so
1:10:40
It could be whatever or to put it
1:10:41
another way, it is your choice as the
1:10:43
user as the modeler. Correct? The thing
1:10:45
is at this point with the blue stuff the
1:10:47
transformer is basically saying, my job
1:10:49
is done.
1:10:51
It has given you these valuable
1:10:52
contextual embeddings at some high-level
1:10:54
abstraction. What you do with it depends
1:10:56
on your particular problem. And so that
1:10:58
the best practice would be to take it
1:11:00
and then maybe, you know, if these
1:11:01
embeddings are embeddings are really
1:11:03
long, maybe you make them a little
1:11:04
smaller, right? Using a ReLU. And using
1:11:07
a ReLU is always a good idea because
1:11:09
when in doubt, throw in a bit of
1:11:10
non-linearity.
1:11:11
Right? Uh and then once you're done with
1:11:13
that, well, at this point you need to
1:11:15
actually classify it. So, you stick an
1:11:17
output softmax on it.
1:11:20
Okay. So, that's what we have.
1:11:24
Um
1:11:27
All right, back to this picture.
1:11:29
So, what we're going to do is we
1:11:32
we also get to decide how long are these
1:11:34
embedding vectors. How long because here
1:11:36
we're not going to use Glove embeddings.
1:11:37
We're just going to learn everything
1:11:37
from scratch.
1:11:39
Right? We're going to learn everything
1:11:40
from scratch. So, and we can decide how
1:11:42
long these embedding vectors are. So, um
1:11:45
these embedding vectors I'm going to
1:11:46
decide
1:11:47
uh I have decided that I want them to be
1:11:49
512 long, right? I want these actually
1:11:52
to be 512 long. So, that's what I have
1:11:54
here, 512.
1:11:57
And then inside the transformer,
1:11:58
remember
1:12:00
when we
1:12:01
concatenate everything and then we have
1:12:02
something, we run it through a final
1:12:04
ReLU layer, how big should that layer
1:12:07
be?
1:12:08
That's what it here what I mean by dense
1:12:11
dim. I want it to be 64.
1:12:13
And then I, you know, for fun I'm going
1:12:15
to use five attention heads.
1:12:17
Because why not?
1:12:20
Okay. And then in the final thing here
1:12:24
to go to Ali's question here these
1:12:27
things are all 512 long as I mentioned
1:12:29
earlier, right? These are all 512.
1:12:32
But this thing here I'm going to make it
1:12:34
just 128.
1:12:36
Okay, that's what I mean by units here.
1:12:38
And so if you look at the actual model
1:12:41
okay, whatever comes in has a max query
1:12:43
length of I think 30 if I recall.
1:12:45
Um actually let's just make sure of
1:12:47
that. What did I assume?
1:12:51
30, correct? Max query length 30. So,
1:12:53
each sentence is 30. So, if a sentence
1:12:55
has 35 words in it, what's going to
1:12:57
happen?
1:12:59
The last five will get chopped,
1:13:01
truncated. If it comes in at 22, we're
1:13:03
going to pad it with eight more tokens
1:13:05
with a pad token. Okay? That's how we
1:13:06
make sure everything uh gets to 30.
1:13:09
All right. So, we come back here.
1:13:12
So, the input is still sentences which
1:13:14
are 30 long, tokens which are 30 long.
1:13:16
And then we run it through a positional
1:13:18
embedding layer.
1:13:20
Okay? This positional embedding layer
1:13:23
has the the actual embedding for each
1:13:25
word, that table and it has the
1:13:27
positional table, positional embedding
1:13:29
table. So, just to be clear, this
1:13:31
positional embedding layer is basically
1:13:34
it's basically this.
1:13:37
So, this table
1:13:38
and this table together are packaged up
1:13:41
into the positional encoding layer.
1:13:43
But they are two distinct tables. They
1:13:45
just happen to be packaged up.
1:13:47
So,
1:13:49
so this is what we have here.
1:13:51
And then we get a nice positional
1:13:52
embedding out and then boom, we run it
1:13:55
through the transformer. And you know,
1:13:57
this transformer encoder object we have
1:13:59
to tell it obviously, hey, this is the
1:14:01
embedding dimension that's going to come
1:14:02
out. This is the dense dimension you're
1:14:04
going to use in that final feedforward
1:14:06
layer inside each attention block and
1:14:09
this is the number of attention heads I
1:14:10
want you to use. That's it.
1:14:11
Very, right? Only three things have to
1:14:13
be specified.
1:14:14
And then whatever comes out of the
1:14:16
transformer encoder are these blue
1:14:18
vectors.
1:14:19
And then we are back into good old sort
1:14:20
of, you know, traditional DNN stuff
1:14:22
where we take this thing, run it through
1:14:24
a ReLU with 128 units, we add a little
1:14:27
dropout uh and then we run it through a
1:14:30
dense layer which the the vocab size
1:14:33
here is 125, which is the 125-way
1:14:35
softmax.
1:14:37
Okay? Activation softmax.
1:14:39
Connect up everything into model input
1:14:41
and output and boom, that's the whole
1:14:42
model.
1:14:44
So, that's what we have here.
1:14:47
Okay?
1:14:48
Now,
1:14:51
this for the you know, after Wednesday's
1:14:53
class
1:14:54
for extra credit and for your personal
1:14:56
edification
1:14:59
try to work through this thing to come
1:15:00
up with this number.
1:15:03
53 million
1:15:04
um sorry, 5.3 million.
1:15:06
Right? Uh and see if it matches this
1:15:10
number here.
1:15:12
It should match.
1:15:13
Hand calculate the number of parameters
1:15:15
inside the transformer. Okay? For fame
1:15:17
and fortune. That's an optional thing.
1:15:19
So,
1:15:20
uh do it after Wednesday's class, not
1:15:22
right now.
1:15:23
And I have actually listed the exact
1:15:24
math that goes into it here. Okay? All
1:15:26
right. So, by the way, you can peek into
1:15:28
any layers' weights using its weight
1:15:30
attribute. This is the embedding
1:15:31
uh the positional embedding thing we
1:15:33
had. So,
1:15:34
we can click it and you can see here it
1:15:36
has two tables. There's the first table
1:15:39
which is just the embedding table which
1:15:40
says
1:15:41
there are eight eight eight tokens in my
1:15:43
vocabulary and each of those tokens is a
1:15:45
an embedding vector which is 512 long.
1:15:47
That is the first table here. And then
1:15:49
it has the second object which is the
1:15:51
positional embedding and it says here,
1:15:53
well, my sentences can be 30 long and
1:15:56
for each position of the 30 long
1:15:58
sentence, I will have a 512 embedding.
1:16:02
Both these tables as I mentioned earlier
1:16:04
are packaged up inside and you can
1:16:05
actually see what the weights are before
1:16:06
you do any training.
1:16:08
Okay?
1:16:09
So, all right. So, I'm going to stop
1:16:11
here uh because the model it's going to
1:16:13
take a few minutes to run and we're
1:16:14
already at 5 9:45.
1:16:16
Um so, we will continue the journey on
1:16:17
Wednesday. If some of it is not super
1:16:19
clear, don't worry about it. It will
1:16:20
become much clearer on Wednesday. All
1:16:21
right? All right, folks, have a good
1:16:22
couple of days. I'll see you on
1:16:23
Wednesday.
— end of transcript —
Advertisement