WEBVTT

00:00:16.519 --> 00:00:20.600
So, all right. So, transformers, even

00:00:18.839 --> 00:00:22.839
though they were originally invented for

00:00:20.600 --> 00:00:24.240
machine translation, right, going from

00:00:22.839 --> 00:00:25.879
English to German and German to French

00:00:24.239 --> 00:00:27.799
and so on and so forth,

00:00:25.879 --> 00:00:29.559
they have turned out to be an incredibly

00:00:27.800 --> 00:00:32.439
effective deep neural network

00:00:29.559 --> 00:00:34.600
architecture for just really a vast

00:00:32.439 --> 00:00:36.000
array of domains. It has reached a point

00:00:34.600 --> 00:00:37.640
where if you're actually working with on

00:00:36.000 --> 00:00:39.439
a particular problem, you almost

00:00:37.640 --> 00:00:40.960
reflexively to will try a transformer

00:00:39.439 --> 00:00:42.919
first because it's probably going to be

00:00:40.960 --> 00:00:45.120
pretty darn good.

00:00:42.920 --> 00:00:46.480
Okay? So, they have just taken over

00:00:45.119 --> 00:00:48.199
everything.

00:00:46.479 --> 00:00:50.000
Um and obviously they have they've

00:00:48.200 --> 00:00:52.400
transformed translation, which is the

00:00:50.000 --> 00:00:54.079
original sort of target, uh Google

00:00:52.399 --> 00:00:55.600
search, really information retrieval,

00:00:54.079 --> 00:00:57.479
completely transformed speech

00:00:55.600 --> 00:00:59.520
recognition, text-to-speech, even

00:00:57.479 --> 00:01:00.599
computer vision. Even the stuff that we

00:00:59.520 --> 00:01:03.000
learned with convolutional neural

00:01:00.600 --> 00:01:04.760
networks, now there are transformers for

00:01:03.000 --> 00:01:06.519
computer vision problems that are

00:01:04.760 --> 00:01:07.560
actually quite good.

00:01:06.519 --> 00:01:08.839
Right?

00:01:07.560 --> 00:01:10.960
Um which is kind of shocking because

00:01:08.840 --> 00:01:12.719
they were not even designed for that.

00:01:10.959 --> 00:01:14.519
Um and then, you know, reinforcement

00:01:12.719 --> 00:01:15.719
learning. And of course, all the crazy

00:01:14.519 --> 00:01:17.439
stuff that's going on with generative

00:01:15.719 --> 00:01:20.200
AI, large language models, multimodal

00:01:17.439 --> 00:01:21.399
models, everything everything runs on a

00:01:20.200 --> 00:01:23.799
transformer.

00:01:21.400 --> 00:01:25.600
Okay? Uh and then there are numerous

00:01:23.799 --> 00:01:27.079
special purpose systems

00:01:25.599 --> 00:01:28.399
and I find these to be even more

00:01:27.079 --> 00:01:30.000
interesting.

00:01:28.400 --> 00:01:32.440
Um you know, like AlphaFold, the protein

00:01:30.000 --> 00:01:33.640
folding AI, is run runs on a transformer

00:01:32.439 --> 00:01:35.519
stack.

00:01:33.640 --> 00:01:36.640
Okay? And I could just list examples one

00:01:35.519 --> 00:01:38.479
after the other.

00:01:36.640 --> 00:01:40.040
So, it's just amazing. It's incredibly

00:01:38.480 --> 00:01:43.079
uh flexible architecture.

00:01:40.040 --> 00:01:44.280
Um and I think we are lucky to be alive

00:01:43.079 --> 00:01:46.879
during a time when such a thing was

00:01:44.280 --> 00:01:46.879
invented.

00:01:47.200 --> 00:01:50.480
And I'm not getting paid to tell you any

00:01:48.480 --> 00:01:52.120
of this stuff.

00:01:50.480 --> 00:01:55.439
All right, it's just amazing. Okay. So,

00:01:52.120 --> 00:01:57.280
let's get going. We will use search um

00:01:55.439 --> 00:01:59.359
or more broadly information retrieval as

00:01:57.280 --> 00:02:00.640
a motivating use case. So, these are all

00:01:59.359 --> 00:02:02.120
examples where people are typing in

00:02:00.640 --> 00:02:03.959
natural language queries or uttering

00:02:02.120 --> 00:02:05.400
natural language queries into a phone

00:02:03.959 --> 00:02:07.319
and we need to sort of make sense of

00:02:05.400 --> 00:02:08.879
what they want. And it's not like, you

00:02:07.319 --> 00:02:10.599
know, write me a limerick about deep

00:02:08.879 --> 00:02:12.639
learning where there could be many

00:02:10.599 --> 00:02:14.000
possible right answers. It's more like,

00:02:12.639 --> 00:02:15.279
okay, tell me all the flights that are

00:02:14.000 --> 00:02:16.680
leaving from Boston to going to

00:02:15.280 --> 00:02:19.080
LaGuardia tomorrow morning between 8:00

00:02:16.680 --> 00:02:21.120
and 9:00. Well, you better get it right.

00:02:19.080 --> 00:02:22.200
Okay? Accuracy is a high bar.

00:02:21.120 --> 00:02:23.319
So,

00:02:22.199 --> 00:02:24.679
um or, you know, how many customers

00:02:23.319 --> 00:02:26.000
abandoned their shopping cart? Find all

00:02:24.680 --> 00:02:28.960
contracts that are up for renewal next

00:02:26.000 --> 00:02:30.840
month. Uh you know, tell me the all the

00:02:28.960 --> 00:02:32.800
customers who ended the phone call to

00:02:30.840 --> 00:02:34.879
the call center yesterday not entirely

00:02:32.800 --> 00:02:37.040
pleased with the transaction. Right? The

00:02:34.879 --> 00:02:38.479
list goes on and on. And so, in

00:02:37.039 --> 00:02:40.959
particular, we'll focus on this

00:02:38.479 --> 00:02:42.879
travel-related example today. Okay? Uh

00:02:40.960 --> 00:02:44.159
find me all flights from Boston to

00:02:42.879 --> 00:02:45.639
LaGuardia tomorrow morning, right? That

00:02:44.159 --> 00:02:48.560
kind of query.

00:02:45.639 --> 00:02:50.919
Um and so, in these sorts of use cases,

00:02:48.560 --> 00:02:53.599
a very common approach historically has

00:02:50.919 --> 00:02:55.599
been, well, we will take this, you know,

00:02:53.599 --> 00:02:57.919
natural language query

00:02:55.599 --> 00:03:01.039
and then we will convert it into a

00:02:57.919 --> 00:03:03.559
structured query. By that I mean we will

00:03:01.039 --> 00:03:05.799
parse the query and we'll extract out

00:03:03.560 --> 00:03:07.640
key things in that query. Once we

00:03:05.800 --> 00:03:09.800
extract out those key things, we will

00:03:07.639 --> 00:03:12.919
reassemble it into a structured query,

00:03:09.800 --> 00:03:14.760
like a SQL query, right? Uh SQL is just

00:03:12.919 --> 00:03:15.919
one example of a possible structured

00:03:14.759 --> 00:03:17.239
query. There are many many ways to

00:03:15.919 --> 00:03:18.759
structure queries.

00:03:17.240 --> 00:03:20.840
But SQL is sort of familiar to lots of

00:03:18.759 --> 00:03:23.120
people, so I'm using that. So, you take

00:03:20.840 --> 00:03:25.200
the SQL. Once you have the SQL query,

00:03:23.120 --> 00:03:27.319
you're in a very comfortable structured

00:03:25.199 --> 00:03:28.839
land, in which case you just run the

00:03:27.319 --> 00:03:30.959
query through a some database that you

00:03:28.840 --> 00:03:32.960
have, get the results back, format it

00:03:30.960 --> 00:03:34.719
nicely, and show show it to the user.

00:03:32.960 --> 00:03:36.480
Right? That's the flow.

00:03:34.719 --> 00:03:37.599
So, the question becomes

00:03:36.479 --> 00:03:40.560
um

00:03:37.599 --> 00:03:43.960
how do we automatically extract all the

00:03:40.560 --> 00:03:45.280
travel-related entities from this query?

00:03:43.960 --> 00:03:49.800
Right? We want to be able to extract

00:03:45.280 --> 00:03:50.640
BOS, LGA, tomorrow, morning, flights, so

00:03:49.800 --> 00:03:51.800
on and so forth. These are all the

00:03:50.639 --> 00:03:54.799
travel-related entities we want to

00:03:51.800 --> 00:03:56.520
extract out, right? That's the problem.

00:03:54.800 --> 00:03:58.200
And so,

00:03:56.520 --> 00:03:59.760
we will use a really cool data set

00:03:58.199 --> 00:04:01.159
called the airline travel information

00:03:59.759 --> 00:04:02.759
system data set and I'll explain the

00:04:01.159 --> 00:04:05.159
data set in just in just a bit. We'll

00:04:02.759 --> 00:04:07.120
use this as the basis for this example.

00:04:05.159 --> 00:04:08.599
And so, the way we think about it is

00:04:07.120 --> 00:04:10.400
that

00:04:08.599 --> 00:04:12.079
we we have a whole bunch of queries in

00:04:10.400 --> 00:04:14.400
this data set.

00:04:12.080 --> 00:04:16.359
And fortunately for us, the researchers

00:04:14.400 --> 00:04:18.358
who compiled this data set,

00:04:16.358 --> 00:04:20.399
they went through every one of these

00:04:18.358 --> 00:04:22.239
queries, right? And we have, you know,

00:04:20.399 --> 00:04:24.239
several thousands of them. They went

00:04:22.240 --> 00:04:26.800
through every one of those queries and

00:04:24.240 --> 00:04:28.400
they manually tagged each word in the

00:04:26.800 --> 00:04:31.520
query

00:04:28.399 --> 00:04:33.319
with what kind of travel entity it is

00:04:31.519 --> 00:04:35.399
or none of them, right? So, for

00:04:33.319 --> 00:04:37.360
instance, so they class they call them

00:04:35.399 --> 00:04:39.799
slots. So, they will take each word in

00:04:37.360 --> 00:04:41.439
the query and assign it to a slot, a

00:04:39.800 --> 00:04:42.800
particular kind of slot, and I'll

00:04:41.439 --> 00:04:45.480
explain what slot means in just a

00:04:42.800 --> 00:04:47.439
second. Okay? That's the basic idea. So,

00:04:45.480 --> 00:04:49.759
so, for example, if you have something

00:04:47.439 --> 00:04:52.639
like I want to fly from

00:04:49.759 --> 00:04:53.759
Okay? And this is a flight database, so

00:04:52.639 --> 00:04:56.039
you can assume that everything is

00:04:53.759 --> 00:04:57.519
related to a flight flying. So, if you

00:04:56.040 --> 00:04:58.560
have all these words, I want to fly

00:04:57.519 --> 00:05:00.759
from,

00:04:58.560 --> 00:05:02.560
each of these words these five words

00:05:00.759 --> 00:05:04.599
gets mapped to something called the O,

00:05:02.560 --> 00:05:06.280
which means other.

00:05:04.600 --> 00:05:07.840
It's the other slot, right? We don't

00:05:06.279 --> 00:05:09.279
really care about it. It's the other

00:05:07.839 --> 00:05:11.599
slot.

00:05:09.279 --> 00:05:13.159
And then we come to Boston.

00:05:11.600 --> 00:05:15.560
Oh, Boston is very special, right?

00:05:13.160 --> 00:05:18.160
Because, you know, it's clearly a

00:05:15.560 --> 00:05:20.280
departure city. So, we actually tag it,

00:05:18.160 --> 00:05:21.640
we assign it this label. Think of it as

00:05:20.279 --> 00:05:23.000
just like a classification problem,

00:05:21.639 --> 00:05:26.479
right? A multi-class classification

00:05:23.000 --> 00:05:29.240
problem. So, we assign it to B from

00:05:26.480 --> 00:05:31.160
loc.city_name.

00:05:29.240 --> 00:05:32.560
Okay? That is the label you assign it.

00:05:31.160 --> 00:05:34.720
Okay?

00:05:32.560 --> 00:05:37.199
And then you go to at. You don't care

00:05:34.720 --> 00:05:38.680
about at. It's O, other. You come to

00:05:37.199 --> 00:05:41.159
7:00 a.m.

00:05:38.680 --> 00:05:43.280
And then, okay, that is depart time. So,

00:05:41.160 --> 00:05:45.680
depart time and then another depart

00:05:43.279 --> 00:05:47.319
time. And here you see there is a B and

00:05:45.680 --> 00:05:49.360
then there is an I.

00:05:47.319 --> 00:05:51.800
Right? So, what's what we are saying

00:05:49.360 --> 00:05:54.160
here is that there could be entities who

00:05:51.800 --> 00:05:57.439
are described using more than one word.

00:05:54.160 --> 00:05:58.600
Like 7:00 a.m., right? Two tokens.

00:05:57.439 --> 00:06:00.600
And for that, we need to be able to

00:05:58.600 --> 00:06:01.760
figure out, okay, the second token is

00:06:00.600 --> 00:06:03.920
really

00:06:01.759 --> 00:06:05.360
is part of the first token. Together,

00:06:03.920 --> 00:06:08.920
they define the notion of a departure

00:06:05.360 --> 00:06:10.520
time. So, what the B means that is that

00:06:08.920 --> 00:06:12.480
this is the word this is the token in

00:06:10.519 --> 00:06:15.240
which we are beginning the idea of a

00:06:12.480 --> 00:06:17.840
departure time. And then I means we are

00:06:15.240 --> 00:06:19.920
in the middle of this description.

00:06:17.839 --> 00:06:21.079
B is for beginning.

00:06:19.920 --> 00:06:23.240
So,

00:06:21.079 --> 00:06:25.079
you can see here. So, there is a B here

00:06:23.240 --> 00:06:27.519
and there is an I. B for beginning, I

00:06:25.079 --> 00:06:31.680
for intermediate or in the middle.

00:06:27.519 --> 00:06:33.120
Um and then at, we don't care. 11:00 B

00:06:31.680 --> 00:06:35.400
arrive time.

00:06:33.120 --> 00:06:37.920
Boop boop boop. Morning arrive time

00:06:35.399 --> 00:06:37.919
period.

00:06:38.199 --> 00:06:43.560
So, this is an example of how you can

00:06:40.800 --> 00:06:45.040
take a sentence and then manually label

00:06:43.560 --> 00:06:46.120
every word in the sentence with

00:06:45.040 --> 00:06:48.920
something that's relevant to your

00:06:46.120 --> 00:06:48.920
particular problem.

00:06:50.360 --> 00:06:54.439
And

00:06:51.959 --> 00:06:56.959
turns out these people

00:06:54.439 --> 00:06:59.439
every word is classified into one of 123

00:06:56.959 --> 00:07:02.919
possibilities.

00:06:59.439 --> 00:07:04.920
Okay? Um so, aircraft code, airline

00:07:02.920 --> 00:07:07.080
code, airline name, airport code,

00:07:04.920 --> 00:07:08.960
airport name, arrival date, relative

00:07:07.079 --> 00:07:11.560
name. Now, you get the idea.

00:07:08.959 --> 00:07:13.359
They want a round trip versus a one-way.

00:07:11.560 --> 00:07:14.480
The relative to today because if

00:07:13.360 --> 00:07:16.040
somebody say tomorrow morning, it's

00:07:14.480 --> 00:07:17.520
relative to today, so you need to notion

00:07:16.040 --> 00:07:19.560
you need absolute time and you need

00:07:17.519 --> 00:07:20.919
notion of relative time.

00:07:19.560 --> 00:07:23.480
So, they basically thought of every

00:07:20.920 --> 00:07:25.560
possibility with these researchers. And

00:07:23.480 --> 00:07:27.840
so, the every word in every one of these

00:07:25.560 --> 00:07:30.319
queries is assigned one of these 123

00:07:27.839 --> 00:07:30.319
labels.

00:07:32.240 --> 00:07:35.480
Any questions on the setup?

00:07:36.920 --> 00:07:39.480
Um

00:07:39.920 --> 00:07:44.480
Did they have to contextualize what

00:07:42.199 --> 00:07:46.360
comes before than let's say Boston? So,

00:07:44.480 --> 00:07:47.720
if someone says from

00:07:46.360 --> 00:07:49.360
Boston, so that there should be

00:07:47.720 --> 00:07:50.880
contextualization with the from to

00:07:49.360 --> 00:07:52.480
Boston. So, because they did it

00:07:50.879 --> 00:07:54.079
manually, they could just read it and

00:07:52.480 --> 00:07:55.680
figure it out, that's what they mean,

00:07:54.079 --> 00:07:57.319
right? You Boston is the the departure

00:07:55.680 --> 00:07:59.480
city and not the arrival city. So, do

00:07:57.319 --> 00:08:01.480
they have two tags to Boston, which is

00:07:59.480 --> 00:08:03.520
some like, you know, departure city as

00:08:01.480 --> 00:08:05.759
well as arrival city

00:08:03.519 --> 00:08:07.399
word Boston? In that particular phrase,

00:08:05.759 --> 00:08:08.959
it's it's clear from that particular

00:08:07.399 --> 00:08:10.639
case in the context of it as a human

00:08:08.959 --> 00:08:13.279
reading it that Boston is a departure

00:08:10.639 --> 00:08:15.360
city. So, it just only gets that tag. In

00:08:13.279 --> 00:08:16.759
that sentence. In some other sentence

00:08:15.360 --> 00:08:19.720
where people are coming into Boston,

00:08:16.759 --> 00:08:19.719
it'll have a different tag.

00:08:21.040 --> 00:08:25.120
I was wondering if my query like the

00:08:23.000 --> 00:08:27.279
others, basically there is like, for

00:08:25.120 --> 00:08:29.079
example, if my query was

00:08:27.279 --> 00:08:29.759
giving flights from Boston at 7:00 a.m.

00:08:29.079 --> 00:08:31.079
and

00:08:29.759 --> 00:08:33.559
uh the

00:08:31.079 --> 00:08:35.478
flights from Denver at 11:00 a.m.

00:08:33.559 --> 00:08:37.079
You mean like a compound query? Yeah.

00:08:35.479 --> 00:08:39.080
So, this one only takes single queries

00:08:37.080 --> 00:08:40.158
into account.

00:08:39.080 --> 00:08:42.038
Because most people are like, you know,

00:08:40.158 --> 00:08:43.120
give me a flight from here to there. Or

00:08:42.038 --> 00:08:45.279
what is the cheapest thing from here to

00:08:43.120 --> 00:08:47.720
there? And we'll see examples of queries

00:08:45.279 --> 00:08:47.720
later on.

00:08:50.000 --> 00:08:52.679
Okay.

00:08:51.120 --> 00:08:53.600
Uh all right. So, that's that's the

00:08:52.679 --> 00:08:56.959
deal.

00:08:53.600 --> 00:08:58.000
So, basically what we have this you

00:08:56.960 --> 00:08:59.480
know,

00:08:58.000 --> 00:09:02.120
uh this problem that we have here is

00:08:59.480 --> 00:09:04.879
really a word-to-slot,

00:09:02.120 --> 00:09:06.399
word-to-slot multi-class classification

00:09:04.879 --> 00:09:07.480
problem.

00:09:06.399 --> 00:09:09.240
Okay?

00:09:07.480 --> 00:09:10.560
Um because if you look at that that

00:09:09.240 --> 00:09:12.840
input, we want to be able to take that

00:09:10.559 --> 00:09:16.159
input and a really good model will then

00:09:12.840 --> 00:09:16.160
give you this as the output.

00:09:17.000 --> 00:09:20.240
Right? Because this is what a human

00:09:18.159 --> 00:09:23.480
would have done.

00:09:20.240 --> 00:09:25.720
So, that is our problem. Okay?

00:09:23.480 --> 00:09:27.840
So, the question is

00:09:25.720 --> 00:09:29.320
um the the key thing here is that each

00:09:27.840 --> 00:09:32.040
of the 18 words in this particular

00:09:29.320 --> 00:09:34.440
example must be assigned to one of 123

00:09:32.039 --> 00:09:36.399
slot types, right? Each word. It's not

00:09:34.440 --> 00:09:38.080
like we take the entire query and

00:09:36.399 --> 00:09:40.399
classify the entire query into one of

00:09:38.080 --> 00:09:42.480
123 possibilities. Every word in the

00:09:40.399 --> 00:09:45.360
query has to be classified.

00:09:42.480 --> 00:09:45.360
That is the wrinkle.

00:09:45.399 --> 00:09:49.240
Okay?

00:09:46.960 --> 00:09:51.080
So, now, if we could run the query

00:09:49.240 --> 00:09:54.120
through a deep neural network and

00:09:51.080 --> 00:09:55.800
generate 18 output nodes,

00:09:54.120 --> 00:09:57.679
it goes through some unspecified deep

00:09:55.799 --> 00:09:59.399
neural network. And when it comes out

00:09:57.679 --> 00:10:00.399
the other end, the output layer has 18

00:09:59.399 --> 00:10:01.439
nodes.

00:10:00.399 --> 00:10:03.120
Okay?

00:10:01.440 --> 00:10:04.480
Because that is that is the that is the

00:10:03.120 --> 00:10:06.919
that is the the the dimension of the

00:10:04.480 --> 00:10:09.000
output that we care about. 18 in, 18

00:10:06.919 --> 00:10:11.599
out. 18 in, 18 out, right?

00:10:09.000 --> 00:10:15.600
And then for each one of those 18 nodes,

00:10:11.600 --> 00:10:19.200
maybe we could attach a 123-way softmax

00:10:15.600 --> 00:10:19.200
to each of those 18 outputs.

00:10:20.200 --> 00:10:23.280
By the way, isn't it cool that we can

00:10:21.480 --> 00:10:25.440
just casually talk about sticking a

00:10:23.279 --> 00:10:27.159
123-way softmax onto each one of the 18

00:10:25.440 --> 00:10:29.920
nodes?

00:10:27.159 --> 00:10:29.919
Folks, wake up.

00:10:31.360 --> 00:10:34.840
You're not easily impressed. I'm

00:10:32.720 --> 00:10:37.560
impressed by that.

00:10:34.840 --> 00:10:37.560
So, okay.

00:10:37.879 --> 00:10:41.840
So, so the So, here's the key thing,

00:10:39.759 --> 00:10:45.639
right? We want to generate an output

00:10:41.840 --> 00:10:47.399
that has the same length as the input.

00:10:45.639 --> 00:10:48.960
But the problem is the inputs could be

00:10:47.399 --> 00:10:50.480
of different lengths as they come in.

00:10:48.960 --> 00:10:52.840
They could be short sentences, long

00:10:50.480 --> 00:10:55.120
sentences, we don't know, right?

00:10:52.840 --> 00:10:56.759
Yet we need to accommodate this range

00:10:55.120 --> 00:10:58.159
this variable size of input that's

00:10:56.759 --> 00:10:59.399
coming in.

00:10:58.159 --> 00:11:00.799
But the key thing is the output has to

00:10:59.399 --> 00:11:02.759
be the same thing as the input, the same

00:11:00.799 --> 00:11:05.559
cardinality as the input.

00:11:02.759 --> 00:11:07.080
Okay, that's a one big requirement.

00:11:05.559 --> 00:11:08.799
In addition, we want to take the

00:11:07.080 --> 00:11:10.440
surrounding context of each word into

00:11:08.799 --> 00:11:12.719
account, right? To go to Ronak's

00:11:10.440 --> 00:11:14.040
question, when you see the word Boston,

00:11:12.720 --> 00:11:15.920
you can't conclude whether it's a

00:11:14.039 --> 00:11:17.079
departure city or arrival city.

00:11:15.919 --> 00:11:19.399
You have to look at what else is going

00:11:17.080 --> 00:11:21.400
on around it. Is there a from? Is there

00:11:19.399 --> 00:11:22.600
a to? Things like that to figure out

00:11:21.399 --> 00:11:24.399
what how to tag it. So, clearly the

00:11:22.600 --> 00:11:25.800
context matters.

00:11:24.399 --> 00:11:28.319
And then we clearly have to take the

00:11:25.799 --> 00:11:29.599
order of the words into account.

00:11:28.320 --> 00:11:30.480
Going from Boston to LaGuardia is very

00:11:29.600 --> 00:11:31.680
different than going from LaGuardia to

00:11:30.480 --> 00:11:33.720
Boston.

00:11:31.679 --> 00:11:35.479
So, clearly the order matters.

00:11:33.720 --> 00:11:37.560
Right? So, the context matters and the

00:11:35.480 --> 00:11:40.279
order matters. And the output has to be

00:11:37.559 --> 00:11:42.119
the same length as the input.

00:11:40.279 --> 00:11:44.240
Okay?

00:11:42.120 --> 00:11:45.720
So, context matters, right? Just a few

00:11:44.240 --> 00:11:47.480
fun examples.

00:11:45.720 --> 00:11:48.680
Remember from the last week that the

00:11:47.480 --> 00:11:50.639
meaning of a word can change

00:11:48.679 --> 00:11:53.359
dramatically depending on the context.

00:11:50.639 --> 00:11:55.080
And we also saw that the standalone or

00:11:53.360 --> 00:11:58.360
uncontextual embeddings that we saw for

00:11:55.080 --> 00:11:59.960
last week, like Glove, um

00:11:58.360 --> 00:12:01.440
you know, they don't take context into

00:11:59.960 --> 00:12:04.040
account because they give a single

00:12:01.440 --> 00:12:05.880
unique embedding vector to every word.

00:12:04.039 --> 00:12:07.959
And if a word ends up having lots of

00:12:05.879 --> 00:12:09.720
different meanings, that vector is kind

00:12:07.960 --> 00:12:11.680
of some mushy average of all those

00:12:09.720 --> 00:12:13.320
meanings.

00:12:11.679 --> 00:12:15.399
Okay. So,

00:12:13.320 --> 00:12:16.960
the word see. I will see you soon. I

00:12:15.399 --> 00:12:18.959
will see this project to its end. I see

00:12:16.960 --> 00:12:20.879
what you mean. Very different meanings

00:12:18.960 --> 00:12:21.920
of the word see. This is my favorite,

00:12:20.879 --> 00:12:23.559
bank.

00:12:21.919 --> 00:12:24.838
Uh I went to the bank to apply for a

00:12:23.559 --> 00:12:27.359
loan. I'm banking on the job. I'm

00:12:24.839 --> 00:12:29.839
standing on the left bank. And so on. Uh

00:12:27.360 --> 00:12:31.680
it. The animal Oh, this is actually very

00:12:29.839 --> 00:12:33.640
It's a good one. The animal didn't cross

00:12:31.679 --> 00:12:34.719
the street because it was too tired. The

00:12:33.639 --> 00:12:37.279
animal didn't cross the street because

00:12:34.720 --> 00:12:39.040
it was too wide.

00:12:37.279 --> 00:12:40.519
Can you imagine

00:12:39.039 --> 00:12:42.279
a deep neural network looking at this

00:12:40.519 --> 00:12:44.360
word it and trying to figure out what

00:12:42.279 --> 00:12:46.120
the heck does it word it mean?

00:12:44.360 --> 00:12:48.480
What is it referring to?

00:12:46.120 --> 00:12:50.440
Tricky, right?

00:12:48.480 --> 00:12:52.000
Um and then, you know, if you take the

00:12:50.440 --> 00:12:53.120
word station, and I have the station

00:12:52.000 --> 00:12:55.200
example here because we're going to use

00:12:53.120 --> 00:12:57.080
it a bit more the rest of the lecture.

00:12:55.200 --> 00:12:59.360
The train You know, the station could be

00:12:57.080 --> 00:13:00.839
a radio station, a train station, being

00:12:59.360 --> 00:13:03.080
stationed somewhere, the International

00:13:00.839 --> 00:13:04.360
Space Station. The list goes on.

00:13:03.080 --> 00:13:05.960
So, clearly order matters. I mean,

00:13:04.360 --> 00:13:08.680
context matters.

00:13:05.960 --> 00:13:08.680
And

00:13:08.879 --> 00:13:12.000
clearly order matters. You can come up

00:13:10.480 --> 00:13:13.279
with your own examples. Let's keep

00:13:12.000 --> 00:13:15.159
moving.

00:13:13.279 --> 00:13:18.240
Okay?

00:13:15.159 --> 00:13:20.600
So, the Transformer architecture

00:13:18.240 --> 00:13:22.080
is a very elegant

00:13:20.600 --> 00:13:23.560
architecture

00:13:22.080 --> 00:13:25.360
which checks these three boxes

00:13:23.559 --> 00:13:26.479
beautifully.

00:13:25.360 --> 00:13:27.960
Okay?

00:13:26.480 --> 00:13:29.680
Um it takes the context into account,

00:13:27.960 --> 00:13:32.000
order into account, and then, you know,

00:13:29.679 --> 00:13:33.599
whatever is produced out there

00:13:32.000 --> 00:13:34.399
is the same length as whatever is coming

00:13:33.600 --> 00:13:35.279
in.

00:13:34.399 --> 00:13:36.879
And the reason it's called the

00:13:35.279 --> 00:13:39.679
Transformer

00:13:36.879 --> 00:13:41.759
is because if 10 things come in,

00:13:39.679 --> 00:13:43.799
10 things go out, but the 10 things that

00:13:41.759 --> 00:13:46.000
go out are a transformed version of the

00:13:43.799 --> 00:13:47.919
10 things that came in.

00:13:46.000 --> 00:13:48.919
That's why it's called the Transformer.

00:13:47.919 --> 00:13:50.679
Okay?

00:13:48.919 --> 00:13:52.559
If 10 things came in and like one thing

00:13:50.679 --> 00:13:54.359
go goes out, well, sure, it's been

00:13:52.559 --> 00:13:56.439
transformed, but what is it? It's some

00:13:54.360 --> 00:13:58.800
weird thing. But when 10 comes in and 10

00:13:56.440 --> 00:13:59.760
goes out, the 10 10 is preserved. Each

00:13:58.799 --> 00:14:01.039
one is getting transformed in

00:13:59.759 --> 00:14:04.279
interesting way.

00:14:01.039 --> 00:14:04.279
That's why it's called the Transformer.

00:14:04.440 --> 00:14:08.400
So, developed 2017, just dramatic

00:14:07.080 --> 00:14:09.360
impact.

00:14:08.399 --> 00:14:11.078
So, by the way, the effect of

00:14:09.360 --> 00:14:13.639
Transformer, um

00:14:11.078 --> 00:14:15.239
Google had spent a lot of research on

00:14:13.639 --> 00:14:17.439
machine translation and obviously

00:14:15.240 --> 00:14:20.079
search. Uh and then when the Transformer

00:14:17.440 --> 00:14:22.600
is invented, uh they took a model called

00:14:20.078 --> 00:14:25.958
BERT, which we will uh see on Wednesday

00:14:22.600 --> 00:14:28.320
in detail, and then they introduced BERT

00:14:25.958 --> 00:14:29.599
into their search, and the results were

00:14:28.320 --> 00:14:32.040
dramatic.

00:14:29.600 --> 00:14:34.279
And from what I've read, apparently the

00:14:32.039 --> 00:14:35.639
impact of doing that was a

00:14:34.279 --> 00:14:37.360
Typically, when you make an improvement

00:14:35.639 --> 00:14:38.799
to search, the improvement is very, very

00:14:37.360 --> 00:14:40.680
marginal because it's already a very

00:14:38.799 --> 00:14:42.240
heavily optimized system.

00:14:40.679 --> 00:14:43.799
And then when the Transformer thing came

00:14:42.240 --> 00:14:46.320
along, there was actually a significant

00:14:43.799 --> 00:14:48.240
jump in search quality. So, for example,

00:14:46.320 --> 00:14:49.800
and you can actually read this blog post

00:14:48.240 --> 00:14:51.839
uh which came out when they introduced

00:14:49.799 --> 00:14:54.359
BERT into search. It gives you a bit

00:14:51.839 --> 00:14:56.360
more detail. But here, so if you had if

00:14:54.360 --> 00:14:57.600
you were querying something like uh you

00:14:56.360 --> 00:15:00.480
know,

00:14:57.600 --> 00:15:02.240
"Brazil traveler to USA needs a visa."

00:15:00.480 --> 00:15:03.279
Right? You would think that it is it

00:15:02.240 --> 00:15:04.600
should give you information about how to

00:15:03.279 --> 00:15:06.480
get a visa if you're a Brazilian want to

00:15:04.600 --> 00:15:09.000
come to the US, right? Uh but it turns

00:15:06.480 --> 00:15:11.159
out the first result was how US citizens

00:15:09.000 --> 00:15:13.240
going to Brazil can get you know,

00:15:11.159 --> 00:15:14.879
get a visa.

00:15:13.240 --> 00:15:16.480
So, clearly it's not taking the order

00:15:14.879 --> 00:15:19.000
into account.

00:15:16.480 --> 00:15:20.440
Uh but once they introduced it, boom,

00:15:19.000 --> 00:15:21.720
the first thing was the US Embassy in

00:15:20.440 --> 00:15:24.200
Brazil.

00:15:21.720 --> 00:15:26.839
And a page on how to get a visa.

00:15:24.200 --> 00:15:30.120
So, the effect was dramatic.

00:15:26.839 --> 00:15:31.600
And so, this is a seminal paper,

00:15:30.120 --> 00:15:34.440
right? And it's actually worth reading

00:15:31.600 --> 00:15:35.639
the paper. And uh and it's worth and you

00:15:34.440 --> 00:15:38.079
know, this is the picture this this is

00:15:35.639 --> 00:15:39.799
like an iconic picture at this point

00:15:38.078 --> 00:15:41.679
in the deep learning community. And we

00:15:39.799 --> 00:15:43.399
will actually understand this picture

00:15:41.679 --> 00:15:45.399
by the end of Wednesday.

00:15:43.399 --> 00:15:46.399
Um and so, but the funny thing is that

00:15:45.399 --> 00:15:48.720
when the researchers came up with it,

00:15:46.399 --> 00:15:50.720
they didn't realize, in some sense, like

00:15:48.720 --> 00:15:51.759
what they had stumbled on uh because

00:15:50.720 --> 00:15:53.000
they were really focused on machine

00:15:51.759 --> 00:15:54.240
translation.

00:15:53.000 --> 00:15:55.519
It's only the rest of the research

00:15:54.240 --> 00:15:56.879
community that took it and started

00:15:55.519 --> 00:15:59.639
applying to everything else and found it

00:15:56.879 --> 00:16:01.240
to be really, really effective.

00:15:59.639 --> 00:16:02.399
Okay. So, we're going to take each one

00:16:01.240 --> 00:16:04.039
of these things and figure out how to

00:16:02.399 --> 00:16:05.480
address them and thereby build up the

00:16:04.039 --> 00:16:07.759
architecture.

00:16:05.480 --> 00:16:10.000
Any questions before I continue?

00:16:07.759 --> 00:16:10.000
Yeah.

00:16:11.000 --> 00:16:16.360
Is there any uh

00:16:13.559 --> 00:16:18.679
benefits to discarding some of those

00:16:16.360 --> 00:16:21.240
unclassified nodes before it goes out

00:16:18.679 --> 00:16:23.078
rather than going like you have 18 words

00:16:21.240 --> 00:16:24.279
input, discarding all the ones that

00:16:23.078 --> 00:16:26.239
don't actually matter and just doing

00:16:24.279 --> 00:16:28.480
like eight for your output?

00:16:26.240 --> 00:16:29.959
Yeah, yeah. I think that's a totally

00:16:28.480 --> 00:16:31.120
fine way to think about it. Basically,

00:16:29.958 --> 00:16:33.119
what you're saying is that can we have a

00:16:31.120 --> 00:16:35.839
two-stage model? The first-stage model

00:16:33.120 --> 00:16:37.200
is like a O non-O classifier. And the

00:16:35.839 --> 00:16:38.520
second-stage model only goes after the

00:16:37.200 --> 00:16:39.280
non-Os. That's a totally fine way to do

00:16:38.519 --> 00:16:40.319
it.

00:16:39.279 --> 00:16:41.958
Yeah.

00:16:40.320 --> 00:16:43.320
But as you can see, if you even if you

00:16:41.958 --> 00:16:44.958
go with the just a simple one-stage

00:16:43.320 --> 00:16:47.120
model, if you use a Transformer, you get

00:16:44.958 --> 00:16:50.359
fantastic accuracy.

00:16:47.120 --> 00:16:52.360
And we'll do the collab in a bit.

00:16:50.360 --> 00:16:53.600
Uh all right. So, let's take the first

00:16:52.360 --> 00:16:55.240
thing. How do you how do you take the

00:16:53.600 --> 00:16:56.959
context of everything around the word

00:16:55.240 --> 00:16:59.279
into account?

00:16:56.958 --> 00:17:01.000
So,

00:16:59.279 --> 00:17:03.039
so let's say that this is this is the

00:17:01.000 --> 00:17:04.199
sentence we have. The train slowly left

00:17:03.039 --> 00:17:06.759
the station.

00:17:04.199 --> 00:17:09.839
Okay? For each of these words,

00:17:06.759 --> 00:17:11.279
we can calculate a standalone embedding,

00:17:09.838 --> 00:17:13.720
say something like Glove.

00:17:11.279 --> 00:17:15.959
Okay? So, I'm just rep- depicting these

00:17:13.720 --> 00:17:18.400
standalone embeddings using these uh

00:17:15.959 --> 00:17:19.600
you know, thingies here.

00:17:18.400 --> 00:17:20.560
Please appreciate them because it took

00:17:19.599 --> 00:17:22.119
me a while to get them to do in

00:17:20.559 --> 00:17:24.678
PowerPoint.

00:17:22.119 --> 00:17:27.000
Okay? So, these are W1 through W6. These

00:17:24.679 --> 00:17:29.120
are the vectors standing up. Okay?

00:17:27.000 --> 00:17:30.359
Um now, let's say that So, we can easily

00:17:29.119 --> 00:17:32.079
do that.

00:17:30.359 --> 00:17:34.559
Now, what we want to figure out is we

00:17:32.079 --> 00:17:36.119
want to focus on the word station.

00:17:34.559 --> 00:17:37.519
And since station could mean very

00:17:36.119 --> 00:17:39.559
different things in different contexts,

00:17:37.519 --> 00:17:40.599
we want to figure out how do we actually

00:17:39.559 --> 00:17:43.359
take

00:17:40.599 --> 00:17:45.439
station's embedding and contextualize it

00:17:43.359 --> 00:17:46.799
using all the other words that are going

00:17:45.440 --> 00:17:49.799
on in that sentence.

00:17:46.799 --> 00:17:50.879
Okay? Clearly, it's a train station.

00:17:49.799 --> 00:17:53.720
So, we need to take the fact that there

00:17:50.880 --> 00:17:55.120
is a train involved to to alter the

00:17:53.720 --> 00:17:56.880
embedding of the word station. Right?

00:17:55.119 --> 00:17:58.719
That's what taking context into account

00:17:56.880 --> 00:17:59.960
actually means.

00:17:58.720 --> 00:18:03.039
So,

00:17:59.960 --> 00:18:04.799
how can we modify station's embedding so

00:18:03.039 --> 00:18:07.519
that it incorporates all the other

00:18:04.799 --> 00:18:08.399
words? That's the question.

00:18:07.519 --> 00:18:11.879
Okay?

00:18:08.400 --> 00:18:14.040
So, when you look at it this way,

00:18:11.880 --> 00:18:15.640
imagine just for a moment,

00:18:14.039 --> 00:18:16.559
just for a moment,

00:18:15.640 --> 00:18:17.640
that

00:18:16.559 --> 00:18:18.960
we

00:18:17.640 --> 00:18:20.440
Now, some of the other words in the

00:18:18.960 --> 00:18:22.279
sentence don't matter. The word the

00:18:20.440 --> 00:18:24.120
probably doesn't matter.

00:18:22.279 --> 00:18:26.678
But some of the other words like train,

00:18:24.119 --> 00:18:29.119
slowly, left probably does matter.

00:18:26.679 --> 00:18:30.480
And suppose, just magically, we have

00:18:29.119 --> 00:18:32.439
been told

00:18:30.480 --> 00:18:34.480
all the other words in the sentence,

00:18:32.440 --> 00:18:36.640
this is how much weight you have to give

00:18:34.480 --> 00:18:38.159
to them. These don't give it any weight.

00:18:36.640 --> 00:18:39.800
Those give it a lot of weight. Okay?

00:18:38.159 --> 00:18:41.360
Suppose we are told that.

00:18:39.799 --> 00:18:42.639
Or to put it another way, and this this

00:18:41.359 --> 00:18:44.199
is the word that's heavily used in the

00:18:42.640 --> 00:18:46.200
literature,

00:18:44.200 --> 00:18:47.720
someone tells you how much attention to

00:18:46.200 --> 00:18:48.720
pay to the other words.

00:18:47.720 --> 00:18:50.440
Whether you got to pay it a lot of

00:18:48.720 --> 00:18:51.360
attention or very little attention.

00:18:50.440 --> 00:18:52.600
Okay?

00:18:51.359 --> 00:18:54.439
And this

00:18:52.599 --> 00:18:55.879
how much attention to pay is given in

00:18:54.440 --> 00:18:57.440
the form of a weight that you can use.

00:18:55.880 --> 00:18:58.880
Okay? So,

00:18:57.440 --> 00:19:00.080
um

00:18:58.880 --> 00:19:01.840
if you look at it that way, from this

00:19:00.079 --> 00:19:04.039
notion of which word should I give a lot

00:19:01.839 --> 00:19:05.599
of weight to and very little weight to,

00:19:04.039 --> 00:19:06.799
in this example, intuitively, which

00:19:05.599 --> 00:19:07.759
words do you think should get the most

00:19:06.799 --> 00:19:09.759
weight and which words do you think

00:19:07.759 --> 00:19:11.319
should get the least weight?

00:19:09.759 --> 00:19:12.679
Yeah. Train.

00:19:11.319 --> 00:19:13.759
Train. Right.

00:19:12.679 --> 00:19:14.840
Time matters.

00:19:13.759 --> 00:19:16.200
Uh

00:19:14.839 --> 00:19:18.119
you can do one at a time.

00:19:16.200 --> 00:19:18.720
Train. Okay, thank you.

00:19:18.119 --> 00:19:21.279
Uh

00:19:18.720 --> 00:19:22.279
okay. Others?

00:19:21.279 --> 00:19:23.918
Slowly.

00:19:22.279 --> 00:19:25.599
Slowly. Right. So, that also seems to

00:19:23.919 --> 00:19:27.759
have some bearing to it. What about

00:19:25.599 --> 00:19:28.799
words that don't really I don't

00:19:27.759 --> 00:19:31.079
we don't think is going to are going to

00:19:28.799 --> 00:19:33.279
help at all?

00:19:31.079 --> 00:19:35.839
The. The. Exactly. It probably doesn't

00:19:33.279 --> 00:19:37.200
do much here. Some context it actually

00:19:35.839 --> 00:19:38.678
might make a difference, but in this

00:19:37.200 --> 00:19:40.759
sentence, maybe not.

00:19:38.679 --> 00:19:42.200
Right? Intuitively.

00:19:40.759 --> 00:19:43.079
So,

00:19:42.200 --> 00:19:45.000
we should probably give a lot of weight

00:19:43.079 --> 00:19:47.839
to train, maybe a little to slowly and

00:19:45.000 --> 00:19:49.359
left, and hardly anything to the.

00:19:47.839 --> 00:19:52.519
Okay?

00:19:49.359 --> 00:19:56.759
And so, this intuition that we have

00:19:52.519 --> 00:19:58.519
can be written numerically as maybe we

00:19:56.759 --> 00:20:00.160
have a bunch of weights that add up to

00:19:58.519 --> 00:20:02.240
one.

00:20:00.160 --> 00:20:03.560
Okay?

00:20:02.240 --> 00:20:07.120
Okay, maybe something like this. So, we

00:20:03.559 --> 00:20:11.639
are saying the train 30% weightage,

00:20:07.119 --> 00:20:14.159
maybe 8% weightage to left, maybe 12%

00:20:11.640 --> 00:20:15.680
weightage to slowly, uh and then as you

00:20:14.160 --> 00:20:17.960
will see here,

00:20:15.680 --> 00:20:20.680
the station's own embedding also plays a

00:20:17.960 --> 00:20:22.240
role. Because we want to take its own

00:20:20.680 --> 00:20:23.799
standalone embedding and just move it

00:20:22.240 --> 00:20:26.759
slightly, change it slightly, which

00:20:23.799 --> 00:20:28.279
means that has to be the starting point.

00:20:26.759 --> 00:20:30.799
So, it will get a lot of weight. We

00:20:28.279 --> 00:20:33.599
can't ignore itself, in other words.

00:20:30.799 --> 00:20:34.720
Right? So, we give it maybe 40% weight.

00:20:33.599 --> 00:20:35.879
By the way, these numbers I just made

00:20:34.720 --> 00:20:38.640
them up.

00:20:35.880 --> 00:20:40.640
Okay? Uh yeah.

00:20:38.640 --> 00:20:43.120
I'm sorry, it's a quick question. So,

00:20:40.640 --> 00:20:44.560
the weights

00:20:43.119 --> 00:20:46.399
are they

00:20:44.559 --> 00:20:48.200
Are they Are they standalone for the

00:20:46.400 --> 00:20:50.759
context of the entire sentence or are

00:20:48.200 --> 00:20:54.000
they related to station that we started

00:20:50.759 --> 00:20:56.400
off with? The The These six numbers are

00:20:54.000 --> 00:20:57.799
only pertinent to station.

00:20:56.400 --> 00:20:59.960
And for each word, we're going to do

00:20:57.799 --> 00:21:01.319
something similar.

00:20:59.960 --> 00:21:03.240
Yeah.

00:21:01.319 --> 00:21:05.399
And at this point, does the model

00:21:03.240 --> 00:21:07.000
understand order? Because like I'm just

00:21:05.400 --> 00:21:08.920
thinking of like left because like I

00:21:07.000 --> 00:21:09.559
gave it a very low

00:21:08.920 --> 00:21:11.360
a

00:21:09.559 --> 00:21:14.200
a very low weight. But let's say left

00:21:11.359 --> 00:21:15.919
comes slowly, leave left station. The

00:21:14.200 --> 00:21:18.000
station only have the two be higher.

00:21:15.920 --> 00:21:20.039
Yeah, correct. So, at this point, we are

00:21:18.000 --> 00:21:22.480
not worrying about order. We are only We

00:21:20.039 --> 00:21:24.000
are worrying about context.

00:21:22.480 --> 00:21:25.720
Later, we'll take order into account.

00:21:24.000 --> 00:21:28.119
But how does the model know that left

00:21:25.720 --> 00:21:31.039
here is of lesser importance because

00:21:28.119 --> 00:21:33.000
it's a verb rather than a

00:21:31.039 --> 00:21:34.279
It's It has to figure it out.

00:21:33.000 --> 00:21:36.519
We don't It doesn't We We are just

00:21:34.279 --> 00:21:38.879
giving it a whole bunch of capabilities.

00:21:36.519 --> 00:21:42.279
How it manifests those capabilities is

00:21:38.880 --> 00:21:42.280
all going to emerge from training.

00:21:42.880 --> 00:21:46.760
Okay. So, all right. So, let's say we

00:21:45.160 --> 00:21:48.120
have something like this. So, what we

00:21:46.759 --> 00:21:49.119
can do,

00:21:48.119 --> 00:21:50.319
right? And we'll get to the

00:21:49.119 --> 00:21:51.639
all-important question of where do we

00:21:50.319 --> 00:21:54.599
get these numbers from in just a moment.

00:21:51.640 --> 00:21:56.240
But suppose you had the numbers,

00:21:54.599 --> 00:22:00.399
how can we use these numbers to

00:21:56.240 --> 00:22:03.839
contextualize W6? What can we do?

00:22:00.400 --> 00:22:03.840
What is the simplest thing you can do?

00:22:05.359 --> 00:22:10.240
You have W6, you want to make it a new

00:22:07.359 --> 00:22:13.639
W6, which is now contextual, is aware of

00:22:10.240 --> 00:22:13.640
what else is going on. Okay?

00:22:17.480 --> 00:22:22.079
It's working now, I think.

00:22:20.119 --> 00:22:23.639
We can take a weighted average. Exactly.

00:22:22.079 --> 00:22:25.079
Exactly. So, when you have a bunch of

00:22:23.640 --> 00:22:26.400
things and you have a bunch of weights

00:22:25.079 --> 00:22:27.839
and I, you know, and we have when we

00:22:26.400 --> 00:22:29.480
have to somehow modify one of those

00:22:27.839 --> 00:22:30.519
things with those weights, the simplest

00:22:29.480 --> 00:22:31.559
thing you can do is to take a weighted

00:22:30.519 --> 00:22:33.000
average.

00:22:31.559 --> 00:22:34.359
Right? So, that's exactly what we're

00:22:33.000 --> 00:22:35.279
going to do.

00:22:34.359 --> 00:22:37.119
So, we're going to take all these

00:22:35.279 --> 00:22:39.678
weights

00:22:37.119 --> 00:22:40.639
and just like move them up.

00:22:39.679 --> 00:22:42.720
Okay?

00:22:40.640 --> 00:22:44.120
Move them up.

00:22:42.720 --> 00:22:46.319
Don't even get me started on how long it

00:22:44.119 --> 00:22:47.439
took me to get this arrow to run.

00:22:46.319 --> 00:22:49.439
I don't know about you, folks. Is it

00:22:47.440 --> 00:22:51.160
It's extremely painful to get the U-turn

00:22:49.440 --> 00:22:52.039
arrows to work in PowerPoint.

00:22:51.160 --> 00:22:54.960
Okay?

00:22:52.039 --> 00:22:57.159
Anyway, uh back to work. So,

00:22:54.960 --> 00:23:01.400
so we just move these up here, okay? So,

00:22:57.160 --> 00:23:03.679
now we can do 0.05 * this vector + 0.3 *

00:23:01.400 --> 00:23:06.679
that vector and so on and so forth.

00:23:03.679 --> 00:23:08.640
And the result is just another vector.

00:23:06.679 --> 00:23:11.400
Right?

00:23:08.640 --> 00:23:13.440
And that vector, folks,

00:23:11.400 --> 00:23:15.320
is the contextual embedding vector of

00:23:13.440 --> 00:23:17.759
station.

00:23:15.319 --> 00:23:19.759
Okay? That was the standalone embedding.

00:23:17.759 --> 00:23:21.119
And now we did the We multiplied this by

00:23:19.759 --> 00:23:24.759
that that by whoop whoop whoop, add them

00:23:21.119 --> 00:23:24.759
all up, and then you get a new vector.

00:23:24.799 --> 00:23:29.519
And contextual embeddings have this

00:23:27.839 --> 00:23:30.959
bluish kind of color.

00:23:29.519 --> 00:23:32.400
Okay?

00:23:30.960 --> 00:23:33.559
And I'll maintain that color scheme as

00:23:32.400 --> 00:23:36.320
we go along.

00:23:33.559 --> 00:23:38.440
So, that's it.

00:23:36.319 --> 00:23:41.079
That's it. That's the idea.

00:23:38.440 --> 00:23:41.080
Any questions?

00:23:41.679 --> 00:23:44.800
Yeah.

00:23:43.039 --> 00:23:46.960
How did you come up with the original

00:23:44.799 --> 00:23:49.359
weights again? You just kind of guessed?

00:23:46.960 --> 00:23:51.559
No, these weights I just I just

00:23:49.359 --> 00:23:53.279
hand typed them in manually just to make

00:23:51.559 --> 00:23:54.319
the point. And And now I'm going to talk

00:23:53.279 --> 00:23:57.039
about how we are actually going to

00:23:54.319 --> 00:23:57.039
calculate them.

00:23:57.599 --> 00:24:00.959
Okay.

00:23:58.640 --> 00:24:03.080
Uh all right, cool. So, now I'm going to

00:24:00.960 --> 00:24:05.400
uh okay, enough pictures. Let's switch

00:24:03.079 --> 00:24:07.319
to some math. So,

00:24:05.400 --> 00:24:08.759
so basically what I'm So, let's write it

00:24:07.319 --> 00:24:11.279
a bit more formally.

00:24:08.759 --> 00:24:12.920
So, we have these W1 through W6, which

00:24:11.279 --> 00:24:14.240
are the standalone embeddings.

00:24:12.920 --> 00:24:16.080
And then for station, we want to

00:24:14.240 --> 00:24:17.359
calculate, you know, W6 with a little

00:24:16.079 --> 00:24:19.599
hat on it, which is the contextual

00:24:17.359 --> 00:24:22.359
embedding. And the way we do it is to

00:24:19.599 --> 00:24:25.000
say we calculate some weights for each

00:24:22.359 --> 00:24:27.159
of these words. So, this weight S16

00:24:25.000 --> 00:24:30.079
means that the weight

00:24:27.160 --> 00:24:32.040
of the first word on the sixth word,

00:24:30.079 --> 00:24:33.678
which happens to be station.

00:24:32.039 --> 00:24:35.839
The The weight of the second word on the

00:24:33.679 --> 00:24:38.120
sixth word, and so on and so forth. And

00:24:35.839 --> 00:24:40.480
so, what we are saying is that W6 is

00:24:38.119 --> 00:24:41.879
just, you know, this weight times W1,

00:24:40.480 --> 00:24:43.240
this time W whoop whoop whoop,

00:24:41.880 --> 00:24:45.560
that's it.

00:24:43.240 --> 00:24:45.559
Okay?

00:24:45.839 --> 00:24:48.839
I have to inflict all these, you know,

00:24:47.039 --> 00:24:51.240
subscripts and all that because

00:24:48.839 --> 00:24:53.919
you know, we need it.

00:24:51.240 --> 00:24:56.559
All right. So, that's it.

00:24:53.920 --> 00:24:58.000
That's what we have.

00:24:56.559 --> 00:25:00.279
Now, let's talk about Okay, any

00:24:58.000 --> 00:25:01.759
questions on the mechanics of it

00:25:00.279 --> 00:25:02.879
before I get to Okay, where do these

00:25:01.759 --> 00:25:05.160
weights come from?

00:25:02.880 --> 00:25:05.160
Yeah.

00:25:06.920 --> 00:25:11.039
Utilizing something like Google, for

00:25:08.839 --> 00:25:12.759
example, like how does it understand

00:25:11.039 --> 00:25:13.960
like the context of

00:25:12.759 --> 00:25:16.000
new words

00:25:13.960 --> 00:25:18.480
and context like

00:25:16.000 --> 00:25:20.400
process immediately through the training

00:25:18.480 --> 00:25:21.480
data the users played or

00:25:20.400 --> 00:25:22.640
like basically

00:25:21.480 --> 00:25:24.440
>> like a totally new word that didn't

00:25:22.640 --> 00:25:27.520
exist before? A new word or a new

00:25:24.440 --> 00:25:29.320
context to a word that already exists.

00:25:27.519 --> 00:25:31.400
No, I think that the context is supplied

00:25:29.319 --> 00:25:33.159
because the query coming into something

00:25:31.400 --> 00:25:35.120
like Google is a full sentence.

00:25:33.160 --> 00:25:36.400
And we only take that sentence and take

00:25:35.119 --> 00:25:37.919
only the sentence into account as the

00:25:36.400 --> 00:25:40.000
context for us.

00:25:37.920 --> 00:25:41.600
So, the context is always present to us

00:25:40.000 --> 00:25:44.079
when we get the input.

00:25:41.599 --> 00:25:45.199
But the other question you had uh of

00:25:44.079 --> 00:25:46.678
Okay, what if there's a brand new word

00:25:45.200 --> 00:25:47.799
you've never seen before, for which

00:25:46.679 --> 00:25:49.720
there is not even a standalone

00:25:47.799 --> 00:25:51.919
embedding? What do you do then?

00:25:49.720 --> 00:25:53.600
So, let's punt on that till Wednesday

00:25:51.920 --> 00:25:55.440
because I have to talk about something

00:25:53.599 --> 00:25:57.359
called byte pair encoding and stuff like

00:25:55.440 --> 00:25:59.279
that before I can answer that.

00:25:57.359 --> 00:26:00.599
And And really quickly, does that

00:25:59.279 --> 00:26:03.480
immediately translate to their

00:26:00.599 --> 00:26:06.399
predictive search queries?

00:26:03.480 --> 00:26:08.559
Utilizing like verb

00:26:06.400 --> 00:26:10.759
Yeah, a new word, for example.

00:26:08.559 --> 00:26:12.200
Does that automatically get applied to

00:26:10.759 --> 00:26:14.000
the predictive search queries like when

00:26:12.200 --> 00:26:15.880
we're saying how to and then just home?

00:26:14.000 --> 00:26:17.200
Oh, you mean like the auto complete?

00:26:15.880 --> 00:26:18.560
You know, auto complete uses a slightly

00:26:17.200 --> 00:26:20.880
different mechanism.

00:26:18.559 --> 00:26:23.440
Um I They had a very complicated

00:26:20.880 --> 00:26:24.800
non-transformer thing for a long time.

00:26:23.440 --> 00:26:26.320
I'm sure they have a transformer version

00:26:24.799 --> 00:26:28.039
now, but I don't I'm not privy to how

00:26:26.319 --> 00:26:29.799
exactly they've done it. So, I don't

00:26:28.039 --> 00:26:31.200
quite know how they do it. But what

00:26:29.799 --> 00:26:33.279
you're proposing is a reasonable way to

00:26:31.200 --> 00:26:34.360
think about it.

00:26:33.279 --> 00:26:36.678
Yeah.

00:26:34.359 --> 00:26:39.678
Um my question is like we have six

00:26:36.679 --> 00:26:41.800
words, station and but number parameters

00:26:39.679 --> 00:26:43.400
as in weights, let's say 10 of them.

00:26:41.799 --> 00:26:46.119
And then we have calculated the

00:26:43.400 --> 00:26:48.280
contextual version of W6. Yeah. So, this

00:26:46.119 --> 00:26:50.559
has a different parameter or it remains

00:26:48.279 --> 00:26:54.759
the same? It replaces. Okay.

00:26:50.559 --> 00:26:57.720
Yeah, W becomes W6 becomes W6 hat.

00:26:54.759 --> 00:26:58.759
Okay. And how we are expecting

00:26:57.720 --> 00:27:00.600
Right.

00:26:58.759 --> 00:27:03.640
This contextual word will be really

00:27:00.599 --> 00:27:03.639
good. That's what we want.

00:27:07.759 --> 00:27:11.319
Do we lose that

00:27:08.960 --> 00:27:12.759
or retain No, we lose it. And as you

00:27:11.319 --> 00:27:14.439
will see here, as it flows through the

00:27:12.759 --> 00:27:16.720
transformer, it's getting more and more

00:27:14.440 --> 00:27:19.920
and more contextualized.

00:27:16.720 --> 00:27:19.920
So, it's a left-to-right flow.

00:27:20.000 --> 00:27:23.200
All right. Uh all right, great. So, the

00:27:22.000 --> 00:27:25.720
By the way, this thing that we did for

00:27:23.200 --> 00:27:27.960
station, we will do it for each word in

00:27:25.720 --> 00:27:30.039
the in the in the sentence.

00:27:27.960 --> 00:27:31.759
The same exact logic. Obviously, the

00:27:30.039 --> 00:27:34.079
weights are going to change.

00:27:31.759 --> 00:27:37.920
Okay? But what will happen is that W1

00:27:34.079 --> 00:27:39.480
through W6 will become W1 hat through W6

00:27:37.920 --> 00:27:41.880
hat.

00:27:39.480 --> 00:27:43.360
The same exact logic is going to hold.

00:27:41.880 --> 00:27:44.440
Okay? That's what I just don't have the

00:27:43.359 --> 00:27:45.719
slides for it because it's a waste of

00:27:44.440 --> 00:27:47.160
time.

00:27:45.720 --> 00:27:48.880
The same exact logic is going to hold.

00:27:47.160 --> 00:27:50.679
All right. Now, switch gears

00:27:48.880 --> 00:27:51.600
and and answer the all-important

00:27:50.679 --> 00:27:52.679
question of where are the weights going

00:27:51.599 --> 00:27:54.678
to come from.

00:27:52.679 --> 00:27:56.840
Okay? So, the intuition here is really

00:27:54.679 --> 00:27:59.800
really interesting and elegant.

00:27:56.839 --> 00:28:02.199
So, clearly the weight of a word

00:27:59.799 --> 00:28:04.879
should be proportional to how related it

00:28:02.200 --> 00:28:06.319
is to the word station.

00:28:04.880 --> 00:28:08.240
Right?

00:28:06.319 --> 00:28:09.919
The word train clearly is very related

00:28:08.240 --> 00:28:11.559
to the word station.

00:28:09.920 --> 00:28:12.640
The word the is not clear how it's

00:28:11.559 --> 00:28:15.440
related it is. Probably not all that

00:28:12.640 --> 00:28:17.160
related. So, the relatedness matters to

00:28:15.440 --> 00:28:19.360
the weight. More related, higher the

00:28:17.160 --> 00:28:21.400
weight, right? Just intuitive.

00:28:19.359 --> 00:28:23.799
So, one way to quantify how related two

00:28:21.400 --> 00:28:25.560
words are is to take their standalone

00:28:23.799 --> 00:28:27.918
embeddings and calculate the dot

00:28:25.559 --> 00:28:27.918
product.

00:28:28.000 --> 00:28:33.119
Okay? So, um

00:28:30.720 --> 00:28:36.799
in case folks have

00:28:33.119 --> 00:28:36.799
sort of forgotten about the dot product,

00:28:39.559 --> 00:28:44.599
Oops, that's not what I want.

00:28:42.519 --> 00:28:47.200
So, um So, if you have a Let's say you

00:28:44.599 --> 00:28:47.199
have a vector.

00:28:50.039 --> 00:28:52.599
Okay, let's Let's Let's say this is the

00:28:51.599 --> 00:28:55.039
vector for

00:28:52.599 --> 00:28:55.039
train.

00:28:55.720 --> 00:28:59.079
This is the vector for station.

00:28:59.279 --> 00:29:04.599
Okay? So, the dot product of these two

00:29:01.960 --> 00:29:04.600
vectors,

00:29:05.559 --> 00:29:11.759
I'll write it as train

00:29:09.039 --> 00:29:11.759
station

00:29:12.039 --> 00:29:17.519
equals

00:29:13.880 --> 00:29:19.960
basically the length

00:29:17.519 --> 00:29:19.960
of

00:29:20.359 --> 00:29:23.479
the vector for train

00:29:23.720 --> 00:29:30.480
times the length

00:29:26.679 --> 00:29:30.480
of the vector for station

00:29:30.720 --> 00:29:36.519
times the cosine

00:29:33.839 --> 00:29:38.480
of the angle between them.

00:29:36.519 --> 00:29:40.639
Okay?

00:29:38.480 --> 00:29:40.640
Okay?

00:29:42.400 --> 00:29:46.440
So, how long is each vector?

00:29:45.159 --> 00:29:48.560
Product of the two and then the angle

00:29:46.440 --> 00:29:50.480
between them. Okay? Now, let's assume

00:29:48.559 --> 00:29:52.480
for simplicity that these lengths are

00:29:50.480 --> 00:29:54.519
roughly the same.

00:29:52.480 --> 00:29:55.599
They're just one unit length. Okay? Just

00:29:54.519 --> 00:29:57.720
roughly.

00:29:55.599 --> 00:30:01.799
So, if you assume that,

00:29:57.720 --> 00:30:01.799
okay? This thing, let's say, becomes

00:30:01.880 --> 00:30:05.160
becomes one, let's say.

00:30:03.799 --> 00:30:07.119
Okay?

00:30:05.160 --> 00:30:09.240
This thing becomes one.

00:30:07.119 --> 00:30:11.399
So, all the action

00:30:09.240 --> 00:30:12.519
is here.

00:30:11.400 --> 00:30:14.280
Okay?

00:30:12.519 --> 00:30:15.839
So, all the action is here.

00:30:14.279 --> 00:30:17.440
So, basically, the dot product of these

00:30:15.839 --> 00:30:20.079
two vectors is really the cosine of

00:30:17.440 --> 00:30:22.360
angle between them.

00:30:20.079 --> 00:30:25.319
So, now, the question is, if you have

00:30:22.359 --> 00:30:25.319
something like this,

00:30:27.200 --> 00:30:31.519
right? Which are very close to each

00:30:28.519 --> 00:30:34.440
other, the cosine of a very small angle,

00:30:31.519 --> 00:30:35.480
actually, the cosine of zero is what?

00:30:34.440 --> 00:30:37.720
One.

00:30:35.480 --> 00:30:39.000
So, if the angle is really, really

00:30:37.720 --> 00:30:40.160
small, the cosine is going to be very

00:30:39.000 --> 00:30:41.559
close to one.

00:30:40.160 --> 00:30:43.519
Right? Because zero is one. The cosine

00:30:41.559 --> 00:30:46.639
of zero is one. So, this thing is going

00:30:43.519 --> 00:30:49.039
to be, you know, pretty close to one.

00:30:46.640 --> 00:30:51.520
If you have a cosine of two vectors that

00:30:49.039 --> 00:30:52.759
are like this, 90° apart, what is the

00:30:51.519 --> 00:30:55.440
cosine?

00:30:52.759 --> 00:30:58.079
Zero. They're orthogonal, right? Which

00:30:55.440 --> 00:31:00.720
maps to the English orthogonal.

00:30:58.079 --> 00:31:01.960
So, the cosine of that is zero.

00:31:00.720 --> 00:31:03.400
And then, if you have something like

00:31:01.960 --> 00:31:04.640
this,

00:31:03.400 --> 00:31:07.400
where they're literally pointing in

00:31:04.640 --> 00:31:07.400
opposite direction,

00:31:07.640 --> 00:31:11.240
what is the cosine of that 180?

00:31:09.880 --> 00:31:13.080
Minus one.

00:31:11.240 --> 00:31:14.799
So, that's it. So, the if these things

00:31:13.079 --> 00:31:16.119
if these these these two vectors are

00:31:14.799 --> 00:31:18.039
very close to each other,

00:31:16.119 --> 00:31:19.919
the cosine of the angle between them is

00:31:18.039 --> 00:31:21.399
going to be very close to one. If they

00:31:19.920 --> 00:31:22.960
are really kind of unrelated, it's going

00:31:21.400 --> 00:31:24.240
to be zero. If they're anti-related,

00:31:22.960 --> 00:31:27.120
it's going to be minus one.

00:31:24.240 --> 00:31:28.960
Right? So, that's how dot products

00:31:27.119 --> 00:31:30.679
capture this notion of closeness or

00:31:28.960 --> 00:31:31.680
relatedness.

00:31:30.680 --> 00:31:36.320
Okay?

00:31:31.680 --> 00:31:37.960
So, all right. Um iPad.

00:31:36.319 --> 00:31:40.480
So, we can use the dot product of these

00:31:37.960 --> 00:31:43.519
embeddings to capture relatedness.

00:31:40.480 --> 00:31:45.960
And so, okay, iPad done.

00:31:43.519 --> 00:31:48.000
So, now, what we do is we know now that

00:31:45.960 --> 00:31:49.920
we know that dot products can be used,

00:31:48.000 --> 00:31:51.759
we can't use them as is because we need

00:31:49.920 --> 00:31:53.880
to do one more thing to make them proper

00:31:51.759 --> 00:31:55.519
weights. And what I mean by proper

00:31:53.880 --> 00:31:58.000
weights is that the we want the weights

00:31:55.519 --> 00:31:59.279
to be, first of all, non-negative, and

00:31:58.000 --> 00:32:00.240
we want to add up we want them to add up

00:31:59.279 --> 00:32:01.319
to one, right? That's that's what a

00:32:00.240 --> 00:32:02.279
weighted average actually is going to

00:32:01.319 --> 00:32:05.359
mean.

00:32:02.279 --> 00:32:07.359
But these cosines could be negative.

00:32:05.359 --> 00:32:08.959
Right? And so, we need to now adjust

00:32:07.359 --> 00:32:10.039
them to make them proper so that every

00:32:08.960 --> 00:32:11.400
one of them is guaranteed to be

00:32:10.039 --> 00:32:12.279
non-negative and they will add up to

00:32:11.400 --> 00:32:14.200
one.

00:32:12.279 --> 00:32:15.839
When was the last time you had to take a

00:32:14.200 --> 00:32:18.279
bunch of numbers, which could be

00:32:15.839 --> 00:32:20.480
anything, and then somehow make sure

00:32:18.279 --> 00:32:22.079
that they are going to be positive,

00:32:20.480 --> 00:32:23.839
non-negative, and they add up to one?

00:32:22.079 --> 00:32:25.720
When was the last time?

00:32:23.839 --> 00:32:27.519
Yeah, softmax. Exactly. So, we'll do the

00:32:25.720 --> 00:32:29.759
same trick.

00:32:27.519 --> 00:32:32.799
So, what we'll simply do is we'll just,

00:32:29.759 --> 00:32:35.400
you know, exponentiate them, right? So,

00:32:32.799 --> 00:32:36.519
like this W1 W6, this angle bracket

00:32:35.400 --> 00:32:39.120
thing is the dot product. That's the

00:32:36.519 --> 00:32:41.319
notation I'm using. EXP of that is just

00:32:39.119 --> 00:32:42.839
you exponentiate them, e raised to that.

00:32:41.319 --> 00:32:44.599
And once you exponentiate them, they all

00:32:42.839 --> 00:32:46.119
become non-negative, and then we just

00:32:44.599 --> 00:32:47.359
divide each by the sum of everything.

00:32:46.119 --> 00:32:48.559
So, it the whole thing will become like

00:32:47.359 --> 00:32:50.119
a probability, right? It'll just add up

00:32:48.559 --> 00:32:52.119
to one.

00:32:50.119 --> 00:32:53.519
Make sense? So, that's how we take

00:32:52.119 --> 00:32:55.919
arbitrary numbers and make them proper

00:32:53.519 --> 00:32:55.920
weights.

00:32:56.679 --> 00:32:59.200
All right.

00:32:59.880 --> 00:33:02.840
So,

00:33:01.440 --> 00:33:04.200
to summarize,

00:33:02.839 --> 00:33:05.759
from embeddings to contextual

00:33:04.200 --> 00:33:08.120
embeddings, that's what we do.

00:33:05.759 --> 00:33:09.720
We take all the stand-alone embeddings,

00:33:08.119 --> 00:33:11.678
we calculate these weights using this

00:33:09.720 --> 00:33:12.799
formula, and then we just do the

00:33:11.679 --> 00:33:16.080
weighted average, and we arrive at the

00:33:12.799 --> 00:33:16.079
contextual embedding, and boom, done.

00:33:16.480 --> 00:33:20.079
Okay?

00:33:17.880 --> 00:33:22.360
And so, by way choosing weights in this

00:33:20.079 --> 00:33:24.359
manner, the embedding of a word gets

00:33:22.359 --> 00:33:26.839
dragged closer to the embeddings of the

00:33:24.359 --> 00:33:29.039
other words in proportion to how related

00:33:26.839 --> 00:33:30.439
they are. So, just imagine for a second,

00:33:29.039 --> 00:33:31.920
right? In this case, station obviously

00:33:30.440 --> 00:33:33.880
has many contexts, but let's assume for

00:33:31.920 --> 00:33:35.800
a second that only has the train context

00:33:33.880 --> 00:33:37.400
and the radio station context.

00:33:35.799 --> 00:33:39.200
Okay?

00:33:37.400 --> 00:33:40.920
In the current context, train is closely

00:33:39.200 --> 00:33:42.640
related to station, and therefore exerts

00:33:40.920 --> 00:33:43.840
a strong pull on it.

00:33:42.640 --> 00:33:45.720
Right?

00:33:43.839 --> 00:33:47.199
Now, radio is also related to station,

00:33:45.720 --> 00:33:48.440
but it doesn't appear in the word in the

00:33:47.200 --> 00:33:49.840
sentence.

00:33:48.440 --> 00:33:52.200
So, effectively, it has a weight of

00:33:49.839 --> 00:33:52.199
zero.

00:33:52.839 --> 00:33:56.399
Okay? And that's the beauty of it. And

00:33:55.119 --> 00:33:58.079
And please do not ask me things like,

00:33:56.400 --> 00:33:59.640
you know, I I was listening to a great

00:33:58.079 --> 00:34:01.559
song on the radio station and the train

00:33:59.640 --> 00:34:03.360
pulled out of the station.

00:34:01.559 --> 00:34:05.480
Okay? Transformers can deal with stuff

00:34:03.359 --> 00:34:07.519
like that. Okay? But yeah, but you get

00:34:05.480 --> 00:34:09.878
the idea, the main idea.

00:34:07.519 --> 00:34:11.480
So, by paying moving station closer to

00:34:09.878 --> 00:34:13.440
train,

00:34:11.480 --> 00:34:15.559
by paying more attention to train, we

00:34:13.440 --> 00:34:18.000
are contextualizing the station the word

00:34:15.559 --> 00:34:20.440
the embedding to the context of trains,

00:34:18.000 --> 00:34:22.960
platforms, departures, tickets, and so

00:34:20.440 --> 00:34:25.159
on. It's like this portal into the whole

00:34:22.960 --> 00:34:27.280
train world.

00:34:25.159 --> 00:34:29.840
Right? It's beautiful. This simple idea

00:34:27.280 --> 00:34:29.840
will get you there.

00:34:30.840 --> 00:34:33.960
Okay?

00:34:31.800 --> 00:34:36.679
So, this, folks, is called

00:34:33.960 --> 00:34:37.760
self-attention.

00:34:36.679 --> 00:34:39.639
What we just described is called

00:34:37.760 --> 00:34:41.240
self-attention.

00:34:39.639 --> 00:34:42.679
And it's the key building block of

00:34:41.239 --> 00:34:44.759
transformers.

00:34:42.679 --> 00:34:46.599
Okay? Um and so, the the So, to

00:34:44.760 --> 00:34:50.320
summarize, stand-alone embeddings come

00:34:46.599 --> 00:34:50.319
in, contextual embeddings go out.

00:34:50.760 --> 00:34:54.720
Any questions?

00:34:52.398 --> 00:34:56.199
Uh yeah.

00:34:54.719 --> 00:34:58.799
Uh I'm still struggling a little bit

00:34:56.199 --> 00:35:00.239
with the intuition of the word

00:34:58.800 --> 00:35:02.039
contextual embedding. So, like the

00:35:00.239 --> 00:35:03.639
weight of station in the station

00:35:02.039 --> 00:35:05.159
embedding, how how should I think about

00:35:03.639 --> 00:35:07.679
that? It seems intuitive that it would

00:35:05.159 --> 00:35:11.879
be high for all contextual embeddings,

00:35:07.679 --> 00:35:11.879
but I assume that's not the case.

00:35:12.079 --> 00:35:15.920
It'll be high. It'll be typically be a

00:35:13.639 --> 00:35:17.599
high number because the cosine of the

00:35:15.920 --> 00:35:19.200
the vector to itself is going to be very

00:35:17.599 --> 00:35:20.799
cosine is going to be one, right? So,

00:35:19.199 --> 00:35:21.559
it's going to be pretty high, but it

00:35:20.800 --> 00:35:22.880
there's no guarantee it's going to be

00:35:21.559 --> 00:35:24.840
the highest.

00:35:22.880 --> 00:35:26.519
Right? Because they're not actually the

00:35:24.840 --> 00:35:28.000
the length doesn't have to be one. They

00:35:26.519 --> 00:35:30.358
could be We try to keep them kind of

00:35:28.000 --> 00:35:31.840
smallish, but they don't have to be.

00:35:30.358 --> 00:35:33.319
Uh so, the way I would think about it is

00:35:31.840 --> 00:35:35.320
imagine that you take an average of

00:35:33.320 --> 00:35:37.359
everything else first, and then you

00:35:35.320 --> 00:35:38.480
average it with the new the old

00:35:37.358 --> 00:35:39.639
embedding.

00:35:38.480 --> 00:35:40.880
Effectively, it's the same as just

00:35:39.639 --> 00:35:42.639
calculating the different weights and

00:35:40.880 --> 00:35:44.599
averaging the whole thing together.

00:35:42.639 --> 00:35:45.639
Sure.

00:35:44.599 --> 00:35:47.679
So, why should you say that the

00:35:45.639 --> 00:35:50.239
embedding of a word would be the same

00:35:47.679 --> 00:35:52.679
number but same place? But is this the

00:35:50.239 --> 00:35:53.719
reason why you need a contextual

00:35:52.679 --> 00:35:55.159
embedding?

00:35:53.719 --> 00:35:56.519
But even if it's like a

00:35:55.159 --> 00:35:59.000
other word

00:35:56.519 --> 00:36:01.079
and it's not related, that's what

00:35:59.000 --> 00:36:02.840
I'm saying. Correct. Correct. Exactly.

00:36:01.079 --> 00:36:04.759
Exactly. And the other thing to remember

00:36:02.840 --> 00:36:07.120
is that by getting

00:36:04.760 --> 00:36:09.000
by keeping the origin the input sort of

00:36:07.119 --> 00:36:10.119
the size of the input cardinality intact

00:36:09.000 --> 00:36:11.119
as you move through the transformer

00:36:10.119 --> 00:36:12.719
stack,

00:36:11.119 --> 00:36:14.880
when you finally come out the other end,

00:36:12.719 --> 00:36:16.439
there is sort of no loss of information.

00:36:14.880 --> 00:36:18.079
And in the very end, you can choose to

00:36:16.440 --> 00:36:19.519
aggregate, simplify, summarize, and so

00:36:18.079 --> 00:36:22.840
on and so forth. It preserves your

00:36:19.519 --> 00:36:22.840
optionality as long as possible.

00:36:23.679 --> 00:36:27.000
Do you know

00:36:25.119 --> 00:36:28.039
how how long the embedding contextual

00:36:27.000 --> 00:36:29.880
embedding is?

00:36:28.039 --> 00:36:31.039
Is that a factor between the

00:36:29.880 --> 00:36:33.240
two?

00:36:31.039 --> 00:36:34.679
You know

00:36:33.239 --> 00:36:35.839
Yeah, so, what we do is the the sentence

00:36:34.679 --> 00:36:37.679
comes in. There's a whole notion of

00:36:35.840 --> 00:36:39.079
something called a context window, or

00:36:37.679 --> 00:36:40.480
what is the sort of the maximum length

00:36:39.079 --> 00:36:42.480
that these sentences will handle, and

00:36:40.480 --> 00:36:43.519
that's a parameter you can set. And

00:36:42.480 --> 00:36:44.719
we'll come to that when you actually

00:36:43.519 --> 00:36:46.639
look at the collab.

00:36:44.719 --> 00:36:48.399
Um

00:36:46.639 --> 00:36:49.639
Was that a question in the middle? No.

00:36:48.400 --> 00:36:53.119
Okay.

00:36:49.639 --> 00:36:53.119
All right. So, that is self-attention.

00:36:53.199 --> 00:36:58.000
Um and now,

00:36:55.199 --> 00:37:00.119
because that's felt too easy,

00:36:58.000 --> 00:37:02.079
we're going to do a little tweak called

00:37:00.119 --> 00:37:03.039
multi-head attention.

00:37:02.079 --> 00:37:04.719
So,

00:37:03.039 --> 00:37:06.039
this is this is the self-attention we

00:37:04.719 --> 00:37:07.439
just saw.

00:37:06.039 --> 00:37:08.920
What we can do is we can be like, you

00:37:07.440 --> 00:37:10.720
know what?

00:37:08.920 --> 00:37:12.400
Why can't we have more than this? Why

00:37:10.719 --> 00:37:13.879
can't we have more than one of these?

00:37:12.400 --> 00:37:16.160
So, this is called an attention head,

00:37:13.880 --> 00:37:18.519
self-attention head. We'll have multiple

00:37:16.159 --> 00:37:20.279
self-attention heads. Okay?

00:37:18.519 --> 00:37:22.239
Now, and I'll come back to the top thing

00:37:20.280 --> 00:37:23.840
in a second, okay? But So, the question

00:37:22.239 --> 00:37:25.399
is, why should we have multiple

00:37:23.840 --> 00:37:26.920
self-attention heads?

00:37:25.400 --> 00:37:28.280
Because a particular attention head is

00:37:26.920 --> 00:37:30.480
going to pick up some patterns. The

00:37:28.280 --> 00:37:32.519
reason is because

00:37:30.480 --> 00:37:34.358
it'll help us attend to the multiple

00:37:32.519 --> 00:37:35.599
patterns that may be present in a single

00:37:34.358 --> 00:37:37.239
sentence.

00:37:35.599 --> 00:37:38.440
So far, when I've been explaining, uh

00:37:37.239 --> 00:37:40.319
I've sort of basically been looking at

00:37:38.440 --> 00:37:42.240
what the meaning of these words are.

00:37:40.320 --> 00:37:44.120
Just the meaning of these words. But in

00:37:42.239 --> 00:37:45.759
any complicated sentence, you have to

00:37:44.119 --> 00:37:47.519
worry about grammar, you have to worry

00:37:45.760 --> 00:37:49.880
about tense, you have to worry about

00:37:47.519 --> 00:37:51.880
tone. You have to worry about facts

00:37:49.880 --> 00:37:53.760
versus, you know, opinions. There could

00:37:51.880 --> 00:37:55.559
be any number of complicated patterns

00:37:53.760 --> 00:37:57.920
that are sitting in a simple sentence.

00:37:55.559 --> 00:37:59.519
Which means, well, there is just not one

00:37:57.920 --> 00:38:02.079
way to pay attention. There could be

00:37:59.519 --> 00:38:03.880
many ways of paying attention, many sort

00:38:02.079 --> 00:38:05.799
of There could be many needs to pay

00:38:03.880 --> 00:38:07.599
attention. Right?

00:38:05.800 --> 00:38:09.240
Which means that we'll let's have many

00:38:07.599 --> 00:38:10.719
of these attention heads.

00:38:09.239 --> 00:38:12.919
And each one could be learning something

00:38:10.719 --> 00:38:14.919
else. It's exactly like having lots of

00:38:12.920 --> 00:38:16.680
filters in a convolutional network.

00:38:14.920 --> 00:38:17.960
Right? Uh one filter might learn a line,

00:38:16.679 --> 00:38:19.399
another filter might learn a curve, and

00:38:17.960 --> 00:38:21.000
so on and so forth. And we don't want to

00:38:19.400 --> 00:38:22.760
decide a priori, oh, you're going to

00:38:21.000 --> 00:38:23.840
learn a line, right? Similarly here,

00:38:22.760 --> 00:38:25.040
we're not telling any of these things

00:38:23.840 --> 00:38:27.400
what you have to learn. They just have

00:38:25.039 --> 00:38:28.960
to learn based on the training process.

00:38:27.400 --> 00:38:30.800
So, what we do is

00:38:28.960 --> 00:38:32.400
So, actually, this is an example where

00:38:30.800 --> 00:38:35.000
this is from the original transformer

00:38:32.400 --> 00:38:37.039
paper, where this sentence is the lawyer

00:38:35.000 --> 00:38:39.559
will Sorry, the law will never be

00:38:37.039 --> 00:38:43.079
perfect, but its application should be

00:38:39.559 --> 00:38:44.400
just. This is what we are missing, in my

00:38:43.079 --> 00:38:46.400
opinion.

00:38:44.400 --> 00:38:48.840
The complicated sentence, right? So, the

00:38:46.400 --> 00:38:50.559
first one attention head, actually, this

00:38:48.840 --> 00:38:53.120
is the pattern of things it's it's it's

00:38:50.559 --> 00:38:54.759
So, for example, the word perfect here,

00:38:53.119 --> 00:38:57.279
the contextual embedding of the word

00:38:54.760 --> 00:39:00.480
perfect

00:38:57.280 --> 00:39:01.920
draws upon heavily from the word law

00:39:00.480 --> 00:39:02.920
in this example.

00:39:01.920 --> 00:39:04.840
Okay?

00:39:02.920 --> 00:39:06.240
If you look at another attention head,

00:39:04.840 --> 00:39:07.840
the contextual embedding for the word

00:39:06.239 --> 00:39:11.519
perfect is actually drawing heavily from

00:39:07.840 --> 00:39:13.039
just perfect and nothing else. Right?

00:39:11.519 --> 00:39:14.880
And if you look at other words, the

00:39:13.039 --> 00:39:17.079
patterns are subtly different of what

00:39:14.880 --> 00:39:18.400
it's paying attention to.

00:39:17.079 --> 00:39:20.279
So, these are two different attention

00:39:18.400 --> 00:39:21.960
heads, and they're learning different

00:39:20.280 --> 00:39:24.200
kinds of attentions.

00:39:21.960 --> 00:39:25.679
Okay? In reality, trying to make sense

00:39:24.199 --> 00:39:27.719
of why they

00:39:25.679 --> 00:39:29.399
pay attention to the way they do, it's

00:39:27.719 --> 00:39:30.319
usually quite sort of difficult to

00:39:29.400 --> 00:39:32.320
figure that out. You can't actually

00:39:30.320 --> 00:39:34.200
interpret it. But when you have lots of

00:39:32.320 --> 00:39:35.840
attention heads, the performance on the

00:39:34.199 --> 00:39:37.559
task that you care about gets really

00:39:35.840 --> 00:39:39.000
much better.

00:39:37.559 --> 00:39:40.759
Right? And then you're saying, okay, I

00:39:39.000 --> 00:39:42.000
can use that. Uh yeah.

00:39:40.760 --> 00:39:43.520
That's the

00:39:42.000 --> 00:39:46.960
I think that's the idea behind this. Is

00:39:43.519 --> 00:39:46.960
that the idea behind this?

00:39:49.320 --> 00:39:53.360
Right.

00:39:50.760 --> 00:39:55.640
Exactly. Same logic. Same logic.

00:39:53.360 --> 00:39:55.640
Yeah.

00:40:13.519 --> 00:40:17.360
Actually in the convolutional case, the

00:40:15.079 --> 00:40:19.519
ones and zeros I had were just example

00:40:17.360 --> 00:40:21.000
numbers to show that that particular

00:40:19.519 --> 00:40:23.360
filter could detect a vertical line or

00:40:21.000 --> 00:40:24.760
horizontal line. You will recall that

00:40:23.360 --> 00:40:26.000
when we actually train a convolutional

00:40:24.760 --> 00:40:27.880
network, we actually don't specify the

00:40:26.000 --> 00:40:30.039
numbers. We start with random

00:40:27.880 --> 00:40:32.200
initialized weights and then we let back

00:40:30.039 --> 00:40:34.199
back propagation figure it out.

00:40:32.199 --> 00:40:35.679
Similarly here, we don't decide any of

00:40:34.199 --> 00:40:37.239
these things. We just let back prop

00:40:35.679 --> 00:40:39.559
figure it out.

00:40:37.239 --> 00:40:40.559
Okay? And now the question of what are

00:40:39.559 --> 00:40:42.519
the weights that are actually going to

00:40:40.559 --> 00:40:43.480
be learned? We'll come come to that in a

00:40:42.519 --> 00:40:46.480
bit.

00:40:43.480 --> 00:40:46.480
Okay? Uh yeah.

00:40:47.559 --> 00:40:53.559
Uh I was wondering how come we have

00:40:50.360 --> 00:40:55.480
different attention head even though

00:40:53.559 --> 00:40:57.119
uh it seems like they're only function

00:40:55.480 --> 00:40:59.480
of a dot product and we have the same

00:40:57.119 --> 00:41:01.239
dot product for same embeddings.

00:40:59.480 --> 00:41:02.960
Great question. Great question. And I

00:41:01.239 --> 00:41:04.799
literally have a a note in my slide

00:41:02.960 --> 00:41:06.480
saying, "If a student asks this good

00:41:04.800 --> 00:41:08.480
question, tell them to wait till

00:41:06.480 --> 00:41:10.400
Wednesday."

00:41:08.480 --> 00:41:12.079
So, great question. And we'll come back

00:41:10.400 --> 00:41:14.440
to that uh on Wednesday and spend a fair

00:41:12.079 --> 00:41:17.079
amount of time on it. So, uh

00:41:14.440 --> 00:41:19.800
the the the point that's being made here

00:41:17.079 --> 00:41:22.840
is that oops.

00:41:19.800 --> 00:41:24.720
When we look at self-attention,

00:41:22.840 --> 00:41:26.600
the embeddings came in and we did all

00:41:24.719 --> 00:41:28.799
these dot products and the contextual

00:41:26.599 --> 00:41:30.319
things popped out the other end. Note

00:41:28.800 --> 00:41:32.800
that inside the self-attention box,

00:41:30.320 --> 00:41:34.160
there are no parameters.

00:41:32.800 --> 00:41:36.519
There are no parameters.

00:41:34.159 --> 00:41:38.799
So, the question that is being raised

00:41:36.519 --> 00:41:40.880
here is that so what are we learning

00:41:38.800 --> 00:41:42.840
really? If there is nothing inside to be

00:41:40.880 --> 00:41:43.880
learned, if there are no parameters, no

00:41:42.840 --> 00:41:46.480
coefficients, what are we learning?

00:41:43.880 --> 00:41:48.400
That's the question. And by extension,

00:41:46.480 --> 00:41:49.599
if we have two of these and neither of

00:41:48.400 --> 00:41:52.079
them is learning anything, what's the

00:41:49.599 --> 00:41:52.079
point?

00:41:52.880 --> 00:41:57.320
Sadly, you have to wait till Wednesday.

00:41:55.719 --> 00:41:58.719
Okay? But we have a great answer to the

00:41:57.320 --> 00:42:00.120
question. So,

00:41:58.719 --> 00:42:03.279
it'll be worth it. And if you can't

00:42:00.119 --> 00:42:03.279
stand the suspense, read the book.

00:42:03.320 --> 00:42:07.320
All right. So, that is uh that's why we

00:42:05.519 --> 00:42:09.719
need multiple heads. Okay? And now to

00:42:07.320 --> 00:42:11.400
come back to this, so what we do is it

00:42:09.719 --> 00:42:13.199
goes through this head and you get these

00:42:11.400 --> 00:42:15.760
W's, right? And it goes through here and

00:42:13.199 --> 00:42:17.639
we get the another set of W's.

00:42:15.760 --> 00:42:19.880
Then what we do at the very end is we

00:42:17.639 --> 00:42:21.920
concatenate them.

00:42:19.880 --> 00:42:23.480
Okay? We concatenate them and we do a

00:42:21.920 --> 00:42:25.800
projection. And this is what I mean by

00:42:23.480 --> 00:42:25.800
that.

00:42:29.199 --> 00:42:33.279
So, we have

00:42:30.760 --> 00:42:35.880
uh this this is one self-attention head,

00:42:33.280 --> 00:42:38.960
self-attention one.

00:42:35.880 --> 00:42:41.760
This is self-attention two.

00:42:38.960 --> 00:42:44.720
And let's say that

00:42:41.760 --> 00:42:47.200
W1 hat comes out.

00:42:44.719 --> 00:42:48.799
And I'm just going to call it Z Z1 for

00:42:47.199 --> 00:42:49.919
the same thing so that there's no name

00:42:48.800 --> 00:42:52.440
clash.

00:42:49.920 --> 00:42:55.360
Okay? And uh the W2, W6, all of them are

00:42:52.440 --> 00:42:57.599
coming, right? Let's focus on W1 and Z1.

00:42:55.360 --> 00:42:59.320
W1 and Z1 are both contextual embeddings

00:42:57.599 --> 00:43:01.679
for the same word.

00:42:59.320 --> 00:43:04.720
Okay? For the first word, word one. And

00:43:01.679 --> 00:43:06.440
so what we do is let's say this is W1 uh

00:43:04.719 --> 00:43:07.959
let's call let's say this vector is like

00:43:06.440 --> 00:43:10.039
this. Okay?

00:43:07.960 --> 00:43:12.400
And let's say that this vector is like

00:43:10.039 --> 00:43:14.679
this.

00:43:12.400 --> 00:43:16.360
What I mean when I say concatenated here

00:43:14.679 --> 00:43:18.719
is we literally take

00:43:16.360 --> 00:43:20.320
um this word here,

00:43:18.719 --> 00:43:22.839
this embedding here, then we take this

00:43:20.320 --> 00:43:22.840
thing here.

00:43:23.079 --> 00:43:27.799
Okay? And we just make it a long vector.

00:43:25.039 --> 00:43:30.519
We concatenate it. But now this vector

00:43:27.800 --> 00:43:32.519
has become twice as long, right?

00:43:30.519 --> 00:43:34.759
So, what but remember, we always want to

00:43:32.519 --> 00:43:36.759
preserve this the the number of inputs

00:43:34.760 --> 00:43:39.400
we have and the lengths of these vectors

00:43:36.760 --> 00:43:42.760
everywhere as we go along. So, what we

00:43:39.400 --> 00:43:44.840
do is at this point, we run it through

00:43:42.760 --> 00:43:46.560
a single dense layer

00:43:44.840 --> 00:43:48.480
which will take this thing and make it

00:43:46.559 --> 00:43:50.039
back into the same small shape as

00:43:48.480 --> 00:43:53.119
before.

00:43:50.039 --> 00:43:53.119
So, this is a dense layer.

00:43:54.320 --> 00:43:58.559
That's it. So, this vector comes in

00:43:56.840 --> 00:44:00.240
and it becomes it gets compressed back

00:43:58.559 --> 00:44:01.239
to the original shape that came out of

00:44:00.239 --> 00:44:03.599
here.

00:44:01.239 --> 00:44:04.919
So, you could have like 20 of these uh

00:44:03.599 --> 00:44:06.480
attention heads

00:44:04.920 --> 00:44:08.440
and the concatenated will be 20 times

00:44:06.480 --> 00:44:09.800
long and then just project boom, one

00:44:08.440 --> 00:44:12.119
dense layer comes back to the original

00:44:09.800 --> 00:44:12.120
shape.

00:44:12.920 --> 00:44:17.519
So, that's that is the projection step.

00:44:16.320 --> 00:44:20.480
And that's what I mean here when I say

00:44:17.519 --> 00:44:21.800
concatenate and project.

00:44:20.480 --> 00:44:23.559
So, at this point, what we have is

00:44:21.800 --> 00:44:25.120
things come in, we contextualize them

00:44:23.559 --> 00:44:27.039
using these different attention heads,

00:44:25.119 --> 00:44:29.000
and when they come out of the attention

00:44:27.039 --> 00:44:31.039
heads, we take them all, we just like

00:44:29.000 --> 00:44:32.480
concatenate them, and then compress them

00:44:31.039 --> 00:44:35.320
back to the same original starting

00:44:32.480 --> 00:44:37.119
shape. Right? If these vectors are 100

00:44:35.320 --> 00:44:39.640
units long or 100 dimension long,

00:44:37.119 --> 00:44:42.000
whatever comes out is 100 still.

00:44:39.639 --> 00:44:43.839
And to pre- preserving this

00:44:42.000 --> 00:44:44.920
size as we go along is very important

00:44:43.840 --> 00:44:46.800
for reasons that'll become apparent a

00:44:44.920 --> 00:44:49.440
bit later.

00:44:46.800 --> 00:44:50.320
Okay. So, that is the multi-attention

00:44:49.440 --> 00:44:53.679
thing.

00:44:50.320 --> 00:44:55.120
Now, a final tweak for today

00:44:53.679 --> 00:44:57.440
is that we will inject some

00:44:55.119 --> 00:44:59.400
non-linearity

00:44:57.440 --> 00:45:01.358
with some dense layer dense ReLU layers

00:44:59.400 --> 00:45:03.280
at the very end. So, we'd went through a

00:45:01.358 --> 00:45:04.400
bunch of attention heads. We we came up

00:45:03.280 --> 00:45:05.240
with a bunch of contextual embeddings

00:45:04.400 --> 00:45:07.720
now.

00:45:05.239 --> 00:45:08.479
So, at this point so far,

00:45:07.719 --> 00:45:10.759
there are no since there are no

00:45:08.480 --> 00:45:11.840
parameters inside these boxes,

00:45:10.760 --> 00:45:13.000
uh

00:45:11.840 --> 00:45:13.960
right? And there are some parameters

00:45:13.000 --> 00:45:15.559
here.

00:45:13.960 --> 00:45:16.480
We need to do some non-linearity. So

00:45:15.559 --> 00:45:18.840
far, there's been nothing that's

00:45:16.480 --> 00:45:21.480
non-linear so far. So, here we actually

00:45:18.840 --> 00:45:24.680
send it through one or more ReLUs.

00:45:21.480 --> 00:45:27.559
Typically, they just use one ReLU. So,

00:45:24.679 --> 00:45:27.559
and what I mean by that

00:45:34.199 --> 00:45:36.599
Sorry.

00:45:37.920 --> 00:45:44.480
So, this is what we had here and then

00:45:41.760 --> 00:45:44.480
we take it in

00:45:46.400 --> 00:45:49.599
and then run it through

00:45:50.079 --> 00:45:52.639
actually

00:45:54.719 --> 00:45:58.639
we typically run it through

00:45:57.320 --> 00:46:01.440
a ReLU.

00:45:58.639 --> 00:46:03.358
This is a nice ReLU.

00:46:01.440 --> 00:46:04.559
Okay? And all and and the rule of thumb,

00:46:03.358 --> 00:46:06.840
as you will see, if let's say this

00:46:04.559 --> 00:46:08.119
vector is say 100 dimensions long, they

00:46:06.840 --> 00:46:10.039
typically will choose a ReLU which is

00:46:08.119 --> 00:46:12.440
about 400

00:46:10.039 --> 00:46:15.920
wide. And then it just gets projected

00:46:12.440 --> 00:46:15.920
out again back to 100.

00:46:16.639 --> 00:46:20.279
So,

00:46:17.719 --> 00:46:21.759
this is just a simple, you know, the

00:46:20.280 --> 00:46:23.480
input comes in, goes through a single

00:46:21.760 --> 00:46:26.040
hidden layer with four four times as

00:46:23.480 --> 00:46:28.599
many as here, and then it

00:46:26.039 --> 00:46:29.800
project another dense layer

00:46:28.599 --> 00:46:32.279
to 100 again.

00:46:29.800 --> 00:46:33.280
And this since there are ReLUs here,

00:46:32.280 --> 00:46:35.760
we in- we have injected some

00:46:33.280 --> 00:46:37.519
non-linearity into the processing.

00:46:35.760 --> 00:46:39.200
Okay? Now,

00:46:37.519 --> 00:46:41.719
a lot of this stuff when it came out

00:46:39.199 --> 00:46:43.358
felt very ad hoc.

00:46:41.719 --> 00:46:45.599
Right? It didn't come from some deep,

00:46:43.358 --> 00:46:47.400
you know, theoretical motivations.

00:46:45.599 --> 00:46:49.400
But and people had strong intuitions as

00:46:47.400 --> 00:46:51.680
to why these things were helpful. And as

00:46:49.400 --> 00:46:53.720
it turns out, since the transformer came

00:46:51.679 --> 00:46:55.519
out, people have tried to optimize every

00:46:53.719 --> 00:46:56.959
aspect of this thing.

00:46:55.519 --> 00:46:58.719
It's actually pretty difficult to beat

00:46:56.960 --> 00:47:00.358
the starting architecture.

00:46:58.719 --> 00:47:02.679
Right? Improvements have been made, but

00:47:00.358 --> 00:47:03.719
it's actually very robust architecture.

00:47:02.679 --> 00:47:05.719
So,

00:47:03.719 --> 00:47:08.959
so that's what's going on here. And then

00:47:05.719 --> 00:47:10.919
when we come out of this thing,

00:47:08.960 --> 00:47:13.000
this is what we have, the story so far.

00:47:10.920 --> 00:47:14.639
We start with random standalone

00:47:13.000 --> 00:47:15.960
embeddings. This could be

00:47:14.639 --> 00:47:18.159
GloVe embeddings, it could be random

00:47:15.960 --> 00:47:19.920
weights, doesn't matter. It goes through

00:47:18.159 --> 00:47:21.399
a bunch of self-attention heads. We

00:47:19.920 --> 00:47:23.840
concatenate it when it comes out the

00:47:21.400 --> 00:47:25.039
other end.

00:47:23.840 --> 00:47:27.160
Concatenate it when it comes out the

00:47:25.039 --> 00:47:29.119
other end. And then we project it back

00:47:27.159 --> 00:47:31.358
to the same size as before. Then we run

00:47:29.119 --> 00:47:33.400
it through, you know, a ReLU followed by

00:47:31.358 --> 00:47:36.079
a linear layer and we get these things

00:47:33.400 --> 00:47:37.760
again. So, in this whole process, if six

00:47:36.079 --> 00:47:40.400
things came in, six things will come

00:47:37.760 --> 00:47:41.920
out. And if six and if those six things

00:47:40.400 --> 00:47:43.358
that came in

00:47:41.920 --> 00:47:45.440
were embedding standalone embedding

00:47:43.358 --> 00:47:47.559
vectors of 100 dimensions, what comes

00:47:45.440 --> 00:47:48.559
out is also 100 dimensions.

00:47:47.559 --> 00:47:50.440
So, in that sense, you could think of

00:47:48.559 --> 00:47:52.358
this whole thing as a black box in which

00:47:50.440 --> 00:47:54.599
whatever you send in, the same number of

00:47:52.358 --> 00:47:56.079
things will come out of the same length.

00:47:54.599 --> 00:47:56.759
The numbers will be different because

00:47:56.079 --> 00:47:58.519
they will have been heavily

00:47:56.760 --> 00:48:00.240
contextualized.

00:47:58.519 --> 00:48:02.480
The numbers are much smarter, in other

00:48:00.239 --> 00:48:04.959
words.

00:48:02.480 --> 00:48:05.920
So, so far what we have seen is that we

00:48:04.960 --> 00:48:08.079
have satisfied two of the three

00:48:05.920 --> 00:48:09.920
requirements. We have taken the context

00:48:08.079 --> 00:48:11.119
of each word into account

00:48:09.920 --> 00:48:12.599
by using these dot products in the

00:48:11.119 --> 00:48:13.799
self-attention layer, and we can

00:48:12.599 --> 00:48:15.599
generate an output that is the same

00:48:13.800 --> 00:48:17.480
length as the input, but we have ignored

00:48:15.599 --> 00:48:19.759
the fact that we have ignored word order

00:48:17.480 --> 00:48:21.519
completely.

00:48:19.760 --> 00:48:23.880
Okay? Because whether I had said the

00:48:21.519 --> 00:48:25.559
train slowly left the station or I had

00:48:23.880 --> 00:48:26.800
said the the station slowly left the

00:48:25.559 --> 00:48:30.279
train,

00:48:26.800 --> 00:48:30.280
this thing won't know the difference.

00:48:30.840 --> 00:48:34.519
Because dot products

00:48:32.239 --> 00:48:36.559
function on sets, not on sequences. They

00:48:34.519 --> 00:48:37.800
function on sets.

00:48:36.559 --> 00:48:39.159
Okay? Regard- You can you should

00:48:37.800 --> 00:48:40.600
convince yourself of this. Regardless of

00:48:39.159 --> 00:48:42.039
the order, the dot product calculation

00:48:40.599 --> 00:48:45.159
doesn't change anything.

00:48:42.039 --> 00:48:45.159
Because we are doing every pair.

00:48:46.440 --> 00:48:50.519
Okay? So, the question is how do we take

00:48:48.159 --> 00:48:52.199
the order of the words into account? Um

00:48:50.519 --> 00:48:53.519
right. As I was saying, we can scramble

00:48:52.199 --> 00:48:54.519
the order of the words in a sentence and

00:48:53.519 --> 00:48:55.759
we'll get the exact same contextual

00:48:54.519 --> 00:48:57.079
embeddings at the end.

00:48:55.760 --> 00:48:58.840
So, by the way, if you're working on a

00:48:57.079 --> 00:49:00.319
problem in which the order doesn't

00:48:58.840 --> 00:49:01.960
matter,

00:49:00.320 --> 00:49:04.160
then you can stop right now and use the

00:49:01.960 --> 00:49:05.199
transformer.

00:49:04.159 --> 00:49:06.759
And there are many problems that are

00:49:05.199 --> 00:49:08.799
actually in that category where the

00:49:06.760 --> 00:49:10.880
order doesn't matter. So, if you take

00:49:08.800 --> 00:49:12.359
traditional structured data, right? Uh

00:49:10.880 --> 00:49:14.320
tabular data,

00:49:12.358 --> 00:49:15.759
uh you know, blood pressure, cholesterol

00:49:14.320 --> 00:49:17.519
level, boom boom boom. Does it predict

00:49:15.760 --> 00:49:18.520
heart disease? Well, there is no order

00:49:17.519 --> 00:49:20.199
in that thing. You can use the

00:49:18.519 --> 00:49:22.119
transformer as is without doing anything

00:49:20.199 --> 00:49:24.679
more.

00:49:22.119 --> 00:49:27.199
So, transformers work for both sets and

00:49:24.679 --> 00:49:29.839
sequences where order matters.

00:49:27.199 --> 00:49:32.239
Okay. So, the fix for this is something

00:49:29.840 --> 00:49:33.160
called the positional encoding.

00:49:32.239 --> 00:49:34.839
Um

00:49:33.159 --> 00:49:36.159
so what we do is very simple. There are

00:49:34.840 --> 00:49:40.920
By By there are many things that been

00:49:36.159 --> 00:49:42.759
invented um to to to tell transformers

00:49:40.920 --> 00:49:44.159
to give an transformer some information

00:49:42.760 --> 00:49:45.760
about the order of each of the things

00:49:44.159 --> 00:49:46.799
that are coming in.

00:49:45.760 --> 00:49:47.920
I'm going to go with something called

00:49:46.800 --> 00:49:49.480
the, you know,

00:49:47.920 --> 00:49:51.440
the simplest possible way which actually

00:49:49.480 --> 00:49:52.840
works pretty well in practice. So, what

00:49:51.440 --> 00:49:55.000
we do is

00:49:52.840 --> 00:49:56.960
for each position

00:49:55.000 --> 00:49:58.280
each possible position in the input

00:49:56.960 --> 00:50:00.280
starting from the first position all the

00:49:58.280 --> 00:50:02.120
way through the last position

00:50:00.280 --> 00:50:05.280
we imagine that that position itself is

00:50:02.119 --> 00:50:05.279
a categorical variable.

00:50:05.599 --> 00:50:10.039
Right? If a sentence can only be 30 30

00:50:07.639 --> 00:50:11.719
words long, let's say, we say that hey,

00:50:10.039 --> 00:50:14.599
the position of each word is a number

00:50:11.719 --> 00:50:16.039
between 0 and 29.

00:50:14.599 --> 00:50:17.960
And so, we can just think of it as a

00:50:16.039 --> 00:50:20.000
categorical variable.

00:50:17.960 --> 00:50:22.159
And because the categorical variable, we

00:50:20.000 --> 00:50:24.199
can just imagine an embedding for that

00:50:22.159 --> 00:50:25.319
for each potential value. So, it'll

00:50:24.199 --> 00:50:27.000
become clear in just a moment because I

00:50:25.320 --> 00:50:28.920
have a numerical example.

00:50:27.000 --> 00:50:30.800
And so, what we do is we will just take

00:50:28.920 --> 00:50:32.920
that standalone embedding and then we'll

00:50:30.800 --> 00:50:33.960
take this position embedding

00:50:32.920 --> 00:50:35.280
which represents the position of the

00:50:33.960 --> 00:50:36.800
word in the sentence, we just add them

00:50:35.280 --> 00:50:39.560
up.

00:50:36.800 --> 00:50:40.519
Okay? Uh yeah.

00:50:39.559 --> 00:50:43.079
So, if

00:50:40.519 --> 00:50:45.280
in the initial sentence itself, I have a

00:50:43.079 --> 00:50:48.039
mistake, so I just write it as the train

00:50:45.280 --> 00:50:49.840
slowly the station.

00:50:48.039 --> 00:50:52.079
So, which means my output is actually

00:50:49.840 --> 00:50:53.760
going to be wrong. Yes.

00:50:52.079 --> 00:50:55.559
Now, the transformers are since they're

00:50:53.760 --> 00:50:57.000
trained on lots of data,

00:50:55.559 --> 00:50:58.199
they will be quite robust to these

00:50:57.000 --> 00:51:00.239
things.

00:50:58.199 --> 00:51:02.839
But strictly arithmetically speaking

00:51:00.239 --> 00:51:05.439
correct, yes.

00:51:02.840 --> 00:51:06.720
Um okay. So, here's let's look at an

00:51:05.440 --> 00:51:08.800
example.

00:51:06.719 --> 00:51:09.359
Let's assume that

00:51:08.800 --> 00:51:11.360
um

00:51:09.360 --> 00:51:13.480
your standalone embeddings, right? This

00:51:11.360 --> 00:51:15.920
is your vocabulary, okay?

00:51:13.480 --> 00:51:17.400
Unknown, cat, mat, I, sit, love, the,

00:51:15.920 --> 00:51:18.960
you, on. That's it. That's our

00:51:17.400 --> 00:51:20.800
vocabulary.

00:51:18.960 --> 00:51:22.440
And for this vocabulary, we have these

00:51:20.800 --> 00:51:23.680
standalone embeddings.

00:51:22.440 --> 00:51:26.159
And just for argument, let's assume

00:51:23.679 --> 00:51:27.239
these embeddings are only two long.

00:51:26.159 --> 00:51:28.599
Okay? The dimension of these embeddings

00:51:27.239 --> 00:51:30.039
is two.

00:51:28.599 --> 00:51:31.880
If you recall the glove embeddings we

00:51:30.039 --> 00:51:33.159
used last week, I think they were what?

00:51:31.880 --> 00:51:34.400
100 long?

00:51:33.159 --> 00:51:35.799
And the ones we're using in the homework

00:51:34.400 --> 00:51:37.200
are even longer than that.

00:51:35.800 --> 00:51:39.120
Um but here we are assuming they're only

00:51:37.199 --> 00:51:42.799
two long, okay? So, the embedding for

00:51:39.119 --> 00:51:45.880
cat is 0.5, {comma} 7.1.

00:51:42.800 --> 00:51:47.320
All right. Now, let's assume that the we

00:51:45.880 --> 00:51:49.079
can have at most 10 words in any

00:51:47.320 --> 00:51:50.559
sentence that's coming in.

00:51:49.079 --> 00:51:52.360
And obviously, a particular word could

00:51:50.559 --> 00:51:53.639
be in position 0 all the way through

00:51:52.360 --> 00:51:56.240
position 9.

00:51:53.639 --> 00:51:57.719
And we will learn embeddings for each of

00:51:56.239 --> 00:51:59.759
these positions, and these embeddings

00:51:57.719 --> 00:52:03.239
are also two long.

00:51:59.760 --> 00:52:03.240
Two units long. Dimension two.

00:52:03.320 --> 00:52:06.480
Okay?

00:52:04.519 --> 00:52:07.880
Now, where will these embeddings come

00:52:06.480 --> 00:52:09.199
from?

00:52:07.880 --> 00:52:10.720
What's the answer to that question? What

00:52:09.199 --> 00:52:13.839
is the answer to the general question of

00:52:10.719 --> 00:52:13.839
where will these weights come from?

00:52:14.599 --> 00:52:17.759
We will learn it with backprop.

00:52:18.159 --> 00:52:21.599
Okay?

00:52:20.400 --> 00:52:23.240
We will start initially with random

00:52:21.599 --> 00:52:24.519
numbers and then we'll get them make

00:52:23.239 --> 00:52:26.599
them better and better

00:52:24.519 --> 00:52:28.280
as over the course of training.

00:52:26.599 --> 00:52:29.400
So, what we do is we have these two

00:52:28.280 --> 00:52:30.680
tables

00:52:29.400 --> 00:52:32.400
of embeddings.

00:52:30.679 --> 00:52:34.039
Um the standalone embedding for the word

00:52:32.400 --> 00:52:37.000
and the position embedding.

00:52:34.039 --> 00:52:39.239
And then, we literally add them up.

00:52:37.000 --> 00:52:41.599
So, for example, let's say the word the

00:52:39.239 --> 00:52:43.119
sentence that came in is cat sat mat.

00:52:41.599 --> 00:52:46.119
That's the sentence. It's got three

00:52:43.119 --> 00:52:49.119
words, cat sat mat. So, what we do is we

00:52:46.119 --> 00:52:51.119
say, well, the embedding for cat is this

00:52:49.119 --> 00:52:53.400
thing here, 0.571.

00:52:51.119 --> 00:52:55.239
So, I write it here, 0.571.

00:52:53.400 --> 00:52:56.240
Cat happens to be the zeroth position of

00:52:55.239 --> 00:52:58.119
the word.

00:52:56.239 --> 00:53:01.079
So, I grab the embedding for zero, which

00:52:58.119 --> 00:53:04.799
is 1.3, 3.9. I stick it there, and then

00:53:01.079 --> 00:53:07.159
I literally add them up. 0.5 + 1.3, 1.8.

00:53:04.800 --> 00:53:10.880
11.0. That's it.

00:53:07.159 --> 00:53:15.159
So, now the positional encoded embedding

00:53:10.880 --> 00:53:17.880
for the word cat is 1.8, 11.0, not 0.5,

00:53:15.159 --> 00:53:17.879
7.1.

00:53:18.400 --> 00:53:22.400
So, if cat happens to show up in another

00:53:20.719 --> 00:53:25.199
part of the sentence, let's say instead

00:53:22.400 --> 00:53:28.119
of cat sat mat, we had

00:53:25.199 --> 00:53:29.839
mat sat cat.

00:53:28.119 --> 00:53:33.159
Now, cat is now the third position,

00:53:29.840 --> 00:53:34.680
right? Which is 0, 1, and 2. Which means

00:53:33.159 --> 00:53:36.239
its embedding doesn't change. It's just

00:53:34.679 --> 00:53:38.159
the embedding for cat, but now instead

00:53:36.239 --> 00:53:40.519
of picking zero, we'll pick this one,

00:53:38.159 --> 00:53:43.079
0.6, 8.1, and put that here and add them

00:53:40.519 --> 00:53:43.079
up instead.

00:53:43.719 --> 00:53:46.959
So, this is the idea of the positional

00:53:45.840 --> 00:53:48.800
encoding.

00:53:46.960 --> 00:53:51.599
This is how we inject position knowledge

00:53:48.800 --> 00:53:51.600
into the transformer.

00:53:52.960 --> 00:53:55.000
Yes.

00:53:54.400 --> 00:53:56.280
Um

00:53:55.000 --> 00:53:58.159
the positional embedding would be

00:53:56.280 --> 00:54:00.000
different for each sentence, right? How

00:53:58.159 --> 00:54:01.799
do you No, this is just one table which

00:54:00.000 --> 00:54:04.159
tells you what the position is.

00:54:01.800 --> 00:54:06.200
So, the it says for a word that appears

00:54:04.159 --> 00:54:08.279
in the seventh position in any input

00:54:06.199 --> 00:54:09.599
sentence that you're feeding in,

00:54:08.280 --> 00:54:11.359
this is the embedding that you need to

00:54:09.599 --> 00:54:14.079
use

00:54:11.358 --> 00:54:14.079
for that position.

00:54:16.679 --> 00:54:21.639
If the word appears twice in the same

00:54:19.559 --> 00:54:23.920
sentence, how do

00:54:21.639 --> 00:54:25.719
Great question. So, if if let's say just

00:54:23.920 --> 00:54:27.559
for argument, let's say the word the the

00:54:25.719 --> 00:54:29.480
sentence was cat cat cat.

00:54:27.559 --> 00:54:31.599
So, the

00:54:29.480 --> 00:54:32.559
for each one of those cat for cat cat

00:54:31.599 --> 00:54:34.759
cat,

00:54:32.559 --> 00:54:36.519
the this embedding will be the same,

00:54:34.760 --> 00:54:38.240
0.571, because that is happens to be

00:54:36.519 --> 00:54:39.519
just the embedding for cat regardless of

00:54:38.239 --> 00:54:42.159
position.

00:54:39.519 --> 00:54:45.599
But then, the first cat

00:54:42.159 --> 00:54:47.440
for the first cat, we will use 1.3, 3.9

00:54:45.599 --> 00:54:50.159
as the addition. For the second cat,

00:54:47.440 --> 00:54:51.679
we'll use 6.3, 3.7. The third cat will

00:54:50.159 --> 00:54:53.519
use 0.6, 8.1.

00:54:51.679 --> 00:54:55.000
So, only the things that are adding the

00:54:53.519 --> 00:54:57.119
position encoding will change, the

00:54:55.000 --> 00:54:58.280
positional embedding. So, the resulting

00:54:57.119 --> 00:54:59.679
sum is going to be different for each of

00:54:58.280 --> 00:55:02.560
these three words, even though they're

00:54:59.679 --> 00:55:02.559
exactly the same word.

00:55:05.760 --> 00:55:09.800
Is that position embedding table

00:55:07.800 --> 00:55:12.000
specific to the standalone embedding

00:55:09.800 --> 00:55:14.320
table? Like if you were to add or remove

00:55:12.000 --> 00:55:15.960
some words from the standalone It's

00:55:14.320 --> 00:55:18.000
independent.

00:55:15.960 --> 00:55:19.880
Independent. It only depends on your

00:55:18.000 --> 00:55:21.000
assumption about how long the sentences

00:55:19.880 --> 00:55:21.920
can be.

00:55:21.000 --> 00:55:23.400
That's it.

00:55:21.920 --> 00:55:24.840
It doesn't really care about what's what

00:55:23.400 --> 00:55:26.039
words are coming in. That's a whole

00:55:24.840 --> 00:55:27.400
different thing.

00:55:26.039 --> 00:55:28.719
So, these are two independent tables

00:55:27.400 --> 00:55:31.160
that just learned as part of this

00:55:28.719 --> 00:55:31.159
process.

00:55:31.639 --> 00:55:35.480
So, yeah, I have the same thing for sat

00:55:33.599 --> 00:55:39.079
and mat.

00:55:35.480 --> 00:55:39.079
Sat and mat, that's what we have.

00:55:39.519 --> 00:55:42.679
So, just make sure you understand these

00:55:40.519 --> 00:55:46.199
two slides to really like make sure the

00:55:42.679 --> 00:55:48.839
mechanics are clear. Yeah.

00:55:46.199 --> 00:55:50.839
How do you control for filler words? For

00:55:48.840 --> 00:55:53.920
example, if you're taking

00:55:50.840 --> 00:55:55.680
NLP output for transcription and you're

00:55:53.920 --> 00:55:56.639
trying to run a transformer and you have

00:55:55.679 --> 00:55:58.799
a lot of

00:55:56.639 --> 00:56:00.879
um's and likes that are

00:55:58.800 --> 00:56:03.000
disproportionately large and have these

00:56:00.880 --> 00:56:04.559
random assignments or

00:56:03.000 --> 00:56:07.039
really deep embeddings, is there other

00:56:04.559 --> 00:56:09.000
ways to look at through the noise?

00:56:07.039 --> 00:56:10.440
Typically, what they do is um

00:56:09.000 --> 00:56:12.239
as we will we'll talk about this thing

00:56:10.440 --> 00:56:14.639
called byte pair encoding in which we

00:56:12.239 --> 00:56:16.599
take individual characters,

00:56:14.639 --> 00:56:18.879
fragments of words, and words into

00:56:16.599 --> 00:56:21.239
account as tokens. So, when you hear

00:56:18.880 --> 00:56:23.079
stuff like uh and so on, it gets mapped

00:56:21.239 --> 00:56:24.119
to these small tokens.

00:56:23.079 --> 00:56:26.799
Right? And then we treat them as just

00:56:24.119 --> 00:56:26.799
any other token.

00:56:28.840 --> 00:56:33.480
Um yeah, is aggregation like a simple

00:56:31.119 --> 00:56:36.039
sum where here and is the actual

00:56:33.480 --> 00:56:37.840
semantic meaning of the word standalone

00:56:36.039 --> 00:56:40.400
not be more important than its

00:56:37.840 --> 00:56:42.200
relative position in the sentence?

00:56:40.400 --> 00:56:43.400
It could be. We just don't know a priori

00:56:42.199 --> 00:56:45.399
whether it's going to be important or

00:56:43.400 --> 00:56:46.960
not for any particular sentence.

00:56:45.400 --> 00:56:48.880
We when we train the transformer with a

00:56:46.960 --> 00:56:50.358
lot of textual data,

00:56:48.880 --> 00:56:51.880
right? It'll just figure out the right

00:56:50.358 --> 00:56:53.719
values for these things so that on

00:56:51.880 --> 00:56:55.280
average, the accuracy is as high as

00:56:53.719 --> 00:56:56.879
possible.

00:56:55.280 --> 00:56:58.120
So, in many of these things, there's

00:56:56.880 --> 00:57:00.480
always a tension between our human

00:56:58.119 --> 00:57:01.559
intuition as to how it should work and

00:57:00.480 --> 00:57:02.960
whether you should just throw it into

00:57:01.559 --> 00:57:04.079
the meat grinder of backprop and see

00:57:02.960 --> 00:57:05.280
what happens.

00:57:04.079 --> 00:57:06.400
And so, here it does it turns out you

00:57:05.280 --> 00:57:08.840
can just throw it into backprop, it'll

00:57:06.400 --> 00:57:10.920
actually do a pretty good job.

00:57:08.840 --> 00:57:13.000
Uh yeah.

00:57:10.920 --> 00:57:15.960
For the positional encoding, we would

00:57:13.000 --> 00:57:18.199
just be as using the sum vector, we

00:57:15.960 --> 00:57:20.720
would be using like this 2 by 3 matrix

00:57:18.199 --> 00:57:21.719
that you have for our right?

00:57:20.719 --> 00:57:23.559
Uh oh yeah, this is just for

00:57:21.719 --> 00:57:24.679
demonstration. Basically, this is the

00:57:23.559 --> 00:57:26.279
thing that will actually go into the

00:57:24.679 --> 00:57:28.358
transformer. Correct.

00:57:26.280 --> 00:57:28.359
Yeah.

00:57:28.559 --> 00:57:31.679
That was just me being overly verbose in

00:57:30.079 --> 00:57:33.199
the slides.

00:57:31.679 --> 00:57:35.239
Uh yeah.

00:57:33.199 --> 00:57:36.919
I can see sentences in the input. At

00:57:35.239 --> 00:57:38.279
this point, are we still parsing out

00:57:36.920 --> 00:57:40.039
punctuation or if we have like a

00:57:38.280 --> 00:57:41.760
multi-sentence input, is there a

00:57:40.039 --> 00:57:44.119
positional embedding vector for each of

00:57:41.760 --> 00:57:47.120
the sentences? Yeah, so here um

00:57:44.119 --> 00:57:48.799
basically, the starting point is tokens.

00:57:47.119 --> 00:57:50.239
Right? And in our example, because we're

00:57:48.800 --> 00:57:51.760
working with the idea of simple

00:57:50.239 --> 00:57:53.039
standardization and stripping and things

00:57:51.760 --> 00:57:54.000
like that, I'm just showing actual

00:57:53.039 --> 00:57:56.000
words.

00:57:54.000 --> 00:57:58.199
If you go to something like GPT-4, since

00:57:56.000 --> 00:58:01.159
it uses a different tokenization scheme,

00:57:58.199 --> 00:58:02.319
uh each token might be part of a word.

00:58:01.159 --> 00:58:03.559
It might be it might be an individual

00:58:02.320 --> 00:58:06.240
character, it might be a punctuation

00:58:03.559 --> 00:58:08.440
mark, it could be in fact um the GPT

00:58:06.239 --> 00:58:10.439
family doesn't strip out punctuation.

00:58:08.440 --> 00:58:12.480
Which is why when you ask a question, it

00:58:10.440 --> 00:58:13.920
comes back with intact punctuation in

00:58:12.480 --> 00:58:15.840
its response.

00:58:13.920 --> 00:58:17.400
Uh and so, we'll get we'll revisit this

00:58:15.840 --> 00:58:19.760
when you look at BPE, byte pair encoding

00:58:17.400 --> 00:58:19.760
later on.

00:58:19.840 --> 00:58:22.800
But the key thing to remember is that

00:58:21.119 --> 00:58:24.679
all the stuff we're talking about starts

00:58:22.800 --> 00:58:26.560
from the notion of a token.

00:58:24.679 --> 00:58:28.559
As to how you define a token given a

00:58:26.559 --> 00:58:30.719
bunch of text, that's the tokenizer's

00:58:28.559 --> 00:58:33.519
job. And we just assumed a simple

00:58:30.719 --> 00:58:36.759
tokenizer for the time being.

00:58:33.519 --> 00:58:38.960
Okay? So, at this point, folks, we have

00:58:36.760 --> 00:58:40.680
satisfied all the requirements.

00:58:38.960 --> 00:58:42.480
Uh we have taken the surrounding context

00:58:40.679 --> 00:58:43.839
of each word, we have taken the order,

00:58:42.480 --> 00:58:45.480
and so on and so forth, because what's

00:58:43.840 --> 00:58:47.519
coming in here is the positional

00:58:45.480 --> 00:58:49.639
embeddings. Okay? And it runs through

00:58:47.519 --> 00:58:51.440
the whole transformer stack.

00:58:49.639 --> 00:58:54.799
So,

00:58:51.440 --> 00:58:55.920
this is called a transformer encoder.

00:58:54.800 --> 00:58:57.840
Okay?

00:58:55.920 --> 00:58:59.039
This is the transformer encoder.

00:58:57.840 --> 00:59:01.039
And you can see here, this is the

00:58:59.039 --> 00:59:03.239
original picture from the paper.

00:59:01.039 --> 00:59:04.719
It's an iconic picture at this point.

00:59:03.239 --> 00:59:06.239
So, it says here this is these are the

00:59:04.719 --> 00:59:07.599
input This is like the cat sat on the

00:59:06.239 --> 00:59:09.519
mat.

00:59:07.599 --> 00:59:11.400
It comes in here, gets transferred to

00:59:09.519 --> 00:59:12.679
transformed into embeddings, standalone

00:59:11.400 --> 00:59:14.639
embeddings.

00:59:12.679 --> 00:59:17.319
And then, based on the position of each

00:59:14.639 --> 00:59:20.679
word, we add that's why you see a plus

00:59:17.320 --> 00:59:22.120
sign here, we add the positional

00:59:20.679 --> 00:59:24.358
embedding to that.

00:59:22.119 --> 00:59:26.799
And the resulting thing goes into this

00:59:24.358 --> 00:59:30.599
transformer block. And here,

00:59:26.800 --> 00:59:30.600
we go through multi-head attention.

00:59:30.800 --> 00:59:34.480
And things come out the other end.

00:59:32.800 --> 00:59:36.160
Then there is this thing called add and

00:59:34.480 --> 00:59:37.440
norm, which we'll visit we'll revisit on

00:59:36.159 --> 00:59:38.759
Wednesday.

00:59:37.440 --> 00:59:40.800
And then it goes through a feed forward

00:59:38.760 --> 00:59:42.480
network, another add and norm, which

00:59:40.800 --> 00:59:43.640
we'll revisit on Wednesday.

00:59:42.480 --> 00:59:46.360
And then it comes out the other end.

00:59:43.639 --> 00:59:47.519
That's it. That's a transformer encoder.

00:59:46.360 --> 00:59:48.360
Okay?

00:59:47.519 --> 00:59:51.759
Um

00:59:48.360 --> 00:59:51.760
and so if you look at this

00:59:52.320 --> 00:59:55.160
just to point out a couple of things,

00:59:53.719 --> 00:59:56.359
the input embeddings can be random

00:59:55.159 --> 00:59:57.519
weights or it could be pre-trained

00:59:56.360 --> 00:59:58.440
embeddings.

00:59:57.519 --> 01:00:00.119
Um

00:59:58.440 --> 01:00:01.000
we add in a position-dependent embedding

01:00:00.119 --> 01:00:02.799
to represent the position of each word

01:00:01.000 --> 01:00:04.000
in the sentence. That's the plus.

01:00:02.800 --> 01:00:05.800
Then we pass it through multi-headed

01:00:04.000 --> 01:00:07.199
attention to get a contextual uh

01:00:05.800 --> 01:00:09.000
representation.

01:00:07.199 --> 01:00:10.639
Then we finally we pass all this through

01:00:09.000 --> 01:00:12.480
a simple

01:00:10.639 --> 01:00:13.879
typically it's a two-layer network. A

01:00:12.480 --> 01:00:16.039
one hidden layer with relus and then a

01:00:13.880 --> 01:00:20.079
linear layer after that and boom. Uh and

01:00:16.039 --> 01:00:21.840
then we do it. This is the encoder. And

01:00:20.079 --> 01:00:23.799
here is the perhaps the most important

01:00:21.840 --> 01:00:25.600
point to keep in mind.

01:00:23.800 --> 01:00:26.840
Because we have taken inordinate care to

01:00:25.599 --> 01:00:28.159
make sure that the things that are

01:00:26.840 --> 01:00:30.200
coming in and the things that are going

01:00:28.159 --> 01:00:32.159
out have the same size

01:00:30.199 --> 01:00:34.199
both in terms of the number of tokens as

01:00:32.159 --> 01:00:37.319
well as the length of each vector.

01:00:34.199 --> 01:00:39.079
We can then stack them up like pancakes.

01:00:37.320 --> 01:00:41.480
We can have lots of transformers stacked

01:00:39.079 --> 01:00:43.679
one on top of each other.

01:00:41.480 --> 01:00:45.679
Right? Because it's the perfect API.

01:00:43.679 --> 01:00:47.879
It's the simplest possible API. The same

01:00:45.679 --> 01:00:49.639
thing comes in, same thing goes out.

01:00:47.880 --> 01:00:51.200
In terms of size. So you can have a

01:00:49.639 --> 01:00:53.239
transformer encoder, another one top,

01:00:51.199 --> 01:00:55.799
boom, boom, boom, boom, boom, one after

01:00:53.239 --> 01:00:58.239
the other. GPT-3 has 96 transformer

01:00:55.800 --> 01:00:58.240
stacks.

01:00:58.719 --> 01:01:02.919
And like in all things deep learning

01:01:00.440 --> 01:01:04.360
related, the more layers you have, the

01:01:02.920 --> 01:01:05.400
more complicated things we can do with

01:01:04.360 --> 01:01:06.760
it.

01:01:05.400 --> 01:01:10.559
As long as you have enough data to keep

01:01:06.760 --> 01:01:10.560
the model happy so it doesn't overfit.

01:01:11.760 --> 01:01:15.920
Okay?

01:01:13.400 --> 01:01:17.920
All right. So, what we haven't covered,

01:01:15.920 --> 01:01:20.079
which we'll cover on Wednesday

01:01:17.920 --> 01:01:22.400
uh is is the question that

01:01:20.079 --> 01:01:23.440
he had posed about how

01:01:22.400 --> 01:01:24.680
uh you know, since there are no

01:01:23.440 --> 01:01:26.760
parameters inside the self-attention

01:01:24.679 --> 01:01:27.879
block, what are we actually learning?

01:01:26.760 --> 01:01:29.120
And then there is these things called

01:01:27.880 --> 01:01:31.000
residual connections and layer

01:01:29.119 --> 01:01:32.400
normalization. We'll talk about all

01:01:31.000 --> 01:01:35.159
those things on Wednesday. Those are all

01:01:32.400 --> 01:01:38.559
like, you know, refinements to the idea.

01:01:35.159 --> 01:01:39.719
So, all right, 9:39. Um let's apply the

01:01:38.559 --> 01:01:40.920
transformer encoder to an actual

01:01:39.719 --> 01:01:43.319
problem.

01:01:40.920 --> 01:01:45.119
Any questions?

01:01:43.320 --> 01:01:46.760
Uh yeah.

01:01:45.119 --> 01:01:48.839
My question is regarding like you said

01:01:46.760 --> 01:01:50.400
you could have multiple transformers.

01:01:48.840 --> 01:01:53.200
What is the difference with having

01:01:50.400 --> 01:01:54.840
multiple self-attention heads uh and

01:01:53.199 --> 01:01:57.519
rather than that having multiple When I

01:01:54.840 --> 01:01:59.400
say a transformer block within the block

01:01:57.519 --> 01:02:01.599
there could be multiple heads. So, if

01:01:59.400 --> 01:02:04.680
you're if the accuracy is the same, why

01:02:01.599 --> 01:02:06.039
would you use this rather

01:02:04.679 --> 01:02:08.199
Yeah, you can have a lot of attention

01:02:06.039 --> 01:02:10.559
heads. And that's totally fine. And

01:02:08.199 --> 01:02:12.079
typically I forget how many GPT-3 and 4

01:02:10.559 --> 01:02:13.799
have. They have a whole bunch of them.

01:02:12.079 --> 01:02:15.360
But you can So you can go wide and you

01:02:13.800 --> 01:02:18.320
can go deep.

01:02:15.360 --> 01:02:19.599
Both are done in practice.

01:02:18.320 --> 01:02:20.559
But the thing is if

01:02:19.599 --> 01:02:22.119
The one thing you have to remember is

01:02:20.559 --> 01:02:24.480
that if you if you go wide, you have a

01:02:22.119 --> 01:02:26.239
lot of attention heads then given the

01:02:24.480 --> 01:02:28.440
particular input that's coming into that

01:02:26.239 --> 01:02:29.439
block, it'll learn different patterns

01:02:28.440 --> 01:02:31.039
from it.

01:02:29.440 --> 01:02:32.440
While if you stack them all up, it's

01:02:31.039 --> 01:02:33.800
going to learn different ways to

01:02:32.440 --> 01:02:35.200
contextualize the things that are coming

01:02:33.800 --> 01:02:36.760
in. It operates at higher levels of

01:02:35.199 --> 01:02:38.279
abstraction. So the analogy would be

01:02:36.760 --> 01:02:40.520
that like the seventh layer of a

01:02:38.280 --> 01:02:42.640
convolutional net may take the sixth

01:02:40.519 --> 01:02:44.960
layer's output and say, "Oh, I'm seeing

01:02:42.639 --> 01:02:46.839
a lot of edges here. I'm going to take

01:02:44.960 --> 01:02:48.519
an edge like this, two circles like that

01:02:46.840 --> 01:02:49.480
and call it a face."

01:02:48.519 --> 01:02:52.000
So it'll operate at a higher level of

01:02:49.480 --> 01:02:52.000
abstraction.

01:02:52.400 --> 01:02:55.440
Okay.

01:02:53.360 --> 01:02:55.440
Um

01:02:58.320 --> 01:03:02.840
All right, let's go to the collab.

01:03:01.800 --> 01:03:04.080
So what we're going to do is we're going

01:03:02.840 --> 01:03:05.360
to take the transformer that we just

01:03:04.079 --> 01:03:07.599
learned about and we're going to apply

01:03:05.360 --> 01:03:09.320
it to solve the the travel uh slot

01:03:07.599 --> 01:03:12.079
problem. Okay?

01:03:09.320 --> 01:03:14.320
Uh all right. So

01:03:12.079 --> 01:03:16.199
Okay, so we'll start with the usual

01:03:14.320 --> 01:03:18.600
preliminaries.

01:03:16.199 --> 01:03:20.319
And then we have taken the ATIS data set

01:03:18.599 --> 01:03:23.960
I talked about and we have stuck them in

01:03:20.320 --> 01:03:26.480
raw box for easy consumption.

01:03:23.960 --> 01:03:26.480
It's here.

01:03:29.880 --> 01:03:33.400
Okay.

01:03:30.800 --> 01:03:35.160
So if you look at to the top view

01:03:33.400 --> 01:03:37.960
you can see here, for example, I want to

01:03:35.159 --> 01:03:39.599
fly from Boston 8:30 a.m. And then this

01:03:37.960 --> 01:03:42.880
is the output. The slot filling is the

01:03:39.599 --> 01:03:43.880
output. Um and so as it turns out here

01:03:42.880 --> 01:03:46.000
there is

01:03:43.880 --> 01:03:47.358
this these people also gave it a another

01:03:46.000 --> 01:03:49.440
They took the whole query and gave it an

01:03:47.358 --> 01:03:51.199
intent as to is it it's a flight query,

01:03:49.440 --> 01:03:52.480
it's a something else query and so on,

01:03:51.199 --> 01:03:54.559
which we're not going to use. Are you

01:03:52.480 --> 01:03:56.599
kidding me?

01:03:54.559 --> 01:03:57.519
I want to fly from Boston at 8:30 a.m.

01:03:56.599 --> 01:03:59.239
and arrive in Denver at 11:00 in the

01:03:57.519 --> 01:04:01.239
morning. What kind of ground

01:03:59.239 --> 01:04:03.759
transportations are available in Denver?

01:04:01.239 --> 01:04:06.079
What's the airport at Orlando?

01:04:03.760 --> 01:04:08.480
Um how much does the limo service cost

01:04:06.079 --> 01:04:09.799
within Pittsburgh? Okay.

01:04:08.480 --> 01:04:11.480
And so on and so forth. So you get So

01:04:09.800 --> 01:04:13.760
you get the idea. It's a very wide range

01:04:11.480 --> 01:04:16.440
of queries that are in this data set.

01:04:13.760 --> 01:04:18.960
Um okay. So let's just ignore that for a

01:04:16.440 --> 01:04:22.240
sec. Um okay. So what we're now going to

01:04:18.960 --> 01:04:24.960
do is we are going to take only

01:04:22.239 --> 01:04:27.799
um this column, right? The query column.

01:04:24.960 --> 01:04:29.559
That's going to be our input text. Okay?

01:04:27.800 --> 01:04:31.359
And then the slot filling column is

01:04:29.559 --> 01:04:32.599
going to be our dependent variable, the

01:04:31.358 --> 01:04:34.880
output.

01:04:32.599 --> 01:04:37.440
So we'll just gather them all up

01:04:34.880 --> 01:04:38.840
uh here.

01:04:37.440 --> 01:04:40.599
Let it run. We'll do it for the training

01:04:38.840 --> 01:04:42.559
data and the test data.

01:04:40.599 --> 01:04:45.759
And so what we have done is that we have

01:04:42.559 --> 01:04:47.840
taken um the transformer related code in

01:04:45.760 --> 01:04:49.480
Keras and we have packaged it into a

01:04:47.840 --> 01:04:50.640
little hardel library for easy

01:04:49.480 --> 01:04:53.240
consumption.

01:04:50.639 --> 01:04:55.279
Um and so that thing is here. You can

01:04:53.239 --> 01:04:56.719
download it.

01:04:55.280 --> 01:04:57.680
Calling it a library is like overstating

01:04:56.719 --> 01:04:59.679
it. We literally just collected a bunch

01:04:57.679 --> 01:05:00.719
of code and stuck it in a file. Okay?

01:04:59.679 --> 01:05:02.039
So

01:05:00.719 --> 01:05:03.639
and so what we'll do is from hardel

01:05:02.039 --> 01:05:04.960
we'll we'll import the transformer

01:05:03.639 --> 01:05:06.679
encoder.

01:05:04.960 --> 01:05:08.039
And we'll import this positional

01:05:06.679 --> 01:05:09.239
embedding layer.

01:05:08.039 --> 01:05:11.039
Because what we're going to do is we are

01:05:09.239 --> 01:05:12.519
going to take the input do the

01:05:11.039 --> 01:05:14.199
positional encoding business and then

01:05:12.519 --> 01:05:15.400
send it into the transformer.

01:05:14.199 --> 01:05:18.559
Okay?

01:05:15.400 --> 01:05:21.119
Um so but first let's vectorize the

01:05:18.559 --> 01:05:24.920
input uh queries that are coming in.

01:05:21.119 --> 01:05:26.559
So we'll define a thing here.

01:05:24.920 --> 01:05:28.440
The use this uh

01:05:26.559 --> 01:05:30.320
max query length is not defined. That's

01:05:28.440 --> 01:05:32.079
what happens when you

01:05:30.320 --> 01:05:34.480
don't run everything.

01:05:32.079 --> 01:05:34.480
All right.

01:05:38.599 --> 01:05:44.839
Okay. So now we have this thing here. So

01:05:41.719 --> 01:05:47.319
turns out that there are 8,888 tokens,

01:05:44.840 --> 01:05:49.320
right? 8,888 words in the input queries

01:05:47.320 --> 01:05:52.359
that are we have in the data. Uh so I

01:05:49.320 --> 01:05:54.200
take a look at the first few.

01:05:52.358 --> 01:05:56.799
And you can see here, you know, there is

01:05:54.199 --> 01:05:58.759
unk. Uh and because the output mode here

01:05:56.800 --> 01:06:00.280
is you just want integers to come out

01:05:58.760 --> 01:06:01.000
not multi-hot encoding or anything

01:06:00.280 --> 01:06:02.600
because we're going to take these

01:06:01.000 --> 01:06:04.920
integers and then do embeddings from

01:06:02.599 --> 01:06:07.880
them. So it'll it'll create it'll

01:06:04.920 --> 01:06:10.280
reserve this empty string as the pad

01:06:07.880 --> 01:06:11.119
token. This should be familiar from last

01:06:10.280 --> 01:06:13.200
week.

01:06:11.119 --> 01:06:14.679
And then the unk for unknown tokens and

01:06:13.199 --> 01:06:17.039
then two from flights these are all some

01:06:14.679 --> 01:06:18.559
of the most frequent. Um turns out

01:06:17.039 --> 01:06:20.119
Boston is actually the most frequent. I

01:06:18.559 --> 01:06:22.358
don't know what's up with that.

01:06:20.119 --> 01:06:24.279
It is what it is. Then we'll do the same

01:06:22.358 --> 01:06:25.319
vectorization to the train and test data

01:06:24.280 --> 01:06:28.160
sets.

01:06:25.320 --> 01:06:30.480
Now uh we need to do STIE for the output

01:06:28.159 --> 01:06:31.799
side of the problem because the slots

01:06:30.480 --> 01:06:33.800
the the dependent variable here,

01:06:31.800 --> 01:06:36.519
remember, are all sentences as well with

01:06:33.800 --> 01:06:38.200
the B, O, things like that, right? So we

01:06:36.519 --> 01:06:40.840
need to vectorize those.

01:06:38.199 --> 01:06:42.039
So we do or we need to do STIE on them.

01:06:40.840 --> 01:06:43.280
So let's take a look at some of these

01:06:42.039 --> 01:06:44.519
slots.

01:06:43.280 --> 01:06:45.800
And you can see here all this stuff is

01:06:44.519 --> 01:06:48.280
going on.

01:06:45.800 --> 01:06:49.760
Note So now here is an example where you

01:06:48.280 --> 01:06:51.440
have to be very careful when you do the

01:06:49.760 --> 01:06:52.800
standardization.

01:06:51.440 --> 01:06:54.440
Typically standardization you will

01:06:52.800 --> 01:06:56.120
remove punctuation and you know, do

01:06:54.440 --> 01:06:57.358
things like that and lowercase, right?

01:06:56.119 --> 01:07:00.400
But here

01:06:57.358 --> 01:07:01.559
these things have a specific meaning.

01:07:00.400 --> 01:07:03.400
We can't just go in there and remove the

01:07:01.559 --> 01:07:04.880
period and the underscore and then take

01:07:03.400 --> 01:07:06.559
make the B into lowercase B and stuff

01:07:04.880 --> 01:07:07.880
like that. That'll just harm it.

01:07:06.559 --> 01:07:10.239
Right? We need to be able to preserve

01:07:07.880 --> 01:07:12.559
the nomenclature of the output in terms

01:07:10.239 --> 01:07:13.639
of all those tags. So

01:07:12.559 --> 01:07:15.119
um so we don't want the standardization

01:07:13.639 --> 01:07:17.000
to do all those out. So what we do is we

01:07:15.119 --> 01:07:18.358
say standardization none.

01:07:17.000 --> 01:07:20.039
Look at that.

01:07:18.358 --> 01:07:22.319
We tell Keras do not standardize this.

01:07:20.039 --> 01:07:23.239
Do not do your usual thing.

01:07:22.320 --> 01:07:25.280
Okay?

01:07:23.239 --> 01:07:26.919
Um so

01:07:25.280 --> 01:07:29.080
we do that

01:07:26.920 --> 01:07:30.960
for the output side. And then let's look

01:07:29.079 --> 01:07:33.358
at the vocabulary.

01:07:30.960 --> 01:07:34.440
Yeah, so this sounds pretty good.

01:07:33.358 --> 01:07:35.880
These are all the things that we would

01:07:34.440 --> 01:07:37.599
expect to see.

01:07:35.880 --> 01:07:39.800
These are the distinct tokens in the

01:07:37.599 --> 01:07:42.759
output strings.

01:07:39.800 --> 01:07:42.760
Um all right.

01:07:43.320 --> 01:07:48.359
Okay, we get it.

01:07:45.880 --> 01:07:50.400
So we have 125 of them. In the in the

01:07:48.358 --> 01:07:54.279
lecture I said there are 123 slots,

01:07:50.400 --> 01:07:57.240
possible slots. Why is it 125 here?

01:07:54.280 --> 01:07:59.519
Yes, unk and pad. Correct.

01:07:57.239 --> 01:08:02.279
Um okay. Now we'll set up a transformer

01:07:59.519 --> 01:08:05.119
encoder, right? Uh this Oh, wait, wait,

01:08:02.280 --> 01:08:07.280
wait. I forgot about um doing this. My

01:08:05.119 --> 01:08:09.519
bad. Um

01:08:07.280 --> 01:08:09.519
All right.

01:08:11.519 --> 01:08:15.639
I just thought when I saw the slide that

01:08:12.880 --> 01:08:16.560
we should go to the collab

01:08:15.639 --> 01:08:18.880
without giving you a bit more

01:08:16.560 --> 01:08:20.240
background. No problem. So

01:08:18.880 --> 01:08:21.119
So

01:08:20.239 --> 01:08:22.318
the way we're going to model this

01:08:21.119 --> 01:08:23.479
problem is that we're going to have

01:08:22.319 --> 01:08:24.839
something like this, right? Fly from

01:08:23.479 --> 01:08:26.239
Boston to Denver.

01:08:24.838 --> 01:08:28.600
That's the input that's coming in and

01:08:26.239 --> 01:08:31.439
that is the correct answer.

01:08:28.600 --> 01:08:32.798
0 0 some B something or others I mean O

01:08:31.439 --> 01:08:34.479
and then something else, right? That's

01:08:32.798 --> 01:08:36.399
the correct answer. That's the that's

01:08:34.479 --> 01:08:38.718
the input and that is the right answer.

01:08:36.399 --> 01:08:40.559
So what we'll do is we will

01:08:38.719 --> 01:08:42.640
create these positional input embeddings

01:08:40.560 --> 01:08:45.359
like we have discussed before.

01:08:42.640 --> 01:08:47.719
We will run it through a transformer.

01:08:45.359 --> 01:08:49.120
It gives us contextual embeddings.

01:08:47.719 --> 01:08:50.680
So if we send five in, it's going to

01:08:49.119 --> 01:08:51.960
send us five out except the color is now

01:08:50.680 --> 01:08:54.319
blue.

01:08:51.960 --> 01:08:57.520
Right? And then what we do is

01:08:54.319 --> 01:08:59.400
we will run it through a relu.

01:08:57.520 --> 01:09:01.080
Okay, we'll run it through a relu.

01:08:59.399 --> 01:09:02.639
We will still have

01:09:01.079 --> 01:09:04.039
you know, five vectors here, five

01:09:02.640 --> 01:09:05.920
vectors will come in.

01:09:04.039 --> 01:09:07.960
And then for each of the things that

01:09:05.920 --> 01:09:10.759
comes in, we will stick a 123-way

01:09:07.960 --> 01:09:10.759
softmax.

01:09:11.838 --> 01:09:15.838
Okay, for each thing that comes out

01:09:13.279 --> 01:09:16.838
we'll have a 123-way softmax and that's

01:09:15.838 --> 01:09:19.239
the classification problem we're going

01:09:16.838 --> 01:09:19.239
to solve.

01:09:20.439 --> 01:09:23.639
Okay?

01:09:21.719 --> 01:09:25.759
So

01:09:23.640 --> 01:09:28.280
the weights in all these layers will get

01:09:25.759 --> 01:09:29.279
optimized by backprop.

01:09:28.279 --> 01:09:30.798
All these weights are going to get

01:09:29.279 --> 01:09:33.200
optimized.

01:09:30.798 --> 01:09:33.199
Uh yeah.

01:09:34.119 --> 01:09:36.399
Sorry?

01:09:40.798 --> 01:09:44.798
Oh no, the that's a layer. The weights

01:09:43.680 --> 01:09:46.920
in the layer will still need to be

01:09:44.798 --> 01:09:48.159
learned.

01:09:46.920 --> 01:09:50.199
It's sort of like the text vectorization

01:09:48.159 --> 01:09:51.880
layer is a bunch of code and then you

01:09:50.199 --> 01:09:53.439
actually run it on a particular corpus

01:09:51.880 --> 01:09:54.480
to adapt it and fill our vocabulary out

01:09:53.439 --> 01:09:55.679
of it.

01:09:54.479 --> 01:09:57.879
So, it's like an empty shell that needs

01:09:55.680 --> 01:09:59.320
to get populated.

01:09:57.880 --> 01:10:00.680
Okay, so with the weights and all these

01:09:59.319 --> 01:10:02.239
things are going to get updated when we

01:10:00.680 --> 01:10:03.600
when we train the model

01:10:02.239 --> 01:10:06.399
by backprop.

01:10:03.600 --> 01:10:07.600
Uh and that's it. That's the setup.

01:10:06.399 --> 01:10:09.639
Does this make sense before I switch

01:10:07.600 --> 01:10:11.560
back to the collab?

01:10:09.640 --> 01:10:14.320
In particular, does this make sense?

01:10:11.560 --> 01:10:14.320
This part of it.

01:10:15.920 --> 01:10:18.440
Bunch of things come out and then for

01:10:17.319 --> 01:10:20.439
each one of those things we need to

01:10:18.439 --> 01:10:22.119
figure out a classification of a 123-way

01:10:20.439 --> 01:10:23.479
classification. And that's where we

01:10:22.119 --> 01:10:25.319
stick a softmax on every one of those

01:10:23.479 --> 01:10:27.599
output nodes.

01:10:25.319 --> 01:10:27.599
Yeah.

01:10:32.800 --> 01:10:35.440
Oh oh, I see.

01:10:36.000 --> 01:10:38.439
Yeah, so

01:10:40.239 --> 01:10:43.279
It could be whatever or to put it

01:10:41.560 --> 01:10:45.600
another way, it is your choice as the

01:10:43.279 --> 01:10:47.880
user as the modeler. Correct? The thing

01:10:45.600 --> 01:10:49.400
is at this point with the blue stuff the

01:10:47.880 --> 01:10:51.359
transformer is basically saying, my job

01:10:49.399 --> 01:10:52.639
is done.

01:10:51.359 --> 01:10:54.639
It has given you these valuable

01:10:52.640 --> 01:10:56.720
contextual embeddings at some high-level

01:10:54.640 --> 01:10:58.480
abstraction. What you do with it depends

01:10:56.720 --> 01:11:00.680
on your particular problem. And so that

01:10:58.479 --> 01:11:01.959
the best practice would be to take it

01:11:00.680 --> 01:11:03.280
and then maybe, you know, if these

01:11:01.960 --> 01:11:04.279
embeddings are embeddings are really

01:11:03.279 --> 01:11:07.159
long, maybe you make them a little

01:11:04.279 --> 01:11:09.079
smaller, right? Using a ReLU. And using

01:11:07.159 --> 01:11:10.239
a ReLU is always a good idea because

01:11:09.079 --> 01:11:11.640
when in doubt, throw in a bit of

01:11:10.239 --> 01:11:13.519
non-linearity.

01:11:11.640 --> 01:11:15.440
Right? Uh and then once you're done with

01:11:13.520 --> 01:11:17.040
that, well, at this point you need to

01:11:15.439 --> 01:11:20.079
actually classify it. So, you stick an

01:11:17.039 --> 01:11:20.079
output softmax on it.

01:11:20.560 --> 01:11:24.120
Okay. So, that's what we have.

01:11:24.680 --> 01:11:26.960
Um

01:11:27.680 --> 01:11:32.119
All right, back to this picture.

01:11:29.640 --> 01:11:34.280
So, what we're going to do is we

01:11:32.119 --> 01:11:36.119
we also get to decide how long are these

01:11:34.279 --> 01:11:37.199
embedding vectors. How long because here

01:11:36.119 --> 01:11:37.920
we're not going to use Glove embeddings.

01:11:37.199 --> 01:11:39.800
We're just going to learn everything

01:11:37.920 --> 01:11:40.800
from scratch.

01:11:39.800 --> 01:11:42.880
Right? We're going to learn everything

01:11:40.800 --> 01:11:45.360
from scratch. So, and we can decide how

01:11:42.880 --> 01:11:46.440
long these embedding vectors are. So, um

01:11:45.359 --> 01:11:47.519
these embedding vectors I'm going to

01:11:46.439 --> 01:11:49.359
decide

01:11:47.520 --> 01:11:52.880
uh I have decided that I want them to be

01:11:49.359 --> 01:11:54.839
512 long, right? I want these actually

01:11:52.880 --> 01:11:57.000
to be 512 long. So, that's what I have

01:11:54.840 --> 01:11:58.880
here, 512.

01:11:57.000 --> 01:12:00.000
And then inside the transformer,

01:11:58.880 --> 01:12:01.239
remember

01:12:00.000 --> 01:12:02.920
when we

01:12:01.239 --> 01:12:04.679
concatenate everything and then we have

01:12:02.920 --> 01:12:07.600
something, we run it through a final

01:12:04.680 --> 01:12:08.960
ReLU layer, how big should that layer

01:12:07.600 --> 01:12:11.079
be?

01:12:08.960 --> 01:12:13.279
That's what it here what I mean by dense

01:12:11.079 --> 01:12:15.039
dim. I want it to be 64.

01:12:13.279 --> 01:12:17.519
And then I, you know, for fun I'm going

01:12:15.039 --> 01:12:20.399
to use five attention heads.

01:12:17.520 --> 01:12:20.400
Because why not?

01:12:20.439 --> 01:12:27.399
Okay. And then in the final thing here

01:12:24.319 --> 01:12:29.199
to go to Ali's question here these

01:12:27.399 --> 01:12:32.079
things are all 512 long as I mentioned

01:12:29.199 --> 01:12:34.479
earlier, right? These are all 512.

01:12:32.079 --> 01:12:36.760
But this thing here I'm going to make it

01:12:34.479 --> 01:12:38.799
just 128.

01:12:36.760 --> 01:12:41.199
Okay, that's what I mean by units here.

01:12:38.800 --> 01:12:43.119
And so if you look at the actual model

01:12:41.199 --> 01:12:45.679
okay, whatever comes in has a max query

01:12:43.119 --> 01:12:47.239
length of I think 30 if I recall.

01:12:45.680 --> 01:12:50.240
Um actually let's just make sure of

01:12:47.239 --> 01:12:50.239
that. What did I assume?

01:12:51.439 --> 01:12:55.759
30, correct? Max query length 30. So,

01:12:53.079 --> 01:12:57.319
each sentence is 30. So, if a sentence

01:12:55.760 --> 01:12:59.680
has 35 words in it, what's going to

01:12:57.319 --> 01:12:59.679
happen?

01:12:59.840 --> 01:13:03.760
The last five will get chopped,

01:13:01.159 --> 01:13:05.359
truncated. If it comes in at 22, we're

01:13:03.760 --> 01:13:06.840
going to pad it with eight more tokens

01:13:05.359 --> 01:13:09.559
with a pad token. Okay? That's how we

01:13:06.840 --> 01:13:12.159
make sure everything uh gets to 30.

01:13:09.560 --> 01:13:14.039
All right. So, we come back here.

01:13:12.159 --> 01:13:16.720
So, the input is still sentences which

01:13:14.039 --> 01:13:18.960
are 30 long, tokens which are 30 long.

01:13:16.720 --> 01:13:20.520
And then we run it through a positional

01:13:18.960 --> 01:13:23.119
embedding layer.

01:13:20.520 --> 01:13:25.160
Okay? This positional embedding layer

01:13:23.119 --> 01:13:27.319
has the the actual embedding for each

01:13:25.159 --> 01:13:29.279
word, that table and it has the

01:13:27.319 --> 01:13:31.639
positional table, positional embedding

01:13:29.279 --> 01:13:34.119
table. So, just to be clear, this

01:13:31.640 --> 01:13:37.119
positional embedding layer is basically

01:13:34.119 --> 01:13:38.800
it's basically this.

01:13:37.119 --> 01:13:41.199
So, this table

01:13:38.800 --> 01:13:43.720
and this table together are packaged up

01:13:41.199 --> 01:13:45.279
into the positional encoding layer.

01:13:43.720 --> 01:13:47.400
But they are two distinct tables. They

01:13:45.279 --> 01:13:49.479
just happen to be packaged up.

01:13:47.399 --> 01:13:51.119
So,

01:13:49.479 --> 01:13:52.839
so this is what we have here.

01:13:51.119 --> 01:13:55.000
And then we get a nice positional

01:13:52.840 --> 01:13:57.480
embedding out and then boom, we run it

01:13:55.000 --> 01:13:59.640
through the transformer. And you know,

01:13:57.479 --> 01:14:01.559
this transformer encoder object we have

01:13:59.640 --> 01:14:02.800
to tell it obviously, hey, this is the

01:14:01.560 --> 01:14:04.640
embedding dimension that's going to come

01:14:02.800 --> 01:14:06.880
out. This is the dense dimension you're

01:14:04.640 --> 01:14:09.000
going to use in that final feedforward

01:14:06.880 --> 01:14:10.159
layer inside each attention block and

01:14:09.000 --> 01:14:11.640
this is the number of attention heads I

01:14:10.159 --> 01:14:13.519
want you to use. That's it.

01:14:11.640 --> 01:14:14.800
Very, right? Only three things have to

01:14:13.520 --> 01:14:16.840
be specified.

01:14:14.800 --> 01:14:18.039
And then whatever comes out of the

01:14:16.840 --> 01:14:19.159
transformer encoder are these blue

01:14:18.039 --> 01:14:20.960
vectors.

01:14:19.159 --> 01:14:22.720
And then we are back into good old sort

01:14:20.960 --> 01:14:24.560
of, you know, traditional DNN stuff

01:14:22.720 --> 01:14:27.880
where we take this thing, run it through

01:14:24.560 --> 01:14:30.880
a ReLU with 128 units, we add a little

01:14:27.880 --> 01:14:33.279
dropout uh and then we run it through a

01:14:30.880 --> 01:14:35.600
dense layer which the the vocab size

01:14:33.279 --> 01:14:37.359
here is 125, which is the 125-way

01:14:35.600 --> 01:14:39.840
softmax.

01:14:37.359 --> 01:14:41.239
Okay? Activation softmax.

01:14:39.840 --> 01:14:42.720
Connect up everything into model input

01:14:41.239 --> 01:14:44.399
and output and boom, that's the whole

01:14:42.720 --> 01:14:47.440
model.

01:14:44.399 --> 01:14:48.519
So, that's what we have here.

01:14:47.439 --> 01:14:50.839
Okay?

01:14:48.520 --> 01:14:50.840
Now,

01:14:51.079 --> 01:14:54.680
this for the you know, after Wednesday's

01:14:53.399 --> 01:14:56.679
class

01:14:54.680 --> 01:14:59.320
for extra credit and for your personal

01:14:56.680 --> 01:15:00.880
edification

01:14:59.319 --> 01:15:03.000
try to work through this thing to come

01:15:00.880 --> 01:15:04.800
up with this number.

01:15:03.000 --> 01:15:06.960
53 million

01:15:04.800 --> 01:15:10.039
um sorry, 5.3 million.

01:15:06.960 --> 01:15:12.600
Right? Uh and see if it matches this

01:15:10.039 --> 01:15:13.920
number here.

01:15:12.600 --> 01:15:15.520
It should match.

01:15:13.920 --> 01:15:17.840
Hand calculate the number of parameters

01:15:15.520 --> 01:15:19.720
inside the transformer. Okay? For fame

01:15:17.840 --> 01:15:20.520
and fortune. That's an optional thing.

01:15:19.720 --> 01:15:22.240
So,

01:15:20.520 --> 01:15:23.480
uh do it after Wednesday's class, not

01:15:22.239 --> 01:15:24.920
right now.

01:15:23.479 --> 01:15:26.799
And I have actually listed the exact

01:15:24.920 --> 01:15:28.560
math that goes into it here. Okay? All

01:15:26.800 --> 01:15:30.159
right. So, by the way, you can peek into

01:15:28.560 --> 01:15:31.960
any layers' weights using its weight

01:15:30.159 --> 01:15:33.319
attribute. This is the embedding

01:15:31.960 --> 01:15:34.640
uh the positional embedding thing we

01:15:33.319 --> 01:15:36.759
had. So,

01:15:34.640 --> 01:15:39.440
we can click it and you can see here it

01:15:36.760 --> 01:15:40.840
has two tables. There's the first table

01:15:39.439 --> 01:15:41.799
which is just the embedding table which

01:15:40.840 --> 01:15:43.560
says

01:15:41.800 --> 01:15:45.840
there are eight eight eight tokens in my

01:15:43.560 --> 01:15:47.880
vocabulary and each of those tokens is a

01:15:45.840 --> 01:15:49.880
an embedding vector which is 512 long.

01:15:47.880 --> 01:15:51.520
That is the first table here. And then

01:15:49.880 --> 01:15:53.880
it has the second object which is the

01:15:51.520 --> 01:15:56.480
positional embedding and it says here,

01:15:53.880 --> 01:15:58.640
well, my sentences can be 30 long and

01:15:56.479 --> 01:16:02.079
for each position of the 30 long

01:15:58.640 --> 01:16:04.079
sentence, I will have a 512 embedding.

01:16:02.079 --> 01:16:05.439
Both these tables as I mentioned earlier

01:16:04.079 --> 01:16:06.800
are packaged up inside and you can

01:16:05.439 --> 01:16:08.159
actually see what the weights are before

01:16:06.800 --> 01:16:09.560
you do any training.

01:16:08.159 --> 01:16:11.319
Okay?

01:16:09.560 --> 01:16:13.400
So, all right. So, I'm going to stop

01:16:11.319 --> 01:16:14.359
here uh because the model it's going to

01:16:13.399 --> 01:16:16.079
take a few minutes to run and we're

01:16:14.359 --> 01:16:17.519
already at 5 9:45.

01:16:16.079 --> 01:16:19.479
Um so, we will continue the journey on

01:16:17.520 --> 01:16:20.560
Wednesday. If some of it is not super

01:16:19.479 --> 01:16:21.799
clear, don't worry about it. It will

01:16:20.560 --> 01:16:22.960
become much clearer on Wednesday. All

01:16:21.800 --> 01:16:23.640
right? All right, folks, have a good

01:16:22.960 --> 01:16:26.000
couple of days. I'll see you on

01:16:23.640 --> 01:16:26.000
Wednesday.