WEBVTT

00:00:17.039 --> 00:00:21.760
Okay. Uh, all right. So, we'll continue

00:00:19.199 --> 00:00:23.519
with transformers today. Part two. Uh,

00:00:21.760 --> 00:00:24.960
we're going to do the second pass. Uh,

00:00:23.518 --> 00:00:27.439
this is going to be a deeper pass

00:00:24.960 --> 00:00:29.760
through the transformer stack. Um and I

00:00:27.439 --> 00:00:31.278
think maybe the next 30 minutes it's

00:00:29.760 --> 00:00:33.840
potentially the most demanding 30

00:00:31.278 --> 00:00:35.439
minutes of the entire course. Okay, with

00:00:33.840 --> 00:00:38.960
that motivational speech, let's get

00:00:35.439 --> 00:00:41.359
going. Okay, so quick review. Why do we

00:00:38.960 --> 00:00:43.520
want transformers? Because we want u we

00:00:41.359 --> 00:00:45.280
want an architecture that can generate

00:00:43.520 --> 00:00:48.879
output that has the same length as the

00:00:45.280 --> 00:00:50.320
input. Same length. Oh, there it is. Uh

00:00:48.878 --> 00:00:51.359
number two, we want to take the context

00:00:50.320 --> 00:00:53.520
into account and we want to take the

00:00:51.359 --> 00:00:55.198
order into account. And as you saw last

00:00:53.520 --> 00:00:57.199
time, the transformer architecture

00:00:55.198 --> 00:00:59.358
delivers on those three requirements.

00:00:57.198 --> 00:01:01.599
And so uh just a quick review, if you

00:00:59.359 --> 00:01:03.198
have a phrase like the train liftation,

00:01:01.600 --> 00:01:05.280
we have all these little arrows which

00:01:03.198 --> 00:01:08.079
stand for the the standalone or

00:01:05.280 --> 00:01:09.439
uncontextual embeddings. Uh and then

00:01:08.079 --> 00:01:12.079
sometimes this works. So I'm going to

00:01:09.438 --> 00:01:13.759
put it close to me here.

00:01:12.079 --> 00:01:16.640
Okay.

00:01:13.760 --> 00:01:17.920
All right. So um so if you here if you

00:01:16.640 --> 00:01:19.359
we start with either standalone

00:01:17.920 --> 00:01:20.879
embeddings i.e. the contextual

00:01:19.359 --> 00:01:22.239
embeddings uh which have been

00:01:20.879 --> 00:01:25.118
pre-trained or random doesn't really

00:01:22.239 --> 00:01:27.199
matter. If you look at the collab we did

00:01:25.118 --> 00:01:30.400
uh the other day we actually just start

00:01:27.200 --> 00:01:32.478
with random weights for the embeddings

00:01:30.400 --> 00:01:35.439
and then we add positional embeddings to

00:01:32.478 --> 00:01:38.239
them. And so you know each embedding

00:01:35.438 --> 00:01:39.679
each word here we take it standalone we

00:01:38.239 --> 00:01:41.438
take its positional embedding we just

00:01:39.680 --> 00:01:43.680
literally just add them up element by

00:01:41.438 --> 00:01:45.039
element then we get a total embedding

00:01:43.680 --> 00:01:48.000
and that's called the positional

00:01:45.040 --> 00:01:49.439
embedding of each word. Okay. And then

00:01:48.000 --> 00:01:51.920
uh that's what we have position input

00:01:49.438 --> 00:01:54.000
embeddings. So this whole thing goes

00:01:51.920 --> 00:01:55.359
into this transformer encoder stack and

00:01:54.000 --> 00:01:57.359
what pops out the other end is

00:01:55.359 --> 00:02:01.280
contextual embeddings. Okay. So that's

00:01:57.359 --> 00:02:03.920
the overall flow. Now

00:02:01.280 --> 00:02:06.159
we applied this uh the transformer stack

00:02:03.920 --> 00:02:08.080
to the word to slot classification

00:02:06.159 --> 00:02:10.319
problem where we basically took every

00:02:08.080 --> 00:02:12.400
incoming natural language query that

00:02:10.318 --> 00:02:14.399
comes in. We calculate its positional

00:02:12.400 --> 00:02:16.640
embeddings and then we run it through

00:02:14.400 --> 00:02:18.480
the transformer stack. uh and then we

00:02:16.639 --> 00:02:21.279
get contextual embeddings and then at

00:02:18.479 --> 00:02:22.878
this point uh since each word that comes

00:02:21.280 --> 00:02:24.640
out each embedding that comes out needs

00:02:22.878 --> 00:02:26.959
to be classified into one of 125

00:02:24.639 --> 00:02:29.119
possibilities we run it through a ReLU

00:02:26.959 --> 00:02:31.199
and then we and when we attach a softmax

00:02:29.120 --> 00:02:33.920
to each embedding right this is

00:02:31.199 --> 00:02:36.399
basically what we did last class

00:02:33.919 --> 00:02:39.439
um so this is the transformer encoder

00:02:36.400 --> 00:02:43.760
okay now actually

00:02:39.439 --> 00:02:43.759
any questions on this before I continue

00:02:48.479 --> 00:02:52.399
I was wondering why when how do you

00:02:50.479 --> 00:02:55.280
decide where to add more self attention

00:02:52.400 --> 00:02:58.000
and where to add transformer layers? You

00:02:55.280 --> 00:03:03.120
mentioned that chart has 96 of them.

00:02:58.000 --> 00:03:05.439
>> Yeah. So right so GPD3 has 90 96

00:03:03.120 --> 00:03:07.759
transformer blocks. Each one is a block.

00:03:05.439 --> 00:03:09.519
Um, so I think the question goes to do

00:03:07.759 --> 00:03:11.598
you add more attention heads within a

00:03:09.519 --> 00:03:14.158
single block or do you add lots of

00:03:11.598 --> 00:03:16.719
blocks? And both are good things to do.

00:03:14.158 --> 00:03:18.639
Um, what increasing the number of

00:03:16.719 --> 00:03:21.039
attention heads in a block does for you,

00:03:18.639 --> 00:03:23.679
it allows you to pick up more patterns

00:03:21.039 --> 00:03:25.919
at that level of abstraction.

00:03:23.680 --> 00:03:28.319
But if you add more blocks, much like

00:03:25.919 --> 00:03:30.798
later convolutional filters can build on

00:03:28.318 --> 00:03:32.798
earlier convolutional filters, you're

00:03:30.799 --> 00:03:34.719
going up the levels of abstraction. So

00:03:32.799 --> 00:03:36.480
to go to vision for instance you have

00:03:34.719 --> 00:03:37.680
the notion of lines and so on in the

00:03:36.479 --> 00:03:40.560
beginning and then you have a notion of

00:03:37.680 --> 00:03:42.879
edges which are two lines then you have

00:03:40.560 --> 00:03:45.199
you know nose eyes face and so on and so

00:03:42.878 --> 00:03:46.798
forth. So both are worth doing. So

00:03:45.199 --> 00:03:49.039
typically that's what you you typically

00:03:46.799 --> 00:03:52.239
find that people typically have you know

00:03:49.039 --> 00:03:54.400
maybe a dozen heads or you know five six

00:03:52.239 --> 00:03:55.840
a dozen heads. We'll see examples of how

00:03:54.400 --> 00:03:58.400
many heads in a couple of architectures

00:03:55.840 --> 00:04:01.200
later on today. And you can the more you

00:03:58.400 --> 00:04:02.799
go up the more uh more capable the model

00:04:01.199 --> 00:04:05.518
becomes. as long as you have enough data

00:04:02.799 --> 00:04:07.200
to train it well. So the perennial

00:04:05.519 --> 00:04:09.360
question of do we have enough data to

00:04:07.199 --> 00:04:11.039
train this large model because if you

00:04:09.360 --> 00:04:12.720
don't have enough data we might run into

00:04:11.039 --> 00:04:14.719
overfitting problems and so on. That's

00:04:12.719 --> 00:04:17.040
always the trade-off.

00:04:14.719 --> 00:04:18.720
So okay so here I just want to quickly

00:04:17.040 --> 00:04:20.560
switch to the collab because we didn't

00:04:18.720 --> 00:04:22.240
get have a chance to finish it. I'm not

00:04:20.560 --> 00:04:24.800
going to run it because it's going to

00:04:22.240 --> 00:04:27.680
take some time. So where we left off

00:04:24.800 --> 00:04:31.120
last time.

00:04:27.680 --> 00:04:32.959
Okay. So here we we basically took this

00:04:31.120 --> 00:04:34.959
architecture that we just saw on the

00:04:32.959 --> 00:04:36.560
slide and then we essentially wrote it

00:04:34.959 --> 00:04:37.839
as a keras model and I went through this

00:04:36.560 --> 00:04:39.918
model in the last class so I'm not going

00:04:37.839 --> 00:04:41.519
to go through it all over again. What we

00:04:39.918 --> 00:04:44.719
did not do last class was to actually

00:04:41.519 --> 00:04:47.599
run it. Um and so uh so if you actually

00:04:44.720 --> 00:04:50.160
run it right you can just run it for 10

00:04:47.600 --> 00:04:52.479
epochs just like we normally do. Give it

00:04:50.160 --> 00:04:53.439
data give it a bunch of epochs choose a

00:04:52.478 --> 00:04:55.680
particular batch size. I just

00:04:53.439 --> 00:04:57.519
arbitrarily chose 64. You run it for 10

00:04:55.680 --> 00:05:00.959
epochs and then you evaluate it on the

00:04:57.519 --> 00:05:03.599
test set. You get a 99% accuracy on this

00:05:00.959 --> 00:05:05.439
problem. One transformer stack. That's

00:05:03.600 --> 00:05:08.320
it. One one block rather. One block.

00:05:05.439 --> 00:05:09.759
That's it. And uh of course here there's

00:05:08.319 --> 00:05:12.560
a little trickiness going on here

00:05:09.759 --> 00:05:15.360
because a naive model can literally say

00:05:12.560 --> 00:05:17.199
every word that comes in is other. O.

00:05:15.360 --> 00:05:19.439
And since the O's are the majority of

00:05:17.199 --> 00:05:20.960
the words, it's not going to do badly,

00:05:19.439 --> 00:05:22.639
right? It's like having a classification

00:05:20.959 --> 00:05:25.038
problem in which one class is very

00:05:22.639 --> 00:05:26.478
predominant. So the naive way to

00:05:25.038 --> 00:05:27.918
actually do well is to just say every

00:05:26.478 --> 00:05:30.000
time something comes in, oh it's that

00:05:27.918 --> 00:05:32.079
majority class. The same thing happens.

00:05:30.000 --> 00:05:34.478
But if you then adjust for that, it

00:05:32.079 --> 00:05:35.918
turns out that the accuracy on the nono

00:05:34.478 --> 00:05:38.079
slots, which is really what you care

00:05:35.918 --> 00:05:40.639
about, is actually 93%.

00:05:38.079 --> 00:05:42.399
Which is actually pretty good. Okay. Uh

00:05:40.639 --> 00:05:44.319
and then I had some examples of, you

00:05:42.399 --> 00:05:45.758
know, lots of fun queries you can do,

00:05:44.319 --> 00:05:47.439
including queries where I try to break

00:05:45.759 --> 00:05:49.038
stuff like cheapest flight to fly from

00:05:47.439 --> 00:05:50.800
MIT to Mars and see what happens, you

00:05:49.038 --> 00:05:53.519
know, things like that. So have fun with

00:05:50.800 --> 00:05:56.520
it. Okay. Um, all right, back to

00:05:53.519 --> 00:05:56.519
PowerPoint.

00:05:59.439 --> 00:06:03.839
So, this is what we had. Now, what we're

00:06:01.199 --> 00:06:05.919
going to do in today's class, we are

00:06:03.839 --> 00:06:08.478
actually going to take the encoder we

00:06:05.918 --> 00:06:10.799
built last time and introduce three new

00:06:08.478 --> 00:06:11.758
complications into it. And when we

00:06:10.800 --> 00:06:14.079
finish introducing these three

00:06:11.759 --> 00:06:15.919
complications, we will actually have the

00:06:14.079 --> 00:06:20.079
actual transformer that was invented in

00:06:15.918 --> 00:06:21.918
the 2017 paper. Okay. All right. Um, the

00:06:20.079 --> 00:06:24.959
first tweak is the hardest tweak. So

00:06:21.918 --> 00:06:26.719
we'll slowly work our way to it. U so

00:06:24.959 --> 00:06:28.560
the thing to remember is let's review

00:06:26.720 --> 00:06:30.560
self attention. What is self attention?

00:06:28.560 --> 00:06:32.319
You have a bunch of words and we further

00:06:30.560 --> 00:06:34.000
said that for any particular word like

00:06:32.319 --> 00:06:36.000
station we want to take its positional

00:06:34.000 --> 00:06:38.240
embedding and then make it contextual.

00:06:36.000 --> 00:06:40.319
And the way we do that is by taking each

00:06:38.240 --> 00:06:42.240
word's embedding and then calculating

00:06:40.319 --> 00:06:44.160
these dot productducts between all the

00:06:42.240 --> 00:06:46.400
other words. And then since these dot

00:06:44.160 --> 00:06:48.400
products can be positive or negative we

00:06:46.399 --> 00:06:50.000
want to make them all positive and

00:06:48.399 --> 00:06:52.638
normalize them so that they nicely add

00:06:50.000 --> 00:06:54.879
up to one. So we then exponentiate them

00:06:52.639 --> 00:06:57.519
and then divide with the total, right?

00:06:54.879 --> 00:06:59.038
Which is basically soft max. And when

00:06:57.519 --> 00:07:01.198
you do that, you have nice fractions

00:06:59.038 --> 00:07:03.519
that add up to one. And then we said,

00:07:01.199 --> 00:07:07.199
well, the contextual embedding for W6 is

00:07:03.519 --> 00:07:10.079
just all these weights S1, S2 all the

00:07:07.199 --> 00:07:12.960
way to S6 multiplied by the original W's

00:07:10.079 --> 00:07:14.879
and then you get the context for W6. So

00:07:12.959 --> 00:07:19.198
this is the basic logic we covered last

00:07:14.879 --> 00:07:21.839
time. Now it is obviously the case that

00:07:19.199 --> 00:07:23.598
we explained it only for one word but we

00:07:21.839 --> 00:07:25.839
have to do the same exact operation for

00:07:23.598 --> 00:07:28.719
every one of the other words too so that

00:07:25.839 --> 00:07:30.719
we could calculate W5 hat, W4 hat, W3

00:07:28.720 --> 00:07:32.240
hat and so on and so forth right so

00:07:30.720 --> 00:07:34.479
there's a lot of computations that are

00:07:32.240 --> 00:07:36.720
going on and they all look kind of

00:07:34.478 --> 00:07:38.240
similar where you got to do a bunch of

00:07:36.720 --> 00:07:39.759
dot products you got to like you know do

00:07:38.240 --> 00:07:42.000
some soft maxing on it and stuff like

00:07:39.759 --> 00:07:45.120
that so the natural question is is there

00:07:42.000 --> 00:07:46.959
a way to organize it very efficiently

00:07:45.120 --> 00:07:48.079
And the short answer is yes. In fact, if

00:07:46.959 --> 00:07:50.318
you could not do that, there wouldn't be

00:07:48.079 --> 00:07:52.079
any transformer revolution. Okay,

00:07:50.319 --> 00:07:53.439
because there is that ability to package

00:07:52.079 --> 00:07:55.758
it up into a very interesting and

00:07:53.439 --> 00:07:58.478
efficient operation that allows you to

00:07:55.759 --> 00:08:02.000
put the whole thing on GPUs.

00:07:58.478 --> 00:08:04.478
Okay, so now I'm going to switch to iPad

00:08:02.000 --> 00:08:06.879
uh and give you some iPad scribblings of

00:08:04.478 --> 00:08:08.399
mine which were concocted last night

00:08:06.879 --> 00:08:10.319
because I was very unhappy with the

00:08:08.399 --> 00:08:14.799
slides that follow. So, we're going to

00:08:10.319 --> 00:08:16.560
do iPad. Okay. U All right. So if it

00:08:14.800 --> 00:08:17.919
works, you folks are lucky. If it

00:08:16.560 --> 00:08:21.079
doesn't work, last year's huddle class

00:08:17.918 --> 00:08:21.079
is luckier.

00:08:21.360 --> 00:08:29.639
So let's shift to that.

00:08:24.240 --> 00:08:29.639
All right. So we're going to go here.

00:08:31.199 --> 00:08:37.158
So let's assume we have a simple thing

00:08:32.799 --> 00:08:37.158
like uh oops.

00:08:37.679 --> 00:08:41.359
Okay, instead of you know train left the

00:08:40.080 --> 00:08:42.639
station which is a long sentence, let's

00:08:41.360 --> 00:08:45.759
just say you have a simple sentence like

00:08:42.639 --> 00:08:47.439
I love hodddle. Okay, and so I love

00:08:45.759 --> 00:08:50.639
hodddle is what you have and then you

00:08:47.440 --> 00:08:53.760
have these standalone embeddings W1 W2

00:08:50.639 --> 00:08:55.838
W3. Okay, so it comes into the self

00:08:53.759 --> 00:08:58.639
attention layer and let's assume that

00:08:55.839 --> 00:09:00.959
these W1's, W2, W3, they're already

00:08:58.639 --> 00:09:02.399
positionally encoded, right? We have

00:09:00.958 --> 00:09:03.919
already added up the position encoding,

00:09:02.399 --> 00:09:05.039
all that stuff also. It's all behind us.

00:09:03.919 --> 00:09:08.079
That all happens outside the

00:09:05.039 --> 00:09:10.480
transformer. So you you you get it here.

00:09:08.080 --> 00:09:13.200
Now what you do is you actually make

00:09:10.480 --> 00:09:15.200
three copies of this thing.

00:09:13.200 --> 00:09:18.000
Okay? And let's call this whole thing as

00:09:15.200 --> 00:09:20.640
just X. Okay? I'm just giving it the

00:09:18.000 --> 00:09:23.360
name X. It's a matrix of these three

00:09:20.639 --> 00:09:25.199
vectors. And so the first copy goes up

00:09:23.360 --> 00:09:26.720
here, the second copy goes straight, and

00:09:25.200 --> 00:09:29.360
the third copy goes down. And don't

00:09:26.720 --> 00:09:31.600
worry about the third copy just yet. So

00:09:29.360 --> 00:09:33.680
if you look at the the first two copies,

00:09:31.600 --> 00:09:36.320
here is the key thing to focus on. Okay,

00:09:33.679 --> 00:09:37.759
this whole thing here. Remember that we

00:09:36.320 --> 00:09:40.240
want to calculate dotproducts between

00:09:37.759 --> 00:09:41.679
all these vectors. And basically we want

00:09:40.240 --> 00:09:44.799
to calculate the dot product of every

00:09:41.679 --> 00:09:46.319
pair of vectors, every pair of words.

00:09:44.799 --> 00:09:47.919
The whole point of self attention is

00:09:46.320 --> 00:09:49.440
that every pair of words we figure out

00:09:47.919 --> 00:09:50.639
how attracted or related they are.

00:09:49.440 --> 00:09:53.040
Right? Which means that we have to

00:09:50.639 --> 00:09:55.439
calculate all pairs of dot products. And

00:09:53.039 --> 00:09:58.159
so what you do is you take this vector

00:09:55.440 --> 00:10:00.880
right there W1 WW3. You take this other

00:09:58.159 --> 00:10:03.759
copy that went up. Okay? And then you

00:10:00.879 --> 00:10:05.278
transpose it. So when you transpose it,

00:10:03.759 --> 00:10:06.958
it all becomes nice and vertical like

00:10:05.278 --> 00:10:08.720
that.

00:10:06.958 --> 00:10:09.679
Right? All the vectors come in came like

00:10:08.720 --> 00:10:12.399
this. When you transfer, it becomes

00:10:09.679 --> 00:10:15.439
vertical. And now what you do is you

00:10:12.399 --> 00:10:19.839
take each one you take W1 and then you

00:10:15.440 --> 00:10:22.240
multiply it by W1. Here you take W1 W2

00:10:19.839 --> 00:10:23.760
W1 W3. You calculate all those dot

00:10:22.240 --> 00:10:27.120
products like that. And when you do that

00:10:23.759 --> 00:10:29.439
you have these nice cells where every

00:10:27.120 --> 00:10:31.919
pair of words their dot products have

00:10:29.440 --> 00:10:34.079
been calculated in this grid. Okay. And

00:10:31.919 --> 00:10:36.078
the key thing to see here and folks with

00:10:34.078 --> 00:10:38.559
a matrix algebra background will see

00:10:36.078 --> 00:10:40.319
this immediately. All we are doing is we

00:10:38.559 --> 00:10:42.399
are taking this x which is the matrix

00:10:40.320 --> 00:10:44.800
that came in

00:10:42.399 --> 00:10:46.480
and then xrpose which is the matrix that

00:10:44.799 --> 00:10:48.559
we went sent up and then brought back

00:10:46.480 --> 00:10:50.639
down. We are basically doing a matrix

00:10:48.559 --> 00:10:53.759
multiplication of x * xrpose. That's all

00:10:50.639 --> 00:10:57.600
we doing. And when we do that we're

00:10:53.759 --> 00:10:59.439
getting this nice uh grid of where in

00:10:57.600 --> 00:11:01.120
which every pair of words their dot

00:10:59.440 --> 00:11:03.600
products have been calculated for you

00:11:01.120 --> 00:11:05.440
with one matrix multiplication. Boom.

00:11:03.600 --> 00:11:07.200
Done. Okay. Okay, so if you have three

00:11:05.440 --> 00:11:11.200
words, there are nine multiplications,

00:11:07.200 --> 00:11:13.920
right? So if you have a million words,

00:11:11.200 --> 00:11:15.680
that's a lot of multiplications, right?

00:11:13.919 --> 00:11:18.078
One trillion multiplications on the

00:11:15.679 --> 00:11:21.199
order of all trillion. And the reason to

00:11:18.078 --> 00:11:23.039
say order is because you know W1 * W3 is

00:11:21.200 --> 00:11:25.680
the same as W3 * W1. So there's some

00:11:23.039 --> 00:11:27.360
duplication here. So you get this grid,

00:11:25.679 --> 00:11:29.838
okay, in one shot is one multi

00:11:27.360 --> 00:11:31.278
multiplication. And then we because each

00:11:29.839 --> 00:11:32.800
of these numbers is just a dot product

00:11:31.278 --> 00:11:34.559
which can be negative or positive, we

00:11:32.799 --> 00:11:36.240
need to softmax it.

00:11:34.559 --> 00:11:38.399
And so what we do is we take all these

00:11:36.240 --> 00:11:40.240
numbers and we put it into a softmax

00:11:38.399 --> 00:11:41.759
function where for each row it

00:11:40.240 --> 00:11:44.480
calculates a soft max. And what do I

00:11:41.759 --> 00:11:46.559
mean by that? It takes each number here

00:11:44.480 --> 00:11:47.839
does e raised to the top e ra to the

00:11:46.559 --> 00:11:49.919
number. It does it for each of these

00:11:47.839 --> 00:11:51.920
numbers and then divides by the sum of

00:11:49.919 --> 00:11:54.159
those numbers for each row. And when you

00:11:51.919 --> 00:11:56.639
do that okay you can think of this

00:11:54.159 --> 00:11:59.439
operation as soft max applied to x *

00:11:56.639 --> 00:12:01.039
xrpose you get this nice little table of

00:11:59.440 --> 00:12:02.880
numbers.

00:12:01.039 --> 00:12:06.240
This table of numbers basically says

00:12:02.879 --> 00:12:08.799
that for the first word right W1 for the

00:12:06.240 --> 00:12:11.519
first word take 0.1 of the of the first

00:12:08.799 --> 00:12:14.240
one 7 of the second.3 of the 2 of the

00:12:11.519 --> 00:12:17.200
third and add them up. We do a weighted

00:12:14.240 --> 00:12:20.720
average. So we have this table here. We

00:12:17.200 --> 00:12:24.000
have now the third copy shows up here.

00:12:20.720 --> 00:12:25.200
Okay is right there. So we do this times

00:12:24.000 --> 00:12:27.200
that which is just a matrix

00:12:25.200 --> 00:12:29.040
multiplication again. And when we do

00:12:27.200 --> 00:12:31.519
that we get the final contextual

00:12:29.039 --> 00:12:34.559
embeddings. So this for example is just

00:12:31.519 --> 00:12:36.399
0.1 * w12

00:12:34.559 --> 00:12:40.078
* w2

00:12:36.399 --> 00:12:41.600
point sorry 7 * w2 and then2 * w3 right

00:12:40.078 --> 00:12:44.399
there. And you can see the same logic

00:12:41.600 --> 00:12:46.480
here as well. Okay. And you can read it

00:12:44.399 --> 00:12:47.679
later on. I will post this thing uh to

00:12:46.480 --> 00:12:50.399
make sure you understand exactly how it

00:12:47.679 --> 00:12:53.759
flowed. But the larger point I want you

00:12:50.399 --> 00:12:55.278
to focus on is that the entire sol self

00:12:53.759 --> 00:12:58.159
attention operation we just looked at

00:12:55.278 --> 00:13:01.919
here basically is this this beautifully

00:12:58.159 --> 00:13:04.480
little compact matrix formula.

00:13:01.919 --> 00:13:06.240
Okay X comes in you do XRpose you do a

00:13:04.480 --> 00:13:07.519
matrix multiplication you do a softmax

00:13:06.240 --> 00:13:10.320
on top of it and then multiply by X

00:13:07.519 --> 00:13:12.799
again and boom you're done.

00:13:10.320 --> 00:13:15.200
So that is the magic of taking the

00:13:12.799 --> 00:13:17.199
transformer stack and representing it

00:13:15.200 --> 00:13:20.079
using matrix operations because then

00:13:17.200 --> 00:13:22.560
lightning fast on GPUs.

00:13:20.078 --> 00:13:24.638
Okay. All right.

00:13:22.559 --> 00:13:27.119
That was the warm-up.

00:13:24.639 --> 00:13:31.278
Now let's crank it up a notch.

00:13:27.120 --> 00:13:34.839
So recall that in the last class um I

00:13:31.278 --> 00:13:34.838
talked about the fact

00:13:35.519 --> 00:13:39.600
the self attention operation the W's are

00:13:38.000 --> 00:13:41.839
coming in and we're doing all this stuff

00:13:39.600 --> 00:13:44.399
with the W's right and then we're

00:13:41.839 --> 00:13:46.639
getting some W hats out but there are no

00:13:44.399 --> 00:13:48.958
parameters

00:13:46.639 --> 00:13:51.360
there's nothing to be learned inside the

00:13:48.958 --> 00:13:52.719
transformer self attention layer right

00:13:51.360 --> 00:13:54.639
there are no there are no weights there

00:13:52.720 --> 00:13:58.560
are no biases there are no coefficients

00:13:54.639 --> 00:14:00.879
so well okay What are we learning then?

00:13:58.559 --> 00:14:03.359
Right? So what we now do is we going to

00:14:00.879 --> 00:14:05.039
make the self attention layer tunable.

00:14:03.360 --> 00:14:07.440
We're going to inject some weights into

00:14:05.039 --> 00:14:09.120
it so that when we train it on an actual

00:14:07.440 --> 00:14:10.800
system, it'll the weights will keep

00:14:09.120 --> 00:14:12.240
changing to adapt itself to the

00:14:10.799 --> 00:14:15.599
particularities of whatever problem

00:14:12.240 --> 00:14:21.959
you're working on. Right? So that takes

00:14:15.600 --> 00:14:21.959
us to the tunable self attention layer.

00:14:22.720 --> 00:14:28.399
Okay? Tunable self attention layer. So

00:14:25.519 --> 00:14:29.759
this is the key thing to keep in mind. U

00:14:28.399 --> 00:14:33.639
any questions on this before I continue

00:14:29.759 --> 00:14:33.639
with the tunability thing.

00:14:34.639 --> 00:14:39.839
Okay.

00:14:37.120 --> 00:14:41.839
Is this picture working out by the way?

00:14:39.839 --> 00:14:44.000
Okay.

00:14:41.839 --> 00:14:46.160
Uh all right.

00:14:44.000 --> 00:14:48.799
So what we now do is we have the same

00:14:46.159 --> 00:14:51.120
exact logic as before where we have this

00:14:48.799 --> 00:14:53.519
thing that comes in. Okay. We have this

00:14:51.120 --> 00:14:55.360
input that comes in the same we call it

00:14:53.519 --> 00:14:58.399
X again. this whole this matrix of

00:14:55.360 --> 00:15:01.120
embeddings and then before we just send

00:14:58.399 --> 00:15:02.720
three copies instead of doing that what

00:15:01.120 --> 00:15:04.879
we're going to do is we'll take each

00:15:02.720 --> 00:15:07.519
copy X and then we will actually

00:15:04.879 --> 00:15:09.120
multiply it by a matrix

00:15:07.519 --> 00:15:10.720
okay this matrix is called the key

00:15:09.120 --> 00:15:14.078
matrix

00:15:10.720 --> 00:15:16.000
okay and this matrix this matrix of

00:15:14.078 --> 00:15:18.319
numbers are weights that will be learned

00:15:16.000 --> 00:15:20.399
by Brack prop

00:15:18.320 --> 00:15:23.199
so basically what we're saying is that

00:15:20.399 --> 00:15:25.759
when this thing comes in let's see if

00:15:23.198 --> 00:15:28.399
there's a way to transform this X into

00:15:25.759 --> 00:15:30.639
some other set of embeddings which may

00:15:28.399 --> 00:15:32.159
be useful for your task. We don't know

00:15:30.639 --> 00:15:34.320
if they're going to be useful, but

00:15:32.159 --> 00:15:36.399
surely giving it a bit more ability to

00:15:34.320 --> 00:15:39.199
have weights which can be learned means

00:15:36.399 --> 00:15:41.600
that it giving it more expressive power,

00:15:39.198 --> 00:15:42.799
more modeling capacity. And whether it

00:15:41.600 --> 00:15:44.159
actually uses the capacity will depend

00:15:42.799 --> 00:15:46.479
on how much data you have and how well

00:15:44.159 --> 00:15:48.879
you train it. And maybe if it's not

00:15:46.480 --> 00:15:50.800
useful, it won't use it. In what I mean

00:15:48.879 --> 00:15:52.799
is if transforming X actually doesn't

00:15:50.799 --> 00:15:55.679
really help at all, then this matrix A

00:15:52.799 --> 00:15:57.359
is going to be what?

00:15:55.679 --> 00:15:59.120
it's going to be the identity matrix

00:15:57.360 --> 00:16:01.278
because you take basically one and

00:15:59.120 --> 00:16:03.120
multiply by X you'll get one X again. So

00:16:01.278 --> 00:16:05.039
in the worst case maybe it just says I

00:16:03.120 --> 00:16:07.519
have nothing to learn here but maybe

00:16:05.039 --> 00:16:09.278
there is something you can learn. So so

00:16:07.519 --> 00:16:12.560
that's what we do. So we multiplied by

00:16:09.278 --> 00:16:14.480
this matrix A K and then we come up with

00:16:12.559 --> 00:16:16.239
the same you know some embeddings

00:16:14.480 --> 00:16:18.240
transformed embeddings and we call these

00:16:16.240 --> 00:16:22.000
things K

00:16:18.240 --> 00:16:24.079
okay K. Now this KQV as you will see has

00:16:22.000 --> 00:16:26.480
its origins in the in this field of

00:16:24.078 --> 00:16:28.159
information retrieval but I personally

00:16:26.480 --> 00:16:30.639
find that that interpretation is not

00:16:28.159 --> 00:16:32.000
super helpful because transformers are

00:16:30.639 --> 00:16:33.519
used for lots of applications outside

00:16:32.000 --> 00:16:35.759
information retrieval. So I'm not going

00:16:33.519 --> 00:16:37.360
to go with that kind of interpretation.

00:16:35.759 --> 00:16:39.440
I'm going to go with interpretation of

00:16:37.360 --> 00:16:41.360
let's make each of these things tunable.

00:16:39.440 --> 00:16:42.800
Okay. And tunability means we need to

00:16:41.360 --> 00:16:46.240
give it weights. All right. So that's

00:16:42.799 --> 00:16:47.758
what we have here. Now the second copy

00:16:46.240 --> 00:16:48.720
we did this with the first copy. Well,

00:16:47.759 --> 00:16:50.159
let's do the same thing with the second

00:16:48.720 --> 00:16:51.519
copy. We'll take the second copy and

00:16:50.159 --> 00:16:53.278
multiply it by some other matrix called

00:16:51.519 --> 00:16:54.720
AQ.

00:16:53.278 --> 00:16:57.439
And when we are done with that, we get

00:16:54.720 --> 00:17:00.320
these embeddings. And we will call these

00:16:57.440 --> 00:17:02.720
embeddings as Q.

00:17:00.320 --> 00:17:05.038
Okay. Now, just like before, we will

00:17:02.720 --> 00:17:07.120
take this this thing here and we'll

00:17:05.038 --> 00:17:08.400
transpose it.

00:17:07.119 --> 00:17:11.279
So, it all becomes nice and vertical

00:17:08.400 --> 00:17:12.319
like that. And then we'll do exactly the

00:17:11.279 --> 00:17:14.078
same as before. We'll calculate all

00:17:12.318 --> 00:17:16.720
these pair-wise dot productducts using

00:17:14.078 --> 00:17:20.159
one one shot one matrix multiplication.

00:17:16.720 --> 00:17:22.078
And because we are calling this Q and we

00:17:20.160 --> 00:17:26.000
are calling this whole thing as K. This

00:17:22.078 --> 00:17:29.038
thing just becomes Q * KT.

00:17:26.000 --> 00:17:31.919
Okay. At the end of it you come up with

00:17:29.038 --> 00:17:33.359
a grid of numbers just like before.

00:17:31.919 --> 00:17:35.120
Okay. And these numbers could be

00:17:33.359 --> 00:17:36.399
negative or positive. So we need to do

00:17:35.119 --> 00:17:38.079
the softmax on them to make sure they

00:17:36.400 --> 00:17:42.160
are well behaved fractions that add up

00:17:38.079 --> 00:17:44.879
to one. So we take this Q KT business

00:17:42.160 --> 00:17:48.320
and then we do we just run a we put it

00:17:44.880 --> 00:17:50.720
through a softmax function for each row

00:17:48.319 --> 00:17:52.639
and when we do that we we'll get

00:17:50.720 --> 00:17:54.160
basically the the like a table like the

00:17:52.640 --> 00:17:55.600
ones we saw before by the way the

00:17:54.160 --> 00:17:57.919
numbers here are the same just because I

00:17:55.599 --> 00:17:59.439
duplicated it because I'm lazy in

00:17:57.919 --> 00:18:00.480
reality given it has gone through all

00:17:59.440 --> 00:18:03.120
these transformations the numbers are

00:18:00.480 --> 00:18:05.440
not going to be the same right uh you

00:18:03.119 --> 00:18:08.719
have these numbers and then you take the

00:18:05.440 --> 00:18:10.080
final copy which is x * av Right? Each

00:18:08.720 --> 00:18:11.919
copy is getting multiplied by its own

00:18:10.079 --> 00:18:14.319
matrix. Right? And this copy is being

00:18:11.919 --> 00:18:19.440
multiplied by AV. And let's call this X

00:18:14.319 --> 00:18:21.519
A. Okay? Which is here as just V.

00:18:19.440 --> 00:18:24.640
And so what you have here is this soft

00:18:21.519 --> 00:18:26.319
max QT * V is exactly the same kind of

00:18:24.640 --> 00:18:28.080
dot product as we saw before matrix

00:18:26.319 --> 00:18:30.000
multiplication. So we have these

00:18:28.079 --> 00:18:32.240
contextual embeddings and that's what's

00:18:30.000 --> 00:18:34.798
coming out of the of the transformer

00:18:32.240 --> 00:18:36.960
block. So now the whole thing we did

00:18:34.798 --> 00:18:42.558
here the whole thing can be represented

00:18:36.960 --> 00:18:47.038
as soft max of Q KT * V. Okay. So if we

00:18:42.558 --> 00:18:49.200
zoom in a bit. Come on. Okay.

00:18:47.038 --> 00:18:52.319
Okay.

00:18:49.200 --> 00:18:55.440
So X came in.

00:18:52.319 --> 00:18:59.519
Three tracks went here. The first track

00:18:55.440 --> 00:19:01.360
X * A K X * AQ X * A V. And this thing

00:18:59.519 --> 00:19:03.918
is called K. This thing is called Q.

00:19:01.359 --> 00:19:06.079
This thing is called V. And then we do

00:19:03.919 --> 00:19:08.080
the same transpose as before. We do the

00:19:06.079 --> 00:19:09.839
dotproduct thing to calculate the

00:19:08.079 --> 00:19:12.319
pair-wise dot products for everything

00:19:09.839 --> 00:19:15.119
which is just Q KT. We run it through a

00:19:12.319 --> 00:19:16.798
soft max. We get soft max of Q KT. We

00:19:15.119 --> 00:19:18.879
multiply it by one to do the final

00:19:16.798 --> 00:19:22.079
waiting and then boom the output comes

00:19:18.880 --> 00:19:24.160
and that's this function. That's it.

00:19:22.079 --> 00:19:27.279
Okay. So what we have done is we have

00:19:24.160 --> 00:19:31.200
introduced three matrices learnable

00:19:27.279 --> 00:19:34.319
matrices into the self attention layer.

00:19:31.200 --> 00:19:35.679
Okay. Now,

00:19:34.319 --> 00:19:37.439
okay. Let me just stop there for a sec.

00:19:35.679 --> 00:19:39.668
Questions.

00:19:37.440 --> 00:19:39.840
Yeah.

00:19:39.667 --> 00:19:43.119
[clears throat]

00:19:39.839 --> 00:19:44.159
>> Is there a relationship between AK, AQ,

00:19:43.119 --> 00:19:47.599
and A

00:19:44.160 --> 00:19:48.558
>> independent independent matrices?

00:19:47.599 --> 00:19:49.279
>> Yes.

00:19:48.558 --> 00:19:50.558
>> Like we have

00:19:49.279 --> 00:19:52.480
>> could you use the microphone please?

00:19:50.558 --> 00:19:55.038
>> Here we have three set of parameters K,

00:19:52.480 --> 00:19:58.240
Q and P. If there are let's say if there

00:19:55.038 --> 00:19:59.839
were 100 the total length was let's say

00:19:58.240 --> 00:20:02.558
the number of total totals were let's

00:19:59.839 --> 00:20:04.959
say 50. So you would have uh 50 for a

00:20:02.558 --> 00:20:07.678
set of parameters like you'll have to

00:20:04.960 --> 00:20:10.079
>> so if you have a 50 if the dimension is

00:20:07.679 --> 00:20:13.038
50 long what is coming in the W's are 50

00:20:10.079 --> 00:20:15.678
long then the key the what comes out of

00:20:13.038 --> 00:20:20.599
it if you want it to be 50 as well so

00:20:15.679 --> 00:20:20.600
this matrix needs to be 50 * 50 2500

00:20:22.960 --> 00:20:27.519
>> U Luna

00:20:24.798 --> 00:20:30.400
>> what are the different things the three

00:20:27.519 --> 00:20:30.798
the three matrices are trying to

00:20:30.400 --> 00:20:32.000
Sorry,

00:20:30.798 --> 00:20:33.679
>> what are the different things that the

00:20:32.000 --> 00:20:35.599
matrices are trying to learn?

00:20:33.679 --> 00:20:37.120
>> We don't know. All we are saying is that

00:20:35.599 --> 00:20:38.959
we have a self attention layer which can

00:20:37.119 --> 00:20:40.959
pay attention to every pair of words.

00:20:38.960 --> 00:20:43.120
But we need to give it some ways to

00:20:40.960 --> 00:20:45.759
transform what is coming in into

00:20:43.119 --> 00:20:48.000
potentially useful things. Right? As to

00:20:45.759 --> 00:20:49.679
their actual usefulness, we'll have to

00:20:48.000 --> 00:20:51.200
figure out if if it actually helps or

00:20:49.679 --> 00:20:52.320
not. And of course, as you know, the the

00:20:51.200 --> 00:20:54.240
punch line is that yeah, it helps

00:20:52.319 --> 00:20:55.279
massively. That's why we do it. In

00:20:54.240 --> 00:20:57.519
general, what you will find in the deep

00:20:55.279 --> 00:20:58.960
learning literature is that whenever you

00:20:57.519 --> 00:21:01.119
want to increase the capacity, the

00:20:58.960 --> 00:21:03.600
modeling capacity of a particular model,

00:21:01.119 --> 00:21:05.839
you just take a small piece and inject a

00:21:03.599 --> 00:21:07.599
little matrix multiplication into it.

00:21:05.839 --> 00:21:08.959
You take a vector that's showing up in

00:21:07.599 --> 00:21:10.879
the middle and then you make it run

00:21:08.960 --> 00:21:13.038
through a matrix to get another vector

00:21:10.880 --> 00:21:14.559
and then further after you run it

00:21:13.038 --> 00:21:17.119
through a matrix, you run it through a

00:21:14.558 --> 00:21:19.519
little ReLU as well. Even better. So

00:21:17.119 --> 00:21:22.158
that's how you inject modeling capacity

00:21:19.519 --> 00:21:23.359
into the middle of these networks. Okay?

00:21:22.159 --> 00:21:26.640
And that's what these people are doing

00:21:23.359 --> 00:21:29.519
here. Yeah.

00:21:26.640 --> 00:21:31.360
>> In the last step, you had the matrix V.

00:21:29.519 --> 00:21:33.038
So on the previous example, you had used

00:21:31.359 --> 00:21:35.359
the original matrix X. So could you just

00:21:33.038 --> 00:21:36.079
say for why is it not using X? What does

00:21:35.359 --> 00:21:38.479
that mean?

00:21:36.079 --> 00:21:40.319
>> So what we're saying is that the in the

00:21:38.480 --> 00:21:42.480
initial version we had three copies and

00:21:40.319 --> 00:21:44.000
we treated them all identical. Now we

00:21:42.480 --> 00:21:45.599
said well there are are there ways to

00:21:44.000 --> 00:21:47.519
transform each copy into some other

00:21:45.599 --> 00:21:48.959
representation which could be useful. So

00:21:47.519 --> 00:21:51.519
we may as well use three different

00:21:48.960 --> 00:21:52.960
matrices for it. Why stop with two?

00:21:51.519 --> 00:21:54.558
There are three opportunities to make

00:21:52.960 --> 00:21:56.558
them more expressive. We'll use all of

00:21:54.558 --> 00:21:59.558
them.

00:21:56.558 --> 00:21:59.558
>> Yeah.

00:21:59.759 --> 00:22:03.919
>> You mentioned that these are kind of

00:22:02.240 --> 00:22:05.359
you're tuning it. You're kind of

00:22:03.919 --> 00:22:06.960
fine-tuning it. Is there any risk?

00:22:05.359 --> 00:22:09.199
>> We're not fine-tuning it. Uh just to be

00:22:06.960 --> 00:22:10.880
clear on the on the vocabulary here. So

00:22:09.200 --> 00:22:12.880
we have added more weights to make them

00:22:10.880 --> 00:22:16.320
tunable. What that means is that we when

00:22:12.880 --> 00:22:17.760
we finally train this entire model,

00:22:16.319 --> 00:22:20.240
remember all the weights are going to be

00:22:17.759 --> 00:22:21.839
updated using back propagation, right?

00:22:20.240 --> 00:22:23.839
In particular, these matrices will also

00:22:21.839 --> 00:22:26.319
get updated using back propagation.

00:22:23.839 --> 00:22:27.678
>> So there's no risk of is there a risk of

00:22:26.319 --> 00:22:29.759
>> there's always the risk of overfitting

00:22:27.679 --> 00:22:31.038
when you add more parameters to a model

00:22:29.759 --> 00:22:34.079
>> which means that you have to look at the

00:22:31.038 --> 00:22:36.400
validation set and all that good stuff.

00:22:34.079 --> 00:22:39.038
We are basically adding more parameters

00:22:36.400 --> 00:22:40.720
in a very interesting way because we

00:22:39.038 --> 00:22:41.919
want to add more capacity to the self

00:22:40.720 --> 00:22:43.440
attention layer. We want to give it a

00:22:41.919 --> 00:22:45.600
more of an ability to learn things from

00:22:43.440 --> 00:22:48.080
the data. Before it could not learn

00:22:45.599 --> 00:22:51.119
anything. It could only do dot products.

00:22:48.079 --> 00:22:52.399
So we we want to solve that problem.

00:22:51.119 --> 00:22:56.759
All right, I'm going to continue and

00:22:52.400 --> 00:22:56.759
we'll come back to this. Okay. Um

00:22:57.359 --> 00:23:01.678
so uh all right, let's just just for

00:22:59.359 --> 00:23:03.119
fun, I'm going to do this. Um the the

00:23:01.679 --> 00:23:05.519
original paper is called attention is

00:23:03.119 --> 00:23:07.599
all you need. This is a transformer

00:23:05.519 --> 00:23:11.519
paper.

00:23:07.599 --> 00:23:14.399
You folks should read it at some point.

00:23:11.519 --> 00:23:17.400
Just want to show you something.

00:23:14.400 --> 00:23:17.400
Uh

00:23:20.000 --> 00:23:26.000
You see that? So that is the famous

00:23:22.319 --> 00:23:29.038
transformer formula. Okay. And the only

00:23:26.000 --> 00:23:31.440
thing we ignored is this root of DK

00:23:29.038 --> 00:23:33.119
business in the back under it. I

00:23:31.440 --> 00:23:35.759
wouldn't worry about it. The reason they

00:23:33.119 --> 00:23:37.199
have it is because these soft maxes when

00:23:35.759 --> 00:23:39.679
you have lots of numbers and some

00:23:37.200 --> 00:23:41.120
numbers really really big what's going

00:23:39.679 --> 00:23:43.679
to happen is that all the other numbers

00:23:41.119 --> 00:23:45.519
are going to get squashed to zero. Okay.

00:23:43.679 --> 00:23:47.600
And so to make sure the gradient flows

00:23:45.519 --> 00:23:49.279
properly, they just divide it by a

00:23:47.599 --> 00:23:51.918
particular number to make sure no number

00:23:49.279 --> 00:23:53.599
is too big. Okay, that's a small

00:23:51.919 --> 00:23:54.880
technical important but bit of a

00:23:53.599 --> 00:23:57.359
technical detail which is why I ignored

00:23:54.880 --> 00:23:59.760
it in my iPad. But the rest of it you

00:23:57.359 --> 00:24:03.918
can see this is exactly the formula we

00:23:59.759 --> 00:24:05.759
derived qt * v softmax.

00:24:03.919 --> 00:24:08.159
Okay, so this is the famous transformer

00:24:05.759 --> 00:24:10.240
formula

00:24:08.159 --> 00:24:11.840
and congratulations now you understand

00:24:10.240 --> 00:24:14.720
it.

00:24:11.839 --> 00:24:17.199
You seem less than fully convinced.

00:24:14.720 --> 00:24:19.120
Okay.

00:24:17.200 --> 00:24:21.600
Yes. Hi iPad.

00:24:19.119 --> 00:24:24.079
Now I have a bunch of slides which I had

00:24:21.599 --> 00:24:25.678
but actually I'll come back to this. I

00:24:24.079 --> 00:24:27.359
had a bunch of other slides. This is

00:24:25.679 --> 00:24:28.880
from last year uh which actually

00:24:27.359 --> 00:24:30.000
explains what I did in the iPad in a

00:24:28.880 --> 00:24:32.240
very different way without using any

00:24:30.000 --> 00:24:34.240
matrices and so on. I was looking at it

00:24:32.240 --> 00:24:36.480
last evening and I was getting very

00:24:34.240 --> 00:24:38.000
annoyed by these slides for some reason

00:24:36.480 --> 00:24:40.480
because I felt that it wasn't really

00:24:38.000 --> 00:24:43.679
conveying the core matrix sort of the

00:24:40.480 --> 00:24:45.919
matrix uh the ability of using matrix

00:24:43.679 --> 00:24:47.840
algebra to to actually do this so

00:24:45.919 --> 00:24:49.278
efficiently and compactly which is why I

00:24:47.839 --> 00:24:51.599
decided to like handdraw this thing on

00:24:49.278 --> 00:24:53.119
the iPad. Okay, but you should read it

00:24:51.599 --> 00:24:55.439
afterwards to make sure that whatever

00:24:53.119 --> 00:24:56.798
you saw on the iPad actually matches

00:24:55.440 --> 00:24:58.880
this. Okay, because two different ways

00:24:56.798 --> 00:25:02.480
of understanding something always helps.

00:24:58.880 --> 00:25:05.360
Um okay so this what we have here now to

00:25:02.480 --> 00:25:07.360
just to recall

00:25:05.359 --> 00:25:08.798
the by making self attention tunable we

00:25:07.359 --> 00:25:10.319
get a very interesting benefit which is

00:25:08.798 --> 00:25:13.278
that when you have these different

00:25:10.319 --> 00:25:14.798
attention heads before

00:25:13.278 --> 00:25:16.798
you could have two attention heads but

00:25:14.798 --> 00:25:19.278
because there were no parameters inside

00:25:16.798 --> 00:25:21.440
their outputs would have been identical

00:25:19.278 --> 00:25:23.119
because the inputs are the same for both

00:25:21.440 --> 00:25:25.440
therefore the outputs would be identical

00:25:23.119 --> 00:25:28.319
but now by since each attention head

00:25:25.440 --> 00:25:29.120
will have its own aq

00:25:28.319 --> 00:25:32.000
matrix

00:25:29.119 --> 00:25:34.239
the outputs are going to be different.

00:25:32.000 --> 00:25:36.319
That's why it makes sense to do the

00:25:34.240 --> 00:25:37.759
tunability thing because that's what

00:25:36.319 --> 00:25:42.439
actually makes multiple attention it's

00:25:37.759 --> 00:25:42.440
actually useful. Um

00:25:43.038 --> 00:25:47.839
is is there actually any relationship

00:25:44.880 --> 00:25:49.520
between AK AQ and AV or is the A just

00:25:47.839 --> 00:25:51.439
for like a notation standpoint?

00:25:49.519 --> 00:25:54.720
>> Just notation. The thing is we want to

00:25:51.440 --> 00:25:56.320
use QV for the resulting matrix and so I

00:25:54.720 --> 00:25:58.480
had to find something else to use for

00:25:56.319 --> 00:25:59.839
the first one and I was like okay aqaq

00:25:58.480 --> 00:26:03.360
and we at MIT we do subscript super

00:25:59.839 --> 00:26:05.038
subcripts right so yeah

00:26:03.359 --> 00:26:07.678
>> what what is the the size of the

00:26:05.038 --> 00:26:08.319
matrices are there like square matrices

00:26:07.679 --> 00:26:10.400
or

00:26:08.319 --> 00:26:12.158
>> yeah so typically what happens is that

00:26:10.400 --> 00:26:14.240
um there's a whole bunch you can think

00:26:12.159 --> 00:26:15.919
of it as a hyperparameter in some ways

00:26:14.240 --> 00:26:17.200
um typically what people do in most

00:26:15.919 --> 00:26:19.038
implementations is that they will

00:26:17.200 --> 00:26:20.798
actually just preserve the size so if

00:26:19.038 --> 00:26:22.400
the incoming embedding is and they'll

00:26:20.798 --> 00:26:24.720
make sure the the thing coming out of

00:26:22.400 --> 00:26:27.519
thing is also 10. So you just do a 10x10

00:26:24.720 --> 00:26:31.038
matrix to transform it. Uh but the the

00:26:27.519 --> 00:26:32.639
the value v av matrix on the other hand

00:26:31.038 --> 00:26:35.599
there's a bit more technical stuff going

00:26:32.640 --> 00:26:37.200
on where it often tends to be smaller.

00:26:35.599 --> 00:26:39.839
Um so for example let's say that your

00:26:37.200 --> 00:26:42.240
incoming is 100 you do 100 to 100 for

00:26:39.839 --> 00:26:44.480
the key 100 to 100 for the query. But if

00:26:42.240 --> 00:26:47.440
you have say five attention heads, you

00:26:44.480 --> 00:26:48.798
may do 100 to 20 for the W's because

00:26:47.440 --> 00:26:51.600
ultimately all the V's are going to get

00:26:48.798 --> 00:26:53.918
concatenated into another 100 again. So

00:26:51.599 --> 00:26:55.278
I can tell you more offline but fun

00:26:53.919 --> 00:26:56.240
broadly speaking these things tend to

00:26:55.278 --> 00:26:58.720
get transformed. They don't they

00:26:56.240 --> 00:27:00.240
preserve the dimension 10 and 10 out.

00:26:58.720 --> 00:27:04.798
Yeah.

00:27:00.240 --> 00:27:06.640
>> So this uh aq uh these numbers are

00:27:04.798 --> 00:27:07.599
random when you start with it and then

00:27:06.640 --> 00:27:11.159
allow it to back.

00:27:07.599 --> 00:27:11.158
>> Exactly. Exactly.

00:27:11.440 --> 00:27:15.640
So all right um

00:27:17.359 --> 00:27:20.798
yeah so the values in these matrices are

00:27:19.359 --> 00:27:23.918
weights learned through optimization

00:27:20.798 --> 00:27:25.839
using SGD. Uh and then what that means

00:27:23.919 --> 00:27:27.679
is that

00:27:25.839 --> 00:27:29.599
each of these attention now has its own

00:27:27.679 --> 00:27:31.759
copy of these matrices. It has its own

00:27:29.599 --> 00:27:33.359
matrices and over the course of back

00:27:31.759 --> 00:27:36.319
propagation these matrices will look

00:27:33.359 --> 00:27:38.558
very different. Okay. So important each

00:27:36.319 --> 00:27:40.639
attention head will have its own mat set

00:27:38.558 --> 00:27:42.079
of three matrices. So if you have 10

00:27:40.640 --> 00:27:45.080
attention heads 30 matrices will be

00:27:42.079 --> 00:27:45.079
learned.

00:27:46.400 --> 00:27:50.798
So by the math it seems like it's

00:27:48.558 --> 00:27:52.399
creating essentially a relationship

00:27:50.798 --> 00:27:54.639
between all of the content being

00:27:52.400 --> 00:27:56.240
ingested and if you're creating if

00:27:54.640 --> 00:27:58.080
you're ingesting all the content for

00:27:56.240 --> 00:28:00.399
each attention head are there different

00:27:58.079 --> 00:28:01.839
categories of attention head type that

00:28:00.398 --> 00:28:03.199
you're trying to go after?

00:28:01.839 --> 00:28:04.798
>> Yeah. So basically what we're trying to

00:28:03.200 --> 00:28:07.120
do is to say a particular attention

00:28:04.798 --> 00:28:09.038
head. So in any particular sentence it

00:28:07.119 --> 00:28:10.558
may turn out to be the case that one

00:28:09.038 --> 00:28:12.240
pattern could be about the meanings of

00:28:10.558 --> 00:28:14.480
these words right like the word bank and

00:28:12.240 --> 00:28:15.679
what it means the word station train

00:28:14.480 --> 00:28:17.519
things like that. That's what really

00:28:15.679 --> 00:28:19.360
we've been talking about. But there is a

00:28:17.519 --> 00:28:21.200
whole other pattern to do with grammar

00:28:19.359 --> 00:28:23.759
and tense and things like that. There

00:28:21.200 --> 00:28:25.440
could be another one in terms of tone.

00:28:23.759 --> 00:28:26.960
All those things are very important. And

00:28:25.440 --> 00:28:28.640
a priority we don't know how many such

00:28:26.960 --> 00:28:30.079
patterns exist. Much like in a

00:28:28.640 --> 00:28:31.600
convolutional network, we don't when

00:28:30.079 --> 00:28:33.199
we're designing how many filters to

00:28:31.599 --> 00:28:34.798
have, we don't know how many kinds of

00:28:33.200 --> 00:28:36.798
little things we have to detect, you

00:28:34.798 --> 00:28:38.158
know, vertical line, horizontal line,

00:28:36.798 --> 00:28:39.679
semicircle, quarter circle, stuff like

00:28:38.159 --> 00:28:41.840
that. So, you just give it a lot of

00:28:39.679 --> 00:28:45.000
capacity so that it can learn whatever

00:28:41.839 --> 00:28:45.000
it wants.

00:28:45.038 --> 00:28:49.359
All right. So, um so that that is the

00:28:47.440 --> 00:28:51.840
transformer encoder. So, we have done

00:28:49.359 --> 00:28:53.278
one the first of the three complications

00:28:51.839 --> 00:28:56.720
needed to make it like industrial

00:28:53.278 --> 00:28:58.159
strength and legit. Uh the second thing

00:28:56.720 --> 00:29:02.720
we do is something called the residual

00:28:58.159 --> 00:29:05.120
connection. So what we do is that

00:29:02.720 --> 00:29:08.798
whatever comes out here right W1 through

00:29:05.119 --> 00:29:11.278
W6 goes in and comes out as W1 hat W2

00:29:08.798 --> 00:29:13.759
and so on and so forth right

00:29:11.278 --> 00:29:16.240
actually sorry what comes out here is

00:29:13.759 --> 00:29:18.720
the hats but what comes out here is some

00:29:16.240 --> 00:29:20.079
intermediate W's right that is what the

00:29:18.720 --> 00:29:22.399
selfident is going to give you some

00:29:20.079 --> 00:29:24.079
intermediate W's what we do is and

00:29:22.398 --> 00:29:26.479
because what's coming out here these

00:29:24.079 --> 00:29:28.720
vectors are the same length as what goes

00:29:26.480 --> 00:29:29.440
in we can just add them element by

00:29:28.720 --> 00:29:32.159
element

00:29:29.440 --> 00:29:35.120
So we take the input and we actually add

00:29:32.159 --> 00:29:37.679
it to what comes out.

00:29:35.119 --> 00:29:39.918
So why would we want to do that? Why

00:29:37.679 --> 00:29:41.600
would we want to you know go to a lot of

00:29:39.919 --> 00:29:43.520
trouble to process this thing and then

00:29:41.599 --> 00:29:45.759
when it comes out we like literally add

00:29:43.519 --> 00:29:49.879
up the original input? What's like what

00:29:45.759 --> 00:29:49.879
do you think the intuition is?

00:29:52.398 --> 00:29:57.918
So turns out, think of it this way. You

00:29:56.240 --> 00:30:00.000
have a bunch of inputs. You send it to a

00:29:57.919 --> 00:30:02.240
neural network. It transforms it and

00:30:00.000 --> 00:30:04.798
gives you something else. Right? At that

00:30:02.240 --> 00:30:06.159
point, you might be thinking, well,

00:30:04.798 --> 00:30:07.519
everything that go everything that

00:30:06.159 --> 00:30:10.240
happens in the network from that point

00:30:07.519 --> 00:30:12.558
onward can no longer see your original

00:30:10.240 --> 00:30:14.640
input. It can only work with the

00:30:12.558 --> 00:30:17.599
transformed input. Right? But what if

00:30:14.640 --> 00:30:20.080
your transformations are not great?

00:30:17.599 --> 00:30:22.798
So as an insurance policy what you can

00:30:20.079 --> 00:30:24.798
do is you can take the the transform

00:30:22.798 --> 00:30:27.839
stuff and you can take the original

00:30:24.798 --> 00:30:30.158
stuff and send both in.

00:30:27.839 --> 00:30:31.439
Right? And this whole thing is and you

00:30:30.159 --> 00:30:33.120
can Google it. It's called like a wide

00:30:31.440 --> 00:30:35.200
and deep network and things like that.

00:30:33.119 --> 00:30:37.278
But the whole point is that let's not

00:30:35.200 --> 00:30:39.440
lose the original input anywhere. Let's

00:30:37.278 --> 00:30:40.880
also send it along. But if you keep

00:30:39.440 --> 00:30:42.080
adding the original input to every

00:30:40.880 --> 00:30:43.440
intermediate layer, it's going to get

00:30:42.079 --> 00:30:44.720
longer and longer and longer and bigger,

00:30:43.440 --> 00:30:46.640
which you don't want because you want it

00:30:44.720 --> 00:30:49.360
all to be the same size. So the simplest

00:30:46.640 --> 00:30:50.960
alternative is to just add them up. You

00:30:49.359 --> 00:30:52.879
take the transform stuff and you add the

00:30:50.960 --> 00:30:54.960
original input. You get the same thing

00:30:52.880 --> 00:30:57.919
again. The the what came out what came

00:30:54.960 --> 00:31:00.240
in W1 was a 100 long vector and the

00:30:57.919 --> 00:31:02.240
transformed version is also 100 long. So

00:31:00.240 --> 00:31:04.000
just literally 100 100 add them up.

00:31:02.240 --> 00:31:06.319
That's it. You get another 100 long

00:31:04.000 --> 00:31:08.880
vector. So that is what's called a

00:31:06.319 --> 00:31:12.079
residual connection. Okay. And as it

00:31:08.880 --> 00:31:14.480
turns out, residual connections make it

00:31:12.079 --> 00:31:16.960
m improve the gradient flow during back

00:31:14.480 --> 00:31:18.960
propagation dramatically and that's why

00:31:16.960 --> 00:31:21.440
they are very heavily used. And in fact,

00:31:18.960 --> 00:31:24.319
RestNet, which we looked at for computer

00:31:21.440 --> 00:31:26.399
vision, it stands for residual net

00:31:24.319 --> 00:31:29.200
because it was the first network to

00:31:26.398 --> 00:31:30.719
actually figure this out. It's not this

00:31:29.200 --> 00:31:32.399
this is not just a transformer thing by

00:31:30.720 --> 00:31:35.278
the way. It's widely used in you know

00:31:32.398 --> 00:31:36.719
lots of new architectures. The notion of

00:31:35.278 --> 00:31:39.759
a residual connection that's what it

00:31:36.720 --> 00:31:42.720
means. Okay, so we do a residual

00:31:39.759 --> 00:31:44.000
connection and then we come to the final

00:31:42.720 --> 00:31:45.600
tweak which is called layer

00:31:44.000 --> 00:31:47.440
normalization.

00:31:45.599 --> 00:31:48.879
So once we add the residual connection,

00:31:47.440 --> 00:31:51.120
we are going to do something else here

00:31:48.880 --> 00:31:54.080
to these vectors before they continue

00:31:51.119 --> 00:31:57.759
flowing. And what layer normalation does

00:31:54.079 --> 00:31:59.599
is it basically says that

00:31:57.759 --> 00:32:00.798
I you will recall from the very

00:31:59.599 --> 00:32:02.639
beginning of the semester I've been

00:32:00.798 --> 00:32:04.480
saying that whatever comes into a neural

00:32:02.640 --> 00:32:05.840
network the inputs let's just really

00:32:04.480 --> 00:32:07.919
make sure that they are all in some sort

00:32:05.839 --> 00:32:10.558
of a narrow well- definfined range they

00:32:07.919 --> 00:32:12.960
can't be in a big range right so for

00:32:10.558 --> 00:32:15.038
pictures for images we divided every

00:32:12.960 --> 00:32:18.480
number by 255 so that every little pixel

00:32:15.038 --> 00:32:20.158
value is between zero and one okay for

00:32:18.480 --> 00:32:22.720
continuous things like the heart disease

00:32:20.159 --> 00:32:24.399
example we standardized by calculating

00:32:22.720 --> 00:32:26.000
the mean and the standard deviation and

00:32:24.398 --> 00:32:27.278
doing subtracting the mean and dividing

00:32:26.000 --> 00:32:28.880
by the standard deviation. So when you

00:32:27.278 --> 00:32:32.480
do that all the numbers are going to

00:32:28.880 --> 00:32:35.039
roughly be in the minus1 to +1 range. So

00:32:32.480 --> 00:32:36.960
in neural networks it's for backrop to

00:32:35.038 --> 00:32:39.839
work really well you have to make sure

00:32:36.960 --> 00:32:41.200
that no numbers get too big that all the

00:32:39.839 --> 00:32:43.199
numbers are always in some sort of a

00:32:41.200 --> 00:32:45.519
narrow range. So what layer

00:32:43.200 --> 00:32:48.240
normalization does is to say you know

00:32:45.519 --> 00:32:49.519
what whatever is coming out here I want

00:32:48.240 --> 00:32:51.519
to make sure none of these numbers are

00:32:49.519 --> 00:32:53.679
too big. I want to make sure they're all

00:32:51.519 --> 00:32:55.599
well behaved in a small range because if

00:32:53.679 --> 00:32:59.600
I don't do that back prop is not going

00:32:55.599 --> 00:33:01.759
to work very well and so

00:32:59.599 --> 00:33:04.480
is this what we do to ensure we don't

00:33:01.759 --> 00:33:06.558
problem of vanishing right

00:33:04.480 --> 00:33:07.839
>> so um so the there technically there are

00:33:06.558 --> 00:33:09.038
there could be two problems there's an

00:33:07.839 --> 00:33:10.959
exploding gradient and vanishing

00:33:09.038 --> 00:33:12.480
gradient both are bad this is a way to

00:33:10.960 --> 00:33:15.200
address it so you will find a whole

00:33:12.480 --> 00:33:17.038
bunch of dash normalization techniques

00:33:15.200 --> 00:33:19.120
layer normalization batch normalization

00:33:17.038 --> 00:33:21.200
and so on and so forth all these are

00:33:19.119 --> 00:33:22.798
methods to make that these numbers stay

00:33:21.200 --> 00:33:26.600
in a small range so it doesn't cause

00:33:22.798 --> 00:33:26.599
gradient issues later.

00:33:27.038 --> 00:33:32.879
All right. So in particular

00:33:30.159 --> 00:33:35.200
what we do is or what happens inside

00:33:32.880 --> 00:33:36.480
this layer layer normalization is we

00:33:35.200 --> 00:33:37.440
just calculate the mean and standard

00:33:36.480 --> 00:33:39.759
deviation of every one of these

00:33:37.440 --> 00:33:41.440
embeddings. Okay? Right? If you have

00:33:39.759 --> 00:33:42.480
let's say six embeddings here, we'll

00:33:41.440 --> 00:33:43.840
have six means and six standard

00:33:42.480 --> 00:33:46.000
deviations, right? For each one across

00:33:43.839 --> 00:33:48.319
the rows and then we standardize it.

00:33:46.000 --> 00:33:49.599
Meaning subtract the mean divide by the

00:33:48.319 --> 00:33:51.599
standard deviation. And when you do

00:33:49.599 --> 00:33:54.079
that, all these things are going to be

00:33:51.599 --> 00:33:55.678
nice and small. And then we do this a

00:33:54.079 --> 00:33:58.879
little other thing where we we have

00:33:55.679 --> 00:34:01.120
introduced two new parameters to rescale

00:33:58.880 --> 00:34:03.840
it and move it around a little bit just

00:34:01.119 --> 00:34:06.079
because adding more weights always helps

00:34:03.839 --> 00:34:07.918
make these things better. So we add them

00:34:06.079 --> 00:34:09.358
and this gets slightly complicated

00:34:07.919 --> 00:34:10.480
because of the way the dimensions work.

00:34:09.358 --> 00:34:13.598
So I'm not going to spend much time on

00:34:10.480 --> 00:34:15.358
it. Uh and then what comes out the other

00:34:13.599 --> 00:34:16.960
end is a very well- behaved set of

00:34:15.358 --> 00:34:18.960
numbers in a nice and small and narrow

00:34:16.960 --> 00:34:20.480
range.

00:34:18.960 --> 00:34:23.280
Okay, so this is called layer

00:34:20.480 --> 00:34:25.760
normalization. Um, you can see this link

00:34:23.280 --> 00:34:28.480
to understand it a bit better. Um, and

00:34:25.760 --> 00:34:30.639
we do that as well. So to put it all

00:34:28.480 --> 00:34:32.639
together,

00:34:30.639 --> 00:34:34.159
so this is a transformer encoder where

00:34:32.639 --> 00:34:36.559
we have this multi head attention layer

00:34:34.159 --> 00:34:39.039
where each attention head in the inside

00:34:36.559 --> 00:34:41.039
of it is tunable with those a matrices

00:34:39.039 --> 00:34:43.039
and then we have a residual connection.

00:34:41.039 --> 00:34:45.119
We do that and then we do layer norm and

00:34:43.039 --> 00:34:46.800
then we do the same thing in the next

00:34:45.119 --> 00:34:50.399
feed forward layer as well. And then

00:34:46.800 --> 00:34:52.159
boom out pops the output

00:34:50.398 --> 00:34:53.838
>> by that definition in the multi head

00:34:52.159 --> 00:34:56.398
attention layer when I'm doing tone and

00:34:53.838 --> 00:34:59.039
everything theoretically I can add even

00:34:56.398 --> 00:35:01.759
the biases or the hate speech aspects

00:34:59.039 --> 00:35:04.159
which come in to take care of it right

00:35:01.760 --> 00:35:06.320
so the model can account for the fact

00:35:04.159 --> 00:35:07.199
that something is biased or something is

00:35:06.320 --> 00:35:09.200
not

00:35:07.199 --> 00:35:11.679
>> um the thing is it's not so much the

00:35:09.199 --> 00:35:13.358
model is accounting for it is capturing

00:35:11.679 --> 00:35:16.719
whatever patterns happen to be inherent

00:35:13.358 --> 00:35:18.319
in the data it's capturing Right now

00:35:16.719 --> 00:35:19.598
what you do with that capture is up to

00:35:18.320 --> 00:35:21.838
you. It depends on the actual problem

00:35:19.599 --> 00:35:23.440
you're trying to solve. In particular,

00:35:21.838 --> 00:35:25.440
it is going to capture all the bad stuff

00:35:23.440 --> 00:35:27.119
too because if your training header has

00:35:25.440 --> 00:35:29.119
a lot of biased stuff in it, toxic

00:35:27.119 --> 00:35:30.480
things in it, dangerous things in it, it

00:35:29.119 --> 00:35:32.240
doesn't it doesn't have a sense of

00:35:30.480 --> 00:35:35.039
values as to what it's good or bad. It's

00:35:32.239 --> 00:35:36.239
just going to pick it up.

00:35:35.039 --> 00:35:38.480
>> Yes.

00:35:36.239 --> 00:35:40.799
>> On that then how do you actually make it

00:35:38.480 --> 00:35:43.119
angle on those or how do you mitigate

00:35:40.800 --> 00:35:44.800
the effect of those? That's a whole

00:35:43.119 --> 00:35:47.838
course unto itself, but I'm happy to

00:35:44.800 --> 00:35:50.560
give you pointers offline.

00:35:47.838 --> 00:35:52.799
All right, so this is what we have and

00:35:50.559 --> 00:35:54.960
remember what I said that this is just a

00:35:52.800 --> 00:35:56.480
single transformer block and since what

00:35:54.960 --> 00:35:58.240
comes in and what goes out are the same

00:35:56.480 --> 00:36:00.400
dimensions, we can just stack them one

00:35:58.239 --> 00:36:02.000
after the other, right? It's very

00:36:00.400 --> 00:36:03.280
stackable. You can do it, you can

00:36:02.000 --> 00:36:05.199
multiply, you can you can stack it

00:36:03.280 --> 00:36:08.000
vertically as much as you want. And as I

00:36:05.199 --> 00:36:09.919
mentioned, I think GPD3 has 96 of these

00:36:08.000 --> 00:36:14.079
things stacked one on top of the other.

00:36:09.920 --> 00:36:15.760
Um and so yeah that brings us to that is

00:36:14.079 --> 00:36:18.240
it that is the transformer encoder and

00:36:15.760 --> 00:36:20.160
this exactly maps to that. So basically

00:36:18.239 --> 00:36:22.239
the input embeddings come in you add

00:36:20.159 --> 00:36:24.480
positional embeddings and then you send

00:36:22.239 --> 00:36:26.399
it to say these many attention blocks

00:36:24.480 --> 00:36:28.800
and they all get added up and then it

00:36:26.400 --> 00:36:31.119
comes over the attention block you add

00:36:28.800 --> 00:36:32.320
the add and nom here means add means

00:36:31.119 --> 00:36:33.920
residual connection because you're

00:36:32.320 --> 00:36:36.079
adding the input which is why you have

00:36:33.920 --> 00:36:37.920
this arrow going from the input being

00:36:36.079 --> 00:36:39.920
added there and then you normalize it

00:36:37.920 --> 00:36:42.960
send it along and do it again and out

00:36:39.920 --> 00:36:46.480
comes the output.

00:36:42.960 --> 00:36:48.400
So all right now just to be very clear

00:36:46.480 --> 00:36:52.480
on what is being optimized during back

00:36:48.400 --> 00:36:54.559
propagation in this complex flow right

00:36:52.480 --> 00:36:56.320
now clearly the the embeddings that you

00:36:54.559 --> 00:36:57.838
started out with both the standalone

00:36:56.320 --> 00:37:00.000
embeddings as well as the positional uh

00:36:57.838 --> 00:37:01.838
the position embeddings those things are

00:37:00.000 --> 00:37:02.880
going to get optimized right those are

00:37:01.838 --> 00:37:05.279
just weights they're going to get

00:37:02.880 --> 00:37:06.800
optimized clearly everything inside the

00:37:05.280 --> 00:37:08.640
transformer encoder block is going to

00:37:06.800 --> 00:37:12.000
get get nominized right and what are

00:37:08.639 --> 00:37:15.598
they well they are the aqa v matrices

00:37:12.000 --> 00:37:18.079
for Each attention head layer norm has

00:37:15.599 --> 00:37:20.160
parameters as well. The next like the

00:37:18.079 --> 00:37:22.160
little feed forward layer has weights as

00:37:20.159 --> 00:37:24.960
well. All these things are going to get

00:37:22.159 --> 00:37:26.799
optimized and then it goes through this

00:37:24.960 --> 00:37:28.320
relu which again has a bunch of weights.

00:37:26.800 --> 00:37:29.920
It's going to get optimized and then the

00:37:28.320 --> 00:37:32.240
final softmax has a bunch of weights.

00:37:29.920 --> 00:37:33.280
That's going to get optimized.

00:37:32.239 --> 00:37:36.239
All these things are going to get

00:37:33.280 --> 00:37:38.560
optimized by back prop.

00:37:36.239 --> 00:37:40.000
Okay. So in that sense you just step

00:37:38.559 --> 00:37:41.679
back for a second and look at the whole

00:37:40.000 --> 00:37:43.760
thing. It is just a mathematical model

00:37:41.679 --> 00:37:45.118
with a lot of parameters

00:37:43.760 --> 00:37:46.480
and we're just going to use gradient

00:37:45.119 --> 00:37:49.440
descent or stoastic gradient descent to

00:37:46.480 --> 00:37:51.039
optimize it. That's it.

00:37:49.440 --> 00:37:53.358
Yeah.

00:37:51.039 --> 00:37:55.519
>> For those eight matrices we train the

00:37:53.358 --> 00:37:58.559
model, are we calculating weights for

00:37:55.519 --> 00:38:00.480
like each cell of every possible matrix

00:37:58.559 --> 00:38:02.559
based on the number of inputs like every

00:38:00.480 --> 00:38:04.559
possible dimension up to the max number

00:38:02.559 --> 00:38:07.199
of inputs?

00:38:04.559 --> 00:38:09.358
Um actually the the weights themselves

00:38:07.199 --> 00:38:11.439
um don't depend on how long your input

00:38:09.358 --> 00:38:13.519
sentence is because remember what we're

00:38:11.440 --> 00:38:14.880
doing is for each sentence that comes in

00:38:13.519 --> 00:38:16.800
let's say the sentence has say three

00:38:14.880 --> 00:38:19.119
words there are three embeddings for

00:38:16.800 --> 00:38:23.440
that sentence each of those embeddings

00:38:19.119 --> 00:38:25.440
gets multiplied by say AK right so AK

00:38:23.440 --> 00:38:27.679
only needs to work needs to know how

00:38:25.440 --> 00:38:31.599
long is each embedding it doesn't need

00:38:27.679 --> 00:38:33.039
to know how many words do I have

00:38:31.599 --> 00:38:35.599
and that's a I'm glad you raised that

00:38:33.039 --> 00:38:37.759
question Ben because that's what makes a

00:38:35.599 --> 00:38:40.000
transformer's number of weights

00:38:37.760 --> 00:38:42.160
independent of the number of words in

00:38:40.000 --> 00:38:43.920
your sentence.

00:38:42.159 --> 00:38:45.279
It only depends on the vocabulary that

00:38:43.920 --> 00:38:46.960
you're going to work with because the

00:38:45.280 --> 00:38:48.960
vocabulary determines how many

00:38:46.960 --> 00:38:51.679
embeddings you need, how many embeddings

00:38:48.960 --> 00:38:53.519
you need. It the length only matters in

00:38:51.679 --> 00:38:55.039
terms of the positional embedding

00:38:53.519 --> 00:38:56.639
because if you have a thousand long

00:38:55.039 --> 00:38:59.199
sentence, you need a thousand long

00:38:56.639 --> 00:39:02.239
positional embedding matrix. But beyond

00:38:59.199 --> 00:39:04.480
that, it doesn't care.

00:39:02.239 --> 00:39:07.679
And that's why for example Google uh

00:39:04.480 --> 00:39:09.280
Gemini 1.5 Pro which is a million it can

00:39:07.679 --> 00:39:12.078
accommodate basically a million long

00:39:09.280 --> 00:39:15.200
million token context window right it

00:39:12.079 --> 00:39:18.960
can it's still very compute heavy but it

00:39:15.199 --> 00:39:20.719
does not change the number of parameters

00:39:18.960 --> 00:39:24.159
uh yeah

00:39:20.719 --> 00:39:26.319
>> conceptually which weights are optimized

00:39:24.159 --> 00:39:28.639
first but in sequential order or are

00:39:26.320 --> 00:39:29.680
they optimizing the weights at the very

00:39:28.639 --> 00:39:31.920
same time all

00:39:29.679 --> 00:39:34.078
>> simultaneously because if you think of

00:39:31.920 --> 00:39:35.680
back propagation ultimately you have a

00:39:34.079 --> 00:39:38.000
loss function right and you calculate

00:39:35.679 --> 00:39:40.159
the gradient of that loss function so if

00:39:38.000 --> 00:39:42.000
you have a say a billion parameters that

00:39:40.159 --> 00:39:44.159
gradient is basically a billion long

00:39:42.000 --> 00:39:47.039
vector right and we're going to take the

00:39:44.159 --> 00:39:49.519
gradient and we're going to do w new

00:39:47.039 --> 00:39:51.679
equals w old minus alpha times the

00:39:49.519 --> 00:39:53.519
gradient so all the w's are going to

00:39:51.679 --> 00:39:55.118
update instantaneously

00:39:53.519 --> 00:39:56.880
now the way it actually works in

00:39:55.119 --> 00:39:58.559
computation is you're going to do it the

00:39:56.880 --> 00:39:59.599
because of the back and back propagation

00:39:58.559 --> 00:40:01.759
it's going to start at the end and

00:39:59.599 --> 00:40:03.920
slowly flow backwards but when it's done

00:40:01.760 --> 00:40:06.720
everything will be updated.

00:40:03.920 --> 00:40:10.159
Yeah.

00:40:06.719 --> 00:40:12.559
>> We take uh two attention heads and we

00:40:10.159 --> 00:40:16.319
have the matrices of AK, A2 and AV in

00:40:12.559 --> 00:40:18.078
them. Uh why would the parameters of all

00:40:16.320 --> 00:40:19.519
three of them all the weights of the

00:40:18.079 --> 00:40:21.280
three matrices on this side and this

00:40:19.519 --> 00:40:22.960
side would be different because finally

00:40:21.280 --> 00:40:25.359
the things you're inputting from this

00:40:22.960 --> 00:40:26.800
side and the output is same. So the

00:40:25.358 --> 00:40:29.199
learning process should be ideally the

00:40:26.800 --> 00:40:31.200
same unlike like a CNN where we had put

00:40:29.199 --> 00:40:32.159
filters which were different. So what

00:40:31.199 --> 00:40:35.279
different thing we have to

00:40:32.159 --> 00:40:35.920
>> because the initialization is different.

00:40:35.280 --> 00:40:37.119
>> What do we mean?

00:40:35.920 --> 00:40:38.480
>> Like what I mean is if you have two

00:40:37.119 --> 00:40:40.960
heads right each head has three

00:40:38.480 --> 00:40:42.559
matrices. The starting values of those

00:40:40.960 --> 00:40:45.599
six matrix is different.

00:40:42.559 --> 00:40:46.559
>> Starting value of A aka B AQ and A is

00:40:45.599 --> 00:40:48.800
different for both the heads

00:40:46.559 --> 00:40:50.000
>> right? Much like for all the weights

00:40:48.800 --> 00:40:53.119
typically the values are randomly

00:40:50.000 --> 00:40:54.880
chosen. If they were all the same thing

00:40:53.119 --> 00:40:56.000
you're right. It won't you don't make a

00:40:54.880 --> 00:40:59.920
difference right? They will all change

00:40:56.000 --> 00:41:02.639
the same way. Yeah.

00:40:59.920 --> 00:41:06.720
U is the input of the transformer of the

00:41:02.639 --> 00:41:08.239
sentence or the the array of embedding

00:41:06.719 --> 00:41:10.639
of each word.

00:41:08.239 --> 00:41:13.039
>> Uh the in the transformer itself is

00:41:10.639 --> 00:41:14.480
expecting embeddings in and so what

00:41:13.039 --> 00:41:16.639
basically happens is that we get some

00:41:14.480 --> 00:41:18.639
sentence we run it through a tokenizer

00:41:16.639 --> 00:41:20.879
which connects it to a bunch of tokens

00:41:18.639 --> 00:41:22.719
which are just integers and then it goes

00:41:20.880 --> 00:41:24.480
through the embedding layer which maps

00:41:22.719 --> 00:41:26.239
the integers to these embeddings and

00:41:24.480 --> 00:41:28.318
then you feed it to the transformer. But

00:41:26.239 --> 00:41:29.598
when you do back propagation, it comes

00:41:28.318 --> 00:41:31.358
all the way back to the starting

00:41:29.599 --> 00:41:32.240
embedding layer and updates those

00:41:31.358 --> 00:41:34.318
weights.

00:41:32.239 --> 00:41:36.239
>> Okay. So they can be trainable. So the

00:41:34.318 --> 00:41:37.358
twist at the beginning must be input

00:41:36.239 --> 00:41:40.000
here, but they can train.

00:41:37.358 --> 00:41:41.679
>> They're trainable. Exactly. Exactly.

00:41:40.000 --> 00:41:43.920
>> Uh yeah.

00:41:41.679 --> 00:41:45.519
>> Are the attention heads solely parallel

00:41:43.920 --> 00:41:46.639
or can you have like a stack of

00:41:45.519 --> 00:41:49.119
attention heads?

00:41:46.639 --> 00:41:50.879
>> Typically they are parallelized. Um and

00:41:49.119 --> 00:41:54.480
because you can always stack the block

00:41:50.880 --> 00:41:57.200
itself to get more and more power.

00:41:54.480 --> 00:41:59.519
All right. So um so now to apply the

00:41:57.199 --> 00:42:01.919
transformer right there are common use

00:41:59.519 --> 00:42:03.599
cases are that you have a whole sentence

00:42:01.920 --> 00:42:05.519
that comes in and then you just want to

00:42:03.599 --> 00:42:07.119
classify it right the the canonical

00:42:05.519 --> 00:42:09.599
thing being hey movie sentiment

00:42:07.119 --> 00:42:11.599
classification boom positive or negative

00:42:09.599 --> 00:42:13.359
right classification another common one

00:42:11.599 --> 00:42:15.838
is labeling where every word gets

00:42:13.358 --> 00:42:17.279
labeled as a multiclass label and that's

00:42:15.838 --> 00:42:19.119
basically what we saw with our slot

00:42:17.280 --> 00:42:20.720
filling problem and then there is

00:42:19.119 --> 00:42:22.160
another thing called sequence generation

00:42:20.719 --> 00:42:23.838
where you give it a sequence you wanted

00:42:22.159 --> 00:42:25.598
to continue the sequence right generate

00:42:23.838 --> 00:42:28.159
more stuff i.e. large language models

00:42:25.599 --> 00:42:29.359
and all that good stuff. So, so this we

00:42:28.159 --> 00:42:30.559
know already know how to do because we

00:42:29.358 --> 00:42:33.759
actually literally built a collab with

00:42:30.559 --> 00:42:35.759
this with the transformer stack. Now the

00:42:33.760 --> 00:42:37.280
question is how can we do that right?

00:42:35.760 --> 00:42:40.160
How can you do basic classification with

00:42:37.280 --> 00:42:42.000
these things? So now if you again when

00:42:40.159 --> 00:42:44.000
you send a sentence in after all that

00:42:42.000 --> 00:42:46.000
stuff is done and when I say encoder

00:42:44.000 --> 00:42:48.079
here I'm assuming that you may have one

00:42:46.000 --> 00:42:49.679
one block you may have 106 blocks I

00:42:48.079 --> 00:42:50.880
don't care at the end of the day you

00:42:49.679 --> 00:42:53.598
send something in you get a bunch of

00:42:50.880 --> 00:42:57.200
contextual embeddings out

00:42:53.599 --> 00:42:58.720
right so at this point we need to take

00:42:57.199 --> 00:43:00.480
these contextual embeddings and somehow

00:42:58.719 --> 00:43:02.078
make it work for classification for just

00:43:00.480 --> 00:43:05.440
classifying something into yes or no

00:43:02.079 --> 00:43:06.720
positive or negative so it'll be nice if

00:43:05.440 --> 00:43:08.159
we can actually take all these

00:43:06.719 --> 00:43:10.639
embeddings and like essentially

00:43:08.159 --> 00:43:12.719
summarize them into a single embedding,

00:43:10.639 --> 00:43:14.559
a single vector

00:43:12.719 --> 00:43:16.239
because if you have a single vector then

00:43:14.559 --> 00:43:18.078
we can run it through maybe a relu and

00:43:16.239 --> 00:43:19.519
then we do a sigmoid and boom we can do

00:43:18.079 --> 00:43:22.079
a you know a binary classification

00:43:19.519 --> 00:43:23.759
problem super easy right so this begs

00:43:22.079 --> 00:43:25.680
the question okay how are we going to go

00:43:23.760 --> 00:43:28.720
from the all the many blue things to one

00:43:25.679 --> 00:43:33.358
green thing

00:43:28.719 --> 00:43:36.318
okay now of course um what we can do is

00:43:33.358 --> 00:43:37.598
we can simply average them we can take

00:43:36.318 --> 00:43:39.519
each of the embeddings just simply

00:43:37.599 --> 00:43:42.960
average them element by element, you'll

00:43:39.519 --> 00:43:47.318
get a nice green thing. Okay. Um any

00:43:42.960 --> 00:43:47.318
shortcomings from doing that?

00:43:48.318 --> 00:43:51.358
>> You would lose the ordering of the

00:43:50.318 --> 00:43:53.759
words.

00:43:51.358 --> 00:43:55.598
>> You do uh well in some sense the

00:43:53.760 --> 00:43:58.079
positional embedding, the positional

00:43:55.599 --> 00:44:00.640
encoding you have in the input does have

00:43:58.079 --> 00:44:02.720
this notion of position, right? So

00:44:00.639 --> 00:44:04.719
you're not necessarily losing the order

00:44:02.719 --> 00:44:06.239
necessarily, but you're sort of

00:44:04.719 --> 00:44:08.318
averaging all this information into

00:44:06.239 --> 00:44:11.559
something and averaging is going to lose

00:44:08.318 --> 00:44:11.559
some richness.

00:44:12.800 --> 00:44:17.280
Okay.

00:44:15.440 --> 00:44:19.920
>> I think it's going to be skewed to the

00:44:17.280 --> 00:44:22.640
one that has like the biggest number,

00:44:19.920 --> 00:44:23.760
right? So something is influencing your

00:44:22.639 --> 00:44:25.279
>> Yeah, the biggest ones are going to

00:44:23.760 --> 00:44:27.440
dominate. But hopefully we won't have

00:44:25.280 --> 00:44:29.040
too much of that because all the layer

00:44:27.440 --> 00:44:30.240
nom business at the beginning has

00:44:29.039 --> 00:44:31.759
hopefully made sure the numbers are all

00:44:30.239 --> 00:44:33.838
in a reasonably small and well behaved

00:44:31.760 --> 00:44:35.599
range. But the the point really is that

00:44:33.838 --> 00:44:36.960
you're going to lose richness in the

00:44:35.599 --> 00:44:40.160
information because you're just like

00:44:36.960 --> 00:44:42.880
mushing it down. So there's a much

00:44:40.159 --> 00:44:46.639
better and more elegant way to do this

00:44:42.880 --> 00:44:49.280
which is that what you do is for every

00:44:46.639 --> 00:44:52.239
sentence when you train it you add an

00:44:49.280 --> 00:44:54.640
artificial token called the class token.

00:44:52.239 --> 00:44:57.199
Okay, literally it's an artificial token

00:44:54.639 --> 00:45:00.318
and it's designated as you know CLS in

00:44:57.199 --> 00:45:03.838
the literature and then this token is

00:45:00.318 --> 00:45:06.400
getting trained with everything else.

00:45:03.838 --> 00:45:08.000
Okay. And so once you once you finish

00:45:06.400 --> 00:45:10.720
training

00:45:08.000 --> 00:45:13.039
that token has its own embedding too.

00:45:10.719 --> 00:45:15.039
And because it has been trained with

00:45:13.039 --> 00:45:16.480
everything else and this token is

00:45:15.039 --> 00:45:18.239
remember it's a contextual embedding

00:45:16.480 --> 00:45:21.119
which means that it's very much aware of

00:45:18.239 --> 00:45:23.358
all the other words in the sentence.

00:45:21.119 --> 00:45:25.440
So in some sense this context this CLS

00:45:23.358 --> 00:45:26.960
tokens contextual embedding sort of

00:45:25.440 --> 00:45:29.119
captures everything that's going on

00:45:26.960 --> 00:45:31.358
about that sentence

00:45:29.119 --> 00:45:32.960
right and so what we do is once we are

00:45:31.358 --> 00:45:35.759
done training we just grab this thing

00:45:32.960 --> 00:45:38.960
alone and then send that through a relu

00:45:35.760 --> 00:45:41.040
and a sigmoid and boom you're done.

00:45:38.960 --> 00:45:43.599
So this is a very clever trick to

00:45:41.039 --> 00:45:45.119
somehow you know instead of averaging

00:45:43.599 --> 00:45:46.640
everything at the end let's just have

00:45:45.119 --> 00:45:48.480
something just for the whole thing the

00:45:46.639 --> 00:45:50.719
sentence and just learn it anyway along

00:45:48.480 --> 00:45:52.800
with everything else. So in like a meta

00:45:50.719 --> 00:45:54.480
principle in deep learning is that

00:45:52.800 --> 00:45:55.760
whenever you think you're making an ad

00:45:54.480 --> 00:45:56.960
hoc decision about something like

00:45:55.760 --> 00:45:59.040
averaging a bunch of stuff you should

00:45:56.960 --> 00:46:00.480
always stop and say is there a better

00:45:59.039 --> 00:46:02.480
way to do it where it doesn't have to be

00:46:00.480 --> 00:46:04.639
ad hoc where the right way is learnable

00:46:02.480 --> 00:46:08.400
from the data directly using back

00:46:04.639 --> 00:46:11.679
propagation. Um there was a hand. Yeah.

00:46:08.400 --> 00:46:14.400
>> Is there a reason that you

00:46:11.679 --> 00:46:15.039
added the CLS at the start? Why not add

00:46:14.400 --> 00:46:16.559
it at the

00:46:15.039 --> 00:46:17.039
>> You can do it at the end. Is there any

00:46:16.559 --> 00:46:19.759
difference?

00:46:17.039 --> 00:46:21.759
>> Um the only thing to remember is that um

00:46:19.760 --> 00:46:22.800
it's a good question. So different

00:46:21.760 --> 00:46:24.319
centers are going to be of different

00:46:22.800 --> 00:46:25.200
length, right? So there might be short

00:46:24.318 --> 00:46:27.039
sentences, there might be long

00:46:25.199 --> 00:46:29.759
sentences. In particular, the lot the

00:46:27.039 --> 00:46:31.599
short sentences are going to get padded,

00:46:29.760 --> 00:46:34.079
right? I remember I talked about padding

00:46:31.599 --> 00:46:35.680
to make it to fit to one length. So what

00:46:34.079 --> 00:46:37.519
internally the transformer will do is

00:46:35.679 --> 00:46:39.118
ignore all the padded tokens because it

00:46:37.519 --> 00:46:40.800
doesn't do it's just padding doesn't

00:46:39.119 --> 00:46:42.720
really matter for anything. So if you

00:46:40.800 --> 00:46:44.079
have the serless at the very end we have

00:46:42.719 --> 00:46:46.559
to have much more administrative

00:46:44.079 --> 00:46:48.400
bookkeeping to take everything but the

00:46:46.559 --> 00:46:50.318
last one

00:46:48.400 --> 00:46:52.480
ignore it and only do the last one just

00:46:50.318 --> 00:46:54.960
much easier just to get in the beginning

00:46:52.480 --> 00:46:56.800
that's the reason. Yeah.

00:46:54.960 --> 00:46:58.159
>> What would be just a practical

00:46:56.800 --> 00:46:59.839
application of this would be something

00:46:58.159 --> 00:47:00.559
like sentiment analysis like a positive

00:46:59.838 --> 00:47:02.159
or negative.

00:47:00.559 --> 00:47:04.480
>> Yeah. So basically any kind of text

00:47:02.159 --> 00:47:06.078
comes in and you want to figure out some

00:47:04.480 --> 00:47:08.000
labeling problem like a classification

00:47:06.079 --> 00:47:09.920
problem. The easiest example I could

00:47:08.000 --> 00:47:12.079
think of was sentiment.

00:47:09.920 --> 00:47:14.079
But you can imagine for example an email

00:47:12.079 --> 00:47:16.000
comes into a like a call center

00:47:14.079 --> 00:47:17.200
operation and you want to take the email

00:47:16.000 --> 00:47:20.960
and automatically figure out which

00:47:17.199 --> 00:47:24.399
department should I send it to.

00:47:20.960 --> 00:47:27.039
Okay. So now now if the input data for a

00:47:24.400 --> 00:47:28.480
task is natural language text, right? We

00:47:27.039 --> 00:47:31.199
don't have to restrict ourselves to only

00:47:28.480 --> 00:47:32.880
the input training data we have. Right?

00:47:31.199 --> 00:47:35.358
Would it be great to learn from all the

00:47:32.880 --> 00:47:36.800
text that's out there? So, for example,

00:47:35.358 --> 00:47:39.119
to go back to that call center thing I

00:47:36.800 --> 00:47:41.039
just mentioned, you know, why clearly,

00:47:39.119 --> 00:47:43.599
let's say it's coming in English, the

00:47:41.039 --> 00:47:45.759
ability to take that English email and

00:47:43.599 --> 00:47:47.280
route it to one of 10 things. You know,

00:47:45.760 --> 00:47:49.119
you should have to learn English just

00:47:47.280 --> 00:47:50.640
for your call center application. You

00:47:49.119 --> 00:47:52.800
should learn English generally and use

00:47:50.639 --> 00:47:54.318
it for other things, right? So, why

00:47:52.800 --> 00:47:56.880
can't we just learn from all the text

00:47:54.318 --> 00:47:58.239
that's out there? And so, that brings us

00:47:56.880 --> 00:48:00.079
to something called self-supervised

00:47:58.239 --> 00:48:02.318
learning. And the idea of sens

00:48:00.079 --> 00:48:03.760
supervised learning is this. So if you

00:48:02.318 --> 00:48:05.838
recall the transfer learning example

00:48:03.760 --> 00:48:08.400
from lecture four right where we had

00:48:05.838 --> 00:48:10.480
restnet right and we took restn net we

00:48:08.400 --> 00:48:13.039
chopped off the final thing we make made

00:48:10.480 --> 00:48:14.800
it sort of headless and then we attached

00:48:13.039 --> 00:48:17.519
that output of the headless restn net to

00:48:14.800 --> 00:48:19.760
a little hidden layer and output and we

00:48:17.519 --> 00:48:21.039
did the handbags and shoes and you will

00:48:19.760 --> 00:48:22.480
recall that we were able to build a very

00:48:21.039 --> 00:48:24.880
good classifier for handbags and shoes

00:48:22.480 --> 00:48:26.639
with just like a 100 examples. Right? So

00:48:24.880 --> 00:48:29.280
the question is why was this so

00:48:26.639 --> 00:48:31.519
effective? Why was this so effective?

00:48:29.280 --> 00:48:34.079
And turns out the reason why any of this

00:48:31.519 --> 00:48:36.400
stuff actually works is because neural

00:48:34.079 --> 00:48:38.160
networks or they learn representations

00:48:36.400 --> 00:48:40.318
automatically when you train them. So

00:48:38.159 --> 00:48:42.399
what I mean by that is when you imagine

00:48:40.318 --> 00:48:43.759
a network, you feed in a bunch of stuff,

00:48:42.400 --> 00:48:46.639
it goes through all the layers, it comes

00:48:43.760 --> 00:48:48.960
out. Uh you can think of each layer as

00:48:46.639 --> 00:48:50.400
transforming the raw input in some

00:48:48.960 --> 00:48:53.280
different alternate representation of

00:48:50.400 --> 00:48:54.480
the input. Okay? And so and these are

00:48:53.280 --> 00:48:57.200
called representations. That's actually

00:48:54.480 --> 00:48:58.960
a technical term. Um, and so you can

00:48:57.199 --> 00:49:00.558
from this perspective when you train a a

00:48:58.960 --> 00:49:02.880
neural network, a deep network with lots

00:49:00.559 --> 00:49:05.920
of layers, what you're really learning

00:49:02.880 --> 00:49:07.838
is you're learning a way to you're

00:49:05.920 --> 00:49:09.440
learning how to represent the input in

00:49:07.838 --> 00:49:10.400
many different ways. Each of these

00:49:09.440 --> 00:49:11.838
arrows is a different way of

00:49:10.400 --> 00:49:14.240
representing things. Plus, you're

00:49:11.838 --> 00:49:15.759
learning a final regression model,

00:49:14.239 --> 00:49:16.799
either a linear regression model or a

00:49:15.760 --> 00:49:18.079
logistic regression model.

00:49:16.800 --> 00:49:19.680
Fundamentally, that's what's going on.

00:49:18.079 --> 00:49:21.599
Because the final layers tend to be

00:49:19.679 --> 00:49:24.000
sigmoid, soft max, or just linear,

00:49:21.599 --> 00:49:26.480
right? So the final layer if you just

00:49:24.000 --> 00:49:27.760
look at the this part alone whatever is

00:49:26.480 --> 00:49:29.119
coming in it's just going through

00:49:27.760 --> 00:49:31.520
essentially a linear regression model or

00:49:29.119 --> 00:49:32.800
a logistic regression model that's it.

00:49:31.519 --> 00:49:34.318
So fundamentally you're learning

00:49:32.800 --> 00:49:36.720
representations and a final little

00:49:34.318 --> 00:49:38.079
model. Okay. But the reason why all

00:49:36.719 --> 00:49:39.358
these things work so much better than

00:49:38.079 --> 00:49:41.359
logistic regression is because those

00:49:39.358 --> 00:49:43.598
representations have learned all kinds

00:49:41.358 --> 00:49:45.358
of useful things about the input data.

00:49:43.599 --> 00:49:47.519
They have sort of automatically feature

00:49:45.358 --> 00:49:50.078
engineered for you.

00:49:47.519 --> 00:49:53.119
So, so from this perspective you can

00:49:50.079 --> 00:49:55.280
imagine that each layer here is like an

00:49:53.119 --> 00:49:56.800
encoder. It encodes the input, right?

00:49:55.280 --> 00:49:58.240
The first layer encodes it. The first

00:49:56.800 --> 00:49:59.519
two layers encode something. The first

00:49:58.239 --> 00:50:01.439
three layers encode something and so on

00:49:59.519 --> 00:50:04.318
and so forth. So a deep network contains

00:50:01.440 --> 00:50:06.639
many encoders. And so the question is

00:50:04.318 --> 00:50:08.719
what do these representations actually

00:50:06.639 --> 00:50:10.639
embody right? What do they capture? Is

00:50:08.719 --> 00:50:12.719
it like specific knowledge about the

00:50:10.639 --> 00:50:14.879
particular problem that you train the

00:50:12.719 --> 00:50:16.399
thing train the network on or is it like

00:50:14.880 --> 00:50:18.160
general knowledge about the input data

00:50:16.400 --> 00:50:20.160
because if it is general knowledge about

00:50:18.159 --> 00:50:22.719
the input we can use it to solve other

00:50:20.159 --> 00:50:24.318
problems unrelated problems. So is it

00:50:22.719 --> 00:50:26.480
specific knowledge or general knowledge

00:50:24.318 --> 00:50:28.558
and it turns out they actually capture a

00:50:26.480 --> 00:50:31.039
lot of general knowledge about the input

00:50:28.559 --> 00:50:33.040
and that's why you can get reuse out of

00:50:31.039 --> 00:50:34.558
them you can reuse them for other

00:50:33.039 --> 00:50:36.400
unrelated things because they have

00:50:34.559 --> 00:50:38.240
captured general stuff. So if you look

00:50:36.400 --> 00:50:40.160
at this, I think I've shown you before,

00:50:38.239 --> 00:50:41.759
right? If you if you look at a network

00:50:40.159 --> 00:50:43.358
that classifies everyday objects into a

00:50:41.760 --> 00:50:44.720
bunch of categories, it can learn all

00:50:43.358 --> 00:50:46.799
these little patterns in the beginning

00:50:44.719 --> 00:50:48.558
and later on and so on and so forth. And

00:50:46.800 --> 00:50:50.480
this is a face detection network. It has

00:50:48.559 --> 00:50:52.640
learned how to look at, you know,

00:50:50.480 --> 00:50:55.280
identify little circles and edges and

00:50:52.639 --> 00:50:56.639
nose like shapes and finally faces. So

00:50:55.280 --> 00:50:57.760
all these things are examples of

00:50:56.639 --> 00:51:00.960
representations, learning interesting

00:50:57.760 --> 00:51:02.480
things about the input. Okay. So since

00:51:00.960 --> 00:51:04.240
these representations are capturing

00:51:02.480 --> 00:51:06.960
intrinsic aspects of the data, you can

00:51:04.239 --> 00:51:08.719
use it for other things, right? You can

00:51:06.960 --> 00:51:10.559
take a face detection neural network and

00:51:08.719 --> 00:51:12.318
use it, reuse it for emotion detection

00:51:10.559 --> 00:51:14.640
for instance.

00:51:12.318 --> 00:51:17.358
U so the question is if you can somehow

00:51:14.639 --> 00:51:19.358
get like an encoder that generates good

00:51:17.358 --> 00:51:20.799
representations for your input data, we

00:51:19.358 --> 00:51:22.558
can simply build a regression model with

00:51:20.800 --> 00:51:24.079
those as input and labels as output and

00:51:22.559 --> 00:51:27.359
be done. And this is exactly what we did

00:51:24.079 --> 00:51:28.960
with RestNet for handbags and shows. We

00:51:27.358 --> 00:51:30.799
found a thing that had already been

00:51:28.960 --> 00:51:33.679
trained on similar everyday objects,

00:51:30.800 --> 00:51:35.200
everyday images. And the key insight

00:51:33.679 --> 00:51:37.279
here is that since we don't have to

00:51:35.199 --> 00:51:40.078
spend precious data on learning these

00:51:37.280 --> 00:51:42.160
good representations,

00:51:40.079 --> 00:51:44.880
we won't need as much label data in the

00:51:42.159 --> 00:51:46.318
first place because the pre-training

00:51:44.880 --> 00:51:48.318
used a lot of data and you're sort of

00:51:46.318 --> 00:51:50.239
piggybacking on that data. So in some

00:51:48.318 --> 00:51:51.599
sense, your training data is everything

00:51:50.239 --> 00:51:55.358
that the pre-trained model was trained

00:51:51.599 --> 00:51:57.119
on plus your little 200 examples.

00:51:55.358 --> 00:51:58.558
Um, okay. So this is what we did. We

00:51:57.119 --> 00:52:00.160
used headless resonate as an encoder

00:51:58.559 --> 00:52:02.480
that can take raw input and transform it

00:52:00.159 --> 00:52:04.639
into useful representations. Uh this is

00:52:02.480 --> 00:52:06.318
what we did. All right. So the general

00:52:04.639 --> 00:52:08.000
approach is that you find a deep neural

00:52:06.318 --> 00:52:10.719
network built on similar inputs but

00:52:08.000 --> 00:52:13.119
different outputs. Uh and then you

00:52:10.719 --> 00:52:15.439
basically grab maybe the penultimate uh

00:52:13.119 --> 00:52:17.760
representation or the one before that.

00:52:15.440 --> 00:52:21.119
Then you chop off the head. You attach

00:52:17.760 --> 00:52:23.119
your own output head. Train the whole

00:52:21.119 --> 00:52:25.039
thing just the final layer or train the

00:52:23.119 --> 00:52:26.079
whole thing if you want. Right? This is

00:52:25.039 --> 00:52:27.838
like the playbook we followed for

00:52:26.079 --> 00:52:30.720
restnet. The same thing works for all

00:52:27.838 --> 00:52:32.318
kinds of other data types as well. So

00:52:30.719 --> 00:52:34.000
now to build such a model we need

00:52:32.318 --> 00:52:35.599
labeled data, right? We were lucky

00:52:34.000 --> 00:52:37.599
because restnet was actually trained on

00:52:35.599 --> 00:52:39.119
imageet data which is like a million

00:52:37.599 --> 00:52:40.960
images each of which labeled into

00:52:39.119 --> 00:52:44.880
thousand categories which is very

00:52:40.960 --> 00:52:46.639
convenient for us, right? But what if

00:52:44.880 --> 00:52:49.760
you want to build a generally useful

00:52:46.639 --> 00:52:51.279
model for text data?

00:52:49.760 --> 00:52:52.559
Clearly we need to collect a lot of text

00:52:51.280 --> 00:52:54.160
data. But that's no problem because

00:52:52.559 --> 00:52:55.680
internet is full of text data, right? we

00:52:54.159 --> 00:52:57.519
can easily escape the internet. We can

00:52:55.679 --> 00:52:59.759
just download Wikipedia. So that's not a

00:52:57.519 --> 00:53:02.559
problem. The problem is something else

00:52:59.760 --> 00:53:05.520
which is that how do we define an input

00:53:02.559 --> 00:53:07.119
label for a piece of text? So for an

00:53:05.519 --> 00:53:09.199
input sentence, what should the output

00:53:07.119 --> 00:53:10.480
label be? That's the key question.

00:53:09.199 --> 00:53:11.759
Because if you can answer this question,

00:53:10.480 --> 00:53:14.318
you can just spray train all these

00:53:11.760 --> 00:53:17.520
things on all kinds of text data, right?

00:53:14.318 --> 00:53:18.800
So the like a beautiful idea for doing

00:53:17.519 --> 00:53:20.880
this is called self-supervised learning.

00:53:18.800 --> 00:53:23.359
And the key idea is that you take your

00:53:20.880 --> 00:53:26.079
input, whatever the input is you take a

00:53:23.358 --> 00:53:28.719
small part of the input and just remove

00:53:26.079 --> 00:53:31.680
it and then ask your network to fill in

00:53:28.719 --> 00:53:33.919
the blanks from everything else.

00:53:31.679 --> 00:53:35.118
Okay, so this is called masking and it's

00:53:33.920 --> 00:53:36.559
just one of many techniques in

00:53:35.119 --> 00:53:39.119
self-supervised learning, but this is

00:53:36.559 --> 00:53:41.680
very commonly used. So this is original

00:53:39.119 --> 00:53:43.599
input, right? And then you take it and

00:53:41.679 --> 00:53:45.679
then you just like take this thing in

00:53:43.599 --> 00:53:48.880
the middle here randomly and and and

00:53:45.679 --> 00:53:51.199
zero it out or mask it. And so this

00:53:48.880 --> 00:53:53.119
incomplete input is your now new input

00:53:51.199 --> 00:53:56.719
and the thing that you took out becomes

00:53:53.119 --> 00:53:58.240
your your fake label.

00:53:56.719 --> 00:54:00.399
So you can almost imagine right if you

00:53:58.239 --> 00:54:02.318
take if you if you're baking donuts you

00:54:00.400 --> 00:54:04.720
you make a donut and then you punch a

00:54:02.318 --> 00:54:07.199
hole in the middle of the donut the the

00:54:04.719 --> 00:54:11.598
donut with the hole is your no input the

00:54:07.199 --> 00:54:13.039
munchkin is the label.

00:54:11.599 --> 00:54:15.200
Am I making everybody hungry at this

00:54:13.039 --> 00:54:17.838
point? So,

00:54:15.199 --> 00:54:19.519
so and once you do that, no problem. You

00:54:17.838 --> 00:54:23.799
have an input, you have an you have

00:54:19.519 --> 00:54:23.800
labels, you just train a neural network

00:54:23.838 --> 00:54:28.558
to essentially predict those to

00:54:25.679 --> 00:54:30.879
basically fill in the blanks.

00:54:28.559 --> 00:54:32.559
And so if for example, if you take a

00:54:30.880 --> 00:54:34.640
sentence like the Sloan School's

00:54:32.559 --> 00:54:36.559
mission, you can just go in there and

00:54:34.639 --> 00:54:39.199
just just knock out randomly a bunch of

00:54:36.559 --> 00:54:40.319
words like this second. And the ones I'm

00:54:39.199 --> 00:54:42.960
knocking out, I'm just putting the word

00:54:40.318 --> 00:54:45.119
mask in it just to show what I'm doing.

00:54:42.960 --> 00:54:46.720
And then what it's actually given this

00:54:45.119 --> 00:54:50.240
sentence, it will try to fill in the

00:54:46.719 --> 00:54:51.759
blanks with actual words.

00:54:50.239 --> 00:54:53.439
Okay,

00:54:51.760 --> 00:54:54.400
so now for the amazing part. In the

00:54:53.440 --> 00:54:57.358
process of learning to fill in the

00:54:54.400 --> 00:54:58.960
blanks, uh the network learns a really

00:54:57.358 --> 00:55:01.199
good representation of the kind of input

00:54:58.960 --> 00:55:02.880
data it's seeing. And it kind of makes

00:55:01.199 --> 00:55:04.879
sense, right? Because if I give you a

00:55:02.880 --> 00:55:06.800
sentence with a few missing blanks and

00:55:04.880 --> 00:55:08.720
you're able to very successfully fill in

00:55:06.800 --> 00:55:10.079
the blanks, you have learned a whole

00:55:08.719 --> 00:55:12.558
bunch of stuff about the world to be

00:55:10.079 --> 00:55:14.318
able to do that, right? If I say the

00:55:12.559 --> 00:55:16.800
capital of France is Dash and you're

00:55:14.318 --> 00:55:18.880
like Paris, okay, how did you know that?

00:55:16.800 --> 00:55:20.559
It's sort of like that. By learning to

00:55:18.880 --> 00:55:22.559
fill in the blanks, you really have to

00:55:20.559 --> 00:55:24.079
learn how how all these things work, all

00:55:22.559 --> 00:55:27.760
the the connections between various

00:55:24.079 --> 00:55:29.839
words and so on and so forth. So, and so

00:55:27.760 --> 00:55:32.000
what you can do is once we build such a

00:55:29.838 --> 00:55:34.159
model, we can just extract an encoder

00:55:32.000 --> 00:55:36.079
from it, right? And then we'll fine-tune

00:55:34.159 --> 00:55:38.239
it like we do with library transfer

00:55:36.079 --> 00:55:41.359
learning. But this how you build a

00:55:38.239 --> 00:55:43.598
generic a generic pre-trained model on

00:55:41.358 --> 00:55:46.159
unlabelled data.

00:55:43.599 --> 00:55:48.000
And so we can use a transformer encoder

00:55:46.159 --> 00:55:49.598
to build this whole thing in the middle

00:55:48.000 --> 00:55:51.599
because remember the transformer can

00:55:49.599 --> 00:55:53.280
take any sentence and give you the same

00:55:51.599 --> 00:55:55.280
size sentence back along with

00:55:53.280 --> 00:55:57.280
predictions for everything. So we can

00:55:55.280 --> 00:55:58.880
just have it take this thing in and ask

00:55:57.280 --> 00:56:01.519
it to just predict all the missing words

00:55:58.880 --> 00:56:03.119
here.

00:56:01.519 --> 00:56:05.358
And

00:56:03.119 --> 00:56:06.880
so uh to put it in other words, masked

00:56:05.358 --> 00:56:09.440
self-supervised learning is just a

00:56:06.880 --> 00:56:11.039
sequence labeling problem.

00:56:09.440 --> 00:56:13.440
So basically this is the sequence that

00:56:11.039 --> 00:56:14.639
comes in and then you you tell the

00:56:13.440 --> 00:56:16.240
transform and you get all these

00:56:14.639 --> 00:56:18.078
embeddings. It goes through all that

00:56:16.239 --> 00:56:21.118
stuff. You really don't care about these

00:56:18.079 --> 00:56:23.359
outputs. But wherever the word mask went

00:56:21.119 --> 00:56:25.358
in in the input, you you basically try

00:56:23.358 --> 00:56:26.798
to get it to the right answer is for

00:56:25.358 --> 00:56:28.159
example the word mission and you're

00:56:26.798 --> 00:56:29.759
trying to and that is the right answer.

00:56:28.159 --> 00:56:31.440
This is the right answer here. And then

00:56:29.760 --> 00:56:32.799
you take these right answers, create a

00:56:31.440 --> 00:56:35.519
loss function, and do back prop and

00:56:32.798 --> 00:56:37.358
boom, you're done.

00:56:35.519 --> 00:56:40.159
Inputs, right answers, and and you're in

00:56:37.358 --> 00:56:41.759
business. That's it. Now, if we

00:56:40.159 --> 00:56:44.078
pre-train a transformer model like this

00:56:41.760 --> 00:56:46.240
on massive amounts of English text,

00:56:44.079 --> 00:56:48.960
let's say we did that. We get something

00:56:46.239 --> 00:56:51.118
called BERT. BERT is a very famous

00:56:48.960 --> 00:56:53.599
transformer model. And BERT was the

00:56:51.119 --> 00:56:56.400
first model actually that Google used to

00:56:53.599 --> 00:56:58.559
upgrade its search in 2019.

00:56:56.400 --> 00:57:00.318
like the br the Brazil visa example you

00:56:58.559 --> 00:57:03.599
may recall from earlier lectures that

00:57:00.318 --> 00:57:06.400
uses BERT under the hood. Okay. Um and

00:57:03.599 --> 00:57:07.920
so now I just want to show you because

00:57:06.400 --> 00:57:09.680
you can actually read the BERT paper and

00:57:07.920 --> 00:57:10.880
it'll actually make sense to you now

00:57:09.679 --> 00:57:13.440
based on what you have learned in this

00:57:10.880 --> 00:57:14.798
class. Look at this BERT's model

00:57:13.440 --> 00:57:16.798
architecture is a multi-layer

00:57:14.798 --> 00:57:18.639
birectional transformer encoder. Okay,

00:57:16.798 --> 00:57:20.639
transformer encoder. We denote the

00:57:18.639 --> 00:57:23.039
number of layers transformer blocks as

00:57:20.639 --> 00:57:25.118
L. The hidden size is H and the number

00:57:23.039 --> 00:57:30.558
of attention heads as A. And how much is

00:57:25.119 --> 00:57:34.318
that? Uh okay we want uh h is 768 okay

00:57:30.559 --> 00:57:36.480
so which means that the embedding sizes

00:57:34.318 --> 00:57:38.318
or 768

00:57:36.480 --> 00:57:41.599
and the hidden feed forward layer is

00:57:38.318 --> 00:57:44.719
four times as much so it's 4096 and so

00:57:41.599 --> 00:57:47.760
sorry the the the 4096 the feed forward

00:57:44.719 --> 00:57:49.838
layer the embeddings are 768 and you can

00:57:47.760 --> 00:57:52.799
see there are two BERT models here this

00:57:49.838 --> 00:57:55.759
one has 12 transformer blocks this one

00:57:52.798 --> 00:57:58.159
has 24 transformer blocks

00:57:55.760 --> 00:57:59.440
Okay, so you can actually read this

00:57:58.159 --> 00:58:00.879
paper. You can you can actually relate

00:57:59.440 --> 00:58:02.720
it to exactly what we discussed in

00:58:00.880 --> 00:58:04.640
class. It'll all make sense.

00:58:02.719 --> 00:58:06.239
Birectionally means that the words can

00:58:04.639 --> 00:58:09.598
pay attention to every other word in the

00:58:06.239 --> 00:58:10.959
sentence. And as we will see on Monday,

00:58:09.599 --> 00:58:12.400
you can have you have a diff another

00:58:10.960 --> 00:58:14.240
transformer thing called a causal

00:58:12.400 --> 00:58:15.440
transformer in which you only pay

00:58:14.239 --> 00:58:18.000
attention to the words that came before

00:58:15.440 --> 00:58:21.014
you, not the ones after you. So

00:58:18.000 --> 00:58:24.400
birectional means all words are seen.

00:58:21.014 --> 00:58:26.639
[snorts] Okay. So um so what we do is

00:58:24.400 --> 00:58:27.760
remember we said to do solve sequence

00:58:26.639 --> 00:58:30.719
classification you can add a little

00:58:27.760 --> 00:58:32.480
token at the beginning uh and then boom

00:58:30.719 --> 00:58:35.199
use it for classification as it turns

00:58:32.480 --> 00:58:36.960
out but very conveniently for us the

00:58:35.199 --> 00:58:38.639
people who built bird they actually auto

00:58:36.960 --> 00:58:41.039
they when they train bird they just use

00:58:38.639 --> 00:58:42.318
the CLS business

00:58:41.039 --> 00:58:44.719
during training so it's actually

00:58:42.318 --> 00:58:46.159
available for us out of the box so when

00:58:44.719 --> 00:58:47.439
you use bird for sequence classification

00:58:46.159 --> 00:58:48.798
you don't even have to do any surgery on

00:58:47.440 --> 00:58:51.519
it it just gives you the class token

00:58:48.798 --> 00:58:52.960
automatically which is very convenient

00:58:51.519 --> 00:58:55.280
uh and you can also use it for sequence

00:58:52.960 --> 00:58:57.440
labeling as well. So for sequence

00:58:55.280 --> 00:58:58.960
classifications and sequence labeling uh

00:58:57.440 --> 00:59:00.960
BERT is actually usually a really good

00:58:58.960 --> 00:59:02.159
starting point and in particular there

00:59:00.960 --> 00:59:04.240
have been lots of improvements and

00:59:02.159 --> 00:59:05.759
variations of BERT over the years and if

00:59:04.239 --> 00:59:07.199
you're curious about this there's a

00:59:05.760 --> 00:59:09.040
thing called the sentence transformers

00:59:07.199 --> 00:59:11.199
library which has got a whole bunch of

00:59:09.039 --> 00:59:14.400
BERT related code and resources that you

00:59:11.199 --> 00:59:18.480
can use to do things out of the box.

00:59:14.400 --> 00:59:20.000
Okay. So okay there's a bit of a word

00:59:18.480 --> 00:59:21.920
wall.

00:59:20.000 --> 00:59:23.519
So to solve any of these problems

00:59:21.920 --> 00:59:24.720
classification or labeling where the

00:59:23.519 --> 00:59:27.199
input is natural language we can

00:59:24.719 --> 00:59:28.719
obviously use a model like BERT label a

00:59:27.199 --> 00:59:30.159
few hundred examples attach the right

00:59:28.719 --> 00:59:32.318
final layers and fine tune it like we

00:59:30.159 --> 00:59:34.879
did for the restn net but if your

00:59:32.318 --> 00:59:37.358
problem is like a standard NLP problem

00:59:34.880 --> 00:59:39.280
okay you don't even have to do that

00:59:37.358 --> 00:59:40.719
because people for these standard tasks

00:59:39.280 --> 00:59:43.440
they've already pre-trained it on those

00:59:40.719 --> 00:59:44.719
standard tasks right and so you can do

00:59:43.440 --> 00:59:47.440
all these things without any fine tuning

00:59:44.719 --> 00:59:49.199
at all like literally out of the box u

00:59:47.440 --> 00:59:50.720
and so there are many hubs which have

00:59:49.199 --> 00:59:53.519
these pre-trained models, but perhaps

00:59:50.719 --> 00:59:56.558
the biggest one is the hugging face hub.

00:59:53.519 --> 00:59:58.159
And I checked last night, it has 525,000

00:59:56.559 --> 01:00:00.640
models

00:59:58.159 --> 01:00:02.239
available. I think if I recall last year

01:00:00.639 --> 01:00:04.719
when I taught Hodel, I think the number

01:00:02.239 --> 01:00:07.118
was a lot smaller, maybe 50,000. So it's

01:00:04.719 --> 01:00:09.039
like growing really, really fast. Um,

01:00:07.119 --> 01:00:12.599
and so all right, let's just switch to a

01:00:09.039 --> 01:00:12.599
hugging face collab.

01:00:15.199 --> 01:00:21.759
So, hugging face, how many of you are

01:00:18.159 --> 01:00:24.719
familiar with hugging face?

01:00:21.760 --> 01:00:26.720
Okay, it's good. All right, so um for

01:00:24.719 --> 01:00:28.480
the others, basically you have a whole

01:00:26.719 --> 01:00:30.318
bunch of pre-trained models on hugging

01:00:28.480 --> 01:00:32.240
phase. You actually have a lot of data

01:00:30.318 --> 01:00:34.960
sets you can work with for your own

01:00:32.239 --> 01:00:37.039
tasks. Uh there are lots of people

01:00:34.960 --> 01:00:39.039
demoing what they have built in this

01:00:37.039 --> 01:00:40.558
thing called spaces and of course a lot

01:00:39.039 --> 01:00:42.318
of documentation and so on. So the thing

01:00:40.559 --> 01:00:44.000
you can do is what they have done is

01:00:42.318 --> 01:00:46.318
they have organized all these models by

01:00:44.000 --> 01:00:47.760
the kind of task you can use them for.

01:00:46.318 --> 01:00:49.279
So you can see here there are a whole

01:00:47.760 --> 01:00:50.960
bunch of computer vision tasks that you

01:00:49.280 --> 01:00:52.480
can use them for. There's a whole bunch

01:00:50.960 --> 01:00:54.000
of natural language tasks like text

01:00:52.480 --> 01:00:56.798
classification

01:00:54.000 --> 01:00:59.280
uh feature extraction this and that lots

01:00:56.798 --> 01:01:00.559
of interesting examples here. And so

01:00:59.280 --> 01:01:01.760
what you do is you just literally can go

01:01:00.559 --> 01:01:03.839
in there and say okay I want to do a

01:01:01.760 --> 01:01:05.200
text classification. You hit it and then

01:01:03.838 --> 01:01:06.798
it tells you all the models that are

01:01:05.199 --> 01:01:08.558
available. Turns into 50,000 models just

01:01:06.798 --> 01:01:10.159
for text classification. And you can

01:01:08.559 --> 01:01:11.680
look at okay which is you know most

01:01:10.159 --> 01:01:13.118
downloaded or which is the most liked

01:01:11.679 --> 01:01:14.318
and then you can just use them as a

01:01:13.119 --> 01:01:17.358
starting point for whatever you want to

01:01:14.318 --> 01:01:20.880
do. Okay. So so that is hugging phase

01:01:17.358 --> 01:01:24.960
and so the way you do hugging face is

01:01:20.880 --> 01:01:26.798
I'm just connecting it. Um

01:01:24.960 --> 01:01:28.159
if you have a problem which the input is

01:01:26.798 --> 01:01:29.440
natural language text the first question

01:01:28.159 --> 01:01:31.199
you have to ask yourself is it standard

01:01:29.440 --> 01:01:32.960
or not? Is it a standard task or not? If

01:01:31.199 --> 01:01:34.639
it's a standard task you just go go that

01:01:32.960 --> 01:01:37.199
do not reinvent the wheel. This thing

01:01:34.639 --> 01:01:39.679
will usually work pretty well. Okay. So

01:01:37.199 --> 01:01:41.598
here we will use this thing called um

01:01:39.679 --> 01:01:43.759
the transformers library from hugging

01:01:41.599 --> 01:01:45.599
face in particular the pipeline function

01:01:43.760 --> 01:01:47.520
to demonstrate quickly how to do this

01:01:45.599 --> 01:01:48.960
thing. Fortunately this library as of

01:01:47.519 --> 01:01:50.000
this year is pre-installed in collab so

01:01:48.960 --> 01:01:51.599
we can we don't have to install it. We

01:01:50.000 --> 01:01:53.920
can just start using it right away. So

01:01:51.599 --> 01:01:57.119
we'll take this example where you have a

01:01:53.920 --> 01:01:59.039
bunch of text which says um

01:01:57.119 --> 01:02:00.480
dear Amazon last week I got an Optimus

01:01:59.039 --> 01:02:01.519
Prime action figure from your store in

01:02:00.480 --> 01:02:04.000
Germany. Unfortunately when I opened the

01:02:01.519 --> 01:02:05.039
vicage I discovered to my horror that I

01:02:04.000 --> 01:02:06.719
had been sent an action figure of

01:02:05.039 --> 01:02:08.639
Megatron instead. Can you imagine that

01:02:06.719 --> 01:02:10.879
person's like sheer distress at this?

01:02:08.639 --> 01:02:12.159
Um, so as a lifelong enemy of the

01:02:10.880 --> 01:02:14.640
Decepticons, I hope you can understand

01:02:12.159 --> 01:02:17.039
my dilemma. So to resolve the issue, I

01:02:14.639 --> 01:02:19.440
demand an exchange. Encloser copies

01:02:17.039 --> 01:02:21.039
expect to hear from you soon. Sincerely,

01:02:19.440 --> 01:02:22.720
Bumblebee.

01:02:21.039 --> 01:02:24.880
Okay, that Okay, they should have come

01:02:22.719 --> 01:02:26.558
up with a better name for this example.

01:02:24.880 --> 01:02:29.358
Uh, all right, cool. So that's the text

01:02:26.559 --> 01:02:31.040
we have. So we import the this pipeline

01:02:29.358 --> 01:02:33.119
function is the one that basically gives

01:02:31.039 --> 01:02:34.558
you the ability to out of the box start

01:02:33.119 --> 01:02:36.720
using it without any pre-training,

01:02:34.559 --> 01:02:40.160
nothing like that. Okay, so we download

01:02:36.719 --> 01:02:42.399
this thing. Um, oh wow, I got an A00

01:02:40.159 --> 01:02:44.480
today. That happens very rarely. All

01:02:42.400 --> 01:02:46.079
right, sorry.

01:02:44.480 --> 01:02:48.000
So here, let's say you want to classify

01:02:46.079 --> 01:02:50.000
that text. Okay, you want just want to

01:02:48.000 --> 01:02:52.880
classify it for sentiment. You literally

01:02:50.000 --> 01:02:55.358
go in there and say pipeline

01:02:52.880 --> 01:02:57.599
text classification. That's the task you

01:02:55.358 --> 01:02:59.519
want the pipeline to do for you, right?

01:02:57.599 --> 01:03:01.280
And you create a classifier. Okay, it's

01:02:59.519 --> 01:03:04.318
going to download a bunch of stuff. Uh,

01:03:01.280 --> 01:03:06.079
and then so on and so forth.

01:03:04.318 --> 01:03:08.558
The first time it just takes time to

01:03:06.079 --> 01:03:10.240
download and then you literally take the

01:03:08.559 --> 01:03:11.599
text you have here and then run it

01:03:10.239 --> 01:03:14.078
through the classifier as it was just a

01:03:11.599 --> 01:03:17.280
little function right you get some

01:03:14.079 --> 01:03:19.599
outputs and then actually just do this

01:03:17.280 --> 01:03:21.519
this way

01:03:19.599 --> 01:03:23.760
negative sentiment is negative with 90%

01:03:21.519 --> 01:03:25.838
probability pretty good right sequence

01:03:23.760 --> 01:03:27.440
classification solved I mean sent

01:03:25.838 --> 01:03:30.239
sentiment classification solved so we'll

01:03:27.440 --> 01:03:31.838
try a few different examples uh I hated

01:03:30.239 --> 01:03:33.038
the movie I if I said I loved the movie

01:03:31.838 --> 01:03:34.880
I would be lying okay that's a little

01:03:33.039 --> 01:03:36.400
tricky The movie left me speechless.

01:03:34.880 --> 01:03:38.798
Incredible. And then I had to add this

01:03:36.400 --> 01:03:40.400
last thing here last night. Almost but

01:03:38.798 --> 01:03:42.000
not quite entirely unlike anything good

01:03:40.400 --> 01:03:43.119
I've seen. Okay. And that's not

01:03:42.000 --> 01:03:44.960
original. By the way, people who have

01:03:43.119 --> 01:03:46.720
read Douglas Adams will know this famous

01:03:44.960 --> 01:03:48.240
sentence about somebody drinking some

01:03:46.719 --> 01:03:50.959
beverage and saying it's almost but not

01:03:48.239 --> 01:03:52.558
quite entirely unlike tea. So I was

01:03:50.960 --> 01:03:56.159
inspired by that. So anyway, we'll see

01:03:52.559 --> 01:03:59.519
what happens. Um.

01:03:56.159 --> 01:04:01.679
All right. Put it in there. Okay. So

01:03:59.519 --> 01:04:02.960
negative. I hated the movie. Okay, fine.

01:04:01.679 --> 01:04:05.038
If I said love me, I'd be lying.

01:04:02.960 --> 01:04:07.440
Negative. Movie left me speechless. Uh,

01:04:05.039 --> 01:04:09.119
it says it's negative, but it could go

01:04:07.440 --> 01:04:09.838
either way, right? A good classifier

01:04:09.119 --> 01:04:11.599
would have probably given you a

01:04:09.838 --> 01:04:13.759
probability around the 50% mark because

01:04:11.599 --> 01:04:15.760
it's sort of right on the fence. Um,

01:04:13.760 --> 01:04:17.680
incredible, it's positive, and then it

01:04:15.760 --> 01:04:20.640
got fooled by my crazy long sentence and

01:04:17.679 --> 01:04:22.159
it says it's positive. Okay, now that's

01:04:20.639 --> 01:04:23.679
classification. Here's one other quick

01:04:22.159 --> 01:04:25.759
example. So, you can actually give it a

01:04:23.679 --> 01:04:28.318
piece of text, right? For example, you

01:04:25.760 --> 01:04:30.319
can take like a a Reuter's news story.

01:04:28.318 --> 01:04:32.880
You can feed it and say extract all the

01:04:30.318 --> 01:04:34.159
company names from it. Extract company

01:04:32.880 --> 01:04:35.599
names, people names and things like

01:04:34.159 --> 01:04:37.920
that. It's called named entity

01:04:35.599 --> 01:04:40.240
extraction. And there are in the back in

01:04:37.920 --> 01:04:42.400
back in the day people would bring they

01:04:40.239 --> 01:04:44.479
would hand build painstakingly all these

01:04:42.400 --> 01:04:46.079
very complex systems to be to do named

01:04:44.480 --> 01:04:48.400
entity extraction. Now it's just a

01:04:46.079 --> 01:04:50.559
pipeline away. So you can take this

01:04:48.400 --> 01:04:53.280
thing and you can say create a pipeline

01:04:50.559 --> 01:04:54.798
for any name extraction and for any

01:04:53.280 --> 01:04:56.240
particular task that you're using there

01:04:54.798 --> 01:04:57.838
might be a few additional parameters you

01:04:56.239 --> 01:05:00.000
can set right as a part of the

01:04:57.838 --> 01:05:03.000
configuration. So we download this

01:05:00.000 --> 01:05:03.000
pipeline.

01:05:08.480 --> 01:05:14.798
Okay, perfect. And then we run the

01:05:11.199 --> 01:05:16.960
output. So it says okay good. Amazon is

01:05:14.798 --> 01:05:18.559
an organization

01:05:16.960 --> 01:05:21.119
uh

01:05:18.559 --> 01:05:22.400
and Germany is a location lock which is

01:05:21.119 --> 01:05:23.920
nice. So these things have a standard

01:05:22.400 --> 01:05:24.798
vocabulary as to or lock things like

01:05:23.920 --> 01:05:26.960
that which you can read up in the

01:05:24.798 --> 01:05:29.599
documentation. Uh and then Bumblebee is

01:05:26.960 --> 01:05:32.079
a person and then boy all the like the

01:05:29.599 --> 01:05:33.760
Optimus Prime transformer stuff is all

01:05:32.079 --> 01:05:36.480
it got full right. It thinks Optimus

01:05:33.760 --> 01:05:38.000
Prime is miscellaneous. Uh decept is

01:05:36.480 --> 01:05:39.039
miscellaneous and so on and so forth.

01:05:38.000 --> 01:05:41.039
But you get the idea. You can take

01:05:39.039 --> 01:05:42.400
standard things like Reuters use stories

01:05:41.039 --> 01:05:44.160
and so or you can just boop. You can get

01:05:42.400 --> 01:05:45.440
a very good entity extraction right out

01:05:44.159 --> 01:05:47.038
of the bat. And once you get these

01:05:45.440 --> 01:05:48.960
entities extracted, then you can put

01:05:47.039 --> 01:05:50.640
them into a nice structured data table

01:05:48.960 --> 01:05:53.280
like a database and then you can run

01:05:50.639 --> 01:05:55.679
traditional machine learning on it.

01:05:53.280 --> 01:05:58.559
Okay. Um and then I had I think a few

01:05:55.679 --> 01:06:01.598
more examples of question answering and

01:05:58.559 --> 01:06:02.798
uh actually let's just try that. um you

01:06:01.599 --> 01:06:03.920
can actually give it a thing and ask a

01:06:02.798 --> 01:06:07.599
question about it and you can actually

01:06:03.920 --> 01:06:09.119
give you the answer which gets into the

01:06:07.599 --> 01:06:10.960
causal transformer thing that we're

01:06:09.119 --> 01:06:12.720
going to see on Monday which builds up

01:06:10.960 --> 01:06:14.480
into large language models because you

01:06:12.719 --> 01:06:16.000
obviously can give something you can

01:06:14.480 --> 01:06:17.440
give a passage to chat GPT and ask a

01:06:16.000 --> 01:06:19.599
question ask it to give you an answer so

01:06:17.440 --> 01:06:20.880
it's really in that thing but um just

01:06:19.599 --> 01:06:25.280
for fun let's just do that to see if

01:06:20.880 --> 01:06:27.440
it's any good um okay so what does the

01:06:25.280 --> 01:06:29.359
customer want and the output is an

01:06:27.440 --> 01:06:32.480
exchange of megatron and it's telling

01:06:29.358 --> 01:06:34.558
you which where it starts in the text

01:06:32.480 --> 01:06:37.599
and where it ends the relevant passage.

01:06:34.559 --> 01:06:39.119
It's pretty good, right? So because

01:06:37.599 --> 01:06:41.200
remember if you have stuff like this

01:06:39.119 --> 01:06:42.559
then when you ask like a large language

01:06:41.199 --> 01:06:44.399
model a question it gives you an answer.

01:06:42.559 --> 01:06:46.720
You can actually ask it to give you

01:06:44.400 --> 01:06:48.480
exactly where in the input it found the

01:06:46.719 --> 01:06:49.679
answer and because you know these things

01:06:48.480 --> 01:06:51.920
are going to elicitate you can actually

01:06:49.679 --> 01:06:54.078
look at the input that it's claiming to

01:06:51.920 --> 01:06:56.318
use and look at what it says and see if

01:06:54.079 --> 01:06:59.760
they actually match. It's a way to sort

01:06:56.318 --> 01:07:01.920
of essentially do QA on LLM output.

01:06:59.760 --> 01:07:03.280
Um okay so that's what we have here and

01:07:01.920 --> 01:07:05.280
I have other budget much of which which

01:07:03.280 --> 01:07:07.760
I'll ignore for the moment because I

01:07:05.280 --> 01:07:10.000
want to go back to the PowerPoint.

01:07:07.760 --> 01:07:11.760
So yeah so if you have a standard task

01:07:10.000 --> 01:07:13.679
uh you know you can just use pipelines

01:07:11.760 --> 01:07:15.200
and hugging face to actually solve many

01:07:13.679 --> 01:07:18.078
of them out of the box without any heavy

01:07:15.199 --> 01:07:19.598
lifting. So I mentioned earlier on that

01:07:18.079 --> 01:07:21.519
transformers have proven to be effective

01:07:19.599 --> 01:07:24.079
for a whole bunch of domains outside of

01:07:21.519 --> 01:07:26.480
natural language processing um like you

01:07:24.079 --> 01:07:29.119
know speech recognition, computer vision

01:07:26.480 --> 01:07:30.559
and so on and so forth. Um and so I want

01:07:29.119 --> 01:07:32.400
to give you a couple of quick examples

01:07:30.559 --> 01:07:35.280
of how to think about transform using

01:07:32.400 --> 01:07:39.358
transformers for non-ext applications.

01:07:35.280 --> 01:07:41.039
Okay. So uh the the key insight here is

01:07:39.358 --> 01:07:42.880
that the architecture of the transformer

01:07:41.039 --> 01:07:45.280
block that we have looked at amazingly

01:07:42.880 --> 01:07:47.599
enough can be used as is with no changes

01:07:45.280 --> 01:07:49.519
no surgery needed. No clever thinking

01:07:47.599 --> 01:07:51.359
required for any particular application.

01:07:49.519 --> 01:07:53.759
What is needed where the clever thinking

01:07:51.358 --> 01:07:55.358
may be required is you need to take the

01:07:53.760 --> 01:07:57.280
inputs that you're working with and you

01:07:55.358 --> 01:07:59.679
need to figure out a way to tokenize and

01:07:57.280 --> 01:08:01.039
encode them into embeddings

01:07:59.679 --> 01:08:03.358
which can then be sent into the

01:08:01.039 --> 01:08:05.839
transformer. So all the action is in

01:08:03.358 --> 01:08:07.759
taking that input that non-ext input and

01:08:05.838 --> 01:08:09.759
figuring out a way to cast them in the

01:08:07.760 --> 01:08:12.640
language of embeddings. That's where the

01:08:09.760 --> 01:08:14.160
that's the game. Okay. So um here is

01:08:12.639 --> 01:08:16.158
something called the vision transformer

01:08:14.159 --> 01:08:19.119
which is very famous actually. I think

01:08:16.158 --> 01:08:20.559
it may be the first perhaps the first uh

01:08:19.119 --> 01:08:23.358
transformer architecture that was

01:08:20.560 --> 01:08:25.759
applied to vision problems. So um so

01:08:23.359 --> 01:08:28.960
let's say you have a picture um yeah so

01:08:25.759 --> 01:08:31.679
let's say you have this picture okay

01:08:28.960 --> 01:08:33.279
it is just a picture okay so you have to

01:08:31.679 --> 01:08:35.679
find a way to create embeddings from

01:08:33.279 --> 01:08:38.000
this picture or to tokenize this picture

01:08:35.679 --> 01:08:40.158
in some way with sentences you know I

01:08:38.000 --> 01:08:41.759
love hard well obviously I love and hard

01:08:40.158 --> 01:08:43.599
are three tokens it's pretty trivial to

01:08:41.759 --> 01:08:45.359
figure out how to tokenize them but with

01:08:43.600 --> 01:08:47.120
a picture what do you do right it's kind

01:08:45.359 --> 01:08:49.600
of weird to think of tokenizing a

01:08:47.119 --> 01:08:51.119
picture so what these people did is that

01:08:49.600 --> 01:08:52.960
they say you know what I'm going to take

01:08:51.119 --> 01:08:54.479
this picture and chop it up into small

01:08:52.960 --> 01:08:57.039
squares.

01:08:54.479 --> 01:08:58.639
Right? So in this example, they have

01:08:57.039 --> 01:09:02.079
taken this big picture and chopped it up

01:08:58.640 --> 01:09:03.359
into nine little pictures. Okay? Then

01:09:02.079 --> 01:09:05.278
you can take each of those nine

01:09:03.359 --> 01:09:07.600
pictures.

01:09:05.279 --> 01:09:09.679
Each of those nine pictures, right? If

01:09:07.600 --> 01:09:11.600
you look at the how it's represented,

01:09:09.679 --> 01:09:15.440
it's just three tables of numbers,

01:09:11.600 --> 01:09:16.960
right? The RGB values, right? So you can

01:09:15.439 --> 01:09:20.318
take all those numbers and you just

01:09:16.960 --> 01:09:22.880
create a giant long vector from it.

01:09:20.319 --> 01:09:26.080
Okay? you have a huge long vector and

01:09:22.880 --> 01:09:28.719
then you run it through a dense layer to

01:09:26.079 --> 01:09:30.079
come up with a smaller vector

01:09:28.719 --> 01:09:31.838
and that smaller vector is your

01:09:30.079 --> 01:09:34.318
embedding.

01:09:31.838 --> 01:09:36.079
That's it. But the way you transform the

01:09:34.319 --> 01:09:37.600
long vector into small vector is just a

01:09:36.079 --> 01:09:39.119
dense layer whose weights can be

01:09:37.600 --> 01:09:41.039
learned.

01:09:39.119 --> 01:09:42.559
So what these people did is they said

01:09:41.039 --> 01:09:44.560
well I'm going to first chop it up into

01:09:42.560 --> 01:09:47.199
these patches and then I take each patch

01:09:44.560 --> 01:09:49.039
and do a linear projection. Right? A

01:09:47.198 --> 01:09:50.639
flattened patch is nothing more than a

01:09:49.039 --> 01:09:52.079
three tables of numbers flattened into a

01:09:50.640 --> 01:09:54.480
long vector. That's what the word

01:09:52.079 --> 01:09:56.000
flatten here means. And once you flatten

01:09:54.479 --> 01:09:58.158
it, I'm just going to run it through a

01:09:56.000 --> 01:09:59.760
dense layer. So, by the way, you will

01:09:58.158 --> 01:10:01.279
see the words linear projection. It's a

01:09:59.760 --> 01:10:03.360
synonym for run it through a dense

01:10:01.279 --> 01:10:05.198
layer.

01:10:03.359 --> 01:10:08.000
So, you run it through a dense layer,

01:10:05.198 --> 01:10:09.599
right? You get these nice vectors, these

01:10:08.000 --> 01:10:11.520
vectors.

01:10:09.600 --> 01:10:12.880
And now you say, well, you know what? I

01:10:11.520 --> 01:10:15.120
have to take the order of these things

01:10:12.880 --> 01:10:17.039
into account because clearly this little

01:10:15.119 --> 01:10:18.479
patch is in the top left while this

01:10:17.039 --> 01:10:20.640
patch is somewhere in the middle. Right?

01:10:18.479 --> 01:10:22.079
The order matters in the picture

01:10:20.640 --> 01:10:24.239
otherwise every jumbled version is going

01:10:22.079 --> 01:10:26.158
to be the same thing. So you use

01:10:24.238 --> 01:10:27.519
positional embeddings

01:10:26.158 --> 01:10:31.439
you basically say there are nine

01:10:27.520 --> 01:10:33.760
positions in any picture right 0 1 2 3 4

01:10:31.439 --> 01:10:36.879
5 6 7 8 there are nine positions. So I'm

01:10:33.760 --> 01:10:39.199
going to create nine position embeddings

01:10:36.880 --> 01:10:40.319
and then I'm just going to add them up.

01:10:39.198 --> 01:10:41.519
Then I'm just going to add them up to

01:10:40.319 --> 01:10:44.319
this embedding. Just like we did with

01:10:41.520 --> 01:10:45.440
words. With words, we each word had an

01:10:44.319 --> 01:10:47.119
embedding. Each position had an

01:10:45.439 --> 01:10:49.359
embedding. We added them up. Here each

01:10:47.119 --> 01:10:50.800
image has an embedding. The position of

01:10:49.359 --> 01:10:53.439
the little patch in the picture has an

01:10:50.800 --> 01:10:54.960
embedding. We add them up. Okay? And

01:10:53.439 --> 01:10:57.198
then because we want to use it for

01:10:54.960 --> 01:11:00.239
classification, no problem. We'll have a

01:10:57.198 --> 01:11:01.678
little CLS token

01:11:00.238 --> 01:11:04.399
and then we just run it through the

01:11:01.679 --> 01:11:06.480
transformer. That's it.

01:11:04.399 --> 01:11:08.238
and then you get the CLS token and then

01:11:06.479 --> 01:11:09.759
you can attach a softmax to it and say,

01:11:08.238 --> 01:11:12.079
"Okay, it's a bird, it's a ball, it's a

01:11:09.760 --> 01:11:14.560
car.

01:11:12.079 --> 01:11:16.960
That's it. This simple approach actually

01:11:14.560 --> 01:11:19.440
works

01:11:16.960 --> 01:11:22.158
amazingly enough."

01:11:19.439 --> 01:11:23.439
Okay, so that is the vision transformer

01:11:22.158 --> 01:11:24.639
and I'm going through it fast just to

01:11:23.439 --> 01:11:29.359
give you a sense for how these things

01:11:24.640 --> 01:11:31.760
work. Uh any questions? Yeah. Uh my

01:11:29.359 --> 01:11:33.759
question is like uh in case of uh text

01:11:31.760 --> 01:11:35.280
we had fixed number of tokens that is

01:11:33.760 --> 01:11:37.360
amount of words which could be there in

01:11:35.279 --> 01:11:39.359
your vocabul in the English vocabulary

01:11:37.359 --> 01:11:41.279
but here if you look at images they will

01:11:39.359 --> 01:11:43.519
probably go into trillions that I know

01:11:41.279 --> 01:11:45.599
like we are not talking about one image

01:11:43.520 --> 01:11:47.760
but we take a total set of plot of

01:11:45.600 --> 01:11:52.079
images and we try to subset each one of

01:11:47.760 --> 01:11:53.679
them each one would have its own uh uh

01:11:52.079 --> 01:11:56.158
own weights like own parameters. There

01:11:53.679 --> 01:11:58.719
is no notion of vocabulary here. All

01:11:56.158 --> 01:12:02.319
we're saying is that given any image, we

01:11:58.719 --> 01:12:03.920
create nine patches, sub images from it.

01:12:02.319 --> 01:12:06.880
Each of those patches gets passed

01:12:03.920 --> 01:12:09.440
through a dense layer and out comes an

01:12:06.880 --> 01:12:10.800
embedding. So at that point, any image

01:12:09.439 --> 01:12:13.119
you give me, I'm going to give get you

01:12:10.800 --> 01:12:14.719
nine embeddings out of it. And once I

01:12:13.119 --> 01:12:16.000
get the nine embeddings, I just throw it

01:12:14.719 --> 01:12:19.239
into the meat grinder, the transformer

01:12:16.000 --> 01:12:19.238
meat grinder.

01:12:20.079 --> 01:12:25.198
All right. So uh another example I think

01:12:23.760 --> 01:12:27.600
some of you have asked me outside of

01:12:25.198 --> 01:12:30.238
class um how good are transformers for

01:12:27.600 --> 01:12:32.480
structured data tabular data right for

01:12:30.238 --> 01:12:34.399
tabular data in general um things like

01:12:32.479 --> 01:12:36.238
xg boost gradient boosting works really

01:12:34.399 --> 01:12:38.238
really well so it's good to try them

01:12:36.238 --> 01:12:39.519
certainly I don't think transformers and

01:12:38.238 --> 01:12:42.639
deep learning networks have any great

01:12:39.520 --> 01:12:44.400
edge over xg boost for structured data

01:12:42.640 --> 01:12:46.480
problems so it's worth trying both of

01:12:44.399 --> 01:12:48.719
them however you can use transformers

01:12:46.479 --> 01:12:50.238
for this stuff too so that's called the

01:12:48.719 --> 01:12:52.158
tab transformer one of the first ones

01:12:50.238 --> 01:12:54.158
wants to come out a transform of a

01:12:52.158 --> 01:12:56.799
tabular data and again it's pretty

01:12:54.158 --> 01:12:58.719
simple. All you do is

01:12:56.800 --> 01:13:00.640
in any kind of input that you have, you

01:12:58.719 --> 01:13:02.560
will have some categorical variables,

01:13:00.640 --> 01:13:04.640
right? Like blood pressure, things like

01:13:02.560 --> 01:13:07.440
that, right? Not blood pressure, bad

01:13:04.640 --> 01:13:10.079
example, gender, right? Um, and so on

01:13:07.439 --> 01:13:12.000
and so forth. And so what you do is you

01:13:10.079 --> 01:13:14.640
take all the categorical features and

01:13:12.000 --> 01:13:16.640
for each categorical feature, you create

01:13:14.640 --> 01:13:18.480
embeddings

01:13:16.640 --> 01:13:20.640
because a categorical feature is just

01:13:18.479 --> 01:13:22.399
text.

01:13:20.640 --> 01:13:23.920
A categorical feature is just text. So

01:13:22.399 --> 01:13:27.920
you can create text embeddings for it.

01:13:23.920 --> 01:13:30.000
No problem. Um,

01:13:27.920 --> 01:13:32.800
and you take all the continuous

01:13:30.000 --> 01:13:34.640
features, right? Cholesterol and blood

01:13:32.800 --> 01:13:36.560
pressure and whatnot, right? To go to

01:13:34.640 --> 01:13:38.560
the heart disease example, and then you

01:13:36.560 --> 01:13:39.840
take just create all the correct them

01:13:38.560 --> 01:13:41.840
all and just create a vector out of

01:13:39.840 --> 01:13:45.199
them.

01:13:41.840 --> 01:13:47.279
You're just a vector. Okay? Then you run

01:13:45.198 --> 01:13:48.960
these the embeddings for all the

01:13:47.279 --> 01:13:51.599
categorical variables through a nice

01:13:48.960 --> 01:13:52.880
transformer block. And you can see here

01:13:51.600 --> 01:13:54.960
it's exactly the block we have seen

01:13:52.880 --> 01:13:56.319
before. no difference. And then at the

01:13:54.960 --> 01:13:58.000
very end when it comes out of the

01:13:56.319 --> 01:13:59.279
transformer, you take all the contextual

01:13:58.000 --> 01:14:01.119
stuff coming out of the transformer and

01:13:59.279 --> 01:14:03.519
then you concatenate it with the

01:14:01.119 --> 01:14:05.198
continuous features.

01:14:03.520 --> 01:14:07.120
Okay. And then you run it through maybe

01:14:05.198 --> 01:14:09.519
one or more dense layers and boom

01:14:07.119 --> 01:14:11.198
output.

01:14:09.520 --> 01:14:12.880
So this is a tab tabular data

01:14:11.198 --> 01:14:14.399
transformer. And there are many you know

01:14:12.880 --> 01:14:16.960
refinements improvements over the years

01:14:14.399 --> 01:14:18.879
that have come since then. But the key

01:14:16.960 --> 01:14:21.840
thing I want you to rec remember from

01:14:18.880 --> 01:14:24.159
here is that categorical variables can

01:14:21.840 --> 01:14:28.800
be very easily represented as

01:14:24.158 --> 01:14:31.039
embeddings. That's the key. Okay. Uh all

01:14:28.800 --> 01:14:32.480
right. So that's that. Now once the

01:14:31.039 --> 01:14:34.479
input has been transformed into sort of

01:14:32.479 --> 01:14:35.839
this common language of embeddings, we

01:14:34.479 --> 01:14:37.279
can process them without changing the

01:14:35.840 --> 01:14:39.600
architecture of the block itself because

01:14:37.279 --> 01:14:40.960
all it wants is embeddings. It's like

01:14:39.600 --> 01:14:42.159
you give me embeddings, I give you a

01:14:40.960 --> 01:14:44.399
great contextual embeddings out and

01:14:42.158 --> 01:14:47.439
nobody gets hurt, right? That is the

01:14:44.399 --> 01:14:50.639
deal with the transformer stack. So um

01:14:47.439 --> 01:14:52.559
now this this ability this sort of since

01:14:50.640 --> 01:14:54.480
the transformer is agnostic to the kind

01:14:52.560 --> 01:14:56.640
of input as long as it comes into comes

01:14:54.479 --> 01:14:58.799
in as a form of an embedding you can use

01:14:56.640 --> 01:15:00.159
it for multimodal data very easily. So

01:14:58.800 --> 01:15:02.079
for example let's say that you have a

01:15:00.158 --> 01:15:03.759
problem in which you have a picture that

01:15:02.079 --> 01:15:05.760
you have to be sent in some text that

01:15:03.760 --> 01:15:08.560
goes in a bunch of tabular data coming

01:15:05.760 --> 01:15:10.079
in well you take the text and do

01:15:08.560 --> 01:15:11.520
language embeddings like we know how to

01:15:10.079 --> 01:15:12.640
do you take the image and do image

01:15:11.520 --> 01:15:14.640
embeddings like we just saw with the

01:15:12.640 --> 01:15:16.320
vision transformer. You take tablet data

01:15:14.640 --> 01:15:18.719
and do tab data embeddings like we saw

01:15:16.319 --> 01:15:21.840
with the tab transformer. Once we do it,

01:15:18.719 --> 01:15:23.439
it's all a bunch of embeddings

01:15:21.840 --> 01:15:25.199
and then you attach a little class token

01:15:23.439 --> 01:15:27.839
on top, send it through a bunch of

01:15:25.198 --> 01:15:29.839
transformers blocks and then out comes a

01:15:27.840 --> 01:15:32.319
contextual class token the contextual

01:15:29.840 --> 01:15:36.000
version run it through maybe a sigmoid

01:15:32.319 --> 01:15:38.079
or a softmax predict the label done.

01:15:36.000 --> 01:15:40.960
So this is extremely powerful its

01:15:38.079 --> 01:15:42.880
ability to handle multimodel data. Okay.

01:15:40.960 --> 01:15:46.079
And that's why for example if you look

01:15:42.880 --> 01:15:48.400
at Gemini Google Gemini 1.5 Pro GPT4

01:15:46.079 --> 01:15:50.559
vision and so on you can send it images

01:15:48.399 --> 01:15:53.599
and a question and you'll get an answer

01:15:50.560 --> 01:15:55.840
back because every modality that goes in

01:15:53.600 --> 01:15:58.880
is cast into embeddings and once it's

01:15:55.840 --> 01:16:00.159
embedded one once it's embeddingized

01:15:58.880 --> 01:16:02.079
then the transformer doesn't care. It'll

01:16:00.158 --> 01:16:04.238
just do its thing.

01:16:02.079 --> 01:16:06.479
It it will decide for example that this

01:16:04.238 --> 01:16:09.678
word in your question actually is highly

01:16:06.479 --> 01:16:12.479
related to that patch in the picture.

01:16:09.679 --> 01:16:14.640
Right? you'll just figure it out.

01:16:12.479 --> 01:16:16.879
Uh, okay. That's all I had because

01:16:14.640 --> 01:16:18.320
there's a time pering 9:55. Perfect. All

01:16:16.880 --> 01:16:21.640
right, folks. Thanks. Have a great rest

01:16:18.319 --> 01:16:21.639
of your week.