Advertisement
1:16:46
8: Deep Learning for Natural Language – Transformers, Self-Supervised Learning
MIT OpenCourseWare
·
May 11, 2026
Open on YouTube
Transcript
0:17
Okay. Uh, all right. So, we'll continue
0:19
with transformers today. Part two. Uh,
0:21
we're going to do the second pass. Uh,
0:23
this is going to be a deeper pass
0:24
through the transformer stack. Um and I
0:27
think maybe the next 30 minutes it's
0:29
potentially the most demanding 30
0:31
minutes of the entire course. Okay, with
Advertisement
0:33
that motivational speech, let's get
0:35
going. Okay, so quick review. Why do we
0:38
want transformers? Because we want u we
0:41
want an architecture that can generate
0:43
output that has the same length as the
0:45
input. Same length. Oh, there it is. Uh
0:48
number two, we want to take the context
0:50
into account and we want to take the
0:51
order into account. And as you saw last
0:53
time, the transformer architecture
Advertisement
0:55
delivers on those three requirements.
0:57
And so uh just a quick review, if you
0:59
have a phrase like the train liftation,
1:01
we have all these little arrows which
1:03
stand for the the standalone or
1:05
uncontextual embeddings. Uh and then
1:08
sometimes this works. So I'm going to
1:09
put it close to me here.
1:12
Okay.
1:13
All right. So um so if you here if you
1:16
we start with either standalone
1:17
embeddings i.e. the contextual
1:19
embeddings uh which have been
1:20
pre-trained or random doesn't really
1:22
matter. If you look at the collab we did
1:25
uh the other day we actually just start
1:27
with random weights for the embeddings
1:30
and then we add positional embeddings to
1:32
them. And so you know each embedding
1:35
each word here we take it standalone we
1:38
take its positional embedding we just
1:39
literally just add them up element by
1:41
element then we get a total embedding
1:43
and that's called the positional
1:45
embedding of each word. Okay. And then
1:48
uh that's what we have position input
1:49
embeddings. So this whole thing goes
1:51
into this transformer encoder stack and
1:54
what pops out the other end is
1:55
contextual embeddings. Okay. So that's
1:57
the overall flow. Now
2:01
we applied this uh the transformer stack
2:03
to the word to slot classification
2:06
problem where we basically took every
2:08
incoming natural language query that
2:10
comes in. We calculate its positional
2:12
embeddings and then we run it through
2:14
the transformer stack. uh and then we
2:16
get contextual embeddings and then at
2:18
this point uh since each word that comes
2:21
out each embedding that comes out needs
2:22
to be classified into one of 125
2:24
possibilities we run it through a ReLU
2:26
and then we and when we attach a softmax
2:29
to each embedding right this is
2:31
basically what we did last class
2:33
um so this is the transformer encoder
2:36
okay now actually
2:39
any questions on this before I continue
2:48
I was wondering why when how do you
2:50
decide where to add more self attention
2:52
and where to add transformer layers? You
2:55
mentioned that chart has 96 of them.
2:58
>> Yeah. So right so GPD3 has 90 96
3:03
transformer blocks. Each one is a block.
3:05
Um, so I think the question goes to do
3:07
you add more attention heads within a
3:09
single block or do you add lots of
3:11
blocks? And both are good things to do.
3:14
Um, what increasing the number of
3:16
attention heads in a block does for you,
3:18
it allows you to pick up more patterns
3:21
at that level of abstraction.
3:23
But if you add more blocks, much like
3:25
later convolutional filters can build on
3:28
earlier convolutional filters, you're
3:30
going up the levels of abstraction. So
3:32
to go to vision for instance you have
3:34
the notion of lines and so on in the
3:36
beginning and then you have a notion of
3:37
edges which are two lines then you have
3:40
you know nose eyes face and so on and so
3:42
forth. So both are worth doing. So
3:45
typically that's what you you typically
3:46
find that people typically have you know
3:49
maybe a dozen heads or you know five six
3:52
a dozen heads. We'll see examples of how
3:54
many heads in a couple of architectures
3:55
later on today. And you can the more you
3:58
go up the more uh more capable the model
4:01
becomes. as long as you have enough data
4:02
to train it well. So the perennial
4:05
question of do we have enough data to
4:07
train this large model because if you
4:09
don't have enough data we might run into
4:11
overfitting problems and so on. That's
4:12
always the trade-off.
4:14
So okay so here I just want to quickly
4:17
switch to the collab because we didn't
4:18
get have a chance to finish it. I'm not
4:20
going to run it because it's going to
4:22
take some time. So where we left off
4:24
last time.
4:27
Okay. So here we we basically took this
4:31
architecture that we just saw on the
4:32
slide and then we essentially wrote it
4:34
as a keras model and I went through this
4:36
model in the last class so I'm not going
4:37
to go through it all over again. What we
4:39
did not do last class was to actually
4:41
run it. Um and so uh so if you actually
4:44
run it right you can just run it for 10
4:47
epochs just like we normally do. Give it
4:50
data give it a bunch of epochs choose a
4:52
particular batch size. I just
4:53
arbitrarily chose 64. You run it for 10
4:55
epochs and then you evaluate it on the
4:57
test set. You get a 99% accuracy on this
5:00
problem. One transformer stack. That's
5:03
it. One one block rather. One block.
5:05
That's it. And uh of course here there's
5:08
a little trickiness going on here
5:09
because a naive model can literally say
5:12
every word that comes in is other. O.
5:15
And since the O's are the majority of
5:17
the words, it's not going to do badly,
5:19
right? It's like having a classification
5:20
problem in which one class is very
5:22
predominant. So the naive way to
5:25
actually do well is to just say every
5:26
time something comes in, oh it's that
5:27
majority class. The same thing happens.
5:30
But if you then adjust for that, it
5:32
turns out that the accuracy on the nono
5:34
slots, which is really what you care
5:35
about, is actually 93%.
5:38
Which is actually pretty good. Okay. Uh
5:40
and then I had some examples of, you
5:42
know, lots of fun queries you can do,
5:44
including queries where I try to break
5:45
stuff like cheapest flight to fly from
5:47
MIT to Mars and see what happens, you
5:49
know, things like that. So have fun with
5:50
it. Okay. Um, all right, back to
5:53
PowerPoint.
5:59
So, this is what we had. Now, what we're
6:01
going to do in today's class, we are
6:03
actually going to take the encoder we
6:05
built last time and introduce three new
6:08
complications into it. And when we
6:10
finish introducing these three
6:11
complications, we will actually have the
6:14
actual transformer that was invented in
6:15
the 2017 paper. Okay. All right. Um, the
6:20
first tweak is the hardest tweak. So
6:21
we'll slowly work our way to it. U so
6:24
the thing to remember is let's review
6:26
self attention. What is self attention?
6:28
You have a bunch of words and we further
6:30
said that for any particular word like
6:32
station we want to take its positional
6:34
embedding and then make it contextual.
6:36
And the way we do that is by taking each
6:38
word's embedding and then calculating
6:40
these dot productducts between all the
6:42
other words. And then since these dot
6:44
products can be positive or negative we
6:46
want to make them all positive and
6:48
normalize them so that they nicely add
6:50
up to one. So we then exponentiate them
6:52
and then divide with the total, right?
6:54
Which is basically soft max. And when
6:57
you do that, you have nice fractions
6:59
that add up to one. And then we said,
7:01
well, the contextual embedding for W6 is
7:03
just all these weights S1, S2 all the
7:07
way to S6 multiplied by the original W's
7:10
and then you get the context for W6. So
7:12
this is the basic logic we covered last
7:14
time. Now it is obviously the case that
7:19
we explained it only for one word but we
7:21
have to do the same exact operation for
7:23
every one of the other words too so that
7:25
we could calculate W5 hat, W4 hat, W3
7:28
hat and so on and so forth right so
7:30
there's a lot of computations that are
7:32
going on and they all look kind of
7:34
similar where you got to do a bunch of
7:36
dot products you got to like you know do
7:38
some soft maxing on it and stuff like
7:39
that so the natural question is is there
7:42
a way to organize it very efficiently
7:45
And the short answer is yes. In fact, if
7:46
you could not do that, there wouldn't be
7:48
any transformer revolution. Okay,
7:50
because there is that ability to package
7:52
it up into a very interesting and
7:53
efficient operation that allows you to
7:55
put the whole thing on GPUs.
7:58
Okay, so now I'm going to switch to iPad
8:02
uh and give you some iPad scribblings of
8:04
mine which were concocted last night
8:06
because I was very unhappy with the
8:08
slides that follow. So, we're going to
8:10
do iPad. Okay. U All right. So if it
8:14
works, you folks are lucky. If it
8:16
doesn't work, last year's huddle class
8:17
is luckier.
8:21
So let's shift to that.
8:24
All right. So we're going to go here.
8:31
So let's assume we have a simple thing
8:32
like uh oops.
8:37
Okay, instead of you know train left the
8:40
station which is a long sentence, let's
8:41
just say you have a simple sentence like
8:42
I love hodddle. Okay, and so I love
8:45
hodddle is what you have and then you
8:47
have these standalone embeddings W1 W2
8:50
W3. Okay, so it comes into the self
8:53
attention layer and let's assume that
8:55
these W1's, W2, W3, they're already
8:58
positionally encoded, right? We have
9:00
already added up the position encoding,
9:02
all that stuff also. It's all behind us.
9:03
That all happens outside the
9:05
transformer. So you you you get it here.
9:08
Now what you do is you actually make
9:10
three copies of this thing.
9:13
Okay? And let's call this whole thing as
9:15
just X. Okay? I'm just giving it the
9:18
name X. It's a matrix of these three
9:20
vectors. And so the first copy goes up
9:23
here, the second copy goes straight, and
9:25
the third copy goes down. And don't
9:26
worry about the third copy just yet. So
9:29
if you look at the the first two copies,
9:31
here is the key thing to focus on. Okay,
9:33
this whole thing here. Remember that we
9:36
want to calculate dotproducts between
9:37
all these vectors. And basically we want
9:40
to calculate the dot product of every
9:41
pair of vectors, every pair of words.
9:44
The whole point of self attention is
9:46
that every pair of words we figure out
9:47
how attracted or related they are.
9:49
Right? Which means that we have to
9:50
calculate all pairs of dot products. And
9:53
so what you do is you take this vector
9:55
right there W1 WW3. You take this other
9:58
copy that went up. Okay? And then you
10:00
transpose it. So when you transpose it,
10:03
it all becomes nice and vertical like
10:05
that.
10:06
Right? All the vectors come in came like
10:08
this. When you transfer, it becomes
10:09
vertical. And now what you do is you
10:12
take each one you take W1 and then you
10:15
multiply it by W1. Here you take W1 W2
10:19
W1 W3. You calculate all those dot
10:22
products like that. And when you do that
10:23
you have these nice cells where every
10:27
pair of words their dot products have
10:29
been calculated in this grid. Okay. And
10:31
the key thing to see here and folks with
10:34
a matrix algebra background will see
10:36
this immediately. All we are doing is we
10:38
are taking this x which is the matrix
10:40
that came in
10:42
and then xrpose which is the matrix that
10:44
we went sent up and then brought back
10:46
down. We are basically doing a matrix
10:48
multiplication of x * xrpose. That's all
10:50
we doing. And when we do that we're
10:53
getting this nice uh grid of where in
10:57
which every pair of words their dot
10:59
products have been calculated for you
11:01
with one matrix multiplication. Boom.
11:03
Done. Okay. Okay, so if you have three
11:05
words, there are nine multiplications,
11:07
right? So if you have a million words,
11:11
that's a lot of multiplications, right?
11:13
One trillion multiplications on the
11:15
order of all trillion. And the reason to
11:18
say order is because you know W1 * W3 is
11:21
the same as W3 * W1. So there's some
11:23
duplication here. So you get this grid,
11:25
okay, in one shot is one multi
11:27
multiplication. And then we because each
11:29
of these numbers is just a dot product
11:31
which can be negative or positive, we
11:32
need to softmax it.
11:34
And so what we do is we take all these
11:36
numbers and we put it into a softmax
11:38
function where for each row it
11:40
calculates a soft max. And what do I
11:41
mean by that? It takes each number here
11:44
does e raised to the top e ra to the
11:46
number. It does it for each of these
11:47
numbers and then divides by the sum of
11:49
those numbers for each row. And when you
11:51
do that okay you can think of this
11:54
operation as soft max applied to x *
11:56
xrpose you get this nice little table of
11:59
numbers.
12:01
This table of numbers basically says
12:02
that for the first word right W1 for the
12:06
first word take 0.1 of the of the first
12:08
one 7 of the second.3 of the 2 of the
12:11
third and add them up. We do a weighted
12:14
average. So we have this table here. We
12:17
have now the third copy shows up here.
12:20
Okay is right there. So we do this times
12:24
that which is just a matrix
12:25
multiplication again. And when we do
12:27
that we get the final contextual
12:29
embeddings. So this for example is just
12:31
0.1 * w12
12:34
* w2
12:36
point sorry 7 * w2 and then2 * w3 right
12:40
there. And you can see the same logic
12:41
here as well. Okay. And you can read it
12:44
later on. I will post this thing uh to
12:46
make sure you understand exactly how it
12:47
flowed. But the larger point I want you
12:50
to focus on is that the entire sol self
12:53
attention operation we just looked at
12:55
here basically is this this beautifully
12:58
little compact matrix formula.
13:01
Okay X comes in you do XRpose you do a
13:04
matrix multiplication you do a softmax
13:06
on top of it and then multiply by X
13:07
again and boom you're done.
13:10
So that is the magic of taking the
13:12
transformer stack and representing it
13:15
using matrix operations because then
13:17
lightning fast on GPUs.
13:20
Okay. All right.
13:22
That was the warm-up.
13:24
Now let's crank it up a notch.
13:27
So recall that in the last class um I
13:31
talked about the fact
13:35
the self attention operation the W's are
13:38
coming in and we're doing all this stuff
13:39
with the W's right and then we're
13:41
getting some W hats out but there are no
13:44
parameters
13:46
there's nothing to be learned inside the
13:48
transformer self attention layer right
13:51
there are no there are no weights there
13:52
are no biases there are no coefficients
13:54
so well okay What are we learning then?
13:58
Right? So what we now do is we going to
14:00
make the self attention layer tunable.
14:03
We're going to inject some weights into
14:05
it so that when we train it on an actual
14:07
system, it'll the weights will keep
14:09
changing to adapt itself to the
14:10
particularities of whatever problem
14:12
you're working on. Right? So that takes
14:15
us to the tunable self attention layer.
14:22
Okay? Tunable self attention layer. So
14:25
this is the key thing to keep in mind. U
14:28
any questions on this before I continue
14:29
with the tunability thing.
14:34
Okay.
14:37
Is this picture working out by the way?
14:39
Okay.
14:41
Uh all right.
14:44
So what we now do is we have the same
14:46
exact logic as before where we have this
14:48
thing that comes in. Okay. We have this
14:51
input that comes in the same we call it
14:53
X again. this whole this matrix of
14:55
embeddings and then before we just send
14:58
three copies instead of doing that what
15:01
we're going to do is we'll take each
15:02
copy X and then we will actually
15:04
multiply it by a matrix
15:07
okay this matrix is called the key
15:09
matrix
15:10
okay and this matrix this matrix of
15:14
numbers are weights that will be learned
15:16
by Brack prop
15:18
so basically what we're saying is that
15:20
when this thing comes in let's see if
15:23
there's a way to transform this X into
15:25
some other set of embeddings which may
15:28
be useful for your task. We don't know
15:30
if they're going to be useful, but
15:32
surely giving it a bit more ability to
15:34
have weights which can be learned means
15:36
that it giving it more expressive power,
15:39
more modeling capacity. And whether it
15:41
actually uses the capacity will depend
15:42
on how much data you have and how well
15:44
you train it. And maybe if it's not
15:46
useful, it won't use it. In what I mean
15:48
is if transforming X actually doesn't
15:50
really help at all, then this matrix A
15:52
is going to be what?
15:55
it's going to be the identity matrix
15:57
because you take basically one and
15:59
multiply by X you'll get one X again. So
16:01
in the worst case maybe it just says I
16:03
have nothing to learn here but maybe
16:05
there is something you can learn. So so
16:07
that's what we do. So we multiplied by
16:09
this matrix A K and then we come up with
16:12
the same you know some embeddings
16:14
transformed embeddings and we call these
16:16
things K
16:18
okay K. Now this KQV as you will see has
16:22
its origins in the in this field of
16:24
information retrieval but I personally
16:26
find that that interpretation is not
16:28
super helpful because transformers are
16:30
used for lots of applications outside
16:32
information retrieval. So I'm not going
16:33
to go with that kind of interpretation.
16:35
I'm going to go with interpretation of
16:37
let's make each of these things tunable.
16:39
Okay. And tunability means we need to
16:41
give it weights. All right. So that's
16:42
what we have here. Now the second copy
16:46
we did this with the first copy. Well,
16:47
let's do the same thing with the second
16:48
copy. We'll take the second copy and
16:50
multiply it by some other matrix called
16:51
AQ.
16:53
And when we are done with that, we get
16:54
these embeddings. And we will call these
16:57
embeddings as Q.
17:00
Okay. Now, just like before, we will
17:02
take this this thing here and we'll
17:05
transpose it.
17:07
So, it all becomes nice and vertical
17:08
like that. And then we'll do exactly the
17:11
same as before. We'll calculate all
17:12
these pair-wise dot productducts using
17:14
one one shot one matrix multiplication.
17:16
And because we are calling this Q and we
17:20
are calling this whole thing as K. This
17:22
thing just becomes Q * KT.
17:26
Okay. At the end of it you come up with
17:29
a grid of numbers just like before.
17:31
Okay. And these numbers could be
17:33
negative or positive. So we need to do
17:35
the softmax on them to make sure they
17:36
are well behaved fractions that add up
17:38
to one. So we take this Q KT business
17:42
and then we do we just run a we put it
17:44
through a softmax function for each row
17:48
and when we do that we we'll get
17:50
basically the the like a table like the
17:52
ones we saw before by the way the
17:54
numbers here are the same just because I
17:55
duplicated it because I'm lazy in
17:57
reality given it has gone through all
17:59
these transformations the numbers are
18:00
not going to be the same right uh you
18:03
have these numbers and then you take the
18:05
final copy which is x * av Right? Each
18:08
copy is getting multiplied by its own
18:10
matrix. Right? And this copy is being
18:11
multiplied by AV. And let's call this X
18:14
A. Okay? Which is here as just V.
18:19
And so what you have here is this soft
18:21
max QT * V is exactly the same kind of
18:24
dot product as we saw before matrix
18:26
multiplication. So we have these
18:28
contextual embeddings and that's what's
18:30
coming out of the of the transformer
18:32
block. So now the whole thing we did
18:34
here the whole thing can be represented
18:36
as soft max of Q KT * V. Okay. So if we
18:42
zoom in a bit. Come on. Okay.
18:47
Okay.
18:49
So X came in.
18:52
Three tracks went here. The first track
18:55
X * A K X * AQ X * A V. And this thing
18:59
is called K. This thing is called Q.
19:01
This thing is called V. And then we do
19:03
the same transpose as before. We do the
19:06
dotproduct thing to calculate the
19:08
pair-wise dot products for everything
19:09
which is just Q KT. We run it through a
19:12
soft max. We get soft max of Q KT. We
19:15
multiply it by one to do the final
19:16
waiting and then boom the output comes
19:18
and that's this function. That's it.
19:22
Okay. So what we have done is we have
19:24
introduced three matrices learnable
19:27
matrices into the self attention layer.
19:31
Okay. Now,
19:34
okay. Let me just stop there for a sec.
19:35
Questions.
19:37
Yeah.
19:39
[clears throat]
19:39
>> Is there a relationship between AK, AQ,
19:43
and A
19:44
>> independent independent matrices?
19:47
>> Yes.
19:48
>> Like we have
19:49
>> could you use the microphone please?
19:50
>> Here we have three set of parameters K,
19:52
Q and P. If there are let's say if there
19:55
were 100 the total length was let's say
19:58
the number of total totals were let's
19:59
say 50. So you would have uh 50 for a
20:02
set of parameters like you'll have to
20:04
>> so if you have a 50 if the dimension is
20:07
50 long what is coming in the W's are 50
20:10
long then the key the what comes out of
20:13
it if you want it to be 50 as well so
20:15
this matrix needs to be 50 * 50 2500
20:22
>> U Luna
20:24
>> what are the different things the three
20:27
the three matrices are trying to
20:30
Sorry,
20:30
>> what are the different things that the
20:32
matrices are trying to learn?
20:33
>> We don't know. All we are saying is that
20:35
we have a self attention layer which can
20:37
pay attention to every pair of words.
20:38
But we need to give it some ways to
20:40
transform what is coming in into
20:43
potentially useful things. Right? As to
20:45
their actual usefulness, we'll have to
20:48
figure out if if it actually helps or
20:49
not. And of course, as you know, the the
20:51
punch line is that yeah, it helps
20:52
massively. That's why we do it. In
20:54
general, what you will find in the deep
20:55
learning literature is that whenever you
20:57
want to increase the capacity, the
20:58
modeling capacity of a particular model,
21:01
you just take a small piece and inject a
21:03
little matrix multiplication into it.
21:05
You take a vector that's showing up in
21:07
the middle and then you make it run
21:08
through a matrix to get another vector
21:10
and then further after you run it
21:13
through a matrix, you run it through a
21:14
little ReLU as well. Even better. So
21:17
that's how you inject modeling capacity
21:19
into the middle of these networks. Okay?
21:22
And that's what these people are doing
21:23
here. Yeah.
21:26
>> In the last step, you had the matrix V.
21:29
So on the previous example, you had used
21:31
the original matrix X. So could you just
21:33
say for why is it not using X? What does
21:35
that mean?
21:36
>> So what we're saying is that the in the
21:38
initial version we had three copies and
21:40
we treated them all identical. Now we
21:42
said well there are are there ways to
21:44
transform each copy into some other
21:45
representation which could be useful. So
21:47
we may as well use three different
21:48
matrices for it. Why stop with two?
21:51
There are three opportunities to make
21:52
them more expressive. We'll use all of
21:54
them.
21:56
>> Yeah.
21:59
>> You mentioned that these are kind of
22:02
you're tuning it. You're kind of
22:03
fine-tuning it. Is there any risk?
22:05
>> We're not fine-tuning it. Uh just to be
22:06
clear on the on the vocabulary here. So
22:09
we have added more weights to make them
22:10
tunable. What that means is that we when
22:12
we finally train this entire model,
22:16
remember all the weights are going to be
22:17
updated using back propagation, right?
22:20
In particular, these matrices will also
22:21
get updated using back propagation.
22:23
>> So there's no risk of is there a risk of
22:26
>> there's always the risk of overfitting
22:27
when you add more parameters to a model
22:29
>> which means that you have to look at the
22:31
validation set and all that good stuff.
22:34
We are basically adding more parameters
22:36
in a very interesting way because we
22:39
want to add more capacity to the self
22:40
attention layer. We want to give it a
22:41
more of an ability to learn things from
22:43
the data. Before it could not learn
22:45
anything. It could only do dot products.
22:48
So we we want to solve that problem.
22:51
All right, I'm going to continue and
22:52
we'll come back to this. Okay. Um
22:57
so uh all right, let's just just for
22:59
fun, I'm going to do this. Um the the
23:01
original paper is called attention is
23:03
all you need. This is a transformer
23:05
paper.
23:07
You folks should read it at some point.
23:11
Just want to show you something.
23:14
Uh
23:20
You see that? So that is the famous
23:22
transformer formula. Okay. And the only
23:26
thing we ignored is this root of DK
23:29
business in the back under it. I
23:31
wouldn't worry about it. The reason they
23:33
have it is because these soft maxes when
23:35
you have lots of numbers and some
23:37
numbers really really big what's going
23:39
to happen is that all the other numbers
23:41
are going to get squashed to zero. Okay.
23:43
And so to make sure the gradient flows
23:45
properly, they just divide it by a
23:47
particular number to make sure no number
23:49
is too big. Okay, that's a small
23:51
technical important but bit of a
23:53
technical detail which is why I ignored
23:54
it in my iPad. But the rest of it you
23:57
can see this is exactly the formula we
23:59
derived qt * v softmax.
24:03
Okay, so this is the famous transformer
24:05
formula
24:08
and congratulations now you understand
24:10
it.
24:11
You seem less than fully convinced.
24:14
Okay.
24:17
Yes. Hi iPad.
24:19
Now I have a bunch of slides which I had
24:21
but actually I'll come back to this. I
24:24
had a bunch of other slides. This is
24:25
from last year uh which actually
24:27
explains what I did in the iPad in a
24:28
very different way without using any
24:30
matrices and so on. I was looking at it
24:32
last evening and I was getting very
24:34
annoyed by these slides for some reason
24:36
because I felt that it wasn't really
24:38
conveying the core matrix sort of the
24:40
matrix uh the ability of using matrix
24:43
algebra to to actually do this so
24:45
efficiently and compactly which is why I
24:47
decided to like handdraw this thing on
24:49
the iPad. Okay, but you should read it
24:51
afterwards to make sure that whatever
24:53
you saw on the iPad actually matches
24:55
this. Okay, because two different ways
24:56
of understanding something always helps.
24:58
Um okay so this what we have here now to
25:02
just to recall
25:05
the by making self attention tunable we
25:07
get a very interesting benefit which is
25:08
that when you have these different
25:10
attention heads before
25:13
you could have two attention heads but
25:14
because there were no parameters inside
25:16
their outputs would have been identical
25:19
because the inputs are the same for both
25:21
therefore the outputs would be identical
25:23
but now by since each attention head
25:25
will have its own aq
25:28
matrix
25:29
the outputs are going to be different.
25:32
That's why it makes sense to do the
25:34
tunability thing because that's what
25:36
actually makes multiple attention it's
25:37
actually useful. Um
25:43
is is there actually any relationship
25:44
between AK AQ and AV or is the A just
25:47
for like a notation standpoint?
25:49
>> Just notation. The thing is we want to
25:51
use QV for the resulting matrix and so I
25:54
had to find something else to use for
25:56
the first one and I was like okay aqaq
25:58
and we at MIT we do subscript super
25:59
subcripts right so yeah
26:03
>> what what is the the size of the
26:05
matrices are there like square matrices
26:07
or
26:08
>> yeah so typically what happens is that
26:10
um there's a whole bunch you can think
26:12
of it as a hyperparameter in some ways
26:14
um typically what people do in most
26:15
implementations is that they will
26:17
actually just preserve the size so if
26:19
the incoming embedding is and they'll
26:20
make sure the the thing coming out of
26:22
thing is also 10. So you just do a 10x10
26:24
matrix to transform it. Uh but the the
26:27
the value v av matrix on the other hand
26:31
there's a bit more technical stuff going
26:32
on where it often tends to be smaller.
26:35
Um so for example let's say that your
26:37
incoming is 100 you do 100 to 100 for
26:39
the key 100 to 100 for the query. But if
26:42
you have say five attention heads, you
26:44
may do 100 to 20 for the W's because
26:47
ultimately all the V's are going to get
26:48
concatenated into another 100 again. So
26:51
I can tell you more offline but fun
26:53
broadly speaking these things tend to
26:55
get transformed. They don't they
26:56
preserve the dimension 10 and 10 out.
26:58
Yeah.
27:00
>> So this uh aq uh these numbers are
27:04
random when you start with it and then
27:06
allow it to back.
27:07
>> Exactly. Exactly.
27:11
So all right um
27:17
yeah so the values in these matrices are
27:19
weights learned through optimization
27:20
using SGD. Uh and then what that means
27:23
is that
27:25
each of these attention now has its own
27:27
copy of these matrices. It has its own
27:29
matrices and over the course of back
27:31
propagation these matrices will look
27:33
very different. Okay. So important each
27:36
attention head will have its own mat set
27:38
of three matrices. So if you have 10
27:40
attention heads 30 matrices will be
27:42
learned.
27:46
So by the math it seems like it's
27:48
creating essentially a relationship
27:50
between all of the content being
27:52
ingested and if you're creating if
27:54
you're ingesting all the content for
27:56
each attention head are there different
27:58
categories of attention head type that
28:00
you're trying to go after?
28:01
>> Yeah. So basically what we're trying to
28:03
do is to say a particular attention
28:04
head. So in any particular sentence it
28:07
may turn out to be the case that one
28:09
pattern could be about the meanings of
28:10
these words right like the word bank and
28:12
what it means the word station train
28:14
things like that. That's what really
28:15
we've been talking about. But there is a
28:17
whole other pattern to do with grammar
28:19
and tense and things like that. There
28:21
could be another one in terms of tone.
28:23
All those things are very important. And
28:25
a priority we don't know how many such
28:26
patterns exist. Much like in a
28:28
convolutional network, we don't when
28:30
we're designing how many filters to
28:31
have, we don't know how many kinds of
28:33
little things we have to detect, you
28:34
know, vertical line, horizontal line,
28:36
semicircle, quarter circle, stuff like
28:38
that. So, you just give it a lot of
28:39
capacity so that it can learn whatever
28:41
it wants.
28:45
All right. So, um so that that is the
28:47
transformer encoder. So, we have done
28:49
one the first of the three complications
28:51
needed to make it like industrial
28:53
strength and legit. Uh the second thing
28:56
we do is something called the residual
28:58
connection. So what we do is that
29:02
whatever comes out here right W1 through
29:05
W6 goes in and comes out as W1 hat W2
29:08
and so on and so forth right
29:11
actually sorry what comes out here is
29:13
the hats but what comes out here is some
29:16
intermediate W's right that is what the
29:18
selfident is going to give you some
29:20
intermediate W's what we do is and
29:22
because what's coming out here these
29:24
vectors are the same length as what goes
29:26
in we can just add them element by
29:28
element
29:29
So we take the input and we actually add
29:32
it to what comes out.
29:35
So why would we want to do that? Why
29:37
would we want to you know go to a lot of
29:39
trouble to process this thing and then
29:41
when it comes out we like literally add
29:43
up the original input? What's like what
29:45
do you think the intuition is?
29:52
So turns out, think of it this way. You
29:56
have a bunch of inputs. You send it to a
29:57
neural network. It transforms it and
30:00
gives you something else. Right? At that
30:02
point, you might be thinking, well,
30:04
everything that go everything that
30:06
happens in the network from that point
30:07
onward can no longer see your original
30:10
input. It can only work with the
30:12
transformed input. Right? But what if
30:14
your transformations are not great?
30:17
So as an insurance policy what you can
30:20
do is you can take the the transform
30:22
stuff and you can take the original
30:24
stuff and send both in.
30:27
Right? And this whole thing is and you
30:30
can Google it. It's called like a wide
30:31
and deep network and things like that.
30:33
But the whole point is that let's not
30:35
lose the original input anywhere. Let's
30:37
also send it along. But if you keep
30:39
adding the original input to every
30:40
intermediate layer, it's going to get
30:42
longer and longer and longer and bigger,
30:43
which you don't want because you want it
30:44
all to be the same size. So the simplest
30:46
alternative is to just add them up. You
30:49
take the transform stuff and you add the
30:50
original input. You get the same thing
30:52
again. The the what came out what came
30:54
in W1 was a 100 long vector and the
30:57
transformed version is also 100 long. So
31:00
just literally 100 100 add them up.
31:02
That's it. You get another 100 long
31:04
vector. So that is what's called a
31:06
residual connection. Okay. And as it
31:08
turns out, residual connections make it
31:12
m improve the gradient flow during back
31:14
propagation dramatically and that's why
31:16
they are very heavily used. And in fact,
31:18
RestNet, which we looked at for computer
31:21
vision, it stands for residual net
31:24
because it was the first network to
31:26
actually figure this out. It's not this
31:29
this is not just a transformer thing by
31:30
the way. It's widely used in you know
31:32
lots of new architectures. The notion of
31:35
a residual connection that's what it
31:36
means. Okay, so we do a residual
31:39
connection and then we come to the final
31:42
tweak which is called layer
31:44
normalization.
31:45
So once we add the residual connection,
31:47
we are going to do something else here
31:48
to these vectors before they continue
31:51
flowing. And what layer normalation does
31:54
is it basically says that
31:57
I you will recall from the very
31:59
beginning of the semester I've been
32:00
saying that whatever comes into a neural
32:02
network the inputs let's just really
32:04
make sure that they are all in some sort
32:05
of a narrow well- definfined range they
32:07
can't be in a big range right so for
32:10
pictures for images we divided every
32:12
number by 255 so that every little pixel
32:15
value is between zero and one okay for
32:18
continuous things like the heart disease
32:20
example we standardized by calculating
32:22
the mean and the standard deviation and
32:24
doing subtracting the mean and dividing
32:26
by the standard deviation. So when you
32:27
do that all the numbers are going to
32:28
roughly be in the minus1 to +1 range. So
32:32
in neural networks it's for backrop to
32:35
work really well you have to make sure
32:36
that no numbers get too big that all the
32:39
numbers are always in some sort of a
32:41
narrow range. So what layer
32:43
normalization does is to say you know
32:45
what whatever is coming out here I want
32:48
to make sure none of these numbers are
32:49
too big. I want to make sure they're all
32:51
well behaved in a small range because if
32:53
I don't do that back prop is not going
32:55
to work very well and so
32:59
is this what we do to ensure we don't
33:01
problem of vanishing right
33:04
>> so um so the there technically there are
33:06
there could be two problems there's an
33:07
exploding gradient and vanishing
33:09
gradient both are bad this is a way to
33:10
address it so you will find a whole
33:12
bunch of dash normalization techniques
33:15
layer normalization batch normalization
33:17
and so on and so forth all these are
33:19
methods to make that these numbers stay
33:21
in a small range so it doesn't cause
33:22
gradient issues later.
33:27
All right. So in particular
33:30
what we do is or what happens inside
33:32
this layer layer normalization is we
33:35
just calculate the mean and standard
33:36
deviation of every one of these
33:37
embeddings. Okay? Right? If you have
33:39
let's say six embeddings here, we'll
33:41
have six means and six standard
33:42
deviations, right? For each one across
33:43
the rows and then we standardize it.
33:46
Meaning subtract the mean divide by the
33:48
standard deviation. And when you do
33:49
that, all these things are going to be
33:51
nice and small. And then we do this a
33:54
little other thing where we we have
33:55
introduced two new parameters to rescale
33:58
it and move it around a little bit just
34:01
because adding more weights always helps
34:03
make these things better. So we add them
34:06
and this gets slightly complicated
34:07
because of the way the dimensions work.
34:09
So I'm not going to spend much time on
34:10
it. Uh and then what comes out the other
34:13
end is a very well- behaved set of
34:15
numbers in a nice and small and narrow
34:16
range.
34:18
Okay, so this is called layer
34:20
normalization. Um, you can see this link
34:23
to understand it a bit better. Um, and
34:25
we do that as well. So to put it all
34:28
together,
34:30
so this is a transformer encoder where
34:32
we have this multi head attention layer
34:34
where each attention head in the inside
34:36
of it is tunable with those a matrices
34:39
and then we have a residual connection.
34:41
We do that and then we do layer norm and
34:43
then we do the same thing in the next
34:45
feed forward layer as well. And then
34:46
boom out pops the output
34:50
>> by that definition in the multi head
34:52
attention layer when I'm doing tone and
34:53
everything theoretically I can add even
34:56
the biases or the hate speech aspects
34:59
which come in to take care of it right
35:01
so the model can account for the fact
35:04
that something is biased or something is
35:06
not
35:07
>> um the thing is it's not so much the
35:09
model is accounting for it is capturing
35:11
whatever patterns happen to be inherent
35:13
in the data it's capturing Right now
35:16
what you do with that capture is up to
35:18
you. It depends on the actual problem
35:19
you're trying to solve. In particular,
35:21
it is going to capture all the bad stuff
35:23
too because if your training header has
35:25
a lot of biased stuff in it, toxic
35:27
things in it, dangerous things in it, it
35:29
doesn't it doesn't have a sense of
35:30
values as to what it's good or bad. It's
35:32
just going to pick it up.
35:35
>> Yes.
35:36
>> On that then how do you actually make it
35:38
angle on those or how do you mitigate
35:40
the effect of those? That's a whole
35:43
course unto itself, but I'm happy to
35:44
give you pointers offline.
35:47
All right, so this is what we have and
35:50
remember what I said that this is just a
35:52
single transformer block and since what
35:54
comes in and what goes out are the same
35:56
dimensions, we can just stack them one
35:58
after the other, right? It's very
36:00
stackable. You can do it, you can
36:02
multiply, you can you can stack it
36:03
vertically as much as you want. And as I
36:05
mentioned, I think GPD3 has 96 of these
36:08
things stacked one on top of the other.
36:09
Um and so yeah that brings us to that is
36:14
it that is the transformer encoder and
36:15
this exactly maps to that. So basically
36:18
the input embeddings come in you add
36:20
positional embeddings and then you send
36:22
it to say these many attention blocks
36:24
and they all get added up and then it
36:26
comes over the attention block you add
36:28
the add and nom here means add means
36:31
residual connection because you're
36:32
adding the input which is why you have
36:33
this arrow going from the input being
36:36
added there and then you normalize it
36:37
send it along and do it again and out
36:39
comes the output.
36:42
So all right now just to be very clear
36:46
on what is being optimized during back
36:48
propagation in this complex flow right
36:52
now clearly the the embeddings that you
36:54
started out with both the standalone
36:56
embeddings as well as the positional uh
36:57
the position embeddings those things are
37:00
going to get optimized right those are
37:01
just weights they're going to get
37:02
optimized clearly everything inside the
37:05
transformer encoder block is going to
37:06
get get nominized right and what are
37:08
they well they are the aqa v matrices
37:12
for Each attention head layer norm has
37:15
parameters as well. The next like the
37:18
little feed forward layer has weights as
37:20
well. All these things are going to get
37:22
optimized and then it goes through this
37:24
relu which again has a bunch of weights.
37:26
It's going to get optimized and then the
37:28
final softmax has a bunch of weights.
37:29
That's going to get optimized.
37:32
All these things are going to get
37:33
optimized by back prop.
37:36
Okay. So in that sense you just step
37:38
back for a second and look at the whole
37:40
thing. It is just a mathematical model
37:41
with a lot of parameters
37:43
and we're just going to use gradient
37:45
descent or stoastic gradient descent to
37:46
optimize it. That's it.
37:49
Yeah.
37:51
>> For those eight matrices we train the
37:53
model, are we calculating weights for
37:55
like each cell of every possible matrix
37:58
based on the number of inputs like every
38:00
possible dimension up to the max number
38:02
of inputs?
38:04
Um actually the the weights themselves
38:07
um don't depend on how long your input
38:09
sentence is because remember what we're
38:11
doing is for each sentence that comes in
38:13
let's say the sentence has say three
38:14
words there are three embeddings for
38:16
that sentence each of those embeddings
38:19
gets multiplied by say AK right so AK
38:23
only needs to work needs to know how
38:25
long is each embedding it doesn't need
38:27
to know how many words do I have
38:31
and that's a I'm glad you raised that
38:33
question Ben because that's what makes a
38:35
transformer's number of weights
38:37
independent of the number of words in
38:40
your sentence.
38:42
It only depends on the vocabulary that
38:43
you're going to work with because the
38:45
vocabulary determines how many
38:46
embeddings you need, how many embeddings
38:48
you need. It the length only matters in
38:51
terms of the positional embedding
38:53
because if you have a thousand long
38:55
sentence, you need a thousand long
38:56
positional embedding matrix. But beyond
38:59
that, it doesn't care.
39:02
And that's why for example Google uh
39:04
Gemini 1.5 Pro which is a million it can
39:07
accommodate basically a million long
39:09
million token context window right it
39:12
can it's still very compute heavy but it
39:15
does not change the number of parameters
39:18
uh yeah
39:20
>> conceptually which weights are optimized
39:24
first but in sequential order or are
39:26
they optimizing the weights at the very
39:28
same time all
39:29
>> simultaneously because if you think of
39:31
back propagation ultimately you have a
39:34
loss function right and you calculate
39:35
the gradient of that loss function so if
39:38
you have a say a billion parameters that
39:40
gradient is basically a billion long
39:42
vector right and we're going to take the
39:44
gradient and we're going to do w new
39:47
equals w old minus alpha times the
39:49
gradient so all the w's are going to
39:51
update instantaneously
39:53
now the way it actually works in
39:55
computation is you're going to do it the
39:56
because of the back and back propagation
39:58
it's going to start at the end and
39:59
slowly flow backwards but when it's done
40:01
everything will be updated.
40:03
Yeah.
40:06
>> We take uh two attention heads and we
40:10
have the matrices of AK, A2 and AV in
40:12
them. Uh why would the parameters of all
40:16
three of them all the weights of the
40:18
three matrices on this side and this
40:19
side would be different because finally
40:21
the things you're inputting from this
40:22
side and the output is same. So the
40:25
learning process should be ideally the
40:26
same unlike like a CNN where we had put
40:29
filters which were different. So what
40:31
different thing we have to
40:32
>> because the initialization is different.
40:35
>> What do we mean?
40:35
>> Like what I mean is if you have two
40:37
heads right each head has three
40:38
matrices. The starting values of those
40:40
six matrix is different.
40:42
>> Starting value of A aka B AQ and A is
40:45
different for both the heads
40:46
>> right? Much like for all the weights
40:48
typically the values are randomly
40:50
chosen. If they were all the same thing
40:53
you're right. It won't you don't make a
40:54
difference right? They will all change
40:56
the same way. Yeah.
40:59
U is the input of the transformer of the
41:02
sentence or the the array of embedding
41:06
of each word.
41:08
>> Uh the in the transformer itself is
41:10
expecting embeddings in and so what
41:13
basically happens is that we get some
41:14
sentence we run it through a tokenizer
41:16
which connects it to a bunch of tokens
41:18
which are just integers and then it goes
41:20
through the embedding layer which maps
41:22
the integers to these embeddings and
41:24
then you feed it to the transformer. But
41:26
when you do back propagation, it comes
41:28
all the way back to the starting
41:29
embedding layer and updates those
41:31
weights.
41:32
>> Okay. So they can be trainable. So the
41:34
twist at the beginning must be input
41:36
here, but they can train.
41:37
>> They're trainable. Exactly. Exactly.
41:40
>> Uh yeah.
41:41
>> Are the attention heads solely parallel
41:43
or can you have like a stack of
41:45
attention heads?
41:46
>> Typically they are parallelized. Um and
41:49
because you can always stack the block
41:50
itself to get more and more power.
41:54
All right. So um so now to apply the
41:57
transformer right there are common use
41:59
cases are that you have a whole sentence
42:01
that comes in and then you just want to
42:03
classify it right the the canonical
42:05
thing being hey movie sentiment
42:07
classification boom positive or negative
42:09
right classification another common one
42:11
is labeling where every word gets
42:13
labeled as a multiclass label and that's
42:15
basically what we saw with our slot
42:17
filling problem and then there is
42:19
another thing called sequence generation
42:20
where you give it a sequence you wanted
42:22
to continue the sequence right generate
42:23
more stuff i.e. large language models
42:25
and all that good stuff. So, so this we
42:28
know already know how to do because we
42:29
actually literally built a collab with
42:30
this with the transformer stack. Now the
42:33
question is how can we do that right?
42:35
How can you do basic classification with
42:37
these things? So now if you again when
42:40
you send a sentence in after all that
42:42
stuff is done and when I say encoder
42:44
here I'm assuming that you may have one
42:46
one block you may have 106 blocks I
42:48
don't care at the end of the day you
42:49
send something in you get a bunch of
42:50
contextual embeddings out
42:53
right so at this point we need to take
42:57
these contextual embeddings and somehow
42:58
make it work for classification for just
43:00
classifying something into yes or no
43:02
positive or negative so it'll be nice if
43:05
we can actually take all these
43:06
embeddings and like essentially
43:08
summarize them into a single embedding,
43:10
a single vector
43:12
because if you have a single vector then
43:14
we can run it through maybe a relu and
43:16
then we do a sigmoid and boom we can do
43:18
a you know a binary classification
43:19
problem super easy right so this begs
43:22
the question okay how are we going to go
43:23
from the all the many blue things to one
43:25
green thing
43:28
okay now of course um what we can do is
43:33
we can simply average them we can take
43:36
each of the embeddings just simply
43:37
average them element by element, you'll
43:39
get a nice green thing. Okay. Um any
43:42
shortcomings from doing that?
43:48
>> You would lose the ordering of the
43:50
words.
43:51
>> You do uh well in some sense the
43:53
positional embedding, the positional
43:55
encoding you have in the input does have
43:58
this notion of position, right? So
44:00
you're not necessarily losing the order
44:02
necessarily, but you're sort of
44:04
averaging all this information into
44:06
something and averaging is going to lose
44:08
some richness.
44:12
Okay.
44:15
>> I think it's going to be skewed to the
44:17
one that has like the biggest number,
44:19
right? So something is influencing your
44:22
>> Yeah, the biggest ones are going to
44:23
dominate. But hopefully we won't have
44:25
too much of that because all the layer
44:27
nom business at the beginning has
44:29
hopefully made sure the numbers are all
44:30
in a reasonably small and well behaved
44:31
range. But the the point really is that
44:33
you're going to lose richness in the
44:35
information because you're just like
44:36
mushing it down. So there's a much
44:40
better and more elegant way to do this
44:42
which is that what you do is for every
44:46
sentence when you train it you add an
44:49
artificial token called the class token.
44:52
Okay, literally it's an artificial token
44:54
and it's designated as you know CLS in
44:57
the literature and then this token is
45:00
getting trained with everything else.
45:03
Okay. And so once you once you finish
45:06
training
45:08
that token has its own embedding too.
45:10
And because it has been trained with
45:13
everything else and this token is
45:15
remember it's a contextual embedding
45:16
which means that it's very much aware of
45:18
all the other words in the sentence.
45:21
So in some sense this context this CLS
45:23
tokens contextual embedding sort of
45:25
captures everything that's going on
45:26
about that sentence
45:29
right and so what we do is once we are
45:31
done training we just grab this thing
45:32
alone and then send that through a relu
45:35
and a sigmoid and boom you're done.
45:38
So this is a very clever trick to
45:41
somehow you know instead of averaging
45:43
everything at the end let's just have
45:45
something just for the whole thing the
45:46
sentence and just learn it anyway along
45:48
with everything else. So in like a meta
45:50
principle in deep learning is that
45:52
whenever you think you're making an ad
45:54
hoc decision about something like
45:55
averaging a bunch of stuff you should
45:56
always stop and say is there a better
45:59
way to do it where it doesn't have to be
46:00
ad hoc where the right way is learnable
46:02
from the data directly using back
46:04
propagation. Um there was a hand. Yeah.
46:08
>> Is there a reason that you
46:11
added the CLS at the start? Why not add
46:14
it at the
46:15
>> You can do it at the end. Is there any
46:16
difference?
46:17
>> Um the only thing to remember is that um
46:19
it's a good question. So different
46:21
centers are going to be of different
46:22
length, right? So there might be short
46:24
sentences, there might be long
46:25
sentences. In particular, the lot the
46:27
short sentences are going to get padded,
46:29
right? I remember I talked about padding
46:31
to make it to fit to one length. So what
46:34
internally the transformer will do is
46:35
ignore all the padded tokens because it
46:37
doesn't do it's just padding doesn't
46:39
really matter for anything. So if you
46:40
have the serless at the very end we have
46:42
to have much more administrative
46:44
bookkeeping to take everything but the
46:46
last one
46:48
ignore it and only do the last one just
46:50
much easier just to get in the beginning
46:52
that's the reason. Yeah.
46:54
>> What would be just a practical
46:56
application of this would be something
46:58
like sentiment analysis like a positive
46:59
or negative.
47:00
>> Yeah. So basically any kind of text
47:02
comes in and you want to figure out some
47:04
labeling problem like a classification
47:06
problem. The easiest example I could
47:08
think of was sentiment.
47:09
But you can imagine for example an email
47:12
comes into a like a call center
47:14
operation and you want to take the email
47:16
and automatically figure out which
47:17
department should I send it to.
47:20
Okay. So now now if the input data for a
47:24
task is natural language text, right? We
47:27
don't have to restrict ourselves to only
47:28
the input training data we have. Right?
47:31
Would it be great to learn from all the
47:32
text that's out there? So, for example,
47:35
to go back to that call center thing I
47:36
just mentioned, you know, why clearly,
47:39
let's say it's coming in English, the
47:41
ability to take that English email and
47:43
route it to one of 10 things. You know,
47:45
you should have to learn English just
47:47
for your call center application. You
47:49
should learn English generally and use
47:50
it for other things, right? So, why
47:52
can't we just learn from all the text
47:54
that's out there? And so, that brings us
47:56
to something called self-supervised
47:58
learning. And the idea of sens
48:00
supervised learning is this. So if you
48:02
recall the transfer learning example
48:03
from lecture four right where we had
48:05
restnet right and we took restn net we
48:08
chopped off the final thing we make made
48:10
it sort of headless and then we attached
48:13
that output of the headless restn net to
48:14
a little hidden layer and output and we
48:17
did the handbags and shoes and you will
48:19
recall that we were able to build a very
48:21
good classifier for handbags and shoes
48:22
with just like a 100 examples. Right? So
48:24
the question is why was this so
48:26
effective? Why was this so effective?
48:29
And turns out the reason why any of this
48:31
stuff actually works is because neural
48:34
networks or they learn representations
48:36
automatically when you train them. So
48:38
what I mean by that is when you imagine
48:40
a network, you feed in a bunch of stuff,
48:42
it goes through all the layers, it comes
48:43
out. Uh you can think of each layer as
48:46
transforming the raw input in some
48:48
different alternate representation of
48:50
the input. Okay? And so and these are
48:53
called representations. That's actually
48:54
a technical term. Um, and so you can
48:57
from this perspective when you train a a
48:58
neural network, a deep network with lots
49:00
of layers, what you're really learning
49:02
is you're learning a way to you're
49:05
learning how to represent the input in
49:07
many different ways. Each of these
49:09
arrows is a different way of
49:10
representing things. Plus, you're
49:11
learning a final regression model,
49:14
either a linear regression model or a
49:15
logistic regression model.
49:16
Fundamentally, that's what's going on.
49:18
Because the final layers tend to be
49:19
sigmoid, soft max, or just linear,
49:21
right? So the final layer if you just
49:24
look at the this part alone whatever is
49:26
coming in it's just going through
49:27
essentially a linear regression model or
49:29
a logistic regression model that's it.
49:31
So fundamentally you're learning
49:32
representations and a final little
49:34
model. Okay. But the reason why all
49:36
these things work so much better than
49:38
logistic regression is because those
49:39
representations have learned all kinds
49:41
of useful things about the input data.
49:43
They have sort of automatically feature
49:45
engineered for you.
49:47
So, so from this perspective you can
49:50
imagine that each layer here is like an
49:53
encoder. It encodes the input, right?
49:55
The first layer encodes it. The first
49:56
two layers encode something. The first
49:58
three layers encode something and so on
49:59
and so forth. So a deep network contains
50:01
many encoders. And so the question is
50:04
what do these representations actually
50:06
embody right? What do they capture? Is
50:08
it like specific knowledge about the
50:10
particular problem that you train the
50:12
thing train the network on or is it like
50:14
general knowledge about the input data
50:16
because if it is general knowledge about
50:18
the input we can use it to solve other
50:20
problems unrelated problems. So is it
50:22
specific knowledge or general knowledge
50:24
and it turns out they actually capture a
50:26
lot of general knowledge about the input
50:28
and that's why you can get reuse out of
50:31
them you can reuse them for other
50:33
unrelated things because they have
50:34
captured general stuff. So if you look
50:36
at this, I think I've shown you before,
50:38
right? If you if you look at a network
50:40
that classifies everyday objects into a
50:41
bunch of categories, it can learn all
50:43
these little patterns in the beginning
50:44
and later on and so on and so forth. And
50:46
this is a face detection network. It has
50:48
learned how to look at, you know,
50:50
identify little circles and edges and
50:52
nose like shapes and finally faces. So
50:55
all these things are examples of
50:56
representations, learning interesting
50:57
things about the input. Okay. So since
51:00
these representations are capturing
51:02
intrinsic aspects of the data, you can
51:04
use it for other things, right? You can
51:06
take a face detection neural network and
51:08
use it, reuse it for emotion detection
51:10
for instance.
51:12
U so the question is if you can somehow
51:14
get like an encoder that generates good
51:17
representations for your input data, we
51:19
can simply build a regression model with
51:20
those as input and labels as output and
51:22
be done. And this is exactly what we did
51:24
with RestNet for handbags and shows. We
51:27
found a thing that had already been
51:28
trained on similar everyday objects,
51:30
everyday images. And the key insight
51:33
here is that since we don't have to
51:35
spend precious data on learning these
51:37
good representations,
51:40
we won't need as much label data in the
51:42
first place because the pre-training
51:44
used a lot of data and you're sort of
51:46
piggybacking on that data. So in some
51:48
sense, your training data is everything
51:50
that the pre-trained model was trained
51:51
on plus your little 200 examples.
51:55
Um, okay. So this is what we did. We
51:57
used headless resonate as an encoder
51:58
that can take raw input and transform it
52:00
into useful representations. Uh this is
52:02
what we did. All right. So the general
52:04
approach is that you find a deep neural
52:06
network built on similar inputs but
52:08
different outputs. Uh and then you
52:10
basically grab maybe the penultimate uh
52:13
representation or the one before that.
52:15
Then you chop off the head. You attach
52:17
your own output head. Train the whole
52:21
thing just the final layer or train the
52:23
whole thing if you want. Right? This is
52:25
like the playbook we followed for
52:26
restnet. The same thing works for all
52:27
kinds of other data types as well. So
52:30
now to build such a model we need
52:32
labeled data, right? We were lucky
52:34
because restnet was actually trained on
52:35
imageet data which is like a million
52:37
images each of which labeled into
52:39
thousand categories which is very
52:40
convenient for us, right? But what if
52:44
you want to build a generally useful
52:46
model for text data?
52:49
Clearly we need to collect a lot of text
52:51
data. But that's no problem because
52:52
internet is full of text data, right? we
52:54
can easily escape the internet. We can
52:55
just download Wikipedia. So that's not a
52:57
problem. The problem is something else
52:59
which is that how do we define an input
53:02
label for a piece of text? So for an
53:05
input sentence, what should the output
53:07
label be? That's the key question.
53:09
Because if you can answer this question,
53:10
you can just spray train all these
53:11
things on all kinds of text data, right?
53:14
So the like a beautiful idea for doing
53:17
this is called self-supervised learning.
53:18
And the key idea is that you take your
53:20
input, whatever the input is you take a
53:23
small part of the input and just remove
53:26
it and then ask your network to fill in
53:28
the blanks from everything else.
53:31
Okay, so this is called masking and it's
53:33
just one of many techniques in
53:35
self-supervised learning, but this is
53:36
very commonly used. So this is original
53:39
input, right? And then you take it and
53:41
then you just like take this thing in
53:43
the middle here randomly and and and
53:45
zero it out or mask it. And so this
53:48
incomplete input is your now new input
53:51
and the thing that you took out becomes
53:53
your your fake label.
53:56
So you can almost imagine right if you
53:58
take if you if you're baking donuts you
54:00
you make a donut and then you punch a
54:02
hole in the middle of the donut the the
54:04
donut with the hole is your no input the
54:07
munchkin is the label.
54:11
Am I making everybody hungry at this
54:13
point? So,
54:15
so and once you do that, no problem. You
54:17
have an input, you have an you have
54:19
labels, you just train a neural network
54:23
to essentially predict those to
54:25
basically fill in the blanks.
54:28
And so if for example, if you take a
54:30
sentence like the Sloan School's
54:32
mission, you can just go in there and
54:34
just just knock out randomly a bunch of
54:36
words like this second. And the ones I'm
54:39
knocking out, I'm just putting the word
54:40
mask in it just to show what I'm doing.
54:42
And then what it's actually given this
54:45
sentence, it will try to fill in the
54:46
blanks with actual words.
54:50
Okay,
54:51
so now for the amazing part. In the
54:53
process of learning to fill in the
54:54
blanks, uh the network learns a really
54:57
good representation of the kind of input
54:58
data it's seeing. And it kind of makes
55:01
sense, right? Because if I give you a
55:02
sentence with a few missing blanks and
55:04
you're able to very successfully fill in
55:06
the blanks, you have learned a whole
55:08
bunch of stuff about the world to be
55:10
able to do that, right? If I say the
55:12
capital of France is Dash and you're
55:14
like Paris, okay, how did you know that?
55:16
It's sort of like that. By learning to
55:18
fill in the blanks, you really have to
55:20
learn how how all these things work, all
55:22
the the connections between various
55:24
words and so on and so forth. So, and so
55:27
what you can do is once we build such a
55:29
model, we can just extract an encoder
55:32
from it, right? And then we'll fine-tune
55:34
it like we do with library transfer
55:36
learning. But this how you build a
55:38
generic a generic pre-trained model on
55:41
unlabelled data.
55:43
And so we can use a transformer encoder
55:46
to build this whole thing in the middle
55:48
because remember the transformer can
55:49
take any sentence and give you the same
55:51
size sentence back along with
55:53
predictions for everything. So we can
55:55
just have it take this thing in and ask
55:57
it to just predict all the missing words
55:58
here.
56:01
And
56:03
so uh to put it in other words, masked
56:05
self-supervised learning is just a
56:06
sequence labeling problem.
56:09
So basically this is the sequence that
56:11
comes in and then you you tell the
56:13
transform and you get all these
56:14
embeddings. It goes through all that
56:16
stuff. You really don't care about these
56:18
outputs. But wherever the word mask went
56:21
in in the input, you you basically try
56:23
to get it to the right answer is for
56:25
example the word mission and you're
56:26
trying to and that is the right answer.
56:28
This is the right answer here. And then
56:29
you take these right answers, create a
56:31
loss function, and do back prop and
56:32
boom, you're done.
56:35
Inputs, right answers, and and you're in
56:37
business. That's it. Now, if we
56:40
pre-train a transformer model like this
56:41
on massive amounts of English text,
56:44
let's say we did that. We get something
56:46
called BERT. BERT is a very famous
56:48
transformer model. And BERT was the
56:51
first model actually that Google used to
56:53
upgrade its search in 2019.
56:56
like the br the Brazil visa example you
56:58
may recall from earlier lectures that
57:00
uses BERT under the hood. Okay. Um and
57:03
so now I just want to show you because
57:06
you can actually read the BERT paper and
57:07
it'll actually make sense to you now
57:09
based on what you have learned in this
57:10
class. Look at this BERT's model
57:13
architecture is a multi-layer
57:14
birectional transformer encoder. Okay,
57:16
transformer encoder. We denote the
57:18
number of layers transformer blocks as
57:20
L. The hidden size is H and the number
57:23
of attention heads as A. And how much is
57:25
that? Uh okay we want uh h is 768 okay
57:30
so which means that the embedding sizes
57:34
or 768
57:36
and the hidden feed forward layer is
57:38
four times as much so it's 4096 and so
57:41
sorry the the the 4096 the feed forward
57:44
layer the embeddings are 768 and you can
57:47
see there are two BERT models here this
57:49
one has 12 transformer blocks this one
57:52
has 24 transformer blocks
57:55
Okay, so you can actually read this
57:58
paper. You can you can actually relate
57:59
it to exactly what we discussed in
58:00
class. It'll all make sense.
58:02
Birectionally means that the words can
58:04
pay attention to every other word in the
58:06
sentence. And as we will see on Monday,
58:09
you can have you have a diff another
58:10
transformer thing called a causal
58:12
transformer in which you only pay
58:14
attention to the words that came before
58:15
you, not the ones after you. So
58:18
birectional means all words are seen.
58:21
[snorts] Okay. So um so what we do is
58:24
remember we said to do solve sequence
58:26
classification you can add a little
58:27
token at the beginning uh and then boom
58:30
use it for classification as it turns
58:32
out but very conveniently for us the
58:35
people who built bird they actually auto
58:36
they when they train bird they just use
58:38
the CLS business
58:41
during training so it's actually
58:42
available for us out of the box so when
58:44
you use bird for sequence classification
58:46
you don't even have to do any surgery on
58:47
it it just gives you the class token
58:48
automatically which is very convenient
58:51
uh and you can also use it for sequence
58:52
labeling as well. So for sequence
58:55
classifications and sequence labeling uh
58:57
BERT is actually usually a really good
58:58
starting point and in particular there
59:00
have been lots of improvements and
59:02
variations of BERT over the years and if
59:04
you're curious about this there's a
59:05
thing called the sentence transformers
59:07
library which has got a whole bunch of
59:09
BERT related code and resources that you
59:11
can use to do things out of the box.
59:14
Okay. So okay there's a bit of a word
59:18
wall.
59:20
So to solve any of these problems
59:21
classification or labeling where the
59:23
input is natural language we can
59:24
obviously use a model like BERT label a
59:27
few hundred examples attach the right
59:28
final layers and fine tune it like we
59:30
did for the restn net but if your
59:32
problem is like a standard NLP problem
59:34
okay you don't even have to do that
59:37
because people for these standard tasks
59:39
they've already pre-trained it on those
59:40
standard tasks right and so you can do
59:43
all these things without any fine tuning
59:44
at all like literally out of the box u
59:47
and so there are many hubs which have
59:49
these pre-trained models, but perhaps
59:50
the biggest one is the hugging face hub.
59:53
And I checked last night, it has 525,000
59:56
models
59:58
available. I think if I recall last year
1:00:00
when I taught Hodel, I think the number
1:00:02
was a lot smaller, maybe 50,000. So it's
1:00:04
like growing really, really fast. Um,
1:00:07
and so all right, let's just switch to a
1:00:09
hugging face collab.
1:00:15
So, hugging face, how many of you are
1:00:18
familiar with hugging face?
1:00:21
Okay, it's good. All right, so um for
1:00:24
the others, basically you have a whole
1:00:26
bunch of pre-trained models on hugging
1:00:28
phase. You actually have a lot of data
1:00:30
sets you can work with for your own
1:00:32
tasks. Uh there are lots of people
1:00:34
demoing what they have built in this
1:00:37
thing called spaces and of course a lot
1:00:39
of documentation and so on. So the thing
1:00:40
you can do is what they have done is
1:00:42
they have organized all these models by
1:00:44
the kind of task you can use them for.
1:00:46
So you can see here there are a whole
1:00:47
bunch of computer vision tasks that you
1:00:49
can use them for. There's a whole bunch
1:00:50
of natural language tasks like text
1:00:52
classification
1:00:54
uh feature extraction this and that lots
1:00:56
of interesting examples here. And so
1:00:59
what you do is you just literally can go
1:01:00
in there and say okay I want to do a
1:01:01
text classification. You hit it and then
1:01:03
it tells you all the models that are
1:01:05
available. Turns into 50,000 models just
1:01:06
for text classification. And you can
1:01:08
look at okay which is you know most
1:01:10
downloaded or which is the most liked
1:01:11
and then you can just use them as a
1:01:13
starting point for whatever you want to
1:01:14
do. Okay. So so that is hugging phase
1:01:17
and so the way you do hugging face is
1:01:20
I'm just connecting it. Um
1:01:24
if you have a problem which the input is
1:01:26
natural language text the first question
1:01:28
you have to ask yourself is it standard
1:01:29
or not? Is it a standard task or not? If
1:01:31
it's a standard task you just go go that
1:01:32
do not reinvent the wheel. This thing
1:01:34
will usually work pretty well. Okay. So
1:01:37
here we will use this thing called um
1:01:39
the transformers library from hugging
1:01:41
face in particular the pipeline function
1:01:43
to demonstrate quickly how to do this
1:01:45
thing. Fortunately this library as of
1:01:47
this year is pre-installed in collab so
1:01:48
we can we don't have to install it. We
1:01:50
can just start using it right away. So
1:01:51
we'll take this example where you have a
1:01:53
bunch of text which says um
1:01:57
dear Amazon last week I got an Optimus
1:01:59
Prime action figure from your store in
1:02:00
Germany. Unfortunately when I opened the
1:02:01
vicage I discovered to my horror that I
1:02:04
had been sent an action figure of
1:02:05
Megatron instead. Can you imagine that
1:02:06
person's like sheer distress at this?
1:02:08
Um, so as a lifelong enemy of the
1:02:10
Decepticons, I hope you can understand
1:02:12
my dilemma. So to resolve the issue, I
1:02:14
demand an exchange. Encloser copies
1:02:17
expect to hear from you soon. Sincerely,
1:02:19
Bumblebee.
1:02:21
Okay, that Okay, they should have come
1:02:22
up with a better name for this example.
1:02:24
Uh, all right, cool. So that's the text
1:02:26
we have. So we import the this pipeline
1:02:29
function is the one that basically gives
1:02:31
you the ability to out of the box start
1:02:33
using it without any pre-training,
1:02:34
nothing like that. Okay, so we download
1:02:36
this thing. Um, oh wow, I got an A00
1:02:40
today. That happens very rarely. All
1:02:42
right, sorry.
1:02:44
So here, let's say you want to classify
1:02:46
that text. Okay, you want just want to
1:02:48
classify it for sentiment. You literally
1:02:50
go in there and say pipeline
1:02:52
text classification. That's the task you
1:02:55
want the pipeline to do for you, right?
1:02:57
And you create a classifier. Okay, it's
1:02:59
going to download a bunch of stuff. Uh,
1:03:01
and then so on and so forth.
1:03:04
The first time it just takes time to
1:03:06
download and then you literally take the
1:03:08
text you have here and then run it
1:03:10
through the classifier as it was just a
1:03:11
little function right you get some
1:03:14
outputs and then actually just do this
1:03:17
this way
1:03:19
negative sentiment is negative with 90%
1:03:21
probability pretty good right sequence
1:03:23
classification solved I mean sent
1:03:25
sentiment classification solved so we'll
1:03:27
try a few different examples uh I hated
1:03:30
the movie I if I said I loved the movie
1:03:31
I would be lying okay that's a little
1:03:33
tricky The movie left me speechless.
1:03:34
Incredible. And then I had to add this
1:03:36
last thing here last night. Almost but
1:03:38
not quite entirely unlike anything good
1:03:40
I've seen. Okay. And that's not
1:03:42
original. By the way, people who have
1:03:43
read Douglas Adams will know this famous
1:03:44
sentence about somebody drinking some
1:03:46
beverage and saying it's almost but not
1:03:48
quite entirely unlike tea. So I was
1:03:50
inspired by that. So anyway, we'll see
1:03:52
what happens. Um.
1:03:56
All right. Put it in there. Okay. So
1:03:59
negative. I hated the movie. Okay, fine.
1:04:01
If I said love me, I'd be lying.
1:04:02
Negative. Movie left me speechless. Uh,
1:04:05
it says it's negative, but it could go
1:04:07
either way, right? A good classifier
1:04:09
would have probably given you a
1:04:09
probability around the 50% mark because
1:04:11
it's sort of right on the fence. Um,
1:04:13
incredible, it's positive, and then it
1:04:15
got fooled by my crazy long sentence and
1:04:17
it says it's positive. Okay, now that's
1:04:20
classification. Here's one other quick
1:04:22
example. So, you can actually give it a
1:04:23
piece of text, right? For example, you
1:04:25
can take like a a Reuter's news story.
1:04:28
You can feed it and say extract all the
1:04:30
company names from it. Extract company
1:04:32
names, people names and things like
1:04:34
that. It's called named entity
1:04:35
extraction. And there are in the back in
1:04:37
back in the day people would bring they
1:04:40
would hand build painstakingly all these
1:04:42
very complex systems to be to do named
1:04:44
entity extraction. Now it's just a
1:04:46
pipeline away. So you can take this
1:04:48
thing and you can say create a pipeline
1:04:50
for any name extraction and for any
1:04:53
particular task that you're using there
1:04:54
might be a few additional parameters you
1:04:56
can set right as a part of the
1:04:57
configuration. So we download this
1:05:00
pipeline.
1:05:08
Okay, perfect. And then we run the
1:05:11
output. So it says okay good. Amazon is
1:05:14
an organization
1:05:16
uh
1:05:18
and Germany is a location lock which is
1:05:21
nice. So these things have a standard
1:05:22
vocabulary as to or lock things like
1:05:23
that which you can read up in the
1:05:24
documentation. Uh and then Bumblebee is
1:05:26
a person and then boy all the like the
1:05:29
Optimus Prime transformer stuff is all
1:05:32
it got full right. It thinks Optimus
1:05:33
Prime is miscellaneous. Uh decept is
1:05:36
miscellaneous and so on and so forth.
1:05:38
But you get the idea. You can take
1:05:39
standard things like Reuters use stories
1:05:41
and so or you can just boop. You can get
1:05:42
a very good entity extraction right out
1:05:44
of the bat. And once you get these
1:05:45
entities extracted, then you can put
1:05:47
them into a nice structured data table
1:05:48
like a database and then you can run
1:05:50
traditional machine learning on it.
1:05:53
Okay. Um and then I had I think a few
1:05:55
more examples of question answering and
1:05:58
uh actually let's just try that. um you
1:06:01
can actually give it a thing and ask a
1:06:02
question about it and you can actually
1:06:03
give you the answer which gets into the
1:06:07
causal transformer thing that we're
1:06:09
going to see on Monday which builds up
1:06:10
into large language models because you
1:06:12
obviously can give something you can
1:06:14
give a passage to chat GPT and ask a
1:06:16
question ask it to give you an answer so
1:06:17
it's really in that thing but um just
1:06:19
for fun let's just do that to see if
1:06:20
it's any good um okay so what does the
1:06:25
customer want and the output is an
1:06:27
exchange of megatron and it's telling
1:06:29
you which where it starts in the text
1:06:32
and where it ends the relevant passage.
1:06:34
It's pretty good, right? So because
1:06:37
remember if you have stuff like this
1:06:39
then when you ask like a large language
1:06:41
model a question it gives you an answer.
1:06:42
You can actually ask it to give you
1:06:44
exactly where in the input it found the
1:06:46
answer and because you know these things
1:06:48
are going to elicitate you can actually
1:06:49
look at the input that it's claiming to
1:06:51
use and look at what it says and see if
1:06:54
they actually match. It's a way to sort
1:06:56
of essentially do QA on LLM output.
1:06:59
Um okay so that's what we have here and
1:07:01
I have other budget much of which which
1:07:03
I'll ignore for the moment because I
1:07:05
want to go back to the PowerPoint.
1:07:07
So yeah so if you have a standard task
1:07:10
uh you know you can just use pipelines
1:07:11
and hugging face to actually solve many
1:07:13
of them out of the box without any heavy
1:07:15
lifting. So I mentioned earlier on that
1:07:18
transformers have proven to be effective
1:07:19
for a whole bunch of domains outside of
1:07:21
natural language processing um like you
1:07:24
know speech recognition, computer vision
1:07:26
and so on and so forth. Um and so I want
1:07:29
to give you a couple of quick examples
1:07:30
of how to think about transform using
1:07:32
transformers for non-ext applications.
1:07:35
Okay. So uh the the key insight here is
1:07:39
that the architecture of the transformer
1:07:41
block that we have looked at amazingly
1:07:42
enough can be used as is with no changes
1:07:45
no surgery needed. No clever thinking
1:07:47
required for any particular application.
1:07:49
What is needed where the clever thinking
1:07:51
may be required is you need to take the
1:07:53
inputs that you're working with and you
1:07:55
need to figure out a way to tokenize and
1:07:57
encode them into embeddings
1:07:59
which can then be sent into the
1:08:01
transformer. So all the action is in
1:08:03
taking that input that non-ext input and
1:08:05
figuring out a way to cast them in the
1:08:07
language of embeddings. That's where the
1:08:09
that's the game. Okay. So um here is
1:08:12
something called the vision transformer
1:08:14
which is very famous actually. I think
1:08:16
it may be the first perhaps the first uh
1:08:19
transformer architecture that was
1:08:20
applied to vision problems. So um so
1:08:23
let's say you have a picture um yeah so
1:08:25
let's say you have this picture okay
1:08:28
it is just a picture okay so you have to
1:08:31
find a way to create embeddings from
1:08:33
this picture or to tokenize this picture
1:08:35
in some way with sentences you know I
1:08:38
love hard well obviously I love and hard
1:08:40
are three tokens it's pretty trivial to
1:08:41
figure out how to tokenize them but with
1:08:43
a picture what do you do right it's kind
1:08:45
of weird to think of tokenizing a
1:08:47
picture so what these people did is that
1:08:49
they say you know what I'm going to take
1:08:51
this picture and chop it up into small
1:08:52
squares.
1:08:54
Right? So in this example, they have
1:08:57
taken this big picture and chopped it up
1:08:58
into nine little pictures. Okay? Then
1:09:02
you can take each of those nine
1:09:03
pictures.
1:09:05
Each of those nine pictures, right? If
1:09:07
you look at the how it's represented,
1:09:09
it's just three tables of numbers,
1:09:11
right? The RGB values, right? So you can
1:09:15
take all those numbers and you just
1:09:16
create a giant long vector from it.
1:09:20
Okay? you have a huge long vector and
1:09:22
then you run it through a dense layer to
1:09:26
come up with a smaller vector
1:09:28
and that smaller vector is your
1:09:30
embedding.
1:09:31
That's it. But the way you transform the
1:09:34
long vector into small vector is just a
1:09:36
dense layer whose weights can be
1:09:37
learned.
1:09:39
So what these people did is they said
1:09:41
well I'm going to first chop it up into
1:09:42
these patches and then I take each patch
1:09:44
and do a linear projection. Right? A
1:09:47
flattened patch is nothing more than a
1:09:49
three tables of numbers flattened into a
1:09:50
long vector. That's what the word
1:09:52
flatten here means. And once you flatten
1:09:54
it, I'm just going to run it through a
1:09:56
dense layer. So, by the way, you will
1:09:58
see the words linear projection. It's a
1:09:59
synonym for run it through a dense
1:10:01
layer.
1:10:03
So, you run it through a dense layer,
1:10:05
right? You get these nice vectors, these
1:10:08
vectors.
1:10:09
And now you say, well, you know what? I
1:10:11
have to take the order of these things
1:10:12
into account because clearly this little
1:10:15
patch is in the top left while this
1:10:17
patch is somewhere in the middle. Right?
1:10:18
The order matters in the picture
1:10:20
otherwise every jumbled version is going
1:10:22
to be the same thing. So you use
1:10:24
positional embeddings
1:10:26
you basically say there are nine
1:10:27
positions in any picture right 0 1 2 3 4
1:10:31
5 6 7 8 there are nine positions. So I'm
1:10:33
going to create nine position embeddings
1:10:36
and then I'm just going to add them up.
1:10:39
Then I'm just going to add them up to
1:10:40
this embedding. Just like we did with
1:10:41
words. With words, we each word had an
1:10:44
embedding. Each position had an
1:10:45
embedding. We added them up. Here each
1:10:47
image has an embedding. The position of
1:10:49
the little patch in the picture has an
1:10:50
embedding. We add them up. Okay? And
1:10:53
then because we want to use it for
1:10:54
classification, no problem. We'll have a
1:10:57
little CLS token
1:11:00
and then we just run it through the
1:11:01
transformer. That's it.
1:11:04
and then you get the CLS token and then
1:11:06
you can attach a softmax to it and say,
1:11:08
"Okay, it's a bird, it's a ball, it's a
1:11:09
car.
1:11:12
That's it. This simple approach actually
1:11:14
works
1:11:16
amazingly enough."
1:11:19
Okay, so that is the vision transformer
1:11:22
and I'm going through it fast just to
1:11:23
give you a sense for how these things
1:11:24
work. Uh any questions? Yeah. Uh my
1:11:29
question is like uh in case of uh text
1:11:31
we had fixed number of tokens that is
1:11:33
amount of words which could be there in
1:11:35
your vocabul in the English vocabulary
1:11:37
but here if you look at images they will
1:11:39
probably go into trillions that I know
1:11:41
like we are not talking about one image
1:11:43
but we take a total set of plot of
1:11:45
images and we try to subset each one of
1:11:47
them each one would have its own uh uh
1:11:52
own weights like own parameters. There
1:11:53
is no notion of vocabulary here. All
1:11:56
we're saying is that given any image, we
1:11:58
create nine patches, sub images from it.
1:12:02
Each of those patches gets passed
1:12:03
through a dense layer and out comes an
1:12:06
embedding. So at that point, any image
1:12:09
you give me, I'm going to give get you
1:12:10
nine embeddings out of it. And once I
1:12:13
get the nine embeddings, I just throw it
1:12:14
into the meat grinder, the transformer
1:12:16
meat grinder.
1:12:20
All right. So uh another example I think
1:12:23
some of you have asked me outside of
1:12:25
class um how good are transformers for
1:12:27
structured data tabular data right for
1:12:30
tabular data in general um things like
1:12:32
xg boost gradient boosting works really
1:12:34
really well so it's good to try them
1:12:36
certainly I don't think transformers and
1:12:38
deep learning networks have any great
1:12:39
edge over xg boost for structured data
1:12:42
problems so it's worth trying both of
1:12:44
them however you can use transformers
1:12:46
for this stuff too so that's called the
1:12:48
tab transformer one of the first ones
1:12:50
wants to come out a transform of a
1:12:52
tabular data and again it's pretty
1:12:54
simple. All you do is
1:12:56
in any kind of input that you have, you
1:12:58
will have some categorical variables,
1:13:00
right? Like blood pressure, things like
1:13:02
that, right? Not blood pressure, bad
1:13:04
example, gender, right? Um, and so on
1:13:07
and so forth. And so what you do is you
1:13:10
take all the categorical features and
1:13:12
for each categorical feature, you create
1:13:14
embeddings
1:13:16
because a categorical feature is just
1:13:18
text.
1:13:20
A categorical feature is just text. So
1:13:22
you can create text embeddings for it.
1:13:23
No problem. Um,
1:13:27
and you take all the continuous
1:13:30
features, right? Cholesterol and blood
1:13:32
pressure and whatnot, right? To go to
1:13:34
the heart disease example, and then you
1:13:36
take just create all the correct them
1:13:38
all and just create a vector out of
1:13:39
them.
1:13:41
You're just a vector. Okay? Then you run
1:13:45
these the embeddings for all the
1:13:47
categorical variables through a nice
1:13:48
transformer block. And you can see here
1:13:51
it's exactly the block we have seen
1:13:52
before. no difference. And then at the
1:13:54
very end when it comes out of the
1:13:56
transformer, you take all the contextual
1:13:58
stuff coming out of the transformer and
1:13:59
then you concatenate it with the
1:14:01
continuous features.
1:14:03
Okay. And then you run it through maybe
1:14:05
one or more dense layers and boom
1:14:07
output.
1:14:09
So this is a tab tabular data
1:14:11
transformer. And there are many you know
1:14:12
refinements improvements over the years
1:14:14
that have come since then. But the key
1:14:16
thing I want you to rec remember from
1:14:18
here is that categorical variables can
1:14:21
be very easily represented as
1:14:24
embeddings. That's the key. Okay. Uh all
1:14:28
right. So that's that. Now once the
1:14:31
input has been transformed into sort of
1:14:32
this common language of embeddings, we
1:14:34
can process them without changing the
1:14:35
architecture of the block itself because
1:14:37
all it wants is embeddings. It's like
1:14:39
you give me embeddings, I give you a
1:14:40
great contextual embeddings out and
1:14:42
nobody gets hurt, right? That is the
1:14:44
deal with the transformer stack. So um
1:14:47
now this this ability this sort of since
1:14:50
the transformer is agnostic to the kind
1:14:52
of input as long as it comes into comes
1:14:54
in as a form of an embedding you can use
1:14:56
it for multimodal data very easily. So
1:14:58
for example let's say that you have a
1:15:00
problem in which you have a picture that
1:15:02
you have to be sent in some text that
1:15:03
goes in a bunch of tabular data coming
1:15:05
in well you take the text and do
1:15:08
language embeddings like we know how to
1:15:10
do you take the image and do image
1:15:11
embeddings like we just saw with the
1:15:12
vision transformer. You take tablet data
1:15:14
and do tab data embeddings like we saw
1:15:16
with the tab transformer. Once we do it,
1:15:18
it's all a bunch of embeddings
1:15:21
and then you attach a little class token
1:15:23
on top, send it through a bunch of
1:15:25
transformers blocks and then out comes a
1:15:27
contextual class token the contextual
1:15:29
version run it through maybe a sigmoid
1:15:32
or a softmax predict the label done.
1:15:36
So this is extremely powerful its
1:15:38
ability to handle multimodel data. Okay.
1:15:40
And that's why for example if you look
1:15:42
at Gemini Google Gemini 1.5 Pro GPT4
1:15:46
vision and so on you can send it images
1:15:48
and a question and you'll get an answer
1:15:50
back because every modality that goes in
1:15:53
is cast into embeddings and once it's
1:15:55
embedded one once it's embeddingized
1:15:58
then the transformer doesn't care. It'll
1:16:00
just do its thing.
1:16:02
It it will decide for example that this
1:16:04
word in your question actually is highly
1:16:06
related to that patch in the picture.
1:16:09
Right? you'll just figure it out.
1:16:12
Uh, okay. That's all I had because
1:16:14
there's a time pering 9:55. Perfect. All
1:16:16
right, folks. Thanks. Have a great rest
1:16:18
of your week.
— end of transcript —
Advertisement