[00:17] Okay. Uh, all right. So, we'll continue
[00:19] with transformers today. Part two. Uh,
[00:21] we're going to do the second pass. Uh,
[00:23] this is going to be a deeper pass
[00:24] through the transformer stack. Um and I
[00:27] think maybe the next 30 minutes it's
[00:29] potentially the most demanding 30
[00:31] minutes of the entire course. Okay, with
[00:33] that motivational speech, let's get
[00:35] going. Okay, so quick review. Why do we
[00:38] want transformers? Because we want u we
[00:41] want an architecture that can generate
[00:43] output that has the same length as the
[00:45] input. Same length. Oh, there it is. Uh
[00:48] number two, we want to take the context
[00:50] into account and we want to take the
[00:51] order into account. And as you saw last
[00:53] time, the transformer architecture
[00:55] delivers on those three requirements.
[00:57] And so uh just a quick review, if you
[00:59] have a phrase like the train liftation,
[01:01] we have all these little arrows which
[01:03] stand for the the standalone or
[01:05] uncontextual embeddings. Uh and then
[01:08] sometimes this works. So I'm going to
[01:09] put it close to me here.
[01:12] Okay.
[01:13] All right. So um so if you here if you
[01:16] we start with either standalone
[01:17] embeddings i.e. the contextual
[01:19] embeddings uh which have been
[01:20] pre-trained or random doesn't really
[01:22] matter. If you look at the collab we did
[01:25] uh the other day we actually just start
[01:27] with random weights for the embeddings
[01:30] and then we add positional embeddings to
[01:32] them. And so you know each embedding
[01:35] each word here we take it standalone we
[01:38] take its positional embedding we just
[01:39] literally just add them up element by
[01:41] element then we get a total embedding
[01:43] and that's called the positional
[01:45] embedding of each word. Okay. And then
[01:48] uh that's what we have position input
[01:49] embeddings. So this whole thing goes
[01:51] into this transformer encoder stack and
[01:54] what pops out the other end is
[01:55] contextual embeddings. Okay. So that's
[01:57] the overall flow. Now
[02:01] we applied this uh the transformer stack
[02:03] to the word to slot classification
[02:06] problem where we basically took every
[02:08] incoming natural language query that
[02:10] comes in. We calculate its positional
[02:12] embeddings and then we run it through
[02:14] the transformer stack. uh and then we
[02:16] get contextual embeddings and then at
[02:18] this point uh since each word that comes
[02:21] out each embedding that comes out needs
[02:22] to be classified into one of 125
[02:24] possibilities we run it through a ReLU
[02:26] and then we and when we attach a softmax
[02:29] to each embedding right this is
[02:31] basically what we did last class
[02:33] um so this is the transformer encoder
[02:36] okay now actually
[02:39] any questions on this before I continue
[02:48] I was wondering why when how do you
[02:50] decide where to add more self attention
[02:52] and where to add transformer layers? You
[02:55] mentioned that chart has 96 of them.
[02:58] >> Yeah. So right so GPD3 has 90 96
[03:03] transformer blocks. Each one is a block.
[03:05] Um, so I think the question goes to do
[03:07] you add more attention heads within a
[03:09] single block or do you add lots of
[03:11] blocks? And both are good things to do.
[03:14] Um, what increasing the number of
[03:16] attention heads in a block does for you,
[03:18] it allows you to pick up more patterns
[03:21] at that level of abstraction.
[03:23] But if you add more blocks, much like
[03:25] later convolutional filters can build on
[03:28] earlier convolutional filters, you're
[03:30] going up the levels of abstraction. So
[03:32] to go to vision for instance you have
[03:34] the notion of lines and so on in the
[03:36] beginning and then you have a notion of
[03:37] edges which are two lines then you have
[03:40] you know nose eyes face and so on and so
[03:42] forth. So both are worth doing. So
[03:45] typically that's what you you typically
[03:46] find that people typically have you know
[03:49] maybe a dozen heads or you know five six
[03:52] a dozen heads. We'll see examples of how
[03:54] many heads in a couple of architectures
[03:55] later on today. And you can the more you
[03:58] go up the more uh more capable the model
[04:01] becomes. as long as you have enough data
[04:02] to train it well. So the perennial
[04:05] question of do we have enough data to
[04:07] train this large model because if you
[04:09] don't have enough data we might run into
[04:11] overfitting problems and so on. That's
[04:12] always the trade-off.
[04:14] So okay so here I just want to quickly
[04:17] switch to the collab because we didn't
[04:18] get have a chance to finish it. I'm not
[04:20] going to run it because it's going to
[04:22] take some time. So where we left off
[04:24] last time.
[04:27] Okay. So here we we basically took this
[04:31] architecture that we just saw on the
[04:32] slide and then we essentially wrote it
[04:34] as a keras model and I went through this
[04:36] model in the last class so I'm not going
[04:37] to go through it all over again. What we
[04:39] did not do last class was to actually
[04:41] run it. Um and so uh so if you actually
[04:44] run it right you can just run it for 10
[04:47] epochs just like we normally do. Give it
[04:50] data give it a bunch of epochs choose a
[04:52] particular batch size. I just
[04:53] arbitrarily chose 64. You run it for 10
[04:55] epochs and then you evaluate it on the
[04:57] test set. You get a 99% accuracy on this
[05:00] problem. One transformer stack. That's
[05:03] it. One one block rather. One block.
[05:05] That's it. And uh of course here there's
[05:08] a little trickiness going on here
[05:09] because a naive model can literally say
[05:12] every word that comes in is other. O.
[05:15] And since the O's are the majority of
[05:17] the words, it's not going to do badly,
[05:19] right? It's like having a classification
[05:20] problem in which one class is very
[05:22] predominant. So the naive way to
[05:25] actually do well is to just say every
[05:26] time something comes in, oh it's that
[05:27] majority class. The same thing happens.
[05:30] But if you then adjust for that, it
[05:32] turns out that the accuracy on the nono
[05:34] slots, which is really what you care
[05:35] about, is actually 93%.
[05:38] Which is actually pretty good. Okay. Uh
[05:40] and then I had some examples of, you
[05:42] know, lots of fun queries you can do,
[05:44] including queries where I try to break
[05:45] stuff like cheapest flight to fly from
[05:47] MIT to Mars and see what happens, you
[05:49] know, things like that. So have fun with
[05:50] it. Okay. Um, all right, back to
[05:53] PowerPoint.
[05:59] So, this is what we had. Now, what we're
[06:01] going to do in today's class, we are
[06:03] actually going to take the encoder we
[06:05] built last time and introduce three new
[06:08] complications into it. And when we
[06:10] finish introducing these three
[06:11] complications, we will actually have the
[06:14] actual transformer that was invented in
[06:15] the 2017 paper. Okay. All right. Um, the
[06:20] first tweak is the hardest tweak. So
[06:21] we'll slowly work our way to it. U so
[06:24] the thing to remember is let's review
[06:26] self attention. What is self attention?
[06:28] You have a bunch of words and we further
[06:30] said that for any particular word like
[06:32] station we want to take its positional
[06:34] embedding and then make it contextual.
[06:36] And the way we do that is by taking each
[06:38] word's embedding and then calculating
[06:40] these dot productducts between all the
[06:42] other words. And then since these dot
[06:44] products can be positive or negative we
[06:46] want to make them all positive and
[06:48] normalize them so that they nicely add
[06:50] up to one. So we then exponentiate them
[06:52] and then divide with the total, right?
[06:54] Which is basically soft max. And when
[06:57] you do that, you have nice fractions
[06:59] that add up to one. And then we said,
[07:01] well, the contextual embedding for W6 is
[07:03] just all these weights S1, S2 all the
[07:07] way to S6 multiplied by the original W's
[07:10] and then you get the context for W6. So
[07:12] this is the basic logic we covered last
[07:14] time. Now it is obviously the case that
[07:19] we explained it only for one word but we
[07:21] have to do the same exact operation for
[07:23] every one of the other words too so that
[07:25] we could calculate W5 hat, W4 hat, W3
[07:28] hat and so on and so forth right so
[07:30] there's a lot of computations that are
[07:32] going on and they all look kind of
[07:34] similar where you got to do a bunch of
[07:36] dot products you got to like you know do
[07:38] some soft maxing on it and stuff like
[07:39] that so the natural question is is there
[07:42] a way to organize it very efficiently
[07:45] And the short answer is yes. In fact, if
[07:46] you could not do that, there wouldn't be
[07:48] any transformer revolution. Okay,
[07:50] because there is that ability to package
[07:52] it up into a very interesting and
[07:53] efficient operation that allows you to
[07:55] put the whole thing on GPUs.
[07:58] Okay, so now I'm going to switch to iPad
[08:02] uh and give you some iPad scribblings of
[08:04] mine which were concocted last night
[08:06] because I was very unhappy with the
[08:08] slides that follow. So, we're going to
[08:10] do iPad. Okay. U All right. So if it
[08:14] works, you folks are lucky. If it
[08:16] doesn't work, last year's huddle class
[08:17] is luckier.
[08:21] So let's shift to that.
[08:24] All right. So we're going to go here.
[08:31] So let's assume we have a simple thing
[08:32] like uh oops.
[08:37] Okay, instead of you know train left the
[08:40] station which is a long sentence, let's
[08:41] just say you have a simple sentence like
[08:42] I love hodddle. Okay, and so I love
[08:45] hodddle is what you have and then you
[08:47] have these standalone embeddings W1 W2
[08:50] W3. Okay, so it comes into the self
[08:53] attention layer and let's assume that
[08:55] these W1's, W2, W3, they're already
[08:58] positionally encoded, right? We have
[09:00] already added up the position encoding,
[09:02] all that stuff also. It's all behind us.
[09:03] That all happens outside the
[09:05] transformer. So you you you get it here.
[09:08] Now what you do is you actually make
[09:10] three copies of this thing.
[09:13] Okay? And let's call this whole thing as
[09:15] just X. Okay? I'm just giving it the
[09:18] name X. It's a matrix of these three
[09:20] vectors. And so the first copy goes up
[09:23] here, the second copy goes straight, and
[09:25] the third copy goes down. And don't
[09:26] worry about the third copy just yet. So
[09:29] if you look at the the first two copies,
[09:31] here is the key thing to focus on. Okay,
[09:33] this whole thing here. Remember that we
[09:36] want to calculate dotproducts between
[09:37] all these vectors. And basically we want
[09:40] to calculate the dot product of every
[09:41] pair of vectors, every pair of words.
[09:44] The whole point of self attention is
[09:46] that every pair of words we figure out
[09:47] how attracted or related they are.
[09:49] Right? Which means that we have to
[09:50] calculate all pairs of dot products. And
[09:53] so what you do is you take this vector
[09:55] right there W1 WW3. You take this other
[09:58] copy that went up. Okay? And then you
[10:00] transpose it. So when you transpose it,
[10:03] it all becomes nice and vertical like
[10:05] that.
[10:06] Right? All the vectors come in came like
[10:08] this. When you transfer, it becomes
[10:09] vertical. And now what you do is you
[10:12] take each one you take W1 and then you
[10:15] multiply it by W1. Here you take W1 W2
[10:19] W1 W3. You calculate all those dot
[10:22] products like that. And when you do that
[10:23] you have these nice cells where every
[10:27] pair of words their dot products have
[10:29] been calculated in this grid. Okay. And
[10:31] the key thing to see here and folks with
[10:34] a matrix algebra background will see
[10:36] this immediately. All we are doing is we
[10:38] are taking this x which is the matrix
[10:40] that came in
[10:42] and then xrpose which is the matrix that
[10:44] we went sent up and then brought back
[10:46] down. We are basically doing a matrix
[10:48] multiplication of x * xrpose. That's all
[10:50] we doing. And when we do that we're
[10:53] getting this nice uh grid of where in
[10:57] which every pair of words their dot
[10:59] products have been calculated for you
[11:01] with one matrix multiplication. Boom.
[11:03] Done. Okay. Okay, so if you have three
[11:05] words, there are nine multiplications,
[11:07] right? So if you have a million words,
[11:11] that's a lot of multiplications, right?
[11:13] One trillion multiplications on the
[11:15] order of all trillion. And the reason to
[11:18] say order is because you know W1 * W3 is
[11:21] the same as W3 * W1. So there's some
[11:23] duplication here. So you get this grid,
[11:25] okay, in one shot is one multi
[11:27] multiplication. And then we because each
[11:29] of these numbers is just a dot product
[11:31] which can be negative or positive, we
[11:32] need to softmax it.
[11:34] And so what we do is we take all these
[11:36] numbers and we put it into a softmax
[11:38] function where for each row it
[11:40] calculates a soft max. And what do I
[11:41] mean by that? It takes each number here
[11:44] does e raised to the top e ra to the
[11:46] number. It does it for each of these
[11:47] numbers and then divides by the sum of
[11:49] those numbers for each row. And when you
[11:51] do that okay you can think of this
[11:54] operation as soft max applied to x *
[11:56] xrpose you get this nice little table of
[11:59] numbers.
[12:01] This table of numbers basically says
[12:02] that for the first word right W1 for the
[12:06] first word take 0.1 of the of the first
[12:08] one 7 of the second.3 of the 2 of the
[12:11] third and add them up. We do a weighted
[12:14] average. So we have this table here. We
[12:17] have now the third copy shows up here.
[12:20] Okay is right there. So we do this times
[12:24] that which is just a matrix
[12:25] multiplication again. And when we do
[12:27] that we get the final contextual
[12:29] embeddings. So this for example is just
[12:31] 0.1 * w12
[12:34] * w2
[12:36] point sorry 7 * w2 and then2 * w3 right
[12:40] there. And you can see the same logic
[12:41] here as well. Okay. And you can read it
[12:44] later on. I will post this thing uh to
[12:46] make sure you understand exactly how it
[12:47] flowed. But the larger point I want you
[12:50] to focus on is that the entire sol self
[12:53] attention operation we just looked at
[12:55] here basically is this this beautifully
[12:58] little compact matrix formula.
[13:01] Okay X comes in you do XRpose you do a
[13:04] matrix multiplication you do a softmax
[13:06] on top of it and then multiply by X
[13:07] again and boom you're done.
[13:10] So that is the magic of taking the
[13:12] transformer stack and representing it
[13:15] using matrix operations because then
[13:17] lightning fast on GPUs.
[13:20] Okay. All right.
[13:22] That was the warm-up.
[13:24] Now let's crank it up a notch.
[13:27] So recall that in the last class um I
[13:31] talked about the fact
[13:35] the self attention operation the W's are
[13:38] coming in and we're doing all this stuff
[13:39] with the W's right and then we're
[13:41] getting some W hats out but there are no
[13:44] parameters
[13:46] there's nothing to be learned inside the
[13:48] transformer self attention layer right
[13:51] there are no there are no weights there
[13:52] are no biases there are no coefficients
[13:54] so well okay What are we learning then?
[13:58] Right? So what we now do is we going to
[14:00] make the self attention layer tunable.
[14:03] We're going to inject some weights into
[14:05] it so that when we train it on an actual
[14:07] system, it'll the weights will keep
[14:09] changing to adapt itself to the
[14:10] particularities of whatever problem
[14:12] you're working on. Right? So that takes
[14:15] us to the tunable self attention layer.
[14:22] Okay? Tunable self attention layer. So
[14:25] this is the key thing to keep in mind. U
[14:28] any questions on this before I continue
[14:29] with the tunability thing.
[14:34] Okay.
[14:37] Is this picture working out by the way?
[14:39] Okay.
[14:41] Uh all right.
[14:44] So what we now do is we have the same
[14:46] exact logic as before where we have this
[14:48] thing that comes in. Okay. We have this
[14:51] input that comes in the same we call it
[14:53] X again. this whole this matrix of
[14:55] embeddings and then before we just send
[14:58] three copies instead of doing that what
[15:01] we're going to do is we'll take each
[15:02] copy X and then we will actually
[15:04] multiply it by a matrix
[15:07] okay this matrix is called the key
[15:09] matrix
[15:10] okay and this matrix this matrix of
[15:14] numbers are weights that will be learned
[15:16] by Brack prop
[15:18] so basically what we're saying is that
[15:20] when this thing comes in let's see if
[15:23] there's a way to transform this X into
[15:25] some other set of embeddings which may
[15:28] be useful for your task. We don't know
[15:30] if they're going to be useful, but
[15:32] surely giving it a bit more ability to
[15:34] have weights which can be learned means
[15:36] that it giving it more expressive power,
[15:39] more modeling capacity. And whether it
[15:41] actually uses the capacity will depend
[15:42] on how much data you have and how well
[15:44] you train it. And maybe if it's not
[15:46] useful, it won't use it. In what I mean
[15:48] is if transforming X actually doesn't
[15:50] really help at all, then this matrix A
[15:52] is going to be what?
[15:55] it's going to be the identity matrix
[15:57] because you take basically one and
[15:59] multiply by X you'll get one X again. So
[16:01] in the worst case maybe it just says I
[16:03] have nothing to learn here but maybe
[16:05] there is something you can learn. So so
[16:07] that's what we do. So we multiplied by
[16:09] this matrix A K and then we come up with
[16:12] the same you know some embeddings
[16:14] transformed embeddings and we call these
[16:16] things K
[16:18] okay K. Now this KQV as you will see has
[16:22] its origins in the in this field of
[16:24] information retrieval but I personally
[16:26] find that that interpretation is not
[16:28] super helpful because transformers are
[16:30] used for lots of applications outside
[16:32] information retrieval. So I'm not going
[16:33] to go with that kind of interpretation.
[16:35] I'm going to go with interpretation of
[16:37] let's make each of these things tunable.
[16:39] Okay. And tunability means we need to
[16:41] give it weights. All right. So that's
[16:42] what we have here. Now the second copy
[16:46] we did this with the first copy. Well,
[16:47] let's do the same thing with the second
[16:48] copy. We'll take the second copy and
[16:50] multiply it by some other matrix called
[16:51] AQ.
[16:53] And when we are done with that, we get
[16:54] these embeddings. And we will call these
[16:57] embeddings as Q.
[17:00] Okay. Now, just like before, we will
[17:02] take this this thing here and we'll
[17:05] transpose it.
[17:07] So, it all becomes nice and vertical
[17:08] like that. And then we'll do exactly the
[17:11] same as before. We'll calculate all
[17:12] these pair-wise dot productducts using
[17:14] one one shot one matrix multiplication.
[17:16] And because we are calling this Q and we
[17:20] are calling this whole thing as K. This
[17:22] thing just becomes Q * KT.
[17:26] Okay. At the end of it you come up with
[17:29] a grid of numbers just like before.
[17:31] Okay. And these numbers could be
[17:33] negative or positive. So we need to do
[17:35] the softmax on them to make sure they
[17:36] are well behaved fractions that add up
[17:38] to one. So we take this Q KT business
[17:42] and then we do we just run a we put it
[17:44] through a softmax function for each row
[17:48] and when we do that we we'll get
[17:50] basically the the like a table like the
[17:52] ones we saw before by the way the
[17:54] numbers here are the same just because I
[17:55] duplicated it because I'm lazy in
[17:57] reality given it has gone through all
[17:59] these transformations the numbers are
[18:00] not going to be the same right uh you
[18:03] have these numbers and then you take the
[18:05] final copy which is x * av Right? Each
[18:08] copy is getting multiplied by its own
[18:10] matrix. Right? And this copy is being
[18:11] multiplied by AV. And let's call this X
[18:14] A. Okay? Which is here as just V.
[18:19] And so what you have here is this soft
[18:21] max QT * V is exactly the same kind of
[18:24] dot product as we saw before matrix
[18:26] multiplication. So we have these
[18:28] contextual embeddings and that's what's
[18:30] coming out of the of the transformer
[18:32] block. So now the whole thing we did
[18:34] here the whole thing can be represented
[18:36] as soft max of Q KT * V. Okay. So if we
[18:42] zoom in a bit. Come on. Okay.
[18:47] Okay.
[18:49] So X came in.
[18:52] Three tracks went here. The first track
[18:55] X * A K X * AQ X * A V. And this thing
[18:59] is called K. This thing is called Q.
[19:01] This thing is called V. And then we do
[19:03] the same transpose as before. We do the
[19:06] dotproduct thing to calculate the
[19:08] pair-wise dot products for everything
[19:09] which is just Q KT. We run it through a
[19:12] soft max. We get soft max of Q KT. We
[19:15] multiply it by one to do the final
[19:16] waiting and then boom the output comes
[19:18] and that's this function. That's it.
[19:22] Okay. So what we have done is we have
[19:24] introduced three matrices learnable
[19:27] matrices into the self attention layer.
[19:31] Okay. Now,
[19:34] okay. Let me just stop there for a sec.
[19:35] Questions.
[19:37] Yeah.
[19:39] [clears throat]
[19:39] >> Is there a relationship between AK, AQ,
[19:43] and A
[19:44] >> independent independent matrices?
[19:47] >> Yes.
[19:48] >> Like we have
[19:49] >> could you use the microphone please?
[19:50] >> Here we have three set of parameters K,
[19:52] Q and P. If there are let's say if there
[19:55] were 100 the total length was let's say
[19:58] the number of total totals were let's
[19:59] say 50. So you would have uh 50 for a
[20:02] set of parameters like you'll have to
[20:04] >> so if you have a 50 if the dimension is
[20:07] 50 long what is coming in the W's are 50
[20:10] long then the key the what comes out of
[20:13] it if you want it to be 50 as well so
[20:15] this matrix needs to be 50 * 50 2500
[20:22] >> U Luna
[20:24] >> what are the different things the three
[20:27] the three matrices are trying to
[20:30] Sorry,
[20:30] >> what are the different things that the
[20:32] matrices are trying to learn?
[20:33] >> We don't know. All we are saying is that
[20:35] we have a self attention layer which can
[20:37] pay attention to every pair of words.
[20:38] But we need to give it some ways to
[20:40] transform what is coming in into
[20:43] potentially useful things. Right? As to
[20:45] their actual usefulness, we'll have to
[20:48] figure out if if it actually helps or
[20:49] not. And of course, as you know, the the
[20:51] punch line is that yeah, it helps
[20:52] massively. That's why we do it. In
[20:54] general, what you will find in the deep
[20:55] learning literature is that whenever you
[20:57] want to increase the capacity, the
[20:58] modeling capacity of a particular model,
[21:01] you just take a small piece and inject a
[21:03] little matrix multiplication into it.
[21:05] You take a vector that's showing up in
[21:07] the middle and then you make it run
[21:08] through a matrix to get another vector
[21:10] and then further after you run it
[21:13] through a matrix, you run it through a
[21:14] little ReLU as well. Even better. So
[21:17] that's how you inject modeling capacity
[21:19] into the middle of these networks. Okay?
[21:22] And that's what these people are doing
[21:23] here. Yeah.
[21:26] >> In the last step, you had the matrix V.
[21:29] So on the previous example, you had used
[21:31] the original matrix X. So could you just
[21:33] say for why is it not using X? What does
[21:35] that mean?
[21:36] >> So what we're saying is that the in the
[21:38] initial version we had three copies and
[21:40] we treated them all identical. Now we
[21:42] said well there are are there ways to
[21:44] transform each copy into some other
[21:45] representation which could be useful. So
[21:47] we may as well use three different
[21:48] matrices for it. Why stop with two?
[21:51] There are three opportunities to make
[21:52] them more expressive. We'll use all of
[21:54] them.
[21:56] >> Yeah.
[21:59] >> You mentioned that these are kind of
[22:02] you're tuning it. You're kind of
[22:03] fine-tuning it. Is there any risk?
[22:05] >> We're not fine-tuning it. Uh just to be
[22:06] clear on the on the vocabulary here. So
[22:09] we have added more weights to make them
[22:10] tunable. What that means is that we when
[22:12] we finally train this entire model,
[22:16] remember all the weights are going to be
[22:17] updated using back propagation, right?
[22:20] In particular, these matrices will also
[22:21] get updated using back propagation.
[22:23] >> So there's no risk of is there a risk of
[22:26] >> there's always the risk of overfitting
[22:27] when you add more parameters to a model
[22:29] >> which means that you have to look at the
[22:31] validation set and all that good stuff.
[22:34] We are basically adding more parameters
[22:36] in a very interesting way because we
[22:39] want to add more capacity to the self
[22:40] attention layer. We want to give it a
[22:41] more of an ability to learn things from
[22:43] the data. Before it could not learn
[22:45] anything. It could only do dot products.
[22:48] So we we want to solve that problem.
[22:51] All right, I'm going to continue and
[22:52] we'll come back to this. Okay. Um
[22:57] so uh all right, let's just just for
[22:59] fun, I'm going to do this. Um the the
[23:01] original paper is called attention is
[23:03] all you need. This is a transformer
[23:05] paper.
[23:07] You folks should read it at some point.
[23:11] Just want to show you something.
[23:14] Uh
[23:20] You see that? So that is the famous
[23:22] transformer formula. Okay. And the only
[23:26] thing we ignored is this root of DK
[23:29] business in the back under it. I
[23:31] wouldn't worry about it. The reason they
[23:33] have it is because these soft maxes when
[23:35] you have lots of numbers and some
[23:37] numbers really really big what's going
[23:39] to happen is that all the other numbers
[23:41] are going to get squashed to zero. Okay.
[23:43] And so to make sure the gradient flows
[23:45] properly, they just divide it by a
[23:47] particular number to make sure no number
[23:49] is too big. Okay, that's a small
[23:51] technical important but bit of a
[23:53] technical detail which is why I ignored
[23:54] it in my iPad. But the rest of it you
[23:57] can see this is exactly the formula we
[23:59] derived qt * v softmax.
[24:03] Okay, so this is the famous transformer
[24:05] formula
[24:08] and congratulations now you understand
[24:10] it.
[24:11] You seem less than fully convinced.
[24:14] Okay.
[24:17] Yes. Hi iPad.
[24:19] Now I have a bunch of slides which I had
[24:21] but actually I'll come back to this. I
[24:24] had a bunch of other slides. This is
[24:25] from last year uh which actually
[24:27] explains what I did in the iPad in a
[24:28] very different way without using any
[24:30] matrices and so on. I was looking at it
[24:32] last evening and I was getting very
[24:34] annoyed by these slides for some reason
[24:36] because I felt that it wasn't really
[24:38] conveying the core matrix sort of the
[24:40] matrix uh the ability of using matrix
[24:43] algebra to to actually do this so
[24:45] efficiently and compactly which is why I
[24:47] decided to like handdraw this thing on
[24:49] the iPad. Okay, but you should read it
[24:51] afterwards to make sure that whatever
[24:53] you saw on the iPad actually matches
[24:55] this. Okay, because two different ways
[24:56] of understanding something always helps.
[24:58] Um okay so this what we have here now to
[25:02] just to recall
[25:05] the by making self attention tunable we
[25:07] get a very interesting benefit which is
[25:08] that when you have these different
[25:10] attention heads before
[25:13] you could have two attention heads but
[25:14] because there were no parameters inside
[25:16] their outputs would have been identical
[25:19] because the inputs are the same for both
[25:21] therefore the outputs would be identical
[25:23] but now by since each attention head
[25:25] will have its own aq
[25:28] matrix
[25:29] the outputs are going to be different.
[25:32] That's why it makes sense to do the
[25:34] tunability thing because that's what
[25:36] actually makes multiple attention it's
[25:37] actually useful. Um
[25:43] is is there actually any relationship
[25:44] between AK AQ and AV or is the A just
[25:47] for like a notation standpoint?
[25:49] >> Just notation. The thing is we want to
[25:51] use QV for the resulting matrix and so I
[25:54] had to find something else to use for
[25:56] the first one and I was like okay aqaq
[25:58] and we at MIT we do subscript super
[25:59] subcripts right so yeah
[26:03] >> what what is the the size of the
[26:05] matrices are there like square matrices
[26:07] or
[26:08] >> yeah so typically what happens is that
[26:10] um there's a whole bunch you can think
[26:12] of it as a hyperparameter in some ways
[26:14] um typically what people do in most
[26:15] implementations is that they will
[26:17] actually just preserve the size so if
[26:19] the incoming embedding is and they'll
[26:20] make sure the the thing coming out of
[26:22] thing is also 10. So you just do a 10x10
[26:24] matrix to transform it. Uh but the the
[26:27] the value v av matrix on the other hand
[26:31] there's a bit more technical stuff going
[26:32] on where it often tends to be smaller.
[26:35] Um so for example let's say that your
[26:37] incoming is 100 you do 100 to 100 for
[26:39] the key 100 to 100 for the query. But if
[26:42] you have say five attention heads, you
[26:44] may do 100 to 20 for the W's because
[26:47] ultimately all the V's are going to get
[26:48] concatenated into another 100 again. So
[26:51] I can tell you more offline but fun
[26:53] broadly speaking these things tend to
[26:55] get transformed. They don't they
[26:56] preserve the dimension 10 and 10 out.
[26:58] Yeah.
[27:00] >> So this uh aq uh these numbers are
[27:04] random when you start with it and then
[27:06] allow it to back.
[27:07] >> Exactly. Exactly.
[27:11] So all right um
[27:17] yeah so the values in these matrices are
[27:19] weights learned through optimization
[27:20] using SGD. Uh and then what that means
[27:23] is that
[27:25] each of these attention now has its own
[27:27] copy of these matrices. It has its own
[27:29] matrices and over the course of back
[27:31] propagation these matrices will look
[27:33] very different. Okay. So important each
[27:36] attention head will have its own mat set
[27:38] of three matrices. So if you have 10
[27:40] attention heads 30 matrices will be
[27:42] learned.
[27:46] So by the math it seems like it's
[27:48] creating essentially a relationship
[27:50] between all of the content being
[27:52] ingested and if you're creating if
[27:54] you're ingesting all the content for
[27:56] each attention head are there different
[27:58] categories of attention head type that
[28:00] you're trying to go after?
[28:01] >> Yeah. So basically what we're trying to
[28:03] do is to say a particular attention
[28:04] head. So in any particular sentence it
[28:07] may turn out to be the case that one
[28:09] pattern could be about the meanings of
[28:10] these words right like the word bank and
[28:12] what it means the word station train
[28:14] things like that. That's what really
[28:15] we've been talking about. But there is a
[28:17] whole other pattern to do with grammar
[28:19] and tense and things like that. There
[28:21] could be another one in terms of tone.
[28:23] All those things are very important. And
[28:25] a priority we don't know how many such
[28:26] patterns exist. Much like in a
[28:28] convolutional network, we don't when
[28:30] we're designing how many filters to
[28:31] have, we don't know how many kinds of
[28:33] little things we have to detect, you
[28:34] know, vertical line, horizontal line,
[28:36] semicircle, quarter circle, stuff like
[28:38] that. So, you just give it a lot of
[28:39] capacity so that it can learn whatever
[28:41] it wants.
[28:45] All right. So, um so that that is the
[28:47] transformer encoder. So, we have done
[28:49] one the first of the three complications
[28:51] needed to make it like industrial
[28:53] strength and legit. Uh the second thing
[28:56] we do is something called the residual
[28:58] connection. So what we do is that
[29:02] whatever comes out here right W1 through
[29:05] W6 goes in and comes out as W1 hat W2
[29:08] and so on and so forth right
[29:11] actually sorry what comes out here is
[29:13] the hats but what comes out here is some
[29:16] intermediate W's right that is what the
[29:18] selfident is going to give you some
[29:20] intermediate W's what we do is and
[29:22] because what's coming out here these
[29:24] vectors are the same length as what goes
[29:26] in we can just add them element by
[29:28] element
[29:29] So we take the input and we actually add
[29:32] it to what comes out.
[29:35] So why would we want to do that? Why
[29:37] would we want to you know go to a lot of
[29:39] trouble to process this thing and then
[29:41] when it comes out we like literally add
[29:43] up the original input? What's like what
[29:45] do you think the intuition is?
[29:52] So turns out, think of it this way. You
[29:56] have a bunch of inputs. You send it to a
[29:57] neural network. It transforms it and
[30:00] gives you something else. Right? At that
[30:02] point, you might be thinking, well,
[30:04] everything that go everything that
[30:06] happens in the network from that point
[30:07] onward can no longer see your original
[30:10] input. It can only work with the
[30:12] transformed input. Right? But what if
[30:14] your transformations are not great?
[30:17] So as an insurance policy what you can
[30:20] do is you can take the the transform
[30:22] stuff and you can take the original
[30:24] stuff and send both in.
[30:27] Right? And this whole thing is and you
[30:30] can Google it. It's called like a wide
[30:31] and deep network and things like that.
[30:33] But the whole point is that let's not
[30:35] lose the original input anywhere. Let's
[30:37] also send it along. But if you keep
[30:39] adding the original input to every
[30:40] intermediate layer, it's going to get
[30:42] longer and longer and longer and bigger,
[30:43] which you don't want because you want it
[30:44] all to be the same size. So the simplest
[30:46] alternative is to just add them up. You
[30:49] take the transform stuff and you add the
[30:50] original input. You get the same thing
[30:52] again. The the what came out what came
[30:54] in W1 was a 100 long vector and the
[30:57] transformed version is also 100 long. So
[31:00] just literally 100 100 add them up.
[31:02] That's it. You get another 100 long
[31:04] vector. So that is what's called a
[31:06] residual connection. Okay. And as it
[31:08] turns out, residual connections make it
[31:12] m improve the gradient flow during back
[31:14] propagation dramatically and that's why
[31:16] they are very heavily used. And in fact,
[31:18] RestNet, which we looked at for computer
[31:21] vision, it stands for residual net
[31:24] because it was the first network to
[31:26] actually figure this out. It's not this
[31:29] this is not just a transformer thing by
[31:30] the way. It's widely used in you know
[31:32] lots of new architectures. The notion of
[31:35] a residual connection that's what it
[31:36] means. Okay, so we do a residual
[31:39] connection and then we come to the final
[31:42] tweak which is called layer
[31:44] normalization.
[31:45] So once we add the residual connection,
[31:47] we are going to do something else here
[31:48] to these vectors before they continue
[31:51] flowing. And what layer normalation does
[31:54] is it basically says that
[31:57] I you will recall from the very
[31:59] beginning of the semester I've been
[32:00] saying that whatever comes into a neural
[32:02] network the inputs let's just really
[32:04] make sure that they are all in some sort
[32:05] of a narrow well- definfined range they
[32:07] can't be in a big range right so for
[32:10] pictures for images we divided every
[32:12] number by 255 so that every little pixel
[32:15] value is between zero and one okay for
[32:18] continuous things like the heart disease
[32:20] example we standardized by calculating
[32:22] the mean and the standard deviation and
[32:24] doing subtracting the mean and dividing
[32:26] by the standard deviation. So when you
[32:27] do that all the numbers are going to
[32:28] roughly be in the minus1 to +1 range. So
[32:32] in neural networks it's for backrop to
[32:35] work really well you have to make sure
[32:36] that no numbers get too big that all the
[32:39] numbers are always in some sort of a
[32:41] narrow range. So what layer
[32:43] normalization does is to say you know
[32:45] what whatever is coming out here I want
[32:48] to make sure none of these numbers are
[32:49] too big. I want to make sure they're all
[32:51] well behaved in a small range because if
[32:53] I don't do that back prop is not going
[32:55] to work very well and so
[32:59] is this what we do to ensure we don't
[33:01] problem of vanishing right
[33:04] >> so um so the there technically there are
[33:06] there could be two problems there's an
[33:07] exploding gradient and vanishing
[33:09] gradient both are bad this is a way to
[33:10] address it so you will find a whole
[33:12] bunch of dash normalization techniques
[33:15] layer normalization batch normalization
[33:17] and so on and so forth all these are
[33:19] methods to make that these numbers stay
[33:21] in a small range so it doesn't cause
[33:22] gradient issues later.
[33:27] All right. So in particular
[33:30] what we do is or what happens inside
[33:32] this layer layer normalization is we
[33:35] just calculate the mean and standard
[33:36] deviation of every one of these
[33:37] embeddings. Okay? Right? If you have
[33:39] let's say six embeddings here, we'll
[33:41] have six means and six standard
[33:42] deviations, right? For each one across
[33:43] the rows and then we standardize it.
[33:46] Meaning subtract the mean divide by the
[33:48] standard deviation. And when you do
[33:49] that, all these things are going to be
[33:51] nice and small. And then we do this a
[33:54] little other thing where we we have
[33:55] introduced two new parameters to rescale
[33:58] it and move it around a little bit just
[34:01] because adding more weights always helps
[34:03] make these things better. So we add them
[34:06] and this gets slightly complicated
[34:07] because of the way the dimensions work.
[34:09] So I'm not going to spend much time on
[34:10] it. Uh and then what comes out the other
[34:13] end is a very well- behaved set of
[34:15] numbers in a nice and small and narrow
[34:16] range.
[34:18] Okay, so this is called layer
[34:20] normalization. Um, you can see this link
[34:23] to understand it a bit better. Um, and
[34:25] we do that as well. So to put it all
[34:28] together,
[34:30] so this is a transformer encoder where
[34:32] we have this multi head attention layer
[34:34] where each attention head in the inside
[34:36] of it is tunable with those a matrices
[34:39] and then we have a residual connection.
[34:41] We do that and then we do layer norm and
[34:43] then we do the same thing in the next
[34:45] feed forward layer as well. And then
[34:46] boom out pops the output
[34:50] >> by that definition in the multi head
[34:52] attention layer when I'm doing tone and
[34:53] everything theoretically I can add even
[34:56] the biases or the hate speech aspects
[34:59] which come in to take care of it right
[35:01] so the model can account for the fact
[35:04] that something is biased or something is
[35:06] not
[35:07] >> um the thing is it's not so much the
[35:09] model is accounting for it is capturing
[35:11] whatever patterns happen to be inherent
[35:13] in the data it's capturing Right now
[35:16] what you do with that capture is up to
[35:18] you. It depends on the actual problem
[35:19] you're trying to solve. In particular,
[35:21] it is going to capture all the bad stuff
[35:23] too because if your training header has
[35:25] a lot of biased stuff in it, toxic
[35:27] things in it, dangerous things in it, it
[35:29] doesn't it doesn't have a sense of
[35:30] values as to what it's good or bad. It's
[35:32] just going to pick it up.
[35:35] >> Yes.
[35:36] >> On that then how do you actually make it
[35:38] angle on those or how do you mitigate
[35:40] the effect of those? That's a whole
[35:43] course unto itself, but I'm happy to
[35:44] give you pointers offline.
[35:47] All right, so this is what we have and
[35:50] remember what I said that this is just a
[35:52] single transformer block and since what
[35:54] comes in and what goes out are the same
[35:56] dimensions, we can just stack them one
[35:58] after the other, right? It's very
[36:00] stackable. You can do it, you can
[36:02] multiply, you can you can stack it
[36:03] vertically as much as you want. And as I
[36:05] mentioned, I think GPD3 has 96 of these
[36:08] things stacked one on top of the other.
[36:09] Um and so yeah that brings us to that is
[36:14] it that is the transformer encoder and
[36:15] this exactly maps to that. So basically
[36:18] the input embeddings come in you add
[36:20] positional embeddings and then you send
[36:22] it to say these many attention blocks
[36:24] and they all get added up and then it
[36:26] comes over the attention block you add
[36:28] the add and nom here means add means
[36:31] residual connection because you're
[36:32] adding the input which is why you have
[36:33] this arrow going from the input being
[36:36] added there and then you normalize it
[36:37] send it along and do it again and out
[36:39] comes the output.
[36:42] So all right now just to be very clear
[36:46] on what is being optimized during back
[36:48] propagation in this complex flow right
[36:52] now clearly the the embeddings that you
[36:54] started out with both the standalone
[36:56] embeddings as well as the positional uh
[36:57] the position embeddings those things are
[37:00] going to get optimized right those are
[37:01] just weights they're going to get
[37:02] optimized clearly everything inside the
[37:05] transformer encoder block is going to
[37:06] get get nominized right and what are
[37:08] they well they are the aqa v matrices
[37:12] for Each attention head layer norm has
[37:15] parameters as well. The next like the
[37:18] little feed forward layer has weights as
[37:20] well. All these things are going to get
[37:22] optimized and then it goes through this
[37:24] relu which again has a bunch of weights.
[37:26] It's going to get optimized and then the
[37:28] final softmax has a bunch of weights.
[37:29] That's going to get optimized.
[37:32] All these things are going to get
[37:33] optimized by back prop.
[37:36] Okay. So in that sense you just step
[37:38] back for a second and look at the whole
[37:40] thing. It is just a mathematical model
[37:41] with a lot of parameters
[37:43] and we're just going to use gradient
[37:45] descent or stoastic gradient descent to
[37:46] optimize it. That's it.
[37:49] Yeah.
[37:51] >> For those eight matrices we train the
[37:53] model, are we calculating weights for
[37:55] like each cell of every possible matrix
[37:58] based on the number of inputs like every
[38:00] possible dimension up to the max number
[38:02] of inputs?
[38:04] Um actually the the weights themselves
[38:07] um don't depend on how long your input
[38:09] sentence is because remember what we're
[38:11] doing is for each sentence that comes in
[38:13] let's say the sentence has say three
[38:14] words there are three embeddings for
[38:16] that sentence each of those embeddings
[38:19] gets multiplied by say AK right so AK
[38:23] only needs to work needs to know how
[38:25] long is each embedding it doesn't need
[38:27] to know how many words do I have
[38:31] and that's a I'm glad you raised that
[38:33] question Ben because that's what makes a
[38:35] transformer's number of weights
[38:37] independent of the number of words in
[38:40] your sentence.
[38:42] It only depends on the vocabulary that
[38:43] you're going to work with because the
[38:45] vocabulary determines how many
[38:46] embeddings you need, how many embeddings
[38:48] you need. It the length only matters in
[38:51] terms of the positional embedding
[38:53] because if you have a thousand long
[38:55] sentence, you need a thousand long
[38:56] positional embedding matrix. But beyond
[38:59] that, it doesn't care.
[39:02] And that's why for example Google uh
[39:04] Gemini 1.5 Pro which is a million it can
[39:07] accommodate basically a million long
[39:09] million token context window right it
[39:12] can it's still very compute heavy but it
[39:15] does not change the number of parameters
[39:18] uh yeah
[39:20] >> conceptually which weights are optimized
[39:24] first but in sequential order or are
[39:26] they optimizing the weights at the very
[39:28] same time all
[39:29] >> simultaneously because if you think of
[39:31] back propagation ultimately you have a
[39:34] loss function right and you calculate
[39:35] the gradient of that loss function so if
[39:38] you have a say a billion parameters that
[39:40] gradient is basically a billion long
[39:42] vector right and we're going to take the
[39:44] gradient and we're going to do w new
[39:47] equals w old minus alpha times the
[39:49] gradient so all the w's are going to
[39:51] update instantaneously
[39:53] now the way it actually works in
[39:55] computation is you're going to do it the
[39:56] because of the back and back propagation
[39:58] it's going to start at the end and
[39:59] slowly flow backwards but when it's done
[40:01] everything will be updated.
[40:03] Yeah.
[40:06] >> We take uh two attention heads and we
[40:10] have the matrices of AK, A2 and AV in
[40:12] them. Uh why would the parameters of all
[40:16] three of them all the weights of the
[40:18] three matrices on this side and this
[40:19] side would be different because finally
[40:21] the things you're inputting from this
[40:22] side and the output is same. So the
[40:25] learning process should be ideally the
[40:26] same unlike like a CNN where we had put
[40:29] filters which were different. So what
[40:31] different thing we have to
[40:32] >> because the initialization is different.
[40:35] >> What do we mean?
[40:35] >> Like what I mean is if you have two
[40:37] heads right each head has three
[40:38] matrices. The starting values of those
[40:40] six matrix is different.
[40:42] >> Starting value of A aka B AQ and A is
[40:45] different for both the heads
[40:46] >> right? Much like for all the weights
[40:48] typically the values are randomly
[40:50] chosen. If they were all the same thing
[40:53] you're right. It won't you don't make a
[40:54] difference right? They will all change
[40:56] the same way. Yeah.
[40:59] U is the input of the transformer of the
[41:02] sentence or the the array of embedding
[41:06] of each word.
[41:08] >> Uh the in the transformer itself is
[41:10] expecting embeddings in and so what
[41:13] basically happens is that we get some
[41:14] sentence we run it through a tokenizer
[41:16] which connects it to a bunch of tokens
[41:18] which are just integers and then it goes
[41:20] through the embedding layer which maps
[41:22] the integers to these embeddings and
[41:24] then you feed it to the transformer. But
[41:26] when you do back propagation, it comes
[41:28] all the way back to the starting
[41:29] embedding layer and updates those
[41:31] weights.
[41:32] >> Okay. So they can be trainable. So the
[41:34] twist at the beginning must be input
[41:36] here, but they can train.
[41:37] >> They're trainable. Exactly. Exactly.
[41:40] >> Uh yeah.
[41:41] >> Are the attention heads solely parallel
[41:43] or can you have like a stack of
[41:45] attention heads?
[41:46] >> Typically they are parallelized. Um and
[41:49] because you can always stack the block
[41:50] itself to get more and more power.
[41:54] All right. So um so now to apply the
[41:57] transformer right there are common use
[41:59] cases are that you have a whole sentence
[42:01] that comes in and then you just want to
[42:03] classify it right the the canonical
[42:05] thing being hey movie sentiment
[42:07] classification boom positive or negative
[42:09] right classification another common one
[42:11] is labeling where every word gets
[42:13] labeled as a multiclass label and that's
[42:15] basically what we saw with our slot
[42:17] filling problem and then there is
[42:19] another thing called sequence generation
[42:20] where you give it a sequence you wanted
[42:22] to continue the sequence right generate
[42:23] more stuff i.e. large language models
[42:25] and all that good stuff. So, so this we
[42:28] know already know how to do because we
[42:29] actually literally built a collab with
[42:30] this with the transformer stack. Now the
[42:33] question is how can we do that right?
[42:35] How can you do basic classification with
[42:37] these things? So now if you again when
[42:40] you send a sentence in after all that
[42:42] stuff is done and when I say encoder
[42:44] here I'm assuming that you may have one
[42:46] one block you may have 106 blocks I
[42:48] don't care at the end of the day you
[42:49] send something in you get a bunch of
[42:50] contextual embeddings out
[42:53] right so at this point we need to take
[42:57] these contextual embeddings and somehow
[42:58] make it work for classification for just
[43:00] classifying something into yes or no
[43:02] positive or negative so it'll be nice if
[43:05] we can actually take all these
[43:06] embeddings and like essentially
[43:08] summarize them into a single embedding,
[43:10] a single vector
[43:12] because if you have a single vector then
[43:14] we can run it through maybe a relu and
[43:16] then we do a sigmoid and boom we can do
[43:18] a you know a binary classification
[43:19] problem super easy right so this begs
[43:22] the question okay how are we going to go
[43:23] from the all the many blue things to one
[43:25] green thing
[43:28] okay now of course um what we can do is
[43:33] we can simply average them we can take
[43:36] each of the embeddings just simply
[43:37] average them element by element, you'll
[43:39] get a nice green thing. Okay. Um any
[43:42] shortcomings from doing that?
[43:48] >> You would lose the ordering of the
[43:50] words.
[43:51] >> You do uh well in some sense the
[43:53] positional embedding, the positional
[43:55] encoding you have in the input does have
[43:58] this notion of position, right? So
[44:00] you're not necessarily losing the order
[44:02] necessarily, but you're sort of
[44:04] averaging all this information into
[44:06] something and averaging is going to lose
[44:08] some richness.
[44:12] Okay.
[44:15] >> I think it's going to be skewed to the
[44:17] one that has like the biggest number,
[44:19] right? So something is influencing your
[44:22] >> Yeah, the biggest ones are going to
[44:23] dominate. But hopefully we won't have
[44:25] too much of that because all the layer
[44:27] nom business at the beginning has
[44:29] hopefully made sure the numbers are all
[44:30] in a reasonably small and well behaved
[44:31] range. But the the point really is that
[44:33] you're going to lose richness in the
[44:35] information because you're just like
[44:36] mushing it down. So there's a much
[44:40] better and more elegant way to do this
[44:42] which is that what you do is for every
[44:46] sentence when you train it you add an
[44:49] artificial token called the class token.
[44:52] Okay, literally it's an artificial token
[44:54] and it's designated as you know CLS in
[44:57] the literature and then this token is
[45:00] getting trained with everything else.
[45:03] Okay. And so once you once you finish
[45:06] training
[45:08] that token has its own embedding too.
[45:10] And because it has been trained with
[45:13] everything else and this token is
[45:15] remember it's a contextual embedding
[45:16] which means that it's very much aware of
[45:18] all the other words in the sentence.
[45:21] So in some sense this context this CLS
[45:23] tokens contextual embedding sort of
[45:25] captures everything that's going on
[45:26] about that sentence
[45:29] right and so what we do is once we are
[45:31] done training we just grab this thing
[45:32] alone and then send that through a relu
[45:35] and a sigmoid and boom you're done.
[45:38] So this is a very clever trick to
[45:41] somehow you know instead of averaging
[45:43] everything at the end let's just have
[45:45] something just for the whole thing the
[45:46] sentence and just learn it anyway along
[45:48] with everything else. So in like a meta
[45:50] principle in deep learning is that
[45:52] whenever you think you're making an ad
[45:54] hoc decision about something like
[45:55] averaging a bunch of stuff you should
[45:56] always stop and say is there a better
[45:59] way to do it where it doesn't have to be
[46:00] ad hoc where the right way is learnable
[46:02] from the data directly using back
[46:04] propagation. Um there was a hand. Yeah.
[46:08] >> Is there a reason that you
[46:11] added the CLS at the start? Why not add
[46:14] it at the
[46:15] >> You can do it at the end. Is there any
[46:16] difference?
[46:17] >> Um the only thing to remember is that um
[46:19] it's a good question. So different
[46:21] centers are going to be of different
[46:22] length, right? So there might be short
[46:24] sentences, there might be long
[46:25] sentences. In particular, the lot the
[46:27] short sentences are going to get padded,
[46:29] right? I remember I talked about padding
[46:31] to make it to fit to one length. So what
[46:34] internally the transformer will do is
[46:35] ignore all the padded tokens because it
[46:37] doesn't do it's just padding doesn't
[46:39] really matter for anything. So if you
[46:40] have the serless at the very end we have
[46:42] to have much more administrative
[46:44] bookkeeping to take everything but the
[46:46] last one
[46:48] ignore it and only do the last one just
[46:50] much easier just to get in the beginning
[46:52] that's the reason. Yeah.
[46:54] >> What would be just a practical
[46:56] application of this would be something
[46:58] like sentiment analysis like a positive
[46:59] or negative.
[47:00] >> Yeah. So basically any kind of text
[47:02] comes in and you want to figure out some
[47:04] labeling problem like a classification
[47:06] problem. The easiest example I could
[47:08] think of was sentiment.
[47:09] But you can imagine for example an email
[47:12] comes into a like a call center
[47:14] operation and you want to take the email
[47:16] and automatically figure out which
[47:17] department should I send it to.
[47:20] Okay. So now now if the input data for a
[47:24] task is natural language text, right? We
[47:27] don't have to restrict ourselves to only
[47:28] the input training data we have. Right?
[47:31] Would it be great to learn from all the
[47:32] text that's out there? So, for example,
[47:35] to go back to that call center thing I
[47:36] just mentioned, you know, why clearly,
[47:39] let's say it's coming in English, the
[47:41] ability to take that English email and
[47:43] route it to one of 10 things. You know,
[47:45] you should have to learn English just
[47:47] for your call center application. You
[47:49] should learn English generally and use
[47:50] it for other things, right? So, why
[47:52] can't we just learn from all the text
[47:54] that's out there? And so, that brings us
[47:56] to something called self-supervised
[47:58] learning. And the idea of sens
[48:00] supervised learning is this. So if you
[48:02] recall the transfer learning example
[48:03] from lecture four right where we had
[48:05] restnet right and we took restn net we
[48:08] chopped off the final thing we make made
[48:10] it sort of headless and then we attached
[48:13] that output of the headless restn net to
[48:14] a little hidden layer and output and we
[48:17] did the handbags and shoes and you will
[48:19] recall that we were able to build a very
[48:21] good classifier for handbags and shoes
[48:22] with just like a 100 examples. Right? So
[48:24] the question is why was this so
[48:26] effective? Why was this so effective?
[48:29] And turns out the reason why any of this
[48:31] stuff actually works is because neural
[48:34] networks or they learn representations
[48:36] automatically when you train them. So
[48:38] what I mean by that is when you imagine
[48:40] a network, you feed in a bunch of stuff,
[48:42] it goes through all the layers, it comes
[48:43] out. Uh you can think of each layer as
[48:46] transforming the raw input in some
[48:48] different alternate representation of
[48:50] the input. Okay? And so and these are
[48:53] called representations. That's actually
[48:54] a technical term. Um, and so you can
[48:57] from this perspective when you train a a
[48:58] neural network, a deep network with lots
[49:00] of layers, what you're really learning
[49:02] is you're learning a way to you're
[49:05] learning how to represent the input in
[49:07] many different ways. Each of these
[49:09] arrows is a different way of
[49:10] representing things. Plus, you're
[49:11] learning a final regression model,
[49:14] either a linear regression model or a
[49:15] logistic regression model.
[49:16] Fundamentally, that's what's going on.
[49:18] Because the final layers tend to be
[49:19] sigmoid, soft max, or just linear,
[49:21] right? So the final layer if you just
[49:24] look at the this part alone whatever is
[49:26] coming in it's just going through
[49:27] essentially a linear regression model or
[49:29] a logistic regression model that's it.
[49:31] So fundamentally you're learning
[49:32] representations and a final little
[49:34] model. Okay. But the reason why all
[49:36] these things work so much better than
[49:38] logistic regression is because those
[49:39] representations have learned all kinds
[49:41] of useful things about the input data.
[49:43] They have sort of automatically feature
[49:45] engineered for you.
[49:47] So, so from this perspective you can
[49:50] imagine that each layer here is like an
[49:53] encoder. It encodes the input, right?
[49:55] The first layer encodes it. The first
[49:56] two layers encode something. The first
[49:58] three layers encode something and so on
[49:59] and so forth. So a deep network contains
[50:01] many encoders. And so the question is
[50:04] what do these representations actually
[50:06] embody right? What do they capture? Is
[50:08] it like specific knowledge about the
[50:10] particular problem that you train the
[50:12] thing train the network on or is it like
[50:14] general knowledge about the input data
[50:16] because if it is general knowledge about
[50:18] the input we can use it to solve other
[50:20] problems unrelated problems. So is it
[50:22] specific knowledge or general knowledge
[50:24] and it turns out they actually capture a
[50:26] lot of general knowledge about the input
[50:28] and that's why you can get reuse out of
[50:31] them you can reuse them for other
[50:33] unrelated things because they have
[50:34] captured general stuff. So if you look
[50:36] at this, I think I've shown you before,
[50:38] right? If you if you look at a network
[50:40] that classifies everyday objects into a
[50:41] bunch of categories, it can learn all
[50:43] these little patterns in the beginning
[50:44] and later on and so on and so forth. And
[50:46] this is a face detection network. It has
[50:48] learned how to look at, you know,
[50:50] identify little circles and edges and
[50:52] nose like shapes and finally faces. So
[50:55] all these things are examples of
[50:56] representations, learning interesting
[50:57] things about the input. Okay. So since
[51:00] these representations are capturing
[51:02] intrinsic aspects of the data, you can
[51:04] use it for other things, right? You can
[51:06] take a face detection neural network and
[51:08] use it, reuse it for emotion detection
[51:10] for instance.
[51:12] U so the question is if you can somehow
[51:14] get like an encoder that generates good
[51:17] representations for your input data, we
[51:19] can simply build a regression model with
[51:20] those as input and labels as output and
[51:22] be done. And this is exactly what we did
[51:24] with RestNet for handbags and shows. We
[51:27] found a thing that had already been
[51:28] trained on similar everyday objects,
[51:30] everyday images. And the key insight
[51:33] here is that since we don't have to
[51:35] spend precious data on learning these
[51:37] good representations,
[51:40] we won't need as much label data in the
[51:42] first place because the pre-training
[51:44] used a lot of data and you're sort of
[51:46] piggybacking on that data. So in some
[51:48] sense, your training data is everything
[51:50] that the pre-trained model was trained
[51:51] on plus your little 200 examples.
[51:55] Um, okay. So this is what we did. We
[51:57] used headless resonate as an encoder
[51:58] that can take raw input and transform it
[52:00] into useful representations. Uh this is
[52:02] what we did. All right. So the general
[52:04] approach is that you find a deep neural
[52:06] network built on similar inputs but
[52:08] different outputs. Uh and then you
[52:10] basically grab maybe the penultimate uh
[52:13] representation or the one before that.
[52:15] Then you chop off the head. You attach
[52:17] your own output head. Train the whole
[52:21] thing just the final layer or train the
[52:23] whole thing if you want. Right? This is
[52:25] like the playbook we followed for
[52:26] restnet. The same thing works for all
[52:27] kinds of other data types as well. So
[52:30] now to build such a model we need
[52:32] labeled data, right? We were lucky
[52:34] because restnet was actually trained on
[52:35] imageet data which is like a million
[52:37] images each of which labeled into
[52:39] thousand categories which is very
[52:40] convenient for us, right? But what if
[52:44] you want to build a generally useful
[52:46] model for text data?
[52:49] Clearly we need to collect a lot of text
[52:51] data. But that's no problem because
[52:52] internet is full of text data, right? we
[52:54] can easily escape the internet. We can
[52:55] just download Wikipedia. So that's not a
[52:57] problem. The problem is something else
[52:59] which is that how do we define an input
[53:02] label for a piece of text? So for an
[53:05] input sentence, what should the output
[53:07] label be? That's the key question.
[53:09] Because if you can answer this question,
[53:10] you can just spray train all these
[53:11] things on all kinds of text data, right?
[53:14] So the like a beautiful idea for doing
[53:17] this is called self-supervised learning.
[53:18] And the key idea is that you take your
[53:20] input, whatever the input is you take a
[53:23] small part of the input and just remove
[53:26] it and then ask your network to fill in
[53:28] the blanks from everything else.
[53:31] Okay, so this is called masking and it's
[53:33] just one of many techniques in
[53:35] self-supervised learning, but this is
[53:36] very commonly used. So this is original
[53:39] input, right? And then you take it and
[53:41] then you just like take this thing in
[53:43] the middle here randomly and and and
[53:45] zero it out or mask it. And so this
[53:48] incomplete input is your now new input
[53:51] and the thing that you took out becomes
[53:53] your your fake label.
[53:56] So you can almost imagine right if you
[53:58] take if you if you're baking donuts you
[54:00] you make a donut and then you punch a
[54:02] hole in the middle of the donut the the
[54:04] donut with the hole is your no input the
[54:07] munchkin is the label.
[54:11] Am I making everybody hungry at this
[54:13] point? So,
[54:15] so and once you do that, no problem. You
[54:17] have an input, you have an you have
[54:19] labels, you just train a neural network
[54:23] to essentially predict those to
[54:25] basically fill in the blanks.
[54:28] And so if for example, if you take a
[54:30] sentence like the Sloan School's
[54:32] mission, you can just go in there and
[54:34] just just knock out randomly a bunch of
[54:36] words like this second. And the ones I'm
[54:39] knocking out, I'm just putting the word
[54:40] mask in it just to show what I'm doing.
[54:42] And then what it's actually given this
[54:45] sentence, it will try to fill in the
[54:46] blanks with actual words.
[54:50] Okay,
[54:51] so now for the amazing part. In the
[54:53] process of learning to fill in the
[54:54] blanks, uh the network learns a really
[54:57] good representation of the kind of input
[54:58] data it's seeing. And it kind of makes
[55:01] sense, right? Because if I give you a
[55:02] sentence with a few missing blanks and
[55:04] you're able to very successfully fill in
[55:06] the blanks, you have learned a whole
[55:08] bunch of stuff about the world to be
[55:10] able to do that, right? If I say the
[55:12] capital of France is Dash and you're
[55:14] like Paris, okay, how did you know that?
[55:16] It's sort of like that. By learning to
[55:18] fill in the blanks, you really have to
[55:20] learn how how all these things work, all
[55:22] the the connections between various
[55:24] words and so on and so forth. So, and so
[55:27] what you can do is once we build such a
[55:29] model, we can just extract an encoder
[55:32] from it, right? And then we'll fine-tune
[55:34] it like we do with library transfer
[55:36] learning. But this how you build a
[55:38] generic a generic pre-trained model on
[55:41] unlabelled data.
[55:43] And so we can use a transformer encoder
[55:46] to build this whole thing in the middle
[55:48] because remember the transformer can
[55:49] take any sentence and give you the same
[55:51] size sentence back along with
[55:53] predictions for everything. So we can
[55:55] just have it take this thing in and ask
[55:57] it to just predict all the missing words
[55:58] here.
[56:01] And
[56:03] so uh to put it in other words, masked
[56:05] self-supervised learning is just a
[56:06] sequence labeling problem.
[56:09] So basically this is the sequence that
[56:11] comes in and then you you tell the
[56:13] transform and you get all these
[56:14] embeddings. It goes through all that
[56:16] stuff. You really don't care about these
[56:18] outputs. But wherever the word mask went
[56:21] in in the input, you you basically try
[56:23] to get it to the right answer is for
[56:25] example the word mission and you're
[56:26] trying to and that is the right answer.
[56:28] This is the right answer here. And then
[56:29] you take these right answers, create a
[56:31] loss function, and do back prop and
[56:32] boom, you're done.
[56:35] Inputs, right answers, and and you're in
[56:37] business. That's it. Now, if we
[56:40] pre-train a transformer model like this
[56:41] on massive amounts of English text,
[56:44] let's say we did that. We get something
[56:46] called BERT. BERT is a very famous
[56:48] transformer model. And BERT was the
[56:51] first model actually that Google used to
[56:53] upgrade its search in 2019.
[56:56] like the br the Brazil visa example you
[56:58] may recall from earlier lectures that
[57:00] uses BERT under the hood. Okay. Um and
[57:03] so now I just want to show you because
[57:06] you can actually read the BERT paper and
[57:07] it'll actually make sense to you now
[57:09] based on what you have learned in this
[57:10] class. Look at this BERT's model
[57:13] architecture is a multi-layer
[57:14] birectional transformer encoder. Okay,
[57:16] transformer encoder. We denote the
[57:18] number of layers transformer blocks as
[57:20] L. The hidden size is H and the number
[57:23] of attention heads as A. And how much is
[57:25] that? Uh okay we want uh h is 768 okay
[57:30] so which means that the embedding sizes
[57:34] or 768
[57:36] and the hidden feed forward layer is
[57:38] four times as much so it's 4096 and so
[57:41] sorry the the the 4096 the feed forward
[57:44] layer the embeddings are 768 and you can
[57:47] see there are two BERT models here this
[57:49] one has 12 transformer blocks this one
[57:52] has 24 transformer blocks
[57:55] Okay, so you can actually read this
[57:58] paper. You can you can actually relate
[57:59] it to exactly what we discussed in
[58:00] class. It'll all make sense.
[58:02] Birectionally means that the words can
[58:04] pay attention to every other word in the
[58:06] sentence. And as we will see on Monday,
[58:09] you can have you have a diff another
[58:10] transformer thing called a causal
[58:12] transformer in which you only pay
[58:14] attention to the words that came before
[58:15] you, not the ones after you. So
[58:18] birectional means all words are seen.
[58:21] [snorts] Okay. So um so what we do is
[58:24] remember we said to do solve sequence
[58:26] classification you can add a little
[58:27] token at the beginning uh and then boom
[58:30] use it for classification as it turns
[58:32] out but very conveniently for us the
[58:35] people who built bird they actually auto
[58:36] they when they train bird they just use
[58:38] the CLS business
[58:41] during training so it's actually
[58:42] available for us out of the box so when
[58:44] you use bird for sequence classification
[58:46] you don't even have to do any surgery on
[58:47] it it just gives you the class token
[58:48] automatically which is very convenient
[58:51] uh and you can also use it for sequence
[58:52] labeling as well. So for sequence
[58:55] classifications and sequence labeling uh
[58:57] BERT is actually usually a really good
[58:58] starting point and in particular there
[59:00] have been lots of improvements and
[59:02] variations of BERT over the years and if
[59:04] you're curious about this there's a
[59:05] thing called the sentence transformers
[59:07] library which has got a whole bunch of
[59:09] BERT related code and resources that you
[59:11] can use to do things out of the box.
[59:14] Okay. So okay there's a bit of a word
[59:18] wall.
[59:20] So to solve any of these problems
[59:21] classification or labeling where the
[59:23] input is natural language we can
[59:24] obviously use a model like BERT label a
[59:27] few hundred examples attach the right
[59:28] final layers and fine tune it like we
[59:30] did for the restn net but if your
[59:32] problem is like a standard NLP problem
[59:34] okay you don't even have to do that
[59:37] because people for these standard tasks
[59:39] they've already pre-trained it on those
[59:40] standard tasks right and so you can do
[59:43] all these things without any fine tuning
[59:44] at all like literally out of the box u
[59:47] and so there are many hubs which have
[59:49] these pre-trained models, but perhaps
[59:50] the biggest one is the hugging face hub.
[59:53] And I checked last night, it has 525,000
[59:56] models
[59:58] available. I think if I recall last year
[01:00:00] when I taught Hodel, I think the number
[01:00:02] was a lot smaller, maybe 50,000. So it's
[01:00:04] like growing really, really fast. Um,
[01:00:07] and so all right, let's just switch to a
[01:00:09] hugging face collab.
[01:00:15] So, hugging face, how many of you are
[01:00:18] familiar with hugging face?
[01:00:21] Okay, it's good. All right, so um for
[01:00:24] the others, basically you have a whole
[01:00:26] bunch of pre-trained models on hugging
[01:00:28] phase. You actually have a lot of data
[01:00:30] sets you can work with for your own
[01:00:32] tasks. Uh there are lots of people
[01:00:34] demoing what they have built in this
[01:00:37] thing called spaces and of course a lot
[01:00:39] of documentation and so on. So the thing
[01:00:40] you can do is what they have done is
[01:00:42] they have organized all these models by
[01:00:44] the kind of task you can use them for.
[01:00:46] So you can see here there are a whole
[01:00:47] bunch of computer vision tasks that you
[01:00:49] can use them for. There's a whole bunch
[01:00:50] of natural language tasks like text
[01:00:52] classification
[01:00:54] uh feature extraction this and that lots
[01:00:56] of interesting examples here. And so
[01:00:59] what you do is you just literally can go
[01:01:00] in there and say okay I want to do a
[01:01:01] text classification. You hit it and then
[01:01:03] it tells you all the models that are
[01:01:05] available. Turns into 50,000 models just
[01:01:06] for text classification. And you can
[01:01:08] look at okay which is you know most
[01:01:10] downloaded or which is the most liked
[01:01:11] and then you can just use them as a
[01:01:13] starting point for whatever you want to
[01:01:14] do. Okay. So so that is hugging phase
[01:01:17] and so the way you do hugging face is
[01:01:20] I'm just connecting it. Um
[01:01:24] if you have a problem which the input is
[01:01:26] natural language text the first question
[01:01:28] you have to ask yourself is it standard
[01:01:29] or not? Is it a standard task or not? If
[01:01:31] it's a standard task you just go go that
[01:01:32] do not reinvent the wheel. This thing
[01:01:34] will usually work pretty well. Okay. So
[01:01:37] here we will use this thing called um
[01:01:39] the transformers library from hugging
[01:01:41] face in particular the pipeline function
[01:01:43] to demonstrate quickly how to do this
[01:01:45] thing. Fortunately this library as of
[01:01:47] this year is pre-installed in collab so
[01:01:48] we can we don't have to install it. We
[01:01:50] can just start using it right away. So
[01:01:51] we'll take this example where you have a
[01:01:53] bunch of text which says um
[01:01:57] dear Amazon last week I got an Optimus
[01:01:59] Prime action figure from your store in
[01:02:00] Germany. Unfortunately when I opened the
[01:02:01] vicage I discovered to my horror that I
[01:02:04] had been sent an action figure of
[01:02:05] Megatron instead. Can you imagine that
[01:02:06] person's like sheer distress at this?
[01:02:08] Um, so as a lifelong enemy of the
[01:02:10] Decepticons, I hope you can understand
[01:02:12] my dilemma. So to resolve the issue, I
[01:02:14] demand an exchange. Encloser copies
[01:02:17] expect to hear from you soon. Sincerely,
[01:02:19] Bumblebee.
[01:02:21] Okay, that Okay, they should have come
[01:02:22] up with a better name for this example.
[01:02:24] Uh, all right, cool. So that's the text
[01:02:26] we have. So we import the this pipeline
[01:02:29] function is the one that basically gives
[01:02:31] you the ability to out of the box start
[01:02:33] using it without any pre-training,
[01:02:34] nothing like that. Okay, so we download
[01:02:36] this thing. Um, oh wow, I got an A00
[01:02:40] today. That happens very rarely. All
[01:02:42] right, sorry.
[01:02:44] So here, let's say you want to classify
[01:02:46] that text. Okay, you want just want to
[01:02:48] classify it for sentiment. You literally
[01:02:50] go in there and say pipeline
[01:02:52] text classification. That's the task you
[01:02:55] want the pipeline to do for you, right?
[01:02:57] And you create a classifier. Okay, it's
[01:02:59] going to download a bunch of stuff. Uh,
[01:03:01] and then so on and so forth.
[01:03:04] The first time it just takes time to
[01:03:06] download and then you literally take the
[01:03:08] text you have here and then run it
[01:03:10] through the classifier as it was just a
[01:03:11] little function right you get some
[01:03:14] outputs and then actually just do this
[01:03:17] this way
[01:03:19] negative sentiment is negative with 90%
[01:03:21] probability pretty good right sequence
[01:03:23] classification solved I mean sent
[01:03:25] sentiment classification solved so we'll
[01:03:27] try a few different examples uh I hated
[01:03:30] the movie I if I said I loved the movie
[01:03:31] I would be lying okay that's a little
[01:03:33] tricky The movie left me speechless.
[01:03:34] Incredible. And then I had to add this
[01:03:36] last thing here last night. Almost but
[01:03:38] not quite entirely unlike anything good
[01:03:40] I've seen. Okay. And that's not
[01:03:42] original. By the way, people who have
[01:03:43] read Douglas Adams will know this famous
[01:03:44] sentence about somebody drinking some
[01:03:46] beverage and saying it's almost but not
[01:03:48] quite entirely unlike tea. So I was
[01:03:50] inspired by that. So anyway, we'll see
[01:03:52] what happens. Um.
[01:03:56] All right. Put it in there. Okay. So
[01:03:59] negative. I hated the movie. Okay, fine.
[01:04:01] If I said love me, I'd be lying.
[01:04:02] Negative. Movie left me speechless. Uh,
[01:04:05] it says it's negative, but it could go
[01:04:07] either way, right? A good classifier
[01:04:09] would have probably given you a
[01:04:09] probability around the 50% mark because
[01:04:11] it's sort of right on the fence. Um,
[01:04:13] incredible, it's positive, and then it
[01:04:15] got fooled by my crazy long sentence and
[01:04:17] it says it's positive. Okay, now that's
[01:04:20] classification. Here's one other quick
[01:04:22] example. So, you can actually give it a
[01:04:23] piece of text, right? For example, you
[01:04:25] can take like a a Reuter's news story.
[01:04:28] You can feed it and say extract all the
[01:04:30] company names from it. Extract company
[01:04:32] names, people names and things like
[01:04:34] that. It's called named entity
[01:04:35] extraction. And there are in the back in
[01:04:37] back in the day people would bring they
[01:04:40] would hand build painstakingly all these
[01:04:42] very complex systems to be to do named
[01:04:44] entity extraction. Now it's just a
[01:04:46] pipeline away. So you can take this
[01:04:48] thing and you can say create a pipeline
[01:04:50] for any name extraction and for any
[01:04:53] particular task that you're using there
[01:04:54] might be a few additional parameters you
[01:04:56] can set right as a part of the
[01:04:57] configuration. So we download this
[01:05:00] pipeline.
[01:05:08] Okay, perfect. And then we run the
[01:05:11] output. So it says okay good. Amazon is
[01:05:14] an organization
[01:05:16] uh
[01:05:18] and Germany is a location lock which is
[01:05:21] nice. So these things have a standard
[01:05:22] vocabulary as to or lock things like
[01:05:23] that which you can read up in the
[01:05:24] documentation. Uh and then Bumblebee is
[01:05:26] a person and then boy all the like the
[01:05:29] Optimus Prime transformer stuff is all
[01:05:32] it got full right. It thinks Optimus
[01:05:33] Prime is miscellaneous. Uh decept is
[01:05:36] miscellaneous and so on and so forth.
[01:05:38] But you get the idea. You can take
[01:05:39] standard things like Reuters use stories
[01:05:41] and so or you can just boop. You can get
[01:05:42] a very good entity extraction right out
[01:05:44] of the bat. And once you get these
[01:05:45] entities extracted, then you can put
[01:05:47] them into a nice structured data table
[01:05:48] like a database and then you can run
[01:05:50] traditional machine learning on it.
[01:05:53] Okay. Um and then I had I think a few
[01:05:55] more examples of question answering and
[01:05:58] uh actually let's just try that. um you
[01:06:01] can actually give it a thing and ask a
[01:06:02] question about it and you can actually
[01:06:03] give you the answer which gets into the
[01:06:07] causal transformer thing that we're
[01:06:09] going to see on Monday which builds up
[01:06:10] into large language models because you
[01:06:12] obviously can give something you can
[01:06:14] give a passage to chat GPT and ask a
[01:06:16] question ask it to give you an answer so
[01:06:17] it's really in that thing but um just
[01:06:19] for fun let's just do that to see if
[01:06:20] it's any good um okay so what does the
[01:06:25] customer want and the output is an
[01:06:27] exchange of megatron and it's telling
[01:06:29] you which where it starts in the text
[01:06:32] and where it ends the relevant passage.
[01:06:34] It's pretty good, right? So because
[01:06:37] remember if you have stuff like this
[01:06:39] then when you ask like a large language
[01:06:41] model a question it gives you an answer.
[01:06:42] You can actually ask it to give you
[01:06:44] exactly where in the input it found the
[01:06:46] answer and because you know these things
[01:06:48] are going to elicitate you can actually
[01:06:49] look at the input that it's claiming to
[01:06:51] use and look at what it says and see if
[01:06:54] they actually match. It's a way to sort
[01:06:56] of essentially do QA on LLM output.
[01:06:59] Um okay so that's what we have here and
[01:07:01] I have other budget much of which which
[01:07:03] I'll ignore for the moment because I
[01:07:05] want to go back to the PowerPoint.
[01:07:07] So yeah so if you have a standard task
[01:07:10] uh you know you can just use pipelines
[01:07:11] and hugging face to actually solve many
[01:07:13] of them out of the box without any heavy
[01:07:15] lifting. So I mentioned earlier on that
[01:07:18] transformers have proven to be effective
[01:07:19] for a whole bunch of domains outside of
[01:07:21] natural language processing um like you
[01:07:24] know speech recognition, computer vision
[01:07:26] and so on and so forth. Um and so I want
[01:07:29] to give you a couple of quick examples
[01:07:30] of how to think about transform using
[01:07:32] transformers for non-ext applications.
[01:07:35] Okay. So uh the the key insight here is
[01:07:39] that the architecture of the transformer
[01:07:41] block that we have looked at amazingly
[01:07:42] enough can be used as is with no changes
[01:07:45] no surgery needed. No clever thinking
[01:07:47] required for any particular application.
[01:07:49] What is needed where the clever thinking
[01:07:51] may be required is you need to take the
[01:07:53] inputs that you're working with and you
[01:07:55] need to figure out a way to tokenize and
[01:07:57] encode them into embeddings
[01:07:59] which can then be sent into the
[01:08:01] transformer. So all the action is in
[01:08:03] taking that input that non-ext input and
[01:08:05] figuring out a way to cast them in the
[01:08:07] language of embeddings. That's where the
[01:08:09] that's the game. Okay. So um here is
[01:08:12] something called the vision transformer
[01:08:14] which is very famous actually. I think
[01:08:16] it may be the first perhaps the first uh
[01:08:19] transformer architecture that was
[01:08:20] applied to vision problems. So um so
[01:08:23] let's say you have a picture um yeah so
[01:08:25] let's say you have this picture okay
[01:08:28] it is just a picture okay so you have to
[01:08:31] find a way to create embeddings from
[01:08:33] this picture or to tokenize this picture
[01:08:35] in some way with sentences you know I
[01:08:38] love hard well obviously I love and hard
[01:08:40] are three tokens it's pretty trivial to
[01:08:41] figure out how to tokenize them but with
[01:08:43] a picture what do you do right it's kind
[01:08:45] of weird to think of tokenizing a
[01:08:47] picture so what these people did is that
[01:08:49] they say you know what I'm going to take
[01:08:51] this picture and chop it up into small
[01:08:52] squares.
[01:08:54] Right? So in this example, they have
[01:08:57] taken this big picture and chopped it up
[01:08:58] into nine little pictures. Okay? Then
[01:09:02] you can take each of those nine
[01:09:03] pictures.
[01:09:05] Each of those nine pictures, right? If
[01:09:07] you look at the how it's represented,
[01:09:09] it's just three tables of numbers,
[01:09:11] right? The RGB values, right? So you can
[01:09:15] take all those numbers and you just
[01:09:16] create a giant long vector from it.
[01:09:20] Okay? you have a huge long vector and
[01:09:22] then you run it through a dense layer to
[01:09:26] come up with a smaller vector
[01:09:28] and that smaller vector is your
[01:09:30] embedding.
[01:09:31] That's it. But the way you transform the
[01:09:34] long vector into small vector is just a
[01:09:36] dense layer whose weights can be
[01:09:37] learned.
[01:09:39] So what these people did is they said
[01:09:41] well I'm going to first chop it up into
[01:09:42] these patches and then I take each patch
[01:09:44] and do a linear projection. Right? A
[01:09:47] flattened patch is nothing more than a
[01:09:49] three tables of numbers flattened into a
[01:09:50] long vector. That's what the word
[01:09:52] flatten here means. And once you flatten
[01:09:54] it, I'm just going to run it through a
[01:09:56] dense layer. So, by the way, you will
[01:09:58] see the words linear projection. It's a
[01:09:59] synonym for run it through a dense
[01:10:01] layer.
[01:10:03] So, you run it through a dense layer,
[01:10:05] right? You get these nice vectors, these
[01:10:08] vectors.
[01:10:09] And now you say, well, you know what? I
[01:10:11] have to take the order of these things
[01:10:12] into account because clearly this little
[01:10:15] patch is in the top left while this
[01:10:17] patch is somewhere in the middle. Right?
[01:10:18] The order matters in the picture
[01:10:20] otherwise every jumbled version is going
[01:10:22] to be the same thing. So you use
[01:10:24] positional embeddings
[01:10:26] you basically say there are nine
[01:10:27] positions in any picture right 0 1 2 3 4
[01:10:31] 5 6 7 8 there are nine positions. So I'm
[01:10:33] going to create nine position embeddings
[01:10:36] and then I'm just going to add them up.
[01:10:39] Then I'm just going to add them up to
[01:10:40] this embedding. Just like we did with
[01:10:41] words. With words, we each word had an
[01:10:44] embedding. Each position had an
[01:10:45] embedding. We added them up. Here each
[01:10:47] image has an embedding. The position of
[01:10:49] the little patch in the picture has an
[01:10:50] embedding. We add them up. Okay? And
[01:10:53] then because we want to use it for
[01:10:54] classification, no problem. We'll have a
[01:10:57] little CLS token
[01:11:00] and then we just run it through the
[01:11:01] transformer. That's it.
[01:11:04] and then you get the CLS token and then
[01:11:06] you can attach a softmax to it and say,
[01:11:08] "Okay, it's a bird, it's a ball, it's a
[01:11:09] car.
[01:11:12] That's it. This simple approach actually
[01:11:14] works
[01:11:16] amazingly enough."
[01:11:19] Okay, so that is the vision transformer
[01:11:22] and I'm going through it fast just to
[01:11:23] give you a sense for how these things
[01:11:24] work. Uh any questions? Yeah. Uh my
[01:11:29] question is like uh in case of uh text
[01:11:31] we had fixed number of tokens that is
[01:11:33] amount of words which could be there in
[01:11:35] your vocabul in the English vocabulary
[01:11:37] but here if you look at images they will
[01:11:39] probably go into trillions that I know
[01:11:41] like we are not talking about one image
[01:11:43] but we take a total set of plot of
[01:11:45] images and we try to subset each one of
[01:11:47] them each one would have its own uh uh
[01:11:52] own weights like own parameters. There
[01:11:53] is no notion of vocabulary here. All
[01:11:56] we're saying is that given any image, we
[01:11:58] create nine patches, sub images from it.
[01:12:02] Each of those patches gets passed
[01:12:03] through a dense layer and out comes an
[01:12:06] embedding. So at that point, any image
[01:12:09] you give me, I'm going to give get you
[01:12:10] nine embeddings out of it. And once I
[01:12:13] get the nine embeddings, I just throw it
[01:12:14] into the meat grinder, the transformer
[01:12:16] meat grinder.
[01:12:20] All right. So uh another example I think
[01:12:23] some of you have asked me outside of
[01:12:25] class um how good are transformers for
[01:12:27] structured data tabular data right for
[01:12:30] tabular data in general um things like
[01:12:32] xg boost gradient boosting works really
[01:12:34] really well so it's good to try them
[01:12:36] certainly I don't think transformers and
[01:12:38] deep learning networks have any great
[01:12:39] edge over xg boost for structured data
[01:12:42] problems so it's worth trying both of
[01:12:44] them however you can use transformers
[01:12:46] for this stuff too so that's called the
[01:12:48] tab transformer one of the first ones
[01:12:50] wants to come out a transform of a
[01:12:52] tabular data and again it's pretty
[01:12:54] simple. All you do is
[01:12:56] in any kind of input that you have, you
[01:12:58] will have some categorical variables,
[01:13:00] right? Like blood pressure, things like
[01:13:02] that, right? Not blood pressure, bad
[01:13:04] example, gender, right? Um, and so on
[01:13:07] and so forth. And so what you do is you
[01:13:10] take all the categorical features and
[01:13:12] for each categorical feature, you create
[01:13:14] embeddings
[01:13:16] because a categorical feature is just
[01:13:18] text.
[01:13:20] A categorical feature is just text. So
[01:13:22] you can create text embeddings for it.
[01:13:23] No problem. Um,
[01:13:27] and you take all the continuous
[01:13:30] features, right? Cholesterol and blood
[01:13:32] pressure and whatnot, right? To go to
[01:13:34] the heart disease example, and then you
[01:13:36] take just create all the correct them
[01:13:38] all and just create a vector out of
[01:13:39] them.
[01:13:41] You're just a vector. Okay? Then you run
[01:13:45] these the embeddings for all the
[01:13:47] categorical variables through a nice
[01:13:48] transformer block. And you can see here
[01:13:51] it's exactly the block we have seen
[01:13:52] before. no difference. And then at the
[01:13:54] very end when it comes out of the
[01:13:56] transformer, you take all the contextual
[01:13:58] stuff coming out of the transformer and
[01:13:59] then you concatenate it with the
[01:14:01] continuous features.
[01:14:03] Okay. And then you run it through maybe
[01:14:05] one or more dense layers and boom
[01:14:07] output.
[01:14:09] So this is a tab tabular data
[01:14:11] transformer. And there are many you know
[01:14:12] refinements improvements over the years
[01:14:14] that have come since then. But the key
[01:14:16] thing I want you to rec remember from
[01:14:18] here is that categorical variables can
[01:14:21] be very easily represented as
[01:14:24] embeddings. That's the key. Okay. Uh all
[01:14:28] right. So that's that. Now once the
[01:14:31] input has been transformed into sort of
[01:14:32] this common language of embeddings, we
[01:14:34] can process them without changing the
[01:14:35] architecture of the block itself because
[01:14:37] all it wants is embeddings. It's like
[01:14:39] you give me embeddings, I give you a
[01:14:40] great contextual embeddings out and
[01:14:42] nobody gets hurt, right? That is the
[01:14:44] deal with the transformer stack. So um
[01:14:47] now this this ability this sort of since
[01:14:50] the transformer is agnostic to the kind
[01:14:52] of input as long as it comes into comes
[01:14:54] in as a form of an embedding you can use
[01:14:56] it for multimodal data very easily. So
[01:14:58] for example let's say that you have a
[01:15:00] problem in which you have a picture that
[01:15:02] you have to be sent in some text that
[01:15:03] goes in a bunch of tabular data coming
[01:15:05] in well you take the text and do
[01:15:08] language embeddings like we know how to
[01:15:10] do you take the image and do image
[01:15:11] embeddings like we just saw with the
[01:15:12] vision transformer. You take tablet data
[01:15:14] and do tab data embeddings like we saw
[01:15:16] with the tab transformer. Once we do it,
[01:15:18] it's all a bunch of embeddings
[01:15:21] and then you attach a little class token
[01:15:23] on top, send it through a bunch of
[01:15:25] transformers blocks and then out comes a
[01:15:27] contextual class token the contextual
[01:15:29] version run it through maybe a sigmoid
[01:15:32] or a softmax predict the label done.
[01:15:36] So this is extremely powerful its
[01:15:38] ability to handle multimodel data. Okay.
[01:15:40] And that's why for example if you look
[01:15:42] at Gemini Google Gemini 1.5 Pro GPT4
[01:15:46] vision and so on you can send it images
[01:15:48] and a question and you'll get an answer
[01:15:50] back because every modality that goes in
[01:15:53] is cast into embeddings and once it's
[01:15:55] embedded one once it's embeddingized
[01:15:58] then the transformer doesn't care. It'll
[01:16:00] just do its thing.
[01:16:02] It it will decide for example that this
[01:16:04] word in your question actually is highly
[01:16:06] related to that patch in the picture.
[01:16:09] Right? you'll just figure it out.
[01:16:12] Uh, okay. That's all I had because
[01:16:14] there's a time pering 9:55. Perfect. All
[01:16:16] right, folks. Thanks. Have a great rest
[01:16:18] of your week.