[00:16] So, all right. So, transformers, even [00:18] though they were originally invented for [00:20] machine translation, right, going from [00:22] English to German and German to French [00:24] and so on and so forth, [00:25] they have turned out to be an incredibly [00:27] effective deep neural network [00:29] architecture for just really a vast [00:32] array of domains. It has reached a point [00:34] where if you're actually working with on [00:36] a particular problem, you almost [00:37] reflexively to will try a transformer [00:39] first because it's probably going to be [00:40] pretty darn good. [00:42] Okay? So, they have just taken over [00:45] everything. [00:46] Um and obviously they have they've [00:48] transformed translation, which is the [00:50] original sort of target, uh Google [00:52] search, really information retrieval, [00:54] completely transformed speech [00:55] recognition, text-to-speech, even [00:57] computer vision. Even the stuff that we [00:59] learned with convolutional neural [01:00] networks, now there are transformers for [01:03] computer vision problems that are [01:04] actually quite good. [01:06] Right? [01:07] Um which is kind of shocking because [01:08] they were not even designed for that. [01:10] Um and then, you know, reinforcement [01:12] learning. And of course, all the crazy [01:14] stuff that's going on with generative [01:15] AI, large language models, multimodal [01:17] models, everything everything runs on a [01:20] transformer. [01:21] Okay? Uh and then there are numerous [01:23] special purpose systems [01:25] and I find these to be even more [01:27] interesting. [01:28] Um you know, like AlphaFold, the protein [01:30] folding AI, is run runs on a transformer [01:32] stack. [01:33] Okay? And I could just list examples one [01:35] after the other. [01:36] So, it's just amazing. It's incredibly [01:38] uh flexible architecture. [01:40] Um and I think we are lucky to be alive [01:43] during a time when such a thing was [01:44] invented. [01:47] And I'm not getting paid to tell you any [01:48] of this stuff. [01:50] All right, it's just amazing. Okay. So, [01:52] let's get going. We will use search um [01:55] or more broadly information retrieval as [01:57] a motivating use case. So, these are all [01:59] examples where people are typing in [02:00] natural language queries or uttering [02:02] natural language queries into a phone [02:03] and we need to sort of make sense of [02:05] what they want. And it's not like, you [02:07] know, write me a limerick about deep [02:08] learning where there could be many [02:10] possible right answers. It's more like, [02:12] okay, tell me all the flights that are [02:14] leaving from Boston to going to [02:15] LaGuardia tomorrow morning between 8:00 [02:16] and 9:00. Well, you better get it right. [02:19] Okay? Accuracy is a high bar. [02:21] So, [02:22] um or, you know, how many customers [02:23] abandoned their shopping cart? Find all [02:24] contracts that are up for renewal next [02:26] month. Uh you know, tell me the all the [02:28] customers who ended the phone call to [02:30] the call center yesterday not entirely [02:32] pleased with the transaction. Right? The [02:34] list goes on and on. And so, in [02:37] particular, we'll focus on this [02:38] travel-related example today. Okay? Uh [02:40] find me all flights from Boston to [02:42] LaGuardia tomorrow morning, right? That [02:44] kind of query. [02:45] Um and so, in these sorts of use cases, [02:48] a very common approach historically has [02:50] been, well, we will take this, you know, [02:53] natural language query [02:55] and then we will convert it into a [02:57] structured query. By that I mean we will [03:01] parse the query and we'll extract out [03:03] key things in that query. Once we [03:05] extract out those key things, we will [03:07] reassemble it into a structured query, [03:09] like a SQL query, right? Uh SQL is just [03:12] one example of a possible structured [03:14] query. There are many many ways to [03:15] structure queries. [03:17] But SQL is sort of familiar to lots of [03:18] people, so I'm using that. So, you take [03:20] the SQL. Once you have the SQL query, [03:23] you're in a very comfortable structured [03:25] land, in which case you just run the [03:27] query through a some database that you [03:28] have, get the results back, format it [03:30] nicely, and show show it to the user. [03:32] Right? That's the flow. [03:34] So, the question becomes [03:36] um [03:37] how do we automatically extract all the [03:40] travel-related entities from this query? [03:43] Right? We want to be able to extract [03:45] BOS, LGA, tomorrow, morning, flights, so [03:49] on and so forth. These are all the [03:50] travel-related entities we want to [03:51] extract out, right? That's the problem. [03:54] And so, [03:56] we will use a really cool data set [03:58] called the airline travel information [03:59] system data set and I'll explain the [04:01] data set in just in just a bit. We'll [04:02] use this as the basis for this example. [04:05] And so, the way we think about it is [04:07] that [04:08] we we have a whole bunch of queries in [04:10] this data set. [04:12] And fortunately for us, the researchers [04:14] who compiled this data set, [04:16] they went through every one of these [04:18] queries, right? And we have, you know, [04:20] several thousands of them. They went [04:22] through every one of those queries and [04:24] they manually tagged each word in the [04:26] query [04:28] with what kind of travel entity it is [04:31] or none of them, right? So, for [04:33] instance, so they class they call them [04:35] slots. So, they will take each word in [04:37] the query and assign it to a slot, a [04:39] particular kind of slot, and I'll [04:41] explain what slot means in just a [04:42] second. Okay? That's the basic idea. So, [04:45] so, for example, if you have something [04:47] like I want to fly from [04:49] Okay? And this is a flight database, so [04:52] you can assume that everything is [04:53] related to a flight flying. So, if you [04:56] have all these words, I want to fly [04:57] from, [04:58] each of these words these five words [05:00] gets mapped to something called the O, [05:02] which means other. [05:04] It's the other slot, right? We don't [05:06] really care about it. It's the other [05:07] slot. [05:09] And then we come to Boston. [05:11] Oh, Boston is very special, right? [05:13] Because, you know, it's clearly a [05:15] departure city. So, we actually tag it, [05:18] we assign it this label. Think of it as [05:20] just like a classification problem, [05:21] right? A multi-class classification [05:23] problem. So, we assign it to B from [05:26] loc.city_name. [05:29] Okay? That is the label you assign it. [05:31] Okay? [05:32] And then you go to at. You don't care [05:34] about at. It's O, other. You come to [05:37] 7:00 a.m. [05:38] And then, okay, that is depart time. So, [05:41] depart time and then another depart [05:43] time. And here you see there is a B and [05:45] then there is an I. [05:47] Right? So, what's what we are saying [05:49] here is that there could be entities who [05:51] are described using more than one word. [05:54] Like 7:00 a.m., right? Two tokens. [05:57] And for that, we need to be able to [05:58] figure out, okay, the second token is [06:00] really [06:01] is part of the first token. Together, [06:03] they define the notion of a departure [06:05] time. So, what the B means that is that [06:08] this is the word this is the token in [06:10] which we are beginning the idea of a [06:12] departure time. And then I means we are [06:15] in the middle of this description. [06:17] B is for beginning. [06:19] So, [06:21] you can see here. So, there is a B here [06:23] and there is an I. B for beginning, I [06:25] for intermediate or in the middle. [06:27] Um and then at, we don't care. 11:00 B [06:31] arrive time. [06:33] Boop boop boop. Morning arrive time [06:35] period. [06:38] So, this is an example of how you can [06:40] take a sentence and then manually label [06:43] every word in the sentence with [06:45] something that's relevant to your [06:46] particular problem. [06:50] And [06:51] turns out these people [06:54] every word is classified into one of 123 [06:56] possibilities. [06:59] Okay? Um so, aircraft code, airline [07:02] code, airline name, airport code, [07:04] airport name, arrival date, relative [07:07] name. Now, you get the idea. [07:08] They want a round trip versus a one-way. [07:11] The relative to today because if [07:13] somebody say tomorrow morning, it's [07:14] relative to today, so you need to notion [07:16] you need absolute time and you need [07:17] notion of relative time. [07:19] So, they basically thought of every [07:20] possibility with these researchers. And [07:23] so, the every word in every one of these [07:25] queries is assigned one of these 123 [07:27] labels. [07:32] Any questions on the setup? [07:36] Um [07:39] Did they have to contextualize what [07:42] comes before than let's say Boston? So, [07:44] if someone says from [07:46] Boston, so that there should be [07:47] contextualization with the from to [07:49] Boston. So, because they did it [07:50] manually, they could just read it and [07:52] figure it out, that's what they mean, [07:54] right? You Boston is the the departure [07:55] city and not the arrival city. So, do [07:57] they have two tags to Boston, which is [07:59] some like, you know, departure city as [08:01] well as arrival city [08:03] word Boston? In that particular phrase, [08:05] it's it's clear from that particular [08:07] case in the context of it as a human [08:08] reading it that Boston is a departure [08:10] city. So, it just only gets that tag. In [08:13] that sentence. In some other sentence [08:15] where people are coming into Boston, [08:16] it'll have a different tag. [08:21] I was wondering if my query like the [08:23] others, basically there is like, for [08:25] example, if my query was [08:27] giving flights from Boston at 7:00 a.m. [08:29] and [08:29] uh the [08:31] flights from Denver at 11:00 a.m. [08:33] You mean like a compound query? Yeah. [08:35] So, this one only takes single queries [08:37] into account. [08:39] Because most people are like, you know, [08:40] give me a flight from here to there. Or [08:42] what is the cheapest thing from here to [08:43] there? And we'll see examples of queries [08:45] later on. [08:50] Okay. [08:51] Uh all right. So, that's that's the [08:52] deal. [08:53] So, basically what we have this you [08:56] know, [08:58] uh this problem that we have here is [08:59] really a word-to-slot, [09:02] word-to-slot multi-class classification [09:04] problem. [09:06] Okay? [09:07] Um because if you look at that that [09:09] input, we want to be able to take that [09:10] input and a really good model will then [09:12] give you this as the output. [09:17] Right? Because this is what a human [09:18] would have done. [09:20] So, that is our problem. Okay? [09:23] So, the question is [09:25] um the the key thing here is that each [09:27] of the 18 words in this particular [09:29] example must be assigned to one of 123 [09:32] slot types, right? Each word. It's not [09:34] like we take the entire query and [09:36] classify the entire query into one of [09:38] 123 possibilities. Every word in the [09:40] query has to be classified. [09:42] That is the wrinkle. [09:45] Okay? [09:46] So, now, if we could run the query [09:49] through a deep neural network and [09:51] generate 18 output nodes, [09:54] it goes through some unspecified deep [09:55] neural network. And when it comes out [09:57] the other end, the output layer has 18 [09:59] nodes. [10:00] Okay? [10:01] Because that is that is the that is the [10:03] that is the the the dimension of the [10:04] output that we care about. 18 in, 18 [10:06] out. 18 in, 18 out, right? [10:09] And then for each one of those 18 nodes, [10:11] maybe we could attach a 123-way softmax [10:15] to each of those 18 outputs. [10:20] By the way, isn't it cool that we can [10:21] just casually talk about sticking a [10:23] 123-way softmax onto each one of the 18 [10:25] nodes? [10:27] Folks, wake up. [10:31] You're not easily impressed. I'm [10:32] impressed by that. [10:34] So, okay. [10:37] So, so the So, here's the key thing, [10:39] right? We want to generate an output [10:41] that has the same length as the input. [10:45] But the problem is the inputs could be [10:47] of different lengths as they come in. [10:48] They could be short sentences, long [10:50] sentences, we don't know, right? [10:52] Yet we need to accommodate this range [10:55] this variable size of input that's [10:56] coming in. [10:58] But the key thing is the output has to [10:59] be the same thing as the input, the same [11:00] cardinality as the input. [11:02] Okay, that's a one big requirement. [11:05] In addition, we want to take the [11:07] surrounding context of each word into [11:08] account, right? To go to Ronak's [11:10] question, when you see the word Boston, [11:12] you can't conclude whether it's a [11:14] departure city or arrival city. [11:15] You have to look at what else is going [11:17] on around it. Is there a from? Is there [11:19] a to? Things like that to figure out [11:21] what how to tag it. So, clearly the [11:22] context matters. [11:24] And then we clearly have to take the [11:25] order of the words into account. [11:28] Going from Boston to LaGuardia is very [11:29] different than going from LaGuardia to [11:30] Boston. [11:31] So, clearly the order matters. [11:33] Right? So, the context matters and the [11:35] order matters. And the output has to be [11:37] the same length as the input. [11:40] Okay? [11:42] So, context matters, right? Just a few [11:44] fun examples. [11:45] Remember from the last week that the [11:47] meaning of a word can change [11:48] dramatically depending on the context. [11:50] And we also saw that the standalone or [11:53] uncontextual embeddings that we saw for [11:55] last week, like Glove, um [11:58] you know, they don't take context into [11:59] account because they give a single [12:01] unique embedding vector to every word. [12:04] And if a word ends up having lots of [12:05] different meanings, that vector is kind [12:07] of some mushy average of all those [12:09] meanings. [12:11] Okay. So, [12:13] the word see. I will see you soon. I [12:15] will see this project to its end. I see [12:16] what you mean. Very different meanings [12:18] of the word see. This is my favorite, [12:20] bank. [12:21] Uh I went to the bank to apply for a [12:23] loan. I'm banking on the job. I'm [12:24] standing on the left bank. And so on. Uh [12:27] it. The animal Oh, this is actually very [12:29] It's a good one. The animal didn't cross [12:31] the street because it was too tired. The [12:33] animal didn't cross the street because [12:34] it was too wide. [12:37] Can you imagine [12:39] a deep neural network looking at this [12:40] word it and trying to figure out what [12:42] the heck does it word it mean? [12:44] What is it referring to? [12:46] Tricky, right? [12:48] Um and then, you know, if you take the [12:50] word station, and I have the station [12:52] example here because we're going to use [12:53] it a bit more the rest of the lecture. [12:55] The train You know, the station could be [12:57] a radio station, a train station, being [12:59] stationed somewhere, the International [13:00] Space Station. The list goes on. [13:03] So, clearly order matters. I mean, [13:04] context matters. [13:05] And [13:08] clearly order matters. You can come up [13:10] with your own examples. Let's keep [13:12] moving. [13:13] Okay? [13:15] So, the Transformer architecture [13:18] is a very elegant [13:20] architecture [13:22] which checks these three boxes [13:23] beautifully. [13:25] Okay? [13:26] Um it takes the context into account, [13:27] order into account, and then, you know, [13:29] whatever is produced out there [13:32] is the same length as whatever is coming [13:33] in. [13:34] And the reason it's called the [13:35] Transformer [13:36] is because if 10 things come in, [13:39] 10 things go out, but the 10 things that [13:41] go out are a transformed version of the [13:43] 10 things that came in. [13:46] That's why it's called the Transformer. [13:47] Okay? [13:48] If 10 things came in and like one thing [13:50] go goes out, well, sure, it's been [13:52] transformed, but what is it? It's some [13:54] weird thing. But when 10 comes in and 10 [13:56] goes out, the 10 10 is preserved. Each [13:58] one is getting transformed in [13:59] interesting way. [14:01] That's why it's called the Transformer. [14:04] So, developed 2017, just dramatic [14:07] impact. [14:08] So, by the way, the effect of [14:09] Transformer, um [14:11] Google had spent a lot of research on [14:13] machine translation and obviously [14:15] search. Uh and then when the Transformer [14:17] is invented, uh they took a model called [14:20] BERT, which we will uh see on Wednesday [14:22] in detail, and then they introduced BERT [14:25] into their search, and the results were [14:28] dramatic. [14:29] And from what I've read, apparently the [14:32] impact of doing that was a [14:34] Typically, when you make an improvement [14:35] to search, the improvement is very, very [14:37] marginal because it's already a very [14:38] heavily optimized system. [14:40] And then when the Transformer thing came [14:42] along, there was actually a significant [14:43] jump in search quality. So, for example, [14:46] and you can actually read this blog post [14:48] uh which came out when they introduced [14:49] BERT into search. It gives you a bit [14:51] more detail. But here, so if you had if [14:54] you were querying something like uh you [14:56] know, [14:57] "Brazil traveler to USA needs a visa." [15:00] Right? You would think that it is it [15:02] should give you information about how to [15:03] get a visa if you're a Brazilian want to [15:04] come to the US, right? Uh but it turns [15:06] out the first result was how US citizens [15:09] going to Brazil can get you know, [15:11] get a visa. [15:13] So, clearly it's not taking the order [15:14] into account. [15:16] Uh but once they introduced it, boom, [15:19] the first thing was the US Embassy in [15:20] Brazil. [15:21] And a page on how to get a visa. [15:24] So, the effect was dramatic. [15:26] And so, this is a seminal paper, [15:30] right? And it's actually worth reading [15:31] the paper. And uh and it's worth and you [15:34] know, this is the picture this this is [15:35] like an iconic picture at this point [15:38] in the deep learning community. And we [15:39] will actually understand this picture [15:41] by the end of Wednesday. [15:43] Um and so, but the funny thing is that [15:45] when the researchers came up with it, [15:46] they didn't realize, in some sense, like [15:48] what they had stumbled on uh because [15:50] they were really focused on machine [15:51] translation. [15:53] It's only the rest of the research [15:54] community that took it and started [15:55] applying to everything else and found it [15:56] to be really, really effective. [15:59] Okay. So, we're going to take each one [16:01] of these things and figure out how to [16:02] address them and thereby build up the [16:04] architecture. [16:05] Any questions before I continue? [16:07] Yeah. [16:11] Is there any uh [16:13] benefits to discarding some of those [16:16] unclassified nodes before it goes out [16:18] rather than going like you have 18 words [16:21] input, discarding all the ones that [16:23] don't actually matter and just doing [16:24] like eight for your output? [16:26] Yeah, yeah. I think that's a totally [16:28] fine way to think about it. Basically, [16:29] what you're saying is that can we have a [16:31] two-stage model? The first-stage model [16:33] is like a O non-O classifier. And the [16:35] second-stage model only goes after the [16:37] non-Os. That's a totally fine way to do [16:38] it. [16:39] Yeah. [16:40] But as you can see, if you even if you [16:41] go with the just a simple one-stage [16:43] model, if you use a Transformer, you get [16:44] fantastic accuracy. [16:47] And we'll do the collab in a bit. [16:50] Uh all right. So, let's take the first [16:52] thing. How do you how do you take the [16:53] context of everything around the word [16:55] into account? [16:56] So, [16:59] so let's say that this is this is the [17:01] sentence we have. The train slowly left [17:03] the station. [17:04] Okay? For each of these words, [17:06] we can calculate a standalone embedding, [17:09] say something like Glove. [17:11] Okay? So, I'm just rep- depicting these [17:13] standalone embeddings using these uh [17:15] you know, thingies here. [17:18] Please appreciate them because it took [17:19] me a while to get them to do in [17:20] PowerPoint. [17:22] Okay? So, these are W1 through W6. These [17:24] are the vectors standing up. Okay? [17:27] Um now, let's say that So, we can easily [17:29] do that. [17:30] Now, what we want to figure out is we [17:32] want to focus on the word station. [17:34] And since station could mean very [17:36] different things in different contexts, [17:37] we want to figure out how do we actually [17:39] take [17:40] station's embedding and contextualize it [17:43] using all the other words that are going [17:45] on in that sentence. [17:46] Okay? Clearly, it's a train station. [17:49] So, we need to take the fact that there [17:50] is a train involved to to alter the [17:53] embedding of the word station. Right? [17:55] That's what taking context into account [17:56] actually means. [17:58] So, [17:59] how can we modify station's embedding so [18:03] that it incorporates all the other [18:04] words? That's the question. [18:07] Okay? [18:08] So, when you look at it this way, [18:11] imagine just for a moment, [18:14] just for a moment, [18:15] that [18:16] we [18:17] Now, some of the other words in the [18:18] sentence don't matter. The word the [18:20] probably doesn't matter. [18:22] But some of the other words like train, [18:24] slowly, left probably does matter. [18:26] And suppose, just magically, we have [18:29] been told [18:30] all the other words in the sentence, [18:32] this is how much weight you have to give [18:34] to them. These don't give it any weight. [18:36] Those give it a lot of weight. Okay? [18:38] Suppose we are told that. [18:39] Or to put it another way, and this this [18:41] is the word that's heavily used in the [18:42] literature, [18:44] someone tells you how much attention to [18:46] pay to the other words. [18:47] Whether you got to pay it a lot of [18:48] attention or very little attention. [18:50] Okay? [18:51] And this [18:52] how much attention to pay is given in [18:54] the form of a weight that you can use. [18:55] Okay? So, [18:57] um [18:58] if you look at it that way, from this [19:00] notion of which word should I give a lot [19:01] of weight to and very little weight to, [19:04] in this example, intuitively, which [19:05] words do you think should get the most [19:06] weight and which words do you think [19:07] should get the least weight? [19:09] Yeah. Train. [19:11] Train. Right. [19:12] Time matters. [19:13] Uh [19:14] you can do one at a time. [19:16] Train. Okay, thank you. [19:18] Uh [19:18] okay. Others? [19:21] Slowly. [19:22] Slowly. Right. So, that also seems to [19:23] have some bearing to it. What about [19:25] words that don't really I don't [19:27] we don't think is going to are going to [19:28] help at all? [19:31] The. The. Exactly. It probably doesn't [19:33] do much here. Some context it actually [19:35] might make a difference, but in this [19:37] sentence, maybe not. [19:38] Right? Intuitively. [19:40] So, [19:42] we should probably give a lot of weight [19:43] to train, maybe a little to slowly and [19:45] left, and hardly anything to the. [19:47] Okay? [19:49] And so, this intuition that we have [19:52] can be written numerically as maybe we [19:56] have a bunch of weights that add up to [19:58] one. [20:00] Okay? [20:02] Okay, maybe something like this. So, we [20:03] are saying the train 30% weightage, [20:07] maybe 8% weightage to left, maybe 12% [20:11] weightage to slowly, uh and then as you [20:14] will see here, [20:15] the station's own embedding also plays a [20:17] role. Because we want to take its own [20:20] standalone embedding and just move it [20:22] slightly, change it slightly, which [20:23] means that has to be the starting point. [20:26] So, it will get a lot of weight. We [20:28] can't ignore itself, in other words. [20:30] Right? So, we give it maybe 40% weight. [20:33] By the way, these numbers I just made [20:34] them up. [20:35] Okay? Uh yeah. [20:38] I'm sorry, it's a quick question. So, [20:40] the weights [20:43] are they [20:44] Are they Are they standalone for the [20:46] context of the entire sentence or are [20:48] they related to station that we started [20:50] off with? The The These six numbers are [20:54] only pertinent to station. [20:56] And for each word, we're going to do [20:57] something similar. [20:59] Yeah. [21:01] And at this point, does the model [21:03] understand order? Because like I'm just [21:05] thinking of like left because like I [21:07] gave it a very low [21:08] a [21:09] a very low weight. But let's say left [21:11] comes slowly, leave left station. The [21:14] station only have the two be higher. [21:15] Yeah, correct. So, at this point, we are [21:18] not worrying about order. We are only We [21:20] are worrying about context. [21:22] Later, we'll take order into account. [21:24] But how does the model know that left [21:25] here is of lesser importance because [21:28] it's a verb rather than a [21:31] It's It has to figure it out. [21:33] We don't It doesn't We We are just [21:34] giving it a whole bunch of capabilities. [21:36] How it manifests those capabilities is [21:38] all going to emerge from training. [21:42] Okay. So, all right. So, let's say we [21:45] have something like this. So, what we [21:46] can do, [21:48] right? And we'll get to the [21:49] all-important question of where do we [21:50] get these numbers from in just a moment. [21:51] But suppose you had the numbers, [21:54] how can we use these numbers to [21:56] contextualize W6? What can we do? [22:00] What is the simplest thing you can do? [22:05] You have W6, you want to make it a new [22:07] W6, which is now contextual, is aware of [22:10] what else is going on. Okay? [22:17] It's working now, I think. [22:20] We can take a weighted average. Exactly. [22:22] Exactly. So, when you have a bunch of [22:23] things and you have a bunch of weights [22:25] and I, you know, and we have when we [22:26] have to somehow modify one of those [22:27] things with those weights, the simplest [22:29] thing you can do is to take a weighted [22:30] average. [22:31] Right? So, that's exactly what we're [22:33] going to do. [22:34] So, we're going to take all these [22:35] weights [22:37] and just like move them up. [22:39] Okay? [22:40] Move them up. [22:42] Don't even get me started on how long it [22:44] took me to get this arrow to run. [22:46] I don't know about you, folks. Is it [22:47] It's extremely painful to get the U-turn [22:49] arrows to work in PowerPoint. [22:51] Okay? [22:52] Anyway, uh back to work. So, [22:54] so we just move these up here, okay? So, [22:57] now we can do 0.05 * this vector + 0.3 * [23:01] that vector and so on and so forth. [23:03] And the result is just another vector. [23:06] Right? [23:08] And that vector, folks, [23:11] is the contextual embedding vector of [23:13] station. [23:15] Okay? That was the standalone embedding. [23:17] And now we did the We multiplied this by [23:19] that that by whoop whoop whoop, add them [23:21] all up, and then you get a new vector. [23:24] And contextual embeddings have this [23:27] bluish kind of color. [23:29] Okay? [23:30] And I'll maintain that color scheme as [23:32] we go along. [23:33] So, that's it. [23:36] That's it. That's the idea. [23:38] Any questions? [23:41] Yeah. [23:43] How did you come up with the original [23:44] weights again? You just kind of guessed? [23:46] No, these weights I just I just [23:49] hand typed them in manually just to make [23:51] the point. And And now I'm going to talk [23:53] about how we are actually going to [23:54] calculate them. [23:57] Okay. [23:58] Uh all right, cool. So, now I'm going to [24:00] uh okay, enough pictures. Let's switch [24:03] to some math. So, [24:05] so basically what I'm So, let's write it [24:07] a bit more formally. [24:08] So, we have these W1 through W6, which [24:11] are the standalone embeddings. [24:12] And then for station, we want to [24:14] calculate, you know, W6 with a little [24:16] hat on it, which is the contextual [24:17] embedding. And the way we do it is to [24:19] say we calculate some weights for each [24:22] of these words. So, this weight S16 [24:25] means that the weight [24:27] of the first word on the sixth word, [24:30] which happens to be station. [24:32] The The weight of the second word on the [24:33] sixth word, and so on and so forth. And [24:35] so, what we are saying is that W6 is [24:38] just, you know, this weight times W1, [24:40] this time W whoop whoop whoop, [24:41] that's it. [24:43] Okay? [24:45] I have to inflict all these, you know, [24:47] subscripts and all that because [24:48] you know, we need it. [24:51] All right. So, that's it. [24:53] That's what we have. [24:56] Now, let's talk about Okay, any [24:58] questions on the mechanics of it [25:00] before I get to Okay, where do these [25:01] weights come from? [25:02] Yeah. [25:06] Utilizing something like Google, for [25:08] example, like how does it understand [25:11] like the context of [25:12] new words [25:13] and context like [25:16] process immediately through the training [25:18] data the users played or [25:20] like basically [25:21] >> like a totally new word that didn't [25:22] exist before? A new word or a new [25:24] context to a word that already exists. [25:27] No, I think that the context is supplied [25:29] because the query coming into something [25:31] like Google is a full sentence. [25:33] And we only take that sentence and take [25:35] only the sentence into account as the [25:36] context for us. [25:37] So, the context is always present to us [25:40] when we get the input. [25:41] But the other question you had uh of [25:44] Okay, what if there's a brand new word [25:45] you've never seen before, for which [25:46] there is not even a standalone [25:47] embedding? What do you do then? [25:49] So, let's punt on that till Wednesday [25:51] because I have to talk about something [25:53] called byte pair encoding and stuff like [25:55] that before I can answer that. [25:57] And And really quickly, does that [25:59] immediately translate to their [26:00] predictive search queries? [26:03] Utilizing like verb [26:06] Yeah, a new word, for example. [26:08] Does that automatically get applied to [26:10] the predictive search queries like when [26:12] we're saying how to and then just home? [26:14] Oh, you mean like the auto complete? [26:15] You know, auto complete uses a slightly [26:17] different mechanism. [26:18] Um I They had a very complicated [26:20] non-transformer thing for a long time. [26:23] I'm sure they have a transformer version [26:24] now, but I don't I'm not privy to how [26:26] exactly they've done it. So, I don't [26:28] quite know how they do it. But what [26:29] you're proposing is a reasonable way to [26:31] think about it. [26:33] Yeah. [26:34] Um my question is like we have six [26:36] words, station and but number parameters [26:39] as in weights, let's say 10 of them. [26:41] And then we have calculated the [26:43] contextual version of W6. Yeah. So, this [26:46] has a different parameter or it remains [26:48] the same? It replaces. Okay. [26:50] Yeah, W becomes W6 becomes W6 hat. [26:54] Okay. And how we are expecting [26:57] Right. [26:58] This contextual word will be really [27:00] good. That's what we want. [27:07] Do we lose that [27:08] or retain No, we lose it. And as you [27:11] will see here, as it flows through the [27:12] transformer, it's getting more and more [27:14] and more contextualized. [27:16] So, it's a left-to-right flow. [27:20] All right. Uh all right, great. So, the [27:22] By the way, this thing that we did for [27:23] station, we will do it for each word in [27:25] the in the in the sentence. [27:27] The same exact logic. Obviously, the [27:30] weights are going to change. [27:31] Okay? But what will happen is that W1 [27:34] through W6 will become W1 hat through W6 [27:37] hat. [27:39] The same exact logic is going to hold. [27:41] Okay? That's what I just don't have the [27:43] slides for it because it's a waste of [27:44] time. [27:45] The same exact logic is going to hold. [27:47] All right. Now, switch gears [27:48] and and answer the all-important [27:50] question of where are the weights going [27:51] to come from. [27:52] Okay? So, the intuition here is really [27:54] really interesting and elegant. [27:56] So, clearly the weight of a word [27:59] should be proportional to how related it [28:02] is to the word station. [28:04] Right? [28:06] The word train clearly is very related [28:08] to the word station. [28:09] The word the is not clear how it's [28:11] related it is. Probably not all that [28:12] related. So, the relatedness matters to [28:15] the weight. More related, higher the [28:17] weight, right? Just intuitive. [28:19] So, one way to quantify how related two [28:21] words are is to take their standalone [28:23] embeddings and calculate the dot [28:25] product. [28:28] Okay? So, um [28:30] in case folks have [28:33] sort of forgotten about the dot product, [28:39] Oops, that's not what I want. [28:42] So, um So, if you have a Let's say you [28:44] have a vector. [28:50] Okay, let's Let's Let's say this is the [28:51] vector for [28:52] train. [28:55] This is the vector for station. [28:59] Okay? So, the dot product of these two [29:01] vectors, [29:05] I'll write it as train [29:09] station [29:12] equals [29:13] basically the length [29:17] of [29:20] the vector for train [29:23] times the length [29:26] of the vector for station [29:30] times the cosine [29:33] of the angle between them. [29:36] Okay? [29:38] Okay? [29:42] So, how long is each vector? [29:45] Product of the two and then the angle [29:46] between them. Okay? Now, let's assume [29:48] for simplicity that these lengths are [29:50] roughly the same. [29:52] They're just one unit length. Okay? Just [29:54] roughly. [29:55] So, if you assume that, [29:57] okay? This thing, let's say, becomes [30:01] becomes one, let's say. [30:03] Okay? [30:05] This thing becomes one. [30:07] So, all the action [30:09] is here. [30:11] Okay? [30:12] So, all the action is here. [30:14] So, basically, the dot product of these [30:15] two vectors is really the cosine of [30:17] angle between them. [30:20] So, now, the question is, if you have [30:22] something like this, [30:27] right? Which are very close to each [30:28] other, the cosine of a very small angle, [30:31] actually, the cosine of zero is what? [30:34] One. [30:35] So, if the angle is really, really [30:37] small, the cosine is going to be very [30:39] close to one. [30:40] Right? Because zero is one. The cosine [30:41] of zero is one. So, this thing is going [30:43] to be, you know, pretty close to one. [30:46] If you have a cosine of two vectors that [30:49] are like this, 90° apart, what is the [30:51] cosine? [30:52] Zero. They're orthogonal, right? Which [30:55] maps to the English orthogonal. [30:58] So, the cosine of that is zero. [31:00] And then, if you have something like [31:01] this, [31:03] where they're literally pointing in [31:04] opposite direction, [31:07] what is the cosine of that 180? [31:09] Minus one. [31:11] So, that's it. So, the if these things [31:13] if these these these two vectors are [31:14] very close to each other, [31:16] the cosine of the angle between them is [31:18] going to be very close to one. If they [31:19] are really kind of unrelated, it's going [31:21] to be zero. If they're anti-related, [31:22] it's going to be minus one. [31:24] Right? So, that's how dot products [31:27] capture this notion of closeness or [31:28] relatedness. [31:30] Okay? [31:31] So, all right. Um iPad. [31:36] So, we can use the dot product of these [31:37] embeddings to capture relatedness. [31:40] And so, okay, iPad done. [31:43] So, now, what we do is we know now that [31:45] we know that dot products can be used, [31:48] we can't use them as is because we need [31:49] to do one more thing to make them proper [31:51] weights. And what I mean by proper [31:53] weights is that the we want the weights [31:55] to be, first of all, non-negative, and [31:58] we want to add up we want them to add up [31:59] to one, right? That's that's what a [32:00] weighted average actually is going to [32:01] mean. [32:02] But these cosines could be negative. [32:05] Right? And so, we need to now adjust [32:07] them to make them proper so that every [32:08] one of them is guaranteed to be [32:10] non-negative and they will add up to [32:11] one. [32:12] When was the last time you had to take a [32:14] bunch of numbers, which could be [32:15] anything, and then somehow make sure [32:18] that they are going to be positive, [32:20] non-negative, and they add up to one? [32:22] When was the last time? [32:23] Yeah, softmax. Exactly. So, we'll do the [32:25] same trick. [32:27] So, what we'll simply do is we'll just, [32:29] you know, exponentiate them, right? So, [32:32] like this W1 W6, this angle bracket [32:35] thing is the dot product. That's the [32:36] notation I'm using. EXP of that is just [32:39] you exponentiate them, e raised to that. [32:41] And once you exponentiate them, they all [32:42] become non-negative, and then we just [32:44] divide each by the sum of everything. [32:46] So, it the whole thing will become like [32:47] a probability, right? It'll just add up [32:48] to one. [32:50] Make sense? So, that's how we take [32:52] arbitrary numbers and make them proper [32:53] weights. [32:56] All right. [32:59] So, [33:01] to summarize, [33:02] from embeddings to contextual [33:04] embeddings, that's what we do. [33:05] We take all the stand-alone embeddings, [33:08] we calculate these weights using this [33:09] formula, and then we just do the [33:11] weighted average, and we arrive at the [33:12] contextual embedding, and boom, done. [33:16] Okay? [33:17] And so, by way choosing weights in this [33:20] manner, the embedding of a word gets [33:22] dragged closer to the embeddings of the [33:24] other words in proportion to how related [33:26] they are. So, just imagine for a second, [33:29] right? In this case, station obviously [33:30] has many contexts, but let's assume for [33:31] a second that only has the train context [33:33] and the radio station context. [33:35] Okay? [33:37] In the current context, train is closely [33:39] related to station, and therefore exerts [33:40] a strong pull on it. [33:42] Right? [33:43] Now, radio is also related to station, [33:45] but it doesn't appear in the word in the [33:47] sentence. [33:48] So, effectively, it has a weight of [33:49] zero. [33:52] Okay? And that's the beauty of it. And [33:55] And please do not ask me things like, [33:56] you know, I I was listening to a great [33:58] song on the radio station and the train [33:59] pulled out of the station. [34:01] Okay? Transformers can deal with stuff [34:03] like that. Okay? But yeah, but you get [34:05] the idea, the main idea. [34:07] So, by paying moving station closer to [34:09] train, [34:11] by paying more attention to train, we [34:13] are contextualizing the station the word [34:15] the embedding to the context of trains, [34:18] platforms, departures, tickets, and so [34:20] on. It's like this portal into the whole [34:22] train world. [34:25] Right? It's beautiful. This simple idea [34:27] will get you there. [34:30] Okay? [34:31] So, this, folks, is called [34:33] self-attention. [34:36] What we just described is called [34:37] self-attention. [34:39] And it's the key building block of [34:41] transformers. [34:42] Okay? Um and so, the the So, to [34:44] summarize, stand-alone embeddings come [34:46] in, contextual embeddings go out. [34:50] Any questions? [34:52] Uh yeah. [34:54] Uh I'm still struggling a little bit [34:56] with the intuition of the word [34:58] contextual embedding. So, like the [35:00] weight of station in the station [35:02] embedding, how how should I think about [35:03] that? It seems intuitive that it would [35:05] be high for all contextual embeddings, [35:07] but I assume that's not the case. [35:12] It'll be high. It'll be typically be a [35:13] high number because the cosine of the [35:15] the vector to itself is going to be very [35:17] cosine is going to be one, right? So, [35:19] it's going to be pretty high, but it [35:20] there's no guarantee it's going to be [35:21] the highest. [35:22] Right? Because they're not actually the [35:24] the length doesn't have to be one. They [35:26] could be We try to keep them kind of [35:28] smallish, but they don't have to be. [35:30] Uh so, the way I would think about it is [35:31] imagine that you take an average of [35:33] everything else first, and then you [35:35] average it with the new the old [35:37] embedding. [35:38] Effectively, it's the same as just [35:39] calculating the different weights and [35:40] averaging the whole thing together. [35:42] Sure. [35:44] So, why should you say that the [35:45] embedding of a word would be the same [35:47] number but same place? But is this the [35:50] reason why you need a contextual [35:52] embedding? [35:53] But even if it's like a [35:55] other word [35:56] and it's not related, that's what [35:59] I'm saying. Correct. Correct. Exactly. [36:01] Exactly. And the other thing to remember [36:02] is that by getting [36:04] by keeping the origin the input sort of [36:07] the size of the input cardinality intact [36:09] as you move through the transformer [36:10] stack, [36:11] when you finally come out the other end, [36:12] there is sort of no loss of information. [36:14] And in the very end, you can choose to [36:16] aggregate, simplify, summarize, and so [36:18] on and so forth. It preserves your [36:19] optionality as long as possible. [36:23] Do you know [36:25] how how long the embedding contextual [36:27] embedding is? [36:28] Is that a factor between the [36:29] two? [36:31] You know [36:33] Yeah, so, what we do is the the sentence [36:34] comes in. There's a whole notion of [36:35] something called a context window, or [36:37] what is the sort of the maximum length [36:39] that these sentences will handle, and [36:40] that's a parameter you can set. And [36:42] we'll come to that when you actually [36:43] look at the collab. [36:44] Um [36:46] Was that a question in the middle? No. [36:48] Okay. [36:49] All right. So, that is self-attention. [36:53] Um and now, [36:55] because that's felt too easy, [36:58] we're going to do a little tweak called [37:00] multi-head attention. [37:02] So, [37:03] this is this is the self-attention we [37:04] just saw. [37:06] What we can do is we can be like, you [37:07] know what? [37:08] Why can't we have more than this? Why [37:10] can't we have more than one of these? [37:12] So, this is called an attention head, [37:13] self-attention head. We'll have multiple [37:16] self-attention heads. Okay? [37:18] Now, and I'll come back to the top thing [37:20] in a second, okay? But So, the question [37:22] is, why should we have multiple [37:23] self-attention heads? [37:25] Because a particular attention head is [37:26] going to pick up some patterns. The [37:28] reason is because [37:30] it'll help us attend to the multiple [37:32] patterns that may be present in a single [37:34] sentence. [37:35] So far, when I've been explaining, uh [37:37] I've sort of basically been looking at [37:38] what the meaning of these words are. [37:40] Just the meaning of these words. But in [37:42] any complicated sentence, you have to [37:44] worry about grammar, you have to worry [37:45] about tense, you have to worry about [37:47] tone. You have to worry about facts [37:49] versus, you know, opinions. There could [37:51] be any number of complicated patterns [37:53] that are sitting in a simple sentence. [37:55] Which means, well, there is just not one [37:57] way to pay attention. There could be [37:59] many ways of paying attention, many sort [38:02] of There could be many needs to pay [38:03] attention. Right? [38:05] Which means that we'll let's have many [38:07] of these attention heads. [38:09] And each one could be learning something [38:10] else. It's exactly like having lots of [38:12] filters in a convolutional network. [38:14] Right? Uh one filter might learn a line, [38:16] another filter might learn a curve, and [38:17] so on and so forth. And we don't want to [38:19] decide a priori, oh, you're going to [38:21] learn a line, right? Similarly here, [38:22] we're not telling any of these things [38:23] what you have to learn. They just have [38:25] to learn based on the training process. [38:27] So, what we do is [38:28] So, actually, this is an example where [38:30] this is from the original transformer [38:32] paper, where this sentence is the lawyer [38:35] will Sorry, the law will never be [38:37] perfect, but its application should be [38:39] just. This is what we are missing, in my [38:43] opinion. [38:44] The complicated sentence, right? So, the [38:46] first one attention head, actually, this [38:48] is the pattern of things it's it's it's [38:50] So, for example, the word perfect here, [38:53] the contextual embedding of the word [38:54] perfect [38:57] draws upon heavily from the word law [39:00] in this example. [39:01] Okay? [39:02] If you look at another attention head, [39:04] the contextual embedding for the word [39:06] perfect is actually drawing heavily from [39:07] just perfect and nothing else. Right? [39:11] And if you look at other words, the [39:13] patterns are subtly different of what [39:14] it's paying attention to. [39:17] So, these are two different attention [39:18] heads, and they're learning different [39:20] kinds of attentions. [39:21] Okay? In reality, trying to make sense [39:24] of why they [39:25] pay attention to the way they do, it's [39:27] usually quite sort of difficult to [39:29] figure that out. You can't actually [39:30] interpret it. But when you have lots of [39:32] attention heads, the performance on the [39:34] task that you care about gets really [39:35] much better. [39:37] Right? And then you're saying, okay, I [39:39] can use that. Uh yeah. [39:40] That's the [39:42] I think that's the idea behind this. Is [39:43] that the idea behind this? [39:49] Right. [39:50] Exactly. Same logic. Same logic. [39:53] Yeah. [40:13] Actually in the convolutional case, the [40:15] ones and zeros I had were just example [40:17] numbers to show that that particular [40:19] filter could detect a vertical line or [40:21] horizontal line. You will recall that [40:23] when we actually train a convolutional [40:24] network, we actually don't specify the [40:26] numbers. We start with random [40:27] initialized weights and then we let back [40:30] back propagation figure it out. [40:32] Similarly here, we don't decide any of [40:34] these things. We just let back prop [40:35] figure it out. [40:37] Okay? And now the question of what are [40:39] the weights that are actually going to [40:40] be learned? We'll come come to that in a [40:42] bit. [40:43] Okay? Uh yeah. [40:47] Uh I was wondering how come we have [40:50] different attention head even though [40:53] uh it seems like they're only function [40:55] of a dot product and we have the same [40:57] dot product for same embeddings. [40:59] Great question. Great question. And I [41:01] literally have a a note in my slide [41:02] saying, "If a student asks this good [41:04] question, tell them to wait till [41:06] Wednesday." [41:08] So, great question. And we'll come back [41:10] to that uh on Wednesday and spend a fair [41:12] amount of time on it. So, uh [41:14] the the the point that's being made here [41:17] is that oops. [41:19] When we look at self-attention, [41:22] the embeddings came in and we did all [41:24] these dot products and the contextual [41:26] things popped out the other end. Note [41:28] that inside the self-attention box, [41:30] there are no parameters. [41:32] There are no parameters. [41:34] So, the question that is being raised [41:36] here is that so what are we learning [41:38] really? If there is nothing inside to be [41:40] learned, if there are no parameters, no [41:42] coefficients, what are we learning? [41:43] That's the question. And by extension, [41:46] if we have two of these and neither of [41:48] them is learning anything, what's the [41:49] point? [41:52] Sadly, you have to wait till Wednesday. [41:55] Okay? But we have a great answer to the [41:57] question. So, [41:58] it'll be worth it. And if you can't [42:00] stand the suspense, read the book. [42:03] All right. So, that is uh that's why we [42:05] need multiple heads. Okay? And now to [42:07] come back to this, so what we do is it [42:09] goes through this head and you get these [42:11] W's, right? And it goes through here and [42:13] we get the another set of W's. [42:15] Then what we do at the very end is we [42:17] concatenate them. [42:19] Okay? We concatenate them and we do a [42:21] projection. And this is what I mean by [42:23] that. [42:29] So, we have [42:30] uh this this is one self-attention head, [42:33] self-attention one. [42:35] This is self-attention two. [42:38] And let's say that [42:41] W1 hat comes out. [42:44] And I'm just going to call it Z Z1 for [42:47] the same thing so that there's no name [42:48] clash. [42:49] Okay? And uh the W2, W6, all of them are [42:52] coming, right? Let's focus on W1 and Z1. [42:55] W1 and Z1 are both contextual embeddings [42:57] for the same word. [42:59] Okay? For the first word, word one. And [43:01] so what we do is let's say this is W1 uh [43:04] let's call let's say this vector is like [43:06] this. Okay? [43:07] And let's say that this vector is like [43:10] this. [43:12] What I mean when I say concatenated here [43:14] is we literally take [43:16] um this word here, [43:18] this embedding here, then we take this [43:20] thing here. [43:23] Okay? And we just make it a long vector. [43:25] We concatenate it. But now this vector [43:27] has become twice as long, right? [43:30] So, what but remember, we always want to [43:32] preserve this the the number of inputs [43:34] we have and the lengths of these vectors [43:36] everywhere as we go along. So, what we [43:39] do is at this point, we run it through [43:42] a single dense layer [43:44] which will take this thing and make it [43:46] back into the same small shape as [43:48] before. [43:50] So, this is a dense layer. [43:54] That's it. So, this vector comes in [43:56] and it becomes it gets compressed back [43:58] to the original shape that came out of [44:00] here. [44:01] So, you could have like 20 of these uh [44:03] attention heads [44:04] and the concatenated will be 20 times [44:06] long and then just project boom, one [44:08] dense layer comes back to the original [44:09] shape. [44:12] So, that's that is the projection step. [44:16] And that's what I mean here when I say [44:17] concatenate and project. [44:20] So, at this point, what we have is [44:21] things come in, we contextualize them [44:23] using these different attention heads, [44:25] and when they come out of the attention [44:27] heads, we take them all, we just like [44:29] concatenate them, and then compress them [44:31] back to the same original starting [44:32] shape. Right? If these vectors are 100 [44:35] units long or 100 dimension long, [44:37] whatever comes out is 100 still. [44:39] And to pre- preserving this [44:42] size as we go along is very important [44:43] for reasons that'll become apparent a [44:44] bit later. [44:46] Okay. So, that is the multi-attention [44:49] thing. [44:50] Now, a final tweak for today [44:53] is that we will inject some [44:55] non-linearity [44:57] with some dense layer dense ReLU layers [44:59] at the very end. So, we'd went through a [45:01] bunch of attention heads. We we came up [45:03] with a bunch of contextual embeddings [45:04] now. [45:05] So, at this point so far, [45:07] there are no since there are no [45:08] parameters inside these boxes, [45:10] uh [45:11] right? And there are some parameters [45:13] here. [45:13] We need to do some non-linearity. So [45:15] far, there's been nothing that's [45:16] non-linear so far. So, here we actually [45:18] send it through one or more ReLUs. [45:21] Typically, they just use one ReLU. So, [45:24] and what I mean by that [45:34] Sorry. [45:37] So, this is what we had here and then [45:41] we take it in [45:46] and then run it through [45:50] actually [45:54] we typically run it through [45:57] a ReLU. [45:58] This is a nice ReLU. [46:01] Okay? And all and and the rule of thumb, [46:03] as you will see, if let's say this [46:04] vector is say 100 dimensions long, they [46:06] typically will choose a ReLU which is [46:08] about 400 [46:10] wide. And then it just gets projected [46:12] out again back to 100. [46:16] So, [46:17] this is just a simple, you know, the [46:20] input comes in, goes through a single [46:21] hidden layer with four four times as [46:23] many as here, and then it [46:26] project another dense layer [46:28] to 100 again. [46:29] And this since there are ReLUs here, [46:32] we in- we have injected some [46:33] non-linearity into the processing. [46:35] Okay? Now, [46:37] a lot of this stuff when it came out [46:39] felt very ad hoc. [46:41] Right? It didn't come from some deep, [46:43] you know, theoretical motivations. [46:45] But and people had strong intuitions as [46:47] to why these things were helpful. And as [46:49] it turns out, since the transformer came [46:51] out, people have tried to optimize every [46:53] aspect of this thing. [46:55] It's actually pretty difficult to beat [46:56] the starting architecture. [46:58] Right? Improvements have been made, but [47:00] it's actually very robust architecture. [47:02] So, [47:03] so that's what's going on here. And then [47:05] when we come out of this thing, [47:08] this is what we have, the story so far. [47:10] We start with random standalone [47:13] embeddings. This could be [47:14] GloVe embeddings, it could be random [47:15] weights, doesn't matter. It goes through [47:18] a bunch of self-attention heads. We [47:19] concatenate it when it comes out the [47:21] other end. [47:23] Concatenate it when it comes out the [47:25] other end. And then we project it back [47:27] to the same size as before. Then we run [47:29] it through, you know, a ReLU followed by [47:31] a linear layer and we get these things [47:33] again. So, in this whole process, if six [47:36] things came in, six things will come [47:37] out. And if six and if those six things [47:40] that came in [47:41] were embedding standalone embedding [47:43] vectors of 100 dimensions, what comes [47:45] out is also 100 dimensions. [47:47] So, in that sense, you could think of [47:48] this whole thing as a black box in which [47:50] whatever you send in, the same number of [47:52] things will come out of the same length. [47:54] The numbers will be different because [47:56] they will have been heavily [47:56] contextualized. [47:58] The numbers are much smarter, in other [48:00] words. [48:02] So, so far what we have seen is that we [48:04] have satisfied two of the three [48:05] requirements. We have taken the context [48:08] of each word into account [48:09] by using these dot products in the [48:11] self-attention layer, and we can [48:12] generate an output that is the same [48:13] length as the input, but we have ignored [48:15] the fact that we have ignored word order [48:17] completely. [48:19] Okay? Because whether I had said the [48:21] train slowly left the station or I had [48:23] said the the station slowly left the [48:25] train, [48:26] this thing won't know the difference. [48:30] Because dot products [48:32] function on sets, not on sequences. They [48:34] function on sets. [48:36] Okay? Regard- You can you should [48:37] convince yourself of this. Regardless of [48:39] the order, the dot product calculation [48:40] doesn't change anything. [48:42] Because we are doing every pair. [48:46] Okay? So, the question is how do we take [48:48] the order of the words into account? Um [48:50] right. As I was saying, we can scramble [48:52] the order of the words in a sentence and [48:53] we'll get the exact same contextual [48:54] embeddings at the end. [48:55] So, by the way, if you're working on a [48:57] problem in which the order doesn't [48:58] matter, [49:00] then you can stop right now and use the [49:01] transformer. [49:04] And there are many problems that are [49:05] actually in that category where the [49:06] order doesn't matter. So, if you take [49:08] traditional structured data, right? Uh [49:10] tabular data, [49:12] uh you know, blood pressure, cholesterol [49:14] level, boom boom boom. Does it predict [49:15] heart disease? Well, there is no order [49:17] in that thing. You can use the [49:18] transformer as is without doing anything [49:20] more. [49:22] So, transformers work for both sets and [49:24] sequences where order matters. [49:27] Okay. So, the fix for this is something [49:29] called the positional encoding. [49:32] Um [49:33] so what we do is very simple. There are [49:34] By By there are many things that been [49:36] invented um to to to tell transformers [49:40] to give an transformer some information [49:42] about the order of each of the things [49:44] that are coming in. [49:45] I'm going to go with something called [49:46] the, you know, [49:47] the simplest possible way which actually [49:49] works pretty well in practice. So, what [49:51] we do is [49:52] for each position [49:55] each possible position in the input [49:56] starting from the first position all the [49:58] way through the last position [50:00] we imagine that that position itself is [50:02] a categorical variable. [50:05] Right? If a sentence can only be 30 30 [50:07] words long, let's say, we say that hey, [50:10] the position of each word is a number [50:11] between 0 and 29. [50:14] And so, we can just think of it as a [50:16] categorical variable. [50:17] And because the categorical variable, we [50:20] can just imagine an embedding for that [50:22] for each potential value. So, it'll [50:24] become clear in just a moment because I [50:25] have a numerical example. [50:27] And so, what we do is we will just take [50:28] that standalone embedding and then we'll [50:30] take this position embedding [50:32] which represents the position of the [50:33] word in the sentence, we just add them [50:35] up. [50:36] Okay? Uh yeah. [50:39] So, if [50:40] in the initial sentence itself, I have a [50:43] mistake, so I just write it as the train [50:45] slowly the station. [50:48] So, which means my output is actually [50:49] going to be wrong. Yes. [50:52] Now, the transformers are since they're [50:53] trained on lots of data, [50:55] they will be quite robust to these [50:57] things. [50:58] But strictly arithmetically speaking [51:00] correct, yes. [51:02] Um okay. So, here's let's look at an [51:05] example. [51:06] Let's assume that [51:08] um [51:09] your standalone embeddings, right? This [51:11] is your vocabulary, okay? [51:13] Unknown, cat, mat, I, sit, love, the, [51:15] you, on. That's it. That's our [51:17] vocabulary. [51:18] And for this vocabulary, we have these [51:20] standalone embeddings. [51:22] And just for argument, let's assume [51:23] these embeddings are only two long. [51:26] Okay? The dimension of these embeddings [51:27] is two. [51:28] If you recall the glove embeddings we [51:30] used last week, I think they were what? [51:31] 100 long? [51:33] And the ones we're using in the homework [51:34] are even longer than that. [51:35] Um but here we are assuming they're only [51:37] two long, okay? So, the embedding for [51:39] cat is 0.5, {comma} 7.1. [51:42] All right. Now, let's assume that the we [51:45] can have at most 10 words in any [51:47] sentence that's coming in. [51:49] And obviously, a particular word could [51:50] be in position 0 all the way through [51:52] position 9. [51:53] And we will learn embeddings for each of [51:56] these positions, and these embeddings [51:57] are also two long. [51:59] Two units long. Dimension two. [52:03] Okay? [52:04] Now, where will these embeddings come [52:06] from? [52:07] What's the answer to that question? What [52:09] is the answer to the general question of [52:10] where will these weights come from? [52:14] We will learn it with backprop. [52:18] Okay? [52:20] We will start initially with random [52:21] numbers and then we'll get them make [52:23] them better and better [52:24] as over the course of training. [52:26] So, what we do is we have these two [52:28] tables [52:29] of embeddings. [52:30] Um the standalone embedding for the word [52:32] and the position embedding. [52:34] And then, we literally add them up. [52:37] So, for example, let's say the word the [52:39] sentence that came in is cat sat mat. [52:41] That's the sentence. It's got three [52:43] words, cat sat mat. So, what we do is we [52:46] say, well, the embedding for cat is this [52:49] thing here, 0.571. [52:51] So, I write it here, 0.571. [52:53] Cat happens to be the zeroth position of [52:55] the word. [52:56] So, I grab the embedding for zero, which [52:58] is 1.3, 3.9. I stick it there, and then [53:01] I literally add them up. 0.5 + 1.3, 1.8. [53:04] 11.0. That's it. [53:07] So, now the positional encoded embedding [53:10] for the word cat is 1.8, 11.0, not 0.5, [53:15] 7.1. [53:18] So, if cat happens to show up in another [53:20] part of the sentence, let's say instead [53:22] of cat sat mat, we had [53:25] mat sat cat. [53:28] Now, cat is now the third position, [53:29] right? Which is 0, 1, and 2. Which means [53:33] its embedding doesn't change. It's just [53:34] the embedding for cat, but now instead [53:36] of picking zero, we'll pick this one, [53:38] 0.6, 8.1, and put that here and add them [53:40] up instead. [53:43] So, this is the idea of the positional [53:45] encoding. [53:46] This is how we inject position knowledge [53:48] into the transformer. [53:52] Yes. [53:54] Um [53:55] the positional embedding would be [53:56] different for each sentence, right? How [53:58] do you No, this is just one table which [54:00] tells you what the position is. [54:01] So, the it says for a word that appears [54:04] in the seventh position in any input [54:06] sentence that you're feeding in, [54:08] this is the embedding that you need to [54:09] use [54:11] for that position. [54:16] If the word appears twice in the same [54:19] sentence, how do [54:21] Great question. So, if if let's say just [54:23] for argument, let's say the word the the [54:25] sentence was cat cat cat. [54:27] So, the [54:29] for each one of those cat for cat cat [54:31] cat, [54:32] the this embedding will be the same, [54:34] 0.571, because that is happens to be [54:36] just the embedding for cat regardless of [54:38] position. [54:39] But then, the first cat [54:42] for the first cat, we will use 1.3, 3.9 [54:45] as the addition. For the second cat, [54:47] we'll use 6.3, 3.7. The third cat will [54:50] use 0.6, 8.1. [54:51] So, only the things that are adding the [54:53] position encoding will change, the [54:55] positional embedding. So, the resulting [54:57] sum is going to be different for each of [54:58] these three words, even though they're [54:59] exactly the same word. [55:05] Is that position embedding table [55:07] specific to the standalone embedding [55:09] table? Like if you were to add or remove [55:12] some words from the standalone It's [55:14] independent. [55:15] Independent. It only depends on your [55:18] assumption about how long the sentences [55:19] can be. [55:21] That's it. [55:21] It doesn't really care about what's what [55:23] words are coming in. That's a whole [55:24] different thing. [55:26] So, these are two independent tables [55:27] that just learned as part of this [55:28] process. [55:31] So, yeah, I have the same thing for sat [55:33] and mat. [55:35] Sat and mat, that's what we have. [55:39] So, just make sure you understand these [55:40] two slides to really like make sure the [55:42] mechanics are clear. Yeah. [55:46] How do you control for filler words? For [55:48] example, if you're taking [55:50] NLP output for transcription and you're [55:53] trying to run a transformer and you have [55:55] a lot of [55:56] um's and likes that are [55:58] disproportionately large and have these [56:00] random assignments or [56:03] really deep embeddings, is there other [56:04] ways to look at through the noise? [56:07] Typically, what they do is um [56:09] as we will we'll talk about this thing [56:10] called byte pair encoding in which we [56:12] take individual characters, [56:14] fragments of words, and words into [56:16] account as tokens. So, when you hear [56:18] stuff like uh and so on, it gets mapped [56:21] to these small tokens. [56:23] Right? And then we treat them as just [56:24] any other token. [56:28] Um yeah, is aggregation like a simple [56:31] sum where here and is the actual [56:33] semantic meaning of the word standalone [56:36] not be more important than its [56:37] relative position in the sentence? [56:40] It could be. We just don't know a priori [56:42] whether it's going to be important or [56:43] not for any particular sentence. [56:45] We when we train the transformer with a [56:46] lot of textual data, [56:48] right? It'll just figure out the right [56:50] values for these things so that on [56:51] average, the accuracy is as high as [56:53] possible. [56:55] So, in many of these things, there's [56:56] always a tension between our human [56:58] intuition as to how it should work and [57:00] whether you should just throw it into [57:01] the meat grinder of backprop and see [57:02] what happens. [57:04] And so, here it does it turns out you [57:05] can just throw it into backprop, it'll [57:06] actually do a pretty good job. [57:08] Uh yeah. [57:10] For the positional encoding, we would [57:13] just be as using the sum vector, we [57:15] would be using like this 2 by 3 matrix [57:18] that you have for our right? [57:20] Uh oh yeah, this is just for [57:21] demonstration. Basically, this is the [57:23] thing that will actually go into the [57:24] transformer. Correct. [57:26] Yeah. [57:28] That was just me being overly verbose in [57:30] the slides. [57:31] Uh yeah. [57:33] I can see sentences in the input. At [57:35] this point, are we still parsing out [57:36] punctuation or if we have like a [57:38] multi-sentence input, is there a [57:40] positional embedding vector for each of [57:41] the sentences? Yeah, so here um [57:44] basically, the starting point is tokens. [57:47] Right? And in our example, because we're [57:48] working with the idea of simple [57:50] standardization and stripping and things [57:51] like that, I'm just showing actual [57:53] words. [57:54] If you go to something like GPT-4, since [57:56] it uses a different tokenization scheme, [57:58] uh each token might be part of a word. [58:01] It might be it might be an individual [58:02] character, it might be a punctuation [58:03] mark, it could be in fact um the GPT [58:06] family doesn't strip out punctuation. [58:08] Which is why when you ask a question, it [58:10] comes back with intact punctuation in [58:12] its response. [58:13] Uh and so, we'll get we'll revisit this [58:15] when you look at BPE, byte pair encoding [58:17] later on. [58:19] But the key thing to remember is that [58:21] all the stuff we're talking about starts [58:22] from the notion of a token. [58:24] As to how you define a token given a [58:26] bunch of text, that's the tokenizer's [58:28] job. And we just assumed a simple [58:30] tokenizer for the time being. [58:33] Okay? So, at this point, folks, we have [58:36] satisfied all the requirements. [58:38] Uh we have taken the surrounding context [58:40] of each word, we have taken the order, [58:42] and so on and so forth, because what's [58:43] coming in here is the positional [58:45] embeddings. Okay? And it runs through [58:47] the whole transformer stack. [58:49] So, [58:51] this is called a transformer encoder. [58:54] Okay? [58:55] This is the transformer encoder. [58:57] And you can see here, this is the [58:59] original picture from the paper. [59:01] It's an iconic picture at this point. [59:03] So, it says here this is these are the [59:04] input This is like the cat sat on the [59:06] mat. [59:07] It comes in here, gets transferred to [59:09] transformed into embeddings, standalone [59:11] embeddings. [59:12] And then, based on the position of each [59:14] word, we add that's why you see a plus [59:17] sign here, we add the positional [59:20] embedding to that. [59:22] And the resulting thing goes into this [59:24] transformer block. And here, [59:26] we go through multi-head attention. [59:30] And things come out the other end. [59:32] Then there is this thing called add and [59:34] norm, which we'll visit we'll revisit on [59:36] Wednesday. [59:37] And then it goes through a feed forward [59:38] network, another add and norm, which [59:40] we'll revisit on Wednesday. [59:42] And then it comes out the other end. [59:43] That's it. That's a transformer encoder. [59:46] Okay? [59:47] Um [59:48] and so if you look at this [59:52] just to point out a couple of things, [59:53] the input embeddings can be random [59:55] weights or it could be pre-trained [59:56] embeddings. [59:57] Um [59:58] we add in a position-dependent embedding [01:00:00] to represent the position of each word [01:00:01] in the sentence. That's the plus. [01:00:02] Then we pass it through multi-headed [01:00:04] attention to get a contextual uh [01:00:05] representation. [01:00:07] Then we finally we pass all this through [01:00:09] a simple [01:00:10] typically it's a two-layer network. A [01:00:12] one hidden layer with relus and then a [01:00:13] linear layer after that and boom. Uh and [01:00:16] then we do it. This is the encoder. And [01:00:20] here is the perhaps the most important [01:00:21] point to keep in mind. [01:00:23] Because we have taken inordinate care to [01:00:25] make sure that the things that are [01:00:26] coming in and the things that are going [01:00:28] out have the same size [01:00:30] both in terms of the number of tokens as [01:00:32] well as the length of each vector. [01:00:34] We can then stack them up like pancakes. [01:00:37] We can have lots of transformers stacked [01:00:39] one on top of each other. [01:00:41] Right? Because it's the perfect API. [01:00:43] It's the simplest possible API. The same [01:00:45] thing comes in, same thing goes out. [01:00:47] In terms of size. So you can have a [01:00:49] transformer encoder, another one top, [01:00:51] boom, boom, boom, boom, boom, one after [01:00:53] the other. GPT-3 has 96 transformer [01:00:55] stacks. [01:00:58] And like in all things deep learning [01:01:00] related, the more layers you have, the [01:01:02] more complicated things we can do with [01:01:04] it. [01:01:05] As long as you have enough data to keep [01:01:06] the model happy so it doesn't overfit. [01:01:11] Okay? [01:01:13] All right. So, what we haven't covered, [01:01:15] which we'll cover on Wednesday [01:01:17] uh is is the question that [01:01:20] he had posed about how [01:01:22] uh you know, since there are no [01:01:23] parameters inside the self-attention [01:01:24] block, what are we actually learning? [01:01:26] And then there is these things called [01:01:27] residual connections and layer [01:01:29] normalization. We'll talk about all [01:01:31] those things on Wednesday. Those are all [01:01:32] like, you know, refinements to the idea. [01:01:35] So, all right, 9:39. Um let's apply the [01:01:38] transformer encoder to an actual [01:01:39] problem. [01:01:40] Any questions? [01:01:43] Uh yeah. [01:01:45] My question is regarding like you said [01:01:46] you could have multiple transformers. [01:01:48] What is the difference with having [01:01:50] multiple self-attention heads uh and [01:01:53] rather than that having multiple When I [01:01:54] say a transformer block within the block [01:01:57] there could be multiple heads. So, if [01:01:59] you're if the accuracy is the same, why [01:02:01] would you use this rather [01:02:04] Yeah, you can have a lot of attention [01:02:06] heads. And that's totally fine. And [01:02:08] typically I forget how many GPT-3 and 4 [01:02:10] have. They have a whole bunch of them. [01:02:12] But you can So you can go wide and you [01:02:13] can go deep. [01:02:15] Both are done in practice. [01:02:18] But the thing is if [01:02:19] The one thing you have to remember is [01:02:20] that if you if you go wide, you have a [01:02:22] lot of attention heads then given the [01:02:24] particular input that's coming into that [01:02:26] block, it'll learn different patterns [01:02:28] from it. [01:02:29] While if you stack them all up, it's [01:02:31] going to learn different ways to [01:02:32] contextualize the things that are coming [01:02:33] in. It operates at higher levels of [01:02:35] abstraction. So the analogy would be [01:02:36] that like the seventh layer of a [01:02:38] convolutional net may take the sixth [01:02:40] layer's output and say, "Oh, I'm seeing [01:02:42] a lot of edges here. I'm going to take [01:02:44] an edge like this, two circles like that [01:02:46] and call it a face." [01:02:48] So it'll operate at a higher level of [01:02:49] abstraction. [01:02:52] Okay. [01:02:53] Um [01:02:58] All right, let's go to the collab. [01:03:01] So what we're going to do is we're going [01:03:02] to take the transformer that we just [01:03:04] learned about and we're going to apply [01:03:05] it to solve the the travel uh slot [01:03:07] problem. Okay? [01:03:09] Uh all right. So [01:03:12] Okay, so we'll start with the usual [01:03:14] preliminaries. [01:03:16] And then we have taken the ATIS data set [01:03:18] I talked about and we have stuck them in [01:03:20] raw box for easy consumption. [01:03:23] It's here. [01:03:29] Okay. [01:03:30] So if you look at to the top view [01:03:33] you can see here, for example, I want to [01:03:35] fly from Boston 8:30 a.m. And then this [01:03:37] is the output. The slot filling is the [01:03:39] output. Um and so as it turns out here [01:03:42] there is [01:03:43] this these people also gave it a another [01:03:46] They took the whole query and gave it an [01:03:47] intent as to is it it's a flight query, [01:03:49] it's a something else query and so on, [01:03:51] which we're not going to use. Are you [01:03:52] kidding me? [01:03:54] I want to fly from Boston at 8:30 a.m. [01:03:56] and arrive in Denver at 11:00 in the [01:03:57] morning. What kind of ground [01:03:59] transportations are available in Denver? [01:04:01] What's the airport at Orlando? [01:04:03] Um how much does the limo service cost [01:04:06] within Pittsburgh? Okay. [01:04:08] And so on and so forth. So you get So [01:04:09] you get the idea. It's a very wide range [01:04:11] of queries that are in this data set. [01:04:13] Um okay. So let's just ignore that for a [01:04:16] sec. Um okay. So what we're now going to [01:04:18] do is we are going to take only [01:04:22] um this column, right? The query column. [01:04:24] That's going to be our input text. Okay? [01:04:27] And then the slot filling column is [01:04:29] going to be our dependent variable, the [01:04:31] output. [01:04:32] So we'll just gather them all up [01:04:34] uh here. [01:04:37] Let it run. We'll do it for the training [01:04:38] data and the test data. [01:04:40] And so what we have done is that we have [01:04:42] taken um the transformer related code in [01:04:45] Keras and we have packaged it into a [01:04:47] little hardel library for easy [01:04:49] consumption. [01:04:50] Um and so that thing is here. You can [01:04:53] download it. [01:04:55] Calling it a library is like overstating [01:04:56] it. We literally just collected a bunch [01:04:57] of code and stuck it in a file. Okay? [01:04:59] So [01:05:00] and so what we'll do is from hardel [01:05:02] we'll we'll import the transformer [01:05:03] encoder. [01:05:04] And we'll import this positional [01:05:06] embedding layer. [01:05:08] Because what we're going to do is we are [01:05:09] going to take the input do the [01:05:11] positional encoding business and then [01:05:12] send it into the transformer. [01:05:14] Okay? [01:05:15] Um so but first let's vectorize the [01:05:18] input uh queries that are coming in. [01:05:21] So we'll define a thing here. [01:05:24] The use this uh [01:05:26] max query length is not defined. That's [01:05:28] what happens when you [01:05:30] don't run everything. [01:05:32] All right. [01:05:38] Okay. So now we have this thing here. So [01:05:41] turns out that there are 8,888 tokens, [01:05:44] right? 8,888 words in the input queries [01:05:47] that are we have in the data. Uh so I [01:05:49] take a look at the first few. [01:05:52] And you can see here, you know, there is [01:05:54] unk. Uh and because the output mode here [01:05:56] is you just want integers to come out [01:05:58] not multi-hot encoding or anything [01:06:00] because we're going to take these [01:06:01] integers and then do embeddings from [01:06:02] them. So it'll it'll create it'll [01:06:04] reserve this empty string as the pad [01:06:07] token. This should be familiar from last [01:06:10] week. [01:06:11] And then the unk for unknown tokens and [01:06:13] then two from flights these are all some [01:06:14] of the most frequent. Um turns out [01:06:17] Boston is actually the most frequent. I [01:06:18] don't know what's up with that. [01:06:20] It is what it is. Then we'll do the same [01:06:22] vectorization to the train and test data [01:06:24] sets. [01:06:25] Now uh we need to do STIE for the output [01:06:28] side of the problem because the slots [01:06:30] the the dependent variable here, [01:06:31] remember, are all sentences as well with [01:06:33] the B, O, things like that, right? So we [01:06:36] need to vectorize those. [01:06:38] So we do or we need to do STIE on them. [01:06:40] So let's take a look at some of these [01:06:42] slots. [01:06:43] And you can see here all this stuff is [01:06:44] going on. [01:06:45] Note So now here is an example where you [01:06:48] have to be very careful when you do the [01:06:49] standardization. [01:06:51] Typically standardization you will [01:06:52] remove punctuation and you know, do [01:06:54] things like that and lowercase, right? [01:06:56] But here [01:06:57] these things have a specific meaning. [01:07:00] We can't just go in there and remove the [01:07:01] period and the underscore and then take [01:07:03] make the B into lowercase B and stuff [01:07:04] like that. That'll just harm it. [01:07:06] Right? We need to be able to preserve [01:07:07] the nomenclature of the output in terms [01:07:10] of all those tags. So [01:07:12] um so we don't want the standardization [01:07:13] to do all those out. So what we do is we [01:07:15] say standardization none. [01:07:17] Look at that. [01:07:18] We tell Keras do not standardize this. [01:07:20] Do not do your usual thing. [01:07:22] Okay? [01:07:23] Um so [01:07:25] we do that [01:07:26] for the output side. And then let's look [01:07:29] at the vocabulary. [01:07:30] Yeah, so this sounds pretty good. [01:07:33] These are all the things that we would [01:07:34] expect to see. [01:07:35] These are the distinct tokens in the [01:07:37] output strings. [01:07:39] Um all right. [01:07:43] Okay, we get it. [01:07:45] So we have 125 of them. In the in the [01:07:48] lecture I said there are 123 slots, [01:07:50] possible slots. Why is it 125 here? [01:07:54] Yes, unk and pad. Correct. [01:07:57] Um okay. Now we'll set up a transformer [01:07:59] encoder, right? Uh this Oh, wait, wait, [01:08:02] wait. I forgot about um doing this. My [01:08:05] bad. Um [01:08:07] All right. [01:08:11] I just thought when I saw the slide that [01:08:12] we should go to the collab [01:08:15] without giving you a bit more [01:08:16] background. No problem. So [01:08:18] So [01:08:20] the way we're going to model this [01:08:21] problem is that we're going to have [01:08:22] something like this, right? Fly from [01:08:23] Boston to Denver. [01:08:24] That's the input that's coming in and [01:08:26] that is the correct answer. [01:08:28] 0 0 some B something or others I mean O [01:08:31] and then something else, right? That's [01:08:32] the correct answer. That's the that's [01:08:34] the input and that is the right answer. [01:08:36] So what we'll do is we will [01:08:38] create these positional input embeddings [01:08:40] like we have discussed before. [01:08:42] We will run it through a transformer. [01:08:45] It gives us contextual embeddings. [01:08:47] So if we send five in, it's going to [01:08:49] send us five out except the color is now [01:08:50] blue. [01:08:51] Right? And then what we do is [01:08:54] we will run it through a relu. [01:08:57] Okay, we'll run it through a relu. [01:08:59] We will still have [01:09:01] you know, five vectors here, five [01:09:02] vectors will come in. [01:09:04] And then for each of the things that [01:09:05] comes in, we will stick a 123-way [01:09:07] softmax. [01:09:11] Okay, for each thing that comes out [01:09:13] we'll have a 123-way softmax and that's [01:09:15] the classification problem we're going [01:09:16] to solve. [01:09:20] Okay? [01:09:21] So [01:09:23] the weights in all these layers will get [01:09:25] optimized by backprop. [01:09:28] All these weights are going to get [01:09:29] optimized. [01:09:30] Uh yeah. [01:09:34] Sorry? [01:09:40] Oh no, the that's a layer. The weights [01:09:43] in the layer will still need to be [01:09:44] learned. [01:09:46] It's sort of like the text vectorization [01:09:48] layer is a bunch of code and then you [01:09:50] actually run it on a particular corpus [01:09:51] to adapt it and fill our vocabulary out [01:09:53] of it. [01:09:54] So, it's like an empty shell that needs [01:09:55] to get populated. [01:09:57] Okay, so with the weights and all these [01:09:59] things are going to get updated when we [01:10:00] when we train the model [01:10:02] by backprop. [01:10:03] Uh and that's it. That's the setup. [01:10:06] Does this make sense before I switch [01:10:07] back to the collab? [01:10:09] In particular, does this make sense? [01:10:11] This part of it. [01:10:15] Bunch of things come out and then for [01:10:17] each one of those things we need to [01:10:18] figure out a classification of a 123-way [01:10:20] classification. And that's where we [01:10:22] stick a softmax on every one of those [01:10:23] output nodes. [01:10:25] Yeah. [01:10:32] Oh oh, I see. [01:10:36] Yeah, so [01:10:40] It could be whatever or to put it [01:10:41] another way, it is your choice as the [01:10:43] user as the modeler. Correct? The thing [01:10:45] is at this point with the blue stuff the [01:10:47] transformer is basically saying, my job [01:10:49] is done. [01:10:51] It has given you these valuable [01:10:52] contextual embeddings at some high-level [01:10:54] abstraction. What you do with it depends [01:10:56] on your particular problem. And so that [01:10:58] the best practice would be to take it [01:11:00] and then maybe, you know, if these [01:11:01] embeddings are embeddings are really [01:11:03] long, maybe you make them a little [01:11:04] smaller, right? Using a ReLU. And using [01:11:07] a ReLU is always a good idea because [01:11:09] when in doubt, throw in a bit of [01:11:10] non-linearity. [01:11:11] Right? Uh and then once you're done with [01:11:13] that, well, at this point you need to [01:11:15] actually classify it. So, you stick an [01:11:17] output softmax on it. [01:11:20] Okay. So, that's what we have. [01:11:24] Um [01:11:27] All right, back to this picture. [01:11:29] So, what we're going to do is we [01:11:32] we also get to decide how long are these [01:11:34] embedding vectors. How long because here [01:11:36] we're not going to use Glove embeddings. [01:11:37] We're just going to learn everything [01:11:37] from scratch. [01:11:39] Right? We're going to learn everything [01:11:40] from scratch. So, and we can decide how [01:11:42] long these embedding vectors are. So, um [01:11:45] these embedding vectors I'm going to [01:11:46] decide [01:11:47] uh I have decided that I want them to be [01:11:49] 512 long, right? I want these actually [01:11:52] to be 512 long. So, that's what I have [01:11:54] here, 512. [01:11:57] And then inside the transformer, [01:11:58] remember [01:12:00] when we [01:12:01] concatenate everything and then we have [01:12:02] something, we run it through a final [01:12:04] ReLU layer, how big should that layer [01:12:07] be? [01:12:08] That's what it here what I mean by dense [01:12:11] dim. I want it to be 64. [01:12:13] And then I, you know, for fun I'm going [01:12:15] to use five attention heads. [01:12:17] Because why not? [01:12:20] Okay. And then in the final thing here [01:12:24] to go to Ali's question here these [01:12:27] things are all 512 long as I mentioned [01:12:29] earlier, right? These are all 512. [01:12:32] But this thing here I'm going to make it [01:12:34] just 128. [01:12:36] Okay, that's what I mean by units here. [01:12:38] And so if you look at the actual model [01:12:41] okay, whatever comes in has a max query [01:12:43] length of I think 30 if I recall. [01:12:45] Um actually let's just make sure of [01:12:47] that. What did I assume? [01:12:51] 30, correct? Max query length 30. So, [01:12:53] each sentence is 30. So, if a sentence [01:12:55] has 35 words in it, what's going to [01:12:57] happen? [01:12:59] The last five will get chopped, [01:13:01] truncated. If it comes in at 22, we're [01:13:03] going to pad it with eight more tokens [01:13:05] with a pad token. Okay? That's how we [01:13:06] make sure everything uh gets to 30. [01:13:09] All right. So, we come back here. [01:13:12] So, the input is still sentences which [01:13:14] are 30 long, tokens which are 30 long. [01:13:16] And then we run it through a positional [01:13:18] embedding layer. [01:13:20] Okay? This positional embedding layer [01:13:23] has the the actual embedding for each [01:13:25] word, that table and it has the [01:13:27] positional table, positional embedding [01:13:29] table. So, just to be clear, this [01:13:31] positional embedding layer is basically [01:13:34] it's basically this. [01:13:37] So, this table [01:13:38] and this table together are packaged up [01:13:41] into the positional encoding layer. [01:13:43] But they are two distinct tables. They [01:13:45] just happen to be packaged up. [01:13:47] So, [01:13:49] so this is what we have here. [01:13:51] And then we get a nice positional [01:13:52] embedding out and then boom, we run it [01:13:55] through the transformer. And you know, [01:13:57] this transformer encoder object we have [01:13:59] to tell it obviously, hey, this is the [01:14:01] embedding dimension that's going to come [01:14:02] out. This is the dense dimension you're [01:14:04] going to use in that final feedforward [01:14:06] layer inside each attention block and [01:14:09] this is the number of attention heads I [01:14:10] want you to use. That's it. [01:14:11] Very, right? Only three things have to [01:14:13] be specified. [01:14:14] And then whatever comes out of the [01:14:16] transformer encoder are these blue [01:14:18] vectors. [01:14:19] And then we are back into good old sort [01:14:20] of, you know, traditional DNN stuff [01:14:22] where we take this thing, run it through [01:14:24] a ReLU with 128 units, we add a little [01:14:27] dropout uh and then we run it through a [01:14:30] dense layer which the the vocab size [01:14:33] here is 125, which is the 125-way [01:14:35] softmax. [01:14:37] Okay? Activation softmax. [01:14:39] Connect up everything into model input [01:14:41] and output and boom, that's the whole [01:14:42] model. [01:14:44] So, that's what we have here. [01:14:47] Okay? [01:14:48] Now, [01:14:51] this for the you know, after Wednesday's [01:14:53] class [01:14:54] for extra credit and for your personal [01:14:56] edification [01:14:59] try to work through this thing to come [01:15:00] up with this number. [01:15:03] 53 million [01:15:04] um sorry, 5.3 million. [01:15:06] Right? Uh and see if it matches this [01:15:10] number here. [01:15:12] It should match. [01:15:13] Hand calculate the number of parameters [01:15:15] inside the transformer. Okay? For fame [01:15:17] and fortune. That's an optional thing. [01:15:19] So, [01:15:20] uh do it after Wednesday's class, not [01:15:22] right now. [01:15:23] And I have actually listed the exact [01:15:24] math that goes into it here. Okay? All [01:15:26] right. So, by the way, you can peek into [01:15:28] any layers' weights using its weight [01:15:30] attribute. This is the embedding [01:15:31] uh the positional embedding thing we [01:15:33] had. So, [01:15:34] we can click it and you can see here it [01:15:36] has two tables. There's the first table [01:15:39] which is just the embedding table which [01:15:40] says [01:15:41] there are eight eight eight tokens in my [01:15:43] vocabulary and each of those tokens is a [01:15:45] an embedding vector which is 512 long. [01:15:47] That is the first table here. And then [01:15:49] it has the second object which is the [01:15:51] positional embedding and it says here, [01:15:53] well, my sentences can be 30 long and [01:15:56] for each position of the 30 long [01:15:58] sentence, I will have a 512 embedding. [01:16:02] Both these tables as I mentioned earlier [01:16:04] are packaged up inside and you can [01:16:05] actually see what the weights are before [01:16:06] you do any training. [01:16:08] Okay? [01:16:09] So, all right. So, I'm going to stop [01:16:11] here uh because the model it's going to [01:16:13] take a few minutes to run and we're [01:16:14] already at 5 9:45. [01:16:16] Um so, we will continue the journey on [01:16:17] Wednesday. If some of it is not super [01:16:19] clear, don't worry about it. It will [01:16:20] become much clearer on Wednesday. All [01:16:21] right? All right, folks, have a good [01:16:22] couple of days. I'll see you on [01:16:23] Wednesday.