1 00:00:16,519 --> 00:00:20,600 So, all right. So, transformers, even 2 00:00:18,839 --> 00:00:22,839 though they were originally invented for 3 00:00:20,600 --> 00:00:24,240 machine translation, right, going from 4 00:00:22,839 --> 00:00:25,879 English to German and German to French 5 00:00:24,239 --> 00:00:27,799 and so on and so forth, 6 00:00:25,879 --> 00:00:29,559 they have turned out to be an incredibly 7 00:00:27,800 --> 00:00:32,439 effective deep neural network 8 00:00:29,559 --> 00:00:34,600 architecture for just really a vast 9 00:00:32,439 --> 00:00:36,000 array of domains. It has reached a point 10 00:00:34,600 --> 00:00:37,640 where if you're actually working with on 11 00:00:36,000 --> 00:00:39,439 a particular problem, you almost 12 00:00:37,640 --> 00:00:40,960 reflexively to will try a transformer 13 00:00:39,439 --> 00:00:42,919 first because it's probably going to be 14 00:00:40,960 --> 00:00:45,120 pretty darn good. 15 00:00:42,920 --> 00:00:46,480 Okay? So, they have just taken over 16 00:00:45,119 --> 00:00:48,199 everything. 17 00:00:46,479 --> 00:00:50,000 Um and obviously they have they've 18 00:00:48,200 --> 00:00:52,400 transformed translation, which is the 19 00:00:50,000 --> 00:00:54,079 original sort of target, uh Google 20 00:00:52,399 --> 00:00:55,600 search, really information retrieval, 21 00:00:54,079 --> 00:00:57,479 completely transformed speech 22 00:00:55,600 --> 00:00:59,520 recognition, text-to-speech, even 23 00:00:57,479 --> 00:01:00,599 computer vision. Even the stuff that we 24 00:00:59,520 --> 00:01:03,000 learned with convolutional neural 25 00:01:00,600 --> 00:01:04,760 networks, now there are transformers for 26 00:01:03,000 --> 00:01:06,519 computer vision problems that are 27 00:01:04,760 --> 00:01:07,560 actually quite good. 28 00:01:06,519 --> 00:01:08,839 Right? 29 00:01:07,560 --> 00:01:10,960 Um which is kind of shocking because 30 00:01:08,840 --> 00:01:12,719 they were not even designed for that. 31 00:01:10,959 --> 00:01:14,519 Um and then, you know, reinforcement 32 00:01:12,719 --> 00:01:15,719 learning. And of course, all the crazy 33 00:01:14,519 --> 00:01:17,439 stuff that's going on with generative 34 00:01:15,719 --> 00:01:20,200 AI, large language models, multimodal 35 00:01:17,439 --> 00:01:21,399 models, everything everything runs on a 36 00:01:20,200 --> 00:01:23,799 transformer. 37 00:01:21,400 --> 00:01:25,600 Okay? Uh and then there are numerous 38 00:01:23,799 --> 00:01:27,079 special purpose systems 39 00:01:25,599 --> 00:01:28,399 and I find these to be even more 40 00:01:27,079 --> 00:01:30,000 interesting. 41 00:01:28,400 --> 00:01:32,440 Um you know, like AlphaFold, the protein 42 00:01:30,000 --> 00:01:33,640 folding AI, is run runs on a transformer 43 00:01:32,439 --> 00:01:35,519 stack. 44 00:01:33,640 --> 00:01:36,640 Okay? And I could just list examples one 45 00:01:35,519 --> 00:01:38,479 after the other. 46 00:01:36,640 --> 00:01:40,040 So, it's just amazing. It's incredibly 47 00:01:38,480 --> 00:01:43,079 uh flexible architecture. 48 00:01:40,040 --> 00:01:44,280 Um and I think we are lucky to be alive 49 00:01:43,079 --> 00:01:46,879 during a time when such a thing was 50 00:01:44,280 --> 00:01:46,879 invented. 51 00:01:47,200 --> 00:01:50,480 And I'm not getting paid to tell you any 52 00:01:48,480 --> 00:01:52,120 of this stuff. 53 00:01:50,480 --> 00:01:55,439 All right, it's just amazing. Okay. So, 54 00:01:52,120 --> 00:01:57,280 let's get going. We will use search um 55 00:01:55,439 --> 00:01:59,359 or more broadly information retrieval as 56 00:01:57,280 --> 00:02:00,640 a motivating use case. So, these are all 57 00:01:59,359 --> 00:02:02,120 examples where people are typing in 58 00:02:00,640 --> 00:02:03,959 natural language queries or uttering 59 00:02:02,120 --> 00:02:05,400 natural language queries into a phone 60 00:02:03,959 --> 00:02:07,319 and we need to sort of make sense of 61 00:02:05,400 --> 00:02:08,879 what they want. And it's not like, you 62 00:02:07,319 --> 00:02:10,599 know, write me a limerick about deep 63 00:02:08,879 --> 00:02:12,639 learning where there could be many 64 00:02:10,599 --> 00:02:14,000 possible right answers. It's more like, 65 00:02:12,639 --> 00:02:15,279 okay, tell me all the flights that are 66 00:02:14,000 --> 00:02:16,680 leaving from Boston to going to 67 00:02:15,280 --> 00:02:19,080 LaGuardia tomorrow morning between 8:00 68 00:02:16,680 --> 00:02:21,120 and 9:00. Well, you better get it right. 69 00:02:19,080 --> 00:02:22,200 Okay? Accuracy is a high bar. 70 00:02:21,120 --> 00:02:23,319 So, 71 00:02:22,199 --> 00:02:24,679 um or, you know, how many customers 72 00:02:23,319 --> 00:02:26,000 abandoned their shopping cart? Find all 73 00:02:24,680 --> 00:02:28,960 contracts that are up for renewal next 74 00:02:26,000 --> 00:02:30,840 month. Uh you know, tell me the all the 75 00:02:28,960 --> 00:02:32,800 customers who ended the phone call to 76 00:02:30,840 --> 00:02:34,879 the call center yesterday not entirely 77 00:02:32,800 --> 00:02:37,040 pleased with the transaction. Right? The 78 00:02:34,879 --> 00:02:38,479 list goes on and on. And so, in 79 00:02:37,039 --> 00:02:40,959 particular, we'll focus on this 80 00:02:38,479 --> 00:02:42,879 travel-related example today. Okay? Uh 81 00:02:40,960 --> 00:02:44,159 find me all flights from Boston to 82 00:02:42,879 --> 00:02:45,639 LaGuardia tomorrow morning, right? That 83 00:02:44,159 --> 00:02:48,560 kind of query. 84 00:02:45,639 --> 00:02:50,919 Um and so, in these sorts of use cases, 85 00:02:48,560 --> 00:02:53,599 a very common approach historically has 86 00:02:50,919 --> 00:02:55,599 been, well, we will take this, you know, 87 00:02:53,599 --> 00:02:57,919 natural language query 88 00:02:55,599 --> 00:03:01,039 and then we will convert it into a 89 00:02:57,919 --> 00:03:03,559 structured query. By that I mean we will 90 00:03:01,039 --> 00:03:05,799 parse the query and we'll extract out 91 00:03:03,560 --> 00:03:07,640 key things in that query. Once we 92 00:03:05,800 --> 00:03:09,800 extract out those key things, we will 93 00:03:07,639 --> 00:03:12,919 reassemble it into a structured query, 94 00:03:09,800 --> 00:03:14,760 like a SQL query, right? Uh SQL is just 95 00:03:12,919 --> 00:03:15,919 one example of a possible structured 96 00:03:14,759 --> 00:03:17,239 query. There are many many ways to 97 00:03:15,919 --> 00:03:18,759 structure queries. 98 00:03:17,240 --> 00:03:20,840 But SQL is sort of familiar to lots of 99 00:03:18,759 --> 00:03:23,120 people, so I'm using that. So, you take 100 00:03:20,840 --> 00:03:25,200 the SQL. Once you have the SQL query, 101 00:03:23,120 --> 00:03:27,319 you're in a very comfortable structured 102 00:03:25,199 --> 00:03:28,839 land, in which case you just run the 103 00:03:27,319 --> 00:03:30,959 query through a some database that you 104 00:03:28,840 --> 00:03:32,960 have, get the results back, format it 105 00:03:30,960 --> 00:03:34,719 nicely, and show show it to the user. 106 00:03:32,960 --> 00:03:36,480 Right? That's the flow. 107 00:03:34,719 --> 00:03:37,599 So, the question becomes 108 00:03:36,479 --> 00:03:40,560 um 109 00:03:37,599 --> 00:03:43,960 how do we automatically extract all the 110 00:03:40,560 --> 00:03:45,280 travel-related entities from this query? 111 00:03:43,960 --> 00:03:49,800 Right? We want to be able to extract 112 00:03:45,280 --> 00:03:50,640 BOS, LGA, tomorrow, morning, flights, so 113 00:03:49,800 --> 00:03:51,800 on and so forth. These are all the 114 00:03:50,639 --> 00:03:54,799 travel-related entities we want to 115 00:03:51,800 --> 00:03:56,520 extract out, right? That's the problem. 116 00:03:54,800 --> 00:03:58,200 And so, 117 00:03:56,520 --> 00:03:59,760 we will use a really cool data set 118 00:03:58,199 --> 00:04:01,159 called the airline travel information 119 00:03:59,759 --> 00:04:02,759 system data set and I'll explain the 120 00:04:01,159 --> 00:04:05,159 data set in just in just a bit. We'll 121 00:04:02,759 --> 00:04:07,120 use this as the basis for this example. 122 00:04:05,159 --> 00:04:08,599 And so, the way we think about it is 123 00:04:07,120 --> 00:04:10,400 that 124 00:04:08,599 --> 00:04:12,079 we we have a whole bunch of queries in 125 00:04:10,400 --> 00:04:14,400 this data set. 126 00:04:12,080 --> 00:04:16,359 And fortunately for us, the researchers 127 00:04:14,400 --> 00:04:18,358 who compiled this data set, 128 00:04:16,358 --> 00:04:20,399 they went through every one of these 129 00:04:18,358 --> 00:04:22,239 queries, right? And we have, you know, 130 00:04:20,399 --> 00:04:24,239 several thousands of them. They went 131 00:04:22,240 --> 00:04:26,800 through every one of those queries and 132 00:04:24,240 --> 00:04:28,400 they manually tagged each word in the 133 00:04:26,800 --> 00:04:31,520 query 134 00:04:28,399 --> 00:04:33,319 with what kind of travel entity it is 135 00:04:31,519 --> 00:04:35,399 or none of them, right? So, for 136 00:04:33,319 --> 00:04:37,360 instance, so they class they call them 137 00:04:35,399 --> 00:04:39,799 slots. So, they will take each word in 138 00:04:37,360 --> 00:04:41,439 the query and assign it to a slot, a 139 00:04:39,800 --> 00:04:42,800 particular kind of slot, and I'll 140 00:04:41,439 --> 00:04:45,480 explain what slot means in just a 141 00:04:42,800 --> 00:04:47,439 second. Okay? That's the basic idea. So, 142 00:04:45,480 --> 00:04:49,759 so, for example, if you have something 143 00:04:47,439 --> 00:04:52,639 like I want to fly from 144 00:04:49,759 --> 00:04:53,759 Okay? And this is a flight database, so 145 00:04:52,639 --> 00:04:56,039 you can assume that everything is 146 00:04:53,759 --> 00:04:57,519 related to a flight flying. So, if you 147 00:04:56,040 --> 00:04:58,560 have all these words, I want to fly 148 00:04:57,519 --> 00:05:00,759 from, 149 00:04:58,560 --> 00:05:02,560 each of these words these five words 150 00:05:00,759 --> 00:05:04,599 gets mapped to something called the O, 151 00:05:02,560 --> 00:05:06,280 which means other. 152 00:05:04,600 --> 00:05:07,840 It's the other slot, right? We don't 153 00:05:06,279 --> 00:05:09,279 really care about it. It's the other 154 00:05:07,839 --> 00:05:11,599 slot. 155 00:05:09,279 --> 00:05:13,159 And then we come to Boston. 156 00:05:11,600 --> 00:05:15,560 Oh, Boston is very special, right? 157 00:05:13,160 --> 00:05:18,160 Because, you know, it's clearly a 158 00:05:15,560 --> 00:05:20,280 departure city. So, we actually tag it, 159 00:05:18,160 --> 00:05:21,640 we assign it this label. Think of it as 160 00:05:20,279 --> 00:05:23,000 just like a classification problem, 161 00:05:21,639 --> 00:05:26,479 right? A multi-class classification 162 00:05:23,000 --> 00:05:29,240 problem. So, we assign it to B from 163 00:05:26,480 --> 00:05:31,160 loc.city_name. 164 00:05:29,240 --> 00:05:32,560 Okay? That is the label you assign it. 165 00:05:31,160 --> 00:05:34,720 Okay? 166 00:05:32,560 --> 00:05:37,199 And then you go to at. You don't care 167 00:05:34,720 --> 00:05:38,680 about at. It's O, other. You come to 168 00:05:37,199 --> 00:05:41,159 7:00 a.m. 169 00:05:38,680 --> 00:05:43,280 And then, okay, that is depart time. So, 170 00:05:41,160 --> 00:05:45,680 depart time and then another depart 171 00:05:43,279 --> 00:05:47,319 time. And here you see there is a B and 172 00:05:45,680 --> 00:05:49,360 then there is an I. 173 00:05:47,319 --> 00:05:51,800 Right? So, what's what we are saying 174 00:05:49,360 --> 00:05:54,160 here is that there could be entities who 175 00:05:51,800 --> 00:05:57,439 are described using more than one word. 176 00:05:54,160 --> 00:05:58,600 Like 7:00 a.m., right? Two tokens. 177 00:05:57,439 --> 00:06:00,600 And for that, we need to be able to 178 00:05:58,600 --> 00:06:01,760 figure out, okay, the second token is 179 00:06:00,600 --> 00:06:03,920 really 180 00:06:01,759 --> 00:06:05,360 is part of the first token. Together, 181 00:06:03,920 --> 00:06:08,920 they define the notion of a departure 182 00:06:05,360 --> 00:06:10,520 time. So, what the B means that is that 183 00:06:08,920 --> 00:06:12,480 this is the word this is the token in 184 00:06:10,519 --> 00:06:15,240 which we are beginning the idea of a 185 00:06:12,480 --> 00:06:17,840 departure time. And then I means we are 186 00:06:15,240 --> 00:06:19,920 in the middle of this description. 187 00:06:17,839 --> 00:06:21,079 B is for beginning. 188 00:06:19,920 --> 00:06:23,240 So, 189 00:06:21,079 --> 00:06:25,079 you can see here. So, there is a B here 190 00:06:23,240 --> 00:06:27,519 and there is an I. B for beginning, I 191 00:06:25,079 --> 00:06:31,680 for intermediate or in the middle. 192 00:06:27,519 --> 00:06:33,120 Um and then at, we don't care. 11:00 B 193 00:06:31,680 --> 00:06:35,400 arrive time. 194 00:06:33,120 --> 00:06:37,920 Boop boop boop. Morning arrive time 195 00:06:35,399 --> 00:06:37,919 period. 196 00:06:38,199 --> 00:06:43,560 So, this is an example of how you can 197 00:06:40,800 --> 00:06:45,040 take a sentence and then manually label 198 00:06:43,560 --> 00:06:46,120 every word in the sentence with 199 00:06:45,040 --> 00:06:48,920 something that's relevant to your 200 00:06:46,120 --> 00:06:48,920 particular problem. 201 00:06:50,360 --> 00:06:54,439 And 202 00:06:51,959 --> 00:06:56,959 turns out these people 203 00:06:54,439 --> 00:06:59,439 every word is classified into one of 123 204 00:06:56,959 --> 00:07:02,919 possibilities. 205 00:06:59,439 --> 00:07:04,920 Okay? Um so, aircraft code, airline 206 00:07:02,920 --> 00:07:07,080 code, airline name, airport code, 207 00:07:04,920 --> 00:07:08,960 airport name, arrival date, relative 208 00:07:07,079 --> 00:07:11,560 name. Now, you get the idea. 209 00:07:08,959 --> 00:07:13,359 They want a round trip versus a one-way. 210 00:07:11,560 --> 00:07:14,480 The relative to today because if 211 00:07:13,360 --> 00:07:16,040 somebody say tomorrow morning, it's 212 00:07:14,480 --> 00:07:17,520 relative to today, so you need to notion 213 00:07:16,040 --> 00:07:19,560 you need absolute time and you need 214 00:07:17,519 --> 00:07:20,919 notion of relative time. 215 00:07:19,560 --> 00:07:23,480 So, they basically thought of every 216 00:07:20,920 --> 00:07:25,560 possibility with these researchers. And 217 00:07:23,480 --> 00:07:27,840 so, the every word in every one of these 218 00:07:25,560 --> 00:07:30,319 queries is assigned one of these 123 219 00:07:27,839 --> 00:07:30,319 labels. 220 00:07:32,240 --> 00:07:35,480 Any questions on the setup? 221 00:07:36,920 --> 00:07:39,480 Um 222 00:07:39,920 --> 00:07:44,480 Did they have to contextualize what 223 00:07:42,199 --> 00:07:46,360 comes before than let's say Boston? So, 224 00:07:44,480 --> 00:07:47,720 if someone says from 225 00:07:46,360 --> 00:07:49,360 Boston, so that there should be 226 00:07:47,720 --> 00:07:50,880 contextualization with the from to 227 00:07:49,360 --> 00:07:52,480 Boston. So, because they did it 228 00:07:50,879 --> 00:07:54,079 manually, they could just read it and 229 00:07:52,480 --> 00:07:55,680 figure it out, that's what they mean, 230 00:07:54,079 --> 00:07:57,319 right? You Boston is the the departure 231 00:07:55,680 --> 00:07:59,480 city and not the arrival city. So, do 232 00:07:57,319 --> 00:08:01,480 they have two tags to Boston, which is 233 00:07:59,480 --> 00:08:03,520 some like, you know, departure city as 234 00:08:01,480 --> 00:08:05,759 well as arrival city 235 00:08:03,519 --> 00:08:07,399 word Boston? In that particular phrase, 236 00:08:05,759 --> 00:08:08,959 it's it's clear from that particular 237 00:08:07,399 --> 00:08:10,639 case in the context of it as a human 238 00:08:08,959 --> 00:08:13,279 reading it that Boston is a departure 239 00:08:10,639 --> 00:08:15,360 city. So, it just only gets that tag. In 240 00:08:13,279 --> 00:08:16,759 that sentence. In some other sentence 241 00:08:15,360 --> 00:08:19,720 where people are coming into Boston, 242 00:08:16,759 --> 00:08:19,719 it'll have a different tag. 243 00:08:21,040 --> 00:08:25,120 I was wondering if my query like the 244 00:08:23,000 --> 00:08:27,279 others, basically there is like, for 245 00:08:25,120 --> 00:08:29,079 example, if my query was 246 00:08:27,279 --> 00:08:29,759 giving flights from Boston at 7:00 a.m. 247 00:08:29,079 --> 00:08:31,079 and 248 00:08:29,759 --> 00:08:33,559 uh the 249 00:08:31,079 --> 00:08:35,478 flights from Denver at 11:00 a.m. 250 00:08:33,559 --> 00:08:37,079 You mean like a compound query? Yeah. 251 00:08:35,479 --> 00:08:39,080 So, this one only takes single queries 252 00:08:37,080 --> 00:08:40,158 into account. 253 00:08:39,080 --> 00:08:42,038 Because most people are like, you know, 254 00:08:40,158 --> 00:08:43,120 give me a flight from here to there. Or 255 00:08:42,038 --> 00:08:45,279 what is the cheapest thing from here to 256 00:08:43,120 --> 00:08:47,720 there? And we'll see examples of queries 257 00:08:45,279 --> 00:08:47,720 later on. 258 00:08:50,000 --> 00:08:52,679 Okay. 259 00:08:51,120 --> 00:08:53,600 Uh all right. So, that's that's the 260 00:08:52,679 --> 00:08:56,959 deal. 261 00:08:53,600 --> 00:08:58,000 So, basically what we have this you 262 00:08:56,960 --> 00:08:59,480 know, 263 00:08:58,000 --> 00:09:02,120 uh this problem that we have here is 264 00:08:59,480 --> 00:09:04,879 really a word-to-slot, 265 00:09:02,120 --> 00:09:06,399 word-to-slot multi-class classification 266 00:09:04,879 --> 00:09:07,480 problem. 267 00:09:06,399 --> 00:09:09,240 Okay? 268 00:09:07,480 --> 00:09:10,560 Um because if you look at that that 269 00:09:09,240 --> 00:09:12,840 input, we want to be able to take that 270 00:09:10,559 --> 00:09:16,159 input and a really good model will then 271 00:09:12,840 --> 00:09:16,160 give you this as the output. 272 00:09:17,000 --> 00:09:20,240 Right? Because this is what a human 273 00:09:18,159 --> 00:09:23,480 would have done. 274 00:09:20,240 --> 00:09:25,720 So, that is our problem. Okay? 275 00:09:23,480 --> 00:09:27,840 So, the question is 276 00:09:25,720 --> 00:09:29,320 um the the key thing here is that each 277 00:09:27,840 --> 00:09:32,040 of the 18 words in this particular 278 00:09:29,320 --> 00:09:34,440 example must be assigned to one of 123 279 00:09:32,039 --> 00:09:36,399 slot types, right? Each word. It's not 280 00:09:34,440 --> 00:09:38,080 like we take the entire query and 281 00:09:36,399 --> 00:09:40,399 classify the entire query into one of 282 00:09:38,080 --> 00:09:42,480 123 possibilities. Every word in the 283 00:09:40,399 --> 00:09:45,360 query has to be classified. 284 00:09:42,480 --> 00:09:45,360 That is the wrinkle. 285 00:09:45,399 --> 00:09:49,240 Okay? 286 00:09:46,960 --> 00:09:51,080 So, now, if we could run the query 287 00:09:49,240 --> 00:09:54,120 through a deep neural network and 288 00:09:51,080 --> 00:09:55,800 generate 18 output nodes, 289 00:09:54,120 --> 00:09:57,679 it goes through some unspecified deep 290 00:09:55,799 --> 00:09:59,399 neural network. And when it comes out 291 00:09:57,679 --> 00:10:00,399 the other end, the output layer has 18 292 00:09:59,399 --> 00:10:01,439 nodes. 293 00:10:00,399 --> 00:10:03,120 Okay? 294 00:10:01,440 --> 00:10:04,480 Because that is that is the that is the 295 00:10:03,120 --> 00:10:06,919 that is the the the dimension of the 296 00:10:04,480 --> 00:10:09,000 output that we care about. 18 in, 18 297 00:10:06,919 --> 00:10:11,599 out. 18 in, 18 out, right? 298 00:10:09,000 --> 00:10:15,600 And then for each one of those 18 nodes, 299 00:10:11,600 --> 00:10:19,200 maybe we could attach a 123-way softmax 300 00:10:15,600 --> 00:10:19,200 to each of those 18 outputs. 301 00:10:20,200 --> 00:10:23,280 By the way, isn't it cool that we can 302 00:10:21,480 --> 00:10:25,440 just casually talk about sticking a 303 00:10:23,279 --> 00:10:27,159 123-way softmax onto each one of the 18 304 00:10:25,440 --> 00:10:29,920 nodes? 305 00:10:27,159 --> 00:10:29,919 Folks, wake up. 306 00:10:31,360 --> 00:10:34,840 You're not easily impressed. I'm 307 00:10:32,720 --> 00:10:37,560 impressed by that. 308 00:10:34,840 --> 00:10:37,560 So, okay. 309 00:10:37,879 --> 00:10:41,840 So, so the So, here's the key thing, 310 00:10:39,759 --> 00:10:45,639 right? We want to generate an output 311 00:10:41,840 --> 00:10:47,399 that has the same length as the input. 312 00:10:45,639 --> 00:10:48,960 But the problem is the inputs could be 313 00:10:47,399 --> 00:10:50,480 of different lengths as they come in. 314 00:10:48,960 --> 00:10:52,840 They could be short sentences, long 315 00:10:50,480 --> 00:10:55,120 sentences, we don't know, right? 316 00:10:52,840 --> 00:10:56,759 Yet we need to accommodate this range 317 00:10:55,120 --> 00:10:58,159 this variable size of input that's 318 00:10:56,759 --> 00:10:59,399 coming in. 319 00:10:58,159 --> 00:11:00,799 But the key thing is the output has to 320 00:10:59,399 --> 00:11:02,759 be the same thing as the input, the same 321 00:11:00,799 --> 00:11:05,559 cardinality as the input. 322 00:11:02,759 --> 00:11:07,080 Okay, that's a one big requirement. 323 00:11:05,559 --> 00:11:08,799 In addition, we want to take the 324 00:11:07,080 --> 00:11:10,440 surrounding context of each word into 325 00:11:08,799 --> 00:11:12,719 account, right? To go to Ronak's 326 00:11:10,440 --> 00:11:14,040 question, when you see the word Boston, 327 00:11:12,720 --> 00:11:15,920 you can't conclude whether it's a 328 00:11:14,039 --> 00:11:17,079 departure city or arrival city. 329 00:11:15,919 --> 00:11:19,399 You have to look at what else is going 330 00:11:17,080 --> 00:11:21,400 on around it. Is there a from? Is there 331 00:11:19,399 --> 00:11:22,600 a to? Things like that to figure out 332 00:11:21,399 --> 00:11:24,399 what how to tag it. So, clearly the 333 00:11:22,600 --> 00:11:25,800 context matters. 334 00:11:24,399 --> 00:11:28,319 And then we clearly have to take the 335 00:11:25,799 --> 00:11:29,599 order of the words into account. 336 00:11:28,320 --> 00:11:30,480 Going from Boston to LaGuardia is very 337 00:11:29,600 --> 00:11:31,680 different than going from LaGuardia to 338 00:11:30,480 --> 00:11:33,720 Boston. 339 00:11:31,679 --> 00:11:35,479 So, clearly the order matters. 340 00:11:33,720 --> 00:11:37,560 Right? So, the context matters and the 341 00:11:35,480 --> 00:11:40,279 order matters. And the output has to be 342 00:11:37,559 --> 00:11:42,119 the same length as the input. 343 00:11:40,279 --> 00:11:44,240 Okay? 344 00:11:42,120 --> 00:11:45,720 So, context matters, right? Just a few 345 00:11:44,240 --> 00:11:47,480 fun examples. 346 00:11:45,720 --> 00:11:48,680 Remember from the last week that the 347 00:11:47,480 --> 00:11:50,639 meaning of a word can change 348 00:11:48,679 --> 00:11:53,359 dramatically depending on the context. 349 00:11:50,639 --> 00:11:55,080 And we also saw that the standalone or 350 00:11:53,360 --> 00:11:58,360 uncontextual embeddings that we saw for 351 00:11:55,080 --> 00:11:59,960 last week, like Glove, um 352 00:11:58,360 --> 00:12:01,440 you know, they don't take context into 353 00:11:59,960 --> 00:12:04,040 account because they give a single 354 00:12:01,440 --> 00:12:05,880 unique embedding vector to every word. 355 00:12:04,039 --> 00:12:07,959 And if a word ends up having lots of 356 00:12:05,879 --> 00:12:09,720 different meanings, that vector is kind 357 00:12:07,960 --> 00:12:11,680 of some mushy average of all those 358 00:12:09,720 --> 00:12:13,320 meanings. 359 00:12:11,679 --> 00:12:15,399 Okay. So, 360 00:12:13,320 --> 00:12:16,960 the word see. I will see you soon. I 361 00:12:15,399 --> 00:12:18,959 will see this project to its end. I see 362 00:12:16,960 --> 00:12:20,879 what you mean. Very different meanings 363 00:12:18,960 --> 00:12:21,920 of the word see. This is my favorite, 364 00:12:20,879 --> 00:12:23,559 bank. 365 00:12:21,919 --> 00:12:24,838 Uh I went to the bank to apply for a 366 00:12:23,559 --> 00:12:27,359 loan. I'm banking on the job. I'm 367 00:12:24,839 --> 00:12:29,839 standing on the left bank. And so on. Uh 368 00:12:27,360 --> 00:12:31,680 it. The animal Oh, this is actually very 369 00:12:29,839 --> 00:12:33,640 It's a good one. The animal didn't cross 370 00:12:31,679 --> 00:12:34,719 the street because it was too tired. The 371 00:12:33,639 --> 00:12:37,279 animal didn't cross the street because 372 00:12:34,720 --> 00:12:39,040 it was too wide. 373 00:12:37,279 --> 00:12:40,519 Can you imagine 374 00:12:39,039 --> 00:12:42,279 a deep neural network looking at this 375 00:12:40,519 --> 00:12:44,360 word it and trying to figure out what 376 00:12:42,279 --> 00:12:46,120 the heck does it word it mean? 377 00:12:44,360 --> 00:12:48,480 What is it referring to? 378 00:12:46,120 --> 00:12:50,440 Tricky, right? 379 00:12:48,480 --> 00:12:52,000 Um and then, you know, if you take the 380 00:12:50,440 --> 00:12:53,120 word station, and I have the station 381 00:12:52,000 --> 00:12:55,200 example here because we're going to use 382 00:12:53,120 --> 00:12:57,080 it a bit more the rest of the lecture. 383 00:12:55,200 --> 00:12:59,360 The train You know, the station could be 384 00:12:57,080 --> 00:13:00,839 a radio station, a train station, being 385 00:12:59,360 --> 00:13:03,080 stationed somewhere, the International 386 00:13:00,839 --> 00:13:04,360 Space Station. The list goes on. 387 00:13:03,080 --> 00:13:05,960 So, clearly order matters. I mean, 388 00:13:04,360 --> 00:13:08,680 context matters. 389 00:13:05,960 --> 00:13:08,680 And 390 00:13:08,879 --> 00:13:12,000 clearly order matters. You can come up 391 00:13:10,480 --> 00:13:13,279 with your own examples. Let's keep 392 00:13:12,000 --> 00:13:15,159 moving. 393 00:13:13,279 --> 00:13:18,240 Okay? 394 00:13:15,159 --> 00:13:20,600 So, the Transformer architecture 395 00:13:18,240 --> 00:13:22,080 is a very elegant 396 00:13:20,600 --> 00:13:23,560 architecture 397 00:13:22,080 --> 00:13:25,360 which checks these three boxes 398 00:13:23,559 --> 00:13:26,479 beautifully. 399 00:13:25,360 --> 00:13:27,960 Okay? 400 00:13:26,480 --> 00:13:29,680 Um it takes the context into account, 401 00:13:27,960 --> 00:13:32,000 order into account, and then, you know, 402 00:13:29,679 --> 00:13:33,599 whatever is produced out there 403 00:13:32,000 --> 00:13:34,399 is the same length as whatever is coming 404 00:13:33,600 --> 00:13:35,279 in. 405 00:13:34,399 --> 00:13:36,879 And the reason it's called the 406 00:13:35,279 --> 00:13:39,679 Transformer 407 00:13:36,879 --> 00:13:41,759 is because if 10 things come in, 408 00:13:39,679 --> 00:13:43,799 10 things go out, but the 10 things that 409 00:13:41,759 --> 00:13:46,000 go out are a transformed version of the 410 00:13:43,799 --> 00:13:47,919 10 things that came in. 411 00:13:46,000 --> 00:13:48,919 That's why it's called the Transformer. 412 00:13:47,919 --> 00:13:50,679 Okay? 413 00:13:48,919 --> 00:13:52,559 If 10 things came in and like one thing 414 00:13:50,679 --> 00:13:54,359 go goes out, well, sure, it's been 415 00:13:52,559 --> 00:13:56,439 transformed, but what is it? It's some 416 00:13:54,360 --> 00:13:58,800 weird thing. But when 10 comes in and 10 417 00:13:56,440 --> 00:13:59,760 goes out, the 10 10 is preserved. Each 418 00:13:58,799 --> 00:14:01,039 one is getting transformed in 419 00:13:59,759 --> 00:14:04,279 interesting way. 420 00:14:01,039 --> 00:14:04,279 That's why it's called the Transformer. 421 00:14:04,440 --> 00:14:08,400 So, developed 2017, just dramatic 422 00:14:07,080 --> 00:14:09,360 impact. 423 00:14:08,399 --> 00:14:11,078 So, by the way, the effect of 424 00:14:09,360 --> 00:14:13,639 Transformer, um 425 00:14:11,078 --> 00:14:15,239 Google had spent a lot of research on 426 00:14:13,639 --> 00:14:17,439 machine translation and obviously 427 00:14:15,240 --> 00:14:20,079 search. Uh and then when the Transformer 428 00:14:17,440 --> 00:14:22,600 is invented, uh they took a model called 429 00:14:20,078 --> 00:14:25,958 BERT, which we will uh see on Wednesday 430 00:14:22,600 --> 00:14:28,320 in detail, and then they introduced BERT 431 00:14:25,958 --> 00:14:29,599 into their search, and the results were 432 00:14:28,320 --> 00:14:32,040 dramatic. 433 00:14:29,600 --> 00:14:34,279 And from what I've read, apparently the 434 00:14:32,039 --> 00:14:35,639 impact of doing that was a 435 00:14:34,279 --> 00:14:37,360 Typically, when you make an improvement 436 00:14:35,639 --> 00:14:38,799 to search, the improvement is very, very 437 00:14:37,360 --> 00:14:40,680 marginal because it's already a very 438 00:14:38,799 --> 00:14:42,240 heavily optimized system. 439 00:14:40,679 --> 00:14:43,799 And then when the Transformer thing came 440 00:14:42,240 --> 00:14:46,320 along, there was actually a significant 441 00:14:43,799 --> 00:14:48,240 jump in search quality. So, for example, 442 00:14:46,320 --> 00:14:49,800 and you can actually read this blog post 443 00:14:48,240 --> 00:14:51,839 uh which came out when they introduced 444 00:14:49,799 --> 00:14:54,359 BERT into search. It gives you a bit 445 00:14:51,839 --> 00:14:56,360 more detail. But here, so if you had if 446 00:14:54,360 --> 00:14:57,600 you were querying something like uh you 447 00:14:56,360 --> 00:15:00,480 know, 448 00:14:57,600 --> 00:15:02,240 "Brazil traveler to USA needs a visa." 449 00:15:00,480 --> 00:15:03,279 Right? You would think that it is it 450 00:15:02,240 --> 00:15:04,600 should give you information about how to 451 00:15:03,279 --> 00:15:06,480 get a visa if you're a Brazilian want to 452 00:15:04,600 --> 00:15:09,000 come to the US, right? Uh but it turns 453 00:15:06,480 --> 00:15:11,159 out the first result was how US citizens 454 00:15:09,000 --> 00:15:13,240 going to Brazil can get you know, 455 00:15:11,159 --> 00:15:14,879 get a visa. 456 00:15:13,240 --> 00:15:16,480 So, clearly it's not taking the order 457 00:15:14,879 --> 00:15:19,000 into account. 458 00:15:16,480 --> 00:15:20,440 Uh but once they introduced it, boom, 459 00:15:19,000 --> 00:15:21,720 the first thing was the US Embassy in 460 00:15:20,440 --> 00:15:24,200 Brazil. 461 00:15:21,720 --> 00:15:26,839 And a page on how to get a visa. 462 00:15:24,200 --> 00:15:30,120 So, the effect was dramatic. 463 00:15:26,839 --> 00:15:31,600 And so, this is a seminal paper, 464 00:15:30,120 --> 00:15:34,440 right? And it's actually worth reading 465 00:15:31,600 --> 00:15:35,639 the paper. And uh and it's worth and you 466 00:15:34,440 --> 00:15:38,079 know, this is the picture this this is 467 00:15:35,639 --> 00:15:39,799 like an iconic picture at this point 468 00:15:38,078 --> 00:15:41,679 in the deep learning community. And we 469 00:15:39,799 --> 00:15:43,399 will actually understand this picture 470 00:15:41,679 --> 00:15:45,399 by the end of Wednesday. 471 00:15:43,399 --> 00:15:46,399 Um and so, but the funny thing is that 472 00:15:45,399 --> 00:15:48,720 when the researchers came up with it, 473 00:15:46,399 --> 00:15:50,720 they didn't realize, in some sense, like 474 00:15:48,720 --> 00:15:51,759 what they had stumbled on uh because 475 00:15:50,720 --> 00:15:53,000 they were really focused on machine 476 00:15:51,759 --> 00:15:54,240 translation. 477 00:15:53,000 --> 00:15:55,519 It's only the rest of the research 478 00:15:54,240 --> 00:15:56,879 community that took it and started 479 00:15:55,519 --> 00:15:59,639 applying to everything else and found it 480 00:15:56,879 --> 00:16:01,240 to be really, really effective. 481 00:15:59,639 --> 00:16:02,399 Okay. So, we're going to take each one 482 00:16:01,240 --> 00:16:04,039 of these things and figure out how to 483 00:16:02,399 --> 00:16:05,480 address them and thereby build up the 484 00:16:04,039 --> 00:16:07,759 architecture. 485 00:16:05,480 --> 00:16:10,000 Any questions before I continue? 486 00:16:07,759 --> 00:16:10,000 Yeah. 487 00:16:11,000 --> 00:16:16,360 Is there any uh 488 00:16:13,559 --> 00:16:18,679 benefits to discarding some of those 489 00:16:16,360 --> 00:16:21,240 unclassified nodes before it goes out 490 00:16:18,679 --> 00:16:23,078 rather than going like you have 18 words 491 00:16:21,240 --> 00:16:24,279 input, discarding all the ones that 492 00:16:23,078 --> 00:16:26,239 don't actually matter and just doing 493 00:16:24,279 --> 00:16:28,480 like eight for your output? 494 00:16:26,240 --> 00:16:29,959 Yeah, yeah. I think that's a totally 495 00:16:28,480 --> 00:16:31,120 fine way to think about it. Basically, 496 00:16:29,958 --> 00:16:33,119 what you're saying is that can we have a 497 00:16:31,120 --> 00:16:35,839 two-stage model? The first-stage model 498 00:16:33,120 --> 00:16:37,200 is like a O non-O classifier. And the 499 00:16:35,839 --> 00:16:38,520 second-stage model only goes after the 500 00:16:37,200 --> 00:16:39,280 non-Os. That's a totally fine way to do 501 00:16:38,519 --> 00:16:40,319 it. 502 00:16:39,279 --> 00:16:41,958 Yeah. 503 00:16:40,320 --> 00:16:43,320 But as you can see, if you even if you 504 00:16:41,958 --> 00:16:44,958 go with the just a simple one-stage 505 00:16:43,320 --> 00:16:47,120 model, if you use a Transformer, you get 506 00:16:44,958 --> 00:16:50,359 fantastic accuracy. 507 00:16:47,120 --> 00:16:52,360 And we'll do the collab in a bit. 508 00:16:50,360 --> 00:16:53,600 Uh all right. So, let's take the first 509 00:16:52,360 --> 00:16:55,240 thing. How do you how do you take the 510 00:16:53,600 --> 00:16:56,959 context of everything around the word 511 00:16:55,240 --> 00:16:59,279 into account? 512 00:16:56,958 --> 00:17:01,000 So, 513 00:16:59,279 --> 00:17:03,039 so let's say that this is this is the 514 00:17:01,000 --> 00:17:04,199 sentence we have. The train slowly left 515 00:17:03,039 --> 00:17:06,759 the station. 516 00:17:04,199 --> 00:17:09,839 Okay? For each of these words, 517 00:17:06,759 --> 00:17:11,279 we can calculate a standalone embedding, 518 00:17:09,838 --> 00:17:13,720 say something like Glove. 519 00:17:11,279 --> 00:17:15,959 Okay? So, I'm just rep- depicting these 520 00:17:13,720 --> 00:17:18,400 standalone embeddings using these uh 521 00:17:15,959 --> 00:17:19,600 you know, thingies here. 522 00:17:18,400 --> 00:17:20,560 Please appreciate them because it took 523 00:17:19,599 --> 00:17:22,119 me a while to get them to do in 524 00:17:20,559 --> 00:17:24,678 PowerPoint. 525 00:17:22,119 --> 00:17:27,000 Okay? So, these are W1 through W6. These 526 00:17:24,679 --> 00:17:29,120 are the vectors standing up. Okay? 527 00:17:27,000 --> 00:17:30,359 Um now, let's say that So, we can easily 528 00:17:29,119 --> 00:17:32,079 do that. 529 00:17:30,359 --> 00:17:34,559 Now, what we want to figure out is we 530 00:17:32,079 --> 00:17:36,119 want to focus on the word station. 531 00:17:34,559 --> 00:17:37,519 And since station could mean very 532 00:17:36,119 --> 00:17:39,559 different things in different contexts, 533 00:17:37,519 --> 00:17:40,599 we want to figure out how do we actually 534 00:17:39,559 --> 00:17:43,359 take 535 00:17:40,599 --> 00:17:45,439 station's embedding and contextualize it 536 00:17:43,359 --> 00:17:46,799 using all the other words that are going 537 00:17:45,440 --> 00:17:49,799 on in that sentence. 538 00:17:46,799 --> 00:17:50,879 Okay? Clearly, it's a train station. 539 00:17:49,799 --> 00:17:53,720 So, we need to take the fact that there 540 00:17:50,880 --> 00:17:55,120 is a train involved to to alter the 541 00:17:53,720 --> 00:17:56,880 embedding of the word station. Right? 542 00:17:55,119 --> 00:17:58,719 That's what taking context into account 543 00:17:56,880 --> 00:17:59,960 actually means. 544 00:17:58,720 --> 00:18:03,039 So, 545 00:17:59,960 --> 00:18:04,799 how can we modify station's embedding so 546 00:18:03,039 --> 00:18:07,519 that it incorporates all the other 547 00:18:04,799 --> 00:18:08,399 words? That's the question. 548 00:18:07,519 --> 00:18:11,879 Okay? 549 00:18:08,400 --> 00:18:14,040 So, when you look at it this way, 550 00:18:11,880 --> 00:18:15,640 imagine just for a moment, 551 00:18:14,039 --> 00:18:16,559 just for a moment, 552 00:18:15,640 --> 00:18:17,640 that 553 00:18:16,559 --> 00:18:18,960 we 554 00:18:17,640 --> 00:18:20,440 Now, some of the other words in the 555 00:18:18,960 --> 00:18:22,279 sentence don't matter. The word the 556 00:18:20,440 --> 00:18:24,120 probably doesn't matter. 557 00:18:22,279 --> 00:18:26,678 But some of the other words like train, 558 00:18:24,119 --> 00:18:29,119 slowly, left probably does matter. 559 00:18:26,679 --> 00:18:30,480 And suppose, just magically, we have 560 00:18:29,119 --> 00:18:32,439 been told 561 00:18:30,480 --> 00:18:34,480 all the other words in the sentence, 562 00:18:32,440 --> 00:18:36,640 this is how much weight you have to give 563 00:18:34,480 --> 00:18:38,159 to them. These don't give it any weight. 564 00:18:36,640 --> 00:18:39,800 Those give it a lot of weight. Okay? 565 00:18:38,159 --> 00:18:41,360 Suppose we are told that. 566 00:18:39,799 --> 00:18:42,639 Or to put it another way, and this this 567 00:18:41,359 --> 00:18:44,199 is the word that's heavily used in the 568 00:18:42,640 --> 00:18:46,200 literature, 569 00:18:44,200 --> 00:18:47,720 someone tells you how much attention to 570 00:18:46,200 --> 00:18:48,720 pay to the other words. 571 00:18:47,720 --> 00:18:50,440 Whether you got to pay it a lot of 572 00:18:48,720 --> 00:18:51,360 attention or very little attention. 573 00:18:50,440 --> 00:18:52,600 Okay? 574 00:18:51,359 --> 00:18:54,439 And this 575 00:18:52,599 --> 00:18:55,879 how much attention to pay is given in 576 00:18:54,440 --> 00:18:57,440 the form of a weight that you can use. 577 00:18:55,880 --> 00:18:58,880 Okay? So, 578 00:18:57,440 --> 00:19:00,080 um 579 00:18:58,880 --> 00:19:01,840 if you look at it that way, from this 580 00:19:00,079 --> 00:19:04,039 notion of which word should I give a lot 581 00:19:01,839 --> 00:19:05,599 of weight to and very little weight to, 582 00:19:04,039 --> 00:19:06,799 in this example, intuitively, which 583 00:19:05,599 --> 00:19:07,759 words do you think should get the most 584 00:19:06,799 --> 00:19:09,759 weight and which words do you think 585 00:19:07,759 --> 00:19:11,319 should get the least weight? 586 00:19:09,759 --> 00:19:12,679 Yeah. Train. 587 00:19:11,319 --> 00:19:13,759 Train. Right. 588 00:19:12,679 --> 00:19:14,840 Time matters. 589 00:19:13,759 --> 00:19:16,200 Uh 590 00:19:14,839 --> 00:19:18,119 you can do one at a time. 591 00:19:16,200 --> 00:19:18,720 Train. Okay, thank you. 592 00:19:18,119 --> 00:19:21,279 Uh 593 00:19:18,720 --> 00:19:22,279 okay. Others? 594 00:19:21,279 --> 00:19:23,918 Slowly. 595 00:19:22,279 --> 00:19:25,599 Slowly. Right. So, that also seems to 596 00:19:23,919 --> 00:19:27,759 have some bearing to it. What about 597 00:19:25,599 --> 00:19:28,799 words that don't really I don't 598 00:19:27,759 --> 00:19:31,079 we don't think is going to are going to 599 00:19:28,799 --> 00:19:33,279 help at all? 600 00:19:31,079 --> 00:19:35,839 The. The. Exactly. It probably doesn't 601 00:19:33,279 --> 00:19:37,200 do much here. Some context it actually 602 00:19:35,839 --> 00:19:38,678 might make a difference, but in this 603 00:19:37,200 --> 00:19:40,759 sentence, maybe not. 604 00:19:38,679 --> 00:19:42,200 Right? Intuitively. 605 00:19:40,759 --> 00:19:43,079 So, 606 00:19:42,200 --> 00:19:45,000 we should probably give a lot of weight 607 00:19:43,079 --> 00:19:47,839 to train, maybe a little to slowly and 608 00:19:45,000 --> 00:19:49,359 left, and hardly anything to the. 609 00:19:47,839 --> 00:19:52,519 Okay? 610 00:19:49,359 --> 00:19:56,759 And so, this intuition that we have 611 00:19:52,519 --> 00:19:58,519 can be written numerically as maybe we 612 00:19:56,759 --> 00:20:00,160 have a bunch of weights that add up to 613 00:19:58,519 --> 00:20:02,240 one. 614 00:20:00,160 --> 00:20:03,560 Okay? 615 00:20:02,240 --> 00:20:07,120 Okay, maybe something like this. So, we 616 00:20:03,559 --> 00:20:11,639 are saying the train 30% weightage, 617 00:20:07,119 --> 00:20:14,159 maybe 8% weightage to left, maybe 12% 618 00:20:11,640 --> 00:20:15,680 weightage to slowly, uh and then as you 619 00:20:14,160 --> 00:20:17,960 will see here, 620 00:20:15,680 --> 00:20:20,680 the station's own embedding also plays a 621 00:20:17,960 --> 00:20:22,240 role. Because we want to take its own 622 00:20:20,680 --> 00:20:23,799 standalone embedding and just move it 623 00:20:22,240 --> 00:20:26,759 slightly, change it slightly, which 624 00:20:23,799 --> 00:20:28,279 means that has to be the starting point. 625 00:20:26,759 --> 00:20:30,799 So, it will get a lot of weight. We 626 00:20:28,279 --> 00:20:33,599 can't ignore itself, in other words. 627 00:20:30,799 --> 00:20:34,720 Right? So, we give it maybe 40% weight. 628 00:20:33,599 --> 00:20:35,879 By the way, these numbers I just made 629 00:20:34,720 --> 00:20:38,640 them up. 630 00:20:35,880 --> 00:20:40,640 Okay? Uh yeah. 631 00:20:38,640 --> 00:20:43,120 I'm sorry, it's a quick question. So, 632 00:20:40,640 --> 00:20:44,560 the weights 633 00:20:43,119 --> 00:20:46,399 are they 634 00:20:44,559 --> 00:20:48,200 Are they Are they standalone for the 635 00:20:46,400 --> 00:20:50,759 context of the entire sentence or are 636 00:20:48,200 --> 00:20:54,000 they related to station that we started 637 00:20:50,759 --> 00:20:56,400 off with? The The These six numbers are 638 00:20:54,000 --> 00:20:57,799 only pertinent to station. 639 00:20:56,400 --> 00:20:59,960 And for each word, we're going to do 640 00:20:57,799 --> 00:21:01,319 something similar. 641 00:20:59,960 --> 00:21:03,240 Yeah. 642 00:21:01,319 --> 00:21:05,399 And at this point, does the model 643 00:21:03,240 --> 00:21:07,000 understand order? Because like I'm just 644 00:21:05,400 --> 00:21:08,920 thinking of like left because like I 645 00:21:07,000 --> 00:21:09,559 gave it a very low 646 00:21:08,920 --> 00:21:11,360 a 647 00:21:09,559 --> 00:21:14,200 a very low weight. But let's say left 648 00:21:11,359 --> 00:21:15,919 comes slowly, leave left station. The 649 00:21:14,200 --> 00:21:18,000 station only have the two be higher. 650 00:21:15,920 --> 00:21:20,039 Yeah, correct. So, at this point, we are 651 00:21:18,000 --> 00:21:22,480 not worrying about order. We are only We 652 00:21:20,039 --> 00:21:24,000 are worrying about context. 653 00:21:22,480 --> 00:21:25,720 Later, we'll take order into account. 654 00:21:24,000 --> 00:21:28,119 But how does the model know that left 655 00:21:25,720 --> 00:21:31,039 here is of lesser importance because 656 00:21:28,119 --> 00:21:33,000 it's a verb rather than a 657 00:21:31,039 --> 00:21:34,279 It's It has to figure it out. 658 00:21:33,000 --> 00:21:36,519 We don't It doesn't We We are just 659 00:21:34,279 --> 00:21:38,879 giving it a whole bunch of capabilities. 660 00:21:36,519 --> 00:21:42,279 How it manifests those capabilities is 661 00:21:38,880 --> 00:21:42,280 all going to emerge from training. 662 00:21:42,880 --> 00:21:46,760 Okay. So, all right. So, let's say we 663 00:21:45,160 --> 00:21:48,120 have something like this. So, what we 664 00:21:46,759 --> 00:21:49,119 can do, 665 00:21:48,119 --> 00:21:50,319 right? And we'll get to the 666 00:21:49,119 --> 00:21:51,639 all-important question of where do we 667 00:21:50,319 --> 00:21:54,599 get these numbers from in just a moment. 668 00:21:51,640 --> 00:21:56,240 But suppose you had the numbers, 669 00:21:54,599 --> 00:22:00,399 how can we use these numbers to 670 00:21:56,240 --> 00:22:03,839 contextualize W6? What can we do? 671 00:22:00,400 --> 00:22:03,840 What is the simplest thing you can do? 672 00:22:05,359 --> 00:22:10,240 You have W6, you want to make it a new 673 00:22:07,359 --> 00:22:13,639 W6, which is now contextual, is aware of 674 00:22:10,240 --> 00:22:13,640 what else is going on. Okay? 675 00:22:17,480 --> 00:22:22,079 It's working now, I think. 676 00:22:20,119 --> 00:22:23,639 We can take a weighted average. Exactly. 677 00:22:22,079 --> 00:22:25,079 Exactly. So, when you have a bunch of 678 00:22:23,640 --> 00:22:26,400 things and you have a bunch of weights 679 00:22:25,079 --> 00:22:27,839 and I, you know, and we have when we 680 00:22:26,400 --> 00:22:29,480 have to somehow modify one of those 681 00:22:27,839 --> 00:22:30,519 things with those weights, the simplest 682 00:22:29,480 --> 00:22:31,559 thing you can do is to take a weighted 683 00:22:30,519 --> 00:22:33,000 average. 684 00:22:31,559 --> 00:22:34,359 Right? So, that's exactly what we're 685 00:22:33,000 --> 00:22:35,279 going to do. 686 00:22:34,359 --> 00:22:37,119 So, we're going to take all these 687 00:22:35,279 --> 00:22:39,678 weights 688 00:22:37,119 --> 00:22:40,639 and just like move them up. 689 00:22:39,679 --> 00:22:42,720 Okay? 690 00:22:40,640 --> 00:22:44,120 Move them up. 691 00:22:42,720 --> 00:22:46,319 Don't even get me started on how long it 692 00:22:44,119 --> 00:22:47,439 took me to get this arrow to run. 693 00:22:46,319 --> 00:22:49,439 I don't know about you, folks. Is it 694 00:22:47,440 --> 00:22:51,160 It's extremely painful to get the U-turn 695 00:22:49,440 --> 00:22:52,039 arrows to work in PowerPoint. 696 00:22:51,160 --> 00:22:54,960 Okay? 697 00:22:52,039 --> 00:22:57,159 Anyway, uh back to work. So, 698 00:22:54,960 --> 00:23:01,400 so we just move these up here, okay? So, 699 00:22:57,160 --> 00:23:03,679 now we can do 0.05 * this vector + 0.3 * 700 00:23:01,400 --> 00:23:06,679 that vector and so on and so forth. 701 00:23:03,679 --> 00:23:08,640 And the result is just another vector. 702 00:23:06,679 --> 00:23:11,400 Right? 703 00:23:08,640 --> 00:23:13,440 And that vector, folks, 704 00:23:11,400 --> 00:23:15,320 is the contextual embedding vector of 705 00:23:13,440 --> 00:23:17,759 station. 706 00:23:15,319 --> 00:23:19,759 Okay? That was the standalone embedding. 707 00:23:17,759 --> 00:23:21,119 And now we did the We multiplied this by 708 00:23:19,759 --> 00:23:24,759 that that by whoop whoop whoop, add them 709 00:23:21,119 --> 00:23:24,759 all up, and then you get a new vector. 710 00:23:24,799 --> 00:23:29,519 And contextual embeddings have this 711 00:23:27,839 --> 00:23:30,959 bluish kind of color. 712 00:23:29,519 --> 00:23:32,400 Okay? 713 00:23:30,960 --> 00:23:33,559 And I'll maintain that color scheme as 714 00:23:32,400 --> 00:23:36,320 we go along. 715 00:23:33,559 --> 00:23:38,440 So, that's it. 716 00:23:36,319 --> 00:23:41,079 That's it. That's the idea. 717 00:23:38,440 --> 00:23:41,080 Any questions? 718 00:23:41,679 --> 00:23:44,800 Yeah. 719 00:23:43,039 --> 00:23:46,960 How did you come up with the original 720 00:23:44,799 --> 00:23:49,359 weights again? You just kind of guessed? 721 00:23:46,960 --> 00:23:51,559 No, these weights I just I just 722 00:23:49,359 --> 00:23:53,279 hand typed them in manually just to make 723 00:23:51,559 --> 00:23:54,319 the point. And And now I'm going to talk 724 00:23:53,279 --> 00:23:57,039 about how we are actually going to 725 00:23:54,319 --> 00:23:57,039 calculate them. 726 00:23:57,599 --> 00:24:00,959 Okay. 727 00:23:58,640 --> 00:24:03,080 Uh all right, cool. So, now I'm going to 728 00:24:00,960 --> 00:24:05,400 uh okay, enough pictures. Let's switch 729 00:24:03,079 --> 00:24:07,319 to some math. So, 730 00:24:05,400 --> 00:24:08,759 so basically what I'm So, let's write it 731 00:24:07,319 --> 00:24:11,279 a bit more formally. 732 00:24:08,759 --> 00:24:12,920 So, we have these W1 through W6, which 733 00:24:11,279 --> 00:24:14,240 are the standalone embeddings. 734 00:24:12,920 --> 00:24:16,080 And then for station, we want to 735 00:24:14,240 --> 00:24:17,359 calculate, you know, W6 with a little 736 00:24:16,079 --> 00:24:19,599 hat on it, which is the contextual 737 00:24:17,359 --> 00:24:22,359 embedding. And the way we do it is to 738 00:24:19,599 --> 00:24:25,000 say we calculate some weights for each 739 00:24:22,359 --> 00:24:27,159 of these words. So, this weight S16 740 00:24:25,000 --> 00:24:30,079 means that the weight 741 00:24:27,160 --> 00:24:32,040 of the first word on the sixth word, 742 00:24:30,079 --> 00:24:33,678 which happens to be station. 743 00:24:32,039 --> 00:24:35,839 The The weight of the second word on the 744 00:24:33,679 --> 00:24:38,120 sixth word, and so on and so forth. And 745 00:24:35,839 --> 00:24:40,480 so, what we are saying is that W6 is 746 00:24:38,119 --> 00:24:41,879 just, you know, this weight times W1, 747 00:24:40,480 --> 00:24:43,240 this time W whoop whoop whoop, 748 00:24:41,880 --> 00:24:45,560 that's it. 749 00:24:43,240 --> 00:24:45,559 Okay? 750 00:24:45,839 --> 00:24:48,839 I have to inflict all these, you know, 751 00:24:47,039 --> 00:24:51,240 subscripts and all that because 752 00:24:48,839 --> 00:24:53,919 you know, we need it. 753 00:24:51,240 --> 00:24:56,559 All right. So, that's it. 754 00:24:53,920 --> 00:24:58,000 That's what we have. 755 00:24:56,559 --> 00:25:00,279 Now, let's talk about Okay, any 756 00:24:58,000 --> 00:25:01,759 questions on the mechanics of it 757 00:25:00,279 --> 00:25:02,879 before I get to Okay, where do these 758 00:25:01,759 --> 00:25:05,160 weights come from? 759 00:25:02,880 --> 00:25:05,160 Yeah. 760 00:25:06,920 --> 00:25:11,039 Utilizing something like Google, for 761 00:25:08,839 --> 00:25:12,759 example, like how does it understand 762 00:25:11,039 --> 00:25:13,960 like the context of 763 00:25:12,759 --> 00:25:16,000 new words 764 00:25:13,960 --> 00:25:18,480 and context like 765 00:25:16,000 --> 00:25:20,400 process immediately through the training 766 00:25:18,480 --> 00:25:21,480 data the users played or 767 00:25:20,400 --> 00:25:22,640 like basically 768 00:25:21,480 --> 00:25:24,440 >> like a totally new word that didn't 769 00:25:22,640 --> 00:25:27,520 exist before? A new word or a new 770 00:25:24,440 --> 00:25:29,320 context to a word that already exists. 771 00:25:27,519 --> 00:25:31,400 No, I think that the context is supplied 772 00:25:29,319 --> 00:25:33,159 because the query coming into something 773 00:25:31,400 --> 00:25:35,120 like Google is a full sentence. 774 00:25:33,160 --> 00:25:36,400 And we only take that sentence and take 775 00:25:35,119 --> 00:25:37,919 only the sentence into account as the 776 00:25:36,400 --> 00:25:40,000 context for us. 777 00:25:37,920 --> 00:25:41,600 So, the context is always present to us 778 00:25:40,000 --> 00:25:44,079 when we get the input. 779 00:25:41,599 --> 00:25:45,199 But the other question you had uh of 780 00:25:44,079 --> 00:25:46,678 Okay, what if there's a brand new word 781 00:25:45,200 --> 00:25:47,799 you've never seen before, for which 782 00:25:46,679 --> 00:25:49,720 there is not even a standalone 783 00:25:47,799 --> 00:25:51,919 embedding? What do you do then? 784 00:25:49,720 --> 00:25:53,600 So, let's punt on that till Wednesday 785 00:25:51,920 --> 00:25:55,440 because I have to talk about something 786 00:25:53,599 --> 00:25:57,359 called byte pair encoding and stuff like 787 00:25:55,440 --> 00:25:59,279 that before I can answer that. 788 00:25:57,359 --> 00:26:00,599 And And really quickly, does that 789 00:25:59,279 --> 00:26:03,480 immediately translate to their 790 00:26:00,599 --> 00:26:06,399 predictive search queries? 791 00:26:03,480 --> 00:26:08,559 Utilizing like verb 792 00:26:06,400 --> 00:26:10,759 Yeah, a new word, for example. 793 00:26:08,559 --> 00:26:12,200 Does that automatically get applied to 794 00:26:10,759 --> 00:26:14,000 the predictive search queries like when 795 00:26:12,200 --> 00:26:15,880 we're saying how to and then just home? 796 00:26:14,000 --> 00:26:17,200 Oh, you mean like the auto complete? 797 00:26:15,880 --> 00:26:18,560 You know, auto complete uses a slightly 798 00:26:17,200 --> 00:26:20,880 different mechanism. 799 00:26:18,559 --> 00:26:23,440 Um I They had a very complicated 800 00:26:20,880 --> 00:26:24,800 non-transformer thing for a long time. 801 00:26:23,440 --> 00:26:26,320 I'm sure they have a transformer version 802 00:26:24,799 --> 00:26:28,039 now, but I don't I'm not privy to how 803 00:26:26,319 --> 00:26:29,799 exactly they've done it. So, I don't 804 00:26:28,039 --> 00:26:31,200 quite know how they do it. But what 805 00:26:29,799 --> 00:26:33,279 you're proposing is a reasonable way to 806 00:26:31,200 --> 00:26:34,360 think about it. 807 00:26:33,279 --> 00:26:36,678 Yeah. 808 00:26:34,359 --> 00:26:39,678 Um my question is like we have six 809 00:26:36,679 --> 00:26:41,800 words, station and but number parameters 810 00:26:39,679 --> 00:26:43,400 as in weights, let's say 10 of them. 811 00:26:41,799 --> 00:26:46,119 And then we have calculated the 812 00:26:43,400 --> 00:26:48,280 contextual version of W6. Yeah. So, this 813 00:26:46,119 --> 00:26:50,559 has a different parameter or it remains 814 00:26:48,279 --> 00:26:54,759 the same? It replaces. Okay. 815 00:26:50,559 --> 00:26:57,720 Yeah, W becomes W6 becomes W6 hat. 816 00:26:54,759 --> 00:26:58,759 Okay. And how we are expecting 817 00:26:57,720 --> 00:27:00,600 Right. 818 00:26:58,759 --> 00:27:03,640 This contextual word will be really 819 00:27:00,599 --> 00:27:03,639 good. That's what we want. 820 00:27:07,759 --> 00:27:11,319 Do we lose that 821 00:27:08,960 --> 00:27:12,759 or retain No, we lose it. And as you 822 00:27:11,319 --> 00:27:14,439 will see here, as it flows through the 823 00:27:12,759 --> 00:27:16,720 transformer, it's getting more and more 824 00:27:14,440 --> 00:27:19,920 and more contextualized. 825 00:27:16,720 --> 00:27:19,920 So, it's a left-to-right flow. 826 00:27:20,000 --> 00:27:23,200 All right. Uh all right, great. So, the 827 00:27:22,000 --> 00:27:25,720 By the way, this thing that we did for 828 00:27:23,200 --> 00:27:27,960 station, we will do it for each word in 829 00:27:25,720 --> 00:27:30,039 the in the in the sentence. 830 00:27:27,960 --> 00:27:31,759 The same exact logic. Obviously, the 831 00:27:30,039 --> 00:27:34,079 weights are going to change. 832 00:27:31,759 --> 00:27:37,920 Okay? But what will happen is that W1 833 00:27:34,079 --> 00:27:39,480 through W6 will become W1 hat through W6 834 00:27:37,920 --> 00:27:41,880 hat. 835 00:27:39,480 --> 00:27:43,360 The same exact logic is going to hold. 836 00:27:41,880 --> 00:27:44,440 Okay? That's what I just don't have the 837 00:27:43,359 --> 00:27:45,719 slides for it because it's a waste of 838 00:27:44,440 --> 00:27:47,160 time. 839 00:27:45,720 --> 00:27:48,880 The same exact logic is going to hold. 840 00:27:47,160 --> 00:27:50,679 All right. Now, switch gears 841 00:27:48,880 --> 00:27:51,600 and and answer the all-important 842 00:27:50,679 --> 00:27:52,679 question of where are the weights going 843 00:27:51,599 --> 00:27:54,678 to come from. 844 00:27:52,679 --> 00:27:56,840 Okay? So, the intuition here is really 845 00:27:54,679 --> 00:27:59,800 really interesting and elegant. 846 00:27:56,839 --> 00:28:02,199 So, clearly the weight of a word 847 00:27:59,799 --> 00:28:04,879 should be proportional to how related it 848 00:28:02,200 --> 00:28:06,319 is to the word station. 849 00:28:04,880 --> 00:28:08,240 Right? 850 00:28:06,319 --> 00:28:09,919 The word train clearly is very related 851 00:28:08,240 --> 00:28:11,559 to the word station. 852 00:28:09,920 --> 00:28:12,640 The word the is not clear how it's 853 00:28:11,559 --> 00:28:15,440 related it is. Probably not all that 854 00:28:12,640 --> 00:28:17,160 related. So, the relatedness matters to 855 00:28:15,440 --> 00:28:19,360 the weight. More related, higher the 856 00:28:17,160 --> 00:28:21,400 weight, right? Just intuitive. 857 00:28:19,359 --> 00:28:23,799 So, one way to quantify how related two 858 00:28:21,400 --> 00:28:25,560 words are is to take their standalone 859 00:28:23,799 --> 00:28:27,918 embeddings and calculate the dot 860 00:28:25,559 --> 00:28:27,918 product. 861 00:28:28,000 --> 00:28:33,119 Okay? So, um 862 00:28:30,720 --> 00:28:36,799 in case folks have 863 00:28:33,119 --> 00:28:36,799 sort of forgotten about the dot product, 864 00:28:39,559 --> 00:28:44,599 Oops, that's not what I want. 865 00:28:42,519 --> 00:28:47,200 So, um So, if you have a Let's say you 866 00:28:44,599 --> 00:28:47,199 have a vector. 867 00:28:50,039 --> 00:28:52,599 Okay, let's Let's Let's say this is the 868 00:28:51,599 --> 00:28:55,039 vector for 869 00:28:52,599 --> 00:28:55,039 train. 870 00:28:55,720 --> 00:28:59,079 This is the vector for station. 871 00:28:59,279 --> 00:29:04,599 Okay? So, the dot product of these two 872 00:29:01,960 --> 00:29:04,600 vectors, 873 00:29:05,559 --> 00:29:11,759 I'll write it as train 874 00:29:09,039 --> 00:29:11,759 station 875 00:29:12,039 --> 00:29:17,519 equals 876 00:29:13,880 --> 00:29:19,960 basically the length 877 00:29:17,519 --> 00:29:19,960 of 878 00:29:20,359 --> 00:29:23,479 the vector for train 879 00:29:23,720 --> 00:29:30,480 times the length 880 00:29:26,679 --> 00:29:30,480 of the vector for station 881 00:29:30,720 --> 00:29:36,519 times the cosine 882 00:29:33,839 --> 00:29:38,480 of the angle between them. 883 00:29:36,519 --> 00:29:40,639 Okay? 884 00:29:38,480 --> 00:29:40,640 Okay? 885 00:29:42,400 --> 00:29:46,440 So, how long is each vector? 886 00:29:45,159 --> 00:29:48,560 Product of the two and then the angle 887 00:29:46,440 --> 00:29:50,480 between them. Okay? Now, let's assume 888 00:29:48,559 --> 00:29:52,480 for simplicity that these lengths are 889 00:29:50,480 --> 00:29:54,519 roughly the same. 890 00:29:52,480 --> 00:29:55,599 They're just one unit length. Okay? Just 891 00:29:54,519 --> 00:29:57,720 roughly. 892 00:29:55,599 --> 00:30:01,799 So, if you assume that, 893 00:29:57,720 --> 00:30:01,799 okay? This thing, let's say, becomes 894 00:30:01,880 --> 00:30:05,160 becomes one, let's say. 895 00:30:03,799 --> 00:30:07,119 Okay? 896 00:30:05,160 --> 00:30:09,240 This thing becomes one. 897 00:30:07,119 --> 00:30:11,399 So, all the action 898 00:30:09,240 --> 00:30:12,519 is here. 899 00:30:11,400 --> 00:30:14,280 Okay? 900 00:30:12,519 --> 00:30:15,839 So, all the action is here. 901 00:30:14,279 --> 00:30:17,440 So, basically, the dot product of these 902 00:30:15,839 --> 00:30:20,079 two vectors is really the cosine of 903 00:30:17,440 --> 00:30:22,360 angle between them. 904 00:30:20,079 --> 00:30:25,319 So, now, the question is, if you have 905 00:30:22,359 --> 00:30:25,319 something like this, 906 00:30:27,200 --> 00:30:31,519 right? Which are very close to each 907 00:30:28,519 --> 00:30:34,440 other, the cosine of a very small angle, 908 00:30:31,519 --> 00:30:35,480 actually, the cosine of zero is what? 909 00:30:34,440 --> 00:30:37,720 One. 910 00:30:35,480 --> 00:30:39,000 So, if the angle is really, really 911 00:30:37,720 --> 00:30:40,160 small, the cosine is going to be very 912 00:30:39,000 --> 00:30:41,559 close to one. 913 00:30:40,160 --> 00:30:43,519 Right? Because zero is one. The cosine 914 00:30:41,559 --> 00:30:46,639 of zero is one. So, this thing is going 915 00:30:43,519 --> 00:30:49,039 to be, you know, pretty close to one. 916 00:30:46,640 --> 00:30:51,520 If you have a cosine of two vectors that 917 00:30:49,039 --> 00:30:52,759 are like this, 90° apart, what is the 918 00:30:51,519 --> 00:30:55,440 cosine? 919 00:30:52,759 --> 00:30:58,079 Zero. They're orthogonal, right? Which 920 00:30:55,440 --> 00:31:00,720 maps to the English orthogonal. 921 00:30:58,079 --> 00:31:01,960 So, the cosine of that is zero. 922 00:31:00,720 --> 00:31:03,400 And then, if you have something like 923 00:31:01,960 --> 00:31:04,640 this, 924 00:31:03,400 --> 00:31:07,400 where they're literally pointing in 925 00:31:04,640 --> 00:31:07,400 opposite direction, 926 00:31:07,640 --> 00:31:11,240 what is the cosine of that 180? 927 00:31:09,880 --> 00:31:13,080 Minus one. 928 00:31:11,240 --> 00:31:14,799 So, that's it. So, the if these things 929 00:31:13,079 --> 00:31:16,119 if these these these two vectors are 930 00:31:14,799 --> 00:31:18,039 very close to each other, 931 00:31:16,119 --> 00:31:19,919 the cosine of the angle between them is 932 00:31:18,039 --> 00:31:21,399 going to be very close to one. If they 933 00:31:19,920 --> 00:31:22,960 are really kind of unrelated, it's going 934 00:31:21,400 --> 00:31:24,240 to be zero. If they're anti-related, 935 00:31:22,960 --> 00:31:27,120 it's going to be minus one. 936 00:31:24,240 --> 00:31:28,960 Right? So, that's how dot products 937 00:31:27,119 --> 00:31:30,679 capture this notion of closeness or 938 00:31:28,960 --> 00:31:31,680 relatedness. 939 00:31:30,680 --> 00:31:36,320 Okay? 940 00:31:31,680 --> 00:31:37,960 So, all right. Um iPad. 941 00:31:36,319 --> 00:31:40,480 So, we can use the dot product of these 942 00:31:37,960 --> 00:31:43,519 embeddings to capture relatedness. 943 00:31:40,480 --> 00:31:45,960 And so, okay, iPad done. 944 00:31:43,519 --> 00:31:48,000 So, now, what we do is we know now that 945 00:31:45,960 --> 00:31:49,920 we know that dot products can be used, 946 00:31:48,000 --> 00:31:51,759 we can't use them as is because we need 947 00:31:49,920 --> 00:31:53,880 to do one more thing to make them proper 948 00:31:51,759 --> 00:31:55,519 weights. And what I mean by proper 949 00:31:53,880 --> 00:31:58,000 weights is that the we want the weights 950 00:31:55,519 --> 00:31:59,279 to be, first of all, non-negative, and 951 00:31:58,000 --> 00:32:00,240 we want to add up we want them to add up 952 00:31:59,279 --> 00:32:01,319 to one, right? That's that's what a 953 00:32:00,240 --> 00:32:02,279 weighted average actually is going to 954 00:32:01,319 --> 00:32:05,359 mean. 955 00:32:02,279 --> 00:32:07,359 But these cosines could be negative. 956 00:32:05,359 --> 00:32:08,959 Right? And so, we need to now adjust 957 00:32:07,359 --> 00:32:10,039 them to make them proper so that every 958 00:32:08,960 --> 00:32:11,400 one of them is guaranteed to be 959 00:32:10,039 --> 00:32:12,279 non-negative and they will add up to 960 00:32:11,400 --> 00:32:14,200 one. 961 00:32:12,279 --> 00:32:15,839 When was the last time you had to take a 962 00:32:14,200 --> 00:32:18,279 bunch of numbers, which could be 963 00:32:15,839 --> 00:32:20,480 anything, and then somehow make sure 964 00:32:18,279 --> 00:32:22,079 that they are going to be positive, 965 00:32:20,480 --> 00:32:23,839 non-negative, and they add up to one? 966 00:32:22,079 --> 00:32:25,720 When was the last time? 967 00:32:23,839 --> 00:32:27,519 Yeah, softmax. Exactly. So, we'll do the 968 00:32:25,720 --> 00:32:29,759 same trick. 969 00:32:27,519 --> 00:32:32,799 So, what we'll simply do is we'll just, 970 00:32:29,759 --> 00:32:35,400 you know, exponentiate them, right? So, 971 00:32:32,799 --> 00:32:36,519 like this W1 W6, this angle bracket 972 00:32:35,400 --> 00:32:39,120 thing is the dot product. That's the 973 00:32:36,519 --> 00:32:41,319 notation I'm using. EXP of that is just 974 00:32:39,119 --> 00:32:42,839 you exponentiate them, e raised to that. 975 00:32:41,319 --> 00:32:44,599 And once you exponentiate them, they all 976 00:32:42,839 --> 00:32:46,119 become non-negative, and then we just 977 00:32:44,599 --> 00:32:47,359 divide each by the sum of everything. 978 00:32:46,119 --> 00:32:48,559 So, it the whole thing will become like 979 00:32:47,359 --> 00:32:50,119 a probability, right? It'll just add up 980 00:32:48,559 --> 00:32:52,119 to one. 981 00:32:50,119 --> 00:32:53,519 Make sense? So, that's how we take 982 00:32:52,119 --> 00:32:55,919 arbitrary numbers and make them proper 983 00:32:53,519 --> 00:32:55,920 weights. 984 00:32:56,679 --> 00:32:59,200 All right. 985 00:32:59,880 --> 00:33:02,840 So, 986 00:33:01,440 --> 00:33:04,200 to summarize, 987 00:33:02,839 --> 00:33:05,759 from embeddings to contextual 988 00:33:04,200 --> 00:33:08,120 embeddings, that's what we do. 989 00:33:05,759 --> 00:33:09,720 We take all the stand-alone embeddings, 990 00:33:08,119 --> 00:33:11,678 we calculate these weights using this 991 00:33:09,720 --> 00:33:12,799 formula, and then we just do the 992 00:33:11,679 --> 00:33:16,080 weighted average, and we arrive at the 993 00:33:12,799 --> 00:33:16,079 contextual embedding, and boom, done. 994 00:33:16,480 --> 00:33:20,079 Okay? 995 00:33:17,880 --> 00:33:22,360 And so, by way choosing weights in this 996 00:33:20,079 --> 00:33:24,359 manner, the embedding of a word gets 997 00:33:22,359 --> 00:33:26,839 dragged closer to the embeddings of the 998 00:33:24,359 --> 00:33:29,039 other words in proportion to how related 999 00:33:26,839 --> 00:33:30,439 they are. So, just imagine for a second, 1000 00:33:29,039 --> 00:33:31,920 right? In this case, station obviously 1001 00:33:30,440 --> 00:33:33,880 has many contexts, but let's assume for 1002 00:33:31,920 --> 00:33:35,800 a second that only has the train context 1003 00:33:33,880 --> 00:33:37,400 and the radio station context. 1004 00:33:35,799 --> 00:33:39,200 Okay? 1005 00:33:37,400 --> 00:33:40,920 In the current context, train is closely 1006 00:33:39,200 --> 00:33:42,640 related to station, and therefore exerts 1007 00:33:40,920 --> 00:33:43,840 a strong pull on it. 1008 00:33:42,640 --> 00:33:45,720 Right? 1009 00:33:43,839 --> 00:33:47,199 Now, radio is also related to station, 1010 00:33:45,720 --> 00:33:48,440 but it doesn't appear in the word in the 1011 00:33:47,200 --> 00:33:49,840 sentence. 1012 00:33:48,440 --> 00:33:52,200 So, effectively, it has a weight of 1013 00:33:49,839 --> 00:33:52,199 zero. 1014 00:33:52,839 --> 00:33:56,399 Okay? And that's the beauty of it. And 1015 00:33:55,119 --> 00:33:58,079 And please do not ask me things like, 1016 00:33:56,400 --> 00:33:59,640 you know, I I was listening to a great 1017 00:33:58,079 --> 00:34:01,559 song on the radio station and the train 1018 00:33:59,640 --> 00:34:03,360 pulled out of the station. 1019 00:34:01,559 --> 00:34:05,480 Okay? Transformers can deal with stuff 1020 00:34:03,359 --> 00:34:07,519 like that. Okay? But yeah, but you get 1021 00:34:05,480 --> 00:34:09,878 the idea, the main idea. 1022 00:34:07,519 --> 00:34:11,480 So, by paying moving station closer to 1023 00:34:09,878 --> 00:34:13,440 train, 1024 00:34:11,480 --> 00:34:15,559 by paying more attention to train, we 1025 00:34:13,440 --> 00:34:18,000 are contextualizing the station the word 1026 00:34:15,559 --> 00:34:20,440 the embedding to the context of trains, 1027 00:34:18,000 --> 00:34:22,960 platforms, departures, tickets, and so 1028 00:34:20,440 --> 00:34:25,159 on. It's like this portal into the whole 1029 00:34:22,960 --> 00:34:27,280 train world. 1030 00:34:25,159 --> 00:34:29,840 Right? It's beautiful. This simple idea 1031 00:34:27,280 --> 00:34:29,840 will get you there. 1032 00:34:30,840 --> 00:34:33,960 Okay? 1033 00:34:31,800 --> 00:34:36,679 So, this, folks, is called 1034 00:34:33,960 --> 00:34:37,760 self-attention. 1035 00:34:36,679 --> 00:34:39,639 What we just described is called 1036 00:34:37,760 --> 00:34:41,240 self-attention. 1037 00:34:39,639 --> 00:34:42,679 And it's the key building block of 1038 00:34:41,239 --> 00:34:44,759 transformers. 1039 00:34:42,679 --> 00:34:46,599 Okay? Um and so, the the So, to 1040 00:34:44,760 --> 00:34:50,320 summarize, stand-alone embeddings come 1041 00:34:46,599 --> 00:34:50,319 in, contextual embeddings go out. 1042 00:34:50,760 --> 00:34:54,720 Any questions? 1043 00:34:52,398 --> 00:34:56,199 Uh yeah. 1044 00:34:54,719 --> 00:34:58,799 Uh I'm still struggling a little bit 1045 00:34:56,199 --> 00:35:00,239 with the intuition of the word 1046 00:34:58,800 --> 00:35:02,039 contextual embedding. So, like the 1047 00:35:00,239 --> 00:35:03,639 weight of station in the station 1048 00:35:02,039 --> 00:35:05,159 embedding, how how should I think about 1049 00:35:03,639 --> 00:35:07,679 that? It seems intuitive that it would 1050 00:35:05,159 --> 00:35:11,879 be high for all contextual embeddings, 1051 00:35:07,679 --> 00:35:11,879 but I assume that's not the case. 1052 00:35:12,079 --> 00:35:15,920 It'll be high. It'll be typically be a 1053 00:35:13,639 --> 00:35:17,599 high number because the cosine of the 1054 00:35:15,920 --> 00:35:19,200 the vector to itself is going to be very 1055 00:35:17,599 --> 00:35:20,799 cosine is going to be one, right? So, 1056 00:35:19,199 --> 00:35:21,559 it's going to be pretty high, but it 1057 00:35:20,800 --> 00:35:22,880 there's no guarantee it's going to be 1058 00:35:21,559 --> 00:35:24,840 the highest. 1059 00:35:22,880 --> 00:35:26,519 Right? Because they're not actually the 1060 00:35:24,840 --> 00:35:28,000 the length doesn't have to be one. They 1061 00:35:26,519 --> 00:35:30,358 could be We try to keep them kind of 1062 00:35:28,000 --> 00:35:31,840 smallish, but they don't have to be. 1063 00:35:30,358 --> 00:35:33,319 Uh so, the way I would think about it is 1064 00:35:31,840 --> 00:35:35,320 imagine that you take an average of 1065 00:35:33,320 --> 00:35:37,359 everything else first, and then you 1066 00:35:35,320 --> 00:35:38,480 average it with the new the old 1067 00:35:37,358 --> 00:35:39,639 embedding. 1068 00:35:38,480 --> 00:35:40,880 Effectively, it's the same as just 1069 00:35:39,639 --> 00:35:42,639 calculating the different weights and 1070 00:35:40,880 --> 00:35:44,599 averaging the whole thing together. 1071 00:35:42,639 --> 00:35:45,639 Sure. 1072 00:35:44,599 --> 00:35:47,679 So, why should you say that the 1073 00:35:45,639 --> 00:35:50,239 embedding of a word would be the same 1074 00:35:47,679 --> 00:35:52,679 number but same place? But is this the 1075 00:35:50,239 --> 00:35:53,719 reason why you need a contextual 1076 00:35:52,679 --> 00:35:55,159 embedding? 1077 00:35:53,719 --> 00:35:56,519 But even if it's like a 1078 00:35:55,159 --> 00:35:59,000 other word 1079 00:35:56,519 --> 00:36:01,079 and it's not related, that's what 1080 00:35:59,000 --> 00:36:02,840 I'm saying. Correct. Correct. Exactly. 1081 00:36:01,079 --> 00:36:04,759 Exactly. And the other thing to remember 1082 00:36:02,840 --> 00:36:07,120 is that by getting 1083 00:36:04,760 --> 00:36:09,000 by keeping the origin the input sort of 1084 00:36:07,119 --> 00:36:10,119 the size of the input cardinality intact 1085 00:36:09,000 --> 00:36:11,119 as you move through the transformer 1086 00:36:10,119 --> 00:36:12,719 stack, 1087 00:36:11,119 --> 00:36:14,880 when you finally come out the other end, 1088 00:36:12,719 --> 00:36:16,439 there is sort of no loss of information. 1089 00:36:14,880 --> 00:36:18,079 And in the very end, you can choose to 1090 00:36:16,440 --> 00:36:19,519 aggregate, simplify, summarize, and so 1091 00:36:18,079 --> 00:36:22,840 on and so forth. It preserves your 1092 00:36:19,519 --> 00:36:22,840 optionality as long as possible. 1093 00:36:23,679 --> 00:36:27,000 Do you know 1094 00:36:25,119 --> 00:36:28,039 how how long the embedding contextual 1095 00:36:27,000 --> 00:36:29,880 embedding is? 1096 00:36:28,039 --> 00:36:31,039 Is that a factor between the 1097 00:36:29,880 --> 00:36:33,240 two? 1098 00:36:31,039 --> 00:36:34,679 You know 1099 00:36:33,239 --> 00:36:35,839 Yeah, so, what we do is the the sentence 1100 00:36:34,679 --> 00:36:37,679 comes in. There's a whole notion of 1101 00:36:35,840 --> 00:36:39,079 something called a context window, or 1102 00:36:37,679 --> 00:36:40,480 what is the sort of the maximum length 1103 00:36:39,079 --> 00:36:42,480 that these sentences will handle, and 1104 00:36:40,480 --> 00:36:43,519 that's a parameter you can set. And 1105 00:36:42,480 --> 00:36:44,719 we'll come to that when you actually 1106 00:36:43,519 --> 00:36:46,639 look at the collab. 1107 00:36:44,719 --> 00:36:48,399 Um 1108 00:36:46,639 --> 00:36:49,639 Was that a question in the middle? No. 1109 00:36:48,400 --> 00:36:53,119 Okay. 1110 00:36:49,639 --> 00:36:53,119 All right. So, that is self-attention. 1111 00:36:53,199 --> 00:36:58,000 Um and now, 1112 00:36:55,199 --> 00:37:00,119 because that's felt too easy, 1113 00:36:58,000 --> 00:37:02,079 we're going to do a little tweak called 1114 00:37:00,119 --> 00:37:03,039 multi-head attention. 1115 00:37:02,079 --> 00:37:04,719 So, 1116 00:37:03,039 --> 00:37:06,039 this is this is the self-attention we 1117 00:37:04,719 --> 00:37:07,439 just saw. 1118 00:37:06,039 --> 00:37:08,920 What we can do is we can be like, you 1119 00:37:07,440 --> 00:37:10,720 know what? 1120 00:37:08,920 --> 00:37:12,400 Why can't we have more than this? Why 1121 00:37:10,719 --> 00:37:13,879 can't we have more than one of these? 1122 00:37:12,400 --> 00:37:16,160 So, this is called an attention head, 1123 00:37:13,880 --> 00:37:18,519 self-attention head. We'll have multiple 1124 00:37:16,159 --> 00:37:20,279 self-attention heads. Okay? 1125 00:37:18,519 --> 00:37:22,239 Now, and I'll come back to the top thing 1126 00:37:20,280 --> 00:37:23,840 in a second, okay? But So, the question 1127 00:37:22,239 --> 00:37:25,399 is, why should we have multiple 1128 00:37:23,840 --> 00:37:26,920 self-attention heads? 1129 00:37:25,400 --> 00:37:28,280 Because a particular attention head is 1130 00:37:26,920 --> 00:37:30,480 going to pick up some patterns. The 1131 00:37:28,280 --> 00:37:32,519 reason is because 1132 00:37:30,480 --> 00:37:34,358 it'll help us attend to the multiple 1133 00:37:32,519 --> 00:37:35,599 patterns that may be present in a single 1134 00:37:34,358 --> 00:37:37,239 sentence. 1135 00:37:35,599 --> 00:37:38,440 So far, when I've been explaining, uh 1136 00:37:37,239 --> 00:37:40,319 I've sort of basically been looking at 1137 00:37:38,440 --> 00:37:42,240 what the meaning of these words are. 1138 00:37:40,320 --> 00:37:44,120 Just the meaning of these words. But in 1139 00:37:42,239 --> 00:37:45,759 any complicated sentence, you have to 1140 00:37:44,119 --> 00:37:47,519 worry about grammar, you have to worry 1141 00:37:45,760 --> 00:37:49,880 about tense, you have to worry about 1142 00:37:47,519 --> 00:37:51,880 tone. You have to worry about facts 1143 00:37:49,880 --> 00:37:53,760 versus, you know, opinions. There could 1144 00:37:51,880 --> 00:37:55,559 be any number of complicated patterns 1145 00:37:53,760 --> 00:37:57,920 that are sitting in a simple sentence. 1146 00:37:55,559 --> 00:37:59,519 Which means, well, there is just not one 1147 00:37:57,920 --> 00:38:02,079 way to pay attention. There could be 1148 00:37:59,519 --> 00:38:03,880 many ways of paying attention, many sort 1149 00:38:02,079 --> 00:38:05,799 of There could be many needs to pay 1150 00:38:03,880 --> 00:38:07,599 attention. Right? 1151 00:38:05,800 --> 00:38:09,240 Which means that we'll let's have many 1152 00:38:07,599 --> 00:38:10,719 of these attention heads. 1153 00:38:09,239 --> 00:38:12,919 And each one could be learning something 1154 00:38:10,719 --> 00:38:14,919 else. It's exactly like having lots of 1155 00:38:12,920 --> 00:38:16,680 filters in a convolutional network. 1156 00:38:14,920 --> 00:38:17,960 Right? Uh one filter might learn a line, 1157 00:38:16,679 --> 00:38:19,399 another filter might learn a curve, and 1158 00:38:17,960 --> 00:38:21,000 so on and so forth. And we don't want to 1159 00:38:19,400 --> 00:38:22,760 decide a priori, oh, you're going to 1160 00:38:21,000 --> 00:38:23,840 learn a line, right? Similarly here, 1161 00:38:22,760 --> 00:38:25,040 we're not telling any of these things 1162 00:38:23,840 --> 00:38:27,400 what you have to learn. They just have 1163 00:38:25,039 --> 00:38:28,960 to learn based on the training process. 1164 00:38:27,400 --> 00:38:30,800 So, what we do is 1165 00:38:28,960 --> 00:38:32,400 So, actually, this is an example where 1166 00:38:30,800 --> 00:38:35,000 this is from the original transformer 1167 00:38:32,400 --> 00:38:37,039 paper, where this sentence is the lawyer 1168 00:38:35,000 --> 00:38:39,559 will Sorry, the law will never be 1169 00:38:37,039 --> 00:38:43,079 perfect, but its application should be 1170 00:38:39,559 --> 00:38:44,400 just. This is what we are missing, in my 1171 00:38:43,079 --> 00:38:46,400 opinion. 1172 00:38:44,400 --> 00:38:48,840 The complicated sentence, right? So, the 1173 00:38:46,400 --> 00:38:50,559 first one attention head, actually, this 1174 00:38:48,840 --> 00:38:53,120 is the pattern of things it's it's it's 1175 00:38:50,559 --> 00:38:54,759 So, for example, the word perfect here, 1176 00:38:53,119 --> 00:38:57,279 the contextual embedding of the word 1177 00:38:54,760 --> 00:39:00,480 perfect 1178 00:38:57,280 --> 00:39:01,920 draws upon heavily from the word law 1179 00:39:00,480 --> 00:39:02,920 in this example. 1180 00:39:01,920 --> 00:39:04,840 Okay? 1181 00:39:02,920 --> 00:39:06,240 If you look at another attention head, 1182 00:39:04,840 --> 00:39:07,840 the contextual embedding for the word 1183 00:39:06,239 --> 00:39:11,519 perfect is actually drawing heavily from 1184 00:39:07,840 --> 00:39:13,039 just perfect and nothing else. Right? 1185 00:39:11,519 --> 00:39:14,880 And if you look at other words, the 1186 00:39:13,039 --> 00:39:17,079 patterns are subtly different of what 1187 00:39:14,880 --> 00:39:18,400 it's paying attention to. 1188 00:39:17,079 --> 00:39:20,279 So, these are two different attention 1189 00:39:18,400 --> 00:39:21,960 heads, and they're learning different 1190 00:39:20,280 --> 00:39:24,200 kinds of attentions. 1191 00:39:21,960 --> 00:39:25,679 Okay? In reality, trying to make sense 1192 00:39:24,199 --> 00:39:27,719 of why they 1193 00:39:25,679 --> 00:39:29,399 pay attention to the way they do, it's 1194 00:39:27,719 --> 00:39:30,319 usually quite sort of difficult to 1195 00:39:29,400 --> 00:39:32,320 figure that out. You can't actually 1196 00:39:30,320 --> 00:39:34,200 interpret it. But when you have lots of 1197 00:39:32,320 --> 00:39:35,840 attention heads, the performance on the 1198 00:39:34,199 --> 00:39:37,559 task that you care about gets really 1199 00:39:35,840 --> 00:39:39,000 much better. 1200 00:39:37,559 --> 00:39:40,759 Right? And then you're saying, okay, I 1201 00:39:39,000 --> 00:39:42,000 can use that. Uh yeah. 1202 00:39:40,760 --> 00:39:43,520 That's the 1203 00:39:42,000 --> 00:39:46,960 I think that's the idea behind this. Is 1204 00:39:43,519 --> 00:39:46,960 that the idea behind this? 1205 00:39:49,320 --> 00:39:53,360 Right. 1206 00:39:50,760 --> 00:39:55,640 Exactly. Same logic. Same logic. 1207 00:39:53,360 --> 00:39:55,640 Yeah. 1208 00:40:13,519 --> 00:40:17,360 Actually in the convolutional case, the 1209 00:40:15,079 --> 00:40:19,519 ones and zeros I had were just example 1210 00:40:17,360 --> 00:40:21,000 numbers to show that that particular 1211 00:40:19,519 --> 00:40:23,360 filter could detect a vertical line or 1212 00:40:21,000 --> 00:40:24,760 horizontal line. You will recall that 1213 00:40:23,360 --> 00:40:26,000 when we actually train a convolutional 1214 00:40:24,760 --> 00:40:27,880 network, we actually don't specify the 1215 00:40:26,000 --> 00:40:30,039 numbers. We start with random 1216 00:40:27,880 --> 00:40:32,200 initialized weights and then we let back 1217 00:40:30,039 --> 00:40:34,199 back propagation figure it out. 1218 00:40:32,199 --> 00:40:35,679 Similarly here, we don't decide any of 1219 00:40:34,199 --> 00:40:37,239 these things. We just let back prop 1220 00:40:35,679 --> 00:40:39,559 figure it out. 1221 00:40:37,239 --> 00:40:40,559 Okay? And now the question of what are 1222 00:40:39,559 --> 00:40:42,519 the weights that are actually going to 1223 00:40:40,559 --> 00:40:43,480 be learned? We'll come come to that in a 1224 00:40:42,519 --> 00:40:46,480 bit. 1225 00:40:43,480 --> 00:40:46,480 Okay? Uh yeah. 1226 00:40:47,559 --> 00:40:53,559 Uh I was wondering how come we have 1227 00:40:50,360 --> 00:40:55,480 different attention head even though 1228 00:40:53,559 --> 00:40:57,119 uh it seems like they're only function 1229 00:40:55,480 --> 00:40:59,480 of a dot product and we have the same 1230 00:40:57,119 --> 00:41:01,239 dot product for same embeddings. 1231 00:40:59,480 --> 00:41:02,960 Great question. Great question. And I 1232 00:41:01,239 --> 00:41:04,799 literally have a a note in my slide 1233 00:41:02,960 --> 00:41:06,480 saying, "If a student asks this good 1234 00:41:04,800 --> 00:41:08,480 question, tell them to wait till 1235 00:41:06,480 --> 00:41:10,400 Wednesday." 1236 00:41:08,480 --> 00:41:12,079 So, great question. And we'll come back 1237 00:41:10,400 --> 00:41:14,440 to that uh on Wednesday and spend a fair 1238 00:41:12,079 --> 00:41:17,079 amount of time on it. So, uh 1239 00:41:14,440 --> 00:41:19,800 the the the point that's being made here 1240 00:41:17,079 --> 00:41:22,840 is that oops. 1241 00:41:19,800 --> 00:41:24,720 When we look at self-attention, 1242 00:41:22,840 --> 00:41:26,600 the embeddings came in and we did all 1243 00:41:24,719 --> 00:41:28,799 these dot products and the contextual 1244 00:41:26,599 --> 00:41:30,319 things popped out the other end. Note 1245 00:41:28,800 --> 00:41:32,800 that inside the self-attention box, 1246 00:41:30,320 --> 00:41:34,160 there are no parameters. 1247 00:41:32,800 --> 00:41:36,519 There are no parameters. 1248 00:41:34,159 --> 00:41:38,799 So, the question that is being raised 1249 00:41:36,519 --> 00:41:40,880 here is that so what are we learning 1250 00:41:38,800 --> 00:41:42,840 really? If there is nothing inside to be 1251 00:41:40,880 --> 00:41:43,880 learned, if there are no parameters, no 1252 00:41:42,840 --> 00:41:46,480 coefficients, what are we learning? 1253 00:41:43,880 --> 00:41:48,400 That's the question. And by extension, 1254 00:41:46,480 --> 00:41:49,599 if we have two of these and neither of 1255 00:41:48,400 --> 00:41:52,079 them is learning anything, what's the 1256 00:41:49,599 --> 00:41:52,079 point? 1257 00:41:52,880 --> 00:41:57,320 Sadly, you have to wait till Wednesday. 1258 00:41:55,719 --> 00:41:58,719 Okay? But we have a great answer to the 1259 00:41:57,320 --> 00:42:00,120 question. So, 1260 00:41:58,719 --> 00:42:03,279 it'll be worth it. And if you can't 1261 00:42:00,119 --> 00:42:03,279 stand the suspense, read the book. 1262 00:42:03,320 --> 00:42:07,320 All right. So, that is uh that's why we 1263 00:42:05,519 --> 00:42:09,719 need multiple heads. Okay? And now to 1264 00:42:07,320 --> 00:42:11,400 come back to this, so what we do is it 1265 00:42:09,719 --> 00:42:13,199 goes through this head and you get these 1266 00:42:11,400 --> 00:42:15,760 W's, right? And it goes through here and 1267 00:42:13,199 --> 00:42:17,639 we get the another set of W's. 1268 00:42:15,760 --> 00:42:19,880 Then what we do at the very end is we 1269 00:42:17,639 --> 00:42:21,920 concatenate them. 1270 00:42:19,880 --> 00:42:23,480 Okay? We concatenate them and we do a 1271 00:42:21,920 --> 00:42:25,800 projection. And this is what I mean by 1272 00:42:23,480 --> 00:42:25,800 that. 1273 00:42:29,199 --> 00:42:33,279 So, we have 1274 00:42:30,760 --> 00:42:35,880 uh this this is one self-attention head, 1275 00:42:33,280 --> 00:42:38,960 self-attention one. 1276 00:42:35,880 --> 00:42:41,760 This is self-attention two. 1277 00:42:38,960 --> 00:42:44,720 And let's say that 1278 00:42:41,760 --> 00:42:47,200 W1 hat comes out. 1279 00:42:44,719 --> 00:42:48,799 And I'm just going to call it Z Z1 for 1280 00:42:47,199 --> 00:42:49,919 the same thing so that there's no name 1281 00:42:48,800 --> 00:42:52,440 clash. 1282 00:42:49,920 --> 00:42:55,360 Okay? And uh the W2, W6, all of them are 1283 00:42:52,440 --> 00:42:57,599 coming, right? Let's focus on W1 and Z1. 1284 00:42:55,360 --> 00:42:59,320 W1 and Z1 are both contextual embeddings 1285 00:42:57,599 --> 00:43:01,679 for the same word. 1286 00:42:59,320 --> 00:43:04,720 Okay? For the first word, word one. And 1287 00:43:01,679 --> 00:43:06,440 so what we do is let's say this is W1 uh 1288 00:43:04,719 --> 00:43:07,959 let's call let's say this vector is like 1289 00:43:06,440 --> 00:43:10,039 this. Okay? 1290 00:43:07,960 --> 00:43:12,400 And let's say that this vector is like 1291 00:43:10,039 --> 00:43:14,679 this. 1292 00:43:12,400 --> 00:43:16,360 What I mean when I say concatenated here 1293 00:43:14,679 --> 00:43:18,719 is we literally take 1294 00:43:16,360 --> 00:43:20,320 um this word here, 1295 00:43:18,719 --> 00:43:22,839 this embedding here, then we take this 1296 00:43:20,320 --> 00:43:22,840 thing here. 1297 00:43:23,079 --> 00:43:27,799 Okay? And we just make it a long vector. 1298 00:43:25,039 --> 00:43:30,519 We concatenate it. But now this vector 1299 00:43:27,800 --> 00:43:32,519 has become twice as long, right? 1300 00:43:30,519 --> 00:43:34,759 So, what but remember, we always want to 1301 00:43:32,519 --> 00:43:36,759 preserve this the the number of inputs 1302 00:43:34,760 --> 00:43:39,400 we have and the lengths of these vectors 1303 00:43:36,760 --> 00:43:42,760 everywhere as we go along. So, what we 1304 00:43:39,400 --> 00:43:44,840 do is at this point, we run it through 1305 00:43:42,760 --> 00:43:46,560 a single dense layer 1306 00:43:44,840 --> 00:43:48,480 which will take this thing and make it 1307 00:43:46,559 --> 00:43:50,039 back into the same small shape as 1308 00:43:48,480 --> 00:43:53,119 before. 1309 00:43:50,039 --> 00:43:53,119 So, this is a dense layer. 1310 00:43:54,320 --> 00:43:58,559 That's it. So, this vector comes in 1311 00:43:56,840 --> 00:44:00,240 and it becomes it gets compressed back 1312 00:43:58,559 --> 00:44:01,239 to the original shape that came out of 1313 00:44:00,239 --> 00:44:03,599 here. 1314 00:44:01,239 --> 00:44:04,919 So, you could have like 20 of these uh 1315 00:44:03,599 --> 00:44:06,480 attention heads 1316 00:44:04,920 --> 00:44:08,440 and the concatenated will be 20 times 1317 00:44:06,480 --> 00:44:09,800 long and then just project boom, one 1318 00:44:08,440 --> 00:44:12,119 dense layer comes back to the original 1319 00:44:09,800 --> 00:44:12,120 shape. 1320 00:44:12,920 --> 00:44:17,519 So, that's that is the projection step. 1321 00:44:16,320 --> 00:44:20,480 And that's what I mean here when I say 1322 00:44:17,519 --> 00:44:21,800 concatenate and project. 1323 00:44:20,480 --> 00:44:23,559 So, at this point, what we have is 1324 00:44:21,800 --> 00:44:25,120 things come in, we contextualize them 1325 00:44:23,559 --> 00:44:27,039 using these different attention heads, 1326 00:44:25,119 --> 00:44:29,000 and when they come out of the attention 1327 00:44:27,039 --> 00:44:31,039 heads, we take them all, we just like 1328 00:44:29,000 --> 00:44:32,480 concatenate them, and then compress them 1329 00:44:31,039 --> 00:44:35,320 back to the same original starting 1330 00:44:32,480 --> 00:44:37,119 shape. Right? If these vectors are 100 1331 00:44:35,320 --> 00:44:39,640 units long or 100 dimension long, 1332 00:44:37,119 --> 00:44:42,000 whatever comes out is 100 still. 1333 00:44:39,639 --> 00:44:43,839 And to pre- preserving this 1334 00:44:42,000 --> 00:44:44,920 size as we go along is very important 1335 00:44:43,840 --> 00:44:46,800 for reasons that'll become apparent a 1336 00:44:44,920 --> 00:44:49,440 bit later. 1337 00:44:46,800 --> 00:44:50,320 Okay. So, that is the multi-attention 1338 00:44:49,440 --> 00:44:53,679 thing. 1339 00:44:50,320 --> 00:44:55,120 Now, a final tweak for today 1340 00:44:53,679 --> 00:44:57,440 is that we will inject some 1341 00:44:55,119 --> 00:44:59,400 non-linearity 1342 00:44:57,440 --> 00:45:01,358 with some dense layer dense ReLU layers 1343 00:44:59,400 --> 00:45:03,280 at the very end. So, we'd went through a 1344 00:45:01,358 --> 00:45:04,400 bunch of attention heads. We we came up 1345 00:45:03,280 --> 00:45:05,240 with a bunch of contextual embeddings 1346 00:45:04,400 --> 00:45:07,720 now. 1347 00:45:05,239 --> 00:45:08,479 So, at this point so far, 1348 00:45:07,719 --> 00:45:10,759 there are no since there are no 1349 00:45:08,480 --> 00:45:11,840 parameters inside these boxes, 1350 00:45:10,760 --> 00:45:13,000 uh 1351 00:45:11,840 --> 00:45:13,960 right? And there are some parameters 1352 00:45:13,000 --> 00:45:15,559 here. 1353 00:45:13,960 --> 00:45:16,480 We need to do some non-linearity. So 1354 00:45:15,559 --> 00:45:18,840 far, there's been nothing that's 1355 00:45:16,480 --> 00:45:21,480 non-linear so far. So, here we actually 1356 00:45:18,840 --> 00:45:24,680 send it through one or more ReLUs. 1357 00:45:21,480 --> 00:45:27,559 Typically, they just use one ReLU. So, 1358 00:45:24,679 --> 00:45:27,559 and what I mean by that 1359 00:45:34,199 --> 00:45:36,599 Sorry. 1360 00:45:37,920 --> 00:45:44,480 So, this is what we had here and then 1361 00:45:41,760 --> 00:45:44,480 we take it in 1362 00:45:46,400 --> 00:45:49,599 and then run it through 1363 00:45:50,079 --> 00:45:52,639 actually 1364 00:45:54,719 --> 00:45:58,639 we typically run it through 1365 00:45:57,320 --> 00:46:01,440 a ReLU. 1366 00:45:58,639 --> 00:46:03,358 This is a nice ReLU. 1367 00:46:01,440 --> 00:46:04,559 Okay? And all and and the rule of thumb, 1368 00:46:03,358 --> 00:46:06,840 as you will see, if let's say this 1369 00:46:04,559 --> 00:46:08,119 vector is say 100 dimensions long, they 1370 00:46:06,840 --> 00:46:10,039 typically will choose a ReLU which is 1371 00:46:08,119 --> 00:46:12,440 about 400 1372 00:46:10,039 --> 00:46:15,920 wide. And then it just gets projected 1373 00:46:12,440 --> 00:46:15,920 out again back to 100. 1374 00:46:16,639 --> 00:46:20,279 So, 1375 00:46:17,719 --> 00:46:21,759 this is just a simple, you know, the 1376 00:46:20,280 --> 00:46:23,480 input comes in, goes through a single 1377 00:46:21,760 --> 00:46:26,040 hidden layer with four four times as 1378 00:46:23,480 --> 00:46:28,599 many as here, and then it 1379 00:46:26,039 --> 00:46:29,800 project another dense layer 1380 00:46:28,599 --> 00:46:32,279 to 100 again. 1381 00:46:29,800 --> 00:46:33,280 And this since there are ReLUs here, 1382 00:46:32,280 --> 00:46:35,760 we in- we have injected some 1383 00:46:33,280 --> 00:46:37,519 non-linearity into the processing. 1384 00:46:35,760 --> 00:46:39,200 Okay? Now, 1385 00:46:37,519 --> 00:46:41,719 a lot of this stuff when it came out 1386 00:46:39,199 --> 00:46:43,358 felt very ad hoc. 1387 00:46:41,719 --> 00:46:45,599 Right? It didn't come from some deep, 1388 00:46:43,358 --> 00:46:47,400 you know, theoretical motivations. 1389 00:46:45,599 --> 00:46:49,400 But and people had strong intuitions as 1390 00:46:47,400 --> 00:46:51,680 to why these things were helpful. And as 1391 00:46:49,400 --> 00:46:53,720 it turns out, since the transformer came 1392 00:46:51,679 --> 00:46:55,519 out, people have tried to optimize every 1393 00:46:53,719 --> 00:46:56,959 aspect of this thing. 1394 00:46:55,519 --> 00:46:58,719 It's actually pretty difficult to beat 1395 00:46:56,960 --> 00:47:00,358 the starting architecture. 1396 00:46:58,719 --> 00:47:02,679 Right? Improvements have been made, but 1397 00:47:00,358 --> 00:47:03,719 it's actually very robust architecture. 1398 00:47:02,679 --> 00:47:05,719 So, 1399 00:47:03,719 --> 00:47:08,959 so that's what's going on here. And then 1400 00:47:05,719 --> 00:47:10,919 when we come out of this thing, 1401 00:47:08,960 --> 00:47:13,000 this is what we have, the story so far. 1402 00:47:10,920 --> 00:47:14,639 We start with random standalone 1403 00:47:13,000 --> 00:47:15,960 embeddings. This could be 1404 00:47:14,639 --> 00:47:18,159 GloVe embeddings, it could be random 1405 00:47:15,960 --> 00:47:19,920 weights, doesn't matter. It goes through 1406 00:47:18,159 --> 00:47:21,399 a bunch of self-attention heads. We 1407 00:47:19,920 --> 00:47:23,840 concatenate it when it comes out the 1408 00:47:21,400 --> 00:47:25,039 other end. 1409 00:47:23,840 --> 00:47:27,160 Concatenate it when it comes out the 1410 00:47:25,039 --> 00:47:29,119 other end. And then we project it back 1411 00:47:27,159 --> 00:47:31,358 to the same size as before. Then we run 1412 00:47:29,119 --> 00:47:33,400 it through, you know, a ReLU followed by 1413 00:47:31,358 --> 00:47:36,079 a linear layer and we get these things 1414 00:47:33,400 --> 00:47:37,760 again. So, in this whole process, if six 1415 00:47:36,079 --> 00:47:40,400 things came in, six things will come 1416 00:47:37,760 --> 00:47:41,920 out. And if six and if those six things 1417 00:47:40,400 --> 00:47:43,358 that came in 1418 00:47:41,920 --> 00:47:45,440 were embedding standalone embedding 1419 00:47:43,358 --> 00:47:47,559 vectors of 100 dimensions, what comes 1420 00:47:45,440 --> 00:47:48,559 out is also 100 dimensions. 1421 00:47:47,559 --> 00:47:50,440 So, in that sense, you could think of 1422 00:47:48,559 --> 00:47:52,358 this whole thing as a black box in which 1423 00:47:50,440 --> 00:47:54,599 whatever you send in, the same number of 1424 00:47:52,358 --> 00:47:56,079 things will come out of the same length. 1425 00:47:54,599 --> 00:47:56,759 The numbers will be different because 1426 00:47:56,079 --> 00:47:58,519 they will have been heavily 1427 00:47:56,760 --> 00:48:00,240 contextualized. 1428 00:47:58,519 --> 00:48:02,480 The numbers are much smarter, in other 1429 00:48:00,239 --> 00:48:04,959 words. 1430 00:48:02,480 --> 00:48:05,920 So, so far what we have seen is that we 1431 00:48:04,960 --> 00:48:08,079 have satisfied two of the three 1432 00:48:05,920 --> 00:48:09,920 requirements. We have taken the context 1433 00:48:08,079 --> 00:48:11,119 of each word into account 1434 00:48:09,920 --> 00:48:12,599 by using these dot products in the 1435 00:48:11,119 --> 00:48:13,799 self-attention layer, and we can 1436 00:48:12,599 --> 00:48:15,599 generate an output that is the same 1437 00:48:13,800 --> 00:48:17,480 length as the input, but we have ignored 1438 00:48:15,599 --> 00:48:19,759 the fact that we have ignored word order 1439 00:48:17,480 --> 00:48:21,519 completely. 1440 00:48:19,760 --> 00:48:23,880 Okay? Because whether I had said the 1441 00:48:21,519 --> 00:48:25,559 train slowly left the station or I had 1442 00:48:23,880 --> 00:48:26,800 said the the station slowly left the 1443 00:48:25,559 --> 00:48:30,279 train, 1444 00:48:26,800 --> 00:48:30,280 this thing won't know the difference. 1445 00:48:30,840 --> 00:48:34,519 Because dot products 1446 00:48:32,239 --> 00:48:36,559 function on sets, not on sequences. They 1447 00:48:34,519 --> 00:48:37,800 function on sets. 1448 00:48:36,559 --> 00:48:39,159 Okay? Regard- You can you should 1449 00:48:37,800 --> 00:48:40,600 convince yourself of this. Regardless of 1450 00:48:39,159 --> 00:48:42,039 the order, the dot product calculation 1451 00:48:40,599 --> 00:48:45,159 doesn't change anything. 1452 00:48:42,039 --> 00:48:45,159 Because we are doing every pair. 1453 00:48:46,440 --> 00:48:50,519 Okay? So, the question is how do we take 1454 00:48:48,159 --> 00:48:52,199 the order of the words into account? Um 1455 00:48:50,519 --> 00:48:53,519 right. As I was saying, we can scramble 1456 00:48:52,199 --> 00:48:54,519 the order of the words in a sentence and 1457 00:48:53,519 --> 00:48:55,759 we'll get the exact same contextual 1458 00:48:54,519 --> 00:48:57,079 embeddings at the end. 1459 00:48:55,760 --> 00:48:58,840 So, by the way, if you're working on a 1460 00:48:57,079 --> 00:49:00,319 problem in which the order doesn't 1461 00:48:58,840 --> 00:49:01,960 matter, 1462 00:49:00,320 --> 00:49:04,160 then you can stop right now and use the 1463 00:49:01,960 --> 00:49:05,199 transformer. 1464 00:49:04,159 --> 00:49:06,759 And there are many problems that are 1465 00:49:05,199 --> 00:49:08,799 actually in that category where the 1466 00:49:06,760 --> 00:49:10,880 order doesn't matter. So, if you take 1467 00:49:08,800 --> 00:49:12,359 traditional structured data, right? Uh 1468 00:49:10,880 --> 00:49:14,320 tabular data, 1469 00:49:12,358 --> 00:49:15,759 uh you know, blood pressure, cholesterol 1470 00:49:14,320 --> 00:49:17,519 level, boom boom boom. Does it predict 1471 00:49:15,760 --> 00:49:18,520 heart disease? Well, there is no order 1472 00:49:17,519 --> 00:49:20,199 in that thing. You can use the 1473 00:49:18,519 --> 00:49:22,119 transformer as is without doing anything 1474 00:49:20,199 --> 00:49:24,679 more. 1475 00:49:22,119 --> 00:49:27,199 So, transformers work for both sets and 1476 00:49:24,679 --> 00:49:29,839 sequences where order matters. 1477 00:49:27,199 --> 00:49:32,239 Okay. So, the fix for this is something 1478 00:49:29,840 --> 00:49:33,160 called the positional encoding. 1479 00:49:32,239 --> 00:49:34,839 Um 1480 00:49:33,159 --> 00:49:36,159 so what we do is very simple. There are 1481 00:49:34,840 --> 00:49:40,920 By By there are many things that been 1482 00:49:36,159 --> 00:49:42,759 invented um to to to tell transformers 1483 00:49:40,920 --> 00:49:44,159 to give an transformer some information 1484 00:49:42,760 --> 00:49:45,760 about the order of each of the things 1485 00:49:44,159 --> 00:49:46,799 that are coming in. 1486 00:49:45,760 --> 00:49:47,920 I'm going to go with something called 1487 00:49:46,800 --> 00:49:49,480 the, you know, 1488 00:49:47,920 --> 00:49:51,440 the simplest possible way which actually 1489 00:49:49,480 --> 00:49:52,840 works pretty well in practice. So, what 1490 00:49:51,440 --> 00:49:55,000 we do is 1491 00:49:52,840 --> 00:49:56,960 for each position 1492 00:49:55,000 --> 00:49:58,280 each possible position in the input 1493 00:49:56,960 --> 00:50:00,280 starting from the first position all the 1494 00:49:58,280 --> 00:50:02,120 way through the last position 1495 00:50:00,280 --> 00:50:05,280 we imagine that that position itself is 1496 00:50:02,119 --> 00:50:05,279 a categorical variable. 1497 00:50:05,599 --> 00:50:10,039 Right? If a sentence can only be 30 30 1498 00:50:07,639 --> 00:50:11,719 words long, let's say, we say that hey, 1499 00:50:10,039 --> 00:50:14,599 the position of each word is a number 1500 00:50:11,719 --> 00:50:16,039 between 0 and 29. 1501 00:50:14,599 --> 00:50:17,960 And so, we can just think of it as a 1502 00:50:16,039 --> 00:50:20,000 categorical variable. 1503 00:50:17,960 --> 00:50:22,159 And because the categorical variable, we 1504 00:50:20,000 --> 00:50:24,199 can just imagine an embedding for that 1505 00:50:22,159 --> 00:50:25,319 for each potential value. So, it'll 1506 00:50:24,199 --> 00:50:27,000 become clear in just a moment because I 1507 00:50:25,320 --> 00:50:28,920 have a numerical example. 1508 00:50:27,000 --> 00:50:30,800 And so, what we do is we will just take 1509 00:50:28,920 --> 00:50:32,920 that standalone embedding and then we'll 1510 00:50:30,800 --> 00:50:33,960 take this position embedding 1511 00:50:32,920 --> 00:50:35,280 which represents the position of the 1512 00:50:33,960 --> 00:50:36,800 word in the sentence, we just add them 1513 00:50:35,280 --> 00:50:39,560 up. 1514 00:50:36,800 --> 00:50:40,519 Okay? Uh yeah. 1515 00:50:39,559 --> 00:50:43,079 So, if 1516 00:50:40,519 --> 00:50:45,280 in the initial sentence itself, I have a 1517 00:50:43,079 --> 00:50:48,039 mistake, so I just write it as the train 1518 00:50:45,280 --> 00:50:49,840 slowly the station. 1519 00:50:48,039 --> 00:50:52,079 So, which means my output is actually 1520 00:50:49,840 --> 00:50:53,760 going to be wrong. Yes. 1521 00:50:52,079 --> 00:50:55,559 Now, the transformers are since they're 1522 00:50:53,760 --> 00:50:57,000 trained on lots of data, 1523 00:50:55,559 --> 00:50:58,199 they will be quite robust to these 1524 00:50:57,000 --> 00:51:00,239 things. 1525 00:50:58,199 --> 00:51:02,839 But strictly arithmetically speaking 1526 00:51:00,239 --> 00:51:05,439 correct, yes. 1527 00:51:02,840 --> 00:51:06,720 Um okay. So, here's let's look at an 1528 00:51:05,440 --> 00:51:08,800 example. 1529 00:51:06,719 --> 00:51:09,359 Let's assume that 1530 00:51:08,800 --> 00:51:11,360 um 1531 00:51:09,360 --> 00:51:13,480 your standalone embeddings, right? This 1532 00:51:11,360 --> 00:51:15,920 is your vocabulary, okay? 1533 00:51:13,480 --> 00:51:17,400 Unknown, cat, mat, I, sit, love, the, 1534 00:51:15,920 --> 00:51:18,960 you, on. That's it. That's our 1535 00:51:17,400 --> 00:51:20,800 vocabulary. 1536 00:51:18,960 --> 00:51:22,440 And for this vocabulary, we have these 1537 00:51:20,800 --> 00:51:23,680 standalone embeddings. 1538 00:51:22,440 --> 00:51:26,159 And just for argument, let's assume 1539 00:51:23,679 --> 00:51:27,239 these embeddings are only two long. 1540 00:51:26,159 --> 00:51:28,599 Okay? The dimension of these embeddings 1541 00:51:27,239 --> 00:51:30,039 is two. 1542 00:51:28,599 --> 00:51:31,880 If you recall the glove embeddings we 1543 00:51:30,039 --> 00:51:33,159 used last week, I think they were what? 1544 00:51:31,880 --> 00:51:34,400 100 long? 1545 00:51:33,159 --> 00:51:35,799 And the ones we're using in the homework 1546 00:51:34,400 --> 00:51:37,200 are even longer than that. 1547 00:51:35,800 --> 00:51:39,120 Um but here we are assuming they're only 1548 00:51:37,199 --> 00:51:42,799 two long, okay? So, the embedding for 1549 00:51:39,119 --> 00:51:45,880 cat is 0.5, {comma} 7.1. 1550 00:51:42,800 --> 00:51:47,320 All right. Now, let's assume that the we 1551 00:51:45,880 --> 00:51:49,079 can have at most 10 words in any 1552 00:51:47,320 --> 00:51:50,559 sentence that's coming in. 1553 00:51:49,079 --> 00:51:52,360 And obviously, a particular word could 1554 00:51:50,559 --> 00:51:53,639 be in position 0 all the way through 1555 00:51:52,360 --> 00:51:56,240 position 9. 1556 00:51:53,639 --> 00:51:57,719 And we will learn embeddings for each of 1557 00:51:56,239 --> 00:51:59,759 these positions, and these embeddings 1558 00:51:57,719 --> 00:52:03,239 are also two long. 1559 00:51:59,760 --> 00:52:03,240 Two units long. Dimension two. 1560 00:52:03,320 --> 00:52:06,480 Okay? 1561 00:52:04,519 --> 00:52:07,880 Now, where will these embeddings come 1562 00:52:06,480 --> 00:52:09,199 from? 1563 00:52:07,880 --> 00:52:10,720 What's the answer to that question? What 1564 00:52:09,199 --> 00:52:13,839 is the answer to the general question of 1565 00:52:10,719 --> 00:52:13,839 where will these weights come from? 1566 00:52:14,599 --> 00:52:17,759 We will learn it with backprop. 1567 00:52:18,159 --> 00:52:21,599 Okay? 1568 00:52:20,400 --> 00:52:23,240 We will start initially with random 1569 00:52:21,599 --> 00:52:24,519 numbers and then we'll get them make 1570 00:52:23,239 --> 00:52:26,599 them better and better 1571 00:52:24,519 --> 00:52:28,280 as over the course of training. 1572 00:52:26,599 --> 00:52:29,400 So, what we do is we have these two 1573 00:52:28,280 --> 00:52:30,680 tables 1574 00:52:29,400 --> 00:52:32,400 of embeddings. 1575 00:52:30,679 --> 00:52:34,039 Um the standalone embedding for the word 1576 00:52:32,400 --> 00:52:37,000 and the position embedding. 1577 00:52:34,039 --> 00:52:39,239 And then, we literally add them up. 1578 00:52:37,000 --> 00:52:41,599 So, for example, let's say the word the 1579 00:52:39,239 --> 00:52:43,119 sentence that came in is cat sat mat. 1580 00:52:41,599 --> 00:52:46,119 That's the sentence. It's got three 1581 00:52:43,119 --> 00:52:49,119 words, cat sat mat. So, what we do is we 1582 00:52:46,119 --> 00:52:51,119 say, well, the embedding for cat is this 1583 00:52:49,119 --> 00:52:53,400 thing here, 0.571. 1584 00:52:51,119 --> 00:52:55,239 So, I write it here, 0.571. 1585 00:52:53,400 --> 00:52:56,240 Cat happens to be the zeroth position of 1586 00:52:55,239 --> 00:52:58,119 the word. 1587 00:52:56,239 --> 00:53:01,079 So, I grab the embedding for zero, which 1588 00:52:58,119 --> 00:53:04,799 is 1.3, 3.9. I stick it there, and then 1589 00:53:01,079 --> 00:53:07,159 I literally add them up. 0.5 + 1.3, 1.8. 1590 00:53:04,800 --> 00:53:10,880 11.0. That's it. 1591 00:53:07,159 --> 00:53:15,159 So, now the positional encoded embedding 1592 00:53:10,880 --> 00:53:17,880 for the word cat is 1.8, 11.0, not 0.5, 1593 00:53:15,159 --> 00:53:17,879 7.1. 1594 00:53:18,400 --> 00:53:22,400 So, if cat happens to show up in another 1595 00:53:20,719 --> 00:53:25,199 part of the sentence, let's say instead 1596 00:53:22,400 --> 00:53:28,119 of cat sat mat, we had 1597 00:53:25,199 --> 00:53:29,839 mat sat cat. 1598 00:53:28,119 --> 00:53:33,159 Now, cat is now the third position, 1599 00:53:29,840 --> 00:53:34,680 right? Which is 0, 1, and 2. Which means 1600 00:53:33,159 --> 00:53:36,239 its embedding doesn't change. It's just 1601 00:53:34,679 --> 00:53:38,159 the embedding for cat, but now instead 1602 00:53:36,239 --> 00:53:40,519 of picking zero, we'll pick this one, 1603 00:53:38,159 --> 00:53:43,079 0.6, 8.1, and put that here and add them 1604 00:53:40,519 --> 00:53:43,079 up instead. 1605 00:53:43,719 --> 00:53:46,959 So, this is the idea of the positional 1606 00:53:45,840 --> 00:53:48,800 encoding. 1607 00:53:46,960 --> 00:53:51,599 This is how we inject position knowledge 1608 00:53:48,800 --> 00:53:51,600 into the transformer. 1609 00:53:52,960 --> 00:53:55,000 Yes. 1610 00:53:54,400 --> 00:53:56,280 Um 1611 00:53:55,000 --> 00:53:58,159 the positional embedding would be 1612 00:53:56,280 --> 00:54:00,000 different for each sentence, right? How 1613 00:53:58,159 --> 00:54:01,799 do you No, this is just one table which 1614 00:54:00,000 --> 00:54:04,159 tells you what the position is. 1615 00:54:01,800 --> 00:54:06,200 So, the it says for a word that appears 1616 00:54:04,159 --> 00:54:08,279 in the seventh position in any input 1617 00:54:06,199 --> 00:54:09,599 sentence that you're feeding in, 1618 00:54:08,280 --> 00:54:11,359 this is the embedding that you need to 1619 00:54:09,599 --> 00:54:14,079 use 1620 00:54:11,358 --> 00:54:14,079 for that position. 1621 00:54:16,679 --> 00:54:21,639 If the word appears twice in the same 1622 00:54:19,559 --> 00:54:23,920 sentence, how do 1623 00:54:21,639 --> 00:54:25,719 Great question. So, if if let's say just 1624 00:54:23,920 --> 00:54:27,559 for argument, let's say the word the the 1625 00:54:25,719 --> 00:54:29,480 sentence was cat cat cat. 1626 00:54:27,559 --> 00:54:31,599 So, the 1627 00:54:29,480 --> 00:54:32,559 for each one of those cat for cat cat 1628 00:54:31,599 --> 00:54:34,759 cat, 1629 00:54:32,559 --> 00:54:36,519 the this embedding will be the same, 1630 00:54:34,760 --> 00:54:38,240 0.571, because that is happens to be 1631 00:54:36,519 --> 00:54:39,519 just the embedding for cat regardless of 1632 00:54:38,239 --> 00:54:42,159 position. 1633 00:54:39,519 --> 00:54:45,599 But then, the first cat 1634 00:54:42,159 --> 00:54:47,440 for the first cat, we will use 1.3, 3.9 1635 00:54:45,599 --> 00:54:50,159 as the addition. For the second cat, 1636 00:54:47,440 --> 00:54:51,679 we'll use 6.3, 3.7. The third cat will 1637 00:54:50,159 --> 00:54:53,519 use 0.6, 8.1. 1638 00:54:51,679 --> 00:54:55,000 So, only the things that are adding the 1639 00:54:53,519 --> 00:54:57,119 position encoding will change, the 1640 00:54:55,000 --> 00:54:58,280 positional embedding. So, the resulting 1641 00:54:57,119 --> 00:54:59,679 sum is going to be different for each of 1642 00:54:58,280 --> 00:55:02,560 these three words, even though they're 1643 00:54:59,679 --> 00:55:02,559 exactly the same word. 1644 00:55:05,760 --> 00:55:09,800 Is that position embedding table 1645 00:55:07,800 --> 00:55:12,000 specific to the standalone embedding 1646 00:55:09,800 --> 00:55:14,320 table? Like if you were to add or remove 1647 00:55:12,000 --> 00:55:15,960 some words from the standalone It's 1648 00:55:14,320 --> 00:55:18,000 independent. 1649 00:55:15,960 --> 00:55:19,880 Independent. It only depends on your 1650 00:55:18,000 --> 00:55:21,000 assumption about how long the sentences 1651 00:55:19,880 --> 00:55:21,920 can be. 1652 00:55:21,000 --> 00:55:23,400 That's it. 1653 00:55:21,920 --> 00:55:24,840 It doesn't really care about what's what 1654 00:55:23,400 --> 00:55:26,039 words are coming in. That's a whole 1655 00:55:24,840 --> 00:55:27,400 different thing. 1656 00:55:26,039 --> 00:55:28,719 So, these are two independent tables 1657 00:55:27,400 --> 00:55:31,160 that just learned as part of this 1658 00:55:28,719 --> 00:55:31,159 process. 1659 00:55:31,639 --> 00:55:35,480 So, yeah, I have the same thing for sat 1660 00:55:33,599 --> 00:55:39,079 and mat. 1661 00:55:35,480 --> 00:55:39,079 Sat and mat, that's what we have. 1662 00:55:39,519 --> 00:55:42,679 So, just make sure you understand these 1663 00:55:40,519 --> 00:55:46,199 two slides to really like make sure the 1664 00:55:42,679 --> 00:55:48,839 mechanics are clear. Yeah. 1665 00:55:46,199 --> 00:55:50,839 How do you control for filler words? For 1666 00:55:48,840 --> 00:55:53,920 example, if you're taking 1667 00:55:50,840 --> 00:55:55,680 NLP output for transcription and you're 1668 00:55:53,920 --> 00:55:56,639 trying to run a transformer and you have 1669 00:55:55,679 --> 00:55:58,799 a lot of 1670 00:55:56,639 --> 00:56:00,879 um's and likes that are 1671 00:55:58,800 --> 00:56:03,000 disproportionately large and have these 1672 00:56:00,880 --> 00:56:04,559 random assignments or 1673 00:56:03,000 --> 00:56:07,039 really deep embeddings, is there other 1674 00:56:04,559 --> 00:56:09,000 ways to look at through the noise? 1675 00:56:07,039 --> 00:56:10,440 Typically, what they do is um 1676 00:56:09,000 --> 00:56:12,239 as we will we'll talk about this thing 1677 00:56:10,440 --> 00:56:14,639 called byte pair encoding in which we 1678 00:56:12,239 --> 00:56:16,599 take individual characters, 1679 00:56:14,639 --> 00:56:18,879 fragments of words, and words into 1680 00:56:16,599 --> 00:56:21,239 account as tokens. So, when you hear 1681 00:56:18,880 --> 00:56:23,079 stuff like uh and so on, it gets mapped 1682 00:56:21,239 --> 00:56:24,119 to these small tokens. 1683 00:56:23,079 --> 00:56:26,799 Right? And then we treat them as just 1684 00:56:24,119 --> 00:56:26,799 any other token. 1685 00:56:28,840 --> 00:56:33,480 Um yeah, is aggregation like a simple 1686 00:56:31,119 --> 00:56:36,039 sum where here and is the actual 1687 00:56:33,480 --> 00:56:37,840 semantic meaning of the word standalone 1688 00:56:36,039 --> 00:56:40,400 not be more important than its 1689 00:56:37,840 --> 00:56:42,200 relative position in the sentence? 1690 00:56:40,400 --> 00:56:43,400 It could be. We just don't know a priori 1691 00:56:42,199 --> 00:56:45,399 whether it's going to be important or 1692 00:56:43,400 --> 00:56:46,960 not for any particular sentence. 1693 00:56:45,400 --> 00:56:48,880 We when we train the transformer with a 1694 00:56:46,960 --> 00:56:50,358 lot of textual data, 1695 00:56:48,880 --> 00:56:51,880 right? It'll just figure out the right 1696 00:56:50,358 --> 00:56:53,719 values for these things so that on 1697 00:56:51,880 --> 00:56:55,280 average, the accuracy is as high as 1698 00:56:53,719 --> 00:56:56,879 possible. 1699 00:56:55,280 --> 00:56:58,120 So, in many of these things, there's 1700 00:56:56,880 --> 00:57:00,480 always a tension between our human 1701 00:56:58,119 --> 00:57:01,559 intuition as to how it should work and 1702 00:57:00,480 --> 00:57:02,960 whether you should just throw it into 1703 00:57:01,559 --> 00:57:04,079 the meat grinder of backprop and see 1704 00:57:02,960 --> 00:57:05,280 what happens. 1705 00:57:04,079 --> 00:57:06,400 And so, here it does it turns out you 1706 00:57:05,280 --> 00:57:08,840 can just throw it into backprop, it'll 1707 00:57:06,400 --> 00:57:10,920 actually do a pretty good job. 1708 00:57:08,840 --> 00:57:13,000 Uh yeah. 1709 00:57:10,920 --> 00:57:15,960 For the positional encoding, we would 1710 00:57:13,000 --> 00:57:18,199 just be as using the sum vector, we 1711 00:57:15,960 --> 00:57:20,720 would be using like this 2 by 3 matrix 1712 00:57:18,199 --> 00:57:21,719 that you have for our right? 1713 00:57:20,719 --> 00:57:23,559 Uh oh yeah, this is just for 1714 00:57:21,719 --> 00:57:24,679 demonstration. Basically, this is the 1715 00:57:23,559 --> 00:57:26,279 thing that will actually go into the 1716 00:57:24,679 --> 00:57:28,358 transformer. Correct. 1717 00:57:26,280 --> 00:57:28,359 Yeah. 1718 00:57:28,559 --> 00:57:31,679 That was just me being overly verbose in 1719 00:57:30,079 --> 00:57:33,199 the slides. 1720 00:57:31,679 --> 00:57:35,239 Uh yeah. 1721 00:57:33,199 --> 00:57:36,919 I can see sentences in the input. At 1722 00:57:35,239 --> 00:57:38,279 this point, are we still parsing out 1723 00:57:36,920 --> 00:57:40,039 punctuation or if we have like a 1724 00:57:38,280 --> 00:57:41,760 multi-sentence input, is there a 1725 00:57:40,039 --> 00:57:44,119 positional embedding vector for each of 1726 00:57:41,760 --> 00:57:47,120 the sentences? Yeah, so here um 1727 00:57:44,119 --> 00:57:48,799 basically, the starting point is tokens. 1728 00:57:47,119 --> 00:57:50,239 Right? And in our example, because we're 1729 00:57:48,800 --> 00:57:51,760 working with the idea of simple 1730 00:57:50,239 --> 00:57:53,039 standardization and stripping and things 1731 00:57:51,760 --> 00:57:54,000 like that, I'm just showing actual 1732 00:57:53,039 --> 00:57:56,000 words. 1733 00:57:54,000 --> 00:57:58,199 If you go to something like GPT-4, since 1734 00:57:56,000 --> 00:58:01,159 it uses a different tokenization scheme, 1735 00:57:58,199 --> 00:58:02,319 uh each token might be part of a word. 1736 00:58:01,159 --> 00:58:03,559 It might be it might be an individual 1737 00:58:02,320 --> 00:58:06,240 character, it might be a punctuation 1738 00:58:03,559 --> 00:58:08,440 mark, it could be in fact um the GPT 1739 00:58:06,239 --> 00:58:10,439 family doesn't strip out punctuation. 1740 00:58:08,440 --> 00:58:12,480 Which is why when you ask a question, it 1741 00:58:10,440 --> 00:58:13,920 comes back with intact punctuation in 1742 00:58:12,480 --> 00:58:15,840 its response. 1743 00:58:13,920 --> 00:58:17,400 Uh and so, we'll get we'll revisit this 1744 00:58:15,840 --> 00:58:19,760 when you look at BPE, byte pair encoding 1745 00:58:17,400 --> 00:58:19,760 later on. 1746 00:58:19,840 --> 00:58:22,800 But the key thing to remember is that 1747 00:58:21,119 --> 00:58:24,679 all the stuff we're talking about starts 1748 00:58:22,800 --> 00:58:26,560 from the notion of a token. 1749 00:58:24,679 --> 00:58:28,559 As to how you define a token given a 1750 00:58:26,559 --> 00:58:30,719 bunch of text, that's the tokenizer's 1751 00:58:28,559 --> 00:58:33,519 job. And we just assumed a simple 1752 00:58:30,719 --> 00:58:36,759 tokenizer for the time being. 1753 00:58:33,519 --> 00:58:38,960 Okay? So, at this point, folks, we have 1754 00:58:36,760 --> 00:58:40,680 satisfied all the requirements. 1755 00:58:38,960 --> 00:58:42,480 Uh we have taken the surrounding context 1756 00:58:40,679 --> 00:58:43,839 of each word, we have taken the order, 1757 00:58:42,480 --> 00:58:45,480 and so on and so forth, because what's 1758 00:58:43,840 --> 00:58:47,519 coming in here is the positional 1759 00:58:45,480 --> 00:58:49,639 embeddings. Okay? And it runs through 1760 00:58:47,519 --> 00:58:51,440 the whole transformer stack. 1761 00:58:49,639 --> 00:58:54,799 So, 1762 00:58:51,440 --> 00:58:55,920 this is called a transformer encoder. 1763 00:58:54,800 --> 00:58:57,840 Okay? 1764 00:58:55,920 --> 00:58:59,039 This is the transformer encoder. 1765 00:58:57,840 --> 00:59:01,039 And you can see here, this is the 1766 00:58:59,039 --> 00:59:03,239 original picture from the paper. 1767 00:59:01,039 --> 00:59:04,719 It's an iconic picture at this point. 1768 00:59:03,239 --> 00:59:06,239 So, it says here this is these are the 1769 00:59:04,719 --> 00:59:07,599 input This is like the cat sat on the 1770 00:59:06,239 --> 00:59:09,519 mat. 1771 00:59:07,599 --> 00:59:11,400 It comes in here, gets transferred to 1772 00:59:09,519 --> 00:59:12,679 transformed into embeddings, standalone 1773 00:59:11,400 --> 00:59:14,639 embeddings. 1774 00:59:12,679 --> 00:59:17,319 And then, based on the position of each 1775 00:59:14,639 --> 00:59:20,679 word, we add that's why you see a plus 1776 00:59:17,320 --> 00:59:22,120 sign here, we add the positional 1777 00:59:20,679 --> 00:59:24,358 embedding to that. 1778 00:59:22,119 --> 00:59:26,799 And the resulting thing goes into this 1779 00:59:24,358 --> 00:59:30,599 transformer block. And here, 1780 00:59:26,800 --> 00:59:30,600 we go through multi-head attention. 1781 00:59:30,800 --> 00:59:34,480 And things come out the other end. 1782 00:59:32,800 --> 00:59:36,160 Then there is this thing called add and 1783 00:59:34,480 --> 00:59:37,440 norm, which we'll visit we'll revisit on 1784 00:59:36,159 --> 00:59:38,759 Wednesday. 1785 00:59:37,440 --> 00:59:40,800 And then it goes through a feed forward 1786 00:59:38,760 --> 00:59:42,480 network, another add and norm, which 1787 00:59:40,800 --> 00:59:43,640 we'll revisit on Wednesday. 1788 00:59:42,480 --> 00:59:46,360 And then it comes out the other end. 1789 00:59:43,639 --> 00:59:47,519 That's it. That's a transformer encoder. 1790 00:59:46,360 --> 00:59:48,360 Okay? 1791 00:59:47,519 --> 00:59:51,759 Um 1792 00:59:48,360 --> 00:59:51,760 and so if you look at this 1793 00:59:52,320 --> 00:59:55,160 just to point out a couple of things, 1794 00:59:53,719 --> 00:59:56,359 the input embeddings can be random 1795 00:59:55,159 --> 00:59:57,519 weights or it could be pre-trained 1796 00:59:56,360 --> 00:59:58,440 embeddings. 1797 00:59:57,519 --> 01:00:00,119 Um 1798 00:59:58,440 --> 01:00:01,000 we add in a position-dependent embedding 1799 01:00:00,119 --> 01:00:02,799 to represent the position of each word 1800 01:00:01,000 --> 01:00:04,000 in the sentence. That's the plus. 1801 01:00:02,800 --> 01:00:05,800 Then we pass it through multi-headed 1802 01:00:04,000 --> 01:00:07,199 attention to get a contextual uh 1803 01:00:05,800 --> 01:00:09,000 representation. 1804 01:00:07,199 --> 01:00:10,639 Then we finally we pass all this through 1805 01:00:09,000 --> 01:00:12,480 a simple 1806 01:00:10,639 --> 01:00:13,879 typically it's a two-layer network. A 1807 01:00:12,480 --> 01:00:16,039 one hidden layer with relus and then a 1808 01:00:13,880 --> 01:00:20,079 linear layer after that and boom. Uh and 1809 01:00:16,039 --> 01:00:21,840 then we do it. This is the encoder. And 1810 01:00:20,079 --> 01:00:23,799 here is the perhaps the most important 1811 01:00:21,840 --> 01:00:25,600 point to keep in mind. 1812 01:00:23,800 --> 01:00:26,840 Because we have taken inordinate care to 1813 01:00:25,599 --> 01:00:28,159 make sure that the things that are 1814 01:00:26,840 --> 01:00:30,200 coming in and the things that are going 1815 01:00:28,159 --> 01:00:32,159 out have the same size 1816 01:00:30,199 --> 01:00:34,199 both in terms of the number of tokens as 1817 01:00:32,159 --> 01:00:37,319 well as the length of each vector. 1818 01:00:34,199 --> 01:00:39,079 We can then stack them up like pancakes. 1819 01:00:37,320 --> 01:00:41,480 We can have lots of transformers stacked 1820 01:00:39,079 --> 01:00:43,679 one on top of each other. 1821 01:00:41,480 --> 01:00:45,679 Right? Because it's the perfect API. 1822 01:00:43,679 --> 01:00:47,879 It's the simplest possible API. The same 1823 01:00:45,679 --> 01:00:49,639 thing comes in, same thing goes out. 1824 01:00:47,880 --> 01:00:51,200 In terms of size. So you can have a 1825 01:00:49,639 --> 01:00:53,239 transformer encoder, another one top, 1826 01:00:51,199 --> 01:00:55,799 boom, boom, boom, boom, boom, one after 1827 01:00:53,239 --> 01:00:58,239 the other. GPT-3 has 96 transformer 1828 01:00:55,800 --> 01:00:58,240 stacks. 1829 01:00:58,719 --> 01:01:02,919 And like in all things deep learning 1830 01:01:00,440 --> 01:01:04,360 related, the more layers you have, the 1831 01:01:02,920 --> 01:01:05,400 more complicated things we can do with 1832 01:01:04,360 --> 01:01:06,760 it. 1833 01:01:05,400 --> 01:01:10,559 As long as you have enough data to keep 1834 01:01:06,760 --> 01:01:10,560 the model happy so it doesn't overfit. 1835 01:01:11,760 --> 01:01:15,920 Okay? 1836 01:01:13,400 --> 01:01:17,920 All right. So, what we haven't covered, 1837 01:01:15,920 --> 01:01:20,079 which we'll cover on Wednesday 1838 01:01:17,920 --> 01:01:22,400 uh is is the question that 1839 01:01:20,079 --> 01:01:23,440 he had posed about how 1840 01:01:22,400 --> 01:01:24,680 uh you know, since there are no 1841 01:01:23,440 --> 01:01:26,760 parameters inside the self-attention 1842 01:01:24,679 --> 01:01:27,879 block, what are we actually learning? 1843 01:01:26,760 --> 01:01:29,120 And then there is these things called 1844 01:01:27,880 --> 01:01:31,000 residual connections and layer 1845 01:01:29,119 --> 01:01:32,400 normalization. We'll talk about all 1846 01:01:31,000 --> 01:01:35,159 those things on Wednesday. Those are all 1847 01:01:32,400 --> 01:01:38,559 like, you know, refinements to the idea. 1848 01:01:35,159 --> 01:01:39,719 So, all right, 9:39. Um let's apply the 1849 01:01:38,559 --> 01:01:40,920 transformer encoder to an actual 1850 01:01:39,719 --> 01:01:43,319 problem. 1851 01:01:40,920 --> 01:01:45,119 Any questions? 1852 01:01:43,320 --> 01:01:46,760 Uh yeah. 1853 01:01:45,119 --> 01:01:48,839 My question is regarding like you said 1854 01:01:46,760 --> 01:01:50,400 you could have multiple transformers. 1855 01:01:48,840 --> 01:01:53,200 What is the difference with having 1856 01:01:50,400 --> 01:01:54,840 multiple self-attention heads uh and 1857 01:01:53,199 --> 01:01:57,519 rather than that having multiple When I 1858 01:01:54,840 --> 01:01:59,400 say a transformer block within the block 1859 01:01:57,519 --> 01:02:01,599 there could be multiple heads. So, if 1860 01:01:59,400 --> 01:02:04,680 you're if the accuracy is the same, why 1861 01:02:01,599 --> 01:02:06,039 would you use this rather 1862 01:02:04,679 --> 01:02:08,199 Yeah, you can have a lot of attention 1863 01:02:06,039 --> 01:02:10,559 heads. And that's totally fine. And 1864 01:02:08,199 --> 01:02:12,079 typically I forget how many GPT-3 and 4 1865 01:02:10,559 --> 01:02:13,799 have. They have a whole bunch of them. 1866 01:02:12,079 --> 01:02:15,360 But you can So you can go wide and you 1867 01:02:13,800 --> 01:02:18,320 can go deep. 1868 01:02:15,360 --> 01:02:19,599 Both are done in practice. 1869 01:02:18,320 --> 01:02:20,559 But the thing is if 1870 01:02:19,599 --> 01:02:22,119 The one thing you have to remember is 1871 01:02:20,559 --> 01:02:24,480 that if you if you go wide, you have a 1872 01:02:22,119 --> 01:02:26,239 lot of attention heads then given the 1873 01:02:24,480 --> 01:02:28,440 particular input that's coming into that 1874 01:02:26,239 --> 01:02:29,439 block, it'll learn different patterns 1875 01:02:28,440 --> 01:02:31,039 from it. 1876 01:02:29,440 --> 01:02:32,440 While if you stack them all up, it's 1877 01:02:31,039 --> 01:02:33,800 going to learn different ways to 1878 01:02:32,440 --> 01:02:35,200 contextualize the things that are coming 1879 01:02:33,800 --> 01:02:36,760 in. It operates at higher levels of 1880 01:02:35,199 --> 01:02:38,279 abstraction. So the analogy would be 1881 01:02:36,760 --> 01:02:40,520 that like the seventh layer of a 1882 01:02:38,280 --> 01:02:42,640 convolutional net may take the sixth 1883 01:02:40,519 --> 01:02:44,960 layer's output and say, "Oh, I'm seeing 1884 01:02:42,639 --> 01:02:46,839 a lot of edges here. I'm going to take 1885 01:02:44,960 --> 01:02:48,519 an edge like this, two circles like that 1886 01:02:46,840 --> 01:02:49,480 and call it a face." 1887 01:02:48,519 --> 01:02:52,000 So it'll operate at a higher level of 1888 01:02:49,480 --> 01:02:52,000 abstraction. 1889 01:02:52,400 --> 01:02:55,440 Okay. 1890 01:02:53,360 --> 01:02:55,440 Um 1891 01:02:58,320 --> 01:03:02,840 All right, let's go to the collab. 1892 01:03:01,800 --> 01:03:04,080 So what we're going to do is we're going 1893 01:03:02,840 --> 01:03:05,360 to take the transformer that we just 1894 01:03:04,079 --> 01:03:07,599 learned about and we're going to apply 1895 01:03:05,360 --> 01:03:09,320 it to solve the the travel uh slot 1896 01:03:07,599 --> 01:03:12,079 problem. Okay? 1897 01:03:09,320 --> 01:03:14,320 Uh all right. So 1898 01:03:12,079 --> 01:03:16,199 Okay, so we'll start with the usual 1899 01:03:14,320 --> 01:03:18,600 preliminaries. 1900 01:03:16,199 --> 01:03:20,319 And then we have taken the ATIS data set 1901 01:03:18,599 --> 01:03:23,960 I talked about and we have stuck them in 1902 01:03:20,320 --> 01:03:26,480 raw box for easy consumption. 1903 01:03:23,960 --> 01:03:26,480 It's here. 1904 01:03:29,880 --> 01:03:33,400 Okay. 1905 01:03:30,800 --> 01:03:35,160 So if you look at to the top view 1906 01:03:33,400 --> 01:03:37,960 you can see here, for example, I want to 1907 01:03:35,159 --> 01:03:39,599 fly from Boston 8:30 a.m. And then this 1908 01:03:37,960 --> 01:03:42,880 is the output. The slot filling is the 1909 01:03:39,599 --> 01:03:43,880 output. Um and so as it turns out here 1910 01:03:42,880 --> 01:03:46,000 there is 1911 01:03:43,880 --> 01:03:47,358 this these people also gave it a another 1912 01:03:46,000 --> 01:03:49,440 They took the whole query and gave it an 1913 01:03:47,358 --> 01:03:51,199 intent as to is it it's a flight query, 1914 01:03:49,440 --> 01:03:52,480 it's a something else query and so on, 1915 01:03:51,199 --> 01:03:54,559 which we're not going to use. Are you 1916 01:03:52,480 --> 01:03:56,599 kidding me? 1917 01:03:54,559 --> 01:03:57,519 I want to fly from Boston at 8:30 a.m. 1918 01:03:56,599 --> 01:03:59,239 and arrive in Denver at 11:00 in the 1919 01:03:57,519 --> 01:04:01,239 morning. What kind of ground 1920 01:03:59,239 --> 01:04:03,759 transportations are available in Denver? 1921 01:04:01,239 --> 01:04:06,079 What's the airport at Orlando? 1922 01:04:03,760 --> 01:04:08,480 Um how much does the limo service cost 1923 01:04:06,079 --> 01:04:09,799 within Pittsburgh? Okay. 1924 01:04:08,480 --> 01:04:11,480 And so on and so forth. So you get So 1925 01:04:09,800 --> 01:04:13,760 you get the idea. It's a very wide range 1926 01:04:11,480 --> 01:04:16,440 of queries that are in this data set. 1927 01:04:13,760 --> 01:04:18,960 Um okay. So let's just ignore that for a 1928 01:04:16,440 --> 01:04:22,240 sec. Um okay. So what we're now going to 1929 01:04:18,960 --> 01:04:24,960 do is we are going to take only 1930 01:04:22,239 --> 01:04:27,799 um this column, right? The query column. 1931 01:04:24,960 --> 01:04:29,559 That's going to be our input text. Okay? 1932 01:04:27,800 --> 01:04:31,359 And then the slot filling column is 1933 01:04:29,559 --> 01:04:32,599 going to be our dependent variable, the 1934 01:04:31,358 --> 01:04:34,880 output. 1935 01:04:32,599 --> 01:04:37,440 So we'll just gather them all up 1936 01:04:34,880 --> 01:04:38,840 uh here. 1937 01:04:37,440 --> 01:04:40,599 Let it run. We'll do it for the training 1938 01:04:38,840 --> 01:04:42,559 data and the test data. 1939 01:04:40,599 --> 01:04:45,759 And so what we have done is that we have 1940 01:04:42,559 --> 01:04:47,840 taken um the transformer related code in 1941 01:04:45,760 --> 01:04:49,480 Keras and we have packaged it into a 1942 01:04:47,840 --> 01:04:50,640 little hardel library for easy 1943 01:04:49,480 --> 01:04:53,240 consumption. 1944 01:04:50,639 --> 01:04:55,279 Um and so that thing is here. You can 1945 01:04:53,239 --> 01:04:56,719 download it. 1946 01:04:55,280 --> 01:04:57,680 Calling it a library is like overstating 1947 01:04:56,719 --> 01:04:59,679 it. We literally just collected a bunch 1948 01:04:57,679 --> 01:05:00,719 of code and stuck it in a file. Okay? 1949 01:04:59,679 --> 01:05:02,039 So 1950 01:05:00,719 --> 01:05:03,639 and so what we'll do is from hardel 1951 01:05:02,039 --> 01:05:04,960 we'll we'll import the transformer 1952 01:05:03,639 --> 01:05:06,679 encoder. 1953 01:05:04,960 --> 01:05:08,039 And we'll import this positional 1954 01:05:06,679 --> 01:05:09,239 embedding layer. 1955 01:05:08,039 --> 01:05:11,039 Because what we're going to do is we are 1956 01:05:09,239 --> 01:05:12,519 going to take the input do the 1957 01:05:11,039 --> 01:05:14,199 positional encoding business and then 1958 01:05:12,519 --> 01:05:15,400 send it into the transformer. 1959 01:05:14,199 --> 01:05:18,559 Okay? 1960 01:05:15,400 --> 01:05:21,119 Um so but first let's vectorize the 1961 01:05:18,559 --> 01:05:24,920 input uh queries that are coming in. 1962 01:05:21,119 --> 01:05:26,559 So we'll define a thing here. 1963 01:05:24,920 --> 01:05:28,440 The use this uh 1964 01:05:26,559 --> 01:05:30,320 max query length is not defined. That's 1965 01:05:28,440 --> 01:05:32,079 what happens when you 1966 01:05:30,320 --> 01:05:34,480 don't run everything. 1967 01:05:32,079 --> 01:05:34,480 All right. 1968 01:05:38,599 --> 01:05:44,839 Okay. So now we have this thing here. So 1969 01:05:41,719 --> 01:05:47,319 turns out that there are 8,888 tokens, 1970 01:05:44,840 --> 01:05:49,320 right? 8,888 words in the input queries 1971 01:05:47,320 --> 01:05:52,359 that are we have in the data. Uh so I 1972 01:05:49,320 --> 01:05:54,200 take a look at the first few. 1973 01:05:52,358 --> 01:05:56,799 And you can see here, you know, there is 1974 01:05:54,199 --> 01:05:58,759 unk. Uh and because the output mode here 1975 01:05:56,800 --> 01:06:00,280 is you just want integers to come out 1976 01:05:58,760 --> 01:06:01,000 not multi-hot encoding or anything 1977 01:06:00,280 --> 01:06:02,600 because we're going to take these 1978 01:06:01,000 --> 01:06:04,920 integers and then do embeddings from 1979 01:06:02,599 --> 01:06:07,880 them. So it'll it'll create it'll 1980 01:06:04,920 --> 01:06:10,280 reserve this empty string as the pad 1981 01:06:07,880 --> 01:06:11,119 token. This should be familiar from last 1982 01:06:10,280 --> 01:06:13,200 week. 1983 01:06:11,119 --> 01:06:14,679 And then the unk for unknown tokens and 1984 01:06:13,199 --> 01:06:17,039 then two from flights these are all some 1985 01:06:14,679 --> 01:06:18,559 of the most frequent. Um turns out 1986 01:06:17,039 --> 01:06:20,119 Boston is actually the most frequent. I 1987 01:06:18,559 --> 01:06:22,358 don't know what's up with that. 1988 01:06:20,119 --> 01:06:24,279 It is what it is. Then we'll do the same 1989 01:06:22,358 --> 01:06:25,319 vectorization to the train and test data 1990 01:06:24,280 --> 01:06:28,160 sets. 1991 01:06:25,320 --> 01:06:30,480 Now uh we need to do STIE for the output 1992 01:06:28,159 --> 01:06:31,799 side of the problem because the slots 1993 01:06:30,480 --> 01:06:33,800 the the dependent variable here, 1994 01:06:31,800 --> 01:06:36,519 remember, are all sentences as well with 1995 01:06:33,800 --> 01:06:38,200 the B, O, things like that, right? So we 1996 01:06:36,519 --> 01:06:40,840 need to vectorize those. 1997 01:06:38,199 --> 01:06:42,039 So we do or we need to do STIE on them. 1998 01:06:40,840 --> 01:06:43,280 So let's take a look at some of these 1999 01:06:42,039 --> 01:06:44,519 slots. 2000 01:06:43,280 --> 01:06:45,800 And you can see here all this stuff is 2001 01:06:44,519 --> 01:06:48,280 going on. 2002 01:06:45,800 --> 01:06:49,760 Note So now here is an example where you 2003 01:06:48,280 --> 01:06:51,440 have to be very careful when you do the 2004 01:06:49,760 --> 01:06:52,800 standardization. 2005 01:06:51,440 --> 01:06:54,440 Typically standardization you will 2006 01:06:52,800 --> 01:06:56,120 remove punctuation and you know, do 2007 01:06:54,440 --> 01:06:57,358 things like that and lowercase, right? 2008 01:06:56,119 --> 01:07:00,400 But here 2009 01:06:57,358 --> 01:07:01,559 these things have a specific meaning. 2010 01:07:00,400 --> 01:07:03,400 We can't just go in there and remove the 2011 01:07:01,559 --> 01:07:04,880 period and the underscore and then take 2012 01:07:03,400 --> 01:07:06,559 make the B into lowercase B and stuff 2013 01:07:04,880 --> 01:07:07,880 like that. That'll just harm it. 2014 01:07:06,559 --> 01:07:10,239 Right? We need to be able to preserve 2015 01:07:07,880 --> 01:07:12,559 the nomenclature of the output in terms 2016 01:07:10,239 --> 01:07:13,639 of all those tags. So 2017 01:07:12,559 --> 01:07:15,119 um so we don't want the standardization 2018 01:07:13,639 --> 01:07:17,000 to do all those out. So what we do is we 2019 01:07:15,119 --> 01:07:18,358 say standardization none. 2020 01:07:17,000 --> 01:07:20,039 Look at that. 2021 01:07:18,358 --> 01:07:22,319 We tell Keras do not standardize this. 2022 01:07:20,039 --> 01:07:23,239 Do not do your usual thing. 2023 01:07:22,320 --> 01:07:25,280 Okay? 2024 01:07:23,239 --> 01:07:26,919 Um so 2025 01:07:25,280 --> 01:07:29,080 we do that 2026 01:07:26,920 --> 01:07:30,960 for the output side. And then let's look 2027 01:07:29,079 --> 01:07:33,358 at the vocabulary. 2028 01:07:30,960 --> 01:07:34,440 Yeah, so this sounds pretty good. 2029 01:07:33,358 --> 01:07:35,880 These are all the things that we would 2030 01:07:34,440 --> 01:07:37,599 expect to see. 2031 01:07:35,880 --> 01:07:39,800 These are the distinct tokens in the 2032 01:07:37,599 --> 01:07:42,759 output strings. 2033 01:07:39,800 --> 01:07:42,760 Um all right. 2034 01:07:43,320 --> 01:07:48,359 Okay, we get it. 2035 01:07:45,880 --> 01:07:50,400 So we have 125 of them. In the in the 2036 01:07:48,358 --> 01:07:54,279 lecture I said there are 123 slots, 2037 01:07:50,400 --> 01:07:57,240 possible slots. Why is it 125 here? 2038 01:07:54,280 --> 01:07:59,519 Yes, unk and pad. Correct. 2039 01:07:57,239 --> 01:08:02,279 Um okay. Now we'll set up a transformer 2040 01:07:59,519 --> 01:08:05,119 encoder, right? Uh this Oh, wait, wait, 2041 01:08:02,280 --> 01:08:07,280 wait. I forgot about um doing this. My 2042 01:08:05,119 --> 01:08:09,519 bad. Um 2043 01:08:07,280 --> 01:08:09,519 All right. 2044 01:08:11,519 --> 01:08:15,639 I just thought when I saw the slide that 2045 01:08:12,880 --> 01:08:16,560 we should go to the collab 2046 01:08:15,639 --> 01:08:18,880 without giving you a bit more 2047 01:08:16,560 --> 01:08:20,240 background. No problem. So 2048 01:08:18,880 --> 01:08:21,119 So 2049 01:08:20,239 --> 01:08:22,318 the way we're going to model this 2050 01:08:21,119 --> 01:08:23,479 problem is that we're going to have 2051 01:08:22,319 --> 01:08:24,839 something like this, right? Fly from 2052 01:08:23,479 --> 01:08:26,239 Boston to Denver. 2053 01:08:24,838 --> 01:08:28,600 That's the input that's coming in and 2054 01:08:26,239 --> 01:08:31,439 that is the correct answer. 2055 01:08:28,600 --> 01:08:32,798 0 0 some B something or others I mean O 2056 01:08:31,439 --> 01:08:34,479 and then something else, right? That's 2057 01:08:32,798 --> 01:08:36,399 the correct answer. That's the that's 2058 01:08:34,479 --> 01:08:38,718 the input and that is the right answer. 2059 01:08:36,399 --> 01:08:40,559 So what we'll do is we will 2060 01:08:38,719 --> 01:08:42,640 create these positional input embeddings 2061 01:08:40,560 --> 01:08:45,359 like we have discussed before. 2062 01:08:42,640 --> 01:08:47,719 We will run it through a transformer. 2063 01:08:45,359 --> 01:08:49,120 It gives us contextual embeddings. 2064 01:08:47,719 --> 01:08:50,680 So if we send five in, it's going to 2065 01:08:49,119 --> 01:08:51,960 send us five out except the color is now 2066 01:08:50,680 --> 01:08:54,319 blue. 2067 01:08:51,960 --> 01:08:57,520 Right? And then what we do is 2068 01:08:54,319 --> 01:08:59,400 we will run it through a relu. 2069 01:08:57,520 --> 01:09:01,080 Okay, we'll run it through a relu. 2070 01:08:59,399 --> 01:09:02,639 We will still have 2071 01:09:01,079 --> 01:09:04,039 you know, five vectors here, five 2072 01:09:02,640 --> 01:09:05,920 vectors will come in. 2073 01:09:04,039 --> 01:09:07,960 And then for each of the things that 2074 01:09:05,920 --> 01:09:10,759 comes in, we will stick a 123-way 2075 01:09:07,960 --> 01:09:10,759 softmax. 2076 01:09:11,838 --> 01:09:15,838 Okay, for each thing that comes out 2077 01:09:13,279 --> 01:09:16,838 we'll have a 123-way softmax and that's 2078 01:09:15,838 --> 01:09:19,239 the classification problem we're going 2079 01:09:16,838 --> 01:09:19,239 to solve. 2080 01:09:20,439 --> 01:09:23,639 Okay? 2081 01:09:21,719 --> 01:09:25,759 So 2082 01:09:23,640 --> 01:09:28,280 the weights in all these layers will get 2083 01:09:25,759 --> 01:09:29,279 optimized by backprop. 2084 01:09:28,279 --> 01:09:30,798 All these weights are going to get 2085 01:09:29,279 --> 01:09:33,200 optimized. 2086 01:09:30,798 --> 01:09:33,199 Uh yeah. 2087 01:09:34,119 --> 01:09:36,399 Sorry? 2088 01:09:40,798 --> 01:09:44,798 Oh no, the that's a layer. The weights 2089 01:09:43,680 --> 01:09:46,920 in the layer will still need to be 2090 01:09:44,798 --> 01:09:48,159 learned. 2091 01:09:46,920 --> 01:09:50,199 It's sort of like the text vectorization 2092 01:09:48,159 --> 01:09:51,880 layer is a bunch of code and then you 2093 01:09:50,199 --> 01:09:53,439 actually run it on a particular corpus 2094 01:09:51,880 --> 01:09:54,480 to adapt it and fill our vocabulary out 2095 01:09:53,439 --> 01:09:55,679 of it. 2096 01:09:54,479 --> 01:09:57,879 So, it's like an empty shell that needs 2097 01:09:55,680 --> 01:09:59,320 to get populated. 2098 01:09:57,880 --> 01:10:00,680 Okay, so with the weights and all these 2099 01:09:59,319 --> 01:10:02,239 things are going to get updated when we 2100 01:10:00,680 --> 01:10:03,600 when we train the model 2101 01:10:02,239 --> 01:10:06,399 by backprop. 2102 01:10:03,600 --> 01:10:07,600 Uh and that's it. That's the setup. 2103 01:10:06,399 --> 01:10:09,639 Does this make sense before I switch 2104 01:10:07,600 --> 01:10:11,560 back to the collab? 2105 01:10:09,640 --> 01:10:14,320 In particular, does this make sense? 2106 01:10:11,560 --> 01:10:14,320 This part of it. 2107 01:10:15,920 --> 01:10:18,440 Bunch of things come out and then for 2108 01:10:17,319 --> 01:10:20,439 each one of those things we need to 2109 01:10:18,439 --> 01:10:22,119 figure out a classification of a 123-way 2110 01:10:20,439 --> 01:10:23,479 classification. And that's where we 2111 01:10:22,119 --> 01:10:25,319 stick a softmax on every one of those 2112 01:10:23,479 --> 01:10:27,599 output nodes. 2113 01:10:25,319 --> 01:10:27,599 Yeah. 2114 01:10:32,800 --> 01:10:35,440 Oh oh, I see. 2115 01:10:36,000 --> 01:10:38,439 Yeah, so 2116 01:10:40,239 --> 01:10:43,279 It could be whatever or to put it 2117 01:10:41,560 --> 01:10:45,600 another way, it is your choice as the 2118 01:10:43,279 --> 01:10:47,880 user as the modeler. Correct? The thing 2119 01:10:45,600 --> 01:10:49,400 is at this point with the blue stuff the 2120 01:10:47,880 --> 01:10:51,359 transformer is basically saying, my job 2121 01:10:49,399 --> 01:10:52,639 is done. 2122 01:10:51,359 --> 01:10:54,639 It has given you these valuable 2123 01:10:52,640 --> 01:10:56,720 contextual embeddings at some high-level 2124 01:10:54,640 --> 01:10:58,480 abstraction. What you do with it depends 2125 01:10:56,720 --> 01:11:00,680 on your particular problem. And so that 2126 01:10:58,479 --> 01:11:01,959 the best practice would be to take it 2127 01:11:00,680 --> 01:11:03,280 and then maybe, you know, if these 2128 01:11:01,960 --> 01:11:04,279 embeddings are embeddings are really 2129 01:11:03,279 --> 01:11:07,159 long, maybe you make them a little 2130 01:11:04,279 --> 01:11:09,079 smaller, right? Using a ReLU. And using 2131 01:11:07,159 --> 01:11:10,239 a ReLU is always a good idea because 2132 01:11:09,079 --> 01:11:11,640 when in doubt, throw in a bit of 2133 01:11:10,239 --> 01:11:13,519 non-linearity. 2134 01:11:11,640 --> 01:11:15,440 Right? Uh and then once you're done with 2135 01:11:13,520 --> 01:11:17,040 that, well, at this point you need to 2136 01:11:15,439 --> 01:11:20,079 actually classify it. So, you stick an 2137 01:11:17,039 --> 01:11:20,079 output softmax on it. 2138 01:11:20,560 --> 01:11:24,120 Okay. So, that's what we have. 2139 01:11:24,680 --> 01:11:26,960 Um 2140 01:11:27,680 --> 01:11:32,119 All right, back to this picture. 2141 01:11:29,640 --> 01:11:34,280 So, what we're going to do is we 2142 01:11:32,119 --> 01:11:36,119 we also get to decide how long are these 2143 01:11:34,279 --> 01:11:37,199 embedding vectors. How long because here 2144 01:11:36,119 --> 01:11:37,920 we're not going to use Glove embeddings. 2145 01:11:37,199 --> 01:11:39,800 We're just going to learn everything 2146 01:11:37,920 --> 01:11:40,800 from scratch. 2147 01:11:39,800 --> 01:11:42,880 Right? We're going to learn everything 2148 01:11:40,800 --> 01:11:45,360 from scratch. So, and we can decide how 2149 01:11:42,880 --> 01:11:46,440 long these embedding vectors are. So, um 2150 01:11:45,359 --> 01:11:47,519 these embedding vectors I'm going to 2151 01:11:46,439 --> 01:11:49,359 decide 2152 01:11:47,520 --> 01:11:52,880 uh I have decided that I want them to be 2153 01:11:49,359 --> 01:11:54,839 512 long, right? I want these actually 2154 01:11:52,880 --> 01:11:57,000 to be 512 long. So, that's what I have 2155 01:11:54,840 --> 01:11:58,880 here, 512. 2156 01:11:57,000 --> 01:12:00,000 And then inside the transformer, 2157 01:11:58,880 --> 01:12:01,239 remember 2158 01:12:00,000 --> 01:12:02,920 when we 2159 01:12:01,239 --> 01:12:04,679 concatenate everything and then we have 2160 01:12:02,920 --> 01:12:07,600 something, we run it through a final 2161 01:12:04,680 --> 01:12:08,960 ReLU layer, how big should that layer 2162 01:12:07,600 --> 01:12:11,079 be? 2163 01:12:08,960 --> 01:12:13,279 That's what it here what I mean by dense 2164 01:12:11,079 --> 01:12:15,039 dim. I want it to be 64. 2165 01:12:13,279 --> 01:12:17,519 And then I, you know, for fun I'm going 2166 01:12:15,039 --> 01:12:20,399 to use five attention heads. 2167 01:12:17,520 --> 01:12:20,400 Because why not? 2168 01:12:20,439 --> 01:12:27,399 Okay. And then in the final thing here 2169 01:12:24,319 --> 01:12:29,199 to go to Ali's question here these 2170 01:12:27,399 --> 01:12:32,079 things are all 512 long as I mentioned 2171 01:12:29,199 --> 01:12:34,479 earlier, right? These are all 512. 2172 01:12:32,079 --> 01:12:36,760 But this thing here I'm going to make it 2173 01:12:34,479 --> 01:12:38,799 just 128. 2174 01:12:36,760 --> 01:12:41,199 Okay, that's what I mean by units here. 2175 01:12:38,800 --> 01:12:43,119 And so if you look at the actual model 2176 01:12:41,199 --> 01:12:45,679 okay, whatever comes in has a max query 2177 01:12:43,119 --> 01:12:47,239 length of I think 30 if I recall. 2178 01:12:45,680 --> 01:12:50,240 Um actually let's just make sure of 2179 01:12:47,239 --> 01:12:50,239 that. What did I assume? 2180 01:12:51,439 --> 01:12:55,759 30, correct? Max query length 30. So, 2181 01:12:53,079 --> 01:12:57,319 each sentence is 30. So, if a sentence 2182 01:12:55,760 --> 01:12:59,680 has 35 words in it, what's going to 2183 01:12:57,319 --> 01:12:59,679 happen? 2184 01:12:59,840 --> 01:13:03,760 The last five will get chopped, 2185 01:13:01,159 --> 01:13:05,359 truncated. If it comes in at 22, we're 2186 01:13:03,760 --> 01:13:06,840 going to pad it with eight more tokens 2187 01:13:05,359 --> 01:13:09,559 with a pad token. Okay? That's how we 2188 01:13:06,840 --> 01:13:12,159 make sure everything uh gets to 30. 2189 01:13:09,560 --> 01:13:14,039 All right. So, we come back here. 2190 01:13:12,159 --> 01:13:16,720 So, the input is still sentences which 2191 01:13:14,039 --> 01:13:18,960 are 30 long, tokens which are 30 long. 2192 01:13:16,720 --> 01:13:20,520 And then we run it through a positional 2193 01:13:18,960 --> 01:13:23,119 embedding layer. 2194 01:13:20,520 --> 01:13:25,160 Okay? This positional embedding layer 2195 01:13:23,119 --> 01:13:27,319 has the the actual embedding for each 2196 01:13:25,159 --> 01:13:29,279 word, that table and it has the 2197 01:13:27,319 --> 01:13:31,639 positional table, positional embedding 2198 01:13:29,279 --> 01:13:34,119 table. So, just to be clear, this 2199 01:13:31,640 --> 01:13:37,119 positional embedding layer is basically 2200 01:13:34,119 --> 01:13:38,800 it's basically this. 2201 01:13:37,119 --> 01:13:41,199 So, this table 2202 01:13:38,800 --> 01:13:43,720 and this table together are packaged up 2203 01:13:41,199 --> 01:13:45,279 into the positional encoding layer. 2204 01:13:43,720 --> 01:13:47,400 But they are two distinct tables. They 2205 01:13:45,279 --> 01:13:49,479 just happen to be packaged up. 2206 01:13:47,399 --> 01:13:51,119 So, 2207 01:13:49,479 --> 01:13:52,839 so this is what we have here. 2208 01:13:51,119 --> 01:13:55,000 And then we get a nice positional 2209 01:13:52,840 --> 01:13:57,480 embedding out and then boom, we run it 2210 01:13:55,000 --> 01:13:59,640 through the transformer. And you know, 2211 01:13:57,479 --> 01:14:01,559 this transformer encoder object we have 2212 01:13:59,640 --> 01:14:02,800 to tell it obviously, hey, this is the 2213 01:14:01,560 --> 01:14:04,640 embedding dimension that's going to come 2214 01:14:02,800 --> 01:14:06,880 out. This is the dense dimension you're 2215 01:14:04,640 --> 01:14:09,000 going to use in that final feedforward 2216 01:14:06,880 --> 01:14:10,159 layer inside each attention block and 2217 01:14:09,000 --> 01:14:11,640 this is the number of attention heads I 2218 01:14:10,159 --> 01:14:13,519 want you to use. That's it. 2219 01:14:11,640 --> 01:14:14,800 Very, right? Only three things have to 2220 01:14:13,520 --> 01:14:16,840 be specified. 2221 01:14:14,800 --> 01:14:18,039 And then whatever comes out of the 2222 01:14:16,840 --> 01:14:19,159 transformer encoder are these blue 2223 01:14:18,039 --> 01:14:20,960 vectors. 2224 01:14:19,159 --> 01:14:22,720 And then we are back into good old sort 2225 01:14:20,960 --> 01:14:24,560 of, you know, traditional DNN stuff 2226 01:14:22,720 --> 01:14:27,880 where we take this thing, run it through 2227 01:14:24,560 --> 01:14:30,880 a ReLU with 128 units, we add a little 2228 01:14:27,880 --> 01:14:33,279 dropout uh and then we run it through a 2229 01:14:30,880 --> 01:14:35,600 dense layer which the the vocab size 2230 01:14:33,279 --> 01:14:37,359 here is 125, which is the 125-way 2231 01:14:35,600 --> 01:14:39,840 softmax. 2232 01:14:37,359 --> 01:14:41,239 Okay? Activation softmax. 2233 01:14:39,840 --> 01:14:42,720 Connect up everything into model input 2234 01:14:41,239 --> 01:14:44,399 and output and boom, that's the whole 2235 01:14:42,720 --> 01:14:47,440 model. 2236 01:14:44,399 --> 01:14:48,519 So, that's what we have here. 2237 01:14:47,439 --> 01:14:50,839 Okay? 2238 01:14:48,520 --> 01:14:50,840 Now, 2239 01:14:51,079 --> 01:14:54,680 this for the you know, after Wednesday's 2240 01:14:53,399 --> 01:14:56,679 class 2241 01:14:54,680 --> 01:14:59,320 for extra credit and for your personal 2242 01:14:56,680 --> 01:15:00,880 edification 2243 01:14:59,319 --> 01:15:03,000 try to work through this thing to come 2244 01:15:00,880 --> 01:15:04,800 up with this number. 2245 01:15:03,000 --> 01:15:06,960 53 million 2246 01:15:04,800 --> 01:15:10,039 um sorry, 5.3 million. 2247 01:15:06,960 --> 01:15:12,600 Right? Uh and see if it matches this 2248 01:15:10,039 --> 01:15:13,920 number here. 2249 01:15:12,600 --> 01:15:15,520 It should match. 2250 01:15:13,920 --> 01:15:17,840 Hand calculate the number of parameters 2251 01:15:15,520 --> 01:15:19,720 inside the transformer. Okay? For fame 2252 01:15:17,840 --> 01:15:20,520 and fortune. That's an optional thing. 2253 01:15:19,720 --> 01:15:22,240 So, 2254 01:15:20,520 --> 01:15:23,480 uh do it after Wednesday's class, not 2255 01:15:22,239 --> 01:15:24,920 right now. 2256 01:15:23,479 --> 01:15:26,799 And I have actually listed the exact 2257 01:15:24,920 --> 01:15:28,560 math that goes into it here. Okay? All 2258 01:15:26,800 --> 01:15:30,159 right. So, by the way, you can peek into 2259 01:15:28,560 --> 01:15:31,960 any layers' weights using its weight 2260 01:15:30,159 --> 01:15:33,319 attribute. This is the embedding 2261 01:15:31,960 --> 01:15:34,640 uh the positional embedding thing we 2262 01:15:33,319 --> 01:15:36,759 had. So, 2263 01:15:34,640 --> 01:15:39,440 we can click it and you can see here it 2264 01:15:36,760 --> 01:15:40,840 has two tables. There's the first table 2265 01:15:39,439 --> 01:15:41,799 which is just the embedding table which 2266 01:15:40,840 --> 01:15:43,560 says 2267 01:15:41,800 --> 01:15:45,840 there are eight eight eight tokens in my 2268 01:15:43,560 --> 01:15:47,880 vocabulary and each of those tokens is a 2269 01:15:45,840 --> 01:15:49,880 an embedding vector which is 512 long. 2270 01:15:47,880 --> 01:15:51,520 That is the first table here. And then 2271 01:15:49,880 --> 01:15:53,880 it has the second object which is the 2272 01:15:51,520 --> 01:15:56,480 positional embedding and it says here, 2273 01:15:53,880 --> 01:15:58,640 well, my sentences can be 30 long and 2274 01:15:56,479 --> 01:16:02,079 for each position of the 30 long 2275 01:15:58,640 --> 01:16:04,079 sentence, I will have a 512 embedding. 2276 01:16:02,079 --> 01:16:05,439 Both these tables as I mentioned earlier 2277 01:16:04,079 --> 01:16:06,800 are packaged up inside and you can 2278 01:16:05,439 --> 01:16:08,159 actually see what the weights are before 2279 01:16:06,800 --> 01:16:09,560 you do any training. 2280 01:16:08,159 --> 01:16:11,319 Okay? 2281 01:16:09,560 --> 01:16:13,400 So, all right. So, I'm going to stop 2282 01:16:11,319 --> 01:16:14,359 here uh because the model it's going to 2283 01:16:13,399 --> 01:16:16,079 take a few minutes to run and we're 2284 01:16:14,359 --> 01:16:17,519 already at 5 9:45. 2285 01:16:16,079 --> 01:16:19,479 Um so, we will continue the journey on 2286 01:16:17,520 --> 01:16:20,560 Wednesday. If some of it is not super 2287 01:16:19,479 --> 01:16:21,799 clear, don't worry about it. It will 2288 01:16:20,560 --> 01:16:22,960 become much clearer on Wednesday. All 2289 01:16:21,800 --> 01:16:23,640 right? All right, folks, have a good 2290 01:16:22,960 --> 01:16:26,000 couple of days. I'll see you on 2291 01:16:23,640 --> 01:16:26,000 Wednesday.