1 00:00:16,320 --> 00:00:21,039 Um, so let's start with a quick review. 2 00:00:18,559 --> 00:00:23,599 Last week we looked at BERT, how BERT 3 00:00:21,039 --> 00:00:25,439 was created, and we learned about this 4 00:00:23,600 --> 00:00:27,199 technique called masking, which is a 5 00:00:25,439 --> 00:00:29,278 kind of self-supervised learning. And 6 00:00:27,199 --> 00:00:31,279 the idea of masking was very simple. We 7 00:00:29,278 --> 00:00:33,039 asked ourselves the question we have 8 00:00:31,278 --> 00:00:35,519 seen ways in which people can take 9 00:00:33,039 --> 00:00:38,238 images and pre-train models like restnet 10 00:00:35,520 --> 00:00:40,160 on a vast you know vast uh body of 11 00:00:38,238 --> 00:00:42,320 images but then for each image somebody 12 00:00:40,159 --> 00:00:44,000 had to go and label them right so for 13 00:00:42,320 --> 00:00:46,399 text we asked the question well what 14 00:00:44,000 --> 00:00:48,000 does it mean to label a piece of text 15 00:00:46,399 --> 00:00:49,920 when we don't actually have a clearly 16 00:00:48,000 --> 00:00:51,600 defined end goal in mind except the 17 00:00:49,920 --> 00:00:53,280 general goal of pre-training things 18 00:00:51,600 --> 00:00:55,359 right and then we said oh well what we 19 00:00:53,280 --> 00:00:57,280 can do is we can actually replace some 20 00:00:55,359 --> 00:00:59,280 some of the words in every sentence with 21 00:00:57,280 --> 00:01:00,719 a what you call like a mask token and 22 00:00:59,280 --> 00:01:03,760 then we just train the network to 23 00:01:00,719 --> 00:01:06,079 recover the blanks to fill in the blanks 24 00:01:03,759 --> 00:01:07,280 right and this technique which is one of 25 00:01:06,079 --> 00:01:08,798 many ways of doing what's called 26 00:01:07,280 --> 00:01:12,239 self-supervised learning is called 27 00:01:08,799 --> 00:01:14,479 masking and we and we described how if 28 00:01:12,239 --> 00:01:16,399 you essentially take all of Wikipedia 29 00:01:14,478 --> 00:01:19,118 and for every sentence you mask it like 30 00:01:16,400 --> 00:01:21,280 this and then train a network to recover 31 00:01:19,118 --> 00:01:23,200 to fill in the blanks the resulting 32 00:01:21,280 --> 00:01:25,280 network becomes really good at doing all 33 00:01:23,200 --> 00:01:27,200 kinds of interesting things and that in 34 00:01:25,280 --> 00:01:29,359 fact the first such network or one of 35 00:01:27,200 --> 00:01:31,040 the first such networks was called BERT 36 00:01:29,359 --> 00:01:32,319 u and in fact in your homework you've 37 00:01:31,040 --> 00:01:34,479 been you've been looking at BERT and so 38 00:01:32,319 --> 00:01:35,758 on and so forth right that's masking now 39 00:01:34,478 --> 00:01:37,118 we're going to switch gears and talk 40 00:01:35,759 --> 00:01:38,640 about a different kind of self-s 41 00:01:37,118 --> 00:01:41,840 supervised learning which is different 42 00:01:38,640 --> 00:01:45,280 from masking which turns out to be 43 00:01:41,840 --> 00:01:47,200 weirdly more interesting and powerful 44 00:01:45,280 --> 00:01:49,040 okay so we are going to look at another 45 00:01:47,200 --> 00:01:52,560 technique and this technique is called 46 00:01:49,040 --> 00:01:54,079 next word prediction so now it is 47 00:01:52,560 --> 00:01:55,680 actually in some some sense a special 48 00:01:54,078 --> 00:01:57,519 case of masking where you're basically 49 00:01:55,680 --> 00:01:59,759 saying take a sentence and instead of 50 00:01:57,519 --> 00:02:01,679 randomly picking a word and and making 51 00:01:59,759 --> 00:02:03,040 it a blank. You're saying, "I'm just 52 00:02:01,680 --> 00:02:06,000 going to take the last word and make it 53 00:02:03,040 --> 00:02:08,560 a blank." Okay? And then you send the 54 00:02:06,000 --> 00:02:10,080 sentence in and then you have the the 55 00:02:08,560 --> 00:02:12,239 machine just fill in the blank on the 56 00:02:10,080 --> 00:02:13,920 last word. Predict the next word. Okay? 57 00:02:12,239 --> 00:02:15,920 And you don't have to use full sentences 58 00:02:13,919 --> 00:02:17,598 for it. You can use parts of sentences 59 00:02:15,919 --> 00:02:20,000 for it. Sentence fragments as well. So 60 00:02:17,598 --> 00:02:21,919 if you take the same sentences before 61 00:02:20,000 --> 00:02:23,598 the mission of the MI loan school, you 62 00:02:21,919 --> 00:02:25,439 can literally divide it into well you 63 00:02:23,598 --> 00:02:27,598 can give the and ask it to predict 64 00:02:25,439 --> 00:02:29,598 mission. If you can give it the mission 65 00:02:27,598 --> 00:02:31,598 and ask it to predict off. You give it 66 00:02:29,598 --> 00:02:33,919 the mission of ask to predict the you 67 00:02:31,598 --> 00:02:35,518 get the idea. So every sentence fragment 68 00:02:33,919 --> 00:02:37,119 you can take and literally just give it 69 00:02:35,519 --> 00:02:38,959 the first few and then predict the next 70 00:02:37,120 --> 00:02:41,360 one. First few next one first few next 71 00:02:38,959 --> 00:02:44,640 one. Okay. So this is next word 72 00:02:41,360 --> 00:02:46,239 prediction. And 73 00:02:44,639 --> 00:02:47,839 so the let's what we're going to do now 74 00:02:46,239 --> 00:02:50,480 is we're going to actually take the 75 00:02:47,840 --> 00:02:52,800 transformer encoder architecture that we 76 00:02:50,479 --> 00:02:54,479 used to build bird in the last class and 77 00:02:52,800 --> 00:02:56,239 we're going to try to use it to solve 78 00:02:54,479 --> 00:02:58,560 next word prediction to build a model 79 00:02:56,239 --> 00:03:01,039 that can do next word prediction. Okay. 80 00:02:58,560 --> 00:03:03,519 So this is what [clears throat] we have. 81 00:03:01,039 --> 00:03:08,120 So what we're going to do is uh if you 82 00:03:03,519 --> 00:03:08,120 take the phrase the cat sat on the mat. 83 00:03:09,199 --> 00:03:15,199 So the phrase was let's say the cat 84 00:03:13,598 --> 00:03:16,719 sat 85 00:03:15,199 --> 00:03:18,639 on 86 00:03:16,719 --> 00:03:20,400 the mat. 87 00:03:18,639 --> 00:03:24,199 So what you might want to do is to say 88 00:03:20,400 --> 00:03:24,200 okay this is the input 89 00:03:25,519 --> 00:03:30,000 output 90 00:03:27,680 --> 00:03:33,120 the cat. 91 00:03:30,000 --> 00:03:36,400 Then maybe you have the cat 92 00:03:33,120 --> 00:03:39,840 then the output is sat. 93 00:03:36,400 --> 00:03:42,239 The cat sat on and so on. Right, you get 94 00:03:39,840 --> 00:03:45,200 the idea. And then finally, we have the 95 00:03:42,239 --> 00:03:48,480 cat sat 96 00:03:45,199 --> 00:03:50,158 the mat. Right, this is basically what 97 00:03:48,479 --> 00:03:51,679 we have all these inputs and outputs. 98 00:03:50,158 --> 00:03:54,000 But we're going to very compactly 99 00:03:51,680 --> 00:03:56,480 express it as if it's just coming in 100 00:03:54,000 --> 00:03:58,639 through as as one sort of data point in 101 00:03:56,479 --> 00:04:00,079 one batch. And that's what we're doing 102 00:03:58,639 --> 00:04:02,158 here. So what we're going to do is we're 103 00:04:00,080 --> 00:04:04,879 going to stack it up like this where we 104 00:04:02,158 --> 00:04:07,120 have the cat sat on the on the left 105 00:04:04,878 --> 00:04:08,560 meaning everything but the last word and 106 00:04:07,120 --> 00:04:10,158 then we're going to take that same 107 00:04:08,560 --> 00:04:13,199 sentence and just shift it to the left 108 00:04:10,158 --> 00:04:15,438 one right so the cat sat on the mat we 109 00:04:13,199 --> 00:04:17,599 cut off the mat right and that becomes 110 00:04:15,438 --> 00:04:19,918 the input then we cut off the first word 111 00:04:17,600 --> 00:04:22,160 and that becomes the output so when you 112 00:04:19,918 --> 00:04:25,599 look at it that way you can see here 113 00:04:22,160 --> 00:04:29,040 right the you will want the to be used 114 00:04:25,600 --> 00:04:31,120 to predict cat you will want the to be 115 00:04:29,040 --> 00:04:32,800 used to predict SAT and so on and so 116 00:04:31,120 --> 00:04:35,759 forth. 117 00:04:32,800 --> 00:04:37,918 Okay, so this is just a little sort of 118 00:04:35,759 --> 00:04:40,400 manipulation so that we don't have to 119 00:04:37,918 --> 00:04:42,319 have you know like dozens of sentences 120 00:04:40,399 --> 00:04:44,799 or sentence examples just for one 121 00:04:42,319 --> 00:04:46,879 starting sentence. 122 00:04:44,800 --> 00:04:49,040 So if you have something like this, what 123 00:04:46,879 --> 00:04:50,639 you can do is you can run it through 124 00:04:49,040 --> 00:04:53,280 positional input embeddings like we have 125 00:04:50,639 --> 00:04:54,560 done before with BERT. Uh then we can 126 00:04:53,279 --> 00:04:56,879 run it through a whole bunch of 127 00:04:54,560 --> 00:04:59,360 transformers, right? It's like a 128 00:04:56,879 --> 00:05:01,680 transformer stack. Then we get these 129 00:04:59,360 --> 00:05:03,680 contextual embeddings. Then we run them 130 00:05:01,680 --> 00:05:05,439 through maybe one or more ReLUs if you 131 00:05:03,680 --> 00:05:08,079 want because it's always a good idea to 132 00:05:05,439 --> 00:05:11,680 stick some ReLUS at the very end. U and 133 00:05:08,079 --> 00:05:13,038 then we basically attach a softmax to 134 00:05:11,680 --> 00:05:17,038 every one of the things that are coming 135 00:05:13,038 --> 00:05:20,159 out. Okay. And then that soft max is 136 00:05:17,038 --> 00:05:23,759 actually going to be a soft max whose 137 00:05:20,160 --> 00:05:25,840 range is the entire vocabulary. 138 00:05:23,759 --> 00:05:27,199 Okay. For now, let's assume that the 139 00:05:25,839 --> 00:05:29,198 vocabulary is just a vocabulary of 140 00:05:27,199 --> 00:05:30,800 words, not tokens. We'll get into tokens 141 00:05:29,199 --> 00:05:32,639 a bit later on in the class. For now, 142 00:05:30,800 --> 00:05:33,919 just assume it's words. And roughly 143 00:05:32,639 --> 00:05:36,160 speaking, let's say there are 50,000 144 00:05:33,918 --> 00:05:38,079 words in our vocabulary. So each of 145 00:05:36,160 --> 00:05:39,759 these soft maxes, and this is exactly 146 00:05:38,079 --> 00:05:42,000 what we did for BERT, by the way. Each 147 00:05:39,759 --> 00:05:43,919 of these soft maxes is like a 50,000 way 148 00:05:42,000 --> 00:05:47,839 soft max. 149 00:05:43,918 --> 00:05:50,319 Okay. But what we're going to do is here 150 00:05:47,839 --> 00:05:52,079 when we look at it this way 151 00:05:50,319 --> 00:05:54,159 since we are fundamentally bothered 152 00:05:52,079 --> 00:05:55,519 about next word prediction as you will 153 00:05:54,160 --> 00:05:57,360 see later on we are actually going to 154 00:05:55,519 --> 00:05:59,519 ignore all these predictions because who 155 00:05:57,360 --> 00:06:02,800 cares? We are only going to look at the 156 00:05:59,519 --> 00:06:04,560 last one to figure out okay what is the 157 00:06:02,800 --> 00:06:06,960 last prediction? What is it? Because the 158 00:06:04,560 --> 00:06:09,759 last prediction is going to be based on 159 00:06:06,959 --> 00:06:11,279 everything that came before it here. So 160 00:06:09,759 --> 00:06:13,120 this is really the next word that's 161 00:06:11,279 --> 00:06:16,318 actually being predicted. All the things 162 00:06:13,120 --> 00:06:17,840 before we don't care so much. 163 00:06:16,319 --> 00:06:18,879 Okay. And all this will become slightly 164 00:06:17,839 --> 00:06:20,879 clearer because you're going to make a 165 00:06:18,879 --> 00:06:24,079 couple of passes through it. Yeah. 166 00:06:20,879 --> 00:06:27,519 >> How do we 167 00:06:24,079 --> 00:06:29,038 >> uh so um the notion of a sentence has 168 00:06:27,519 --> 00:06:30,879 disappeared at this point. What we're 169 00:06:29,038 --> 00:06:33,680 going to do is when we look at how we 170 00:06:30,879 --> 00:06:35,038 tokenize the input for these kinds of 171 00:06:33,680 --> 00:06:36,879 models, we're actually going to take 172 00:06:35,038 --> 00:06:37,918 punctuation into account. So we're going 173 00:06:36,879 --> 00:06:39,680 to take periods into account, 174 00:06:37,918 --> 00:06:41,279 exclamation marks into account and so on 175 00:06:39,680 --> 00:06:42,478 and so forth. And that that'll answer 176 00:06:41,279 --> 00:06:47,839 your question and we'll come back to 177 00:06:42,478 --> 00:06:49,360 that. U okay so this what we have. So um 178 00:06:47,839 --> 00:06:50,799 all right. So just to be clear the 179 00:06:49,360 --> 00:06:52,560 embedding that's coming out of the final 180 00:06:50,800 --> 00:06:54,400 dense layer is passed through its own 181 00:06:52,560 --> 00:06:58,160 softmax with the number of softmax 182 00:06:54,399 --> 00:07:01,918 categories equal to the cap size. Okay. 183 00:06:58,160 --> 00:07:04,080 All right. Um okay. So 184 00:07:01,918 --> 00:07:05,918 first of all, s let's say we train 185 00:07:04,079 --> 00:07:08,478 models a model like this with a lots of 186 00:07:05,918 --> 00:07:10,159 inputs and outputs. Okay, this just 187 00:07:08,478 --> 00:07:11,598 looks like bird, right? It's not that 188 00:07:10,160 --> 00:07:13,680 different except that there's no notion 189 00:07:11,598 --> 00:07:15,279 of a mask. 190 00:07:13,680 --> 00:07:19,519 Do you notice any problems with the way 191 00:07:15,279 --> 00:07:21,519 this thing has been set up? Uh 192 00:07:19,519 --> 00:07:23,758 >> like for some words like the you're 193 00:07:21,519 --> 00:07:25,680 going to have a lot of potential output 194 00:07:23,759 --> 00:07:27,598 pairs that come out of that. 195 00:07:25,680 --> 00:07:29,120 >> True. Which means that if you have a 196 00:07:27,598 --> 00:07:29,839 word like the the next word 197 00:07:29,120 --> 00:07:32,319 >> hard to predict. 198 00:07:29,839 --> 00:07:35,198 >> It's true. So some words may be hard to 199 00:07:32,319 --> 00:07:36,560 predict depending on the last word of 200 00:07:35,199 --> 00:07:39,120 the sentence that was the input. Right. 201 00:07:36,560 --> 00:07:41,199 That's what you're getting at. Yeah. U 202 00:07:39,120 --> 00:07:43,759 concerns. 203 00:07:41,199 --> 00:07:46,080 So I want you Yeah. Uh 204 00:07:43,759 --> 00:07:48,960 >> since you're using contextual 205 00:07:46,079 --> 00:07:51,198 like the output of the first word is 206 00:07:48,959 --> 00:07:53,839 going to have access to the second word 207 00:07:51,199 --> 00:07:55,360 and so it's kind of like cheating. 208 00:07:53,839 --> 00:07:58,959 >> Bingo. 209 00:07:55,360 --> 00:08:01,759 So remember for bingo is a technical 210 00:07:58,959 --> 00:08:05,439 term in deep learning which means great. 211 00:08:01,759 --> 00:08:08,000 So um so if you go to this right as she 212 00:08:05,439 --> 00:08:11,279 points out if you look at the self 213 00:08:08,000 --> 00:08:12,959 attention layer note remember the self 214 00:08:11,279 --> 00:08:15,279 attention layer is the key building 215 00:08:12,959 --> 00:08:17,439 block of the transformer block right and 216 00:08:15,279 --> 00:08:19,839 so in the self attention layer every 217 00:08:17,439 --> 00:08:23,360 word we calculate its contextual 218 00:08:19,839 --> 00:08:26,239 embedding by waiting weighted averaging 219 00:08:23,360 --> 00:08:28,560 of its relationship to all other words 220 00:08:26,240 --> 00:08:30,079 in the sentence. So the last word can 221 00:08:28,560 --> 00:08:31,360 see the first word, the first word can 222 00:08:30,079 --> 00:08:33,199 see the last word and so on and so 223 00:08:31,360 --> 00:08:34,800 forth, right? But when you're doing next 224 00:08:33,200 --> 00:08:38,240 word prediction, this feels problematic 225 00:08:34,799 --> 00:08:40,958 because you're peeking into the future, 226 00:08:38,240 --> 00:08:42,320 right? So 227 00:08:40,958 --> 00:08:43,598 so let's say that you want to predict 228 00:08:42,320 --> 00:08:46,560 the next word. If you look at this 229 00:08:43,599 --> 00:08:48,560 architecture, what it can simply do, it 230 00:08:46,559 --> 00:08:50,559 can simply copy it from the input 231 00:08:48,559 --> 00:08:52,639 because it can see the whole sentence. 232 00:08:50,559 --> 00:08:55,119 So if I tell you, hey, the cat sat on 233 00:08:52,639 --> 00:08:56,958 the mat. If I just gave you the cat sat 234 00:08:55,120 --> 00:08:58,720 on the can you predict the the next word 235 00:08:56,958 --> 00:09:01,278 for me? You'll be like yeah duh it's cat 236 00:08:58,720 --> 00:09:02,879 it's Matt. 237 00:09:01,278 --> 00:09:04,559 The whole thing becomes challenging only 238 00:09:02,879 --> 00:09:07,039 if I say the cat sat on the dash. Now 239 00:09:04,559 --> 00:09:09,759 predict the dash. 240 00:09:07,039 --> 00:09:11,838 So to put it another way let's say that 241 00:09:09,759 --> 00:09:13,919 you want to predict right you have fed 242 00:09:11,839 --> 00:09:15,600 in the first two words and you want to 243 00:09:13,919 --> 00:09:17,838 predict this. This is the right answer 244 00:09:15,600 --> 00:09:20,639 for the prediction. The network should 245 00:09:17,839 --> 00:09:23,279 only use the first two. 246 00:09:20,639 --> 00:09:26,480 However, but because self attention can 247 00:09:23,278 --> 00:09:28,240 see SAT, it can see this next word, 248 00:09:26,480 --> 00:09:31,039 it'll trivially learn to predict the 249 00:09:28,240 --> 00:09:34,480 next word to be SAT, 250 00:09:31,039 --> 00:09:37,278 right? There is no challenge for it. 251 00:09:34,480 --> 00:09:38,480 So, this is the key problem, right? This 252 00:09:37,278 --> 00:09:41,600 is the key problem. We're just using the 253 00:09:38,480 --> 00:09:43,278 transformer as is. 254 00:09:41,600 --> 00:09:44,959 >> What's our loss function here? 255 00:09:43,278 --> 00:09:46,320 >> The loss function in all these things is 256 00:09:44,958 --> 00:09:48,559 actually the same as before, which is 257 00:09:46,320 --> 00:09:50,240 that for every output that's coming out. 258 00:09:48,559 --> 00:09:52,479 So imagine you have just a traditional 259 00:09:50,240 --> 00:09:54,799 classification problem uh in which you 260 00:09:52,480 --> 00:09:56,560 have one output uh let's say dividing 261 00:09:54,799 --> 00:09:57,919 you're classifying things to uh 10 262 00:09:56,559 --> 00:10:00,239 categories like we did with the fashion 263 00:09:57,919 --> 00:10:02,000 mnest right 10 digits so you have 10 264 00:10:00,240 --> 00:10:03,759 outputs right and that goes through a 265 00:10:02,000 --> 00:10:05,759 softmax and then you have 10 266 00:10:03,759 --> 00:10:09,679 probabilities and there we use cross 267 00:10:05,759 --> 00:10:12,559 entropy right so here for every one of 268 00:10:09,679 --> 00:10:14,000 these things we use cross entropy so we 269 00:10:12,559 --> 00:10:16,078 take this output and there's a cross 270 00:10:14,000 --> 00:10:18,000 entropy for just for that plus cross 271 00:10:16,078 --> 00:10:20,479 entropy for that and so on and so forth 272 00:10:18,000 --> 00:10:21,759 So we we minimize still cross entropy 273 00:10:20,480 --> 00:10:22,560 but the sum of all these cross 274 00:10:21,759 --> 00:10:24,078 entropies. 275 00:10:22,559 --> 00:10:26,319 >> And does it get complicated at all by 276 00:10:24,078 --> 00:10:27,679 the fact we have a large vocabulary size 277 00:10:26,320 --> 00:10:29,040 now? 278 00:10:27,679 --> 00:10:30,239 >> I mean it it gets complicated just 279 00:10:29,039 --> 00:10:32,240 because there are more things to worry 280 00:10:30,240 --> 00:10:33,919 about compute and so on and so forth. 281 00:10:32,240 --> 00:10:35,360 But conceptually no difference whether 282 00:10:33,919 --> 00:10:37,759 you have 10 or 50,000 it's the same 283 00:10:35,360 --> 00:10:39,278 thing. It's just that instead of 284 00:10:37,759 --> 00:10:41,278 classifying an input into one of 10 285 00:10:39,278 --> 00:10:42,958 categories you're take the inputs 286 00:10:41,278 --> 00:10:45,039 themselves are as long as the number of 287 00:10:42,958 --> 00:10:46,799 words in your sentence. So each word 288 00:10:45,039 --> 00:10:49,679 that comes into your sentence is being 289 00:10:46,799 --> 00:10:51,838 classified in one of 50,000 ways, right? 290 00:10:49,679 --> 00:10:53,278 So essentially you have as many 291 00:10:51,839 --> 00:10:55,040 classification problems as you have 292 00:10:53,278 --> 00:10:56,240 number of words in a sentence. But at 293 00:10:55,039 --> 00:10:58,078 the end of the day, the loss function is 294 00:10:56,240 --> 00:10:59,440 just a sum of all those things or to be 295 00:10:58,078 --> 00:11:02,319 more precise, the average of all those 296 00:10:59,440 --> 00:11:03,600 things. 297 00:11:02,320 --> 00:11:05,600 Actually, I think I may have a slide 298 00:11:03,600 --> 00:11:07,440 about this which I may have hidden 299 00:11:05,600 --> 00:11:13,079 because I wasn't sure if I would have 300 00:11:07,440 --> 00:11:13,079 time. Uh let's unhide it. 301 00:11:17,360 --> 00:11:20,560 and B I did not agree ahead of time that 302 00:11:19,519 --> 00:11:23,120 we're going to set this up like this. 303 00:11:20,559 --> 00:11:25,599 Okay. So, all right. So, yeah. So, we 304 00:11:23,120 --> 00:11:27,759 still use the cross cross entropy cross 305 00:11:25,600 --> 00:11:30,480 cross entropy loss function. So, each 306 00:11:27,759 --> 00:11:33,120 word that comes in. So, the cross 307 00:11:30,480 --> 00:11:35,200 entropy is actually minus log 308 00:11:33,120 --> 00:11:36,480 probability of the right answer. And you 309 00:11:35,200 --> 00:11:38,399 may recall this from earlier in the 310 00:11:36,480 --> 00:11:41,039 class. So, we just do the same thing for 311 00:11:38,399 --> 00:11:43,278 for cat sat on the everything. And then 312 00:11:41,039 --> 00:11:46,519 we just take the average 1 / 7. Boom. 313 00:11:43,278 --> 00:11:46,519 That's it. 314 00:11:47,360 --> 00:11:52,000 So let's so to go back to this problem. 315 00:11:50,240 --> 00:11:55,680 So this is the issue. The issue is that 316 00:11:52,000 --> 00:11:57,440 we can't allow words to be predicted 317 00:11:55,679 --> 00:12:00,159 knowing the future. They should only 318 00:11:57,440 --> 00:12:02,240 know about the past words. Okay. So what 319 00:12:00,159 --> 00:12:03,919 do we do? Right? We have to make a 320 00:12:02,240 --> 00:12:06,320 change to the transformer to make it 321 00:12:03,919 --> 00:12:07,838 work for next word prediction. So what 322 00:12:06,320 --> 00:12:09,278 we're going to do is when we are 323 00:12:07,839 --> 00:12:11,200 calculating the contextual embedding for 324 00:12:09,278 --> 00:12:13,039 a word, remember the contextual 325 00:12:11,200 --> 00:12:14,560 embedding for a word is going to be a 326 00:12:13,039 --> 00:12:17,199 weighted average of all the other words 327 00:12:14,559 --> 00:12:20,000 embeddings. We will simply give zero 328 00:12:17,200 --> 00:12:22,000 weight to future words. 329 00:12:20,000 --> 00:12:26,240 If you give zero weight to future words, 330 00:12:22,000 --> 00:12:27,519 it's almost as if they don't exist. 331 00:12:26,240 --> 00:12:31,600 Okay? And this will become clear in a 332 00:12:27,519 --> 00:12:32,959 second. So imagine that this is the the 333 00:12:31,600 --> 00:12:34,879 thing we are going to calculate. These 334 00:12:32,958 --> 00:12:38,239 are all for every word in the sentence 335 00:12:34,879 --> 00:12:41,439 we are calculating the uh the pair-wise 336 00:12:38,240 --> 00:12:43,039 attention weight and you will remember I 337 00:12:41,440 --> 00:12:45,440 went through this you know with like an 338 00:12:43,039 --> 00:12:48,240 iPad thing last week we calculate all 339 00:12:45,440 --> 00:12:51,279 the weights. So for example to find the 340 00:12:48,240 --> 00:12:54,320 um so all these weights in every row 341 00:12:51,278 --> 00:12:56,799 will add up to one and so you take the 342 00:12:54,320 --> 00:12:58,560 contextual embeddings of the cat sat on 343 00:12:56,799 --> 00:12:59,838 the multiply them by the respective 344 00:12:58,559 --> 00:13:01,439 weights that add up to one which is the 345 00:12:59,839 --> 00:13:02,639 first row of this table and that gives 346 00:13:01,440 --> 00:13:05,120 you the contextual embedding for the 347 00:13:02,639 --> 00:13:07,440 word the and so on and so forth. And 348 00:13:05,120 --> 00:13:10,159 since we can't look at the future words 349 00:13:07,440 --> 00:13:14,600 all we do is we go take this table and 350 00:13:10,159 --> 00:13:14,600 we just zero everything out in red. 351 00:13:14,720 --> 00:13:19,519 Okay, we just zero everything here out 352 00:13:17,278 --> 00:13:22,240 and then we renormalize so that the 353 00:13:19,519 --> 00:13:25,278 remaining cells the nonzero dot cells 354 00:13:22,240 --> 00:13:27,519 will still add up to one in each row. So 355 00:13:25,278 --> 00:13:29,519 what that means is that if you're 356 00:13:27,519 --> 00:13:31,759 actually only looking at the only this 357 00:13:29,519 --> 00:13:32,959 thing is going to play a role for cat 358 00:13:31,759 --> 00:13:36,399 only this thing is going to play a role. 359 00:13:32,958 --> 00:13:39,439 So let's let's let's give an example. So 360 00:13:36,399 --> 00:13:43,839 um to calculate 361 00:13:39,440 --> 00:13:46,959 to predict uh on you'll only look at the 362 00:13:43,839 --> 00:13:48,639 words for the cat sat. 363 00:13:46,958 --> 00:13:51,359 Okay. The rest of it will not be 364 00:13:48,639 --> 00:13:54,000 considered at all. Now the effect of 365 00:13:51,360 --> 00:13:56,240 doing all this is that by the way this 366 00:13:54,000 --> 00:13:58,559 is called causal self attention. This 367 00:13:56,240 --> 00:14:01,198 tweak is called causal self attention. 368 00:13:58,559 --> 00:14:02,799 Uh is also called masked self attention. 369 00:14:01,198 --> 00:14:05,198 Right? Just different labels for the 370 00:14:02,799 --> 00:14:07,439 same thing. And so what that means is 371 00:14:05,198 --> 00:14:10,159 that when you're looking at the input 372 00:14:07,440 --> 00:14:12,720 for the only the is going to be used to 373 00:14:10,159 --> 00:14:15,600 predict cat. 374 00:14:12,720 --> 00:14:18,240 When you look the cat only these two are 375 00:14:15,600 --> 00:14:22,759 going to be used to predict sat and so 376 00:14:18,240 --> 00:14:22,759 on and so on and so forth. 377 00:14:24,159 --> 00:14:30,240 Okay. So this thing here this so all we 378 00:14:28,159 --> 00:14:32,559 do is we go into a transformer and we 379 00:14:30,240 --> 00:14:36,360 just change each attention head to be a 380 00:14:32,559 --> 00:14:36,359 causal attention head 381 00:14:38,559 --> 00:14:42,399 and the way it's actually done under the 382 00:14:40,078 --> 00:14:44,399 hood is actually very elegant for 383 00:14:42,399 --> 00:14:46,399 computational efficiency purposes but I 384 00:14:44,399 --> 00:14:49,600 won't get into it because it gets a bit 385 00:14:46,399 --> 00:14:52,559 you know involved but the key idea is 386 00:14:49,600 --> 00:14:54,959 replace basic plain vanilla attention 387 00:14:52,559 --> 00:14:57,119 with causal attention aka pay mass 388 00:14:54,958 --> 00:14:59,359 attention 389 00:14:57,120 --> 00:15:01,120 and you do that boom suddenly it it 390 00:14:59,360 --> 00:15:04,079 starts you know working for an expert 391 00:15:01,120 --> 00:15:06,000 prediction it can't cheat anymore 392 00:15:04,078 --> 00:15:10,198 and when we do that we get the 393 00:15:06,000 --> 00:15:10,198 transformer causal encoder 394 00:15:11,440 --> 00:15:15,360 and by the way the word causal here 395 00:15:13,519 --> 00:15:19,440 there's no connection to causality so 396 00:15:15,360 --> 00:15:20,800 it's just a it's just a term 397 00:15:19,440 --> 00:15:24,240 so if you look at the original 398 00:15:20,799 --> 00:15:26,319 transformer paper um 399 00:15:24,240 --> 00:15:28,000 it was created for translation for 400 00:15:26,320 --> 00:15:30,560 machine translation you know English to 401 00:15:28,000 --> 00:15:32,480 German right those kinds of use cases so 402 00:15:30,559 --> 00:15:34,399 it had something called an encoder which 403 00:15:32,480 --> 00:15:35,839 we are very familiar with from last week 404 00:15:34,399 --> 00:15:38,000 and then it had something called a 405 00:15:35,839 --> 00:15:40,480 decoder right and it is called the 406 00:15:38,000 --> 00:15:42,000 encoder decoder architecture and we are 407 00:15:40,480 --> 00:15:43,278 not going to cover the encoder decoder 408 00:15:42,000 --> 00:15:45,679 architecture because we are not covering 409 00:15:43,278 --> 00:15:48,958 machine translation in this class but 410 00:15:45,679 --> 00:15:51,439 I'm mentioning this because the this 411 00:15:48,958 --> 00:15:52,559 part of the the architecture is called a 412 00:15:51,440 --> 00:15:55,360 decoder 413 00:15:52,559 --> 00:15:57,758 because it uses see here there is a 414 00:15:55,360 --> 00:15:59,199 masked attention business going on here 415 00:15:57,759 --> 00:16:02,959 because it is using this masked 416 00:15:59,198 --> 00:16:05,278 attention it's called a decoder so 417 00:16:02,958 --> 00:16:06,799 the transformer causal encoder is also 418 00:16:05,278 --> 00:16:09,360 referred to sometimes as a transformer 419 00:16:06,799 --> 00:16:11,039 decoder but the word decoder has two 420 00:16:09,360 --> 00:16:12,560 meanings 421 00:16:11,039 --> 00:16:14,319 right it's a synonym for the causal 422 00:16:12,559 --> 00:16:17,359 encoder like we have seen today it's 423 00:16:14,320 --> 00:16:19,040 also used to refer to sequencetosequence 424 00:16:17,360 --> 00:16:21,519 translation problems for the second part 425 00:16:19,039 --> 00:16:23,198 of its architecture so you just have 426 00:16:21,519 --> 00:16:25,120 keep it it'll become clear from context 427 00:16:23,198 --> 00:16:26,399 what we're talking about in this course 428 00:16:25,120 --> 00:16:27,278 of course there is no confusion because 429 00:16:26,399 --> 00:16:29,679 we're not going to be looking at 430 00:16:27,278 --> 00:16:32,958 translation right we may say decoder 431 00:16:29,679 --> 00:16:34,479 causal encoder it's the same thing so I 432 00:16:32,958 --> 00:16:36,638 thought there were some transformers 433 00:16:34,480 --> 00:16:39,600 that use birectional 434 00:16:36,639 --> 00:16:42,399 package like is it different from 435 00:16:39,600 --> 00:16:44,480 >> no the um the birectional all all 436 00:16:42,399 --> 00:16:47,839 birectional means is that I can see 437 00:16:44,480 --> 00:16:49,920 everything so the encoder we looked at 438 00:16:47,839 --> 00:16:52,880 last week the the basic self attention 439 00:16:49,919 --> 00:16:52,879 thing is birectional 440 00:16:54,480 --> 00:16:57,199 Basically all it means is I can look at 441 00:16:55,839 --> 00:16:58,800 both in both directions to see what 442 00:16:57,198 --> 00:16:59,838 other words are there in causal. You're 443 00:16:58,799 --> 00:17:02,559 not using the one in the future. 444 00:16:59,839 --> 00:17:04,959 Correct. 445 00:17:02,559 --> 00:17:07,918 All right. So, 446 00:17:04,959 --> 00:17:09,519 so in to summarize where we are. This is 447 00:17:07,919 --> 00:17:11,600 what we looked at last week for BERT and 448 00:17:09,519 --> 00:17:14,078 this is a transformer encoder and we 449 00:17:11,599 --> 00:17:15,599 take the same thing and instead of 450 00:17:14,078 --> 00:17:18,639 multi-head retention we would do causal 451 00:17:15,599 --> 00:17:21,279 multi retention. We get the decoder aka 452 00:17:18,640 --> 00:17:25,360 causal encoder. 453 00:17:21,279 --> 00:17:27,038 Okay. And we use the left for masked 454 00:17:25,359 --> 00:17:29,599 prediction. We use the right for next 455 00:17:27,038 --> 00:17:32,319 word prediction. 456 00:17:29,599 --> 00:17:34,079 All right. So now if you have instead of 457 00:17:32,319 --> 00:17:37,119 having an encoder, if you have a causal 458 00:17:34,079 --> 00:17:38,879 encoder, a TCE here, now we can train 459 00:17:37,119 --> 00:17:42,159 models for expert prediction using the 460 00:17:38,880 --> 00:17:43,600 same exact approach as before, 461 00:17:42,160 --> 00:17:45,200 right? We set up the inputs and the 462 00:17:43,599 --> 00:17:47,359 outputs like I described earlier. We run 463 00:17:45,200 --> 00:17:50,000 it through a bunch of stacks, a stack of 464 00:17:47,359 --> 00:17:52,159 causal encoders, dens, relu, softmax and 465 00:17:50,000 --> 00:17:54,720 so on and so forth, right? Otherwise the 466 00:17:52,160 --> 00:17:56,558 details don't change but the all 467 00:17:54,720 --> 00:18:00,759 important changes go into the attention 468 00:17:56,558 --> 00:18:00,759 layer and make it masked or causal. 469 00:18:02,480 --> 00:18:08,240 Any questions so far? 470 00:18:06,240 --> 00:18:09,679 >> Uh yeah, 471 00:18:08,240 --> 00:18:11,120 this would only apply when we're 472 00:18:09,679 --> 00:18:13,679 training the model, not when we're 473 00:18:11,119 --> 00:18:15,918 validating and testing, right? 474 00:18:13,679 --> 00:18:18,559 Uh so if I if you give me a sentence 475 00:18:15,919 --> 00:18:20,880 after training right the final 476 00:18:18,558 --> 00:18:22,960 prediction is only is the only thing you 477 00:18:20,880 --> 00:18:24,240 care about and by definition the final 478 00:18:22,960 --> 00:18:27,440 prediction will use everything that came 479 00:18:24,240 --> 00:18:30,240 before it. So we are okay. 480 00:18:27,440 --> 00:18:33,038 Was that your question? No, I think the 481 00:18:30,240 --> 00:18:35,038 fact that we're 482 00:18:33,038 --> 00:18:36,720 uh we're zeroing out the weights in the 483 00:18:35,038 --> 00:18:38,240 future words I thought would apply more 484 00:18:36,720 --> 00:18:40,400 when we're training the model and we're 485 00:18:38,240 --> 00:18:44,558 trying to minimize the loss as opposed 486 00:18:40,400 --> 00:18:45,600 to when we're as a chance to the next 487 00:18:44,558 --> 00:18:47,440 set 488 00:18:45,599 --> 00:18:49,199 >> right but the point is when we actually 489 00:18:47,440 --> 00:18:50,480 use them what is the objective like what 490 00:18:49,200 --> 00:18:51,840 do we want to do when we actually use 491 00:18:50,480 --> 00:18:54,079 them for inference once we finish 492 00:18:51,839 --> 00:18:56,639 training our objective is given a 493 00:18:54,079 --> 00:18:59,038 particular string get me the next word 494 00:18:56,640 --> 00:19:00,320 right and to find the next word you can 495 00:18:59,038 --> 00:19:01,119 in fact use everything that came before 496 00:19:00,319 --> 00:19:03,119 it 497 00:19:01,119 --> 00:19:04,798 >> and therefore without any change to this 498 00:19:03,119 --> 00:19:06,639 model it'll just work for your intended 499 00:19:04,798 --> 00:19:08,160 purpose you don't have to go in there 500 00:19:06,640 --> 00:19:10,400 and change it to you don't have to 501 00:19:08,160 --> 00:19:13,600 unmask it for inference because you 502 00:19:10,400 --> 00:19:14,960 don't need to 503 00:19:13,599 --> 00:19:17,599 >> yes 504 00:19:14,960 --> 00:19:20,480 >> uh I have one question is regarding like 505 00:19:17,599 --> 00:19:22,480 when we do the puzzle transformers we 506 00:19:20,480 --> 00:19:24,160 are putting certain weights to zero for 507 00:19:22,480 --> 00:19:24,798 the words which are to be predicted and 508 00:19:24,160 --> 00:19:26,720 then we 509 00:19:24,798 --> 00:19:27,200 >> no word the the words that are in the 510 00:19:26,720 --> 00:19:28,000 future 511 00:19:27,200 --> 00:19:29,279 >> future Yeah. 512 00:19:28,000 --> 00:19:29,679 >> And then we normalize it. 513 00:19:29,279 --> 00:19:31,200 >> Correct. 514 00:19:29,679 --> 00:19:33,200 >> And we have trained a transformer 515 00:19:31,200 --> 00:19:35,759 earlier on the all the words packed all 516 00:19:33,200 --> 00:19:37,279 the words together. So won't there be 517 00:19:35,759 --> 00:19:37,839 difference in weights between both the 518 00:19:37,279 --> 00:19:39,279 things 519 00:19:37,839 --> 00:19:40,879 >> between the two ways of training? The 520 00:19:39,279 --> 00:19:43,599 weights are going to be very different 521 00:19:40,880 --> 00:19:45,600 and they are two different models. Bert 522 00:19:43,599 --> 00:19:47,119 is used for certain things and this kind 523 00:19:45,599 --> 00:19:47,918 of model which is the basis of GPT is 524 00:19:47,119 --> 00:19:49,599 going to be used for other things. 525 00:19:47,919 --> 00:19:52,240 >> We are training it as well like that. I 526 00:19:49,599 --> 00:19:53,839 mean with while putting the by moving 527 00:19:52,240 --> 00:19:56,160 some of the rates to 528 00:19:53,839 --> 00:19:59,199 >> correct correct. So what I'm talking 529 00:19:56,160 --> 00:20:01,279 about here is the what we're trying to 530 00:19:59,200 --> 00:20:03,919 do here is to say let's say that we want 531 00:20:01,279 --> 00:20:06,160 to do next word prediction as the as the 532 00:20:03,919 --> 00:20:08,160 task as a self-supervised learning task 533 00:20:06,160 --> 00:20:10,798 and and we want to train such a model on 534 00:20:08,160 --> 00:20:12,080 a vast amount of text data right well we 535 00:20:10,798 --> 00:20:13,279 can't just use what we did last week 536 00:20:12,079 --> 00:20:14,720 because it's not going to work because 537 00:20:13,279 --> 00:20:16,480 of the fact it can see the future 538 00:20:14,720 --> 00:20:17,839 therefore we make a tweak and then we 539 00:20:16,480 --> 00:20:18,960 build this model now the question 540 00:20:17,839 --> 00:20:20,319 becomes okay what can you do with this 541 00:20:18,960 --> 00:20:21,840 such a model right we have basically 542 00:20:20,319 --> 00:20:23,200 trained two different kinds of models 543 00:20:21,839 --> 00:20:25,599 that the one that can see everything 544 00:20:23,200 --> 00:20:27,840 Bert and that one that can't see the 545 00:20:25,599 --> 00:20:28,959 future which is actually GPT. So what 546 00:20:27,839 --> 00:20:32,199 can you do with it? And we're going to 547 00:20:28,960 --> 00:20:32,200 come to that. 548 00:20:32,240 --> 00:20:38,558 Okay. U all right. So now once you train 549 00:20:35,679 --> 00:20:41,519 such a model u right given any input 550 00:20:38,558 --> 00:20:45,519 sentence um let's say that the sentence 551 00:20:41,519 --> 00:20:47,599 is it was a dark and it was a dark and 552 00:20:45,519 --> 00:20:49,200 right it goes through all these things. 553 00:20:47,599 --> 00:20:50,879 And remember what I said earlier the 554 00:20:49,200 --> 00:20:53,440 fact that it's predicting something 555 00:20:50,880 --> 00:20:55,600 after just seeing it. We don't really 556 00:20:53,440 --> 00:20:57,840 care. 557 00:20:55,599 --> 00:20:59,199 All what we're really curious about is 558 00:20:57,839 --> 00:21:01,119 what is the next thing it's going to 559 00:20:59,200 --> 00:21:02,400 say? And the next thing it's going to 560 00:21:01,119 --> 00:21:06,359 say is going to be is basically going to 561 00:21:02,400 --> 00:21:06,360 be what's coming out of this softmax. 562 00:21:06,798 --> 00:21:11,839 Does it make sense? We don't care about 563 00:21:08,720 --> 00:21:14,159 anything that went before it 564 00:21:11,839 --> 00:21:15,918 because we already have like a half form 565 00:21:14,159 --> 00:21:17,840 sentence and we want to just find the 566 00:21:15,919 --> 00:21:19,520 next thing here. So we only care about 567 00:21:17,839 --> 00:21:21,759 this. We I mean these things will come 568 00:21:19,519 --> 00:21:22,960 out of the of the architecture of the 569 00:21:21,759 --> 00:21:24,640 model, but we don't we throw them out. 570 00:21:22,960 --> 00:21:26,240 We don't even pay any attention to them. 571 00:21:24,640 --> 00:21:30,159 Okay, we only look at what's coming out 572 00:21:26,240 --> 00:21:32,720 in this one here. And what comes out of 573 00:21:30,159 --> 00:21:35,120 the soft max, remember, is a 50,000 way 574 00:21:32,720 --> 00:21:37,440 table of probabilities. That's what a 575 00:21:35,119 --> 00:21:39,279 soft max is, right? It's a whole bunch 576 00:21:37,440 --> 00:21:40,640 of probabilities that add up to one. And 577 00:21:39,279 --> 00:21:42,960 so it's going to and let's say, for 578 00:21:40,640 --> 00:21:45,360 example, that you know you have starting 579 00:21:42,960 --> 00:21:46,159 with oddwark all the way to zebra, 580 00:21:45,359 --> 00:21:48,000 right? Right? And these are the 581 00:21:46,159 --> 00:21:52,640 probabilities. 582 00:21:48,000 --> 00:21:55,519 So it was a dark and you know just for 583 00:21:52,640 --> 00:21:56,880 kicks I put star me as the most highest 584 00:21:55,519 --> 00:21:59,599 probability number but these numbers 585 00:21:56,880 --> 00:22:02,400 will add up to one. We have this table. 586 00:21:59,599 --> 00:22:04,959 Okay. And then what we do is we choose a 587 00:22:02,400 --> 00:22:06,960 token from this table. We get we get to 588 00:22:04,960 --> 00:22:08,880 choose right. There's a whole bunch of 589 00:22:06,960 --> 00:22:11,120 numbers in this table that we we get to 590 00:22:08,880 --> 00:22:12,880 choose a token. the the simplest thing 591 00:22:11,119 --> 00:22:14,959 one can think of is just choose the the 592 00:22:12,880 --> 00:22:16,320 word that is the most likely, right? And 593 00:22:14,960 --> 00:22:18,400 we choose the word that's most likely 594 00:22:16,319 --> 00:22:20,319 here. And we we're going to have a whole 595 00:22:18,400 --> 00:22:22,880 section on how to choose these things 596 00:22:20,319 --> 00:22:23,918 coming up. Okay, for now let's go with 597 00:22:22,880 --> 00:22:26,960 the simple option. We're going to just 598 00:22:23,919 --> 00:22:30,880 choose the one that's most likely 6. And 599 00:22:26,960 --> 00:22:32,319 then we we attach it to the input. So 600 00:22:30,880 --> 00:22:34,480 now the input has become it was a dark 601 00:22:32,319 --> 00:22:36,319 and stormy. We run it through and we 602 00:22:34,480 --> 00:22:37,919 again we only care about the last one 603 00:22:36,319 --> 00:22:40,399 softmax. 604 00:22:37,919 --> 00:22:42,720 Okay, 605 00:22:40,400 --> 00:22:44,480 we do that. We get another table and 606 00:22:42,720 --> 00:22:45,839 this table turns out the table keeps 607 00:22:44,480 --> 00:22:46,880 changing because the softmax is 608 00:22:45,839 --> 00:22:49,119 different for each time you run it 609 00:22:46,880 --> 00:22:50,880 through because the input has changed. 610 00:22:49,119 --> 00:22:53,918 So you get a new table and it turns out 611 00:22:50,880 --> 00:22:56,480 the most likely one is knight. Okay. And 612 00:22:53,919 --> 00:22:59,840 then we attach so night comes out the 613 00:22:56,480 --> 00:23:03,279 other end. We and we attach knight here 614 00:22:59,839 --> 00:23:05,759 and we keep on going right. We can keep 615 00:23:03,279 --> 00:23:08,240 on going maybe till we basically we tell 616 00:23:05,759 --> 00:23:11,200 the model okay generate up to 100 tokens 617 00:23:08,240 --> 00:23:12,720 and stop. It might stop after 100 or you 618 00:23:11,200 --> 00:23:15,279 or it might decide the model may decide 619 00:23:12,720 --> 00:23:17,120 in fact that when it sees a punctuation 620 00:23:15,279 --> 00:23:19,678 like a period or exclamation mark or 621 00:23:17,119 --> 00:23:21,199 something it's going to stop. Okay. And 622 00:23:19,679 --> 00:23:23,600 we have control over this when it stops 623 00:23:21,200 --> 00:23:26,080 and how it stops. But this is this is 624 00:23:23,599 --> 00:23:27,199 sort of the the basic process and you 625 00:23:26,079 --> 00:23:28,558 folks are all very used to it because 626 00:23:27,200 --> 00:23:30,960 you've all been playing with chat GPT 627 00:23:28,558 --> 00:23:33,279 and the like right? So the but the basic 628 00:23:30,960 --> 00:23:34,720 building block is next word prediction 629 00:23:33,279 --> 00:23:36,960 feed it back to the input next word 630 00:23:34,720 --> 00:23:38,640 prediction keep on doing it right you 631 00:23:36,960 --> 00:23:41,519 keep on doing it and suddenly you know 632 00:23:38,640 --> 00:23:42,799 it's writing entire novels for you 633 00:23:41,519 --> 00:23:44,960 um yeah 634 00:23:42,798 --> 00:23:47,519 >> that mean that the longer the initial 635 00:23:44,960 --> 00:23:48,880 input is better you get a better 636 00:23:47,519 --> 00:23:52,639 prediction 637 00:23:48,880 --> 00:23:54,400 >> um it depends on your objective so 638 00:23:52,640 --> 00:23:56,400 fundamentally you have some task you 639 00:23:54,400 --> 00:23:58,320 want the thing to do for you right and 640 00:23:56,400 --> 00:24:00,320 that task may and you need to give it 641 00:23:58,319 --> 00:24:02,639 all the information it can puzzle we 642 00:24:00,319 --> 00:24:04,240 find useful. Yeah. So the long the the 643 00:24:02,640 --> 00:24:07,120 more helpful the input the better. Maybe 644 00:24:04,240 --> 00:24:09,200 that's how I would say it. 645 00:24:07,119 --> 00:24:11,199 Uh yeah. 646 00:24:09,200 --> 00:24:14,960 >> Would this also apply to something like 647 00:24:11,200 --> 00:24:17,038 Google search? Uh or does they also do 648 00:24:14,960 --> 00:24:18,079 next letter prediction too? But would 649 00:24:17,038 --> 00:24:20,000 this just be a deeper 650 00:24:18,079 --> 00:24:22,000 >> Yeah. So the Google autocomplete for 651 00:24:20,000 --> 00:24:24,240 example, I don't know if they actually 652 00:24:22,000 --> 00:24:26,319 use uh this kind of model under the hood 653 00:24:24,240 --> 00:24:27,679 or not. I just don't know. Um these 654 00:24:26,319 --> 00:24:29,918 things tend to be kept tightly under 655 00:24:27,679 --> 00:24:31,840 wraps. uh you know if they were to do if 656 00:24:29,919 --> 00:24:33,360 they were using it you know my guess is 657 00:24:31,839 --> 00:24:34,959 that 658 00:24:33,359 --> 00:24:36,798 they so I don't know if you folks have 659 00:24:34,960 --> 00:24:38,319 seen recently over the last few months 660 00:24:36,798 --> 00:24:40,158 they have there is there is a generative 661 00:24:38,319 --> 00:24:42,639 AI panel that opens up when you do a 662 00:24:40,159 --> 00:24:45,278 Google search that panel I suspect uses 663 00:24:42,640 --> 00:24:47,038 this uh but I don't know if the default 664 00:24:45,278 --> 00:24:49,759 Google autocomplete actually uses it or 665 00:24:47,038 --> 00:24:52,558 not because it's very compute heavy 666 00:24:49,759 --> 00:24:55,359 right so I don't know what they do 667 00:24:52,558 --> 00:25:00,000 um so yeah this is what you do other 668 00:24:55,359 --> 00:25:01,599 questions on this on the mechanics of 669 00:25:00,000 --> 00:25:03,679 Yeah, 670 00:25:01,599 --> 00:25:05,359 >> for our vocabulary list, I'm assuming 671 00:25:03,679 --> 00:25:07,200 it's static. 672 00:25:05,359 --> 00:25:08,959 >> Yeah, correct. Uh, and as you will see 673 00:25:07,200 --> 00:25:10,880 here, it's not really a word vocabulary. 674 00:25:08,960 --> 00:25:12,880 It's a token vocabulary, but yes, it is 675 00:25:10,880 --> 00:25:15,520 static for a given model. 676 00:25:12,880 --> 00:25:17,440 >> And so for I guess I'm assuming for 677 00:25:15,519 --> 00:25:19,839 Google or any other sort of like search 678 00:25:17,440 --> 00:25:23,519 engine that wouldn't necessarily be 679 00:25:19,839 --> 00:25:26,879 static. And so when it comes to I guess 680 00:25:23,519 --> 00:25:30,158 I guess I'll leave it like because the 681 00:25:26,880 --> 00:25:32,159 model would be different 682 00:25:30,159 --> 00:25:34,240 sort of thinking about uh what happens 683 00:25:32,159 --> 00:25:35,760 to like new words and things that are 684 00:25:34,240 --> 00:25:37,519 formed and how does it handle it if the 685 00:25:35,759 --> 00:25:41,440 vocabulary is static. There's a very 686 00:25:37,519 --> 00:25:45,119 elegant solution that's coming up. 687 00:25:41,440 --> 00:25:48,400 Okay. Um 688 00:25:45,119 --> 00:25:51,439 all right. So now in other words we have 689 00:25:48,400 --> 00:25:52,960 learned how to do sequence generation. 690 00:25:51,440 --> 00:25:54,558 We already saw that we can do 691 00:25:52,960 --> 00:25:56,640 classification with BERT. We can do 692 00:25:54,558 --> 00:25:59,038 labeling with BERT B like models which 693 00:25:56,640 --> 00:26:00,720 are trained on mass prediction. And for 694 00:25:59,038 --> 00:26:02,319 generating sequences now we know how to 695 00:26:00,720 --> 00:26:05,519 do it. We just need to use a transformer 696 00:26:02,319 --> 00:26:08,319 cosal encoder. 697 00:26:05,519 --> 00:26:10,480 Okay. 698 00:26:08,319 --> 00:26:12,240 Now 699 00:26:10,480 --> 00:26:13,919 these kind of models, sequence 700 00:26:12,240 --> 00:26:15,679 generation models trained on text 701 00:26:13,919 --> 00:26:17,759 sequences using next word prediction are 702 00:26:15,679 --> 00:26:20,798 called auto reggressive language models 703 00:26:17,759 --> 00:26:22,558 or causal language models. Okay. And of 704 00:26:20,798 --> 00:26:25,519 course the GPD family is perhaps the 705 00:26:22,558 --> 00:26:28,079 most well-known uh example of an auto 706 00:26:25,519 --> 00:26:30,720 reggressive co language model. auto 707 00:26:28,079 --> 00:26:32,480 reggressive because people who have done 708 00:26:30,720 --> 00:26:34,159 econometrics and some regression know 709 00:26:32,480 --> 00:26:36,159 the notion of auto reggression means 710 00:26:34,159 --> 00:26:38,320 that you predict something and then you 711 00:26:36,159 --> 00:26:40,159 you use sort of you know the past 712 00:26:38,319 --> 00:26:42,639 predictions as inputs into the next time 713 00:26:40,159 --> 00:26:44,720 you predict right so this is the notion 714 00:26:42,640 --> 00:26:46,799 of auto reggression you feed you predict 715 00:26:44,720 --> 00:26:48,079 you feed the prediction back get the 716 00:26:46,798 --> 00:26:51,798 next prediction and keep on cycling 717 00:26:48,079 --> 00:26:51,798 through yes 718 00:26:51,919 --> 00:26:56,320 >> so when you you're kind of putting an 719 00:26:53,839 --> 00:26:59,038 input into GPT for example and it has 720 00:26:56,319 --> 00:27:01,519 that um you know it shows you like the 721 00:26:59,038 --> 00:27:03,519 next words as as it's coming. Is that an 722 00:27:01,519 --> 00:27:05,759 indication of it doing this 723 00:27:03,519 --> 00:27:07,679 recalculation that you described here? 724 00:27:05,759 --> 00:27:09,759 >> Correct. That's exactly what's going on. 725 00:27:07,679 --> 00:27:12,240 Uh in fact, if you use the API, there is 726 00:27:09,759 --> 00:27:14,079 the thing called the streaming API where 727 00:27:12,240 --> 00:27:15,359 it'll actually stream each token that's 728 00:27:14,079 --> 00:27:17,278 coming out through the through every 729 00:27:15,359 --> 00:27:19,599 pass and you can actually see everything 730 00:27:17,278 --> 00:27:22,159 very clearly. But when you actually work 731 00:27:19,599 --> 00:27:24,079 with the web interface and you see the 732 00:27:22,159 --> 00:27:25,919 thing almost as if it's typing like a 733 00:27:24,079 --> 00:27:26,960 human, what I've heard from people, I 734 00:27:25,919 --> 00:27:28,720 don't know if this is true, what I've 735 00:27:26,960 --> 00:27:30,960 heard from people is that they can 736 00:27:28,720 --> 00:27:32,319 actually do it much faster. They slow it 737 00:27:30,960 --> 00:27:33,600 down intentionally to give you the 738 00:27:32,319 --> 00:27:36,480 feeling that it's actually coming from a 739 00:27:33,599 --> 00:27:39,599 human. 740 00:27:36,480 --> 00:27:41,278 So it's like a UX trick to slow it down 741 00:27:39,599 --> 00:27:42,480 to make it feel as if someone is 742 00:27:41,278 --> 00:27:44,240 actually typing something on the other 743 00:27:42,480 --> 00:27:46,159 end. So when you're interacting with a 744 00:27:44,240 --> 00:27:48,640 chatbot, for example, sometimes you see 745 00:27:46,159 --> 00:27:49,600 it actually typing like slowly you can 746 00:27:48,640 --> 00:27:50,720 see the bubble and you can see the 747 00:27:49,599 --> 00:27:53,439 typing. It's actually intentionally 748 00:27:50,720 --> 00:27:55,360 slowed down. Uh because you know it's a 749 00:27:53,440 --> 00:27:58,159 bot otherwise, right? So there's a 750 00:27:55,359 --> 00:28:01,119 little bit of UX 751 00:27:58,159 --> 00:28:03,039 creepiness maybe going on. Uh I don't 752 00:28:01,119 --> 00:28:05,038 know to what extent this is 100% true 753 00:28:03,038 --> 00:28:06,798 and how pervasive it is, but folks who 754 00:28:05,038 --> 00:28:10,558 work in the field have told me that this 755 00:28:06,798 --> 00:28:12,398 actually is not uncommon. So 756 00:28:10,558 --> 00:28:14,639 okay, so that's what's going on here. 757 00:28:12,398 --> 00:28:17,199 These are language models and of course 758 00:28:14,640 --> 00:28:20,159 GPD3 is an auto reggressive language 759 00:28:17,200 --> 00:28:22,399 model and the reason why we have an L in 760 00:28:20,159 --> 00:28:24,080 front of the LM because it was trained 761 00:28:22,398 --> 00:28:25,678 on lots of data with lots of parameters 762 00:28:24,079 --> 00:28:26,960 right some someone does this at some 763 00:28:25,679 --> 00:28:28,480 point it's not a small language model 764 00:28:26,960 --> 00:28:31,600 anymore it's a large language model so 765 00:28:28,480 --> 00:28:35,440 yeah so it's LLM nothing more momentous 766 00:28:31,599 --> 00:28:40,558 than that so so as it turns out uh GPT3 767 00:28:35,440 --> 00:28:43,038 uses 96 transformer blocks 96 blocks and 768 00:28:40,558 --> 00:28:44,960 each block has 96 six causal attention 769 00:28:43,038 --> 00:28:46,480 heads. 770 00:28:44,960 --> 00:28:48,480 Okay. And you can see you can read the 771 00:28:46,480 --> 00:28:50,319 GPD3 paper. It gives you all the details 772 00:28:48,480 --> 00:28:51,839 of the architecture. That is interesting 773 00:28:50,319 --> 00:28:55,918 because GPD4 they didn't publish the 774 00:28:51,839 --> 00:28:58,158 architecture from GPD3 after GPD3 775 00:28:55,919 --> 00:28:59,200 everything became closed. So we actually 776 00:28:58,159 --> 00:29:00,320 don't know what the architecture is even 777 00:28:59,200 --> 00:29:03,440 though there's a lot of speculation on 778 00:29:00,319 --> 00:29:06,240 Twitter. So uh but GP3 we know exactly 779 00:29:03,440 --> 00:29:09,360 what happened right 96 blocks each has 780 00:29:06,240 --> 00:29:11,359 96 causal attention heads. Um and then 781 00:29:09,359 --> 00:29:14,000 the data was actually they scraped 30 782 00:29:11,359 --> 00:29:16,158 billion sentences um from a whole bunch 783 00:29:14,000 --> 00:29:19,599 of sources, web text, Wikipedia, a bunch 784 00:29:16,159 --> 00:29:21,840 of book databases. Um and um and then 785 00:29:19,599 --> 00:29:23,359 they basically just took those 30 786 00:29:21,839 --> 00:29:27,038 billion sentences and just trained it 787 00:29:23,359 --> 00:29:28,798 exactly next word. That's it. 788 00:29:27,038 --> 00:29:31,759 Now when they trained GBD3, I think it 789 00:29:28,798 --> 00:29:34,158 cost them a lot of money um because 790 00:29:31,759 --> 00:29:36,158 things were not as we hadn't figured out 791 00:29:34,159 --> 00:29:38,240 how to do as efficiently as we know now. 792 00:29:36,159 --> 00:29:39,679 uh but it was still pretty amazing and 793 00:29:38,240 --> 00:29:41,200 I'll talk about you know what is so 794 00:29:39,679 --> 00:29:44,080 special about GBD3 in just a minute or 795 00:29:41,200 --> 00:29:46,960 two. So, so this is what we have here 796 00:29:44,079 --> 00:29:49,918 and as you folks have seen the notion of 797 00:29:46,960 --> 00:29:51,440 generating text right is very powerful 798 00:29:49,919 --> 00:29:53,278 right uh because we can obviously 799 00:29:51,440 --> 00:29:55,919 generate text but we can also generate 800 00:29:53,278 --> 00:29:57,519 code because code is just text uh we can 801 00:29:55,919 --> 00:29:58,960 generate documentation for code we can 802 00:29:57,519 --> 00:30:00,798 summarize text we can answer questions 803 00:29:58,960 --> 00:30:03,200 we can do chat I mean the list goes on 804 00:30:00,798 --> 00:30:05,278 all the excitement we see around genai 805 00:30:03,200 --> 00:30:07,600 from the time chat GBD came out is 806 00:30:05,278 --> 00:30:12,000 precisely because the simple idea of 807 00:30:07,599 --> 00:30:13,759 text in text out is just so flexible 808 00:30:12,000 --> 00:30:15,119 It's so versatile. It can handle all 809 00:30:13,759 --> 00:30:17,038 sorts of use cases. That's why there's 810 00:30:15,119 --> 00:30:19,759 so much excitement. 811 00:30:17,038 --> 00:30:21,839 Um, by the way, um, if you're really 812 00:30:19,759 --> 00:30:24,798 curious, I would actually recommend 813 00:30:21,839 --> 00:30:28,959 seeing this video where this this guy 814 00:30:24,798 --> 00:30:31,839 Andre Karpathi builds GPT from scratch. 815 00:30:28,960 --> 00:30:33,759 Okay, it's a fantastic video. If you if 816 00:30:31,839 --> 00:30:35,038 you have even like a little bit of 817 00:30:33,759 --> 00:30:36,240 curiosity about how these things are 818 00:30:35,038 --> 00:30:38,000 actually built, I would strongly 819 00:30:36,240 --> 00:30:39,440 recommend checking it out. Um and 820 00:30:38,000 --> 00:30:41,519 there's also a little blog post where 821 00:30:39,440 --> 00:30:43,519 this person you know basically if you 822 00:30:41,519 --> 00:30:46,079 know numpy you can actually create GPD3 823 00:30:43,519 --> 00:30:50,240 GPD using numpy without any using any 824 00:30:46,079 --> 00:30:52,319 frameworks and things like that. So um 825 00:30:50,240 --> 00:30:53,759 I I found it super interesting and 826 00:30:52,319 --> 00:30:55,439 helpful to understand what exactly is 827 00:30:53,759 --> 00:30:57,759 going on. So if you would like to do 828 00:30:55,440 --> 00:31:00,320 this. Okay. So now we're going to talk 829 00:30:57,759 --> 00:31:03,679 about um decoding sampling strategies 830 00:31:00,319 --> 00:31:05,278 which is I said that when we produce uh 831 00:31:03,679 --> 00:31:07,759 when when when we come up with the 832 00:31:05,278 --> 00:31:10,398 softmax for that last token right we 833 00:31:07,759 --> 00:31:13,278 have 50,000 choices. What do we pick 834 00:31:10,398 --> 00:31:15,759 right as it turns out to actually get 835 00:31:13,278 --> 00:31:17,839 really good performance out of uh genai 836 00:31:15,759 --> 00:31:19,919 systems like charge you need to be quite 837 00:31:17,839 --> 00:31:21,678 thoughtful about the how to decode right 838 00:31:19,919 --> 00:31:25,278 how to actually sample from that table. 839 00:31:21,679 --> 00:31:27,600 So we'll talk about that for a bit. So, 840 00:31:25,278 --> 00:31:29,119 so the first of all definition the 841 00:31:27,599 --> 00:31:30,639 process of choosing a token from the 842 00:31:29,119 --> 00:31:32,479 probability distribution from the coming 843 00:31:30,640 --> 00:31:34,399 out of the softmax right I'm sticking 844 00:31:32,480 --> 00:31:36,640 this table right here this is the 845 00:31:34,398 --> 00:31:38,798 softmax right this process of choosing 846 00:31:36,640 --> 00:31:40,720 it is called decoding that's a technical 847 00:31:38,798 --> 00:31:42,480 term for it right we have to we get this 848 00:31:40,720 --> 00:31:44,480 table we have to decode meaning we have 849 00:31:42,480 --> 00:31:48,079 to pick something from this table okay 850 00:31:44,480 --> 00:31:51,038 that's called decoding now 851 00:31:48,079 --> 00:31:53,359 there are two sort of extreme cases of 852 00:31:51,038 --> 00:31:55,038 very highly simple ways to do 853 00:31:53,359 --> 00:31:56,558 The first thing of course is just pick 854 00:31:55,038 --> 00:31:58,798 the one just pick the word with the 855 00:31:56,558 --> 00:32:02,240 highest probability. 856 00:31:58,798 --> 00:32:03,918 This is called greedy decoding. 857 00:32:02,240 --> 00:32:06,640 Okay. 858 00:32:03,919 --> 00:32:08,240 So in this case for example if stommy is 859 00:32:06,640 --> 00:32:10,880 6 the highest probability in this whole 860 00:32:08,240 --> 00:32:14,558 table we just pick stommy. Okay. So that 861 00:32:10,880 --> 00:32:15,760 is the obvious extreme simple case. The 862 00:32:14,558 --> 00:32:18,240 other thing we can do which is also 863 00:32:15,759 --> 00:32:20,480 super simple is that because we have a 864 00:32:18,240 --> 00:32:22,319 probability table here, we can just 865 00:32:20,480 --> 00:32:24,880 reach into the table and sample a word 866 00:32:22,319 --> 00:32:27,519 out of it, right? In proportion to its 867 00:32:24,880 --> 00:32:28,640 probability, which means that if you if 868 00:32:27,519 --> 00:32:30,960 if you have this table and you're 869 00:32:28,640 --> 00:32:33,519 sampling from it, if you sample from it 870 00:32:30,960 --> 00:32:36,480 100 times, 60 times you probably get 871 00:32:33,519 --> 00:32:38,079 Stormy because the probability is 6. But 872 00:32:36,480 --> 00:32:39,919 some small fraction of the time you may 873 00:32:38,079 --> 00:32:42,798 get strange things like oddwark and 874 00:32:39,919 --> 00:32:44,080 zebra and so on and so forth, 875 00:32:42,798 --> 00:32:46,558 right? you're just literally doing 876 00:32:44,079 --> 00:32:48,960 random sampling. 877 00:32:46,558 --> 00:32:50,558 That's a fine way to do it too, right? 878 00:32:48,960 --> 00:32:53,200 There's nothing wrong with that. So 879 00:32:50,558 --> 00:32:56,158 these these are both options. So the key 880 00:32:53,200 --> 00:32:58,080 thing you need to remember is that the 881 00:32:56,159 --> 00:32:59,600 which one you pick and there are some 882 00:32:58,079 --> 00:33:01,519 variations on it which we'll get to in a 883 00:32:59,599 --> 00:33:03,278 moment. What you pick, which way to 884 00:33:01,519 --> 00:33:05,519 decode you pick really depends on what 885 00:33:03,278 --> 00:33:08,558 your task is, what you're trying to use 886 00:33:05,519 --> 00:33:10,880 the the system for, right? The LLM for. 887 00:33:08,558 --> 00:33:13,839 So the the the broad thing to remember 888 00:33:10,880 --> 00:33:16,559 is that if you're working on questions 889 00:33:13,839 --> 00:33:19,678 for which the factual accuracy of the 890 00:33:16,558 --> 00:33:22,000 response is really important 891 00:33:19,679 --> 00:33:24,480 and or you want the output to be 892 00:33:22,000 --> 00:33:26,159 deterministic meaning every time you ask 893 00:33:24,480 --> 00:33:28,720 it a particular question you really want 894 00:33:26,159 --> 00:33:31,120 the same answer back right you can 895 00:33:28,720 --> 00:33:33,120 imagine a customer call support agent 896 00:33:31,119 --> 00:33:34,639 where there two different customers ask 897 00:33:33,119 --> 00:33:37,678 the same question and they get different 898 00:33:34,640 --> 00:33:40,000 answers right you don't want that so you 899 00:33:37,679 --> 00:33:41,679 want determinist IC outputs. So in those 900 00:33:40,000 --> 00:33:43,759 situations, you should use greedy 901 00:33:41,679 --> 00:33:45,519 decoding is a good starting point 902 00:33:43,759 --> 00:33:48,879 because you will get you know you won't 903 00:33:45,519 --> 00:33:51,679 get any random stuff because for any 904 00:33:48,880 --> 00:33:53,120 given input sentence the softmax that 905 00:33:51,679 --> 00:33:55,600 comes out of that table is not going to 906 00:33:53,119 --> 00:33:57,119 change. It's the same table and if 907 00:33:55,599 --> 00:33:58,398 you're always picking the highest number 908 00:33:57,119 --> 00:34:03,038 in the table that's not going to change 909 00:33:58,398 --> 00:34:05,199 either. So guaranteed determinism 910 00:34:03,038 --> 00:34:07,359 and I found that for reasoning questions 911 00:34:05,200 --> 00:34:08,960 and things where you know you're asking 912 00:34:07,359 --> 00:34:10,878 questions, math questions, reasoning 913 00:34:08,960 --> 00:34:12,878 questions, logic questions, you should 914 00:34:10,878 --> 00:34:15,598 really sort of keep it as sort of greedy 915 00:34:12,878 --> 00:34:18,319 as possible in my experience. Okay. Now 916 00:34:15,599 --> 00:34:20,879 there are other situations where random 917 00:34:18,320 --> 00:34:22,639 sampling is actually a better option. If 918 00:34:20,878 --> 00:34:24,159 you're doing creative things, right? 919 00:34:22,639 --> 00:34:26,000 write a poem, write a highQ, write a 920 00:34:24,159 --> 00:34:27,760 screenplay, things like that. You do 921 00:34:26,000 --> 00:34:30,320 want a lot of creativity in which case 922 00:34:27,760 --> 00:34:31,440 you actually randomness is your friend, 923 00:34:30,320 --> 00:34:32,960 right? You get a lot of different 924 00:34:31,440 --> 00:34:35,119 varieties of responses, diversity of 925 00:34:32,960 --> 00:34:36,878 responses, all that is really good. The 926 00:34:35,119 --> 00:34:39,119 price you pay for it is that you lose 927 00:34:36,878 --> 00:34:40,239 determinist determinism. The outputs are 928 00:34:39,119 --> 00:34:41,440 going to be stoastic. They're going to 929 00:34:40,239 --> 00:34:42,638 be random. They're going to vary from 930 00:34:41,440 --> 00:34:44,559 the same question. The answer is going 931 00:34:42,639 --> 00:34:47,119 to vary again and again. But in many 932 00:34:44,559 --> 00:34:49,599 cases, maybe it's okay. You don't care. 933 00:34:47,119 --> 00:34:50,960 Okay, so that's sort of how roughly how 934 00:34:49,599 --> 00:34:53,200 you think about. Other one I want to say 935 00:34:50,960 --> 00:34:55,039 is that the diversity of response also 936 00:34:53,199 --> 00:34:58,239 important because you if you imagine a 937 00:34:55,039 --> 00:35:00,239 chatbot um if you ask questions if the 938 00:34:58,239 --> 00:35:03,118 chatbot always responds in the same 939 00:35:00,239 --> 00:35:05,199 stilted robotic fashion right it kind it 940 00:35:03,119 --> 00:35:07,519 starts to get annoying you want some 941 00:35:05,199 --> 00:35:08,879 variation in the output right because a 942 00:35:07,519 --> 00:35:11,358 human will never give you the same thing 943 00:35:08,880 --> 00:35:13,119 back though I must say that when I 944 00:35:11,358 --> 00:35:14,480 interact with call center agents I think 945 00:35:13,119 --> 00:35:16,320 they're just cutting and pasting from a 946 00:35:14,480 --> 00:35:18,639 text library so it does look kind of 947 00:35:16,320 --> 00:35:20,079 robotic u so maybe we are already kind 948 00:35:18,639 --> 00:35:21,199 of used to this but anyway Okay, so 949 00:35:20,079 --> 00:35:24,400 those are some of the things to keep in 950 00:35:21,199 --> 00:35:26,480 mind. Yeah, 951 00:35:24,400 --> 00:35:28,480 >> if you're using random sampling, do you 952 00:35:26,480 --> 00:35:33,079 end up with a better estimation of the 953 00:35:28,480 --> 00:35:33,079 uncertainty and probability are more 954 00:35:33,119 --> 00:35:36,960 calibrated in the sense that the table 955 00:35:35,199 --> 00:35:39,759 that you end up at the end is the real 956 00:35:36,960 --> 00:35:42,000 probability that you observe from the 957 00:35:39,760 --> 00:35:43,760 words in your corpus. 958 00:35:42,000 --> 00:35:45,440 >> The table doesn't change regardless of 959 00:35:43,760 --> 00:35:47,599 how you sample it. The table is a 960 00:35:45,440 --> 00:35:50,480 starting point for sampling. 961 00:35:47,599 --> 00:35:51,680 The all of all decoding is about what 962 00:35:50,480 --> 00:35:53,039 token from the table you're going to 963 00:35:51,679 --> 00:35:54,799 pull out. 964 00:35:53,039 --> 00:35:55,599 >> Oh, so it doesn't impact the loss 965 00:35:54,800 --> 00:35:56,720 function. 966 00:35:55,599 --> 00:35:58,880 >> No. 967 00:35:56,719 --> 00:36:00,559 >> Yeah. It's all those things are fixed. 968 00:35:58,880 --> 00:36:02,000 You literally get the table and then you 969 00:36:00,559 --> 00:36:06,000 literally can forget how you got the 970 00:36:02,000 --> 00:36:09,199 table and now decoding starts. 971 00:36:06,000 --> 00:36:11,119 >> Is there a reason why would generate a 972 00:36:09,199 --> 00:36:12,559 different answer given the same prompt 973 00:36:11,119 --> 00:36:14,320 if we run it again and again? Because 974 00:36:12,559 --> 00:36:16,559 they are using random sampling. 975 00:36:14,320 --> 00:36:19,039 >> Correct. That's exactly why. And we'll 976 00:36:16,559 --> 00:36:20,159 see I'll see do a demo of it very very 977 00:36:19,039 --> 00:36:22,800 shortly because you can actually 978 00:36:20,159 --> 00:36:25,039 manipulate it. Uh 979 00:36:22,800 --> 00:36:27,680 >> if you do the prediction word by word, 980 00:36:25,039 --> 00:36:29,838 is there a way to make it resilient to 981 00:36:27,679 --> 00:36:32,319 mistakes? Like if you say the night was 982 00:36:29,838 --> 00:36:33,199 dark and hard work, that can mess up the 983 00:36:32,320 --> 00:36:34,559 next word, right? 984 00:36:33,199 --> 00:36:37,439 >> It can totally mess it up. 985 00:36:34,559 --> 00:36:37,838 >> So how does it can it get itself back on 986 00:36:37,440 --> 00:36:40,240 track? 987 00:36:37,838 --> 00:36:42,000 >> It cannot. And so great question. And 988 00:36:40,239 --> 00:36:46,078 we'll look at an example of things going 989 00:36:42,000 --> 00:36:48,400 off the rails in just a second. Yep. 990 00:36:46,079 --> 00:36:51,359 Is this how Bing works where you can 991 00:36:48,400 --> 00:36:52,000 slide between being more creative, more 992 00:36:51,358 --> 00:36:53,920 accurate? 993 00:36:52,000 --> 00:36:56,000 >> Yeah, exactly. So, Bing has creative, 994 00:36:53,920 --> 00:36:57,680 balanced, precise something, right? Uh 995 00:36:56,000 --> 00:36:59,440 they're basically under the hood, 996 00:36:57,679 --> 00:37:00,399 they're manipulating some of the par 997 00:36:59,440 --> 00:37:01,760 we're going to look at some of those 998 00:37:00,400 --> 00:37:03,838 parameters in just a moment. They're 999 00:37:01,760 --> 00:37:05,920 just manipulating it for you. But if you 1000 00:37:03,838 --> 00:37:08,920 use the API, you can manipulate it 1001 00:37:05,920 --> 00:37:08,920 directly. 1002 00:37:09,760 --> 00:37:15,760 Okay. Um All right. So, so here's sort 1003 00:37:14,559 --> 00:37:17,599 of the basic thing to remember about 1004 00:37:15,760 --> 00:37:19,839 random sampling. 1005 00:37:17,599 --> 00:37:22,000 So, our hope is that the, you know, for 1006 00:37:19,838 --> 00:37:24,400 any given sentence, we think that there 1007 00:37:22,000 --> 00:37:26,880 is probably some set of good answers for 1008 00:37:24,400 --> 00:37:30,720 the next word and a whole bunch of bad 1009 00:37:26,880 --> 00:37:33,358 answers, right? Intuitively. So, we want 1010 00:37:30,719 --> 00:37:36,078 the probability of the good stuff, 1011 00:37:33,358 --> 00:37:38,078 right? We we want like a you can imagine 1012 00:37:36,079 --> 00:37:39,440 a distribution is going like that. There 1013 00:37:38,079 --> 00:37:41,119 is the head of the distribution, the 1014 00:37:39,440 --> 00:37:42,320 first few words in the distribution. if 1015 00:37:41,119 --> 00:37:44,160 you sort them from high to low 1016 00:37:42,320 --> 00:37:46,400 probability and then there's all the 1017 00:37:44,159 --> 00:37:48,078 long tale of you know kind of you know 1018 00:37:46,400 --> 00:37:51,119 inappropriate not inappropriate 1019 00:37:48,079 --> 00:37:53,440 irrelevant words right so our hope is 1020 00:37:51,119 --> 00:37:55,838 that the model is so good that for any 1021 00:37:53,440 --> 00:37:57,440 given input phrase it it basically 1022 00:37:55,838 --> 00:37:59,679 concentrates the output probability in 1023 00:37:57,440 --> 00:38:01,358 the softmax to just a few good words and 1024 00:37:59,679 --> 00:38:04,639 sort of kind of zeros out everything 1025 00:38:01,358 --> 00:38:06,559 else that is the ideal scenario because 1026 00:38:04,639 --> 00:38:08,319 in that scenario if you do random 1027 00:38:06,559 --> 00:38:10,000 sampling you by definition you'll pick 1028 00:38:08,320 --> 00:38:13,119 something from the high quality head of 1029 00:38:10,000 --> 00:38:16,159 the distribution and life is good. Okay. 1030 00:38:13,119 --> 00:38:18,079 Now, we want random sampling to sample 1031 00:38:16,159 --> 00:38:19,440 from the head and not from the tail, 1032 00:38:18,079 --> 00:38:21,119 right? That's the key point. And what do 1033 00:38:19,440 --> 00:38:24,119 I mean by head and tail? Let's be very 1034 00:38:21,119 --> 00:38:24,119 clear. 1035 00:38:26,320 --> 00:38:31,760 So, um imagine you have 1036 00:38:30,559 --> 00:38:33,440 take the table that we looked at the 1037 00:38:31,760 --> 00:38:35,680 softax table which went from whatever 1038 00:38:33,440 --> 00:38:37,280 oddwalk to zebra right and let's say we 1039 00:38:35,679 --> 00:38:39,199 sort the table based on high to low 1040 00:38:37,280 --> 00:38:42,240 probabilities. So maybe what's going to 1041 00:38:39,199 --> 00:38:43,838 happen is that star me 1042 00:38:42,239 --> 00:38:46,879 is going to have a probability of I 1043 00:38:43,838 --> 00:38:51,920 don't know 6 and I think if I remember 1044 00:38:46,880 --> 00:38:53,440 right a knight had a probability of.3 1045 00:38:51,920 --> 00:38:56,639 and then a there was a whole bunch of 1046 00:38:53,440 --> 00:39:00,320 other words 1047 00:38:56,639 --> 00:39:02,480 all the way to the 50,000th word right 1048 00:39:00,320 --> 00:39:04,160 from highest low probability so this is 1049 00:39:02,480 --> 00:39:06,880 what I so this is you can think of this 1050 00:39:04,159 --> 00:39:09,598 as like a probability distribution 1051 00:39:06,880 --> 00:39:12,240 okay and So basically what we are saying 1052 00:39:09,599 --> 00:39:13,920 here is that these this is the head of 1053 00:39:12,239 --> 00:39:16,078 the distribution 1054 00:39:13,920 --> 00:39:18,960 while this long tail is the tail of the 1055 00:39:16,079 --> 00:39:21,200 distribution and we want our system to 1056 00:39:18,960 --> 00:39:23,119 grab something from the head and not 1057 00:39:21,199 --> 00:39:24,480 from the tail because the head is the 1058 00:39:23,119 --> 00:39:26,960 stuff that's actually the relevant 1059 00:39:24,480 --> 00:39:28,719 useful good stuff. Okay, that's really 1060 00:39:26,960 --> 00:39:32,639 what we're trying to do here. Does it 1061 00:39:28,719 --> 00:39:37,279 make sense? Okay. So, 1062 00:39:32,639 --> 00:39:39,039 so to come back to this um 1063 00:39:37,280 --> 00:39:41,440 and here is like the most important 1064 00:39:39,039 --> 00:39:43,679 point to remember about this slide. 1065 00:39:41,440 --> 00:39:46,000 While the probability of choosing any 1066 00:39:43,679 --> 00:39:47,440 individual word in this long tail is 1067 00:39:46,000 --> 00:39:49,199 pretty small. For any one word, it's 1068 00:39:47,440 --> 00:39:51,200 pretty small. The probability of 1069 00:39:49,199 --> 00:39:54,159 choosing some word from the tail is 1070 00:39:51,199 --> 00:39:56,239 high. 1071 00:39:54,159 --> 00:39:58,559 Some word from the tail is high. So to 1072 00:39:56,239 --> 00:40:00,719 go back to this thing here. Yeah. Uh so 1073 00:39:58,559 --> 00:40:03,519 in this particular example 1074 00:40:00,719 --> 00:40:05,279 6 +.3 there is a 0.9 probability it's 1075 00:40:03,519 --> 00:40:06,880 going to be either stormy or night but 1076 00:40:05,280 --> 00:40:09,519 there is a 10% probability it's going to 1077 00:40:06,880 --> 00:40:11,119 be one of these words 1078 00:40:09,519 --> 00:40:12,639 and who knows what that word might it's 1079 00:40:11,119 --> 00:40:15,358 going to be it might be some random 1080 00:40:12,639 --> 00:40:18,879 nonsense word right so what that means 1081 00:40:15,358 --> 00:40:21,838 is and this goes to um 1082 00:40:18,880 --> 00:40:24,160 this goes to point from before if the 1083 00:40:21,838 --> 00:40:25,920 LLM happens to sample a token from the 1084 00:40:24,159 --> 00:40:27,440 tail which is not good it won't be able 1085 00:40:25,920 --> 00:40:29,920 to recover from its mistake it'll just 1086 00:40:27,440 --> 00:40:31,679 go off the rails 1087 00:40:29,920 --> 00:40:33,358 Which is why every word that gets 1088 00:40:31,679 --> 00:40:35,919 generated is really important to get it 1089 00:40:33,358 --> 00:40:37,759 right because book it can't recover very 1090 00:40:35,920 --> 00:40:40,079 often. 1091 00:40:37,760 --> 00:40:41,359 >> Is there a technical way to define the 1092 00:40:40,079 --> 00:40:44,000 difference between the head and the 1093 00:40:41,358 --> 00:40:45,358 tail? No, 1094 00:40:44,000 --> 00:40:47,440 it's sort of like this common thing 1095 00:40:45,358 --> 00:40:50,239 people use and the reason why it's not 1096 00:40:47,440 --> 00:40:52,800 is because uh it's so problem dependent 1097 00:40:50,239 --> 00:40:54,078 as to what like the you know like 1098 00:40:52,800 --> 00:40:55,680 basically you're saying that for any 1099 00:40:54,079 --> 00:40:58,000 particular problem I think depending on 1100 00:40:55,679 --> 00:41:00,239 the question the right number of words 1101 00:40:58,000 --> 00:41:02,719 is probably 20 for the same for a 1102 00:41:00,239 --> 00:41:04,078 different question maybe it's 40 for a 1103 00:41:02,719 --> 00:41:05,919 totally different model for the same 1104 00:41:04,079 --> 00:41:09,200 question maybe 10 so because of that 1105 00:41:05,920 --> 00:41:12,960 variability we just can't figure it out 1106 00:41:09,199 --> 00:41:14,318 okay so um all All right. So, and I'll 1107 00:41:12,960 --> 00:41:18,400 show you this how to do this in just a 1108 00:41:14,318 --> 00:41:22,719 moment. So, just for kicks, um I went in 1109 00:41:18,400 --> 00:41:25,920 to GPD 3.5 U and then I said students at 1110 00:41:22,719 --> 00:41:29,118 the MIT Sloan School of Management are 1111 00:41:25,920 --> 00:41:31,519 and I said predict the next word. Okay, 1112 00:41:29,119 --> 00:41:33,599 so it turns out invited is the most 1113 00:41:31,519 --> 00:41:35,838 likely next word followed by given, 1114 00:41:33,599 --> 00:41:38,079 expected, required and able. These are 1115 00:41:35,838 --> 00:41:40,960 the top five words. 1116 00:41:38,079 --> 00:41:42,000 Okay. And the probability is 3% 2% you 1117 00:41:40,960 --> 00:41:43,760 see the you know pretty small 1118 00:41:42,000 --> 00:41:45,838 probabilities but then the words that 1119 00:41:43,760 --> 00:41:47,440 are below it right the remaining 1120 00:41:45,838 --> 00:41:50,078 whatever 50,000 odd words are even 1121 00:41:47,440 --> 00:41:52,800 lower. Okay. So here the most likely 1122 00:41:50,079 --> 00:41:54,800 word is invited. So what I did is I went 1123 00:41:52,800 --> 00:41:56,720 in there and said okay let me try again 1124 00:41:54,800 --> 00:41:59,119 now with students of this loan school of 1125 00:41:56,719 --> 00:42:00,639 management or invited. And now 1126 00:41:59,119 --> 00:42:03,519 autocomplete that find me the next 1127 00:42:00,639 --> 00:42:04,799 thing. So it comes back with see now 1128 00:42:03,519 --> 00:42:07,119 this is my new prompt. student the M 1129 00:42:04,800 --> 00:42:08,640 school invited to submit their original 1130 00:42:07,119 --> 00:42:11,119 white papers to the annual MIT 1131 00:42:08,639 --> 00:42:13,838 something. It seems reasonable. Doesn't 1132 00:42:11,119 --> 00:42:16,640 seem bad, right? It seems reasonable. 1133 00:42:13,838 --> 00:42:19,279 Okay. Now, let's mess it up a bit. So 1134 00:42:16,639 --> 00:42:22,559 now I go in there and I noticed that the 1135 00:42:19,280 --> 00:42:24,480 word masters and the word spending were 1136 00:42:22,559 --> 00:42:26,480 much lower probability than these top 1137 00:42:24,480 --> 00:42:28,400 five words. Right? I just mucked around 1138 00:42:26,480 --> 00:42:31,599 till I found these things. So this is 1139 00:42:28,400 --> 00:42:34,639 only 0.05%. This is.1%. 1140 00:42:31,599 --> 00:42:36,640 So these are clearly in the tail, right? 1141 00:42:34,639 --> 00:42:37,920 They're not the most likely. So I said, 1142 00:42:36,639 --> 00:42:41,039 what's going to happen if I actually 1143 00:42:37,920 --> 00:42:43,838 force it to use masters and then I force 1144 00:42:41,039 --> 00:42:46,239 it to use spending? Okay, this is what I 1145 00:42:43,838 --> 00:42:49,358 what you get. Students MID school of 1146 00:42:46,239 --> 00:42:52,639 management are masters of chaos. 1147 00:42:49,358 --> 00:42:53,920 They routinely blow past deadlines 1148 00:42:52,639 --> 00:42:57,559 fracture and then I couldn't take it 1149 00:42:53,920 --> 00:42:57,559 anymore. I stopped it. 1150 00:42:58,000 --> 00:43:02,318 a single word 1151 00:43:00,800 --> 00:43:03,760 and then I said students school of 1152 00:43:02,318 --> 00:43:05,838 management or spending which is the 1153 00:43:03,760 --> 00:43:07,440 other unlikely word the semester 1154 00:43:05,838 --> 00:43:11,960 learning life skills so far it looks 1155 00:43:07,440 --> 00:43:11,960 promising through knitting socks 1156 00:43:13,358 --> 00:43:17,519 I'm not making this stuff up but this is 1157 00:43:14,639 --> 00:43:19,199 GP3.5 1158 00:43:17,519 --> 00:43:22,880 so yes it will go off the rails you have 1159 00:43:19,199 --> 00:43:25,118 to be super careful um and so 1160 00:43:22,880 --> 00:43:29,280 so the way we sort of tame random 1161 00:43:25,119 --> 00:43:32,640 sampling to make it work for us uh 1162 00:43:29,280 --> 00:43:35,920 Do you think that these sentences refers 1163 00:43:32,639 --> 00:43:38,078 like the past like the master of chaos 1164 00:43:35,920 --> 00:43:40,800 blow past deadline like is something 1165 00:43:38,079 --> 00:43:42,720 that it was in the training sense? 1166 00:43:40,800 --> 00:43:45,200 >> Yeah. I mean that is the thing is it's 1167 00:43:42,719 --> 00:43:47,039 basically doing rough it's doing some 1168 00:43:45,199 --> 00:43:48,879 very rough and approximate pattern 1169 00:43:47,039 --> 00:43:51,119 matching from all the training data it 1170 00:43:48,880 --> 00:43:53,838 was trained on. So it doesn't mean for 1171 00:43:51,119 --> 00:43:56,800 example that on on the mit.edu edu 1172 00:43:53,838 --> 00:43:59,039 website right on the collection of sites 1173 00:43:56,800 --> 00:44:00,800 that actually there were text saying 1174 00:43:59,039 --> 00:44:02,960 that yeah MIT Sloan students were doing 1175 00:44:00,800 --> 00:44:06,400 all this crazy stuff it's probably more 1176 00:44:02,960 --> 00:44:08,000 like a whole bunch of you know u college 1177 00:44:06,400 --> 00:44:09,519 university websites probably had some 1178 00:44:08,000 --> 00:44:10,960 content like that maybe there was a 1179 00:44:09,519 --> 00:44:12,559 bunch of Reddit people posting stuff 1180 00:44:10,960 --> 00:44:14,400 like that so you're just doing some 1181 00:44:12,559 --> 00:44:15,599 rough pattern matching it's basically 1182 00:44:14,400 --> 00:44:16,960 looking the thing is you have to 1183 00:44:15,599 --> 00:44:19,599 remember always with large language 1184 00:44:16,960 --> 00:44:22,000 models what it's trying to give you it's 1185 00:44:19,599 --> 00:44:23,680 giving you a response that is not 1186 00:44:22,000 --> 00:44:25,358 implausible 1187 00:44:23,679 --> 00:44:27,279 There is no guarantee of correctness. 1188 00:44:25,358 --> 00:44:29,519 There's no accuracy. Nothing like that. 1189 00:44:27,280 --> 00:44:32,000 It's giving you a probabilistically 1190 00:44:29,519 --> 00:44:35,119 plausible response. That's it. Okay. 1191 00:44:32,000 --> 00:44:36,880 Now, usies being Sloan, uh we look at 1192 00:44:35,119 --> 00:44:39,200 stuff like this and we get offended. So, 1193 00:44:36,880 --> 00:44:40,880 we are we are imputing our values onto 1194 00:44:39,199 --> 00:44:43,919 its generation, but it doesn't know and 1195 00:44:40,880 --> 00:44:46,079 it doesn't care. 1196 00:44:43,920 --> 00:44:48,079 So in fact if I when I typed in 1197 00:44:46,079 --> 00:44:50,800 something like list all the awards that 1198 00:44:48,079 --> 00:44:52,960 professor Ramak Krishna has won it gave 1199 00:44:50,800 --> 00:44:55,440 me an amazing list of awards apparently 1200 00:44:52,960 --> 00:44:58,639 I won this and I won that I won none of 1201 00:44:55,440 --> 00:45:00,240 it is true to which a student said not 1202 00:44:58,639 --> 00:45:01,838 yet. 1203 00:45:00,239 --> 00:45:05,039 So I had the tea I made a note of that 1204 00:45:01,838 --> 00:45:09,039 fine person's name. So [laughter] 1205 00:45:05,039 --> 00:45:11,119 >> so yeah so that's what's going on. 1206 00:45:09,039 --> 00:45:12,800 Yeah 1207 00:45:11,119 --> 00:45:15,838 >> I get the sense like Maybe there's 1208 00:45:12,800 --> 00:45:17,599 >> Could you use the microphone, please? 1209 00:45:15,838 --> 00:45:20,480 >> I get the sense that maybe there's some 1210 00:45:17,599 --> 00:45:23,519 sort of sliding window that's somehow 1211 00:45:20,480 --> 00:45:26,480 waning later words more strongly than 1212 00:45:23,519 --> 00:45:28,079 earlier words given how far out because 1213 00:45:26,480 --> 00:45:30,318 I feel like the context of students at 1214 00:45:28,079 --> 00:45:32,000 MIT, right, should have steered in a 1215 00:45:30,318 --> 00:45:34,318 certain direction even with the presence 1216 00:45:32,000 --> 00:45:35,599 of the word masters. So, is there 1217 00:45:34,318 --> 00:45:37,519 something like that happening? 1218 00:45:35,599 --> 00:45:38,800 >> No, it is just the thing is think about 1219 00:45:37,519 --> 00:45:41,199 the training process, right? In the 1220 00:45:38,800 --> 00:45:42,800 training process, uh, we gave it 1221 00:45:41,199 --> 00:45:45,519 sentence fragments and we asked it to 1222 00:45:42,800 --> 00:45:48,240 predict the next word. Now, clearly the 1223 00:45:45,519 --> 00:45:49,759 more you know about the input that's 1224 00:45:48,239 --> 00:45:51,919 coming and the longer the input, the 1225 00:45:49,760 --> 00:45:53,200 more clues you have to figure out what 1226 00:45:51,920 --> 00:45:56,240 the right next prediction is going to 1227 00:45:53,199 --> 00:45:58,480 be. Right? If I say the capital uh the 1228 00:45:56,239 --> 00:46:00,239 capital of you'll be like, I don't know, 1229 00:45:58,480 --> 00:46:01,440 it's got to be a country, I guess, or a 1230 00:46:00,239 --> 00:46:03,039 state, but I don't know anything more 1231 00:46:01,440 --> 00:46:06,318 than that. But if you if I say the 1232 00:46:03,039 --> 00:46:08,719 capital of France is dramatic narrowing 1233 00:46:06,318 --> 00:46:11,039 of the cone of uncertainty. So that's 1234 00:46:08,719 --> 00:46:12,480 basically what's going on. And in fact 1235 00:46:11,039 --> 00:46:14,480 some there's a very beautiful expression 1236 00:46:12,480 --> 00:46:17,679 I've heard which is that what what the 1237 00:46:14,480 --> 00:46:20,159 LMS do they call it subtractive 1238 00:46:17,679 --> 00:46:22,000 sculpting. So what I mean by that is 1239 00:46:20,159 --> 00:46:24,559 it's sort of like when you start it's 1240 00:46:22,000 --> 00:46:26,480 like this big block of marble and then 1241 00:46:24,559 --> 00:46:27,838 every word chips away at the marble and 1242 00:46:26,480 --> 00:46:29,599 then when you're done it's kind of 1243 00:46:27,838 --> 00:46:31,358 pretty clear it's David inside the 1244 00:46:29,599 --> 00:46:34,240 marble. Right? That's sort of what's 1245 00:46:31,358 --> 00:46:36,559 going on. 1246 00:46:34,239 --> 00:46:38,559 All right. So to come back to this, uh 1247 00:46:36,559 --> 00:46:40,000 what can we do? We can there are three 1248 00:46:38,559 --> 00:46:42,078 ways in which you can tune random 1249 00:46:40,000 --> 00:46:44,400 sampling to make it work for you. The 1250 00:46:42,079 --> 00:46:46,160 first way and and the the idea of all 1251 00:46:44,400 --> 00:46:48,800 these things is that you have some 1252 00:46:46,159 --> 00:46:51,199 probability distribution. We are now 1253 00:46:48,800 --> 00:46:53,680 going to sort of manually 1254 00:46:51,199 --> 00:46:55,279 focus on the head and then we're going 1255 00:46:53,679 --> 00:46:56,879 to kill everything else and only focus 1256 00:46:55,280 --> 00:46:58,400 on the head and sample from that head. 1257 00:46:56,880 --> 00:46:59,920 Okay, which immediately begs the 1258 00:46:58,400 --> 00:47:01,280 question, how will you decide what the 1259 00:46:59,920 --> 00:47:02,880 head is? Right? And that was sort of 1260 00:47:01,280 --> 00:47:04,640 Alina's question from before. How will 1261 00:47:02,880 --> 00:47:07,440 you decide what the head is? So, one way 1262 00:47:04,639 --> 00:47:08,559 we do that is to say, you know what, I 1263 00:47:07,440 --> 00:47:11,280 know we have 50,000 words in the 1264 00:47:08,559 --> 00:47:13,199 vocabulary. I don't care. Each time, I'm 1265 00:47:11,280 --> 00:47:15,599 only going to pick the top K words, 1266 00:47:13,199 --> 00:47:17,039 right? K could be 10, 20, 30, 40, 50. 1267 00:47:15,599 --> 00:47:18,880 This very problem dependent. I'm going 1268 00:47:17,039 --> 00:47:20,800 to pick the top 20 words and I'm going 1269 00:47:18,880 --> 00:47:22,800 to ignore everything else and only 1270 00:47:20,800 --> 00:47:24,800 sample from the top 10 or the top 20. 1271 00:47:22,800 --> 00:47:25,920 That's called top K sampling. And so the 1272 00:47:24,800 --> 00:47:27,440 way it works is that let's say this is 1273 00:47:25,920 --> 00:47:28,720 your whole distribution and I just 1274 00:47:27,440 --> 00:47:30,960 stopped at wet instead of going all the 1275 00:47:28,719 --> 00:47:33,118 way to 50,000, right? And then you see 1276 00:47:30,960 --> 00:47:36,240 and you decide let's say that you want k 1277 00:47:33,119 --> 00:47:39,519 to be two. So you just grab the top two 1278 00:47:36,239 --> 00:47:41,679 words k equals 2 and then you reormalize 1279 00:47:39,519 --> 00:47:45,119 the probability so they add up to one. 1280 00:47:41,679 --> 00:47:46,799 So 6 and2 reormalize it becomes 75 and 1281 00:47:45,119 --> 00:47:48,480 0.25. 1282 00:47:46,800 --> 00:47:50,160 And now just imagine that this is the 1283 00:47:48,480 --> 00:47:52,240 new softmax table that you're sampling 1284 00:47:50,159 --> 00:47:55,039 from and you grab a number from I'm 1285 00:47:52,239 --> 00:47:58,159 sorry a word from here and you're done. 1286 00:47:55,039 --> 00:48:00,639 Okay, that's this called top K sampling 1287 00:47:58,159 --> 00:48:03,279 very commonly used 1288 00:48:00,639 --> 00:48:06,078 but there's it has a small shortcoming 1289 00:48:03,280 --> 00:48:07,680 which is that it basically assumes that 1290 00:48:06,079 --> 00:48:11,119 this K that you have come up with let's 1291 00:48:07,679 --> 00:48:13,118 say 20 every input sentence the right 1292 00:48:11,119 --> 00:48:15,519 number of words in the head is 20 which 1293 00:48:13,119 --> 00:48:16,640 seems obviously it's not a you know well 1294 00:48:15,519 --> 00:48:18,639 supported assumption it's just an 1295 00:48:16,639 --> 00:48:21,440 assumption so then the question becomes 1296 00:48:18,639 --> 00:48:24,078 can we do better right because what you 1297 00:48:21,440 --> 00:48:25,599 really want is you want the words that 1298 00:48:24,079 --> 00:48:27,280 you pick to have the bulk of the 1299 00:48:25,599 --> 00:48:29,440 probabilities, 1300 00:48:27,280 --> 00:48:30,800 right? As much probability as possible. 1301 00:48:29,440 --> 00:48:32,240 You don't really care how many words are 1302 00:48:30,800 --> 00:48:34,800 inside it as long as together they have 1303 00:48:32,239 --> 00:48:37,199 a lot of probability. Which brings us to 1304 00:48:34,800 --> 00:48:39,359 something called top p sampling also 1305 00:48:37,199 --> 00:48:40,639 called nucleus sampling where instead of 1306 00:48:39,358 --> 00:48:42,719 deciding on the number of words we're 1307 00:48:40,639 --> 00:48:45,118 going to pick every time, we decide you 1308 00:48:42,719 --> 00:48:47,358 know what we're just going to 1309 00:48:45,119 --> 00:48:49,119 choose all the words such that the 1310 00:48:47,358 --> 00:48:51,679 probability of such words that we have 1311 00:48:49,119 --> 00:48:53,039 chosen is at least P. 1312 00:48:51,679 --> 00:48:54,639 Sometimes it may be just two words. 1313 00:48:53,039 --> 00:48:58,880 Sometimes it may be 20 words. We don't 1314 00:48:54,639 --> 00:49:02,000 care. And then we sample from it. 1315 00:48:58,880 --> 00:49:05,280 Okay. So here, same thing here. Let's 1316 00:49:02,000 --> 00:49:09,039 say you go with P equ= 0.9. So you 6 1317 00:49:05,280 --> 00:49:11,359 +2.8 plus.1.9. Boom. We have hit 0.9. We 1318 00:49:09,039 --> 00:49:14,400 stop and then we grab these three words 1319 00:49:11,358 --> 00:49:16,799 and then we renormalize them to get this 1320 00:49:14,400 --> 00:49:18,079 thing and then boom, we sample from it. 1321 00:49:16,800 --> 00:49:19,839 So this actually is even more effective 1322 00:49:18,079 --> 00:49:21,599 in my opinion because it sort of it 1323 00:49:19,838 --> 00:49:23,440 fluctuates. It doesn't hardcode the 1324 00:49:21,599 --> 00:49:25,920 number of words you think is important. 1325 00:49:23,440 --> 00:49:29,440 Uh was there a question? Yeah. 1326 00:49:25,920 --> 00:49:32,720 >> What if like let's say 0.9 ended up like 1327 00:49:29,440 --> 00:49:33,838 if foggy was 0.12 will it only take 0.1 1328 00:49:32,719 --> 00:49:35,519 from foggy? 1329 00:49:33,838 --> 00:49:37,199 >> Yeah. What it does is it so you give it 1330 00:49:35,519 --> 00:49:39,599 a give it a 0.9. What it's going to do 1331 00:49:37,199 --> 00:49:43,598 is it's going to keep adding words till 1332 00:49:39,599 --> 00:49:46,640 it just crosses that number. 1333 00:49:43,599 --> 00:49:50,240 >> Yeah. I was thinking, can't you just set 1334 00:49:46,639 --> 00:49:53,598 a threshold for the word slap? Don't 1335 00:49:50,239 --> 00:49:57,118 pick a word below probability. This top 1336 00:49:53,599 --> 00:49:59,680 B, what if was like 0.89 1337 00:49:57,119 --> 00:50:00,800 and then the other one is just 0.1. So 1338 00:49:59,679 --> 00:50:03,440 you pick two words. 1339 00:50:00,800 --> 00:50:04,960 >> Yeah, you can do that. Um and in fact in 1340 00:50:03,440 --> 00:50:06,240 what you can do is you can always say I 1341 00:50:04,960 --> 00:50:08,480 want to pick a word which is the most 1342 00:50:06,239 --> 00:50:12,078 likely word, right? You can do that. But 1343 00:50:08,480 --> 00:50:13,760 if you say I want a word um I want only 1344 00:50:12,079 --> 00:50:15,760 consider words whose probabilities are 1345 00:50:13,760 --> 00:50:16,640 at least something then basically what 1346 00:50:15,760 --> 00:50:18,559 you're saying is that I'm just going to 1347 00:50:16,639 --> 00:50:21,039 keep on doing and then we draw a line 1348 00:50:18,559 --> 00:50:23,839 here right but the problem is you don't 1349 00:50:21,039 --> 00:50:25,519 know how many words have crept over your 1350 00:50:23,838 --> 00:50:27,679 threshold 1351 00:50:25,519 --> 00:50:29,759 right you might for example find that to 1352 00:50:27,679 --> 00:50:31,598 to go to your example maybe you said 0.9 1353 00:50:29,760 --> 00:50:33,520 as a threshold may maybe there are a 1354 00:50:31,599 --> 00:50:34,559 whole bunch of there was a word at 089 1355 00:50:33,519 --> 00:50:36,079 that you just missed because you didn't 1356 00:50:34,559 --> 00:50:38,000 make the threshold you'll be like oh no 1357 00:50:36,079 --> 00:50:40,079 I should have made it 089 so there's No 1358 00:50:38,000 --> 00:50:41,838 right answer unfortunately. But these 1359 00:50:40,079 --> 00:50:43,680 are exactly the this is exactly the kind 1360 00:50:41,838 --> 00:50:46,239 of thinking that brought us these kinds 1361 00:50:43,679 --> 00:50:48,639 of ways to tune these things 1362 00:50:46,239 --> 00:50:51,118 all sort of you know the foundation here 1363 00:50:48,639 --> 00:50:53,279 is that the realization that we cannot 1364 00:50:51,119 --> 00:50:54,800 pro sort of a priority decide what the 1365 00:50:53,280 --> 00:50:56,720 right number of words is. So we have to 1366 00:50:54,800 --> 00:50:58,318 find huristics to try to do do these 1367 00:50:56,719 --> 00:51:00,318 things. So in practice people try all 1368 00:50:58,318 --> 00:51:02,000 these methods. In fact you can do both. 1369 00:51:00,318 --> 00:51:04,558 You can do you can set up so that you 1370 00:51:02,000 --> 00:51:07,358 can do top p and top k at the same time. 1371 00:51:04,559 --> 00:51:10,880 Basically you're saying grab words uh 1372 00:51:07,358 --> 00:51:14,920 till you cross the probability uh or you 1373 00:51:10,880 --> 00:51:14,920 cross k whichever is earlier. 1374 00:51:15,199 --> 00:51:19,118 Okay. So those are two methods people 1375 00:51:17,358 --> 00:51:21,598 use heavily. 1376 00:51:19,119 --> 00:51:23,680 The third method is called distribution. 1377 00:51:21,599 --> 00:51:26,640 I'm sorry temperature. And the idea of 1378 00:51:23,679 --> 00:51:28,719 temperature is that in top K and top P, 1379 00:51:26,639 --> 00:51:31,598 it sort of we have to decide on a number 1380 00:51:28,719 --> 00:51:33,279 up front K or P and then we just draw 1381 00:51:31,599 --> 00:51:35,599 the line and look at the words that pass 1382 00:51:33,280 --> 00:51:37,440 the threshold. Temperature is like a 1383 00:51:35,599 --> 00:51:39,838 softer way to do the same thing. It it's 1384 00:51:37,440 --> 00:51:44,159 a softer way to emphasize the head more 1385 00:51:39,838 --> 00:51:47,159 than the tail. So um I think iPad. All 1386 00:51:44,159 --> 00:51:47,159 right. 1387 00:51:52,960 --> 00:52:01,358 So the idea of temperature is remember 1388 00:51:55,039 --> 00:52:04,400 uh when we have this um oops soft max. 1389 00:52:01,358 --> 00:52:06,639 So you know oddwark 1390 00:52:04,400 --> 00:52:09,039 all the way to zebra 1391 00:52:06,639 --> 00:52:10,799 you have all these probabilities right 1392 00:52:09,039 --> 00:52:12,239 now remember where did we get these 1393 00:52:10,800 --> 00:52:15,440 probabilities these properties came from 1394 00:52:12,239 --> 00:52:18,799 a softmax. So what is a softmax? We 1395 00:52:15,440 --> 00:52:22,240 basically had you know all these nodes 1396 00:52:18,800 --> 00:52:23,680 say 50,000 nodes in some output layer 1397 00:52:22,239 --> 00:52:27,519 and these were just numbers let's just 1398 00:52:23,679 --> 00:52:29,598 call them a1 through a 50,000 1399 00:52:27,519 --> 00:52:31,838 and then we ran it through a softmax 1400 00:52:29,599 --> 00:52:36,160 function and what did it do it basically 1401 00:52:31,838 --> 00:52:39,358 did e ra to a1 e ra to a2 all the way to 1402 00:52:36,159 --> 00:52:40,719 e ra to a let's call it n and then we it 1403 00:52:39,358 --> 00:52:42,880 divided it by the sum of all these 1404 00:52:40,719 --> 00:52:47,039 things to get the probabilities. So this 1405 00:52:42,880 --> 00:52:51,640 number became e ra to a1 divided by the 1406 00:52:47,039 --> 00:52:51,639 sum of all the e ra to a 1407 00:52:52,400 --> 00:52:55,920 okay so e ra to a divided by e ra to a1 1408 00:52:54,159 --> 00:52:57,598 plus e to a2 and so on and so forth. So 1409 00:52:55,920 --> 00:52:59,519 this how softmax works. I'm just 1410 00:52:57,599 --> 00:53:03,200 refreshing your memory from a few weeks 1411 00:52:59,519 --> 00:53:06,719 ago. Okay. Now what temperature does is 1412 00:53:03,199 --> 00:53:08,558 that let me just write it a little 1413 00:53:06,719 --> 00:53:13,358 easier. 1414 00:53:08,559 --> 00:53:15,359 So e ra to a1 plus e ra to a2 is all the 1415 00:53:13,358 --> 00:53:18,358 way 1416 00:53:15,358 --> 00:53:18,358 and 1417 00:53:18,480 --> 00:53:22,800 what it does is it introduces a new 1418 00:53:20,159 --> 00:53:27,799 parameter here called temperature which 1419 00:53:22,800 --> 00:53:27,800 is that we divide everything here by t. 1420 00:53:41,679 --> 00:53:45,519 And the effect of adding this little 1421 00:53:43,358 --> 00:53:48,159 knob called temperature here, right, is 1422 00:53:45,519 --> 00:53:50,800 very interesting. So let's assume for a 1423 00:53:48,159 --> 00:53:52,399 second that t is a very very small 1424 00:53:50,800 --> 00:53:53,920 number. 1425 00:53:52,400 --> 00:53:57,838 Assume that t is pretty close to zero, 1426 00:53:53,920 --> 00:54:00,838 very small number. So if t is close to 1427 00:53:57,838 --> 00:54:00,838 zero, 1428 00:54:00,960 --> 00:54:05,280 what's going to happen is that since 1429 00:54:03,199 --> 00:54:06,799 it's in the denominator here, all these 1430 00:54:05,280 --> 00:54:08,319 numbers, 1431 00:54:06,800 --> 00:54:10,800 all these numbers are going to become 1432 00:54:08,318 --> 00:54:13,119 really big because t is really small. 1433 00:54:10,800 --> 00:54:14,240 Right? If if a1 happens to be a positive 1434 00:54:13,119 --> 00:54:15,519 number, it's going to become really big. 1435 00:54:14,239 --> 00:54:16,799 If a1 is a negative number, it's going 1436 00:54:15,519 --> 00:54:19,358 to be a really really small negative 1437 00:54:16,800 --> 00:54:20,880 number. Okay? Now in particular, what's 1438 00:54:19,358 --> 00:54:23,838 going to happen is the biggest of all 1439 00:54:20,880 --> 00:54:26,559 the a numbers, it was already big. Now 1440 00:54:23,838 --> 00:54:28,239 it's going to get massive 1441 00:54:26,559 --> 00:54:30,240 which means that its probability is 1442 00:54:28,239 --> 00:54:31,838 going to dominate everything else 1443 00:54:30,239 --> 00:54:35,039 because you're taking a really big 1444 00:54:31,838 --> 00:54:37,599 number and doing e ra to that number. 1445 00:54:35,039 --> 00:54:40,400 So what's going to happen is that wait 1446 00:54:37,599 --> 00:54:46,039 what what did this 1447 00:54:40,400 --> 00:54:46,039 okay so if t is close to zero 1448 00:54:47,280 --> 00:54:51,160 the biggest a 1449 00:54:56,000 --> 00:55:05,559 Uh, hold on. 1450 00:54:59,199 --> 00:55:05,558 The word corresponding to the biggest A 1451 00:55:06,960 --> 00:55:12,760 will have a probability of one or close 1452 00:55:09,599 --> 00:55:12,760 to one. 1453 00:55:12,800 --> 00:55:15,680 And since all the probabilities have to 1454 00:55:14,480 --> 00:55:17,599 add up to zero, which means that 1455 00:55:15,679 --> 00:55:18,960 everything else is going to be zero. So 1456 00:55:17,599 --> 00:55:20,160 the biggest A will have a probability of 1457 00:55:18,960 --> 00:55:22,480 one. Everything else is going to have 1458 00:55:20,159 --> 00:55:24,159 zero. So reducing temperature close to 1459 00:55:22,480 --> 00:55:25,679 zero means that the probability 1460 00:55:24,159 --> 00:55:27,358 distribution is going to peak at the 1461 00:55:25,679 --> 00:55:29,358 biggest word and everything is going to 1462 00:55:27,358 --> 00:55:30,960 become zero. So in practice what that 1463 00:55:29,358 --> 00:55:34,960 means is that if you look at something 1464 00:55:30,960 --> 00:55:37,760 like this if you apply um 1465 00:55:34,960 --> 00:55:40,240 temperature here 1466 00:55:37,760 --> 00:55:43,200 what's going to happen is that stormiest 1467 00:55:40,239 --> 00:55:46,000 thing is going to get something like.999 1468 00:55:43,199 --> 00:55:49,480 and everything else right it's going to 1469 00:55:46,000 --> 00:55:49,480 get wiped out 1470 00:55:49,838 --> 00:55:52,880 right it's going to get really small 1471 00:55:51,440 --> 00:55:55,599 it's going to get even smaller and so on 1472 00:55:52,880 --> 00:55:57,358 and so forth and so when t is exactly 1473 00:55:55,599 --> 00:55:59,519 zero basically what that means is that 1474 00:55:57,358 --> 00:56:00,798 this is going to be exactly nine uh one 1475 00:55:59,519 --> 00:56:02,719 and everything was going to just get 1476 00:56:00,798 --> 00:56:03,838 zero. So when one of them is one and 1477 00:56:02,719 --> 00:56:05,039 everything else is zero when you do 1478 00:56:03,838 --> 00:56:07,119 sampling from it you're just picking the 1479 00:56:05,039 --> 00:56:10,480 the big number right which means it sort 1480 00:56:07,119 --> 00:56:12,480 it becomes greedy decoding. 1481 00:56:10,480 --> 00:56:14,960 So that is the value of having 1482 00:56:12,480 --> 00:56:16,798 temperature as a knob. Conversely, if 1483 00:56:14,960 --> 00:56:19,519 you take temperature T and make it 1484 00:56:16,798 --> 00:56:22,159 bigger and bigger, right, as opposed to 1485 00:56:19,519 --> 00:56:24,159 smaller and smaller, this distribution 1486 00:56:22,159 --> 00:56:25,199 is going to become flat. Meaning all the 1487 00:56:24,159 --> 00:56:27,679 words are going to have the same 1488 00:56:25,199 --> 00:56:29,358 probability. 1489 00:56:27,679 --> 00:56:32,399 So a any one of these words becomes 1490 00:56:29,358 --> 00:56:34,639 equally likely. So t close to zero, the 1491 00:56:32,400 --> 00:56:38,160 biggest biggest word gets picked. T 1492 00:56:34,639 --> 00:56:40,078 close to say exceeds one goes to 1.52 1493 00:56:38,159 --> 00:56:42,318 any word becomes likely. It becomes 1494 00:56:40,079 --> 00:56:44,880 truly random. So that is the effect of 1495 00:56:42,318 --> 00:56:47,759 temperature. 1496 00:56:44,880 --> 00:56:50,559 And this knob, you can actually tune it. 1497 00:56:47,760 --> 00:56:53,119 Um, 1498 00:56:50,559 --> 00:56:56,000 all right. So, uh, this is called, uh, 1499 00:56:53,119 --> 00:56:57,519 I'm at 1500 00:56:56,000 --> 00:56:59,599 platform.openai.com. 1501 00:56:57,519 --> 00:57:01,119 It's called the OpenAI playground. And 1502 00:56:59,599 --> 00:57:02,640 in this playground, you can actually put 1503 00:57:01,119 --> 00:57:04,400 in all the sentences you want. You can 1504 00:57:02,639 --> 00:57:05,598 choose the model and then you can it'll 1505 00:57:04,400 --> 00:57:09,920 actually tell you what the softmax 1506 00:57:05,599 --> 00:57:12,079 output is. Okay, it's very handy. So 1507 00:57:09,920 --> 00:57:13,358 this is where I said oh so here are a 1508 00:57:12,079 --> 00:57:15,039 few things I want to draw your attention 1509 00:57:13,358 --> 00:57:18,239 to. The first one is you see temperature 1510 00:57:15,039 --> 00:57:20,880 here the default is one. If you make it 1511 00:57:18,239 --> 00:57:22,879 zero it becomes greedy decoding but you 1512 00:57:20,880 --> 00:57:24,400 can make it more than one if you want. 1513 00:57:22,880 --> 00:57:27,280 It'll give you all kinds of crazy stuff 1514 00:57:24,400 --> 00:57:30,480 as you will see in a second. Okay. Um 1515 00:57:27,280 --> 00:57:32,798 and then they don't have top K. They 1516 00:57:30,480 --> 00:57:35,519 don't have support for top K openai but 1517 00:57:32,798 --> 00:57:37,838 they do have support for top P. You can 1518 00:57:35,519 --> 00:57:38,880 put P here in this thing. And I'll 1519 00:57:37,838 --> 00:57:40,558 ignore these things. You can read the 1520 00:57:38,880 --> 00:57:42,318 documentation uh to understand those 1521 00:57:40,559 --> 00:57:44,319 things. But you can actually ask it to 1522 00:57:42,318 --> 00:57:46,159 show the probabilities. So I'm going to 1523 00:57:44,318 --> 00:57:48,480 ask it to show all the probabilities. 1524 00:57:46,159 --> 00:57:50,879 I'm also going to tell it um don't go 1525 00:57:48,480 --> 00:57:53,920 nuts. Just give me like a few outputs. 1526 00:57:50,880 --> 00:57:55,920 Let's just call it 30. Okay. And now I'm 1527 00:57:53,920 --> 00:57:57,440 going to enter some sentences for us to 1528 00:57:55,920 --> 00:57:59,920 see what's going on. So let's enter the 1529 00:57:57,440 --> 00:58:03,519 same sentence as before. students 1530 00:57:59,920 --> 00:58:05,039 at the MIT 1531 00:58:03,519 --> 00:58:08,079 Sloan 1532 00:58:05,039 --> 00:58:10,798 School of Management 1533 00:58:08,079 --> 00:58:13,798 or I think that's what we had right so 1534 00:58:10,798 --> 00:58:13,798 submit 1535 00:58:14,000 --> 00:58:18,239 so okay this is what it's filling out 1536 00:58:16,159 --> 00:58:20,399 now you go click on this word you get 1537 00:58:18,239 --> 00:58:23,118 all the probabilities 1538 00:58:20,400 --> 00:58:25,119 pretty cool right so you can see invited 1539 00:58:23,119 --> 00:58:27,440 given expected these are all some of the 1540 00:58:25,119 --> 00:58:32,400 things we had u and so what you can do 1541 00:58:27,440 --> 00:58:36,000 is you can go in and say here clearly uh 1542 00:58:32,400 --> 00:58:40,079 aching. What is that? 1543 00:58:36,000 --> 00:58:41,358 That's very weird. So I'm going to again 1544 00:58:40,079 --> 00:58:43,200 I'm just going to check to make sure 1545 00:58:41,358 --> 00:58:46,078 that I use the same sentence as before. 1546 00:58:43,199 --> 00:58:50,558 It's very brittle. Students MD school 1547 00:58:46,079 --> 00:58:54,440 management are okay. Uh are 1548 00:58:50,559 --> 00:58:54,440 oh I know what it is. 1549 00:58:54,798 --> 00:59:01,719 Okay. 1550 00:58:57,519 --> 00:59:01,719 Okay. So, let's try that again. 1551 00:59:03,679 --> 00:59:08,159 Okay. So, invited 3.18. That's what we 1552 00:59:05,920 --> 00:59:10,559 had, right? Invited 3.19. 3.8. Okay. 1553 00:59:08,159 --> 00:59:12,480 Close enough. So, this is what we have. 1554 00:59:10,559 --> 00:59:15,040 And now, if you wanted to force it to 1555 00:59:12,480 --> 00:59:18,798 choose invited here, you just go in 1556 00:59:15,039 --> 00:59:20,000 there and make the temperature zero. 1557 00:59:18,798 --> 00:59:21,519 Temperature zero means it's always going 1558 00:59:20,000 --> 00:59:25,039 to pick the best one. Greedy recording. 1559 00:59:21,519 --> 00:59:27,119 So, you can hit it again. 1560 00:59:25,039 --> 00:59:29,519 And it better give you invited. See it 1561 00:59:27,119 --> 00:59:31,119 has given you invited. 1562 00:59:29,519 --> 00:59:34,079 So that's how you manipulate it using 1563 00:59:31,119 --> 00:59:35,760 temperature. Um you can also ask it you 1564 00:59:34,079 --> 00:59:38,079 can also manipulate top P. You can do 1565 00:59:35,760 --> 00:59:40,000 all these things right but so it's a 1566 00:59:38,079 --> 00:59:41,839 it's a people actually use it very 1567 00:59:40,000 --> 00:59:42,798 heavily for debugging right and for when 1568 00:59:41,838 --> 00:59:44,239 they're playing with a bunch of data 1569 00:59:42,798 --> 00:59:45,679 with a model for that particular use 1570 00:59:44,239 --> 00:59:46,879 case. You just play with it to get a 1571 00:59:45,679 --> 00:59:48,399 sense for what kinds of probability 1572 00:59:46,880 --> 00:59:50,400 distributions you see and then you can 1573 00:59:48,400 --> 00:59:54,480 fine-tune it using that using that 1574 00:59:50,400 --> 00:59:58,079 knowledge. Um so yeah check this out. 1575 00:59:54,480 --> 01:00:01,199 Oh, uh, I I said that if the temperature 1576 00:59:58,079 --> 01:00:03,119 goes above one to a higher number, every 1577 01:00:01,199 --> 01:00:04,798 word in the 50,000 becomes sort of 1578 01:00:03,119 --> 01:00:06,400 equally likely, which means it's going 1579 01:00:04,798 --> 01:00:07,679 to produce garbage, right? So, let's 1580 01:00:06,400 --> 01:00:09,200 actually see garbage production in 1581 01:00:07,679 --> 01:00:11,838 action. 1582 01:00:09,199 --> 01:00:13,439 So, all right, let's just nuke this. 1583 01:00:11,838 --> 01:00:15,519 Okay, and I'm going to take the 1584 01:00:13,440 --> 01:00:19,280 temperature and max it. I'm going to 1585 01:00:15,519 --> 01:00:22,000 call it two. Okay, which means that 1586 01:00:19,280 --> 01:00:25,000 literally anything is possible. 1587 01:00:22,000 --> 01:00:25,000 Submit. 1588 01:00:25,838 --> 01:00:32,039 Ladies and gentlemen, I present to you a 1589 01:00:28,079 --> 01:00:32,039 modern large language model. 1590 01:00:35,838 --> 01:00:39,599 Isn't it like shocking 1591 01:00:38,079 --> 01:00:41,760 >> because when we work with these language 1592 01:00:39,599 --> 01:00:43,039 models we have, we always when we see it 1593 01:00:41,760 --> 01:00:45,119 doing some smart things, we always 1594 01:00:43,039 --> 01:00:46,480 ascribe some level of, you know, 1595 01:00:45,119 --> 01:00:48,960 interesting abilities and intelligence 1596 01:00:46,480 --> 01:00:50,318 and so on and then you realize all I had 1597 01:00:48,960 --> 01:00:52,798 to go in go in there and change one 1598 01:00:50,318 --> 01:00:54,719 parameter and it's garbage. 1599 01:00:52,798 --> 01:00:56,480 So you can see the amount of garbage 1600 01:00:54,719 --> 01:00:58,879 right it's showing just by twiddling one 1601 01:00:56,480 --> 01:01:00,240 parameter. So you have to be in 1602 01:00:58,880 --> 01:01:01,358 production use cases when you're 1603 01:01:00,239 --> 01:01:02,798 building applications on top of these 1604 01:01:01,358 --> 01:01:05,279 large language models you got to be very 1605 01:01:02,798 --> 01:01:09,039 very careful with these parameters. So 1606 01:01:05,280 --> 01:01:12,359 pay attention. All right. So um what did 1607 01:01:09,039 --> 01:01:12,358 I have next? 1608 01:01:13,679 --> 01:01:21,639 Okay. So that brings us to the uh sort 1609 01:01:17,440 --> 01:01:21,639 of the end of the decoding section. 1610 01:01:22,798 --> 01:01:27,119 Oh, see now I'm going to switch gears 1611 01:01:24,798 --> 01:01:30,798 and talk about tokenization, right? 1612 01:01:27,119 --> 01:01:32,318 which is that um when so far in all the 1613 01:01:30,798 --> 01:01:34,159 the the things we have done including 1614 01:01:32,318 --> 01:01:36,798 the homeworks and so on we looked at 1615 01:01:34,159 --> 01:01:38,639 this tokenization the standard process 1616 01:01:36,798 --> 01:01:41,039 right for taking a bunch of text and 1617 01:01:38,639 --> 01:01:44,960 vectorizing it which was the stie 1618 01:01:41,039 --> 01:01:46,639 process standardize tokenize um index 1619 01:01:44,960 --> 01:01:48,559 right and then encode and the 1620 01:01:46,639 --> 01:01:50,558 standardization I had mentioned earlier 1621 01:01:48,559 --> 01:01:53,200 uh strips out punctuation lower cases 1622 01:01:50,559 --> 01:01:55,359 everything uh sometimes removes stop 1623 01:01:53,199 --> 01:01:57,118 words like a and the things like that it 1624 01:01:55,358 --> 01:01:59,440 also does these things called stemming 1625 01:01:57,119 --> 01:02:02,960 But turns out if you actually work with 1626 01:01:59,440 --> 01:02:04,480 uh something like GPT, you know that 1627 01:02:02,960 --> 01:02:06,159 it hasn't stripped out punctuation. The 1628 01:02:04,480 --> 01:02:08,079 punctuation is really good, right? It 1629 01:02:06,159 --> 01:02:10,078 uses case, uppercase, and lower case. 1630 01:02:08,079 --> 01:02:11,920 And in fact, even better, you can 1631 01:02:10,079 --> 01:02:13,359 actually make up a word as part of your 1632 01:02:11,920 --> 01:02:15,760 question and it'll use the word 1633 01:02:13,358 --> 01:02:18,000 consistently in the output. So just for 1634 01:02:15,760 --> 01:02:22,240 fun, 1635 01:02:18,000 --> 01:02:23,599 um I made up a word. 1636 01:02:22,239 --> 01:02:24,879 I just did this yesterday, a day before. 1637 01:02:23,599 --> 01:02:28,160 I said, here's a new word and it 1638 01:02:24,880 --> 01:02:30,079 definition. The word is relo 1639 01:02:28,159 --> 01:02:31,759 backwards. 1640 01:02:30,079 --> 01:02:33,680 I said the definition a student who 1641 01:02:31,760 --> 01:02:35,200 understands deep learning backwards 1642 01:02:33,679 --> 01:02:37,358 please use his word in a sentence. And 1643 01:02:35,199 --> 01:02:39,759 here is a sentence it's coming up with. 1644 01:02:37,358 --> 01:02:41,838 Um 1645 01:02:39,760 --> 01:02:43,200 I was like a little shocked during the 1646 01:02:41,838 --> 01:02:45,838 advanced neural network seminar. It 1647 01:02:43,199 --> 01:02:47,439 became evident that Jane was a true relo 1648 01:02:45,838 --> 01:02:48,719 effortlessly explaining even the most 1649 01:02:47,440 --> 01:02:50,720 complex deep learning concepts in 1650 01:02:48,719 --> 01:02:53,039 reverse order. 1651 01:02:50,719 --> 01:02:54,719 Okay. So it clearly knows how to use 1652 01:02:53,039 --> 01:02:56,639 anything you may make up with. Right? So 1653 01:02:54,719 --> 01:02:59,039 it has the ability to compose things 1654 01:02:56,639 --> 01:03:01,118 from scratch as opposed to just looking 1655 01:02:59,039 --> 01:03:02,960 up stuff. So where is the thing coming 1656 01:03:01,119 --> 01:03:04,559 from? Right? That's the question. And 1657 01:03:02,960 --> 01:03:06,720 the answer is this very beautiful thing 1658 01:03:04,559 --> 01:03:10,040 called bite pair encoding which we'll 1659 01:03:06,719 --> 01:03:10,039 look at next. 1660 01:03:10,559 --> 01:03:15,599 So all right. So what here um when we 1661 01:03:14,318 --> 01:03:17,119 look at this process the adv 1662 01:03:15,599 --> 01:03:18,400 disadvantages are some of the things we 1663 01:03:17,119 --> 01:03:19,920 have discussed which is that we want to 1664 01:03:18,400 --> 01:03:21,119 be able to preserve punctuation. We want 1665 01:03:19,920 --> 01:03:22,318 to be able to preserve case. We want to 1666 01:03:21,119 --> 01:03:26,240 be able to handle new words and so on 1667 01:03:22,318 --> 01:03:28,318 and so forth. So uh the new like the the 1668 01:03:26,239 --> 01:03:29,759 sort of the modern models like BERT and 1669 01:03:28,318 --> 01:03:31,599 so on they use different tokenization 1670 01:03:29,760 --> 01:03:34,720 schemes. They don't actually do the STIE 1671 01:03:31,599 --> 01:03:37,519 thing and the GPD family uses bite pair 1672 01:03:34,719 --> 01:03:40,399 encoding BPE. Uh BERT uses something 1673 01:03:37,519 --> 01:03:42,719 called wordpiece. All of these ways of 1674 01:03:40,400 --> 01:03:44,720 encoding, the fundamental idea is to 1675 01:03:42,719 --> 01:03:46,078 say, well, you know what? Why don't 1676 01:03:44,719 --> 01:03:47,598 whatever language you're working with, 1677 01:03:46,079 --> 01:03:50,000 why don't we start first of all with all 1678 01:03:47,599 --> 01:03:51,359 the individual characters? Because if 1679 01:03:50,000 --> 01:03:53,039 you could actually work with individual 1680 01:03:51,358 --> 01:03:56,000 characters, you can clearly compose any 1681 01:03:53,039 --> 01:03:58,880 word that comes up, right? Reo is just R 1682 01:03:56,000 --> 01:04:00,318 E L D O H, right? Six tokens. If you're 1683 01:03:58,880 --> 01:04:02,720 working with characters at the character 1684 01:04:00,318 --> 01:04:05,679 level, but working only with characters 1685 01:04:02,719 --> 01:04:07,838 is not great, right? because that means 1686 01:04:05,679 --> 01:04:09,279 that the model you're giving it no 1687 01:04:07,838 --> 01:04:11,199 information about the world. It has to 1688 01:04:09,280 --> 01:04:14,160 learn every word from scratch, what the 1689 01:04:11,199 --> 01:04:15,439 word means and so on and so forth. So we 1690 01:04:14,159 --> 01:04:17,759 it would be nice if we can actually give 1691 01:04:15,440 --> 01:04:20,159 it words as well. But we don't we don't 1692 01:04:17,760 --> 01:04:22,400 want to give it infrequent words because 1693 01:04:20,159 --> 01:04:25,118 infrequent words by definition are not 1694 01:04:22,400 --> 01:04:26,480 worth adding to your vocabulary. We're 1695 01:04:25,119 --> 01:04:28,318 just going to you know take up another 1696 01:04:26,480 --> 01:04:30,000 embedding vector and things like that. 1697 01:04:28,318 --> 01:04:31,679 For infrequent words, we'll just make 1698 01:04:30,000 --> 01:04:32,960 we'll just compose them. we'll we'll 1699 01:04:31,679 --> 01:04:35,440 actually construct them on the fly 1700 01:04:32,960 --> 01:04:37,199 because we can always use characters. 1701 01:04:35,440 --> 01:04:38,880 Okay, so we don't want to put every word 1702 01:04:37,199 --> 01:04:41,199 in there. We only want to put frequent 1703 01:04:38,880 --> 01:04:43,039 words. But to give this thing the 1704 01:04:41,199 --> 01:04:45,038 ability to compose new words and not 1705 01:04:43,039 --> 01:04:47,520 always have to go to characters, we will 1706 01:04:45,039 --> 01:04:52,000 give it parts of words. These are called 1707 01:04:47,519 --> 01:04:54,000 subwords. So the key idea is that let's 1708 01:04:52,000 --> 01:04:56,880 come up with a way to build a vocabulary 1709 01:04:54,000 --> 01:04:59,679 which has characters full words that are 1710 01:04:56,880 --> 01:05:01,838 frequent enough to be worth adding and 1711 01:04:59,679 --> 01:05:03,519 subwords or word fragments that occur 1712 01:05:01,838 --> 01:05:07,279 frequently enough to be worth adding. So 1713 01:05:03,519 --> 01:05:09,759 for example the word standardize 1714 01:05:07,280 --> 01:05:11,119 right normalize standardize and so on 1715 01:05:09,760 --> 01:05:12,880 and so forth. I is going to show up a 1716 01:05:11,119 --> 01:05:14,318 lot in many places. So you don't want to 1717 01:05:12,880 --> 01:05:15,680 have standardize and normalize and so 1718 01:05:14,318 --> 01:05:17,679 on. You just want to have eyes. you can 1719 01:05:15,679 --> 01:05:19,598 just attach it to all kinds of words, 1720 01:05:17,679 --> 01:05:20,960 right? And make it all work, right? So 1721 01:05:19,599 --> 01:05:23,760 that's the basic idea of all these 1722 01:05:20,960 --> 01:05:25,679 tokenization schemes. And BP is one such 1723 01:05:23,760 --> 01:05:27,039 way to figure out how to actually 1724 01:05:25,679 --> 01:05:29,358 construct this vocabulary from a 1725 01:05:27,039 --> 01:05:31,359 training corpus, right? And by the way, 1726 01:05:29,358 --> 01:05:33,279 when I say characters, this will include 1727 01:05:31,358 --> 01:05:34,639 not just you know uppercase lowerase 1728 01:05:33,280 --> 01:05:37,039 alphabets and numbers, it may it will 1729 01:05:34,639 --> 01:05:38,318 also include punctuation. 1730 01:05:37,039 --> 01:05:40,640 So that all these things just become 1731 01:05:38,318 --> 01:05:42,960 atomic units. 1732 01:05:40,639 --> 01:05:45,519 All right. So uh so what we're going to 1733 01:05:42,960 --> 01:05:47,599 the way BP works is that uh we're going 1734 01:05:45,519 --> 01:05:49,679 to uh start with each character as a 1735 01:05:47,599 --> 01:05:51,039 token and I'll talk about the rest of 1736 01:05:49,679 --> 01:05:52,318 the thing on the page in just a moment. 1737 01:05:51,039 --> 01:05:53,920 Don't worry about it. We'll start with 1738 01:05:52,318 --> 01:05:56,480 each character as a token. So let's say 1739 01:05:53,920 --> 01:05:58,720 that your training corpus is just a 1740 01:05:56,480 --> 01:06:02,079 single sentence. The cat sat on the mat. 1741 01:05:58,719 --> 01:06:03,838 Okay. And even though GPT does not 1742 01:06:02,079 --> 01:06:05,839 actually do any lower casing, it'll just 1743 01:06:03,838 --> 01:06:08,159 actually use like TH uppercase is 1744 01:06:05,838 --> 01:06:09,038 different than TH lowerase. Uh just for 1745 01:06:08,159 --> 01:06:11,118 simplicity, I'm just going to 1746 01:06:09,039 --> 01:06:12,799 standardize it here. So it just becomes 1747 01:06:11,119 --> 01:06:14,880 a cat sat on the mat. And then I'm going 1748 01:06:12,798 --> 01:06:16,719 to write it in this form where I 1749 01:06:14,880 --> 01:06:18,160 basically put a comma after every word 1750 01:06:16,719 --> 01:06:20,318 and then I put a little underscore to 1751 01:06:18,159 --> 01:06:21,598 show the space between the words. Okay, 1752 01:06:20,318 --> 01:06:22,798 I'm going to write it in this format. 1753 01:06:21,599 --> 01:06:25,359 And it'll become clear why I'm writing 1754 01:06:22,798 --> 01:06:27,358 it in just a second. Okay. Now my 1755 01:06:25,358 --> 01:06:28,719 starting vocabulary is just all the 1756 01:06:27,358 --> 01:06:31,440 individual letters in the training 1757 01:06:28,719 --> 01:06:34,159 corpus. So the starting is just whatever 1758 01:06:31,440 --> 01:06:35,920 all these letters. Okay, that's it. And 1759 01:06:34,159 --> 01:06:38,558 this is a starting point. And now what 1760 01:06:35,920 --> 01:06:41,838 we do and this is the key step. 1761 01:06:38,559 --> 01:06:44,960 We merge tokens that most frequently 1762 01:06:41,838 --> 01:06:47,358 occur right next to each other. So if 1763 01:06:44,960 --> 01:06:48,720 two characters or two tokens are 1764 01:06:47,358 --> 01:06:51,119 occurring right next to each other a 1765 01:06:48,719 --> 01:06:52,798 lot, let's just merge them because they 1766 01:06:51,119 --> 01:06:54,880 seem to be occurring a lot together, 1767 01:06:52,798 --> 01:06:57,679 right? May as well merge them. And so 1768 01:06:54,880 --> 01:06:59,119 here, for example, I've I've listed the 1769 01:06:57,679 --> 01:07:01,759 frequency of the adjacent token. So for 1770 01:06:59,119 --> 01:07:04,160 example, if you look at th 1771 01:07:01,760 --> 01:07:06,960 shows up right after each other here, it 1772 01:07:04,159 --> 01:07:08,558 also shows up here. So therefore, it 1773 01:07:06,960 --> 01:07:11,920 shows up twice. 1774 01:07:08,559 --> 01:07:13,519 Now H E again is showing up here. It's 1775 01:07:11,920 --> 01:07:16,079 also showing up here. So that also shows 1776 01:07:13,519 --> 01:07:17,679 up twice. CA on the other hand is only 1777 01:07:16,079 --> 01:07:20,798 showing up here. It's not showing up 1778 01:07:17,679 --> 01:07:24,000 anywhere else. So it shows up once. A 1779 01:07:20,798 --> 01:07:25,599 shows up three times in Matt, SAT, and 1780 01:07:24,000 --> 01:07:27,838 in CAT and so on and so forth. You get 1781 01:07:25,599 --> 01:07:30,798 the idea. So you're just looking at 1782 01:07:27,838 --> 01:07:32,318 pair-wise adjacent tokens. And you pick 1783 01:07:30,798 --> 01:07:34,318 the most frequent one that's showing up, 1784 01:07:32,318 --> 01:07:36,000 which in this case happens to be a t. 1785 01:07:34,318 --> 01:07:40,000 And then you take a and t and you merge 1786 01:07:36,000 --> 01:07:42,400 them. So it becomes 80. 1787 01:07:40,000 --> 01:07:44,079 Okay. So when you do that when you when 1788 01:07:42,400 --> 01:07:45,440 you you merge them and then you add that 1789 01:07:44,079 --> 01:07:48,559 new token that you've just literally 1790 01:07:45,440 --> 01:07:50,400 created to your vocabulary list and then 1791 01:07:48,559 --> 01:07:52,559 you update the corpus to reflect the 1792 01:07:50,400 --> 01:07:55,039 merge you've just did. So now the corpus 1793 01:07:52,559 --> 01:07:56,319 becomes the cat sat on the mat. But in 1794 01:07:55,039 --> 01:07:58,880 this case there is no a and t 1795 01:07:56,318 --> 01:08:02,400 separately. There is just the at combo 1796 01:07:58,880 --> 01:08:06,160 com combo token here. 1797 01:08:02,400 --> 01:08:07,599 Are we good with this step so far? 1798 01:08:06,159 --> 01:08:10,598 take the most frequent things and merge 1799 01:08:07,599 --> 01:08:10,599 them. 1800 01:08:12,639 --> 01:08:16,238 It's a way to compress the data. In 1801 01:08:14,400 --> 01:08:17,440 fact, the algorithm came from someone 1802 01:08:16,238 --> 01:08:18,959 trying to figure out a way to compress 1803 01:08:17,439 --> 01:08:22,119 data. 1804 01:08:18,960 --> 01:08:22,119 You know, 1805 01:08:22,158 --> 01:08:25,920 think of it this way, right? Suppose I 1806 01:08:23,759 --> 01:08:28,238 tell you uh I'm I want you to compress a 1807 01:08:25,920 --> 01:08:30,158 message I'm going to send to you and 1808 01:08:28,238 --> 01:08:32,079 then you look at all the past messages 1809 01:08:30,158 --> 01:08:35,838 you've had to deal with and it turns out 1810 01:08:32,079 --> 01:08:37,359 you're finding that u certain characters 1811 01:08:35,838 --> 01:08:40,079 are occurring next to each other all the 1812 01:08:37,359 --> 01:08:42,480 time right maybe just for argument let's 1813 01:08:40,079 --> 01:08:44,158 say ABC shows up ridiculously often in 1814 01:08:42,479 --> 01:08:45,439 the messaging and then you'll be like 1815 01:08:44,158 --> 01:08:47,358 you know what's if it's always showing 1816 01:08:45,439 --> 01:08:48,639 up all the time together why treat it as 1817 01:08:47,359 --> 01:08:51,520 three things let me just call it one 1818 01:08:48,640 --> 01:08:53,119 thing ABC that's it you send a single 1819 01:08:51,520 --> 01:08:56,480 token called ABC every time you send 1820 01:08:53,119 --> 01:08:58,880 need ABC not a B C that's the basic 1821 01:08:56,479 --> 01:09:01,278 idea. So here if you come here that's 1822 01:08:58,880 --> 01:09:03,039 what we have and then what we do is now 1823 01:09:01,279 --> 01:09:05,520 we do again this calculation of 1824 01:09:03,039 --> 01:09:08,640 adjacency tokens on this updated corpus 1825 01:09:05,520 --> 01:09:11,600 and you can see here th shows up once TH 1826 01:09:08,640 --> 01:09:13,838 shows up here twice so you get two every 1827 01:09:11,600 --> 01:09:16,880 H shows up twice everything else shows 1828 01:09:13,838 --> 01:09:18,000 up once and yeah when many things are 1829 01:09:16,880 --> 01:09:19,600 showing up with equal frequency just 1830 01:09:18,000 --> 01:09:22,079 pick one randomly from this. So we pick 1831 01:09:19,600 --> 01:09:25,199 up th right and we merge that which 1832 01:09:22,079 --> 01:09:27,278 means that we add th to our vocabulary 1833 01:09:25,198 --> 01:09:30,238 and once we do that we update the corpus 1834 01:09:27,279 --> 01:09:32,080 and now we have th is now one thing 1835 01:09:30,238 --> 01:09:34,959 fused together along with the previous 1836 01:09:32,079 --> 01:09:36,960 thing 80 that had been fused together 1837 01:09:34,960 --> 01:09:38,960 that is a corpus after the second merge 1838 01:09:36,960 --> 01:09:40,640 and then we do the same thing we find 1839 01:09:38,960 --> 01:09:42,640 the frequency adjacent tokens turns out 1840 01:09:40,640 --> 01:09:45,039 th and e are showing up twice everything 1841 01:09:42,640 --> 01:09:48,880 else is showing up once so we take th 1842 01:09:45,039 --> 01:09:51,359 merge it to get the boom the and now we 1843 01:09:48,880 --> 01:09:53,838 have the cat sat on the mat. So this 1844 01:09:51,359 --> 01:09:56,159 process continues 1845 01:09:53,838 --> 01:09:59,039 till we reach a predefined limit for our 1846 01:09:56,158 --> 01:10:02,399 vocabulary. Now as it turns out when 1847 01:09:59,039 --> 01:10:04,238 they built GPT2 and GPT let me just see 1848 01:10:02,399 --> 01:10:07,279 I think I did some digging around on 1849 01:10:04,238 --> 01:10:09,119 this thing. Yeah. So GPT2 and 3 they set 1850 01:10:07,279 --> 01:10:12,000 the vocabulary size to be roughly 1851 01:10:09,119 --> 01:10:14,399 50,000. So it basically kept on doing 1852 01:10:12,000 --> 01:10:17,119 this till it hit a limit of 50,000 then 1853 01:10:14,399 --> 01:10:18,639 it stopped. GPD4 on the other hand 1854 01:10:17,119 --> 01:10:23,238 actually went goes all the way to 1855 01:10:18,640 --> 01:10:23,239 100,000 vocabulary size. 1856 01:10:23,439 --> 01:10:29,198 Okay, so this is BP in action. U and so 1857 01:10:28,000 --> 01:10:30,000 what's going to happen is once you 1858 01:10:29,198 --> 01:10:31,119 finish all this thing and you have 1859 01:10:30,000 --> 01:10:32,960 vocabulary and you have all these things 1860 01:10:31,119 --> 01:10:36,158 that you have merged when a new piece of 1861 01:10:32,960 --> 01:10:39,760 text comes in right the merges remember 1862 01:10:36,158 --> 01:10:41,759 here we merged a to get a this th became 1863 01:10:39,760 --> 01:10:43,119 this and so on. When a new piece of text 1864 01:10:41,760 --> 01:10:45,520 arrives the tokenization apply the 1865 01:10:43,119 --> 01:10:47,920 merges in the exact same order. So if 1866 01:10:45,520 --> 01:10:50,239 the new text that comes in is the rat, 1867 01:10:47,920 --> 01:10:52,800 it's first going to apply the 80 to 80 1868 01:10:50,238 --> 01:10:54,399 to become fuse this here and then going 1869 01:10:52,800 --> 01:10:56,400 to fuse th to get this and then it's 1870 01:10:54,399 --> 01:10:58,559 going to fuse th and e to get that. And 1871 01:10:56,399 --> 01:11:00,319 the final list of tokens that goes in to 1872 01:10:58,560 --> 01:11:02,080 your model is going to be the token for 1873 01:11:00,319 --> 01:11:05,960 the the token for space and the token 1874 01:11:02,079 --> 01:11:05,960 for r and the token for at. 1875 01:11:06,560 --> 01:11:10,600 So let's see this in action. 1876 01:11:12,319 --> 01:11:17,119 uh GP I mean OpenAI has a has its own 1877 01:11:14,560 --> 01:11:20,960 thing but I found this uh site to be 1878 01:11:17,119 --> 01:11:23,039 really good. So let's uh tokenize 1879 01:11:20,960 --> 01:11:26,079 hands-on 1880 01:11:23,039 --> 01:11:28,319 deep learning. 1881 01:11:26,079 --> 01:11:30,960 So you can see here 1882 01:11:28,319 --> 01:11:34,639 look at this. 1883 01:11:30,960 --> 01:11:36,880 So H uppercase H is its own token. It's 1884 01:11:34,640 --> 01:11:38,560 token number 39 1885 01:11:36,880 --> 01:11:41,119 and 1886 01:11:38,560 --> 01:11:43,440 it's it own token. dash is its own token 1887 01:11:41,119 --> 01:11:45,198 on is its own token and then space deep 1888 01:11:43,439 --> 01:11:48,319 is its token and space learning is its 1889 01:11:45,198 --> 01:11:50,399 token okay note one thing suppose you 1890 01:11:48,319 --> 01:11:51,920 had said 1891 01:11:50,399 --> 01:11:53,679 let's just say you just had deep deep 1892 01:11:51,920 --> 01:11:56,480 deep learning 1893 01:11:53,679 --> 01:11:58,960 deep has a different token than space 1894 01:11:56,479 --> 01:12:01,359 deep 1895 01:11:58,960 --> 01:12:03,119 okay what they have realized is that 1896 01:12:01,359 --> 01:12:06,079 most words are actually going to show up 1897 01:12:03,119 --> 01:12:08,238 after the space after a space right much 1898 01:12:06,079 --> 01:12:10,079 more likely so having a space attached 1899 01:12:08,238 --> 01:12:12,000 to the beginning of the word saves you a 1900 01:12:10,079 --> 01:12:13,519 lot of sort of you know saves you a lot 1901 01:12:12,000 --> 01:12:15,198 of compute and so on and so forth 1902 01:12:13,520 --> 01:12:17,199 because they will in fact arrive almost 1903 01:12:15,198 --> 01:12:18,479 all the time with the space before it 1904 01:12:17,198 --> 01:12:21,119 right that's why they have attached the 1905 01:12:18,479 --> 01:12:25,759 space to the word itself um and note 1906 01:12:21,119 --> 01:12:29,719 that deep learning deep and uh deep 1907 01:12:25,760 --> 01:12:29,719 actually let's call it this way 1908 01:12:30,800 --> 01:12:36,960 so deep and deep are different 1909 01:12:34,319 --> 01:12:38,799 right there is deep there is so clearly 1910 01:12:36,960 --> 01:12:43,359 it's taking case into account then I put 1911 01:12:38,800 --> 01:12:44,800 an exclamation here. Boom. That and so 1912 01:12:43,359 --> 01:12:48,319 ultimately what goes in when you have 1913 01:12:44,800 --> 01:12:51,679 have a phrase like um 1914 01:12:48,319 --> 01:12:53,679 sat on the mat. 1915 01:12:51,679 --> 01:12:58,480 So the cat sat on the mat. And you can 1916 01:12:53,679 --> 01:13:01,600 see here uppercase the um and then 1917 01:12:58,479 --> 01:13:06,718 let's just do another thing here. 1918 01:13:01,600 --> 01:13:10,239 So uppercase the with a space is 383. 1919 01:13:06,719 --> 01:13:11,920 lowerase the is 262. Uh and then that's 1920 01:13:10,238 --> 01:13:13,119 distinct from just the without any 1921 01:13:11,920 --> 01:13:16,960 space. That's a different thing. So 1922 01:13:13,119 --> 01:13:18,960 these are all the tokens. Now um let's 1923 01:13:16,960 --> 01:13:21,520 try something. 1924 01:13:18,960 --> 01:13:24,520 Let's try 1925 01:13:21,520 --> 01:13:24,520 Jane. 1926 01:13:24,719 --> 01:13:30,800 So Jane is one token which is great and 1927 01:13:27,520 --> 01:13:34,560 is another token. Let's see. Rama. Ah 1928 01:13:30,800 --> 01:13:38,960 darn. My name wasn't worthy enough to be 1929 01:13:34,560 --> 01:13:41,520 its own token. Okay. But strangely 1930 01:13:38,960 --> 01:13:44,000 enough 1931 01:13:41,520 --> 01:13:46,080 this I was very surprised by this. So if 1932 01:13:44,000 --> 01:13:48,319 I put Rama in lower case is its own 1933 01:13:46,079 --> 01:13:51,039 token. 1934 01:13:48,319 --> 01:13:55,039 I have no idea what they were scraping 1935 01:13:51,039 --> 01:13:56,640 which websites. Uh and if I put Jane 1936 01:13:55,039 --> 01:13:58,960 here 1937 01:13:56,640 --> 01:14:01,600 now J has become its token with space 1938 01:13:58,960 --> 01:14:03,840 and A has become different. 1939 01:14:01,600 --> 01:14:05,199 So the tokenization is like very it's a 1940 01:14:03,840 --> 01:14:07,360 very interesting thing and it works in 1941 01:14:05,198 --> 01:14:08,719 very interesting ways. But that's the 1942 01:14:07,359 --> 01:14:10,639 basic idea of what's going on under the 1943 01:14:08,719 --> 01:14:12,079 hood. I would encourage you to like 1944 01:14:10,640 --> 01:14:13,920 check out your names to see if it's 1945 01:14:12,079 --> 01:14:15,359 actually been tokenized. So all right, 1946 01:14:13,920 --> 01:14:18,359 I'm done. Thanks folks. I'll see you on 1947 01:14:15,359 --> 01:14:18,359 Wednesday.