1 00:00:21,359 --> 00:00:25,039 We'll continue our journey with 2 00:00:23,120 --> 00:00:26,679 natural language processing. 3 00:00:25,039 --> 00:00:28,039 We looked at the bag of words model, 4 00:00:26,679 --> 00:00:30,679 one-hot embeddings, and so on and so 5 00:00:28,039 --> 00:00:32,679 forth. And today we will talk about 6 00:00:30,679 --> 00:00:34,759 embeddings, or to be more precise, 7 00:00:32,679 --> 00:00:36,159 stand-alone embeddings, and then that 8 00:00:34,759 --> 00:00:38,799 will tee us up for something called 9 00:00:36,159 --> 00:00:40,439 contextual embeddings, which is where 10 00:00:38,799 --> 00:00:41,519 the transformer really sort of comes 11 00:00:40,439 --> 00:00:43,960 into play. 12 00:00:41,520 --> 00:00:47,920 All right, so let's get going. So so far 13 00:00:43,960 --> 00:00:50,200 we have encoded input text 14 00:00:47,920 --> 00:00:52,480 one-hot vector. So to just to refresh 15 00:00:50,200 --> 00:00:53,640 your memories from Monday, 16 00:00:52,479 --> 00:00:55,640 if you know, if this is the phrase 17 00:00:53,640 --> 00:00:58,560 that's coming into the system, we run it 18 00:00:55,640 --> 00:01:01,359 through the STIE process. And when we do 19 00:00:58,560 --> 00:01:03,600 that, what happens is that first of all, 20 00:01:01,359 --> 00:01:05,760 we you know, we standardize, then we 21 00:01:03,600 --> 00:01:08,960 split on white space to get individual 22 00:01:05,760 --> 00:01:10,840 words, then we assign words to integers, 23 00:01:08,959 --> 00:01:12,439 and then we take you know, each integer 24 00:01:10,840 --> 00:01:15,159 and essentially create a one-hot version 25 00:01:12,439 --> 00:01:18,759 of that integer. And when we do that, 26 00:01:15,159 --> 00:01:20,319 basically we have a vocabulary. 27 00:01:18,760 --> 00:01:23,040 Right? And in this example, we just have 28 00:01:20,319 --> 00:01:25,159 100 words, and you will note that this 29 00:01:23,040 --> 00:01:28,120 vocabulary, which are which you arrive 30 00:01:25,159 --> 00:01:30,039 at once you standardize and tokenize, 31 00:01:28,120 --> 00:01:32,000 you know, has words like the because we 32 00:01:30,040 --> 00:01:33,840 decided not to remove stop words like A, 33 00:01:32,000 --> 00:01:36,159 and the, 34 00:01:33,840 --> 00:01:38,159 and so on. So just to be clear, 35 00:01:36,159 --> 00:01:40,399 standardization 36 00:01:38,159 --> 00:01:42,519 here, standardization, while it has 37 00:01:40,400 --> 00:01:45,000 historically been all about stripping 38 00:01:42,519 --> 00:01:47,640 punctuation, lowercasing everything, 39 00:01:45,000 --> 00:01:49,239 removing stop words, and stemming, 40 00:01:47,640 --> 00:01:51,519 while that has been true historically, 41 00:01:49,239 --> 00:01:54,560 if you look at modern practice, people 42 00:01:51,519 --> 00:01:57,039 essentially strip punctuation maybe, and 43 00:01:54,560 --> 00:01:58,439 then lowercase and and they often don't 44 00:01:57,040 --> 00:02:00,120 even bother to do stemming and things 45 00:01:58,439 --> 00:02:01,120 like that, or to remove stop words. 46 00:02:00,120 --> 00:02:03,800 Okay? 47 00:02:01,120 --> 00:02:05,840 And that's why in Keras, the default 48 00:02:03,799 --> 00:02:08,800 standardization is only lowercasing and 49 00:02:05,840 --> 00:02:08,800 punctuation stripping. 50 00:02:09,479 --> 00:02:12,840 This detail may actually be handy for 51 00:02:11,319 --> 00:02:14,280 homework two, perhaps. That's why I'm 52 00:02:12,840 --> 00:02:17,080 pointing it out. 53 00:02:14,280 --> 00:02:18,840 Okay. So that's what we have. And so for 54 00:02:17,080 --> 00:02:20,719 each word that's coming in, we have a 55 00:02:18,840 --> 00:02:22,520 one-hot vector. 56 00:02:20,719 --> 00:02:25,800 Right? But the one-hot vector is just 57 00:02:22,520 --> 00:02:27,520 like on to the vocabulary. And then, you 58 00:02:25,800 --> 00:02:29,520 know, and we can either 59 00:02:27,520 --> 00:02:32,719 quote unquote add them up and get a 60 00:02:29,520 --> 00:02:34,719 multi-hot encoding, or 61 00:02:32,719 --> 00:02:36,560 sorry, get a count encoding, or we can 62 00:02:34,719 --> 00:02:38,280 just do or, right? Look for just any 63 00:02:36,560 --> 00:02:39,280 ones in a column and get multi-hot 64 00:02:38,280 --> 00:02:42,199 encoding. 65 00:02:39,280 --> 00:02:44,439 So that's what we saw last class. But 66 00:02:42,199 --> 00:02:47,399 this scheme, while it's quite effective 67 00:02:44,439 --> 00:02:49,079 for simple kind of problems, 68 00:02:47,400 --> 00:02:50,920 is it has some very serious 69 00:02:49,080 --> 00:02:52,760 shortcomings. And so we will sort of 70 00:02:50,919 --> 00:02:54,280 delve into those shortcomings, and then 71 00:02:52,759 --> 00:02:57,679 sort of step back and say, all right, is 72 00:02:54,280 --> 00:02:57,680 there a solution to fix these things? 73 00:02:58,319 --> 00:03:01,599 Problem with one-hot vectors. 74 00:03:00,199 --> 00:03:04,319 There are lots of problems. Any 75 00:03:01,599 --> 00:03:04,319 volunteers? 76 00:03:07,879 --> 00:03:12,439 Similar words are understood 77 00:03:09,919 --> 00:03:12,439 differently. 78 00:03:21,919 --> 00:03:26,319 Absolutely. So that what he's pointing 79 00:03:24,439 --> 00:03:28,120 out is that if you have two words which 80 00:03:26,319 --> 00:03:29,439 are synonyms, let's say, great and 81 00:03:28,120 --> 00:03:31,840 awesome, 82 00:03:29,439 --> 00:03:33,759 hope that the way we represent them 83 00:03:31,840 --> 00:03:35,120 using these vectors would have some 84 00:03:33,759 --> 00:03:37,439 connection to what the words actually 85 00:03:35,120 --> 00:03:38,800 mean. In particular, we would hope that 86 00:03:37,439 --> 00:03:40,919 if they mean similar things, that they 87 00:03:38,800 --> 00:03:41,920 are sort of close by. If they mean very 88 00:03:40,919 --> 00:03:43,599 different things, we would hope that 89 00:03:41,919 --> 00:03:44,839 they are very far away. Right? Things 90 00:03:43,599 --> 00:03:46,280 like that. Sort of common sensical 91 00:03:44,840 --> 00:03:49,000 expectations of what you want the 92 00:03:46,280 --> 00:03:50,599 vectors to have. So it clearly it won't 93 00:03:49,000 --> 00:03:53,039 have that, and we'll look into it in a 94 00:03:50,599 --> 00:03:54,759 detail in a bit. But before we do that, 95 00:03:53,039 --> 00:03:56,199 there is also a computational issue, 96 00:03:54,759 --> 00:03:59,120 which we covered last class, which is 97 00:03:56,199 --> 00:04:01,560 that if the vocabulary is really long, 98 00:03:59,120 --> 00:04:03,360 then each token, each word that's coming 99 00:04:01,560 --> 00:04:04,400 in here, will have a one-hot vector 100 00:04:03,360 --> 00:04:06,640 that's as long as the size of 101 00:04:04,400 --> 00:04:08,360 vocabulary. Right? If you have 500,000 102 00:04:06,639 --> 00:04:09,759 words in your vocabulary, every little 103 00:04:08,360 --> 00:04:12,200 word that comes in has a vector which is 104 00:04:09,759 --> 00:04:15,840 500,000 long. Which feels like a gross 105 00:04:12,199 --> 00:04:15,839 sort of waste of it stuff. 106 00:04:16,798 --> 00:04:20,120 Now you can mitigate it somewhat by 107 00:04:18,199 --> 00:04:21,680 choosing only the most frequent words, 108 00:04:20,120 --> 00:04:23,360 but it does increase the number of ways 109 00:04:21,680 --> 00:04:25,280 the model has to learn, and increase the 110 00:04:23,360 --> 00:04:26,800 need for compute and data, and so on and 111 00:04:25,279 --> 00:04:27,759 so forth. Okay? 112 00:04:26,800 --> 00:04:28,800 Now 113 00:04:27,759 --> 00:04:31,159 let's say that we have created a 114 00:04:28,800 --> 00:04:32,520 vocabulary from a training corpus. Okay? 115 00:04:31,160 --> 00:04:34,400 We have a bunch of 116 00:04:32,519 --> 00:04:36,439 strings, text that's coming in. We have 117 00:04:34,399 --> 00:04:37,639 done it We have done the ST the 118 00:04:36,439 --> 00:04:39,439 standardization and organization. We 119 00:04:37,639 --> 00:04:41,399 have created a vocabulary from it. And 120 00:04:39,439 --> 00:04:42,319 let's say we get the words movie and 121 00:04:41,399 --> 00:04:44,679 film. 122 00:04:42,319 --> 00:04:47,199 So the question is, and and I always 123 00:04:44,680 --> 00:04:48,639 observation gets to this immediately, if 124 00:04:47,199 --> 00:04:50,039 you look at the words movie and film, 125 00:04:48,639 --> 00:04:52,759 are these two vectors close to each 126 00:04:50,040 --> 00:04:56,800 other or not? Okay? So if you have two 127 00:04:52,759 --> 00:04:56,800 vectors, how would we measure closeness? 128 00:04:56,879 --> 00:05:00,759 What's the simplest way to think about 129 00:04:58,240 --> 00:05:00,759 closeness? 130 00:05:02,519 --> 00:05:06,680 It's not a trick question. 131 00:05:05,199 --> 00:05:08,680 Distance. Yeah, exactly. So if they are 132 00:05:06,680 --> 00:05:10,480 really close distance-wise, we would 133 00:05:08,680 --> 00:05:13,480 hope, right? The words similar words 134 00:05:10,480 --> 00:05:16,280 should do should should be close by. So 135 00:05:13,480 --> 00:05:19,759 here, if you let's just imagine that the 136 00:05:16,279 --> 00:05:19,759 vector for movie, 137 00:05:20,000 --> 00:05:22,199 let's say your vocabulary is, I don't 138 00:05:21,480 --> 00:05:24,600 know, 139 00:05:22,199 --> 00:05:24,599 um 140 00:05:25,079 --> 00:05:30,359 100,000 long. 141 00:05:27,879 --> 00:05:33,000 So your vector is 100,000 long, 142 00:05:30,360 --> 00:05:35,520 and the word for movie 143 00:05:33,000 --> 00:05:39,399 is the position, so this this has a one, 144 00:05:35,519 --> 00:05:39,399 everything else is zero. Right? 145 00:05:42,519 --> 00:05:47,279 Sorry, this is the vector for film, and 146 00:05:44,199 --> 00:05:51,039 maybe this is the position for film. 147 00:05:47,279 --> 00:05:53,439 So that has a one, everything else here 148 00:05:51,040 --> 00:05:55,640 zero. Okay? What's the distance between 149 00:05:53,439 --> 00:05:58,000 these two vectors? 150 00:05:55,639 --> 00:06:00,360 You just use the Euclidean distance. So 151 00:05:58,000 --> 00:06:01,759 the Euclidean distance, you will recall, 152 00:06:00,360 --> 00:06:02,639 you literally just take the difference 153 00:06:01,759 --> 00:06:04,079 of 154 00:06:02,639 --> 00:06:06,039 these values, 155 00:06:04,079 --> 00:06:07,159 square them, add them up, take square 156 00:06:06,040 --> 00:06:09,360 root. 157 00:06:07,160 --> 00:06:12,080 So which means that all the zeros will 158 00:06:09,360 --> 00:06:14,000 obviously give you zero. This one is 159 00:06:12,079 --> 00:06:15,399 going to give you a one. 160 00:06:14,000 --> 00:06:18,079 This comparison is going to give you 161 00:06:15,399 --> 00:06:20,079 another one. 1 + 1 = 2. Root 2. That's 162 00:06:18,079 --> 00:06:21,039 the answer. 163 00:06:20,079 --> 00:06:23,759 So the distance between these two 164 00:06:21,040 --> 00:06:23,760 vectors is root 2. 165 00:06:25,120 --> 00:06:30,639 Now, 166 00:06:27,240 --> 00:06:32,639 so the distance between them is root 2. 167 00:06:30,639 --> 00:06:34,959 What about the one-hot encoded vectors 168 00:06:32,639 --> 00:06:36,800 for good and bad? Clearly good and bad 169 00:06:34,959 --> 00:06:37,799 mean opposite things. 170 00:06:36,800 --> 00:06:40,960 What is the distance between the good 171 00:06:37,800 --> 00:06:40,960 and bad 01 vectors? 172 00:06:42,600 --> 00:06:45,320 Still root 2. 173 00:06:45,360 --> 00:06:49,920 Because the zeros don't mean anything, 174 00:06:47,800 --> 00:06:51,040 the ones are not in the same place. 175 00:06:49,920 --> 00:06:52,640 So when you subtract the one and the 176 00:06:51,040 --> 00:06:54,879 zero, you'll get ones and ones, add them 177 00:06:52,639 --> 00:06:56,560 up, two, root 2. 178 00:06:54,879 --> 00:06:57,800 In fact, you take any two words in your 179 00:06:56,560 --> 00:06:59,720 vocabulary, what's the distance between 180 00:06:57,800 --> 00:07:01,560 the two one-hot vectors for those words? 181 00:06:59,720 --> 00:07:03,960 It's root 2. 182 00:07:01,560 --> 00:07:06,399 So if any two words have the same 183 00:07:03,959 --> 00:07:08,759 distance, does this even have a notion 184 00:07:06,399 --> 00:07:10,879 of distance? 185 00:07:08,759 --> 00:07:12,639 It doesn't. 186 00:07:10,879 --> 00:07:13,959 There's no notion of distance from 187 00:07:12,639 --> 00:07:15,959 one-hot vectors. 188 00:07:13,959 --> 00:07:17,839 It has no connection to the actual 189 00:07:15,959 --> 00:07:21,199 meanings of these words. 190 00:07:17,839 --> 00:07:22,319 It's just a way of representing them. 191 00:07:21,199 --> 00:07:24,039 Okay? 192 00:07:22,319 --> 00:07:26,240 So that is the big problem with one-hot 193 00:07:24,040 --> 00:07:27,400 vectors. 194 00:07:26,240 --> 00:07:28,639 So 195 00:07:27,399 --> 00:07:29,519 the distance between them is the same 196 00:07:28,639 --> 00:07:30,439 regardless of the words. It's got 197 00:07:29,519 --> 00:07:32,079 nothing to do with the meaning of the 198 00:07:30,439 --> 00:07:33,920 words. 199 00:07:32,079 --> 00:07:35,879 And this is a huge problem, which we'll 200 00:07:33,920 --> 00:07:37,759 have to solve. 201 00:07:35,879 --> 00:07:39,439 So to summarize where we are, if the 202 00:07:37,759 --> 00:07:40,519 vocabulary is very long, each token will 203 00:07:39,439 --> 00:07:42,279 have a one-hot vector that's long as 204 00:07:40,519 --> 00:07:44,599 vocabulary. That's that's sort of a 205 00:07:42,279 --> 00:07:46,759 computational and sort of training 206 00:07:44,600 --> 00:07:48,080 problem. And then this is a deeper 207 00:07:46,759 --> 00:07:49,039 problem, where there's no connection 208 00:07:48,079 --> 00:07:51,240 between the meaning of a word and its 209 00:07:49,040 --> 00:07:55,080 vector. 210 00:07:51,240 --> 00:07:57,639 So wouldn't it be nice if 211 00:07:55,079 --> 00:07:59,879 vectors that represent synonyms, 212 00:07:57,639 --> 00:08:01,839 movie and film, apple, banana, 213 00:07:59,879 --> 00:08:03,519 hopefully they're close to each other. 214 00:08:01,839 --> 00:08:04,919 It would be nice if the vectors for 215 00:08:03,519 --> 00:08:06,919 things that mean very different things 216 00:08:04,920 --> 00:08:08,400 are far from each other. 217 00:08:06,920 --> 00:08:10,840 So let's take a look at a particular 218 00:08:08,399 --> 00:08:13,239 example. Okay? Let's assume that we have 219 00:08:10,839 --> 00:08:15,279 been magically given 220 00:08:13,240 --> 00:08:17,319 these vectors, so that they actually 221 00:08:15,279 --> 00:08:18,959 have some notion of meaning. 222 00:08:17,319 --> 00:08:21,839 And for convenience, let's say that we 223 00:08:18,959 --> 00:08:23,759 take the just the first uh 224 00:08:21,839 --> 00:08:25,239 two dimensions of these vectors, the 225 00:08:23,759 --> 00:08:28,120 first two dimensions, so that we can do 226 00:08:25,240 --> 00:08:30,040 a scatter plot on them. 227 00:08:28,120 --> 00:08:31,959 So we plot the first dimension of the of 228 00:08:30,040 --> 00:08:34,158 these vectors, the second dimension, and 229 00:08:31,959 --> 00:08:37,279 what we have in this little cartoon is 230 00:08:34,158 --> 00:08:41,559 we have plotted the the word for 231 00:08:37,279 --> 00:08:44,000 factory, uh for home, for building, and 232 00:08:41,559 --> 00:08:45,719 they all happen to be clustered here. 233 00:08:44,000 --> 00:08:48,320 Clearly this representation is capturing 234 00:08:45,720 --> 00:08:50,120 some notion of what the thing is. 235 00:08:48,320 --> 00:08:53,680 Right? Some sort of building. 236 00:08:50,120 --> 00:08:55,919 Uh and here we have, you know, bicycle, 237 00:08:53,679 --> 00:08:57,799 truck, and car. Clearly some This is 238 00:08:55,919 --> 00:09:00,000 like the automobile cluster, right? 239 00:08:57,799 --> 00:09:02,079 Transportation cluster. And here we have 240 00:09:00,000 --> 00:09:04,240 like a fruit cluster, and here we have 241 00:09:02,080 --> 00:09:05,879 some, you know, sports balls cluster. 242 00:09:04,240 --> 00:09:07,560 Okay? 243 00:09:05,879 --> 00:09:10,439 We Because it's a cartoon, things are 244 00:09:07,559 --> 00:09:12,319 all nice and cleanly separated. Okay? So 245 00:09:10,440 --> 00:09:14,360 now if you take the word apple, where do 246 00:09:12,320 --> 00:09:19,000 you think it's going to go? 247 00:09:14,360 --> 00:09:20,840 It's going to go in into A, C, D, or B? 248 00:09:19,000 --> 00:09:23,519 C, right? It makes eminent sense it's 249 00:09:20,840 --> 00:09:23,519 going to go to C. 250 00:09:23,600 --> 00:09:27,920 Good. Now, 251 00:09:25,440 --> 00:09:29,960 wouldn't it be nice if 252 00:09:27,919 --> 00:09:32,839 in more generally, if the geometric 253 00:09:29,960 --> 00:09:35,120 relationship between word vectors 254 00:09:32,840 --> 00:09:37,000 represent the semantic relationship 255 00:09:35,120 --> 00:09:38,399 between the underlying objects that the 256 00:09:37,000 --> 00:09:39,080 words represent? 257 00:09:38,399 --> 00:09:41,039 Okay? 258 00:09:39,080 --> 00:09:42,800 And it's And I say relationship and not 259 00:09:41,039 --> 00:09:45,559 distance, because it's not just 260 00:09:42,799 --> 00:09:46,319 distance. It's actually more than that. 261 00:09:45,559 --> 00:09:48,239 Okay? 262 00:09:46,320 --> 00:09:49,720 So let's take another one. 263 00:09:48,240 --> 00:09:52,200 Here we have 264 00:09:49,720 --> 00:09:54,320 uh this is the the vector plotted for 265 00:09:52,200 --> 00:09:56,240 puppy and dog, 266 00:09:54,320 --> 00:09:58,040 and this is calf. 267 00:09:56,240 --> 00:09:59,639 Uh right? We have plotted the word for 268 00:09:58,039 --> 00:10:01,599 calf. And let's say that we need to 269 00:09:59,639 --> 00:10:04,879 figure out where would the embedding, 270 00:10:01,600 --> 00:10:07,920 the word vector for cow appear? 271 00:10:04,879 --> 00:10:09,639 It is the most logical. Should it be A? 272 00:10:07,919 --> 00:10:11,639 Should it be C? Should it be B? Where 273 00:10:09,639 --> 00:10:13,919 should it be? 274 00:10:11,639 --> 00:10:13,919 This is 275 00:10:14,000 --> 00:10:19,600 C? Okay, what's the logic? 276 00:10:16,320 --> 00:10:21,440 Any volunteers? Just put your hand up. 277 00:10:19,600 --> 00:10:23,480 Uh, yes. 278 00:10:21,440 --> 00:10:26,200 Uh 279 00:10:23,480 --> 00:10:27,639 A calf is a baby bull, whereas the cow 280 00:10:26,200 --> 00:10:28,720 is an adult. 281 00:10:27,639 --> 00:10:31,240 So, it should be closer to the dog, 282 00:10:28,720 --> 00:10:32,840 which is the adult version of a dog. 283 00:10:31,240 --> 00:10:34,600 Got it. So, you're basically saying go 284 00:10:32,840 --> 00:10:36,120 from the puppy version to the grown-up 285 00:10:34,600 --> 00:10:37,560 version. Right? That's sort of what 286 00:10:36,120 --> 00:10:39,560 you're getting at, right? And that's a 287 00:10:37,559 --> 00:10:40,719 totally valid way to think about it. 288 00:10:39,559 --> 00:10:42,479 But there are a couple of ways to think 289 00:10:40,720 --> 00:10:44,600 about this, which is this is one of the 290 00:10:42,480 --> 00:10:45,800 those two ways. So, what you can do is 291 00:10:44,600 --> 00:10:46,920 you can actually look at it and say, 292 00:10:45,799 --> 00:10:48,719 well, 293 00:10:46,919 --> 00:10:50,759 Okay, if this is big bringing you, you 294 00:10:48,720 --> 00:10:52,920 know, bad memories of GMAT and GRE and 295 00:10:50,759 --> 00:10:55,080 stuff like that, I apologize. 296 00:10:52,919 --> 00:10:57,120 But 297 00:10:55,080 --> 00:10:59,720 So, a puppy is to a dog like a calf is 298 00:10:57,120 --> 00:11:01,159 to a cow, right? Which means that that's 299 00:10:59,720 --> 00:11:02,720 exactly what Jay is pointing out. You 300 00:11:01,159 --> 00:11:04,480 can go from like the baby version to the 301 00:11:02,720 --> 00:11:08,720 full-grown version if you go in the 302 00:11:04,480 --> 00:11:10,200 horizontal direction. Okay? But maybe if 303 00:11:08,720 --> 00:11:13,000 you go in the vertical direction, you're 304 00:11:10,200 --> 00:11:15,759 essentially going up and down the young 305 00:11:13,000 --> 00:11:16,720 entities of animals. 306 00:11:15,759 --> 00:11:18,399 Okay? 307 00:11:16,720 --> 00:11:20,560 So, here you are growing with, you know, 308 00:11:18,399 --> 00:11:22,000 you're still across the same dimension 309 00:11:20,559 --> 00:11:24,039 of animals. You're just going from, you 310 00:11:22,000 --> 00:11:25,639 know, the the same age level, right? 311 00:11:24,039 --> 00:11:27,039 That is the band here. 312 00:11:25,639 --> 00:11:28,639 So, this is the grown-up version of a 313 00:11:27,039 --> 00:11:30,000 whole bunch of animals, the puppy 314 00:11:28,639 --> 00:11:31,600 version of a whole bunch of animals. So, 315 00:11:30,000 --> 00:11:34,720 the vertical dimension measures some 316 00:11:31,600 --> 00:11:36,839 sort of variation across animal species 317 00:11:34,720 --> 00:11:37,920 of the same roughly sort of maturity 318 00:11:36,839 --> 00:11:41,000 stage. 319 00:11:37,919 --> 00:11:43,399 Okay? So, these directions also matter. 320 00:11:41,000 --> 00:11:45,279 It's not just the distance. 321 00:11:43,399 --> 00:11:47,319 Okay. That's what I mean when I say 322 00:11:45,279 --> 00:11:48,759 semantic relationship and geometric 323 00:11:47,320 --> 00:11:51,200 relationship. 324 00:11:48,759 --> 00:11:53,159 Relationship is distance and direction, 325 00:11:51,200 --> 00:11:55,120 right? Both have to be involved. 326 00:11:53,159 --> 00:11:57,879 So, so 327 00:11:55,120 --> 00:12:00,759 Uh, now word embeddings, as we will dis- 328 00:11:57,879 --> 00:12:03,399 learn soon, are word vectors designed to 329 00:12:00,759 --> 00:12:04,720 achieve exactly these requirements. 330 00:12:03,399 --> 00:12:06,000 Okay? They will achieve these 331 00:12:04,720 --> 00:12:07,800 requirements. 332 00:12:06,000 --> 00:12:11,440 Uh, and they will fix both these 333 00:12:07,799 --> 00:12:11,439 problems very elegantly. 334 00:12:11,720 --> 00:12:14,399 Okay? 335 00:12:13,159 --> 00:12:15,279 So, let's say that we have word 336 00:12:14,399 --> 00:12:17,639 embeddings that achieve both these 337 00:12:15,279 --> 00:12:19,720 problems. Are we basically done? 338 00:12:17,639 --> 00:12:22,399 Can we declare victory? 339 00:12:19,720 --> 00:12:24,639 Or are there any- is there anything that 340 00:12:22,399 --> 00:12:28,240 even words which actually capture the 341 00:12:24,639 --> 00:12:28,240 meaning of the underlying thing 342 00:12:28,279 --> 00:12:31,519 don't fully address? Is there any 343 00:12:30,159 --> 00:12:33,199 remaining problem we have to worry 344 00:12:31,519 --> 00:12:36,720 about? Yes? 345 00:12:33,200 --> 00:12:39,520 Context. Context? Yes. 346 00:12:36,720 --> 00:12:42,240 Context, right? What about The fact is a 347 00:12:39,519 --> 00:12:44,679 word's meaning Sure, every word has a 348 00:12:42,240 --> 00:12:46,440 meaning, but we know that some words 349 00:12:44,679 --> 00:12:49,399 have multiple meanings. 350 00:12:46,440 --> 00:12:51,320 And that meaning is really sort of 351 00:12:49,399 --> 00:12:52,799 inferencable, or you can make sense of 352 00:12:51,320 --> 00:12:55,680 it only if you know the surrounding 353 00:12:52,799 --> 00:12:59,039 context, right? If I give you if if you 354 00:12:55,679 --> 00:13:00,239 see the word bank, b a n k, bank, 355 00:12:59,039 --> 00:13:02,120 sure, it could be a financial 356 00:13:00,240 --> 00:13:04,839 institution. It could be the side of a 357 00:13:02,120 --> 00:13:07,200 river. It could be the act of a plane 358 00:13:04,839 --> 00:13:09,160 turning in one direction. 359 00:13:07,200 --> 00:13:11,960 It could be someone hoping for 360 00:13:09,159 --> 00:13:13,679 something, banking on something. The 361 00:13:11,960 --> 00:13:16,360 list of possible meanings of the word 362 00:13:13,679 --> 00:13:18,359 bank is basically enormous. 363 00:13:16,360 --> 00:13:19,800 And you cannot figure out what it means 364 00:13:18,360 --> 00:13:22,120 unless you know what else is going on 365 00:13:19,799 --> 00:13:24,559 around that word. So, context is super 366 00:13:22,120 --> 00:13:26,159 super important. And these embeddings, 367 00:13:24,559 --> 00:13:28,199 word embeddings, just tell you what the 368 00:13:26,159 --> 00:13:29,838 meaning of the word is. And basically 369 00:13:28,200 --> 00:13:31,400 what's going to happen when you have a 370 00:13:29,839 --> 00:13:33,400 word which could mean many different 371 00:13:31,399 --> 00:13:36,319 things, it's going to give you some 372 00:13:33,399 --> 00:13:37,759 average version of that meaning. 373 00:13:36,320 --> 00:13:39,680 And that average version is not going to 374 00:13:37,759 --> 00:13:40,838 be very good. 375 00:13:39,679 --> 00:13:41,759 Now, there are some words which only 376 00:13:40,839 --> 00:13:42,800 mean one thing, and you'll be okay 377 00:13:41,759 --> 00:13:44,759 there. 378 00:13:42,799 --> 00:13:47,319 But for the rest of it, right? It's 379 00:13:44,759 --> 00:13:47,319 going to be tough. 380 00:13:47,480 --> 00:13:52,879 So, what we need is some way 381 00:13:53,360 --> 00:13:56,680 We need to find a way to make word 382 00:13:54,480 --> 00:13:58,200 embeddings contextual. 383 00:13:56,679 --> 00:14:00,199 Meaning we need to somehow consider the 384 00:13:58,200 --> 00:14:02,879 other words in the sentence. 385 00:14:00,200 --> 00:14:05,040 Okay? So, if we can do that, then we 386 00:14:02,879 --> 00:14:08,279 will be in great shape. 387 00:14:05,039 --> 00:14:11,039 Solve all sorts of NLP problems. 388 00:14:08,279 --> 00:14:13,639 Now, as it turns out, contextual word 389 00:14:11,039 --> 00:14:15,279 embeddings, or word vectors, or word 390 00:14:13,639 --> 00:14:16,838 embeddings that achieve both these 391 00:14:15,279 --> 00:14:19,399 requirements. 392 00:14:16,839 --> 00:14:21,440 They capture the semantic geometric 393 00:14:19,399 --> 00:14:22,838 relationship thing I talked about, and 394 00:14:21,440 --> 00:14:23,880 they are contextual. 395 00:14:22,839 --> 00:14:27,079 Okay? 396 00:14:23,879 --> 00:14:29,078 They're really fantastic. Uh, and the 397 00:14:27,078 --> 00:14:32,838 key to calculating contextual word 398 00:14:29,078 --> 00:14:32,838 embeddings is the transformer. 399 00:14:33,200 --> 00:14:37,959 That is why transformers are justifiably 400 00:14:35,519 --> 00:14:37,958 famous. 401 00:14:39,320 --> 00:14:42,680 So, what's sort of the the lay of the 402 00:14:40,440 --> 00:14:44,520 land here? So, today we are going to 403 00:14:42,679 --> 00:14:46,879 look at how to calculate 404 00:14:44,519 --> 00:14:48,159 stand-alone or uncontextual word 405 00:14:46,879 --> 00:14:50,600 embeddings. 406 00:14:48,159 --> 00:14:52,319 And then starting Monday, we will take 407 00:14:50,600 --> 00:14:53,759 these, you know, un- stand-alone 408 00:14:52,320 --> 00:14:56,079 embeddings and make them contextual 409 00:14:53,759 --> 00:14:57,159 using transformers. Okay? That is the 410 00:14:56,078 --> 00:14:58,679 plan. 411 00:14:57,159 --> 00:15:00,719 Any questions so far? 412 00:14:58,679 --> 00:15:02,519 So, now let's think about how we can 413 00:15:00,720 --> 00:15:05,800 learn these stand-alone embeddings from 414 00:15:02,519 --> 00:15:07,559 data, right? Now, the naive way to think 415 00:15:05,799 --> 00:15:08,879 about it would be, hey, let's Why don't 416 00:15:07,559 --> 00:15:11,719 we manually collect a whole bunch of 417 00:15:08,879 --> 00:15:13,679 synonyms, antonyms, related words, etc., 418 00:15:11,720 --> 00:15:15,440 and try to assign embedding vectors to 419 00:15:13,679 --> 00:15:18,399 them that satisfy 420 00:15:15,440 --> 00:15:19,880 our requirements. Okay? Now, as you can 421 00:15:18,399 --> 00:15:21,639 imagine, this is going to be a long, 422 00:15:19,879 --> 00:15:22,759 painful, and never quite complete 423 00:15:21,639 --> 00:15:23,720 exercise. 424 00:15:22,759 --> 00:15:24,480 Okay? 425 00:15:23,720 --> 00:15:26,600 Uh, 426 00:15:24,480 --> 00:15:29,200 so and uh you mean and given that we are 427 00:15:26,600 --> 00:15:30,759 machine learning people, 428 00:15:29,200 --> 00:15:32,240 the question is, can we do in a better 429 00:15:30,759 --> 00:15:34,039 way? Can we just learn it from the data 430 00:15:32,240 --> 00:15:36,600 without doing any of this manual stuff? 431 00:15:34,039 --> 00:15:39,639 Okay? And 432 00:15:36,600 --> 00:15:42,320 the key insight that makes it all happen 433 00:15:39,639 --> 00:15:44,360 is this humble-looking line on the 434 00:15:42,320 --> 00:15:45,839 screen by John Firth, who was a 435 00:15:44,360 --> 00:15:47,720 linguist. 436 00:15:45,839 --> 00:15:49,600 You shall know a word 437 00:15:47,720 --> 00:15:52,879 by the company it keeps. I wish I could 438 00:15:49,600 --> 00:15:52,879 deliver this in a British accent. 439 00:15:53,078 --> 00:15:57,879 Know a word by the company it keeps. 440 00:15:55,120 --> 00:15:59,919 Okay? It's a very profound statement. 441 00:15:57,879 --> 00:16:02,399 Okay? And here is the sort of the key 442 00:15:59,919 --> 00:16:03,958 intuition behind this. 443 00:16:02,399 --> 00:16:05,480 It says, 444 00:16:03,958 --> 00:16:08,559 let's say that you have a sentence like 445 00:16:05,480 --> 00:16:09,560 the acting in the dash was superb. 446 00:16:08,559 --> 00:16:11,199 Okay? 447 00:16:09,559 --> 00:16:14,519 What are some words that you folks think 448 00:16:11,200 --> 00:16:14,520 are likely to appear in the sentence? 449 00:16:15,039 --> 00:16:19,519 Shout it out. Play. Play. 450 00:16:18,120 --> 00:16:20,560 Movie. 451 00:16:19,519 --> 00:16:24,159 Show. 452 00:16:20,559 --> 00:16:25,239 Musical. Right? Those are all some great 453 00:16:24,159 --> 00:16:26,838 candidates, right? The acting in the 454 00:16:25,240 --> 00:16:28,799 movie, the film, musical, and so on and 455 00:16:26,839 --> 00:16:29,800 so forth. Okay? Now, let's say that I 456 00:16:28,799 --> 00:16:31,679 ask you, what are some words that are 457 00:16:29,799 --> 00:16:32,958 unlikely to appear in the sentence? And 458 00:16:31,679 --> 00:16:35,879 I think we could all be here for like 459 00:16:32,958 --> 00:16:38,519 days, you know, listing them out. Uh, I 460 00:16:35,879 --> 00:16:39,919 just listed these out. Um, I love the 461 00:16:38,519 --> 00:16:41,759 word tensor, so I have to find a way to 462 00:16:39,919 --> 00:16:43,078 use it somewhere. 463 00:16:41,759 --> 00:16:45,200 So, all right. So, the acting in the 464 00:16:43,078 --> 00:16:48,239 banana was superb. Clearly nonsensical, 465 00:16:45,200 --> 00:16:51,200 right? So, what this actually What What 466 00:16:48,240 --> 00:16:53,879 we are seeing here is that if certain 467 00:16:51,200 --> 00:16:55,360 words are sort of interchangeable in a 468 00:16:53,879 --> 00:16:57,000 sentence, 469 00:16:55,360 --> 00:16:59,959 meaning you you change them, they still 470 00:16:57,000 --> 00:17:02,240 the sentence still makes sense, right? 471 00:16:59,958 --> 00:17:04,240 If they appear in the same context very 472 00:17:02,240 --> 00:17:07,559 often, i.e., if they're interchangeable, 473 00:17:04,240 --> 00:17:07,559 they are probably related. 474 00:17:07,799 --> 00:17:10,559 Sort of like we don't even have to know 475 00:17:09,119 --> 00:17:12,599 what the word is. 476 00:17:10,559 --> 00:17:14,240 All we have to know is that this word 477 00:17:12,599 --> 00:17:15,519 and this word, you can drop them into a 478 00:17:14,240 --> 00:17:17,240 particular sentence, you can fill in the 479 00:17:15,519 --> 00:17:18,799 blank of that sentence with that word, 480 00:17:17,240 --> 00:17:20,120 and it actually makes sense, then we're 481 00:17:18,799 --> 00:17:21,399 like, oh, wow, okay, these words are 482 00:17:20,119 --> 00:17:23,359 related then. 483 00:17:21,400 --> 00:17:25,519 Right? You're sort of inferring their 484 00:17:23,359 --> 00:17:29,559 relatedness not by looking at them 485 00:17:25,519 --> 00:17:29,559 directly, but by seeing where they live. 486 00:17:30,000 --> 00:17:36,119 Right? It's a very very clever idea. And 487 00:17:32,319 --> 00:17:37,480 it'll slowly sink into you. Okay? Um, so 488 00:17:36,119 --> 00:17:39,079 that's the first observation. If they 489 00:17:37,480 --> 00:17:41,160 appear in the same context very often, 490 00:17:39,079 --> 00:17:44,240 they are likely to be related. 491 00:17:41,160 --> 00:17:47,440 More generally, related words appear in 492 00:17:44,240 --> 00:17:47,440 related contexts. 493 00:17:47,880 --> 00:17:52,480 So, all we have to do 494 00:17:49,559 --> 00:17:54,240 is to figure out a way to calculate 495 00:17:52,480 --> 00:17:57,039 context. 496 00:17:54,240 --> 00:17:58,599 And then use that to understand, you 497 00:17:57,039 --> 00:18:00,519 know, what the words are that happen to 498 00:17:58,599 --> 00:18:02,119 be living in this context. 499 00:18:00,519 --> 00:18:03,639 And there are some beautiful ways to do 500 00:18:02,119 --> 00:18:05,239 these things, and we'll you and we'll 501 00:18:03,640 --> 00:18:06,120 really dive deep into one such way to do 502 00:18:05,240 --> 00:18:08,759 it. 503 00:18:06,119 --> 00:18:10,639 So, so the So, what we're going to do in 504 00:18:08,759 --> 00:18:11,759 this approach 505 00:18:10,640 --> 00:18:12,920 is that 506 00:18:11,759 --> 00:18:14,879 since 507 00:18:12,920 --> 00:18:16,880 words that appear in 508 00:18:14,880 --> 00:18:18,200 related contexts mean related same 509 00:18:16,880 --> 00:18:19,200 similar things, 510 00:18:18,200 --> 00:18:21,480 first of all, you have to define what do 511 00:18:19,200 --> 00:18:22,440 you mean by context? 512 00:18:21,480 --> 00:18:23,360 And there are many ways to define 513 00:18:22,440 --> 00:18:24,759 context. We're going to go with a very 514 00:18:23,359 --> 00:18:26,959 simple explanation, simple definition, 515 00:18:24,759 --> 00:18:29,079 which is that if words happen to appear 516 00:18:26,960 --> 00:18:31,159 in the same sentence a lot, 517 00:18:29,079 --> 00:18:32,480 then we think that, okay, 518 00:18:31,159 --> 00:18:34,440 they are in the same context. So, 519 00:18:32,480 --> 00:18:35,120 context here means sentence. 520 00:18:34,440 --> 00:18:38,200 Okay? 521 00:18:35,119 --> 00:18:40,399 So, what we can do is we can actually 522 00:18:38,200 --> 00:18:41,919 take a whole bunch of text, maybe all of 523 00:18:40,400 --> 00:18:43,519 Wikipedia, 524 00:18:41,919 --> 00:18:46,040 and then break it up into sentences. 525 00:18:43,519 --> 00:18:47,279 We'll have billions of sentences, right? 526 00:18:46,039 --> 00:18:48,879 And then for all these billion 527 00:18:47,279 --> 00:18:51,639 sentences, we can literally go and count 528 00:18:48,880 --> 00:18:52,880 for every pair of words, how many times 529 00:18:51,640 --> 00:18:55,280 are both these words showing up in the 530 00:18:52,880 --> 00:18:57,880 same sentence? 531 00:18:55,279 --> 00:18:59,359 Okay? And we call this co-occurrence, 532 00:18:57,880 --> 00:19:00,640 right? The words are co-occurring in the 533 00:18:59,359 --> 00:19:02,000 sentence. 534 00:19:00,640 --> 00:19:02,880 And it doesn't have to be next to each 535 00:19:02,000 --> 00:19:04,759 other, 536 00:19:02,880 --> 00:19:07,280 right? We know that in complicated 537 00:19:04,759 --> 00:19:09,079 words, a word at the very end of the 538 00:19:07,279 --> 00:19:10,799 sentence could actually alter the mean- 539 00:19:09,079 --> 00:19:11,759 could be its meaning could be altered by 540 00:19:10,799 --> 00:19:12,678 a word that happened in the very 541 00:19:11,759 --> 00:19:14,240 beginning of the sentence, and it could 542 00:19:12,679 --> 00:19:16,240 be a really long sentence. 543 00:19:14,240 --> 00:19:18,079 So, we take the whole sentence and say, 544 00:19:16,240 --> 00:19:19,599 are are two words co-occurring in the 545 00:19:18,079 --> 00:19:20,720 sentence, yes or no? And we just count 546 00:19:19,599 --> 00:19:23,799 them up. 547 00:19:20,720 --> 00:19:23,799 And when we do that, 548 00:19:24,119 --> 00:19:27,678 right? When we do that, we will get 549 00:19:26,279 --> 00:19:29,519 something like this. 550 00:19:27,679 --> 00:19:30,880 So, I'm just 551 00:19:29,519 --> 00:19:32,359 This just captures what I've been 552 00:19:30,880 --> 00:19:34,280 talking about. Identify all the words 553 00:19:32,359 --> 00:19:35,799 that occur, let's say, in Wikipedia. And 554 00:19:34,279 --> 00:19:37,039 then for every sentence, you look at 555 00:19:35,799 --> 00:19:38,759 every word pair and count the number of 556 00:19:37,039 --> 00:19:41,480 times they appear in the same sentence 557 00:19:38,759 --> 00:19:43,839 across all those sentences. Okay? 558 00:19:41,480 --> 00:19:46,440 This is a word-word co-occurrence 559 00:19:43,839 --> 00:19:47,519 matrix. So, for example, 560 00:19:46,440 --> 00:19:48,679 let's assume that you took all of 561 00:19:47,519 --> 00:19:49,918 Wikipedia, looked at all the words, 562 00:19:48,679 --> 00:19:51,960 distinct words, and you found there are 563 00:19:49,919 --> 00:19:54,360 500,000 words. 564 00:19:51,960 --> 00:19:56,880 Okay? So, there are 500,000 words 565 00:19:54,359 --> 00:20:00,240 here in the columns 566 00:19:56,880 --> 00:20:02,640 500,000 words on the rows. 567 00:20:00,240 --> 00:20:05,599 The columns and rows. And then you go 568 00:20:02,640 --> 00:20:08,000 and each cell of this table is basically 569 00:20:05,599 --> 00:20:10,519 has a number that you calculate which is 570 00:20:08,000 --> 00:20:12,039 the number of times the word in the row 571 00:20:10,519 --> 00:20:14,319 and the word in the column happen to 572 00:20:12,039 --> 00:20:15,680 show up in the same sentence. That's it. 573 00:20:14,319 --> 00:20:18,119 So, for instance 574 00:20:15,680 --> 00:20:20,360 if you look at deep and learning, right? 575 00:20:18,119 --> 00:20:22,519 The word deep and the word learning 576 00:20:20,359 --> 00:20:24,719 maybe that 577 00:20:22,519 --> 00:20:28,319 the those two words occurred in the same 578 00:20:24,720 --> 00:20:31,400 sentence maybe 3,025 times. 579 00:20:28,319 --> 00:20:35,200 3,025 sentences across all of Wikipedia. 580 00:20:31,400 --> 00:20:35,200 You put 3,025 right in that cell. 581 00:20:35,240 --> 00:20:37,680 Okay? 582 00:20:36,000 --> 00:20:38,880 Many words are unlikely to appear in the 583 00:20:37,680 --> 00:20:40,360 same sentence. 584 00:20:38,880 --> 00:20:42,720 So, much of this matrix is going to be 585 00:20:40,359 --> 00:20:42,719 zero. 586 00:20:44,319 --> 00:20:47,119 But, we 587 00:20:45,359 --> 00:20:49,639 fundamentally form this co-occurrence 588 00:20:47,119 --> 00:20:49,639 matrix. 589 00:20:49,960 --> 00:20:55,640 This matrix essentially embodies all the 590 00:20:54,119 --> 00:20:58,359 context information that we can work 591 00:20:55,640 --> 00:20:59,840 with in a very compact, beautiful you 592 00:20:58,359 --> 00:21:02,240 know, sort of 593 00:20:59,839 --> 00:21:02,240 elegant 594 00:21:03,279 --> 00:21:06,039 And using this, we're going to try to 595 00:21:04,640 --> 00:21:07,400 figure out 596 00:21:06,039 --> 00:21:08,440 what the word embeddings actually are 597 00:21:07,400 --> 00:21:09,519 going to be. 598 00:21:08,440 --> 00:21:11,720 Okay? 599 00:21:09,519 --> 00:21:13,480 And so 600 00:21:11,720 --> 00:21:15,440 So, by the way, the approach I'm 601 00:21:13,480 --> 00:21:19,240 describing here to calculate standalone 602 00:21:15,440 --> 00:21:19,240 embeddings is called Glove. 603 00:21:20,200 --> 00:21:24,799 Uh it's called Glove and 604 00:21:23,039 --> 00:21:27,519 standalone embeddings first sort of came 605 00:21:24,799 --> 00:21:29,720 onto the NLP deep learning scene. Uh 606 00:21:27,519 --> 00:21:32,519 there were two sort of ways of doing it. 607 00:21:29,720 --> 00:21:34,400 One was called word to vec, word to vec. 608 00:21:32,519 --> 00:21:35,879 Uh the other one is Glove. 609 00:21:34,400 --> 00:21:36,960 And they're both comparable, right? They 610 00:21:35,880 --> 00:21:38,520 use slightly different mechanisms of 611 00:21:36,960 --> 00:21:40,559 doing this. 612 00:21:38,519 --> 00:21:42,279 We went with word for for this lecture 613 00:21:40,559 --> 00:21:44,359 because I think it's actually a little 614 00:21:42,279 --> 00:21:45,759 easier to understand and equally 615 00:21:44,359 --> 00:21:47,199 effective. 616 00:21:45,759 --> 00:21:49,480 Okay? 617 00:21:47,200 --> 00:21:50,880 So, this is what we have. And so, what 618 00:21:49,480 --> 00:21:52,880 we want to do is 619 00:21:50,880 --> 00:21:54,120 we want to learn these embedding vectors 620 00:21:52,880 --> 00:21:56,200 that can be used to essentially 621 00:21:54,119 --> 00:21:59,319 approximate this matrix. 622 00:21:56,200 --> 00:22:01,720 Right? If you can find vectors that can 623 00:21:59,319 --> 00:22:03,279 actually approximate this matrix, then 624 00:22:01,720 --> 00:22:04,519 hopefully those vectors do in fact 625 00:22:03,279 --> 00:22:06,519 capture some notion of what the words 626 00:22:04,519 --> 00:22:07,440 actually mean. Okay? So, let me put it 627 00:22:06,519 --> 00:22:10,119 differently. 628 00:22:07,440 --> 00:22:12,759 You come to me with this matrix. Okay? 629 00:22:10,119 --> 00:22:14,359 And you say uh okay, Rama, do you have 630 00:22:12,759 --> 00:22:15,679 embeddings for me? 631 00:22:14,359 --> 00:22:17,319 And I'm like, yeah, I reach into my bag 632 00:22:15,679 --> 00:22:19,160 and I'm like, okay, every one of those 633 00:22:17,319 --> 00:22:20,119 500,000 words, I have an embedding. 634 00:22:19,160 --> 00:22:21,440 Right? 635 00:22:20,119 --> 00:22:23,039 Let's ignore for a moment how I actually 636 00:22:21,440 --> 00:22:24,000 calculated embeddings. I have the 637 00:22:23,039 --> 00:22:25,839 embeddings. 638 00:22:24,000 --> 00:22:28,400 How will you know if my embeddings are 639 00:22:25,839 --> 00:22:28,399 any good? 640 00:22:28,720 --> 00:22:31,559 How will you know? 641 00:22:30,279 --> 00:22:34,440 How can you actually assess if those 642 00:22:31,559 --> 00:22:34,440 embeddings are any good? 643 00:22:34,559 --> 00:22:37,440 Well, you can certainly say, okay, give 644 00:22:35,799 --> 00:22:39,240 me the embeddings for movie and film and 645 00:22:37,440 --> 00:22:40,440 you can see if they're really close by. 646 00:22:39,240 --> 00:22:42,160 If you can look at the you look at the 647 00:22:40,440 --> 00:22:43,920 embedding for movie and tensor and 648 00:22:42,160 --> 00:22:46,600 hopefully they're far away. 649 00:22:43,920 --> 00:22:47,360 But, you'll never get done. 650 00:22:46,599 --> 00:22:49,199 Right? 651 00:22:47,359 --> 00:22:51,159 How can you systematically evaluate 652 00:22:49,200 --> 00:22:53,720 this? 653 00:22:51,160 --> 00:22:55,840 Well, what if you could actually what 654 00:22:53,720 --> 00:22:57,400 what if I come to you and say, not only 655 00:22:55,839 --> 00:22:59,079 am I going to give you an embedding, 656 00:22:57,400 --> 00:23:00,480 here is a procedure 657 00:22:59,079 --> 00:23:02,279 which you can use with these embeddings 658 00:23:00,480 --> 00:23:04,400 to validate how good they are and here 659 00:23:02,279 --> 00:23:07,160 is the procedure. What you can do is you 660 00:23:04,400 --> 00:23:09,960 can use the embedding to recreate the 661 00:23:07,160 --> 00:23:11,600 co-occurrence matrix. 662 00:23:09,960 --> 00:23:14,400 And if the recreated co-occurrence 663 00:23:11,599 --> 00:23:15,319 matrix actually matches the real matrix 664 00:23:14,400 --> 00:23:17,519 well, these embeddings probably are 665 00:23:15,319 --> 00:23:18,559 pretty good. 666 00:23:17,519 --> 00:23:20,079 Remember, the whole point of the 667 00:23:18,559 --> 00:23:21,720 co-occurrence is to handle this context 668 00:23:20,079 --> 00:23:23,960 information. So, if my embeddings can 669 00:23:21,720 --> 00:23:25,640 actually recreate them, reconstruct them 670 00:23:23,960 --> 00:23:27,400 pretty close, right? It'll never be 671 00:23:25,640 --> 00:23:28,200 perfect. But, it comes pretty close, 672 00:23:27,400 --> 00:23:29,759 then we're like, wow, okay, these 673 00:23:28,200 --> 00:23:31,400 embeddings do mean something. 674 00:23:29,759 --> 00:23:33,839 So, if it turns out for instance that 675 00:23:31,400 --> 00:23:36,600 the matrix has, you know, 3,000 possible 676 00:23:33,839 --> 00:23:40,159 va- value of 3,000 for deep and learning 677 00:23:36,599 --> 00:23:40,959 and values of uh 678 00:23:40,160 --> 00:23:43,519 say 679 00:23:40,960 --> 00:23:45,200 50 for extreme learning 680 00:23:43,519 --> 00:23:48,480 and our embedding comes in and says 681 00:23:45,200 --> 00:23:49,360 3,002 for the first one and 48 for the 682 00:23:48,480 --> 00:23:51,440 second one, we'll be like we'll be 683 00:23:49,359 --> 00:23:53,279 pretty impressed. 684 00:23:51,440 --> 00:23:54,320 Whoa, it didn't need to be that close. 685 00:23:53,279 --> 00:23:55,480 Unless it was actually capturing 686 00:23:54,319 --> 00:23:57,519 something. 687 00:23:55,480 --> 00:23:59,000 Okay? So, that's what we're going to do. 688 00:23:57,519 --> 00:24:00,240 And so, we're going to take this logic 689 00:23:59,000 --> 00:24:03,200 of saying 690 00:24:00,240 --> 00:24:05,960 find embeddings that can approximate the 691 00:24:03,200 --> 00:24:07,880 what we actually see in Wikipedia. 692 00:24:05,960 --> 00:24:09,240 Right? And we're going to use that idea 693 00:24:07,880 --> 00:24:10,440 to actually build the model and learn 694 00:24:09,240 --> 00:24:12,559 the 695 00:24:10,440 --> 00:24:14,759 using nothing more than basically linear 696 00:24:12,559 --> 00:24:14,759 regression. 697 00:24:16,480 --> 00:24:18,839 And here you are thinking that linear 698 00:24:17,759 --> 00:24:22,160 regression is useless now that you've 699 00:24:18,839 --> 00:24:22,159 graduated machine learning, right? 700 00:24:22,319 --> 00:24:24,759 So 701 00:24:23,240 --> 00:24:26,599 So, we can think of the embedding 702 00:24:24,759 --> 00:24:28,879 vectors that we want to figure out as 703 00:24:26,599 --> 00:24:31,319 just the weights in a model. 704 00:24:28,880 --> 00:24:33,120 In a linear regression. 705 00:24:31,319 --> 00:24:35,200 We can think of the co-occurrence matrix 706 00:24:33,119 --> 00:24:37,759 as just the data we're going to use in 707 00:24:35,200 --> 00:24:39,799 this model to estimate these weights. 708 00:24:37,759 --> 00:24:42,200 And the model we're going to use 709 00:24:39,799 --> 00:24:43,799 is something like this. 710 00:24:42,200 --> 00:24:45,080 So, first I have to inflict some 711 00:24:43,799 --> 00:24:46,559 notation on you. 712 00:24:45,079 --> 00:24:50,000 We would denote the co-occurrence matrix 713 00:24:46,559 --> 00:24:51,759 of say words I and J as Xij. 714 00:24:50,000 --> 00:24:53,079 Xij is just data. 715 00:24:51,759 --> 00:24:55,079 It's just data. Okay? It's not a 716 00:24:53,079 --> 00:24:55,639 variable, it's data. 717 00:24:55,079 --> 00:24:57,399 Uh 718 00:24:55,640 --> 00:24:59,160 and then we will denote an embedding 719 00:24:57,400 --> 00:25:01,080 vector for each word. Remember, we need 720 00:24:59,160 --> 00:25:03,840 to have a vector for each word. So, we 721 00:25:01,079 --> 00:25:06,199 call it Wi, right? Wi is the embedding 722 00:25:03,839 --> 00:25:09,119 vector for each word. 723 00:25:06,200 --> 00:25:10,559 And we will also assume that 724 00:25:09,119 --> 00:25:11,639 some words are just inherently very 725 00:25:10,559 --> 00:25:13,440 popular. They're going to show up all 726 00:25:11,640 --> 00:25:15,920 the time like the word the. 727 00:25:13,440 --> 00:25:18,320 Okay? So, we'll assume that every word 728 00:25:15,920 --> 00:25:20,160 has some natural frequency of occurring 729 00:25:18,319 --> 00:25:22,919 like movie versus flick. 730 00:25:20,160 --> 00:25:24,480 The versus tensor. So, we want the 731 00:25:22,920 --> 00:25:27,279 vectors to capture the co-occurrence 732 00:25:24,480 --> 00:25:28,880 patterns independent of how naturally 733 00:25:27,279 --> 00:25:29,639 frequent the words are. 734 00:25:28,880 --> 00:25:30,920 Okay? 735 00:25:29,640 --> 00:25:33,600 And so, to capture this natural 736 00:25:30,920 --> 00:25:34,600 frequency, we will assign a bias or Bi 737 00:25:33,599 --> 00:25:36,359 to each word that we're going to 738 00:25:34,599 --> 00:25:39,319 calculate. And all this will become 739 00:25:36,359 --> 00:25:41,000 clear in just a moment. Okay? So 740 00:25:39,319 --> 00:25:42,480 with this setup, basically what we're 741 00:25:41,000 --> 00:25:44,679 saying is something very simple. We're 742 00:25:42,480 --> 00:25:45,960 saying, look, this co-occurrence matrix 743 00:25:44,679 --> 00:25:48,000 that we have 744 00:25:45,960 --> 00:25:51,240 that we're able to compute, it came 745 00:25:48,000 --> 00:25:53,400 about because in in truth, in reality, 746 00:25:51,240 --> 00:25:55,559 in nature, there are these embedding 747 00:25:53,400 --> 00:25:58,120 vectors for every word. 748 00:25:55,559 --> 00:26:00,240 There are these biases Bi for every word 749 00:25:58,119 --> 00:26:03,000 and every co-occurrence number that you 750 00:26:00,240 --> 00:26:05,079 see just came about because, you know, 751 00:26:03,000 --> 00:26:07,839 under the hood, mother nature grabbed 752 00:26:05,079 --> 00:26:09,720 the bias number for the word I, the bias 753 00:26:07,839 --> 00:26:11,639 number for the word J took the two 754 00:26:09,720 --> 00:26:13,799 embedding vectors, which only mother 755 00:26:11,640 --> 00:26:15,200 nature knows at this point did the dot 756 00:26:13,799 --> 00:26:16,919 product of them, add them, and that's 757 00:26:15,200 --> 00:26:19,080 how we get this number. 758 00:26:16,920 --> 00:26:21,560 So, it basically says the number you see 759 00:26:19,079 --> 00:26:23,039 is the sum of the inherent popularity of 760 00:26:21,559 --> 00:26:25,159 the first word plus the inherent 761 00:26:23,039 --> 00:26:26,799 popularity of the second word plus the 762 00:26:25,160 --> 00:26:29,000 way in which these two words connect to 763 00:26:26,799 --> 00:26:29,960 each other. 764 00:26:29,000 --> 00:26:30,839 That's it. 765 00:26:29,960 --> 00:26:32,440 And 766 00:26:30,839 --> 00:26:33,599 you will agree with me 767 00:26:32,440 --> 00:26:34,799 that literally can't get simpler than 768 00:26:33,599 --> 00:26:36,759 this. 769 00:26:34,799 --> 00:26:38,200 If I tell you, hey, here are two things. 770 00:26:36,759 --> 00:26:39,799 I want you to tell me how connected they 771 00:26:38,200 --> 00:26:42,360 are, you'll be like, well, let's take 772 00:26:39,799 --> 00:26:44,200 the first one, figure out how inherently 773 00:26:42,359 --> 00:26:45,039 popular it is, inherent popularity, and 774 00:26:44,200 --> 00:26:46,319 then of course you got to worry about 775 00:26:45,039 --> 00:26:47,678 the connection. So, we do a dot dot 776 00:26:46,319 --> 00:26:49,720 product. 777 00:26:47,679 --> 00:26:50,440 That's it. Those three things. 778 00:26:49,720 --> 00:26:52,360 Right? 779 00:26:50,440 --> 00:26:53,840 So, this is what we have. Now, you may 780 00:26:52,359 --> 00:26:54,599 have seen 781 00:26:53,839 --> 00:26:56,839 uh 782 00:26:54,599 --> 00:27:00,079 from your, you know, good old linear 783 00:26:56,839 --> 00:27:02,039 regression that whenever uh your 784 00:27:00,079 --> 00:27:05,119 dependent variable happens to be 785 00:27:02,039 --> 00:27:08,279 positive, guaranteed to be positive 786 00:27:05,119 --> 00:27:10,519 and it ends up having a big range 787 00:27:08,279 --> 00:27:12,599 we always advise you folks 788 00:27:10,519 --> 00:27:14,839 to take the logarithmic transformation 789 00:27:12,599 --> 00:27:16,480 to squash it into a narrow range because 790 00:27:14,839 --> 00:27:18,319 that will make these models much more 791 00:27:16,480 --> 00:27:20,319 well-behaved. 792 00:27:18,319 --> 00:27:22,240 Regression if the Y value is like a huge 793 00:27:20,319 --> 00:27:23,159 range. Like the canonical example is 794 00:27:22,240 --> 00:27:24,960 that, you know, if you are trying to 795 00:27:23,160 --> 00:27:27,560 model, you know, the net worth of 796 00:27:24,960 --> 00:27:29,120 people, right? It's going to have a long 797 00:27:27,559 --> 00:27:30,879 right tail with people like Elon and 798 00:27:29,119 --> 00:27:33,279 Jeff and so on on the right side, right? 799 00:27:30,880 --> 00:27:34,880 And the rest of us on the left. So and 800 00:27:33,279 --> 00:27:35,920 so, to model this big long tail 801 00:27:34,880 --> 00:27:37,360 distribution, you just take the 802 00:27:35,920 --> 00:27:39,120 logarithm, just squash everything to a 803 00:27:37,359 --> 00:27:41,479 very narrow range. And that will make 804 00:27:39,119 --> 00:27:42,559 regression much better behaved. Okay? 805 00:27:41,480 --> 00:27:45,400 Here 806 00:27:42,559 --> 00:27:47,000 most of the counts are going to be zero. 807 00:27:45,400 --> 00:27:48,440 But, some of the counts could be very 808 00:27:47,000 --> 00:27:49,160 high. 809 00:27:48,440 --> 00:27:51,000 Right? 810 00:27:49,160 --> 00:27:52,960 And therefore we wanted to If you take 811 00:27:51,000 --> 00:27:54,839 the logarithm, it makes it much better 812 00:27:52,960 --> 00:27:56,440 behaved, so we take the logarithm here. 813 00:27:54,839 --> 00:27:57,439 So, this is actually our model. That's 814 00:27:56,440 --> 00:27:58,720 it. 815 00:27:57,440 --> 00:28:00,759 And I know that many of the numbers are 816 00:27:58,720 --> 00:28:02,600 zero and log of zero is not defined. So, 817 00:28:00,759 --> 00:28:03,960 we can just add the one a number one to 818 00:28:02,599 --> 00:28:06,240 all the numbers 819 00:28:03,960 --> 00:28:08,360 to avoid that kind of, you know, 820 00:28:06,240 --> 00:28:09,559 technical arithmetic problems. 821 00:28:08,359 --> 00:28:10,319 But, this conceptually is what's going 822 00:28:09,559 --> 00:28:11,519 on. This is the model we want to 823 00:28:10,319 --> 00:28:14,079 calculate. 824 00:28:11,519 --> 00:28:16,759 So, given that we have essentially 825 00:28:14,079 --> 00:28:17,839 postulated this model 826 00:28:16,759 --> 00:28:19,519 and we have this data, this 827 00:28:17,839 --> 00:28:21,240 co-occurrence matrix, how can we 828 00:28:19,519 --> 00:28:24,279 actually find the weights? How can we 829 00:28:21,240 --> 00:28:25,679 actually find the Bs and the Ws? What 830 00:28:24,279 --> 00:28:26,960 would we What should we do? 831 00:28:25,679 --> 00:28:29,320 Go back to the fundamentals of 832 00:28:26,960 --> 00:28:30,519 regression. Think about it conceptually. 833 00:28:29,319 --> 00:28:31,879 You have some model which has some 834 00:28:30,519 --> 00:28:33,519 weights. 835 00:28:31,880 --> 00:28:35,320 There's some data you can use to train 836 00:28:33,519 --> 00:28:36,960 the model. 837 00:28:35,319 --> 00:28:38,240 Right? And you need to find the best set 838 00:28:36,960 --> 00:28:40,079 of weights. What does the best mean 839 00:28:38,240 --> 00:28:42,279 here? 840 00:28:40,079 --> 00:28:43,879 The lowest 841 00:28:42,279 --> 00:28:46,119 The lowest error. Exactly. There are 842 00:28:43,880 --> 00:28:47,280 many ways to measure error, right? What 843 00:28:46,119 --> 00:28:48,759 would be What is the simplest thing we 844 00:28:47,279 --> 00:28:50,240 could use? So, what you do is you would 845 00:28:48,759 --> 00:28:52,079 actually do mean squared error. Right? 846 00:28:50,240 --> 00:28:53,240 Which is what you're getting at. 847 00:28:52,079 --> 00:28:54,359 You could take the actual thing, you 848 00:28:53,240 --> 00:28:55,839 could take the predicted thing, take the 849 00:28:54,359 --> 00:28:57,119 difference, square it, and minimize the 850 00:28:55,839 --> 00:28:59,759 sum of it. 851 00:28:57,119 --> 00:29:00,839 Okay? If your model exactly nails every 852 00:28:59,759 --> 00:29:02,799 number in the co-occurrence matrix, the 853 00:29:00,839 --> 00:29:04,879 error is going to be zero. 854 00:29:02,799 --> 00:29:07,759 Okay? So 855 00:29:04,880 --> 00:29:09,240 what we do is we literally just do that. 856 00:29:07,759 --> 00:29:11,200 This is the data. 857 00:29:09,240 --> 00:29:13,319 This is the actual predicted value. 858 00:29:11,200 --> 00:29:14,880 Predicted value, actual value, 859 00:29:13,319 --> 00:29:17,439 difference squared, add them all up, 860 00:29:14,880 --> 00:29:17,440 minimize. 861 00:29:17,839 --> 00:29:21,039 Okay? 862 00:29:19,200 --> 00:29:23,200 Uh yes. 863 00:29:21,039 --> 00:29:25,720 And in the loss function, how is this 864 00:29:23,200 --> 00:29:28,679 capturing the context? Because unless my 865 00:29:25,720 --> 00:29:31,120 input data is having that context 866 00:29:28,679 --> 00:29:33,120 how will this actually differentiate 867 00:29:31,119 --> 00:29:34,239 based on where the particular word is 868 00:29:33,119 --> 00:29:36,359 used? 869 00:29:34,240 --> 00:29:37,079 The word The way the word is 870 00:29:36,359 --> 00:29:38,559 the 871 00:29:37,079 --> 00:29:41,559 So, let's take two words like deep and 872 00:29:38,559 --> 00:29:42,918 learning. Now, let's take this word and 873 00:29:41,559 --> 00:29:44,839 change it according to the context. 874 00:29:42,919 --> 00:29:46,280 Okay. 875 00:29:44,839 --> 00:29:47,359 Sorry, go ahead. Yeah, so basically, 876 00:29:46,279 --> 00:29:49,759 let's say I'm talking about the word 877 00:29:47,359 --> 00:29:50,919 banana. So it's a fruit in some context 878 00:29:49,759 --> 00:29:53,119 and I could be saying he's going 879 00:29:50,920 --> 00:29:55,240 bananas. That's a 880 00:29:53,119 --> 00:29:57,039 whatever, right? So now these are two 881 00:29:55,240 --> 00:29:59,079 different contexts in my understanding 882 00:29:57,039 --> 00:30:01,000 and my same model needs to be able to 883 00:29:59,079 --> 00:30:02,720 tell me that banana is the right word in 884 00:30:01,000 --> 00:30:04,400 this context but wrong word in this 885 00:30:02,720 --> 00:30:06,600 context or 886 00:30:04,400 --> 00:30:08,440 correct in both contexts. Yeah, very 887 00:30:06,599 --> 00:30:10,359 good question. So let's actually spend a 888 00:30:08,440 --> 00:30:13,360 minute on that. Good question. I'm going 889 00:30:10,359 --> 00:30:15,439 to swap to my iPad. 890 00:30:13,359 --> 00:30:18,000 So let's let's assume that this is our 891 00:30:15,440 --> 00:30:20,160 co-occurrence matrix. 892 00:30:18,000 --> 00:30:23,160 Right? And then we have words going from 893 00:30:20,160 --> 00:30:24,600 A all the way to let's say zebra, right? 894 00:30:23,160 --> 00:30:25,800 This is the all the words in our 895 00:30:24,599 --> 00:30:29,439 vocabulary 896 00:30:25,799 --> 00:30:32,680 and we have A through zebra here. 897 00:30:29,440 --> 00:30:34,480 And now what we have is 898 00:30:32,680 --> 00:30:36,519 we have uh 899 00:30:34,480 --> 00:30:39,079 apple 900 00:30:36,519 --> 00:30:39,079 and banana. 901 00:30:39,559 --> 00:30:42,279 Right? 902 00:30:40,279 --> 00:30:44,079 So basically what's going on at this 903 00:30:42,279 --> 00:30:48,240 point is that 904 00:30:44,079 --> 00:30:50,559 here every number here measures 905 00:30:48,240 --> 00:30:51,960 for every word here, how many times that 906 00:30:50,559 --> 00:30:53,559 word and apple show up in the same 907 00:30:51,960 --> 00:30:56,400 sentence, okay? 908 00:30:53,559 --> 00:30:57,960 It is not measuring, to your point, 909 00:30:56,400 --> 00:30:59,880 how many times apple and banana are 910 00:30:57,960 --> 00:31:01,240 showing up. It's measuring how much how 911 00:30:59,880 --> 00:31:03,680 many times apple is showing up in each 912 00:31:01,240 --> 00:31:06,480 sentence, right? Now, if apple and 913 00:31:03,680 --> 00:31:09,799 banana are sort of interchangeable, 914 00:31:06,480 --> 00:31:11,880 what do we expect these numbers these 915 00:31:09,799 --> 00:31:13,319 two rows of numbers to look like? Let's 916 00:31:11,880 --> 00:31:14,560 assume that apple and banana are perfect 917 00:31:13,319 --> 00:31:15,799 synonyms. 918 00:31:14,559 --> 00:31:17,240 Just for argument, okay? Let's say it's 919 00:31:15,799 --> 00:31:19,839 a perfect synonyms. 920 00:31:17,240 --> 00:31:21,359 What do we expect these two 921 00:31:19,839 --> 00:31:23,839 numbers 922 00:31:21,359 --> 00:31:25,599 to look like? 923 00:31:23,839 --> 00:31:27,720 Very similar. 924 00:31:25,599 --> 00:31:30,240 So if two words are related, their 925 00:31:27,720 --> 00:31:31,120 entries their entry row vectors in the 926 00:31:30,240 --> 00:31:32,599 co-occurrence matrix are going to be 927 00:31:31,119 --> 00:31:34,479 very very similar. 928 00:31:32,599 --> 00:31:36,079 So that is how the context comes into 929 00:31:34,480 --> 00:31:37,960 the co-occurrence matrix. 930 00:31:36,079 --> 00:31:40,559 So what we want to do is we want to find 931 00:31:37,960 --> 00:31:42,840 if if embeddings can recreate the same 932 00:31:40,559 --> 00:31:45,000 pattern of numbers in these two 933 00:31:42,839 --> 00:31:47,919 uh in these two rows, it's actually 934 00:31:45,000 --> 00:31:49,880 capturing the underlying context. 935 00:31:47,920 --> 00:31:51,560 So words which are similar will sort of 936 00:31:49,880 --> 00:31:53,280 zig and zag together the same way 937 00:31:51,559 --> 00:31:56,039 through the co-occurrence matrix. 938 00:31:53,279 --> 00:31:56,039 And that's where it comes in. 939 00:31:57,440 --> 00:32:00,440 Yeah. 940 00:31:58,440 --> 00:32:01,960 What's up with the diagonal of the 941 00:32:00,440 --> 00:32:05,240 co-occurrence matrix where you have 942 00:32:01,960 --> 00:32:07,200 apple showing up twice? Oh oh, I see. So 943 00:32:05,240 --> 00:32:08,799 yeah, here the you can just ignore the 944 00:32:07,200 --> 00:32:10,480 diagonal typically 945 00:32:08,799 --> 00:32:13,519 uh because all the action is off the the 946 00:32:10,480 --> 00:32:13,519 off-diagonal entries. 947 00:32:15,319 --> 00:32:20,319 So so that's basically the idea and uh 948 00:32:18,720 --> 00:32:22,519 if words which are very similar will 949 00:32:20,319 --> 00:32:24,039 have a very similar pattern of numbers 950 00:32:22,519 --> 00:32:25,720 and then any 951 00:32:24,039 --> 00:32:27,759 embeddings that can actually recreate 952 00:32:25,720 --> 00:32:28,920 the same pattern of numbers is capturing 953 00:32:27,759 --> 00:32:29,720 the underlying reality of what's going 954 00:32:28,920 --> 00:32:32,240 on. 955 00:32:29,720 --> 00:32:34,799 If words are kind of unrelated, those 956 00:32:32,240 --> 00:32:38,000 two those two vectors, let's say that 957 00:32:34,799 --> 00:32:38,000 the word you have is uh 958 00:32:40,400 --> 00:32:45,640 Let's assume the word is uh of course 959 00:32:42,880 --> 00:32:48,080 you know what I'm going to say, tensor. 960 00:32:45,640 --> 00:32:49,440 Right? These two vectors 961 00:32:48,079 --> 00:32:50,799 will sort of won't have any connection 962 00:32:49,440 --> 00:32:51,920 to each other. 963 00:32:50,799 --> 00:32:53,119 Which means if you look at something 964 00:32:51,920 --> 00:32:54,679 like the correlation of those two 965 00:32:53,119 --> 00:32:55,919 vectors, it's it's going to be around 966 00:32:54,679 --> 00:32:56,600 zero. 967 00:32:55,920 --> 00:32:57,960 Right? 968 00:32:56,599 --> 00:32:59,719 Words which are 969 00:32:57,960 --> 00:33:01,559 you know, interchangeable will have a 970 00:32:59,720 --> 00:33:03,720 very high correlation. 971 00:33:01,559 --> 00:33:05,519 Words which are antonyms and never show 972 00:33:03,720 --> 00:33:07,240 up in the same place together may have a 973 00:33:05,519 --> 00:33:09,079 highly negative correlation, close to 974 00:33:07,240 --> 00:33:10,640 minus one for instance. So that's sort 975 00:33:09,079 --> 00:33:11,919 of the intuition behind what's going on 976 00:33:10,640 --> 00:33:12,920 in these two row vectors on these row 977 00:33:11,920 --> 00:33:14,560 vectors. 978 00:33:12,920 --> 00:33:16,120 And so the point is given this 979 00:33:14,559 --> 00:33:19,879 co-occurrence matrix is capturing all 980 00:33:16,119 --> 00:33:22,039 these word word correlational structure, 981 00:33:19,880 --> 00:33:25,200 any embedding that can recreate it must 982 00:33:22,039 --> 00:33:26,879 have captured the structure as well. 983 00:33:25,200 --> 00:33:28,759 Because you can't recreate something 984 00:33:26,880 --> 00:33:30,080 like this with great fidelity unless you 985 00:33:28,759 --> 00:33:31,799 have some notion of what's going on 986 00:33:30,079 --> 00:33:33,599 under the hood. 987 00:33:31,799 --> 00:33:34,519 That's the basic idea. 988 00:33:33,599 --> 00:33:36,599 Yeah. 989 00:33:34,519 --> 00:33:39,160 So just connecting to Sophie's question. 990 00:33:36,599 --> 00:33:40,879 So in that example then 991 00:33:39,160 --> 00:33:42,800 banana is a fruit and apple is a fruit 992 00:33:40,880 --> 00:33:44,160 as well. Banana and apple are synonyms 993 00:33:42,799 --> 00:33:47,039 and you're going mad, you're going 994 00:33:44,160 --> 00:33:48,040 bananas. How that comes together is that 995 00:33:47,039 --> 00:33:50,399 Oh, I see. You're going mad, you're 996 00:33:48,039 --> 00:33:52,319 going bananas, yeah. So uh so those will 997 00:33:50,400 --> 00:33:53,720 also have some correlational structure 998 00:33:52,319 --> 00:33:57,000 to it which the embeddings will 999 00:33:53,720 --> 00:33:59,440 hopefully catch, but words like banana 1000 00:33:57,000 --> 00:34:01,160 which are very they they 1001 00:33:59,440 --> 00:34:03,400 the thing is it's called polysemy where 1002 00:34:01,160 --> 00:34:04,880 the word looks one way, it looks the 1003 00:34:03,400 --> 00:34:06,080 same way. It's like the word bank, 1004 00:34:04,880 --> 00:34:07,520 right? It can mean very different things 1005 00:34:06,079 --> 00:34:09,319 in very different context. So the 1006 00:34:07,519 --> 00:34:11,800 embedding is going to be some average 1007 00:34:09,320 --> 00:34:13,280 representation of it, right? But we are 1008 00:34:11,800 --> 00:34:15,000 not happy with that average and we'll 1009 00:34:13,280 --> 00:34:18,280 get around that average 1010 00:34:15,000 --> 00:34:19,159 next week when we do contextual stuff. 1011 00:34:18,280 --> 00:34:20,320 All right. 1012 00:34:19,159 --> 00:34:22,280 Um 1013 00:34:20,320 --> 00:34:25,519 So that's what we have here. So to go 1014 00:34:22,280 --> 00:34:25,519 back to this thing, 1015 00:34:26,719 --> 00:34:31,839 so what we can do is yeah. 1016 00:34:29,000 --> 00:34:34,398 I didn't understand how do we get the 1017 00:34:31,840 --> 00:34:35,120 mean squared error in this because we 1018 00:34:34,398 --> 00:34:37,319 didn't 1019 00:34:35,119 --> 00:34:39,480 do any reading from the data set we got. 1020 00:34:37,320 --> 00:34:41,200 We haven't calculated the embeddings. 1021 00:34:39,480 --> 00:34:42,559 We are trying to calculate them. Those 1022 00:34:41,199 --> 00:34:45,079 are just it's sort of like, you know, in 1023 00:34:42,559 --> 00:34:47,199 regression you have, you know, beta beta 1024 00:34:45,079 --> 00:34:49,398 one times X1 plus beta two times X2 kind 1025 00:34:47,199 --> 00:34:51,199 of thing. The betas are what the 1026 00:34:49,398 --> 00:34:52,759 regression produces for us, right? The 1027 00:34:51,199 --> 00:34:53,918 the embeddings are exactly that. They're 1028 00:34:52,760 --> 00:34:55,240 just coefficients that we're trying to 1029 00:34:53,918 --> 00:34:59,400 figure out. 1030 00:34:55,239 --> 00:34:59,399 The data is only the X's, the Xij. 1031 00:34:59,519 --> 00:35:01,920 And so this is what we're trying to 1032 00:35:00,760 --> 00:35:03,960 calculate, 1033 00:35:01,920 --> 00:35:06,200 right? And so what you can do is you can 1034 00:35:03,960 --> 00:35:08,320 actually start with some random values 1035 00:35:06,199 --> 00:35:09,839 for these things 1036 00:35:08,320 --> 00:35:11,920 and then 1037 00:35:09,840 --> 00:35:13,240 keep on trying to improve to minimize 1038 00:35:11,920 --> 00:35:15,639 the error 1039 00:35:13,239 --> 00:35:17,319 starting from these random values. 1040 00:35:15,639 --> 00:35:19,119 Do you folks are you aware of any 1041 00:35:17,320 --> 00:35:20,559 algorithm that which allows us to take 1042 00:35:19,119 --> 00:35:23,839 random value starting point and then 1043 00:35:20,559 --> 00:35:23,840 minimize some notion of error? 1044 00:35:32,760 --> 00:35:35,600 Well, how do you know it's actually 1045 00:35:33,679 --> 00:35:37,879 random? Oh. 1046 00:35:35,599 --> 00:35:39,000 So that's actually a very deep question. 1047 00:35:37,880 --> 00:35:39,920 Um 1048 00:35:39,000 --> 00:35:41,400 and 1049 00:35:39,920 --> 00:35:42,480 so 1050 00:35:41,400 --> 00:35:44,160 it's actually a tough question, right? 1051 00:35:42,480 --> 00:35:46,079 Because ultimately the random number is 1052 00:35:44,159 --> 00:35:47,960 coming from a computer 1053 00:35:46,079 --> 00:35:50,000 and we know how the computer runs. It's 1054 00:35:47,960 --> 00:35:51,559 deterministic at the end of the day. 1055 00:35:50,000 --> 00:35:53,280 So we actually use something called 1056 00:35:51,559 --> 00:35:54,880 pseudo random numbers, 1057 00:35:53,280 --> 00:35:56,840 right? Um and there's like a whole 1058 00:35:54,880 --> 00:35:59,358 specialized field of math 1059 00:35:56,840 --> 00:36:02,120 which essentially says, "Look, how can I 1060 00:35:59,358 --> 00:36:03,719 get random numbers that are sufficiently 1061 00:36:02,119 --> 00:36:05,358 random even though they come from a 1062 00:36:03,719 --> 00:36:07,759 non-random computer deterministic 1063 00:36:05,358 --> 00:36:08,519 process?" So we can talk offline about 1064 00:36:07,760 --> 00:36:10,480 it, 1065 00:36:08,519 --> 00:36:11,960 um but fundamentally all these systems 1066 00:36:10,480 --> 00:36:14,519 have some random number generators built 1067 00:36:11,960 --> 00:36:17,400 in. We just cross our fingers and hope 1068 00:36:14,519 --> 00:36:19,079 for the best and just use them. 1069 00:36:17,400 --> 00:36:20,639 So come back to this, 1070 00:36:19,079 --> 00:36:22,119 right? We can start with random values 1071 00:36:20,639 --> 00:36:23,559 for these weights 1072 00:36:22,119 --> 00:36:25,440 um and then we can try to minimize the 1073 00:36:23,559 --> 00:36:26,639 squared error. Are are you folks aware 1074 00:36:25,440 --> 00:36:28,358 of any algorithm that can help us do 1075 00:36:26,639 --> 00:36:30,239 that? 1076 00:36:28,358 --> 00:36:33,079 Yes. 1077 00:36:30,239 --> 00:36:35,279 Gradient descent. Yes, gradient descent. 1078 00:36:33,079 --> 00:36:36,400 Again, comes to the rescue. Uh and since 1079 00:36:35,280 --> 00:36:38,680 we are cool, we'll do stochastic 1080 00:36:36,400 --> 00:36:41,880 gradient descent. 1081 00:36:38,679 --> 00:36:42,960 Okay? So that's it. So gradient descent 1082 00:36:41,880 --> 00:36:44,240 actually doesn't care what the function 1083 00:36:42,960 --> 00:36:45,559 is as long as it you can calculate a 1084 00:36:44,239 --> 00:36:47,319 derivative from it. As long as you 1085 00:36:45,559 --> 00:36:48,719 calculate a gradient, you're good. 1086 00:36:47,320 --> 00:36:50,960 Right? So we can just run gradient 1087 00:36:48,719 --> 00:36:53,119 descent on this thing, right? 1088 00:36:50,960 --> 00:36:54,240 Uh one key point here is that gradient 1089 00:36:53,119 --> 00:36:55,960 descent, stochastic gradient descent 1090 00:36:54,239 --> 00:36:58,519 work for any 1091 00:36:55,960 --> 00:37:00,480 any models as long as you can calculate 1092 00:36:58,519 --> 00:37:03,639 good gradients from them. 1093 00:37:00,480 --> 00:37:03,639 It doesn't have to be a neural network. 1094 00:37:03,760 --> 00:37:07,400 Any mathematical function as long as 1095 00:37:05,880 --> 00:37:08,800 it's differentiable and gives you a good 1096 00:37:07,400 --> 00:37:10,440 gradient. 1097 00:37:08,800 --> 00:37:12,480 Okay? So here this is not a neural 1098 00:37:10,440 --> 00:37:14,200 network per se, but we can still use 1099 00:37:12,480 --> 00:37:17,159 gradient descent for it. 1100 00:37:14,199 --> 00:37:17,159 So we do that. 1101 00:37:17,960 --> 00:37:22,159 Um and when we are done, we would have 1102 00:37:20,039 --> 00:37:23,880 calculated some nice embeddings. We 1103 00:37:22,159 --> 00:37:25,559 would have all calculated or we can also 1104 00:37:23,880 --> 00:37:26,559 calculate all these biases, but we don't 1105 00:37:25,559 --> 00:37:28,119 need the biases anymore. We can just 1106 00:37:26,559 --> 00:37:29,519 throw out the biases because we only 1107 00:37:28,119 --> 00:37:30,920 care about the embeddings and how they 1108 00:37:29,519 --> 00:37:33,320 connect to each other. 1109 00:37:30,920 --> 00:37:34,760 Okay? Yeah. 1110 00:37:33,320 --> 00:37:36,800 So when when you're doing that 1111 00:37:34,760 --> 00:37:39,480 regression, are you predicting the 1112 00:37:36,800 --> 00:37:42,000 co-occurrence matrix? Mhm. Okay. 1113 00:37:39,480 --> 00:37:42,000 Exactly. 1114 00:37:42,320 --> 00:37:45,039 So 1115 00:37:43,358 --> 00:37:46,559 um actually let me just show a very 1116 00:37:45,039 --> 00:37:48,199 quick example 1117 00:37:46,559 --> 00:37:52,039 numerical example here. 1118 00:37:48,199 --> 00:37:52,039 So let's say for example that um 1119 00:37:53,480 --> 00:37:56,039 you know what? 1120 00:37:57,159 --> 00:38:02,000 So this is say W1 and this is W2. 1121 00:38:00,358 --> 00:38:04,639 Okay? And this is the vector and let's 1122 00:38:02,000 --> 00:38:06,400 assume for a moment that we it has two 1123 00:38:04,639 --> 00:38:07,920 dimensions, okay? 1124 00:38:06,400 --> 00:38:09,840 Two dimensions. 1125 00:38:07,920 --> 00:38:13,320 And we also need to calculate B1 and B2 1126 00:38:09,840 --> 00:38:13,320 which is just a number, okay? 1127 00:38:14,320 --> 00:38:18,359 So and let's say the number for deep 1128 00:38:16,960 --> 00:38:20,599 learning in the co-occurrence matrix it 1129 00:38:18,358 --> 00:38:21,759 happens let's say it has occurred 104 1130 00:38:20,599 --> 00:38:24,759 times. 1131 00:38:21,760 --> 00:38:27,200 So all we are doing is to say log of 1132 00:38:24,760 --> 00:38:28,720 104. 1133 00:38:27,199 --> 00:38:30,919 That is the actual value 1134 00:38:28,719 --> 00:38:33,599 minus 1135 00:38:30,920 --> 00:38:34,880 B1 which we don't know plus B2 which we 1136 00:38:33,599 --> 00:38:36,880 don't know 1137 00:38:34,880 --> 00:38:38,039 and then this thing here, let's just 1138 00:38:36,880 --> 00:38:40,160 call it, 1139 00:38:38,039 --> 00:38:42,119 you know, W11, 1140 00:38:40,159 --> 00:38:43,960 W12, 1141 00:38:42,119 --> 00:38:45,159 W21, 1142 00:38:43,960 --> 00:38:46,519 W22. 1143 00:38:45,159 --> 00:38:49,000 Okay? And then we're just doing the dot 1144 00:38:46,519 --> 00:38:51,400 product which is 1145 00:38:49,000 --> 00:38:53,719 times W12 1146 00:38:51,400 --> 00:38:55,280 plus W21 1147 00:38:53,719 --> 00:38:58,679 W22. 1148 00:38:55,280 --> 00:39:00,240 Okay? So this is our prediction. 1149 00:38:58,679 --> 00:39:03,559 Where is that cool laser pointer? Yeah. 1150 00:39:00,239 --> 00:39:05,199 So this is our prediction. 1151 00:39:03,559 --> 00:39:07,480 This is the actual. 1152 00:39:05,199 --> 00:39:09,039 So all we do is to say, "Okay, 1153 00:39:07,480 --> 00:39:11,000 this thing, the difference, we're going 1154 00:39:09,039 --> 00:39:12,358 to square it." 1155 00:39:11,000 --> 00:39:16,280 And then we're going to do the same 1156 00:39:12,358 --> 00:39:17,799 exact thing for every other word pair. 1157 00:39:16,280 --> 00:39:19,840 Okay? And when we are done with all of 1158 00:39:17,800 --> 00:39:20,840 that thing, we just take this whole 1159 00:39:19,840 --> 00:39:23,880 thing 1160 00:39:20,840 --> 00:39:26,039 and say gradient descent minimize. 1161 00:39:23,880 --> 00:39:28,200 So then it has to find the B's and the 1162 00:39:26,039 --> 00:39:29,400 W's and everything for every every pair 1163 00:39:28,199 --> 00:39:31,919 every word. 1164 00:39:29,400 --> 00:39:34,440 So that's actually what's going on. 1165 00:39:31,920 --> 00:39:34,440 Make sense? 1166 00:39:37,039 --> 00:39:43,800 All right. So by the way uh here 1167 00:39:41,559 --> 00:39:45,320 I said 1168 00:39:43,800 --> 00:39:47,160 I said, you know, let's assume that the 1169 00:39:45,320 --> 00:39:51,160 embeddings are just vectors which are 1170 00:39:47,159 --> 00:39:52,440 two dimension dimension two. 1171 00:39:51,159 --> 00:39:54,039 Well, 1172 00:39:52,440 --> 00:39:55,840 that's an arbitrary decision that I made 1173 00:39:54,039 --> 00:39:58,119 just to show you how it works because I 1174 00:39:55,840 --> 00:39:59,680 was doing it by hand. But more 1175 00:39:58,119 --> 00:40:01,159 generally, we get to choose how long 1176 00:39:59,679 --> 00:40:02,079 these vectors are. 1177 00:40:01,159 --> 00:40:04,440 Right? 1178 00:40:02,079 --> 00:40:05,920 And the longer the vector, the more 1179 00:40:04,440 --> 00:40:07,240 interesting ways it can actually 1180 00:40:05,920 --> 00:40:09,880 reproduce the co-occurrence matrix. It 1181 00:40:07,239 --> 00:40:13,319 has more flexibility. But the longer the 1182 00:40:09,880 --> 00:40:14,920 vector, what is the risk that you run? 1183 00:40:13,320 --> 00:40:16,039 Overfitting. 1184 00:40:14,920 --> 00:40:17,079 Because these are all parameters at the 1185 00:40:16,039 --> 00:40:19,360 end of the day. More parameters you 1186 00:40:17,079 --> 00:40:21,239 have, the more risk of overfitting. 1187 00:40:19,360 --> 00:40:24,320 Okay? So, you get to choose how big 1188 00:40:21,239 --> 00:40:26,799 these things can be. Uh yes. 1189 00:40:24,320 --> 00:40:29,000 Um don't you find it surprising that 1190 00:40:26,800 --> 00:40:30,680 we're able to fit the model where we 1191 00:40:29,000 --> 00:40:32,719 have a lot more parameters than we have 1192 00:40:30,679 --> 00:40:33,919 data because usually with most machine 1193 00:40:32,719 --> 00:40:35,959 learning with our experts, you would 1194 00:40:33,920 --> 00:40:37,920 like to not have a lot of parameters, 1195 00:40:35,960 --> 00:40:40,240 but here we're going to have 1196 00:40:37,920 --> 00:40:42,680 as you said, the number of dimensions 1197 00:40:40,239 --> 00:40:44,359 times more parameters than we have 1198 00:40:42,679 --> 00:40:46,839 data points. Well, here in this 1199 00:40:44,360 --> 00:40:48,120 particular case, as it turns out, um 1200 00:40:46,840 --> 00:40:49,440 let's assume that you only have 10 1201 00:40:48,119 --> 00:40:51,920 words, right? 1202 00:40:49,440 --> 00:40:53,960 And for each word, let's assume that you 1203 00:40:51,920 --> 00:40:55,280 have let's just just keep the math 1204 00:40:53,960 --> 00:40:56,320 simple. You have a two-dimensional 1205 00:40:55,280 --> 00:40:58,600 vector. 1206 00:40:56,320 --> 00:41:00,640 So, 10 words * 2, that's 20. 1207 00:40:58,599 --> 00:41:02,880 Plus you have 10 biases for the words, 1208 00:41:00,639 --> 00:41:06,000 right? So, that's another 10, that's 30. 1209 00:41:02,880 --> 00:41:08,160 But 10 * 10, the matrix has 100 entries. 1210 00:41:06,000 --> 00:41:10,360 So, because of the matrix being a order 1211 00:41:08,159 --> 00:41:13,000 n squared matrix, you'll have a lot more 1212 00:41:10,360 --> 00:41:14,640 numbers than parameters. 1213 00:41:13,000 --> 00:41:17,239 In this particular case, you have more 1214 00:41:14,639 --> 00:41:18,440 data than parameters. 1215 00:41:17,239 --> 00:41:20,039 So, that particular problem doesn't 1216 00:41:18,440 --> 00:41:22,119 apply in this case. 1217 00:41:20,039 --> 00:41:23,599 But that does show up in other cases and 1218 00:41:22,119 --> 00:41:24,799 there is some 1219 00:41:23,599 --> 00:41:26,599 very interesting research in neural 1220 00:41:24,800 --> 00:41:29,120 networks which suggests that often times 1221 00:41:26,599 --> 00:41:30,679 the traditional assumptions of data and 1222 00:41:29,119 --> 00:41:32,079 overfitting and all 1223 00:41:30,679 --> 00:41:33,879 can all be called into question under 1224 00:41:32,079 --> 00:41:35,440 some situations. 1225 00:41:33,880 --> 00:41:37,240 Um happy to tell you more offline, but 1226 00:41:35,440 --> 00:41:39,280 if you're curious, just Google something 1227 00:41:37,239 --> 00:41:41,799 called double descent. 1228 00:41:39,280 --> 00:41:41,800 You know what I mean. 1229 00:41:42,559 --> 00:41:45,840 But in this case, it's not a problem. 1230 00:41:46,320 --> 00:41:49,680 Okay. 1231 00:41:47,480 --> 00:41:51,519 So, so what that means is that we can 1232 00:41:49,679 --> 00:41:53,480 choose how big these things are. So, if 1233 00:41:51,519 --> 00:41:55,920 you look at one-hot word vector, one-hot 1234 00:41:53,480 --> 00:41:57,119 vectors, right? Where 1235 00:41:55,920 --> 00:41:58,559 there's a one and everything else is 1236 00:41:57,119 --> 00:42:00,519 zero depending on the position of the 1237 00:41:58,559 --> 00:42:03,440 word, these are long vectors as long as 1238 00:42:00,519 --> 00:42:05,480 a vocabulary, right? As we saw earlier. 1239 00:42:03,440 --> 00:42:07,400 Word embeddings on the other hand, 1240 00:42:05,480 --> 00:42:08,679 right? They can be very dense, right? 1241 00:42:07,400 --> 00:42:10,000 The numbers 1242 00:42:08,679 --> 00:42:11,000 that make up these embeddings, we're 1243 00:42:10,000 --> 00:42:13,199 actually going to figure out from the 1244 00:42:11,000 --> 00:42:15,480 data what they are. So, it can be 1245 00:42:13,199 --> 00:42:17,679 anything. It can So, the first dimension 1246 00:42:15,480 --> 00:42:19,480 may stand for some combination of, you 1247 00:42:17,679 --> 00:42:22,519 know, um 1248 00:42:19,480 --> 00:42:23,559 brightness plus speed plus animalness or 1249 00:42:22,519 --> 00:42:24,719 something. We have no idea what it 1250 00:42:23,559 --> 00:42:26,279 means. 1251 00:42:24,719 --> 00:42:27,959 All we know is that it's able to 1252 00:42:26,280 --> 00:42:29,400 reproduce the co-occurrence matrix 1253 00:42:27,960 --> 00:42:30,880 really well, so it's probably has 1254 00:42:29,400 --> 00:42:32,480 figured something out. 1255 00:42:30,880 --> 00:42:33,720 Okay? And so, we can keep it really 1256 00:42:32,480 --> 00:42:35,039 short. So, the word embeddings tend to 1257 00:42:33,719 --> 00:42:36,039 be very 1258 00:42:35,039 --> 00:42:38,079 dense, 1259 00:42:36,039 --> 00:42:39,599 meaning not zeros and ones, but some 1260 00:42:38,079 --> 00:42:40,880 arbitrary numbers. It's very lower 1261 00:42:39,599 --> 00:42:41,960 dimensional and it's of course learned 1262 00:42:40,880 --> 00:42:43,960 from data. 1263 00:42:41,960 --> 00:42:45,760 Right? So, 1264 00:42:43,960 --> 00:42:47,800 so once you do this, once you actually 1265 00:42:45,760 --> 00:42:49,920 run Glove on this data and do gradient 1266 00:42:47,800 --> 00:42:51,400 descent and so on and so forth, uh you 1267 00:42:49,920 --> 00:42:52,639 will actually come up with embeddings 1268 00:42:51,400 --> 00:42:54,360 and then you can actually plot the 1269 00:42:52,639 --> 00:42:55,719 embeddings. You can take like this they 1270 00:42:54,360 --> 00:42:58,320 say the you know, you can take these 1271 00:42:55,719 --> 00:42:59,959 embeddings and just plot them. Here um 1272 00:42:58,320 --> 00:43:01,600 they're not literally plotting the first 1273 00:42:59,960 --> 00:43:03,599 two dimensions. They're using a 1274 00:43:01,599 --> 00:43:05,480 particular technique called t-SNE, which 1275 00:43:03,599 --> 00:43:07,239 is a way to take long vectors and 1276 00:43:05,480 --> 00:43:09,119 project them to 2D space for 1277 00:43:07,239 --> 00:43:11,479 visualization purposes. 1278 00:43:09,119 --> 00:43:12,719 And you can see here 1279 00:43:11,480 --> 00:43:15,079 some very interesting things are showing 1280 00:43:12,719 --> 00:43:17,000 up. So, they basically they plotted the 1281 00:43:15,079 --> 00:43:19,599 embedding for brother, 1282 00:43:17,000 --> 00:43:20,920 nephew, uncle, sister, niece, 1283 00:43:19,599 --> 00:43:22,199 aunt, and so on and so forth. It's all 1284 00:43:20,920 --> 00:43:24,240 showing up here. 1285 00:43:22,199 --> 00:43:25,439 This the embedding for man, embedding 1286 00:43:24,239 --> 00:43:28,119 for woman, 1287 00:43:25,440 --> 00:43:29,920 sir, madam, 1288 00:43:28,119 --> 00:43:32,519 empress, heir, 1289 00:43:29,920 --> 00:43:34,599 duke, emperor, king. You get the idea. 1290 00:43:32,519 --> 00:43:35,840 Right? So, clearly there are patterns 1291 00:43:34,599 --> 00:43:37,480 here where 1292 00:43:35,840 --> 00:43:38,880 things which are sort of similar in 1293 00:43:37,480 --> 00:43:41,519 their nature are all hanging out 1294 00:43:38,880 --> 00:43:42,720 together in the same part of the space. 1295 00:43:41,519 --> 00:43:44,079 Which is comforting, which is good to 1296 00:43:42,719 --> 00:43:44,959 know. 1297 00:43:44,079 --> 00:43:46,719 Right? 1298 00:43:44,960 --> 00:43:48,400 Now, but as I mentioned earlier, it's 1299 00:43:46,719 --> 00:43:50,839 not just about the fact that similar 1300 00:43:48,400 --> 00:43:53,280 things happen to be near each other. 1301 00:43:50,840 --> 00:43:54,960 The direction also actually matters. And 1302 00:43:53,280 --> 00:43:57,640 beautiful things happen when you look at 1303 00:43:54,960 --> 00:44:00,119 directions. So, for instance, 1304 00:43:57,639 --> 00:44:01,879 you know, let's say that 1305 00:44:00,119 --> 00:44:03,159 man and you want to go from man to 1306 00:44:01,880 --> 00:44:05,440 brother. 1307 00:44:03,159 --> 00:44:07,799 Okay? So, to go from man to brother, you 1308 00:44:05,440 --> 00:44:09,920 have to start with man and then travel 1309 00:44:07,800 --> 00:44:11,440 along this arrow, right? To get to 1310 00:44:09,920 --> 00:44:14,880 brother. 1311 00:44:11,440 --> 00:44:18,519 So, this arrow has some notion of a 1312 00:44:14,880 --> 00:44:19,400 person becoming a sibling. 1313 00:44:18,519 --> 00:44:20,920 Right? 1314 00:44:19,400 --> 00:44:22,400 So, you would hope that if you take that 1315 00:44:20,920 --> 00:44:23,800 same arrow 1316 00:44:22,400 --> 00:44:26,039 and then 1317 00:44:23,800 --> 00:44:29,359 start here with that arrow, hopefully 1318 00:44:26,039 --> 00:44:32,358 the woman will become a sister. 1319 00:44:29,358 --> 00:44:32,358 Sure enough, this. 1320 00:44:32,719 --> 00:44:37,119 So, this is called word vector algebra. 1321 00:44:35,199 --> 00:44:39,039 Right? Embedding algebra. And these 1322 00:44:37,119 --> 00:44:41,119 relationships are actually showing up in 1323 00:44:39,039 --> 00:44:42,039 the data. We didn't tell it any of these 1324 00:44:41,119 --> 00:44:43,039 things. 1325 00:44:42,039 --> 00:44:44,920 We just literally gave it the 1326 00:44:43,039 --> 00:44:46,358 co-occurrence matrix 1327 00:44:44,920 --> 00:44:47,960 and said and and asked it to reproduce 1328 00:44:46,358 --> 00:44:49,759 it. 1329 00:44:47,960 --> 00:44:52,519 So, I find it pretty shocking that these 1330 00:44:49,760 --> 00:44:55,160 things are actually true. 1331 00:44:52,519 --> 00:44:57,358 And it gives us evidence and comfort 1332 00:44:55,159 --> 00:44:59,358 that whatever has been learned does have 1333 00:44:57,358 --> 00:45:01,759 some deep connection to describing the 1334 00:44:59,358 --> 00:45:03,799 underlying nature of what's going on. 1335 00:45:01,760 --> 00:45:05,520 It's not some statistically fluky 1336 00:45:03,800 --> 00:45:07,000 artifact. 1337 00:45:05,519 --> 00:45:07,679 Um yeah. 1338 00:45:07,000 --> 00:45:08,639 So, 1339 00:45:07,679 --> 00:45:11,239 I said 1340 00:45:08,639 --> 00:45:12,960 by context or by adjacency to other 1341 00:45:11,239 --> 00:45:15,000 words and not by 1342 00:45:12,960 --> 00:45:16,480 the place in the same word, right? 1343 00:45:15,000 --> 00:45:17,840 Cuz you can't click they won't appear in 1344 00:45:16,480 --> 00:45:19,079 the same sentence. 1345 00:45:17,840 --> 00:45:20,720 They have 1346 00:45:19,079 --> 00:45:22,000 keywords. Right. 1347 00:45:20,719 --> 00:45:23,799 They won't appear in the same sentence, 1348 00:45:22,000 --> 00:45:25,199 but the pattern of co-occurrence will be 1349 00:45:23,800 --> 00:45:26,240 the same for them. 1350 00:45:25,199 --> 00:45:28,159 Which is what we've been able to 1351 00:45:26,239 --> 00:45:30,879 reproduce with these embeddings. So, 1352 00:45:28,159 --> 00:45:30,879 that's the key idea. 1353 00:45:34,119 --> 00:45:37,400 Um 1354 00:45:34,800 --> 00:45:40,359 so, my question is along like how are we 1355 00:45:37,400 --> 00:45:41,480 able to capture all these directions in 1356 00:45:40,358 --> 00:45:44,119 2D 1357 00:45:41,480 --> 00:45:46,119 matrix versus a multi-dimensional matrix 1358 00:45:44,119 --> 00:45:47,920 because I feel like okay, so this 1359 00:45:46,119 --> 00:45:48,759 relationship is kind of 1360 00:45:47,920 --> 00:45:50,519 uh 1361 00:45:48,760 --> 00:45:51,880 confirmed that you're moving to 1362 00:45:50,519 --> 00:45:53,440 kind of like 1363 00:45:51,880 --> 00:45:54,800 family or like blood relationship or 1364 00:45:53,440 --> 00:45:56,599 something of the sort, but like how does 1365 00:45:54,800 --> 00:45:58,240 it not mess up the other sides of that 1366 00:45:56,599 --> 00:46:00,000 matrix? Like 1367 00:45:58,239 --> 00:46:02,199 No, this is just a visualization thing. 1368 00:46:00,000 --> 00:46:04,159 So, we're basically taking this uh you 1369 00:46:02,199 --> 00:46:06,279 know, as you will see, Glove embeddings 1370 00:46:04,159 --> 00:46:08,199 come in lots of different sizes. And 1371 00:46:06,280 --> 00:46:10,080 this I think uses the 100 dimension 1372 00:46:08,199 --> 00:46:12,358 embedding and just projects it to 2D 1373 00:46:10,079 --> 00:46:15,840 space using a particular technique and 1374 00:46:12,358 --> 00:46:15,840 then looks to see what's going on. 1375 00:46:15,880 --> 00:46:20,000 Um yeah. 1376 00:46:17,800 --> 00:46:22,519 Uh if the input data being co-occurrence 1377 00:46:20,000 --> 00:46:24,599 matrix is biased, aren't we amplifying 1378 00:46:22,519 --> 00:46:26,800 that bias? Yes, we are. Yes. No, it's a 1379 00:46:24,599 --> 00:46:28,719 great observation. Uh any sort of data 1380 00:46:26,800 --> 00:46:30,840 you scrape from the internet and use for 1381 00:46:28,719 --> 00:46:32,679 this sort of modeling exercise will be 1382 00:46:30,840 --> 00:46:34,760 subject to all the biases that produced 1383 00:46:32,679 --> 00:46:36,599 the data in the place first place. And 1384 00:46:34,760 --> 00:46:38,760 the model will faithfully learn those 1385 00:46:36,599 --> 00:46:40,358 biases. And if you're not careful, it'll 1386 00:46:38,760 --> 00:46:41,840 perpetuate them. 1387 00:46:40,358 --> 00:46:43,960 So, and that's a whole very important 1388 00:46:41,840 --> 00:46:45,600 topic that unfortunately won't cover in 1389 00:46:43,960 --> 00:46:46,760 this course because of time constraints, 1390 00:46:45,599 --> 00:46:47,920 but it's something you always have to 1391 00:46:46,760 --> 00:46:50,359 worry about when you're building these 1392 00:46:47,920 --> 00:46:50,358 models. 1393 00:46:50,519 --> 00:46:53,679 How do you think about the 1394 00:46:51,199 --> 00:46:55,799 dimensionality of the embeddings not the 1395 00:46:53,679 --> 00:46:57,279 2D representation of the actual data? 1396 00:46:55,800 --> 00:46:59,000 The one that we choose, that's that's in 1397 00:46:57,280 --> 00:47:00,519 our hands. So, you should think of them 1398 00:46:59,000 --> 00:47:03,358 as a hyperparameter. 1399 00:47:00,519 --> 00:47:05,239 So, much like the number of hidden units 1400 00:47:03,358 --> 00:47:06,920 to use in a particular hidden layer, 1401 00:47:05,239 --> 00:47:09,719 um it's a hyperparameter. Uh so, you 1402 00:47:06,920 --> 00:47:11,039 know, I would again start small and if 1403 00:47:09,719 --> 00:47:13,159 it solves the problem that you're trying 1404 00:47:11,039 --> 00:47:15,440 to solve with these embeddings, great. 1405 00:47:13,159 --> 00:47:16,960 If not, keep increasing them. And at 1406 00:47:15,440 --> 00:47:19,000 some point there might be like a a 1407 00:47:16,960 --> 00:47:20,400 flattening out and a overfitting sort of 1408 00:47:19,000 --> 00:47:22,679 dynamic and then you stop. So, just 1409 00:47:20,400 --> 00:47:24,280 think of it as a hyperparameter. 1410 00:47:22,679 --> 00:47:26,599 Yeah. 1411 00:47:24,280 --> 00:47:28,920 Do you see any benefit practicing using 1412 00:47:26,599 --> 00:47:31,239 like penalized regression to do this 1413 00:47:28,920 --> 00:47:33,200 instead of having the embeddings more 1414 00:47:31,239 --> 00:47:36,879 sparse or just like 1415 00:47:33,199 --> 00:47:39,239 lowering the magnitude of them? Yeah. 1416 00:47:36,880 --> 00:47:40,160 Yes. So, there are lots of techniques to 1417 00:47:39,239 --> 00:47:42,159 uh 1418 00:47:40,159 --> 00:47:44,679 to apply regularization in the 1419 00:47:42,159 --> 00:47:46,759 estimation itself of all these numbers. 1420 00:47:44,679 --> 00:47:47,799 Um happy to give you pointers. It's I'm 1421 00:47:46,760 --> 00:47:49,480 just going with like the simplest 1422 00:47:47,800 --> 00:47:50,800 version possible. 1423 00:47:49,480 --> 00:47:53,719 Yeah. 1424 00:47:50,800 --> 00:47:55,920 Am I understanding why overfitting is a 1425 00:47:53,719 --> 00:47:58,000 problem in this case cuz we're not doing 1426 00:47:55,920 --> 00:48:00,079 any like out of sample 1427 00:47:58,000 --> 00:48:02,039 prediction. So, like wouldn't you want 1428 00:48:00,079 --> 00:48:03,599 like the embeddings to be 1429 00:48:02,039 --> 00:48:04,519 like high dimensional so you can capture 1430 00:48:03,599 --> 00:48:06,519 like 1431 00:48:04,519 --> 00:48:08,679 your relationships? Uh interesting 1432 00:48:06,519 --> 00:48:11,119 question. So, the question is given that 1433 00:48:08,679 --> 00:48:12,879 there's no notion of a test set, out of 1434 00:48:11,119 --> 00:48:14,559 sample test set that we got we're going 1435 00:48:12,880 --> 00:48:16,079 to evaluate these things on, why do we 1436 00:48:14,559 --> 00:48:18,519 really care about overfitting? Don't 1437 00:48:16,079 --> 00:48:20,519 should we do the best we can to capture 1438 00:48:18,519 --> 00:48:21,400 everything in the data, right? 1439 00:48:20,519 --> 00:48:22,920 Well, 1440 00:48:21,400 --> 00:48:24,280 the thing is 1441 00:48:22,920 --> 00:48:26,320 even when you're not trying to use it 1442 00:48:24,280 --> 00:48:29,560 for out of sample prediction, you do 1443 00:48:26,320 --> 00:48:31,359 want to make sure that your model only 1444 00:48:29,559 --> 00:48:32,639 captures the true patterns and not the 1445 00:48:31,358 --> 00:48:35,199 noise. 1446 00:48:32,639 --> 00:48:36,440 In every data set, there's always noise. 1447 00:48:35,199 --> 00:48:38,358 Right? And you want it to capture a 1448 00:48:36,440 --> 00:48:40,599 signal but not the noise. 1449 00:48:38,358 --> 00:48:42,719 And regardless of what you use it for. 1450 00:48:40,599 --> 00:48:44,159 Because if it captures the noise, then 1451 00:48:42,719 --> 00:48:45,959 the insights you draw from the word 1452 00:48:44,159 --> 00:48:48,399 embeddings may be flawed. 1453 00:48:45,960 --> 00:48:48,400 That's the reason. 1454 00:48:48,880 --> 00:48:51,358 Okay. 1455 00:48:49,760 --> 00:48:53,080 Um all right, so let's keep going. So, 1456 00:48:51,358 --> 00:48:55,400 here the algebra is brother minus man 1457 00:48:53,079 --> 00:48:57,039 plus woman is sister. 1458 00:48:55,400 --> 00:48:58,920 That's it. Human biology reduced to a 1459 00:48:57,039 --> 00:49:00,759 single sentence. 1460 00:48:58,920 --> 00:49:02,079 All right. So, now the pros and cons of 1461 00:49:00,760 --> 00:49:04,520 these things are you should use 1462 00:49:02,079 --> 00:49:07,159 something like a Glove embedding if you 1463 00:49:04,519 --> 00:49:07,960 don't have enough data to do to to sort 1464 00:49:07,159 --> 00:49:10,039 of 1465 00:49:07,960 --> 00:49:11,920 to learn a task-specific embedding for 1466 00:49:10,039 --> 00:49:13,400 your own vocabulary. As we As I'll show 1467 00:49:11,920 --> 00:49:14,880 you in the Colab, you can actually learn 1468 00:49:13,400 --> 00:49:16,720 these things just for your own data set 1469 00:49:14,880 --> 00:49:18,920 if you want. You don't have to use these 1470 00:49:16,719 --> 00:49:20,879 Glove embeddings. But the reason to use 1471 00:49:18,920 --> 00:49:22,639 these pretrained embeddings is that if 1472 00:49:20,880 --> 00:49:24,079 you're working with natural language, 1473 00:49:22,639 --> 00:49:25,559 you know, the word is the word, right? 1474 00:49:24,079 --> 00:49:28,358 It means something. 1475 00:49:25,559 --> 00:49:30,639 And so, there's no reason for you to 1476 00:49:28,358 --> 00:49:32,840 have for your model, for your little use 1477 00:49:30,639 --> 00:49:35,599 case, for you to actually somehow learn 1478 00:49:32,840 --> 00:49:36,519 all the fundamentals of English. 1479 00:49:35,599 --> 00:49:37,679 The fundamentals of English are the 1480 00:49:36,519 --> 00:49:40,519 fundamentals of English. May as well 1481 00:49:37,679 --> 00:49:42,039 learn it once and then piggyback on it. 1482 00:49:40,519 --> 00:49:43,880 So, that's the whole idea of using 1483 00:49:42,039 --> 00:49:45,519 pre-trained embeddings. 1484 00:49:43,880 --> 00:49:47,000 Because it These things are all common 1485 00:49:45,519 --> 00:49:48,559 aspects of language. May as well learn 1486 00:49:47,000 --> 00:49:50,559 them using all the data you can throw at 1487 00:49:48,559 --> 00:49:52,039 it and then you can sort of fine-tune 1488 00:49:50,559 --> 00:49:53,159 and tweak and adapt to your particular 1489 00:49:52,039 --> 00:49:55,920 use case. 1490 00:49:53,159 --> 00:49:57,319 Right? So, if you and this particular 1491 00:49:55,920 --> 00:49:58,920 useful when you don't have a lot of data 1492 00:49:57,320 --> 00:50:01,360 in your particular use case. 1493 00:49:58,920 --> 00:50:03,280 Uh right? That's one big advantage. Now, 1494 00:50:01,360 --> 00:50:04,840 it does have the drawback that this 1495 00:50:03,280 --> 00:50:05,920 embedding will not be customized to your 1496 00:50:04,840 --> 00:50:06,920 data. 1497 00:50:05,920 --> 00:50:08,840 Right? For example, if you're trying to 1498 00:50:06,920 --> 00:50:10,599 build an application for a medical or 1499 00:50:08,840 --> 00:50:11,680 legal use, it's going to have a lot of 1500 00:50:10,599 --> 00:50:13,679 jargon. 1501 00:50:11,679 --> 00:50:14,960 Right? And this pre-trained embedding 1502 00:50:13,679 --> 00:50:16,960 trained on all of Wikipedia may not 1503 00:50:14,960 --> 00:50:18,840 capture enough of the jargon and know 1504 00:50:16,960 --> 00:50:19,960 its meaning really accurately. So, what 1505 00:50:18,840 --> 00:50:21,000 you want to do is you want to take this 1506 00:50:19,960 --> 00:50:22,440 thing. You may still want to take this 1507 00:50:21,000 --> 00:50:25,119 thing and then you can adapt and 1508 00:50:22,440 --> 00:50:28,679 fine-tune it using your jargon-packed, 1509 00:50:25,119 --> 00:50:29,719 heavy, domain-specific data set. 1510 00:50:28,679 --> 00:50:32,239 Okay, those are some of the things to 1511 00:50:29,719 --> 00:50:32,239 keep in mind. 1512 00:50:32,360 --> 00:50:35,559 And of course, we can also learn it from 1513 00:50:33,559 --> 00:50:38,079 scratch if you want and the collab I 1514 00:50:35,559 --> 00:50:39,440 demonstrate all these options. 1515 00:50:38,079 --> 00:50:41,880 So, when you're working with embeddings 1516 00:50:39,440 --> 00:50:43,880 in Keras uh Keras, so what we do is 1517 00:50:41,880 --> 00:50:45,480 remember STI 1518 00:50:43,880 --> 00:50:48,360 where we after we standardize and 1519 00:50:45,480 --> 00:50:50,440 tokenize and index, right? At this 1520 00:50:48,360 --> 00:50:51,960 point, we go from integers to vectors 1521 00:50:50,440 --> 00:50:54,079 and so far we have been using integers 1522 00:50:51,960 --> 00:50:55,559 to one-hot vectors. Here, we're going to 1523 00:50:54,079 --> 00:50:57,679 use embedding vectors that we're going 1524 00:50:55,559 --> 00:51:00,519 to learn or that we're going to pre-use 1525 00:50:57,679 --> 00:51:02,119 from glove. And so, what we do is we 1526 00:51:00,519 --> 00:51:06,119 tell Kera we tell Keras's text 1527 00:51:02,119 --> 00:51:08,000 vectorization layer to do only STI. 1528 00:51:06,119 --> 00:51:10,519 And then we will use a new layer called 1529 00:51:08,000 --> 00:51:11,639 the embedding layer to do the encoding. 1530 00:51:10,519 --> 00:51:14,719 Yeah, that's how we're going to do it 1531 00:51:11,639 --> 00:51:14,719 divide divide it up. 1532 00:51:14,920 --> 00:51:18,760 So, we'll take a look at this first uh 1533 00:51:17,039 --> 00:51:20,559 before we switch to the collab. So, 1534 00:51:18,760 --> 00:51:23,480 before 1535 00:51:20,559 --> 00:51:26,279 we told Keras in this layer output mode 1536 00:51:23,480 --> 00:51:27,679 should be multi-hot or whatever, right? 1537 00:51:26,280 --> 00:51:29,080 Here, we don't want it to actually 1538 00:51:27,679 --> 00:51:30,759 encode anything in multi-hot. We just 1539 00:51:29,079 --> 00:51:32,920 wanted to give it integers back. So, we 1540 00:51:30,760 --> 00:51:35,120 tell it give me int. 1541 00:51:32,920 --> 00:51:36,800 Okay? That's the first change. We only 1542 00:51:35,119 --> 00:51:39,559 We tell it give me give us int. If you 1543 00:51:36,800 --> 00:51:41,120 say give us int, it'll stop with STI. 1544 00:51:39,559 --> 00:51:43,759 I'll just give you the integers. 1545 00:51:41,119 --> 00:51:45,079 Uh and then what you do is that 1546 00:51:43,760 --> 00:51:47,160 all the incoming sentences are going to 1547 00:51:45,079 --> 00:51:48,599 have different lengths. So, what we want 1548 00:51:47,159 --> 00:51:50,159 to do is we want to actually take all 1549 00:51:48,599 --> 00:51:52,319 these sentences and sort of normalize 1550 00:51:50,159 --> 00:51:53,199 them so they are of the same length. 1551 00:51:52,320 --> 00:51:55,440 Okay? 1552 00:51:53,199 --> 00:51:57,599 And the way we do that 1553 00:51:55,440 --> 00:51:59,800 And the way we do that very quickly is 1554 00:51:57,599 --> 00:52:01,519 that we either trunk we choose a maximum 1555 00:51:59,800 --> 00:52:04,000 length for every sen- for for the 1556 00:52:01,519 --> 00:52:05,960 sentences and then if something is 1557 00:52:04,000 --> 00:52:07,119 uh exactly fits that length, perfect. 1558 00:52:05,960 --> 00:52:08,920 Let's say in this case we want a max 1559 00:52:07,119 --> 00:52:11,199 length of five. Cats sat on the mat is 1560 00:52:08,920 --> 00:52:12,599 exactly five. Boom, fits perfectly. But 1561 00:52:11,199 --> 00:52:14,480 if something is smaller, I love you is 1562 00:52:12,599 --> 00:52:16,360 only three of these things, we actually 1563 00:52:14,480 --> 00:52:17,760 pad it with something called the pad 1564 00:52:16,360 --> 00:52:19,840 token. 1565 00:52:17,760 --> 00:52:22,160 Much like the unk token, pad token is a 1566 00:52:19,840 --> 00:52:23,840 special token which we use for padding. 1567 00:52:22,159 --> 00:52:25,759 And then it'll you know, and so and 1568 00:52:23,840 --> 00:52:27,760 Keras you will see will use zeros for 1569 00:52:25,760 --> 00:52:29,720 these paddings. So so that it fills it 1570 00:52:27,760 --> 00:52:31,440 up and gets all the way to the end. And 1571 00:52:29,719 --> 00:52:33,239 if you have something which is much 1572 00:52:31,440 --> 00:52:34,559 longer than five, you just truncate 1573 00:52:33,239 --> 00:52:36,199 everything else and just use the first 1574 00:52:34,559 --> 00:52:38,440 five. 1575 00:52:36,199 --> 00:52:41,639 So, this is what we do to get all the 1576 00:52:38,440 --> 00:52:41,639 sentences to be of the same length. 1577 00:52:42,400 --> 00:52:45,599 Okay? 1578 00:52:43,480 --> 00:52:47,199 And once we do that we then go to the 1579 00:52:45,599 --> 00:52:49,000 embedding layer. 1580 00:52:47,199 --> 00:52:50,359 And the embedding layer is actually very 1581 00:52:49,000 --> 00:52:51,719 simple. 1582 00:52:50,360 --> 00:52:53,360 What is What is an embedding? It's just 1583 00:52:51,719 --> 00:52:54,519 a vector and we need a vector for every 1584 00:52:53,360 --> 00:52:55,440 token. 1585 00:52:54,519 --> 00:52:57,639 Of course, we're going to learn these 1586 00:52:55,440 --> 00:52:59,599 vectors. We need one for every token. 1587 00:52:57,639 --> 00:53:01,039 So, in this case for example, uh let's 1588 00:52:59,599 --> 00:53:02,239 say that these are all the tokens we 1589 00:53:01,039 --> 00:53:05,000 have 1590 00:53:02,239 --> 00:53:08,039 in our vocabulary after the STI process. 1591 00:53:05,000 --> 00:53:09,280 Maybe in this case we have 5,000 tokens. 1592 00:53:08,039 --> 00:53:11,159 Each token we have this embedding 1593 00:53:09,280 --> 00:53:12,960 vector, right? And we choose what the 1594 00:53:11,159 --> 00:53:15,000 dimension of that embedding vector is, 1595 00:53:12,960 --> 00:53:17,400 right? And so, we can set it up by 1596 00:53:15,000 --> 00:53:19,280 saying Keras layers.embedding and we 1597 00:53:17,400 --> 00:53:21,160 tell it max tokens which means what how 1598 00:53:19,280 --> 00:53:21,920 many rows do we have here. 1599 00:53:21,159 --> 00:53:23,759 You know, how many What is the 1600 00:53:21,920 --> 00:53:25,680 vocabulary size that we're working with? 1601 00:53:23,760 --> 00:53:28,360 And then we tell it, okay, this is how 1602 00:53:25,679 --> 00:53:31,799 long I want each embedding vector to be. 1603 00:53:28,360 --> 00:53:33,240 So, rows, the size of the columns, and 1604 00:53:31,800 --> 00:53:34,400 that's the embedding layer. And we'll 1605 00:53:33,239 --> 00:53:35,719 use it in a second. I just want to show 1606 00:53:34,400 --> 00:53:37,039 it to you here so that's because it's 1607 00:53:35,719 --> 00:53:38,759 slightly clearer. 1608 00:53:37,039 --> 00:53:40,559 So, when an input sentence arrives, the 1609 00:53:38,760 --> 00:53:42,200 text vectorization layer will learn STI 1610 00:53:40,559 --> 00:53:44,000 on it. It'll truncate and pad it to max 1611 00:53:42,199 --> 00:53:46,599 length as needed. So, let's say this 1612 00:53:44,000 --> 00:53:48,639 phrase comes in, STI will give you the 1613 00:53:46,599 --> 00:53:50,319 same tokens plus pad pad because let's 1614 00:53:48,639 --> 00:53:52,639 say the max length is five and then 1615 00:53:50,320 --> 00:53:53,960 these are the corresponding integers. 1616 00:53:52,639 --> 00:53:55,279 And then 1617 00:53:53,960 --> 00:53:56,440 the embedding layer will just look up 1618 00:53:55,280 --> 00:53:59,320 the corresponding vector. So, for 1619 00:53:56,440 --> 00:54:01,519 example here, uh the vectors are we need 1620 00:53:59,320 --> 00:54:04,000 to look up the vectors for 23, 9, 5, 0, 1621 00:54:01,519 --> 00:54:07,400 and 0. So, we just go here and look up 1622 00:54:04,000 --> 00:54:08,760 23, 5, 9, and 0. And then once we have 1623 00:54:07,400 --> 00:54:10,720 that, boom. 1624 00:54:08,760 --> 00:54:12,320 This is the resulting output. So, 1625 00:54:10,719 --> 00:54:13,519 whatever input sentence comes in, we 1626 00:54:12,320 --> 00:54:14,760 have now 1627 00:54:13,519 --> 00:54:17,519 five embedding vectors that have been 1628 00:54:14,760 --> 00:54:20,080 looked up from the embedding layer. 1629 00:54:17,519 --> 00:54:22,159 And once we do that 1630 00:54:20,079 --> 00:54:24,400 this is a table. So, I love you comes 1631 00:54:22,159 --> 00:54:25,679 in, it becomes this table. As we have 1632 00:54:24,400 --> 00:54:27,519 seen before 1633 00:54:25,679 --> 00:54:30,279 neural networks can only accommodate 1634 00:54:27,519 --> 00:54:32,119 vectors as inputs. We need to you know, 1635 00:54:30,280 --> 00:54:33,760 make this into a vector. And as we have 1636 00:54:32,119 --> 00:54:35,319 done before, you know, we can either 1637 00:54:33,760 --> 00:54:37,320 take all these things and concatenate 1638 00:54:35,320 --> 00:54:39,359 them, make a one long vector, or we can 1639 00:54:37,320 --> 00:54:40,800 find a way to average them or sum them 1640 00:54:39,358 --> 00:54:42,960 and things like that, right? As we have 1641 00:54:40,800 --> 00:54:44,039 seen before. And we will use the same uh 1642 00:54:42,960 --> 00:54:46,320 we'll the simplest thing is probably 1643 00:54:44,039 --> 00:54:48,239 just to average them. So, 1644 00:54:46,320 --> 00:54:51,000 uh these are some options and we but 1645 00:54:48,239 --> 00:54:53,439 we'll average them here. So, and this is 1646 00:54:51,000 --> 00:54:55,679 called the global average pooling layer 1647 00:54:53,440 --> 00:54:57,240 1D. And it's all it does is whatever you 1648 00:54:55,679 --> 00:54:59,839 give it a table you give it, it just 1649 00:54:57,239 --> 00:55:01,079 takes each dimension and averages it. 1650 00:54:59,840 --> 00:55:02,280 The first dimension average, second 1651 00:55:01,079 --> 00:55:04,358 dimension average, and so on and so 1652 00:55:02,280 --> 00:55:05,440 forth. And once that's done 1653 00:55:04,358 --> 00:55:07,279 that's the whole 1654 00:55:05,440 --> 00:55:09,679 So, 1655 00:55:07,280 --> 00:55:11,920 the phrase comes in, STI gives you these 1656 00:55:09,679 --> 00:55:14,000 things, padding as needed or truncating 1657 00:55:11,920 --> 00:55:16,240 as needed. We look up the embeddings 1658 00:55:14,000 --> 00:55:18,559 from the embedding layer and then we get 1659 00:55:16,239 --> 00:55:20,479 all this thing. We do global global 1660 00:55:18,559 --> 00:55:22,400 pooling on it and it's done. 1661 00:55:20,480 --> 00:55:24,000 The resulting thing is a vector that can 1662 00:55:22,400 --> 00:55:26,680 then be passed into hidden layers just 1663 00:55:24,000 --> 00:55:26,679 like we normally do. 1664 00:55:27,559 --> 00:55:31,320 I'm going over this a little fast, but 1665 00:55:29,320 --> 00:55:33,000 make sure you look at it afterwards and 1666 00:55:31,320 --> 00:55:34,320 understand every step and the collab 1667 00:55:33,000 --> 00:55:36,159 will mirror this 1668 00:55:34,320 --> 00:55:37,200 you know, perfectly. 1669 00:55:36,159 --> 00:55:39,559 All right, so let's switch to the 1670 00:55:37,199 --> 00:55:41,480 collab. 1671 00:55:39,559 --> 00:55:43,960 Okay. All right. 1672 00:55:41,480 --> 00:55:46,320 Can folks see this okay? 1673 00:55:43,960 --> 00:55:47,639 All right, so we'll do the usual. 1674 00:55:46,320 --> 00:55:49,800 Um 1675 00:55:47,639 --> 00:55:51,599 import all the stuff we need and then 1676 00:55:49,800 --> 00:55:53,519 because I want to plot some of these uh 1677 00:55:51,599 --> 00:55:55,159 loss and accuracy curves to 1678 00:55:53,519 --> 00:55:56,639 you know, just to see what's going on, 1679 00:55:55,159 --> 00:55:58,239 I'll just bring in the functions from 1680 00:55:56,639 --> 00:55:59,319 the previous collabs. 1681 00:55:58,239 --> 00:56:01,839 Here. 1682 00:55:59,320 --> 00:56:03,440 And then um and I think I already have 1683 00:56:01,840 --> 00:56:06,000 downloaded this. Let me just make sure I 1684 00:56:03,440 --> 00:56:06,000 have it. 1685 00:56:08,079 --> 00:56:13,480 Uh it's not there. Okay. 1686 00:56:11,119 --> 00:56:14,960 Do it again. 1687 00:56:13,480 --> 00:56:17,760 This is same songs data set that we 1688 00:56:14,960 --> 00:56:17,760 looked at on Monday. 1689 00:56:17,840 --> 00:56:21,079 Okay. 1690 00:56:19,000 --> 00:56:25,280 So, roughly 49,000 examples as we saw 1691 00:56:21,079 --> 00:56:25,279 before. We'll one-hot encode them. 1692 00:56:25,519 --> 00:56:28,840 All right, so there's a bunch of stuff 1693 00:56:27,000 --> 00:56:30,519 that we already covered in class. So, 1694 00:56:28,840 --> 00:56:33,880 this is the thing 1695 00:56:30,519 --> 00:56:35,840 uh this URL has all the glove vectors 1696 00:56:33,880 --> 00:56:37,160 available for download. I downloaded it 1697 00:56:35,840 --> 00:56:39,960 uh before class because it takes a few 1698 00:56:37,159 --> 00:56:41,519 minutes. Uh and I've also unz- Did I 1699 00:56:39,960 --> 00:56:43,199 unzip it? 1700 00:56:41,519 --> 00:56:46,039 Uh yes, I did. And so, let's just look 1701 00:56:43,199 --> 00:56:47,119 at the first few. 1702 00:56:46,039 --> 00:56:49,159 All right, so these are all the first 1703 00:56:47,119 --> 00:56:52,839 few. We'll create a sort of an easier to 1704 00:56:49,159 --> 00:56:52,839 view version of these glove vectors. 1705 00:56:54,760 --> 00:56:58,480 So, I'm going to use the vectors which 1706 00:56:56,639 --> 00:56:59,839 are 100 long, but it comes in many 1707 00:56:58,480 --> 00:57:03,000 different shapes. 1708 00:56:59,840 --> 00:57:05,720 So, we have 400,000 vectors, 400,000 1709 00:57:03,000 --> 00:57:07,519 word vectors. Each is 100 dimension. 1710 00:57:05,719 --> 00:57:09,399 Uh and these all have been calculated 1711 00:57:07,519 --> 00:57:11,119 from Wikipedia using 1712 00:57:09,400 --> 00:57:12,720 the model we described using gradient 1713 00:57:11,119 --> 00:57:15,480 descent. Okay? 1714 00:57:12,719 --> 00:57:18,239 Uh all right, so this is the 1715 00:57:15,480 --> 00:57:19,280 vector for the word for movie. 1716 00:57:18,239 --> 00:57:21,519 Yeah, I don't know what these dimensions 1717 00:57:19,280 --> 00:57:23,480 mean, but it is there's something going 1718 00:57:21,519 --> 00:57:24,880 on. It has figured stuff out. 1719 00:57:23,480 --> 00:57:26,840 Uh but the proof is in the pudding, 1720 00:57:24,880 --> 00:57:28,200 right? So, all right, now we'll first 1721 00:57:26,840 --> 00:57:30,358 set up the text vectorization and 1722 00:57:28,199 --> 00:57:33,839 embedding layers like we saw before. 1723 00:57:30,358 --> 00:57:36,239 Um and so, I'm going to use uh a max 1724 00:57:33,840 --> 00:57:38,240 length of 300 for the songs. 1725 00:57:36,239 --> 00:57:40,879 Um right? Because all the sentences have 1726 00:57:38,239 --> 00:57:42,479 to be the same length. And you might be 1727 00:57:40,880 --> 00:57:44,519 wondering, okay, why did you pick 300 1728 00:57:42,480 --> 00:57:46,840 and not say 400 or 200? So, typically 1729 00:57:44,519 --> 00:57:48,920 what you do is you actually look at the 1730 00:57:46,840 --> 00:57:51,039 the length distribution of the songs you 1731 00:57:48,920 --> 00:57:52,720 have and you will find you're looking 1732 00:57:51,039 --> 00:57:54,358 for like an 80/20 or a you know, one of 1733 00:57:52,719 --> 00:57:56,399 those things. And in this case it turns 1734 00:57:54,358 --> 00:57:59,000 out 90% of the songs have less than or 1735 00:57:56,400 --> 00:58:00,880 equal to 300 words in our data set. So, 1736 00:57:59,000 --> 00:58:03,000 I'm just going to go with 300. Okay? 1737 00:58:00,880 --> 00:58:04,840 It's pretty good. Uh the problem is if 1738 00:58:03,000 --> 00:58:06,800 you actually say if you look at the song 1739 00:58:04,840 --> 00:58:09,079 which has the maximum length 1740 00:58:06,800 --> 00:58:10,680 that might have be like 3,000 words and 1741 00:58:09,079 --> 00:58:12,599 there would be any hardly any songs of 1742 00:58:10,679 --> 00:58:13,839 3,000 long. You're just wasting a lot of 1743 00:58:12,599 --> 00:58:16,079 capacity by doing that. So, you're just 1744 00:58:13,840 --> 00:58:18,840 being a little pragmatic here. 1745 00:58:16,079 --> 00:58:20,519 So, okay. So, and then we as before for 1746 00:58:18,840 --> 00:58:22,920 the vocabulary itself, we tell Keras use 1747 00:58:20,519 --> 00:58:24,599 the most frequent 5,000 words, right? 1748 00:58:22,920 --> 00:58:27,599 When you're doing the STI 1749 00:58:24,599 --> 00:58:29,719 um STI. So, we do that and we tell it 1750 00:58:27,599 --> 00:58:32,279 the output mode is int like we saw 1751 00:58:29,719 --> 00:58:32,279 before. 1752 00:58:32,320 --> 00:58:36,840 We have there. 1753 00:58:35,199 --> 00:58:39,480 Okay, perfect. 1754 00:58:36,840 --> 00:58:41,840 Okay, this is a very dangerous thing 1755 00:58:39,480 --> 00:58:44,519 where somebody is remotely changing it 1756 00:58:41,840 --> 00:58:48,160 in another tab somewhere. 1757 00:58:44,519 --> 00:58:48,159 Fingers crossed. Okay. 1758 00:58:50,239 --> 00:58:54,279 Okay. So, we have this um and this is 1759 00:58:52,400 --> 00:58:57,320 what we did with all this stuff uh as 1760 00:58:54,280 --> 00:58:59,920 I've covered. So, now we will adapt this 1761 00:58:57,320 --> 00:59:02,960 layer as we have seen before using all 1762 00:58:59,920 --> 00:59:02,960 the lyrics we have. 1763 00:59:04,358 --> 00:59:08,358 And once we that, we'll take a look at 1764 00:59:06,280 --> 00:59:10,080 the first few. 1765 00:59:08,358 --> 00:59:12,239 And so, here's a very important thing. 1766 00:59:10,079 --> 00:59:14,519 Before, when we asked it to do multi-hot 1767 00:59:12,239 --> 00:59:17,679 encoding and so on in on Monday, 1768 00:59:14,519 --> 00:59:19,880 uh the zero, the first position was unk. 1769 00:59:17,679 --> 00:59:21,679 Right? Unk had zero. But here, unk 1770 00:59:19,880 --> 00:59:23,400 actually has one. 1771 00:59:21,679 --> 00:59:25,559 And the reason is that 1772 00:59:23,400 --> 00:59:28,200 the zeroth position is going to be uh 1773 00:59:25,559 --> 00:59:30,199 used for essentially the You can think 1774 00:59:28,199 --> 00:59:32,839 of this as the empty string. That's how 1775 00:59:30,199 --> 00:59:35,000 Keras will print out pad. 1776 00:59:32,840 --> 00:59:37,039 So, the zero position is the padding, 1777 00:59:35,000 --> 00:59:39,079 the pad token. The first position is the 1778 00:59:37,039 --> 00:59:41,480 unk token. Okay? 1779 00:59:39,079 --> 00:59:44,440 So, it's an important thing here. 1780 00:59:41,480 --> 00:59:46,920 So, let's say that we do 1781 00:59:44,440 --> 00:59:49,599 "HODL you're the best." 1782 00:59:46,920 --> 00:59:51,240 We take a vectorize it. Um 1783 00:59:49,599 --> 00:59:52,719 Do you think HODL 1784 00:59:51,239 --> 00:59:54,319 is going to be part of those 400,000 1785 00:59:52,719 --> 00:59:57,480 word vectors? 1786 00:59:54,320 --> 01:00:01,400 Wikipedia. Not yet. So, 1787 00:59:57,480 --> 01:00:01,400 Um all right. So, let's try that. 1788 01:00:03,519 --> 01:00:05,960 Okay, and as you can tell, 1789 01:00:05,199 --> 01:00:08,199 um 1790 01:00:05,960 --> 01:00:12,720 HODL is an unknown word, right? That's 1791 01:00:08,199 --> 01:00:14,879 why uh it's showing up here. 1792 01:00:12,719 --> 01:00:18,119 Right. So, one is unknown, right? The 1793 01:00:14,880 --> 01:00:19,559 index value one is unknown. Zero is pad. 1794 01:00:18,119 --> 01:00:21,920 But then, 1795 01:00:19,559 --> 01:00:25,000 this is unknown HODL, I 1796 01:00:21,920 --> 01:00:26,720 Sorry, you are the best, and then 1797 01:00:25,000 --> 01:00:28,679 everything else from that point on is a 1798 01:00:26,719 --> 01:00:30,119 zero because we are padding all the way 1799 01:00:28,679 --> 01:00:31,239 to 300. 1800 01:00:30,119 --> 01:00:32,599 Okay? So, that's why you see all these 1801 01:00:31,239 --> 01:00:34,359 zeros here. 1802 01:00:32,599 --> 01:00:37,000 All right. Uh now, let's just, you know, 1803 01:00:34,360 --> 01:00:38,760 run everything through 1804 01:00:37,000 --> 01:00:41,880 the vectorization layer, and then we'll 1805 01:00:38,760 --> 01:00:41,880 get to the embedding layer. 1806 01:00:44,400 --> 01:00:50,519 Okay. Now, we will we'll we'll first 1807 01:00:48,360 --> 01:00:51,840 There's just a bit of Python uh 1808 01:00:50,519 --> 01:00:54,960 housekeeping 1809 01:00:51,840 --> 01:00:56,600 um to create a nice, easy to look at 1810 01:00:54,960 --> 01:00:58,679 matrix. So, what we're going to do is 1811 01:00:56,599 --> 01:01:00,960 we're actually going to create a nice 1812 01:00:58,679 --> 01:01:02,480 matrix which shows us all the the word 1813 01:01:00,960 --> 01:01:04,039 the GloVe embeddings. 1814 01:01:02,480 --> 01:01:05,679 Um 1815 01:01:04,039 --> 01:01:07,159 And so, here, this is the embedding 1816 01:01:05,679 --> 01:01:09,639 matrix. 1817 01:01:07,159 --> 01:01:11,679 And this matrix has only 5,000 words, 1818 01:01:09,639 --> 01:01:13,639 and each is a 100 long. 1819 01:01:11,679 --> 01:01:15,199 Why is this embedding matrix only 5,000 1820 01:01:13,639 --> 01:01:17,679 even though we downloaded 400,000 1821 01:01:15,199 --> 01:01:17,679 vectors? 1822 01:01:21,480 --> 01:01:24,719 Right. So, clearly the 5,000 we used 1823 01:01:23,440 --> 01:01:27,440 there has some bearing to this, but what 1824 01:01:24,719 --> 01:01:27,439 is that 5,000? 1825 01:01:30,760 --> 01:01:34,800 We told Keras to take the most frequent 1826 01:01:32,840 --> 01:01:36,960 5,000 words in our corpus. 1827 01:01:34,800 --> 01:01:38,920 So, we'll only have 5,000 in vocabulary. 1828 01:01:36,960 --> 01:01:40,480 That's why there's 5,000. So, we grab 1829 01:01:38,920 --> 01:01:42,480 just the word the GloVe vectors for 1830 01:01:40,480 --> 01:01:44,119 those 500 5,000 that Keras has chosen to 1831 01:01:42,480 --> 01:01:45,599 be in the vocabulary. Okay? And that's 1832 01:01:44,119 --> 01:01:47,639 our embedding matrix. 1833 01:01:45,599 --> 01:01:50,279 And then, if you look at the first few 1834 01:01:47,639 --> 01:01:52,879 rows, the first two rows should be all 1835 01:01:50,280 --> 01:01:54,720 zeros because it's pad and unk, 1836 01:01:52,880 --> 01:01:57,320 which clearly GloVe doesn't know about. 1837 01:01:54,719 --> 01:01:59,000 It's all going to be all zeros. And um 1838 01:01:57,320 --> 01:02:00,480 so, you can see all these zeros here, 1839 01:01:59,000 --> 01:02:02,639 and then from third on, words, you start 1840 01:02:00,480 --> 01:02:04,199 getting some numbers. Okay? 1841 01:02:02,639 --> 01:02:05,599 All right. Next, we'll set up the 1842 01:02:04,199 --> 01:02:06,279 embedding layer. 1843 01:02:05,599 --> 01:02:07,799 Uh 1844 01:02:06,280 --> 01:02:09,600 so, basically, what's going on here is 1845 01:02:07,800 --> 01:02:11,680 when you we tell the embedding layer how 1846 01:02:09,599 --> 01:02:13,519 many rows, which is just the vocab size, 1847 01:02:11,679 --> 01:02:15,759 max tokens, what is the embedding 1848 01:02:13,519 --> 01:02:17,599 dimension? Well, that's going to be 100 1849 01:02:15,760 --> 01:02:19,840 because the GloVe vectors are 100. And 1850 01:02:17,599 --> 01:02:22,279 then, here's the thing. You can tell it 1851 01:02:19,840 --> 01:02:23,800 um in this embedding layer, just use 1852 01:02:22,280 --> 01:02:25,640 this matrix I'm giving you as the 1853 01:02:23,800 --> 01:02:26,880 embedding layer. Because we already know 1854 01:02:25,639 --> 01:02:28,799 what the embeddings are. We downloaded 1855 01:02:26,880 --> 01:02:30,760 from whatever GloVe, right? So, we will 1856 01:02:28,800 --> 01:02:32,320 tell it to use GloVe as as the as the 1857 01:02:30,760 --> 01:02:34,680 weights for here, as the embeddings 1858 01:02:32,320 --> 01:02:36,720 here. So, we initialize it using that 1859 01:02:34,679 --> 01:02:38,359 embedding matrix, right? And then, we 1860 01:02:36,719 --> 01:02:40,199 tell it 1861 01:02:38,360 --> 01:02:41,720 don't train. When we do back propagation 1862 01:02:40,199 --> 01:02:43,679 later on, don't change any of these 1863 01:02:41,719 --> 01:02:45,839 weights because somebody spent a lot of 1864 01:02:43,679 --> 01:02:47,759 money create these weights for us. 1865 01:02:45,840 --> 01:02:49,200 Stanford. So, we don't want to like 1866 01:02:47,760 --> 01:02:51,280 further change them. Just freeze them 1867 01:02:49,199 --> 01:02:52,679 and use them as they are. Okay? 1868 01:02:51,280 --> 01:02:53,760 And this mask zero business I'll come 1869 01:02:52,679 --> 01:02:55,440 back later. Don't worry about it for the 1870 01:02:53,760 --> 01:02:58,200 moment. 1871 01:02:55,440 --> 01:03:00,079 All right. So, once we do that, we all 1872 01:02:58,199 --> 01:03:02,199 we are ready to set up our model. So, 1873 01:03:00,079 --> 01:03:04,159 this model is pretty simple. Uh Keras 1874 01:03:02,199 --> 01:03:05,799 input, the length, of course, is the 1875 01:03:04,159 --> 01:03:08,319 length of the sentence, right? Which is 1876 01:03:05,800 --> 01:03:09,600 uh 300 long, and then it runs the input 1877 01:03:08,320 --> 01:03:12,320 runs through an embedding layer right 1878 01:03:09,599 --> 01:03:14,679 there, right? And out comes a 300 by 100 1879 01:03:12,320 --> 01:03:15,600 table, and then we global average pool 1880 01:03:14,679 --> 01:03:17,039 it, 1881 01:03:15,599 --> 01:03:19,079 right? And that becomes a 100 element 1882 01:03:17,039 --> 01:03:20,920 vector, and then we are back in familiar 1883 01:03:19,079 --> 01:03:23,480 ground, and we run it through a dense 1884 01:03:20,920 --> 01:03:25,480 layer with eight ReLU neurons, uh right? 1885 01:03:23,480 --> 01:03:27,159 Eight ReLU neurons, and then we run it 1886 01:03:25,480 --> 01:03:29,000 through the final output layer, which is 1887 01:03:27,159 --> 01:03:31,239 a three-way softmax as before, hip hop 1888 01:03:29,000 --> 01:03:34,760 rock pop. And then, we tell Keras that's 1889 01:03:31,239 --> 01:03:36,879 our model, and then we summarize it. 1890 01:03:34,760 --> 01:03:38,000 Okay. So, this what we have. And you can 1891 01:03:36,880 --> 01:03:41,039 see here, 1892 01:03:38,000 --> 01:03:42,559 the total parameters are 500,835, 1893 01:03:41,039 --> 01:03:44,519 but the trainable parameters are only 1894 01:03:42,559 --> 01:03:46,000 835. 1895 01:03:44,519 --> 01:03:49,320 It's because the total parameters are 1896 01:03:46,000 --> 01:03:50,800 all the GloVe embeddings plus the the 1897 01:03:49,320 --> 01:03:52,840 things we added to the GloVe embeddings 1898 01:03:50,800 --> 01:03:54,720 like the hidden layer and so on. 1899 01:03:52,840 --> 01:03:57,400 But the GloVe embeddings are us we have 1900 01:03:54,719 --> 01:03:58,799 told Keras, freeze it. Do not train it. 1901 01:03:57,400 --> 01:04:00,358 Right? Which means only the rest of it 1902 01:03:58,800 --> 01:04:03,160 is going to be trainable. That's That's 1903 01:04:00,358 --> 01:04:03,159 the 835. Yeah. 1904 01:04:03,358 --> 01:04:06,880 So, when we do the global average 1905 01:04:05,000 --> 01:04:09,840 pooling, don't we don't we lose any 1906 01:04:06,880 --> 01:04:12,680 sense of meaning that we gain from the 1907 01:04:09,840 --> 01:04:14,519 embedding as we average very different 1908 01:04:12,679 --> 01:04:15,960 embeddings together? 1909 01:04:14,519 --> 01:04:16,400 Sorry, say that again. I I missed the 1910 01:04:15,960 --> 01:04:18,639 first 1911 01:04:16,400 --> 01:04:20,400 >> if we average the the embedding of apple 1912 01:04:18,639 --> 01:04:22,279 and learning, for instance, they are 1913 01:04:20,400 --> 01:04:23,880 very different words that are used in 1914 01:04:22,280 --> 01:04:26,320 different meanings, so we have different 1915 01:04:23,880 --> 01:04:27,358 embeddings, but we average it, so can't 1916 01:04:26,320 --> 01:04:28,600 lose it. 1917 01:04:27,358 --> 01:04:30,319 We will lose a bunch of stuff. Yeah, 1918 01:04:28,599 --> 01:04:31,239 yeah, yeah. So, you're barely Anytime 1919 01:04:30,320 --> 01:04:33,559 you average anything, you're going to 1920 01:04:31,239 --> 01:04:36,039 lose some new nuance and so on. So, the 1921 01:04:33,559 --> 01:04:37,840 real question is, is it Despite that 1922 01:04:36,039 --> 01:04:39,239 averaging, is it good enough for you? 1923 01:04:37,840 --> 01:04:41,039 And sometimes it's good enough. 1924 01:04:39,239 --> 01:04:42,599 Very often it's good enough, as it turns 1925 01:04:41,039 --> 01:04:44,119 out. But as you will see when you go to 1926 01:04:42,599 --> 01:04:45,759 contextual embeddings, there's just a 1927 01:04:44,119 --> 01:04:47,519 better way to do it, right? When you 1928 01:04:45,760 --> 01:04:49,240 have contextual embeddings, uh but it 1929 01:04:47,519 --> 01:04:50,679 requires bigger models, more powerful 1930 01:04:49,239 --> 01:04:51,559 stuff, and so on and so forth. And 1931 01:04:50,679 --> 01:04:53,919 that's where you're going from the 1932 01:04:51,559 --> 01:04:56,119 foundations to the advanced stuff. 1933 01:04:53,920 --> 01:04:56,119 Yeah. 1934 01:04:56,199 --> 01:05:00,679 When we're doing optimization, like 1935 01:04:58,159 --> 01:05:02,839 let's say we are word problem, it's 1936 01:05:00,679 --> 01:05:04,719 often best to optimize everything 1937 01:05:02,840 --> 01:05:06,160 together than to optimize one part of 1938 01:05:04,719 --> 01:05:07,199 the system and then optimize the other 1939 01:05:06,159 --> 01:05:09,799 part of the system. 1940 01:05:07,199 --> 01:05:12,319 So, in that case, why wouldn't we want 1941 01:05:09,800 --> 01:05:13,600 to also change the embeddings? 1942 01:05:12,320 --> 01:05:15,559 We would like I understand why we would 1943 01:05:13,599 --> 01:05:17,440 like to stop with 1944 01:05:15,559 --> 01:05:19,119 with those weights that 1945 01:05:17,440 --> 01:05:20,679 some people have spent a lot of money 1946 01:05:19,119 --> 01:05:23,358 trying to find, but will 1947 01:05:20,679 --> 01:05:25,279 we be able to find more specific uh 1948 01:05:23,358 --> 01:05:26,960 embeddings related to our problem if we 1949 01:05:25,280 --> 01:05:29,000 optimize if we let everything be 1950 01:05:26,960 --> 01:05:30,880 trainable? Yeah. Absolutely. Absolutely. 1951 01:05:29,000 --> 01:05:33,280 And in fact, you will see in the collab 1952 01:05:30,880 --> 01:05:35,280 uh that we will do that next. I just 1953 01:05:33,280 --> 01:05:37,000 want to show people you don't have to do 1954 01:05:35,280 --> 01:05:38,560 it. You start with not training it 1955 01:05:37,000 --> 01:05:39,679 because it's going to be much faster. 1956 01:05:38,559 --> 01:05:41,079 And then, you train everything and see 1957 01:05:39,679 --> 01:05:42,519 if it gets better. And sometimes it'll 1958 01:05:41,079 --> 01:05:44,119 get better, in which case it's great. 1959 01:05:42,519 --> 01:05:45,599 Sometimes it won't get better. And I 1960 01:05:44,119 --> 01:05:46,920 will also show you, and I probably will 1961 01:05:45,599 --> 01:05:48,880 run out of time, which I'll So, I'll do 1962 01:05:46,920 --> 01:05:50,119 it on Monday. I will also show you, hey, 1963 01:05:48,880 --> 01:05:51,480 what if you want to do your own 1964 01:05:50,119 --> 01:05:52,599 embeddings from scratch without using 1965 01:05:51,480 --> 01:05:55,639 GloVe? 1966 01:05:52,599 --> 01:05:57,599 So, all possibilities will be covered. 1967 01:05:55,639 --> 01:06:00,159 Um yeah. So, to come back to this, this 1968 01:05:57,599 --> 01:06:01,440 is the model we have. Um and then, all 1969 01:06:00,159 --> 01:06:03,000 right. 1970 01:06:01,440 --> 01:06:05,079 So, we'll If we take a look at the first 1971 01:06:03,000 --> 01:06:06,960 few embedding vectors, by the way, this 1972 01:06:05,079 --> 01:06:09,119 model.layers 1973 01:06:06,960 --> 01:06:10,240 uh will give you every layer as a list, 1974 01:06:09,119 --> 01:06:11,480 a list of all the layers, and then you 1975 01:06:10,239 --> 01:06:13,039 can just grab any layer you want and 1976 01:06:11,480 --> 01:06:14,079 look at its weights. Okay? It's very 1977 01:06:13,039 --> 01:06:15,159 handy. 1978 01:06:14,079 --> 01:06:16,559 So, we're looking at the weights, and 1979 01:06:15,159 --> 01:06:19,399 you can see here 1980 01:06:16,559 --> 01:06:21,519 the first two vectors are all zeros 1981 01:06:19,400 --> 01:06:22,920 because that stands for unk and pad, and 1982 01:06:21,519 --> 01:06:24,880 then we have everything else. So, 1983 01:06:22,920 --> 01:06:26,400 everything looks fine so far. And now, 1984 01:06:24,880 --> 01:06:28,800 we just, you know, compile and fit it. 1985 01:06:26,400 --> 01:06:30,039 So, as usual, Adam, cross entropy, 1986 01:06:28,800 --> 01:06:33,080 accuracy. 1987 01:06:30,039 --> 01:06:34,880 Um and then, we'll just fit the model. 1988 01:06:33,079 --> 01:06:36,000 All right. 1989 01:06:34,880 --> 01:06:38,599 It's going to take 1990 01:06:36,000 --> 01:06:38,599 a few minutes. 1991 01:06:39,000 --> 01:06:43,519 And while it's running, so what what you 1992 01:06:41,358 --> 01:06:44,960 will see in this collab is that 1993 01:06:43,519 --> 01:06:46,440 uh in this particular case, the 1994 01:06:44,960 --> 01:06:47,519 embeddings actually don't help a whole 1995 01:06:46,440 --> 01:06:50,440 lot. 1996 01:06:47,519 --> 01:06:50,440 Why do you think that is? 1997 01:06:51,920 --> 01:06:54,639 What if it could be because we're 1998 01:06:52,920 --> 01:06:57,079 averaging a lot of stuff? Maybe that's 1999 01:06:54,639 --> 01:06:58,400 hurting us. 2000 01:06:57,079 --> 01:06:59,840 Yeah. 2001 01:06:58,400 --> 01:07:01,960 Um I mean, I think that the embeddings 2002 01:06:59,840 --> 01:07:03,559 were pre-trained on some corpus, right? 2003 01:07:01,960 --> 01:07:05,358 Like Wikipedia or something like that 2004 01:07:03,559 --> 01:07:06,599 that is different from the a little bit 2005 01:07:05,358 --> 01:07:08,599 different from the language we tend to 2006 01:07:06,599 --> 01:07:09,599 use in song lyrics. So, so maybe it's 2007 01:07:08,599 --> 01:07:11,599 not 2008 01:07:09,599 --> 01:07:12,679 its ability to sort of extract the 2009 01:07:11,599 --> 01:07:13,319 meaning of 2010 01:07:12,679 --> 01:07:16,679 um 2011 01:07:13,320 --> 01:07:18,240 candy from like a song lyric um 2012 01:07:16,679 --> 01:07:19,679 maybe is limited because Yeah. it's 2013 01:07:18,239 --> 01:07:20,919 thinking of all the other ways Right. 2014 01:07:19,679 --> 01:07:22,039 like that could be our presentation. 2015 01:07:20,920 --> 01:07:23,680 Yeah, so there could be a mismatch 2016 01:07:22,039 --> 01:07:26,119 between the corpus on which the 2017 01:07:23,679 --> 01:07:27,719 pre-trained stuff was trained on versus 2018 01:07:26,119 --> 01:07:29,480 the the corpus that you're working with 2019 01:07:27,719 --> 01:07:31,679 right now. That's one big reason. The 2020 01:07:29,480 --> 01:07:34,719 other reason is that we actually may 2021 01:07:31,679 --> 01:07:36,119 have We have 50,000 examples, basically. 2022 01:07:34,719 --> 01:07:37,959 It's a lot of data. 2023 01:07:36,119 --> 01:07:39,920 So, when you have a lot of data, you may 2024 01:07:37,960 --> 01:07:41,519 not need any of these things. 2025 01:07:39,920 --> 01:07:43,280 These things tend to do really well when 2026 01:07:41,519 --> 01:07:46,159 you don't have a lot of data, and which 2027 01:07:43,280 --> 01:07:47,720 means you you you get to piggyback on 2028 01:07:46,159 --> 01:07:49,960 what these embeddings have learned from 2029 01:07:47,719 --> 01:07:52,159 all of Wikipedia. 2030 01:07:49,960 --> 01:07:54,240 So, so when you have a smallish data 2031 01:07:52,159 --> 01:07:55,799 set, basically, the the rule of thumb 2032 01:07:54,239 --> 01:07:58,119 here is that when your data is really 2033 01:07:55,800 --> 01:07:59,320 small, try to use a pre-trained model. 2034 01:07:58,119 --> 01:08:01,119 Right? And that's what you saw with the 2035 01:07:59,320 --> 01:08:03,200 handbags and shoes classifier, right? We 2036 01:08:01,119 --> 01:08:04,960 had 100 examples of handbags and shoes, 2037 01:08:03,199 --> 01:08:06,480 and we used ResNet to got basically get 2038 01:08:04,960 --> 01:08:08,358 to 100% accuracy. 2039 01:08:06,480 --> 01:08:09,719 The same sort of logic applies here. 2040 01:08:08,358 --> 01:08:11,519 All right. So, 2041 01:08:09,719 --> 01:08:12,879 here, let's see what's happening. Uh 2042 01:08:11,519 --> 01:08:15,480 okay, it's done. 2043 01:08:12,880 --> 01:08:15,480 So, we'll plot. 2044 01:08:16,000 --> 01:08:18,199 Right. 2045 01:08:16,880 --> 01:08:21,279 Uh okay, this look at a very 2046 01:08:18,199 --> 01:08:23,479 well-behaved uh loss function curve. 2047 01:08:21,279 --> 01:08:23,479 Uh 2048 01:08:25,640 --> 01:08:27,600 Okay. 2049 01:08:26,479 --> 01:08:28,879 So, 2050 01:08:27,600 --> 01:08:30,039 uh there doesn't seem to be any massive 2051 01:08:28,880 --> 01:08:32,119 overfitting going on. They are moving 2052 01:08:30,039 --> 01:08:35,279 really nicely in lockstep. Let's see 2053 01:08:32,119 --> 01:08:35,279 what the thing is. 2054 01:08:36,319 --> 01:08:40,798 Okay, 63%, which is not great. Um right? 2055 01:08:39,520 --> 01:08:43,080 Uh it's not as good as what we saw 2056 01:08:40,798 --> 01:08:44,479 before when we used all 50,000 examples 2057 01:08:43,079 --> 01:08:45,880 and just trained something from scratch, 2058 01:08:44,479 --> 01:08:47,519 and that's just because in this case, we 2059 01:08:45,880 --> 01:08:49,079 have lots of examples, these pre-trained 2060 01:08:47,520 --> 01:08:50,319 embeddings aren't, you know, as helpful 2061 01:08:49,079 --> 01:08:52,439 as they could be. 2062 01:08:50,319 --> 01:08:54,280 But if you have a small data set, they 2063 01:08:52,439 --> 01:08:56,318 could be very helpful. And now, we go to 2064 01:08:54,279 --> 01:08:58,239 what um 2065 01:08:56,319 --> 01:08:59,319 he pointed out. Like, why can't we just, 2066 01:08:58,239 --> 01:09:00,759 you know, optimize these embeddings, 2067 01:08:59,319 --> 01:09:02,440 too? Why don't Why do we have to take 2068 01:09:00,759 --> 01:09:03,838 trade them as sacred? We'll just Let 2069 01:09:02,439 --> 01:09:06,000 Let's just use Let's 2070 01:09:03,838 --> 01:09:07,920 inflict Let's just apply unleash back 2071 01:09:06,000 --> 01:09:11,079 prop on it and see what happens. 2072 01:09:07,920 --> 01:09:13,319 So, we'll do that. Um 2073 01:09:11,079 --> 01:09:15,359 So, here, what we do is we retrain it, 2074 01:09:13,319 --> 01:09:17,120 but here, we set trainable equals true 2075 01:09:15,359 --> 01:09:19,240 for the embedding layer. Okay? This is 2076 01:09:17,119 --> 01:09:20,960 the key step. Trainable equals true. 2077 01:09:19,239 --> 01:09:23,279 Otherwise, it's unchanged. 2078 01:09:20,960 --> 01:09:25,880 Uh and then, 2079 01:09:23,279 --> 01:09:25,880 let's skip that. 2080 01:09:27,119 --> 01:09:31,439 We'll run it and see what happens. So 2081 01:09:28,960 --> 01:09:33,079 before it was whatever 63% accuracy or 2082 01:09:31,439 --> 01:09:35,399 something, we'll see if it gets better 2083 01:09:33,079 --> 01:09:38,000 if you train the whole thing. 2084 01:09:35,399 --> 01:09:40,399 And the thing is you can never be sure. 2085 01:09:38,000 --> 01:09:41,279 Right? Because it may start to overfit. 2086 01:09:40,399 --> 01:09:42,599 Uh which is why you just have to 2087 01:09:41,279 --> 01:09:45,440 empirically see what's going on. There 2088 01:09:42,600 --> 01:09:45,440 are no guarantees. 2089 01:09:47,640 --> 01:09:50,079 Um all right, any questions while it's 2090 01:09:48,960 --> 01:09:51,880 training? 2091 01:09:50,079 --> 01:09:54,399 Yeah. 2092 01:09:51,880 --> 01:09:56,480 In that first graph of when um you have 2093 01:09:54,399 --> 01:09:58,000 the training accuracy still increasing, 2094 01:09:56,479 --> 01:10:00,479 that might suggest that you could use 2095 01:09:58,000 --> 01:10:02,399 even more upstream. Correct. Exactly. 2096 01:10:00,479 --> 01:10:03,879 Exactly. So in the in the in that curve, 2097 01:10:02,399 --> 01:10:05,519 we saw that the training was continuing 2098 01:10:03,880 --> 01:10:06,840 to increase. Typically what's going to 2099 01:10:05,520 --> 01:10:08,720 happen is the training will continue to 2100 01:10:06,840 --> 01:10:10,520 get better the more you train it. The 2101 01:10:08,720 --> 01:10:12,560 key thing is is the validation also 2102 01:10:10,520 --> 01:10:13,880 improving. If the validation continues 2103 01:10:12,560 --> 01:10:15,520 to improve, there is a little bit more 2104 01:10:13,880 --> 01:10:17,720 gas left in the tank. You can keep 2105 01:10:15,520 --> 01:10:19,120 increasing more. If it starts to flatten 2106 01:10:17,720 --> 01:10:21,960 and even worse if it starts to go down, 2107 01:10:19,119 --> 01:10:23,279 then you want to pull back. 2108 01:10:21,960 --> 01:10:25,359 Yeah. 2109 01:10:23,279 --> 01:10:27,359 Um so you had used the maximum against 2110 01:10:25,359 --> 01:10:29,079 the limit like the vocabulary 2111 01:10:27,359 --> 01:10:31,880 of the most common 5,000. And then the 2112 01:10:29,079 --> 01:10:33,600 width of that was 100. What is the 100? 2113 01:10:31,880 --> 01:10:34,680 The 100 is just the length of the glove 2114 01:10:33,600 --> 01:10:37,440 vector. 2115 01:10:34,680 --> 01:10:39,560 Does that mean that it can only capture 2116 01:10:37,439 --> 01:10:41,319 how that word is related to 100 other 2117 01:10:39,560 --> 01:10:43,760 words? No, no. It it basically we are 2118 01:10:41,319 --> 01:10:45,920 saying that every word its intrinsic 2119 01:10:43,760 --> 01:10:48,119 meaning can be captured using a vector 2120 01:10:45,920 --> 01:10:49,800 of 100 dimensions. 2121 01:10:48,119 --> 01:10:51,319 Those dimensions mean something. We 2122 01:10:49,800 --> 01:10:53,880 don't know what it is. The first 2123 01:10:51,319 --> 01:10:55,599 dimension could mean color. Second could 2124 01:10:53,880 --> 01:10:57,760 mean some sort of location. The third 2125 01:10:55,600 --> 01:11:00,760 could mean some sort of see time of the 2126 01:10:57,760 --> 01:11:00,760 year. We just have no idea. 2127 01:11:01,359 --> 01:11:04,159 Okay, and then the pre-trained model is 2128 01:11:02,880 --> 01:11:05,640 we're not We're not going to learn the 2129 01:11:04,159 --> 01:11:07,119 pre-trained model like has those 2130 01:11:05,640 --> 01:11:08,960 already. We don't know what they are, 2131 01:11:07,119 --> 01:11:10,000 but it has some cat The people who 2132 01:11:08,960 --> 01:11:10,960 created it don't know what they are 2133 01:11:10,000 --> 01:11:13,439 either. 2134 01:11:10,960 --> 01:11:15,840 All they know is that for each word they 2135 01:11:13,439 --> 01:11:18,799 learned a 100 long vector. 2136 01:11:15,840 --> 01:11:20,279 And that 100 long vector was able to re- 2137 01:11:18,800 --> 01:11:21,520 kind of recreate the co-occurrence 2138 01:11:20,279 --> 01:11:23,359 matrix. 2139 01:11:21,520 --> 01:11:25,200 And then they probed it using that 2140 01:11:23,359 --> 01:11:26,920 visualization of man woman sister 2141 01:11:25,199 --> 01:11:29,880 brother all that stuff and it seems to 2142 01:11:26,920 --> 01:11:31,960 sort of fit with what you would expect. 2143 01:11:29,880 --> 01:11:33,480 Can you think of it as analogous to uh 2144 01:11:31,960 --> 01:11:35,640 when we did the convolutional ones, you 2145 01:11:33,479 --> 01:11:37,639 have the number of kernels, right? So in 2146 01:11:35,640 --> 01:11:39,520 in this case, so if you have 32 kernels, 2147 01:11:37,640 --> 01:11:40,840 it's sort of like 32 things it can 2148 01:11:39,520 --> 01:11:42,400 learn. 2149 01:11:40,840 --> 01:11:43,880 I think that's actually a great analogy. 2150 01:11:42,399 --> 01:11:46,399 I love it. That's that's a great way to 2151 01:11:43,880 --> 01:11:48,079 think about it. Yes. Uh much like we got 2152 01:11:46,399 --> 01:11:50,079 to choose decide how many filters to 2153 01:11:48,079 --> 01:11:51,680 have, here we get to decide how long the 2154 01:11:50,079 --> 01:11:53,880 embedding dimension needs to be and our 2155 01:11:51,680 --> 01:11:55,280 hope is that the more things we are able 2156 01:11:53,880 --> 01:11:57,880 to accommodate, the more complicated 2157 01:11:55,279 --> 01:11:58,920 things it will pick up. Right? Uh at the 2158 01:11:57,880 --> 01:11:59,920 same time, you don't want to have too 2159 01:11:58,920 --> 01:12:01,920 many of these things because it's going 2160 01:11:59,920 --> 01:12:03,640 to start picking up noise. 2161 01:12:01,920 --> 01:12:05,840 And that's not a good That's never a 2162 01:12:03,640 --> 01:12:06,840 good thing. 2163 01:12:05,840 --> 01:12:07,880 Okay. 2164 01:12:06,840 --> 01:12:09,880 Um 2165 01:12:07,880 --> 01:12:10,840 Another question on this side? 2166 01:12:09,880 --> 01:12:12,359 Yeah. 2167 01:12:10,840 --> 01:12:13,159 Go ahead. My 2168 01:12:12,359 --> 01:12:15,359 question is 2169 01:12:13,159 --> 01:12:17,399 why did we use Why do we use embeddings 2170 01:12:15,359 --> 01:12:20,599 and not the actual uh 2171 01:12:17,399 --> 01:12:23,000 correlation matrix called rows to 2172 01:12:20,600 --> 01:12:25,079 represent words, right? Like why do we 2173 01:12:23,000 --> 01:12:26,399 need to abstract Yeah, yeah, yeah. 2174 01:12:25,079 --> 01:12:28,800 That's actually a good That's a That's a 2175 01:12:26,399 --> 01:12:30,399 good That's a good question. Um one 2176 01:12:28,800 --> 01:12:33,600 immediate reason is that that row is 2177 01:12:30,399 --> 01:12:35,679 500,000 vectors long. 500,000 long. 2178 01:12:33,600 --> 01:12:37,280 Right? So you want a compact dense 2179 01:12:35,680 --> 01:12:39,000 representation of a word. 2180 01:12:37,279 --> 01:12:40,679 The second thing is that thing is 2181 01:12:39,000 --> 01:12:43,760 subject to all the counts of the 2182 01:12:40,680 --> 01:12:45,200 Wikipedia corpus. It's not normalized. 2183 01:12:43,760 --> 01:12:47,400 So you need to normalize it so that if 2184 01:12:45,199 --> 01:12:49,119 you take any two rows and do dot 2185 01:12:47,399 --> 01:12:50,839 product, you will get some number which 2186 01:12:49,119 --> 01:12:53,800 is sort of in a narrow range. Otherwise 2187 01:12:50,840 --> 01:12:55,560 things don't become comparable. 2188 01:12:53,800 --> 01:12:57,600 No, both these objections can be 2189 01:12:55,560 --> 01:12:59,080 handled. You can normalize, you can 2190 01:12:57,600 --> 01:13:00,520 reduce the size of the corpus and so on 2191 01:12:59,079 --> 01:13:01,640 and so forth. And in fact that used to 2192 01:13:00,520 --> 01:13:03,400 be a very common way people used to do 2193 01:13:01,640 --> 01:13:04,560 it before. 2194 01:13:03,399 --> 01:13:06,319 But what they have discovered is that 2195 01:13:04,560 --> 01:13:07,680 these the way we learn embeddings now 2196 01:13:06,319 --> 01:13:10,119 tends to be much more effective in 2197 01:13:07,680 --> 01:13:10,119 practice. 2198 01:13:10,960 --> 01:13:16,199 So So what what we thought is 2199 01:13:13,960 --> 01:13:18,159 what what what this process does is it 2200 01:13:16,199 --> 01:13:21,559 creates this like n-dimensional 2201 01:13:18,159 --> 01:13:23,920 incomprehensible matrix that captures 2202 01:13:21,560 --> 01:13:25,840 in essence a summarized version of these 2203 01:13:23,920 --> 01:13:28,359 relationships. 2204 01:13:25,840 --> 01:13:30,440 Correct. A compact representation of 2205 01:13:28,359 --> 01:13:33,159 relationships which is not subject to 2206 01:13:30,439 --> 01:13:34,719 the size of your vocabulary. 2207 01:13:33,159 --> 01:13:36,439 So you know, you have 500,000 words 2208 01:13:34,720 --> 01:13:37,720 today, tomorrow somebody comes up with 2209 01:13:36,439 --> 01:13:39,039 the word called selfie which didn't 2210 01:13:37,720 --> 01:13:40,360 exist 5 years ago. 2211 01:13:39,039 --> 01:13:42,279 And now your corpus has gotten a little 2212 01:13:40,359 --> 01:13:43,920 bit more, right? So here it's very 2213 01:13:42,279 --> 01:13:46,800 compact and it tends to have a much 2214 01:13:43,920 --> 01:13:46,800 longer shelf life. 2215 01:13:48,039 --> 01:13:52,760 Yeah. 2216 01:13:49,279 --> 01:13:52,759 Uh all right, so let's see where we are. 2217 01:13:54,079 --> 01:13:58,159 Uh okay. So evaluate. 2218 01:13:59,079 --> 01:14:04,199 68 69% almost. It was 63 went to 69. So 2219 01:14:02,039 --> 01:14:06,239 clearly here training the whole thing 2220 01:14:04,199 --> 01:14:08,359 including glove actually helps. Uh and 2221 01:14:06,239 --> 01:14:11,239 so that sort of begs the question, well, 2222 01:14:08,359 --> 01:14:13,279 if it um every if training glove helps, 2223 01:14:11,239 --> 01:14:15,000 maybe we should actually train the whole 2224 01:14:13,279 --> 01:14:16,920 thing from scratch. 2225 01:14:15,000 --> 01:14:19,319 Like why the hell not, right? Why the 2226 01:14:16,920 --> 01:14:21,239 heck not? I apologize. 2227 01:14:19,319 --> 01:14:22,639 So uh what we'll do is we'll actually 2228 01:14:21,239 --> 01:14:24,760 create our own embeddings and just train 2229 01:14:22,640 --> 01:14:26,079 them. And here we don't have to worry 2230 01:14:24,760 --> 01:14:27,560 about co-occurrence matrices and so on 2231 01:14:26,079 --> 01:14:29,239 and so forth because we have a very 2232 01:14:27,560 --> 01:14:30,840 specific objective. We want to be very 2233 01:14:29,239 --> 01:14:32,279 accurate in predicting genre for these 2234 01:14:30,840 --> 01:14:34,119 songs. 2235 01:14:32,279 --> 01:14:35,159 The people who had who had worked on 2236 01:14:34,119 --> 01:14:36,479 glove, 2237 01:14:35,159 --> 01:14:37,760 they didn't have any objective. They 2238 01:14:36,479 --> 01:14:39,559 just wanted to create embeddings that 2239 01:14:37,760 --> 01:14:41,640 were generally useful. 2240 01:14:39,560 --> 01:14:43,520 Okay? Here we want to be specifically 2241 01:14:41,640 --> 01:14:45,760 useful for genre prediction. 2242 01:14:43,520 --> 01:14:48,680 And so what we can do is we can actually 2243 01:14:45,760 --> 01:14:50,320 train the whole thing ourselves, right? 2244 01:14:48,680 --> 01:14:51,320 We can actually give it 2245 01:14:50,319 --> 01:14:53,119 uh we can actually put an embedding 2246 01:14:51,319 --> 01:14:55,039 layer here. I you know, we just 2247 01:14:53,119 --> 01:14:57,439 arbitrarily decided to choose 64 as the 2248 01:14:55,039 --> 01:14:59,479 uh the dimension as opposed to 100. It 2249 01:14:57,439 --> 01:15:01,000 will run faster. Uh and then it's the 2250 01:14:59,479 --> 01:15:03,519 same thing. Global average pooling, 2251 01:15:01,000 --> 01:15:07,079 activation, blah blah blah blah blah. Um 2252 01:15:03,520 --> 01:15:07,080 and then you run it. 2253 01:15:08,039 --> 01:15:11,920 We'll see if it finishes in the next 2254 01:15:09,520 --> 01:15:11,920 minute. 2255 01:15:12,760 --> 01:15:16,360 And we'll see if it actually does better 2256 01:15:14,479 --> 01:15:17,359 than the pre-trained embeddings or the 2257 01:15:16,359 --> 01:15:19,399 pre-trained embeddings that have been 2258 01:15:17,359 --> 01:15:21,639 further fine-tuned. And I don't remember 2259 01:15:19,399 --> 01:15:23,119 what I saw when I ran it yesterday. 2260 01:15:21,640 --> 01:15:24,920 Uh and while it's running, other 2261 01:15:23,119 --> 01:15:25,760 questions? 2262 01:15:24,920 --> 01:15:28,000 Yeah. 2263 01:15:25,760 --> 01:15:30,039 So my question is regarding embeddings. 2264 01:15:28,000 --> 01:15:32,439 When we call embedding for a particular 2265 01:15:30,039 --> 01:15:33,920 word, we indicate that we have certain 2266 01:15:32,439 --> 01:15:35,359 number of parameters. Let's say in this 2267 01:15:33,920 --> 01:15:36,920 case we have defined 2268 01:15:35,359 --> 01:15:37,759 We defined 100. So there will be 100 2269 01:15:36,920 --> 01:15:40,079 parameters and there will be 2270 01:15:37,760 --> 01:15:42,520 coefficients weights for each of them. 2271 01:15:40,079 --> 01:15:43,279 So when we take a pre-trained model, 2272 01:15:42,520 --> 01:15:45,520 right? 2273 01:15:43,279 --> 01:15:47,840 The one we took glove. So for each word 2274 01:15:45,520 --> 01:15:49,640 there would already be those number of 2275 01:15:47,840 --> 01:15:51,159 parameters in that different Yeah. So 2276 01:15:49,640 --> 01:15:53,320 but then how do we redefine them? Is 2277 01:15:51,159 --> 01:15:54,880 that we want only 100 or we want only 10 2278 01:15:53,319 --> 01:15:56,519 parameters 2279 01:15:54,880 --> 01:15:59,239 You know, the the glove thing actually 2280 01:15:56,520 --> 01:16:01,520 gives you packaged It's pre-packaged to 2281 01:15:59,239 --> 01:16:03,199 be 100 long. I think they have 200 and 2282 01:16:01,520 --> 01:16:04,240 300 as well if I recall. We just 2283 01:16:03,199 --> 01:16:05,840 happened to use the one the one with 2284 01:16:04,239 --> 01:16:07,719 100. The one is 2285 01:16:05,840 --> 01:16:09,000 The one is available in Google 2286 01:16:07,720 --> 01:16:10,159 Yeah, yeah. And there are many 2287 01:16:09,000 --> 01:16:12,960 available. We just get to pick and 2288 01:16:10,159 --> 01:16:13,840 choose and I happen to pick 100. 2289 01:16:12,960 --> 01:16:15,880 Uh 2290 01:16:13,840 --> 01:16:17,560 Oh, it's okay. So it's a bit slow, but 2291 01:16:15,880 --> 01:16:18,680 it's actually looking promising. 2292 01:16:17,560 --> 01:16:21,080 Um 2293 01:16:18,680 --> 01:16:23,159 9:55, yeah. 2294 01:16:21,079 --> 01:16:24,960 So during the CNN models training during 2295 01:16:23,159 --> 01:16:27,199 our assignments, 2296 01:16:24,960 --> 01:16:29,800 changing the filters gave us more depth 2297 01:16:27,199 --> 01:16:32,399 than improvement in performance. 2298 01:16:29,800 --> 01:16:33,640 So here would I be right in concluding 2299 01:16:32,399 --> 01:16:34,839 that it's actually training the 2300 01:16:33,640 --> 01:16:36,600 embeddings which is giving us more 2301 01:16:34,840 --> 01:16:37,400 assuming that epoch and batch changes 2302 01:16:36,600 --> 01:16:39,400 are not 2303 01:16:37,399 --> 01:16:42,000 changed as much. So if I really want a 2304 01:16:39,399 --> 01:16:43,279 genuine change in performance, we go 2305 01:16:42,000 --> 01:16:44,760 to the level of retraining the 2306 01:16:43,279 --> 01:16:46,359 embeddings. 2307 01:16:44,760 --> 01:16:48,520 What Yeah, so what we saw was that using 2308 01:16:46,359 --> 01:16:50,359 glove as is was okay. Using glove and 2309 01:16:48,520 --> 01:16:51,320 then training them helped a lot. And now 2310 01:16:50,359 --> 01:16:53,559 we are basically saying, well, what if 2311 01:16:51,319 --> 01:16:55,639 we just abandon glove and train our own 2312 01:16:53,560 --> 01:16:57,760 embeddings for our particular problem. 2313 01:16:55,640 --> 01:16:59,160 See, glove is a general purpose tool. 2314 01:16:57,760 --> 01:17:00,640 So a general purpose tool is really good 2315 01:16:59,159 --> 01:17:01,920 if you don't have a lot of data 2316 01:17:00,640 --> 01:17:03,119 as a good starting point. But when you 2317 01:17:01,920 --> 01:17:04,560 have a lot of data, you should always 2318 01:17:03,119 --> 01:17:05,680 try to do your own thing and see if it's 2319 01:17:04,560 --> 01:17:07,400 any better. 2320 01:17:05,680 --> 01:17:09,480 And in this case, I 2321 01:17:07,399 --> 01:17:10,719 well, whoa. Okay, I think it's 2322 01:17:09,479 --> 01:17:13,759 uh 2323 01:17:10,720 --> 01:17:13,760 Come on, it's 9:55. 2324 01:17:14,439 --> 01:17:17,919 The button is going to enter any moment 2325 01:17:15,640 --> 01:17:17,920 now. 2326 01:17:21,399 --> 01:17:24,639 Right, let's just look at the thing. 2327 01:17:25,880 --> 01:17:30,279 Okay, folks. So 74% 72%. 2328 01:17:29,079 --> 01:17:31,840 So you can actually return your own 2329 01:17:30,279 --> 01:17:33,279 thing because of 50,000 examples and you 2330 01:17:31,840 --> 01:17:36,480 can see an even better thing. Thanks a 2331 01:17:33,279 --> 01:17:36,479 lot. Have a good rest of the week.