[00:21] We'll continue our journey with [00:23] natural language processing. [00:25] We looked at the bag of words model, [00:26] one-hot embeddings, and so on and so [00:28] forth. And today we will talk about [00:30] embeddings, or to be more precise, [00:32] stand-alone embeddings, and then that [00:34] will tee us up for something called [00:36] contextual embeddings, which is where [00:38] the transformer really sort of comes [00:40] into play. [00:41] All right, so let's get going. So so far [00:43] we have encoded input text [00:47] one-hot vector. So to just to refresh [00:50] your memories from Monday, [00:52] if you know, if this is the phrase [00:53] that's coming into the system, we run it [00:55] through the STIE process. And when we do [00:58] that, what happens is that first of all, [01:01] we you know, we standardize, then we [01:03] split on white space to get individual [01:05] words, then we assign words to integers, [01:08] and then we take you know, each integer [01:10] and essentially create a one-hot version [01:12] of that integer. And when we do that, [01:15] basically we have a vocabulary. [01:18] Right? And in this example, we just have [01:20] 100 words, and you will note that this [01:23] vocabulary, which are which you arrive [01:25] at once you standardize and tokenize, [01:28] you know, has words like the because we [01:30] decided not to remove stop words like A, [01:32] and the, [01:33] and so on. So just to be clear, [01:36] standardization [01:38] here, standardization, while it has [01:40] historically been all about stripping [01:42] punctuation, lowercasing everything, [01:45] removing stop words, and stemming, [01:47] while that has been true historically, [01:49] if you look at modern practice, people [01:51] essentially strip punctuation maybe, and [01:54] then lowercase and and they often don't [01:57] even bother to do stemming and things [01:58] like that, or to remove stop words. [02:00] Okay? [02:01] And that's why in Keras, the default [02:03] standardization is only lowercasing and [02:05] punctuation stripping. [02:09] This detail may actually be handy for [02:11] homework two, perhaps. That's why I'm [02:12] pointing it out. [02:14] Okay. So that's what we have. And so for [02:17] each word that's coming in, we have a [02:18] one-hot vector. [02:20] Right? But the one-hot vector is just [02:22] like on to the vocabulary. And then, you [02:25] know, and we can either [02:27] quote unquote add them up and get a [02:29] multi-hot encoding, or [02:32] sorry, get a count encoding, or we can [02:34] just do or, right? Look for just any [02:36] ones in a column and get multi-hot [02:38] encoding. [02:39] So that's what we saw last class. But [02:42] this scheme, while it's quite effective [02:44] for simple kind of problems, [02:47] is it has some very serious [02:49] shortcomings. And so we will sort of [02:50] delve into those shortcomings, and then [02:52] sort of step back and say, all right, is [02:54] there a solution to fix these things? [02:58] Problem with one-hot vectors. [03:00] There are lots of problems. Any [03:01] volunteers? [03:07] Similar words are understood [03:09] differently. [03:21] Absolutely. So that what he's pointing [03:24] out is that if you have two words which [03:26] are synonyms, let's say, great and [03:28] awesome, [03:29] hope that the way we represent them [03:31] using these vectors would have some [03:33] connection to what the words actually [03:35] mean. In particular, we would hope that [03:37] if they mean similar things, that they [03:38] are sort of close by. If they mean very [03:40] different things, we would hope that [03:41] they are very far away. Right? Things [03:43] like that. Sort of common sensical [03:44] expectations of what you want the [03:46] vectors to have. So it clearly it won't [03:49] have that, and we'll look into it in a [03:50] detail in a bit. But before we do that, [03:53] there is also a computational issue, [03:54] which we covered last class, which is [03:56] that if the vocabulary is really long, [03:59] then each token, each word that's coming [04:01] in here, will have a one-hot vector [04:03] that's as long as the size of [04:04] vocabulary. Right? If you have 500,000 [04:06] words in your vocabulary, every little [04:08] word that comes in has a vector which is [04:09] 500,000 long. Which feels like a gross [04:12] sort of waste of it stuff. [04:16] Now you can mitigate it somewhat by [04:18] choosing only the most frequent words, [04:20] but it does increase the number of ways [04:21] the model has to learn, and increase the [04:23] need for compute and data, and so on and [04:25] so forth. Okay? [04:26] Now [04:27] let's say that we have created a [04:28] vocabulary from a training corpus. Okay? [04:31] We have a bunch of [04:32] strings, text that's coming in. We have [04:34] done it We have done the ST the [04:36] standardization and organization. We [04:37] have created a vocabulary from it. And [04:39] let's say we get the words movie and [04:41] film. [04:42] So the question is, and and I always [04:44] observation gets to this immediately, if [04:47] you look at the words movie and film, [04:48] are these two vectors close to each [04:50] other or not? Okay? So if you have two [04:52] vectors, how would we measure closeness? [04:56] What's the simplest way to think about [04:58] closeness? [05:02] It's not a trick question. [05:05] Distance. Yeah, exactly. So if they are [05:06] really close distance-wise, we would [05:08] hope, right? The words similar words [05:10] should do should should be close by. So [05:13] here, if you let's just imagine that the [05:16] vector for movie, [05:20] let's say your vocabulary is, I don't [05:21] know, [05:22] um [05:25] 100,000 long. [05:27] So your vector is 100,000 long, [05:30] and the word for movie [05:33] is the position, so this this has a one, [05:35] everything else is zero. Right? [05:42] Sorry, this is the vector for film, and [05:44] maybe this is the position for film. [05:47] So that has a one, everything else here [05:51] zero. Okay? What's the distance between [05:53] these two vectors? [05:55] You just use the Euclidean distance. So [05:58] the Euclidean distance, you will recall, [06:00] you literally just take the difference [06:01] of [06:02] these values, [06:04] square them, add them up, take square [06:06] root. [06:07] So which means that all the zeros will [06:09] obviously give you zero. This one is [06:12] going to give you a one. [06:14] This comparison is going to give you [06:15] another one. 1 + 1 = 2. Root 2. That's [06:18] the answer. [06:20] So the distance between these two [06:21] vectors is root 2. [06:25] Now, [06:27] so the distance between them is root 2. [06:30] What about the one-hot encoded vectors [06:32] for good and bad? Clearly good and bad [06:34] mean opposite things. [06:36] What is the distance between the good [06:37] and bad 01 vectors? [06:42] Still root 2. [06:45] Because the zeros don't mean anything, [06:47] the ones are not in the same place. [06:49] So when you subtract the one and the [06:51] zero, you'll get ones and ones, add them [06:52] up, two, root 2. [06:54] In fact, you take any two words in your [06:56] vocabulary, what's the distance between [06:57] the two one-hot vectors for those words? [06:59] It's root 2. [07:01] So if any two words have the same [07:03] distance, does this even have a notion [07:06] of distance? [07:08] It doesn't. [07:10] There's no notion of distance from [07:12] one-hot vectors. [07:13] It has no connection to the actual [07:15] meanings of these words. [07:17] It's just a way of representing them. [07:21] Okay? [07:22] So that is the big problem with one-hot [07:24] vectors. [07:26] So [07:27] the distance between them is the same [07:28] regardless of the words. It's got [07:29] nothing to do with the meaning of the [07:30] words. [07:32] And this is a huge problem, which we'll [07:33] have to solve. [07:35] So to summarize where we are, if the [07:37] vocabulary is very long, each token will [07:39] have a one-hot vector that's long as [07:40] vocabulary. That's that's sort of a [07:42] computational and sort of training [07:44] problem. And then this is a deeper [07:46] problem, where there's no connection [07:48] between the meaning of a word and its [07:49] vector. [07:51] So wouldn't it be nice if [07:55] vectors that represent synonyms, [07:57] movie and film, apple, banana, [07:59] hopefully they're close to each other. [08:01] It would be nice if the vectors for [08:03] things that mean very different things [08:04] are far from each other. [08:06] So let's take a look at a particular [08:08] example. Okay? Let's assume that we have [08:10] been magically given [08:13] these vectors, so that they actually [08:15] have some notion of meaning. [08:17] And for convenience, let's say that we [08:18] take the just the first uh [08:21] two dimensions of these vectors, the [08:23] first two dimensions, so that we can do [08:25] a scatter plot on them. [08:28] So we plot the first dimension of the of [08:30] these vectors, the second dimension, and [08:31] what we have in this little cartoon is [08:34] we have plotted the the word for [08:37] factory, uh for home, for building, and [08:41] they all happen to be clustered here. [08:44] Clearly this representation is capturing [08:45] some notion of what the thing is. [08:48] Right? Some sort of building. [08:50] Uh and here we have, you know, bicycle, [08:53] truck, and car. Clearly some This is [08:55] like the automobile cluster, right? [08:57] Transportation cluster. And here we have [09:00] like a fruit cluster, and here we have [09:02] some, you know, sports balls cluster. [09:04] Okay? [09:05] We Because it's a cartoon, things are [09:07] all nice and cleanly separated. Okay? So [09:10] now if you take the word apple, where do [09:12] you think it's going to go? [09:14] It's going to go in into A, C, D, or B? [09:19] C, right? It makes eminent sense it's [09:20] going to go to C. [09:23] Good. Now, [09:25] wouldn't it be nice if [09:27] in more generally, if the geometric [09:29] relationship between word vectors [09:32] represent the semantic relationship [09:35] between the underlying objects that the [09:37] words represent? [09:38] Okay? [09:39] And it's And I say relationship and not [09:41] distance, because it's not just [09:42] distance. It's actually more than that. [09:45] Okay? [09:46] So let's take another one. [09:48] Here we have [09:49] uh this is the the vector plotted for [09:52] puppy and dog, [09:54] and this is calf. [09:56] Uh right? We have plotted the word for [09:58] calf. And let's say that we need to [09:59] figure out where would the embedding, [10:01] the word vector for cow appear? [10:04] It is the most logical. Should it be A? [10:07] Should it be C? Should it be B? Where [10:09] should it be? [10:11] This is [10:14] C? Okay, what's the logic? [10:16] Any volunteers? Just put your hand up. [10:19] Uh, yes. [10:21] Uh [10:23] A calf is a baby bull, whereas the cow [10:26] is an adult. [10:27] So, it should be closer to the dog, [10:28] which is the adult version of a dog. [10:31] Got it. So, you're basically saying go [10:32] from the puppy version to the grown-up [10:34] version. Right? That's sort of what [10:36] you're getting at, right? And that's a [10:37] totally valid way to think about it. [10:39] But there are a couple of ways to think [10:40] about this, which is this is one of the [10:42] those two ways. So, what you can do is [10:44] you can actually look at it and say, [10:45] well, [10:46] Okay, if this is big bringing you, you [10:48] know, bad memories of GMAT and GRE and [10:50] stuff like that, I apologize. [10:52] But [10:55] So, a puppy is to a dog like a calf is [10:57] to a cow, right? Which means that that's [10:59] exactly what Jay is pointing out. You [11:01] can go from like the baby version to the [11:02] full-grown version if you go in the [11:04] horizontal direction. Okay? But maybe if [11:08] you go in the vertical direction, you're [11:10] essentially going up and down the young [11:13] entities of animals. [11:15] Okay? [11:16] So, here you are growing with, you know, [11:18] you're still across the same dimension [11:20] of animals. You're just going from, you [11:22] know, the the same age level, right? [11:24] That is the band here. [11:25] So, this is the grown-up version of a [11:27] whole bunch of animals, the puppy [11:28] version of a whole bunch of animals. So, [11:30] the vertical dimension measures some [11:31] sort of variation across animal species [11:34] of the same roughly sort of maturity [11:36] stage. [11:37] Okay? So, these directions also matter. [11:41] It's not just the distance. [11:43] Okay. That's what I mean when I say [11:45] semantic relationship and geometric [11:47] relationship. [11:48] Relationship is distance and direction, [11:51] right? Both have to be involved. [11:53] So, so [11:55] Uh, now word embeddings, as we will dis- [11:57] learn soon, are word vectors designed to [12:00] achieve exactly these requirements. [12:03] Okay? They will achieve these [12:04] requirements. [12:06] Uh, and they will fix both these [12:07] problems very elegantly. [12:11] Okay? [12:13] So, let's say that we have word [12:14] embeddings that achieve both these [12:15] problems. Are we basically done? [12:17] Can we declare victory? [12:19] Or are there any- is there anything that [12:22] even words which actually capture the [12:24] meaning of the underlying thing [12:28] don't fully address? Is there any [12:30] remaining problem we have to worry [12:31] about? Yes? [12:33] Context. Context? Yes. [12:36] Context, right? What about The fact is a [12:39] word's meaning Sure, every word has a [12:42] meaning, but we know that some words [12:44] have multiple meanings. [12:46] And that meaning is really sort of [12:49] inferencable, or you can make sense of [12:51] it only if you know the surrounding [12:52] context, right? If I give you if if you [12:55] see the word bank, b a n k, bank, [12:59] sure, it could be a financial [13:00] institution. It could be the side of a [13:02] river. It could be the act of a plane [13:04] turning in one direction. [13:07] It could be someone hoping for [13:09] something, banking on something. The [13:11] list of possible meanings of the word [13:13] bank is basically enormous. [13:16] And you cannot figure out what it means [13:18] unless you know what else is going on [13:19] around that word. So, context is super [13:22] super important. And these embeddings, [13:24] word embeddings, just tell you what the [13:26] meaning of the word is. And basically [13:28] what's going to happen when you have a [13:29] word which could mean many different [13:31] things, it's going to give you some [13:33] average version of that meaning. [13:36] And that average version is not going to [13:37] be very good. [13:39] Now, there are some words which only [13:40] mean one thing, and you'll be okay [13:41] there. [13:42] But for the rest of it, right? It's [13:44] going to be tough. [13:47] So, what we need is some way [13:53] We need to find a way to make word [13:54] embeddings contextual. [13:56] Meaning we need to somehow consider the [13:58] other words in the sentence. [14:00] Okay? So, if we can do that, then we [14:02] will be in great shape. [14:05] Solve all sorts of NLP problems. [14:08] Now, as it turns out, contextual word [14:11] embeddings, or word vectors, or word [14:13] embeddings that achieve both these [14:15] requirements. [14:16] They capture the semantic geometric [14:19] relationship thing I talked about, and [14:21] they are contextual. [14:22] Okay? [14:23] They're really fantastic. Uh, and the [14:27] key to calculating contextual word [14:29] embeddings is the transformer. [14:33] That is why transformers are justifiably [14:35] famous. [14:39] So, what's sort of the the lay of the [14:40] land here? So, today we are going to [14:42] look at how to calculate [14:44] stand-alone or uncontextual word [14:46] embeddings. [14:48] And then starting Monday, we will take [14:50] these, you know, un- stand-alone [14:52] embeddings and make them contextual [14:53] using transformers. Okay? That is the [14:56] plan. [14:57] Any questions so far? [14:58] So, now let's think about how we can [15:00] learn these stand-alone embeddings from [15:02] data, right? Now, the naive way to think [15:05] about it would be, hey, let's Why don't [15:07] we manually collect a whole bunch of [15:08] synonyms, antonyms, related words, etc., [15:11] and try to assign embedding vectors to [15:13] them that satisfy [15:15] our requirements. Okay? Now, as you can [15:18] imagine, this is going to be a long, [15:19] painful, and never quite complete [15:21] exercise. [15:22] Okay? [15:23] Uh, [15:24] so and uh you mean and given that we are [15:26] machine learning people, [15:29] the question is, can we do in a better [15:30] way? Can we just learn it from the data [15:32] without doing any of this manual stuff? [15:34] Okay? And [15:36] the key insight that makes it all happen [15:39] is this humble-looking line on the [15:42] screen by John Firth, who was a [15:44] linguist. [15:45] You shall know a word [15:47] by the company it keeps. I wish I could [15:49] deliver this in a British accent. [15:53] Know a word by the company it keeps. [15:55] Okay? It's a very profound statement. [15:57] Okay? And here is the sort of the key [15:59] intuition behind this. [16:02] It says, [16:03] let's say that you have a sentence like [16:05] the acting in the dash was superb. [16:08] Okay? [16:09] What are some words that you folks think [16:11] are likely to appear in the sentence? [16:15] Shout it out. Play. Play. [16:18] Movie. [16:19] Show. [16:20] Musical. Right? Those are all some great [16:24] candidates, right? The acting in the [16:25] movie, the film, musical, and so on and [16:26] so forth. Okay? Now, let's say that I [16:28] ask you, what are some words that are [16:29] unlikely to appear in the sentence? And [16:31] I think we could all be here for like [16:32] days, you know, listing them out. Uh, I [16:35] just listed these out. Um, I love the [16:38] word tensor, so I have to find a way to [16:39] use it somewhere. [16:41] So, all right. So, the acting in the [16:43] banana was superb. Clearly nonsensical, [16:45] right? So, what this actually What What [16:48] we are seeing here is that if certain [16:51] words are sort of interchangeable in a [16:53] sentence, [16:55] meaning you you change them, they still [16:57] the sentence still makes sense, right? [16:59] If they appear in the same context very [17:02] often, i.e., if they're interchangeable, [17:04] they are probably related. [17:07] Sort of like we don't even have to know [17:09] what the word is. [17:10] All we have to know is that this word [17:12] and this word, you can drop them into a [17:14] particular sentence, you can fill in the [17:15] blank of that sentence with that word, [17:17] and it actually makes sense, then we're [17:18] like, oh, wow, okay, these words are [17:20] related then. [17:21] Right? You're sort of inferring their [17:23] relatedness not by looking at them [17:25] directly, but by seeing where they live. [17:30] Right? It's a very very clever idea. And [17:32] it'll slowly sink into you. Okay? Um, so [17:36] that's the first observation. If they [17:37] appear in the same context very often, [17:39] they are likely to be related. [17:41] More generally, related words appear in [17:44] related contexts. [17:47] So, all we have to do [17:49] is to figure out a way to calculate [17:52] context. [17:54] And then use that to understand, you [17:57] know, what the words are that happen to [17:58] be living in this context. [18:00] And there are some beautiful ways to do [18:02] these things, and we'll you and we'll [18:03] really dive deep into one such way to do [18:05] it. [18:06] So, so the So, what we're going to do in [18:08] this approach [18:10] is that [18:11] since [18:12] words that appear in [18:14] related contexts mean related same [18:16] similar things, [18:18] first of all, you have to define what do [18:19] you mean by context? [18:21] And there are many ways to define [18:22] context. We're going to go with a very [18:23] simple explanation, simple definition, [18:24] which is that if words happen to appear [18:26] in the same sentence a lot, [18:29] then we think that, okay, [18:31] they are in the same context. So, [18:32] context here means sentence. [18:34] Okay? [18:35] So, what we can do is we can actually [18:38] take a whole bunch of text, maybe all of [18:40] Wikipedia, [18:41] and then break it up into sentences. [18:43] We'll have billions of sentences, right? [18:46] And then for all these billion [18:47] sentences, we can literally go and count [18:48] for every pair of words, how many times [18:51] are both these words showing up in the [18:52] same sentence? [18:55] Okay? And we call this co-occurrence, [18:57] right? The words are co-occurring in the [18:59] sentence. [19:00] And it doesn't have to be next to each [19:02] other, [19:02] right? We know that in complicated [19:04] words, a word at the very end of the [19:07] sentence could actually alter the mean- [19:09] could be its meaning could be altered by [19:10] a word that happened in the very [19:11] beginning of the sentence, and it could [19:12] be a really long sentence. [19:14] So, we take the whole sentence and say, [19:16] are are two words co-occurring in the [19:18] sentence, yes or no? And we just count [19:19] them up. [19:20] And when we do that, [19:24] right? When we do that, we will get [19:26] something like this. [19:27] So, I'm just [19:29] This just captures what I've been [19:30] talking about. Identify all the words [19:32] that occur, let's say, in Wikipedia. And [19:34] then for every sentence, you look at [19:35] every word pair and count the number of [19:37] times they appear in the same sentence [19:38] across all those sentences. Okay? [19:41] This is a word-word co-occurrence [19:43] matrix. So, for example, [19:46] let's assume that you took all of [19:47] Wikipedia, looked at all the words, [19:48] distinct words, and you found there are [19:49] 500,000 words. [19:51] Okay? So, there are 500,000 words [19:54] here in the columns [19:56] 500,000 words on the rows. [20:00] The columns and rows. And then you go [20:02] and each cell of this table is basically [20:05] has a number that you calculate which is [20:08] the number of times the word in the row [20:10] and the word in the column happen to [20:12] show up in the same sentence. That's it. [20:14] So, for instance [20:15] if you look at deep and learning, right? [20:18] The word deep and the word learning [20:20] maybe that [20:22] the those two words occurred in the same [20:24] sentence maybe 3,025 times. [20:28] 3,025 sentences across all of Wikipedia. [20:31] You put 3,025 right in that cell. [20:35] Okay? [20:36] Many words are unlikely to appear in the [20:37] same sentence. [20:38] So, much of this matrix is going to be [20:40] zero. [20:44] But, we [20:45] fundamentally form this co-occurrence [20:47] matrix. [20:49] This matrix essentially embodies all the [20:54] context information that we can work [20:55] with in a very compact, beautiful you [20:58] know, sort of [20:59] elegant [21:03] And using this, we're going to try to [21:04] figure out [21:06] what the word embeddings actually are [21:07] going to be. [21:08] Okay? [21:09] And so [21:11] So, by the way, the approach I'm [21:13] describing here to calculate standalone [21:15] embeddings is called Glove. [21:20] Uh it's called Glove and [21:23] standalone embeddings first sort of came [21:24] onto the NLP deep learning scene. Uh [21:27] there were two sort of ways of doing it. [21:29] One was called word to vec, word to vec. [21:32] Uh the other one is Glove. [21:34] And they're both comparable, right? They [21:35] use slightly different mechanisms of [21:36] doing this. [21:38] We went with word for for this lecture [21:40] because I think it's actually a little [21:42] easier to understand and equally [21:44] effective. [21:45] Okay? [21:47] So, this is what we have. And so, what [21:49] we want to do is [21:50] we want to learn these embedding vectors [21:52] that can be used to essentially [21:54] approximate this matrix. [21:56] Right? If you can find vectors that can [21:59] actually approximate this matrix, then [22:01] hopefully those vectors do in fact [22:03] capture some notion of what the words [22:04] actually mean. Okay? So, let me put it [22:06] differently. [22:07] You come to me with this matrix. Okay? [22:10] And you say uh okay, Rama, do you have [22:12] embeddings for me? [22:14] And I'm like, yeah, I reach into my bag [22:15] and I'm like, okay, every one of those [22:17] 500,000 words, I have an embedding. [22:19] Right? [22:20] Let's ignore for a moment how I actually [22:21] calculated embeddings. I have the [22:23] embeddings. [22:24] How will you know if my embeddings are [22:25] any good? [22:28] How will you know? [22:30] How can you actually assess if those [22:31] embeddings are any good? [22:34] Well, you can certainly say, okay, give [22:35] me the embeddings for movie and film and [22:37] you can see if they're really close by. [22:39] If you can look at the you look at the [22:40] embedding for movie and tensor and [22:42] hopefully they're far away. [22:43] But, you'll never get done. [22:46] Right? [22:47] How can you systematically evaluate [22:49] this? [22:51] Well, what if you could actually what [22:53] what if I come to you and say, not only [22:55] am I going to give you an embedding, [22:57] here is a procedure [22:59] which you can use with these embeddings [23:00] to validate how good they are and here [23:02] is the procedure. What you can do is you [23:04] can use the embedding to recreate the [23:07] co-occurrence matrix. [23:09] And if the recreated co-occurrence [23:11] matrix actually matches the real matrix [23:14] well, these embeddings probably are [23:15] pretty good. [23:17] Remember, the whole point of the [23:18] co-occurrence is to handle this context [23:20] information. So, if my embeddings can [23:21] actually recreate them, reconstruct them [23:23] pretty close, right? It'll never be [23:25] perfect. But, it comes pretty close, [23:27] then we're like, wow, okay, these [23:28] embeddings do mean something. [23:29] So, if it turns out for instance that [23:31] the matrix has, you know, 3,000 possible [23:33] va- value of 3,000 for deep and learning [23:36] and values of uh [23:40] say [23:40] 50 for extreme learning [23:43] and our embedding comes in and says [23:45] 3,002 for the first one and 48 for the [23:48] second one, we'll be like we'll be [23:49] pretty impressed. [23:51] Whoa, it didn't need to be that close. [23:53] Unless it was actually capturing [23:54] something. [23:55] Okay? So, that's what we're going to do. [23:57] And so, we're going to take this logic [23:59] of saying [24:00] find embeddings that can approximate the [24:03] what we actually see in Wikipedia. [24:05] Right? And we're going to use that idea [24:07] to actually build the model and learn [24:09] the [24:10] using nothing more than basically linear [24:12] regression. [24:16] And here you are thinking that linear [24:17] regression is useless now that you've [24:18] graduated machine learning, right? [24:22] So [24:23] So, we can think of the embedding [24:24] vectors that we want to figure out as [24:26] just the weights in a model. [24:28] In a linear regression. [24:31] We can think of the co-occurrence matrix [24:33] as just the data we're going to use in [24:35] this model to estimate these weights. [24:37] And the model we're going to use [24:39] is something like this. [24:42] So, first I have to inflict some [24:43] notation on you. [24:45] We would denote the co-occurrence matrix [24:46] of say words I and J as Xij. [24:50] Xij is just data. [24:51] It's just data. Okay? It's not a [24:53] variable, it's data. [24:55] Uh [24:55] and then we will denote an embedding [24:57] vector for each word. Remember, we need [24:59] to have a vector for each word. So, we [25:01] call it Wi, right? Wi is the embedding [25:03] vector for each word. [25:06] And we will also assume that [25:09] some words are just inherently very [25:10] popular. They're going to show up all [25:11] the time like the word the. [25:13] Okay? So, we'll assume that every word [25:15] has some natural frequency of occurring [25:18] like movie versus flick. [25:20] The versus tensor. So, we want the [25:22] vectors to capture the co-occurrence [25:24] patterns independent of how naturally [25:27] frequent the words are. [25:28] Okay? [25:29] And so, to capture this natural [25:30] frequency, we will assign a bias or Bi [25:33] to each word that we're going to [25:34] calculate. And all this will become [25:36] clear in just a moment. Okay? So [25:39] with this setup, basically what we're [25:41] saying is something very simple. We're [25:42] saying, look, this co-occurrence matrix [25:44] that we have [25:45] that we're able to compute, it came [25:48] about because in in truth, in reality, [25:51] in nature, there are these embedding [25:53] vectors for every word. [25:55] There are these biases Bi for every word [25:58] and every co-occurrence number that you [26:00] see just came about because, you know, [26:03] under the hood, mother nature grabbed [26:05] the bias number for the word I, the bias [26:07] number for the word J took the two [26:09] embedding vectors, which only mother [26:11] nature knows at this point did the dot [26:13] product of them, add them, and that's [26:15] how we get this number. [26:16] So, it basically says the number you see [26:19] is the sum of the inherent popularity of [26:21] the first word plus the inherent [26:23] popularity of the second word plus the [26:25] way in which these two words connect to [26:26] each other. [26:29] That's it. [26:29] And [26:30] you will agree with me [26:32] that literally can't get simpler than [26:33] this. [26:34] If I tell you, hey, here are two things. [26:36] I want you to tell me how connected they [26:38] are, you'll be like, well, let's take [26:39] the first one, figure out how inherently [26:42] popular it is, inherent popularity, and [26:44] then of course you got to worry about [26:45] the connection. So, we do a dot dot [26:46] product. [26:47] That's it. Those three things. [26:49] Right? [26:50] So, this is what we have. Now, you may [26:52] have seen [26:53] uh [26:54] from your, you know, good old linear [26:56] regression that whenever uh your [27:00] dependent variable happens to be [27:02] positive, guaranteed to be positive [27:05] and it ends up having a big range [27:08] we always advise you folks [27:10] to take the logarithmic transformation [27:12] to squash it into a narrow range because [27:14] that will make these models much more [27:16] well-behaved. [27:18] Regression if the Y value is like a huge [27:20] range. Like the canonical example is [27:22] that, you know, if you are trying to [27:23] model, you know, the net worth of [27:24] people, right? It's going to have a long [27:27] right tail with people like Elon and [27:29] Jeff and so on on the right side, right? [27:30] And the rest of us on the left. So and [27:33] so, to model this big long tail [27:34] distribution, you just take the [27:35] logarithm, just squash everything to a [27:37] very narrow range. And that will make [27:39] regression much better behaved. Okay? [27:41] Here [27:42] most of the counts are going to be zero. [27:45] But, some of the counts could be very [27:47] high. [27:48] Right? [27:49] And therefore we wanted to If you take [27:51] the logarithm, it makes it much better [27:52] behaved, so we take the logarithm here. [27:54] So, this is actually our model. That's [27:56] it. [27:57] And I know that many of the numbers are [27:58] zero and log of zero is not defined. So, [28:00] we can just add the one a number one to [28:02] all the numbers [28:03] to avoid that kind of, you know, [28:06] technical arithmetic problems. [28:08] But, this conceptually is what's going [28:09] on. This is the model we want to [28:10] calculate. [28:11] So, given that we have essentially [28:14] postulated this model [28:16] and we have this data, this [28:17] co-occurrence matrix, how can we [28:19] actually find the weights? How can we [28:21] actually find the Bs and the Ws? What [28:24] would we What should we do? [28:25] Go back to the fundamentals of [28:26] regression. Think about it conceptually. [28:29] You have some model which has some [28:30] weights. [28:31] There's some data you can use to train [28:33] the model. [28:35] Right? And you need to find the best set [28:36] of weights. What does the best mean [28:38] here? [28:40] The lowest [28:42] The lowest error. Exactly. There are [28:43] many ways to measure error, right? What [28:46] would be What is the simplest thing we [28:47] could use? So, what you do is you would [28:48] actually do mean squared error. Right? [28:50] Which is what you're getting at. [28:52] You could take the actual thing, you [28:53] could take the predicted thing, take the [28:54] difference, square it, and minimize the [28:55] sum of it. [28:57] Okay? If your model exactly nails every [28:59] number in the co-occurrence matrix, the [29:00] error is going to be zero. [29:02] Okay? So [29:04] what we do is we literally just do that. [29:07] This is the data. [29:09] This is the actual predicted value. [29:11] Predicted value, actual value, [29:13] difference squared, add them all up, [29:14] minimize. [29:17] Okay? [29:19] Uh yes. [29:21] And in the loss function, how is this [29:23] capturing the context? Because unless my [29:25] input data is having that context [29:28] how will this actually differentiate [29:31] based on where the particular word is [29:33] used? [29:34] The word The way the word is [29:36] the [29:37] So, let's take two words like deep and [29:38] learning. Now, let's take this word and [29:41] change it according to the context. [29:42] Okay. [29:44] Sorry, go ahead. Yeah, so basically, [29:46] let's say I'm talking about the word [29:47] banana. So it's a fruit in some context [29:49] and I could be saying he's going [29:50] bananas. That's a [29:53] whatever, right? So now these are two [29:55] different contexts in my understanding [29:57] and my same model needs to be able to [29:59] tell me that banana is the right word in [30:01] this context but wrong word in this [30:02] context or [30:04] correct in both contexts. Yeah, very [30:06] good question. So let's actually spend a [30:08] minute on that. Good question. I'm going [30:10] to swap to my iPad. [30:13] So let's let's assume that this is our [30:15] co-occurrence matrix. [30:18] Right? And then we have words going from [30:20] A all the way to let's say zebra, right? [30:23] This is the all the words in our [30:24] vocabulary [30:25] and we have A through zebra here. [30:29] And now what we have is [30:32] we have uh [30:34] apple [30:36] and banana. [30:39] Right? [30:40] So basically what's going on at this [30:42] point is that [30:44] here every number here measures [30:48] for every word here, how many times that [30:50] word and apple show up in the same [30:51] sentence, okay? [30:53] It is not measuring, to your point, [30:56] how many times apple and banana are [30:57] showing up. It's measuring how much how [30:59] many times apple is showing up in each [31:01] sentence, right? Now, if apple and [31:03] banana are sort of interchangeable, [31:06] what do we expect these numbers these [31:09] two rows of numbers to look like? Let's [31:11] assume that apple and banana are perfect [31:13] synonyms. [31:14] Just for argument, okay? Let's say it's [31:15] a perfect synonyms. [31:17] What do we expect these two [31:19] numbers [31:21] to look like? [31:23] Very similar. [31:25] So if two words are related, their [31:27] entries their entry row vectors in the [31:30] co-occurrence matrix are going to be [31:31] very very similar. [31:32] So that is how the context comes into [31:34] the co-occurrence matrix. [31:36] So what we want to do is we want to find [31:37] if if embeddings can recreate the same [31:40] pattern of numbers in these two [31:42] uh in these two rows, it's actually [31:45] capturing the underlying context. [31:47] So words which are similar will sort of [31:49] zig and zag together the same way [31:51] through the co-occurrence matrix. [31:53] And that's where it comes in. [31:57] Yeah. [31:58] What's up with the diagonal of the [32:00] co-occurrence matrix where you have [32:01] apple showing up twice? Oh oh, I see. So [32:05] yeah, here the you can just ignore the [32:07] diagonal typically [32:08] uh because all the action is off the the [32:10] off-diagonal entries. [32:15] So so that's basically the idea and uh [32:18] if words which are very similar will [32:20] have a very similar pattern of numbers [32:22] and then any [32:24] embeddings that can actually recreate [32:25] the same pattern of numbers is capturing [32:27] the underlying reality of what's going [32:28] on. [32:29] If words are kind of unrelated, those [32:32] two those two vectors, let's say that [32:34] the word you have is uh [32:40] Let's assume the word is uh of course [32:42] you know what I'm going to say, tensor. [32:45] Right? These two vectors [32:48] will sort of won't have any connection [32:49] to each other. [32:50] Which means if you look at something [32:51] like the correlation of those two [32:53] vectors, it's it's going to be around [32:54] zero. [32:55] Right? [32:56] Words which are [32:57] you know, interchangeable will have a [32:59] very high correlation. [33:01] Words which are antonyms and never show [33:03] up in the same place together may have a [33:05] highly negative correlation, close to [33:07] minus one for instance. So that's sort [33:09] of the intuition behind what's going on [33:10] in these two row vectors on these row [33:11] vectors. [33:12] And so the point is given this [33:14] co-occurrence matrix is capturing all [33:16] these word word correlational structure, [33:19] any embedding that can recreate it must [33:22] have captured the structure as well. [33:25] Because you can't recreate something [33:26] like this with great fidelity unless you [33:28] have some notion of what's going on [33:30] under the hood. [33:31] That's the basic idea. [33:33] Yeah. [33:34] So just connecting to Sophie's question. [33:36] So in that example then [33:39] banana is a fruit and apple is a fruit [33:40] as well. Banana and apple are synonyms [33:42] and you're going mad, you're going [33:44] bananas. How that comes together is that [33:47] Oh, I see. You're going mad, you're [33:48] going bananas, yeah. So uh so those will [33:50] also have some correlational structure [33:52] to it which the embeddings will [33:53] hopefully catch, but words like banana [33:57] which are very they they [33:59] the thing is it's called polysemy where [34:01] the word looks one way, it looks the [34:03] same way. It's like the word bank, [34:04] right? It can mean very different things [34:06] in very different context. So the [34:07] embedding is going to be some average [34:09] representation of it, right? But we are [34:11] not happy with that average and we'll [34:13] get around that average [34:15] next week when we do contextual stuff. [34:18] All right. [34:19] Um [34:20] So that's what we have here. So to go [34:22] back to this thing, [34:26] so what we can do is yeah. [34:29] I didn't understand how do we get the [34:31] mean squared error in this because we [34:34] didn't [34:35] do any reading from the data set we got. [34:37] We haven't calculated the embeddings. [34:39] We are trying to calculate them. Those [34:41] are just it's sort of like, you know, in [34:42] regression you have, you know, beta beta [34:45] one times X1 plus beta two times X2 kind [34:47] of thing. The betas are what the [34:49] regression produces for us, right? The [34:51] the embeddings are exactly that. They're [34:52] just coefficients that we're trying to [34:53] figure out. [34:55] The data is only the X's, the Xij. [34:59] And so this is what we're trying to [35:00] calculate, [35:01] right? And so what you can do is you can [35:03] actually start with some random values [35:06] for these things [35:08] and then [35:09] keep on trying to improve to minimize [35:11] the error [35:13] starting from these random values. [35:15] Do you folks are you aware of any [35:17] algorithm that which allows us to take [35:19] random value starting point and then [35:20] minimize some notion of error? [35:32] Well, how do you know it's actually [35:33] random? Oh. [35:35] So that's actually a very deep question. [35:37] Um [35:39] and [35:39] so [35:41] it's actually a tough question, right? [35:42] Because ultimately the random number is [35:44] coming from a computer [35:46] and we know how the computer runs. It's [35:47] deterministic at the end of the day. [35:50] So we actually use something called [35:51] pseudo random numbers, [35:53] right? Um and there's like a whole [35:54] specialized field of math [35:56] which essentially says, "Look, how can I [35:59] get random numbers that are sufficiently [36:02] random even though they come from a [36:03] non-random computer deterministic [36:05] process?" So we can talk offline about [36:07] it, [36:08] um but fundamentally all these systems [36:10] have some random number generators built [36:11] in. We just cross our fingers and hope [36:14] for the best and just use them. [36:17] So come back to this, [36:19] right? We can start with random values [36:20] for these weights [36:22] um and then we can try to minimize the [36:23] squared error. Are are you folks aware [36:25] of any algorithm that can help us do [36:26] that? [36:28] Yes. [36:30] Gradient descent. Yes, gradient descent. [36:33] Again, comes to the rescue. Uh and since [36:35] we are cool, we'll do stochastic [36:36] gradient descent. [36:38] Okay? So that's it. So gradient descent [36:41] actually doesn't care what the function [36:42] is as long as it you can calculate a [36:44] derivative from it. As long as you [36:45] calculate a gradient, you're good. [36:47] Right? So we can just run gradient [36:48] descent on this thing, right? [36:50] Uh one key point here is that gradient [36:53] descent, stochastic gradient descent [36:54] work for any [36:55] any models as long as you can calculate [36:58] good gradients from them. [37:00] It doesn't have to be a neural network. [37:03] Any mathematical function as long as [37:05] it's differentiable and gives you a good [37:07] gradient. [37:08] Okay? So here this is not a neural [37:10] network per se, but we can still use [37:12] gradient descent for it. [37:14] So we do that. [37:17] Um and when we are done, we would have [37:20] calculated some nice embeddings. We [37:22] would have all calculated or we can also [37:23] calculate all these biases, but we don't [37:25] need the biases anymore. We can just [37:26] throw out the biases because we only [37:28] care about the embeddings and how they [37:29] connect to each other. [37:30] Okay? Yeah. [37:33] So when when you're doing that [37:34] regression, are you predicting the [37:36] co-occurrence matrix? Mhm. Okay. [37:39] Exactly. [37:42] So [37:43] um actually let me just show a very [37:45] quick example [37:46] numerical example here. [37:48] So let's say for example that um [37:53] you know what? [37:57] So this is say W1 and this is W2. [38:00] Okay? And this is the vector and let's [38:02] assume for a moment that we it has two [38:04] dimensions, okay? [38:06] Two dimensions. [38:07] And we also need to calculate B1 and B2 [38:09] which is just a number, okay? [38:14] So and let's say the number for deep [38:16] learning in the co-occurrence matrix it [38:18] happens let's say it has occurred 104 [38:20] times. [38:21] So all we are doing is to say log of [38:24] 104. [38:27] That is the actual value [38:28] minus [38:30] B1 which we don't know plus B2 which we [38:33] don't know [38:34] and then this thing here, let's just [38:36] call it, [38:38] you know, W11, [38:40] W12, [38:42] W21, [38:43] W22. [38:45] Okay? And then we're just doing the dot [38:46] product which is [38:49] times W12 [38:51] plus W21 [38:53] W22. [38:55] Okay? So this is our prediction. [38:58] Where is that cool laser pointer? Yeah. [39:00] So this is our prediction. [39:03] This is the actual. [39:05] So all we do is to say, "Okay, [39:07] this thing, the difference, we're going [39:09] to square it." [39:11] And then we're going to do the same [39:12] exact thing for every other word pair. [39:16] Okay? And when we are done with all of [39:17] that thing, we just take this whole [39:19] thing [39:20] and say gradient descent minimize. [39:23] So then it has to find the B's and the [39:26] W's and everything for every every pair [39:28] every word. [39:29] So that's actually what's going on. [39:31] Make sense? [39:37] All right. So by the way uh here [39:41] I said [39:43] I said, you know, let's assume that the [39:45] embeddings are just vectors which are [39:47] two dimension dimension two. [39:51] Well, [39:52] that's an arbitrary decision that I made [39:54] just to show you how it works because I [39:55] was doing it by hand. But more [39:58] generally, we get to choose how long [39:59] these vectors are. [40:01] Right? [40:02] And the longer the vector, the more [40:04] interesting ways it can actually [40:05] reproduce the co-occurrence matrix. It [40:07] has more flexibility. But the longer the [40:09] vector, what is the risk that you run? [40:13] Overfitting. [40:14] Because these are all parameters at the [40:16] end of the day. More parameters you [40:17] have, the more risk of overfitting. [40:19] Okay? So, you get to choose how big [40:21] these things can be. Uh yes. [40:24] Um don't you find it surprising that [40:26] we're able to fit the model where we [40:29] have a lot more parameters than we have [40:30] data because usually with most machine [40:32] learning with our experts, you would [40:33] like to not have a lot of parameters, [40:35] but here we're going to have [40:37] as you said, the number of dimensions [40:40] times more parameters than we have [40:42] data points. Well, here in this [40:44] particular case, as it turns out, um [40:46] let's assume that you only have 10 [40:48] words, right? [40:49] And for each word, let's assume that you [40:51] have let's just just keep the math [40:53] simple. You have a two-dimensional [40:55] vector. [40:56] So, 10 words * 2, that's 20. [40:58] Plus you have 10 biases for the words, [41:00] right? So, that's another 10, that's 30. [41:02] But 10 * 10, the matrix has 100 entries. [41:06] So, because of the matrix being a order [41:08] n squared matrix, you'll have a lot more [41:10] numbers than parameters. [41:13] In this particular case, you have more [41:14] data than parameters. [41:17] So, that particular problem doesn't [41:18] apply in this case. [41:20] But that does show up in other cases and [41:22] there is some [41:23] very interesting research in neural [41:24] networks which suggests that often times [41:26] the traditional assumptions of data and [41:29] overfitting and all [41:30] can all be called into question under [41:32] some situations. [41:33] Um happy to tell you more offline, but [41:35] if you're curious, just Google something [41:37] called double descent. [41:39] You know what I mean. [41:42] But in this case, it's not a problem. [41:46] Okay. [41:47] So, so what that means is that we can [41:49] choose how big these things are. So, if [41:51] you look at one-hot word vector, one-hot [41:53] vectors, right? Where [41:55] there's a one and everything else is [41:57] zero depending on the position of the [41:58] word, these are long vectors as long as [42:00] a vocabulary, right? As we saw earlier. [42:03] Word embeddings on the other hand, [42:05] right? They can be very dense, right? [42:07] The numbers [42:08] that make up these embeddings, we're [42:10] actually going to figure out from the [42:11] data what they are. So, it can be [42:13] anything. It can So, the first dimension [42:15] may stand for some combination of, you [42:17] know, um [42:19] brightness plus speed plus animalness or [42:22] something. We have no idea what it [42:23] means. [42:24] All we know is that it's able to [42:26] reproduce the co-occurrence matrix [42:27] really well, so it's probably has [42:29] figured something out. [42:30] Okay? And so, we can keep it really [42:32] short. So, the word embeddings tend to [42:33] be very [42:35] dense, [42:36] meaning not zeros and ones, but some [42:38] arbitrary numbers. It's very lower [42:39] dimensional and it's of course learned [42:40] from data. [42:41] Right? So, [42:43] so once you do this, once you actually [42:45] run Glove on this data and do gradient [42:47] descent and so on and so forth, uh you [42:49] will actually come up with embeddings [42:51] and then you can actually plot the [42:52] embeddings. You can take like this they [42:54] say the you know, you can take these [42:55] embeddings and just plot them. Here um [42:58] they're not literally plotting the first [42:59] two dimensions. They're using a [43:01] particular technique called t-SNE, which [43:03] is a way to take long vectors and [43:05] project them to 2D space for [43:07] visualization purposes. [43:09] And you can see here [43:11] some very interesting things are showing [43:12] up. So, they basically they plotted the [43:15] embedding for brother, [43:17] nephew, uncle, sister, niece, [43:19] aunt, and so on and so forth. It's all [43:20] showing up here. [43:22] This the embedding for man, embedding [43:24] for woman, [43:25] sir, madam, [43:28] empress, heir, [43:29] duke, emperor, king. You get the idea. [43:32] Right? So, clearly there are patterns [43:34] here where [43:35] things which are sort of similar in [43:37] their nature are all hanging out [43:38] together in the same part of the space. [43:41] Which is comforting, which is good to [43:42] know. [43:44] Right? [43:44] Now, but as I mentioned earlier, it's [43:46] not just about the fact that similar [43:48] things happen to be near each other. [43:50] The direction also actually matters. And [43:53] beautiful things happen when you look at [43:54] directions. So, for instance, [43:57] you know, let's say that [44:00] man and you want to go from man to [44:01] brother. [44:03] Okay? So, to go from man to brother, you [44:05] have to start with man and then travel [44:07] along this arrow, right? To get to [44:09] brother. [44:11] So, this arrow has some notion of a [44:14] person becoming a sibling. [44:18] Right? [44:19] So, you would hope that if you take that [44:20] same arrow [44:22] and then [44:23] start here with that arrow, hopefully [44:26] the woman will become a sister. [44:29] Sure enough, this. [44:32] So, this is called word vector algebra. [44:35] Right? Embedding algebra. And these [44:37] relationships are actually showing up in [44:39] the data. We didn't tell it any of these [44:41] things. [44:42] We just literally gave it the [44:43] co-occurrence matrix [44:44] and said and and asked it to reproduce [44:46] it. [44:47] So, I find it pretty shocking that these [44:49] things are actually true. [44:52] And it gives us evidence and comfort [44:55] that whatever has been learned does have [44:57] some deep connection to describing the [44:59] underlying nature of what's going on. [45:01] It's not some statistically fluky [45:03] artifact. [45:05] Um yeah. [45:07] So, [45:07] I said [45:08] by context or by adjacency to other [45:11] words and not by [45:12] the place in the same word, right? [45:15] Cuz you can't click they won't appear in [45:16] the same sentence. [45:17] They have [45:19] keywords. Right. [45:20] They won't appear in the same sentence, [45:22] but the pattern of co-occurrence will be [45:23] the same for them. [45:25] Which is what we've been able to [45:26] reproduce with these embeddings. So, [45:28] that's the key idea. [45:34] Um [45:34] so, my question is along like how are we [45:37] able to capture all these directions in [45:40] 2D [45:41] matrix versus a multi-dimensional matrix [45:44] because I feel like okay, so this [45:46] relationship is kind of [45:47] uh [45:48] confirmed that you're moving to [45:50] kind of like [45:51] family or like blood relationship or [45:53] something of the sort, but like how does [45:54] it not mess up the other sides of that [45:56] matrix? Like [45:58] No, this is just a visualization thing. [46:00] So, we're basically taking this uh you [46:02] know, as you will see, Glove embeddings [46:04] come in lots of different sizes. And [46:06] this I think uses the 100 dimension [46:08] embedding and just projects it to 2D [46:10] space using a particular technique and [46:12] then looks to see what's going on. [46:15] Um yeah. [46:17] Uh if the input data being co-occurrence [46:20] matrix is biased, aren't we amplifying [46:22] that bias? Yes, we are. Yes. No, it's a [46:24] great observation. Uh any sort of data [46:26] you scrape from the internet and use for [46:28] this sort of modeling exercise will be [46:30] subject to all the biases that produced [46:32] the data in the place first place. And [46:34] the model will faithfully learn those [46:36] biases. And if you're not careful, it'll [46:38] perpetuate them. [46:40] So, and that's a whole very important [46:41] topic that unfortunately won't cover in [46:43] this course because of time constraints, [46:45] but it's something you always have to [46:46] worry about when you're building these [46:47] models. [46:50] How do you think about the [46:51] dimensionality of the embeddings not the [46:53] 2D representation of the actual data? [46:55] The one that we choose, that's that's in [46:57] our hands. So, you should think of them [46:59] as a hyperparameter. [47:00] So, much like the number of hidden units [47:03] to use in a particular hidden layer, [47:05] um it's a hyperparameter. Uh so, you [47:06] know, I would again start small and if [47:09] it solves the problem that you're trying [47:11] to solve with these embeddings, great. [47:13] If not, keep increasing them. And at [47:15] some point there might be like a a [47:16] flattening out and a overfitting sort of [47:19] dynamic and then you stop. So, just [47:20] think of it as a hyperparameter. [47:22] Yeah. [47:24] Do you see any benefit practicing using [47:26] like penalized regression to do this [47:28] instead of having the embeddings more [47:31] sparse or just like [47:33] lowering the magnitude of them? Yeah. [47:36] Yes. So, there are lots of techniques to [47:39] uh [47:40] to apply regularization in the [47:42] estimation itself of all these numbers. [47:44] Um happy to give you pointers. It's I'm [47:46] just going with like the simplest [47:47] version possible. [47:49] Yeah. [47:50] Am I understanding why overfitting is a [47:53] problem in this case cuz we're not doing [47:55] any like out of sample [47:58] prediction. So, like wouldn't you want [48:00] like the embeddings to be [48:02] like high dimensional so you can capture [48:03] like [48:04] your relationships? Uh interesting [48:06] question. So, the question is given that [48:08] there's no notion of a test set, out of [48:11] sample test set that we got we're going [48:12] to evaluate these things on, why do we [48:14] really care about overfitting? Don't [48:16] should we do the best we can to capture [48:18] everything in the data, right? [48:20] Well, [48:21] the thing is [48:22] even when you're not trying to use it [48:24] for out of sample prediction, you do [48:26] want to make sure that your model only [48:29] captures the true patterns and not the [48:31] noise. [48:32] In every data set, there's always noise. [48:35] Right? And you want it to capture a [48:36] signal but not the noise. [48:38] And regardless of what you use it for. [48:40] Because if it captures the noise, then [48:42] the insights you draw from the word [48:44] embeddings may be flawed. [48:45] That's the reason. [48:48] Okay. [48:49] Um all right, so let's keep going. So, [48:51] here the algebra is brother minus man [48:53] plus woman is sister. [48:55] That's it. Human biology reduced to a [48:57] single sentence. [48:58] All right. So, now the pros and cons of [49:00] these things are you should use [49:02] something like a Glove embedding if you [49:04] don't have enough data to do to to sort [49:07] of [49:07] to learn a task-specific embedding for [49:10] your own vocabulary. As we As I'll show [49:11] you in the Colab, you can actually learn [49:13] these things just for your own data set [49:14] if you want. You don't have to use these [49:16] Glove embeddings. But the reason to use [49:18] these pretrained embeddings is that if [49:20] you're working with natural language, [49:22] you know, the word is the word, right? [49:24] It means something. [49:25] And so, there's no reason for you to [49:28] have for your model, for your little use [49:30] case, for you to actually somehow learn [49:32] all the fundamentals of English. [49:35] The fundamentals of English are the [49:36] fundamentals of English. May as well [49:37] learn it once and then piggyback on it. [49:40] So, that's the whole idea of using [49:42] pre-trained embeddings. [49:43] Because it These things are all common [49:45] aspects of language. May as well learn [49:47] them using all the data you can throw at [49:48] it and then you can sort of fine-tune [49:50] and tweak and adapt to your particular [49:52] use case. [49:53] Right? So, if you and this particular [49:55] useful when you don't have a lot of data [49:57] in your particular use case. [49:58] Uh right? That's one big advantage. Now, [50:01] it does have the drawback that this [50:03] embedding will not be customized to your [50:04] data. [50:05] Right? For example, if you're trying to [50:06] build an application for a medical or [50:08] legal use, it's going to have a lot of [50:10] jargon. [50:11] Right? And this pre-trained embedding [50:13] trained on all of Wikipedia may not [50:14] capture enough of the jargon and know [50:16] its meaning really accurately. So, what [50:18] you want to do is you want to take this [50:19] thing. You may still want to take this [50:21] thing and then you can adapt and [50:22] fine-tune it using your jargon-packed, [50:25] heavy, domain-specific data set. [50:28] Okay, those are some of the things to [50:29] keep in mind. [50:32] And of course, we can also learn it from [50:33] scratch if you want and the collab I [50:35] demonstrate all these options. [50:38] So, when you're working with embeddings [50:39] in Keras uh Keras, so what we do is [50:41] remember STI [50:43] where we after we standardize and [50:45] tokenize and index, right? At this [50:48] point, we go from integers to vectors [50:50] and so far we have been using integers [50:51] to one-hot vectors. Here, we're going to [50:54] use embedding vectors that we're going [50:55] to learn or that we're going to pre-use [50:57] from glove. And so, what we do is we [51:00] tell Kera we tell Keras's text [51:02] vectorization layer to do only STI. [51:06] And then we will use a new layer called [51:08] the embedding layer to do the encoding. [51:10] Yeah, that's how we're going to do it [51:11] divide divide it up. [51:14] So, we'll take a look at this first uh [51:17] before we switch to the collab. So, [51:18] before [51:20] we told Keras in this layer output mode [51:23] should be multi-hot or whatever, right? [51:26] Here, we don't want it to actually [51:27] encode anything in multi-hot. We just [51:29] wanted to give it integers back. So, we [51:30] tell it give me int. [51:32] Okay? That's the first change. We only [51:35] We tell it give me give us int. If you [51:36] say give us int, it'll stop with STI. [51:39] I'll just give you the integers. [51:41] Uh and then what you do is that [51:43] all the incoming sentences are going to [51:45] have different lengths. So, what we want [51:47] to do is we want to actually take all [51:48] these sentences and sort of normalize [51:50] them so they are of the same length. [51:52] Okay? [51:53] And the way we do that [51:55] And the way we do that very quickly is [51:57] that we either trunk we choose a maximum [51:59] length for every sen- for for the [52:01] sentences and then if something is [52:04] uh exactly fits that length, perfect. [52:05] Let's say in this case we want a max [52:07] length of five. Cats sat on the mat is [52:08] exactly five. Boom, fits perfectly. But [52:11] if something is smaller, I love you is [52:12] only three of these things, we actually [52:14] pad it with something called the pad [52:16] token. [52:17] Much like the unk token, pad token is a [52:19] special token which we use for padding. [52:22] And then it'll you know, and so and [52:23] Keras you will see will use zeros for [52:25] these paddings. So so that it fills it [52:27] up and gets all the way to the end. And [52:29] if you have something which is much [52:31] longer than five, you just truncate [52:33] everything else and just use the first [52:34] five. [52:36] So, this is what we do to get all the [52:38] sentences to be of the same length. [52:42] Okay? [52:43] And once we do that we then go to the [52:45] embedding layer. [52:47] And the embedding layer is actually very [52:49] simple. [52:50] What is What is an embedding? It's just [52:51] a vector and we need a vector for every [52:53] token. [52:54] Of course, we're going to learn these [52:55] vectors. We need one for every token. [52:57] So, in this case for example, uh let's [52:59] say that these are all the tokens we [53:01] have [53:02] in our vocabulary after the STI process. [53:05] Maybe in this case we have 5,000 tokens. [53:08] Each token we have this embedding [53:09] vector, right? And we choose what the [53:11] dimension of that embedding vector is, [53:12] right? And so, we can set it up by [53:15] saying Keras layers.embedding and we [53:17] tell it max tokens which means what how [53:19] many rows do we have here. [53:21] You know, how many What is the [53:21] vocabulary size that we're working with? [53:23] And then we tell it, okay, this is how [53:25] long I want each embedding vector to be. [53:28] So, rows, the size of the columns, and [53:31] that's the embedding layer. And we'll [53:33] use it in a second. I just want to show [53:34] it to you here so that's because it's [53:35] slightly clearer. [53:37] So, when an input sentence arrives, the [53:38] text vectorization layer will learn STI [53:40] on it. It'll truncate and pad it to max [53:42] length as needed. So, let's say this [53:44] phrase comes in, STI will give you the [53:46] same tokens plus pad pad because let's [53:48] say the max length is five and then [53:50] these are the corresponding integers. [53:52] And then [53:53] the embedding layer will just look up [53:55] the corresponding vector. So, for [53:56] example here, uh the vectors are we need [53:59] to look up the vectors for 23, 9, 5, 0, [54:01] and 0. So, we just go here and look up [54:04] 23, 5, 9, and 0. And then once we have [54:07] that, boom. [54:08] This is the resulting output. So, [54:10] whatever input sentence comes in, we [54:12] have now [54:13] five embedding vectors that have been [54:14] looked up from the embedding layer. [54:17] And once we do that [54:20] this is a table. So, I love you comes [54:22] in, it becomes this table. As we have [54:24] seen before [54:25] neural networks can only accommodate [54:27] vectors as inputs. We need to you know, [54:30] make this into a vector. And as we have [54:32] done before, you know, we can either [54:33] take all these things and concatenate [54:35] them, make a one long vector, or we can [54:37] find a way to average them or sum them [54:39] and things like that, right? As we have [54:40] seen before. And we will use the same uh [54:42] we'll the simplest thing is probably [54:44] just to average them. So, [54:46] uh these are some options and we but [54:48] we'll average them here. So, and this is [54:51] called the global average pooling layer [54:53] 1D. And it's all it does is whatever you [54:55] give it a table you give it, it just [54:57] takes each dimension and averages it. [54:59] The first dimension average, second [55:01] dimension average, and so on and so [55:02] forth. And once that's done [55:04] that's the whole [55:05] So, [55:07] the phrase comes in, STI gives you these [55:09] things, padding as needed or truncating [55:11] as needed. We look up the embeddings [55:14] from the embedding layer and then we get [55:16] all this thing. We do global global [55:18] pooling on it and it's done. [55:20] The resulting thing is a vector that can [55:22] then be passed into hidden layers just [55:24] like we normally do. [55:27] I'm going over this a little fast, but [55:29] make sure you look at it afterwards and [55:31] understand every step and the collab [55:33] will mirror this [55:34] you know, perfectly. [55:36] All right, so let's switch to the [55:37] collab. [55:39] Okay. All right. [55:41] Can folks see this okay? [55:43] All right, so we'll do the usual. [55:46] Um [55:47] import all the stuff we need and then [55:49] because I want to plot some of these uh [55:51] loss and accuracy curves to [55:53] you know, just to see what's going on, [55:55] I'll just bring in the functions from [55:56] the previous collabs. [55:58] Here. [55:59] And then um and I think I already have [56:01] downloaded this. Let me just make sure I [56:03] have it. [56:08] Uh it's not there. Okay. [56:11] Do it again. [56:13] This is same songs data set that we [56:14] looked at on Monday. [56:17] Okay. [56:19] So, roughly 49,000 examples as we saw [56:21] before. We'll one-hot encode them. [56:25] All right, so there's a bunch of stuff [56:27] that we already covered in class. So, [56:28] this is the thing [56:30] uh this URL has all the glove vectors [56:33] available for download. I downloaded it [56:35] uh before class because it takes a few [56:37] minutes. Uh and I've also unz- Did I [56:39] unzip it? [56:41] Uh yes, I did. And so, let's just look [56:43] at the first few. [56:46] All right, so these are all the first [56:47] few. We'll create a sort of an easier to [56:49] view version of these glove vectors. [56:54] So, I'm going to use the vectors which [56:56] are 100 long, but it comes in many [56:58] different shapes. [56:59] So, we have 400,000 vectors, 400,000 [57:03] word vectors. Each is 100 dimension. [57:05] Uh and these all have been calculated [57:07] from Wikipedia using [57:09] the model we described using gradient [57:11] descent. Okay? [57:12] Uh all right, so this is the [57:15] vector for the word for movie. [57:18] Yeah, I don't know what these dimensions [57:19] mean, but it is there's something going [57:21] on. It has figured stuff out. [57:23] Uh but the proof is in the pudding, [57:24] right? So, all right, now we'll first [57:26] set up the text vectorization and [57:28] embedding layers like we saw before. [57:30] Um and so, I'm going to use uh a max [57:33] length of 300 for the songs. [57:36] Um right? Because all the sentences have [57:38] to be the same length. And you might be [57:40] wondering, okay, why did you pick 300 [57:42] and not say 400 or 200? So, typically [57:44] what you do is you actually look at the [57:46] the length distribution of the songs you [57:48] have and you will find you're looking [57:51] for like an 80/20 or a you know, one of [57:52] those things. And in this case it turns [57:54] out 90% of the songs have less than or [57:56] equal to 300 words in our data set. So, [57:59] I'm just going to go with 300. Okay? [58:00] It's pretty good. Uh the problem is if [58:03] you actually say if you look at the song [58:04] which has the maximum length [58:06] that might have be like 3,000 words and [58:09] there would be any hardly any songs of [58:10] 3,000 long. You're just wasting a lot of [58:12] capacity by doing that. So, you're just [58:13] being a little pragmatic here. [58:16] So, okay. So, and then we as before for [58:18] the vocabulary itself, we tell Keras use [58:20] the most frequent 5,000 words, right? [58:22] When you're doing the STI [58:24] um STI. So, we do that and we tell it [58:27] the output mode is int like we saw [58:29] before. [58:32] We have there. [58:35] Okay, perfect. [58:36] Okay, this is a very dangerous thing [58:39] where somebody is remotely changing it [58:41] in another tab somewhere. [58:44] Fingers crossed. Okay. [58:50] Okay. So, we have this um and this is [58:52] what we did with all this stuff uh as [58:54] I've covered. So, now we will adapt this [58:57] layer as we have seen before using all [58:59] the lyrics we have. [59:04] And once we that, we'll take a look at [59:06] the first few. [59:08] And so, here's a very important thing. [59:10] Before, when we asked it to do multi-hot [59:12] encoding and so on in on Monday, [59:14] uh the zero, the first position was unk. [59:17] Right? Unk had zero. But here, unk [59:19] actually has one. [59:21] And the reason is that [59:23] the zeroth position is going to be uh [59:25] used for essentially the You can think [59:28] of this as the empty string. That's how [59:30] Keras will print out pad. [59:32] So, the zero position is the padding, [59:35] the pad token. The first position is the [59:37] unk token. Okay? [59:39] So, it's an important thing here. [59:41] So, let's say that we do [59:44] "HODL you're the best." [59:46] We take a vectorize it. Um [59:49] Do you think HODL [59:51] is going to be part of those 400,000 [59:52] word vectors? [59:54] Wikipedia. Not yet. So, [59:57] Um all right. So, let's try that. [01:00:03] Okay, and as you can tell, [01:00:05] um [01:00:05] HODL is an unknown word, right? That's [01:00:08] why uh it's showing up here. [01:00:12] Right. So, one is unknown, right? The [01:00:14] index value one is unknown. Zero is pad. [01:00:18] But then, [01:00:19] this is unknown HODL, I [01:00:21] Sorry, you are the best, and then [01:00:25] everything else from that point on is a [01:00:26] zero because we are padding all the way [01:00:28] to 300. [01:00:30] Okay? So, that's why you see all these [01:00:31] zeros here. [01:00:32] All right. Uh now, let's just, you know, [01:00:34] run everything through [01:00:37] the vectorization layer, and then we'll [01:00:38] get to the embedding layer. [01:00:44] Okay. Now, we will we'll we'll first [01:00:48] There's just a bit of Python uh [01:00:50] housekeeping [01:00:51] um to create a nice, easy to look at [01:00:54] matrix. So, what we're going to do is [01:00:56] we're actually going to create a nice [01:00:58] matrix which shows us all the the word [01:01:00] the GloVe embeddings. [01:01:02] Um [01:01:04] And so, here, this is the embedding [01:01:05] matrix. [01:01:07] And this matrix has only 5,000 words, [01:01:09] and each is a 100 long. [01:01:11] Why is this embedding matrix only 5,000 [01:01:13] even though we downloaded 400,000 [01:01:15] vectors? [01:01:21] Right. So, clearly the 5,000 we used [01:01:23] there has some bearing to this, but what [01:01:24] is that 5,000? [01:01:30] We told Keras to take the most frequent [01:01:32] 5,000 words in our corpus. [01:01:34] So, we'll only have 5,000 in vocabulary. [01:01:36] That's why there's 5,000. So, we grab [01:01:38] just the word the GloVe vectors for [01:01:40] those 500 5,000 that Keras has chosen to [01:01:42] be in the vocabulary. Okay? And that's [01:01:44] our embedding matrix. [01:01:45] And then, if you look at the first few [01:01:47] rows, the first two rows should be all [01:01:50] zeros because it's pad and unk, [01:01:52] which clearly GloVe doesn't know about. [01:01:54] It's all going to be all zeros. And um [01:01:57] so, you can see all these zeros here, [01:01:59] and then from third on, words, you start [01:02:00] getting some numbers. Okay? [01:02:02] All right. Next, we'll set up the [01:02:04] embedding layer. [01:02:05] Uh [01:02:06] so, basically, what's going on here is [01:02:07] when you we tell the embedding layer how [01:02:09] many rows, which is just the vocab size, [01:02:11] max tokens, what is the embedding [01:02:13] dimension? Well, that's going to be 100 [01:02:15] because the GloVe vectors are 100. And [01:02:17] then, here's the thing. You can tell it [01:02:19] um in this embedding layer, just use [01:02:22] this matrix I'm giving you as the [01:02:23] embedding layer. Because we already know [01:02:25] what the embeddings are. We downloaded [01:02:26] from whatever GloVe, right? So, we will [01:02:28] tell it to use GloVe as as the as the [01:02:30] weights for here, as the embeddings [01:02:32] here. So, we initialize it using that [01:02:34] embedding matrix, right? And then, we [01:02:36] tell it [01:02:38] don't train. When we do back propagation [01:02:40] later on, don't change any of these [01:02:41] weights because somebody spent a lot of [01:02:43] money create these weights for us. [01:02:45] Stanford. So, we don't want to like [01:02:47] further change them. Just freeze them [01:02:49] and use them as they are. Okay? [01:02:51] And this mask zero business I'll come [01:02:52] back later. Don't worry about it for the [01:02:53] moment. [01:02:55] All right. So, once we do that, we all [01:02:58] we are ready to set up our model. So, [01:03:00] this model is pretty simple. Uh Keras [01:03:02] input, the length, of course, is the [01:03:04] length of the sentence, right? Which is [01:03:05] uh 300 long, and then it runs the input [01:03:08] runs through an embedding layer right [01:03:09] there, right? And out comes a 300 by 100 [01:03:12] table, and then we global average pool [01:03:14] it, [01:03:15] right? And that becomes a 100 element [01:03:17] vector, and then we are back in familiar [01:03:19] ground, and we run it through a dense [01:03:20] layer with eight ReLU neurons, uh right? [01:03:23] Eight ReLU neurons, and then we run it [01:03:25] through the final output layer, which is [01:03:27] a three-way softmax as before, hip hop [01:03:29] rock pop. And then, we tell Keras that's [01:03:31] our model, and then we summarize it. [01:03:34] Okay. So, this what we have. And you can [01:03:36] see here, [01:03:38] the total parameters are 500,835, [01:03:41] but the trainable parameters are only [01:03:42] 835. [01:03:44] It's because the total parameters are [01:03:46] all the GloVe embeddings plus the the [01:03:49] things we added to the GloVe embeddings [01:03:50] like the hidden layer and so on. [01:03:52] But the GloVe embeddings are us we have [01:03:54] told Keras, freeze it. Do not train it. [01:03:57] Right? Which means only the rest of it [01:03:58] is going to be trainable. That's That's [01:04:00] the 835. Yeah. [01:04:03] So, when we do the global average [01:04:05] pooling, don't we don't we lose any [01:04:06] sense of meaning that we gain from the [01:04:09] embedding as we average very different [01:04:12] embeddings together? [01:04:14] Sorry, say that again. I I missed the [01:04:15] first [01:04:16] >> if we average the the embedding of apple [01:04:18] and learning, for instance, they are [01:04:20] very different words that are used in [01:04:22] different meanings, so we have different [01:04:23] embeddings, but we average it, so can't [01:04:26] lose it. [01:04:27] We will lose a bunch of stuff. Yeah, [01:04:28] yeah, yeah. So, you're barely Anytime [01:04:30] you average anything, you're going to [01:04:31] lose some new nuance and so on. So, the [01:04:33] real question is, is it Despite that [01:04:36] averaging, is it good enough for you? [01:04:37] And sometimes it's good enough. [01:04:39] Very often it's good enough, as it turns [01:04:41] out. But as you will see when you go to [01:04:42] contextual embeddings, there's just a [01:04:44] better way to do it, right? When you [01:04:45] have contextual embeddings, uh but it [01:04:47] requires bigger models, more powerful [01:04:49] stuff, and so on and so forth. And [01:04:50] that's where you're going from the [01:04:51] foundations to the advanced stuff. [01:04:53] Yeah. [01:04:56] When we're doing optimization, like [01:04:58] let's say we are word problem, it's [01:05:00] often best to optimize everything [01:05:02] together than to optimize one part of [01:05:04] the system and then optimize the other [01:05:06] part of the system. [01:05:07] So, in that case, why wouldn't we want [01:05:09] to also change the embeddings? [01:05:12] We would like I understand why we would [01:05:13] like to stop with [01:05:15] with those weights that [01:05:17] some people have spent a lot of money [01:05:19] trying to find, but will [01:05:20] we be able to find more specific uh [01:05:23] embeddings related to our problem if we [01:05:25] optimize if we let everything be [01:05:26] trainable? Yeah. Absolutely. Absolutely. [01:05:29] And in fact, you will see in the collab [01:05:30] uh that we will do that next. I just [01:05:33] want to show people you don't have to do [01:05:35] it. You start with not training it [01:05:37] because it's going to be much faster. [01:05:38] And then, you train everything and see [01:05:39] if it gets better. And sometimes it'll [01:05:41] get better, in which case it's great. [01:05:42] Sometimes it won't get better. And I [01:05:44] will also show you, and I probably will [01:05:45] run out of time, which I'll So, I'll do [01:05:46] it on Monday. I will also show you, hey, [01:05:48] what if you want to do your own [01:05:50] embeddings from scratch without using [01:05:51] GloVe? [01:05:52] So, all possibilities will be covered. [01:05:55] Um yeah. So, to come back to this, this [01:05:57] is the model we have. Um and then, all [01:06:00] right. [01:06:01] So, we'll If we take a look at the first [01:06:03] few embedding vectors, by the way, this [01:06:05] model.layers [01:06:06] uh will give you every layer as a list, [01:06:09] a list of all the layers, and then you [01:06:10] can just grab any layer you want and [01:06:11] look at its weights. Okay? It's very [01:06:13] handy. [01:06:14] So, we're looking at the weights, and [01:06:15] you can see here [01:06:16] the first two vectors are all zeros [01:06:19] because that stands for unk and pad, and [01:06:21] then we have everything else. So, [01:06:22] everything looks fine so far. And now, [01:06:24] we just, you know, compile and fit it. [01:06:26] So, as usual, Adam, cross entropy, [01:06:28] accuracy. [01:06:30] Um and then, we'll just fit the model. [01:06:33] All right. [01:06:34] It's going to take [01:06:36] a few minutes. [01:06:39] And while it's running, so what what you [01:06:41] will see in this collab is that [01:06:43] uh in this particular case, the [01:06:44] embeddings actually don't help a whole [01:06:46] lot. [01:06:47] Why do you think that is? [01:06:51] What if it could be because we're [01:06:52] averaging a lot of stuff? Maybe that's [01:06:54] hurting us. [01:06:57] Yeah. [01:06:58] Um I mean, I think that the embeddings [01:06:59] were pre-trained on some corpus, right? [01:07:01] Like Wikipedia or something like that [01:07:03] that is different from the a little bit [01:07:05] different from the language we tend to [01:07:06] use in song lyrics. So, so maybe it's [01:07:08] not [01:07:09] its ability to sort of extract the [01:07:11] meaning of [01:07:12] um [01:07:13] candy from like a song lyric um [01:07:16] maybe is limited because Yeah. it's [01:07:18] thinking of all the other ways Right. [01:07:19] like that could be our presentation. [01:07:20] Yeah, so there could be a mismatch [01:07:22] between the corpus on which the [01:07:23] pre-trained stuff was trained on versus [01:07:26] the the corpus that you're working with [01:07:27] right now. That's one big reason. The [01:07:29] other reason is that we actually may [01:07:31] have We have 50,000 examples, basically. [01:07:34] It's a lot of data. [01:07:36] So, when you have a lot of data, you may [01:07:37] not need any of these things. [01:07:39] These things tend to do really well when [01:07:41] you don't have a lot of data, and which [01:07:43] means you you you get to piggyback on [01:07:46] what these embeddings have learned from [01:07:47] all of Wikipedia. [01:07:49] So, so when you have a smallish data [01:07:52] set, basically, the the rule of thumb [01:07:54] here is that when your data is really [01:07:55] small, try to use a pre-trained model. [01:07:58] Right? And that's what you saw with the [01:07:59] handbags and shoes classifier, right? We [01:08:01] had 100 examples of handbags and shoes, [01:08:03] and we used ResNet to got basically get [01:08:04] to 100% accuracy. [01:08:06] The same sort of logic applies here. [01:08:08] All right. So, [01:08:09] here, let's see what's happening. Uh [01:08:11] okay, it's done. [01:08:12] So, we'll plot. [01:08:16] Right. [01:08:16] Uh okay, this look at a very [01:08:18] well-behaved uh loss function curve. [01:08:21] Uh [01:08:25] Okay. [01:08:26] So, [01:08:27] uh there doesn't seem to be any massive [01:08:28] overfitting going on. They are moving [01:08:30] really nicely in lockstep. Let's see [01:08:32] what the thing is. [01:08:36] Okay, 63%, which is not great. Um right? [01:08:39] Uh it's not as good as what we saw [01:08:40] before when we used all 50,000 examples [01:08:43] and just trained something from scratch, [01:08:44] and that's just because in this case, we [01:08:45] have lots of examples, these pre-trained [01:08:47] embeddings aren't, you know, as helpful [01:08:49] as they could be. [01:08:50] But if you have a small data set, they [01:08:52] could be very helpful. And now, we go to [01:08:54] what um [01:08:56] he pointed out. Like, why can't we just, [01:08:58] you know, optimize these embeddings, [01:08:59] too? Why don't Why do we have to take [01:09:00] trade them as sacred? We'll just Let [01:09:02] Let's just use Let's [01:09:03] inflict Let's just apply unleash back [01:09:06] prop on it and see what happens. [01:09:07] So, we'll do that. Um [01:09:11] So, here, what we do is we retrain it, [01:09:13] but here, we set trainable equals true [01:09:15] for the embedding layer. Okay? This is [01:09:17] the key step. Trainable equals true. [01:09:19] Otherwise, it's unchanged. [01:09:20] Uh and then, [01:09:23] let's skip that. [01:09:27] We'll run it and see what happens. So [01:09:28] before it was whatever 63% accuracy or [01:09:31] something, we'll see if it gets better [01:09:33] if you train the whole thing. [01:09:35] And the thing is you can never be sure. [01:09:38] Right? Because it may start to overfit. [01:09:40] Uh which is why you just have to [01:09:41] empirically see what's going on. There [01:09:42] are no guarantees. [01:09:47] Um all right, any questions while it's [01:09:48] training? [01:09:50] Yeah. [01:09:51] In that first graph of when um you have [01:09:54] the training accuracy still increasing, [01:09:56] that might suggest that you could use [01:09:58] even more upstream. Correct. Exactly. [01:10:00] Exactly. So in the in the in that curve, [01:10:02] we saw that the training was continuing [01:10:03] to increase. Typically what's going to [01:10:05] happen is the training will continue to [01:10:06] get better the more you train it. The [01:10:08] key thing is is the validation also [01:10:10] improving. If the validation continues [01:10:12] to improve, there is a little bit more [01:10:13] gas left in the tank. You can keep [01:10:15] increasing more. If it starts to flatten [01:10:17] and even worse if it starts to go down, [01:10:19] then you want to pull back. [01:10:21] Yeah. [01:10:23] Um so you had used the maximum against [01:10:25] the limit like the vocabulary [01:10:27] of the most common 5,000. And then the [01:10:29] width of that was 100. What is the 100? [01:10:31] The 100 is just the length of the glove [01:10:33] vector. [01:10:34] Does that mean that it can only capture [01:10:37] how that word is related to 100 other [01:10:39] words? No, no. It it basically we are [01:10:41] saying that every word its intrinsic [01:10:43] meaning can be captured using a vector [01:10:45] of 100 dimensions. [01:10:48] Those dimensions mean something. We [01:10:49] don't know what it is. The first [01:10:51] dimension could mean color. Second could [01:10:53] mean some sort of location. The third [01:10:55] could mean some sort of see time of the [01:10:57] year. We just have no idea. [01:11:01] Okay, and then the pre-trained model is [01:11:02] we're not We're not going to learn the [01:11:04] pre-trained model like has those [01:11:05] already. We don't know what they are, [01:11:07] but it has some cat The people who [01:11:08] created it don't know what they are [01:11:10] either. [01:11:10] All they know is that for each word they [01:11:13] learned a 100 long vector. [01:11:15] And that 100 long vector was able to re- [01:11:18] kind of recreate the co-occurrence [01:11:20] matrix. [01:11:21] And then they probed it using that [01:11:23] visualization of man woman sister [01:11:25] brother all that stuff and it seems to [01:11:26] sort of fit with what you would expect. [01:11:29] Can you think of it as analogous to uh [01:11:31] when we did the convolutional ones, you [01:11:33] have the number of kernels, right? So in [01:11:35] in this case, so if you have 32 kernels, [01:11:37] it's sort of like 32 things it can [01:11:39] learn. [01:11:40] I think that's actually a great analogy. [01:11:42] I love it. That's that's a great way to [01:11:43] think about it. Yes. Uh much like we got [01:11:46] to choose decide how many filters to [01:11:48] have, here we get to decide how long the [01:11:50] embedding dimension needs to be and our [01:11:51] hope is that the more things we are able [01:11:53] to accommodate, the more complicated [01:11:55] things it will pick up. Right? Uh at the [01:11:57] same time, you don't want to have too [01:11:58] many of these things because it's going [01:11:59] to start picking up noise. [01:12:01] And that's not a good That's never a [01:12:03] good thing. [01:12:05] Okay. [01:12:06] Um [01:12:07] Another question on this side? [01:12:09] Yeah. [01:12:10] Go ahead. My [01:12:12] question is [01:12:13] why did we use Why do we use embeddings [01:12:15] and not the actual uh [01:12:17] correlation matrix called rows to [01:12:20] represent words, right? Like why do we [01:12:23] need to abstract Yeah, yeah, yeah. [01:12:25] That's actually a good That's a That's a [01:12:26] good That's a good question. Um one [01:12:28] immediate reason is that that row is [01:12:30] 500,000 vectors long. 500,000 long. [01:12:33] Right? So you want a compact dense [01:12:35] representation of a word. [01:12:37] The second thing is that thing is [01:12:39] subject to all the counts of the [01:12:40] Wikipedia corpus. It's not normalized. [01:12:43] So you need to normalize it so that if [01:12:45] you take any two rows and do dot [01:12:47] product, you will get some number which [01:12:49] is sort of in a narrow range. Otherwise [01:12:50] things don't become comparable. [01:12:53] No, both these objections can be [01:12:55] handled. You can normalize, you can [01:12:57] reduce the size of the corpus and so on [01:12:59] and so forth. And in fact that used to [01:13:00] be a very common way people used to do [01:13:01] it before. [01:13:03] But what they have discovered is that [01:13:04] these the way we learn embeddings now [01:13:06] tends to be much more effective in [01:13:07] practice. [01:13:10] So So what what we thought is [01:13:13] what what what this process does is it [01:13:16] creates this like n-dimensional [01:13:18] incomprehensible matrix that captures [01:13:21] in essence a summarized version of these [01:13:23] relationships. [01:13:25] Correct. A compact representation of [01:13:28] relationships which is not subject to [01:13:30] the size of your vocabulary. [01:13:33] So you know, you have 500,000 words [01:13:34] today, tomorrow somebody comes up with [01:13:36] the word called selfie which didn't [01:13:37] exist 5 years ago. [01:13:39] And now your corpus has gotten a little [01:13:40] bit more, right? So here it's very [01:13:42] compact and it tends to have a much [01:13:43] longer shelf life. [01:13:48] Yeah. [01:13:49] Uh all right, so let's see where we are. [01:13:54] Uh okay. So evaluate. [01:13:59] 68 69% almost. It was 63 went to 69. So [01:14:02] clearly here training the whole thing [01:14:04] including glove actually helps. Uh and [01:14:06] so that sort of begs the question, well, [01:14:08] if it um every if training glove helps, [01:14:11] maybe we should actually train the whole [01:14:13] thing from scratch. [01:14:15] Like why the hell not, right? Why the [01:14:16] heck not? I apologize. [01:14:19] So uh what we'll do is we'll actually [01:14:21] create our own embeddings and just train [01:14:22] them. And here we don't have to worry [01:14:24] about co-occurrence matrices and so on [01:14:26] and so forth because we have a very [01:14:27] specific objective. We want to be very [01:14:29] accurate in predicting genre for these [01:14:30] songs. [01:14:32] The people who had who had worked on [01:14:34] glove, [01:14:35] they didn't have any objective. They [01:14:36] just wanted to create embeddings that [01:14:37] were generally useful. [01:14:39] Okay? Here we want to be specifically [01:14:41] useful for genre prediction. [01:14:43] And so what we can do is we can actually [01:14:45] train the whole thing ourselves, right? [01:14:48] We can actually give it [01:14:50] uh we can actually put an embedding [01:14:51] layer here. I you know, we just [01:14:53] arbitrarily decided to choose 64 as the [01:14:55] uh the dimension as opposed to 100. It [01:14:57] will run faster. Uh and then it's the [01:14:59] same thing. Global average pooling, [01:15:01] activation, blah blah blah blah blah. Um [01:15:03] and then you run it. [01:15:08] We'll see if it finishes in the next [01:15:09] minute. [01:15:12] And we'll see if it actually does better [01:15:14] than the pre-trained embeddings or the [01:15:16] pre-trained embeddings that have been [01:15:17] further fine-tuned. And I don't remember [01:15:19] what I saw when I ran it yesterday. [01:15:21] Uh and while it's running, other [01:15:23] questions? [01:15:24] Yeah. [01:15:25] So my question is regarding embeddings. [01:15:28] When we call embedding for a particular [01:15:30] word, we indicate that we have certain [01:15:32] number of parameters. Let's say in this [01:15:33] case we have defined [01:15:35] We defined 100. So there will be 100 [01:15:36] parameters and there will be [01:15:37] coefficients weights for each of them. [01:15:40] So when we take a pre-trained model, [01:15:42] right? [01:15:43] The one we took glove. So for each word [01:15:45] there would already be those number of [01:15:47] parameters in that different Yeah. So [01:15:49] but then how do we redefine them? Is [01:15:51] that we want only 100 or we want only 10 [01:15:53] parameters [01:15:54] You know, the the glove thing actually [01:15:56] gives you packaged It's pre-packaged to [01:15:59] be 100 long. I think they have 200 and [01:16:01] 300 as well if I recall. We just [01:16:03] happened to use the one the one with [01:16:04] 100. The one is [01:16:05] The one is available in Google [01:16:07] Yeah, yeah. And there are many [01:16:09] available. We just get to pick and [01:16:10] choose and I happen to pick 100. [01:16:12] Uh [01:16:13] Oh, it's okay. So it's a bit slow, but [01:16:15] it's actually looking promising. [01:16:17] Um [01:16:18] 9:55, yeah. [01:16:21] So during the CNN models training during [01:16:23] our assignments, [01:16:24] changing the filters gave us more depth [01:16:27] than improvement in performance. [01:16:29] So here would I be right in concluding [01:16:32] that it's actually training the [01:16:33] embeddings which is giving us more [01:16:34] assuming that epoch and batch changes [01:16:36] are not [01:16:37] changed as much. So if I really want a [01:16:39] genuine change in performance, we go [01:16:42] to the level of retraining the [01:16:43] embeddings. [01:16:44] What Yeah, so what we saw was that using [01:16:46] glove as is was okay. Using glove and [01:16:48] then training them helped a lot. And now [01:16:50] we are basically saying, well, what if [01:16:51] we just abandon glove and train our own [01:16:53] embeddings for our particular problem. [01:16:55] See, glove is a general purpose tool. [01:16:57] So a general purpose tool is really good [01:16:59] if you don't have a lot of data [01:17:00] as a good starting point. But when you [01:17:01] have a lot of data, you should always [01:17:03] try to do your own thing and see if it's [01:17:04] any better. [01:17:05] And in this case, I [01:17:07] well, whoa. Okay, I think it's [01:17:09] uh [01:17:10] Come on, it's 9:55. [01:17:14] The button is going to enter any moment [01:17:15] now. [01:17:21] Right, let's just look at the thing. [01:17:25] Okay, folks. So 74% 72%. [01:17:29] So you can actually return your own [01:17:30] thing because of 50,000 examples and you [01:17:31] can see an even better thing. Thanks a [01:17:33] lot. Have a good rest of the week.