[00:16] Okay. So today we start the the natural [00:20] language processing sequence and so just [00:23] to give you a quick idea we're going to [00:24] start with uh what's called [00:26] vectorization [00:27] uh and then the bag of words model and [00:29] then we'll spend a fair amount of time [00:30] on a collab uh and then on Wednesday we [00:33] talk about these things called [00:34] embeddings which you'll come to [00:36] appreciate over the the next couple of [00:38] weeks form like the sort of the core [00:40] atomic unit of all modern natural [00:42] language processing and for that matter [00:45] vision processing as well. uh and then [00:47] we will uh following week we'll do [00:49] transformers two lectures on [00:50] transformers we'll get into the theory [00:52] and then we'll get into a bunch of [00:53] applications and then lectures nine and [00:55] 10 will be all LLMs all about LLMs so [00:59] it's going to be a lot of fun u this is [01:01] one of my favorite segments of the class [01:04] of course truth be told every segment of [01:05] the class is my favorite so don't judge [01:08] me all right so let's get going uh so [01:10] why why natural language processing u [01:13] you know these are in some sense the the [01:16] things I have on the slide here are sort [01:17] of obvious but I think it's actually [01:18] worth reme reminding ourselves of how [01:21] important text is for everything we do. [01:24] Uh obviously human knowledge is mostly [01:26] encoded as text. The internet is mostly [01:29] text. At least this was true till the [01:30] advent of Tik Tok and YouTube. Uh and uh [01:33] human communication is mostly text and [01:35] cultural production you know movies, [01:37] books, uh arts and so on. So much of it [01:40] is so textheavy and so in some sense uh [01:43] text forms not just a big chunk of all [01:47] the media that's out there but it also [01:49] happens to be the way in which we think [01:50] and communicate and so on and so forth. [01:52] So it's sort of uh primacy is in my [01:55] opinion sort of unparalleled uh in how [01:57] we think about the world. And so the the [01:59] tantalizing possibility is that imagine [02:02] if we had an AI system which could just [02:04] read and quote unquote understand all [02:06] this text, right? Um and so you can [02:09] imagine such a system reading all of [02:11] PubMed, reading all the medical [02:13] literature and then coming back and [02:15] saying you know for this particular [02:17] disease you know this particular sort of [02:19] protein is actually the malfunctioning [02:21] protein and for that that small molecule [02:23] is going to dock into the protein and [02:24] cure the disease and you didn't know [02:26] this. It came back and told you that. [02:27] Wouldn't it be unbelievable? So my [02:29] feeling is that such things are going to [02:31] happen. It's just that it's not going to [02:33] happen soon enough for my lifetime, but [02:36] perhaps it'll happen in yours. All [02:38] right. Okay. So, let's continue. So, NLP [02:40] is an action all around us. Um, you [02:42] know, according to Google, apparently [02:44] Google autocomplete, uh, which uses a [02:46] fair bit of NLP, uh, saves 200 years of [02:49] typing time apparently, every day. Uh, I [02:53] actually thought it was, you know, this [02:54] I wasn't very impressed with this [02:55] number, frankly, because billions of [02:57] searches are being done every day and [02:58] I'm like only 200 years. So anyway u but [03:01] I think the more important point is that [03:03] it made mobile possible right if you if [03:06] you didn't have autocomplete people [03:08] would not be you know typing and pecking [03:09] on their keyboards it's going to be much [03:11] worse it would have had a hugely [03:13] dampening effect on e-commerce for [03:15] instance so this humble little [03:17] autocomplete has incredible incredible [03:19] impact on the world economy and the [03:21] other thing which I heard about I'm not [03:23] sure if it's 100% true but it's an [03:25] interesting example apparently the very [03:26] first iPhone keyboard that came out [03:28] right the soft keyboard not the hard [03:30] keyboard. Um they had some very basic, [03:34] you know, sort of word continuation [03:35] prediction going on. And so if if when [03:38] you start typing T and H, obviously it's [03:41] going to guess the E is going to come [03:43] next, right? So that part is old old [03:46] news, nothing new there. But apparently [03:48] the E letter in the keyboard will become [03:50] slightly bigger. So when your finger [03:53] goes towards it, it has a better shot of [03:54] actually connecting with it. Right? So [03:57] these kinds of things are used to change [03:59] the UI in real time in a whole bunch of [04:01] applications and you just don't even [04:02] realize it. All right. So uh and of [04:06] course we all know about uh LRM at this [04:08] point. So I asked it to write a [04:09] limmerick about the beauty and power of [04:11] deep learning yesterday and it says in a [04:13] world where data flows like a stream [04:15] deep learning is more than a dream. [04:16] Sifts through the noise with an elegant [04:18] poise unveiling insights that gleam. [04:22] Cool, right? All right. So let's get [04:25] back to work. Uh so NLP has [04:26] extraordinary potential for making [04:28] products, service and services much much [04:30] smarter. Uh and what I want to point out [04:33] here is that you know even if you focus [04:35] on this very very simple sort of [04:37] formalism right a bunch of text comes in [04:40] a bunch of text goes out that's it. If [04:42] you take that very simple text in text [04:44] out formalism this little humble little [04:46] thing has just an enormous enormous [04:49] range of applicability. Right? So [04:51] obviously you can send a bunch of text [04:53] in and ask it to classify it right for [04:56] mo you know sentiment route it for [04:58] customer support you can try to figure [05:00] out the intent of what the person is [05:01] asking in search you can filter it you [05:03] can content filter to make sure there's [05:04] no toxic abusive stuff going on I mean [05:06] the the possibilities for just text [05:08] classification are numerous okay but [05:11] that's a that's sort of a use case we [05:12] are all kind of familiar with right so [05:14] no surprise there now text extraction we [05:17] may be less familiar with here and the [05:19] idea is that you can actually look at a [05:20] lot lot of uh unstructured textual data [05:23] and extract all sorts of interesting [05:25] entities from it. Right? Hedge fun hedge [05:27] funds use it very heavily. They will [05:29] extract all sorts of company information [05:30] from news articles u and then obviously [05:33] doctor's notes. There are a whole bunch [05:34] of NLP startups that will take the [05:36] doctor's the doctor patient conversation [05:38] transcribe it and then extract disease [05:40] codes, diagnosis codes, medication codes [05:43] and things like that. Uh right. So the [05:45] possibilities for this are enormous. Of [05:47] course text summarization and we all [05:48] have been doing it thanks to chat GPT [05:50] right take text in and any kind of [05:53] summary that comes out of the text is [05:54] just text out okay and then text [05:57] generation of course we can take text [05:58] and do marketing copy sales emails [06:00] market summaries so on so forth and [06:01] including troublingly for educators [06:03] college application essays [06:06] code generation is a more subtle example [06:10] of text out because code is just text [06:14] right so text in text out also covers [06:16] was text in code out. Okay. And question [06:20] answering. So you can take a bunch of [06:22] text, [06:24] you can take a whole bunch of documents, [06:25] you can add a bit of text to it which is [06:27] your question and this whole thing at [06:29] the end of the day is just text it in [06:31] and then you can have and you can use it [06:33] to answer questions and therefore create [06:35] chat bots for all sorts of interesting [06:36] applications. [06:39] And you know if you look at this example [06:42] call centers that's that is where a lot [06:44] of money is being spent right now to [06:46] build these call center chatbots for [06:47] text and text out question answering and [06:49] so just if you drill into this right if [06:52] you imagine taking all the call center [06:54] transcripts and their internal product [06:56] documentation service documentation FAQs [06:59] etc stick it in you can start to answer [07:02] these kinds of questions okay yesterday [07:04] what are the top reasons why customers [07:05] were upset with us what interventions [07:08] made by the agent actually worked what [07:09] did not work, right? What characterizes [07:12] the best agents from the rest? How [07:14] should we grade this particular agent's [07:16] interaction with the particular [07:16] customer? How should she how should we [07:18] chain the call center script? How should [07:20] we coach the agent in real time? Every [07:23] one of these applications is aminable to [07:25] this very humble text and text route [07:26] model. [07:28] Okay. And so, and of course the [07:30] potential for is now everybody knows [07:32] this potential because of the advent of [07:33] large language models. Uh, by the way, [07:36] Google is uh released something called [07:38] Google Geminy 1.5 Pro u a couple of days [07:42] ago. Uh, and it's incredible. [07:46] It's incredible, right? And anyway, [07:49] we'll get back to that later. But the [07:50] point is that the kind of potential we [07:52] have is just amazing even for text and [07:54] text. Okay. And as you would imagine [08:00] >> this is all like though we are calling [08:02] it language this is all primarily [08:04] English right [08:05] >> now there are lots of multilingual uh [08:07] models as well uh there are multilingual [08:09] models by that I mean models which are [08:12] specialized to other languages [08:13] non-English languages and models which [08:15] are mult truly multilingual like [08:16] polyglot models as well and both of them [08:18] are available uh right now and many many [08:21] modern LLMs are actually trained from [08:23] the get-go to be multilingual in a bunch [08:26] of the what are called high resource [08:28] languages. Languages which are spoken by [08:30] lots of people. Uh but actually it's [08:32] funny you should ask that question [08:33] because this this Google Gemini model [08:34] that I just described they actually u so [08:37] there is a language called kalamang [08:40] which is spoken by 200 people in the [08:41] world and so a researcher had created a [08:45] one book which is sort of like a grammar [08:48] manual for kalamag right because there [08:50] are no other written works in that [08:52] language. And so what they did is they [08:54] took a whole bunch of English dialogue [08:56] and this book fed it into uh Google [09:00] Gemini 1.4 Pro 1.5 and it translated [09:04] into Calamong at human level [09:06] proficiency. [09:07] It had never seen it before. So that's [09:10] an example [09:12] of of this. [09:15] Yes. So the question is the question [09:18] text here is all the things you want to [09:19] translate from English to kalamong. The [09:21] documents here is just one document [09:23] singular the grammar book the manual and [09:25] then what comes out is a translation. So [09:29] these models even when they're not [09:30] explicitly trained on a different [09:31] language if you give them enough of sort [09:34] of grammar manuals and stuff like that [09:35] they may do a pretty decent job from the [09:37] get-go with no training. [09:40] It's kind of a shocker. Two years ago [09:42] people would be like that's impossible. [09:44] All right. So [09:47] back to this. [09:50] All right. And as you folks, you know, [09:51] may already know and maybe you're in [09:53] fact participating in this gold rush [09:54] already. Um, you know, lots of people [09:57] are creating lots of really cool [09:58] companies to take some of these ideas [10:00] and actually create really interesting [10:02] products and services out of them. Um, [10:04] so if you're not doing it and if you've [10:06] been thinking about entrepreneurial [10:07] stuff, here's a word of advice. Take the [10:10] plunge. [10:15] Dismissed. Just kidding. All right. So, [10:18] and as you can imagine, enterprise [10:19] vendors are rushing to add NLP to all [10:22] their products. Salesforce Einstein now [10:24] has Einstein GPT. Microsoft has [10:27] co-pilot. I mean, the list goes on. [10:28] Everybody, everybody's like scrambling [10:30] and really trying hard to infuse some [10:32] GPT magic into whatever they're doing. [10:34] Okay, some of it is real, a lot of it is [10:36] not. Uh, okay. So, let's go to like the [10:39] arc of NLP progress. How did we get to [10:41] this kind of crazy times that we live [10:43] in? Um so if you look at natural [10:46] language processing basically efforts to [10:48] take language and try to analyze [10:50] language and you do predictions with [10:52] language and so on and so forth. Um [10:56] the first phase of it was just [10:58] handcrafted rules based on linguistics. [11:00] So these are all linguists who would [11:02] really understand the grammar of a [11:03] language and then they would use a deep [11:05] knowledge of linguistics to figure out [11:07] all these rules by which you can process [11:08] and analyze natural language text. And [11:11] then this other thing came along which [11:13] was a statistical machine learning [11:15] approach which basically said never mind [11:17] all that complicated knowledge of [11:19] linguistics and grammar. Why don't we [11:21] simply count things? Let's count the [11:24] number of times these two will co- [11:25] occur. Now let's count that. Let's count [11:26] this basically just count a lot. Okay. [11:29] And let's see if it does right if it [11:31] does for predicting things for say for [11:32] classifying text and so on. And [11:34] shockingly those methods ended up being [11:36] really good. They ended up being really [11:39] good and in fact they actually were [11:41] better than the lovingly handcurated [11:44] linguistically driven rules. Okay, so [11:47] much so this is a famous quote which [11:50] says every time I fire a linguist the [11:52] performance of speech recognizer goes up [11:55] right obviously made in justest but [11:57] there is a kernel of truth to it. [11:59] So that was that's what that's that's [12:01] what that's where we were and then deep [12:03] learning happened okay um in 2012 [12:06] roughly and then we had these things [12:08] called recurren neural networks which [12:09] are based on deep learning which [12:11] actually moved the ball forward and then [12:13] in 2017 [12:15] something called the transformer was [12:17] invented [12:18] 2017 and the transformer replaced [12:21] everything else across the board so we [12:26] just going to leaprog directly to [12:27] transformers in hodle we will not spend [12:29] any time on recurren neural networks and [12:30] that is not to say that they are sort of [12:32] dead. Um there's there's a very [12:35] interesting work which actually is [12:36] trying to now revive recurren neural [12:38] networks to make it work for these kinds [12:40] of modern LLM kinds of tasks but it's [12:42] still very early days. Okay. So for now [12:44] we'll just focus on transformers. [12:46] Okay. So the the very high level view of [12:49] the problem here is that like most [12:51] things in deep learning it's basically [12:53] fancy regression. [12:55] There is some variable X that comes in. [12:57] It goes through a bunch this goes to [12:59] this very complicated function along [13:01] with this W which is the weights and [13:03] then out pops an output. Right? That's [13:05] just the view that you've always had. [13:07] And so in this case X happens to be [13:10] text. Y can be text. It could be labels. [13:12] It could be numbers. It could be [13:13] anything else. The W is the weights. And [13:15] the function is a deep neural network. [13:16] Right? This by by at this point when you [13:19] look at this slide it should be like [13:20] blindingly obvious. [13:23] So now the key question here is how do [13:26] you actually represent X? That's the key [13:28] question for pictures for images. We saw [13:31] that we just took the pixel values which [13:34] were light intensity numbers between 0 [13:36] and 255 and you could just use that [13:37] directly but when a when a sentence [13:39] comes in like I love deep learning like [13:41] what do you do right how do you actually [13:43] represent it because remember we have to [13:45] numericalize everything that's coming [13:46] in. So that's a key question and and [13:49] this actually is a very subtle question [13:50] very important question and we'll focus [13:52] on that today and then next week when we [13:56] look at transformers we will look at [13:58] what neural network architecture is best [14:00] suited to process this sort of text [14:02] inputs that are coming in right those [14:04] are the two big questions we're going to [14:06] look at all right so processing basics [14:11] we going to follow this very standard [14:12] process [14:15] this is the process by which we take any [14:18] any text that comes in and we do run it [14:21] through these four steps and this [14:23] process is called text vectorization and [14:25] as the name suggest that we are [14:26] essentially taking text and creating [14:28] vectors of numbers out of it right text [14:30] vectorization and we'll go through each [14:32] of these processes one after the other [14:34] so I just find it very useful to just [14:36] have this acronym stie in my head like [14:39] stie just keep that in mind it may be [14:41] helpful um all right so we what we do is [14:45] the setup here is that we have a whole [14:48] bunch of documents, right? We call it [14:50] the training corpus. We have a whole [14:51] bunch of text documents, text data. Uh, [14:54] and as far as we are concerned, you can [14:55] just imagine it as just lists of long [14:58] passages. Okay? What is a novel? It's [15:01] just a long passage, right, of text. So [15:03] whether it's a novel or a sentence [15:05] doesn't really matter. We just think of [15:07] them as a big list of strings, a big [15:09] list of text. Okay, that's a training [15:11] corpus. And what we do is we take this [15:13] training corpus and we run it through [15:15] and we apply standardization and [15:17] tokenization which I will describe to [15:19] this entire training corpus up front. [15:22] Okay. So we first do this and and [15:26] standardization is basically [15:29] the default for most applications tends [15:32] to be this which is we first strip [15:34] capitalization and make everything lower [15:36] case [15:38] and then we remove punctuation and [15:40] accents and so on and so forth. Okay, [15:42] that's the first thing we do. I'll talk [15:44] about why we do it in just a moment, but [15:46] the mechanics of it are we do this [15:48] first. Then we look at words like a, [15:51] the, it, and so on and so forth. [15:53] Basically filler words, right? Which [15:55] which we we need to actually make [15:57] complete sentences, but they may not [15:59] have any value predicting things. So we [16:02] remove them and they are called stop [16:03] words. And then finally we take words [16:06] which are very similar which have sort [16:08] of a same kind of stem or root and then [16:10] we just map it to like a common [16:12] representation like ate eaten eating [16:14] eaten all these things just becomes [16:16] let's say eats and we do that sometimes. [16:19] So this we almost always do this we [16:21] often do and this we do it sometimes. [16:23] Okay. Now, why do we do any of these [16:25] things? [16:34] >> I think we want to try to recognize the [16:36] essential thing with the word, right? [16:38] Whether it's eaten or eat, but the [16:40] essential thing is the eat, right? So, [16:42] we want to try to sort of abstract from [16:45] it the more essential thing, [16:47] >> right? So, why do we need to abstract? I [16:49] guess you're absolutely correct. We're [16:50] trying to abstract. Why is there a [16:52] benefit to doing this abstraction? [16:58] How about somebody from this side of the [16:59] room? Oh yes. [17:03] >> So I want to reduce the library. [17:07] >> Why is it a good idea to reduce the [17:08] library? The size of the library [17:12] >> because of the the amount of computation [17:14] needed. So that is part of the answer. [17:17] There's another part to the answer which [17:20] says all right let's swing to the right [17:26] um is it faculties comparison between [17:28] different sets [17:30] of standard [17:33] [clears throat] [17:34] >> okay so I will go with that but I think [17:37] the the key thing we want to uh the key [17:39] thing to realize here is that you want [17:42] the model much like when you go when we [17:44] talk about computer vision we said look [17:46] if it's vertical line. I want to be able [17:48] to detect it wherever it happens. I [17:51] don't want the model to think that the [17:52] vertical line on the left side is [17:54] different from the vertical line on the [17:55] right side and then later realize they [17:57] are the same thing because you would [17:58] have wasted valuable capacity learning [18:00] things which actually happen to be the [18:02] same because you didn't know it was the [18:03] same. So here if you for example take a [18:06] word and lowerase it clearly the case of [18:09] it whether it's uppercase or lower case [18:11] most of the time it's not going to [18:12] matter for anything you want to predict. [18:14] So you're essentially telling the model [18:16] you know the lowerase version uppercase [18:18] version they are not different they're [18:19] actually the same and the easiest way to [18:21] tell the model they are the same is just [18:23] make everything lower case so that is [18:25] the key idea okay and similarly if you [18:29] look at stop words the reason is that [18:31] these stop words may not help you [18:32] predict anything whether a word uh and [18:34] the showed up in a movie review probably [18:36] does not affect the sentiment of the [18:38] review and therefore let's remove it so [18:40] that's a slightly different reason [18:42] stemming is the same reason as the first [18:44] which is that all these words kind of [18:46] mean the same thing. We don't have to be [18:48] super precise about it and so let's just [18:50] like collapse them onto the same thing. [18:51] Now that these are all the standard [18:54] things we do there are totally notice [18:57] you know important exceptions to all [18:58] these things. Okay we'll come back to [19:00] the exceptions a bit later but that is [19:02] the standard thing we do make sense. All [19:05] right. [19:08] So if you look at something like this um [19:11] this sentence here right hola what do [19:14] you picture when you think of travel [19:15] Mexico boom and then you can see here [19:17] this is the standardized version like [19:20] everything has become lower case like [19:21] the h has become small h the punctuation [19:24] has disappeared that's part of [19:25] standardization and then uh travel and [19:29] you can see here that Mexico m has [19:32] become small sipping has become sips uh [19:35] things think has become things and so on [19:37] and so forth [19:38] So that's an example of strandization at [19:41] work. [19:47] Okay. [19:49] The next thing we do is something very [19:51] important and it's called tokenization. [19:53] So what we do typically is that okay now [19:55] we have standardized everything. We have [19:56] a bunch of words. Uh we need to now [19:59] split them into what are called tokens. [20:01] So the most common default is to just [20:04] think of a word as a token. [20:07] We just split on the white space, right? [20:09] You take each string and wherever there [20:11] is white space, meaning actual spaces, [20:14] uh, carriage returns and things like [20:15] that, boom, you just split on them and [20:17] you just create words out of it. So, so [20:20] for instance, if you have this [20:22] standardized sentence here, you just [20:24] split it after every word and you get [20:26] this thing. Okay? So, each of these is [20:29] now a token. [20:32] Now, this has some disadvantages. [20:36] What are some disadvantages of just [20:38] splitting on on the space between words? [20:40] Uh yeah, [20:43] >> I think we lose any context because we [20:46] look at each word separately. Uh so we [20:49] don't have any password or what happens [20:52] next, [20:53] >> right? So for example, the cat sat on [20:55] the mat and the mat sat on the cat will [20:57] have the same like set, right? Yeah. So [21:00] you lose the order. What are some other [21:02] issues with it? [21:05] for words that should have two together [21:07] like you lose the fact that that's one [21:10] name because you separated [21:11] >> right exactly so there are compound [21:14] words right like father-in-law for [21:16] instance that's one problem another [21:18] problem is that lots of non-English [21:20] languages they actually don't have this [21:22] notion of a space between words right [21:25] actually runs one after the other and it [21:27] is and the native speakers know from [21:29] context how to chunk it and break it so [21:32] well what do we do Right? [21:34] Because you basically will have one word [21:36] for the whole passage, one token. The [21:39] other problem is that there are [21:40] languages, German is perhaps the most [21:42] notable one in which you have very long [21:44] words. [21:47] Um I saw a word uh which I think I might [21:50] have it on the site somewhere this like [21:52] this long which means uh [21:57] you realize that something amazing is [21:59] happening but the rest of the world [22:00] hasn't woken up to it yet. It's that [22:02] feeling. [22:04] There's a word for that. Amazing, right? [22:07] Anyway, so yeah, some words or Japanese, [22:10] for example, there's a word called. Do [22:12] people know the meaning of the word [22:13] combi? [22:16] It means the transient beauty of [22:20] sunlight going through fall foliage. [22:24] There's a word for that. How cool is [22:26] that? Anyway, sorry. I love that word. [22:29] So, back to this. Um so we have this [22:31] thing here. So there are all reasons for [22:33] which splitting on the the space between [22:35] words not going to work. Okay. Um [22:38] so what we will so what happens is that [22:41] modern large language models. So the the [22:44] what we have described so far despite [22:46] its shortcomings is actually really good [22:47] for lots of NLP use cases. Okay. If you [22:50] want to classify text as good enough for [22:52] instance but if you want to generate [22:54] text like LLMs do it's not going to [22:57] work. It's not going to work because you [22:59] know when you ask the strategia question [23:01] it comes back with perfect punctuation. [23:03] Clearly punctuation was not stripped. It [23:05] comes back with particular upper and [23:07] lower case clearly that wasn't stripped. [23:09] You can actually make up new words and [23:11] ask it to use the new word. It'll make [23:12] it I'll use it. Therefore, it's not like [23:15] it can only recognize a finite set. So [23:17] there's a very clever scheme called bite [23:19] pair encoding right which is which is [23:22] invented to do all those things. And I [23:24] have slides at the end and if we have [23:26] time we'll talk about it. [23:28] All right, for now let's continue this [23:29] thing. So when this is done for every [23:33] sentence or every uh passage in our [23:35] training data set, we have now have a [23:37] list of distinct tokens, right? We have [23:40] a list of distinct tokens. In this [23:41] simple case, it happens to be all the [23:42] distinct words that we have seen, right? [23:45] That's called the vocabulary. [23:47] That's called the vocabulary. [23:49] So now we move to the third and fourth [23:51] stages. In this in these stages, the [23:53] indexing and encoding stage, we only [23:55] work with the vocabulary. Okay. And so [23:58] what we do is the first thing the [24:00] indexing we assign a unique integer to [24:03] each distinct token in the vocabulary. [24:05] So for instance, let's say that you know [24:07] you took a whole bunch of English [24:09] literature as your training corus and [24:12] you ran it through you basically you'll [24:14] come up with English dictionary right? [24:16] So it'll have maybe starting with a all [24:18] the way to zebra a whole bunch of words. [24:20] Um, and so I'm just putting 50,000 here [24:24] because turns out the GPD family uses [24:26] something called 50,000 tokens. So I'm [24:28] just using 50,000. It's not the actual [24:30] number of words in the English language. [24:31] It's much more than that. So let's say [24:33] that we give a number one through [24:35] 50,000. And then we actually also [24:37] introduce a special token called UN. It [24:40] stands for unknown. And we'll come back [24:42] to this later. And we give unknown the [24:44] integer zero. [24:46] Okay. So this what this is what we mean [24:48] by indexing. take the word the tokens [24:51] you have identified and just map it to [24:52] an integer. [24:55] Okay, that's the indexing step. Then [24:57] what we do is we assign a vector to [25:00] every one of these integers. [25:03] Okay, and that is the encoding step. We [25:05] assign a vector to each integer. [25:08] So you have a bunch of distinct words [25:10] and each word we put an integer on it [25:12] and then we take that integer and map it [25:14] to a vector. Yeah. Can you please [25:16] explain [25:17] to [25:18] >> Can you please explain what unknown [25:20] means? [25:20] >> Yeah. So, so I'll come back to that for [25:23] now. Just assume that we have a token [25:25] called unknown. And the way we are going [25:26] to use it will become apparent in a few [25:28] minutes. [25:29] >> Does it mean there's a base to it [25:31] though? There's like a letter or [25:32] something. [25:32] >> It's it's a it's a placeholder for [25:34] something else which I'll describe [25:36] shortly. [25:38] Okay. So, that's what we have. U so [25:42] let's say that we want to assign a [25:44] vector to each integer in our vocabulary [25:46] and let's assume that we have uh okay [25:50] let's say we have 50,000 possible [25:52] integers because we have 50,000 possible [25:54] words and we want to assign a vector so [25:56] that if you take the vector of two [25:58] different words they should look [25:59] different right clearly that's the whole [26:02] point of mapping from integer to vector [26:04] they better be different uh what is the [26:06] simplest way to come up with a vector [26:08] for each each of these tokens [26:20] the same as the index. [26:21] >> Sorry, [26:22] >> the same as the index. It's just a [26:24] vector one one by one with the index. [26:26] >> So, a vector of uh zeros and ones or [26:31] >> it's just a vector with one dimension. [26:34] >> Oh. Oh, I see. So, god. Well, it's it [26:38] it's creative, but it's a little [26:39] cheating, right? Because you're [26:40] essentially putting a square bracket [26:42] around the number and saying it's a [26:43] vector. Good try. [26:47] >> You can try one hot encoding, [26:48] >> right? You can try one hot encoding. [26:51] So remember the list of distinct tokens [26:53] you have, you can just think of them as [26:55] the distinct levels of a categorical [26:57] variable, [26:59] right? And you can just use one hard [27:01] encoding for it. [27:04] So what you can do is you can the [27:07] simplest thing is do one one hard [27:08] encoding and the way it's going to work [27:10] is that if you have let's say 50,000 [27:13] uh 50,000 possible values the vector is [27:16] going to be 50,000 long it's going to [27:17] have zeros everywhere except in the [27:20] index value of whatever that token is. [27:22] So for instance since we said ank is [27:25] going to be the first uh first number [27:28] zero it has a one here and the zero the [27:31] zero index position has a one everything [27:33] is zero a happens to be the second one [27:36] so it happens to be one in the second [27:37] position zero you get the idea [27:40] okay [27:40] >> so this real zero hot encoding we can do [27:42] the zero hot one coding one hot encoding [27:45] and so so the dimension of this encoding [27:47] vector how long it is it's basically the [27:50] number of distinct tokens that you have [27:51] seen in in the training corpus plus one [27:54] for this unk thing that you'll get to. [27:59] Okay, [28:01] so that is a dimensional encoding vector [28:03] which is this is called the vocabulary [28:05] size. [28:09] It's called the vocabulary size. [28:13] All right. So at this point we have [28:16] created a vocabulary for the training [28:18] data training corpus. every distinct [28:20] token vocabulary has been assigned a one [28:22] hot vector and we are done with basic [28:24] pre-processing. [28:26] Okay, so all the text that has come in, [28:29] every token has been mapped to some one [28:31] hot one potentially very long one hot [28:33] vector. [28:35] Any questions on the mechanics of this [28:37] before we continue on? [28:45] >> Now let's see if when you get a new [28:47] input sentence in a new sentence freshly [28:50] arriving and we want to feed it into a [28:52] deep neural network, how will this [28:53] process actually apply to the new [28:55] sentence that's coming in? Okay, so [28:57] let's assume um that we have completed [29:00] our SDIE on the training corpus and it [29:02] turns out we found only you know 99 [29:05] distinct tokens 99 distinct words and [29:08] then we add this ank thing to it so we [29:10] got a 100 okay so this is our vocabulary [29:13] it starts with ank a and then goes all [29:16] the way to zebra but there are only 100 [29:17] of them in total right and just to be [29:20] very clear we didn't bother to do things [29:22] like stemming and stop word removal and [29:24] stuff like that which is why you have [29:26] words like 'the' showing up in this [29:28] list. [29:30] Okay. All right. So, [29:34] let's say this input string arrives, the [29:35] cats are on the mat, and then we run it [29:38] through STIE. So, the cats are on the [29:40] mat goes through this thingoop. [29:43] Then the output is going to be a table [29:46] with a bunch of rows and a bunch of [29:49] columns. Any guesses [29:52] how many rows and how many columns? [30:02] Just raise your hands. I'll call on you. [30:13] >> Yeah, you use a microphone. Go for it. [30:14] >> Yeah, I would guess uh 100 rows and uh [30:18] six columns. [30:20] All right, we'll take a look. Uh [30:23] 100 and six as well as six and 100 are [30:24] both correct. So, so the way I've done [30:27] it is six and 100. And the and that's [30:30] exactly right. So, the idea is that this [30:33] is your vocabulary, right? So, the word [30:36] the cat sat on the mat once you change [30:38] the case of it, it becomes like this. [30:41] So, the the happens to be a one hot [30:43] vector with a one where there is a the [30:47] and zero everywhere else. I'm not [30:48] showing all the zeros because it'll get [30:50] too cluttered. [30:52] Similarly, cat has a one where the the [30:55] cat position is and zero everywhere else [30:57] and so on and so forth. Does that make [30:59] sense? So, the the phrase the cat sat on [31:02] the mat came in as just whatever six [31:04] words and then it became this you know [31:06] 600 entry table. [31:12] Okay. Now, what is the best way to feed [31:15] this table to a deep neural network? [31:18] What can we do? [31:23] It's not a vector. It's a table. [31:26] If it's a vector, we know what to do. We [31:27] just feed it in. We'll just maybe send [31:29] it to some, you know, hidden layer and [31:30] declare victory at that point. [31:34] >> Yeah. [31:37] >> You would like to flatten it. And like [31:38] how how might you do it? [31:43] Flattening is a reasonable answer by the [31:45] way. [31:46] I think you mean you just have to like [31:49] take each like each column [31:52] take the first one each row and each row [31:54] each word kind of like [31:56] >> yeah so basically you can take all the [31:57] first columns and then take the second [31:59] column and attach it under the first [32:01] column and so on and so forth right so [32:03] we can certainly do that and that's very [32:05] akin to how we work with images right u [32:08] but there is one downside to that what [32:10] is that downside [32:15] uh Um, [32:18] >> it's pretty long. Like I wonder if [32:20] instead you could for the first word [32:23] it's one, for the second word it's two, [32:25] and then you maintain the order, but you [32:27] still keep it just as like one row. [32:30] >> One row. So one issue, so we'll come [32:33] back to what we do about this, but what [32:34] you're pointing out is it could be very [32:36] long, right? Because if each word is a [32:39] 50,000 long one vector with just six [32:42] words, it becomes a 300,000 long vector. [32:45] Imagine take the 300,000 long vector and [32:48] sending it into a 100 hidden unit hidden [32:50] layer. 300,000 times 100 parameters. Too [32:53] much can't learn anything. [32:56] So that's one issue. The other issue is [32:58] that different length texts that are [33:01] coming in will have different sized [33:02] inputs. [33:04] So here the cat sat on the mat has six [33:06] times 50,000 but maybe the cat sat on [33:08] the mat and the rat rat ran over to the [33:10] cat becomes even longer. We can't handle [33:13] variable sized inputs. [33:15] the inputs all have to be mapped to the [33:16] same length. [33:19] That's another problem. [33:22] >> So maybe you can count how many you can [33:24] sum the columns basically and count how [33:26] many times each word appears since [33:27] you're using the like spatial [33:29] relationship. [33:30] >> Yes. So you Yeah. So both you and are on [33:33] the same sort of trajectory which is [33:34] that uh we need to somehow take this [33:37] table and make it into a vector. And [33:39] there are many ways like what you folks [33:40] are describing to make it into a vector [33:42] and turns out um this is all the things [33:46] that we've been discussing so far the [33:48] varying length ratio and so on. So, so [33:50] what we can do is we can aggregate all [33:53] these things. If you just add them up, [33:56] this is what you described. I believe [33:58] it's called sum encoding. [34:00] And if instead of adding you just or [34:02] them, meaning if you look at the column [34:04] and say, is there any one in this [34:05] column? If there's a any one, I'll put a [34:07] stick of one, otherwise it's a zero. [34:08] It's called multihot encoding. So, so if [34:12] you look at this thing, if you literally [34:13] just go column by column and count [34:15] everything. Okay, there's a one here, [34:17] one here. Oh, wait. There are two twos [34:19] here. So you put a two. That's count [34:21] count encoding. Multih hard encoding. It [34:23] just looks for any ones and puts on. [34:26] Make sense? So by the way there are many [34:28] ways to take these tables and make them [34:30] into vectors. These two happen to be [34:32] very commonly used and they kind of make [34:34] common sense. [34:39] Okay. [34:41] Right. So this aggregation approach that [34:43] we just described is called the bag of [34:44] words model. [34:46] Bag of words model. And the reason is [34:49] that first of all this bag that we have [34:51] has words either it counts whether a [34:53] word exists or not or it counts how many [34:56] words how many times the word has [34:58] appeared right count versus multihot [35:01] versus sum encoding count encoding but [35:04] more importantly and this goes back to [35:05] your observation is that we have lost [35:09] the order of the words now whether the [35:12] phrase came in was the cat sat on the [35:14] mat or the mat sat on the cat the count [35:18] encoding and the multih hard encoding [35:19] are exactly the same. There's no [35:21] difference because we're just looking [35:23] for the the presence or absence of [35:24] words. That's it. We don't care in what [35:27] which order they appear, right? That's a [35:29] huge limitation, but shockingly for many [35:32] applications, it doesn't matter. It's [35:34] good enough. So, it's called the bag of [35:36] words model. [35:38] All right, so this called the bag of [35:40] words model. [35:42] Um, now does it have any shortcomings? I [35:46] already talked about the first [35:47] shortcoming which is that it loses [35:48] sequentiality the order we lost this [35:51] order information right uh we we lose [35:54] the meaning inherent in the order of the [35:55] words what are some other issues with it [36:04] what do you mean by that [36:12] >> right so there are lots of zeros not [36:14] that many ones so you have it's a very [36:16] sparse amount of information but maybe [36:18] is carrying around a lot of information [36:19] to to make it all work. Now there are [36:22] some tricks CS computer science tricks [36:24] to handle sparsity in some clever ways [36:26] but it is certainly an issue. Now the [36:29] other issue is that let's say the [36:30] vocabulary is very long. [36:32] Each input sentence whether it's the [36:34] collected works of William Shakespeare [36:36] or the phrase I love you will have the [36:39] same length input. [36:42] Is that the same length input [36:45] because ultimately every incoming thing [36:48] gets mapped into one vector. Okay, that [36:51] feels a little sub suboptimal. [36:54] Clearly the collected works of ins have [36:56] a lot more stuff going on in them. [36:59] Right? So that's a problem. In [37:02] particular, very very small things that [37:04] come in, you'll be spending a lot of [37:06] compute on those long vectors and [37:08] processing them. Um, now you can [37:10] mitigate some of this by choosing only [37:13] the most frequent words. You don't have [37:14] to take, you know, I think the English [37:16] language I read somewhere has roughly [37:18] 500,000 words or so. Uh, but turns out [37:20] the top 50,000 most frequent words are [37:23] responsible for just about everything [37:24] you're going to see ever. And the other [37:27] 50,000 are what's called the long tail. [37:29] They almost never happen, right? You [37:31] never see them. So, you can be very [37:33] pragmatic and say, "I'm not going to [37:34] take every little word that I see in my [37:36] vocabulary. I'm going to only take the [37:38] most frequent words. I'm just going to [37:40] ignore the rest. [37:42] I'm just going to ignore the rest." [37:44] Okay? [37:46] But if you ignore the rest, let's say [37:50] the there is one word uh let's take some [37:52] Shakespeare word hamlet. Let's let's [37:55] assume that you ignore the word Hamlet [37:57] from your training corpus. You just [37:58] delete it because it's not one of the [38:00] top most frequent things you have seen. [38:02] And then somebody sends you a text [38:04] saying, you know, Hamlet was a bad [38:06] prince. [38:08] Analyze the sentiment of the sentence. [38:10] Well, when you see Hamlet, what is your [38:12] system going to do? [38:14] It's going to look at the Hamlet and [38:15] say, I can't see it in my vocabulary [38:16] anywhere. [38:18] And if it can't see in the vocabulary, [38:19] what is the only thing it can do? [38:22] Replace it with Unk. So that's where [38:26] comes into the picture. [38:28] So whenever it can't see something in [38:30] the vocabulary in a new input, it just [38:32] replaced with ank. Which means that [38:35] if you had ignored Romeo, Juliet, and [38:37] Hamlet in the in the training corpus, [38:40] all of them are going to be replaced by [38:42] the same ankh, which means that we can't [38:44] distinguish between them anymore. [38:46] >> So is this whereation [38:48] comes into play here where it doesn't [38:52] recogize [38:54] H interesting question. This is [38:56] whereation comes up. Actually, as it [38:58] turns out, no, as we will see when we [39:00] talk about LLMs later. Uh LLMs actually [39:03] will not have this UN problem because [39:06] they use a different tokenization scheme [39:08] which can handle anything you throw at [39:09] it, including new stuff you just made [39:10] up. [39:12] So, we'll come back to that. [39:14] All right. Um so, that's what we have. [39:17] And so what we're going to do is despite [39:19] its shortcomings, bag of words is [39:21] actually a really good default for many [39:23] NLP tasks. Uh and in the spirit of do [39:26] the simple stuff first and do [39:27] complicated things only the simple [39:28] doesn't work. We'll use a bag of words [39:30] model right now. Okay. So we'll switch [39:32] to a collab and see how it's done. [39:36] So here the the application we're going [39:39] to work with is kind of a fun [39:40] application. Uh we're going to try to [39:43] predict the genre of songs. [39:46] Okay, it's a nice classification use [39:47] case. Um, so we want to take some [39:50] arbitrary song and then classify it into [39:52] either hip-hop, rock or pop. [39:55] Okay. Um, and so for instance, [39:59] right, this is the kind of lyric you're [40:01] lyrics you're going to see. And as you [40:03] will see in this data set, the data set, [40:04] just a quick word of caution, uh, the [40:07] data set does have lyrics which may not [40:10] be sort of, you know, safe for work as [40:12] it were. So I'm not going to be like [40:14] exploring the lyrics in the collab, but [40:16] I just wanted to be aware of it. Okay. [40:18] Um, so but it's just some data set that [40:20] we downloaded from somewhere, right? Uh, [40:22] it's got all these lyrics. Okay. So [40:24] we're going to try to classify each [40:25] verse that we see into one of three [40:27] things. Hip hop, rock or pop. It's a [40:29] multi-class classification problem. [40:31] All right. Actually, what is the [40:33] simplest neural network based classifier [40:35] we can build [40:37] for this problem? [40:41] All right. So what is the simplest [40:42] neural network we can build for this [40:44] problem? So remember what is the input? [40:47] The input is going to be a bunch of song [40:49] lyrics. It's going to be a really long [40:50] song for all you know, right? And we're [40:52] going to use the bag of birds model. Uh [40:54] and let's assume for a moment that we [40:56] will use multihot encoding, right? We'll [40:59] create a vocabulary from this for the [41:02] song. We'll take all the songs. We'll [41:04] process them, run it through STI. will [41:06] do multihod encoding which means that [41:08] every song that comes in will have will [41:10] be a vector that's how long [41:14] it'll be as long as the [41:17] correct as a vocabulary size right so um [41:20] so maybe what comes in is this phrase um [41:24] since it's supposed to be songs I'll say [41:26] something which is probably common to [41:28] 90% of songs I love you [41:30] okay that goes in [41:34] it goes into our ST STIE process [41:38] and then this SDIE process gives us a [41:42] vector which is X1 X2 all the way to XV [41:49] where V stands for the size of [41:50] vocabulary. Okay. So that that's our [41:52] input layer [41:54] all the way. So knowing what we know now [41:58] about deep learning what can we do next? [42:02] Couldn't you or maybe I'm getting ahead [42:04] but wouldn't the classifier just be like [42:07] the baseline would be classify it as the [42:10] most common genre? [42:11] >> That is the baseline. Correct. Correct. [42:13] I'm just saying and we'll come to the [42:14] baseline a bit later. But here I'm [42:17] saying suppose you need to you wanted to [42:18] build a neural network model for this. [42:21] How would you set it up? [42:23] >> You think about the layers that you [42:25] want, [42:26] >> right? And what is the simplest thing [42:27] you can do with a neural network? How [42:29] many layers? [42:30] >> Uh no layers. Well, then it becomes [42:33] problematic with even a neural network [42:35] because it could just be logistic [42:36] regression [42:37] >> one hidden layer. [42:38] >> Yes, thank you. I'm being a little [42:41] squishy about this because there are [42:43] some people who be like well even if [42:44] there's no hidden layers if you're using [42:46] relus and this and that and sigma that's [42:48] maybe it's a neural network and I don't [42:49] want to get into that how many ages in [42:51] the tip of a pin argument. So um so yeah [42:54] we need one hidden layer right in this [42:56] course we need at least one hidden layer [42:57] for it to qualify as a neural network. [42:59] Okay, so let's have a hidden layer and [43:01] we'll have a bunch of ReLUS as usual. [43:04] Okay, bunch of ReLULS and I'll ignore [43:07] all the arrows between them. It's kind [43:09] of a pain. U and then we come to the [43:11] output layer. And what should the output [43:13] layer be? [43:15] How many nodes do we have need in the [43:16] output layer? Three, right? Hip-hop, [43:19] rock, whatever. Pop. So we And then that [43:22] layer is called what? What activation [43:23] function? [43:25] Softmax. Perfect. Love it. love this [43:27] class. All right, three things. Uh, [43:30] rock, hip-hop, [43:33] and uh, pop, right? And this is a soft [43:36] max right there. [43:39] And then it's going to give us three [43:41] probabilities that add up to one because [43:44] it's a soft max. So that's our basic [43:46] network, right? Perfect. Yeah. [43:49] >> Why do you need those probabilities? [43:51] Again, if you just want to identify the [43:52] most likely genre, the soft max just [43:55] give you a way to kind of add them all [43:56] up once. Why do you need soft? Why don't [43:59] you just take the max value and say it's [44:01] that? [44:01] >> Oh, interesting question. Why can't we [44:03] just produce three numbers and grab the [44:05] maximum number? So, it turns out finding [44:09] the maximum bunch of numbers that [44:11] function [44:12] is not very it's not very friendly for [44:14] differentiation. [44:16] And ultimately you want to take this [44:18] output, run it through a loss function [44:20] like cross entropy and then be able to [44:23] run back prop on it. And so [44:25] fundamentally back propagation is just [44:27] differentiation and it requires [44:29] everything inside of it to have well- [44:31] behaved gradients. And so this little [44:34] max function is actually not well [44:36] behaved and which is why we have a soft [44:39] version of it soft max which makes it [44:41] easy to differentiate. So I can tell you [44:44] more about it offline but that's sort of [44:45] the quick synopsis. [44:49] So a lot of tricks you will see in the [44:50] neural network literature or ways to [44:52] avoid this the problem of having certain [44:55] the like the obvious choice of function [44:57] will not be well behaved for [44:59] differentiation. That's why you need to [45:00] go through all these other mechanisms [45:02] much like we couldn't just say accuracy. [45:05] Why don't you just maximize accuracy [45:06] instead of doing this cross entropy [45:07] business? Same reason. [45:10] All right. So let's come back here. [45:14] All right. [45:20] So that's what we created on the thing. [45:23] Right? Cats out of the mat vocabulary [45:27] thing and so on. And I you know I was [45:28] playing around with it uh earlier and so [45:31] I I found that you know eight relu [45:33] neurons were pretty good to get the job [45:35] done. So I'm just going to go with eight [45:36] rel [45:37] neurons in the hidden layer. [45:39] So I think that brings us to the collab. [45:44] Yeah. So let's switch to the collab. [45:47] All right. So um that's what we have [45:49] here. We you know there's a little bit [45:50] of verbiage here which just describes [45:52] what I just talked about. So we'll do [45:54] the usual things and upload everything [45:56] uh import everything we want. TensorFlow [45:58] and caras and the the holy trinity of [46:01] numpy pandas and mattplot lib. Uh set [46:03] the random seed as usual at 42. [46:07] This is our SDIE framework here. And the [46:09] nice thing is that all four of these [46:11] things SDIE are beautifully implemented [46:14] in Keras is a single simple layer called [46:16] the text vectorzation layer. Okay, which [46:19] is nice. Um, so we have the text vector [46:22] right here. And so in our first example, [46:25] what we'll do is we will use a default [46:26] standardization which will just remove [46:29] punctuation, convert to lowercase. We'll [46:31] use a default tokenization which just [46:33] means split on the space between words. [46:35] And then we will set the output to [46:37] multihart. Right? All the things we [46:39] talked about, KAS will just do it for [46:41] you automatically. And so output mode [46:43] multihart standardize this spread whites [46:45] space and boom, you run the text [46:47] vectorization thing. And once you do it, [46:49] KAS creates this textualization layer [46:52] with these settings and it's now ready [46:53] to swing into action. So what does swing [46:56] into action actually means? Well, now we [46:58] need to actually feed it a training [46:59] carpass so that it can do all the things [47:01] it's supposed to do and create the [47:02] vocabulary for you, right? So um so and [47:07] that thing is called the adapt method. [47:08] So we create a tiny training corpus for [47:11] us. This is our data set. Um right this [47:14] just a bunch of words from some of these [47:16] lyrics. And then what we'll do is we'll [47:18] take this layer that we just defined [47:19] here that we have set up here. And then [47:21] we will ask this layer to actually [47:24] create the vocabulary using this adapt [47:26] command. Okay. Index the vocabulary. And [47:29] it's done. And once it does it, you can [47:31] actually ask it for the vocabulary. [47:34] Okay, this is the vocabulary using the [47:36] get vocabulary command. And so first of [47:38] all, how long is the vocab? 17 17 words, [47:41] 17 tokens. What are they? [47:45] And see here, and you can see these are [47:46] all the words and you can see it is [47:48] stuck in an in the very beginning, [47:50] right? It's sort of the default. By the [47:52] way, uh just a little programming tip if [47:54] you're not familiar with if you don't [47:55] have a ton of programming experience. If [47:57] you want to, you know, print these [47:58] Python objects like list and all in a [48:00] pretty way, one trick that often works [48:02] is just stick it into a data frame [48:05] and then print it. Usually, it'll print [48:08] it in a much better way. So, you can see [48:09] it like that. [48:11] So, you can see here ank arrays blah [48:13] blah blah blah blah. And you can see [48:15] integer zero assigned the ank token. By [48:17] the way, how come it picked the word [48:19] arrays as the second entry? Why not [48:22] something like an or um you know why [48:26] not? Why not a how come a is not chosen [48:29] as a second entry? Why why did it pick [48:32] arrays? You think [48:40] >> maybe maybe it tried like the words that [48:43] are most influential on the meaning of [48:45] the sentence to be on the [48:49] But it at this point it doesn't know [48:51] what we're going to use it for. [48:54] So it has no way to know what word is [48:56] useful because we haven't told it how [48:57] we're going to use it. [48:59] But but you're kind of on the right [49:01] track. So what KAS does is it'll [49:04] calculate it'll find all these tokens [49:06] and then it'll actually just sort them [49:07] by frequency. [49:09] So the most frequent as it turns out in [49:12] those four sentences we gave it happen [49:13] to be the word arrays. That's why arrays [49:15] is showing up on top. Um, and you can [49:17] actually confirm this by going to the [49:19] our little data set and you can see here [49:21] array shows up here and was up here [49:23] twice and that's why it came up on top. [49:25] Okay. All right. So that's what we have [49:29] and u and now now that we have populated [49:32] this we can run any sentence through it [49:34] easily. Yeah. [49:36] >> Does [clears throat] it matter that it's [49:37] on the top or is it just [49:39] >> it doesn't matter. It doesn't matter. [49:41] The reason why it's helpful later on is [49:43] because suppose you tell Kas hey don't [49:45] take every word you see here give me [49:48] only the most frequent 100 words I don't [49:50] want any more than that it can easily do [49:52] that that's the reason yeah [50:01] >> this is just a vocabulary so basically [50:03] you you give it all this phrases it [50:05] happens just four phrases in our example [50:07] and then it finds all the distinct words [50:09] and you know does all that stuff and and [50:10] then it has created a vocabulary. At [50:12] this point the the training corpus you [50:14] fed it will is forgotten and the only [50:17] thing has survived this processing is [50:19] just the vocabulary. That's it. Now we [50:21] have to start applying it to any kind of [50:23] text we want to use it for. [50:25] So here when you come back here u so [50:28] this is what we have and so what you can [50:30] do is you can take any sentence and you [50:32] can just run it through a layer and to [50:33] make sure that actually is doing the [50:35] right thing for you. So we'll take the [50:37] sentence, we will then run it through [50:39] the text vectorization layer by just [50:40] passing that sentence into it and then [50:42] we can just print it. [50:46] So now it's giving you a tensor. This is [50:47] a multihot encoder tensor with all these [50:50] ones and zeros. So note that this tensor [50:54] is 17 units long which is which is a [50:56] good check because our vocabulary is 17 [50:58] long. So it's better match that. Uh now [51:00] recall that the ank token is at the [51:03] first location. It's at index zero and [51:05] it says that this encoded sentence does [51:08] have an unk word. [51:10] Okay. So [51:13] why is that? What is this UN word? [51:15] Anyone can guess? [51:19] Well, it turns out to be the word still. [51:21] Um I think yeah still is not in our [51:24] vocabulary because the four sentences [51:26] which is our training corpus used to [51:28] build vocabulary. They had a lot of [51:30] write and rewrite but there was no still [51:32] in it anyway. That's why there's an UN [51:33] ank for it. Uh we can just double check [51:35] that by asking Python is it is it [51:38] vocabulary? Nope, it's not. Okay. Now, [51:40] in the spirit of making small changes to [51:41] the code to understand what's going on, [51:42] which is a very useful tip for folks who [51:45] don't have a ton of programming [51:46] knowledge. Let's say that you send the [51:48] phrase Sloan Hodddle and DM, DMD. Uh I [51:52] think you will agree with me that none [51:54] of these words is in the training [51:55] corpus, right? So what will this what is [51:59] the multihot encoded vector for this [52:02] phrase sloan hodddle BMD [52:07] three [52:11] it's not count encoding it's multihod [52:13] encoding [52:14] right it's going to be 1 0 0 so you can [52:17] see here or in this case remember the [52:19] vocabulary is 17 [52:21] right so each of these words is going to [52:23] be a one followed by 16 zeros [52:27] And then it's going to multih hot encode [52:29] them which means the three ones in the [52:30] column just become a one. So so you [52:34] still have only this one. Okay. All [52:37] right. Good. So now let's see that's now [52:39] let's actually get to the the the data [52:41] set. We have this 90,000 songs. Uh and [52:45] it's in this little thing here. Uh we [52:47] have grabbed the data and cleaned it up. [52:49] Cleaned it up meaning like formatting [52:50] wise not content wise. uh and then we [52:53] stuck it in this uh data frame and it's [52:55] we already have divided into train, test [52:56] and validation for your benefit. So you [52:58] don't have to worry about it. So turns [53:00] out we have 40 almost 49,000 songs in [53:03] the training set, 16,000 songs in the [53:05] validation set and 22 roughly 22,000 in [53:08] the test set. Okay, lot of songs. It's a [53:10] lot. It's a big data set. Um so let's [53:13] just look at the first few. [53:15] So oh girl, I can't get ready. We met on [53:18] rainy evening. Paralysis through [53:20] analysis. [53:22] Okay, that I can relate to as a data [53:23] science person. But anyway, u but uh by [53:27] the way this uh these things are very [53:29] useful for exploration of any uh data [53:31] frames that you might have. Collab is a [53:33] collab feature just check it out. Um so [53:36] anyway, that's the first few the first [53:38] few rows. Let's look at the last few [53:40] rows. [53:43] Okay, [53:48] you never listen to me as pop. Beamer [53:51] Benz is hip-hop. Yeah, of course. [53:57] So, okay. Uh, now to go back to the [53:59] question of, okay, um, what could be a [54:01] good baseline model? We need to [54:02] understand the proportion of these three [54:04] classes of songs. So, we'll do a quick [54:07] check. Turns out rock is 55%. So, if you [54:10] had to just guess something just [54:12] naively, you would just guess everything [54:13] to be rock and you'd be right 55% of the [54:15] time. Uh so now uh by the way the the [54:18] the target variable which tells you [54:20] whether which of these three genres it [54:21] is uh is is is a is actually a dummy [54:24] variable. So we need to one hot encode [54:26] that right. Um so we'll just turn that [54:29] this way using the pandas get dummies [54:32] function. And when we do that uh this is [54:34] y train which contains the dependent [54:35] variable. And you can see that is one [54:37] hot encoded now. Uh 0 1 0 0 1 0 0 1 and [54:40] so on and so forth. That's it. So I [54:42] think the first I forget it rock, [54:44] hip-hop, rock, pop or whatever. It's in [54:46] some order. We'll we'll get to that [54:48] later. So it's one hot encoded as well. [54:50] So that is as far as the data [54:52] downloading and setup is concerned. Any [54:54] questions? [54:55] >> Yeah. [54:57] >> Uh this kind of goes back to the [54:58] transfer learning concept. But do you [55:01] always want to build your corpus based [55:04] off of the vocabulary of your training [55:06] data or could you have like a [55:08] pre-ompiled like somebody's already made [55:10] like a list of the 50,000 words? [55:13] >> That's a really good question. Uh [55:15] unfortunately I'm going to punt on it [55:16] for the moment because um with modern [55:20] large language models a number of these [55:22] NLP tasks for which you had to sort of [55:25] roll your own and build your own thing [55:27] can now be very easily done using large [55:29] language models without even any further [55:31] training. [55:33] Case you pay for it is that you have to [55:34] use a large language model which means [55:35] you have to pay somebody an API call and [55:37] things like that and there are other [55:38] issues with it. uh but [55:41] we'll talk a lot about transfer learning [55:43] for text when we come to a little later [55:46] in the NLP sequence. So if I forget [55:48] please bring it up again. [55:53] >> Yeah. [55:54] >> Um quick clarification on the encode [55:58] factor. If I post it as floats not ins. [56:00] If it gets incredibly long wouldn't that [56:03] eat into compute time? Is there a reason [56:05] why it's floats? [56:06] >> Yeah. So uh question is that when when I [56:09] showed you that tensor the it is [56:11] actually is written as a continuous [56:13] number right a float floating point [56:14] number but we know these are one zeros [56:16] and ones so why can't we why do we have [56:18] to waste compute capacity by telling the [56:20] computer that these are all big [56:21] continuous numbers when it's just a zero [56:23] one there are ways to optimize that but [56:25] these problems are so small we just [56:26] don't worry about it but when we come to [56:28] something called parameter efficient [56:30] fine-tuning lecture maybe 10ish uh we [56:34] actually exploit that particular fact to [56:35] make things faster [56:38] Okay, so that's what we have. Uh, so [56:41] we'll we'll do the bag of birds model. [56:43] Um, by the way, there's a whole bunch of [56:46] stuff here. It just repeats what I've [56:47] been telling you in the lecture. So feel [56:49] free to read it again, but we can ignore [56:50] it for the moment. And now there's a new [56:54] thing we are doing here. So we are [56:55] basically saying, look, instead of [56:58] taking every word you see in these [57:00] 49,000 uh songs in the training corpus, [57:03] uh, it's going to be too many words. [57:05] just pick the 5,000 most frequent words [57:09] and that's what this max tokens stands [57:11] for. Okay. And so we tell it uh all [57:15] right do this thing max tokens 5,000 [57:18] sorry not 50,000 5,000 and still do [57:20] multihart and we are not explicitly [57:22] saying the standardization and all that [57:24] stuff because the defaults are what [57:25] we're going with. Okay. Yeah. [57:29] This is for making it more efficient. [57:30] Like this is like don't waste your time [57:32] on these thousand sports. Use them more. [57:36] Use them. Just focus on that to make [57:39] more efficient. [57:40] >> Make more efficient. But there is a [57:42] related and important point which is [57:44] that fundamentally the number of tokens [57:46] you allow this layer to have dictates [57:49] the size of your vocabulary and the size [57:51] of your vocabulary dictates the size of [57:53] the vector that you feed in. So shorter [57:56] vectors are better than longer vectors. [57:57] That's the efficiency point. The other [57:59] point is that the longer the input [58:00] vector, the more the number of [58:02] parameters the network has to learn [58:04] because the first layer itself is the [58:06] size of the input times roughly times [58:08] the size of the hidden layer. So this [58:10] thing becomes 10 times as long. You have [58:11] 10 times as many parameters to learn and [58:13] given a finite amount of data, right? [58:15] The more parameters you have, the worse [58:17] it's going to do when you actually start [58:18] using it in the real world. It's going [58:19] to overfitit heavily. That's why you [58:21] need to be very careful. [58:24] Okay. [58:25] Yeah. [58:27] So, um, you downloaded the data set, but [58:29] are you still using the vocabulary the [58:31] 17 words or did you [58:33] >> No, no, I'm that was just for fun. I'm [58:35] going to actually build a vocabulary [58:36] now. It's coming. Yeah, good question. [58:38] Yeah. So, all right, let's do that. Um, [58:41] so I first, you know, I defined this [58:43] layer. Uh, okay. I just defined it. All [58:46] right. Now we actually build the [58:47] vocabulary by essentially telling it to [58:49] adapt the layer using essentially the [58:53] full all 15 basically 49,000 songs in [58:56] the training data set right that's a [58:58] long list of songs as far as kas is [59:01] concerned you're just looking for a list [59:02] of strings so you just give it the list [59:04] of strings instead of four we're giving [59:06] it 49,000 the same uh philosophy applies [59:09] so we run it [59:11] it's obviously going to take a few [59:12] seconds to do that because it's 49,000 [59:15] songs [59:17] five seconds. Uh, all right. Let's look [59:19] at the most common 20, [59:21] right? We get the vocabulary from our [59:23] layer. See, once you adapt the layer and [59:26] has built a vocabulary, the layer is [59:27] sort of been populated with all this [59:29] information. So, you can query it. So, [59:31] you can get the vocab top 20 words, the [59:34] most frequent word, no surprise, u, I, [59:37] blah, blah, blah. Uh, let's look at the [59:39] last few. [59:41] Dagger cheddar [59:43] verified [59:46] moving on [59:48] right and then we so once we have done [59:51] that now we actually can vectorize all [59:52] the data sets we have using this and by [59:55] vectorize you mean take every string and [59:57] create the multihot encoded vector from [59:59] it uh yeah [01:00:00] >> are we doing stie because we're keeping [01:00:02] stuff like d a etc. Yeah, we are not [01:00:05] strictly doing STI or to put it [01:00:07] differently the S stands typically S has [01:00:09] lower case uppercase strip punctuation [01:00:12] stemming stop word removal here the [01:00:14] default in KAS happens to not do [01:00:16] stemming not do stop word removal so [01:00:18] we're just going with the default thanks [01:00:20] for the clarification [01:00:22] and in fact in practice what I find [01:00:23] these days is that don't even bother to [01:00:25] stem don't even bother to remove the [01:00:27] stop words it's going to work well [01:00:28] enough [01:00:31] okay so all right uh okay so now Each [01:00:34] phrase is a vector. How long is this [01:00:36] vector? Each song is now a vector. How [01:00:38] long is that vector? [01:00:41] 5,000. Correct. Because that is a size [01:00:43] vocabulary. Correct. [01:00:47] It's max tokens long, which is 5,000. So [01:00:49] if you actually look at X Oh, wait, [01:00:51] wait, wait, wait, wait. I haven't done [01:00:52] this thing yet. [01:00:57] It's going through 49,000. It's going [01:00:59] through another what? 23,000. Fine. So [01:01:02] let's run it. [01:01:04] Okay, now we can see X train which is [01:01:06] all the training data you have has is a [01:01:09] tensor is a table with 48 991 rows and [01:01:12] each row is a 5,000 long vector. [01:01:18] All right, good. Now we will try the [01:01:20] simple neural network that we wrote up [01:01:23] in class. So and now at this point this [01:01:28] code should be sort of second nature, [01:01:31] right? Isn't that cool? It's so easy to [01:01:34] write the write the thing the power of [01:01:36] abstraction. So uh we take kasin input [01:01:39] as usual input layer we tell it what is [01:01:41] the size of each thing that's coming in. [01:01:42] Well the size of each thing is a 50 max [01:01:44] tokens long vector. So we tell it the [01:01:46] shape is max tokens and then we run it [01:01:48] through a dense layer with eight relus. [01:01:51] Okay I'm hurrying. [01:01:54] So we get the outputs then we string the [01:01:56] inputs and the outputs into a model and [01:01:58] then we summarize the model. That's it. [01:01:59] So we go here and this has 40,000 [01:02:02] parameters and you can see here right [01:02:04] when you go from the input the 5,000 * 8 [01:02:08] that gives you 40,000 plus the eight [01:02:10] neurons have a bias coming in that's [01:02:11] another eight so you get 40,0008 okay [01:02:15] and we compile it as usual we use atom [01:02:17] as usual and because now the the output [01:02:20] y variable the y train variable is now [01:02:23] it itself is actually one hot encoded [01:02:27] right 0 1 0 0 1 depending on pop rock [01:02:29] and so on and so forth. We don't use [01:02:31] sparse categorical cross entropy. We [01:02:33] just use plain old categorical cross [01:02:35] entropy here. Okay. And this was [01:02:38] explained in lecture last week. So you [01:02:40] can revisit it if uh if it's if it's not [01:02:42] familiar. We again report accuracy, [01:02:44] right? So let's compile it. And we've [01:02:46] got a model. So we just run it for 10 [01:02:48] epochs with a batch size of 32. And [01:02:50] because we have validation data already [01:02:52] supplied to us, we don't have to tell [01:02:53] Karas take the training data and keep [01:02:55] 20% of it aside for validation. We can [01:02:58] literally tell it what validation to [01:02:59] use. That's what we're doing here. Okay. [01:03:04] All right. So, it's running. [01:03:06] Um, [01:03:09] it's pretty fast. [01:03:16] Any questions so far? [01:03:18] >> Yes. [01:03:20] >> The microphone. [01:03:23] >> How do we decide the max total? like [01:03:25] define the number of 5,000 here but we [01:03:27] do not know how many words would be [01:03:29] there in the entire text. [01:03:29] >> Yeah. So it's a good question. How do [01:03:31] you decide on this the maximum [01:03:32] vocabulary? What you typically do in [01:03:34] practice is that you actually you do it [01:03:36] without the max tokens and then you see [01:03:38] how long the vocabulary is and then you [01:03:40] actually get statistics on how [01:03:41] frequently the very infrequent words [01:03:43] actually show up. And then you'll [01:03:45] typically see like a dramatic fall-off [01:03:47] at some point and you pick that fall-off [01:03:49] point and then set that to be the max. [01:03:54] Uh all right. So perfect. Let's test it. [01:03:58] Uh accuracy is pretty good. 87% on the [01:04:01] training and 73 on the validation. We'll [01:04:05] do it on the test set. All right. 72%. [01:04:09] So we saw earlier the the largest class [01:04:11] of the three-way is a rock with around [01:04:13] 50%. So the naive model is going to get [01:04:15] 50% accuracy and this little neural [01:04:17] network model gets you 70 72% which is [01:04:19] pretty nice. Okay. So now let's actually [01:04:22] kick it up a notch and make it slightly [01:04:23] more capable. So the key thing here is [01:04:26] that uh as was has been observed in [01:04:29] class already when you go with a bag of [01:04:31] words model we lose all notion of order [01:04:33] right the word order clearly matters and [01:04:35] we're kind of ignoring it. So what we do [01:04:38] to get around it is um so actually this [01:04:40] actually really interesting uh sentence [01:04:42] here. Let's say this is a movie review. [01:04:44] Kate Vinclet's performance as a [01:04:46] detective trying to solve a terrible [01:04:48] crime in a P small pin tennos is [01:04:50] anything but disappointing. [01:04:52] Tricky tricky thing, right? Because if [01:04:55] you look at the word separately, the [01:04:56] word terrible and disappointing like [01:04:58] negative sentiment, right? But then if [01:05:01] you actually know that the word terrible [01:05:04] respon refers to the crime, not to the [01:05:06] movie or anything but disappointing [01:05:08] changes the meaning of the word [01:05:09] disappointing, you will see obviously [01:05:10] it's a positive review, right? So [01:05:12] clearly the the the words around the [01:05:14] word provide valuable clues as to how to [01:05:17] interpret that word. And so what we do [01:05:20] is how can we make our little model a [01:05:23] bit more capable of recognizing the [01:05:25] context around every word. And the way [01:05:27] we do it is something called bgrams. [01:05:29] Okay. And what for biograms what we [01:05:32] basically do is instead of taking [01:05:34] instead of just taking each word we take [01:05:36] each word and we further take every pair [01:05:39] of adjacent words [01:05:42] and those become our tokens and because [01:05:44] we take two adjacent words right it are [01:05:47] called bgrams you can take three adjent [01:05:49] words trigrams you get the idea engram [01:05:51] grams okay so that's the idea of bgrams [01:05:54] and so um so for example if you had the [01:05:56] cat matt sat on the cat sat on the mat [01:05:59] you will have the the cat cats sat you [01:06:03] get the idea right uh that's what we [01:06:05] have so let's do a little example and [01:06:07] kas makes it very easy you literally [01:06:09] tell it engram grams equals 2 [01:06:12] bs and now by by from this you auto [01:06:15] immediately should know that engram [01:06:16] grams equals 1 is the default that's why [01:06:19] we didn't have to specify it okay so you [01:06:23] run it and then you do [01:06:25] cats on the mat is your training corpus [01:06:27] and then you get the vocabulary and you [01:06:29] can see here, right? It has created all [01:06:31] these nice biograms for you. And so [01:06:34] that's it. All right. Now, what we do is [01:06:35] we'll go back to the songs and we [01:06:37] actually tell Keras to not just take [01:06:39] each word, but take all the biograms as [01:06:41] well. And hopefully you'll do a better [01:06:43] job, right, of figuring out what the [01:06:45] sentiment is. And now because you know [01:06:47] when you have when you when you say, [01:06:49] okay, take the top 5,000 words, that's [01:06:51] great for single unigs as they are [01:06:53] called. But when you have biograms, you [01:06:56] have 5,000 possibilities for the first [01:06:57] word, maybe 5,000 for the second word, [01:06:59] right? That's a lot of possibilities. 25 [01:07:01] million. Now, most of the 25 million [01:07:03] possibilities are not going to show up [01:07:04] in the data. So, you don't need to [01:07:05] actually make it much larger, but you [01:07:07] should make the vocabulary a bit more [01:07:08] than 5,000. So, here we go with say [01:07:11] 20,000, right? Otherwise, it's the same. [01:07:13] Still multihart. So, let's run it. And [01:07:16] now we will run this. Now that the layer [01:07:18] has been set up with all the right [01:07:20] settings, we'll ask it to create the [01:07:21] vocabulary. Okay? again by doing exactly [01:07:24] what we did before. Create the [01:07:25] vocabulary [01:07:30] seconds [01:07:42] by triagrams all of them will get much [01:07:44] more computer intensive that's why [01:07:46] you're seeing this. So all right let's [01:07:48] look at the first 10 words. The first 10 [01:07:51] words are all just single words and [01:07:53] that's not surprising because the single [01:07:54] words are going to be the most more [01:07:55] frequent right u [01:07:59] and then the last few [01:08:02] your mom your god you short you hell [01:08:09] all right let's just uh you know uh [01:08:13] index the whole all the data we have the [01:08:15] training validation test sets using this [01:08:17] vocabulary [01:08:23] Perfect. Now we come to our second model [01:08:24] where we say the shape the incoming [01:08:26] shape is now 20,000 long right because [01:08:28] we increase max tokens from 5,000 to [01:08:30] 20,000. So each thing is a 20,000 long [01:08:32] vector otherwise it's the same and now [01:08:35] we will use this thing called dropout [01:08:37] for the first time which is a [01:08:38] rigorization thing that I have referred [01:08:41] to earlier that I never really described [01:08:43] and I will describe today if we have [01:08:45] time but I'll first run through the [01:08:47] whole demo. So just you know just you [01:08:49] can just you think of dropout as just [01:08:50] another layer you can insert and it's [01:08:52] essentially a great way to prevent [01:08:54] overfitting. So I just routinely will [01:08:56] use it and I'll talk more about it. So [01:08:58] for now you have this dropout layer in [01:09:00] the middle. It receives the input from [01:09:02] the dense layer and then sends it to the [01:09:04] output layer. The output layer is [01:09:05] unchanged. It's a three-way softmax. [01:09:07] Same model as before. Okay. And now uh [01:09:10] all right we'll come back to drop out. [01:09:11] So we'll compile it the same way as [01:09:13] before and then we will we will I will [01:09:15] just fit it for three epochs. Um if [01:09:17] you're interested after class later on [01:09:19] you can actually try it for more epochs [01:09:20] and see if it does better. Uh for now in [01:09:22] the interest of time we'll just do it [01:09:23] for three [01:09:29] right [01:09:36] I think 72% right was the uh the single [01:09:39] word unig thing we had. [01:09:43] >> If you're rerunning this code with the [01:09:45] same number of Do you ever expect the [01:09:47] accuracy to change? [01:09:49] >> Um if if you had to run this code in [01:09:51] your machine, you would expect it to be [01:09:53] roughly the same, but there are some [01:09:55] minute differences due to hardware and [01:09:57] device drivers. [01:09:58] >> If you rewrite it on your own machine [01:09:59] twice, would you expect a change? [01:10:02] >> That's actually a very tricky question. [01:10:05] Uh because it depends on what else I [01:10:07] have been doing in that notebook. [01:10:09] If I start fresh and do nothing but [01:10:11] that, typically I get the same numbers [01:10:13] typically. But for some reason I don't [01:10:15] get it exactly the right. [01:10:19] Okay. So we come to this. Let's evaluate [01:10:22] our little model. [01:10:25] Okay. 75%. So it went from 72 to 75. [01:10:29] It's actually a meaningful jump just by [01:10:30] using biograms. Okay. And I ran it only [01:10:32] for three epochs. If you run it for 10, [01:10:34] maybe it's going to do even better. All [01:10:36] right. So that is the beauty of this [01:10:38] thing. Now let's just actually do a [01:10:40] little demo. Uh we'll try to predict [01:10:42] some lyrics. Okay, I'll try another one. [01:10:45] Bites the dust. [01:10:49] It's a rock song. I think that's [01:10:50] correct. Yes. Okay. Okay, folks. Your [01:10:53] turn now. [01:10:55] Uh, somebody tell me your favorite song. [01:11:00] >> Dancing Queen from Aba. [01:11:03] >> I love ABBA. That's awesome. All right. [01:11:05] Okay. [01:11:07] Uh, Dancing Queen [01:11:11] Rex. [01:11:17] worse one intro. I don't like that. [01:11:18] Let's just go to something without all [01:11:20] this metadata. [01:11:23] Right. [01:11:27] All right. I'll just take the first [01:11:28] page. Okay. [01:11:40] Are we good? [01:11:42] All right, [01:11:45] down model. Let's predict [01:11:50] pop just about. Yay. [01:11:55] All right. So, uh yeah. So, that's [01:11:58] basically the model, but we have five [01:12:00] minutes. I want to get back to you can [01:12:01] play around and put your own lyrics in. [01:12:03] Uh typically what happens is that the [01:12:05] last two years that I've been doing this [01:12:07] particular lecture, I've noticed that [01:12:09] the songs are always rock songs for some [01:12:11] reason. [01:12:13] >> First time I'm getting a pop song from [01:12:14] the from a group that I actually like. [01:12:16] So thank you. [01:12:18] Uh all right. Uh let's go back to [01:12:20] dropout. [01:12:22] So the idea here in dropout is that you [01:12:24] know you have all these the input comes [01:12:26] in, it goes through a hidden layer and [01:12:28] so on and so forth. What the dropout? So [01:12:30] dropout is a layer and you put this [01:12:33] layer just like you use any other layer. [01:12:35] And what dropout does is that it takes [01:12:37] all the things that are coming into it [01:12:38] from the previous layer and randomly [01:12:41] decides to replace that number with a [01:12:43] zero. [01:12:46] That's it. It drops that number and [01:12:48] replace it with a zero. Okay? But it [01:12:50] does it randomly. It basically toss a [01:12:52] coin and the coin comes up heads zero. [01:12:54] If it comes up to us, let it through. [01:12:55] Pass it through. Okay? And the reason [01:12:58] why this is very effective is because [01:13:02] you can imagine all the neurons in a [01:13:04] particular layer when they overfit to a [01:13:07] particular data set the overfitting [01:13:09] happens because the neurons essentially [01:13:11] collude with each other right they sort [01:13:14] of collude with each other to actually [01:13:15] overfitit and predict things in sort of [01:13:17] a very accurate way. So you want to [01:13:19] break any sort of collusion between the [01:13:21] neurons, right? I'm obviously using sort [01:13:24] of like a you know again theoretic way [01:13:26] of describing it but the idea is that [01:13:28] any kind of speurious correlations in [01:13:30] your data neurons can pick it up by [01:13:33] being correlated themselves. [01:13:36] And so the way you avoid the spurious [01:13:38] correlation is by dropping neurons [01:13:40] randomly. You just kill the neuron [01:13:42] randomly which means that no neuron can [01:13:44] depend on another neuron being [01:13:45] available. [01:13:47] I know it's a bit grim but that's the [01:13:50] basic idea of dropout and apparently the [01:13:52] story goes that the the folk person who [01:13:54] the team that invented it Jeff Hinton [01:13:56] who won the touring for the stuff not [01:13:58] for not for dropout just for deep [01:13:59] learning um he said I don't know if it's [01:14:02] true but he said that apparently he got [01:14:03] the idea when he went to a bank and [01:14:05] realized that you know very often the [01:14:07] bank the folks who working in that bank [01:14:09] branch that he used to go to kept [01:14:11] changing [01:14:13] right they were never sort of the same [01:14:14] the people would be transferring in [01:14:16] transferring out and he was like why Why [01:14:17] can't they just leave these people [01:14:18] alone? Why does it keep changing? And [01:14:19] then he got the insight that maybe a lot [01:14:21] of fraud happens because the person [01:14:24] working in the branch colludes with the [01:14:26] customer, but by changing the staff [01:14:28] constantly, you break the the risk of [01:14:30] fraud happening. And that apparently was [01:14:32] the genesis for this idea. True, [01:14:34] apocryphal? I have no idea. But it's [01:14:36] sort of a fun story. Uh yes, [01:14:40] >> instead of random, if we go to the way [01:14:43] historical models are built, concepts of [01:14:45] multiple and all of that, would that [01:14:47] make it sharper as compared to this? [01:14:50] >> The problem is that um these networks [01:14:53] are massive, right? And for you to take [01:14:56] each layer and look at it correlation [01:14:58] with some other layer and so on and so [01:14:59] forth. First of all, investigating [01:15:01] multi-linearity is pro is a problem. The [01:15:04] second thing is okay, what do you do [01:15:05] then? Next uh in linear regression you [01:15:08] can do things like principal components [01:15:09] analysis to get around it. Here [01:15:11] everything is nonlinear. There is no [01:15:12] easy way to solve the problem. So we are [01:15:14] like we'll just solve the problem in one [01:15:16] shot using dropout. That's all right. Um [01:15:20] so I had uh some material on [01:15:23] something called bite pair encoding [01:15:25] which I will um which I will do when we [01:15:28] get to LLMs and I stuck it in the end [01:15:30] because I knew that we probably won't [01:15:31] have enough time to cover this anyway. [01:15:33] And that is a very clever tokenization [01:15:35] scheme used by for example the GPT [01:15:37] family and that allows them to do [01:15:40] beautiful punctuation, keep the case [01:15:41] intact and then use words that you just [01:15:43] made up and things like that. Okay. So [01:15:45] we have two one more minute. I'm happy [01:15:47] to answer any questions you might have. [01:15:50] >> And so initially when we are picking [01:15:52] like the hidden layer the number of [01:15:54] neurons and weed. So so far in all the [01:15:57] materials this is has been given to us [01:15:59] but initially how do you pick it? Is it [01:16:01] more of a trial and error type of thing [01:16:03] or [01:16:03] >> it tends to be trial and error. Um so [01:16:05] that's in fact what I did when I created [01:16:07] the collabs. So um and and you can [01:16:10] actually make it a bit more systematic [01:16:12] by trying lots of different values and [01:16:14] there is a particular package uh Python [01:16:16] package called Keras tuner. So just [01:16:18] Google Keras tuner and it comes with [01:16:20] very nice collabs and if I have a chance [01:16:22] maybe I'll just record a screen [01:16:23] walkthrough of doing that. But that's [01:16:25] that's a very efficient way to do these [01:16:27] things. And it comes under the broad [01:16:28] category of something called [01:16:29] hyperparameter optimization where the [01:16:31] number of neurons, the activation you [01:16:33] use, the learning rate, all those things [01:16:35] can all be tried. You can try lots of [01:16:36] variations and kas is a great way to do [01:16:39] it in the context of kas. [01:16:42] Other questions? [01:16:45] >> All right, I give you 30 seconds back. [01:16:47] Thank you. See you tomorrow.