1 00:00:16,800 --> 00:00:23,039 Okay. So today we start the the natural 2 00:00:20,399 --> 00:00:24,799 language processing sequence and so just 3 00:00:23,039 --> 00:00:26,400 to give you a quick idea we're going to 4 00:00:24,800 --> 00:00:27,920 start with uh what's called 5 00:00:26,399 --> 00:00:29,759 vectorization 6 00:00:27,920 --> 00:00:30,960 uh and then the bag of words model and 7 00:00:29,760 --> 00:00:33,920 then we'll spend a fair amount of time 8 00:00:30,960 --> 00:00:34,799 on a collab uh and then on Wednesday we 9 00:00:33,920 --> 00:00:36,480 talk about these things called 10 00:00:34,799 --> 00:00:38,000 embeddings which you'll come to 11 00:00:36,479 --> 00:00:40,238 appreciate over the the next couple of 12 00:00:38,000 --> 00:00:42,640 weeks form like the sort of the core 13 00:00:40,238 --> 00:00:45,280 atomic unit of all modern natural 14 00:00:42,640 --> 00:00:47,439 language processing and for that matter 15 00:00:45,280 --> 00:00:49,280 vision processing as well. uh and then 16 00:00:47,439 --> 00:00:50,640 we will uh following week we'll do 17 00:00:49,280 --> 00:00:52,399 transformers two lectures on 18 00:00:50,640 --> 00:00:53,520 transformers we'll get into the theory 19 00:00:52,399 --> 00:00:55,759 and then we'll get into a bunch of 20 00:00:53,520 --> 00:00:59,280 applications and then lectures nine and 21 00:00:55,759 --> 00:01:01,358 10 will be all LLMs all about LLMs so 22 00:00:59,280 --> 00:01:04,320 it's going to be a lot of fun u this is 23 00:01:01,359 --> 00:01:05,920 one of my favorite segments of the class 24 00:01:04,319 --> 00:01:08,000 of course truth be told every segment of 25 00:01:05,920 --> 00:01:10,879 the class is my favorite so don't judge 26 00:01:08,000 --> 00:01:13,599 me all right so let's get going uh so 27 00:01:10,879 --> 00:01:16,079 why why natural language processing u 28 00:01:13,599 --> 00:01:17,599 you know these are in some sense the the 29 00:01:16,079 --> 00:01:18,879 things I have on the slide here are sort 30 00:01:17,599 --> 00:01:21,679 of obvious but I think it's actually 31 00:01:18,879 --> 00:01:24,239 worth reme reminding ourselves of how 32 00:01:21,680 --> 00:01:26,320 important text is for everything we do. 33 00:01:24,239 --> 00:01:29,280 Uh obviously human knowledge is mostly 34 00:01:26,319 --> 00:01:30,959 encoded as text. The internet is mostly 35 00:01:29,280 --> 00:01:33,920 text. At least this was true till the 36 00:01:30,959 --> 00:01:35,759 advent of Tik Tok and YouTube. Uh and uh 37 00:01:33,920 --> 00:01:37,840 human communication is mostly text and 38 00:01:35,759 --> 00:01:40,640 cultural production you know movies, 39 00:01:37,840 --> 00:01:43,680 books, uh arts and so on. So much of it 40 00:01:40,640 --> 00:01:47,040 is so textheavy and so in some sense uh 41 00:01:43,680 --> 00:01:49,200 text forms not just a big chunk of all 42 00:01:47,040 --> 00:01:50,880 the media that's out there but it also 43 00:01:49,200 --> 00:01:52,560 happens to be the way in which we think 44 00:01:50,879 --> 00:01:55,438 and communicate and so on and so forth. 45 00:01:52,560 --> 00:01:57,920 So it's sort of uh primacy is in my 46 00:01:55,438 --> 00:01:59,919 opinion sort of unparalleled uh in how 47 00:01:57,920 --> 00:02:02,399 we think about the world. And so the the 48 00:01:59,920 --> 00:02:04,719 tantalizing possibility is that imagine 49 00:02:02,399 --> 00:02:06,560 if we had an AI system which could just 50 00:02:04,718 --> 00:02:09,519 read and quote unquote understand all 51 00:02:06,560 --> 00:02:11,759 this text, right? Um and so you can 52 00:02:09,520 --> 00:02:13,599 imagine such a system reading all of 53 00:02:11,759 --> 00:02:15,199 PubMed, reading all the medical 54 00:02:13,598 --> 00:02:17,199 literature and then coming back and 55 00:02:15,199 --> 00:02:19,439 saying you know for this particular 56 00:02:17,199 --> 00:02:21,039 disease you know this particular sort of 57 00:02:19,439 --> 00:02:23,039 protein is actually the malfunctioning 58 00:02:21,039 --> 00:02:24,400 protein and for that that small molecule 59 00:02:23,039 --> 00:02:26,400 is going to dock into the protein and 60 00:02:24,400 --> 00:02:27,680 cure the disease and you didn't know 61 00:02:26,400 --> 00:02:29,920 this. It came back and told you that. 62 00:02:27,680 --> 00:02:31,439 Wouldn't it be unbelievable? So my 63 00:02:29,919 --> 00:02:33,759 feeling is that such things are going to 64 00:02:31,439 --> 00:02:36,239 happen. It's just that it's not going to 65 00:02:33,759 --> 00:02:38,000 happen soon enough for my lifetime, but 66 00:02:36,239 --> 00:02:40,560 perhaps it'll happen in yours. All 67 00:02:38,000 --> 00:02:42,639 right. Okay. So, let's continue. So, NLP 68 00:02:40,560 --> 00:02:44,400 is an action all around us. Um, you 69 00:02:42,639 --> 00:02:46,958 know, according to Google, apparently 70 00:02:44,400 --> 00:02:49,840 Google autocomplete, uh, which uses a 71 00:02:46,959 --> 00:02:53,199 fair bit of NLP, uh, saves 200 years of 72 00:02:49,840 --> 00:02:54,640 typing time apparently, every day. Uh, I 73 00:02:53,199 --> 00:02:55,759 actually thought it was, you know, this 74 00:02:54,639 --> 00:02:57,598 I wasn't very impressed with this 75 00:02:55,759 --> 00:02:58,959 number, frankly, because billions of 76 00:02:57,598 --> 00:03:01,919 searches are being done every day and 77 00:02:58,959 --> 00:03:03,598 I'm like only 200 years. So anyway u but 78 00:03:01,919 --> 00:03:06,399 I think the more important point is that 79 00:03:03,598 --> 00:03:08,000 it made mobile possible right if you if 80 00:03:06,400 --> 00:03:09,920 you didn't have autocomplete people 81 00:03:08,000 --> 00:03:11,759 would not be you know typing and pecking 82 00:03:09,919 --> 00:03:13,679 on their keyboards it's going to be much 83 00:03:11,759 --> 00:03:15,199 worse it would have had a hugely 84 00:03:13,680 --> 00:03:17,519 dampening effect on e-commerce for 85 00:03:15,199 --> 00:03:19,598 instance so this humble little 86 00:03:17,519 --> 00:03:21,759 autocomplete has incredible incredible 87 00:03:19,598 --> 00:03:23,359 impact on the world economy and the 88 00:03:21,759 --> 00:03:25,039 other thing which I heard about I'm not 89 00:03:23,360 --> 00:03:26,959 sure if it's 100% true but it's an 90 00:03:25,039 --> 00:03:28,799 interesting example apparently the very 91 00:03:26,959 --> 00:03:30,640 first iPhone keyboard that came out 92 00:03:28,800 --> 00:03:34,239 right the soft keyboard not the hard 93 00:03:30,639 --> 00:03:35,759 keyboard. Um they had some very basic, 94 00:03:34,239 --> 00:03:38,719 you know, sort of word continuation 95 00:03:35,759 --> 00:03:41,519 prediction going on. And so if if when 96 00:03:38,719 --> 00:03:43,039 you start typing T and H, obviously it's 97 00:03:41,519 --> 00:03:46,080 going to guess the E is going to come 98 00:03:43,039 --> 00:03:48,239 next, right? So that part is old old 99 00:03:46,080 --> 00:03:50,719 news, nothing new there. But apparently 100 00:03:48,239 --> 00:03:53,360 the E letter in the keyboard will become 101 00:03:50,719 --> 00:03:54,959 slightly bigger. So when your finger 102 00:03:53,360 --> 00:03:57,280 goes towards it, it has a better shot of 103 00:03:54,959 --> 00:03:59,438 actually connecting with it. Right? So 104 00:03:57,280 --> 00:04:01,280 these kinds of things are used to change 105 00:03:59,438 --> 00:04:02,560 the UI in real time in a whole bunch of 106 00:04:01,280 --> 00:04:06,000 applications and you just don't even 107 00:04:02,560 --> 00:04:08,560 realize it. All right. So uh and of 108 00:04:06,000 --> 00:04:09,919 course we all know about uh LRM at this 109 00:04:08,560 --> 00:04:11,680 point. So I asked it to write a 110 00:04:09,919 --> 00:04:13,919 limmerick about the beauty and power of 111 00:04:11,680 --> 00:04:15,360 deep learning yesterday and it says in a 112 00:04:13,919 --> 00:04:16,798 world where data flows like a stream 113 00:04:15,360 --> 00:04:18,319 deep learning is more than a dream. 114 00:04:16,798 --> 00:04:22,399 Sifts through the noise with an elegant 115 00:04:18,319 --> 00:04:25,199 poise unveiling insights that gleam. 116 00:04:22,399 --> 00:04:26,879 Cool, right? All right. So let's get 117 00:04:25,199 --> 00:04:28,478 back to work. Uh so NLP has 118 00:04:26,879 --> 00:04:30,478 extraordinary potential for making 119 00:04:28,478 --> 00:04:33,279 products, service and services much much 120 00:04:30,478 --> 00:04:35,758 smarter. Uh and what I want to point out 121 00:04:33,279 --> 00:04:37,599 here is that you know even if you focus 122 00:04:35,759 --> 00:04:40,160 on this very very simple sort of 123 00:04:37,600 --> 00:04:42,160 formalism right a bunch of text comes in 124 00:04:40,160 --> 00:04:44,000 a bunch of text goes out that's it. If 125 00:04:42,160 --> 00:04:46,720 you take that very simple text in text 126 00:04:44,000 --> 00:04:49,199 out formalism this little humble little 127 00:04:46,720 --> 00:04:51,840 thing has just an enormous enormous 128 00:04:49,199 --> 00:04:53,840 range of applicability. Right? So 129 00:04:51,839 --> 00:04:56,079 obviously you can send a bunch of text 130 00:04:53,839 --> 00:04:58,399 in and ask it to classify it right for 131 00:04:56,079 --> 00:05:00,159 mo you know sentiment route it for 132 00:04:58,399 --> 00:05:01,519 customer support you can try to figure 133 00:05:00,160 --> 00:05:03,520 out the intent of what the person is 134 00:05:01,519 --> 00:05:04,799 asking in search you can filter it you 135 00:05:03,519 --> 00:05:06,879 can content filter to make sure there's 136 00:05:04,800 --> 00:05:08,639 no toxic abusive stuff going on I mean 137 00:05:06,879 --> 00:05:11,038 the the possibilities for just text 138 00:05:08,639 --> 00:05:12,879 classification are numerous okay but 139 00:05:11,038 --> 00:05:14,879 that's a that's sort of a use case we 140 00:05:12,879 --> 00:05:17,360 are all kind of familiar with right so 141 00:05:14,879 --> 00:05:19,038 no surprise there now text extraction we 142 00:05:17,360 --> 00:05:20,879 may be less familiar with here and the 143 00:05:19,038 --> 00:05:23,360 idea is that you can actually look at a 144 00:05:20,879 --> 00:05:25,279 lot lot of uh unstructured textual data 145 00:05:23,360 --> 00:05:27,759 and extract all sorts of interesting 146 00:05:25,279 --> 00:05:29,439 entities from it. Right? Hedge fun hedge 147 00:05:27,759 --> 00:05:30,720 funds use it very heavily. They will 148 00:05:29,439 --> 00:05:33,600 extract all sorts of company information 149 00:05:30,720 --> 00:05:34,880 from news articles u and then obviously 150 00:05:33,600 --> 00:05:36,800 doctor's notes. There are a whole bunch 151 00:05:34,879 --> 00:05:38,959 of NLP startups that will take the 152 00:05:36,800 --> 00:05:40,879 doctor's the doctor patient conversation 153 00:05:38,959 --> 00:05:43,279 transcribe it and then extract disease 154 00:05:40,879 --> 00:05:45,199 codes, diagnosis codes, medication codes 155 00:05:43,279 --> 00:05:47,279 and things like that. Uh right. So the 156 00:05:45,199 --> 00:05:48,960 possibilities for this are enormous. Of 157 00:05:47,279 --> 00:05:50,799 course text summarization and we all 158 00:05:48,959 --> 00:05:53,038 have been doing it thanks to chat GPT 159 00:05:50,800 --> 00:05:54,319 right take text in and any kind of 160 00:05:53,038 --> 00:05:57,120 summary that comes out of the text is 161 00:05:54,319 --> 00:05:58,719 just text out okay and then text 162 00:05:57,120 --> 00:06:00,478 generation of course we can take text 163 00:05:58,720 --> 00:06:01,680 and do marketing copy sales emails 164 00:06:00,478 --> 00:06:03,199 market summaries so on so forth and 165 00:06:01,680 --> 00:06:06,840 including troublingly for educators 166 00:06:03,199 --> 00:06:06,840 college application essays 167 00:06:06,959 --> 00:06:14,478 code generation is a more subtle example 168 00:06:10,800 --> 00:06:16,720 of text out because code is just text 169 00:06:14,478 --> 00:06:20,959 right so text in text out also covers 170 00:06:16,720 --> 00:06:22,639 was text in code out. Okay. And question 171 00:06:20,959 --> 00:06:24,399 answering. So you can take a bunch of 172 00:06:22,639 --> 00:06:25,759 text, 173 00:06:24,399 --> 00:06:27,758 you can take a whole bunch of documents, 174 00:06:25,759 --> 00:06:29,680 you can add a bit of text to it which is 175 00:06:27,759 --> 00:06:31,840 your question and this whole thing at 176 00:06:29,680 --> 00:06:33,680 the end of the day is just text it in 177 00:06:31,839 --> 00:06:35,038 and then you can have and you can use it 178 00:06:33,680 --> 00:06:36,560 to answer questions and therefore create 179 00:06:35,038 --> 00:06:39,560 chat bots for all sorts of interesting 180 00:06:36,560 --> 00:06:39,560 applications. 181 00:06:39,918 --> 00:06:44,799 And you know if you look at this example 182 00:06:42,240 --> 00:06:46,319 call centers that's that is where a lot 183 00:06:44,800 --> 00:06:47,680 of money is being spent right now to 184 00:06:46,319 --> 00:06:49,840 build these call center chatbots for 185 00:06:47,680 --> 00:06:52,160 text and text out question answering and 186 00:06:49,839 --> 00:06:54,318 so just if you drill into this right if 187 00:06:52,160 --> 00:06:56,720 you imagine taking all the call center 188 00:06:54,319 --> 00:06:59,280 transcripts and their internal product 189 00:06:56,720 --> 00:07:02,000 documentation service documentation FAQs 190 00:06:59,279 --> 00:07:04,159 etc stick it in you can start to answer 191 00:07:02,000 --> 00:07:05,519 these kinds of questions okay yesterday 192 00:07:04,160 --> 00:07:08,000 what are the top reasons why customers 193 00:07:05,519 --> 00:07:09,758 were upset with us what interventions 194 00:07:08,000 --> 00:07:12,319 made by the agent actually worked what 195 00:07:09,759 --> 00:07:14,560 did not work, right? What characterizes 196 00:07:12,319 --> 00:07:16,000 the best agents from the rest? How 197 00:07:14,560 --> 00:07:16,879 should we grade this particular agent's 198 00:07:16,000 --> 00:07:18,800 interaction with the particular 199 00:07:16,879 --> 00:07:20,478 customer? How should she how should we 200 00:07:18,800 --> 00:07:23,280 chain the call center script? How should 201 00:07:20,478 --> 00:07:25,038 we coach the agent in real time? Every 202 00:07:23,279 --> 00:07:26,399 one of these applications is aminable to 203 00:07:25,038 --> 00:07:28,159 this very humble text and text route 204 00:07:26,399 --> 00:07:30,478 model. 205 00:07:28,160 --> 00:07:32,080 Okay. And so, and of course the 206 00:07:30,478 --> 00:07:33,598 potential for is now everybody knows 207 00:07:32,079 --> 00:07:36,399 this potential because of the advent of 208 00:07:33,598 --> 00:07:38,959 large language models. Uh, by the way, 209 00:07:36,399 --> 00:07:42,318 Google is uh released something called 210 00:07:38,959 --> 00:07:46,879 Google Geminy 1.5 Pro u a couple of days 211 00:07:42,319 --> 00:07:49,199 ago. Uh, and it's incredible. 212 00:07:46,879 --> 00:07:50,560 It's incredible, right? And anyway, 213 00:07:49,199 --> 00:07:52,319 we'll get back to that later. But the 214 00:07:50,560 --> 00:07:54,000 point is that the kind of potential we 215 00:07:52,319 --> 00:07:59,000 have is just amazing even for text and 216 00:07:54,000 --> 00:07:59,000 text. Okay. And as you would imagine 217 00:08:00,478 --> 00:08:04,478 >> this is all like though we are calling 218 00:08:02,478 --> 00:08:05,680 it language this is all primarily 219 00:08:04,478 --> 00:08:07,758 English right 220 00:08:05,680 --> 00:08:09,598 >> now there are lots of multilingual uh 221 00:08:07,759 --> 00:08:12,160 models as well uh there are multilingual 222 00:08:09,598 --> 00:08:13,439 models by that I mean models which are 223 00:08:12,160 --> 00:08:15,039 specialized to other languages 224 00:08:13,439 --> 00:08:16,478 non-English languages and models which 225 00:08:15,038 --> 00:08:18,800 are mult truly multilingual like 226 00:08:16,478 --> 00:08:21,918 polyglot models as well and both of them 227 00:08:18,800 --> 00:08:23,598 are available uh right now and many many 228 00:08:21,918 --> 00:08:26,318 modern LLMs are actually trained from 229 00:08:23,598 --> 00:08:28,319 the get-go to be multilingual in a bunch 230 00:08:26,319 --> 00:08:30,080 of the what are called high resource 231 00:08:28,319 --> 00:08:32,320 languages. Languages which are spoken by 232 00:08:30,079 --> 00:08:33,199 lots of people. Uh but actually it's 233 00:08:32,320 --> 00:08:34,959 funny you should ask that question 234 00:08:33,200 --> 00:08:37,680 because this this Google Gemini model 235 00:08:34,958 --> 00:08:40,000 that I just described they actually u so 236 00:08:37,679 --> 00:08:41,918 there is a language called kalamang 237 00:08:40,000 --> 00:08:45,919 which is spoken by 200 people in the 238 00:08:41,918 --> 00:08:48,720 world and so a researcher had created a 239 00:08:45,919 --> 00:08:50,879 one book which is sort of like a grammar 240 00:08:48,720 --> 00:08:52,240 manual for kalamag right because there 241 00:08:50,879 --> 00:08:54,559 are no other written works in that 242 00:08:52,240 --> 00:08:56,799 language. And so what they did is they 243 00:08:54,559 --> 00:09:00,799 took a whole bunch of English dialogue 244 00:08:56,799 --> 00:09:04,479 and this book fed it into uh Google 245 00:09:00,799 --> 00:09:06,079 Gemini 1.4 Pro 1.5 and it translated 246 00:09:04,480 --> 00:09:07,680 into Calamong at human level 247 00:09:06,080 --> 00:09:10,639 proficiency. 248 00:09:07,679 --> 00:09:12,399 It had never seen it before. So that's 249 00:09:10,639 --> 00:09:15,600 an example 250 00:09:12,399 --> 00:09:18,399 of of this. 251 00:09:15,600 --> 00:09:19,920 Yes. So the question is the question 252 00:09:18,399 --> 00:09:21,759 text here is all the things you want to 253 00:09:19,919 --> 00:09:23,360 translate from English to kalamong. The 254 00:09:21,759 --> 00:09:25,919 documents here is just one document 255 00:09:23,360 --> 00:09:29,039 singular the grammar book the manual and 256 00:09:25,919 --> 00:09:30,559 then what comes out is a translation. So 257 00:09:29,039 --> 00:09:31,838 these models even when they're not 258 00:09:30,559 --> 00:09:34,159 explicitly trained on a different 259 00:09:31,839 --> 00:09:35,839 language if you give them enough of sort 260 00:09:34,159 --> 00:09:37,679 of grammar manuals and stuff like that 261 00:09:35,839 --> 00:09:40,480 they may do a pretty decent job from the 262 00:09:37,679 --> 00:09:42,479 get-go with no training. 263 00:09:40,480 --> 00:09:44,879 It's kind of a shocker. Two years ago 264 00:09:42,480 --> 00:09:47,440 people would be like that's impossible. 265 00:09:44,879 --> 00:09:50,159 All right. So 266 00:09:47,440 --> 00:09:51,680 back to this. 267 00:09:50,159 --> 00:09:53,039 All right. And as you folks, you know, 268 00:09:51,679 --> 00:09:54,559 may already know and maybe you're in 269 00:09:53,039 --> 00:09:57,039 fact participating in this gold rush 270 00:09:54,559 --> 00:09:58,799 already. Um, you know, lots of people 271 00:09:57,039 --> 00:10:00,639 are creating lots of really cool 272 00:09:58,799 --> 00:10:02,000 companies to take some of these ideas 273 00:10:00,639 --> 00:10:04,159 and actually create really interesting 274 00:10:02,000 --> 00:10:06,320 products and services out of them. Um, 275 00:10:04,159 --> 00:10:07,439 so if you're not doing it and if you've 276 00:10:06,320 --> 00:10:10,000 been thinking about entrepreneurial 277 00:10:07,440 --> 00:10:13,000 stuff, here's a word of advice. Take the 278 00:10:10,000 --> 00:10:13,000 plunge. 279 00:10:15,120 --> 00:10:19,839 Dismissed. Just kidding. All right. So, 280 00:10:18,240 --> 00:10:22,240 and as you can imagine, enterprise 281 00:10:19,839 --> 00:10:24,480 vendors are rushing to add NLP to all 282 00:10:22,240 --> 00:10:27,039 their products. Salesforce Einstein now 283 00:10:24,480 --> 00:10:28,800 has Einstein GPT. Microsoft has 284 00:10:27,039 --> 00:10:30,639 co-pilot. I mean, the list goes on. 285 00:10:28,799 --> 00:10:32,319 Everybody, everybody's like scrambling 286 00:10:30,639 --> 00:10:34,559 and really trying hard to infuse some 287 00:10:32,320 --> 00:10:36,480 GPT magic into whatever they're doing. 288 00:10:34,559 --> 00:10:39,679 Okay, some of it is real, a lot of it is 289 00:10:36,480 --> 00:10:41,759 not. Uh, okay. So, let's go to like the 290 00:10:39,679 --> 00:10:43,759 arc of NLP progress. How did we get to 291 00:10:41,759 --> 00:10:46,720 this kind of crazy times that we live 292 00:10:43,759 --> 00:10:48,958 in? Um so if you look at natural 293 00:10:46,720 --> 00:10:50,639 language processing basically efforts to 294 00:10:48,958 --> 00:10:52,239 take language and try to analyze 295 00:10:50,639 --> 00:10:56,240 language and you do predictions with 296 00:10:52,240 --> 00:10:58,000 language and so on and so forth. Um 297 00:10:56,240 --> 00:11:00,240 the first phase of it was just 298 00:10:58,000 --> 00:11:02,000 handcrafted rules based on linguistics. 299 00:11:00,240 --> 00:11:03,360 So these are all linguists who would 300 00:11:02,000 --> 00:11:05,200 really understand the grammar of a 301 00:11:03,360 --> 00:11:07,278 language and then they would use a deep 302 00:11:05,200 --> 00:11:08,959 knowledge of linguistics to figure out 303 00:11:07,278 --> 00:11:11,919 all these rules by which you can process 304 00:11:08,958 --> 00:11:13,919 and analyze natural language text. And 305 00:11:11,919 --> 00:11:15,360 then this other thing came along which 306 00:11:13,919 --> 00:11:17,679 was a statistical machine learning 307 00:11:15,360 --> 00:11:19,440 approach which basically said never mind 308 00:11:17,679 --> 00:11:21,359 all that complicated knowledge of 309 00:11:19,440 --> 00:11:24,160 linguistics and grammar. Why don't we 310 00:11:21,360 --> 00:11:25,360 simply count things? Let's count the 311 00:11:24,159 --> 00:11:26,879 number of times these two will co- 312 00:11:25,360 --> 00:11:29,120 occur. Now let's count that. Let's count 313 00:11:26,879 --> 00:11:31,278 this basically just count a lot. Okay. 314 00:11:29,120 --> 00:11:32,879 And let's see if it does right if it 315 00:11:31,278 --> 00:11:34,799 does for predicting things for say for 316 00:11:32,879 --> 00:11:36,799 classifying text and so on. And 317 00:11:34,799 --> 00:11:39,199 shockingly those methods ended up being 318 00:11:36,799 --> 00:11:41,120 really good. They ended up being really 319 00:11:39,200 --> 00:11:44,000 good and in fact they actually were 320 00:11:41,120 --> 00:11:47,039 better than the lovingly handcurated 321 00:11:44,000 --> 00:11:50,159 linguistically driven rules. Okay, so 322 00:11:47,039 --> 00:11:52,159 much so this is a famous quote which 323 00:11:50,159 --> 00:11:55,278 says every time I fire a linguist the 324 00:11:52,159 --> 00:11:57,039 performance of speech recognizer goes up 325 00:11:55,278 --> 00:11:59,759 right obviously made in justest but 326 00:11:57,039 --> 00:12:01,199 there is a kernel of truth to it. 327 00:11:59,759 --> 00:12:03,600 So that was that's what that's that's 328 00:12:01,200 --> 00:12:06,639 what that's where we were and then deep 329 00:12:03,600 --> 00:12:08,320 learning happened okay um in 2012 330 00:12:06,639 --> 00:12:09,839 roughly and then we had these things 331 00:12:08,320 --> 00:12:11,278 called recurren neural networks which 332 00:12:09,839 --> 00:12:13,600 are based on deep learning which 333 00:12:11,278 --> 00:12:15,600 actually moved the ball forward and then 334 00:12:13,600 --> 00:12:17,120 in 2017 335 00:12:15,600 --> 00:12:18,959 something called the transformer was 336 00:12:17,120 --> 00:12:21,919 invented 337 00:12:18,958 --> 00:12:26,159 2017 and the transformer replaced 338 00:12:21,919 --> 00:12:27,519 everything else across the board so we 339 00:12:26,159 --> 00:12:29,199 just going to leaprog directly to 340 00:12:27,519 --> 00:12:30,959 transformers in hodle we will not spend 341 00:12:29,200 --> 00:12:32,879 any time on recurren neural networks and 342 00:12:30,958 --> 00:12:35,119 that is not to say that they are sort of 343 00:12:32,879 --> 00:12:36,559 dead. Um there's there's a very 344 00:12:35,120 --> 00:12:38,320 interesting work which actually is 345 00:12:36,559 --> 00:12:40,159 trying to now revive recurren neural 346 00:12:38,320 --> 00:12:42,639 networks to make it work for these kinds 347 00:12:40,159 --> 00:12:44,399 of modern LLM kinds of tasks but it's 348 00:12:42,639 --> 00:12:46,639 still very early days. Okay. So for now 349 00:12:44,399 --> 00:12:49,759 we'll just focus on transformers. 350 00:12:46,639 --> 00:12:51,759 Okay. So the the very high level view of 351 00:12:49,759 --> 00:12:53,600 the problem here is that like most 352 00:12:51,759 --> 00:12:55,600 things in deep learning it's basically 353 00:12:53,600 --> 00:12:57,680 fancy regression. 354 00:12:55,600 --> 00:12:59,120 There is some variable X that comes in. 355 00:12:57,679 --> 00:13:01,599 It goes through a bunch this goes to 356 00:12:59,120 --> 00:13:03,839 this very complicated function along 357 00:13:01,600 --> 00:13:05,920 with this W which is the weights and 358 00:13:03,839 --> 00:13:07,760 then out pops an output. Right? That's 359 00:13:05,919 --> 00:13:10,399 just the view that you've always had. 360 00:13:07,759 --> 00:13:12,720 And so in this case X happens to be 361 00:13:10,399 --> 00:13:13,600 text. Y can be text. It could be labels. 362 00:13:12,720 --> 00:13:15,360 It could be numbers. It could be 363 00:13:13,600 --> 00:13:16,879 anything else. The W is the weights. And 364 00:13:15,360 --> 00:13:19,600 the function is a deep neural network. 365 00:13:16,879 --> 00:13:20,639 Right? This by by at this point when you 366 00:13:19,600 --> 00:13:23,440 look at this slide it should be like 367 00:13:20,639 --> 00:13:26,000 blindingly obvious. 368 00:13:23,440 --> 00:13:28,560 So now the key question here is how do 369 00:13:26,000 --> 00:13:31,679 you actually represent X? That's the key 370 00:13:28,559 --> 00:13:34,399 question for pictures for images. We saw 371 00:13:31,679 --> 00:13:36,078 that we just took the pixel values which 372 00:13:34,399 --> 00:13:37,600 were light intensity numbers between 0 373 00:13:36,078 --> 00:13:39,599 and 255 and you could just use that 374 00:13:37,600 --> 00:13:41,839 directly but when a when a sentence 375 00:13:39,600 --> 00:13:43,600 comes in like I love deep learning like 376 00:13:41,839 --> 00:13:45,279 what do you do right how do you actually 377 00:13:43,600 --> 00:13:46,639 represent it because remember we have to 378 00:13:45,278 --> 00:13:49,439 numericalize everything that's coming 379 00:13:46,639 --> 00:13:50,959 in. So that's a key question and and 380 00:13:49,440 --> 00:13:52,800 this actually is a very subtle question 381 00:13:50,958 --> 00:13:56,638 very important question and we'll focus 382 00:13:52,799 --> 00:13:58,719 on that today and then next week when we 383 00:13:56,639 --> 00:14:00,720 look at transformers we will look at 384 00:13:58,720 --> 00:14:02,959 what neural network architecture is best 385 00:14:00,720 --> 00:14:04,560 suited to process this sort of text 386 00:14:02,958 --> 00:14:06,000 inputs that are coming in right those 387 00:14:04,559 --> 00:14:11,198 are the two big questions we're going to 388 00:14:06,000 --> 00:14:12,879 look at all right so processing basics 389 00:14:11,198 --> 00:14:15,879 we going to follow this very standard 390 00:14:12,879 --> 00:14:15,879 process 391 00:14:15,919 --> 00:14:21,519 this is the process by which we take any 392 00:14:18,639 --> 00:14:23,120 any text that comes in and we do run it 393 00:14:21,519 --> 00:14:25,360 through these four steps and this 394 00:14:23,120 --> 00:14:26,720 process is called text vectorization and 395 00:14:25,360 --> 00:14:28,159 as the name suggest that we are 396 00:14:26,720 --> 00:14:30,399 essentially taking text and creating 397 00:14:28,159 --> 00:14:32,399 vectors of numbers out of it right text 398 00:14:30,399 --> 00:14:34,559 vectorization and we'll go through each 399 00:14:32,399 --> 00:14:36,879 of these processes one after the other 400 00:14:34,559 --> 00:14:39,278 so I just find it very useful to just 401 00:14:36,879 --> 00:14:41,519 have this acronym stie in my head like 402 00:14:39,278 --> 00:14:45,519 stie just keep that in mind it may be 403 00:14:41,519 --> 00:14:48,240 helpful um all right so we what we do is 404 00:14:45,519 --> 00:14:50,078 the setup here is that we have a whole 405 00:14:48,240 --> 00:14:51,839 bunch of documents, right? We call it 406 00:14:50,078 --> 00:14:54,319 the training corpus. We have a whole 407 00:14:51,839 --> 00:14:55,760 bunch of text documents, text data. Uh, 408 00:14:54,320 --> 00:14:58,399 and as far as we are concerned, you can 409 00:14:55,759 --> 00:15:01,120 just imagine it as just lists of long 410 00:14:58,399 --> 00:15:03,360 passages. Okay? What is a novel? It's 411 00:15:01,120 --> 00:15:05,839 just a long passage, right, of text. So 412 00:15:03,360 --> 00:15:07,360 whether it's a novel or a sentence 413 00:15:05,839 --> 00:15:09,440 doesn't really matter. We just think of 414 00:15:07,360 --> 00:15:11,360 them as a big list of strings, a big 415 00:15:09,440 --> 00:15:13,600 list of text. Okay, that's a training 416 00:15:11,360 --> 00:15:15,759 corpus. And what we do is we take this 417 00:15:13,600 --> 00:15:17,680 training corpus and we run it through 418 00:15:15,759 --> 00:15:19,600 and we apply standardization and 419 00:15:17,679 --> 00:15:22,399 tokenization which I will describe to 420 00:15:19,600 --> 00:15:26,879 this entire training corpus up front. 421 00:15:22,399 --> 00:15:29,919 Okay. So we first do this and and 422 00:15:26,879 --> 00:15:32,480 standardization is basically 423 00:15:29,919 --> 00:15:34,639 the default for most applications tends 424 00:15:32,480 --> 00:15:36,399 to be this which is we first strip 425 00:15:34,639 --> 00:15:38,240 capitalization and make everything lower 426 00:15:36,399 --> 00:15:40,078 case 427 00:15:38,240 --> 00:15:42,240 and then we remove punctuation and 428 00:15:40,078 --> 00:15:44,559 accents and so on and so forth. Okay, 429 00:15:42,240 --> 00:15:46,639 that's the first thing we do. I'll talk 430 00:15:44,559 --> 00:15:48,559 about why we do it in just a moment, but 431 00:15:46,639 --> 00:15:51,360 the mechanics of it are we do this 432 00:15:48,559 --> 00:15:53,359 first. Then we look at words like a, 433 00:15:51,360 --> 00:15:55,199 the, it, and so on and so forth. 434 00:15:53,360 --> 00:15:57,680 Basically filler words, right? Which 435 00:15:55,198 --> 00:15:59,439 which we we need to actually make 436 00:15:57,679 --> 00:16:02,399 complete sentences, but they may not 437 00:15:59,440 --> 00:16:03,759 have any value predicting things. So we 438 00:16:02,399 --> 00:16:06,559 remove them and they are called stop 439 00:16:03,759 --> 00:16:08,079 words. And then finally we take words 440 00:16:06,559 --> 00:16:10,399 which are very similar which have sort 441 00:16:08,078 --> 00:16:12,159 of a same kind of stem or root and then 442 00:16:10,399 --> 00:16:14,958 we just map it to like a common 443 00:16:12,159 --> 00:16:16,559 representation like ate eaten eating 444 00:16:14,958 --> 00:16:19,439 eaten all these things just becomes 445 00:16:16,559 --> 00:16:21,679 let's say eats and we do that sometimes. 446 00:16:19,440 --> 00:16:23,600 So this we almost always do this we 447 00:16:21,679 --> 00:16:25,599 often do and this we do it sometimes. 448 00:16:23,600 --> 00:16:28,600 Okay. Now, why do we do any of these 449 00:16:25,600 --> 00:16:28,600 things? 450 00:16:34,320 --> 00:16:38,879 >> I think we want to try to recognize the 451 00:16:36,480 --> 00:16:40,480 essential thing with the word, right? 452 00:16:38,879 --> 00:16:42,799 Whether it's eaten or eat, but the 453 00:16:40,480 --> 00:16:45,600 essential thing is the eat, right? So, 454 00:16:42,799 --> 00:16:47,198 we want to try to sort of abstract from 455 00:16:45,600 --> 00:16:49,120 it the more essential thing, 456 00:16:47,198 --> 00:16:50,799 >> right? So, why do we need to abstract? I 457 00:16:49,120 --> 00:16:52,560 guess you're absolutely correct. We're 458 00:16:50,799 --> 00:16:56,439 trying to abstract. Why is there a 459 00:16:52,559 --> 00:16:56,439 benefit to doing this abstraction? 460 00:16:58,000 --> 00:17:02,919 How about somebody from this side of the 461 00:16:59,278 --> 00:17:02,919 room? Oh yes. 462 00:17:03,440 --> 00:17:08,880 >> So I want to reduce the library. 463 00:17:07,359 --> 00:17:12,240 >> Why is it a good idea to reduce the 464 00:17:08,880 --> 00:17:14,480 library? The size of the library 465 00:17:12,240 --> 00:17:17,519 >> because of the the amount of computation 466 00:17:14,480 --> 00:17:20,160 needed. So that is part of the answer. 467 00:17:17,519 --> 00:17:25,240 There's another part to the answer which 468 00:17:20,160 --> 00:17:25,240 says all right let's swing to the right 469 00:17:26,480 --> 00:17:30,720 um is it faculties comparison between 470 00:17:28,880 --> 00:17:33,880 different sets 471 00:17:30,720 --> 00:17:33,880 of standard 472 00:17:33,906 --> 00:17:37,279 [clears throat] 473 00:17:34,480 --> 00:17:39,759 >> okay so I will go with that but I think 474 00:17:37,279 --> 00:17:42,240 the the key thing we want to uh the key 475 00:17:39,759 --> 00:17:44,480 thing to realize here is that you want 476 00:17:42,240 --> 00:17:46,640 the model much like when you go when we 477 00:17:44,480 --> 00:17:48,720 talk about computer vision we said look 478 00:17:46,640 --> 00:17:51,038 if it's vertical line. I want to be able 479 00:17:48,720 --> 00:17:52,640 to detect it wherever it happens. I 480 00:17:51,038 --> 00:17:54,000 don't want the model to think that the 481 00:17:52,640 --> 00:17:55,038 vertical line on the left side is 482 00:17:54,000 --> 00:17:57,038 different from the vertical line on the 483 00:17:55,038 --> 00:17:58,720 right side and then later realize they 484 00:17:57,038 --> 00:18:00,879 are the same thing because you would 485 00:17:58,720 --> 00:18:02,319 have wasted valuable capacity learning 486 00:18:00,880 --> 00:18:03,760 things which actually happen to be the 487 00:18:02,319 --> 00:18:06,798 same because you didn't know it was the 488 00:18:03,759 --> 00:18:09,519 same. So here if you for example take a 489 00:18:06,798 --> 00:18:11,918 word and lowerase it clearly the case of 490 00:18:09,519 --> 00:18:12,960 it whether it's uppercase or lower case 491 00:18:11,919 --> 00:18:14,559 most of the time it's not going to 492 00:18:12,960 --> 00:18:16,400 matter for anything you want to predict. 493 00:18:14,558 --> 00:18:18,399 So you're essentially telling the model 494 00:18:16,400 --> 00:18:19,840 you know the lowerase version uppercase 495 00:18:18,400 --> 00:18:21,919 version they are not different they're 496 00:18:19,839 --> 00:18:23,678 actually the same and the easiest way to 497 00:18:21,919 --> 00:18:25,919 tell the model they are the same is just 498 00:18:23,679 --> 00:18:29,200 make everything lower case so that is 499 00:18:25,919 --> 00:18:31,038 the key idea okay and similarly if you 500 00:18:29,200 --> 00:18:32,080 look at stop words the reason is that 501 00:18:31,038 --> 00:18:34,400 these stop words may not help you 502 00:18:32,079 --> 00:18:36,720 predict anything whether a word uh and 503 00:18:34,400 --> 00:18:38,160 the showed up in a movie review probably 504 00:18:36,720 --> 00:18:40,400 does not affect the sentiment of the 505 00:18:38,160 --> 00:18:42,240 review and therefore let's remove it so 506 00:18:40,400 --> 00:18:44,320 that's a slightly different reason 507 00:18:42,240 --> 00:18:46,400 stemming is the same reason as the first 508 00:18:44,319 --> 00:18:48,319 which is that all these words kind of 509 00:18:46,400 --> 00:18:50,160 mean the same thing. We don't have to be 510 00:18:48,319 --> 00:18:51,839 super precise about it and so let's just 511 00:18:50,160 --> 00:18:54,080 like collapse them onto the same thing. 512 00:18:51,839 --> 00:18:57,038 Now that these are all the standard 513 00:18:54,079 --> 00:18:58,399 things we do there are totally notice 514 00:18:57,038 --> 00:19:00,319 you know important exceptions to all 515 00:18:58,400 --> 00:19:02,000 these things. Okay we'll come back to 516 00:19:00,319 --> 00:19:05,359 the exceptions a bit later but that is 517 00:19:02,000 --> 00:19:08,359 the standard thing we do make sense. All 518 00:19:05,359 --> 00:19:08,359 right. 519 00:19:08,720 --> 00:19:14,240 So if you look at something like this um 520 00:19:11,679 --> 00:19:15,440 this sentence here right hola what do 521 00:19:14,240 --> 00:19:17,919 you picture when you think of travel 522 00:19:15,440 --> 00:19:20,000 Mexico boom and then you can see here 523 00:19:17,919 --> 00:19:21,520 this is the standardized version like 524 00:19:20,000 --> 00:19:24,000 everything has become lower case like 525 00:19:21,519 --> 00:19:25,279 the h has become small h the punctuation 526 00:19:24,000 --> 00:19:29,440 has disappeared that's part of 527 00:19:25,279 --> 00:19:32,160 standardization and then uh travel and 528 00:19:29,440 --> 00:19:35,759 you can see here that Mexico m has 529 00:19:32,160 --> 00:19:37,759 become small sipping has become sips uh 530 00:19:35,759 --> 00:19:38,960 things think has become things and so on 531 00:19:37,759 --> 00:19:41,279 and so forth 532 00:19:38,960 --> 00:19:44,279 So that's an example of strandization at 533 00:19:41,279 --> 00:19:44,279 work. 534 00:19:47,038 --> 00:19:51,200 Okay. 535 00:19:49,200 --> 00:19:53,840 The next thing we do is something very 536 00:19:51,200 --> 00:19:55,600 important and it's called tokenization. 537 00:19:53,839 --> 00:19:56,720 So what we do typically is that okay now 538 00:19:55,599 --> 00:19:59,439 we have standardized everything. We have 539 00:19:56,720 --> 00:20:01,919 a bunch of words. Uh we need to now 540 00:19:59,440 --> 00:20:04,558 split them into what are called tokens. 541 00:20:01,919 --> 00:20:07,120 So the most common default is to just 542 00:20:04,558 --> 00:20:09,279 think of a word as a token. 543 00:20:07,119 --> 00:20:11,119 We just split on the white space, right? 544 00:20:09,279 --> 00:20:14,160 You take each string and wherever there 545 00:20:11,119 --> 00:20:15,678 is white space, meaning actual spaces, 546 00:20:14,160 --> 00:20:17,440 uh, carriage returns and things like 547 00:20:15,679 --> 00:20:20,720 that, boom, you just split on them and 548 00:20:17,440 --> 00:20:22,080 you just create words out of it. So, so 549 00:20:20,720 --> 00:20:24,160 for instance, if you have this 550 00:20:22,079 --> 00:20:26,159 standardized sentence here, you just 551 00:20:24,160 --> 00:20:29,120 split it after every word and you get 552 00:20:26,160 --> 00:20:32,519 this thing. Okay? So, each of these is 553 00:20:29,119 --> 00:20:32,519 now a token. 554 00:20:32,880 --> 00:20:38,400 Now, this has some disadvantages. 555 00:20:36,159 --> 00:20:40,799 What are some disadvantages of just 556 00:20:38,400 --> 00:20:43,840 splitting on on the space between words? 557 00:20:40,798 --> 00:20:46,319 Uh yeah, 558 00:20:43,839 --> 00:20:49,439 >> I think we lose any context because we 559 00:20:46,319 --> 00:20:52,639 look at each word separately. Uh so we 560 00:20:49,440 --> 00:20:53,440 don't have any password or what happens 561 00:20:52,640 --> 00:20:55,280 next, 562 00:20:53,440 --> 00:20:57,759 >> right? So for example, the cat sat on 563 00:20:55,279 --> 00:21:00,558 the mat and the mat sat on the cat will 564 00:20:57,759 --> 00:21:02,400 have the same like set, right? Yeah. So 565 00:21:00,558 --> 00:21:05,798 you lose the order. What are some other 566 00:21:02,400 --> 00:21:05,798 issues with it? 567 00:21:05,839 --> 00:21:10,319 for words that should have two together 568 00:21:07,440 --> 00:21:11,840 like you lose the fact that that's one 569 00:21:10,319 --> 00:21:14,240 name because you separated 570 00:21:11,839 --> 00:21:16,480 >> right exactly so there are compound 571 00:21:14,240 --> 00:21:18,640 words right like father-in-law for 572 00:21:16,480 --> 00:21:20,319 instance that's one problem another 573 00:21:18,640 --> 00:21:22,240 problem is that lots of non-English 574 00:21:20,319 --> 00:21:25,038 languages they actually don't have this 575 00:21:22,240 --> 00:21:27,359 notion of a space between words right 576 00:21:25,038 --> 00:21:29,359 actually runs one after the other and it 577 00:21:27,359 --> 00:21:32,479 is and the native speakers know from 578 00:21:29,359 --> 00:21:34,879 context how to chunk it and break it so 579 00:21:32,480 --> 00:21:36,480 well what do we do Right? 580 00:21:34,880 --> 00:21:39,280 Because you basically will have one word 581 00:21:36,480 --> 00:21:40,720 for the whole passage, one token. The 582 00:21:39,279 --> 00:21:42,960 other problem is that there are 583 00:21:40,720 --> 00:21:44,960 languages, German is perhaps the most 584 00:21:42,960 --> 00:21:47,279 notable one in which you have very long 585 00:21:44,960 --> 00:21:50,798 words. 586 00:21:47,279 --> 00:21:52,319 Um I saw a word uh which I think I might 587 00:21:50,798 --> 00:21:57,200 have it on the site somewhere this like 588 00:21:52,319 --> 00:21:59,200 this long which means uh 589 00:21:57,200 --> 00:22:00,720 you realize that something amazing is 590 00:21:59,200 --> 00:22:02,798 happening but the rest of the world 591 00:22:00,720 --> 00:22:04,400 hasn't woken up to it yet. It's that 592 00:22:02,798 --> 00:22:07,918 feeling. 593 00:22:04,400 --> 00:22:10,640 There's a word for that. Amazing, right? 594 00:22:07,919 --> 00:22:12,640 Anyway, so yeah, some words or Japanese, 595 00:22:10,640 --> 00:22:13,520 for example, there's a word called. Do 596 00:22:12,640 --> 00:22:16,520 people know the meaning of the word 597 00:22:13,519 --> 00:22:16,519 combi? 598 00:22:16,640 --> 00:22:24,640 It means the transient beauty of 599 00:22:20,240 --> 00:22:26,720 sunlight going through fall foliage. 600 00:22:24,640 --> 00:22:29,280 There's a word for that. How cool is 601 00:22:26,720 --> 00:22:31,440 that? Anyway, sorry. I love that word. 602 00:22:29,279 --> 00:22:33,440 So, back to this. Um so we have this 603 00:22:31,440 --> 00:22:35,200 thing here. So there are all reasons for 604 00:22:33,440 --> 00:22:38,798 which splitting on the the space between 605 00:22:35,200 --> 00:22:41,360 words not going to work. Okay. Um 606 00:22:38,798 --> 00:22:44,720 so what we will so what happens is that 607 00:22:41,359 --> 00:22:46,319 modern large language models. So the the 608 00:22:44,720 --> 00:22:47,919 what we have described so far despite 609 00:22:46,319 --> 00:22:50,960 its shortcomings is actually really good 610 00:22:47,919 --> 00:22:52,640 for lots of NLP use cases. Okay. If you 611 00:22:50,960 --> 00:22:54,640 want to classify text as good enough for 612 00:22:52,640 --> 00:22:57,679 instance but if you want to generate 613 00:22:54,640 --> 00:22:59,840 text like LLMs do it's not going to 614 00:22:57,679 --> 00:23:01,600 work. It's not going to work because you 615 00:22:59,839 --> 00:23:03,839 know when you ask the strategia question 616 00:23:01,599 --> 00:23:05,918 it comes back with perfect punctuation. 617 00:23:03,839 --> 00:23:07,119 Clearly punctuation was not stripped. It 618 00:23:05,919 --> 00:23:09,600 comes back with particular upper and 619 00:23:07,119 --> 00:23:11,359 lower case clearly that wasn't stripped. 620 00:23:09,599 --> 00:23:12,719 You can actually make up new words and 621 00:23:11,359 --> 00:23:15,359 ask it to use the new word. It'll make 622 00:23:12,720 --> 00:23:17,919 it I'll use it. Therefore, it's not like 623 00:23:15,359 --> 00:23:19,918 it can only recognize a finite set. So 624 00:23:17,919 --> 00:23:22,240 there's a very clever scheme called bite 625 00:23:19,919 --> 00:23:24,880 pair encoding right which is which is 626 00:23:22,240 --> 00:23:26,400 invented to do all those things. And I 627 00:23:24,880 --> 00:23:28,240 have slides at the end and if we have 628 00:23:26,400 --> 00:23:29,759 time we'll talk about it. 629 00:23:28,240 --> 00:23:33,440 All right, for now let's continue this 630 00:23:29,759 --> 00:23:35,359 thing. So when this is done for every 631 00:23:33,440 --> 00:23:37,440 sentence or every uh passage in our 632 00:23:35,359 --> 00:23:40,079 training data set, we have now have a 633 00:23:37,440 --> 00:23:41,519 list of distinct tokens, right? We have 634 00:23:40,079 --> 00:23:42,960 a list of distinct tokens. In this 635 00:23:41,519 --> 00:23:45,200 simple case, it happens to be all the 636 00:23:42,960 --> 00:23:47,360 distinct words that we have seen, right? 637 00:23:45,200 --> 00:23:49,840 That's called the vocabulary. 638 00:23:47,359 --> 00:23:51,279 That's called the vocabulary. 639 00:23:49,839 --> 00:23:53,599 So now we move to the third and fourth 640 00:23:51,279 --> 00:23:55,279 stages. In this in these stages, the 641 00:23:53,599 --> 00:23:58,719 indexing and encoding stage, we only 642 00:23:55,279 --> 00:24:00,319 work with the vocabulary. Okay. And so 643 00:23:58,720 --> 00:24:03,200 what we do is the first thing the 644 00:24:00,319 --> 00:24:05,359 indexing we assign a unique integer to 645 00:24:03,200 --> 00:24:07,600 each distinct token in the vocabulary. 646 00:24:05,359 --> 00:24:09,599 So for instance, let's say that you know 647 00:24:07,599 --> 00:24:12,000 you took a whole bunch of English 648 00:24:09,599 --> 00:24:14,240 literature as your training corus and 649 00:24:12,000 --> 00:24:16,079 you ran it through you basically you'll 650 00:24:14,240 --> 00:24:18,159 come up with English dictionary right? 651 00:24:16,079 --> 00:24:20,879 So it'll have maybe starting with a all 652 00:24:18,159 --> 00:24:24,000 the way to zebra a whole bunch of words. 653 00:24:20,880 --> 00:24:26,960 Um, and so I'm just putting 50,000 here 654 00:24:24,000 --> 00:24:28,480 because turns out the GPD family uses 655 00:24:26,960 --> 00:24:30,159 something called 50,000 tokens. So I'm 656 00:24:28,480 --> 00:24:31,519 just using 50,000. It's not the actual 657 00:24:30,159 --> 00:24:33,600 number of words in the English language. 658 00:24:31,519 --> 00:24:35,119 It's much more than that. So let's say 659 00:24:33,599 --> 00:24:37,519 that we give a number one through 660 00:24:35,119 --> 00:24:40,158 50,000. And then we actually also 661 00:24:37,519 --> 00:24:42,240 introduce a special token called UN. It 662 00:24:40,159 --> 00:24:44,559 stands for unknown. And we'll come back 663 00:24:42,240 --> 00:24:46,960 to this later. And we give unknown the 664 00:24:44,558 --> 00:24:48,558 integer zero. 665 00:24:46,960 --> 00:24:51,600 Okay. So this what this is what we mean 666 00:24:48,558 --> 00:24:52,798 by indexing. take the word the tokens 667 00:24:51,599 --> 00:24:55,038 you have identified and just map it to 668 00:24:52,798 --> 00:24:57,440 an integer. 669 00:24:55,038 --> 00:25:00,400 Okay, that's the indexing step. Then 670 00:24:57,440 --> 00:25:03,120 what we do is we assign a vector to 671 00:25:00,400 --> 00:25:05,600 every one of these integers. 672 00:25:03,119 --> 00:25:08,959 Okay, and that is the encoding step. We 673 00:25:05,599 --> 00:25:10,639 assign a vector to each integer. 674 00:25:08,960 --> 00:25:12,480 So you have a bunch of distinct words 675 00:25:10,640 --> 00:25:14,000 and each word we put an integer on it 676 00:25:12,480 --> 00:25:16,079 and then we take that integer and map it 677 00:25:14,000 --> 00:25:17,599 to a vector. Yeah. Can you please 678 00:25:16,079 --> 00:25:18,558 explain 679 00:25:17,599 --> 00:25:20,158 to 680 00:25:18,558 --> 00:25:20,960 >> Can you please explain what unknown 681 00:25:20,159 --> 00:25:23,679 means? 682 00:25:20,960 --> 00:25:25,200 >> Yeah. So, so I'll come back to that for 683 00:25:23,679 --> 00:25:26,720 now. Just assume that we have a token 684 00:25:25,200 --> 00:25:28,240 called unknown. And the way we are going 685 00:25:26,720 --> 00:25:29,759 to use it will become apparent in a few 686 00:25:28,240 --> 00:25:31,038 minutes. 687 00:25:29,759 --> 00:25:32,480 >> Does it mean there's a base to it 688 00:25:31,038 --> 00:25:32,960 though? There's like a letter or 689 00:25:32,480 --> 00:25:34,798 something. 690 00:25:32,960 --> 00:25:36,400 >> It's it's a it's a placeholder for 691 00:25:34,798 --> 00:25:38,639 something else which I'll describe 692 00:25:36,400 --> 00:25:42,798 shortly. 693 00:25:38,640 --> 00:25:44,320 Okay. So, that's what we have. U so 694 00:25:42,798 --> 00:25:46,879 let's say that we want to assign a 695 00:25:44,319 --> 00:25:50,720 vector to each integer in our vocabulary 696 00:25:46,880 --> 00:25:52,880 and let's assume that we have uh okay 697 00:25:50,720 --> 00:25:54,400 let's say we have 50,000 possible 698 00:25:52,880 --> 00:25:56,480 integers because we have 50,000 possible 699 00:25:54,400 --> 00:25:58,400 words and we want to assign a vector so 700 00:25:56,480 --> 00:25:59,759 that if you take the vector of two 701 00:25:58,400 --> 00:26:02,320 different words they should look 702 00:25:59,759 --> 00:26:04,319 different right clearly that's the whole 703 00:26:02,319 --> 00:26:06,399 point of mapping from integer to vector 704 00:26:04,319 --> 00:26:08,079 they better be different uh what is the 705 00:26:06,400 --> 00:26:12,240 simplest way to come up with a vector 706 00:26:08,079 --> 00:26:12,240 for each each of these tokens 707 00:26:20,079 --> 00:26:22,399 the same as the index. 708 00:26:21,839 --> 00:26:24,319 >> Sorry, 709 00:26:22,400 --> 00:26:26,880 >> the same as the index. It's just a 710 00:26:24,319 --> 00:26:31,599 vector one one by one with the index. 711 00:26:26,880 --> 00:26:34,799 >> So, a vector of uh zeros and ones or 712 00:26:31,599 --> 00:26:38,158 >> it's just a vector with one dimension. 713 00:26:34,798 --> 00:26:39,359 >> Oh. Oh, I see. So, god. Well, it's it 714 00:26:38,159 --> 00:26:40,799 it's creative, but it's a little 715 00:26:39,359 --> 00:26:42,000 cheating, right? Because you're 716 00:26:40,798 --> 00:26:43,038 essentially putting a square bracket 717 00:26:42,000 --> 00:26:47,038 around the number and saying it's a 718 00:26:43,038 --> 00:26:48,640 vector. Good try. 719 00:26:47,038 --> 00:26:51,440 >> You can try one hot encoding, 720 00:26:48,640 --> 00:26:53,520 >> right? You can try one hot encoding. 721 00:26:51,440 --> 00:26:55,360 So remember the list of distinct tokens 722 00:26:53,519 --> 00:26:57,119 you have, you can just think of them as 723 00:26:55,359 --> 00:26:59,759 the distinct levels of a categorical 724 00:26:57,119 --> 00:27:01,759 variable, 725 00:26:59,759 --> 00:27:04,558 right? And you can just use one hard 726 00:27:01,759 --> 00:27:07,359 encoding for it. 727 00:27:04,558 --> 00:27:08,480 So what you can do is you can the 728 00:27:07,359 --> 00:27:10,399 simplest thing is do one one hard 729 00:27:08,480 --> 00:27:13,599 encoding and the way it's going to work 730 00:27:10,400 --> 00:27:16,000 is that if you have let's say 50,000 731 00:27:13,599 --> 00:27:17,599 uh 50,000 possible values the vector is 732 00:27:16,000 --> 00:27:20,079 going to be 50,000 long it's going to 733 00:27:17,599 --> 00:27:22,719 have zeros everywhere except in the 734 00:27:20,079 --> 00:27:25,359 index value of whatever that token is. 735 00:27:22,720 --> 00:27:28,319 So for instance since we said ank is 736 00:27:25,359 --> 00:27:31,278 going to be the first uh first number 737 00:27:28,319 --> 00:27:33,359 zero it has a one here and the zero the 738 00:27:31,278 --> 00:27:36,159 zero index position has a one everything 739 00:27:33,359 --> 00:27:37,918 is zero a happens to be the second one 740 00:27:36,159 --> 00:27:40,480 so it happens to be one in the second 741 00:27:37,919 --> 00:27:40,880 position zero you get the idea 742 00:27:40,480 --> 00:27:42,480 okay 743 00:27:40,880 --> 00:27:45,039 >> so this real zero hot encoding we can do 744 00:27:42,480 --> 00:27:47,599 the zero hot one coding one hot encoding 745 00:27:45,038 --> 00:27:50,079 and so so the dimension of this encoding 746 00:27:47,599 --> 00:27:51,678 vector how long it is it's basically the 747 00:27:50,079 --> 00:27:54,798 number of distinct tokens that you have 748 00:27:51,679 --> 00:27:59,320 seen in in the training corpus plus one 749 00:27:54,798 --> 00:27:59,319 for this unk thing that you'll get to. 750 00:27:59,759 --> 00:28:03,278 Okay, 751 00:28:01,278 --> 00:28:05,278 so that is a dimensional encoding vector 752 00:28:03,278 --> 00:28:08,278 which is this is called the vocabulary 753 00:28:05,278 --> 00:28:08,278 size. 754 00:28:09,519 --> 00:28:13,398 It's called the vocabulary size. 755 00:28:13,440 --> 00:28:18,159 All right. So at this point we have 756 00:28:16,798 --> 00:28:20,480 created a vocabulary for the training 757 00:28:18,159 --> 00:28:22,240 data training corpus. every distinct 758 00:28:20,480 --> 00:28:24,240 token vocabulary has been assigned a one 759 00:28:22,240 --> 00:28:26,480 hot vector and we are done with basic 760 00:28:24,240 --> 00:28:29,359 pre-processing. 761 00:28:26,480 --> 00:28:31,440 Okay, so all the text that has come in, 762 00:28:29,359 --> 00:28:33,359 every token has been mapped to some one 763 00:28:31,440 --> 00:28:35,840 hot one potentially very long one hot 764 00:28:33,359 --> 00:28:37,439 vector. 765 00:28:35,839 --> 00:28:41,158 Any questions on the mechanics of this 766 00:28:37,440 --> 00:28:41,159 before we continue on? 767 00:28:45,038 --> 00:28:50,000 >> Now let's see if when you get a new 768 00:28:47,278 --> 00:28:52,000 input sentence in a new sentence freshly 769 00:28:50,000 --> 00:28:53,599 arriving and we want to feed it into a 770 00:28:52,000 --> 00:28:55,038 deep neural network, how will this 771 00:28:53,599 --> 00:28:57,599 process actually apply to the new 772 00:28:55,038 --> 00:29:00,240 sentence that's coming in? Okay, so 773 00:28:57,599 --> 00:29:02,558 let's assume um that we have completed 774 00:29:00,240 --> 00:29:05,038 our SDIE on the training corpus and it 775 00:29:02,558 --> 00:29:08,000 turns out we found only you know 99 776 00:29:05,038 --> 00:29:10,079 distinct tokens 99 distinct words and 777 00:29:08,000 --> 00:29:13,599 then we add this ank thing to it so we 778 00:29:10,079 --> 00:29:16,158 got a 100 okay so this is our vocabulary 779 00:29:13,599 --> 00:29:17,599 it starts with ank a and then goes all 780 00:29:16,159 --> 00:29:20,399 the way to zebra but there are only 100 781 00:29:17,599 --> 00:29:22,398 of them in total right and just to be 782 00:29:20,398 --> 00:29:24,319 very clear we didn't bother to do things 783 00:29:22,398 --> 00:29:26,239 like stemming and stop word removal and 784 00:29:24,319 --> 00:29:28,000 stuff like that which is why you have 785 00:29:26,240 --> 00:29:30,319 words like 'the' showing up in this 786 00:29:28,000 --> 00:29:34,159 list. 787 00:29:30,319 --> 00:29:35,759 Okay. All right. So, 788 00:29:34,159 --> 00:29:38,000 let's say this input string arrives, the 789 00:29:35,759 --> 00:29:40,158 cats are on the mat, and then we run it 790 00:29:38,000 --> 00:29:43,440 through STIE. So, the cats are on the 791 00:29:40,159 --> 00:29:46,640 mat goes through this thingoop. 792 00:29:43,440 --> 00:29:49,440 Then the output is going to be a table 793 00:29:46,640 --> 00:29:52,559 with a bunch of rows and a bunch of 794 00:29:49,440 --> 00:29:56,840 columns. Any guesses 795 00:29:52,558 --> 00:29:56,839 how many rows and how many columns? 796 00:30:02,000 --> 00:30:06,278 Just raise your hands. I'll call on you. 797 00:30:13,359 --> 00:30:18,319 >> Yeah, you use a microphone. Go for it. 798 00:30:14,880 --> 00:30:20,000 >> Yeah, I would guess uh 100 rows and uh 799 00:30:18,319 --> 00:30:23,200 six columns. 800 00:30:20,000 --> 00:30:24,880 All right, we'll take a look. Uh 801 00:30:23,200 --> 00:30:27,919 100 and six as well as six and 100 are 802 00:30:24,880 --> 00:30:30,799 both correct. So, so the way I've done 803 00:30:27,919 --> 00:30:33,038 it is six and 100. And the and that's 804 00:30:30,798 --> 00:30:36,158 exactly right. So, the idea is that this 805 00:30:33,038 --> 00:30:38,720 is your vocabulary, right? So, the word 806 00:30:36,159 --> 00:30:41,600 the cat sat on the mat once you change 807 00:30:38,720 --> 00:30:43,919 the case of it, it becomes like this. 808 00:30:41,599 --> 00:30:47,199 So, the the happens to be a one hot 809 00:30:43,919 --> 00:30:48,799 vector with a one where there is a the 810 00:30:47,200 --> 00:30:50,159 and zero everywhere else. I'm not 811 00:30:48,798 --> 00:30:52,079 showing all the zeros because it'll get 812 00:30:50,159 --> 00:30:55,679 too cluttered. 813 00:30:52,079 --> 00:30:57,519 Similarly, cat has a one where the the 814 00:30:55,679 --> 00:30:59,919 cat position is and zero everywhere else 815 00:30:57,519 --> 00:31:02,079 and so on and so forth. Does that make 816 00:30:59,919 --> 00:31:04,159 sense? So, the the phrase the cat sat on 817 00:31:02,079 --> 00:31:06,319 the mat came in as just whatever six 818 00:31:04,159 --> 00:31:10,200 words and then it became this you know 819 00:31:06,319 --> 00:31:10,200 600 entry table. 820 00:31:12,240 --> 00:31:18,000 Okay. Now, what is the best way to feed 821 00:31:15,679 --> 00:31:21,559 this table to a deep neural network? 822 00:31:18,000 --> 00:31:21,558 What can we do? 823 00:31:23,599 --> 00:31:27,678 It's not a vector. It's a table. 824 00:31:26,319 --> 00:31:29,359 If it's a vector, we know what to do. We 825 00:31:27,679 --> 00:31:30,960 just feed it in. We'll just maybe send 826 00:31:29,359 --> 00:31:34,398 it to some, you know, hidden layer and 827 00:31:30,960 --> 00:31:37,200 declare victory at that point. 828 00:31:34,398 --> 00:31:38,959 >> Yeah. 829 00:31:37,200 --> 00:31:42,840 >> You would like to flatten it. And like 830 00:31:38,960 --> 00:31:42,840 how how might you do it? 831 00:31:43,200 --> 00:31:46,960 Flattening is a reasonable answer by the 832 00:31:45,119 --> 00:31:49,038 way. 833 00:31:46,960 --> 00:31:52,480 I think you mean you just have to like 834 00:31:49,038 --> 00:31:54,798 take each like each column 835 00:31:52,480 --> 00:31:56,319 take the first one each row and each row 836 00:31:54,798 --> 00:31:57,839 each word kind of like 837 00:31:56,319 --> 00:31:59,599 >> yeah so basically you can take all the 838 00:31:57,839 --> 00:32:01,439 first columns and then take the second 839 00:31:59,599 --> 00:32:03,359 column and attach it under the first 840 00:32:01,440 --> 00:32:05,120 column and so on and so forth right so 841 00:32:03,359 --> 00:32:08,158 we can certainly do that and that's very 842 00:32:05,119 --> 00:32:10,319 akin to how we work with images right u 843 00:32:08,159 --> 00:32:13,640 but there is one downside to that what 844 00:32:10,319 --> 00:32:13,639 is that downside 845 00:32:15,759 --> 00:32:20,798 uh Um, 846 00:32:18,480 --> 00:32:23,360 >> it's pretty long. Like I wonder if 847 00:32:20,798 --> 00:32:25,440 instead you could for the first word 848 00:32:23,359 --> 00:32:27,439 it's one, for the second word it's two, 849 00:32:25,440 --> 00:32:30,558 and then you maintain the order, but you 850 00:32:27,440 --> 00:32:33,038 still keep it just as like one row. 851 00:32:30,558 --> 00:32:34,960 >> One row. So one issue, so we'll come 852 00:32:33,038 --> 00:32:36,240 back to what we do about this, but what 853 00:32:34,960 --> 00:32:39,440 you're pointing out is it could be very 854 00:32:36,240 --> 00:32:42,399 long, right? Because if each word is a 855 00:32:39,440 --> 00:32:45,278 50,000 long one vector with just six 856 00:32:42,398 --> 00:32:48,000 words, it becomes a 300,000 long vector. 857 00:32:45,278 --> 00:32:50,798 Imagine take the 300,000 long vector and 858 00:32:48,000 --> 00:32:53,839 sending it into a 100 hidden unit hidden 859 00:32:50,798 --> 00:32:56,158 layer. 300,000 times 100 parameters. Too 860 00:32:53,839 --> 00:32:58,879 much can't learn anything. 861 00:32:56,159 --> 00:33:01,360 So that's one issue. The other issue is 862 00:32:58,880 --> 00:33:02,720 that different length texts that are 863 00:33:01,359 --> 00:33:04,398 coming in will have different sized 864 00:33:02,720 --> 00:33:06,319 inputs. 865 00:33:04,398 --> 00:33:08,879 So here the cat sat on the mat has six 866 00:33:06,319 --> 00:33:10,558 times 50,000 but maybe the cat sat on 867 00:33:08,880 --> 00:33:13,200 the mat and the rat rat ran over to the 868 00:33:10,558 --> 00:33:15,359 cat becomes even longer. We can't handle 869 00:33:13,200 --> 00:33:16,798 variable sized inputs. 870 00:33:15,359 --> 00:33:19,599 the inputs all have to be mapped to the 871 00:33:16,798 --> 00:33:22,158 same length. 872 00:33:19,599 --> 00:33:24,079 That's another problem. 873 00:33:22,159 --> 00:33:26,000 >> So maybe you can count how many you can 874 00:33:24,079 --> 00:33:27,599 sum the columns basically and count how 875 00:33:26,000 --> 00:33:29,519 many times each word appears since 876 00:33:27,599 --> 00:33:30,240 you're using the like spatial 877 00:33:29,519 --> 00:33:33,359 relationship. 878 00:33:30,240 --> 00:33:34,880 >> Yes. So you Yeah. So both you and are on 879 00:33:33,359 --> 00:33:37,199 the same sort of trajectory which is 880 00:33:34,880 --> 00:33:39,120 that uh we need to somehow take this 881 00:33:37,200 --> 00:33:40,960 table and make it into a vector. And 882 00:33:39,119 --> 00:33:42,879 there are many ways like what you folks 883 00:33:40,960 --> 00:33:46,880 are describing to make it into a vector 884 00:33:42,880 --> 00:33:48,159 and turns out um this is all the things 885 00:33:46,880 --> 00:33:50,880 that we've been discussing so far the 886 00:33:48,159 --> 00:33:53,039 varying length ratio and so on. So, so 887 00:33:50,880 --> 00:33:56,720 what we can do is we can aggregate all 888 00:33:53,038 --> 00:33:58,319 these things. If you just add them up, 889 00:33:56,720 --> 00:34:00,720 this is what you described. I believe 890 00:33:58,319 --> 00:34:02,720 it's called sum encoding. 891 00:34:00,720 --> 00:34:04,079 And if instead of adding you just or 892 00:34:02,720 --> 00:34:05,360 them, meaning if you look at the column 893 00:34:04,079 --> 00:34:07,038 and say, is there any one in this 894 00:34:05,359 --> 00:34:08,878 column? If there's a any one, I'll put a 895 00:34:07,038 --> 00:34:12,239 stick of one, otherwise it's a zero. 896 00:34:08,878 --> 00:34:13,918 It's called multihot encoding. So, so if 897 00:34:12,239 --> 00:34:15,358 you look at this thing, if you literally 898 00:34:13,918 --> 00:34:17,199 just go column by column and count 899 00:34:15,358 --> 00:34:19,838 everything. Okay, there's a one here, 900 00:34:17,199 --> 00:34:21,519 one here. Oh, wait. There are two twos 901 00:34:19,838 --> 00:34:23,039 here. So you put a two. That's count 902 00:34:21,519 --> 00:34:26,159 count encoding. Multih hard encoding. It 903 00:34:23,039 --> 00:34:28,800 just looks for any ones and puts on. 904 00:34:26,159 --> 00:34:30,159 Make sense? So by the way there are many 905 00:34:28,800 --> 00:34:32,159 ways to take these tables and make them 906 00:34:30,159 --> 00:34:34,159 into vectors. These two happen to be 907 00:34:32,159 --> 00:34:37,480 very commonly used and they kind of make 908 00:34:34,159 --> 00:34:37,480 common sense. 909 00:34:39,199 --> 00:34:43,039 Okay. 910 00:34:41,039 --> 00:34:44,800 Right. So this aggregation approach that 911 00:34:43,039 --> 00:34:46,800 we just described is called the bag of 912 00:34:44,800 --> 00:34:49,039 words model. 913 00:34:46,800 --> 00:34:51,760 Bag of words model. And the reason is 914 00:34:49,039 --> 00:34:53,918 that first of all this bag that we have 915 00:34:51,760 --> 00:34:56,560 has words either it counts whether a 916 00:34:53,918 --> 00:34:58,000 word exists or not or it counts how many 917 00:34:56,559 --> 00:35:01,039 words how many times the word has 918 00:34:58,000 --> 00:35:04,000 appeared right count versus multihot 919 00:35:01,039 --> 00:35:05,920 versus sum encoding count encoding but 920 00:35:04,000 --> 00:35:09,199 more importantly and this goes back to 921 00:35:05,920 --> 00:35:12,320 your observation is that we have lost 922 00:35:09,199 --> 00:35:14,399 the order of the words now whether the 923 00:35:12,320 --> 00:35:18,079 phrase came in was the cat sat on the 924 00:35:14,400 --> 00:35:19,599 mat or the mat sat on the cat the count 925 00:35:18,079 --> 00:35:21,440 encoding and the multih hard encoding 926 00:35:19,599 --> 00:35:23,200 are exactly the same. There's no 927 00:35:21,440 --> 00:35:24,880 difference because we're just looking 928 00:35:23,199 --> 00:35:27,039 for the the presence or absence of 929 00:35:24,880 --> 00:35:29,599 words. That's it. We don't care in what 930 00:35:27,039 --> 00:35:32,480 which order they appear, right? That's a 931 00:35:29,599 --> 00:35:34,160 huge limitation, but shockingly for many 932 00:35:32,480 --> 00:35:36,800 applications, it doesn't matter. It's 933 00:35:34,159 --> 00:35:38,960 good enough. So, it's called the bag of 934 00:35:36,800 --> 00:35:40,480 words model. 935 00:35:38,960 --> 00:35:42,720 All right, so this called the bag of 936 00:35:40,480 --> 00:35:46,320 words model. 937 00:35:42,719 --> 00:35:47,199 Um, now does it have any shortcomings? I 938 00:35:46,320 --> 00:35:48,960 already talked about the first 939 00:35:47,199 --> 00:35:51,279 shortcoming which is that it loses 940 00:35:48,960 --> 00:35:54,320 sequentiality the order we lost this 941 00:35:51,280 --> 00:35:55,680 order information right uh we we lose 942 00:35:54,320 --> 00:36:00,280 the meaning inherent in the order of the 943 00:35:55,679 --> 00:36:00,279 words what are some other issues with it 944 00:36:04,079 --> 00:36:07,720 what do you mean by that 945 00:36:12,480 --> 00:36:16,559 >> right so there are lots of zeros not 946 00:36:14,639 --> 00:36:18,239 that many ones so you have it's a very 947 00:36:16,559 --> 00:36:19,920 sparse amount of information but maybe 948 00:36:18,239 --> 00:36:22,000 is carrying around a lot of information 949 00:36:19,920 --> 00:36:24,159 to to make it all work. Now there are 950 00:36:22,000 --> 00:36:26,239 some tricks CS computer science tricks 951 00:36:24,159 --> 00:36:29,118 to handle sparsity in some clever ways 952 00:36:26,239 --> 00:36:30,319 but it is certainly an issue. Now the 953 00:36:29,119 --> 00:36:32,640 other issue is that let's say the 954 00:36:30,320 --> 00:36:34,960 vocabulary is very long. 955 00:36:32,639 --> 00:36:36,879 Each input sentence whether it's the 956 00:36:34,960 --> 00:36:39,838 collected works of William Shakespeare 957 00:36:36,880 --> 00:36:42,640 or the phrase I love you will have the 958 00:36:39,838 --> 00:36:45,519 same length input. 959 00:36:42,639 --> 00:36:48,078 Is that the same length input 960 00:36:45,519 --> 00:36:51,440 because ultimately every incoming thing 961 00:36:48,079 --> 00:36:54,480 gets mapped into one vector. Okay, that 962 00:36:51,440 --> 00:36:56,159 feels a little sub suboptimal. 963 00:36:54,480 --> 00:36:59,280 Clearly the collected works of ins have 964 00:36:56,159 --> 00:37:02,719 a lot more stuff going on in them. 965 00:36:59,280 --> 00:37:04,480 Right? So that's a problem. In 966 00:37:02,719 --> 00:37:06,239 particular, very very small things that 967 00:37:04,480 --> 00:37:08,159 come in, you'll be spending a lot of 968 00:37:06,239 --> 00:37:10,799 compute on those long vectors and 969 00:37:08,159 --> 00:37:13,039 processing them. Um, now you can 970 00:37:10,800 --> 00:37:14,560 mitigate some of this by choosing only 971 00:37:13,039 --> 00:37:16,000 the most frequent words. You don't have 972 00:37:14,559 --> 00:37:18,000 to take, you know, I think the English 973 00:37:16,000 --> 00:37:20,800 language I read somewhere has roughly 974 00:37:18,000 --> 00:37:23,440 500,000 words or so. Uh, but turns out 975 00:37:20,800 --> 00:37:24,640 the top 50,000 most frequent words are 976 00:37:23,440 --> 00:37:27,200 responsible for just about everything 977 00:37:24,639 --> 00:37:29,519 you're going to see ever. And the other 978 00:37:27,199 --> 00:37:31,358 50,000 are what's called the long tail. 979 00:37:29,519 --> 00:37:33,119 They almost never happen, right? You 980 00:37:31,358 --> 00:37:34,639 never see them. So, you can be very 981 00:37:33,119 --> 00:37:36,640 pragmatic and say, "I'm not going to 982 00:37:34,639 --> 00:37:38,559 take every little word that I see in my 983 00:37:36,639 --> 00:37:40,000 vocabulary. I'm going to only take the 984 00:37:38,559 --> 00:37:42,078 most frequent words. I'm just going to 985 00:37:40,000 --> 00:37:44,000 ignore the rest. 986 00:37:42,079 --> 00:37:46,960 I'm just going to ignore the rest." 987 00:37:44,000 --> 00:37:50,079 Okay? 988 00:37:46,960 --> 00:37:52,400 But if you ignore the rest, let's say 989 00:37:50,079 --> 00:37:55,280 the there is one word uh let's take some 990 00:37:52,400 --> 00:37:57,358 Shakespeare word hamlet. Let's let's 991 00:37:55,280 --> 00:37:58,640 assume that you ignore the word Hamlet 992 00:37:57,358 --> 00:38:00,400 from your training corpus. You just 993 00:37:58,639 --> 00:38:02,159 delete it because it's not one of the 994 00:38:00,400 --> 00:38:04,480 top most frequent things you have seen. 995 00:38:02,159 --> 00:38:06,559 And then somebody sends you a text 996 00:38:04,480 --> 00:38:08,240 saying, you know, Hamlet was a bad 997 00:38:06,559 --> 00:38:10,400 prince. 998 00:38:08,239 --> 00:38:12,159 Analyze the sentiment of the sentence. 999 00:38:10,400 --> 00:38:14,160 Well, when you see Hamlet, what is your 1000 00:38:12,159 --> 00:38:15,358 system going to do? 1001 00:38:14,159 --> 00:38:16,799 It's going to look at the Hamlet and 1002 00:38:15,358 --> 00:38:18,480 say, I can't see it in my vocabulary 1003 00:38:16,800 --> 00:38:19,920 anywhere. 1004 00:38:18,480 --> 00:38:22,400 And if it can't see in the vocabulary, 1005 00:38:19,920 --> 00:38:26,000 what is the only thing it can do? 1006 00:38:22,400 --> 00:38:28,400 Replace it with Unk. So that's where 1007 00:38:26,000 --> 00:38:30,079 comes into the picture. 1008 00:38:28,400 --> 00:38:32,000 So whenever it can't see something in 1009 00:38:30,079 --> 00:38:35,839 the vocabulary in a new input, it just 1010 00:38:32,000 --> 00:38:37,838 replaced with ank. Which means that 1011 00:38:35,838 --> 00:38:40,880 if you had ignored Romeo, Juliet, and 1012 00:38:37,838 --> 00:38:42,239 Hamlet in the in the training corpus, 1013 00:38:40,880 --> 00:38:44,079 all of them are going to be replaced by 1014 00:38:42,239 --> 00:38:46,719 the same ankh, which means that we can't 1015 00:38:44,079 --> 00:38:48,960 distinguish between them anymore. 1016 00:38:46,719 --> 00:38:52,159 >> So is this whereation 1017 00:38:48,960 --> 00:38:54,880 comes into play here where it doesn't 1018 00:38:52,159 --> 00:38:56,239 recogize 1019 00:38:54,880 --> 00:38:58,400 H interesting question. This is 1020 00:38:56,239 --> 00:39:00,799 whereation comes up. Actually, as it 1021 00:38:58,400 --> 00:39:03,680 turns out, no, as we will see when we 1022 00:39:00,800 --> 00:39:06,480 talk about LLMs later. Uh LLMs actually 1023 00:39:03,679 --> 00:39:08,078 will not have this UN problem because 1024 00:39:06,480 --> 00:39:09,440 they use a different tokenization scheme 1025 00:39:08,079 --> 00:39:10,960 which can handle anything you throw at 1026 00:39:09,440 --> 00:39:12,480 it, including new stuff you just made 1027 00:39:10,960 --> 00:39:14,800 up. 1028 00:39:12,480 --> 00:39:17,838 So, we'll come back to that. 1029 00:39:14,800 --> 00:39:19,760 All right. Um so, that's what we have. 1030 00:39:17,838 --> 00:39:21,440 And so what we're going to do is despite 1031 00:39:19,760 --> 00:39:23,599 its shortcomings, bag of words is 1032 00:39:21,440 --> 00:39:26,400 actually a really good default for many 1033 00:39:23,599 --> 00:39:27,599 NLP tasks. Uh and in the spirit of do 1034 00:39:26,400 --> 00:39:28,880 the simple stuff first and do 1035 00:39:27,599 --> 00:39:30,400 complicated things only the simple 1036 00:39:28,880 --> 00:39:32,079 doesn't work. We'll use a bag of words 1037 00:39:30,400 --> 00:39:36,480 model right now. Okay. So we'll switch 1038 00:39:32,079 --> 00:39:39,440 to a collab and see how it's done. 1039 00:39:36,480 --> 00:39:40,719 So here the the application we're going 1040 00:39:39,440 --> 00:39:43,119 to work with is kind of a fun 1041 00:39:40,719 --> 00:39:46,000 application. Uh we're going to try to 1042 00:39:43,119 --> 00:39:47,599 predict the genre of songs. 1043 00:39:46,000 --> 00:39:50,480 Okay, it's a nice classification use 1044 00:39:47,599 --> 00:39:52,800 case. Um, so we want to take some 1045 00:39:50,480 --> 00:39:55,440 arbitrary song and then classify it into 1046 00:39:52,800 --> 00:39:59,599 either hip-hop, rock or pop. 1047 00:39:55,440 --> 00:40:01,200 Okay. Um, and so for instance, 1048 00:39:59,599 --> 00:40:03,200 right, this is the kind of lyric you're 1049 00:40:01,199 --> 00:40:04,879 lyrics you're going to see. And as you 1050 00:40:03,199 --> 00:40:07,279 will see in this data set, the data set, 1051 00:40:04,880 --> 00:40:10,320 just a quick word of caution, uh, the 1052 00:40:07,280 --> 00:40:12,720 data set does have lyrics which may not 1053 00:40:10,320 --> 00:40:14,320 be sort of, you know, safe for work as 1054 00:40:12,719 --> 00:40:16,719 it were. So I'm not going to be like 1055 00:40:14,320 --> 00:40:18,880 exploring the lyrics in the collab, but 1056 00:40:16,719 --> 00:40:20,959 I just wanted to be aware of it. Okay. 1057 00:40:18,880 --> 00:40:22,480 Um, so but it's just some data set that 1058 00:40:20,960 --> 00:40:24,240 we downloaded from somewhere, right? Uh, 1059 00:40:22,480 --> 00:40:25,599 it's got all these lyrics. Okay. So 1060 00:40:24,239 --> 00:40:27,759 we're going to try to classify each 1061 00:40:25,599 --> 00:40:29,200 verse that we see into one of three 1062 00:40:27,760 --> 00:40:31,680 things. Hip hop, rock or pop. It's a 1063 00:40:29,199 --> 00:40:33,279 multi-class classification problem. 1064 00:40:31,679 --> 00:40:35,039 All right. Actually, what is the 1065 00:40:33,280 --> 00:40:37,760 simplest neural network based classifier 1066 00:40:35,039 --> 00:40:41,119 we can build 1067 00:40:37,760 --> 00:40:42,800 for this problem? 1068 00:40:41,119 --> 00:40:44,880 All right. So what is the simplest 1069 00:40:42,800 --> 00:40:47,519 neural network we can build for this 1070 00:40:44,880 --> 00:40:49,519 problem? So remember what is the input? 1071 00:40:47,519 --> 00:40:50,719 The input is going to be a bunch of song 1072 00:40:49,519 --> 00:40:52,800 lyrics. It's going to be a really long 1073 00:40:50,719 --> 00:40:54,879 song for all you know, right? And we're 1074 00:40:52,800 --> 00:40:56,560 going to use the bag of birds model. Uh 1075 00:40:54,880 --> 00:40:59,680 and let's assume for a moment that we 1076 00:40:56,559 --> 00:41:02,239 will use multihot encoding, right? We'll 1077 00:40:59,679 --> 00:41:04,000 create a vocabulary from this for the 1078 00:41:02,239 --> 00:41:06,559 song. We'll take all the songs. We'll 1079 00:41:04,000 --> 00:41:08,239 process them, run it through STI. will 1080 00:41:06,559 --> 00:41:10,719 do multihod encoding which means that 1081 00:41:08,239 --> 00:41:14,239 every song that comes in will have will 1082 00:41:10,719 --> 00:41:17,279 be a vector that's how long 1083 00:41:14,239 --> 00:41:20,719 it'll be as long as the 1084 00:41:17,280 --> 00:41:24,720 correct as a vocabulary size right so um 1085 00:41:20,719 --> 00:41:26,480 so maybe what comes in is this phrase um 1086 00:41:24,719 --> 00:41:28,000 since it's supposed to be songs I'll say 1087 00:41:26,480 --> 00:41:30,960 something which is probably common to 1088 00:41:28,000 --> 00:41:34,639 90% of songs I love you 1089 00:41:30,960 --> 00:41:38,480 okay that goes in 1090 00:41:34,639 --> 00:41:42,000 it goes into our ST STIE process 1091 00:41:38,480 --> 00:41:49,039 and then this SDIE process gives us a 1092 00:41:42,000 --> 00:41:50,318 vector which is X1 X2 all the way to XV 1093 00:41:49,039 --> 00:41:52,639 where V stands for the size of 1094 00:41:50,318 --> 00:41:54,960 vocabulary. Okay. So that that's our 1095 00:41:52,639 --> 00:41:58,239 input layer 1096 00:41:54,960 --> 00:42:02,400 all the way. So knowing what we know now 1097 00:41:58,239 --> 00:42:04,959 about deep learning what can we do next? 1098 00:42:02,400 --> 00:42:07,920 Couldn't you or maybe I'm getting ahead 1099 00:42:04,960 --> 00:42:10,240 but wouldn't the classifier just be like 1100 00:42:07,920 --> 00:42:11,920 the baseline would be classify it as the 1101 00:42:10,239 --> 00:42:13,199 most common genre? 1102 00:42:11,920 --> 00:42:14,800 >> That is the baseline. Correct. Correct. 1103 00:42:13,199 --> 00:42:17,039 I'm just saying and we'll come to the 1104 00:42:14,800 --> 00:42:18,720 baseline a bit later. But here I'm 1105 00:42:17,039 --> 00:42:21,119 saying suppose you need to you wanted to 1106 00:42:18,719 --> 00:42:23,358 build a neural network model for this. 1107 00:42:21,119 --> 00:42:25,280 How would you set it up? 1108 00:42:23,358 --> 00:42:26,078 >> You think about the layers that you 1109 00:42:25,280 --> 00:42:27,359 want, 1110 00:42:26,079 --> 00:42:29,039 >> right? And what is the simplest thing 1111 00:42:27,358 --> 00:42:30,159 you can do with a neural network? How 1112 00:42:29,039 --> 00:42:33,279 many layers? 1113 00:42:30,159 --> 00:42:35,358 >> Uh no layers. Well, then it becomes 1114 00:42:33,280 --> 00:42:36,800 problematic with even a neural network 1115 00:42:35,358 --> 00:42:37,759 because it could just be logistic 1116 00:42:36,800 --> 00:42:38,800 regression 1117 00:42:37,760 --> 00:42:41,760 >> one hidden layer. 1118 00:42:38,800 --> 00:42:43,119 >> Yes, thank you. I'm being a little 1119 00:42:41,760 --> 00:42:44,800 squishy about this because there are 1120 00:42:43,119 --> 00:42:46,480 some people who be like well even if 1121 00:42:44,800 --> 00:42:48,560 there's no hidden layers if you're using 1122 00:42:46,480 --> 00:42:49,838 relus and this and that and sigma that's 1123 00:42:48,559 --> 00:42:51,519 maybe it's a neural network and I don't 1124 00:42:49,838 --> 00:42:54,400 want to get into that how many ages in 1125 00:42:51,519 --> 00:42:56,079 the tip of a pin argument. So um so yeah 1126 00:42:54,400 --> 00:42:57,358 we need one hidden layer right in this 1127 00:42:56,079 --> 00:42:59,039 course we need at least one hidden layer 1128 00:42:57,358 --> 00:43:01,119 for it to qualify as a neural network. 1129 00:42:59,039 --> 00:43:04,800 Okay, so let's have a hidden layer and 1130 00:43:01,119 --> 00:43:07,680 we'll have a bunch of ReLUS as usual. 1131 00:43:04,800 --> 00:43:09,119 Okay, bunch of ReLULS and I'll ignore 1132 00:43:07,679 --> 00:43:11,519 all the arrows between them. It's kind 1133 00:43:09,119 --> 00:43:13,039 of a pain. U and then we come to the 1134 00:43:11,519 --> 00:43:15,358 output layer. And what should the output 1135 00:43:13,039 --> 00:43:16,960 layer be? 1136 00:43:15,358 --> 00:43:19,519 How many nodes do we have need in the 1137 00:43:16,960 --> 00:43:22,400 output layer? Three, right? Hip-hop, 1138 00:43:19,519 --> 00:43:23,759 rock, whatever. Pop. So we And then that 1139 00:43:22,400 --> 00:43:25,358 layer is called what? What activation 1140 00:43:23,760 --> 00:43:27,520 function? 1141 00:43:25,358 --> 00:43:30,960 Softmax. Perfect. Love it. love this 1142 00:43:27,519 --> 00:43:33,838 class. All right, three things. Uh, 1143 00:43:30,960 --> 00:43:36,880 rock, hip-hop, 1144 00:43:33,838 --> 00:43:39,199 and uh, pop, right? And this is a soft 1145 00:43:36,880 --> 00:43:41,760 max right there. 1146 00:43:39,199 --> 00:43:44,639 And then it's going to give us three 1147 00:43:41,760 --> 00:43:46,400 probabilities that add up to one because 1148 00:43:44,639 --> 00:43:49,679 it's a soft max. So that's our basic 1149 00:43:46,400 --> 00:43:51,039 network, right? Perfect. Yeah. 1150 00:43:49,679 --> 00:43:52,799 >> Why do you need those probabilities? 1151 00:43:51,039 --> 00:43:55,279 Again, if you just want to identify the 1152 00:43:52,800 --> 00:43:56,720 most likely genre, the soft max just 1153 00:43:55,280 --> 00:43:59,359 give you a way to kind of add them all 1154 00:43:56,719 --> 00:44:01,358 up once. Why do you need soft? Why don't 1155 00:43:59,358 --> 00:44:01,759 you just take the max value and say it's 1156 00:44:01,358 --> 00:44:03,679 that? 1157 00:44:01,760 --> 00:44:05,760 >> Oh, interesting question. Why can't we 1158 00:44:03,679 --> 00:44:09,519 just produce three numbers and grab the 1159 00:44:05,760 --> 00:44:11,200 maximum number? So, it turns out finding 1160 00:44:09,519 --> 00:44:12,719 the maximum bunch of numbers that 1161 00:44:11,199 --> 00:44:14,960 function 1162 00:44:12,719 --> 00:44:16,959 is not very it's not very friendly for 1163 00:44:14,960 --> 00:44:18,880 differentiation. 1164 00:44:16,960 --> 00:44:20,800 And ultimately you want to take this 1165 00:44:18,880 --> 00:44:23,200 output, run it through a loss function 1166 00:44:20,800 --> 00:44:25,839 like cross entropy and then be able to 1167 00:44:23,199 --> 00:44:27,679 run back prop on it. And so 1168 00:44:25,838 --> 00:44:29,599 fundamentally back propagation is just 1169 00:44:27,679 --> 00:44:31,199 differentiation and it requires 1170 00:44:29,599 --> 00:44:34,160 everything inside of it to have well- 1171 00:44:31,199 --> 00:44:36,239 behaved gradients. And so this little 1172 00:44:34,159 --> 00:44:39,039 max function is actually not well 1173 00:44:36,239 --> 00:44:41,598 behaved and which is why we have a soft 1174 00:44:39,039 --> 00:44:44,318 version of it soft max which makes it 1175 00:44:41,599 --> 00:44:45,760 easy to differentiate. So I can tell you 1176 00:44:44,318 --> 00:44:49,079 more about it offline but that's sort of 1177 00:44:45,760 --> 00:44:49,079 the quick synopsis. 1178 00:44:49,119 --> 00:44:52,640 So a lot of tricks you will see in the 1179 00:44:50,480 --> 00:44:55,440 neural network literature or ways to 1180 00:44:52,639 --> 00:44:57,358 avoid this the problem of having certain 1181 00:44:55,440 --> 00:44:59,200 the like the obvious choice of function 1182 00:44:57,358 --> 00:45:00,400 will not be well behaved for 1183 00:44:59,199 --> 00:45:02,960 differentiation. That's why you need to 1184 00:45:00,400 --> 00:45:05,039 go through all these other mechanisms 1185 00:45:02,960 --> 00:45:06,400 much like we couldn't just say accuracy. 1186 00:45:05,039 --> 00:45:07,679 Why don't you just maximize accuracy 1187 00:45:06,400 --> 00:45:10,880 instead of doing this cross entropy 1188 00:45:07,679 --> 00:45:14,480 business? Same reason. 1189 00:45:10,880 --> 00:45:17,640 All right. So let's come back here. 1190 00:45:14,480 --> 00:45:17,639 All right. 1191 00:45:20,639 --> 00:45:27,279 So that's what we created on the thing. 1192 00:45:23,679 --> 00:45:28,960 Right? Cats out of the mat vocabulary 1193 00:45:27,280 --> 00:45:31,359 thing and so on. And I you know I was 1194 00:45:28,960 --> 00:45:33,519 playing around with it uh earlier and so 1195 00:45:31,358 --> 00:45:35,039 I I found that you know eight relu 1196 00:45:33,519 --> 00:45:36,159 neurons were pretty good to get the job 1197 00:45:35,039 --> 00:45:37,838 done. So I'm just going to go with eight 1198 00:45:36,159 --> 00:45:39,920 rel 1199 00:45:37,838 --> 00:45:44,078 neurons in the hidden layer. 1200 00:45:39,920 --> 00:45:47,039 So I think that brings us to the collab. 1201 00:45:44,079 --> 00:45:49,519 Yeah. So let's switch to the collab. 1202 00:45:47,039 --> 00:45:50,960 All right. So um that's what we have 1203 00:45:49,519 --> 00:45:52,318 here. We you know there's a little bit 1204 00:45:50,960 --> 00:45:54,159 of verbiage here which just describes 1205 00:45:52,318 --> 00:45:56,400 what I just talked about. So we'll do 1206 00:45:54,159 --> 00:45:58,639 the usual things and upload everything 1207 00:45:56,400 --> 00:46:01,280 uh import everything we want. TensorFlow 1208 00:45:58,639 --> 00:46:03,838 and caras and the the holy trinity of 1209 00:46:01,280 --> 00:46:07,040 numpy pandas and mattplot lib. Uh set 1210 00:46:03,838 --> 00:46:09,679 the random seed as usual at 42. 1211 00:46:07,039 --> 00:46:11,759 This is our SDIE framework here. And the 1212 00:46:09,679 --> 00:46:14,480 nice thing is that all four of these 1213 00:46:11,760 --> 00:46:16,880 things SDIE are beautifully implemented 1214 00:46:14,480 --> 00:46:19,440 in Keras is a single simple layer called 1215 00:46:16,880 --> 00:46:22,880 the text vectorzation layer. Okay, which 1216 00:46:19,440 --> 00:46:25,200 is nice. Um, so we have the text vector 1217 00:46:22,880 --> 00:46:26,960 right here. And so in our first example, 1218 00:46:25,199 --> 00:46:29,039 what we'll do is we will use a default 1219 00:46:26,960 --> 00:46:31,199 standardization which will just remove 1220 00:46:29,039 --> 00:46:33,039 punctuation, convert to lowercase. We'll 1221 00:46:31,199 --> 00:46:35,598 use a default tokenization which just 1222 00:46:33,039 --> 00:46:37,358 means split on the space between words. 1223 00:46:35,599 --> 00:46:39,680 And then we will set the output to 1224 00:46:37,358 --> 00:46:41,039 multihart. Right? All the things we 1225 00:46:39,679 --> 00:46:43,598 talked about, KAS will just do it for 1226 00:46:41,039 --> 00:46:45,759 you automatically. And so output mode 1227 00:46:43,599 --> 00:46:47,359 multihart standardize this spread whites 1228 00:46:45,760 --> 00:46:49,760 space and boom, you run the text 1229 00:46:47,358 --> 00:46:52,000 vectorization thing. And once you do it, 1230 00:46:49,760 --> 00:46:53,599 KAS creates this textualization layer 1231 00:46:52,000 --> 00:46:56,159 with these settings and it's now ready 1232 00:46:53,599 --> 00:46:58,480 to swing into action. So what does swing 1233 00:46:56,159 --> 00:46:59,679 into action actually means? Well, now we 1234 00:46:58,480 --> 00:47:01,920 need to actually feed it a training 1235 00:46:59,679 --> 00:47:02,960 carpass so that it can do all the things 1236 00:47:01,920 --> 00:47:07,039 it's supposed to do and create the 1237 00:47:02,960 --> 00:47:08,800 vocabulary for you, right? So um so and 1238 00:47:07,039 --> 00:47:11,599 that thing is called the adapt method. 1239 00:47:08,800 --> 00:47:14,880 So we create a tiny training corpus for 1240 00:47:11,599 --> 00:47:16,160 us. This is our data set. Um right this 1241 00:47:14,880 --> 00:47:18,240 just a bunch of words from some of these 1242 00:47:16,159 --> 00:47:19,920 lyrics. And then what we'll do is we'll 1243 00:47:18,239 --> 00:47:21,838 take this layer that we just defined 1244 00:47:19,920 --> 00:47:24,240 here that we have set up here. And then 1245 00:47:21,838 --> 00:47:26,078 we will ask this layer to actually 1246 00:47:24,239 --> 00:47:29,679 create the vocabulary using this adapt 1247 00:47:26,079 --> 00:47:31,760 command. Okay. Index the vocabulary. And 1248 00:47:29,679 --> 00:47:34,239 it's done. And once it does it, you can 1249 00:47:31,760 --> 00:47:36,160 actually ask it for the vocabulary. 1250 00:47:34,239 --> 00:47:38,479 Okay, this is the vocabulary using the 1251 00:47:36,159 --> 00:47:41,679 get vocabulary command. And so first of 1252 00:47:38,480 --> 00:47:45,119 all, how long is the vocab? 17 17 words, 1253 00:47:41,679 --> 00:47:46,799 17 tokens. What are they? 1254 00:47:45,119 --> 00:47:48,880 And see here, and you can see these are 1255 00:47:46,800 --> 00:47:50,640 all the words and you can see it is 1256 00:47:48,880 --> 00:47:52,400 stuck in an in the very beginning, 1257 00:47:50,639 --> 00:47:54,239 right? It's sort of the default. By the 1258 00:47:52,400 --> 00:47:55,599 way, uh just a little programming tip if 1259 00:47:54,239 --> 00:47:57,118 you're not familiar with if you don't 1260 00:47:55,599 --> 00:47:58,400 have a ton of programming experience. If 1261 00:47:57,119 --> 00:48:00,240 you want to, you know, print these 1262 00:47:58,400 --> 00:48:02,960 Python objects like list and all in a 1263 00:48:00,239 --> 00:48:05,838 pretty way, one trick that often works 1264 00:48:02,960 --> 00:48:08,240 is just stick it into a data frame 1265 00:48:05,838 --> 00:48:09,599 and then print it. Usually, it'll print 1266 00:48:08,239 --> 00:48:11,679 it in a much better way. So, you can see 1267 00:48:09,599 --> 00:48:13,760 it like that. 1268 00:48:11,679 --> 00:48:15,598 So, you can see here ank arrays blah 1269 00:48:13,760 --> 00:48:17,920 blah blah blah blah. And you can see 1270 00:48:15,599 --> 00:48:19,760 integer zero assigned the ank token. By 1271 00:48:17,920 --> 00:48:22,559 the way, how come it picked the word 1272 00:48:19,760 --> 00:48:26,960 arrays as the second entry? Why not 1273 00:48:22,559 --> 00:48:29,839 something like an or um you know why 1274 00:48:26,960 --> 00:48:32,400 not? Why not a how come a is not chosen 1275 00:48:29,838 --> 00:48:36,039 as a second entry? Why why did it pick 1276 00:48:32,400 --> 00:48:36,039 arrays? You think 1277 00:48:40,318 --> 00:48:45,358 >> maybe maybe it tried like the words that 1278 00:48:43,358 --> 00:48:49,119 are most influential on the meaning of 1279 00:48:45,358 --> 00:48:49,119 the sentence to be on the 1280 00:48:49,760 --> 00:48:54,160 But it at this point it doesn't know 1281 00:48:51,280 --> 00:48:56,000 what we're going to use it for. 1282 00:48:54,159 --> 00:48:57,358 So it has no way to know what word is 1283 00:48:56,000 --> 00:48:59,599 useful because we haven't told it how 1284 00:48:57,358 --> 00:49:01,838 we're going to use it. 1285 00:48:59,599 --> 00:49:04,559 But but you're kind of on the right 1286 00:49:01,838 --> 00:49:06,400 track. So what KAS does is it'll 1287 00:49:04,559 --> 00:49:07,680 calculate it'll find all these tokens 1288 00:49:06,400 --> 00:49:09,760 and then it'll actually just sort them 1289 00:49:07,679 --> 00:49:12,239 by frequency. 1290 00:49:09,760 --> 00:49:13,680 So the most frequent as it turns out in 1291 00:49:12,239 --> 00:49:15,838 those four sentences we gave it happen 1292 00:49:13,679 --> 00:49:17,838 to be the word arrays. That's why arrays 1293 00:49:15,838 --> 00:49:19,279 is showing up on top. Um, and you can 1294 00:49:17,838 --> 00:49:21,759 actually confirm this by going to the 1295 00:49:19,280 --> 00:49:23,760 our little data set and you can see here 1296 00:49:21,760 --> 00:49:25,920 array shows up here and was up here 1297 00:49:23,760 --> 00:49:29,920 twice and that's why it came up on top. 1298 00:49:25,920 --> 00:49:32,559 Okay. All right. So that's what we have 1299 00:49:29,920 --> 00:49:34,400 and u and now now that we have populated 1300 00:49:32,559 --> 00:49:36,319 this we can run any sentence through it 1301 00:49:34,400 --> 00:49:37,358 easily. Yeah. 1302 00:49:36,318 --> 00:49:39,599 >> Does [clears throat] it matter that it's 1303 00:49:37,358 --> 00:49:41,199 on the top or is it just 1304 00:49:39,599 --> 00:49:43,519 >> it doesn't matter. It doesn't matter. 1305 00:49:41,199 --> 00:49:45,598 The reason why it's helpful later on is 1306 00:49:43,519 --> 00:49:48,079 because suppose you tell Kas hey don't 1307 00:49:45,599 --> 00:49:50,559 take every word you see here give me 1308 00:49:48,079 --> 00:49:52,318 only the most frequent 100 words I don't 1309 00:49:50,559 --> 00:49:56,519 want any more than that it can easily do 1310 00:49:52,318 --> 00:49:56,519 that that's the reason yeah 1311 00:50:01,199 --> 00:50:05,679 >> this is just a vocabulary so basically 1312 00:50:03,280 --> 00:50:07,519 you you give it all this phrases it 1313 00:50:05,679 --> 00:50:09,039 happens just four phrases in our example 1314 00:50:07,519 --> 00:50:10,639 and then it finds all the distinct words 1315 00:50:09,039 --> 00:50:12,558 and you know does all that stuff and and 1316 00:50:10,639 --> 00:50:14,480 then it has created a vocabulary. At 1317 00:50:12,559 --> 00:50:17,680 this point the the training corpus you 1318 00:50:14,480 --> 00:50:19,440 fed it will is forgotten and the only 1319 00:50:17,679 --> 00:50:21,838 thing has survived this processing is 1320 00:50:19,440 --> 00:50:23,280 just the vocabulary. That's it. Now we 1321 00:50:21,838 --> 00:50:25,838 have to start applying it to any kind of 1322 00:50:23,280 --> 00:50:28,559 text we want to use it for. 1323 00:50:25,838 --> 00:50:30,159 So here when you come back here u so 1324 00:50:28,559 --> 00:50:32,240 this is what we have and so what you can 1325 00:50:30,159 --> 00:50:33,920 do is you can take any sentence and you 1326 00:50:32,239 --> 00:50:35,039 can just run it through a layer and to 1327 00:50:33,920 --> 00:50:37,039 make sure that actually is doing the 1328 00:50:35,039 --> 00:50:39,119 right thing for you. So we'll take the 1329 00:50:37,039 --> 00:50:40,558 sentence, we will then run it through 1330 00:50:39,119 --> 00:50:42,000 the text vectorization layer by just 1331 00:50:40,559 --> 00:50:45,640 passing that sentence into it and then 1332 00:50:42,000 --> 00:50:45,639 we can just print it. 1333 00:50:46,000 --> 00:50:50,559 So now it's giving you a tensor. This is 1334 00:50:47,838 --> 00:50:54,318 a multihot encoder tensor with all these 1335 00:50:50,559 --> 00:50:56,400 ones and zeros. So note that this tensor 1336 00:50:54,318 --> 00:50:58,079 is 17 units long which is which is a 1337 00:50:56,400 --> 00:51:00,880 good check because our vocabulary is 17 1338 00:50:58,079 --> 00:51:03,519 long. So it's better match that. Uh now 1339 00:51:00,880 --> 00:51:05,680 recall that the ank token is at the 1340 00:51:03,519 --> 00:51:08,159 first location. It's at index zero and 1341 00:51:05,679 --> 00:51:10,558 it says that this encoded sentence does 1342 00:51:08,159 --> 00:51:13,358 have an unk word. 1343 00:51:10,559 --> 00:51:15,920 Okay. So 1344 00:51:13,358 --> 00:51:19,039 why is that? What is this UN word? 1345 00:51:15,920 --> 00:51:21,680 Anyone can guess? 1346 00:51:19,039 --> 00:51:24,400 Well, it turns out to be the word still. 1347 00:51:21,679 --> 00:51:26,480 Um I think yeah still is not in our 1348 00:51:24,400 --> 00:51:28,079 vocabulary because the four sentences 1349 00:51:26,480 --> 00:51:30,240 which is our training corpus used to 1350 00:51:28,079 --> 00:51:32,000 build vocabulary. They had a lot of 1351 00:51:30,239 --> 00:51:33,838 write and rewrite but there was no still 1352 00:51:32,000 --> 00:51:35,920 in it anyway. That's why there's an UN 1353 00:51:33,838 --> 00:51:38,159 ank for it. Uh we can just double check 1354 00:51:35,920 --> 00:51:40,000 that by asking Python is it is it 1355 00:51:38,159 --> 00:51:41,598 vocabulary? Nope, it's not. Okay. Now, 1356 00:51:40,000 --> 00:51:42,960 in the spirit of making small changes to 1357 00:51:41,599 --> 00:51:45,680 the code to understand what's going on, 1358 00:51:42,960 --> 00:51:46,880 which is a very useful tip for folks who 1359 00:51:45,679 --> 00:51:48,879 don't have a ton of programming 1360 00:51:46,880 --> 00:51:52,960 knowledge. Let's say that you send the 1361 00:51:48,880 --> 00:51:54,480 phrase Sloan Hodddle and DM, DMD. Uh I 1362 00:51:52,960 --> 00:51:55,760 think you will agree with me that none 1363 00:51:54,480 --> 00:51:59,358 of these words is in the training 1364 00:51:55,760 --> 00:52:02,000 corpus, right? So what will this what is 1365 00:51:59,358 --> 00:52:06,199 the multihot encoded vector for this 1366 00:52:02,000 --> 00:52:06,199 phrase sloan hodddle BMD 1367 00:52:07,440 --> 00:52:10,440 three 1368 00:52:11,440 --> 00:52:14,800 it's not count encoding it's multihod 1369 00:52:13,119 --> 00:52:17,358 encoding 1370 00:52:14,800 --> 00:52:19,039 right it's going to be 1 0 0 so you can 1371 00:52:17,358 --> 00:52:21,598 see here or in this case remember the 1372 00:52:19,039 --> 00:52:23,599 vocabulary is 17 1373 00:52:21,599 --> 00:52:27,440 right so each of these words is going to 1374 00:52:23,599 --> 00:52:29,200 be a one followed by 16 zeros 1375 00:52:27,440 --> 00:52:30,880 And then it's going to multih hot encode 1376 00:52:29,199 --> 00:52:34,318 them which means the three ones in the 1377 00:52:30,880 --> 00:52:37,039 column just become a one. So so you 1378 00:52:34,318 --> 00:52:39,599 still have only this one. Okay. All 1379 00:52:37,039 --> 00:52:41,358 right. Good. So now let's see that's now 1380 00:52:39,599 --> 00:52:45,359 let's actually get to the the the data 1381 00:52:41,358 --> 00:52:47,598 set. We have this 90,000 songs. Uh and 1382 00:52:45,358 --> 00:52:49,440 it's in this little thing here. Uh we 1383 00:52:47,599 --> 00:52:50,720 have grabbed the data and cleaned it up. 1384 00:52:49,440 --> 00:52:53,280 Cleaned it up meaning like formatting 1385 00:52:50,719 --> 00:52:55,039 wise not content wise. uh and then we 1386 00:52:53,280 --> 00:52:56,960 stuck it in this uh data frame and it's 1387 00:52:55,039 --> 00:52:58,480 we already have divided into train, test 1388 00:52:56,960 --> 00:53:00,720 and validation for your benefit. So you 1389 00:52:58,480 --> 00:53:03,599 don't have to worry about it. So turns 1390 00:53:00,719 --> 00:53:05,759 out we have 40 almost 49,000 songs in 1391 00:53:03,599 --> 00:53:08,800 the training set, 16,000 songs in the 1392 00:53:05,760 --> 00:53:10,960 validation set and 22 roughly 22,000 in 1393 00:53:08,800 --> 00:53:13,119 the test set. Okay, lot of songs. It's a 1394 00:53:10,960 --> 00:53:15,838 lot. It's a big data set. Um so let's 1395 00:53:13,119 --> 00:53:18,079 just look at the first few. 1396 00:53:15,838 --> 00:53:20,558 So oh girl, I can't get ready. We met on 1397 00:53:18,079 --> 00:53:22,000 rainy evening. Paralysis through 1398 00:53:20,559 --> 00:53:23,599 analysis. 1399 00:53:22,000 --> 00:53:27,599 Okay, that I can relate to as a data 1400 00:53:23,599 --> 00:53:29,280 science person. But anyway, u but uh by 1401 00:53:27,599 --> 00:53:31,440 the way this uh these things are very 1402 00:53:29,280 --> 00:53:33,440 useful for exploration of any uh data 1403 00:53:31,440 --> 00:53:36,720 frames that you might have. Collab is a 1404 00:53:33,440 --> 00:53:38,318 collab feature just check it out. Um so 1405 00:53:36,719 --> 00:53:40,159 anyway, that's the first few the first 1406 00:53:38,318 --> 00:53:43,119 few rows. Let's look at the last few 1407 00:53:40,159 --> 00:53:46,118 rows. 1408 00:53:43,119 --> 00:53:46,119 Okay, 1409 00:53:48,800 --> 00:53:56,280 you never listen to me as pop. Beamer 1410 00:53:51,440 --> 00:53:56,280 Benz is hip-hop. Yeah, of course. 1411 00:53:57,599 --> 00:54:01,440 So, okay. Uh, now to go back to the 1412 00:53:59,679 --> 00:54:02,639 question of, okay, um, what could be a 1413 00:54:01,440 --> 00:54:04,559 good baseline model? We need to 1414 00:54:02,639 --> 00:54:07,118 understand the proportion of these three 1415 00:54:04,559 --> 00:54:10,559 classes of songs. So, we'll do a quick 1416 00:54:07,119 --> 00:54:12,480 check. Turns out rock is 55%. So, if you 1417 00:54:10,559 --> 00:54:13,599 had to just guess something just 1418 00:54:12,480 --> 00:54:15,920 naively, you would just guess everything 1419 00:54:13,599 --> 00:54:18,400 to be rock and you'd be right 55% of the 1420 00:54:15,920 --> 00:54:20,159 time. Uh so now uh by the way the the 1421 00:54:18,400 --> 00:54:21,680 the target variable which tells you 1422 00:54:20,159 --> 00:54:24,639 whether which of these three genres it 1423 00:54:21,679 --> 00:54:26,318 is uh is is is a is actually a dummy 1424 00:54:24,639 --> 00:54:29,598 variable. So we need to one hot encode 1425 00:54:26,318 --> 00:54:32,000 that right. Um so we'll just turn that 1426 00:54:29,599 --> 00:54:34,559 this way using the pandas get dummies 1427 00:54:32,000 --> 00:54:35,920 function. And when we do that uh this is 1428 00:54:34,559 --> 00:54:37,200 y train which contains the dependent 1429 00:54:35,920 --> 00:54:40,800 variable. And you can see that is one 1430 00:54:37,199 --> 00:54:42,719 hot encoded now. Uh 0 1 0 0 1 0 0 1 and 1431 00:54:40,800 --> 00:54:44,960 so on and so forth. That's it. So I 1432 00:54:42,719 --> 00:54:46,799 think the first I forget it rock, 1433 00:54:44,960 --> 00:54:48,400 hip-hop, rock, pop or whatever. It's in 1434 00:54:46,800 --> 00:54:50,800 some order. We'll we'll get to that 1435 00:54:48,400 --> 00:54:52,559 later. So it's one hot encoded as well. 1436 00:54:50,800 --> 00:54:54,240 So that is as far as the data 1437 00:54:52,559 --> 00:54:55,680 downloading and setup is concerned. Any 1438 00:54:54,239 --> 00:54:57,439 questions? 1439 00:54:55,679 --> 00:54:58,960 >> Yeah. 1440 00:54:57,440 --> 00:55:01,440 >> Uh this kind of goes back to the 1441 00:54:58,960 --> 00:55:04,000 transfer learning concept. But do you 1442 00:55:01,440 --> 00:55:06,079 always want to build your corpus based 1443 00:55:04,000 --> 00:55:08,000 off of the vocabulary of your training 1444 00:55:06,079 --> 00:55:10,559 data or could you have like a 1445 00:55:08,000 --> 00:55:13,679 pre-ompiled like somebody's already made 1446 00:55:10,559 --> 00:55:15,280 like a list of the 50,000 words? 1447 00:55:13,679 --> 00:55:16,558 >> That's a really good question. Uh 1448 00:55:15,280 --> 00:55:20,240 unfortunately I'm going to punt on it 1449 00:55:16,559 --> 00:55:22,240 for the moment because um with modern 1450 00:55:20,239 --> 00:55:25,039 large language models a number of these 1451 00:55:22,239 --> 00:55:27,039 NLP tasks for which you had to sort of 1452 00:55:25,039 --> 00:55:29,759 roll your own and build your own thing 1453 00:55:27,039 --> 00:55:31,838 can now be very easily done using large 1454 00:55:29,760 --> 00:55:33,520 language models without even any further 1455 00:55:31,838 --> 00:55:34,639 training. 1456 00:55:33,519 --> 00:55:35,759 Case you pay for it is that you have to 1457 00:55:34,639 --> 00:55:37,759 use a large language model which means 1458 00:55:35,760 --> 00:55:38,800 you have to pay somebody an API call and 1459 00:55:37,760 --> 00:55:41,760 things like that and there are other 1460 00:55:38,800 --> 00:55:43,920 issues with it. uh but 1461 00:55:41,760 --> 00:55:46,319 we'll talk a lot about transfer learning 1462 00:55:43,920 --> 00:55:48,559 for text when we come to a little later 1463 00:55:46,318 --> 00:55:52,279 in the NLP sequence. So if I forget 1464 00:55:48,559 --> 00:55:52,280 please bring it up again. 1465 00:55:53,358 --> 00:55:58,159 >> Yeah. 1466 00:55:54,880 --> 00:56:00,880 >> Um quick clarification on the encode 1467 00:55:58,159 --> 00:56:03,440 factor. If I post it as floats not ins. 1468 00:56:00,880 --> 00:56:05,599 If it gets incredibly long wouldn't that 1469 00:56:03,440 --> 00:56:06,559 eat into compute time? Is there a reason 1470 00:56:05,599 --> 00:56:09,119 why it's floats? 1471 00:56:06,559 --> 00:56:11,359 >> Yeah. So uh question is that when when I 1472 00:56:09,119 --> 00:56:13,200 showed you that tensor the it is 1473 00:56:11,358 --> 00:56:14,639 actually is written as a continuous 1474 00:56:13,199 --> 00:56:16,399 number right a float floating point 1475 00:56:14,639 --> 00:56:18,159 number but we know these are one zeros 1476 00:56:16,400 --> 00:56:20,240 and ones so why can't we why do we have 1477 00:56:18,159 --> 00:56:21,519 to waste compute capacity by telling the 1478 00:56:20,239 --> 00:56:23,118 computer that these are all big 1479 00:56:21,519 --> 00:56:25,199 continuous numbers when it's just a zero 1480 00:56:23,119 --> 00:56:26,559 one there are ways to optimize that but 1481 00:56:25,199 --> 00:56:28,960 these problems are so small we just 1482 00:56:26,559 --> 00:56:30,319 don't worry about it but when we come to 1483 00:56:28,960 --> 00:56:34,079 something called parameter efficient 1484 00:56:30,318 --> 00:56:35,838 fine-tuning lecture maybe 10ish uh we 1485 00:56:34,079 --> 00:56:38,318 actually exploit that particular fact to 1486 00:56:35,838 --> 00:56:38,318 make things faster 1487 00:56:38,480 --> 00:56:43,519 Okay, so that's what we have. Uh, so 1488 00:56:41,199 --> 00:56:46,000 we'll we'll do the bag of birds model. 1489 00:56:43,519 --> 00:56:47,119 Um, by the way, there's a whole bunch of 1490 00:56:46,000 --> 00:56:49,199 stuff here. It just repeats what I've 1491 00:56:47,119 --> 00:56:50,880 been telling you in the lecture. So feel 1492 00:56:49,199 --> 00:56:54,000 free to read it again, but we can ignore 1493 00:56:50,880 --> 00:56:55,920 it for the moment. And now there's a new 1494 00:56:54,000 --> 00:56:58,159 thing we are doing here. So we are 1495 00:56:55,920 --> 00:57:00,159 basically saying, look, instead of 1496 00:56:58,159 --> 00:57:03,519 taking every word you see in these 1497 00:57:00,159 --> 00:57:05,358 49,000 uh songs in the training corpus, 1498 00:57:03,519 --> 00:57:09,119 uh, it's going to be too many words. 1499 00:57:05,358 --> 00:57:11,679 just pick the 5,000 most frequent words 1500 00:57:09,119 --> 00:57:15,039 and that's what this max tokens stands 1501 00:57:11,679 --> 00:57:18,719 for. Okay. And so we tell it uh all 1502 00:57:15,039 --> 00:57:20,798 right do this thing max tokens 5,000 1503 00:57:18,719 --> 00:57:22,318 sorry not 50,000 5,000 and still do 1504 00:57:20,798 --> 00:57:24,318 multihart and we are not explicitly 1505 00:57:22,318 --> 00:57:25,599 saying the standardization and all that 1506 00:57:24,318 --> 00:57:29,119 stuff because the defaults are what 1507 00:57:25,599 --> 00:57:30,960 we're going with. Okay. Yeah. 1508 00:57:29,119 --> 00:57:32,798 This is for making it more efficient. 1509 00:57:30,960 --> 00:57:36,639 Like this is like don't waste your time 1510 00:57:32,798 --> 00:57:39,358 on these thousand sports. Use them more. 1511 00:57:36,639 --> 00:57:40,239 Use them. Just focus on that to make 1512 00:57:39,358 --> 00:57:42,318 more efficient. 1513 00:57:40,239 --> 00:57:44,000 >> Make more efficient. But there is a 1514 00:57:42,318 --> 00:57:46,400 related and important point which is 1515 00:57:44,000 --> 00:57:49,599 that fundamentally the number of tokens 1516 00:57:46,400 --> 00:57:51,760 you allow this layer to have dictates 1517 00:57:49,599 --> 00:57:53,680 the size of your vocabulary and the size 1518 00:57:51,760 --> 00:57:56,079 of your vocabulary dictates the size of 1519 00:57:53,679 --> 00:57:57,358 the vector that you feed in. So shorter 1520 00:57:56,079 --> 00:57:59,039 vectors are better than longer vectors. 1521 00:57:57,358 --> 00:58:00,639 That's the efficiency point. The other 1522 00:57:59,039 --> 00:58:02,719 point is that the longer the input 1523 00:58:00,639 --> 00:58:04,400 vector, the more the number of 1524 00:58:02,719 --> 00:58:06,558 parameters the network has to learn 1525 00:58:04,400 --> 00:58:08,480 because the first layer itself is the 1526 00:58:06,559 --> 00:58:10,000 size of the input times roughly times 1527 00:58:08,480 --> 00:58:11,199 the size of the hidden layer. So this 1528 00:58:10,000 --> 00:58:13,039 thing becomes 10 times as long. You have 1529 00:58:11,199 --> 00:58:15,439 10 times as many parameters to learn and 1530 00:58:13,039 --> 00:58:17,199 given a finite amount of data, right? 1531 00:58:15,440 --> 00:58:18,400 The more parameters you have, the worse 1532 00:58:17,199 --> 00:58:19,679 it's going to do when you actually start 1533 00:58:18,400 --> 00:58:21,200 using it in the real world. It's going 1534 00:58:19,679 --> 00:58:24,000 to overfitit heavily. That's why you 1535 00:58:21,199 --> 00:58:25,679 need to be very careful. 1536 00:58:24,000 --> 00:58:27,519 Okay. 1537 00:58:25,679 --> 00:58:29,440 Yeah. 1538 00:58:27,519 --> 00:58:31,358 So, um, you downloaded the data set, but 1539 00:58:29,440 --> 00:58:33,760 are you still using the vocabulary the 1540 00:58:31,358 --> 00:58:35,598 17 words or did you 1541 00:58:33,760 --> 00:58:36,720 >> No, no, I'm that was just for fun. I'm 1542 00:58:35,599 --> 00:58:38,960 going to actually build a vocabulary 1543 00:58:36,719 --> 00:58:41,838 now. It's coming. Yeah, good question. 1544 00:58:38,960 --> 00:58:43,599 Yeah. So, all right, let's do that. Um, 1545 00:58:41,838 --> 00:58:46,000 so I first, you know, I defined this 1546 00:58:43,599 --> 00:58:47,599 layer. Uh, okay. I just defined it. All 1547 00:58:46,000 --> 00:58:49,760 right. Now we actually build the 1548 00:58:47,599 --> 00:58:53,519 vocabulary by essentially telling it to 1549 00:58:49,760 --> 00:58:56,640 adapt the layer using essentially the 1550 00:58:53,519 --> 00:58:58,719 full all 15 basically 49,000 songs in 1551 00:58:56,639 --> 00:59:01,679 the training data set right that's a 1552 00:58:58,719 --> 00:59:02,798 long list of songs as far as kas is 1553 00:59:01,679 --> 00:59:04,879 concerned you're just looking for a list 1554 00:59:02,798 --> 00:59:06,159 of strings so you just give it the list 1555 00:59:04,880 --> 00:59:09,200 of strings instead of four we're giving 1556 00:59:06,159 --> 00:59:11,358 it 49,000 the same uh philosophy applies 1557 00:59:09,199 --> 00:59:12,879 so we run it 1558 00:59:11,358 --> 00:59:15,039 it's obviously going to take a few 1559 00:59:12,880 --> 00:59:17,280 seconds to do that because it's 49,000 1560 00:59:15,039 --> 00:59:19,039 songs 1561 00:59:17,280 --> 00:59:21,519 five seconds. Uh, all right. Let's look 1562 00:59:19,039 --> 00:59:23,759 at the most common 20, 1563 00:59:21,519 --> 00:59:26,318 right? We get the vocabulary from our 1564 00:59:23,760 --> 00:59:27,839 layer. See, once you adapt the layer and 1565 00:59:26,318 --> 00:59:29,358 has built a vocabulary, the layer is 1566 00:59:27,838 --> 00:59:31,279 sort of been populated with all this 1567 00:59:29,358 --> 00:59:34,719 information. So, you can query it. So, 1568 00:59:31,280 --> 00:59:37,040 you can get the vocab top 20 words, the 1569 00:59:34,719 --> 00:59:39,039 most frequent word, no surprise, u, I, 1570 00:59:37,039 --> 00:59:41,039 blah, blah, blah. Uh, let's look at the 1571 00:59:39,039 --> 00:59:43,599 last few. 1572 00:59:41,039 --> 00:59:46,599 Dagger cheddar 1573 00:59:43,599 --> 00:59:46,599 verified 1574 00:59:46,798 --> 00:59:51,199 moving on 1575 00:59:48,880 --> 00:59:52,960 right and then we so once we have done 1576 00:59:51,199 --> 00:59:55,439 that now we actually can vectorize all 1577 00:59:52,960 --> 00:59:57,039 the data sets we have using this and by 1578 00:59:55,440 --> 00:59:59,119 vectorize you mean take every string and 1579 00:59:57,039 --> 01:00:00,400 create the multihot encoded vector from 1580 00:59:59,119 --> 01:00:02,480 it uh yeah 1581 01:00:00,400 --> 01:00:05,358 >> are we doing stie because we're keeping 1582 01:00:02,480 --> 01:00:07,119 stuff like d a etc. Yeah, we are not 1583 01:00:05,358 --> 01:00:09,598 strictly doing STI or to put it 1584 01:00:07,119 --> 01:00:12,000 differently the S stands typically S has 1585 01:00:09,599 --> 01:00:14,960 lower case uppercase strip punctuation 1586 01:00:12,000 --> 01:00:16,798 stemming stop word removal here the 1587 01:00:14,960 --> 01:00:18,639 default in KAS happens to not do 1588 01:00:16,798 --> 01:00:20,000 stemming not do stop word removal so 1589 01:00:18,639 --> 01:00:22,078 we're just going with the default thanks 1590 01:00:20,000 --> 01:00:23,519 for the clarification 1591 01:00:22,079 --> 01:00:25,039 and in fact in practice what I find 1592 01:00:23,519 --> 01:00:27,039 these days is that don't even bother to 1593 01:00:25,039 --> 01:00:28,239 stem don't even bother to remove the 1594 01:00:27,039 --> 01:00:31,119 stop words it's going to work well 1595 01:00:28,239 --> 01:00:34,399 enough 1596 01:00:31,119 --> 01:00:36,000 okay so all right uh okay so now Each 1597 01:00:34,400 --> 01:00:38,639 phrase is a vector. How long is this 1598 01:00:36,000 --> 01:00:41,039 vector? Each song is now a vector. How 1599 01:00:38,639 --> 01:00:43,279 long is that vector? 1600 01:00:41,039 --> 01:00:46,920 5,000. Correct. Because that is a size 1601 01:00:43,280 --> 01:00:46,920 vocabulary. Correct. 1602 01:00:47,199 --> 01:00:51,679 It's max tokens long, which is 5,000. So 1603 01:00:49,599 --> 01:00:52,960 if you actually look at X Oh, wait, 1604 01:00:51,679 --> 01:00:56,358 wait, wait, wait, wait. I haven't done 1605 01:00:52,960 --> 01:00:56,358 this thing yet. 1606 01:00:57,838 --> 01:01:02,400 It's going through 49,000. It's going 1607 01:00:59,599 --> 01:01:04,400 through another what? 23,000. Fine. So 1608 01:01:02,400 --> 01:01:06,798 let's run it. 1609 01:01:04,400 --> 01:01:09,200 Okay, now we can see X train which is 1610 01:01:06,798 --> 01:01:12,960 all the training data you have has is a 1611 01:01:09,199 --> 01:01:18,039 tensor is a table with 48 991 rows and 1612 01:01:12,960 --> 01:01:18,039 each row is a 5,000 long vector. 1613 01:01:18,079 --> 01:01:23,280 All right, good. Now we will try the 1614 01:01:20,559 --> 01:01:28,240 simple neural network that we wrote up 1615 01:01:23,280 --> 01:01:31,359 in class. So and now at this point this 1616 01:01:28,239 --> 01:01:34,078 code should be sort of second nature, 1617 01:01:31,358 --> 01:01:36,159 right? Isn't that cool? It's so easy to 1618 01:01:34,079 --> 01:01:39,280 write the write the thing the power of 1619 01:01:36,159 --> 01:01:41,279 abstraction. So uh we take kasin input 1620 01:01:39,280 --> 01:01:42,720 as usual input layer we tell it what is 1621 01:01:41,280 --> 01:01:44,480 the size of each thing that's coming in. 1622 01:01:42,719 --> 01:01:46,480 Well the size of each thing is a 50 max 1623 01:01:44,480 --> 01:01:48,880 tokens long vector. So we tell it the 1624 01:01:46,480 --> 01:01:51,119 shape is max tokens and then we run it 1625 01:01:48,880 --> 01:01:54,160 through a dense layer with eight relus. 1626 01:01:51,119 --> 01:01:56,079 Okay I'm hurrying. 1627 01:01:54,159 --> 01:01:58,000 So we get the outputs then we string the 1628 01:01:56,079 --> 01:01:59,680 inputs and the outputs into a model and 1629 01:01:58,000 --> 01:02:02,239 then we summarize the model. That's it. 1630 01:01:59,679 --> 01:02:04,639 So we go here and this has 40,000 1631 01:02:02,239 --> 01:02:08,239 parameters and you can see here right 1632 01:02:04,639 --> 01:02:10,239 when you go from the input the 5,000 * 8 1633 01:02:08,239 --> 01:02:11,838 that gives you 40,000 plus the eight 1634 01:02:10,239 --> 01:02:15,039 neurons have a bias coming in that's 1635 01:02:11,838 --> 01:02:17,119 another eight so you get 40,0008 okay 1636 01:02:15,039 --> 01:02:20,159 and we compile it as usual we use atom 1637 01:02:17,119 --> 01:02:23,760 as usual and because now the the output 1638 01:02:20,159 --> 01:02:27,039 y variable the y train variable is now 1639 01:02:23,760 --> 01:02:29,599 it itself is actually one hot encoded 1640 01:02:27,039 --> 01:02:31,440 right 0 1 0 0 1 depending on pop rock 1641 01:02:29,599 --> 01:02:33,519 and so on and so forth. We don't use 1642 01:02:31,440 --> 01:02:35,119 sparse categorical cross entropy. We 1643 01:02:33,519 --> 01:02:38,000 just use plain old categorical cross 1644 01:02:35,119 --> 01:02:40,318 entropy here. Okay. And this was 1645 01:02:38,000 --> 01:02:42,400 explained in lecture last week. So you 1646 01:02:40,318 --> 01:02:44,318 can revisit it if uh if it's if it's not 1647 01:02:42,400 --> 01:02:46,400 familiar. We again report accuracy, 1648 01:02:44,318 --> 01:02:48,558 right? So let's compile it. And we've 1649 01:02:46,400 --> 01:02:50,798 got a model. So we just run it for 10 1650 01:02:48,559 --> 01:02:52,640 epochs with a batch size of 32. And 1651 01:02:50,798 --> 01:02:53,838 because we have validation data already 1652 01:02:52,639 --> 01:02:55,679 supplied to us, we don't have to tell 1653 01:02:53,838 --> 01:02:58,159 Karas take the training data and keep 1654 01:02:55,679 --> 01:02:59,519 20% of it aside for validation. We can 1655 01:02:58,159 --> 01:03:04,000 literally tell it what validation to 1656 01:02:59,519 --> 01:03:06,798 use. That's what we're doing here. Okay. 1657 01:03:04,000 --> 01:03:09,119 All right. So, it's running. 1658 01:03:06,798 --> 01:03:12,599 Um, 1659 01:03:09,119 --> 01:03:12,599 it's pretty fast. 1660 01:03:16,318 --> 01:03:20,480 Any questions so far? 1661 01:03:18,159 --> 01:03:23,519 >> Yes. 1662 01:03:20,480 --> 01:03:25,358 >> The microphone. 1663 01:03:23,519 --> 01:03:27,679 >> How do we decide the max total? like 1664 01:03:25,358 --> 01:03:29,038 define the number of 5,000 here but we 1665 01:03:27,679 --> 01:03:29,919 do not know how many words would be 1666 01:03:29,039 --> 01:03:31,200 there in the entire text. 1667 01:03:29,920 --> 01:03:32,720 >> Yeah. So it's a good question. How do 1668 01:03:31,199 --> 01:03:34,399 you decide on this the maximum 1669 01:03:32,719 --> 01:03:36,480 vocabulary? What you typically do in 1670 01:03:34,400 --> 01:03:38,240 practice is that you actually you do it 1671 01:03:36,480 --> 01:03:40,079 without the max tokens and then you see 1672 01:03:38,239 --> 01:03:41,838 how long the vocabulary is and then you 1673 01:03:40,079 --> 01:03:43,839 actually get statistics on how 1674 01:03:41,838 --> 01:03:45,279 frequently the very infrequent words 1675 01:03:43,838 --> 01:03:47,279 actually show up. And then you'll 1676 01:03:45,280 --> 01:03:49,599 typically see like a dramatic fall-off 1677 01:03:47,280 --> 01:03:54,119 at some point and you pick that fall-off 1678 01:03:49,599 --> 01:03:54,119 point and then set that to be the max. 1679 01:03:54,960 --> 01:04:01,599 Uh all right. So perfect. Let's test it. 1680 01:03:58,719 --> 01:04:05,358 Uh accuracy is pretty good. 87% on the 1681 01:04:01,599 --> 01:04:09,280 training and 73 on the validation. We'll 1682 01:04:05,358 --> 01:04:11,440 do it on the test set. All right. 72%. 1683 01:04:09,280 --> 01:04:13,200 So we saw earlier the the largest class 1684 01:04:11,440 --> 01:04:15,358 of the three-way is a rock with around 1685 01:04:13,199 --> 01:04:17,279 50%. So the naive model is going to get 1686 01:04:15,358 --> 01:04:19,279 50% accuracy and this little neural 1687 01:04:17,280 --> 01:04:22,160 network model gets you 70 72% which is 1688 01:04:19,280 --> 01:04:23,839 pretty nice. Okay. So now let's actually 1689 01:04:22,159 --> 01:04:26,798 kick it up a notch and make it slightly 1690 01:04:23,838 --> 01:04:29,358 more capable. So the key thing here is 1691 01:04:26,798 --> 01:04:31,119 that uh as was has been observed in 1692 01:04:29,358 --> 01:04:33,759 class already when you go with a bag of 1693 01:04:31,119 --> 01:04:35,358 words model we lose all notion of order 1694 01:04:33,760 --> 01:04:38,000 right the word order clearly matters and 1695 01:04:35,358 --> 01:04:40,400 we're kind of ignoring it. So what we do 1696 01:04:38,000 --> 01:04:42,079 to get around it is um so actually this 1697 01:04:40,400 --> 01:04:44,720 actually really interesting uh sentence 1698 01:04:42,079 --> 01:04:46,640 here. Let's say this is a movie review. 1699 01:04:44,719 --> 01:04:48,639 Kate Vinclet's performance as a 1700 01:04:46,639 --> 01:04:50,639 detective trying to solve a terrible 1701 01:04:48,639 --> 01:04:52,879 crime in a P small pin tennos is 1702 01:04:50,639 --> 01:04:55,038 anything but disappointing. 1703 01:04:52,880 --> 01:04:56,160 Tricky tricky thing, right? Because if 1704 01:04:55,039 --> 01:04:58,400 you look at the word separately, the 1705 01:04:56,159 --> 01:05:01,358 word terrible and disappointing like 1706 01:04:58,400 --> 01:05:04,000 negative sentiment, right? But then if 1707 01:05:01,358 --> 01:05:06,318 you actually know that the word terrible 1708 01:05:04,000 --> 01:05:08,559 respon refers to the crime, not to the 1709 01:05:06,318 --> 01:05:09,440 movie or anything but disappointing 1710 01:05:08,559 --> 01:05:10,798 changes the meaning of the word 1711 01:05:09,440 --> 01:05:12,639 disappointing, you will see obviously 1712 01:05:10,798 --> 01:05:14,719 it's a positive review, right? So 1713 01:05:12,639 --> 01:05:17,679 clearly the the the words around the 1714 01:05:14,719 --> 01:05:20,558 word provide valuable clues as to how to 1715 01:05:17,679 --> 01:05:23,519 interpret that word. And so what we do 1716 01:05:20,559 --> 01:05:25,599 is how can we make our little model a 1717 01:05:23,519 --> 01:05:27,599 bit more capable of recognizing the 1718 01:05:25,599 --> 01:05:29,680 context around every word. And the way 1719 01:05:27,599 --> 01:05:32,960 we do it is something called bgrams. 1720 01:05:29,679 --> 01:05:34,318 Okay. And what for biograms what we 1721 01:05:32,960 --> 01:05:36,960 basically do is instead of taking 1722 01:05:34,318 --> 01:05:39,599 instead of just taking each word we take 1723 01:05:36,960 --> 01:05:42,240 each word and we further take every pair 1724 01:05:39,599 --> 01:05:44,400 of adjacent words 1725 01:05:42,239 --> 01:05:47,279 and those become our tokens and because 1726 01:05:44,400 --> 01:05:49,440 we take two adjacent words right it are 1727 01:05:47,280 --> 01:05:51,680 called bgrams you can take three adjent 1728 01:05:49,440 --> 01:05:54,480 words trigrams you get the idea engram 1729 01:05:51,679 --> 01:05:56,719 grams okay so that's the idea of bgrams 1730 01:05:54,480 --> 01:05:59,920 and so um so for example if you had the 1731 01:05:56,719 --> 01:06:03,519 cat matt sat on the cat sat on the mat 1732 01:05:59,920 --> 01:06:05,680 you will have the the cat cats sat you 1733 01:06:03,519 --> 01:06:07,679 get the idea right uh that's what we 1734 01:06:05,679 --> 01:06:09,279 have so let's do a little example and 1735 01:06:07,679 --> 01:06:12,399 kas makes it very easy you literally 1736 01:06:09,280 --> 01:06:15,119 tell it engram grams equals 2 1737 01:06:12,400 --> 01:06:16,960 bs and now by by from this you auto 1738 01:06:15,119 --> 01:06:19,358 immediately should know that engram 1739 01:06:16,960 --> 01:06:23,039 grams equals 1 is the default that's why 1740 01:06:19,358 --> 01:06:25,440 we didn't have to specify it okay so you 1741 01:06:23,039 --> 01:06:27,520 run it and then you do 1742 01:06:25,440 --> 01:06:29,119 cats on the mat is your training corpus 1743 01:06:27,519 --> 01:06:31,280 and then you get the vocabulary and you 1744 01:06:29,119 --> 01:06:34,160 can see here, right? It has created all 1745 01:06:31,280 --> 01:06:35,920 these nice biograms for you. And so 1746 01:06:34,159 --> 01:06:37,679 that's it. All right. Now, what we do is 1747 01:06:35,920 --> 01:06:39,680 we'll go back to the songs and we 1748 01:06:37,679 --> 01:06:41,919 actually tell Keras to not just take 1749 01:06:39,679 --> 01:06:43,519 each word, but take all the biograms as 1750 01:06:41,920 --> 01:06:45,440 well. And hopefully you'll do a better 1751 01:06:43,519 --> 01:06:47,759 job, right, of figuring out what the 1752 01:06:45,440 --> 01:06:49,679 sentiment is. And now because you know 1753 01:06:47,760 --> 01:06:51,680 when you have when you when you say, 1754 01:06:49,679 --> 01:06:53,919 okay, take the top 5,000 words, that's 1755 01:06:51,679 --> 01:06:56,078 great for single unigs as they are 1756 01:06:53,920 --> 01:06:57,680 called. But when you have biograms, you 1757 01:06:56,079 --> 01:06:59,599 have 5,000 possibilities for the first 1758 01:06:57,679 --> 01:07:01,279 word, maybe 5,000 for the second word, 1759 01:06:59,599 --> 01:07:03,280 right? That's a lot of possibilities. 25 1760 01:07:01,280 --> 01:07:04,240 million. Now, most of the 25 million 1761 01:07:03,280 --> 01:07:05,680 possibilities are not going to show up 1762 01:07:04,239 --> 01:07:07,199 in the data. So, you don't need to 1763 01:07:05,679 --> 01:07:08,798 actually make it much larger, but you 1764 01:07:07,199 --> 01:07:11,038 should make the vocabulary a bit more 1765 01:07:08,798 --> 01:07:13,759 than 5,000. So, here we go with say 1766 01:07:11,039 --> 01:07:16,160 20,000, right? Otherwise, it's the same. 1767 01:07:13,760 --> 01:07:18,240 Still multihart. So, let's run it. And 1768 01:07:16,159 --> 01:07:20,000 now we will run this. Now that the layer 1769 01:07:18,239 --> 01:07:21,519 has been set up with all the right 1770 01:07:20,000 --> 01:07:24,079 settings, we'll ask it to create the 1771 01:07:21,519 --> 01:07:25,679 vocabulary. Okay? again by doing exactly 1772 01:07:24,079 --> 01:07:28,680 what we did before. Create the 1773 01:07:25,679 --> 01:07:28,679 vocabulary 1774 01:07:30,000 --> 01:07:33,000 seconds 1775 01:07:42,000 --> 01:07:46,639 by triagrams all of them will get much 1776 01:07:44,480 --> 01:07:48,400 more computer intensive that's why 1777 01:07:46,639 --> 01:07:51,358 you're seeing this. So all right let's 1778 01:07:48,400 --> 01:07:53,200 look at the first 10 words. The first 10 1779 01:07:51,358 --> 01:07:54,639 words are all just single words and 1780 01:07:53,199 --> 01:07:55,598 that's not surprising because the single 1781 01:07:54,639 --> 01:07:59,519 words are going to be the most more 1782 01:07:55,599 --> 01:08:02,559 frequent right u 1783 01:07:59,519 --> 01:08:08,038 and then the last few 1784 01:08:02,559 --> 01:08:08,039 your mom your god you short you hell 1785 01:08:09,920 --> 01:08:15,838 all right let's just uh you know uh 1786 01:08:13,039 --> 01:08:17,520 index the whole all the data we have the 1787 01:08:15,838 --> 01:08:19,920 training validation test sets using this 1788 01:08:17,520 --> 01:08:19,920 vocabulary 1789 01:08:23,198 --> 01:08:26,479 Perfect. Now we come to our second model 1790 01:08:24,798 --> 01:08:28,479 where we say the shape the incoming 1791 01:08:26,479 --> 01:08:30,238 shape is now 20,000 long right because 1792 01:08:28,479 --> 01:08:32,718 we increase max tokens from 5,000 to 1793 01:08:30,238 --> 01:08:35,198 20,000. So each thing is a 20,000 long 1794 01:08:32,719 --> 01:08:37,198 vector otherwise it's the same and now 1795 01:08:35,198 --> 01:08:38,639 we will use this thing called dropout 1796 01:08:37,198 --> 01:08:41,119 for the first time which is a 1797 01:08:38,640 --> 01:08:43,440 rigorization thing that I have referred 1798 01:08:41,119 --> 01:08:45,439 to earlier that I never really described 1799 01:08:43,439 --> 01:08:47,119 and I will describe today if we have 1800 01:08:45,439 --> 01:08:49,439 time but I'll first run through the 1801 01:08:47,119 --> 01:08:50,960 whole demo. So just you know just you 1802 01:08:49,439 --> 01:08:52,879 can just you think of dropout as just 1803 01:08:50,960 --> 01:08:54,079 another layer you can insert and it's 1804 01:08:52,880 --> 01:08:56,798 essentially a great way to prevent 1805 01:08:54,079 --> 01:08:58,798 overfitting. So I just routinely will 1806 01:08:56,798 --> 01:09:00,719 use it and I'll talk more about it. So 1807 01:08:58,798 --> 01:09:02,399 for now you have this dropout layer in 1808 01:09:00,719 --> 01:09:04,319 the middle. It receives the input from 1809 01:09:02,399 --> 01:09:05,278 the dense layer and then sends it to the 1810 01:09:04,319 --> 01:09:07,039 output layer. The output layer is 1811 01:09:05,279 --> 01:09:10,319 unchanged. It's a three-way softmax. 1812 01:09:07,039 --> 01:09:11,838 Same model as before. Okay. And now uh 1813 01:09:10,319 --> 01:09:13,279 all right we'll come back to drop out. 1814 01:09:11,838 --> 01:09:15,198 So we'll compile it the same way as 1815 01:09:13,279 --> 01:09:17,440 before and then we will we will I will 1816 01:09:15,198 --> 01:09:19,198 just fit it for three epochs. Um if 1817 01:09:17,439 --> 01:09:20,479 you're interested after class later on 1818 01:09:19,198 --> 01:09:22,879 you can actually try it for more epochs 1819 01:09:20,479 --> 01:09:23,838 and see if it does better. Uh for now in 1820 01:09:22,880 --> 01:09:27,000 the interest of time we'll just do it 1821 01:09:23,838 --> 01:09:27,000 for three 1822 01:09:29,838 --> 01:09:32,838 right 1823 01:09:36,560 --> 01:09:43,440 I think 72% right was the uh the single 1824 01:09:39,520 --> 01:09:45,279 word unig thing we had. 1825 01:09:43,439 --> 01:09:47,358 >> If you're rerunning this code with the 1826 01:09:45,279 --> 01:09:49,120 same number of Do you ever expect the 1827 01:09:47,359 --> 01:09:51,759 accuracy to change? 1828 01:09:49,119 --> 01:09:53,679 >> Um if if you had to run this code in 1829 01:09:51,759 --> 01:09:55,359 your machine, you would expect it to be 1830 01:09:53,679 --> 01:09:57,119 roughly the same, but there are some 1831 01:09:55,359 --> 01:09:58,238 minute differences due to hardware and 1832 01:09:57,119 --> 01:09:59,920 device drivers. 1833 01:09:58,238 --> 01:10:02,959 >> If you rewrite it on your own machine 1834 01:09:59,920 --> 01:10:05,039 twice, would you expect a change? 1835 01:10:02,960 --> 01:10:07,198 >> That's actually a very tricky question. 1836 01:10:05,039 --> 01:10:09,679 Uh because it depends on what else I 1837 01:10:07,198 --> 01:10:11,759 have been doing in that notebook. 1838 01:10:09,679 --> 01:10:13,840 If I start fresh and do nothing but 1839 01:10:11,760 --> 01:10:15,199 that, typically I get the same numbers 1840 01:10:13,840 --> 01:10:19,000 typically. But for some reason I don't 1841 01:10:15,198 --> 01:10:19,000 get it exactly the right. 1842 01:10:19,359 --> 01:10:25,559 Okay. So we come to this. Let's evaluate 1843 01:10:22,000 --> 01:10:25,560 our little model. 1844 01:10:25,840 --> 01:10:30,640 Okay. 75%. So it went from 72 to 75. 1845 01:10:29,119 --> 01:10:32,960 It's actually a meaningful jump just by 1846 01:10:30,640 --> 01:10:34,800 using biograms. Okay. And I ran it only 1847 01:10:32,960 --> 01:10:36,239 for three epochs. If you run it for 10, 1848 01:10:34,800 --> 01:10:38,960 maybe it's going to do even better. All 1849 01:10:36,238 --> 01:10:40,639 right. So that is the beauty of this 1850 01:10:38,960 --> 01:10:42,719 thing. Now let's just actually do a 1851 01:10:40,640 --> 01:10:45,920 little demo. Uh we'll try to predict 1852 01:10:42,719 --> 01:10:49,198 some lyrics. Okay, I'll try another one. 1853 01:10:45,920 --> 01:10:50,399 Bites the dust. 1854 01:10:49,198 --> 01:10:53,359 It's a rock song. I think that's 1855 01:10:50,399 --> 01:10:55,839 correct. Yes. Okay. Okay, folks. Your 1856 01:10:53,359 --> 01:11:00,359 turn now. 1857 01:10:55,840 --> 01:11:00,360 Uh, somebody tell me your favorite song. 1858 01:11:00,479 --> 01:11:05,519 >> Dancing Queen from Aba. 1859 01:11:03,039 --> 01:11:07,039 >> I love ABBA. That's awesome. All right. 1860 01:11:05,520 --> 01:11:11,120 Okay. 1861 01:11:07,039 --> 01:11:14,119 Uh, Dancing Queen 1862 01:11:11,119 --> 01:11:14,119 Rex. 1863 01:11:17,119 --> 01:11:20,158 worse one intro. I don't like that. 1864 01:11:18,560 --> 01:11:23,480 Let's just go to something without all 1865 01:11:20,158 --> 01:11:23,479 this metadata. 1866 01:11:23,679 --> 01:11:26,679 Right. 1867 01:11:27,359 --> 01:11:31,559 All right. I'll just take the first 1868 01:11:28,238 --> 01:11:31,559 page. Okay. 1869 01:11:40,479 --> 01:11:45,439 Are we good? 1870 01:11:42,560 --> 01:11:49,560 All right, 1871 01:11:45,439 --> 01:11:49,559 down model. Let's predict 1872 01:11:50,319 --> 01:11:55,238 pop just about. Yay. 1873 01:11:55,679 --> 01:12:00,399 All right. So, uh yeah. So, that's 1874 01:11:58,238 --> 01:12:01,919 basically the model, but we have five 1875 01:12:00,399 --> 01:12:03,599 minutes. I want to get back to you can 1876 01:12:01,920 --> 01:12:05,440 play around and put your own lyrics in. 1877 01:12:03,600 --> 01:12:07,600 Uh typically what happens is that the 1878 01:12:05,439 --> 01:12:09,519 last two years that I've been doing this 1879 01:12:07,600 --> 01:12:11,679 particular lecture, I've noticed that 1880 01:12:09,520 --> 01:12:13,280 the songs are always rock songs for some 1881 01:12:11,679 --> 01:12:14,800 reason. 1882 01:12:13,279 --> 01:12:16,000 >> First time I'm getting a pop song from 1883 01:12:14,800 --> 01:12:18,320 the from a group that I actually like. 1884 01:12:16,000 --> 01:12:20,158 So thank you. 1885 01:12:18,319 --> 01:12:22,000 Uh all right. Uh let's go back to 1886 01:12:20,158 --> 01:12:24,879 dropout. 1887 01:12:22,000 --> 01:12:26,560 So the idea here in dropout is that you 1888 01:12:24,880 --> 01:12:28,079 know you have all these the input comes 1889 01:12:26,560 --> 01:12:30,719 in, it goes through a hidden layer and 1890 01:12:28,079 --> 01:12:33,760 so on and so forth. What the dropout? So 1891 01:12:30,719 --> 01:12:35,198 dropout is a layer and you put this 1892 01:12:33,760 --> 01:12:37,600 layer just like you use any other layer. 1893 01:12:35,198 --> 01:12:38,960 And what dropout does is that it takes 1894 01:12:37,600 --> 01:12:41,360 all the things that are coming into it 1895 01:12:38,960 --> 01:12:43,760 from the previous layer and randomly 1896 01:12:41,359 --> 01:12:46,079 decides to replace that number with a 1897 01:12:43,760 --> 01:12:48,239 zero. 1898 01:12:46,079 --> 01:12:50,399 That's it. It drops that number and 1899 01:12:48,238 --> 01:12:52,158 replace it with a zero. Okay? But it 1900 01:12:50,399 --> 01:12:54,319 does it randomly. It basically toss a 1901 01:12:52,158 --> 01:12:55,839 coin and the coin comes up heads zero. 1902 01:12:54,319 --> 01:12:58,479 If it comes up to us, let it through. 1903 01:12:55,840 --> 01:13:02,960 Pass it through. Okay? And the reason 1904 01:12:58,479 --> 01:13:04,799 why this is very effective is because 1905 01:13:02,960 --> 01:13:07,600 you can imagine all the neurons in a 1906 01:13:04,800 --> 01:13:09,840 particular layer when they overfit to a 1907 01:13:07,600 --> 01:13:11,360 particular data set the overfitting 1908 01:13:09,840 --> 01:13:14,319 happens because the neurons essentially 1909 01:13:11,359 --> 01:13:15,839 collude with each other right they sort 1910 01:13:14,319 --> 01:13:17,439 of collude with each other to actually 1911 01:13:15,840 --> 01:13:19,920 overfitit and predict things in sort of 1912 01:13:17,439 --> 01:13:21,839 a very accurate way. So you want to 1913 01:13:19,920 --> 01:13:24,480 break any sort of collusion between the 1914 01:13:21,840 --> 01:13:26,239 neurons, right? I'm obviously using sort 1915 01:13:24,479 --> 01:13:28,799 of like a you know again theoretic way 1916 01:13:26,238 --> 01:13:30,718 of describing it but the idea is that 1917 01:13:28,800 --> 01:13:33,440 any kind of speurious correlations in 1918 01:13:30,719 --> 01:13:36,079 your data neurons can pick it up by 1919 01:13:33,439 --> 01:13:38,079 being correlated themselves. 1920 01:13:36,079 --> 01:13:40,800 And so the way you avoid the spurious 1921 01:13:38,079 --> 01:13:42,000 correlation is by dropping neurons 1922 01:13:40,800 --> 01:13:44,159 randomly. You just kill the neuron 1923 01:13:42,000 --> 01:13:45,520 randomly which means that no neuron can 1924 01:13:44,158 --> 01:13:47,679 depend on another neuron being 1925 01:13:45,520 --> 01:13:50,320 available. 1926 01:13:47,679 --> 01:13:52,560 I know it's a bit grim but that's the 1927 01:13:50,319 --> 01:13:54,399 basic idea of dropout and apparently the 1928 01:13:52,560 --> 01:13:56,640 story goes that the the folk person who 1929 01:13:54,399 --> 01:13:58,238 the team that invented it Jeff Hinton 1930 01:13:56,640 --> 01:13:59,679 who won the touring for the stuff not 1931 01:13:58,238 --> 01:14:02,158 for not for dropout just for deep 1932 01:13:59,679 --> 01:14:03,279 learning um he said I don't know if it's 1933 01:14:02,158 --> 01:14:05,519 true but he said that apparently he got 1934 01:14:03,279 --> 01:14:07,759 the idea when he went to a bank and 1935 01:14:05,520 --> 01:14:09,440 realized that you know very often the 1936 01:14:07,760 --> 01:14:11,280 bank the folks who working in that bank 1937 01:14:09,439 --> 01:14:13,119 branch that he used to go to kept 1938 01:14:11,279 --> 01:14:14,800 changing 1939 01:14:13,119 --> 01:14:16,000 right they were never sort of the same 1940 01:14:14,800 --> 01:14:17,279 the people would be transferring in 1941 01:14:16,000 --> 01:14:18,158 transferring out and he was like why Why 1942 01:14:17,279 --> 01:14:19,519 can't they just leave these people 1943 01:14:18,158 --> 01:14:21,519 alone? Why does it keep changing? And 1944 01:14:19,520 --> 01:14:24,560 then he got the insight that maybe a lot 1945 01:14:21,520 --> 01:14:26,080 of fraud happens because the person 1946 01:14:24,560 --> 01:14:28,400 working in the branch colludes with the 1947 01:14:26,079 --> 01:14:30,399 customer, but by changing the staff 1948 01:14:28,399 --> 01:14:32,319 constantly, you break the the risk of 1949 01:14:30,399 --> 01:14:34,879 fraud happening. And that apparently was 1950 01:14:32,319 --> 01:14:36,559 the genesis for this idea. True, 1951 01:14:34,880 --> 01:14:40,480 apocryphal? I have no idea. But it's 1952 01:14:36,560 --> 01:14:43,039 sort of a fun story. Uh yes, 1953 01:14:40,479 --> 01:14:45,198 >> instead of random, if we go to the way 1954 01:14:43,039 --> 01:14:47,039 historical models are built, concepts of 1955 01:14:45,198 --> 01:14:50,079 multiple and all of that, would that 1956 01:14:47,039 --> 01:14:53,279 make it sharper as compared to this? 1957 01:14:50,079 --> 01:14:56,238 >> The problem is that um these networks 1958 01:14:53,279 --> 01:14:58,158 are massive, right? And for you to take 1959 01:14:56,238 --> 01:14:59,839 each layer and look at it correlation 1960 01:14:58,158 --> 01:15:01,679 with some other layer and so on and so 1961 01:14:59,840 --> 01:15:04,319 forth. First of all, investigating 1962 01:15:01,679 --> 01:15:05,359 multi-linearity is pro is a problem. The 1963 01:15:04,319 --> 01:15:08,559 second thing is okay, what do you do 1964 01:15:05,359 --> 01:15:09,920 then? Next uh in linear regression you 1965 01:15:08,560 --> 01:15:11,360 can do things like principal components 1966 01:15:09,920 --> 01:15:12,960 analysis to get around it. Here 1967 01:15:11,359 --> 01:15:14,719 everything is nonlinear. There is no 1968 01:15:12,960 --> 01:15:16,000 easy way to solve the problem. So we are 1969 01:15:14,719 --> 01:15:20,000 like we'll just solve the problem in one 1970 01:15:16,000 --> 01:15:23,840 shot using dropout. That's all right. Um 1971 01:15:20,000 --> 01:15:25,920 so I had uh some material on 1972 01:15:23,840 --> 01:15:28,239 something called bite pair encoding 1973 01:15:25,920 --> 01:15:30,319 which I will um which I will do when we 1974 01:15:28,238 --> 01:15:31,759 get to LLMs and I stuck it in the end 1975 01:15:30,319 --> 01:15:33,519 because I knew that we probably won't 1976 01:15:31,760 --> 01:15:35,280 have enough time to cover this anyway. 1977 01:15:33,520 --> 01:15:37,679 And that is a very clever tokenization 1978 01:15:35,279 --> 01:15:40,079 scheme used by for example the GPT 1979 01:15:37,679 --> 01:15:41,679 family and that allows them to do 1980 01:15:40,079 --> 01:15:43,920 beautiful punctuation, keep the case 1981 01:15:41,679 --> 01:15:45,679 intact and then use words that you just 1982 01:15:43,920 --> 01:15:47,520 made up and things like that. Okay. So 1983 01:15:45,679 --> 01:15:50,719 we have two one more minute. I'm happy 1984 01:15:47,520 --> 01:15:52,640 to answer any questions you might have. 1985 01:15:50,719 --> 01:15:54,079 >> And so initially when we are picking 1986 01:15:52,640 --> 01:15:57,119 like the hidden layer the number of 1987 01:15:54,079 --> 01:15:59,279 neurons and weed. So so far in all the 1988 01:15:57,119 --> 01:16:01,039 materials this is has been given to us 1989 01:15:59,279 --> 01:16:03,439 but initially how do you pick it? Is it 1990 01:16:01,039 --> 01:16:03,920 more of a trial and error type of thing 1991 01:16:03,439 --> 01:16:05,678 or 1992 01:16:03,920 --> 01:16:07,520 >> it tends to be trial and error. Um so 1993 01:16:05,679 --> 01:16:10,319 that's in fact what I did when I created 1994 01:16:07,520 --> 01:16:12,400 the collabs. So um and and you can 1995 01:16:10,319 --> 01:16:14,319 actually make it a bit more systematic 1996 01:16:12,399 --> 01:16:16,719 by trying lots of different values and 1997 01:16:14,319 --> 01:16:18,719 there is a particular package uh Python 1998 01:16:16,719 --> 01:16:20,719 package called Keras tuner. So just 1999 01:16:18,719 --> 01:16:22,640 Google Keras tuner and it comes with 2000 01:16:20,719 --> 01:16:23,920 very nice collabs and if I have a chance 2001 01:16:22,640 --> 01:16:25,440 maybe I'll just record a screen 2002 01:16:23,920 --> 01:16:27,119 walkthrough of doing that. But that's 2003 01:16:25,439 --> 01:16:28,559 that's a very efficient way to do these 2004 01:16:27,119 --> 01:16:29,359 things. And it comes under the broad 2005 01:16:28,560 --> 01:16:31,920 category of something called 2006 01:16:29,359 --> 01:16:33,519 hyperparameter optimization where the 2007 01:16:31,920 --> 01:16:35,279 number of neurons, the activation you 2008 01:16:33,520 --> 01:16:36,960 use, the learning rate, all those things 2009 01:16:35,279 --> 01:16:39,039 can all be tried. You can try lots of 2010 01:16:36,960 --> 01:16:42,000 variations and kas is a great way to do 2011 01:16:39,039 --> 01:16:45,238 it in the context of kas. 2012 01:16:42,000 --> 01:16:45,238 Other questions? 2013 01:16:45,920 --> 01:16:51,399 >> All right, I give you 30 seconds back. 2014 01:16:47,359 --> 01:16:51,399 Thank you. See you tomorrow.