1 00:00:16,600 --> 00:00:19,320 All right. So, today's lecture, 2 00:00:18,000 --> 00:00:20,640 introduction to neural networks and deep 3 00:00:19,320 --> 00:00:21,879 learning. 4 00:00:20,640 --> 00:00:23,960 Um so, we'll start with a very quick 5 00:00:21,879 --> 00:00:25,399 intro to these things, 6 00:00:23,960 --> 00:00:27,480 uh and then we'll switch and dive deep 7 00:00:25,399 --> 00:00:30,000 into neural networks. All right. So, the 8 00:00:27,480 --> 00:00:31,000 field of AI originated in 1956. Sadly, 9 00:00:30,000 --> 00:00:32,520 it didn't originate at MIT, it 10 00:00:31,000 --> 00:00:33,640 originated at Dartmouth. 11 00:00:32,520 --> 00:00:35,520 Because all these people got together at 12 00:00:33,640 --> 00:00:37,480 Dartmouth. I guess it's it's got a nice 13 00:00:35,520 --> 00:00:40,000 quad or whatever. They got together, 14 00:00:37,479 --> 00:00:42,159 they defined the field. But, fortunately 15 00:00:40,000 --> 00:00:44,719 for us, MIT was very well represented. 16 00:00:42,159 --> 00:00:47,279 So, we have Marvin Minsky who founded 17 00:00:44,719 --> 00:00:50,079 the MIT AI Lab, John McCarthy who 18 00:00:47,280 --> 00:00:51,880 invented Lisp, and then later defected 19 00:00:50,079 --> 00:00:53,879 to the West Coast, and then Claude 20 00:00:51,880 --> 00:00:55,800 Shannon who invented information theory, 21 00:00:53,880 --> 00:00:57,560 right? Who was a professor at MIT. So, 22 00:00:55,799 --> 00:00:58,919 MIT was well represented. These folks, 23 00:00:57,560 --> 00:01:01,359 you know, founded the field, and they 24 00:00:58,920 --> 00:01:03,280 were so bright, they thought that AI was 25 00:01:01,359 --> 00:01:04,359 going to be substantially solved, quote 26 00:01:03,280 --> 00:01:06,519 unquote, 27 00:01:04,359 --> 00:01:07,359 by that fall. 28 00:01:06,519 --> 00:01:08,959 Okay? 29 00:01:07,359 --> 00:01:10,840 Now, obviously, it turned out a bit 30 00:01:08,959 --> 00:01:12,839 differently than what they expected. 31 00:01:10,840 --> 00:01:14,560 Um so, it's been, whatever, 67, 68 years 32 00:01:12,840 --> 00:01:16,520 since its founding. So, it's gone 33 00:01:14,560 --> 00:01:18,680 through, essentially, in my opinion, 34 00:01:16,519 --> 00:01:19,679 three seminal breakthroughs, 35 00:01:18,680 --> 00:01:21,280 um starting with the traditional 36 00:01:19,680 --> 00:01:22,920 approach, then machine learning, deep 37 00:01:21,280 --> 00:01:24,760 learning, and generative AI. So, let's 38 00:01:22,920 --> 00:01:26,760 take a very quick look at each of these 39 00:01:24,760 --> 00:01:27,719 breakthroughs and what motivated them. 40 00:01:26,760 --> 00:01:28,560 So, 41 00:01:27,719 --> 00:01:31,120 let's start with the traditional 42 00:01:28,560 --> 00:01:33,480 approach to AI. And so, what is AI? AI, 43 00:01:31,120 --> 00:01:34,960 informally, is the ability to imbue 44 00:01:33,480 --> 00:01:36,799 computers with the 45 00:01:34,959 --> 00:01:38,439 the the the ability to do things that 46 00:01:36,799 --> 00:01:39,640 only humans can typically do. Cognitive 47 00:01:38,439 --> 00:01:41,879 tasks, thinking tasks, and things like 48 00:01:39,640 --> 00:01:43,640 that. And so, the most sort of common 49 00:01:41,879 --> 00:01:45,199 sensical way to do that is to say, 50 00:01:43,640 --> 00:01:46,920 "Well, if I want the computer to do 51 00:01:45,200 --> 00:01:48,200 something complicated like play chess, 52 00:01:46,920 --> 00:01:49,920 I'm just going to sit down with a few 53 00:01:48,200 --> 00:01:51,640 chess grandmasters, 54 00:01:49,920 --> 00:01:53,400 show them a whole bunch of board moves, 55 00:01:51,640 --> 00:01:55,400 and ask them how they figure out how to 56 00:01:53,400 --> 00:01:56,480 respond, how to play the next move." I'm 57 00:01:55,400 --> 00:01:57,719 going to sort of sit down, talk to all 58 00:01:56,480 --> 00:01:59,960 these people, and then I'm going to 59 00:01:57,719 --> 00:02:01,560 write down a whole bunch of rules. If 60 00:01:59,959 --> 00:02:02,599 this is the board position, move this. 61 00:02:01,560 --> 00:02:04,159 If this is the board position, move 62 00:02:02,599 --> 00:02:05,959 this, and so on and so forth. Or I might 63 00:02:04,159 --> 00:02:06,959 sit down with a cardiologist and tell 64 00:02:05,959 --> 00:02:09,478 them, "Okay, how do you actually 65 00:02:06,959 --> 00:02:11,239 interpret an ECG?" They will give me all 66 00:02:09,479 --> 00:02:12,360 the similarly a bunch of if-then rules. 67 00:02:11,240 --> 00:02:13,920 I will take all these rules, I'll put 68 00:02:12,360 --> 00:02:15,200 them into the computer, and boom, I have 69 00:02:13,919 --> 00:02:17,959 a system that can do what a human can 70 00:02:15,199 --> 00:02:19,319 do. Right? Now, this approach, even 71 00:02:17,960 --> 00:02:21,360 though it's common sensical and kind of 72 00:02:19,319 --> 00:02:22,560 makes sense, it had success in only a 73 00:02:21,360 --> 00:02:24,640 few areas. 74 00:02:22,560 --> 00:02:28,159 Um and so, the interesting question is, 75 00:02:24,639 --> 00:02:29,559 why was it not pervasively successful? 76 00:02:28,159 --> 00:02:31,319 Why was it not pervasively successful? 77 00:02:29,560 --> 00:02:32,599 It seems like a pretty good idea to me, 78 00:02:31,319 --> 00:02:33,680 right? And the people who came up with 79 00:02:32,599 --> 00:02:35,560 these things are smart people, they're 80 00:02:33,680 --> 00:02:39,319 not dumb people. They know what they're 81 00:02:35,560 --> 00:02:39,319 doing. So, why did it not work? 82 00:02:39,400 --> 00:02:42,760 Because 83 00:02:40,719 --> 00:02:44,120 because it's time-intensive, 84 00:02:42,759 --> 00:02:46,000 so in case that you have to run through 85 00:02:44,120 --> 00:02:48,280 all these scenarios that can ever exist, 86 00:02:46,000 --> 00:02:51,120 and still some new scenarios can come up 87 00:02:48,280 --> 00:02:52,759 that you didn't cater for initially. 88 00:02:51,120 --> 00:02:54,159 Right. So, there are two aspects to what 89 00:02:52,759 --> 00:02:56,079 you said, which is the first aspect is 90 00:02:54,159 --> 00:02:57,319 it's time-intensive. That, as it turns 91 00:02:56,080 --> 00:02:58,816 out, is not a big deal, because 92 00:02:57,319 --> 00:02:59,359 computers are getting faster and faster. 93 00:02:58,816 --> 00:03:01,080 >> [clears throat] 94 00:02:59,360 --> 00:03:02,840 >> Right? The second thing is actually the 95 00:03:01,080 --> 00:03:05,600 key thing, which is that it doesn't 96 00:03:02,840 --> 00:03:07,400 generalize to new situations very well. 97 00:03:05,599 --> 00:03:08,960 Right? The problem is 98 00:03:07,400 --> 00:03:10,080 there are an infinite number of things 99 00:03:08,960 --> 00:03:11,719 that you're going to see when you deploy 100 00:03:10,080 --> 00:03:13,160 these systems in the real world. By 101 00:03:11,719 --> 00:03:15,479 definition, what you're training it on 102 00:03:13,159 --> 00:03:17,079 is a small sample of rules. So, these 103 00:03:15,479 --> 00:03:19,599 rules are very brittle. But, there's 104 00:03:17,080 --> 00:03:22,040 actually even more interesting reason. 105 00:03:19,599 --> 00:03:23,919 And that reason is that we know more 106 00:03:22,039 --> 00:03:25,599 than we can tell. 107 00:03:23,919 --> 00:03:27,280 This is called Polanyi's paradox. So, 108 00:03:25,599 --> 00:03:29,519 the idea is that if I come to you and 109 00:03:27,280 --> 00:03:32,360 say, "Hey, uh here's a picture. Is it a 110 00:03:29,520 --> 00:03:33,800 dog or a cat?" you will tell me within, 111 00:03:32,360 --> 00:03:34,960 I believe, they've measured it, like 20 112 00:03:33,800 --> 00:03:36,920 milliseconds or something, you know if 113 00:03:34,960 --> 00:03:38,560 it's a dog if it's a dog or a cat. And 114 00:03:36,919 --> 00:03:40,039 then if I ask you to explain to me 115 00:03:38,560 --> 00:03:41,520 exactly how you figured that out, you'll 116 00:03:40,039 --> 00:03:43,639 come up with a bunch of sort of reasons, 117 00:03:41,520 --> 00:03:45,120 right? Alleged reasons. Oh, you know, if 118 00:03:43,639 --> 00:03:46,000 it has whiskers, I think it's a cat or 119 00:03:45,120 --> 00:03:47,800 whatever. 120 00:03:46,000 --> 00:03:49,080 But, the problem is that you actually, 121 00:03:47,800 --> 00:03:50,280 first of all, can't really articulate 122 00:03:49,080 --> 00:03:51,840 what's going on in your head, how you do 123 00:03:50,280 --> 00:03:54,000 these things. And number two, even if 124 00:03:51,840 --> 00:03:55,479 you articulate it, often times, your 125 00:03:54,000 --> 00:03:58,000 articulation has no correspondence with 126 00:03:55,479 --> 00:04:01,239 how your brain actually does it. 127 00:03:58,000 --> 00:04:03,360 So, you're incomplete and a liar. 128 00:04:01,240 --> 00:04:04,840 So, this is Polanyi's paradox. So, if 129 00:04:03,360 --> 00:04:06,840 you can't even 130 00:04:04,840 --> 00:04:08,120 tell me how you do something, how the 131 00:04:06,840 --> 00:04:10,080 heck am I supposed to take it and put it 132 00:04:08,120 --> 00:04:11,680 into a computer? Doesn't work. And 133 00:04:10,080 --> 00:04:13,480 second is the fact that we can't write 134 00:04:11,680 --> 00:04:15,760 down these rules for all possible 135 00:04:13,479 --> 00:04:17,279 situations. Edge cases, corner cases, 136 00:04:15,759 --> 00:04:18,759 etc. And the world is full of edge 137 00:04:17,279 --> 00:04:20,199 cases. 138 00:04:18,759 --> 00:04:21,560 So, for these reasons, this approach 139 00:04:20,199 --> 00:04:22,800 didn't work. 140 00:04:21,560 --> 00:04:24,959 And so, a different approach was 141 00:04:22,800 --> 00:04:26,040 developed, and this approach was, well, 142 00:04:24,959 --> 00:04:27,279 basically said, "Hey, instead of 143 00:04:26,040 --> 00:04:30,040 explicitly telling the computer what to 144 00:04:27,279 --> 00:04:32,719 do, why don't we simply give it lots of 145 00:04:30,040 --> 00:04:35,680 examples of inputs and outputs, chess 146 00:04:32,720 --> 00:04:37,800 positions, next move, right? ECG, 147 00:04:35,680 --> 00:04:39,319 diagnosis, right? Inputs and outputs. 148 00:04:37,800 --> 00:04:41,000 And then, why don't we just use some 149 00:04:39,319 --> 00:04:43,240 statistical techniques to learn a 150 00:04:41,000 --> 00:04:44,920 mapping, a function, that can go from 151 00:04:43,240 --> 00:04:45,879 the input to the output? Okay? That was 152 00:04:44,920 --> 00:04:48,160 the idea. 153 00:04:45,879 --> 00:04:49,839 And this idea is machine learning. 154 00:04:48,160 --> 00:04:51,800 Okay? So, machine learning is basically 155 00:04:49,839 --> 00:04:53,719 just a fancy way of saying, "Learn from 156 00:04:51,800 --> 00:04:55,920 input-output examples using statistical 157 00:04:53,720 --> 00:04:59,080 techniques." 158 00:04:55,920 --> 00:05:00,480 Good. All right. So, um 159 00:04:59,079 --> 00:05:01,879 Now, there are numerous ways to create 160 00:05:00,480 --> 00:05:02,840 machine learning models, and if you've 161 00:05:01,879 --> 00:05:03,759 ever done linear regression, 162 00:05:02,839 --> 00:05:06,279 congratulations, you've been doing 163 00:05:03,759 --> 00:05:06,279 machine learning. 164 00:05:06,439 --> 00:05:09,839 Okay? And only one of those methods 165 00:05:08,720 --> 00:05:11,360 happens to be something called neural 166 00:05:09,839 --> 00:05:12,439 networks. 167 00:05:11,360 --> 00:05:14,280 There are many other methods, and in 168 00:05:12,439 --> 00:05:16,000 fact, you probably have done these other 169 00:05:14,279 --> 00:05:17,199 methods if you have done the a course 170 00:05:16,000 --> 00:05:19,279 like the Analytics Edge or something 171 00:05:17,199 --> 00:05:21,279 similar. 172 00:05:19,279 --> 00:05:23,039 Okay. So, machine learning has got 173 00:05:21,279 --> 00:05:25,759 tremendous impact around the world, 174 00:05:23,040 --> 00:05:27,560 right? It's like, at this point, um it's 175 00:05:25,759 --> 00:05:29,480 widely accepted, it's a very, very 176 00:05:27,560 --> 00:05:30,680 successful technology. 177 00:05:29,480 --> 00:05:32,560 And in fact, whenever people are 178 00:05:30,680 --> 00:05:33,959 actually talking about AI, 179 00:05:32,560 --> 00:05:35,639 chances are they're actually talking 180 00:05:33,959 --> 00:05:38,959 about machine learning. 181 00:05:35,639 --> 00:05:40,639 It's just that AI sounds cooler. 182 00:05:38,959 --> 00:05:41,919 The only problem is, for machine 183 00:05:40,639 --> 00:05:43,680 learning to work really well, the input 184 00:05:41,920 --> 00:05:46,439 data has to be structured. 185 00:05:43,680 --> 00:05:47,680 Okay? And what I mean by that is data 186 00:05:46,439 --> 00:05:50,040 that can essentially be sort of 187 00:05:47,680 --> 00:05:51,920 numericalized and stuck into the columns 188 00:05:50,040 --> 00:05:54,000 and rows of a spreadsheet. 189 00:05:51,920 --> 00:05:55,360 Right? So, for example, here, let's say 190 00:05:54,000 --> 00:05:58,120 I want to put together a data data set 191 00:05:55,360 --> 00:05:59,960 of, you know, uh patients, their 192 00:05:58,120 --> 00:06:01,759 symptoms, and their characteristics, and 193 00:05:59,959 --> 00:06:03,519 then in the following year after they 194 00:06:01,759 --> 00:06:05,439 showed up at the doctor's office whether 195 00:06:03,519 --> 00:06:07,279 they had a cardiac event or not. 196 00:06:05,439 --> 00:06:09,560 I might create a data set like this with 197 00:06:07,279 --> 00:06:11,799 age, smoking status, yes, no, exercise, 198 00:06:09,560 --> 00:06:13,480 blah blah blah blah blah. Right? And so, 199 00:06:11,800 --> 00:06:15,079 either these numbers are numbers, 200 00:06:13,480 --> 00:06:17,200 they're numerical, or if they're not 201 00:06:15,079 --> 00:06:19,680 numerical, they're categorical. 202 00:06:17,199 --> 00:06:21,479 Right? Yes, no, uh smoking, yes, no, 203 00:06:19,680 --> 00:06:22,720 things like that. Which means that if 204 00:06:21,480 --> 00:06:25,240 you have categorical variables, you can 205 00:06:22,720 --> 00:06:26,600 just numericalize them pretty easily. 206 00:06:25,240 --> 00:06:27,680 You folks have done the some machine 207 00:06:26,600 --> 00:06:29,040 learning before, so you know, things 208 00:06:27,680 --> 00:06:30,560 like one-hot encoding and stuff like 209 00:06:29,040 --> 00:06:32,160 that can be done to make them all 210 00:06:30,560 --> 00:06:35,040 numerical. So, the point is, you can 211 00:06:32,160 --> 00:06:36,640 just render the data into the columns 212 00:06:35,040 --> 00:06:38,080 and rows of a spreadsheet pretty easily, 213 00:06:36,639 --> 00:06:40,479 right? That's what I mean by structured 214 00:06:38,079 --> 00:06:41,879 data. So, when you but the situation is 215 00:06:40,480 --> 00:06:43,920 very different if you have unstructured 216 00:06:41,879 --> 00:06:46,120 data. So, if you have an image of, you 217 00:06:43,920 --> 00:06:47,720 know, a cute puppy, this is my puppy, by 218 00:06:46,120 --> 00:06:49,319 the way, um 219 00:06:47,720 --> 00:06:50,960 from many years ago. Sadly, he's no 220 00:06:49,319 --> 00:06:54,079 more. Um 221 00:06:50,959 --> 00:06:54,079 but, his name was Google. 222 00:06:54,800 --> 00:06:58,759 So, yeah, anyway, uh 223 00:06:56,600 --> 00:07:00,160 my DMD alums know Google well. So, this 224 00:06:58,759 --> 00:07:01,240 is Google, right? If you want to take 225 00:07:00,160 --> 00:07:03,439 Google, 226 00:07:01,240 --> 00:07:05,280 uh this picture, and figure out how to 227 00:07:03,439 --> 00:07:06,680 sort of numericalize it, the first thing 228 00:07:05,279 --> 00:07:07,839 you want to need to understand is that 229 00:07:06,680 --> 00:07:10,759 if you actually look at how this picture 230 00:07:07,839 --> 00:07:12,839 is represented inside, uh digitally, in 231 00:07:10,759 --> 00:07:13,839 the computer, basically, every picture 232 00:07:12,839 --> 00:07:15,319 like this is represented using three 233 00:07:13,839 --> 00:07:17,439 tables of numbers. 234 00:07:15,319 --> 00:07:19,000 Okay? And these and we'll get to what 235 00:07:17,439 --> 00:07:21,360 these numbers mean later on, but the 236 00:07:19,000 --> 00:07:23,279 point I'm making is that each number 237 00:07:21,360 --> 00:07:25,240 basically represents the amount of 238 00:07:23,279 --> 00:07:27,319 light, 239 00:07:25,240 --> 00:07:29,160 right? On a scale of 0 to 255, the 240 00:07:27,319 --> 00:07:30,639 amount of light in that location, in 241 00:07:29,160 --> 00:07:32,960 that pixel. That's all the amount of 242 00:07:30,639 --> 00:07:35,759 light. So, basically, the this table is 243 00:07:32,959 --> 00:07:37,479 the amount of um sorry. 244 00:07:35,759 --> 00:07:39,000 This table is amount of red light, 245 00:07:37,480 --> 00:07:41,200 amount of green light, amount of blue 246 00:07:39,000 --> 00:07:42,480 light. Okay? Now, you will agree with me 247 00:07:41,199 --> 00:07:45,199 that if you, for example, look at 248 00:07:42,480 --> 00:07:47,200 something like this and say, "Okay, 251 249 00:07:45,199 --> 00:07:49,639 at this location, there is a lot of blue 250 00:07:47,199 --> 00:07:52,000 light because it's 251 out of a possible 251 00:07:49,639 --> 00:07:53,439 255, right? Maybe a lot of blue light 252 00:07:52,000 --> 00:07:55,600 somewhere here. There's a lot of blue 253 00:07:53,439 --> 00:07:59,199 here." 254 00:07:55,600 --> 00:08:00,600 Whether that area is blue because of a 255 00:07:59,199 --> 00:08:03,560 piece of sky, 256 00:08:00,600 --> 00:08:04,879 some water, or a bunch of blue paint, 257 00:08:03,560 --> 00:08:06,240 could be anything, it's going to say 258 00:08:04,879 --> 00:08:08,079 251. 259 00:08:06,240 --> 00:08:09,319 So, the underlying reality, the 260 00:08:08,079 --> 00:08:11,079 underlying object that's being 261 00:08:09,319 --> 00:08:12,480 described, has nothing to do with the 262 00:08:11,079 --> 00:08:14,399 251. 263 00:08:12,480 --> 00:08:16,480 Right? So, that's the whole problem. The 264 00:08:14,399 --> 00:08:19,039 raw form of the data has no intrinsic 265 00:08:16,480 --> 00:08:20,240 meaning with the underlying thing. 266 00:08:19,040 --> 00:08:21,560 So, given that there's no connection 267 00:08:20,240 --> 00:08:23,360 between the number and what it's 268 00:08:21,560 --> 00:08:25,160 describing, how the heck can any 269 00:08:23,360 --> 00:08:27,639 algorithm do anything with it? 270 00:08:25,160 --> 00:08:27,640 It can't. 271 00:08:27,680 --> 00:08:32,440 Right? So, what you have to do is 272 00:08:30,319 --> 00:08:34,639 something called feature engineering or 273 00:08:32,440 --> 00:08:36,919 feature extraction, right? Where you 274 00:08:34,639 --> 00:08:38,279 have to manually take all these things 275 00:08:36,918 --> 00:08:40,279 and create essentially a spreadsheet 276 00:08:38,279 --> 00:08:42,279 from them. So, basically, let's say that 277 00:08:40,279 --> 00:08:43,478 you have a bunch of birds, right? And 278 00:08:42,279 --> 00:08:44,839 you're trying to build a a bird 279 00:08:43,479 --> 00:08:46,680 classifier to figure out what kind of 280 00:08:44,840 --> 00:08:48,759 bird species it is, you might actually 281 00:08:46,679 --> 00:08:50,799 have to take this picture, and then you 282 00:08:48,759 --> 00:08:52,759 have to measure the beak length, the 283 00:08:50,799 --> 00:08:54,479 wingspan, the primary color, and so on 284 00:08:52,759 --> 00:08:56,720 and so forth. 285 00:08:54,480 --> 00:08:59,360 So, you're basically structuring the 286 00:08:56,720 --> 00:09:02,120 unstructured data manually, right? 287 00:08:59,360 --> 00:09:06,120 And this process of structuring 288 00:09:02,120 --> 00:09:08,200 unstructured data is basically called 289 00:09:06,120 --> 00:09:10,879 we use the word representation. We take 290 00:09:08,200 --> 00:09:13,360 the raw data and we represent the data 291 00:09:10,879 --> 00:09:14,600 in a different form. And the the reason 292 00:09:13,360 --> 00:09:15,840 why I'm sort of 293 00:09:14,600 --> 00:09:17,879 focusing on the use of the word 294 00:09:15,840 --> 00:09:19,519 representation is because it becomes 295 00:09:17,879 --> 00:09:22,080 really, really important a bit later on 296 00:09:19,519 --> 00:09:23,399 when we get to deep learning. Okay? So, 297 00:09:22,080 --> 00:09:25,080 we have to represent the data in a 298 00:09:23,399 --> 00:09:26,519 different way for it to work. That's the 299 00:09:25,080 --> 00:09:28,960 basic idea. 300 00:09:26,519 --> 00:09:31,319 All right. So, what that means is that, 301 00:09:28,960 --> 00:09:33,519 uh historically, researchers would 302 00:09:31,320 --> 00:09:35,440 manually develop these representations. 303 00:09:33,519 --> 00:09:37,159 And once you develop them, once you have 304 00:09:35,440 --> 00:09:38,320 representations, you can just use 305 00:09:37,159 --> 00:09:40,559 traditional linear regression or 306 00:09:38,320 --> 00:09:41,720 logistic regression to get the job done. 307 00:09:40,559 --> 00:09:43,799 So, the whole name of the game is the 308 00:09:41,720 --> 00:09:45,440 representations. So, in fact, people 309 00:09:43,799 --> 00:09:47,959 doing PhDs, for example, in computer 310 00:09:45,440 --> 00:09:49,840 vision, would spend like 4 years 311 00:09:47,960 --> 00:09:52,080 developing amazing representations for 312 00:09:49,840 --> 00:09:53,920 solving one particular little problem. 313 00:09:52,080 --> 00:09:55,680 Right? We have a bunch of, say, CAT 314 00:09:53,919 --> 00:09:57,279 scans, and we need to take the CAT scan 315 00:09:55,679 --> 00:09:58,959 and figure out whether a particular kind 316 00:09:57,279 --> 00:10:00,879 of stroke that is evidence for it in the 317 00:09:58,960 --> 00:10:02,519 cat scan, right? They might actually sit 318 00:10:00,879 --> 00:10:04,000 and develop all kinds of representations 319 00:10:02,519 --> 00:10:05,960 and test it and so on. And then they'll 320 00:10:04,000 --> 00:10:07,200 finally declare victory and say, "Yay, 321 00:10:05,960 --> 00:10:08,920 I'm done with my PhD. Here is this 322 00:10:07,200 --> 00:10:11,000 amazing representation, and you can 323 00:10:08,919 --> 00:10:12,639 build a classifier with it to predict a 324 00:10:11,000 --> 00:10:15,840 particular kind of stroke with a high 325 00:10:12,639 --> 00:10:18,120 accuracy." Okay? So, that was what that 326 00:10:15,840 --> 00:10:20,680 that's where the world was. 327 00:10:18,120 --> 00:10:22,240 Uh now, as you can imagine, developing 328 00:10:20,679 --> 00:10:24,479 representations, because it's so manual, 329 00:10:22,240 --> 00:10:27,000 is this massive human bottleneck, and 330 00:10:24,480 --> 00:10:29,120 this sharply limited limited the reach 331 00:10:27,000 --> 00:10:31,919 and applicability of machine learning. 332 00:10:29,120 --> 00:10:31,919 As you would expect. 333 00:10:31,960 --> 00:10:35,000 To address this problem, 334 00:10:33,840 --> 00:10:36,120 a different approach came about, and 335 00:10:35,000 --> 00:10:38,720 that's deep learning. So, deep learning 336 00:10:36,120 --> 00:10:40,440 sits inside machine learning. Okay? 337 00:10:38,720 --> 00:10:43,279 And deep learning 338 00:10:40,440 --> 00:10:46,880 can handle unstructured input data 339 00:10:43,279 --> 00:10:48,079 without upfront manual processing. 340 00:10:46,879 --> 00:10:50,439 Meaning, 341 00:10:48,080 --> 00:10:52,639 it will automatically learn the right 342 00:10:50,440 --> 00:10:54,000 representations from the raw input. 343 00:10:52,639 --> 00:10:55,759 Automatically is the keyword. 344 00:10:54,000 --> 00:10:57,159 Automatically learn representations, 345 00:10:55,759 --> 00:10:58,200 which means that you could give it 346 00:10:57,159 --> 00:10:59,279 structured data, you can give it 347 00:10:58,200 --> 00:11:00,600 pictures, you can give it text, you can 348 00:10:59,279 --> 00:11:01,559 give it anything you want, it just learn 349 00:11:00,600 --> 00:11:02,600 it. 350 00:11:01,559 --> 00:11:05,159 Okay? 351 00:11:02,600 --> 00:11:07,279 Um it can automatically extract these 352 00:11:05,159 --> 00:11:09,480 representations, and since it's being 353 00:11:07,279 --> 00:11:11,240 automatically extracted, you can imagine 354 00:11:09,480 --> 00:11:12,960 sort of a pipeline where the raw data 355 00:11:11,240 --> 00:11:14,320 comes in, you have a bunch of stuff in 356 00:11:12,960 --> 00:11:15,879 the middle that's learning these 357 00:11:14,320 --> 00:11:17,600 representations automatically without 358 00:11:15,879 --> 00:11:19,439 your help, and then boom, you just 359 00:11:17,600 --> 00:11:20,720 attach a little linear regression or 360 00:11:19,440 --> 00:11:22,880 logistic regression at the end, problem 361 00:11:20,720 --> 00:11:25,000 solved. 362 00:11:22,879 --> 00:11:26,679 That in a nutshell is deep learning. 363 00:11:25,000 --> 00:11:28,440 Input, a whole bunch of representations 364 00:11:26,679 --> 00:11:30,838 being learned, and then piped into a 365 00:11:28,440 --> 00:11:31,920 linear or logistic regression model. 366 00:11:30,839 --> 00:11:34,560 Okay? 367 00:11:31,919 --> 00:11:36,000 You would So, the amazing thing is this 368 00:11:34,559 --> 00:11:37,599 simple idea 369 00:11:36,000 --> 00:11:40,399 this simple idea 370 00:11:37,600 --> 00:11:42,440 is just incredibly powerful. Right? That 371 00:11:40,399 --> 00:11:44,639 idea has led to ChatGPT, has led to 372 00:11:42,440 --> 00:11:45,480 AlphaGo, AlphaFold, and so on and so 373 00:11:44,639 --> 00:11:46,600 forth. 374 00:11:45,480 --> 00:11:49,120 And 375 00:11:46,600 --> 00:11:50,360 I I kid you not, I'm sort of 376 00:11:49,120 --> 00:11:52,600 I've I've I've been doing deep learning 377 00:11:50,360 --> 00:11:54,800 for about 10 years now, and every time I 378 00:11:52,600 --> 00:11:56,159 look at it, I literally get goosebumps 379 00:11:54,799 --> 00:11:57,838 every so often. 380 00:11:56,159 --> 00:11:59,759 That that something so simple could be 381 00:11:57,839 --> 00:12:01,200 so powerful, right? It's really like 382 00:11:59,759 --> 00:12:03,080 boggles the mind. 383 00:12:01,200 --> 00:12:05,360 I'm like I'm just so lucky to be alive 384 00:12:03,080 --> 00:12:06,360 and working during this period. 385 00:12:05,360 --> 00:12:07,759 Okay? 386 00:12:06,360 --> 00:12:08,879 And you know, coming from people who 387 00:12:07,759 --> 00:12:10,799 have been in the industry a long time, 388 00:12:08,879 --> 00:12:12,399 this sort of breathless exclamation is 389 00:12:10,799 --> 00:12:14,559 not very rare, particularly because I'm 390 00:12:12,399 --> 00:12:17,240 not in marketing. 391 00:12:14,559 --> 00:12:19,399 Okay? I actually mean it. 392 00:12:17,240 --> 00:12:21,480 With all your apologies to various 393 00:12:19,399 --> 00:12:23,319 marketing folks. So, 394 00:12:21,480 --> 00:12:25,759 just realized it's being taped, so uh 395 00:12:23,320 --> 00:12:27,560 okay. So, so this has demolished the 396 00:12:25,759 --> 00:12:29,919 human bottleneck for using machine 397 00:12:27,559 --> 00:12:31,479 learning with unstructured data, uh and 398 00:12:29,919 --> 00:12:32,639 so it comes from the confluence of three 399 00:12:31,480 --> 00:12:34,839 forces, 400 00:12:32,639 --> 00:12:37,159 uh new algorithmic ideas, a whole a lot 401 00:12:34,839 --> 00:12:38,680 of data, and then very importantly, the 402 00:12:37,159 --> 00:12:40,559 fact that we have access to parallel 403 00:12:38,679 --> 00:12:42,000 computing hardware in the in the form of 404 00:12:40,559 --> 00:12:44,159 these things called GPUs, graphics 405 00:12:42,000 --> 00:12:45,960 processing units. Um and these three 406 00:12:44,159 --> 00:12:47,319 forces came together, and they were 407 00:12:45,960 --> 00:12:48,480 applied to an old idea called neural 408 00:12:47,320 --> 00:12:49,720 networks, and that's basically deep 409 00:12:48,480 --> 00:12:50,960 learning. And I'll go through it very 410 00:12:49,720 --> 00:12:52,639 quickly, because obviously we going to 411 00:12:50,960 --> 00:12:54,040 spend half the semester looking into 412 00:12:52,639 --> 00:12:56,679 this thing in detail. 413 00:12:54,039 --> 00:12:58,519 Uh so, what's the immediate immediate 414 00:12:56,679 --> 00:13:01,559 application of the ability to 415 00:12:58,519 --> 00:13:05,360 automatically handle unstructured data? 416 00:13:01,559 --> 00:13:05,359 What is like the no-brainer application? 417 00:13:10,759 --> 00:13:15,879 It's okay if it's obvious, tell me. 418 00:13:13,639 --> 00:13:18,360 Uh sorry. 419 00:13:15,879 --> 00:13:19,759 Um image classification. Right. So, 420 00:13:18,360 --> 00:13:21,279 image classification, yes. So, you can 421 00:13:19,759 --> 00:13:22,480 take an image, a good example of 422 00:13:21,279 --> 00:13:24,199 unstructured data, you can do some 423 00:13:22,480 --> 00:13:27,000 classification on it. But more 424 00:13:24,200 --> 00:13:30,000 generally, more generally, what I'm 425 00:13:27,000 --> 00:13:31,399 getting at is that every sensor in the 426 00:13:30,000 --> 00:13:33,799 world 427 00:13:31,399 --> 00:13:35,039 can be given the ability to detect, 428 00:13:33,799 --> 00:13:37,319 recognize, and classify what it's 429 00:13:35,039 --> 00:13:39,799 sensing. Every sensor. Because remember, 430 00:13:37,320 --> 00:13:41,520 what is a What does a sensor do? 431 00:13:39,799 --> 00:13:43,199 A sensor is just a receptacle for 432 00:13:41,519 --> 00:13:44,679 unstructured data. 433 00:13:43,200 --> 00:13:46,080 A camera is a receptacle for 434 00:13:44,679 --> 00:13:48,079 unstructured video 435 00:13:46,080 --> 00:13:50,480 or unstructured, you know, still images. 436 00:13:48,080 --> 00:13:52,600 Microphone, unstructured audio, right? 437 00:13:50,480 --> 00:13:54,600 So, every sensor, you can you can 438 00:13:52,600 --> 00:13:56,839 imagine taking a sensor and sticking a 439 00:13:54,600 --> 00:13:58,720 little deep learning system behind it. 440 00:13:56,839 --> 00:13:59,880 And now suddenly, the 441 00:13:58,720 --> 00:14:01,759 what comes out of this sensor the deep 442 00:13:59,879 --> 00:14:03,320 learning system, you can count, you can 443 00:14:01,759 --> 00:14:05,360 classify, you can detect, you can do all 444 00:14:03,320 --> 00:14:07,080 kinds of stuff. In short, you can 445 00:14:05,360 --> 00:14:10,279 analyze. 446 00:14:07,080 --> 00:14:12,120 And you can predict, right? And this is 447 00:14:10,279 --> 00:14:15,600 the way I'm describing it right now, 448 00:14:12,120 --> 00:14:17,839 you'll be like, "Yeah, duh, obviously." 449 00:14:15,600 --> 00:14:19,839 But you know what, this obviously thing 450 00:14:17,839 --> 00:14:21,800 is actually not at all obvious 451 00:14:19,839 --> 00:14:24,400 in terms of whether it'll help you find 452 00:14:21,799 --> 00:14:25,719 interesting applications or not. Okay? 453 00:14:24,399 --> 00:14:28,159 So, 454 00:14:25,720 --> 00:14:30,920 here's something I literally saw 455 00:14:28,159 --> 00:14:32,120 last week. Okay? Actually, I have 456 00:14:30,919 --> 00:14:34,599 another slide before that, but we are 457 00:14:32,120 --> 00:14:36,399 coming to that. So, for instance, every 458 00:14:34,600 --> 00:14:38,200 time you use Face ID to unlock your 459 00:14:36,399 --> 00:14:39,639 phone, this is the basic principle at 460 00:14:38,200 --> 00:14:41,120 work, right? The the camera in the 461 00:14:39,639 --> 00:14:42,240 iPhone is the sensor, and they stuck a 462 00:14:41,120 --> 00:14:44,039 deep learning system behind it to do 463 00:14:42,240 --> 00:14:45,879 image classification, right? Drama, 464 00:14:44,039 --> 00:14:46,958 non-drama, right? That's what it's 465 00:14:45,879 --> 00:14:49,399 classifying. 466 00:14:46,958 --> 00:14:51,279 Um and so here, right, you have a breast 467 00:14:49,399 --> 00:14:52,799 cancer is it's a breast cancer detection 468 00:14:51,279 --> 00:14:55,319 system from a mammogram. 469 00:14:52,799 --> 00:14:57,759 Uh by the way, this picture 470 00:14:55,320 --> 00:15:00,320 it's a very interesting picture. So, uh 471 00:14:57,759 --> 00:15:02,519 there's a professor in WCS, uh Regina 472 00:15:00,320 --> 00:15:05,879 Barzilay, who's a very well-known expert 473 00:15:02,519 --> 00:15:07,439 in this field, and uh she actually has 474 00:15:05,879 --> 00:15:08,919 built a breast cancer detection system, 475 00:15:07,440 --> 00:15:10,240 which is which has been deployed at Mass 476 00:15:08,919 --> 00:15:12,120 General Hospital. 477 00:15:10,240 --> 00:15:15,399 And turns out she's actually a breast 478 00:15:12,120 --> 00:15:16,919 cancer survivor. And uh she was 479 00:15:15,399 --> 00:15:19,919 you know, she's she's she's good now, 480 00:15:16,919 --> 00:15:21,958 all good. But when um after she built 481 00:15:19,919 --> 00:15:25,319 her system, I heard that she actually 482 00:15:21,958 --> 00:15:29,000 ran that system against the mammograms 483 00:15:25,320 --> 00:15:30,720 from many years prior when she went for 484 00:15:29,000 --> 00:15:32,360 a mammogram and was told that everything 485 00:15:30,720 --> 00:15:34,440 is fine. 486 00:15:32,360 --> 00:15:35,639 She ran the system on that mammogram, 487 00:15:34,440 --> 00:15:37,400 and it came back and said, "Here is a 488 00:15:35,639 --> 00:15:38,879 problem." 489 00:15:37,399 --> 00:15:40,720 So, a very interesting example where a 490 00:15:38,879 --> 00:15:43,279 deep learning system picked up something 491 00:15:40,720 --> 00:15:45,519 that a radiologist could not, right? So, 492 00:15:43,279 --> 00:15:47,399 these things can be quite powerful. 493 00:15:45,519 --> 00:15:50,078 Um obviously, any self-driving system 494 00:15:47,399 --> 00:15:51,399 has numerous deep learning algorithms 495 00:15:50,078 --> 00:15:52,958 running under the hood, you know, 496 00:15:51,399 --> 00:15:54,720 pedestrian detection, you know, 497 00:15:52,958 --> 00:15:57,239 stoplight detection, zebra crossing 498 00:15:54,720 --> 00:15:58,759 detection, and so on and so forth. Um 499 00:15:57,240 --> 00:16:00,879 you know, it's being very heavily used 500 00:15:58,759 --> 00:16:02,159 in visual inspection manufacturing. 501 00:16:00,879 --> 00:16:03,279 Uh you have various cameras now instead 502 00:16:02,159 --> 00:16:04,919 of people looking at saying, "Okay, 503 00:16:03,279 --> 00:16:06,199 there is a dent or there's a scratch." 504 00:16:04,919 --> 00:16:07,919 They have a little system, which is a 505 00:16:06,200 --> 00:16:09,680 dent detector, scratch detector, and so 506 00:16:07,919 --> 00:16:11,159 on. That's that's going on right now. 507 00:16:09,679 --> 00:16:12,199 And now I come to the example I saw last 508 00:16:11,159 --> 00:16:14,759 week, 509 00:16:12,200 --> 00:16:16,000 which is um So, this is an example of 510 00:16:14,759 --> 00:16:18,159 you can create dramatically better 511 00:16:16,000 --> 00:16:20,799 products if you really internalize this 512 00:16:18,159 --> 00:16:22,519 idea of, "Okay, it's almost like you're 513 00:16:20,799 --> 00:16:24,078 looking the the world and saying, 'Oh, 514 00:16:22,519 --> 00:16:25,559 there's a sensor. Can I attach a DL 515 00:16:24,078 --> 00:16:26,679 thing behind it?'" That's the way you 516 00:16:25,559 --> 00:16:28,719 should be looking at the world, okay, 517 00:16:26,679 --> 00:16:30,879 for startup ideas. So, here's an 518 00:16:28,720 --> 00:16:34,279 example, okay, these apparently are the 519 00:16:30,879 --> 00:16:35,480 world's first smart binoculars. 520 00:16:34,279 --> 00:16:37,720 Okay? 521 00:16:35,480 --> 00:16:39,360 This is the binocular. 522 00:16:37,720 --> 00:16:41,240 Two weeks ago, 523 00:16:39,360 --> 00:16:42,320 where you look at the bird you look at 524 00:16:41,240 --> 00:16:43,959 the bird, 525 00:16:42,320 --> 00:16:46,680 and now it tells you what kind of bird 526 00:16:43,958 --> 00:16:46,679 it is, right there. 527 00:16:47,360 --> 00:16:51,839 It's a simple idea, but imagine, right? 528 00:16:50,120 --> 00:16:53,560 Imagine you are the first out of the 529 00:16:51,839 --> 00:16:54,640 gate with this feature, you'll have a 530 00:16:53,559 --> 00:16:57,719 little bit of an edge till everybody 531 00:16:54,639 --> 00:16:58,958 catches up like 3 months later. 532 00:16:57,720 --> 00:17:01,360 Let's be very clear, there are no 533 00:16:58,958 --> 00:17:03,479 long-term monopoly windows in the world. 534 00:17:01,360 --> 00:17:04,838 There are only short-term windows, so 535 00:17:03,480 --> 00:17:06,720 the hunt is always on for a little 536 00:17:04,838 --> 00:17:08,838 monopoly window. 537 00:17:06,720 --> 00:17:11,199 So, here's an example of that. 538 00:17:08,838 --> 00:17:13,240 Right? So, I encourage you to always 539 00:17:11,199 --> 00:17:15,079 think about the world as, you know, 540 00:17:13,240 --> 00:17:16,519 where are the sensors here? 541 00:17:15,078 --> 00:17:18,198 And can I attach something behind the 542 00:17:16,519 --> 00:17:19,078 sensor to do something useful with it? 543 00:17:18,199 --> 00:17:21,439 Okay? 544 00:17:19,078 --> 00:17:21,438 All right. 545 00:17:24,799 --> 00:17:27,279 Now, let's uh turn our attention to the 546 00:17:26,199 --> 00:17:28,759 output. 547 00:17:27,279 --> 00:17:30,678 We've been talking about in structured 548 00:17:28,759 --> 00:17:32,839 data, unstructured data, and how deep 549 00:17:30,679 --> 00:17:34,519 learning has sort of unlocked the 550 00:17:32,839 --> 00:17:35,759 ability to work with unstructured data, 551 00:17:34,519 --> 00:17:37,879 but you've sort of been neglecting the 552 00:17:35,759 --> 00:17:40,079 output side of the equation. So, 553 00:17:37,880 --> 00:17:42,040 traditionally, uh we could predict 554 00:17:40,079 --> 00:17:44,678 single numbers or a few numbers pretty 555 00:17:42,039 --> 00:17:46,920 easily, right? So, you've all done the 556 00:17:44,679 --> 00:17:48,600 canonical, you know, uh should this 557 00:17:46,920 --> 00:17:50,600 person be given a loan application in 558 00:17:48,599 --> 00:17:51,919 machine learning, right? So, you just 559 00:17:50,599 --> 00:17:53,159 predicts a probability that a borrower 560 00:17:51,920 --> 00:17:56,080 will repay a loan on a whole based on a 561 00:17:53,160 --> 00:17:57,240 whole bunch of data, or supply chain, 562 00:17:56,079 --> 00:17:58,799 you predict the demand for the product 563 00:17:57,240 --> 00:18:00,480 next week, or you could predict a bunch 564 00:17:58,799 --> 00:18:01,919 of numbers. So, given a 565 00:18:00,480 --> 00:18:03,640 um given a picture, you can say, "Okay, 566 00:18:01,920 --> 00:18:04,920 does it Which which one of the 10 kinds 567 00:18:03,640 --> 00:18:06,360 of furniture is it?" Right? You can 568 00:18:04,920 --> 00:18:08,000 predict 10 numbers, 10 probabilities 569 00:18:06,359 --> 00:18:09,199 that add up to one. You can predict a 570 00:18:08,000 --> 00:18:10,440 whole bunch of numbers that don't have 571 00:18:09,200 --> 00:18:12,840 to add up to one, such as the GPS 572 00:18:10,440 --> 00:18:15,279 coordinates of a of an Uber ride. So, 573 00:18:12,839 --> 00:18:16,759 these are all simple unstructured Sorry, 574 00:18:15,279 --> 00:18:18,839 simple structured output, just a few 575 00:18:16,759 --> 00:18:20,799 numbers, right? What we could not do 576 00:18:18,839 --> 00:18:23,399 very easily was to actually generate 577 00:18:20,799 --> 00:18:25,319 pictures like this. 578 00:18:23,400 --> 00:18:27,560 We could not generate unstructured data. 579 00:18:25,319 --> 00:18:28,519 We could only consume unstructured data, 580 00:18:27,559 --> 00:18:29,918 right? 581 00:18:28,519 --> 00:18:31,440 Um you can generate text, you can 582 00:18:29,919 --> 00:18:32,919 generate pictures, and so on, and audio, 583 00:18:31,440 --> 00:18:35,080 and so on, and so forth. 584 00:18:32,919 --> 00:18:36,200 So, with generative AI, that problem is 585 00:18:35,079 --> 00:18:37,519 gone. 586 00:18:36,200 --> 00:18:39,880 So, generative AI is the ability to 587 00:18:37,519 --> 00:18:41,599 actually create unstructured data, all 588 00:18:39,880 --> 00:18:43,840 right? And therefore, it sits within 589 00:18:41,599 --> 00:18:45,399 deep learning. It still runs on deep 590 00:18:43,839 --> 00:18:47,079 learning, but it's just one kind of deep 591 00:18:45,400 --> 00:18:49,000 learning. 592 00:18:47,079 --> 00:18:50,119 Okay? There's plenty of stuff going on 593 00:18:49,000 --> 00:18:51,679 in deep learning that's got nothing to 594 00:18:50,119 --> 00:18:53,399 do with generative AI. 595 00:18:51,679 --> 00:18:55,080 Nowadays, of course, you know, if you're 596 00:18:53,400 --> 00:18:57,519 a self-respecting entrepreneur who wants 597 00:18:55,079 --> 00:18:58,599 to ride this craze, you'll probably 598 00:18:57,519 --> 00:19:00,240 declare whatever you're doing as 599 00:18:58,599 --> 00:19:02,480 generative AI. 600 00:19:00,240 --> 00:19:04,319 Right? Um and some VCs may actually be 601 00:19:02,480 --> 00:19:05,679 ready to fund you, who knows? 602 00:19:04,319 --> 00:19:06,759 But the point is, there's plenty of 603 00:19:05,679 --> 00:19:08,759 stuff going on in deep learning that's 604 00:19:06,759 --> 00:19:11,079 got nothing to do with generative AI. Uh 605 00:19:08,759 --> 00:19:13,000 but this is the overall picture. Now, 606 00:19:11,079 --> 00:19:15,439 here, uh we can produce unstructured 607 00:19:13,000 --> 00:19:17,359 outputs, like pictures. You can take 608 00:19:15,440 --> 00:19:18,440 this thing, and then you can actually, 609 00:19:17,359 --> 00:19:19,519 you know, come up with a nice picture 610 00:19:18,440 --> 00:19:21,880 description of it. This actually is a 611 00:19:19,519 --> 00:19:23,200 very famous picture, by the way, in in 612 00:19:21,880 --> 00:19:24,520 the world of computer vision. So, we are 613 00:19:23,200 --> 00:19:26,319 actually going to be analyzing this 614 00:19:24,519 --> 00:19:27,879 picture a little later on 615 00:19:26,319 --> 00:19:29,639 in the semester. 616 00:19:27,880 --> 00:19:31,840 Uh you can obviously go from a very 617 00:19:29,640 --> 00:19:35,560 complicated caption to an image. 618 00:19:31,839 --> 00:19:35,559 Uh you can go from text to music. 619 00:19:36,240 --> 00:19:40,359 Can people hear it? Okay. Yeah. Yeah. 620 00:19:38,359 --> 00:19:43,039 All right. So, and of course, we can go 621 00:19:40,359 --> 00:19:45,439 from text to text, i.e., ChatGPT. Uh and 622 00:19:43,039 --> 00:19:47,079 then uh as of a few months ago, things 623 00:19:45,440 --> 00:19:49,440 have gotten even more interesting, where 624 00:19:47,079 --> 00:19:51,000 you can actually go you can send text 625 00:19:49,440 --> 00:19:51,880 and an image in, and you can get text 626 00:19:51,000 --> 00:19:53,480 out. 627 00:19:51,880 --> 00:19:55,360 Right? And in fact, as of a few weeks 628 00:19:53,480 --> 00:19:56,960 ago, you can send text, image, text, 629 00:19:55,359 --> 00:19:58,119 image, text, image in in an arbitrary 630 00:19:56,960 --> 00:20:00,039 sequence 631 00:19:58,119 --> 00:20:02,239 into into the system, and it'll actually 632 00:20:00,039 --> 00:20:03,519 come back to you with text and image. 633 00:20:02,240 --> 00:20:05,200 Right? So, things are becoming 634 00:20:03,519 --> 00:20:07,839 multimodal. I just want to share with 635 00:20:05,200 --> 00:20:10,840 you like a really fun example I saw 636 00:20:07,839 --> 00:20:14,000 uh recently. So, this person 637 00:20:10,839 --> 00:20:16,879 sends this picture. Can folks see this? 638 00:20:14,000 --> 00:20:19,000 It's this very complicated parking sign. 639 00:20:16,880 --> 00:20:20,360 Apparently in San Francisco. 640 00:20:19,000 --> 00:20:22,519 And they're like, it's Wednesday at 4:00 641 00:20:20,359 --> 00:20:23,959 p.m. Can I park here? 642 00:20:22,519 --> 00:20:25,480 Tell me in one line. Because you really 643 00:20:23,960 --> 00:20:26,880 didn't want GPT-4 to be giving you a big 644 00:20:25,480 --> 00:20:29,079 essay about this. 645 00:20:26,880 --> 00:20:32,120 Like, you literally want to park. 646 00:20:29,079 --> 00:20:33,960 So, GPT-4 comes back and says, "Yes, you 647 00:20:32,119 --> 00:20:35,439 can park here for up to 1 hour starting 648 00:20:33,960 --> 00:20:36,720 at 4:00 p.m." 649 00:20:35,440 --> 00:20:38,559 And folks, I double-checked this thing, 650 00:20:36,720 --> 00:20:39,640 it's correct. 651 00:20:38,559 --> 00:20:41,119 We all know these things hallucinate, 652 00:20:39,640 --> 00:20:42,080 right? Can you imagine getting a parking 653 00:20:41,119 --> 00:20:42,839 ticket and telling the judge, "I'm 654 00:20:42,079 --> 00:20:44,359 sorry, I didn't realize it was 655 00:20:42,839 --> 00:20:45,359 hallucinating." 656 00:20:44,359 --> 00:20:46,839 So, 657 00:20:45,359 --> 00:20:47,759 so you have to double-check it. 658 00:20:46,839 --> 00:20:49,399 So, yeah. So, things are getting 659 00:20:47,759 --> 00:20:51,759 multimodal very quickly. 660 00:20:49,400 --> 00:20:53,640 Uh and so, the picture here is that 661 00:20:51,759 --> 00:20:55,400 within gen AI, we used to have these 662 00:20:53,640 --> 00:20:57,360 separate circles, text to text, text to 663 00:20:55,400 --> 00:20:59,040 image, text to music, text to this, text 664 00:20:57,359 --> 00:21:00,879 to that, so on and so forth. Those are 665 00:20:59,039 --> 00:21:02,720 all beginning to merge now inside gen AI 666 00:21:00,880 --> 00:21:04,680 because multimodal models are going to 667 00:21:02,720 --> 00:21:06,279 become the norm this year, right? We 668 00:21:04,680 --> 00:21:07,880 already have really good closed models. 669 00:21:06,279 --> 00:21:10,119 We really have We actually already have 670 00:21:07,880 --> 00:21:12,320 very good open-source multimodal models. 671 00:21:10,119 --> 00:21:15,839 And so, my feeling is that by the end of 672 00:21:12,319 --> 00:21:17,960 the year, the idea of using a text-only 673 00:21:15,839 --> 00:21:19,359 model is going to be like, "Really, you 674 00:21:17,960 --> 00:21:20,319 do that still?" 675 00:21:19,359 --> 00:21:21,919 Right? It's going to become like a 676 00:21:20,319 --> 00:21:23,720 quaint, old-fashioned thing. I think 677 00:21:21,920 --> 00:21:25,200 multimodal modality is going to become 678 00:21:23,720 --> 00:21:26,680 the norm. So, that's where the world is, 679 00:21:25,200 --> 00:21:29,000 and this is the landscape. So, any 680 00:21:26,680 --> 00:21:29,960 questions on the landscape? 681 00:21:29,000 --> 00:21:32,319 Before we actually start doing some 682 00:21:29,960 --> 00:21:32,319 math. 683 00:21:35,519 --> 00:21:40,039 Okay. 684 00:21:37,799 --> 00:21:40,039 Yeah. 685 00:22:05,559 --> 00:22:09,519 You mean the the the evidence of that 686 00:22:07,400 --> 00:22:11,720 being a problem would have been smaller. 687 00:22:09,519 --> 00:22:11,720 Yeah. 688 00:22:16,319 --> 00:22:19,359 Yeah. So, I think the So, the question 689 00:22:17,759 --> 00:22:20,480 is that in general, how do you train 690 00:22:19,359 --> 00:22:22,240 your models so that it gives you the 691 00:22:20,480 --> 00:22:24,000 right answers given that over the 692 00:22:22,240 --> 00:22:25,599 passage of time, the amount of evidence 693 00:22:24,000 --> 00:22:28,119 in this data could be very highly 694 00:22:25,599 --> 00:22:30,719 variable. So, in this particular case of 695 00:22:28,119 --> 00:22:32,199 you know, the professor I talked about, 696 00:22:30,720 --> 00:22:34,400 uh yeah, everything at that point was 697 00:22:32,200 --> 00:22:36,840 going through a an expert radiologist. 698 00:22:34,400 --> 00:22:38,200 So, 5 years ago, this mammogram was seen 699 00:22:36,839 --> 00:22:40,240 by a radiologist, and that person 700 00:22:38,200 --> 00:22:41,759 concluded there is no problem. So, that 701 00:22:40,240 --> 00:22:44,599 was the training label, right? The wrong 702 00:22:41,759 --> 00:22:46,400 training label. Uh so, in typically what 703 00:22:44,599 --> 00:22:48,399 happens is that training labels could be 704 00:22:46,400 --> 00:22:49,400 wrong some small fraction of the time. 705 00:22:48,400 --> 00:22:51,720 So, you need to have systems that are 706 00:22:49,400 --> 00:22:53,880 robust. So, your data needs to be 707 00:22:51,720 --> 00:22:56,120 complete, it needs to be comprehensive, 708 00:22:53,880 --> 00:22:58,320 it needs to be have correct labels. If 709 00:22:56,119 --> 00:22:59,959 these ideas are not met, your systems 710 00:22:58,319 --> 00:23:01,960 are not going to be that good. But as it 711 00:22:59,960 --> 00:23:04,240 turns out, with neural networks, even 712 00:23:01,960 --> 00:23:06,000 with some amount of noise in the labels, 713 00:23:04,240 --> 00:23:07,079 they still do a pretty good job. 714 00:23:06,000 --> 00:23:09,759 Right? So, it's that's sort of the 715 00:23:07,079 --> 00:23:09,759 general idea. 716 00:23:11,480 --> 00:23:15,759 The veri- The verification comes from 717 00:23:12,799 --> 00:23:17,599 the human. So, every Remember when we 718 00:23:15,759 --> 00:23:19,319 look at radiology data, 719 00:23:17,599 --> 00:23:21,439 the the data we're working with is the 720 00:23:19,319 --> 00:23:23,559 input is let's say an image, like a 721 00:23:21,440 --> 00:23:25,440 radio mammogram or something, and then a 722 00:23:23,559 --> 00:23:27,480 human radiologist or a set of 723 00:23:25,440 --> 00:23:29,400 radiologists have said this has a 724 00:23:27,480 --> 00:23:31,279 problem or does not have a problem. So, 725 00:23:29,400 --> 00:23:33,679 that is called the ground truth. 726 00:23:31,279 --> 00:23:35,440 So, it is this ground truth image and 727 00:23:33,679 --> 00:23:38,440 label, this combination that's being 728 00:23:35,440 --> 00:23:38,440 used to train these models. 729 00:23:39,559 --> 00:23:41,759 Yeah. 730 00:23:43,160 --> 00:23:47,400 Embodiment? So, So, are we are we going 731 00:23:45,440 --> 00:23:49,080 to cover embodiment? So, the 732 00:23:47,400 --> 00:23:50,280 the embodiment here refers to the fact 733 00:23:49,079 --> 00:23:53,039 that 734 00:23:50,279 --> 00:23:54,359 if you have robot robots, right? 735 00:23:53,039 --> 00:23:56,200 They need to actually operate in the 736 00:23:54,359 --> 00:23:58,559 real world, and so robots are an example 737 00:23:56,200 --> 00:23:59,920 of what's called embodied intelligence. 738 00:23:58,559 --> 00:24:01,440 So, unfortunately, due to the 739 00:23:59,920 --> 00:24:03,720 constraints of time, we're not going to 740 00:24:01,440 --> 00:24:04,799 get into robotics at all. But I will say 741 00:24:03,720 --> 00:24:05,880 that a lot of the deep learning stuff 742 00:24:04,799 --> 00:24:07,359 you're going to talk about, those are 743 00:24:05,880 --> 00:24:09,880 all fundamental building blocks in 744 00:24:07,359 --> 00:24:13,039 modern robotic systems. 745 00:24:09,880 --> 00:24:14,400 All right. So, um so, in summary, 746 00:24:13,039 --> 00:24:15,639 X and Y 747 00:24:14,400 --> 00:24:17,200 can be anything, and it can be 748 00:24:15,640 --> 00:24:19,240 multimodal. 749 00:24:17,200 --> 00:24:21,679 Okay? I literally could not have put up 750 00:24:19,240 --> 00:24:23,559 this slide maybe 2 years ago. 751 00:24:21,679 --> 00:24:25,800 Right? So, it's very simple in how it 752 00:24:23,559 --> 00:24:28,079 looks, but it's very profound. You can 753 00:24:25,799 --> 00:24:29,599 You can learn a mapping from anything to 754 00:24:28,079 --> 00:24:31,559 anything at this point very easily as 755 00:24:29,599 --> 00:24:34,480 long as you have enough data. 756 00:24:31,559 --> 00:24:36,599 Okay? So, um now, note that all this 757 00:24:34,480 --> 00:24:38,640 excitement that we see around us 758 00:24:36,599 --> 00:24:39,639 is everything stems from stems from deep 759 00:24:38,640 --> 00:24:40,640 learning. 760 00:24:39,640 --> 00:24:42,160 Okay? 761 00:24:40,640 --> 00:24:44,280 Everything Everything depends on deep 762 00:24:42,160 --> 00:24:45,679 learning. And so, if you understand deep 763 00:24:44,279 --> 00:24:47,960 learning, a lot of interesting things 764 00:24:45,679 --> 00:24:49,080 become possible. So, let's get going. 765 00:24:47,960 --> 00:24:51,840 All right. So, we'll start with the very 766 00:24:49,079 --> 00:24:54,599 basics. Uh what's a neural network? 767 00:24:51,839 --> 00:24:56,039 Uh now, recall logistic regression 768 00:24:54,599 --> 00:24:57,879 from back in the day. 769 00:24:56,039 --> 00:24:59,920 So, what is logistic regression? 770 00:24:57,880 --> 00:25:01,679 You send in a bunch of numbers, a vector 771 00:24:59,920 --> 00:25:03,960 of numbers, and you get usually get a 772 00:25:01,679 --> 00:25:05,000 probability out, right? Between 0 and 1. 773 00:25:03,960 --> 00:25:07,559 What is the probability of something or 774 00:25:05,000 --> 00:25:09,559 the other? Okay? Um and so, this 775 00:25:07,559 --> 00:25:11,519 logistic regression model is also 776 00:25:09,559 --> 00:25:13,359 represented in this form, 777 00:25:11,519 --> 00:25:15,519 if you will recall. So, basically what 778 00:25:13,359 --> 00:25:17,678 we do is we take all these numbers, we 779 00:25:15,519 --> 00:25:19,240 run it through a linear function, right? 780 00:25:17,679 --> 00:25:20,880 We run it through a linear function, you 781 00:25:19,240 --> 00:25:22,880 get a number, and then we take that 782 00:25:20,880 --> 00:25:25,000 thing and run it through 1 / 1 + e 783 00:25:22,880 --> 00:25:26,120 raised to minus that, 784 00:25:25,000 --> 00:25:27,720 and that's guaranteed to give you a 785 00:25:26,119 --> 00:25:29,719 number between 0 and 1, which can be 786 00:25:27,720 --> 00:25:31,839 interpreted as a probability, and that's 787 00:25:29,720 --> 00:25:33,559 logistic regression. Okay? And the 788 00:25:31,839 --> 00:25:35,399 canonical, you know, 789 00:25:33,559 --> 00:25:36,720 uh loan approvals, things like that, all 790 00:25:35,400 --> 00:25:38,480 fall into this sort of convenient 791 00:25:36,720 --> 00:25:42,799 bucket. 792 00:25:38,480 --> 00:25:42,799 Okay? So, this should be super familiar. 793 00:25:44,400 --> 00:25:48,759 All right. Now, we're going to actually 794 00:25:46,480 --> 00:25:51,920 look at this, you know, simple, modest, 795 00:25:48,759 --> 00:25:53,799 humble little operation 796 00:25:51,920 --> 00:25:55,480 using the lens of a network of 797 00:25:53,799 --> 00:25:56,879 mathematical operations, and the reason 798 00:25:55,480 --> 00:25:57,799 why we do it will become clear a bit 799 00:25:56,880 --> 00:25:59,880 later. 800 00:25:57,799 --> 00:26:02,240 So, we'll take this very simple example 801 00:25:59,880 --> 00:26:05,320 where we have uh let's say two 802 00:26:02,240 --> 00:26:07,759 variables, GPA and experience, right? 803 00:26:05,319 --> 00:26:09,559 This is the GPA of some graduates, uh 804 00:26:07,759 --> 00:26:11,799 number of years of work experience, and 805 00:26:09,559 --> 00:26:14,678 then this is the dependent variable, 806 00:26:11,799 --> 00:26:16,480 which is either 0 or 1, and 0 if they 807 00:26:14,679 --> 00:26:18,280 don't get called for an interview, 1 if 808 00:26:16,480 --> 00:26:20,519 they get called for an interview. Okay? 809 00:26:18,279 --> 00:26:22,119 It's a two-input variable, one-output 810 00:26:20,519 --> 00:26:24,000 variable problem. Okay? And it's a 811 00:26:22,119 --> 00:26:25,719 classification problem because we're 812 00:26:24,000 --> 00:26:27,880 classifying people into will they get 813 00:26:25,720 --> 00:26:29,600 called for an interview, yes or no. 814 00:26:27,880 --> 00:26:31,560 Okay? 815 00:26:29,599 --> 00:26:33,119 And so, that's the setup for this 816 00:26:31,559 --> 00:26:35,919 problem. 817 00:26:33,119 --> 00:26:38,839 And let's say that we actually run it 818 00:26:35,920 --> 00:26:40,720 through any you know, we actually try to 819 00:26:38,839 --> 00:26:41,959 fit a logistic regression model to it. 820 00:26:40,720 --> 00:26:43,559 So, if you're familiar with R, for 821 00:26:41,960 --> 00:26:46,120 example, you would use something like 822 00:26:43,559 --> 00:26:48,079 GLM to fit this model. 823 00:26:46,119 --> 00:26:49,919 Um if you use something like statsmodels 824 00:26:48,079 --> 00:26:52,000 in Python, there's a similar function 825 00:26:49,920 --> 00:26:53,560 for it. Scikit-learn, there's another 826 00:26:52,000 --> 00:26:55,160 function for it. You get the idea, 827 00:26:53,559 --> 00:26:57,079 right? This 828 00:26:55,160 --> 00:26:58,160 You can use whatever favorite methods 829 00:26:57,079 --> 00:27:00,199 you have for logistic regression 830 00:26:58,160 --> 00:27:02,080 modeling to get this job done. And if 831 00:27:00,200 --> 00:27:04,120 you do that with this little data set, 832 00:27:02,079 --> 00:27:06,599 you're going to get these coefficients. 833 00:27:04,119 --> 00:27:08,199 Right? The 0.4 is the intercept, 0.2 is 834 00:27:06,599 --> 00:27:09,919 the coefficient for GPA, 0.5 for 835 00:27:08,200 --> 00:27:11,440 experience. And that is the resulting 836 00:27:09,920 --> 00:27:12,560 sigmoid function. 837 00:27:11,440 --> 00:27:14,519 Okay? 838 00:27:12,559 --> 00:27:17,240 All right. Cool. So, now let's actually 839 00:27:14,519 --> 00:27:19,319 rewrite this formula as a network in the 840 00:27:17,240 --> 00:27:20,920 following way. So, first, what we'll do 841 00:27:19,319 --> 00:27:22,839 is we'll take GPA and experience and 842 00:27:20,920 --> 00:27:24,600 stick it here on the left side, and 843 00:27:22,839 --> 00:27:26,799 we'll put little circles next to them, 844 00:27:24,599 --> 00:27:29,359 and we'll call them the input nodes. 845 00:27:26,799 --> 00:27:32,000 Okay? And so, imagine that somebody puts 846 00:27:29,359 --> 00:27:34,279 the writes a GPA into the circle, 3.5 or 847 00:27:32,000 --> 00:27:36,880 you know, years experience, 2.0, and 848 00:27:34,279 --> 00:27:38,000 then it flows through this arrow, 849 00:27:36,880 --> 00:27:40,400 and as it flows through, it gets 850 00:27:38,000 --> 00:27:42,880 multiplied by its coefficient, 0.2. The 851 00:27:40,400 --> 00:27:44,320 0.2 is coming from here. 852 00:27:42,880 --> 00:27:47,080 Similarly, experience gets multiplied by 853 00:27:44,319 --> 00:27:49,119 0.5, it comes in here, and this node, as 854 00:27:47,079 --> 00:27:50,480 the plus indicates, is adding everything 855 00:27:49,119 --> 00:27:52,919 that's coming into it. 856 00:27:50,480 --> 00:27:54,519 So, it's adding 0.2 * GPA, 0.5 * 857 00:27:52,920 --> 00:27:57,200 experience, plus the intercept, which is 858 00:27:54,519 --> 00:27:58,599 the green arrow coming from on its own. 859 00:27:57,200 --> 00:28:01,240 It comes through here, and what comes 860 00:27:58,599 --> 00:28:02,839 out of this is just a single number, 861 00:28:01,240 --> 00:28:04,640 and that number goes into this little 862 00:28:02,839 --> 00:28:07,319 circle, 863 00:28:04,640 --> 00:28:08,560 and then out pops a probability. 864 00:28:07,319 --> 00:28:10,720 Okay? 865 00:28:08,559 --> 00:28:13,440 So, I've sort of 866 00:28:10,720 --> 00:28:15,039 done this ridiculously long long 867 00:28:13,440 --> 00:28:16,400 long-winded way of writing a simple 868 00:28:15,039 --> 00:28:18,000 function. 869 00:28:16,400 --> 00:28:20,880 Okay? And the reason we why I'm doing it 870 00:28:18,000 --> 00:28:20,880 will become clear in a second. 871 00:28:21,079 --> 00:28:25,839 Okay? So, this is a little network of 872 00:28:23,359 --> 00:28:27,678 operations for the simple function. 873 00:28:25,839 --> 00:28:29,639 And so, for instance, how you would use 874 00:28:27,679 --> 00:28:31,759 it is you to make a prediction, you'll 875 00:28:29,640 --> 00:28:33,600 let's say someone has a 3.8 GPA and 1.2 876 00:28:31,759 --> 00:28:34,640 years experience, you just plug it in 877 00:28:33,599 --> 00:28:36,599 here, 878 00:28:34,640 --> 00:28:38,360 do the math, you get 0.76, same thing 879 00:28:36,599 --> 00:28:40,918 here, comes in here, add them all up, 880 00:28:38,359 --> 00:28:43,279 you get 1.76, you run 1.76 through the 881 00:28:40,919 --> 00:28:44,480 sigmoid, you get 0.85, and that is the 882 00:28:43,279 --> 00:28:45,519 probability that that particular 883 00:28:44,480 --> 00:28:46,839 individual may get called for an 884 00:28:45,519 --> 00:28:48,240 interview. 885 00:28:46,839 --> 00:28:49,399 Okay? At this point, we're just doing 886 00:28:48,240 --> 00:28:51,359 logistic regression, nothing more 887 00:28:49,400 --> 00:28:54,040 complicated. 888 00:28:51,359 --> 00:28:56,119 Okay? So, um now, if you have many 889 00:28:54,039 --> 00:28:58,399 variables, not two variables like X1 890 00:28:56,119 --> 00:28:59,759 through XK, you can the same sort of 891 00:28:58,400 --> 00:29:01,200 logic applies. Each one has some 892 00:28:59,759 --> 00:29:03,039 coefficient, and then there's an 893 00:29:01,200 --> 00:29:04,720 intercept, they all get added up here, 894 00:29:03,039 --> 00:29:07,000 run through a sigmoid, and out pops this 895 00:29:04,720 --> 00:29:09,240 number. Okay? Notice how the data flows 896 00:29:07,000 --> 00:29:10,559 from left to right. 897 00:29:09,240 --> 00:29:14,039 Okay? 898 00:29:10,559 --> 00:29:14,039 All right. Any questions on this? 899 00:29:15,119 --> 00:29:18,719 All right. Good. 900 00:29:16,519 --> 00:29:20,519 So, now terminology. 901 00:29:18,720 --> 00:29:21,720 Uh so, you will actually you'll discover 902 00:29:20,519 --> 00:29:24,039 that the world of neural networks and 903 00:29:21,720 --> 00:29:25,440 deep learning has its own terminology. 904 00:29:24,039 --> 00:29:26,799 They have their own ways of referring to 905 00:29:25,440 --> 00:29:28,440 things that we the rest of the world has 906 00:29:26,799 --> 00:29:29,799 been referring using something else for 907 00:29:28,440 --> 00:29:31,240 the longest time. 908 00:29:29,799 --> 00:29:35,000 Right? It's kind of annoying sometimes, 909 00:29:31,240 --> 00:29:35,000 but it's the way it is. So, um 910 00:29:35,200 --> 00:29:38,440 Remember in regression, we used to call 911 00:29:37,000 --> 00:29:39,720 those numbers next to each variable as 912 00:29:38,440 --> 00:29:41,440 coefficients, 913 00:29:39,720 --> 00:29:43,200 and the constant thing as an intercept? 914 00:29:41,440 --> 00:29:44,519 Well, guess what? In this world, these 915 00:29:43,200 --> 00:29:46,960 multi- those coefficients are actually 916 00:29:44,519 --> 00:29:49,160 called weights, 917 00:29:46,960 --> 00:29:50,840 and the intercepts are called biases. 918 00:29:49,160 --> 00:29:53,000 So, in in the neural network world, 919 00:29:50,839 --> 00:29:54,240 these are called weights and biases. 920 00:29:53,000 --> 00:29:55,240 And sometimes, if you're a little lazy, 921 00:29:54,240 --> 00:29:56,359 you may just call the whole thing as 922 00:29:55,240 --> 00:29:58,480 weights. 923 00:29:56,359 --> 00:30:00,799 Okay? So, when you see in the newspaper 924 00:29:58,480 --> 00:30:03,640 that, you know, "Oh my god, this amazing 925 00:30:00,799 --> 00:30:05,119 model's weights have been leaked 926 00:30:03,640 --> 00:30:06,680 on the internet or on BitTorrent or 927 00:30:05,119 --> 00:30:08,119 something." That's what's going on, 928 00:30:06,680 --> 00:30:09,960 right? All these coefficients have been 929 00:30:08,119 --> 00:30:11,559 leaked. Because once you know what the 930 00:30:09,960 --> 00:30:12,640 coefficients are and what the 931 00:30:11,559 --> 00:30:15,039 architecture is, you can just 932 00:30:12,640 --> 00:30:16,360 reconstruct the model. 933 00:30:15,039 --> 00:30:17,559 All right. So, that's what's going on 934 00:30:16,359 --> 00:30:19,639 here. 935 00:30:17,559 --> 00:30:20,799 Now, why did we do this network 936 00:30:19,640 --> 00:30:23,120 business? Why did we write it as a 937 00:30:20,799 --> 00:30:24,359 network? 938 00:30:23,119 --> 00:30:26,919 Yeah, what is the advantage? Any 939 00:30:24,359 --> 00:30:26,919 guesses? 940 00:30:34,000 --> 00:30:38,200 When you have multiple functions for 941 00:30:37,200 --> 00:30:40,360 So, 942 00:30:38,200 --> 00:30:41,840 it's just easier to see it that way. 943 00:30:40,359 --> 00:30:43,719 Right. If you have lots of things going 944 00:30:41,839 --> 00:30:45,240 on, it's easier to see it if you 945 00:30:43,720 --> 00:30:46,960 actually write it in graphical form. 946 00:30:45,240 --> 00:30:49,880 Yes, correct. 947 00:30:46,960 --> 00:30:51,920 But, so is it only like a usability 948 00:30:49,880 --> 00:30:53,560 advantage? 949 00:30:51,920 --> 00:30:55,920 I mean, the thing is you want different 950 00:30:53,559 --> 00:30:56,679 functions for different layers of that. 951 00:30:55,920 --> 00:30:57,640 Uh-huh. 952 00:30:56,680 --> 00:30:59,000 Okay. 953 00:30:57,640 --> 00:31:00,880 So, maybe we want to use different 954 00:30:59,000 --> 00:31:02,599 functions in different layers. But, I 955 00:31:00,880 --> 00:31:04,640 think there's actually even a larger 956 00:31:02,599 --> 00:31:05,559 sort of a more basic point, which is 957 00:31:04,640 --> 00:31:07,000 that 958 00:31:05,559 --> 00:31:09,000 then when you the moment you write it 959 00:31:07,000 --> 00:31:10,480 down, you suddenly realize 960 00:31:09,000 --> 00:31:12,839 that I could have lots of things in the 961 00:31:10,480 --> 00:31:12,839 middle. 962 00:31:12,960 --> 00:31:15,640 I don't have to go from the input to the 963 00:31:13,960 --> 00:31:17,360 output directly. I can do lots of things 964 00:31:15,640 --> 00:31:20,560 in the middle, right? That's sort of the 965 00:31:17,359 --> 00:31:22,359 key idea. So, what you do is 966 00:31:20,559 --> 00:31:24,799 So, remember the notion of learning 967 00:31:22,359 --> 00:31:25,959 representations of unstructured data, 968 00:31:24,799 --> 00:31:27,879 right? Where you take a picture and say 969 00:31:25,960 --> 00:31:29,400 beak length and things like that, right? 970 00:31:27,880 --> 00:31:30,800 And remember, I said deep learning 971 00:31:29,400 --> 00:31:33,000 actually automatically learns these 972 00:31:30,799 --> 00:31:34,839 things. Where is that automatic learning 973 00:31:33,000 --> 00:31:36,680 coming from? 974 00:31:34,839 --> 00:31:38,879 Well, this is where it's coming from. 975 00:31:36,680 --> 00:31:39,680 So, what we do is we take this thing, 976 00:31:38,880 --> 00:31:41,560 right? There's just a logistic 977 00:31:39,680 --> 00:31:43,480 regression model. Inputs 978 00:31:41,559 --> 00:31:45,720 get multiple added up as a linear 979 00:31:43,480 --> 00:31:46,880 function, run through a sigmoid. 980 00:31:45,720 --> 00:31:48,799 And then 981 00:31:46,880 --> 00:31:51,520 we are like, "Hmm, if we want to learn 982 00:31:48,799 --> 00:31:53,000 representations of the raw input, we 983 00:31:51,519 --> 00:31:54,720 better be doing something in the middle 984 00:31:53,000 --> 00:31:56,759 here." 985 00:31:54,720 --> 00:31:58,720 Because the output is the output. 986 00:31:56,759 --> 00:32:00,039 That is That's not going to change. 987 00:31:58,720 --> 00:32:02,079 You know, it's it's either a dog or a 988 00:32:00,039 --> 00:32:05,440 cat. You don't have any choice 989 00:32:02,079 --> 00:32:07,960 as to what it is. Okay? The only agency 990 00:32:05,440 --> 00:32:09,279 you have at this point is you can take 991 00:32:07,960 --> 00:32:11,079 the raw input and do things in the 992 00:32:09,279 --> 00:32:12,678 middle with it. 993 00:32:11,079 --> 00:32:14,439 You can do a lot of stuff in the middle 994 00:32:12,679 --> 00:32:18,160 and then run it through something to get 995 00:32:14,440 --> 00:32:20,679 the output. Okay? So, in any in in in 996 00:32:18,160 --> 00:32:22,120 any mathematical discipline, 997 00:32:20,679 --> 00:32:23,679 if someone comes to you and says, 998 00:32:22,119 --> 00:32:25,639 "Here's a bunch of data. 999 00:32:23,679 --> 00:32:27,280 I want you to do something with it." 1000 00:32:25,640 --> 00:32:30,759 What should the What is like the big the 1001 00:32:27,279 --> 00:32:30,759 most basic first thing you should do? 1002 00:32:31,720 --> 00:32:36,120 Run it through a linear function. 1003 00:32:34,480 --> 00:32:37,759 The most basic thing in math is a linear 1004 00:32:36,119 --> 00:32:38,559 function. So, given anything, just run 1005 00:32:37,759 --> 00:32:40,039 it through a linear function. See what 1006 00:32:38,559 --> 00:32:42,678 happens. 1007 00:32:40,039 --> 00:32:44,399 So, that's exactly what we can do. So, 1008 00:32:42,679 --> 00:32:46,560 the simplest thing we can do here, we 1009 00:32:44,400 --> 00:32:49,400 can insert a bunch of linear functions. 1010 00:32:46,559 --> 00:32:50,960 So, we do is we take all this input and 1011 00:32:49,400 --> 00:32:52,759 we just run it we we do a linear 1012 00:32:50,960 --> 00:32:56,079 function on it. So, think of it this as 1013 00:32:52,759 --> 00:32:58,879 X1 * 2 + X3 * 4 and all the way to XK * 1014 00:32:56,079 --> 00:33:00,599 9 plus some intercept and boom, it goes 1015 00:32:58,880 --> 00:33:05,200 out the other end. So, this little 1016 00:33:00,599 --> 00:33:05,959 circle here with a plus in it is just 1017 00:33:05,200 --> 00:33:06,600 Thank you. 1018 00:33:05,960 --> 00:33:08,279 Uh 1019 00:33:06,599 --> 00:33:10,359 that is This is just a linear It's a 1020 00:33:08,279 --> 00:33:11,480 shorthand for a linear function. 1021 00:33:10,359 --> 00:33:13,159 So, whenever you see a circle with a 1022 00:33:11,480 --> 00:33:15,360 plus, it's just a shorthand for a linear 1023 00:33:13,160 --> 00:33:16,279 function. Okay? So, you can take this 1024 00:33:15,359 --> 00:33:17,759 whole thing and run through a linear 1025 00:33:16,279 --> 00:33:19,799 function and when you do it, you'll get 1026 00:33:17,759 --> 00:33:21,960 some number right there. You'll get some 1027 00:33:19,799 --> 00:33:23,399 number. So, you've taken these K numbers 1028 00:33:21,960 --> 00:33:25,559 and you've sort of dis- compressed them 1029 00:33:23,400 --> 00:33:26,840 in some way into one number. 1030 00:33:25,559 --> 00:33:28,319 Okay? 1031 00:33:26,839 --> 00:33:30,079 But, you don't have to stop at one 1032 00:33:28,319 --> 00:33:31,599 number. You can do more. 1033 00:33:30,079 --> 00:33:33,439 So, we can have a stack of linear 1034 00:33:31,599 --> 00:33:35,359 functions in the middle. 1035 00:33:33,440 --> 00:33:37,279 Right? There's a linear function here, 1036 00:33:35,359 --> 00:33:40,159 another one here, another one here. At 1037 00:33:37,279 --> 00:33:42,240 this point, the K numbers you have 1038 00:33:40,160 --> 00:33:43,440 K could be, for example, 1,000. 1039 00:33:42,240 --> 00:33:44,400 Right? It's just the size of your input 1040 00:33:43,440 --> 00:33:45,799 data. 1041 00:33:44,400 --> 00:33:47,280 You've taken these K things and you've 1042 00:33:45,799 --> 00:33:48,839 compressed them into three numbers at 1043 00:33:47,279 --> 00:33:50,359 this point. 1044 00:33:48,839 --> 00:33:52,079 Okay? 1045 00:33:50,359 --> 00:33:53,079 So, okay, maybe three is the right 1046 00:33:52,079 --> 00:33:54,039 number, maybe 10 is the right number. We 1047 00:33:53,079 --> 00:33:55,480 don't know. 1048 00:33:54,039 --> 00:33:58,079 And we'll get to know how do we know 1049 00:33:55,480 --> 00:33:59,519 what the right number is later on. 1050 00:33:58,079 --> 00:34:01,159 So, we can stack as many linear 1051 00:33:59,519 --> 00:34:02,720 functions we want. 1052 00:34:01,160 --> 00:34:04,440 So, we have transformed this K thing 1053 00:34:02,720 --> 00:34:06,600 into a three-dimensional vector, right? 1054 00:34:04,440 --> 00:34:07,519 K numbers become three numbers. 1055 00:34:06,599 --> 00:34:10,279 Um 1056 00:34:07,519 --> 00:34:12,280 and now we can flow this three these 1057 00:34:10,280 --> 00:34:13,919 three numbers through some other little 1058 00:34:12,280 --> 00:34:16,359 function. 1059 00:34:13,918 --> 00:34:16,358 Okay? 1060 00:34:16,440 --> 00:34:19,559 And as you will see in a few minutes, 1061 00:34:18,039 --> 00:34:20,759 that function is called an activation 1062 00:34:19,559 --> 00:34:22,320 function 1063 00:34:20,760 --> 00:34:23,359 and it's chosen to be a non-linear 1064 00:34:22,320 --> 00:34:24,559 function 1065 00:34:23,358 --> 00:34:26,759 because if you don't choose it to be a 1066 00:34:24,559 --> 00:34:28,719 non-linear function, all the effort we 1067 00:34:26,760 --> 00:34:30,280 are doing is going to be a total waste 1068 00:34:28,719 --> 00:34:32,839 of time. 1069 00:34:30,280 --> 00:34:34,399 Okay? For now, just 1070 00:34:32,840 --> 00:34:36,200 take it on faith that you need to have 1071 00:34:34,398 --> 00:34:39,480 non-linear functions here. 1072 00:34:36,199 --> 00:34:41,039 But, note that the three numbers here 1073 00:34:39,480 --> 00:34:42,079 are still three numbers. They are three 1074 00:34:41,039 --> 00:34:43,398 different numbers, but they're still 1075 00:34:42,079 --> 00:34:45,000 three numbers. 1076 00:34:43,398 --> 00:34:46,440 And once we do this, we'll be like, "You 1077 00:34:45,000 --> 00:34:48,119 know what? This was fun. Let's do it 1078 00:34:46,440 --> 00:34:51,918 again." 1079 00:34:48,119 --> 00:34:51,918 Okay? So, you can do it again. 1080 00:34:52,320 --> 00:34:55,720 And you can keep on doing it. You can 1081 00:34:53,559 --> 00:34:57,400 keep it 100 times if you want. 1082 00:34:55,719 --> 00:35:00,639 And the key thing is that every time you 1083 00:34:57,400 --> 00:35:03,079 do it, you're giving this network some 1084 00:35:00,639 --> 00:35:05,159 ability, some capacity to learn 1085 00:35:03,079 --> 00:35:07,799 something interesting from the data. 1086 00:35:05,159 --> 00:35:09,319 To learn an interesting representation. 1087 00:35:07,800 --> 00:35:10,680 Now, of course, you're thinking, "Well, 1088 00:35:09,320 --> 00:35:12,039 how do we know it's interesting? How do 1089 00:35:10,679 --> 00:35:14,079 you know it's a useful thing?" And we'll 1090 00:35:12,039 --> 00:35:14,840 come to all that later on. 1091 00:35:14,079 --> 00:35:16,840 Right? We're just giving it the 1092 00:35:14,840 --> 00:35:17,960 capacity, the potential to learn 1093 00:35:16,840 --> 00:35:19,240 interesting things from the data. 1094 00:35:17,960 --> 00:35:21,199 Whether it actually lives up to its 1095 00:35:19,239 --> 00:35:23,000 potential, we don't know yet. 1096 00:35:21,199 --> 00:35:24,719 Okay? We'll give it the potential. 1097 00:35:23,000 --> 00:35:26,358 Because the more transformations of the 1098 00:35:24,719 --> 00:35:27,799 input data you make, the more 1099 00:35:26,358 --> 00:35:29,039 opportunity you have to do interesting 1100 00:35:27,800 --> 00:35:30,160 things with it. 1101 00:35:29,039 --> 00:35:31,480 If I don't even give you the opportunity 1102 00:35:30,159 --> 00:35:32,879 to transform it once, you don't have any 1103 00:35:31,480 --> 00:35:34,719 opportunity, right? 1104 00:35:32,880 --> 00:35:36,200 If I give you 10 chances to transform 1105 00:35:34,719 --> 00:35:38,039 things, you have 10 shots at doing 1106 00:35:36,199 --> 00:35:40,239 something useful. 1107 00:35:38,039 --> 00:35:42,159 So, you can you can do this repeatedly 1108 00:35:40,239 --> 00:35:44,759 and once we are done doing these 1109 00:35:42,159 --> 00:35:46,159 transformations, we just pipe it through 1110 00:35:44,760 --> 00:35:49,920 to our good old logistic regression 1111 00:35:46,159 --> 00:35:49,920 sigmoid here and we are done. 1112 00:35:50,440 --> 00:35:53,960 Okay? 1113 00:35:51,480 --> 00:35:55,960 So, this is the basic idea. 1114 00:35:53,960 --> 00:35:57,800 And so, just to contrast it, this was 1115 00:35:55,960 --> 00:35:59,240 good old logistic regression where we 1116 00:35:57,800 --> 00:36:00,519 take the input, 1117 00:35:59,239 --> 00:36:02,319 we run it through a linear function and 1118 00:36:00,519 --> 00:36:04,599 pop out a number, 1119 00:36:02,320 --> 00:36:06,080 a probability number. But, after we do 1120 00:36:04,599 --> 00:36:08,599 all this stuff, the input stays the 1121 00:36:06,079 --> 00:36:09,679 same, the output stays the same, but in 1122 00:36:08,599 --> 00:36:11,480 the middle you just run through a whole 1123 00:36:09,679 --> 00:36:12,639 bunch of these functions, you know, 1124 00:36:11,480 --> 00:36:14,358 these layers, boop boop boop boop, and 1125 00:36:12,639 --> 00:36:15,239 then we get the output. 1126 00:36:14,358 --> 00:36:16,559 Okay? 1127 00:36:15,239 --> 00:36:19,519 That's all we have done. 1128 00:36:16,559 --> 00:36:21,679 And this is a neural network. 1129 00:36:19,519 --> 00:36:25,079 A neural network is nothing more than 1130 00:36:21,679 --> 00:36:27,519 repeatedly transformed inputs which are 1131 00:36:25,079 --> 00:36:30,159 finally fed to a linear or logistic 1132 00:36:27,519 --> 00:36:30,159 regression model. 1133 00:36:35,400 --> 00:36:38,800 Any questions? 1134 00:36:37,559 --> 00:36:41,799 I have two questions. Could you use the 1135 00:36:38,800 --> 00:36:43,320 thing so that everyone can hear? Yeah. 1136 00:36:41,800 --> 00:36:45,240 I have two questions. Firstly, so when 1137 00:36:43,320 --> 00:36:48,080 we say that there isn't chance of 1138 00:36:45,239 --> 00:36:51,559 explainability, is it that we don't know 1139 00:36:48,079 --> 00:36:53,239 which arrow it went through? That's one. 1140 00:36:51,559 --> 00:36:54,960 Second, 1141 00:36:53,239 --> 00:36:57,239 who's controlling the number of 1142 00:36:54,960 --> 00:36:59,639 iterations or the number of functions? 1143 00:36:57,239 --> 00:37:01,239 That's up to us or how does that work? 1144 00:36:59,639 --> 00:37:03,960 Right. So, yeah, so the the first 1145 00:37:01,239 --> 00:37:06,879 question, um explainability, we actually 1146 00:37:03,960 --> 00:37:09,119 know exactly for any given input input 1147 00:37:06,880 --> 00:37:10,760 uh data data point, we know exactly how 1148 00:37:09,119 --> 00:37:12,119 it flows through the network. So, there 1149 00:37:10,760 --> 00:37:15,680 is no problem there. 1150 00:37:12,119 --> 00:37:17,599 The problem is in ascribing, "Okay, this 1151 00:37:15,679 --> 00:37:20,159 we we think this person is going to be 1152 00:37:17,599 --> 00:37:21,880 uh repay the loan because 1153 00:37:20,159 --> 00:37:24,159 of this particular attribute." We don't 1154 00:37:21,880 --> 00:37:25,680 know that because those attributes all 1155 00:37:24,159 --> 00:37:27,358 get enmeshed together and goes through 1156 00:37:25,679 --> 00:37:29,119 this complicated thing. So, we know 1157 00:37:27,358 --> 00:37:31,480 exactly what happens. We just can't give 1158 00:37:29,119 --> 00:37:33,319 credit to anyone thing very easily. 1159 00:37:31,480 --> 00:37:35,480 I'm again, I'm just standing on the 1160 00:37:33,320 --> 00:37:36,280 brink of this vast ocean of something 1161 00:37:35,480 --> 00:37:38,519 called explainability and 1162 00:37:36,280 --> 00:37:39,960 interpretability, uh which I'll get to a 1163 00:37:38,519 --> 00:37:42,280 bit later on in the semester. But, 1164 00:37:39,960 --> 00:37:44,280 that's sort of the quick 1165 00:37:42,280 --> 00:37:46,880 kind of right-ish kind of wrong answer. 1166 00:37:44,280 --> 00:37:47,760 Okay? Number two, um 1167 00:37:46,880 --> 00:37:49,559 uh 1168 00:37:47,760 --> 00:37:51,000 we decide the number of layers. We 1169 00:37:49,559 --> 00:37:52,880 decide a whole bunch of things and as 1170 00:37:51,000 --> 00:37:53,920 we'll see in a few minutes, uh there is 1171 00:37:52,880 --> 00:37:55,640 something that's given to us and 1172 00:37:53,920 --> 00:37:58,840 something we get to design and I'll make 1173 00:37:55,639 --> 00:37:58,839 it very clear which is which. 1174 00:37:59,320 --> 00:38:01,600 Yeah. 1175 00:38:02,000 --> 00:38:06,320 Did I say your name right? Yeah. 1176 00:38:04,039 --> 00:38:08,840 So, which functions have to be linear 1177 00:38:06,320 --> 00:38:11,960 and also like why does it have to be 1178 00:38:08,840 --> 00:38:15,200 linear? Yeah. So, these functions uh the 1179 00:38:11,960 --> 00:38:16,920 f of x here, they have to be non-linear. 1180 00:38:15,199 --> 00:38:19,439 As to why they have to be non-linear, 1181 00:38:16,920 --> 00:38:22,559 we'll get to that in a few minutes. 1182 00:38:19,440 --> 00:38:23,480 Okay. So, these are called neurons. 1183 00:38:22,559 --> 00:38:25,239 Okay? 1184 00:38:23,480 --> 00:38:27,559 These things where you basically there's 1185 00:38:25,239 --> 00:38:29,358 a linear function followed by uh a 1186 00:38:27,559 --> 00:38:31,000 little non-linear function, 1187 00:38:29,358 --> 00:38:32,679 right? This is a Each one of these 1188 00:38:31,000 --> 00:38:34,239 things is called a neuron. 1189 00:38:32,679 --> 00:38:36,960 Um 1190 00:38:34,239 --> 00:38:39,719 By the way, you know, this is loosely 1191 00:38:36,960 --> 00:38:41,679 inspired by the way how, you know, uh 1192 00:38:39,719 --> 00:38:42,919 neurons work in a human in mammalian 1193 00:38:41,679 --> 00:38:45,599 brains. 1194 00:38:42,920 --> 00:38:47,880 But, the connections between 1195 00:38:45,599 --> 00:38:50,679 neuroscience and deep learning 1196 00:38:47,880 --> 00:38:52,599 are very heavily argued. 1197 00:38:50,679 --> 00:38:55,559 So, I'm going to like stay away from it. 1198 00:38:52,599 --> 00:38:57,559 Okay? Uh suffice it to say it's I I just 1199 00:38:55,559 --> 00:38:59,559 think of For for building practical deep 1200 00:38:57,559 --> 00:39:01,880 learning systems in industry, you don't 1201 00:38:59,559 --> 00:39:04,000 you don't worry about this. Okay? 1202 00:39:01,880 --> 00:39:06,880 All right, let's move on. 1203 00:39:04,000 --> 00:39:09,320 Terminology. Uh this vertical stack of 1204 00:39:06,880 --> 00:39:10,760 linear functions or neurons, 1205 00:39:09,320 --> 00:39:12,080 right? This vertical stack is called a 1206 00:39:10,760 --> 00:39:14,080 layer. 1207 00:39:12,079 --> 00:39:15,840 Right? This is a layer, that's a layer. 1208 00:39:14,079 --> 00:39:17,279 Uh and these little non-linear 1209 00:39:15,840 --> 00:39:20,440 functions, which we haven't gotten to 1210 00:39:17,280 --> 00:39:22,280 yet, are called activation functions. 1211 00:39:20,440 --> 00:39:25,240 Uh and we'll get to why they are called 1212 00:39:22,280 --> 00:39:25,240 that in just a second. 1213 00:39:25,320 --> 00:39:29,400 And 1214 00:39:26,920 --> 00:39:31,840 the input 1215 00:39:29,400 --> 00:39:34,079 is called an input layer and I have the 1216 00:39:31,840 --> 00:39:35,640 word layer in double quotes because like 1217 00:39:34,079 --> 00:39:36,759 it's not really doing anything, right? 1218 00:39:35,639 --> 00:39:39,279 It's just the input. 1219 00:39:36,760 --> 00:39:41,480 So, but we call it an input layer. 1220 00:39:39,280 --> 00:39:42,880 And what the very final thing that 1221 00:39:41,480 --> 00:39:45,280 produces outputs is called the output 1222 00:39:42,880 --> 00:39:48,360 layer, right? Obviously. And everything 1223 00:39:45,280 --> 00:39:50,200 in the middle is called a hidden layer. 1224 00:39:48,360 --> 00:39:52,440 Okay? 1225 00:39:50,199 --> 00:39:54,960 So, the final piece of terminology is 1226 00:39:52,440 --> 00:39:56,240 that when you have a layer like this in 1227 00:39:54,960 --> 00:39:58,240 which say three numbers are coming out 1228 00:39:56,239 --> 00:40:00,799 and there's another another layer, 1229 00:39:58,239 --> 00:40:03,319 right? If every neuron in this layer is 1230 00:40:00,800 --> 00:40:05,280 connected to every neuron in this layer, 1231 00:40:03,320 --> 00:40:07,280 it's called a fully connected or dense 1232 00:40:05,280 --> 00:40:08,880 layer. So, for instance, here 1233 00:40:07,280 --> 00:40:10,360 this arrow that's 1234 00:40:08,880 --> 00:40:11,240 whatever the whatever number is coming 1235 00:40:10,360 --> 00:40:12,720 out. Let's say the number three is 1236 00:40:11,239 --> 00:40:15,239 coming out of this thing here. That 1237 00:40:12,719 --> 00:40:17,399 number three goes flows on this arrow to 1238 00:40:15,239 --> 00:40:19,559 this thing, flows on this arrow to this 1239 00:40:17,400 --> 00:40:21,200 neuron, and flows on this third arrow to 1240 00:40:19,559 --> 00:40:23,239 this neuron. That's what I mean. So, 1241 00:40:21,199 --> 00:40:25,159 every neuron, its output is being sent 1242 00:40:23,239 --> 00:40:27,559 to every neuron in the following layer. 1243 00:40:25,159 --> 00:40:29,319 Okay? That's we call it fully connected 1244 00:40:27,559 --> 00:40:30,599 or dense. 1245 00:40:29,320 --> 00:40:32,559 And then 1246 00:40:30,599 --> 00:40:34,480 if you look at logistic regression, 1247 00:40:32,559 --> 00:40:36,320 right? This is logistic regression. You 1248 00:40:34,480 --> 00:40:40,440 can see basically logistic regression is 1249 00:40:36,320 --> 00:40:40,440 a neural network with no hidden layers. 1250 00:40:41,000 --> 00:40:43,639 So, in some sense, logistic regression 1251 00:40:42,159 --> 00:40:45,440 is like almost the simplest possible 1252 00:40:43,639 --> 00:40:48,359 network you can think of. 1253 00:40:45,440 --> 00:40:50,280 Like barely a neural network. 1254 00:40:48,360 --> 00:40:51,079 Right? It's got no no hidden layers. 1255 00:40:50,280 --> 00:40:52,440 That's what makes it logistic 1256 00:40:51,079 --> 00:40:54,239 regression. 1257 00:40:52,440 --> 00:40:56,119 And so, as you might have guessed by 1258 00:40:54,239 --> 00:40:58,879 now, deep learning is just neural 1259 00:40:56,119 --> 00:41:00,119 networks with lots and lots of 1260 00:40:58,880 --> 00:41:02,400 of what? 1261 00:41:00,119 --> 00:41:04,319 Yes, layers. 1262 00:41:02,400 --> 00:41:07,079 So, here are a few. 1263 00:41:04,320 --> 00:41:08,480 Uh and by the way, these are not even 1264 00:41:07,079 --> 00:41:10,039 considered all that, you know, 1265 00:41:08,480 --> 00:41:13,039 impressive these days. 1266 00:41:10,039 --> 00:41:16,039 Okay? Uh but I put them up because this 1267 00:41:13,039 --> 00:41:18,119 this thing here is called ResNet. 1268 00:41:16,039 --> 00:41:20,440 And it's famous because the ResNet 1269 00:41:18,119 --> 00:41:21,559 neural network was I think the first 1270 00:41:20,440 --> 00:41:24,039 network 1271 00:41:21,559 --> 00:41:26,799 to surpass human-level performance in 1272 00:41:24,039 --> 00:41:28,920 image classification. 1273 00:41:26,800 --> 00:41:31,039 Sort of it it's sort of like the Skynet 1274 00:41:28,920 --> 00:41:32,960 of image classification. Okay? It 1275 00:41:31,039 --> 00:41:34,159 surpassed human-level performance. And 1276 00:41:32,960 --> 00:41:36,320 I'm putting it up here because we'll 1277 00:41:34,159 --> 00:41:37,759 actually work with ResNet on next next 1278 00:41:36,320 --> 00:41:39,280 Wednesday. And we'll actually take 1279 00:41:37,760 --> 00:41:41,920 ResNet, we'll fine-tune it, and solve a 1280 00:41:39,280 --> 00:41:43,640 real problem in class. 1281 00:41:41,920 --> 00:41:46,000 All right. So, it's got lots and lots of 1282 00:41:43,639 --> 00:41:47,159 layers. Uh now, let's turn to these 1283 00:41:46,000 --> 00:41:48,800 activation functions. We've been 1284 00:41:47,159 --> 00:41:49,839 ignoring these little guys, right? So 1285 00:41:48,800 --> 00:41:52,800 far. 1286 00:41:49,840 --> 00:41:54,920 So, the activation function at a node is 1287 00:41:52,800 --> 00:41:56,960 a first of all, it's a function that 1288 00:41:54,920 --> 00:41:58,639 receives a single number and outputs a 1289 00:41:56,960 --> 00:42:00,760 single number, right? It's not very 1290 00:41:58,639 --> 00:42:03,000 complicated, right? It receives 1291 00:42:00,760 --> 00:42:04,560 basically this this is a linear function 1292 00:42:03,000 --> 00:42:06,679 which receives all these inputs. It 1293 00:42:04,559 --> 00:42:07,880 could be 10 inputs, 1,000 inputs, 1294 00:42:06,679 --> 00:42:09,559 runs it through a linear function, 1295 00:42:07,880 --> 00:42:12,200 outputs a number, and that single 1296 00:42:09,559 --> 00:42:14,759 number, a scalar, goes in here, and it 1297 00:42:12,199 --> 00:42:16,599 comes out as another single number. 1298 00:42:14,760 --> 00:42:18,000 Just just just remember that. 1299 00:42:16,599 --> 00:42:19,480 And so, these are some of the most 1300 00:42:18,000 --> 00:42:21,519 common activation functions. In fact, 1301 00:42:19,480 --> 00:42:23,400 the sigmoid we saw, which is actually we 1302 00:42:21,519 --> 00:42:25,639 use for the output, is actually a kind 1303 00:42:23,400 --> 00:42:28,119 of activation function where a single 1304 00:42:25,639 --> 00:42:30,000 number comes in and it gets mapped into 1305 00:42:28,119 --> 00:42:31,799 this curve because of this thing. So, 1306 00:42:30,000 --> 00:42:33,920 the single number that comes in is A, 1307 00:42:31,800 --> 00:42:37,160 and it and it gets transformed as 1 / 1 1308 00:42:33,920 --> 00:42:38,880 + e ^ -A, and you get a shape like this, 1309 00:42:37,159 --> 00:42:40,679 and it's called the sigmoid activation 1310 00:42:38,880 --> 00:42:41,840 function. And And And as you can see 1311 00:42:40,679 --> 00:42:44,319 here, 1312 00:42:41,840 --> 00:42:45,920 for very small values, for very negative 1313 00:42:44,320 --> 00:42:47,840 values, 1314 00:42:45,920 --> 00:42:50,280 it's going to be pretty close to zero, 1315 00:42:47,840 --> 00:42:52,559 meaning it won't get activated. 1316 00:42:50,280 --> 00:42:53,680 And for very very large values, it's 1317 00:42:52,559 --> 00:42:55,360 going to be 1318 00:42:53,679 --> 00:42:57,759 pretty close to one. 1319 00:42:55,360 --> 00:42:59,079 All the action happens in the middle. 1320 00:42:57,760 --> 00:43:00,160 When your When your When your values are 1321 00:42:59,079 --> 00:43:03,119 somewhere in this range, there's a 1322 00:43:00,159 --> 00:43:05,079 dramatic increases in what comes out. 1323 00:43:03,119 --> 00:43:06,440 Okay? So, that little thing in the 1324 00:43:05,079 --> 00:43:07,799 middle is a sweet spot for these 1325 00:43:06,440 --> 00:43:08,639 functions. 1326 00:43:07,800 --> 00:43:10,000 Uh 1327 00:43:08,639 --> 00:43:11,440 and this 1328 00:43:10,000 --> 00:43:12,760 I you know, I'm also almost embarrassed 1329 00:43:11,440 --> 00:43:13,880 to call it an activation function 1330 00:43:12,760 --> 00:43:15,520 because it's literally not doing 1331 00:43:13,880 --> 00:43:16,880 anything. It's sort of getting a nice 1332 00:43:15,519 --> 00:43:18,639 label for free. 1333 00:43:16,880 --> 00:43:19,720 Um right? You basically it says you just 1334 00:43:18,639 --> 00:43:20,839 get a number, just pass it straight 1335 00:43:19,719 --> 00:43:22,359 along. 1336 00:43:20,840 --> 00:43:23,720 It's a linear activation function, but 1337 00:43:22,360 --> 00:43:25,599 just for completeness, I want to put it 1338 00:43:23,719 --> 00:43:28,319 here. 1339 00:43:25,599 --> 00:43:30,920 And then we come to the hero of deep 1340 00:43:28,320 --> 00:43:32,000 learning, which is the rectified linear 1341 00:43:30,920 --> 00:43:34,519 unit, 1342 00:43:32,000 --> 00:43:37,079 right? Rectified linear unit. It's 1343 00:43:34,519 --> 00:43:38,519 called ReLU. Uh and ReLU is going to 1344 00:43:37,079 --> 00:43:41,039 become part of your vocabulary very very 1345 00:43:38,519 --> 00:43:43,000 quickly. Uh and so, ReLU is actually a 1346 00:43:41,039 --> 00:43:44,920 very interesting function. So, you write 1347 00:43:43,000 --> 00:43:46,320 it as maximum of whatever number and 1348 00:43:44,920 --> 00:43:48,360 zero, 1349 00:43:46,320 --> 00:43:50,600 which is another way of saying if the 1350 00:43:48,360 --> 00:43:53,480 number is positive, just send it along 1351 00:43:50,599 --> 00:43:56,639 unchanged. If the number is negative, 1352 00:43:53,480 --> 00:43:57,639 send a zero instead. Squish it to zero. 1353 00:43:56,639 --> 00:43:59,799 So, which means if the number is 1354 00:43:57,639 --> 00:44:03,039 negative, nothing happens. If the number 1355 00:43:59,800 --> 00:44:03,039 is positive, it wakes up. 1356 00:44:03,239 --> 00:44:07,159 So, what happens is that you could have 1357 00:44:04,920 --> 00:44:09,320 a very complicated linear function with 1358 00:44:07,159 --> 00:44:10,519 millions of variables, and then it puts 1359 00:44:09,320 --> 00:44:12,000 a single number, and that number 1360 00:44:10,519 --> 00:44:13,239 unfortunately happens to be negative. 1361 00:44:12,000 --> 00:44:15,199 The ReLU is not impressed. It's going to 1362 00:44:13,239 --> 00:44:17,519 send a zero out. 1363 00:44:15,199 --> 00:44:20,279 Okay? It's a very simple function. 1364 00:44:17,519 --> 00:44:22,559 And many many folks who've been in deep 1365 00:44:20,280 --> 00:44:23,480 learning for a long long time believe 1366 00:44:22,559 --> 00:44:25,519 that 1367 00:44:23,480 --> 00:44:26,760 the use of the ReLUs is one of the key 1368 00:44:25,519 --> 00:44:28,840 factors 1369 00:44:26,760 --> 00:44:30,440 that led to the amazing success of deep 1370 00:44:28,840 --> 00:44:32,160 learning because it's got some very 1371 00:44:30,440 --> 00:44:33,880 interesting properties, 1372 00:44:32,159 --> 00:44:35,759 uh which we'll get to hopefully on 1373 00:44:33,880 --> 00:44:40,039 Wednesday. 1374 00:44:35,760 --> 00:44:42,000 Okay. So, the shorthand here is that um 1375 00:44:40,039 --> 00:44:43,639 whenever you see this thing, it's just a 1376 00:44:42,000 --> 00:44:44,679 linear activation, linear function 1377 00:44:43,639 --> 00:44:47,319 followed by just sending it straight 1378 00:44:44,679 --> 00:44:49,119 out. If I If you do this this If I put a 1379 00:44:47,320 --> 00:44:51,519 ReLU in here, I'm going to denote it 1380 00:44:49,119 --> 00:44:53,239 like that, which mimics the graph 1381 00:44:51,519 --> 00:44:54,719 uh how it looks. And if I'm going If I 1382 00:44:53,239 --> 00:44:55,839 put a sigmoid, I'm just going to use 1383 00:44:54,719 --> 00:44:56,839 this thing here. 1384 00:44:55,840 --> 00:44:59,941 Okay? 1385 00:44:56,840 --> 00:45:00,240 Just a visual shorthand. 1386 00:44:59,940 --> 00:45:02,358 >> [clears throat] 1387 00:45:00,239 --> 00:45:03,839 >> There are many other functions 1388 00:45:02,358 --> 00:45:05,079 activation functions, by the way. 1389 00:45:03,840 --> 00:45:07,840 There's something called the tan h 1390 00:45:05,079 --> 00:45:10,960 function, the leaky ReLU, the GELU, the 1391 00:45:07,840 --> 00:45:12,640 Swish. I mean, it's like a menagerie of 1392 00:45:10,960 --> 00:45:14,280 activation functions because very often 1393 00:45:12,639 --> 00:45:15,799 researchers will be like, "Well, I don't 1394 00:45:14,280 --> 00:45:17,040 like this activation function. Here's a 1395 00:45:15,800 --> 00:45:18,080 little modified version of the function 1396 00:45:17,039 --> 00:45:20,400 which is going to be better for certain 1397 00:45:18,079 --> 00:45:22,480 things." So, you know, people's research 1398 00:45:20,400 --> 00:45:24,400 creativity is sort of on this point has 1399 00:45:22,480 --> 00:45:26,519 gone unhinged. Um so, there's lots of 1400 00:45:24,400 --> 00:45:27,760 options. But if you just stick to the 1401 00:45:26,519 --> 00:45:29,519 ReLU 1402 00:45:27,760 --> 00:45:31,720 for your hidden layers, you can 1403 00:45:29,519 --> 00:45:32,519 basically get anything done practically, 1404 00:45:31,719 --> 00:45:34,039 right? You don't have to worry about 1405 00:45:32,519 --> 00:45:37,280 anything else. So, we'll only focus on 1406 00:45:34,039 --> 00:45:38,559 ReLUs for all the intermediate stuff. Uh 1407 00:45:37,280 --> 00:45:40,400 yeah. 1408 00:45:38,559 --> 00:45:41,840 Yeah, how do you gauge which activation 1409 00:45:40,400 --> 00:45:42,720 function is more suited for your use 1410 00:45:41,840 --> 00:45:45,280 case? 1411 00:45:42,719 --> 00:45:48,000 Yeah. So, the rule of thumb here is that 1412 00:45:45,280 --> 00:45:49,680 for your hidden layers, use ReLUs, 1413 00:45:48,000 --> 00:45:51,880 right? Because empirically we have seen 1414 00:45:49,679 --> 00:45:54,199 that they they do an amazing job. 1415 00:45:51,880 --> 00:45:56,320 For your output layer, your very final 1416 00:45:54,199 --> 00:45:57,960 thing, you actually don't have a choice 1417 00:45:56,320 --> 00:45:59,640 because what you have to use depends on 1418 00:45:57,960 --> 00:46:01,199 what kind of output you have to work 1419 00:45:59,639 --> 00:46:02,679 with. If it's an output which is a 1420 00:46:01,199 --> 00:46:04,480 probability number between zero and one, 1421 00:46:02,679 --> 00:46:05,839 you have to use a sigmoid. 1422 00:46:04,480 --> 00:46:07,559 Um if it is 1423 00:46:05,840 --> 00:46:08,960 say 10 numbers, all of which have to be 1424 00:46:07,559 --> 00:46:10,119 probabilities, and they have to add up 1425 00:46:08,960 --> 00:46:10,880 to one, 1426 00:46:10,119 --> 00:46:12,199 you got to use something called the 1427 00:46:10,880 --> 00:46:13,960 softmax, which we'll get to on 1428 00:46:12,199 --> 00:46:15,679 Wednesday. So, it really depends on the 1429 00:46:13,960 --> 00:46:16,760 output, and the nature of the output 1430 00:46:15,679 --> 00:46:18,599 dictates what you use in the output 1431 00:46:16,760 --> 00:46:19,920 layer. 1432 00:46:18,599 --> 00:46:22,000 Okay. 1433 00:46:19,920 --> 00:46:24,880 So, coming back to this. So, if you want 1434 00:46:22,000 --> 00:46:27,280 to design a deep neural network, 1435 00:46:24,880 --> 00:46:29,599 uh the input is the input. 1436 00:46:27,280 --> 00:46:30,960 The output is the output. And so, you 1437 00:46:29,599 --> 00:46:32,880 get to choose everything else. You get 1438 00:46:30,960 --> 00:46:35,320 to choose the number of hidden layers, 1439 00:46:32,880 --> 00:46:37,559 the number of neurons in each layer, the 1440 00:46:35,320 --> 00:46:39,600 activation functions you're going to use 1441 00:46:37,559 --> 00:46:41,119 and uh for the hidden layers, and then 1442 00:46:39,599 --> 00:46:42,759 you have to make sure that the what you 1443 00:46:41,119 --> 00:46:44,279 choose for the output layer matches the 1444 00:46:42,760 --> 00:46:46,840 kind of output you want to generate. 1445 00:46:44,280 --> 00:46:48,680 Okay? So, this is this sort of This is 1446 00:46:46,840 --> 00:46:51,120 all in your hands. You decide what 1447 00:46:48,679 --> 00:46:52,799 happens. But 1448 00:46:51,119 --> 00:46:53,719 you will there there's a lot of guidance 1449 00:46:52,800 --> 00:46:56,080 for how to do these things, which we'll 1450 00:46:53,719 --> 00:46:57,679 which we'll cover as we go along. 1451 00:46:56,079 --> 00:47:00,519 Did you have a question? 1452 00:46:57,679 --> 00:47:03,279 Kind of, but I guess I'll do it. 1453 00:47:00,519 --> 00:47:05,400 Is Is there also exploration in kind of 1454 00:47:03,280 --> 00:47:07,920 dynamic uh 1455 00:47:05,400 --> 00:47:11,400 setting up layers so that your users 1456 00:47:07,920 --> 00:47:11,400 determine the number of layers 1457 00:47:12,599 --> 00:47:16,719 Yeah. So, there's a whole field called 1458 00:47:14,320 --> 00:47:18,680 neural architecture search, NAS, 1459 00:47:16,719 --> 00:47:20,480 where we can actually try a whole bunch 1460 00:47:18,679 --> 00:47:22,319 of different architectures, 1461 00:47:20,480 --> 00:47:23,800 uh and then use some optimization and in 1462 00:47:22,320 --> 00:47:25,640 fact reinforcement learning, which we 1463 00:47:23,800 --> 00:47:27,160 won't get to in this class, 1464 00:47:25,639 --> 00:47:28,440 as a way to figure out really good 1465 00:47:27,159 --> 00:47:32,199 architectures for any particular 1466 00:47:28,440 --> 00:47:33,760 problem. Uh but the 1467 00:47:32,199 --> 00:47:34,799 the question of okay, 1468 00:47:33,760 --> 00:47:36,480 when I'm training a model with a 1469 00:47:34,800 --> 00:47:37,840 particular kind of data, 1470 00:47:36,480 --> 00:47:39,039 the first pass through the training 1471 00:47:37,840 --> 00:47:40,240 data, I'm going to use two layers. The 1472 00:47:39,039 --> 00:47:42,440 second pass, I'm going to do seven 1473 00:47:40,239 --> 00:47:44,039 layers. That is not done. 1474 00:47:42,440 --> 00:47:45,840 Uh and the reason it's not done is 1475 00:47:44,039 --> 00:47:47,279 because of certain other constraints we 1476 00:47:45,840 --> 00:47:48,840 have in how we can do the the 1477 00:47:47,280 --> 00:47:50,720 optimization and the gradient descent 1478 00:47:48,840 --> 00:47:52,680 and stuff like that. But what you can 1479 00:47:50,719 --> 00:47:54,319 do, and we will we'll look at this thing 1480 00:47:52,679 --> 00:47:56,399 called dropout, 1481 00:47:54,320 --> 00:47:58,200 for certain layers, you can actually for 1482 00:47:56,400 --> 00:48:00,440 each time you run it through the 1483 00:47:58,199 --> 00:48:02,199 network, you can decide in this layer 1484 00:48:00,440 --> 00:48:03,599 I'm not going to use all the nodes. I'm 1485 00:48:02,199 --> 00:48:05,879 going to drop out a few of the nodes 1486 00:48:03,599 --> 00:48:07,279 randomly. And it's a very effective 1487 00:48:05,880 --> 00:48:09,599 technique to prevent overfitting, and 1488 00:48:07,280 --> 00:48:11,240 we'll come to that a little later on. 1489 00:48:09,599 --> 00:48:13,639 Uh yeah. 1490 00:48:11,239 --> 00:48:15,439 So, one question regarding like 1491 00:48:13,639 --> 00:48:16,960 neural networks is about the 1492 00:48:15,440 --> 00:48:17,920 coefficients. Is this something we 1493 00:48:16,960 --> 00:48:19,358 decide 1494 00:48:17,920 --> 00:48:21,159 or we 1495 00:48:19,358 --> 00:48:23,840 have to use as a defined coefficient for 1496 00:48:21,159 --> 00:48:25,920 the weights? No, the whole trick here 1497 00:48:23,840 --> 00:48:29,079 the whole name of the game is we use the 1498 00:48:25,920 --> 00:48:30,440 data, the training data, and something 1499 00:48:29,079 --> 00:48:31,719 called a loss function, which I'll get 1500 00:48:30,440 --> 00:48:33,760 to on Wednesday, 1501 00:48:31,719 --> 00:48:36,639 along with an optimization algorithm, so 1502 00:48:33,760 --> 00:48:37,880 that the network figures out by itself 1503 00:48:36,639 --> 00:48:39,599 what the weights need to be, what the 1504 00:48:37,880 --> 00:48:42,039 coefficients need to be, so as to 1505 00:48:39,599 --> 00:48:43,920 minimize prediction error. 1506 00:48:42,039 --> 00:48:45,358 And that's the whole thing. The magic 1507 00:48:43,920 --> 00:48:47,559 here is that we don't have to do 1508 00:48:45,358 --> 00:48:49,880 anything. We only have to set it up, sit 1509 00:48:47,559 --> 00:48:51,679 back, often for many hours, and watch it 1510 00:48:49,880 --> 00:48:52,800 do its thing. 1511 00:48:51,679 --> 00:48:54,279 Yeah. 1512 00:48:52,800 --> 00:48:56,320 Just one quick question. Um you 1513 00:48:54,280 --> 00:48:58,000 mentioned nodes just now when you were 1514 00:48:56,320 --> 00:49:00,920 answering Roland's question. Can you 1515 00:48:58,000 --> 00:49:02,559 just confirm exactly what a node is? I 1516 00:49:00,920 --> 00:49:03,519 have an idea that it's basically any 1517 00:49:02,559 --> 00:49:04,799 circle, but 1518 00:49:03,519 --> 00:49:06,320 >> Yeah, yeah. you just added a lot more 1519 00:49:04,800 --> 00:49:07,560 detail. Sure. No, when when I'm 1520 00:49:06,320 --> 00:49:09,760 referring to a node, I'm literally 1521 00:49:07,559 --> 00:49:12,000 referring to something like this, which 1522 00:49:09,760 --> 00:49:14,640 think of it as a linear function 1523 00:49:12,000 --> 00:49:16,480 followed by a non-linear activation. 1524 00:49:14,639 --> 00:49:18,239 So, it it reads a bunch of inputs, runs 1525 00:49:16,480 --> 00:49:19,920 it through a linear function, and pass 1526 00:49:18,239 --> 00:49:22,119 it through like a ReLU or a sigmoid or 1527 00:49:19,920 --> 00:49:24,119 something, and out pops a number. 1528 00:49:22,119 --> 00:49:26,000 So, in general, a node will have 1529 00:49:24,119 --> 00:49:28,239 many numbers potentially coming in, but 1530 00:49:26,000 --> 00:49:30,000 only one number going out. 1531 00:49:28,239 --> 00:49:32,719 Uh now, that one number may get copied 1532 00:49:30,000 --> 00:49:33,960 to every node in the next layer, 1533 00:49:32,719 --> 00:49:36,639 but what comes out of that particular 1534 00:49:33,960 --> 00:49:38,240 node is just a single number. 1535 00:49:36,639 --> 00:49:38,839 All right. So, 1536 00:49:38,239 --> 00:49:41,799 uh 1537 00:49:38,840 --> 00:49:44,320 So, let's use a DNN for our interview 1538 00:49:41,800 --> 00:49:46,360 example. So, in this problem we had two 1539 00:49:44,320 --> 00:49:48,000 inputs, right? GPA and experience. The 1540 00:49:46,360 --> 00:49:48,880 output variable has to be between zero 1541 00:49:48,000 --> 00:49:50,039 and one because you're trying to predict 1542 00:49:48,880 --> 00:49:52,720 the probability that someone will get 1543 00:49:50,039 --> 00:49:54,079 called for an interview. So, the output 1544 00:49:52,719 --> 00:49:55,319 size is fixed the 1545 00:49:54,079 --> 00:49:57,039 sorry, the input size is fixed the 1546 00:49:55,320 --> 00:49:59,440 output is fixed. Uh 1547 00:49:57,039 --> 00:50:00,800 and we so, since it's really the only 1548 00:49:59,440 --> 00:50:02,800 the very first network we're actually 1549 00:50:00,800 --> 00:50:04,360 playing with uh 1550 00:50:02,800 --> 00:50:06,640 let's just start simple, right? We'll 1551 00:50:04,360 --> 00:50:09,480 just have one hidden layer and we'll 1552 00:50:06,639 --> 00:50:11,199 have three neurons, right? And and as I 1553 00:50:09,480 --> 00:50:13,719 mentioned to Tommaso's question from 1554 00:50:11,199 --> 00:50:15,839 before if you are choosing activation 1555 00:50:13,719 --> 00:50:17,919 functions in the hidden layers, just go 1556 00:50:15,840 --> 00:50:19,760 with the ReLU as a default. It usually 1557 00:50:17,920 --> 00:50:21,360 works really well out of the box. So, 1558 00:50:19,760 --> 00:50:23,280 we'll just use a ReLU and since the 1559 00:50:21,360 --> 00:50:25,240 output has to be between zero and one, 1560 00:50:23,280 --> 00:50:27,000 we don't have a choice. We have to use a 1561 00:50:25,239 --> 00:50:29,199 sigmoid for the output layer. 1562 00:50:27,000 --> 00:50:31,119 Okay? That's it. So, we have the those 1563 00:50:29,199 --> 00:50:32,919 are the design choices and when we do 1564 00:50:31,119 --> 00:50:34,960 that, this is how it's looked like, 1565 00:50:32,920 --> 00:50:36,760 right? We have two inputs X1 and X2, GPA 1566 00:50:34,960 --> 00:50:38,199 and experience and then it goes through 1567 00:50:36,760 --> 00:50:40,440 these three 1568 00:50:38,199 --> 00:50:42,759 ReLUs and then out comes these three 1569 00:50:40,440 --> 00:50:44,960 numbers and they pass through a sigmoid 1570 00:50:42,760 --> 00:50:46,560 and we get a probability Y at the end. 1571 00:50:44,960 --> 00:50:47,440 All right, quick question. Concept 1572 00:50:46,559 --> 00:50:49,320 check. 1573 00:50:47,440 --> 00:50:51,039 How many weights 1574 00:50:49,320 --> 00:50:53,000 how many parameters, both weights and 1575 00:50:51,039 --> 00:50:56,079 biases does this network have? 1576 00:50:53,000 --> 00:50:56,079 Let's take a moment to count. 1577 00:51:11,199 --> 00:51:14,439 All right, any guesses? 1578 00:51:15,559 --> 00:51:18,440 Yeah. 1579 00:51:16,840 --> 00:51:21,840 12. 1580 00:51:18,440 --> 00:51:21,840 I think you're almost there. 1581 00:51:22,039 --> 00:51:25,400 Um 1582 00:51:23,960 --> 00:51:28,320 our folks going to be doing a binary 1583 00:51:25,400 --> 00:51:28,320 search on this now? Okay. 1584 00:51:29,320 --> 00:51:34,039 Uh no. 1585 00:51:31,119 --> 00:51:35,679 Yes? 30. Yes, very good. 1586 00:51:34,039 --> 00:51:37,360 So, that's 30 1587 00:51:35,679 --> 00:51:39,000 and my guess is that the reason you came 1588 00:51:37,360 --> 00:51:41,000 up with 12 and I made the same mistake, 1589 00:51:39,000 --> 00:51:44,400 that's why I know it is you probably 1590 00:51:41,000 --> 00:51:44,400 forgot this green thing here. 1591 00:51:45,239 --> 00:51:49,319 Um so, so the what folks often forget is 1592 00:51:48,000 --> 00:51:50,679 the bias. 1593 00:51:49,320 --> 00:51:52,600 Right? We all count the things, right? 1594 00:51:50,679 --> 00:51:54,239 Okay. And the easy way to do it is okay, 1595 00:51:52,599 --> 00:51:56,119 two things here, 1596 00:51:54,239 --> 00:51:57,279 three things here, two times six three 1597 00:51:56,119 --> 00:51:59,400 is six, 1598 00:51:57,280 --> 00:52:00,760 three times one is three another nine 1599 00:51:59,400 --> 00:52:02,480 and then you have to add up all the 1600 00:52:00,760 --> 00:52:04,080 intercepts. 1601 00:52:02,480 --> 00:52:05,840 Right? So, you get 30. 1602 00:52:04,079 --> 00:52:08,079 And so, when we get to very complicated 1603 00:52:05,840 --> 00:52:09,480 networks the the first two or three 1604 00:52:08,079 --> 00:52:10,719 times you work with very complex 1605 00:52:09,480 --> 00:52:11,960 networks 1606 00:52:10,719 --> 00:52:14,359 and we'll do it, you know, starting very 1607 00:52:11,960 --> 00:52:16,119 soon, just get into the habit of hand 1608 00:52:14,360 --> 00:52:17,079 calculating the number of parameters 1609 00:52:16,119 --> 00:52:18,880 just to make sure you understand what's 1610 00:52:17,079 --> 00:52:20,039 going on. Once you get it right a couple 1611 00:52:18,880 --> 00:52:21,599 of times, you can you don't have to do 1612 00:52:20,039 --> 00:52:23,000 it anymore. Okay? The first couple of 1613 00:52:21,599 --> 00:52:23,920 times hand calculate to make sure you 1614 00:52:23,000 --> 00:52:26,239 get it. 1615 00:52:23,920 --> 00:52:28,840 Okay. So, yeah. So, let's say that we 1616 00:52:26,239 --> 00:52:30,559 have trained this network using, you 1617 00:52:28,840 --> 00:52:32,800 know, using techniques which we'll cover 1618 00:52:30,559 --> 00:52:34,119 on Wednesday and it is it comes back to 1619 00:52:32,800 --> 00:52:36,360 you after training and says, "Okay, 1620 00:52:34,119 --> 00:52:38,679 these are the optimal the best values 1621 00:52:36,360 --> 00:52:40,559 for the weights and the biases that I 1622 00:52:38,679 --> 00:52:42,319 have found." So, now your network is 1623 00:52:40,559 --> 00:52:43,840 ready for action. 1624 00:52:42,320 --> 00:52:45,880 It's ready to be used 1625 00:52:43,840 --> 00:52:47,079 and so, so what you can do is let's say 1626 00:52:45,880 --> 00:52:48,640 that you want to predict with this 1627 00:52:47,079 --> 00:52:49,880 network, 1628 00:52:48,639 --> 00:52:52,679 you know, 1629 00:52:49,880 --> 00:52:54,119 if you have X1 and X2, what comes out of 1630 00:52:52,679 --> 00:52:56,480 what So, what comes out of this top 1631 00:52:54,119 --> 00:52:58,719 neuron, right? Let's call it A1. It's 1632 00:52:56,480 --> 00:53:00,199 basically this. 1633 00:52:58,719 --> 00:53:02,159 Okay? That's what's coming out of this 1634 00:53:00,199 --> 00:53:05,639 thing. For any X1 and X2, this is what's 1635 00:53:02,159 --> 00:53:06,519 coming out. Similarly for A2 and A3 1636 00:53:05,639 --> 00:53:08,519 Okay? 1637 00:53:06,519 --> 00:53:09,559 And then what comes out at the very end 1638 00:53:08,519 --> 00:53:11,840 is 1639 00:53:09,559 --> 00:53:14,880 basically A1 times that plus A2 times 1640 00:53:11,840 --> 00:53:15,880 that plus A3 times that plus 0.05 and 1641 00:53:14,880 --> 00:53:18,240 the whole thing gets run through the 1642 00:53:15,880 --> 00:53:20,920 sigmoid and this is what you get. 1643 00:53:18,239 --> 00:53:22,159 Okay? So, this slide and the one before, 1644 00:53:20,920 --> 00:53:23,840 just make sure you look at it afterwards 1645 00:53:22,159 --> 00:53:26,399 and to make sure you totally understand 1646 00:53:23,840 --> 00:53:27,800 the mechanics of it because 1647 00:53:26,400 --> 00:53:28,960 this is really important. If you don't 1648 00:53:27,800 --> 00:53:30,720 If you don't fully understand like 1649 00:53:28,960 --> 00:53:31,880 internalize the mechanics, when we get 1650 00:53:30,719 --> 00:53:33,799 to things like transformers, it's going 1651 00:53:31,880 --> 00:53:35,280 to get hard. Okay? So, just make sure 1652 00:53:33,800 --> 00:53:37,080 it's like automatic at this point. It 1653 00:53:35,280 --> 00:53:38,280 should be reflexive. 1654 00:53:37,079 --> 00:53:40,840 Um 1655 00:53:38,280 --> 00:53:41,840 Okay. So, yeah. And so, when we when you 1656 00:53:40,840 --> 00:53:42,760 want to predict anything, you just run 1657 00:53:41,840 --> 00:53:44,120 some numbers through it, you get all 1658 00:53:42,760 --> 00:53:45,480 these things 1659 00:53:44,119 --> 00:53:48,519 and boom, you calculate it. It turns out 1660 00:53:45,480 --> 00:53:50,000 to be 22.6. That's the answer. 1661 00:53:48,519 --> 00:53:51,800 All right. So, 1662 00:53:50,000 --> 00:53:53,519 I just want to say that let's say that 1663 00:53:51,800 --> 00:53:55,359 you built this network 1664 00:53:53,519 --> 00:53:57,079 and now we are like, "Hey, 1665 00:53:55,358 --> 00:53:58,440 given any X1 and X2, I can come up with 1666 00:53:57,079 --> 00:54:00,239 a Y." 1667 00:53:58,440 --> 00:54:02,159 But I'm feeling a little mathy. Can we 1668 00:54:00,239 --> 00:54:03,358 actually write down the function? Yeah, 1669 00:54:02,159 --> 00:54:06,000 you can write down the function. This is 1670 00:54:03,358 --> 00:54:06,000 what it looks like. 1671 00:54:07,358 --> 00:54:10,358 Super interpretable, right? 1672 00:54:10,480 --> 00:54:16,159 So, this goes to the comment that Itai 1673 00:54:12,480 --> 00:54:18,280 you made earlier on where the act of 1674 00:54:16,159 --> 00:54:21,119 depicting something using this sort of 1675 00:54:18,280 --> 00:54:22,400 graphical layout makes it so much easier 1676 00:54:21,119 --> 00:54:24,440 to reason with 1677 00:54:22,400 --> 00:54:26,559 and to think about compared to trying to 1678 00:54:24,440 --> 00:54:28,519 figure out what this function is doing. 1679 00:54:26,559 --> 00:54:30,559 Right? The other point I want to make is 1680 00:54:28,519 --> 00:54:32,239 that um 1681 00:54:30,559 --> 00:54:33,400 just contrast what we just saw with the 1682 00:54:32,239 --> 00:54:35,599 logistic regression thing we saw 1683 00:54:33,400 --> 00:54:38,200 earlier, which was this little function 1684 00:54:35,599 --> 00:54:40,759 and so, here 1685 00:54:38,199 --> 00:54:42,559 even this simple network with just three 1686 00:54:40,760 --> 00:54:44,200 hidden layers the sorry, three nodes in 1687 00:54:42,559 --> 00:54:46,519 that single hidden layer 1688 00:54:44,199 --> 00:54:48,480 right? It's so much more complicated 1689 00:54:46,519 --> 00:54:50,280 than the logistic regression model. So 1690 00:54:48,480 --> 00:54:52,760 much more complicated, right? 1691 00:54:50,280 --> 00:54:55,000 And it is from this complexity 1692 00:54:52,760 --> 00:54:56,800 springs the ability of these networks to 1693 00:54:55,000 --> 00:54:58,159 do basically magical things. 1694 00:54:56,800 --> 00:55:00,000 Right? That's where the complexity comes 1695 00:54:58,159 --> 00:55:02,519 from. That's where the magic comes from. 1696 00:55:00,000 --> 00:55:03,559 So, and here in this case, the number of 1697 00:55:02,519 --> 00:55:05,960 variables hasn't even changed. It's 1698 00:55:03,559 --> 00:55:07,759 still only two. 1699 00:55:05,960 --> 00:55:10,199 But we can go from the two inputs to the 1700 00:55:07,760 --> 00:55:11,800 one output in very complicated ways as 1701 00:55:10,199 --> 00:55:13,159 long as we know how to train these 1702 00:55:11,800 --> 00:55:13,960 networks the right way. That's sort of 1703 00:55:13,159 --> 00:55:15,799 the 1704 00:55:13,960 --> 00:55:16,920 the secret sauce which we'll spend a lot 1705 00:55:15,800 --> 00:55:19,039 of time on. 1706 00:55:16,920 --> 00:55:20,920 So, yeah. To summarize, this is what we 1707 00:55:19,039 --> 00:55:22,239 have. It's a deep neural network. 1708 00:55:20,920 --> 00:55:23,639 By the way, this kind of network where 1709 00:55:22,239 --> 00:55:25,599 things just flow from left to right is 1710 00:55:23,639 --> 00:55:27,239 called a feedforward 1711 00:55:25,599 --> 00:55:28,679 neural network 1712 00:55:27,239 --> 00:55:30,599 in contrast to some other kinds of 1713 00:55:28,679 --> 00:55:31,919 networks called recurrent networks which 1714 00:55:30,599 --> 00:55:34,639 you won't get to 1715 00:55:31,920 --> 00:55:36,880 in this class because 1716 00:55:34,639 --> 00:55:38,799 transformers have actually proven to be 1717 00:55:36,880 --> 00:55:40,680 much more capable than recurrent 1718 00:55:38,800 --> 00:55:42,920 networks and those have become the norm, 1719 00:55:40,679 --> 00:55:44,799 so we'll just focus on those instead. Um 1720 00:55:42,920 --> 00:55:46,519 and so, this arrangement of neurons into 1721 00:55:44,800 --> 00:55:48,240 layers and activation functions and all 1722 00:55:46,519 --> 00:55:50,039 that stuff, this called the architecture 1723 00:55:48,239 --> 00:55:51,639 of the neural network. And as you will 1724 00:55:50,039 --> 00:55:53,637 see later on, the transformer, the 1725 00:55:51,639 --> 00:55:54,920 famous transformer network 1726 00:55:53,637 --> 00:55:57,239 [clears throat] is just an example of a 1727 00:55:54,920 --> 00:55:59,280 particular neural network architecture 1728 00:55:57,239 --> 00:56:01,479 much like convolutional neural networks 1729 00:55:59,280 --> 00:56:03,280 which will get to next week for computer 1730 00:56:01,480 --> 00:56:05,719 vision or another example of a 1731 00:56:03,280 --> 00:56:07,519 particular network of of architecture. 1732 00:56:05,719 --> 00:56:08,959 So, we will focus on transformers. They 1733 00:56:07,519 --> 00:56:10,559 are a particular kind of architecture. 1734 00:56:08,960 --> 00:56:11,760 All right. So, in summary, this is what 1735 00:56:10,559 --> 00:56:13,239 we have. 1736 00:56:11,760 --> 00:56:14,400 You know, you get to choose the hidden 1737 00:56:13,239 --> 00:56:15,839 layers, the neurons, activation 1738 00:56:14,400 --> 00:56:17,280 functions, stuff like that. 1739 00:56:15,840 --> 00:56:19,200 The inputs and outputs are what you have 1740 00:56:17,280 --> 00:56:22,160 to work with and so, we will actually 1741 00:56:19,199 --> 00:56:23,119 take this idea and then use it 1742 00:56:22,159 --> 00:56:25,920 to 1743 00:56:23,119 --> 00:56:28,319 to actually solve a problem from start 1744 00:56:25,920 --> 00:56:29,559 to finish on Wednesday. So, I think I'm 1745 00:56:28,320 --> 00:56:32,284 done. I give you three minutes back of 1746 00:56:29,559 --> 00:56:34,304 your day. Thank you. 1747 00:56:32,284 --> 00:56:34,304 >> [applause]