1 00:00:21,519 --> 00:00:26,759 Okay. So, let's get going. Today we're 2 00:00:24,399 --> 00:00:28,879 going to talk about how do you actually 3 00:00:26,760 --> 00:00:30,320 train a neural network, right? Because 4 00:00:28,879 --> 00:00:33,439 that is sort of the heart of the game 5 00:00:30,320 --> 00:00:34,960 here. Um so, just to recap, we looked 6 00:00:33,439 --> 00:00:36,679 last class 7 00:00:34,960 --> 00:00:38,560 at what it takes to design a neural 8 00:00:36,679 --> 00:00:40,880 network, and we made this very important 9 00:00:38,560 --> 00:00:42,960 distinction between the things that you 10 00:00:40,880 --> 00:00:44,679 are handed by your problem and the 11 00:00:42,960 --> 00:00:46,759 things that you have agency over, that 12 00:00:44,679 --> 00:00:49,079 you have control over. And we noticed 13 00:00:46,759 --> 00:00:51,599 that, you know, the input layer for your 14 00:00:49,079 --> 00:00:53,079 problem, the input is the input. Uh the 15 00:00:51,600 --> 00:00:54,439 output is the output. You got to do 16 00:00:53,079 --> 00:00:56,519 something with the output, something 17 00:00:54,439 --> 00:00:58,119 that's expected. But everything that 18 00:00:56,520 --> 00:01:00,480 happens in the middle is actually in 19 00:00:58,119 --> 00:01:03,320 your hands. And in particular, we 20 00:01:00,479 --> 00:01:05,920 noticed that we have to decide how many 21 00:01:03,320 --> 00:01:08,599 hidden layers we want. We have to decide 22 00:01:05,920 --> 00:01:11,200 in each layer how many neurons to have. 23 00:01:08,599 --> 00:01:13,359 And then we had to decide what uh 24 00:01:11,200 --> 00:01:14,719 activation to use. Even though I'm kind 25 00:01:13,359 --> 00:01:17,159 of cheating when I say that because I 26 00:01:14,719 --> 00:01:18,719 told you very clearly on Monday that for 27 00:01:17,159 --> 00:01:20,679 the hidden layer activation, just go 28 00:01:18,719 --> 00:01:22,120 with the ReLU activation function. You 29 00:01:20,680 --> 00:01:23,520 don't have to think deep thoughts about 30 00:01:22,120 --> 00:01:24,920 this, okay? 31 00:01:23,519 --> 00:01:26,280 But the other things are all choices you 32 00:01:24,920 --> 00:01:28,320 have to make, and we will talk a bit 33 00:01:26,280 --> 00:01:29,920 later about how do you actually make 34 00:01:28,319 --> 00:01:32,519 those choices. 35 00:01:29,920 --> 00:01:34,519 Okay. Now, the rule of thumb, 36 00:01:32,519 --> 00:01:36,439 right? The rule of thumb always is to 37 00:01:34,519 --> 00:01:37,599 start with the simplest network you can 38 00:01:36,439 --> 00:01:39,439 think of. 39 00:01:37,599 --> 00:01:41,519 And if it's if it gets the job done, 40 00:01:39,439 --> 00:01:42,799 stop working on it. 41 00:01:41,519 --> 00:01:45,359 If it's not good enough, make it 42 00:01:42,799 --> 00:01:46,679 slightly more complicated. Okay? So, 43 00:01:45,359 --> 00:01:48,200 that's sort of the, you know, like the 44 00:01:46,680 --> 00:01:49,880 meta thing you have to remember always 45 00:01:48,200 --> 00:01:52,200 when you're designing these things. 46 00:01:49,879 --> 00:01:53,479 Okay. So, that's sort of, you know, what 47 00:01:52,200 --> 00:01:55,640 it takes to design a deep neural 48 00:01:53,480 --> 00:01:57,120 network. So, what we will do in this 49 00:01:55,640 --> 00:01:59,680 class is we'll actually take a real 50 00:01:57,120 --> 00:02:01,320 example with real data, and then we 51 00:01:59,680 --> 00:02:03,280 we'll think through how we would design 52 00:02:01,319 --> 00:02:05,439 a network to solve this problem. 53 00:02:03,280 --> 00:02:07,599 And while doing so, we will cover a 54 00:02:05,439 --> 00:02:09,758 whole bunch of conceptual foundations 55 00:02:07,599 --> 00:02:11,079 such as optimization, loss functions, 56 00:02:09,758 --> 00:02:12,039 gradient descent, and all that good 57 00:02:11,080 --> 00:02:12,960 stuff. 58 00:02:12,039 --> 00:02:16,199 Okay? 59 00:02:12,960 --> 00:02:18,760 All right. So, the the case study or the 60 00:02:16,199 --> 00:02:20,919 scenario here is we have a data set of 61 00:02:18,759 --> 00:02:23,719 patients uh made available by the 62 00:02:20,919 --> 00:02:25,599 Cleveland Clinic. And essentially, we 63 00:02:23,719 --> 00:02:27,359 have a bunch of patients, and for all 64 00:02:25,599 --> 00:02:29,799 these patients, the setting is that they 65 00:02:27,360 --> 00:02:31,600 have come into the Cleveland Clinic, and 66 00:02:29,800 --> 00:02:32,800 they have not come in with a heart 67 00:02:31,599 --> 00:02:33,879 problem. They have come in for something 68 00:02:32,800 --> 00:02:36,080 else. Maybe they just came in for a 69 00:02:33,879 --> 00:02:38,039 physical. And we measured a whole bunch 70 00:02:36,080 --> 00:02:40,160 of things about them, okay? And the 71 00:02:38,039 --> 00:02:41,719 kinds of things we measured are, you 72 00:02:40,159 --> 00:02:44,199 know, demographic information, like 73 00:02:41,719 --> 00:02:45,680 what's their age, uh gender, whether 74 00:02:44,199 --> 00:02:47,639 they have any chest pain at all when 75 00:02:45,680 --> 00:02:50,520 they came in, blood pressure, 76 00:02:47,639 --> 00:02:52,399 cholesterol, sugar, so on and so forth. 77 00:02:50,520 --> 00:02:53,920 Right? You get the idea? Demographic 78 00:02:52,400 --> 00:02:56,439 information and a bunch of biomarker 79 00:02:53,919 --> 00:02:59,039 information. And then, 80 00:02:56,439 --> 00:03:01,520 what the Cleveland Clinic uh did was 81 00:02:59,039 --> 00:03:04,079 they actually tracked these people 82 00:03:01,520 --> 00:03:05,560 and figured out in the next year, 83 00:03:04,080 --> 00:03:07,520 did they get diagnosed with heart 84 00:03:05,560 --> 00:03:09,000 disease or not? 85 00:03:07,520 --> 00:03:10,760 Okay, in the next year. 86 00:03:09,000 --> 00:03:12,879 Which means that maybe you can build a 87 00:03:10,759 --> 00:03:15,199 model when someone comes in, even though 88 00:03:12,879 --> 00:03:16,519 they didn't come in for a chest problem, 89 00:03:15,199 --> 00:03:17,439 maybe you can predict that something's 90 00:03:16,520 --> 00:03:20,120 going to happen to them in the next 91 00:03:17,439 --> 00:03:23,000 year, right? It's a nice sort of classic 92 00:03:20,120 --> 00:03:24,879 machine learning setup. 93 00:03:23,000 --> 00:03:26,439 All right. So, this is the thing. So, 94 00:03:24,879 --> 00:03:28,199 what we want to do is we can totally 95 00:03:26,439 --> 00:03:29,719 solve this problem using decision trees, 96 00:03:28,199 --> 00:03:31,439 neural network I mean, sorry, random 97 00:03:29,719 --> 00:03:33,240 forests and gradient boosting and all 98 00:03:31,439 --> 00:03:35,039 that good stuff you folks have already 99 00:03:33,240 --> 00:03:36,360 learned from machine learning. 100 00:03:35,039 --> 00:03:38,959 But we will try to solve it using neural 101 00:03:36,360 --> 00:03:40,280 networks, okay? Um this is an example, 102 00:03:38,960 --> 00:03:41,680 of course, of what's called structured 103 00:03:40,280 --> 00:03:43,879 data because this is all data sitting in 104 00:03:41,680 --> 00:03:46,159 the columns of a spreadsheet, right? Uh 105 00:03:43,879 --> 00:03:48,199 so, working with structured data is the 106 00:03:46,159 --> 00:03:50,199 way we warm up our knowledge of neural 107 00:03:48,199 --> 00:03:51,879 networks. And then we will do things 108 00:03:50,199 --> 00:03:53,599 like working with unstructured data 109 00:03:51,879 --> 00:03:55,079 starting next week with images and then 110 00:03:53,599 --> 00:03:58,960 later on with text and so on and so 111 00:03:55,080 --> 00:03:58,960 forth. Okay, any questions on this? 112 00:04:00,439 --> 00:04:05,560 Okay. Uh yes. Uh just connected even to 113 00:04:03,319 --> 00:04:07,840 last time's class where we took uh the 114 00:04:05,560 --> 00:04:10,000 same example and first it was a logistic 115 00:04:07,840 --> 00:04:12,120 and then we did a neural network. So, 116 00:04:10,000 --> 00:04:14,759 the probability in case of one was 0.85, 117 00:04:12,120 --> 00:04:16,840 then was 0.22, and here as well, how do 118 00:04:14,759 --> 00:04:19,199 you know when to uh 119 00:04:16,839 --> 00:04:21,919 use what? Usually in textbooks, you know 120 00:04:19,199 --> 00:04:24,399 when to use logistic or when to use uh 121 00:04:21,920 --> 00:04:25,560 something else, but in this case, 122 00:04:24,399 --> 00:04:27,439 uh 123 00:04:25,560 --> 00:04:29,079 when do I complicate it to neural 124 00:04:27,439 --> 00:04:30,439 networks visa-vis in this case maybe 125 00:04:29,079 --> 00:04:33,039 just doing a random It's a great 126 00:04:30,439 --> 00:04:34,480 question. Uh when do you use what? So, I 127 00:04:33,040 --> 00:04:35,800 think there are two broad dimensions 128 00:04:34,480 --> 00:04:37,160 that you have to think about. One broad 129 00:04:35,800 --> 00:04:39,439 dimension is 130 00:04:37,160 --> 00:04:41,720 uh how important is it that you need to 131 00:04:39,439 --> 00:04:43,680 explain or interpret what's going on 132 00:04:41,720 --> 00:04:46,240 inside the model to perhaps a 133 00:04:43,680 --> 00:04:48,280 non-technical consumer. 134 00:04:46,240 --> 00:04:50,759 The other dimension is how important is 135 00:04:48,279 --> 00:04:52,559 sheer predictive accuracy. 136 00:04:50,759 --> 00:04:54,319 In some situations, predictive accuracy 137 00:04:52,560 --> 00:04:56,160 trumps everything else. In which case, 138 00:04:54,319 --> 00:04:57,399 just go with it. In other cases, 139 00:04:56,160 --> 00:04:59,000 explainability becomes a big deal 140 00:04:57,399 --> 00:05:00,560 because if they can't understand, they 141 00:04:59,000 --> 00:05:02,000 won't use it. 142 00:05:00,560 --> 00:05:04,319 And those cases, it's probably better to 143 00:05:02,000 --> 00:05:05,800 go with simpler models such as decision 144 00:05:04,319 --> 00:05:07,800 trees and neural I mean, not neural 145 00:05:05,800 --> 00:05:09,280 network decision trees, maybe even 146 00:05:07,800 --> 00:05:10,920 random forests, certainly logistic 147 00:05:09,279 --> 00:05:12,319 regression. Those are all a little more 148 00:05:10,920 --> 00:05:15,480 amenable. 149 00:05:12,319 --> 00:05:17,439 But that said, uh even complex black box 150 00:05:15,480 --> 00:05:19,280 methods like neural networks, there is a 151 00:05:17,439 --> 00:05:20,800 whole field called mechanistic 152 00:05:19,279 --> 00:05:23,199 interpretability, 153 00:05:20,800 --> 00:05:24,720 which seeks to try to get insight into 154 00:05:23,199 --> 00:05:28,360 what's going on inside these big black 155 00:05:24,720 --> 00:05:30,560 boxes. So, the story isn't over, right? 156 00:05:28,360 --> 00:05:33,360 But that's just the first cut you sort 157 00:05:30,560 --> 00:05:35,199 of analyze the problem. 158 00:05:33,360 --> 00:05:37,600 Okay. So, 159 00:05:35,199 --> 00:05:39,719 um let's get going. So, if you want to 160 00:05:37,600 --> 00:05:42,080 design a network, 161 00:05:39,720 --> 00:05:43,160 All right. So, we design the network. Uh 162 00:05:42,079 --> 00:05:45,039 so, we have to choose the number of 163 00:05:43,160 --> 00:05:46,439 hidden layers and the number of neurons 164 00:05:45,040 --> 00:05:49,160 in each layer. Then we have to pick the 165 00:05:46,439 --> 00:05:51,199 right output layer. So, here, 166 00:05:49,160 --> 00:05:52,720 what I did is the simplest thing you can 167 00:05:51,199 --> 00:05:53,719 do is, of course, is to have no hidden 168 00:05:52,720 --> 00:05:55,360 layer. 169 00:05:53,720 --> 00:05:58,120 So, if you have no hidden layers, what 170 00:05:55,360 --> 00:05:58,120 is that model called? 171 00:05:58,439 --> 00:06:02,079 Yes, logistic regression. 172 00:06:00,240 --> 00:06:03,319 Okay? So, of course, we want to do a 173 00:06:02,079 --> 00:06:05,079 neural network, so I'm going to have one 174 00:06:03,319 --> 00:06:08,199 hidden layer because that's the simplest 175 00:06:05,079 --> 00:06:09,879 thing I can do. And then, I'll confess, 176 00:06:08,199 --> 00:06:12,000 I tried a few different numbers of 177 00:06:09,879 --> 00:06:14,079 neurons in this thing, and when I had 16 178 00:06:12,000 --> 00:06:15,480 neurons, it actually did pretty well. 179 00:06:14,079 --> 00:06:16,959 Okay? So, there was some trial and error 180 00:06:15,480 --> 00:06:19,280 that went on before I landed on the 181 00:06:16,959 --> 00:06:20,839 number 16. Right? And for some reason, 182 00:06:19,279 --> 00:06:22,599 people always use powers of two, so may 183 00:06:20,839 --> 00:06:24,239 as well do that. 184 00:06:22,600 --> 00:06:25,439 So, I tried like 4, 8, 16, and 16 was 185 00:06:24,240 --> 00:06:27,319 really good. 186 00:06:25,439 --> 00:06:30,800 And as it turns out, when I went above 187 00:06:27,319 --> 00:06:31,959 16, uh it sort of started to do badly. 188 00:06:30,800 --> 00:06:33,560 And it started to do badly because 189 00:06:31,959 --> 00:06:35,039 something called overfitting, 190 00:06:33,560 --> 00:06:37,439 which we're going to talk about later, 191 00:06:35,040 --> 00:06:39,960 okay? So, yeah, 16. 192 00:06:37,439 --> 00:06:42,040 Um and then by default, I use ReLUs, 193 00:06:39,959 --> 00:06:44,959 okay? So, 16 ReLU neurons. And then 194 00:06:42,040 --> 00:06:47,160 here, the output is a categorical 195 00:06:44,959 --> 00:06:49,719 output, right? Heart disease, yes or no, 196 00:06:47,160 --> 00:06:51,120 one or zero, classification problem, 197 00:06:49,720 --> 00:06:53,040 which means that we want to emit a 198 00:06:51,120 --> 00:06:54,840 probability at the very end. Therefore, 199 00:06:53,040 --> 00:06:57,240 we'll use a sigmoid. 200 00:06:54,839 --> 00:06:59,159 Okay? So, so far, so good, right? Any 201 00:06:57,240 --> 00:07:00,360 questions? 202 00:06:59,160 --> 00:07:02,520 All right. 203 00:07:00,360 --> 00:07:03,720 So, we're going to lay out this network 204 00:07:02,519 --> 00:07:06,560 visually. 205 00:07:03,720 --> 00:07:09,160 Okay? So, we have an input, and so I 206 00:07:06,560 --> 00:07:10,480 just have have an input. And as you will 207 00:07:09,160 --> 00:07:13,120 see here, 208 00:07:10,480 --> 00:07:15,240 X1 through X29, that's our input layer. 209 00:07:13,120 --> 00:07:17,360 And you may be wondering, 29, where did 210 00:07:15,240 --> 00:07:19,519 he get that from? 211 00:07:17,360 --> 00:07:22,040 Because there doesn't seem to be like 29 212 00:07:19,519 --> 00:07:24,759 rows here of independent variables. So, 213 00:07:22,040 --> 00:07:26,439 it turns out there are only 13 input 214 00:07:24,759 --> 00:07:29,159 variables here, 215 00:07:26,439 --> 00:07:31,279 but some of them are categorical. 216 00:07:29,160 --> 00:07:32,920 So, what I ended up doing is to take 217 00:07:31,279 --> 00:07:34,039 each categorical variable and one-hot 218 00:07:32,920 --> 00:07:35,560 encode it. 219 00:07:34,040 --> 00:07:37,360 Okay? 220 00:07:35,560 --> 00:07:39,240 And when you do that, you get to 39. 221 00:07:37,360 --> 00:07:40,800 Sorry, 29. 222 00:07:39,240 --> 00:07:43,240 All right? And when we actually do the 223 00:07:40,800 --> 00:07:45,400 Colab later on, I'll show you exactly 224 00:07:43,240 --> 00:07:46,879 how I one-hot encode encoded it, but 225 00:07:45,399 --> 00:07:49,239 that's what I'm doing here. 226 00:07:46,879 --> 00:07:51,920 That's why you have 29, not 13. 227 00:07:49,240 --> 00:07:54,079 Okay? Now, obviously, we have decided on 228 00:07:51,920 --> 00:07:56,199 these hidden units, 16 units, 229 00:07:54,079 --> 00:07:57,680 with nice ReLUs here. 230 00:07:56,199 --> 00:07:59,479 Okay? And then we have an output layer 231 00:07:57,680 --> 00:08:01,319 with a little sigmoid. 232 00:07:59,480 --> 00:08:02,560 And I got bored of trying to draw all 233 00:08:01,319 --> 00:08:05,199 these arrows, so I just gave up and 234 00:08:02,560 --> 00:08:07,839 said, "Assume there are arrows." 235 00:08:05,199 --> 00:08:09,800 Okay, between all these things. 236 00:08:07,839 --> 00:08:11,119 Good? 237 00:08:09,800 --> 00:08:12,439 Yeah. 238 00:08:11,120 --> 00:08:15,319 Yeah, I'm sorry. I think you already 239 00:08:12,439 --> 00:08:16,600 mentioned this, but why 16 units? Why 240 00:08:15,319 --> 00:08:18,159 16? Uh 241 00:08:16,600 --> 00:08:21,400 I tried a bunch of different numbers of 242 00:08:18,160 --> 00:08:23,480 units. Uh and at 16, the resulting model 243 00:08:21,399 --> 00:08:25,879 did well, so I just went with that. And 244 00:08:23,480 --> 00:08:28,040 the logic of why is a ReLU? 245 00:08:25,879 --> 00:08:29,519 Oh, why a ReLU? Yeah, so there's a 246 00:08:28,040 --> 00:08:31,960 there's just a mountain of empirical 247 00:08:29,519 --> 00:08:35,158 evidence that suggests that uh ReLU is a 248 00:08:31,959 --> 00:08:37,038 really good default option for using as 249 00:08:35,158 --> 00:08:39,279 activations in hidden layers. There is 250 00:08:37,038 --> 00:08:41,639 also a really great set of theoretical 251 00:08:39,279 --> 00:08:42,918 results, and I'll allude to some of them 252 00:08:41,639 --> 00:08:45,199 when we actually talk about gradient 253 00:08:42,918 --> 00:08:47,519 descent. 254 00:08:45,200 --> 00:08:47,520 Yeah. 255 00:08:47,879 --> 00:08:51,840 Sorry, quick question. You mentioned um 256 00:08:50,120 --> 00:08:53,919 in the input layer, how how did you get 257 00:08:51,840 --> 00:08:55,720 to 29 again when you had like 13 258 00:08:53,919 --> 00:08:58,399 variables? So, some of those 13 259 00:08:55,720 --> 00:09:00,560 variables are categorical variables like 260 00:08:58,399 --> 00:09:02,159 uh cholesterol low, medium, high. Right? 261 00:09:00,559 --> 00:09:04,639 And so, I took them and one-hot encoded 262 00:09:02,159 --> 00:09:08,079 them. So, if it had like five levels, I 263 00:09:04,639 --> 00:09:08,080 would get five columns now. 264 00:09:08,440 --> 00:09:12,720 Uh yeah. 265 00:09:09,799 --> 00:09:15,359 And by the way, folks, um just like uh 266 00:09:12,720 --> 00:09:17,080 is it can Yeah, just like did, please 267 00:09:15,360 --> 00:09:18,440 use a microphone so that people on the 268 00:09:17,080 --> 00:09:20,440 live stream can hear your question. 269 00:09:18,440 --> 00:09:22,080 Yeah, go ahead. Uh sorry, just one 270 00:09:20,440 --> 00:09:23,800 question. So, the vectors, since you 271 00:09:22,080 --> 00:09:26,000 didn't represent them, are we assuming 272 00:09:23,799 --> 00:09:26,599 like every X is connected to all the 273 00:09:26,000 --> 00:09:28,480 units? 274 00:09:26,600 --> 00:09:31,000 >> Correct. And this is also a parameter 275 00:09:28,480 --> 00:09:32,279 that we have to decide or That ends up 276 00:09:31,000 --> 00:09:33,720 being the default. 277 00:09:32,279 --> 00:09:36,120 And we will see 278 00:09:33,720 --> 00:09:37,840 deviations from that assumption when we 279 00:09:36,120 --> 00:09:39,440 go to image processing and language 280 00:09:37,840 --> 00:09:40,879 processing and so on. But when you're 281 00:09:39,440 --> 00:09:43,800 working with structured data like we're 282 00:09:40,879 --> 00:09:46,039 doing now, that's the default. 283 00:09:43,799 --> 00:09:47,759 Okay. So, let's keep going. 284 00:09:46,039 --> 00:09:49,399 So, this is what we have. 285 00:09:47,759 --> 00:09:50,679 So, what Remember what I told you in the 286 00:09:49,399 --> 00:09:52,360 last class? Whenever you're working with 287 00:09:50,679 --> 00:09:54,239 these networks, right? Get into the 288 00:09:52,360 --> 00:09:55,919 habit of very quickly calculating the 289 00:09:54,240 --> 00:09:57,360 number of parameters. 290 00:09:55,919 --> 00:09:59,839 Right? Just do it a few times, the first 291 00:09:57,360 --> 00:10:02,279 few times, so that you really know cold 292 00:09:59,840 --> 00:10:04,600 exactly what's going on. Okay? So, yeah, 293 00:10:02,279 --> 00:10:06,159 how many parameters do we have here? 294 00:10:04,600 --> 00:10:08,120 How many weights and biases? You can 295 00:10:06,159 --> 00:10:09,120 work through it, okay? You can You don't 296 00:10:08,120 --> 00:10:13,840 have to tell me the final number. You 297 00:10:09,120 --> 00:10:13,840 can say x * y + z, stuff like that. 298 00:10:14,399 --> 00:10:20,199 Yeah. 299 00:10:15,759 --> 00:10:21,759 65. You have 48 weights and 17 biases. 300 00:10:20,200 --> 00:10:23,680 Okay, and how did he come up with that? 301 00:10:21,759 --> 00:10:26,000 So, for the weights, you have like for 302 00:10:23,679 --> 00:10:28,319 the first layer it's 2 * 16 and for the 303 00:10:26,000 --> 00:10:30,399 the second connection it's 1 * 16 and 304 00:10:28,320 --> 00:10:32,200 then the biases are the 16 hidden plus 305 00:10:30,399 --> 00:10:33,439 the outputs. 306 00:10:32,200 --> 00:10:36,280 Okay. 307 00:10:33,440 --> 00:10:40,280 Um any other views on this? 308 00:10:36,279 --> 00:10:43,559 I think it's 29 into 16. 29, okay, 29 309 00:10:40,279 --> 00:10:46,600 into 16. And then 16 into 310 00:10:43,559 --> 00:10:49,839 uh plus I mean 16 there. Yeah. And then 311 00:10:46,600 --> 00:10:52,320 biases 16 biases and one bias. Right. 312 00:10:49,840 --> 00:10:55,240 So, the way it's going to work is we 313 00:10:52,320 --> 00:10:58,440 have 29 things here, 16 in the middle, 314 00:10:55,240 --> 00:11:00,279 so 29 into 16 arrows. 315 00:10:58,440 --> 00:11:02,640 And then for each of these fellows, 316 00:11:00,279 --> 00:11:05,000 there's a bias coming in. 317 00:11:02,639 --> 00:11:08,399 So, that's another 16. 318 00:11:05,000 --> 00:11:10,759 Plus, you have 16 * 1. 319 00:11:08,399 --> 00:11:12,079 Which is here, plus there is one bias 320 00:11:10,759 --> 00:11:15,519 for this one. 321 00:11:12,080 --> 00:11:15,520 So, the total is 497. 322 00:11:16,720 --> 00:11:21,040 So, you can see here there's something 323 00:11:19,279 --> 00:11:22,838 very interesting going on, which is that 324 00:11:21,039 --> 00:11:24,000 when you go from one layer to another 325 00:11:22,839 --> 00:11:26,280 layer, 326 00:11:24,000 --> 00:11:28,360 the number of weights is roughly on the 327 00:11:26,279 --> 00:11:30,199 order of a * b. 328 00:11:28,360 --> 00:11:31,639 The number of units and so that's a 329 00:11:30,200 --> 00:11:33,400 dramatic explosion in the number of 330 00:11:31,639 --> 00:11:34,559 parameters. 331 00:11:33,399 --> 00:11:36,199 Right? And that's something we have to 332 00:11:34,559 --> 00:11:38,039 watch for later on to prevent 333 00:11:36,200 --> 00:11:39,720 overfitting. 334 00:11:38,039 --> 00:11:41,480 Okay, that's where the explosion of 335 00:11:39,720 --> 00:11:43,080 parameters comes from the fact that each 336 00:11:41,480 --> 00:11:44,000 layer is fully connected to the next 337 00:11:43,080 --> 00:11:46,160 layer. 338 00:11:44,000 --> 00:11:47,200 Okay? But we'll revisit this later on. 339 00:11:46,159 --> 00:11:48,279 Okay. 340 00:11:47,200 --> 00:11:50,120 So, 341 00:11:48,279 --> 00:11:52,240 what I'm going to do now is I'm going to 342 00:11:50,120 --> 00:11:53,200 actually translate this network, right? 343 00:11:52,240 --> 00:11:56,039 The one that we have laid out 344 00:11:53,200 --> 00:11:58,759 graphically, into Keras code 345 00:11:56,039 --> 00:12:01,159 to demonstrate how easy it is. 346 00:11:58,759 --> 00:12:03,159 Okay? So, I will give a fuller intro to 347 00:12:01,159 --> 00:12:06,240 Keras in TensorFlow later on, but for 348 00:12:03,159 --> 00:12:08,159 now, just suspend your disbelief. 349 00:12:06,240 --> 00:12:10,560 We'll just try to do it in Keras as if 350 00:12:08,159 --> 00:12:12,039 we know Keras. Okay? So, let's try that. 351 00:12:10,559 --> 00:12:14,119 Later on we'll get into all the gory 352 00:12:12,039 --> 00:12:17,519 details and train it in Colab and so on 353 00:12:14,120 --> 00:12:19,399 and so forth. Okay. All right. So, 354 00:12:17,519 --> 00:12:21,319 So, the So, the way we typically do it 355 00:12:19,399 --> 00:12:23,759 is that once we have a network like 356 00:12:21,320 --> 00:12:25,800 this, we typically start from the left 357 00:12:23,759 --> 00:12:27,519 and start defining each layer in Keras 358 00:12:25,799 --> 00:12:30,120 one after the other. So, we flow left to 359 00:12:27,519 --> 00:12:32,000 right. Okay? So, let's take the input 360 00:12:30,120 --> 00:12:34,720 layer. The way you define an input layer 361 00:12:32,000 --> 00:12:38,360 in Keras is really easy. 362 00:12:34,720 --> 00:12:41,200 You literally say Keras.input. 363 00:12:38,360 --> 00:12:43,360 Okay? And then you tell Keras how many 364 00:12:41,200 --> 00:12:45,120 nodes you have in the input coming in. 365 00:12:43,360 --> 00:12:47,240 In this case it happens to be 29, so you 366 00:12:45,120 --> 00:12:49,039 tell it the shape. Shape equals 29. And 367 00:12:47,240 --> 00:12:51,120 the reason why we say shape as opposed 368 00:12:49,039 --> 00:12:53,159 to length is because, as you will see 369 00:12:51,120 --> 00:12:55,519 later on, we don't have to just send 370 00:12:53,159 --> 00:12:57,279 vectors in, we can send complicated 371 00:12:55,519 --> 00:12:59,319 things in to Keras. 372 00:12:57,279 --> 00:13:01,519 And those complicated objects could be 373 00:12:59,320 --> 00:13:03,600 matrices, it could be 3D cubes, it could 374 00:13:01,519 --> 00:13:06,199 be 4D tensors and so on and so forth. 375 00:13:03,600 --> 00:13:07,720 So, it's expecting a shape. 376 00:13:06,200 --> 00:13:09,040 Right? What is the shape shape of this 377 00:13:07,720 --> 00:13:10,800 thing you're going to send me? In this 378 00:13:09,039 --> 00:13:12,679 particular case it happens to be a nice 379 00:13:10,799 --> 00:13:15,519 list or a vector, so it's 29. Okay, 380 00:13:12,679 --> 00:13:17,719 that's it. So, we we write this down. 381 00:13:15,519 --> 00:13:19,720 This creates the input layer. 382 00:13:17,720 --> 00:13:21,440 Right? And we give it a name. Right? And 383 00:13:19,720 --> 00:13:23,160 the name here means 384 00:13:21,440 --> 00:13:26,400 this layer, whatever comes out of this 385 00:13:23,159 --> 00:13:27,799 layer has a name input. 386 00:13:26,399 --> 00:13:30,319 Okay? 387 00:13:27,799 --> 00:13:31,399 Good. Next. 388 00:13:30,320 --> 00:13:32,920 Let's make sure the shape of the input 389 00:13:31,399 --> 00:13:34,360 as I mentioned. 390 00:13:32,919 --> 00:13:36,719 Right there. 391 00:13:34,360 --> 00:13:39,560 Then we go to the next one. And here and 392 00:13:36,720 --> 00:13:41,920 we will unpack this. The way you define 393 00:13:39,559 --> 00:13:43,439 a layer is typically a hidden layer 394 00:13:41,919 --> 00:13:46,000 Keras.layers.dense 395 00:13:43,440 --> 00:13:48,760 and all this stuff. Okay? So, what this 396 00:13:46,000 --> 00:13:50,720 is is it first of all it says 397 00:13:48,759 --> 00:13:52,480 I want a dense layer. By dense layer I 398 00:13:50,720 --> 00:13:53,960 mean a layer that's going to fully 399 00:13:52,480 --> 00:13:55,120 connect to the prior and the later 400 00:13:53,960 --> 00:13:56,240 layers. 401 00:13:55,120 --> 00:13:58,120 Fully connect, that's what the word 402 00:13:56,240 --> 00:13:59,159 dense means. Okay? 403 00:13:58,120 --> 00:14:02,799 Number two, 404 00:13:59,159 --> 00:14:06,799 I want 16 nodes here in this layer. 405 00:14:02,799 --> 00:14:09,559 Okay? Finally, I want to use a ReLU. 406 00:14:06,799 --> 00:14:11,120 See how compact and parsimonious it is? 407 00:14:09,559 --> 00:14:13,679 Right? And that is the appeal of Keras. 408 00:14:11,120 --> 00:14:15,039 It's very easy to get going. 409 00:14:13,679 --> 00:14:18,239 So, the moment you do that, you've 410 00:14:15,039 --> 00:14:18,240 actually defined this layer. 411 00:14:18,600 --> 00:14:23,519 But what you have not done 412 00:14:20,600 --> 00:14:25,440 is you have not told this layer what 413 00:14:23,519 --> 00:14:26,439 input is going to get. 414 00:14:25,440 --> 00:14:28,440 Because as far as this layer is 415 00:14:26,440 --> 00:14:30,320 concerned, it doesn't know that this 416 00:14:28,440 --> 00:14:33,320 other layer exists. 417 00:14:30,320 --> 00:14:35,800 So, you need to connect them. Yes. 418 00:14:33,320 --> 00:14:38,079 Um do we need to define for the ReLU 419 00:14:35,799 --> 00:14:39,039 where the the bends are? Like where you 420 00:14:38,078 --> 00:14:41,319 take the max? 421 00:14:39,039 --> 00:14:44,159 >> No, the ReLU the bend is always at zero. 422 00:14:41,320 --> 00:14:44,160 Okay. Thank you. 423 00:14:45,559 --> 00:14:48,799 Okay? 424 00:14:47,320 --> 00:14:51,240 All right. 425 00:14:48,799 --> 00:14:53,399 So, that's what we have here. 426 00:14:51,240 --> 00:14:55,959 And then, what we do is we have to tell 427 00:14:53,399 --> 00:14:57,958 it I you want to feed this layer the 428 00:14:55,958 --> 00:15:00,239 output of the previous layer, so you 429 00:14:57,958 --> 00:15:02,000 feed it by taking whatever is coming out 430 00:15:00,240 --> 00:15:03,120 of this thing, which is called input, 431 00:15:02,000 --> 00:15:05,480 and you basically 432 00:15:03,120 --> 00:15:07,759 stick it in here. 433 00:15:05,480 --> 00:15:09,039 So, the moment you do that, boom, it's 434 00:15:07,759 --> 00:15:10,519 going to receive the input from the 435 00:15:09,039 --> 00:15:12,879 previous layer. 436 00:15:10,519 --> 00:15:15,000 And because this one's output needs to 437 00:15:12,879 --> 00:15:16,519 go to the final layer, you need to give 438 00:15:15,000 --> 00:15:17,919 a name to that output. 439 00:15:16,519 --> 00:15:19,360 So, you give it a name. I'm just calling 440 00:15:17,919 --> 00:15:20,559 it h for because it's coming out of the 441 00:15:19,360 --> 00:15:21,600 hidden layer. 442 00:15:20,559 --> 00:15:24,119 It's just a variable. You can call it 443 00:15:21,600 --> 00:15:24,120 anything you want. 444 00:15:25,000 --> 00:15:28,958 Now, what we do, we go to the final 445 00:15:26,360 --> 00:15:30,360 output layer. 446 00:15:28,958 --> 00:15:32,799 And this is what we use. The output 447 00:15:30,360 --> 00:15:34,720 layer is just another dense layer. 448 00:15:32,799 --> 00:15:36,279 That's why I use the word dense. But we 449 00:15:34,720 --> 00:15:37,800 say, "Hey, give me just one thing 450 00:15:36,279 --> 00:15:40,159 because I just literally just need one 451 00:15:37,799 --> 00:15:41,919 unit here because I need to emit just 452 00:15:40,159 --> 00:15:44,120 one probability. 453 00:15:41,919 --> 00:15:46,639 And the activation I want to use is a 454 00:15:44,120 --> 00:15:46,639 sigmoid." 455 00:15:46,958 --> 00:15:50,399 Done. 456 00:15:48,720 --> 00:15:52,759 Okay? 457 00:15:50,399 --> 00:15:54,679 And once you do that, you 458 00:15:52,759 --> 00:15:57,838 have to feed it the input from the 459 00:15:54,679 --> 00:16:00,000 second layer. So, you stick an h here. 460 00:15:57,839 --> 00:16:01,400 Now you have connected the third and the 461 00:16:00,000 --> 00:16:03,039 second layers. 462 00:16:01,399 --> 00:16:04,720 And after you do that, you give a name 463 00:16:03,039 --> 00:16:06,399 to the output coming out of that. We'll 464 00:16:04,720 --> 00:16:07,360 just call it output. You can call it y, 465 00:16:06,399 --> 00:16:09,720 you can call it output, you can call it 466 00:16:07,360 --> 00:16:11,039 whatever you want. 467 00:16:09,720 --> 00:16:12,000 Okay? So, at this point, what we have 468 00:16:11,039 --> 00:16:14,399 done 469 00:16:12,000 --> 00:16:16,200 is we have mapped that picture into 470 00:16:14,399 --> 00:16:17,759 those three lines. 471 00:16:16,200 --> 00:16:19,400 That's it. 472 00:16:17,759 --> 00:16:20,759 Okay? 473 00:16:19,399 --> 00:16:22,519 But we aren't quite done yet. There's 474 00:16:20,759 --> 00:16:24,759 one little thing we have to do. 475 00:16:22,519 --> 00:16:27,919 So, what we have to do is we have to 476 00:16:24,759 --> 00:16:30,078 formally define a model so that Keras 477 00:16:27,919 --> 00:16:31,879 can just work with this model object. It 478 00:16:30,078 --> 00:16:33,199 can train it, it can evaluate it, it can 479 00:16:31,879 --> 00:16:35,759 use it for prediction and so on and so 480 00:16:33,200 --> 00:16:38,160 forth. So, we tell Keras, "Hey, uh 481 00:16:35,759 --> 00:16:40,039 create a model for me, Keras.model, 482 00:16:38,159 --> 00:16:41,600 and basically where the input is this 483 00:16:40,039 --> 00:16:42,480 thing here and the output is that thing 484 00:16:41,600 --> 00:16:43,800 there. 485 00:16:42,480 --> 00:16:45,879 And then the whole thing we'll just call 486 00:16:43,799 --> 00:16:48,559 it model." 487 00:16:45,879 --> 00:16:50,240 Okay? So, that's it. 488 00:16:48,559 --> 00:16:52,000 We are done. That is the whole model. 489 00:16:50,240 --> 00:16:53,680 That is It sounds really fancy, right? A 490 00:16:52,000 --> 00:16:56,600 neural model for heart disease 491 00:16:53,679 --> 00:16:58,599 prediction. That's pretty cool. 492 00:16:56,600 --> 00:17:00,360 Four lines. 493 00:16:58,600 --> 00:17:02,839 And we will show how to train this model 494 00:17:00,360 --> 00:17:05,199 with real data and so on and so forth 495 00:17:02,839 --> 00:17:06,959 and use it for prediction after we 496 00:17:05,199 --> 00:17:08,759 switch gears and really get into some 497 00:17:06,959 --> 00:17:11,320 conceptual building blocks. 498 00:17:08,759 --> 00:17:11,319 Had a question. 499 00:17:13,799 --> 00:17:18,599 Can you define a custom activation 500 00:17:16,319 --> 00:17:21,039 function that is not in the list of 501 00:17:18,599 --> 00:17:22,319 Keras library? Yes. 502 00:17:21,039 --> 00:17:23,438 Yeah, you can define The question was, 503 00:17:22,319 --> 00:17:25,359 can you define a custom activation 504 00:17:23,439 --> 00:17:27,400 function? You totally can. 505 00:17:25,359 --> 00:17:30,279 Uh in fact, I mean, the the kind of 506 00:17:27,400 --> 00:17:32,280 flexibility you have here is incredible. 507 00:17:30,279 --> 00:17:34,480 And this these innocent four lines 508 00:17:32,279 --> 00:17:36,399 unfortunately sort of hide the the 509 00:17:34,480 --> 00:17:38,640 potential that's possible here, but I 510 00:17:36,400 --> 00:17:39,759 guarantee you in two to three weeks you 511 00:17:38,640 --> 00:17:41,440 folks will be thinking in building 512 00:17:39,759 --> 00:17:43,599 blocks like Legos. 513 00:17:41,440 --> 00:17:44,600 So, you'll be, you know, I I I I'm so 514 00:17:43,599 --> 00:17:46,079 happy when it happens. Students will 515 00:17:44,599 --> 00:17:47,319 come to my office hours and say, "You 516 00:17:46,079 --> 00:17:49,399 know, I want to create a network where I 517 00:17:47,319 --> 00:17:50,879 have a little network going up on top, 518 00:17:49,400 --> 00:17:52,240 one going in the bottom, then they meet 519 00:17:50,880 --> 00:17:54,160 in the middle, then they fork again, 520 00:17:52,240 --> 00:17:55,440 they split." I'm like, "Unbelievable." 521 00:17:54,160 --> 00:17:56,720 It's fantastic. And you're going to be 522 00:17:55,440 --> 00:17:58,720 doing this in two weeks, I guarantee 523 00:17:56,720 --> 00:18:00,319 you. 524 00:17:58,720 --> 00:18:01,880 Yeah, in the case of a multi-class 525 00:18:00,319 --> 00:18:04,159 classification problem, are the output 526 00:18:01,880 --> 00:18:05,320 nodes equal to the number of classes? 527 00:18:04,160 --> 00:18:07,400 Correct. 528 00:18:05,319 --> 00:18:09,279 So, we will come to So, this is binary 529 00:18:07,400 --> 00:18:10,880 classification. And the question is for 530 00:18:09,279 --> 00:18:12,960 multi-class classification, let's say 531 00:18:10,880 --> 00:18:14,960 you're trying to classify some input 532 00:18:12,960 --> 00:18:16,720 into one of 10 possibilities, we will 533 00:18:14,960 --> 00:18:18,840 have 10 outputs. 534 00:18:16,720 --> 00:18:20,360 But the way we define it is going to be 535 00:18:18,839 --> 00:18:21,879 using something called a softmax 536 00:18:20,359 --> 00:18:24,039 function, which we're going to cover on 537 00:18:21,880 --> 00:18:25,720 Monday. 538 00:18:24,039 --> 00:18:27,079 So, for now, we just live with binary 539 00:18:25,720 --> 00:18:29,120 classification. 540 00:18:27,079 --> 00:18:29,119 Uh 541 00:18:29,159 --> 00:18:33,800 Is there a default activation method in 542 00:18:31,679 --> 00:18:35,400 Keras or you have to put something? Ah, 543 00:18:33,799 --> 00:18:37,079 that's a good question. I believe the 544 00:18:35,400 --> 00:18:39,200 default might be ReLUs for hidden 545 00:18:37,079 --> 00:18:40,678 layers, but I'm not 100% sure. Let's 546 00:18:39,200 --> 00:18:42,759 double-check that. 547 00:18:40,679 --> 00:18:44,960 Uh 548 00:18:42,759 --> 00:18:47,240 Uh just to get a clearer understanding, 549 00:18:44,960 --> 00:18:50,000 when you said that beyond 16 when you 550 00:18:47,240 --> 00:18:52,240 tried working on those neurons, the 551 00:18:50,000 --> 00:18:53,279 performance uh worsened. 552 00:18:52,240 --> 00:18:54,919 So, that is where you were playing 553 00:18:53,279 --> 00:18:58,759 around with initially two and then maybe 554 00:18:54,919 --> 00:19:01,560 four and six and eight. Exactly. Right. 555 00:18:58,759 --> 00:19:01,559 Could you use the mic? 556 00:19:02,200 --> 00:19:05,880 Do we need to define each of the hidden 557 00:19:04,000 --> 00:19:08,200 layer when the model gets more complex 558 00:19:05,880 --> 00:19:09,640 when we have more than one layer? Oh, 559 00:19:08,200 --> 00:19:11,159 like if you have like 25 layers? 560 00:19:09,640 --> 00:19:12,640 >> consolidate, yeah. Yeah, yeah, yeah. So, 561 00:19:11,159 --> 00:19:14,919 what we typically Good question. If you 562 00:19:12,640 --> 00:19:16,200 have let's say 100 layers, right? Uh do 563 00:19:14,919 --> 00:19:18,280 you actually write I have to type in 564 00:19:16,200 --> 00:19:19,759 each by hand and cut and paste? No. You 565 00:19:18,279 --> 00:19:20,839 can actually write a little loop which 566 00:19:19,759 --> 00:19:22,720 will just automatically create them for 567 00:19:20,839 --> 00:19:24,240 you. 568 00:19:22,720 --> 00:19:26,000 And so, basically what's going on is 569 00:19:24,240 --> 00:19:27,640 that this little output thing you see 570 00:19:26,000 --> 00:19:30,200 here, this variable, 571 00:19:27,640 --> 00:19:32,880 this output could be the result of a 572 00:19:30,200 --> 00:19:34,519 thousand layer network with all sorts of 573 00:19:32,880 --> 00:19:36,080 complicated transformations going on and 574 00:19:34,519 --> 00:19:38,200 then finally it pops up as a little 575 00:19:36,079 --> 00:19:39,678 thing called the output. And what Keras 576 00:19:38,200 --> 00:19:41,919 will do is it'll be like, "Okay, this 577 00:19:39,679 --> 00:19:43,759 model has this input and has this 578 00:19:41,919 --> 00:19:45,200 output, but boy, this output came from 579 00:19:43,759 --> 00:19:47,079 incredible transformations applied to 580 00:19:45,200 --> 00:19:48,159 the input." And Keras will process all 581 00:19:47,079 --> 00:19:49,759 that very easily for you. You don't have 582 00:19:48,159 --> 00:19:51,280 to worry about it. 583 00:19:49,759 --> 00:19:53,319 Right? It's really a beautiful example 584 00:19:51,279 --> 00:19:54,440 of the power of abstraction. 585 00:19:53,319 --> 00:19:55,200 And you will you will see that as we go 586 00:19:54,440 --> 00:19:56,880 along. 587 00:19:55,200 --> 00:19:58,640 Okay. So, 588 00:19:56,880 --> 00:20:00,040 now let's switch gears and say once 589 00:19:58,640 --> 00:20:01,840 you've written a model like that in 590 00:20:00,039 --> 00:20:04,240 Keras, how do you actually train it? 591 00:20:01,839 --> 00:20:05,839 Okay? Now, training is something you've 592 00:20:04,240 --> 00:20:06,880 been doing a lot, right? So, for 593 00:20:05,839 --> 00:20:08,720 example, when you have something like 594 00:20:06,880 --> 00:20:09,800 linear regression, right? Where you have 595 00:20:08,720 --> 00:20:12,039 all these coefficients you need to 596 00:20:09,799 --> 00:20:14,039 estimate, you have this model, then you 597 00:20:12,039 --> 00:20:16,680 have a bunch of data, then you run it 598 00:20:14,039 --> 00:20:18,559 through something like LM if you use R, 599 00:20:16,680 --> 00:20:20,480 and what it gives you is actual values 600 00:20:18,559 --> 00:20:22,559 for these coefficients, right? 2.8, 0.9, 601 00:20:20,480 --> 00:20:23,880 and so on and so forth. So, the the role 602 00:20:22,559 --> 00:20:25,399 of the data is to give you the 603 00:20:23,880 --> 00:20:26,560 coefficients. 604 00:20:25,400 --> 00:20:28,280 Right? Or you can think of the 605 00:20:26,559 --> 00:20:30,319 coefficients as really a compressed 606 00:20:28,279 --> 00:20:31,759 version of the data. 607 00:20:30,319 --> 00:20:33,799 Okay? Similarly, if you do logistic 608 00:20:31,759 --> 00:20:35,359 regression, you have a model like that, 609 00:20:33,799 --> 00:20:37,240 you add some data, you run it through 610 00:20:35,359 --> 00:20:40,479 some estimation routine like GLM or 611 00:20:37,240 --> 00:20:42,079 scikit-learn or statsmodels, pick your 612 00:20:40,480 --> 00:20:43,680 favorite tool, then you'll come up with 613 00:20:42,079 --> 00:20:45,919 something like that. So, basically 614 00:20:43,680 --> 00:20:47,519 what's going on here is training simply 615 00:20:45,920 --> 00:20:49,640 means find the values of the 616 00:20:47,519 --> 00:20:51,839 coefficients that so that the model's 617 00:20:49,640 --> 00:20:54,680 predictions are as close to the actual 618 00:20:51,839 --> 00:20:57,559 values as possible. That's it. Okay? And 619 00:20:54,680 --> 00:20:59,519 so and to find the one that is as close 620 00:20:57,559 --> 00:21:01,519 to the actual value as possible, a whole 621 00:20:59,519 --> 00:21:02,200 bunch of optimization is involved. You 622 00:21:01,519 --> 00:21:03,079 didn't have to worry about the 623 00:21:02,200 --> 00:21:05,200 optimization when you did the 624 00:21:03,079 --> 00:21:07,039 regression, linear or logistic, because 625 00:21:05,200 --> 00:21:08,840 it's all done under the hood for you, 626 00:21:07,039 --> 00:21:10,879 but for neural networks, we actually get 627 00:21:08,839 --> 00:21:12,919 to know how it's done. 628 00:21:10,880 --> 00:21:15,800 Okay, because it's important. 629 00:21:12,920 --> 00:21:18,279 Okay. So, training a neural network, a 630 00:21:15,799 --> 00:21:19,680 deep neural network, even GPT-4, it's 631 00:21:18,279 --> 00:21:21,000 basically the same process as what you 632 00:21:19,680 --> 00:21:23,320 do for regression. 633 00:21:21,000 --> 00:21:24,480 Right? You basically you're just a very 634 00:21:23,319 --> 00:21:26,679 complicated function with lots of 635 00:21:24,480 --> 00:21:28,160 parameters, but ultimately you have a 636 00:21:26,680 --> 00:21:29,960 network with all these question marks, 637 00:21:28,160 --> 00:21:32,960 you add some data, you do some training, 638 00:21:29,960 --> 00:21:32,960 and boom, you get some numbers. 639 00:21:36,200 --> 00:21:40,480 You may get into this, but are we 640 00:21:38,279 --> 00:21:43,079 determining the architecture of the 641 00:21:40,480 --> 00:21:45,319 network before we train it? 642 00:21:43,079 --> 00:21:46,720 Okay. Yes, because if you don't define 643 00:21:45,319 --> 00:21:49,279 the architecture, 644 00:21:46,720 --> 00:21:51,200 um Keras doesn't know how to actually 645 00:21:49,279 --> 00:21:53,279 calculate the output. 646 00:21:51,200 --> 00:21:55,880 Given an input. And unless it knows 647 00:21:53,279 --> 00:21:58,119 input-output pairs, it can't do anything 648 00:21:55,880 --> 00:22:00,400 more with it. 649 00:21:58,119 --> 00:22:02,039 Okay. So, um 650 00:22:00,400 --> 00:22:04,080 so the essence of training is to find 651 00:22:02,039 --> 00:22:05,440 the best values for the weights and 652 00:22:04,079 --> 00:22:07,559 biases. 653 00:22:05,440 --> 00:22:09,440 And the way we think of the best values 654 00:22:07,559 --> 00:22:11,919 is that we basically set up a little 655 00:22:09,440 --> 00:22:14,400 function, and this function measures the 656 00:22:11,920 --> 00:22:16,759 discrepancy between the actual and the 657 00:22:14,400 --> 00:22:19,640 predicted values. Okay? And I use the 658 00:22:16,759 --> 00:22:20,960 word discrepancy because the way you 659 00:22:19,640 --> 00:22:22,320 define discrepancy, there's an 660 00:22:20,960 --> 00:22:23,279 incredible amounts of creativity in the 661 00:22:22,319 --> 00:22:25,000 field. 662 00:22:23,279 --> 00:22:27,039 In fact, a lot of breakthroughs in deep 663 00:22:25,000 --> 00:22:29,519 learning come because people define a 664 00:22:27,039 --> 00:22:31,079 very clever measure of discrepancy, and 665 00:22:29,519 --> 00:22:33,039 then turns out it actually gives you all 666 00:22:31,079 --> 00:22:34,279 sorts of interesting behavior. Okay? 667 00:22:33,039 --> 00:22:35,879 That's why I use the word discrepancy as 668 00:22:34,279 --> 00:22:37,399 opposed to the word error, because when 669 00:22:35,880 --> 00:22:39,960 I say error, you might be just thinking 670 00:22:37,400 --> 00:22:42,240 something like predicted minus actual. 671 00:22:39,960 --> 00:22:43,600 That's too limiting. 672 00:22:42,240 --> 00:22:45,120 Prediction minus actual is too limiting, 673 00:22:43,599 --> 00:22:48,079 that's why I use the word discrepancy. 674 00:22:45,119 --> 00:22:49,439 So, so we we basically define a function 675 00:22:48,079 --> 00:22:50,639 that captures the discrepancy between 676 00:22:49,440 --> 00:22:53,000 these the actual and the predicted 677 00:22:50,640 --> 00:22:54,759 values, and these functions are called 678 00:22:53,000 --> 00:22:55,759 loss functions in the deep learning 679 00:22:54,759 --> 00:22:58,039 world. 680 00:22:55,759 --> 00:23:00,200 And every paper that you read, you will 681 00:22:58,039 --> 00:23:02,519 find interesting loss functions. There 682 00:23:00,200 --> 00:23:03,920 are hundreds of loss functions, enormous 683 00:23:02,519 --> 00:23:05,920 research creativity goes into defining 684 00:23:03,920 --> 00:23:08,519 these loss functions. Okay? 685 00:23:05,920 --> 00:23:10,039 All right. So, these are loss functions. 686 00:23:08,519 --> 00:23:12,440 And so a loss function is a function 687 00:23:10,039 --> 00:23:14,119 that quantifies a discrepancy. So, let's 688 00:23:12,440 --> 00:23:16,679 say the predictions are really close to 689 00:23:14,119 --> 00:23:19,039 the actual values, the loss would be 690 00:23:16,679 --> 00:23:20,720 what? 691 00:23:19,039 --> 00:23:23,279 It's close to zero. It's close to zero. 692 00:23:20,720 --> 00:23:26,240 Close to zero. Right? Very small. 693 00:23:23,279 --> 00:23:27,519 And if if you have a perfect model, 694 00:23:26,240 --> 00:23:28,799 perfect crystal ball, what would the 695 00:23:27,519 --> 00:23:30,039 loss be? 696 00:23:28,799 --> 00:23:32,839 Exactly zero. 697 00:23:30,039 --> 00:23:35,599 Right? Exactly zero. So, in linear 698 00:23:32,839 --> 00:23:37,759 regression, we the loss function we use 699 00:23:35,599 --> 00:23:39,159 is called sum of squared errors. 700 00:23:37,759 --> 00:23:40,640 We didn't call it loss function because 701 00:23:39,160 --> 00:23:42,200 we were not doing deep learning, just 702 00:23:40,640 --> 00:23:45,120 linear regression, but that's basically 703 00:23:42,200 --> 00:23:47,200 the loss function. Right? So, 704 00:23:45,119 --> 00:23:49,000 the loss function we use must be very 705 00:23:47,200 --> 00:23:51,200 matched very properly with the kind of 706 00:23:49,000 --> 00:23:53,200 output we have. 707 00:23:51,200 --> 00:23:55,200 Right? So, if your output is a number 708 00:23:53,200 --> 00:23:57,480 like 23, right? You're trying to predict 709 00:23:55,200 --> 00:24:00,319 demand like a product demand for next 710 00:23:57,480 --> 00:24:02,120 week for a particular product, and uh 711 00:24:00,319 --> 00:24:03,439 predicted value is 23, the actual value 712 00:24:02,119 --> 00:24:05,879 is 21, 713 00:24:03,440 --> 00:24:09,120 it's okay to do 23 minus 21, two as a 714 00:24:05,880 --> 00:24:11,640 discrepancy, right? The error. Okay? But 715 00:24:09,119 --> 00:24:13,439 for other kinds of outputs, it's not so 716 00:24:11,640 --> 00:24:14,800 obvious what the correct loss function 717 00:24:13,440 --> 00:24:18,160 is, what the correct measure of 718 00:24:14,799 --> 00:24:20,799 discrepancy is. And so here, 719 00:24:18,160 --> 00:24:21,759 for the simple case of regression, 720 00:24:20,799 --> 00:24:23,759 right? Um 721 00:24:21,759 --> 00:24:26,119 the YI, the I here, by the way, is a 722 00:24:23,759 --> 00:24:29,000 superscript which stands for the ith 723 00:24:26,119 --> 00:24:31,079 data point, the ith data point. So, what 724 00:24:29,000 --> 00:24:33,519 I'm saying is that okay, for the ith 725 00:24:31,079 --> 00:24:36,119 data point, this is the actual value, Y, 726 00:24:33,519 --> 00:24:39,000 and this is what the model predicted. 727 00:24:36,119 --> 00:24:41,079 Okay? I take the difference, square it, 728 00:24:39,000 --> 00:24:43,119 and once I square it for each point, I 729 00:24:41,079 --> 00:24:45,759 just average all these numbers to get an 730 00:24:43,119 --> 00:24:48,239 average squared error, i.e. mean squared 731 00:24:45,759 --> 00:24:50,960 error, MSE. So, this is sort of like the 732 00:24:48,240 --> 00:24:52,240 easiest loss function. 733 00:24:50,960 --> 00:24:55,000 Okay? 734 00:24:52,240 --> 00:24:57,120 Now, let's crank it up a notch. 735 00:24:55,000 --> 00:24:59,759 In the heart disease example, the heart 736 00:24:57,119 --> 00:25:01,678 disease the neural prediction model, 737 00:24:59,759 --> 00:25:03,440 the prediction is a number between zero 738 00:25:01,679 --> 00:25:04,759 and one, right? It's because it's coming 739 00:25:03,440 --> 00:25:07,720 out of the sigmoid. 740 00:25:04,759 --> 00:25:09,799 It's a fraction. The actual output is a 741 00:25:07,720 --> 00:25:11,120 zero or one, one of the two, right? It's 742 00:25:09,799 --> 00:25:12,720 binary. 743 00:25:11,119 --> 00:25:14,039 So, how would we compare the 744 00:25:12,720 --> 00:25:16,640 discrepancy? How would we measure the 745 00:25:14,039 --> 00:25:18,839 discrepancy between a fraction and the 746 00:25:16,640 --> 00:25:21,080 numbers zero and one? Right? What is the 747 00:25:18,839 --> 00:25:22,879 good loss function in this situation? 748 00:25:21,079 --> 00:25:26,000 Right? Is the key question. So, let's 749 00:25:22,880 --> 00:25:28,640 build some intuition around this. 750 00:25:26,000 --> 00:25:31,200 And let's see if my little daisy chain 751 00:25:28,640 --> 00:25:32,480 iPad thing works. 752 00:25:31,200 --> 00:25:34,160 I'm doing it on the iPad so that people 753 00:25:32,480 --> 00:25:35,200 on the live stream can see it, otherwise 754 00:25:34,160 --> 00:25:37,040 the blackboard is a little tough for 755 00:25:35,200 --> 00:25:41,039 them. 756 00:25:37,039 --> 00:25:43,159 Okay. So, let's have a situation here. 757 00:25:41,039 --> 00:25:45,039 Okay? So, let's say let's say that you 758 00:25:43,160 --> 00:25:47,000 have a patient who comes in, and let's 759 00:25:45,039 --> 00:25:50,240 say they have heart disease. Okay? So, 760 00:25:47,000 --> 00:25:51,960 for that patient, Y equals one. 761 00:25:50,240 --> 00:25:55,920 Right? The true value is one for that 762 00:25:51,960 --> 00:25:59,840 patient. And now you have this model. 763 00:25:55,920 --> 00:26:03,480 Okay? And this is the predicted 764 00:25:59,839 --> 00:26:03,480 probability from this model. 765 00:26:04,480 --> 00:26:07,480 Can people see my 766 00:26:05,960 --> 00:26:08,279 handwriting okay? 767 00:26:07,480 --> 00:26:11,200 Good. 768 00:26:08,279 --> 00:26:13,359 I could never be a doctor, right? So. 769 00:26:11,200 --> 00:26:14,279 So, zero, okay? One, it's going to be 770 00:26:13,359 --> 00:26:15,479 between zero and one because it's 771 00:26:14,279 --> 00:26:17,079 probability. 772 00:26:15,480 --> 00:26:19,079 And then this is the loss we want to 773 00:26:17,079 --> 00:26:21,759 sort of have, right? This is the loss. 774 00:26:19,079 --> 00:26:23,839 So, for this this patient actually had 775 00:26:21,759 --> 00:26:25,240 heart disease, Y equals one. So, let's 776 00:26:23,839 --> 00:26:26,919 say that the predicted probability is 777 00:26:25,240 --> 00:26:28,279 pretty close to one. 778 00:26:26,920 --> 00:26:29,759 Okay? What do you think the loss should 779 00:26:28,279 --> 00:26:30,879 be? 780 00:26:29,759 --> 00:26:32,799 Small. 781 00:26:30,880 --> 00:26:34,080 Close to zero. 782 00:26:32,799 --> 00:26:36,480 Sorry? 783 00:26:34,079 --> 00:26:38,480 Close to zero, exactly. So, here, if the 784 00:26:36,480 --> 00:26:40,599 prediction comes here, you want the loss 785 00:26:38,480 --> 00:26:42,279 to be you want the loss to be somewhere 786 00:26:40,599 --> 00:26:44,000 here. 787 00:26:42,279 --> 00:26:45,599 But if the predicted probability is 788 00:26:44,000 --> 00:26:47,079 pretty close to zero, even though the 789 00:26:45,599 --> 00:26:49,319 patient actually has heart disease, what 790 00:26:47,079 --> 00:26:50,678 do you want the loss to be? 791 00:26:49,319 --> 00:26:52,599 Really high. 792 00:26:50,679 --> 00:26:53,720 Because it's screwing up badly, right? 793 00:26:52,599 --> 00:26:55,319 So, you want the loss to be somewhere 794 00:26:53,720 --> 00:26:57,440 here. 795 00:26:55,319 --> 00:27:00,359 So, basically you want a function that's 796 00:26:57,440 --> 00:27:00,360 kind of like that. 797 00:27:00,759 --> 00:27:04,319 Right? You want the loss function shape 798 00:27:02,319 --> 00:27:05,519 to be like that. 799 00:27:04,319 --> 00:27:07,039 High values of probability should have 800 00:27:05,519 --> 00:27:08,799 low losses, low values of probability 801 00:27:07,039 --> 00:27:10,759 should have high losses. Yeah. 802 00:27:08,799 --> 00:27:12,279 I understand like why it has to be 803 00:27:10,759 --> 00:27:14,480 increasing or decreasing, but can you 804 00:27:12,279 --> 00:27:16,279 explain why it has to be Yeah, yeah. So, 805 00:27:14,480 --> 00:27:18,279 it can be linear, it can certainly be 806 00:27:16,279 --> 00:27:21,678 linear, but basically what you want to 807 00:27:18,279 --> 00:27:23,960 do is the more it makes a mistake, the 808 00:27:21,679 --> 00:27:25,920 more harshly you want to penalize it. 809 00:27:23,960 --> 00:27:27,720 Right? So, basically what you're what 810 00:27:25,920 --> 00:27:29,120 what you really want is something where 811 00:27:27,720 --> 00:27:31,880 if it basically says this person's 812 00:27:29,119 --> 00:27:33,199 probability is say uh the probability 813 00:27:31,880 --> 00:27:34,560 the predicted probability is say one 814 00:27:33,200 --> 00:27:35,960 over a million, 815 00:27:34,559 --> 00:27:37,919 basically close to zero, you want the 816 00:27:35,960 --> 00:27:39,480 loss to be like super high. 817 00:27:37,920 --> 00:27:41,200 So that the model is like it's like a 818 00:27:39,480 --> 00:27:42,440 huge rap on the knuckles for the model. 819 00:27:41,200 --> 00:27:43,880 Don't do that. 820 00:27:42,440 --> 00:27:45,519 That's basically what we're doing, and 821 00:27:43,880 --> 00:27:47,400 I'm sort of demonstrating that dynamic 822 00:27:45,519 --> 00:27:49,559 by using a very curved and steep loss 823 00:27:47,400 --> 00:27:50,960 function. 824 00:27:49,559 --> 00:27:52,799 But you can absolutely use a linear 825 00:27:50,960 --> 00:27:54,759 function, it's totally fine. It won't be 826 00:27:52,799 --> 00:27:56,000 as effective for gradient descent later 827 00:27:54,759 --> 00:27:57,799 on with a bunch of bunch of technical 828 00:27:56,000 --> 00:27:59,359 details. 829 00:27:57,799 --> 00:28:01,440 Are we good with this? 830 00:27:59,359 --> 00:28:03,919 All right. So, now let's look at the 831 00:28:01,440 --> 00:28:05,039 case where a patient does not have heart 832 00:28:03,920 --> 00:28:06,720 disease. 833 00:28:05,039 --> 00:28:09,000 Y equals zero. 834 00:28:06,720 --> 00:28:11,920 Same setup, okay? 835 00:28:09,000 --> 00:28:15,279 Predicted probability, 836 00:28:11,920 --> 00:28:18,360 zero, one, loss. 837 00:28:15,279 --> 00:28:20,440 So, for this patient, they don't have um 838 00:28:18,359 --> 00:28:22,240 whatever uh they're not 839 00:28:20,440 --> 00:28:24,559 uh they don't have heart disease. If the 840 00:28:22,240 --> 00:28:26,200 probability is close to zero, what 841 00:28:24,559 --> 00:28:27,279 should the loss be? 842 00:28:26,200 --> 00:28:28,720 Close to zero. It should be somewhere 843 00:28:27,279 --> 00:28:31,079 here, right? 844 00:28:28,720 --> 00:28:32,440 And the more and more the probability 845 00:28:31,079 --> 00:28:34,359 gets closer and closer to one, you want 846 00:28:32,440 --> 00:28:36,120 to penalize it very heavily, which means 847 00:28:34,359 --> 00:28:37,559 you want the loss to be somewhere here. 848 00:28:36,119 --> 00:28:39,239 So, you basically want a loss ideally 849 00:28:37,559 --> 00:28:42,158 that's kind of going up like that and 850 00:28:39,240 --> 00:28:43,200 climbing higher and higher. 851 00:28:42,159 --> 00:28:44,640 Are we good? 852 00:28:43,200 --> 00:28:46,919 Okay, perfect. 853 00:28:44,640 --> 00:28:48,919 Because we have a perfect loss function 854 00:28:46,919 --> 00:28:51,360 for that. 855 00:28:48,919 --> 00:28:53,040 So, just a recap. 856 00:28:51,359 --> 00:28:54,799 Right? This is what we want. 857 00:28:53,039 --> 00:28:56,799 People with for points with Y equals 858 00:28:54,799 --> 00:28:58,359 one, lower prediction predictions should 859 00:28:56,799 --> 00:29:02,000 have higher loss. You want something 860 00:28:58,359 --> 00:29:03,519 like that. And then turns out 861 00:29:02,000 --> 00:29:04,640 there's a very little simple loss 862 00:29:03,519 --> 00:29:05,918 function 863 00:29:04,640 --> 00:29:07,880 which just literally just uses the 864 00:29:05,919 --> 00:29:09,840 logarithm, which will get the job done. 865 00:29:07,880 --> 00:29:13,159 So, what you do is you literally do 866 00:29:09,839 --> 00:29:15,399 minus log of the predicted probability. 867 00:29:13,159 --> 00:29:16,520 That's it. And that thing it has exactly 868 00:29:15,400 --> 00:29:17,919 that shape. 869 00:29:16,519 --> 00:29:20,039 Okay? And in fact, you can see it 870 00:29:17,919 --> 00:29:22,840 numerically. So, if the loss is one, 871 00:29:20,039 --> 00:29:24,720 it's zero. If it's half, it's 1.0. And 872 00:29:22,839 --> 00:29:26,599 if it's like one over 1,000, it's almost 873 00:29:24,720 --> 00:29:27,319 10. If it's one over 10,000, it's going 874 00:29:26,599 --> 00:29:30,359 to be like 875 00:29:27,319 --> 00:29:32,519 much higher, right? Very high losses. 876 00:29:30,359 --> 00:29:34,479 Okay? So, minus log probability, boom, 877 00:29:32,519 --> 00:29:36,639 done. 878 00:29:34,480 --> 00:29:38,919 Similarly, this is what we want for 879 00:29:36,640 --> 00:29:42,400 patients for whom Y equals zero. 880 00:29:38,919 --> 00:29:44,520 And turns out if you do minus log one 881 00:29:42,400 --> 00:29:46,960 minus predicted probability, it does the 882 00:29:44,519 --> 00:29:46,960 same thing. 883 00:29:47,880 --> 00:29:50,160 Okay? 884 00:29:50,759 --> 00:29:54,640 Mathematicians once again saved with a 885 00:29:52,160 --> 00:29:54,640 logarithm. 886 00:29:54,680 --> 00:29:58,560 So, see in summary 887 00:29:56,920 --> 00:30:00,400 this is what we have. 888 00:29:58,559 --> 00:30:01,599 Right? For data points where y equals 1, 889 00:30:00,400 --> 00:30:03,960 we have this. Data points where y equals 890 00:30:01,599 --> 00:30:05,919 0, we have this. But, it feels a little 891 00:30:03,960 --> 00:30:07,279 inelegant 892 00:30:05,920 --> 00:30:08,400 to say, "Well, if it's y equals 1, I 893 00:30:07,279 --> 00:30:09,599 want to use this. If y equals 0, I want 894 00:30:08,400 --> 00:30:11,280 to use that." 895 00:30:09,599 --> 00:30:12,759 Right? There's There's like an if-then 896 00:30:11,279 --> 00:30:14,639 thing going on here. And I don't know 897 00:30:12,759 --> 00:30:15,640 about you folks, but if-then really irks 898 00:30:14,640 --> 00:30:17,320 me 899 00:30:15,640 --> 00:30:19,600 mathematically because you can't do 900 00:30:17,319 --> 00:30:20,279 derivatives and so on very easily. 901 00:30:19,599 --> 00:30:22,919 Okay? 902 00:30:20,279 --> 00:30:24,879 But, no worries. This is MIT. We know we 903 00:30:22,920 --> 00:30:26,720 have our bag of math tricks. 904 00:30:24,880 --> 00:30:28,680 So, what we do is 905 00:30:26,720 --> 00:30:30,519 we can actually combine them both into a 906 00:30:28,680 --> 00:30:32,600 single expression. 907 00:30:30,519 --> 00:30:35,079 Okay? Like this. 908 00:30:32,599 --> 00:30:37,000 Okay? And here the yi again is the ith 909 00:30:35,079 --> 00:30:38,399 data point. Remember, yi is either 1 or 910 00:30:37,000 --> 00:30:40,359 0 always. 911 00:30:38,400 --> 00:30:43,360 And this model of xi is the predicted 912 00:30:40,359 --> 00:30:45,679 probability. Okay? So, 913 00:30:43,359 --> 00:30:48,439 and I've just taken the minus log the 914 00:30:45,680 --> 00:30:50,680 minus and I've just moved it here. 915 00:30:48,440 --> 00:30:52,680 Okay? And I've taken the the minus that 916 00:30:50,680 --> 00:30:54,640 was here and just moved it here. Okay? 917 00:30:52,680 --> 00:30:57,080 That's why you see it like this. 918 00:30:54,640 --> 00:30:58,560 So, this one is basically 919 00:30:57,079 --> 00:30:59,960 you can convince yourself what's 920 00:30:58,559 --> 00:31:01,359 happens. This single expression will get 921 00:30:59,960 --> 00:31:04,039 the job done. So, let's say there is a 922 00:31:01,359 --> 00:31:05,559 patient for whom y equals 1. 923 00:31:04,039 --> 00:31:07,799 What's going to happen is that when you 924 00:31:05,559 --> 00:31:10,519 plug in y equals 1, this becomes 0. The 925 00:31:07,799 --> 00:31:12,559 whole thing will collapse to 0. 926 00:31:10,519 --> 00:31:14,319 While here, y equals 1 just means it 927 00:31:12,559 --> 00:31:16,879 becomes minus log probability, which is 928 00:31:14,319 --> 00:31:16,879 what we want. 929 00:31:17,640 --> 00:31:22,120 Conversely, if y equals 0, this whole 930 00:31:20,200 --> 00:31:23,720 thing is going to disappear. 931 00:31:22,119 --> 00:31:25,919 And this thing becomes 1 minus 0, which 932 00:31:23,720 --> 00:31:27,559 is just 1. And so, it becomes minus log 933 00:31:25,920 --> 00:31:29,680 1 minus probability, which is again what 934 00:31:27,559 --> 00:31:32,000 we want. 935 00:31:29,680 --> 00:31:34,720 Simple and neat, right? 936 00:31:32,000 --> 00:31:36,799 So, in one expression, we have defined 937 00:31:34,720 --> 00:31:39,360 the perfect loss. No if-thens, none of 938 00:31:36,799 --> 00:31:39,359 that crap. 939 00:31:39,519 --> 00:31:44,079 Good. So, now what we do is that was 940 00:31:42,200 --> 00:31:45,160 true for every data point. 941 00:31:44,079 --> 00:31:47,799 But, we obviously have lots of data 942 00:31:45,160 --> 00:31:50,560 points. So, we just add them all up and 943 00:31:47,799 --> 00:31:51,919 take the average. 944 00:31:50,559 --> 00:31:53,519 That's it. We average across all the 945 00:31:51,920 --> 00:31:55,440 data points we have. So, that we get an 946 00:31:53,519 --> 00:31:57,119 average loss. 947 00:31:55,440 --> 00:31:58,679 Okay? 948 00:31:57,119 --> 00:32:01,239 We call this is the binary cross entropy 949 00:31:58,679 --> 00:32:01,240 loss function. 950 00:32:06,640 --> 00:32:11,440 Is there a way you can um edit the loss 951 00:32:08,920 --> 00:32:13,560 function so that you penalize like false 952 00:32:11,440 --> 00:32:15,679 negatives more strongly than false 953 00:32:13,559 --> 00:32:17,279 >> you can do all of them. Great question. 954 00:32:15,679 --> 00:32:19,160 Uh I'm just looking at the basic case 955 00:32:17,279 --> 00:32:21,720 where we it's symmetric 956 00:32:19,160 --> 00:32:23,240 loss. Um you can actually penalize 957 00:32:21,720 --> 00:32:25,200 overestimates much more than 958 00:32:23,240 --> 00:32:26,759 underestimates and things like that. 959 00:32:25,200 --> 00:32:28,160 Um and if you're curious, you can just 960 00:32:26,759 --> 00:32:30,599 Google something called the pinball 961 00:32:28,160 --> 00:32:30,600 loss. 962 00:32:31,519 --> 00:32:34,440 Okay? 963 00:32:32,599 --> 00:32:36,359 Any other questions on this? 964 00:32:34,440 --> 00:32:38,120 So, when you see this massive deep 965 00:32:36,359 --> 00:32:39,959 neural network built by Google for doing 966 00:32:38,119 --> 00:32:41,839 something or the other, if it's a binary 967 00:32:39,960 --> 00:32:44,079 classification problem, chances are 968 00:32:41,839 --> 00:32:45,119 they're using this thing. 969 00:32:44,079 --> 00:32:45,960 Okay? 970 00:32:45,119 --> 00:32:48,159 All right. 971 00:32:45,960 --> 00:32:49,840 So, now let's figure out how to minimize 972 00:32:48,160 --> 00:32:50,800 these loss functions because the name of 973 00:32:49,839 --> 00:32:52,199 the game 974 00:32:50,799 --> 00:32:54,839 is to find a way to minimize these loss 975 00:32:52,200 --> 00:32:56,880 functions. So, now loss functions are 976 00:32:54,839 --> 00:32:59,279 just a particular kind of function. So, 977 00:32:56,880 --> 00:33:02,000 we'll first consider the general problem 978 00:32:59,279 --> 00:33:02,759 of minimizing some arbitrary function. 979 00:33:02,000 --> 00:33:03,720 Okay? 980 00:33:02,759 --> 00:33:05,160 And once we develop a little bit of 981 00:33:03,720 --> 00:33:07,400 intuition about that, we'll return to 982 00:33:05,160 --> 00:33:09,920 the specific task of minimizing loss 983 00:33:07,400 --> 00:33:09,920 functions. 984 00:33:12,240 --> 00:33:14,920 How's everyone doing? 985 00:33:15,240 --> 00:33:18,480 Yes, no, good, bad? 986 00:33:18,679 --> 00:33:23,240 You have a bit of a 987 00:33:20,480 --> 00:33:24,960 like a tough-to-interpret head shake. 988 00:33:23,240 --> 00:33:26,559 It's more like um I kind of lost you 989 00:33:24,960 --> 00:33:28,400 where you said that the loss function 990 00:33:26,559 --> 00:33:30,119 and the predicted probability 991 00:33:28,400 --> 00:33:31,560 uh how were they inversely because my 992 00:33:30,119 --> 00:33:33,839 understanding was that the loss function 993 00:33:31,559 --> 00:33:35,200 is supposed to be the sum of errors. 994 00:33:33,839 --> 00:33:36,159 We're averaging the errors. And when you 995 00:33:35,200 --> 00:33:37,360 said the heart patient 996 00:33:36,160 --> 00:33:38,880 >> Sorry, sorry. Let me Let me just stop 997 00:33:37,359 --> 00:33:41,240 there for a second. 998 00:33:38,880 --> 00:33:42,640 For each point, you define the loss. 999 00:33:41,240 --> 00:33:44,400 That's the whole point of the game. And 1000 00:33:42,640 --> 00:33:46,640 once you define it, you calculate for 1001 00:33:44,400 --> 00:33:49,440 every point and average it, right? So, 1002 00:33:46,640 --> 00:33:50,960 just focus on a single data point. 1003 00:33:49,440 --> 00:33:53,000 And so, now continue. 1004 00:33:50,960 --> 00:33:56,160 So, now when the heart patient has There 1005 00:33:53,000 --> 00:33:58,240 is more probability that they No. So, 1006 00:33:56,160 --> 00:34:00,400 when there is a person who has the heart 1007 00:33:58,240 --> 00:34:02,759 uh disease, you said that you want the 1008 00:34:00,400 --> 00:34:03,960 loss function to be high. 1009 00:34:02,759 --> 00:34:06,440 I think I'm going back to the graph. 1010 00:34:03,960 --> 00:34:08,159 >> You want the loss function to be high if 1011 00:34:06,440 --> 00:34:09,878 I'm predicting that they basically don't 1012 00:34:08,159 --> 00:34:12,079 have heart disease. 1013 00:34:09,878 --> 00:34:13,960 If the prediction is close to 0, 1014 00:34:12,079 --> 00:34:16,878 the predicted probability is close to 0, 1015 00:34:13,960 --> 00:34:18,519 then I'm badly wrong. 1016 00:34:16,878 --> 00:34:19,918 Because in reality, they do have heart 1017 00:34:18,519 --> 00:34:21,039 disease. 1018 00:34:19,918 --> 00:34:23,199 And that's why I want the loss to be 1019 00:34:21,039 --> 00:34:25,519 really high. Okay, so effectively, loss 1020 00:34:23,199 --> 00:34:28,678 is my way of finding out how good my 1021 00:34:25,519 --> 00:34:31,159 model is instead of saying, "Okay." Or 1022 00:34:28,679 --> 00:34:33,119 rather, how bad your model is. Yeah. 1023 00:34:31,159 --> 00:34:34,760 Right? How bad is it? That's really what 1024 00:34:33,119 --> 00:34:37,279 the loss function is. Got it. 1025 00:34:34,760 --> 00:34:39,960 >> And you want to minimize badness. 1026 00:34:37,280 --> 00:34:41,560 That's the whole point of optimization. 1027 00:34:39,960 --> 00:34:43,800 Okay. 1028 00:34:41,559 --> 00:34:45,119 Um I guess I don't have a fully like 1029 00:34:43,800 --> 00:34:46,800 similar to the point where I said but I 1030 00:34:45,119 --> 00:34:48,839 don't have a fully clear intuition of 1031 00:34:46,800 --> 00:34:50,440 why exactly a log function rather than 1032 00:34:48,840 --> 00:34:53,320 something that say 1033 00:34:50,440 --> 00:34:55,519 flatter for small and then really steep 1034 00:34:53,320 --> 00:34:57,640 later. Those are all fantastic things. 1035 00:34:55,519 --> 00:35:00,719 You can totally do it. Uh the reason we 1036 00:34:57,639 --> 00:35:02,759 picked the loss this function because A, 1037 00:35:00,719 --> 00:35:04,079 it's easy to work with. It has good 1038 00:35:02,760 --> 00:35:06,160 gradients. It's well-behaved 1039 00:35:04,079 --> 00:35:07,799 mathematically. But, there are many 1040 00:35:06,159 --> 00:35:09,399 alternatives to it. I don't want you to 1041 00:35:07,800 --> 00:35:11,720 think that this is like the only game in 1042 00:35:09,400 --> 00:35:13,760 town or it's the only choice for us. We 1043 00:35:11,719 --> 00:35:15,919 have many choices. This is really This 1044 00:35:13,760 --> 00:35:17,320 happens to be a very easy choice, which 1045 00:35:15,920 --> 00:35:18,960 also happens to be empirically very 1046 00:35:17,320 --> 00:35:20,480 effective. 1047 00:35:18,960 --> 00:35:22,840 And I'm happy to give you pointers to 1048 00:35:20,480 --> 00:35:26,000 other crazy loss functions, right? Which 1049 00:35:22,840 --> 00:35:26,000 can actually do all these things, too. 1050 00:35:26,800 --> 00:35:29,120 Okay? 1051 00:35:30,400 --> 00:35:34,440 All right. So, uh minimizing a single 1052 00:35:32,440 --> 00:35:36,559 variable function, we will warm up by 1053 00:35:34,440 --> 00:35:38,358 looking at this little function here. 1054 00:35:36,559 --> 00:35:41,639 Okay? Which is a 1055 00:35:38,358 --> 00:35:41,639 What do you call a fourth power? 1056 00:35:41,840 --> 00:35:45,519 What? Quartic, right? Yeah, thank you. 1057 00:35:43,679 --> 00:35:47,599 Quartic. So, yeah, it's a quartic 1058 00:35:45,519 --> 00:35:50,000 function. Um 1059 00:35:47,599 --> 00:35:51,639 right? And this is how it looks like. 1060 00:35:50,000 --> 00:35:53,199 But, you can see there is like a minimum 1061 00:35:51,639 --> 00:35:54,679 somewhere here, right? Between like one 1062 00:35:53,199 --> 00:35:56,799 minus one and minus two. Like maybe 1063 00:35:54,679 --> 00:35:58,519 minus 1.5. Okay? 1064 00:35:56,800 --> 00:36:00,440 So, we want to minimize this function. 1065 00:35:58,519 --> 00:36:02,039 It's obviously a toy function, little 1066 00:36:00,440 --> 00:36:03,599 function with one variable. 1067 00:36:02,039 --> 00:36:06,320 But, the intuition we use here is going 1068 00:36:03,599 --> 00:36:08,239 to be exactly what we use for GPT-4. 1069 00:36:06,320 --> 00:36:09,880 So, pay attention. 1070 00:36:08,239 --> 00:36:11,000 So, how can we go about minimizing this 1071 00:36:09,880 --> 00:36:13,559 function? 1072 00:36:11,000 --> 00:36:13,559 What will we do? 1073 00:36:15,079 --> 00:36:18,159 Yeah. 1074 00:36:16,639 --> 00:36:20,119 Take the derivative and set it equal to 1075 00:36:18,159 --> 00:36:22,039 zero. You take the derivative. Exactly. 1076 00:36:20,119 --> 00:36:23,799 So, you take the derivative, right? 1077 00:36:22,039 --> 00:36:25,559 Um so, when you So, let's look at what 1078 00:36:23,800 --> 00:36:26,640 the derivative does for us. 1079 00:36:25,559 --> 00:36:30,000 But, then 1080 00:36:26,639 --> 00:36:31,920 the second part of what said 1081 00:36:30,000 --> 00:36:33,960 Yeah. Second part of what said was set 1082 00:36:31,920 --> 00:36:35,800 it to zero. Setting it to zero becomes 1083 00:36:33,960 --> 00:36:37,000 problematic 1084 00:36:35,800 --> 00:36:38,840 when you have very complicated 1085 00:36:37,000 --> 00:36:39,960 functions. It's not clear at all what's 1086 00:36:38,840 --> 00:36:41,880 going to make them zero, right? 1087 00:36:39,960 --> 00:36:42,960 Unfortunately. But, the idea of taking 1088 00:36:41,880 --> 00:36:43,840 the derivative is in fact the right 1089 00:36:42,960 --> 00:36:45,440 idea. 1090 00:36:43,840 --> 00:36:46,480 So, we can go about this. We can 1091 00:36:45,440 --> 00:36:47,920 calculate the derivative. And that 1092 00:36:46,480 --> 00:36:49,480 actually happens with the derivative. 1093 00:36:47,920 --> 00:36:50,840 You can convince yourself. 1094 00:36:49,480 --> 00:36:53,240 And if you plot the derivative, it looks 1095 00:36:50,840 --> 00:36:53,240 like that. 1096 00:36:53,400 --> 00:36:56,760 And as you would hope, wherever the 1097 00:36:55,079 --> 00:36:58,679 minimum is, in fact, the derivative is 1098 00:36:56,760 --> 00:36:59,760 crossing 1099 00:36:58,679 --> 00:37:01,119 right? The derivative is zero here. It's 1100 00:36:59,760 --> 00:37:02,320 crossing the x-axis. 1101 00:37:01,119 --> 00:37:03,759 Right? In this case, you can actually do 1102 00:37:02,320 --> 00:37:04,800 that. 1103 00:37:03,760 --> 00:37:06,280 So, let's say you have the derivative. 1104 00:37:04,800 --> 00:37:08,359 How can you use it? 1105 00:37:06,280 --> 00:37:09,760 Like, what is the value of a derivative? 1106 00:37:08,358 --> 00:37:11,199 What does it tell you? 1107 00:37:09,760 --> 00:37:13,800 Yeah. 1108 00:37:11,199 --> 00:37:16,159 You use a gradient descent algorithm. 1109 00:37:13,800 --> 00:37:18,240 You are 10 steps ahead of me, my friend. 1110 00:37:16,159 --> 00:37:19,920 I just want the basic answer. 1111 00:37:18,239 --> 00:37:21,239 Like, what what what what good is a 1112 00:37:19,920 --> 00:37:22,200 derivative? What Like, what does it tell 1113 00:37:21,239 --> 00:37:23,919 you? When you calculate the derivative 1114 00:37:22,199 --> 00:37:25,919 of something at a particular point 1115 00:37:23,920 --> 00:37:27,240 >> you the rate of change of the function 1116 00:37:25,920 --> 00:37:29,800 at the place you are. Correct. Exactly 1117 00:37:27,239 --> 00:37:32,119 right. So, here, what the derivative 1118 00:37:29,800 --> 00:37:34,240 would tells us is that the slope tells 1119 00:37:32,119 --> 00:37:36,920 us the change in the function for a very 1120 00:37:34,239 --> 00:37:38,319 small increase in w, right? 1121 00:37:36,920 --> 00:37:41,519 And this is high school calculus. I'm 1122 00:37:38,320 --> 00:37:41,519 just doing a quick refresher. 1123 00:37:41,920 --> 00:37:47,720 So, what that means is that 1124 00:37:45,199 --> 00:37:49,480 if the derivative is positive, 1125 00:37:47,719 --> 00:37:52,039 what that means is that increasing w 1126 00:37:49,480 --> 00:37:53,760 slightly will increase the function. 1127 00:37:52,039 --> 00:37:55,000 So, if if you're here, 1128 00:37:53,760 --> 00:37:56,160 you calculate the derivative, the slope 1129 00:37:55,000 --> 00:37:57,480 is positive. It means that if you go 1130 00:37:56,159 --> 00:37:58,799 slightly in this direction, the function 1131 00:37:57,480 --> 00:38:00,199 is going to get higher. 1132 00:37:58,800 --> 00:38:02,560 Right? 1133 00:38:00,199 --> 00:38:03,839 Similarly, if it's negative, 1134 00:38:02,559 --> 00:38:05,039 let's say here, you calculate the 1135 00:38:03,840 --> 00:38:06,680 derivative, it's the the slope is like 1136 00:38:05,039 --> 00:38:08,840 this. It's negative, which means that if 1137 00:38:06,679 --> 00:38:10,239 you increase w, if you go in this 1138 00:38:08,840 --> 00:38:12,519 direction, it's going to decrease the 1139 00:38:10,239 --> 00:38:13,759 function. 1140 00:38:12,519 --> 00:38:15,000 Okay? 1141 00:38:13,760 --> 00:38:17,760 All right. 1142 00:38:15,000 --> 00:38:19,639 And if it's kind of close to zero, 1143 00:38:17,760 --> 00:38:22,240 it means that changing w slightly won't 1144 00:38:19,639 --> 00:38:24,119 change anything. 1145 00:38:22,239 --> 00:38:25,719 So, if you're here, changing it slightly 1146 00:38:24,119 --> 00:38:26,880 won't change anything. 1147 00:38:25,719 --> 00:38:28,079 All right? 1148 00:38:26,880 --> 00:38:29,920 That's it. 1149 00:38:28,079 --> 00:38:31,599 So, 1150 00:38:29,920 --> 00:38:35,400 So, what we do is this immediately 1151 00:38:31,599 --> 00:38:37,079 suggests an algorithm for minimizing gw, 1152 00:38:35,400 --> 00:38:38,400 which is let's start with some random 1153 00:38:37,079 --> 00:38:39,400 point w. 1154 00:38:38,400 --> 00:38:40,519 And then, 1155 00:38:39,400 --> 00:38:41,480 let's calculate the derivative at that 1156 00:38:40,519 --> 00:38:42,920 point. 1157 00:38:41,480 --> 00:38:45,000 And once we do that, 1158 00:38:42,920 --> 00:38:46,280 there are three possibilities. 1159 00:38:45,000 --> 00:38:48,320 It could be positive, negative, or kind 1160 00:38:46,280 --> 00:38:49,640 of close to zero. 1161 00:38:48,320 --> 00:38:52,160 And if it's positive, we know that 1162 00:38:49,639 --> 00:38:53,839 increasing w will increase the function. 1163 00:38:52,159 --> 00:38:55,358 But, we want to decrease the function. 1164 00:38:53,840 --> 00:38:56,200 We want to minimize it. 1165 00:38:55,358 --> 00:38:58,920 Which means that we should not be 1166 00:38:56,199 --> 00:39:00,159 increasing w. We should be doing what 1167 00:38:58,920 --> 00:39:01,720 here? 1168 00:39:00,159 --> 00:39:03,519 Decrease. 1169 00:39:01,719 --> 00:39:07,119 Yes. And similarly, if it's negative, 1170 00:39:03,519 --> 00:39:07,119 what should we do here? Increase. 1171 00:39:07,840 --> 00:39:11,358 Exactly. So, in the first case, you 1172 00:39:09,320 --> 00:39:13,240 reduce w slightly. In the second case, 1173 00:39:11,358 --> 00:39:14,400 you increase w slightly. And if the 1174 00:39:13,239 --> 00:39:17,399 thing is close to zero, you just stop 1175 00:39:14,400 --> 00:39:17,400 because there's nothing else you can do. 1176 00:39:17,880 --> 00:39:20,119 Okay? 1177 00:39:21,358 --> 00:39:26,639 This is the basic intuition behind how 1178 00:39:23,599 --> 00:39:28,239 GPT-4 was built. 1179 00:39:26,639 --> 00:39:29,199 Which is kind of shocking if you think 1180 00:39:28,239 --> 00:39:31,279 about it. 1181 00:39:29,199 --> 00:39:32,879 Right? Which means that all the the 1182 00:39:31,280 --> 00:39:35,080 heavy-duty optimization stuff that 1183 00:39:32,880 --> 00:39:37,960 people have figured out over the decades 1184 00:39:35,079 --> 00:39:39,440 is kind of not used. 1185 00:39:37,960 --> 00:39:41,320 Right? This algorithm is what's being 1186 00:39:39,440 --> 00:39:42,200 used with some, you know, flavors on top 1187 00:39:41,320 --> 00:39:44,200 of it. 1188 00:39:42,199 --> 00:39:46,719 So, yeah. So, back to this 1189 00:39:44,199 --> 00:39:48,319 uh and you you do that and then if 1190 00:39:46,719 --> 00:39:49,879 you've sort of run out of time or 1191 00:39:48,320 --> 00:39:52,240 compute 1192 00:39:49,880 --> 00:39:54,119 or right, if you run out of time and so 1193 00:39:52,239 --> 00:39:55,279 on, just stop. 1194 00:39:54,119 --> 00:39:56,839 Otherwise, just go back to step one and 1195 00:39:55,280 --> 00:39:59,720 try again. Of course, if it's close to 1196 00:39:56,840 --> 00:39:59,720 zero, you got to stop anyway. 1197 00:40:00,119 --> 00:40:05,159 Yeah. 1198 00:40:02,280 --> 00:40:09,040 Is there the um concern of a potentially 1199 00:40:05,159 --> 00:40:09,039 local minimum there? It's coming. 1200 00:40:10,039 --> 00:40:12,400 Okay? So, that's the function. It's 1201 00:40:11,320 --> 00:40:13,960 going to give find It's going to find 1202 00:40:12,400 --> 00:40:16,160 you some point where the derivative is 1203 00:40:13,960 --> 00:40:17,639 kind of close to zero. Okay? 1204 00:40:16,159 --> 00:40:19,879 So, 1205 00:40:17,639 --> 00:40:21,759 this is called gradient descent. Right? 1206 00:40:19,880 --> 00:40:23,519 This is gradient descent, this little 1207 00:40:21,760 --> 00:40:26,720 algorithm. 1208 00:40:23,519 --> 00:40:29,360 And this this 1209 00:40:26,719 --> 00:40:32,679 this very power pointy MBA table can be 1210 00:40:29,360 --> 00:40:34,039 collapsed into this little expression. 1211 00:40:32,679 --> 00:40:35,519 Basically says, 1212 00:40:34,039 --> 00:40:36,920 calculate the derivative, 1213 00:40:35,519 --> 00:40:38,320 multiplied by a small number which we'll 1214 00:40:36,920 --> 00:40:41,880 get to in a second, 1215 00:40:38,320 --> 00:40:44,039 and then change the old W to the new W 1216 00:40:41,880 --> 00:40:45,680 is the old W minus a little number times 1217 00:40:44,039 --> 00:40:47,800 gradient. 1218 00:40:45,679 --> 00:40:50,480 So, this little one-line formula is 1219 00:40:47,800 --> 00:40:51,560 basically gradient descent. 1220 00:40:50,480 --> 00:40:54,159 Okay? 1221 00:40:51,559 --> 00:40:56,400 And what you should do, just to build 1222 00:40:54,159 --> 00:40:58,639 your intuition, is to make sure that 1223 00:40:56,400 --> 00:41:00,119 these three possibilities here map 1224 00:40:58,639 --> 00:41:01,199 nicely to this. Like this thing will 1225 00:41:00,119 --> 00:41:03,559 actually capture these three 1226 00:41:01,199 --> 00:41:04,559 possibilities. 1227 00:41:03,559 --> 00:41:07,079 This is when gradient descent was 1228 00:41:04,559 --> 00:41:07,079 invented. 1229 00:41:07,599 --> 00:41:10,839 It has some historical fun, right? 1230 00:41:13,199 --> 00:41:17,719 The 19th century? 1231 00:41:15,000 --> 00:41:20,320 19th century. Yeah, okay. Good. Very 1232 00:41:17,719 --> 00:41:22,719 good. Excellent guess. 1233 00:41:20,320 --> 00:41:25,559 1847. 1234 00:41:22,719 --> 00:41:27,919 It was uh invented uh in 1847 by Cauchy, 1235 00:41:25,559 --> 00:41:29,159 the great mathematician. And in fact, if 1236 00:41:27,920 --> 00:41:30,760 you're curious, you can check out the 1237 00:41:29,159 --> 00:41:32,639 paper. 1238 00:41:30,760 --> 00:41:35,880 I have I gave you I give you the paper 1239 00:41:32,639 --> 00:41:35,879 here for handy reference. 1240 00:41:36,639 --> 00:41:40,839 So, 1847. 1241 00:41:38,159 --> 00:41:43,879 So, GPT-4 is built using an algorithm 1242 00:41:40,840 --> 00:41:43,880 invented in 1847. 1243 00:41:44,280 --> 00:41:51,600 Which I find like astonishing, frankly. 1244 00:41:47,719 --> 00:41:52,959 That this little thing is so capable. 1245 00:41:51,599 --> 00:41:54,639 Okay. 1246 00:41:52,960 --> 00:41:56,599 So, that's gradient descent. And this 1247 00:41:54,639 --> 00:41:58,519 little number alpha 1248 00:41:56,599 --> 00:41:59,920 is called the learning rate. And it's 1249 00:41:58,519 --> 00:42:02,480 our way of sort of essentially 1250 00:41:59,920 --> 00:42:04,880 quantifying the idea of let's not 1251 00:42:02,480 --> 00:42:06,480 increase or decrease W massively, let's 1252 00:42:04,880 --> 00:42:08,640 do it slightly. 1253 00:42:06,480 --> 00:42:11,280 Because the gradient is only valid for 1254 00:42:08,639 --> 00:42:14,839 small movements around your point. If 1255 00:42:11,280 --> 00:42:17,519 you take a big step, all bets are off. 1256 00:42:14,840 --> 00:42:20,000 So, this alpha tells you how how small a 1257 00:42:17,519 --> 00:42:20,880 step should you take. 1258 00:42:20,000 --> 00:42:23,360 Okay? 1259 00:42:20,880 --> 00:42:25,880 And in typically, it's set to very small 1260 00:42:23,360 --> 00:42:27,240 values like, you know, 0.1, 0.001, and 1261 00:42:25,880 --> 00:42:30,000 so on and so forth. And in fact, if you 1262 00:42:27,239 --> 00:42:31,159 read any deep learning academic papers 1263 00:42:30,000 --> 00:42:32,440 where they have trained like a big model 1264 00:42:31,159 --> 00:42:34,279 to do something, 1265 00:42:32,440 --> 00:42:36,240 right? More lot of researchers will very 1266 00:42:34,280 --> 00:42:37,640 quickly go to the appendix where they 1267 00:42:36,239 --> 00:42:39,559 have described exactly what learning 1268 00:42:37,639 --> 00:42:40,960 rates were used. 1269 00:42:39,559 --> 00:42:44,239 Because sort of the learning rate is 1270 00:42:40,960 --> 00:42:45,480 like part of the IP for how it's built. 1271 00:42:44,239 --> 00:42:47,479 A lot of trial and error that goes into 1272 00:42:45,480 --> 00:42:50,280 these learning rates. 1273 00:42:47,480 --> 00:42:53,400 Okay. So, that is gradient descent. 1274 00:42:50,280 --> 00:42:55,080 Um so, if we apply this algorithm to GW, 1275 00:42:53,400 --> 00:42:56,800 our original function, 1276 00:42:55,079 --> 00:42:58,840 right? We just keep on doing this thing 1277 00:42:56,800 --> 00:43:00,560 a few times. 1278 00:42:58,840 --> 00:43:01,880 Right? What you will find is that if 1279 00:43:00,559 --> 00:43:02,639 let's say we start with two point the 1280 00:43:01,880 --> 00:43:05,519 the 1281 00:43:02,639 --> 00:43:07,599 the point we randomly pick is a 2.5, we 1282 00:43:05,519 --> 00:43:09,759 set the alpha to one, we run this 1283 00:43:07,599 --> 00:43:11,159 algorithm, it starts here, then it goes 1284 00:43:09,760 --> 00:43:12,960 there, it goes there, bup bup bup bup 1285 00:43:11,159 --> 00:43:14,119 bup, and then finally ends up here. 1286 00:43:12,960 --> 00:43:16,440 In like four or five iterations, it 1287 00:43:14,119 --> 00:43:17,679 finds some minimum. 1288 00:43:16,440 --> 00:43:19,639 This is obviously a very simple, 1289 00:43:17,679 --> 00:43:22,279 well-behaved, nice little function, so 1290 00:43:19,639 --> 00:43:23,440 you can easily optimize it. 1291 00:43:22,280 --> 00:43:25,400 Okay? If you want, you can just go to 1292 00:43:23,440 --> 00:43:28,000 this thing. There's a nice animation of 1293 00:43:25,400 --> 00:43:28,000 this thing as well. 1294 00:43:28,119 --> 00:43:31,679 Okay. So, now 1295 00:43:30,119 --> 00:43:33,279 All right. Before we actually go to the 1296 00:43:31,679 --> 00:43:35,000 multi-variable function, I want to go to 1297 00:43:33,280 --> 00:43:36,280 the question that you posed about local 1298 00:43:35,000 --> 00:43:37,480 minima. 1299 00:43:36,280 --> 00:43:38,920 Um actually, you know what? I think I 1300 00:43:37,480 --> 00:43:40,320 may have some slides on it. So, sorry. 1301 00:43:38,920 --> 00:43:41,920 I'll come back to this. 1302 00:43:40,320 --> 00:43:43,080 So, let's actually see what you know, 1303 00:43:41,920 --> 00:43:45,240 what we looked at a toy example where 1304 00:43:43,079 --> 00:43:46,440 there was only one variable. What if you 1305 00:43:45,239 --> 00:43:49,319 have 1306 00:43:46,440 --> 00:43:51,639 uh what if it was GPT-3? GPT-3 has 175 1307 00:43:49,320 --> 00:43:53,960 billion parameters. 1308 00:43:51,639 --> 00:43:55,400 175 billion and GPT-4, they haven't 1309 00:43:53,960 --> 00:43:57,720 published it, so we don't know. It's 1310 00:43:55,400 --> 00:43:59,840 supposed to be eight times as much. 1311 00:43:57,719 --> 00:44:02,039 Okay? So, I mean, the number of 1312 00:43:59,840 --> 00:44:04,840 parameters is massive. So, basically, 1313 00:44:02,039 --> 00:44:07,960 our loss function has 1314 00:44:04,840 --> 00:44:10,320 billions of variables, billions of Ws 1315 00:44:07,960 --> 00:44:12,920 that we need to optimize over, minimize 1316 00:44:10,320 --> 00:44:14,760 over. So, we need to use this notion of 1317 00:44:12,920 --> 00:44:16,039 a partial derivative. So, let's take 1318 00:44:14,760 --> 00:44:18,200 baby steps and say, okay, what if you 1319 00:44:16,039 --> 00:44:20,079 have a two-variable function, right? 1320 00:44:18,199 --> 00:44:21,599 Something like this, very simple. So, 1321 00:44:20,079 --> 00:44:23,960 what we can do is we can calculate the 1322 00:44:21,599 --> 00:44:26,400 partial derivative of G with respect to 1323 00:44:23,960 --> 00:44:27,840 each of these Ws. 1324 00:44:26,400 --> 00:44:29,720 And the partial derivative, just to 1325 00:44:27,840 --> 00:44:32,840 quickly refresh your memories, 1326 00:44:29,719 --> 00:44:36,439 is you take a function, you pretend that 1327 00:44:32,840 --> 00:44:38,400 everything other than W is a constant. 1328 00:44:36,440 --> 00:44:40,960 Then the function becomes a 1329 00:44:38,400 --> 00:44:41,920 a function of just one variable W, W1. 1330 00:44:40,960 --> 00:44:43,760 And then you just differentiate it like 1331 00:44:41,920 --> 00:44:46,159 you do everything else. And you you get 1332 00:44:43,760 --> 00:44:48,600 you get something, and that is 1333 00:44:46,159 --> 00:44:50,039 this thing here. 1334 00:44:48,599 --> 00:44:51,559 And then you do the same thing for W2, 1335 00:44:50,039 --> 00:44:54,239 you get this thing here, and then you 1336 00:44:51,559 --> 00:44:55,079 just stack them up in a nice list. 1337 00:44:54,239 --> 00:44:56,399 Okay? 1338 00:44:55,079 --> 00:44:58,000 This is the vector of partial 1339 00:44:56,400 --> 00:44:59,400 derivatives. 1340 00:44:58,000 --> 00:45:01,559 So, how should we interpret this? The 1341 00:44:59,400 --> 00:45:04,280 same way as before. Basically, for a 1342 00:45:01,559 --> 00:45:06,000 small change in W1, keeping W2 and 1343 00:45:04,280 --> 00:45:08,200 everything else fixed, how does the 1344 00:45:06,000 --> 00:45:11,000 function change if you change just W1 1345 00:45:08,199 --> 00:45:14,039 slightly? And similarly for W2 and all 1346 00:45:11,000 --> 00:45:15,760 the way to W175 billion. 1347 00:45:14,039 --> 00:45:17,119 Same thing. Okay? 1348 00:45:15,760 --> 00:45:19,359 So, um 1349 00:45:17,119 --> 00:45:22,039 now, when you have these functions with 1350 00:45:19,358 --> 00:45:24,480 many variables, many Ws, 1351 00:45:22,039 --> 00:45:26,759 uh since we have a gradient for each one 1352 00:45:24,480 --> 00:45:28,358 of those Ws, we stack them up into a 1353 00:45:26,760 --> 00:45:30,200 nice vector 1354 00:45:28,358 --> 00:45:32,199 of derivatives, and this vector is 1355 00:45:30,199 --> 00:45:33,799 called the gradient. 1356 00:45:32,199 --> 00:45:35,279 And it's denoted 1357 00:45:33,800 --> 00:45:37,240 using 1358 00:45:35,280 --> 00:45:38,720 this uh Anyone know what the symbol is 1359 00:45:37,239 --> 00:45:40,199 called? 1360 00:45:38,719 --> 00:45:41,679 nabla 1361 00:45:40,199 --> 00:45:43,839 Yeah? 1362 00:45:41,679 --> 00:45:45,599 Laplacian 1363 00:45:43,840 --> 00:45:48,880 Maybe. Maybe that's a synonym. But the 1364 00:45:45,599 --> 00:45:50,559 one I'm familiar with is nabla. 1365 00:45:48,880 --> 00:45:52,200 Delta is the one that's upside down 1366 00:45:50,559 --> 00:45:53,920 triangle, but I think the upside down 1367 00:45:52,199 --> 00:45:55,960 triangle is called nabla if I if I 1368 00:45:53,920 --> 00:45:58,200 recall. Am I right? 1369 00:45:55,960 --> 00:46:00,800 Thank you. 1370 00:45:58,199 --> 00:46:00,799 He's my go-to. 1371 00:46:02,559 --> 00:46:06,440 So, yeah. So, the gradient, um we just 1372 00:46:04,840 --> 00:46:08,519 call it the gradient, and it's written 1373 00:46:06,440 --> 00:46:10,960 as this. 1374 00:46:08,519 --> 00:46:12,358 All right. So, what we do is we simply 1375 00:46:10,960 --> 00:46:13,599 do gradient descent on every one of the 1376 00:46:12,358 --> 00:46:16,519 Ws 1377 00:46:13,599 --> 00:46:19,319 using its partial derivative. 1378 00:46:16,519 --> 00:46:21,519 Okay? So, in a in a gradient step, we 1379 00:46:19,320 --> 00:46:23,000 update W1 using this formula, W2 using 1380 00:46:21,519 --> 00:46:25,400 this formula. 1381 00:46:23,000 --> 00:46:25,400 Finished. 1382 00:46:25,599 --> 00:46:30,440 We've just generalized gradient descent 1383 00:46:27,000 --> 00:46:30,440 to an arbitrary number of variables. 1384 00:46:30,840 --> 00:46:35,120 So, and of course, as before, this can 1385 00:46:32,480 --> 00:46:36,719 be summarized compactly as this vector 1386 00:46:35,119 --> 00:46:40,358 formula. 1387 00:46:36,719 --> 00:46:40,358 Let me just do this. 1388 00:46:43,000 --> 00:46:46,639 So, what's going on here is that 1389 00:46:46,719 --> 00:46:50,119 I have 1390 00:46:47,599 --> 00:46:52,400 W1 1391 00:46:50,119 --> 00:46:53,639 old W1 minus alpha 1392 00:46:52,400 --> 00:46:55,720 times 1393 00:46:53,639 --> 00:46:59,319 the function G 1394 00:46:55,719 --> 00:47:02,159 of W1, then W2 1395 00:46:59,320 --> 00:47:04,920 W2 minus alpha 1396 00:47:02,159 --> 00:47:06,039 G by W2. And then all we're doing is 1397 00:47:04,920 --> 00:47:08,358 we're just stacking them up into a 1398 00:47:06,039 --> 00:47:10,880 vector 1399 00:47:08,358 --> 00:47:10,880 like that. 1400 00:47:15,440 --> 00:47:19,559 minus alpha, and this vector 1401 00:47:21,440 --> 00:47:24,159 like that. 1402 00:47:27,719 --> 00:47:31,919 So, this can be written as just this 1403 00:47:28,760 --> 00:47:34,240 vector W, the new vector 1404 00:47:31,920 --> 00:47:37,599 old vector minus alpha 1405 00:47:34,239 --> 00:47:39,119 and the gradient. Finished. 1406 00:47:37,599 --> 00:47:40,400 And you can see if it is, you know, 1407 00:47:39,119 --> 00:47:42,719 GPT-3, 1408 00:47:40,400 --> 00:47:44,880 this vector is going to be 175 billion 1409 00:47:42,719 --> 00:47:46,559 long. 1410 00:47:44,880 --> 00:47:47,920 Okay? But whether it's two or 175 1411 00:47:46,559 --> 00:47:50,199 billion, who cares? It's the same thing, 1412 00:47:47,920 --> 00:47:50,200 right? 1413 00:47:50,358 --> 00:47:52,480 Okay. 1414 00:47:52,559 --> 00:47:55,320 So, yeah. So, that's what we have here. 1415 00:47:54,358 --> 00:47:58,000 I'm really thrilled by the way this 1416 00:47:55,320 --> 00:48:00,200 whole iPad business is working out. 1417 00:47:58,000 --> 00:48:02,199 I was a little worried about it. Okay. 1418 00:48:00,199 --> 00:48:04,000 Um so, if you look at two dimensions, 1419 00:48:02,199 --> 00:48:06,679 this function, and if you actually look 1420 00:48:04,000 --> 00:48:08,239 at if you plot the function, this is W 1421 00:48:06,679 --> 00:48:09,119 the first W, the second W, and then you 1422 00:48:08,239 --> 00:48:11,679 actually This is actually the loss 1423 00:48:09,119 --> 00:48:13,000 function. That's the function GW. And 1424 00:48:11,679 --> 00:48:14,960 so, you're trying to find the minimum 1425 00:48:13,000 --> 00:48:16,079 here, and so this is how the gradient 1426 00:48:14,960 --> 00:48:17,400 descent will do do do do do. It will 1427 00:48:16,079 --> 00:48:18,400 progress if you're starting from this 1428 00:48:17,400 --> 00:48:20,000 point. 1429 00:48:18,400 --> 00:48:22,280 Or you can also sort of look at it from 1430 00:48:20,000 --> 00:48:23,480 up top down into the function, and 1431 00:48:22,280 --> 00:48:24,720 that's what this picture is, and it 1432 00:48:23,480 --> 00:48:27,000 shows gradient descent starting from 1433 00:48:24,719 --> 00:48:30,599 there and working its way down 1434 00:48:27,000 --> 00:48:32,840 um from here all the way to the center. 1435 00:48:30,599 --> 00:48:35,119 Okay. So, 1436 00:48:32,840 --> 00:48:38,160 All right. Local minima. So, now 1437 00:48:35,119 --> 00:48:41,358 gradient descent will just stop 1438 00:48:38,159 --> 00:48:43,399 near uh hopefully a minimum, 1439 00:48:41,358 --> 00:48:45,960 right? But the problem is it may not be 1440 00:48:43,400 --> 00:48:47,400 a global minimum. It may It may not even 1441 00:48:45,960 --> 00:48:48,800 be a minimum. 1442 00:48:47,400 --> 00:48:49,880 So, um 1443 00:48:48,800 --> 00:48:51,160 so, let's see what what I'm talking 1444 00:48:49,880 --> 00:48:53,920 about here. 1445 00:48:51,159 --> 00:48:57,079 Here are some possibilities. 1446 00:48:53,920 --> 00:48:59,960 So, let's take a simple function. 1447 00:48:57,079 --> 00:49:02,159 Okay? Let's take This is GW. 1448 00:48:59,960 --> 00:49:05,960 This is W. And turns out this function 1449 00:49:02,159 --> 00:49:05,960 is actually looks like this. 1450 00:49:12,199 --> 00:49:16,719 Okay? 1451 00:49:13,519 --> 00:49:16,719 So, you can see here 1452 00:49:17,719 --> 00:49:23,159 Well, 1453 00:49:19,679 --> 00:49:24,759 um this point 1454 00:49:23,159 --> 00:49:27,119 this point here 1455 00:49:24,760 --> 00:49:29,359 is a local minimum. 1456 00:49:27,119 --> 00:49:30,880 This is a local minimum. 1457 00:49:29,358 --> 00:49:32,599 It's a local minimum. 1458 00:49:30,880 --> 00:49:34,559 These are all 1459 00:49:32,599 --> 00:49:37,239 lots of local minima here. 1460 00:49:34,559 --> 00:49:39,320 Okay? And yeah, there's a lot of local 1461 00:49:37,239 --> 00:49:41,599 minima here, too. 1462 00:49:39,320 --> 00:49:43,880 So, these are all places in which the 1463 00:49:41,599 --> 00:49:46,079 derivative is going to be zero. 1464 00:49:43,880 --> 00:49:48,160 So, if you run gradient descent and it 1465 00:49:46,079 --> 00:49:49,119 stops because the gradient is reached 1466 00:49:48,159 --> 00:49:52,000 zero, 1467 00:49:49,119 --> 00:49:54,519 you could be in any of these places. 1468 00:49:52,000 --> 00:49:57,480 Right? So, there's no guarantee. So, 1469 00:49:54,519 --> 00:49:59,400 this in this picture happens to be 1470 00:49:57,480 --> 00:50:01,039 maybe the global minimum because it's 1471 00:49:59,400 --> 00:50:02,160 the lowest of the lot. 1472 00:50:01,039 --> 00:50:02,880 Right? 1473 00:50:02,159 --> 00:50:04,639 But, there's no guarantee you're 1474 00:50:02,880 --> 00:50:06,320 actually going to get there. 1475 00:50:04,639 --> 00:50:07,519 Okay, there's not even a guarantee 1476 00:50:06,320 --> 00:50:09,519 you're going to be in any of these 1477 00:50:07,519 --> 00:50:10,920 places because you could literally be in 1478 00:50:09,519 --> 00:50:12,480 this thing here 1479 00:50:10,920 --> 00:50:14,599 where it's sort of taking a break and 1480 00:50:12,480 --> 00:50:15,920 then continuing on down. 1481 00:50:14,599 --> 00:50:17,799 That, by the way, is called a you know, 1482 00:50:15,920 --> 00:50:19,320 a saddle point. I drew it badly, but 1483 00:50:17,800 --> 00:50:21,120 this sort of coming in sort of taking a 1484 00:50:19,320 --> 00:50:23,559 break and going down again is called a 1485 00:50:21,119 --> 00:50:25,679 saddle point. So, gradient descent can 1486 00:50:23,559 --> 00:50:27,239 stop at a saddle point. It can stop at 1487 00:50:25,679 --> 00:50:28,879 some minima. There's no guarantee it's 1488 00:50:27,239 --> 00:50:31,279 going to be global. 1489 00:50:28,880 --> 00:50:31,280 Okay? 1490 00:50:33,000 --> 00:50:39,199 But, it turns out it has not mattered. 1491 00:50:37,239 --> 00:50:41,039 So, it has not mattered. And there are a 1492 00:50:39,199 --> 00:50:42,919 whole bunch of reasons why it has not 1493 00:50:41,039 --> 00:50:44,440 mattered because when you have these 1494 00:50:42,920 --> 00:50:46,360 very complicated neural networks, 1495 00:50:44,440 --> 00:50:49,200 they're very complex functions. Even 1496 00:50:46,360 --> 00:50:50,640 finding a decent solution, right, to 1497 00:50:49,199 --> 00:50:52,960 these complicated networks is actually 1498 00:50:50,639 --> 00:50:54,879 really good for solving the problem. 1499 00:50:52,960 --> 00:50:57,199 You don't have to go to the best best 1500 00:50:54,880 --> 00:50:58,680 possible solution. And in fact, if you 1501 00:50:57,199 --> 00:51:01,960 go to the best possible solution, you 1502 00:50:58,679 --> 00:51:01,960 actually run the risk of overfitting. 1503 00:51:02,039 --> 00:51:05,840 So, that's one reason. The other 1504 00:51:03,719 --> 00:51:08,319 interesting reason and by the way, this 1505 00:51:05,840 --> 00:51:09,800 is a very hot area of research to figure 1506 00:51:08,320 --> 00:51:11,120 out exactly 1507 00:51:09,800 --> 00:51:12,600 So, it's sort of like this. Empirically, 1508 00:51:11,119 --> 00:51:13,960 what we have seen is that not worrying 1509 00:51:12,599 --> 00:51:16,239 about local minima, global minima, all 1510 00:51:13,960 --> 00:51:18,119 that stuff has not hurt us because these 1511 00:51:16,239 --> 00:51:20,479 is things are amazing. 1512 00:51:18,119 --> 00:51:21,480 GPT GPT-4, probably they just stopped 1513 00:51:20,480 --> 00:51:22,880 somewhere. They probably it wasn't even 1514 00:51:21,480 --> 00:51:24,000 a local minima. They're like, "All 1515 00:51:22,880 --> 00:51:25,000 right, we've It's been running for 6 1516 00:51:24,000 --> 00:51:27,000 days. We've spent 2 million dollars. 1517 00:51:25,000 --> 00:51:29,000 Let's stop." 1518 00:51:27,000 --> 00:51:31,800 Right? Because these are very expensive. 1519 00:51:29,000 --> 00:51:33,199 So, but that's still so magical. 1520 00:51:31,800 --> 00:51:34,600 You don't need to get anywhere close to 1521 00:51:33,199 --> 00:51:36,279 local minimum. But, there's another 1522 00:51:34,599 --> 00:51:37,559 interesting point which I've which which 1523 00:51:36,280 --> 00:51:40,880 I read about. 1524 00:51:37,559 --> 00:51:43,279 People basically hypothesize that 1525 00:51:40,880 --> 00:51:45,200 for you to be at a local minimum, just 1526 00:51:43,280 --> 00:51:47,000 think about what it means. It means that 1527 00:51:45,199 --> 00:51:49,439 you're standing at a particular point, 1528 00:51:47,000 --> 00:51:51,800 in every direction that you look, 1529 00:51:49,440 --> 00:51:52,840 things are just sloping upward. 1530 00:51:51,800 --> 00:51:54,760 Right? 1531 00:51:52,840 --> 00:51:56,400 Everything is sloping upward. Only if 1532 00:51:54,760 --> 00:51:58,520 everything is sloping upward all around 1533 00:51:56,400 --> 00:52:00,760 you, could you be at a local minimum 1534 00:51:58,519 --> 00:52:02,880 by definition. But, if you have a 1535 00:52:00,760 --> 00:52:04,560 billion dimensions, 1536 00:52:02,880 --> 00:52:06,200 what are the odds that you're going to 1537 00:52:04,559 --> 00:52:07,199 be standing at a point where every one 1538 00:52:06,199 --> 00:52:08,319 of those billion dimensions is going 1539 00:52:07,199 --> 00:52:10,119 upward? 1540 00:52:08,320 --> 00:52:11,600 The odds are really low. 1541 00:52:10,119 --> 00:52:13,239 Chances are some of them are going to go 1542 00:52:11,599 --> 00:52:14,480 going up, some of them are going down, 1543 00:52:13,239 --> 00:52:16,759 others are sort of coming down and going 1544 00:52:14,480 --> 00:52:18,400 another way. It's going to be crazy. 1545 00:52:16,760 --> 00:52:20,000 So, in some sense, the best you can hope 1546 00:52:18,400 --> 00:52:23,079 for in these very high-dimensional 1547 00:52:20,000 --> 00:52:25,760 situations is probably a saddle point. 1548 00:52:23,079 --> 00:52:29,159 And it turns out it's good enough. 1549 00:52:25,760 --> 00:52:30,920 So, for those reasons, we are content 1550 00:52:29,159 --> 00:52:31,879 with just running gradient descent with 1551 00:52:30,920 --> 00:52:34,320 some tweaks which I'll get to in a 1552 00:52:31,880 --> 00:52:36,880 second. Um and it just performs really 1553 00:52:34,320 --> 00:52:36,880 admirably. 1554 00:52:36,920 --> 00:52:41,680 Um how does alpha depend on like how 1555 00:52:39,840 --> 00:52:44,600 much compute you have? Like, would you 1556 00:52:41,679 --> 00:52:45,359 set the learning rate based on that or 1557 00:52:44,599 --> 00:52:47,960 not really? 1558 00:52:45,360 --> 00:52:50,680 >> No, the the learning rate is really 1559 00:52:47,960 --> 00:52:52,519 is a measure of It's sort of like this. 1560 00:52:50,679 --> 00:52:54,759 When you're at a point where you think 1561 00:52:52,519 --> 00:52:55,840 that the gradient is looking nice and 1562 00:52:54,760 --> 00:52:57,600 right, if you take a step in the 1563 00:52:55,840 --> 00:53:00,000 direction it's going to go down. And if 1564 00:52:57,599 --> 00:53:01,519 you further believe that it's going to 1565 00:53:00,000 --> 00:53:02,480 keep going down in the direction for a 1566 00:53:01,519 --> 00:53:04,159 while, 1567 00:53:02,480 --> 00:53:06,000 then you're very confident about taking 1568 00:53:04,159 --> 00:53:07,519 a big step. 1569 00:53:06,000 --> 00:53:09,119 But, if you're like, "I I don't know 1570 00:53:07,519 --> 00:53:10,960 because the maybe I take a little step, 1571 00:53:09,119 --> 00:53:12,159 maybe I have to go this way. I can't go 1572 00:53:10,960 --> 00:53:13,360 straight anymore." Then you don't want 1573 00:53:12,159 --> 00:53:14,639 to take a big step because then you have 1574 00:53:13,360 --> 00:53:16,320 to backtrack. 1575 00:53:14,639 --> 00:53:19,039 So, those kinds of considerations go 1576 00:53:16,320 --> 00:53:20,920 into the learning rate. Um and so, 1577 00:53:19,039 --> 00:53:23,360 that's sort of the rough answer to your 1578 00:53:20,920 --> 00:53:24,920 question. It's not so much determined by 1579 00:53:23,360 --> 00:53:25,840 compute and bandwidth and things like 1580 00:53:24,920 --> 00:53:27,320 that. 1581 00:53:25,840 --> 00:53:29,400 But, again, it's very it's a sort of a 1582 00:53:27,320 --> 00:53:31,240 complicated thing because sometimes with 1583 00:53:29,400 --> 00:53:33,079 a given amount of compute compute, if 1584 00:53:31,239 --> 00:53:35,239 you have a particular kind of data, you 1585 00:53:33,079 --> 00:53:37,079 can have very aggressive learning rates. 1586 00:53:35,239 --> 00:53:39,439 So, it tends to be a bit sort of, you 1587 00:53:37,079 --> 00:53:40,880 know, jumbled up complicated. So, but 1588 00:53:39,440 --> 00:53:43,320 that's sort of the the quick surface 1589 00:53:40,880 --> 00:53:46,240 level idea of what's going on. 1590 00:53:43,320 --> 00:53:46,240 Um okay. 1591 00:53:47,000 --> 00:53:49,679 9:31. 1592 00:53:50,960 --> 00:53:54,119 Anyway, folks, this lecture is like 1593 00:53:52,400 --> 00:53:55,519 probably one of the driest in the like 1594 00:53:54,119 --> 00:53:57,719 semester because of like I have to go 1595 00:53:55,519 --> 00:53:59,159 through all the concepts. Um once we 1596 00:53:57,719 --> 00:54:00,839 start doing collabs, you know, things 1597 00:53:59,159 --> 00:54:01,759 get a lot more lively. 1598 00:54:00,840 --> 00:54:04,320 Okay. 1599 00:54:01,760 --> 00:54:05,880 Um all right. So, now let's talk about 1600 00:54:04,320 --> 00:54:08,039 minimizing a loss function gradient 1601 00:54:05,880 --> 00:54:09,519 descent. So, here is our little binary 1602 00:54:08,039 --> 00:54:11,719 cross entropy loss function that we saw 1603 00:54:09,519 --> 00:54:13,519 from before. Right? This is what we want 1604 00:54:11,719 --> 00:54:14,839 to minimize. So, if you look at this 1605 00:54:13,519 --> 00:54:16,800 thing, 1606 00:54:14,840 --> 00:54:19,280 where are the variables we need to 1607 00:54:16,800 --> 00:54:21,880 change to minimize this function? 1608 00:54:19,280 --> 00:54:23,880 Folks, don't look at your phones. 1609 00:54:21,880 --> 00:54:26,480 Okay, with laptop and iPad use, don't 1610 00:54:23,880 --> 00:54:26,480 look at your phones. 1611 00:54:27,559 --> 00:54:33,000 Sorry, we've kind of abstracted um the 1612 00:54:30,639 --> 00:54:35,079 variables W, but just to bring it back, 1613 00:54:33,000 --> 00:54:36,559 those are actually the weights in the 1614 00:54:35,079 --> 00:54:38,519 neural networks, right? Yeah, the 1615 00:54:36,559 --> 00:54:42,480 weights and the biases. I'm just calling 1616 00:54:38,519 --> 00:54:45,440 them as weights. So, the output of these 1617 00:54:42,480 --> 00:54:47,480 uh minimization functions are going to 1618 00:54:45,440 --> 00:54:47,720 be the actual weights in your model, 1619 00:54:47,480 --> 00:54:49,920 right? 1620 00:54:47,719 --> 00:54:51,358 >> Exactly. Exactly right. 1621 00:54:49,920 --> 00:54:52,440 The whole name of the game is to find 1622 00:54:51,358 --> 00:54:53,719 the weights. 1623 00:54:52,440 --> 00:54:57,159 And so, for example, when you see in the 1624 00:54:53,719 --> 00:55:00,279 press that uh Meta has essentially um 1625 00:54:57,159 --> 00:55:01,719 made the weights of Llama 2 or something 1626 00:55:00,280 --> 00:55:02,920 available, that's basically what they've 1627 00:55:01,719 --> 00:55:04,679 done. 1628 00:55:02,920 --> 00:55:06,039 They basically published the weights. 1629 00:55:04,679 --> 00:55:07,480 Reason that's so valuable is 1630 00:55:06,039 --> 00:55:09,320 >> Microphone, please. Go. 1631 00:55:07,480 --> 00:55:11,400 Cuz if you have a billion parameters, 1632 00:55:09,320 --> 00:55:13,320 the compute time on that is horrendous 1633 00:55:11,400 --> 00:55:14,599 and expensive. That's why it does 1634 00:55:13,320 --> 00:55:16,559 weights are so valuable. 1635 00:55:14,599 --> 00:55:18,358 >> Correct. The weights are the crown jewel 1636 00:55:16,559 --> 00:55:19,840 because they are the result of a lot of 1637 00:55:18,358 --> 00:55:21,880 money and time and smartness being 1638 00:55:19,840 --> 00:55:23,320 spent. 1639 00:55:21,880 --> 00:55:25,240 There is a separate question of why are 1640 00:55:23,320 --> 00:55:26,000 they making it open source, 1641 00:55:25,239 --> 00:55:28,679 which 1642 00:55:26,000 --> 00:55:29,880 I'm happy to chat about offline. 1643 00:55:28,679 --> 00:55:30,839 All right, cool. So, what are the 1644 00:55:29,880 --> 00:55:32,920 variables we need to change change to 1645 00:55:30,840 --> 00:55:34,480 minimize? It's basically the parameters 1646 00:55:32,920 --> 00:55:36,358 and they're hiding inside the model 1647 00:55:34,480 --> 00:55:38,440 term. 1648 00:55:36,358 --> 00:55:41,279 Right? Because what is the model? The 1649 00:55:38,440 --> 00:55:42,639 model is some function like that, right? 1650 00:55:41,280 --> 00:55:44,400 If you look at the simple GPA and 1651 00:55:42,639 --> 00:55:46,400 experience thing we looked at in the on 1652 00:55:44,400 --> 00:55:48,800 Monday, we finally figured out that the 1653 00:55:46,400 --> 00:55:50,480 actual thing that comes out here is 1654 00:55:48,800 --> 00:55:52,440 going to be this complicated function of 1655 00:55:50,480 --> 00:55:54,960 all the X's and the W's and so on and so 1656 00:55:52,440 --> 00:55:57,599 forth, right? And that complicated thing 1657 00:55:54,960 --> 00:55:58,800 is showing up inside this thing. 1658 00:55:57,599 --> 00:56:00,960 So, 1659 00:55:58,800 --> 00:56:02,920 you know, and the W's here are the 1660 00:56:00,960 --> 00:56:05,119 variables we can we need to change to 1661 00:56:02,920 --> 00:56:06,720 minimize the loss function. And it It's 1662 00:56:05,119 --> 00:56:10,159 important for you to to note and 1663 00:56:06,719 --> 00:56:13,159 understand that the values of X and Y 1664 00:56:10,159 --> 00:56:14,199 and so on are just data. 1665 00:56:13,159 --> 00:56:15,759 You're not optimizing anything there. 1666 00:56:14,199 --> 00:56:17,919 You're just data. 1667 00:56:15,760 --> 00:56:20,480 What you're optimizing is the W's. 1668 00:56:17,920 --> 00:56:20,480 The weights. 1669 00:56:22,400 --> 00:56:27,639 Okay. So, so imagine replacing the model 1670 00:56:26,400 --> 00:56:29,440 here with the mathematical expression 1671 00:56:27,639 --> 00:56:31,920 above whenever this appears the loss 1672 00:56:29,440 --> 00:56:33,400 function. And once you do that, your 1673 00:56:31,920 --> 00:56:35,800 loss function is just a good old 1674 00:56:33,400 --> 00:56:37,440 function of the W's. 1675 00:56:35,800 --> 00:56:39,200 The fact that it's a loss function is 1676 00:56:37,440 --> 00:56:41,039 kind of irrelevant. 1677 00:56:39,199 --> 00:56:42,399 It's just a function. 1678 00:56:41,039 --> 00:56:43,880 And since it's just a good old function 1679 00:56:42,400 --> 00:56:45,920 of the W's, you can apply gradient 1680 00:56:43,880 --> 00:56:48,559 descent to it as we normally would. 1681 00:56:45,920 --> 00:56:48,559 It's no big deal. 1682 00:56:49,440 --> 00:56:52,920 Which brings us to something called 1683 00:56:50,880 --> 00:56:55,400 backpropagation. 1684 00:56:52,920 --> 00:56:55,400 Um 1685 00:56:56,199 --> 00:56:59,839 Um if you remember nothing else about 1686 00:56:57,639 --> 00:57:01,239 backpropagation, just remember this. 1687 00:56:59,840 --> 00:57:04,680 Never use the word backpropagation 1688 00:57:01,239 --> 00:57:05,479 again. Only use the word backprop. 1689 00:57:04,679 --> 00:57:06,759 You're 1690 00:57:05,480 --> 00:57:07,760 hip and cool to the deep learning 1691 00:57:06,760 --> 00:57:09,320 community. 1692 00:57:07,760 --> 00:57:12,200 Backprop. 1693 00:57:09,320 --> 00:57:14,480 Okay. All right. So, what is backprop? 1694 00:57:12,199 --> 00:57:16,000 Backprop is a very efficient way to 1695 00:57:14,480 --> 00:57:17,519 compute the gradient of the loss 1696 00:57:16,000 --> 00:57:19,239 function. 1697 00:57:17,519 --> 00:57:21,920 So, when you have this loss function, 1698 00:57:19,239 --> 00:57:24,759 and let's say you have a billion W's 1699 00:57:21,920 --> 00:57:27,559 and you have 10 million data points. So, 1700 00:57:24,760 --> 00:57:30,520 the little n we saw was 10 million. 1701 00:57:27,559 --> 00:57:32,279 That is a lot of computation. 1702 00:57:30,519 --> 00:57:34,239 And that is just for one step of 1703 00:57:32,280 --> 00:57:37,480 gradient descent. 1704 00:57:34,239 --> 00:57:39,799 Right? So, backprop is a way is a very 1705 00:57:37,480 --> 00:57:41,639 efficient and clever way to compute the 1706 00:57:39,800 --> 00:57:44,800 gradient of the loss function, which 1707 00:57:41,639 --> 00:57:47,039 takes advantage of the fact that what we 1708 00:57:44,800 --> 00:57:49,480 have here is not some arbitrary model. 1709 00:57:47,039 --> 00:57:51,960 It's a model that came from a particular 1710 00:57:49,480 --> 00:57:53,480 kind of neural network, which has layers 1711 00:57:51,960 --> 00:57:55,119 one after the other, and then there was 1712 00:57:53,480 --> 00:57:57,760 an output at the very end. 1713 00:57:55,119 --> 00:57:59,519 So, what backprop does is 1714 00:57:57,760 --> 00:58:00,440 it organizes the computation in the form 1715 00:57:59,519 --> 00:58:01,920 of something called a computational 1716 00:58:00,440 --> 00:58:03,679 graph, and the book has a good 1717 00:58:01,920 --> 00:58:05,880 discussion about it. And so, what we do 1718 00:58:03,679 --> 00:58:08,039 is we start at the very end. 1719 00:58:05,880 --> 00:58:10,119 We calculate the gradient of the loss 1720 00:58:08,039 --> 00:58:12,119 with respect to the output. 1721 00:58:10,119 --> 00:58:13,960 Then we move left. We calculate the 1722 00:58:12,119 --> 00:58:15,759 gradient of that output with respect to 1723 00:58:13,960 --> 00:58:17,079 the output of the just the prior hidden 1724 00:58:15,760 --> 00:58:19,160 layer. 1725 00:58:17,079 --> 00:58:20,559 Step to the left. Calculate the gradient 1726 00:58:19,159 --> 00:58:22,719 of the current thing with respect to the 1727 00:58:20,559 --> 00:58:25,159 previous layer. You get the idea, right? 1728 00:58:22,719 --> 00:58:27,319 It's iterative and it moves backwards, 1729 00:58:25,159 --> 00:58:30,879 and by doing so, you never repeat the 1730 00:58:27,320 --> 00:58:32,920 same computation twice wastefully. 1731 00:58:30,880 --> 00:58:34,400 That's the big advantage. You calculate 1732 00:58:32,920 --> 00:58:35,519 once and reuse it many many many many 1733 00:58:34,400 --> 00:58:37,200 times. 1734 00:58:35,519 --> 00:58:39,639 The second advantage is that if you 1735 00:58:37,199 --> 00:58:42,159 organize it this way, it just becomes a 1736 00:58:39,639 --> 00:58:42,879 sequence of matrix multiplications. 1737 00:58:42,159 --> 00:58:45,239 Okay. 1738 00:58:42,880 --> 00:58:46,240 And 1739 00:58:45,239 --> 00:58:48,358 because it's a sequence of matrix 1740 00:58:46,239 --> 00:58:51,879 multiplications and eliminates redundant 1741 00:58:48,358 --> 00:58:53,199 calculations, and best of all, 1742 00:58:51,880 --> 00:58:54,680 there are these things called GPUs, 1743 00:58:53,199 --> 00:58:56,079 graphics processing units, originally 1744 00:58:54,679 --> 00:58:57,119 invented to accelerate video game 1745 00:58:56,079 --> 00:58:58,599 rendering. 1746 00:58:57,119 --> 00:59:00,639 Uh and as it turns out, to accelerate 1747 00:58:58,599 --> 00:59:02,159 video game rendering, the core math 1748 00:59:00,639 --> 00:59:03,960 operation you do is basically a matrix 1749 00:59:02,159 --> 00:59:05,319 multiplication. Right? Some linear 1750 00:59:03,960 --> 00:59:07,599 algebra uh 1751 00:59:05,320 --> 00:59:09,760 sort of operations. And so, someone 1752 00:59:07,599 --> 00:59:11,920 really at some point had the bright idea 1753 00:59:09,760 --> 00:59:13,440 for deep learning, calculating gradients 1754 00:59:11,920 --> 00:59:14,920 and so on, we need to do matrix 1755 00:59:13,440 --> 00:59:17,559 multiplications, and here is some 1756 00:59:14,920 --> 00:59:19,200 specialized hardware that does really 1757 00:59:17,559 --> 00:59:20,960 that does a fast job of matrix 1758 00:59:19,199 --> 00:59:22,319 multiplications. Can't we Can we use 1759 00:59:20,960 --> 00:59:24,440 this for that? 1760 00:59:22,320 --> 00:59:26,200 And they did it. And all hell broke 1761 00:59:24,440 --> 00:59:28,039 loose. 1762 00:59:26,199 --> 00:59:30,000 That's literally what happened. 1763 00:59:28,039 --> 00:59:32,279 And that's why Nvidia is valued at what, 1764 00:59:30,000 --> 00:59:35,480 1.5 trillion or something. 1765 00:59:32,280 --> 00:59:37,880 So, yeah. So, they are really good. And 1766 00:59:35,480 --> 00:59:40,079 so, backprop 1767 00:59:37,880 --> 00:59:42,680 the way you do backprop plus using it on 1768 00:59:40,079 --> 00:59:44,400 GPUs leads to fast calculation of loss 1769 00:59:42,679 --> 00:59:47,159 function gradients. 1770 00:59:44,400 --> 00:59:49,039 If this thing were not true, this class 1771 00:59:47,159 --> 00:59:50,279 would not exist. 1772 00:59:49,039 --> 00:59:52,880 Because there won't be any deep learning 1773 00:59:50,280 --> 00:59:56,840 revolution. 1774 00:59:52,880 --> 00:59:56,840 This is a fundamental seminal reason. 1775 00:59:57,880 --> 01:00:00,840 All right. So, the book has a bunch of 1776 00:59:59,760 --> 01:00:01,880 detail 1777 01:00:00,840 --> 01:00:05,600 um 1778 01:00:01,880 --> 01:00:07,559 and I actually did like a I work I hand 1779 01:00:05,599 --> 01:00:09,599 worked out an example 1780 01:00:07,559 --> 01:00:11,679 of calculating a gradient like the 1781 01:00:09,599 --> 01:00:13,400 old-fashioned way and calculating it 1782 01:00:11,679 --> 01:00:14,879 using backprop. 1783 01:00:13,400 --> 01:00:17,200 So, take a look at it. I'll post it on 1784 01:00:14,880 --> 01:00:18,519 Canvas and you will understand exactly 1785 01:00:17,199 --> 01:00:21,519 where the savings come from, where the 1786 01:00:18,519 --> 01:00:22,800 efficiency gains come from. Okay? 1787 01:00:21,519 --> 01:00:25,239 Because of time, I'm not going to get 1788 01:00:22,800 --> 01:00:25,240 into it now. 1789 01:00:26,400 --> 01:00:30,400 All right. Any questions so far? 1790 01:00:28,840 --> 01:00:32,600 Yep. 1791 01:00:30,400 --> 01:00:34,559 Sorry, I followed up to and so, we've 1792 01:00:32,599 --> 01:00:36,239 done gradient descent, which is 1793 01:00:34,559 --> 01:00:37,840 different than calculation of the 1794 01:00:36,239 --> 01:00:39,239 gradient of the loss function. What What 1795 01:00:37,840 --> 01:00:41,039 is the purpose of the calculation of the 1796 01:00:39,239 --> 01:00:42,519 gradient of the loss function? You 1797 01:00:41,039 --> 01:00:44,159 calculate the gradient because the 1798 01:00:42,519 --> 01:00:47,039 fundamental operation of gradient 1799 01:00:44,159 --> 01:00:48,199 descent is to take your current value of 1800 01:00:47,039 --> 01:00:50,159 W 1801 01:00:48,199 --> 01:00:52,919 and modify it slightly and the 1802 01:00:50,159 --> 01:00:56,000 modification is old value minus learning 1803 01:00:52,920 --> 01:00:56,000 rate times gradient. 1804 01:01:03,360 --> 01:01:06,280 It'd be cool, right, if I say, "Go mo- 1805 01:01:04,960 --> 01:01:08,400 go back five slides to this thing." and 1806 01:01:06,280 --> 01:01:09,880 it just goes back. Product idea. Anyone 1807 01:01:08,400 --> 01:01:11,840 startups? 1808 01:01:09,880 --> 01:01:14,320 So. 1809 01:01:11,840 --> 01:01:15,360 So, this one. 1810 01:01:14,320 --> 01:01:16,920 So, this is the fundamental step of 1811 01:01:15,360 --> 01:01:19,280 gradient descent. 1812 01:01:16,920 --> 01:01:20,720 So, this is the current value of W. 1813 01:01:19,280 --> 01:01:22,000 You calculate the gradient at that 1814 01:01:20,719 --> 01:01:24,159 current value 1815 01:01:22,000 --> 01:01:26,199 multiplied by alpha do this thing and 1816 01:01:24,159 --> 01:01:27,440 you get the new value. 1817 01:01:26,199 --> 01:01:29,879 And you keep repeating. 1818 01:01:27,440 --> 01:01:32,240 Right, but GW 1819 01:01:29,880 --> 01:01:33,559 that's not that's not the loss function. 1820 01:01:32,239 --> 01:01:34,039 >> It is the loss function. That is the 1821 01:01:33,559 --> 01:01:35,960 loss function. 1822 01:01:34,039 --> 01:01:37,880 >> Yeah, right. Here, I'm just using G as 1823 01:01:35,960 --> 01:01:39,880 an arbitrary function 1824 01:01:37,880 --> 01:01:41,599 to just to demonstrate the point. But 1825 01:01:39,880 --> 01:01:42,880 when you're optimizing, when you're 1826 01:01:41,599 --> 01:01:45,519 training a neural network, what you're 1827 01:01:42,880 --> 01:01:46,800 actually doing is minimizing a loss 1828 01:01:45,519 --> 01:01:49,320 function. Right. 1829 01:01:46,800 --> 01:01:51,360 >> Loss of W. Sorry, I got things mixed up. 1830 01:01:49,320 --> 01:01:53,000 Thank you. 1831 01:01:51,360 --> 01:01:54,680 >> Yeah. 1832 01:01:53,000 --> 01:01:55,639 Uh how do we define the initial weights 1833 01:01:54,679 --> 01:01:57,279 for the neural network? 1834 01:01:55,639 --> 01:02:01,639 >> Ah. 1835 01:01:57,280 --> 01:02:01,640 So, yeah, the initial weights um 1836 01:02:02,199 --> 01:02:04,919 So, there's a there are many ways to So, 1837 01:02:04,000 --> 01:02:06,119 first of all, they are initialized 1838 01:02:04,920 --> 01:02:08,119 randomly. 1839 01:02:06,119 --> 01:02:09,920 Uh but randomly doesn't mean you can 1840 01:02:08,119 --> 01:02:11,839 just pick any random weight. There are 1841 01:02:09,920 --> 01:02:13,519 actually some good ways to randomly pick 1842 01:02:11,840 --> 01:02:16,240 the weights. Uh those are called 1843 01:02:13,519 --> 01:02:18,199 initialization schemes. Um and there are 1844 01:02:16,239 --> 01:02:19,359 a bunch of very effective initialization 1845 01:02:18,199 --> 01:02:21,119 schemes people have figured out over the 1846 01:02:19,360 --> 01:02:22,880 years and those things are baked into 1847 01:02:21,119 --> 01:02:24,880 Keras as the default. 1848 01:02:22,880 --> 01:02:26,079 So, the Keras, I believe, uses something 1849 01:02:24,880 --> 01:02:27,960 called the 1850 01:02:26,079 --> 01:02:31,199 uh He initialization, H E 1851 01:02:27,960 --> 01:02:33,039 initialization, or the Xavier Glorot 1852 01:02:31,199 --> 01:02:33,839 initialization. I wouldn't worry about 1853 01:02:33,039 --> 01:02:36,000 it. Just go with the default 1854 01:02:33,840 --> 01:02:37,519 initialization. 1855 01:02:36,000 --> 01:02:38,679 The reason why they have to be very 1856 01:02:37,519 --> 01:02:40,880 careful about how these weights are 1857 01:02:38,679 --> 01:02:43,039 initialized is because if you have a 1858 01:02:40,880 --> 01:02:45,200 very big network and if you initialize 1859 01:02:43,039 --> 01:02:47,679 badly then 1860 01:02:45,199 --> 01:02:48,919 the gradient will just explode as you 1861 01:02:47,679 --> 01:02:50,440 calculate it. 1862 01:02:48,920 --> 01:02:52,480 The earlier layers, the weights will 1863 01:02:50,440 --> 01:02:53,720 have massive gradients or the gradients 1864 01:02:52,480 --> 01:02:55,119 will vanish. 1865 01:02:53,719 --> 01:02:56,319 So, they're called the exploding 1866 01:02:55,119 --> 01:02:58,239 gradient problem or the vanishing 1867 01:02:56,320 --> 01:02:59,240 gradient problem. To avoid all those 1868 01:02:58,239 --> 01:03:00,719 things, researchers have figured out 1869 01:02:59,239 --> 01:03:03,599 some clever way to initialize so that 1870 01:03:00,719 --> 01:03:05,359 it's well-behaved throughout. 1871 01:03:03,599 --> 01:03:08,400 Yep. 1872 01:03:05,360 --> 01:03:10,360 If using um backprops and GPUs was so 1873 01:03:08,400 --> 01:03:12,440 critical, I'm just curious like who 1874 01:03:10,360 --> 01:03:14,760 first did it and when? Was this like a 1875 01:03:12,440 --> 01:03:15,119 couple years ago? Was it a company? Was 1876 01:03:14,760 --> 01:03:17,520 it a Yeah. 1877 01:03:15,119 --> 01:03:20,199 >> Yeah. Well, GPUs have been used for deep 1878 01:03:17,519 --> 01:03:22,400 learning, I want to say um 1879 01:03:20,199 --> 01:03:26,279 I think the first uh case may have been 1880 01:03:22,400 --> 01:03:27,920 in the mid 2005, 2006 sort of thing. 1881 01:03:26,280 --> 01:03:30,000 But I would say that it sort of burst 1882 01:03:27,920 --> 01:03:32,800 out onto the world stage and made 1883 01:03:30,000 --> 01:03:35,000 everyone take notice when uh a deep 1884 01:03:32,800 --> 01:03:38,519 learning model called AlexNet 1885 01:03:35,000 --> 01:03:40,440 in 2012 won a very famous 1886 01:03:38,519 --> 01:03:43,320 computer vision competition. 1887 01:03:40,440 --> 01:03:45,079 Uh and it beat the and it set a world 1888 01:03:43,320 --> 01:03:46,200 record for how good it was. 1889 01:03:45,079 --> 01:03:48,039 Uh and that's when everyone was like, 1890 01:03:46,199 --> 01:03:49,119 "Hey, what is this thing?" And that's 1891 01:03:48,039 --> 01:03:50,719 really when it burst onto the world 1892 01:03:49,119 --> 01:03:51,880 stage. I'll talk a bit more about it 1893 01:03:50,719 --> 01:03:54,119 when I get into the computer vision 1894 01:03:51,880 --> 01:03:55,480 segment of the class. 1895 01:03:54,119 --> 01:03:58,759 But you can Google AlexNet and you'll 1896 01:03:55,480 --> 01:03:58,760 find a whole bunch of history around it. 1897 01:03:59,599 --> 01:04:04,920 I believe that if you do this, is it 1898 01:04:00,760 --> 01:04:06,040 true that could get to a global minima 1899 01:04:04,920 --> 01:04:07,840 that would mean there would be no 1900 01:04:06,039 --> 01:04:09,840 hallucinations? 1901 01:04:07,840 --> 01:04:11,920 Aha, good question. 1902 01:04:09,840 --> 01:04:13,120 So, if it is perfect 1903 01:04:11,920 --> 01:04:14,519 if you get to a global minimum. First of 1904 01:04:13,119 --> 01:04:15,880 all, global minima doesn't mean the 1905 01:04:14,519 --> 01:04:17,199 model is perfect, right? It may still 1906 01:04:15,880 --> 01:04:18,400 have some loss. 1907 01:04:17,199 --> 01:04:21,119 Um 1908 01:04:18,400 --> 01:04:24,000 but global minima is going to be on the 1909 01:04:21,119 --> 01:04:24,000 training data. 1910 01:04:24,199 --> 01:04:28,519 You can imagine that the test data, 1911 01:04:26,280 --> 01:04:29,480 future data has its own loss function, 1912 01:04:28,519 --> 01:04:31,000 right? 1913 01:04:29,480 --> 01:04:34,599 So, what is minimum here may not be 1914 01:04:31,000 --> 01:04:34,599 minimum there. That's the problem. 1915 01:04:36,440 --> 01:04:40,280 Is that a comment? No, okay. 1916 01:04:38,800 --> 01:04:42,280 Just saying that 1917 01:04:40,280 --> 01:04:43,240 uh that would mean that also you can be 1918 01:04:42,280 --> 01:04:45,200 over-fitting for 1919 01:04:43,239 --> 01:04:47,119 >> Correct. Exactly. Exactly. So, if you 1920 01:04:45,199 --> 01:04:48,960 overdo, if you find the best thing in 1921 01:04:47,119 --> 01:04:50,960 the training function, chances are it 1922 01:04:48,960 --> 01:04:52,000 doesn't match the best thing of the test 1923 01:04:50,960 --> 01:04:53,358 data. 1924 01:04:52,000 --> 01:04:55,880 So, on the test data, you're actually 1925 01:04:53,358 --> 01:04:55,880 doing badly. 1926 01:04:56,440 --> 01:05:00,880 Okay. So, 1927 01:04:57,960 --> 01:05:00,880 uh come back to this. 1928 01:05:03,800 --> 01:05:08,240 Okay. Now, uh the final uh twist to the 1929 01:05:06,199 --> 01:05:10,039 tail here uh we're going to go from 1930 01:05:08,239 --> 01:05:11,839 something gradient descent to something 1931 01:05:10,039 --> 01:05:14,639 called stochastic gradient descent. And 1932 01:05:11,840 --> 01:05:16,400 stochastic gradient descent or SGD is 1933 01:05:14,639 --> 01:05:17,480 the workhorse for all deep learning. 1934 01:05:16,400 --> 01:05:19,639 Okay? 1935 01:05:17,480 --> 01:05:20,679 And funnily enough, SGD is simpler than 1936 01:05:19,639 --> 01:05:21,839 GD. 1937 01:05:20,679 --> 01:05:23,799 Okay? Just when you thought it couldn't 1938 01:05:21,840 --> 01:05:25,280 get simpler, right? 1939 01:05:23,800 --> 01:05:27,400 Okay. So, 1940 01:05:25,280 --> 01:05:28,640 So, for large data sets, computing the 1941 01:05:27,400 --> 01:05:31,440 gradient of the loss function can be 1942 01:05:28,639 --> 01:05:32,920 very expensive. Right? Needless to say. 1943 01:05:31,440 --> 01:05:34,519 Because it has to be done at every step 1944 01:05:32,920 --> 01:05:36,760 and the cardinality of the data set is 1945 01:05:34,519 --> 01:05:38,079 really big. Right? And you may have, I 1946 01:05:36,760 --> 01:05:39,480 don't know, billions of parameters. It's 1947 01:05:38,079 --> 01:05:43,119 just very, very 1948 01:05:39,480 --> 01:05:45,679 tough to compute it even with backprop. 1949 01:05:43,119 --> 01:05:47,519 So, the solution is at each iteration, 1950 01:05:45,679 --> 01:05:50,119 when I say iteration, I'm talking about 1951 01:05:47,519 --> 01:05:52,599 this step of gradient descent. 1952 01:05:50,119 --> 01:05:54,599 Instead of using all the data 1953 01:05:52,599 --> 01:05:57,358 instead of calculating the loss function 1954 01:05:54,599 --> 01:05:59,480 by averaging the loss across all N data 1955 01:05:57,358 --> 01:06:01,880 points and then calculating the gradient 1956 01:05:59,480 --> 01:06:04,440 of that thing, what you do is you just 1957 01:06:01,880 --> 01:06:06,480 choose a small sample randomly. You 1958 01:06:04,440 --> 01:06:08,400 choose just a few of the N observations 1959 01:06:06,480 --> 01:06:10,159 and we call it a mini batch. 1960 01:06:08,400 --> 01:06:11,599 So, for example, the number of data 1961 01:06:10,159 --> 01:06:12,639 points you may you may have 10 billion 1962 01:06:11,599 --> 01:06:14,000 data points 1963 01:06:12,639 --> 01:06:16,559 but in every iteration, you may 1964 01:06:14,000 --> 01:06:18,119 literally grab just like 32 or 64, 1965 01:06:16,559 --> 01:06:20,199 something really small. 1966 01:06:18,119 --> 01:06:21,199 Like absurdly small. 1967 01:06:20,199 --> 01:06:23,000 Okay? 1968 01:06:21,199 --> 01:06:24,799 And then you pretend that okay, that's 1969 01:06:23,000 --> 01:06:27,159 all the data I have. You calculate the 1970 01:06:24,800 --> 01:06:30,359 loss, find the gradient and just use 1971 01:06:27,159 --> 01:06:33,199 that here instead. 1972 01:06:30,358 --> 01:06:36,799 Okay? So, this is called stochastic 1973 01:06:33,199 --> 01:06:39,159 gradient descent. So, strictly speaking 1974 01:06:36,800 --> 01:06:40,680 theoretically, SGD uses just one data 1975 01:06:39,159 --> 01:06:42,079 point. 1976 01:06:40,679 --> 01:06:44,599 But in practice, we use what's called a 1977 01:06:42,079 --> 01:06:47,039 mini batch, 32, 64, whatever. 1978 01:06:44,599 --> 01:06:48,319 Uh and so, mini batch gradient descent 1979 01:06:47,039 --> 01:06:51,719 is just loosely called stochastic 1980 01:06:48,320 --> 01:06:51,720 gradient descent, SGD. 1981 01:06:52,719 --> 01:06:57,559 So, and SGD, as it turns out 1982 01:06:55,679 --> 01:06:58,799 you can see it's clearly very efficient, 1983 01:06:57,559 --> 01:07:00,960 right? Because 1984 01:06:58,800 --> 01:07:02,519 it's just processing a few at a time. 1985 01:07:00,960 --> 01:07:03,559 Uh and in fact, if you have a lot of 1986 01:07:02,519 --> 01:07:05,159 data 1987 01:07:03,559 --> 01:07:07,119 and you calculate the full gradient of 1988 01:07:05,159 --> 01:07:09,319 the loss function, it may not even fit 1989 01:07:07,119 --> 01:07:11,319 into memory. 1990 01:07:09,320 --> 01:07:12,880 Right? It's really problematic. But with 1991 01:07:11,320 --> 01:07:14,359 SGD, it says, "I don't care whether you 1992 01:07:12,880 --> 01:07:17,400 have a billion data points or a trillion 1993 01:07:14,358 --> 01:07:19,199 data points. Just give me 32 at a time." 1994 01:07:17,400 --> 01:07:20,720 Okay? And you just keep on doing it. 1995 01:07:19,199 --> 01:07:22,639 And 1996 01:07:20,719 --> 01:07:24,719 turns out, because not all the points 1997 01:07:22,639 --> 01:07:26,679 are used in the calculation this only 1998 01:07:24,719 --> 01:07:27,919 approximates the true gradient. Right? 1999 01:07:26,679 --> 01:07:29,919 It's only an approximation. It's not the 2000 01:07:27,920 --> 01:07:32,079 real thing. It's only an approximation. 2001 01:07:29,920 --> 01:07:33,760 But it works extremely well in practice. 2002 01:07:32,079 --> 01:07:34,960 Extremely well in practice. 2003 01:07:33,760 --> 01:07:37,359 And there's a whole bunch of research 2004 01:07:34,960 --> 01:07:39,079 that goes into why is it so effective? 2005 01:07:37,358 --> 01:07:40,920 And you know, people are discovering 2006 01:07:39,079 --> 01:07:42,599 interesting things about SGD, but we 2007 01:07:40,920 --> 01:07:44,680 don't have like a definitive theory as 2008 01:07:42,599 --> 01:07:46,039 to why it's so good yet. We have some 2009 01:07:44,679 --> 01:07:47,799 interesting, you know, uh research 2010 01:07:46,039 --> 01:07:50,000 threads that have happened. 2011 01:07:47,800 --> 01:07:51,840 And very tantalizingly, very 2012 01:07:50,000 --> 01:07:53,920 tantalizingly 2013 01:07:51,840 --> 01:07:55,640 because it's only an approximation of 2014 01:07:53,920 --> 01:07:59,480 the true gradient 2015 01:07:55,639 --> 01:08:00,480 SGD can actually escape local minima. 2016 01:07:59,480 --> 01:08:02,240 So, 2017 01:08:00,480 --> 01:08:04,159 in the in the true loss function, you're 2018 01:08:02,239 --> 01:08:06,679 at a local minimum 2019 01:08:04,159 --> 01:08:08,519 but in SGD's loss function, when you're 2020 01:08:06,679 --> 01:08:11,440 doing SGD, you're reaching the the 2021 01:08:08,519 --> 01:08:13,159 minimum of the SGD loss function 2022 01:08:11,440 --> 01:08:14,920 which actually may not be the actual 2023 01:08:13,159 --> 01:08:16,798 loss function. So, as you're moving 2024 01:08:14,920 --> 01:08:18,359 around, you're actually jumping from 2025 01:08:16,798 --> 01:08:20,359 local minima to local minima of the 2026 01:08:18,359 --> 01:08:22,039 actual loss function. 2027 01:08:20,359 --> 01:08:24,039 I know that's a mouthful. I'm happy to 2028 01:08:22,039 --> 01:08:25,319 tell you more. It's just a side thing 2029 01:08:24,039 --> 01:08:26,560 that I just wanted you to be aware of. 2030 01:08:25,319 --> 01:08:27,960 Okay? 2031 01:08:26,560 --> 01:08:30,640 One of the reasons why SGD is actually 2032 01:08:27,960 --> 01:08:33,838 effective. It's almost like you work 2033 01:08:30,640 --> 01:08:33,838 less and you do better. 2034 01:08:34,000 --> 01:08:38,159 How many times does it happen in life? 2035 01:08:35,680 --> 01:08:38,159 This is one of them. 2036 01:08:39,520 --> 01:08:44,359 Okay? Now, SGD comes in many flavors. 2037 01:08:42,798 --> 01:08:45,680 Uh many siblings. It's got a lot of 2038 01:08:44,359 --> 01:08:47,520 siblings and variations. It's a big 2039 01:08:45,680 --> 01:08:49,838 family. Uh and we're going to use a 2040 01:08:47,520 --> 01:08:52,040 particular flavor called Adam 2041 01:08:49,838 --> 01:08:53,159 as our default in this course and I'll 2042 01:08:52,039 --> 01:08:56,000 get back to it when we get into the 2043 01:08:53,159 --> 01:08:57,119 co-labs and things like that. 2044 01:08:56,000 --> 01:08:58,159 All right. 2045 01:08:57,119 --> 01:09:00,039 Um 2046 01:08:58,159 --> 01:09:01,519 By the way 2047 01:09:00,039 --> 01:09:02,600 you know how you know all these pictures 2048 01:09:01,520 --> 01:09:04,600 I've been showing you a nice little 2049 01:09:02,600 --> 01:09:05,440 function like that, a little bowl and so 2050 01:09:04,600 --> 01:09:07,359 on. 2051 01:09:05,439 --> 01:09:08,960 This is a visualization 2052 01:09:07,359 --> 01:09:11,400 of an actual neural network loss 2053 01:09:08,960 --> 01:09:12,838 function. 2054 01:09:11,399 --> 01:09:14,920 You can see like the hills and valleys 2055 01:09:12,838 --> 01:09:16,798 and the cracks and so on and so forth. 2056 01:09:14,920 --> 01:09:18,600 Okay? And you can check out the paper to 2057 01:09:16,798 --> 01:09:19,359 get more insight into how they actually, 2058 01:09:18,600 --> 01:09:21,680 you know, came up with this 2059 01:09:19,359 --> 01:09:24,280 visualization. It's crazy. 2060 01:09:21,680 --> 01:09:25,520 It's complicated. 2061 01:09:24,279 --> 01:09:28,439 Yep. 2062 01:09:25,520 --> 01:09:30,920 So, for for SGD, do you perform the 2063 01:09:28,439 --> 01:09:32,599 iterations until you minimize the loss 2064 01:09:30,920 --> 01:09:34,440 function for each mini batch and then 2065 01:09:32,600 --> 01:09:36,520 move to another mini batch? Yeah, so 2066 01:09:34,439 --> 01:09:37,719 what you do is you take each mini batch 2067 01:09:36,520 --> 01:09:39,440 and then 2068 01:09:37,720 --> 01:09:41,560 you calculate the loss for the mini 2069 01:09:39,439 --> 01:09:43,679 batch, you find the gradient. 2070 01:09:41,560 --> 01:09:45,319 And use the gradient and update the W. 2071 01:09:43,680 --> 01:09:47,119 Then you pick up the next mini batch. So 2072 01:09:45,319 --> 01:09:48,920 you don't you don't pick a mini batch 2073 01:09:47,119 --> 01:09:50,920 and try to perform the iterations on 2074 01:09:48,920 --> 01:09:52,838 that mini batch until you reach the 2075 01:09:50,920 --> 01:09:54,840 You Each mini batch, one iteration. Each 2076 01:09:52,838 --> 01:09:56,359 mini batch, one iteration. Because if 2077 01:09:54,840 --> 01:09:57,600 you do a lot of iterations on one mini 2078 01:09:56,359 --> 01:09:58,759 batch, 2079 01:09:57,600 --> 01:09:59,640 first of all, you'll never be sure that 2080 01:09:58,760 --> 01:10:00,960 you're going to find any optimal 2081 01:09:59,640 --> 01:10:03,079 solution because you're not guaranteed 2082 01:10:00,960 --> 01:10:04,039 of any global minima. And secondly, it's 2083 01:10:03,079 --> 01:10:05,960 much better for you to get new 2084 01:10:04,039 --> 01:10:07,399 information constantly because what you 2085 01:10:05,960 --> 01:10:09,439 can do is you can revisit that mini 2086 01:10:07,399 --> 01:10:10,799 batch later on. 2087 01:10:09,439 --> 01:10:13,039 Right? And that gets into these things 2088 01:10:10,800 --> 01:10:14,239 called epochs and batch size and so on, 2089 01:10:13,039 --> 01:10:16,359 which we'll get into a lot of gory 2090 01:10:14,239 --> 01:10:17,880 detail when we do the collab. 2091 01:10:16,359 --> 01:10:20,359 So let's revisit that question. It's a 2092 01:10:17,880 --> 01:10:20,359 good question. 2093 01:10:20,439 --> 01:10:25,439 Yeah. 2094 01:10:22,520 --> 01:10:26,880 When you do the backprop process, Very 2095 01:10:25,439 --> 01:10:27,960 good. Backprop. Not backpropagation. 2096 01:10:26,880 --> 01:10:29,039 Nice. I made sure. 2097 01:10:27,960 --> 01:10:30,840 >> Yes. 2098 01:10:29,039 --> 01:10:32,760 Well, it's it sounded like you started 2099 01:10:30,840 --> 01:10:35,159 from the layers that were closest to the 2100 01:10:32,760 --> 01:10:36,920 output and you went backward. Okay. And 2101 01:10:35,159 --> 01:10:39,479 um my question is are you doing that 2102 01:10:36,920 --> 01:10:39,760 once or is it looping multiple times and 2103 01:10:39,479 --> 01:10:42,439 then 2104 01:10:39,760 --> 01:10:44,600 >> do it once. Just once. Yeah. So for each 2105 01:10:42,439 --> 01:10:45,960 gradient calculation, you do it once. 2106 01:10:44,600 --> 01:10:47,680 Why does it Why does it want to start 2107 01:10:45,960 --> 01:10:48,560 from the layer that's closest or why do 2108 01:10:47,680 --> 01:10:49,800 you want to start it from the layer 2109 01:10:48,560 --> 01:10:51,280 that's closest to the output? 2110 01:10:49,800 --> 01:10:53,239 >> Yeah. So basically what happens is let's 2111 01:10:51,279 --> 01:10:54,920 say that just for argument that you go 2112 01:10:53,239 --> 01:10:56,800 go in the reverse direction. 2113 01:10:54,920 --> 01:10:58,279 You will discover that a lot of paths to 2114 01:10:56,800 --> 01:10:59,960 go from the left to the right will end 2115 01:10:58,279 --> 01:11:02,439 up calculating certain intermediate 2116 01:10:59,960 --> 01:11:04,720 quantities including the very final 2117 01:11:02,439 --> 01:11:06,559 gradient sort of item 2118 01:11:04,720 --> 01:11:07,760 again and again and again. 2119 01:11:06,560 --> 01:11:09,280 Same thing is going to get calculated 2120 01:11:07,760 --> 01:11:10,520 again and again and again. So by 2121 01:11:09,279 --> 01:11:12,159 starting from the end and working 2122 01:11:10,520 --> 01:11:14,320 backwards, you just reuse stuff you've 2123 01:11:12,159 --> 01:11:15,920 already calculated. 2124 01:11:14,319 --> 01:11:17,960 So that is sort of the rough idea. But 2125 01:11:15,920 --> 01:11:19,440 if you see my PDF, I've actually worked 2126 01:11:17,960 --> 01:11:22,399 out the example and you and that will 2127 01:11:19,439 --> 01:11:22,399 demonstrate what I'm talking about. 2128 01:11:23,359 --> 01:11:28,319 By the way, this gradient the backprop 2129 01:11:25,119 --> 01:11:28,319 is just a sort of a 2130 01:11:28,600 --> 01:11:31,760 Like in calculus, we have something 2131 01:11:29,920 --> 01:11:32,600 called the chain rule. 2132 01:11:31,760 --> 01:11:34,400 To calculate the derivative of a 2133 01:11:32,600 --> 01:11:35,960 complicated function, you calculate the 2134 01:11:34,399 --> 01:11:37,479 calculate derivative of like the outer 2135 01:11:35,960 --> 01:11:39,239 function then the inner function and so 2136 01:11:37,479 --> 01:11:40,799 on and so forth. The backprop is 2137 01:11:39,239 --> 01:11:42,840 essentially a way to organize the chain 2138 01:11:40,800 --> 01:11:46,279 rule to work with the neural network 2139 01:11:42,840 --> 01:11:46,279 layer-by-layer architecture. That's all. 2140 01:11:49,520 --> 01:11:54,120 So is it Is it fair to say that once we 2141 01:11:51,960 --> 01:11:56,560 are finding like the local minimum, we 2142 01:11:54,119 --> 01:11:58,079 are not optimizing to all the GWs 2143 01:11:56,560 --> 01:11:59,400 because like this local minimum is 2144 01:11:58,079 --> 01:12:01,239 coming like from different curves, from 2145 01:11:59,399 --> 01:12:02,920 different lines. So 2146 01:12:01,239 --> 01:12:04,760 Is that fair to say? When we are using 2147 01:12:02,920 --> 01:12:06,640 stochastic gradient descent, yes. So for 2148 01:12:04,760 --> 01:12:09,360 in stochastic gradient descent, when you 2149 01:12:06,640 --> 01:12:10,880 take say 32 data points from a million 2150 01:12:09,359 --> 01:12:12,960 and you're calculating the loss for that 2151 01:12:10,880 --> 01:12:14,880 32 data points, you're basically trying 2152 01:12:12,960 --> 01:12:17,039 to do a gradient step. 2153 01:12:14,880 --> 01:12:20,000 Right? The W equals W minus alpha 2154 01:12:17,039 --> 01:12:22,680 gradient thing. You're doing it for that 2155 01:12:20,000 --> 01:12:24,720 that 32 points loss function. 2156 01:12:22,680 --> 01:12:25,840 Right? Which is not the 1 million points 2157 01:12:24,720 --> 01:12:27,680 loss function. 2158 01:12:25,840 --> 01:12:29,279 That's why it's approximate. 2159 01:12:27,680 --> 01:12:31,640 But the approximation, instead of 2160 01:12:29,279 --> 01:12:33,719 hurting you, actually helps you because 2161 01:12:31,640 --> 01:12:35,640 it helps you escape the local minima of 2162 01:12:33,720 --> 01:12:37,000 the global loss function. 2163 01:12:35,640 --> 01:12:38,640 So it's it's sort of an interesting and 2164 01:12:37,000 --> 01:12:40,159 somewhat technically subtle point, which 2165 01:12:38,640 --> 01:12:41,920 is why I'm not getting into it too much, 2166 01:12:40,159 --> 01:12:44,119 but I'm happy to give pointers if people 2167 01:12:41,920 --> 01:12:45,680 are interested. Yeah? 2168 01:12:44,119 --> 01:12:47,319 Uh when you say you initialize the 2169 01:12:45,680 --> 01:12:50,039 weights, you initialize for the whole 2170 01:12:47,319 --> 01:12:51,119 network or just the end layer and then 2171 01:12:50,039 --> 01:12:52,119 go backwards like you 2172 01:12:51,119 --> 01:12:53,880 >> No, you initialize everything in one 2173 01:12:52,119 --> 01:12:54,840 shot. 2174 01:12:53,880 --> 01:12:55,960 Because if you don't initialize 2175 01:12:54,840 --> 01:12:57,760 everything in one shot, what's going to 2176 01:12:55,960 --> 01:12:58,960 happen is that you can't do like the 2177 01:12:57,760 --> 01:13:00,560 forward computation to find the 2178 01:12:58,960 --> 01:13:02,720 prediction. 2179 01:13:00,560 --> 01:13:05,080 Uh and so they are done independently 2180 01:13:02,720 --> 01:13:07,159 and the initialization schemes will take 2181 01:13:05,079 --> 01:13:08,680 into account, okay, I'm initializing the 2182 01:13:07,159 --> 01:13:10,720 weights between a layer which has 10 2183 01:13:08,680 --> 01:13:12,280 nodes and on one side and 32 on the 2184 01:13:10,720 --> 01:13:13,240 other side and the 10 and the 32 2185 01:13:12,279 --> 01:13:15,800 actually play a role in how you 2186 01:13:13,239 --> 01:13:15,800 initialize. 2187 01:13:15,960 --> 01:13:19,960 Okay. So um so the summary of the 2188 01:13:18,279 --> 01:13:22,840 overall training flow 2189 01:13:19,960 --> 01:13:24,359 is that, you know, you have an input. 2190 01:13:22,840 --> 01:13:26,079 It goes through a bunch of layers. You 2191 01:13:24,359 --> 01:13:28,319 come up with a prediction. You compare 2192 01:13:26,079 --> 01:13:29,600 it to the true values and these two 2193 01:13:28,319 --> 01:13:31,679 things go into the loss function 2194 01:13:29,600 --> 01:13:33,600 calculation. You get a loss number. 2195 01:13:31,680 --> 01:13:35,480 Right? And you do it for say 10 points 2196 01:13:33,600 --> 01:13:38,000 or 32 points or a million points. And 2197 01:13:35,479 --> 01:13:39,959 this loss thing goes into the optimizer, 2198 01:13:38,000 --> 01:13:41,640 which calculates the gradient. And once 2199 01:13:39,960 --> 01:13:44,159 it calculates the gradient, it updates 2200 01:13:41,640 --> 01:13:45,880 the weights of every layer using the W 2201 01:13:44,159 --> 01:13:47,760 equals W minus alpha times gradient 2202 01:13:45,880 --> 01:13:48,920 formula, gradient descent formula. And 2203 01:13:47,760 --> 01:13:50,440 then you keep it doing this again and 2204 01:13:48,920 --> 01:13:53,000 again and again. 2205 01:13:50,439 --> 01:13:54,439 This is the overall flow. 2206 01:13:53,000 --> 01:13:56,359 This is how our little network is going 2207 01:13:54,439 --> 01:14:00,039 to get built for heart disease 2208 01:13:56,359 --> 01:14:00,039 prediction. This is how GPT-4 was built. 2209 01:14:00,720 --> 01:14:04,240 And this is how AlphaFold was built. 2210 01:14:02,720 --> 01:14:06,720 And AlphaGo was built. 2211 01:14:04,239 --> 01:14:06,719 You get the idea. 2212 01:14:07,359 --> 01:14:10,799 I mean, it's astonishing, frankly. 2213 01:14:09,479 --> 01:14:12,359 If you're not getting goosebumps at the 2214 01:14:10,800 --> 01:14:14,239 thought that this simple thing can do 2215 01:14:12,359 --> 01:14:17,159 all these complicated things, we really 2216 01:14:14,239 --> 01:14:20,359 need to talk offline. 2217 01:14:17,159 --> 01:14:23,119 Uh there was a hand raised here. Yeah. 2218 01:14:20,359 --> 01:14:25,759 Sorry. Just quickly, this is for each 2219 01:14:23,119 --> 01:14:27,159 mini batch, right? So 2220 01:14:25,760 --> 01:14:28,680 my question is if you came with 2221 01:14:27,159 --> 01:14:30,199 different weight for each mini batch, 2222 01:14:28,680 --> 01:14:31,520 how do you 2223 01:14:30,199 --> 01:14:33,800 add it up? 2224 01:14:31,520 --> 01:14:35,400 The like, okay, this weight has is the 2225 01:14:33,800 --> 01:14:37,880 perfect combination for this mini batch, 2226 01:14:35,399 --> 01:14:39,559 but you have weight different 2227 01:14:37,880 --> 01:14:41,560 weight for another mini batch. How do 2228 01:14:39,560 --> 01:14:43,360 you combine those two? No. 2229 01:14:41,560 --> 01:14:45,400 On each point, what you do is you you 2230 01:14:43,359 --> 01:14:46,519 find the you find you you you start with 2231 01:14:45,399 --> 01:14:48,000 a weight. 2232 01:14:46,520 --> 01:14:49,320 You run it through for a mini batch. You 2233 01:14:48,000 --> 01:14:50,680 come up with the loss function. You 2234 01:14:49,319 --> 01:14:51,880 calculate the gradient. 2235 01:14:50,680 --> 01:14:53,159 And now using the gradient, you've 2236 01:14:51,880 --> 01:14:54,159 updated the weight. Now you have a new 2237 01:14:53,159 --> 01:14:55,559 set of weights, right? Which is the 2238 01:14:54,159 --> 01:14:57,680 updated weights. Call it 2239 01:14:55,560 --> 01:14:59,480 W2 instead of W1. 2240 01:14:57,680 --> 01:15:00,680 Now W2 is is your network and when you 2241 01:14:59,479 --> 01:15:03,559 take the next mini batch, it's going to 2242 01:15:00,680 --> 01:15:05,240 use W2 to calculate the prediction. 2243 01:15:03,560 --> 01:15:08,800 And this this whole flow will become a 2244 01:15:05,239 --> 01:15:11,840 lot clearer when we do the collabs. 2245 01:15:08,800 --> 01:15:13,360 Okay. So we have 3 minutes. 2246 01:15:11,840 --> 01:15:15,720 I don't want to go into 2247 01:15:13,359 --> 01:15:19,039 regularization overfitting in 3 minutes. 2248 01:15:15,720 --> 01:15:19,039 So let's have some more questions. 2249 01:15:19,680 --> 01:15:22,600 Yeah. 2250 01:15:20,640 --> 01:15:25,200 Can you use any activation function as 2251 01:15:22,600 --> 01:15:26,760 long as it gives like positive values? 2252 01:15:25,199 --> 01:15:29,679 For like X squared or mod X or 2253 01:15:26,760 --> 01:15:31,400 something. Um you can use a variety of 2254 01:15:29,680 --> 01:15:33,320 activation functions. 2255 01:15:31,399 --> 01:15:35,519 Um 2256 01:15:33,319 --> 01:15:37,319 There is uh but yeah, there's a whole 2257 01:15:35,520 --> 01:15:38,640 literature on, you know, the pros and 2258 01:15:37,319 --> 01:15:39,840 cons of various activation functions 2259 01:15:38,640 --> 01:15:42,520 that you could use. 2260 01:15:39,840 --> 01:15:44,760 But in general, you have to make sure of 2261 01:15:42,520 --> 01:15:46,880 a couple of things. One is that when you 2262 01:15:44,760 --> 01:15:48,360 do backprop, 2263 01:15:46,880 --> 01:15:49,520 the gradient is going to flow through 2264 01:15:48,359 --> 01:15:50,639 the activation function in the reverse 2265 01:15:49,520 --> 01:15:52,200 direction. 2266 01:15:50,640 --> 01:15:53,720 And the activation function should 2267 01:15:52,199 --> 01:15:55,439 actually sort of make sure the gradient 2268 01:15:53,720 --> 01:15:56,800 doesn't get squished. 2269 01:15:55,439 --> 01:15:58,559 It shouldn't get squished. It shouldn't 2270 01:15:56,800 --> 01:16:00,199 get exploded. 2271 01:15:58,560 --> 01:16:01,280 So those are some considerations and 2272 01:16:00,199 --> 01:16:02,760 these are technical considerations, but 2273 01:16:01,279 --> 01:16:04,239 those all those considerations have to 2274 01:16:02,760 --> 01:16:07,000 be taken into account. If you can take 2275 01:16:04,239 --> 01:16:08,039 those into account, then you're okay. 2276 01:16:07,000 --> 01:16:08,960 That's sort of the key thing to keep in 2277 01:16:08,039 --> 01:16:10,479 mind. 2278 01:16:08,960 --> 01:16:11,920 And that's in fact why the ReLU is 2279 01:16:10,479 --> 01:16:13,319 actually very popular 2280 01:16:11,920 --> 01:16:15,640 because as long as the value is 2281 01:16:13,319 --> 01:16:18,000 positive, the gradient of the ReLU is 2282 01:16:15,640 --> 01:16:20,640 just one. Right? 2283 01:16:18,000 --> 01:16:20,640 Uh because 2284 01:16:22,680 --> 01:16:26,600 So if you look at something 2285 01:16:24,239 --> 01:16:26,599 Oops. 2286 01:16:28,720 --> 01:16:31,920 Was it frozen? 2287 01:16:30,359 --> 01:16:34,880 I jinxed it. 2288 01:16:31,920 --> 01:16:37,399 So sorry, livestream. 2289 01:16:34,880 --> 01:16:39,880 If you have something like this, 2290 01:16:37,399 --> 01:16:41,719 the ReLU is like that, right? 2291 01:16:39,880 --> 01:16:43,480 So the gradient here 2292 01:16:41,720 --> 01:16:44,560 is always going to be one. 2293 01:16:43,479 --> 01:16:46,279 Which means that as long as the value is 2294 01:16:44,560 --> 01:16:47,960 positive, whatever gradient comes in 2295 01:16:46,279 --> 01:16:49,000 like this, it just like gets multiplied 2296 01:16:47,960 --> 01:16:50,960 by one and gets pushed out the other 2297 01:16:49,000 --> 01:16:52,840 side. So it doesn't get it doesn't get 2298 01:16:50,960 --> 01:16:55,399 harmed or squished or anything like 2299 01:16:52,840 --> 01:16:57,119 that. Um so that's one reason why the 2300 01:16:55,399 --> 01:16:59,239 ReLU is very popular because it 2301 01:16:57,119 --> 01:17:00,640 preserves the gradient while injecting 2302 01:16:59,239 --> 01:17:04,519 almost like the minimum amount of 2303 01:17:00,640 --> 01:17:04,520 non-linearity to do interesting things. 2304 01:17:04,760 --> 01:17:10,280 Um yeah. 2305 01:17:07,520 --> 01:17:13,080 If you have a high number of dimensions, 2306 01:17:10,279 --> 01:17:14,920 can you do mini batching on like 2307 01:17:13,079 --> 01:17:17,119 features dimensions instead of just 2308 01:17:14,920 --> 01:17:19,840 observations and keep the same number of 2309 01:17:17,119 --> 01:17:21,760 observations, but just take a small 2310 01:17:19,840 --> 01:17:24,000 sample of the number of features that 2311 01:17:21,760 --> 01:17:25,760 you're actually using? Oh, I see. I see. 2312 01:17:24,000 --> 01:17:27,039 So you're saying let's say you have 10 2313 01:17:25,760 --> 01:17:28,720 features. 2314 01:17:27,039 --> 01:17:31,000 Um instead of taking all data points of 2315 01:17:28,720 --> 01:17:33,640 10 features, what if you have choose 2316 01:17:31,000 --> 01:17:34,920 five features and just use them and do 2317 01:17:33,640 --> 01:17:36,760 the thing 2318 01:17:34,920 --> 01:17:38,520 as long as you can actually compute the 2319 01:17:36,760 --> 01:17:39,840 prediction. 2320 01:17:38,520 --> 01:17:41,600 To compute the prediction, you may need 2321 01:17:39,840 --> 01:17:43,239 all 10 features. 2322 01:17:41,600 --> 01:17:44,720 Right? Or you need to have some defaults 2323 01:17:43,239 --> 01:17:46,800 for those features. 2324 01:17:44,720 --> 01:17:48,560 And by if you define defaults for those 2325 01:17:46,800 --> 01:17:50,520 other five features, you're basically 2326 01:17:48,560 --> 01:17:51,400 using all all features. 2327 01:17:50,520 --> 01:17:53,400 So that's the key thing. Can you 2328 01:17:51,399 --> 01:17:55,079 actually calculate the prediction 2329 01:17:53,399 --> 01:17:57,399 by manipulating? And typically, you 2330 01:17:55,079 --> 01:17:57,399 can't. 2331 01:17:57,840 --> 01:18:00,960 All right? 2332 01:17:58,960 --> 01:18:02,439 Okay, folks. 9:55. I'm done. Have a 2333 01:18:00,960 --> 01:18:04,800 great rest of your week. I'll see you on 2334 01:18:02,439 --> 01:18:04,799 Monday.