1 00:00:16,320 --> 00:00:21,519 Okay. All right. Let's get going. Uh 2 00:00:20,304 --> 00:00:23,358 [clears throat] today is going to be 3 00:00:21,519 --> 00:00:25,920 packed. uh I'm going to spend the first 4 00:00:23,359 --> 00:00:28,960 roughly half of the lecture on uh 5 00:00:25,920 --> 00:00:30,720 actually building a model a car corass 6 00:00:28,960 --> 00:00:32,960 model in collab to solve the heart 7 00:00:30,719 --> 00:00:35,759 disease problem we saw earlier and then 8 00:00:32,960 --> 00:00:37,679 switch gears halfway and then talk about 9 00:00:35,759 --> 00:00:39,839 uh how to solve image classification 10 00:00:37,679 --> 00:00:42,159 okay so we're going to do two collabs 11 00:00:39,840 --> 00:00:44,079 today uh I've been talking about collab 12 00:00:42,159 --> 00:00:46,718 collab right I've been teasing you we'll 13 00:00:44,079 --> 00:00:48,960 actually do collabs today all right so 14 00:00:46,719 --> 00:00:50,320 summary of baby by the way I've shut off 15 00:00:48,960 --> 00:00:52,160 the lights in the top because when I 16 00:00:50,320 --> 00:00:53,280 switch to collab it's going be much 17 00:00:52,159 --> 00:00:54,639 better for you folks particularly the 18 00:00:53,280 --> 00:00:57,039 folks in the back to be able to see it. 19 00:00:54,640 --> 00:01:00,000 Okay, but I hope you can see the slide 20 00:00:57,039 --> 00:01:02,320 right now. Yes. 21 00:01:00,000 --> 00:01:04,799 Okay, great. So this is just a quick 22 00:01:02,320 --> 00:01:07,040 recap of what we did last class. U you 23 00:01:04,799 --> 00:01:08,479 know broadly speaking training a neural 24 00:01:07,040 --> 00:01:10,240 network essentially is no different than 25 00:01:08,478 --> 00:01:12,079 training other kinds of models. We have 26 00:01:10,239 --> 00:01:14,959 a bunch of parameters i.e weights and 27 00:01:12,079 --> 00:01:17,200 biases and we need to use the data to 28 00:01:14,959 --> 00:01:19,199 find good values of those weights. And 29 00:01:17,200 --> 00:01:21,040 what does good mean? Typically it means 30 00:01:19,200 --> 00:01:23,040 that we define some measure of 31 00:01:21,040 --> 00:01:24,960 discrepancy between what the model 32 00:01:23,040 --> 00:01:26,880 predicts for a given set of weights and 33 00:01:24,959 --> 00:01:29,199 what the right answer is what the ground 34 00:01:26,879 --> 00:01:30,879 truth answer is and then we try to find 35 00:01:29,200 --> 00:01:32,240 weights that minimize this discrepancy 36 00:01:30,879 --> 00:01:34,078 that's it and this notion of a 37 00:01:32,239 --> 00:01:36,239 discrepancy is called a loss function 38 00:01:34,078 --> 00:01:38,239 right so the broadly speaking the 39 00:01:36,239 --> 00:01:40,078 overall training flow is that you define 40 00:01:38,239 --> 00:01:41,280 some network it has an input it goes 41 00:01:40,078 --> 00:01:42,719 through a bunch of layers you come up 42 00:01:41,280 --> 00:01:44,799 with some predictions you take the 43 00:01:42,719 --> 00:01:46,798 predictions you take the true values and 44 00:01:44,799 --> 00:01:48,560 then those two go into the loss function 45 00:01:46,799 --> 00:01:50,320 i.e i.e. the discrepancy function and 46 00:01:48,560 --> 00:01:52,399 then you come up with the loss score and 47 00:01:50,319 --> 00:01:54,639 then you send it to the optimizer which 48 00:01:52,399 --> 00:01:56,239 then proceeds to calculate the gradient 49 00:01:54,640 --> 00:01:58,239 of this loss function with respect to 50 00:01:56,239 --> 00:02:00,078 all the parameters and then it updates 51 00:01:58,239 --> 00:02:02,718 all the weights using that gradient and 52 00:02:00,078 --> 00:02:04,718 then this process repeats. That's it. So 53 00:02:02,718 --> 00:02:08,000 that is the training flow. Okay, quick 54 00:02:04,718 --> 00:02:09,359 recap. Now we also talked about the 55 00:02:08,000 --> 00:02:12,719 optimization algorithm we're going to 56 00:02:09,360 --> 00:02:15,280 use which is called gradient descent. 57 00:02:12,719 --> 00:02:17,759 and gradient descent. As you noticed in 58 00:02:15,280 --> 00:02:20,400 each iteration, every data point is 59 00:02:17,759 --> 00:02:22,399 being used to make predictions and 60 00:02:20,400 --> 00:02:24,480 therefore to calculate the loss and then 61 00:02:22,400 --> 00:02:26,719 to calculate the gradient. And then we 62 00:02:24,479 --> 00:02:28,399 pointed out that gradient descent is 63 00:02:26,719 --> 00:02:31,120 actually not as good as something called 64 00:02:28,400 --> 00:02:33,200 stochastic gradient descent. Stoastic 65 00:02:31,120 --> 00:02:35,759 gradient descent where we instead of 66 00:02:33,199 --> 00:02:37,439 choosing taking all the points, we just 67 00:02:35,759 --> 00:02:40,000 randomly choose a small number of 68 00:02:37,439 --> 00:02:42,239 points. Pretend for a moment as if those 69 00:02:40,000 --> 00:02:44,479 are the only points we have. make 70 00:02:42,239 --> 00:02:47,360 predictions, calculate loss, calculate 71 00:02:44,479 --> 00:02:49,439 gradient and go on. So that was the 72 00:02:47,360 --> 00:02:51,840 basic idea behind stochastic gradient 73 00:02:49,439 --> 00:02:54,479 descent, right? Two different kinds of 74 00:02:51,840 --> 00:02:56,000 things. Now what it means is that when 75 00:02:54,479 --> 00:02:58,238 we actually start training the model, as 76 00:02:56,000 --> 00:03:00,318 we will in a few minutes, the way 77 00:02:58,239 --> 00:03:02,640 because we only take a few points at a 78 00:03:00,318 --> 00:03:04,399 time, we have to be a bit careful in 79 00:03:02,639 --> 00:03:06,079 what's going on. And I want to make sure 80 00:03:04,400 --> 00:03:07,439 you clearly understand what the 81 00:03:06,080 --> 00:03:10,640 differences are before we actually get 82 00:03:07,439 --> 00:03:13,280 to the collab. Okay. And 83 00:03:10,639 --> 00:03:14,878 all right. So there is the notion of an 84 00:03:13,280 --> 00:03:17,120 epoch. 85 00:03:14,878 --> 00:03:20,000 An epoch essentially just means that we 86 00:03:17,120 --> 00:03:22,080 make one pass through the training data. 87 00:03:20,000 --> 00:03:25,039 All the training data we make one pass 88 00:03:22,080 --> 00:03:27,519 through it. Okay. And so what is one 89 00:03:25,039 --> 00:03:30,560 pass is that if you have something like 90 00:03:27,519 --> 00:03:32,400 gradient descent, one pass means every 91 00:03:30,560 --> 00:03:34,318 data point is sent through the network. 92 00:03:32,400 --> 00:03:37,039 We calculate its predictions, calculate 93 00:03:34,318 --> 00:03:38,958 the loss, calculate the gradient, right? 94 00:03:37,039 --> 00:03:40,560 We run every training sample through it. 95 00:03:38,959 --> 00:03:42,959 we calculate the gradient which is just 96 00:03:40,560 --> 00:03:46,878 this thing here right I mean I will 97 00:03:42,959 --> 00:03:48,799 sometimes say d of loss time dwerative 98 00:03:46,878 --> 00:03:51,199 of loss with respect to w sometimes I 99 00:03:48,799 --> 00:03:54,000 might use this naba symbol these are all 100 00:03:51,199 --> 00:03:55,598 interchangeable okay so we'll calculate 101 00:03:54,000 --> 00:03:58,080 the gradient and then we update using 102 00:03:55,598 --> 00:04:01,438 some version of this okay but we just do 103 00:03:58,080 --> 00:04:03,680 it once at the end of the epoch because 104 00:04:01,438 --> 00:04:05,280 if you have 10 billion data points every 105 00:04:03,680 --> 00:04:07,200 one of them flows through you get 10 106 00:04:05,280 --> 00:04:08,959 billion outputs and then we calculate 107 00:04:07,199 --> 00:04:10,158 the epoch just one at the end of this 108 00:04:08,959 --> 00:04:15,039 thing we calculate the gradient and 109 00:04:10,158 --> 00:04:18,399 update once one update per epoch. Yes. 110 00:04:15,039 --> 00:04:20,319 Now in stoastic gradient descent what we 111 00:04:18,399 --> 00:04:22,078 do is that we process the data in 112 00:04:20,319 --> 00:04:25,360 batches 113 00:04:22,079 --> 00:04:26,800 small numbers of points at a time right 114 00:04:25,360 --> 00:04:29,280 and these are called technically 115 00:04:26,800 --> 00:04:30,560 speaking they're called mini batches I 116 00:04:29,279 --> 00:04:31,758 don't know about you I just get tired of 117 00:04:30,560 --> 00:04:34,879 saying mini batches I'm just going to 118 00:04:31,759 --> 00:04:36,319 say batches from this point on okay and 119 00:04:34,879 --> 00:04:39,360 in fact that is widely done in the 120 00:04:36,319 --> 00:04:41,360 literature so we'll so we'll have to 121 00:04:39,360 --> 00:04:43,199 process it in batches so we take the 122 00:04:41,360 --> 00:04:44,720 training data and then we divide it up 123 00:04:43,199 --> 00:04:46,400 into batches 124 00:04:44,720 --> 00:04:49,840 batch one, batch two all the way till 125 00:04:46,399 --> 00:04:53,120 the final batch. And so what we do is we 126 00:04:49,839 --> 00:04:56,399 for each batch we basically do gradient 127 00:04:53,120 --> 00:04:57,918 descent for each batch we take batch one 128 00:04:56,399 --> 00:05:00,399 and then we run just the training 129 00:04:57,918 --> 00:05:01,918 samples in that batch through the 130 00:05:00,399 --> 00:05:03,758 network to get predictions. We calculate 131 00:05:01,918 --> 00:05:05,519 the gradient we update the parameters 132 00:05:03,759 --> 00:05:07,360 and then we go to batch two then we go 133 00:05:05,519 --> 00:05:09,120 to batch three and so on and so forth. 134 00:05:07,360 --> 00:05:11,038 So pictorially this is how it's going to 135 00:05:09,120 --> 00:05:12,800 look like 136 00:05:11,038 --> 00:05:16,000 right let's say the first batch is say 137 00:05:12,800 --> 00:05:17,439 32 points we take those 32 points we run 138 00:05:16,000 --> 00:05:19,519 it through the network get all the stuff 139 00:05:17,439 --> 00:05:22,719 out we calculate the gradient update the 140 00:05:19,519 --> 00:05:25,279 weights so when we now get to batch two 141 00:05:22,720 --> 00:05:27,600 the weights have changed 142 00:05:25,279 --> 00:05:29,038 they have been updated and then we do 143 00:05:27,600 --> 00:05:30,720 the same thing for batch two batch three 144 00:05:29,038 --> 00:05:32,399 and all the way till we get to the end 145 00:05:30,720 --> 00:05:34,240 of the thing and when we are done with 146 00:05:32,399 --> 00:05:36,239 this thing this whole thing is called a 147 00:05:34,240 --> 00:05:38,160 what 148 00:05:36,240 --> 00:05:42,879 an epoch [clears throat] 149 00:05:38,160 --> 00:05:44,880 This whole thing is an epoch. Okay. 150 00:05:42,879 --> 00:05:46,319 All right. Now, so the question of 151 00:05:44,879 --> 00:05:47,519 course is that if you have a bunch of 152 00:05:46,319 --> 00:05:50,000 data points and you're going to run 153 00:05:47,519 --> 00:05:52,319 stoastic gradient descent on it in in a 154 00:05:50,000 --> 00:05:54,959 in a particular epoch, how many batches 155 00:05:52,319 --> 00:05:56,879 are going to be there? Okay, how many 156 00:05:54,959 --> 00:05:58,159 batches are going to be there? Now, 157 00:05:56,879 --> 00:05:59,199 Keras is going to calculate all this 158 00:05:58,160 --> 00:06:00,560 stuff. You don't have to worry about it, 159 00:05:59,199 --> 00:06:02,960 but you just need to understand exactly 160 00:06:00,560 --> 00:06:04,879 what happens. Okay, so my philosophy, by 161 00:06:02,959 --> 00:06:06,799 the way, is that you have to know the 162 00:06:04,879 --> 00:06:08,560 details of what's going on. If you don't 163 00:06:06,800 --> 00:06:11,038 know the details, if you haven't figured 164 00:06:08,560 --> 00:06:12,399 out at least once, you will not actually 165 00:06:11,038 --> 00:06:15,519 be able to think new and creative 166 00:06:12,399 --> 00:06:17,198 thoughts for a new problem. Okay, it's 167 00:06:15,519 --> 00:06:21,719 because the concepts are not manipulable 168 00:06:17,199 --> 00:06:21,720 in your head yet. Okay, 169 00:06:23,839 --> 00:06:30,159 please use the microphone. 170 00:06:27,279 --> 00:06:32,399 So when we talk about SG, so and we 171 00:06:30,160 --> 00:06:34,080 talking about uh we are only taking some 172 00:06:32,399 --> 00:06:36,239 part of it. Is it what we are saying is 173 00:06:34,079 --> 00:06:37,918 that we only take some variables or we 174 00:06:36,240 --> 00:06:40,319 only taking some part of the data. 175 00:06:37,918 --> 00:06:42,639 >> We are taking some rows. 176 00:06:40,319 --> 00:06:44,639 Okay. We taking only right. So that data 177 00:06:42,639 --> 00:06:46,319 points that means a batch. 178 00:06:44,639 --> 00:06:48,160 >> Exactly. So for example, let's say you 179 00:06:46,319 --> 00:06:50,479 have a thousand data points, right? 180 00:06:48,160 --> 00:06:52,400 Thousand rows of observations, thousand 181 00:06:50,478 --> 00:06:53,839 patients in the heart disease example or 182 00:06:52,399 --> 00:06:56,478 a thousand images that you're trying to 183 00:06:53,839 --> 00:06:58,799 classify. You take let's say 32 of those 184 00:06:56,478 --> 00:07:00,879 images, 32 of those patients and that's 185 00:06:58,800 --> 00:07:02,800 a batch. Then you go to the next 32. 186 00:07:00,879 --> 00:07:04,240 Then the next 32 and so on and so forth 187 00:07:02,800 --> 00:07:05,199 till you run out of patients or run out 188 00:07:04,240 --> 00:07:07,759 of images. 189 00:07:05,199 --> 00:07:09,199 >> And each iterative time you are updating 190 00:07:07,759 --> 00:07:09,759 with the weights new weights that you've 191 00:07:09,199 --> 00:07:12,478 got. 192 00:07:09,759 --> 00:07:13,520 >> And it means you keep connecting it or 193 00:07:12,478 --> 00:07:14,560 keep moving towards 194 00:07:13,519 --> 00:07:14,799 >> you're basically updating the weights as 195 00:07:14,560 --> 00:07:17,519 you 196 00:07:14,800 --> 00:07:19,199 >> updating the weights 197 00:07:17,519 --> 00:07:20,719 >> and what we calling the epoch is 198 00:07:19,199 --> 00:07:21,919 ultimately the equation of loss function 199 00:07:20,720 --> 00:07:24,400 that we are trying to do. 200 00:07:21,918 --> 00:07:27,680 >> No an epoch. See the the thing to 201 00:07:24,399 --> 00:07:30,079 remember is that here this whole thing 202 00:07:27,680 --> 00:07:32,319 is called an epoch because we have to do 203 00:07:30,079 --> 00:07:35,439 one full pass through the training data. 204 00:07:32,319 --> 00:07:37,919 Okay. But within that epoch we update 205 00:07:35,439 --> 00:07:40,160 the weights many times. Basically we 206 00:07:37,918 --> 00:07:43,318 update the weights as many times as we 207 00:07:40,160 --> 00:07:43,319 have batches. 208 00:07:44,079 --> 00:07:49,038 All right. Um 209 00:07:46,478 --> 00:07:50,240 so to go here let's say for example 210 00:07:49,038 --> 00:07:52,000 basically the idea is that you take the 211 00:07:50,240 --> 00:07:54,960 training tech you divide it by the batch 212 00:07:52,000 --> 00:07:56,478 size and you choose the batch size okay 213 00:07:54,959 --> 00:07:57,918 you choose the bat size and we'll talk 214 00:07:56,478 --> 00:07:59,839 about well how do you choose that later 215 00:07:57,918 --> 00:08:01,598 on you choose the batch size and once 216 00:07:59,839 --> 00:08:04,239 you choose size just divide it and round 217 00:08:01,598 --> 00:08:06,878 it up so for example as you will see in 218 00:08:04,240 --> 00:08:09,199 the collabing set is going to be 194 219 00:08:06,879 --> 00:08:12,080 patients and then we're going to choose 220 00:08:09,199 --> 00:08:14,960 a batch size of 32 and we typically tend 221 00:08:12,079 --> 00:08:16,478 to choose batch sizes of 32 64 and 222 00:08:14,959 --> 00:08:18,318 things like that because it actually 223 00:08:16,478 --> 00:08:20,800 aligns very well with the nature of the 224 00:08:18,319 --> 00:08:24,479 parallel hardware we're going to use. 225 00:08:20,800 --> 00:08:27,038 Okay. And so here 32 and so on. So 226 00:08:24,478 --> 00:08:29,680 divide 194 by 32 you get 6 point 227 00:08:27,038 --> 00:08:31,439 something. You round it up to seven. 228 00:08:29,680 --> 00:08:33,839 Okay. And so what that means is that the 229 00:08:31,439 --> 00:08:36,719 first six batches will have 32 samples 230 00:08:33,839 --> 00:08:38,800 each. And then the final batch has only 231 00:08:36,719 --> 00:08:40,560 two samples left. And that's okay. It 232 00:08:38,799 --> 00:08:42,079 can be a nice little small batch at the 233 00:08:40,559 --> 00:08:43,278 end. 234 00:08:42,080 --> 00:08:46,959 There's nothing that says that every 235 00:08:43,278 --> 00:08:51,320 batch has to be the same size. 236 00:08:46,958 --> 00:08:51,319 >> That's it. Epoch batches. 237 00:08:53,039 --> 00:08:58,399 >> And are you like for each batch you run 238 00:08:56,799 --> 00:09:00,799 through the whole network like all the 239 00:08:58,399 --> 00:09:03,039 layers or like each layer is one batch? 240 00:09:00,799 --> 00:09:04,719 >> No, for a batch you run it through the 241 00:09:03,039 --> 00:09:06,958 entire network. So the way I think about 242 00:09:04,720 --> 00:09:08,560 it is that you take a batch right just 243 00:09:06,958 --> 00:09:10,879 momentarily you assume that's all the 244 00:09:08,559 --> 00:09:12,879 data you have 245 00:09:10,879 --> 00:09:14,399 just run it through the network because 246 00:09:12,879 --> 00:09:15,838 unless you run it through the every 247 00:09:14,399 --> 00:09:18,080 layer of the network you can't get a 248 00:09:15,839 --> 00:09:19,600 prediction and unless you get a 249 00:09:18,080 --> 00:09:20,879 prediction you can't calculate the loss 250 00:09:19,600 --> 00:09:22,159 and unless you calculate the loss you 251 00:09:20,879 --> 00:09:23,200 can't calculate the gradient unless you 252 00:09:22,159 --> 00:09:25,199 calculate the gradient you can't update 253 00:09:23,200 --> 00:09:27,120 the weights 254 00:09:25,200 --> 00:09:29,440 >> last thing but if you're using like all 255 00:09:27,120 --> 00:09:31,120 the data just doing the gradient descent 256 00:09:29,440 --> 00:09:32,160 then you just go through the network 257 00:09:31,120 --> 00:09:34,320 once right 258 00:09:32,159 --> 00:09:37,679 >> okay exactly so in Gradient descent one 259 00:09:34,320 --> 00:09:40,399 epoch is one pass and one weight update. 260 00:09:37,679 --> 00:09:41,838 In many in stoastic gradient descent the 261 00:09:40,399 --> 00:09:43,440 number of updates you make is equal to 262 00:09:41,839 --> 00:09:46,080 the number of batches you have which 263 00:09:43,440 --> 00:09:47,920 ends up being you know some the training 264 00:09:46,080 --> 00:09:50,399 set divided by the batch size rounded 265 00:09:47,919 --> 00:09:52,799 up. 266 00:09:50,399 --> 00:09:54,639 >> So just to confirm so initially when we 267 00:09:52,799 --> 00:09:56,559 introduced like the concept of batches 268 00:09:54,639 --> 00:09:58,559 the whole purpose was not to run through 269 00:09:56,559 --> 00:10:00,639 all the data and be able to do some 270 00:09:58,559 --> 00:10:02,319 prediction from a subset. So now like 271 00:10:00,639 --> 00:10:04,639 the advantage is that like after batch 272 00:10:02,320 --> 00:10:06,560 one we are using more accurate 273 00:10:04,639 --> 00:10:08,399 coefficient to run through batch two and 274 00:10:06,559 --> 00:10:10,319 so on. That's really the advantage of it 275 00:10:08,399 --> 00:10:11,919 or there's something else to it. 276 00:10:10,320 --> 00:10:13,920 >> Perfectly set. That's exactly the 277 00:10:11,919 --> 00:10:16,319 advantage. So we take a small amount of 278 00:10:13,919 --> 00:10:18,240 data and we say hey we know this is not 279 00:10:16,320 --> 00:10:19,839 all the data. It's just a small subset 280 00:10:18,240 --> 00:10:21,120 of the data. So therefore it's not going 281 00:10:19,839 --> 00:10:23,920 to be super accurate. It's going to be 282 00:10:21,120 --> 00:10:25,919 approximate but it's okay. So we'll 283 00:10:23,919 --> 00:10:28,078 still tend to move in the in the right 284 00:10:25,919 --> 00:10:29,679 direction. So instead of waiting for the 285 00:10:28,078 --> 00:10:30,958 whole thing to get done and then 286 00:10:29,679 --> 00:10:33,679 updating it, we're just going to update 287 00:10:30,958 --> 00:10:35,919 it as we go along. 288 00:10:33,679 --> 00:10:37,759 All right. Uh yes, 289 00:10:35,919 --> 00:10:40,799 >> building on to her question, is it that 290 00:10:37,759 --> 00:10:43,600 uh doing this process for SG will uh 291 00:10:40,799 --> 00:10:45,519 render us a more better solution or 292 00:10:43,600 --> 00:10:46,399 requires less compute power? 293 00:10:45,519 --> 00:10:48,240 >> Both 294 00:10:46,399 --> 00:10:51,278 >> both and the reasons for both are in the 295 00:10:48,240 --> 00:10:52,480 previous lecture. Yeah. And I'm saying 296 00:10:51,278 --> 00:10:54,000 that instead of repeating it just 297 00:10:52,480 --> 00:10:57,120 because I'm like very pressed for time 298 00:10:54,000 --> 00:11:01,278 today. That's why uh all right cool so 299 00:10:57,120 --> 00:11:04,000 that's what we have uh are we good 300 00:11:01,278 --> 00:11:05,519 okay so now we come to the last step 301 00:11:04,000 --> 00:11:07,600 before we actually fire up the collab 302 00:11:05,519 --> 00:11:09,600 which is overfitting and regularization 303 00:11:07,600 --> 00:11:12,159 um so if you remember from your machine 304 00:11:09,600 --> 00:11:14,720 learning background um when your model 305 00:11:12,159 --> 00:11:18,319 gets more and more complex 306 00:11:14,720 --> 00:11:19,759 right if you you know using 307 00:11:18,320 --> 00:11:21,680 use a simple model then you use a more 308 00:11:19,759 --> 00:11:23,278 complex model and so on and so forth 309 00:11:21,679 --> 00:11:26,159 what happens to the error on the 310 00:11:23,278 --> 00:11:27,200 training data Typically what happens to 311 00:11:26,159 --> 00:11:28,480 the error on the training data? So let's 312 00:11:27,200 --> 00:11:30,079 say you have a simple regression model, 313 00:11:28,480 --> 00:11:31,440 you get some error and then you have a 314 00:11:30,078 --> 00:11:32,879 regression model in which you use all 315 00:11:31,440 --> 00:11:34,480 kinds of interaction terms. You use 316 00:11:32,879 --> 00:11:35,759 logarithms and this and that and make it 317 00:11:34,480 --> 00:11:36,879 super complicated. What do you think is 318 00:11:35,759 --> 00:11:39,519 going to happen to the error on the 319 00:11:36,879 --> 00:11:41,278 training data? 320 00:11:39,519 --> 00:11:43,200 >> Right? Basically it's going to go down 321 00:11:41,278 --> 00:11:45,360 as the model get more gets more complex. 322 00:11:43,200 --> 00:11:46,959 Correct. Now of course comes the punch 323 00:11:45,360 --> 00:11:49,440 line which is what what do you think is 324 00:11:46,958 --> 00:11:53,000 going to happen to the training data? I 325 00:11:49,440 --> 00:11:53,000 showed you the answer. 326 00:11:53,039 --> 00:11:56,078 Right? Basically, what's going to happen 327 00:11:54,399 --> 00:11:57,360 typically, at least conceptually, is 328 00:11:56,078 --> 00:11:59,120 that it's going to get better and better 329 00:11:57,360 --> 00:12:00,879 at some point. It's going to bottom out 330 00:11:59,120 --> 00:12:03,360 and it's going to start climbing again. 331 00:12:00,879 --> 00:12:05,039 And so, we typically refer to this 332 00:12:03,360 --> 00:12:07,440 phenomenon here when it starts to climb 333 00:12:05,039 --> 00:12:09,120 again as overfitting because the model 334 00:12:07,440 --> 00:12:11,440 is essentially fitting to the 335 00:12:09,120 --> 00:12:14,159 idiosyncrasies of the training data as 336 00:12:11,440 --> 00:12:15,760 opposed to generalizing patterns. And 337 00:12:14,159 --> 00:12:17,278 then in this thing we call it 338 00:12:15,759 --> 00:12:18,480 underfitting because it can still 339 00:12:17,278 --> 00:12:20,399 there's a lot of potential to improve 340 00:12:18,480 --> 00:12:23,360 and we really are hoping to find the 341 00:12:20,399 --> 00:12:24,559 sweet spot in the middle right that's 342 00:12:23,360 --> 00:12:27,360 the basic idea of overfitting 343 00:12:24,559 --> 00:12:29,359 underfitting and the way we and to to 344 00:12:27,360 --> 00:12:31,839 relate this to neural networks as you 345 00:12:29,360 --> 00:12:33,680 see as you as you've learned so far you 346 00:12:31,839 --> 00:12:36,320 have to learn smart representations of 347 00:12:33,679 --> 00:12:38,078 the input data and to do that we I have 348 00:12:36,320 --> 00:12:39,760 argued that you need to have lots of 349 00:12:38,078 --> 00:12:42,719 layers in your network the more layers 350 00:12:39,759 --> 00:12:45,439 you have the better things get. GPT3 for 351 00:12:42,720 --> 00:12:47,680 example has 96 layers if I recall right 352 00:12:45,440 --> 00:12:50,079 more layers the better but more layers 353 00:12:47,679 --> 00:12:52,000 means more parameters more parameters 354 00:12:50,078 --> 00:12:54,719 means more complexity to the model and 355 00:12:52,000 --> 00:12:57,919 therefore more chance of overfitting 356 00:12:54,720 --> 00:12:59,200 okay so it's really important in neural 357 00:12:57,919 --> 00:13:01,759 networks that we think about 358 00:12:59,200 --> 00:13:03,278 regularization and regularization you 359 00:13:01,759 --> 00:13:05,600 will recall from your machine learning 360 00:13:03,278 --> 00:13:07,278 background is the way we handle the risk 361 00:13:05,600 --> 00:13:11,278 of overfitting and try to find models 362 00:13:07,278 --> 00:13:12,639 that fit just right okay and so several 363 00:13:11,278 --> 00:13:14,480 regularization methods have been 364 00:13:12,639 --> 00:13:16,560 developed over the years and we are 365 00:13:14,480 --> 00:13:19,039 going to use only two of them. The first 366 00:13:16,559 --> 00:13:20,799 one is called early stopping. uh and 367 00:13:19,039 --> 00:13:23,120 this is this has been famously referred 368 00:13:20,799 --> 00:13:25,199 to uh by Jeff Hinton who's one of the 369 00:13:23,120 --> 00:13:27,039 pioneers or as he's more colorfully 370 00:13:25,200 --> 00:13:29,040 known one of the godfathers of deep 371 00:13:27,039 --> 00:13:31,199 learning um who won he also won the 372 00:13:29,039 --> 00:13:33,360 touring a few years ago as the own sort 373 00:13:31,200 --> 00:13:35,120 of a beautiful free lunch right that's 374 00:13:33,360 --> 00:13:37,839 what he calls it so the idea is very 375 00:13:35,120 --> 00:13:39,278 simple we take a validation set we take 376 00:13:37,839 --> 00:13:41,120 the training data we split into a 377 00:13:39,278 --> 00:13:42,879 training and a validation set and then 378 00:13:41,120 --> 00:13:45,519 we just keep you know doing gradient 379 00:13:42,879 --> 00:13:46,720 descent boop b the training will 380 00:13:45,519 --> 00:13:49,200 hopefully keep on getting better and 381 00:13:46,720 --> 00:13:50,800 better lower and lower error 382 00:13:49,200 --> 00:13:52,959 And then we just keep track of what's 383 00:13:50,799 --> 00:13:54,559 going on in the validation set. And then 384 00:13:52,958 --> 00:13:56,958 at some point if it starts to flatten 385 00:13:54,559 --> 00:13:59,919 out and start to climb, we just say, 386 00:13:56,958 --> 00:14:01,359 "Okay, that's when we stop training." 387 00:13:59,919 --> 00:14:02,639 Right? And what we're going to do in the 388 00:14:01,360 --> 00:14:03,919 collab is actually run it through the 389 00:14:02,639 --> 00:14:04,959 whole thing, see where it flattens out, 390 00:14:03,919 --> 00:14:06,479 and then we say, "Okay, that's why we 391 00:14:04,958 --> 00:14:07,759 should stop." But of course, you don't 392 00:14:06,480 --> 00:14:09,120 want to go all the way to the end and 393 00:14:07,759 --> 00:14:12,000 then go back and say, "Well, I want to 394 00:14:09,120 --> 00:14:13,600 stop at the 10th epoch." And there are 395 00:14:12,000 --> 00:14:15,039 ways you can use Keras to be very 396 00:14:13,600 --> 00:14:16,320 efficient about this. But the 397 00:14:15,039 --> 00:14:18,319 fundamental idea is you take the 398 00:14:16,320 --> 00:14:20,079 training data, split it into training 399 00:14:18,320 --> 00:14:21,920 and validation and just track what's 400 00:14:20,078 --> 00:14:23,278 going on in the validation set to see 401 00:14:21,919 --> 00:14:25,679 whether this kind of bottoming out 402 00:14:23,278 --> 00:14:28,240 happens. Okay. So this is called early 403 00:14:25,679 --> 00:14:30,879 stopping. And the other way we're going 404 00:14:28,240 --> 00:14:32,959 to do right this called early stopping. 405 00:14:30,879 --> 00:14:35,838 We're looking for this part. The other 406 00:14:32,958 --> 00:14:39,039 thing is called dropout. And I'm going 407 00:14:35,839 --> 00:14:40,560 to come back to dropout when we do when 408 00:14:39,039 --> 00:14:42,000 on Wednesday's lecture because that's 409 00:14:40,559 --> 00:14:43,518 the first time we're going to use it. 410 00:14:42,000 --> 00:14:44,958 And so I'll come back to draw port and 411 00:14:43,519 --> 00:14:46,879 tell you exactly how it works. It's a 412 00:14:44,958 --> 00:14:48,399 very very clever strategy. But we will 413 00:14:46,879 --> 00:14:51,519 not use it today. We'll use it on 414 00:14:48,399 --> 00:14:53,679 Wednesday. Okay. So in summary, uh what 415 00:14:51,519 --> 00:14:55,679 do we do? We get the data ready. We 416 00:14:53,679 --> 00:14:57,198 design the network, number of hidden 417 00:14:55,679 --> 00:14:58,958 layers, number of neurons and so on and 418 00:14:57,198 --> 00:15:01,359 so forth. We pick the right output 419 00:14:58,958 --> 00:15:04,078 layer. We pick the right loss function. 420 00:15:01,360 --> 00:15:06,000 Uh we choose an optimizer. As I 421 00:15:04,078 --> 00:15:07,919 mentioned earlier, SGD comes in lots of 422 00:15:06,000 --> 00:15:11,519 flavors, lots of variations on the 423 00:15:07,919 --> 00:15:13,439 theme. And empirically much like for 424 00:15:11,519 --> 00:15:16,159 hidden layer neurons we t tend to use 425 00:15:13,440 --> 00:15:17,920 relu as the activation function for 426 00:15:16,159 --> 00:15:20,559 optimization we tend to use a flavor of 427 00:15:17,919 --> 00:15:22,240 HGD called Adam okay as sort of the 428 00:15:20,559 --> 00:15:24,879 default because it's really good so 429 00:15:22,240 --> 00:15:27,039 we'll use Adam as you'll see we 430 00:15:24,879 --> 00:15:29,039 typically use either uh early stopping 431 00:15:27,039 --> 00:15:32,000 or dropout and then you just fire it up 432 00:15:29,039 --> 00:15:33,838 and start training in terasen tensorflow 433 00:15:32,000 --> 00:15:35,120 all right so that is the training loop 434 00:15:33,839 --> 00:15:38,079 now I'm going to switch gears and give 435 00:15:35,120 --> 00:15:40,959 you a quick intro to teras and teras 436 00:15:38,078 --> 00:15:43,278 TensorFlow. Okay. Keras and Tensor. No, 437 00:15:40,958 --> 00:15:45,119 TensorFlow and KAS. Thank you. Um, and 438 00:15:43,278 --> 00:15:49,078 then we'll actually fire up the collab. 439 00:15:45,120 --> 00:15:49,078 So, first of all, what's a tensor? 440 00:15:49,919 --> 00:15:54,639 >> Yeah, I just quick question on the 441 00:15:52,159 --> 00:15:57,679 previous thing like if you're looking at 442 00:15:54,639 --> 00:15:59,440 the validation set to avoid overfitting, 443 00:15:57,679 --> 00:16:02,000 but aren't you actually like over 444 00:15:59,440 --> 00:16:03,920 actually overfitting because like you're 445 00:16:02,000 --> 00:16:05,919 kind of using the validation set as a 446 00:16:03,919 --> 00:16:08,000 training set or not? 447 00:16:05,919 --> 00:16:10,319 >> Uh, no, no, no. The validation set is 448 00:16:08,000 --> 00:16:12,799 never used to calculate any gradients. 449 00:16:10,320 --> 00:16:14,480 It's only used to calculate accuracy and 450 00:16:12,799 --> 00:16:16,078 loss. 451 00:16:14,480 --> 00:16:19,360 Yeah. Yeah. It's kept aside and only 452 00:16:16,078 --> 00:16:22,479 used for evaluation, not for training. 453 00:16:19,360 --> 00:16:23,120 That's what keeps you honest. 454 00:16:22,480 --> 00:16:24,399 >> Right. 455 00:16:23,120 --> 00:16:25,600 >> And this will become clear when we 456 00:16:24,399 --> 00:16:28,600 actually go to the collab. So what's a 457 00:16:25,600 --> 00:16:28,600 tensor? 458 00:16:28,639 --> 00:16:33,120 >> All right. 459 00:16:30,639 --> 00:16:35,360 Okay. 460 00:16:33,120 --> 00:16:36,720 Tensor is the input data which you're 461 00:16:35,360 --> 00:16:39,440 giving to the system. It could be in 462 00:16:36,720 --> 00:16:42,240 various formats like it's image it could 463 00:16:39,440 --> 00:16:45,120 be like we call it a 4D tensor. If it's 464 00:16:42,240 --> 00:16:47,278 a time series data, it's 3D. And 465 00:16:45,120 --> 00:16:49,360 typically, if you just send numbers in, 466 00:16:47,278 --> 00:16:52,480 it becomes a vector which would go 467 00:16:49,360 --> 00:16:54,480 inside which each each it gives the 468 00:16:52,480 --> 00:16:57,278 value of the 469 00:16:54,480 --> 00:16:59,120 uh uh the variable as well the values of 470 00:16:57,278 --> 00:17:01,759 the variables associated to it as well 471 00:16:59,120 --> 00:17:05,599 as 472 00:17:01,759 --> 00:17:07,120 uh as well as the I mean information you 473 00:17:05,599 --> 00:17:08,480 want to get to. 474 00:17:07,119 --> 00:17:10,159 >> You're kind of on the right track, but 475 00:17:08,480 --> 00:17:13,439 not entirely, right? It's actually a 476 00:17:10,160 --> 00:17:15,038 simpler concept than that. So, uh 477 00:17:13,439 --> 00:17:16,720 >> it's like a matrix but generalized with 478 00:17:15,038 --> 00:17:18,558 higher dimensions. 479 00:17:16,720 --> 00:17:21,360 >> Correct? That's also actually correct 480 00:17:18,558 --> 00:17:24,078 but incomplete. The reason is because it 481 00:17:21,359 --> 00:17:25,838 can be simpler than a matrix. It's not 482 00:17:24,078 --> 00:17:27,599 matrix or higher. It's actually could be 483 00:17:25,838 --> 00:17:30,159 simpler. In fact, you take a number, 484 00:17:27,599 --> 00:17:31,759 it's actually a tensor. 485 00:17:30,160 --> 00:17:34,400 All right? The simplest case of a tensor 486 00:17:31,759 --> 00:17:37,359 is a number. The next simpler case is a 487 00:17:34,400 --> 00:17:40,798 vector which is a list. The next higher 488 00:17:37,359 --> 00:17:43,038 case is a table. 489 00:17:40,798 --> 00:17:45,679 Okay, so these are all tensors. So 490 00:17:43,038 --> 00:17:48,879 tensors basically are a generalization 491 00:17:45,679 --> 00:17:52,240 of the notion of both a number, a vector 492 00:17:48,880 --> 00:17:56,799 and a table to higher dimensions. 493 00:17:52,240 --> 00:17:59,440 Okay, so you can think of a tensor as 494 00:17:56,798 --> 00:18:03,200 having what are called every tensor has 495 00:17:59,440 --> 00:18:04,720 something called a rank, right? So a 496 00:18:03,200 --> 00:18:06,720 number is just a number. It doesn't have 497 00:18:04,720 --> 00:18:10,798 a dimensionality to it. So it has got 498 00:18:06,720 --> 00:18:12,720 rank zero. Okay. While a vector it's a 499 00:18:10,798 --> 00:18:14,720 list of numbers. You can sort of write 500 00:18:12,720 --> 00:18:17,600 it down top to bottom and it's one 501 00:18:14,720 --> 00:18:19,200 dimension. Right? So that dimension that 502 00:18:17,599 --> 00:18:22,798 one dimension is called a rank. So it's 503 00:18:19,200 --> 00:18:24,480 called rank one. A table is 2D 504 00:18:22,798 --> 00:18:26,558 two-dimensional. So it's called rank 505 00:18:24,480 --> 00:18:28,640 two. 506 00:18:26,558 --> 00:18:32,079 And you can have a rank three which is 507 00:18:28,640 --> 00:18:34,080 just a bunch of tables. 508 00:18:32,079 --> 00:18:37,199 A bunch of tables is a rank three 509 00:18:34,079 --> 00:18:40,399 tensor. We also think of it as a cube. 510 00:18:37,200 --> 00:18:42,240 Okay. So these things are very useful 511 00:18:40,400 --> 00:18:45,280 because obviously we are all familiar 512 00:18:42,240 --> 00:18:48,000 with vectors. Uh as you will see very 513 00:18:45,279 --> 00:18:49,678 shortly later in this class black and 514 00:18:48,000 --> 00:18:51,679 white grayscale images are usually 515 00:18:49,679 --> 00:18:54,240 represented using tables of numbers like 516 00:18:51,679 --> 00:18:56,240 this. Color images are represented using 517 00:18:54,240 --> 00:18:59,440 three tables. 518 00:18:56,240 --> 00:19:02,319 Okay. Can you get think of what might be 519 00:18:59,440 --> 00:19:06,160 representable as you know a tensil of 520 00:19:02,319 --> 00:19:08,720 rank four? Meaning every element of a 521 00:19:06,160 --> 00:19:11,600 tensor of rank four is actually a color 522 00:19:08,720 --> 00:19:14,720 picture. 523 00:19:11,599 --> 00:19:16,959 Just shout it out. Video. Exactly. What 524 00:19:14,720 --> 00:19:19,440 is a video? A video is basically a 525 00:19:16,960 --> 00:19:23,519 stream of black color images. A color 526 00:19:19,440 --> 00:19:25,519 video. So each element of that stream, 527 00:19:23,519 --> 00:19:28,879 right? What the first dimension of the 528 00:19:25,519 --> 00:19:31,440 tensor is which frame it is and then 529 00:19:28,880 --> 00:19:34,000 everything else is the actual frame. So 530 00:19:31,440 --> 00:19:37,320 the way I u think about these tensors 531 00:19:34,000 --> 00:19:37,319 always is 532 00:19:37,359 --> 00:19:42,639 tensor you can just think of it as a you 533 00:19:40,480 --> 00:19:45,759 can think of a tensor as being this 534 00:19:42,640 --> 00:19:48,080 array which has all these axes or 535 00:19:45,759 --> 00:19:51,359 dimensions. This is the first one. This 536 00:19:48,079 --> 00:19:54,159 is the second one. This is a third swan. 537 00:19:51,359 --> 00:19:58,639 Right? This is a tensor of rank four. 538 00:19:54,160 --> 00:20:02,000 Okay? 1 2 3 4. And so if you have a 539 00:19:58,640 --> 00:20:03,520 vector, right? So you can imagine if 540 00:20:02,000 --> 00:20:06,480 it's just a vector, you can imagine the 541 00:20:03,519 --> 00:20:10,240 vector actually living like this, just a 542 00:20:06,480 --> 00:20:14,000 list of numbers, right? 543 00:20:10,240 --> 00:20:16,798 But if it's just if it is just 544 00:20:14,000 --> 00:20:19,038 a 2D a rank two tensor right which is 545 00:20:16,798 --> 00:20:21,200 just like that right which is just like 546 00:20:19,038 --> 00:20:24,079 that 547 00:20:21,200 --> 00:20:26,400 so this thing becomes you know like that 548 00:20:24,079 --> 00:20:29,199 and that thing becomes like that. So for 549 00:20:26,400 --> 00:20:31,360 example if this is a 7a 3 that means 550 00:20:29,200 --> 00:20:35,200 that there are 551 00:20:31,359 --> 00:20:36,558 seven rows and three columns. 552 00:20:35,200 --> 00:20:38,558 So you get the idea. So the way you 553 00:20:36,558 --> 00:20:40,319 think about tensor is always as if this 554 00:20:38,558 --> 00:20:42,079 open square bracket a bunch of things a 555 00:20:40,319 --> 00:20:44,639 closed square bracket and that's really 556 00:20:42,079 --> 00:20:48,158 what a tensor object is. So what that 557 00:20:44,640 --> 00:20:49,759 means is that anytime you have a tensor 558 00:20:48,159 --> 00:20:52,480 right anytime you have a tensor however 559 00:20:49,759 --> 00:20:54,720 complicated it is you can always create 560 00:20:52,480 --> 00:20:56,319 a more complicated tensor by if you want 561 00:20:54,720 --> 00:20:59,279 to take a list of those tensors let's 562 00:20:56,319 --> 00:21:02,158 say that you have a list of videos 563 00:20:59,279 --> 00:21:04,240 each video is a rank four tensor so 564 00:21:02,159 --> 00:21:05,760 which means a list of videos is what 565 00:21:04,240 --> 00:21:10,720 rank 566 00:21:05,759 --> 00:21:15,279 Exactly. So a a tensor of rank say 10 is 567 00:21:10,720 --> 00:21:17,120 just a list of rank nine tensors. 568 00:21:15,279 --> 00:21:18,000 So that is this that is the most 569 00:21:17,119 --> 00:21:20,719 important thing you need to understand 570 00:21:18,000 --> 00:21:22,640 about tensors. So at any point in time 571 00:21:20,720 --> 00:21:24,319 if I give you a tensor you can just 572 00:21:22,640 --> 00:21:27,520 iterate through the first dimension of 573 00:21:24,319 --> 00:21:29,119 it the first aspect of it and as as you 574 00:21:27,519 --> 00:21:32,158 go through each one of these values. So 575 00:21:29,119 --> 00:21:35,599 for example here um 576 00:21:32,159 --> 00:21:38,600 yeah that can do it. 577 00:21:35,599 --> 00:21:38,599 So 578 00:21:39,038 --> 00:21:43,599 so if you have this tensor here 579 00:21:42,319 --> 00:21:46,879 and if you want to create a more 580 00:21:43,599 --> 00:21:52,359 complicated tensor no problem. 581 00:21:46,880 --> 00:21:52,360 So you add another dimension here. Okay. 582 00:21:52,558 --> 00:21:58,119 Now it just becomes this dimension let's 583 00:21:54,480 --> 00:21:58,120 say has nine values. 584 00:21:58,558 --> 00:22:02,558 one on the nine. So you put zero here 585 00:22:00,960 --> 00:22:04,720 and then what do you get? This whole 586 00:22:02,558 --> 00:22:06,798 tensor is a rank four tensor. And you 587 00:22:04,720 --> 00:22:08,720 put a one here, it's another rank four 588 00:22:06,798 --> 00:22:11,759 tensor. You put a two here, another rank 589 00:22:08,720 --> 00:22:14,319 four tensor. So every tensor, you take 590 00:22:11,759 --> 00:22:18,000 the first element, it's just a list, but 591 00:22:14,319 --> 00:22:20,480 it's a list of the next downrank tensor. 592 00:22:18,000 --> 00:22:21,679 Okay. Now this tensor concept is 593 00:22:20,480 --> 00:22:26,640 actually something Einstein came up 594 00:22:21,679 --> 00:22:28,480 with. Um and so u it's simultaneously 595 00:22:26,640 --> 00:22:30,559 kind of easy to understand and also 596 00:22:28,480 --> 00:22:32,400 slippery. So I would actually encourage 597 00:22:30,558 --> 00:22:33,918 you to read the book which has a really 598 00:22:32,400 --> 00:22:35,280 good discussion of tensors and the more 599 00:22:33,919 --> 00:22:38,000 you practice with it the easier it'll 600 00:22:35,279 --> 00:22:39,759 get. Okay. So if you feel you kind of 601 00:22:38,000 --> 00:22:42,159 understood but not quite you're not 602 00:22:39,759 --> 00:22:43,599 alone. It happens to all of us right? 603 00:22:42,159 --> 00:22:48,640 You have to pay the price or go through 604 00:22:43,599 --> 00:22:51,519 the crucible. Okay. Okay. All right. 605 00:22:48,640 --> 00:22:55,038 So to come back to this 606 00:22:51,519 --> 00:22:56,400 that's what we have 607 00:22:55,038 --> 00:22:59,519 and we already talked about a rank four 608 00:22:56,400 --> 00:23:00,720 tensor it's a video so 2.2 two the text 609 00:22:59,519 --> 00:23:05,119 has a lot more detail. You should 610 00:23:00,720 --> 00:23:08,079 definitely read it. U so here tensorflow 611 00:23:05,119 --> 00:23:10,639 is a library and as you can imagine 612 00:23:08,079 --> 00:23:11,918 neural networks tensors come in and go 613 00:23:10,640 --> 00:23:14,559 through the network and go out the other 614 00:23:11,919 --> 00:23:16,880 end right and since tensors capture 615 00:23:14,558 --> 00:23:18,639 everything numbers lists uh tables and 616 00:23:16,880 --> 00:23:20,240 so on and so forth it's just tensors 617 00:23:18,640 --> 00:23:22,640 flowing from input to output hence it's 618 00:23:20,240 --> 00:23:23,839 called tensorflow and it gives you a 619 00:23:22,640 --> 00:23:25,360 couple of things which are really really 620 00:23:23,839 --> 00:23:27,199 important which is why we use it. The 621 00:23:25,359 --> 00:23:30,000 first one is that it'll automatically 622 00:23:27,200 --> 00:23:32,640 calculate gradients for you of 623 00:23:30,000 --> 00:23:34,079 arbitrarily complicated loss functions. 624 00:23:32,640 --> 00:23:35,520 You don't have to calculate the gradient 625 00:23:34,079 --> 00:23:37,678 because calculating the gradient is very 626 00:23:35,519 --> 00:23:39,519 painful, right? It'll automatically 627 00:23:37,679 --> 00:23:40,720 calculate the gradients for you. That's 628 00:23:39,519 --> 00:23:42,639 the best part. You don't have to use the 629 00:23:40,720 --> 00:23:44,400 chain rule. You don't do anything. The 630 00:23:42,640 --> 00:23:46,400 second thing it'll do, it gives you all 631 00:23:44,400 --> 00:23:48,000 these optimizers including SGD and all 632 00:23:46,400 --> 00:23:49,360 its variations. So you don't have to 633 00:23:48,000 --> 00:23:50,558 worry about the optimization itself. 634 00:23:49,359 --> 00:23:53,359 It'll just you can just pick and choose 635 00:23:50,558 --> 00:23:55,440 what you want. Third, if you have a lot 636 00:23:53,359 --> 00:23:56,959 of servers, it'll actually take the 637 00:23:55,440 --> 00:23:58,320 computational load and distribute it 638 00:23:56,960 --> 00:24:00,480 across all those servers. People here 639 00:23:58,319 --> 00:24:02,879 with the CS background know that 640 00:24:00,480 --> 00:24:05,038 parallelizing computation is actually a 641 00:24:02,880 --> 00:24:06,320 very difficult problem, right? There are 642 00:24:05,038 --> 00:24:09,119 things which are called embarrassingly 643 00:24:06,319 --> 00:24:10,798 parallel. Many things are not actually 644 00:24:09,119 --> 00:24:11,839 quite tricky to figure it out. We don't 645 00:24:10,798 --> 00:24:13,918 know how to figure it out. TensorFlow 646 00:24:11,839 --> 00:24:15,678 will figure it out. Okay? And then 647 00:24:13,919 --> 00:24:17,440 finally, I talked about the fact that 648 00:24:15,679 --> 00:24:18,720 there are these things called GPUs, 649 00:24:17,440 --> 00:24:21,919 graphics processing units, which are 650 00:24:18,720 --> 00:24:23,679 parallel hardware. uh and so it'll even 651 00:24:21,919 --> 00:24:26,000 if you have just one computer but it has 652 00:24:23,679 --> 00:24:28,080 GPUs there's a particular way in which 653 00:24:26,000 --> 00:24:30,079 you have to take your computation and 654 00:24:28,079 --> 00:24:33,359 organize it to really exploit the fact 655 00:24:30,079 --> 00:24:35,199 that you have a GPU and so TensorFlow 656 00:24:33,359 --> 00:24:36,240 will actually do it for you out of the 657 00:24:35,200 --> 00:24:38,080 box automatically you don't have to 658 00:24:36,240 --> 00:24:39,278 worry about any of that stuff okay so 659 00:24:38,079 --> 00:24:41,519 those are all the advantages of this 660 00:24:39,278 --> 00:24:43,519 thing by the way TPU is called a tensor 661 00:24:41,519 --> 00:24:45,278 processing unit it's something that it's 662 00:24:43,519 --> 00:24:47,440 kind of you can think of it as Google's 663 00:24:45,278 --> 00:24:50,000 GPU right they came up with their own 664 00:24:47,440 --> 00:24:52,080 variation on the theme okay now keras 665 00:24:50,000 --> 00:24:53,839 sits on top of TensorFlow, right? 666 00:24:52,079 --> 00:24:56,158 TensorFlow, this is the this is the 667 00:24:53,839 --> 00:24:58,319 hardware you have. TensorFlow sits on 668 00:24:56,159 --> 00:25:01,200 top of the hardware. Keras sits on top 669 00:24:58,319 --> 00:25:02,879 of TensorFlow and it basically gives you 670 00:25:01,200 --> 00:25:04,960 a whole bunch of convenience features. 671 00:25:02,880 --> 00:25:07,120 So, for example, it gives you the notion 672 00:25:04,960 --> 00:25:10,079 of a layer, right? We already saw 673 00:25:07,119 --> 00:25:11,519 Keras.dense is a dense layer, right? It 674 00:25:10,079 --> 00:25:12,558 gives you the notion of a layer. It 675 00:25:11,519 --> 00:25:14,558 gives you the notion of activation 676 00:25:12,558 --> 00:25:16,240 functions and so on and so forth. It 677 00:25:14,558 --> 00:25:18,079 gives you easy ways to pre-process the 678 00:25:16,240 --> 00:25:20,000 data, easy ways to train the model, 679 00:25:18,079 --> 00:25:21,839 report on metrics, you know, calculate 680 00:25:20,000 --> 00:25:23,519 validation loss, validation accuracy, 681 00:25:21,839 --> 00:25:25,359 training loss, all the metrics we care 682 00:25:23,519 --> 00:25:26,960 about. And then it also gives you a 683 00:25:25,359 --> 00:25:28,558 whole library of pre-trained models that 684 00:25:26,960 --> 00:25:30,798 you can just use and adapt for your 685 00:25:28,558 --> 00:25:32,720 particular problem. So it gives you a 686 00:25:30,798 --> 00:25:34,400 whole bunch of conveniences and that's 687 00:25:32,720 --> 00:25:35,679 why it's very popular. And by the way, 688 00:25:34,400 --> 00:25:37,440 you know, many of you might also be 689 00:25:35,679 --> 00:25:38,960 familiar with PyTorch, which is a 690 00:25:37,440 --> 00:25:41,038 fantastic framework as well for deep 691 00:25:38,960 --> 00:25:42,798 learning. And the reason we chose to go 692 00:25:41,038 --> 00:25:45,679 with TensorFlow for this course rather 693 00:25:42,798 --> 00:25:48,158 than PyTorch is because we wanted to 694 00:25:45,679 --> 00:25:49,679 make the course uh sort of accessible to 695 00:25:48,159 --> 00:25:51,200 folks who don't have a ton of 696 00:25:49,679 --> 00:25:53,360 programming background before coming to 697 00:25:51,200 --> 00:25:55,519 the class. And Pyarch is a bit more 698 00:25:53,359 --> 00:25:56,479 demanding from a CS perspective. It 699 00:25:55,519 --> 00:25:58,400 requires more knowledge of 700 00:25:56,480 --> 00:25:59,759 object-oriented programming. Uh which is 701 00:25:58,400 --> 00:26:02,080 why we decided to go with TensorFlow and 702 00:25:59,759 --> 00:26:04,720 KAS because I think it's actually as 703 00:26:02,079 --> 00:26:07,278 powerful uh in many ways and it's a 704 00:26:04,720 --> 00:26:09,440 little easier to get going. Okay, so 705 00:26:07,278 --> 00:26:10,960 that's what we have here. And one other 706 00:26:09,440 --> 00:26:12,720 thing I will mention is that there are 707 00:26:10,960 --> 00:26:14,480 three ways in which you can use kas. 708 00:26:12,720 --> 00:26:16,480 There are three kinds of APIs. 709 00:26:14,480 --> 00:26:18,079 Sequential, functional, subclassing. And 710 00:26:16,480 --> 00:26:21,120 we'll almost exclusively use the 711 00:26:18,079 --> 00:26:22,319 functional API. Okay. And in fact, the 712 00:26:21,119 --> 00:26:24,399 model we built for heart disease 713 00:26:22,319 --> 00:26:26,798 prediction uses the functional API. And 714 00:26:24,400 --> 00:26:28,640 so just read 722 of the textbook to 715 00:26:26,798 --> 00:26:30,319 understand in detail how the API works. 716 00:26:28,640 --> 00:26:32,080 I find in my own work, the functional 717 00:26:30,319 --> 00:26:33,278 API is basically all I need. I don't 718 00:26:32,079 --> 00:26:35,519 need to do anything more complicated 719 00:26:33,278 --> 00:26:37,599 than that. Um and and as you will see as 720 00:26:35,519 --> 00:26:39,679 you work on the homeworks uh and on your 721 00:26:37,599 --> 00:26:41,199 project that it's is it's sort of a 722 00:26:39,679 --> 00:26:43,440 beautifully designed Lego block 723 00:26:41,200 --> 00:26:45,200 environment for doing these things and 724 00:26:43,440 --> 00:26:48,240 you can create very complicated models 725 00:26:45,200 --> 00:26:50,159 very easily. Okay. Uh there's a whole 726 00:26:48,240 --> 00:26:51,759 bunch of stuff here on these websites. 727 00:26:50,159 --> 00:26:55,600 So check them out. There's lots of 728 00:26:51,759 --> 00:26:57,038 collabs or uh uh are available. So now 729 00:26:55,599 --> 00:26:58,158 if you go back to the neural model for 730 00:26:57,038 --> 00:26:59,519 heart disease prediction, this is what 731 00:26:58,159 --> 00:27:02,400 we came up with in the last class, 732 00:26:59,519 --> 00:27:04,319 right? uh we had an input layer, one 733 00:27:02,400 --> 00:27:05,759 dense layer with 16 neurons, rel 734 00:27:04,319 --> 00:27:08,000 neurons, an output layer with the 735 00:27:05,759 --> 00:27:10,720 sigmoid and then boom, that was a model. 736 00:27:08,000 --> 00:27:13,119 So let's train this model. Uh and so the 737 00:27:10,720 --> 00:27:14,640 training checklist is that uh we have 738 00:27:13,119 --> 00:27:17,918 already done this hidden layer of 16 739 00:27:14,640 --> 00:27:19,200 neurons uh sigmoid. We need to use an 740 00:27:17,919 --> 00:27:20,559 appropriate loss function based on the 741 00:27:19,200 --> 00:27:23,038 type of output. What loss function 742 00:27:20,558 --> 00:27:26,480 should we use? 743 00:27:23,038 --> 00:27:28,798 What is the output here? 744 00:27:26,480 --> 00:27:33,079 It's a binary classification problem. So 745 00:27:28,798 --> 00:27:33,079 what should the the loss function be? 746 00:27:33,440 --> 00:27:37,360 Kind of heard it somewhere. Get shout it 747 00:27:35,599 --> 00:27:40,079 out. 748 00:27:37,359 --> 00:27:43,079 No, the output is a sigmoid. The loss 749 00:27:40,079 --> 00:27:43,079 functionary 750 00:27:43,200 --> 00:27:46,798 cross entropy. 751 00:27:44,960 --> 00:27:48,798 Okay, remember if if you're predicting a 752 00:27:46,798 --> 00:27:50,879 number an arbitrary number, you can use 753 00:27:48,798 --> 00:27:52,400 something like mean square error. If 754 00:27:50,880 --> 00:27:55,120 you're predicting a probability which 755 00:27:52,400 --> 00:27:56,720 has to be compared to a 01 output, which 756 00:27:55,119 --> 00:27:59,038 is what binary classification is all 757 00:27:56,720 --> 00:28:01,120 about. we use binary cross entropy. 758 00:27:59,038 --> 00:28:03,759 Okay, so that's what we do here. So we 759 00:28:01,119 --> 00:28:06,239 do binary cross entropy 760 00:28:03,759 --> 00:28:08,158 and then we will go with Adam, right? 761 00:28:06,240 --> 00:28:10,880 And then we'll use early stopping to 762 00:28:08,159 --> 00:28:12,399 make sure we don't over fit. Okay, I 763 00:28:10,880 --> 00:28:13,679 know this like okay I promise this is a 764 00:28:12,398 --> 00:28:16,079 lot literally the last slide before I go 765 00:28:13,679 --> 00:28:19,519 to the collab. I feel like one of those 766 00:28:16,079 --> 00:28:23,359 used cars here but wait there is more. 767 00:28:19,519 --> 00:28:24,720 So anyway, u so uh don't worry if you 768 00:28:23,359 --> 00:28:26,558 don't understand every detail of what 769 00:28:24,720 --> 00:28:27,919 I'm going to go through. I'm going to 770 00:28:26,558 --> 00:28:29,839 link to the collab as soon as the class 771 00:28:27,919 --> 00:28:31,278 is over. But once you get your hands on 772 00:28:29,839 --> 00:28:33,519 the collab, make sure you actually go 773 00:28:31,278 --> 00:28:34,640 through every line in the collab. What I 774 00:28:33,519 --> 00:28:36,558 typically do when I'm trying to learn 775 00:28:34,640 --> 00:28:39,919 something new is I'll actually cut and 776 00:28:36,558 --> 00:28:41,359 paste, right? I won't do that. I won't 777 00:28:39,919 --> 00:28:44,159 actually cut and paste the code and run 778 00:28:41,359 --> 00:28:45,519 it myself. I will retype the code. If 779 00:28:44,159 --> 00:28:46,799 you retype the code as opposed to 780 00:28:45,519 --> 00:28:48,960 cutting and pasting, trust me, you'll 781 00:28:46,798 --> 00:28:52,079 learn a lot more. Right? So I strongly 782 00:28:48,960 --> 00:28:54,480 encourage you to do it that way. 783 00:28:52,079 --> 00:28:56,079 Um and so all the collabs you're going 784 00:28:54,480 --> 00:28:57,519 to publish in the class, uh the first 785 00:28:56,079 --> 00:29:00,240 thing you should do is you should just 786 00:28:57,519 --> 00:29:02,879 make your own copy of the notebook, 787 00:29:00,240 --> 00:29:04,558 right? Copy to drive. And then if you're 788 00:29:02,880 --> 00:29:06,720 using anything other than today's 789 00:29:04,558 --> 00:29:08,079 collab, uh right, anything involving 790 00:29:06,720 --> 00:29:10,079 natural language processing or vision, 791 00:29:08,079 --> 00:29:13,038 you probably should use a GPU. So just 792 00:29:10,079 --> 00:29:15,918 go into go in here, choose the runtime 793 00:29:13,038 --> 00:29:17,599 to be a GPU. Um and then you start your 794 00:29:15,919 --> 00:29:19,038 notebook and you're done. And the second 795 00:29:17,599 --> 00:29:21,199 time onwards, you can just go directly 796 00:29:19,038 --> 00:29:23,359 to this step. You don't have to do all 797 00:29:21,200 --> 00:29:24,880 this stuff for that particular notebook. 798 00:29:23,359 --> 00:29:26,319 And there are numerous tutorials like 799 00:29:24,880 --> 00:29:27,919 five minute videos and so on on how to 800 00:29:26,319 --> 00:29:30,319 use collab. Just just do that. I'm not 801 00:29:27,919 --> 00:29:33,919 going to spend time on it here. 802 00:29:30,319 --> 00:29:35,839 All right. Okay. So, uh I just ran it um 803 00:29:33,919 --> 00:29:37,120 a few hours ago. I'm not going to run 804 00:29:35,839 --> 00:29:38,079 every cell now because it's going to 805 00:29:37,119 --> 00:29:39,759 take some time. It's going to get in the 806 00:29:38,079 --> 00:29:40,960 way of the class time, but I'm going to 807 00:29:39,759 --> 00:29:43,359 just like, you know, go through it 808 00:29:40,960 --> 00:29:45,278 slowly and explain what's going on. So, 809 00:29:43,359 --> 00:29:46,639 here this is just an introduction to the 810 00:29:45,278 --> 00:29:49,038 data set. We already saw this 811 00:29:46,640 --> 00:29:51,759 introduction in the last last week. We 812 00:29:49,038 --> 00:29:54,720 have whatever 303 patients, hot 813 00:29:51,759 --> 00:29:57,919 patients. We have a whole bunch of uh 814 00:29:54,720 --> 00:29:59,839 variables here, age, demographics, and a 815 00:29:57,919 --> 00:30:02,960 whole bunch of biomarker information. 816 00:29:59,839 --> 00:30:05,519 And this is a target variable. Okay? Uh 817 00:30:02,960 --> 00:30:07,759 zero or one, heart disease, yes or no. 818 00:30:05,519 --> 00:30:10,319 And so, by the way, just some technical 819 00:30:07,759 --> 00:30:12,000 prelim preliminaries here. Basically, 820 00:30:10,319 --> 00:30:13,439 every time we load these things, we're 821 00:30:12,000 --> 00:30:15,119 actually going to load these packages. 822 00:30:13,440 --> 00:30:16,558 So you can see here these are the two 823 00:30:15,119 --> 00:30:18,879 key things we need to do. We import 824 00:30:16,558 --> 00:30:21,038 tensorflow first and then from within 825 00:30:18,880 --> 00:30:23,760 tensorflow we import keras. Okay that's 826 00:30:21,038 --> 00:30:25,759 what these two lines do here. Okay. And 827 00:30:23,759 --> 00:30:26,798 then and folks who have done data 828 00:30:25,759 --> 00:30:28,640 science and machine learning a bit 829 00:30:26,798 --> 00:30:30,720 before you you'll know this. We will in 830 00:30:28,640 --> 00:30:32,320 in sort of we will actually load like 831 00:30:30,720 --> 00:30:34,558 the three packages that were just most 832 00:30:32,319 --> 00:30:37,278 commonly used right which is numpy 833 00:30:34,558 --> 00:30:39,678 pandas and mattplot lib. Uh numpy 834 00:30:37,278 --> 00:30:42,079 because it's very easy for manipulating 835 00:30:39,679 --> 00:30:44,159 matrices and arrays and tensors. uh 836 00:30:42,079 --> 00:30:46,240 pandas because often times you get some 837 00:30:44,159 --> 00:30:48,240 data in from somewhere you need to 838 00:30:46,240 --> 00:30:49,839 massage it and wrangle it to a point 839 00:30:48,240 --> 00:30:51,839 where we can actually feed it into ketas 840 00:30:49,839 --> 00:30:53,678 so you need pandas for that and mattplot 841 00:30:51,839 --> 00:30:55,839 lib because you just want to plot you 842 00:30:53,679 --> 00:30:57,440 know uh these loss curves and accuracy 843 00:30:55,839 --> 00:31:00,158 curves to see whether early stopping is 844 00:30:57,440 --> 00:31:02,320 needed okay so that's why we use it uh 845 00:31:00,159 --> 00:31:03,200 so we import all these things and then I 846 00:31:02,319 --> 00:31:04,558 guess the other thing you have to 847 00:31:03,200 --> 00:31:06,558 remember is that when we are training 848 00:31:04,558 --> 00:31:08,639 these deep learning models uh there is 849 00:31:06,558 --> 00:31:11,278 randomness in the process which enters 850 00:31:08,640 --> 00:31:13,360 in a few different places so clearly the 851 00:31:11,278 --> 00:31:14,398 starting values for the these weights 852 00:31:13,359 --> 00:31:15,439 are going to be they're going the 853 00:31:14,398 --> 00:31:17,359 weights are going to be randomly 854 00:31:15,440 --> 00:31:19,600 initialized. Uh and therefore that 855 00:31:17,359 --> 00:31:22,398 that's obviously a source of randomness. 856 00:31:19,599 --> 00:31:23,599 Uh now we talked about how you take if 857 00:31:22,398 --> 00:31:25,519 when you're doing stoastic gradient 858 00:31:23,599 --> 00:31:28,000 descent you take all the data and then 859 00:31:25,519 --> 00:31:29,839 you randomly choose batches right from 860 00:31:28,000 --> 00:31:32,398 this data till we finish a whole pass 861 00:31:29,839 --> 00:31:33,519 through it. Well that immediately raised 862 00:31:32,398 --> 00:31:35,759 the question well well what do you mean 863 00:31:33,519 --> 00:31:37,839 by randomly choose? So typically what we 864 00:31:35,759 --> 00:31:39,519 do in practice is that and kas will take 865 00:31:37,839 --> 00:31:40,720 care of all this for you. um you 866 00:31:39,519 --> 00:31:42,960 basically take the data and just shuffle 867 00:31:40,720 --> 00:31:45,200 it once randomly and then you just go 868 00:31:42,960 --> 00:31:47,120 first 32 next 32 next 32 next 32 like 869 00:31:45,200 --> 00:31:49,278 that okay but it is a source of 870 00:31:47,119 --> 00:31:51,518 randomness and then when we split the 871 00:31:49,278 --> 00:31:53,278 data into train validation testing and 872 00:31:51,519 --> 00:31:55,440 so on uh particularly if you want to 873 00:31:53,278 --> 00:31:56,880 look for early stopping and overfitting 874 00:31:55,440 --> 00:31:58,480 uh we need to again split the data 875 00:31:56,880 --> 00:32:01,120 randomly and that's another source of 876 00:31:58,480 --> 00:32:02,880 randomness and then when we do dropout 877 00:32:01,119 --> 00:32:05,119 which we'll talk about on Wednesday 878 00:32:02,880 --> 00:32:06,799 again dropout has a little bit of a 879 00:32:05,119 --> 00:32:09,599 random element to it and so that's 880 00:32:06,798 --> 00:32:11,679 another source of randomness this. So 881 00:32:09,599 --> 00:32:13,038 all of it all this means is that if 882 00:32:11,679 --> 00:32:14,240 you're working with these models and if 883 00:32:13,038 --> 00:32:16,000 you want to build a model and you want 884 00:32:14,240 --> 00:32:17,919 to hand it off to someone so that they 885 00:32:16,000 --> 00:32:19,759 can reproduce your results well you 886 00:32:17,919 --> 00:32:21,440 better make sure that you sort of you 887 00:32:19,759 --> 00:32:22,960 know make it easy for them to replicate 888 00:32:21,440 --> 00:32:24,798 what you have and the way you do it is 889 00:32:22,960 --> 00:32:26,960 by sending a setting a random seat for 890 00:32:24,798 --> 00:32:28,480 all these things okay and the way you do 891 00:32:26,960 --> 00:32:31,200 it is by having this little handy 892 00:32:28,480 --> 00:32:32,960 function here set random seat uh and of 893 00:32:31,200 --> 00:32:35,360 course you know I use 42 tool like just 894 00:32:32,960 --> 00:32:38,000 like everybody should right so okay so 895 00:32:35,359 --> 00:32:39,678 that's that uh by the way just that's 896 00:32:38,000 --> 00:32:40,880 just a popculture reference to this book 897 00:32:39,679 --> 00:32:43,360 called The Hitchhiker's Guide to the 898 00:32:40,880 --> 00:32:45,440 Galaxy. 899 00:32:43,359 --> 00:32:47,678 >> Number 42 and you'll know what I mean. 900 00:32:45,440 --> 00:32:49,759 Okay, so by the way, um the question 901 00:32:47,679 --> 00:32:51,278 inevitably comes at this point, okay, if 902 00:32:49,759 --> 00:32:52,558 we do exactly this, will you actually 903 00:32:51,278 --> 00:32:55,359 get the exact same numbers that you have 904 00:32:52,558 --> 00:32:57,200 in your version uh of the notebook? And 905 00:32:55,359 --> 00:32:59,119 the answer is hopefully most of the 906 00:32:57,200 --> 00:33:01,360 time, but it's not guaranteed. So this 907 00:32:59,119 --> 00:33:03,359 is called bitwise reproducibility. It's 908 00:33:01,359 --> 00:33:05,439 not guaranteed due to certain hardware 909 00:33:03,359 --> 00:33:07,119 things and device drivers and stuff like 910 00:33:05,440 --> 00:33:09,120 that. So we won't get into all that 911 00:33:07,119 --> 00:33:11,199 stuff. uh and which is why as you see 912 00:33:09,119 --> 00:33:14,239 here uh I have a bit of a fingers 913 00:33:11,200 --> 00:33:16,480 crossed thing. Okay. All right. Cool. So 914 00:33:14,240 --> 00:33:18,399 that's what we have. Um so as it turns 915 00:33:16,480 --> 00:33:20,240 out uh Frantois Shallet who wrote the 916 00:33:18,398 --> 00:33:21,678 book uh the textbook he actually made 917 00:33:20,240 --> 00:33:24,480 this data available in a pandanda's data 918 00:33:21,679 --> 00:33:26,880 frame. So we read the CSV file into this 919 00:33:24,480 --> 00:33:30,720 data frame right there. Uh and then it's 920 00:33:26,880 --> 00:33:32,000 uh and it's 303 rows 14 columns right 921 00:33:30,720 --> 00:33:34,399 and you can see here we'll take a look 922 00:33:32,000 --> 00:33:36,960 at the first few rows. Uh and these are 923 00:33:34,398 --> 00:33:38,719 all the rows. age, gender, cholesterol, 924 00:33:36,960 --> 00:33:41,120 blah blah blah blah blah. And then this 925 00:33:38,720 --> 00:33:42,880 is the target variable right there. U 926 00:33:41,119 --> 00:33:44,000 and the one of the first things I always 927 00:33:42,880 --> 00:33:45,679 do when I'm working with a binary 928 00:33:44,000 --> 00:33:47,359 classification problem is to quickly 929 00:33:45,679 --> 00:33:49,759 check whether the positive and negative 930 00:33:47,359 --> 00:33:51,119 classes are balanced or not. And so what 931 00:33:49,759 --> 00:33:52,720 you can do is you can just quickly check 932 00:33:51,119 --> 00:33:55,278 to see what percent of the data points 933 00:33:52,720 --> 00:33:57,038 is zero versus one. And you can see here 934 00:33:55,278 --> 00:33:59,038 uh 72.6% 935 00:33:57,038 --> 00:34:00,720 of the patients don't have heart 936 00:33:59,038 --> 00:34:03,839 disease. That's a good thing of course. 937 00:34:00,720 --> 00:34:05,519 Uh and then 27.4 have heart disease. So 938 00:34:03,839 --> 00:34:08,159 it's not bad. It's not 50/50 or roughly 939 00:34:05,519 --> 00:34:11,599 50/50. It's a little thing. So, by the 940 00:34:08,159 --> 00:34:13,358 way, quick question. What is a a b good 941 00:34:11,599 --> 00:34:14,639 baseline model for this problem? Suppose 942 00:34:13,358 --> 00:34:15,679 you couldn't use anything any 943 00:34:14,639 --> 00:34:19,159 complicated thing. What's a good 944 00:34:15,679 --> 00:34:19,159 baseline model? 945 00:34:22,079 --> 00:34:25,519 >> Yes. Just predict zero. 946 00:34:24,320 --> 00:34:28,879 >> Yeah. And why would you do that? 947 00:34:25,519 --> 00:34:31,519 >> Uh, it would give you a 72.6% accuracy. 948 00:34:28,878 --> 00:34:33,759 Exactly. Because 72.6% 6% is the sort of 949 00:34:31,519 --> 00:34:35,358 the higher class higher class with the 950 00:34:33,760 --> 00:34:37,040 higher percentage you just predict it 951 00:34:35,358 --> 00:34:38,878 you'll be right on those 72.6% of the 952 00:34:37,039 --> 00:34:41,519 cases you'll be wrong on the rest which 953 00:34:38,878 --> 00:34:43,838 means that your accuracy of this model 954 00:34:41,519 --> 00:34:46,559 is going to be 72.6%. 955 00:34:43,838 --> 00:34:48,078 Okay. And so any fancy model we build 956 00:34:46,559 --> 00:34:49,279 better do you know it's got to do better 957 00:34:48,079 --> 00:34:51,919 than this otherwise it's not worth its 958 00:34:49,280 --> 00:34:53,760 weight uh in layers. Um so all right so 959 00:34:51,918 --> 00:34:54,960 we'll come back to this later. So the 960 00:34:53,760 --> 00:34:56,560 first thing we want to do is we want to 961 00:34:54,960 --> 00:34:58,880 pre-process it because this data set has 962 00:34:56,559 --> 00:35:01,599 both categorical variables and numeric 963 00:34:58,880 --> 00:35:03,119 variables. Um and so it's usually 964 00:35:01,599 --> 00:35:05,119 convenient to just to group them into 965 00:35:03,119 --> 00:35:06,640 two different groups. So I have listed 966 00:35:05,119 --> 00:35:09,200 all the categorical variables here and 967 00:35:06,639 --> 00:35:11,199 the numeric here. Uh and then we have 968 00:35:09,199 --> 00:35:12,799 the pre-processing here. We have to take 969 00:35:11,199 --> 00:35:15,118 the categorical variables and we have to 970 00:35:12,800 --> 00:35:17,920 one hot encode them. And the reason is 971 00:35:15,119 --> 00:35:20,400 that unlike say a decision tree model, a 972 00:35:17,920 --> 00:35:22,800 neural network cannot handle uh 973 00:35:20,400 --> 00:35:24,720 categorical inputs directly. It can only 974 00:35:22,800 --> 00:35:26,400 handle numeric inputs. Which means that 975 00:35:24,719 --> 00:35:28,319 we have to numericalize every 976 00:35:26,400 --> 00:35:29,760 categorical thing that comes in. And the 977 00:35:28,320 --> 00:35:31,200 st there are many ways to do it but the 978 00:35:29,760 --> 00:35:33,760 standard way to do it is one hot 979 00:35:31,199 --> 00:35:35,358 encoding. Um and for the numeric 980 00:35:33,760 --> 00:35:37,839 variables we need to normalize them and 981 00:35:35,358 --> 00:35:40,400 I'll come to that in a second. So pandas 982 00:35:37,838 --> 00:35:41,920 has this get dummies function here and 983 00:35:40,400 --> 00:35:44,000 you can just run this thing and it'll 984 00:35:41,920 --> 00:35:45,680 just hot encode the whole thing. So once 985 00:35:44,000 --> 00:35:49,358 you do that this is what you have. So 986 00:35:45,679 --> 00:35:52,319 you can see here previously um let's say 987 00:35:49,358 --> 00:35:54,000 tal was had three values fixed normal 988 00:35:52,320 --> 00:35:56,880 reversible or something and then you go 989 00:35:54,000 --> 00:36:00,079 to the one hot encoded version u and now 990 00:35:56,880 --> 00:36:02,160 we can see here tal fixed tal normal tal 991 00:36:00,079 --> 00:36:04,720 reversible that's three columns right 992 00:36:02,159 --> 00:36:07,598 that's the one hot encoding in action 993 00:36:04,719 --> 00:36:09,919 okay now the other thing to remember is 994 00:36:07,599 --> 00:36:12,240 that neural networks work best when the 995 00:36:09,920 --> 00:36:13,920 numeric inputs you send them are all in 996 00:36:12,239 --> 00:36:15,838 a relatively small range they shouldn't 997 00:36:13,920 --> 00:36:18,639 have a wide range of variation 998 00:36:15,838 --> 00:36:20,239 Um and so the standard practice is to 999 00:36:18,639 --> 00:36:22,078 standardize the numerical variables. By 1000 00:36:20,239 --> 00:36:23,279 standardize, I mean typically subtract 1001 00:36:22,079 --> 00:36:26,000 the mean, divide by the standard 1002 00:36:23,280 --> 00:36:27,839 deviation. Um we should do that. But 1003 00:36:26,000 --> 00:36:30,719 before we do so, we should split the 1004 00:36:27,838 --> 00:36:32,239 data into a training set and a test set, 1005 00:36:30,719 --> 00:36:33,759 right? And why do we want to split into 1006 00:36:32,239 --> 00:36:35,039 a test set? Because at the very end once 1007 00:36:33,760 --> 00:36:36,800 we've built the model and done all the 1008 00:36:35,039 --> 00:36:38,639 things we want to do with it, we finally 1009 00:36:36,800 --> 00:36:41,519 want to take out the test set and 1010 00:36:38,639 --> 00:36:43,679 evaluate it once so that we get this 1011 00:36:41,519 --> 00:36:46,079 true measure of how it's going to 1012 00:36:43,679 --> 00:36:48,960 perform in the wild after you deploy it. 1013 00:36:46,079 --> 00:36:51,280 Okay. Uh so you want to divide it 80 80 1014 00:36:48,960 --> 00:36:53,119 say 80% training and 20% test set. So 1015 00:36:51,280 --> 00:36:54,640 the question is why should we do the 1016 00:36:53,119 --> 00:36:57,358 splitting now before we do the 1017 00:36:54,639 --> 00:37:01,480 normalization? Why can't we just do the 1018 00:36:57,358 --> 00:37:01,480 normalization and then do the splitting? 1019 00:37:02,800 --> 00:37:09,680 Um all right 1020 00:37:06,239 --> 00:37:11,838 >> because then your uh validation set is 1021 00:37:09,679 --> 00:37:13,440 also somewhat dependent on your test set 1022 00:37:11,838 --> 00:37:13,838 results as well as the mean of the test 1023 00:37:13,440 --> 00:37:16,400 set. 1024 00:37:13,838 --> 00:37:18,799 >> Correct? Because the test set has now 1025 00:37:16,400 --> 00:37:21,920 essentially sort of has been influenced 1026 00:37:18,800 --> 00:37:23,359 by the training set. Right? That is the 1027 00:37:21,920 --> 00:37:25,200 the the modeling process part of the 1028 00:37:23,358 --> 00:37:27,039 modeling process the splitting and the 1029 00:37:25,199 --> 00:37:28,719 splitting also the the the 1030 00:37:27,039 --> 00:37:30,800 standardization 1031 00:37:28,719 --> 00:37:32,879 if if the standardization which is part 1032 00:37:30,800 --> 00:37:34,720 of the process uses information about 1033 00:37:32,880 --> 00:37:37,440 the test set well the test set not 1034 00:37:34,719 --> 00:37:39,759 really kept away from anything is it 1035 00:37:37,440 --> 00:37:41,200 that's why we want to split it lock away 1036 00:37:39,760 --> 00:37:43,200 the test set somewhere and then proceed 1037 00:37:41,199 --> 00:37:44,799 with the modeling this again this is 1038 00:37:43,199 --> 00:37:47,598 like machine learning 101 which is why 1039 00:37:44,800 --> 00:37:50,800 I'm going through it pretty fast uh okay 1040 00:37:47,599 --> 00:37:53,200 so we we do this uh sampling function 1041 00:37:50,800 --> 00:37:55,039 take 20% of the data and make it the 1042 00:37:53,199 --> 00:37:56,719 test set and the remaining is going to 1043 00:37:55,039 --> 00:37:58,960 be the training set. And when we do 1044 00:37:56,719 --> 00:38:00,959 that, you can see the training set is 1045 00:37:58,960 --> 00:38:05,199 now 242 1046 00:38:00,960 --> 00:38:07,039 um rows while the test is 61 rows. Uh 1047 00:38:05,199 --> 00:38:08,960 and any of these data frames, you'll 1048 00:38:07,039 --> 00:38:10,719 know that the the shape attribute gives 1049 00:38:08,960 --> 00:38:12,240 you the dimensions of the number of rows 1050 00:38:10,719 --> 00:38:14,000 in the columns. That's what we're doing 1051 00:38:12,239 --> 00:38:15,199 here. And now that we have done that, we 1052 00:38:14,000 --> 00:38:16,400 have done the split, we can calculate 1053 00:38:15,199 --> 00:38:18,480 the the the mean and the standard 1054 00:38:16,400 --> 00:38:20,079 deviation. So I calculate the mean here. 1055 00:38:18,480 --> 00:38:21,760 I calculate standard deviation. And 1056 00:38:20,079 --> 00:38:24,640 these are all the means. And once I do 1057 00:38:21,760 --> 00:38:26,000 that, I just do you know each column 1058 00:38:24,639 --> 00:38:28,319 minus the mean divide the standard 1059 00:38:26,000 --> 00:38:30,320 deviation. And then once I do that I get 1060 00:38:28,320 --> 00:38:32,160 I save them in the train and the test 1061 00:38:30,320 --> 00:38:33,680 data frames. And you can see here now 1062 00:38:32,159 --> 00:38:36,799 all the numbers are all very sort of 1063 00:38:33,679 --> 00:38:38,480 smallalish 0 1 minus one kind of around 1064 00:38:36,800 --> 00:38:40,880 that range and that's kind of ideal when 1065 00:38:38,480 --> 00:38:42,159 you're network training. Okay. All 1066 00:38:40,880 --> 00:38:44,640 right. Right. So at this point the data 1067 00:38:42,159 --> 00:38:46,719 is entirely numeric and then uh we are 1068 00:38:44,639 --> 00:38:48,000 ready almost ready to feed it into KAS 1069 00:38:46,719 --> 00:38:51,279 and the way you do it is you take a 1070 00:38:48,000 --> 00:38:52,719 numpy array u you you take a pandas data 1071 00:38:51,280 --> 00:38:54,880 frame and then you convert it into a 1072 00:38:52,719 --> 00:38:56,959 numpy array and then keras is happy to 1073 00:38:54,880 --> 00:39:00,079 take it happy to receive it. So the so 1074 00:38:56,960 --> 00:39:01,838 we use this thing called two numpy which 1075 00:39:00,079 --> 00:39:04,160 I think is as descriptive as it gets in 1076 00:39:01,838 --> 00:39:05,838 programming. Um and then you save it as 1077 00:39:04,159 --> 00:39:08,000 train and test. Now train and test on 1078 00:39:05,838 --> 00:39:09,679 two numpy arrays with exactly the same 1079 00:39:08,000 --> 00:39:12,000 information and now we can fit it into 1080 00:39:09,679 --> 00:39:13,838 kas. All right. Now I guess there's one 1081 00:39:12,000 --> 00:39:17,358 other thing we need to do which is that 1082 00:39:13,838 --> 00:39:18,880 um in this data frame train and test our 1083 00:39:17,358 --> 00:39:20,799 independent variables all the features 1084 00:39:18,880 --> 00:39:23,519 as well as the target the 01 target. 1085 00:39:20,800 --> 00:39:25,280 They're all in this 1086 00:39:23,519 --> 00:39:27,679 right and we need to now take it and 1087 00:39:25,280 --> 00:39:29,839 just take the the dependent variable the 1088 00:39:27,679 --> 00:39:32,000 01 column and split it out and keep the 1089 00:39:29,838 --> 00:39:33,519 x and the y separately. Right? That's 1090 00:39:32,000 --> 00:39:34,960 the whole point of it, right? Because 1091 00:39:33,519 --> 00:39:36,320 you need to feed the X, do the 1092 00:39:34,960 --> 00:39:38,240 prediction, and then compare it to the 1093 00:39:36,320 --> 00:39:41,599 actual Y and calculate the loss and so 1094 00:39:38,239 --> 00:39:43,279 on and so forth. So, uh, so the target 1095 00:39:41,599 --> 00:39:45,119 column is our Y variable, and it's 1096 00:39:43,280 --> 00:39:47,119 column number six from the left. If you 1097 00:39:45,119 --> 00:39:49,599 count it, you can see it. So, we just, 1098 00:39:47,119 --> 00:39:53,039 you know, uh, we we delete it from the 1099 00:39:49,599 --> 00:39:56,720 the train and test. Um, and now we have 1100 00:39:53,039 --> 00:39:58,320 242 rows and 29 columns, 29 features. 1101 00:39:56,719 --> 00:40:01,039 You will recall from the network that we 1102 00:39:58,320 --> 00:40:03,200 made way back, it had 29 inputs, right? 1103 00:40:01,039 --> 00:40:06,159 29 nodes in the input layer. And that's 1104 00:40:03,199 --> 00:40:07,838 where the 29 is coming from. And so now 1105 00:40:06,159 --> 00:40:09,759 uh we just select the sixth column which 1106 00:40:07,838 --> 00:40:12,559 is the target and make it the Y variable 1107 00:40:09,760 --> 00:40:14,320 right train Y and test Y. And that is of 1108 00:40:12,559 --> 00:40:16,559 course a vector which is 242 long in the 1109 00:40:14,320 --> 00:40:19,359 training set and 61 long in the thing. 1110 00:40:16,559 --> 00:40:21,679 So at this point all we have done is to 1111 00:40:19,358 --> 00:40:22,960 be honest boring pre-processing. Okay, 1112 00:40:21,679 --> 00:40:26,319 we haven't actually gotten to the action 1113 00:40:22,960 --> 00:40:29,039 yet. Finally, let's do something. So um 1114 00:40:26,320 --> 00:40:30,320 and we start with a single hidden layer. 1115 00:40:29,039 --> 00:40:31,920 Since it's a binary classification 1116 00:40:30,320 --> 00:40:34,000 problem, we'll use sigmoids as we saw 1117 00:40:31,920 --> 00:40:36,559 earlier. And this is the model we 1118 00:40:34,000 --> 00:40:39,760 created in class last last class. This 1119 00:40:36,559 --> 00:40:41,199 is the model we created. Okay. The only 1120 00:40:39,760 --> 00:40:43,280 difference between that model and this 1121 00:40:41,199 --> 00:40:45,919 model is that I've actually given names 1122 00:40:43,280 --> 00:40:47,599 to these layers. And this name thing is 1123 00:40:45,920 --> 00:40:48,800 totally optional. Right? If you want to 1124 00:40:47,599 --> 00:40:50,240 give a name, give a name. It's just a 1125 00:40:48,800 --> 00:40:53,280 little easier to interpret later on. 1126 00:40:50,239 --> 00:40:55,519 Okay? It's just cosmetic. Okay? So, uh, 1127 00:40:53,280 --> 00:40:57,760 but I've just put it here. U and once 1128 00:40:55,519 --> 00:40:59,599 you build the model u you should 1129 00:40:57,760 --> 00:41:01,680 immediately run the model dots summary 1130 00:40:59,599 --> 00:41:04,079 command because it gives you a nice 1131 00:41:01,679 --> 00:41:05,440 overview of the model right what are for 1132 00:41:04,079 --> 00:41:07,599 each layer it tells you what the layer 1133 00:41:05,440 --> 00:41:09,519 is it tells you what's coming into the 1134 00:41:07,599 --> 00:41:11,280 layer meaning the shape of the tensor 1135 00:41:09,519 --> 00:41:13,440 that's coming in and what's going out 1136 00:41:11,280 --> 00:41:16,240 and how many parameters the layer has 1137 00:41:13,440 --> 00:41:20,720 and it turns out this layer has sorry 1138 00:41:16,239 --> 00:41:22,799 this network has 497 parameters okay uh 1139 00:41:20,719 --> 00:41:24,078 and I have told you repeatedly the first 1140 00:41:22,800 --> 00:41:25,680 few times just hadn't calculated the 1141 00:41:24,079 --> 00:41:27,359 number of parameters to make sure it 1142 00:41:25,679 --> 00:41:30,000 verifies. So we should just make sure 1143 00:41:27,358 --> 00:41:32,078 that it is in fact 497. So let's hand 1144 00:41:30,000 --> 00:41:34,559 calculate it. And you do basically it's 1145 00:41:32,079 --> 00:41:37,839 basically what's going on here. 29 1146 00:41:34,559 --> 00:41:40,239 inputs time 16, right? All the arrows 29 1147 00:41:37,838 --> 00:41:42,000 * 16 arrows, right? And then you have a 1148 00:41:40,239 --> 00:41:43,759 bias of another 16. That's why you have 1149 00:41:42,000 --> 00:41:46,318 this expression. And then the next one 1150 00:41:43,760 --> 00:41:49,200 is 16 * 1 plus one bias for the output 1151 00:41:46,318 --> 00:41:50,960 sigmoid and you get to 497. Okay? Just 1152 00:41:49,199 --> 00:41:53,039 make sure you follow this later on when 1153 00:41:50,960 --> 00:41:55,358 you work with the collab. We we did this 1154 00:41:53,039 --> 00:41:56,960 in class last week and you can visualize 1155 00:41:55,358 --> 00:41:59,279 the network graphically as well by using 1156 00:41:56,960 --> 00:42:02,240 the plot model function. So we do that 1157 00:41:59,280 --> 00:42:03,760 here. Um and let's say it gives you the 1158 00:42:02,239 --> 00:42:06,159 same information but in a slightly 1159 00:42:03,760 --> 00:42:07,839 easier form to consume and when we work 1160 00:42:06,159 --> 00:42:09,440 with larger networks starting on 1161 00:42:07,838 --> 00:42:11,039 Wednesday you will see that being able 1162 00:42:09,440 --> 00:42:13,838 to visualize the topology of the network 1163 00:42:11,039 --> 00:42:16,239 is actually quite handy. Okay, we 1164 00:42:13,838 --> 00:42:18,400 finally come to uh actually trying to 1165 00:42:16,239 --> 00:42:20,719 train this thing and so what loss 1166 00:42:18,400 --> 00:42:23,358 function should we use? uh we need to we 1167 00:42:20,719 --> 00:42:26,159 need to use binary cross entropy right 1168 00:42:23,358 --> 00:42:29,838 there. What optimizer to use? Well, as I 1169 00:42:26,159 --> 00:42:32,480 mentioned earlier, uh we'll use Adam. 1170 00:42:29,838 --> 00:42:35,679 Adam. 1171 00:42:32,480 --> 00:42:37,920 All right, Adam. Uh and then uh and then 1172 00:42:35,679 --> 00:42:39,598 the the final thing is you can ask Keras 1173 00:42:37,920 --> 00:42:41,358 to report out whatever metrics you care 1174 00:42:39,599 --> 00:42:42,960 about. These metrics are not going to be 1175 00:42:41,358 --> 00:42:45,039 used in any optimization. They just it's 1176 00:42:42,960 --> 00:42:46,800 just reporting it to you. And the most 1177 00:42:45,039 --> 00:42:49,119 common thing people report out for 1178 00:42:46,800 --> 00:42:51,440 binary classification is accuracy. So 1179 00:42:49,119 --> 00:42:54,318 we'll just go with that metric. Um and 1180 00:42:51,440 --> 00:42:56,880 so so what we do is we tell Keras take 1181 00:42:54,318 --> 00:42:58,719 the model we just built and compile it 1182 00:42:56,880 --> 00:43:00,000 with this choice of optimizer this 1183 00:42:58,719 --> 00:43:02,159 choice of loss function and these 1184 00:43:00,000 --> 00:43:04,480 metrics. And this compilation step what 1185 00:43:02,159 --> 00:43:06,480 it does is it essentially Keras will 1186 00:43:04,480 --> 00:43:08,639 take this information and take the model 1187 00:43:06,480 --> 00:43:11,599 you have built and it'll reorganize the 1188 00:43:08,639 --> 00:43:13,920 model in such a way that the parallel 1189 00:43:11,599 --> 00:43:16,000 computing uh distribution of computing 1190 00:43:13,920 --> 00:43:17,519 across many servers and so on. That's 1191 00:43:16,000 --> 00:43:20,159 that's what's happening in the compile 1192 00:43:17,519 --> 00:43:21,838 step. Organizing it so that reorganizing 1193 00:43:20,159 --> 00:43:23,679 the model so that it becomes amendable 1194 00:43:21,838 --> 00:43:25,039 to parallelization and distribution. 1195 00:43:23,679 --> 00:43:26,159 That's what's going on. That's why you 1196 00:43:25,039 --> 00:43:28,800 actually have to do something called the 1197 00:43:26,159 --> 00:43:30,879 compile step. Okay. And once we do that, 1198 00:43:28,800 --> 00:43:34,160 we have finally finally ready to train 1199 00:43:30,880 --> 00:43:36,000 the model. And to do that uh we have to 1200 00:43:34,159 --> 00:43:37,199 decide what the batch size is that we're 1201 00:43:36,000 --> 00:43:38,880 going to use. Remember, we're using some 1202 00:43:37,199 --> 00:43:40,559 flavor of SGD, which means we have to 1203 00:43:38,880 --> 00:43:43,358 choose what is the bat size. And 1204 00:43:40,559 --> 00:43:45,199 typically what people do is that uh 32 1205 00:43:43,358 --> 00:43:46,480 is a good default for the batch size. 1206 00:43:45,199 --> 00:43:47,519 Like if you don't if you're not just 1207 00:43:46,480 --> 00:43:49,519 getting started with something, just use 1208 00:43:47,519 --> 00:43:51,519 32. Uh and there's a whole bunch of 1209 00:43:49,519 --> 00:43:53,358 literature on what the right batch size 1210 00:43:51,519 --> 00:43:55,119 should be for the number of data points 1211 00:43:53,358 --> 00:43:56,960 you have, the size of the network and so 1212 00:43:55,119 --> 00:43:59,760 on and so forth. My philosophy is start 1213 00:43:56,960 --> 00:44:02,000 with 32. Um and you can always try 32, 1214 00:43:59,760 --> 00:44:04,079 64, 128. It's kind of like, you know, 1215 00:44:02,000 --> 00:44:05,760 oftenimes what people tell me, 1216 00:44:04,079 --> 00:44:07,760 researchers tell me is that just use the 1217 00:44:05,760 --> 00:44:09,920 biggest batch size that doesn't make 1218 00:44:07,760 --> 00:44:11,359 your machine die. 1219 00:44:09,920 --> 00:44:12,400 Right? If you can fit into memory, it's 1220 00:44:11,358 --> 00:44:13,759 probably good. Just try the biggest 1221 00:44:12,400 --> 00:44:15,039 size. We'll just start with 30. It's 1222 00:44:13,760 --> 00:44:16,720 just a tiny problem. It's not a big 1223 00:44:15,039 --> 00:44:19,199 deal. And then we also have to decide 1224 00:44:16,719 --> 00:44:21,519 how many epochs through the data do we 1225 00:44:19,199 --> 00:44:24,318 want to go through, right? How many 1226 00:44:21,519 --> 00:44:26,480 epochs? And uh you know, usually 20 to 1227 00:44:24,318 --> 00:44:28,239 30 epochs is a good starting point. Um 1228 00:44:26,480 --> 00:44:29,679 and then because this is a tiny problem 1229 00:44:28,239 --> 00:44:31,838 just for kicks, I decided to run it for 1230 00:44:29,679 --> 00:44:33,759 300 epochs. Uh just to see if anything 1231 00:44:31,838 --> 00:44:34,960 any overfitting is going to happen. Uh 1232 00:44:33,760 --> 00:44:36,079 and then whether we want to use a 1233 00:44:34,960 --> 00:44:38,639 validation set. Of course, we want to 1234 00:44:36,079 --> 00:44:40,560 use a validation set. Uh right. So we 1235 00:44:38,639 --> 00:44:42,239 will use 20% of the data points as a 1236 00:44:40,559 --> 00:44:44,400 validation set so that we can look for 1237 00:44:42,239 --> 00:44:46,399 overfitting underfitting. 1238 00:44:44,400 --> 00:44:49,519 All right. So with these decisions made 1239 00:44:46,400 --> 00:44:51,920 we finally uh we use the model.fit 1240 00:44:49,519 --> 00:44:55,039 command. Model.fit is what actually 1241 00:44:51,920 --> 00:44:58,000 trains the neural network. Okay. And you 1242 00:44:55,039 --> 00:45:00,318 have to tell it what the x 1243 00:44:58,000 --> 00:45:03,280 tensor is. You have to tell it what the 1244 00:45:00,318 --> 00:45:05,199 dependent variable y tensor is. We need 1245 00:45:03,280 --> 00:45:07,519 to tell it how many epochs to do this. 1246 00:45:05,199 --> 00:45:09,519 What the bat size to use. Verbos equals 1247 00:45:07,519 --> 00:45:11,199 1 just means like just you know put a 1248 00:45:09,519 --> 00:45:13,199 lot of descriptive output as you do this 1249 00:45:11,199 --> 00:45:16,318 thing and then validation split means 1250 00:45:13,199 --> 00:45:18,559 you know take 20% of the training data 1251 00:45:16,318 --> 00:45:20,000 and set it aside as your validation data 1252 00:45:18,559 --> 00:45:22,239 set. Don't use it for training because I 1253 00:45:20,000 --> 00:45:24,239 want to measure overfitting using that. 1254 00:45:22,239 --> 00:45:26,318 So that's it. So you do that thing it 1255 00:45:24,239 --> 00:45:28,159 it'll run for 300 epochs and this is the 1256 00:45:26,318 --> 00:45:31,358 reason why you know I decided to just 1257 00:45:28,159 --> 00:45:33,759 not actually run it in class. Um and so 1258 00:45:31,358 --> 00:45:36,318 you keep on doing it gives you a lot of 1259 00:45:33,760 --> 00:45:40,280 output and finally 1260 00:45:36,318 --> 00:45:40,279 we reach the end. 1261 00:45:41,760 --> 00:45:44,640 Okay. Now let's take a moment to 1262 00:45:43,358 --> 00:45:46,559 understand what's being reported. So 1263 00:45:44,639 --> 00:45:49,118 I'll just take this one line here. So 1264 00:45:46,559 --> 00:45:51,279 this there is a there is these two there 1265 00:45:49,119 --> 00:45:53,920 is a pair of lines for each epoch. And 1266 00:45:51,280 --> 00:45:56,960 then here it's telling you uh you know 1267 00:45:53,920 --> 00:46:01,280 it it actually uses in the in this 300th 1268 00:45:56,960 --> 00:46:02,800 epoch it used seven batches seven out of 1269 00:46:01,280 --> 00:46:05,040 seven batches right so it used seven 1270 00:46:02,800 --> 00:46:06,960 batches and if you you will recall from 1271 00:46:05,039 --> 00:46:08,318 the math we did in the class that it's 1272 00:46:06,960 --> 00:46:10,559 actually seven batches where the first 1273 00:46:08,318 --> 00:46:12,159 six batches are 32 and the last batch is 1274 00:46:10,559 --> 00:46:15,440 just a couple of examples but we have 1275 00:46:12,159 --> 00:46:19,039 seven batches right this is the 193 by 1276 00:46:15,440 --> 00:46:20,720 32 rounded up okay so that's why we have 1277 00:46:19,039 --> 00:46:22,800 seven here and then it tells you how how 1278 00:46:20,719 --> 00:46:24,239 long it took it for that and then it 1279 00:46:22,800 --> 00:46:26,560 this is the loss value. This is the 1280 00:46:24,239 --> 00:46:29,279 binary cross entropy loss value on the 1281 00:46:26,559 --> 00:46:32,239 training set right on on that particular 1282 00:46:29,280 --> 00:46:33,599 batch right uh that it calculated this 1283 00:46:32,239 --> 00:46:36,799 is the accuracy that you asked you to 1284 00:46:33,599 --> 00:46:39,838 report out 98.4% 4% 98.5% accuracy on 1285 00:46:36,800 --> 00:46:42,480 that batch and and then at the end of 1286 00:46:39,838 --> 00:46:44,480 this epoch using whatever weights were 1287 00:46:42,480 --> 00:46:46,639 available in that network it actually 1288 00:46:44,480 --> 00:46:48,318 calculate the loss on the validation set 1289 00:46:46,639 --> 00:46:50,480 which is the 20% of the data we have set 1290 00:46:48,318 --> 00:46:53,759 aside and then it this is the accuracy 1291 00:46:50,480 --> 00:46:55,920 on that validation set okay so that's 1292 00:46:53,760 --> 00:46:57,599 what each of these numbers mean now 1293 00:46:55,920 --> 00:47:00,318 looking at these wall of numbers is kind 1294 00:46:57,599 --> 00:47:02,480 of painful so usually you just plot it 1295 00:47:00,318 --> 00:47:04,719 um so and the way you do that is if you 1296 00:47:02,480 --> 00:47:06,800 if you notice here Uh okay, I'm not 1297 00:47:04,719 --> 00:47:08,639 going to go back here. So I said history 1298 00:47:06,800 --> 00:47:10,560 equals model.fit blah blah blah blah 1299 00:47:08,639 --> 00:47:12,000 blah. And that history object has a lot 1300 00:47:10,559 --> 00:47:14,799 of information that we can use for 1301 00:47:12,000 --> 00:47:18,480 plotting and diagnostics and so on. And 1302 00:47:14,800 --> 00:47:19,760 that history thing uh history object has 1303 00:47:18,480 --> 00:47:21,358 another object called history 1304 00:47:19,760 --> 00:47:23,040 history.htistory which is a dictionary 1305 00:47:21,358 --> 00:47:24,318 with all these values and that's what 1306 00:47:23,039 --> 00:47:25,599 we're going to plot. Was there a 1307 00:47:24,318 --> 00:47:28,639 question here? Yeah. 1308 00:47:25,599 --> 00:47:30,960 >> Uh so you prompted it to keep the size 1309 00:47:28,639 --> 00:47:33,679 for validation but didn't we already 1310 00:47:30,960 --> 00:47:34,960 keep a test set? So that's going to be a 1311 00:47:33,679 --> 00:47:37,679 secondary validation, right? 1312 00:47:34,960 --> 00:47:40,079 >> So basically we have a training uh and 1313 00:47:37,679 --> 00:47:42,000 then a validation and a test. The role 1314 00:47:40,079 --> 00:47:43,680 of the validation set is to figure out 1315 00:47:42,000 --> 00:47:45,519 things like early stopping. Should we 1316 00:47:43,679 --> 00:47:46,719 stop here? Should we go back? And as you 1317 00:47:45,519 --> 00:47:48,960 will see later on, if we use 1318 00:47:46,719 --> 00:47:50,399 hyperparameters, you know, we we'll try 1319 00:47:48,960 --> 00:47:52,079 different values of the hyperparameters 1320 00:47:50,400 --> 00:47:53,680 and figure out use the validation set to 1321 00:47:52,079 --> 00:47:55,359 figure out which one is the best one. 1322 00:47:53,679 --> 00:47:57,679 But once we are done with all that, we 1323 00:47:55,358 --> 00:47:59,679 will finally have a model. At that 1324 00:47:57,679 --> 00:48:02,399 point, we open the safe, take out the 1325 00:47:59,679 --> 00:48:04,239 test set and use it just once with your 1326 00:48:02,400 --> 00:48:05,519 final final model. Not because you want 1327 00:48:04,239 --> 00:48:07,519 to improve the model, but because you 1328 00:48:05,519 --> 00:48:08,880 want to have a realistic idea how it'll 1329 00:48:07,519 --> 00:48:11,679 do when you actually deploy it out in 1330 00:48:08,880 --> 00:48:13,920 the real world. 1331 00:48:11,679 --> 00:48:17,679 >> Uh yeah. 1332 00:48:13,920 --> 00:48:20,000 >> Uh can we use can we instead of accuracy 1333 00:48:17,679 --> 00:48:21,199 could we use other metrics uh to 1334 00:48:20,000 --> 00:48:23,920 evaluate whether to 1335 00:48:21,199 --> 00:48:24,318 >> absolutely like a confusion matrix let's 1336 00:48:23,920 --> 00:48:25,680 say? 1337 00:48:24,318 --> 00:48:27,519 >> Yeah, you can you can do whatever you 1338 00:48:25,679 --> 00:48:29,118 want. You can use like I said it's not 1339 00:48:27,519 --> 00:48:31,280 used for training so there is no 1340 00:48:29,119 --> 00:48:32,720 mathematical implication what you choose 1341 00:48:31,280 --> 00:48:35,040 right you can choose error rates 1342 00:48:32,719 --> 00:48:37,118 accuracy f1 fb beta you can do whatever 1343 00:48:35,039 --> 00:48:39,440 you want and keras as you will see has 1344 00:48:37,119 --> 00:48:41,760 this dizzying list of possible metrics 1345 00:48:39,440 --> 00:48:43,280 you can use for reporting the key thing 1346 00:48:41,760 --> 00:48:44,800 to remember is you're just reporting 1347 00:48:43,280 --> 00:48:47,440 these metrics you're not actually using 1348 00:48:44,800 --> 00:48:49,039 them for any training 1349 00:48:47,440 --> 00:48:50,559 yeah 1350 00:48:49,039 --> 00:48:52,800 >> uh my question is with respect to 1351 00:48:50,559 --> 00:48:55,760 validation like uh we've got a training 1352 00:48:52,800 --> 00:48:58,720 data set so when we take out 20% This is 1353 00:48:55,760 --> 00:49:00,559 the validation uh data for validation. 1354 00:48:58,719 --> 00:49:02,719 Are we taking out from the training set 1355 00:49:00,559 --> 00:49:04,640 or correct from there that level or we 1356 00:49:02,719 --> 00:49:04,879 go to each batch and take out 20% from 1357 00:49:04,639 --> 00:49:05,759 the train? 1358 00:49:04,880 --> 00:49:06,400 >> No, we're taking it out from the 1359 00:49:05,760 --> 00:49:08,400 training set. 1360 00:49:06,400 --> 00:49:09,920 >> So it means the batch size the number of 1361 00:49:08,400 --> 00:49:11,599 batch number of data would be available 1362 00:49:09,920 --> 00:49:12,079 for calculating the batch size will 1363 00:49:11,599 --> 00:49:13,599 reduce. 1364 00:49:12,079 --> 00:49:15,200 >> Correct. And in [snorts] fact once we 1365 00:49:13,599 --> 00:49:17,119 validate take out the validation set 1366 00:49:15,199 --> 00:49:18,558 whatever remaining is 193. 1367 00:49:17,119 --> 00:49:21,519 >> Okay. And then we divide that into 1368 00:49:18,559 --> 00:49:23,440 batches and then that every info uh that 1369 00:49:21,519 --> 00:49:25,519 validation and the data gets different 1370 00:49:23,440 --> 00:49:27,519 added. Now once you take out the 1371 00:49:25,519 --> 00:49:30,960 validation set at the very beginning you 1372 00:49:27,519 --> 00:49:33,440 keep it aside and then you only evaluate 1373 00:49:30,960 --> 00:49:36,000 at the end of each epoch what your loss 1374 00:49:33,440 --> 00:49:37,838 and accuracy is on that validation set. 1375 00:49:36,000 --> 00:49:39,358 >> So you don't have cross validation. 1376 00:49:37,838 --> 00:49:40,558 >> No no we're not doing any of that stuff. 1377 00:49:39,358 --> 00:49:43,519 We're just taking it out once and we're 1378 00:49:40,559 --> 00:49:46,240 just evaluating the end of every epoch. 1379 00:49:43,519 --> 00:49:50,559 >> Okay. So 1380 00:49:46,239 --> 00:49:53,679 yeah. Okay. So I know we both asked 1381 00:49:50,559 --> 00:49:54,960 similar questions but 1382 00:49:53,679 --> 00:49:56,960 >> so I know both have asked similar 1383 00:49:54,960 --> 00:49:59,440 questions but just to reconfirm. So here 1384 00:49:56,960 --> 00:50:01,760 my training model is giving me say a 1385 00:49:59,440 --> 00:50:04,800 loss of 0860. 1386 00:50:01,760 --> 00:50:07,680 My validation model is giving me 660. 1387 00:50:04,800 --> 00:50:11,519 That means I've already crossed the U. 1388 00:50:07,679 --> 00:50:13,358 So when I have to actually test the 1389 00:50:11,519 --> 00:50:14,800 model that is the midpoint which I take 1390 00:50:13,358 --> 00:50:16,880 and that will model which will get 1391 00:50:14,800 --> 00:50:19,200 deployed in production. 1392 00:50:16,880 --> 00:50:20,559 Correct. And as to okay, what do we do 1393 00:50:19,199 --> 00:50:22,318 to get that model? Do we actually have 1394 00:50:20,559 --> 00:50:24,720 to go go back to the beginning and run 1395 00:50:22,318 --> 00:50:25,920 it for a few epochs or can we do 1396 00:50:24,719 --> 00:50:26,959 something smarter than that? We'll get 1397 00:50:25,920 --> 00:50:27,838 to that. 1398 00:50:26,960 --> 00:50:30,159 >> Yeah. 1399 00:50:27,838 --> 00:50:31,838 >> Is the validation set different for each 1400 00:50:30,159 --> 00:50:33,759 APO or is it the same? 1401 00:50:31,838 --> 00:50:35,759 >> It's the same. So what you do is you 1402 00:50:33,760 --> 00:50:37,359 have a training set before you do any 1403 00:50:35,760 --> 00:50:39,680 training. You take out 20% of it, keep 1404 00:50:37,358 --> 00:50:41,838 it aside. You take whatever is left over 1405 00:50:39,679 --> 00:50:43,279 that you divide that into mini batches 1406 00:50:41,838 --> 00:50:45,838 and then start running it through each 1407 00:50:43,280 --> 00:50:47,519 epoch. But at the end of each epoch, you 1408 00:50:45,838 --> 00:50:49,119 just evaluate the quality of that 1409 00:50:47,519 --> 00:50:49,920 resulting model using the validation 1410 00:50:49,119 --> 00:50:51,920 set. 1411 00:50:49,920 --> 00:50:52,800 >> What's different between each epoch? Is 1412 00:50:51,920 --> 00:50:53,519 it just the way 1413 00:50:52,800 --> 00:50:55,760 >> weights have changed? 1414 00:50:53,519 --> 00:50:56,960 >> It's the it's the division into the 1415 00:50:55,760 --> 00:51:00,480 different 1416 00:50:56,960 --> 00:51:02,159 >> uh no so in the difference in each epoch 1417 00:51:00,480 --> 00:51:03,920 is the weights have changed. 1418 00:51:02,159 --> 00:51:05,440 >> So after every mini batch, the weights 1419 00:51:03,920 --> 00:51:07,200 have changed. At the end of one epoch, 1420 00:51:05,440 --> 00:51:09,200 you've gone through all the data points 1421 00:51:07,199 --> 00:51:10,639 you ever had, right, in the training 1422 00:51:09,199 --> 00:51:14,558 set. And then you come back to the 1423 00:51:10,639 --> 00:51:14,558 beginning and you do it again. 1424 00:51:17,760 --> 00:51:22,480 How do you identify the sweet spot? 1425 00:51:20,800 --> 00:51:24,160 >> It's coming. 1426 00:51:22,480 --> 00:51:27,280 >> Yeah. All right. So, I'm going to keep 1427 00:51:24,159 --> 00:51:28,960 going. So, we have this here. And so, 1428 00:51:27,280 --> 00:51:31,280 you just I mean there's a little bit of 1429 00:51:28,960 --> 00:51:33,440 mattplot lip code. So, what we do is we 1430 00:51:31,280 --> 00:51:35,280 just plot the training loss and the 1431 00:51:33,440 --> 00:51:37,760 validation loss as a function of the 1432 00:51:35,280 --> 00:51:39,920 number of epochs. Okay? And as you can 1433 00:51:37,760 --> 00:51:41,920 see here, the training loss is these 1434 00:51:39,920 --> 00:51:45,280 things here. And it's steadily going 1435 00:51:41,920 --> 00:51:47,519 down as you would expect. The validation 1436 00:51:45,280 --> 00:51:49,599 loss goes down here. And then at some 1437 00:51:47,519 --> 00:51:53,358 point it kind of flattens out and then 1438 00:51:49,599 --> 00:51:55,920 maybe gently starts to rise. Okay. So do 1439 00:51:53,358 --> 00:51:57,279 you think there's overfitting? 1440 00:51:55,920 --> 00:51:59,200 >> Right. There seems to be some level of 1441 00:51:57,280 --> 00:52:01,839 overfitting here. But the thing you have 1442 00:51:59,199 --> 00:52:04,799 to always remember is that the binary 1443 00:52:01,838 --> 00:52:06,639 cross entropy loss is a loss function 1444 00:52:04,800 --> 00:52:08,160 that is convenient for you because it 1445 00:52:06,639 --> 00:52:10,879 sort of captures the thing you want to 1446 00:52:08,159 --> 00:52:13,920 capture the discrepancy but also because 1447 00:52:10,880 --> 00:52:15,599 it's mathematically convenient but what 1448 00:52:13,920 --> 00:52:18,400 you may actually care about in practice 1449 00:52:15,599 --> 00:52:19,760 is something like accuracy right so I 1450 00:52:18,400 --> 00:52:21,440 always that's why you're reporting out 1451 00:52:19,760 --> 00:52:23,200 the accuracy when we do these things so 1452 00:52:21,440 --> 00:52:25,358 you should also plot the accuracy to see 1453 00:52:23,199 --> 00:52:26,799 what's going on and really you should 1454 00:52:25,358 --> 00:52:28,239 look at the accuracy and figure out 1455 00:52:26,800 --> 00:52:30,720 overfitting and underfitting and all 1456 00:52:28,239 --> 00:52:34,000 stuff. So let's just do that. So I have 1457 00:52:30,719 --> 00:52:35,519 here uh overfitting. 1458 00:52:34,000 --> 00:52:37,280 Uh okay. So this is how it looks like 1459 00:52:35,519 --> 00:52:38,639 for accuracy. Accuracy of course as the 1460 00:52:37,280 --> 00:52:40,079 model gets you know as you do more and 1461 00:52:38,639 --> 00:52:42,078 more epochs hopefully it get better and 1462 00:52:40,079 --> 00:52:44,480 better for training. So you can see here 1463 00:52:42,079 --> 00:52:47,440 accuracy actually climbs all the way up 1464 00:52:44,480 --> 00:52:50,079 to the mid 90s uh right there small the 1465 00:52:47,440 --> 00:52:52,639 low 90s here. the validation gets to 1466 00:52:50,079 --> 00:52:54,400 this point after like I don't know 50 1467 00:52:52,639 --> 00:52:56,719 epochs maybe and then it kind of 1468 00:52:54,400 --> 00:53:00,880 flattens out and then strangely it 1469 00:52:56,719 --> 00:53:03,759 climbs up again a bit later right so now 1470 00:53:00,880 --> 00:53:06,800 the fact that the accuracy actually got 1471 00:53:03,760 --> 00:53:09,920 better at the very end suggests that 1472 00:53:06,800 --> 00:53:10,480 maybe we can live with this overfitting 1473 00:53:09,920 --> 00:53:12,000 >> okay 1474 00:53:10,480 --> 00:53:14,559 >> right it's not the end of the world 1475 00:53:12,000 --> 00:53:16,719 right so you can so you can certainly 1476 00:53:14,559 --> 00:53:17,920 what you can do is you can go back and 1477 00:53:16,719 --> 00:53:20,558 say you know what no I'm going to be a 1478 00:53:17,920 --> 00:53:22,240 purist about this around 50 epochs or 1479 00:53:20,559 --> 00:53:24,079 so. I think that's when it actually 1480 00:53:22,239 --> 00:53:26,078 flattened out for loss. So you can just 1481 00:53:24,079 --> 00:53:29,039 go back and just restart the model and 1482 00:53:26,079 --> 00:53:30,318 run it only for 50 epochs, not 300 and 1483 00:53:29,039 --> 00:53:31,920 then stop and just use that model for 1484 00:53:30,318 --> 00:53:33,358 everything from that point on. Or you 1485 00:53:31,920 --> 00:53:35,358 can say, you know what, it's okay. I can 1486 00:53:33,358 --> 00:53:36,558 live with this thing. Uh and so that's 1487 00:53:35,358 --> 00:53:39,838 what we're going to do here. Let me just 1488 00:53:36,559 --> 00:53:40,319 stop for a second. There was a question. 1489 00:53:39,838 --> 00:53:42,000 >> Yeah, 1490 00:53:40,318 --> 00:53:44,000 >> for originally when we were starting 1491 00:53:42,000 --> 00:53:46,880 out, we were saying 20 to 30 pods, but 1492 00:53:44,000 --> 00:53:49,039 we were going to do 300. 50 is over 20 1493 00:53:46,880 --> 00:53:51,280 to 30. So when it comes to validation of 1494 00:53:49,039 --> 00:53:52,639 if you run enough epochs, are you doing 1495 00:53:51,280 --> 00:53:54,480 like derivative calculations? 1496 00:53:52,639 --> 00:53:56,639 >> Oh, I see. No, that's a great question. 1497 00:53:54,480 --> 00:53:58,240 So the question is I said start with 20 1498 00:53:56,639 --> 00:54:00,000 and 30 epochs as a rule of thumb here, 1499 00:53:58,239 --> 00:54:01,598 I'm just going with 300. And because I'm 1500 00:54:00,000 --> 00:54:03,199 going with 300, I can actually see some 1501 00:54:01,599 --> 00:54:05,119 potential evidence of overfitting. But 1502 00:54:03,199 --> 00:54:06,239 if I had done only 20 to 30, maybe I 1503 00:54:05,119 --> 00:54:07,280 wouldn't have even seen that. What 1504 00:54:06,239 --> 00:54:09,279 happens next? Right? Is that the 1505 00:54:07,280 --> 00:54:10,559 question? Great question. So what you 1506 00:54:09,280 --> 00:54:13,519 should do is when you look at these 1507 00:54:10,559 --> 00:54:15,680 curves if at the end of 30 epochs you 1508 00:54:13,519 --> 00:54:18,159 find that the validation loss continues 1509 00:54:15,679 --> 00:54:20,078 to drop then you know maybe there is 1510 00:54:18,159 --> 00:54:21,759 more room for it to drop. So you you 1511 00:54:20,079 --> 00:54:24,000 continue from that point on. The thing 1512 00:54:21,760 --> 00:54:27,119 about keras is that you can actually run 1513 00:54:24,000 --> 00:54:29,199 the the the fit command at that point 1514 00:54:27,119 --> 00:54:31,680 and it'll continue where it left off. It 1515 00:54:29,199 --> 00:54:33,598 won't go to the beginning again. 1516 00:54:31,679 --> 00:54:34,799 Right? So you can run 10. Okay. The 1517 00:54:33,599 --> 00:54:36,640 validation is still getting better and 1518 00:54:34,800 --> 00:54:38,240 better. Okay. Run for another 10. It's 1519 00:54:36,639 --> 00:54:39,440 getting better and better. Run for 1520 00:54:38,239 --> 00:54:40,639 another 10. Getting better and better. 1521 00:54:39,440 --> 00:54:41,760 Run for another 10. Oh, it starts to 1522 00:54:40,639 --> 00:54:44,799 climb up again. Okay, now I'm going to 1523 00:54:41,760 --> 00:54:47,119 back off. That's what you do. 1524 00:54:44,800 --> 00:54:48,800 All right. Now, all this manual stuff 1525 00:54:47,119 --> 00:54:50,800 I'm going through it just because to 1526 00:54:48,800 --> 00:54:52,559 build intuition, there are these things 1527 00:54:50,800 --> 00:54:54,640 called callbacks in KAS, which we'll get 1528 00:54:52,559 --> 00:54:57,040 to later on in which you can actually 1529 00:54:54,639 --> 00:54:59,679 tell it, hey, when the validation loss, 1530 00:54:57,039 --> 00:55:02,000 you know, uh, stops improving, stop 1531 00:54:59,679 --> 00:55:04,558 everything or when it stops improving, 1532 00:55:02,000 --> 00:55:05,920 save that model for me somewhere. So, 1533 00:55:04,559 --> 00:55:07,280 they don't have to go back and rerun 1534 00:55:05,920 --> 00:55:08,480 everything. It'll just it'll have saved 1535 00:55:07,280 --> 00:55:12,240 it for you and you can just pick it up 1536 00:55:08,480 --> 00:55:15,358 and use it. Uh yeah. 1537 00:55:12,239 --> 00:55:17,838 >> What's the intuition behind um the 1538 00:55:15,358 --> 00:55:19,358 accuracy continuing to improve when the 1539 00:55:17,838 --> 00:55:21,440 loss is getting higher? 1540 00:55:19,358 --> 00:55:23,759 >> Because accuracy and loss are related 1541 00:55:21,440 --> 00:55:25,760 but they're not the same thing. Uh in 1542 00:55:23,760 --> 00:55:27,520 particular, so it's a really good 1543 00:55:25,760 --> 00:55:29,359 question also kind of a profound 1544 00:55:27,519 --> 00:55:30,880 question because accuracy is a very 1545 00:55:29,358 --> 00:55:32,078 discrete measure, right? So if a 1546 00:55:30,880 --> 00:55:34,880 particular point we predicting its 1547 00:55:32,079 --> 00:55:37,599 probability to be say 049 we're going to 1548 00:55:34,880 --> 00:55:39,599 say okay that's a zero no heart disease 1549 00:55:37,599 --> 00:55:41,599 but if it goes to 0.51 we're going to be 1550 00:55:39,599 --> 00:55:44,559 oh that's heart disease. So when you go 1551 00:55:41,599 --> 00:55:46,079 from 049 to 0.51 the binary cross 1552 00:55:44,559 --> 00:55:48,640 entropy loss will change very very 1553 00:55:46,079 --> 00:55:51,359 slightly but the accuracy will go from 0 1554 00:55:48,639 --> 00:55:53,358 to one dramatic jump. So it's very jumpy 1555 00:55:51,358 --> 00:55:56,000 and discreet and that's why it tends to 1556 00:55:53,358 --> 00:55:58,639 be a proxy but sort of a crude proxy for 1557 00:55:56,000 --> 00:56:01,440 loss. That's part of the reason and I 1558 00:55:58,639 --> 00:56:04,558 can talk more offline. 1559 00:56:01,440 --> 00:56:06,480 Okay. So yeah, 1560 00:56:04,559 --> 00:56:09,839 >> you mentioned that if you are a purist, 1561 00:56:06,480 --> 00:56:12,159 you could stop up 50. In this case, I 1562 00:56:09,838 --> 00:56:13,759 was want and run it and stop it there. I 1563 00:56:12,159 --> 00:56:15,679 was wondering if you could see the 1564 00:56:13,760 --> 00:56:18,079 history of the model, take the weight at 1565 00:56:15,679 --> 00:56:21,358 EOC 50 and input it your model and it 1566 00:56:18,079 --> 00:56:22,400 will be roughly the same or it would be 1567 00:56:21,358 --> 00:56:24,318 certain differences. 1568 00:56:22,400 --> 00:56:25,920 >> You could try it. Yeah, you should just 1569 00:56:24,318 --> 00:56:27,599 try it because what happens is that 1570 00:56:25,920 --> 00:56:29,440 ultimately what we care about is how it 1571 00:56:27,599 --> 00:56:30,960 performs on the validation set. Right. 1572 00:56:29,440 --> 00:56:33,200 Here it appears to perform better on the 1573 00:56:30,960 --> 00:56:34,880 validation set. right? If you stop at 50 1574 00:56:33,199 --> 00:56:36,078 but only for the loss for accuracy 1575 00:56:34,880 --> 00:56:40,079 actually if you wait till the very end 1576 00:56:36,079 --> 00:56:41,760 it gets better. So my thrust tends to be 1577 00:56:40,079 --> 00:56:44,079 what is the measure that's closest to 1578 00:56:41,760 --> 00:56:45,599 the real world deployment. 1579 00:56:44,079 --> 00:56:48,599 It's accuracy. So I tend to go with 1580 00:56:45,599 --> 00:56:48,599 accuracy. 1581 00:56:48,639 --> 00:56:53,519 Binary cross entropy is a beautiful 1582 00:56:50,639 --> 00:56:54,960 proxy but an imperfect proxy for the 1583 00:56:53,519 --> 00:56:57,440 thing we actually care about in the real 1584 00:56:54,960 --> 00:56:59,519 world which is error rate and accuracy. 1585 00:56:57,440 --> 00:57:00,960 That's why I tend to plot both and if 1586 00:56:59,519 --> 00:57:03,119 accuracy is telling me one thing I kind 1587 00:57:00,960 --> 00:57:07,920 of tend to believe that 1588 00:57:03,119 --> 00:57:09,680 all right so um here that's what we have 1589 00:57:07,920 --> 00:57:11,519 so once we do all this we have a model 1590 00:57:09,679 --> 00:57:13,039 and we now we may to evaluate to see 1591 00:57:11,519 --> 00:57:14,559 okay if you actually deployed how good 1592 00:57:13,039 --> 00:57:17,039 is going to be so you use this thing 1593 00:57:14,559 --> 00:57:19,040 called the model evealuate function so 1594 00:57:17,039 --> 00:57:21,358 you take the modelealate function now we 1595 00:57:19,039 --> 00:57:23,279 use the test and the the test x and the 1596 00:57:21,358 --> 00:57:24,719 test y data set which we split at the 1597 00:57:23,280 --> 00:57:27,040 very very beginning and never used from 1598 00:57:24,719 --> 00:57:29,679 that point on uh we run it And when I 1599 00:57:27,039 --> 00:57:33,039 ran it uh last night, it came up with a 1600 00:57:29,679 --> 00:57:35,118 83.6% accuracy for the model. And 1601 00:57:33,039 --> 00:57:36,798 remember our baseline model which just 1602 00:57:35,119 --> 00:57:39,358 predicts everybody is a zero is going to 1603 00:57:36,798 --> 00:57:41,599 have a 72.6% accuracy. And this little 1604 00:57:39,358 --> 00:57:45,598 neural network gives you 83 83.6 which 1605 00:57:41,599 --> 00:57:47,280 is pretty good right so it's actually uh 1606 00:57:45,599 --> 00:57:49,519 few it's beating the model the baseline 1607 00:57:47,280 --> 00:57:50,720 model which is nice. Uh and I guess 1608 00:57:49,519 --> 00:57:52,159 there is something here about you know 1609 00:57:50,719 --> 00:57:53,919 the fact that we did a bunch of 1610 00:57:52,159 --> 00:57:55,440 pre-processing outside Keras and then we 1611 00:57:53,920 --> 00:57:57,119 send stuff into Keras. You can actually 1612 00:57:55,440 --> 00:57:58,639 do all this pre-processing inside Karas 1613 00:57:57,119 --> 00:58:00,160 automatically and there are layers for 1614 00:57:58,639 --> 00:58:02,239 that and I have linked to a bunch of 1615 00:58:00,159 --> 00:58:03,679 stuff here. So that's it as far as this 1616 00:58:02,239 --> 00:58:05,358 model is concerned. I know we went 1617 00:58:03,679 --> 00:58:07,199 through it really fast but please go 1618 00:58:05,358 --> 00:58:09,039 through it afterwards and make sure you 1619 00:58:07,199 --> 00:58:11,039 understand every single line. Change 1620 00:58:09,039 --> 00:58:12,239 each of these lines, rerun it, see how 1621 00:58:11,039 --> 00:58:15,279 the output changes. That's how we build 1622 00:58:12,239 --> 00:58:17,919 some intuition. Okay. All right. 1623 00:58:15,280 --> 00:58:20,079 computer vision 1624 00:58:17,920 --> 00:58:22,639 >> as I do 1625 00:58:20,079 --> 00:58:24,720 >> just one question and for is there a way 1626 00:58:22,639 --> 00:58:27,118 to build a model just to have less false 1627 00:58:24,719 --> 00:58:27,679 positive or less false immediate or you 1628 00:58:27,119 --> 00:58:29,119 don't know that 1629 00:58:27,679 --> 00:58:31,679 >> oh yeah yeah you can do that um but 1630 00:58:29,119 --> 00:58:33,599 there are so you can report on all those 1631 00:58:31,679 --> 00:58:35,759 things very easily but there are more 1632 00:58:33,599 --> 00:58:38,400 complex loss functions which will take 1633 00:58:35,760 --> 00:58:40,960 the the asymmetry between the false 1634 00:58:38,400 --> 00:58:43,440 positive false negative into account u 1635 00:58:40,960 --> 00:58:45,199 you know yeah so the short it's possible 1636 00:58:43,440 --> 00:58:46,880 yeah 1637 00:58:45,199 --> 00:58:48,318 All right. So, first let's just talk 1638 00:58:46,880 --> 00:58:52,240 about how do you represent an image 1639 00:58:48,318 --> 00:58:54,159 digitally. Okay. Uh and so these are how 1640 00:58:52,239 --> 00:58:55,759 gay grayscale images are represented. 1641 00:58:54,159 --> 00:58:57,598 Black and white images. So the basic 1642 00:58:55,760 --> 00:58:59,520 basic idea is very simple. Every picture 1643 00:58:57,599 --> 00:59:01,680 you have it's got a every location in 1644 00:58:59,519 --> 00:59:03,599 that picture is a pixel and the pixel 1645 00:59:01,679 --> 00:59:06,159 pixel basically has a light intensity. 1646 00:59:03,599 --> 00:59:09,119 The amount of light at that location and 1647 00:59:06,159 --> 00:59:12,078 that light level is measured from zero 1648 00:59:09,119 --> 00:59:16,000 no light to blinding white light which 1649 00:59:12,079 --> 00:59:18,559 is 255. And so all the numbers here, if 1650 00:59:16,000 --> 00:59:20,798 you take this five for example, you can 1651 00:59:18,559 --> 00:59:23,599 see a lot of no light like all the black 1652 00:59:20,798 --> 00:59:24,960 regions, those are all zeros. Okay? And 1653 00:59:23,599 --> 00:59:27,119 then wherever there is white light, 1654 00:59:24,960 --> 00:59:29,519 there's a number and more the amount of 1655 00:59:27,119 --> 00:59:30,720 light, the closer it gets to 255. Okay? 1656 00:59:29,519 --> 00:59:32,079 In fact, if you just step back and 1657 00:59:30,719 --> 00:59:33,679 squint at this, you can actually see the 1658 00:59:32,079 --> 00:59:35,680 five. 1659 00:59:33,679 --> 00:59:37,440 Okay? So that's it. That's how that's 1660 00:59:35,679 --> 00:59:42,239 how black and white image represented. 1661 00:59:37,440 --> 00:59:43,838 Very simple. Okay. Now, yeah. 1662 00:59:42,239 --> 00:59:45,838 microphone 1663 00:59:43,838 --> 00:59:47,679 >> just when you say amount of light what's 1664 00:59:45,838 --> 00:59:48,239 the unit that's being measured like what 1665 00:59:47,679 --> 00:59:51,039 do you mean 1666 00:59:48,239 --> 00:59:54,639 >> so here basically what we have is uh the 1667 00:59:51,039 --> 00:59:56,318 the computer takes whatever so when you 1668 00:59:54,639 --> 00:59:58,239 send an analog you take an analog 1669 00:59:56,318 --> 00:59:59,440 picture there is an there's a process by 1670 00:59:58,239 --> 01:00:02,000 which you take that analog picture and 1671 00:59:59,440 --> 01:00:04,559 read it in and it gets mapped to a scale 1672 01:00:02,000 --> 01:00:05,599 between 0 and 255 that's it that's all 1673 01:00:04,559 --> 01:00:07,119 so you can think of it as like a 1674 01:00:05,599 --> 01:00:10,559 relative scale a normalized scale 1675 01:00:07,119 --> 01:00:12,240 between 0 and 255 and so um it just 1676 01:00:10,559 --> 01:00:14,720 roughly maps to amount of light in that 1677 01:00:12,239 --> 01:00:16,318 location the exact like lumens to the 1678 01:00:14,719 --> 01:00:18,159 number mapping I don't know how they do 1679 01:00:16,318 --> 01:00:20,798 it my guess is there are a dis number of 1680 01:00:18,159 --> 01:00:22,318 variations on that but the for our 1681 01:00:20,798 --> 01:00:24,079 purposes just think of it as it's a 1682 01:00:22,318 --> 01:00:26,318 normalized scale which runs from 0 to 1683 01:00:24,079 --> 01:00:28,880 255 1684 01:00:26,318 --> 01:00:30,798 all right so uh if you look at u so 1685 01:00:28,880 --> 01:00:34,318 that's what's happening every is a 1686 01:00:30,798 --> 01:00:37,119 number between 0 to 55 boom boom okay so 1687 01:00:34,318 --> 01:00:38,880 if you have a color image each pixel of 1688 01:00:37,119 --> 01:00:42,400 a colored image is represented by three 1689 01:00:38,880 --> 01:00:44,480 numbers uh And these numbers measure the 1690 01:00:42,400 --> 01:00:46,480 intensity of red light, blue light and 1691 01:00:44,480 --> 01:00:47,599 green light because red, blue and green 1692 01:00:46,480 --> 01:00:50,480 if you mix them in the right proportion 1693 01:00:47,599 --> 01:00:52,559 you can get whatever you want. Okay. So 1694 01:00:50,480 --> 01:00:54,719 uh and so each light density is still a 1695 01:00:52,559 --> 01:00:56,480 number between 0 and 55 and that's what 1696 01:00:54,719 --> 01:00:58,078 you have. Which means that now you have 1697 01:00:56,480 --> 01:01:00,079 three tables of numbers instead of one 1698 01:00:58,079 --> 01:01:02,240 table of numbers. And by the way just 1699 01:01:00,079 --> 01:01:05,440 some lingo here uh in the deep learning 1700 01:01:02,239 --> 01:01:06,959 world these uh colors RGB, red, blue, 1701 01:01:05,440 --> 01:01:10,318 green are sometimes referred to as 1702 01:01:06,960 --> 01:01:11,358 channels. Okay. All right. So this is 1703 01:01:10,318 --> 01:01:13,599 what we have here. This is a picture of 1704 01:01:11,358 --> 01:01:16,159 Kian Cord U and then if you take that 1705 01:01:13,599 --> 01:01:18,960 little thing here red the red table the 1706 01:01:16,159 --> 01:01:21,039 green table and the blue table. So for 1707 01:01:18,960 --> 01:01:23,760 this picture these three tables is a 1708 01:01:21,039 --> 01:01:26,159 tensor of rank what? 1709 01:01:23,760 --> 01:01:30,520 Good. 1710 01:01:26,159 --> 01:01:30,519 All right. Any questions on this? 1711 01:01:33,920 --> 01:01:37,599 So the key task in computer vision 1712 01:01:35,838 --> 01:01:40,239 obviously the the important thing is 1713 01:01:37,599 --> 01:01:42,160 image classification right uh the most 1714 01:01:40,239 --> 01:01:43,679 basic task if you will uh when you're 1715 01:01:42,159 --> 01:01:45,358 working with images is you you have an 1716 01:01:43,679 --> 01:01:46,719 image and you want to take whatever you 1717 01:01:45,358 --> 01:01:48,078 take the image and figure out okay you 1718 01:01:46,719 --> 01:01:49,519 have a list of possible objects the 1719 01:01:48,079 --> 01:01:51,039 image could contain and you're figuring 1720 01:01:49,519 --> 01:01:53,280 out okay which of these possible objects 1721 01:01:51,039 --> 01:01:54,960 exists in that image right the doc cat 1722 01:01:53,280 --> 01:01:57,760 classification is like the the canonical 1723 01:01:54,960 --> 01:01:59,599 example right that we all know and love 1724 01:01:57,760 --> 01:02:01,280 uh and that's what we will solve uh 1725 01:01:59,599 --> 01:02:02,720 later today and on Wednesday but there 1726 01:02:01,280 --> 01:02:05,680 are many other tasks that you need to to 1727 01:02:02,719 --> 01:02:07,358 be aware of. So when you actually not 1728 01:02:05,679 --> 01:02:10,318 just classify an image, but you also 1729 01:02:07,358 --> 01:02:11,519 localize where in the image is it, 1730 01:02:10,318 --> 01:02:13,039 right? It's not just enough to say 1731 01:02:11,519 --> 01:02:14,639 sheep, you want to figure out where is 1732 01:02:13,039 --> 01:02:16,159 the sheep, right? And that's called 1733 01:02:14,639 --> 01:02:18,239 localization. And the way you do 1734 01:02:16,159 --> 01:02:21,118 localization is you put this little box 1735 01:02:18,239 --> 01:02:23,358 around it. And then you output not just 1736 01:02:21,119 --> 01:02:26,000 whether it's a, you know, sheep, yes or 1737 01:02:23,358 --> 01:02:28,159 no, but the coordinates of this box, the 1738 01:02:26,000 --> 01:02:29,760 top left, uh, and the bottom right, for 1739 01:02:28,159 --> 01:02:31,598 example, if you put the coordinates, you 1740 01:02:29,760 --> 01:02:33,599 can actually draw a box around it. So 1741 01:02:31,599 --> 01:02:36,079 you you output the numbers the 1742 01:02:33,599 --> 01:02:39,760 coordinates of where this box is in the 1743 01:02:36,079 --> 01:02:42,720 picture. Okay, this called localization. 1744 01:02:39,760 --> 01:02:45,040 Now this is object detection where you 1745 01:02:42,719 --> 01:02:47,039 may have lots of objects going on and 1746 01:02:45,039 --> 01:02:49,759 you want to pick up every one of them 1747 01:02:47,039 --> 01:02:51,679 and you want to localize it. 1748 01:02:49,760 --> 01:02:53,359 Okay, this is object detection. So here 1749 01:02:51,679 --> 01:02:55,679 we have gone in there and said okay 1750 01:02:53,358 --> 01:02:57,519 sheep one, sheep two, sheep three and 1751 01:02:55,679 --> 01:02:59,598 each of these sheep has a little box 1752 01:02:57,519 --> 01:03:01,440 around it. Okay. 1753 01:02:59,599 --> 01:03:04,000 >> By the way, u you know, self-driving 1754 01:03:01,440 --> 01:03:05,358 cars, the the camera vision system is 1755 01:03:04,000 --> 01:03:06,960 constantly scanning what's coming in 1756 01:03:05,358 --> 01:03:08,400 through the cameras and doing object 1757 01:03:06,960 --> 01:03:09,039 detection constantly, many times a 1758 01:03:08,400 --> 01:03:09,680 second, 1759 01:03:09,039 --> 01:03:11,599 >> right? 1760 01:03:09,679 --> 01:03:13,838 >> Pedestrian box, you know, zero crossing 1761 01:03:11,599 --> 01:03:16,240 box, doggy box, stroller box, and so on 1762 01:03:13,838 --> 01:03:17,358 and so forth. 1763 01:03:16,239 --> 01:03:20,479 And then we have this thing called 1764 01:03:17,358 --> 01:03:22,960 semantic segmentation where we take 1765 01:03:20,480 --> 01:03:24,880 every pixel in the picture and classify 1766 01:03:22,960 --> 01:03:26,159 every pixel. We are not classifying the 1767 01:03:24,880 --> 01:03:28,880 whole picture, we're classifying every 1768 01:03:26,159 --> 01:03:32,318 pixel. So we are saying okay all these 1769 01:03:28,880 --> 01:03:34,798 gray pixels road all these pixels are 1770 01:03:32,318 --> 01:03:37,838 sheep and all these pixels are grass 1771 01:03:34,798 --> 01:03:39,838 every pixel is being classified. 1772 01:03:37,838 --> 01:03:42,159 So we are taking a an image instead of 1773 01:03:39,838 --> 01:03:43,920 giving one classification for every 1774 01:03:42,159 --> 01:03:47,558 pixel we are solving a multiclass 1775 01:03:43,920 --> 01:03:47,559 classification problem. 1776 01:03:48,318 --> 01:03:51,199 Okay, every pixel is classified. And 1777 01:03:49,920 --> 01:03:53,280 just when you think it can't get more 1778 01:03:51,199 --> 01:03:54,480 complicated than this, 1779 01:03:53,280 --> 01:03:56,880 we have something called instance 1780 01:03:54,480 --> 01:03:58,559 segmentation where not only are we 1781 01:03:56,880 --> 01:03:59,838 classifying every pixel, we are 1782 01:03:58,559 --> 01:04:01,920 distinguishing between the different 1783 01:03:59,838 --> 01:04:04,318 sheep. 1784 01:04:01,920 --> 01:04:06,400 So every pixel is classified and 1785 01:04:04,318 --> 01:04:09,960 different instances of the same category 1786 01:04:06,400 --> 01:04:09,960 need to be identified. 1787 01:04:10,480 --> 01:04:14,880 Okay. So these are all some of the most 1788 01:04:12,318 --> 01:04:16,798 sort of uh I would say popular most 1789 01:04:14,880 --> 01:04:18,960 prevalently and useful most prevalent 1790 01:04:16,798 --> 01:04:20,880 and useful categories of image 1791 01:04:18,960 --> 01:04:23,920 processing problems that are aminable to 1792 01:04:20,880 --> 01:04:25,440 a deep learning system. 1793 01:04:23,920 --> 01:04:27,200 All right. So let's go to image 1794 01:04:25,440 --> 01:04:28,559 classification and we're going to work 1795 01:04:27,199 --> 01:04:32,598 with this application called fashion 1796 01:04:28,559 --> 01:04:32,599 emnest. Um 1797 01:04:33,039 --> 01:04:38,400 so the idea here is that you have 70 1798 01:04:35,358 --> 01:04:40,960 70,000 images of clothing items across 1799 01:04:38,400 --> 01:04:43,119 10 categories. you know like boots and 1800 01:04:40,960 --> 01:04:45,760 sweaters and t-shirts and you get the 1801 01:04:43,119 --> 01:04:48,559 idea 10 categories of clothing. Um we we 1802 01:04:45,760 --> 01:04:50,559 have 70,000 images like this u and then 1803 01:04:48,559 --> 01:04:52,559 we'll build a network from scratch to 1804 01:04:50,559 --> 01:04:54,559 classify all these things uh you know 1805 01:04:52,559 --> 01:04:55,920 with pretty high accuracy. So these 1806 01:04:54,559 --> 01:04:58,000 classes by the way you know this is a 1807 01:04:55,920 --> 01:04:59,838 very balanced data set. So 10% of the 1808 01:04:58,000 --> 01:05:01,920 data is you know sweaters 10% is boots 1809 01:04:59,838 --> 01:05:03,519 and so on and so forth. So a naive 1810 01:05:01,920 --> 01:05:06,519 baseline model would give you what 1811 01:05:03,519 --> 01:05:06,519 accuracy 1812 01:05:07,679 --> 01:05:12,078 10%. Exactly. So we need to build 1813 01:05:10,559 --> 01:05:13,440 something that's better than 10% and I'm 1814 01:05:12,079 --> 01:05:14,559 glad to report that a simple neural 1815 01:05:13,440 --> 01:05:17,559 network can actually get you close to 1816 01:05:14,559 --> 01:05:17,559 90%. 1817 01:05:18,559 --> 01:05:24,798 Right? So so this is the simple network 1818 01:05:21,838 --> 01:05:28,400 that we have. The input in this case is 1819 01:05:24,798 --> 01:05:33,358 a 28x 28 picture. 1820 01:05:28,400 --> 01:05:36,720 It's a 28x 28 picture. Uh and 1821 01:05:33,358 --> 01:05:38,318 so far we have been feeding vectors into 1822 01:05:36,719 --> 01:05:40,239 our neural network. Now we have a 1823 01:05:38,318 --> 01:05:43,759 picture which is 28 by 28. It's a tens 1824 01:05:40,239 --> 01:05:45,919 set of rank two, right? It's a table of 1825 01:05:43,760 --> 01:05:49,160 numbers. What do we do? How do we feed 1826 01:05:45,920 --> 01:05:49,159 that in? 1827 01:05:51,199 --> 01:05:54,960 It's a temp. No, each image is a table 1828 01:05:53,599 --> 01:05:57,519 of numbers. Let's just take a single 1829 01:05:54,960 --> 01:05:59,280 image. 1830 01:05:57,519 --> 01:06:01,679 Like what do we do? How do we what do we 1831 01:05:59,280 --> 01:06:04,079 do with this table? 1832 01:06:01,679 --> 01:06:06,399 Convert it into a vector. Exactly. And 1833 01:06:04,079 --> 01:06:08,079 that's called flattening. So we take 1834 01:06:06,400 --> 01:06:11,440 this table of numbers and we flatten it 1835 01:06:08,079 --> 01:06:13,599 into a vector. And so so what we do is 1836 01:06:11,440 --> 01:06:17,760 uh let me just 1837 01:06:13,599 --> 01:06:20,240 Okay. So we have um 1838 01:06:17,760 --> 01:06:22,400 28 by 28. 1839 01:06:20,239 --> 01:06:25,598 So what we can do is we can take each 1840 01:06:22,400 --> 01:06:27,838 row right take this row and then write 1841 01:06:25,599 --> 01:06:32,599 it like that. 1842 01:06:27,838 --> 01:06:32,599 We take the second row oops 1843 01:06:33,440 --> 01:06:36,639 write it like that. 1844 01:06:38,079 --> 01:06:43,599 third row is here 1845 01:06:41,440 --> 01:06:45,358 like that. You get the idea. So you take 1846 01:06:43,599 --> 01:06:47,039 each row just rotate it and stack it all 1847 01:06:45,358 --> 01:06:49,119 up, right? And string them up. It 1848 01:06:47,039 --> 01:06:51,760 becomes one long vector. So this called 1849 01:06:49,119 --> 01:06:52,960 flattening. Okay? So that's how you take 1850 01:06:51,760 --> 01:06:55,960 this thing and make it into one long 1851 01:06:52,960 --> 01:06:55,960 vector. 1852 01:06:56,400 --> 01:07:03,400 So when you do that 28 by 28 is what is 1853 01:07:00,159 --> 01:07:03,399 it? 7 1854 01:07:03,599 --> 01:07:09,440 784. So we get 7. So we get a vector. 1855 01:07:07,440 --> 01:07:11,119 This is the flattened input and you get 1856 01:07:09,440 --> 01:07:15,039 784. 1857 01:07:11,119 --> 01:07:17,358 Uh it's a vector that's 784 long. 1858 01:07:15,039 --> 01:07:18,799 Okay. After the flattening, we have not 1859 01:07:17,358 --> 01:07:19,920 done anything complicated yet. We have 1860 01:07:18,798 --> 01:07:21,679 literally taken the numbers and just 1861 01:07:19,920 --> 01:07:24,318 reorganized them in a different way. 1862 01:07:21,679 --> 01:07:26,000 Okay. And once we do that, now we are 1863 01:07:24,318 --> 01:07:27,759 back in our familiar neural network 1864 01:07:26,000 --> 01:07:29,760 territory, right? We know how to work 1865 01:07:27,760 --> 01:07:33,760 with vectors. So, we just need to pass 1866 01:07:29,760 --> 01:07:35,520 it through a hidden layer, right? And 1867 01:07:33,760 --> 01:07:37,599 this hidden layer, we're going to use re 1868 01:07:35,519 --> 01:07:39,119 neurons. And I tried a few different 1869 01:07:37,599 --> 01:07:41,680 values. And it turns out that 256 1870 01:07:39,119 --> 01:07:43,680 neurons does a really good job. 1871 01:07:41,679 --> 01:07:46,480 Okay? And so, I'm going to use 256 1872 01:07:43,679 --> 01:07:48,000 neurons here. And then we need to now 1873 01:07:46,480 --> 01:07:51,199 think about what the output layer should 1874 01:07:48,000 --> 01:07:54,159 be. Now, the now we run into a problem 1875 01:07:51,199 --> 01:07:55,759 because the output layer before we saw 1876 01:07:54,159 --> 01:07:58,239 for the heart disease example, it's just 1877 01:07:55,760 --> 01:08:01,039 zero or one. Right? Here there are 10 1878 01:07:58,239 --> 01:08:02,879 possible outputs. It could be a you know 1879 01:08:01,039 --> 01:08:04,799 boot, a sweater, a shirt and so on so 1880 01:08:02,880 --> 01:08:06,798 forth. 10 possible categories. So we 1881 01:08:04,798 --> 01:08:09,199 need some way to handle something with 1882 01:08:06,798 --> 01:08:12,960 many more than you know one binary 1883 01:08:09,199 --> 01:08:15,038 output many possible outputs. So the way 1884 01:08:12,960 --> 01:08:16,880 we do that 1885 01:08:15,039 --> 01:08:20,079 this is by the way pay attention to this 1886 01:08:16,880 --> 01:08:24,000 because this is actually how GPD4 works. 1887 01:08:20,079 --> 01:08:26,880 Okay. So what we do is here's what we 1888 01:08:24,000 --> 01:08:28,640 have. We know how to output 10 numbers, 1889 01:08:26,880 --> 01:08:30,000 right? If you want to output 10 numbers, 1890 01:08:28,640 --> 01:08:31,440 no problem. We just, you know, we have, 1891 01:08:30,000 --> 01:08:33,600 we can easily output 10 numbers by just 1892 01:08:31,439 --> 01:08:36,559 using a linear activation. We also know 1893 01:08:33,600 --> 01:08:37,838 how to output 10 probabilities, 1894 01:08:36,560 --> 01:08:40,560 right? Each one just needs to be a 1895 01:08:37,838 --> 01:08:44,079 sigmoid. But here we can't use 10 1896 01:08:40,560 --> 01:08:47,839 sigmoids as the output. Why is that? 1897 01:08:44,079 --> 01:08:50,000 Why can't we use 10 sigmoids? 1898 01:08:47,838 --> 01:08:52,798 >> Because the probability to one, 1899 01:08:50,000 --> 01:08:54,640 >> right? So here when the output comes we 1900 01:08:52,798 --> 01:08:56,238 need to figure out okay is it a boot, a 1901 01:08:54,640 --> 01:08:59,199 sweater, a shirt and so on and so forth. 1902 01:08:56,238 --> 01:09:00,479 There's only one right answer. Okay, 1903 01:08:59,198 --> 01:09:01,838 which means that we need to actually 1904 01:09:00,479 --> 01:09:03,519 figure out which of these 10 is the 1905 01:09:01,838 --> 01:09:05,439 right answer which means that we need to 1906 01:09:03,520 --> 01:09:07,520 produce probabilities but they have to 1907 01:09:05,439 --> 01:09:09,599 add up to one because only one of them 1908 01:09:07,520 --> 01:09:10,719 can be true. 1909 01:09:09,600 --> 01:09:12,159 So that's the key thing. They have to 1910 01:09:10,719 --> 01:09:13,279 add up to one. That's the wrinkle. If 1911 01:09:12,158 --> 01:09:16,000 not for that we can just use 10 1912 01:09:13,279 --> 01:09:17,600 sigmoids, right? And the way we do that 1913 01:09:16,000 --> 01:09:20,079 is something using something called the 1914 01:09:17,600 --> 01:09:22,319 softmax function or the softmax layer. 1915 01:09:20,079 --> 01:09:25,198 And the idea is actually very simple. We 1916 01:09:22,319 --> 01:09:27,759 have these 10 outputs in the very final 1917 01:09:25,198 --> 01:09:29,759 layer which is just linear activations. 1918 01:09:27,759 --> 01:09:32,719 And then we take each one of these 1919 01:09:29,759 --> 01:09:34,719 numbers and then run it through the 1920 01:09:32,719 --> 01:09:37,279 exponential function and then divide by 1921 01:09:34,719 --> 01:09:39,279 the total. So when you do that two 1922 01:09:37,279 --> 01:09:40,560 things happen. The first one is when you 1923 01:09:39,279 --> 01:09:43,359 take these numbers and run it through 1924 01:09:40,560 --> 01:09:45,920 say you take a1 and do e raised to a1 1925 01:09:43,359 --> 01:09:47,039 you now get a positive number 1926 01:09:45,920 --> 01:09:48,640 and now you have a positive number 1927 01:09:47,039 --> 01:09:50,319 divide by the sum of a bunch of positive 1928 01:09:48,640 --> 01:09:52,079 numbers and they're all you can see here 1929 01:09:50,319 --> 01:09:53,920 you can confirm visually that they will 1930 01:09:52,079 --> 01:09:55,198 add up to one because you're literally 1931 01:09:53,920 --> 01:09:56,719 divide taking each number dividing by 1932 01:09:55,198 --> 01:09:59,439 the total so they will add up to one 1933 01:09:56,719 --> 01:10:00,880 there's no other option right so this is 1934 01:09:59,439 --> 01:10:02,559 called the softmax function which means 1935 01:10:00,880 --> 01:10:04,000 that you can take any set of 10 numbers 1936 01:10:02,560 --> 01:10:05,199 that's coming out of the network and 1937 01:10:04,000 --> 01:10:07,198 convert them into probabilities that add 1938 01:10:05,198 --> 01:10:09,919 up to one 1939 01:10:07,198 --> 01:10:12,639 and So, by the way, the GPD4 reference 1940 01:10:09,920 --> 01:10:14,480 when you actually put a prompt in GPD4 1941 01:10:12,640 --> 01:10:17,760 and it starts giving you the output. 1942 01:10:14,479 --> 01:10:19,359 Every word it's emitting, right? It's 1943 01:10:17,760 --> 01:10:21,199 actually a token, but we'll get to that 1944 01:10:19,359 --> 01:10:23,599 later. You imagine it's a word. Every 1945 01:10:21,198 --> 01:10:27,599 word it's emitting u is actually it's 1946 01:10:23,600 --> 01:10:28,960 doing a 50 52,000 way softmax. 1947 01:10:27,600 --> 01:10:31,840 Think of it as every word in the 1948 01:10:28,960 --> 01:10:34,158 language is a possible output. So it's a 1949 01:10:31,840 --> 01:10:36,560 vector which is 52,000 long but it's 1950 01:10:34,158 --> 01:10:39,839 actually a softmax and it just picks the 1951 01:10:36,560 --> 01:10:41,440 most probable word and emits that. So 1952 01:10:39,840 --> 01:10:43,360 this notion of a softmax is actually 1953 01:10:41,439 --> 01:10:45,039 very powerful. 1954 01:10:43,359 --> 01:10:49,119 Okay but we'll come back to that uh 1955 01:10:45,039 --> 01:10:51,039 later. So, so to summarize, if you have 1956 01:10:49,119 --> 01:10:53,519 a single number, you can use a s simple 1957 01:10:51,039 --> 01:10:55,519 output layer, a single probability, a 1958 01:10:53,520 --> 01:10:57,440 sigmoid, you have lots of numbers, just 1959 01:10:55,520 --> 01:10:58,719 have a stack of these things. And when 1960 01:10:57,439 --> 01:10:59,839 you have a lot of numbers that have to 1961 01:10:58,719 --> 01:11:03,640 add up to one, that have to be 1962 01:10:59,840 --> 01:11:03,640 probabilities, use softmax, 1963 01:11:03,679 --> 01:11:08,399 >> right? So uh yeah 1964 01:11:06,640 --> 01:11:11,360 >> why do we choose probabilities instead 1965 01:11:08,399 --> 01:11:12,000 of just number 1966 01:11:11,359 --> 01:11:12,559 one 1967 01:11:12,000 --> 01:11:14,158 >> sorry 1968 01:11:12,560 --> 01:11:15,760 >> then we know it's only going to be one 1969 01:11:14,158 --> 01:11:19,399 >> because you can't force the network to 1970 01:11:15,760 --> 01:11:19,400 give you ones or zeros 1971 01:11:20,158 --> 01:11:22,639 it's going to produce what it's going to 1972 01:11:21,279 --> 01:11:24,399 produce 1973 01:11:22,640 --> 01:11:26,239 >> you can't force it to be exactly one or 1974 01:11:24,399 --> 01:11:28,479 zero 1975 01:11:26,238 --> 01:11:30,319 it'll give you some number you can do is 1976 01:11:28,479 --> 01:11:32,238 to tame that number so that it comes 1977 01:11:30,319 --> 01:11:34,639 into a range that you like like between 1978 01:11:32,238 --> 01:11:38,399 zero and 1979 01:11:34,640 --> 01:11:40,000 So here very quickly um we have a b when 1980 01:11:38,399 --> 01:11:41,759 we have a binary classification example 1981 01:11:40,000 --> 01:11:43,279 like yes or no this is the one hot 1982 01:11:41,760 --> 01:11:45,440 encoded version one or zero this is what 1983 01:11:43,279 --> 01:11:46,719 we saw in the heart disease example when 1984 01:11:45,439 --> 01:11:48,639 you have something like this example 1985 01:11:46,719 --> 01:11:51,039 fashion mn list where you have all these 1986 01:11:48,640 --> 01:11:52,560 different possibilities then you can 1987 01:11:51,039 --> 01:11:54,479 encode it in one of two ways you can 1988 01:11:52,560 --> 01:11:56,560 encode it just using integers like 0 to 1989 01:11:54,479 --> 01:11:59,519 9 right this is called the sparse 1990 01:11:56,560 --> 01:12:02,239 encoded version or you can do a one hot 1991 01:11:59,520 --> 01:12:03,760 encoded version of the output right you 1992 01:12:02,238 --> 01:12:06,879 can have a one hot encoded version of 1993 01:12:03,760 --> 01:12:08,960 the output and depending on how your 1994 01:12:06,880 --> 01:12:11,760 data comes in to you comes into your 1995 01:12:08,960 --> 01:12:13,840 collab right just pay attention to this 1996 01:12:11,760 --> 01:12:18,239 and depending on what it is you have to 1997 01:12:13,840 --> 01:12:20,159 pick the right keras loss function so 1998 01:12:18,238 --> 01:12:21,839 data comes like a one zero thing which 1999 01:12:20,158 --> 01:12:24,079 is exactly what we had in the how this 2000 01:12:21,840 --> 01:12:26,400 example we use binary cross entropy if 2001 01:12:24,079 --> 01:12:28,719 your data comes in this form where it's 2002 01:12:26,399 --> 01:12:31,279 sparse encoded you use sparse 2003 01:12:28,719 --> 01:12:32,640 categorical cross entropy and then if it 2004 01:12:31,279 --> 01:12:34,960 comes in this form form you use 2005 01:12:32,640 --> 01:12:36,640 categorical cross entropy, right? These 2006 01:12:34,960 --> 01:12:38,399 are all equalent things. It just depends 2007 01:12:36,640 --> 01:12:40,159 on the data that you get how it happens 2008 01:12:38,399 --> 01:12:42,559 to be encoded by the people who sent it 2009 01:12:40,158 --> 01:12:43,759 to you. If they send it this way, use 2010 01:12:42,560 --> 01:12:46,080 this loss function. If you send that 2011 01:12:43,760 --> 01:12:47,600 way, use that loss function. 2012 01:12:46,079 --> 01:12:49,359 Now, as it turns out in our example 2013 01:12:47,600 --> 01:12:50,800 here, the data is actually coming in in 2014 01:12:49,359 --> 01:12:52,158 this form. So, we'll use this thing 2015 01:12:50,800 --> 01:12:54,880 called the sparse categorical cross 2016 01:12:52,158 --> 01:12:56,399 entropy. And categorical cross entropy 2017 01:12:54,880 --> 01:12:58,159 is a generalization of binary cross 2018 01:12:56,399 --> 01:12:59,839 entropy which I'm not going to get into 2019 01:12:58,158 --> 01:13:01,359 the mathematical details but the in the 2020 01:12:59,840 --> 01:13:04,319 the intuition is basically roughly the 2021 01:13:01,359 --> 01:13:07,439 same. 2022 01:13:04,319 --> 01:13:09,198 Okay so this is what we have. Um if this 2023 01:13:07,439 --> 01:13:11,599 is your output layer use mean squared 2024 01:13:09,198 --> 01:13:14,079 error. If this is your output layer use 2025 01:13:11,600 --> 01:13:15,360 binary cross entropy and if you still 2026 01:13:14,079 --> 01:13:17,039 have a stack of these numbers you can 2027 01:13:15,359 --> 01:13:19,519 still use mean squed error. And if your 2028 01:13:17,039 --> 01:13:22,000 output is a soft max, use categorical 2029 01:13:19,520 --> 01:13:24,560 cross entropy or sparse categorical 2030 01:13:22,000 --> 01:13:26,479 cross entropy. 2031 01:13:24,560 --> 01:13:30,600 Okay. So let's actually run this in 2032 01:13:26,479 --> 01:13:30,599 collab. Um 2033 01:13:32,079 --> 01:13:37,198 right. So this is what we have. Can 2034 01:13:33,679 --> 01:13:40,800 folks see this? Okay. All right. So this 2035 01:13:37,198 --> 01:13:44,399 is the data set we saw earlier. Uh down 2036 01:13:40,800 --> 01:13:47,039 here as usual, right? We have we load 2037 01:13:44,399 --> 01:13:49,198 tensorflow and kas. We load our usual 2038 01:13:47,039 --> 01:13:51,119 three packages and then we set the 2039 01:13:49,198 --> 01:13:53,198 random seed for reproducibility. And it 2040 01:13:51,119 --> 01:13:54,719 turns out that the fashion mnest data is 2041 01:13:53,198 --> 01:13:56,000 actually available in keras. You don't 2042 01:13:54,719 --> 01:13:57,439 have to go find it somewhere and bring 2043 01:13:56,000 --> 01:13:59,279 it in. It's actually available in kas. 2044 01:13:57,439 --> 01:14:01,119 It's one of the standard data sets. We 2045 01:13:59,279 --> 01:14:04,079 luck out. So we just actually load the 2046 01:14:01,119 --> 01:14:05,920 data right using this load data command. 2047 01:14:04,079 --> 01:14:08,399 And then you do that and conveniently 2048 01:14:05,920 --> 01:14:10,399 for us keras has not only made the data 2049 01:14:08,399 --> 01:14:12,238 available it has already split it into a 2050 01:14:10,399 --> 01:14:13,920 training and test set. So we don't have 2051 01:14:12,238 --> 01:14:15,279 to do the splitting. Okay. And the 2052 01:14:13,920 --> 01:14:18,279 reason they do that, why would they do 2053 01:14:15,279 --> 01:14:18,279 that? 2054 01:14:18,640 --> 01:14:21,679 They do that so that different people 2055 01:14:20,238 --> 01:14:23,678 who are building algorithms for that 2056 01:14:21,679 --> 01:14:26,640 particular data set can all be evaluated 2057 01:14:23,679 --> 01:14:28,079 using the same test set. 2058 01:14:26,640 --> 01:14:29,600 Otherwise, if I split it one way and 2059 01:14:28,079 --> 01:14:31,439 say, "Hey, look how well I did that like 2060 01:14:29,600 --> 01:14:32,480 I don't know how did you split it." 2061 01:14:31,439 --> 01:14:36,000 >> That's the reason. 2062 01:14:32,479 --> 01:14:38,158 >> Okay. So here and you can see here that 2063 01:14:36,000 --> 01:14:43,760 uh we have 2064 01:14:38,158 --> 01:14:47,039 the input data is a tensor of rank 2065 01:14:43,760 --> 01:14:48,239 three. The first and basically another 2066 01:14:47,039 --> 01:14:50,158 way to think about a tensor of rank 2067 01:14:48,238 --> 01:14:52,879 three is just a list of rank two 2068 01:14:50,158 --> 01:14:57,279 tensors. Right? So here you have 60,000 2069 01:14:52,880 --> 01:15:02,079 images. 60,000 images and each image is 2070 01:14:57,279 --> 01:15:04,639 a 28x 28 square of numbers. Each image 2071 01:15:02,079 --> 01:15:07,279 is a 28 x 28 table. Uh and then of 2072 01:15:04,640 --> 01:15:09,920 course the output uh is just what 2073 01:15:07,279 --> 01:15:11,519 category it is a number between 0 and 9. 2074 01:15:09,920 --> 01:15:13,840 So you just have 60,000 numbers. It's 2075 01:15:11,520 --> 01:15:15,920 just a vector of 60,000 numbers. Okay. 2076 01:15:13,840 --> 01:15:19,039 Uh so there are 60,000 in the training 2077 01:15:15,920 --> 01:15:21,279 set. Oops. Uh and then there are 10,000 2078 01:15:19,039 --> 01:15:23,519 in the test set. Same structure 28 by 2079 01:15:21,279 --> 01:15:25,039 28. Uh that's what we have. So if you 2080 01:15:23,520 --> 01:15:27,040 look at the first 10 rows of the 2081 01:15:25,039 --> 01:15:29,039 dependent variable Y, you get these 2082 01:15:27,039 --> 01:15:31,439 numbers 9 0 33 like that. There are 2083 01:15:29,039 --> 01:15:33,359 numbers from 0 to 9. So if you look at 2084 01:15:31,439 --> 01:15:35,919 the fashion mnest GitHub site, this is 2085 01:15:33,359 --> 01:15:37,839 what it refers to. Zero is a t-shirt, 2086 01:15:35,920 --> 01:15:41,600 one is a trouser, and so on and so 2087 01:15:37,840 --> 01:15:43,760 forth. And nine is an ankle boot. 2088 01:15:41,600 --> 01:15:45,280 All right. So, uh, whenever I'm working 2089 01:15:43,760 --> 01:15:47,520 with multiclass lab classification 2090 01:15:45,279 --> 01:15:49,439 problems, I always, you know, do a 2091 01:15:47,520 --> 01:15:51,120 little thing here to help me figure out 2092 01:15:49,439 --> 01:15:52,319 that nine corresponds to an ankle boot 2093 01:15:51,119 --> 01:15:53,519 and so on and so forth. It just makes it 2094 01:15:52,319 --> 01:15:56,639 a little easier to to work with this 2095 01:15:53,520 --> 01:15:59,679 stuff. So, I create this little list. Um 2096 01:15:56,640 --> 01:16:01,119 and then uh turns out if you okay what 2097 01:15:59,679 --> 01:16:02,960 is the very first data point? What is 2098 01:16:01,119 --> 01:16:05,279 it? What is its y- value? Turns out to 2099 01:16:02,960 --> 01:16:07,679 be an ankle boot. Um so you can actually 2100 01:16:05,279 --> 01:16:10,238 look at the raw data for that image 2101 01:16:07,679 --> 01:16:13,119 which is just a 28x 28 thing and these 2102 01:16:10,238 --> 01:16:16,959 are the numbers you have. 2103 01:16:13,119 --> 01:16:19,198 See all these 250 233 lots of zeros and 2104 01:16:16,960 --> 01:16:20,960 so on and so forth. So you can actually 2105 01:16:19,198 --> 01:16:22,639 look at the first visualize the first 25 2106 01:16:20,960 --> 01:16:24,560 images. I have a little bit of code here 2107 01:16:22,640 --> 01:16:25,920 which visualizes that just matt plot lip 2108 01:16:24,560 --> 01:16:28,719 code and you can see these are all the 2109 01:16:25,920 --> 01:16:32,319 images they're kind of smallalish this 2110 01:16:28,719 --> 01:16:34,560 my friends is an ankle boot 2111 01:16:32,319 --> 01:16:35,759 right it's like okay can the network 2112 01:16:34,560 --> 01:16:37,360 really make any sense out of this thing 2113 01:16:35,760 --> 01:16:39,920 right it looks very blurry and I don't 2114 01:16:37,359 --> 01:16:42,158 know 2115 01:16:39,920 --> 01:16:43,679 this is uh 2116 01:16:42,158 --> 01:16:45,359 oh this is actually a better ankle boot 2117 01:16:43,679 --> 01:16:47,840 look at that okay sorry I'm getting 2118 01:16:45,359 --> 01:16:49,599 distracted so so this is what we have 2119 01:16:47,840 --> 01:16:51,520 here 2120 01:16:49,600 --> 01:16:53,360 uh okay we are at 955 2121 01:16:51,520 --> 01:16:54,880 I'm going to stop um so you folks are 2122 01:16:53,359 --> 01:16:56,399 not late for your next class. So we'll 2123 01:16:54,880 --> 01:16:58,079 continue this journey on Wednesday and 2124 01:16:56,399 --> 01:16:59,599 then we'll go on to color images the 2125 01:16:58,079 --> 01:17:03,000 next class as well. Thank you folks. 2126 01:16:59,600 --> 01:17:03,000 Have a good one.