1 00:00:16,719 --> 00:00:19,799 Right folks, good morning. 2 00:00:19,960 --> 00:00:22,880 Welcome back. I hope you all had a nice 3 00:00:21,600 --> 00:00:24,800 weekend. 4 00:00:22,879 --> 00:00:26,919 Uh, and I hope you had a chance to watch 5 00:00:24,800 --> 00:00:28,920 the the video walk-through I posted 6 00:00:26,920 --> 00:00:31,080 yesterday. Um, it's going to save us 7 00:00:28,920 --> 00:00:33,400 some time today. So, let's get right in. 8 00:00:31,079 --> 00:00:35,159 Today is going to be super packed. Um, 9 00:00:33,399 --> 00:00:36,759 you're going to go from not knowing 10 00:00:35,159 --> 00:00:38,439 anything about convolutions perhaps for 11 00:00:36,759 --> 00:00:39,839 some of you to actually knowing how 12 00:00:38,439 --> 00:00:42,839 convolution networks work and actually 13 00:00:39,840 --> 00:00:44,240 to build one and demo it in class, okay? 14 00:00:42,840 --> 00:00:45,720 And uh, this demo has actually worked 15 00:00:44,240 --> 00:00:47,240 pretty well for the last few years that 16 00:00:45,719 --> 00:00:48,439 I've taught the class, but you never 17 00:00:47,240 --> 00:00:50,039 know because it's a live demo, it may 18 00:00:48,439 --> 00:00:51,879 not work. We'll see. 19 00:00:50,039 --> 00:00:53,519 Um, 20 00:00:51,880 --> 00:00:54,760 Valentine's Day gods, maybe they maybe 21 00:00:53,520 --> 00:00:56,800 be with us. 22 00:00:54,759 --> 00:01:00,439 Okay, so let's get going. So, Fashion 23 00:00:56,799 --> 00:01:01,599 MNIST we saw previously, um, i.e. as in, 24 00:01:00,439 --> 00:01:03,839 you know, in the in the walk-through, 25 00:01:01,600 --> 00:01:05,760 the video walk-through, that a neural 26 00:01:03,840 --> 00:01:08,200 network with a single single hidden 27 00:01:05,760 --> 00:01:11,280 layer can get us to some an accuracy in 28 00:01:08,200 --> 00:01:14,200 the the high 80s, okay? Uh, and that 29 00:01:11,280 --> 00:01:16,239 thing that network actually didn't know 30 00:01:14,200 --> 00:01:18,280 what was coming in was an image, right? 31 00:01:16,239 --> 00:01:19,759 It literally took this table of numbers 32 00:01:18,280 --> 00:01:21,519 and just took each row and then 33 00:01:19,760 --> 00:01:23,719 concatenated all the rows into one giant 34 00:01:21,519 --> 00:01:25,799 long vector and then sent it in. So, the 35 00:01:23,719 --> 00:01:27,760 neural network did exploit the fact that 36 00:01:25,799 --> 00:01:30,280 the input data was sort of known to be 37 00:01:27,760 --> 00:01:32,760 of a certain type, okay? Which is the 38 00:01:30,280 --> 00:01:35,159 clue for how can we do better? 39 00:01:32,760 --> 00:01:38,480 Right? So, let's just spend a few 40 00:01:35,159 --> 00:01:40,479 minutes on why what is it about images 41 00:01:38,480 --> 00:01:42,719 that we have to really pay attention to, 42 00:01:40,480 --> 00:01:44,359 okay? As opposed to any arbitrary vector 43 00:01:42,719 --> 00:01:47,599 of numbers that's coming in. 44 00:01:44,359 --> 00:01:49,519 Okay? So, when we flatten the image into 45 00:01:47,599 --> 00:01:50,519 a long vector and feed it into a dense 46 00:01:49,519 --> 00:01:52,719 layer, 47 00:01:50,519 --> 00:01:55,119 several undesirable things can actually 48 00:01:52,719 --> 00:01:59,039 happen. 49 00:01:55,120 --> 00:01:59,040 What are some of them? Any any guesses? 50 00:02:00,400 --> 00:02:04,560 Uh, yeah. 51 00:02:02,640 --> 00:02:06,560 I think you lose the proximity of one 52 00:02:04,560 --> 00:02:07,400 pixel to other ones that would be around 53 00:02:06,560 --> 00:02:08,719 it. 54 00:02:07,400 --> 00:02:11,039 Right. So, if you take a particular 55 00:02:08,719 --> 00:02:13,560 pixel, then let's say that the picture 56 00:02:11,038 --> 00:02:15,759 shows a t-shirt, um, if there's a little 57 00:02:13,560 --> 00:02:17,479 pixel at in the center of the t-shirt, 58 00:02:15,759 --> 00:02:19,239 knowing that the surrounding pixels are 59 00:02:17,479 --> 00:02:21,159 related to the pixel in a way because 60 00:02:19,240 --> 00:02:23,439 they are all part of this concept called 61 00:02:21,159 --> 00:02:25,799 a t-shirt, would certainly be helpful, 62 00:02:23,439 --> 00:02:28,079 right? So, so to put it more 63 00:02:25,800 --> 00:02:30,480 technically, spatial adjacency 64 00:02:28,080 --> 00:02:32,560 information is very important. And we 65 00:02:30,479 --> 00:02:34,759 need to somehow take that into account. 66 00:02:32,560 --> 00:02:37,759 Okay? Um, all right. What else? What 67 00:02:34,759 --> 00:02:37,759 else might be going on here? 68 00:02:38,120 --> 00:02:41,439 Uh, 69 00:02:40,159 --> 00:02:43,000 Yeah, 70 00:02:41,439 --> 00:02:46,439 you have some metadata about it like the 71 00:02:43,000 --> 00:02:47,719 relative match into the resolution 72 00:02:46,439 --> 00:02:50,240 Oh, I see. So, if you actually had 73 00:02:47,719 --> 00:02:51,560 structured data about the image such as, 74 00:02:50,240 --> 00:02:54,040 you know, various characters about that 75 00:02:51,560 --> 00:02:55,560 might be helpful. True. Now, but let's 76 00:02:54,039 --> 00:02:57,799 just focus on the case where you only 77 00:02:55,560 --> 00:03:00,400 have the raw image and nothing else. 78 00:02:57,800 --> 00:03:02,480 And under that constraint, what else 79 00:03:00,400 --> 00:03:06,039 might go wrong? 80 00:03:02,479 --> 00:03:06,039 Or what else might be suboptimal? 81 00:03:08,199 --> 00:03:12,079 Okay. Well, the first thing that might 82 00:03:10,000 --> 00:03:15,240 happen is that 83 00:03:12,080 --> 00:03:17,400 we have we may have too many parameters. 84 00:03:15,240 --> 00:03:18,760 So, let's take So, this is, you know, 85 00:03:17,400 --> 00:03:21,439 this these numbers are from my, you 86 00:03:18,759 --> 00:03:22,959 know, older iPhone. Uh, I noticed that 87 00:03:21,439 --> 00:03:27,879 when I take a color picture with my 88 00:03:22,960 --> 00:03:30,200 phone, it's a 3,000 * 3,000 roughly uh, 89 00:03:27,879 --> 00:03:34,039 grid, right? So, the picture is actually 90 00:03:30,199 --> 00:03:37,839 3,024 pixels on this axis, 3,024 on that 91 00:03:34,039 --> 00:03:40,280 axis, okay? So, that gets us to roughly 92 00:03:37,840 --> 00:03:41,680 9 million pixels, but remember there's a 93 00:03:40,280 --> 00:03:43,479 color picture, which means there are 94 00:03:41,680 --> 00:03:45,360 three channels, 95 00:03:43,479 --> 00:03:46,959 which means there are 27 million 96 00:03:45,360 --> 00:03:49,240 numbers, 97 00:03:46,960 --> 00:03:51,879 each of which is between 0 and 255 from 98 00:03:49,240 --> 00:03:54,080 that little picture, okay? And now let's 99 00:03:51,879 --> 00:03:57,319 say we connect it to a single 100 00:03:54,080 --> 00:03:59,080 100 neuron dense layer. 101 00:03:57,319 --> 00:04:00,319 A single 100 neuron dense layer. How 102 00:03:59,080 --> 00:04:01,719 many parameters are we going to have? 103 00:04:00,319 --> 00:04:04,239 Just in that one little part of the 104 00:04:01,719 --> 00:04:04,240 network. 105 00:04:07,000 --> 00:04:13,319 Could the mumbling be louder? 106 00:04:10,280 --> 00:04:15,919 Yes, roughly 2.7 billion because 27 107 00:04:13,319 --> 00:04:17,439 million parameters times 100, 108 00:04:15,919 --> 00:04:19,839 right? Roughly, of course. Forget about 109 00:04:17,439 --> 00:04:21,000 the biases for a moment, right? It's 2.7 110 00:04:19,839 --> 00:04:23,479 billion. 111 00:04:21,000 --> 00:04:25,199 2.7 billion parameters, 112 00:04:23,480 --> 00:04:27,920 right? Do you think we can actually get 113 00:04:25,199 --> 00:04:29,680 2.7 billion images to train any of these 114 00:04:27,920 --> 00:04:32,280 things? 115 00:04:29,680 --> 00:04:33,920 So, then you're going to overfit. 116 00:04:32,279 --> 00:04:35,439 Right? Too many parameters. We have to 117 00:04:33,920 --> 00:04:36,800 do We have to be smarter about this. 118 00:04:35,439 --> 00:04:39,519 It's not going to work. 119 00:04:36,800 --> 00:04:41,240 Right? That's the first problem. 120 00:04:39,519 --> 00:04:43,079 The So, this clearly is computationally 121 00:04:41,240 --> 00:04:45,120 demanding, very data hungry, and 122 00:04:43,079 --> 00:04:46,359 increase the risk of overfitting. 123 00:04:45,120 --> 00:04:48,920 Okay? 124 00:04:46,360 --> 00:04:48,920 Next, 125 00:04:49,000 --> 00:04:52,800 we lose spatial adjacency. 126 00:04:51,279 --> 00:04:55,279 Right? We literally are ignoring what's 127 00:04:52,800 --> 00:04:55,280 nearby. 128 00:04:55,480 --> 00:04:58,879 So, that's a huge huge factor. There's a 129 00:04:57,519 --> 00:05:01,000 third factor, 130 00:04:58,879 --> 00:05:02,319 right? That we have to worry about, 131 00:05:01,000 --> 00:05:04,120 which is that 132 00:05:02,319 --> 00:05:06,199 let's say that, you know, the picture 133 00:05:04,120 --> 00:05:08,120 has a vertical line 134 00:05:06,199 --> 00:05:09,599 on the on the top left side and it has 135 00:05:08,120 --> 00:05:12,160 some other vertical line on the bottom 136 00:05:09,600 --> 00:05:12,160 right side. 137 00:05:12,240 --> 00:05:15,280 What this sort of dumb approach is going 138 00:05:14,160 --> 00:05:16,640 to do 139 00:05:15,279 --> 00:05:18,079 is going to it's going to learn to 140 00:05:16,639 --> 00:05:20,000 detect that vertical line on the top 141 00:05:18,079 --> 00:05:21,159 left and it's going to independent of 142 00:05:20,000 --> 00:05:24,079 that, it's going to learn to detect the 143 00:05:21,160 --> 00:05:26,200 vertical line on the bottom right. 144 00:05:24,079 --> 00:05:27,599 Okay? Which doesn't make any sense. What 145 00:05:26,199 --> 00:05:29,479 do you A vertical line is a vertical 146 00:05:27,600 --> 00:05:31,360 line. So, you want to be able to detect 147 00:05:29,480 --> 00:05:33,879 it wherever it happens. 148 00:05:31,360 --> 00:05:35,520 Detect once, reuse everywhere. 149 00:05:33,879 --> 00:05:36,879 That's what you need to do. 150 00:05:35,519 --> 00:05:38,680 So, this, by the way, is called 151 00:05:36,879 --> 00:05:40,279 translation invariance. 152 00:05:38,680 --> 00:05:41,720 Translation is math speak for move stuff 153 00:05:40,279 --> 00:05:42,959 around. 154 00:05:41,720 --> 00:05:43,960 Right? You take a line and it moves 155 00:05:42,959 --> 00:05:45,159 around, 156 00:05:43,959 --> 00:05:47,239 it doesn't matter, it's still a line. 157 00:05:45,160 --> 00:05:48,880 Let's Let's Let's figure it out. 158 00:05:47,240 --> 00:05:50,960 So, these are the the three things we 159 00:05:48,879 --> 00:05:53,199 need to worry about. So, we want to 160 00:05:50,959 --> 00:05:55,079 learn once and use all over the place. 161 00:05:53,199 --> 00:05:56,920 We want to take spatial adjacency into 162 00:05:55,079 --> 00:05:58,199 account, number two. And number three, 163 00:05:56,920 --> 00:05:59,720 let's just find a way to make sure that 164 00:05:58,199 --> 00:06:02,319 we don't have billions of parameters for 165 00:05:59,720 --> 00:06:04,920 simple toy problems. 166 00:06:02,319 --> 00:06:04,920 Any questions? 167 00:06:05,480 --> 00:06:09,280 Yep. 168 00:06:07,279 --> 00:06:11,879 Um, is this a problem 169 00:06:09,279 --> 00:06:14,119 just because we are compressing the 170 00:06:11,879 --> 00:06:15,279 image or would it have happened anyway? 171 00:06:14,120 --> 00:06:16,439 It would have happened So, the question 172 00:06:15,279 --> 00:06:18,279 was is it a problem because we are 173 00:06:16,439 --> 00:06:19,839 compressing the image uh, or would it 174 00:06:18,279 --> 00:06:20,839 would it have happened anyway? The 175 00:06:19,839 --> 00:06:22,239 answer is it would have happened anyway. 176 00:06:20,839 --> 00:06:24,399 You can take any picture, this is going 177 00:06:22,240 --> 00:06:26,199 to happen, right? Because I'm not making 178 00:06:24,399 --> 00:06:27,560 any assumptions about how the image is 179 00:06:26,199 --> 00:06:28,839 coming in to me, 180 00:06:27,560 --> 00:06:31,240 whether it's compressed or not and so on 181 00:06:28,839 --> 00:06:31,239 and so forth. 182 00:06:31,639 --> 00:06:36,240 Okay. All right. 183 00:06:33,519 --> 00:06:38,599 So, convolutional layers 184 00:06:36,240 --> 00:06:40,240 were developed to precisely address 185 00:06:38,600 --> 00:06:44,400 these shortcomings and they're amazing 186 00:06:40,240 --> 00:06:44,400 solution, as you will see. Very elegant. 187 00:06:45,040 --> 00:06:49,080 All right. 188 00:06:45,800 --> 00:06:51,040 So, the next, I don't know, half an hour 189 00:06:49,079 --> 00:06:52,359 is going to be me defining a whole bunch 190 00:06:51,040 --> 00:06:53,560 of stuff 191 00:06:52,360 --> 00:06:55,439 before we actually get to the fun 192 00:06:53,560 --> 00:06:57,560 collabs and so on and so forth. 193 00:06:55,439 --> 00:06:59,719 Um, so just to put in perspective, I I 194 00:06:57,560 --> 00:07:01,160 have a PowerPoint, 195 00:06:59,720 --> 00:07:03,200 two collabs, 196 00:07:01,160 --> 00:07:06,040 and an Excel spreadsheet, and maybe even 197 00:07:03,199 --> 00:07:08,159 a notability file to cover today. 198 00:07:06,040 --> 00:07:09,080 Okay? So, but hang on for the next 30 199 00:07:08,160 --> 00:07:10,600 minutes because it's going to be a 200 00:07:09,079 --> 00:07:12,279 little concept heavy 201 00:07:10,600 --> 00:07:14,280 before we get to the fun stuff. So, stop 202 00:07:12,279 --> 00:07:15,119 me, ask me questions because we do have 203 00:07:14,279 --> 00:07:17,559 time. 204 00:07:15,120 --> 00:07:18,920 All right. A convolutional layer is made 205 00:07:17,560 --> 00:07:20,000 up of something called a convolutional 206 00:07:18,920 --> 00:07:22,199 filter. 207 00:07:20,000 --> 00:07:24,720 Okay? That's the atomic building block. 208 00:07:22,199 --> 00:07:28,159 A convolutional filter is a nothing but 209 00:07:24,720 --> 00:07:29,600 a small matrix of numbers like this. 210 00:07:28,160 --> 00:07:31,480 It's just a small square matrix of 211 00:07:29,600 --> 00:07:33,400 numbers. That's a convolutional filter, 212 00:07:31,480 --> 00:07:35,600 okay? Now, 213 00:07:33,399 --> 00:07:38,159 a layer is just composed of one or more 214 00:07:35,600 --> 00:07:39,200 of these filters. 215 00:07:38,160 --> 00:07:41,400 All right? 216 00:07:39,199 --> 00:07:42,519 Filters and layers. 217 00:07:41,399 --> 00:07:44,679 Now, 218 00:07:42,519 --> 00:07:46,639 the thing about the convolutional filter 219 00:07:44,680 --> 00:07:48,720 that makes it really magical 220 00:07:46,639 --> 00:07:50,759 is that if you choose the numbers in a 221 00:07:48,720 --> 00:07:52,440 filter carefully 222 00:07:50,759 --> 00:07:53,879 and then you apply the filter to an 223 00:07:52,439 --> 00:07:56,040 image, and I'll get to what I mean by 224 00:07:53,879 --> 00:07:57,519 applying the filter, 225 00:07:56,040 --> 00:07:59,560 if you choose the numbers carefully and 226 00:07:57,519 --> 00:08:02,399 you apply to that image, 227 00:07:59,560 --> 00:08:04,759 this little humble thing has the ability 228 00:08:02,399 --> 00:08:07,039 to detect features in your image. 229 00:08:04,759 --> 00:08:09,800 It can detect lines, curves, gradations 230 00:08:07,040 --> 00:08:11,360 in color, circles, things like that, 231 00:08:09,800 --> 00:08:12,800 okay? It's pretty cool. 232 00:08:11,360 --> 00:08:14,080 And so, 233 00:08:12,800 --> 00:08:15,920 I'm going to claim and I'm going to 234 00:08:14,079 --> 00:08:17,719 prove shortly that this little humble 235 00:08:15,920 --> 00:08:19,560 filter with the ones and zeros, it can 236 00:08:17,720 --> 00:08:21,160 detect horizontal lines in any picture 237 00:08:19,560 --> 00:08:22,079 you give it. 238 00:08:21,160 --> 00:08:23,760 Okay? 239 00:08:22,079 --> 00:08:27,000 This thing here is going to has the 240 00:08:23,759 --> 00:08:28,959 ability to detect vertical lines. 241 00:08:27,000 --> 00:08:30,560 All right? So, I will demonstrate how 242 00:08:28,959 --> 00:08:33,038 this thing actually detects all these 243 00:08:30,560 --> 00:08:34,360 things and then we will ask the big 244 00:08:33,038 --> 00:08:35,879 question that's probably in your minds 245 00:08:34,360 --> 00:08:37,840 already, where are we going to get these 246 00:08:35,879 --> 00:08:39,038 numbers from? 247 00:08:37,840 --> 00:08:41,000 That all sounds great, Rama. Where are 248 00:08:39,038 --> 00:08:42,479 we going to get the numbers from? Okay? 249 00:08:41,000 --> 00:08:43,879 And we have a beautiful answer to that 250 00:08:42,479 --> 00:08:46,520 question. 251 00:08:43,879 --> 00:08:47,919 All right. So, let's go. Um, now I'm 252 00:08:46,519 --> 00:08:50,919 going to first explain to you what I 253 00:08:47,919 --> 00:08:52,679 mean by applying a filter to an image 254 00:08:50,919 --> 00:08:54,120 and then I'm going to give you examples 255 00:08:52,679 --> 00:08:56,120 of how the filter works for detecting 256 00:08:54,120 --> 00:08:58,320 vertical and horizontal lines. So, all 257 00:08:56,120 --> 00:09:00,200 right. So, let's say that this is the 258 00:08:58,320 --> 00:09:02,280 image we have. 259 00:09:00,200 --> 00:09:04,280 Okay? Again, an image. Assume it's a 260 00:09:02,279 --> 00:09:06,079 grayscale image. So, you just have a 261 00:09:04,279 --> 00:09:07,759 bunch of numbers between 0 and 255, 262 00:09:06,080 --> 00:09:09,720 okay? So, that's that This is the image 263 00:09:07,759 --> 00:09:10,919 we have. It's a little tiny image. 264 00:09:09,720 --> 00:09:13,200 And this is the filter that's been 265 00:09:10,919 --> 00:09:14,279 magically given to us by somebody. 266 00:09:13,200 --> 00:09:17,040 And what we are trying to do now is to 267 00:09:14,279 --> 00:09:19,879 apply it, okay? So, what we do is that 268 00:09:17,039 --> 00:09:22,719 we literally take this filter, 269 00:09:19,879 --> 00:09:24,720 the little one, and then we superimpose 270 00:09:22,720 --> 00:09:26,840 it on the top left part of the image. 271 00:09:24,720 --> 00:09:28,639 So, you have the image here, you take 272 00:09:26,840 --> 00:09:30,320 this little filter, and then you move it 273 00:09:28,639 --> 00:09:32,080 to the top left so that they are sort of 274 00:09:30,320 --> 00:09:33,240 right on top of each other. 275 00:09:32,080 --> 00:09:34,879 Okay? 276 00:09:33,240 --> 00:09:35,840 Once you have it right on top of each 277 00:09:34,879 --> 00:09:37,439 other, 278 00:09:35,840 --> 00:09:39,600 you have these matching numbers. You 279 00:09:37,440 --> 00:09:41,360 have three numbers in the image, there 280 00:09:39,600 --> 00:09:42,879 are three numbers in the filter, and 281 00:09:41,360 --> 00:09:44,039 they're all matching each other right on 282 00:09:42,879 --> 00:09:46,439 top of each other, right? So, you have 283 00:09:44,039 --> 00:09:48,919 nine pairs of numbers. 284 00:09:46,440 --> 00:09:50,880 And then what we do, once we overlay it, 285 00:09:48,919 --> 00:09:53,879 we literally just multiply all the 286 00:09:50,879 --> 00:09:55,519 matching numbers and add them up. 287 00:09:53,879 --> 00:09:57,080 Okay? You just multiply all the numbers 288 00:09:55,519 --> 00:09:58,600 and match them up, and you can confirm 289 00:09:57,080 --> 00:09:59,759 later on that you know the the 290 00:09:58,600 --> 00:10:01,759 arithmetic I'm doing is actually 291 00:09:59,759 --> 00:10:03,159 accurate. Okay? 292 00:10:01,759 --> 00:10:04,559 And once you do that you'll go get some 293 00:10:03,159 --> 00:10:05,559 number. 294 00:10:04,559 --> 00:10:06,879 Right? 295 00:10:05,559 --> 00:10:09,039 Um 296 00:10:06,879 --> 00:10:11,039 once you get that number 297 00:10:09,039 --> 00:10:12,360 what we do is we go to our good old 298 00:10:11,039 --> 00:10:15,559 friend the relu 299 00:10:12,360 --> 00:10:16,759 and then we just run it through a relu. 300 00:10:15,559 --> 00:10:19,119 Now, in this case all that effort comes 301 00:10:16,759 --> 00:10:22,159 to nothing because it's zero. It's okay. 302 00:10:19,120 --> 00:10:26,000 Okay? So, zero and this number becomes 303 00:10:22,159 --> 00:10:26,000 the top left cell of your output. 304 00:10:26,679 --> 00:10:29,639 So, this is called the convolution 305 00:10:28,120 --> 00:10:30,360 operation. 306 00:10:29,639 --> 00:10:31,600 Okay? 307 00:10:30,360 --> 00:10:32,639 And we won't get into why it's called 308 00:10:31,600 --> 00:10:34,560 that and so on and so forth. There's a 309 00:10:32,639 --> 00:10:35,840 long and rich and storied history of 310 00:10:34,559 --> 00:10:38,239 these things. 311 00:10:35,840 --> 00:10:40,320 But this is the convolution operation. 312 00:10:38,240 --> 00:10:42,080 And once we do that you sort of can now 313 00:10:40,320 --> 00:10:44,120 predict what's going to happen, right? 314 00:10:42,080 --> 00:10:46,920 We take the same exact operation and we 315 00:10:44,120 --> 00:10:48,360 just move it to the right. 316 00:10:46,919 --> 00:10:51,439 We move this little 3 by 3 thing to the 317 00:10:48,360 --> 00:10:53,000 right and repeat the exact same process. 318 00:10:51,440 --> 00:10:54,640 Matching numbers 319 00:10:53,000 --> 00:10:55,960 uh to you know multiply all of the all 320 00:10:54,639 --> 00:10:58,559 the matching numbers together, add them 321 00:10:55,960 --> 00:10:59,400 up, run them through a relu. 322 00:10:58,559 --> 00:11:01,359 Okay? 323 00:10:59,399 --> 00:11:03,720 And then boom, you get the you get the 324 00:11:01,360 --> 00:11:05,800 second number here. 325 00:11:03,720 --> 00:11:07,200 And you keep doing that till you reach 326 00:11:05,799 --> 00:11:08,559 the very end. You fill up all these 327 00:11:07,200 --> 00:11:11,480 numbers then when you then you come to 328 00:11:08,559 --> 00:11:12,559 the top of the second row. 329 00:11:11,480 --> 00:11:14,039 Okay? 330 00:11:12,559 --> 00:11:16,439 And you keep on doing that till you 331 00:11:14,039 --> 00:11:18,919 reach the very bottom. 332 00:11:16,440 --> 00:11:21,000 So, this is what I mean when I say apply 333 00:11:18,919 --> 00:11:22,159 a filter to an image. 334 00:11:21,000 --> 00:11:24,720 Okay? 335 00:11:22,159 --> 00:11:24,719 Any questions? 336 00:11:25,159 --> 00:11:27,399 Okay. 337 00:11:29,480 --> 00:11:33,519 Microphone, please. 338 00:11:31,080 --> 00:11:33,520 Microphone. 339 00:11:35,000 --> 00:11:38,200 What happens when 340 00:11:36,639 --> 00:11:39,360 the heart of the 341 00:11:38,200 --> 00:11:41,839 and you stop 342 00:11:39,360 --> 00:11:41,839 the remaining 343 00:11:42,120 --> 00:11:46,159 but the filter doesn't perfectly match 344 00:11:44,440 --> 00:11:47,839 Yeah, so you start from the left and 345 00:11:46,159 --> 00:11:49,480 then you keep on going. At some point 346 00:11:47,839 --> 00:11:51,000 the right edge of the filter is going to 347 00:11:49,480 --> 00:11:52,080 match the right edge of the image and 348 00:11:51,000 --> 00:11:55,360 then you stop. 349 00:11:52,080 --> 00:11:58,160 Yeah. Now, there are some nuances here. 350 00:11:55,360 --> 00:11:59,879 So, for example, you can actually pad 351 00:11:58,159 --> 00:12:01,879 the whole image 352 00:11:59,879 --> 00:12:03,639 on its borders so that you can actually 353 00:12:01,879 --> 00:12:04,879 go outside the image and it'll still 354 00:12:03,639 --> 00:12:08,519 work. 355 00:12:04,879 --> 00:12:10,159 Okay? Number one. Number two, nuance. 356 00:12:08,519 --> 00:12:11,679 Instead of just moving one step to the 357 00:12:10,159 --> 00:12:13,879 right every time you finish, you can 358 00:12:11,679 --> 00:12:15,479 move two steps to the right. 359 00:12:13,879 --> 00:12:17,879 Right? And that's something called a 360 00:12:15,480 --> 00:12:20,240 stride. Okay? So, there are a bunch of 361 00:12:17,879 --> 00:12:22,600 pesky details here. But I'm just 362 00:12:20,240 --> 00:12:24,919 ignoring them because this basic default 363 00:12:22,600 --> 00:12:27,639 approach works well amazingly well 364 00:12:24,919 --> 00:12:27,639 almost all the time. 365 00:12:27,879 --> 00:12:31,039 Okay? All right. So, that's that's 366 00:12:29,839 --> 00:12:33,920 that's the mechanics of how this 367 00:12:31,039 --> 00:12:35,120 operation works. Um all right. Now, I'm 368 00:12:33,919 --> 00:12:37,120 going to switch to a spreadsheet which 369 00:12:35,120 --> 00:12:41,000 shows this really beautifully 370 00:12:37,120 --> 00:12:43,279 courtesy of the fast.ai people. 371 00:12:41,000 --> 00:12:44,600 All right. So, what I'm going to do here 372 00:12:43,279 --> 00:12:45,679 because the big spreadsheet I'll upload 373 00:12:44,600 --> 00:12:48,399 the spreadsheet after class so you can 374 00:12:45,679 --> 00:12:50,079 see it. So, all I have done here, rather 375 00:12:48,399 --> 00:12:51,838 all they have done here 376 00:12:50,080 --> 00:12:53,360 thanks to them, is that they have 377 00:12:51,839 --> 00:12:55,320 essentially created a table of numbers 378 00:12:53,360 --> 00:12:57,399 in Excel as you can tell. 379 00:12:55,320 --> 00:12:59,280 And they have just put some numbers. 380 00:12:57,399 --> 00:13:01,720 Most of the numbers are zero. But these 381 00:12:59,279 --> 00:13:03,720 some of these numbers are all more than 382 00:13:01,720 --> 00:13:04,920 zero. They're like 0.8, 0.9 and so on. 383 00:13:03,720 --> 00:13:06,320 Basically, all they have done is instead 384 00:13:04,919 --> 00:13:08,159 of working with numbers between zero and 385 00:13:06,320 --> 00:13:10,080 255, they're just dividing all the 386 00:13:08,159 --> 00:13:11,199 numbers by 255 so you get fractions and 387 00:13:10,080 --> 00:13:13,440 they just put the fractions in the 388 00:13:11,200 --> 00:13:15,680 table. Okay? And then then they have 389 00:13:13,440 --> 00:13:16,920 used Excel's very cool conditional 390 00:13:15,679 --> 00:13:19,719 formatting 391 00:13:16,919 --> 00:13:21,759 to essentially mark in red all the 392 00:13:19,720 --> 00:13:23,200 values that are high. Right? If the 393 00:13:21,759 --> 00:13:24,679 number is closer to one, the more 394 00:13:23,200 --> 00:13:26,520 reddish it gets. 395 00:13:24,679 --> 00:13:28,599 Okay? And when you do that the three 396 00:13:26,519 --> 00:13:31,039 obviously pops out. 397 00:13:28,600 --> 00:13:33,320 So, there is a three in the image. Yes? 398 00:13:31,039 --> 00:13:35,559 Okay, good. So, now 399 00:13:33,320 --> 00:13:37,920 what we're going to do is we're going to 400 00:13:35,559 --> 00:13:39,519 move to our little filter here. 401 00:13:37,919 --> 00:13:41,199 You can see the filter. 402 00:13:39,519 --> 00:13:44,519 Right? And I'm claiming this detects 403 00:13:41,200 --> 00:13:47,000 horizontal lines. And so and this table 404 00:13:44,519 --> 00:13:47,000 here 405 00:13:47,159 --> 00:13:49,399 Sorry. 406 00:13:51,320 --> 00:13:56,040 This table here is the result of 407 00:13:53,440 --> 00:13:58,120 applying that filter to the three. 408 00:13:56,039 --> 00:14:01,039 Okay? And you can see here I'm looking 409 00:13:58,120 --> 00:14:03,080 at the top left cell here. 410 00:14:01,039 --> 00:14:03,799 Um 411 00:14:03,080 --> 00:14:05,400 This is 412 00:14:03,799 --> 00:14:07,199 Look at this top left cell. The formula 413 00:14:05,399 --> 00:14:08,759 is nothing more than 414 00:14:07,200 --> 00:14:10,680 you know, multiply all those things and 415 00:14:08,759 --> 00:14:12,759 add them up. And then once you add it 416 00:14:10,679 --> 00:14:15,319 up, run it through a max of zero comma 417 00:14:12,759 --> 00:14:18,078 that which is just the relu. 418 00:14:15,320 --> 00:14:19,480 Okay? Basic arithmetic. 419 00:14:18,078 --> 00:14:21,838 So, we do that. 420 00:14:19,480 --> 00:14:24,560 And this is the output and the output is 421 00:14:21,839 --> 00:14:26,680 also conditionally formatted to show you 422 00:14:24,559 --> 00:14:30,838 where things are lighting up. 423 00:14:26,679 --> 00:14:34,479 And you can see only the horizontal 424 00:14:30,839 --> 00:14:35,839 lines of the three are lighting up. 425 00:14:34,480 --> 00:14:36,720 Everyone see that? 426 00:14:35,839 --> 00:14:38,720 Right? 427 00:14:36,720 --> 00:14:41,000 So, you So, now you you understand the 428 00:14:38,720 --> 00:14:42,440 filter in fact is living up to the claim 429 00:14:41,000 --> 00:14:44,839 I made for it. 430 00:14:42,440 --> 00:14:46,079 Right? Similarly, 431 00:14:44,839 --> 00:14:47,839 if you look at what's going on here, 432 00:14:46,078 --> 00:14:50,159 this is a vertical filter, the same 433 00:14:47,839 --> 00:14:53,440 thing, you apply it, only the vertical 434 00:14:50,159 --> 00:14:53,439 line is lighting up. 435 00:14:53,480 --> 00:14:57,720 Right? Now, what you can do is 436 00:14:56,159 --> 00:15:00,279 uh I would encourage you to do this, you 437 00:14:57,720 --> 00:15:02,519 know, um after class, is you can look at 438 00:15:00,279 --> 00:15:04,759 all these numbers here, for example, and 439 00:15:02,519 --> 00:15:06,480 then ask yourself, "Okay, why is that 440 00:15:04,759 --> 00:15:08,759 lighting up?" 441 00:15:06,480 --> 00:15:11,039 Right? And you will discover that what's 442 00:15:08,759 --> 00:15:12,519 actually going on is that it's looking 443 00:15:11,039 --> 00:15:14,639 for edges. 444 00:15:12,519 --> 00:15:16,319 It's looking for you know, s- you're 445 00:15:14,639 --> 00:15:18,600 looking for rows in the table where 446 00:15:16,320 --> 00:15:21,760 there is some nonzero thing in the first 447 00:15:18,600 --> 00:15:23,519 row and zeros in the second row. 448 00:15:21,759 --> 00:15:25,120 And by choosing the numbers carefully, 449 00:15:23,519 --> 00:15:27,519 you multiply the ones with positive 450 00:15:25,120 --> 00:15:29,159 numbers and you multiply the zeros with 451 00:15:27,519 --> 00:15:31,039 zeros and then you'll come up with a 452 00:15:29,159 --> 00:15:32,879 positive number and thereby you detect 453 00:15:31,039 --> 00:15:34,120 an edge. 454 00:15:32,879 --> 00:15:39,399 Right? So, what I would encourage you to 455 00:15:34,120 --> 00:15:39,399 do is use the this Excel thing here. 456 00:15:39,639 --> 00:15:46,159 All right. So, here is here is a cell we 457 00:15:41,279 --> 00:15:46,159 have. So, let's uh trace its 458 00:15:48,240 --> 00:15:51,399 coincidence. 459 00:15:49,600 --> 00:15:53,000 Okay. 460 00:15:51,399 --> 00:15:56,078 So, you can see here 461 00:15:53,000 --> 00:15:56,078 these numbers 462 00:15:56,159 --> 00:16:00,639 Right? Th- This is what it's processing. 463 00:15:59,120 --> 00:16:01,959 Right? That is this grid is being 464 00:16:00,639 --> 00:16:04,360 processed to come up with that big 465 00:16:01,958 --> 00:16:06,159 number. And you can see here in this 466 00:16:04,360 --> 00:16:08,560 grid these are all these numbers are 467 00:16:06,159 --> 00:16:11,120 here and then these numbers are a lot 468 00:16:08,559 --> 00:16:13,319 lower than these these numbers because 469 00:16:11,120 --> 00:16:14,519 there is an edge. 470 00:16:13,320 --> 00:16:16,360 Right? The numbers are a lot lower. 471 00:16:14,519 --> 00:16:17,759 That's why you can see the horizontal 472 00:16:16,360 --> 00:16:19,959 part of the three. 473 00:16:17,759 --> 00:16:22,559 And so, what this filter is doing, it's 474 00:16:19,958 --> 00:16:24,399 basically saying, "Well, the stuff 475 00:16:22,559 --> 00:16:26,199 the row that I'm catching here has the 476 00:16:24,399 --> 00:16:27,720 ones, the middle has zeros, the rest are 477 00:16:26,200 --> 00:16:29,480 all minus ones." 478 00:16:27,720 --> 00:16:31,440 Right? So, the small values are going to 479 00:16:29,480 --> 00:16:33,120 get very small. 480 00:16:31,440 --> 00:16:34,040 The big values are going to get very big 481 00:16:33,120 --> 00:16:35,639 and the overall thing is going to be 482 00:16:34,039 --> 00:16:37,000 emphasized. 483 00:16:35,639 --> 00:16:38,360 So, that's the basic idea of edge 484 00:16:37,000 --> 00:16:39,958 detection. 485 00:16:38,360 --> 00:16:41,480 Spend some time with it with the Excel 486 00:16:39,958 --> 00:16:43,119 and it'll you'll become clear to you 487 00:16:41,480 --> 00:16:46,079 what I'm talking about here. 488 00:16:43,120 --> 00:16:48,399 All right, cool. So, that's that. 489 00:16:46,078 --> 00:16:49,759 All right. Uh by the way, I also have uh 490 00:16:48,399 --> 00:16:50,759 th- there is a little very cool site 491 00:16:49,759 --> 00:16:52,279 here 492 00:16:50,759 --> 00:16:53,879 in which you can actually go in and 493 00:16:52,279 --> 00:16:55,039 punch in your own numbers and see what 494 00:16:53,879 --> 00:16:56,838 it detects. 495 00:16:55,039 --> 00:16:58,319 Right? Lot of edges and curves and this 496 00:16:56,839 --> 00:17:00,040 and that. It's very cool. So, I 497 00:16:58,320 --> 00:17:04,680 encourage you to try it out. 498 00:17:00,039 --> 00:17:04,680 So, the key thing here I want to say is 499 00:17:06,640 --> 00:17:10,160 by choosing the numbers in a filter 500 00:17:08,160 --> 00:17:12,120 carefully and applying this operation 501 00:17:10,160 --> 00:17:13,720 different different features can be 502 00:17:12,119 --> 00:17:14,639 detected. All right. 503 00:17:13,720 --> 00:17:16,199 Now, 504 00:17:14,640 --> 00:17:18,079 I mentioned earlier that a convolution 505 00:17:16,199 --> 00:17:20,519 layer is composed of one or more of 506 00:17:18,078 --> 00:17:23,519 these filters. So, one or more of these 507 00:17:20,519 --> 00:17:25,759 filters. And so, you can think of each 508 00:17:23,519 --> 00:17:27,959 filter as a sort of a specialist for a 509 00:17:25,759 --> 00:17:30,279 particular feature. 510 00:17:27,959 --> 00:17:32,200 Okay? So, it's a specialist. Maybe it it 511 00:17:30,279 --> 00:17:34,079 specializes in detecting vertical lines, 512 00:17:32,200 --> 00:17:35,720 horizontal lines, you know, uh 513 00:17:34,079 --> 00:17:38,079 semicircles, quarter circles, you don't 514 00:17:35,720 --> 00:17:39,799 know. Right? You can imagine either them 515 00:17:38,079 --> 00:17:42,079 as being specialists. 516 00:17:39,799 --> 00:17:43,799 And given that modern images could be 517 00:17:42,079 --> 00:17:45,359 very complicated, they may have lots of 518 00:17:43,799 --> 00:17:46,678 interesting features going on, you 519 00:17:45,359 --> 00:17:48,359 probably want to have lots of these 520 00:17:46,679 --> 00:17:52,360 filters. 521 00:17:48,359 --> 00:17:54,719 Okay? But the key the key is that you 522 00:17:52,359 --> 00:17:56,559 don't have to decide up front, "Hey, you 523 00:17:54,720 --> 00:17:57,880 filter, you better specialize in 524 00:17:56,559 --> 00:18:00,119 detecting vertical lines and you on the 525 00:17:57,880 --> 00:18:01,320 other hand do not stay in your lane. Do 526 00:18:00,119 --> 00:18:02,559 vertical lines." Right? You're not going 527 00:18:01,319 --> 00:18:04,039 to do that. 528 00:18:02,559 --> 00:18:06,559 You will let the system figure out what 529 00:18:04,039 --> 00:18:08,678 it wants to figure out. 530 00:18:06,559 --> 00:18:10,200 Okay? So, there is no human bottleneck 531 00:18:08,679 --> 00:18:11,800 in doing this. 532 00:18:10,200 --> 00:18:13,600 And I mentioned this because there used 533 00:18:11,799 --> 00:18:15,799 to be a human bottleneck, you know, 534 00:18:13,599 --> 00:18:17,559 before deep learning happened. 535 00:18:15,799 --> 00:18:19,399 And so, 536 00:18:17,559 --> 00:18:20,599 Now, let's just um make sure we 537 00:18:19,400 --> 00:18:22,120 understand the mechanics of what happens 538 00:18:20,599 --> 00:18:24,439 when you have two of these filters, not 539 00:18:22,119 --> 00:18:26,119 one. So, this is the input image as 540 00:18:24,440 --> 00:18:28,159 before. This is the filter we saw 541 00:18:26,119 --> 00:18:29,399 earlier and this is another filter we 542 00:18:28,159 --> 00:18:30,440 have. 543 00:18:29,400 --> 00:18:32,120 The thing is we just run them in 544 00:18:30,440 --> 00:18:33,440 parallel. We take each filter, do the 545 00:18:32,119 --> 00:18:34,839 operation, come up with an output. Take 546 00:18:33,440 --> 00:18:36,679 the other filter, do the operation, come 547 00:18:34,839 --> 00:18:38,279 up with its output. And then when you do 548 00:18:36,679 --> 00:18:40,480 that, the first one gives you that, the 549 00:18:38,279 --> 00:18:42,799 second one gives you that. And this 550 00:18:40,480 --> 00:18:44,799 output is a table of some it's it's a 551 00:18:42,799 --> 00:18:47,200 it's a it's actually not a table. What 552 00:18:44,799 --> 00:18:47,200 is it? 553 00:18:49,159 --> 00:18:54,040 Louder, please. 554 00:18:51,359 --> 00:18:56,439 It's a tensor. Thank you. It's a tensor. 555 00:18:54,039 --> 00:18:59,960 And so, these two 5 by 5 matrices can be 556 00:18:56,440 --> 00:18:59,960 represented as a tensor of what shape? 557 00:19:02,079 --> 00:19:06,439 And there are two right answers. 558 00:19:04,919 --> 00:19:08,600 5 by 5 559 00:19:06,440 --> 00:19:11,480 into two, correct. So, it can you can 560 00:19:08,599 --> 00:19:14,439 either think of it as 5 by 5 * 2 or 2 * 561 00:19:11,480 --> 00:19:15,799 5 by 5. They're both fine. 562 00:19:14,440 --> 00:19:18,679 Which one you go with is actually ends 563 00:19:15,799 --> 00:19:20,839 up being a matter of convention. 564 00:19:18,679 --> 00:19:22,640 Okay? So, now you begin to see why we 565 00:19:20,839 --> 00:19:24,079 care about tensors. 566 00:19:22,640 --> 00:19:27,960 Imagine if instead of having two 567 00:19:24,079 --> 00:19:29,839 filters, we have 103 filters. 568 00:19:27,960 --> 00:19:32,799 The resulting tensor is going to be 5 by 569 00:19:29,839 --> 00:19:32,799 5 by 103. 570 00:19:33,559 --> 00:19:35,480 Okay. 571 00:19:34,720 --> 00:19:37,400 Good. 572 00:19:35,480 --> 00:19:39,679 Um all right. Now, 573 00:19:37,400 --> 00:19:42,600 let's now look at the slightly more 574 00:19:39,679 --> 00:19:44,720 complex situation where you have not a 575 00:19:42,599 --> 00:19:46,799 black and white image, a grayscale image 576 00:19:44,720 --> 00:19:48,440 with just a little table, but an actual 577 00:19:46,799 --> 00:19:51,119 color image. 578 00:19:48,440 --> 00:19:54,240 Okay? So, So, we know how to apply a 579 00:19:51,119 --> 00:19:56,359 filter to a 2D tensor like this and to 580 00:19:54,240 --> 00:19:58,400 get that. But let's say we have 581 00:19:56,359 --> 00:20:00,000 something like this where it has 582 00:19:58,400 --> 00:20:02,120 three, right? It's got three channels, 583 00:20:00,000 --> 00:20:03,519 red, blue, green, RGB. It's got three 584 00:20:02,119 --> 00:20:06,399 tables of numbers. 585 00:20:03,519 --> 00:20:08,599 So, this is a a tensor of shape 6 * 6 * 586 00:20:06,400 --> 00:20:11,120 3, let's say, and you want to apply this 587 00:20:08,599 --> 00:20:12,480 3 by 3 filter just like before to this 588 00:20:11,119 --> 00:20:16,599 thing. You want to apply the convolution 589 00:20:12,480 --> 00:20:16,599 operation. How's that going to work? 590 00:20:18,440 --> 00:20:23,200 Do we just like apply this to each 591 00:20:21,640 --> 00:20:25,400 We first apply it to the red, then we 592 00:20:23,200 --> 00:20:29,519 apply it to the to the green, then we 593 00:20:25,400 --> 00:20:29,519 apply to the blue. Should we do that? 594 00:20:30,079 --> 00:20:35,199 Or is there a 595 00:20:31,960 --> 00:20:35,200 a problem with that approach? 596 00:20:36,039 --> 00:20:38,359 Yeah. 597 00:20:39,960 --> 00:20:43,559 Could you use the microphone, please? 598 00:20:42,079 --> 00:20:45,279 Uh the problem with the approach, I 599 00:20:43,559 --> 00:20:47,399 think, would be the same as what you 600 00:20:45,279 --> 00:20:49,079 said earlier, that it would learn the 601 00:20:47,400 --> 00:20:50,360 lines probably the same each channel, 602 00:20:49,079 --> 00:20:51,599 right? 603 00:20:50,359 --> 00:20:54,039 Like the location of the lines are 604 00:20:51,599 --> 00:20:55,319 probably the same each channel. 605 00:20:54,039 --> 00:20:57,599 Yes, the location of the line is going 606 00:20:55,319 --> 00:20:59,399 to be the same thing because that line, 607 00:20:57,599 --> 00:21:00,879 if you will, is sort of the the 608 00:20:59,400 --> 00:21:03,320 aggregation of information from the 609 00:21:00,880 --> 00:21:05,080 three different channels. Right. But the 610 00:21:03,319 --> 00:21:07,200 problem here 611 00:21:05,079 --> 00:21:09,599 is sort of slightly different, 612 00:21:07,200 --> 00:21:12,000 which is that 613 00:21:09,599 --> 00:21:15,279 If you do them independently, 614 00:21:12,000 --> 00:21:17,599 the network has not been informed that 615 00:21:15,279 --> 00:21:19,759 these things are all part of the same 616 00:21:17,599 --> 00:21:21,039 underlying concept. 617 00:21:19,759 --> 00:21:22,160 As far as it's concerned, it's just like 618 00:21:21,039 --> 00:21:23,759 three things. It's just going to process 619 00:21:22,160 --> 00:21:25,800 them independently. So, we need to 620 00:21:23,759 --> 00:21:27,879 somehow change the filter so that it 621 00:21:25,799 --> 00:21:29,919 understands like what is at this pixel 622 00:21:27,880 --> 00:21:31,800 location, the three numbers under it, 623 00:21:29,920 --> 00:21:35,080 RGB, they're actually the same part of 624 00:21:31,799 --> 00:21:37,919 the same thing, underlying thing. 625 00:21:35,079 --> 00:21:42,399 So, what we do is actually very simple. 626 00:21:37,920 --> 00:21:42,400 We just take this filter and make it 3D. 627 00:21:42,599 --> 00:21:45,959 So, we take this filter, instead of 628 00:21:44,240 --> 00:21:49,240 having just one of them, we just make it 629 00:21:45,960 --> 00:21:51,680 a cube like that. Three times. 630 00:21:49,240 --> 00:21:53,839 And once we do that, you can imagine 631 00:21:51,680 --> 00:21:56,279 taking this thing here and essentially 632 00:21:53,839 --> 00:21:58,480 doing that. 633 00:21:56,279 --> 00:22:00,119 Okay. Now, instead of having, you know, 634 00:21:58,480 --> 00:22:01,799 nine numbers in the image and nine 635 00:22:00,119 --> 00:22:04,159 numbers in the filter, 636 00:22:01,799 --> 00:22:05,678 you have 27 numbers in the image, 27 637 00:22:04,160 --> 00:22:07,720 numbers in the filter. 638 00:22:05,679 --> 00:22:09,400 But you still match them up, multiply 639 00:22:07,720 --> 00:22:11,759 them, add them up, run them through a 640 00:22:09,400 --> 00:22:11,759 ReLU. 641 00:22:14,799 --> 00:22:19,399 By the way, I tried to get ChatGPT to 642 00:22:16,720 --> 00:22:21,679 give me a picture like that. 643 00:22:19,400 --> 00:22:22,920 It just completely bombed. 644 00:22:21,679 --> 00:22:24,400 I like three, four, five different 645 00:22:22,920 --> 00:22:25,800 variations. It just gave up. And then I 646 00:22:24,400 --> 00:22:28,640 found this nice picture at in the 647 00:22:25,799 --> 00:22:30,559 deeplearning.ai and I used it. 648 00:22:28,640 --> 00:22:32,160 So, then if you put different numbers in 649 00:22:30,559 --> 00:22:33,519 each of the layers, is that like color 650 00:22:32,160 --> 00:22:36,279 processing? Like it could be doing a 651 00:22:33,519 --> 00:22:37,440 different thing to green and blue. I'm 652 00:22:36,279 --> 00:22:39,920 sorry, say that again. If you put 653 00:22:37,440 --> 00:22:42,160 different numbers in each of the layers 654 00:22:39,920 --> 00:22:43,600 of your knowledge, in each of the 655 00:22:42,160 --> 00:22:45,519 different like depth dimensions of your 656 00:22:43,599 --> 00:22:47,000 convolution filter, would that be like 657 00:22:45,519 --> 00:22:49,319 color processing? 658 00:22:47,000 --> 00:22:50,559 Uh yeah, you you will in 659 00:22:49,319 --> 00:22:53,000 Yeah, you will put different numbers. In 660 00:22:50,559 --> 00:22:54,119 fact, you you have 27 numbers now, 661 00:22:53,000 --> 00:22:55,640 but we haven't gotten to the question of 662 00:22:54,119 --> 00:22:58,759 where these numbers are coming from. So, 663 00:22:55,640 --> 00:23:02,920 just hold the thought till we get there. 664 00:22:58,759 --> 00:23:04,640 Okay. Um so, any questions on this? 665 00:23:02,920 --> 00:23:05,800 Okay. You literally take the 2D thing 666 00:23:04,640 --> 00:23:08,120 and make it 3D. 667 00:23:05,799 --> 00:23:10,079 You basically give it depth and the 668 00:23:08,119 --> 00:23:11,319 depth just matches the depth of the 669 00:23:10,079 --> 00:23:13,319 input. 670 00:23:11,319 --> 00:23:15,000 So, if the input is like, you know, 10 671 00:23:13,319 --> 00:23:17,359 deep, your filter is going to get 10 672 00:23:15,000 --> 00:23:17,359 deep. 673 00:23:18,200 --> 00:23:22,519 Okay? 674 00:23:20,079 --> 00:23:22,519 Yes. 675 00:23:22,640 --> 00:23:26,000 Rather than 676 00:23:24,160 --> 00:23:27,679 increasing the rank order of the tensor 677 00:23:26,000 --> 00:23:29,240 by one, is there any instance where you 678 00:23:27,679 --> 00:23:30,920 would create a subtraction layer where 679 00:23:29,240 --> 00:23:33,559 you would run an operation across the 680 00:23:30,920 --> 00:23:35,920 different layers to come up with a 681 00:23:33,559 --> 00:23:38,799 intermediary layer that you would run a 682 00:23:35,920 --> 00:23:40,640 lower rank tensor of a filter over? 683 00:23:38,799 --> 00:23:42,639 Yeah, so there is a lot of stuff in the 684 00:23:40,640 --> 00:23:45,440 research literature which tries to do 685 00:23:42,640 --> 00:23:48,200 things like that. Uh I'm just describing 686 00:23:45,440 --> 00:23:50,080 like the the the most basic approach to 687 00:23:48,200 --> 00:23:51,720 doing this. And as it turns out, this 688 00:23:50,079 --> 00:23:54,319 basic approach is actually extremely 689 00:23:51,720 --> 00:23:56,079 powerful, right? And of course, uh 690 00:23:54,319 --> 00:23:59,399 researchers try to, you know, go from 691 00:23:56,079 --> 00:24:01,039 the 95th percent thing to the 95.1%. 692 00:23:59,400 --> 00:24:02,840 So, they invent like all sorts of crazy 693 00:24:01,039 --> 00:24:04,839 complicated stuff, which is all good for 694 00:24:02,839 --> 00:24:07,399 us, humanity, but for practical use, 695 00:24:04,839 --> 00:24:07,399 this is good enough. 696 00:24:08,119 --> 00:24:12,519 How do you convert the 3 by 3 layer into 697 00:24:10,599 --> 00:24:14,359 a single 4 by 4 layer? 4 by 4 is 698 00:24:12,519 --> 00:24:15,279 understood, but what about the 3 layers? 699 00:24:14,359 --> 00:24:17,399 How do they work? 700 00:24:15,279 --> 00:24:19,079 Yeah. Um so, we are coming to that. I 701 00:24:17,400 --> 00:24:20,960 think we have a slide here. Actually, we 702 00:24:19,079 --> 00:24:23,599 don't. Never mind. We'll answer that. Um 703 00:24:20,960 --> 00:24:26,480 so, so here you have one filter, right? 704 00:24:23,599 --> 00:24:28,319 You have one 3 by 3 by 3 filter, which 705 00:24:26,480 --> 00:24:30,920 plugs into this thing here, and then it 706 00:24:28,319 --> 00:24:33,119 gives you the 4 by 4 at the end. 707 00:24:30,920 --> 00:24:37,000 Right? So, for one filter, we know that 708 00:24:33,119 --> 00:24:37,000 by doing this operation, we get 709 00:24:37,119 --> 00:24:40,159 we get this 4 by 4. 710 00:24:38,720 --> 00:24:41,880 Let's say that you have another filter, 711 00:24:40,160 --> 00:24:43,120 which is also 3D. 712 00:24:41,880 --> 00:24:45,080 You do that thing, you'll get another 4 713 00:24:43,119 --> 00:24:46,399 by 4. 714 00:24:45,079 --> 00:24:48,240 And if you have 10 filters, you'll get 715 00:24:46,400 --> 00:24:52,600 10 of these 4 by 4s, which then gets 716 00:24:48,240 --> 00:24:52,599 packaged up into a 4 by 4 by 10 tensor. 717 00:24:54,519 --> 00:25:01,839 Remember, whether it's 2D, 3D, 10D, 718 00:24:57,880 --> 00:25:01,840 what is coming out is always 2D. 719 00:25:02,039 --> 00:25:05,240 Because ultimately, when you apply all 720 00:25:03,359 --> 00:25:06,639 this operation, at each position, you 721 00:25:05,240 --> 00:25:07,799 just have one number. 722 00:25:06,640 --> 00:25:08,720 And then ultimately, you just do all 723 00:25:07,799 --> 00:25:10,480 those things, you just come up with a 724 00:25:08,720 --> 00:25:13,160 table of numbers always. So, the what's 725 00:25:10,480 --> 00:25:14,279 coming out is always a 2D number table 726 00:25:13,160 --> 00:25:16,360 like that. 727 00:25:14,279 --> 00:25:18,119 But when you have lots of filters, you 728 00:25:16,359 --> 00:25:20,039 have lots of these 2D tables one after 729 00:25:18,119 --> 00:25:23,319 the other, and there therefore, they get 730 00:25:20,039 --> 00:25:23,319 packaged up into a tensor. 731 00:25:25,160 --> 00:25:28,279 All right. 732 00:25:26,200 --> 00:25:30,559 Um so, 733 00:25:28,279 --> 00:25:32,119 textbook chapter 8.1 has a lot of detail 734 00:25:30,559 --> 00:25:35,839 and intuition, which I think is really 735 00:25:32,119 --> 00:25:37,439 good. So, please uh try it out. Okay. 736 00:25:35,839 --> 00:25:40,199 And folks, by the way, this convolution 737 00:25:37,440 --> 00:25:41,920 stuff, um it's sort of it grows in the 738 00:25:40,200 --> 00:25:43,960 telling. So, I would encourage you to 739 00:25:41,920 --> 00:25:45,920 revisit it, revisit it 740 00:25:43,960 --> 00:25:48,240 a few times, and then it slowly becomes 741 00:25:45,920 --> 00:25:49,600 part of your muscle memory. 742 00:25:48,240 --> 00:25:51,559 Don't expect to just understand all the 743 00:25:49,599 --> 00:25:52,959 nuances like one shot. 744 00:25:51,559 --> 00:25:54,559 Do it a few times. 745 00:25:52,960 --> 00:25:56,360 And it will become, you know, wired into 746 00:25:54,559 --> 00:25:59,159 your into your head. 747 00:25:56,359 --> 00:26:00,599 Okay. So, all right. The big question. 748 00:25:59,160 --> 00:26:02,240 These seem excellent, but how are we 749 00:26:00,599 --> 00:26:04,079 supposed to come up with these numbers? 750 00:26:02,240 --> 00:26:05,480 Now, in fact, traditionally, 751 00:26:04,079 --> 00:26:07,079 uh these filters actually used to be 752 00:26:05,480 --> 00:26:08,480 designed by hand. 753 00:26:07,079 --> 00:26:10,079 Uh computer vision researchers would 754 00:26:08,480 --> 00:26:12,759 invest, you know, prodigious amounts of 755 00:26:10,079 --> 00:26:14,960 time and effort and talent to figure 756 00:26:12,759 --> 00:26:17,119 out, you know, the kind the right kinds 757 00:26:14,960 --> 00:26:19,000 of filters to use for various specific 758 00:26:17,119 --> 00:26:20,399 applications. So, if you wanted to build 759 00:26:19,000 --> 00:26:22,799 an application which would look at, say, 760 00:26:20,400 --> 00:26:24,720 MRI images and figure out, okay, what 761 00:26:22,799 --> 00:26:27,000 kind of features should I extract from 762 00:26:24,720 --> 00:26:28,519 this MRI thing to be able to say, you 763 00:26:27,000 --> 00:26:30,519 know, predict the the evidence for a 764 00:26:28,519 --> 00:26:32,799 stroke, they would actually, you know, 765 00:26:30,519 --> 00:26:34,359 hand design the filter. They'd try lots 766 00:26:32,799 --> 00:26:35,960 of different values and then come up 767 00:26:34,359 --> 00:26:37,959 with, "Ah, I got the perfect filter for 768 00:26:35,960 --> 00:26:39,440 this thing here." Right? So, that's the 769 00:26:37,960 --> 00:26:41,559 way it used to be done. 770 00:26:39,440 --> 00:26:42,920 Um and now, 771 00:26:41,559 --> 00:26:45,279 I but as we figured out how to train 772 00:26:42,920 --> 00:26:47,160 deep networks with lots of parameters, 773 00:26:45,279 --> 00:26:49,079 right? We figured out things like ReLU 774 00:26:47,160 --> 00:26:51,800 activation, stochastic gradient descent, 775 00:26:49,079 --> 00:26:54,559 GPUs, backprop, things like that, you 776 00:26:51,799 --> 00:26:55,759 know, uh this big idea emerged. Why 777 00:26:54,559 --> 00:26:57,839 don't we think of the numbers in the 778 00:26:55,759 --> 00:26:59,359 filter as just weights? 779 00:26:57,839 --> 00:27:01,639 And why don't we just simply learn them 780 00:26:59,359 --> 00:27:03,159 from the data using backprop? 781 00:27:01,640 --> 00:27:06,160 Right? Just like we learn all the other 782 00:27:03,160 --> 00:27:06,160 weights. What's the big deal? 783 00:27:06,279 --> 00:27:09,639 And this simple idea, 784 00:27:08,160 --> 00:27:12,080 and it feels a bit, I don't know, 785 00:27:09,640 --> 00:27:13,160 blindingly obvious in hindsight. 786 00:27:12,079 --> 00:27:14,439 I'm sure it was not obvious in 787 00:27:13,160 --> 00:27:16,560 foresight. 788 00:27:14,440 --> 00:27:18,960 Um right? This was the breakthrough. 789 00:27:16,559 --> 00:27:20,399 This was the key breakthrough. And now, 790 00:27:18,960 --> 00:27:22,840 it's actually possible to do this 791 00:27:20,400 --> 00:27:25,840 because a convolutional filter that we 792 00:27:22,839 --> 00:27:27,319 have seen is actually just a neuron. 793 00:27:25,839 --> 00:27:31,119 And the underlying arithmetic of it is 794 00:27:27,319 --> 00:27:32,960 just a neuronal arithmetic. And so, it 795 00:27:31,119 --> 00:27:34,879 just happens to be a slightly special 796 00:27:32,960 --> 00:27:37,400 one. It's actually even simpler than a 797 00:27:34,880 --> 00:27:39,400 regular neuron. And in the interest of 798 00:27:37,400 --> 00:27:40,640 time, I have a one or two slides in the 799 00:27:39,400 --> 00:27:42,920 appendix which tells you exactly why 800 00:27:40,640 --> 00:27:44,480 it's a neuron. So, check it out. But 801 00:27:42,920 --> 00:27:46,480 just take my word for it. It's just a 802 00:27:44,480 --> 00:27:48,319 particular kind of neuron. And because 803 00:27:46,480 --> 00:27:50,400 it's a particular kind of neuron, and we 804 00:27:48,319 --> 00:27:51,359 know how to work with neurons, 805 00:27:50,400 --> 00:27:53,519 right? You know how to work with 806 00:27:51,359 --> 00:27:55,559 neurons, which means that our entire 807 00:27:53,519 --> 00:27:57,279 machinery, 808 00:27:55,559 --> 00:27:59,480 layers, loss functions, gradient 809 00:27:57,279 --> 00:28:01,279 descent, SGD, blah, blah, everything is 810 00:27:59,480 --> 00:28:03,559 immediately applicable. 811 00:28:01,279 --> 00:28:06,039 We don't have to invent any new stuff to 812 00:28:03,559 --> 00:28:08,000 make it work. 813 00:28:06,039 --> 00:28:09,839 Okay? 814 00:28:08,000 --> 00:28:12,119 All right. 815 00:28:09,839 --> 00:28:14,639 Do you initialize the layers differently 816 00:28:12,119 --> 00:28:16,239 in applications or just because the 817 00:28:14,640 --> 00:28:18,400 network has different sizes? Like 818 00:28:16,240 --> 00:28:20,839 computer vision versus uh medical 819 00:28:18,400 --> 00:28:23,120 imaging. Is it just because the network 820 00:28:20,839 --> 00:28:25,359 has different numbers in them? 821 00:28:23,119 --> 00:28:27,439 Yeah, so the initialization 822 00:28:25,359 --> 00:28:29,119 So, let's It's a good question. Let's 823 00:28:27,440 --> 00:28:30,720 come back to it when we get to something 824 00:28:29,119 --> 00:28:34,559 called transfer learning, which I'm 825 00:28:30,720 --> 00:28:34,559 going to get to by about 9:30. 826 00:28:34,720 --> 00:28:37,480 All right. So, 827 00:28:36,279 --> 00:28:38,678 that's it. All right. So, this turned 828 00:28:37,480 --> 00:28:40,599 out to be a huge turning point in the 829 00:28:38,679 --> 00:28:43,360 computer vision field, and this was the 830 00:28:40,599 --> 00:28:44,678 massive unlock in the year 2012. This 831 00:28:43,359 --> 00:28:47,399 computer vision system that used this 832 00:28:44,679 --> 00:28:49,080 technology called AlexNet burst out onto 833 00:28:47,400 --> 00:28:51,200 the world stage because it crushed the 834 00:28:49,079 --> 00:28:53,519 competition in a, you know, in in a 835 00:28:51,200 --> 00:28:56,919 competition called ImageNet, and uh the 836 00:28:53,519 --> 00:28:59,679 previous best score was 26% error rate, 837 00:28:56,919 --> 00:29:01,159 and this thing came in and had 16% error 838 00:28:59,679 --> 00:29:01,960 rate. Right? It's the kind of thing 839 00:29:01,159 --> 00:29:04,120 where if you see it, you'll be like, 840 00:29:01,960 --> 00:29:05,480 "Oh, that must be a typo." 841 00:29:04,119 --> 00:29:06,439 Right? Because every year, the 842 00:29:05,480 --> 00:29:07,919 improvements in error rate were like 843 00:29:06,440 --> 00:29:09,919 very little, half a percent, 1%, and 844 00:29:07,919 --> 00:29:12,800 then this year was 10%, and that that 845 00:29:09,919 --> 00:29:14,520 was because of this approach. 846 00:29:12,799 --> 00:29:16,960 And so, all right. Now, one other thing 847 00:29:14,519 --> 00:29:19,960 I want to cover talk about is that with 848 00:29:16,960 --> 00:29:21,480 every succeeding convolutional layer, 849 00:29:19,960 --> 00:29:23,440 uh this particular convolution any 850 00:29:21,480 --> 00:29:25,519 particular convolutional filter, it's 851 00:29:23,440 --> 00:29:28,320 basically implicitly seeing much more of 852 00:29:25,519 --> 00:29:29,839 the input image as we go along. 853 00:29:28,319 --> 00:29:31,639 Right? Which means that if in the very 854 00:29:29,839 --> 00:29:33,119 beginning, if this is the input, right? 855 00:29:31,640 --> 00:29:34,360 This little convolutional filter this 856 00:29:33,119 --> 00:29:37,119 number here 857 00:29:34,359 --> 00:29:38,719 in the first layer, let's say, only sees 858 00:29:37,119 --> 00:29:40,119 like the top of the chimney or whatever 859 00:29:38,720 --> 00:29:42,120 of this house. 860 00:29:40,119 --> 00:29:44,839 But then the next layer, remember, the 861 00:29:42,119 --> 00:29:45,879 next layer is input is this particular 862 00:29:44,839 --> 00:29:47,240 layer. 863 00:29:45,880 --> 00:29:49,400 And so, 864 00:29:47,240 --> 00:29:50,839 this particular little thing here is 865 00:29:49,400 --> 00:29:52,280 getting information from this whole 866 00:29:50,839 --> 00:29:53,839 square here. 867 00:29:52,279 --> 00:29:55,599 And every one of the points in that 868 00:29:53,839 --> 00:29:57,399 square is actually something big in the 869 00:29:55,599 --> 00:29:59,480 original picture. 870 00:29:57,400 --> 00:30:00,680 So, with every additional layer, you're 871 00:29:59,480 --> 00:30:03,039 seeing more and more and more of the 872 00:30:00,680 --> 00:30:04,920 image. 873 00:30:03,039 --> 00:30:06,639 All right? And this is a key part of why 874 00:30:04,920 --> 00:30:08,519 these things work because you're 875 00:30:06,640 --> 00:30:09,759 essentially hierarchically building a 876 00:30:08,519 --> 00:30:10,680 better and better understanding of the 877 00:30:09,759 --> 00:30:12,879 image. 878 00:30:10,680 --> 00:30:14,960 It is the hierarchical understanding, 879 00:30:12,880 --> 00:30:17,880 the hierarchical learning, that's a very 880 00:30:14,960 --> 00:30:20,240 key part of the unlock. 881 00:30:17,880 --> 00:30:21,840 And so, if you look at networks and what 882 00:30:20,240 --> 00:30:23,759 they're visualizing, this actually a you 883 00:30:21,839 --> 00:30:25,639 know, a face detection deep network 884 00:30:23,759 --> 00:30:26,879 visualizes of what it's learning, you'll 885 00:30:25,640 --> 00:30:28,759 see that the first layer is just 886 00:30:26,880 --> 00:30:29,920 learning lines and edges and so on, 887 00:30:28,759 --> 00:30:30,960 lines. 888 00:30:29,920 --> 00:30:32,800 And the second layer is actually 889 00:30:30,960 --> 00:30:33,759 learning edges. Look at this thing, 890 00:30:32,799 --> 00:30:36,000 right? 891 00:30:33,759 --> 00:30:37,119 It's it's learning to put these lines 892 00:30:36,000 --> 00:30:38,519 together 893 00:30:37,119 --> 00:30:40,359 to get some sort of an edge here, 894 00:30:38,519 --> 00:30:43,879 another edge here. This looks like three 895 00:30:40,359 --> 00:30:45,199 three quarters of a of somebody's ears. 896 00:30:43,880 --> 00:30:46,360 And then, these things are now being 897 00:30:45,200 --> 00:30:49,160 assembled 898 00:30:46,359 --> 00:30:50,279 to get whole faces out. 899 00:30:49,160 --> 00:30:52,080 Can you imagine the researchers who did 900 00:30:50,279 --> 00:30:53,720 this work? They built the network, it's 901 00:30:52,079 --> 00:30:54,599 doing really well on detecting faces, 902 00:30:53,720 --> 00:30:56,079 and they turn around, "Okay, let's see 903 00:30:54,599 --> 00:30:58,079 what it's actually doing." 904 00:30:56,079 --> 00:31:00,480 And then, this picture pops up. 905 00:30:58,079 --> 00:31:03,039 I mean, goosebumps. 906 00:31:00,480 --> 00:31:04,440 Okay, so pooling layers, the next one. 907 00:31:03,039 --> 00:31:05,559 So, 908 00:31:04,440 --> 00:31:07,519 so far you've talked about convolutional 909 00:31:05,559 --> 00:31:09,559 layers, this is the second thing, second 910 00:31:07,519 --> 00:31:11,440 building block, and then we'll again go 911 00:31:09,559 --> 00:31:12,919 go to the collapse. So, pooling layers 912 00:31:11,440 --> 00:31:15,039 are also called subsampling or 913 00:31:12,920 --> 00:31:17,120 downsampling layers. 914 00:31:15,039 --> 00:31:19,480 So, the idea is that every time a tensor 915 00:31:17,119 --> 00:31:20,639 is coming out of these convolutional um 916 00:31:19,480 --> 00:31:23,440 layers, 917 00:31:20,640 --> 00:31:25,440 we try to make it slightly smaller 918 00:31:23,440 --> 00:31:27,519 because the act of making it smaller 919 00:31:25,440 --> 00:31:29,440 will force the network to try to 920 00:31:27,519 --> 00:31:30,920 summarize and learn what's going on in 921 00:31:29,440 --> 00:31:32,840 this complicated thing it's coming into 922 00:31:30,920 --> 00:31:35,200 it, okay? So, I will describe the 923 00:31:32,839 --> 00:31:37,599 mechanics first. Um 924 00:31:35,200 --> 00:31:39,600 So, let's say that this is the output of 925 00:31:37,599 --> 00:31:40,559 a convolutional layer. 926 00:31:39,599 --> 00:31:42,879 Okay? 927 00:31:40,559 --> 00:31:45,079 Is this four of them? A 4 by 4. 928 00:31:42,880 --> 00:31:47,440 So, what we do is that there are two 929 00:31:45,079 --> 00:31:48,879 kinds of pooling, max pooling and 930 00:31:47,440 --> 00:31:51,000 average pooling. This is called max 931 00:31:48,880 --> 00:31:52,480 pooling, and the idea is really simple. 932 00:31:51,000 --> 00:31:53,799 In this max pooling layer, there are no 933 00:31:52,480 --> 00:31:56,200 weights parameters to be learned. It's 934 00:31:53,799 --> 00:31:57,839 just a simple arithmetic operation. We 935 00:31:56,200 --> 00:32:00,200 basically take 936 00:31:57,839 --> 00:32:02,919 we take this we basically superimpose a 937 00:32:00,200 --> 00:32:04,920 2 by 2 empty grid 938 00:32:02,920 --> 00:32:06,519 on the top left, and then we say, "Hey, 939 00:32:04,920 --> 00:32:08,000 what's the biggest number on the among 940 00:32:06,519 --> 00:32:09,720 these four numbers?" Well, the biggest 941 00:32:08,000 --> 00:32:11,200 number is 43. Boom. Okay, I'm going to 942 00:32:09,720 --> 00:32:13,600 stick a 43 here. 943 00:32:11,200 --> 00:32:15,720 Then I move my 2 by 2 to the right 944 00:32:13,599 --> 00:32:17,039 so that it overlaps with these numbers 945 00:32:15,720 --> 00:32:19,759 in blue, and I say, "Hey, what's the 946 00:32:17,039 --> 00:32:20,960 biggest number here?" Okay, that's 109. 947 00:32:19,759 --> 00:32:23,240 And I move it down, what's the biggest 948 00:32:20,960 --> 00:32:25,000 number here? 105. Stick it in here. 949 00:32:23,240 --> 00:32:26,519 Biggest number here, 35, and I stick it 950 00:32:25,000 --> 00:32:28,839 in there. That's it. This is max 951 00:32:26,519 --> 00:32:28,839 pooling. 952 00:32:29,119 --> 00:32:32,199 Similarly, there's this thing called 953 00:32:30,200 --> 00:32:33,440 average pooling, but instead of taking 954 00:32:32,200 --> 00:32:35,480 the maximum of these four numbers, we 955 00:32:33,440 --> 00:32:36,840 just average the four numbers. 956 00:32:35,480 --> 00:32:38,519 Okay, the average of these four things 957 00:32:36,839 --> 00:32:40,879 in yellow, 958 00:32:38,519 --> 00:32:40,879 am I done? 959 00:32:41,559 --> 00:32:45,639 Average of these four numbers is 32.2. 960 00:32:43,519 --> 00:32:46,839 The average of blue numbers is 25.5, you 961 00:32:45,640 --> 00:32:48,200 get the idea. 962 00:32:46,839 --> 00:32:50,439 That's it. Max pooling and average 963 00:32:48,200 --> 00:32:51,840 pooling. Now, 964 00:32:50,440 --> 00:32:53,400 as you can see, when you go when you 965 00:32:51,839 --> 00:32:55,720 apply pooling, the number of entries 966 00:32:53,400 --> 00:32:56,880 drops significantly. 967 00:32:55,720 --> 00:32:58,240 Right? The number of entries drops 968 00:32:56,880 --> 00:32:59,880 significantly. 969 00:32:58,240 --> 00:33:02,839 And the output from this layer is just 970 00:32:59,880 --> 00:33:04,480 fed to the next layer as usual. 971 00:33:02,839 --> 00:33:05,720 Okay? There's nothing, you know, crazy 972 00:33:04,480 --> 00:33:07,679 going on. 973 00:33:05,720 --> 00:33:10,039 So, it's a way to shrink the output from 974 00:33:07,679 --> 00:33:11,560 one convolutional layer before it passes 975 00:33:10,039 --> 00:33:13,799 on to the next convolutional, you 976 00:33:11,559 --> 00:33:15,960 interject with a pooling layer. 977 00:33:13,799 --> 00:33:18,039 Now, I have actually a 978 00:33:15,960 --> 00:33:20,759 even if I say so myself, a very nice 979 00:33:18,039 --> 00:33:23,319 handwritten explanation of what pooling 980 00:33:20,759 --> 00:33:25,200 does, the the effect of pooling. 981 00:33:23,319 --> 00:33:27,480 And unfortunately, I can't get my iPad 982 00:33:25,200 --> 00:33:28,920 to actually show up on my laptop. 983 00:33:27,480 --> 00:33:31,400 So, I'm not going to be able to do it, 984 00:33:28,920 --> 00:33:33,519 but I will record a walk-through. 985 00:33:31,400 --> 00:33:35,519 Yeah, and I posted check it out, okay? 986 00:33:33,519 --> 00:33:38,240 But the intuition that I tried to convey 987 00:33:35,519 --> 00:33:39,359 with that thing is that oh, um Sorry, 988 00:33:38,240 --> 00:33:41,039 I'll come back to this. 989 00:33:39,359 --> 00:33:43,439 So, max pooling acts like an or 990 00:33:41,039 --> 00:33:44,879 condition. It basically says, "I have 991 00:33:43,440 --> 00:33:46,559 this big picture. 992 00:33:44,880 --> 00:33:48,720 So, in the four things that I'm looking 993 00:33:46,559 --> 00:33:50,319 at, if there's any number which is 994 00:33:48,720 --> 00:33:51,880 really high, 995 00:33:50,319 --> 00:33:54,319 that means that some feature is being 996 00:33:51,880 --> 00:33:55,720 detected, right? 997 00:33:54,319 --> 00:33:57,000 The number is really high coming out of 998 00:33:55,720 --> 00:33:59,200 a convolutional layer, that means that 999 00:33:57,000 --> 00:34:00,519 something somewhere fired up, 1000 00:33:59,200 --> 00:34:01,799 lit up. 1001 00:34:00,519 --> 00:34:04,200 And so, I'm just looking to see if 1002 00:34:01,799 --> 00:34:05,319 anything lit up in that part. If it did, 1003 00:34:04,200 --> 00:34:06,640 I'm going to say, "Yep, something lit 1004 00:34:05,319 --> 00:34:08,239 up." 1005 00:34:06,640 --> 00:34:09,640 If nothing lit up, then I'm going to 1006 00:34:08,239 --> 00:34:11,559 say, "Oh, nothing lit up." 1007 00:34:09,639 --> 00:34:13,158 So, in a in that sense, what it's it it 1008 00:34:11,559 --> 00:34:15,398 think you can imagine it's like acting 1009 00:34:13,159 --> 00:34:16,519 like an or condition. 1010 00:34:15,398 --> 00:34:17,559 Anything fired up? Anything fired up? 1011 00:34:16,519 --> 00:34:19,480 Anything fired up? Anything up? Yes, 1012 00:34:17,559 --> 00:34:22,039 okay. Otherwise, no. 1013 00:34:19,480 --> 00:34:22,039 And so, 1014 00:34:22,280 --> 00:34:27,040 sadly, I can't switch to Notability. 1015 00:34:24,639 --> 00:34:28,440 So, it acts like a feature detector. So, 1016 00:34:27,039 --> 00:34:30,239 if you have lots of things going on in a 1017 00:34:28,440 --> 00:34:32,000 particular picture, you want to be able 1018 00:34:30,239 --> 00:34:33,398 to summarize and aggregate all the 1019 00:34:32,000 --> 00:34:35,519 things that are going on so that you can 1020 00:34:33,398 --> 00:34:36,918 say you if you may have a big picture 1021 00:34:35,519 --> 00:34:38,398 with lots of things lighting up here and 1022 00:34:36,918 --> 00:34:40,559 there, but you want to step back and 1023 00:34:38,398 --> 00:34:42,918 say, "You know what? In this picture, 1024 00:34:40,559 --> 00:34:45,440 the top left, nothing lit up. The top 1025 00:34:42,918 --> 00:34:46,719 right, something lit up. Bottom left, 1026 00:34:45,440 --> 00:34:48,320 something lit up. And the bottom right, 1027 00:34:46,719 --> 00:34:49,599 nothing lit up." 1028 00:34:48,320 --> 00:34:51,800 So, you're operating at a higher level 1029 00:34:49,599 --> 00:34:54,839 of abstraction. 1030 00:34:51,800 --> 00:34:54,840 That's the effect of pooling. 1031 00:34:55,039 --> 00:34:58,639 But don't you lose spatial information? 1032 00:34:59,920 --> 00:35:04,079 Uh you don't because the 1033 00:35:02,480 --> 00:35:06,199 what you're actually saying is the top 1034 00:35:04,079 --> 00:35:08,639 left has this thing. 1035 00:35:06,199 --> 00:35:10,599 You already know it is in the top left. 1036 00:35:08,639 --> 00:35:12,119 And you already moved up to that level 1037 00:35:10,599 --> 00:35:13,839 of abstraction. 1038 00:35:12,119 --> 00:35:15,880 So, the fact for example, if if the top 1039 00:35:13,840 --> 00:35:18,480 left there is a human eye, 1040 00:35:15,880 --> 00:35:19,880 and there is a circle detector, it's 1041 00:35:18,480 --> 00:35:21,719 going to fire up and saying, "Hey, in 1042 00:35:19,880 --> 00:35:23,599 the top left there is an eye." 1043 00:35:21,719 --> 00:35:24,919 Yep, lit up. So, you're not looking at 1044 00:35:23,599 --> 00:35:25,759 the pixels anymore, you're already 1045 00:35:24,920 --> 00:35:27,159 operating at a higher level of 1046 00:35:25,760 --> 00:35:29,520 abstraction, and that's how we get 1047 00:35:27,159 --> 00:35:31,039 around it. But this proceeds slowly and 1048 00:35:29,519 --> 00:35:34,039 incrementally, which is why you have 1049 00:35:31,039 --> 00:35:34,039 these big networks. 1050 00:35:34,199 --> 00:35:38,159 All right. 1051 00:35:35,679 --> 00:35:40,159 So, now as we saw, some successive 1052 00:35:38,159 --> 00:35:41,639 convolution layers can see more and more 1053 00:35:40,159 --> 00:35:43,319 of the original image, 1054 00:35:41,639 --> 00:35:45,480 the max pooling layers that follow them 1055 00:35:43,320 --> 00:35:47,640 can detect if a feature exists in more 1056 00:35:45,480 --> 00:35:48,760 and more of the original input as well. 1057 00:35:47,639 --> 00:35:50,279 So, by the time you get to like the 1058 00:35:48,760 --> 00:35:52,320 seventh and eighth, ninth and layers and 1059 00:35:50,280 --> 00:35:53,720 so on, this thing is actually really 1060 00:35:52,320 --> 00:35:55,160 smart. It's operating at a very high 1061 00:35:53,719 --> 00:35:56,959 level of abstraction. 1062 00:35:55,159 --> 00:35:58,480 Right? It It is You can think of it It 1063 00:35:56,960 --> 00:36:00,280 is basically like tagged all the 1064 00:35:58,480 --> 00:36:04,199 features in that image at various 1065 00:36:00,280 --> 00:36:04,200 resolutions, and it can work with it. 1066 00:36:04,880 --> 00:36:08,920 Is there a trade-off between doing 1067 00:36:06,400 --> 00:36:11,160 pre-processing as opposed to adding 1068 00:36:08,920 --> 00:36:12,760 additional convolutional layers? I'm 1069 00:36:11,159 --> 00:36:15,519 thinking if you have a video turning 1070 00:36:12,760 --> 00:36:17,600 into a black and white static images in 1071 00:36:15,519 --> 00:36:19,358 a sequence as opposed to 1072 00:36:17,599 --> 00:36:20,639 shoving in a color video with a ton of 1073 00:36:19,358 --> 00:36:22,400 noise. 1074 00:36:20,639 --> 00:36:24,759 The greater the time expanse, is there a 1075 00:36:22,400 --> 00:36:27,960 trade-off element? There is a trade-off. 1076 00:36:24,760 --> 00:36:29,760 Um if your particular data set and input 1077 00:36:27,960 --> 00:36:31,720 has has some there is some very 1078 00:36:29,760 --> 00:36:33,240 important domain knowledge that you want 1079 00:36:31,719 --> 00:36:35,719 to encode 1080 00:36:33,239 --> 00:36:37,839 into the network so that the network 1081 00:36:35,719 --> 00:36:39,719 doesn't waste its capacity learning 1082 00:36:37,840 --> 00:36:41,640 things that you know have to be true, 1083 00:36:39,719 --> 00:36:43,358 then yeah, modify the input. 1084 00:36:41,639 --> 00:36:45,480 But if you're not sure, 1085 00:36:43,358 --> 00:36:47,199 right? Then you want to just let network 1086 00:36:45,480 --> 00:36:49,679 learn whatever it can as long as it's 1087 00:36:47,199 --> 00:36:53,439 focused on predicting accuracy as well 1088 00:36:49,679 --> 00:36:53,440 as possible, then just let it be. 1089 00:36:55,800 --> 00:36:59,200 Uh all right. So, that's the basic idea. 1090 00:36:57,880 --> 00:37:01,358 And I again, I'm sorry this is 1091 00:36:59,199 --> 00:37:03,799 Notability thing is is it's not working. 1092 00:37:01,358 --> 00:37:05,559 Uh but take a look to really understand 1093 00:37:03,800 --> 00:37:08,039 um how this max pooling thing business 1094 00:37:05,559 --> 00:37:09,358 works. Okay. Oh, uh I think I skipped 1095 00:37:08,039 --> 00:37:12,000 over this. 1096 00:37:09,358 --> 00:37:13,639 So, when you have something like this, 1097 00:37:12,000 --> 00:37:15,760 so this, let's say, is a tensor coming 1098 00:37:13,639 --> 00:37:18,839 out of some convolutional layer, and its 1099 00:37:15,760 --> 00:37:20,640 size is 224 by 224 by 64, then you apply 1100 00:37:18,840 --> 00:37:22,160 something like a pooling. The thing I 1101 00:37:20,639 --> 00:37:23,839 want to point out is that the pooling 1102 00:37:22,159 --> 00:37:25,839 will work with every slice of the 1103 00:37:23,840 --> 00:37:27,960 tensor. 1104 00:37:25,840 --> 00:37:30,600 Okay? So, if the tensor is 224 by 224 by 1105 00:37:27,960 --> 00:37:31,880 64, it has a depth of 64, 1106 00:37:30,599 --> 00:37:35,239 which is basically like saying it's got 1107 00:37:31,880 --> 00:37:38,200 64 tables of 224 by 224, and the pooling 1108 00:37:35,239 --> 00:37:40,119 will work on every one of those tables. 1109 00:37:38,199 --> 00:37:42,279 Which means that 1110 00:37:40,119 --> 00:37:43,719 the 64 will that you'll still have 64 1111 00:37:42,280 --> 00:37:45,760 things at the very end. It's just that 1112 00:37:43,719 --> 00:37:49,759 every one of the things of the 64, the 1113 00:37:45,760 --> 00:37:52,560 224 by 224, will shrink to 112 by 112. 1114 00:37:49,760 --> 00:37:53,720 So, each table shrinks due to pooling, 1115 00:37:52,559 --> 00:37:56,119 but the number of tables does not 1116 00:37:53,719 --> 00:37:56,119 change. 1117 00:37:57,800 --> 00:38:01,880 Okay. So, 1118 00:37:59,440 --> 00:38:03,559 uh by the way, this 1119 00:38:01,880 --> 00:38:05,400 link here 1120 00:38:03,559 --> 00:38:06,599 has a beautiful explanation of all these 1121 00:38:05,400 --> 00:38:08,800 things with a little bit more complexity 1122 00:38:06,599 --> 00:38:10,440 as well from a course taught at Stanford 1123 00:38:08,800 --> 00:38:12,640 in like 2018 or 2019 or something, I 1124 00:38:10,440 --> 00:38:13,800 forget. Uh so, just check it out if 1125 00:38:12,639 --> 00:38:15,039 you're curious about this stuff. It's 1126 00:38:13,800 --> 00:38:18,160 really good. 1127 00:38:15,039 --> 00:38:18,159 Okay. Um 1128 00:38:18,440 --> 00:38:21,760 All right. So, that brings us to the 1129 00:38:19,800 --> 00:38:23,800 architecture of a basic CNN. 1130 00:38:21,760 --> 00:38:25,240 Um and so, what we do is we have an 1131 00:38:23,800 --> 00:38:27,240 input. 1132 00:38:25,239 --> 00:38:29,239 Okay? We take that input, we run it 1133 00:38:27,239 --> 00:38:30,799 through a bunch of convolutional and 1134 00:38:29,239 --> 00:38:33,559 pooling layers. So, there's a 1135 00:38:30,800 --> 00:38:35,840 convolutional layer, and then we pool 1136 00:38:33,559 --> 00:38:37,440 it, which is why it has shrunk 1137 00:38:35,840 --> 00:38:38,440 in size, 1138 00:38:37,440 --> 00:38:40,599 and then it goes through another 1139 00:38:38,440 --> 00:38:42,358 convolutional layer, then we pool it, 1140 00:38:40,599 --> 00:38:44,000 which is shrunk again, 1141 00:38:42,358 --> 00:38:45,559 and then it keeps on doing it. So, we 1142 00:38:44,000 --> 00:38:47,559 have a series of these these called 1143 00:38:45,559 --> 00:38:49,400 these are called convolutional blocks. 1144 00:38:47,559 --> 00:38:50,559 So, a convolutional block is typically, 1145 00:38:49,400 --> 00:38:52,920 you know, one to two convolutional 1146 00:38:50,559 --> 00:38:54,358 layers followed by a pooling layer. 1147 00:38:52,920 --> 00:38:55,760 Okay. 1148 00:38:54,358 --> 00:38:57,159 So, you have a series of convolutional 1149 00:38:55,760 --> 00:38:59,960 blocks. 1150 00:38:57,159 --> 00:39:01,559 Okay? And the thing to notice is that 1151 00:38:59,960 --> 00:39:03,320 as you go further and further in the 1152 00:39:01,559 --> 00:39:05,519 network, 1153 00:39:03,320 --> 00:39:07,000 the blocks will actually get smaller and 1154 00:39:05,519 --> 00:39:09,159 smaller because of 1155 00:39:07,000 --> 00:39:10,599 uh max pooling, right? They'll get 1156 00:39:09,159 --> 00:39:14,039 smaller and smaller, but they'll get 1157 00:39:10,599 --> 00:39:14,799 longer they'll get deeper and deeper. 1158 00:39:14,039 --> 00:39:16,519 Okay. 1159 00:39:14,800 --> 00:39:18,880 And we have empirically figured out that 1160 00:39:16,519 --> 00:39:20,639 that actually that model of reducing the 1161 00:39:18,880 --> 00:39:22,519 size, the height and height and the 1162 00:39:20,639 --> 00:39:25,519 width, but then making it deeper, tends 1163 00:39:22,519 --> 00:39:27,119 to work really well in practice. 1164 00:39:25,519 --> 00:39:29,559 And so, 1165 00:39:27,119 --> 00:39:31,279 in fact, uh and I apologies to the live 1166 00:39:29,559 --> 00:39:34,480 stream that I can't use iPad, I'm going 1167 00:39:31,280 --> 00:39:34,480 to do it on the the board. 1168 00:39:35,960 --> 00:39:39,639 So, let's say that you have a picture 1169 00:39:38,358 --> 00:39:43,480 which is 1170 00:39:39,639 --> 00:39:44,879 coming in as 224 1171 00:39:43,480 --> 00:39:46,199 224 1172 00:39:44,880 --> 00:39:48,000 and then you have 1173 00:39:46,199 --> 00:39:49,719 say three of them 1174 00:39:48,000 --> 00:39:52,360 because it's a color picture, so you 1175 00:39:49,719 --> 00:39:54,399 have three of them. 1176 00:39:52,360 --> 00:39:56,440 Can you folks see this okay? 1177 00:39:54,400 --> 00:39:59,240 All right. So, right? Let's say this is 1178 00:39:56,440 --> 00:40:00,960 the input coming in. And ResNet, which 1179 00:39:59,239 --> 00:40:02,479 is a very famous network that we're 1180 00:40:00,960 --> 00:40:03,679 actually going to work with in a few 1181 00:40:02,480 --> 00:40:05,719 minutes, 1182 00:40:03,679 --> 00:40:07,960 then it actually gets done with all this 1183 00:40:05,719 --> 00:40:11,119 convolution pooling business. 1184 00:40:07,960 --> 00:40:13,400 The final tensor that it it has is 1185 00:40:11,119 --> 00:40:16,239 actually of shape 1186 00:40:13,400 --> 00:40:20,720 7 by 7. 1187 00:40:16,239 --> 00:40:20,719 But it is 2048 long. 1188 00:40:22,519 --> 00:40:26,719 Okay? So, it it has gone it has 1189 00:40:24,039 --> 00:40:28,400 processed something which is 224 224 * 3 1190 00:40:26,719 --> 00:40:31,439 to much smaller height and width just 7 1191 00:40:28,400 --> 00:40:32,840 by 7, but it's gotten much deeper, 2048 1192 00:40:31,440 --> 00:40:34,920 layers. 1193 00:40:32,840 --> 00:40:36,800 This is a this is a numerical example of 1194 00:40:34,920 --> 00:40:39,320 what I'm talking about there in terms of 1195 00:40:36,800 --> 00:40:41,560 as you go along, things get smaller but 1196 00:40:39,320 --> 00:40:43,039 deeper. 1197 00:40:41,559 --> 00:40:44,480 All right. 1198 00:40:43,039 --> 00:40:45,880 Uh 1199 00:40:44,480 --> 00:40:47,280 Yes? 1200 00:40:45,880 --> 00:40:49,519 Is the reason that it gets deeper 1201 00:40:47,280 --> 00:40:50,880 because each 1202 00:40:49,519 --> 00:40:52,759 Like it it gets deeper because each 1203 00:40:50,880 --> 00:40:54,400 layer has a single feature that is 1204 00:40:52,760 --> 00:40:55,120 picked up and then it gets stacked on 1205 00:40:54,400 --> 00:40:57,039 top 1206 00:40:55,119 --> 00:40:58,559 It's not so much that each layer has 1207 00:40:57,039 --> 00:40:59,480 picking up a single feature, it's more 1208 00:40:58,559 --> 00:41:00,279 that 1209 00:40:59,480 --> 00:41:01,960 uh 1210 00:41:00,280 --> 00:41:04,519 basically 1211 00:41:01,960 --> 00:41:06,159 the way I think about it is that 1212 00:41:04,519 --> 00:41:07,800 the the the the number of atomic 1213 00:41:06,159 --> 00:41:10,199 features that you may want to detect are 1214 00:41:07,800 --> 00:41:11,920 probably not that many, right? Lines, 1215 00:41:10,199 --> 00:41:13,719 curves, gradations in color and things 1216 00:41:11,920 --> 00:41:16,519 like that. But the way in which you can 1217 00:41:13,719 --> 00:41:18,559 combine these atomic features 1218 00:41:16,519 --> 00:41:20,199 to depict real world things 1219 00:41:18,559 --> 00:41:22,279 is combinatorial. 1220 00:41:20,199 --> 00:41:23,879 It's sort of like I have 10 kinds of 1221 00:41:22,280 --> 00:41:25,040 atoms, how many molecules can I make 1222 00:41:23,880 --> 00:41:26,519 from it? 1223 00:41:25,039 --> 00:41:28,279 You can make a lot of molecules from 1224 00:41:26,519 --> 00:41:30,679 those 10 atoms, which means that you 1225 00:41:28,280 --> 00:41:32,080 better give the network more the ability 1226 00:41:30,679 --> 00:41:33,719 to capture more and more of these 1227 00:41:32,079 --> 00:41:35,400 possible things that the real world can 1228 00:41:33,719 --> 00:41:38,000 come up with. 1229 00:41:35,400 --> 00:41:40,200 And so every as the depth increases, you 1230 00:41:38,000 --> 00:41:42,320 have more filters and every filter has 1231 00:41:40,199 --> 00:41:43,719 now has the ability to pick up some 1232 00:41:42,320 --> 00:41:46,080 combinatorial combination of what's 1233 00:41:43,719 --> 00:41:46,079 coming in. 1234 00:41:49,639 --> 00:41:53,239 Uh sorry, quick question related to 1235 00:41:51,320 --> 00:41:55,080 this. So, right now like our model is 1236 00:41:53,239 --> 00:41:56,799 being trained to detect certain specific 1237 00:41:55,079 --> 00:41:58,519 features like a line, a color, or 1238 00:41:56,800 --> 00:42:00,680 something of this sort. But still it 1239 00:41:58,519 --> 00:42:02,880 doesn't have meaning to this, right? 1240 00:42:00,679 --> 00:42:06,239 Like still they don't know if that 1241 00:42:02,880 --> 00:42:08,360 arc is a sun or is an eye, right? 1242 00:42:06,239 --> 00:42:10,639 So, yeah. So, we we don't tell it what 1243 00:42:08,360 --> 00:42:12,280 to learn, it just learns. 1244 00:42:10,639 --> 00:42:14,599 All we tell it is make sure that you 1245 00:42:12,280 --> 00:42:16,240 minimize the loss function. Now, once it 1246 00:42:14,599 --> 00:42:18,679 is finished learning, if it's a good 1247 00:42:16,239 --> 00:42:21,359 network, it has good accuracy, then we 1248 00:42:18,679 --> 00:42:23,480 can introspect. We can peek into the 1249 00:42:21,360 --> 00:42:24,559 internals and try to understand what is 1250 00:42:23,480 --> 00:42:26,480 it learning, 1251 00:42:24,559 --> 00:42:27,759 right? And sometimes you like you saw in 1252 00:42:26,480 --> 00:42:28,840 the face detection example, it's 1253 00:42:27,760 --> 00:42:30,720 actually learning interesting things 1254 00:42:28,840 --> 00:42:32,440 like basic lines and edges and then 1255 00:42:30,719 --> 00:42:34,359 slowly, you know, more complicated 1256 00:42:32,440 --> 00:42:36,320 shapes and then finally like entire 1257 00:42:34,360 --> 00:42:37,640 human faces. Sometimes it may not be 1258 00:42:36,320 --> 00:42:39,200 understandable. 1259 00:42:37,639 --> 00:42:42,879 And the way it's doing this is by 1260 00:42:39,199 --> 00:42:44,039 constructing features of my brain. 1261 00:42:42,880 --> 00:42:44,480 Like how do you figure out what it's 1262 00:42:44,039 --> 00:42:46,800 learning? 1263 00:42:44,480 --> 00:42:49,039 >> Yeah. Oh, oh, I see. So, I'm going to 1264 00:42:46,800 --> 00:42:50,400 give a reference in just a few minutes. 1265 00:42:49,039 --> 00:42:52,199 Read the paper. That was one of the 1266 00:42:50,400 --> 00:42:53,720 first ones to actually visualize what it 1267 00:42:52,199 --> 00:42:54,919 what these things are learning and 1268 00:42:53,719 --> 00:42:56,399 that'll give you an idea of how it 1269 00:42:54,920 --> 00:42:58,079 actually works. And I'm also happy to 1270 00:42:56,400 --> 00:43:00,160 talk about it offline. It's a bit of a a 1271 00:42:58,079 --> 00:43:02,319 tangent, but it's a really rich tangent, 1272 00:43:00,159 --> 00:43:03,399 so if if I keep talking about it, I'll 1273 00:43:02,320 --> 00:43:06,039 end up spending 10 minutes on it, so I'm 1274 00:43:03,400 --> 00:43:06,039 going to back off. 1275 00:43:06,960 --> 00:43:09,679 Okay. 1276 00:43:08,039 --> 00:43:12,320 Um all right. 1277 00:43:09,679 --> 00:43:13,919 So, now once we do that, 1278 00:43:12,320 --> 00:43:16,200 okay? Now we are back in familiar 1279 00:43:13,920 --> 00:43:18,360 territory where we take whatever tensor 1280 00:43:16,199 --> 00:43:20,119 is coming out from these convolutional 1281 00:43:18,360 --> 00:43:22,840 operations and pooling operations and 1282 00:43:20,119 --> 00:43:25,440 then we just flatten them only now into 1283 00:43:22,840 --> 00:43:27,720 a long vector. And once we flatten them, 1284 00:43:25,440 --> 00:43:29,240 we can connect them to some good old 1285 00:43:27,719 --> 00:43:30,599 dense layers 1286 00:43:29,239 --> 00:43:32,479 like we know how to do and then we 1287 00:43:30,599 --> 00:43:34,880 finally connect them with whatever, you 1288 00:43:32,480 --> 00:43:36,760 know, output layer you want, right? In 1289 00:43:34,880 --> 00:43:39,480 this case, this example is using some 1290 00:43:36,760 --> 00:43:41,120 multi-class classification of 1291 00:43:39,480 --> 00:43:42,760 classifying images to what kind of 1292 00:43:41,119 --> 00:43:44,719 automobile or whatever it is. So, it's 1293 00:43:42,760 --> 00:43:47,160 like a softmax. So, this is a general 1294 00:43:44,719 --> 00:43:47,159 framework. 1295 00:43:48,639 --> 00:43:52,639 Okay? 1296 00:43:50,039 --> 00:43:52,639 Any questions? 1297 00:43:54,559 --> 00:43:57,639 Yeah. 1298 00:43:55,599 --> 00:44:00,159 Can you explain again how the depth 1299 00:43:57,639 --> 00:44:01,839 increases exactly like Oh, the depth 1300 00:44:00,159 --> 00:44:03,719 increases because you decide what the 1301 00:44:01,840 --> 00:44:05,920 depth is. 1302 00:44:03,719 --> 00:44:07,839 So, when you add a convolutional layer, 1303 00:44:05,920 --> 00:44:09,920 you decide how many filters it has. So, 1304 00:44:07,840 --> 00:44:11,600 you just keep adding more and more 1305 00:44:09,920 --> 00:44:13,320 filters the later on you go in the 1306 00:44:11,599 --> 00:44:14,920 network. 1307 00:44:13,320 --> 00:44:16,600 So, it's in your control. So, remember 1308 00:44:14,920 --> 00:44:18,480 the number of neurons in a hidden layer 1309 00:44:16,599 --> 00:44:19,839 is in your control, right? Similarly, 1310 00:44:18,480 --> 00:44:22,559 the number of filters is in your 1311 00:44:19,840 --> 00:44:24,160 control. It's a design choice. 1312 00:44:22,559 --> 00:44:26,519 And we design it so that the later we 1313 00:44:24,159 --> 00:44:28,159 go, the more depth we have. So, you have 1314 00:44:26,519 --> 00:44:31,800 you stack 1315 00:44:28,159 --> 00:44:35,279 um layers with each of those layers has 1316 00:44:31,800 --> 00:44:37,720 a different filter applied to the end 1317 00:44:35,280 --> 00:44:39,359 Yeah, a layer is made up of filters and 1318 00:44:37,719 --> 00:44:40,919 so the depth just comes from having lots 1319 00:44:39,358 --> 00:44:43,759 and lots and lots of filters. And you 1320 00:44:40,920 --> 00:44:43,760 get to choose what they are. 1321 00:44:44,358 --> 00:44:49,319 All right. So, now let's go to the 1322 00:44:46,639 --> 00:44:51,920 fashion MNIST collab um that I did the 1323 00:44:49,320 --> 00:44:55,559 video walk-through on and then actually 1324 00:44:51,920 --> 00:44:55,559 solve it using a convolutional network. 1325 00:44:56,000 --> 00:44:59,159 All right, cool. So, uh at this point 1326 00:44:58,199 --> 00:45:00,879 I'm going to zip through some of the 1327 00:44:59,159 --> 00:45:02,519 stuff because you know the preliminaries 1328 00:45:00,880 --> 00:45:05,559 have to be done. Import all these 1329 00:45:02,519 --> 00:45:07,320 packages, set the random seed here. 1330 00:45:05,559 --> 00:45:09,320 Great. And then the we will load the 1331 00:45:07,320 --> 00:45:11,519 MNIST data set just like I did in the 1332 00:45:09,320 --> 00:45:13,280 collab yesterday. Uh we create these 1333 00:45:11,519 --> 00:45:14,960 little labels. 1334 00:45:13,280 --> 00:45:17,240 Uh and then we just have these standard 1335 00:45:14,960 --> 00:45:19,320 functions to plot accuracy and loss that 1336 00:45:17,239 --> 00:45:21,159 we've been using so far. All right. Now 1337 00:45:19,320 --> 00:45:24,519 we come to the convolutional thing and 1338 00:45:21,159 --> 00:45:25,960 so as before, we're going to um 1339 00:45:24,519 --> 00:45:27,280 we're going to divide it by 255 to 1340 00:45:25,960 --> 00:45:29,480 normalize everything to a zero to one 1341 00:45:27,280 --> 00:45:31,640 range. Uh let's confirm to make sure 1342 00:45:29,480 --> 00:45:33,599 that the data nothing has gotten 1343 00:45:31,639 --> 00:45:35,679 tampered with. Yep, we have 60,000 1344 00:45:33,599 --> 00:45:37,799 images, each one is 28 by 28 in the 1345 00:45:35,679 --> 00:45:40,559 training set. Now, 1346 00:45:37,800 --> 00:45:42,680 convolutional networks um they expect 1347 00:45:40,559 --> 00:45:44,759 the input to have 1348 00:45:42,679 --> 00:45:46,239 three channels or it expects to have 1349 00:45:44,760 --> 00:45:47,440 like a an additional thing which is like 1350 00:45:46,239 --> 00:45:49,679 a channel, 1351 00:45:47,440 --> 00:45:50,800 right? Uh the color images have three 1352 00:45:49,679 --> 00:45:52,279 channels, 1353 00:45:50,800 --> 00:45:54,400 but black and white images have only one 1354 00:45:52,280 --> 00:45:56,640 channel, right? One table of numbers. 1355 00:45:54,400 --> 00:45:59,280 So, instead of saying 28 by 28, we tell 1356 00:45:56,639 --> 00:46:01,559 this the convolutional layer expect 28 1357 00:45:59,280 --> 00:46:03,160 by 28 by one. 1358 00:46:01,559 --> 00:46:04,519 It's the same thing conceptually, but 1359 00:46:03,159 --> 00:46:05,639 that's the sort of the format that it 1360 00:46:04,519 --> 00:46:06,679 expects. 1361 00:46:05,639 --> 00:46:09,199 And so, 1362 00:46:06,679 --> 00:46:11,039 uh we go here and then we say, all 1363 00:46:09,199 --> 00:46:12,879 right, there's a thing called expand 1364 00:46:11,039 --> 00:46:14,599 dimension. I'm just telling it to expand 1365 00:46:12,880 --> 00:46:17,200 its dimension and once I do that, you 1366 00:46:14,599 --> 00:46:19,639 can see here it's still 60,000, but 1367 00:46:17,199 --> 00:46:21,799 instead of 28 by 28, it has become 28 by 1368 00:46:19,639 --> 00:46:24,039 28 by one. Same thing. 1369 00:46:21,800 --> 00:46:25,920 Okay? Now, let's define our very first 1370 00:46:24,039 --> 00:46:27,440 CNN. 1371 00:46:25,920 --> 00:46:30,240 So, all right. 1372 00:46:27,440 --> 00:46:32,519 As as before, the the input is just 1373 00:46:30,239 --> 00:46:34,239 Keras.input as before, no difference 1374 00:46:32,519 --> 00:46:37,239 here and we tell it the shape and the 1375 00:46:34,239 --> 00:46:39,239 shape is of course just 28 by 28 by one. 1376 00:46:37,239 --> 00:46:40,639 Okay? That's what I have here. 1377 00:46:39,239 --> 00:46:43,839 And then we come to the first 1378 00:46:40,639 --> 00:46:45,679 convolutional block. 1379 00:46:43,840 --> 00:46:47,400 So, and this is the key thing. 1380 00:46:45,679 --> 00:46:49,719 If you want to tell Keras to use a 1381 00:46:47,400 --> 00:46:53,519 convolutional a layer, 1382 00:46:49,719 --> 00:46:54,679 you use this keyword layers.Conv2D. 1383 00:46:53,519 --> 00:46:56,519 And from this you can probably also 1384 00:46:54,679 --> 00:46:58,759 figure out that there's a Conv1D and 1385 00:46:56,519 --> 00:47:00,880 there's a Conv3D and so on and so forth, 1386 00:46:58,760 --> 00:47:01,920 which, you know, uh explore. It's really 1387 00:47:00,880 --> 00:47:04,400 good stuff. 1388 00:47:01,920 --> 00:47:06,599 But for image processing, Conv2D is all 1389 00:47:04,400 --> 00:47:09,400 you need. And now we tell it how many 1390 00:47:06,599 --> 00:47:10,920 filters you want. Okay. So, uh we decide 1391 00:47:09,400 --> 00:47:13,240 on the number of filters. So, I've 1392 00:47:10,920 --> 00:47:15,760 decided to have 32 filters. Okay? And 1393 00:47:13,239 --> 00:47:18,199 then I I we also have to decide the size 1394 00:47:15,760 --> 00:47:19,760 of the filter, right? The simplest size 1395 00:47:18,199 --> 00:47:20,639 is 2 by 2. So, I'm just going to go with 1396 00:47:19,760 --> 00:47:22,760 that. 1397 00:47:20,639 --> 00:47:23,839 Right? Kernel size is 2 by 2. 1398 00:47:22,760 --> 00:47:26,160 And then the activation is of course 1399 00:47:23,840 --> 00:47:27,960 ReLU. I give it a name, convolution one, 1400 00:47:26,159 --> 00:47:29,480 and then I feed it the input. And then 1401 00:47:27,960 --> 00:47:31,679 once I do that, I follow it up with a 1402 00:47:29,480 --> 00:47:33,679 little pooling layer which I where I use 1403 00:47:31,679 --> 00:47:35,279 MaxPooling2D. 1404 00:47:33,679 --> 00:47:36,639 And MaxPooling2D, you just literally 1405 00:47:35,280 --> 00:47:37,600 pass the input, you get the output back. 1406 00:47:36,639 --> 00:47:39,480 It just 1407 00:47:37,599 --> 00:47:40,719 shrinks everything using pooling. 1408 00:47:39,480 --> 00:47:41,679 So, that is the first convolutional 1409 00:47:40,719 --> 00:47:43,879 block. 1410 00:47:41,679 --> 00:47:45,599 And you know what? 1411 00:47:43,880 --> 00:47:46,440 I know how to cut and paste. Boom, cut 1412 00:47:45,599 --> 00:47:48,119 and paste, I get the second 1413 00:47:46,440 --> 00:47:49,599 convolutional block. 1414 00:47:48,119 --> 00:47:52,358 Okay? Here is the second convolutional 1415 00:47:49,599 --> 00:47:54,199 block. And I know in in I just lecture I 1416 00:47:52,358 --> 00:47:56,960 mentioned that as you go deeper, you get 1417 00:47:54,199 --> 00:47:58,199 more depth to it, but this is this is 1418 00:47:56,960 --> 00:47:59,480 just a starting point. I'm just going to 1419 00:47:58,199 --> 00:48:01,599 use the same depth. Not a big deal. It's 1420 00:47:59,480 --> 00:48:03,000 a simple problem. So, which is why in 1421 00:48:01,599 --> 00:48:04,559 the second convolutional block I'm still 1422 00:48:03,000 --> 00:48:06,039 using only 32. 1423 00:48:04,559 --> 00:48:07,719 But you can totally go to 64 for 1424 00:48:06,039 --> 00:48:08,639 instance to make it much deeper. 1425 00:48:07,719 --> 00:48:10,679 Okay? 1426 00:48:08,639 --> 00:48:12,159 Uh and once I do that, 1427 00:48:10,679 --> 00:48:14,319 I finally come to the point where I 1428 00:48:12,159 --> 00:48:17,759 flatten everything to a long vector, 1429 00:48:14,320 --> 00:48:19,480 then I connect it to one dense layer of 1430 00:48:17,760 --> 00:48:22,080 256 neurons. 1431 00:48:19,480 --> 00:48:23,559 And then finally, I come to the softmax 1432 00:48:22,079 --> 00:48:26,000 where I have 10 outputs, right? 10 1433 00:48:23,559 --> 00:48:27,880 categories of clothing, softmax, and 1434 00:48:26,000 --> 00:48:30,119 then I tell Keras, okay, take this input 1435 00:48:27,880 --> 00:48:32,160 and the output, string them up together, 1436 00:48:30,119 --> 00:48:33,519 define a model for me. 1437 00:48:32,159 --> 00:48:35,599 So, that's it. That's a convolutional 1438 00:48:33,519 --> 00:48:38,358 network. The new concepts we are seeing 1439 00:48:35,599 --> 00:48:40,960 here are Conv2D for the convolutional 1440 00:48:38,358 --> 00:48:42,440 layer and then MaxPooling2D for the max 1441 00:48:40,960 --> 00:48:43,639 pooling layer. 1442 00:48:42,440 --> 00:48:44,240 Okay? That's it. 1443 00:48:43,639 --> 00:48:46,839 Uh 1444 00:48:44,239 --> 00:48:49,839 coming. So, let me just run this thing. 1445 00:48:46,840 --> 00:48:52,840 It runs. Okay, good. Yeah. 1446 00:48:49,840 --> 00:48:54,800 Uh how do you decide when to flatten and 1447 00:48:52,840 --> 00:48:56,800 would there ever be a situation in which 1448 00:48:54,800 --> 00:48:59,600 we just kind of use the method that we 1449 00:48:56,800 --> 00:49:00,960 used before and not use a CNN? 1450 00:48:59,599 --> 00:49:02,279 Well, we already tried it with MNIST, 1451 00:49:00,960 --> 00:49:03,039 right? We didn't use a CNN. We just 1452 00:49:02,280 --> 00:49:05,120 flattened right away. 1453 00:49:03,039 --> 00:49:06,719 >> work. It it was it's not bad, but we are 1454 00:49:05,119 --> 00:49:08,079 like, you know, can we do better than 85 1455 00:49:06,719 --> 00:49:09,679 or 88 or whatever the percent was, 1456 00:49:08,079 --> 00:49:11,719 right? So, but we are working with 1457 00:49:09,679 --> 00:49:13,239 images, it's typically a good idea to 1458 00:49:11,719 --> 00:49:14,439 just start with a CNN straight out the 1459 00:49:13,239 --> 00:49:16,799 back because you're not losing anything. 1460 00:49:14,440 --> 00:49:19,320 You're not giving up anything. 1461 00:49:16,800 --> 00:49:20,960 So, uh in terms of how many uh layers 1462 00:49:19,320 --> 00:49:23,120 you should have, my philosophy is start 1463 00:49:20,960 --> 00:49:27,079 simple and if it works, stop working on 1464 00:49:23,119 --> 00:49:28,480 it. If it doesn't, add more layers. 1465 00:49:27,079 --> 00:49:30,440 Uh yeah. 1466 00:49:28,480 --> 00:49:32,358 Yeah, just to uh is it the architecture 1467 00:49:30,440 --> 00:49:34,358 design, the number of filters, kernel 1468 00:49:32,358 --> 00:49:36,159 size, number of layers, convolution 1469 00:49:34,358 --> 00:49:37,719 pooling, is that just all based on trial 1470 00:49:36,159 --> 00:49:39,440 and error or what's sometimes? Yeah, so 1471 00:49:37,719 --> 00:49:41,359 typically it's based on trial and error, 1472 00:49:39,440 --> 00:49:42,679 Um to answer your question. But as you 1473 00:49:41,360 --> 00:49:44,559 will see in the transfer learning 1474 00:49:42,679 --> 00:49:46,719 discussion we're going to have soon, 1475 00:49:44,559 --> 00:49:48,639 you can actually, instead of doing 1476 00:49:46,719 --> 00:49:50,679 anything from scratch, it's much better 1477 00:49:48,639 --> 00:49:51,839 to just download a pre-trained model and 1478 00:49:50,679 --> 00:49:54,039 just adapt it for your particular 1479 00:49:51,840 --> 00:49:55,680 problem. That is actually the norm by 1480 00:49:54,039 --> 00:49:57,320 which people do these things. The reason 1481 00:49:55,679 --> 00:50:00,319 I'm doing it from scratch is because you 1482 00:49:57,320 --> 00:50:01,800 should know how it was done. 1483 00:50:00,320 --> 00:50:03,880 Like you it should not be a black box to 1484 00:50:01,800 --> 00:50:05,080 you. That's my goal. 1485 00:50:03,880 --> 00:50:07,039 Yeah. 1486 00:50:05,079 --> 00:50:09,719 Just for what notation perspective, I 1487 00:50:07,039 --> 00:50:11,159 noticed you named all of these layers X. 1488 00:50:09,719 --> 00:50:12,639 Is that a habit we should get into 1489 00:50:11,159 --> 00:50:12,759 naming them all the same or is that just 1490 00:50:12,639 --> 00:50:15,199 a 1491 00:50:12,760 --> 00:50:17,880 >> Actually, I'm not naming the layers as 1492 00:50:15,199 --> 00:50:19,719 X. What what's going on here is I'm 1493 00:50:17,880 --> 00:50:21,079 feeding it X. 1494 00:50:19,719 --> 00:50:22,679 And whatever is coming out of it, I'm 1495 00:50:21,079 --> 00:50:23,920 just calling it X. 1496 00:50:22,679 --> 00:50:25,679 That's all. It's just a notational 1497 00:50:23,920 --> 00:50:27,280 convenience for me to I'm I'm just 1498 00:50:25,679 --> 00:50:28,679 calling the input and the output and 1499 00:50:27,280 --> 00:50:29,760 Keras under the hood will track 1500 00:50:28,679 --> 00:50:31,319 everything and make sure the right thing 1501 00:50:29,760 --> 00:50:33,920 happens. Otherwise, I'd have to be like 1502 00:50:31,320 --> 00:50:35,360 X1, X2, X3, X4 and then if I want to add 1503 00:50:33,920 --> 00:50:37,320 a new layer somewhere in the middle 1504 00:50:35,360 --> 00:50:39,160 between X3 and X4, I have to call that 1505 00:50:37,320 --> 00:50:41,360 X4 and then I'll change everything to 5, 1506 00:50:39,159 --> 00:50:42,839 6, 7. Complete pain in the neck. That's 1507 00:50:41,360 --> 00:50:46,760 why I do this. 1508 00:50:42,840 --> 00:50:51,039 All right. So, model.summary 1509 00:50:46,760 --> 00:50:53,160 It has got 302 thousand parameters. I'll 1510 00:50:51,039 --> 00:50:56,199 just plot it. 1511 00:50:53,159 --> 00:50:58,519 Great. And I encourage you to hand 1512 00:50:56,199 --> 00:51:00,359 calculate it later on and make sure the 1513 00:50:58,519 --> 00:51:03,679 numbers tally, okay? 1514 00:51:00,360 --> 00:51:06,320 For now, let's just go. So, as before, 1515 00:51:03,679 --> 00:51:08,399 we'll just use the same compilation. 1516 00:51:06,320 --> 00:51:11,080 We'll use Adam and then we'll train it 1517 00:51:08,400 --> 00:51:13,119 for, you know, just 10 epochs. We'll use 1518 00:51:11,079 --> 00:51:15,360 a validation split again, as usual, of 1519 00:51:13,119 --> 00:51:17,519 20%. So, let's just run it. 1520 00:51:15,360 --> 00:51:18,720 So, it's actually going to run. And as 1521 00:51:17,519 --> 00:51:19,759 you will see, 1522 00:51:18,719 --> 00:51:20,959 convolutional networks there's a lot 1523 00:51:19,760 --> 00:51:23,560 more going on, so it's going to be a bit 1524 00:51:20,960 --> 00:51:25,400 slower to run. Hopefully not too much 1525 00:51:23,559 --> 00:51:28,599 slower. 1526 00:51:25,400 --> 00:51:28,599 While it's doing, other questions? 1527 00:51:31,000 --> 00:51:34,679 So, if we have a task other than image 1528 00:51:32,840 --> 00:51:35,880 classification, do we still flat the 1529 00:51:34,679 --> 00:51:37,399 model like first and then it's 1530 00:51:35,880 --> 00:51:39,000 segmentation? 1531 00:51:37,400 --> 00:51:41,480 Yeah, so this is for image 1532 00:51:39,000 --> 00:51:42,920 classification. For other kinds of 1533 00:51:41,480 --> 00:51:44,240 applications, 1534 00:51:42,920 --> 00:51:45,840 typically you run it through a bunch of 1535 00:51:44,239 --> 00:51:46,639 convolutional layers and so on and so 1536 00:51:45,840 --> 00:51:48,840 forth. 1537 00:51:46,639 --> 00:51:51,759 But the output side of the equation gets 1538 00:51:48,840 --> 00:51:53,880 much more complicated because if instead 1539 00:51:51,760 --> 00:51:56,360 of classifying just 1540 00:51:53,880 --> 00:51:58,800 the whole picture into, you know, dog or 1541 00:51:56,360 --> 00:52:01,280 cat, if you have to take every pixel and 1542 00:51:58,800 --> 00:52:03,320 classify it, right? Then, well, you 1543 00:52:01,280 --> 00:52:06,320 better have an output shape that is the 1544 00:52:03,320 --> 00:52:07,640 same dimensions as the input shape. 1545 00:52:06,320 --> 00:52:09,800 So, for that we use a different 1546 00:52:07,639 --> 00:52:11,119 architecture. It's called U-Net 1547 00:52:09,800 --> 00:52:13,120 and so on, which unfortunately I won't 1548 00:52:11,119 --> 00:52:14,599 be able to get into. But I know I am 1549 00:52:13,119 --> 00:52:17,319 planning to post another video 1550 00:52:14,599 --> 00:52:19,440 walk-through where I show you how to use 1551 00:52:17,320 --> 00:52:22,160 the Hugging Face Hub 1552 00:52:19,440 --> 00:52:23,880 to very quickly build models for the 1553 00:52:22,159 --> 00:52:26,039 other applications like segmentation and 1554 00:52:23,880 --> 00:52:27,280 so on. I'm hoping to post that tomorrow. 1555 00:52:26,039 --> 00:52:29,440 It's an optional viewing thing that 1556 00:52:27,280 --> 00:52:32,400 might help with that. 1557 00:52:29,440 --> 00:52:35,280 Okay. So, is it done? Okay, good. It's 1558 00:52:32,400 --> 00:52:36,760 done. All right, let's plot the 1559 00:52:35,280 --> 00:52:38,240 thing here. 1560 00:52:36,760 --> 00:52:40,480 All right, so it seems like training is 1561 00:52:38,239 --> 00:52:42,639 going down nice and nicely. Validation 1562 00:52:40,480 --> 00:52:45,000 is sort of flattening out somewhere here 1563 00:52:42,639 --> 00:52:47,359 around the eighth epoch. Let's look at 1564 00:52:45,000 --> 00:52:48,840 the accuracy. 1565 00:52:47,360 --> 00:52:51,440 Same situation here. The accuracy is in 1566 00:52:48,840 --> 00:52:52,960 the 90s. Of course, the final question, 1567 00:52:51,440 --> 00:52:55,639 of course, is how it will will it does 1568 00:52:52,960 --> 00:52:55,639 on the thing. 1569 00:52:55,840 --> 00:52:59,440 Whoa, 90.5%. 1570 00:52:58,360 --> 00:53:00,720 Pretty good. 1571 00:52:59,440 --> 00:53:04,200 By the way, if you're not impressed that 1572 00:53:00,719 --> 00:53:05,959 we went from 88 to 90, 1573 00:53:04,199 --> 00:53:07,599 this is the These applications are the 1574 00:53:05,960 --> 00:53:09,639 proverbial sort of diminishing returns 1575 00:53:07,599 --> 00:53:11,880 problems, okay? So, what you should 1576 00:53:09,639 --> 00:53:13,920 always think of is look at the amount of 1577 00:53:11,880 --> 00:53:16,920 error that's left and ask yourself how 1578 00:53:13,920 --> 00:53:20,119 much of that error am I able to reduce? 1579 00:53:16,920 --> 00:53:22,079 So, you we had 12% roughly of error left 1580 00:53:20,119 --> 00:53:24,279 when we did the simple collab yesterday. 1581 00:53:22,079 --> 00:53:26,119 From that 12% we have knocked off two of 1582 00:53:24,280 --> 00:53:27,240 the 12% to get to over 90, which is 1583 00:53:26,119 --> 00:53:28,119 amazing. 1584 00:53:27,239 --> 00:53:29,639 Okay? 1585 00:53:28,119 --> 00:53:31,119 And in fact, I think the state of the 1586 00:53:29,639 --> 00:53:32,279 art on this 1587 00:53:31,119 --> 00:53:34,400 um 1588 00:53:32,280 --> 00:53:36,760 is 97%. 1589 00:53:34,400 --> 00:53:39,039 So, I invite you 1590 00:53:36,760 --> 00:53:40,480 to take this thing and try different 1591 00:53:39,039 --> 00:53:42,800 filters and so on and so forth to see if 1592 00:53:40,480 --> 00:53:45,960 you can get to the the mid-90s. 1593 00:53:42,800 --> 00:53:48,039 It's not easy, but try it. Yeah. 1594 00:53:45,960 --> 00:53:50,159 Does the number of epochs have to be 1595 00:53:48,039 --> 00:53:52,960 related to the number of batches? 1596 00:53:50,159 --> 00:53:55,199 Because you did 64 batches and 10 No, 1597 00:53:52,960 --> 00:53:56,800 the epochs is an independent 1598 00:53:55,199 --> 00:53:58,319 the epochs is just the number of passes 1599 00:53:56,800 --> 00:54:01,320 through the whole data. 1600 00:53:58,320 --> 00:54:03,000 But within each pass, within each epoch, 1601 00:54:01,320 --> 00:54:05,039 the num the batch size tells you how 1602 00:54:03,000 --> 00:54:06,599 many batches you're going to process. 1603 00:54:05,039 --> 00:54:08,079 So, it is basically the number of 1604 00:54:06,599 --> 00:54:10,480 examples you have in your training data 1605 00:54:08,079 --> 00:54:11,679 divided by the batch size that you have 1606 00:54:10,480 --> 00:54:13,960 chosen, 1607 00:54:11,679 --> 00:54:16,879 right? That number rounded up is the 1608 00:54:13,960 --> 00:54:18,480 number of batches within each epoch. 1609 00:54:16,880 --> 00:54:20,559 And here I'm just choosing 10 because, 1610 00:54:18,480 --> 00:54:23,119 you know, 1611 00:54:20,559 --> 00:54:24,719 Siri found something on the web. Okay. 1612 00:54:23,119 --> 00:54:26,519 I chose 10 because it's going to be fast 1613 00:54:24,719 --> 00:54:27,439 to do for me to do it in class. And 10 1614 00:54:26,519 --> 00:54:28,320 is actually more than enough because you 1615 00:54:27,440 --> 00:54:30,800 can see it's already beginning to 1616 00:54:28,320 --> 00:54:30,800 overfit. 1617 00:54:31,000 --> 00:54:33,320 Yeah. 1618 00:54:33,599 --> 00:54:37,559 This is more of a conceptual question, 1619 00:54:35,639 --> 00:54:39,920 but is it always the case that a neural 1620 00:54:37,559 --> 00:54:42,400 network will have better accuracy than 1621 00:54:39,920 --> 00:54:44,440 this like machine learning algorithm? 1622 00:54:42,400 --> 00:54:45,960 And I'm asking more on the case of like 1623 00:54:44,440 --> 00:54:46,720 the heart disease problem. Oh, yeah, 1624 00:54:45,960 --> 00:54:49,000 yeah. 1625 00:54:46,719 --> 00:54:50,519 Great question. So, neural networks are 1626 00:54:49,000 --> 00:54:52,039 really good for unstructured data like 1627 00:54:50,519 --> 00:54:53,159 the ones we're having here. But if you 1628 00:54:52,039 --> 00:54:55,199 have structured data like the heart 1629 00:54:53,159 --> 00:54:57,519 disease problem, sometimes it actually 1630 00:54:55,199 --> 00:54:59,799 works really well. Sometimes 1631 00:54:57,519 --> 00:55:01,840 things like gradient boosting, XGBoost, 1632 00:54:59,800 --> 00:55:03,440 work really well. So, if I am actually 1633 00:55:01,840 --> 00:55:04,600 working on a structured data problem, 1634 00:55:03,440 --> 00:55:06,119 I'll try both. 1635 00:55:04,599 --> 00:55:07,239 I'm not going to axiomatically assume 1636 00:55:06,119 --> 00:55:09,319 that the DNN is going to be the best 1637 00:55:07,239 --> 00:55:11,679 thing. But if you have structured data, 1638 00:55:09,320 --> 00:55:13,160 it's the best game in town. 1639 00:55:11,679 --> 00:55:14,319 All right. Um 1640 00:55:13,159 --> 00:55:15,480 I'm just going to 1641 00:55:14,320 --> 00:55:16,480 By the way, I have a whole section here 1642 00:55:15,480 --> 00:55:17,679 on once you build a model, how do you 1643 00:55:16,480 --> 00:55:19,320 actually improve it? 1644 00:55:17,679 --> 00:55:20,440 Right? Check it out. It's an optional 1645 00:55:19,320 --> 00:55:22,559 thing. 1646 00:55:20,440 --> 00:55:23,880 All right, I'm going to stop this here. 1647 00:55:22,559 --> 00:55:25,559 All right. So, the next thing I want to 1648 00:55:23,880 --> 00:55:27,599 do is 1649 00:55:25,559 --> 00:55:29,559 So, we went from 88 to 90 plus percent, 1650 00:55:27,599 --> 00:55:31,639 right? Using convolutional networks. 1651 00:55:29,559 --> 00:55:33,000 Now, let's work with color images. Let's 1652 00:55:31,639 --> 00:55:34,960 kick it up a notch. 1653 00:55:33,000 --> 00:55:36,840 So, um 1654 00:55:34,960 --> 00:55:38,880 I actually 1655 00:55:36,840 --> 00:55:40,120 web scraped 1656 00:55:38,880 --> 00:55:42,680 all these pictures for you folks, for 1657 00:55:40,119 --> 00:55:44,759 your enjoyment. I web scraped about 100 1658 00:55:42,679 --> 00:55:46,599 color images of handbags and shoes. 1659 00:55:44,760 --> 00:55:48,600 Each 100 roughly 100 handbags, 100 1660 00:55:46,599 --> 00:55:51,159 shoes. So, the question is with these 1661 00:55:48,599 --> 00:55:52,239 essentially 200 images, 1662 00:55:51,159 --> 00:55:54,679 can we build a really good neural 1663 00:55:52,239 --> 00:55:56,079 network to classify handbags and shoes? 1664 00:55:54,679 --> 00:55:58,039 Right? It seems kind of absurd, right? 1665 00:55:56,079 --> 00:55:59,519 Because 200 examples, I mean, it's not 1666 00:55:58,039 --> 00:56:02,759 that much, right? It doesn't feel like a 1667 00:55:59,519 --> 00:56:04,239 lot. The MNIST data fashion has 60,000 1668 00:56:02,760 --> 00:56:06,080 images. 1669 00:56:04,239 --> 00:56:07,639 Right? So, there's no, you know, even 1670 00:56:06,079 --> 00:56:09,199 with that we are overfitting in like 5, 1671 00:56:07,639 --> 00:56:10,599 6, 7, 8 epochs. 1672 00:56:09,199 --> 00:56:11,879 With 200 images, maybe, you know, is 1673 00:56:10,599 --> 00:56:13,199 there any hope? Obviously, there is 1674 00:56:11,880 --> 00:56:15,160 hope, otherwise it won't be in the 1675 00:56:13,199 --> 00:56:16,319 lecture. So, yeah. So, we're going to 1676 00:56:15,159 --> 00:56:18,119 take this data set and let's see what we 1677 00:56:16,320 --> 00:56:19,519 can do with it. So, we'll first actually 1678 00:56:18,119 --> 00:56:22,119 build a convolutional network from 1679 00:56:19,519 --> 00:56:24,519 scratch to solve this problem. Okay? 1680 00:56:22,119 --> 00:56:24,519 All right. 1681 00:56:24,679 --> 00:56:27,519 I'm actually going to run through the 1682 00:56:25,599 --> 00:56:29,480 code because at the end of it we'll have 1683 00:56:27,519 --> 00:56:31,519 a live demo. So, I would like one 1684 00:56:29,480 --> 00:56:34,840 volunteer to give me a handbag and one 1685 00:56:31,519 --> 00:56:37,280 volunteer to give me their footwear. 1686 00:56:34,840 --> 00:56:40,880 Boy, in class. 1687 00:56:37,280 --> 00:56:42,400 Okay. So, all right. Unlike the previous 1688 00:56:40,880 --> 00:56:44,760 data set, this one actually I just web 1689 00:56:42,400 --> 00:56:46,280 scraped it. So, I just, you know, it's 1690 00:56:44,760 --> 00:56:47,359 it's it's I've stuck it in this Dropbox 1691 00:56:46,280 --> 00:56:49,120 folder. 1692 00:56:47,358 --> 00:56:51,519 Let's just download it and unzip it. And 1693 00:56:49,119 --> 00:56:54,920 once we do that, we have to now organize 1694 00:56:51,519 --> 00:56:57,119 it with these 200 images. So, 1695 00:56:54,920 --> 00:57:00,519 I have to do some sort of 1696 00:56:57,119 --> 00:57:02,400 sort of boring-ish Python stuff here. 1697 00:57:00,519 --> 00:57:04,639 So, here what we're doing is that we 1698 00:57:02,400 --> 00:57:06,440 have 100 handbags, roughly 100 shoes. 1699 00:57:04,639 --> 00:57:08,599 And what this code is doing is it's 1700 00:57:06,440 --> 00:57:10,280 actually creating a directory of saying 1701 00:57:08,599 --> 00:57:12,400 it's it's splitting stuff into train and 1702 00:57:10,280 --> 00:57:13,960 validation and test. And then for each 1703 00:57:12,400 --> 00:57:16,480 of the splits it's doing the handbags 1704 00:57:13,960 --> 00:57:18,960 and the shoes folder. Okay? So, once we 1705 00:57:16,480 --> 00:57:20,679 do that, basically this directory 1706 00:57:18,960 --> 00:57:23,199 structure is created. 1707 00:57:20,679 --> 00:57:25,079 Okay? Training, validation folder, test 1708 00:57:23,199 --> 00:57:26,199 folder, handbags and shoes. In fact, 1709 00:57:25,079 --> 00:57:27,039 actually you can I think you can see it 1710 00:57:26,199 --> 00:57:29,679 here. 1711 00:57:27,039 --> 00:57:31,559 See here, handbags and shoes. And within 1712 00:57:29,679 --> 00:57:33,119 that, there is, you know, train, test, 1713 00:57:31,559 --> 00:57:34,960 validation. And within each of these, 1714 00:57:33,119 --> 00:57:36,319 there's handbags and shoes. So, the idea 1715 00:57:34,960 --> 00:57:37,840 is that when you're working with images, 1716 00:57:36,320 --> 00:57:40,400 right? What you can do is you can just 1717 00:57:37,840 --> 00:57:42,358 create folders for each kind of image, 1718 00:57:40,400 --> 00:57:43,760 right? Let's say dogs, cats, 1719 00:57:42,358 --> 00:57:46,480 two folders with cat images and dog 1720 00:57:43,760 --> 00:57:47,800 images and then just point Keras at it. 1721 00:57:46,480 --> 00:57:49,559 It'll automatically figure out those are 1722 00:57:47,800 --> 00:57:50,560 the labels. 1723 00:57:49,559 --> 00:57:51,639 It makes it easy for you. So, it's very 1724 00:57:50,559 --> 00:57:52,639 convenient when you're working with 1725 00:57:51,639 --> 00:57:53,960 images. 1726 00:57:52,639 --> 00:57:55,799 And the book explains this thing in 1727 00:57:53,960 --> 00:57:56,920 great detail. 1728 00:57:55,800 --> 00:57:58,600 All right. So, when working with these 1729 00:57:56,920 --> 00:58:00,440 images, color images, we'll follow this 1730 00:57:58,599 --> 00:58:02,279 process. We'll read in the JPEGs. We'll 1731 00:58:00,440 --> 00:58:03,559 convert them to tensors. And then since 1732 00:58:02,280 --> 00:58:05,040 I'm web scraping it, they all come in 1733 00:58:03,559 --> 00:58:06,880 different shapes and sizes. So, I need 1734 00:58:05,039 --> 00:58:08,719 to like bring it all to the same size. 1735 00:58:06,880 --> 00:58:10,599 Okay? I resize it and then I'm going to 1736 00:58:08,719 --> 00:58:13,319 batch it into whatever. I'm going to 1737 00:58:10,599 --> 00:58:16,639 batch it using a batch size of 32 here. 1738 00:58:13,320 --> 00:58:19,640 So, and this utility from Keras will do 1739 00:58:16,639 --> 00:58:20,920 all that for you, right? Very quickly. 1740 00:58:19,639 --> 00:58:23,358 So, basically what it says is that I 1741 00:58:20,920 --> 00:58:25,440 found 98 files in the 98 images in the 1742 00:58:23,358 --> 00:58:28,000 training data belonging to two classes, 1743 00:58:25,440 --> 00:58:29,559 49 in the validation and 38 in the test. 1744 00:58:28,000 --> 00:58:31,679 So, less than 100 examples in the 1745 00:58:29,559 --> 00:58:33,960 training set. That's what we have here. 1746 00:58:31,679 --> 00:58:35,879 All right. What's the time? 9:30. Okay. 1747 00:58:33,960 --> 00:58:38,800 So, all right. Now, let us check the 1748 00:58:35,880 --> 00:58:40,480 dimensions to make sure Good. So, 224 1749 00:58:38,800 --> 00:58:43,039 224 by 3. And the reason why did I pick 1750 00:58:40,480 --> 00:58:45,039 224 224? As you will see later, we're 1751 00:58:43,039 --> 00:58:47,039 going to use something called ResNet 1752 00:58:45,039 --> 00:58:49,599 and the ResNet expects it to be 224 by 1753 00:58:47,039 --> 00:58:52,719 224 by 3. That's why I resized it to 224 1754 00:58:49,599 --> 00:58:56,400 224. Let's look at a few examples of my 1755 00:58:52,719 --> 00:58:56,399 wonderful web scraping in action. 1756 00:59:01,079 --> 00:59:04,519 It's pretty wild, right? 1757 00:59:02,920 --> 00:59:07,000 Okay. So, we have a Now, let's do a 1758 00:59:04,519 --> 00:59:09,000 simple convolutional network. Um 1759 00:59:07,000 --> 00:59:10,639 And before we would take all the X 1760 00:59:09,000 --> 00:59:13,480 values in Fashion MNIST and divide them 1761 00:59:10,639 --> 00:59:14,559 manually by 255 to normalize it to 0 1. 1762 00:59:13,480 --> 00:59:16,240 Well, you know what? We are actually 1763 00:59:14,559 --> 00:59:17,759 graduating to the higher levels of Keras 1764 00:59:16,239 --> 00:59:19,319 now. So, let's not do that, right? 1765 00:59:17,760 --> 00:59:21,240 Manual stuff is bad. So, we'll do it 1766 00:59:19,320 --> 00:59:22,720 within Keras by using something called 1767 00:59:21,239 --> 00:59:24,479 the rescaling layer where we just tell 1768 00:59:22,719 --> 00:59:26,399 it how much to rescale and boom, it'll 1769 00:59:24,480 --> 00:59:28,559 do it for you. The first convolution 1770 00:59:26,400 --> 00:59:31,519 block, just like the Fashion MNIST 32, 1771 00:59:28,559 --> 00:59:33,440 second block, again 32, max pool, 1772 00:59:31,519 --> 00:59:35,199 flatten. And then here we only have 1773 00:59:33,440 --> 00:59:36,599 handbags which are shoes, just a sigmoid 1774 00:59:35,199 --> 00:59:38,079 is enough, right? It's just a binary 1775 00:59:36,599 --> 00:59:40,440 classification problem. So, I'm just 1776 00:59:38,079 --> 00:59:42,239 using one output layer with a sigmoid, 1777 00:59:40,440 --> 00:59:43,840 and that's our model. So, let's do the 1778 00:59:42,239 --> 00:59:47,279 model. 1779 00:59:43,840 --> 00:59:47,280 All right, model summary. 1780 00:59:48,440 --> 00:59:54,360 103 101,000 parameters in this little 1781 00:59:52,079 --> 00:59:56,519 model. Okay, let's compile it and run 1782 00:59:54,360 --> 00:59:57,720 it. Uh, and note here because it's a 1783 00:59:56,519 --> 00:59:59,480 binary 1784 00:59:57,719 --> 01:00:02,000 classification problem, I'm using binary 1785 00:59:59,480 --> 01:00:03,320 cross entropy. 1786 01:00:02,000 --> 01:00:05,880 Same Adam. 1787 01:00:03,320 --> 01:00:07,440 And accuracy, compile, and then boom, 1788 01:00:05,880 --> 01:00:08,519 let's run it. We'll run it for 20 1789 01:00:07,440 --> 01:00:10,800 epochs. 1790 01:00:08,519 --> 01:00:10,800 Hopefully. 1791 01:00:12,320 --> 01:00:17,760 Okay, while it's doing this business, 1792 01:00:13,760 --> 01:00:19,400 I'm going to shift to the PowerPoint. 1793 01:00:17,760 --> 01:00:21,480 So, we'll go back to see how well it 1794 01:00:19,400 --> 01:00:23,039 did, but the question is, uh, whatever 1795 01:00:21,480 --> 01:00:23,960 it did, we built it from scratch. So, 1796 01:00:23,039 --> 01:00:26,440 the question is, can we do better than 1797 01:00:23,960 --> 01:00:28,079 that? Okay? Because we only have 100 1798 01:00:26,440 --> 01:00:29,480 examples of each class, and which brings 1799 01:00:28,079 --> 01:00:31,440 us to something very cool and very 1800 01:00:29,480 --> 01:00:33,240 powerful called transfer learning. And 1801 01:00:31,440 --> 01:00:34,519 the idea, so the key thing is there are 1802 01:00:33,239 --> 01:00:36,000 two research trends that are going on 1803 01:00:34,519 --> 01:00:38,199 that we take advantage of. The first one 1804 01:00:36,000 --> 01:00:40,320 is that researchers have defined, you 1805 01:00:38,199 --> 01:00:42,439 know, designed architectures which 1806 01:00:40,320 --> 01:00:43,840 exploit the kind of input you have. So, 1807 01:00:42,440 --> 01:00:45,639 Olivia asked the question, if you have a 1808 01:00:43,840 --> 01:00:47,320 particular kind of input images, do you 1809 01:00:45,639 --> 01:00:49,079 actually change the input, or do you 1810 01:00:47,320 --> 01:00:50,680 actually change the network? As it turns 1811 01:00:49,079 --> 01:00:52,039 out, here, for example, if it's images, 1812 01:00:50,679 --> 01:00:53,679 we know that we should use convolutional 1813 01:00:52,039 --> 01:00:55,759 layers because convolutional layers were 1814 01:00:53,679 --> 01:00:57,159 designed to exploit the image-ness of 1815 01:00:55,760 --> 01:00:59,680 the input. 1816 01:00:57,159 --> 01:01:01,559 Okay? Similarly, if you have sequences 1817 01:00:59,679 --> 01:01:03,719 of information, like obviously natural 1818 01:01:01,559 --> 01:01:05,320 language, audio, video, gene sequences, 1819 01:01:03,719 --> 01:01:07,119 and so on, so forth, these things called 1820 01:01:05,320 --> 01:01:08,360 transformers were invented 1821 01:01:07,119 --> 01:01:09,480 to exploit them, and we're going to 1822 01:01:08,360 --> 01:01:11,720 spend a lot of time on transformers 1823 01:01:09,480 --> 01:01:13,320 starting next week. So, that's the first 1824 01:01:11,719 --> 01:01:15,959 trend. The second trend is that 1825 01:01:13,320 --> 01:01:19,000 researchers have used these innovations 1826 01:01:15,960 --> 01:01:21,880 to actually create and train models on 1827 01:01:19,000 --> 01:01:23,719 vast data sets, and thankfully, they've 1828 01:01:21,880 --> 01:01:26,760 made them publicly available for us to 1829 01:01:23,719 --> 01:01:28,439 use. So, transfer learning is the idea 1830 01:01:26,760 --> 01:01:30,080 that if you have a particular problem, 1831 01:01:28,440 --> 01:01:32,240 let's just take a pre-trained network 1832 01:01:30,079 --> 01:01:33,840 work somebody may have already created, 1833 01:01:32,239 --> 01:01:35,519 and then let's just customize it to our 1834 01:01:33,840 --> 01:01:37,079 problem, rather than actually build 1835 01:01:35,519 --> 01:01:39,559 anything from scratch. 1836 01:01:37,079 --> 01:01:41,599 Okay, that's the basic idea. So, 1837 01:01:39,559 --> 01:01:43,519 so here we have this basically we have 1838 01:01:41,599 --> 01:01:45,079 to build a classifier which takes in an 1839 01:01:43,519 --> 01:01:46,759 arbitrary image and figures out if it's 1840 01:01:45,079 --> 01:01:47,799 a handbag or a shoe, right? That's our 1841 01:01:46,760 --> 01:01:49,800 goal. 1842 01:01:47,800 --> 01:01:51,320 And so, now handbags and shoes are 1843 01:01:49,800 --> 01:01:53,680 everyday objects, and so what you can do 1844 01:01:51,320 --> 01:01:55,200 is, hmm, you you can look around and see 1845 01:01:53,679 --> 01:01:57,919 if there are any networks that have been 1846 01:01:55,199 --> 01:02:00,359 trained by other people which actually 1847 01:01:57,920 --> 01:02:02,599 have been trained on everyday images. 1848 01:02:00,360 --> 01:02:04,000 Right? As opposed to like MRI or X-rays, 1849 01:02:02,599 --> 01:02:05,400 right? Specialized images, everyday 1850 01:02:04,000 --> 01:02:07,039 images. Of course, the first thing you 1851 01:02:05,400 --> 01:02:08,960 should probably do is to see if anybody 1852 01:02:07,039 --> 01:02:10,800 has built the specific thing you want, 1853 01:02:08,960 --> 01:02:12,519 handbag shoes classifier on GitHub. 1854 01:02:10,800 --> 01:02:15,800 Assuming it's not, then you do transfer 1855 01:02:12,519 --> 01:02:17,800 learning. Okay? So, now it turns out 1856 01:02:15,800 --> 01:02:19,360 that there's this thing called ImageNet, 1857 01:02:17,800 --> 01:02:22,080 which is a database of millions of 1858 01:02:19,360 --> 01:02:24,079 images of everyday objects in a thousand 1859 01:02:22,079 --> 01:02:26,199 different categories, furniture, 1860 01:02:24,079 --> 01:02:28,719 animals, automobiles, you get the idea. 1861 01:02:26,199 --> 01:02:29,919 Okay? And so, we can look for the 1862 01:02:28,719 --> 01:02:31,599 networks that have been trained on 1863 01:02:29,920 --> 01:02:33,200 ImageNet. 1864 01:02:31,599 --> 01:02:36,360 Okay, let me just go back to the collab 1865 01:02:33,199 --> 01:02:36,359 just to make sure it doesn't time out. 1866 01:02:37,519 --> 01:02:44,039 All right, so it has finished doing it. 1867 01:02:40,079 --> 01:02:44,039 Um, let's just plot these things. 1868 01:02:48,599 --> 01:02:51,199 Okay, so 1869 01:02:49,920 --> 01:02:52,920 uh, there is some overfitting that 1870 01:02:51,199 --> 01:02:55,159 happens around here 1871 01:02:52,920 --> 01:02:57,760 on the training on the 10th epoch. Let's 1872 01:02:55,159 --> 01:02:57,759 look at the 1873 01:02:59,239 --> 01:03:03,919 So, the the training accuracy is 1874 01:03:01,039 --> 01:03:04,920 actually getting to almost to 100%. But 1875 01:03:03,920 --> 01:03:06,760 we're not interested in training 1876 01:03:04,920 --> 01:03:08,880 accuracy, right? We care about 1877 01:03:06,760 --> 01:03:10,200 validation and test accuracy, and that 1878 01:03:08,880 --> 01:03:13,000 seems to be kind of hovering around in 1879 01:03:10,199 --> 01:03:15,159 the 80s. Um, so let's just evaluate it 1880 01:03:13,000 --> 01:03:19,360 anyway to see what happens. 1881 01:03:15,159 --> 01:03:20,960 Okay, so it gets to 80 87% accuracy 1882 01:03:19,360 --> 01:03:22,320 on this data set. 1883 01:03:20,960 --> 01:03:24,760 It's actually pretty good given that we 1884 01:03:22,320 --> 01:03:26,320 only have 100 examples. So, 87% 1885 01:03:24,760 --> 01:03:28,320 accuracy, but we pre-trained the whole 1886 01:03:26,320 --> 01:03:31,280 thing. I'm sorry, we did everything from 1887 01:03:28,320 --> 01:03:32,600 scratch. Okay? Now, then 1888 01:03:31,280 --> 01:03:35,280 I'm going to there's this whole section 1889 01:03:32,599 --> 01:03:38,079 about data augmentation, which, um, you 1890 01:03:35,280 --> 01:03:40,040 know what? Do we have time? 1891 01:03:38,079 --> 01:03:42,799 So, 1892 01:03:40,039 --> 01:03:44,320 so the idea of augmentation is that when 1893 01:03:42,800 --> 01:03:45,800 you have an image, 1894 01:03:44,320 --> 01:03:49,160 let's say you take this image, and you 1895 01:03:45,800 --> 01:03:51,359 just rotate it slightly by 10°. 1896 01:03:49,159 --> 01:03:52,960 If it's a handbag before you rotated it, 1897 01:03:51,358 --> 01:03:54,199 it sure as hell is a handbag after you 1898 01:03:52,960 --> 01:03:55,119 rotated it. 1899 01:03:54,199 --> 01:03:56,679 Right? 1900 01:03:55,119 --> 01:03:57,920 It doesn't change The meaning of the 1901 01:03:56,679 --> 01:04:00,000 image doesn't change just because you 1902 01:03:57,920 --> 01:04:01,358 rotated it slightly. Or maybe you zoom 1903 01:04:00,000 --> 01:04:03,639 in slightly, you zoom out slightly, you 1904 01:04:01,358 --> 01:04:05,358 crop it slightly, nothing happens. 1905 01:04:03,639 --> 01:04:07,440 So, what you can do is you can take any 1906 01:04:05,358 --> 01:04:08,880 image you have, and you just perturb it 1907 01:04:07,440 --> 01:04:10,920 slightly, 1908 01:04:08,880 --> 01:04:14,079 like right there, and then add it as a 1909 01:04:10,920 --> 01:04:15,800 new example to your training data. 1910 01:04:14,079 --> 01:04:16,960 This is an unbelievable free lunch, 1911 01:04:15,800 --> 01:04:19,080 frankly. 1912 01:04:16,960 --> 01:04:20,720 And the same thing actually, same kinds 1913 01:04:19,079 --> 01:04:22,519 of techniques actually work for text 1914 01:04:20,719 --> 01:04:24,599 also, which we'll cover later on. 1915 01:04:22,519 --> 01:04:26,119 Right? This broad area is called data 1916 01:04:24,599 --> 01:04:27,599 augmentation. 1917 01:04:26,119 --> 01:04:30,199 It's a great way when you don't have a 1918 01:04:27,599 --> 01:04:31,799 lot of data to artificially bolster the 1919 01:04:30,199 --> 01:04:32,599 amount of data you have. 1920 01:04:31,800 --> 01:04:34,800 Okay? 1921 01:04:32,599 --> 01:04:36,239 Um, and so, and of course, Keras makes 1922 01:04:34,800 --> 01:04:38,640 it very easy for you to do all these 1923 01:04:36,239 --> 01:04:40,919 things. It has already predefined a 1924 01:04:38,639 --> 01:04:43,079 whole bunch of data augmentation layers 1925 01:04:40,920 --> 01:04:45,000 for you. So, here's a little example 1926 01:04:43,079 --> 01:04:47,239 where I basically take a picture and 1927 01:04:45,000 --> 01:04:48,320 then I randomly flip it. So, if it looks 1928 01:04:47,239 --> 01:04:50,799 like this, I flip it this way, 1929 01:04:48,320 --> 01:04:53,080 horizontal. Okay? Uh, and then I 1930 01:04:50,800 --> 01:04:55,039 randomly rotate it by 0.1. I forget if 1931 01:04:53,079 --> 01:04:57,358 it's 0.1° or radians, you can look up 1932 01:04:55,039 --> 01:05:00,079 the documentation. And then random zoom, 1933 01:04:57,358 --> 01:05:02,920 right? Zoom in and out a little bit. Uh, 1934 01:05:00,079 --> 01:05:04,960 but it won't do this for every picture. 1935 01:05:02,920 --> 01:05:06,000 It will only do it randomly. Okay? So, 1936 01:05:04,960 --> 01:05:07,800 that only some pictures will get 1937 01:05:06,000 --> 01:05:09,000 perturbed in some ways. And that's how 1938 01:05:07,800 --> 01:05:10,600 you make sure there's enough diversity 1939 01:05:09,000 --> 01:05:12,880 of pictures that you have. 1940 01:05:10,599 --> 01:05:13,839 So, once you do that, 1941 01:05:12,880 --> 01:05:15,960 you can actually take a picture and see 1942 01:05:13,840 --> 01:05:17,240 what it does. 1943 01:05:15,960 --> 01:05:20,159 I just randomly grab a picture, so it 1944 01:05:17,239 --> 01:05:20,159 keeps changing every time. 1945 01:05:21,280 --> 01:05:24,880 Yeah, look at this handbag. 1946 01:05:22,800 --> 01:05:26,440 Handbag slightly rotated this way, 1947 01:05:24,880 --> 01:05:28,320 rotated that way. 1948 01:05:26,440 --> 01:05:30,320 Some more. Maybe a little bit of zooming 1949 01:05:28,320 --> 01:05:31,880 going on, and so on. You get the idea, 1950 01:05:30,320 --> 01:05:33,840 right? And there's a whole list of these 1951 01:05:31,880 --> 01:05:35,640 things you can do. But when you do those 1952 01:05:33,840 --> 01:05:37,358 things, make sure 1953 01:05:35,639 --> 01:05:38,759 that what you're doing doesn't actually 1954 01:05:37,358 --> 01:05:39,679 change the underlying meaning of the 1955 01:05:38,760 --> 01:05:41,480 picture. 1956 01:05:39,679 --> 01:05:43,440 It's really important. 1957 01:05:41,480 --> 01:05:45,679 Okay? So, for example, if you're working 1958 01:05:43,440 --> 01:05:47,679 with satellite data, 1959 01:05:45,679 --> 01:05:49,319 yes, be very careful not to do flips of 1960 01:05:47,679 --> 01:05:50,319 crazy flips. 1961 01:05:49,320 --> 01:05:51,920 Right? Or even if you're working with 1962 01:05:50,320 --> 01:05:54,440 everyday images, horizontal flips are 1963 01:05:51,920 --> 01:05:55,800 okay. Don't do vertical flips. 1964 01:05:54,440 --> 01:05:57,400 Right? How many times will you have an 1965 01:05:55,800 --> 01:05:59,400 upside-down dog picture that you need to 1966 01:05:57,400 --> 01:06:00,639 classify? 1967 01:05:59,400 --> 01:06:02,720 Make sure your augmentation doesn't go 1968 01:06:00,639 --> 01:06:04,839 nuts. 1969 01:06:02,719 --> 01:06:04,839 All right. 1970 01:06:05,760 --> 01:06:09,240 Once you do that, you can actually just 1971 01:06:07,239 --> 01:06:11,000 insert the data augmentation layers in 1972 01:06:09,239 --> 01:06:12,479 your model right there, right after the 1973 01:06:11,000 --> 01:06:14,280 input. The rest of it can stay 1974 01:06:12,480 --> 01:06:15,760 unchanged. 1975 01:06:14,280 --> 01:06:17,600 So, this is a great way to increase the 1976 01:06:15,760 --> 01:06:19,760 size of your training data, and here is 1977 01:06:17,599 --> 01:06:21,880 a model, and then I invite you to 1978 01:06:19,760 --> 01:06:23,120 actually just play with it and uh, and 1979 01:06:21,880 --> 01:06:23,960 train it. We won't try In the interest 1980 01:06:23,119 --> 01:06:24,920 of time, we won't actually train this 1981 01:06:23,960 --> 01:06:27,240 model, but it's in the collab, you can 1982 01:06:24,920 --> 01:06:28,599 just try it. It also figures prominently 1983 01:06:27,239 --> 01:06:30,000 in homework one, by the way, data 1984 01:06:28,599 --> 01:06:32,519 augmentation. So, you'll get more 1985 01:06:30,000 --> 01:06:34,800 experience with this. Okay. So, uh, back 1986 01:06:32,519 --> 01:06:37,239 to the PPT. 1987 01:06:34,800 --> 01:06:38,800 So, this is what we have. Um, and so, 1988 01:06:37,239 --> 01:06:41,279 any network that has been trained on 1989 01:06:38,800 --> 01:06:42,920 this ImageNet thing, uh, turns out 1990 01:06:41,280 --> 01:06:44,880 learns all kinds of interesting features 1991 01:06:42,920 --> 01:06:46,320 in every one of its layers. So, here 1992 01:06:44,880 --> 01:06:48,039 this is the first layer, and you can see 1993 01:06:46,320 --> 01:06:49,880 it's picking up sort of gradations of 1994 01:06:48,039 --> 01:06:52,559 color, sort of line-ish kind of 1995 01:06:49,880 --> 01:06:54,320 behavior. Layer two, um, it's actually 1996 01:06:52,559 --> 01:06:56,880 picking up Hey, look, it's picking up an 1997 01:06:54,320 --> 01:06:59,480 edge. Can you see that edge? 1998 01:06:56,880 --> 01:07:01,920 Right? Like like that. 1999 01:06:59,480 --> 01:07:04,400 And then layer three is picking up these 2000 01:07:01,920 --> 01:07:05,960 interesting honeycomb shapes, uh, and so 2001 01:07:04,400 --> 01:07:07,880 on. Oh, it's actually this thing is 2002 01:07:05,960 --> 01:07:11,240 already already picking up like the 2003 01:07:07,880 --> 01:07:11,240 shape of a human torso. 2004 01:07:12,599 --> 01:07:16,440 Yeah, this layer is actually picking up 2005 01:07:13,920 --> 01:07:17,240 what looks like a Labrador retriever. 2006 01:07:16,440 --> 01:07:19,400 Okay. 2007 01:07:17,239 --> 01:07:20,399 Isn't that cute? 2008 01:07:19,400 --> 01:07:22,480 Come on, even if you're not a dog 2009 01:07:20,400 --> 01:07:24,480 person. 2010 01:07:22,480 --> 01:07:25,599 All right. So, the the this this is the 2011 01:07:24,480 --> 01:07:26,599 visualization I was referring to 2012 01:07:25,599 --> 01:07:28,319 earlier, 2013 01:07:26,599 --> 01:07:30,039 um, to figure out what are these 2014 01:07:28,320 --> 01:07:31,760 networks actually learning. 2015 01:07:30,039 --> 01:07:32,920 This paper was one of the first ones to 2016 01:07:31,760 --> 01:07:34,920 actually visualize what's going on 2017 01:07:32,920 --> 01:07:36,639 inside. So, if you folks are curious how 2018 01:07:34,920 --> 01:07:38,760 these pictures are actually produced, I 2019 01:07:36,639 --> 01:07:40,719 would encourage you to check this out. 2020 01:07:38,760 --> 01:07:42,560 Okay, yep. 2021 01:07:40,719 --> 01:07:44,879 So, we spoke about images and you 2022 01:07:42,559 --> 01:07:46,599 referred to classes, but sorry, we spoke 2023 01:07:44,880 --> 01:07:47,358 about images and you referred to classes 2024 01:07:46,599 --> 01:07:49,920 and 2025 01:07:47,358 --> 01:07:52,480 text next week on transformers, but 2026 01:07:49,920 --> 01:07:54,039 what about say an email which has both 2027 01:07:52,480 --> 01:07:56,280 text and image, and that may be white 2028 01:07:54,039 --> 01:07:58,759 space depending on who has written it 2029 01:07:56,280 --> 01:08:01,359 out. Does that get put in as an input 2030 01:07:58,760 --> 01:08:03,240 for an image or 2031 01:08:01,358 --> 01:08:04,840 So, we'll revisit this great question a 2032 01:08:03,239 --> 01:08:06,439 bit later on in the course. 2033 01:08:04,840 --> 01:08:07,840 So, the answer is a bit complicated, so 2034 01:08:06,440 --> 01:08:09,280 I don't want to I want to do it justice, 2035 01:08:07,840 --> 01:08:10,800 so we'll come back to it. 2036 01:08:09,280 --> 01:08:12,600 All right, so 2037 01:08:10,800 --> 01:08:14,280 so it turns out this thing called ResNet 2038 01:08:12,599 --> 01:08:16,000 is a family of networks that are which 2039 01:08:14,280 --> 01:08:18,079 were trained on this ImageNet data set, 2040 01:08:16,000 --> 01:08:19,399 and they did really well in this 2041 01:08:18,079 --> 01:08:21,479 competition that's associated with the 2042 01:08:19,399 --> 01:08:22,519 ImageNet data set called ImageNet. And 2043 01:08:21,479 --> 01:08:24,679 so, this is an example of such a 2044 01:08:22,520 --> 01:08:27,400 network. So, you we would expect the the 2045 01:08:24,680 --> 01:08:28,400 weights and the parameters of ResNet, 2046 01:08:27,399 --> 01:08:30,838 given that it's been trained on 2047 01:08:28,399 --> 01:08:32,719 ImageNet, to sort of have some knowledge 2048 01:08:30,838 --> 01:08:34,719 about lines and shapes and curves and 2049 01:08:32,719 --> 01:08:37,520 things like that. So, maybe we can just 2050 01:08:34,719 --> 01:08:39,039 use that, right? So, so the idea is we 2051 01:08:37,520 --> 01:08:40,920 But the thing is we can't use ResNet as 2052 01:08:39,039 --> 01:08:42,159 is because remember, it was trained to 2053 01:08:40,920 --> 01:08:44,119 classify an incoming image into a 2054 01:08:42,159 --> 01:08:45,439 thousand possibilities. 2055 01:08:44,119 --> 01:08:47,838 Here we only have two possibilities, 2056 01:08:45,439 --> 01:08:50,039 handbags and shoes. So, what we do is 2057 01:08:47,838 --> 01:08:51,759 very simple and elegant. We do just a 2058 01:08:50,039 --> 01:08:54,519 little bit of surgery. 2059 01:08:51,759 --> 01:08:57,439 We take ResNet and stop just before the 2060 01:08:54,520 --> 01:08:59,680 final layer. So, take my word for it, 2061 01:08:57,439 --> 01:09:01,318 this thing here, what it says is fully 2062 01:08:59,680 --> 01:09:02,920 connected thousand. 2063 01:09:01,319 --> 01:09:04,839 Because it's got thousand way, right? 2064 01:09:02,920 --> 01:09:06,560 Thousand objects. So, what we do is we 2065 01:09:04,838 --> 01:09:08,239 just take everything except and we stop 2066 01:09:06,560 --> 01:09:10,280 just before that last layer. 2067 01:09:08,239 --> 01:09:11,599 And then what comes out of that layer, 2068 01:09:10,279 --> 01:09:13,239 hopefully, will be like a very smart 2069 01:09:11,600 --> 01:09:14,480 representation of the images that it has 2070 01:09:13,239 --> 01:09:16,960 been trained on. 2071 01:09:14,479 --> 01:09:19,199 And so, what we do is we can think of 2072 01:09:16,960 --> 01:09:21,000 sort of headless ResNet 2073 01:09:19,199 --> 01:09:23,358 as our model. 2074 01:09:21,000 --> 01:09:26,239 And we can take that we can take all our 2075 01:09:23,359 --> 01:09:28,079 data and run it through ResNet up to but 2076 01:09:26,239 --> 01:09:30,358 not including the last layer. 2077 01:09:28,079 --> 01:09:31,920 Okay, you get some tensor and that 2078 01:09:30,359 --> 01:09:33,319 tensor is probably like a very has a 2079 01:09:31,920 --> 01:09:35,079 very rich understanding of what's going 2080 01:09:33,319 --> 01:09:36,880 on in that image, all the objects and 2081 01:09:35,079 --> 01:09:39,880 features and things like that. And then 2082 01:09:36,880 --> 01:09:40,759 we can just simply connect that we can 2083 01:09:39,880 --> 01:09:42,199 think of it as like a smart 2084 01:09:40,759 --> 01:09:44,359 representation of an input. We can 2085 01:09:42,199 --> 01:09:46,000 connect it to just a little hidden layer 2086 01:09:44,359 --> 01:09:47,798 and then we have a little sigmoid which 2087 01:09:46,000 --> 01:09:50,199 then tells you handbag or shoe. We can 2088 01:09:47,798 --> 01:09:53,039 just run this network. 2089 01:09:50,199 --> 01:09:54,840 Okay? Um and so since the outputs to the 2090 01:09:53,039 --> 01:09:57,199 hidden layer now are not raw images 2091 01:09:54,840 --> 01:09:59,000 anymore, but this much higher level of 2092 01:09:57,199 --> 01:10:00,279 abstraction that ResNet has learned, 2093 01:09:59,000 --> 01:10:02,399 hopefully it can get the job done with 2094 01:10:00,279 --> 01:10:04,519 hardly any examples. 2095 01:10:02,399 --> 01:10:05,679 Okay? And now you can get fancier. 2096 01:10:04,520 --> 01:10:07,440 That's the basic idea, but you can get 2097 01:10:05,680 --> 01:10:09,760 much fancier. You can connect up 2098 01:10:07,439 --> 01:10:10,960 headless ResNet directly with our little 2099 01:10:09,760 --> 01:10:12,720 network with a hidden layer and the 2100 01:10:10,960 --> 01:10:14,960 final thing and the whole thing can be 2101 01:10:12,720 --> 01:10:16,960 trained. 2102 01:10:14,960 --> 01:10:18,680 End to end. Uh but when you do that you 2103 01:10:16,960 --> 01:10:20,159 must start the training with the weights 2104 01:10:18,680 --> 01:10:21,960 that you downloaded with ResNet because 2105 01:10:20,159 --> 01:10:23,639 that is the crown jewel that's been 2106 01:10:21,960 --> 01:10:26,239 learned so you want to start from there. 2107 01:10:23,640 --> 01:10:28,400 Uh and you will do this in homework one. 2108 01:10:26,239 --> 01:10:29,479 Okay? All right. Uh by the way, these 2109 01:10:28,399 --> 01:10:30,639 pre-trained models are available all 2110 01:10:29,479 --> 01:10:32,599 over the internet. There is the 2111 01:10:30,640 --> 01:10:34,000 TensorFlow hub, the PyTorch hub and then 2112 01:10:32,600 --> 01:10:36,840 there's the Hugging Face hub. When I 2113 01:10:34,000 --> 01:10:39,079 checked it on the 13th yesterday, it had 2114 01:10:36,840 --> 01:10:41,199 over half a million models available 2115 01:10:39,079 --> 01:10:42,760 for download. Half a million. 2116 01:10:41,199 --> 01:10:46,840 I think last year it was like 50,000 2117 01:10:42,760 --> 01:10:49,159 when I taught the course. Uh so yes. 2118 01:10:46,840 --> 01:10:50,880 I was just wondering, doesn't this make 2119 01:10:49,159 --> 01:10:52,199 your neural network susceptible to 2120 01:10:50,880 --> 01:10:53,279 adversarial attacks because the weights 2121 01:10:52,199 --> 01:10:55,639 have been 2122 01:10:53,279 --> 01:10:57,319 pre-trained on a Yes. Uh it there is 2123 01:10:55,640 --> 01:10:59,160 some adversarial risk. I'm happy to talk 2124 01:10:57,319 --> 01:11:01,439 about it offline. 2125 01:10:59,159 --> 01:11:03,720 All right. So that's what we have. So 2126 01:11:01,439 --> 01:11:06,319 back to Colab. Okay. So that's what we 2127 01:11:03,720 --> 01:11:07,720 have. This is ResNet. So what we do is 2128 01:11:06,319 --> 01:11:09,519 and ResNet is all packaged up. It's 2129 01:11:07,720 --> 01:11:12,640 available for download. So we download 2130 01:11:09,520 --> 01:11:12,640 it here. 2131 01:11:13,560 --> 01:11:19,360 And you see here that I'm saying use 2132 01:11:16,520 --> 01:11:21,800 include top equals false. 2133 01:11:19,359 --> 01:11:23,799 So basically you are telling Keras 2134 01:11:21,800 --> 01:11:25,279 uh the top the very final layer of the 2135 01:11:23,800 --> 01:11:27,239 thing, don't give it to me. Just give me 2136 01:11:25,279 --> 01:11:28,840 everything up to but not including that. 2137 01:11:27,239 --> 01:11:30,920 And of course I think of it as left to 2138 01:11:28,840 --> 01:11:32,960 right. People think of it as bottom to 2139 01:11:30,920 --> 01:11:34,440 top. So they could the very very top 2140 01:11:32,960 --> 01:11:35,480 layer, don't give it to me. You're 2141 01:11:34,439 --> 01:11:37,319 telling it so that you don't have to 2142 01:11:35,479 --> 01:11:39,319 manually go and remove it. 2143 01:11:37,319 --> 01:11:40,880 Okay? And then I'm not going to 2144 01:11:39,319 --> 01:11:44,000 summarize uh well, I'll just summarize 2145 01:11:40,880 --> 01:11:44,000 some of it. Just show you how big it is. 2146 01:11:44,640 --> 01:11:48,800 Okay? 2147 01:11:45,720 --> 01:11:50,920 23 million parameters. 2148 01:11:48,800 --> 01:11:52,039 ResNet. Okay? And I won't plot it 2149 01:11:50,920 --> 01:11:53,520 because then I'll be scrolling for 5 2150 01:11:52,039 --> 01:11:55,399 minutes. Uh 2151 01:11:53,520 --> 01:11:56,400 so let's just do this now. So what we're 2152 01:11:55,399 --> 01:11:58,000 now going to do is we're going to run 2153 01:11:56,399 --> 01:11:59,679 all the data through this thing and 2154 01:11:58,000 --> 01:12:00,880 whatever comes out in that penultimate 2155 01:11:59,680 --> 01:12:02,640 thing, I'm going to just grab it and 2156 01:12:00,880 --> 01:12:04,720 store it. So that's what this thing 2157 01:12:02,640 --> 01:12:07,000 does. 2158 01:12:04,720 --> 01:12:08,640 All right. And now we create this a 2159 01:12:07,000 --> 01:12:09,520 little handy function to do all these 2160 01:12:08,640 --> 01:12:11,160 things. 2161 01:12:09,520 --> 01:12:12,760 And once I do that, 2162 01:12:11,159 --> 01:12:15,239 uh every image has been sent through 2163 01:12:12,760 --> 01:12:16,280 ResNet up to but not the final layer and 2164 01:12:15,239 --> 01:12:18,119 then whatever comes into the final 2165 01:12:16,279 --> 01:12:19,479 layer, we're storing it. And then we're 2166 01:12:18,119 --> 01:12:21,800 going to create a network where we'll 2167 01:12:19,479 --> 01:12:23,199 only feed that layer that information to 2168 01:12:21,800 --> 01:12:24,440 a simple network. 2169 01:12:23,199 --> 01:12:26,279 Okay? 2170 01:12:24,439 --> 01:12:28,599 So what is coming out of ResNet, you can 2171 01:12:26,279 --> 01:12:31,719 see here 98 examples in the training 2172 01:12:28,600 --> 01:12:33,840 data and each example is now a 7 by 7 by 2173 01:12:31,720 --> 01:12:35,000 2048 tensor. 2174 01:12:33,840 --> 01:12:37,000 That's what came out of ResNet and you 2175 01:12:35,000 --> 01:12:37,720 saw that's what I did there. 2176 01:12:37,000 --> 01:12:39,479 Okay? 2177 01:12:37,720 --> 01:12:41,199 All right. So that's what it looks like. 2178 01:12:39,479 --> 01:12:43,479 Now let's just create our actual model 2179 01:12:41,199 --> 01:12:46,479 now. Right? We have our input which is 2180 01:12:43,479 --> 01:12:48,559 just a 7 by 7 by 2048. 2181 01:12:46,479 --> 01:12:50,079 We flatten it immediately. 2182 01:12:48,560 --> 01:12:52,600 Then we run it through a dense layer 2183 01:12:50,079 --> 01:12:54,519 with 256 ReLU neurons and then we use 2184 01:12:52,600 --> 01:12:56,920 dropout which I haven't talked about yet 2185 01:12:54,520 --> 01:12:58,720 which I will talk about early next week. 2186 01:12:56,920 --> 01:13:00,720 Uh but I will come back to it. Don't 2187 01:12:58,720 --> 01:13:01,520 worry about this detail for the moment. 2188 01:13:00,720 --> 01:13:03,360 Uh and then we just run through a 2189 01:13:01,520 --> 01:13:05,960 sigmoid. 2190 01:13:03,359 --> 01:13:08,079 Okay? And that that's our model. 2191 01:13:05,960 --> 01:13:12,640 Finished. Plot the model. This is what 2192 01:13:08,079 --> 01:13:12,640 we have. Okay? Model summary. 2193 01:13:13,479 --> 01:13:18,519 It's one so far. All right, good. Now 2194 01:13:15,399 --> 01:13:18,519 let's actually train this thing. 2195 01:13:18,640 --> 01:13:22,720 I'm just going to run it for 10 epochs 2196 01:13:20,600 --> 01:13:24,760 because I tried running it uh previously 2197 01:13:22,720 --> 01:13:26,920 and it seems to do a fine job in just an 2198 01:13:24,760 --> 01:13:28,680 epoch. Okay, it's already done. It's so 2199 01:13:26,920 --> 01:13:31,359 fast because we ran everything through 2200 01:13:28,680 --> 01:13:33,640 this monster ResNet thing and basically 2201 01:13:31,359 --> 01:13:34,759 took all the output values and use them 2202 01:13:33,640 --> 01:13:36,880 as a starting point. Right? We don't 2203 01:13:34,760 --> 01:13:40,440 have to run it every single time. So you 2204 01:13:36,880 --> 01:13:43,279 can see here the accuracy is 2205 01:13:40,439 --> 01:13:43,279 quite high. 2206 01:13:44,199 --> 01:13:48,439 Wow, interesting. So the 10th epoch 2207 01:13:45,920 --> 01:13:49,880 something bad happened. 2208 01:13:48,439 --> 01:13:51,439 So maybe I should have stopped at the 2209 01:13:49,880 --> 01:13:53,079 ninth epoch. I didn't see this yesterday 2210 01:13:51,439 --> 01:13:55,159 when I was running. So much for random 2211 01:13:53,079 --> 01:13:57,079 reproducibility. Uh 2212 01:13:55,159 --> 01:13:58,800 So let's just run this. Oh wow, look. On 2213 01:13:57,079 --> 01:14:01,319 the test set it's achieving 100% 2214 01:13:58,800 --> 01:14:01,320 accuracy. 2215 01:14:02,159 --> 01:14:06,840 It's unbelievable. Okay folks, now for 2216 01:14:04,439 --> 01:14:08,079 the moment of truth. Um all right, I 2217 01:14:06,840 --> 01:14:10,000 have a little code snippet here to 2218 01:14:08,079 --> 01:14:12,159 capture stuff from the webcam. 2219 01:14:10,000 --> 01:14:13,600 Because that last epoch it went down, 2220 01:14:12,159 --> 01:14:14,920 I'm a little worried that the demo is 2221 01:14:13,600 --> 01:14:16,440 going to flunk. 2222 01:14:14,920 --> 01:14:18,560 But you know what? We all have to live 2223 01:14:16,439 --> 01:14:20,119 dangerously. So 2224 01:14:18,560 --> 01:14:21,560 So here's a little function to predict 2225 01:14:20,119 --> 01:14:23,519 what's going to happen. 2226 01:14:21,560 --> 01:14:24,920 Okay. Now I tried it at home yesterday 2227 01:14:23,520 --> 01:14:26,160 by the way. 2228 01:14:24,920 --> 01:14:27,680 I act and it's like, "Yay, it's a 2229 01:14:26,159 --> 01:14:29,599 handbag." 2230 01:14:27,680 --> 01:14:30,720 So okay. Now let's just do something 2231 01:14:29,600 --> 01:14:32,560 else. 2232 01:14:30,720 --> 01:14:34,880 Okay. Any volunteers? 2233 01:14:32,560 --> 01:14:37,400 I want a a piece of footwear 2234 01:14:34,880 --> 01:14:39,840 or a handbag. 2235 01:14:37,399 --> 01:14:40,839 It's like a backpack, right? 2236 01:14:39,840 --> 01:14:42,159 I don't know. It feels like an 2237 01:14:40,840 --> 01:14:43,440 adversarial example, but yeah, let's 2238 01:14:42,159 --> 01:14:45,000 just try it. 2239 01:14:43,439 --> 01:14:47,039 Okay. 2240 01:14:45,000 --> 01:14:48,880 No disrespect. I'll let me let me go 2241 01:14:47,039 --> 01:14:50,920 with the shoe first. I have a better 2242 01:14:48,880 --> 01:14:51,880 chance of it working. 2243 01:14:50,920 --> 01:14:53,239 So 2244 01:14:51,880 --> 01:14:55,880 it's a pretty big shoe. If it can't get 2245 01:14:53,239 --> 01:14:59,079 this shoe, I'm worried about this model. 2246 01:14:55,880 --> 01:14:59,079 All right. So 2247 01:15:05,159 --> 01:15:10,159 Okay. Hold on. Hold on. Hold on. 2248 01:15:07,800 --> 01:15:10,159 All right. 2249 01:15:10,680 --> 01:15:14,360 Please don't get distracted by my hand. 2250 01:15:14,479 --> 01:15:20,719 Capture. 2251 01:15:16,880 --> 01:15:20,720 It's a shoe! LOOK AT THAT. 2252 01:15:21,680 --> 01:15:26,760 PHEW. ALL RIGHT. THANKS. 2253 01:15:25,000 --> 01:15:28,319 OKAY. Now let's try that. I'm feeling 2254 01:15:26,760 --> 01:15:32,600 kind of brave now. 2255 01:15:28,319 --> 01:15:34,880 Thank you. All right. Let's do this. 2256 01:15:32,600 --> 01:15:38,000 All right. 2257 01:15:34,880 --> 01:15:38,000 Camera capture. 2258 01:15:40,399 --> 01:15:42,559 Okay. 2259 01:15:44,199 --> 01:15:47,519 Put its better side. 2260 01:15:54,960 --> 01:15:58,720 It's a handbag! Look at that. 2261 01:15:59,800 --> 01:16:03,640 I swear every time I do the demo I age a 2262 01:16:01,479 --> 01:16:06,879 few years. So 2263 01:16:03,640 --> 01:16:06,880 All right folks, I'm done. Thank you.