[00:16] Right folks, good morning. [00:19] Welcome back. I hope you all had a nice [00:21] weekend. [00:22] Uh, and I hope you had a chance to watch [00:24] the the video walk-through I posted [00:26] yesterday. Um, it's going to save us [00:28] some time today. So, let's get right in. [00:31] Today is going to be super packed. Um, [00:33] you're going to go from not knowing [00:35] anything about convolutions perhaps for [00:36] some of you to actually knowing how [00:38] convolution networks work and actually [00:39] to build one and demo it in class, okay? [00:42] And uh, this demo has actually worked [00:44] pretty well for the last few years that [00:45] I've taught the class, but you never [00:47] know because it's a live demo, it may [00:48] not work. We'll see. [00:50] Um, [00:51] Valentine's Day gods, maybe they maybe [00:53] be with us. [00:54] Okay, so let's get going. So, Fashion [00:56] MNIST we saw previously, um, i.e. as in, [01:00] you know, in the in the walk-through, [01:01] the video walk-through, that a neural [01:03] network with a single single hidden [01:05] layer can get us to some an accuracy in [01:08] the the high 80s, okay? Uh, and that [01:11] thing that network actually didn't know [01:14] what was coming in was an image, right? [01:16] It literally took this table of numbers [01:18] and just took each row and then [01:19] concatenated all the rows into one giant [01:21] long vector and then sent it in. So, the [01:23] neural network did exploit the fact that [01:25] the input data was sort of known to be [01:27] of a certain type, okay? Which is the [01:30] clue for how can we do better? [01:32] Right? So, let's just spend a few [01:35] minutes on why what is it about images [01:38] that we have to really pay attention to, [01:40] okay? As opposed to any arbitrary vector [01:42] of numbers that's coming in. [01:44] Okay? So, when we flatten the image into [01:47] a long vector and feed it into a dense [01:49] layer, [01:50] several undesirable things can actually [01:52] happen. [01:55] What are some of them? Any any guesses? [02:00] Uh, yeah. [02:02] I think you lose the proximity of one [02:04] pixel to other ones that would be around [02:06] it. [02:07] Right. So, if you take a particular [02:08] pixel, then let's say that the picture [02:11] shows a t-shirt, um, if there's a little [02:13] pixel at in the center of the t-shirt, [02:15] knowing that the surrounding pixels are [02:17] related to the pixel in a way because [02:19] they are all part of this concept called [02:21] a t-shirt, would certainly be helpful, [02:23] right? So, so to put it more [02:25] technically, spatial adjacency [02:28] information is very important. And we [02:30] need to somehow take that into account. [02:32] Okay? Um, all right. What else? What [02:34] else might be going on here? [02:38] Uh, [02:40] Yeah, [02:41] you have some metadata about it like the [02:43] relative match into the resolution [02:46] Oh, I see. So, if you actually had [02:47] structured data about the image such as, [02:50] you know, various characters about that [02:51] might be helpful. True. Now, but let's [02:54] just focus on the case where you only [02:55] have the raw image and nothing else. [02:57] And under that constraint, what else [03:00] might go wrong? [03:02] Or what else might be suboptimal? [03:08] Okay. Well, the first thing that might [03:10] happen is that [03:12] we have we may have too many parameters. [03:15] So, let's take So, this is, you know, [03:17] this these numbers are from my, you [03:18] know, older iPhone. Uh, I noticed that [03:21] when I take a color picture with my [03:22] phone, it's a 3,000 * 3,000 roughly uh, [03:27] grid, right? So, the picture is actually [03:30] 3,024 pixels on this axis, 3,024 on that [03:34] axis, okay? So, that gets us to roughly [03:37] 9 million pixels, but remember there's a [03:40] color picture, which means there are [03:41] three channels, [03:43] which means there are 27 million [03:45] numbers, [03:46] each of which is between 0 and 255 from [03:49] that little picture, okay? And now let's [03:51] say we connect it to a single [03:54] 100 neuron dense layer. [03:57] A single 100 neuron dense layer. How [03:59] many parameters are we going to have? [04:00] Just in that one little part of the [04:01] network. [04:07] Could the mumbling be louder? [04:10] Yes, roughly 2.7 billion because 27 [04:13] million parameters times 100, [04:15] right? Roughly, of course. Forget about [04:17] the biases for a moment, right? It's 2.7 [04:19] billion. [04:21] 2.7 billion parameters, [04:23] right? Do you think we can actually get [04:25] 2.7 billion images to train any of these [04:27] things? [04:29] So, then you're going to overfit. [04:32] Right? Too many parameters. We have to [04:33] do We have to be smarter about this. [04:35] It's not going to work. [04:36] Right? That's the first problem. [04:39] The So, this clearly is computationally [04:41] demanding, very data hungry, and [04:43] increase the risk of overfitting. [04:45] Okay? [04:46] Next, [04:49] we lose spatial adjacency. [04:51] Right? We literally are ignoring what's [04:52] nearby. [04:55] So, that's a huge huge factor. There's a [04:57] third factor, [04:58] right? That we have to worry about, [05:01] which is that [05:02] let's say that, you know, the picture [05:04] has a vertical line [05:06] on the on the top left side and it has [05:08] some other vertical line on the bottom [05:09] right side. [05:12] What this sort of dumb approach is going [05:14] to do [05:15] is going to it's going to learn to [05:16] detect that vertical line on the top [05:18] left and it's going to independent of [05:20] that, it's going to learn to detect the [05:21] vertical line on the bottom right. [05:24] Okay? Which doesn't make any sense. What [05:26] do you A vertical line is a vertical [05:27] line. So, you want to be able to detect [05:29] it wherever it happens. [05:31] Detect once, reuse everywhere. [05:33] That's what you need to do. [05:35] So, this, by the way, is called [05:36] translation invariance. [05:38] Translation is math speak for move stuff [05:40] around. [05:41] Right? You take a line and it moves [05:42] around, [05:43] it doesn't matter, it's still a line. [05:45] Let's Let's Let's figure it out. [05:47] So, these are the the three things we [05:48] need to worry about. So, we want to [05:50] learn once and use all over the place. [05:53] We want to take spatial adjacency into [05:55] account, number two. And number three, [05:56] let's just find a way to make sure that [05:58] we don't have billions of parameters for [05:59] simple toy problems. [06:02] Any questions? [06:05] Yep. [06:07] Um, is this a problem [06:09] just because we are compressing the [06:11] image or would it have happened anyway? [06:14] It would have happened So, the question [06:15] was is it a problem because we are [06:16] compressing the image uh, or would it [06:18] would it have happened anyway? The [06:19] answer is it would have happened anyway. [06:20] You can take any picture, this is going [06:22] to happen, right? Because I'm not making [06:24] any assumptions about how the image is [06:26] coming in to me, [06:27] whether it's compressed or not and so on [06:28] and so forth. [06:31] Okay. All right. [06:33] So, convolutional layers [06:36] were developed to precisely address [06:38] these shortcomings and they're amazing [06:40] solution, as you will see. Very elegant. [06:45] All right. [06:45] So, the next, I don't know, half an hour [06:49] is going to be me defining a whole bunch [06:51] of stuff [06:52] before we actually get to the fun [06:53] collabs and so on and so forth. [06:55] Um, so just to put in perspective, I I [06:57] have a PowerPoint, [06:59] two collabs, [07:01] and an Excel spreadsheet, and maybe even [07:03] a notability file to cover today. [07:06] Okay? So, but hang on for the next 30 [07:08] minutes because it's going to be a [07:09] little concept heavy [07:10] before we get to the fun stuff. So, stop [07:12] me, ask me questions because we do have [07:14] time. [07:15] All right. A convolutional layer is made [07:17] up of something called a convolutional [07:18] filter. [07:20] Okay? That's the atomic building block. [07:22] A convolutional filter is a nothing but [07:24] a small matrix of numbers like this. [07:28] It's just a small square matrix of [07:29] numbers. That's a convolutional filter, [07:31] okay? Now, [07:33] a layer is just composed of one or more [07:35] of these filters. [07:38] All right? [07:39] Filters and layers. [07:41] Now, [07:42] the thing about the convolutional filter [07:44] that makes it really magical [07:46] is that if you choose the numbers in a [07:48] filter carefully [07:50] and then you apply the filter to an [07:52] image, and I'll get to what I mean by [07:53] applying the filter, [07:56] if you choose the numbers carefully and [07:57] you apply to that image, [07:59] this little humble thing has the ability [08:02] to detect features in your image. [08:04] It can detect lines, curves, gradations [08:07] in color, circles, things like that, [08:09] okay? It's pretty cool. [08:11] And so, [08:12] I'm going to claim and I'm going to [08:14] prove shortly that this little humble [08:15] filter with the ones and zeros, it can [08:17] detect horizontal lines in any picture [08:19] you give it. [08:21] Okay? [08:22] This thing here is going to has the [08:23] ability to detect vertical lines. [08:27] All right? So, I will demonstrate how [08:28] this thing actually detects all these [08:30] things and then we will ask the big [08:33] question that's probably in your minds [08:34] already, where are we going to get these [08:35] numbers from? [08:37] That all sounds great, Rama. Where are [08:39] we going to get the numbers from? Okay? [08:41] And we have a beautiful answer to that [08:42] question. [08:43] All right. So, let's go. Um, now I'm [08:46] going to first explain to you what I [08:47] mean by applying a filter to an image [08:50] and then I'm going to give you examples [08:52] of how the filter works for detecting [08:54] vertical and horizontal lines. So, all [08:56] right. So, let's say that this is the [08:58] image we have. [09:00] Okay? Again, an image. Assume it's a [09:02] grayscale image. So, you just have a [09:04] bunch of numbers between 0 and 255, [09:06] okay? So, that's that This is the image [09:07] we have. It's a little tiny image. [09:09] And this is the filter that's been [09:10] magically given to us by somebody. [09:13] And what we are trying to do now is to [09:14] apply it, okay? So, what we do is that [09:17] we literally take this filter, [09:19] the little one, and then we superimpose [09:22] it on the top left part of the image. [09:24] So, you have the image here, you take [09:26] this little filter, and then you move it [09:28] to the top left so that they are sort of [09:30] right on top of each other. [09:32] Okay? [09:33] Once you have it right on top of each [09:34] other, [09:35] you have these matching numbers. You [09:37] have three numbers in the image, there [09:39] are three numbers in the filter, and [09:41] they're all matching each other right on [09:42] top of each other, right? So, you have [09:44] nine pairs of numbers. [09:46] And then what we do, once we overlay it, [09:48] we literally just multiply all the [09:50] matching numbers and add them up. [09:53] Okay? You just multiply all the numbers [09:55] and match them up, and you can confirm [09:57] later on that you know the the [09:58] arithmetic I'm doing is actually [09:59] accurate. Okay? [10:01] And once you do that you'll go get some [10:03] number. [10:04] Right? [10:05] Um [10:06] once you get that number [10:09] what we do is we go to our good old [10:11] friend the relu [10:12] and then we just run it through a relu. [10:15] Now, in this case all that effort comes [10:16] to nothing because it's zero. It's okay. [10:19] Okay? So, zero and this number becomes [10:22] the top left cell of your output. [10:26] So, this is called the convolution [10:28] operation. [10:29] Okay? [10:30] And we won't get into why it's called [10:31] that and so on and so forth. There's a [10:32] long and rich and storied history of [10:34] these things. [10:35] But this is the convolution operation. [10:38] And once we do that you sort of can now [10:40] predict what's going to happen, right? [10:42] We take the same exact operation and we [10:44] just move it to the right. [10:46] We move this little 3 by 3 thing to the [10:48] right and repeat the exact same process. [10:51] Matching numbers [10:53] uh to you know multiply all of the all [10:54] the matching numbers together, add them [10:55] up, run them through a relu. [10:58] Okay? [10:59] And then boom, you get the you get the [11:01] second number here. [11:03] And you keep doing that till you reach [11:05] the very end. You fill up all these [11:07] numbers then when you then you come to [11:08] the top of the second row. [11:11] Okay? [11:12] And you keep on doing that till you [11:14] reach the very bottom. [11:16] So, this is what I mean when I say apply [11:18] a filter to an image. [11:21] Okay? [11:22] Any questions? [11:25] Okay. [11:29] Microphone, please. [11:31] Microphone. [11:35] What happens when [11:36] the heart of the [11:38] and you stop [11:39] the remaining [11:42] but the filter doesn't perfectly match [11:44] Yeah, so you start from the left and [11:46] then you keep on going. At some point [11:47] the right edge of the filter is going to [11:49] match the right edge of the image and [11:51] then you stop. [11:52] Yeah. Now, there are some nuances here. [11:55] So, for example, you can actually pad [11:58] the whole image [11:59] on its borders so that you can actually [12:01] go outside the image and it'll still [12:03] work. [12:04] Okay? Number one. Number two, nuance. [12:08] Instead of just moving one step to the [12:10] right every time you finish, you can [12:11] move two steps to the right. [12:13] Right? And that's something called a [12:15] stride. Okay? So, there are a bunch of [12:17] pesky details here. But I'm just [12:20] ignoring them because this basic default [12:22] approach works well amazingly well [12:24] almost all the time. [12:27] Okay? All right. So, that's that's [12:29] that's the mechanics of how this [12:31] operation works. Um all right. Now, I'm [12:33] going to switch to a spreadsheet which [12:35] shows this really beautifully [12:37] courtesy of the fast.ai people. [12:41] All right. So, what I'm going to do here [12:43] because the big spreadsheet I'll upload [12:44] the spreadsheet after class so you can [12:45] see it. So, all I have done here, rather [12:48] all they have done here [12:50] thanks to them, is that they have [12:51] essentially created a table of numbers [12:53] in Excel as you can tell. [12:55] And they have just put some numbers. [12:57] Most of the numbers are zero. But these [12:59] some of these numbers are all more than [13:01] zero. They're like 0.8, 0.9 and so on. [13:03] Basically, all they have done is instead [13:04] of working with numbers between zero and [13:06] 255, they're just dividing all the [13:08] numbers by 255 so you get fractions and [13:10] they just put the fractions in the [13:11] table. Okay? And then then they have [13:13] used Excel's very cool conditional [13:15] formatting [13:16] to essentially mark in red all the [13:19] values that are high. Right? If the [13:21] number is closer to one, the more [13:23] reddish it gets. [13:24] Okay? And when you do that the three [13:26] obviously pops out. [13:28] So, there is a three in the image. Yes? [13:31] Okay, good. So, now [13:33] what we're going to do is we're going to [13:35] move to our little filter here. [13:37] You can see the filter. [13:39] Right? And I'm claiming this detects [13:41] horizontal lines. And so and this table [13:44] here [13:47] Sorry. [13:51] This table here is the result of [13:53] applying that filter to the three. [13:56] Okay? And you can see here I'm looking [13:58] at the top left cell here. [14:01] Um [14:03] This is [14:03] Look at this top left cell. The formula [14:05] is nothing more than [14:07] you know, multiply all those things and [14:08] add them up. And then once you add it [14:10] up, run it through a max of zero comma [14:12] that which is just the relu. [14:15] Okay? Basic arithmetic. [14:18] So, we do that. [14:19] And this is the output and the output is [14:21] also conditionally formatted to show you [14:24] where things are lighting up. [14:26] And you can see only the horizontal [14:30] lines of the three are lighting up. [14:34] Everyone see that? [14:35] Right? [14:36] So, you So, now you you understand the [14:38] filter in fact is living up to the claim [14:41] I made for it. [14:42] Right? Similarly, [14:44] if you look at what's going on here, [14:46] this is a vertical filter, the same [14:47] thing, you apply it, only the vertical [14:50] line is lighting up. [14:53] Right? Now, what you can do is [14:56] uh I would encourage you to do this, you [14:57] know, um after class, is you can look at [15:00] all these numbers here, for example, and [15:02] then ask yourself, "Okay, why is that [15:04] lighting up?" [15:06] Right? And you will discover that what's [15:08] actually going on is that it's looking [15:11] for edges. [15:12] It's looking for you know, s- you're [15:14] looking for rows in the table where [15:16] there is some nonzero thing in the first [15:18] row and zeros in the second row. [15:21] And by choosing the numbers carefully, [15:23] you multiply the ones with positive [15:25] numbers and you multiply the zeros with [15:27] zeros and then you'll come up with a [15:29] positive number and thereby you detect [15:31] an edge. [15:32] Right? So, what I would encourage you to [15:34] do is use the this Excel thing here. [15:39] All right. So, here is here is a cell we [15:41] have. So, let's uh trace its [15:48] coincidence. [15:49] Okay. [15:51] So, you can see here [15:53] these numbers [15:56] Right? Th- This is what it's processing. [15:59] Right? That is this grid is being [16:00] processed to come up with that big [16:01] number. And you can see here in this [16:04] grid these are all these numbers are [16:06] here and then these numbers are a lot [16:08] lower than these these numbers because [16:11] there is an edge. [16:13] Right? The numbers are a lot lower. [16:14] That's why you can see the horizontal [16:16] part of the three. [16:17] And so, what this filter is doing, it's [16:19] basically saying, "Well, the stuff [16:22] the row that I'm catching here has the [16:24] ones, the middle has zeros, the rest are [16:26] all minus ones." [16:27] Right? So, the small values are going to [16:29] get very small. [16:31] The big values are going to get very big [16:33] and the overall thing is going to be [16:34] emphasized. [16:35] So, that's the basic idea of edge [16:37] detection. [16:38] Spend some time with it with the Excel [16:39] and it'll you'll become clear to you [16:41] what I'm talking about here. [16:43] All right, cool. So, that's that. [16:46] All right. Uh by the way, I also have uh [16:48] th- there is a little very cool site [16:49] here [16:50] in which you can actually go in and [16:52] punch in your own numbers and see what [16:53] it detects. [16:55] Right? Lot of edges and curves and this [16:56] and that. It's very cool. So, I [16:58] encourage you to try it out. [17:00] So, the key thing here I want to say is [17:06] by choosing the numbers in a filter [17:08] carefully and applying this operation [17:10] different different features can be [17:12] detected. All right. [17:13] Now, [17:14] I mentioned earlier that a convolution [17:16] layer is composed of one or more of [17:18] these filters. So, one or more of these [17:20] filters. And so, you can think of each [17:23] filter as a sort of a specialist for a [17:25] particular feature. [17:27] Okay? So, it's a specialist. Maybe it it [17:30] specializes in detecting vertical lines, [17:32] horizontal lines, you know, uh [17:34] semicircles, quarter circles, you don't [17:35] know. Right? You can imagine either them [17:38] as being specialists. [17:39] And given that modern images could be [17:42] very complicated, they may have lots of [17:43] interesting features going on, you [17:45] probably want to have lots of these [17:46] filters. [17:48] Okay? But the key the key is that you [17:52] don't have to decide up front, "Hey, you [17:54] filter, you better specialize in [17:56] detecting vertical lines and you on the [17:57] other hand do not stay in your lane. Do [18:00] vertical lines." Right? You're not going [18:01] to do that. [18:02] You will let the system figure out what [18:04] it wants to figure out. [18:06] Okay? So, there is no human bottleneck [18:08] in doing this. [18:10] And I mentioned this because there used [18:11] to be a human bottleneck, you know, [18:13] before deep learning happened. [18:15] And so, [18:17] Now, let's just um make sure we [18:19] understand the mechanics of what happens [18:20] when you have two of these filters, not [18:22] one. So, this is the input image as [18:24] before. This is the filter we saw [18:26] earlier and this is another filter we [18:28] have. [18:29] The thing is we just run them in [18:30] parallel. We take each filter, do the [18:32] operation, come up with an output. Take [18:33] the other filter, do the operation, come [18:34] up with its output. And then when you do [18:36] that, the first one gives you that, the [18:38] second one gives you that. And this [18:40] output is a table of some it's it's a [18:42] it's a it's actually not a table. What [18:44] is it? [18:49] Louder, please. [18:51] It's a tensor. Thank you. It's a tensor. [18:54] And so, these two 5 by 5 matrices can be [18:56] represented as a tensor of what shape? [19:02] And there are two right answers. [19:04] 5 by 5 [19:06] into two, correct. So, it can you can [19:08] either think of it as 5 by 5 * 2 or 2 * [19:11] 5 by 5. They're both fine. [19:14] Which one you go with is actually ends [19:15] up being a matter of convention. [19:18] Okay? So, now you begin to see why we [19:20] care about tensors. [19:22] Imagine if instead of having two [19:24] filters, we have 103 filters. [19:27] The resulting tensor is going to be 5 by [19:29] 5 by 103. [19:33] Okay. [19:34] Good. [19:35] Um all right. Now, [19:37] let's now look at the slightly more [19:39] complex situation where you have not a [19:42] black and white image, a grayscale image [19:44] with just a little table, but an actual [19:46] color image. [19:48] Okay? So, So, we know how to apply a [19:51] filter to a 2D tensor like this and to [19:54] get that. But let's say we have [19:56] something like this where it has [19:58] three, right? It's got three channels, [20:00] red, blue, green, RGB. It's got three [20:02] tables of numbers. [20:03] So, this is a a tensor of shape 6 * 6 * [20:06] 3, let's say, and you want to apply this [20:08] 3 by 3 filter just like before to this [20:11] thing. You want to apply the convolution [20:12] operation. How's that going to work? [20:18] Do we just like apply this to each [20:21] We first apply it to the red, then we [20:23] apply it to the to the green, then we [20:25] apply to the blue. Should we do that? [20:30] Or is there a [20:31] a problem with that approach? [20:36] Yeah. [20:39] Could you use the microphone, please? [20:42] Uh the problem with the approach, I [20:43] think, would be the same as what you [20:45] said earlier, that it would learn the [20:47] lines probably the same each channel, [20:49] right? [20:50] Like the location of the lines are [20:51] probably the same each channel. [20:54] Yes, the location of the line is going [20:55] to be the same thing because that line, [20:57] if you will, is sort of the the [20:59] aggregation of information from the [21:00] three different channels. Right. But the [21:03] problem here [21:05] is sort of slightly different, [21:07] which is that [21:09] If you do them independently, [21:12] the network has not been informed that [21:15] these things are all part of the same [21:17] underlying concept. [21:19] As far as it's concerned, it's just like [21:21] three things. It's just going to process [21:22] them independently. So, we need to [21:23] somehow change the filter so that it [21:25] understands like what is at this pixel [21:27] location, the three numbers under it, [21:29] RGB, they're actually the same part of [21:31] the same thing, underlying thing. [21:35] So, what we do is actually very simple. [21:37] We just take this filter and make it 3D. [21:42] So, we take this filter, instead of [21:44] having just one of them, we just make it [21:45] a cube like that. Three times. [21:49] And once we do that, you can imagine [21:51] taking this thing here and essentially [21:53] doing that. [21:56] Okay. Now, instead of having, you know, [21:58] nine numbers in the image and nine [22:00] numbers in the filter, [22:01] you have 27 numbers in the image, 27 [22:04] numbers in the filter. [22:05] But you still match them up, multiply [22:07] them, add them up, run them through a [22:09] ReLU. [22:14] By the way, I tried to get ChatGPT to [22:16] give me a picture like that. [22:19] It just completely bombed. [22:21] I like three, four, five different [22:22] variations. It just gave up. And then I [22:24] found this nice picture at in the [22:25] deeplearning.ai and I used it. [22:28] So, then if you put different numbers in [22:30] each of the layers, is that like color [22:32] processing? Like it could be doing a [22:33] different thing to green and blue. I'm [22:36] sorry, say that again. If you put [22:37] different numbers in each of the layers [22:39] of your knowledge, in each of the [22:42] different like depth dimensions of your [22:43] convolution filter, would that be like [22:45] color processing? [22:47] Uh yeah, you you will in [22:49] Yeah, you will put different numbers. In [22:50] fact, you you have 27 numbers now, [22:53] but we haven't gotten to the question of [22:54] where these numbers are coming from. So, [22:55] just hold the thought till we get there. [22:58] Okay. Um so, any questions on this? [23:02] Okay. You literally take the 2D thing [23:04] and make it 3D. [23:05] You basically give it depth and the [23:08] depth just matches the depth of the [23:10] input. [23:11] So, if the input is like, you know, 10 [23:13] deep, your filter is going to get 10 [23:15] deep. [23:18] Okay? [23:20] Yes. [23:22] Rather than [23:24] increasing the rank order of the tensor [23:26] by one, is there any instance where you [23:27] would create a subtraction layer where [23:29] you would run an operation across the [23:30] different layers to come up with a [23:33] intermediary layer that you would run a [23:35] lower rank tensor of a filter over? [23:38] Yeah, so there is a lot of stuff in the [23:40] research literature which tries to do [23:42] things like that. Uh I'm just describing [23:45] like the the the most basic approach to [23:48] doing this. And as it turns out, this [23:50] basic approach is actually extremely [23:51] powerful, right? And of course, uh [23:54] researchers try to, you know, go from [23:56] the 95th percent thing to the 95.1%. [23:59] So, they invent like all sorts of crazy [24:01] complicated stuff, which is all good for [24:02] us, humanity, but for practical use, [24:04] this is good enough. [24:08] How do you convert the 3 by 3 layer into [24:10] a single 4 by 4 layer? 4 by 4 is [24:12] understood, but what about the 3 layers? [24:14] How do they work? [24:15] Yeah. Um so, we are coming to that. I [24:17] think we have a slide here. Actually, we [24:19] don't. Never mind. We'll answer that. Um [24:20] so, so here you have one filter, right? [24:23] You have one 3 by 3 by 3 filter, which [24:26] plugs into this thing here, and then it [24:28] gives you the 4 by 4 at the end. [24:30] Right? So, for one filter, we know that [24:33] by doing this operation, we get [24:37] we get this 4 by 4. [24:38] Let's say that you have another filter, [24:40] which is also 3D. [24:41] You do that thing, you'll get another 4 [24:43] by 4. [24:45] And if you have 10 filters, you'll get [24:46] 10 of these 4 by 4s, which then gets [24:48] packaged up into a 4 by 4 by 10 tensor. [24:54] Remember, whether it's 2D, 3D, 10D, [24:57] what is coming out is always 2D. [25:02] Because ultimately, when you apply all [25:03] this operation, at each position, you [25:05] just have one number. [25:06] And then ultimately, you just do all [25:07] those things, you just come up with a [25:08] table of numbers always. So, the what's [25:10] coming out is always a 2D number table [25:13] like that. [25:14] But when you have lots of filters, you [25:16] have lots of these 2D tables one after [25:18] the other, and there therefore, they get [25:20] packaged up into a tensor. [25:25] All right. [25:26] Um so, [25:28] textbook chapter 8.1 has a lot of detail [25:30] and intuition, which I think is really [25:32] good. So, please uh try it out. Okay. [25:35] And folks, by the way, this convolution [25:37] stuff, um it's sort of it grows in the [25:40] telling. So, I would encourage you to [25:41] revisit it, revisit it [25:43] a few times, and then it slowly becomes [25:45] part of your muscle memory. [25:48] Don't expect to just understand all the [25:49] nuances like one shot. [25:51] Do it a few times. [25:52] And it will become, you know, wired into [25:54] your into your head. [25:56] Okay. So, all right. The big question. [25:59] These seem excellent, but how are we [26:00] supposed to come up with these numbers? [26:02] Now, in fact, traditionally, [26:04] uh these filters actually used to be [26:05] designed by hand. [26:07] Uh computer vision researchers would [26:08] invest, you know, prodigious amounts of [26:10] time and effort and talent to figure [26:12] out, you know, the kind the right kinds [26:14] of filters to use for various specific [26:17] applications. So, if you wanted to build [26:19] an application which would look at, say, [26:20] MRI images and figure out, okay, what [26:22] kind of features should I extract from [26:24] this MRI thing to be able to say, you [26:27] know, predict the the evidence for a [26:28] stroke, they would actually, you know, [26:30] hand design the filter. They'd try lots [26:32] of different values and then come up [26:34] with, "Ah, I got the perfect filter for [26:35] this thing here." Right? So, that's the [26:37] way it used to be done. [26:39] Um and now, [26:41] I but as we figured out how to train [26:42] deep networks with lots of parameters, [26:45] right? We figured out things like ReLU [26:47] activation, stochastic gradient descent, [26:49] GPUs, backprop, things like that, you [26:51] know, uh this big idea emerged. Why [26:54] don't we think of the numbers in the [26:55] filter as just weights? [26:57] And why don't we just simply learn them [26:59] from the data using backprop? [27:01] Right? Just like we learn all the other [27:03] weights. What's the big deal? [27:06] And this simple idea, [27:08] and it feels a bit, I don't know, [27:09] blindingly obvious in hindsight. [27:12] I'm sure it was not obvious in [27:13] foresight. [27:14] Um right? This was the breakthrough. [27:16] This was the key breakthrough. And now, [27:18] it's actually possible to do this [27:20] because a convolutional filter that we [27:22] have seen is actually just a neuron. [27:25] And the underlying arithmetic of it is [27:27] just a neuronal arithmetic. And so, it [27:31] just happens to be a slightly special [27:32] one. It's actually even simpler than a [27:34] regular neuron. And in the interest of [27:37] time, I have a one or two slides in the [27:39] appendix which tells you exactly why [27:40] it's a neuron. So, check it out. But [27:42] just take my word for it. It's just a [27:44] particular kind of neuron. And because [27:46] it's a particular kind of neuron, and we [27:48] know how to work with neurons, [27:50] right? You know how to work with [27:51] neurons, which means that our entire [27:53] machinery, [27:55] layers, loss functions, gradient [27:57] descent, SGD, blah, blah, everything is [27:59] immediately applicable. [28:01] We don't have to invent any new stuff to [28:03] make it work. [28:06] Okay? [28:08] All right. [28:09] Do you initialize the layers differently [28:12] in applications or just because the [28:14] network has different sizes? Like [28:16] computer vision versus uh medical [28:18] imaging. Is it just because the network [28:20] has different numbers in them? [28:23] Yeah, so the initialization [28:25] So, let's It's a good question. Let's [28:27] come back to it when we get to something [28:29] called transfer learning, which I'm [28:30] going to get to by about 9:30. [28:34] All right. So, [28:36] that's it. All right. So, this turned [28:37] out to be a huge turning point in the [28:38] computer vision field, and this was the [28:40] massive unlock in the year 2012. This [28:43] computer vision system that used this [28:44] technology called AlexNet burst out onto [28:47] the world stage because it crushed the [28:49] competition in a, you know, in in a [28:51] competition called ImageNet, and uh the [28:53] previous best score was 26% error rate, [28:56] and this thing came in and had 16% error [28:59] rate. Right? It's the kind of thing [29:01] where if you see it, you'll be like, [29:01] "Oh, that must be a typo." [29:04] Right? Because every year, the [29:05] improvements in error rate were like [29:06] very little, half a percent, 1%, and [29:07] then this year was 10%, and that that [29:09] was because of this approach. [29:12] And so, all right. Now, one other thing [29:14] I want to cover talk about is that with [29:16] every succeeding convolutional layer, [29:19] uh this particular convolution any [29:21] particular convolutional filter, it's [29:23] basically implicitly seeing much more of [29:25] the input image as we go along. [29:28] Right? Which means that if in the very [29:29] beginning, if this is the input, right? [29:31] This little convolutional filter this [29:33] number here [29:34] in the first layer, let's say, only sees [29:37] like the top of the chimney or whatever [29:38] of this house. [29:40] But then the next layer, remember, the [29:42] next layer is input is this particular [29:44] layer. [29:45] And so, [29:47] this particular little thing here is [29:49] getting information from this whole [29:50] square here. [29:52] And every one of the points in that [29:53] square is actually something big in the [29:55] original picture. [29:57] So, with every additional layer, you're [29:59] seeing more and more and more of the [30:00] image. [30:03] All right? And this is a key part of why [30:04] these things work because you're [30:06] essentially hierarchically building a [30:08] better and better understanding of the [30:09] image. [30:10] It is the hierarchical understanding, [30:12] the hierarchical learning, that's a very [30:14] key part of the unlock. [30:17] And so, if you look at networks and what [30:20] they're visualizing, this actually a you [30:21] know, a face detection deep network [30:23] visualizes of what it's learning, you'll [30:25] see that the first layer is just [30:26] learning lines and edges and so on, [30:28] lines. [30:29] And the second layer is actually [30:30] learning edges. Look at this thing, [30:32] right? [30:33] It's it's learning to put these lines [30:36] together [30:37] to get some sort of an edge here, [30:38] another edge here. This looks like three [30:40] three quarters of a of somebody's ears. [30:43] And then, these things are now being [30:45] assembled [30:46] to get whole faces out. [30:49] Can you imagine the researchers who did [30:50] this work? They built the network, it's [30:52] doing really well on detecting faces, [30:53] and they turn around, "Okay, let's see [30:54] what it's actually doing." [30:56] And then, this picture pops up. [30:58] I mean, goosebumps. [31:00] Okay, so pooling layers, the next one. [31:03] So, [31:04] so far you've talked about convolutional [31:05] layers, this is the second thing, second [31:07] building block, and then we'll again go [31:09] go to the collapse. So, pooling layers [31:11] are also called subsampling or [31:12] downsampling layers. [31:15] So, the idea is that every time a tensor [31:17] is coming out of these convolutional um [31:19] layers, [31:20] we try to make it slightly smaller [31:23] because the act of making it smaller [31:25] will force the network to try to [31:27] summarize and learn what's going on in [31:29] this complicated thing it's coming into [31:30] it, okay? So, I will describe the [31:32] mechanics first. Um [31:35] So, let's say that this is the output of [31:37] a convolutional layer. [31:39] Okay? [31:40] Is this four of them? A 4 by 4. [31:42] So, what we do is that there are two [31:45] kinds of pooling, max pooling and [31:47] average pooling. This is called max [31:48] pooling, and the idea is really simple. [31:51] In this max pooling layer, there are no [31:52] weights parameters to be learned. It's [31:53] just a simple arithmetic operation. We [31:56] basically take [31:57] we take this we basically superimpose a [32:00] 2 by 2 empty grid [32:02] on the top left, and then we say, "Hey, [32:04] what's the biggest number on the among [32:06] these four numbers?" Well, the biggest [32:08] number is 43. Boom. Okay, I'm going to [32:09] stick a 43 here. [32:11] Then I move my 2 by 2 to the right [32:13] so that it overlaps with these numbers [32:15] in blue, and I say, "Hey, what's the [32:17] biggest number here?" Okay, that's 109. [32:19] And I move it down, what's the biggest [32:20] number here? 105. Stick it in here. [32:23] Biggest number here, 35, and I stick it [32:25] in there. That's it. This is max [32:26] pooling. [32:29] Similarly, there's this thing called [32:30] average pooling, but instead of taking [32:32] the maximum of these four numbers, we [32:33] just average the four numbers. [32:35] Okay, the average of these four things [32:36] in yellow, [32:38] am I done? [32:41] Average of these four numbers is 32.2. [32:43] The average of blue numbers is 25.5, you [32:45] get the idea. [32:46] That's it. Max pooling and average [32:48] pooling. Now, [32:50] as you can see, when you go when you [32:51] apply pooling, the number of entries [32:53] drops significantly. [32:55] Right? The number of entries drops [32:56] significantly. [32:58] And the output from this layer is just [32:59] fed to the next layer as usual. [33:02] Okay? There's nothing, you know, crazy [33:04] going on. [33:05] So, it's a way to shrink the output from [33:07] one convolutional layer before it passes [33:10] on to the next convolutional, you [33:11] interject with a pooling layer. [33:13] Now, I have actually a [33:15] even if I say so myself, a very nice [33:18] handwritten explanation of what pooling [33:20] does, the the effect of pooling. [33:23] And unfortunately, I can't get my iPad [33:25] to actually show up on my laptop. [33:27] So, I'm not going to be able to do it, [33:28] but I will record a walk-through. [33:31] Yeah, and I posted check it out, okay? [33:33] But the intuition that I tried to convey [33:35] with that thing is that oh, um Sorry, [33:38] I'll come back to this. [33:39] So, max pooling acts like an or [33:41] condition. It basically says, "I have [33:43] this big picture. [33:44] So, in the four things that I'm looking [33:46] at, if there's any number which is [33:48] really high, [33:50] that means that some feature is being [33:51] detected, right? [33:54] The number is really high coming out of [33:55] a convolutional layer, that means that [33:57] something somewhere fired up, [33:59] lit up. [34:00] And so, I'm just looking to see if [34:01] anything lit up in that part. If it did, [34:04] I'm going to say, "Yep, something lit [34:05] up." [34:06] If nothing lit up, then I'm going to [34:08] say, "Oh, nothing lit up." [34:09] So, in a in that sense, what it's it it [34:11] think you can imagine it's like acting [34:13] like an or condition. [34:15] Anything fired up? Anything fired up? [34:16] Anything fired up? Anything up? Yes, [34:17] okay. Otherwise, no. [34:19] And so, [34:22] sadly, I can't switch to Notability. [34:24] So, it acts like a feature detector. So, [34:27] if you have lots of things going on in a [34:28] particular picture, you want to be able [34:30] to summarize and aggregate all the [34:32] things that are going on so that you can [34:33] say you if you may have a big picture [34:35] with lots of things lighting up here and [34:36] there, but you want to step back and [34:38] say, "You know what? In this picture, [34:40] the top left, nothing lit up. The top [34:42] right, something lit up. Bottom left, [34:45] something lit up. And the bottom right, [34:46] nothing lit up." [34:48] So, you're operating at a higher level [34:49] of abstraction. [34:51] That's the effect of pooling. [34:55] But don't you lose spatial information? [34:59] Uh you don't because the [35:02] what you're actually saying is the top [35:04] left has this thing. [35:06] You already know it is in the top left. [35:08] And you already moved up to that level [35:10] of abstraction. [35:12] So, the fact for example, if if the top [35:13] left there is a human eye, [35:15] and there is a circle detector, it's [35:18] going to fire up and saying, "Hey, in [35:19] the top left there is an eye." [35:21] Yep, lit up. So, you're not looking at [35:23] the pixels anymore, you're already [35:24] operating at a higher level of [35:25] abstraction, and that's how we get [35:27] around it. But this proceeds slowly and [35:29] incrementally, which is why you have [35:31] these big networks. [35:34] All right. [35:35] So, now as we saw, some successive [35:38] convolution layers can see more and more [35:40] of the original image, [35:41] the max pooling layers that follow them [35:43] can detect if a feature exists in more [35:45] and more of the original input as well. [35:47] So, by the time you get to like the [35:48] seventh and eighth, ninth and layers and [35:50] so on, this thing is actually really [35:52] smart. It's operating at a very high [35:53] level of abstraction. [35:55] Right? It It is You can think of it It [35:56] is basically like tagged all the [35:58] features in that image at various [36:00] resolutions, and it can work with it. [36:04] Is there a trade-off between doing [36:06] pre-processing as opposed to adding [36:08] additional convolutional layers? I'm [36:11] thinking if you have a video turning [36:12] into a black and white static images in [36:15] a sequence as opposed to [36:17] shoving in a color video with a ton of [36:19] noise. [36:20] The greater the time expanse, is there a [36:22] trade-off element? There is a trade-off. [36:24] Um if your particular data set and input [36:27] has has some there is some very [36:29] important domain knowledge that you want [36:31] to encode [36:33] into the network so that the network [36:35] doesn't waste its capacity learning [36:37] things that you know have to be true, [36:39] then yeah, modify the input. [36:41] But if you're not sure, [36:43] right? Then you want to just let network [36:45] learn whatever it can as long as it's [36:47] focused on predicting accuracy as well [36:49] as possible, then just let it be. [36:55] Uh all right. So, that's the basic idea. [36:57] And I again, I'm sorry this is [36:59] Notability thing is is it's not working. [37:01] Uh but take a look to really understand [37:03] um how this max pooling thing business [37:05] works. Okay. Oh, uh I think I skipped [37:08] over this. [37:09] So, when you have something like this, [37:12] so this, let's say, is a tensor coming [37:13] out of some convolutional layer, and its [37:15] size is 224 by 224 by 64, then you apply [37:18] something like a pooling. The thing I [37:20] want to point out is that the pooling [37:22] will work with every slice of the [37:23] tensor. [37:25] Okay? So, if the tensor is 224 by 224 by [37:27] 64, it has a depth of 64, [37:30] which is basically like saying it's got [37:31] 64 tables of 224 by 224, and the pooling [37:35] will work on every one of those tables. [37:38] Which means that [37:40] the 64 will that you'll still have 64 [37:42] things at the very end. It's just that [37:43] every one of the things of the 64, the [37:45] 224 by 224, will shrink to 112 by 112. [37:49] So, each table shrinks due to pooling, [37:52] but the number of tables does not [37:53] change. [37:57] Okay. So, [37:59] uh by the way, this [38:01] link here [38:03] has a beautiful explanation of all these [38:05] things with a little bit more complexity [38:06] as well from a course taught at Stanford [38:08] in like 2018 or 2019 or something, I [38:10] forget. Uh so, just check it out if [38:12] you're curious about this stuff. It's [38:13] really good. [38:15] Okay. Um [38:18] All right. So, that brings us to the [38:19] architecture of a basic CNN. [38:21] Um and so, what we do is we have an [38:23] input. [38:25] Okay? We take that input, we run it [38:27] through a bunch of convolutional and [38:29] pooling layers. So, there's a [38:30] convolutional layer, and then we pool [38:33] it, which is why it has shrunk [38:35] in size, [38:37] and then it goes through another [38:38] convolutional layer, then we pool it, [38:40] which is shrunk again, [38:42] and then it keeps on doing it. So, we [38:44] have a series of these these called [38:45] these are called convolutional blocks. [38:47] So, a convolutional block is typically, [38:49] you know, one to two convolutional [38:50] layers followed by a pooling layer. [38:52] Okay. [38:54] So, you have a series of convolutional [38:55] blocks. [38:57] Okay? And the thing to notice is that [38:59] as you go further and further in the [39:01] network, [39:03] the blocks will actually get smaller and [39:05] smaller because of [39:07] uh max pooling, right? They'll get [39:09] smaller and smaller, but they'll get [39:10] longer they'll get deeper and deeper. [39:14] Okay. [39:14] And we have empirically figured out that [39:16] that actually that model of reducing the [39:18] size, the height and height and the [39:20] width, but then making it deeper, tends [39:22] to work really well in practice. [39:25] And so, [39:27] in fact, uh and I apologies to the live [39:29] stream that I can't use iPad, I'm going [39:31] to do it on the the board. [39:35] So, let's say that you have a picture [39:38] which is [39:39] coming in as 224 [39:43] 224 [39:44] and then you have [39:46] say three of them [39:48] because it's a color picture, so you [39:49] have three of them. [39:52] Can you folks see this okay? [39:54] All right. So, right? Let's say this is [39:56] the input coming in. And ResNet, which [39:59] is a very famous network that we're [40:00] actually going to work with in a few [40:02] minutes, [40:03] then it actually gets done with all this [40:05] convolution pooling business. [40:07] The final tensor that it it has is [40:11] actually of shape [40:13] 7 by 7. [40:16] But it is 2048 long. [40:22] Okay? So, it it has gone it has [40:24] processed something which is 224 224 * 3 [40:26] to much smaller height and width just 7 [40:28] by 7, but it's gotten much deeper, 2048 [40:31] layers. [40:32] This is a this is a numerical example of [40:34] what I'm talking about there in terms of [40:36] as you go along, things get smaller but [40:39] deeper. [40:41] All right. [40:43] Uh [40:44] Yes? [40:45] Is the reason that it gets deeper [40:47] because each [40:49] Like it it gets deeper because each [40:50] layer has a single feature that is [40:52] picked up and then it gets stacked on [40:54] top [40:55] It's not so much that each layer has [40:57] picking up a single feature, it's more [40:58] that [40:59] uh [41:00] basically [41:01] the way I think about it is that [41:04] the the the the number of atomic [41:06] features that you may want to detect are [41:07] probably not that many, right? Lines, [41:10] curves, gradations in color and things [41:11] like that. But the way in which you can [41:13] combine these atomic features [41:16] to depict real world things [41:18] is combinatorial. [41:20] It's sort of like I have 10 kinds of [41:22] atoms, how many molecules can I make [41:23] from it? [41:25] You can make a lot of molecules from [41:26] those 10 atoms, which means that you [41:28] better give the network more the ability [41:30] to capture more and more of these [41:32] possible things that the real world can [41:33] come up with. [41:35] And so every as the depth increases, you [41:38] have more filters and every filter has [41:40] now has the ability to pick up some [41:42] combinatorial combination of what's [41:43] coming in. [41:49] Uh sorry, quick question related to [41:51] this. So, right now like our model is [41:53] being trained to detect certain specific [41:55] features like a line, a color, or [41:56] something of this sort. But still it [41:58] doesn't have meaning to this, right? [42:00] Like still they don't know if that [42:02] arc is a sun or is an eye, right? [42:06] So, yeah. So, we we don't tell it what [42:08] to learn, it just learns. [42:10] All we tell it is make sure that you [42:12] minimize the loss function. Now, once it [42:14] is finished learning, if it's a good [42:16] network, it has good accuracy, then we [42:18] can introspect. We can peek into the [42:21] internals and try to understand what is [42:23] it learning, [42:24] right? And sometimes you like you saw in [42:26] the face detection example, it's [42:27] actually learning interesting things [42:28] like basic lines and edges and then [42:30] slowly, you know, more complicated [42:32] shapes and then finally like entire [42:34] human faces. Sometimes it may not be [42:36] understandable. [42:37] And the way it's doing this is by [42:39] constructing features of my brain. [42:42] Like how do you figure out what it's [42:44] learning? [42:44] >> Yeah. Oh, oh, I see. So, I'm going to [42:46] give a reference in just a few minutes. [42:49] Read the paper. That was one of the [42:50] first ones to actually visualize what it [42:52] what these things are learning and [42:53] that'll give you an idea of how it [42:54] actually works. And I'm also happy to [42:56] talk about it offline. It's a bit of a a [42:58] tangent, but it's a really rich tangent, [43:00] so if if I keep talking about it, I'll [43:02] end up spending 10 minutes on it, so I'm [43:03] going to back off. [43:06] Okay. [43:08] Um all right. [43:09] So, now once we do that, [43:12] okay? Now we are back in familiar [43:13] territory where we take whatever tensor [43:16] is coming out from these convolutional [43:18] operations and pooling operations and [43:20] then we just flatten them only now into [43:22] a long vector. And once we flatten them, [43:25] we can connect them to some good old [43:27] dense layers [43:29] like we know how to do and then we [43:30] finally connect them with whatever, you [43:32] know, output layer you want, right? In [43:34] this case, this example is using some [43:36] multi-class classification of [43:39] classifying images to what kind of [43:41] automobile or whatever it is. So, it's [43:42] like a softmax. So, this is a general [43:44] framework. [43:48] Okay? [43:50] Any questions? [43:54] Yeah. [43:55] Can you explain again how the depth [43:57] increases exactly like Oh, the depth [44:00] increases because you decide what the [44:01] depth is. [44:03] So, when you add a convolutional layer, [44:05] you decide how many filters it has. So, [44:07] you just keep adding more and more [44:09] filters the later on you go in the [44:11] network. [44:13] So, it's in your control. So, remember [44:14] the number of neurons in a hidden layer [44:16] is in your control, right? Similarly, [44:18] the number of filters is in your [44:19] control. It's a design choice. [44:22] And we design it so that the later we [44:24] go, the more depth we have. So, you have [44:26] you stack [44:28] um layers with each of those layers has [44:31] a different filter applied to the end [44:35] Yeah, a layer is made up of filters and [44:37] so the depth just comes from having lots [44:39] and lots and lots of filters. And you [44:40] get to choose what they are. [44:44] All right. So, now let's go to the [44:46] fashion MNIST collab um that I did the [44:49] video walk-through on and then actually [44:51] solve it using a convolutional network. [44:56] All right, cool. So, uh at this point [44:58] I'm going to zip through some of the [44:59] stuff because you know the preliminaries [45:00] have to be done. Import all these [45:02] packages, set the random seed here. [45:05] Great. And then the we will load the [45:07] MNIST data set just like I did in the [45:09] collab yesterday. Uh we create these [45:11] little labels. [45:13] Uh and then we just have these standard [45:14] functions to plot accuracy and loss that [45:17] we've been using so far. All right. Now [45:19] we come to the convolutional thing and [45:21] so as before, we're going to um [45:24] we're going to divide it by 255 to [45:25] normalize everything to a zero to one [45:27] range. Uh let's confirm to make sure [45:29] that the data nothing has gotten [45:31] tampered with. Yep, we have 60,000 [45:33] images, each one is 28 by 28 in the [45:35] training set. Now, [45:37] convolutional networks um they expect [45:40] the input to have [45:42] three channels or it expects to have [45:44] like a an additional thing which is like [45:46] a channel, [45:47] right? Uh the color images have three [45:49] channels, [45:50] but black and white images have only one [45:52] channel, right? One table of numbers. [45:54] So, instead of saying 28 by 28, we tell [45:56] this the convolutional layer expect 28 [45:59] by 28 by one. [46:01] It's the same thing conceptually, but [46:03] that's the sort of the format that it [46:04] expects. [46:05] And so, [46:06] uh we go here and then we say, all [46:09] right, there's a thing called expand [46:11] dimension. I'm just telling it to expand [46:12] its dimension and once I do that, you [46:14] can see here it's still 60,000, but [46:17] instead of 28 by 28, it has become 28 by [46:19] 28 by one. Same thing. [46:21] Okay? Now, let's define our very first [46:24] CNN. [46:25] So, all right. [46:27] As as before, the the input is just [46:30] Keras.input as before, no difference [46:32] here and we tell it the shape and the [46:34] shape is of course just 28 by 28 by one. [46:37] Okay? That's what I have here. [46:39] And then we come to the first [46:40] convolutional block. [46:43] So, and this is the key thing. [46:45] If you want to tell Keras to use a [46:47] convolutional a layer, [46:49] you use this keyword layers.Conv2D. [46:53] And from this you can probably also [46:54] figure out that there's a Conv1D and [46:56] there's a Conv3D and so on and so forth, [46:58] which, you know, uh explore. It's really [47:00] good stuff. [47:01] But for image processing, Conv2D is all [47:04] you need. And now we tell it how many [47:06] filters you want. Okay. So, uh we decide [47:09] on the number of filters. So, I've [47:10] decided to have 32 filters. Okay? And [47:13] then I I we also have to decide the size [47:15] of the filter, right? The simplest size [47:18] is 2 by 2. So, I'm just going to go with [47:19] that. [47:20] Right? Kernel size is 2 by 2. [47:22] And then the activation is of course [47:23] ReLU. I give it a name, convolution one, [47:26] and then I feed it the input. And then [47:27] once I do that, I follow it up with a [47:29] little pooling layer which I where I use [47:31] MaxPooling2D. [47:33] And MaxPooling2D, you just literally [47:35] pass the input, you get the output back. [47:36] It just [47:37] shrinks everything using pooling. [47:39] So, that is the first convolutional [47:40] block. [47:41] And you know what? [47:43] I know how to cut and paste. Boom, cut [47:45] and paste, I get the second [47:46] convolutional block. [47:48] Okay? Here is the second convolutional [47:49] block. And I know in in I just lecture I [47:52] mentioned that as you go deeper, you get [47:54] more depth to it, but this is this is [47:56] just a starting point. I'm just going to [47:58] use the same depth. Not a big deal. It's [47:59] a simple problem. So, which is why in [48:01] the second convolutional block I'm still [48:03] using only 32. [48:04] But you can totally go to 64 for [48:06] instance to make it much deeper. [48:07] Okay? [48:08] Uh and once I do that, [48:10] I finally come to the point where I [48:12] flatten everything to a long vector, [48:14] then I connect it to one dense layer of [48:17] 256 neurons. [48:19] And then finally, I come to the softmax [48:22] where I have 10 outputs, right? 10 [48:23] categories of clothing, softmax, and [48:26] then I tell Keras, okay, take this input [48:27] and the output, string them up together, [48:30] define a model for me. [48:32] So, that's it. That's a convolutional [48:33] network. The new concepts we are seeing [48:35] here are Conv2D for the convolutional [48:38] layer and then MaxPooling2D for the max [48:40] pooling layer. [48:42] Okay? That's it. [48:43] Uh [48:44] coming. So, let me just run this thing. [48:46] It runs. Okay, good. Yeah. [48:49] Uh how do you decide when to flatten and [48:52] would there ever be a situation in which [48:54] we just kind of use the method that we [48:56] used before and not use a CNN? [48:59] Well, we already tried it with MNIST, [49:00] right? We didn't use a CNN. We just [49:02] flattened right away. [49:03] >> work. It it was it's not bad, but we are [49:05] like, you know, can we do better than 85 [49:06] or 88 or whatever the percent was, [49:08] right? So, but we are working with [49:09] images, it's typically a good idea to [49:11] just start with a CNN straight out the [49:13] back because you're not losing anything. [49:14] You're not giving up anything. [49:16] So, uh in terms of how many uh layers [49:19] you should have, my philosophy is start [49:20] simple and if it works, stop working on [49:23] it. If it doesn't, add more layers. [49:27] Uh yeah. [49:28] Yeah, just to uh is it the architecture [49:30] design, the number of filters, kernel [49:32] size, number of layers, convolution [49:34] pooling, is that just all based on trial [49:36] and error or what's sometimes? Yeah, so [49:37] typically it's based on trial and error, [49:39] Um to answer your question. But as you [49:41] will see in the transfer learning [49:42] discussion we're going to have soon, [49:44] you can actually, instead of doing [49:46] anything from scratch, it's much better [49:48] to just download a pre-trained model and [49:50] just adapt it for your particular [49:51] problem. That is actually the norm by [49:54] which people do these things. The reason [49:55] I'm doing it from scratch is because you [49:57] should know how it was done. [50:00] Like you it should not be a black box to [50:01] you. That's my goal. [50:03] Yeah. [50:05] Just for what notation perspective, I [50:07] noticed you named all of these layers X. [50:09] Is that a habit we should get into [50:11] naming them all the same or is that just [50:12] a [50:12] >> Actually, I'm not naming the layers as [50:15] X. What what's going on here is I'm [50:17] feeding it X. [50:19] And whatever is coming out of it, I'm [50:21] just calling it X. [50:22] That's all. It's just a notational [50:23] convenience for me to I'm I'm just [50:25] calling the input and the output and [50:27] Keras under the hood will track [50:28] everything and make sure the right thing [50:29] happens. Otherwise, I'd have to be like [50:31] X1, X2, X3, X4 and then if I want to add [50:33] a new layer somewhere in the middle [50:35] between X3 and X4, I have to call that [50:37] X4 and then I'll change everything to 5, [50:39] 6, 7. Complete pain in the neck. That's [50:41] why I do this. [50:42] All right. So, model.summary [50:46] It has got 302 thousand parameters. I'll [50:51] just plot it. [50:53] Great. And I encourage you to hand [50:56] calculate it later on and make sure the [50:58] numbers tally, okay? [51:00] For now, let's just go. So, as before, [51:03] we'll just use the same compilation. [51:06] We'll use Adam and then we'll train it [51:08] for, you know, just 10 epochs. We'll use [51:11] a validation split again, as usual, of [51:13] 20%. So, let's just run it. [51:15] So, it's actually going to run. And as [51:17] you will see, [51:18] convolutional networks there's a lot [51:19] more going on, so it's going to be a bit [51:20] slower to run. Hopefully not too much [51:23] slower. [51:25] While it's doing, other questions? [51:31] So, if we have a task other than image [51:32] classification, do we still flat the [51:34] model like first and then it's [51:35] segmentation? [51:37] Yeah, so this is for image [51:39] classification. For other kinds of [51:41] applications, [51:42] typically you run it through a bunch of [51:44] convolutional layers and so on and so [51:45] forth. [51:46] But the output side of the equation gets [51:48] much more complicated because if instead [51:51] of classifying just [51:53] the whole picture into, you know, dog or [51:56] cat, if you have to take every pixel and [51:58] classify it, right? Then, well, you [52:01] better have an output shape that is the [52:03] same dimensions as the input shape. [52:06] So, for that we use a different [52:07] architecture. It's called U-Net [52:09] and so on, which unfortunately I won't [52:11] be able to get into. But I know I am [52:13] planning to post another video [52:14] walk-through where I show you how to use [52:17] the Hugging Face Hub [52:19] to very quickly build models for the [52:22] other applications like segmentation and [52:23] so on. I'm hoping to post that tomorrow. [52:26] It's an optional viewing thing that [52:27] might help with that. [52:29] Okay. So, is it done? Okay, good. It's [52:32] done. All right, let's plot the [52:35] thing here. [52:36] All right, so it seems like training is [52:38] going down nice and nicely. Validation [52:40] is sort of flattening out somewhere here [52:42] around the eighth epoch. Let's look at [52:45] the accuracy. [52:47] Same situation here. The accuracy is in [52:48] the 90s. Of course, the final question, [52:51] of course, is how it will will it does [52:52] on the thing. [52:55] Whoa, 90.5%. [52:58] Pretty good. [52:59] By the way, if you're not impressed that [53:00] we went from 88 to 90, [53:04] this is the These applications are the [53:05] proverbial sort of diminishing returns [53:07] problems, okay? So, what you should [53:09] always think of is look at the amount of [53:11] error that's left and ask yourself how [53:13] much of that error am I able to reduce? [53:16] So, you we had 12% roughly of error left [53:20] when we did the simple collab yesterday. [53:22] From that 12% we have knocked off two of [53:24] the 12% to get to over 90, which is [53:26] amazing. [53:27] Okay? [53:28] And in fact, I think the state of the [53:29] art on this [53:31] um [53:32] is 97%. [53:34] So, I invite you [53:36] to take this thing and try different [53:39] filters and so on and so forth to see if [53:40] you can get to the the mid-90s. [53:42] It's not easy, but try it. Yeah. [53:45] Does the number of epochs have to be [53:48] related to the number of batches? [53:50] Because you did 64 batches and 10 No, [53:52] the epochs is an independent [53:55] the epochs is just the number of passes [53:56] through the whole data. [53:58] But within each pass, within each epoch, [54:01] the num the batch size tells you how [54:03] many batches you're going to process. [54:05] So, it is basically the number of [54:06] examples you have in your training data [54:08] divided by the batch size that you have [54:10] chosen, [54:11] right? That number rounded up is the [54:13] number of batches within each epoch. [54:16] And here I'm just choosing 10 because, [54:18] you know, [54:20] Siri found something on the web. Okay. [54:23] I chose 10 because it's going to be fast [54:24] to do for me to do it in class. And 10 [54:26] is actually more than enough because you [54:27] can see it's already beginning to [54:28] overfit. [54:31] Yeah. [54:33] This is more of a conceptual question, [54:35] but is it always the case that a neural [54:37] network will have better accuracy than [54:39] this like machine learning algorithm? [54:42] And I'm asking more on the case of like [54:44] the heart disease problem. Oh, yeah, [54:45] yeah. [54:46] Great question. So, neural networks are [54:49] really good for unstructured data like [54:50] the ones we're having here. But if you [54:52] have structured data like the heart [54:53] disease problem, sometimes it actually [54:55] works really well. Sometimes [54:57] things like gradient boosting, XGBoost, [54:59] work really well. So, if I am actually [55:01] working on a structured data problem, [55:03] I'll try both. [55:04] I'm not going to axiomatically assume [55:06] that the DNN is going to be the best [55:07] thing. But if you have structured data, [55:09] it's the best game in town. [55:11] All right. Um [55:13] I'm just going to [55:14] By the way, I have a whole section here [55:15] on once you build a model, how do you [55:16] actually improve it? [55:17] Right? Check it out. It's an optional [55:19] thing. [55:20] All right, I'm going to stop this here. [55:22] All right. So, the next thing I want to [55:23] do is [55:25] So, we went from 88 to 90 plus percent, [55:27] right? Using convolutional networks. [55:29] Now, let's work with color images. Let's [55:31] kick it up a notch. [55:33] So, um [55:34] I actually [55:36] web scraped [55:38] all these pictures for you folks, for [55:40] your enjoyment. I web scraped about 100 [55:42] color images of handbags and shoes. [55:44] Each 100 roughly 100 handbags, 100 [55:46] shoes. So, the question is with these [55:48] essentially 200 images, [55:51] can we build a really good neural [55:52] network to classify handbags and shoes? [55:54] Right? It seems kind of absurd, right? [55:56] Because 200 examples, I mean, it's not [55:58] that much, right? It doesn't feel like a [55:59] lot. The MNIST data fashion has 60,000 [56:02] images. [56:04] Right? So, there's no, you know, even [56:06] with that we are overfitting in like 5, [56:07] 6, 7, 8 epochs. [56:09] With 200 images, maybe, you know, is [56:10] there any hope? Obviously, there is [56:11] hope, otherwise it won't be in the [56:13] lecture. So, yeah. So, we're going to [56:15] take this data set and let's see what we [56:16] can do with it. So, we'll first actually [56:18] build a convolutional network from [56:19] scratch to solve this problem. Okay? [56:22] All right. [56:24] I'm actually going to run through the [56:25] code because at the end of it we'll have [56:27] a live demo. So, I would like one [56:29] volunteer to give me a handbag and one [56:31] volunteer to give me their footwear. [56:34] Boy, in class. [56:37] Okay. So, all right. Unlike the previous [56:40] data set, this one actually I just web [56:42] scraped it. So, I just, you know, it's [56:44] it's it's I've stuck it in this Dropbox [56:46] folder. [56:47] Let's just download it and unzip it. And [56:49] once we do that, we have to now organize [56:51] it with these 200 images. So, [56:54] I have to do some sort of [56:57] sort of boring-ish Python stuff here. [57:00] So, here what we're doing is that we [57:02] have 100 handbags, roughly 100 shoes. [57:04] And what this code is doing is it's [57:06] actually creating a directory of saying [57:08] it's it's splitting stuff into train and [57:10] validation and test. And then for each [57:12] of the splits it's doing the handbags [57:13] and the shoes folder. Okay? So, once we [57:16] do that, basically this directory [57:18] structure is created. [57:20] Okay? Training, validation folder, test [57:23] folder, handbags and shoes. In fact, [57:25] actually you can I think you can see it [57:26] here. [57:27] See here, handbags and shoes. And within [57:29] that, there is, you know, train, test, [57:31] validation. And within each of these, [57:33] there's handbags and shoes. So, the idea [57:34] is that when you're working with images, [57:36] right? What you can do is you can just [57:37] create folders for each kind of image, [57:40] right? Let's say dogs, cats, [57:42] two folders with cat images and dog [57:43] images and then just point Keras at it. [57:46] It'll automatically figure out those are [57:47] the labels. [57:49] It makes it easy for you. So, it's very [57:50] convenient when you're working with [57:51] images. [57:52] And the book explains this thing in [57:53] great detail. [57:55] All right. So, when working with these [57:56] images, color images, we'll follow this [57:58] process. We'll read in the JPEGs. We'll [58:00] convert them to tensors. And then since [58:02] I'm web scraping it, they all come in [58:03] different shapes and sizes. So, I need [58:05] to like bring it all to the same size. [58:06] Okay? I resize it and then I'm going to [58:08] batch it into whatever. I'm going to [58:10] batch it using a batch size of 32 here. [58:13] So, and this utility from Keras will do [58:16] all that for you, right? Very quickly. [58:19] So, basically what it says is that I [58:20] found 98 files in the 98 images in the [58:23] training data belonging to two classes, [58:25] 49 in the validation and 38 in the test. [58:28] So, less than 100 examples in the [58:29] training set. That's what we have here. [58:31] All right. What's the time? 9:30. Okay. [58:33] So, all right. Now, let us check the [58:35] dimensions to make sure Good. So, 224 [58:38] 224 by 3. And the reason why did I pick [58:40] 224 224? As you will see later, we're [58:43] going to use something called ResNet [58:45] and the ResNet expects it to be 224 by [58:47] 224 by 3. That's why I resized it to 224 [58:49] 224. Let's look at a few examples of my [58:52] wonderful web scraping in action. [59:01] It's pretty wild, right? [59:02] Okay. So, we have a Now, let's do a [59:04] simple convolutional network. Um [59:07] And before we would take all the X [59:09] values in Fashion MNIST and divide them [59:10] manually by 255 to normalize it to 0 1. [59:13] Well, you know what? We are actually [59:14] graduating to the higher levels of Keras [59:16] now. So, let's not do that, right? [59:17] Manual stuff is bad. So, we'll do it [59:19] within Keras by using something called [59:21] the rescaling layer where we just tell [59:22] it how much to rescale and boom, it'll [59:24] do it for you. The first convolution [59:26] block, just like the Fashion MNIST 32, [59:28] second block, again 32, max pool, [59:31] flatten. And then here we only have [59:33] handbags which are shoes, just a sigmoid [59:35] is enough, right? It's just a binary [59:36] classification problem. So, I'm just [59:38] using one output layer with a sigmoid, [59:40] and that's our model. So, let's do the [59:42] model. [59:43] All right, model summary. [59:48] 103 101,000 parameters in this little [59:52] model. Okay, let's compile it and run [59:54] it. Uh, and note here because it's a [59:56] binary [59:57] classification problem, I'm using binary [59:59] cross entropy. [01:00:02] Same Adam. [01:00:03] And accuracy, compile, and then boom, [01:00:05] let's run it. We'll run it for 20 [01:00:07] epochs. [01:00:08] Hopefully. [01:00:12] Okay, while it's doing this business, [01:00:13] I'm going to shift to the PowerPoint. [01:00:17] So, we'll go back to see how well it [01:00:19] did, but the question is, uh, whatever [01:00:21] it did, we built it from scratch. So, [01:00:23] the question is, can we do better than [01:00:23] that? Okay? Because we only have 100 [01:00:26] examples of each class, and which brings [01:00:28] us to something very cool and very [01:00:29] powerful called transfer learning. And [01:00:31] the idea, so the key thing is there are [01:00:33] two research trends that are going on [01:00:34] that we take advantage of. The first one [01:00:36] is that researchers have defined, you [01:00:38] know, designed architectures which [01:00:40] exploit the kind of input you have. So, [01:00:42] Olivia asked the question, if you have a [01:00:43] particular kind of input images, do you [01:00:45] actually change the input, or do you [01:00:47] actually change the network? As it turns [01:00:49] out, here, for example, if it's images, [01:00:50] we know that we should use convolutional [01:00:52] layers because convolutional layers were [01:00:53] designed to exploit the image-ness of [01:00:55] the input. [01:00:57] Okay? Similarly, if you have sequences [01:00:59] of information, like obviously natural [01:01:01] language, audio, video, gene sequences, [01:01:03] and so on, so forth, these things called [01:01:05] transformers were invented [01:01:07] to exploit them, and we're going to [01:01:08] spend a lot of time on transformers [01:01:09] starting next week. So, that's the first [01:01:11] trend. The second trend is that [01:01:13] researchers have used these innovations [01:01:15] to actually create and train models on [01:01:19] vast data sets, and thankfully, they've [01:01:21] made them publicly available for us to [01:01:23] use. So, transfer learning is the idea [01:01:26] that if you have a particular problem, [01:01:28] let's just take a pre-trained network [01:01:30] work somebody may have already created, [01:01:32] and then let's just customize it to our [01:01:33] problem, rather than actually build [01:01:35] anything from scratch. [01:01:37] Okay, that's the basic idea. So, [01:01:39] so here we have this basically we have [01:01:41] to build a classifier which takes in an [01:01:43] arbitrary image and figures out if it's [01:01:45] a handbag or a shoe, right? That's our [01:01:46] goal. [01:01:47] And so, now handbags and shoes are [01:01:49] everyday objects, and so what you can do [01:01:51] is, hmm, you you can look around and see [01:01:53] if there are any networks that have been [01:01:55] trained by other people which actually [01:01:57] have been trained on everyday images. [01:02:00] Right? As opposed to like MRI or X-rays, [01:02:02] right? Specialized images, everyday [01:02:04] images. Of course, the first thing you [01:02:05] should probably do is to see if anybody [01:02:07] has built the specific thing you want, [01:02:08] handbag shoes classifier on GitHub. [01:02:10] Assuming it's not, then you do transfer [01:02:12] learning. Okay? So, now it turns out [01:02:15] that there's this thing called ImageNet, [01:02:17] which is a database of millions of [01:02:19] images of everyday objects in a thousand [01:02:22] different categories, furniture, [01:02:24] animals, automobiles, you get the idea. [01:02:26] Okay? And so, we can look for the [01:02:28] networks that have been trained on [01:02:29] ImageNet. [01:02:31] Okay, let me just go back to the collab [01:02:33] just to make sure it doesn't time out. [01:02:37] All right, so it has finished doing it. [01:02:40] Um, let's just plot these things. [01:02:48] Okay, so [01:02:49] uh, there is some overfitting that [01:02:51] happens around here [01:02:52] on the training on the 10th epoch. Let's [01:02:55] look at the [01:02:59] So, the the training accuracy is [01:03:01] actually getting to almost to 100%. But [01:03:03] we're not interested in training [01:03:04] accuracy, right? We care about [01:03:06] validation and test accuracy, and that [01:03:08] seems to be kind of hovering around in [01:03:10] the 80s. Um, so let's just evaluate it [01:03:13] anyway to see what happens. [01:03:15] Okay, so it gets to 80 87% accuracy [01:03:19] on this data set. [01:03:20] It's actually pretty good given that we [01:03:22] only have 100 examples. So, 87% [01:03:24] accuracy, but we pre-trained the whole [01:03:26] thing. I'm sorry, we did everything from [01:03:28] scratch. Okay? Now, then [01:03:31] I'm going to there's this whole section [01:03:32] about data augmentation, which, um, you [01:03:35] know what? Do we have time? [01:03:38] So, [01:03:40] so the idea of augmentation is that when [01:03:42] you have an image, [01:03:44] let's say you take this image, and you [01:03:45] just rotate it slightly by 10°. [01:03:49] If it's a handbag before you rotated it, [01:03:51] it sure as hell is a handbag after you [01:03:52] rotated it. [01:03:54] Right? [01:03:55] It doesn't change The meaning of the [01:03:56] image doesn't change just because you [01:03:57] rotated it slightly. Or maybe you zoom [01:04:00] in slightly, you zoom out slightly, you [01:04:01] crop it slightly, nothing happens. [01:04:03] So, what you can do is you can take any [01:04:05] image you have, and you just perturb it [01:04:07] slightly, [01:04:08] like right there, and then add it as a [01:04:10] new example to your training data. [01:04:14] This is an unbelievable free lunch, [01:04:15] frankly. [01:04:16] And the same thing actually, same kinds [01:04:19] of techniques actually work for text [01:04:20] also, which we'll cover later on. [01:04:22] Right? This broad area is called data [01:04:24] augmentation. [01:04:26] It's a great way when you don't have a [01:04:27] lot of data to artificially bolster the [01:04:30] amount of data you have. [01:04:31] Okay? [01:04:32] Um, and so, and of course, Keras makes [01:04:34] it very easy for you to do all these [01:04:36] things. It has already predefined a [01:04:38] whole bunch of data augmentation layers [01:04:40] for you. So, here's a little example [01:04:43] where I basically take a picture and [01:04:45] then I randomly flip it. So, if it looks [01:04:47] like this, I flip it this way, [01:04:48] horizontal. Okay? Uh, and then I [01:04:50] randomly rotate it by 0.1. I forget if [01:04:53] it's 0.1° or radians, you can look up [01:04:55] the documentation. And then random zoom, [01:04:57] right? Zoom in and out a little bit. Uh, [01:05:00] but it won't do this for every picture. [01:05:02] It will only do it randomly. Okay? So, [01:05:04] that only some pictures will get [01:05:06] perturbed in some ways. And that's how [01:05:07] you make sure there's enough diversity [01:05:09] of pictures that you have. [01:05:10] So, once you do that, [01:05:12] you can actually take a picture and see [01:05:13] what it does. [01:05:15] I just randomly grab a picture, so it [01:05:17] keeps changing every time. [01:05:21] Yeah, look at this handbag. [01:05:22] Handbag slightly rotated this way, [01:05:24] rotated that way. [01:05:26] Some more. Maybe a little bit of zooming [01:05:28] going on, and so on. You get the idea, [01:05:30] right? And there's a whole list of these [01:05:31] things you can do. But when you do those [01:05:33] things, make sure [01:05:35] that what you're doing doesn't actually [01:05:37] change the underlying meaning of the [01:05:38] picture. [01:05:39] It's really important. [01:05:41] Okay? So, for example, if you're working [01:05:43] with satellite data, [01:05:45] yes, be very careful not to do flips of [01:05:47] crazy flips. [01:05:49] Right? Or even if you're working with [01:05:50] everyday images, horizontal flips are [01:05:51] okay. Don't do vertical flips. [01:05:54] Right? How many times will you have an [01:05:55] upside-down dog picture that you need to [01:05:57] classify? [01:05:59] Make sure your augmentation doesn't go [01:06:00] nuts. [01:06:02] All right. [01:06:05] Once you do that, you can actually just [01:06:07] insert the data augmentation layers in [01:06:09] your model right there, right after the [01:06:11] input. The rest of it can stay [01:06:12] unchanged. [01:06:14] So, this is a great way to increase the [01:06:15] size of your training data, and here is [01:06:17] a model, and then I invite you to [01:06:19] actually just play with it and uh, and [01:06:21] train it. We won't try In the interest [01:06:23] of time, we won't actually train this [01:06:23] model, but it's in the collab, you can [01:06:24] just try it. It also figures prominently [01:06:27] in homework one, by the way, data [01:06:28] augmentation. So, you'll get more [01:06:30] experience with this. Okay. So, uh, back [01:06:32] to the PPT. [01:06:34] So, this is what we have. Um, and so, [01:06:37] any network that has been trained on [01:06:38] this ImageNet thing, uh, turns out [01:06:41] learns all kinds of interesting features [01:06:42] in every one of its layers. So, here [01:06:44] this is the first layer, and you can see [01:06:46] it's picking up sort of gradations of [01:06:48] color, sort of line-ish kind of [01:06:49] behavior. Layer two, um, it's actually [01:06:52] picking up Hey, look, it's picking up an [01:06:54] edge. Can you see that edge? [01:06:56] Right? Like like that. [01:06:59] And then layer three is picking up these [01:07:01] interesting honeycomb shapes, uh, and so [01:07:04] on. Oh, it's actually this thing is [01:07:05] already already picking up like the [01:07:07] shape of a human torso. [01:07:12] Yeah, this layer is actually picking up [01:07:13] what looks like a Labrador retriever. [01:07:16] Okay. [01:07:17] Isn't that cute? [01:07:19] Come on, even if you're not a dog [01:07:20] person. [01:07:22] All right. So, the the this this is the [01:07:24] visualization I was referring to [01:07:25] earlier, [01:07:26] um, to figure out what are these [01:07:28] networks actually learning. [01:07:30] This paper was one of the first ones to [01:07:31] actually visualize what's going on [01:07:32] inside. So, if you folks are curious how [01:07:34] these pictures are actually produced, I [01:07:36] would encourage you to check this out. [01:07:38] Okay, yep. [01:07:40] So, we spoke about images and you [01:07:42] referred to classes, but sorry, we spoke [01:07:44] about images and you referred to classes [01:07:46] and [01:07:47] text next week on transformers, but [01:07:49] what about say an email which has both [01:07:52] text and image, and that may be white [01:07:54] space depending on who has written it [01:07:56] out. Does that get put in as an input [01:07:58] for an image or [01:08:01] So, we'll revisit this great question a [01:08:03] bit later on in the course. [01:08:04] So, the answer is a bit complicated, so [01:08:06] I don't want to I want to do it justice, [01:08:07] so we'll come back to it. [01:08:09] All right, so [01:08:10] so it turns out this thing called ResNet [01:08:12] is a family of networks that are which [01:08:14] were trained on this ImageNet data set, [01:08:16] and they did really well in this [01:08:18] competition that's associated with the [01:08:19] ImageNet data set called ImageNet. And [01:08:21] so, this is an example of such a [01:08:22] network. So, you we would expect the the [01:08:24] weights and the parameters of ResNet, [01:08:27] given that it's been trained on [01:08:28] ImageNet, to sort of have some knowledge [01:08:30] about lines and shapes and curves and [01:08:32] things like that. So, maybe we can just [01:08:34] use that, right? So, so the idea is we [01:08:37] But the thing is we can't use ResNet as [01:08:39] is because remember, it was trained to [01:08:40] classify an incoming image into a [01:08:42] thousand possibilities. [01:08:44] Here we only have two possibilities, [01:08:45] handbags and shoes. So, what we do is [01:08:47] very simple and elegant. We do just a [01:08:50] little bit of surgery. [01:08:51] We take ResNet and stop just before the [01:08:54] final layer. So, take my word for it, [01:08:57] this thing here, what it says is fully [01:08:59] connected thousand. [01:09:01] Because it's got thousand way, right? [01:09:02] Thousand objects. So, what we do is we [01:09:04] just take everything except and we stop [01:09:06] just before that last layer. [01:09:08] And then what comes out of that layer, [01:09:10] hopefully, will be like a very smart [01:09:11] representation of the images that it has [01:09:13] been trained on. [01:09:14] And so, what we do is we can think of [01:09:16] sort of headless ResNet [01:09:19] as our model. [01:09:21] And we can take that we can take all our [01:09:23] data and run it through ResNet up to but [01:09:26] not including the last layer. [01:09:28] Okay, you get some tensor and that [01:09:30] tensor is probably like a very has a [01:09:31] very rich understanding of what's going [01:09:33] on in that image, all the objects and [01:09:35] features and things like that. And then [01:09:36] we can just simply connect that we can [01:09:39] think of it as like a smart [01:09:40] representation of an input. We can [01:09:42] connect it to just a little hidden layer [01:09:44] and then we have a little sigmoid which [01:09:46] then tells you handbag or shoe. We can [01:09:47] just run this network. [01:09:50] Okay? Um and so since the outputs to the [01:09:53] hidden layer now are not raw images [01:09:54] anymore, but this much higher level of [01:09:57] abstraction that ResNet has learned, [01:09:59] hopefully it can get the job done with [01:10:00] hardly any examples. [01:10:02] Okay? And now you can get fancier. [01:10:04] That's the basic idea, but you can get [01:10:05] much fancier. You can connect up [01:10:07] headless ResNet directly with our little [01:10:09] network with a hidden layer and the [01:10:10] final thing and the whole thing can be [01:10:12] trained. [01:10:14] End to end. Uh but when you do that you [01:10:16] must start the training with the weights [01:10:18] that you downloaded with ResNet because [01:10:20] that is the crown jewel that's been [01:10:21] learned so you want to start from there. [01:10:23] Uh and you will do this in homework one. [01:10:26] Okay? All right. Uh by the way, these [01:10:28] pre-trained models are available all [01:10:29] over the internet. There is the [01:10:30] TensorFlow hub, the PyTorch hub and then [01:10:32] there's the Hugging Face hub. When I [01:10:34] checked it on the 13th yesterday, it had [01:10:36] over half a million models available [01:10:39] for download. Half a million. [01:10:41] I think last year it was like 50,000 [01:10:42] when I taught the course. Uh so yes. [01:10:46] I was just wondering, doesn't this make [01:10:49] your neural network susceptible to [01:10:50] adversarial attacks because the weights [01:10:52] have been [01:10:53] pre-trained on a Yes. Uh it there is [01:10:55] some adversarial risk. I'm happy to talk [01:10:57] about it offline. [01:10:59] All right. So that's what we have. So [01:11:01] back to Colab. Okay. So that's what we [01:11:03] have. This is ResNet. So what we do is [01:11:06] and ResNet is all packaged up. It's [01:11:07] available for download. So we download [01:11:09] it here. [01:11:13] And you see here that I'm saying use [01:11:16] include top equals false. [01:11:19] So basically you are telling Keras [01:11:21] uh the top the very final layer of the [01:11:23] thing, don't give it to me. Just give me [01:11:25] everything up to but not including that. [01:11:27] And of course I think of it as left to [01:11:28] right. People think of it as bottom to [01:11:30] top. So they could the very very top [01:11:32] layer, don't give it to me. You're [01:11:34] telling it so that you don't have to [01:11:35] manually go and remove it. [01:11:37] Okay? And then I'm not going to [01:11:39] summarize uh well, I'll just summarize [01:11:40] some of it. Just show you how big it is. [01:11:44] Okay? [01:11:45] 23 million parameters. [01:11:48] ResNet. Okay? And I won't plot it [01:11:50] because then I'll be scrolling for 5 [01:11:52] minutes. Uh [01:11:53] so let's just do this now. So what we're [01:11:55] now going to do is we're going to run [01:11:56] all the data through this thing and [01:11:58] whatever comes out in that penultimate [01:11:59] thing, I'm going to just grab it and [01:12:00] store it. So that's what this thing [01:12:02] does. [01:12:04] All right. And now we create this a [01:12:07] little handy function to do all these [01:12:08] things. [01:12:09] And once I do that, [01:12:11] uh every image has been sent through [01:12:12] ResNet up to but not the final layer and [01:12:15] then whatever comes into the final [01:12:16] layer, we're storing it. And then we're [01:12:18] going to create a network where we'll [01:12:19] only feed that layer that information to [01:12:21] a simple network. [01:12:23] Okay? [01:12:24] So what is coming out of ResNet, you can [01:12:26] see here 98 examples in the training [01:12:28] data and each example is now a 7 by 7 by [01:12:31] 2048 tensor. [01:12:33] That's what came out of ResNet and you [01:12:35] saw that's what I did there. [01:12:37] Okay? [01:12:37] All right. So that's what it looks like. [01:12:39] Now let's just create our actual model [01:12:41] now. Right? We have our input which is [01:12:43] just a 7 by 7 by 2048. [01:12:46] We flatten it immediately. [01:12:48] Then we run it through a dense layer [01:12:50] with 256 ReLU neurons and then we use [01:12:52] dropout which I haven't talked about yet [01:12:54] which I will talk about early next week. [01:12:56] Uh but I will come back to it. Don't [01:12:58] worry about this detail for the moment. [01:13:00] Uh and then we just run through a [01:13:01] sigmoid. [01:13:03] Okay? And that that's our model. [01:13:05] Finished. Plot the model. This is what [01:13:08] we have. Okay? Model summary. [01:13:13] It's one so far. All right, good. Now [01:13:15] let's actually train this thing. [01:13:18] I'm just going to run it for 10 epochs [01:13:20] because I tried running it uh previously [01:13:22] and it seems to do a fine job in just an [01:13:24] epoch. Okay, it's already done. It's so [01:13:26] fast because we ran everything through [01:13:28] this monster ResNet thing and basically [01:13:31] took all the output values and use them [01:13:33] as a starting point. Right? We don't [01:13:34] have to run it every single time. So you [01:13:36] can see here the accuracy is [01:13:40] quite high. [01:13:44] Wow, interesting. So the 10th epoch [01:13:45] something bad happened. [01:13:48] So maybe I should have stopped at the [01:13:49] ninth epoch. I didn't see this yesterday [01:13:51] when I was running. So much for random [01:13:53] reproducibility. Uh [01:13:55] So let's just run this. Oh wow, look. On [01:13:57] the test set it's achieving 100% [01:13:58] accuracy. [01:14:02] It's unbelievable. Okay folks, now for [01:14:04] the moment of truth. Um all right, I [01:14:06] have a little code snippet here to [01:14:08] capture stuff from the webcam. [01:14:10] Because that last epoch it went down, [01:14:12] I'm a little worried that the demo is [01:14:13] going to flunk. [01:14:14] But you know what? We all have to live [01:14:16] dangerously. So [01:14:18] So here's a little function to predict [01:14:20] what's going to happen. [01:14:21] Okay. Now I tried it at home yesterday [01:14:23] by the way. [01:14:24] I act and it's like, "Yay, it's a [01:14:26] handbag." [01:14:27] So okay. Now let's just do something [01:14:29] else. [01:14:30] Okay. Any volunteers? [01:14:32] I want a a piece of footwear [01:14:34] or a handbag. [01:14:37] It's like a backpack, right? [01:14:39] I don't know. It feels like an [01:14:40] adversarial example, but yeah, let's [01:14:42] just try it. [01:14:43] Okay. [01:14:45] No disrespect. I'll let me let me go [01:14:47] with the shoe first. I have a better [01:14:48] chance of it working. [01:14:50] So [01:14:51] it's a pretty big shoe. If it can't get [01:14:53] this shoe, I'm worried about this model. [01:14:55] All right. So [01:15:05] Okay. Hold on. Hold on. Hold on. [01:15:07] All right. [01:15:10] Please don't get distracted by my hand. [01:15:14] Capture. [01:15:16] It's a shoe! LOOK AT THAT. [01:15:21] PHEW. ALL RIGHT. THANKS. [01:15:25] OKAY. Now let's try that. I'm feeling [01:15:26] kind of brave now. [01:15:28] Thank you. All right. Let's do this. [01:15:32] All right. [01:15:34] Camera capture. [01:15:40] Okay. [01:15:44] Put its better side. [01:15:54] It's a handbag! Look at that. [01:15:59] I swear every time I do the demo I age a [01:16:01] few years. So [01:16:03] All right folks, I'm done. Thank you.