[00:21] Okay. So, let's get going. Today we're [00:24] going to talk about how do you actually [00:26] train a neural network, right? Because [00:28] that is sort of the heart of the game [00:30] here. Um so, just to recap, we looked [00:33] last class [00:34] at what it takes to design a neural [00:36] network, and we made this very important [00:38] distinction between the things that you [00:40] are handed by your problem and the [00:42] things that you have agency over, that [00:44] you have control over. And we noticed [00:46] that, you know, the input layer for your [00:49] problem, the input is the input. Uh the [00:51] output is the output. You got to do [00:53] something with the output, something [00:54] that's expected. But everything that [00:56] happens in the middle is actually in [00:58] your hands. And in particular, we [01:00] noticed that we have to decide how many [01:03] hidden layers we want. We have to decide [01:05] in each layer how many neurons to have. [01:08] And then we had to decide what uh [01:11] activation to use. Even though I'm kind [01:13] of cheating when I say that because I [01:14] told you very clearly on Monday that for [01:17] the hidden layer activation, just go [01:18] with the ReLU activation function. You [01:20] don't have to think deep thoughts about [01:22] this, okay? [01:23] But the other things are all choices you [01:24] have to make, and we will talk a bit [01:26] later about how do you actually make [01:28] those choices. [01:29] Okay. Now, the rule of thumb, [01:32] right? The rule of thumb always is to [01:34] start with the simplest network you can [01:36] think of. [01:37] And if it's if it gets the job done, [01:39] stop working on it. [01:41] If it's not good enough, make it [01:42] slightly more complicated. Okay? So, [01:45] that's sort of the, you know, like the [01:46] meta thing you have to remember always [01:48] when you're designing these things. [01:49] Okay. So, that's sort of, you know, what [01:52] it takes to design a deep neural [01:53] network. So, what we will do in this [01:55] class is we'll actually take a real [01:57] example with real data, and then we [01:59] we'll think through how we would design [02:01] a network to solve this problem. [02:03] And while doing so, we will cover a [02:05] whole bunch of conceptual foundations [02:07] such as optimization, loss functions, [02:09] gradient descent, and all that good [02:11] stuff. [02:12] Okay? [02:12] All right. So, the the case study or the [02:16] scenario here is we have a data set of [02:18] patients uh made available by the [02:20] Cleveland Clinic. And essentially, we [02:23] have a bunch of patients, and for all [02:25] these patients, the setting is that they [02:27] have come into the Cleveland Clinic, and [02:29] they have not come in with a heart [02:31] problem. They have come in for something [02:32] else. Maybe they just came in for a [02:33] physical. And we measured a whole bunch [02:36] of things about them, okay? And the [02:38] kinds of things we measured are, you [02:40] know, demographic information, like [02:41] what's their age, uh gender, whether [02:44] they have any chest pain at all when [02:45] they came in, blood pressure, [02:47] cholesterol, sugar, so on and so forth. [02:50] Right? You get the idea? Demographic [02:52] information and a bunch of biomarker [02:53] information. And then, [02:56] what the Cleveland Clinic uh did was [02:59] they actually tracked these people [03:01] and figured out in the next year, [03:04] did they get diagnosed with heart [03:05] disease or not? [03:07] Okay, in the next year. [03:09] Which means that maybe you can build a [03:10] model when someone comes in, even though [03:12] they didn't come in for a chest problem, [03:15] maybe you can predict that something's [03:16] going to happen to them in the next [03:17] year, right? It's a nice sort of classic [03:20] machine learning setup. [03:23] All right. So, this is the thing. So, [03:24] what we want to do is we can totally [03:26] solve this problem using decision trees, [03:28] neural network I mean, sorry, random [03:29] forests and gradient boosting and all [03:31] that good stuff you folks have already [03:33] learned from machine learning. [03:35] But we will try to solve it using neural [03:36] networks, okay? Um this is an example, [03:38] of course, of what's called structured [03:40] data because this is all data sitting in [03:41] the columns of a spreadsheet, right? Uh [03:43] so, working with structured data is the [03:46] way we warm up our knowledge of neural [03:48] networks. And then we will do things [03:50] like working with unstructured data [03:51] starting next week with images and then [03:53] later on with text and so on and so [03:55] forth. Okay, any questions on this? [04:00] Okay. Uh yes. Uh just connected even to [04:03] last time's class where we took uh the [04:05] same example and first it was a logistic [04:07] and then we did a neural network. So, [04:10] the probability in case of one was 0.85, [04:12] then was 0.22, and here as well, how do [04:14] you know when to uh [04:16] use what? Usually in textbooks, you know [04:19] when to use logistic or when to use uh [04:21] something else, but in this case, [04:24] uh [04:25] when do I complicate it to neural [04:27] networks visa-vis in this case maybe [04:29] just doing a random It's a great [04:30] question. Uh when do you use what? So, I [04:33] think there are two broad dimensions [04:34] that you have to think about. One broad [04:35] dimension is [04:37] uh how important is it that you need to [04:39] explain or interpret what's going on [04:41] inside the model to perhaps a [04:43] non-technical consumer. [04:46] The other dimension is how important is [04:48] sheer predictive accuracy. [04:50] In some situations, predictive accuracy [04:52] trumps everything else. In which case, [04:54] just go with it. In other cases, [04:56] explainability becomes a big deal [04:57] because if they can't understand, they [04:59] won't use it. [05:00] And those cases, it's probably better to [05:02] go with simpler models such as decision [05:04] trees and neural I mean, not neural [05:05] network decision trees, maybe even [05:07] random forests, certainly logistic [05:09] regression. Those are all a little more [05:10] amenable. [05:12] But that said, uh even complex black box [05:15] methods like neural networks, there is a [05:17] whole field called mechanistic [05:19] interpretability, [05:20] which seeks to try to get insight into [05:23] what's going on inside these big black [05:24] boxes. So, the story isn't over, right? [05:28] But that's just the first cut you sort [05:30] of analyze the problem. [05:33] Okay. So, [05:35] um let's get going. So, if you want to [05:37] design a network, [05:39] All right. So, we design the network. Uh [05:42] so, we have to choose the number of [05:43] hidden layers and the number of neurons [05:45] in each layer. Then we have to pick the [05:46] right output layer. So, here, [05:49] what I did is the simplest thing you can [05:51] do is, of course, is to have no hidden [05:52] layer. [05:53] So, if you have no hidden layers, what [05:55] is that model called? [05:58] Yes, logistic regression. [06:00] Okay? So, of course, we want to do a [06:02] neural network, so I'm going to have one [06:03] hidden layer because that's the simplest [06:05] thing I can do. And then, I'll confess, [06:08] I tried a few different numbers of [06:09] neurons in this thing, and when I had 16 [06:12] neurons, it actually did pretty well. [06:14] Okay? So, there was some trial and error [06:15] that went on before I landed on the [06:16] number 16. Right? And for some reason, [06:19] people always use powers of two, so may [06:20] as well do that. [06:22] So, I tried like 4, 8, 16, and 16 was [06:24] really good. [06:25] And as it turns out, when I went above [06:27] 16, uh it sort of started to do badly. [06:30] And it started to do badly because [06:31] something called overfitting, [06:33] which we're going to talk about later, [06:35] okay? So, yeah, 16. [06:37] Um and then by default, I use ReLUs, [06:39] okay? So, 16 ReLU neurons. And then [06:42] here, the output is a categorical [06:44] output, right? Heart disease, yes or no, [06:47] one or zero, classification problem, [06:49] which means that we want to emit a [06:51] probability at the very end. Therefore, [06:53] we'll use a sigmoid. [06:54] Okay? So, so far, so good, right? Any [06:57] questions? [06:59] All right. [07:00] So, we're going to lay out this network [07:02] visually. [07:03] Okay? So, we have an input, and so I [07:06] just have have an input. And as you will [07:09] see here, [07:10] X1 through X29, that's our input layer. [07:13] And you may be wondering, 29, where did [07:15] he get that from? [07:17] Because there doesn't seem to be like 29 [07:19] rows here of independent variables. So, [07:22] it turns out there are only 13 input [07:24] variables here, [07:26] but some of them are categorical. [07:29] So, what I ended up doing is to take [07:31] each categorical variable and one-hot [07:32] encode it. [07:34] Okay? [07:35] And when you do that, you get to 39. [07:37] Sorry, 29. [07:39] All right? And when we actually do the [07:40] Colab later on, I'll show you exactly [07:43] how I one-hot encode encoded it, but [07:45] that's what I'm doing here. [07:46] That's why you have 29, not 13. [07:49] Okay? Now, obviously, we have decided on [07:51] these hidden units, 16 units, [07:54] with nice ReLUs here. [07:56] Okay? And then we have an output layer [07:57] with a little sigmoid. [07:59] And I got bored of trying to draw all [08:01] these arrows, so I just gave up and [08:02] said, "Assume there are arrows." [08:05] Okay, between all these things. [08:07] Good? [08:09] Yeah. [08:11] Yeah, I'm sorry. I think you already [08:12] mentioned this, but why 16 units? Why [08:15] 16? Uh [08:16] I tried a bunch of different numbers of [08:18] units. Uh and at 16, the resulting model [08:21] did well, so I just went with that. And [08:23] the logic of why is a ReLU? [08:25] Oh, why a ReLU? Yeah, so there's a [08:28] there's just a mountain of empirical [08:29] evidence that suggests that uh ReLU is a [08:31] really good default option for using as [08:35] activations in hidden layers. There is [08:37] also a really great set of theoretical [08:39] results, and I'll allude to some of them [08:41] when we actually talk about gradient [08:42] descent. [08:45] Yeah. [08:47] Sorry, quick question. You mentioned um [08:50] in the input layer, how how did you get [08:51] to 29 again when you had like 13 [08:53] variables? So, some of those 13 [08:55] variables are categorical variables like [08:58] uh cholesterol low, medium, high. Right? [09:00] And so, I took them and one-hot encoded [09:02] them. So, if it had like five levels, I [09:04] would get five columns now. [09:08] Uh yeah. [09:09] And by the way, folks, um just like uh [09:12] is it can Yeah, just like did, please [09:15] use a microphone so that people on the [09:17] live stream can hear your question. [09:18] Yeah, go ahead. Uh sorry, just one [09:20] question. So, the vectors, since you [09:22] didn't represent them, are we assuming [09:23] like every X is connected to all the [09:26] units? [09:26] >> Correct. And this is also a parameter [09:28] that we have to decide or That ends up [09:31] being the default. [09:32] And we will see [09:33] deviations from that assumption when we [09:36] go to image processing and language [09:37] processing and so on. But when you're [09:39] working with structured data like we're [09:40] doing now, that's the default. [09:43] Okay. So, let's keep going. [09:46] So, this is what we have. [09:47] So, what Remember what I told you in the [09:49] last class? Whenever you're working with [09:50] these networks, right? Get into the [09:52] habit of very quickly calculating the [09:54] number of parameters. [09:55] Right? Just do it a few times, the first [09:57] few times, so that you really know cold [09:59] exactly what's going on. Okay? So, yeah, [10:02] how many parameters do we have here? [10:04] How many weights and biases? You can [10:06] work through it, okay? You can You don't [10:08] have to tell me the final number. You [10:09] can say x * y + z, stuff like that. [10:14] Yeah. [10:15] 65. You have 48 weights and 17 biases. [10:20] Okay, and how did he come up with that? [10:21] So, for the weights, you have like for [10:23] the first layer it's 2 * 16 and for the [10:26] the second connection it's 1 * 16 and [10:28] then the biases are the 16 hidden plus [10:30] the outputs. [10:32] Okay. [10:33] Um any other views on this? [10:36] I think it's 29 into 16. 29, okay, 29 [10:40] into 16. And then 16 into [10:43] uh plus I mean 16 there. Yeah. And then [10:46] biases 16 biases and one bias. Right. [10:49] So, the way it's going to work is we [10:52] have 29 things here, 16 in the middle, [10:55] so 29 into 16 arrows. [10:58] And then for each of these fellows, [11:00] there's a bias coming in. [11:02] So, that's another 16. [11:05] Plus, you have 16 * 1. [11:08] Which is here, plus there is one bias [11:10] for this one. [11:12] So, the total is 497. [11:16] So, you can see here there's something [11:19] very interesting going on, which is that [11:21] when you go from one layer to another [11:22] layer, [11:24] the number of weights is roughly on the [11:26] order of a * b. [11:28] The number of units and so that's a [11:30] dramatic explosion in the number of [11:31] parameters. [11:33] Right? And that's something we have to [11:34] watch for later on to prevent [11:36] overfitting. [11:38] Okay, that's where the explosion of [11:39] parameters comes from the fact that each [11:41] layer is fully connected to the next [11:43] layer. [11:44] Okay? But we'll revisit this later on. [11:46] Okay. [11:47] So, [11:48] what I'm going to do now is I'm going to [11:50] actually translate this network, right? [11:52] The one that we have laid out [11:53] graphically, into Keras code [11:56] to demonstrate how easy it is. [11:58] Okay? So, I will give a fuller intro to [12:01] Keras in TensorFlow later on, but for [12:03] now, just suspend your disbelief. [12:06] We'll just try to do it in Keras as if [12:08] we know Keras. Okay? So, let's try that. [12:10] Later on we'll get into all the gory [12:12] details and train it in Colab and so on [12:14] and so forth. Okay. All right. So, [12:17] So, the So, the way we typically do it [12:19] is that once we have a network like [12:21] this, we typically start from the left [12:23] and start defining each layer in Keras [12:25] one after the other. So, we flow left to [12:27] right. Okay? So, let's take the input [12:30] layer. The way you define an input layer [12:32] in Keras is really easy. [12:34] You literally say Keras.input. [12:38] Okay? And then you tell Keras how many [12:41] nodes you have in the input coming in. [12:43] In this case it happens to be 29, so you [12:45] tell it the shape. Shape equals 29. And [12:47] the reason why we say shape as opposed [12:49] to length is because, as you will see [12:51] later on, we don't have to just send [12:53] vectors in, we can send complicated [12:55] things in to Keras. [12:57] And those complicated objects could be [12:59] matrices, it could be 3D cubes, it could [13:01] be 4D tensors and so on and so forth. [13:03] So, it's expecting a shape. [13:06] Right? What is the shape shape of this [13:07] thing you're going to send me? In this [13:09] particular case it happens to be a nice [13:10] list or a vector, so it's 29. Okay, [13:12] that's it. So, we we write this down. [13:15] This creates the input layer. [13:17] Right? And we give it a name. Right? And [13:19] the name here means [13:21] this layer, whatever comes out of this [13:23] layer has a name input. [13:26] Okay? [13:27] Good. Next. [13:30] Let's make sure the shape of the input [13:31] as I mentioned. [13:32] Right there. [13:34] Then we go to the next one. And here and [13:36] we will unpack this. The way you define [13:39] a layer is typically a hidden layer [13:41] Keras.layers.dense [13:43] and all this stuff. Okay? So, what this [13:46] is is it first of all it says [13:48] I want a dense layer. By dense layer I [13:50] mean a layer that's going to fully [13:52] connect to the prior and the later [13:53] layers. [13:55] Fully connect, that's what the word [13:56] dense means. Okay? [13:58] Number two, [13:59] I want 16 nodes here in this layer. [14:02] Okay? Finally, I want to use a ReLU. [14:06] See how compact and parsimonious it is? [14:09] Right? And that is the appeal of Keras. [14:11] It's very easy to get going. [14:13] So, the moment you do that, you've [14:15] actually defined this layer. [14:18] But what you have not done [14:20] is you have not told this layer what [14:23] input is going to get. [14:25] Because as far as this layer is [14:26] concerned, it doesn't know that this [14:28] other layer exists. [14:30] So, you need to connect them. Yes. [14:33] Um do we need to define for the ReLU [14:35] where the the bends are? Like where you [14:38] take the max? [14:39] >> No, the ReLU the bend is always at zero. [14:41] Okay. Thank you. [14:45] Okay? [14:47] All right. [14:48] So, that's what we have here. [14:51] And then, what we do is we have to tell [14:53] it I you want to feed this layer the [14:55] output of the previous layer, so you [14:57] feed it by taking whatever is coming out [15:00] of this thing, which is called input, [15:02] and you basically [15:03] stick it in here. [15:05] So, the moment you do that, boom, it's [15:07] going to receive the input from the [15:09] previous layer. [15:10] And because this one's output needs to [15:12] go to the final layer, you need to give [15:15] a name to that output. [15:16] So, you give it a name. I'm just calling [15:17] it h for because it's coming out of the [15:19] hidden layer. [15:20] It's just a variable. You can call it [15:21] anything you want. [15:25] Now, what we do, we go to the final [15:26] output layer. [15:28] And this is what we use. The output [15:30] layer is just another dense layer. [15:32] That's why I use the word dense. But we [15:34] say, "Hey, give me just one thing [15:36] because I just literally just need one [15:37] unit here because I need to emit just [15:40] one probability. [15:41] And the activation I want to use is a [15:44] sigmoid." [15:46] Done. [15:48] Okay? [15:50] And once you do that, you [15:52] have to feed it the input from the [15:54] second layer. So, you stick an h here. [15:57] Now you have connected the third and the [16:00] second layers. [16:01] And after you do that, you give a name [16:03] to the output coming out of that. We'll [16:04] just call it output. You can call it y, [16:06] you can call it output, you can call it [16:07] whatever you want. [16:09] Okay? So, at this point, what we have [16:11] done [16:12] is we have mapped that picture into [16:14] those three lines. [16:16] That's it. [16:17] Okay? [16:19] But we aren't quite done yet. There's [16:20] one little thing we have to do. [16:22] So, what we have to do is we have to [16:24] formally define a model so that Keras [16:27] can just work with this model object. It [16:30] can train it, it can evaluate it, it can [16:31] use it for prediction and so on and so [16:33] forth. So, we tell Keras, "Hey, uh [16:35] create a model for me, Keras.model, [16:38] and basically where the input is this [16:40] thing here and the output is that thing [16:41] there. [16:42] And then the whole thing we'll just call [16:43] it model." [16:45] Okay? So, that's it. [16:48] We are done. That is the whole model. [16:50] That is It sounds really fancy, right? A [16:52] neural model for heart disease [16:53] prediction. That's pretty cool. [16:56] Four lines. [16:58] And we will show how to train this model [17:00] with real data and so on and so forth [17:02] and use it for prediction after we [17:05] switch gears and really get into some [17:06] conceptual building blocks. [17:08] Had a question. [17:13] Can you define a custom activation [17:16] function that is not in the list of [17:18] Keras library? Yes. [17:21] Yeah, you can define The question was, [17:22] can you define a custom activation [17:23] function? You totally can. [17:25] Uh in fact, I mean, the the kind of [17:27] flexibility you have here is incredible. [17:30] And this these innocent four lines [17:32] unfortunately sort of hide the the [17:34] potential that's possible here, but I [17:36] guarantee you in two to three weeks you [17:38] folks will be thinking in building [17:39] blocks like Legos. [17:41] So, you'll be, you know, I I I I'm so [17:43] happy when it happens. Students will [17:44] come to my office hours and say, "You [17:46] know, I want to create a network where I [17:47] have a little network going up on top, [17:49] one going in the bottom, then they meet [17:50] in the middle, then they fork again, [17:52] they split." I'm like, "Unbelievable." [17:54] It's fantastic. And you're going to be [17:55] doing this in two weeks, I guarantee [17:56] you. [17:58] Yeah, in the case of a multi-class [18:00] classification problem, are the output [18:01] nodes equal to the number of classes? [18:04] Correct. [18:05] So, we will come to So, this is binary [18:07] classification. And the question is for [18:09] multi-class classification, let's say [18:10] you're trying to classify some input [18:12] into one of 10 possibilities, we will [18:14] have 10 outputs. [18:16] But the way we define it is going to be [18:18] using something called a softmax [18:20] function, which we're going to cover on [18:21] Monday. [18:24] So, for now, we just live with binary [18:25] classification. [18:27] Uh [18:29] Is there a default activation method in [18:31] Keras or you have to put something? Ah, [18:33] that's a good question. I believe the [18:35] default might be ReLUs for hidden [18:37] layers, but I'm not 100% sure. Let's [18:39] double-check that. [18:40] Uh [18:42] Uh just to get a clearer understanding, [18:44] when you said that beyond 16 when you [18:47] tried working on those neurons, the [18:50] performance uh worsened. [18:52] So, that is where you were playing [18:53] around with initially two and then maybe [18:54] four and six and eight. Exactly. Right. [18:58] Could you use the mic? [19:02] Do we need to define each of the hidden [19:04] layer when the model gets more complex [19:05] when we have more than one layer? Oh, [19:08] like if you have like 25 layers? [19:09] >> consolidate, yeah. Yeah, yeah, yeah. So, [19:11] what we typically Good question. If you [19:12] have let's say 100 layers, right? Uh do [19:14] you actually write I have to type in [19:16] each by hand and cut and paste? No. You [19:18] can actually write a little loop which [19:19] will just automatically create them for [19:20] you. [19:22] And so, basically what's going on is [19:24] that this little output thing you see [19:26] here, this variable, [19:27] this output could be the result of a [19:30] thousand layer network with all sorts of [19:32] complicated transformations going on and [19:34] then finally it pops up as a little [19:36] thing called the output. And what Keras [19:38] will do is it'll be like, "Okay, this [19:39] model has this input and has this [19:41] output, but boy, this output came from [19:43] incredible transformations applied to [19:45] the input." And Keras will process all [19:47] that very easily for you. You don't have [19:48] to worry about it. [19:49] Right? It's really a beautiful example [19:51] of the power of abstraction. [19:53] And you will you will see that as we go [19:54] along. [19:55] Okay. So, [19:56] now let's switch gears and say once [19:58] you've written a model like that in [20:00] Keras, how do you actually train it? [20:01] Okay? Now, training is something you've [20:04] been doing a lot, right? So, for [20:05] example, when you have something like [20:06] linear regression, right? Where you have [20:08] all these coefficients you need to [20:09] estimate, you have this model, then you [20:12] have a bunch of data, then you run it [20:14] through something like LM if you use R, [20:16] and what it gives you is actual values [20:18] for these coefficients, right? 2.8, 0.9, [20:20] and so on and so forth. So, the the role [20:22] of the data is to give you the [20:23] coefficients. [20:25] Right? Or you can think of the [20:26] coefficients as really a compressed [20:28] version of the data. [20:30] Okay? Similarly, if you do logistic [20:31] regression, you have a model like that, [20:33] you add some data, you run it through [20:35] some estimation routine like GLM or [20:37] scikit-learn or statsmodels, pick your [20:40] favorite tool, then you'll come up with [20:42] something like that. So, basically [20:43] what's going on here is training simply [20:45] means find the values of the [20:47] coefficients that so that the model's [20:49] predictions are as close to the actual [20:51] values as possible. That's it. Okay? And [20:54] so and to find the one that is as close [20:57] to the actual value as possible, a whole [20:59] bunch of optimization is involved. You [21:01] didn't have to worry about the [21:02] optimization when you did the [21:03] regression, linear or logistic, because [21:05] it's all done under the hood for you, [21:07] but for neural networks, we actually get [21:08] to know how it's done. [21:10] Okay, because it's important. [21:12] Okay. So, training a neural network, a [21:15] deep neural network, even GPT-4, it's [21:18] basically the same process as what you [21:19] do for regression. [21:21] Right? You basically you're just a very [21:23] complicated function with lots of [21:24] parameters, but ultimately you have a [21:26] network with all these question marks, [21:28] you add some data, you do some training, [21:29] and boom, you get some numbers. [21:36] You may get into this, but are we [21:38] determining the architecture of the [21:40] network before we train it? [21:43] Okay. Yes, because if you don't define [21:45] the architecture, [21:46] um Keras doesn't know how to actually [21:49] calculate the output. [21:51] Given an input. And unless it knows [21:53] input-output pairs, it can't do anything [21:55] more with it. [21:58] Okay. So, um [22:00] so the essence of training is to find [22:02] the best values for the weights and [22:04] biases. [22:05] And the way we think of the best values [22:07] is that we basically set up a little [22:09] function, and this function measures the [22:11] discrepancy between the actual and the [22:14] predicted values. Okay? And I use the [22:16] word discrepancy because the way you [22:19] define discrepancy, there's an [22:20] incredible amounts of creativity in the [22:22] field. [22:23] In fact, a lot of breakthroughs in deep [22:25] learning come because people define a [22:27] very clever measure of discrepancy, and [22:29] then turns out it actually gives you all [22:31] sorts of interesting behavior. Okay? [22:33] That's why I use the word discrepancy as [22:34] opposed to the word error, because when [22:35] I say error, you might be just thinking [22:37] something like predicted minus actual. [22:39] That's too limiting. [22:42] Prediction minus actual is too limiting, [22:43] that's why I use the word discrepancy. [22:45] So, so we we basically define a function [22:48] that captures the discrepancy between [22:49] these the actual and the predicted [22:50] values, and these functions are called [22:53] loss functions in the deep learning [22:54] world. [22:55] And every paper that you read, you will [22:58] find interesting loss functions. There [23:00] are hundreds of loss functions, enormous [23:02] research creativity goes into defining [23:03] these loss functions. Okay? [23:05] All right. So, these are loss functions. [23:08] And so a loss function is a function [23:10] that quantifies a discrepancy. So, let's [23:12] say the predictions are really close to [23:14] the actual values, the loss would be [23:16] what? [23:19] It's close to zero. It's close to zero. [23:20] Close to zero. Right? Very small. [23:23] And if if you have a perfect model, [23:26] perfect crystal ball, what would the [23:27] loss be? [23:28] Exactly zero. [23:30] Right? Exactly zero. So, in linear [23:32] regression, we the loss function we use [23:35] is called sum of squared errors. [23:37] We didn't call it loss function because [23:39] we were not doing deep learning, just [23:40] linear regression, but that's basically [23:42] the loss function. Right? So, [23:45] the loss function we use must be very [23:47] matched very properly with the kind of [23:49] output we have. [23:51] Right? So, if your output is a number [23:53] like 23, right? You're trying to predict [23:55] demand like a product demand for next [23:57] week for a particular product, and uh [24:00] predicted value is 23, the actual value [24:02] is 21, [24:03] it's okay to do 23 minus 21, two as a [24:05] discrepancy, right? The error. Okay? But [24:09] for other kinds of outputs, it's not so [24:11] obvious what the correct loss function [24:13] is, what the correct measure of [24:14] discrepancy is. And so here, [24:18] for the simple case of regression, [24:20] right? Um [24:21] the YI, the I here, by the way, is a [24:23] superscript which stands for the ith [24:26] data point, the ith data point. So, what [24:29] I'm saying is that okay, for the ith [24:31] data point, this is the actual value, Y, [24:33] and this is what the model predicted. [24:36] Okay? I take the difference, square it, [24:39] and once I square it for each point, I [24:41] just average all these numbers to get an [24:43] average squared error, i.e. mean squared [24:45] error, MSE. So, this is sort of like the [24:48] easiest loss function. [24:50] Okay? [24:52] Now, let's crank it up a notch. [24:55] In the heart disease example, the heart [24:57] disease the neural prediction model, [24:59] the prediction is a number between zero [25:01] and one, right? It's because it's coming [25:03] out of the sigmoid. [25:04] It's a fraction. The actual output is a [25:07] zero or one, one of the two, right? It's [25:09] binary. [25:11] So, how would we compare the [25:12] discrepancy? How would we measure the [25:14] discrepancy between a fraction and the [25:16] numbers zero and one? Right? What is the [25:18] good loss function in this situation? [25:21] Right? Is the key question. So, let's [25:22] build some intuition around this. [25:26] And let's see if my little daisy chain [25:28] iPad thing works. [25:31] I'm doing it on the iPad so that people [25:32] on the live stream can see it, otherwise [25:34] the blackboard is a little tough for [25:35] them. [25:37] Okay. So, let's have a situation here. [25:41] Okay? So, let's say let's say that you [25:43] have a patient who comes in, and let's [25:45] say they have heart disease. Okay? So, [25:47] for that patient, Y equals one. [25:50] Right? The true value is one for that [25:51] patient. And now you have this model. [25:55] Okay? And this is the predicted [25:59] probability from this model. [26:04] Can people see my [26:05] handwriting okay? [26:07] Good. [26:08] I could never be a doctor, right? So. [26:11] So, zero, okay? One, it's going to be [26:13] between zero and one because it's [26:14] probability. [26:15] And then this is the loss we want to [26:17] sort of have, right? This is the loss. [26:19] So, for this this patient actually had [26:21] heart disease, Y equals one. So, let's [26:23] say that the predicted probability is [26:25] pretty close to one. [26:26] Okay? What do you think the loss should [26:28] be? [26:29] Small. [26:30] Close to zero. [26:32] Sorry? [26:34] Close to zero, exactly. So, here, if the [26:36] prediction comes here, you want the loss [26:38] to be you want the loss to be somewhere [26:40] here. [26:42] But if the predicted probability is [26:44] pretty close to zero, even though the [26:45] patient actually has heart disease, what [26:47] do you want the loss to be? [26:49] Really high. [26:50] Because it's screwing up badly, right? [26:52] So, you want the loss to be somewhere [26:53] here. [26:55] So, basically you want a function that's [26:57] kind of like that. [27:00] Right? You want the loss function shape [27:02] to be like that. [27:04] High values of probability should have [27:05] low losses, low values of probability [27:07] should have high losses. Yeah. [27:08] I understand like why it has to be [27:10] increasing or decreasing, but can you [27:12] explain why it has to be Yeah, yeah. So, [27:14] it can be linear, it can certainly be [27:16] linear, but basically what you want to [27:18] do is the more it makes a mistake, the [27:21] more harshly you want to penalize it. [27:23] Right? So, basically what you're what [27:25] what you really want is something where [27:27] if it basically says this person's [27:29] probability is say uh the probability [27:31] the predicted probability is say one [27:33] over a million, [27:34] basically close to zero, you want the [27:35] loss to be like super high. [27:37] So that the model is like it's like a [27:39] huge rap on the knuckles for the model. [27:41] Don't do that. [27:42] That's basically what we're doing, and [27:43] I'm sort of demonstrating that dynamic [27:45] by using a very curved and steep loss [27:47] function. [27:49] But you can absolutely use a linear [27:50] function, it's totally fine. It won't be [27:52] as effective for gradient descent later [27:54] on with a bunch of bunch of technical [27:56] details. [27:57] Are we good with this? [27:59] All right. So, now let's look at the [28:01] case where a patient does not have heart [28:03] disease. [28:05] Y equals zero. [28:06] Same setup, okay? [28:09] Predicted probability, [28:11] zero, one, loss. [28:15] So, for this patient, they don't have um [28:18] whatever uh they're not [28:20] uh they don't have heart disease. If the [28:22] probability is close to zero, what [28:24] should the loss be? [28:26] Close to zero. It should be somewhere [28:27] here, right? [28:28] And the more and more the probability [28:31] gets closer and closer to one, you want [28:32] to penalize it very heavily, which means [28:34] you want the loss to be somewhere here. [28:36] So, you basically want a loss ideally [28:37] that's kind of going up like that and [28:39] climbing higher and higher. [28:42] Are we good? [28:43] Okay, perfect. [28:44] Because we have a perfect loss function [28:46] for that. [28:48] So, just a recap. [28:51] Right? This is what we want. [28:53] People with for points with Y equals [28:54] one, lower prediction predictions should [28:56] have higher loss. You want something [28:58] like that. And then turns out [29:02] there's a very little simple loss [29:03] function [29:04] which just literally just uses the [29:05] logarithm, which will get the job done. [29:07] So, what you do is you literally do [29:09] minus log of the predicted probability. [29:13] That's it. And that thing it has exactly [29:15] that shape. [29:16] Okay? And in fact, you can see it [29:17] numerically. So, if the loss is one, [29:20] it's zero. If it's half, it's 1.0. And [29:22] if it's like one over 1,000, it's almost [29:24] 10. If it's one over 10,000, it's going [29:26] to be like [29:27] much higher, right? Very high losses. [29:30] Okay? So, minus log probability, boom, [29:32] done. [29:34] Similarly, this is what we want for [29:36] patients for whom Y equals zero. [29:38] And turns out if you do minus log one [29:42] minus predicted probability, it does the [29:44] same thing. [29:47] Okay? [29:50] Mathematicians once again saved with a [29:52] logarithm. [29:54] So, see in summary [29:56] this is what we have. [29:58] Right? For data points where y equals 1, [30:00] we have this. Data points where y equals [30:01] 0, we have this. But, it feels a little [30:03] inelegant [30:05] to say, "Well, if it's y equals 1, I [30:07] want to use this. If y equals 0, I want [30:08] to use that." [30:09] Right? There's There's like an if-then [30:11] thing going on here. And I don't know [30:12] about you folks, but if-then really irks [30:14] me [30:15] mathematically because you can't do [30:17] derivatives and so on very easily. [30:19] Okay? [30:20] But, no worries. This is MIT. We know we [30:22] have our bag of math tricks. [30:24] So, what we do is [30:26] we can actually combine them both into a [30:28] single expression. [30:30] Okay? Like this. [30:32] Okay? And here the yi again is the ith [30:35] data point. Remember, yi is either 1 or [30:37] 0 always. [30:38] And this model of xi is the predicted [30:40] probability. Okay? So, [30:43] and I've just taken the minus log the [30:45] minus and I've just moved it here. [30:48] Okay? And I've taken the the minus that [30:50] was here and just moved it here. Okay? [30:52] That's why you see it like this. [30:54] So, this one is basically [30:57] you can convince yourself what's [30:58] happens. This single expression will get [30:59] the job done. So, let's say there is a [31:01] patient for whom y equals 1. [31:04] What's going to happen is that when you [31:05] plug in y equals 1, this becomes 0. The [31:07] whole thing will collapse to 0. [31:10] While here, y equals 1 just means it [31:12] becomes minus log probability, which is [31:14] what we want. [31:17] Conversely, if y equals 0, this whole [31:20] thing is going to disappear. [31:22] And this thing becomes 1 minus 0, which [31:23] is just 1. And so, it becomes minus log [31:25] 1 minus probability, which is again what [31:27] we want. [31:29] Simple and neat, right? [31:32] So, in one expression, we have defined [31:34] the perfect loss. No if-thens, none of [31:36] that crap. [31:39] Good. So, now what we do is that was [31:42] true for every data point. [31:44] But, we obviously have lots of data [31:45] points. So, we just add them all up and [31:47] take the average. [31:50] That's it. We average across all the [31:51] data points we have. So, that we get an [31:53] average loss. [31:55] Okay? [31:57] We call this is the binary cross entropy [31:58] loss function. [32:06] Is there a way you can um edit the loss [32:08] function so that you penalize like false [32:11] negatives more strongly than false [32:13] >> you can do all of them. Great question. [32:15] Uh I'm just looking at the basic case [32:17] where we it's symmetric [32:19] loss. Um you can actually penalize [32:21] overestimates much more than [32:23] underestimates and things like that. [32:25] Um and if you're curious, you can just [32:26] Google something called the pinball [32:28] loss. [32:31] Okay? [32:32] Any other questions on this? [32:34] So, when you see this massive deep [32:36] neural network built by Google for doing [32:38] something or the other, if it's a binary [32:39] classification problem, chances are [32:41] they're using this thing. [32:44] Okay? [32:45] All right. [32:45] So, now let's figure out how to minimize [32:48] these loss functions because the name of [32:49] the game [32:50] is to find a way to minimize these loss [32:52] functions. So, now loss functions are [32:54] just a particular kind of function. So, [32:56] we'll first consider the general problem [32:59] of minimizing some arbitrary function. [33:02] Okay? [33:02] And once we develop a little bit of [33:03] intuition about that, we'll return to [33:05] the specific task of minimizing loss [33:07] functions. [33:12] How's everyone doing? [33:15] Yes, no, good, bad? [33:18] You have a bit of a [33:20] like a tough-to-interpret head shake. [33:23] It's more like um I kind of lost you [33:24] where you said that the loss function [33:26] and the predicted probability [33:28] uh how were they inversely because my [33:30] understanding was that the loss function [33:31] is supposed to be the sum of errors. [33:33] We're averaging the errors. And when you [33:35] said the heart patient [33:36] >> Sorry, sorry. Let me Let me just stop [33:37] there for a second. [33:38] For each point, you define the loss. [33:41] That's the whole point of the game. And [33:42] once you define it, you calculate for [33:44] every point and average it, right? So, [33:46] just focus on a single data point. [33:49] And so, now continue. [33:50] So, now when the heart patient has There [33:53] is more probability that they No. So, [33:56] when there is a person who has the heart [33:58] uh disease, you said that you want the [34:00] loss function to be high. [34:02] I think I'm going back to the graph. [34:03] >> You want the loss function to be high if [34:06] I'm predicting that they basically don't [34:08] have heart disease. [34:09] If the prediction is close to 0, [34:12] the predicted probability is close to 0, [34:13] then I'm badly wrong. [34:16] Because in reality, they do have heart [34:18] disease. [34:19] And that's why I want the loss to be [34:21] really high. Okay, so effectively, loss [34:23] is my way of finding out how good my [34:25] model is instead of saying, "Okay." Or [34:28] rather, how bad your model is. Yeah. [34:31] Right? How bad is it? That's really what [34:33] the loss function is. Got it. [34:34] >> And you want to minimize badness. [34:37] That's the whole point of optimization. [34:39] Okay. [34:41] Um I guess I don't have a fully like [34:43] similar to the point where I said but I [34:45] don't have a fully clear intuition of [34:46] why exactly a log function rather than [34:48] something that say [34:50] flatter for small and then really steep [34:53] later. Those are all fantastic things. [34:55] You can totally do it. Uh the reason we [34:57] picked the loss this function because A, [35:00] it's easy to work with. It has good [35:02] gradients. It's well-behaved [35:04] mathematically. But, there are many [35:06] alternatives to it. I don't want you to [35:07] think that this is like the only game in [35:09] town or it's the only choice for us. We [35:11] have many choices. This is really This [35:13] happens to be a very easy choice, which [35:15] also happens to be empirically very [35:17] effective. [35:18] And I'm happy to give you pointers to [35:20] other crazy loss functions, right? Which [35:22] can actually do all these things, too. [35:26] Okay? [35:30] All right. So, uh minimizing a single [35:32] variable function, we will warm up by [35:34] looking at this little function here. [35:36] Okay? Which is a [35:38] What do you call a fourth power? [35:41] What? Quartic, right? Yeah, thank you. [35:43] Quartic. So, yeah, it's a quartic [35:45] function. Um [35:47] right? And this is how it looks like. [35:50] But, you can see there is like a minimum [35:51] somewhere here, right? Between like one [35:53] minus one and minus two. Like maybe [35:54] minus 1.5. Okay? [35:56] So, we want to minimize this function. [35:58] It's obviously a toy function, little [36:00] function with one variable. [36:02] But, the intuition we use here is going [36:03] to be exactly what we use for GPT-4. [36:06] So, pay attention. [36:08] So, how can we go about minimizing this [36:09] function? [36:11] What will we do? [36:15] Yeah. [36:16] Take the derivative and set it equal to [36:18] zero. You take the derivative. Exactly. [36:20] So, you take the derivative, right? [36:22] Um so, when you So, let's look at what [36:23] the derivative does for us. [36:25] But, then [36:26] the second part of what said [36:30] Yeah. Second part of what said was set [36:31] it to zero. Setting it to zero becomes [36:33] problematic [36:35] when you have very complicated [36:37] functions. It's not clear at all what's [36:38] going to make them zero, right? [36:39] Unfortunately. But, the idea of taking [36:41] the derivative is in fact the right [36:42] idea. [36:43] So, we can go about this. We can [36:45] calculate the derivative. And that [36:46] actually happens with the derivative. [36:47] You can convince yourself. [36:49] And if you plot the derivative, it looks [36:50] like that. [36:53] And as you would hope, wherever the [36:55] minimum is, in fact, the derivative is [36:56] crossing [36:58] right? The derivative is zero here. It's [36:59] crossing the x-axis. [37:01] Right? In this case, you can actually do [37:02] that. [37:03] So, let's say you have the derivative. [37:04] How can you use it? [37:06] Like, what is the value of a derivative? [37:08] What does it tell you? [37:09] Yeah. [37:11] You use a gradient descent algorithm. [37:13] You are 10 steps ahead of me, my friend. [37:16] I just want the basic answer. [37:18] Like, what what what what good is a [37:19] derivative? What Like, what does it tell [37:21] you? When you calculate the derivative [37:22] of something at a particular point [37:23] >> you the rate of change of the function [37:25] at the place you are. Correct. Exactly [37:27] right. So, here, what the derivative [37:29] would tells us is that the slope tells [37:32] us the change in the function for a very [37:34] small increase in w, right? [37:36] And this is high school calculus. I'm [37:38] just doing a quick refresher. [37:41] So, what that means is that [37:45] if the derivative is positive, [37:47] what that means is that increasing w [37:49] slightly will increase the function. [37:52] So, if if you're here, [37:53] you calculate the derivative, the slope [37:55] is positive. It means that if you go [37:56] slightly in this direction, the function [37:57] is going to get higher. [37:58] Right? [38:00] Similarly, if it's negative, [38:02] let's say here, you calculate the [38:03] derivative, it's the the slope is like [38:05] this. It's negative, which means that if [38:06] you increase w, if you go in this [38:08] direction, it's going to decrease the [38:10] function. [38:12] Okay? [38:13] All right. [38:15] And if it's kind of close to zero, [38:17] it means that changing w slightly won't [38:19] change anything. [38:22] So, if you're here, changing it slightly [38:24] won't change anything. [38:25] All right? [38:26] That's it. [38:28] So, [38:29] So, what we do is this immediately [38:31] suggests an algorithm for minimizing gw, [38:35] which is let's start with some random [38:37] point w. [38:38] And then, [38:39] let's calculate the derivative at that [38:40] point. [38:41] And once we do that, [38:42] there are three possibilities. [38:45] It could be positive, negative, or kind [38:46] of close to zero. [38:48] And if it's positive, we know that [38:49] increasing w will increase the function. [38:52] But, we want to decrease the function. [38:53] We want to minimize it. [38:55] Which means that we should not be [38:56] increasing w. We should be doing what [38:58] here? [39:00] Decrease. [39:01] Yes. And similarly, if it's negative, [39:03] what should we do here? Increase. [39:07] Exactly. So, in the first case, you [39:09] reduce w slightly. In the second case, [39:11] you increase w slightly. And if the [39:13] thing is close to zero, you just stop [39:14] because there's nothing else you can do. [39:17] Okay? [39:21] This is the basic intuition behind how [39:23] GPT-4 was built. [39:26] Which is kind of shocking if you think [39:28] about it. [39:29] Right? Which means that all the the [39:31] heavy-duty optimization stuff that [39:32] people have figured out over the decades [39:35] is kind of not used. [39:37] Right? This algorithm is what's being [39:39] used with some, you know, flavors on top [39:41] of it. [39:42] So, yeah. So, back to this [39:44] uh and you you do that and then if [39:46] you've sort of run out of time or [39:48] compute [39:49] or right, if you run out of time and so [39:52] on, just stop. [39:54] Otherwise, just go back to step one and [39:55] try again. Of course, if it's close to [39:56] zero, you got to stop anyway. [40:00] Yeah. [40:02] Is there the um concern of a potentially [40:05] local minimum there? It's coming. [40:10] Okay? So, that's the function. It's [40:11] going to give find It's going to find [40:12] you some point where the derivative is [40:13] kind of close to zero. Okay? [40:16] So, [40:17] this is called gradient descent. Right? [40:19] This is gradient descent, this little [40:21] algorithm. [40:23] And this this [40:26] this very power pointy MBA table can be [40:29] collapsed into this little expression. [40:32] Basically says, [40:34] calculate the derivative, [40:35] multiplied by a small number which we'll [40:36] get to in a second, [40:38] and then change the old W to the new W [40:41] is the old W minus a little number times [40:44] gradient. [40:45] So, this little one-line formula is [40:47] basically gradient descent. [40:50] Okay? [40:51] And what you should do, just to build [40:54] your intuition, is to make sure that [40:56] these three possibilities here map [40:58] nicely to this. Like this thing will [41:00] actually capture these three [41:01] possibilities. [41:03] This is when gradient descent was [41:04] invented. [41:07] It has some historical fun, right? [41:13] The 19th century? [41:15] 19th century. Yeah, okay. Good. Very [41:17] good. Excellent guess. [41:20] 1847. [41:22] It was uh invented uh in 1847 by Cauchy, [41:25] the great mathematician. And in fact, if [41:27] you're curious, you can check out the [41:29] paper. [41:30] I have I gave you I give you the paper [41:32] here for handy reference. [41:36] So, 1847. [41:38] So, GPT-4 is built using an algorithm [41:40] invented in 1847. [41:44] Which I find like astonishing, frankly. [41:47] That this little thing is so capable. [41:51] Okay. [41:52] So, that's gradient descent. And this [41:54] little number alpha [41:56] is called the learning rate. And it's [41:58] our way of sort of essentially [41:59] quantifying the idea of let's not [42:02] increase or decrease W massively, let's [42:04] do it slightly. [42:06] Because the gradient is only valid for [42:08] small movements around your point. If [42:11] you take a big step, all bets are off. [42:14] So, this alpha tells you how how small a [42:17] step should you take. [42:20] Okay? [42:20] And in typically, it's set to very small [42:23] values like, you know, 0.1, 0.001, and [42:25] so on and so forth. And in fact, if you [42:27] read any deep learning academic papers [42:30] where they have trained like a big model [42:31] to do something, [42:32] right? More lot of researchers will very [42:34] quickly go to the appendix where they [42:36] have described exactly what learning [42:37] rates were used. [42:39] Because sort of the learning rate is [42:40] like part of the IP for how it's built. [42:44] A lot of trial and error that goes into [42:45] these learning rates. [42:47] Okay. So, that is gradient descent. [42:50] Um so, if we apply this algorithm to GW, [42:53] our original function, [42:55] right? We just keep on doing this thing [42:56] a few times. [42:58] Right? What you will find is that if [43:00] let's say we start with two point the [43:01] the [43:02] the point we randomly pick is a 2.5, we [43:05] set the alpha to one, we run this [43:07] algorithm, it starts here, then it goes [43:09] there, it goes there, bup bup bup bup [43:11] bup, and then finally ends up here. [43:12] In like four or five iterations, it [43:14] finds some minimum. [43:16] This is obviously a very simple, [43:17] well-behaved, nice little function, so [43:19] you can easily optimize it. [43:22] Okay? If you want, you can just go to [43:23] this thing. There's a nice animation of [43:25] this thing as well. [43:28] Okay. So, now [43:30] All right. Before we actually go to the [43:31] multi-variable function, I want to go to [43:33] the question that you posed about local [43:35] minima. [43:36] Um actually, you know what? I think I [43:37] may have some slides on it. So, sorry. [43:38] I'll come back to this. [43:40] So, let's actually see what you know, [43:41] what we looked at a toy example where [43:43] there was only one variable. What if you [43:45] have [43:46] uh what if it was GPT-3? GPT-3 has 175 [43:49] billion parameters. [43:51] 175 billion and GPT-4, they haven't [43:53] published it, so we don't know. It's [43:55] supposed to be eight times as much. [43:57] Okay? So, I mean, the number of [43:59] parameters is massive. So, basically, [44:02] our loss function has [44:04] billions of variables, billions of Ws [44:07] that we need to optimize over, minimize [44:10] over. So, we need to use this notion of [44:12] a partial derivative. So, let's take [44:14] baby steps and say, okay, what if you [44:16] have a two-variable function, right? [44:18] Something like this, very simple. So, [44:20] what we can do is we can calculate the [44:21] partial derivative of G with respect to [44:23] each of these Ws. [44:26] And the partial derivative, just to [44:27] quickly refresh your memories, [44:29] is you take a function, you pretend that [44:32] everything other than W is a constant. [44:36] Then the function becomes a [44:38] a function of just one variable W, W1. [44:40] And then you just differentiate it like [44:41] you do everything else. And you you get [44:43] you get something, and that is [44:46] this thing here. [44:48] And then you do the same thing for W2, [44:50] you get this thing here, and then you [44:51] just stack them up in a nice list. [44:54] Okay? [44:55] This is the vector of partial [44:56] derivatives. [44:58] So, how should we interpret this? The [44:59] same way as before. Basically, for a [45:01] small change in W1, keeping W2 and [45:04] everything else fixed, how does the [45:06] function change if you change just W1 [45:08] slightly? And similarly for W2 and all [45:11] the way to W175 billion. [45:14] Same thing. Okay? [45:15] So, um [45:17] now, when you have these functions with [45:19] many variables, many Ws, [45:22] uh since we have a gradient for each one [45:24] of those Ws, we stack them up into a [45:26] nice vector [45:28] of derivatives, and this vector is [45:30] called the gradient. [45:32] And it's denoted [45:33] using [45:35] this uh Anyone know what the symbol is [45:37] called? [45:38] nabla [45:40] Yeah? [45:41] Laplacian [45:43] Maybe. Maybe that's a synonym. But the [45:45] one I'm familiar with is nabla. [45:48] Delta is the one that's upside down [45:50] triangle, but I think the upside down [45:52] triangle is called nabla if I if I [45:53] recall. Am I right? [45:55] Thank you. [45:58] He's my go-to. [46:02] So, yeah. So, the gradient, um we just [46:04] call it the gradient, and it's written [46:06] as this. [46:08] All right. So, what we do is we simply [46:10] do gradient descent on every one of the [46:12] Ws [46:13] using its partial derivative. [46:16] Okay? So, in a in a gradient step, we [46:19] update W1 using this formula, W2 using [46:21] this formula. [46:23] Finished. [46:25] We've just generalized gradient descent [46:27] to an arbitrary number of variables. [46:30] So, and of course, as before, this can [46:32] be summarized compactly as this vector [46:35] formula. [46:36] Let me just do this. [46:43] So, what's going on here is that [46:46] I have [46:47] W1 [46:50] old W1 minus alpha [46:52] times [46:53] the function G [46:55] of W1, then W2 [46:59] W2 minus alpha [47:02] G by W2. And then all we're doing is [47:04] we're just stacking them up into a [47:06] vector [47:08] like that. [47:15] minus alpha, and this vector [47:21] like that. [47:27] So, this can be written as just this [47:28] vector W, the new vector [47:31] old vector minus alpha [47:34] and the gradient. Finished. [47:37] And you can see if it is, you know, [47:39] GPT-3, [47:40] this vector is going to be 175 billion [47:42] long. [47:44] Okay? But whether it's two or 175 [47:46] billion, who cares? It's the same thing, [47:47] right? [47:50] Okay. [47:52] So, yeah. So, that's what we have here. [47:54] I'm really thrilled by the way this [47:55] whole iPad business is working out. [47:58] I was a little worried about it. Okay. [48:00] Um so, if you look at two dimensions, [48:02] this function, and if you actually look [48:04] at if you plot the function, this is W [48:06] the first W, the second W, and then you [48:08] actually This is actually the loss [48:09] function. That's the function GW. And [48:11] so, you're trying to find the minimum [48:13] here, and so this is how the gradient [48:14] descent will do do do do do. It will [48:16] progress if you're starting from this [48:17] point. [48:18] Or you can also sort of look at it from [48:20] up top down into the function, and [48:22] that's what this picture is, and it [48:23] shows gradient descent starting from [48:24] there and working its way down [48:27] um from here all the way to the center. [48:30] Okay. So, [48:32] All right. Local minima. So, now [48:35] gradient descent will just stop [48:38] near uh hopefully a minimum, [48:41] right? But the problem is it may not be [48:43] a global minimum. It may It may not even [48:45] be a minimum. [48:47] So, um [48:48] so, let's see what what I'm talking [48:49] about here. [48:51] Here are some possibilities. [48:53] So, let's take a simple function. [48:57] Okay? Let's take This is GW. [48:59] This is W. And turns out this function [49:02] is actually looks like this. [49:12] Okay? [49:13] So, you can see here [49:17] Well, [49:19] um this point [49:23] this point here [49:24] is a local minimum. [49:27] This is a local minimum. [49:29] It's a local minimum. [49:30] These are all [49:32] lots of local minima here. [49:34] Okay? And yeah, there's a lot of local [49:37] minima here, too. [49:39] So, these are all places in which the [49:41] derivative is going to be zero. [49:43] So, if you run gradient descent and it [49:46] stops because the gradient is reached [49:48] zero, [49:49] you could be in any of these places. [49:52] Right? So, there's no guarantee. So, [49:54] this in this picture happens to be [49:57] maybe the global minimum because it's [49:59] the lowest of the lot. [50:01] Right? [50:02] But, there's no guarantee you're [50:02] actually going to get there. [50:04] Okay, there's not even a guarantee [50:06] you're going to be in any of these [50:07] places because you could literally be in [50:09] this thing here [50:10] where it's sort of taking a break and [50:12] then continuing on down. [50:14] That, by the way, is called a you know, [50:15] a saddle point. I drew it badly, but [50:17] this sort of coming in sort of taking a [50:19] break and going down again is called a [50:21] saddle point. So, gradient descent can [50:23] stop at a saddle point. It can stop at [50:25] some minima. There's no guarantee it's [50:27] going to be global. [50:28] Okay? [50:33] But, it turns out it has not mattered. [50:37] So, it has not mattered. And there are a [50:39] whole bunch of reasons why it has not [50:41] mattered because when you have these [50:42] very complicated neural networks, [50:44] they're very complex functions. Even [50:46] finding a decent solution, right, to [50:49] these complicated networks is actually [50:50] really good for solving the problem. [50:52] You don't have to go to the best best [50:54] possible solution. And in fact, if you [50:57] go to the best possible solution, you [50:58] actually run the risk of overfitting. [51:02] So, that's one reason. The other [51:03] interesting reason and by the way, this [51:05] is a very hot area of research to figure [51:08] out exactly [51:09] So, it's sort of like this. Empirically, [51:11] what we have seen is that not worrying [51:12] about local minima, global minima, all [51:13] that stuff has not hurt us because these [51:16] is things are amazing. [51:18] GPT GPT-4, probably they just stopped [51:20] somewhere. They probably it wasn't even [51:21] a local minima. They're like, "All [51:22] right, we've It's been running for 6 [51:24] days. We've spent 2 million dollars. [51:25] Let's stop." [51:27] Right? Because these are very expensive. [51:29] So, but that's still so magical. [51:31] You don't need to get anywhere close to [51:33] local minimum. But, there's another [51:34] interesting point which I've which which [51:36] I read about. [51:37] People basically hypothesize that [51:40] for you to be at a local minimum, just [51:43] think about what it means. It means that [51:45] you're standing at a particular point, [51:47] in every direction that you look, [51:49] things are just sloping upward. [51:51] Right? [51:52] Everything is sloping upward. Only if [51:54] everything is sloping upward all around [51:56] you, could you be at a local minimum [51:58] by definition. But, if you have a [52:00] billion dimensions, [52:02] what are the odds that you're going to [52:04] be standing at a point where every one [52:06] of those billion dimensions is going [52:07] upward? [52:08] The odds are really low. [52:10] Chances are some of them are going to go [52:11] going up, some of them are going down, [52:13] others are sort of coming down and going [52:14] another way. It's going to be crazy. [52:16] So, in some sense, the best you can hope [52:18] for in these very high-dimensional [52:20] situations is probably a saddle point. [52:23] And it turns out it's good enough. [52:25] So, for those reasons, we are content [52:29] with just running gradient descent with [52:30] some tweaks which I'll get to in a [52:31] second. Um and it just performs really [52:34] admirably. [52:36] Um how does alpha depend on like how [52:39] much compute you have? Like, would you [52:41] set the learning rate based on that or [52:44] not really? [52:45] >> No, the the learning rate is really [52:47] is a measure of It's sort of like this. [52:50] When you're at a point where you think [52:52] that the gradient is looking nice and [52:54] right, if you take a step in the [52:55] direction it's going to go down. And if [52:57] you further believe that it's going to [53:00] keep going down in the direction for a [53:01] while, [53:02] then you're very confident about taking [53:04] a big step. [53:06] But, if you're like, "I I don't know [53:07] because the maybe I take a little step, [53:09] maybe I have to go this way. I can't go [53:10] straight anymore." Then you don't want [53:12] to take a big step because then you have [53:13] to backtrack. [53:14] So, those kinds of considerations go [53:16] into the learning rate. Um and so, [53:19] that's sort of the rough answer to your [53:20] question. It's not so much determined by [53:23] compute and bandwidth and things like [53:24] that. [53:25] But, again, it's very it's a sort of a [53:27] complicated thing because sometimes with [53:29] a given amount of compute compute, if [53:31] you have a particular kind of data, you [53:33] can have very aggressive learning rates. [53:35] So, it tends to be a bit sort of, you [53:37] know, jumbled up complicated. So, but [53:39] that's sort of the the quick surface [53:40] level idea of what's going on. [53:43] Um okay. [53:47] 9:31. [53:50] Anyway, folks, this lecture is like [53:52] probably one of the driest in the like [53:54] semester because of like I have to go [53:55] through all the concepts. Um once we [53:57] start doing collabs, you know, things [53:59] get a lot more lively. [54:00] Okay. [54:01] Um all right. So, now let's talk about [54:04] minimizing a loss function gradient [54:05] descent. So, here is our little binary [54:08] cross entropy loss function that we saw [54:09] from before. Right? This is what we want [54:11] to minimize. So, if you look at this [54:13] thing, [54:14] where are the variables we need to [54:16] change to minimize this function? [54:19] Folks, don't look at your phones. [54:21] Okay, with laptop and iPad use, don't [54:23] look at your phones. [54:27] Sorry, we've kind of abstracted um the [54:30] variables W, but just to bring it back, [54:33] those are actually the weights in the [54:35] neural networks, right? Yeah, the [54:36] weights and the biases. I'm just calling [54:38] them as weights. So, the output of these [54:42] uh minimization functions are going to [54:45] be the actual weights in your model, [54:47] right? [54:47] >> Exactly. Exactly right. [54:49] The whole name of the game is to find [54:51] the weights. [54:52] And so, for example, when you see in the [54:53] press that uh Meta has essentially um [54:57] made the weights of Llama 2 or something [55:00] available, that's basically what they've [55:01] done. [55:02] They basically published the weights. [55:04] Reason that's so valuable is [55:06] >> Microphone, please. Go. [55:07] Cuz if you have a billion parameters, [55:09] the compute time on that is horrendous [55:11] and expensive. That's why it does [55:13] weights are so valuable. [55:14] >> Correct. The weights are the crown jewel [55:16] because they are the result of a lot of [55:18] money and time and smartness being [55:19] spent. [55:21] There is a separate question of why are [55:23] they making it open source, [55:25] which [55:26] I'm happy to chat about offline. [55:28] All right, cool. So, what are the [55:29] variables we need to change change to [55:30] minimize? It's basically the parameters [55:32] and they're hiding inside the model [55:34] term. [55:36] Right? Because what is the model? The [55:38] model is some function like that, right? [55:41] If you look at the simple GPA and [55:42] experience thing we looked at in the on [55:44] Monday, we finally figured out that the [55:46] actual thing that comes out here is [55:48] going to be this complicated function of [55:50] all the X's and the W's and so on and so [55:52] forth, right? And that complicated thing [55:54] is showing up inside this thing. [55:57] So, [55:58] you know, and the W's here are the [56:00] variables we can we need to change to [56:02] minimize the loss function. And it It's [56:05] important for you to to note and [56:06] understand that the values of X and Y [56:10] and so on are just data. [56:13] You're not optimizing anything there. [56:14] You're just data. [56:15] What you're optimizing is the W's. [56:17] The weights. [56:22] Okay. So, so imagine replacing the model [56:26] here with the mathematical expression [56:27] above whenever this appears the loss [56:29] function. And once you do that, your [56:31] loss function is just a good old [56:33] function of the W's. [56:35] The fact that it's a loss function is [56:37] kind of irrelevant. [56:39] It's just a function. [56:41] And since it's just a good old function [56:42] of the W's, you can apply gradient [56:43] descent to it as we normally would. [56:45] It's no big deal. [56:49] Which brings us to something called [56:50] backpropagation. [56:52] Um [56:56] Um if you remember nothing else about [56:57] backpropagation, just remember this. [56:59] Never use the word backpropagation [57:01] again. Only use the word backprop. [57:04] You're [57:05] hip and cool to the deep learning [57:06] community. [57:07] Backprop. [57:09] Okay. All right. So, what is backprop? [57:12] Backprop is a very efficient way to [57:14] compute the gradient of the loss [57:16] function. [57:17] So, when you have this loss function, [57:19] and let's say you have a billion W's [57:21] and you have 10 million data points. So, [57:24] the little n we saw was 10 million. [57:27] That is a lot of computation. [57:30] And that is just for one step of [57:32] gradient descent. [57:34] Right? So, backprop is a way is a very [57:37] efficient and clever way to compute the [57:39] gradient of the loss function, which [57:41] takes advantage of the fact that what we [57:44] have here is not some arbitrary model. [57:47] It's a model that came from a particular [57:49] kind of neural network, which has layers [57:51] one after the other, and then there was [57:53] an output at the very end. [57:55] So, what backprop does is [57:57] it organizes the computation in the form [57:59] of something called a computational [58:00] graph, and the book has a good [58:01] discussion about it. And so, what we do [58:03] is we start at the very end. [58:05] We calculate the gradient of the loss [58:08] with respect to the output. [58:10] Then we move left. We calculate the [58:12] gradient of that output with respect to [58:13] the output of the just the prior hidden [58:15] layer. [58:17] Step to the left. Calculate the gradient [58:19] of the current thing with respect to the [58:20] previous layer. You get the idea, right? [58:22] It's iterative and it moves backwards, [58:25] and by doing so, you never repeat the [58:27] same computation twice wastefully. [58:30] That's the big advantage. You calculate [58:32] once and reuse it many many many many [58:34] times. [58:35] The second advantage is that if you [58:37] organize it this way, it just becomes a [58:39] sequence of matrix multiplications. [58:42] Okay. [58:42] And [58:45] because it's a sequence of matrix [58:46] multiplications and eliminates redundant [58:48] calculations, and best of all, [58:51] there are these things called GPUs, [58:53] graphics processing units, originally [58:54] invented to accelerate video game [58:56] rendering. [58:57] Uh and as it turns out, to accelerate [58:58] video game rendering, the core math [59:00] operation you do is basically a matrix [59:02] multiplication. Right? Some linear [59:03] algebra uh [59:05] sort of operations. And so, someone [59:07] really at some point had the bright idea [59:09] for deep learning, calculating gradients [59:11] and so on, we need to do matrix [59:13] multiplications, and here is some [59:14] specialized hardware that does really [59:17] that does a fast job of matrix [59:19] multiplications. Can't we Can we use [59:20] this for that? [59:22] And they did it. And all hell broke [59:24] loose. [59:26] That's literally what happened. [59:28] And that's why Nvidia is valued at what, [59:30] 1.5 trillion or something. [59:32] So, yeah. So, they are really good. And [59:35] so, backprop [59:37] the way you do backprop plus using it on [59:40] GPUs leads to fast calculation of loss [59:42] function gradients. [59:44] If this thing were not true, this class [59:47] would not exist. [59:49] Because there won't be any deep learning [59:50] revolution. [59:52] This is a fundamental seminal reason. [59:57] All right. So, the book has a bunch of [59:59] detail [01:00:00] um [01:00:01] and I actually did like a I work I hand [01:00:05] worked out an example [01:00:07] of calculating a gradient like the [01:00:09] old-fashioned way and calculating it [01:00:11] using backprop. [01:00:13] So, take a look at it. I'll post it on [01:00:14] Canvas and you will understand exactly [01:00:17] where the savings come from, where the [01:00:18] efficiency gains come from. Okay? [01:00:21] Because of time, I'm not going to get [01:00:22] into it now. [01:00:26] All right. Any questions so far? [01:00:28] Yep. [01:00:30] Sorry, I followed up to and so, we've [01:00:32] done gradient descent, which is [01:00:34] different than calculation of the [01:00:36] gradient of the loss function. What What [01:00:37] is the purpose of the calculation of the [01:00:39] gradient of the loss function? You [01:00:41] calculate the gradient because the [01:00:42] fundamental operation of gradient [01:00:44] descent is to take your current value of [01:00:47] W [01:00:48] and modify it slightly and the [01:00:50] modification is old value minus learning [01:00:52] rate times gradient. [01:01:03] It'd be cool, right, if I say, "Go mo- [01:01:04] go back five slides to this thing." and [01:01:06] it just goes back. Product idea. Anyone [01:01:08] startups? [01:01:09] So. [01:01:11] So, this one. [01:01:14] So, this is the fundamental step of [01:01:15] gradient descent. [01:01:16] So, this is the current value of W. [01:01:19] You calculate the gradient at that [01:01:20] current value [01:01:22] multiplied by alpha do this thing and [01:01:24] you get the new value. [01:01:26] And you keep repeating. [01:01:27] Right, but GW [01:01:29] that's not that's not the loss function. [01:01:32] >> It is the loss function. That is the [01:01:33] loss function. [01:01:34] >> Yeah, right. Here, I'm just using G as [01:01:35] an arbitrary function [01:01:37] to just to demonstrate the point. But [01:01:39] when you're optimizing, when you're [01:01:41] training a neural network, what you're [01:01:42] actually doing is minimizing a loss [01:01:45] function. Right. [01:01:46] >> Loss of W. Sorry, I got things mixed up. [01:01:49] Thank you. [01:01:51] >> Yeah. [01:01:53] Uh how do we define the initial weights [01:01:54] for the neural network? [01:01:55] >> Ah. [01:01:57] So, yeah, the initial weights um [01:02:02] So, there's a there are many ways to So, [01:02:04] first of all, they are initialized [01:02:04] randomly. [01:02:06] Uh but randomly doesn't mean you can [01:02:08] just pick any random weight. There are [01:02:09] actually some good ways to randomly pick [01:02:11] the weights. Uh those are called [01:02:13] initialization schemes. Um and there are [01:02:16] a bunch of very effective initialization [01:02:18] schemes people have figured out over the [01:02:19] years and those things are baked into [01:02:21] Keras as the default. [01:02:22] So, the Keras, I believe, uses something [01:02:24] called the [01:02:26] uh He initialization, H E [01:02:27] initialization, or the Xavier Glorot [01:02:31] initialization. I wouldn't worry about [01:02:33] it. Just go with the default [01:02:33] initialization. [01:02:36] The reason why they have to be very [01:02:37] careful about how these weights are [01:02:38] initialized is because if you have a [01:02:40] very big network and if you initialize [01:02:43] badly then [01:02:45] the gradient will just explode as you [01:02:47] calculate it. [01:02:48] The earlier layers, the weights will [01:02:50] have massive gradients or the gradients [01:02:52] will vanish. [01:02:53] So, they're called the exploding [01:02:55] gradient problem or the vanishing [01:02:56] gradient problem. To avoid all those [01:02:58] things, researchers have figured out [01:02:59] some clever way to initialize so that [01:03:00] it's well-behaved throughout. [01:03:03] Yep. [01:03:05] If using um backprops and GPUs was so [01:03:08] critical, I'm just curious like who [01:03:10] first did it and when? Was this like a [01:03:12] couple years ago? Was it a company? Was [01:03:14] it a Yeah. [01:03:15] >> Yeah. Well, GPUs have been used for deep [01:03:17] learning, I want to say um [01:03:20] I think the first uh case may have been [01:03:22] in the mid 2005, 2006 sort of thing. [01:03:26] But I would say that it sort of burst [01:03:27] out onto the world stage and made [01:03:30] everyone take notice when uh a deep [01:03:32] learning model called AlexNet [01:03:35] in 2012 won a very famous [01:03:38] computer vision competition. [01:03:40] Uh and it beat the and it set a world [01:03:43] record for how good it was. [01:03:45] Uh and that's when everyone was like, [01:03:46] "Hey, what is this thing?" And that's [01:03:48] really when it burst onto the world [01:03:49] stage. I'll talk a bit more about it [01:03:50] when I get into the computer vision [01:03:51] segment of the class. [01:03:54] But you can Google AlexNet and you'll [01:03:55] find a whole bunch of history around it. [01:03:59] I believe that if you do this, is it [01:04:00] true that could get to a global minima [01:04:04] that would mean there would be no [01:04:06] hallucinations? [01:04:07] Aha, good question. [01:04:09] So, if it is perfect [01:04:11] if you get to a global minimum. First of [01:04:13] all, global minima doesn't mean the [01:04:14] model is perfect, right? It may still [01:04:15] have some loss. [01:04:17] Um [01:04:18] but global minima is going to be on the [01:04:21] training data. [01:04:24] You can imagine that the test data, [01:04:26] future data has its own loss function, [01:04:28] right? [01:04:29] So, what is minimum here may not be [01:04:31] minimum there. That's the problem. [01:04:36] Is that a comment? No, okay. [01:04:38] Just saying that [01:04:40] uh that would mean that also you can be [01:04:42] over-fitting for [01:04:43] >> Correct. Exactly. Exactly. So, if you [01:04:45] overdo, if you find the best thing in [01:04:47] the training function, chances are it [01:04:48] doesn't match the best thing of the test [01:04:50] data. [01:04:52] So, on the test data, you're actually [01:04:53] doing badly. [01:04:56] Okay. So, [01:04:57] uh come back to this. [01:05:03] Okay. Now, uh the final uh twist to the [01:05:06] tail here uh we're going to go from [01:05:08] something gradient descent to something [01:05:10] called stochastic gradient descent. And [01:05:11] stochastic gradient descent or SGD is [01:05:14] the workhorse for all deep learning. [01:05:16] Okay? [01:05:17] And funnily enough, SGD is simpler than [01:05:19] GD. [01:05:20] Okay? Just when you thought it couldn't [01:05:21] get simpler, right? [01:05:23] Okay. So, [01:05:25] So, for large data sets, computing the [01:05:27] gradient of the loss function can be [01:05:28] very expensive. Right? Needless to say. [01:05:31] Because it has to be done at every step [01:05:32] and the cardinality of the data set is [01:05:34] really big. Right? And you may have, I [01:05:36] don't know, billions of parameters. It's [01:05:38] just very, very [01:05:39] tough to compute it even with backprop. [01:05:43] So, the solution is at each iteration, [01:05:45] when I say iteration, I'm talking about [01:05:47] this step of gradient descent. [01:05:50] Instead of using all the data [01:05:52] instead of calculating the loss function [01:05:54] by averaging the loss across all N data [01:05:57] points and then calculating the gradient [01:05:59] of that thing, what you do is you just [01:06:01] choose a small sample randomly. You [01:06:04] choose just a few of the N observations [01:06:06] and we call it a mini batch. [01:06:08] So, for example, the number of data [01:06:10] points you may you may have 10 billion [01:06:11] data points [01:06:12] but in every iteration, you may [01:06:14] literally grab just like 32 or 64, [01:06:16] something really small. [01:06:18] Like absurdly small. [01:06:20] Okay? [01:06:21] And then you pretend that okay, that's [01:06:23] all the data I have. You calculate the [01:06:24] loss, find the gradient and just use [01:06:27] that here instead. [01:06:30] Okay? So, this is called stochastic [01:06:33] gradient descent. So, strictly speaking [01:06:36] theoretically, SGD uses just one data [01:06:39] point. [01:06:40] But in practice, we use what's called a [01:06:42] mini batch, 32, 64, whatever. [01:06:44] Uh and so, mini batch gradient descent [01:06:47] is just loosely called stochastic [01:06:48] gradient descent, SGD. [01:06:52] So, and SGD, as it turns out [01:06:55] you can see it's clearly very efficient, [01:06:57] right? Because [01:06:58] it's just processing a few at a time. [01:07:00] Uh and in fact, if you have a lot of [01:07:02] data [01:07:03] and you calculate the full gradient of [01:07:05] the loss function, it may not even fit [01:07:07] into memory. [01:07:09] Right? It's really problematic. But with [01:07:11] SGD, it says, "I don't care whether you [01:07:12] have a billion data points or a trillion [01:07:14] data points. Just give me 32 at a time." [01:07:17] Okay? And you just keep on doing it. [01:07:19] And [01:07:20] turns out, because not all the points [01:07:22] are used in the calculation this only [01:07:24] approximates the true gradient. Right? [01:07:26] It's only an approximation. It's not the [01:07:27] real thing. It's only an approximation. [01:07:29] But it works extremely well in practice. [01:07:32] Extremely well in practice. [01:07:33] And there's a whole bunch of research [01:07:34] that goes into why is it so effective? [01:07:37] And you know, people are discovering [01:07:39] interesting things about SGD, but we [01:07:40] don't have like a definitive theory as [01:07:42] to why it's so good yet. We have some [01:07:44] interesting, you know, uh research [01:07:46] threads that have happened. [01:07:47] And very tantalizingly, very [01:07:50] tantalizingly [01:07:51] because it's only an approximation of [01:07:53] the true gradient [01:07:55] SGD can actually escape local minima. [01:07:59] So, [01:08:00] in the in the true loss function, you're [01:08:02] at a local minimum [01:08:04] but in SGD's loss function, when you're [01:08:06] doing SGD, you're reaching the the [01:08:08] minimum of the SGD loss function [01:08:11] which actually may not be the actual [01:08:13] loss function. So, as you're moving [01:08:14] around, you're actually jumping from [01:08:16] local minima to local minima of the [01:08:18] actual loss function. [01:08:20] I know that's a mouthful. I'm happy to [01:08:22] tell you more. It's just a side thing [01:08:24] that I just wanted you to be aware of. [01:08:25] Okay? [01:08:26] One of the reasons why SGD is actually [01:08:27] effective. It's almost like you work [01:08:30] less and you do better. [01:08:34] How many times does it happen in life? [01:08:35] This is one of them. [01:08:39] Okay? Now, SGD comes in many flavors. [01:08:42] Uh many siblings. It's got a lot of [01:08:44] siblings and variations. It's a big [01:08:45] family. Uh and we're going to use a [01:08:47] particular flavor called Adam [01:08:49] as our default in this course and I'll [01:08:52] get back to it when we get into the [01:08:53] co-labs and things like that. [01:08:56] All right. [01:08:57] Um [01:08:58] By the way [01:09:00] you know how you know all these pictures [01:09:01] I've been showing you a nice little [01:09:02] function like that, a little bowl and so [01:09:04] on. [01:09:05] This is a visualization [01:09:07] of an actual neural network loss [01:09:08] function. [01:09:11] You can see like the hills and valleys [01:09:12] and the cracks and so on and so forth. [01:09:14] Okay? And you can check out the paper to [01:09:16] get more insight into how they actually, [01:09:18] you know, came up with this [01:09:19] visualization. It's crazy. [01:09:21] It's complicated. [01:09:24] Yep. [01:09:25] So, for for SGD, do you perform the [01:09:28] iterations until you minimize the loss [01:09:30] function for each mini batch and then [01:09:32] move to another mini batch? Yeah, so [01:09:34] what you do is you take each mini batch [01:09:36] and then [01:09:37] you calculate the loss for the mini [01:09:39] batch, you find the gradient. [01:09:41] And use the gradient and update the W. [01:09:43] Then you pick up the next mini batch. So [01:09:45] you don't you don't pick a mini batch [01:09:47] and try to perform the iterations on [01:09:48] that mini batch until you reach the [01:09:50] You Each mini batch, one iteration. Each [01:09:52] mini batch, one iteration. Because if [01:09:54] you do a lot of iterations on one mini [01:09:56] batch, [01:09:57] first of all, you'll never be sure that [01:09:58] you're going to find any optimal [01:09:59] solution because you're not guaranteed [01:10:00] of any global minima. And secondly, it's [01:10:03] much better for you to get new [01:10:04] information constantly because what you [01:10:05] can do is you can revisit that mini [01:10:07] batch later on. [01:10:09] Right? And that gets into these things [01:10:10] called epochs and batch size and so on, [01:10:13] which we'll get into a lot of gory [01:10:14] detail when we do the collab. [01:10:16] So let's revisit that question. It's a [01:10:17] good question. [01:10:20] Yeah. [01:10:22] When you do the backprop process, Very [01:10:25] good. Backprop. Not backpropagation. [01:10:26] Nice. I made sure. [01:10:27] >> Yes. [01:10:29] Well, it's it sounded like you started [01:10:30] from the layers that were closest to the [01:10:32] output and you went backward. Okay. And [01:10:35] um my question is are you doing that [01:10:36] once or is it looping multiple times and [01:10:39] then [01:10:39] >> do it once. Just once. Yeah. So for each [01:10:42] gradient calculation, you do it once. [01:10:44] Why does it Why does it want to start [01:10:45] from the layer that's closest or why do [01:10:47] you want to start it from the layer [01:10:48] that's closest to the output? [01:10:49] >> Yeah. So basically what happens is let's [01:10:51] say that just for argument that you go [01:10:53] go in the reverse direction. [01:10:54] You will discover that a lot of paths to [01:10:56] go from the left to the right will end [01:10:58] up calculating certain intermediate [01:10:59] quantities including the very final [01:11:02] gradient sort of item [01:11:04] again and again and again. [01:11:06] Same thing is going to get calculated [01:11:07] again and again and again. So by [01:11:09] starting from the end and working [01:11:10] backwards, you just reuse stuff you've [01:11:12] already calculated. [01:11:14] So that is sort of the rough idea. But [01:11:15] if you see my PDF, I've actually worked [01:11:17] out the example and you and that will [01:11:19] demonstrate what I'm talking about. [01:11:23] By the way, this gradient the backprop [01:11:25] is just a sort of a [01:11:28] Like in calculus, we have something [01:11:29] called the chain rule. [01:11:31] To calculate the derivative of a [01:11:32] complicated function, you calculate the [01:11:34] calculate derivative of like the outer [01:11:35] function then the inner function and so [01:11:37] on and so forth. The backprop is [01:11:39] essentially a way to organize the chain [01:11:40] rule to work with the neural network [01:11:42] layer-by-layer architecture. That's all. [01:11:49] So is it Is it fair to say that once we [01:11:51] are finding like the local minimum, we [01:11:54] are not optimizing to all the GWs [01:11:56] because like this local minimum is [01:11:58] coming like from different curves, from [01:11:59] different lines. So [01:12:01] Is that fair to say? When we are using [01:12:02] stochastic gradient descent, yes. So for [01:12:04] in stochastic gradient descent, when you [01:12:06] take say 32 data points from a million [01:12:09] and you're calculating the loss for that [01:12:10] 32 data points, you're basically trying [01:12:12] to do a gradient step. [01:12:14] Right? The W equals W minus alpha [01:12:17] gradient thing. You're doing it for that [01:12:20] that 32 points loss function. [01:12:22] Right? Which is not the 1 million points [01:12:24] loss function. [01:12:25] That's why it's approximate. [01:12:27] But the approximation, instead of [01:12:29] hurting you, actually helps you because [01:12:31] it helps you escape the local minima of [01:12:33] the global loss function. [01:12:35] So it's it's sort of an interesting and [01:12:37] somewhat technically subtle point, which [01:12:38] is why I'm not getting into it too much, [01:12:40] but I'm happy to give pointers if people [01:12:41] are interested. Yeah? [01:12:44] Uh when you say you initialize the [01:12:45] weights, you initialize for the whole [01:12:47] network or just the end layer and then [01:12:50] go backwards like you [01:12:51] >> No, you initialize everything in one [01:12:52] shot. [01:12:53] Because if you don't initialize [01:12:54] everything in one shot, what's going to [01:12:55] happen is that you can't do like the [01:12:57] forward computation to find the [01:12:58] prediction. [01:13:00] Uh and so they are done independently [01:13:02] and the initialization schemes will take [01:13:05] into account, okay, I'm initializing the [01:13:07] weights between a layer which has 10 [01:13:08] nodes and on one side and 32 on the [01:13:10] other side and the 10 and the 32 [01:13:12] actually play a role in how you [01:13:13] initialize. [01:13:15] Okay. So um so the summary of the [01:13:18] overall training flow [01:13:19] is that, you know, you have an input. [01:13:22] It goes through a bunch of layers. You [01:13:24] come up with a prediction. You compare [01:13:26] it to the true values and these two [01:13:28] things go into the loss function [01:13:29] calculation. You get a loss number. [01:13:31] Right? And you do it for say 10 points [01:13:33] or 32 points or a million points. And [01:13:35] this loss thing goes into the optimizer, [01:13:38] which calculates the gradient. And once [01:13:39] it calculates the gradient, it updates [01:13:41] the weights of every layer using the W [01:13:44] equals W minus alpha times gradient [01:13:45] formula, gradient descent formula. And [01:13:47] then you keep it doing this again and [01:13:48] again and again. [01:13:50] This is the overall flow. [01:13:53] This is how our little network is going [01:13:54] to get built for heart disease [01:13:56] prediction. This is how GPT-4 was built. [01:14:00] And this is how AlphaFold was built. [01:14:02] And AlphaGo was built. [01:14:04] You get the idea. [01:14:07] I mean, it's astonishing, frankly. [01:14:09] If you're not getting goosebumps at the [01:14:10] thought that this simple thing can do [01:14:12] all these complicated things, we really [01:14:14] need to talk offline. [01:14:17] Uh there was a hand raised here. Yeah. [01:14:20] Sorry. Just quickly, this is for each [01:14:23] mini batch, right? So [01:14:25] my question is if you came with [01:14:27] different weight for each mini batch, [01:14:28] how do you [01:14:30] add it up? [01:14:31] The like, okay, this weight has is the [01:14:33] perfect combination for this mini batch, [01:14:35] but you have weight different [01:14:37] weight for another mini batch. How do [01:14:39] you combine those two? No. [01:14:41] On each point, what you do is you you [01:14:43] find the you find you you you start with [01:14:45] a weight. [01:14:46] You run it through for a mini batch. You [01:14:48] come up with the loss function. You [01:14:49] calculate the gradient. [01:14:50] And now using the gradient, you've [01:14:51] updated the weight. Now you have a new [01:14:53] set of weights, right? Which is the [01:14:54] updated weights. Call it [01:14:55] W2 instead of W1. [01:14:57] Now W2 is is your network and when you [01:14:59] take the next mini batch, it's going to [01:15:00] use W2 to calculate the prediction. [01:15:03] And this this whole flow will become a [01:15:05] lot clearer when we do the collabs. [01:15:08] Okay. So we have 3 minutes. [01:15:11] I don't want to go into [01:15:13] regularization overfitting in 3 minutes. [01:15:15] So let's have some more questions. [01:15:19] Yeah. [01:15:20] Can you use any activation function as [01:15:22] long as it gives like positive values? [01:15:25] For like X squared or mod X or [01:15:26] something. Um you can use a variety of [01:15:29] activation functions. [01:15:31] Um [01:15:33] There is uh but yeah, there's a whole [01:15:35] literature on, you know, the pros and [01:15:37] cons of various activation functions [01:15:38] that you could use. [01:15:39] But in general, you have to make sure of [01:15:42] a couple of things. One is that when you [01:15:44] do backprop, [01:15:46] the gradient is going to flow through [01:15:48] the activation function in the reverse [01:15:49] direction. [01:15:50] And the activation function should [01:15:52] actually sort of make sure the gradient [01:15:53] doesn't get squished. [01:15:55] It shouldn't get squished. It shouldn't [01:15:56] get exploded. [01:15:58] So those are some considerations and [01:16:00] these are technical considerations, but [01:16:01] those all those considerations have to [01:16:02] be taken into account. If you can take [01:16:04] those into account, then you're okay. [01:16:07] That's sort of the key thing to keep in [01:16:08] mind. [01:16:08] And that's in fact why the ReLU is [01:16:10] actually very popular [01:16:11] because as long as the value is [01:16:13] positive, the gradient of the ReLU is [01:16:15] just one. Right? [01:16:18] Uh because [01:16:22] So if you look at something [01:16:24] Oops. [01:16:28] Was it frozen? [01:16:30] I jinxed it. [01:16:31] So sorry, livestream. [01:16:34] If you have something like this, [01:16:37] the ReLU is like that, right? [01:16:39] So the gradient here [01:16:41] is always going to be one. [01:16:43] Which means that as long as the value is [01:16:44] positive, whatever gradient comes in [01:16:46] like this, it just like gets multiplied [01:16:47] by one and gets pushed out the other [01:16:49] side. So it doesn't get it doesn't get [01:16:50] harmed or squished or anything like [01:16:52] that. Um so that's one reason why the [01:16:55] ReLU is very popular because it [01:16:57] preserves the gradient while injecting [01:16:59] almost like the minimum amount of [01:17:00] non-linearity to do interesting things. [01:17:04] Um yeah. [01:17:07] If you have a high number of dimensions, [01:17:10] can you do mini batching on like [01:17:13] features dimensions instead of just [01:17:14] observations and keep the same number of [01:17:17] observations, but just take a small [01:17:19] sample of the number of features that [01:17:21] you're actually using? Oh, I see. I see. [01:17:24] So you're saying let's say you have 10 [01:17:25] features. [01:17:27] Um instead of taking all data points of [01:17:28] 10 features, what if you have choose [01:17:31] five features and just use them and do [01:17:33] the thing [01:17:34] as long as you can actually compute the [01:17:36] prediction. [01:17:38] To compute the prediction, you may need [01:17:39] all 10 features. [01:17:41] Right? Or you need to have some defaults [01:17:43] for those features. [01:17:44] And by if you define defaults for those [01:17:46] other five features, you're basically [01:17:48] using all all features. [01:17:50] So that's the key thing. Can you [01:17:51] actually calculate the prediction [01:17:53] by manipulating? And typically, you [01:17:55] can't. [01:17:57] All right? [01:17:58] Okay, folks. 9:55. I'm done. Have a [01:18:00] great rest of your week. I'll see you on [01:18:02] Monday.