[00:16] Okay. All right. Let's get going. Uh
[00:20] [clears throat] today is going to be
[00:21] packed. uh I'm going to spend the first
[00:23] roughly half of the lecture on uh
[00:25] actually building a model a car corass
[00:28] model in collab to solve the heart
[00:30] disease problem we saw earlier and then
[00:32] switch gears halfway and then talk about
[00:35] uh how to solve image classification
[00:37] okay so we're going to do two collabs
[00:39] today uh I've been talking about collab
[00:42] collab right I've been teasing you we'll
[00:44] actually do collabs today all right so
[00:46] summary of baby by the way I've shut off
[00:48] the lights in the top because when I
[00:50] switch to collab it's going be much
[00:52] better for you folks particularly the
[00:53] folks in the back to be able to see it.
[00:54] Okay, but I hope you can see the slide
[00:57] right now. Yes.
[01:00] Okay, great. So this is just a quick
[01:02] recap of what we did last class. U you
[01:04] know broadly speaking training a neural
[01:07] network essentially is no different than
[01:08] training other kinds of models. We have
[01:10] a bunch of parameters i.e weights and
[01:12] biases and we need to use the data to
[01:14] find good values of those weights. And
[01:17] what does good mean? Typically it means
[01:19] that we define some measure of
[01:21] discrepancy between what the model
[01:23] predicts for a given set of weights and
[01:24] what the right answer is what the ground
[01:26] truth answer is and then we try to find
[01:29] weights that minimize this discrepancy
[01:30] that's it and this notion of a
[01:32] discrepancy is called a loss function
[01:34] right so the broadly speaking the
[01:36] overall training flow is that you define
[01:38] some network it has an input it goes
[01:40] through a bunch of layers you come up
[01:41] with some predictions you take the
[01:42] predictions you take the true values and
[01:44] then those two go into the loss function
[01:46] i.e i.e. the discrepancy function and
[01:48] then you come up with the loss score and
[01:50] then you send it to the optimizer which
[01:52] then proceeds to calculate the gradient
[01:54] of this loss function with respect to
[01:56] all the parameters and then it updates
[01:58] all the weights using that gradient and
[02:00] then this process repeats. That's it. So
[02:02] that is the training flow. Okay, quick
[02:04] recap. Now we also talked about the
[02:08] optimization algorithm we're going to
[02:09] use which is called gradient descent.
[02:12] and gradient descent. As you noticed in
[02:15] each iteration, every data point is
[02:17] being used to make predictions and
[02:20] therefore to calculate the loss and then
[02:22] to calculate the gradient. And then we
[02:24] pointed out that gradient descent is
[02:26] actually not as good as something called
[02:28] stochastic gradient descent. Stoastic
[02:31] gradient descent where we instead of
[02:33] choosing taking all the points, we just
[02:35] randomly choose a small number of
[02:37] points. Pretend for a moment as if those
[02:40] are the only points we have. make
[02:42] predictions, calculate loss, calculate
[02:44] gradient and go on. So that was the
[02:47] basic idea behind stochastic gradient
[02:49] descent, right? Two different kinds of
[02:51] things. Now what it means is that when
[02:54] we actually start training the model, as
[02:56] we will in a few minutes, the way
[02:58] because we only take a few points at a
[03:00] time, we have to be a bit careful in
[03:02] what's going on. And I want to make sure
[03:04] you clearly understand what the
[03:06] differences are before we actually get
[03:07] to the collab. Okay. And
[03:10] all right. So there is the notion of an
[03:13] epoch.
[03:14] An epoch essentially just means that we
[03:17] make one pass through the training data.
[03:20] All the training data we make one pass
[03:22] through it. Okay. And so what is one
[03:25] pass is that if you have something like
[03:27] gradient descent, one pass means every
[03:30] data point is sent through the network.
[03:32] We calculate its predictions, calculate
[03:34] the loss, calculate the gradient, right?
[03:37] We run every training sample through it.
[03:38] we calculate the gradient which is just
[03:40] this thing here right I mean I will
[03:42] sometimes say d of loss time dwerative
[03:46] of loss with respect to w sometimes I
[03:48] might use this naba symbol these are all
[03:51] interchangeable okay so we'll calculate
[03:54] the gradient and then we update using
[03:55] some version of this okay but we just do
[03:58] it once at the end of the epoch because
[04:01] if you have 10 billion data points every
[04:03] one of them flows through you get 10
[04:05] billion outputs and then we calculate
[04:07] the epoch just one at the end of this
[04:08] thing we calculate the gradient and
[04:10] update once one update per epoch. Yes.
[04:15] Now in stoastic gradient descent what we
[04:18] do is that we process the data in
[04:20] batches
[04:22] small numbers of points at a time right
[04:25] and these are called technically
[04:26] speaking they're called mini batches I
[04:29] don't know about you I just get tired of
[04:30] saying mini batches I'm just going to
[04:31] say batches from this point on okay and
[04:34] in fact that is widely done in the
[04:36] literature so we'll so we'll have to
[04:39] process it in batches so we take the
[04:41] training data and then we divide it up
[04:43] into batches
[04:44] batch one, batch two all the way till
[04:46] the final batch. And so what we do is we
[04:49] for each batch we basically do gradient
[04:53] descent for each batch we take batch one
[04:56] and then we run just the training
[04:57] samples in that batch through the
[05:00] network to get predictions. We calculate
[05:01] the gradient we update the parameters
[05:03] and then we go to batch two then we go
[05:05] to batch three and so on and so forth.
[05:07] So pictorially this is how it's going to
[05:09] look like
[05:11] right let's say the first batch is say
[05:12] 32 points we take those 32 points we run
[05:16] it through the network get all the stuff
[05:17] out we calculate the gradient update the
[05:19] weights so when we now get to batch two
[05:22] the weights have changed
[05:25] they have been updated and then we do
[05:27] the same thing for batch two batch three
[05:29] and all the way till we get to the end
[05:30] of the thing and when we are done with
[05:32] this thing this whole thing is called a
[05:34] what
[05:36] an epoch [clears throat]
[05:38] This whole thing is an epoch. Okay.
[05:42] All right. Now, so the question of
[05:44] course is that if you have a bunch of
[05:46] data points and you're going to run
[05:47] stoastic gradient descent on it in in a
[05:50] in a particular epoch, how many batches
[05:52] are going to be there? Okay, how many
[05:54] batches are going to be there? Now,
[05:56] Keras is going to calculate all this
[05:58] stuff. You don't have to worry about it,
[05:59] but you just need to understand exactly
[06:00] what happens. Okay, so my philosophy, by
[06:02] the way, is that you have to know the
[06:04] details of what's going on. If you don't
[06:06] know the details, if you haven't figured
[06:08] out at least once, you will not actually
[06:11] be able to think new and creative
[06:12] thoughts for a new problem. Okay, it's
[06:15] because the concepts are not manipulable
[06:17] in your head yet. Okay,
[06:23] please use the microphone.
[06:27] So when we talk about SG, so and we
[06:30] talking about uh we are only taking some
[06:32] part of it. Is it what we are saying is
[06:34] that we only take some variables or we
[06:36] only taking some part of the data.
[06:37] >> We are taking some rows.
[06:40] Okay. We taking only right. So that data
[06:42] points that means a batch.
[06:44] >> Exactly. So for example, let's say you
[06:46] have a thousand data points, right?
[06:48] Thousand rows of observations, thousand
[06:50] patients in the heart disease example or
[06:52] a thousand images that you're trying to
[06:53] classify. You take let's say 32 of those
[06:56] images, 32 of those patients and that's
[06:58] a batch. Then you go to the next 32.
[07:00] Then the next 32 and so on and so forth
[07:02] till you run out of patients or run out
[07:04] of images.
[07:05] >> And each iterative time you are updating
[07:07] with the weights new weights that you've
[07:09] got.
[07:09] >> And it means you keep connecting it or
[07:12] keep moving towards
[07:13] >> you're basically updating the weights as
[07:14] you
[07:14] >> updating the weights
[07:17] >> and what we calling the epoch is
[07:19] ultimately the equation of loss function
[07:20] that we are trying to do.
[07:21] >> No an epoch. See the the thing to
[07:24] remember is that here this whole thing
[07:27] is called an epoch because we have to do
[07:30] one full pass through the training data.
[07:32] Okay. But within that epoch we update
[07:35] the weights many times. Basically we
[07:37] update the weights as many times as we
[07:40] have batches.
[07:44] All right. Um
[07:46] so to go here let's say for example
[07:49] basically the idea is that you take the
[07:50] training tech you divide it by the batch
[07:52] size and you choose the batch size okay
[07:54] you choose the bat size and we'll talk
[07:56] about well how do you choose that later
[07:57] on you choose the batch size and once
[07:59] you choose size just divide it and round
[08:01] it up so for example as you will see in
[08:04] the collabing set is going to be 194
[08:06] patients and then we're going to choose
[08:09] a batch size of 32 and we typically tend
[08:12] to choose batch sizes of 32 64 and
[08:14] things like that because it actually
[08:16] aligns very well with the nature of the
[08:18] parallel hardware we're going to use.
[08:20] Okay. And so here 32 and so on. So
[08:24] divide 194 by 32 you get 6 point
[08:27] something. You round it up to seven.
[08:29] Okay. And so what that means is that the
[08:31] first six batches will have 32 samples
[08:33] each. And then the final batch has only
[08:36] two samples left. And that's okay. It
[08:38] can be a nice little small batch at the
[08:40] end.
[08:42] There's nothing that says that every
[08:43] batch has to be the same size.
[08:46] >> That's it. Epoch batches.
[08:53] >> And are you like for each batch you run
[08:56] through the whole network like all the
[08:58] layers or like each layer is one batch?
[09:00] >> No, for a batch you run it through the
[09:03] entire network. So the way I think about
[09:04] it is that you take a batch right just
[09:06] momentarily you assume that's all the
[09:08] data you have
[09:10] just run it through the network because
[09:12] unless you run it through the every
[09:14] layer of the network you can't get a
[09:15] prediction and unless you get a
[09:18] prediction you can't calculate the loss
[09:19] and unless you calculate the loss you
[09:20] can't calculate the gradient unless you
[09:22] calculate the gradient you can't update
[09:23] the weights
[09:25] >> last thing but if you're using like all
[09:27] the data just doing the gradient descent
[09:29] then you just go through the network
[09:31] once right
[09:32] >> okay exactly so in Gradient descent one
[09:34] epoch is one pass and one weight update.
[09:37] In many in stoastic gradient descent the
[09:40] number of updates you make is equal to
[09:41] the number of batches you have which
[09:43] ends up being you know some the training
[09:46] set divided by the batch size rounded
[09:47] up.
[09:50] >> So just to confirm so initially when we
[09:52] introduced like the concept of batches
[09:54] the whole purpose was not to run through
[09:56] all the data and be able to do some
[09:58] prediction from a subset. So now like
[10:00] the advantage is that like after batch
[10:02] one we are using more accurate
[10:04] coefficient to run through batch two and
[10:06] so on. That's really the advantage of it
[10:08] or there's something else to it.
[10:10] >> Perfectly set. That's exactly the
[10:11] advantage. So we take a small amount of
[10:13] data and we say hey we know this is not
[10:16] all the data. It's just a small subset
[10:18] of the data. So therefore it's not going
[10:19] to be super accurate. It's going to be
[10:21] approximate but it's okay. So we'll
[10:23] still tend to move in the in the right
[10:25] direction. So instead of waiting for the
[10:28] whole thing to get done and then
[10:29] updating it, we're just going to update
[10:30] it as we go along.
[10:33] All right. Uh yes,
[10:35] >> building on to her question, is it that
[10:37] uh doing this process for SG will uh
[10:40] render us a more better solution or
[10:43] requires less compute power?
[10:45] >> Both
[10:46] >> both and the reasons for both are in the
[10:48] previous lecture. Yeah. And I'm saying
[10:51] that instead of repeating it just
[10:52] because I'm like very pressed for time
[10:54] today. That's why uh all right cool so
[10:57] that's what we have uh are we good
[11:01] okay so now we come to the last step
[11:04] before we actually fire up the collab
[11:05] which is overfitting and regularization
[11:07] um so if you remember from your machine
[11:09] learning background um when your model
[11:12] gets more and more complex
[11:14] right if you you know using
[11:18] use a simple model then you use a more
[11:19] complex model and so on and so forth
[11:21] what happens to the error on the
[11:23] training data Typically what happens to
[11:26] the error on the training data? So let's
[11:27] say you have a simple regression model,
[11:28] you get some error and then you have a
[11:30] regression model in which you use all
[11:31] kinds of interaction terms. You use
[11:32] logarithms and this and that and make it
[11:34] super complicated. What do you think is
[11:35] going to happen to the error on the
[11:36] training data?
[11:39] >> Right? Basically it's going to go down
[11:41] as the model get more gets more complex.
[11:43] Correct. Now of course comes the punch
[11:45] line which is what what do you think is
[11:46] going to happen to the training data? I
[11:49] showed you the answer.
[11:53] Right? Basically, what's going to happen
[11:54] typically, at least conceptually, is
[11:56] that it's going to get better and better
[11:57] at some point. It's going to bottom out
[11:59] and it's going to start climbing again.
[12:00] And so, we typically refer to this
[12:03] phenomenon here when it starts to climb
[12:05] again as overfitting because the model
[12:07] is essentially fitting to the
[12:09] idiosyncrasies of the training data as
[12:11] opposed to generalizing patterns. And
[12:14] then in this thing we call it
[12:15] underfitting because it can still
[12:17] there's a lot of potential to improve
[12:18] and we really are hoping to find the
[12:20] sweet spot in the middle right that's
[12:23] the basic idea of overfitting
[12:24] underfitting and the way we and to to
[12:27] relate this to neural networks as you
[12:29] see as you as you've learned so far you
[12:31] have to learn smart representations of
[12:33] the input data and to do that we I have
[12:36] argued that you need to have lots of
[12:38] layers in your network the more layers
[12:39] you have the better things get. GPT3 for
[12:42] example has 96 layers if I recall right
[12:45] more layers the better but more layers
[12:47] means more parameters more parameters
[12:50] means more complexity to the model and
[12:52] therefore more chance of overfitting
[12:54] okay so it's really important in neural
[12:57] networks that we think about
[12:59] regularization and regularization you
[13:01] will recall from your machine learning
[13:03] background is the way we handle the risk
[13:05] of overfitting and try to find models
[13:07] that fit just right okay and so several
[13:11] regularization methods have been
[13:12] developed over the years and we are
[13:14] going to use only two of them. The first
[13:16] one is called early stopping. uh and
[13:19] this is this has been famously referred
[13:20] to uh by Jeff Hinton who's one of the
[13:23] pioneers or as he's more colorfully
[13:25] known one of the godfathers of deep
[13:27] learning um who won he also won the
[13:29] touring a few years ago as the own sort
[13:31] of a beautiful free lunch right that's
[13:33] what he calls it so the idea is very
[13:35] simple we take a validation set we take
[13:37] the training data we split into a
[13:39] training and a validation set and then
[13:41] we just keep you know doing gradient
[13:42] descent boop b the training will
[13:45] hopefully keep on getting better and
[13:46] better lower and lower error
[13:49] And then we just keep track of what's
[13:50] going on in the validation set. And then
[13:52] at some point if it starts to flatten
[13:54] out and start to climb, we just say,
[13:56] "Okay, that's when we stop training."
[13:59] Right? And what we're going to do in the
[14:01] collab is actually run it through the
[14:02] whole thing, see where it flattens out,
[14:03] and then we say, "Okay, that's why we
[14:04] should stop." But of course, you don't
[14:06] want to go all the way to the end and
[14:07] then go back and say, "Well, I want to
[14:09] stop at the 10th epoch." And there are
[14:12] ways you can use Keras to be very
[14:13] efficient about this. But the
[14:15] fundamental idea is you take the
[14:16] training data, split it into training
[14:18] and validation and just track what's
[14:20] going on in the validation set to see
[14:21] whether this kind of bottoming out
[14:23] happens. Okay. So this is called early
[14:25] stopping. And the other way we're going
[14:28] to do right this called early stopping.
[14:30] We're looking for this part. The other
[14:32] thing is called dropout. And I'm going
[14:35] to come back to dropout when we do when
[14:39] on Wednesday's lecture because that's
[14:40] the first time we're going to use it.
[14:42] And so I'll come back to draw port and
[14:43] tell you exactly how it works. It's a
[14:44] very very clever strategy. But we will
[14:46] not use it today. We'll use it on
[14:48] Wednesday. Okay. So in summary, uh what
[14:51] do we do? We get the data ready. We
[14:53] design the network, number of hidden
[14:55] layers, number of neurons and so on and
[14:57] so forth. We pick the right output
[14:58] layer. We pick the right loss function.
[15:01] Uh we choose an optimizer. As I
[15:04] mentioned earlier, SGD comes in lots of
[15:06] flavors, lots of variations on the
[15:07] theme. And empirically much like for
[15:11] hidden layer neurons we t tend to use
[15:13] relu as the activation function for
[15:16] optimization we tend to use a flavor of
[15:17] HGD called Adam okay as sort of the
[15:20] default because it's really good so
[15:22] we'll use Adam as you'll see we
[15:24] typically use either uh early stopping
[15:27] or dropout and then you just fire it up
[15:29] and start training in terasen tensorflow
[15:32] all right so that is the training loop
[15:33] now I'm going to switch gears and give
[15:35] you a quick intro to teras and teras
[15:38] TensorFlow. Okay. Keras and Tensor. No,
[15:40] TensorFlow and KAS. Thank you. Um, and
[15:43] then we'll actually fire up the collab.
[15:45] So, first of all, what's a tensor?
[15:49] >> Yeah, I just quick question on the
[15:52] previous thing like if you're looking at
[15:54] the validation set to avoid overfitting,
[15:57] but aren't you actually like over
[15:59] actually overfitting because like you're
[16:02] kind of using the validation set as a
[16:03] training set or not?
[16:05] >> Uh, no, no, no. The validation set is
[16:08] never used to calculate any gradients.
[16:10] It's only used to calculate accuracy and
[16:12] loss.
[16:14] Yeah. Yeah. It's kept aside and only
[16:16] used for evaluation, not for training.
[16:19] That's what keeps you honest.
[16:22] >> Right.
[16:23] >> And this will become clear when we
[16:24] actually go to the collab. So what's a
[16:25] tensor?
[16:28] >> All right.
[16:30] Okay.
[16:33] Tensor is the input data which you're
[16:35] giving to the system. It could be in
[16:36] various formats like it's image it could
[16:39] be like we call it a 4D tensor. If it's
[16:42] a time series data, it's 3D. And
[16:45] typically, if you just send numbers in,
[16:47] it becomes a vector which would go
[16:49] inside which each each it gives the
[16:52] value of the
[16:54] uh uh the variable as well the values of
[16:57] the variables associated to it as well
[16:59] as
[17:01] uh as well as the I mean information you
[17:05] want to get to.
[17:07] >> You're kind of on the right track, but
[17:08] not entirely, right? It's actually a
[17:10] simpler concept than that. So, uh
[17:13] >> it's like a matrix but generalized with
[17:15] higher dimensions.
[17:16] >> Correct? That's also actually correct
[17:18] but incomplete. The reason is because it
[17:21] can be simpler than a matrix. It's not
[17:24] matrix or higher. It's actually could be
[17:25] simpler. In fact, you take a number,
[17:27] it's actually a tensor.
[17:30] All right? The simplest case of a tensor
[17:31] is a number. The next simpler case is a
[17:34] vector which is a list. The next higher
[17:37] case is a table.
[17:40] Okay, so these are all tensors. So
[17:43] tensors basically are a generalization
[17:45] of the notion of both a number, a vector
[17:48] and a table to higher dimensions.
[17:52] Okay, so you can think of a tensor as
[17:56] having what are called every tensor has
[17:59] something called a rank, right? So a
[18:03] number is just a number. It doesn't have
[18:04] a dimensionality to it. So it has got
[18:06] rank zero. Okay. While a vector it's a
[18:10] list of numbers. You can sort of write
[18:12] it down top to bottom and it's one
[18:14] dimension. Right? So that dimension that
[18:17] one dimension is called a rank. So it's
[18:19] called rank one. A table is 2D
[18:22] two-dimensional. So it's called rank
[18:24] two.
[18:26] And you can have a rank three which is
[18:28] just a bunch of tables.
[18:32] A bunch of tables is a rank three
[18:34] tensor. We also think of it as a cube.
[18:37] Okay. So these things are very useful
[18:40] because obviously we are all familiar
[18:42] with vectors. Uh as you will see very
[18:45] shortly later in this class black and
[18:48] white grayscale images are usually
[18:49] represented using tables of numbers like
[18:51] this. Color images are represented using
[18:54] three tables.
[18:56] Okay. Can you get think of what might be
[18:59] representable as you know a tensil of
[19:02] rank four? Meaning every element of a
[19:06] tensor of rank four is actually a color
[19:08] picture.
[19:11] Just shout it out. Video. Exactly. What
[19:14] is a video? A video is basically a
[19:16] stream of black color images. A color
[19:19] video. So each element of that stream,
[19:23] right? What the first dimension of the
[19:25] tensor is which frame it is and then
[19:28] everything else is the actual frame. So
[19:31] the way I u think about these tensors
[19:34] always is
[19:37] tensor you can just think of it as a you
[19:40] can think of a tensor as being this
[19:42] array which has all these axes or
[19:45] dimensions. This is the first one. This
[19:48] is the second one. This is a third swan.
[19:51] Right? This is a tensor of rank four.
[19:54] Okay? 1 2 3 4. And so if you have a
[19:58] vector, right? So you can imagine if
[20:02] it's just a vector, you can imagine the
[20:03] vector actually living like this, just a
[20:06] list of numbers, right?
[20:10] But if it's just if it is just
[20:14] a 2D a rank two tensor right which is
[20:16] just like that right which is just like
[20:19] that
[20:21] so this thing becomes you know like that
[20:24] and that thing becomes like that. So for
[20:26] example if this is a 7a 3 that means
[20:29] that there are
[20:31] seven rows and three columns.
[20:35] So you get the idea. So the way you
[20:36] think about tensor is always as if this
[20:38] open square bracket a bunch of things a
[20:40] closed square bracket and that's really
[20:42] what a tensor object is. So what that
[20:44] means is that anytime you have a tensor
[20:48] right anytime you have a tensor however
[20:49] complicated it is you can always create
[20:52] a more complicated tensor by if you want
[20:54] to take a list of those tensors let's
[20:56] say that you have a list of videos
[20:59] each video is a rank four tensor so
[21:02] which means a list of videos is what
[21:04] rank
[21:05] Exactly. So a a tensor of rank say 10 is
[21:10] just a list of rank nine tensors.
[21:15] So that is this that is the most
[21:17] important thing you need to understand
[21:18] about tensors. So at any point in time
[21:20] if I give you a tensor you can just
[21:22] iterate through the first dimension of
[21:24] it the first aspect of it and as as you
[21:27] go through each one of these values. So
[21:29] for example here um
[21:32] yeah that can do it.
[21:35] So
[21:39] so if you have this tensor here
[21:42] and if you want to create a more
[21:43] complicated tensor no problem.
[21:46] So you add another dimension here. Okay.
[21:52] Now it just becomes this dimension let's
[21:54] say has nine values.
[21:58] one on the nine. So you put zero here
[22:00] and then what do you get? This whole
[22:02] tensor is a rank four tensor. And you
[22:04] put a one here, it's another rank four
[22:06] tensor. You put a two here, another rank
[22:08] four tensor. So every tensor, you take
[22:11] the first element, it's just a list, but
[22:14] it's a list of the next downrank tensor.
[22:18] Okay. Now this tensor concept is
[22:20] actually something Einstein came up
[22:21] with. Um and so u it's simultaneously
[22:26] kind of easy to understand and also
[22:28] slippery. So I would actually encourage
[22:30] you to read the book which has a really
[22:32] good discussion of tensors and the more
[22:33] you practice with it the easier it'll
[22:35] get. Okay. So if you feel you kind of
[22:38] understood but not quite you're not
[22:39] alone. It happens to all of us right?
[22:42] You have to pay the price or go through
[22:43] the crucible. Okay. Okay. All right.
[22:48] So to come back to this
[22:51] that's what we have
[22:55] and we already talked about a rank four
[22:56] tensor it's a video so 2.2 two the text
[22:59] has a lot more detail. You should
[23:00] definitely read it. U so here tensorflow
[23:05] is a library and as you can imagine
[23:08] neural networks tensors come in and go
[23:10] through the network and go out the other
[23:11] end right and since tensors capture
[23:14] everything numbers lists uh tables and
[23:16] so on and so forth it's just tensors
[23:18] flowing from input to output hence it's
[23:20] called tensorflow and it gives you a
[23:22] couple of things which are really really
[23:23] important which is why we use it. The
[23:25] first one is that it'll automatically
[23:27] calculate gradients for you of
[23:30] arbitrarily complicated loss functions.
[23:32] You don't have to calculate the gradient
[23:34] because calculating the gradient is very
[23:35] painful, right? It'll automatically
[23:37] calculate the gradients for you. That's
[23:39] the best part. You don't have to use the
[23:40] chain rule. You don't do anything. The
[23:42] second thing it'll do, it gives you all
[23:44] these optimizers including SGD and all
[23:46] its variations. So you don't have to
[23:48] worry about the optimization itself.
[23:49] It'll just you can just pick and choose
[23:50] what you want. Third, if you have a lot
[23:53] of servers, it'll actually take the
[23:55] computational load and distribute it
[23:56] across all those servers. People here
[23:58] with the CS background know that
[24:00] parallelizing computation is actually a
[24:02] very difficult problem, right? There are
[24:05] things which are called embarrassingly
[24:06] parallel. Many things are not actually
[24:09] quite tricky to figure it out. We don't
[24:10] know how to figure it out. TensorFlow
[24:11] will figure it out. Okay? And then
[24:13] finally, I talked about the fact that
[24:15] there are these things called GPUs,
[24:17] graphics processing units, which are
[24:18] parallel hardware. uh and so it'll even
[24:21] if you have just one computer but it has
[24:23] GPUs there's a particular way in which
[24:26] you have to take your computation and
[24:28] organize it to really exploit the fact
[24:30] that you have a GPU and so TensorFlow
[24:33] will actually do it for you out of the
[24:35] box automatically you don't have to
[24:36] worry about any of that stuff okay so
[24:38] those are all the advantages of this
[24:39] thing by the way TPU is called a tensor
[24:41] processing unit it's something that it's
[24:43] kind of you can think of it as Google's
[24:45] GPU right they came up with their own
[24:47] variation on the theme okay now keras
[24:50] sits on top of TensorFlow, right?
[24:52] TensorFlow, this is the this is the
[24:53] hardware you have. TensorFlow sits on
[24:56] top of the hardware. Keras sits on top
[24:58] of TensorFlow and it basically gives you
[25:01] a whole bunch of convenience features.
[25:02] So, for example, it gives you the notion
[25:04] of a layer, right? We already saw
[25:07] Keras.dense is a dense layer, right? It
[25:10] gives you the notion of a layer. It
[25:11] gives you the notion of activation
[25:12] functions and so on and so forth. It
[25:14] gives you easy ways to pre-process the
[25:16] data, easy ways to train the model,
[25:18] report on metrics, you know, calculate
[25:20] validation loss, validation accuracy,
[25:21] training loss, all the metrics we care
[25:23] about. And then it also gives you a
[25:25] whole library of pre-trained models that
[25:26] you can just use and adapt for your
[25:28] particular problem. So it gives you a
[25:30] whole bunch of conveniences and that's
[25:32] why it's very popular. And by the way,
[25:34] you know, many of you might also be
[25:35] familiar with PyTorch, which is a
[25:37] fantastic framework as well for deep
[25:38] learning. And the reason we chose to go
[25:41] with TensorFlow for this course rather
[25:42] than PyTorch is because we wanted to
[25:45] make the course uh sort of accessible to
[25:48] folks who don't have a ton of
[25:49] programming background before coming to
[25:51] the class. And Pyarch is a bit more
[25:53] demanding from a CS perspective. It
[25:55] requires more knowledge of
[25:56] object-oriented programming. Uh which is
[25:58] why we decided to go with TensorFlow and
[25:59] KAS because I think it's actually as
[26:02] powerful uh in many ways and it's a
[26:04] little easier to get going. Okay, so
[26:07] that's what we have here. And one other
[26:09] thing I will mention is that there are
[26:10] three ways in which you can use kas.
[26:12] There are three kinds of APIs.
[26:14] Sequential, functional, subclassing. And
[26:16] we'll almost exclusively use the
[26:18] functional API. Okay. And in fact, the
[26:21] model we built for heart disease
[26:22] prediction uses the functional API. And
[26:24] so just read 722 of the textbook to
[26:26] understand in detail how the API works.
[26:28] I find in my own work, the functional
[26:30] API is basically all I need. I don't
[26:32] need to do anything more complicated
[26:33] than that. Um and and as you will see as
[26:35] you work on the homeworks uh and on your
[26:37] project that it's is it's sort of a
[26:39] beautifully designed Lego block
[26:41] environment for doing these things and
[26:43] you can create very complicated models
[26:45] very easily. Okay. Uh there's a whole
[26:48] bunch of stuff here on these websites.
[26:50] So check them out. There's lots of
[26:51] collabs or uh uh are available. So now
[26:55] if you go back to the neural model for
[26:57] heart disease prediction, this is what
[26:58] we came up with in the last class,
[26:59] right? uh we had an input layer, one
[27:02] dense layer with 16 neurons, rel
[27:04] neurons, an output layer with the
[27:05] sigmoid and then boom, that was a model.
[27:08] So let's train this model. Uh and so the
[27:10] training checklist is that uh we have
[27:13] already done this hidden layer of 16
[27:14] neurons uh sigmoid. We need to use an
[27:17] appropriate loss function based on the
[27:19] type of output. What loss function
[27:20] should we use?
[27:23] What is the output here?
[27:26] It's a binary classification problem. So
[27:28] what should the the loss function be?
[27:33] Kind of heard it somewhere. Get shout it
[27:35] out.
[27:37] No, the output is a sigmoid. The loss
[27:40] functionary
[27:43] cross entropy.
[27:44] Okay, remember if if you're predicting a
[27:46] number an arbitrary number, you can use
[27:48] something like mean square error. If
[27:50] you're predicting a probability which
[27:52] has to be compared to a 01 output, which
[27:55] is what binary classification is all
[27:56] about. we use binary cross entropy.
[27:59] Okay, so that's what we do here. So we
[28:01] do binary cross entropy
[28:03] and then we will go with Adam, right?
[28:06] And then we'll use early stopping to
[28:08] make sure we don't over fit. Okay, I
[28:10] know this like okay I promise this is a
[28:12] lot literally the last slide before I go
[28:13] to the collab. I feel like one of those
[28:16] used cars here but wait there is more.
[28:19] So anyway, u so uh don't worry if you
[28:23] don't understand every detail of what
[28:24] I'm going to go through. I'm going to
[28:26] link to the collab as soon as the class
[28:27] is over. But once you get your hands on
[28:29] the collab, make sure you actually go
[28:31] through every line in the collab. What I
[28:33] typically do when I'm trying to learn
[28:34] something new is I'll actually cut and
[28:36] paste, right? I won't do that. I won't
[28:39] actually cut and paste the code and run
[28:41] it myself. I will retype the code. If
[28:44] you retype the code as opposed to
[28:45] cutting and pasting, trust me, you'll
[28:46] learn a lot more. Right? So I strongly
[28:48] encourage you to do it that way.
[28:52] Um and so all the collabs you're going
[28:54] to publish in the class, uh the first
[28:56] thing you should do is you should just
[28:57] make your own copy of the notebook,
[29:00] right? Copy to drive. And then if you're
[29:02] using anything other than today's
[29:04] collab, uh right, anything involving
[29:06] natural language processing or vision,
[29:08] you probably should use a GPU. So just
[29:10] go into go in here, choose the runtime
[29:13] to be a GPU. Um and then you start your
[29:15] notebook and you're done. And the second
[29:17] time onwards, you can just go directly
[29:19] to this step. You don't have to do all
[29:21] this stuff for that particular notebook.
[29:23] And there are numerous tutorials like
[29:24] five minute videos and so on on how to
[29:26] use collab. Just just do that. I'm not
[29:27] going to spend time on it here.
[29:30] All right. Okay. So, uh I just ran it um
[29:33] a few hours ago. I'm not going to run
[29:35] every cell now because it's going to
[29:37] take some time. It's going to get in the
[29:38] way of the class time, but I'm going to
[29:39] just like, you know, go through it
[29:40] slowly and explain what's going on. So,
[29:43] here this is just an introduction to the
[29:45] data set. We already saw this
[29:46] introduction in the last last week. We
[29:49] have whatever 303 patients, hot
[29:51] patients. We have a whole bunch of uh
[29:54] variables here, age, demographics, and a
[29:57] whole bunch of biomarker information.
[29:59] And this is a target variable. Okay? Uh
[30:02] zero or one, heart disease, yes or no.
[30:05] And so, by the way, just some technical
[30:07] prelim preliminaries here. Basically,
[30:10] every time we load these things, we're
[30:12] actually going to load these packages.
[30:13] So you can see here these are the two
[30:15] key things we need to do. We import
[30:16] tensorflow first and then from within
[30:18] tensorflow we import keras. Okay that's
[30:21] what these two lines do here. Okay. And
[30:23] then and folks who have done data
[30:25] science and machine learning a bit
[30:26] before you you'll know this. We will in
[30:28] in sort of we will actually load like
[30:30] the three packages that were just most
[30:32] commonly used right which is numpy
[30:34] pandas and mattplot lib. Uh numpy
[30:37] because it's very easy for manipulating
[30:39] matrices and arrays and tensors. uh
[30:42] pandas because often times you get some
[30:44] data in from somewhere you need to
[30:46] massage it and wrangle it to a point
[30:48] where we can actually feed it into ketas
[30:49] so you need pandas for that and mattplot
[30:51] lib because you just want to plot you
[30:53] know uh these loss curves and accuracy
[30:55] curves to see whether early stopping is
[30:57] needed okay so that's why we use it uh
[31:00] so we import all these things and then I
[31:02] guess the other thing you have to
[31:03] remember is that when we are training
[31:04] these deep learning models uh there is
[31:06] randomness in the process which enters
[31:08] in a few different places so clearly the
[31:11] starting values for the these weights
[31:13] are going to be they're going the
[31:14] weights are going to be randomly
[31:15] initialized. Uh and therefore that
[31:17] that's obviously a source of randomness.
[31:19] Uh now we talked about how you take if
[31:22] when you're doing stoastic gradient
[31:23] descent you take all the data and then
[31:25] you randomly choose batches right from
[31:28] this data till we finish a whole pass
[31:29] through it. Well that immediately raised
[31:32] the question well well what do you mean
[31:33] by randomly choose? So typically what we
[31:35] do in practice is that and kas will take
[31:37] care of all this for you. um you
[31:39] basically take the data and just shuffle
[31:40] it once randomly and then you just go
[31:42] first 32 next 32 next 32 next 32 like
[31:45] that okay but it is a source of
[31:47] randomness and then when we split the
[31:49] data into train validation testing and
[31:51] so on uh particularly if you want to
[31:53] look for early stopping and overfitting
[31:55] uh we need to again split the data
[31:56] randomly and that's another source of
[31:58] randomness and then when we do dropout
[32:01] which we'll talk about on Wednesday
[32:02] again dropout has a little bit of a
[32:05] random element to it and so that's
[32:06] another source of randomness this. So
[32:09] all of it all this means is that if
[32:11] you're working with these models and if
[32:13] you want to build a model and you want
[32:14] to hand it off to someone so that they
[32:16] can reproduce your results well you
[32:17] better make sure that you sort of you
[32:19] know make it easy for them to replicate
[32:21] what you have and the way you do it is
[32:22] by sending a setting a random seat for
[32:24] all these things okay and the way you do
[32:26] it is by having this little handy
[32:28] function here set random seat uh and of
[32:31] course you know I use 42 tool like just
[32:32] like everybody should right so okay so
[32:35] that's that uh by the way just that's
[32:38] just a popculture reference to this book
[32:39] called The Hitchhiker's Guide to the
[32:40] Galaxy.
[32:43] >> Number 42 and you'll know what I mean.
[32:45] Okay, so by the way, um the question
[32:47] inevitably comes at this point, okay, if
[32:49] we do exactly this, will you actually
[32:51] get the exact same numbers that you have
[32:52] in your version uh of the notebook? And
[32:55] the answer is hopefully most of the
[32:57] time, but it's not guaranteed. So this
[32:59] is called bitwise reproducibility. It's
[33:01] not guaranteed due to certain hardware
[33:03] things and device drivers and stuff like
[33:05] that. So we won't get into all that
[33:07] stuff. uh and which is why as you see
[33:09] here uh I have a bit of a fingers
[33:11] crossed thing. Okay. All right. Cool. So
[33:14] that's what we have. Um so as it turns
[33:16] out uh Frantois Shallet who wrote the
[33:18] book uh the textbook he actually made
[33:20] this data available in a pandanda's data
[33:21] frame. So we read the CSV file into this
[33:24] data frame right there. Uh and then it's
[33:26] uh and it's 303 rows 14 columns right
[33:30] and you can see here we'll take a look
[33:32] at the first few rows. Uh and these are
[33:34] all the rows. age, gender, cholesterol,
[33:36] blah blah blah blah blah. And then this
[33:38] is the target variable right there. U
[33:41] and the one of the first things I always
[33:42] do when I'm working with a binary
[33:44] classification problem is to quickly
[33:45] check whether the positive and negative
[33:47] classes are balanced or not. And so what
[33:49] you can do is you can just quickly check
[33:51] to see what percent of the data points
[33:52] is zero versus one. And you can see here
[33:55] uh 72.6%
[33:57] of the patients don't have heart
[33:59] disease. That's a good thing of course.
[34:00] Uh and then 27.4 have heart disease. So
[34:03] it's not bad. It's not 50/50 or roughly
[34:05] 50/50. It's a little thing. So, by the
[34:08] way, quick question. What is a a b good
[34:11] baseline model for this problem? Suppose
[34:13] you couldn't use anything any
[34:14] complicated thing. What's a good
[34:15] baseline model?
[34:22] >> Yes. Just predict zero.
[34:24] >> Yeah. And why would you do that?
[34:25] >> Uh, it would give you a 72.6% accuracy.
[34:28] Exactly. Because 72.6% 6% is the sort of
[34:31] the higher class higher class with the
[34:33] higher percentage you just predict it
[34:35] you'll be right on those 72.6% of the
[34:37] cases you'll be wrong on the rest which
[34:38] means that your accuracy of this model
[34:41] is going to be 72.6%.
[34:43] Okay. And so any fancy model we build
[34:46] better do you know it's got to do better
[34:48] than this otherwise it's not worth its
[34:49] weight uh in layers. Um so all right so
[34:51] we'll come back to this later. So the
[34:53] first thing we want to do is we want to
[34:54] pre-process it because this data set has
[34:56] both categorical variables and numeric
[34:58] variables. Um and so it's usually
[35:01] convenient to just to group them into
[35:03] two different groups. So I have listed
[35:05] all the categorical variables here and
[35:06] the numeric here. Uh and then we have
[35:09] the pre-processing here. We have to take
[35:11] the categorical variables and we have to
[35:12] one hot encode them. And the reason is
[35:15] that unlike say a decision tree model, a
[35:17] neural network cannot handle uh
[35:20] categorical inputs directly. It can only
[35:22] handle numeric inputs. Which means that
[35:24] we have to numericalize every
[35:26] categorical thing that comes in. And the
[35:28] st there are many ways to do it but the
[35:29] standard way to do it is one hot
[35:31] encoding. Um and for the numeric
[35:33] variables we need to normalize them and
[35:35] I'll come to that in a second. So pandas
[35:37] has this get dummies function here and
[35:40] you can just run this thing and it'll
[35:41] just hot encode the whole thing. So once
[35:44] you do that this is what you have. So
[35:45] you can see here previously um let's say
[35:49] tal was had three values fixed normal
[35:52] reversible or something and then you go
[35:54] to the one hot encoded version u and now
[35:56] we can see here tal fixed tal normal tal
[36:00] reversible that's three columns right
[36:02] that's the one hot encoding in action
[36:04] okay now the other thing to remember is
[36:07] that neural networks work best when the
[36:09] numeric inputs you send them are all in
[36:12] a relatively small range they shouldn't
[36:13] have a wide range of variation
[36:15] Um and so the standard practice is to
[36:18] standardize the numerical variables. By
[36:20] standardize, I mean typically subtract
[36:22] the mean, divide by the standard
[36:23] deviation. Um we should do that. But
[36:26] before we do so, we should split the
[36:27] data into a training set and a test set,
[36:30] right? And why do we want to split into
[36:32] a test set? Because at the very end once
[36:33] we've built the model and done all the
[36:35] things we want to do with it, we finally
[36:36] want to take out the test set and
[36:38] evaluate it once so that we get this
[36:41] true measure of how it's going to
[36:43] perform in the wild after you deploy it.
[36:46] Okay. Uh so you want to divide it 80 80
[36:48] say 80% training and 20% test set. So
[36:51] the question is why should we do the
[36:53] splitting now before we do the
[36:54] normalization? Why can't we just do the
[36:57] normalization and then do the splitting?
[37:02] Um all right
[37:06] >> because then your uh validation set is
[37:09] also somewhat dependent on your test set
[37:11] results as well as the mean of the test
[37:13] set.
[37:13] >> Correct? Because the test set has now
[37:16] essentially sort of has been influenced
[37:18] by the training set. Right? That is the
[37:21] the the modeling process part of the
[37:23] modeling process the splitting and the
[37:25] splitting also the the the
[37:27] standardization
[37:28] if if the standardization which is part
[37:30] of the process uses information about
[37:32] the test set well the test set not
[37:34] really kept away from anything is it
[37:37] that's why we want to split it lock away
[37:39] the test set somewhere and then proceed
[37:41] with the modeling this again this is
[37:43] like machine learning 101 which is why
[37:44] I'm going through it pretty fast uh okay
[37:47] so we we do this uh sampling function
[37:50] take 20% of the data and make it the
[37:53] test set and the remaining is going to
[37:55] be the training set. And when we do
[37:56] that, you can see the training set is
[37:58] now 242
[38:00] um rows while the test is 61 rows. Uh
[38:05] and any of these data frames, you'll
[38:07] know that the the shape attribute gives
[38:08] you the dimensions of the number of rows
[38:10] in the columns. That's what we're doing
[38:12] here. And now that we have done that, we
[38:14] have done the split, we can calculate
[38:15] the the the mean and the standard
[38:16] deviation. So I calculate the mean here.
[38:18] I calculate standard deviation. And
[38:20] these are all the means. And once I do
[38:21] that, I just do you know each column
[38:24] minus the mean divide the standard
[38:26] deviation. And then once I do that I get
[38:28] I save them in the train and the test
[38:30] data frames. And you can see here now
[38:32] all the numbers are all very sort of
[38:33] smallalish 0 1 minus one kind of around
[38:36] that range and that's kind of ideal when
[38:38] you're network training. Okay. All
[38:40] right. Right. So at this point the data
[38:42] is entirely numeric and then uh we are
[38:44] ready almost ready to feed it into KAS
[38:46] and the way you do it is you take a
[38:48] numpy array u you you take a pandas data
[38:51] frame and then you convert it into a
[38:52] numpy array and then keras is happy to
[38:54] take it happy to receive it. So the so
[38:56] we use this thing called two numpy which
[39:00] I think is as descriptive as it gets in
[39:01] programming. Um and then you save it as
[39:04] train and test. Now train and test on
[39:05] two numpy arrays with exactly the same
[39:08] information and now we can fit it into
[39:09] kas. All right. Now I guess there's one
[39:12] other thing we need to do which is that
[39:13] um in this data frame train and test our
[39:17] independent variables all the features
[39:18] as well as the target the 01 target.
[39:20] They're all in this
[39:23] right and we need to now take it and
[39:25] just take the the dependent variable the
[39:27] 01 column and split it out and keep the
[39:29] x and the y separately. Right? That's
[39:32] the whole point of it, right? Because
[39:33] you need to feed the X, do the
[39:34] prediction, and then compare it to the
[39:36] actual Y and calculate the loss and so
[39:38] on and so forth. So, uh, so the target
[39:41] column is our Y variable, and it's
[39:43] column number six from the left. If you
[39:45] count it, you can see it. So, we just,
[39:47] you know, uh, we we delete it from the
[39:49] the train and test. Um, and now we have
[39:53] 242 rows and 29 columns, 29 features.
[39:56] You will recall from the network that we
[39:58] made way back, it had 29 inputs, right?
[40:01] 29 nodes in the input layer. And that's
[40:03] where the 29 is coming from. And so now
[40:06] uh we just select the sixth column which
[40:07] is the target and make it the Y variable
[40:09] right train Y and test Y. And that is of
[40:12] course a vector which is 242 long in the
[40:14] training set and 61 long in the thing.
[40:16] So at this point all we have done is to
[40:19] be honest boring pre-processing. Okay,
[40:21] we haven't actually gotten to the action
[40:22] yet. Finally, let's do something. So um
[40:26] and we start with a single hidden layer.
[40:29] Since it's a binary classification
[40:30] problem, we'll use sigmoids as we saw
[40:31] earlier. And this is the model we
[40:34] created in class last last class. This
[40:36] is the model we created. Okay. The only
[40:39] difference between that model and this
[40:41] model is that I've actually given names
[40:43] to these layers. And this name thing is
[40:45] totally optional. Right? If you want to
[40:47] give a name, give a name. It's just a
[40:48] little easier to interpret later on.
[40:50] Okay? It's just cosmetic. Okay? So, uh,
[40:53] but I've just put it here. U and once
[40:55] you build the model u you should
[40:57] immediately run the model dots summary
[40:59] command because it gives you a nice
[41:01] overview of the model right what are for
[41:04] each layer it tells you what the layer
[41:05] is it tells you what's coming into the
[41:07] layer meaning the shape of the tensor
[41:09] that's coming in and what's going out
[41:11] and how many parameters the layer has
[41:13] and it turns out this layer has sorry
[41:16] this network has 497 parameters okay uh
[41:20] and I have told you repeatedly the first
[41:22] few times just hadn't calculated the
[41:24] number of parameters to make sure it
[41:25] verifies. So we should just make sure
[41:27] that it is in fact 497. So let's hand
[41:30] calculate it. And you do basically it's
[41:32] basically what's going on here. 29
[41:34] inputs time 16, right? All the arrows 29
[41:37] * 16 arrows, right? And then you have a
[41:40] bias of another 16. That's why you have
[41:42] this expression. And then the next one
[41:43] is 16 * 1 plus one bias for the output
[41:46] sigmoid and you get to 497. Okay? Just
[41:49] make sure you follow this later on when
[41:50] you work with the collab. We we did this
[41:53] in class last week and you can visualize
[41:55] the network graphically as well by using
[41:56] the plot model function. So we do that
[41:59] here. Um and let's say it gives you the
[42:02] same information but in a slightly
[42:03] easier form to consume and when we work
[42:06] with larger networks starting on
[42:07] Wednesday you will see that being able
[42:09] to visualize the topology of the network
[42:11] is actually quite handy. Okay, we
[42:13] finally come to uh actually trying to
[42:16] train this thing and so what loss
[42:18] function should we use? uh we need to we
[42:20] need to use binary cross entropy right
[42:23] there. What optimizer to use? Well, as I
[42:26] mentioned earlier, uh we'll use Adam.
[42:29] Adam.
[42:32] All right, Adam. Uh and then uh and then
[42:35] the the final thing is you can ask Keras
[42:37] to report out whatever metrics you care
[42:39] about. These metrics are not going to be
[42:41] used in any optimization. They just it's
[42:42] just reporting it to you. And the most
[42:45] common thing people report out for
[42:46] binary classification is accuracy. So
[42:49] we'll just go with that metric. Um and
[42:51] so so what we do is we tell Keras take
[42:54] the model we just built and compile it
[42:56] with this choice of optimizer this
[42:58] choice of loss function and these
[43:00] metrics. And this compilation step what
[43:02] it does is it essentially Keras will
[43:04] take this information and take the model
[43:06] you have built and it'll reorganize the
[43:08] model in such a way that the parallel
[43:11] computing uh distribution of computing
[43:13] across many servers and so on. That's
[43:16] that's what's happening in the compile
[43:17] step. Organizing it so that reorganizing
[43:20] the model so that it becomes amendable
[43:21] to parallelization and distribution.
[43:23] That's what's going on. That's why you
[43:25] actually have to do something called the
[43:26] compile step. Okay. And once we do that,
[43:28] we have finally finally ready to train
[43:30] the model. And to do that uh we have to
[43:34] decide what the batch size is that we're
[43:36] going to use. Remember, we're using some
[43:37] flavor of SGD, which means we have to
[43:38] choose what is the bat size. And
[43:40] typically what people do is that uh 32
[43:43] is a good default for the batch size.
[43:45] Like if you don't if you're not just
[43:46] getting started with something, just use
[43:47] 32. Uh and there's a whole bunch of
[43:49] literature on what the right batch size
[43:51] should be for the number of data points
[43:53] you have, the size of the network and so
[43:55] on and so forth. My philosophy is start
[43:56] with 32. Um and you can always try 32,
[43:59] 64, 128. It's kind of like, you know,
[44:02] oftenimes what people tell me,
[44:04] researchers tell me is that just use the
[44:05] biggest batch size that doesn't make
[44:07] your machine die.
[44:09] Right? If you can fit into memory, it's
[44:11] probably good. Just try the biggest
[44:12] size. We'll just start with 30. It's
[44:13] just a tiny problem. It's not a big
[44:15] deal. And then we also have to decide
[44:16] how many epochs through the data do we
[44:19] want to go through, right? How many
[44:21] epochs? And uh you know, usually 20 to
[44:24] 30 epochs is a good starting point. Um
[44:26] and then because this is a tiny problem
[44:28] just for kicks, I decided to run it for
[44:29] 300 epochs. Uh just to see if anything
[44:31] any overfitting is going to happen. Uh
[44:33] and then whether we want to use a
[44:34] validation set. Of course, we want to
[44:36] use a validation set. Uh right. So we
[44:38] will use 20% of the data points as a
[44:40] validation set so that we can look for
[44:42] overfitting underfitting.
[44:44] All right. So with these decisions made
[44:46] we finally uh we use the model.fit
[44:49] command. Model.fit is what actually
[44:51] trains the neural network. Okay. And you
[44:55] have to tell it what the x
[44:58] tensor is. You have to tell it what the
[45:00] dependent variable y tensor is. We need
[45:03] to tell it how many epochs to do this.
[45:05] What the bat size to use. Verbos equals
[45:07] 1 just means like just you know put a
[45:09] lot of descriptive output as you do this
[45:11] thing and then validation split means
[45:13] you know take 20% of the training data
[45:16] and set it aside as your validation data
[45:18] set. Don't use it for training because I
[45:20] want to measure overfitting using that.
[45:22] So that's it. So you do that thing it
[45:24] it'll run for 300 epochs and this is the
[45:26] reason why you know I decided to just
[45:28] not actually run it in class. Um and so
[45:31] you keep on doing it gives you a lot of
[45:33] output and finally
[45:36] we reach the end.
[45:41] Okay. Now let's take a moment to
[45:43] understand what's being reported. So
[45:44] I'll just take this one line here. So
[45:46] this there is a there is these two there
[45:49] is a pair of lines for each epoch. And
[45:51] then here it's telling you uh you know
[45:53] it it actually uses in the in this 300th
[45:56] epoch it used seven batches seven out of
[46:01] seven batches right so it used seven
[46:02] batches and if you you will recall from
[46:05] the math we did in the class that it's
[46:06] actually seven batches where the first
[46:08] six batches are 32 and the last batch is
[46:10] just a couple of examples but we have
[46:12] seven batches right this is the 193 by
[46:15] 32 rounded up okay so that's why we have
[46:19] seven here and then it tells you how how
[46:20] long it took it for that and then it
[46:22] this is the loss value. This is the
[46:24] binary cross entropy loss value on the
[46:26] training set right on on that particular
[46:29] batch right uh that it calculated this
[46:32] is the accuracy that you asked you to
[46:33] report out 98.4% 4% 98.5% accuracy on
[46:36] that batch and and then at the end of
[46:39] this epoch using whatever weights were
[46:42] available in that network it actually
[46:44] calculate the loss on the validation set
[46:46] which is the 20% of the data we have set
[46:48] aside and then it this is the accuracy
[46:50] on that validation set okay so that's
[46:53] what each of these numbers mean now
[46:55] looking at these wall of numbers is kind
[46:57] of painful so usually you just plot it
[47:00] um so and the way you do that is if you
[47:02] if you notice here Uh okay, I'm not
[47:04] going to go back here. So I said history
[47:06] equals model.fit blah blah blah blah
[47:08] blah. And that history object has a lot
[47:10] of information that we can use for
[47:12] plotting and diagnostics and so on. And
[47:14] that history thing uh history object has
[47:18] another object called history
[47:19] history.htistory which is a dictionary
[47:21] with all these values and that's what
[47:23] we're going to plot. Was there a
[47:24] question here? Yeah.
[47:25] >> Uh so you prompted it to keep the size
[47:28] for validation but didn't we already
[47:30] keep a test set? So that's going to be a
[47:33] secondary validation, right?
[47:34] >> So basically we have a training uh and
[47:37] then a validation and a test. The role
[47:40] of the validation set is to figure out
[47:42] things like early stopping. Should we
[47:43] stop here? Should we go back? And as you
[47:45] will see later on, if we use
[47:46] hyperparameters, you know, we we'll try
[47:48] different values of the hyperparameters
[47:50] and figure out use the validation set to
[47:52] figure out which one is the best one.
[47:53] But once we are done with all that, we
[47:55] will finally have a model. At that
[47:57] point, we open the safe, take out the
[47:59] test set and use it just once with your
[48:02] final final model. Not because you want
[48:04] to improve the model, but because you
[48:05] want to have a realistic idea how it'll
[48:07] do when you actually deploy it out in
[48:08] the real world.
[48:11] >> Uh yeah.
[48:13] >> Uh can we use can we instead of accuracy
[48:17] could we use other metrics uh to
[48:20] evaluate whether to
[48:21] >> absolutely like a confusion matrix let's
[48:23] say?
[48:24] >> Yeah, you can you can do whatever you
[48:25] want. You can use like I said it's not
[48:27] used for training so there is no
[48:29] mathematical implication what you choose
[48:31] right you can choose error rates
[48:32] accuracy f1 fb beta you can do whatever
[48:35] you want and keras as you will see has
[48:37] this dizzying list of possible metrics
[48:39] you can use for reporting the key thing
[48:41] to remember is you're just reporting
[48:43] these metrics you're not actually using
[48:44] them for any training
[48:47] yeah
[48:49] >> uh my question is with respect to
[48:50] validation like uh we've got a training
[48:52] data set so when we take out 20% This is
[48:55] the validation uh data for validation.
[48:58] Are we taking out from the training set
[49:00] or correct from there that level or we
[49:02] go to each batch and take out 20% from
[49:04] the train?
[49:04] >> No, we're taking it out from the
[49:05] training set.
[49:06] >> So it means the batch size the number of
[49:08] batch number of data would be available
[49:09] for calculating the batch size will
[49:11] reduce.
[49:12] >> Correct. And in [snorts] fact once we
[49:13] validate take out the validation set
[49:15] whatever remaining is 193.
[49:17] >> Okay. And then we divide that into
[49:18] batches and then that every info uh that
[49:21] validation and the data gets different
[49:23] added. Now once you take out the
[49:25] validation set at the very beginning you
[49:27] keep it aside and then you only evaluate
[49:30] at the end of each epoch what your loss
[49:33] and accuracy is on that validation set.
[49:36] >> So you don't have cross validation.
[49:37] >> No no we're not doing any of that stuff.
[49:39] We're just taking it out once and we're
[49:40] just evaluating the end of every epoch.
[49:43] >> Okay. So
[49:46] yeah. Okay. So I know we both asked
[49:50] similar questions but
[49:53] >> so I know both have asked similar
[49:54] questions but just to reconfirm. So here
[49:56] my training model is giving me say a
[49:59] loss of 0860.
[50:01] My validation model is giving me 660.
[50:04] That means I've already crossed the U.
[50:07] So when I have to actually test the
[50:11] model that is the midpoint which I take
[50:13] and that will model which will get
[50:14] deployed in production.
[50:16] Correct. And as to okay, what do we do
[50:19] to get that model? Do we actually have
[50:20] to go go back to the beginning and run
[50:22] it for a few epochs or can we do
[50:24] something smarter than that? We'll get
[50:25] to that.
[50:26] >> Yeah.
[50:27] >> Is the validation set different for each
[50:30] APO or is it the same?
[50:31] >> It's the same. So what you do is you
[50:33] have a training set before you do any
[50:35] training. You take out 20% of it, keep
[50:37] it aside. You take whatever is left over
[50:39] that you divide that into mini batches
[50:41] and then start running it through each
[50:43] epoch. But at the end of each epoch, you
[50:45] just evaluate the quality of that
[50:47] resulting model using the validation
[50:49] set.
[50:49] >> What's different between each epoch? Is
[50:51] it just the way
[50:52] >> weights have changed?
[50:53] >> It's the it's the division into the
[50:55] different
[50:56] >> uh no so in the difference in each epoch
[51:00] is the weights have changed.
[51:02] >> So after every mini batch, the weights
[51:03] have changed. At the end of one epoch,
[51:05] you've gone through all the data points
[51:07] you ever had, right, in the training
[51:09] set. And then you come back to the
[51:10] beginning and you do it again.
[51:17] How do you identify the sweet spot?
[51:20] >> It's coming.
[51:22] >> Yeah. All right. So, I'm going to keep
[51:24] going. So, we have this here. And so,
[51:27] you just I mean there's a little bit of
[51:28] mattplot lip code. So, what we do is we
[51:31] just plot the training loss and the
[51:33] validation loss as a function of the
[51:35] number of epochs. Okay? And as you can
[51:37] see here, the training loss is these
[51:39] things here. And it's steadily going
[51:41] down as you would expect. The validation
[51:45] loss goes down here. And then at some
[51:47] point it kind of flattens out and then
[51:49] maybe gently starts to rise. Okay. So do
[51:53] you think there's overfitting?
[51:55] >> Right. There seems to be some level of
[51:57] overfitting here. But the thing you have
[51:59] to always remember is that the binary
[52:01] cross entropy loss is a loss function
[52:04] that is convenient for you because it
[52:06] sort of captures the thing you want to
[52:08] capture the discrepancy but also because
[52:10] it's mathematically convenient but what
[52:13] you may actually care about in practice
[52:15] is something like accuracy right so I
[52:18] always that's why you're reporting out
[52:19] the accuracy when we do these things so
[52:21] you should also plot the accuracy to see
[52:23] what's going on and really you should
[52:25] look at the accuracy and figure out
[52:26] overfitting and underfitting and all
[52:28] stuff. So let's just do that. So I have
[52:30] here uh overfitting.
[52:34] Uh okay. So this is how it looks like
[52:35] for accuracy. Accuracy of course as the
[52:37] model gets you know as you do more and
[52:38] more epochs hopefully it get better and
[52:40] better for training. So you can see here
[52:42] accuracy actually climbs all the way up
[52:44] to the mid 90s uh right there small the
[52:47] low 90s here. the validation gets to
[52:50] this point after like I don't know 50
[52:52] epochs maybe and then it kind of
[52:54] flattens out and then strangely it
[52:56] climbs up again a bit later right so now
[53:00] the fact that the accuracy actually got
[53:03] better at the very end suggests that
[53:06] maybe we can live with this overfitting
[53:09] >> okay
[53:10] >> right it's not the end of the world
[53:12] right so you can so you can certainly
[53:14] what you can do is you can go back and
[53:16] say you know what no I'm going to be a
[53:17] purist about this around 50 epochs or
[53:20] so. I think that's when it actually
[53:22] flattened out for loss. So you can just
[53:24] go back and just restart the model and
[53:26] run it only for 50 epochs, not 300 and
[53:29] then stop and just use that model for
[53:30] everything from that point on. Or you
[53:31] can say, you know what, it's okay. I can
[53:33] live with this thing. Uh and so that's
[53:35] what we're going to do here. Let me just
[53:36] stop for a second. There was a question.
[53:39] >> Yeah,
[53:40] >> for originally when we were starting
[53:42] out, we were saying 20 to 30 pods, but
[53:44] we were going to do 300. 50 is over 20
[53:46] to 30. So when it comes to validation of
[53:49] if you run enough epochs, are you doing
[53:51] like derivative calculations?
[53:52] >> Oh, I see. No, that's a great question.
[53:54] So the question is I said start with 20
[53:56] and 30 epochs as a rule of thumb here,
[53:58] I'm just going with 300. And because I'm
[54:00] going with 300, I can actually see some
[54:01] potential evidence of overfitting. But
[54:03] if I had done only 20 to 30, maybe I
[54:05] wouldn't have even seen that. What
[54:06] happens next? Right? Is that the
[54:07] question? Great question. So what you
[54:09] should do is when you look at these
[54:10] curves if at the end of 30 epochs you
[54:13] find that the validation loss continues
[54:15] to drop then you know maybe there is
[54:18] more room for it to drop. So you you
[54:20] continue from that point on. The thing
[54:21] about keras is that you can actually run
[54:24] the the the fit command at that point
[54:27] and it'll continue where it left off. It
[54:29] won't go to the beginning again.
[54:31] Right? So you can run 10. Okay. The
[54:33] validation is still getting better and
[54:34] better. Okay. Run for another 10. It's
[54:36] getting better and better. Run for
[54:38] another 10. Getting better and better.
[54:39] Run for another 10. Oh, it starts to
[54:40] climb up again. Okay, now I'm going to
[54:41] back off. That's what you do.
[54:44] All right. Now, all this manual stuff
[54:47] I'm going through it just because to
[54:48] build intuition, there are these things
[54:50] called callbacks in KAS, which we'll get
[54:52] to later on in which you can actually
[54:54] tell it, hey, when the validation loss,
[54:57] you know, uh, stops improving, stop
[54:59] everything or when it stops improving,
[55:02] save that model for me somewhere. So,
[55:04] they don't have to go back and rerun
[55:05] everything. It'll just it'll have saved
[55:07] it for you and you can just pick it up
[55:08] and use it. Uh yeah.
[55:12] >> What's the intuition behind um the
[55:15] accuracy continuing to improve when the
[55:17] loss is getting higher?
[55:19] >> Because accuracy and loss are related
[55:21] but they're not the same thing. Uh in
[55:23] particular, so it's a really good
[55:25] question also kind of a profound
[55:27] question because accuracy is a very
[55:29] discrete measure, right? So if a
[55:30] particular point we predicting its
[55:32] probability to be say 049 we're going to
[55:34] say okay that's a zero no heart disease
[55:37] but if it goes to 0.51 we're going to be
[55:39] oh that's heart disease. So when you go
[55:41] from 049 to 0.51 the binary cross
[55:44] entropy loss will change very very
[55:46] slightly but the accuracy will go from 0
[55:48] to one dramatic jump. So it's very jumpy
[55:51] and discreet and that's why it tends to
[55:53] be a proxy but sort of a crude proxy for
[55:56] loss. That's part of the reason and I
[55:58] can talk more offline.
[56:01] Okay. So yeah,
[56:04] >> you mentioned that if you are a purist,
[56:06] you could stop up 50. In this case, I
[56:09] was want and run it and stop it there. I
[56:12] was wondering if you could see the
[56:13] history of the model, take the weight at
[56:15] EOC 50 and input it your model and it
[56:18] will be roughly the same or it would be
[56:21] certain differences.
[56:22] >> You could try it. Yeah, you should just
[56:24] try it because what happens is that
[56:25] ultimately what we care about is how it
[56:27] performs on the validation set. Right.
[56:29] Here it appears to perform better on the
[56:30] validation set. right? If you stop at 50
[56:33] but only for the loss for accuracy
[56:34] actually if you wait till the very end
[56:36] it gets better. So my thrust tends to be
[56:40] what is the measure that's closest to
[56:41] the real world deployment.
[56:44] It's accuracy. So I tend to go with
[56:45] accuracy.
[56:48] Binary cross entropy is a beautiful
[56:50] proxy but an imperfect proxy for the
[56:53] thing we actually care about in the real
[56:54] world which is error rate and accuracy.
[56:57] That's why I tend to plot both and if
[56:59] accuracy is telling me one thing I kind
[57:00] of tend to believe that
[57:03] all right so um here that's what we have
[57:07] so once we do all this we have a model
[57:09] and we now we may to evaluate to see
[57:11] okay if you actually deployed how good
[57:13] is going to be so you use this thing
[57:14] called the model evealuate function so
[57:17] you take the modelealate function now we
[57:19] use the test and the the test x and the
[57:21] test y data set which we split at the
[57:23] very very beginning and never used from
[57:24] that point on uh we run it And when I
[57:27] ran it uh last night, it came up with a
[57:29] 83.6% accuracy for the model. And
[57:33] remember our baseline model which just
[57:35] predicts everybody is a zero is going to
[57:36] have a 72.6% accuracy. And this little
[57:39] neural network gives you 83 83.6 which
[57:41] is pretty good right so it's actually uh
[57:45] few it's beating the model the baseline
[57:47] model which is nice. Uh and I guess
[57:49] there is something here about you know
[57:50] the fact that we did a bunch of
[57:52] pre-processing outside Keras and then we
[57:53] send stuff into Keras. You can actually
[57:55] do all this pre-processing inside Karas
[57:57] automatically and there are layers for
[57:58] that and I have linked to a bunch of
[58:00] stuff here. So that's it as far as this
[58:02] model is concerned. I know we went
[58:03] through it really fast but please go
[58:05] through it afterwards and make sure you
[58:07] understand every single line. Change
[58:09] each of these lines, rerun it, see how
[58:11] the output changes. That's how we build
[58:12] some intuition. Okay. All right.
[58:15] computer vision
[58:17] >> as I do
[58:20] >> just one question and for is there a way
[58:22] to build a model just to have less false
[58:24] positive or less false immediate or you
[58:27] don't know that
[58:27] >> oh yeah yeah you can do that um but
[58:29] there are so you can report on all those
[58:31] things very easily but there are more
[58:33] complex loss functions which will take
[58:35] the the asymmetry between the false
[58:38] positive false negative into account u
[58:40] you know yeah so the short it's possible
[58:43] yeah
[58:45] All right. So, first let's just talk
[58:46] about how do you represent an image
[58:48] digitally. Okay. Uh and so these are how
[58:52] gay grayscale images are represented.
[58:54] Black and white images. So the basic
[58:55] basic idea is very simple. Every picture
[58:57] you have it's got a every location in
[58:59] that picture is a pixel and the pixel
[59:01] pixel basically has a light intensity.
[59:03] The amount of light at that location and
[59:06] that light level is measured from zero
[59:09] no light to blinding white light which
[59:12] is 255. And so all the numbers here, if
[59:16] you take this five for example, you can
[59:18] see a lot of no light like all the black
[59:20] regions, those are all zeros. Okay? And
[59:23] then wherever there is white light,
[59:24] there's a number and more the amount of
[59:27] light, the closer it gets to 255. Okay?
[59:29] In fact, if you just step back and
[59:30] squint at this, you can actually see the
[59:32] five.
[59:33] Okay? So that's it. That's how that's
[59:35] how black and white image represented.
[59:37] Very simple. Okay. Now, yeah.
[59:42] microphone
[59:43] >> just when you say amount of light what's
[59:45] the unit that's being measured like what
[59:47] do you mean
[59:48] >> so here basically what we have is uh the
[59:51] the computer takes whatever so when you
[59:54] send an analog you take an analog
[59:56] picture there is an there's a process by
[59:58] which you take that analog picture and
[59:59] read it in and it gets mapped to a scale
[01:00:02] between 0 and 255 that's it that's all
[01:00:04] so you can think of it as like a
[01:00:05] relative scale a normalized scale
[01:00:07] between 0 and 255 and so um it just
[01:00:10] roughly maps to amount of light in that
[01:00:12] location the exact like lumens to the
[01:00:14] number mapping I don't know how they do
[01:00:16] it my guess is there are a dis number of
[01:00:18] variations on that but the for our
[01:00:20] purposes just think of it as it's a
[01:00:22] normalized scale which runs from 0 to
[01:00:24] 255
[01:00:26] all right so uh if you look at u so
[01:00:28] that's what's happening every is a
[01:00:30] number between 0 to 55 boom boom okay so
[01:00:34] if you have a color image each pixel of
[01:00:37] a colored image is represented by three
[01:00:38] numbers uh And these numbers measure the
[01:00:42] intensity of red light, blue light and
[01:00:44] green light because red, blue and green
[01:00:46] if you mix them in the right proportion
[01:00:47] you can get whatever you want. Okay. So
[01:00:50] uh and so each light density is still a
[01:00:52] number between 0 and 55 and that's what
[01:00:54] you have. Which means that now you have
[01:00:56] three tables of numbers instead of one
[01:00:58] table of numbers. And by the way just
[01:01:00] some lingo here uh in the deep learning
[01:01:02] world these uh colors RGB, red, blue,
[01:01:05] green are sometimes referred to as
[01:01:06] channels. Okay. All right. So this is
[01:01:10] what we have here. This is a picture of
[01:01:11] Kian Cord U and then if you take that
[01:01:13] little thing here red the red table the
[01:01:16] green table and the blue table. So for
[01:01:18] this picture these three tables is a
[01:01:21] tensor of rank what?
[01:01:23] Good.
[01:01:26] All right. Any questions on this?
[01:01:33] So the key task in computer vision
[01:01:35] obviously the the important thing is
[01:01:37] image classification right uh the most
[01:01:40] basic task if you will uh when you're
[01:01:42] working with images is you you have an
[01:01:43] image and you want to take whatever you
[01:01:45] take the image and figure out okay you
[01:01:46] have a list of possible objects the
[01:01:48] image could contain and you're figuring
[01:01:49] out okay which of these possible objects
[01:01:51] exists in that image right the doc cat
[01:01:53] classification is like the the canonical
[01:01:54] example right that we all know and love
[01:01:57] uh and that's what we will solve uh
[01:01:59] later today and on Wednesday but there
[01:02:01] are many other tasks that you need to to
[01:02:02] be aware of. So when you actually not
[01:02:05] just classify an image, but you also
[01:02:07] localize where in the image is it,
[01:02:10] right? It's not just enough to say
[01:02:11] sheep, you want to figure out where is
[01:02:13] the sheep, right? And that's called
[01:02:14] localization. And the way you do
[01:02:16] localization is you put this little box
[01:02:18] around it. And then you output not just
[01:02:21] whether it's a, you know, sheep, yes or
[01:02:23] no, but the coordinates of this box, the
[01:02:26] top left, uh, and the bottom right, for
[01:02:28] example, if you put the coordinates, you
[01:02:29] can actually draw a box around it. So
[01:02:31] you you output the numbers the
[01:02:33] coordinates of where this box is in the
[01:02:36] picture. Okay, this called localization.
[01:02:39] Now this is object detection where you
[01:02:42] may have lots of objects going on and
[01:02:45] you want to pick up every one of them
[01:02:47] and you want to localize it.
[01:02:49] Okay, this is object detection. So here
[01:02:51] we have gone in there and said okay
[01:02:53] sheep one, sheep two, sheep three and
[01:02:55] each of these sheep has a little box
[01:02:57] around it. Okay.
[01:02:59] >> By the way, u you know, self-driving
[01:03:01] cars, the the camera vision system is
[01:03:04] constantly scanning what's coming in
[01:03:05] through the cameras and doing object
[01:03:06] detection constantly, many times a
[01:03:08] second,
[01:03:09] >> right?
[01:03:09] >> Pedestrian box, you know, zero crossing
[01:03:11] box, doggy box, stroller box, and so on
[01:03:13] and so forth.
[01:03:16] And then we have this thing called
[01:03:17] semantic segmentation where we take
[01:03:20] every pixel in the picture and classify
[01:03:22] every pixel. We are not classifying the
[01:03:24] whole picture, we're classifying every
[01:03:26] pixel. So we are saying okay all these
[01:03:28] gray pixels road all these pixels are
[01:03:32] sheep and all these pixels are grass
[01:03:34] every pixel is being classified.
[01:03:37] So we are taking a an image instead of
[01:03:39] giving one classification for every
[01:03:42] pixel we are solving a multiclass
[01:03:43] classification problem.
[01:03:48] Okay, every pixel is classified. And
[01:03:49] just when you think it can't get more
[01:03:51] complicated than this,
[01:03:53] we have something called instance
[01:03:54] segmentation where not only are we
[01:03:56] classifying every pixel, we are
[01:03:58] distinguishing between the different
[01:03:59] sheep.
[01:04:01] So every pixel is classified and
[01:04:04] different instances of the same category
[01:04:06] need to be identified.
[01:04:10] Okay. So these are all some of the most
[01:04:12] sort of uh I would say popular most
[01:04:14] prevalently and useful most prevalent
[01:04:16] and useful categories of image
[01:04:18] processing problems that are aminable to
[01:04:20] a deep learning system.
[01:04:23] All right. So let's go to image
[01:04:25] classification and we're going to work
[01:04:27] with this application called fashion
[01:04:28] emnest. Um
[01:04:33] so the idea here is that you have 70
[01:04:35] 70,000 images of clothing items across
[01:04:38] 10 categories. you know like boots and
[01:04:40] sweaters and t-shirts and you get the
[01:04:43] idea 10 categories of clothing. Um we we
[01:04:45] have 70,000 images like this u and then
[01:04:48] we'll build a network from scratch to
[01:04:50] classify all these things uh you know
[01:04:52] with pretty high accuracy. So these
[01:04:54] classes by the way you know this is a
[01:04:55] very balanced data set. So 10% of the
[01:04:58] data is you know sweaters 10% is boots
[01:04:59] and so on and so forth. So a naive
[01:05:01] baseline model would give you what
[01:05:03] accuracy
[01:05:07] 10%. Exactly. So we need to build
[01:05:10] something that's better than 10% and I'm
[01:05:12] glad to report that a simple neural
[01:05:13] network can actually get you close to
[01:05:14] 90%.
[01:05:18] Right? So so this is the simple network
[01:05:21] that we have. The input in this case is
[01:05:24] a 28x 28 picture.
[01:05:28] It's a 28x 28 picture. Uh and
[01:05:33] so far we have been feeding vectors into
[01:05:36] our neural network. Now we have a
[01:05:38] picture which is 28 by 28. It's a tens
[01:05:40] set of rank two, right? It's a table of
[01:05:43] numbers. What do we do? How do we feed
[01:05:45] that in?
[01:05:51] It's a temp. No, each image is a table
[01:05:53] of numbers. Let's just take a single
[01:05:54] image.
[01:05:57] Like what do we do? How do we what do we
[01:05:59] do with this table?
[01:06:01] Convert it into a vector. Exactly. And
[01:06:04] that's called flattening. So we take
[01:06:06] this table of numbers and we flatten it
[01:06:08] into a vector. And so so what we do is
[01:06:11] uh let me just
[01:06:13] Okay. So we have um
[01:06:17] 28 by 28.
[01:06:20] So what we can do is we can take each
[01:06:22] row right take this row and then write
[01:06:25] it like that.
[01:06:27] We take the second row oops
[01:06:33] write it like that.
[01:06:38] third row is here
[01:06:41] like that. You get the idea. So you take
[01:06:43] each row just rotate it and stack it all
[01:06:45] up, right? And string them up. It
[01:06:47] becomes one long vector. So this called
[01:06:49] flattening. Okay? So that's how you take
[01:06:51] this thing and make it into one long
[01:06:52] vector.
[01:06:56] So when you do that 28 by 28 is what is
[01:07:00] it? 7
[01:07:03] 784. So we get 7. So we get a vector.
[01:07:07] This is the flattened input and you get
[01:07:09] 784.
[01:07:11] Uh it's a vector that's 784 long.
[01:07:15] Okay. After the flattening, we have not
[01:07:17] done anything complicated yet. We have
[01:07:18] literally taken the numbers and just
[01:07:19] reorganized them in a different way.
[01:07:21] Okay. And once we do that, now we are
[01:07:24] back in our familiar neural network
[01:07:26] territory, right? We know how to work
[01:07:27] with vectors. So, we just need to pass
[01:07:29] it through a hidden layer, right? And
[01:07:33] this hidden layer, we're going to use re
[01:07:35] neurons. And I tried a few different
[01:07:37] values. And it turns out that 256
[01:07:39] neurons does a really good job.
[01:07:41] Okay? And so, I'm going to use 256
[01:07:43] neurons here. And then we need to now
[01:07:46] think about what the output layer should
[01:07:48] be. Now, the now we run into a problem
[01:07:51] because the output layer before we saw
[01:07:54] for the heart disease example, it's just
[01:07:55] zero or one. Right? Here there are 10
[01:07:58] possible outputs. It could be a you know
[01:08:01] boot, a sweater, a shirt and so on so
[01:08:02] forth. 10 possible categories. So we
[01:08:04] need some way to handle something with
[01:08:06] many more than you know one binary
[01:08:09] output many possible outputs. So the way
[01:08:12] we do that
[01:08:15] this is by the way pay attention to this
[01:08:16] because this is actually how GPD4 works.
[01:08:20] Okay. So what we do is here's what we
[01:08:24] have. We know how to output 10 numbers,
[01:08:26] right? If you want to output 10 numbers,
[01:08:28] no problem. We just, you know, we have,
[01:08:30] we can easily output 10 numbers by just
[01:08:31] using a linear activation. We also know
[01:08:33] how to output 10 probabilities,
[01:08:36] right? Each one just needs to be a
[01:08:37] sigmoid. But here we can't use 10
[01:08:40] sigmoids as the output. Why is that?
[01:08:44] Why can't we use 10 sigmoids?
[01:08:47] >> Because the probability to one,
[01:08:50] >> right? So here when the output comes we
[01:08:52] need to figure out okay is it a boot, a
[01:08:54] sweater, a shirt and so on and so forth.
[01:08:56] There's only one right answer. Okay,
[01:08:59] which means that we need to actually
[01:09:00] figure out which of these 10 is the
[01:09:01] right answer which means that we need to
[01:09:03] produce probabilities but they have to
[01:09:05] add up to one because only one of them
[01:09:07] can be true.
[01:09:09] So that's the key thing. They have to
[01:09:10] add up to one. That's the wrinkle. If
[01:09:12] not for that we can just use 10
[01:09:13] sigmoids, right? And the way we do that
[01:09:16] is something using something called the
[01:09:17] softmax function or the softmax layer.
[01:09:20] And the idea is actually very simple. We
[01:09:22] have these 10 outputs in the very final
[01:09:25] layer which is just linear activations.
[01:09:27] And then we take each one of these
[01:09:29] numbers and then run it through the
[01:09:32] exponential function and then divide by
[01:09:34] the total. So when you do that two
[01:09:37] things happen. The first one is when you
[01:09:39] take these numbers and run it through
[01:09:40] say you take a1 and do e raised to a1
[01:09:43] you now get a positive number
[01:09:45] and now you have a positive number
[01:09:47] divide by the sum of a bunch of positive
[01:09:48] numbers and they're all you can see here
[01:09:50] you can confirm visually that they will
[01:09:52] add up to one because you're literally
[01:09:53] divide taking each number dividing by
[01:09:55] the total so they will add up to one
[01:09:56] there's no other option right so this is
[01:09:59] called the softmax function which means
[01:10:00] that you can take any set of 10 numbers
[01:10:02] that's coming out of the network and
[01:10:04] convert them into probabilities that add
[01:10:05] up to one
[01:10:07] and So, by the way, the GPD4 reference
[01:10:09] when you actually put a prompt in GPD4
[01:10:12] and it starts giving you the output.
[01:10:14] Every word it's emitting, right? It's
[01:10:17] actually a token, but we'll get to that
[01:10:19] later. You imagine it's a word. Every
[01:10:21] word it's emitting u is actually it's
[01:10:23] doing a 50 52,000 way softmax.
[01:10:27] Think of it as every word in the
[01:10:28] language is a possible output. So it's a
[01:10:31] vector which is 52,000 long but it's
[01:10:34] actually a softmax and it just picks the
[01:10:36] most probable word and emits that. So
[01:10:39] this notion of a softmax is actually
[01:10:41] very powerful.
[01:10:43] Okay but we'll come back to that uh
[01:10:45] later. So, so to summarize, if you have
[01:10:49] a single number, you can use a s simple
[01:10:51] output layer, a single probability, a
[01:10:53] sigmoid, you have lots of numbers, just
[01:10:55] have a stack of these things. And when
[01:10:57] you have a lot of numbers that have to
[01:10:58] add up to one, that have to be
[01:10:59] probabilities, use softmax,
[01:11:03] >> right? So uh yeah
[01:11:06] >> why do we choose probabilities instead
[01:11:08] of just number
[01:11:11] one
[01:11:12] >> sorry
[01:11:12] >> then we know it's only going to be one
[01:11:14] >> because you can't force the network to
[01:11:15] give you ones or zeros
[01:11:20] it's going to produce what it's going to
[01:11:21] produce
[01:11:22] >> you can't force it to be exactly one or
[01:11:24] zero
[01:11:26] it'll give you some number you can do is
[01:11:28] to tame that number so that it comes
[01:11:30] into a range that you like like between
[01:11:32] zero and
[01:11:34] So here very quickly um we have a b when
[01:11:38] we have a binary classification example
[01:11:40] like yes or no this is the one hot
[01:11:41] encoded version one or zero this is what
[01:11:43] we saw in the heart disease example when
[01:11:45] you have something like this example
[01:11:46] fashion mn list where you have all these
[01:11:48] different possibilities then you can
[01:11:51] encode it in one of two ways you can
[01:11:52] encode it just using integers like 0 to
[01:11:54] 9 right this is called the sparse
[01:11:56] encoded version or you can do a one hot
[01:11:59] encoded version of the output right you
[01:12:02] can have a one hot encoded version of
[01:12:03] the output and depending on how your
[01:12:06] data comes in to you comes into your
[01:12:08] collab right just pay attention to this
[01:12:11] and depending on what it is you have to
[01:12:13] pick the right keras loss function so
[01:12:18] data comes like a one zero thing which
[01:12:20] is exactly what we had in the how this
[01:12:21] example we use binary cross entropy if
[01:12:24] your data comes in this form where it's
[01:12:26] sparse encoded you use sparse
[01:12:28] categorical cross entropy and then if it
[01:12:31] comes in this form form you use
[01:12:32] categorical cross entropy, right? These
[01:12:34] are all equalent things. It just depends
[01:12:36] on the data that you get how it happens
[01:12:38] to be encoded by the people who sent it
[01:12:40] to you. If they send it this way, use
[01:12:42] this loss function. If you send that
[01:12:43] way, use that loss function.
[01:12:46] Now, as it turns out in our example
[01:12:47] here, the data is actually coming in in
[01:12:49] this form. So, we'll use this thing
[01:12:50] called the sparse categorical cross
[01:12:52] entropy. And categorical cross entropy
[01:12:54] is a generalization of binary cross
[01:12:56] entropy which I'm not going to get into
[01:12:58] the mathematical details but the in the
[01:12:59] the intuition is basically roughly the
[01:13:01] same.
[01:13:04] Okay so this is what we have. Um if this
[01:13:07] is your output layer use mean squared
[01:13:09] error. If this is your output layer use
[01:13:11] binary cross entropy and if you still
[01:13:14] have a stack of these numbers you can
[01:13:15] still use mean squed error. And if your
[01:13:17] output is a soft max, use categorical
[01:13:19] cross entropy or sparse categorical
[01:13:22] cross entropy.
[01:13:24] Okay. So let's actually run this in
[01:13:26] collab. Um
[01:13:32] right. So this is what we have. Can
[01:13:33] folks see this? Okay. All right. So this
[01:13:37] is the data set we saw earlier. Uh down
[01:13:40] here as usual, right? We have we load
[01:13:44] tensorflow and kas. We load our usual
[01:13:47] three packages and then we set the
[01:13:49] random seed for reproducibility. And it
[01:13:51] turns out that the fashion mnest data is
[01:13:53] actually available in keras. You don't
[01:13:54] have to go find it somewhere and bring
[01:13:56] it in. It's actually available in kas.
[01:13:57] It's one of the standard data sets. We
[01:13:59] luck out. So we just actually load the
[01:14:01] data right using this load data command.
[01:14:04] And then you do that and conveniently
[01:14:05] for us keras has not only made the data
[01:14:08] available it has already split it into a
[01:14:10] training and test set. So we don't have
[01:14:12] to do the splitting. Okay. And the
[01:14:13] reason they do that, why would they do
[01:14:15] that?
[01:14:18] They do that so that different people
[01:14:20] who are building algorithms for that
[01:14:21] particular data set can all be evaluated
[01:14:23] using the same test set.
[01:14:26] Otherwise, if I split it one way and
[01:14:28] say, "Hey, look how well I did that like
[01:14:29] I don't know how did you split it."
[01:14:31] >> That's the reason.
[01:14:32] >> Okay. So here and you can see here that
[01:14:36] uh we have
[01:14:38] the input data is a tensor of rank
[01:14:43] three. The first and basically another
[01:14:47] way to think about a tensor of rank
[01:14:48] three is just a list of rank two
[01:14:50] tensors. Right? So here you have 60,000
[01:14:52] images. 60,000 images and each image is
[01:14:57] a 28x 28 square of numbers. Each image
[01:15:02] is a 28 x 28 table. Uh and then of
[01:15:04] course the output uh is just what
[01:15:07] category it is a number between 0 and 9.
[01:15:09] So you just have 60,000 numbers. It's
[01:15:11] just a vector of 60,000 numbers. Okay.
[01:15:13] Uh so there are 60,000 in the training
[01:15:15] set. Oops. Uh and then there are 10,000
[01:15:19] in the test set. Same structure 28 by
[01:15:21] 28. Uh that's what we have. So if you
[01:15:23] look at the first 10 rows of the
[01:15:25] dependent variable Y, you get these
[01:15:27] numbers 9 0 33 like that. There are
[01:15:29] numbers from 0 to 9. So if you look at
[01:15:31] the fashion mnest GitHub site, this is
[01:15:33] what it refers to. Zero is a t-shirt,
[01:15:35] one is a trouser, and so on and so
[01:15:37] forth. And nine is an ankle boot.
[01:15:41] All right. So, uh, whenever I'm working
[01:15:43] with multiclass lab classification
[01:15:45] problems, I always, you know, do a
[01:15:47] little thing here to help me figure out
[01:15:49] that nine corresponds to an ankle boot
[01:15:51] and so on and so forth. It just makes it
[01:15:52] a little easier to to work with this
[01:15:53] stuff. So, I create this little list. Um
[01:15:56] and then uh turns out if you okay what
[01:15:59] is the very first data point? What is
[01:16:01] it? What is its y- value? Turns out to
[01:16:02] be an ankle boot. Um so you can actually
[01:16:05] look at the raw data for that image
[01:16:07] which is just a 28x 28 thing and these
[01:16:10] are the numbers you have.
[01:16:13] See all these 250 233 lots of zeros and
[01:16:16] so on and so forth. So you can actually
[01:16:19] look at the first visualize the first 25
[01:16:20] images. I have a little bit of code here
[01:16:22] which visualizes that just matt plot lip
[01:16:24] code and you can see these are all the
[01:16:25] images they're kind of smallalish this
[01:16:28] my friends is an ankle boot
[01:16:32] right it's like okay can the network
[01:16:34] really make any sense out of this thing
[01:16:35] right it looks very blurry and I don't
[01:16:37] know
[01:16:39] this is uh
[01:16:42] oh this is actually a better ankle boot
[01:16:43] look at that okay sorry I'm getting
[01:16:45] distracted so so this is what we have
[01:16:47] here
[01:16:49] uh okay we are at 955
[01:16:51] I'm going to stop um so you folks are
[01:16:53] not late for your next class. So we'll
[01:16:54] continue this journey on Wednesday and
[01:16:56] then we'll go on to color images the
[01:16:58] next class as well. Thank you folks.
[01:16:59] Have a good one.