Advertisement
1:17:12
3: Deep Learning for Computer Vision – Building Convolutional Neural Networks from Scratch
MIT OpenCourseWare
·
May 11, 2026
Open on YouTube
Transcript
0:16
Okay. All right. Let's get going. Uh
0:20
[clears throat] today is going to be
0:21
packed. uh I'm going to spend the first
0:23
roughly half of the lecture on uh
0:25
actually building a model a car corass
0:28
model in collab to solve the heart
0:30
disease problem we saw earlier and then
0:32
switch gears halfway and then talk about
Advertisement
0:35
uh how to solve image classification
0:37
okay so we're going to do two collabs
0:39
today uh I've been talking about collab
0:42
collab right I've been teasing you we'll
0:44
actually do collabs today all right so
0:46
summary of baby by the way I've shut off
0:48
the lights in the top because when I
0:50
switch to collab it's going be much
0:52
better for you folks particularly the
0:53
folks in the back to be able to see it.
Advertisement
0:54
Okay, but I hope you can see the slide
0:57
right now. Yes.
1:00
Okay, great. So this is just a quick
1:02
recap of what we did last class. U you
1:04
know broadly speaking training a neural
1:07
network essentially is no different than
1:08
training other kinds of models. We have
1:10
a bunch of parameters i.e weights and
1:12
biases and we need to use the data to
1:14
find good values of those weights. And
1:17
what does good mean? Typically it means
1:19
that we define some measure of
1:21
discrepancy between what the model
1:23
predicts for a given set of weights and
1:24
what the right answer is what the ground
1:26
truth answer is and then we try to find
1:29
weights that minimize this discrepancy
1:30
that's it and this notion of a
1:32
discrepancy is called a loss function
1:34
right so the broadly speaking the
1:36
overall training flow is that you define
1:38
some network it has an input it goes
1:40
through a bunch of layers you come up
1:41
with some predictions you take the
1:42
predictions you take the true values and
1:44
then those two go into the loss function
1:46
i.e i.e. the discrepancy function and
1:48
then you come up with the loss score and
1:50
then you send it to the optimizer which
1:52
then proceeds to calculate the gradient
1:54
of this loss function with respect to
1:56
all the parameters and then it updates
1:58
all the weights using that gradient and
2:00
then this process repeats. That's it. So
2:02
that is the training flow. Okay, quick
2:04
recap. Now we also talked about the
2:08
optimization algorithm we're going to
2:09
use which is called gradient descent.
2:12
and gradient descent. As you noticed in
2:15
each iteration, every data point is
2:17
being used to make predictions and
2:20
therefore to calculate the loss and then
2:22
to calculate the gradient. And then we
2:24
pointed out that gradient descent is
2:26
actually not as good as something called
2:28
stochastic gradient descent. Stoastic
2:31
gradient descent where we instead of
2:33
choosing taking all the points, we just
2:35
randomly choose a small number of
2:37
points. Pretend for a moment as if those
2:40
are the only points we have. make
2:42
predictions, calculate loss, calculate
2:44
gradient and go on. So that was the
2:47
basic idea behind stochastic gradient
2:49
descent, right? Two different kinds of
2:51
things. Now what it means is that when
2:54
we actually start training the model, as
2:56
we will in a few minutes, the way
2:58
because we only take a few points at a
3:00
time, we have to be a bit careful in
3:02
what's going on. And I want to make sure
3:04
you clearly understand what the
3:06
differences are before we actually get
3:07
to the collab. Okay. And
3:10
all right. So there is the notion of an
3:13
epoch.
3:14
An epoch essentially just means that we
3:17
make one pass through the training data.
3:20
All the training data we make one pass
3:22
through it. Okay. And so what is one
3:25
pass is that if you have something like
3:27
gradient descent, one pass means every
3:30
data point is sent through the network.
3:32
We calculate its predictions, calculate
3:34
the loss, calculate the gradient, right?
3:37
We run every training sample through it.
3:38
we calculate the gradient which is just
3:40
this thing here right I mean I will
3:42
sometimes say d of loss time dwerative
3:46
of loss with respect to w sometimes I
3:48
might use this naba symbol these are all
3:51
interchangeable okay so we'll calculate
3:54
the gradient and then we update using
3:55
some version of this okay but we just do
3:58
it once at the end of the epoch because
4:01
if you have 10 billion data points every
4:03
one of them flows through you get 10
4:05
billion outputs and then we calculate
4:07
the epoch just one at the end of this
4:08
thing we calculate the gradient and
4:10
update once one update per epoch. Yes.
4:15
Now in stoastic gradient descent what we
4:18
do is that we process the data in
4:20
batches
4:22
small numbers of points at a time right
4:25
and these are called technically
4:26
speaking they're called mini batches I
4:29
don't know about you I just get tired of
4:30
saying mini batches I'm just going to
4:31
say batches from this point on okay and
4:34
in fact that is widely done in the
4:36
literature so we'll so we'll have to
4:39
process it in batches so we take the
4:41
training data and then we divide it up
4:43
into batches
4:44
batch one, batch two all the way till
4:46
the final batch. And so what we do is we
4:49
for each batch we basically do gradient
4:53
descent for each batch we take batch one
4:56
and then we run just the training
4:57
samples in that batch through the
5:00
network to get predictions. We calculate
5:01
the gradient we update the parameters
5:03
and then we go to batch two then we go
5:05
to batch three and so on and so forth.
5:07
So pictorially this is how it's going to
5:09
look like
5:11
right let's say the first batch is say
5:12
32 points we take those 32 points we run
5:16
it through the network get all the stuff
5:17
out we calculate the gradient update the
5:19
weights so when we now get to batch two
5:22
the weights have changed
5:25
they have been updated and then we do
5:27
the same thing for batch two batch three
5:29
and all the way till we get to the end
5:30
of the thing and when we are done with
5:32
this thing this whole thing is called a
5:34
what
5:36
an epoch [clears throat]
5:38
This whole thing is an epoch. Okay.
5:42
All right. Now, so the question of
5:44
course is that if you have a bunch of
5:46
data points and you're going to run
5:47
stoastic gradient descent on it in in a
5:50
in a particular epoch, how many batches
5:52
are going to be there? Okay, how many
5:54
batches are going to be there? Now,
5:56
Keras is going to calculate all this
5:58
stuff. You don't have to worry about it,
5:59
but you just need to understand exactly
6:00
what happens. Okay, so my philosophy, by
6:02
the way, is that you have to know the
6:04
details of what's going on. If you don't
6:06
know the details, if you haven't figured
6:08
out at least once, you will not actually
6:11
be able to think new and creative
6:12
thoughts for a new problem. Okay, it's
6:15
because the concepts are not manipulable
6:17
in your head yet. Okay,
6:23
please use the microphone.
6:27
So when we talk about SG, so and we
6:30
talking about uh we are only taking some
6:32
part of it. Is it what we are saying is
6:34
that we only take some variables or we
6:36
only taking some part of the data.
6:37
>> We are taking some rows.
6:40
Okay. We taking only right. So that data
6:42
points that means a batch.
6:44
>> Exactly. So for example, let's say you
6:46
have a thousand data points, right?
6:48
Thousand rows of observations, thousand
6:50
patients in the heart disease example or
6:52
a thousand images that you're trying to
6:53
classify. You take let's say 32 of those
6:56
images, 32 of those patients and that's
6:58
a batch. Then you go to the next 32.
7:00
Then the next 32 and so on and so forth
7:02
till you run out of patients or run out
7:04
of images.
7:05
>> And each iterative time you are updating
7:07
with the weights new weights that you've
7:09
got.
7:09
>> And it means you keep connecting it or
7:12
keep moving towards
7:13
>> you're basically updating the weights as
7:14
you
7:14
>> updating the weights
7:17
>> and what we calling the epoch is
7:19
ultimately the equation of loss function
7:20
that we are trying to do.
7:21
>> No an epoch. See the the thing to
7:24
remember is that here this whole thing
7:27
is called an epoch because we have to do
7:30
one full pass through the training data.
7:32
Okay. But within that epoch we update
7:35
the weights many times. Basically we
7:37
update the weights as many times as we
7:40
have batches.
7:44
All right. Um
7:46
so to go here let's say for example
7:49
basically the idea is that you take the
7:50
training tech you divide it by the batch
7:52
size and you choose the batch size okay
7:54
you choose the bat size and we'll talk
7:56
about well how do you choose that later
7:57
on you choose the batch size and once
7:59
you choose size just divide it and round
8:01
it up so for example as you will see in
8:04
the collabing set is going to be 194
8:06
patients and then we're going to choose
8:09
a batch size of 32 and we typically tend
8:12
to choose batch sizes of 32 64 and
8:14
things like that because it actually
8:16
aligns very well with the nature of the
8:18
parallel hardware we're going to use.
8:20
Okay. And so here 32 and so on. So
8:24
divide 194 by 32 you get 6 point
8:27
something. You round it up to seven.
8:29
Okay. And so what that means is that the
8:31
first six batches will have 32 samples
8:33
each. And then the final batch has only
8:36
two samples left. And that's okay. It
8:38
can be a nice little small batch at the
8:40
end.
8:42
There's nothing that says that every
8:43
batch has to be the same size.
8:46
>> That's it. Epoch batches.
8:53
>> And are you like for each batch you run
8:56
through the whole network like all the
8:58
layers or like each layer is one batch?
9:00
>> No, for a batch you run it through the
9:03
entire network. So the way I think about
9:04
it is that you take a batch right just
9:06
momentarily you assume that's all the
9:08
data you have
9:10
just run it through the network because
9:12
unless you run it through the every
9:14
layer of the network you can't get a
9:15
prediction and unless you get a
9:18
prediction you can't calculate the loss
9:19
and unless you calculate the loss you
9:20
can't calculate the gradient unless you
9:22
calculate the gradient you can't update
9:23
the weights
9:25
>> last thing but if you're using like all
9:27
the data just doing the gradient descent
9:29
then you just go through the network
9:31
once right
9:32
>> okay exactly so in Gradient descent one
9:34
epoch is one pass and one weight update.
9:37
In many in stoastic gradient descent the
9:40
number of updates you make is equal to
9:41
the number of batches you have which
9:43
ends up being you know some the training
9:46
set divided by the batch size rounded
9:47
up.
9:50
>> So just to confirm so initially when we
9:52
introduced like the concept of batches
9:54
the whole purpose was not to run through
9:56
all the data and be able to do some
9:58
prediction from a subset. So now like
10:00
the advantage is that like after batch
10:02
one we are using more accurate
10:04
coefficient to run through batch two and
10:06
so on. That's really the advantage of it
10:08
or there's something else to it.
10:10
>> Perfectly set. That's exactly the
10:11
advantage. So we take a small amount of
10:13
data and we say hey we know this is not
10:16
all the data. It's just a small subset
10:18
of the data. So therefore it's not going
10:19
to be super accurate. It's going to be
10:21
approximate but it's okay. So we'll
10:23
still tend to move in the in the right
10:25
direction. So instead of waiting for the
10:28
whole thing to get done and then
10:29
updating it, we're just going to update
10:30
it as we go along.
10:33
All right. Uh yes,
10:35
>> building on to her question, is it that
10:37
uh doing this process for SG will uh
10:40
render us a more better solution or
10:43
requires less compute power?
10:45
>> Both
10:46
>> both and the reasons for both are in the
10:48
previous lecture. Yeah. And I'm saying
10:51
that instead of repeating it just
10:52
because I'm like very pressed for time
10:54
today. That's why uh all right cool so
10:57
that's what we have uh are we good
11:01
okay so now we come to the last step
11:04
before we actually fire up the collab
11:05
which is overfitting and regularization
11:07
um so if you remember from your machine
11:09
learning background um when your model
11:12
gets more and more complex
11:14
right if you you know using
11:18
use a simple model then you use a more
11:19
complex model and so on and so forth
11:21
what happens to the error on the
11:23
training data Typically what happens to
11:26
the error on the training data? So let's
11:27
say you have a simple regression model,
11:28
you get some error and then you have a
11:30
regression model in which you use all
11:31
kinds of interaction terms. You use
11:32
logarithms and this and that and make it
11:34
super complicated. What do you think is
11:35
going to happen to the error on the
11:36
training data?
11:39
>> Right? Basically it's going to go down
11:41
as the model get more gets more complex.
11:43
Correct. Now of course comes the punch
11:45
line which is what what do you think is
11:46
going to happen to the training data? I
11:49
showed you the answer.
11:53
Right? Basically, what's going to happen
11:54
typically, at least conceptually, is
11:56
that it's going to get better and better
11:57
at some point. It's going to bottom out
11:59
and it's going to start climbing again.
12:00
And so, we typically refer to this
12:03
phenomenon here when it starts to climb
12:05
again as overfitting because the model
12:07
is essentially fitting to the
12:09
idiosyncrasies of the training data as
12:11
opposed to generalizing patterns. And
12:14
then in this thing we call it
12:15
underfitting because it can still
12:17
there's a lot of potential to improve
12:18
and we really are hoping to find the
12:20
sweet spot in the middle right that's
12:23
the basic idea of overfitting
12:24
underfitting and the way we and to to
12:27
relate this to neural networks as you
12:29
see as you as you've learned so far you
12:31
have to learn smart representations of
12:33
the input data and to do that we I have
12:36
argued that you need to have lots of
12:38
layers in your network the more layers
12:39
you have the better things get. GPT3 for
12:42
example has 96 layers if I recall right
12:45
more layers the better but more layers
12:47
means more parameters more parameters
12:50
means more complexity to the model and
12:52
therefore more chance of overfitting
12:54
okay so it's really important in neural
12:57
networks that we think about
12:59
regularization and regularization you
13:01
will recall from your machine learning
13:03
background is the way we handle the risk
13:05
of overfitting and try to find models
13:07
that fit just right okay and so several
13:11
regularization methods have been
13:12
developed over the years and we are
13:14
going to use only two of them. The first
13:16
one is called early stopping. uh and
13:19
this is this has been famously referred
13:20
to uh by Jeff Hinton who's one of the
13:23
pioneers or as he's more colorfully
13:25
known one of the godfathers of deep
13:27
learning um who won he also won the
13:29
touring a few years ago as the own sort
13:31
of a beautiful free lunch right that's
13:33
what he calls it so the idea is very
13:35
simple we take a validation set we take
13:37
the training data we split into a
13:39
training and a validation set and then
13:41
we just keep you know doing gradient
13:42
descent boop b the training will
13:45
hopefully keep on getting better and
13:46
better lower and lower error
13:49
And then we just keep track of what's
13:50
going on in the validation set. And then
13:52
at some point if it starts to flatten
13:54
out and start to climb, we just say,
13:56
"Okay, that's when we stop training."
13:59
Right? And what we're going to do in the
14:01
collab is actually run it through the
14:02
whole thing, see where it flattens out,
14:03
and then we say, "Okay, that's why we
14:04
should stop." But of course, you don't
14:06
want to go all the way to the end and
14:07
then go back and say, "Well, I want to
14:09
stop at the 10th epoch." And there are
14:12
ways you can use Keras to be very
14:13
efficient about this. But the
14:15
fundamental idea is you take the
14:16
training data, split it into training
14:18
and validation and just track what's
14:20
going on in the validation set to see
14:21
whether this kind of bottoming out
14:23
happens. Okay. So this is called early
14:25
stopping. And the other way we're going
14:28
to do right this called early stopping.
14:30
We're looking for this part. The other
14:32
thing is called dropout. And I'm going
14:35
to come back to dropout when we do when
14:39
on Wednesday's lecture because that's
14:40
the first time we're going to use it.
14:42
And so I'll come back to draw port and
14:43
tell you exactly how it works. It's a
14:44
very very clever strategy. But we will
14:46
not use it today. We'll use it on
14:48
Wednesday. Okay. So in summary, uh what
14:51
do we do? We get the data ready. We
14:53
design the network, number of hidden
14:55
layers, number of neurons and so on and
14:57
so forth. We pick the right output
14:58
layer. We pick the right loss function.
15:01
Uh we choose an optimizer. As I
15:04
mentioned earlier, SGD comes in lots of
15:06
flavors, lots of variations on the
15:07
theme. And empirically much like for
15:11
hidden layer neurons we t tend to use
15:13
relu as the activation function for
15:16
optimization we tend to use a flavor of
15:17
HGD called Adam okay as sort of the
15:20
default because it's really good so
15:22
we'll use Adam as you'll see we
15:24
typically use either uh early stopping
15:27
or dropout and then you just fire it up
15:29
and start training in terasen tensorflow
15:32
all right so that is the training loop
15:33
now I'm going to switch gears and give
15:35
you a quick intro to teras and teras
15:38
TensorFlow. Okay. Keras and Tensor. No,
15:40
TensorFlow and KAS. Thank you. Um, and
15:43
then we'll actually fire up the collab.
15:45
So, first of all, what's a tensor?
15:49
>> Yeah, I just quick question on the
15:52
previous thing like if you're looking at
15:54
the validation set to avoid overfitting,
15:57
but aren't you actually like over
15:59
actually overfitting because like you're
16:02
kind of using the validation set as a
16:03
training set or not?
16:05
>> Uh, no, no, no. The validation set is
16:08
never used to calculate any gradients.
16:10
It's only used to calculate accuracy and
16:12
loss.
16:14
Yeah. Yeah. It's kept aside and only
16:16
used for evaluation, not for training.
16:19
That's what keeps you honest.
16:22
>> Right.
16:23
>> And this will become clear when we
16:24
actually go to the collab. So what's a
16:25
tensor?
16:28
>> All right.
16:30
Okay.
16:33
Tensor is the input data which you're
16:35
giving to the system. It could be in
16:36
various formats like it's image it could
16:39
be like we call it a 4D tensor. If it's
16:42
a time series data, it's 3D. And
16:45
typically, if you just send numbers in,
16:47
it becomes a vector which would go
16:49
inside which each each it gives the
16:52
value of the
16:54
uh uh the variable as well the values of
16:57
the variables associated to it as well
16:59
as
17:01
uh as well as the I mean information you
17:05
want to get to.
17:07
>> You're kind of on the right track, but
17:08
not entirely, right? It's actually a
17:10
simpler concept than that. So, uh
17:13
>> it's like a matrix but generalized with
17:15
higher dimensions.
17:16
>> Correct? That's also actually correct
17:18
but incomplete. The reason is because it
17:21
can be simpler than a matrix. It's not
17:24
matrix or higher. It's actually could be
17:25
simpler. In fact, you take a number,
17:27
it's actually a tensor.
17:30
All right? The simplest case of a tensor
17:31
is a number. The next simpler case is a
17:34
vector which is a list. The next higher
17:37
case is a table.
17:40
Okay, so these are all tensors. So
17:43
tensors basically are a generalization
17:45
of the notion of both a number, a vector
17:48
and a table to higher dimensions.
17:52
Okay, so you can think of a tensor as
17:56
having what are called every tensor has
17:59
something called a rank, right? So a
18:03
number is just a number. It doesn't have
18:04
a dimensionality to it. So it has got
18:06
rank zero. Okay. While a vector it's a
18:10
list of numbers. You can sort of write
18:12
it down top to bottom and it's one
18:14
dimension. Right? So that dimension that
18:17
one dimension is called a rank. So it's
18:19
called rank one. A table is 2D
18:22
two-dimensional. So it's called rank
18:24
two.
18:26
And you can have a rank three which is
18:28
just a bunch of tables.
18:32
A bunch of tables is a rank three
18:34
tensor. We also think of it as a cube.
18:37
Okay. So these things are very useful
18:40
because obviously we are all familiar
18:42
with vectors. Uh as you will see very
18:45
shortly later in this class black and
18:48
white grayscale images are usually
18:49
represented using tables of numbers like
18:51
this. Color images are represented using
18:54
three tables.
18:56
Okay. Can you get think of what might be
18:59
representable as you know a tensil of
19:02
rank four? Meaning every element of a
19:06
tensor of rank four is actually a color
19:08
picture.
19:11
Just shout it out. Video. Exactly. What
19:14
is a video? A video is basically a
19:16
stream of black color images. A color
19:19
video. So each element of that stream,
19:23
right? What the first dimension of the
19:25
tensor is which frame it is and then
19:28
everything else is the actual frame. So
19:31
the way I u think about these tensors
19:34
always is
19:37
tensor you can just think of it as a you
19:40
can think of a tensor as being this
19:42
array which has all these axes or
19:45
dimensions. This is the first one. This
19:48
is the second one. This is a third swan.
19:51
Right? This is a tensor of rank four.
19:54
Okay? 1 2 3 4. And so if you have a
19:58
vector, right? So you can imagine if
20:02
it's just a vector, you can imagine the
20:03
vector actually living like this, just a
20:06
list of numbers, right?
20:10
But if it's just if it is just
20:14
a 2D a rank two tensor right which is
20:16
just like that right which is just like
20:19
that
20:21
so this thing becomes you know like that
20:24
and that thing becomes like that. So for
20:26
example if this is a 7a 3 that means
20:29
that there are
20:31
seven rows and three columns.
20:35
So you get the idea. So the way you
20:36
think about tensor is always as if this
20:38
open square bracket a bunch of things a
20:40
closed square bracket and that's really
20:42
what a tensor object is. So what that
20:44
means is that anytime you have a tensor
20:48
right anytime you have a tensor however
20:49
complicated it is you can always create
20:52
a more complicated tensor by if you want
20:54
to take a list of those tensors let's
20:56
say that you have a list of videos
20:59
each video is a rank four tensor so
21:02
which means a list of videos is what
21:04
rank
21:05
Exactly. So a a tensor of rank say 10 is
21:10
just a list of rank nine tensors.
21:15
So that is this that is the most
21:17
important thing you need to understand
21:18
about tensors. So at any point in time
21:20
if I give you a tensor you can just
21:22
iterate through the first dimension of
21:24
it the first aspect of it and as as you
21:27
go through each one of these values. So
21:29
for example here um
21:32
yeah that can do it.
21:35
So
21:39
so if you have this tensor here
21:42
and if you want to create a more
21:43
complicated tensor no problem.
21:46
So you add another dimension here. Okay.
21:52
Now it just becomes this dimension let's
21:54
say has nine values.
21:58
one on the nine. So you put zero here
22:00
and then what do you get? This whole
22:02
tensor is a rank four tensor. And you
22:04
put a one here, it's another rank four
22:06
tensor. You put a two here, another rank
22:08
four tensor. So every tensor, you take
22:11
the first element, it's just a list, but
22:14
it's a list of the next downrank tensor.
22:18
Okay. Now this tensor concept is
22:20
actually something Einstein came up
22:21
with. Um and so u it's simultaneously
22:26
kind of easy to understand and also
22:28
slippery. So I would actually encourage
22:30
you to read the book which has a really
22:32
good discussion of tensors and the more
22:33
you practice with it the easier it'll
22:35
get. Okay. So if you feel you kind of
22:38
understood but not quite you're not
22:39
alone. It happens to all of us right?
22:42
You have to pay the price or go through
22:43
the crucible. Okay. Okay. All right.
22:48
So to come back to this
22:51
that's what we have
22:55
and we already talked about a rank four
22:56
tensor it's a video so 2.2 two the text
22:59
has a lot more detail. You should
23:00
definitely read it. U so here tensorflow
23:05
is a library and as you can imagine
23:08
neural networks tensors come in and go
23:10
through the network and go out the other
23:11
end right and since tensors capture
23:14
everything numbers lists uh tables and
23:16
so on and so forth it's just tensors
23:18
flowing from input to output hence it's
23:20
called tensorflow and it gives you a
23:22
couple of things which are really really
23:23
important which is why we use it. The
23:25
first one is that it'll automatically
23:27
calculate gradients for you of
23:30
arbitrarily complicated loss functions.
23:32
You don't have to calculate the gradient
23:34
because calculating the gradient is very
23:35
painful, right? It'll automatically
23:37
calculate the gradients for you. That's
23:39
the best part. You don't have to use the
23:40
chain rule. You don't do anything. The
23:42
second thing it'll do, it gives you all
23:44
these optimizers including SGD and all
23:46
its variations. So you don't have to
23:48
worry about the optimization itself.
23:49
It'll just you can just pick and choose
23:50
what you want. Third, if you have a lot
23:53
of servers, it'll actually take the
23:55
computational load and distribute it
23:56
across all those servers. People here
23:58
with the CS background know that
24:00
parallelizing computation is actually a
24:02
very difficult problem, right? There are
24:05
things which are called embarrassingly
24:06
parallel. Many things are not actually
24:09
quite tricky to figure it out. We don't
24:10
know how to figure it out. TensorFlow
24:11
will figure it out. Okay? And then
24:13
finally, I talked about the fact that
24:15
there are these things called GPUs,
24:17
graphics processing units, which are
24:18
parallel hardware. uh and so it'll even
24:21
if you have just one computer but it has
24:23
GPUs there's a particular way in which
24:26
you have to take your computation and
24:28
organize it to really exploit the fact
24:30
that you have a GPU and so TensorFlow
24:33
will actually do it for you out of the
24:35
box automatically you don't have to
24:36
worry about any of that stuff okay so
24:38
those are all the advantages of this
24:39
thing by the way TPU is called a tensor
24:41
processing unit it's something that it's
24:43
kind of you can think of it as Google's
24:45
GPU right they came up with their own
24:47
variation on the theme okay now keras
24:50
sits on top of TensorFlow, right?
24:52
TensorFlow, this is the this is the
24:53
hardware you have. TensorFlow sits on
24:56
top of the hardware. Keras sits on top
24:58
of TensorFlow and it basically gives you
25:01
a whole bunch of convenience features.
25:02
So, for example, it gives you the notion
25:04
of a layer, right? We already saw
25:07
Keras.dense is a dense layer, right? It
25:10
gives you the notion of a layer. It
25:11
gives you the notion of activation
25:12
functions and so on and so forth. It
25:14
gives you easy ways to pre-process the
25:16
data, easy ways to train the model,
25:18
report on metrics, you know, calculate
25:20
validation loss, validation accuracy,
25:21
training loss, all the metrics we care
25:23
about. And then it also gives you a
25:25
whole library of pre-trained models that
25:26
you can just use and adapt for your
25:28
particular problem. So it gives you a
25:30
whole bunch of conveniences and that's
25:32
why it's very popular. And by the way,
25:34
you know, many of you might also be
25:35
familiar with PyTorch, which is a
25:37
fantastic framework as well for deep
25:38
learning. And the reason we chose to go
25:41
with TensorFlow for this course rather
25:42
than PyTorch is because we wanted to
25:45
make the course uh sort of accessible to
25:48
folks who don't have a ton of
25:49
programming background before coming to
25:51
the class. And Pyarch is a bit more
25:53
demanding from a CS perspective. It
25:55
requires more knowledge of
25:56
object-oriented programming. Uh which is
25:58
why we decided to go with TensorFlow and
25:59
KAS because I think it's actually as
26:02
powerful uh in many ways and it's a
26:04
little easier to get going. Okay, so
26:07
that's what we have here. And one other
26:09
thing I will mention is that there are
26:10
three ways in which you can use kas.
26:12
There are three kinds of APIs.
26:14
Sequential, functional, subclassing. And
26:16
we'll almost exclusively use the
26:18
functional API. Okay. And in fact, the
26:21
model we built for heart disease
26:22
prediction uses the functional API. And
26:24
so just read 722 of the textbook to
26:26
understand in detail how the API works.
26:28
I find in my own work, the functional
26:30
API is basically all I need. I don't
26:32
need to do anything more complicated
26:33
than that. Um and and as you will see as
26:35
you work on the homeworks uh and on your
26:37
project that it's is it's sort of a
26:39
beautifully designed Lego block
26:41
environment for doing these things and
26:43
you can create very complicated models
26:45
very easily. Okay. Uh there's a whole
26:48
bunch of stuff here on these websites.
26:50
So check them out. There's lots of
26:51
collabs or uh uh are available. So now
26:55
if you go back to the neural model for
26:57
heart disease prediction, this is what
26:58
we came up with in the last class,
26:59
right? uh we had an input layer, one
27:02
dense layer with 16 neurons, rel
27:04
neurons, an output layer with the
27:05
sigmoid and then boom, that was a model.
27:08
So let's train this model. Uh and so the
27:10
training checklist is that uh we have
27:13
already done this hidden layer of 16
27:14
neurons uh sigmoid. We need to use an
27:17
appropriate loss function based on the
27:19
type of output. What loss function
27:20
should we use?
27:23
What is the output here?
27:26
It's a binary classification problem. So
27:28
what should the the loss function be?
27:33
Kind of heard it somewhere. Get shout it
27:35
out.
27:37
No, the output is a sigmoid. The loss
27:40
functionary
27:43
cross entropy.
27:44
Okay, remember if if you're predicting a
27:46
number an arbitrary number, you can use
27:48
something like mean square error. If
27:50
you're predicting a probability which
27:52
has to be compared to a 01 output, which
27:55
is what binary classification is all
27:56
about. we use binary cross entropy.
27:59
Okay, so that's what we do here. So we
28:01
do binary cross entropy
28:03
and then we will go with Adam, right?
28:06
And then we'll use early stopping to
28:08
make sure we don't over fit. Okay, I
28:10
know this like okay I promise this is a
28:12
lot literally the last slide before I go
28:13
to the collab. I feel like one of those
28:16
used cars here but wait there is more.
28:19
So anyway, u so uh don't worry if you
28:23
don't understand every detail of what
28:24
I'm going to go through. I'm going to
28:26
link to the collab as soon as the class
28:27
is over. But once you get your hands on
28:29
the collab, make sure you actually go
28:31
through every line in the collab. What I
28:33
typically do when I'm trying to learn
28:34
something new is I'll actually cut and
28:36
paste, right? I won't do that. I won't
28:39
actually cut and paste the code and run
28:41
it myself. I will retype the code. If
28:44
you retype the code as opposed to
28:45
cutting and pasting, trust me, you'll
28:46
learn a lot more. Right? So I strongly
28:48
encourage you to do it that way.
28:52
Um and so all the collabs you're going
28:54
to publish in the class, uh the first
28:56
thing you should do is you should just
28:57
make your own copy of the notebook,
29:00
right? Copy to drive. And then if you're
29:02
using anything other than today's
29:04
collab, uh right, anything involving
29:06
natural language processing or vision,
29:08
you probably should use a GPU. So just
29:10
go into go in here, choose the runtime
29:13
to be a GPU. Um and then you start your
29:15
notebook and you're done. And the second
29:17
time onwards, you can just go directly
29:19
to this step. You don't have to do all
29:21
this stuff for that particular notebook.
29:23
And there are numerous tutorials like
29:24
five minute videos and so on on how to
29:26
use collab. Just just do that. I'm not
29:27
going to spend time on it here.
29:30
All right. Okay. So, uh I just ran it um
29:33
a few hours ago. I'm not going to run
29:35
every cell now because it's going to
29:37
take some time. It's going to get in the
29:38
way of the class time, but I'm going to
29:39
just like, you know, go through it
29:40
slowly and explain what's going on. So,
29:43
here this is just an introduction to the
29:45
data set. We already saw this
29:46
introduction in the last last week. We
29:49
have whatever 303 patients, hot
29:51
patients. We have a whole bunch of uh
29:54
variables here, age, demographics, and a
29:57
whole bunch of biomarker information.
29:59
And this is a target variable. Okay? Uh
30:02
zero or one, heart disease, yes or no.
30:05
And so, by the way, just some technical
30:07
prelim preliminaries here. Basically,
30:10
every time we load these things, we're
30:12
actually going to load these packages.
30:13
So you can see here these are the two
30:15
key things we need to do. We import
30:16
tensorflow first and then from within
30:18
tensorflow we import keras. Okay that's
30:21
what these two lines do here. Okay. And
30:23
then and folks who have done data
30:25
science and machine learning a bit
30:26
before you you'll know this. We will in
30:28
in sort of we will actually load like
30:30
the three packages that were just most
30:32
commonly used right which is numpy
30:34
pandas and mattplot lib. Uh numpy
30:37
because it's very easy for manipulating
30:39
matrices and arrays and tensors. uh
30:42
pandas because often times you get some
30:44
data in from somewhere you need to
30:46
massage it and wrangle it to a point
30:48
where we can actually feed it into ketas
30:49
so you need pandas for that and mattplot
30:51
lib because you just want to plot you
30:53
know uh these loss curves and accuracy
30:55
curves to see whether early stopping is
30:57
needed okay so that's why we use it uh
31:00
so we import all these things and then I
31:02
guess the other thing you have to
31:03
remember is that when we are training
31:04
these deep learning models uh there is
31:06
randomness in the process which enters
31:08
in a few different places so clearly the
31:11
starting values for the these weights
31:13
are going to be they're going the
31:14
weights are going to be randomly
31:15
initialized. Uh and therefore that
31:17
that's obviously a source of randomness.
31:19
Uh now we talked about how you take if
31:22
when you're doing stoastic gradient
31:23
descent you take all the data and then
31:25
you randomly choose batches right from
31:28
this data till we finish a whole pass
31:29
through it. Well that immediately raised
31:32
the question well well what do you mean
31:33
by randomly choose? So typically what we
31:35
do in practice is that and kas will take
31:37
care of all this for you. um you
31:39
basically take the data and just shuffle
31:40
it once randomly and then you just go
31:42
first 32 next 32 next 32 next 32 like
31:45
that okay but it is a source of
31:47
randomness and then when we split the
31:49
data into train validation testing and
31:51
so on uh particularly if you want to
31:53
look for early stopping and overfitting
31:55
uh we need to again split the data
31:56
randomly and that's another source of
31:58
randomness and then when we do dropout
32:01
which we'll talk about on Wednesday
32:02
again dropout has a little bit of a
32:05
random element to it and so that's
32:06
another source of randomness this. So
32:09
all of it all this means is that if
32:11
you're working with these models and if
32:13
you want to build a model and you want
32:14
to hand it off to someone so that they
32:16
can reproduce your results well you
32:17
better make sure that you sort of you
32:19
know make it easy for them to replicate
32:21
what you have and the way you do it is
32:22
by sending a setting a random seat for
32:24
all these things okay and the way you do
32:26
it is by having this little handy
32:28
function here set random seat uh and of
32:31
course you know I use 42 tool like just
32:32
like everybody should right so okay so
32:35
that's that uh by the way just that's
32:38
just a popculture reference to this book
32:39
called The Hitchhiker's Guide to the
32:40
Galaxy.
32:43
>> Number 42 and you'll know what I mean.
32:45
Okay, so by the way, um the question
32:47
inevitably comes at this point, okay, if
32:49
we do exactly this, will you actually
32:51
get the exact same numbers that you have
32:52
in your version uh of the notebook? And
32:55
the answer is hopefully most of the
32:57
time, but it's not guaranteed. So this
32:59
is called bitwise reproducibility. It's
33:01
not guaranteed due to certain hardware
33:03
things and device drivers and stuff like
33:05
that. So we won't get into all that
33:07
stuff. uh and which is why as you see
33:09
here uh I have a bit of a fingers
33:11
crossed thing. Okay. All right. Cool. So
33:14
that's what we have. Um so as it turns
33:16
out uh Frantois Shallet who wrote the
33:18
book uh the textbook he actually made
33:20
this data available in a pandanda's data
33:21
frame. So we read the CSV file into this
33:24
data frame right there. Uh and then it's
33:26
uh and it's 303 rows 14 columns right
33:30
and you can see here we'll take a look
33:32
at the first few rows. Uh and these are
33:34
all the rows. age, gender, cholesterol,
33:36
blah blah blah blah blah. And then this
33:38
is the target variable right there. U
33:41
and the one of the first things I always
33:42
do when I'm working with a binary
33:44
classification problem is to quickly
33:45
check whether the positive and negative
33:47
classes are balanced or not. And so what
33:49
you can do is you can just quickly check
33:51
to see what percent of the data points
33:52
is zero versus one. And you can see here
33:55
uh 72.6%
33:57
of the patients don't have heart
33:59
disease. That's a good thing of course.
34:00
Uh and then 27.4 have heart disease. So
34:03
it's not bad. It's not 50/50 or roughly
34:05
50/50. It's a little thing. So, by the
34:08
way, quick question. What is a a b good
34:11
baseline model for this problem? Suppose
34:13
you couldn't use anything any
34:14
complicated thing. What's a good
34:15
baseline model?
34:22
>> Yes. Just predict zero.
34:24
>> Yeah. And why would you do that?
34:25
>> Uh, it would give you a 72.6% accuracy.
34:28
Exactly. Because 72.6% 6% is the sort of
34:31
the higher class higher class with the
34:33
higher percentage you just predict it
34:35
you'll be right on those 72.6% of the
34:37
cases you'll be wrong on the rest which
34:38
means that your accuracy of this model
34:41
is going to be 72.6%.
34:43
Okay. And so any fancy model we build
34:46
better do you know it's got to do better
34:48
than this otherwise it's not worth its
34:49
weight uh in layers. Um so all right so
34:51
we'll come back to this later. So the
34:53
first thing we want to do is we want to
34:54
pre-process it because this data set has
34:56
both categorical variables and numeric
34:58
variables. Um and so it's usually
35:01
convenient to just to group them into
35:03
two different groups. So I have listed
35:05
all the categorical variables here and
35:06
the numeric here. Uh and then we have
35:09
the pre-processing here. We have to take
35:11
the categorical variables and we have to
35:12
one hot encode them. And the reason is
35:15
that unlike say a decision tree model, a
35:17
neural network cannot handle uh
35:20
categorical inputs directly. It can only
35:22
handle numeric inputs. Which means that
35:24
we have to numericalize every
35:26
categorical thing that comes in. And the
35:28
st there are many ways to do it but the
35:29
standard way to do it is one hot
35:31
encoding. Um and for the numeric
35:33
variables we need to normalize them and
35:35
I'll come to that in a second. So pandas
35:37
has this get dummies function here and
35:40
you can just run this thing and it'll
35:41
just hot encode the whole thing. So once
35:44
you do that this is what you have. So
35:45
you can see here previously um let's say
35:49
tal was had three values fixed normal
35:52
reversible or something and then you go
35:54
to the one hot encoded version u and now
35:56
we can see here tal fixed tal normal tal
36:00
reversible that's three columns right
36:02
that's the one hot encoding in action
36:04
okay now the other thing to remember is
36:07
that neural networks work best when the
36:09
numeric inputs you send them are all in
36:12
a relatively small range they shouldn't
36:13
have a wide range of variation
36:15
Um and so the standard practice is to
36:18
standardize the numerical variables. By
36:20
standardize, I mean typically subtract
36:22
the mean, divide by the standard
36:23
deviation. Um we should do that. But
36:26
before we do so, we should split the
36:27
data into a training set and a test set,
36:30
right? And why do we want to split into
36:32
a test set? Because at the very end once
36:33
we've built the model and done all the
36:35
things we want to do with it, we finally
36:36
want to take out the test set and
36:38
evaluate it once so that we get this
36:41
true measure of how it's going to
36:43
perform in the wild after you deploy it.
36:46
Okay. Uh so you want to divide it 80 80
36:48
say 80% training and 20% test set. So
36:51
the question is why should we do the
36:53
splitting now before we do the
36:54
normalization? Why can't we just do the
36:57
normalization and then do the splitting?
37:02
Um all right
37:06
>> because then your uh validation set is
37:09
also somewhat dependent on your test set
37:11
results as well as the mean of the test
37:13
set.
37:13
>> Correct? Because the test set has now
37:16
essentially sort of has been influenced
37:18
by the training set. Right? That is the
37:21
the the modeling process part of the
37:23
modeling process the splitting and the
37:25
splitting also the the the
37:27
standardization
37:28
if if the standardization which is part
37:30
of the process uses information about
37:32
the test set well the test set not
37:34
really kept away from anything is it
37:37
that's why we want to split it lock away
37:39
the test set somewhere and then proceed
37:41
with the modeling this again this is
37:43
like machine learning 101 which is why
37:44
I'm going through it pretty fast uh okay
37:47
so we we do this uh sampling function
37:50
take 20% of the data and make it the
37:53
test set and the remaining is going to
37:55
be the training set. And when we do
37:56
that, you can see the training set is
37:58
now 242
38:00
um rows while the test is 61 rows. Uh
38:05
and any of these data frames, you'll
38:07
know that the the shape attribute gives
38:08
you the dimensions of the number of rows
38:10
in the columns. That's what we're doing
38:12
here. And now that we have done that, we
38:14
have done the split, we can calculate
38:15
the the the mean and the standard
38:16
deviation. So I calculate the mean here.
38:18
I calculate standard deviation. And
38:20
these are all the means. And once I do
38:21
that, I just do you know each column
38:24
minus the mean divide the standard
38:26
deviation. And then once I do that I get
38:28
I save them in the train and the test
38:30
data frames. And you can see here now
38:32
all the numbers are all very sort of
38:33
smallalish 0 1 minus one kind of around
38:36
that range and that's kind of ideal when
38:38
you're network training. Okay. All
38:40
right. Right. So at this point the data
38:42
is entirely numeric and then uh we are
38:44
ready almost ready to feed it into KAS
38:46
and the way you do it is you take a
38:48
numpy array u you you take a pandas data
38:51
frame and then you convert it into a
38:52
numpy array and then keras is happy to
38:54
take it happy to receive it. So the so
38:56
we use this thing called two numpy which
39:00
I think is as descriptive as it gets in
39:01
programming. Um and then you save it as
39:04
train and test. Now train and test on
39:05
two numpy arrays with exactly the same
39:08
information and now we can fit it into
39:09
kas. All right. Now I guess there's one
39:12
other thing we need to do which is that
39:13
um in this data frame train and test our
39:17
independent variables all the features
39:18
as well as the target the 01 target.
39:20
They're all in this
39:23
right and we need to now take it and
39:25
just take the the dependent variable the
39:27
01 column and split it out and keep the
39:29
x and the y separately. Right? That's
39:32
the whole point of it, right? Because
39:33
you need to feed the X, do the
39:34
prediction, and then compare it to the
39:36
actual Y and calculate the loss and so
39:38
on and so forth. So, uh, so the target
39:41
column is our Y variable, and it's
39:43
column number six from the left. If you
39:45
count it, you can see it. So, we just,
39:47
you know, uh, we we delete it from the
39:49
the train and test. Um, and now we have
39:53
242 rows and 29 columns, 29 features.
39:56
You will recall from the network that we
39:58
made way back, it had 29 inputs, right?
40:01
29 nodes in the input layer. And that's
40:03
where the 29 is coming from. And so now
40:06
uh we just select the sixth column which
40:07
is the target and make it the Y variable
40:09
right train Y and test Y. And that is of
40:12
course a vector which is 242 long in the
40:14
training set and 61 long in the thing.
40:16
So at this point all we have done is to
40:19
be honest boring pre-processing. Okay,
40:21
we haven't actually gotten to the action
40:22
yet. Finally, let's do something. So um
40:26
and we start with a single hidden layer.
40:29
Since it's a binary classification
40:30
problem, we'll use sigmoids as we saw
40:31
earlier. And this is the model we
40:34
created in class last last class. This
40:36
is the model we created. Okay. The only
40:39
difference between that model and this
40:41
model is that I've actually given names
40:43
to these layers. And this name thing is
40:45
totally optional. Right? If you want to
40:47
give a name, give a name. It's just a
40:48
little easier to interpret later on.
40:50
Okay? It's just cosmetic. Okay? So, uh,
40:53
but I've just put it here. U and once
40:55
you build the model u you should
40:57
immediately run the model dots summary
40:59
command because it gives you a nice
41:01
overview of the model right what are for
41:04
each layer it tells you what the layer
41:05
is it tells you what's coming into the
41:07
layer meaning the shape of the tensor
41:09
that's coming in and what's going out
41:11
and how many parameters the layer has
41:13
and it turns out this layer has sorry
41:16
this network has 497 parameters okay uh
41:20
and I have told you repeatedly the first
41:22
few times just hadn't calculated the
41:24
number of parameters to make sure it
41:25
verifies. So we should just make sure
41:27
that it is in fact 497. So let's hand
41:30
calculate it. And you do basically it's
41:32
basically what's going on here. 29
41:34
inputs time 16, right? All the arrows 29
41:37
* 16 arrows, right? And then you have a
41:40
bias of another 16. That's why you have
41:42
this expression. And then the next one
41:43
is 16 * 1 plus one bias for the output
41:46
sigmoid and you get to 497. Okay? Just
41:49
make sure you follow this later on when
41:50
you work with the collab. We we did this
41:53
in class last week and you can visualize
41:55
the network graphically as well by using
41:56
the plot model function. So we do that
41:59
here. Um and let's say it gives you the
42:02
same information but in a slightly
42:03
easier form to consume and when we work
42:06
with larger networks starting on
42:07
Wednesday you will see that being able
42:09
to visualize the topology of the network
42:11
is actually quite handy. Okay, we
42:13
finally come to uh actually trying to
42:16
train this thing and so what loss
42:18
function should we use? uh we need to we
42:20
need to use binary cross entropy right
42:23
there. What optimizer to use? Well, as I
42:26
mentioned earlier, uh we'll use Adam.
42:29
Adam.
42:32
All right, Adam. Uh and then uh and then
42:35
the the final thing is you can ask Keras
42:37
to report out whatever metrics you care
42:39
about. These metrics are not going to be
42:41
used in any optimization. They just it's
42:42
just reporting it to you. And the most
42:45
common thing people report out for
42:46
binary classification is accuracy. So
42:49
we'll just go with that metric. Um and
42:51
so so what we do is we tell Keras take
42:54
the model we just built and compile it
42:56
with this choice of optimizer this
42:58
choice of loss function and these
43:00
metrics. And this compilation step what
43:02
it does is it essentially Keras will
43:04
take this information and take the model
43:06
you have built and it'll reorganize the
43:08
model in such a way that the parallel
43:11
computing uh distribution of computing
43:13
across many servers and so on. That's
43:16
that's what's happening in the compile
43:17
step. Organizing it so that reorganizing
43:20
the model so that it becomes amendable
43:21
to parallelization and distribution.
43:23
That's what's going on. That's why you
43:25
actually have to do something called the
43:26
compile step. Okay. And once we do that,
43:28
we have finally finally ready to train
43:30
the model. And to do that uh we have to
43:34
decide what the batch size is that we're
43:36
going to use. Remember, we're using some
43:37
flavor of SGD, which means we have to
43:38
choose what is the bat size. And
43:40
typically what people do is that uh 32
43:43
is a good default for the batch size.
43:45
Like if you don't if you're not just
43:46
getting started with something, just use
43:47
32. Uh and there's a whole bunch of
43:49
literature on what the right batch size
43:51
should be for the number of data points
43:53
you have, the size of the network and so
43:55
on and so forth. My philosophy is start
43:56
with 32. Um and you can always try 32,
43:59
64, 128. It's kind of like, you know,
44:02
oftenimes what people tell me,
44:04
researchers tell me is that just use the
44:05
biggest batch size that doesn't make
44:07
your machine die.
44:09
Right? If you can fit into memory, it's
44:11
probably good. Just try the biggest
44:12
size. We'll just start with 30. It's
44:13
just a tiny problem. It's not a big
44:15
deal. And then we also have to decide
44:16
how many epochs through the data do we
44:19
want to go through, right? How many
44:21
epochs? And uh you know, usually 20 to
44:24
30 epochs is a good starting point. Um
44:26
and then because this is a tiny problem
44:28
just for kicks, I decided to run it for
44:29
300 epochs. Uh just to see if anything
44:31
any overfitting is going to happen. Uh
44:33
and then whether we want to use a
44:34
validation set. Of course, we want to
44:36
use a validation set. Uh right. So we
44:38
will use 20% of the data points as a
44:40
validation set so that we can look for
44:42
overfitting underfitting.
44:44
All right. So with these decisions made
44:46
we finally uh we use the model.fit
44:49
command. Model.fit is what actually
44:51
trains the neural network. Okay. And you
44:55
have to tell it what the x
44:58
tensor is. You have to tell it what the
45:00
dependent variable y tensor is. We need
45:03
to tell it how many epochs to do this.
45:05
What the bat size to use. Verbos equals
45:07
1 just means like just you know put a
45:09
lot of descriptive output as you do this
45:11
thing and then validation split means
45:13
you know take 20% of the training data
45:16
and set it aside as your validation data
45:18
set. Don't use it for training because I
45:20
want to measure overfitting using that.
45:22
So that's it. So you do that thing it
45:24
it'll run for 300 epochs and this is the
45:26
reason why you know I decided to just
45:28
not actually run it in class. Um and so
45:31
you keep on doing it gives you a lot of
45:33
output and finally
45:36
we reach the end.
45:41
Okay. Now let's take a moment to
45:43
understand what's being reported. So
45:44
I'll just take this one line here. So
45:46
this there is a there is these two there
45:49
is a pair of lines for each epoch. And
45:51
then here it's telling you uh you know
45:53
it it actually uses in the in this 300th
45:56
epoch it used seven batches seven out of
46:01
seven batches right so it used seven
46:02
batches and if you you will recall from
46:05
the math we did in the class that it's
46:06
actually seven batches where the first
46:08
six batches are 32 and the last batch is
46:10
just a couple of examples but we have
46:12
seven batches right this is the 193 by
46:15
32 rounded up okay so that's why we have
46:19
seven here and then it tells you how how
46:20
long it took it for that and then it
46:22
this is the loss value. This is the
46:24
binary cross entropy loss value on the
46:26
training set right on on that particular
46:29
batch right uh that it calculated this
46:32
is the accuracy that you asked you to
46:33
report out 98.4% 4% 98.5% accuracy on
46:36
that batch and and then at the end of
46:39
this epoch using whatever weights were
46:42
available in that network it actually
46:44
calculate the loss on the validation set
46:46
which is the 20% of the data we have set
46:48
aside and then it this is the accuracy
46:50
on that validation set okay so that's
46:53
what each of these numbers mean now
46:55
looking at these wall of numbers is kind
46:57
of painful so usually you just plot it
47:00
um so and the way you do that is if you
47:02
if you notice here Uh okay, I'm not
47:04
going to go back here. So I said history
47:06
equals model.fit blah blah blah blah
47:08
blah. And that history object has a lot
47:10
of information that we can use for
47:12
plotting and diagnostics and so on. And
47:14
that history thing uh history object has
47:18
another object called history
47:19
history.htistory which is a dictionary
47:21
with all these values and that's what
47:23
we're going to plot. Was there a
47:24
question here? Yeah.
47:25
>> Uh so you prompted it to keep the size
47:28
for validation but didn't we already
47:30
keep a test set? So that's going to be a
47:33
secondary validation, right?
47:34
>> So basically we have a training uh and
47:37
then a validation and a test. The role
47:40
of the validation set is to figure out
47:42
things like early stopping. Should we
47:43
stop here? Should we go back? And as you
47:45
will see later on, if we use
47:46
hyperparameters, you know, we we'll try
47:48
different values of the hyperparameters
47:50
and figure out use the validation set to
47:52
figure out which one is the best one.
47:53
But once we are done with all that, we
47:55
will finally have a model. At that
47:57
point, we open the safe, take out the
47:59
test set and use it just once with your
48:02
final final model. Not because you want
48:04
to improve the model, but because you
48:05
want to have a realistic idea how it'll
48:07
do when you actually deploy it out in
48:08
the real world.
48:11
>> Uh yeah.
48:13
>> Uh can we use can we instead of accuracy
48:17
could we use other metrics uh to
48:20
evaluate whether to
48:21
>> absolutely like a confusion matrix let's
48:23
say?
48:24
>> Yeah, you can you can do whatever you
48:25
want. You can use like I said it's not
48:27
used for training so there is no
48:29
mathematical implication what you choose
48:31
right you can choose error rates
48:32
accuracy f1 fb beta you can do whatever
48:35
you want and keras as you will see has
48:37
this dizzying list of possible metrics
48:39
you can use for reporting the key thing
48:41
to remember is you're just reporting
48:43
these metrics you're not actually using
48:44
them for any training
48:47
yeah
48:49
>> uh my question is with respect to
48:50
validation like uh we've got a training
48:52
data set so when we take out 20% This is
48:55
the validation uh data for validation.
48:58
Are we taking out from the training set
49:00
or correct from there that level or we
49:02
go to each batch and take out 20% from
49:04
the train?
49:04
>> No, we're taking it out from the
49:05
training set.
49:06
>> So it means the batch size the number of
49:08
batch number of data would be available
49:09
for calculating the batch size will
49:11
reduce.
49:12
>> Correct. And in [snorts] fact once we
49:13
validate take out the validation set
49:15
whatever remaining is 193.
49:17
>> Okay. And then we divide that into
49:18
batches and then that every info uh that
49:21
validation and the data gets different
49:23
added. Now once you take out the
49:25
validation set at the very beginning you
49:27
keep it aside and then you only evaluate
49:30
at the end of each epoch what your loss
49:33
and accuracy is on that validation set.
49:36
>> So you don't have cross validation.
49:37
>> No no we're not doing any of that stuff.
49:39
We're just taking it out once and we're
49:40
just evaluating the end of every epoch.
49:43
>> Okay. So
49:46
yeah. Okay. So I know we both asked
49:50
similar questions but
49:53
>> so I know both have asked similar
49:54
questions but just to reconfirm. So here
49:56
my training model is giving me say a
49:59
loss of 0860.
50:01
My validation model is giving me 660.
50:04
That means I've already crossed the U.
50:07
So when I have to actually test the
50:11
model that is the midpoint which I take
50:13
and that will model which will get
50:14
deployed in production.
50:16
Correct. And as to okay, what do we do
50:19
to get that model? Do we actually have
50:20
to go go back to the beginning and run
50:22
it for a few epochs or can we do
50:24
something smarter than that? We'll get
50:25
to that.
50:26
>> Yeah.
50:27
>> Is the validation set different for each
50:30
APO or is it the same?
50:31
>> It's the same. So what you do is you
50:33
have a training set before you do any
50:35
training. You take out 20% of it, keep
50:37
it aside. You take whatever is left over
50:39
that you divide that into mini batches
50:41
and then start running it through each
50:43
epoch. But at the end of each epoch, you
50:45
just evaluate the quality of that
50:47
resulting model using the validation
50:49
set.
50:49
>> What's different between each epoch? Is
50:51
it just the way
50:52
>> weights have changed?
50:53
>> It's the it's the division into the
50:55
different
50:56
>> uh no so in the difference in each epoch
51:00
is the weights have changed.
51:02
>> So after every mini batch, the weights
51:03
have changed. At the end of one epoch,
51:05
you've gone through all the data points
51:07
you ever had, right, in the training
51:09
set. And then you come back to the
51:10
beginning and you do it again.
51:17
How do you identify the sweet spot?
51:20
>> It's coming.
51:22
>> Yeah. All right. So, I'm going to keep
51:24
going. So, we have this here. And so,
51:27
you just I mean there's a little bit of
51:28
mattplot lip code. So, what we do is we
51:31
just plot the training loss and the
51:33
validation loss as a function of the
51:35
number of epochs. Okay? And as you can
51:37
see here, the training loss is these
51:39
things here. And it's steadily going
51:41
down as you would expect. The validation
51:45
loss goes down here. And then at some
51:47
point it kind of flattens out and then
51:49
maybe gently starts to rise. Okay. So do
51:53
you think there's overfitting?
51:55
>> Right. There seems to be some level of
51:57
overfitting here. But the thing you have
51:59
to always remember is that the binary
52:01
cross entropy loss is a loss function
52:04
that is convenient for you because it
52:06
sort of captures the thing you want to
52:08
capture the discrepancy but also because
52:10
it's mathematically convenient but what
52:13
you may actually care about in practice
52:15
is something like accuracy right so I
52:18
always that's why you're reporting out
52:19
the accuracy when we do these things so
52:21
you should also plot the accuracy to see
52:23
what's going on and really you should
52:25
look at the accuracy and figure out
52:26
overfitting and underfitting and all
52:28
stuff. So let's just do that. So I have
52:30
here uh overfitting.
52:34
Uh okay. So this is how it looks like
52:35
for accuracy. Accuracy of course as the
52:37
model gets you know as you do more and
52:38
more epochs hopefully it get better and
52:40
better for training. So you can see here
52:42
accuracy actually climbs all the way up
52:44
to the mid 90s uh right there small the
52:47
low 90s here. the validation gets to
52:50
this point after like I don't know 50
52:52
epochs maybe and then it kind of
52:54
flattens out and then strangely it
52:56
climbs up again a bit later right so now
53:00
the fact that the accuracy actually got
53:03
better at the very end suggests that
53:06
maybe we can live with this overfitting
53:09
>> okay
53:10
>> right it's not the end of the world
53:12
right so you can so you can certainly
53:14
what you can do is you can go back and
53:16
say you know what no I'm going to be a
53:17
purist about this around 50 epochs or
53:20
so. I think that's when it actually
53:22
flattened out for loss. So you can just
53:24
go back and just restart the model and
53:26
run it only for 50 epochs, not 300 and
53:29
then stop and just use that model for
53:30
everything from that point on. Or you
53:31
can say, you know what, it's okay. I can
53:33
live with this thing. Uh and so that's
53:35
what we're going to do here. Let me just
53:36
stop for a second. There was a question.
53:39
>> Yeah,
53:40
>> for originally when we were starting
53:42
out, we were saying 20 to 30 pods, but
53:44
we were going to do 300. 50 is over 20
53:46
to 30. So when it comes to validation of
53:49
if you run enough epochs, are you doing
53:51
like derivative calculations?
53:52
>> Oh, I see. No, that's a great question.
53:54
So the question is I said start with 20
53:56
and 30 epochs as a rule of thumb here,
53:58
I'm just going with 300. And because I'm
54:00
going with 300, I can actually see some
54:01
potential evidence of overfitting. But
54:03
if I had done only 20 to 30, maybe I
54:05
wouldn't have even seen that. What
54:06
happens next? Right? Is that the
54:07
question? Great question. So what you
54:09
should do is when you look at these
54:10
curves if at the end of 30 epochs you
54:13
find that the validation loss continues
54:15
to drop then you know maybe there is
54:18
more room for it to drop. So you you
54:20
continue from that point on. The thing
54:21
about keras is that you can actually run
54:24
the the the fit command at that point
54:27
and it'll continue where it left off. It
54:29
won't go to the beginning again.
54:31
Right? So you can run 10. Okay. The
54:33
validation is still getting better and
54:34
better. Okay. Run for another 10. It's
54:36
getting better and better. Run for
54:38
another 10. Getting better and better.
54:39
Run for another 10. Oh, it starts to
54:40
climb up again. Okay, now I'm going to
54:41
back off. That's what you do.
54:44
All right. Now, all this manual stuff
54:47
I'm going through it just because to
54:48
build intuition, there are these things
54:50
called callbacks in KAS, which we'll get
54:52
to later on in which you can actually
54:54
tell it, hey, when the validation loss,
54:57
you know, uh, stops improving, stop
54:59
everything or when it stops improving,
55:02
save that model for me somewhere. So,
55:04
they don't have to go back and rerun
55:05
everything. It'll just it'll have saved
55:07
it for you and you can just pick it up
55:08
and use it. Uh yeah.
55:12
>> What's the intuition behind um the
55:15
accuracy continuing to improve when the
55:17
loss is getting higher?
55:19
>> Because accuracy and loss are related
55:21
but they're not the same thing. Uh in
55:23
particular, so it's a really good
55:25
question also kind of a profound
55:27
question because accuracy is a very
55:29
discrete measure, right? So if a
55:30
particular point we predicting its
55:32
probability to be say 049 we're going to
55:34
say okay that's a zero no heart disease
55:37
but if it goes to 0.51 we're going to be
55:39
oh that's heart disease. So when you go
55:41
from 049 to 0.51 the binary cross
55:44
entropy loss will change very very
55:46
slightly but the accuracy will go from 0
55:48
to one dramatic jump. So it's very jumpy
55:51
and discreet and that's why it tends to
55:53
be a proxy but sort of a crude proxy for
55:56
loss. That's part of the reason and I
55:58
can talk more offline.
56:01
Okay. So yeah,
56:04
>> you mentioned that if you are a purist,
56:06
you could stop up 50. In this case, I
56:09
was want and run it and stop it there. I
56:12
was wondering if you could see the
56:13
history of the model, take the weight at
56:15
EOC 50 and input it your model and it
56:18
will be roughly the same or it would be
56:21
certain differences.
56:22
>> You could try it. Yeah, you should just
56:24
try it because what happens is that
56:25
ultimately what we care about is how it
56:27
performs on the validation set. Right.
56:29
Here it appears to perform better on the
56:30
validation set. right? If you stop at 50
56:33
but only for the loss for accuracy
56:34
actually if you wait till the very end
56:36
it gets better. So my thrust tends to be
56:40
what is the measure that's closest to
56:41
the real world deployment.
56:44
It's accuracy. So I tend to go with
56:45
accuracy.
56:48
Binary cross entropy is a beautiful
56:50
proxy but an imperfect proxy for the
56:53
thing we actually care about in the real
56:54
world which is error rate and accuracy.
56:57
That's why I tend to plot both and if
56:59
accuracy is telling me one thing I kind
57:00
of tend to believe that
57:03
all right so um here that's what we have
57:07
so once we do all this we have a model
57:09
and we now we may to evaluate to see
57:11
okay if you actually deployed how good
57:13
is going to be so you use this thing
57:14
called the model evealuate function so
57:17
you take the modelealate function now we
57:19
use the test and the the test x and the
57:21
test y data set which we split at the
57:23
very very beginning and never used from
57:24
that point on uh we run it And when I
57:27
ran it uh last night, it came up with a
57:29
83.6% accuracy for the model. And
57:33
remember our baseline model which just
57:35
predicts everybody is a zero is going to
57:36
have a 72.6% accuracy. And this little
57:39
neural network gives you 83 83.6 which
57:41
is pretty good right so it's actually uh
57:45
few it's beating the model the baseline
57:47
model which is nice. Uh and I guess
57:49
there is something here about you know
57:50
the fact that we did a bunch of
57:52
pre-processing outside Keras and then we
57:53
send stuff into Keras. You can actually
57:55
do all this pre-processing inside Karas
57:57
automatically and there are layers for
57:58
that and I have linked to a bunch of
58:00
stuff here. So that's it as far as this
58:02
model is concerned. I know we went
58:03
through it really fast but please go
58:05
through it afterwards and make sure you
58:07
understand every single line. Change
58:09
each of these lines, rerun it, see how
58:11
the output changes. That's how we build
58:12
some intuition. Okay. All right.
58:15
computer vision
58:17
>> as I do
58:20
>> just one question and for is there a way
58:22
to build a model just to have less false
58:24
positive or less false immediate or you
58:27
don't know that
58:27
>> oh yeah yeah you can do that um but
58:29
there are so you can report on all those
58:31
things very easily but there are more
58:33
complex loss functions which will take
58:35
the the asymmetry between the false
58:38
positive false negative into account u
58:40
you know yeah so the short it's possible
58:43
yeah
58:45
All right. So, first let's just talk
58:46
about how do you represent an image
58:48
digitally. Okay. Uh and so these are how
58:52
gay grayscale images are represented.
58:54
Black and white images. So the basic
58:55
basic idea is very simple. Every picture
58:57
you have it's got a every location in
58:59
that picture is a pixel and the pixel
59:01
pixel basically has a light intensity.
59:03
The amount of light at that location and
59:06
that light level is measured from zero
59:09
no light to blinding white light which
59:12
is 255. And so all the numbers here, if
59:16
you take this five for example, you can
59:18
see a lot of no light like all the black
59:20
regions, those are all zeros. Okay? And
59:23
then wherever there is white light,
59:24
there's a number and more the amount of
59:27
light, the closer it gets to 255. Okay?
59:29
In fact, if you just step back and
59:30
squint at this, you can actually see the
59:32
five.
59:33
Okay? So that's it. That's how that's
59:35
how black and white image represented.
59:37
Very simple. Okay. Now, yeah.
59:42
microphone
59:43
>> just when you say amount of light what's
59:45
the unit that's being measured like what
59:47
do you mean
59:48
>> so here basically what we have is uh the
59:51
the computer takes whatever so when you
59:54
send an analog you take an analog
59:56
picture there is an there's a process by
59:58
which you take that analog picture and
59:59
read it in and it gets mapped to a scale
1:00:02
between 0 and 255 that's it that's all
1:00:04
so you can think of it as like a
1:00:05
relative scale a normalized scale
1:00:07
between 0 and 255 and so um it just
1:00:10
roughly maps to amount of light in that
1:00:12
location the exact like lumens to the
1:00:14
number mapping I don't know how they do
1:00:16
it my guess is there are a dis number of
1:00:18
variations on that but the for our
1:00:20
purposes just think of it as it's a
1:00:22
normalized scale which runs from 0 to
1:00:24
255
1:00:26
all right so uh if you look at u so
1:00:28
that's what's happening every is a
1:00:30
number between 0 to 55 boom boom okay so
1:00:34
if you have a color image each pixel of
1:00:37
a colored image is represented by three
1:00:38
numbers uh And these numbers measure the
1:00:42
intensity of red light, blue light and
1:00:44
green light because red, blue and green
1:00:46
if you mix them in the right proportion
1:00:47
you can get whatever you want. Okay. So
1:00:50
uh and so each light density is still a
1:00:52
number between 0 and 55 and that's what
1:00:54
you have. Which means that now you have
1:00:56
three tables of numbers instead of one
1:00:58
table of numbers. And by the way just
1:01:00
some lingo here uh in the deep learning
1:01:02
world these uh colors RGB, red, blue,
1:01:05
green are sometimes referred to as
1:01:06
channels. Okay. All right. So this is
1:01:10
what we have here. This is a picture of
1:01:11
Kian Cord U and then if you take that
1:01:13
little thing here red the red table the
1:01:16
green table and the blue table. So for
1:01:18
this picture these three tables is a
1:01:21
tensor of rank what?
1:01:23
Good.
1:01:26
All right. Any questions on this?
1:01:33
So the key task in computer vision
1:01:35
obviously the the important thing is
1:01:37
image classification right uh the most
1:01:40
basic task if you will uh when you're
1:01:42
working with images is you you have an
1:01:43
image and you want to take whatever you
1:01:45
take the image and figure out okay you
1:01:46
have a list of possible objects the
1:01:48
image could contain and you're figuring
1:01:49
out okay which of these possible objects
1:01:51
exists in that image right the doc cat
1:01:53
classification is like the the canonical
1:01:54
example right that we all know and love
1:01:57
uh and that's what we will solve uh
1:01:59
later today and on Wednesday but there
1:02:01
are many other tasks that you need to to
1:02:02
be aware of. So when you actually not
1:02:05
just classify an image, but you also
1:02:07
localize where in the image is it,
1:02:10
right? It's not just enough to say
1:02:11
sheep, you want to figure out where is
1:02:13
the sheep, right? And that's called
1:02:14
localization. And the way you do
1:02:16
localization is you put this little box
1:02:18
around it. And then you output not just
1:02:21
whether it's a, you know, sheep, yes or
1:02:23
no, but the coordinates of this box, the
1:02:26
top left, uh, and the bottom right, for
1:02:28
example, if you put the coordinates, you
1:02:29
can actually draw a box around it. So
1:02:31
you you output the numbers the
1:02:33
coordinates of where this box is in the
1:02:36
picture. Okay, this called localization.
1:02:39
Now this is object detection where you
1:02:42
may have lots of objects going on and
1:02:45
you want to pick up every one of them
1:02:47
and you want to localize it.
1:02:49
Okay, this is object detection. So here
1:02:51
we have gone in there and said okay
1:02:53
sheep one, sheep two, sheep three and
1:02:55
each of these sheep has a little box
1:02:57
around it. Okay.
1:02:59
>> By the way, u you know, self-driving
1:03:01
cars, the the camera vision system is
1:03:04
constantly scanning what's coming in
1:03:05
through the cameras and doing object
1:03:06
detection constantly, many times a
1:03:08
second,
1:03:09
>> right?
1:03:09
>> Pedestrian box, you know, zero crossing
1:03:11
box, doggy box, stroller box, and so on
1:03:13
and so forth.
1:03:16
And then we have this thing called
1:03:17
semantic segmentation where we take
1:03:20
every pixel in the picture and classify
1:03:22
every pixel. We are not classifying the
1:03:24
whole picture, we're classifying every
1:03:26
pixel. So we are saying okay all these
1:03:28
gray pixels road all these pixels are
1:03:32
sheep and all these pixels are grass
1:03:34
every pixel is being classified.
1:03:37
So we are taking a an image instead of
1:03:39
giving one classification for every
1:03:42
pixel we are solving a multiclass
1:03:43
classification problem.
1:03:48
Okay, every pixel is classified. And
1:03:49
just when you think it can't get more
1:03:51
complicated than this,
1:03:53
we have something called instance
1:03:54
segmentation where not only are we
1:03:56
classifying every pixel, we are
1:03:58
distinguishing between the different
1:03:59
sheep.
1:04:01
So every pixel is classified and
1:04:04
different instances of the same category
1:04:06
need to be identified.
1:04:10
Okay. So these are all some of the most
1:04:12
sort of uh I would say popular most
1:04:14
prevalently and useful most prevalent
1:04:16
and useful categories of image
1:04:18
processing problems that are aminable to
1:04:20
a deep learning system.
1:04:23
All right. So let's go to image
1:04:25
classification and we're going to work
1:04:27
with this application called fashion
1:04:28
emnest. Um
1:04:33
so the idea here is that you have 70
1:04:35
70,000 images of clothing items across
1:04:38
10 categories. you know like boots and
1:04:40
sweaters and t-shirts and you get the
1:04:43
idea 10 categories of clothing. Um we we
1:04:45
have 70,000 images like this u and then
1:04:48
we'll build a network from scratch to
1:04:50
classify all these things uh you know
1:04:52
with pretty high accuracy. So these
1:04:54
classes by the way you know this is a
1:04:55
very balanced data set. So 10% of the
1:04:58
data is you know sweaters 10% is boots
1:04:59
and so on and so forth. So a naive
1:05:01
baseline model would give you what
1:05:03
accuracy
1:05:07
10%. Exactly. So we need to build
1:05:10
something that's better than 10% and I'm
1:05:12
glad to report that a simple neural
1:05:13
network can actually get you close to
1:05:14
90%.
1:05:18
Right? So so this is the simple network
1:05:21
that we have. The input in this case is
1:05:24
a 28x 28 picture.
1:05:28
It's a 28x 28 picture. Uh and
1:05:33
so far we have been feeding vectors into
1:05:36
our neural network. Now we have a
1:05:38
picture which is 28 by 28. It's a tens
1:05:40
set of rank two, right? It's a table of
1:05:43
numbers. What do we do? How do we feed
1:05:45
that in?
1:05:51
It's a temp. No, each image is a table
1:05:53
of numbers. Let's just take a single
1:05:54
image.
1:05:57
Like what do we do? How do we what do we
1:05:59
do with this table?
1:06:01
Convert it into a vector. Exactly. And
1:06:04
that's called flattening. So we take
1:06:06
this table of numbers and we flatten it
1:06:08
into a vector. And so so what we do is
1:06:11
uh let me just
1:06:13
Okay. So we have um
1:06:17
28 by 28.
1:06:20
So what we can do is we can take each
1:06:22
row right take this row and then write
1:06:25
it like that.
1:06:27
We take the second row oops
1:06:33
write it like that.
1:06:38
third row is here
1:06:41
like that. You get the idea. So you take
1:06:43
each row just rotate it and stack it all
1:06:45
up, right? And string them up. It
1:06:47
becomes one long vector. So this called
1:06:49
flattening. Okay? So that's how you take
1:06:51
this thing and make it into one long
1:06:52
vector.
1:06:56
So when you do that 28 by 28 is what is
1:07:00
it? 7
1:07:03
784. So we get 7. So we get a vector.
1:07:07
This is the flattened input and you get
1:07:09
784.
1:07:11
Uh it's a vector that's 784 long.
1:07:15
Okay. After the flattening, we have not
1:07:17
done anything complicated yet. We have
1:07:18
literally taken the numbers and just
1:07:19
reorganized them in a different way.
1:07:21
Okay. And once we do that, now we are
1:07:24
back in our familiar neural network
1:07:26
territory, right? We know how to work
1:07:27
with vectors. So, we just need to pass
1:07:29
it through a hidden layer, right? And
1:07:33
this hidden layer, we're going to use re
1:07:35
neurons. And I tried a few different
1:07:37
values. And it turns out that 256
1:07:39
neurons does a really good job.
1:07:41
Okay? And so, I'm going to use 256
1:07:43
neurons here. And then we need to now
1:07:46
think about what the output layer should
1:07:48
be. Now, the now we run into a problem
1:07:51
because the output layer before we saw
1:07:54
for the heart disease example, it's just
1:07:55
zero or one. Right? Here there are 10
1:07:58
possible outputs. It could be a you know
1:08:01
boot, a sweater, a shirt and so on so
1:08:02
forth. 10 possible categories. So we
1:08:04
need some way to handle something with
1:08:06
many more than you know one binary
1:08:09
output many possible outputs. So the way
1:08:12
we do that
1:08:15
this is by the way pay attention to this
1:08:16
because this is actually how GPD4 works.
1:08:20
Okay. So what we do is here's what we
1:08:24
have. We know how to output 10 numbers,
1:08:26
right? If you want to output 10 numbers,
1:08:28
no problem. We just, you know, we have,
1:08:30
we can easily output 10 numbers by just
1:08:31
using a linear activation. We also know
1:08:33
how to output 10 probabilities,
1:08:36
right? Each one just needs to be a
1:08:37
sigmoid. But here we can't use 10
1:08:40
sigmoids as the output. Why is that?
1:08:44
Why can't we use 10 sigmoids?
1:08:47
>> Because the probability to one,
1:08:50
>> right? So here when the output comes we
1:08:52
need to figure out okay is it a boot, a
1:08:54
sweater, a shirt and so on and so forth.
1:08:56
There's only one right answer. Okay,
1:08:59
which means that we need to actually
1:09:00
figure out which of these 10 is the
1:09:01
right answer which means that we need to
1:09:03
produce probabilities but they have to
1:09:05
add up to one because only one of them
1:09:07
can be true.
1:09:09
So that's the key thing. They have to
1:09:10
add up to one. That's the wrinkle. If
1:09:12
not for that we can just use 10
1:09:13
sigmoids, right? And the way we do that
1:09:16
is something using something called the
1:09:17
softmax function or the softmax layer.
1:09:20
And the idea is actually very simple. We
1:09:22
have these 10 outputs in the very final
1:09:25
layer which is just linear activations.
1:09:27
And then we take each one of these
1:09:29
numbers and then run it through the
1:09:32
exponential function and then divide by
1:09:34
the total. So when you do that two
1:09:37
things happen. The first one is when you
1:09:39
take these numbers and run it through
1:09:40
say you take a1 and do e raised to a1
1:09:43
you now get a positive number
1:09:45
and now you have a positive number
1:09:47
divide by the sum of a bunch of positive
1:09:48
numbers and they're all you can see here
1:09:50
you can confirm visually that they will
1:09:52
add up to one because you're literally
1:09:53
divide taking each number dividing by
1:09:55
the total so they will add up to one
1:09:56
there's no other option right so this is
1:09:59
called the softmax function which means
1:10:00
that you can take any set of 10 numbers
1:10:02
that's coming out of the network and
1:10:04
convert them into probabilities that add
1:10:05
up to one
1:10:07
and So, by the way, the GPD4 reference
1:10:09
when you actually put a prompt in GPD4
1:10:12
and it starts giving you the output.
1:10:14
Every word it's emitting, right? It's
1:10:17
actually a token, but we'll get to that
1:10:19
later. You imagine it's a word. Every
1:10:21
word it's emitting u is actually it's
1:10:23
doing a 50 52,000 way softmax.
1:10:27
Think of it as every word in the
1:10:28
language is a possible output. So it's a
1:10:31
vector which is 52,000 long but it's
1:10:34
actually a softmax and it just picks the
1:10:36
most probable word and emits that. So
1:10:39
this notion of a softmax is actually
1:10:41
very powerful.
1:10:43
Okay but we'll come back to that uh
1:10:45
later. So, so to summarize, if you have
1:10:49
a single number, you can use a s simple
1:10:51
output layer, a single probability, a
1:10:53
sigmoid, you have lots of numbers, just
1:10:55
have a stack of these things. And when
1:10:57
you have a lot of numbers that have to
1:10:58
add up to one, that have to be
1:10:59
probabilities, use softmax,
1:11:03
>> right? So uh yeah
1:11:06
>> why do we choose probabilities instead
1:11:08
of just number
1:11:11
one
1:11:12
>> sorry
1:11:12
>> then we know it's only going to be one
1:11:14
>> because you can't force the network to
1:11:15
give you ones or zeros
1:11:20
it's going to produce what it's going to
1:11:21
produce
1:11:22
>> you can't force it to be exactly one or
1:11:24
zero
1:11:26
it'll give you some number you can do is
1:11:28
to tame that number so that it comes
1:11:30
into a range that you like like between
1:11:32
zero and
1:11:34
So here very quickly um we have a b when
1:11:38
we have a binary classification example
1:11:40
like yes or no this is the one hot
1:11:41
encoded version one or zero this is what
1:11:43
we saw in the heart disease example when
1:11:45
you have something like this example
1:11:46
fashion mn list where you have all these
1:11:48
different possibilities then you can
1:11:51
encode it in one of two ways you can
1:11:52
encode it just using integers like 0 to
1:11:54
9 right this is called the sparse
1:11:56
encoded version or you can do a one hot
1:11:59
encoded version of the output right you
1:12:02
can have a one hot encoded version of
1:12:03
the output and depending on how your
1:12:06
data comes in to you comes into your
1:12:08
collab right just pay attention to this
1:12:11
and depending on what it is you have to
1:12:13
pick the right keras loss function so
1:12:18
data comes like a one zero thing which
1:12:20
is exactly what we had in the how this
1:12:21
example we use binary cross entropy if
1:12:24
your data comes in this form where it's
1:12:26
sparse encoded you use sparse
1:12:28
categorical cross entropy and then if it
1:12:31
comes in this form form you use
1:12:32
categorical cross entropy, right? These
1:12:34
are all equalent things. It just depends
1:12:36
on the data that you get how it happens
1:12:38
to be encoded by the people who sent it
1:12:40
to you. If they send it this way, use
1:12:42
this loss function. If you send that
1:12:43
way, use that loss function.
1:12:46
Now, as it turns out in our example
1:12:47
here, the data is actually coming in in
1:12:49
this form. So, we'll use this thing
1:12:50
called the sparse categorical cross
1:12:52
entropy. And categorical cross entropy
1:12:54
is a generalization of binary cross
1:12:56
entropy which I'm not going to get into
1:12:58
the mathematical details but the in the
1:12:59
the intuition is basically roughly the
1:13:01
same.
1:13:04
Okay so this is what we have. Um if this
1:13:07
is your output layer use mean squared
1:13:09
error. If this is your output layer use
1:13:11
binary cross entropy and if you still
1:13:14
have a stack of these numbers you can
1:13:15
still use mean squed error. And if your
1:13:17
output is a soft max, use categorical
1:13:19
cross entropy or sparse categorical
1:13:22
cross entropy.
1:13:24
Okay. So let's actually run this in
1:13:26
collab. Um
1:13:32
right. So this is what we have. Can
1:13:33
folks see this? Okay. All right. So this
1:13:37
is the data set we saw earlier. Uh down
1:13:40
here as usual, right? We have we load
1:13:44
tensorflow and kas. We load our usual
1:13:47
three packages and then we set the
1:13:49
random seed for reproducibility. And it
1:13:51
turns out that the fashion mnest data is
1:13:53
actually available in keras. You don't
1:13:54
have to go find it somewhere and bring
1:13:56
it in. It's actually available in kas.
1:13:57
It's one of the standard data sets. We
1:13:59
luck out. So we just actually load the
1:14:01
data right using this load data command.
1:14:04
And then you do that and conveniently
1:14:05
for us keras has not only made the data
1:14:08
available it has already split it into a
1:14:10
training and test set. So we don't have
1:14:12
to do the splitting. Okay. And the
1:14:13
reason they do that, why would they do
1:14:15
that?
1:14:18
They do that so that different people
1:14:20
who are building algorithms for that
1:14:21
particular data set can all be evaluated
1:14:23
using the same test set.
1:14:26
Otherwise, if I split it one way and
1:14:28
say, "Hey, look how well I did that like
1:14:29
I don't know how did you split it."
1:14:31
>> That's the reason.
1:14:32
>> Okay. So here and you can see here that
1:14:36
uh we have
1:14:38
the input data is a tensor of rank
1:14:43
three. The first and basically another
1:14:47
way to think about a tensor of rank
1:14:48
three is just a list of rank two
1:14:50
tensors. Right? So here you have 60,000
1:14:52
images. 60,000 images and each image is
1:14:57
a 28x 28 square of numbers. Each image
1:15:02
is a 28 x 28 table. Uh and then of
1:15:04
course the output uh is just what
1:15:07
category it is a number between 0 and 9.
1:15:09
So you just have 60,000 numbers. It's
1:15:11
just a vector of 60,000 numbers. Okay.
1:15:13
Uh so there are 60,000 in the training
1:15:15
set. Oops. Uh and then there are 10,000
1:15:19
in the test set. Same structure 28 by
1:15:21
28. Uh that's what we have. So if you
1:15:23
look at the first 10 rows of the
1:15:25
dependent variable Y, you get these
1:15:27
numbers 9 0 33 like that. There are
1:15:29
numbers from 0 to 9. So if you look at
1:15:31
the fashion mnest GitHub site, this is
1:15:33
what it refers to. Zero is a t-shirt,
1:15:35
one is a trouser, and so on and so
1:15:37
forth. And nine is an ankle boot.
1:15:41
All right. So, uh, whenever I'm working
1:15:43
with multiclass lab classification
1:15:45
problems, I always, you know, do a
1:15:47
little thing here to help me figure out
1:15:49
that nine corresponds to an ankle boot
1:15:51
and so on and so forth. It just makes it
1:15:52
a little easier to to work with this
1:15:53
stuff. So, I create this little list. Um
1:15:56
and then uh turns out if you okay what
1:15:59
is the very first data point? What is
1:16:01
it? What is its y- value? Turns out to
1:16:02
be an ankle boot. Um so you can actually
1:16:05
look at the raw data for that image
1:16:07
which is just a 28x 28 thing and these
1:16:10
are the numbers you have.
1:16:13
See all these 250 233 lots of zeros and
1:16:16
so on and so forth. So you can actually
1:16:19
look at the first visualize the first 25
1:16:20
images. I have a little bit of code here
1:16:22
which visualizes that just matt plot lip
1:16:24
code and you can see these are all the
1:16:25
images they're kind of smallalish this
1:16:28
my friends is an ankle boot
1:16:32
right it's like okay can the network
1:16:34
really make any sense out of this thing
1:16:35
right it looks very blurry and I don't
1:16:37
know
1:16:39
this is uh
1:16:42
oh this is actually a better ankle boot
1:16:43
look at that okay sorry I'm getting
1:16:45
distracted so so this is what we have
1:16:47
here
1:16:49
uh okay we are at 955
1:16:51
I'm going to stop um so you folks are
1:16:53
not late for your next class. So we'll
1:16:54
continue this journey on Wednesday and
1:16:56
then we'll go on to color images the
1:16:58
next class as well. Thank you folks.
1:16:59
Have a good one.
— end of transcript —
Advertisement