WEBVTT

00:00:16.320 --> 00:00:21.519
Okay. All right. Let's get going. Uh

00:00:20.304 --> 00:00:23.358
[clears throat] today is going to be

00:00:21.519 --> 00:00:25.920
packed. uh I'm going to spend the first

00:00:23.359 --> 00:00:28.960
roughly half of the lecture on uh

00:00:25.920 --> 00:00:30.720
actually building a model a car corass

00:00:28.960 --> 00:00:32.960
model in collab to solve the heart

00:00:30.719 --> 00:00:35.759
disease problem we saw earlier and then

00:00:32.960 --> 00:00:37.679
switch gears halfway and then talk about

00:00:35.759 --> 00:00:39.839
uh how to solve image classification

00:00:37.679 --> 00:00:42.159
okay so we're going to do two collabs

00:00:39.840 --> 00:00:44.079
today uh I've been talking about collab

00:00:42.159 --> 00:00:46.718
collab right I've been teasing you we'll

00:00:44.079 --> 00:00:48.960
actually do collabs today all right so

00:00:46.719 --> 00:00:50.320
summary of baby by the way I've shut off

00:00:48.960 --> 00:00:52.160
the lights in the top because when I

00:00:50.320 --> 00:00:53.280
switch to collab it's going be much

00:00:52.159 --> 00:00:54.639
better for you folks particularly the

00:00:53.280 --> 00:00:57.039
folks in the back to be able to see it.

00:00:54.640 --> 00:01:00.000
Okay, but I hope you can see the slide

00:00:57.039 --> 00:01:02.320
right now. Yes.

00:01:00.000 --> 00:01:04.799
Okay, great. So this is just a quick

00:01:02.320 --> 00:01:07.040
recap of what we did last class. U you

00:01:04.799 --> 00:01:08.479
know broadly speaking training a neural

00:01:07.040 --> 00:01:10.240
network essentially is no different than

00:01:08.478 --> 00:01:12.079
training other kinds of models. We have

00:01:10.239 --> 00:01:14.959
a bunch of parameters i.e weights and

00:01:12.079 --> 00:01:17.200
biases and we need to use the data to

00:01:14.959 --> 00:01:19.199
find good values of those weights. And

00:01:17.200 --> 00:01:21.040
what does good mean? Typically it means

00:01:19.200 --> 00:01:23.040
that we define some measure of

00:01:21.040 --> 00:01:24.960
discrepancy between what the model

00:01:23.040 --> 00:01:26.880
predicts for a given set of weights and

00:01:24.959 --> 00:01:29.199
what the right answer is what the ground

00:01:26.879 --> 00:01:30.879
truth answer is and then we try to find

00:01:29.200 --> 00:01:32.240
weights that minimize this discrepancy

00:01:30.879 --> 00:01:34.078
that's it and this notion of a

00:01:32.239 --> 00:01:36.239
discrepancy is called a loss function

00:01:34.078 --> 00:01:38.239
right so the broadly speaking the

00:01:36.239 --> 00:01:40.078
overall training flow is that you define

00:01:38.239 --> 00:01:41.280
some network it has an input it goes

00:01:40.078 --> 00:01:42.719
through a bunch of layers you come up

00:01:41.280 --> 00:01:44.799
with some predictions you take the

00:01:42.719 --> 00:01:46.798
predictions you take the true values and

00:01:44.799 --> 00:01:48.560
then those two go into the loss function

00:01:46.799 --> 00:01:50.320
i.e i.e. the discrepancy function and

00:01:48.560 --> 00:01:52.399
then you come up with the loss score and

00:01:50.319 --> 00:01:54.639
then you send it to the optimizer which

00:01:52.399 --> 00:01:56.239
then proceeds to calculate the gradient

00:01:54.640 --> 00:01:58.239
of this loss function with respect to

00:01:56.239 --> 00:02:00.078
all the parameters and then it updates

00:01:58.239 --> 00:02:02.718
all the weights using that gradient and

00:02:00.078 --> 00:02:04.718
then this process repeats. That's it. So

00:02:02.718 --> 00:02:08.000
that is the training flow. Okay, quick

00:02:04.718 --> 00:02:09.359
recap. Now we also talked about the

00:02:08.000 --> 00:02:12.719
optimization algorithm we're going to

00:02:09.360 --> 00:02:15.280
use which is called gradient descent.

00:02:12.719 --> 00:02:17.759
and gradient descent. As you noticed in

00:02:15.280 --> 00:02:20.400
each iteration, every data point is

00:02:17.759 --> 00:02:22.399
being used to make predictions and

00:02:20.400 --> 00:02:24.480
therefore to calculate the loss and then

00:02:22.400 --> 00:02:26.719
to calculate the gradient. And then we

00:02:24.479 --> 00:02:28.399
pointed out that gradient descent is

00:02:26.719 --> 00:02:31.120
actually not as good as something called

00:02:28.400 --> 00:02:33.200
stochastic gradient descent. Stoastic

00:02:31.120 --> 00:02:35.759
gradient descent where we instead of

00:02:33.199 --> 00:02:37.439
choosing taking all the points, we just

00:02:35.759 --> 00:02:40.000
randomly choose a small number of

00:02:37.439 --> 00:02:42.239
points. Pretend for a moment as if those

00:02:40.000 --> 00:02:44.479
are the only points we have. make

00:02:42.239 --> 00:02:47.360
predictions, calculate loss, calculate

00:02:44.479 --> 00:02:49.439
gradient and go on. So that was the

00:02:47.360 --> 00:02:51.840
basic idea behind stochastic gradient

00:02:49.439 --> 00:02:54.479
descent, right? Two different kinds of

00:02:51.840 --> 00:02:56.000
things. Now what it means is that when

00:02:54.479 --> 00:02:58.238
we actually start training the model, as

00:02:56.000 --> 00:03:00.318
we will in a few minutes, the way

00:02:58.239 --> 00:03:02.640
because we only take a few points at a

00:03:00.318 --> 00:03:04.399
time, we have to be a bit careful in

00:03:02.639 --> 00:03:06.079
what's going on. And I want to make sure

00:03:04.400 --> 00:03:07.439
you clearly understand what the

00:03:06.080 --> 00:03:10.640
differences are before we actually get

00:03:07.439 --> 00:03:13.280
to the collab. Okay. And

00:03:10.639 --> 00:03:14.878
all right. So there is the notion of an

00:03:13.280 --> 00:03:17.120
epoch.

00:03:14.878 --> 00:03:20.000
An epoch essentially just means that we

00:03:17.120 --> 00:03:22.080
make one pass through the training data.

00:03:20.000 --> 00:03:25.039
All the training data we make one pass

00:03:22.080 --> 00:03:27.519
through it. Okay. And so what is one

00:03:25.039 --> 00:03:30.560
pass is that if you have something like

00:03:27.519 --> 00:03:32.400
gradient descent, one pass means every

00:03:30.560 --> 00:03:34.318
data point is sent through the network.

00:03:32.400 --> 00:03:37.039
We calculate its predictions, calculate

00:03:34.318 --> 00:03:38.958
the loss, calculate the gradient, right?

00:03:37.039 --> 00:03:40.560
We run every training sample through it.

00:03:38.959 --> 00:03:42.959
we calculate the gradient which is just

00:03:40.560 --> 00:03:46.878
this thing here right I mean I will

00:03:42.959 --> 00:03:48.799
sometimes say d of loss time dwerative

00:03:46.878 --> 00:03:51.199
of loss with respect to w sometimes I

00:03:48.799 --> 00:03:54.000
might use this naba symbol these are all

00:03:51.199 --> 00:03:55.598
interchangeable okay so we'll calculate

00:03:54.000 --> 00:03:58.080
the gradient and then we update using

00:03:55.598 --> 00:04:01.438
some version of this okay but we just do

00:03:58.080 --> 00:04:03.680
it once at the end of the epoch because

00:04:01.438 --> 00:04:05.280
if you have 10 billion data points every

00:04:03.680 --> 00:04:07.200
one of them flows through you get 10

00:04:05.280 --> 00:04:08.959
billion outputs and then we calculate

00:04:07.199 --> 00:04:10.158
the epoch just one at the end of this

00:04:08.959 --> 00:04:15.039
thing we calculate the gradient and

00:04:10.158 --> 00:04:18.399
update once one update per epoch. Yes.

00:04:15.039 --> 00:04:20.319
Now in stoastic gradient descent what we

00:04:18.399 --> 00:04:22.078
do is that we process the data in

00:04:20.319 --> 00:04:25.360
batches

00:04:22.079 --> 00:04:26.800
small numbers of points at a time right

00:04:25.360 --> 00:04:29.280
and these are called technically

00:04:26.800 --> 00:04:30.560
speaking they're called mini batches I

00:04:29.279 --> 00:04:31.758
don't know about you I just get tired of

00:04:30.560 --> 00:04:34.879
saying mini batches I'm just going to

00:04:31.759 --> 00:04:36.319
say batches from this point on okay and

00:04:34.879 --> 00:04:39.360
in fact that is widely done in the

00:04:36.319 --> 00:04:41.360
literature so we'll so we'll have to

00:04:39.360 --> 00:04:43.199
process it in batches so we take the

00:04:41.360 --> 00:04:44.720
training data and then we divide it up

00:04:43.199 --> 00:04:46.400
into batches

00:04:44.720 --> 00:04:49.840
batch one, batch two all the way till

00:04:46.399 --> 00:04:53.120
the final batch. And so what we do is we

00:04:49.839 --> 00:04:56.399
for each batch we basically do gradient

00:04:53.120 --> 00:04:57.918
descent for each batch we take batch one

00:04:56.399 --> 00:05:00.399
and then we run just the training

00:04:57.918 --> 00:05:01.918
samples in that batch through the

00:05:00.399 --> 00:05:03.758
network to get predictions. We calculate

00:05:01.918 --> 00:05:05.519
the gradient we update the parameters

00:05:03.759 --> 00:05:07.360
and then we go to batch two then we go

00:05:05.519 --> 00:05:09.120
to batch three and so on and so forth.

00:05:07.360 --> 00:05:11.038
So pictorially this is how it's going to

00:05:09.120 --> 00:05:12.800
look like

00:05:11.038 --> 00:05:16.000
right let's say the first batch is say

00:05:12.800 --> 00:05:17.439
32 points we take those 32 points we run

00:05:16.000 --> 00:05:19.519
it through the network get all the stuff

00:05:17.439 --> 00:05:22.719
out we calculate the gradient update the

00:05:19.519 --> 00:05:25.279
weights so when we now get to batch two

00:05:22.720 --> 00:05:27.600
the weights have changed

00:05:25.279 --> 00:05:29.038
they have been updated and then we do

00:05:27.600 --> 00:05:30.720
the same thing for batch two batch three

00:05:29.038 --> 00:05:32.399
and all the way till we get to the end

00:05:30.720 --> 00:05:34.240
of the thing and when we are done with

00:05:32.399 --> 00:05:36.239
this thing this whole thing is called a

00:05:34.240 --> 00:05:38.160
what

00:05:36.240 --> 00:05:42.879
an epoch [clears throat]

00:05:38.160 --> 00:05:44.880
This whole thing is an epoch. Okay.

00:05:42.879 --> 00:05:46.319
All right. Now, so the question of

00:05:44.879 --> 00:05:47.519
course is that if you have a bunch of

00:05:46.319 --> 00:05:50.000
data points and you're going to run

00:05:47.519 --> 00:05:52.319
stoastic gradient descent on it in in a

00:05:50.000 --> 00:05:54.959
in a particular epoch, how many batches

00:05:52.319 --> 00:05:56.879
are going to be there? Okay, how many

00:05:54.959 --> 00:05:58.159
batches are going to be there? Now,

00:05:56.879 --> 00:05:59.199
Keras is going to calculate all this

00:05:58.160 --> 00:06:00.560
stuff. You don't have to worry about it,

00:05:59.199 --> 00:06:02.960
but you just need to understand exactly

00:06:00.560 --> 00:06:04.879
what happens. Okay, so my philosophy, by

00:06:02.959 --> 00:06:06.799
the way, is that you have to know the

00:06:04.879 --> 00:06:08.560
details of what's going on. If you don't

00:06:06.800 --> 00:06:11.038
know the details, if you haven't figured

00:06:08.560 --> 00:06:12.399
out at least once, you will not actually

00:06:11.038 --> 00:06:15.519
be able to think new and creative

00:06:12.399 --> 00:06:17.198
thoughts for a new problem. Okay, it's

00:06:15.519 --> 00:06:21.719
because the concepts are not manipulable

00:06:17.199 --> 00:06:21.720
in your head yet. Okay,

00:06:23.839 --> 00:06:30.159
please use the microphone.

00:06:27.279 --> 00:06:32.399
So when we talk about SG, so and we

00:06:30.160 --> 00:06:34.080
talking about uh we are only taking some

00:06:32.399 --> 00:06:36.239
part of it. Is it what we are saying is

00:06:34.079 --> 00:06:37.918
that we only take some variables or we

00:06:36.240 --> 00:06:40.319
only taking some part of the data.

00:06:37.918 --> 00:06:42.639
>> We are taking some rows.

00:06:40.319 --> 00:06:44.639
Okay. We taking only right. So that data

00:06:42.639 --> 00:06:46.319
points that means a batch.

00:06:44.639 --> 00:06:48.160
>> Exactly. So for example, let's say you

00:06:46.319 --> 00:06:50.479
have a thousand data points, right?

00:06:48.160 --> 00:06:52.400
Thousand rows of observations, thousand

00:06:50.478 --> 00:06:53.839
patients in the heart disease example or

00:06:52.399 --> 00:06:56.478
a thousand images that you're trying to

00:06:53.839 --> 00:06:58.799
classify. You take let's say 32 of those

00:06:56.478 --> 00:07:00.879
images, 32 of those patients and that's

00:06:58.800 --> 00:07:02.800
a batch. Then you go to the next 32.

00:07:00.879 --> 00:07:04.240
Then the next 32 and so on and so forth

00:07:02.800 --> 00:07:05.199
till you run out of patients or run out

00:07:04.240 --> 00:07:07.759
of images.

00:07:05.199 --> 00:07:09.199
>> And each iterative time you are updating

00:07:07.759 --> 00:07:09.759
with the weights new weights that you've

00:07:09.199 --> 00:07:12.478
got.

00:07:09.759 --> 00:07:13.520
>> And it means you keep connecting it or

00:07:12.478 --> 00:07:14.560
keep moving towards

00:07:13.519 --> 00:07:14.799
>> you're basically updating the weights as

00:07:14.560 --> 00:07:17.519
you

00:07:14.800 --> 00:07:19.199
>> updating the weights

00:07:17.519 --> 00:07:20.719
>> and what we calling the epoch is

00:07:19.199 --> 00:07:21.919
ultimately the equation of loss function

00:07:20.720 --> 00:07:24.400
that we are trying to do.

00:07:21.918 --> 00:07:27.680
>> No an epoch. See the the thing to

00:07:24.399 --> 00:07:30.079
remember is that here this whole thing

00:07:27.680 --> 00:07:32.319
is called an epoch because we have to do

00:07:30.079 --> 00:07:35.439
one full pass through the training data.

00:07:32.319 --> 00:07:37.919
Okay. But within that epoch we update

00:07:35.439 --> 00:07:40.160
the weights many times. Basically we

00:07:37.918 --> 00:07:43.318
update the weights as many times as we

00:07:40.160 --> 00:07:43.319
have batches.

00:07:44.079 --> 00:07:49.038
All right. Um

00:07:46.478 --> 00:07:50.240
so to go here let's say for example

00:07:49.038 --> 00:07:52.000
basically the idea is that you take the

00:07:50.240 --> 00:07:54.960
training tech you divide it by the batch

00:07:52.000 --> 00:07:56.478
size and you choose the batch size okay

00:07:54.959 --> 00:07:57.918
you choose the bat size and we'll talk

00:07:56.478 --> 00:07:59.839
about well how do you choose that later

00:07:57.918 --> 00:08:01.598
on you choose the batch size and once

00:07:59.839 --> 00:08:04.239
you choose size just divide it and round

00:08:01.598 --> 00:08:06.878
it up so for example as you will see in

00:08:04.240 --> 00:08:09.199
the collabing set is going to be 194

00:08:06.879 --> 00:08:12.080
patients and then we're going to choose

00:08:09.199 --> 00:08:14.960
a batch size of 32 and we typically tend

00:08:12.079 --> 00:08:16.478
to choose batch sizes of 32 64 and

00:08:14.959 --> 00:08:18.318
things like that because it actually

00:08:16.478 --> 00:08:20.800
aligns very well with the nature of the

00:08:18.319 --> 00:08:24.479
parallel hardware we're going to use.

00:08:20.800 --> 00:08:27.038
Okay. And so here 32 and so on. So

00:08:24.478 --> 00:08:29.680
divide 194 by 32 you get 6 point

00:08:27.038 --> 00:08:31.439
something. You round it up to seven.

00:08:29.680 --> 00:08:33.839
Okay. And so what that means is that the

00:08:31.439 --> 00:08:36.719
first six batches will have 32 samples

00:08:33.839 --> 00:08:38.800
each. And then the final batch has only

00:08:36.719 --> 00:08:40.560
two samples left. And that's okay. It

00:08:38.799 --> 00:08:42.079
can be a nice little small batch at the

00:08:40.559 --> 00:08:43.278
end.

00:08:42.080 --> 00:08:46.959
There's nothing that says that every

00:08:43.278 --> 00:08:51.320
batch has to be the same size.

00:08:46.958 --> 00:08:51.319
>> That's it. Epoch batches.

00:08:53.039 --> 00:08:58.399
>> And are you like for each batch you run

00:08:56.799 --> 00:09:00.799
through the whole network like all the

00:08:58.399 --> 00:09:03.039
layers or like each layer is one batch?

00:09:00.799 --> 00:09:04.719
>> No, for a batch you run it through the

00:09:03.039 --> 00:09:06.958
entire network. So the way I think about

00:09:04.720 --> 00:09:08.560
it is that you take a batch right just

00:09:06.958 --> 00:09:10.879
momentarily you assume that's all the

00:09:08.559 --> 00:09:12.879
data you have

00:09:10.879 --> 00:09:14.399
just run it through the network because

00:09:12.879 --> 00:09:15.838
unless you run it through the every

00:09:14.399 --> 00:09:18.080
layer of the network you can't get a

00:09:15.839 --> 00:09:19.600
prediction and unless you get a

00:09:18.080 --> 00:09:20.879
prediction you can't calculate the loss

00:09:19.600 --> 00:09:22.159
and unless you calculate the loss you

00:09:20.879 --> 00:09:23.200
can't calculate the gradient unless you

00:09:22.159 --> 00:09:25.199
calculate the gradient you can't update

00:09:23.200 --> 00:09:27.120
the weights

00:09:25.200 --> 00:09:29.440
>> last thing but if you're using like all

00:09:27.120 --> 00:09:31.120
the data just doing the gradient descent

00:09:29.440 --> 00:09:32.160
then you just go through the network

00:09:31.120 --> 00:09:34.320
once right

00:09:32.159 --> 00:09:37.679
>> okay exactly so in Gradient descent one

00:09:34.320 --> 00:09:40.399
epoch is one pass and one weight update.

00:09:37.679 --> 00:09:41.838
In many in stoastic gradient descent the

00:09:40.399 --> 00:09:43.440
number of updates you make is equal to

00:09:41.839 --> 00:09:46.080
the number of batches you have which

00:09:43.440 --> 00:09:47.920
ends up being you know some the training

00:09:46.080 --> 00:09:50.399
set divided by the batch size rounded

00:09:47.919 --> 00:09:52.799
up.

00:09:50.399 --> 00:09:54.639
>> So just to confirm so initially when we

00:09:52.799 --> 00:09:56.559
introduced like the concept of batches

00:09:54.639 --> 00:09:58.559
the whole purpose was not to run through

00:09:56.559 --> 00:10:00.639
all the data and be able to do some

00:09:58.559 --> 00:10:02.319
prediction from a subset. So now like

00:10:00.639 --> 00:10:04.639
the advantage is that like after batch

00:10:02.320 --> 00:10:06.560
one we are using more accurate

00:10:04.639 --> 00:10:08.399
coefficient to run through batch two and

00:10:06.559 --> 00:10:10.319
so on. That's really the advantage of it

00:10:08.399 --> 00:10:11.919
or there's something else to it.

00:10:10.320 --> 00:10:13.920
>> Perfectly set. That's exactly the

00:10:11.919 --> 00:10:16.319
advantage. So we take a small amount of

00:10:13.919 --> 00:10:18.240
data and we say hey we know this is not

00:10:16.320 --> 00:10:19.839
all the data. It's just a small subset

00:10:18.240 --> 00:10:21.120
of the data. So therefore it's not going

00:10:19.839 --> 00:10:23.920
to be super accurate. It's going to be

00:10:21.120 --> 00:10:25.919
approximate but it's okay. So we'll

00:10:23.919 --> 00:10:28.078
still tend to move in the in the right

00:10:25.919 --> 00:10:29.679
direction. So instead of waiting for the

00:10:28.078 --> 00:10:30.958
whole thing to get done and then

00:10:29.679 --> 00:10:33.679
updating it, we're just going to update

00:10:30.958 --> 00:10:35.919
it as we go along.

00:10:33.679 --> 00:10:37.759
All right. Uh yes,

00:10:35.919 --> 00:10:40.799
>> building on to her question, is it that

00:10:37.759 --> 00:10:43.600
uh doing this process for SG will uh

00:10:40.799 --> 00:10:45.519
render us a more better solution or

00:10:43.600 --> 00:10:46.399
requires less compute power?

00:10:45.519 --> 00:10:48.240
>> Both

00:10:46.399 --> 00:10:51.278
>> both and the reasons for both are in the

00:10:48.240 --> 00:10:52.480
previous lecture. Yeah. And I'm saying

00:10:51.278 --> 00:10:54.000
that instead of repeating it just

00:10:52.480 --> 00:10:57.120
because I'm like very pressed for time

00:10:54.000 --> 00:11:01.278
today. That's why uh all right cool so

00:10:57.120 --> 00:11:04.000
that's what we have uh are we good

00:11:01.278 --> 00:11:05.519
okay so now we come to the last step

00:11:04.000 --> 00:11:07.600
before we actually fire up the collab

00:11:05.519 --> 00:11:09.600
which is overfitting and regularization

00:11:07.600 --> 00:11:12.159
um so if you remember from your machine

00:11:09.600 --> 00:11:14.720
learning background um when your model

00:11:12.159 --> 00:11:18.319
gets more and more complex

00:11:14.720 --> 00:11:19.759
right if you you know using

00:11:18.320 --> 00:11:21.680
use a simple model then you use a more

00:11:19.759 --> 00:11:23.278
complex model and so on and so forth

00:11:21.679 --> 00:11:26.159
what happens to the error on the

00:11:23.278 --> 00:11:27.200
training data Typically what happens to

00:11:26.159 --> 00:11:28.480
the error on the training data? So let's

00:11:27.200 --> 00:11:30.079
say you have a simple regression model,

00:11:28.480 --> 00:11:31.440
you get some error and then you have a

00:11:30.078 --> 00:11:32.879
regression model in which you use all

00:11:31.440 --> 00:11:34.480
kinds of interaction terms. You use

00:11:32.879 --> 00:11:35.759
logarithms and this and that and make it

00:11:34.480 --> 00:11:36.879
super complicated. What do you think is

00:11:35.759 --> 00:11:39.519
going to happen to the error on the

00:11:36.879 --> 00:11:41.278
training data?

00:11:39.519 --> 00:11:43.200
>> Right? Basically it's going to go down

00:11:41.278 --> 00:11:45.360
as the model get more gets more complex.

00:11:43.200 --> 00:11:46.959
Correct. Now of course comes the punch

00:11:45.360 --> 00:11:49.440
line which is what what do you think is

00:11:46.958 --> 00:11:53.000
going to happen to the training data? I

00:11:49.440 --> 00:11:53.000
showed you the answer.

00:11:53.039 --> 00:11:56.078
Right? Basically, what's going to happen

00:11:54.399 --> 00:11:57.360
typically, at least conceptually, is

00:11:56.078 --> 00:11:59.120
that it's going to get better and better

00:11:57.360 --> 00:12:00.879
at some point. It's going to bottom out

00:11:59.120 --> 00:12:03.360
and it's going to start climbing again.

00:12:00.879 --> 00:12:05.039
And so, we typically refer to this

00:12:03.360 --> 00:12:07.440
phenomenon here when it starts to climb

00:12:05.039 --> 00:12:09.120
again as overfitting because the model

00:12:07.440 --> 00:12:11.440
is essentially fitting to the

00:12:09.120 --> 00:12:14.159
idiosyncrasies of the training data as

00:12:11.440 --> 00:12:15.760
opposed to generalizing patterns. And

00:12:14.159 --> 00:12:17.278
then in this thing we call it

00:12:15.759 --> 00:12:18.480
underfitting because it can still

00:12:17.278 --> 00:12:20.399
there's a lot of potential to improve

00:12:18.480 --> 00:12:23.360
and we really are hoping to find the

00:12:20.399 --> 00:12:24.559
sweet spot in the middle right that's

00:12:23.360 --> 00:12:27.360
the basic idea of overfitting

00:12:24.559 --> 00:12:29.359
underfitting and the way we and to to

00:12:27.360 --> 00:12:31.839
relate this to neural networks as you

00:12:29.360 --> 00:12:33.680
see as you as you've learned so far you

00:12:31.839 --> 00:12:36.320
have to learn smart representations of

00:12:33.679 --> 00:12:38.078
the input data and to do that we I have

00:12:36.320 --> 00:12:39.760
argued that you need to have lots of

00:12:38.078 --> 00:12:42.719
layers in your network the more layers

00:12:39.759 --> 00:12:45.439
you have the better things get. GPT3 for

00:12:42.720 --> 00:12:47.680
example has 96 layers if I recall right

00:12:45.440 --> 00:12:50.079
more layers the better but more layers

00:12:47.679 --> 00:12:52.000
means more parameters more parameters

00:12:50.078 --> 00:12:54.719
means more complexity to the model and

00:12:52.000 --> 00:12:57.919
therefore more chance of overfitting

00:12:54.720 --> 00:12:59.200
okay so it's really important in neural

00:12:57.919 --> 00:13:01.759
networks that we think about

00:12:59.200 --> 00:13:03.278
regularization and regularization you

00:13:01.759 --> 00:13:05.600
will recall from your machine learning

00:13:03.278 --> 00:13:07.278
background is the way we handle the risk

00:13:05.600 --> 00:13:11.278
of overfitting and try to find models

00:13:07.278 --> 00:13:12.639
that fit just right okay and so several

00:13:11.278 --> 00:13:14.480
regularization methods have been

00:13:12.639 --> 00:13:16.560
developed over the years and we are

00:13:14.480 --> 00:13:19.039
going to use only two of them. The first

00:13:16.559 --> 00:13:20.799
one is called early stopping. uh and

00:13:19.039 --> 00:13:23.120
this is this has been famously referred

00:13:20.799 --> 00:13:25.199
to uh by Jeff Hinton who's one of the

00:13:23.120 --> 00:13:27.039
pioneers or as he's more colorfully

00:13:25.200 --> 00:13:29.040
known one of the godfathers of deep

00:13:27.039 --> 00:13:31.199
learning um who won he also won the

00:13:29.039 --> 00:13:33.360
touring a few years ago as the own sort

00:13:31.200 --> 00:13:35.120
of a beautiful free lunch right that's

00:13:33.360 --> 00:13:37.839
what he calls it so the idea is very

00:13:35.120 --> 00:13:39.278
simple we take a validation set we take

00:13:37.839 --> 00:13:41.120
the training data we split into a

00:13:39.278 --> 00:13:42.879
training and a validation set and then

00:13:41.120 --> 00:13:45.519
we just keep you know doing gradient

00:13:42.879 --> 00:13:46.720
descent boop b the training will

00:13:45.519 --> 00:13:49.200
hopefully keep on getting better and

00:13:46.720 --> 00:13:50.800
better lower and lower error

00:13:49.200 --> 00:13:52.959
And then we just keep track of what's

00:13:50.799 --> 00:13:54.559
going on in the validation set. And then

00:13:52.958 --> 00:13:56.958
at some point if it starts to flatten

00:13:54.559 --> 00:13:59.919
out and start to climb, we just say,

00:13:56.958 --> 00:14:01.359
"Okay, that's when we stop training."

00:13:59.919 --> 00:14:02.639
Right? And what we're going to do in the

00:14:01.360 --> 00:14:03.919
collab is actually run it through the

00:14:02.639 --> 00:14:04.959
whole thing, see where it flattens out,

00:14:03.919 --> 00:14:06.479
and then we say, "Okay, that's why we

00:14:04.958 --> 00:14:07.759
should stop." But of course, you don't

00:14:06.480 --> 00:14:09.120
want to go all the way to the end and

00:14:07.759 --> 00:14:12.000
then go back and say, "Well, I want to

00:14:09.120 --> 00:14:13.600
stop at the 10th epoch." And there are

00:14:12.000 --> 00:14:15.039
ways you can use Keras to be very

00:14:13.600 --> 00:14:16.320
efficient about this. But the

00:14:15.039 --> 00:14:18.319
fundamental idea is you take the

00:14:16.320 --> 00:14:20.079
training data, split it into training

00:14:18.320 --> 00:14:21.920
and validation and just track what's

00:14:20.078 --> 00:14:23.278
going on in the validation set to see

00:14:21.919 --> 00:14:25.679
whether this kind of bottoming out

00:14:23.278 --> 00:14:28.240
happens. Okay. So this is called early

00:14:25.679 --> 00:14:30.879
stopping. And the other way we're going

00:14:28.240 --> 00:14:32.959
to do right this called early stopping.

00:14:30.879 --> 00:14:35.838
We're looking for this part. The other

00:14:32.958 --> 00:14:39.039
thing is called dropout. And I'm going

00:14:35.839 --> 00:14:40.560
to come back to dropout when we do when

00:14:39.039 --> 00:14:42.000
on Wednesday's lecture because that's

00:14:40.559 --> 00:14:43.518
the first time we're going to use it.

00:14:42.000 --> 00:14:44.958
And so I'll come back to draw port and

00:14:43.519 --> 00:14:46.879
tell you exactly how it works. It's a

00:14:44.958 --> 00:14:48.399
very very clever strategy. But we will

00:14:46.879 --> 00:14:51.519
not use it today. We'll use it on

00:14:48.399 --> 00:14:53.679
Wednesday. Okay. So in summary, uh what

00:14:51.519 --> 00:14:55.679
do we do? We get the data ready. We

00:14:53.679 --> 00:14:57.198
design the network, number of hidden

00:14:55.679 --> 00:14:58.958
layers, number of neurons and so on and

00:14:57.198 --> 00:15:01.359
so forth. We pick the right output

00:14:58.958 --> 00:15:04.078
layer. We pick the right loss function.

00:15:01.360 --> 00:15:06.000
Uh we choose an optimizer. As I

00:15:04.078 --> 00:15:07.919
mentioned earlier, SGD comes in lots of

00:15:06.000 --> 00:15:11.519
flavors, lots of variations on the

00:15:07.919 --> 00:15:13.439
theme. And empirically much like for

00:15:11.519 --> 00:15:16.159
hidden layer neurons we t tend to use

00:15:13.440 --> 00:15:17.920
relu as the activation function for

00:15:16.159 --> 00:15:20.559
optimization we tend to use a flavor of

00:15:17.919 --> 00:15:22.240
HGD called Adam okay as sort of the

00:15:20.559 --> 00:15:24.879
default because it's really good so

00:15:22.240 --> 00:15:27.039
we'll use Adam as you'll see we

00:15:24.879 --> 00:15:29.039
typically use either uh early stopping

00:15:27.039 --> 00:15:32.000
or dropout and then you just fire it up

00:15:29.039 --> 00:15:33.838
and start training in terasen tensorflow

00:15:32.000 --> 00:15:35.120
all right so that is the training loop

00:15:33.839 --> 00:15:38.079
now I'm going to switch gears and give

00:15:35.120 --> 00:15:40.959
you a quick intro to teras and teras

00:15:38.078 --> 00:15:43.278
TensorFlow. Okay. Keras and Tensor. No,

00:15:40.958 --> 00:15:45.119
TensorFlow and KAS. Thank you. Um, and

00:15:43.278 --> 00:15:49.078
then we'll actually fire up the collab.

00:15:45.120 --> 00:15:49.078
So, first of all, what's a tensor?

00:15:49.919 --> 00:15:54.639
>> Yeah, I just quick question on the

00:15:52.159 --> 00:15:57.679
previous thing like if you're looking at

00:15:54.639 --> 00:15:59.440
the validation set to avoid overfitting,

00:15:57.679 --> 00:16:02.000
but aren't you actually like over

00:15:59.440 --> 00:16:03.920
actually overfitting because like you're

00:16:02.000 --> 00:16:05.919
kind of using the validation set as a

00:16:03.919 --> 00:16:08.000
training set or not?

00:16:05.919 --> 00:16:10.319
>> Uh, no, no, no. The validation set is

00:16:08.000 --> 00:16:12.799
never used to calculate any gradients.

00:16:10.320 --> 00:16:14.480
It's only used to calculate accuracy and

00:16:12.799 --> 00:16:16.078
loss.

00:16:14.480 --> 00:16:19.360
Yeah. Yeah. It's kept aside and only

00:16:16.078 --> 00:16:22.479
used for evaluation, not for training.

00:16:19.360 --> 00:16:23.120
That's what keeps you honest.

00:16:22.480 --> 00:16:24.399
>> Right.

00:16:23.120 --> 00:16:25.600
>> And this will become clear when we

00:16:24.399 --> 00:16:28.600
actually go to the collab. So what's a

00:16:25.600 --> 00:16:28.600
tensor?

00:16:28.639 --> 00:16:33.120
>> All right.

00:16:30.639 --> 00:16:35.360
Okay.

00:16:33.120 --> 00:16:36.720
Tensor is the input data which you're

00:16:35.360 --> 00:16:39.440
giving to the system. It could be in

00:16:36.720 --> 00:16:42.240
various formats like it's image it could

00:16:39.440 --> 00:16:45.120
be like we call it a 4D tensor. If it's

00:16:42.240 --> 00:16:47.278
a time series data, it's 3D. And

00:16:45.120 --> 00:16:49.360
typically, if you just send numbers in,

00:16:47.278 --> 00:16:52.480
it becomes a vector which would go

00:16:49.360 --> 00:16:54.480
inside which each each it gives the

00:16:52.480 --> 00:16:57.278
value of the

00:16:54.480 --> 00:16:59.120
uh uh the variable as well the values of

00:16:57.278 --> 00:17:01.759
the variables associated to it as well

00:16:59.120 --> 00:17:05.599
as

00:17:01.759 --> 00:17:07.120
uh as well as the I mean information you

00:17:05.599 --> 00:17:08.480
want to get to.

00:17:07.119 --> 00:17:10.159
>> You're kind of on the right track, but

00:17:08.480 --> 00:17:13.439
not entirely, right? It's actually a

00:17:10.160 --> 00:17:15.038
simpler concept than that. So, uh

00:17:13.439 --> 00:17:16.720
>> it's like a matrix but generalized with

00:17:15.038 --> 00:17:18.558
higher dimensions.

00:17:16.720 --> 00:17:21.360
>> Correct? That's also actually correct

00:17:18.558 --> 00:17:24.078
but incomplete. The reason is because it

00:17:21.359 --> 00:17:25.838
can be simpler than a matrix. It's not

00:17:24.078 --> 00:17:27.599
matrix or higher. It's actually could be

00:17:25.838 --> 00:17:30.159
simpler. In fact, you take a number,

00:17:27.599 --> 00:17:31.759
it's actually a tensor.

00:17:30.160 --> 00:17:34.400
All right? The simplest case of a tensor

00:17:31.759 --> 00:17:37.359
is a number. The next simpler case is a

00:17:34.400 --> 00:17:40.798
vector which is a list. The next higher

00:17:37.359 --> 00:17:43.038
case is a table.

00:17:40.798 --> 00:17:45.679
Okay, so these are all tensors. So

00:17:43.038 --> 00:17:48.879
tensors basically are a generalization

00:17:45.679 --> 00:17:52.240
of the notion of both a number, a vector

00:17:48.880 --> 00:17:56.799
and a table to higher dimensions.

00:17:52.240 --> 00:17:59.440
Okay, so you can think of a tensor as

00:17:56.798 --> 00:18:03.200
having what are called every tensor has

00:17:59.440 --> 00:18:04.720
something called a rank, right? So a

00:18:03.200 --> 00:18:06.720
number is just a number. It doesn't have

00:18:04.720 --> 00:18:10.798
a dimensionality to it. So it has got

00:18:06.720 --> 00:18:12.720
rank zero. Okay. While a vector it's a

00:18:10.798 --> 00:18:14.720
list of numbers. You can sort of write

00:18:12.720 --> 00:18:17.600
it down top to bottom and it's one

00:18:14.720 --> 00:18:19.200
dimension. Right? So that dimension that

00:18:17.599 --> 00:18:22.798
one dimension is called a rank. So it's

00:18:19.200 --> 00:18:24.480
called rank one. A table is 2D

00:18:22.798 --> 00:18:26.558
two-dimensional. So it's called rank

00:18:24.480 --> 00:18:28.640
two.

00:18:26.558 --> 00:18:32.079
And you can have a rank three which is

00:18:28.640 --> 00:18:34.080
just a bunch of tables.

00:18:32.079 --> 00:18:37.199
A bunch of tables is a rank three

00:18:34.079 --> 00:18:40.399
tensor. We also think of it as a cube.

00:18:37.200 --> 00:18:42.240
Okay. So these things are very useful

00:18:40.400 --> 00:18:45.280
because obviously we are all familiar

00:18:42.240 --> 00:18:48.000
with vectors. Uh as you will see very

00:18:45.279 --> 00:18:49.678
shortly later in this class black and

00:18:48.000 --> 00:18:51.679
white grayscale images are usually

00:18:49.679 --> 00:18:54.240
represented using tables of numbers like

00:18:51.679 --> 00:18:56.240
this. Color images are represented using

00:18:54.240 --> 00:18:59.440
three tables.

00:18:56.240 --> 00:19:02.319
Okay. Can you get think of what might be

00:18:59.440 --> 00:19:06.160
representable as you know a tensil of

00:19:02.319 --> 00:19:08.720
rank four? Meaning every element of a

00:19:06.160 --> 00:19:11.600
tensor of rank four is actually a color

00:19:08.720 --> 00:19:14.720
picture.

00:19:11.599 --> 00:19:16.959
Just shout it out. Video. Exactly. What

00:19:14.720 --> 00:19:19.440
is a video? A video is basically a

00:19:16.960 --> 00:19:23.519
stream of black color images. A color

00:19:19.440 --> 00:19:25.519
video. So each element of that stream,

00:19:23.519 --> 00:19:28.879
right? What the first dimension of the

00:19:25.519 --> 00:19:31.440
tensor is which frame it is and then

00:19:28.880 --> 00:19:34.000
everything else is the actual frame. So

00:19:31.440 --> 00:19:37.320
the way I u think about these tensors

00:19:34.000 --> 00:19:37.319
always is

00:19:37.359 --> 00:19:42.639
tensor you can just think of it as a you

00:19:40.480 --> 00:19:45.759
can think of a tensor as being this

00:19:42.640 --> 00:19:48.080
array which has all these axes or

00:19:45.759 --> 00:19:51.359
dimensions. This is the first one. This

00:19:48.079 --> 00:19:54.159
is the second one. This is a third swan.

00:19:51.359 --> 00:19:58.639
Right? This is a tensor of rank four.

00:19:54.160 --> 00:20:02.000
Okay? 1 2 3 4. And so if you have a

00:19:58.640 --> 00:20:03.520
vector, right? So you can imagine if

00:20:02.000 --> 00:20:06.480
it's just a vector, you can imagine the

00:20:03.519 --> 00:20:10.240
vector actually living like this, just a

00:20:06.480 --> 00:20:14.000
list of numbers, right?

00:20:10.240 --> 00:20:16.798
But if it's just if it is just

00:20:14.000 --> 00:20:19.038
a 2D a rank two tensor right which is

00:20:16.798 --> 00:20:21.200
just like that right which is just like

00:20:19.038 --> 00:20:24.079
that

00:20:21.200 --> 00:20:26.400
so this thing becomes you know like that

00:20:24.079 --> 00:20:29.199
and that thing becomes like that. So for

00:20:26.400 --> 00:20:31.360
example if this is a 7a 3 that means

00:20:29.200 --> 00:20:35.200
that there are

00:20:31.359 --> 00:20:36.558
seven rows and three columns.

00:20:35.200 --> 00:20:38.558
So you get the idea. So the way you

00:20:36.558 --> 00:20:40.319
think about tensor is always as if this

00:20:38.558 --> 00:20:42.079
open square bracket a bunch of things a

00:20:40.319 --> 00:20:44.639
closed square bracket and that's really

00:20:42.079 --> 00:20:48.158
what a tensor object is. So what that

00:20:44.640 --> 00:20:49.759
means is that anytime you have a tensor

00:20:48.159 --> 00:20:52.480
right anytime you have a tensor however

00:20:49.759 --> 00:20:54.720
complicated it is you can always create

00:20:52.480 --> 00:20:56.319
a more complicated tensor by if you want

00:20:54.720 --> 00:20:59.279
to take a list of those tensors let's

00:20:56.319 --> 00:21:02.158
say that you have a list of videos

00:20:59.279 --> 00:21:04.240
each video is a rank four tensor so

00:21:02.159 --> 00:21:05.760
which means a list of videos is what

00:21:04.240 --> 00:21:10.720
rank

00:21:05.759 --> 00:21:15.279
Exactly. So a a tensor of rank say 10 is

00:21:10.720 --> 00:21:17.120
just a list of rank nine tensors.

00:21:15.279 --> 00:21:18.000
So that is this that is the most

00:21:17.119 --> 00:21:20.719
important thing you need to understand

00:21:18.000 --> 00:21:22.640
about tensors. So at any point in time

00:21:20.720 --> 00:21:24.319
if I give you a tensor you can just

00:21:22.640 --> 00:21:27.520
iterate through the first dimension of

00:21:24.319 --> 00:21:29.119
it the first aspect of it and as as you

00:21:27.519 --> 00:21:32.158
go through each one of these values. So

00:21:29.119 --> 00:21:35.599
for example here um

00:21:32.159 --> 00:21:38.600
yeah that can do it.

00:21:35.599 --> 00:21:38.599
So

00:21:39.038 --> 00:21:43.599
so if you have this tensor here

00:21:42.319 --> 00:21:46.879
and if you want to create a more

00:21:43.599 --> 00:21:52.359
complicated tensor no problem.

00:21:46.880 --> 00:21:52.360
So you add another dimension here. Okay.

00:21:52.558 --> 00:21:58.119
Now it just becomes this dimension let's

00:21:54.480 --> 00:21:58.120
say has nine values.

00:21:58.558 --> 00:22:02.558
one on the nine. So you put zero here

00:22:00.960 --> 00:22:04.720
and then what do you get? This whole

00:22:02.558 --> 00:22:06.798
tensor is a rank four tensor. And you

00:22:04.720 --> 00:22:08.720
put a one here, it's another rank four

00:22:06.798 --> 00:22:11.759
tensor. You put a two here, another rank

00:22:08.720 --> 00:22:14.319
four tensor. So every tensor, you take

00:22:11.759 --> 00:22:18.000
the first element, it's just a list, but

00:22:14.319 --> 00:22:20.480
it's a list of the next downrank tensor.

00:22:18.000 --> 00:22:21.679
Okay. Now this tensor concept is

00:22:20.480 --> 00:22:26.640
actually something Einstein came up

00:22:21.679 --> 00:22:28.480
with. Um and so u it's simultaneously

00:22:26.640 --> 00:22:30.559
kind of easy to understand and also

00:22:28.480 --> 00:22:32.400
slippery. So I would actually encourage

00:22:30.558 --> 00:22:33.918
you to read the book which has a really

00:22:32.400 --> 00:22:35.280
good discussion of tensors and the more

00:22:33.919 --> 00:22:38.000
you practice with it the easier it'll

00:22:35.279 --> 00:22:39.759
get. Okay. So if you feel you kind of

00:22:38.000 --> 00:22:42.159
understood but not quite you're not

00:22:39.759 --> 00:22:43.599
alone. It happens to all of us right?

00:22:42.159 --> 00:22:48.640
You have to pay the price or go through

00:22:43.599 --> 00:22:51.519
the crucible. Okay. Okay. All right.

00:22:48.640 --> 00:22:55.038
So to come back to this

00:22:51.519 --> 00:22:56.400
that's what we have

00:22:55.038 --> 00:22:59.519
and we already talked about a rank four

00:22:56.400 --> 00:23:00.720
tensor it's a video so 2.2 two the text

00:22:59.519 --> 00:23:05.119
has a lot more detail. You should

00:23:00.720 --> 00:23:08.079
definitely read it. U so here tensorflow

00:23:05.119 --> 00:23:10.639
is a library and as you can imagine

00:23:08.079 --> 00:23:11.918
neural networks tensors come in and go

00:23:10.640 --> 00:23:14.559
through the network and go out the other

00:23:11.919 --> 00:23:16.880
end right and since tensors capture

00:23:14.558 --> 00:23:18.639
everything numbers lists uh tables and

00:23:16.880 --> 00:23:20.240
so on and so forth it's just tensors

00:23:18.640 --> 00:23:22.640
flowing from input to output hence it's

00:23:20.240 --> 00:23:23.839
called tensorflow and it gives you a

00:23:22.640 --> 00:23:25.360
couple of things which are really really

00:23:23.839 --> 00:23:27.199
important which is why we use it. The

00:23:25.359 --> 00:23:30.000
first one is that it'll automatically

00:23:27.200 --> 00:23:32.640
calculate gradients for you of

00:23:30.000 --> 00:23:34.079
arbitrarily complicated loss functions.

00:23:32.640 --> 00:23:35.520
You don't have to calculate the gradient

00:23:34.079 --> 00:23:37.678
because calculating the gradient is very

00:23:35.519 --> 00:23:39.519
painful, right? It'll automatically

00:23:37.679 --> 00:23:40.720
calculate the gradients for you. That's

00:23:39.519 --> 00:23:42.639
the best part. You don't have to use the

00:23:40.720 --> 00:23:44.400
chain rule. You don't do anything. The

00:23:42.640 --> 00:23:46.400
second thing it'll do, it gives you all

00:23:44.400 --> 00:23:48.000
these optimizers including SGD and all

00:23:46.400 --> 00:23:49.360
its variations. So you don't have to

00:23:48.000 --> 00:23:50.558
worry about the optimization itself.

00:23:49.359 --> 00:23:53.359
It'll just you can just pick and choose

00:23:50.558 --> 00:23:55.440
what you want. Third, if you have a lot

00:23:53.359 --> 00:23:56.959
of servers, it'll actually take the

00:23:55.440 --> 00:23:58.320
computational load and distribute it

00:23:56.960 --> 00:24:00.480
across all those servers. People here

00:23:58.319 --> 00:24:02.879
with the CS background know that

00:24:00.480 --> 00:24:05.038
parallelizing computation is actually a

00:24:02.880 --> 00:24:06.320
very difficult problem, right? There are

00:24:05.038 --> 00:24:09.119
things which are called embarrassingly

00:24:06.319 --> 00:24:10.798
parallel. Many things are not actually

00:24:09.119 --> 00:24:11.839
quite tricky to figure it out. We don't

00:24:10.798 --> 00:24:13.918
know how to figure it out. TensorFlow

00:24:11.839 --> 00:24:15.678
will figure it out. Okay? And then

00:24:13.919 --> 00:24:17.440
finally, I talked about the fact that

00:24:15.679 --> 00:24:18.720
there are these things called GPUs,

00:24:17.440 --> 00:24:21.919
graphics processing units, which are

00:24:18.720 --> 00:24:23.679
parallel hardware. uh and so it'll even

00:24:21.919 --> 00:24:26.000
if you have just one computer but it has

00:24:23.679 --> 00:24:28.080
GPUs there's a particular way in which

00:24:26.000 --> 00:24:30.079
you have to take your computation and

00:24:28.079 --> 00:24:33.359
organize it to really exploit the fact

00:24:30.079 --> 00:24:35.199
that you have a GPU and so TensorFlow

00:24:33.359 --> 00:24:36.240
will actually do it for you out of the

00:24:35.200 --> 00:24:38.080
box automatically you don't have to

00:24:36.240 --> 00:24:39.278
worry about any of that stuff okay so

00:24:38.079 --> 00:24:41.519
those are all the advantages of this

00:24:39.278 --> 00:24:43.519
thing by the way TPU is called a tensor

00:24:41.519 --> 00:24:45.278
processing unit it's something that it's

00:24:43.519 --> 00:24:47.440
kind of you can think of it as Google's

00:24:45.278 --> 00:24:50.000
GPU right they came up with their own

00:24:47.440 --> 00:24:52.080
variation on the theme okay now keras

00:24:50.000 --> 00:24:53.839
sits on top of TensorFlow, right?

00:24:52.079 --> 00:24:56.158
TensorFlow, this is the this is the

00:24:53.839 --> 00:24:58.319
hardware you have. TensorFlow sits on

00:24:56.159 --> 00:25:01.200
top of the hardware. Keras sits on top

00:24:58.319 --> 00:25:02.879
of TensorFlow and it basically gives you

00:25:01.200 --> 00:25:04.960
a whole bunch of convenience features.

00:25:02.880 --> 00:25:07.120
So, for example, it gives you the notion

00:25:04.960 --> 00:25:10.079
of a layer, right? We already saw

00:25:07.119 --> 00:25:11.519
Keras.dense is a dense layer, right? It

00:25:10.079 --> 00:25:12.558
gives you the notion of a layer. It

00:25:11.519 --> 00:25:14.558
gives you the notion of activation

00:25:12.558 --> 00:25:16.240
functions and so on and so forth. It

00:25:14.558 --> 00:25:18.079
gives you easy ways to pre-process the

00:25:16.240 --> 00:25:20.000
data, easy ways to train the model,

00:25:18.079 --> 00:25:21.839
report on metrics, you know, calculate

00:25:20.000 --> 00:25:23.519
validation loss, validation accuracy,

00:25:21.839 --> 00:25:25.359
training loss, all the metrics we care

00:25:23.519 --> 00:25:26.960
about. And then it also gives you a

00:25:25.359 --> 00:25:28.558
whole library of pre-trained models that

00:25:26.960 --> 00:25:30.798
you can just use and adapt for your

00:25:28.558 --> 00:25:32.720
particular problem. So it gives you a

00:25:30.798 --> 00:25:34.400
whole bunch of conveniences and that's

00:25:32.720 --> 00:25:35.679
why it's very popular. And by the way,

00:25:34.400 --> 00:25:37.440
you know, many of you might also be

00:25:35.679 --> 00:25:38.960
familiar with PyTorch, which is a

00:25:37.440 --> 00:25:41.038
fantastic framework as well for deep

00:25:38.960 --> 00:25:42.798
learning. And the reason we chose to go

00:25:41.038 --> 00:25:45.679
with TensorFlow for this course rather

00:25:42.798 --> 00:25:48.158
than PyTorch is because we wanted to

00:25:45.679 --> 00:25:49.679
make the course uh sort of accessible to

00:25:48.159 --> 00:25:51.200
folks who don't have a ton of

00:25:49.679 --> 00:25:53.360
programming background before coming to

00:25:51.200 --> 00:25:55.519
the class. And Pyarch is a bit more

00:25:53.359 --> 00:25:56.479
demanding from a CS perspective. It

00:25:55.519 --> 00:25:58.400
requires more knowledge of

00:25:56.480 --> 00:25:59.759
object-oriented programming. Uh which is

00:25:58.400 --> 00:26:02.080
why we decided to go with TensorFlow and

00:25:59.759 --> 00:26:04.720
KAS because I think it's actually as

00:26:02.079 --> 00:26:07.278
powerful uh in many ways and it's a

00:26:04.720 --> 00:26:09.440
little easier to get going. Okay, so

00:26:07.278 --> 00:26:10.960
that's what we have here. And one other

00:26:09.440 --> 00:26:12.720
thing I will mention is that there are

00:26:10.960 --> 00:26:14.480
three ways in which you can use kas.

00:26:12.720 --> 00:26:16.480
There are three kinds of APIs.

00:26:14.480 --> 00:26:18.079
Sequential, functional, subclassing. And

00:26:16.480 --> 00:26:21.120
we'll almost exclusively use the

00:26:18.079 --> 00:26:22.319
functional API. Okay. And in fact, the

00:26:21.119 --> 00:26:24.399
model we built for heart disease

00:26:22.319 --> 00:26:26.798
prediction uses the functional API. And

00:26:24.400 --> 00:26:28.640
so just read 722 of the textbook to

00:26:26.798 --> 00:26:30.319
understand in detail how the API works.

00:26:28.640 --> 00:26:32.080
I find in my own work, the functional

00:26:30.319 --> 00:26:33.278
API is basically all I need. I don't

00:26:32.079 --> 00:26:35.519
need to do anything more complicated

00:26:33.278 --> 00:26:37.599
than that. Um and and as you will see as

00:26:35.519 --> 00:26:39.679
you work on the homeworks uh and on your

00:26:37.599 --> 00:26:41.199
project that it's is it's sort of a

00:26:39.679 --> 00:26:43.440
beautifully designed Lego block

00:26:41.200 --> 00:26:45.200
environment for doing these things and

00:26:43.440 --> 00:26:48.240
you can create very complicated models

00:26:45.200 --> 00:26:50.159
very easily. Okay. Uh there's a whole

00:26:48.240 --> 00:26:51.759
bunch of stuff here on these websites.

00:26:50.159 --> 00:26:55.600
So check them out. There's lots of

00:26:51.759 --> 00:26:57.038
collabs or uh uh are available. So now

00:26:55.599 --> 00:26:58.158
if you go back to the neural model for

00:26:57.038 --> 00:26:59.519
heart disease prediction, this is what

00:26:58.159 --> 00:27:02.400
we came up with in the last class,

00:26:59.519 --> 00:27:04.319
right? uh we had an input layer, one

00:27:02.400 --> 00:27:05.759
dense layer with 16 neurons, rel

00:27:04.319 --> 00:27:08.000
neurons, an output layer with the

00:27:05.759 --> 00:27:10.720
sigmoid and then boom, that was a model.

00:27:08.000 --> 00:27:13.119
So let's train this model. Uh and so the

00:27:10.720 --> 00:27:14.640
training checklist is that uh we have

00:27:13.119 --> 00:27:17.918
already done this hidden layer of 16

00:27:14.640 --> 00:27:19.200
neurons uh sigmoid. We need to use an

00:27:17.919 --> 00:27:20.559
appropriate loss function based on the

00:27:19.200 --> 00:27:23.038
type of output. What loss function

00:27:20.558 --> 00:27:26.480
should we use?

00:27:23.038 --> 00:27:28.798
What is the output here?

00:27:26.480 --> 00:27:33.079
It's a binary classification problem. So

00:27:28.798 --> 00:27:33.079
what should the the loss function be?

00:27:33.440 --> 00:27:37.360
Kind of heard it somewhere. Get shout it

00:27:35.599 --> 00:27:40.079
out.

00:27:37.359 --> 00:27:43.079
No, the output is a sigmoid. The loss

00:27:40.079 --> 00:27:43.079
functionary

00:27:43.200 --> 00:27:46.798
cross entropy.

00:27:44.960 --> 00:27:48.798
Okay, remember if if you're predicting a

00:27:46.798 --> 00:27:50.879
number an arbitrary number, you can use

00:27:48.798 --> 00:27:52.400
something like mean square error. If

00:27:50.880 --> 00:27:55.120
you're predicting a probability which

00:27:52.400 --> 00:27:56.720
has to be compared to a 01 output, which

00:27:55.119 --> 00:27:59.038
is what binary classification is all

00:27:56.720 --> 00:28:01.120
about. we use binary cross entropy.

00:27:59.038 --> 00:28:03.759
Okay, so that's what we do here. So we

00:28:01.119 --> 00:28:06.239
do binary cross entropy

00:28:03.759 --> 00:28:08.158
and then we will go with Adam, right?

00:28:06.240 --> 00:28:10.880
And then we'll use early stopping to

00:28:08.159 --> 00:28:12.399
make sure we don't over fit. Okay, I

00:28:10.880 --> 00:28:13.679
know this like okay I promise this is a

00:28:12.398 --> 00:28:16.079
lot literally the last slide before I go

00:28:13.679 --> 00:28:19.519
to the collab. I feel like one of those

00:28:16.079 --> 00:28:23.359
used cars here but wait there is more.

00:28:19.519 --> 00:28:24.720
So anyway, u so uh don't worry if you

00:28:23.359 --> 00:28:26.558
don't understand every detail of what

00:28:24.720 --> 00:28:27.919
I'm going to go through. I'm going to

00:28:26.558 --> 00:28:29.839
link to the collab as soon as the class

00:28:27.919 --> 00:28:31.278
is over. But once you get your hands on

00:28:29.839 --> 00:28:33.519
the collab, make sure you actually go

00:28:31.278 --> 00:28:34.640
through every line in the collab. What I

00:28:33.519 --> 00:28:36.558
typically do when I'm trying to learn

00:28:34.640 --> 00:28:39.919
something new is I'll actually cut and

00:28:36.558 --> 00:28:41.359
paste, right? I won't do that. I won't

00:28:39.919 --> 00:28:44.159
actually cut and paste the code and run

00:28:41.359 --> 00:28:45.519
it myself. I will retype the code. If

00:28:44.159 --> 00:28:46.799
you retype the code as opposed to

00:28:45.519 --> 00:28:48.960
cutting and pasting, trust me, you'll

00:28:46.798 --> 00:28:52.079
learn a lot more. Right? So I strongly

00:28:48.960 --> 00:28:54.480
encourage you to do it that way.

00:28:52.079 --> 00:28:56.079
Um and so all the collabs you're going

00:28:54.480 --> 00:28:57.519
to publish in the class, uh the first

00:28:56.079 --> 00:29:00.240
thing you should do is you should just

00:28:57.519 --> 00:29:02.879
make your own copy of the notebook,

00:29:00.240 --> 00:29:04.558
right? Copy to drive. And then if you're

00:29:02.880 --> 00:29:06.720
using anything other than today's

00:29:04.558 --> 00:29:08.079
collab, uh right, anything involving

00:29:06.720 --> 00:29:10.079
natural language processing or vision,

00:29:08.079 --> 00:29:13.038
you probably should use a GPU. So just

00:29:10.079 --> 00:29:15.918
go into go in here, choose the runtime

00:29:13.038 --> 00:29:17.599
to be a GPU. Um and then you start your

00:29:15.919 --> 00:29:19.038
notebook and you're done. And the second

00:29:17.599 --> 00:29:21.199
time onwards, you can just go directly

00:29:19.038 --> 00:29:23.359
to this step. You don't have to do all

00:29:21.200 --> 00:29:24.880
this stuff for that particular notebook.

00:29:23.359 --> 00:29:26.319
And there are numerous tutorials like

00:29:24.880 --> 00:29:27.919
five minute videos and so on on how to

00:29:26.319 --> 00:29:30.319
use collab. Just just do that. I'm not

00:29:27.919 --> 00:29:33.919
going to spend time on it here.

00:29:30.319 --> 00:29:35.839
All right. Okay. So, uh I just ran it um

00:29:33.919 --> 00:29:37.120
a few hours ago. I'm not going to run

00:29:35.839 --> 00:29:38.079
every cell now because it's going to

00:29:37.119 --> 00:29:39.759
take some time. It's going to get in the

00:29:38.079 --> 00:29:40.960
way of the class time, but I'm going to

00:29:39.759 --> 00:29:43.359
just like, you know, go through it

00:29:40.960 --> 00:29:45.278
slowly and explain what's going on. So,

00:29:43.359 --> 00:29:46.639
here this is just an introduction to the

00:29:45.278 --> 00:29:49.038
data set. We already saw this

00:29:46.640 --> 00:29:51.759
introduction in the last last week. We

00:29:49.038 --> 00:29:54.720
have whatever 303 patients, hot

00:29:51.759 --> 00:29:57.919
patients. We have a whole bunch of uh

00:29:54.720 --> 00:29:59.839
variables here, age, demographics, and a

00:29:57.919 --> 00:30:02.960
whole bunch of biomarker information.

00:29:59.839 --> 00:30:05.519
And this is a target variable. Okay? Uh

00:30:02.960 --> 00:30:07.759
zero or one, heart disease, yes or no.

00:30:05.519 --> 00:30:10.319
And so, by the way, just some technical

00:30:07.759 --> 00:30:12.000
prelim preliminaries here. Basically,

00:30:10.319 --> 00:30:13.439
every time we load these things, we're

00:30:12.000 --> 00:30:15.119
actually going to load these packages.

00:30:13.440 --> 00:30:16.558
So you can see here these are the two

00:30:15.119 --> 00:30:18.879
key things we need to do. We import

00:30:16.558 --> 00:30:21.038
tensorflow first and then from within

00:30:18.880 --> 00:30:23.760
tensorflow we import keras. Okay that's

00:30:21.038 --> 00:30:25.759
what these two lines do here. Okay. And

00:30:23.759 --> 00:30:26.798
then and folks who have done data

00:30:25.759 --> 00:30:28.640
science and machine learning a bit

00:30:26.798 --> 00:30:30.720
before you you'll know this. We will in

00:30:28.640 --> 00:30:32.320
in sort of we will actually load like

00:30:30.720 --> 00:30:34.558
the three packages that were just most

00:30:32.319 --> 00:30:37.278
commonly used right which is numpy

00:30:34.558 --> 00:30:39.678
pandas and mattplot lib. Uh numpy

00:30:37.278 --> 00:30:42.079
because it's very easy for manipulating

00:30:39.679 --> 00:30:44.159
matrices and arrays and tensors. uh

00:30:42.079 --> 00:30:46.240
pandas because often times you get some

00:30:44.159 --> 00:30:48.240
data in from somewhere you need to

00:30:46.240 --> 00:30:49.839
massage it and wrangle it to a point

00:30:48.240 --> 00:30:51.839
where we can actually feed it into ketas

00:30:49.839 --> 00:30:53.678
so you need pandas for that and mattplot

00:30:51.839 --> 00:30:55.839
lib because you just want to plot you

00:30:53.679 --> 00:30:57.440
know uh these loss curves and accuracy

00:30:55.839 --> 00:31:00.158
curves to see whether early stopping is

00:30:57.440 --> 00:31:02.320
needed okay so that's why we use it uh

00:31:00.159 --> 00:31:03.200
so we import all these things and then I

00:31:02.319 --> 00:31:04.558
guess the other thing you have to

00:31:03.200 --> 00:31:06.558
remember is that when we are training

00:31:04.558 --> 00:31:08.639
these deep learning models uh there is

00:31:06.558 --> 00:31:11.278
randomness in the process which enters

00:31:08.640 --> 00:31:13.360
in a few different places so clearly the

00:31:11.278 --> 00:31:14.398
starting values for the these weights

00:31:13.359 --> 00:31:15.439
are going to be they're going the

00:31:14.398 --> 00:31:17.359
weights are going to be randomly

00:31:15.440 --> 00:31:19.600
initialized. Uh and therefore that

00:31:17.359 --> 00:31:22.398
that's obviously a source of randomness.

00:31:19.599 --> 00:31:23.599
Uh now we talked about how you take if

00:31:22.398 --> 00:31:25.519
when you're doing stoastic gradient

00:31:23.599 --> 00:31:28.000
descent you take all the data and then

00:31:25.519 --> 00:31:29.839
you randomly choose batches right from

00:31:28.000 --> 00:31:32.398
this data till we finish a whole pass

00:31:29.839 --> 00:31:33.519
through it. Well that immediately raised

00:31:32.398 --> 00:31:35.759
the question well well what do you mean

00:31:33.519 --> 00:31:37.839
by randomly choose? So typically what we

00:31:35.759 --> 00:31:39.519
do in practice is that and kas will take

00:31:37.839 --> 00:31:40.720
care of all this for you. um you

00:31:39.519 --> 00:31:42.960
basically take the data and just shuffle

00:31:40.720 --> 00:31:45.200
it once randomly and then you just go

00:31:42.960 --> 00:31:47.120
first 32 next 32 next 32 next 32 like

00:31:45.200 --> 00:31:49.278
that okay but it is a source of

00:31:47.119 --> 00:31:51.518
randomness and then when we split the

00:31:49.278 --> 00:31:53.278
data into train validation testing and

00:31:51.519 --> 00:31:55.440
so on uh particularly if you want to

00:31:53.278 --> 00:31:56.880
look for early stopping and overfitting

00:31:55.440 --> 00:31:58.480
uh we need to again split the data

00:31:56.880 --> 00:32:01.120
randomly and that's another source of

00:31:58.480 --> 00:32:02.880
randomness and then when we do dropout

00:32:01.119 --> 00:32:05.119
which we'll talk about on Wednesday

00:32:02.880 --> 00:32:06.799
again dropout has a little bit of a

00:32:05.119 --> 00:32:09.599
random element to it and so that's

00:32:06.798 --> 00:32:11.679
another source of randomness this. So

00:32:09.599 --> 00:32:13.038
all of it all this means is that if

00:32:11.679 --> 00:32:14.240
you're working with these models and if

00:32:13.038 --> 00:32:16.000
you want to build a model and you want

00:32:14.240 --> 00:32:17.919
to hand it off to someone so that they

00:32:16.000 --> 00:32:19.759
can reproduce your results well you

00:32:17.919 --> 00:32:21.440
better make sure that you sort of you

00:32:19.759 --> 00:32:22.960
know make it easy for them to replicate

00:32:21.440 --> 00:32:24.798
what you have and the way you do it is

00:32:22.960 --> 00:32:26.960
by sending a setting a random seat for

00:32:24.798 --> 00:32:28.480
all these things okay and the way you do

00:32:26.960 --> 00:32:31.200
it is by having this little handy

00:32:28.480 --> 00:32:32.960
function here set random seat uh and of

00:32:31.200 --> 00:32:35.360
course you know I use 42 tool like just

00:32:32.960 --> 00:32:38.000
like everybody should right so okay so

00:32:35.359 --> 00:32:39.678
that's that uh by the way just that's

00:32:38.000 --> 00:32:40.880
just a popculture reference to this book

00:32:39.679 --> 00:32:43.360
called The Hitchhiker's Guide to the

00:32:40.880 --> 00:32:45.440
Galaxy.

00:32:43.359 --> 00:32:47.678
>> Number 42 and you'll know what I mean.

00:32:45.440 --> 00:32:49.759
Okay, so by the way, um the question

00:32:47.679 --> 00:32:51.278
inevitably comes at this point, okay, if

00:32:49.759 --> 00:32:52.558
we do exactly this, will you actually

00:32:51.278 --> 00:32:55.359
get the exact same numbers that you have

00:32:52.558 --> 00:32:57.200
in your version uh of the notebook? And

00:32:55.359 --> 00:32:59.119
the answer is hopefully most of the

00:32:57.200 --> 00:33:01.360
time, but it's not guaranteed. So this

00:32:59.119 --> 00:33:03.359
is called bitwise reproducibility. It's

00:33:01.359 --> 00:33:05.439
not guaranteed due to certain hardware

00:33:03.359 --> 00:33:07.119
things and device drivers and stuff like

00:33:05.440 --> 00:33:09.120
that. So we won't get into all that

00:33:07.119 --> 00:33:11.199
stuff. uh and which is why as you see

00:33:09.119 --> 00:33:14.239
here uh I have a bit of a fingers

00:33:11.200 --> 00:33:16.480
crossed thing. Okay. All right. Cool. So

00:33:14.240 --> 00:33:18.399
that's what we have. Um so as it turns

00:33:16.480 --> 00:33:20.240
out uh Frantois Shallet who wrote the

00:33:18.398 --> 00:33:21.678
book uh the textbook he actually made

00:33:20.240 --> 00:33:24.480
this data available in a pandanda's data

00:33:21.679 --> 00:33:26.880
frame. So we read the CSV file into this

00:33:24.480 --> 00:33:30.720
data frame right there. Uh and then it's

00:33:26.880 --> 00:33:32.000
uh and it's 303 rows 14 columns right

00:33:30.720 --> 00:33:34.399
and you can see here we'll take a look

00:33:32.000 --> 00:33:36.960
at the first few rows. Uh and these are

00:33:34.398 --> 00:33:38.719
all the rows. age, gender, cholesterol,

00:33:36.960 --> 00:33:41.120
blah blah blah blah blah. And then this

00:33:38.720 --> 00:33:42.880
is the target variable right there. U

00:33:41.119 --> 00:33:44.000
and the one of the first things I always

00:33:42.880 --> 00:33:45.679
do when I'm working with a binary

00:33:44.000 --> 00:33:47.359
classification problem is to quickly

00:33:45.679 --> 00:33:49.759
check whether the positive and negative

00:33:47.359 --> 00:33:51.119
classes are balanced or not. And so what

00:33:49.759 --> 00:33:52.720
you can do is you can just quickly check

00:33:51.119 --> 00:33:55.278
to see what percent of the data points

00:33:52.720 --> 00:33:57.038
is zero versus one. And you can see here

00:33:55.278 --> 00:33:59.038
uh 72.6%

00:33:57.038 --> 00:34:00.720
of the patients don't have heart

00:33:59.038 --> 00:34:03.839
disease. That's a good thing of course.

00:34:00.720 --> 00:34:05.519
Uh and then 27.4 have heart disease. So

00:34:03.839 --> 00:34:08.159
it's not bad. It's not 50/50 or roughly

00:34:05.519 --> 00:34:11.599
50/50. It's a little thing. So, by the

00:34:08.159 --> 00:34:13.358
way, quick question. What is a a b good

00:34:11.599 --> 00:34:14.639
baseline model for this problem? Suppose

00:34:13.358 --> 00:34:15.679
you couldn't use anything any

00:34:14.639 --> 00:34:19.159
complicated thing. What's a good

00:34:15.679 --> 00:34:19.159
baseline model?

00:34:22.079 --> 00:34:25.519
>> Yes. Just predict zero.

00:34:24.320 --> 00:34:28.879
>> Yeah. And why would you do that?

00:34:25.519 --> 00:34:31.519
>> Uh, it would give you a 72.6% accuracy.

00:34:28.878 --> 00:34:33.759
Exactly. Because 72.6% 6% is the sort of

00:34:31.519 --> 00:34:35.358
the higher class higher class with the

00:34:33.760 --> 00:34:37.040
higher percentage you just predict it

00:34:35.358 --> 00:34:38.878
you'll be right on those 72.6% of the

00:34:37.039 --> 00:34:41.519
cases you'll be wrong on the rest which

00:34:38.878 --> 00:34:43.838
means that your accuracy of this model

00:34:41.519 --> 00:34:46.559
is going to be 72.6%.

00:34:43.838 --> 00:34:48.078
Okay. And so any fancy model we build

00:34:46.559 --> 00:34:49.279
better do you know it's got to do better

00:34:48.079 --> 00:34:51.919
than this otherwise it's not worth its

00:34:49.280 --> 00:34:53.760
weight uh in layers. Um so all right so

00:34:51.918 --> 00:34:54.960
we'll come back to this later. So the

00:34:53.760 --> 00:34:56.560
first thing we want to do is we want to

00:34:54.960 --> 00:34:58.880
pre-process it because this data set has

00:34:56.559 --> 00:35:01.599
both categorical variables and numeric

00:34:58.880 --> 00:35:03.119
variables. Um and so it's usually

00:35:01.599 --> 00:35:05.119
convenient to just to group them into

00:35:03.119 --> 00:35:06.640
two different groups. So I have listed

00:35:05.119 --> 00:35:09.200
all the categorical variables here and

00:35:06.639 --> 00:35:11.199
the numeric here. Uh and then we have

00:35:09.199 --> 00:35:12.799
the pre-processing here. We have to take

00:35:11.199 --> 00:35:15.118
the categorical variables and we have to

00:35:12.800 --> 00:35:17.920
one hot encode them. And the reason is

00:35:15.119 --> 00:35:20.400
that unlike say a decision tree model, a

00:35:17.920 --> 00:35:22.800
neural network cannot handle uh

00:35:20.400 --> 00:35:24.720
categorical inputs directly. It can only

00:35:22.800 --> 00:35:26.400
handle numeric inputs. Which means that

00:35:24.719 --> 00:35:28.319
we have to numericalize every

00:35:26.400 --> 00:35:29.760
categorical thing that comes in. And the

00:35:28.320 --> 00:35:31.200
st there are many ways to do it but the

00:35:29.760 --> 00:35:33.760
standard way to do it is one hot

00:35:31.199 --> 00:35:35.358
encoding. Um and for the numeric

00:35:33.760 --> 00:35:37.839
variables we need to normalize them and

00:35:35.358 --> 00:35:40.400
I'll come to that in a second. So pandas

00:35:37.838 --> 00:35:41.920
has this get dummies function here and

00:35:40.400 --> 00:35:44.000
you can just run this thing and it'll

00:35:41.920 --> 00:35:45.680
just hot encode the whole thing. So once

00:35:44.000 --> 00:35:49.358
you do that this is what you have. So

00:35:45.679 --> 00:35:52.319
you can see here previously um let's say

00:35:49.358 --> 00:35:54.000
tal was had three values fixed normal

00:35:52.320 --> 00:35:56.880
reversible or something and then you go

00:35:54.000 --> 00:36:00.079
to the one hot encoded version u and now

00:35:56.880 --> 00:36:02.160
we can see here tal fixed tal normal tal

00:36:00.079 --> 00:36:04.720
reversible that's three columns right

00:36:02.159 --> 00:36:07.598
that's the one hot encoding in action

00:36:04.719 --> 00:36:09.919
okay now the other thing to remember is

00:36:07.599 --> 00:36:12.240
that neural networks work best when the

00:36:09.920 --> 00:36:13.920
numeric inputs you send them are all in

00:36:12.239 --> 00:36:15.838
a relatively small range they shouldn't

00:36:13.920 --> 00:36:18.639
have a wide range of variation

00:36:15.838 --> 00:36:20.239
Um and so the standard practice is to

00:36:18.639 --> 00:36:22.078
standardize the numerical variables. By

00:36:20.239 --> 00:36:23.279
standardize, I mean typically subtract

00:36:22.079 --> 00:36:26.000
the mean, divide by the standard

00:36:23.280 --> 00:36:27.839
deviation. Um we should do that. But

00:36:26.000 --> 00:36:30.719
before we do so, we should split the

00:36:27.838 --> 00:36:32.239
data into a training set and a test set,

00:36:30.719 --> 00:36:33.759
right? And why do we want to split into

00:36:32.239 --> 00:36:35.039
a test set? Because at the very end once

00:36:33.760 --> 00:36:36.800
we've built the model and done all the

00:36:35.039 --> 00:36:38.639
things we want to do with it, we finally

00:36:36.800 --> 00:36:41.519
want to take out the test set and

00:36:38.639 --> 00:36:43.679
evaluate it once so that we get this

00:36:41.519 --> 00:36:46.079
true measure of how it's going to

00:36:43.679 --> 00:36:48.960
perform in the wild after you deploy it.

00:36:46.079 --> 00:36:51.280
Okay. Uh so you want to divide it 80 80

00:36:48.960 --> 00:36:53.119
say 80% training and 20% test set. So

00:36:51.280 --> 00:36:54.640
the question is why should we do the

00:36:53.119 --> 00:36:57.358
splitting now before we do the

00:36:54.639 --> 00:37:01.480
normalization? Why can't we just do the

00:36:57.358 --> 00:37:01.480
normalization and then do the splitting?

00:37:02.800 --> 00:37:09.680
Um all right

00:37:06.239 --> 00:37:11.838
>> because then your uh validation set is

00:37:09.679 --> 00:37:13.440
also somewhat dependent on your test set

00:37:11.838 --> 00:37:13.838
results as well as the mean of the test

00:37:13.440 --> 00:37:16.400
set.

00:37:13.838 --> 00:37:18.799
>> Correct? Because the test set has now

00:37:16.400 --> 00:37:21.920
essentially sort of has been influenced

00:37:18.800 --> 00:37:23.359
by the training set. Right? That is the

00:37:21.920 --> 00:37:25.200
the the modeling process part of the

00:37:23.358 --> 00:37:27.039
modeling process the splitting and the

00:37:25.199 --> 00:37:28.719
splitting also the the the

00:37:27.039 --> 00:37:30.800
standardization

00:37:28.719 --> 00:37:32.879
if if the standardization which is part

00:37:30.800 --> 00:37:34.720
of the process uses information about

00:37:32.880 --> 00:37:37.440
the test set well the test set not

00:37:34.719 --> 00:37:39.759
really kept away from anything is it

00:37:37.440 --> 00:37:41.200
that's why we want to split it lock away

00:37:39.760 --> 00:37:43.200
the test set somewhere and then proceed

00:37:41.199 --> 00:37:44.799
with the modeling this again this is

00:37:43.199 --> 00:37:47.598
like machine learning 101 which is why

00:37:44.800 --> 00:37:50.800
I'm going through it pretty fast uh okay

00:37:47.599 --> 00:37:53.200
so we we do this uh sampling function

00:37:50.800 --> 00:37:55.039
take 20% of the data and make it the

00:37:53.199 --> 00:37:56.719
test set and the remaining is going to

00:37:55.039 --> 00:37:58.960
be the training set. And when we do

00:37:56.719 --> 00:38:00.959
that, you can see the training set is

00:37:58.960 --> 00:38:05.199
now 242

00:38:00.960 --> 00:38:07.039
um rows while the test is 61 rows. Uh

00:38:05.199 --> 00:38:08.960
and any of these data frames, you'll

00:38:07.039 --> 00:38:10.719
know that the the shape attribute gives

00:38:08.960 --> 00:38:12.240
you the dimensions of the number of rows

00:38:10.719 --> 00:38:14.000
in the columns. That's what we're doing

00:38:12.239 --> 00:38:15.199
here. And now that we have done that, we

00:38:14.000 --> 00:38:16.400
have done the split, we can calculate

00:38:15.199 --> 00:38:18.480
the the the mean and the standard

00:38:16.400 --> 00:38:20.079
deviation. So I calculate the mean here.

00:38:18.480 --> 00:38:21.760
I calculate standard deviation. And

00:38:20.079 --> 00:38:24.640
these are all the means. And once I do

00:38:21.760 --> 00:38:26.000
that, I just do you know each column

00:38:24.639 --> 00:38:28.319
minus the mean divide the standard

00:38:26.000 --> 00:38:30.320
deviation. And then once I do that I get

00:38:28.320 --> 00:38:32.160
I save them in the train and the test

00:38:30.320 --> 00:38:33.680
data frames. And you can see here now

00:38:32.159 --> 00:38:36.799
all the numbers are all very sort of

00:38:33.679 --> 00:38:38.480
smallalish 0 1 minus one kind of around

00:38:36.800 --> 00:38:40.880
that range and that's kind of ideal when

00:38:38.480 --> 00:38:42.159
you're network training. Okay. All

00:38:40.880 --> 00:38:44.640
right. Right. So at this point the data

00:38:42.159 --> 00:38:46.719
is entirely numeric and then uh we are

00:38:44.639 --> 00:38:48.000
ready almost ready to feed it into KAS

00:38:46.719 --> 00:38:51.279
and the way you do it is you take a

00:38:48.000 --> 00:38:52.719
numpy array u you you take a pandas data

00:38:51.280 --> 00:38:54.880
frame and then you convert it into a

00:38:52.719 --> 00:38:56.959
numpy array and then keras is happy to

00:38:54.880 --> 00:39:00.079
take it happy to receive it. So the so

00:38:56.960 --> 00:39:01.838
we use this thing called two numpy which

00:39:00.079 --> 00:39:04.160
I think is as descriptive as it gets in

00:39:01.838 --> 00:39:05.838
programming. Um and then you save it as

00:39:04.159 --> 00:39:08.000
train and test. Now train and test on

00:39:05.838 --> 00:39:09.679
two numpy arrays with exactly the same

00:39:08.000 --> 00:39:12.000
information and now we can fit it into

00:39:09.679 --> 00:39:13.838
kas. All right. Now I guess there's one

00:39:12.000 --> 00:39:17.358
other thing we need to do which is that

00:39:13.838 --> 00:39:18.880
um in this data frame train and test our

00:39:17.358 --> 00:39:20.799
independent variables all the features

00:39:18.880 --> 00:39:23.519
as well as the target the 01 target.

00:39:20.800 --> 00:39:25.280
They're all in this

00:39:23.519 --> 00:39:27.679
right and we need to now take it and

00:39:25.280 --> 00:39:29.839
just take the the dependent variable the

00:39:27.679 --> 00:39:32.000
01 column and split it out and keep the

00:39:29.838 --> 00:39:33.519
x and the y separately. Right? That's

00:39:32.000 --> 00:39:34.960
the whole point of it, right? Because

00:39:33.519 --> 00:39:36.320
you need to feed the X, do the

00:39:34.960 --> 00:39:38.240
prediction, and then compare it to the

00:39:36.320 --> 00:39:41.599
actual Y and calculate the loss and so

00:39:38.239 --> 00:39:43.279
on and so forth. So, uh, so the target

00:39:41.599 --> 00:39:45.119
column is our Y variable, and it's

00:39:43.280 --> 00:39:47.119
column number six from the left. If you

00:39:45.119 --> 00:39:49.599
count it, you can see it. So, we just,

00:39:47.119 --> 00:39:53.039
you know, uh, we we delete it from the

00:39:49.599 --> 00:39:56.720
the train and test. Um, and now we have

00:39:53.039 --> 00:39:58.320
242 rows and 29 columns, 29 features.

00:39:56.719 --> 00:40:01.039
You will recall from the network that we

00:39:58.320 --> 00:40:03.200
made way back, it had 29 inputs, right?

00:40:01.039 --> 00:40:06.159
29 nodes in the input layer. And that's

00:40:03.199 --> 00:40:07.838
where the 29 is coming from. And so now

00:40:06.159 --> 00:40:09.759
uh we just select the sixth column which

00:40:07.838 --> 00:40:12.559
is the target and make it the Y variable

00:40:09.760 --> 00:40:14.320
right train Y and test Y. And that is of

00:40:12.559 --> 00:40:16.559
course a vector which is 242 long in the

00:40:14.320 --> 00:40:19.359
training set and 61 long in the thing.

00:40:16.559 --> 00:40:21.679
So at this point all we have done is to

00:40:19.358 --> 00:40:22.960
be honest boring pre-processing. Okay,

00:40:21.679 --> 00:40:26.319
we haven't actually gotten to the action

00:40:22.960 --> 00:40:29.039
yet. Finally, let's do something. So um

00:40:26.320 --> 00:40:30.320
and we start with a single hidden layer.

00:40:29.039 --> 00:40:31.920
Since it's a binary classification

00:40:30.320 --> 00:40:34.000
problem, we'll use sigmoids as we saw

00:40:31.920 --> 00:40:36.559
earlier. And this is the model we

00:40:34.000 --> 00:40:39.760
created in class last last class. This

00:40:36.559 --> 00:40:41.199
is the model we created. Okay. The only

00:40:39.760 --> 00:40:43.280
difference between that model and this

00:40:41.199 --> 00:40:45.919
model is that I've actually given names

00:40:43.280 --> 00:40:47.599
to these layers. And this name thing is

00:40:45.920 --> 00:40:48.800
totally optional. Right? If you want to

00:40:47.599 --> 00:40:50.240
give a name, give a name. It's just a

00:40:48.800 --> 00:40:53.280
little easier to interpret later on.

00:40:50.239 --> 00:40:55.519
Okay? It's just cosmetic. Okay? So, uh,

00:40:53.280 --> 00:40:57.760
but I've just put it here. U and once

00:40:55.519 --> 00:40:59.599
you build the model u you should

00:40:57.760 --> 00:41:01.680
immediately run the model dots summary

00:40:59.599 --> 00:41:04.079
command because it gives you a nice

00:41:01.679 --> 00:41:05.440
overview of the model right what are for

00:41:04.079 --> 00:41:07.599
each layer it tells you what the layer

00:41:05.440 --> 00:41:09.519
is it tells you what's coming into the

00:41:07.599 --> 00:41:11.280
layer meaning the shape of the tensor

00:41:09.519 --> 00:41:13.440
that's coming in and what's going out

00:41:11.280 --> 00:41:16.240
and how many parameters the layer has

00:41:13.440 --> 00:41:20.720
and it turns out this layer has sorry

00:41:16.239 --> 00:41:22.799
this network has 497 parameters okay uh

00:41:20.719 --> 00:41:24.078
and I have told you repeatedly the first

00:41:22.800 --> 00:41:25.680
few times just hadn't calculated the

00:41:24.079 --> 00:41:27.359
number of parameters to make sure it

00:41:25.679 --> 00:41:30.000
verifies. So we should just make sure

00:41:27.358 --> 00:41:32.078
that it is in fact 497. So let's hand

00:41:30.000 --> 00:41:34.559
calculate it. And you do basically it's

00:41:32.079 --> 00:41:37.839
basically what's going on here. 29

00:41:34.559 --> 00:41:40.239
inputs time 16, right? All the arrows 29

00:41:37.838 --> 00:41:42.000
* 16 arrows, right? And then you have a

00:41:40.239 --> 00:41:43.759
bias of another 16. That's why you have

00:41:42.000 --> 00:41:46.318
this expression. And then the next one

00:41:43.760 --> 00:41:49.200
is 16 * 1 plus one bias for the output

00:41:46.318 --> 00:41:50.960
sigmoid and you get to 497. Okay? Just

00:41:49.199 --> 00:41:53.039
make sure you follow this later on when

00:41:50.960 --> 00:41:55.358
you work with the collab. We we did this

00:41:53.039 --> 00:41:56.960
in class last week and you can visualize

00:41:55.358 --> 00:41:59.279
the network graphically as well by using

00:41:56.960 --> 00:42:02.240
the plot model function. So we do that

00:41:59.280 --> 00:42:03.760
here. Um and let's say it gives you the

00:42:02.239 --> 00:42:06.159
same information but in a slightly

00:42:03.760 --> 00:42:07.839
easier form to consume and when we work

00:42:06.159 --> 00:42:09.440
with larger networks starting on

00:42:07.838 --> 00:42:11.039
Wednesday you will see that being able

00:42:09.440 --> 00:42:13.838
to visualize the topology of the network

00:42:11.039 --> 00:42:16.239
is actually quite handy. Okay, we

00:42:13.838 --> 00:42:18.400
finally come to uh actually trying to

00:42:16.239 --> 00:42:20.719
train this thing and so what loss

00:42:18.400 --> 00:42:23.358
function should we use? uh we need to we

00:42:20.719 --> 00:42:26.159
need to use binary cross entropy right

00:42:23.358 --> 00:42:29.838
there. What optimizer to use? Well, as I

00:42:26.159 --> 00:42:32.480
mentioned earlier, uh we'll use Adam.

00:42:29.838 --> 00:42:35.679
Adam.

00:42:32.480 --> 00:42:37.920
All right, Adam. Uh and then uh and then

00:42:35.679 --> 00:42:39.598
the the final thing is you can ask Keras

00:42:37.920 --> 00:42:41.358
to report out whatever metrics you care

00:42:39.599 --> 00:42:42.960
about. These metrics are not going to be

00:42:41.358 --> 00:42:45.039
used in any optimization. They just it's

00:42:42.960 --> 00:42:46.800
just reporting it to you. And the most

00:42:45.039 --> 00:42:49.119
common thing people report out for

00:42:46.800 --> 00:42:51.440
binary classification is accuracy. So

00:42:49.119 --> 00:42:54.318
we'll just go with that metric. Um and

00:42:51.440 --> 00:42:56.880
so so what we do is we tell Keras take

00:42:54.318 --> 00:42:58.719
the model we just built and compile it

00:42:56.880 --> 00:43:00.000
with this choice of optimizer this

00:42:58.719 --> 00:43:02.159
choice of loss function and these

00:43:00.000 --> 00:43:04.480
metrics. And this compilation step what

00:43:02.159 --> 00:43:06.480
it does is it essentially Keras will

00:43:04.480 --> 00:43:08.639
take this information and take the model

00:43:06.480 --> 00:43:11.599
you have built and it'll reorganize the

00:43:08.639 --> 00:43:13.920
model in such a way that the parallel

00:43:11.599 --> 00:43:16.000
computing uh distribution of computing

00:43:13.920 --> 00:43:17.519
across many servers and so on. That's

00:43:16.000 --> 00:43:20.159
that's what's happening in the compile

00:43:17.519 --> 00:43:21.838
step. Organizing it so that reorganizing

00:43:20.159 --> 00:43:23.679
the model so that it becomes amendable

00:43:21.838 --> 00:43:25.039
to parallelization and distribution.

00:43:23.679 --> 00:43:26.159
That's what's going on. That's why you

00:43:25.039 --> 00:43:28.800
actually have to do something called the

00:43:26.159 --> 00:43:30.879
compile step. Okay. And once we do that,

00:43:28.800 --> 00:43:34.160
we have finally finally ready to train

00:43:30.880 --> 00:43:36.000
the model. And to do that uh we have to

00:43:34.159 --> 00:43:37.199
decide what the batch size is that we're

00:43:36.000 --> 00:43:38.880
going to use. Remember, we're using some

00:43:37.199 --> 00:43:40.559
flavor of SGD, which means we have to

00:43:38.880 --> 00:43:43.358
choose what is the bat size. And

00:43:40.559 --> 00:43:45.199
typically what people do is that uh 32

00:43:43.358 --> 00:43:46.480
is a good default for the batch size.

00:43:45.199 --> 00:43:47.519
Like if you don't if you're not just

00:43:46.480 --> 00:43:49.519
getting started with something, just use

00:43:47.519 --> 00:43:51.519
32. Uh and there's a whole bunch of

00:43:49.519 --> 00:43:53.358
literature on what the right batch size

00:43:51.519 --> 00:43:55.119
should be for the number of data points

00:43:53.358 --> 00:43:56.960
you have, the size of the network and so

00:43:55.119 --> 00:43:59.760
on and so forth. My philosophy is start

00:43:56.960 --> 00:44:02.000
with 32. Um and you can always try 32,

00:43:59.760 --> 00:44:04.079
64, 128. It's kind of like, you know,

00:44:02.000 --> 00:44:05.760
oftenimes what people tell me,

00:44:04.079 --> 00:44:07.760
researchers tell me is that just use the

00:44:05.760 --> 00:44:09.920
biggest batch size that doesn't make

00:44:07.760 --> 00:44:11.359
your machine die.

00:44:09.920 --> 00:44:12.400
Right? If you can fit into memory, it's

00:44:11.358 --> 00:44:13.759
probably good. Just try the biggest

00:44:12.400 --> 00:44:15.039
size. We'll just start with 30. It's

00:44:13.760 --> 00:44:16.720
just a tiny problem. It's not a big

00:44:15.039 --> 00:44:19.199
deal. And then we also have to decide

00:44:16.719 --> 00:44:21.519
how many epochs through the data do we

00:44:19.199 --> 00:44:24.318
want to go through, right? How many

00:44:21.519 --> 00:44:26.480
epochs? And uh you know, usually 20 to

00:44:24.318 --> 00:44:28.239
30 epochs is a good starting point. Um

00:44:26.480 --> 00:44:29.679
and then because this is a tiny problem

00:44:28.239 --> 00:44:31.838
just for kicks, I decided to run it for

00:44:29.679 --> 00:44:33.759
300 epochs. Uh just to see if anything

00:44:31.838 --> 00:44:34.960
any overfitting is going to happen. Uh

00:44:33.760 --> 00:44:36.079
and then whether we want to use a

00:44:34.960 --> 00:44:38.639
validation set. Of course, we want to

00:44:36.079 --> 00:44:40.560
use a validation set. Uh right. So we

00:44:38.639 --> 00:44:42.239
will use 20% of the data points as a

00:44:40.559 --> 00:44:44.400
validation set so that we can look for

00:44:42.239 --> 00:44:46.399
overfitting underfitting.

00:44:44.400 --> 00:44:49.519
All right. So with these decisions made

00:44:46.400 --> 00:44:51.920
we finally uh we use the model.fit

00:44:49.519 --> 00:44:55.039
command. Model.fit is what actually

00:44:51.920 --> 00:44:58.000
trains the neural network. Okay. And you

00:44:55.039 --> 00:45:00.318
have to tell it what the x

00:44:58.000 --> 00:45:03.280
tensor is. You have to tell it what the

00:45:00.318 --> 00:45:05.199
dependent variable y tensor is. We need

00:45:03.280 --> 00:45:07.519
to tell it how many epochs to do this.

00:45:05.199 --> 00:45:09.519
What the bat size to use. Verbos equals

00:45:07.519 --> 00:45:11.199
1 just means like just you know put a

00:45:09.519 --> 00:45:13.199
lot of descriptive output as you do this

00:45:11.199 --> 00:45:16.318
thing and then validation split means

00:45:13.199 --> 00:45:18.559
you know take 20% of the training data

00:45:16.318 --> 00:45:20.000
and set it aside as your validation data

00:45:18.559 --> 00:45:22.239
set. Don't use it for training because I

00:45:20.000 --> 00:45:24.239
want to measure overfitting using that.

00:45:22.239 --> 00:45:26.318
So that's it. So you do that thing it

00:45:24.239 --> 00:45:28.159
it'll run for 300 epochs and this is the

00:45:26.318 --> 00:45:31.358
reason why you know I decided to just

00:45:28.159 --> 00:45:33.759
not actually run it in class. Um and so

00:45:31.358 --> 00:45:36.318
you keep on doing it gives you a lot of

00:45:33.760 --> 00:45:40.280
output and finally

00:45:36.318 --> 00:45:40.279
we reach the end.

00:45:41.760 --> 00:45:44.640
Okay. Now let's take a moment to

00:45:43.358 --> 00:45:46.559
understand what's being reported. So

00:45:44.639 --> 00:45:49.118
I'll just take this one line here. So

00:45:46.559 --> 00:45:51.279
this there is a there is these two there

00:45:49.119 --> 00:45:53.920
is a pair of lines for each epoch. And

00:45:51.280 --> 00:45:56.960
then here it's telling you uh you know

00:45:53.920 --> 00:46:01.280
it it actually uses in the in this 300th

00:45:56.960 --> 00:46:02.800
epoch it used seven batches seven out of

00:46:01.280 --> 00:46:05.040
seven batches right so it used seven

00:46:02.800 --> 00:46:06.960
batches and if you you will recall from

00:46:05.039 --> 00:46:08.318
the math we did in the class that it's

00:46:06.960 --> 00:46:10.559
actually seven batches where the first

00:46:08.318 --> 00:46:12.159
six batches are 32 and the last batch is

00:46:10.559 --> 00:46:15.440
just a couple of examples but we have

00:46:12.159 --> 00:46:19.039
seven batches right this is the 193 by

00:46:15.440 --> 00:46:20.720
32 rounded up okay so that's why we have

00:46:19.039 --> 00:46:22.800
seven here and then it tells you how how

00:46:20.719 --> 00:46:24.239
long it took it for that and then it

00:46:22.800 --> 00:46:26.560
this is the loss value. This is the

00:46:24.239 --> 00:46:29.279
binary cross entropy loss value on the

00:46:26.559 --> 00:46:32.239
training set right on on that particular

00:46:29.280 --> 00:46:33.599
batch right uh that it calculated this

00:46:32.239 --> 00:46:36.799
is the accuracy that you asked you to

00:46:33.599 --> 00:46:39.838
report out 98.4% 4% 98.5% accuracy on

00:46:36.800 --> 00:46:42.480
that batch and and then at the end of

00:46:39.838 --> 00:46:44.480
this epoch using whatever weights were

00:46:42.480 --> 00:46:46.639
available in that network it actually

00:46:44.480 --> 00:46:48.318
calculate the loss on the validation set

00:46:46.639 --> 00:46:50.480
which is the 20% of the data we have set

00:46:48.318 --> 00:46:53.759
aside and then it this is the accuracy

00:46:50.480 --> 00:46:55.920
on that validation set okay so that's

00:46:53.760 --> 00:46:57.599
what each of these numbers mean now

00:46:55.920 --> 00:47:00.318
looking at these wall of numbers is kind

00:46:57.599 --> 00:47:02.480
of painful so usually you just plot it

00:47:00.318 --> 00:47:04.719
um so and the way you do that is if you

00:47:02.480 --> 00:47:06.800
if you notice here Uh okay, I'm not

00:47:04.719 --> 00:47:08.639
going to go back here. So I said history

00:47:06.800 --> 00:47:10.560
equals model.fit blah blah blah blah

00:47:08.639 --> 00:47:12.000
blah. And that history object has a lot

00:47:10.559 --> 00:47:14.799
of information that we can use for

00:47:12.000 --> 00:47:18.480
plotting and diagnostics and so on. And

00:47:14.800 --> 00:47:19.760
that history thing uh history object has

00:47:18.480 --> 00:47:21.358
another object called history

00:47:19.760 --> 00:47:23.040
history.htistory which is a dictionary

00:47:21.358 --> 00:47:24.318
with all these values and that's what

00:47:23.039 --> 00:47:25.599
we're going to plot. Was there a

00:47:24.318 --> 00:47:28.639
question here? Yeah.

00:47:25.599 --> 00:47:30.960
>> Uh so you prompted it to keep the size

00:47:28.639 --> 00:47:33.679
for validation but didn't we already

00:47:30.960 --> 00:47:34.960
keep a test set? So that's going to be a

00:47:33.679 --> 00:47:37.679
secondary validation, right?

00:47:34.960 --> 00:47:40.079
>> So basically we have a training uh and

00:47:37.679 --> 00:47:42.000
then a validation and a test. The role

00:47:40.079 --> 00:47:43.680
of the validation set is to figure out

00:47:42.000 --> 00:47:45.519
things like early stopping. Should we

00:47:43.679 --> 00:47:46.719
stop here? Should we go back? And as you

00:47:45.519 --> 00:47:48.960
will see later on, if we use

00:47:46.719 --> 00:47:50.399
hyperparameters, you know, we we'll try

00:47:48.960 --> 00:47:52.079
different values of the hyperparameters

00:47:50.400 --> 00:47:53.680
and figure out use the validation set to

00:47:52.079 --> 00:47:55.359
figure out which one is the best one.

00:47:53.679 --> 00:47:57.679
But once we are done with all that, we

00:47:55.358 --> 00:47:59.679
will finally have a model. At that

00:47:57.679 --> 00:48:02.399
point, we open the safe, take out the

00:47:59.679 --> 00:48:04.239
test set and use it just once with your

00:48:02.400 --> 00:48:05.519
final final model. Not because you want

00:48:04.239 --> 00:48:07.519
to improve the model, but because you

00:48:05.519 --> 00:48:08.880
want to have a realistic idea how it'll

00:48:07.519 --> 00:48:11.679
do when you actually deploy it out in

00:48:08.880 --> 00:48:13.920
the real world.

00:48:11.679 --> 00:48:17.679
>> Uh yeah.

00:48:13.920 --> 00:48:20.000
>> Uh can we use can we instead of accuracy

00:48:17.679 --> 00:48:21.199
could we use other metrics uh to

00:48:20.000 --> 00:48:23.920
evaluate whether to

00:48:21.199 --> 00:48:24.318
>> absolutely like a confusion matrix let's

00:48:23.920 --> 00:48:25.680
say?

00:48:24.318 --> 00:48:27.519
>> Yeah, you can you can do whatever you

00:48:25.679 --> 00:48:29.118
want. You can use like I said it's not

00:48:27.519 --> 00:48:31.280
used for training so there is no

00:48:29.119 --> 00:48:32.720
mathematical implication what you choose

00:48:31.280 --> 00:48:35.040
right you can choose error rates

00:48:32.719 --> 00:48:37.118
accuracy f1 fb beta you can do whatever

00:48:35.039 --> 00:48:39.440
you want and keras as you will see has

00:48:37.119 --> 00:48:41.760
this dizzying list of possible metrics

00:48:39.440 --> 00:48:43.280
you can use for reporting the key thing

00:48:41.760 --> 00:48:44.800
to remember is you're just reporting

00:48:43.280 --> 00:48:47.440
these metrics you're not actually using

00:48:44.800 --> 00:48:49.039
them for any training

00:48:47.440 --> 00:48:50.559
yeah

00:48:49.039 --> 00:48:52.800
>> uh my question is with respect to

00:48:50.559 --> 00:48:55.760
validation like uh we've got a training

00:48:52.800 --> 00:48:58.720
data set so when we take out 20% This is

00:48:55.760 --> 00:49:00.559
the validation uh data for validation.

00:48:58.719 --> 00:49:02.719
Are we taking out from the training set

00:49:00.559 --> 00:49:04.640
or correct from there that level or we

00:49:02.719 --> 00:49:04.879
go to each batch and take out 20% from

00:49:04.639 --> 00:49:05.759
the train?

00:49:04.880 --> 00:49:06.400
>> No, we're taking it out from the

00:49:05.760 --> 00:49:08.400
training set.

00:49:06.400 --> 00:49:09.920
>> So it means the batch size the number of

00:49:08.400 --> 00:49:11.599
batch number of data would be available

00:49:09.920 --> 00:49:12.079
for calculating the batch size will

00:49:11.599 --> 00:49:13.599
reduce.

00:49:12.079 --> 00:49:15.200
>> Correct. And in [snorts] fact once we

00:49:13.599 --> 00:49:17.119
validate take out the validation set

00:49:15.199 --> 00:49:18.558
whatever remaining is 193.

00:49:17.119 --> 00:49:21.519
>> Okay. And then we divide that into

00:49:18.559 --> 00:49:23.440
batches and then that every info uh that

00:49:21.519 --> 00:49:25.519
validation and the data gets different

00:49:23.440 --> 00:49:27.519
added. Now once you take out the

00:49:25.519 --> 00:49:30.960
validation set at the very beginning you

00:49:27.519 --> 00:49:33.440
keep it aside and then you only evaluate

00:49:30.960 --> 00:49:36.000
at the end of each epoch what your loss

00:49:33.440 --> 00:49:37.838
and accuracy is on that validation set.

00:49:36.000 --> 00:49:39.358
>> So you don't have cross validation.

00:49:37.838 --> 00:49:40.558
>> No no we're not doing any of that stuff.

00:49:39.358 --> 00:49:43.519
We're just taking it out once and we're

00:49:40.559 --> 00:49:46.240
just evaluating the end of every epoch.

00:49:43.519 --> 00:49:50.559
>> Okay. So

00:49:46.239 --> 00:49:53.679
yeah. Okay. So I know we both asked

00:49:50.559 --> 00:49:54.960
similar questions but

00:49:53.679 --> 00:49:56.960
>> so I know both have asked similar

00:49:54.960 --> 00:49:59.440
questions but just to reconfirm. So here

00:49:56.960 --> 00:50:01.760
my training model is giving me say a

00:49:59.440 --> 00:50:04.800
loss of 0860.

00:50:01.760 --> 00:50:07.680
My validation model is giving me 660.

00:50:04.800 --> 00:50:11.519
That means I've already crossed the U.

00:50:07.679 --> 00:50:13.358
So when I have to actually test the

00:50:11.519 --> 00:50:14.800
model that is the midpoint which I take

00:50:13.358 --> 00:50:16.880
and that will model which will get

00:50:14.800 --> 00:50:19.200
deployed in production.

00:50:16.880 --> 00:50:20.559
Correct. And as to okay, what do we do

00:50:19.199 --> 00:50:22.318
to get that model? Do we actually have

00:50:20.559 --> 00:50:24.720
to go go back to the beginning and run

00:50:22.318 --> 00:50:25.920
it for a few epochs or can we do

00:50:24.719 --> 00:50:26.959
something smarter than that? We'll get

00:50:25.920 --> 00:50:27.838
to that.

00:50:26.960 --> 00:50:30.159
>> Yeah.

00:50:27.838 --> 00:50:31.838
>> Is the validation set different for each

00:50:30.159 --> 00:50:33.759
APO or is it the same?

00:50:31.838 --> 00:50:35.759
>> It's the same. So what you do is you

00:50:33.760 --> 00:50:37.359
have a training set before you do any

00:50:35.760 --> 00:50:39.680
training. You take out 20% of it, keep

00:50:37.358 --> 00:50:41.838
it aside. You take whatever is left over

00:50:39.679 --> 00:50:43.279
that you divide that into mini batches

00:50:41.838 --> 00:50:45.838
and then start running it through each

00:50:43.280 --> 00:50:47.519
epoch. But at the end of each epoch, you

00:50:45.838 --> 00:50:49.119
just evaluate the quality of that

00:50:47.519 --> 00:50:49.920
resulting model using the validation

00:50:49.119 --> 00:50:51.920
set.

00:50:49.920 --> 00:50:52.800
>> What's different between each epoch? Is

00:50:51.920 --> 00:50:53.519
it just the way

00:50:52.800 --> 00:50:55.760
>> weights have changed?

00:50:53.519 --> 00:50:56.960
>> It's the it's the division into the

00:50:55.760 --> 00:51:00.480
different

00:50:56.960 --> 00:51:02.159
>> uh no so in the difference in each epoch

00:51:00.480 --> 00:51:03.920
is the weights have changed.

00:51:02.159 --> 00:51:05.440
>> So after every mini batch, the weights

00:51:03.920 --> 00:51:07.200
have changed. At the end of one epoch,

00:51:05.440 --> 00:51:09.200
you've gone through all the data points

00:51:07.199 --> 00:51:10.639
you ever had, right, in the training

00:51:09.199 --> 00:51:14.558
set. And then you come back to the

00:51:10.639 --> 00:51:14.558
beginning and you do it again.

00:51:17.760 --> 00:51:22.480
How do you identify the sweet spot?

00:51:20.800 --> 00:51:24.160
>> It's coming.

00:51:22.480 --> 00:51:27.280
>> Yeah. All right. So, I'm going to keep

00:51:24.159 --> 00:51:28.960
going. So, we have this here. And so,

00:51:27.280 --> 00:51:31.280
you just I mean there's a little bit of

00:51:28.960 --> 00:51:33.440
mattplot lip code. So, what we do is we

00:51:31.280 --> 00:51:35.280
just plot the training loss and the

00:51:33.440 --> 00:51:37.760
validation loss as a function of the

00:51:35.280 --> 00:51:39.920
number of epochs. Okay? And as you can

00:51:37.760 --> 00:51:41.920
see here, the training loss is these

00:51:39.920 --> 00:51:45.280
things here. And it's steadily going

00:51:41.920 --> 00:51:47.519
down as you would expect. The validation

00:51:45.280 --> 00:51:49.599
loss goes down here. And then at some

00:51:47.519 --> 00:51:53.358
point it kind of flattens out and then

00:51:49.599 --> 00:51:55.920
maybe gently starts to rise. Okay. So do

00:51:53.358 --> 00:51:57.279
you think there's overfitting?

00:51:55.920 --> 00:51:59.200
>> Right. There seems to be some level of

00:51:57.280 --> 00:52:01.839
overfitting here. But the thing you have

00:51:59.199 --> 00:52:04.799
to always remember is that the binary

00:52:01.838 --> 00:52:06.639
cross entropy loss is a loss function

00:52:04.800 --> 00:52:08.160
that is convenient for you because it

00:52:06.639 --> 00:52:10.879
sort of captures the thing you want to

00:52:08.159 --> 00:52:13.920
capture the discrepancy but also because

00:52:10.880 --> 00:52:15.599
it's mathematically convenient but what

00:52:13.920 --> 00:52:18.400
you may actually care about in practice

00:52:15.599 --> 00:52:19.760
is something like accuracy right so I

00:52:18.400 --> 00:52:21.440
always that's why you're reporting out

00:52:19.760 --> 00:52:23.200
the accuracy when we do these things so

00:52:21.440 --> 00:52:25.358
you should also plot the accuracy to see

00:52:23.199 --> 00:52:26.799
what's going on and really you should

00:52:25.358 --> 00:52:28.239
look at the accuracy and figure out

00:52:26.800 --> 00:52:30.720
overfitting and underfitting and all

00:52:28.239 --> 00:52:34.000
stuff. So let's just do that. So I have

00:52:30.719 --> 00:52:35.519
here uh overfitting.

00:52:34.000 --> 00:52:37.280
Uh okay. So this is how it looks like

00:52:35.519 --> 00:52:38.639
for accuracy. Accuracy of course as the

00:52:37.280 --> 00:52:40.079
model gets you know as you do more and

00:52:38.639 --> 00:52:42.078
more epochs hopefully it get better and

00:52:40.079 --> 00:52:44.480
better for training. So you can see here

00:52:42.079 --> 00:52:47.440
accuracy actually climbs all the way up

00:52:44.480 --> 00:52:50.079
to the mid 90s uh right there small the

00:52:47.440 --> 00:52:52.639
low 90s here. the validation gets to

00:52:50.079 --> 00:52:54.400
this point after like I don't know 50

00:52:52.639 --> 00:52:56.719
epochs maybe and then it kind of

00:52:54.400 --> 00:53:00.880
flattens out and then strangely it

00:52:56.719 --> 00:53:03.759
climbs up again a bit later right so now

00:53:00.880 --> 00:53:06.800
the fact that the accuracy actually got

00:53:03.760 --> 00:53:09.920
better at the very end suggests that

00:53:06.800 --> 00:53:10.480
maybe we can live with this overfitting

00:53:09.920 --> 00:53:12.000
>> okay

00:53:10.480 --> 00:53:14.559
>> right it's not the end of the world

00:53:12.000 --> 00:53:16.719
right so you can so you can certainly

00:53:14.559 --> 00:53:17.920
what you can do is you can go back and

00:53:16.719 --> 00:53:20.558
say you know what no I'm going to be a

00:53:17.920 --> 00:53:22.240
purist about this around 50 epochs or

00:53:20.559 --> 00:53:24.079
so. I think that's when it actually

00:53:22.239 --> 00:53:26.078
flattened out for loss. So you can just

00:53:24.079 --> 00:53:29.039
go back and just restart the model and

00:53:26.079 --> 00:53:30.318
run it only for 50 epochs, not 300 and

00:53:29.039 --> 00:53:31.920
then stop and just use that model for

00:53:30.318 --> 00:53:33.358
everything from that point on. Or you

00:53:31.920 --> 00:53:35.358
can say, you know what, it's okay. I can

00:53:33.358 --> 00:53:36.558
live with this thing. Uh and so that's

00:53:35.358 --> 00:53:39.838
what we're going to do here. Let me just

00:53:36.559 --> 00:53:40.319
stop for a second. There was a question.

00:53:39.838 --> 00:53:42.000
>> Yeah,

00:53:40.318 --> 00:53:44.000
>> for originally when we were starting

00:53:42.000 --> 00:53:46.880
out, we were saying 20 to 30 pods, but

00:53:44.000 --> 00:53:49.039
we were going to do 300. 50 is over 20

00:53:46.880 --> 00:53:51.280
to 30. So when it comes to validation of

00:53:49.039 --> 00:53:52.639
if you run enough epochs, are you doing

00:53:51.280 --> 00:53:54.480
like derivative calculations?

00:53:52.639 --> 00:53:56.639
>> Oh, I see. No, that's a great question.

00:53:54.480 --> 00:53:58.240
So the question is I said start with 20

00:53:56.639 --> 00:54:00.000
and 30 epochs as a rule of thumb here,

00:53:58.239 --> 00:54:01.598
I'm just going with 300. And because I'm

00:54:00.000 --> 00:54:03.199
going with 300, I can actually see some

00:54:01.599 --> 00:54:05.119
potential evidence of overfitting. But

00:54:03.199 --> 00:54:06.239
if I had done only 20 to 30, maybe I

00:54:05.119 --> 00:54:07.280
wouldn't have even seen that. What

00:54:06.239 --> 00:54:09.279
happens next? Right? Is that the

00:54:07.280 --> 00:54:10.559
question? Great question. So what you

00:54:09.280 --> 00:54:13.519
should do is when you look at these

00:54:10.559 --> 00:54:15.680
curves if at the end of 30 epochs you

00:54:13.519 --> 00:54:18.159
find that the validation loss continues

00:54:15.679 --> 00:54:20.078
to drop then you know maybe there is

00:54:18.159 --> 00:54:21.759
more room for it to drop. So you you

00:54:20.079 --> 00:54:24.000
continue from that point on. The thing

00:54:21.760 --> 00:54:27.119
about keras is that you can actually run

00:54:24.000 --> 00:54:29.199
the the the fit command at that point

00:54:27.119 --> 00:54:31.680
and it'll continue where it left off. It

00:54:29.199 --> 00:54:33.598
won't go to the beginning again.

00:54:31.679 --> 00:54:34.799
Right? So you can run 10. Okay. The

00:54:33.599 --> 00:54:36.640
validation is still getting better and

00:54:34.800 --> 00:54:38.240
better. Okay. Run for another 10. It's

00:54:36.639 --> 00:54:39.440
getting better and better. Run for

00:54:38.239 --> 00:54:40.639
another 10. Getting better and better.

00:54:39.440 --> 00:54:41.760
Run for another 10. Oh, it starts to

00:54:40.639 --> 00:54:44.799
climb up again. Okay, now I'm going to

00:54:41.760 --> 00:54:47.119
back off. That's what you do.

00:54:44.800 --> 00:54:48.800
All right. Now, all this manual stuff

00:54:47.119 --> 00:54:50.800
I'm going through it just because to

00:54:48.800 --> 00:54:52.559
build intuition, there are these things

00:54:50.800 --> 00:54:54.640
called callbacks in KAS, which we'll get

00:54:52.559 --> 00:54:57.040
to later on in which you can actually

00:54:54.639 --> 00:54:59.679
tell it, hey, when the validation loss,

00:54:57.039 --> 00:55:02.000
you know, uh, stops improving, stop

00:54:59.679 --> 00:55:04.558
everything or when it stops improving,

00:55:02.000 --> 00:55:05.920
save that model for me somewhere. So,

00:55:04.559 --> 00:55:07.280
they don't have to go back and rerun

00:55:05.920 --> 00:55:08.480
everything. It'll just it'll have saved

00:55:07.280 --> 00:55:12.240
it for you and you can just pick it up

00:55:08.480 --> 00:55:15.358
and use it. Uh yeah.

00:55:12.239 --> 00:55:17.838
>> What's the intuition behind um the

00:55:15.358 --> 00:55:19.358
accuracy continuing to improve when the

00:55:17.838 --> 00:55:21.440
loss is getting higher?

00:55:19.358 --> 00:55:23.759
>> Because accuracy and loss are related

00:55:21.440 --> 00:55:25.760
but they're not the same thing. Uh in

00:55:23.760 --> 00:55:27.520
particular, so it's a really good

00:55:25.760 --> 00:55:29.359
question also kind of a profound

00:55:27.519 --> 00:55:30.880
question because accuracy is a very

00:55:29.358 --> 00:55:32.078
discrete measure, right? So if a

00:55:30.880 --> 00:55:34.880
particular point we predicting its

00:55:32.079 --> 00:55:37.599
probability to be say 049 we're going to

00:55:34.880 --> 00:55:39.599
say okay that's a zero no heart disease

00:55:37.599 --> 00:55:41.599
but if it goes to 0.51 we're going to be

00:55:39.599 --> 00:55:44.559
oh that's heart disease. So when you go

00:55:41.599 --> 00:55:46.079
from 049 to 0.51 the binary cross

00:55:44.559 --> 00:55:48.640
entropy loss will change very very

00:55:46.079 --> 00:55:51.359
slightly but the accuracy will go from 0

00:55:48.639 --> 00:55:53.358
to one dramatic jump. So it's very jumpy

00:55:51.358 --> 00:55:56.000
and discreet and that's why it tends to

00:55:53.358 --> 00:55:58.639
be a proxy but sort of a crude proxy for

00:55:56.000 --> 00:56:01.440
loss. That's part of the reason and I

00:55:58.639 --> 00:56:04.558
can talk more offline.

00:56:01.440 --> 00:56:06.480
Okay. So yeah,

00:56:04.559 --> 00:56:09.839
>> you mentioned that if you are a purist,

00:56:06.480 --> 00:56:12.159
you could stop up 50. In this case, I

00:56:09.838 --> 00:56:13.759
was want and run it and stop it there. I

00:56:12.159 --> 00:56:15.679
was wondering if you could see the

00:56:13.760 --> 00:56:18.079
history of the model, take the weight at

00:56:15.679 --> 00:56:21.358
EOC 50 and input it your model and it

00:56:18.079 --> 00:56:22.400
will be roughly the same or it would be

00:56:21.358 --> 00:56:24.318
certain differences.

00:56:22.400 --> 00:56:25.920
>> You could try it. Yeah, you should just

00:56:24.318 --> 00:56:27.599
try it because what happens is that

00:56:25.920 --> 00:56:29.440
ultimately what we care about is how it

00:56:27.599 --> 00:56:30.960
performs on the validation set. Right.

00:56:29.440 --> 00:56:33.200
Here it appears to perform better on the

00:56:30.960 --> 00:56:34.880
validation set. right? If you stop at 50

00:56:33.199 --> 00:56:36.078
but only for the loss for accuracy

00:56:34.880 --> 00:56:40.079
actually if you wait till the very end

00:56:36.079 --> 00:56:41.760
it gets better. So my thrust tends to be

00:56:40.079 --> 00:56:44.079
what is the measure that's closest to

00:56:41.760 --> 00:56:45.599
the real world deployment.

00:56:44.079 --> 00:56:48.599
It's accuracy. So I tend to go with

00:56:45.599 --> 00:56:48.599
accuracy.

00:56:48.639 --> 00:56:53.519
Binary cross entropy is a beautiful

00:56:50.639 --> 00:56:54.960
proxy but an imperfect proxy for the

00:56:53.519 --> 00:56:57.440
thing we actually care about in the real

00:56:54.960 --> 00:56:59.519
world which is error rate and accuracy.

00:56:57.440 --> 00:57:00.960
That's why I tend to plot both and if

00:56:59.519 --> 00:57:03.119
accuracy is telling me one thing I kind

00:57:00.960 --> 00:57:07.920
of tend to believe that

00:57:03.119 --> 00:57:09.680
all right so um here that's what we have

00:57:07.920 --> 00:57:11.519
so once we do all this we have a model

00:57:09.679 --> 00:57:13.039
and we now we may to evaluate to see

00:57:11.519 --> 00:57:14.559
okay if you actually deployed how good

00:57:13.039 --> 00:57:17.039
is going to be so you use this thing

00:57:14.559 --> 00:57:19.040
called the model evealuate function so

00:57:17.039 --> 00:57:21.358
you take the modelealate function now we

00:57:19.039 --> 00:57:23.279
use the test and the the test x and the

00:57:21.358 --> 00:57:24.719
test y data set which we split at the

00:57:23.280 --> 00:57:27.040
very very beginning and never used from

00:57:24.719 --> 00:57:29.679
that point on uh we run it And when I

00:57:27.039 --> 00:57:33.039
ran it uh last night, it came up with a

00:57:29.679 --> 00:57:35.118
83.6% accuracy for the model. And

00:57:33.039 --> 00:57:36.798
remember our baseline model which just

00:57:35.119 --> 00:57:39.358
predicts everybody is a zero is going to

00:57:36.798 --> 00:57:41.599
have a 72.6% accuracy. And this little

00:57:39.358 --> 00:57:45.598
neural network gives you 83 83.6 which

00:57:41.599 --> 00:57:47.280
is pretty good right so it's actually uh

00:57:45.599 --> 00:57:49.519
few it's beating the model the baseline

00:57:47.280 --> 00:57:50.720
model which is nice. Uh and I guess

00:57:49.519 --> 00:57:52.159
there is something here about you know

00:57:50.719 --> 00:57:53.919
the fact that we did a bunch of

00:57:52.159 --> 00:57:55.440
pre-processing outside Keras and then we

00:57:53.920 --> 00:57:57.119
send stuff into Keras. You can actually

00:57:55.440 --> 00:57:58.639
do all this pre-processing inside Karas

00:57:57.119 --> 00:58:00.160
automatically and there are layers for

00:57:58.639 --> 00:58:02.239
that and I have linked to a bunch of

00:58:00.159 --> 00:58:03.679
stuff here. So that's it as far as this

00:58:02.239 --> 00:58:05.358
model is concerned. I know we went

00:58:03.679 --> 00:58:07.199
through it really fast but please go

00:58:05.358 --> 00:58:09.039
through it afterwards and make sure you

00:58:07.199 --> 00:58:11.039
understand every single line. Change

00:58:09.039 --> 00:58:12.239
each of these lines, rerun it, see how

00:58:11.039 --> 00:58:15.279
the output changes. That's how we build

00:58:12.239 --> 00:58:17.919
some intuition. Okay. All right.

00:58:15.280 --> 00:58:20.079
computer vision

00:58:17.920 --> 00:58:22.639
>> as I do

00:58:20.079 --> 00:58:24.720
>> just one question and for is there a way

00:58:22.639 --> 00:58:27.118
to build a model just to have less false

00:58:24.719 --> 00:58:27.679
positive or less false immediate or you

00:58:27.119 --> 00:58:29.119
don't know that

00:58:27.679 --> 00:58:31.679
>> oh yeah yeah you can do that um but

00:58:29.119 --> 00:58:33.599
there are so you can report on all those

00:58:31.679 --> 00:58:35.759
things very easily but there are more

00:58:33.599 --> 00:58:38.400
complex loss functions which will take

00:58:35.760 --> 00:58:40.960
the the asymmetry between the false

00:58:38.400 --> 00:58:43.440
positive false negative into account u

00:58:40.960 --> 00:58:45.199
you know yeah so the short it's possible

00:58:43.440 --> 00:58:46.880
yeah

00:58:45.199 --> 00:58:48.318
All right. So, first let's just talk

00:58:46.880 --> 00:58:52.240
about how do you represent an image

00:58:48.318 --> 00:58:54.159
digitally. Okay. Uh and so these are how

00:58:52.239 --> 00:58:55.759
gay grayscale images are represented.

00:58:54.159 --> 00:58:57.598
Black and white images. So the basic

00:58:55.760 --> 00:58:59.520
basic idea is very simple. Every picture

00:58:57.599 --> 00:59:01.680
you have it's got a every location in

00:58:59.519 --> 00:59:03.599
that picture is a pixel and the pixel

00:59:01.679 --> 00:59:06.159
pixel basically has a light intensity.

00:59:03.599 --> 00:59:09.119
The amount of light at that location and

00:59:06.159 --> 00:59:12.078
that light level is measured from zero

00:59:09.119 --> 00:59:16.000
no light to blinding white light which

00:59:12.079 --> 00:59:18.559
is 255. And so all the numbers here, if

00:59:16.000 --> 00:59:20.798
you take this five for example, you can

00:59:18.559 --> 00:59:23.599
see a lot of no light like all the black

00:59:20.798 --> 00:59:24.960
regions, those are all zeros. Okay? And

00:59:23.599 --> 00:59:27.119
then wherever there is white light,

00:59:24.960 --> 00:59:29.519
there's a number and more the amount of

00:59:27.119 --> 00:59:30.720
light, the closer it gets to 255. Okay?

00:59:29.519 --> 00:59:32.079
In fact, if you just step back and

00:59:30.719 --> 00:59:33.679
squint at this, you can actually see the

00:59:32.079 --> 00:59:35.680
five.

00:59:33.679 --> 00:59:37.440
Okay? So that's it. That's how that's

00:59:35.679 --> 00:59:42.239
how black and white image represented.

00:59:37.440 --> 00:59:43.838
Very simple. Okay. Now, yeah.

00:59:42.239 --> 00:59:45.838
microphone

00:59:43.838 --> 00:59:47.679
>> just when you say amount of light what's

00:59:45.838 --> 00:59:48.239
the unit that's being measured like what

00:59:47.679 --> 00:59:51.039
do you mean

00:59:48.239 --> 00:59:54.639
>> so here basically what we have is uh the

00:59:51.039 --> 00:59:56.318
the computer takes whatever so when you

00:59:54.639 --> 00:59:58.239
send an analog you take an analog

00:59:56.318 --> 00:59:59.440
picture there is an there's a process by

00:59:58.239 --> 01:00:02.000
which you take that analog picture and

00:59:59.440 --> 01:00:04.559
read it in and it gets mapped to a scale

01:00:02.000 --> 01:00:05.599
between 0 and 255 that's it that's all

01:00:04.559 --> 01:00:07.119
so you can think of it as like a

01:00:05.599 --> 01:00:10.559
relative scale a normalized scale

01:00:07.119 --> 01:00:12.240
between 0 and 255 and so um it just

01:00:10.559 --> 01:00:14.720
roughly maps to amount of light in that

01:00:12.239 --> 01:00:16.318
location the exact like lumens to the

01:00:14.719 --> 01:00:18.159
number mapping I don't know how they do

01:00:16.318 --> 01:00:20.798
it my guess is there are a dis number of

01:00:18.159 --> 01:00:22.318
variations on that but the for our

01:00:20.798 --> 01:00:24.079
purposes just think of it as it's a

01:00:22.318 --> 01:00:26.318
normalized scale which runs from 0 to

01:00:24.079 --> 01:00:28.880
255

01:00:26.318 --> 01:00:30.798
all right so uh if you look at u so

01:00:28.880 --> 01:00:34.318
that's what's happening every is a

01:00:30.798 --> 01:00:37.119
number between 0 to 55 boom boom okay so

01:00:34.318 --> 01:00:38.880
if you have a color image each pixel of

01:00:37.119 --> 01:00:42.400
a colored image is represented by three

01:00:38.880 --> 01:00:44.480
numbers uh And these numbers measure the

01:00:42.400 --> 01:00:46.480
intensity of red light, blue light and

01:00:44.480 --> 01:00:47.599
green light because red, blue and green

01:00:46.480 --> 01:00:50.480
if you mix them in the right proportion

01:00:47.599 --> 01:00:52.559
you can get whatever you want. Okay. So

01:00:50.480 --> 01:00:54.719
uh and so each light density is still a

01:00:52.559 --> 01:00:56.480
number between 0 and 55 and that's what

01:00:54.719 --> 01:00:58.078
you have. Which means that now you have

01:00:56.480 --> 01:01:00.079
three tables of numbers instead of one

01:00:58.079 --> 01:01:02.240
table of numbers. And by the way just

01:01:00.079 --> 01:01:05.440
some lingo here uh in the deep learning

01:01:02.239 --> 01:01:06.959
world these uh colors RGB, red, blue,

01:01:05.440 --> 01:01:10.318
green are sometimes referred to as

01:01:06.960 --> 01:01:11.358
channels. Okay. All right. So this is

01:01:10.318 --> 01:01:13.599
what we have here. This is a picture of

01:01:11.358 --> 01:01:16.159
Kian Cord U and then if you take that

01:01:13.599 --> 01:01:18.960
little thing here red the red table the

01:01:16.159 --> 01:01:21.039
green table and the blue table. So for

01:01:18.960 --> 01:01:23.760
this picture these three tables is a

01:01:21.039 --> 01:01:26.159
tensor of rank what?

01:01:23.760 --> 01:01:30.520
Good.

01:01:26.159 --> 01:01:30.519
All right. Any questions on this?

01:01:33.920 --> 01:01:37.599
So the key task in computer vision

01:01:35.838 --> 01:01:40.239
obviously the the important thing is

01:01:37.599 --> 01:01:42.160
image classification right uh the most

01:01:40.239 --> 01:01:43.679
basic task if you will uh when you're

01:01:42.159 --> 01:01:45.358
working with images is you you have an

01:01:43.679 --> 01:01:46.719
image and you want to take whatever you

01:01:45.358 --> 01:01:48.078
take the image and figure out okay you

01:01:46.719 --> 01:01:49.519
have a list of possible objects the

01:01:48.079 --> 01:01:51.039
image could contain and you're figuring

01:01:49.519 --> 01:01:53.280
out okay which of these possible objects

01:01:51.039 --> 01:01:54.960
exists in that image right the doc cat

01:01:53.280 --> 01:01:57.760
classification is like the the canonical

01:01:54.960 --> 01:01:59.599
example right that we all know and love

01:01:57.760 --> 01:02:01.280
uh and that's what we will solve uh

01:01:59.599 --> 01:02:02.720
later today and on Wednesday but there

01:02:01.280 --> 01:02:05.680
are many other tasks that you need to to

01:02:02.719 --> 01:02:07.358
be aware of. So when you actually not

01:02:05.679 --> 01:02:10.318
just classify an image, but you also

01:02:07.358 --> 01:02:11.519
localize where in the image is it,

01:02:10.318 --> 01:02:13.039
right? It's not just enough to say

01:02:11.519 --> 01:02:14.639
sheep, you want to figure out where is

01:02:13.039 --> 01:02:16.159
the sheep, right? And that's called

01:02:14.639 --> 01:02:18.239
localization. And the way you do

01:02:16.159 --> 01:02:21.118
localization is you put this little box

01:02:18.239 --> 01:02:23.358
around it. And then you output not just

01:02:21.119 --> 01:02:26.000
whether it's a, you know, sheep, yes or

01:02:23.358 --> 01:02:28.159
no, but the coordinates of this box, the

01:02:26.000 --> 01:02:29.760
top left, uh, and the bottom right, for

01:02:28.159 --> 01:02:31.598
example, if you put the coordinates, you

01:02:29.760 --> 01:02:33.599
can actually draw a box around it. So

01:02:31.599 --> 01:02:36.079
you you output the numbers the

01:02:33.599 --> 01:02:39.760
coordinates of where this box is in the

01:02:36.079 --> 01:02:42.720
picture. Okay, this called localization.

01:02:39.760 --> 01:02:45.040
Now this is object detection where you

01:02:42.719 --> 01:02:47.039
may have lots of objects going on and

01:02:45.039 --> 01:02:49.759
you want to pick up every one of them

01:02:47.039 --> 01:02:51.679
and you want to localize it.

01:02:49.760 --> 01:02:53.359
Okay, this is object detection. So here

01:02:51.679 --> 01:02:55.679
we have gone in there and said okay

01:02:53.358 --> 01:02:57.519
sheep one, sheep two, sheep three and

01:02:55.679 --> 01:02:59.598
each of these sheep has a little box

01:02:57.519 --> 01:03:01.440
around it. Okay.

01:02:59.599 --> 01:03:04.000
>> By the way, u you know, self-driving

01:03:01.440 --> 01:03:05.358
cars, the the camera vision system is

01:03:04.000 --> 01:03:06.960
constantly scanning what's coming in

01:03:05.358 --> 01:03:08.400
through the cameras and doing object

01:03:06.960 --> 01:03:09.039
detection constantly, many times a

01:03:08.400 --> 01:03:09.680
second,

01:03:09.039 --> 01:03:11.599
>> right?

01:03:09.679 --> 01:03:13.838
>> Pedestrian box, you know, zero crossing

01:03:11.599 --> 01:03:16.240
box, doggy box, stroller box, and so on

01:03:13.838 --> 01:03:17.358
and so forth.

01:03:16.239 --> 01:03:20.479
And then we have this thing called

01:03:17.358 --> 01:03:22.960
semantic segmentation where we take

01:03:20.480 --> 01:03:24.880
every pixel in the picture and classify

01:03:22.960 --> 01:03:26.159
every pixel. We are not classifying the

01:03:24.880 --> 01:03:28.880
whole picture, we're classifying every

01:03:26.159 --> 01:03:32.318
pixel. So we are saying okay all these

01:03:28.880 --> 01:03:34.798
gray pixels road all these pixels are

01:03:32.318 --> 01:03:37.838
sheep and all these pixels are grass

01:03:34.798 --> 01:03:39.838
every pixel is being classified.

01:03:37.838 --> 01:03:42.159
So we are taking a an image instead of

01:03:39.838 --> 01:03:43.920
giving one classification for every

01:03:42.159 --> 01:03:47.558
pixel we are solving a multiclass

01:03:43.920 --> 01:03:47.559
classification problem.

01:03:48.318 --> 01:03:51.199
Okay, every pixel is classified. And

01:03:49.920 --> 01:03:53.280
just when you think it can't get more

01:03:51.199 --> 01:03:54.480
complicated than this,

01:03:53.280 --> 01:03:56.880
we have something called instance

01:03:54.480 --> 01:03:58.559
segmentation where not only are we

01:03:56.880 --> 01:03:59.838
classifying every pixel, we are

01:03:58.559 --> 01:04:01.920
distinguishing between the different

01:03:59.838 --> 01:04:04.318
sheep.

01:04:01.920 --> 01:04:06.400
So every pixel is classified and

01:04:04.318 --> 01:04:09.960
different instances of the same category

01:04:06.400 --> 01:04:09.960
need to be identified.

01:04:10.480 --> 01:04:14.880
Okay. So these are all some of the most

01:04:12.318 --> 01:04:16.798
sort of uh I would say popular most

01:04:14.880 --> 01:04:18.960
prevalently and useful most prevalent

01:04:16.798 --> 01:04:20.880
and useful categories of image

01:04:18.960 --> 01:04:23.920
processing problems that are aminable to

01:04:20.880 --> 01:04:25.440
a deep learning system.

01:04:23.920 --> 01:04:27.200
All right. So let's go to image

01:04:25.440 --> 01:04:28.559
classification and we're going to work

01:04:27.199 --> 01:04:32.598
with this application called fashion

01:04:28.559 --> 01:04:32.599
emnest. Um

01:04:33.039 --> 01:04:38.400
so the idea here is that you have 70

01:04:35.358 --> 01:04:40.960
70,000 images of clothing items across

01:04:38.400 --> 01:04:43.119
10 categories. you know like boots and

01:04:40.960 --> 01:04:45.760
sweaters and t-shirts and you get the

01:04:43.119 --> 01:04:48.559
idea 10 categories of clothing. Um we we

01:04:45.760 --> 01:04:50.559
have 70,000 images like this u and then

01:04:48.559 --> 01:04:52.559
we'll build a network from scratch to

01:04:50.559 --> 01:04:54.559
classify all these things uh you know

01:04:52.559 --> 01:04:55.920
with pretty high accuracy. So these

01:04:54.559 --> 01:04:58.000
classes by the way you know this is a

01:04:55.920 --> 01:04:59.838
very balanced data set. So 10% of the

01:04:58.000 --> 01:05:01.920
data is you know sweaters 10% is boots

01:04:59.838 --> 01:05:03.519
and so on and so forth. So a naive

01:05:01.920 --> 01:05:06.519
baseline model would give you what

01:05:03.519 --> 01:05:06.519
accuracy

01:05:07.679 --> 01:05:12.078
10%. Exactly. So we need to build

01:05:10.559 --> 01:05:13.440
something that's better than 10% and I'm

01:05:12.079 --> 01:05:14.559
glad to report that a simple neural

01:05:13.440 --> 01:05:17.559
network can actually get you close to

01:05:14.559 --> 01:05:17.559
90%.

01:05:18.559 --> 01:05:24.798
Right? So so this is the simple network

01:05:21.838 --> 01:05:28.400
that we have. The input in this case is

01:05:24.798 --> 01:05:33.358
a 28x 28 picture.

01:05:28.400 --> 01:05:36.720
It's a 28x 28 picture. Uh and

01:05:33.358 --> 01:05:38.318
so far we have been feeding vectors into

01:05:36.719 --> 01:05:40.239
our neural network. Now we have a

01:05:38.318 --> 01:05:43.759
picture which is 28 by 28. It's a tens

01:05:40.239 --> 01:05:45.919
set of rank two, right? It's a table of

01:05:43.760 --> 01:05:49.160
numbers. What do we do? How do we feed

01:05:45.920 --> 01:05:49.159
that in?

01:05:51.199 --> 01:05:54.960
It's a temp. No, each image is a table

01:05:53.599 --> 01:05:57.519
of numbers. Let's just take a single

01:05:54.960 --> 01:05:59.280
image.

01:05:57.519 --> 01:06:01.679
Like what do we do? How do we what do we

01:05:59.280 --> 01:06:04.079
do with this table?

01:06:01.679 --> 01:06:06.399
Convert it into a vector. Exactly. And

01:06:04.079 --> 01:06:08.079
that's called flattening. So we take

01:06:06.400 --> 01:06:11.440
this table of numbers and we flatten it

01:06:08.079 --> 01:06:13.599
into a vector. And so so what we do is

01:06:11.440 --> 01:06:17.760
uh let me just

01:06:13.599 --> 01:06:20.240
Okay. So we have um

01:06:17.760 --> 01:06:22.400
28 by 28.

01:06:20.239 --> 01:06:25.598
So what we can do is we can take each

01:06:22.400 --> 01:06:27.838
row right take this row and then write

01:06:25.599 --> 01:06:32.599
it like that.

01:06:27.838 --> 01:06:32.599
We take the second row oops

01:06:33.440 --> 01:06:36.639
write it like that.

01:06:38.079 --> 01:06:43.599
third row is here

01:06:41.440 --> 01:06:45.358
like that. You get the idea. So you take

01:06:43.599 --> 01:06:47.039
each row just rotate it and stack it all

01:06:45.358 --> 01:06:49.119
up, right? And string them up. It

01:06:47.039 --> 01:06:51.760
becomes one long vector. So this called

01:06:49.119 --> 01:06:52.960
flattening. Okay? So that's how you take

01:06:51.760 --> 01:06:55.960
this thing and make it into one long

01:06:52.960 --> 01:06:55.960
vector.

01:06:56.400 --> 01:07:03.400
So when you do that 28 by 28 is what is

01:07:00.159 --> 01:07:03.399
it? 7

01:07:03.599 --> 01:07:09.440
784. So we get 7. So we get a vector.

01:07:07.440 --> 01:07:11.119
This is the flattened input and you get

01:07:09.440 --> 01:07:15.039
784.

01:07:11.119 --> 01:07:17.358
Uh it's a vector that's 784 long.

01:07:15.039 --> 01:07:18.799
Okay. After the flattening, we have not

01:07:17.358 --> 01:07:19.920
done anything complicated yet. We have

01:07:18.798 --> 01:07:21.679
literally taken the numbers and just

01:07:19.920 --> 01:07:24.318
reorganized them in a different way.

01:07:21.679 --> 01:07:26.000
Okay. And once we do that, now we are

01:07:24.318 --> 01:07:27.759
back in our familiar neural network

01:07:26.000 --> 01:07:29.760
territory, right? We know how to work

01:07:27.760 --> 01:07:33.760
with vectors. So, we just need to pass

01:07:29.760 --> 01:07:35.520
it through a hidden layer, right? And

01:07:33.760 --> 01:07:37.599
this hidden layer, we're going to use re

01:07:35.519 --> 01:07:39.119
neurons. And I tried a few different

01:07:37.599 --> 01:07:41.680
values. And it turns out that 256

01:07:39.119 --> 01:07:43.680
neurons does a really good job.

01:07:41.679 --> 01:07:46.480
Okay? And so, I'm going to use 256

01:07:43.679 --> 01:07:48.000
neurons here. And then we need to now

01:07:46.480 --> 01:07:51.199
think about what the output layer should

01:07:48.000 --> 01:07:54.159
be. Now, the now we run into a problem

01:07:51.199 --> 01:07:55.759
because the output layer before we saw

01:07:54.159 --> 01:07:58.239
for the heart disease example, it's just

01:07:55.760 --> 01:08:01.039
zero or one. Right? Here there are 10

01:07:58.239 --> 01:08:02.879
possible outputs. It could be a you know

01:08:01.039 --> 01:08:04.799
boot, a sweater, a shirt and so on so

01:08:02.880 --> 01:08:06.798
forth. 10 possible categories. So we

01:08:04.798 --> 01:08:09.199
need some way to handle something with

01:08:06.798 --> 01:08:12.960
many more than you know one binary

01:08:09.199 --> 01:08:15.038
output many possible outputs. So the way

01:08:12.960 --> 01:08:16.880
we do that

01:08:15.039 --> 01:08:20.079
this is by the way pay attention to this

01:08:16.880 --> 01:08:24.000
because this is actually how GPD4 works.

01:08:20.079 --> 01:08:26.880
Okay. So what we do is here's what we

01:08:24.000 --> 01:08:28.640
have. We know how to output 10 numbers,

01:08:26.880 --> 01:08:30.000
right? If you want to output 10 numbers,

01:08:28.640 --> 01:08:31.440
no problem. We just, you know, we have,

01:08:30.000 --> 01:08:33.600
we can easily output 10 numbers by just

01:08:31.439 --> 01:08:36.559
using a linear activation. We also know

01:08:33.600 --> 01:08:37.838
how to output 10 probabilities,

01:08:36.560 --> 01:08:40.560
right? Each one just needs to be a

01:08:37.838 --> 01:08:44.079
sigmoid. But here we can't use 10

01:08:40.560 --> 01:08:47.839
sigmoids as the output. Why is that?

01:08:44.079 --> 01:08:50.000
Why can't we use 10 sigmoids?

01:08:47.838 --> 01:08:52.798
>> Because the probability to one,

01:08:50.000 --> 01:08:54.640
>> right? So here when the output comes we

01:08:52.798 --> 01:08:56.238
need to figure out okay is it a boot, a

01:08:54.640 --> 01:08:59.199
sweater, a shirt and so on and so forth.

01:08:56.238 --> 01:09:00.479
There's only one right answer. Okay,

01:08:59.198 --> 01:09:01.838
which means that we need to actually

01:09:00.479 --> 01:09:03.519
figure out which of these 10 is the

01:09:01.838 --> 01:09:05.439
right answer which means that we need to

01:09:03.520 --> 01:09:07.520
produce probabilities but they have to

01:09:05.439 --> 01:09:09.599
add up to one because only one of them

01:09:07.520 --> 01:09:10.719
can be true.

01:09:09.600 --> 01:09:12.159
So that's the key thing. They have to

01:09:10.719 --> 01:09:13.279
add up to one. That's the wrinkle. If

01:09:12.158 --> 01:09:16.000
not for that we can just use 10

01:09:13.279 --> 01:09:17.600
sigmoids, right? And the way we do that

01:09:16.000 --> 01:09:20.079
is something using something called the

01:09:17.600 --> 01:09:22.319
softmax function or the softmax layer.

01:09:20.079 --> 01:09:25.198
And the idea is actually very simple. We

01:09:22.319 --> 01:09:27.759
have these 10 outputs in the very final

01:09:25.198 --> 01:09:29.759
layer which is just linear activations.

01:09:27.759 --> 01:09:32.719
And then we take each one of these

01:09:29.759 --> 01:09:34.719
numbers and then run it through the

01:09:32.719 --> 01:09:37.279
exponential function and then divide by

01:09:34.719 --> 01:09:39.279
the total. So when you do that two

01:09:37.279 --> 01:09:40.560
things happen. The first one is when you

01:09:39.279 --> 01:09:43.359
take these numbers and run it through

01:09:40.560 --> 01:09:45.920
say you take a1 and do e raised to a1

01:09:43.359 --> 01:09:47.039
you now get a positive number

01:09:45.920 --> 01:09:48.640
and now you have a positive number

01:09:47.039 --> 01:09:50.319
divide by the sum of a bunch of positive

01:09:48.640 --> 01:09:52.079
numbers and they're all you can see here

01:09:50.319 --> 01:09:53.920
you can confirm visually that they will

01:09:52.079 --> 01:09:55.198
add up to one because you're literally

01:09:53.920 --> 01:09:56.719
divide taking each number dividing by

01:09:55.198 --> 01:09:59.439
the total so they will add up to one

01:09:56.719 --> 01:10:00.880
there's no other option right so this is

01:09:59.439 --> 01:10:02.559
called the softmax function which means

01:10:00.880 --> 01:10:04.000
that you can take any set of 10 numbers

01:10:02.560 --> 01:10:05.199
that's coming out of the network and

01:10:04.000 --> 01:10:07.198
convert them into probabilities that add

01:10:05.198 --> 01:10:09.919
up to one

01:10:07.198 --> 01:10:12.639
and So, by the way, the GPD4 reference

01:10:09.920 --> 01:10:14.480
when you actually put a prompt in GPD4

01:10:12.640 --> 01:10:17.760
and it starts giving you the output.

01:10:14.479 --> 01:10:19.359
Every word it's emitting, right? It's

01:10:17.760 --> 01:10:21.199
actually a token, but we'll get to that

01:10:19.359 --> 01:10:23.599
later. You imagine it's a word. Every

01:10:21.198 --> 01:10:27.599
word it's emitting u is actually it's

01:10:23.600 --> 01:10:28.960
doing a 50 52,000 way softmax.

01:10:27.600 --> 01:10:31.840
Think of it as every word in the

01:10:28.960 --> 01:10:34.158
language is a possible output. So it's a

01:10:31.840 --> 01:10:36.560
vector which is 52,000 long but it's

01:10:34.158 --> 01:10:39.839
actually a softmax and it just picks the

01:10:36.560 --> 01:10:41.440
most probable word and emits that. So

01:10:39.840 --> 01:10:43.360
this notion of a softmax is actually

01:10:41.439 --> 01:10:45.039
very powerful.

01:10:43.359 --> 01:10:49.119
Okay but we'll come back to that uh

01:10:45.039 --> 01:10:51.039
later. So, so to summarize, if you have

01:10:49.119 --> 01:10:53.519
a single number, you can use a s simple

01:10:51.039 --> 01:10:55.519
output layer, a single probability, a

01:10:53.520 --> 01:10:57.440
sigmoid, you have lots of numbers, just

01:10:55.520 --> 01:10:58.719
have a stack of these things. And when

01:10:57.439 --> 01:10:59.839
you have a lot of numbers that have to

01:10:58.719 --> 01:11:03.640
add up to one, that have to be

01:10:59.840 --> 01:11:03.640
probabilities, use softmax,

01:11:03.679 --> 01:11:08.399
>> right? So uh yeah

01:11:06.640 --> 01:11:11.360
>> why do we choose probabilities instead

01:11:08.399 --> 01:11:12.000
of just number

01:11:11.359 --> 01:11:12.559
one

01:11:12.000 --> 01:11:14.158
>> sorry

01:11:12.560 --> 01:11:15.760
>> then we know it's only going to be one

01:11:14.158 --> 01:11:19.399
>> because you can't force the network to

01:11:15.760 --> 01:11:19.400
give you ones or zeros

01:11:20.158 --> 01:11:22.639
it's going to produce what it's going to

01:11:21.279 --> 01:11:24.399
produce

01:11:22.640 --> 01:11:26.239
>> you can't force it to be exactly one or

01:11:24.399 --> 01:11:28.479
zero

01:11:26.238 --> 01:11:30.319
it'll give you some number you can do is

01:11:28.479 --> 01:11:32.238
to tame that number so that it comes

01:11:30.319 --> 01:11:34.639
into a range that you like like between

01:11:32.238 --> 01:11:38.399
zero and

01:11:34.640 --> 01:11:40.000
So here very quickly um we have a b when

01:11:38.399 --> 01:11:41.759
we have a binary classification example

01:11:40.000 --> 01:11:43.279
like yes or no this is the one hot

01:11:41.760 --> 01:11:45.440
encoded version one or zero this is what

01:11:43.279 --> 01:11:46.719
we saw in the heart disease example when

01:11:45.439 --> 01:11:48.639
you have something like this example

01:11:46.719 --> 01:11:51.039
fashion mn list where you have all these

01:11:48.640 --> 01:11:52.560
different possibilities then you can

01:11:51.039 --> 01:11:54.479
encode it in one of two ways you can

01:11:52.560 --> 01:11:56.560
encode it just using integers like 0 to

01:11:54.479 --> 01:11:59.519
9 right this is called the sparse

01:11:56.560 --> 01:12:02.239
encoded version or you can do a one hot

01:11:59.520 --> 01:12:03.760
encoded version of the output right you

01:12:02.238 --> 01:12:06.879
can have a one hot encoded version of

01:12:03.760 --> 01:12:08.960
the output and depending on how your

01:12:06.880 --> 01:12:11.760
data comes in to you comes into your

01:12:08.960 --> 01:12:13.840
collab right just pay attention to this

01:12:11.760 --> 01:12:18.239
and depending on what it is you have to

01:12:13.840 --> 01:12:20.159
pick the right keras loss function so

01:12:18.238 --> 01:12:21.839
data comes like a one zero thing which

01:12:20.158 --> 01:12:24.079
is exactly what we had in the how this

01:12:21.840 --> 01:12:26.400
example we use binary cross entropy if

01:12:24.079 --> 01:12:28.719
your data comes in this form where it's

01:12:26.399 --> 01:12:31.279
sparse encoded you use sparse

01:12:28.719 --> 01:12:32.640
categorical cross entropy and then if it

01:12:31.279 --> 01:12:34.960
comes in this form form you use

01:12:32.640 --> 01:12:36.640
categorical cross entropy, right? These

01:12:34.960 --> 01:12:38.399
are all equalent things. It just depends

01:12:36.640 --> 01:12:40.159
on the data that you get how it happens

01:12:38.399 --> 01:12:42.559
to be encoded by the people who sent it

01:12:40.158 --> 01:12:43.759
to you. If they send it this way, use

01:12:42.560 --> 01:12:46.080
this loss function. If you send that

01:12:43.760 --> 01:12:47.600
way, use that loss function.

01:12:46.079 --> 01:12:49.359
Now, as it turns out in our example

01:12:47.600 --> 01:12:50.800
here, the data is actually coming in in

01:12:49.359 --> 01:12:52.158
this form. So, we'll use this thing

01:12:50.800 --> 01:12:54.880
called the sparse categorical cross

01:12:52.158 --> 01:12:56.399
entropy. And categorical cross entropy

01:12:54.880 --> 01:12:58.159
is a generalization of binary cross

01:12:56.399 --> 01:12:59.839
entropy which I'm not going to get into

01:12:58.158 --> 01:13:01.359
the mathematical details but the in the

01:12:59.840 --> 01:13:04.319
the intuition is basically roughly the

01:13:01.359 --> 01:13:07.439
same.

01:13:04.319 --> 01:13:09.198
Okay so this is what we have. Um if this

01:13:07.439 --> 01:13:11.599
is your output layer use mean squared

01:13:09.198 --> 01:13:14.079
error. If this is your output layer use

01:13:11.600 --> 01:13:15.360
binary cross entropy and if you still

01:13:14.079 --> 01:13:17.039
have a stack of these numbers you can

01:13:15.359 --> 01:13:19.519
still use mean squed error. And if your

01:13:17.039 --> 01:13:22.000
output is a soft max, use categorical

01:13:19.520 --> 01:13:24.560
cross entropy or sparse categorical

01:13:22.000 --> 01:13:26.479
cross entropy.

01:13:24.560 --> 01:13:30.600
Okay. So let's actually run this in

01:13:26.479 --> 01:13:30.599
collab. Um

01:13:32.079 --> 01:13:37.198
right. So this is what we have. Can

01:13:33.679 --> 01:13:40.800
folks see this? Okay. All right. So this

01:13:37.198 --> 01:13:44.399
is the data set we saw earlier. Uh down

01:13:40.800 --> 01:13:47.039
here as usual, right? We have we load

01:13:44.399 --> 01:13:49.198
tensorflow and kas. We load our usual

01:13:47.039 --> 01:13:51.119
three packages and then we set the

01:13:49.198 --> 01:13:53.198
random seed for reproducibility. And it

01:13:51.119 --> 01:13:54.719
turns out that the fashion mnest data is

01:13:53.198 --> 01:13:56.000
actually available in keras. You don't

01:13:54.719 --> 01:13:57.439
have to go find it somewhere and bring

01:13:56.000 --> 01:13:59.279
it in. It's actually available in kas.

01:13:57.439 --> 01:14:01.119
It's one of the standard data sets. We

01:13:59.279 --> 01:14:04.079
luck out. So we just actually load the

01:14:01.119 --> 01:14:05.920
data right using this load data command.

01:14:04.079 --> 01:14:08.399
And then you do that and conveniently

01:14:05.920 --> 01:14:10.399
for us keras has not only made the data

01:14:08.399 --> 01:14:12.238
available it has already split it into a

01:14:10.399 --> 01:14:13.920
training and test set. So we don't have

01:14:12.238 --> 01:14:15.279
to do the splitting. Okay. And the

01:14:13.920 --> 01:14:18.279
reason they do that, why would they do

01:14:15.279 --> 01:14:18.279
that?

01:14:18.640 --> 01:14:21.679
They do that so that different people

01:14:20.238 --> 01:14:23.678
who are building algorithms for that

01:14:21.679 --> 01:14:26.640
particular data set can all be evaluated

01:14:23.679 --> 01:14:28.079
using the same test set.

01:14:26.640 --> 01:14:29.600
Otherwise, if I split it one way and

01:14:28.079 --> 01:14:31.439
say, "Hey, look how well I did that like

01:14:29.600 --> 01:14:32.480
I don't know how did you split it."

01:14:31.439 --> 01:14:36.000
>> That's the reason.

01:14:32.479 --> 01:14:38.158
>> Okay. So here and you can see here that

01:14:36.000 --> 01:14:43.760
uh we have

01:14:38.158 --> 01:14:47.039
the input data is a tensor of rank

01:14:43.760 --> 01:14:48.239
three. The first and basically another

01:14:47.039 --> 01:14:50.158
way to think about a tensor of rank

01:14:48.238 --> 01:14:52.879
three is just a list of rank two

01:14:50.158 --> 01:14:57.279
tensors. Right? So here you have 60,000

01:14:52.880 --> 01:15:02.079
images. 60,000 images and each image is

01:14:57.279 --> 01:15:04.639
a 28x 28 square of numbers. Each image

01:15:02.079 --> 01:15:07.279
is a 28 x 28 table. Uh and then of

01:15:04.640 --> 01:15:09.920
course the output uh is just what

01:15:07.279 --> 01:15:11.519
category it is a number between 0 and 9.

01:15:09.920 --> 01:15:13.840
So you just have 60,000 numbers. It's

01:15:11.520 --> 01:15:15.920
just a vector of 60,000 numbers. Okay.

01:15:13.840 --> 01:15:19.039
Uh so there are 60,000 in the training

01:15:15.920 --> 01:15:21.279
set. Oops. Uh and then there are 10,000

01:15:19.039 --> 01:15:23.519
in the test set. Same structure 28 by

01:15:21.279 --> 01:15:25.039
28. Uh that's what we have. So if you

01:15:23.520 --> 01:15:27.040
look at the first 10 rows of the

01:15:25.039 --> 01:15:29.039
dependent variable Y, you get these

01:15:27.039 --> 01:15:31.439
numbers 9 0 33 like that. There are

01:15:29.039 --> 01:15:33.359
numbers from 0 to 9. So if you look at

01:15:31.439 --> 01:15:35.919
the fashion mnest GitHub site, this is

01:15:33.359 --> 01:15:37.839
what it refers to. Zero is a t-shirt,

01:15:35.920 --> 01:15:41.600
one is a trouser, and so on and so

01:15:37.840 --> 01:15:43.760
forth. And nine is an ankle boot.

01:15:41.600 --> 01:15:45.280
All right. So, uh, whenever I'm working

01:15:43.760 --> 01:15:47.520
with multiclass lab classification

01:15:45.279 --> 01:15:49.439
problems, I always, you know, do a

01:15:47.520 --> 01:15:51.120
little thing here to help me figure out

01:15:49.439 --> 01:15:52.319
that nine corresponds to an ankle boot

01:15:51.119 --> 01:15:53.519
and so on and so forth. It just makes it

01:15:52.319 --> 01:15:56.639
a little easier to to work with this

01:15:53.520 --> 01:15:59.679
stuff. So, I create this little list. Um

01:15:56.640 --> 01:16:01.119
and then uh turns out if you okay what

01:15:59.679 --> 01:16:02.960
is the very first data point? What is

01:16:01.119 --> 01:16:05.279
it? What is its y- value? Turns out to

01:16:02.960 --> 01:16:07.679
be an ankle boot. Um so you can actually

01:16:05.279 --> 01:16:10.238
look at the raw data for that image

01:16:07.679 --> 01:16:13.119
which is just a 28x 28 thing and these

01:16:10.238 --> 01:16:16.959
are the numbers you have.

01:16:13.119 --> 01:16:19.198
See all these 250 233 lots of zeros and

01:16:16.960 --> 01:16:20.960
so on and so forth. So you can actually

01:16:19.198 --> 01:16:22.639
look at the first visualize the first 25

01:16:20.960 --> 01:16:24.560
images. I have a little bit of code here

01:16:22.640 --> 01:16:25.920
which visualizes that just matt plot lip

01:16:24.560 --> 01:16:28.719
code and you can see these are all the

01:16:25.920 --> 01:16:32.319
images they're kind of smallalish this

01:16:28.719 --> 01:16:34.560
my friends is an ankle boot

01:16:32.319 --> 01:16:35.759
right it's like okay can the network

01:16:34.560 --> 01:16:37.360
really make any sense out of this thing

01:16:35.760 --> 01:16:39.920
right it looks very blurry and I don't

01:16:37.359 --> 01:16:42.158
know

01:16:39.920 --> 01:16:43.679
this is uh

01:16:42.158 --> 01:16:45.359
oh this is actually a better ankle boot

01:16:43.679 --> 01:16:47.840
look at that okay sorry I'm getting

01:16:45.359 --> 01:16:49.599
distracted so so this is what we have

01:16:47.840 --> 01:16:51.520
here

01:16:49.600 --> 01:16:53.360
uh okay we are at 955

01:16:51.520 --> 01:16:54.880
I'm going to stop um so you folks are

01:16:53.359 --> 01:16:56.399
not late for your next class. So we'll

01:16:54.880 --> 01:16:58.079
continue this journey on Wednesday and

01:16:56.399 --> 01:16:59.599
then we'll go on to color images the

01:16:58.079 --> 01:17:03.000
next class as well. Thank you folks.

01:16:59.600 --> 01:17:03.000
Have a good one.
