WEBVTT

00:00:21.519 --> 00:00:26.759
Okay. So, let's get going. Today we're

00:00:24.399 --> 00:00:28.879
going to talk about how do you actually

00:00:26.760 --> 00:00:30.320
train a neural network, right? Because

00:00:28.879 --> 00:00:33.439
that is sort of the heart of the game

00:00:30.320 --> 00:00:34.960
here. Um so, just to recap, we looked

00:00:33.439 --> 00:00:36.679
last class

00:00:34.960 --> 00:00:38.560
at what it takes to design a neural

00:00:36.679 --> 00:00:40.880
network, and we made this very important

00:00:38.560 --> 00:00:42.960
distinction between the things that you

00:00:40.880 --> 00:00:44.679
are handed by your problem and the

00:00:42.960 --> 00:00:46.759
things that you have agency over, that

00:00:44.679 --> 00:00:49.079
you have control over. And we noticed

00:00:46.759 --> 00:00:51.599
that, you know, the input layer for your

00:00:49.079 --> 00:00:53.079
problem, the input is the input. Uh the

00:00:51.600 --> 00:00:54.439
output is the output. You got to do

00:00:53.079 --> 00:00:56.519
something with the output, something

00:00:54.439 --> 00:00:58.119
that's expected. But everything that

00:00:56.520 --> 00:01:00.480
happens in the middle is actually in

00:00:58.119 --> 00:01:03.320
your hands. And in particular, we

00:01:00.479 --> 00:01:05.920
noticed that we have to decide how many

00:01:03.320 --> 00:01:08.599
hidden layers we want. We have to decide

00:01:05.920 --> 00:01:11.200
in each layer how many neurons to have.

00:01:08.599 --> 00:01:13.359
And then we had to decide what uh

00:01:11.200 --> 00:01:14.719
activation to use. Even though I'm kind

00:01:13.359 --> 00:01:17.159
of cheating when I say that because I

00:01:14.719 --> 00:01:18.719
told you very clearly on Monday that for

00:01:17.159 --> 00:01:20.679
the hidden layer activation, just go

00:01:18.719 --> 00:01:22.120
with the ReLU activation function. You

00:01:20.680 --> 00:01:23.520
don't have to think deep thoughts about

00:01:22.120 --> 00:01:24.920
this, okay?

00:01:23.519 --> 00:01:26.280
But the other things are all choices you

00:01:24.920 --> 00:01:28.320
have to make, and we will talk a bit

00:01:26.280 --> 00:01:29.920
later about how do you actually make

00:01:28.319 --> 00:01:32.519
those choices.

00:01:29.920 --> 00:01:34.519
Okay. Now, the rule of thumb,

00:01:32.519 --> 00:01:36.439
right? The rule of thumb always is to

00:01:34.519 --> 00:01:37.599
start with the simplest network you can

00:01:36.439 --> 00:01:39.439
think of.

00:01:37.599 --> 00:01:41.519
And if it's if it gets the job done,

00:01:39.439 --> 00:01:42.799
stop working on it.

00:01:41.519 --> 00:01:45.359
If it's not good enough, make it

00:01:42.799 --> 00:01:46.679
slightly more complicated. Okay? So,

00:01:45.359 --> 00:01:48.200
that's sort of the, you know, like the

00:01:46.680 --> 00:01:49.880
meta thing you have to remember always

00:01:48.200 --> 00:01:52.200
when you're designing these things.

00:01:49.879 --> 00:01:53.479
Okay. So, that's sort of, you know, what

00:01:52.200 --> 00:01:55.640
it takes to design a deep neural

00:01:53.480 --> 00:01:57.120
network. So, what we will do in this

00:01:55.640 --> 00:01:59.680
class is we'll actually take a real

00:01:57.120 --> 00:02:01.320
example with real data, and then we

00:01:59.680 --> 00:02:03.280
we'll think through how we would design

00:02:01.319 --> 00:02:05.439
a network to solve this problem.

00:02:03.280 --> 00:02:07.599
And while doing so, we will cover a

00:02:05.439 --> 00:02:09.758
whole bunch of conceptual foundations

00:02:07.599 --> 00:02:11.079
such as optimization, loss functions,

00:02:09.758 --> 00:02:12.039
gradient descent, and all that good

00:02:11.080 --> 00:02:12.960
stuff.

00:02:12.039 --> 00:02:16.199
Okay?

00:02:12.960 --> 00:02:18.760
All right. So, the the case study or the

00:02:16.199 --> 00:02:20.919
scenario here is we have a data set of

00:02:18.759 --> 00:02:23.719
patients uh made available by the

00:02:20.919 --> 00:02:25.599
Cleveland Clinic. And essentially, we

00:02:23.719 --> 00:02:27.359
have a bunch of patients, and for all

00:02:25.599 --> 00:02:29.799
these patients, the setting is that they

00:02:27.360 --> 00:02:31.600
have come into the Cleveland Clinic, and

00:02:29.800 --> 00:02:32.800
they have not come in with a heart

00:02:31.599 --> 00:02:33.879
problem. They have come in for something

00:02:32.800 --> 00:02:36.080
else. Maybe they just came in for a

00:02:33.879 --> 00:02:38.039
physical. And we measured a whole bunch

00:02:36.080 --> 00:02:40.160
of things about them, okay? And the

00:02:38.039 --> 00:02:41.719
kinds of things we measured are, you

00:02:40.159 --> 00:02:44.199
know, demographic information, like

00:02:41.719 --> 00:02:45.680
what's their age, uh gender, whether

00:02:44.199 --> 00:02:47.639
they have any chest pain at all when

00:02:45.680 --> 00:02:50.520
they came in, blood pressure,

00:02:47.639 --> 00:02:52.399
cholesterol, sugar, so on and so forth.

00:02:50.520 --> 00:02:53.920
Right? You get the idea? Demographic

00:02:52.400 --> 00:02:56.439
information and a bunch of biomarker

00:02:53.919 --> 00:02:59.039
information. And then,

00:02:56.439 --> 00:03:01.520
what the Cleveland Clinic uh did was

00:02:59.039 --> 00:03:04.079
they actually tracked these people

00:03:01.520 --> 00:03:05.560
and figured out in the next year,

00:03:04.080 --> 00:03:07.520
did they get diagnosed with heart

00:03:05.560 --> 00:03:09.000
disease or not?

00:03:07.520 --> 00:03:10.760
Okay, in the next year.

00:03:09.000 --> 00:03:12.879
Which means that maybe you can build a

00:03:10.759 --> 00:03:15.199
model when someone comes in, even though

00:03:12.879 --> 00:03:16.519
they didn't come in for a chest problem,

00:03:15.199 --> 00:03:17.439
maybe you can predict that something's

00:03:16.520 --> 00:03:20.120
going to happen to them in the next

00:03:17.439 --> 00:03:23.000
year, right? It's a nice sort of classic

00:03:20.120 --> 00:03:24.879
machine learning setup.

00:03:23.000 --> 00:03:26.439
All right. So, this is the thing. So,

00:03:24.879 --> 00:03:28.199
what we want to do is we can totally

00:03:26.439 --> 00:03:29.719
solve this problem using decision trees,

00:03:28.199 --> 00:03:31.439
neural network I mean, sorry, random

00:03:29.719 --> 00:03:33.240
forests and gradient boosting and all

00:03:31.439 --> 00:03:35.039
that good stuff you folks have already

00:03:33.240 --> 00:03:36.360
learned from machine learning.

00:03:35.039 --> 00:03:38.959
But we will try to solve it using neural

00:03:36.360 --> 00:03:40.280
networks, okay? Um this is an example,

00:03:38.960 --> 00:03:41.680
of course, of what's called structured

00:03:40.280 --> 00:03:43.879
data because this is all data sitting in

00:03:41.680 --> 00:03:46.159
the columns of a spreadsheet, right? Uh

00:03:43.879 --> 00:03:48.199
so, working with structured data is the

00:03:46.159 --> 00:03:50.199
way we warm up our knowledge of neural

00:03:48.199 --> 00:03:51.879
networks. And then we will do things

00:03:50.199 --> 00:03:53.599
like working with unstructured data

00:03:51.879 --> 00:03:55.079
starting next week with images and then

00:03:53.599 --> 00:03:58.960
later on with text and so on and so

00:03:55.080 --> 00:03:58.960
forth. Okay, any questions on this?

00:04:00.439 --> 00:04:05.560
Okay. Uh yes. Uh just connected even to

00:04:03.319 --> 00:04:07.840
last time's class where we took uh the

00:04:05.560 --> 00:04:10.000
same example and first it was a logistic

00:04:07.840 --> 00:04:12.120
and then we did a neural network. So,

00:04:10.000 --> 00:04:14.759
the probability in case of one was 0.85,

00:04:12.120 --> 00:04:16.840
then was 0.22, and here as well, how do

00:04:14.759 --> 00:04:19.199
you know when to uh

00:04:16.839 --> 00:04:21.919
use what? Usually in textbooks, you know

00:04:19.199 --> 00:04:24.399
when to use logistic or when to use uh

00:04:21.920 --> 00:04:25.560
something else, but in this case,

00:04:24.399 --> 00:04:27.439
uh

00:04:25.560 --> 00:04:29.079
when do I complicate it to neural

00:04:27.439 --> 00:04:30.439
networks visa-vis in this case maybe

00:04:29.079 --> 00:04:33.039
just doing a random It's a great

00:04:30.439 --> 00:04:34.480
question. Uh when do you use what? So, I

00:04:33.040 --> 00:04:35.800
think there are two broad dimensions

00:04:34.480 --> 00:04:37.160
that you have to think about. One broad

00:04:35.800 --> 00:04:39.439
dimension is

00:04:37.160 --> 00:04:41.720
uh how important is it that you need to

00:04:39.439 --> 00:04:43.680
explain or interpret what's going on

00:04:41.720 --> 00:04:46.240
inside the model to perhaps a

00:04:43.680 --> 00:04:48.280
non-technical consumer.

00:04:46.240 --> 00:04:50.759
The other dimension is how important is

00:04:48.279 --> 00:04:52.559
sheer predictive accuracy.

00:04:50.759 --> 00:04:54.319
In some situations, predictive accuracy

00:04:52.560 --> 00:04:56.160
trumps everything else. In which case,

00:04:54.319 --> 00:04:57.399
just go with it. In other cases,

00:04:56.160 --> 00:04:59.000
explainability becomes a big deal

00:04:57.399 --> 00:05:00.560
because if they can't understand, they

00:04:59.000 --> 00:05:02.000
won't use it.

00:05:00.560 --> 00:05:04.319
And those cases, it's probably better to

00:05:02.000 --> 00:05:05.800
go with simpler models such as decision

00:05:04.319 --> 00:05:07.800
trees and neural I mean, not neural

00:05:05.800 --> 00:05:09.280
network decision trees, maybe even

00:05:07.800 --> 00:05:10.920
random forests, certainly logistic

00:05:09.279 --> 00:05:12.319
regression. Those are all a little more

00:05:10.920 --> 00:05:15.480
amenable.

00:05:12.319 --> 00:05:17.439
But that said, uh even complex black box

00:05:15.480 --> 00:05:19.280
methods like neural networks, there is a

00:05:17.439 --> 00:05:20.800
whole field called mechanistic

00:05:19.279 --> 00:05:23.199
interpretability,

00:05:20.800 --> 00:05:24.720
which seeks to try to get insight into

00:05:23.199 --> 00:05:28.360
what's going on inside these big black

00:05:24.720 --> 00:05:30.560
boxes. So, the story isn't over, right?

00:05:28.360 --> 00:05:33.360
But that's just the first cut you sort

00:05:30.560 --> 00:05:35.199
of analyze the problem.

00:05:33.360 --> 00:05:37.600
Okay. So,

00:05:35.199 --> 00:05:39.719
um let's get going. So, if you want to

00:05:37.600 --> 00:05:42.080
design a network,

00:05:39.720 --> 00:05:43.160
All right. So, we design the network. Uh

00:05:42.079 --> 00:05:45.039
so, we have to choose the number of

00:05:43.160 --> 00:05:46.439
hidden layers and the number of neurons

00:05:45.040 --> 00:05:49.160
in each layer. Then we have to pick the

00:05:46.439 --> 00:05:51.199
right output layer. So, here,

00:05:49.160 --> 00:05:52.720
what I did is the simplest thing you can

00:05:51.199 --> 00:05:53.719
do is, of course, is to have no hidden

00:05:52.720 --> 00:05:55.360
layer.

00:05:53.720 --> 00:05:58.120
So, if you have no hidden layers, what

00:05:55.360 --> 00:05:58.120
is that model called?

00:05:58.439 --> 00:06:02.079
Yes, logistic regression.

00:06:00.240 --> 00:06:03.319
Okay? So, of course, we want to do a

00:06:02.079 --> 00:06:05.079
neural network, so I'm going to have one

00:06:03.319 --> 00:06:08.199
hidden layer because that's the simplest

00:06:05.079 --> 00:06:09.879
thing I can do. And then, I'll confess,

00:06:08.199 --> 00:06:12.000
I tried a few different numbers of

00:06:09.879 --> 00:06:14.079
neurons in this thing, and when I had 16

00:06:12.000 --> 00:06:15.480
neurons, it actually did pretty well.

00:06:14.079 --> 00:06:16.959
Okay? So, there was some trial and error

00:06:15.480 --> 00:06:19.280
that went on before I landed on the

00:06:16.959 --> 00:06:20.839
number 16. Right? And for some reason,

00:06:19.279 --> 00:06:22.599
people always use powers of two, so may

00:06:20.839 --> 00:06:24.239
as well do that.

00:06:22.600 --> 00:06:25.439
So, I tried like 4, 8, 16, and 16 was

00:06:24.240 --> 00:06:27.319
really good.

00:06:25.439 --> 00:06:30.800
And as it turns out, when I went above

00:06:27.319 --> 00:06:31.959
16, uh it sort of started to do badly.

00:06:30.800 --> 00:06:33.560
And it started to do badly because

00:06:31.959 --> 00:06:35.039
something called overfitting,

00:06:33.560 --> 00:06:37.439
which we're going to talk about later,

00:06:35.040 --> 00:06:39.960
okay? So, yeah, 16.

00:06:37.439 --> 00:06:42.040
Um and then by default, I use ReLUs,

00:06:39.959 --> 00:06:44.959
okay? So, 16 ReLU neurons. And then

00:06:42.040 --> 00:06:47.160
here, the output is a categorical

00:06:44.959 --> 00:06:49.719
output, right? Heart disease, yes or no,

00:06:47.160 --> 00:06:51.120
one or zero, classification problem,

00:06:49.720 --> 00:06:53.040
which means that we want to emit a

00:06:51.120 --> 00:06:54.840
probability at the very end. Therefore,

00:06:53.040 --> 00:06:57.240
we'll use a sigmoid.

00:06:54.839 --> 00:06:59.159
Okay? So, so far, so good, right? Any

00:06:57.240 --> 00:07:00.360
questions?

00:06:59.160 --> 00:07:02.520
All right.

00:07:00.360 --> 00:07:03.720
So, we're going to lay out this network

00:07:02.519 --> 00:07:06.560
visually.

00:07:03.720 --> 00:07:09.160
Okay? So, we have an input, and so I

00:07:06.560 --> 00:07:10.480
just have have an input. And as you will

00:07:09.160 --> 00:07:13.120
see here,

00:07:10.480 --> 00:07:15.240
X1 through X29, that's our input layer.

00:07:13.120 --> 00:07:17.360
And you may be wondering, 29, where did

00:07:15.240 --> 00:07:19.519
he get that from?

00:07:17.360 --> 00:07:22.040
Because there doesn't seem to be like 29

00:07:19.519 --> 00:07:24.759
rows here of independent variables. So,

00:07:22.040 --> 00:07:26.439
it turns out there are only 13 input

00:07:24.759 --> 00:07:29.159
variables here,

00:07:26.439 --> 00:07:31.279
but some of them are categorical.

00:07:29.160 --> 00:07:32.920
So, what I ended up doing is to take

00:07:31.279 --> 00:07:34.039
each categorical variable and one-hot

00:07:32.920 --> 00:07:35.560
encode it.

00:07:34.040 --> 00:07:37.360
Okay?

00:07:35.560 --> 00:07:39.240
And when you do that, you get to 39.

00:07:37.360 --> 00:07:40.800
Sorry, 29.

00:07:39.240 --> 00:07:43.240
All right? And when we actually do the

00:07:40.800 --> 00:07:45.400
Colab later on, I'll show you exactly

00:07:43.240 --> 00:07:46.879
how I one-hot encode encoded it, but

00:07:45.399 --> 00:07:49.239
that's what I'm doing here.

00:07:46.879 --> 00:07:51.920
That's why you have 29, not 13.

00:07:49.240 --> 00:07:54.079
Okay? Now, obviously, we have decided on

00:07:51.920 --> 00:07:56.199
these hidden units, 16 units,

00:07:54.079 --> 00:07:57.680
with nice ReLUs here.

00:07:56.199 --> 00:07:59.479
Okay? And then we have an output layer

00:07:57.680 --> 00:08:01.319
with a little sigmoid.

00:07:59.480 --> 00:08:02.560
And I got bored of trying to draw all

00:08:01.319 --> 00:08:05.199
these arrows, so I just gave up and

00:08:02.560 --> 00:08:07.839
said, "Assume there are arrows."

00:08:05.199 --> 00:08:09.800
Okay, between all these things.

00:08:07.839 --> 00:08:11.119
Good?

00:08:09.800 --> 00:08:12.439
Yeah.

00:08:11.120 --> 00:08:15.319
Yeah, I'm sorry. I think you already

00:08:12.439 --> 00:08:16.600
mentioned this, but why 16 units? Why

00:08:15.319 --> 00:08:18.159
16? Uh

00:08:16.600 --> 00:08:21.400
I tried a bunch of different numbers of

00:08:18.160 --> 00:08:23.480
units. Uh and at 16, the resulting model

00:08:21.399 --> 00:08:25.879
did well, so I just went with that. And

00:08:23.480 --> 00:08:28.040
the logic of why is a ReLU?

00:08:25.879 --> 00:08:29.519
Oh, why a ReLU? Yeah, so there's a

00:08:28.040 --> 00:08:31.960
there's just a mountain of empirical

00:08:29.519 --> 00:08:35.158
evidence that suggests that uh ReLU is a

00:08:31.959 --> 00:08:37.038
really good default option for using as

00:08:35.158 --> 00:08:39.279
activations in hidden layers. There is

00:08:37.038 --> 00:08:41.639
also a really great set of theoretical

00:08:39.279 --> 00:08:42.918
results, and I'll allude to some of them

00:08:41.639 --> 00:08:45.199
when we actually talk about gradient

00:08:42.918 --> 00:08:47.519
descent.

00:08:45.200 --> 00:08:47.520
Yeah.

00:08:47.879 --> 00:08:51.840
Sorry, quick question. You mentioned um

00:08:50.120 --> 00:08:53.919
in the input layer, how how did you get

00:08:51.840 --> 00:08:55.720
to 29 again when you had like 13

00:08:53.919 --> 00:08:58.399
variables? So, some of those 13

00:08:55.720 --> 00:09:00.560
variables are categorical variables like

00:08:58.399 --> 00:09:02.159
uh cholesterol low, medium, high. Right?

00:09:00.559 --> 00:09:04.639
And so, I took them and one-hot encoded

00:09:02.159 --> 00:09:08.079
them. So, if it had like five levels, I

00:09:04.639 --> 00:09:08.080
would get five columns now.

00:09:08.440 --> 00:09:12.720
Uh yeah.

00:09:09.799 --> 00:09:15.359
And by the way, folks, um just like uh

00:09:12.720 --> 00:09:17.080
is it can Yeah, just like did, please

00:09:15.360 --> 00:09:18.440
use a microphone so that people on the

00:09:17.080 --> 00:09:20.440
live stream can hear your question.

00:09:18.440 --> 00:09:22.080
Yeah, go ahead. Uh sorry, just one

00:09:20.440 --> 00:09:23.800
question. So, the vectors, since you

00:09:22.080 --> 00:09:26.000
didn't represent them, are we assuming

00:09:23.799 --> 00:09:26.599
like every X is connected to all the

00:09:26.000 --> 00:09:28.480
units?

00:09:26.600 --> 00:09:31.000
>> Correct. And this is also a parameter

00:09:28.480 --> 00:09:32.279
that we have to decide or That ends up

00:09:31.000 --> 00:09:33.720
being the default.

00:09:32.279 --> 00:09:36.120
And we will see

00:09:33.720 --> 00:09:37.840
deviations from that assumption when we

00:09:36.120 --> 00:09:39.440
go to image processing and language

00:09:37.840 --> 00:09:40.879
processing and so on. But when you're

00:09:39.440 --> 00:09:43.800
working with structured data like we're

00:09:40.879 --> 00:09:46.039
doing now, that's the default.

00:09:43.799 --> 00:09:47.759
Okay. So, let's keep going.

00:09:46.039 --> 00:09:49.399
So, this is what we have.

00:09:47.759 --> 00:09:50.679
So, what Remember what I told you in the

00:09:49.399 --> 00:09:52.360
last class? Whenever you're working with

00:09:50.679 --> 00:09:54.239
these networks, right? Get into the

00:09:52.360 --> 00:09:55.919
habit of very quickly calculating the

00:09:54.240 --> 00:09:57.360
number of parameters.

00:09:55.919 --> 00:09:59.839
Right? Just do it a few times, the first

00:09:57.360 --> 00:10:02.279
few times, so that you really know cold

00:09:59.840 --> 00:10:04.600
exactly what's going on. Okay? So, yeah,

00:10:02.279 --> 00:10:06.159
how many parameters do we have here?

00:10:04.600 --> 00:10:08.120
How many weights and biases? You can

00:10:06.159 --> 00:10:09.120
work through it, okay? You can You don't

00:10:08.120 --> 00:10:13.840
have to tell me the final number. You

00:10:09.120 --> 00:10:13.840
can say x * y + z, stuff like that.

00:10:14.399 --> 00:10:20.199
Yeah.

00:10:15.759 --> 00:10:21.759
65. You have 48 weights and 17 biases.

00:10:20.200 --> 00:10:23.680
Okay, and how did he come up with that?

00:10:21.759 --> 00:10:26.000
So, for the weights, you have like for

00:10:23.679 --> 00:10:28.319
the first layer it's 2 * 16 and for the

00:10:26.000 --> 00:10:30.399
the second connection it's 1 * 16 and

00:10:28.320 --> 00:10:32.200
then the biases are the 16 hidden plus

00:10:30.399 --> 00:10:33.439
the outputs.

00:10:32.200 --> 00:10:36.280
Okay.

00:10:33.440 --> 00:10:40.280
Um any other views on this?

00:10:36.279 --> 00:10:43.559
I think it's 29 into 16. 29, okay, 29

00:10:40.279 --> 00:10:46.600
into 16. And then 16 into

00:10:43.559 --> 00:10:49.839
uh plus I mean 16 there. Yeah. And then

00:10:46.600 --> 00:10:52.320
biases 16 biases and one bias. Right.

00:10:49.840 --> 00:10:55.240
So, the way it's going to work is we

00:10:52.320 --> 00:10:58.440
have 29 things here, 16 in the middle,

00:10:55.240 --> 00:11:00.279
so 29 into 16 arrows.

00:10:58.440 --> 00:11:02.640
And then for each of these fellows,

00:11:00.279 --> 00:11:05.000
there's a bias coming in.

00:11:02.639 --> 00:11:08.399
So, that's another 16.

00:11:05.000 --> 00:11:10.759
Plus, you have 16 * 1.

00:11:08.399 --> 00:11:12.079
Which is here, plus there is one bias

00:11:10.759 --> 00:11:15.519
for this one.

00:11:12.080 --> 00:11:15.520
So, the total is 497.

00:11:16.720 --> 00:11:21.040
So, you can see here there's something

00:11:19.279 --> 00:11:22.838
very interesting going on, which is that

00:11:21.039 --> 00:11:24.000
when you go from one layer to another

00:11:22.839 --> 00:11:26.280
layer,

00:11:24.000 --> 00:11:28.360
the number of weights is roughly on the

00:11:26.279 --> 00:11:30.199
order of a * b.

00:11:28.360 --> 00:11:31.639
The number of units and so that's a

00:11:30.200 --> 00:11:33.400
dramatic explosion in the number of

00:11:31.639 --> 00:11:34.559
parameters.

00:11:33.399 --> 00:11:36.199
Right? And that's something we have to

00:11:34.559 --> 00:11:38.039
watch for later on to prevent

00:11:36.200 --> 00:11:39.720
overfitting.

00:11:38.039 --> 00:11:41.480
Okay, that's where the explosion of

00:11:39.720 --> 00:11:43.080
parameters comes from the fact that each

00:11:41.480 --> 00:11:44.000
layer is fully connected to the next

00:11:43.080 --> 00:11:46.160
layer.

00:11:44.000 --> 00:11:47.200
Okay? But we'll revisit this later on.

00:11:46.159 --> 00:11:48.279
Okay.

00:11:47.200 --> 00:11:50.120
So,

00:11:48.279 --> 00:11:52.240
what I'm going to do now is I'm going to

00:11:50.120 --> 00:11:53.200
actually translate this network, right?

00:11:52.240 --> 00:11:56.039
The one that we have laid out

00:11:53.200 --> 00:11:58.759
graphically, into Keras code

00:11:56.039 --> 00:12:01.159
to demonstrate how easy it is.

00:11:58.759 --> 00:12:03.159
Okay? So, I will give a fuller intro to

00:12:01.159 --> 00:12:06.240
Keras in TensorFlow later on, but for

00:12:03.159 --> 00:12:08.159
now, just suspend your disbelief.

00:12:06.240 --> 00:12:10.560
We'll just try to do it in Keras as if

00:12:08.159 --> 00:12:12.039
we know Keras. Okay? So, let's try that.

00:12:10.559 --> 00:12:14.119
Later on we'll get into all the gory

00:12:12.039 --> 00:12:17.519
details and train it in Colab and so on

00:12:14.120 --> 00:12:19.399
and so forth. Okay. All right. So,

00:12:17.519 --> 00:12:21.319
So, the So, the way we typically do it

00:12:19.399 --> 00:12:23.759
is that once we have a network like

00:12:21.320 --> 00:12:25.800
this, we typically start from the left

00:12:23.759 --> 00:12:27.519
and start defining each layer in Keras

00:12:25.799 --> 00:12:30.120
one after the other. So, we flow left to

00:12:27.519 --> 00:12:32.000
right. Okay? So, let's take the input

00:12:30.120 --> 00:12:34.720
layer. The way you define an input layer

00:12:32.000 --> 00:12:38.360
in Keras is really easy.

00:12:34.720 --> 00:12:41.200
You literally say Keras.input.

00:12:38.360 --> 00:12:43.360
Okay? And then you tell Keras how many

00:12:41.200 --> 00:12:45.120
nodes you have in the input coming in.

00:12:43.360 --> 00:12:47.240
In this case it happens to be 29, so you

00:12:45.120 --> 00:12:49.039
tell it the shape. Shape equals 29. And

00:12:47.240 --> 00:12:51.120
the reason why we say shape as opposed

00:12:49.039 --> 00:12:53.159
to length is because, as you will see

00:12:51.120 --> 00:12:55.519
later on, we don't have to just send

00:12:53.159 --> 00:12:57.279
vectors in, we can send complicated

00:12:55.519 --> 00:12:59.319
things in to Keras.

00:12:57.279 --> 00:13:01.519
And those complicated objects could be

00:12:59.320 --> 00:13:03.600
matrices, it could be 3D cubes, it could

00:13:01.519 --> 00:13:06.199
be 4D tensors and so on and so forth.

00:13:03.600 --> 00:13:07.720
So, it's expecting a shape.

00:13:06.200 --> 00:13:09.040
Right? What is the shape shape of this

00:13:07.720 --> 00:13:10.800
thing you're going to send me? In this

00:13:09.039 --> 00:13:12.679
particular case it happens to be a nice

00:13:10.799 --> 00:13:15.519
list or a vector, so it's 29. Okay,

00:13:12.679 --> 00:13:17.719
that's it. So, we we write this down.

00:13:15.519 --> 00:13:19.720
This creates the input layer.

00:13:17.720 --> 00:13:21.440
Right? And we give it a name. Right? And

00:13:19.720 --> 00:13:23.160
the name here means

00:13:21.440 --> 00:13:26.400
this layer, whatever comes out of this

00:13:23.159 --> 00:13:27.799
layer has a name input.

00:13:26.399 --> 00:13:30.319
Okay?

00:13:27.799 --> 00:13:31.399
Good. Next.

00:13:30.320 --> 00:13:32.920
Let's make sure the shape of the input

00:13:31.399 --> 00:13:34.360
as I mentioned.

00:13:32.919 --> 00:13:36.719
Right there.

00:13:34.360 --> 00:13:39.560
Then we go to the next one. And here and

00:13:36.720 --> 00:13:41.920
we will unpack this. The way you define

00:13:39.559 --> 00:13:43.439
a layer is typically a hidden layer

00:13:41.919 --> 00:13:46.000
Keras.layers.dense

00:13:43.440 --> 00:13:48.760
and all this stuff. Okay? So, what this

00:13:46.000 --> 00:13:50.720
is is it first of all it says

00:13:48.759 --> 00:13:52.480
I want a dense layer. By dense layer I

00:13:50.720 --> 00:13:53.960
mean a layer that's going to fully

00:13:52.480 --> 00:13:55.120
connect to the prior and the later

00:13:53.960 --> 00:13:56.240
layers.

00:13:55.120 --> 00:13:58.120
Fully connect, that's what the word

00:13:56.240 --> 00:13:59.159
dense means. Okay?

00:13:58.120 --> 00:14:02.799
Number two,

00:13:59.159 --> 00:14:06.799
I want 16 nodes here in this layer.

00:14:02.799 --> 00:14:09.559
Okay? Finally, I want to use a ReLU.

00:14:06.799 --> 00:14:11.120
See how compact and parsimonious it is?

00:14:09.559 --> 00:14:13.679
Right? And that is the appeal of Keras.

00:14:11.120 --> 00:14:15.039
It's very easy to get going.

00:14:13.679 --> 00:14:18.239
So, the moment you do that, you've

00:14:15.039 --> 00:14:18.240
actually defined this layer.

00:14:18.600 --> 00:14:23.519
But what you have not done

00:14:20.600 --> 00:14:25.440
is you have not told this layer what

00:14:23.519 --> 00:14:26.439
input is going to get.

00:14:25.440 --> 00:14:28.440
Because as far as this layer is

00:14:26.440 --> 00:14:30.320
concerned, it doesn't know that this

00:14:28.440 --> 00:14:33.320
other layer exists.

00:14:30.320 --> 00:14:35.800
So, you need to connect them. Yes.

00:14:33.320 --> 00:14:38.079
Um do we need to define for the ReLU

00:14:35.799 --> 00:14:39.039
where the the bends are? Like where you

00:14:38.078 --> 00:14:41.319
take the max?

00:14:39.039 --> 00:14:44.159
>> No, the ReLU the bend is always at zero.

00:14:41.320 --> 00:14:44.160
Okay. Thank you.

00:14:45.559 --> 00:14:48.799
Okay?

00:14:47.320 --> 00:14:51.240
All right.

00:14:48.799 --> 00:14:53.399
So, that's what we have here.

00:14:51.240 --> 00:14:55.959
And then, what we do is we have to tell

00:14:53.399 --> 00:14:57.958
it I you want to feed this layer the

00:14:55.958 --> 00:15:00.239
output of the previous layer, so you

00:14:57.958 --> 00:15:02.000
feed it by taking whatever is coming out

00:15:00.240 --> 00:15:03.120
of this thing, which is called input,

00:15:02.000 --> 00:15:05.480
and you basically

00:15:03.120 --> 00:15:07.759
stick it in here.

00:15:05.480 --> 00:15:09.039
So, the moment you do that, boom, it's

00:15:07.759 --> 00:15:10.519
going to receive the input from the

00:15:09.039 --> 00:15:12.879
previous layer.

00:15:10.519 --> 00:15:15.000
And because this one's output needs to

00:15:12.879 --> 00:15:16.519
go to the final layer, you need to give

00:15:15.000 --> 00:15:17.919
a name to that output.

00:15:16.519 --> 00:15:19.360
So, you give it a name. I'm just calling

00:15:17.919 --> 00:15:20.559
it h for because it's coming out of the

00:15:19.360 --> 00:15:21.600
hidden layer.

00:15:20.559 --> 00:15:24.119
It's just a variable. You can call it

00:15:21.600 --> 00:15:24.120
anything you want.

00:15:25.000 --> 00:15:28.958
Now, what we do, we go to the final

00:15:26.360 --> 00:15:30.360
output layer.

00:15:28.958 --> 00:15:32.799
And this is what we use. The output

00:15:30.360 --> 00:15:34.720
layer is just another dense layer.

00:15:32.799 --> 00:15:36.279
That's why I use the word dense. But we

00:15:34.720 --> 00:15:37.800
say, "Hey, give me just one thing

00:15:36.279 --> 00:15:40.159
because I just literally just need one

00:15:37.799 --> 00:15:41.919
unit here because I need to emit just

00:15:40.159 --> 00:15:44.120
one probability.

00:15:41.919 --> 00:15:46.639
And the activation I want to use is a

00:15:44.120 --> 00:15:46.639
sigmoid."

00:15:46.958 --> 00:15:50.399
Done.

00:15:48.720 --> 00:15:52.759
Okay?

00:15:50.399 --> 00:15:54.679
And once you do that, you

00:15:52.759 --> 00:15:57.838
have to feed it the input from the

00:15:54.679 --> 00:16:00.000
second layer. So, you stick an h here.

00:15:57.839 --> 00:16:01.400
Now you have connected the third and the

00:16:00.000 --> 00:16:03.039
second layers.

00:16:01.399 --> 00:16:04.720
And after you do that, you give a name

00:16:03.039 --> 00:16:06.399
to the output coming out of that. We'll

00:16:04.720 --> 00:16:07.360
just call it output. You can call it y,

00:16:06.399 --> 00:16:09.720
you can call it output, you can call it

00:16:07.360 --> 00:16:11.039
whatever you want.

00:16:09.720 --> 00:16:12.000
Okay? So, at this point, what we have

00:16:11.039 --> 00:16:14.399
done

00:16:12.000 --> 00:16:16.200
is we have mapped that picture into

00:16:14.399 --> 00:16:17.759
those three lines.

00:16:16.200 --> 00:16:19.400
That's it.

00:16:17.759 --> 00:16:20.759
Okay?

00:16:19.399 --> 00:16:22.519
But we aren't quite done yet. There's

00:16:20.759 --> 00:16:24.759
one little thing we have to do.

00:16:22.519 --> 00:16:27.919
So, what we have to do is we have to

00:16:24.759 --> 00:16:30.078
formally define a model so that Keras

00:16:27.919 --> 00:16:31.879
can just work with this model object. It

00:16:30.078 --> 00:16:33.199
can train it, it can evaluate it, it can

00:16:31.879 --> 00:16:35.759
use it for prediction and so on and so

00:16:33.200 --> 00:16:38.160
forth. So, we tell Keras, "Hey, uh

00:16:35.759 --> 00:16:40.039
create a model for me, Keras.model,

00:16:38.159 --> 00:16:41.600
and basically where the input is this

00:16:40.039 --> 00:16:42.480
thing here and the output is that thing

00:16:41.600 --> 00:16:43.800
there.

00:16:42.480 --> 00:16:45.879
And then the whole thing we'll just call

00:16:43.799 --> 00:16:48.559
it model."

00:16:45.879 --> 00:16:50.240
Okay? So, that's it.

00:16:48.559 --> 00:16:52.000
We are done. That is the whole model.

00:16:50.240 --> 00:16:53.680
That is It sounds really fancy, right? A

00:16:52.000 --> 00:16:56.600
neural model for heart disease

00:16:53.679 --> 00:16:58.599
prediction. That's pretty cool.

00:16:56.600 --> 00:17:00.360
Four lines.

00:16:58.600 --> 00:17:02.839
And we will show how to train this model

00:17:00.360 --> 00:17:05.199
with real data and so on and so forth

00:17:02.839 --> 00:17:06.959
and use it for prediction after we

00:17:05.199 --> 00:17:08.759
switch gears and really get into some

00:17:06.959 --> 00:17:11.320
conceptual building blocks.

00:17:08.759 --> 00:17:11.319
Had a question.

00:17:13.799 --> 00:17:18.599
Can you define a custom activation

00:17:16.319 --> 00:17:21.039
function that is not in the list of

00:17:18.599 --> 00:17:22.319
Keras library? Yes.

00:17:21.039 --> 00:17:23.438
Yeah, you can define The question was,

00:17:22.319 --> 00:17:25.359
can you define a custom activation

00:17:23.439 --> 00:17:27.400
function? You totally can.

00:17:25.359 --> 00:17:30.279
Uh in fact, I mean, the the kind of

00:17:27.400 --> 00:17:32.280
flexibility you have here is incredible.

00:17:30.279 --> 00:17:34.480
And this these innocent four lines

00:17:32.279 --> 00:17:36.399
unfortunately sort of hide the the

00:17:34.480 --> 00:17:38.640
potential that's possible here, but I

00:17:36.400 --> 00:17:39.759
guarantee you in two to three weeks you

00:17:38.640 --> 00:17:41.440
folks will be thinking in building

00:17:39.759 --> 00:17:43.599
blocks like Legos.

00:17:41.440 --> 00:17:44.600
So, you'll be, you know, I I I I'm so

00:17:43.599 --> 00:17:46.079
happy when it happens. Students will

00:17:44.599 --> 00:17:47.319
come to my office hours and say, "You

00:17:46.079 --> 00:17:49.399
know, I want to create a network where I

00:17:47.319 --> 00:17:50.879
have a little network going up on top,

00:17:49.400 --> 00:17:52.240
one going in the bottom, then they meet

00:17:50.880 --> 00:17:54.160
in the middle, then they fork again,

00:17:52.240 --> 00:17:55.440
they split." I'm like, "Unbelievable."

00:17:54.160 --> 00:17:56.720
It's fantastic. And you're going to be

00:17:55.440 --> 00:17:58.720
doing this in two weeks, I guarantee

00:17:56.720 --> 00:18:00.319
you.

00:17:58.720 --> 00:18:01.880
Yeah, in the case of a multi-class

00:18:00.319 --> 00:18:04.159
classification problem, are the output

00:18:01.880 --> 00:18:05.320
nodes equal to the number of classes?

00:18:04.160 --> 00:18:07.400
Correct.

00:18:05.319 --> 00:18:09.279
So, we will come to So, this is binary

00:18:07.400 --> 00:18:10.880
classification. And the question is for

00:18:09.279 --> 00:18:12.960
multi-class classification, let's say

00:18:10.880 --> 00:18:14.960
you're trying to classify some input

00:18:12.960 --> 00:18:16.720
into one of 10 possibilities, we will

00:18:14.960 --> 00:18:18.840
have 10 outputs.

00:18:16.720 --> 00:18:20.360
But the way we define it is going to be

00:18:18.839 --> 00:18:21.879
using something called a softmax

00:18:20.359 --> 00:18:24.039
function, which we're going to cover on

00:18:21.880 --> 00:18:25.720
Monday.

00:18:24.039 --> 00:18:27.079
So, for now, we just live with binary

00:18:25.720 --> 00:18:29.120
classification.

00:18:27.079 --> 00:18:29.119
Uh

00:18:29.159 --> 00:18:33.800
Is there a default activation method in

00:18:31.679 --> 00:18:35.400
Keras or you have to put something? Ah,

00:18:33.799 --> 00:18:37.079
that's a good question. I believe the

00:18:35.400 --> 00:18:39.200
default might be ReLUs for hidden

00:18:37.079 --> 00:18:40.678
layers, but I'm not 100% sure. Let's

00:18:39.200 --> 00:18:42.759
double-check that.

00:18:40.679 --> 00:18:44.960
Uh

00:18:42.759 --> 00:18:47.240
Uh just to get a clearer understanding,

00:18:44.960 --> 00:18:50.000
when you said that beyond 16 when you

00:18:47.240 --> 00:18:52.240
tried working on those neurons, the

00:18:50.000 --> 00:18:53.279
performance uh worsened.

00:18:52.240 --> 00:18:54.919
So, that is where you were playing

00:18:53.279 --> 00:18:58.759
around with initially two and then maybe

00:18:54.919 --> 00:19:01.560
four and six and eight. Exactly. Right.

00:18:58.759 --> 00:19:01.559
Could you use the mic?

00:19:02.200 --> 00:19:05.880
Do we need to define each of the hidden

00:19:04.000 --> 00:19:08.200
layer when the model gets more complex

00:19:05.880 --> 00:19:09.640
when we have more than one layer? Oh,

00:19:08.200 --> 00:19:11.159
like if you have like 25 layers?

00:19:09.640 --> 00:19:12.640
>> consolidate, yeah. Yeah, yeah, yeah. So,

00:19:11.159 --> 00:19:14.919
what we typically Good question. If you

00:19:12.640 --> 00:19:16.200
have let's say 100 layers, right? Uh do

00:19:14.919 --> 00:19:18.280
you actually write I have to type in

00:19:16.200 --> 00:19:19.759
each by hand and cut and paste? No. You

00:19:18.279 --> 00:19:20.839
can actually write a little loop which

00:19:19.759 --> 00:19:22.720
will just automatically create them for

00:19:20.839 --> 00:19:24.240
you.

00:19:22.720 --> 00:19:26.000
And so, basically what's going on is

00:19:24.240 --> 00:19:27.640
that this little output thing you see

00:19:26.000 --> 00:19:30.200
here, this variable,

00:19:27.640 --> 00:19:32.880
this output could be the result of a

00:19:30.200 --> 00:19:34.519
thousand layer network with all sorts of

00:19:32.880 --> 00:19:36.080
complicated transformations going on and

00:19:34.519 --> 00:19:38.200
then finally it pops up as a little

00:19:36.079 --> 00:19:39.678
thing called the output. And what Keras

00:19:38.200 --> 00:19:41.919
will do is it'll be like, "Okay, this

00:19:39.679 --> 00:19:43.759
model has this input and has this

00:19:41.919 --> 00:19:45.200
output, but boy, this output came from

00:19:43.759 --> 00:19:47.079
incredible transformations applied to

00:19:45.200 --> 00:19:48.159
the input." And Keras will process all

00:19:47.079 --> 00:19:49.759
that very easily for you. You don't have

00:19:48.159 --> 00:19:51.280
to worry about it.

00:19:49.759 --> 00:19:53.319
Right? It's really a beautiful example

00:19:51.279 --> 00:19:54.440
of the power of abstraction.

00:19:53.319 --> 00:19:55.200
And you will you will see that as we go

00:19:54.440 --> 00:19:56.880
along.

00:19:55.200 --> 00:19:58.640
Okay. So,

00:19:56.880 --> 00:20:00.040
now let's switch gears and say once

00:19:58.640 --> 00:20:01.840
you've written a model like that in

00:20:00.039 --> 00:20:04.240
Keras, how do you actually train it?

00:20:01.839 --> 00:20:05.839
Okay? Now, training is something you've

00:20:04.240 --> 00:20:06.880
been doing a lot, right? So, for

00:20:05.839 --> 00:20:08.720
example, when you have something like

00:20:06.880 --> 00:20:09.800
linear regression, right? Where you have

00:20:08.720 --> 00:20:12.039
all these coefficients you need to

00:20:09.799 --> 00:20:14.039
estimate, you have this model, then you

00:20:12.039 --> 00:20:16.680
have a bunch of data, then you run it

00:20:14.039 --> 00:20:18.559
through something like LM if you use R,

00:20:16.680 --> 00:20:20.480
and what it gives you is actual values

00:20:18.559 --> 00:20:22.559
for these coefficients, right? 2.8, 0.9,

00:20:20.480 --> 00:20:23.880
and so on and so forth. So, the the role

00:20:22.559 --> 00:20:25.399
of the data is to give you the

00:20:23.880 --> 00:20:26.560
coefficients.

00:20:25.400 --> 00:20:28.280
Right? Or you can think of the

00:20:26.559 --> 00:20:30.319
coefficients as really a compressed

00:20:28.279 --> 00:20:31.759
version of the data.

00:20:30.319 --> 00:20:33.799
Okay? Similarly, if you do logistic

00:20:31.759 --> 00:20:35.359
regression, you have a model like that,

00:20:33.799 --> 00:20:37.240
you add some data, you run it through

00:20:35.359 --> 00:20:40.479
some estimation routine like GLM or

00:20:37.240 --> 00:20:42.079
scikit-learn or statsmodels, pick your

00:20:40.480 --> 00:20:43.680
favorite tool, then you'll come up with

00:20:42.079 --> 00:20:45.919
something like that. So, basically

00:20:43.680 --> 00:20:47.519
what's going on here is training simply

00:20:45.920 --> 00:20:49.640
means find the values of the

00:20:47.519 --> 00:20:51.839
coefficients that so that the model's

00:20:49.640 --> 00:20:54.680
predictions are as close to the actual

00:20:51.839 --> 00:20:57.559
values as possible. That's it. Okay? And

00:20:54.680 --> 00:20:59.519
so and to find the one that is as close

00:20:57.559 --> 00:21:01.519
to the actual value as possible, a whole

00:20:59.519 --> 00:21:02.200
bunch of optimization is involved. You

00:21:01.519 --> 00:21:03.079
didn't have to worry about the

00:21:02.200 --> 00:21:05.200
optimization when you did the

00:21:03.079 --> 00:21:07.039
regression, linear or logistic, because

00:21:05.200 --> 00:21:08.840
it's all done under the hood for you,

00:21:07.039 --> 00:21:10.879
but for neural networks, we actually get

00:21:08.839 --> 00:21:12.919
to know how it's done.

00:21:10.880 --> 00:21:15.800
Okay, because it's important.

00:21:12.920 --> 00:21:18.279
Okay. So, training a neural network, a

00:21:15.799 --> 00:21:19.680
deep neural network, even GPT-4, it's

00:21:18.279 --> 00:21:21.000
basically the same process as what you

00:21:19.680 --> 00:21:23.320
do for regression.

00:21:21.000 --> 00:21:24.480
Right? You basically you're just a very

00:21:23.319 --> 00:21:26.679
complicated function with lots of

00:21:24.480 --> 00:21:28.160
parameters, but ultimately you have a

00:21:26.680 --> 00:21:29.960
network with all these question marks,

00:21:28.160 --> 00:21:32.960
you add some data, you do some training,

00:21:29.960 --> 00:21:32.960
and boom, you get some numbers.

00:21:36.200 --> 00:21:40.480
You may get into this, but are we

00:21:38.279 --> 00:21:43.079
determining the architecture of the

00:21:40.480 --> 00:21:45.319
network before we train it?

00:21:43.079 --> 00:21:46.720
Okay. Yes, because if you don't define

00:21:45.319 --> 00:21:49.279
the architecture,

00:21:46.720 --> 00:21:51.200
um Keras doesn't know how to actually

00:21:49.279 --> 00:21:53.279
calculate the output.

00:21:51.200 --> 00:21:55.880
Given an input. And unless it knows

00:21:53.279 --> 00:21:58.119
input-output pairs, it can't do anything

00:21:55.880 --> 00:22:00.400
more with it.

00:21:58.119 --> 00:22:02.039
Okay. So, um

00:22:00.400 --> 00:22:04.080
so the essence of training is to find

00:22:02.039 --> 00:22:05.440
the best values for the weights and

00:22:04.079 --> 00:22:07.559
biases.

00:22:05.440 --> 00:22:09.440
And the way we think of the best values

00:22:07.559 --> 00:22:11.919
is that we basically set up a little

00:22:09.440 --> 00:22:14.400
function, and this function measures the

00:22:11.920 --> 00:22:16.759
discrepancy between the actual and the

00:22:14.400 --> 00:22:19.640
predicted values. Okay? And I use the

00:22:16.759 --> 00:22:20.960
word discrepancy because the way you

00:22:19.640 --> 00:22:22.320
define discrepancy, there's an

00:22:20.960 --> 00:22:23.279
incredible amounts of creativity in the

00:22:22.319 --> 00:22:25.000
field.

00:22:23.279 --> 00:22:27.039
In fact, a lot of breakthroughs in deep

00:22:25.000 --> 00:22:29.519
learning come because people define a

00:22:27.039 --> 00:22:31.079
very clever measure of discrepancy, and

00:22:29.519 --> 00:22:33.039
then turns out it actually gives you all

00:22:31.079 --> 00:22:34.279
sorts of interesting behavior. Okay?

00:22:33.039 --> 00:22:35.879
That's why I use the word discrepancy as

00:22:34.279 --> 00:22:37.399
opposed to the word error, because when

00:22:35.880 --> 00:22:39.960
I say error, you might be just thinking

00:22:37.400 --> 00:22:42.240
something like predicted minus actual.

00:22:39.960 --> 00:22:43.600
That's too limiting.

00:22:42.240 --> 00:22:45.120
Prediction minus actual is too limiting,

00:22:43.599 --> 00:22:48.079
that's why I use the word discrepancy.

00:22:45.119 --> 00:22:49.439
So, so we we basically define a function

00:22:48.079 --> 00:22:50.639
that captures the discrepancy between

00:22:49.440 --> 00:22:53.000
these the actual and the predicted

00:22:50.640 --> 00:22:54.759
values, and these functions are called

00:22:53.000 --> 00:22:55.759
loss functions in the deep learning

00:22:54.759 --> 00:22:58.039
world.

00:22:55.759 --> 00:23:00.200
And every paper that you read, you will

00:22:58.039 --> 00:23:02.519
find interesting loss functions. There

00:23:00.200 --> 00:23:03.920
are hundreds of loss functions, enormous

00:23:02.519 --> 00:23:05.920
research creativity goes into defining

00:23:03.920 --> 00:23:08.519
these loss functions. Okay?

00:23:05.920 --> 00:23:10.039
All right. So, these are loss functions.

00:23:08.519 --> 00:23:12.440
And so a loss function is a function

00:23:10.039 --> 00:23:14.119
that quantifies a discrepancy. So, let's

00:23:12.440 --> 00:23:16.679
say the predictions are really close to

00:23:14.119 --> 00:23:19.039
the actual values, the loss would be

00:23:16.679 --> 00:23:20.720
what?

00:23:19.039 --> 00:23:23.279
It's close to zero. It's close to zero.

00:23:20.720 --> 00:23:26.240
Close to zero. Right? Very small.

00:23:23.279 --> 00:23:27.519
And if if you have a perfect model,

00:23:26.240 --> 00:23:28.799
perfect crystal ball, what would the

00:23:27.519 --> 00:23:30.039
loss be?

00:23:28.799 --> 00:23:32.839
Exactly zero.

00:23:30.039 --> 00:23:35.599
Right? Exactly zero. So, in linear

00:23:32.839 --> 00:23:37.759
regression, we the loss function we use

00:23:35.599 --> 00:23:39.159
is called sum of squared errors.

00:23:37.759 --> 00:23:40.640
We didn't call it loss function because

00:23:39.160 --> 00:23:42.200
we were not doing deep learning, just

00:23:40.640 --> 00:23:45.120
linear regression, but that's basically

00:23:42.200 --> 00:23:47.200
the loss function. Right? So,

00:23:45.119 --> 00:23:49.000
the loss function we use must be very

00:23:47.200 --> 00:23:51.200
matched very properly with the kind of

00:23:49.000 --> 00:23:53.200
output we have.

00:23:51.200 --> 00:23:55.200
Right? So, if your output is a number

00:23:53.200 --> 00:23:57.480
like 23, right? You're trying to predict

00:23:55.200 --> 00:24:00.319
demand like a product demand for next

00:23:57.480 --> 00:24:02.120
week for a particular product, and uh

00:24:00.319 --> 00:24:03.439
predicted value is 23, the actual value

00:24:02.119 --> 00:24:05.879
is 21,

00:24:03.440 --> 00:24:09.120
it's okay to do 23 minus 21, two as a

00:24:05.880 --> 00:24:11.640
discrepancy, right? The error. Okay? But

00:24:09.119 --> 00:24:13.439
for other kinds of outputs, it's not so

00:24:11.640 --> 00:24:14.800
obvious what the correct loss function

00:24:13.440 --> 00:24:18.160
is, what the correct measure of

00:24:14.799 --> 00:24:20.799
discrepancy is. And so here,

00:24:18.160 --> 00:24:21.759
for the simple case of regression,

00:24:20.799 --> 00:24:23.759
right? Um

00:24:21.759 --> 00:24:26.119
the YI, the I here, by the way, is a

00:24:23.759 --> 00:24:29.000
superscript which stands for the ith

00:24:26.119 --> 00:24:31.079
data point, the ith data point. So, what

00:24:29.000 --> 00:24:33.519
I'm saying is that okay, for the ith

00:24:31.079 --> 00:24:36.119
data point, this is the actual value, Y,

00:24:33.519 --> 00:24:39.000
and this is what the model predicted.

00:24:36.119 --> 00:24:41.079
Okay? I take the difference, square it,

00:24:39.000 --> 00:24:43.119
and once I square it for each point, I

00:24:41.079 --> 00:24:45.759
just average all these numbers to get an

00:24:43.119 --> 00:24:48.239
average squared error, i.e. mean squared

00:24:45.759 --> 00:24:50.960
error, MSE. So, this is sort of like the

00:24:48.240 --> 00:24:52.240
easiest loss function.

00:24:50.960 --> 00:24:55.000
Okay?

00:24:52.240 --> 00:24:57.120
Now, let's crank it up a notch.

00:24:55.000 --> 00:24:59.759
In the heart disease example, the heart

00:24:57.119 --> 00:25:01.678
disease the neural prediction model,

00:24:59.759 --> 00:25:03.440
the prediction is a number between zero

00:25:01.679 --> 00:25:04.759
and one, right? It's because it's coming

00:25:03.440 --> 00:25:07.720
out of the sigmoid.

00:25:04.759 --> 00:25:09.799
It's a fraction. The actual output is a

00:25:07.720 --> 00:25:11.120
zero or one, one of the two, right? It's

00:25:09.799 --> 00:25:12.720
binary.

00:25:11.119 --> 00:25:14.039
So, how would we compare the

00:25:12.720 --> 00:25:16.640
discrepancy? How would we measure the

00:25:14.039 --> 00:25:18.839
discrepancy between a fraction and the

00:25:16.640 --> 00:25:21.080
numbers zero and one? Right? What is the

00:25:18.839 --> 00:25:22.879
good loss function in this situation?

00:25:21.079 --> 00:25:26.000
Right? Is the key question. So, let's

00:25:22.880 --> 00:25:28.640
build some intuition around this.

00:25:26.000 --> 00:25:31.200
And let's see if my little daisy chain

00:25:28.640 --> 00:25:32.480
iPad thing works.

00:25:31.200 --> 00:25:34.160
I'm doing it on the iPad so that people

00:25:32.480 --> 00:25:35.200
on the live stream can see it, otherwise

00:25:34.160 --> 00:25:37.040
the blackboard is a little tough for

00:25:35.200 --> 00:25:41.039
them.

00:25:37.039 --> 00:25:43.159
Okay. So, let's have a situation here.

00:25:41.039 --> 00:25:45.039
Okay? So, let's say let's say that you

00:25:43.160 --> 00:25:47.000
have a patient who comes in, and let's

00:25:45.039 --> 00:25:50.240
say they have heart disease. Okay? So,

00:25:47.000 --> 00:25:51.960
for that patient, Y equals one.

00:25:50.240 --> 00:25:55.920
Right? The true value is one for that

00:25:51.960 --> 00:25:59.840
patient. And now you have this model.

00:25:55.920 --> 00:26:03.480
Okay? And this is the predicted

00:25:59.839 --> 00:26:03.480
probability from this model.

00:26:04.480 --> 00:26:07.480
Can people see my

00:26:05.960 --> 00:26:08.279
handwriting okay?

00:26:07.480 --> 00:26:11.200
Good.

00:26:08.279 --> 00:26:13.359
I could never be a doctor, right? So.

00:26:11.200 --> 00:26:14.279
So, zero, okay? One, it's going to be

00:26:13.359 --> 00:26:15.479
between zero and one because it's

00:26:14.279 --> 00:26:17.079
probability.

00:26:15.480 --> 00:26:19.079
And then this is the loss we want to

00:26:17.079 --> 00:26:21.759
sort of have, right? This is the loss.

00:26:19.079 --> 00:26:23.839
So, for this this patient actually had

00:26:21.759 --> 00:26:25.240
heart disease, Y equals one. So, let's

00:26:23.839 --> 00:26:26.919
say that the predicted probability is

00:26:25.240 --> 00:26:28.279
pretty close to one.

00:26:26.920 --> 00:26:29.759
Okay? What do you think the loss should

00:26:28.279 --> 00:26:30.879
be?

00:26:29.759 --> 00:26:32.799
Small.

00:26:30.880 --> 00:26:34.080
Close to zero.

00:26:32.799 --> 00:26:36.480
Sorry?

00:26:34.079 --> 00:26:38.480
Close to zero, exactly. So, here, if the

00:26:36.480 --> 00:26:40.599
prediction comes here, you want the loss

00:26:38.480 --> 00:26:42.279
to be you want the loss to be somewhere

00:26:40.599 --> 00:26:44.000
here.

00:26:42.279 --> 00:26:45.599
But if the predicted probability is

00:26:44.000 --> 00:26:47.079
pretty close to zero, even though the

00:26:45.599 --> 00:26:49.319
patient actually has heart disease, what

00:26:47.079 --> 00:26:50.678
do you want the loss to be?

00:26:49.319 --> 00:26:52.599
Really high.

00:26:50.679 --> 00:26:53.720
Because it's screwing up badly, right?

00:26:52.599 --> 00:26:55.319
So, you want the loss to be somewhere

00:26:53.720 --> 00:26:57.440
here.

00:26:55.319 --> 00:27:00.359
So, basically you want a function that's

00:26:57.440 --> 00:27:00.360
kind of like that.

00:27:00.759 --> 00:27:04.319
Right? You want the loss function shape

00:27:02.319 --> 00:27:05.519
to be like that.

00:27:04.319 --> 00:27:07.039
High values of probability should have

00:27:05.519 --> 00:27:08.799
low losses, low values of probability

00:27:07.039 --> 00:27:10.759
should have high losses. Yeah.

00:27:08.799 --> 00:27:12.279
I understand like why it has to be

00:27:10.759 --> 00:27:14.480
increasing or decreasing, but can you

00:27:12.279 --> 00:27:16.279
explain why it has to be Yeah, yeah. So,

00:27:14.480 --> 00:27:18.279
it can be linear, it can certainly be

00:27:16.279 --> 00:27:21.678
linear, but basically what you want to

00:27:18.279 --> 00:27:23.960
do is the more it makes a mistake, the

00:27:21.679 --> 00:27:25.920
more harshly you want to penalize it.

00:27:23.960 --> 00:27:27.720
Right? So, basically what you're what

00:27:25.920 --> 00:27:29.120
what you really want is something where

00:27:27.720 --> 00:27:31.880
if it basically says this person's

00:27:29.119 --> 00:27:33.199
probability is say uh the probability

00:27:31.880 --> 00:27:34.560
the predicted probability is say one

00:27:33.200 --> 00:27:35.960
over a million,

00:27:34.559 --> 00:27:37.919
basically close to zero, you want the

00:27:35.960 --> 00:27:39.480
loss to be like super high.

00:27:37.920 --> 00:27:41.200
So that the model is like it's like a

00:27:39.480 --> 00:27:42.440
huge rap on the knuckles for the model.

00:27:41.200 --> 00:27:43.880
Don't do that.

00:27:42.440 --> 00:27:45.519
That's basically what we're doing, and

00:27:43.880 --> 00:27:47.400
I'm sort of demonstrating that dynamic

00:27:45.519 --> 00:27:49.559
by using a very curved and steep loss

00:27:47.400 --> 00:27:50.960
function.

00:27:49.559 --> 00:27:52.799
But you can absolutely use a linear

00:27:50.960 --> 00:27:54.759
function, it's totally fine. It won't be

00:27:52.799 --> 00:27:56.000
as effective for gradient descent later

00:27:54.759 --> 00:27:57.799
on with a bunch of bunch of technical

00:27:56.000 --> 00:27:59.359
details.

00:27:57.799 --> 00:28:01.440
Are we good with this?

00:27:59.359 --> 00:28:03.919
All right. So, now let's look at the

00:28:01.440 --> 00:28:05.039
case where a patient does not have heart

00:28:03.920 --> 00:28:06.720
disease.

00:28:05.039 --> 00:28:09.000
Y equals zero.

00:28:06.720 --> 00:28:11.920
Same setup, okay?

00:28:09.000 --> 00:28:15.279
Predicted probability,

00:28:11.920 --> 00:28:18.360
zero, one, loss.

00:28:15.279 --> 00:28:20.440
So, for this patient, they don't have um

00:28:18.359 --> 00:28:22.240
whatever uh they're not

00:28:20.440 --> 00:28:24.559
uh they don't have heart disease. If the

00:28:22.240 --> 00:28:26.200
probability is close to zero, what

00:28:24.559 --> 00:28:27.279
should the loss be?

00:28:26.200 --> 00:28:28.720
Close to zero. It should be somewhere

00:28:27.279 --> 00:28:31.079
here, right?

00:28:28.720 --> 00:28:32.440
And the more and more the probability

00:28:31.079 --> 00:28:34.359
gets closer and closer to one, you want

00:28:32.440 --> 00:28:36.120
to penalize it very heavily, which means

00:28:34.359 --> 00:28:37.559
you want the loss to be somewhere here.

00:28:36.119 --> 00:28:39.239
So, you basically want a loss ideally

00:28:37.559 --> 00:28:42.158
that's kind of going up like that and

00:28:39.240 --> 00:28:43.200
climbing higher and higher.

00:28:42.159 --> 00:28:44.640
Are we good?

00:28:43.200 --> 00:28:46.919
Okay, perfect.

00:28:44.640 --> 00:28:48.919
Because we have a perfect loss function

00:28:46.919 --> 00:28:51.360
for that.

00:28:48.919 --> 00:28:53.040
So, just a recap.

00:28:51.359 --> 00:28:54.799
Right? This is what we want.

00:28:53.039 --> 00:28:56.799
People with for points with Y equals

00:28:54.799 --> 00:28:58.359
one, lower prediction predictions should

00:28:56.799 --> 00:29:02.000
have higher loss. You want something

00:28:58.359 --> 00:29:03.519
like that. And then turns out

00:29:02.000 --> 00:29:04.640
there's a very little simple loss

00:29:03.519 --> 00:29:05.918
function

00:29:04.640 --> 00:29:07.880
which just literally just uses the

00:29:05.919 --> 00:29:09.840
logarithm, which will get the job done.

00:29:07.880 --> 00:29:13.159
So, what you do is you literally do

00:29:09.839 --> 00:29:15.399
minus log of the predicted probability.

00:29:13.159 --> 00:29:16.520
That's it. And that thing it has exactly

00:29:15.400 --> 00:29:17.919
that shape.

00:29:16.519 --> 00:29:20.039
Okay? And in fact, you can see it

00:29:17.919 --> 00:29:22.840
numerically. So, if the loss is one,

00:29:20.039 --> 00:29:24.720
it's zero. If it's half, it's 1.0. And

00:29:22.839 --> 00:29:26.599
if it's like one over 1,000, it's almost

00:29:24.720 --> 00:29:27.319
10. If it's one over 10,000, it's going

00:29:26.599 --> 00:29:30.359
to be like

00:29:27.319 --> 00:29:32.519
much higher, right? Very high losses.

00:29:30.359 --> 00:29:34.479
Okay? So, minus log probability, boom,

00:29:32.519 --> 00:29:36.639
done.

00:29:34.480 --> 00:29:38.919
Similarly, this is what we want for

00:29:36.640 --> 00:29:42.400
patients for whom Y equals zero.

00:29:38.919 --> 00:29:44.520
And turns out if you do minus log one

00:29:42.400 --> 00:29:46.960
minus predicted probability, it does the

00:29:44.519 --> 00:29:46.960
same thing.

00:29:47.880 --> 00:29:50.160
Okay?

00:29:50.759 --> 00:29:54.640
Mathematicians once again saved with a

00:29:52.160 --> 00:29:54.640
logarithm.

00:29:54.680 --> 00:29:58.560
So, see in summary

00:29:56.920 --> 00:30:00.400
this is what we have.

00:29:58.559 --> 00:30:01.599
Right? For data points where y equals 1,

00:30:00.400 --> 00:30:03.960
we have this. Data points where y equals

00:30:01.599 --> 00:30:05.919
0, we have this. But, it feels a little

00:30:03.960 --> 00:30:07.279
inelegant

00:30:05.920 --> 00:30:08.400
to say, "Well, if it's y equals 1, I

00:30:07.279 --> 00:30:09.599
want to use this. If y equals 0, I want

00:30:08.400 --> 00:30:11.280
to use that."

00:30:09.599 --> 00:30:12.759
Right? There's There's like an if-then

00:30:11.279 --> 00:30:14.639
thing going on here. And I don't know

00:30:12.759 --> 00:30:15.640
about you folks, but if-then really irks

00:30:14.640 --> 00:30:17.320
me

00:30:15.640 --> 00:30:19.600
mathematically because you can't do

00:30:17.319 --> 00:30:20.279
derivatives and so on very easily.

00:30:19.599 --> 00:30:22.919
Okay?

00:30:20.279 --> 00:30:24.879
But, no worries. This is MIT. We know we

00:30:22.920 --> 00:30:26.720
have our bag of math tricks.

00:30:24.880 --> 00:30:28.680
So, what we do is

00:30:26.720 --> 00:30:30.519
we can actually combine them both into a

00:30:28.680 --> 00:30:32.600
single expression.

00:30:30.519 --> 00:30:35.079
Okay? Like this.

00:30:32.599 --> 00:30:37.000
Okay? And here the yi again is the ith

00:30:35.079 --> 00:30:38.399
data point. Remember, yi is either 1 or

00:30:37.000 --> 00:30:40.359
0 always.

00:30:38.400 --> 00:30:43.360
And this model of xi is the predicted

00:30:40.359 --> 00:30:45.679
probability. Okay? So,

00:30:43.359 --> 00:30:48.439
and I've just taken the minus log the

00:30:45.680 --> 00:30:50.680
minus and I've just moved it here.

00:30:48.440 --> 00:30:52.680
Okay? And I've taken the the minus that

00:30:50.680 --> 00:30:54.640
was here and just moved it here. Okay?

00:30:52.680 --> 00:30:57.080
That's why you see it like this.

00:30:54.640 --> 00:30:58.560
So, this one is basically

00:30:57.079 --> 00:30:59.960
you can convince yourself what's

00:30:58.559 --> 00:31:01.359
happens. This single expression will get

00:30:59.960 --> 00:31:04.039
the job done. So, let's say there is a

00:31:01.359 --> 00:31:05.559
patient for whom y equals 1.

00:31:04.039 --> 00:31:07.799
What's going to happen is that when you

00:31:05.559 --> 00:31:10.519
plug in y equals 1, this becomes 0. The

00:31:07.799 --> 00:31:12.559
whole thing will collapse to 0.

00:31:10.519 --> 00:31:14.319
While here, y equals 1 just means it

00:31:12.559 --> 00:31:16.879
becomes minus log probability, which is

00:31:14.319 --> 00:31:16.879
what we want.

00:31:17.640 --> 00:31:22.120
Conversely, if y equals 0, this whole

00:31:20.200 --> 00:31:23.720
thing is going to disappear.

00:31:22.119 --> 00:31:25.919
And this thing becomes 1 minus 0, which

00:31:23.720 --> 00:31:27.559
is just 1. And so, it becomes minus log

00:31:25.920 --> 00:31:29.680
1 minus probability, which is again what

00:31:27.559 --> 00:31:32.000
we want.

00:31:29.680 --> 00:31:34.720
Simple and neat, right?

00:31:32.000 --> 00:31:36.799
So, in one expression, we have defined

00:31:34.720 --> 00:31:39.360
the perfect loss. No if-thens, none of

00:31:36.799 --> 00:31:39.359
that crap.

00:31:39.519 --> 00:31:44.079
Good. So, now what we do is that was

00:31:42.200 --> 00:31:45.160
true for every data point.

00:31:44.079 --> 00:31:47.799
But, we obviously have lots of data

00:31:45.160 --> 00:31:50.560
points. So, we just add them all up and

00:31:47.799 --> 00:31:51.919
take the average.

00:31:50.559 --> 00:31:53.519
That's it. We average across all the

00:31:51.920 --> 00:31:55.440
data points we have. So, that we get an

00:31:53.519 --> 00:31:57.119
average loss.

00:31:55.440 --> 00:31:58.679
Okay?

00:31:57.119 --> 00:32:01.239
We call this is the binary cross entropy

00:31:58.679 --> 00:32:01.240
loss function.

00:32:06.640 --> 00:32:11.440
Is there a way you can um edit the loss

00:32:08.920 --> 00:32:13.560
function so that you penalize like false

00:32:11.440 --> 00:32:15.679
negatives more strongly than false

00:32:13.559 --> 00:32:17.279
>> you can do all of them. Great question.

00:32:15.679 --> 00:32:19.160
Uh I'm just looking at the basic case

00:32:17.279 --> 00:32:21.720
where we it's symmetric

00:32:19.160 --> 00:32:23.240
loss. Um you can actually penalize

00:32:21.720 --> 00:32:25.200
overestimates much more than

00:32:23.240 --> 00:32:26.759
underestimates and things like that.

00:32:25.200 --> 00:32:28.160
Um and if you're curious, you can just

00:32:26.759 --> 00:32:30.599
Google something called the pinball

00:32:28.160 --> 00:32:30.600
loss.

00:32:31.519 --> 00:32:34.440
Okay?

00:32:32.599 --> 00:32:36.359
Any other questions on this?

00:32:34.440 --> 00:32:38.120
So, when you see this massive deep

00:32:36.359 --> 00:32:39.959
neural network built by Google for doing

00:32:38.119 --> 00:32:41.839
something or the other, if it's a binary

00:32:39.960 --> 00:32:44.079
classification problem, chances are

00:32:41.839 --> 00:32:45.119
they're using this thing.

00:32:44.079 --> 00:32:45.960
Okay?

00:32:45.119 --> 00:32:48.159
All right.

00:32:45.960 --> 00:32:49.840
So, now let's figure out how to minimize

00:32:48.160 --> 00:32:50.800
these loss functions because the name of

00:32:49.839 --> 00:32:52.199
the game

00:32:50.799 --> 00:32:54.839
is to find a way to minimize these loss

00:32:52.200 --> 00:32:56.880
functions. So, now loss functions are

00:32:54.839 --> 00:32:59.279
just a particular kind of function. So,

00:32:56.880 --> 00:33:02.000
we'll first consider the general problem

00:32:59.279 --> 00:33:02.759
of minimizing some arbitrary function.

00:33:02.000 --> 00:33:03.720
Okay?

00:33:02.759 --> 00:33:05.160
And once we develop a little bit of

00:33:03.720 --> 00:33:07.400
intuition about that, we'll return to

00:33:05.160 --> 00:33:09.920
the specific task of minimizing loss

00:33:07.400 --> 00:33:09.920
functions.

00:33:12.240 --> 00:33:14.920
How's everyone doing?

00:33:15.240 --> 00:33:18.480
Yes, no, good, bad?

00:33:18.679 --> 00:33:23.240
You have a bit of a

00:33:20.480 --> 00:33:24.960
like a tough-to-interpret head shake.

00:33:23.240 --> 00:33:26.559
It's more like um I kind of lost you

00:33:24.960 --> 00:33:28.400
where you said that the loss function

00:33:26.559 --> 00:33:30.119
and the predicted probability

00:33:28.400 --> 00:33:31.560
uh how were they inversely because my

00:33:30.119 --> 00:33:33.839
understanding was that the loss function

00:33:31.559 --> 00:33:35.200
is supposed to be the sum of errors.

00:33:33.839 --> 00:33:36.159
We're averaging the errors. And when you

00:33:35.200 --> 00:33:37.360
said the heart patient

00:33:36.160 --> 00:33:38.880
>> Sorry, sorry. Let me Let me just stop

00:33:37.359 --> 00:33:41.240
there for a second.

00:33:38.880 --> 00:33:42.640
For each point, you define the loss.

00:33:41.240 --> 00:33:44.400
That's the whole point of the game. And

00:33:42.640 --> 00:33:46.640
once you define it, you calculate for

00:33:44.400 --> 00:33:49.440
every point and average it, right? So,

00:33:46.640 --> 00:33:50.960
just focus on a single data point.

00:33:49.440 --> 00:33:53.000
And so, now continue.

00:33:50.960 --> 00:33:56.160
So, now when the heart patient has There

00:33:53.000 --> 00:33:58.240
is more probability that they No. So,

00:33:56.160 --> 00:34:00.400
when there is a person who has the heart

00:33:58.240 --> 00:34:02.759
uh disease, you said that you want the

00:34:00.400 --> 00:34:03.960
loss function to be high.

00:34:02.759 --> 00:34:06.440
I think I'm going back to the graph.

00:34:03.960 --> 00:34:08.159
>> You want the loss function to be high if

00:34:06.440 --> 00:34:09.878
I'm predicting that they basically don't

00:34:08.159 --> 00:34:12.079
have heart disease.

00:34:09.878 --> 00:34:13.960
If the prediction is close to 0,

00:34:12.079 --> 00:34:16.878
the predicted probability is close to 0,

00:34:13.960 --> 00:34:18.519
then I'm badly wrong.

00:34:16.878 --> 00:34:19.918
Because in reality, they do have heart

00:34:18.519 --> 00:34:21.039
disease.

00:34:19.918 --> 00:34:23.199
And that's why I want the loss to be

00:34:21.039 --> 00:34:25.519
really high. Okay, so effectively, loss

00:34:23.199 --> 00:34:28.678
is my way of finding out how good my

00:34:25.519 --> 00:34:31.159
model is instead of saying, "Okay." Or

00:34:28.679 --> 00:34:33.119
rather, how bad your model is. Yeah.

00:34:31.159 --> 00:34:34.760
Right? How bad is it? That's really what

00:34:33.119 --> 00:34:37.279
the loss function is. Got it.

00:34:34.760 --> 00:34:39.960
>> And you want to minimize badness.

00:34:37.280 --> 00:34:41.560
That's the whole point of optimization.

00:34:39.960 --> 00:34:43.800
Okay.

00:34:41.559 --> 00:34:45.119
Um I guess I don't have a fully like

00:34:43.800 --> 00:34:46.800
similar to the point where I said but I

00:34:45.119 --> 00:34:48.839
don't have a fully clear intuition of

00:34:46.800 --> 00:34:50.440
why exactly a log function rather than

00:34:48.840 --> 00:34:53.320
something that say

00:34:50.440 --> 00:34:55.519
flatter for small and then really steep

00:34:53.320 --> 00:34:57.640
later. Those are all fantastic things.

00:34:55.519 --> 00:35:00.719
You can totally do it. Uh the reason we

00:34:57.639 --> 00:35:02.759
picked the loss this function because A,

00:35:00.719 --> 00:35:04.079
it's easy to work with. It has good

00:35:02.760 --> 00:35:06.160
gradients. It's well-behaved

00:35:04.079 --> 00:35:07.799
mathematically. But, there are many

00:35:06.159 --> 00:35:09.399
alternatives to it. I don't want you to

00:35:07.800 --> 00:35:11.720
think that this is like the only game in

00:35:09.400 --> 00:35:13.760
town or it's the only choice for us. We

00:35:11.719 --> 00:35:15.919
have many choices. This is really This

00:35:13.760 --> 00:35:17.320
happens to be a very easy choice, which

00:35:15.920 --> 00:35:18.960
also happens to be empirically very

00:35:17.320 --> 00:35:20.480
effective.

00:35:18.960 --> 00:35:22.840
And I'm happy to give you pointers to

00:35:20.480 --> 00:35:26.000
other crazy loss functions, right? Which

00:35:22.840 --> 00:35:26.000
can actually do all these things, too.

00:35:26.800 --> 00:35:29.120
Okay?

00:35:30.400 --> 00:35:34.440
All right. So, uh minimizing a single

00:35:32.440 --> 00:35:36.559
variable function, we will warm up by

00:35:34.440 --> 00:35:38.358
looking at this little function here.

00:35:36.559 --> 00:35:41.639
Okay? Which is a

00:35:38.358 --> 00:35:41.639
What do you call a fourth power?

00:35:41.840 --> 00:35:45.519
What? Quartic, right? Yeah, thank you.

00:35:43.679 --> 00:35:47.599
Quartic. So, yeah, it's a quartic

00:35:45.519 --> 00:35:50.000
function. Um

00:35:47.599 --> 00:35:51.639
right? And this is how it looks like.

00:35:50.000 --> 00:35:53.199
But, you can see there is like a minimum

00:35:51.639 --> 00:35:54.679
somewhere here, right? Between like one

00:35:53.199 --> 00:35:56.799
minus one and minus two. Like maybe

00:35:54.679 --> 00:35:58.519
minus 1.5. Okay?

00:35:56.800 --> 00:36:00.440
So, we want to minimize this function.

00:35:58.519 --> 00:36:02.039
It's obviously a toy function, little

00:36:00.440 --> 00:36:03.599
function with one variable.

00:36:02.039 --> 00:36:06.320
But, the intuition we use here is going

00:36:03.599 --> 00:36:08.239
to be exactly what we use for GPT-4.

00:36:06.320 --> 00:36:09.880
So, pay attention.

00:36:08.239 --> 00:36:11.000
So, how can we go about minimizing this

00:36:09.880 --> 00:36:13.559
function?

00:36:11.000 --> 00:36:13.559
What will we do?

00:36:15.079 --> 00:36:18.159
Yeah.

00:36:16.639 --> 00:36:20.119
Take the derivative and set it equal to

00:36:18.159 --> 00:36:22.039
zero. You take the derivative. Exactly.

00:36:20.119 --> 00:36:23.799
So, you take the derivative, right?

00:36:22.039 --> 00:36:25.559
Um so, when you So, let's look at what

00:36:23.800 --> 00:36:26.640
the derivative does for us.

00:36:25.559 --> 00:36:30.000
But, then

00:36:26.639 --> 00:36:31.920
the second part of what said

00:36:30.000 --> 00:36:33.960
Yeah. Second part of what said was set

00:36:31.920 --> 00:36:35.800
it to zero. Setting it to zero becomes

00:36:33.960 --> 00:36:37.000
problematic

00:36:35.800 --> 00:36:38.840
when you have very complicated

00:36:37.000 --> 00:36:39.960
functions. It's not clear at all what's

00:36:38.840 --> 00:36:41.880
going to make them zero, right?

00:36:39.960 --> 00:36:42.960
Unfortunately. But, the idea of taking

00:36:41.880 --> 00:36:43.840
the derivative is in fact the right

00:36:42.960 --> 00:36:45.440
idea.

00:36:43.840 --> 00:36:46.480
So, we can go about this. We can

00:36:45.440 --> 00:36:47.920
calculate the derivative. And that

00:36:46.480 --> 00:36:49.480
actually happens with the derivative.

00:36:47.920 --> 00:36:50.840
You can convince yourself.

00:36:49.480 --> 00:36:53.240
And if you plot the derivative, it looks

00:36:50.840 --> 00:36:53.240
like that.

00:36:53.400 --> 00:36:56.760
And as you would hope, wherever the

00:36:55.079 --> 00:36:58.679
minimum is, in fact, the derivative is

00:36:56.760 --> 00:36:59.760
crossing

00:36:58.679 --> 00:37:01.119
right? The derivative is zero here. It's

00:36:59.760 --> 00:37:02.320
crossing the x-axis.

00:37:01.119 --> 00:37:03.759
Right? In this case, you can actually do

00:37:02.320 --> 00:37:04.800
that.

00:37:03.760 --> 00:37:06.280
So, let's say you have the derivative.

00:37:04.800 --> 00:37:08.359
How can you use it?

00:37:06.280 --> 00:37:09.760
Like, what is the value of a derivative?

00:37:08.358 --> 00:37:11.199
What does it tell you?

00:37:09.760 --> 00:37:13.800
Yeah.

00:37:11.199 --> 00:37:16.159
You use a gradient descent algorithm.

00:37:13.800 --> 00:37:18.240
You are 10 steps ahead of me, my friend.

00:37:16.159 --> 00:37:19.920
I just want the basic answer.

00:37:18.239 --> 00:37:21.239
Like, what what what what good is a

00:37:19.920 --> 00:37:22.200
derivative? What Like, what does it tell

00:37:21.239 --> 00:37:23.919
you? When you calculate the derivative

00:37:22.199 --> 00:37:25.919
of something at a particular point

00:37:23.920 --> 00:37:27.240
>> you the rate of change of the function

00:37:25.920 --> 00:37:29.800
at the place you are. Correct. Exactly

00:37:27.239 --> 00:37:32.119
right. So, here, what the derivative

00:37:29.800 --> 00:37:34.240
would tells us is that the slope tells

00:37:32.119 --> 00:37:36.920
us the change in the function for a very

00:37:34.239 --> 00:37:38.319
small increase in w, right?

00:37:36.920 --> 00:37:41.519
And this is high school calculus. I'm

00:37:38.320 --> 00:37:41.519
just doing a quick refresher.

00:37:41.920 --> 00:37:47.720
So, what that means is that

00:37:45.199 --> 00:37:49.480
if the derivative is positive,

00:37:47.719 --> 00:37:52.039
what that means is that increasing w

00:37:49.480 --> 00:37:53.760
slightly will increase the function.

00:37:52.039 --> 00:37:55.000
So, if if you're here,

00:37:53.760 --> 00:37:56.160
you calculate the derivative, the slope

00:37:55.000 --> 00:37:57.480
is positive. It means that if you go

00:37:56.159 --> 00:37:58.799
slightly in this direction, the function

00:37:57.480 --> 00:38:00.199
is going to get higher.

00:37:58.800 --> 00:38:02.560
Right?

00:38:00.199 --> 00:38:03.839
Similarly, if it's negative,

00:38:02.559 --> 00:38:05.039
let's say here, you calculate the

00:38:03.840 --> 00:38:06.680
derivative, it's the the slope is like

00:38:05.039 --> 00:38:08.840
this. It's negative, which means that if

00:38:06.679 --> 00:38:10.239
you increase w, if you go in this

00:38:08.840 --> 00:38:12.519
direction, it's going to decrease the

00:38:10.239 --> 00:38:13.759
function.

00:38:12.519 --> 00:38:15.000
Okay?

00:38:13.760 --> 00:38:17.760
All right.

00:38:15.000 --> 00:38:19.639
And if it's kind of close to zero,

00:38:17.760 --> 00:38:22.240
it means that changing w slightly won't

00:38:19.639 --> 00:38:24.119
change anything.

00:38:22.239 --> 00:38:25.719
So, if you're here, changing it slightly

00:38:24.119 --> 00:38:26.880
won't change anything.

00:38:25.719 --> 00:38:28.079
All right?

00:38:26.880 --> 00:38:29.920
That's it.

00:38:28.079 --> 00:38:31.599
So,

00:38:29.920 --> 00:38:35.400
So, what we do is this immediately

00:38:31.599 --> 00:38:37.079
suggests an algorithm for minimizing gw,

00:38:35.400 --> 00:38:38.400
which is let's start with some random

00:38:37.079 --> 00:38:39.400
point w.

00:38:38.400 --> 00:38:40.519
And then,

00:38:39.400 --> 00:38:41.480
let's calculate the derivative at that

00:38:40.519 --> 00:38:42.920
point.

00:38:41.480 --> 00:38:45.000
And once we do that,

00:38:42.920 --> 00:38:46.280
there are three possibilities.

00:38:45.000 --> 00:38:48.320
It could be positive, negative, or kind

00:38:46.280 --> 00:38:49.640
of close to zero.

00:38:48.320 --> 00:38:52.160
And if it's positive, we know that

00:38:49.639 --> 00:38:53.839
increasing w will increase the function.

00:38:52.159 --> 00:38:55.358
But, we want to decrease the function.

00:38:53.840 --> 00:38:56.200
We want to minimize it.

00:38:55.358 --> 00:38:58.920
Which means that we should not be

00:38:56.199 --> 00:39:00.159
increasing w. We should be doing what

00:38:58.920 --> 00:39:01.720
here?

00:39:00.159 --> 00:39:03.519
Decrease.

00:39:01.719 --> 00:39:07.119
Yes. And similarly, if it's negative,

00:39:03.519 --> 00:39:07.119
what should we do here? Increase.

00:39:07.840 --> 00:39:11.358
Exactly. So, in the first case, you

00:39:09.320 --> 00:39:13.240
reduce w slightly. In the second case,

00:39:11.358 --> 00:39:14.400
you increase w slightly. And if the

00:39:13.239 --> 00:39:17.399
thing is close to zero, you just stop

00:39:14.400 --> 00:39:17.400
because there's nothing else you can do.

00:39:17.880 --> 00:39:20.119
Okay?

00:39:21.358 --> 00:39:26.639
This is the basic intuition behind how

00:39:23.599 --> 00:39:28.239
GPT-4 was built.

00:39:26.639 --> 00:39:29.199
Which is kind of shocking if you think

00:39:28.239 --> 00:39:31.279
about it.

00:39:29.199 --> 00:39:32.879
Right? Which means that all the the

00:39:31.280 --> 00:39:35.080
heavy-duty optimization stuff that

00:39:32.880 --> 00:39:37.960
people have figured out over the decades

00:39:35.079 --> 00:39:39.440
is kind of not used.

00:39:37.960 --> 00:39:41.320
Right? This algorithm is what's being

00:39:39.440 --> 00:39:42.200
used with some, you know, flavors on top

00:39:41.320 --> 00:39:44.200
of it.

00:39:42.199 --> 00:39:46.719
So, yeah. So, back to this

00:39:44.199 --> 00:39:48.319
uh and you you do that and then if

00:39:46.719 --> 00:39:49.879
you've sort of run out of time or

00:39:48.320 --> 00:39:52.240
compute

00:39:49.880 --> 00:39:54.119
or right, if you run out of time and so

00:39:52.239 --> 00:39:55.279
on, just stop.

00:39:54.119 --> 00:39:56.839
Otherwise, just go back to step one and

00:39:55.280 --> 00:39:59.720
try again. Of course, if it's close to

00:39:56.840 --> 00:39:59.720
zero, you got to stop anyway.

00:40:00.119 --> 00:40:05.159
Yeah.

00:40:02.280 --> 00:40:09.040
Is there the um concern of a potentially

00:40:05.159 --> 00:40:09.039
local minimum there? It's coming.

00:40:10.039 --> 00:40:12.400
Okay? So, that's the function. It's

00:40:11.320 --> 00:40:13.960
going to give find It's going to find

00:40:12.400 --> 00:40:16.160
you some point where the derivative is

00:40:13.960 --> 00:40:17.639
kind of close to zero. Okay?

00:40:16.159 --> 00:40:19.879
So,

00:40:17.639 --> 00:40:21.759
this is called gradient descent. Right?

00:40:19.880 --> 00:40:23.519
This is gradient descent, this little

00:40:21.760 --> 00:40:26.720
algorithm.

00:40:23.519 --> 00:40:29.360
And this this

00:40:26.719 --> 00:40:32.679
this very power pointy MBA table can be

00:40:29.360 --> 00:40:34.039
collapsed into this little expression.

00:40:32.679 --> 00:40:35.519
Basically says,

00:40:34.039 --> 00:40:36.920
calculate the derivative,

00:40:35.519 --> 00:40:38.320
multiplied by a small number which we'll

00:40:36.920 --> 00:40:41.880
get to in a second,

00:40:38.320 --> 00:40:44.039
and then change the old W to the new W

00:40:41.880 --> 00:40:45.680
is the old W minus a little number times

00:40:44.039 --> 00:40:47.800
gradient.

00:40:45.679 --> 00:40:50.480
So, this little one-line formula is

00:40:47.800 --> 00:40:51.560
basically gradient descent.

00:40:50.480 --> 00:40:54.159
Okay?

00:40:51.559 --> 00:40:56.400
And what you should do, just to build

00:40:54.159 --> 00:40:58.639
your intuition, is to make sure that

00:40:56.400 --> 00:41:00.119
these three possibilities here map

00:40:58.639 --> 00:41:01.199
nicely to this. Like this thing will

00:41:00.119 --> 00:41:03.559
actually capture these three

00:41:01.199 --> 00:41:04.559
possibilities.

00:41:03.559 --> 00:41:07.079
This is when gradient descent was

00:41:04.559 --> 00:41:07.079
invented.

00:41:07.599 --> 00:41:10.839
It has some historical fun, right?

00:41:13.199 --> 00:41:17.719
The 19th century?

00:41:15.000 --> 00:41:20.320
19th century. Yeah, okay. Good. Very

00:41:17.719 --> 00:41:22.719
good. Excellent guess.

00:41:20.320 --> 00:41:25.559
1847.

00:41:22.719 --> 00:41:27.919
It was uh invented uh in 1847 by Cauchy,

00:41:25.559 --> 00:41:29.159
the great mathematician. And in fact, if

00:41:27.920 --> 00:41:30.760
you're curious, you can check out the

00:41:29.159 --> 00:41:32.639
paper.

00:41:30.760 --> 00:41:35.880
I have I gave you I give you the paper

00:41:32.639 --> 00:41:35.879
here for handy reference.

00:41:36.639 --> 00:41:40.839
So, 1847.

00:41:38.159 --> 00:41:43.879
So, GPT-4 is built using an algorithm

00:41:40.840 --> 00:41:43.880
invented in 1847.

00:41:44.280 --> 00:41:51.600
Which I find like astonishing, frankly.

00:41:47.719 --> 00:41:52.959
That this little thing is so capable.

00:41:51.599 --> 00:41:54.639
Okay.

00:41:52.960 --> 00:41:56.599
So, that's gradient descent. And this

00:41:54.639 --> 00:41:58.519
little number alpha

00:41:56.599 --> 00:41:59.920
is called the learning rate. And it's

00:41:58.519 --> 00:42:02.480
our way of sort of essentially

00:41:59.920 --> 00:42:04.880
quantifying the idea of let's not

00:42:02.480 --> 00:42:06.480
increase or decrease W massively, let's

00:42:04.880 --> 00:42:08.640
do it slightly.

00:42:06.480 --> 00:42:11.280
Because the gradient is only valid for

00:42:08.639 --> 00:42:14.839
small movements around your point. If

00:42:11.280 --> 00:42:17.519
you take a big step, all bets are off.

00:42:14.840 --> 00:42:20.000
So, this alpha tells you how how small a

00:42:17.519 --> 00:42:20.880
step should you take.

00:42:20.000 --> 00:42:23.360
Okay?

00:42:20.880 --> 00:42:25.880
And in typically, it's set to very small

00:42:23.360 --> 00:42:27.240
values like, you know, 0.1, 0.001, and

00:42:25.880 --> 00:42:30.000
so on and so forth. And in fact, if you

00:42:27.239 --> 00:42:31.159
read any deep learning academic papers

00:42:30.000 --> 00:42:32.440
where they have trained like a big model

00:42:31.159 --> 00:42:34.279
to do something,

00:42:32.440 --> 00:42:36.240
right? More lot of researchers will very

00:42:34.280 --> 00:42:37.640
quickly go to the appendix where they

00:42:36.239 --> 00:42:39.559
have described exactly what learning

00:42:37.639 --> 00:42:40.960
rates were used.

00:42:39.559 --> 00:42:44.239
Because sort of the learning rate is

00:42:40.960 --> 00:42:45.480
like part of the IP for how it's built.

00:42:44.239 --> 00:42:47.479
A lot of trial and error that goes into

00:42:45.480 --> 00:42:50.280
these learning rates.

00:42:47.480 --> 00:42:53.400
Okay. So, that is gradient descent.

00:42:50.280 --> 00:42:55.080
Um so, if we apply this algorithm to GW,

00:42:53.400 --> 00:42:56.800
our original function,

00:42:55.079 --> 00:42:58.840
right? We just keep on doing this thing

00:42:56.800 --> 00:43:00.560
a few times.

00:42:58.840 --> 00:43:01.880
Right? What you will find is that if

00:43:00.559 --> 00:43:02.639
let's say we start with two point the

00:43:01.880 --> 00:43:05.519
the

00:43:02.639 --> 00:43:07.599
the point we randomly pick is a 2.5, we

00:43:05.519 --> 00:43:09.759
set the alpha to one, we run this

00:43:07.599 --> 00:43:11.159
algorithm, it starts here, then it goes

00:43:09.760 --> 00:43:12.960
there, it goes there, bup bup bup bup

00:43:11.159 --> 00:43:14.119
bup, and then finally ends up here.

00:43:12.960 --> 00:43:16.440
In like four or five iterations, it

00:43:14.119 --> 00:43:17.679
finds some minimum.

00:43:16.440 --> 00:43:19.639
This is obviously a very simple,

00:43:17.679 --> 00:43:22.279
well-behaved, nice little function, so

00:43:19.639 --> 00:43:23.440
you can easily optimize it.

00:43:22.280 --> 00:43:25.400
Okay? If you want, you can just go to

00:43:23.440 --> 00:43:28.000
this thing. There's a nice animation of

00:43:25.400 --> 00:43:28.000
this thing as well.

00:43:28.119 --> 00:43:31.679
Okay. So, now

00:43:30.119 --> 00:43:33.279
All right. Before we actually go to the

00:43:31.679 --> 00:43:35.000
multi-variable function, I want to go to

00:43:33.280 --> 00:43:36.280
the question that you posed about local

00:43:35.000 --> 00:43:37.480
minima.

00:43:36.280 --> 00:43:38.920
Um actually, you know what? I think I

00:43:37.480 --> 00:43:40.320
may have some slides on it. So, sorry.

00:43:38.920 --> 00:43:41.920
I'll come back to this.

00:43:40.320 --> 00:43:43.080
So, let's actually see what you know,

00:43:41.920 --> 00:43:45.240
what we looked at a toy example where

00:43:43.079 --> 00:43:46.440
there was only one variable. What if you

00:43:45.239 --> 00:43:49.319
have

00:43:46.440 --> 00:43:51.639
uh what if it was GPT-3? GPT-3 has 175

00:43:49.320 --> 00:43:53.960
billion parameters.

00:43:51.639 --> 00:43:55.400
175 billion and GPT-4, they haven't

00:43:53.960 --> 00:43:57.720
published it, so we don't know. It's

00:43:55.400 --> 00:43:59.840
supposed to be eight times as much.

00:43:57.719 --> 00:44:02.039
Okay? So, I mean, the number of

00:43:59.840 --> 00:44:04.840
parameters is massive. So, basically,

00:44:02.039 --> 00:44:07.960
our loss function has

00:44:04.840 --> 00:44:10.320
billions of variables, billions of Ws

00:44:07.960 --> 00:44:12.920
that we need to optimize over, minimize

00:44:10.320 --> 00:44:14.760
over. So, we need to use this notion of

00:44:12.920 --> 00:44:16.039
a partial derivative. So, let's take

00:44:14.760 --> 00:44:18.200
baby steps and say, okay, what if you

00:44:16.039 --> 00:44:20.079
have a two-variable function, right?

00:44:18.199 --> 00:44:21.599
Something like this, very simple. So,

00:44:20.079 --> 00:44:23.960
what we can do is we can calculate the

00:44:21.599 --> 00:44:26.400
partial derivative of G with respect to

00:44:23.960 --> 00:44:27.840
each of these Ws.

00:44:26.400 --> 00:44:29.720
And the partial derivative, just to

00:44:27.840 --> 00:44:32.840
quickly refresh your memories,

00:44:29.719 --> 00:44:36.439
is you take a function, you pretend that

00:44:32.840 --> 00:44:38.400
everything other than W is a constant.

00:44:36.440 --> 00:44:40.960
Then the function becomes a

00:44:38.400 --> 00:44:41.920
a function of just one variable W, W1.

00:44:40.960 --> 00:44:43.760
And then you just differentiate it like

00:44:41.920 --> 00:44:46.159
you do everything else. And you you get

00:44:43.760 --> 00:44:48.600
you get something, and that is

00:44:46.159 --> 00:44:50.039
this thing here.

00:44:48.599 --> 00:44:51.559
And then you do the same thing for W2,

00:44:50.039 --> 00:44:54.239
you get this thing here, and then you

00:44:51.559 --> 00:44:55.079
just stack them up in a nice list.

00:44:54.239 --> 00:44:56.399
Okay?

00:44:55.079 --> 00:44:58.000
This is the vector of partial

00:44:56.400 --> 00:44:59.400
derivatives.

00:44:58.000 --> 00:45:01.559
So, how should we interpret this? The

00:44:59.400 --> 00:45:04.280
same way as before. Basically, for a

00:45:01.559 --> 00:45:06.000
small change in W1, keeping W2 and

00:45:04.280 --> 00:45:08.200
everything else fixed, how does the

00:45:06.000 --> 00:45:11.000
function change if you change just W1

00:45:08.199 --> 00:45:14.039
slightly? And similarly for W2 and all

00:45:11.000 --> 00:45:15.760
the way to W175 billion.

00:45:14.039 --> 00:45:17.119
Same thing. Okay?

00:45:15.760 --> 00:45:19.359
So, um

00:45:17.119 --> 00:45:22.039
now, when you have these functions with

00:45:19.358 --> 00:45:24.480
many variables, many Ws,

00:45:22.039 --> 00:45:26.759
uh since we have a gradient for each one

00:45:24.480 --> 00:45:28.358
of those Ws, we stack them up into a

00:45:26.760 --> 00:45:30.200
nice vector

00:45:28.358 --> 00:45:32.199
of derivatives, and this vector is

00:45:30.199 --> 00:45:33.799
called the gradient.

00:45:32.199 --> 00:45:35.279
And it's denoted

00:45:33.800 --> 00:45:37.240
using

00:45:35.280 --> 00:45:38.720
this uh Anyone know what the symbol is

00:45:37.239 --> 00:45:40.199
called?

00:45:38.719 --> 00:45:41.679
nabla

00:45:40.199 --> 00:45:43.839
Yeah?

00:45:41.679 --> 00:45:45.599
Laplacian

00:45:43.840 --> 00:45:48.880
Maybe. Maybe that's a synonym. But the

00:45:45.599 --> 00:45:50.559
one I'm familiar with is nabla.

00:45:48.880 --> 00:45:52.200
Delta is the one that's upside down

00:45:50.559 --> 00:45:53.920
triangle, but I think the upside down

00:45:52.199 --> 00:45:55.960
triangle is called nabla if I if I

00:45:53.920 --> 00:45:58.200
recall. Am I right?

00:45:55.960 --> 00:46:00.800
Thank you.

00:45:58.199 --> 00:46:00.799
He's my go-to.

00:46:02.559 --> 00:46:06.440
So, yeah. So, the gradient, um we just

00:46:04.840 --> 00:46:08.519
call it the gradient, and it's written

00:46:06.440 --> 00:46:10.960
as this.

00:46:08.519 --> 00:46:12.358
All right. So, what we do is we simply

00:46:10.960 --> 00:46:13.599
do gradient descent on every one of the

00:46:12.358 --> 00:46:16.519
Ws

00:46:13.599 --> 00:46:19.319
using its partial derivative.

00:46:16.519 --> 00:46:21.519
Okay? So, in a in a gradient step, we

00:46:19.320 --> 00:46:23.000
update W1 using this formula, W2 using

00:46:21.519 --> 00:46:25.400
this formula.

00:46:23.000 --> 00:46:25.400
Finished.

00:46:25.599 --> 00:46:30.440
We've just generalized gradient descent

00:46:27.000 --> 00:46:30.440
to an arbitrary number of variables.

00:46:30.840 --> 00:46:35.120
So, and of course, as before, this can

00:46:32.480 --> 00:46:36.719
be summarized compactly as this vector

00:46:35.119 --> 00:46:40.358
formula.

00:46:36.719 --> 00:46:40.358
Let me just do this.

00:46:43.000 --> 00:46:46.639
So, what's going on here is that

00:46:46.719 --> 00:46:50.119
I have

00:46:47.599 --> 00:46:52.400
W1

00:46:50.119 --> 00:46:53.639
old W1 minus alpha

00:46:52.400 --> 00:46:55.720
times

00:46:53.639 --> 00:46:59.319
the function G

00:46:55.719 --> 00:47:02.159
of W1, then W2

00:46:59.320 --> 00:47:04.920
W2 minus alpha

00:47:02.159 --> 00:47:06.039
G by W2. And then all we're doing is

00:47:04.920 --> 00:47:08.358
we're just stacking them up into a

00:47:06.039 --> 00:47:10.880
vector

00:47:08.358 --> 00:47:10.880
like that.

00:47:15.440 --> 00:47:19.559
minus alpha, and this vector

00:47:21.440 --> 00:47:24.159
like that.

00:47:27.719 --> 00:47:31.919
So, this can be written as just this

00:47:28.760 --> 00:47:34.240
vector W, the new vector

00:47:31.920 --> 00:47:37.599
old vector minus alpha

00:47:34.239 --> 00:47:39.119
and the gradient. Finished.

00:47:37.599 --> 00:47:40.400
And you can see if it is, you know,

00:47:39.119 --> 00:47:42.719
GPT-3,

00:47:40.400 --> 00:47:44.880
this vector is going to be 175 billion

00:47:42.719 --> 00:47:46.559
long.

00:47:44.880 --> 00:47:47.920
Okay? But whether it's two or 175

00:47:46.559 --> 00:47:50.199
billion, who cares? It's the same thing,

00:47:47.920 --> 00:47:50.200
right?

00:47:50.358 --> 00:47:52.480
Okay.

00:47:52.559 --> 00:47:55.320
So, yeah. So, that's what we have here.

00:47:54.358 --> 00:47:58.000
I'm really thrilled by the way this

00:47:55.320 --> 00:48:00.200
whole iPad business is working out.

00:47:58.000 --> 00:48:02.199
I was a little worried about it. Okay.

00:48:00.199 --> 00:48:04.000
Um so, if you look at two dimensions,

00:48:02.199 --> 00:48:06.679
this function, and if you actually look

00:48:04.000 --> 00:48:08.239
at if you plot the function, this is W

00:48:06.679 --> 00:48:09.119
the first W, the second W, and then you

00:48:08.239 --> 00:48:11.679
actually This is actually the loss

00:48:09.119 --> 00:48:13.000
function. That's the function GW. And

00:48:11.679 --> 00:48:14.960
so, you're trying to find the minimum

00:48:13.000 --> 00:48:16.079
here, and so this is how the gradient

00:48:14.960 --> 00:48:17.400
descent will do do do do do. It will

00:48:16.079 --> 00:48:18.400
progress if you're starting from this

00:48:17.400 --> 00:48:20.000
point.

00:48:18.400 --> 00:48:22.280
Or you can also sort of look at it from

00:48:20.000 --> 00:48:23.480
up top down into the function, and

00:48:22.280 --> 00:48:24.720
that's what this picture is, and it

00:48:23.480 --> 00:48:27.000
shows gradient descent starting from

00:48:24.719 --> 00:48:30.599
there and working its way down

00:48:27.000 --> 00:48:32.840
um from here all the way to the center.

00:48:30.599 --> 00:48:35.119
Okay. So,

00:48:32.840 --> 00:48:38.160
All right. Local minima. So, now

00:48:35.119 --> 00:48:41.358
gradient descent will just stop

00:48:38.159 --> 00:48:43.399
near uh hopefully a minimum,

00:48:41.358 --> 00:48:45.960
right? But the problem is it may not be

00:48:43.400 --> 00:48:47.400
a global minimum. It may It may not even

00:48:45.960 --> 00:48:48.800
be a minimum.

00:48:47.400 --> 00:48:49.880
So, um

00:48:48.800 --> 00:48:51.160
so, let's see what what I'm talking

00:48:49.880 --> 00:48:53.920
about here.

00:48:51.159 --> 00:48:57.079
Here are some possibilities.

00:48:53.920 --> 00:48:59.960
So, let's take a simple function.

00:48:57.079 --> 00:49:02.159
Okay? Let's take This is GW.

00:48:59.960 --> 00:49:05.960
This is W. And turns out this function

00:49:02.159 --> 00:49:05.960
is actually looks like this.

00:49:12.199 --> 00:49:16.719
Okay?

00:49:13.519 --> 00:49:16.719
So, you can see here

00:49:17.719 --> 00:49:23.159
Well,

00:49:19.679 --> 00:49:24.759
um this point

00:49:23.159 --> 00:49:27.119
this point here

00:49:24.760 --> 00:49:29.359
is a local minimum.

00:49:27.119 --> 00:49:30.880
This is a local minimum.

00:49:29.358 --> 00:49:32.599
It's a local minimum.

00:49:30.880 --> 00:49:34.559
These are all

00:49:32.599 --> 00:49:37.239
lots of local minima here.

00:49:34.559 --> 00:49:39.320
Okay? And yeah, there's a lot of local

00:49:37.239 --> 00:49:41.599
minima here, too.

00:49:39.320 --> 00:49:43.880
So, these are all places in which the

00:49:41.599 --> 00:49:46.079
derivative is going to be zero.

00:49:43.880 --> 00:49:48.160
So, if you run gradient descent and it

00:49:46.079 --> 00:49:49.119
stops because the gradient is reached

00:49:48.159 --> 00:49:52.000
zero,

00:49:49.119 --> 00:49:54.519
you could be in any of these places.

00:49:52.000 --> 00:49:57.480
Right? So, there's no guarantee. So,

00:49:54.519 --> 00:49:59.400
this in this picture happens to be

00:49:57.480 --> 00:50:01.039
maybe the global minimum because it's

00:49:59.400 --> 00:50:02.160
the lowest of the lot.

00:50:01.039 --> 00:50:02.880
Right?

00:50:02.159 --> 00:50:04.639
But, there's no guarantee you're

00:50:02.880 --> 00:50:06.320
actually going to get there.

00:50:04.639 --> 00:50:07.519
Okay, there's not even a guarantee

00:50:06.320 --> 00:50:09.519
you're going to be in any of these

00:50:07.519 --> 00:50:10.920
places because you could literally be in

00:50:09.519 --> 00:50:12.480
this thing here

00:50:10.920 --> 00:50:14.599
where it's sort of taking a break and

00:50:12.480 --> 00:50:15.920
then continuing on down.

00:50:14.599 --> 00:50:17.799
That, by the way, is called a you know,

00:50:15.920 --> 00:50:19.320
a saddle point. I drew it badly, but

00:50:17.800 --> 00:50:21.120
this sort of coming in sort of taking a

00:50:19.320 --> 00:50:23.559
break and going down again is called a

00:50:21.119 --> 00:50:25.679
saddle point. So, gradient descent can

00:50:23.559 --> 00:50:27.239
stop at a saddle point. It can stop at

00:50:25.679 --> 00:50:28.879
some minima. There's no guarantee it's

00:50:27.239 --> 00:50:31.279
going to be global.

00:50:28.880 --> 00:50:31.280
Okay?

00:50:33.000 --> 00:50:39.199
But, it turns out it has not mattered.

00:50:37.239 --> 00:50:41.039
So, it has not mattered. And there are a

00:50:39.199 --> 00:50:42.919
whole bunch of reasons why it has not

00:50:41.039 --> 00:50:44.440
mattered because when you have these

00:50:42.920 --> 00:50:46.360
very complicated neural networks,

00:50:44.440 --> 00:50:49.200
they're very complex functions. Even

00:50:46.360 --> 00:50:50.640
finding a decent solution, right, to

00:50:49.199 --> 00:50:52.960
these complicated networks is actually

00:50:50.639 --> 00:50:54.879
really good for solving the problem.

00:50:52.960 --> 00:50:57.199
You don't have to go to the best best

00:50:54.880 --> 00:50:58.680
possible solution. And in fact, if you

00:50:57.199 --> 00:51:01.960
go to the best possible solution, you

00:50:58.679 --> 00:51:01.960
actually run the risk of overfitting.

00:51:02.039 --> 00:51:05.840
So, that's one reason. The other

00:51:03.719 --> 00:51:08.319
interesting reason and by the way, this

00:51:05.840 --> 00:51:09.800
is a very hot area of research to figure

00:51:08.320 --> 00:51:11.120
out exactly

00:51:09.800 --> 00:51:12.600
So, it's sort of like this. Empirically,

00:51:11.119 --> 00:51:13.960
what we have seen is that not worrying

00:51:12.599 --> 00:51:16.239
about local minima, global minima, all

00:51:13.960 --> 00:51:18.119
that stuff has not hurt us because these

00:51:16.239 --> 00:51:20.479
is things are amazing.

00:51:18.119 --> 00:51:21.480
GPT GPT-4, probably they just stopped

00:51:20.480 --> 00:51:22.880
somewhere. They probably it wasn't even

00:51:21.480 --> 00:51:24.000
a local minima. They're like, "All

00:51:22.880 --> 00:51:25.000
right, we've It's been running for 6

00:51:24.000 --> 00:51:27.000
days. We've spent 2 million dollars.

00:51:25.000 --> 00:51:29.000
Let's stop."

00:51:27.000 --> 00:51:31.800
Right? Because these are very expensive.

00:51:29.000 --> 00:51:33.199
So, but that's still so magical.

00:51:31.800 --> 00:51:34.600
You don't need to get anywhere close to

00:51:33.199 --> 00:51:36.279
local minimum. But, there's another

00:51:34.599 --> 00:51:37.559
interesting point which I've which which

00:51:36.280 --> 00:51:40.880
I read about.

00:51:37.559 --> 00:51:43.279
People basically hypothesize that

00:51:40.880 --> 00:51:45.200
for you to be at a local minimum, just

00:51:43.280 --> 00:51:47.000
think about what it means. It means that

00:51:45.199 --> 00:51:49.439
you're standing at a particular point,

00:51:47.000 --> 00:51:51.800
in every direction that you look,

00:51:49.440 --> 00:51:52.840
things are just sloping upward.

00:51:51.800 --> 00:51:54.760
Right?

00:51:52.840 --> 00:51:56.400
Everything is sloping upward. Only if

00:51:54.760 --> 00:51:58.520
everything is sloping upward all around

00:51:56.400 --> 00:52:00.760
you, could you be at a local minimum

00:51:58.519 --> 00:52:02.880
by definition. But, if you have a

00:52:00.760 --> 00:52:04.560
billion dimensions,

00:52:02.880 --> 00:52:06.200
what are the odds that you're going to

00:52:04.559 --> 00:52:07.199
be standing at a point where every one

00:52:06.199 --> 00:52:08.319
of those billion dimensions is going

00:52:07.199 --> 00:52:10.119
upward?

00:52:08.320 --> 00:52:11.600
The odds are really low.

00:52:10.119 --> 00:52:13.239
Chances are some of them are going to go

00:52:11.599 --> 00:52:14.480
going up, some of them are going down,

00:52:13.239 --> 00:52:16.759
others are sort of coming down and going

00:52:14.480 --> 00:52:18.400
another way. It's going to be crazy.

00:52:16.760 --> 00:52:20.000
So, in some sense, the best you can hope

00:52:18.400 --> 00:52:23.079
for in these very high-dimensional

00:52:20.000 --> 00:52:25.760
situations is probably a saddle point.

00:52:23.079 --> 00:52:29.159
And it turns out it's good enough.

00:52:25.760 --> 00:52:30.920
So, for those reasons, we are content

00:52:29.159 --> 00:52:31.879
with just running gradient descent with

00:52:30.920 --> 00:52:34.320
some tweaks which I'll get to in a

00:52:31.880 --> 00:52:36.880
second. Um and it just performs really

00:52:34.320 --> 00:52:36.880
admirably.

00:52:36.920 --> 00:52:41.680
Um how does alpha depend on like how

00:52:39.840 --> 00:52:44.600
much compute you have? Like, would you

00:52:41.679 --> 00:52:45.359
set the learning rate based on that or

00:52:44.599 --> 00:52:47.960
not really?

00:52:45.360 --> 00:52:50.680
>> No, the the learning rate is really

00:52:47.960 --> 00:52:52.519
is a measure of It's sort of like this.

00:52:50.679 --> 00:52:54.759
When you're at a point where you think

00:52:52.519 --> 00:52:55.840
that the gradient is looking nice and

00:52:54.760 --> 00:52:57.600
right, if you take a step in the

00:52:55.840 --> 00:53:00.000
direction it's going to go down. And if

00:52:57.599 --> 00:53:01.519
you further believe that it's going to

00:53:00.000 --> 00:53:02.480
keep going down in the direction for a

00:53:01.519 --> 00:53:04.159
while,

00:53:02.480 --> 00:53:06.000
then you're very confident about taking

00:53:04.159 --> 00:53:07.519
a big step.

00:53:06.000 --> 00:53:09.119
But, if you're like, "I I don't know

00:53:07.519 --> 00:53:10.960
because the maybe I take a little step,

00:53:09.119 --> 00:53:12.159
maybe I have to go this way. I can't go

00:53:10.960 --> 00:53:13.360
straight anymore." Then you don't want

00:53:12.159 --> 00:53:14.639
to take a big step because then you have

00:53:13.360 --> 00:53:16.320
to backtrack.

00:53:14.639 --> 00:53:19.039
So, those kinds of considerations go

00:53:16.320 --> 00:53:20.920
into the learning rate. Um and so,

00:53:19.039 --> 00:53:23.360
that's sort of the rough answer to your

00:53:20.920 --> 00:53:24.920
question. It's not so much determined by

00:53:23.360 --> 00:53:25.840
compute and bandwidth and things like

00:53:24.920 --> 00:53:27.320
that.

00:53:25.840 --> 00:53:29.400
But, again, it's very it's a sort of a

00:53:27.320 --> 00:53:31.240
complicated thing because sometimes with

00:53:29.400 --> 00:53:33.079
a given amount of compute compute, if

00:53:31.239 --> 00:53:35.239
you have a particular kind of data, you

00:53:33.079 --> 00:53:37.079
can have very aggressive learning rates.

00:53:35.239 --> 00:53:39.439
So, it tends to be a bit sort of, you

00:53:37.079 --> 00:53:40.880
know, jumbled up complicated. So, but

00:53:39.440 --> 00:53:43.320
that's sort of the the quick surface

00:53:40.880 --> 00:53:46.240
level idea of what's going on.

00:53:43.320 --> 00:53:46.240
Um okay.

00:53:47.000 --> 00:53:49.679
9:31.

00:53:50.960 --> 00:53:54.119
Anyway, folks, this lecture is like

00:53:52.400 --> 00:53:55.519
probably one of the driest in the like

00:53:54.119 --> 00:53:57.719
semester because of like I have to go

00:53:55.519 --> 00:53:59.159
through all the concepts. Um once we

00:53:57.719 --> 00:54:00.839
start doing collabs, you know, things

00:53:59.159 --> 00:54:01.759
get a lot more lively.

00:54:00.840 --> 00:54:04.320
Okay.

00:54:01.760 --> 00:54:05.880
Um all right. So, now let's talk about

00:54:04.320 --> 00:54:08.039
minimizing a loss function gradient

00:54:05.880 --> 00:54:09.519
descent. So, here is our little binary

00:54:08.039 --> 00:54:11.719
cross entropy loss function that we saw

00:54:09.519 --> 00:54:13.519
from before. Right? This is what we want

00:54:11.719 --> 00:54:14.839
to minimize. So, if you look at this

00:54:13.519 --> 00:54:16.800
thing,

00:54:14.840 --> 00:54:19.280
where are the variables we need to

00:54:16.800 --> 00:54:21.880
change to minimize this function?

00:54:19.280 --> 00:54:23.880
Folks, don't look at your phones.

00:54:21.880 --> 00:54:26.480
Okay, with laptop and iPad use, don't

00:54:23.880 --> 00:54:26.480
look at your phones.

00:54:27.559 --> 00:54:33.000
Sorry, we've kind of abstracted um the

00:54:30.639 --> 00:54:35.079
variables W, but just to bring it back,

00:54:33.000 --> 00:54:36.559
those are actually the weights in the

00:54:35.079 --> 00:54:38.519
neural networks, right? Yeah, the

00:54:36.559 --> 00:54:42.480
weights and the biases. I'm just calling

00:54:38.519 --> 00:54:45.440
them as weights. So, the output of these

00:54:42.480 --> 00:54:47.480
uh minimization functions are going to

00:54:45.440 --> 00:54:47.720
be the actual weights in your model,

00:54:47.480 --> 00:54:49.920
right?

00:54:47.719 --> 00:54:51.358
>> Exactly. Exactly right.

00:54:49.920 --> 00:54:52.440
The whole name of the game is to find

00:54:51.358 --> 00:54:53.719
the weights.

00:54:52.440 --> 00:54:57.159
And so, for example, when you see in the

00:54:53.719 --> 00:55:00.279
press that uh Meta has essentially um

00:54:57.159 --> 00:55:01.719
made the weights of Llama 2 or something

00:55:00.280 --> 00:55:02.920
available, that's basically what they've

00:55:01.719 --> 00:55:04.679
done.

00:55:02.920 --> 00:55:06.039
They basically published the weights.

00:55:04.679 --> 00:55:07.480
Reason that's so valuable is

00:55:06.039 --> 00:55:09.320
>> Microphone, please. Go.

00:55:07.480 --> 00:55:11.400
Cuz if you have a billion parameters,

00:55:09.320 --> 00:55:13.320
the compute time on that is horrendous

00:55:11.400 --> 00:55:14.599
and expensive. That's why it does

00:55:13.320 --> 00:55:16.559
weights are so valuable.

00:55:14.599 --> 00:55:18.358
>> Correct. The weights are the crown jewel

00:55:16.559 --> 00:55:19.840
because they are the result of a lot of

00:55:18.358 --> 00:55:21.880
money and time and smartness being

00:55:19.840 --> 00:55:23.320
spent.

00:55:21.880 --> 00:55:25.240
There is a separate question of why are

00:55:23.320 --> 00:55:26.000
they making it open source,

00:55:25.239 --> 00:55:28.679
which

00:55:26.000 --> 00:55:29.880
I'm happy to chat about offline.

00:55:28.679 --> 00:55:30.839
All right, cool. So, what are the

00:55:29.880 --> 00:55:32.920
variables we need to change change to

00:55:30.840 --> 00:55:34.480
minimize? It's basically the parameters

00:55:32.920 --> 00:55:36.358
and they're hiding inside the model

00:55:34.480 --> 00:55:38.440
term.

00:55:36.358 --> 00:55:41.279
Right? Because what is the model? The

00:55:38.440 --> 00:55:42.639
model is some function like that, right?

00:55:41.280 --> 00:55:44.400
If you look at the simple GPA and

00:55:42.639 --> 00:55:46.400
experience thing we looked at in the on

00:55:44.400 --> 00:55:48.800
Monday, we finally figured out that the

00:55:46.400 --> 00:55:50.480
actual thing that comes out here is

00:55:48.800 --> 00:55:52.440
going to be this complicated function of

00:55:50.480 --> 00:55:54.960
all the X's and the W's and so on and so

00:55:52.440 --> 00:55:57.599
forth, right? And that complicated thing

00:55:54.960 --> 00:55:58.800
is showing up inside this thing.

00:55:57.599 --> 00:56:00.960
So,

00:55:58.800 --> 00:56:02.920
you know, and the W's here are the

00:56:00.960 --> 00:56:05.119
variables we can we need to change to

00:56:02.920 --> 00:56:06.720
minimize the loss function. And it It's

00:56:05.119 --> 00:56:10.159
important for you to to note and

00:56:06.719 --> 00:56:13.159
understand that the values of X and Y

00:56:10.159 --> 00:56:14.199
and so on are just data.

00:56:13.159 --> 00:56:15.759
You're not optimizing anything there.

00:56:14.199 --> 00:56:17.919
You're just data.

00:56:15.760 --> 00:56:20.480
What you're optimizing is the W's.

00:56:17.920 --> 00:56:20.480
The weights.

00:56:22.400 --> 00:56:27.639
Okay. So, so imagine replacing the model

00:56:26.400 --> 00:56:29.440
here with the mathematical expression

00:56:27.639 --> 00:56:31.920
above whenever this appears the loss

00:56:29.440 --> 00:56:33.400
function. And once you do that, your

00:56:31.920 --> 00:56:35.800
loss function is just a good old

00:56:33.400 --> 00:56:37.440
function of the W's.

00:56:35.800 --> 00:56:39.200
The fact that it's a loss function is

00:56:37.440 --> 00:56:41.039
kind of irrelevant.

00:56:39.199 --> 00:56:42.399
It's just a function.

00:56:41.039 --> 00:56:43.880
And since it's just a good old function

00:56:42.400 --> 00:56:45.920
of the W's, you can apply gradient

00:56:43.880 --> 00:56:48.559
descent to it as we normally would.

00:56:45.920 --> 00:56:48.559
It's no big deal.

00:56:49.440 --> 00:56:52.920
Which brings us to something called

00:56:50.880 --> 00:56:55.400
backpropagation.

00:56:52.920 --> 00:56:55.400
Um

00:56:56.199 --> 00:56:59.839
Um if you remember nothing else about

00:56:57.639 --> 00:57:01.239
backpropagation, just remember this.

00:56:59.840 --> 00:57:04.680
Never use the word backpropagation

00:57:01.239 --> 00:57:05.479
again. Only use the word backprop.

00:57:04.679 --> 00:57:06.759
You're

00:57:05.480 --> 00:57:07.760
hip and cool to the deep learning

00:57:06.760 --> 00:57:09.320
community.

00:57:07.760 --> 00:57:12.200
Backprop.

00:57:09.320 --> 00:57:14.480
Okay. All right. So, what is backprop?

00:57:12.199 --> 00:57:16.000
Backprop is a very efficient way to

00:57:14.480 --> 00:57:17.519
compute the gradient of the loss

00:57:16.000 --> 00:57:19.239
function.

00:57:17.519 --> 00:57:21.920
So, when you have this loss function,

00:57:19.239 --> 00:57:24.759
and let's say you have a billion W's

00:57:21.920 --> 00:57:27.559
and you have 10 million data points. So,

00:57:24.760 --> 00:57:30.520
the little n we saw was 10 million.

00:57:27.559 --> 00:57:32.279
That is a lot of computation.

00:57:30.519 --> 00:57:34.239
And that is just for one step of

00:57:32.280 --> 00:57:37.480
gradient descent.

00:57:34.239 --> 00:57:39.799
Right? So, backprop is a way is a very

00:57:37.480 --> 00:57:41.639
efficient and clever way to compute the

00:57:39.800 --> 00:57:44.800
gradient of the loss function, which

00:57:41.639 --> 00:57:47.039
takes advantage of the fact that what we

00:57:44.800 --> 00:57:49.480
have here is not some arbitrary model.

00:57:47.039 --> 00:57:51.960
It's a model that came from a particular

00:57:49.480 --> 00:57:53.480
kind of neural network, which has layers

00:57:51.960 --> 00:57:55.119
one after the other, and then there was

00:57:53.480 --> 00:57:57.760
an output at the very end.

00:57:55.119 --> 00:57:59.519
So, what backprop does is

00:57:57.760 --> 00:58:00.440
it organizes the computation in the form

00:57:59.519 --> 00:58:01.920
of something called a computational

00:58:00.440 --> 00:58:03.679
graph, and the book has a good

00:58:01.920 --> 00:58:05.880
discussion about it. And so, what we do

00:58:03.679 --> 00:58:08.039
is we start at the very end.

00:58:05.880 --> 00:58:10.119
We calculate the gradient of the loss

00:58:08.039 --> 00:58:12.119
with respect to the output.

00:58:10.119 --> 00:58:13.960
Then we move left. We calculate the

00:58:12.119 --> 00:58:15.759
gradient of that output with respect to

00:58:13.960 --> 00:58:17.079
the output of the just the prior hidden

00:58:15.760 --> 00:58:19.160
layer.

00:58:17.079 --> 00:58:20.559
Step to the left. Calculate the gradient

00:58:19.159 --> 00:58:22.719
of the current thing with respect to the

00:58:20.559 --> 00:58:25.159
previous layer. You get the idea, right?

00:58:22.719 --> 00:58:27.319
It's iterative and it moves backwards,

00:58:25.159 --> 00:58:30.879
and by doing so, you never repeat the

00:58:27.320 --> 00:58:32.920
same computation twice wastefully.

00:58:30.880 --> 00:58:34.400
That's the big advantage. You calculate

00:58:32.920 --> 00:58:35.519
once and reuse it many many many many

00:58:34.400 --> 00:58:37.200
times.

00:58:35.519 --> 00:58:39.639
The second advantage is that if you

00:58:37.199 --> 00:58:42.159
organize it this way, it just becomes a

00:58:39.639 --> 00:58:42.879
sequence of matrix multiplications.

00:58:42.159 --> 00:58:45.239
Okay.

00:58:42.880 --> 00:58:46.240
And

00:58:45.239 --> 00:58:48.358
because it's a sequence of matrix

00:58:46.239 --> 00:58:51.879
multiplications and eliminates redundant

00:58:48.358 --> 00:58:53.199
calculations, and best of all,

00:58:51.880 --> 00:58:54.680
there are these things called GPUs,

00:58:53.199 --> 00:58:56.079
graphics processing units, originally

00:58:54.679 --> 00:58:57.119
invented to accelerate video game

00:58:56.079 --> 00:58:58.599
rendering.

00:58:57.119 --> 00:59:00.639
Uh and as it turns out, to accelerate

00:58:58.599 --> 00:59:02.159
video game rendering, the core math

00:59:00.639 --> 00:59:03.960
operation you do is basically a matrix

00:59:02.159 --> 00:59:05.319
multiplication. Right? Some linear

00:59:03.960 --> 00:59:07.599
algebra uh

00:59:05.320 --> 00:59:09.760
sort of operations. And so, someone

00:59:07.599 --> 00:59:11.920
really at some point had the bright idea

00:59:09.760 --> 00:59:13.440
for deep learning, calculating gradients

00:59:11.920 --> 00:59:14.920
and so on, we need to do matrix

00:59:13.440 --> 00:59:17.559
multiplications, and here is some

00:59:14.920 --> 00:59:19.200
specialized hardware that does really

00:59:17.559 --> 00:59:20.960
that does a fast job of matrix

00:59:19.199 --> 00:59:22.319
multiplications. Can't we Can we use

00:59:20.960 --> 00:59:24.440
this for that?

00:59:22.320 --> 00:59:26.200
And they did it. And all hell broke

00:59:24.440 --> 00:59:28.039
loose.

00:59:26.199 --> 00:59:30.000
That's literally what happened.

00:59:28.039 --> 00:59:32.279
And that's why Nvidia is valued at what,

00:59:30.000 --> 00:59:35.480
1.5 trillion or something.

00:59:32.280 --> 00:59:37.880
So, yeah. So, they are really good. And

00:59:35.480 --> 00:59:40.079
so, backprop

00:59:37.880 --> 00:59:42.680
the way you do backprop plus using it on

00:59:40.079 --> 00:59:44.400
GPUs leads to fast calculation of loss

00:59:42.679 --> 00:59:47.159
function gradients.

00:59:44.400 --> 00:59:49.039
If this thing were not true, this class

00:59:47.159 --> 00:59:50.279
would not exist.

00:59:49.039 --> 00:59:52.880
Because there won't be any deep learning

00:59:50.280 --> 00:59:56.840
revolution.

00:59:52.880 --> 00:59:56.840
This is a fundamental seminal reason.

00:59:57.880 --> 01:00:00.840
All right. So, the book has a bunch of

00:59:59.760 --> 01:00:01.880
detail

01:00:00.840 --> 01:00:05.600
um

01:00:01.880 --> 01:00:07.559
and I actually did like a I work I hand

01:00:05.599 --> 01:00:09.599
worked out an example

01:00:07.559 --> 01:00:11.679
of calculating a gradient like the

01:00:09.599 --> 01:00:13.400
old-fashioned way and calculating it

01:00:11.679 --> 01:00:14.879
using backprop.

01:00:13.400 --> 01:00:17.200
So, take a look at it. I'll post it on

01:00:14.880 --> 01:00:18.519
Canvas and you will understand exactly

01:00:17.199 --> 01:00:21.519
where the savings come from, where the

01:00:18.519 --> 01:00:22.800
efficiency gains come from. Okay?

01:00:21.519 --> 01:00:25.239
Because of time, I'm not going to get

01:00:22.800 --> 01:00:25.240
into it now.

01:00:26.400 --> 01:00:30.400
All right. Any questions so far?

01:00:28.840 --> 01:00:32.600
Yep.

01:00:30.400 --> 01:00:34.559
Sorry, I followed up to and so, we've

01:00:32.599 --> 01:00:36.239
done gradient descent, which is

01:00:34.559 --> 01:00:37.840
different than calculation of the

01:00:36.239 --> 01:00:39.239
gradient of the loss function. What What

01:00:37.840 --> 01:00:41.039
is the purpose of the calculation of the

01:00:39.239 --> 01:00:42.519
gradient of the loss function? You

01:00:41.039 --> 01:00:44.159
calculate the gradient because the

01:00:42.519 --> 01:00:47.039
fundamental operation of gradient

01:00:44.159 --> 01:00:48.199
descent is to take your current value of

01:00:47.039 --> 01:00:50.159
W

01:00:48.199 --> 01:00:52.919
and modify it slightly and the

01:00:50.159 --> 01:00:56.000
modification is old value minus learning

01:00:52.920 --> 01:00:56.000
rate times gradient.

01:01:03.360 --> 01:01:06.280
It'd be cool, right, if I say, "Go mo-

01:01:04.960 --> 01:01:08.400
go back five slides to this thing." and

01:01:06.280 --> 01:01:09.880
it just goes back. Product idea. Anyone

01:01:08.400 --> 01:01:11.840
startups?

01:01:09.880 --> 01:01:14.320
So.

01:01:11.840 --> 01:01:15.360
So, this one.

01:01:14.320 --> 01:01:16.920
So, this is the fundamental step of

01:01:15.360 --> 01:01:19.280
gradient descent.

01:01:16.920 --> 01:01:20.720
So, this is the current value of W.

01:01:19.280 --> 01:01:22.000
You calculate the gradient at that

01:01:20.719 --> 01:01:24.159
current value

01:01:22.000 --> 01:01:26.199
multiplied by alpha do this thing and

01:01:24.159 --> 01:01:27.440
you get the new value.

01:01:26.199 --> 01:01:29.879
And you keep repeating.

01:01:27.440 --> 01:01:32.240
Right, but GW

01:01:29.880 --> 01:01:33.559
that's not that's not the loss function.

01:01:32.239 --> 01:01:34.039
>> It is the loss function. That is the

01:01:33.559 --> 01:01:35.960
loss function.

01:01:34.039 --> 01:01:37.880
>> Yeah, right. Here, I'm just using G as

01:01:35.960 --> 01:01:39.880
an arbitrary function

01:01:37.880 --> 01:01:41.599
to just to demonstrate the point. But

01:01:39.880 --> 01:01:42.880
when you're optimizing, when you're

01:01:41.599 --> 01:01:45.519
training a neural network, what you're

01:01:42.880 --> 01:01:46.800
actually doing is minimizing a loss

01:01:45.519 --> 01:01:49.320
function. Right.

01:01:46.800 --> 01:01:51.360
>> Loss of W. Sorry, I got things mixed up.

01:01:49.320 --> 01:01:53.000
Thank you.

01:01:51.360 --> 01:01:54.680
>> Yeah.

01:01:53.000 --> 01:01:55.639
Uh how do we define the initial weights

01:01:54.679 --> 01:01:57.279
for the neural network?

01:01:55.639 --> 01:02:01.639
>> Ah.

01:01:57.280 --> 01:02:01.640
So, yeah, the initial weights um

01:02:02.199 --> 01:02:04.919
So, there's a there are many ways to So,

01:02:04.000 --> 01:02:06.119
first of all, they are initialized

01:02:04.920 --> 01:02:08.119
randomly.

01:02:06.119 --> 01:02:09.920
Uh but randomly doesn't mean you can

01:02:08.119 --> 01:02:11.839
just pick any random weight. There are

01:02:09.920 --> 01:02:13.519
actually some good ways to randomly pick

01:02:11.840 --> 01:02:16.240
the weights. Uh those are called

01:02:13.519 --> 01:02:18.199
initialization schemes. Um and there are

01:02:16.239 --> 01:02:19.359
a bunch of very effective initialization

01:02:18.199 --> 01:02:21.119
schemes people have figured out over the

01:02:19.360 --> 01:02:22.880
years and those things are baked into

01:02:21.119 --> 01:02:24.880
Keras as the default.

01:02:22.880 --> 01:02:26.079
So, the Keras, I believe, uses something

01:02:24.880 --> 01:02:27.960
called the

01:02:26.079 --> 01:02:31.199
uh He initialization, H E

01:02:27.960 --> 01:02:33.039
initialization, or the Xavier Glorot

01:02:31.199 --> 01:02:33.839
initialization. I wouldn't worry about

01:02:33.039 --> 01:02:36.000
it. Just go with the default

01:02:33.840 --> 01:02:37.519
initialization.

01:02:36.000 --> 01:02:38.679
The reason why they have to be very

01:02:37.519 --> 01:02:40.880
careful about how these weights are

01:02:38.679 --> 01:02:43.039
initialized is because if you have a

01:02:40.880 --> 01:02:45.200
very big network and if you initialize

01:02:43.039 --> 01:02:47.679
badly then

01:02:45.199 --> 01:02:48.919
the gradient will just explode as you

01:02:47.679 --> 01:02:50.440
calculate it.

01:02:48.920 --> 01:02:52.480
The earlier layers, the weights will

01:02:50.440 --> 01:02:53.720
have massive gradients or the gradients

01:02:52.480 --> 01:02:55.119
will vanish.

01:02:53.719 --> 01:02:56.319
So, they're called the exploding

01:02:55.119 --> 01:02:58.239
gradient problem or the vanishing

01:02:56.320 --> 01:02:59.240
gradient problem. To avoid all those

01:02:58.239 --> 01:03:00.719
things, researchers have figured out

01:02:59.239 --> 01:03:03.599
some clever way to initialize so that

01:03:00.719 --> 01:03:05.359
it's well-behaved throughout.

01:03:03.599 --> 01:03:08.400
Yep.

01:03:05.360 --> 01:03:10.360
If using um backprops and GPUs was so

01:03:08.400 --> 01:03:12.440
critical, I'm just curious like who

01:03:10.360 --> 01:03:14.760
first did it and when? Was this like a

01:03:12.440 --> 01:03:15.119
couple years ago? Was it a company? Was

01:03:14.760 --> 01:03:17.520
it a Yeah.

01:03:15.119 --> 01:03:20.199
>> Yeah. Well, GPUs have been used for deep

01:03:17.519 --> 01:03:22.400
learning, I want to say um

01:03:20.199 --> 01:03:26.279
I think the first uh case may have been

01:03:22.400 --> 01:03:27.920
in the mid 2005, 2006 sort of thing.

01:03:26.280 --> 01:03:30.000
But I would say that it sort of burst

01:03:27.920 --> 01:03:32.800
out onto the world stage and made

01:03:30.000 --> 01:03:35.000
everyone take notice when uh a deep

01:03:32.800 --> 01:03:38.519
learning model called AlexNet

01:03:35.000 --> 01:03:40.440
in 2012 won a very famous

01:03:38.519 --> 01:03:43.320
computer vision competition.

01:03:40.440 --> 01:03:45.079
Uh and it beat the and it set a world

01:03:43.320 --> 01:03:46.200
record for how good it was.

01:03:45.079 --> 01:03:48.039
Uh and that's when everyone was like,

01:03:46.199 --> 01:03:49.119
"Hey, what is this thing?" And that's

01:03:48.039 --> 01:03:50.719
really when it burst onto the world

01:03:49.119 --> 01:03:51.880
stage. I'll talk a bit more about it

01:03:50.719 --> 01:03:54.119
when I get into the computer vision

01:03:51.880 --> 01:03:55.480
segment of the class.

01:03:54.119 --> 01:03:58.759
But you can Google AlexNet and you'll

01:03:55.480 --> 01:03:58.760
find a whole bunch of history around it.

01:03:59.599 --> 01:04:04.920
I believe that if you do this, is it

01:04:00.760 --> 01:04:06.040
true that could get to a global minima

01:04:04.920 --> 01:04:07.840
that would mean there would be no

01:04:06.039 --> 01:04:09.840
hallucinations?

01:04:07.840 --> 01:04:11.920
Aha, good question.

01:04:09.840 --> 01:04:13.120
So, if it is perfect

01:04:11.920 --> 01:04:14.519
if you get to a global minimum. First of

01:04:13.119 --> 01:04:15.880
all, global minima doesn't mean the

01:04:14.519 --> 01:04:17.199
model is perfect, right? It may still

01:04:15.880 --> 01:04:18.400
have some loss.

01:04:17.199 --> 01:04:21.119
Um

01:04:18.400 --> 01:04:24.000
but global minima is going to be on the

01:04:21.119 --> 01:04:24.000
training data.

01:04:24.199 --> 01:04:28.519
You can imagine that the test data,

01:04:26.280 --> 01:04:29.480
future data has its own loss function,

01:04:28.519 --> 01:04:31.000
right?

01:04:29.480 --> 01:04:34.599
So, what is minimum here may not be

01:04:31.000 --> 01:04:34.599
minimum there. That's the problem.

01:04:36.440 --> 01:04:40.280
Is that a comment? No, okay.

01:04:38.800 --> 01:04:42.280
Just saying that

01:04:40.280 --> 01:04:43.240
uh that would mean that also you can be

01:04:42.280 --> 01:04:45.200
over-fitting for

01:04:43.239 --> 01:04:47.119
>> Correct. Exactly. Exactly. So, if you

01:04:45.199 --> 01:04:48.960
overdo, if you find the best thing in

01:04:47.119 --> 01:04:50.960
the training function, chances are it

01:04:48.960 --> 01:04:52.000
doesn't match the best thing of the test

01:04:50.960 --> 01:04:53.358
data.

01:04:52.000 --> 01:04:55.880
So, on the test data, you're actually

01:04:53.358 --> 01:04:55.880
doing badly.

01:04:56.440 --> 01:05:00.880
Okay. So,

01:04:57.960 --> 01:05:00.880
uh come back to this.

01:05:03.800 --> 01:05:08.240
Okay. Now, uh the final uh twist to the

01:05:06.199 --> 01:05:10.039
tail here uh we're going to go from

01:05:08.239 --> 01:05:11.839
something gradient descent to something

01:05:10.039 --> 01:05:14.639
called stochastic gradient descent. And

01:05:11.840 --> 01:05:16.400
stochastic gradient descent or SGD is

01:05:14.639 --> 01:05:17.480
the workhorse for all deep learning.

01:05:16.400 --> 01:05:19.639
Okay?

01:05:17.480 --> 01:05:20.679
And funnily enough, SGD is simpler than

01:05:19.639 --> 01:05:21.839
GD.

01:05:20.679 --> 01:05:23.799
Okay? Just when you thought it couldn't

01:05:21.840 --> 01:05:25.280
get simpler, right?

01:05:23.800 --> 01:05:27.400
Okay. So,

01:05:25.280 --> 01:05:28.640
So, for large data sets, computing the

01:05:27.400 --> 01:05:31.440
gradient of the loss function can be

01:05:28.639 --> 01:05:32.920
very expensive. Right? Needless to say.

01:05:31.440 --> 01:05:34.519
Because it has to be done at every step

01:05:32.920 --> 01:05:36.760
and the cardinality of the data set is

01:05:34.519 --> 01:05:38.079
really big. Right? And you may have, I

01:05:36.760 --> 01:05:39.480
don't know, billions of parameters. It's

01:05:38.079 --> 01:05:43.119
just very, very

01:05:39.480 --> 01:05:45.679
tough to compute it even with backprop.

01:05:43.119 --> 01:05:47.519
So, the solution is at each iteration,

01:05:45.679 --> 01:05:50.119
when I say iteration, I'm talking about

01:05:47.519 --> 01:05:52.599
this step of gradient descent.

01:05:50.119 --> 01:05:54.599
Instead of using all the data

01:05:52.599 --> 01:05:57.358
instead of calculating the loss function

01:05:54.599 --> 01:05:59.480
by averaging the loss across all N data

01:05:57.358 --> 01:06:01.880
points and then calculating the gradient

01:05:59.480 --> 01:06:04.440
of that thing, what you do is you just

01:06:01.880 --> 01:06:06.480
choose a small sample randomly. You

01:06:04.440 --> 01:06:08.400
choose just a few of the N observations

01:06:06.480 --> 01:06:10.159
and we call it a mini batch.

01:06:08.400 --> 01:06:11.599
So, for example, the number of data

01:06:10.159 --> 01:06:12.639
points you may you may have 10 billion

01:06:11.599 --> 01:06:14.000
data points

01:06:12.639 --> 01:06:16.559
but in every iteration, you may

01:06:14.000 --> 01:06:18.119
literally grab just like 32 or 64,

01:06:16.559 --> 01:06:20.199
something really small.

01:06:18.119 --> 01:06:21.199
Like absurdly small.

01:06:20.199 --> 01:06:23.000
Okay?

01:06:21.199 --> 01:06:24.799
And then you pretend that okay, that's

01:06:23.000 --> 01:06:27.159
all the data I have. You calculate the

01:06:24.800 --> 01:06:30.359
loss, find the gradient and just use

01:06:27.159 --> 01:06:33.199
that here instead.

01:06:30.358 --> 01:06:36.799
Okay? So, this is called stochastic

01:06:33.199 --> 01:06:39.159
gradient descent. So, strictly speaking

01:06:36.800 --> 01:06:40.680
theoretically, SGD uses just one data

01:06:39.159 --> 01:06:42.079
point.

01:06:40.679 --> 01:06:44.599
But in practice, we use what's called a

01:06:42.079 --> 01:06:47.039
mini batch, 32, 64, whatever.

01:06:44.599 --> 01:06:48.319
Uh and so, mini batch gradient descent

01:06:47.039 --> 01:06:51.719
is just loosely called stochastic

01:06:48.320 --> 01:06:51.720
gradient descent, SGD.

01:06:52.719 --> 01:06:57.559
So, and SGD, as it turns out

01:06:55.679 --> 01:06:58.799
you can see it's clearly very efficient,

01:06:57.559 --> 01:07:00.960
right? Because

01:06:58.800 --> 01:07:02.519
it's just processing a few at a time.

01:07:00.960 --> 01:07:03.559
Uh and in fact, if you have a lot of

01:07:02.519 --> 01:07:05.159
data

01:07:03.559 --> 01:07:07.119
and you calculate the full gradient of

01:07:05.159 --> 01:07:09.319
the loss function, it may not even fit

01:07:07.119 --> 01:07:11.319
into memory.

01:07:09.320 --> 01:07:12.880
Right? It's really problematic. But with

01:07:11.320 --> 01:07:14.359
SGD, it says, "I don't care whether you

01:07:12.880 --> 01:07:17.400
have a billion data points or a trillion

01:07:14.358 --> 01:07:19.199
data points. Just give me 32 at a time."

01:07:17.400 --> 01:07:20.720
Okay? And you just keep on doing it.

01:07:19.199 --> 01:07:22.639
And

01:07:20.719 --> 01:07:24.719
turns out, because not all the points

01:07:22.639 --> 01:07:26.679
are used in the calculation this only

01:07:24.719 --> 01:07:27.919
approximates the true gradient. Right?

01:07:26.679 --> 01:07:29.919
It's only an approximation. It's not the

01:07:27.920 --> 01:07:32.079
real thing. It's only an approximation.

01:07:29.920 --> 01:07:33.760
But it works extremely well in practice.

01:07:32.079 --> 01:07:34.960
Extremely well in practice.

01:07:33.760 --> 01:07:37.359
And there's a whole bunch of research

01:07:34.960 --> 01:07:39.079
that goes into why is it so effective?

01:07:37.358 --> 01:07:40.920
And you know, people are discovering

01:07:39.079 --> 01:07:42.599
interesting things about SGD, but we

01:07:40.920 --> 01:07:44.680
don't have like a definitive theory as

01:07:42.599 --> 01:07:46.039
to why it's so good yet. We have some

01:07:44.679 --> 01:07:47.799
interesting, you know, uh research

01:07:46.039 --> 01:07:50.000
threads that have happened.

01:07:47.800 --> 01:07:51.840
And very tantalizingly, very

01:07:50.000 --> 01:07:53.920
tantalizingly

01:07:51.840 --> 01:07:55.640
because it's only an approximation of

01:07:53.920 --> 01:07:59.480
the true gradient

01:07:55.639 --> 01:08:00.480
SGD can actually escape local minima.

01:07:59.480 --> 01:08:02.240
So,

01:08:00.480 --> 01:08:04.159
in the in the true loss function, you're

01:08:02.239 --> 01:08:06.679
at a local minimum

01:08:04.159 --> 01:08:08.519
but in SGD's loss function, when you're

01:08:06.679 --> 01:08:11.440
doing SGD, you're reaching the the

01:08:08.519 --> 01:08:13.159
minimum of the SGD loss function

01:08:11.440 --> 01:08:14.920
which actually may not be the actual

01:08:13.159 --> 01:08:16.798
loss function. So, as you're moving

01:08:14.920 --> 01:08:18.359
around, you're actually jumping from

01:08:16.798 --> 01:08:20.359
local minima to local minima of the

01:08:18.359 --> 01:08:22.039
actual loss function.

01:08:20.359 --> 01:08:24.039
I know that's a mouthful. I'm happy to

01:08:22.039 --> 01:08:25.319
tell you more. It's just a side thing

01:08:24.039 --> 01:08:26.560
that I just wanted you to be aware of.

01:08:25.319 --> 01:08:27.960
Okay?

01:08:26.560 --> 01:08:30.640
One of the reasons why SGD is actually

01:08:27.960 --> 01:08:33.838
effective. It's almost like you work

01:08:30.640 --> 01:08:33.838
less and you do better.

01:08:34.000 --> 01:08:38.159
How many times does it happen in life?

01:08:35.680 --> 01:08:38.159
This is one of them.

01:08:39.520 --> 01:08:44.359
Okay? Now, SGD comes in many flavors.

01:08:42.798 --> 01:08:45.680
Uh many siblings. It's got a lot of

01:08:44.359 --> 01:08:47.520
siblings and variations. It's a big

01:08:45.680 --> 01:08:49.838
family. Uh and we're going to use a

01:08:47.520 --> 01:08:52.040
particular flavor called Adam

01:08:49.838 --> 01:08:53.159
as our default in this course and I'll

01:08:52.039 --> 01:08:56.000
get back to it when we get into the

01:08:53.159 --> 01:08:57.119
co-labs and things like that.

01:08:56.000 --> 01:08:58.159
All right.

01:08:57.119 --> 01:09:00.039
Um

01:08:58.159 --> 01:09:01.519
By the way

01:09:00.039 --> 01:09:02.600
you know how you know all these pictures

01:09:01.520 --> 01:09:04.600
I've been showing you a nice little

01:09:02.600 --> 01:09:05.440
function like that, a little bowl and so

01:09:04.600 --> 01:09:07.359
on.

01:09:05.439 --> 01:09:08.960
This is a visualization

01:09:07.359 --> 01:09:11.400
of an actual neural network loss

01:09:08.960 --> 01:09:12.838
function.

01:09:11.399 --> 01:09:14.920
You can see like the hills and valleys

01:09:12.838 --> 01:09:16.798
and the cracks and so on and so forth.

01:09:14.920 --> 01:09:18.600
Okay? And you can check out the paper to

01:09:16.798 --> 01:09:19.359
get more insight into how they actually,

01:09:18.600 --> 01:09:21.680
you know, came up with this

01:09:19.359 --> 01:09:24.280
visualization. It's crazy.

01:09:21.680 --> 01:09:25.520
It's complicated.

01:09:24.279 --> 01:09:28.439
Yep.

01:09:25.520 --> 01:09:30.920
So, for for SGD, do you perform the

01:09:28.439 --> 01:09:32.599
iterations until you minimize the loss

01:09:30.920 --> 01:09:34.440
function for each mini batch and then

01:09:32.600 --> 01:09:36.520
move to another mini batch? Yeah, so

01:09:34.439 --> 01:09:37.719
what you do is you take each mini batch

01:09:36.520 --> 01:09:39.440
and then

01:09:37.720 --> 01:09:41.560
you calculate the loss for the mini

01:09:39.439 --> 01:09:43.679
batch, you find the gradient.

01:09:41.560 --> 01:09:45.319
And use the gradient and update the W.

01:09:43.680 --> 01:09:47.119
Then you pick up the next mini batch. So

01:09:45.319 --> 01:09:48.920
you don't you don't pick a mini batch

01:09:47.119 --> 01:09:50.920
and try to perform the iterations on

01:09:48.920 --> 01:09:52.838
that mini batch until you reach the

01:09:50.920 --> 01:09:54.840
You Each mini batch, one iteration. Each

01:09:52.838 --> 01:09:56.359
mini batch, one iteration. Because if

01:09:54.840 --> 01:09:57.600
you do a lot of iterations on one mini

01:09:56.359 --> 01:09:58.759
batch,

01:09:57.600 --> 01:09:59.640
first of all, you'll never be sure that

01:09:58.760 --> 01:10:00.960
you're going to find any optimal

01:09:59.640 --> 01:10:03.079
solution because you're not guaranteed

01:10:00.960 --> 01:10:04.039
of any global minima. And secondly, it's

01:10:03.079 --> 01:10:05.960
much better for you to get new

01:10:04.039 --> 01:10:07.399
information constantly because what you

01:10:05.960 --> 01:10:09.439
can do is you can revisit that mini

01:10:07.399 --> 01:10:10.799
batch later on.

01:10:09.439 --> 01:10:13.039
Right? And that gets into these things

01:10:10.800 --> 01:10:14.239
called epochs and batch size and so on,

01:10:13.039 --> 01:10:16.359
which we'll get into a lot of gory

01:10:14.239 --> 01:10:17.880
detail when we do the collab.

01:10:16.359 --> 01:10:20.359
So let's revisit that question. It's a

01:10:17.880 --> 01:10:20.359
good question.

01:10:20.439 --> 01:10:25.439
Yeah.

01:10:22.520 --> 01:10:26.880
When you do the backprop process, Very

01:10:25.439 --> 01:10:27.960
good. Backprop. Not backpropagation.

01:10:26.880 --> 01:10:29.039
Nice. I made sure.

01:10:27.960 --> 01:10:30.840
>> Yes.

01:10:29.039 --> 01:10:32.760
Well, it's it sounded like you started

01:10:30.840 --> 01:10:35.159
from the layers that were closest to the

01:10:32.760 --> 01:10:36.920
output and you went backward. Okay. And

01:10:35.159 --> 01:10:39.479
um my question is are you doing that

01:10:36.920 --> 01:10:39.760
once or is it looping multiple times and

01:10:39.479 --> 01:10:42.439
then

01:10:39.760 --> 01:10:44.600
>> do it once. Just once. Yeah. So for each

01:10:42.439 --> 01:10:45.960
gradient calculation, you do it once.

01:10:44.600 --> 01:10:47.680
Why does it Why does it want to start

01:10:45.960 --> 01:10:48.560
from the layer that's closest or why do

01:10:47.680 --> 01:10:49.800
you want to start it from the layer

01:10:48.560 --> 01:10:51.280
that's closest to the output?

01:10:49.800 --> 01:10:53.239
>> Yeah. So basically what happens is let's

01:10:51.279 --> 01:10:54.920
say that just for argument that you go

01:10:53.239 --> 01:10:56.800
go in the reverse direction.

01:10:54.920 --> 01:10:58.279
You will discover that a lot of paths to

01:10:56.800 --> 01:10:59.960
go from the left to the right will end

01:10:58.279 --> 01:11:02.439
up calculating certain intermediate

01:10:59.960 --> 01:11:04.720
quantities including the very final

01:11:02.439 --> 01:11:06.559
gradient sort of item

01:11:04.720 --> 01:11:07.760
again and again and again.

01:11:06.560 --> 01:11:09.280
Same thing is going to get calculated

01:11:07.760 --> 01:11:10.520
again and again and again. So by

01:11:09.279 --> 01:11:12.159
starting from the end and working

01:11:10.520 --> 01:11:14.320
backwards, you just reuse stuff you've

01:11:12.159 --> 01:11:15.920
already calculated.

01:11:14.319 --> 01:11:17.960
So that is sort of the rough idea. But

01:11:15.920 --> 01:11:19.440
if you see my PDF, I've actually worked

01:11:17.960 --> 01:11:22.399
out the example and you and that will

01:11:19.439 --> 01:11:22.399
demonstrate what I'm talking about.

01:11:23.359 --> 01:11:28.319
By the way, this gradient the backprop

01:11:25.119 --> 01:11:28.319
is just a sort of a

01:11:28.600 --> 01:11:31.760
Like in calculus, we have something

01:11:29.920 --> 01:11:32.600
called the chain rule.

01:11:31.760 --> 01:11:34.400
To calculate the derivative of a

01:11:32.600 --> 01:11:35.960
complicated function, you calculate the

01:11:34.399 --> 01:11:37.479
calculate derivative of like the outer

01:11:35.960 --> 01:11:39.239
function then the inner function and so

01:11:37.479 --> 01:11:40.799
on and so forth. The backprop is

01:11:39.239 --> 01:11:42.840
essentially a way to organize the chain

01:11:40.800 --> 01:11:46.279
rule to work with the neural network

01:11:42.840 --> 01:11:46.279
layer-by-layer architecture. That's all.

01:11:49.520 --> 01:11:54.120
So is it Is it fair to say that once we

01:11:51.960 --> 01:11:56.560
are finding like the local minimum, we

01:11:54.119 --> 01:11:58.079
are not optimizing to all the GWs

01:11:56.560 --> 01:11:59.400
because like this local minimum is

01:11:58.079 --> 01:12:01.239
coming like from different curves, from

01:11:59.399 --> 01:12:02.920
different lines. So

01:12:01.239 --> 01:12:04.760
Is that fair to say? When we are using

01:12:02.920 --> 01:12:06.640
stochastic gradient descent, yes. So for

01:12:04.760 --> 01:12:09.360
in stochastic gradient descent, when you

01:12:06.640 --> 01:12:10.880
take say 32 data points from a million

01:12:09.359 --> 01:12:12.960
and you're calculating the loss for that

01:12:10.880 --> 01:12:14.880
32 data points, you're basically trying

01:12:12.960 --> 01:12:17.039
to do a gradient step.

01:12:14.880 --> 01:12:20.000
Right? The W equals W minus alpha

01:12:17.039 --> 01:12:22.680
gradient thing. You're doing it for that

01:12:20.000 --> 01:12:24.720
that 32 points loss function.

01:12:22.680 --> 01:12:25.840
Right? Which is not the 1 million points

01:12:24.720 --> 01:12:27.680
loss function.

01:12:25.840 --> 01:12:29.279
That's why it's approximate.

01:12:27.680 --> 01:12:31.640
But the approximation, instead of

01:12:29.279 --> 01:12:33.719
hurting you, actually helps you because

01:12:31.640 --> 01:12:35.640
it helps you escape the local minima of

01:12:33.720 --> 01:12:37.000
the global loss function.

01:12:35.640 --> 01:12:38.640
So it's it's sort of an interesting and

01:12:37.000 --> 01:12:40.159
somewhat technically subtle point, which

01:12:38.640 --> 01:12:41.920
is why I'm not getting into it too much,

01:12:40.159 --> 01:12:44.119
but I'm happy to give pointers if people

01:12:41.920 --> 01:12:45.680
are interested. Yeah?

01:12:44.119 --> 01:12:47.319
Uh when you say you initialize the

01:12:45.680 --> 01:12:50.039
weights, you initialize for the whole

01:12:47.319 --> 01:12:51.119
network or just the end layer and then

01:12:50.039 --> 01:12:52.119
go backwards like you

01:12:51.119 --> 01:12:53.880
>> No, you initialize everything in one

01:12:52.119 --> 01:12:54.840
shot.

01:12:53.880 --> 01:12:55.960
Because if you don't initialize

01:12:54.840 --> 01:12:57.760
everything in one shot, what's going to

01:12:55.960 --> 01:12:58.960
happen is that you can't do like the

01:12:57.760 --> 01:13:00.560
forward computation to find the

01:12:58.960 --> 01:13:02.720
prediction.

01:13:00.560 --> 01:13:05.080
Uh and so they are done independently

01:13:02.720 --> 01:13:07.159
and the initialization schemes will take

01:13:05.079 --> 01:13:08.680
into account, okay, I'm initializing the

01:13:07.159 --> 01:13:10.720
weights between a layer which has 10

01:13:08.680 --> 01:13:12.280
nodes and on one side and 32 on the

01:13:10.720 --> 01:13:13.240
other side and the 10 and the 32

01:13:12.279 --> 01:13:15.800
actually play a role in how you

01:13:13.239 --> 01:13:15.800
initialize.

01:13:15.960 --> 01:13:19.960
Okay. So um so the summary of the

01:13:18.279 --> 01:13:22.840
overall training flow

01:13:19.960 --> 01:13:24.359
is that, you know, you have an input.

01:13:22.840 --> 01:13:26.079
It goes through a bunch of layers. You

01:13:24.359 --> 01:13:28.319
come up with a prediction. You compare

01:13:26.079 --> 01:13:29.600
it to the true values and these two

01:13:28.319 --> 01:13:31.679
things go into the loss function

01:13:29.600 --> 01:13:33.600
calculation. You get a loss number.

01:13:31.680 --> 01:13:35.480
Right? And you do it for say 10 points

01:13:33.600 --> 01:13:38.000
or 32 points or a million points. And

01:13:35.479 --> 01:13:39.959
this loss thing goes into the optimizer,

01:13:38.000 --> 01:13:41.640
which calculates the gradient. And once

01:13:39.960 --> 01:13:44.159
it calculates the gradient, it updates

01:13:41.640 --> 01:13:45.880
the weights of every layer using the W

01:13:44.159 --> 01:13:47.760
equals W minus alpha times gradient

01:13:45.880 --> 01:13:48.920
formula, gradient descent formula. And

01:13:47.760 --> 01:13:50.440
then you keep it doing this again and

01:13:48.920 --> 01:13:53.000
again and again.

01:13:50.439 --> 01:13:54.439
This is the overall flow.

01:13:53.000 --> 01:13:56.359
This is how our little network is going

01:13:54.439 --> 01:14:00.039
to get built for heart disease

01:13:56.359 --> 01:14:00.039
prediction. This is how GPT-4 was built.

01:14:00.720 --> 01:14:04.240
And this is how AlphaFold was built.

01:14:02.720 --> 01:14:06.720
And AlphaGo was built.

01:14:04.239 --> 01:14:06.719
You get the idea.

01:14:07.359 --> 01:14:10.799
I mean, it's astonishing, frankly.

01:14:09.479 --> 01:14:12.359
If you're not getting goosebumps at the

01:14:10.800 --> 01:14:14.239
thought that this simple thing can do

01:14:12.359 --> 01:14:17.159
all these complicated things, we really

01:14:14.239 --> 01:14:20.359
need to talk offline.

01:14:17.159 --> 01:14:23.119
Uh there was a hand raised here. Yeah.

01:14:20.359 --> 01:14:25.759
Sorry. Just quickly, this is for each

01:14:23.119 --> 01:14:27.159
mini batch, right? So

01:14:25.760 --> 01:14:28.680
my question is if you came with

01:14:27.159 --> 01:14:30.199
different weight for each mini batch,

01:14:28.680 --> 01:14:31.520
how do you

01:14:30.199 --> 01:14:33.800
add it up?

01:14:31.520 --> 01:14:35.400
The like, okay, this weight has is the

01:14:33.800 --> 01:14:37.880
perfect combination for this mini batch,

01:14:35.399 --> 01:14:39.559
but you have weight different

01:14:37.880 --> 01:14:41.560
weight for another mini batch. How do

01:14:39.560 --> 01:14:43.360
you combine those two? No.

01:14:41.560 --> 01:14:45.400
On each point, what you do is you you

01:14:43.359 --> 01:14:46.519
find the you find you you you start with

01:14:45.399 --> 01:14:48.000
a weight.

01:14:46.520 --> 01:14:49.320
You run it through for a mini batch. You

01:14:48.000 --> 01:14:50.680
come up with the loss function. You

01:14:49.319 --> 01:14:51.880
calculate the gradient.

01:14:50.680 --> 01:14:53.159
And now using the gradient, you've

01:14:51.880 --> 01:14:54.159
updated the weight. Now you have a new

01:14:53.159 --> 01:14:55.559
set of weights, right? Which is the

01:14:54.159 --> 01:14:57.680
updated weights. Call it

01:14:55.560 --> 01:14:59.480
W2 instead of W1.

01:14:57.680 --> 01:15:00.680
Now W2 is is your network and when you

01:14:59.479 --> 01:15:03.559
take the next mini batch, it's going to

01:15:00.680 --> 01:15:05.240
use W2 to calculate the prediction.

01:15:03.560 --> 01:15:08.800
And this this whole flow will become a

01:15:05.239 --> 01:15:11.840
lot clearer when we do the collabs.

01:15:08.800 --> 01:15:13.360
Okay. So we have 3 minutes.

01:15:11.840 --> 01:15:15.720
I don't want to go into

01:15:13.359 --> 01:15:19.039
regularization overfitting in 3 minutes.

01:15:15.720 --> 01:15:19.039
So let's have some more questions.

01:15:19.680 --> 01:15:22.600
Yeah.

01:15:20.640 --> 01:15:25.200
Can you use any activation function as

01:15:22.600 --> 01:15:26.760
long as it gives like positive values?

01:15:25.199 --> 01:15:29.679
For like X squared or mod X or

01:15:26.760 --> 01:15:31.400
something. Um you can use a variety of

01:15:29.680 --> 01:15:33.320
activation functions.

01:15:31.399 --> 01:15:35.519
Um

01:15:33.319 --> 01:15:37.319
There is uh but yeah, there's a whole

01:15:35.520 --> 01:15:38.640
literature on, you know, the pros and

01:15:37.319 --> 01:15:39.840
cons of various activation functions

01:15:38.640 --> 01:15:42.520
that you could use.

01:15:39.840 --> 01:15:44.760
But in general, you have to make sure of

01:15:42.520 --> 01:15:46.880
a couple of things. One is that when you

01:15:44.760 --> 01:15:48.360
do backprop,

01:15:46.880 --> 01:15:49.520
the gradient is going to flow through

01:15:48.359 --> 01:15:50.639
the activation function in the reverse

01:15:49.520 --> 01:15:52.200
direction.

01:15:50.640 --> 01:15:53.720
And the activation function should

01:15:52.199 --> 01:15:55.439
actually sort of make sure the gradient

01:15:53.720 --> 01:15:56.800
doesn't get squished.

01:15:55.439 --> 01:15:58.559
It shouldn't get squished. It shouldn't

01:15:56.800 --> 01:16:00.199
get exploded.

01:15:58.560 --> 01:16:01.280
So those are some considerations and

01:16:00.199 --> 01:16:02.760
these are technical considerations, but

01:16:01.279 --> 01:16:04.239
those all those considerations have to

01:16:02.760 --> 01:16:07.000
be taken into account. If you can take

01:16:04.239 --> 01:16:08.039
those into account, then you're okay.

01:16:07.000 --> 01:16:08.960
That's sort of the key thing to keep in

01:16:08.039 --> 01:16:10.479
mind.

01:16:08.960 --> 01:16:11.920
And that's in fact why the ReLU is

01:16:10.479 --> 01:16:13.319
actually very popular

01:16:11.920 --> 01:16:15.640
because as long as the value is

01:16:13.319 --> 01:16:18.000
positive, the gradient of the ReLU is

01:16:15.640 --> 01:16:20.640
just one. Right?

01:16:18.000 --> 01:16:20.640
Uh because

01:16:22.680 --> 01:16:26.600
So if you look at something

01:16:24.239 --> 01:16:26.599
Oops.

01:16:28.720 --> 01:16:31.920
Was it frozen?

01:16:30.359 --> 01:16:34.880
I jinxed it.

01:16:31.920 --> 01:16:37.399
So sorry, livestream.

01:16:34.880 --> 01:16:39.880
If you have something like this,

01:16:37.399 --> 01:16:41.719
the ReLU is like that, right?

01:16:39.880 --> 01:16:43.480
So the gradient here

01:16:41.720 --> 01:16:44.560
is always going to be one.

01:16:43.479 --> 01:16:46.279
Which means that as long as the value is

01:16:44.560 --> 01:16:47.960
positive, whatever gradient comes in

01:16:46.279 --> 01:16:49.000
like this, it just like gets multiplied

01:16:47.960 --> 01:16:50.960
by one and gets pushed out the other

01:16:49.000 --> 01:16:52.840
side. So it doesn't get it doesn't get

01:16:50.960 --> 01:16:55.399
harmed or squished or anything like

01:16:52.840 --> 01:16:57.119
that. Um so that's one reason why the

01:16:55.399 --> 01:16:59.239
ReLU is very popular because it

01:16:57.119 --> 01:17:00.640
preserves the gradient while injecting

01:16:59.239 --> 01:17:04.519
almost like the minimum amount of

01:17:00.640 --> 01:17:04.520
non-linearity to do interesting things.

01:17:04.760 --> 01:17:10.280
Um yeah.

01:17:07.520 --> 01:17:13.080
If you have a high number of dimensions,

01:17:10.279 --> 01:17:14.920
can you do mini batching on like

01:17:13.079 --> 01:17:17.119
features dimensions instead of just

01:17:14.920 --> 01:17:19.840
observations and keep the same number of

01:17:17.119 --> 01:17:21.760
observations, but just take a small

01:17:19.840 --> 01:17:24.000
sample of the number of features that

01:17:21.760 --> 01:17:25.760
you're actually using? Oh, I see. I see.

01:17:24.000 --> 01:17:27.039
So you're saying let's say you have 10

01:17:25.760 --> 01:17:28.720
features.

01:17:27.039 --> 01:17:31.000
Um instead of taking all data points of

01:17:28.720 --> 01:17:33.640
10 features, what if you have choose

01:17:31.000 --> 01:17:34.920
five features and just use them and do

01:17:33.640 --> 01:17:36.760
the thing

01:17:34.920 --> 01:17:38.520
as long as you can actually compute the

01:17:36.760 --> 01:17:39.840
prediction.

01:17:38.520 --> 01:17:41.600
To compute the prediction, you may need

01:17:39.840 --> 01:17:43.239
all 10 features.

01:17:41.600 --> 01:17:44.720
Right? Or you need to have some defaults

01:17:43.239 --> 01:17:46.800
for those features.

01:17:44.720 --> 01:17:48.560
And by if you define defaults for those

01:17:46.800 --> 01:17:50.520
other five features, you're basically

01:17:48.560 --> 01:17:51.400
using all all features.

01:17:50.520 --> 01:17:53.400
So that's the key thing. Can you

01:17:51.399 --> 01:17:55.079
actually calculate the prediction

01:17:53.399 --> 01:17:57.399
by manipulating? And typically, you

01:17:55.079 --> 01:17:57.399
can't.

01:17:57.840 --> 01:18:00.960
All right?

01:17:58.960 --> 01:18:02.439
Okay, folks. 9:55. I'm done. Have a

01:18:00.960 --> 01:18:04.800
great rest of your week. I'll see you on

01:18:02.439 --> 01:18:04.799
Monday.