Advertisement
Ad slot
2: Training Deep NNs (cont.); Introduction to Keras/Tensorflow; Application to Tabular Data 1:18:22

2: Training Deep NNs (cont.); Introduction to Keras/Tensorflow; Application to Tabular Data

MIT OpenCourseWare · May 11, 2026
Open on YouTube
Transcript ~13749 words · 1:18:22
0:21
Okay. So, let's get going. Today we're
0:24
going to talk about how do you actually
0:26
train a neural network, right? Because
0:28
that is sort of the heart of the game
0:30
here. Um so, just to recap, we looked
0:33
last class
0:34
at what it takes to design a neural
0:36
network, and we made this very important
Advertisement
Ad slot
0:38
distinction between the things that you
0:40
are handed by your problem and the
0:42
things that you have agency over, that
0:44
you have control over. And we noticed
0:46
that, you know, the input layer for your
0:49
problem, the input is the input. Uh the
0:51
output is the output. You got to do
0:53
something with the output, something
0:54
that's expected. But everything that
0:56
happens in the middle is actually in
Advertisement
Ad slot
0:58
your hands. And in particular, we
1:00
noticed that we have to decide how many
1:03
hidden layers we want. We have to decide
1:05
in each layer how many neurons to have.
1:08
And then we had to decide what uh
1:11
activation to use. Even though I'm kind
1:13
of cheating when I say that because I
1:14
told you very clearly on Monday that for
1:17
the hidden layer activation, just go
1:18
with the ReLU activation function. You
1:20
don't have to think deep thoughts about
1:22
this, okay?
1:23
But the other things are all choices you
1:24
have to make, and we will talk a bit
1:26
later about how do you actually make
1:28
those choices.
1:29
Okay. Now, the rule of thumb,
1:32
right? The rule of thumb always is to
1:34
start with the simplest network you can
1:36
think of.
1:37
And if it's if it gets the job done,
1:39
stop working on it.
1:41
If it's not good enough, make it
1:42
slightly more complicated. Okay? So,
1:45
that's sort of the, you know, like the
1:46
meta thing you have to remember always
1:48
when you're designing these things.
1:49
Okay. So, that's sort of, you know, what
1:52
it takes to design a deep neural
1:53
network. So, what we will do in this
1:55
class is we'll actually take a real
1:57
example with real data, and then we
1:59
we'll think through how we would design
2:01
a network to solve this problem.
2:03
And while doing so, we will cover a
2:05
whole bunch of conceptual foundations
2:07
such as optimization, loss functions,
2:09
gradient descent, and all that good
2:11
stuff.
2:12
Okay?
2:12
All right. So, the the case study or the
2:16
scenario here is we have a data set of
2:18
patients uh made available by the
2:20
Cleveland Clinic. And essentially, we
2:23
have a bunch of patients, and for all
2:25
these patients, the setting is that they
2:27
have come into the Cleveland Clinic, and
2:29
they have not come in with a heart
2:31
problem. They have come in for something
2:32
else. Maybe they just came in for a
2:33
physical. And we measured a whole bunch
2:36
of things about them, okay? And the
2:38
kinds of things we measured are, you
2:40
know, demographic information, like
2:41
what's their age, uh gender, whether
2:44
they have any chest pain at all when
2:45
they came in, blood pressure,
2:47
cholesterol, sugar, so on and so forth.
2:50
Right? You get the idea? Demographic
2:52
information and a bunch of biomarker
2:53
information. And then,
2:56
what the Cleveland Clinic uh did was
2:59
they actually tracked these people
3:01
and figured out in the next year,
3:04
did they get diagnosed with heart
3:05
disease or not?
3:07
Okay, in the next year.
3:09
Which means that maybe you can build a
3:10
model when someone comes in, even though
3:12
they didn't come in for a chest problem,
3:15
maybe you can predict that something's
3:16
going to happen to them in the next
3:17
year, right? It's a nice sort of classic
3:20
machine learning setup.
3:23
All right. So, this is the thing. So,
3:24
what we want to do is we can totally
3:26
solve this problem using decision trees,
3:28
neural network I mean, sorry, random
3:29
forests and gradient boosting and all
3:31
that good stuff you folks have already
3:33
learned from machine learning.
3:35
But we will try to solve it using neural
3:36
networks, okay? Um this is an example,
3:38
of course, of what's called structured
3:40
data because this is all data sitting in
3:41
the columns of a spreadsheet, right? Uh
3:43
so, working with structured data is the
3:46
way we warm up our knowledge of neural
3:48
networks. And then we will do things
3:50
like working with unstructured data
3:51
starting next week with images and then
3:53
later on with text and so on and so
3:55
forth. Okay, any questions on this?
4:00
Okay. Uh yes. Uh just connected even to
4:03
last time's class where we took uh the
4:05
same example and first it was a logistic
4:07
and then we did a neural network. So,
4:10
the probability in case of one was 0.85,
4:12
then was 0.22, and here as well, how do
4:14
you know when to uh
4:16
use what? Usually in textbooks, you know
4:19
when to use logistic or when to use uh
4:21
something else, but in this case,
4:24
uh
4:25
when do I complicate it to neural
4:27
networks visa-vis in this case maybe
4:29
just doing a random It's a great
4:30
question. Uh when do you use what? So, I
4:33
think there are two broad dimensions
4:34
that you have to think about. One broad
4:35
dimension is
4:37
uh how important is it that you need to
4:39
explain or interpret what's going on
4:41
inside the model to perhaps a
4:43
non-technical consumer.
4:46
The other dimension is how important is
4:48
sheer predictive accuracy.
4:50
In some situations, predictive accuracy
4:52
trumps everything else. In which case,
4:54
just go with it. In other cases,
4:56
explainability becomes a big deal
4:57
because if they can't understand, they
4:59
won't use it.
5:00
And those cases, it's probably better to
5:02
go with simpler models such as decision
5:04
trees and neural I mean, not neural
5:05
network decision trees, maybe even
5:07
random forests, certainly logistic
5:09
regression. Those are all a little more
5:10
amenable.
5:12
But that said, uh even complex black box
5:15
methods like neural networks, there is a
5:17
whole field called mechanistic
5:19
interpretability,
5:20
which seeks to try to get insight into
5:23
what's going on inside these big black
5:24
boxes. So, the story isn't over, right?
5:28
But that's just the first cut you sort
5:30
of analyze the problem.
5:33
Okay. So,
5:35
um let's get going. So, if you want to
5:37
design a network,
5:39
All right. So, we design the network. Uh
5:42
so, we have to choose the number of
5:43
hidden layers and the number of neurons
5:45
in each layer. Then we have to pick the
5:46
right output layer. So, here,
5:49
what I did is the simplest thing you can
5:51
do is, of course, is to have no hidden
5:52
layer.
5:53
So, if you have no hidden layers, what
5:55
is that model called?
5:58
Yes, logistic regression.
6:00
Okay? So, of course, we want to do a
6:02
neural network, so I'm going to have one
6:03
hidden layer because that's the simplest
6:05
thing I can do. And then, I'll confess,
6:08
I tried a few different numbers of
6:09
neurons in this thing, and when I had 16
6:12
neurons, it actually did pretty well.
6:14
Okay? So, there was some trial and error
6:15
that went on before I landed on the
6:16
number 16. Right? And for some reason,
6:19
people always use powers of two, so may
6:20
as well do that.
6:22
So, I tried like 4, 8, 16, and 16 was
6:24
really good.
6:25
And as it turns out, when I went above
6:27
16, uh it sort of started to do badly.
6:30
And it started to do badly because
6:31
something called overfitting,
6:33
which we're going to talk about later,
6:35
okay? So, yeah, 16.
6:37
Um and then by default, I use ReLUs,
6:39
okay? So, 16 ReLU neurons. And then
6:42
here, the output is a categorical
6:44
output, right? Heart disease, yes or no,
6:47
one or zero, classification problem,
6:49
which means that we want to emit a
6:51
probability at the very end. Therefore,
6:53
we'll use a sigmoid.
6:54
Okay? So, so far, so good, right? Any
6:57
questions?
6:59
All right.
7:00
So, we're going to lay out this network
7:02
visually.
7:03
Okay? So, we have an input, and so I
7:06
just have have an input. And as you will
7:09
see here,
7:10
X1 through X29, that's our input layer.
7:13
And you may be wondering, 29, where did
7:15
he get that from?
7:17
Because there doesn't seem to be like 29
7:19
rows here of independent variables. So,
7:22
it turns out there are only 13 input
7:24
variables here,
7:26
but some of them are categorical.
7:29
So, what I ended up doing is to take
7:31
each categorical variable and one-hot
7:32
encode it.
7:34
Okay?
7:35
And when you do that, you get to 39.
7:37
Sorry, 29.
7:39
All right? And when we actually do the
7:40
Colab later on, I'll show you exactly
7:43
how I one-hot encode encoded it, but
7:45
that's what I'm doing here.
7:46
That's why you have 29, not 13.
7:49
Okay? Now, obviously, we have decided on
7:51
these hidden units, 16 units,
7:54
with nice ReLUs here.
7:56
Okay? And then we have an output layer
7:57
with a little sigmoid.
7:59
And I got bored of trying to draw all
8:01
these arrows, so I just gave up and
8:02
said, "Assume there are arrows."
8:05
Okay, between all these things.
8:07
Good?
8:09
Yeah.
8:11
Yeah, I'm sorry. I think you already
8:12
mentioned this, but why 16 units? Why
8:15
16? Uh
8:16
I tried a bunch of different numbers of
8:18
units. Uh and at 16, the resulting model
8:21
did well, so I just went with that. And
8:23
the logic of why is a ReLU?
8:25
Oh, why a ReLU? Yeah, so there's a
8:28
there's just a mountain of empirical
8:29
evidence that suggests that uh ReLU is a
8:31
really good default option for using as
8:35
activations in hidden layers. There is
8:37
also a really great set of theoretical
8:39
results, and I'll allude to some of them
8:41
when we actually talk about gradient
8:42
descent.
8:45
Yeah.
8:47
Sorry, quick question. You mentioned um
8:50
in the input layer, how how did you get
8:51
to 29 again when you had like 13
8:53
variables? So, some of those 13
8:55
variables are categorical variables like
8:58
uh cholesterol low, medium, high. Right?
9:00
And so, I took them and one-hot encoded
9:02
them. So, if it had like five levels, I
9:04
would get five columns now.
9:08
Uh yeah.
9:09
And by the way, folks, um just like uh
9:12
is it can Yeah, just like did, please
9:15
use a microphone so that people on the
9:17
live stream can hear your question.
9:18
Yeah, go ahead. Uh sorry, just one
9:20
question. So, the vectors, since you
9:22
didn't represent them, are we assuming
9:23
like every X is connected to all the
9:26
units?
9:26
>> Correct. And this is also a parameter
9:28
that we have to decide or That ends up
9:31
being the default.
9:32
And we will see
9:33
deviations from that assumption when we
9:36
go to image processing and language
9:37
processing and so on. But when you're
9:39
working with structured data like we're
9:40
doing now, that's the default.
9:43
Okay. So, let's keep going.
9:46
So, this is what we have.
9:47
So, what Remember what I told you in the
9:49
last class? Whenever you're working with
9:50
these networks, right? Get into the
9:52
habit of very quickly calculating the
9:54
number of parameters.
9:55
Right? Just do it a few times, the first
9:57
few times, so that you really know cold
9:59
exactly what's going on. Okay? So, yeah,
10:02
how many parameters do we have here?
10:04
How many weights and biases? You can
10:06
work through it, okay? You can You don't
10:08
have to tell me the final number. You
10:09
can say x * y + z, stuff like that.
10:14
Yeah.
10:15
65. You have 48 weights and 17 biases.
10:20
Okay, and how did he come up with that?
10:21
So, for the weights, you have like for
10:23
the first layer it's 2 * 16 and for the
10:26
the second connection it's 1 * 16 and
10:28
then the biases are the 16 hidden plus
10:30
the outputs.
10:32
Okay.
10:33
Um any other views on this?
10:36
I think it's 29 into 16. 29, okay, 29
10:40
into 16. And then 16 into
10:43
uh plus I mean 16 there. Yeah. And then
10:46
biases 16 biases and one bias. Right.
10:49
So, the way it's going to work is we
10:52
have 29 things here, 16 in the middle,
10:55
so 29 into 16 arrows.
10:58
And then for each of these fellows,
11:00
there's a bias coming in.
11:02
So, that's another 16.
11:05
Plus, you have 16 * 1.
11:08
Which is here, plus there is one bias
11:10
for this one.
11:12
So, the total is 497.
11:16
So, you can see here there's something
11:19
very interesting going on, which is that
11:21
when you go from one layer to another
11:22
layer,
11:24
the number of weights is roughly on the
11:26
order of a * b.
11:28
The number of units and so that's a
11:30
dramatic explosion in the number of
11:31
parameters.
11:33
Right? And that's something we have to
11:34
watch for later on to prevent
11:36
overfitting.
11:38
Okay, that's where the explosion of
11:39
parameters comes from the fact that each
11:41
layer is fully connected to the next
11:43
layer.
11:44
Okay? But we'll revisit this later on.
11:46
Okay.
11:47
So,
11:48
what I'm going to do now is I'm going to
11:50
actually translate this network, right?
11:52
The one that we have laid out
11:53
graphically, into Keras code
11:56
to demonstrate how easy it is.
11:58
Okay? So, I will give a fuller intro to
12:01
Keras in TensorFlow later on, but for
12:03
now, just suspend your disbelief.
12:06
We'll just try to do it in Keras as if
12:08
we know Keras. Okay? So, let's try that.
12:10
Later on we'll get into all the gory
12:12
details and train it in Colab and so on
12:14
and so forth. Okay. All right. So,
12:17
So, the So, the way we typically do it
12:19
is that once we have a network like
12:21
this, we typically start from the left
12:23
and start defining each layer in Keras
12:25
one after the other. So, we flow left to
12:27
right. Okay? So, let's take the input
12:30
layer. The way you define an input layer
12:32
in Keras is really easy.
12:34
You literally say Keras.input.
12:38
Okay? And then you tell Keras how many
12:41
nodes you have in the input coming in.
12:43
In this case it happens to be 29, so you
12:45
tell it the shape. Shape equals 29. And
12:47
the reason why we say shape as opposed
12:49
to length is because, as you will see
12:51
later on, we don't have to just send
12:53
vectors in, we can send complicated
12:55
things in to Keras.
12:57
And those complicated objects could be
12:59
matrices, it could be 3D cubes, it could
13:01
be 4D tensors and so on and so forth.
13:03
So, it's expecting a shape.
13:06
Right? What is the shape shape of this
13:07
thing you're going to send me? In this
13:09
particular case it happens to be a nice
13:10
list or a vector, so it's 29. Okay,
13:12
that's it. So, we we write this down.
13:15
This creates the input layer.
13:17
Right? And we give it a name. Right? And
13:19
the name here means
13:21
this layer, whatever comes out of this
13:23
layer has a name input.
13:26
Okay?
13:27
Good. Next.
13:30
Let's make sure the shape of the input
13:31
as I mentioned.
13:32
Right there.
13:34
Then we go to the next one. And here and
13:36
we will unpack this. The way you define
13:39
a layer is typically a hidden layer
13:41
Keras.layers.dense
13:43
and all this stuff. Okay? So, what this
13:46
is is it first of all it says
13:48
I want a dense layer. By dense layer I
13:50
mean a layer that's going to fully
13:52
connect to the prior and the later
13:53
layers.
13:55
Fully connect, that's what the word
13:56
dense means. Okay?
13:58
Number two,
13:59
I want 16 nodes here in this layer.
14:02
Okay? Finally, I want to use a ReLU.
14:06
See how compact and parsimonious it is?
14:09
Right? And that is the appeal of Keras.
14:11
It's very easy to get going.
14:13
So, the moment you do that, you've
14:15
actually defined this layer.
14:18
But what you have not done
14:20
is you have not told this layer what
14:23
input is going to get.
14:25
Because as far as this layer is
14:26
concerned, it doesn't know that this
14:28
other layer exists.
14:30
So, you need to connect them. Yes.
14:33
Um do we need to define for the ReLU
14:35
where the the bends are? Like where you
14:38
take the max?
14:39
>> No, the ReLU the bend is always at zero.
14:41
Okay. Thank you.
14:45
Okay?
14:47
All right.
14:48
So, that's what we have here.
14:51
And then, what we do is we have to tell
14:53
it I you want to feed this layer the
14:55
output of the previous layer, so you
14:57
feed it by taking whatever is coming out
15:00
of this thing, which is called input,
15:02
and you basically
15:03
stick it in here.
15:05
So, the moment you do that, boom, it's
15:07
going to receive the input from the
15:09
previous layer.
15:10
And because this one's output needs to
15:12
go to the final layer, you need to give
15:15
a name to that output.
15:16
So, you give it a name. I'm just calling
15:17
it h for because it's coming out of the
15:19
hidden layer.
15:20
It's just a variable. You can call it
15:21
anything you want.
15:25
Now, what we do, we go to the final
15:26
output layer.
15:28
And this is what we use. The output
15:30
layer is just another dense layer.
15:32
That's why I use the word dense. But we
15:34
say, "Hey, give me just one thing
15:36
because I just literally just need one
15:37
unit here because I need to emit just
15:40
one probability.
15:41
And the activation I want to use is a
15:44
sigmoid."
15:46
Done.
15:48
Okay?
15:50
And once you do that, you
15:52
have to feed it the input from the
15:54
second layer. So, you stick an h here.
15:57
Now you have connected the third and the
16:00
second layers.
16:01
And after you do that, you give a name
16:03
to the output coming out of that. We'll
16:04
just call it output. You can call it y,
16:06
you can call it output, you can call it
16:07
whatever you want.
16:09
Okay? So, at this point, what we have
16:11
done
16:12
is we have mapped that picture into
16:14
those three lines.
16:16
That's it.
16:17
Okay?
16:19
But we aren't quite done yet. There's
16:20
one little thing we have to do.
16:22
So, what we have to do is we have to
16:24
formally define a model so that Keras
16:27
can just work with this model object. It
16:30
can train it, it can evaluate it, it can
16:31
use it for prediction and so on and so
16:33
forth. So, we tell Keras, "Hey, uh
16:35
create a model for me, Keras.model,
16:38
and basically where the input is this
16:40
thing here and the output is that thing
16:41
there.
16:42
And then the whole thing we'll just call
16:43
it model."
16:45
Okay? So, that's it.
16:48
We are done. That is the whole model.
16:50
That is It sounds really fancy, right? A
16:52
neural model for heart disease
16:53
prediction. That's pretty cool.
16:56
Four lines.
16:58
And we will show how to train this model
17:00
with real data and so on and so forth
17:02
and use it for prediction after we
17:05
switch gears and really get into some
17:06
conceptual building blocks.
17:08
Had a question.
17:13
Can you define a custom activation
17:16
function that is not in the list of
17:18
Keras library? Yes.
17:21
Yeah, you can define The question was,
17:22
can you define a custom activation
17:23
function? You totally can.
17:25
Uh in fact, I mean, the the kind of
17:27
flexibility you have here is incredible.
17:30
And this these innocent four lines
17:32
unfortunately sort of hide the the
17:34
potential that's possible here, but I
17:36
guarantee you in two to three weeks you
17:38
folks will be thinking in building
17:39
blocks like Legos.
17:41
So, you'll be, you know, I I I I'm so
17:43
happy when it happens. Students will
17:44
come to my office hours and say, "You
17:46
know, I want to create a network where I
17:47
have a little network going up on top,
17:49
one going in the bottom, then they meet
17:50
in the middle, then they fork again,
17:52
they split." I'm like, "Unbelievable."
17:54
It's fantastic. And you're going to be
17:55
doing this in two weeks, I guarantee
17:56
you.
17:58
Yeah, in the case of a multi-class
18:00
classification problem, are the output
18:01
nodes equal to the number of classes?
18:04
Correct.
18:05
So, we will come to So, this is binary
18:07
classification. And the question is for
18:09
multi-class classification, let's say
18:10
you're trying to classify some input
18:12
into one of 10 possibilities, we will
18:14
have 10 outputs.
18:16
But the way we define it is going to be
18:18
using something called a softmax
18:20
function, which we're going to cover on
18:21
Monday.
18:24
So, for now, we just live with binary
18:25
classification.
18:27
Uh
18:29
Is there a default activation method in
18:31
Keras or you have to put something? Ah,
18:33
that's a good question. I believe the
18:35
default might be ReLUs for hidden
18:37
layers, but I'm not 100% sure. Let's
18:39
double-check that.
18:40
Uh
18:42
Uh just to get a clearer understanding,
18:44
when you said that beyond 16 when you
18:47
tried working on those neurons, the
18:50
performance uh worsened.
18:52
So, that is where you were playing
18:53
around with initially two and then maybe
18:54
four and six and eight. Exactly. Right.
18:58
Could you use the mic?
19:02
Do we need to define each of the hidden
19:04
layer when the model gets more complex
19:05
when we have more than one layer? Oh,
19:08
like if you have like 25 layers?
19:09
>> consolidate, yeah. Yeah, yeah, yeah. So,
19:11
what we typically Good question. If you
19:12
have let's say 100 layers, right? Uh do
19:14
you actually write I have to type in
19:16
each by hand and cut and paste? No. You
19:18
can actually write a little loop which
19:19
will just automatically create them for
19:20
you.
19:22
And so, basically what's going on is
19:24
that this little output thing you see
19:26
here, this variable,
19:27
this output could be the result of a
19:30
thousand layer network with all sorts of
19:32
complicated transformations going on and
19:34
then finally it pops up as a little
19:36
thing called the output. And what Keras
19:38
will do is it'll be like, "Okay, this
19:39
model has this input and has this
19:41
output, but boy, this output came from
19:43
incredible transformations applied to
19:45
the input." And Keras will process all
19:47
that very easily for you. You don't have
19:48
to worry about it.
19:49
Right? It's really a beautiful example
19:51
of the power of abstraction.
19:53
And you will you will see that as we go
19:54
along.
19:55
Okay. So,
19:56
now let's switch gears and say once
19:58
you've written a model like that in
20:00
Keras, how do you actually train it?
20:01
Okay? Now, training is something you've
20:04
been doing a lot, right? So, for
20:05
example, when you have something like
20:06
linear regression, right? Where you have
20:08
all these coefficients you need to
20:09
estimate, you have this model, then you
20:12
have a bunch of data, then you run it
20:14
through something like LM if you use R,
20:16
and what it gives you is actual values
20:18
for these coefficients, right? 2.8, 0.9,
20:20
and so on and so forth. So, the the role
20:22
of the data is to give you the
20:23
coefficients.
20:25
Right? Or you can think of the
20:26
coefficients as really a compressed
20:28
version of the data.
20:30
Okay? Similarly, if you do logistic
20:31
regression, you have a model like that,
20:33
you add some data, you run it through
20:35
some estimation routine like GLM or
20:37
scikit-learn or statsmodels, pick your
20:40
favorite tool, then you'll come up with
20:42
something like that. So, basically
20:43
what's going on here is training simply
20:45
means find the values of the
20:47
coefficients that so that the model's
20:49
predictions are as close to the actual
20:51
values as possible. That's it. Okay? And
20:54
so and to find the one that is as close
20:57
to the actual value as possible, a whole
20:59
bunch of optimization is involved. You
21:01
didn't have to worry about the
21:02
optimization when you did the
21:03
regression, linear or logistic, because
21:05
it's all done under the hood for you,
21:07
but for neural networks, we actually get
21:08
to know how it's done.
21:10
Okay, because it's important.
21:12
Okay. So, training a neural network, a
21:15
deep neural network, even GPT-4, it's
21:18
basically the same process as what you
21:19
do for regression.
21:21
Right? You basically you're just a very
21:23
complicated function with lots of
21:24
parameters, but ultimately you have a
21:26
network with all these question marks,
21:28
you add some data, you do some training,
21:29
and boom, you get some numbers.
21:36
You may get into this, but are we
21:38
determining the architecture of the
21:40
network before we train it?
21:43
Okay. Yes, because if you don't define
21:45
the architecture,
21:46
um Keras doesn't know how to actually
21:49
calculate the output.
21:51
Given an input. And unless it knows
21:53
input-output pairs, it can't do anything
21:55
more with it.
21:58
Okay. So, um
22:00
so the essence of training is to find
22:02
the best values for the weights and
22:04
biases.
22:05
And the way we think of the best values
22:07
is that we basically set up a little
22:09
function, and this function measures the
22:11
discrepancy between the actual and the
22:14
predicted values. Okay? And I use the
22:16
word discrepancy because the way you
22:19
define discrepancy, there's an
22:20
incredible amounts of creativity in the
22:22
field.
22:23
In fact, a lot of breakthroughs in deep
22:25
learning come because people define a
22:27
very clever measure of discrepancy, and
22:29
then turns out it actually gives you all
22:31
sorts of interesting behavior. Okay?
22:33
That's why I use the word discrepancy as
22:34
opposed to the word error, because when
22:35
I say error, you might be just thinking
22:37
something like predicted minus actual.
22:39
That's too limiting.
22:42
Prediction minus actual is too limiting,
22:43
that's why I use the word discrepancy.
22:45
So, so we we basically define a function
22:48
that captures the discrepancy between
22:49
these the actual and the predicted
22:50
values, and these functions are called
22:53
loss functions in the deep learning
22:54
world.
22:55
And every paper that you read, you will
22:58
find interesting loss functions. There
23:00
are hundreds of loss functions, enormous
23:02
research creativity goes into defining
23:03
these loss functions. Okay?
23:05
All right. So, these are loss functions.
23:08
And so a loss function is a function
23:10
that quantifies a discrepancy. So, let's
23:12
say the predictions are really close to
23:14
the actual values, the loss would be
23:16
what?
23:19
It's close to zero. It's close to zero.
23:20
Close to zero. Right? Very small.
23:23
And if if you have a perfect model,
23:26
perfect crystal ball, what would the
23:27
loss be?
23:28
Exactly zero.
23:30
Right? Exactly zero. So, in linear
23:32
regression, we the loss function we use
23:35
is called sum of squared errors.
23:37
We didn't call it loss function because
23:39
we were not doing deep learning, just
23:40
linear regression, but that's basically
23:42
the loss function. Right? So,
23:45
the loss function we use must be very
23:47
matched very properly with the kind of
23:49
output we have.
23:51
Right? So, if your output is a number
23:53
like 23, right? You're trying to predict
23:55
demand like a product demand for next
23:57
week for a particular product, and uh
24:00
predicted value is 23, the actual value
24:02
is 21,
24:03
it's okay to do 23 minus 21, two as a
24:05
discrepancy, right? The error. Okay? But
24:09
for other kinds of outputs, it's not so
24:11
obvious what the correct loss function
24:13
is, what the correct measure of
24:14
discrepancy is. And so here,
24:18
for the simple case of regression,
24:20
right? Um
24:21
the YI, the I here, by the way, is a
24:23
superscript which stands for the ith
24:26
data point, the ith data point. So, what
24:29
I'm saying is that okay, for the ith
24:31
data point, this is the actual value, Y,
24:33
and this is what the model predicted.
24:36
Okay? I take the difference, square it,
24:39
and once I square it for each point, I
24:41
just average all these numbers to get an
24:43
average squared error, i.e. mean squared
24:45
error, MSE. So, this is sort of like the
24:48
easiest loss function.
24:50
Okay?
24:52
Now, let's crank it up a notch.
24:55
In the heart disease example, the heart
24:57
disease the neural prediction model,
24:59
the prediction is a number between zero
25:01
and one, right? It's because it's coming
25:03
out of the sigmoid.
25:04
It's a fraction. The actual output is a
25:07
zero or one, one of the two, right? It's
25:09
binary.
25:11
So, how would we compare the
25:12
discrepancy? How would we measure the
25:14
discrepancy between a fraction and the
25:16
numbers zero and one? Right? What is the
25:18
good loss function in this situation?
25:21
Right? Is the key question. So, let's
25:22
build some intuition around this.
25:26
And let's see if my little daisy chain
25:28
iPad thing works.
25:31
I'm doing it on the iPad so that people
25:32
on the live stream can see it, otherwise
25:34
the blackboard is a little tough for
25:35
them.
25:37
Okay. So, let's have a situation here.
25:41
Okay? So, let's say let's say that you
25:43
have a patient who comes in, and let's
25:45
say they have heart disease. Okay? So,
25:47
for that patient, Y equals one.
25:50
Right? The true value is one for that
25:51
patient. And now you have this model.
25:55
Okay? And this is the predicted
25:59
probability from this model.
26:04
Can people see my
26:05
handwriting okay?
26:07
Good.
26:08
I could never be a doctor, right? So.
26:11
So, zero, okay? One, it's going to be
26:13
between zero and one because it's
26:14
probability.
26:15
And then this is the loss we want to
26:17
sort of have, right? This is the loss.
26:19
So, for this this patient actually had
26:21
heart disease, Y equals one. So, let's
26:23
say that the predicted probability is
26:25
pretty close to one.
26:26
Okay? What do you think the loss should
26:28
be?
26:29
Small.
26:30
Close to zero.
26:32
Sorry?
26:34
Close to zero, exactly. So, here, if the
26:36
prediction comes here, you want the loss
26:38
to be you want the loss to be somewhere
26:40
here.
26:42
But if the predicted probability is
26:44
pretty close to zero, even though the
26:45
patient actually has heart disease, what
26:47
do you want the loss to be?
26:49
Really high.
26:50
Because it's screwing up badly, right?
26:52
So, you want the loss to be somewhere
26:53
here.
26:55
So, basically you want a function that's
26:57
kind of like that.
27:00
Right? You want the loss function shape
27:02
to be like that.
27:04
High values of probability should have
27:05
low losses, low values of probability
27:07
should have high losses. Yeah.
27:08
I understand like why it has to be
27:10
increasing or decreasing, but can you
27:12
explain why it has to be Yeah, yeah. So,
27:14
it can be linear, it can certainly be
27:16
linear, but basically what you want to
27:18
do is the more it makes a mistake, the
27:21
more harshly you want to penalize it.
27:23
Right? So, basically what you're what
27:25
what you really want is something where
27:27
if it basically says this person's
27:29
probability is say uh the probability
27:31
the predicted probability is say one
27:33
over a million,
27:34
basically close to zero, you want the
27:35
loss to be like super high.
27:37
So that the model is like it's like a
27:39
huge rap on the knuckles for the model.
27:41
Don't do that.
27:42
That's basically what we're doing, and
27:43
I'm sort of demonstrating that dynamic
27:45
by using a very curved and steep loss
27:47
function.
27:49
But you can absolutely use a linear
27:50
function, it's totally fine. It won't be
27:52
as effective for gradient descent later
27:54
on with a bunch of bunch of technical
27:56
details.
27:57
Are we good with this?
27:59
All right. So, now let's look at the
28:01
case where a patient does not have heart
28:03
disease.
28:05
Y equals zero.
28:06
Same setup, okay?
28:09
Predicted probability,
28:11
zero, one, loss.
28:15
So, for this patient, they don't have um
28:18
whatever uh they're not
28:20
uh they don't have heart disease. If the
28:22
probability is close to zero, what
28:24
should the loss be?
28:26
Close to zero. It should be somewhere
28:27
here, right?
28:28
And the more and more the probability
28:31
gets closer and closer to one, you want
28:32
to penalize it very heavily, which means
28:34
you want the loss to be somewhere here.
28:36
So, you basically want a loss ideally
28:37
that's kind of going up like that and
28:39
climbing higher and higher.
28:42
Are we good?
28:43
Okay, perfect.
28:44
Because we have a perfect loss function
28:46
for that.
28:48
So, just a recap.
28:51
Right? This is what we want.
28:53
People with for points with Y equals
28:54
one, lower prediction predictions should
28:56
have higher loss. You want something
28:58
like that. And then turns out
29:02
there's a very little simple loss
29:03
function
29:04
which just literally just uses the
29:05
logarithm, which will get the job done.
29:07
So, what you do is you literally do
29:09
minus log of the predicted probability.
29:13
That's it. And that thing it has exactly
29:15
that shape.
29:16
Okay? And in fact, you can see it
29:17
numerically. So, if the loss is one,
29:20
it's zero. If it's half, it's 1.0. And
29:22
if it's like one over 1,000, it's almost
29:24
10. If it's one over 10,000, it's going
29:26
to be like
29:27
much higher, right? Very high losses.
29:30
Okay? So, minus log probability, boom,
29:32
done.
29:34
Similarly, this is what we want for
29:36
patients for whom Y equals zero.
29:38
And turns out if you do minus log one
29:42
minus predicted probability, it does the
29:44
same thing.
29:47
Okay?
29:50
Mathematicians once again saved with a
29:52
logarithm.
29:54
So, see in summary
29:56
this is what we have.
29:58
Right? For data points where y equals 1,
30:00
we have this. Data points where y equals
30:01
0, we have this. But, it feels a little
30:03
inelegant
30:05
to say, "Well, if it's y equals 1, I
30:07
want to use this. If y equals 0, I want
30:08
to use that."
30:09
Right? There's There's like an if-then
30:11
thing going on here. And I don't know
30:12
about you folks, but if-then really irks
30:14
me
30:15
mathematically because you can't do
30:17
derivatives and so on very easily.
30:19
Okay?
30:20
But, no worries. This is MIT. We know we
30:22
have our bag of math tricks.
30:24
So, what we do is
30:26
we can actually combine them both into a
30:28
single expression.
30:30
Okay? Like this.
30:32
Okay? And here the yi again is the ith
30:35
data point. Remember, yi is either 1 or
30:37
0 always.
30:38
And this model of xi is the predicted
30:40
probability. Okay? So,
30:43
and I've just taken the minus log the
30:45
minus and I've just moved it here.
30:48
Okay? And I've taken the the minus that
30:50
was here and just moved it here. Okay?
30:52
That's why you see it like this.
30:54
So, this one is basically
30:57
you can convince yourself what's
30:58
happens. This single expression will get
30:59
the job done. So, let's say there is a
31:01
patient for whom y equals 1.
31:04
What's going to happen is that when you
31:05
plug in y equals 1, this becomes 0. The
31:07
whole thing will collapse to 0.
31:10
While here, y equals 1 just means it
31:12
becomes minus log probability, which is
31:14
what we want.
31:17
Conversely, if y equals 0, this whole
31:20
thing is going to disappear.
31:22
And this thing becomes 1 minus 0, which
31:23
is just 1. And so, it becomes minus log
31:25
1 minus probability, which is again what
31:27
we want.
31:29
Simple and neat, right?
31:32
So, in one expression, we have defined
31:34
the perfect loss. No if-thens, none of
31:36
that crap.
31:39
Good. So, now what we do is that was
31:42
true for every data point.
31:44
But, we obviously have lots of data
31:45
points. So, we just add them all up and
31:47
take the average.
31:50
That's it. We average across all the
31:51
data points we have. So, that we get an
31:53
average loss.
31:55
Okay?
31:57
We call this is the binary cross entropy
31:58
loss function.
32:06
Is there a way you can um edit the loss
32:08
function so that you penalize like false
32:11
negatives more strongly than false
32:13
>> you can do all of them. Great question.
32:15
Uh I'm just looking at the basic case
32:17
where we it's symmetric
32:19
loss. Um you can actually penalize
32:21
overestimates much more than
32:23
underestimates and things like that.
32:25
Um and if you're curious, you can just
32:26
Google something called the pinball
32:28
loss.
32:31
Okay?
32:32
Any other questions on this?
32:34
So, when you see this massive deep
32:36
neural network built by Google for doing
32:38
something or the other, if it's a binary
32:39
classification problem, chances are
32:41
they're using this thing.
32:44
Okay?
32:45
All right.
32:45
So, now let's figure out how to minimize
32:48
these loss functions because the name of
32:49
the game
32:50
is to find a way to minimize these loss
32:52
functions. So, now loss functions are
32:54
just a particular kind of function. So,
32:56
we'll first consider the general problem
32:59
of minimizing some arbitrary function.
33:02
Okay?
33:02
And once we develop a little bit of
33:03
intuition about that, we'll return to
33:05
the specific task of minimizing loss
33:07
functions.
33:12
How's everyone doing?
33:15
Yes, no, good, bad?
33:18
You have a bit of a
33:20
like a tough-to-interpret head shake.
33:23
It's more like um I kind of lost you
33:24
where you said that the loss function
33:26
and the predicted probability
33:28
uh how were they inversely because my
33:30
understanding was that the loss function
33:31
is supposed to be the sum of errors.
33:33
We're averaging the errors. And when you
33:35
said the heart patient
33:36
>> Sorry, sorry. Let me Let me just stop
33:37
there for a second.
33:38
For each point, you define the loss.
33:41
That's the whole point of the game. And
33:42
once you define it, you calculate for
33:44
every point and average it, right? So,
33:46
just focus on a single data point.
33:49
And so, now continue.
33:50
So, now when the heart patient has There
33:53
is more probability that they No. So,
33:56
when there is a person who has the heart
33:58
uh disease, you said that you want the
34:00
loss function to be high.
34:02
I think I'm going back to the graph.
34:03
>> You want the loss function to be high if
34:06
I'm predicting that they basically don't
34:08
have heart disease.
34:09
If the prediction is close to 0,
34:12
the predicted probability is close to 0,
34:13
then I'm badly wrong.
34:16
Because in reality, they do have heart
34:18
disease.
34:19
And that's why I want the loss to be
34:21
really high. Okay, so effectively, loss
34:23
is my way of finding out how good my
34:25
model is instead of saying, "Okay." Or
34:28
rather, how bad your model is. Yeah.
34:31
Right? How bad is it? That's really what
34:33
the loss function is. Got it.
34:34
>> And you want to minimize badness.
34:37
That's the whole point of optimization.
34:39
Okay.
34:41
Um I guess I don't have a fully like
34:43
similar to the point where I said but I
34:45
don't have a fully clear intuition of
34:46
why exactly a log function rather than
34:48
something that say
34:50
flatter for small and then really steep
34:53
later. Those are all fantastic things.
34:55
You can totally do it. Uh the reason we
34:57
picked the loss this function because A,
35:00
it's easy to work with. It has good
35:02
gradients. It's well-behaved
35:04
mathematically. But, there are many
35:06
alternatives to it. I don't want you to
35:07
think that this is like the only game in
35:09
town or it's the only choice for us. We
35:11
have many choices. This is really This
35:13
happens to be a very easy choice, which
35:15
also happens to be empirically very
35:17
effective.
35:18
And I'm happy to give you pointers to
35:20
other crazy loss functions, right? Which
35:22
can actually do all these things, too.
35:26
Okay?
35:30
All right. So, uh minimizing a single
35:32
variable function, we will warm up by
35:34
looking at this little function here.
35:36
Okay? Which is a
35:38
What do you call a fourth power?
35:41
What? Quartic, right? Yeah, thank you.
35:43
Quartic. So, yeah, it's a quartic
35:45
function. Um
35:47
right? And this is how it looks like.
35:50
But, you can see there is like a minimum
35:51
somewhere here, right? Between like one
35:53
minus one and minus two. Like maybe
35:54
minus 1.5. Okay?
35:56
So, we want to minimize this function.
35:58
It's obviously a toy function, little
36:00
function with one variable.
36:02
But, the intuition we use here is going
36:03
to be exactly what we use for GPT-4.
36:06
So, pay attention.
36:08
So, how can we go about minimizing this
36:09
function?
36:11
What will we do?
36:15
Yeah.
36:16
Take the derivative and set it equal to
36:18
zero. You take the derivative. Exactly.
36:20
So, you take the derivative, right?
36:22
Um so, when you So, let's look at what
36:23
the derivative does for us.
36:25
But, then
36:26
the second part of what said
36:30
Yeah. Second part of what said was set
36:31
it to zero. Setting it to zero becomes
36:33
problematic
36:35
when you have very complicated
36:37
functions. It's not clear at all what's
36:38
going to make them zero, right?
36:39
Unfortunately. But, the idea of taking
36:41
the derivative is in fact the right
36:42
idea.
36:43
So, we can go about this. We can
36:45
calculate the derivative. And that
36:46
actually happens with the derivative.
36:47
You can convince yourself.
36:49
And if you plot the derivative, it looks
36:50
like that.
36:53
And as you would hope, wherever the
36:55
minimum is, in fact, the derivative is
36:56
crossing
36:58
right? The derivative is zero here. It's
36:59
crossing the x-axis.
37:01
Right? In this case, you can actually do
37:02
that.
37:03
So, let's say you have the derivative.
37:04
How can you use it?
37:06
Like, what is the value of a derivative?
37:08
What does it tell you?
37:09
Yeah.
37:11
You use a gradient descent algorithm.
37:13
You are 10 steps ahead of me, my friend.
37:16
I just want the basic answer.
37:18
Like, what what what what good is a
37:19
derivative? What Like, what does it tell
37:21
you? When you calculate the derivative
37:22
of something at a particular point
37:23
>> you the rate of change of the function
37:25
at the place you are. Correct. Exactly
37:27
right. So, here, what the derivative
37:29
would tells us is that the slope tells
37:32
us the change in the function for a very
37:34
small increase in w, right?
37:36
And this is high school calculus. I'm
37:38
just doing a quick refresher.
37:41
So, what that means is that
37:45
if the derivative is positive,
37:47
what that means is that increasing w
37:49
slightly will increase the function.
37:52
So, if if you're here,
37:53
you calculate the derivative, the slope
37:55
is positive. It means that if you go
37:56
slightly in this direction, the function
37:57
is going to get higher.
37:58
Right?
38:00
Similarly, if it's negative,
38:02
let's say here, you calculate the
38:03
derivative, it's the the slope is like
38:05
this. It's negative, which means that if
38:06
you increase w, if you go in this
38:08
direction, it's going to decrease the
38:10
function.
38:12
Okay?
38:13
All right.
38:15
And if it's kind of close to zero,
38:17
it means that changing w slightly won't
38:19
change anything.
38:22
So, if you're here, changing it slightly
38:24
won't change anything.
38:25
All right?
38:26
That's it.
38:28
So,
38:29
So, what we do is this immediately
38:31
suggests an algorithm for minimizing gw,
38:35
which is let's start with some random
38:37
point w.
38:38
And then,
38:39
let's calculate the derivative at that
38:40
point.
38:41
And once we do that,
38:42
there are three possibilities.
38:45
It could be positive, negative, or kind
38:46
of close to zero.
38:48
And if it's positive, we know that
38:49
increasing w will increase the function.
38:52
But, we want to decrease the function.
38:53
We want to minimize it.
38:55
Which means that we should not be
38:56
increasing w. We should be doing what
38:58
here?
39:00
Decrease.
39:01
Yes. And similarly, if it's negative,
39:03
what should we do here? Increase.
39:07
Exactly. So, in the first case, you
39:09
reduce w slightly. In the second case,
39:11
you increase w slightly. And if the
39:13
thing is close to zero, you just stop
39:14
because there's nothing else you can do.
39:17
Okay?
39:21
This is the basic intuition behind how
39:23
GPT-4 was built.
39:26
Which is kind of shocking if you think
39:28
about it.
39:29
Right? Which means that all the the
39:31
heavy-duty optimization stuff that
39:32
people have figured out over the decades
39:35
is kind of not used.
39:37
Right? This algorithm is what's being
39:39
used with some, you know, flavors on top
39:41
of it.
39:42
So, yeah. So, back to this
39:44
uh and you you do that and then if
39:46
you've sort of run out of time or
39:48
compute
39:49
or right, if you run out of time and so
39:52
on, just stop.
39:54
Otherwise, just go back to step one and
39:55
try again. Of course, if it's close to
39:56
zero, you got to stop anyway.
40:00
Yeah.
40:02
Is there the um concern of a potentially
40:05
local minimum there? It's coming.
40:10
Okay? So, that's the function. It's
40:11
going to give find It's going to find
40:12
you some point where the derivative is
40:13
kind of close to zero. Okay?
40:16
So,
40:17
this is called gradient descent. Right?
40:19
This is gradient descent, this little
40:21
algorithm.
40:23
And this this
40:26
this very power pointy MBA table can be
40:29
collapsed into this little expression.
40:32
Basically says,
40:34
calculate the derivative,
40:35
multiplied by a small number which we'll
40:36
get to in a second,
40:38
and then change the old W to the new W
40:41
is the old W minus a little number times
40:44
gradient.
40:45
So, this little one-line formula is
40:47
basically gradient descent.
40:50
Okay?
40:51
And what you should do, just to build
40:54
your intuition, is to make sure that
40:56
these three possibilities here map
40:58
nicely to this. Like this thing will
41:00
actually capture these three
41:01
possibilities.
41:03
This is when gradient descent was
41:04
invented.
41:07
It has some historical fun, right?
41:13
The 19th century?
41:15
19th century. Yeah, okay. Good. Very
41:17
good. Excellent guess.
41:20
1847.
41:22
It was uh invented uh in 1847 by Cauchy,
41:25
the great mathematician. And in fact, if
41:27
you're curious, you can check out the
41:29
paper.
41:30
I have I gave you I give you the paper
41:32
here for handy reference.
41:36
So, 1847.
41:38
So, GPT-4 is built using an algorithm
41:40
invented in 1847.
41:44
Which I find like astonishing, frankly.
41:47
That this little thing is so capable.
41:51
Okay.
41:52
So, that's gradient descent. And this
41:54
little number alpha
41:56
is called the learning rate. And it's
41:58
our way of sort of essentially
41:59
quantifying the idea of let's not
42:02
increase or decrease W massively, let's
42:04
do it slightly.
42:06
Because the gradient is only valid for
42:08
small movements around your point. If
42:11
you take a big step, all bets are off.
42:14
So, this alpha tells you how how small a
42:17
step should you take.
42:20
Okay?
42:20
And in typically, it's set to very small
42:23
values like, you know, 0.1, 0.001, and
42:25
so on and so forth. And in fact, if you
42:27
read any deep learning academic papers
42:30
where they have trained like a big model
42:31
to do something,
42:32
right? More lot of researchers will very
42:34
quickly go to the appendix where they
42:36
have described exactly what learning
42:37
rates were used.
42:39
Because sort of the learning rate is
42:40
like part of the IP for how it's built.
42:44
A lot of trial and error that goes into
42:45
these learning rates.
42:47
Okay. So, that is gradient descent.
42:50
Um so, if we apply this algorithm to GW,
42:53
our original function,
42:55
right? We just keep on doing this thing
42:56
a few times.
42:58
Right? What you will find is that if
43:00
let's say we start with two point the
43:01
the
43:02
the point we randomly pick is a 2.5, we
43:05
set the alpha to one, we run this
43:07
algorithm, it starts here, then it goes
43:09
there, it goes there, bup bup bup bup
43:11
bup, and then finally ends up here.
43:12
In like four or five iterations, it
43:14
finds some minimum.
43:16
This is obviously a very simple,
43:17
well-behaved, nice little function, so
43:19
you can easily optimize it.
43:22
Okay? If you want, you can just go to
43:23
this thing. There's a nice animation of
43:25
this thing as well.
43:28
Okay. So, now
43:30
All right. Before we actually go to the
43:31
multi-variable function, I want to go to
43:33
the question that you posed about local
43:35
minima.
43:36
Um actually, you know what? I think I
43:37
may have some slides on it. So, sorry.
43:38
I'll come back to this.
43:40
So, let's actually see what you know,
43:41
what we looked at a toy example where
43:43
there was only one variable. What if you
43:45
have
43:46
uh what if it was GPT-3? GPT-3 has 175
43:49
billion parameters.
43:51
175 billion and GPT-4, they haven't
43:53
published it, so we don't know. It's
43:55
supposed to be eight times as much.
43:57
Okay? So, I mean, the number of
43:59
parameters is massive. So, basically,
44:02
our loss function has
44:04
billions of variables, billions of Ws
44:07
that we need to optimize over, minimize
44:10
over. So, we need to use this notion of
44:12
a partial derivative. So, let's take
44:14
baby steps and say, okay, what if you
44:16
have a two-variable function, right?
44:18
Something like this, very simple. So,
44:20
what we can do is we can calculate the
44:21
partial derivative of G with respect to
44:23
each of these Ws.
44:26
And the partial derivative, just to
44:27
quickly refresh your memories,
44:29
is you take a function, you pretend that
44:32
everything other than W is a constant.
44:36
Then the function becomes a
44:38
a function of just one variable W, W1.
44:40
And then you just differentiate it like
44:41
you do everything else. And you you get
44:43
you get something, and that is
44:46
this thing here.
44:48
And then you do the same thing for W2,
44:50
you get this thing here, and then you
44:51
just stack them up in a nice list.
44:54
Okay?
44:55
This is the vector of partial
44:56
derivatives.
44:58
So, how should we interpret this? The
44:59
same way as before. Basically, for a
45:01
small change in W1, keeping W2 and
45:04
everything else fixed, how does the
45:06
function change if you change just W1
45:08
slightly? And similarly for W2 and all
45:11
the way to W175 billion.
45:14
Same thing. Okay?
45:15
So, um
45:17
now, when you have these functions with
45:19
many variables, many Ws,
45:22
uh since we have a gradient for each one
45:24
of those Ws, we stack them up into a
45:26
nice vector
45:28
of derivatives, and this vector is
45:30
called the gradient.
45:32
And it's denoted
45:33
using
45:35
this uh Anyone know what the symbol is
45:37
called?
45:38
nabla
45:40
Yeah?
45:41
Laplacian
45:43
Maybe. Maybe that's a synonym. But the
45:45
one I'm familiar with is nabla.
45:48
Delta is the one that's upside down
45:50
triangle, but I think the upside down
45:52
triangle is called nabla if I if I
45:53
recall. Am I right?
45:55
Thank you.
45:58
He's my go-to.
46:02
So, yeah. So, the gradient, um we just
46:04
call it the gradient, and it's written
46:06
as this.
46:08
All right. So, what we do is we simply
46:10
do gradient descent on every one of the
46:12
Ws
46:13
using its partial derivative.
46:16
Okay? So, in a in a gradient step, we
46:19
update W1 using this formula, W2 using
46:21
this formula.
46:23
Finished.
46:25
We've just generalized gradient descent
46:27
to an arbitrary number of variables.
46:30
So, and of course, as before, this can
46:32
be summarized compactly as this vector
46:35
formula.
46:36
Let me just do this.
46:43
So, what's going on here is that
46:46
I have
46:47
W1
46:50
old W1 minus alpha
46:52
times
46:53
the function G
46:55
of W1, then W2
46:59
W2 minus alpha
47:02
G by W2. And then all we're doing is
47:04
we're just stacking them up into a
47:06
vector
47:08
like that.
47:15
minus alpha, and this vector
47:21
like that.
47:27
So, this can be written as just this
47:28
vector W, the new vector
47:31
old vector minus alpha
47:34
and the gradient. Finished.
47:37
And you can see if it is, you know,
47:39
GPT-3,
47:40
this vector is going to be 175 billion
47:42
long.
47:44
Okay? But whether it's two or 175
47:46
billion, who cares? It's the same thing,
47:47
right?
47:50
Okay.
47:52
So, yeah. So, that's what we have here.
47:54
I'm really thrilled by the way this
47:55
whole iPad business is working out.
47:58
I was a little worried about it. Okay.
48:00
Um so, if you look at two dimensions,
48:02
this function, and if you actually look
48:04
at if you plot the function, this is W
48:06
the first W, the second W, and then you
48:08
actually This is actually the loss
48:09
function. That's the function GW. And
48:11
so, you're trying to find the minimum
48:13
here, and so this is how the gradient
48:14
descent will do do do do do. It will
48:16
progress if you're starting from this
48:17
point.
48:18
Or you can also sort of look at it from
48:20
up top down into the function, and
48:22
that's what this picture is, and it
48:23
shows gradient descent starting from
48:24
there and working its way down
48:27
um from here all the way to the center.
48:30
Okay. So,
48:32
All right. Local minima. So, now
48:35
gradient descent will just stop
48:38
near uh hopefully a minimum,
48:41
right? But the problem is it may not be
48:43
a global minimum. It may It may not even
48:45
be a minimum.
48:47
So, um
48:48
so, let's see what what I'm talking
48:49
about here.
48:51
Here are some possibilities.
48:53
So, let's take a simple function.
48:57
Okay? Let's take This is GW.
48:59
This is W. And turns out this function
49:02
is actually looks like this.
49:12
Okay?
49:13
So, you can see here
49:17
Well,
49:19
um this point
49:23
this point here
49:24
is a local minimum.
49:27
This is a local minimum.
49:29
It's a local minimum.
49:30
These are all
49:32
lots of local minima here.
49:34
Okay? And yeah, there's a lot of local
49:37
minima here, too.
49:39
So, these are all places in which the
49:41
derivative is going to be zero.
49:43
So, if you run gradient descent and it
49:46
stops because the gradient is reached
49:48
zero,
49:49
you could be in any of these places.
49:52
Right? So, there's no guarantee. So,
49:54
this in this picture happens to be
49:57
maybe the global minimum because it's
49:59
the lowest of the lot.
50:01
Right?
50:02
But, there's no guarantee you're
50:02
actually going to get there.
50:04
Okay, there's not even a guarantee
50:06
you're going to be in any of these
50:07
places because you could literally be in
50:09
this thing here
50:10
where it's sort of taking a break and
50:12
then continuing on down.
50:14
That, by the way, is called a you know,
50:15
a saddle point. I drew it badly, but
50:17
this sort of coming in sort of taking a
50:19
break and going down again is called a
50:21
saddle point. So, gradient descent can
50:23
stop at a saddle point. It can stop at
50:25
some minima. There's no guarantee it's
50:27
going to be global.
50:28
Okay?
50:33
But, it turns out it has not mattered.
50:37
So, it has not mattered. And there are a
50:39
whole bunch of reasons why it has not
50:41
mattered because when you have these
50:42
very complicated neural networks,
50:44
they're very complex functions. Even
50:46
finding a decent solution, right, to
50:49
these complicated networks is actually
50:50
really good for solving the problem.
50:52
You don't have to go to the best best
50:54
possible solution. And in fact, if you
50:57
go to the best possible solution, you
50:58
actually run the risk of overfitting.
51:02
So, that's one reason. The other
51:03
interesting reason and by the way, this
51:05
is a very hot area of research to figure
51:08
out exactly
51:09
So, it's sort of like this. Empirically,
51:11
what we have seen is that not worrying
51:12
about local minima, global minima, all
51:13
that stuff has not hurt us because these
51:16
is things are amazing.
51:18
GPT GPT-4, probably they just stopped
51:20
somewhere. They probably it wasn't even
51:21
a local minima. They're like, "All
51:22
right, we've It's been running for 6
51:24
days. We've spent 2 million dollars.
51:25
Let's stop."
51:27
Right? Because these are very expensive.
51:29
So, but that's still so magical.
51:31
You don't need to get anywhere close to
51:33
local minimum. But, there's another
51:34
interesting point which I've which which
51:36
I read about.
51:37
People basically hypothesize that
51:40
for you to be at a local minimum, just
51:43
think about what it means. It means that
51:45
you're standing at a particular point,
51:47
in every direction that you look,
51:49
things are just sloping upward.
51:51
Right?
51:52
Everything is sloping upward. Only if
51:54
everything is sloping upward all around
51:56
you, could you be at a local minimum
51:58
by definition. But, if you have a
52:00
billion dimensions,
52:02
what are the odds that you're going to
52:04
be standing at a point where every one
52:06
of those billion dimensions is going
52:07
upward?
52:08
The odds are really low.
52:10
Chances are some of them are going to go
52:11
going up, some of them are going down,
52:13
others are sort of coming down and going
52:14
another way. It's going to be crazy.
52:16
So, in some sense, the best you can hope
52:18
for in these very high-dimensional
52:20
situations is probably a saddle point.
52:23
And it turns out it's good enough.
52:25
So, for those reasons, we are content
52:29
with just running gradient descent with
52:30
some tweaks which I'll get to in a
52:31
second. Um and it just performs really
52:34
admirably.
52:36
Um how does alpha depend on like how
52:39
much compute you have? Like, would you
52:41
set the learning rate based on that or
52:44
not really?
52:45
>> No, the the learning rate is really
52:47
is a measure of It's sort of like this.
52:50
When you're at a point where you think
52:52
that the gradient is looking nice and
52:54
right, if you take a step in the
52:55
direction it's going to go down. And if
52:57
you further believe that it's going to
53:00
keep going down in the direction for a
53:01
while,
53:02
then you're very confident about taking
53:04
a big step.
53:06
But, if you're like, "I I don't know
53:07
because the maybe I take a little step,
53:09
maybe I have to go this way. I can't go
53:10
straight anymore." Then you don't want
53:12
to take a big step because then you have
53:13
to backtrack.
53:14
So, those kinds of considerations go
53:16
into the learning rate. Um and so,
53:19
that's sort of the rough answer to your
53:20
question. It's not so much determined by
53:23
compute and bandwidth and things like
53:24
that.
53:25
But, again, it's very it's a sort of a
53:27
complicated thing because sometimes with
53:29
a given amount of compute compute, if
53:31
you have a particular kind of data, you
53:33
can have very aggressive learning rates.
53:35
So, it tends to be a bit sort of, you
53:37
know, jumbled up complicated. So, but
53:39
that's sort of the the quick surface
53:40
level idea of what's going on.
53:43
Um okay.
53:47
9:31.
53:50
Anyway, folks, this lecture is like
53:52
probably one of the driest in the like
53:54
semester because of like I have to go
53:55
through all the concepts. Um once we
53:57
start doing collabs, you know, things
53:59
get a lot more lively.
54:00
Okay.
54:01
Um all right. So, now let's talk about
54:04
minimizing a loss function gradient
54:05
descent. So, here is our little binary
54:08
cross entropy loss function that we saw
54:09
from before. Right? This is what we want
54:11
to minimize. So, if you look at this
54:13
thing,
54:14
where are the variables we need to
54:16
change to minimize this function?
54:19
Folks, don't look at your phones.
54:21
Okay, with laptop and iPad use, don't
54:23
look at your phones.
54:27
Sorry, we've kind of abstracted um the
54:30
variables W, but just to bring it back,
54:33
those are actually the weights in the
54:35
neural networks, right? Yeah, the
54:36
weights and the biases. I'm just calling
54:38
them as weights. So, the output of these
54:42
uh minimization functions are going to
54:45
be the actual weights in your model,
54:47
right?
54:47
>> Exactly. Exactly right.
54:49
The whole name of the game is to find
54:51
the weights.
54:52
And so, for example, when you see in the
54:53
press that uh Meta has essentially um
54:57
made the weights of Llama 2 or something
55:00
available, that's basically what they've
55:01
done.
55:02
They basically published the weights.
55:04
Reason that's so valuable is
55:06
>> Microphone, please. Go.
55:07
Cuz if you have a billion parameters,
55:09
the compute time on that is horrendous
55:11
and expensive. That's why it does
55:13
weights are so valuable.
55:14
>> Correct. The weights are the crown jewel
55:16
because they are the result of a lot of
55:18
money and time and smartness being
55:19
spent.
55:21
There is a separate question of why are
55:23
they making it open source,
55:25
which
55:26
I'm happy to chat about offline.
55:28
All right, cool. So, what are the
55:29
variables we need to change change to
55:30
minimize? It's basically the parameters
55:32
and they're hiding inside the model
55:34
term.
55:36
Right? Because what is the model? The
55:38
model is some function like that, right?
55:41
If you look at the simple GPA and
55:42
experience thing we looked at in the on
55:44
Monday, we finally figured out that the
55:46
actual thing that comes out here is
55:48
going to be this complicated function of
55:50
all the X's and the W's and so on and so
55:52
forth, right? And that complicated thing
55:54
is showing up inside this thing.
55:57
So,
55:58
you know, and the W's here are the
56:00
variables we can we need to change to
56:02
minimize the loss function. And it It's
56:05
important for you to to note and
56:06
understand that the values of X and Y
56:10
and so on are just data.
56:13
You're not optimizing anything there.
56:14
You're just data.
56:15
What you're optimizing is the W's.
56:17
The weights.
56:22
Okay. So, so imagine replacing the model
56:26
here with the mathematical expression
56:27
above whenever this appears the loss
56:29
function. And once you do that, your
56:31
loss function is just a good old
56:33
function of the W's.
56:35
The fact that it's a loss function is
56:37
kind of irrelevant.
56:39
It's just a function.
56:41
And since it's just a good old function
56:42
of the W's, you can apply gradient
56:43
descent to it as we normally would.
56:45
It's no big deal.
56:49
Which brings us to something called
56:50
backpropagation.
56:52
Um
56:56
Um if you remember nothing else about
56:57
backpropagation, just remember this.
56:59
Never use the word backpropagation
57:01
again. Only use the word backprop.
57:04
You're
57:05
hip and cool to the deep learning
57:06
community.
57:07
Backprop.
57:09
Okay. All right. So, what is backprop?
57:12
Backprop is a very efficient way to
57:14
compute the gradient of the loss
57:16
function.
57:17
So, when you have this loss function,
57:19
and let's say you have a billion W's
57:21
and you have 10 million data points. So,
57:24
the little n we saw was 10 million.
57:27
That is a lot of computation.
57:30
And that is just for one step of
57:32
gradient descent.
57:34
Right? So, backprop is a way is a very
57:37
efficient and clever way to compute the
57:39
gradient of the loss function, which
57:41
takes advantage of the fact that what we
57:44
have here is not some arbitrary model.
57:47
It's a model that came from a particular
57:49
kind of neural network, which has layers
57:51
one after the other, and then there was
57:53
an output at the very end.
57:55
So, what backprop does is
57:57
it organizes the computation in the form
57:59
of something called a computational
58:00
graph, and the book has a good
58:01
discussion about it. And so, what we do
58:03
is we start at the very end.
58:05
We calculate the gradient of the loss
58:08
with respect to the output.
58:10
Then we move left. We calculate the
58:12
gradient of that output with respect to
58:13
the output of the just the prior hidden
58:15
layer.
58:17
Step to the left. Calculate the gradient
58:19
of the current thing with respect to the
58:20
previous layer. You get the idea, right?
58:22
It's iterative and it moves backwards,
58:25
and by doing so, you never repeat the
58:27
same computation twice wastefully.
58:30
That's the big advantage. You calculate
58:32
once and reuse it many many many many
58:34
times.
58:35
The second advantage is that if you
58:37
organize it this way, it just becomes a
58:39
sequence of matrix multiplications.
58:42
Okay.
58:42
And
58:45
because it's a sequence of matrix
58:46
multiplications and eliminates redundant
58:48
calculations, and best of all,
58:51
there are these things called GPUs,
58:53
graphics processing units, originally
58:54
invented to accelerate video game
58:56
rendering.
58:57
Uh and as it turns out, to accelerate
58:58
video game rendering, the core math
59:00
operation you do is basically a matrix
59:02
multiplication. Right? Some linear
59:03
algebra uh
59:05
sort of operations. And so, someone
59:07
really at some point had the bright idea
59:09
for deep learning, calculating gradients
59:11
and so on, we need to do matrix
59:13
multiplications, and here is some
59:14
specialized hardware that does really
59:17
that does a fast job of matrix
59:19
multiplications. Can't we Can we use
59:20
this for that?
59:22
And they did it. And all hell broke
59:24
loose.
59:26
That's literally what happened.
59:28
And that's why Nvidia is valued at what,
59:30
1.5 trillion or something.
59:32
So, yeah. So, they are really good. And
59:35
so, backprop
59:37
the way you do backprop plus using it on
59:40
GPUs leads to fast calculation of loss
59:42
function gradients.
59:44
If this thing were not true, this class
59:47
would not exist.
59:49
Because there won't be any deep learning
59:50
revolution.
59:52
This is a fundamental seminal reason.
59:57
All right. So, the book has a bunch of
59:59
detail
1:00:00
um
1:00:01
and I actually did like a I work I hand
1:00:05
worked out an example
1:00:07
of calculating a gradient like the
1:00:09
old-fashioned way and calculating it
1:00:11
using backprop.
1:00:13
So, take a look at it. I'll post it on
1:00:14
Canvas and you will understand exactly
1:00:17
where the savings come from, where the
1:00:18
efficiency gains come from. Okay?
1:00:21
Because of time, I'm not going to get
1:00:22
into it now.
1:00:26
All right. Any questions so far?
1:00:28
Yep.
1:00:30
Sorry, I followed up to and so, we've
1:00:32
done gradient descent, which is
1:00:34
different than calculation of the
1:00:36
gradient of the loss function. What What
1:00:37
is the purpose of the calculation of the
1:00:39
gradient of the loss function? You
1:00:41
calculate the gradient because the
1:00:42
fundamental operation of gradient
1:00:44
descent is to take your current value of
1:00:48
and modify it slightly and the
1:00:50
modification is old value minus learning
1:00:52
rate times gradient.
1:01:03
It'd be cool, right, if I say, "Go mo-
1:01:04
go back five slides to this thing." and
1:01:06
it just goes back. Product idea. Anyone
1:01:08
startups?
1:01:09
So.
1:01:11
So, this one.
1:01:14
So, this is the fundamental step of
1:01:15
gradient descent.
1:01:16
So, this is the current value of W.
1:01:19
You calculate the gradient at that
1:01:20
current value
1:01:22
multiplied by alpha do this thing and
1:01:24
you get the new value.
1:01:26
And you keep repeating.
1:01:27
Right, but GW
1:01:29
that's not that's not the loss function.
1:01:32
>> It is the loss function. That is the
1:01:33
loss function.
1:01:34
>> Yeah, right. Here, I'm just using G as
1:01:35
an arbitrary function
1:01:37
to just to demonstrate the point. But
1:01:39
when you're optimizing, when you're
1:01:41
training a neural network, what you're
1:01:42
actually doing is minimizing a loss
1:01:45
function. Right.
1:01:46
>> Loss of W. Sorry, I got things mixed up.
1:01:49
Thank you.
1:01:51
>> Yeah.
1:01:53
Uh how do we define the initial weights
1:01:54
for the neural network?
1:01:55
>> Ah.
1:01:57
So, yeah, the initial weights um
1:02:02
So, there's a there are many ways to So,
1:02:04
first of all, they are initialized
1:02:04
randomly.
1:02:06
Uh but randomly doesn't mean you can
1:02:08
just pick any random weight. There are
1:02:09
actually some good ways to randomly pick
1:02:11
the weights. Uh those are called
1:02:13
initialization schemes. Um and there are
1:02:16
a bunch of very effective initialization
1:02:18
schemes people have figured out over the
1:02:19
years and those things are baked into
1:02:21
Keras as the default.
1:02:22
So, the Keras, I believe, uses something
1:02:24
called the
1:02:26
uh He initialization, H E
1:02:27
initialization, or the Xavier Glorot
1:02:31
initialization. I wouldn't worry about
1:02:33
it. Just go with the default
1:02:33
initialization.
1:02:36
The reason why they have to be very
1:02:37
careful about how these weights are
1:02:38
initialized is because if you have a
1:02:40
very big network and if you initialize
1:02:43
badly then
1:02:45
the gradient will just explode as you
1:02:47
calculate it.
1:02:48
The earlier layers, the weights will
1:02:50
have massive gradients or the gradients
1:02:52
will vanish.
1:02:53
So, they're called the exploding
1:02:55
gradient problem or the vanishing
1:02:56
gradient problem. To avoid all those
1:02:58
things, researchers have figured out
1:02:59
some clever way to initialize so that
1:03:00
it's well-behaved throughout.
1:03:03
Yep.
1:03:05
If using um backprops and GPUs was so
1:03:08
critical, I'm just curious like who
1:03:10
first did it and when? Was this like a
1:03:12
couple years ago? Was it a company? Was
1:03:14
it a Yeah.
1:03:15
>> Yeah. Well, GPUs have been used for deep
1:03:17
learning, I want to say um
1:03:20
I think the first uh case may have been
1:03:22
in the mid 2005, 2006 sort of thing.
1:03:26
But I would say that it sort of burst
1:03:27
out onto the world stage and made
1:03:30
everyone take notice when uh a deep
1:03:32
learning model called AlexNet
1:03:35
in 2012 won a very famous
1:03:38
computer vision competition.
1:03:40
Uh and it beat the and it set a world
1:03:43
record for how good it was.
1:03:45
Uh and that's when everyone was like,
1:03:46
"Hey, what is this thing?" And that's
1:03:48
really when it burst onto the world
1:03:49
stage. I'll talk a bit more about it
1:03:50
when I get into the computer vision
1:03:51
segment of the class.
1:03:54
But you can Google AlexNet and you'll
1:03:55
find a whole bunch of history around it.
1:03:59
I believe that if you do this, is it
1:04:00
true that could get to a global minima
1:04:04
that would mean there would be no
1:04:06
hallucinations?
1:04:07
Aha, good question.
1:04:09
So, if it is perfect
1:04:11
if you get to a global minimum. First of
1:04:13
all, global minima doesn't mean the
1:04:14
model is perfect, right? It may still
1:04:15
have some loss.
1:04:17
Um
1:04:18
but global minima is going to be on the
1:04:21
training data.
1:04:24
You can imagine that the test data,
1:04:26
future data has its own loss function,
1:04:28
right?
1:04:29
So, what is minimum here may not be
1:04:31
minimum there. That's the problem.
1:04:36
Is that a comment? No, okay.
1:04:38
Just saying that
1:04:40
uh that would mean that also you can be
1:04:42
over-fitting for
1:04:43
>> Correct. Exactly. Exactly. So, if you
1:04:45
overdo, if you find the best thing in
1:04:47
the training function, chances are it
1:04:48
doesn't match the best thing of the test
1:04:50
data.
1:04:52
So, on the test data, you're actually
1:04:53
doing badly.
1:04:56
Okay. So,
1:04:57
uh come back to this.
1:05:03
Okay. Now, uh the final uh twist to the
1:05:06
tail here uh we're going to go from
1:05:08
something gradient descent to something
1:05:10
called stochastic gradient descent. And
1:05:11
stochastic gradient descent or SGD is
1:05:14
the workhorse for all deep learning.
1:05:16
Okay?
1:05:17
And funnily enough, SGD is simpler than
1:05:19
GD.
1:05:20
Okay? Just when you thought it couldn't
1:05:21
get simpler, right?
1:05:23
Okay. So,
1:05:25
So, for large data sets, computing the
1:05:27
gradient of the loss function can be
1:05:28
very expensive. Right? Needless to say.
1:05:31
Because it has to be done at every step
1:05:32
and the cardinality of the data set is
1:05:34
really big. Right? And you may have, I
1:05:36
don't know, billions of parameters. It's
1:05:38
just very, very
1:05:39
tough to compute it even with backprop.
1:05:43
So, the solution is at each iteration,
1:05:45
when I say iteration, I'm talking about
1:05:47
this step of gradient descent.
1:05:50
Instead of using all the data
1:05:52
instead of calculating the loss function
1:05:54
by averaging the loss across all N data
1:05:57
points and then calculating the gradient
1:05:59
of that thing, what you do is you just
1:06:01
choose a small sample randomly. You
1:06:04
choose just a few of the N observations
1:06:06
and we call it a mini batch.
1:06:08
So, for example, the number of data
1:06:10
points you may you may have 10 billion
1:06:11
data points
1:06:12
but in every iteration, you may
1:06:14
literally grab just like 32 or 64,
1:06:16
something really small.
1:06:18
Like absurdly small.
1:06:20
Okay?
1:06:21
And then you pretend that okay, that's
1:06:23
all the data I have. You calculate the
1:06:24
loss, find the gradient and just use
1:06:27
that here instead.
1:06:30
Okay? So, this is called stochastic
1:06:33
gradient descent. So, strictly speaking
1:06:36
theoretically, SGD uses just one data
1:06:39
point.
1:06:40
But in practice, we use what's called a
1:06:42
mini batch, 32, 64, whatever.
1:06:44
Uh and so, mini batch gradient descent
1:06:47
is just loosely called stochastic
1:06:48
gradient descent, SGD.
1:06:52
So, and SGD, as it turns out
1:06:55
you can see it's clearly very efficient,
1:06:57
right? Because
1:06:58
it's just processing a few at a time.
1:07:00
Uh and in fact, if you have a lot of
1:07:02
data
1:07:03
and you calculate the full gradient of
1:07:05
the loss function, it may not even fit
1:07:07
into memory.
1:07:09
Right? It's really problematic. But with
1:07:11
SGD, it says, "I don't care whether you
1:07:12
have a billion data points or a trillion
1:07:14
data points. Just give me 32 at a time."
1:07:17
Okay? And you just keep on doing it.
1:07:19
And
1:07:20
turns out, because not all the points
1:07:22
are used in the calculation this only
1:07:24
approximates the true gradient. Right?
1:07:26
It's only an approximation. It's not the
1:07:27
real thing. It's only an approximation.
1:07:29
But it works extremely well in practice.
1:07:32
Extremely well in practice.
1:07:33
And there's a whole bunch of research
1:07:34
that goes into why is it so effective?
1:07:37
And you know, people are discovering
1:07:39
interesting things about SGD, but we
1:07:40
don't have like a definitive theory as
1:07:42
to why it's so good yet. We have some
1:07:44
interesting, you know, uh research
1:07:46
threads that have happened.
1:07:47
And very tantalizingly, very
1:07:50
tantalizingly
1:07:51
because it's only an approximation of
1:07:53
the true gradient
1:07:55
SGD can actually escape local minima.
1:07:59
So,
1:08:00
in the in the true loss function, you're
1:08:02
at a local minimum
1:08:04
but in SGD's loss function, when you're
1:08:06
doing SGD, you're reaching the the
1:08:08
minimum of the SGD loss function
1:08:11
which actually may not be the actual
1:08:13
loss function. So, as you're moving
1:08:14
around, you're actually jumping from
1:08:16
local minima to local minima of the
1:08:18
actual loss function.
1:08:20
I know that's a mouthful. I'm happy to
1:08:22
tell you more. It's just a side thing
1:08:24
that I just wanted you to be aware of.
1:08:25
Okay?
1:08:26
One of the reasons why SGD is actually
1:08:27
effective. It's almost like you work
1:08:30
less and you do better.
1:08:34
How many times does it happen in life?
1:08:35
This is one of them.
1:08:39
Okay? Now, SGD comes in many flavors.
1:08:42
Uh many siblings. It's got a lot of
1:08:44
siblings and variations. It's a big
1:08:45
family. Uh and we're going to use a
1:08:47
particular flavor called Adam
1:08:49
as our default in this course and I'll
1:08:52
get back to it when we get into the
1:08:53
co-labs and things like that.
1:08:56
All right.
1:08:57
Um
1:08:58
By the way
1:09:00
you know how you know all these pictures
1:09:01
I've been showing you a nice little
1:09:02
function like that, a little bowl and so
1:09:04
on.
1:09:05
This is a visualization
1:09:07
of an actual neural network loss
1:09:08
function.
1:09:11
You can see like the hills and valleys
1:09:12
and the cracks and so on and so forth.
1:09:14
Okay? And you can check out the paper to
1:09:16
get more insight into how they actually,
1:09:18
you know, came up with this
1:09:19
visualization. It's crazy.
1:09:21
It's complicated.
1:09:24
Yep.
1:09:25
So, for for SGD, do you perform the
1:09:28
iterations until you minimize the loss
1:09:30
function for each mini batch and then
1:09:32
move to another mini batch? Yeah, so
1:09:34
what you do is you take each mini batch
1:09:36
and then
1:09:37
you calculate the loss for the mini
1:09:39
batch, you find the gradient.
1:09:41
And use the gradient and update the W.
1:09:43
Then you pick up the next mini batch. So
1:09:45
you don't you don't pick a mini batch
1:09:47
and try to perform the iterations on
1:09:48
that mini batch until you reach the
1:09:50
You Each mini batch, one iteration. Each
1:09:52
mini batch, one iteration. Because if
1:09:54
you do a lot of iterations on one mini
1:09:56
batch,
1:09:57
first of all, you'll never be sure that
1:09:58
you're going to find any optimal
1:09:59
solution because you're not guaranteed
1:10:00
of any global minima. And secondly, it's
1:10:03
much better for you to get new
1:10:04
information constantly because what you
1:10:05
can do is you can revisit that mini
1:10:07
batch later on.
1:10:09
Right? And that gets into these things
1:10:10
called epochs and batch size and so on,
1:10:13
which we'll get into a lot of gory
1:10:14
detail when we do the collab.
1:10:16
So let's revisit that question. It's a
1:10:17
good question.
1:10:20
Yeah.
1:10:22
When you do the backprop process, Very
1:10:25
good. Backprop. Not backpropagation.
1:10:26
Nice. I made sure.
1:10:27
>> Yes.
1:10:29
Well, it's it sounded like you started
1:10:30
from the layers that were closest to the
1:10:32
output and you went backward. Okay. And
1:10:35
um my question is are you doing that
1:10:36
once or is it looping multiple times and
1:10:39
then
1:10:39
>> do it once. Just once. Yeah. So for each
1:10:42
gradient calculation, you do it once.
1:10:44
Why does it Why does it want to start
1:10:45
from the layer that's closest or why do
1:10:47
you want to start it from the layer
1:10:48
that's closest to the output?
1:10:49
>> Yeah. So basically what happens is let's
1:10:51
say that just for argument that you go
1:10:53
go in the reverse direction.
1:10:54
You will discover that a lot of paths to
1:10:56
go from the left to the right will end
1:10:58
up calculating certain intermediate
1:10:59
quantities including the very final
1:11:02
gradient sort of item
1:11:04
again and again and again.
1:11:06
Same thing is going to get calculated
1:11:07
again and again and again. So by
1:11:09
starting from the end and working
1:11:10
backwards, you just reuse stuff you've
1:11:12
already calculated.
1:11:14
So that is sort of the rough idea. But
1:11:15
if you see my PDF, I've actually worked
1:11:17
out the example and you and that will
1:11:19
demonstrate what I'm talking about.
1:11:23
By the way, this gradient the backprop
1:11:25
is just a sort of a
1:11:28
Like in calculus, we have something
1:11:29
called the chain rule.
1:11:31
To calculate the derivative of a
1:11:32
complicated function, you calculate the
1:11:34
calculate derivative of like the outer
1:11:35
function then the inner function and so
1:11:37
on and so forth. The backprop is
1:11:39
essentially a way to organize the chain
1:11:40
rule to work with the neural network
1:11:42
layer-by-layer architecture. That's all.
1:11:49
So is it Is it fair to say that once we
1:11:51
are finding like the local minimum, we
1:11:54
are not optimizing to all the GWs
1:11:56
because like this local minimum is
1:11:58
coming like from different curves, from
1:11:59
different lines. So
1:12:01
Is that fair to say? When we are using
1:12:02
stochastic gradient descent, yes. So for
1:12:04
in stochastic gradient descent, when you
1:12:06
take say 32 data points from a million
1:12:09
and you're calculating the loss for that
1:12:10
32 data points, you're basically trying
1:12:12
to do a gradient step.
1:12:14
Right? The W equals W minus alpha
1:12:17
gradient thing. You're doing it for that
1:12:20
that 32 points loss function.
1:12:22
Right? Which is not the 1 million points
1:12:24
loss function.
1:12:25
That's why it's approximate.
1:12:27
But the approximation, instead of
1:12:29
hurting you, actually helps you because
1:12:31
it helps you escape the local minima of
1:12:33
the global loss function.
1:12:35
So it's it's sort of an interesting and
1:12:37
somewhat technically subtle point, which
1:12:38
is why I'm not getting into it too much,
1:12:40
but I'm happy to give pointers if people
1:12:41
are interested. Yeah?
1:12:44
Uh when you say you initialize the
1:12:45
weights, you initialize for the whole
1:12:47
network or just the end layer and then
1:12:50
go backwards like you
1:12:51
>> No, you initialize everything in one
1:12:52
shot.
1:12:53
Because if you don't initialize
1:12:54
everything in one shot, what's going to
1:12:55
happen is that you can't do like the
1:12:57
forward computation to find the
1:12:58
prediction.
1:13:00
Uh and so they are done independently
1:13:02
and the initialization schemes will take
1:13:05
into account, okay, I'm initializing the
1:13:07
weights between a layer which has 10
1:13:08
nodes and on one side and 32 on the
1:13:10
other side and the 10 and the 32
1:13:12
actually play a role in how you
1:13:13
initialize.
1:13:15
Okay. So um so the summary of the
1:13:18
overall training flow
1:13:19
is that, you know, you have an input.
1:13:22
It goes through a bunch of layers. You
1:13:24
come up with a prediction. You compare
1:13:26
it to the true values and these two
1:13:28
things go into the loss function
1:13:29
calculation. You get a loss number.
1:13:31
Right? And you do it for say 10 points
1:13:33
or 32 points or a million points. And
1:13:35
this loss thing goes into the optimizer,
1:13:38
which calculates the gradient. And once
1:13:39
it calculates the gradient, it updates
1:13:41
the weights of every layer using the W
1:13:44
equals W minus alpha times gradient
1:13:45
formula, gradient descent formula. And
1:13:47
then you keep it doing this again and
1:13:48
again and again.
1:13:50
This is the overall flow.
1:13:53
This is how our little network is going
1:13:54
to get built for heart disease
1:13:56
prediction. This is how GPT-4 was built.
1:14:00
And this is how AlphaFold was built.
1:14:02
And AlphaGo was built.
1:14:04
You get the idea.
1:14:07
I mean, it's astonishing, frankly.
1:14:09
If you're not getting goosebumps at the
1:14:10
thought that this simple thing can do
1:14:12
all these complicated things, we really
1:14:14
need to talk offline.
1:14:17
Uh there was a hand raised here. Yeah.
1:14:20
Sorry. Just quickly, this is for each
1:14:23
mini batch, right? So
1:14:25
my question is if you came with
1:14:27
different weight for each mini batch,
1:14:28
how do you
1:14:30
add it up?
1:14:31
The like, okay, this weight has is the
1:14:33
perfect combination for this mini batch,
1:14:35
but you have weight different
1:14:37
weight for another mini batch. How do
1:14:39
you combine those two? No.
1:14:41
On each point, what you do is you you
1:14:43
find the you find you you you start with
1:14:45
a weight.
1:14:46
You run it through for a mini batch. You
1:14:48
come up with the loss function. You
1:14:49
calculate the gradient.
1:14:50
And now using the gradient, you've
1:14:51
updated the weight. Now you have a new
1:14:53
set of weights, right? Which is the
1:14:54
updated weights. Call it
1:14:55
W2 instead of W1.
1:14:57
Now W2 is is your network and when you
1:14:59
take the next mini batch, it's going to
1:15:00
use W2 to calculate the prediction.
1:15:03
And this this whole flow will become a
1:15:05
lot clearer when we do the collabs.
1:15:08
Okay. So we have 3 minutes.
1:15:11
I don't want to go into
1:15:13
regularization overfitting in 3 minutes.
1:15:15
So let's have some more questions.
1:15:19
Yeah.
1:15:20
Can you use any activation function as
1:15:22
long as it gives like positive values?
1:15:25
For like X squared or mod X or
1:15:26
something. Um you can use a variety of
1:15:29
activation functions.
1:15:31
Um
1:15:33
There is uh but yeah, there's a whole
1:15:35
literature on, you know, the pros and
1:15:37
cons of various activation functions
1:15:38
that you could use.
1:15:39
But in general, you have to make sure of
1:15:42
a couple of things. One is that when you
1:15:44
do backprop,
1:15:46
the gradient is going to flow through
1:15:48
the activation function in the reverse
1:15:49
direction.
1:15:50
And the activation function should
1:15:52
actually sort of make sure the gradient
1:15:53
doesn't get squished.
1:15:55
It shouldn't get squished. It shouldn't
1:15:56
get exploded.
1:15:58
So those are some considerations and
1:16:00
these are technical considerations, but
1:16:01
those all those considerations have to
1:16:02
be taken into account. If you can take
1:16:04
those into account, then you're okay.
1:16:07
That's sort of the key thing to keep in
1:16:08
mind.
1:16:08
And that's in fact why the ReLU is
1:16:10
actually very popular
1:16:11
because as long as the value is
1:16:13
positive, the gradient of the ReLU is
1:16:15
just one. Right?
1:16:18
Uh because
1:16:22
So if you look at something
1:16:24
Oops.
1:16:28
Was it frozen?
1:16:30
I jinxed it.
1:16:31
So sorry, livestream.
1:16:34
If you have something like this,
1:16:37
the ReLU is like that, right?
1:16:39
So the gradient here
1:16:41
is always going to be one.
1:16:43
Which means that as long as the value is
1:16:44
positive, whatever gradient comes in
1:16:46
like this, it just like gets multiplied
1:16:47
by one and gets pushed out the other
1:16:49
side. So it doesn't get it doesn't get
1:16:50
harmed or squished or anything like
1:16:52
that. Um so that's one reason why the
1:16:55
ReLU is very popular because it
1:16:57
preserves the gradient while injecting
1:16:59
almost like the minimum amount of
1:17:00
non-linearity to do interesting things.
1:17:04
Um yeah.
1:17:07
If you have a high number of dimensions,
1:17:10
can you do mini batching on like
1:17:13
features dimensions instead of just
1:17:14
observations and keep the same number of
1:17:17
observations, but just take a small
1:17:19
sample of the number of features that
1:17:21
you're actually using? Oh, I see. I see.
1:17:24
So you're saying let's say you have 10
1:17:25
features.
1:17:27
Um instead of taking all data points of
1:17:28
10 features, what if you have choose
1:17:31
five features and just use them and do
1:17:33
the thing
1:17:34
as long as you can actually compute the
1:17:36
prediction.
1:17:38
To compute the prediction, you may need
1:17:39
all 10 features.
1:17:41
Right? Or you need to have some defaults
1:17:43
for those features.
1:17:44
And by if you define defaults for those
1:17:46
other five features, you're basically
1:17:48
using all all features.
1:17:50
So that's the key thing. Can you
1:17:51
actually calculate the prediction
1:17:53
by manipulating? And typically, you
1:17:55
can't.
1:17:57
All right?
1:17:58
Okay, folks. 9:55. I'm done. Have a
1:18:00
great rest of your week. I'll see you on
1:18:02
Monday.
— end of transcript —
Advertisement
Ad slot

More from MIT OpenCourseWare

Trending Transcripts

Disclaimer: This site is not affiliated with, endorsed by, or sponsored by YouTube or Google LLC. All trademarks belong to their respective owners. Transcripts are sourced from publicly available captions on YouTube and remain the property of their original creators.