2: Training Deep NNs (cont.); Introduction to Keras/Tensorflow; Application to Tabular Data

0:58

your hands. And in particular, we

1:00

noticed that we have to decide how many

1:03

hidden layers we want. We have to decide

1:05

in each layer how many neurons to have.

1:08

And then we had to decide what uh

1:11

activation to use. Even though I'm kind

1:13

of cheating when I say that because I

1:14

told you very clearly on Monday that for

1:17

the hidden layer activation, just go

1:18

with the ReLU activation function. You

1:20

don't have to think deep thoughts about

1:22

this, okay?

1:23

But the other things are all choices you

1:24

have to make, and we will talk a bit

1:26

later about how do you actually make

1:28

those choices.

1:29

Okay. Now, the rule of thumb,

1:32

right? The rule of thumb always is to

1:34

start with the simplest network you can

1:36

think of.

1:37

And if it's if it gets the job done,

1:39

stop working on it.

1:41

If it's not good enough, make it

1:42

slightly more complicated. Okay? So,

1:45

that's sort of the, you know, like the

1:46

meta thing you have to remember always

1:48

when you're designing these things.

1:49

Okay. So, that's sort of, you know, what

1:52

it takes to design a deep neural

1:53

network. So, what we will do in this

1:55

class is we'll actually take a real

1:57

example with real data, and then we

1:59

we'll think through how we would design

2:01

a network to solve this problem.

2:03

And while doing so, we will cover a

2:05

whole bunch of conceptual foundations

2:07

such as optimization, loss functions,

2:09

gradient descent, and all that good

2:11

stuff.

2:12

Okay?

2:12

All right. So, the the case study or the

2:16

scenario here is we have a data set of

2:18

patients uh made available by the

2:20

Cleveland Clinic. And essentially, we

2:23

have a bunch of patients, and for all

2:25

these patients, the setting is that they

2:27

have come into the Cleveland Clinic, and

2:29

they have not come in with a heart

2:31

problem. They have come in for something

2:32

else. Maybe they just came in for a

2:33

physical. And we measured a whole bunch

2:36

of things about them, okay? And the

2:38

kinds of things we measured are, you

2:40

know, demographic information, like

2:41

what's their age, uh gender, whether

2:44

they have any chest pain at all when

2:45

they came in, blood pressure,

2:47

cholesterol, sugar, so on and so forth.

2:50

Right? You get the idea? Demographic

2:52

information and a bunch of biomarker

2:53

information. And then,

2:56

what the Cleveland Clinic uh did was

2:59

they actually tracked these people

3:01

and figured out in the next year,

3:04

did they get diagnosed with heart

3:05

disease or not?

3:07

Okay, in the next year.

3:09

Which means that maybe you can build a

3:10

model when someone comes in, even though

3:12

they didn't come in for a chest problem,

3:15

maybe you can predict that something's

3:16

going to happen to them in the next

3:17

year, right? It's a nice sort of classic

3:20

machine learning setup.

3:23

All right. So, this is the thing. So,

3:24

what we want to do is we can totally

3:26

solve this problem using decision trees,

3:28

neural network I mean, sorry, random

3:29

forests and gradient boosting and all

3:31

that good stuff you folks have already

3:33

learned from machine learning.

3:35

But we will try to solve it using neural

3:36

networks, okay? Um this is an example,

3:38

of course, of what's called structured

3:40

data because this is all data sitting in

3:41

the columns of a spreadsheet, right? Uh

3:43

so, working with structured data is the

3:46

way we warm up our knowledge of neural

3:48

networks. And then we will do things

3:50

like working with unstructured data

3:51

starting next week with images and then

3:53

later on with text and so on and so

3:55

forth. Okay, any questions on this?

4:00

Okay. Uh yes. Uh just connected even to

4:03

last time's class where we took uh the

4:05

same example and first it was a logistic

4:07

and then we did a neural network. So,

4:10

the probability in case of one was 0.85,

4:12

then was 0.22, and here as well, how do

4:14

you know when to uh

4:16

use what? Usually in textbooks, you know

4:19

when to use logistic or when to use uh

4:21

something else, but in this case,

4:24

uh

4:25

when do I complicate it to neural

4:27

networks visa-vis in this case maybe

4:29

just doing a random It's a great

4:30

question. Uh when do you use what? So, I

4:33

think there are two broad dimensions

4:34

that you have to think about. One broad

4:35

dimension is

4:37

uh how important is it that you need to

4:39

explain or interpret what's going on

4:41

inside the model to perhaps a

4:43

non-technical consumer.

4:46

The other dimension is how important is

4:48

sheer predictive accuracy.

4:50

In some situations, predictive accuracy

4:52

trumps everything else. In which case,

4:54

just go with it. In other cases,

4:56

explainability becomes a big deal

4:57

because if they can't understand, they

4:59

won't use it.

5:00

And those cases, it's probably better to

5:02

go with simpler models such as decision

5:04

trees and neural I mean, not neural

5:05

network decision trees, maybe even

5:07

random forests, certainly logistic

5:09

regression. Those are all a little more

5:10

amenable.

5:12

But that said, uh even complex black box

5:15

methods like neural networks, there is a

5:17

whole field called mechanistic

5:19

interpretability,

5:20

which seeks to try to get insight into

5:23

what's going on inside these big black

5:24

boxes. So, the story isn't over, right?

5:28

But that's just the first cut you sort

5:30

of analyze the problem.

5:33

Okay. So,

5:35

um let's get going. So, if you want to

5:37

design a network,

5:39

All right. So, we design the network. Uh

5:42

so, we have to choose the number of

5:43

hidden layers and the number of neurons

5:45

in each layer. Then we have to pick the

5:46

right output layer. So, here,

5:49

what I did is the simplest thing you can

5:51

do is, of course, is to have no hidden

5:52

layer.

5:53

So, if you have no hidden layers, what

5:55

is that model called?

5:58

Yes, logistic regression.

6:00

Okay? So, of course, we want to do a

6:02

neural network, so I'm going to have one

6:03

hidden layer because that's the simplest

6:05

thing I can do. And then, I'll confess,

6:08

I tried a few different numbers of

6:09

neurons in this thing, and when I had 16

6:12

neurons, it actually did pretty well.

6:14

Okay? So, there was some trial and error

6:15

that went on before I landed on the

6:16

number 16. Right? And for some reason,

6:19

people always use powers of two, so may

6:20

as well do that.

6:22

So, I tried like 4, 8, 16, and 16 was

6:24

really good.

6:25

And as it turns out, when I went above

6:27

16, uh it sort of started to do badly.

6:30

And it started to do badly because

6:31

something called overfitting,

6:33

which we're going to talk about later,

6:35

okay? So, yeah, 16.

6:37

Um and then by default, I use ReLUs,

6:39

okay? So, 16 ReLU neurons. And then

6:42

here, the output is a categorical

6:44

output, right? Heart disease, yes or no,

6:47

one or zero, classification problem,

6:49

which means that we want to emit a

6:51

probability at the very end. Therefore,

6:53

we'll use a sigmoid.

6:54

Okay? So, so far, so good, right? Any

6:57

questions?

6:59

All right.

7:00

So, we're going to lay out this network

7:02

visually.

7:03

Okay? So, we have an input, and so I

7:06

just have have an input. And as you will

7:09

see here,

7:10

X1 through X29, that's our input layer.

7:13

And you may be wondering, 29, where did

7:15

he get that from?

7:17

Because there doesn't seem to be like 29

7:19

rows here of independent variables. So,

7:22

it turns out there are only 13 input

7:24

variables here,

7:26

but some of them are categorical.

7:29

So, what I ended up doing is to take

7:31

each categorical variable and one-hot

7:32

encode it.

7:34

Okay?

7:35

And when you do that, you get to 39.

7:37

Sorry, 29.

7:39

All right? And when we actually do the

7:40

Colab later on, I'll show you exactly

7:43

how I one-hot encode encoded it, but

7:45

that's what I'm doing here.

7:46

That's why you have 29, not 13.

7:49

Okay? Now, obviously, we have decided on

7:51

these hidden units, 16 units,

7:54

with nice ReLUs here.

7:56

Okay? And then we have an output layer

7:57

with a little sigmoid.

7:59

And I got bored of trying to draw all

8:01

these arrows, so I just gave up and

8:02

said, "Assume there are arrows."

8:05

Okay, between all these things.

8:07

Good?

8:09

Yeah.

8:11

Yeah, I'm sorry. I think you already

8:12

mentioned this, but why 16 units? Why

8:15

16? Uh

8:16

I tried a bunch of different numbers of

8:18

units. Uh and at 16, the resulting model

8:21

did well, so I just went with that. And

8:23

the logic of why is a ReLU?

8:25

Oh, why a ReLU? Yeah, so there's a

8:28

there's just a mountain of empirical

8:29

evidence that suggests that uh ReLU is a

8:31

really good default option for using as

8:35

activations in hidden layers. There is

8:37

also a really great set of theoretical

8:39

results, and I'll allude to some of them

8:41

when we actually talk about gradient

8:42

descent.

8:45

Yeah.

8:47

Sorry, quick question. You mentioned um

8:50

in the input layer, how how did you get

8:51

to 29 again when you had like 13

8:53

variables? So, some of those 13

8:55

variables are categorical variables like

8:58

uh cholesterol low, medium, high. Right?

9:00

And so, I took them and one-hot encoded

9:02

them. So, if it had like five levels, I

9:04

would get five columns now.

9:08

Uh yeah.

9:09

And by the way, folks, um just like uh

9:12

is it can Yeah, just like did, please

9:15

use a microphone so that people on the

9:17

live stream can hear your question.

9:18

Yeah, go ahead. Uh sorry, just one

9:20

question. So, the vectors, since you

9:22

didn't represent them, are we assuming

9:23

like every X is connected to all the

9:26

units?

9:26

>> Correct. And this is also a parameter

9:28

that we have to decide or That ends up

9:31

being the default.

9:32

And we will see

9:33

deviations from that assumption when we

9:36

go to image processing and language

9:37

processing and so on. But when you're

9:39

working with structured data like we're

9:40

doing now, that's the default.

9:43

Okay. So, let's keep going.

9:46

So, this is what we have.

9:47

So, what Remember what I told you in the

9:49

last class? Whenever you're working with

9:50

these networks, right? Get into the

9:52

habit of very quickly calculating the

9:54

number of parameters.

9:55

Right? Just do it a few times, the first

9:57

few times, so that you really know cold

9:59

exactly what's going on. Okay? So, yeah,

10:02

how many parameters do we have here?

10:04

How many weights and biases? You can

10:06

work through it, okay? You can You don't

10:08

have to tell me the final number. You

10:09

can say x * y + z, stuff like that.

10:14

Yeah.

10:15

65. You have 48 weights and 17 biases.

10:20

Okay, and how did he come up with that?

10:21

So, for the weights, you have like for

10:23

the first layer it's 2 * 16 and for the

10:26

the second connection it's 1 * 16 and

10:28

then the biases are the 16 hidden plus

10:30

the outputs.

10:32

Okay.

10:33

Um any other views on this?

10:36

I think it's 29 into 16. 29, okay, 29

10:40

into 16. And then 16 into

10:43

uh plus I mean 16 there. Yeah. And then

10:46

biases 16 biases and one bias. Right.

10:49

So, the way it's going to work is we

10:52

have 29 things here, 16 in the middle,

10:55

so 29 into 16 arrows.

10:58

And then for each of these fellows,

11:00

there's a bias coming in.

11:02

So, that's another 16.

11:05

Plus, you have 16 * 1.

11:08

Which is here, plus there is one bias

11:10

for this one.

11:12

So, the total is 497.

11:16

So, you can see here there's something

11:19

very interesting going on, which is that

11:21

when you go from one layer to another

11:22

layer,

11:24

the number of weights is roughly on the

11:26

order of a * b.

11:28

The number of units and so that's a

11:30

dramatic explosion in the number of

11:31

parameters.

11:33

Right? And that's something we have to

11:34

watch for later on to prevent

11:36

overfitting.

11:38

Okay, that's where the explosion of

11:39

parameters comes from the fact that each

11:41

layer is fully connected to the next

11:43

layer.

11:44

Okay? But we'll revisit this later on.

11:46

Okay.

11:47

So,

11:48

what I'm going to do now is I'm going to

11:50

actually translate this network, right?

11:52

The one that we have laid out

11:53

graphically, into Keras code

11:56

to demonstrate how easy it is.

11:58

Okay? So, I will give a fuller intro to

12:01

Keras in TensorFlow later on, but for

12:03

now, just suspend your disbelief.

12:06

We'll just try to do it in Keras as if

12:08

we know Keras. Okay? So, let's try that.

12:10

Later on we'll get into all the gory

12:12

details and train it in Colab and so on

12:14

and so forth. Okay. All right. So,

12:17

So, the So, the way we typically do it

12:19

is that once we have a network like

12:21

this, we typically start from the left

12:23

and start defining each layer in Keras

12:25

one after the other. So, we flow left to

12:27

right. Okay? So, let's take the input

12:30

layer. The way you define an input layer

12:32

in Keras is really easy.

12:34

You literally say Keras.input.

12:38

Okay? And then you tell Keras how many

12:41

nodes you have in the input coming in.

12:43

In this case it happens to be 29, so you

12:45

tell it the shape. Shape equals 29. And

12:47

the reason why we say shape as opposed

12:49

to length is because, as you will see

12:51

later on, we don't have to just send

12:53

vectors in, we can send complicated

12:55

things in to Keras.

12:57

And those complicated objects could be

12:59

matrices, it could be 3D cubes, it could

13:01

be 4D tensors and so on and so forth.

13:03

So, it's expecting a shape.

13:06

Right? What is the shape shape of this

13:07

thing you're going to send me? In this

13:09

particular case it happens to be a nice

13:10

list or a vector, so it's 29. Okay,

13:12

that's it. So, we we write this down.

13:15

This creates the input layer.

13:17

Right? And we give it a name. Right? And

13:19

the name here means

13:21

this layer, whatever comes out of this

13:23

layer has a name input.

13:26

Okay?

13:27

Good. Next.

13:30

Let's make sure the shape of the input

13:31

as I mentioned.

13:32

Right there.

13:34

Then we go to the next one. And here and

13:36

we will unpack this. The way you define

13:39

a layer is typically a hidden layer

13:41

Keras.layers.dense

13:43

and all this stuff. Okay? So, what this

13:46

is is it first of all it says

13:48

I want a dense layer. By dense layer I

13:50

mean a layer that's going to fully

13:52

connect to the prior and the later

13:53

layers.

13:55

Fully connect, that's what the word

13:56

dense means. Okay?

13:58

Number two,

13:59

I want 16 nodes here in this layer.

14:02

Okay? Finally, I want to use a ReLU.

14:06

See how compact and parsimonious it is?

14:09

Right? And that is the appeal of Keras.

14:11

It's very easy to get going.

14:13

So, the moment you do that, you've

14:15

actually defined this layer.

14:18

But what you have not done

14:20

is you have not told this layer what

14:23

input is going to get.

14:25

Because as far as this layer is

14:26

concerned, it doesn't know that this

14:28

other layer exists.

14:30

So, you need to connect them. Yes.

14:33

Um do we need to define for the ReLU

14:35

where the the bends are? Like where you

14:38

take the max?

14:39

>> No, the ReLU the bend is always at zero.

14:41

Okay. Thank you.

14:45

Okay?

14:47

All right.

14:48

So, that's what we have here.

14:51

And then, what we do is we have to tell

14:53

it I you want to feed this layer the

14:55

output of the previous layer, so you

14:57

feed it by taking whatever is coming out

15:00

of this thing, which is called input,

15:02

and you basically

15:03

stick it in here.

15:05

So, the moment you do that, boom, it's

15:07

going to receive the input from the

15:09

previous layer.

15:10

And because this one's output needs to

15:12

go to the final layer, you need to give

15:15

a name to that output.

15:16

So, you give it a name. I'm just calling

15:17

it h for because it's coming out of the

15:19

hidden layer.

15:20

It's just a variable. You can call it

15:21

anything you want.

15:25

Now, what we do, we go to the final

15:26

output layer.

15:28

And this is what we use. The output

15:30

layer is just another dense layer.

15:32

That's why I use the word dense. But we

15:34

say, "Hey, give me just one thing

15:36

because I just literally just need one

15:37

unit here because I need to emit just

15:40

one probability.

15:41

And the activation I want to use is a

15:44

sigmoid."

15:46

Done.

15:48

Okay?

15:50

And once you do that, you

15:52

have to feed it the input from the

15:54

second layer. So, you stick an h here.

15:57

Now you have connected the third and the

16:00

second layers.

16:01

And after you do that, you give a name

16:03

to the output coming out of that. We'll

16:04

just call it output. You can call it y,

16:06

you can call it output, you can call it

16:07

whatever you want.

16:09

Okay? So, at this point, what we have

16:11

done

16:12

is we have mapped that picture into

16:14

those three lines.

16:16

That's it.

16:17

Okay?

16:19

But we aren't quite done yet. There's

16:20

one little thing we have to do.

16:22

So, what we have to do is we have to

16:24

formally define a model so that Keras

16:27

can just work with this model object. It

16:30

can train it, it can evaluate it, it can

16:31

use it for prediction and so on and so

16:33

forth. So, we tell Keras, "Hey, uh

16:35

create a model for me, Keras.model,

16:38

and basically where the input is this

16:40

thing here and the output is that thing

16:41

there.

16:42

And then the whole thing we'll just call

16:43

it model."

16:45

Okay? So, that's it.

16:48

We are done. That is the whole model.

16:50

That is It sounds really fancy, right? A

16:52

neural model for heart disease

16:53

prediction. That's pretty cool.

16:56

Four lines.

16:58

And we will show how to train this model

17:00

with real data and so on and so forth

17:02

and use it for prediction after we

17:05

switch gears and really get into some

17:06

conceptual building blocks.

17:08

Had a question.

17:13

Can you define a custom activation

17:16

function that is not in the list of

17:18

Keras library? Yes.

17:21

Yeah, you can define The question was,

17:22

can you define a custom activation

17:23

function? You totally can.

17:25

Uh in fact, I mean, the the kind of

17:27

flexibility you have here is incredible.

17:30

And this these innocent four lines

17:32

unfortunately sort of hide the the

17:34

potential that's possible here, but I

17:36

guarantee you in two to three weeks you

17:38

folks will be thinking in building

17:39

blocks like Legos.

17:41

So, you'll be, you know, I I I I'm so

17:43

happy when it happens. Students will

17:44

come to my office hours and say, "You

17:46

know, I want to create a network where I

17:47

have a little network going up on top,

17:49

one going in the bottom, then they meet

17:50

in the middle, then they fork again,

17:52

they split." I'm like, "Unbelievable."

17:54

It's fantastic. And you're going to be

17:55

doing this in two weeks, I guarantee

17:56

you.

17:58

Yeah, in the case of a multi-class

18:00

classification problem, are the output

18:01

nodes equal to the number of classes?

18:04

Correct.

18:05

So, we will come to So, this is binary

18:07

classification. And the question is for

18:09

multi-class classification, let's say

18:10

you're trying to classify some input

18:12

into one of 10 possibilities, we will

18:14

have 10 outputs.

18:16

But the way we define it is going to be

18:18

using something called a softmax

18:20

function, which we're going to cover on

18:21

Monday.

18:24

So, for now, we just live with binary

18:25

classification.

18:27

Uh

18:29

Is there a default activation method in

18:31

Keras or you have to put something? Ah,

18:33

that's a good question. I believe the

18:35

default might be ReLUs for hidden

18:37

layers, but I'm not 100% sure. Let's

18:39

double-check that.

18:40

Uh

18:42

Uh just to get a clearer understanding,

18:44

when you said that beyond 16 when you

18:47

tried working on those neurons, the

18:50

performance uh worsened.

18:52

So, that is where you were playing

18:53

around with initially two and then maybe

18:54

four and six and eight. Exactly. Right.

18:58

Could you use the mic?

19:02

Do we need to define each of the hidden

19:04

layer when the model gets more complex

19:05

when we have more than one layer? Oh,

19:08

like if you have like 25 layers?

19:09

>> consolidate, yeah. Yeah, yeah, yeah. So,

19:11

what we typically Good question. If you

19:12

have let's say 100 layers, right? Uh do

19:14

you actually write I have to type in

19:16

each by hand and cut and paste? No. You

19:18

can actually write a little loop which

19:19

will just automatically create them for

19:20

you.

19:22

And so, basically what's going on is

19:24

that this little output thing you see

19:26

here, this variable,

19:27

this output could be the result of a

19:30

thousand layer network with all sorts of

19:32

complicated transformations going on and

19:34

then finally it pops up as a little

19:36

thing called the output. And what Keras

19:38

will do is it'll be like, "Okay, this

19:39

model has this input and has this

19:41

output, but boy, this output came from

19:43

incredible transformations applied to

19:45

the input." And Keras will process all

19:47

that very easily for you. You don't have

19:48

to worry about it.

19:49

Right? It's really a beautiful example

19:51

of the power of abstraction.

19:53

And you will you will see that as we go

19:54

along.

19:55

Okay. So,

19:56

now let's switch gears and say once

19:58

you've written a model like that in

20:00

Keras, how do you actually train it?

20:01

Okay? Now, training is something you've

20:04

been doing a lot, right? So, for

20:05

example, when you have something like

20:06

linear regression, right? Where you have

20:08

all these coefficients you need to

20:09

estimate, you have this model, then you

20:12

have a bunch of data, then you run it

20:14

through something like LM if you use R,

20:16

and what it gives you is actual values

20:18

for these coefficients, right? 2.8, 0.9,

20:20

and so on and so forth. So, the the role

20:22

of the data is to give you the

20:23

coefficients.

20:25

Right? Or you can think of the

20:26

coefficients as really a compressed

20:28

version of the data.

20:30

Okay? Similarly, if you do logistic

20:31

regression, you have a model like that,

20:33

you add some data, you run it through

20:35

some estimation routine like GLM or

20:37

scikit-learn or statsmodels, pick your

20:40

favorite tool, then you'll come up with

20:42

something like that. So, basically

20:43

what's going on here is training simply

20:45

means find the values of the

20:47

coefficients that so that the model's

20:49

predictions are as close to the actual

20:51

values as possible. That's it. Okay? And

20:54

so and to find the one that is as close

20:57

to the actual value as possible, a whole

20:59

bunch of optimization is involved. You

21:01

didn't have to worry about the

21:02

optimization when you did the

21:03

regression, linear or logistic, because

21:05

it's all done under the hood for you,

21:07

but for neural networks, we actually get

21:08

to know how it's done.

21:10

Okay, because it's important.

21:12

Okay. So, training a neural network, a

21:15

deep neural network, even GPT-4, it's

21:18

basically the same process as what you

21:19

do for regression.

21:21

Right? You basically you're just a very

21:23

complicated function with lots of

21:24

parameters, but ultimately you have a

21:26

network with all these question marks,

21:28

you add some data, you do some training,

21:29

and boom, you get some numbers.

21:36

You may get into this, but are we

21:38

determining the architecture of the

21:40

network before we train it?

21:43

Okay. Yes, because if you don't define

21:45

the architecture,

21:46

um Keras doesn't know how to actually

21:49

calculate the output.

21:51

Given an input. And unless it knows

21:53

input-output pairs, it can't do anything

21:55

more with it.

21:58

Okay. So, um

22:00

so the essence of training is to find

22:02

the best values for the weights and

22:04

biases.

22:05

And the way we think of the best values

22:07

is that we basically set up a little

22:09

function, and this function measures the

22:11

discrepancy between the actual and the

22:14

predicted values. Okay? And I use the

22:16

word discrepancy because the way you

22:19

define discrepancy, there's an

22:20

incredible amounts of creativity in the

22:22

field.

22:23

In fact, a lot of breakthroughs in deep

22:25

learning come because people define a

22:27

very clever measure of discrepancy, and

22:29

then turns out it actually gives you all

22:31

sorts of interesting behavior. Okay?

22:33

That's why I use the word discrepancy as

22:34

opposed to the word error, because when

22:35

I say error, you might be just thinking

22:37

something like predicted minus actual.

22:39

That's too limiting.

22:42

Prediction minus actual is too limiting,

22:43

that's why I use the word discrepancy.

22:45

So, so we we basically define a function

22:48

that captures the discrepancy between

22:49

these the actual and the predicted

22:50

values, and these functions are called

22:53

loss functions in the deep learning

22:54

world.

22:55

And every paper that you read, you will

22:58

find interesting loss functions. There

23:00

are hundreds of loss functions, enormous

23:02

research creativity goes into defining

23:03

these loss functions. Okay?

23:05

All right. So, these are loss functions.

23:08

And so a loss function is a function

23:10

that quantifies a discrepancy. So, let's

23:12

say the predictions are really close to

23:14

the actual values, the loss would be

23:16

what?

23:19

It's close to zero. It's close to zero.

23:20

Close to zero. Right? Very small.

23:23

And if if you have a perfect model,

23:26

perfect crystal ball, what would the

23:27

loss be?

23:28

Exactly zero.

23:30

Right? Exactly zero. So, in linear

23:32

regression, we the loss function we use

23:35

is called sum of squared errors.

23:37

We didn't call it loss function because

23:39

we were not doing deep learning, just

23:40

linear regression, but that's basically

23:42

the loss function. Right? So,

23:45

the loss function we use must be very

23:47

matched very properly with the kind of

23:49

output we have.

23:51

Right? So, if your output is a number

23:53

like 23, right? You're trying to predict

23:55

demand like a product demand for next

23:57

week for a particular product, and uh

24:00

predicted value is 23, the actual value

24:02

is 21,

24:03

it's okay to do 23 minus 21, two as a

24:05

discrepancy, right? The error. Okay? But

24:09

for other kinds of outputs, it's not so

24:11

obvious what the correct loss function

24:13

is, what the correct measure of

24:14

discrepancy is. And so here,

24:18

for the simple case of regression,

24:20

right? Um

24:21

the YI, the I here, by the way, is a

24:23

superscript which stands for the ith

24:26

data point, the ith data point. So, what

24:29

I'm saying is that okay, for the ith

24:31

data point, this is the actual value, Y,

24:33

and this is what the model predicted.

24:36

Okay? I take the difference, square it,

24:39

and once I square it for each point, I

24:41

just average all these numbers to get an

24:43

average squared error, i.e. mean squared

24:45

error, MSE. So, this is sort of like the

24:48

easiest loss function.

24:50

Okay?

24:52

Now, let's crank it up a notch.

24:55

In the heart disease example, the heart

24:57

disease the neural prediction model,

24:59

the prediction is a number between zero

25:01

and one, right? It's because it's coming

25:03

out of the sigmoid.

25:04

It's a fraction. The actual output is a

25:07

zero or one, one of the two, right? It's

25:09

binary.

25:11

So, how would we compare the

25:12

discrepancy? How would we measure the

25:14

discrepancy between a fraction and the

25:16

numbers zero and one? Right? What is the

25:18

good loss function in this situation?

25:21

Right? Is the key question. So, let's

25:22

build some intuition around this.

25:26

And let's see if my little daisy chain

25:28

iPad thing works.

25:31

I'm doing it on the iPad so that people

25:32

on the live stream can see it, otherwise

25:34

the blackboard is a little tough for

25:35

them.

25:37

Okay. So, let's have a situation here.

25:41

Okay? So, let's say let's say that you

25:43

have a patient who comes in, and let's

25:45

say they have heart disease. Okay? So,

25:47

for that patient, Y equals one.

25:50

Right? The true value is one for that

25:51

patient. And now you have this model.

25:55

Okay? And this is the predicted

25:59

probability from this model.

26:04

Can people see my

26:05

handwriting okay?

26:07

Good.

26:08

I could never be a doctor, right? So.

26:11

So, zero, okay? One, it's going to be

26:13

between zero and one because it's

26:14

probability.

26:15

And then this is the loss we want to

26:17

sort of have, right? This is the loss.

26:19

So, for this this patient actually had

26:21

heart disease, Y equals one. So, let's

26:23

say that the predicted probability is

26:25

pretty close to one.

26:26

Okay? What do you think the loss should

26:28

be?

26:29

Small.

26:30

Close to zero.

26:32

Sorry?

26:34

Close to zero, exactly. So, here, if the

26:36

prediction comes here, you want the loss

26:38

to be you want the loss to be somewhere

26:40

here.

26:42

But if the predicted probability is

26:44

pretty close to zero, even though the

26:45

patient actually has heart disease, what

26:47

do you want the loss to be?

26:49

Really high.

26:50

Because it's screwing up badly, right?

26:52

So, you want the loss to be somewhere

26:53

here.

26:55

So, basically you want a function that's

26:57

kind of like that.

27:00

Right? You want the loss function shape

27:02

to be like that.

27:04

High values of probability should have

27:05

low losses, low values of probability

27:07

should have high losses. Yeah.

27:08

I understand like why it has to be

27:10

increasing or decreasing, but can you

27:12

explain why it has to be Yeah, yeah. So,

27:14

it can be linear, it can certainly be

27:16

linear, but basically what you want to

27:18

do is the more it makes a mistake, the

27:21

more harshly you want to penalize it.

27:23

Right? So, basically what you're what

27:25

what you really want is something where

27:27

if it basically says this person's

27:29

probability is say uh the probability

27:31

the predicted probability is say one

27:33

over a million,

27:34

basically close to zero, you want the

27:35

loss to be like super high.

27:37

So that the model is like it's like a

27:39

huge rap on the knuckles for the model.

27:41

Don't do that.

27:42

That's basically what we're doing, and

27:43

I'm sort of demonstrating that dynamic

27:45

by using a very curved and steep loss

27:47

function.

27:49

But you can absolutely use a linear

27:50

function, it's totally fine. It won't be

27:52

as effective for gradient descent later

27:54

on with a bunch of bunch of technical

27:56

details.

27:57

Are we good with this?

27:59

All right. So, now let's look at the

28:01

case where a patient does not have heart

28:03

disease.

28:05

Y equals zero.

28:06

Same setup, okay?

28:09

Predicted probability,

28:11

zero, one, loss.

28:15

So, for this patient, they don't have um

28:18

whatever uh they're not

28:20

uh they don't have heart disease. If the

28:22

probability is close to zero, what

28:24

should the loss be?

28:26

Close to zero. It should be somewhere

28:27

here, right?

28:28

And the more and more the probability

28:31

gets closer and closer to one, you want

28:32

to penalize it very heavily, which means

28:34

you want the loss to be somewhere here.

28:36

So, you basically want a loss ideally

28:37

that's kind of going up like that and

28:39

climbing higher and higher.

28:42

Are we good?

28:43

Okay, perfect.

28:44

Because we have a perfect loss function

28:46

for that.

28:48

So, just a recap.

28:51

Right? This is what we want.

28:53

People with for points with Y equals

28:54

one, lower prediction predictions should

28:56

have higher loss. You want something

28:58

like that. And then turns out

29:02

there's a very little simple loss

29:03

function

29:04

which just literally just uses the

29:05

logarithm, which will get the job done.

29:07

So, what you do is you literally do

29:09

minus log of the predicted probability.

29:13

That's it. And that thing it has exactly

29:15

that shape.

29:16

Okay? And in fact, you can see it

29:17

numerically. So, if the loss is one,

29:20

it's zero. If it's half, it's 1.0. And

29:22

if it's like one over 1,000, it's almost

29:24

10. If it's one over 10,000, it's going

29:26

to be like

29:27

much higher, right? Very high losses.

29:30

Okay? So, minus log probability, boom,

29:32

done.

29:34

Similarly, this is what we want for

29:36

patients for whom Y equals zero.

29:38

And turns out if you do minus log one

29:42

minus predicted probability, it does the

29:44

same thing.

29:47

Okay?

29:50

Mathematicians once again saved with a

29:52

logarithm.

29:54

So, see in summary

29:56

this is what we have.

29:58

Right? For data points where y equals 1,

30:00

we have this. Data points where y equals

30:01

0, we have this. But, it feels a little

30:03

inelegant

30:05

to say, "Well, if it's y equals 1, I

30:07

want to use this. If y equals 0, I want

30:08

to use that."

30:09

Right? There's There's like an if-then

30:11

thing going on here. And I don't know

30:12

about you folks, but if-then really irks

30:14

me

30:15

mathematically because you can't do

30:17

derivatives and so on very easily.

30:19

Okay?

30:20

But, no worries. This is MIT. We know we

30:22

have our bag of math tricks.

30:24

So, what we do is

30:26

we can actually combine them both into a

30:28

single expression.

30:30

Okay? Like this.

30:32

Okay? And here the yi again is the ith

30:35

data point. Remember, yi is either 1 or

30:37

0 always.

30:38

And this model of xi is the predicted

30:40

probability. Okay? So,

30:43

and I've just taken the minus log the

30:45

minus and I've just moved it here.

30:48

Okay? And I've taken the the minus that

30:50

was here and just moved it here. Okay?

30:52

That's why you see it like this.

30:54

So, this one is basically

30:57

you can convince yourself what's

30:58

happens. This single expression will get

30:59

the job done. So, let's say there is a

31:01

patient for whom y equals 1.

31:04

What's going to happen is that when you

31:05

plug in y equals 1, this becomes 0. The

31:07

whole thing will collapse to 0.

31:10

While here, y equals 1 just means it

31:12

becomes minus log probability, which is

31:14

what we want.

31:17

Conversely, if y equals 0, this whole

31:20

thing is going to disappear.

31:22

And this thing becomes 1 minus 0, which

31:23

is just 1. And so, it becomes minus log

31:25

1 minus probability, which is again what

31:27

we want.

31:29

Simple and neat, right?

31:32

So, in one expression, we have defined

31:34

the perfect loss. No if-thens, none of

31:36

that crap.

31:39

Good. So, now what we do is that was

31:42

true for every data point.

31:44

But, we obviously have lots of data

31:45

points. So, we just add them all up and

31:47

take the average.

31:50

That's it. We average across all the

31:51

data points we have. So, that we get an

31:53

average loss.

31:55

Okay?

31:57

We call this is the binary cross entropy

31:58

loss function.

32:06

Is there a way you can um edit the loss

32:08

function so that you penalize like false

32:11

negatives more strongly than false

32:13

>> you can do all of them. Great question.

32:15

Uh I'm just looking at the basic case

32:17

where we it's symmetric

32:19

loss. Um you can actually penalize

32:21

overestimates much more than

32:23

underestimates and things like that.

32:25

Um and if you're curious, you can just

32:26

Google something called the pinball

32:28

loss.

32:31

Okay?

32:32

Any other questions on this?

32:34

So, when you see this massive deep

32:36

neural network built by Google for doing

32:38

something or the other, if it's a binary

32:39

classification problem, chances are

32:41

they're using this thing.

32:44

Okay?

32:45

All right.

32:45

So, now let's figure out how to minimize

32:48

these loss functions because the name of

32:49

the game

32:50

is to find a way to minimize these loss

32:52

functions. So, now loss functions are

32:54

just a particular kind of function. So,

32:56

we'll first consider the general problem

32:59

of minimizing some arbitrary function.

33:02

Okay?

33:02

And once we develop a little bit of

33:03

intuition about that, we'll return to

33:05

the specific task of minimizing loss

33:07

functions.

33:12

How's everyone doing?

33:15

Yes, no, good, bad?

33:18

You have a bit of a

33:20

like a tough-to-interpret head shake.

33:23

It's more like um I kind of lost you

33:24

where you said that the loss function

33:26

and the predicted probability

33:28

uh how were they inversely because my

33:30

understanding was that the loss function

33:31

is supposed to be the sum of errors.

33:33

We're averaging the errors. And when you

33:35

said the heart patient

33:36

>> Sorry, sorry. Let me Let me just stop

33:37

there for a second.

33:38

For each point, you define the loss.

33:41

That's the whole point of the game. And

33:42

once you define it, you calculate for

33:44

every point and average it, right? So,

33:46

just focus on a single data point.

33:49

And so, now continue.

33:50

So, now when the heart patient has There

33:53

is more probability that they No. So,

33:56

when there is a person who has the heart

33:58

uh disease, you said that you want the

34:00

loss function to be high.

34:02

I think I'm going back to the graph.

34:03

>> You want the loss function to be high if

34:06

I'm predicting that they basically don't

34:08

have heart disease.

34:09

If the prediction is close to 0,

34:12

the predicted probability is close to 0,

34:13

then I'm badly wrong.

34:16

Because in reality, they do have heart

34:18

disease.

34:19

And that's why I want the loss to be

34:21

really high. Okay, so effectively, loss

34:23

is my way of finding out how good my

34:25

model is instead of saying, "Okay." Or

34:28

rather, how bad your model is. Yeah.

34:31

Right? How bad is it? That's really what

34:33

the loss function is. Got it.

34:34

>> And you want to minimize badness.

34:37

That's the whole point of optimization.

34:39

Okay.

34:41

Um I guess I don't have a fully like

34:43

similar to the point where I said but I

34:45

don't have a fully clear intuition of

34:46

why exactly a log function rather than

34:48

something that say

34:50

flatter for small and then really steep

34:53

later. Those are all fantastic things.

34:55

You can totally do it. Uh the reason we

34:57

picked the loss this function because A,

35:00

it's easy to work with. It has good

35:02

gradients. It's well-behaved

35:04

mathematically. But, there are many

35:06

alternatives to it. I don't want you to

35:07

think that this is like the only game in

35:09

town or it's the only choice for us. We

35:11

have many choices. This is really This

35:13

happens to be a very easy choice, which

35:15

also happens to be empirically very

35:17

effective.

35:18

And I'm happy to give you pointers to

35:20

other crazy loss functions, right? Which

35:22

can actually do all these things, too.

35:26

Okay?

35:30

All right. So, uh minimizing a single

35:32

variable function, we will warm up by

35:34

looking at this little function here.

35:36

Okay? Which is a

35:38

What do you call a fourth power?

35:41

What? Quartic, right? Yeah, thank you.

35:43

Quartic. So, yeah, it's a quartic

35:45

function. Um

35:47

right? And this is how it looks like.

35:50

But, you can see there is like a minimum

35:51

somewhere here, right? Between like one

35:53

minus one and minus two. Like maybe

35:54

minus 1.5. Okay?

35:56

So, we want to minimize this function.

35:58

It's obviously a toy function, little

36:00

function with one variable.

36:02

But, the intuition we use here is going

36:03

to be exactly what we use for GPT-4.

36:06

So, pay attention.

36:08

So, how can we go about minimizing this

36:09

function?

36:11

What will we do?

36:15

Yeah.

36:16

Take the derivative and set it equal to

36:18

zero. You take the derivative. Exactly.

36:20

So, you take the derivative, right?

36:22

Um so, when you So, let's look at what

36:23

the derivative does for us.

36:25

But, then

36:26

the second part of what said

36:30

Yeah. Second part of what said was set

36:31

it to zero. Setting it to zero becomes

36:33

problematic

36:35

when you have very complicated

36:37

functions. It's not clear at all what's

36:38

going to make them zero, right?

36:39

Unfortunately. But, the idea of taking

36:41

the derivative is in fact the right

36:42

idea.

36:43

So, we can go about this. We can

36:45

calculate the derivative. And that

36:46

actually happens with the derivative.

36:47

You can convince yourself.

36:49

And if you plot the derivative, it looks

36:50

like that.

36:53

And as you would hope, wherever the

36:55

minimum is, in fact, the derivative is

36:56

crossing

36:58

right? The derivative is zero here. It's

36:59

crossing the x-axis.

37:01

Right? In this case, you can actually do

37:02

that.

37:03

So, let's say you have the derivative.

37:04

How can you use it?

37:06

Like, what is the value of a derivative?

37:08

What does it tell you?

37:09

Yeah.

37:11

You use a gradient descent algorithm.

37:13

You are 10 steps ahead of me, my friend.

37:16

I just want the basic answer.

37:18

Like, what what what what good is a

37:19

derivative? What Like, what does it tell

37:21

you? When you calculate the derivative

37:22

of something at a particular point

37:23

>> you the rate of change of the function

37:25

at the place you are. Correct. Exactly

37:27

right. So, here, what the derivative

37:29

would tells us is that the slope tells

37:32

us the change in the function for a very

37:34

small increase in w, right?

37:36

And this is high school calculus. I'm

37:38

just doing a quick refresher.

37:41

So, what that means is that

37:45

if the derivative is positive,

37:47

what that means is that increasing w

37:49

slightly will increase the function.

37:52

So, if if you're here,

37:53

you calculate the derivative, the slope

37:55

is positive. It means that if you go

37:56

slightly in this direction, the function

37:57

is going to get higher.

37:58

Right?

38:00

Similarly, if it's negative,

38:02

let's say here, you calculate the

38:03

derivative, it's the the slope is like

38:05

this. It's negative, which means that if

38:06

you increase w, if you go in this

38:08

direction, it's going to decrease the

38:10

function.

38:12

Okay?

38:13

All right.

38:15

And if it's kind of close to zero,

38:17

it means that changing w slightly won't

38:19

change anything.

38:22

So, if you're here, changing it slightly

38:24

won't change anything.

38:25

All right?

38:26

That's it.

38:28

So,

38:29

So, what we do is this immediately

38:31

suggests an algorithm for minimizing gw,

38:35

which is let's start with some random

38:37

point w.

38:38

And then,

38:39

let's calculate the derivative at that

38:40

point.

38:41

And once we do that,

38:42

there are three possibilities.

38:45

It could be positive, negative, or kind

38:46

of close to zero.

38:48

And if it's positive, we know that

38:49

increasing w will increase the function.

38:52

But, we want to decrease the function.

38:53

We want to minimize it.

38:55

Which means that we should not be

38:56

increasing w. We should be doing what

38:58

here?

39:00

Decrease.

39:01

Yes. And similarly, if it's negative,

39:03

what should we do here? Increase.

39:07

Exactly. So, in the first case, you

39:09

reduce w slightly. In the second case,

39:11

you increase w slightly. And if the

39:13

thing is close to zero, you just stop

39:14

because there's nothing else you can do.

39:17

Okay?

39:21

This is the basic intuition behind how

39:23

GPT-4 was built.

39:26

Which is kind of shocking if you think

39:28

about it.

39:29

Right? Which means that all the the

39:31

heavy-duty optimization stuff that

39:32

people have figured out over the decades

39:35

is kind of not used.

39:37

Right? This algorithm is what's being

39:39

used with some, you know, flavors on top

39:41

of it.

39:42

So, yeah. So, back to this

39:44

uh and you you do that and then if

39:46

you've sort of run out of time or

39:48

compute

39:49

or right, if you run out of time and so

39:52

on, just stop.

39:54

Otherwise, just go back to step one and

39:55

try again. Of course, if it's close to

39:56

zero, you got to stop anyway.

40:00

Yeah.

40:02

Is there the um concern of a potentially

40:05

local minimum there? It's coming.

40:10

Okay? So, that's the function. It's

40:11

going to give find It's going to find

40:12

you some point where the derivative is

40:13

kind of close to zero. Okay?

40:16

So,

40:17

this is called gradient descent. Right?

40:19

This is gradient descent, this little

40:21

algorithm.

40:23

And this this

40:26

this very power pointy MBA table can be

40:29

collapsed into this little expression.

40:32

Basically says,

40:34

calculate the derivative,

40:35

multiplied by a small number which we'll

40:36

get to in a second,

40:38

and then change the old W to the new W

40:41

is the old W minus a little number times

40:44

gradient.

40:45

So, this little one-line formula is

40:47

basically gradient descent.

40:50

Okay?

40:51

And what you should do, just to build

40:54

your intuition, is to make sure that

40:56

these three possibilities here map

40:58

nicely to this. Like this thing will

41:00

actually capture these three

41:01

possibilities.

41:03

This is when gradient descent was

41:04

invented.

41:07

It has some historical fun, right?

41:13

The 19th century?

41:15

19th century. Yeah, okay. Good. Very

41:17

good. Excellent guess.

41:20

1847.

41:22

It was uh invented uh in 1847 by Cauchy,

41:25

the great mathematician. And in fact, if

41:27

you're curious, you can check out the

41:29

paper.

41:30

I have I gave you I give you the paper

41:32

here for handy reference.

41:36

So, 1847.

41:38

So, GPT-4 is built using an algorithm

41:40

invented in 1847.

41:44

Which I find like astonishing, frankly.

41:47

That this little thing is so capable.

41:51

Okay.

41:52

So, that's gradient descent. And this

41:54

little number alpha

41:56

is called the learning rate. And it's

41:58

our way of sort of essentially

41:59

quantifying the idea of let's not

42:02

increase or decrease W massively, let's

42:04

do it slightly.

42:06

Because the gradient is only valid for

42:08

small movements around your point. If

42:11

you take a big step, all bets are off.

42:14

So, this alpha tells you how how small a

42:17

step should you take.

42:20

Okay?

42:20

And in typically, it's set to very small

42:23

values like, you know, 0.1, 0.001, and

42:25

so on and so forth. And in fact, if you

42:27

read any deep learning academic papers

42:30

where they have trained like a big model

42:31

to do something,

42:32

right? More lot of researchers will very

42:34

quickly go to the appendix where they

42:36

have described exactly what learning

42:37

rates were used.

42:39

Because sort of the learning rate is

42:40

like part of the IP for how it's built.

42:44

A lot of trial and error that goes into

42:45

these learning rates.

42:47

Okay. So, that is gradient descent.

42:50

Um so, if we apply this algorithm to GW,

42:53

our original function,

42:55

right? We just keep on doing this thing

42:56

a few times.

42:58

Right? What you will find is that if

43:00

let's say we start with two point the

43:01

the

43:02

the point we randomly pick is a 2.5, we

43:05

set the alpha to one, we run this

43:07

algorithm, it starts here, then it goes

43:09

there, it goes there, bup bup bup bup

43:11

bup, and then finally ends up here.

43:12

In like four or five iterations, it

43:14

finds some minimum.

43:16

This is obviously a very simple,

43:17

well-behaved, nice little function, so

43:19

you can easily optimize it.

43:22

Okay? If you want, you can just go to

43:23

this thing. There's a nice animation of

43:25

this thing as well.

43:28

Okay. So, now

43:30

All right. Before we actually go to the

43:31

multi-variable function, I want to go to

43:33

the question that you posed about local

43:35

minima.

43:36

Um actually, you know what? I think I

43:37

may have some slides on it. So, sorry.

43:38

I'll come back to this.

43:40

So, let's actually see what you know,

43:41

what we looked at a toy example where

43:43

there was only one variable. What if you

43:45

have

43:46

uh what if it was GPT-3? GPT-3 has 175

43:49

billion parameters.

43:51

175 billion and GPT-4, they haven't

43:53

published it, so we don't know. It's

43:55

supposed to be eight times as much.

43:57

Okay? So, I mean, the number of

43:59

parameters is massive. So, basically,

44:02

our loss function has

44:04

billions of variables, billions of Ws

44:07

that we need to optimize over, minimize

44:10

over. So, we need to use this notion of

44:12

a partial derivative. So, let's take

44:14

baby steps and say, okay, what if you

44:16

have a two-variable function, right?

44:18

Something like this, very simple. So,

44:20

what we can do is we can calculate the

44:21

partial derivative of G with respect to

44:23

each of these Ws.

44:26

And the partial derivative, just to

44:27

quickly refresh your memories,

44:29

is you take a function, you pretend that

44:32

everything other than W is a constant.

44:36

Then the function becomes a

44:38

a function of just one variable W, W1.

44:40

And then you just differentiate it like

44:41

you do everything else. And you you get

44:43

you get something, and that is

44:46

this thing here.

44:48

And then you do the same thing for W2,

44:50

you get this thing here, and then you

44:51

just stack them up in a nice list.

44:54

Okay?

44:55

This is the vector of partial

44:56

derivatives.

44:58

So, how should we interpret this? The

44:59

same way as before. Basically, for a

45:01

small change in W1, keeping W2 and

45:04

everything else fixed, how does the

45:06

function change if you change just W1

45:08

slightly? And similarly for W2 and all

45:11

the way to W175 billion.

45:14

Same thing. Okay?

45:15

So, um

45:17

now, when you have these functions with

45:19

many variables, many Ws,

45:22

uh since we have a gradient for each one

45:24

of those Ws, we stack them up into a

45:26

nice vector

45:28

of derivatives, and this vector is

45:30

called the gradient.

45:32

And it's denoted

45:33

using

45:35

this uh Anyone know what the symbol is

45:37

called?

45:38

nabla

45:40

Yeah?

45:41

Laplacian

45:43

Maybe. Maybe that's a synonym. But the

45:45

one I'm familiar with is nabla.

45:48

Delta is the one that's upside down

45:50

triangle, but I think the upside down

45:52

triangle is called nabla if I if I

45:53

recall. Am I right?

45:55

Thank you.

45:58

He's my go-to.

46:02

So, yeah. So, the gradient, um we just

46:04

call it the gradient, and it's written

46:06

as this.

46:08

All right. So, what we do is we simply

46:10

do gradient descent on every one of the

46:12

Ws

46:13

using its partial derivative.

46:16

Okay? So, in a in a gradient step, we

46:19

update W1 using this formula, W2 using

46:21

this formula.

46:23

Finished.

46:25

We've just generalized gradient descent

46:27

to an arbitrary number of variables.

46:30

So, and of course, as before, this can

46:32

be summarized compactly as this vector

46:35

formula.

46:36

Let me just do this.

46:43

So, what's going on here is that

46:46

I have

46:47

W1

46:50

old W1 minus alpha

46:52

times

46:53

the function G

46:55

of W1, then W2

46:59

W2 minus alpha

47:02

G by W2. And then all we're doing is

47:04

we're just stacking them up into a

47:06

vector

47:08

like that.

47:15

minus alpha, and this vector

47:21

like that.

47:27

So, this can be written as just this

47:28

vector W, the new vector

47:31

old vector minus alpha

47:34

and the gradient. Finished.

47:37

And you can see if it is, you know,

47:39

GPT-3,

47:40

this vector is going to be 175 billion

47:42

long.

47:44

Okay? But whether it's two or 175

47:46

billion, who cares? It's the same thing,

47:47

right?

47:50

Okay.

47:52

So, yeah. So, that's what we have here.

47:54

I'm really thrilled by the way this

47:55

whole iPad business is working out.

47:58

I was a little worried about it. Okay.

48:00

Um so, if you look at two dimensions,

48:02

this function, and if you actually look

48:04

at if you plot the function, this is W

48:06

the first W, the second W, and then you

48:08

actually This is actually the loss

48:09

function. That's the function GW. And

48:11

so, you're trying to find the minimum

48:13

here, and so this is how the gradient

48:14

descent will do do do do do. It will

48:16

progress if you're starting from this

48:17

point.

48:18

Or you can also sort of look at it from

48:20

up top down into the function, and

48:22

that's what this picture is, and it

48:23

shows gradient descent starting from

48:24

there and working its way down

48:27

um from here all the way to the center.

48:30

Okay. So,

48:32

All right. Local minima. So, now

48:35

gradient descent will just stop

48:38

near uh hopefully a minimum,

48:41

right? But the problem is it may not be

48:43

a global minimum. It may It may not even

48:45

be a minimum.

48:47

So, um

48:48

so, let's see what what I'm talking

48:49

about here.

48:51

Here are some possibilities.

48:53

So, let's take a simple function.

48:57

Okay? Let's take This is GW.

48:59

This is W. And turns out this function

49:02

is actually looks like this.

49:12

Okay?

49:13

So, you can see here

49:17

Well,

49:19

um this point

49:23

this point here

49:24

is a local minimum.

49:27

This is a local minimum.

49:29

It's a local minimum.

49:30

These are all

49:32

lots of local minima here.

49:34

Okay? And yeah, there's a lot of local

49:37

minima here, too.

49:39

So, these are all places in which the

49:41

derivative is going to be zero.

49:43

So, if you run gradient descent and it

49:46

stops because the gradient is reached

49:48

zero,

49:49

you could be in any of these places.

49:52

Right? So, there's no guarantee. So,

49:54

this in this picture happens to be

49:57

maybe the global minimum because it's

49:59

the lowest of the lot.

50:01

Right?

50:02

But, there's no guarantee you're

50:02

actually going to get there.

50:04

Okay, there's not even a guarantee

50:06

you're going to be in any of these

50:07

places because you could literally be in

50:09

this thing here

50:10

where it's sort of taking a break and

50:12

then continuing on down.

50:14

That, by the way, is called a you know,

50:15

a saddle point. I drew it badly, but

50:17

this sort of coming in sort of taking a

50:19

break and going down again is called a

50:21

saddle point. So, gradient descent can

50:23

stop at a saddle point. It can stop at

50:25

some minima. There's no guarantee it's

50:27

going to be global.

50:28

Okay?

50:33

But, it turns out it has not mattered.

50:37

So, it has not mattered. And there are a

50:39

whole bunch of reasons why it has not

50:41

mattered because when you have these

50:42

very complicated neural networks,

50:44

they're very complex functions. Even

50:46

finding a decent solution, right, to

50:49

these complicated networks is actually

50:50

really good for solving the problem.

50:52

You don't have to go to the best best

50:54

possible solution. And in fact, if you

50:57

go to the best possible solution, you

50:58

actually run the risk of overfitting.

51:02

So, that's one reason. The other

51:03

interesting reason and by the way, this

51:05

is a very hot area of research to figure

51:08

out exactly

51:09

So, it's sort of like this. Empirically,

51:11

what we have seen is that not worrying

51:12

about local minima, global minima, all

51:13

that stuff has not hurt us because these

51:16

is things are amazing.

51:18

GPT GPT-4, probably they just stopped

51:20

somewhere. They probably it wasn't even

51:21

a local minima. They're like, "All

51:22

right, we've It's been running for 6

51:24

days. We've spent 2 million dollars.

51:25

Let's stop."

51:27

Right? Because these are very expensive.

51:29

So, but that's still so magical.

51:31

You don't need to get anywhere close to

51:33

local minimum. But, there's another

51:34

interesting point which I've which which

51:36

I read about.

51:37

People basically hypothesize that

51:40

for you to be at a local minimum, just

51:43

think about what it means. It means that

51:45

you're standing at a particular point,

51:47

in every direction that you look,

51:49

things are just sloping upward.

51:51

Right?

51:52

Everything is sloping upward. Only if

51:54

everything is sloping upward all around

51:56

you, could you be at a local minimum

51:58

by definition. But, if you have a

52:00

billion dimensions,

52:02

what are the odds that you're going to

52:04

be standing at a point where every one

52:06

of those billion dimensions is going

52:07

upward?

52:08

The odds are really low.

52:10

Chances are some of them are going to go

52:11

going up, some of them are going down,

52:13

others are sort of coming down and going

52:14

another way. It's going to be crazy.

52:16

So, in some sense, the best you can hope

52:18

for in these very high-dimensional

52:20

situations is probably a saddle point.

52:23

And it turns out it's good enough.

52:25

So, for those reasons, we are content

52:29

with just running gradient descent with

52:30

some tweaks which I'll get to in a

52:31

second. Um and it just performs really

52:34

admirably.

52:36

Um how does alpha depend on like how

52:39

much compute you have? Like, would you

52:41

set the learning rate based on that or

52:44

not really?

52:45

>> No, the the learning rate is really

52:47

is a measure of It's sort of like this.

52:50

When you're at a point where you think

52:52

that the gradient is looking nice and

52:54

right, if you take a step in the

52:55

direction it's going to go down. And if

52:57

you further believe that it's going to

53:00

keep going down in the direction for a

53:01

while,

53:02

then you're very confident about taking

53:04

a big step.

53:06

But, if you're like, "I I don't know

53:07

because the maybe I take a little step,

53:09

maybe I have to go this way. I can't go

53:10

straight anymore." Then you don't want

53:12

to take a big step because then you have

53:13

to backtrack.

53:14

So, those kinds of considerations go

53:16

into the learning rate. Um and so,

53:19

that's sort of the rough answer to your

53:20

question. It's not so much determined by

53:23

compute and bandwidth and things like

53:24

that.

53:25

But, again, it's very it's a sort of a

53:27

complicated thing because sometimes with

53:29

a given amount of compute compute, if

53:31

you have a particular kind of data, you

53:33

can have very aggressive learning rates.

53:35

So, it tends to be a bit sort of, you

53:37

know, jumbled up complicated. So, but

53:39

that's sort of the the quick surface

53:40

level idea of what's going on.

53:43

Um okay.

53:47

9:31.

53:50

Anyway, folks, this lecture is like

53:52

probably one of the driest in the like

53:54

semester because of like I have to go

53:55

through all the concepts. Um once we

53:57

start doing collabs, you know, things

53:59

get a lot more lively.

54:00

Okay.

54:01

Um all right. So, now let's talk about

54:04

minimizing a loss function gradient

54:05

descent. So, here is our little binary

54:08

cross entropy loss function that we saw

54:09

from before. Right? This is what we want

54:11

to minimize. So, if you look at this

54:13

thing,

54:14

where are the variables we need to

54:16

change to minimize this function?

54:19

Folks, don't look at your phones.

54:21

Okay, with laptop and iPad use, don't

54:23

look at your phones.

54:27

Sorry, we've kind of abstracted um the

54:30

variables W, but just to bring it back,

54:33

those are actually the weights in the

54:35

neural networks, right? Yeah, the

54:36

weights and the biases. I'm just calling

54:38

them as weights. So, the output of these

54:42

uh minimization functions are going to

54:45

be the actual weights in your model,

54:47

right?

54:47

>> Exactly. Exactly right.

54:49

The whole name of the game is to find

54:51

the weights.

54:52

And so, for example, when you see in the

54:53

press that uh Meta has essentially um

54:57

made the weights of Llama 2 or something

55:00

available, that's basically what they've

55:01

done.

55:02

They basically published the weights.

55:04

Reason that's so valuable is

55:06

>> Microphone, please. Go.

55:07

Cuz if you have a billion parameters,

55:09

the compute time on that is horrendous

55:11

and expensive. That's why it does

55:13

weights are so valuable.

55:14

>> Correct. The weights are the crown jewel

55:16

because they are the result of a lot of

55:18

money and time and smartness being

55:19

spent.

55:21

There is a separate question of why are

55:23

they making it open source,

55:25

which

55:26

I'm happy to chat about offline.

55:28

All right, cool. So, what are the

55:29

variables we need to change change to

55:30

minimize? It's basically the parameters

55:32

and they're hiding inside the model

55:34

term.

55:36

Right? Because what is the model? The

55:38

model is some function like that, right?

55:41

If you look at the simple GPA and

55:42

experience thing we looked at in the on

55:44

Monday, we finally figured out that the

55:46

actual thing that comes out here is

55:48

going to be this complicated function of

55:50

all the X's and the W's and so on and so

55:52

forth, right? And that complicated thing

55:54

is showing up inside this thing.

55:57

So,

55:58

you know, and the W's here are the

56:00

variables we can we need to change to

56:02

minimize the loss function. And it It's

56:05

important for you to to note and

56:06

understand that the values of X and Y

56:10

and so on are just data.

56:13

You're not optimizing anything there.

56:14

You're just data.

56:15

What you're optimizing is the W's.

56:17

The weights.

56:22

Okay. So, so imagine replacing the model

56:26

here with the mathematical expression

56:27

above whenever this appears the loss

56:29

function. And once you do that, your

56:31

loss function is just a good old

56:33

function of the W's.

56:35

The fact that it's a loss function is

56:37

kind of irrelevant.

56:39

It's just a function.

56:41

And since it's just a good old function

56:42

of the W's, you can apply gradient

56:43

descent to it as we normally would.

56:45

It's no big deal.

56:49

Which brings us to something called

56:50

backpropagation.

56:52

Um

56:56

Um if you remember nothing else about

56:57

backpropagation, just remember this.

56:59

Never use the word backpropagation

57:01

again. Only use the word backprop.

57:04

You're

57:05

hip and cool to the deep learning

57:06

community.

57:07

Backprop.

57:09

Okay. All right. So, what is backprop?

57:12

Backprop is a very efficient way to

57:14

compute the gradient of the loss

57:16

function.

57:17

So, when you have this loss function,

57:19

and let's say you have a billion W's

57:21

and you have 10 million data points. So,

57:24

the little n we saw was 10 million.

57:27

That is a lot of computation.

57:30

And that is just for one step of

57:32

gradient descent.

57:34

Right? So, backprop is a way is a very

57:37

efficient and clever way to compute the

57:39

gradient of the loss function, which

57:41

takes advantage of the fact that what we

57:44

have here is not some arbitrary model.

57:47

It's a model that came from a particular

57:49

kind of neural network, which has layers

57:51

one after the other, and then there was

57:53

an output at the very end.

57:55

So, what backprop does is

57:57

it organizes the computation in the form

57:59

of something called a computational

58:00

graph, and the book has a good

58:01

discussion about it. And so, what we do

58:03

is we start at the very end.

58:05

We calculate the gradient of the loss

58:08

with respect to the output.

58:10

Then we move left. We calculate the

58:12

gradient of that output with respect to

58:13

the output of the just the prior hidden

58:15

layer.

58:17

Step to the left. Calculate the gradient

58:19

of the current thing with respect to the

58:20

previous layer. You get the idea, right?

58:22

It's iterative and it moves backwards,

58:25

and by doing so, you never repeat the

58:27

same computation twice wastefully.

58:30

That's the big advantage. You calculate

58:32

once and reuse it many many many many

58:34

times.

58:35

The second advantage is that if you

58:37

organize it this way, it just becomes a

58:39

sequence of matrix multiplications.

58:42

Okay.

58:42

And

58:45

because it's a sequence of matrix

58:46

multiplications and eliminates redundant

58:48

calculations, and best of all,

58:51

there are these things called GPUs,

58:53

graphics processing units, originally

58:54

invented to accelerate video game

58:56

rendering.

58:57

Uh and as it turns out, to accelerate

58:58

video game rendering, the core math

59:00

operation you do is basically a matrix

59:02

multiplication. Right? Some linear

59:03

algebra uh

59:05

sort of operations. And so, someone

59:07

really at some point had the bright idea

59:09

for deep learning, calculating gradients

59:11

and so on, we need to do matrix

59:13

multiplications, and here is some

59:14

specialized hardware that does really

59:17

that does a fast job of matrix

59:19

multiplications. Can't we Can we use

59:20

this for that?

59:22

And they did it. And all hell broke

59:24

loose.

59:26

That's literally what happened.

59:28

And that's why Nvidia is valued at what,

59:30

1.5 trillion or something.

59:32

So, yeah. So, they are really good. And

59:35

so, backprop

59:37

the way you do backprop plus using it on

59:40

GPUs leads to fast calculation of loss

59:42

function gradients.

59:44

If this thing were not true, this class

59:47

would not exist.

59:49

Because there won't be any deep learning

59:50

revolution.

59:52

This is a fundamental seminal reason.

59:57

All right. So, the book has a bunch of

59:59

detail

1:00:00

um

1:00:01

and I actually did like a I work I hand

1:00:05

worked out an example

1:00:07

of calculating a gradient like the

1:00:09

old-fashioned way and calculating it

1:00:11

using backprop.

1:00:13

So, take a look at it. I'll post it on

1:00:14

Canvas and you will understand exactly

1:00:17

where the savings come from, where the

1:00:18

efficiency gains come from. Okay?

1:00:21

Because of time, I'm not going to get

1:00:22

into it now.

1:00:26

All right. Any questions so far?

1:00:28

Yep.

1:00:30

Sorry, I followed up to and so, we've

1:00:32

done gradient descent, which is

1:00:34

different than calculation of the

1:00:36

gradient of the loss function. What What

1:00:37

is the purpose of the calculation of the

1:00:39

gradient of the loss function? You

1:00:41

calculate the gradient because the

1:00:42

fundamental operation of gradient

1:00:44

descent is to take your current value of

1:00:47

W

1:00:48

and modify it slightly and the

1:00:50

modification is old value minus learning

1:00:52

rate times gradient.

1:01:03

It'd be cool, right, if I say, "Go mo-

1:01:04

go back five slides to this thing." and

1:01:06

it just goes back. Product idea. Anyone

1:01:08

startups?

1:01:09

So.

1:01:11

So, this one.

1:01:14

So, this is the fundamental step of

1:01:15

gradient descent.

1:01:16

So, this is the current value of W.

1:01:19

You calculate the gradient at that

1:01:20

current value

1:01:22

multiplied by alpha do this thing and

1:01:24

you get the new value.

1:01:26

And you keep repeating.

1:01:27

Right, but GW

1:01:29

that's not that's not the loss function.

1:01:32

>> It is the loss function. That is the

1:01:33

loss function.

1:01:34

>> Yeah, right. Here, I'm just using G as

1:01:35

an arbitrary function

1:01:37

to just to demonstrate the point. But

1:01:39

when you're optimizing, when you're

1:01:41

training a neural network, what you're

1:01:42

actually doing is minimizing a loss

1:01:45

function. Right.

1:01:46

>> Loss of W. Sorry, I got things mixed up.

1:01:49

Thank you.

1:01:51

>> Yeah.

1:01:53

Uh how do we define the initial weights

1:01:54

for the neural network?

1:01:55

>> Ah.

1:01:57

So, yeah, the initial weights um

1:02:02

So, there's a there are many ways to So,

1:02:04

first of all, they are initialized

1:02:04

randomly.

1:02:06

Uh but randomly doesn't mean you can

1:02:08

just pick any random weight. There are

1:02:09

actually some good ways to randomly pick

1:02:11

the weights. Uh those are called

1:02:13

initialization schemes. Um and there are

1:02:16

a bunch of very effective initialization

1:02:18

schemes people have figured out over the

1:02:19

years and those things are baked into

1:02:21

Keras as the default.

1:02:22

So, the Keras, I believe, uses something

1:02:24

called the

1:02:26

uh He initialization, H E

1:02:27

initialization, or the Xavier Glorot

1:02:31

initialization. I wouldn't worry about

1:02:33

it. Just go with the default

1:02:33

initialization.

1:02:36

The reason why they have to be very

1:02:37

careful about how these weights are

1:02:38

initialized is because if you have a

1:02:40

very big network and if you initialize

1:02:43

badly then

1:02:45

the gradient will just explode as you

1:02:47

calculate it.

1:02:48

The earlier layers, the weights will

1:02:50

have massive gradients or the gradients

1:02:52

will vanish.

1:02:53

So, they're called the exploding

1:02:55

gradient problem or the vanishing

1:02:56

gradient problem. To avoid all those

1:02:58

things, researchers have figured out

1:02:59

some clever way to initialize so that

1:03:00

it's well-behaved throughout.

1:03:03

Yep.

1:03:05

If using um backprops and GPUs was so

1:03:08

critical, I'm just curious like who

1:03:10

first did it and when? Was this like a

1:03:12

couple years ago? Was it a company? Was

1:03:14

it a Yeah.

1:03:15

>> Yeah. Well, GPUs have been used for deep

1:03:17

learning, I want to say um

1:03:20

I think the first uh case may have been

1:03:22

in the mid 2005, 2006 sort of thing.

1:03:26

But I would say that it sort of burst

1:03:27

out onto the world stage and made

1:03:30

everyone take notice when uh a deep

1:03:32

learning model called AlexNet

1:03:35

in 2012 won a very famous

1:03:38

computer vision competition.

1:03:40

Uh and it beat the and it set a world

1:03:43

record for how good it was.

1:03:45

Uh and that's when everyone was like,

1:03:46

"Hey, what is this thing?" And that's

1:03:48

really when it burst onto the world

1:03:49

stage. I'll talk a bit more about it

1:03:50

when I get into the computer vision

1:03:51

segment of the class.

1:03:54

But you can Google AlexNet and you'll

1:03:55

find a whole bunch of history around it.

1:03:59

I believe that if you do this, is it

1:04:00

true that could get to a global minima

1:04:04

that would mean there would be no

1:04:06

hallucinations?

1:04:07

Aha, good question.

1:04:09

So, if it is perfect

1:04:11

if you get to a global minimum. First of

1:04:13

all, global minima doesn't mean the

1:04:14

model is perfect, right? It may still

1:04:15

have some loss.

1:04:17

Um

1:04:18

but global minima is going to be on the

1:04:21

training data.

1:04:24

You can imagine that the test data,

1:04:26

future data has its own loss function,

1:04:28

right?

1:04:29

So, what is minimum here may not be

1:04:31

minimum there. That's the problem.

1:04:36

Is that a comment? No, okay.

1:04:38

Just saying that

1:04:40

uh that would mean that also you can be

1:04:42

over-fitting for

1:04:43

>> Correct. Exactly. Exactly. So, if you

1:04:45

overdo, if you find the best thing in

1:04:47

the training function, chances are it

1:04:48

doesn't match the best thing of the test

1:04:50

data.

1:04:52

So, on the test data, you're actually

1:04:53

doing badly.

1:04:56

Okay. So,

1:04:57

uh come back to this.

1:05:03

Okay. Now, uh the final uh twist to the

1:05:06

tail here uh we're going to go from

1:05:08

something gradient descent to something

1:05:10

called stochastic gradient descent. And

1:05:11

stochastic gradient descent or SGD is

1:05:14

the workhorse for all deep learning.

1:05:16

Okay?

1:05:17

And funnily enough, SGD is simpler than

1:05:19

GD.

1:05:20

Okay? Just when you thought it couldn't

1:05:21

get simpler, right?

1:05:23

Okay. So,

1:05:25

So, for large data sets, computing the

1:05:27

gradient of the loss function can be

1:05:28

very expensive. Right? Needless to say.

1:05:31

Because it has to be done at every step

1:05:32

and the cardinality of the data set is

1:05:34

really big. Right? And you may have, I

1:05:36

don't know, billions of parameters. It's

1:05:38

just very, very

1:05:39

tough to compute it even with backprop.

1:05:43

So, the solution is at each iteration,

1:05:45

when I say iteration, I'm talking about

1:05:47

this step of gradient descent.

1:05:50

Instead of using all the data

1:05:52

instead of calculating the loss function

1:05:54

by averaging the loss across all N data

1:05:57

points and then calculating the gradient

1:05:59

of that thing, what you do is you just

1:06:01

choose a small sample randomly. You

1:06:04

choose just a few of the N observations

1:06:06

and we call it a mini batch.

1:06:08

So, for example, the number of data

1:06:10

points you may you may have 10 billion

1:06:11

data points

1:06:12

but in every iteration, you may

1:06:14

literally grab just like 32 or 64,

1:06:16

something really small.

1:06:18

Like absurdly small.

1:06:20

Okay?

1:06:21

And then you pretend that okay, that's

1:06:23

all the data I have. You calculate the

1:06:24

loss, find the gradient and just use

1:06:27

that here instead.

1:06:30

Okay? So, this is called stochastic

1:06:33

gradient descent. So, strictly speaking

1:06:36

theoretically, SGD uses just one data

1:06:39

point.

1:06:40

But in practice, we use what's called a

1:06:42

mini batch, 32, 64, whatever.

1:06:44

Uh and so, mini batch gradient descent

1:06:47

is just loosely called stochastic

1:06:48

gradient descent, SGD.

1:06:52

So, and SGD, as it turns out

1:06:55

you can see it's clearly very efficient,

1:06:57

right? Because

1:06:58

it's just processing a few at a time.

1:07:00

Uh and in fact, if you have a lot of

1:07:02

data

1:07:03

and you calculate the full gradient of

1:07:05

the loss function, it may not even fit

1:07:07

into memory.

1:07:09

Right? It's really problematic. But with

1:07:11

SGD, it says, "I don't care whether you

1:07:12

have a billion data points or a trillion

1:07:14

data points. Just give me 32 at a time."

1:07:17

Okay? And you just keep on doing it.

1:07:19

And

1:07:20

turns out, because not all the points

1:07:22

are used in the calculation this only

1:07:24

approximates the true gradient. Right?

1:07:26

It's only an approximation. It's not the

1:07:27

real thing. It's only an approximation.

1:07:29

But it works extremely well in practice.

1:07:32

Extremely well in practice.

1:07:33

And there's a whole bunch of research

1:07:34

that goes into why is it so effective?

1:07:37

And you know, people are discovering

1:07:39

interesting things about SGD, but we

1:07:40

don't have like a definitive theory as

1:07:42

to why it's so good yet. We have some

1:07:44

interesting, you know, uh research

1:07:46

threads that have happened.

1:07:47

And very tantalizingly, very

1:07:50

tantalizingly

1:07:51

because it's only an approximation of

1:07:53

the true gradient

1:07:55

SGD can actually escape local minima.

1:07:59

So,

1:08:00

in the in the true loss function, you're

1:08:02

at a local minimum

1:08:04

but in SGD's loss function, when you're

1:08:06

doing SGD, you're reaching the the

1:08:08

minimum of the SGD loss function

1:08:11

which actually may not be the actual

1:08:13

loss function. So, as you're moving

1:08:14

around, you're actually jumping from

1:08:16

local minima to local minima of the

1:08:18

actual loss function.

1:08:20

I know that's a mouthful. I'm happy to

1:08:22

tell you more. It's just a side thing

1:08:24

that I just wanted you to be aware of.

1:08:25

Okay?

1:08:26

One of the reasons why SGD is actually

1:08:27

effective. It's almost like you work

1:08:30

less and you do better.

1:08:34

How many times does it happen in life?

1:08:35

This is one of them.

1:08:39

Okay? Now, SGD comes in many flavors.

1:08:42

Uh many siblings. It's got a lot of

1:08:44

siblings and variations. It's a big

1:08:45

family. Uh and we're going to use a

1:08:47

particular flavor called Adam

1:08:49

as our default in this course and I'll

1:08:52

get back to it when we get into the

1:08:53

co-labs and things like that.

1:08:56

All right.

1:08:57

Um

1:08:58

By the way

1:09:00

you know how you know all these pictures

1:09:01

I've been showing you a nice little

1:09:02

function like that, a little bowl and so

1:09:04

on.

1:09:05

This is a visualization

1:09:07

of an actual neural network loss

1:09:08

function.

1:09:11

You can see like the hills and valleys

1:09:12

and the cracks and so on and so forth.

1:09:14

Okay? And you can check out the paper to

1:09:16

get more insight into how they actually,

1:09:18

you know, came up with this

1:09:19

visualization. It's crazy.

1:09:21

It's complicated.

1:09:24

Yep.

1:09:25

So, for for SGD, do you perform the

1:09:28

iterations until you minimize the loss

1:09:30

function for each mini batch and then

1:09:32

move to another mini batch? Yeah, so

1:09:34

what you do is you take each mini batch

1:09:36

and then

1:09:37

you calculate the loss for the mini

1:09:39

batch, you find the gradient.

1:09:41

And use the gradient and update the W.

1:09:43

Then you pick up the next mini batch. So

1:09:45

you don't you don't pick a mini batch

1:09:47

and try to perform the iterations on

1:09:48

that mini batch until you reach the

1:09:50

You Each mini batch, one iteration. Each

1:09:52

mini batch, one iteration. Because if

1:09:54

you do a lot of iterations on one mini

1:09:56

batch,

1:09:57

first of all, you'll never be sure that

1:09:58

you're going to find any optimal

1:09:59

solution because you're not guaranteed

1:10:00

of any global minima. And secondly, it's

1:10:03

much better for you to get new

1:10:04

information constantly because what you

1:10:05

can do is you can revisit that mini

1:10:07

batch later on.

1:10:09

Right? And that gets into these things

1:10:10

called epochs and batch size and so on,

1:10:13

which we'll get into a lot of gory

1:10:14

detail when we do the collab.

1:10:16

So let's revisit that question. It's a

1:10:17

good question.

1:10:20

Yeah.

1:10:22

When you do the backprop process, Very

1:10:25

good. Backprop. Not backpropagation.

1:10:26

Nice. I made sure.

1:10:27

>> Yes.

1:10:29

Well, it's it sounded like you started

1:10:30

from the layers that were closest to the

1:10:32

output and you went backward. Okay. And

1:10:35

um my question is are you doing that

1:10:36

once or is it looping multiple times and

1:10:39

then

1:10:39

>> do it once. Just once. Yeah. So for each

1:10:42

gradient calculation, you do it once.

1:10:44

Why does it Why does it want to start

1:10:45

from the layer that's closest or why do

1:10:47

you want to start it from the layer

1:10:48

that's closest to the output?

1:10:49

>> Yeah. So basically what happens is let's

1:10:51

say that just for argument that you go

1:10:53

go in the reverse direction.

1:10:54

You will discover that a lot of paths to

1:10:56

go from the left to the right will end

1:10:58

up calculating certain intermediate

1:10:59

quantities including the very final

1:11:02

gradient sort of item

1:11:04

again and again and again.

1:11:06

Same thing is going to get calculated

1:11:07

again and again and again. So by

1:11:09

starting from the end and working

1:11:10

backwards, you just reuse stuff you've

1:11:12

already calculated.

1:11:14

So that is sort of the rough idea. But

1:11:15

if you see my PDF, I've actually worked

1:11:17

out the example and you and that will

1:11:19

demonstrate what I'm talking about.

1:11:23

By the way, this gradient the backprop

1:11:25

is just a sort of a

1:11:28

Like in calculus, we have something

1:11:29

called the chain rule.

1:11:31

To calculate the derivative of a

1:11:32

complicated function, you calculate the

1:11:34

calculate derivative of like the outer

1:11:35

function then the inner function and so

1:11:37

on and so forth. The backprop is

1:11:39

essentially a way to organize the chain

1:11:40

rule to work with the neural network

1:11:42

layer-by-layer architecture. That's all.

1:11:49

So is it Is it fair to say that once we

1:11:51

are finding like the local minimum, we

1:11:54

are not optimizing to all the GWs

1:11:56

because like this local minimum is

1:11:58

coming like from different curves, from

1:11:59

different lines. So

1:12:01

Is that fair to say? When we are using

1:12:02

stochastic gradient descent, yes. So for

1:12:04

in stochastic gradient descent, when you

1:12:06

take say 32 data points from a million

1:12:09

and you're calculating the loss for that

1:12:10

32 data points, you're basically trying

1:12:12

to do a gradient step.

1:12:14

Right? The W equals W minus alpha

1:12:17

gradient thing. You're doing it for that

1:12:20

that 32 points loss function.

1:12:22

Right? Which is not the 1 million points

1:12:24

loss function.

1:12:25

That's why it's approximate.

1:12:27

But the approximation, instead of

1:12:29

hurting you, actually helps you because

1:12:31

it helps you escape the local minima of

1:12:33

the global loss function.

1:12:35

So it's it's sort of an interesting and

1:12:37

somewhat technically subtle point, which

1:12:38

is why I'm not getting into it too much,

1:12:40

but I'm happy to give pointers if people

1:12:41

are interested. Yeah?

1:12:44

Uh when you say you initialize the

1:12:45

weights, you initialize for the whole

1:12:47

network or just the end layer and then

1:12:50

go backwards like you

1:12:51

>> No, you initialize everything in one

1:12:52

shot.

1:12:53

Because if you don't initialize

1:12:54

everything in one shot, what's going to

1:12:55

happen is that you can't do like the

1:12:57

forward computation to find the

1:12:58

prediction.

1:13:00

Uh and so they are done independently

1:13:02

and the initialization schemes will take

1:13:05

into account, okay, I'm initializing the

1:13:07

weights between a layer which has 10

1:13:08

nodes and on one side and 32 on the

1:13:10

other side and the 10 and the 32

1:13:12

actually play a role in how you

1:13:13

initialize.

1:13:15

Okay. So um so the summary of the

1:13:18

overall training flow

1:13:19

is that, you know, you have an input.

1:13:22

It goes through a bunch of layers. You

1:13:24

come up with a prediction. You compare

1:13:26

it to the true values and these two

1:13:28

things go into the loss function

1:13:29

calculation. You get a loss number.

1:13:31

Right? And you do it for say 10 points

1:13:33

or 32 points or a million points. And

1:13:35

this loss thing goes into the optimizer,

1:13:38

which calculates the gradient. And once

1:13:39

it calculates the gradient, it updates

1:13:41

the weights of every layer using the W

1:13:44

equals W minus alpha times gradient

1:13:45

formula, gradient descent formula. And

1:13:47

then you keep it doing this again and

1:13:48

again and again.

1:13:50

This is the overall flow.

1:13:53

This is how our little network is going

1:13:54

to get built for heart disease

1:13:56

prediction. This is how GPT-4 was built.

1:14:00

And this is how AlphaFold was built.

1:14:02

And AlphaGo was built.

1:14:04

You get the idea.

1:14:07

I mean, it's astonishing, frankly.

1:14:09

If you're not getting goosebumps at the

1:14:10

thought that this simple thing can do

1:14:12

all these complicated things, we really

1:14:14

need to talk offline.

1:14:17

Uh there was a hand raised here. Yeah.

1:14:20

Sorry. Just quickly, this is for each

1:14:23

mini batch, right? So

1:14:25

my question is if you came with

1:14:27

different weight for each mini batch,

1:14:28

how do you

1:14:30

add it up?

1:14:31

The like, okay, this weight has is the

1:14:33

perfect combination for this mini batch,

1:14:35

but you have weight different

1:14:37

weight for another mini batch. How do

1:14:39

you combine those two? No.

1:14:41

On each point, what you do is you you

1:14:43

find the you find you you you start with

1:14:45

a weight.

1:14:46

You run it through for a mini batch. You

1:14:48

come up with the loss function. You

1:14:49

calculate the gradient.

1:14:50

And now using the gradient, you've

1:14:51

updated the weight. Now you have a new

1:14:53

set of weights, right? Which is the

1:14:54

updated weights. Call it

1:14:55

W2 instead of W1.

1:14:57

Now W2 is is your network and when you

1:14:59

take the next mini batch, it's going to

1:15:00

use W2 to calculate the prediction.

1:15:03

And this this whole flow will become a

1:15:05

lot clearer when we do the collabs.

1:15:08

Okay. So we have 3 minutes.

1:15:11

I don't want to go into

1:15:13

regularization overfitting in 3 minutes.

1:15:15

So let's have some more questions.

1:15:19

Yeah.

1:15:20

Can you use any activation function as

1:15:22

long as it gives like positive values?

1:15:25

For like X squared or mod X or

1:15:26

something. Um you can use a variety of

1:15:29

activation functions.

1:15:31

Um

1:15:33

There is uh but yeah, there's a whole

1:15:35

literature on, you know, the pros and

1:15:37

cons of various activation functions

1:15:38

that you could use.

1:15:39

But in general, you have to make sure of

1:15:42

a couple of things. One is that when you

1:15:44

do backprop,

1:15:46

the gradient is going to flow through

1:15:48

the activation function in the reverse

1:15:49

direction.

1:15:50

And the activation function should

1:15:52

actually sort of make sure the gradient

1:15:53

doesn't get squished.

1:15:55

It shouldn't get squished. It shouldn't

1:15:56

get exploded.

1:15:58

So those are some considerations and

1:16:00

these are technical considerations, but

1:16:01

those all those considerations have to

1:16:02

be taken into account. If you can take

1:16:04

those into account, then you're okay.

1:16:07

That's sort of the key thing to keep in

1:16:08

mind.

1:16:08

And that's in fact why the ReLU is

1:16:10

actually very popular

1:16:11

because as long as the value is

1:16:13

positive, the gradient of the ReLU is

1:16:15

just one. Right?

1:16:18

Uh because

1:16:22

So if you look at something

1:16:24

Oops.

1:16:28

Was it frozen?

1:16:30

I jinxed it.

1:16:31

So sorry, livestream.

1:16:34

If you have something like this,

1:16:37

the ReLU is like that, right?

1:16:39

So the gradient here

1:16:41

is always going to be one.

1:16:43

Which means that as long as the value is

1:16:44

positive, whatever gradient comes in

1:16:46

like this, it just like gets multiplied

1:16:47

by one and gets pushed out the other

1:16:49

side. So it doesn't get it doesn't get

1:16:50

harmed or squished or anything like

1:16:52

that. Um so that's one reason why the

1:16:55

ReLU is very popular because it

1:16:57

preserves the gradient while injecting

1:16:59

almost like the minimum amount of

1:17:00

non-linearity to do interesting things.

1:17:04

Um yeah.

1:17:07

If you have a high number of dimensions,

1:17:10

can you do mini batching on like

1:17:13

features dimensions instead of just

1:17:14

observations and keep the same number of

1:17:17

observations, but just take a small

1:17:19

sample of the number of features that

1:17:21

you're actually using? Oh, I see. I see.

1:17:24

So you're saying let's say you have 10

1:17:25

features.

1:17:27

Um instead of taking all data points of

1:17:28

10 features, what if you have choose

1:17:31

five features and just use them and do

1:17:33

the thing

1:17:34

as long as you can actually compute the

1:17:36

prediction.

1:17:38

To compute the prediction, you may need

1:17:39

all 10 features.

1:17:41

Right? Or you need to have some defaults

1:17:43

for those features.

1:17:44

And by if you define defaults for those

1:17:46

other five features, you're basically

1:17:48

using all all features.

1:17:50

So that's the key thing. Can you

1:17:51

actually calculate the prediction

1:17:53

by manipulating? And typically, you

1:17:55

can't.

1:17:57

All right?

1:17:58

Okay, folks. 9:55. I'm done. Have a

1:18:00

great rest of your week. I'll see you on

1:18:02

Monday.

More from MIT OpenCourseWare

Trending Transcripts