3: Deep Learning for Computer Vision – Building Convolutional Neural Networks from Scratch — Full Transcript

0:54

Okay, but I hope you can see the slide

0:57

right now. Yes.

1:00

Okay, great. So this is just a quick

1:02

recap of what we did last class. U you

1:04

know broadly speaking training a neural

1:07

network essentially is no different than

1:08

training other kinds of models. We have

1:10

a bunch of parameters i.e weights and

1:12

biases and we need to use the data to

1:14

find good values of those weights. And

1:17

what does good mean? Typically it means

1:19

that we define some measure of

1:21

discrepancy between what the model

1:23

predicts for a given set of weights and

1:24

what the right answer is what the ground

1:26

truth answer is and then we try to find

1:29

weights that minimize this discrepancy

1:30

that's it and this notion of a

1:32

discrepancy is called a loss function

1:34

right so the broadly speaking the

1:36

overall training flow is that you define

1:38

some network it has an input it goes

1:40

through a bunch of layers you come up

1:41

with some predictions you take the

1:42

predictions you take the true values and

1:44

then those two go into the loss function

1:46

i.e i.e. the discrepancy function and

1:48

then you come up with the loss score and

1:50

then you send it to the optimizer which

1:52

then proceeds to calculate the gradient

1:54

of this loss function with respect to

1:56

all the parameters and then it updates

1:58

all the weights using that gradient and

2:00

then this process repeats. That's it. So

2:02

that is the training flow. Okay, quick

2:04

recap. Now we also talked about the

2:08

optimization algorithm we're going to

2:09

use which is called gradient descent.

2:12

and gradient descent. As you noticed in

2:15

each iteration, every data point is

2:17

being used to make predictions and

2:20

therefore to calculate the loss and then

2:22

to calculate the gradient. And then we

2:24

pointed out that gradient descent is

2:26

actually not as good as something called

2:28

stochastic gradient descent. Stoastic

2:31

gradient descent where we instead of

2:33

choosing taking all the points, we just

2:35

randomly choose a small number of

2:37

points. Pretend for a moment as if those

2:40

are the only points we have. make

2:42

predictions, calculate loss, calculate

2:44

gradient and go on. So that was the

2:47

basic idea behind stochastic gradient

2:49

descent, right? Two different kinds of

2:51

things. Now what it means is that when

2:54

we actually start training the model, as

2:56

we will in a few minutes, the way

2:58

because we only take a few points at a

3:00

time, we have to be a bit careful in

3:02

what's going on. And I want to make sure

3:04

you clearly understand what the

3:06

differences are before we actually get

3:07

to the collab. Okay. And

3:10

all right. So there is the notion of an

3:13

epoch.

3:14

An epoch essentially just means that we

3:17

make one pass through the training data.

3:20

All the training data we make one pass

3:22

through it. Okay. And so what is one

3:25

pass is that if you have something like

3:27

gradient descent, one pass means every

3:30

data point is sent through the network.

3:32

We calculate its predictions, calculate

3:34

the loss, calculate the gradient, right?

3:37

We run every training sample through it.

3:38

we calculate the gradient which is just

3:40

this thing here right I mean I will

3:42

sometimes say d of loss time dwerative

3:46

of loss with respect to w sometimes I

3:48

might use this naba symbol these are all

3:51

interchangeable okay so we'll calculate

3:54

the gradient and then we update using

3:55

some version of this okay but we just do

3:58

it once at the end of the epoch because

4:01

if you have 10 billion data points every

4:03

one of them flows through you get 10

4:05

billion outputs and then we calculate

4:07

the epoch just one at the end of this

4:08

thing we calculate the gradient and

4:10

update once one update per epoch. Yes.

4:15

Now in stoastic gradient descent what we

4:18

do is that we process the data in

4:20

batches

4:22

small numbers of points at a time right

4:25

and these are called technically

4:26

speaking they're called mini batches I

4:29

don't know about you I just get tired of

4:30

saying mini batches I'm just going to

4:31

say batches from this point on okay and

4:34

in fact that is widely done in the

4:36

literature so we'll so we'll have to

4:39

process it in batches so we take the

4:41

training data and then we divide it up

4:43

into batches

4:44

batch one, batch two all the way till

4:46

the final batch. And so what we do is we

4:49

for each batch we basically do gradient

4:53

descent for each batch we take batch one

4:56

and then we run just the training

4:57

samples in that batch through the

5:00

network to get predictions. We calculate

5:01

the gradient we update the parameters

5:03

and then we go to batch two then we go

5:05

to batch three and so on and so forth.

5:07

So pictorially this is how it's going to

5:09

look like

5:11

right let's say the first batch is say

5:12

32 points we take those 32 points we run

5:16

it through the network get all the stuff

5:17

out we calculate the gradient update the

5:19

weights so when we now get to batch two

5:22

the weights have changed

5:25

they have been updated and then we do

5:27

the same thing for batch two batch three

5:29

and all the way till we get to the end

5:30

of the thing and when we are done with

5:32

this thing this whole thing is called a

5:34

what

5:36

an epoch [clears throat]

5:38

This whole thing is an epoch. Okay.

5:42

All right. Now, so the question of

5:44

course is that if you have a bunch of

5:46

data points and you're going to run

5:47

stoastic gradient descent on it in in a

5:50

in a particular epoch, how many batches

5:52

are going to be there? Okay, how many

5:54

batches are going to be there? Now,

5:56

Keras is going to calculate all this

5:58

stuff. You don't have to worry about it,

5:59

but you just need to understand exactly

6:00

what happens. Okay, so my philosophy, by

6:02

the way, is that you have to know the

6:04

details of what's going on. If you don't

6:06

know the details, if you haven't figured

6:08

out at least once, you will not actually

6:11

be able to think new and creative

6:12

thoughts for a new problem. Okay, it's

6:15

because the concepts are not manipulable

6:17

in your head yet. Okay,

6:23

please use the microphone.

6:27

So when we talk about SG, so and we

6:30

talking about uh we are only taking some

6:32

part of it. Is it what we are saying is

6:34

that we only take some variables or we

6:36

only taking some part of the data.

6:37

>> We are taking some rows.

6:40

Okay. We taking only right. So that data

6:42

points that means a batch.

6:44

>> Exactly. So for example, let's say you

6:46

have a thousand data points, right?

6:48

Thousand rows of observations, thousand

6:50

patients in the heart disease example or

6:52

a thousand images that you're trying to

6:53

classify. You take let's say 32 of those

6:56

images, 32 of those patients and that's

6:58

a batch. Then you go to the next 32.

7:00

Then the next 32 and so on and so forth

7:02

till you run out of patients or run out

7:04

of images.

7:05

>> And each iterative time you are updating

7:07

with the weights new weights that you've

7:09

got.

7:09

>> And it means you keep connecting it or

7:12

keep moving towards

7:13

>> you're basically updating the weights as

7:14

you

7:14

>> updating the weights

7:17

>> and what we calling the epoch is

7:19

ultimately the equation of loss function

7:20

that we are trying to do.

7:21

>> No an epoch. See the the thing to

7:24

remember is that here this whole thing

7:27

is called an epoch because we have to do

7:30

one full pass through the training data.

7:32

Okay. But within that epoch we update

7:35

the weights many times. Basically we

7:37

update the weights as many times as we

7:40

have batches.

7:44

All right. Um

7:46

so to go here let's say for example

7:49

basically the idea is that you take the

7:50

training tech you divide it by the batch

7:52

size and you choose the batch size okay

7:54

you choose the bat size and we'll talk

7:56

about well how do you choose that later

7:57

on you choose the batch size and once

7:59

you choose size just divide it and round

8:01

it up so for example as you will see in

8:04

the collabing set is going to be 194

8:06

patients and then we're going to choose

8:09

a batch size of 32 and we typically tend

8:12

to choose batch sizes of 32 64 and

8:14

things like that because it actually

8:16

aligns very well with the nature of the

8:18

parallel hardware we're going to use.

8:20

Okay. And so here 32 and so on. So

8:24

divide 194 by 32 you get 6 point

8:27

something. You round it up to seven.

8:29

Okay. And so what that means is that the

8:31

first six batches will have 32 samples

8:33

each. And then the final batch has only

8:36

two samples left. And that's okay. It

8:38

can be a nice little small batch at the

8:40

end.

8:42

There's nothing that says that every

8:43

batch has to be the same size.

8:46

>> That's it. Epoch batches.

8:53

>> And are you like for each batch you run

8:56

through the whole network like all the

8:58

layers or like each layer is one batch?

9:00

>> No, for a batch you run it through the

9:03

entire network. So the way I think about

9:04

it is that you take a batch right just

9:06

momentarily you assume that's all the

9:08

data you have

9:10

just run it through the network because

9:12

unless you run it through the every

9:14

layer of the network you can't get a

9:15

prediction and unless you get a

9:18

prediction you can't calculate the loss

9:19

and unless you calculate the loss you

9:20

can't calculate the gradient unless you

9:22

calculate the gradient you can't update

9:23

the weights

9:25

>> last thing but if you're using like all

9:27

the data just doing the gradient descent

9:29

then you just go through the network

9:31

once right

9:32

>> okay exactly so in Gradient descent one

9:34

epoch is one pass and one weight update.

9:37

In many in stoastic gradient descent the

9:40

number of updates you make is equal to

9:41

the number of batches you have which

9:43

ends up being you know some the training

9:46

set divided by the batch size rounded

9:47

up.

9:50

>> So just to confirm so initially when we

9:52

introduced like the concept of batches

9:54

the whole purpose was not to run through

9:56

all the data and be able to do some

9:58

prediction from a subset. So now like

10:00

the advantage is that like after batch

10:02

one we are using more accurate

10:04

coefficient to run through batch two and

10:06

so on. That's really the advantage of it

10:08

or there's something else to it.

10:10

>> Perfectly set. That's exactly the

10:11

advantage. So we take a small amount of

10:13

data and we say hey we know this is not

10:16

all the data. It's just a small subset

10:18

of the data. So therefore it's not going

10:19

to be super accurate. It's going to be

10:21

approximate but it's okay. So we'll

10:23

still tend to move in the in the right

10:25

direction. So instead of waiting for the

10:28

whole thing to get done and then

10:29

updating it, we're just going to update

10:30

it as we go along.

10:33

All right. Uh yes,

10:35

>> building on to her question, is it that

10:37

uh doing this process for SG will uh

10:40

render us a more better solution or

10:43

requires less compute power?

10:45

>> Both

10:46

>> both and the reasons for both are in the

10:48

previous lecture. Yeah. And I'm saying

10:51

that instead of repeating it just

10:52

because I'm like very pressed for time

10:54

today. That's why uh all right cool so

10:57

that's what we have uh are we good

11:01

okay so now we come to the last step

11:04

before we actually fire up the collab

11:05

which is overfitting and regularization

11:07

um so if you remember from your machine

11:09

learning background um when your model

11:12

gets more and more complex

11:14

right if you you know using

11:18

use a simple model then you use a more

11:19

complex model and so on and so forth

11:21

what happens to the error on the

11:23

training data Typically what happens to

11:26

the error on the training data? So let's

11:27

say you have a simple regression model,

11:28

you get some error and then you have a

11:30

regression model in which you use all

11:31

kinds of interaction terms. You use

11:32

logarithms and this and that and make it

11:34

super complicated. What do you think is

11:35

going to happen to the error on the

11:36

training data?

11:39

>> Right? Basically it's going to go down

11:41

as the model get more gets more complex.

11:43

Correct. Now of course comes the punch

11:45

line which is what what do you think is

11:46

going to happen to the training data? I

11:49

showed you the answer.

11:53

Right? Basically, what's going to happen

11:54

typically, at least conceptually, is

11:56

that it's going to get better and better

11:57

at some point. It's going to bottom out

11:59

and it's going to start climbing again.

12:00

And so, we typically refer to this

12:03

phenomenon here when it starts to climb

12:05

again as overfitting because the model

12:07

is essentially fitting to the

12:09

idiosyncrasies of the training data as

12:11

opposed to generalizing patterns. And

12:14

then in this thing we call it

12:15

underfitting because it can still

12:17

there's a lot of potential to improve

12:18

and we really are hoping to find the

12:20

sweet spot in the middle right that's

12:23

the basic idea of overfitting

12:24

underfitting and the way we and to to

12:27

relate this to neural networks as you

12:29

see as you as you've learned so far you

12:31

have to learn smart representations of

12:33

the input data and to do that we I have

12:36

argued that you need to have lots of

12:38

layers in your network the more layers

12:39

you have the better things get. GPT3 for

12:42

example has 96 layers if I recall right

12:45

more layers the better but more layers

12:47

means more parameters more parameters

12:50

means more complexity to the model and

12:52

therefore more chance of overfitting

12:54

okay so it's really important in neural

12:57

networks that we think about

12:59

regularization and regularization you

13:01

will recall from your machine learning

13:03

background is the way we handle the risk

13:05

of overfitting and try to find models

13:07

that fit just right okay and so several

13:11

regularization methods have been

13:12

developed over the years and we are

13:14

going to use only two of them. The first

13:16

one is called early stopping. uh and

13:19

this is this has been famously referred

13:20

to uh by Jeff Hinton who's one of the

13:23

pioneers or as he's more colorfully

13:25

known one of the godfathers of deep

13:27

learning um who won he also won the

13:29

touring a few years ago as the own sort

13:31

of a beautiful free lunch right that's

13:33

what he calls it so the idea is very

13:35

simple we take a validation set we take

13:37

the training data we split into a

13:39

training and a validation set and then

13:41

we just keep you know doing gradient

13:42

descent boop b the training will

13:45

hopefully keep on getting better and

13:46

better lower and lower error

13:49

And then we just keep track of what's

13:50

going on in the validation set. And then

13:52

at some point if it starts to flatten

13:54

out and start to climb, we just say,

13:56

"Okay, that's when we stop training."

13:59

Right? And what we're going to do in the

14:01

collab is actually run it through the

14:02

whole thing, see where it flattens out,

14:03

and then we say, "Okay, that's why we

14:04

should stop." But of course, you don't

14:06

want to go all the way to the end and

14:07

then go back and say, "Well, I want to

14:09

stop at the 10th epoch." And there are

14:12

ways you can use Keras to be very

14:13

efficient about this. But the

14:15

fundamental idea is you take the

14:16

training data, split it into training

14:18

and validation and just track what's

14:20

going on in the validation set to see

14:21

whether this kind of bottoming out

14:23

happens. Okay. So this is called early

14:25

stopping. And the other way we're going

14:28

to do right this called early stopping.

14:30

We're looking for this part. The other

14:32

thing is called dropout. And I'm going

14:35

to come back to dropout when we do when

14:39

on Wednesday's lecture because that's

14:40

the first time we're going to use it.

14:42

And so I'll come back to draw port and

14:43

tell you exactly how it works. It's a

14:44

very very clever strategy. But we will

14:46

not use it today. We'll use it on

14:48

Wednesday. Okay. So in summary, uh what

14:51

do we do? We get the data ready. We

14:53

design the network, number of hidden

14:55

layers, number of neurons and so on and

14:57

so forth. We pick the right output

14:58

layer. We pick the right loss function.

15:01

Uh we choose an optimizer. As I

15:04

mentioned earlier, SGD comes in lots of

15:06

flavors, lots of variations on the

15:07

theme. And empirically much like for

15:11

hidden layer neurons we t tend to use

15:13

relu as the activation function for

15:16

optimization we tend to use a flavor of

15:17

HGD called Adam okay as sort of the

15:20

default because it's really good so

15:22

we'll use Adam as you'll see we

15:24

typically use either uh early stopping

15:27

or dropout and then you just fire it up

15:29

and start training in terasen tensorflow

15:32

all right so that is the training loop

15:33

now I'm going to switch gears and give

15:35

you a quick intro to teras and teras

15:38

TensorFlow. Okay. Keras and Tensor. No,

15:40

TensorFlow and KAS. Thank you. Um, and

15:43

then we'll actually fire up the collab.

15:45

So, first of all, what's a tensor?

15:49

>> Yeah, I just quick question on the

15:52

previous thing like if you're looking at

15:54

the validation set to avoid overfitting,

15:57

but aren't you actually like over

15:59

actually overfitting because like you're

16:02

kind of using the validation set as a

16:03

training set or not?

16:05

>> Uh, no, no, no. The validation set is

16:08

never used to calculate any gradients.

16:10

It's only used to calculate accuracy and

16:12

loss.

16:14

Yeah. Yeah. It's kept aside and only

16:16

used for evaluation, not for training.

16:19

That's what keeps you honest.

16:22

>> Right.

16:23

>> And this will become clear when we

16:24

actually go to the collab. So what's a

16:25

tensor?

16:28

>> All right.

16:30

Okay.

16:33

Tensor is the input data which you're

16:35

giving to the system. It could be in

16:36

various formats like it's image it could

16:39

be like we call it a 4D tensor. If it's

16:42

a time series data, it's 3D. And

16:45

typically, if you just send numbers in,

16:47

it becomes a vector which would go

16:49

inside which each each it gives the

16:52

value of the

16:54

uh uh the variable as well the values of

16:57

the variables associated to it as well

16:59

as

17:01

uh as well as the I mean information you

17:05

want to get to.

17:07

>> You're kind of on the right track, but

17:08

not entirely, right? It's actually a

17:10

simpler concept than that. So, uh

17:13

>> it's like a matrix but generalized with

17:15

higher dimensions.

17:16

>> Correct? That's also actually correct

17:18

but incomplete. The reason is because it

17:21

can be simpler than a matrix. It's not

17:24

matrix or higher. It's actually could be

17:25

simpler. In fact, you take a number,

17:27

it's actually a tensor.

17:30

All right? The simplest case of a tensor

17:31

is a number. The next simpler case is a

17:34

vector which is a list. The next higher

17:37

case is a table.

17:40

Okay, so these are all tensors. So

17:43

tensors basically are a generalization

17:45

of the notion of both a number, a vector

17:48

and a table to higher dimensions.

17:52

Okay, so you can think of a tensor as

17:56

having what are called every tensor has

17:59

something called a rank, right? So a

18:03

number is just a number. It doesn't have

18:04

a dimensionality to it. So it has got

18:06

rank zero. Okay. While a vector it's a

18:10

list of numbers. You can sort of write

18:12

it down top to bottom and it's one

18:14

dimension. Right? So that dimension that

18:17

one dimension is called a rank. So it's

18:19

called rank one. A table is 2D

18:22

two-dimensional. So it's called rank

18:24

two.

18:26

And you can have a rank three which is

18:28

just a bunch of tables.

18:32

A bunch of tables is a rank three

18:34

tensor. We also think of it as a cube.

18:37

Okay. So these things are very useful

18:40

because obviously we are all familiar

18:42

with vectors. Uh as you will see very

18:45

shortly later in this class black and

18:48

white grayscale images are usually

18:49

represented using tables of numbers like

18:51

this. Color images are represented using

18:54

three tables.

18:56

Okay. Can you get think of what might be

18:59

representable as you know a tensil of

19:02

rank four? Meaning every element of a

19:06

tensor of rank four is actually a color

19:08

picture.

19:11

Just shout it out. Video. Exactly. What

19:14

is a video? A video is basically a

19:16

stream of black color images. A color

19:19

video. So each element of that stream,

19:23

right? What the first dimension of the

19:25

tensor is which frame it is and then

19:28

everything else is the actual frame. So

19:31

the way I u think about these tensors

19:34

always is

19:37

tensor you can just think of it as a you

19:40

can think of a tensor as being this

19:42

array which has all these axes or

19:45

dimensions. This is the first one. This

19:48

is the second one. This is a third swan.

19:51

Right? This is a tensor of rank four.

19:54

Okay? 1 2 3 4. And so if you have a

19:58

vector, right? So you can imagine if

20:02

it's just a vector, you can imagine the

20:03

vector actually living like this, just a

20:06

list of numbers, right?

20:10

But if it's just if it is just

20:14

a 2D a rank two tensor right which is

20:16

just like that right which is just like

20:19

that

20:21

so this thing becomes you know like that

20:24

and that thing becomes like that. So for

20:26

example if this is a 7a 3 that means

20:29

that there are

20:31

seven rows and three columns.

20:35

So you get the idea. So the way you

20:36

think about tensor is always as if this

20:38

open square bracket a bunch of things a

20:40

closed square bracket and that's really

20:42

what a tensor object is. So what that

20:44

means is that anytime you have a tensor

20:48

right anytime you have a tensor however

20:49

complicated it is you can always create

20:52

a more complicated tensor by if you want

20:54

to take a list of those tensors let's

20:56

say that you have a list of videos

20:59

each video is a rank four tensor so

21:02

which means a list of videos is what

21:04

rank

21:05

Exactly. So a a tensor of rank say 10 is

21:10

just a list of rank nine tensors.

21:15

So that is this that is the most

21:17

important thing you need to understand

21:18

about tensors. So at any point in time

21:20

if I give you a tensor you can just

21:22

iterate through the first dimension of

21:24

it the first aspect of it and as as you

21:27

go through each one of these values. So

21:29

for example here um

21:32

yeah that can do it.

21:35

So

21:39

so if you have this tensor here

21:42

and if you want to create a more

21:43

complicated tensor no problem.

21:46

So you add another dimension here. Okay.

21:52

Now it just becomes this dimension let's

21:54

say has nine values.

21:58

one on the nine. So you put zero here

22:00

and then what do you get? This whole

22:02

tensor is a rank four tensor. And you

22:04

put a one here, it's another rank four

22:06

tensor. You put a two here, another rank

22:08

four tensor. So every tensor, you take

22:11

the first element, it's just a list, but

22:14

it's a list of the next downrank tensor.

22:18

Okay. Now this tensor concept is

22:20

actually something Einstein came up

22:21

with. Um and so u it's simultaneously

22:26

kind of easy to understand and also

22:28

slippery. So I would actually encourage

22:30

you to read the book which has a really

22:32

good discussion of tensors and the more

22:33

you practice with it the easier it'll

22:35

get. Okay. So if you feel you kind of

22:38

understood but not quite you're not

22:39

alone. It happens to all of us right?

22:42

You have to pay the price or go through

22:43

the crucible. Okay. Okay. All right.

22:48

So to come back to this

22:51

that's what we have

22:55

and we already talked about a rank four

22:56

tensor it's a video so 2.2 two the text

22:59

has a lot more detail. You should

23:00

definitely read it. U so here tensorflow

23:05

is a library and as you can imagine

23:08

neural networks tensors come in and go

23:10

through the network and go out the other

23:11

end right and since tensors capture

23:14

everything numbers lists uh tables and

23:16

so on and so forth it's just tensors

23:18

flowing from input to output hence it's

23:20

called tensorflow and it gives you a

23:22

couple of things which are really really

23:23

important which is why we use it. The

23:25

first one is that it'll automatically

23:27

calculate gradients for you of

23:30

arbitrarily complicated loss functions.

23:32

You don't have to calculate the gradient

23:34

because calculating the gradient is very

23:35

painful, right? It'll automatically

23:37

calculate the gradients for you. That's

23:39

the best part. You don't have to use the

23:40

chain rule. You don't do anything. The

23:42

second thing it'll do, it gives you all

23:44

these optimizers including SGD and all

23:46

its variations. So you don't have to

23:48

worry about the optimization itself.

23:49

It'll just you can just pick and choose

23:50

what you want. Third, if you have a lot

23:53

of servers, it'll actually take the

23:55

computational load and distribute it

23:56

across all those servers. People here

23:58

with the CS background know that

24:00

parallelizing computation is actually a

24:02

very difficult problem, right? There are

24:05

things which are called embarrassingly

24:06

parallel. Many things are not actually

24:09

quite tricky to figure it out. We don't

24:10

know how to figure it out. TensorFlow

24:11

will figure it out. Okay? And then

24:13

finally, I talked about the fact that

24:15

there are these things called GPUs,

24:17

graphics processing units, which are

24:18

parallel hardware. uh and so it'll even

24:21

if you have just one computer but it has

24:23

GPUs there's a particular way in which

24:26

you have to take your computation and

24:28

organize it to really exploit the fact

24:30

that you have a GPU and so TensorFlow

24:33

will actually do it for you out of the

24:35

box automatically you don't have to

24:36

worry about any of that stuff okay so

24:38

those are all the advantages of this

24:39

thing by the way TPU is called a tensor

24:41

processing unit it's something that it's

24:43

kind of you can think of it as Google's

24:45

GPU right they came up with their own

24:47

variation on the theme okay now keras

24:50

sits on top of TensorFlow, right?

24:52

TensorFlow, this is the this is the

24:53

hardware you have. TensorFlow sits on

24:56

top of the hardware. Keras sits on top

24:58

of TensorFlow and it basically gives you

25:01

a whole bunch of convenience features.

25:02

So, for example, it gives you the notion

25:04

of a layer, right? We already saw

25:07

Keras.dense is a dense layer, right? It

25:10

gives you the notion of a layer. It

25:11

gives you the notion of activation

25:12

functions and so on and so forth. It

25:14

gives you easy ways to pre-process the

25:16

data, easy ways to train the model,

25:18

report on metrics, you know, calculate

25:20

validation loss, validation accuracy,

25:21

training loss, all the metrics we care

25:23

about. And then it also gives you a

25:25

whole library of pre-trained models that

25:26

you can just use and adapt for your

25:28

particular problem. So it gives you a

25:30

whole bunch of conveniences and that's

25:32

why it's very popular. And by the way,

25:34

you know, many of you might also be

25:35

familiar with PyTorch, which is a

25:37

fantastic framework as well for deep

25:38

learning. And the reason we chose to go

25:41

with TensorFlow for this course rather

25:42

than PyTorch is because we wanted to

25:45

make the course uh sort of accessible to

25:48

folks who don't have a ton of

25:49

programming background before coming to

25:51

the class. And Pyarch is a bit more

25:53

demanding from a CS perspective. It

25:55

requires more knowledge of

25:56

object-oriented programming. Uh which is

25:58

why we decided to go with TensorFlow and

25:59

KAS because I think it's actually as

26:02

powerful uh in many ways and it's a

26:04

little easier to get going. Okay, so

26:07

that's what we have here. And one other

26:09

thing I will mention is that there are

26:10

three ways in which you can use kas.

26:12

There are three kinds of APIs.

26:14

Sequential, functional, subclassing. And

26:16

we'll almost exclusively use the

26:18

functional API. Okay. And in fact, the

26:21

model we built for heart disease

26:22

prediction uses the functional API. And

26:24

so just read 722 of the textbook to

26:26

understand in detail how the API works.

26:28

I find in my own work, the functional

26:30

API is basically all I need. I don't

26:32

need to do anything more complicated

26:33

than that. Um and and as you will see as

26:35

you work on the homeworks uh and on your

26:37

project that it's is it's sort of a

26:39

beautifully designed Lego block

26:41

environment for doing these things and

26:43

you can create very complicated models

26:45

very easily. Okay. Uh there's a whole

26:48

bunch of stuff here on these websites.

26:50

So check them out. There's lots of

26:51

collabs or uh uh are available. So now

26:55

if you go back to the neural model for

26:57

heart disease prediction, this is what

26:58

we came up with in the last class,

26:59

right? uh we had an input layer, one

27:02

dense layer with 16 neurons, rel

27:04

neurons, an output layer with the

27:05

sigmoid and then boom, that was a model.

27:08

So let's train this model. Uh and so the

27:10

training checklist is that uh we have

27:13

already done this hidden layer of 16

27:14

neurons uh sigmoid. We need to use an

27:17

appropriate loss function based on the

27:19

type of output. What loss function

27:20

should we use?

27:23

What is the output here?

27:26

It's a binary classification problem. So

27:28

what should the the loss function be?

27:33

Kind of heard it somewhere. Get shout it

27:35

out.

27:37

No, the output is a sigmoid. The loss

27:40

functionary

27:43

cross entropy.

27:44

Okay, remember if if you're predicting a

27:46

number an arbitrary number, you can use

27:48

something like mean square error. If

27:50

you're predicting a probability which

27:52

has to be compared to a 01 output, which

27:55

is what binary classification is all

27:56

about. we use binary cross entropy.

27:59

Okay, so that's what we do here. So we

28:01

do binary cross entropy

28:03

and then we will go with Adam, right?

28:06

And then we'll use early stopping to

28:08

make sure we don't over fit. Okay, I

28:10

know this like okay I promise this is a

28:12

lot literally the last slide before I go

28:13

to the collab. I feel like one of those

28:16

used cars here but wait there is more.

28:19

So anyway, u so uh don't worry if you

28:23

don't understand every detail of what

28:24

I'm going to go through. I'm going to

28:26

link to the collab as soon as the class

28:27

is over. But once you get your hands on

28:29

the collab, make sure you actually go

28:31

through every line in the collab. What I

28:33

typically do when I'm trying to learn

28:34

something new is I'll actually cut and

28:36

paste, right? I won't do that. I won't

28:39

actually cut and paste the code and run

28:41

it myself. I will retype the code. If

28:44

you retype the code as opposed to

28:45

cutting and pasting, trust me, you'll

28:46

learn a lot more. Right? So I strongly

28:48

encourage you to do it that way.

28:52

Um and so all the collabs you're going

28:54

to publish in the class, uh the first

28:56

thing you should do is you should just

28:57

make your own copy of the notebook,

29:00

right? Copy to drive. And then if you're

29:02

using anything other than today's

29:04

collab, uh right, anything involving

29:06

natural language processing or vision,

29:08

you probably should use a GPU. So just

29:10

go into go in here, choose the runtime

29:13

to be a GPU. Um and then you start your

29:15

notebook and you're done. And the second

29:17

time onwards, you can just go directly

29:19

to this step. You don't have to do all

29:21

this stuff for that particular notebook.

29:23

And there are numerous tutorials like

29:24

five minute videos and so on on how to

29:26

use collab. Just just do that. I'm not

29:27

going to spend time on it here.

29:30

All right. Okay. So, uh I just ran it um

29:33

a few hours ago. I'm not going to run

29:35

every cell now because it's going to

29:37

take some time. It's going to get in the

29:38

way of the class time, but I'm going to

29:39

just like, you know, go through it

29:40

slowly and explain what's going on. So,

29:43

here this is just an introduction to the

29:45

data set. We already saw this

29:46

introduction in the last last week. We

29:49

have whatever 303 patients, hot

29:51

patients. We have a whole bunch of uh

29:54

variables here, age, demographics, and a

29:57

whole bunch of biomarker information.

29:59

And this is a target variable. Okay? Uh

30:02

zero or one, heart disease, yes or no.

30:05

And so, by the way, just some technical

30:07

prelim preliminaries here. Basically,

30:10

every time we load these things, we're

30:12

actually going to load these packages.

30:13

So you can see here these are the two

30:15

key things we need to do. We import

30:16

tensorflow first and then from within

30:18

tensorflow we import keras. Okay that's

30:21

what these two lines do here. Okay. And

30:23

then and folks who have done data

30:25

science and machine learning a bit

30:26

before you you'll know this. We will in

30:28

in sort of we will actually load like

30:30

the three packages that were just most

30:32

commonly used right which is numpy

30:34

pandas and mattplot lib. Uh numpy

30:37

because it's very easy for manipulating

30:39

matrices and arrays and tensors. uh

30:42

pandas because often times you get some

30:44

data in from somewhere you need to

30:46

massage it and wrangle it to a point

30:48

where we can actually feed it into ketas

30:49

so you need pandas for that and mattplot

30:51

lib because you just want to plot you

30:53

know uh these loss curves and accuracy

30:55

curves to see whether early stopping is

30:57

needed okay so that's why we use it uh

31:00

so we import all these things and then I

31:02

guess the other thing you have to

31:03

remember is that when we are training

31:04

these deep learning models uh there is

31:06

randomness in the process which enters

31:08

in a few different places so clearly the

31:11

starting values for the these weights

31:13

are going to be they're going the

31:14

weights are going to be randomly

31:15

initialized. Uh and therefore that

31:17

that's obviously a source of randomness.

31:19

Uh now we talked about how you take if

31:22

when you're doing stoastic gradient

31:23

descent you take all the data and then

31:25

you randomly choose batches right from

31:28

this data till we finish a whole pass

31:29

through it. Well that immediately raised

31:32

the question well well what do you mean

31:33

by randomly choose? So typically what we

31:35

do in practice is that and kas will take

31:37

care of all this for you. um you

31:39

basically take the data and just shuffle

31:40

it once randomly and then you just go

31:42

first 32 next 32 next 32 next 32 like

31:45

that okay but it is a source of

31:47

randomness and then when we split the

31:49

data into train validation testing and

31:51

so on uh particularly if you want to

31:53

look for early stopping and overfitting

31:55

uh we need to again split the data

31:56

randomly and that's another source of

31:58

randomness and then when we do dropout

32:01

which we'll talk about on Wednesday

32:02

again dropout has a little bit of a

32:05

random element to it and so that's

32:06

another source of randomness this. So

32:09

all of it all this means is that if

32:11

you're working with these models and if

32:13

you want to build a model and you want

32:14

to hand it off to someone so that they

32:16

can reproduce your results well you

32:17

better make sure that you sort of you

32:19

know make it easy for them to replicate

32:21

what you have and the way you do it is

32:22

by sending a setting a random seat for

32:24

all these things okay and the way you do

32:26

it is by having this little handy

32:28

function here set random seat uh and of

32:31

course you know I use 42 tool like just

32:32

like everybody should right so okay so

32:35

that's that uh by the way just that's

32:38

just a popculture reference to this book

32:39

called The Hitchhiker's Guide to the

32:40

Galaxy.

32:43

>> Number 42 and you'll know what I mean.

32:45

Okay, so by the way, um the question

32:47

inevitably comes at this point, okay, if

32:49

we do exactly this, will you actually

32:51

get the exact same numbers that you have

32:52

in your version uh of the notebook? And

32:55

the answer is hopefully most of the

32:57

time, but it's not guaranteed. So this

32:59

is called bitwise reproducibility. It's

33:01

not guaranteed due to certain hardware

33:03

things and device drivers and stuff like

33:05

that. So we won't get into all that

33:07

stuff. uh and which is why as you see

33:09

here uh I have a bit of a fingers

33:11

crossed thing. Okay. All right. Cool. So

33:14

that's what we have. Um so as it turns

33:16

out uh Frantois Shallet who wrote the

33:18

book uh the textbook he actually made

33:20

this data available in a pandanda's data

33:21

frame. So we read the CSV file into this

33:24

data frame right there. Uh and then it's

33:26

uh and it's 303 rows 14 columns right

33:30

and you can see here we'll take a look

33:32

at the first few rows. Uh and these are

33:34

all the rows. age, gender, cholesterol,

33:36

blah blah blah blah blah. And then this

33:38

is the target variable right there. U

33:41

and the one of the first things I always

33:42

do when I'm working with a binary

33:44

classification problem is to quickly

33:45

check whether the positive and negative

33:47

classes are balanced or not. And so what

33:49

you can do is you can just quickly check

33:51

to see what percent of the data points

33:52

is zero versus one. And you can see here

33:55

uh 72.6%

33:57

of the patients don't have heart

33:59

disease. That's a good thing of course.

34:00

Uh and then 27.4 have heart disease. So

34:03

it's not bad. It's not 50/50 or roughly

34:05

50/50. It's a little thing. So, by the

34:08

way, quick question. What is a a b good

34:11

baseline model for this problem? Suppose

34:13

you couldn't use anything any

34:14

complicated thing. What's a good

34:15

baseline model?

34:22

>> Yes. Just predict zero.

34:24

>> Yeah. And why would you do that?

34:25

>> Uh, it would give you a 72.6% accuracy.

34:28

Exactly. Because 72.6% 6% is the sort of

34:31

the higher class higher class with the

34:33

higher percentage you just predict it

34:35

you'll be right on those 72.6% of the

34:37

cases you'll be wrong on the rest which

34:38

means that your accuracy of this model

34:41

is going to be 72.6%.

34:43

Okay. And so any fancy model we build

34:46

better do you know it's got to do better

34:48

than this otherwise it's not worth its

34:49

weight uh in layers. Um so all right so

34:51

we'll come back to this later. So the

34:53

first thing we want to do is we want to

34:54

pre-process it because this data set has

34:56

both categorical variables and numeric

34:58

variables. Um and so it's usually

35:01

convenient to just to group them into

35:03

two different groups. So I have listed

35:05

all the categorical variables here and

35:06

the numeric here. Uh and then we have

35:09

the pre-processing here. We have to take

35:11

the categorical variables and we have to

35:12

one hot encode them. And the reason is

35:15

that unlike say a decision tree model, a

35:17

neural network cannot handle uh

35:20

categorical inputs directly. It can only

35:22

handle numeric inputs. Which means that

35:24

we have to numericalize every

35:26

categorical thing that comes in. And the

35:28

st there are many ways to do it but the

35:29

standard way to do it is one hot

35:31

encoding. Um and for the numeric

35:33

variables we need to normalize them and

35:35

I'll come to that in a second. So pandas

35:37

has this get dummies function here and

35:40

you can just run this thing and it'll

35:41

just hot encode the whole thing. So once

35:44

you do that this is what you have. So

35:45

you can see here previously um let's say

35:49

tal was had three values fixed normal

35:52

reversible or something and then you go

35:54

to the one hot encoded version u and now

35:56

we can see here tal fixed tal normal tal

36:00

reversible that's three columns right

36:02

that's the one hot encoding in action

36:04

okay now the other thing to remember is

36:07

that neural networks work best when the

36:09

numeric inputs you send them are all in

36:12

a relatively small range they shouldn't

36:13

have a wide range of variation

36:15

Um and so the standard practice is to

36:18

standardize the numerical variables. By

36:20

standardize, I mean typically subtract

36:22

the mean, divide by the standard

36:23

deviation. Um we should do that. But

36:26

before we do so, we should split the

36:27

data into a training set and a test set,

36:30

right? And why do we want to split into

36:32

a test set? Because at the very end once

36:33

we've built the model and done all the

36:35

things we want to do with it, we finally

36:36

want to take out the test set and

36:38

evaluate it once so that we get this

36:41

true measure of how it's going to

36:43

perform in the wild after you deploy it.

36:46

Okay. Uh so you want to divide it 80 80

36:48

say 80% training and 20% test set. So

36:51

the question is why should we do the

36:53

splitting now before we do the

36:54

normalization? Why can't we just do the

36:57

normalization and then do the splitting?

37:02

Um all right

37:06

>> because then your uh validation set is

37:09

also somewhat dependent on your test set

37:11

results as well as the mean of the test

37:13

set.

37:13

>> Correct? Because the test set has now

37:16

essentially sort of has been influenced

37:18

by the training set. Right? That is the

37:21

the the modeling process part of the

37:23

modeling process the splitting and the

37:25

splitting also the the the

37:27

standardization

37:28

if if the standardization which is part

37:30

of the process uses information about

37:32

the test set well the test set not

37:34

really kept away from anything is it

37:37

that's why we want to split it lock away

37:39

the test set somewhere and then proceed

37:41

with the modeling this again this is

37:43

like machine learning 101 which is why

37:44

I'm going through it pretty fast uh okay

37:47

so we we do this uh sampling function

37:50

take 20% of the data and make it the

37:53

test set and the remaining is going to

37:55

be the training set. And when we do

37:56

that, you can see the training set is

37:58

now 242

38:00

um rows while the test is 61 rows. Uh

38:05

and any of these data frames, you'll

38:07

know that the the shape attribute gives

38:08

you the dimensions of the number of rows

38:10

in the columns. That's what we're doing

38:12

here. And now that we have done that, we

38:14

have done the split, we can calculate

38:15

the the the mean and the standard

38:16

deviation. So I calculate the mean here.

38:18

I calculate standard deviation. And

38:20

these are all the means. And once I do

38:21

that, I just do you know each column

38:24

minus the mean divide the standard

38:26

deviation. And then once I do that I get

38:28

I save them in the train and the test

38:30

data frames. And you can see here now

38:32

all the numbers are all very sort of

38:33

smallalish 0 1 minus one kind of around

38:36

that range and that's kind of ideal when

38:38

you're network training. Okay. All

38:40

right. Right. So at this point the data

38:42

is entirely numeric and then uh we are

38:44

ready almost ready to feed it into KAS

38:46

and the way you do it is you take a

38:48

numpy array u you you take a pandas data

38:51

frame and then you convert it into a

38:52

numpy array and then keras is happy to

38:54

take it happy to receive it. So the so

38:56

we use this thing called two numpy which

39:00

I think is as descriptive as it gets in

39:01

programming. Um and then you save it as

39:04

train and test. Now train and test on

39:05

two numpy arrays with exactly the same

39:08

information and now we can fit it into

39:09

kas. All right. Now I guess there's one

39:12

other thing we need to do which is that

39:13

um in this data frame train and test our

39:17

independent variables all the features

39:18

as well as the target the 01 target.

39:20

They're all in this

39:23

right and we need to now take it and

39:25

just take the the dependent variable the

39:27

01 column and split it out and keep the

39:29

x and the y separately. Right? That's

39:32

the whole point of it, right? Because

39:33

you need to feed the X, do the

39:34

prediction, and then compare it to the

39:36

actual Y and calculate the loss and so

39:38

on and so forth. So, uh, so the target

39:41

column is our Y variable, and it's

39:43

column number six from the left. If you

39:45

count it, you can see it. So, we just,

39:47

you know, uh, we we delete it from the

39:49

the train and test. Um, and now we have

39:53

242 rows and 29 columns, 29 features.

39:56

You will recall from the network that we

39:58

made way back, it had 29 inputs, right?

40:01

29 nodes in the input layer. And that's

40:03

where the 29 is coming from. And so now

40:06

uh we just select the sixth column which

40:07

is the target and make it the Y variable

40:09

right train Y and test Y. And that is of

40:12

course a vector which is 242 long in the

40:14

training set and 61 long in the thing.

40:16

So at this point all we have done is to

40:19

be honest boring pre-processing. Okay,

40:21

we haven't actually gotten to the action

40:22

yet. Finally, let's do something. So um

40:26

and we start with a single hidden layer.

40:29

Since it's a binary classification

40:30

problem, we'll use sigmoids as we saw

40:31

earlier. And this is the model we

40:34

created in class last last class. This

40:36

is the model we created. Okay. The only

40:39

difference between that model and this

40:41

model is that I've actually given names

40:43

to these layers. And this name thing is

40:45

totally optional. Right? If you want to

40:47

give a name, give a name. It's just a

40:48

little easier to interpret later on.

40:50

Okay? It's just cosmetic. Okay? So, uh,

40:53

but I've just put it here. U and once

40:55

you build the model u you should

40:57

immediately run the model dots summary

40:59

command because it gives you a nice

41:01

overview of the model right what are for

41:04

each layer it tells you what the layer

41:05

is it tells you what's coming into the

41:07

layer meaning the shape of the tensor

41:09

that's coming in and what's going out

41:11

and how many parameters the layer has

41:13

and it turns out this layer has sorry

41:16

this network has 497 parameters okay uh

41:20

and I have told you repeatedly the first

41:22

few times just hadn't calculated the

41:24

number of parameters to make sure it

41:25

verifies. So we should just make sure

41:27

that it is in fact 497. So let's hand

41:30

calculate it. And you do basically it's

41:32

basically what's going on here. 29

41:34

inputs time 16, right? All the arrows 29

41:37

* 16 arrows, right? And then you have a

41:40

bias of another 16. That's why you have

41:42

this expression. And then the next one

41:43

is 16 * 1 plus one bias for the output

41:46

sigmoid and you get to 497. Okay? Just

41:49

make sure you follow this later on when

41:50

you work with the collab. We we did this

41:53

in class last week and you can visualize

41:55

the network graphically as well by using

41:56

the plot model function. So we do that

41:59

here. Um and let's say it gives you the

42:02

same information but in a slightly

42:03

easier form to consume and when we work

42:06

with larger networks starting on

42:07

Wednesday you will see that being able

42:09

to visualize the topology of the network

42:11

is actually quite handy. Okay, we

42:13

finally come to uh actually trying to

42:16

train this thing and so what loss

42:18

function should we use? uh we need to we

42:20

need to use binary cross entropy right

42:23

there. What optimizer to use? Well, as I

42:26

mentioned earlier, uh we'll use Adam.

42:29

Adam.

42:32

All right, Adam. Uh and then uh and then

42:35

the the final thing is you can ask Keras

42:37

to report out whatever metrics you care

42:39

about. These metrics are not going to be

42:41

used in any optimization. They just it's

42:42

just reporting it to you. And the most

42:45

common thing people report out for

42:46

binary classification is accuracy. So

42:49

we'll just go with that metric. Um and

42:51

so so what we do is we tell Keras take

42:54

the model we just built and compile it

42:56

with this choice of optimizer this

42:58

choice of loss function and these

43:00

metrics. And this compilation step what

43:02

it does is it essentially Keras will

43:04

take this information and take the model

43:06

you have built and it'll reorganize the

43:08

model in such a way that the parallel

43:11

computing uh distribution of computing

43:13

across many servers and so on. That's

43:16

that's what's happening in the compile

43:17

step. Organizing it so that reorganizing

43:20

the model so that it becomes amendable

43:21

to parallelization and distribution.

43:23

That's what's going on. That's why you

43:25

actually have to do something called the

43:26

compile step. Okay. And once we do that,

43:28

we have finally finally ready to train

43:30

the model. And to do that uh we have to

43:34

decide what the batch size is that we're

43:36

going to use. Remember, we're using some

43:37

flavor of SGD, which means we have to

43:38

choose what is the bat size. And

43:40

typically what people do is that uh 32

43:43

is a good default for the batch size.

43:45

Like if you don't if you're not just

43:46

getting started with something, just use

43:47

32. Uh and there's a whole bunch of

43:49

literature on what the right batch size

43:51

should be for the number of data points

43:53

you have, the size of the network and so

43:55

on and so forth. My philosophy is start

43:56

with 32. Um and you can always try 32,

43:59

64, 128. It's kind of like, you know,

44:02

oftenimes what people tell me,

44:04

researchers tell me is that just use the

44:05

biggest batch size that doesn't make

44:07

your machine die.

44:09

Right? If you can fit into memory, it's

44:11

probably good. Just try the biggest

44:12

size. We'll just start with 30. It's

44:13

just a tiny problem. It's not a big

44:15

deal. And then we also have to decide

44:16

how many epochs through the data do we

44:19

want to go through, right? How many

44:21

epochs? And uh you know, usually 20 to

44:24

30 epochs is a good starting point. Um

44:26

and then because this is a tiny problem

44:28

just for kicks, I decided to run it for

44:29

300 epochs. Uh just to see if anything

44:31

any overfitting is going to happen. Uh

44:33

and then whether we want to use a

44:34

validation set. Of course, we want to

44:36

use a validation set. Uh right. So we

44:38

will use 20% of the data points as a

44:40

validation set so that we can look for

44:42

overfitting underfitting.

44:44

All right. So with these decisions made

44:46

we finally uh we use the model.fit

44:49

command. Model.fit is what actually

44:51

trains the neural network. Okay. And you

44:55

have to tell it what the x

44:58

tensor is. You have to tell it what the

45:00

dependent variable y tensor is. We need

45:03

to tell it how many epochs to do this.

45:05

What the bat size to use. Verbos equals

45:07

1 just means like just you know put a

45:09

lot of descriptive output as you do this

45:11

thing and then validation split means

45:13

you know take 20% of the training data

45:16

and set it aside as your validation data

45:18

set. Don't use it for training because I

45:20

want to measure overfitting using that.

45:22

So that's it. So you do that thing it

45:24

it'll run for 300 epochs and this is the

45:26

reason why you know I decided to just

45:28

not actually run it in class. Um and so

45:31

you keep on doing it gives you a lot of

45:33

output and finally

45:36

we reach the end.

45:41

Okay. Now let's take a moment to

45:43

understand what's being reported. So

45:44

I'll just take this one line here. So

45:46

this there is a there is these two there

45:49

is a pair of lines for each epoch. And

45:51

then here it's telling you uh you know

45:53

it it actually uses in the in this 300th

45:56

epoch it used seven batches seven out of

46:01

seven batches right so it used seven

46:02

batches and if you you will recall from

46:05

the math we did in the class that it's

46:06

actually seven batches where the first

46:08

six batches are 32 and the last batch is

46:10

just a couple of examples but we have

46:12

seven batches right this is the 193 by

46:15

32 rounded up okay so that's why we have

46:19

seven here and then it tells you how how

46:20

long it took it for that and then it

46:22

this is the loss value. This is the

46:24

binary cross entropy loss value on the

46:26

training set right on on that particular

46:29

batch right uh that it calculated this

46:32

is the accuracy that you asked you to

46:33

report out 98.4% 4% 98.5% accuracy on

46:36

that batch and and then at the end of

46:39

this epoch using whatever weights were

46:42

available in that network it actually

46:44

calculate the loss on the validation set

46:46

which is the 20% of the data we have set

46:48

aside and then it this is the accuracy

46:50

on that validation set okay so that's

46:53

what each of these numbers mean now

46:55

looking at these wall of numbers is kind

46:57

of painful so usually you just plot it

47:00

um so and the way you do that is if you

47:02

if you notice here Uh okay, I'm not

47:04

going to go back here. So I said history

47:06

equals model.fit blah blah blah blah

47:08

blah. And that history object has a lot

47:10

of information that we can use for

47:12

plotting and diagnostics and so on. And

47:14

that history thing uh history object has

47:18

another object called history

47:19

history.htistory which is a dictionary

47:21

with all these values and that's what

47:23

we're going to plot. Was there a

47:24

question here? Yeah.

47:25

>> Uh so you prompted it to keep the size

47:28

for validation but didn't we already

47:30

keep a test set? So that's going to be a

47:33

secondary validation, right?

47:34

>> So basically we have a training uh and

47:37

then a validation and a test. The role

47:40

of the validation set is to figure out

47:42

things like early stopping. Should we

47:43

stop here? Should we go back? And as you

47:45

will see later on, if we use

47:46

hyperparameters, you know, we we'll try

47:48

different values of the hyperparameters

47:50

and figure out use the validation set to

47:52

figure out which one is the best one.

47:53

But once we are done with all that, we

47:55

will finally have a model. At that

47:57

point, we open the safe, take out the

47:59

test set and use it just once with your

48:02

final final model. Not because you want

48:04

to improve the model, but because you

48:05

want to have a realistic idea how it'll

48:07

do when you actually deploy it out in

48:08

the real world.

48:11

>> Uh yeah.

48:13

>> Uh can we use can we instead of accuracy

48:17

could we use other metrics uh to

48:20

evaluate whether to

48:21

>> absolutely like a confusion matrix let's

48:23

say?

48:24

>> Yeah, you can you can do whatever you

48:25

want. You can use like I said it's not

48:27

used for training so there is no

48:29

mathematical implication what you choose

48:31

right you can choose error rates

48:32

accuracy f1 fb beta you can do whatever

48:35

you want and keras as you will see has

48:37

this dizzying list of possible metrics

48:39

you can use for reporting the key thing

48:41

to remember is you're just reporting

48:43

these metrics you're not actually using

48:44

them for any training

48:47

yeah

48:49

>> uh my question is with respect to

48:50

validation like uh we've got a training

48:52

data set so when we take out 20% This is

48:55

the validation uh data for validation.

48:58

Are we taking out from the training set

49:00

or correct from there that level or we

49:02

go to each batch and take out 20% from

49:04

the train?

49:04

>> No, we're taking it out from the

49:05

training set.

49:06

>> So it means the batch size the number of

49:08

batch number of data would be available

49:09

for calculating the batch size will

49:11

reduce.

49:12

>> Correct. And in [snorts] fact once we

49:13

validate take out the validation set

49:15

whatever remaining is 193.

49:17

>> Okay. And then we divide that into

49:18

batches and then that every info uh that

49:21

validation and the data gets different

49:23

added. Now once you take out the

49:25

validation set at the very beginning you

49:27

keep it aside and then you only evaluate

49:30

at the end of each epoch what your loss

49:33

and accuracy is on that validation set.

49:36

>> So you don't have cross validation.

49:37

>> No no we're not doing any of that stuff.

49:39

We're just taking it out once and we're

49:40

just evaluating the end of every epoch.

49:43

>> Okay. So

49:46

yeah. Okay. So I know we both asked

49:50

3: Deep Learning for Computer Vision – Building Convolutional Neural Networks from Scratch

More from MIT OpenCourseWare

Trending Transcripts