Advertisement
57:05
1: Introduction to Neural Networks and Deep Learning; Training Deep NNs
MIT OpenCourseWare
·
May 11, 2026
Open on YouTube
Transcript
0:16
All right. So, today's lecture,
0:18
introduction to neural networks and deep
0:19
learning.
0:20
Um so, we'll start with a very quick
0:21
intro to these things,
0:23
uh and then we'll switch and dive deep
0:25
into neural networks. All right. So, the
0:27
field of AI originated in 1956. Sadly,
Advertisement
0:30
it didn't originate at MIT, it
0:31
originated at Dartmouth.
0:32
Because all these people got together at
0:33
Dartmouth. I guess it's it's got a nice
0:35
quad or whatever. They got together,
0:37
they defined the field. But, fortunately
0:40
for us, MIT was very well represented.
0:42
So, we have Marvin Minsky who founded
0:44
the MIT AI Lab, John McCarthy who
0:47
invented Lisp, and then later defected
Advertisement
0:50
to the West Coast, and then Claude
0:51
Shannon who invented information theory,
0:53
right? Who was a professor at MIT. So,
0:55
MIT was well represented. These folks,
0:57
you know, founded the field, and they
0:58
were so bright, they thought that AI was
1:01
going to be substantially solved, quote
1:03
unquote,
1:04
by that fall.
1:06
Okay?
1:07
Now, obviously, it turned out a bit
1:08
differently than what they expected.
1:10
Um so, it's been, whatever, 67, 68 years
1:12
since its founding. So, it's gone
1:14
through, essentially, in my opinion,
1:16
three seminal breakthroughs,
1:18
um starting with the traditional
1:19
approach, then machine learning, deep
1:21
learning, and generative AI. So, let's
1:22
take a very quick look at each of these
1:24
breakthroughs and what motivated them.
1:26
So,
1:27
let's start with the traditional
1:28
approach to AI. And so, what is AI? AI,
1:31
informally, is the ability to imbue
1:33
computers with the
1:34
the the the ability to do things that
1:36
only humans can typically do. Cognitive
1:38
tasks, thinking tasks, and things like
1:39
that. And so, the most sort of common
1:41
sensical way to do that is to say,
1:43
"Well, if I want the computer to do
1:45
something complicated like play chess,
1:46
I'm just going to sit down with a few
1:48
chess grandmasters,
1:49
show them a whole bunch of board moves,
1:51
and ask them how they figure out how to
1:53
respond, how to play the next move." I'm
1:55
going to sort of sit down, talk to all
1:56
these people, and then I'm going to
1:57
write down a whole bunch of rules. If
1:59
this is the board position, move this.
2:01
If this is the board position, move
2:02
this, and so on and so forth. Or I might
2:04
sit down with a cardiologist and tell
2:05
them, "Okay, how do you actually
2:06
interpret an ECG?" They will give me all
2:09
the similarly a bunch of if-then rules.
2:11
I will take all these rules, I'll put
2:12
them into the computer, and boom, I have
2:13
a system that can do what a human can
2:15
do. Right? Now, this approach, even
2:17
though it's common sensical and kind of
2:19
makes sense, it had success in only a
2:21
few areas.
2:22
Um and so, the interesting question is,
2:24
why was it not pervasively successful?
2:28
Why was it not pervasively successful?
2:29
It seems like a pretty good idea to me,
2:31
right? And the people who came up with
2:32
these things are smart people, they're
2:33
not dumb people. They know what they're
2:35
doing. So, why did it not work?
2:39
Because
2:40
because it's time-intensive,
2:42
so in case that you have to run through
2:44
all these scenarios that can ever exist,
2:46
and still some new scenarios can come up
2:48
that you didn't cater for initially.
2:51
Right. So, there are two aspects to what
2:52
you said, which is the first aspect is
2:54
it's time-intensive. That, as it turns
2:56
out, is not a big deal, because
2:57
computers are getting faster and faster.
2:58
>> [clears throat]
2:59
>> Right? The second thing is actually the
3:01
key thing, which is that it doesn't
3:02
generalize to new situations very well.
3:05
Right? The problem is
3:07
there are an infinite number of things
3:08
that you're going to see when you deploy
3:10
these systems in the real world. By
3:11
definition, what you're training it on
3:13
is a small sample of rules. So, these
3:15
rules are very brittle. But, there's
3:17
actually even more interesting reason.
3:19
And that reason is that we know more
3:22
than we can tell.
3:23
This is called Polanyi's paradox. So,
3:25
the idea is that if I come to you and
3:27
say, "Hey, uh here's a picture. Is it a
3:29
dog or a cat?" you will tell me within,
3:32
I believe, they've measured it, like 20
3:33
milliseconds or something, you know if
3:34
it's a dog if it's a dog or a cat. And
3:36
then if I ask you to explain to me
3:38
exactly how you figured that out, you'll
3:40
come up with a bunch of sort of reasons,
3:41
right? Alleged reasons. Oh, you know, if
3:43
it has whiskers, I think it's a cat or
3:45
whatever.
3:46
But, the problem is that you actually,
3:47
first of all, can't really articulate
3:49
what's going on in your head, how you do
3:50
these things. And number two, even if
3:51
you articulate it, often times, your
3:54
articulation has no correspondence with
3:55
how your brain actually does it.
3:58
So, you're incomplete and a liar.
4:01
So, this is Polanyi's paradox. So, if
4:03
you can't even
4:04
tell me how you do something, how the
4:06
heck am I supposed to take it and put it
4:08
into a computer? Doesn't work. And
4:10
second is the fact that we can't write
4:11
down these rules for all possible
4:13
situations. Edge cases, corner cases,
4:15
etc. And the world is full of edge
4:17
cases.
4:18
So, for these reasons, this approach
4:20
didn't work.
4:21
And so, a different approach was
4:22
developed, and this approach was, well,
4:24
basically said, "Hey, instead of
4:26
explicitly telling the computer what to
4:27
do, why don't we simply give it lots of
4:30
examples of inputs and outputs, chess
4:32
positions, next move, right? ECG,
4:35
diagnosis, right? Inputs and outputs.
4:37
And then, why don't we just use some
4:39
statistical techniques to learn a
4:41
mapping, a function, that can go from
4:43
the input to the output? Okay? That was
4:44
the idea.
4:45
And this idea is machine learning.
4:48
Okay? So, machine learning is basically
4:49
just a fancy way of saying, "Learn from
4:51
input-output examples using statistical
4:53
techniques."
4:55
Good. All right. So, um
4:59
Now, there are numerous ways to create
5:00
machine learning models, and if you've
5:01
ever done linear regression,
5:02
congratulations, you've been doing
5:03
machine learning.
5:06
Okay? And only one of those methods
5:08
happens to be something called neural
5:09
networks.
5:11
There are many other methods, and in
5:12
fact, you probably have done these other
5:14
methods if you have done the a course
5:16
like the Analytics Edge or something
5:17
similar.
5:19
Okay. So, machine learning has got
5:21
tremendous impact around the world,
5:23
right? It's like, at this point, um it's
5:25
widely accepted, it's a very, very
5:27
successful technology.
5:29
And in fact, whenever people are
5:30
actually talking about AI,
5:32
chances are they're actually talking
5:33
about machine learning.
5:35
It's just that AI sounds cooler.
5:38
The only problem is, for machine
5:40
learning to work really well, the input
5:41
data has to be structured.
5:43
Okay? And what I mean by that is data
5:46
that can essentially be sort of
5:47
numericalized and stuck into the columns
5:50
and rows of a spreadsheet.
5:51
Right? So, for example, here, let's say
5:54
I want to put together a data data set
5:55
of, you know, uh patients, their
5:58
symptoms, and their characteristics, and
5:59
then in the following year after they
6:01
showed up at the doctor's office whether
6:03
they had a cardiac event or not.
6:05
I might create a data set like this with
6:07
age, smoking status, yes, no, exercise,
6:09
blah blah blah blah blah. Right? And so,
6:11
either these numbers are numbers,
6:13
they're numerical, or if they're not
6:15
numerical, they're categorical.
6:17
Right? Yes, no, uh smoking, yes, no,
6:19
things like that. Which means that if
6:21
you have categorical variables, you can
6:22
just numericalize them pretty easily.
6:25
You folks have done the some machine
6:26
learning before, so you know, things
6:27
like one-hot encoding and stuff like
6:29
that can be done to make them all
6:30
numerical. So, the point is, you can
6:32
just render the data into the columns
6:35
and rows of a spreadsheet pretty easily,
6:36
right? That's what I mean by structured
6:38
data. So, when you but the situation is
6:40
very different if you have unstructured
6:41
data. So, if you have an image of, you
6:43
know, a cute puppy, this is my puppy, by
6:46
the way, um
6:47
from many years ago. Sadly, he's no
6:49
more. Um
6:50
but, his name was Google.
6:54
So, yeah, anyway, uh
6:56
my DMD alums know Google well. So, this
6:58
is Google, right? If you want to take
7:00
Google,
7:01
uh this picture, and figure out how to
7:03
sort of numericalize it, the first thing
7:05
you want to need to understand is that
7:06
if you actually look at how this picture
7:07
is represented inside, uh digitally, in
7:10
the computer, basically, every picture
7:12
like this is represented using three
7:13
tables of numbers.
7:15
Okay? And these and we'll get to what
7:17
these numbers mean later on, but the
7:19
point I'm making is that each number
7:21
basically represents the amount of
7:23
light,
7:25
right? On a scale of 0 to 255, the
7:27
amount of light in that location, in
7:29
that pixel. That's all the amount of
7:30
light. So, basically, the this table is
7:32
the amount of um sorry.
7:35
This table is amount of red light,
7:37
amount of green light, amount of blue
7:39
light. Okay? Now, you will agree with me
7:41
that if you, for example, look at
7:42
something like this and say, "Okay, 251
7:45
at this location, there is a lot of blue
7:47
light because it's 251 out of a possible
7:49
255, right? Maybe a lot of blue light
7:52
somewhere here. There's a lot of blue
7:53
here."
7:55
Whether that area is blue because of a
7:59
piece of sky,
8:00
some water, or a bunch of blue paint,
8:03
could be anything, it's going to say
8:04
251.
8:06
So, the underlying reality, the
8:08
underlying object that's being
8:09
described, has nothing to do with the
8:11
251.
8:12
Right? So, that's the whole problem. The
8:14
raw form of the data has no intrinsic
8:16
meaning with the underlying thing.
8:19
So, given that there's no connection
8:20
between the number and what it's
8:21
describing, how the heck can any
8:23
algorithm do anything with it?
8:25
It can't.
8:27
Right? So, what you have to do is
8:30
something called feature engineering or
8:32
feature extraction, right? Where you
8:34
have to manually take all these things
8:36
and create essentially a spreadsheet
8:38
from them. So, basically, let's say that
8:40
you have a bunch of birds, right? And
8:42
you're trying to build a a bird
8:43
classifier to figure out what kind of
8:44
bird species it is, you might actually
8:46
have to take this picture, and then you
8:48
have to measure the beak length, the
8:50
wingspan, the primary color, and so on
8:52
and so forth.
8:54
So, you're basically structuring the
8:56
unstructured data manually, right?
8:59
And this process of structuring
9:02
unstructured data is basically called
9:06
we use the word representation. We take
9:08
the raw data and we represent the data
9:10
in a different form. And the the reason
9:13
why I'm sort of
9:14
focusing on the use of the word
9:15
representation is because it becomes
9:17
really, really important a bit later on
9:19
when we get to deep learning. Okay? So,
9:22
we have to represent the data in a
9:23
different way for it to work. That's the
9:25
basic idea.
9:26
All right. So, what that means is that,
9:28
uh historically, researchers would
9:31
manually develop these representations.
9:33
And once you develop them, once you have
9:35
representations, you can just use
9:37
traditional linear regression or
9:38
logistic regression to get the job done.
9:40
So, the whole name of the game is the
9:41
representations. So, in fact, people
9:43
doing PhDs, for example, in computer
9:45
vision, would spend like 4 years
9:47
developing amazing representations for
9:49
solving one particular little problem.
9:52
Right? We have a bunch of, say, CAT
9:53
scans, and we need to take the CAT scan
9:55
and figure out whether a particular kind
9:57
of stroke that is evidence for it in the
9:58
cat scan, right? They might actually sit
10:00
and develop all kinds of representations
10:02
and test it and so on. And then they'll
10:04
finally declare victory and say, "Yay,
10:05
I'm done with my PhD. Here is this
10:07
amazing representation, and you can
10:08
build a classifier with it to predict a
10:11
particular kind of stroke with a high
10:12
accuracy." Okay? So, that was what that
10:15
that's where the world was.
10:18
Uh now, as you can imagine, developing
10:20
representations, because it's so manual,
10:22
is this massive human bottleneck, and
10:24
this sharply limited limited the reach
10:27
and applicability of machine learning.
10:29
As you would expect.
10:31
To address this problem,
10:33
a different approach came about, and
10:35
that's deep learning. So, deep learning
10:36
sits inside machine learning. Okay?
10:38
And deep learning
10:40
can handle unstructured input data
10:43
without upfront manual processing.
10:46
Meaning,
10:48
it will automatically learn the right
10:50
representations from the raw input.
10:52
Automatically is the keyword.
10:54
Automatically learn representations,
10:55
which means that you could give it
10:57
structured data, you can give it
10:58
pictures, you can give it text, you can
10:59
give it anything you want, it just learn
11:00
it.
11:01
Okay?
11:02
Um it can automatically extract these
11:05
representations, and since it's being
11:07
automatically extracted, you can imagine
11:09
sort of a pipeline where the raw data
11:11
comes in, you have a bunch of stuff in
11:12
the middle that's learning these
11:14
representations automatically without
11:15
your help, and then boom, you just
11:17
attach a little linear regression or
11:19
logistic regression at the end, problem
11:20
solved.
11:22
That in a nutshell is deep learning.
11:25
Input, a whole bunch of representations
11:26
being learned, and then piped into a
11:28
linear or logistic regression model.
11:30
Okay?
11:31
You would So, the amazing thing is this
11:34
simple idea
11:36
this simple idea
11:37
is just incredibly powerful. Right? That
11:40
idea has led to ChatGPT, has led to
11:42
AlphaGo, AlphaFold, and so on and so
11:44
forth.
11:45
And
11:46
I I kid you not, I'm sort of
11:49
I've I've I've been doing deep learning
11:50
for about 10 years now, and every time I
11:52
look at it, I literally get goosebumps
11:54
every so often.
11:56
That that something so simple could be
11:57
so powerful, right? It's really like
11:59
boggles the mind.
12:01
I'm like I'm just so lucky to be alive
12:03
and working during this period.
12:05
Okay?
12:06
And you know, coming from people who
12:07
have been in the industry a long time,
12:08
this sort of breathless exclamation is
12:10
not very rare, particularly because I'm
12:12
not in marketing.
12:14
Okay? I actually mean it.
12:17
With all your apologies to various
12:19
marketing folks. So,
12:21
just realized it's being taped, so uh
12:23
okay. So, so this has demolished the
12:25
human bottleneck for using machine
12:27
learning with unstructured data, uh and
12:29
so it comes from the confluence of three
12:31
forces,
12:32
uh new algorithmic ideas, a whole a lot
12:34
of data, and then very importantly, the
12:37
fact that we have access to parallel
12:38
computing hardware in the in the form of
12:40
these things called GPUs, graphics
12:42
processing units. Um and these three
12:44
forces came together, and they were
12:45
applied to an old idea called neural
12:47
networks, and that's basically deep
12:48
learning. And I'll go through it very
12:49
quickly, because obviously we going to
12:50
spend half the semester looking into
12:52
this thing in detail.
12:54
Uh so, what's the immediate immediate
12:56
application of the ability to
12:58
automatically handle unstructured data?
13:01
What is like the no-brainer application?
13:10
It's okay if it's obvious, tell me.
13:13
Uh sorry.
13:15
Um image classification. Right. So,
13:18
image classification, yes. So, you can
13:19
take an image, a good example of
13:21
unstructured data, you can do some
13:22
classification on it. But more
13:24
generally, more generally, what I'm
13:27
getting at is that every sensor in the
13:30
world
13:31
can be given the ability to detect,
13:33
recognize, and classify what it's
13:35
sensing. Every sensor. Because remember,
13:37
what is a What does a sensor do?
13:39
A sensor is just a receptacle for
13:41
unstructured data.
13:43
A camera is a receptacle for
13:44
unstructured video
13:46
or unstructured, you know, still images.
13:48
Microphone, unstructured audio, right?
13:50
So, every sensor, you can you can
13:52
imagine taking a sensor and sticking a
13:54
little deep learning system behind it.
13:56
And now suddenly, the
13:58
what comes out of this sensor the deep
13:59
learning system, you can count, you can
14:01
classify, you can detect, you can do all
14:03
kinds of stuff. In short, you can
14:05
analyze.
14:07
And you can predict, right? And this is
14:10
the way I'm describing it right now,
14:12
you'll be like, "Yeah, duh, obviously."
14:15
But you know what, this obviously thing
14:17
is actually not at all obvious
14:19
in terms of whether it'll help you find
14:21
interesting applications or not. Okay?
14:24
So,
14:25
here's something I literally saw
14:28
last week. Okay? Actually, I have
14:30
another slide before that, but we are
14:32
coming to that. So, for instance, every
14:34
time you use Face ID to unlock your
14:36
phone, this is the basic principle at
14:38
work, right? The the camera in the
14:39
iPhone is the sensor, and they stuck a
14:41
deep learning system behind it to do
14:42
image classification, right? Drama,
14:44
non-drama, right? That's what it's
14:45
classifying.
14:46
Um and so here, right, you have a breast
14:49
cancer is it's a breast cancer detection
14:51
system from a mammogram.
14:52
Uh by the way, this picture
14:55
it's a very interesting picture. So, uh
14:57
there's a professor in WCS, uh Regina
15:00
Barzilay, who's a very well-known expert
15:02
in this field, and uh she actually has
15:05
built a breast cancer detection system,
15:07
which is which has been deployed at Mass
15:08
General Hospital.
15:10
And turns out she's actually a breast
15:12
cancer survivor. And uh she was
15:15
you know, she's she's she's good now,
15:16
all good. But when um after she built
15:19
her system, I heard that she actually
15:21
ran that system against the mammograms
15:25
from many years prior when she went for
15:29
a mammogram and was told that everything
15:30
is fine.
15:32
She ran the system on that mammogram,
15:34
and it came back and said, "Here is a
15:35
problem."
15:37
So, a very interesting example where a
15:38
deep learning system picked up something
15:40
that a radiologist could not, right? So,
15:43
these things can be quite powerful.
15:45
Um obviously, any self-driving system
15:47
has numerous deep learning algorithms
15:50
running under the hood, you know,
15:51
pedestrian detection, you know,
15:52
stoplight detection, zebra crossing
15:54
detection, and so on and so forth. Um
15:57
you know, it's being very heavily used
15:58
in visual inspection manufacturing.
16:00
Uh you have various cameras now instead
16:02
of people looking at saying, "Okay,
16:03
there is a dent or there's a scratch."
16:04
They have a little system, which is a
16:06
dent detector, scratch detector, and so
16:07
on. That's that's going on right now.
16:09
And now I come to the example I saw last
16:11
week,
16:12
which is um So, this is an example of
16:14
you can create dramatically better
16:16
products if you really internalize this
16:18
idea of, "Okay, it's almost like you're
16:20
looking the the world and saying, 'Oh,
16:22
there's a sensor. Can I attach a DL
16:24
thing behind it?'" That's the way you
16:25
should be looking at the world, okay,
16:26
for startup ideas. So, here's an
16:28
example, okay, these apparently are the
16:30
world's first smart binoculars.
16:34
Okay?
16:35
This is the binocular.
16:37
Two weeks ago,
16:39
where you look at the bird you look at
16:41
the bird,
16:42
and now it tells you what kind of bird
16:43
it is, right there.
16:47
It's a simple idea, but imagine, right?
16:50
Imagine you are the first out of the
16:51
gate with this feature, you'll have a
16:53
little bit of an edge till everybody
16:54
catches up like 3 months later.
16:57
Let's be very clear, there are no
16:58
long-term monopoly windows in the world.
17:01
There are only short-term windows, so
17:03
the hunt is always on for a little
17:04
monopoly window.
17:06
So, here's an example of that.
17:08
Right? So, I encourage you to always
17:11
think about the world as, you know,
17:13
where are the sensors here?
17:15
And can I attach something behind the
17:16
sensor to do something useful with it?
17:18
Okay?
17:19
All right.
17:24
Now, let's uh turn our attention to the
17:26
output.
17:27
We've been talking about in structured
17:28
data, unstructured data, and how deep
17:30
learning has sort of unlocked the
17:32
ability to work with unstructured data,
17:34
but you've sort of been neglecting the
17:35
output side of the equation. So,
17:37
traditionally, uh we could predict
17:40
single numbers or a few numbers pretty
17:42
easily, right? So, you've all done the
17:44
canonical, you know, uh should this
17:46
person be given a loan application in
17:48
machine learning, right? So, you just
17:50
predicts a probability that a borrower
17:51
will repay a loan on a whole based on a
17:53
whole bunch of data, or supply chain,
17:56
you predict the demand for the product
17:57
next week, or you could predict a bunch
17:58
of numbers. So, given a
18:00
um given a picture, you can say, "Okay,
18:01
does it Which which one of the 10 kinds
18:03
of furniture is it?" Right? You can
18:04
predict 10 numbers, 10 probabilities
18:06
that add up to one. You can predict a
18:08
whole bunch of numbers that don't have
18:09
to add up to one, such as the GPS
18:10
coordinates of a of an Uber ride. So,
18:12
these are all simple unstructured Sorry,
18:15
simple structured output, just a few
18:16
numbers, right? What we could not do
18:18
very easily was to actually generate
18:20
pictures like this.
18:23
We could not generate unstructured data.
18:25
We could only consume unstructured data,
18:27
right?
18:28
Um you can generate text, you can
18:29
generate pictures, and so on, and audio,
18:31
and so on, and so forth.
18:32
So, with generative AI, that problem is
18:35
gone.
18:36
So, generative AI is the ability to
18:37
actually create unstructured data, all
18:39
right? And therefore, it sits within
18:41
deep learning. It still runs on deep
18:43
learning, but it's just one kind of deep
18:45
learning.
18:47
Okay? There's plenty of stuff going on
18:49
in deep learning that's got nothing to
18:50
do with generative AI.
18:51
Nowadays, of course, you know, if you're
18:53
a self-respecting entrepreneur who wants
18:55
to ride this craze, you'll probably
18:57
declare whatever you're doing as
18:58
generative AI.
19:00
Right? Um and some VCs may actually be
19:02
ready to fund you, who knows?
19:04
But the point is, there's plenty of
19:05
stuff going on in deep learning that's
19:06
got nothing to do with generative AI. Uh
19:08
but this is the overall picture. Now,
19:11
here, uh we can produce unstructured
19:13
outputs, like pictures. You can take
19:15
this thing, and then you can actually,
19:17
you know, come up with a nice picture
19:18
description of it. This actually is a
19:19
very famous picture, by the way, in in
19:21
the world of computer vision. So, we are
19:23
actually going to be analyzing this
19:24
picture a little later on
19:26
in the semester.
19:27
Uh you can obviously go from a very
19:29
complicated caption to an image.
19:31
Uh you can go from text to music.
19:36
Can people hear it? Okay. Yeah. Yeah.
19:38
All right. So, and of course, we can go
19:40
from text to text, i.e., ChatGPT. Uh and
19:43
then uh as of a few months ago, things
19:45
have gotten even more interesting, where
19:47
you can actually go you can send text
19:49
and an image in, and you can get text
19:51
out.
19:51
Right? And in fact, as of a few weeks
19:53
ago, you can send text, image, text,
19:55
image, text, image in in an arbitrary
19:56
sequence
19:58
into into the system, and it'll actually
20:00
come back to you with text and image.
20:02
Right? So, things are becoming
20:03
multimodal. I just want to share with
20:05
you like a really fun example I saw
20:07
uh recently. So, this person
20:10
sends this picture. Can folks see this?
20:14
It's this very complicated parking sign.
20:16
Apparently in San Francisco.
20:19
And they're like, it's Wednesday at 4:00
20:20
p.m. Can I park here?
20:22
Tell me in one line. Because you really
20:23
didn't want GPT-4 to be giving you a big
20:25
essay about this.
20:26
Like, you literally want to park.
20:29
So, GPT-4 comes back and says, "Yes, you
20:32
can park here for up to 1 hour starting
20:33
at 4:00 p.m."
20:35
And folks, I double-checked this thing,
20:36
it's correct.
20:38
We all know these things hallucinate,
20:39
right? Can you imagine getting a parking
20:41
ticket and telling the judge, "I'm
20:42
sorry, I didn't realize it was
20:42
hallucinating."
20:44
So,
20:45
so you have to double-check it.
20:46
So, yeah. So, things are getting
20:47
multimodal very quickly.
20:49
Uh and so, the picture here is that
20:51
within gen AI, we used to have these
20:53
separate circles, text to text, text to
20:55
image, text to music, text to this, text
20:57
to that, so on and so forth. Those are
20:59
all beginning to merge now inside gen AI
21:00
because multimodal models are going to
21:02
become the norm this year, right? We
21:04
already have really good closed models.
21:06
We really have We actually already have
21:07
very good open-source multimodal models.
21:10
And so, my feeling is that by the end of
21:12
the year, the idea of using a text-only
21:15
model is going to be like, "Really, you
21:17
do that still?"
21:19
Right? It's going to become like a
21:20
quaint, old-fashioned thing. I think
21:21
multimodal modality is going to become
21:23
the norm. So, that's where the world is,
21:25
and this is the landscape. So, any
21:26
questions on the landscape?
21:29
Before we actually start doing some
21:29
math.
21:35
Okay.
21:37
Yeah.
22:05
You mean the the the evidence of that
22:07
being a problem would have been smaller.
22:09
Yeah.
22:16
Yeah. So, I think the So, the question
22:17
is that in general, how do you train
22:19
your models so that it gives you the
22:20
right answers given that over the
22:22
passage of time, the amount of evidence
22:24
in this data could be very highly
22:25
variable. So, in this particular case of
22:28
you know, the professor I talked about,
22:30
uh yeah, everything at that point was
22:32
going through a an expert radiologist.
22:34
So, 5 years ago, this mammogram was seen
22:36
by a radiologist, and that person
22:38
concluded there is no problem. So, that
22:40
was the training label, right? The wrong
22:41
training label. Uh so, in typically what
22:44
happens is that training labels could be
22:46
wrong some small fraction of the time.
22:48
So, you need to have systems that are
22:49
robust. So, your data needs to be
22:51
complete, it needs to be comprehensive,
22:53
it needs to be have correct labels. If
22:56
these ideas are not met, your systems
22:58
are not going to be that good. But as it
22:59
turns out, with neural networks, even
23:01
with some amount of noise in the labels,
23:04
they still do a pretty good job.
23:06
Right? So, it's that's sort of the
23:07
general idea.
23:11
The veri- The verification comes from
23:12
the human. So, every Remember when we
23:15
look at radiology data,
23:17
the the data we're working with is the
23:19
input is let's say an image, like a
23:21
radio mammogram or something, and then a
23:23
human radiologist or a set of
23:25
radiologists have said this has a
23:27
problem or does not have a problem. So,
23:29
that is called the ground truth.
23:31
So, it is this ground truth image and
23:33
label, this combination that's being
23:35
used to train these models.
23:39
Yeah.
23:43
Embodiment? So, So, are we are we going
23:45
to cover embodiment? So, the
23:47
the embodiment here refers to the fact
23:49
that
23:50
if you have robot robots, right?
23:53
They need to actually operate in the
23:54
real world, and so robots are an example
23:56
of what's called embodied intelligence.
23:58
So, unfortunately, due to the
23:59
constraints of time, we're not going to
24:01
get into robotics at all. But I will say
24:03
that a lot of the deep learning stuff
24:04
you're going to talk about, those are
24:05
all fundamental building blocks in
24:07
modern robotic systems.
24:09
All right. So, um so, in summary,
24:13
X and Y
24:14
can be anything, and it can be
24:15
multimodal.
24:17
Okay? I literally could not have put up
24:19
this slide maybe 2 years ago.
24:21
Right? So, it's very simple in how it
24:23
looks, but it's very profound. You can
24:25
You can learn a mapping from anything to
24:28
anything at this point very easily as
24:29
long as you have enough data.
24:31
Okay? So, um now, note that all this
24:34
excitement that we see around us
24:36
is everything stems from stems from deep
24:38
learning.
24:39
Okay?
24:40
Everything Everything depends on deep
24:42
learning. And so, if you understand deep
24:44
learning, a lot of interesting things
24:45
become possible. So, let's get going.
24:47
All right. So, we'll start with the very
24:49
basics. Uh what's a neural network?
24:51
Uh now, recall logistic regression
24:54
from back in the day.
24:56
So, what is logistic regression?
24:57
You send in a bunch of numbers, a vector
24:59
of numbers, and you get usually get a
25:01
probability out, right? Between 0 and 1.
25:03
What is the probability of something or
25:05
the other? Okay? Um and so, this
25:07
logistic regression model is also
25:09
represented in this form,
25:11
if you will recall. So, basically what
25:13
we do is we take all these numbers, we
25:15
run it through a linear function, right?
25:17
We run it through a linear function, you
25:19
get a number, and then we take that
25:20
thing and run it through 1 / 1 + e
25:22
raised to minus that,
25:25
and that's guaranteed to give you a
25:26
number between 0 and 1, which can be
25:27
interpreted as a probability, and that's
25:29
logistic regression. Okay? And the
25:31
canonical, you know,
25:33
uh loan approvals, things like that, all
25:35
fall into this sort of convenient
25:36
bucket.
25:38
Okay? So, this should be super familiar.
25:44
All right. Now, we're going to actually
25:46
look at this, you know, simple, modest,
25:48
humble little operation
25:51
using the lens of a network of
25:53
mathematical operations, and the reason
25:55
why we do it will become clear a bit
25:56
later.
25:57
So, we'll take this very simple example
25:59
where we have uh let's say two
26:02
variables, GPA and experience, right?
26:05
This is the GPA of some graduates, uh
26:07
number of years of work experience, and
26:09
then this is the dependent variable,
26:11
which is either 0 or 1, and 0 if they
26:14
don't get called for an interview, 1 if
26:16
they get called for an interview. Okay?
26:18
It's a two-input variable, one-output
26:20
variable problem. Okay? And it's a
26:22
classification problem because we're
26:24
classifying people into will they get
26:25
called for an interview, yes or no.
26:27
Okay?
26:29
And so, that's the setup for this
26:31
problem.
26:33
And let's say that we actually run it
26:35
through any you know, we actually try to
26:38
fit a logistic regression model to it.
26:40
So, if you're familiar with R, for
26:41
example, you would use something like
26:43
GLM to fit this model.
26:46
Um if you use something like statsmodels
26:48
in Python, there's a similar function
26:49
for it. Scikit-learn, there's another
26:52
function for it. You get the idea,
26:53
right? This
26:55
You can use whatever favorite methods
26:57
you have for logistic regression
26:58
modeling to get this job done. And if
27:00
you do that with this little data set,
27:02
you're going to get these coefficients.
27:04
Right? The 0.4 is the intercept, 0.2 is
27:06
the coefficient for GPA, 0.5 for
27:08
experience. And that is the resulting
27:09
sigmoid function.
27:11
Okay?
27:12
All right. Cool. So, now let's actually
27:14
rewrite this formula as a network in the
27:17
following way. So, first, what we'll do
27:19
is we'll take GPA and experience and
27:20
stick it here on the left side, and
27:22
we'll put little circles next to them,
27:24
and we'll call them the input nodes.
27:26
Okay? And so, imagine that somebody puts
27:29
the writes a GPA into the circle, 3.5 or
27:32
you know, years experience, 2.0, and
27:34
then it flows through this arrow,
27:36
and as it flows through, it gets
27:38
multiplied by its coefficient, 0.2. The
27:40
0.2 is coming from here.
27:42
Similarly, experience gets multiplied by
27:44
0.5, it comes in here, and this node, as
27:47
the plus indicates, is adding everything
27:49
that's coming into it.
27:50
So, it's adding 0.2 * GPA, 0.5 *
27:52
experience, plus the intercept, which is
27:54
the green arrow coming from on its own.
27:57
It comes through here, and what comes
27:58
out of this is just a single number,
28:01
and that number goes into this little
28:02
circle,
28:04
and then out pops a probability.
28:07
Okay?
28:08
So, I've sort of
28:10
done this ridiculously long long
28:13
long-winded way of writing a simple
28:15
function.
28:16
Okay? And the reason we why I'm doing it
28:18
will become clear in a second.
28:21
Okay? So, this is a little network of
28:23
operations for the simple function.
28:25
And so, for instance, how you would use
28:27
it is you to make a prediction, you'll
28:29
let's say someone has a 3.8 GPA and 1.2
28:31
years experience, you just plug it in
28:33
here,
28:34
do the math, you get 0.76, same thing
28:36
here, comes in here, add them all up,
28:38
you get 1.76, you run 1.76 through the
28:40
sigmoid, you get 0.85, and that is the
28:43
probability that that particular
28:44
individual may get called for an
28:45
interview.
28:46
Okay? At this point, we're just doing
28:48
logistic regression, nothing more
28:49
complicated.
28:51
Okay? So, um now, if you have many
28:54
variables, not two variables like X1
28:56
through XK, you can the same sort of
28:58
logic applies. Each one has some
28:59
coefficient, and then there's an
29:01
intercept, they all get added up here,
29:03
run through a sigmoid, and out pops this
29:04
number. Okay? Notice how the data flows
29:07
from left to right.
29:09
Okay?
29:10
All right. Any questions on this?
29:15
All right. Good.
29:16
So, now terminology.
29:18
Uh so, you will actually you'll discover
29:20
that the world of neural networks and
29:21
deep learning has its own terminology.
29:24
They have their own ways of referring to
29:25
things that we the rest of the world has
29:26
been referring using something else for
29:28
the longest time.
29:29
Right? It's kind of annoying sometimes,
29:31
but it's the way it is. So, um
29:35
Remember in regression, we used to call
29:37
those numbers next to each variable as
29:38
coefficients,
29:39
and the constant thing as an intercept?
29:41
Well, guess what? In this world, these
29:43
multi- those coefficients are actually
29:44
called weights,
29:46
and the intercepts are called biases.
29:49
So, in in the neural network world,
29:50
these are called weights and biases.
29:53
And sometimes, if you're a little lazy,
29:54
you may just call the whole thing as
29:55
weights.
29:56
Okay? So, when you see in the newspaper
29:58
that, you know, "Oh my god, this amazing
30:00
model's weights have been leaked
30:03
on the internet or on BitTorrent or
30:05
something." That's what's going on,
30:06
right? All these coefficients have been
30:08
leaked. Because once you know what the
30:09
coefficients are and what the
30:11
architecture is, you can just
30:12
reconstruct the model.
30:15
All right. So, that's what's going on
30:16
here.
30:17
Now, why did we do this network
30:19
business? Why did we write it as a
30:20
network?
30:23
Yeah, what is the advantage? Any
30:24
guesses?
30:34
When you have multiple functions for
30:37
So,
30:38
it's just easier to see it that way.
30:40
Right. If you have lots of things going
30:41
on, it's easier to see it if you
30:43
actually write it in graphical form.
30:45
Yes, correct.
30:46
But, so is it only like a usability
30:49
advantage?
30:51
I mean, the thing is you want different
30:53
functions for different layers of that.
30:55
Uh-huh.
30:56
Okay.
30:57
So, maybe we want to use different
30:59
functions in different layers. But, I
31:00
think there's actually even a larger
31:02
sort of a more basic point, which is
31:04
that
31:05
then when you the moment you write it
31:07
down, you suddenly realize
31:09
that I could have lots of things in the
31:10
middle.
31:12
I don't have to go from the input to the
31:13
output directly. I can do lots of things
31:15
in the middle, right? That's sort of the
31:17
key idea. So, what you do is
31:20
So, remember the notion of learning
31:22
representations of unstructured data,
31:24
right? Where you take a picture and say
31:25
beak length and things like that, right?
31:27
And remember, I said deep learning
31:29
actually automatically learns these
31:30
things. Where is that automatic learning
31:33
coming from?
31:34
Well, this is where it's coming from.
31:36
So, what we do is we take this thing,
31:38
right? There's just a logistic
31:39
regression model. Inputs
31:41
get multiple added up as a linear
31:43
function, run through a sigmoid.
31:45
And then
31:46
we are like, "Hmm, if we want to learn
31:48
representations of the raw input, we
31:51
better be doing something in the middle
31:53
here."
31:54
Because the output is the output.
31:56
That is That's not going to change.
31:58
You know, it's it's either a dog or a
32:00
cat. You don't have any choice
32:02
as to what it is. Okay? The only agency
32:05
you have at this point is you can take
32:07
the raw input and do things in the
32:09
middle with it.
32:11
You can do a lot of stuff in the middle
32:12
and then run it through something to get
32:14
the output. Okay? So, in any in in in
32:18
any mathematical discipline,
32:20
if someone comes to you and says,
32:22
"Here's a bunch of data.
32:23
I want you to do something with it."
32:25
What should the What is like the big the
32:27
most basic first thing you should do?
32:31
Run it through a linear function.
32:34
The most basic thing in math is a linear
32:36
function. So, given anything, just run
32:37
it through a linear function. See what
32:38
happens.
32:40
So, that's exactly what we can do. So,
32:42
the simplest thing we can do here, we
32:44
can insert a bunch of linear functions.
32:46
So, we do is we take all this input and
32:49
we just run it we we do a linear
32:50
function on it. So, think of it this as
32:52
X1 * 2 + X3 * 4 and all the way to XK *
32:56
9 plus some intercept and boom, it goes
32:58
out the other end. So, this little
33:00
circle here with a plus in it is just
33:05
Thank you.
33:05
Uh
33:06
that is This is just a linear It's a
33:08
shorthand for a linear function.
33:10
So, whenever you see a circle with a
33:11
plus, it's just a shorthand for a linear
33:13
function. Okay? So, you can take this
33:15
whole thing and run through a linear
33:16
function and when you do it, you'll get
33:17
some number right there. You'll get some
33:19
number. So, you've taken these K numbers
33:21
and you've sort of dis- compressed them
33:23
in some way into one number.
33:25
Okay?
33:26
But, you don't have to stop at one
33:28
number. You can do more.
33:30
So, we can have a stack of linear
33:31
functions in the middle.
33:33
Right? There's a linear function here,
33:35
another one here, another one here. At
33:37
this point, the K numbers you have
33:40
K could be, for example, 1,000.
33:42
Right? It's just the size of your input
33:43
data.
33:44
You've taken these K things and you've
33:45
compressed them into three numbers at
33:47
this point.
33:48
Okay?
33:50
So, okay, maybe three is the right
33:52
number, maybe 10 is the right number. We
33:53
don't know.
33:54
And we'll get to know how do we know
33:55
what the right number is later on.
33:58
So, we can stack as many linear
33:59
functions we want.
34:01
So, we have transformed this K thing
34:02
into a three-dimensional vector, right?
34:04
K numbers become three numbers.
34:06
Um
34:07
and now we can flow this three these
34:10
three numbers through some other little
34:12
function.
34:13
Okay?
34:16
And as you will see in a few minutes,
34:18
that function is called an activation
34:19
function
34:20
and it's chosen to be a non-linear
34:22
function
34:23
because if you don't choose it to be a
34:24
non-linear function, all the effort we
34:26
are doing is going to be a total waste
34:28
of time.
34:30
Okay? For now, just
34:32
take it on faith that you need to have
34:34
non-linear functions here.
34:36
But, note that the three numbers here
34:39
are still three numbers. They are three
34:41
different numbers, but they're still
34:42
three numbers.
34:43
And once we do this, we'll be like, "You
34:45
know what? This was fun. Let's do it
34:46
again."
34:48
Okay? So, you can do it again.
34:52
And you can keep on doing it. You can
34:53
keep it 100 times if you want.
34:55
And the key thing is that every time you
34:57
do it, you're giving this network some
35:00
ability, some capacity to learn
35:03
something interesting from the data.
35:05
To learn an interesting representation.
35:07
Now, of course, you're thinking, "Well,
35:09
how do we know it's interesting? How do
35:10
you know it's a useful thing?" And we'll
35:12
come to all that later on.
35:14
Right? We're just giving it the
35:14
capacity, the potential to learn
35:16
interesting things from the data.
35:17
Whether it actually lives up to its
35:19
potential, we don't know yet.
35:21
Okay? We'll give it the potential.
35:23
Because the more transformations of the
35:24
input data you make, the more
35:26
opportunity you have to do interesting
35:27
things with it.
35:29
If I don't even give you the opportunity
35:30
to transform it once, you don't have any
35:31
opportunity, right?
35:32
If I give you 10 chances to transform
35:34
things, you have 10 shots at doing
35:36
something useful.
35:38
So, you can you can do this repeatedly
35:40
and once we are done doing these
35:42
transformations, we just pipe it through
35:44
to our good old logistic regression
35:46
sigmoid here and we are done.
35:50
Okay?
35:51
So, this is the basic idea.
35:53
And so, just to contrast it, this was
35:55
good old logistic regression where we
35:57
take the input,
35:59
we run it through a linear function and
36:00
pop out a number,
36:02
a probability number. But, after we do
36:04
all this stuff, the input stays the
36:06
same, the output stays the same, but in
36:08
the middle you just run through a whole
36:09
bunch of these functions, you know,
36:11
these layers, boop boop boop boop, and
36:12
then we get the output.
36:14
Okay?
36:15
That's all we have done.
36:16
And this is a neural network.
36:19
A neural network is nothing more than
36:21
repeatedly transformed inputs which are
36:25
finally fed to a linear or logistic
36:27
regression model.
36:35
Any questions?
36:37
I have two questions. Could you use the
36:38
thing so that everyone can hear? Yeah.
36:41
I have two questions. Firstly, so when
36:43
we say that there isn't chance of
36:45
explainability, is it that we don't know
36:48
which arrow it went through? That's one.
36:51
Second,
36:53
who's controlling the number of
36:54
iterations or the number of functions?
36:57
That's up to us or how does that work?
36:59
Right. So, yeah, so the the first
37:01
question, um explainability, we actually
37:03
know exactly for any given input input
37:06
uh data data point, we know exactly how
37:09
it flows through the network. So, there
37:10
is no problem there.
37:12
The problem is in ascribing, "Okay, this
37:15
we we think this person is going to be
37:17
uh repay the loan because
37:20
of this particular attribute." We don't
37:21
know that because those attributes all
37:24
get enmeshed together and goes through
37:25
this complicated thing. So, we know
37:27
exactly what happens. We just can't give
37:29
credit to anyone thing very easily.
37:31
I'm again, I'm just standing on the
37:33
brink of this vast ocean of something
37:35
called explainability and
37:36
interpretability, uh which I'll get to a
37:38
bit later on in the semester. But,
37:39
that's sort of the quick
37:42
kind of right-ish kind of wrong answer.
37:44
Okay? Number two, um
37:46
uh
37:47
we decide the number of layers. We
37:49
decide a whole bunch of things and as
37:51
we'll see in a few minutes, uh there is
37:52
something that's given to us and
37:53
something we get to design and I'll make
37:55
it very clear which is which.
37:59
Yeah.
38:02
Did I say your name right? Yeah.
38:04
So, which functions have to be linear
38:06
and also like why does it have to be
38:08
linear? Yeah. So, these functions uh the
38:11
f of x here, they have to be non-linear.
38:15
As to why they have to be non-linear,
38:16
we'll get to that in a few minutes.
38:19
Okay. So, these are called neurons.
38:22
Okay?
38:23
These things where you basically there's
38:25
a linear function followed by uh a
38:27
little non-linear function,
38:29
right? This is a Each one of these
38:31
things is called a neuron.
38:32
Um
38:34
By the way, you know, this is loosely
38:36
inspired by the way how, you know, uh
38:39
neurons work in a human in mammalian
38:41
brains.
38:42
But, the connections between
38:45
neuroscience and deep learning
38:47
are very heavily argued.
38:50
So, I'm going to like stay away from it.
38:52
Okay? Uh suffice it to say it's I I just
38:55
think of For for building practical deep
38:57
learning systems in industry, you don't
38:59
you don't worry about this. Okay?
39:01
All right, let's move on.
39:04
Terminology. Uh this vertical stack of
39:06
linear functions or neurons,
39:09
right? This vertical stack is called a
39:10
layer.
39:12
Right? This is a layer, that's a layer.
39:14
Uh and these little non-linear
39:15
functions, which we haven't gotten to
39:17
yet, are called activation functions.
39:20
Uh and we'll get to why they are called
39:22
that in just a second.
39:25
And
39:26
the input
39:29
is called an input layer and I have the
39:31
word layer in double quotes because like
39:34
it's not really doing anything, right?
39:35
It's just the input.
39:36
So, but we call it an input layer.
39:39
And what the very final thing that
39:41
produces outputs is called the output
39:42
layer, right? Obviously. And everything
39:45
in the middle is called a hidden layer.
39:48
Okay?
39:50
So, the final piece of terminology is
39:52
that when you have a layer like this in
39:54
which say three numbers are coming out
39:56
and there's another another layer,
39:58
right? If every neuron in this layer is
40:00
connected to every neuron in this layer,
40:03
it's called a fully connected or dense
40:05
layer. So, for instance, here
40:07
this arrow that's
40:08
whatever the whatever number is coming
40:10
out. Let's say the number three is
40:11
coming out of this thing here. That
40:12
number three goes flows on this arrow to
40:15
this thing, flows on this arrow to this
40:17
neuron, and flows on this third arrow to
40:19
this neuron. That's what I mean. So,
40:21
every neuron, its output is being sent
40:23
to every neuron in the following layer.
40:25
Okay? That's we call it fully connected
40:27
or dense.
40:29
And then
40:30
if you look at logistic regression,
40:32
right? This is logistic regression. You
40:34
can see basically logistic regression is
40:36
a neural network with no hidden layers.
40:41
So, in some sense, logistic regression
40:42
is like almost the simplest possible
40:43
network you can think of.
40:45
Like barely a neural network.
40:48
Right? It's got no no hidden layers.
40:50
That's what makes it logistic
40:51
regression.
40:52
And so, as you might have guessed by
40:54
now, deep learning is just neural
40:56
networks with lots and lots of
40:58
of what?
41:00
Yes, layers.
41:02
So, here are a few.
41:04
Uh and by the way, these are not even
41:07
considered all that, you know,
41:08
impressive these days.
41:10
Okay? Uh but I put them up because this
41:13
this thing here is called ResNet.
41:16
And it's famous because the ResNet
41:18
neural network was I think the first
41:20
network
41:21
to surpass human-level performance in
41:24
image classification.
41:26
Sort of it it's sort of like the Skynet
41:28
of image classification. Okay? It
41:31
surpassed human-level performance. And
41:32
I'm putting it up here because we'll
41:34
actually work with ResNet on next next
41:36
Wednesday. And we'll actually take
41:37
ResNet, we'll fine-tune it, and solve a
41:39
real problem in class.
41:41
All right. So, it's got lots and lots of
41:43
layers. Uh now, let's turn to these
41:46
activation functions. We've been
41:47
ignoring these little guys, right? So
41:48
far.
41:49
So, the activation function at a node is
41:52
a first of all, it's a function that
41:54
receives a single number and outputs a
41:56
single number, right? It's not very
41:58
complicated, right? It receives
42:00
basically this this is a linear function
42:03
which receives all these inputs. It
42:04
could be 10 inputs, 1,000 inputs,
42:06
runs it through a linear function,
42:07
outputs a number, and that single
42:09
number, a scalar, goes in here, and it
42:12
comes out as another single number.
42:14
Just just just remember that.
42:16
And so, these are some of the most
42:18
common activation functions. In fact,
42:19
the sigmoid we saw, which is actually we
42:21
use for the output, is actually a kind
42:23
of activation function where a single
42:25
number comes in and it gets mapped into
42:28
this curve because of this thing. So,
42:30
the single number that comes in is A,
42:31
and it and it gets transformed as 1 / 1
42:33
+ e ^ -A, and you get a shape like this,
42:37
and it's called the sigmoid activation
42:38
function. And And And as you can see
42:40
here,
42:41
for very small values, for very negative
42:44
values,
42:45
it's going to be pretty close to zero,
42:47
meaning it won't get activated.
42:50
And for very very large values, it's
42:52
going to be
42:53
pretty close to one.
42:55
All the action happens in the middle.
42:57
When your When your When your values are
42:59
somewhere in this range, there's a
43:00
dramatic increases in what comes out.
43:03
Okay? So, that little thing in the
43:05
middle is a sweet spot for these
43:06
functions.
43:07
Uh
43:08
and this
43:10
I you know, I'm also almost embarrassed
43:11
to call it an activation function
43:12
because it's literally not doing
43:13
anything. It's sort of getting a nice
43:15
label for free.
43:16
Um right? You basically it says you just
43:18
get a number, just pass it straight
43:19
along.
43:20
It's a linear activation function, but
43:22
just for completeness, I want to put it
43:23
here.
43:25
And then we come to the hero of deep
43:28
learning, which is the rectified linear
43:30
unit,
43:32
right? Rectified linear unit. It's
43:34
called ReLU. Uh and ReLU is going to
43:37
become part of your vocabulary very very
43:38
quickly. Uh and so, ReLU is actually a
43:41
very interesting function. So, you write
43:43
it as maximum of whatever number and
43:44
zero,
43:46
which is another way of saying if the
43:48
number is positive, just send it along
43:50
unchanged. If the number is negative,
43:53
send a zero instead. Squish it to zero.
43:56
So, which means if the number is
43:57
negative, nothing happens. If the number
43:59
is positive, it wakes up.
44:03
So, what happens is that you could have
44:04
a very complicated linear function with
44:07
millions of variables, and then it puts
44:09
a single number, and that number
44:10
unfortunately happens to be negative.
44:12
The ReLU is not impressed. It's going to
44:13
send a zero out.
44:15
Okay? It's a very simple function.
44:17
And many many folks who've been in deep
44:20
learning for a long long time believe
44:22
that
44:23
the use of the ReLUs is one of the key
44:25
factors
44:26
that led to the amazing success of deep
44:28
learning because it's got some very
44:30
interesting properties,
44:32
uh which we'll get to hopefully on
44:33
Wednesday.
44:35
Okay. So, the shorthand here is that um
44:40
whenever you see this thing, it's just a
44:42
linear activation, linear function
44:43
followed by just sending it straight
44:44
out. If I If you do this this If I put a
44:47
ReLU in here, I'm going to denote it
44:49
like that, which mimics the graph
44:51
uh how it looks. And if I'm going If I
44:53
put a sigmoid, I'm just going to use
44:54
this thing here.
44:55
Okay?
44:56
Just a visual shorthand.
44:59
>> [clears throat]
45:00
>> There are many other functions
45:02
activation functions, by the way.
45:03
There's something called the tan h
45:05
function, the leaky ReLU, the GELU, the
45:07
Swish. I mean, it's like a menagerie of
45:10
activation functions because very often
45:12
researchers will be like, "Well, I don't
45:14
like this activation function. Here's a
45:15
little modified version of the function
45:17
which is going to be better for certain
45:18
things." So, you know, people's research
45:20
creativity is sort of on this point has
45:22
gone unhinged. Um so, there's lots of
45:24
options. But if you just stick to the
45:26
ReLU
45:27
for your hidden layers, you can
45:29
basically get anything done practically,
45:31
right? You don't have to worry about
45:32
anything else. So, we'll only focus on
45:34
ReLUs for all the intermediate stuff. Uh
45:37
yeah.
45:38
Yeah, how do you gauge which activation
45:40
function is more suited for your use
45:41
case?
45:42
Yeah. So, the rule of thumb here is that
45:45
for your hidden layers, use ReLUs,
45:48
right? Because empirically we have seen
45:49
that they they do an amazing job.
45:51
For your output layer, your very final
45:54
thing, you actually don't have a choice
45:56
because what you have to use depends on
45:57
what kind of output you have to work
45:59
with. If it's an output which is a
46:01
probability number between zero and one,
46:02
you have to use a sigmoid.
46:04
Um if it is
46:05
say 10 numbers, all of which have to be
46:07
probabilities, and they have to add up
46:08
to one,
46:10
you got to use something called the
46:10
softmax, which we'll get to on
46:12
Wednesday. So, it really depends on the
46:13
output, and the nature of the output
46:15
dictates what you use in the output
46:16
layer.
46:18
Okay.
46:19
So, coming back to this. So, if you want
46:22
to design a deep neural network,
46:24
uh the input is the input.
46:27
The output is the output. And so, you
46:29
get to choose everything else. You get
46:30
to choose the number of hidden layers,
46:32
the number of neurons in each layer, the
46:35
activation functions you're going to use
46:37
and uh for the hidden layers, and then
46:39
you have to make sure that the what you
46:41
choose for the output layer matches the
46:42
kind of output you want to generate.
46:44
Okay? So, this is this sort of This is
46:46
all in your hands. You decide what
46:48
happens. But
46:51
you will there there's a lot of guidance
46:52
for how to do these things, which we'll
46:53
which we'll cover as we go along.
46:56
Did you have a question?
46:57
Kind of, but I guess I'll do it.
47:00
Is Is there also exploration in kind of
47:03
dynamic uh
47:05
setting up layers so that your users
47:07
determine the number of layers
47:12
Yeah. So, there's a whole field called
47:14
neural architecture search, NAS,
47:16
where we can actually try a whole bunch
47:18
of different architectures,
47:20
uh and then use some optimization and in
47:22
fact reinforcement learning, which we
47:23
won't get to in this class,
47:25
as a way to figure out really good
47:27
architectures for any particular
47:28
problem. Uh but the
47:32
the question of okay,
47:33
when I'm training a model with a
47:34
particular kind of data,
47:36
the first pass through the training
47:37
data, I'm going to use two layers. The
47:39
second pass, I'm going to do seven
47:40
layers. That is not done.
47:42
Uh and the reason it's not done is
47:44
because of certain other constraints we
47:45
have in how we can do the the
47:47
optimization and the gradient descent
47:48
and stuff like that. But what you can
47:50
do, and we will we'll look at this thing
47:52
called dropout,
47:54
for certain layers, you can actually for
47:56
each time you run it through the
47:58
network, you can decide in this layer
48:00
I'm not going to use all the nodes. I'm
48:02
going to drop out a few of the nodes
48:03
randomly. And it's a very effective
48:05
technique to prevent overfitting, and
48:07
we'll come to that a little later on.
48:09
Uh yeah.
48:11
So, one question regarding like
48:13
neural networks is about the
48:15
coefficients. Is this something we
48:16
decide
48:17
or we
48:19
have to use as a defined coefficient for
48:21
the weights? No, the whole trick here
48:23
the whole name of the game is we use the
48:25
data, the training data, and something
48:29
called a loss function, which I'll get
48:30
to on Wednesday,
48:31
along with an optimization algorithm, so
48:33
that the network figures out by itself
48:36
what the weights need to be, what the
48:37
coefficients need to be, so as to
48:39
minimize prediction error.
48:42
And that's the whole thing. The magic
48:43
here is that we don't have to do
48:45
anything. We only have to set it up, sit
48:47
back, often for many hours, and watch it
48:49
do its thing.
48:51
Yeah.
48:52
Just one quick question. Um you
48:54
mentioned nodes just now when you were
48:56
answering Roland's question. Can you
48:58
just confirm exactly what a node is? I
49:00
have an idea that it's basically any
49:02
circle, but
49:03
>> Yeah, yeah. you just added a lot more
49:04
detail. Sure. No, when when I'm
49:06
referring to a node, I'm literally
49:07
referring to something like this, which
49:09
think of it as a linear function
49:12
followed by a non-linear activation.
49:14
So, it it reads a bunch of inputs, runs
49:16
it through a linear function, and pass
49:18
it through like a ReLU or a sigmoid or
49:19
something, and out pops a number.
49:22
So, in general, a node will have
49:24
many numbers potentially coming in, but
49:26
only one number going out.
49:28
Uh now, that one number may get copied
49:30
to every node in the next layer,
49:32
but what comes out of that particular
49:33
node is just a single number.
49:36
All right. So,
49:38
uh
49:38
So, let's use a DNN for our interview
49:41
example. So, in this problem we had two
49:44
inputs, right? GPA and experience. The
49:46
output variable has to be between zero
49:48
and one because you're trying to predict
49:48
the probability that someone will get
49:50
called for an interview. So, the output
49:52
size is fixed the
49:54
sorry, the input size is fixed the
49:55
output is fixed. Uh
49:57
and we so, since it's really the only
49:59
the very first network we're actually
50:00
playing with uh
50:02
let's just start simple, right? We'll
50:04
just have one hidden layer and we'll
50:06
have three neurons, right? And and as I
50:09
mentioned to Tommaso's question from
50:11
before if you are choosing activation
50:13
functions in the hidden layers, just go
50:15
with the ReLU as a default. It usually
50:17
works really well out of the box. So,
50:19
we'll just use a ReLU and since the
50:21
output has to be between zero and one,
50:23
we don't have a choice. We have to use a
50:25
sigmoid for the output layer.
50:27
Okay? That's it. So, we have the those
50:29
are the design choices and when we do
50:31
that, this is how it's looked like,
50:32
right? We have two inputs X1 and X2, GPA
50:34
and experience and then it goes through
50:36
these three
50:38
ReLUs and then out comes these three
50:40
numbers and they pass through a sigmoid
50:42
and we get a probability Y at the end.
50:44
All right, quick question. Concept
50:46
check.
50:47
How many weights
50:49
how many parameters, both weights and
50:51
biases does this network have?
50:53
Let's take a moment to count.
51:11
All right, any guesses?
51:15
Yeah.
51:16
12.
51:18
I think you're almost there.
51:22
Um
51:23
our folks going to be doing a binary
51:25
search on this now? Okay.
51:29
Uh no.
51:31
Yes? 30. Yes, very good.
51:34
So, that's 30
51:35
and my guess is that the reason you came
51:37
up with 12 and I made the same mistake,
51:39
that's why I know it is you probably
51:41
forgot this green thing here.
51:45
Um so, so the what folks often forget is
51:48
the bias.
51:49
Right? We all count the things, right?
51:50
Okay. And the easy way to do it is okay,
51:52
two things here,
51:54
three things here, two times six three
51:56
is six,
51:57
three times one is three another nine
51:59
and then you have to add up all the
52:00
intercepts.
52:02
Right? So, you get 30.
52:04
And so, when we get to very complicated
52:05
networks the the first two or three
52:08
times you work with very complex
52:09
networks
52:10
and we'll do it, you know, starting very
52:11
soon, just get into the habit of hand
52:14
calculating the number of parameters
52:16
just to make sure you understand what's
52:17
going on. Once you get it right a couple
52:18
of times, you can you don't have to do
52:20
it anymore. Okay? The first couple of
52:21
times hand calculate to make sure you
52:23
get it.
52:23
Okay. So, yeah. So, let's say that we
52:26
have trained this network using, you
52:28
know, using techniques which we'll cover
52:30
on Wednesday and it is it comes back to
52:32
you after training and says, "Okay,
52:34
these are the optimal the best values
52:36
for the weights and the biases that I
52:38
have found." So, now your network is
52:40
ready for action.
52:42
It's ready to be used
52:43
and so, so what you can do is let's say
52:45
that you want to predict with this
52:47
network,
52:48
you know,
52:49
if you have X1 and X2, what comes out of
52:52
what So, what comes out of this top
52:54
neuron, right? Let's call it A1. It's
52:56
basically this.
52:58
Okay? That's what's coming out of this
53:00
thing. For any X1 and X2, this is what's
53:02
coming out. Similarly for A2 and A3
53:05
Okay?
53:06
And then what comes out at the very end
53:08
is
53:09
basically A1 times that plus A2 times
53:11
that plus A3 times that plus 0.05 and
53:14
the whole thing gets run through the
53:15
sigmoid and this is what you get.
53:18
Okay? So, this slide and the one before,
53:20
just make sure you look at it afterwards
53:22
and to make sure you totally understand
53:23
the mechanics of it because
53:26
this is really important. If you don't
53:27
If you don't fully understand like
53:28
internalize the mechanics, when we get
53:30
to things like transformers, it's going
53:31
to get hard. Okay? So, just make sure
53:33
it's like automatic at this point. It
53:35
should be reflexive.
53:37
Um
53:38
Okay. So, yeah. And so, when we when you
53:40
want to predict anything, you just run
53:41
some numbers through it, you get all
53:42
these things
53:44
and boom, you calculate it. It turns out
53:45
to be 22.6. That's the answer.
53:48
All right. So,
53:50
I just want to say that let's say that
53:51
you built this network
53:53
and now we are like, "Hey,
53:55
given any X1 and X2, I can come up with
53:57
a Y."
53:58
But I'm feeling a little mathy. Can we
54:00
actually write down the function? Yeah,
54:02
you can write down the function. This is
54:03
what it looks like.
54:07
Super interpretable, right?
54:10
So, this goes to the comment that Itai
54:12
you made earlier on where the act of
54:16
depicting something using this sort of
54:18
graphical layout makes it so much easier
54:21
to reason with
54:22
and to think about compared to trying to
54:24
figure out what this function is doing.
54:26
Right? The other point I want to make is
54:28
that um
54:30
just contrast what we just saw with the
54:32
logistic regression thing we saw
54:33
earlier, which was this little function
54:35
and so, here
54:38
even this simple network with just three
54:40
hidden layers the sorry, three nodes in
54:42
that single hidden layer
54:44
right? It's so much more complicated
54:46
than the logistic regression model. So
54:48
much more complicated, right?
54:50
And it is from this complexity
54:52
springs the ability of these networks to
54:55
do basically magical things.
54:56
Right? That's where the complexity comes
54:58
from. That's where the magic comes from.
55:00
So, and here in this case, the number of
55:02
variables hasn't even changed. It's
55:03
still only two.
55:05
But we can go from the two inputs to the
55:07
one output in very complicated ways as
55:10
long as we know how to train these
55:11
networks the right way. That's sort of
55:13
the
55:13
the secret sauce which we'll spend a lot
55:15
of time on.
55:16
So, yeah. To summarize, this is what we
55:19
have. It's a deep neural network.
55:20
By the way, this kind of network where
55:22
things just flow from left to right is
55:23
called a feedforward
55:25
neural network
55:27
in contrast to some other kinds of
55:28
networks called recurrent networks which
55:30
you won't get to
55:31
in this class because
55:34
transformers have actually proven to be
55:36
much more capable than recurrent
55:38
networks and those have become the norm,
55:40
so we'll just focus on those instead. Um
55:42
and so, this arrangement of neurons into
55:44
layers and activation functions and all
55:46
that stuff, this called the architecture
55:48
of the neural network. And as you will
55:50
see later on, the transformer, the
55:51
famous transformer network
55:53
[clears throat] is just an example of a
55:54
particular neural network architecture
55:57
much like convolutional neural networks
55:59
which will get to next week for computer
56:01
vision or another example of a
56:03
particular network of of architecture.
56:05
So, we will focus on transformers. They
56:07
are a particular kind of architecture.
56:08
All right. So, in summary, this is what
56:10
we have.
56:11
You know, you get to choose the hidden
56:13
layers, the neurons, activation
56:14
functions, stuff like that.
56:15
The inputs and outputs are what you have
56:17
to work with and so, we will actually
56:19
take this idea and then use it
56:22
to
56:23
to actually solve a problem from start
56:25
to finish on Wednesday. So, I think I'm
56:28
done. I give you three minutes back of
56:29
your day. Thank you.
56:32
>> [applause]
— end of transcript —
Advertisement