Advertisement
Ad slot
1: Introduction to Neural Networks and Deep Learning; Training Deep NNs 57:05

1: Introduction to Neural Networks and Deep Learning; Training Deep NNs

MIT OpenCourseWare · May 11, 2026
Open on YouTube
Transcript ~10525 words · 57:05
0:16
All right. So, today's lecture,
0:18
introduction to neural networks and deep
0:19
learning.
0:20
Um so, we'll start with a very quick
0:21
intro to these things,
0:23
uh and then we'll switch and dive deep
0:25
into neural networks. All right. So, the
0:27
field of AI originated in 1956. Sadly,
Advertisement
Ad slot
0:30
it didn't originate at MIT, it
0:31
originated at Dartmouth.
0:32
Because all these people got together at
0:33
Dartmouth. I guess it's it's got a nice
0:35
quad or whatever. They got together,
0:37
they defined the field. But, fortunately
0:40
for us, MIT was very well represented.
0:42
So, we have Marvin Minsky who founded
0:44
the MIT AI Lab, John McCarthy who
0:47
invented Lisp, and then later defected
Advertisement
Ad slot
0:50
to the West Coast, and then Claude
0:51
Shannon who invented information theory,
0:53
right? Who was a professor at MIT. So,
0:55
MIT was well represented. These folks,
0:57
you know, founded the field, and they
0:58
were so bright, they thought that AI was
1:01
going to be substantially solved, quote
1:03
unquote,
1:04
by that fall.
1:06
Okay?
1:07
Now, obviously, it turned out a bit
1:08
differently than what they expected.
1:10
Um so, it's been, whatever, 67, 68 years
1:12
since its founding. So, it's gone
1:14
through, essentially, in my opinion,
1:16
three seminal breakthroughs,
1:18
um starting with the traditional
1:19
approach, then machine learning, deep
1:21
learning, and generative AI. So, let's
1:22
take a very quick look at each of these
1:24
breakthroughs and what motivated them.
1:26
So,
1:27
let's start with the traditional
1:28
approach to AI. And so, what is AI? AI,
1:31
informally, is the ability to imbue
1:33
computers with the
1:34
the the the ability to do things that
1:36
only humans can typically do. Cognitive
1:38
tasks, thinking tasks, and things like
1:39
that. And so, the most sort of common
1:41
sensical way to do that is to say,
1:43
"Well, if I want the computer to do
1:45
something complicated like play chess,
1:46
I'm just going to sit down with a few
1:48
chess grandmasters,
1:49
show them a whole bunch of board moves,
1:51
and ask them how they figure out how to
1:53
respond, how to play the next move." I'm
1:55
going to sort of sit down, talk to all
1:56
these people, and then I'm going to
1:57
write down a whole bunch of rules. If
1:59
this is the board position, move this.
2:01
If this is the board position, move
2:02
this, and so on and so forth. Or I might
2:04
sit down with a cardiologist and tell
2:05
them, "Okay, how do you actually
2:06
interpret an ECG?" They will give me all
2:09
the similarly a bunch of if-then rules.
2:11
I will take all these rules, I'll put
2:12
them into the computer, and boom, I have
2:13
a system that can do what a human can
2:15
do. Right? Now, this approach, even
2:17
though it's common sensical and kind of
2:19
makes sense, it had success in only a
2:21
few areas.
2:22
Um and so, the interesting question is,
2:24
why was it not pervasively successful?
2:28
Why was it not pervasively successful?
2:29
It seems like a pretty good idea to me,
2:31
right? And the people who came up with
2:32
these things are smart people, they're
2:33
not dumb people. They know what they're
2:35
doing. So, why did it not work?
2:39
Because
2:40
because it's time-intensive,
2:42
so in case that you have to run through
2:44
all these scenarios that can ever exist,
2:46
and still some new scenarios can come up
2:48
that you didn't cater for initially.
2:51
Right. So, there are two aspects to what
2:52
you said, which is the first aspect is
2:54
it's time-intensive. That, as it turns
2:56
out, is not a big deal, because
2:57
computers are getting faster and faster.
2:58
>> [clears throat]
2:59
>> Right? The second thing is actually the
3:01
key thing, which is that it doesn't
3:02
generalize to new situations very well.
3:05
Right? The problem is
3:07
there are an infinite number of things
3:08
that you're going to see when you deploy
3:10
these systems in the real world. By
3:11
definition, what you're training it on
3:13
is a small sample of rules. So, these
3:15
rules are very brittle. But, there's
3:17
actually even more interesting reason.
3:19
And that reason is that we know more
3:22
than we can tell.
3:23
This is called Polanyi's paradox. So,
3:25
the idea is that if I come to you and
3:27
say, "Hey, uh here's a picture. Is it a
3:29
dog or a cat?" you will tell me within,
3:32
I believe, they've measured it, like 20
3:33
milliseconds or something, you know if
3:34
it's a dog if it's a dog or a cat. And
3:36
then if I ask you to explain to me
3:38
exactly how you figured that out, you'll
3:40
come up with a bunch of sort of reasons,
3:41
right? Alleged reasons. Oh, you know, if
3:43
it has whiskers, I think it's a cat or
3:45
whatever.
3:46
But, the problem is that you actually,
3:47
first of all, can't really articulate
3:49
what's going on in your head, how you do
3:50
these things. And number two, even if
3:51
you articulate it, often times, your
3:54
articulation has no correspondence with
3:55
how your brain actually does it.
3:58
So, you're incomplete and a liar.
4:01
So, this is Polanyi's paradox. So, if
4:03
you can't even
4:04
tell me how you do something, how the
4:06
heck am I supposed to take it and put it
4:08
into a computer? Doesn't work. And
4:10
second is the fact that we can't write
4:11
down these rules for all possible
4:13
situations. Edge cases, corner cases,
4:15
etc. And the world is full of edge
4:17
cases.
4:18
So, for these reasons, this approach
4:20
didn't work.
4:21
And so, a different approach was
4:22
developed, and this approach was, well,
4:24
basically said, "Hey, instead of
4:26
explicitly telling the computer what to
4:27
do, why don't we simply give it lots of
4:30
examples of inputs and outputs, chess
4:32
positions, next move, right? ECG,
4:35
diagnosis, right? Inputs and outputs.
4:37
And then, why don't we just use some
4:39
statistical techniques to learn a
4:41
mapping, a function, that can go from
4:43
the input to the output? Okay? That was
4:44
the idea.
4:45
And this idea is machine learning.
4:48
Okay? So, machine learning is basically
4:49
just a fancy way of saying, "Learn from
4:51
input-output examples using statistical
4:53
techniques."
4:55
Good. All right. So, um
4:59
Now, there are numerous ways to create
5:00
machine learning models, and if you've
5:01
ever done linear regression,
5:02
congratulations, you've been doing
5:03
machine learning.
5:06
Okay? And only one of those methods
5:08
happens to be something called neural
5:09
networks.
5:11
There are many other methods, and in
5:12
fact, you probably have done these other
5:14
methods if you have done the a course
5:16
like the Analytics Edge or something
5:17
similar.
5:19
Okay. So, machine learning has got
5:21
tremendous impact around the world,
5:23
right? It's like, at this point, um it's
5:25
widely accepted, it's a very, very
5:27
successful technology.
5:29
And in fact, whenever people are
5:30
actually talking about AI,
5:32
chances are they're actually talking
5:33
about machine learning.
5:35
It's just that AI sounds cooler.
5:38
The only problem is, for machine
5:40
learning to work really well, the input
5:41
data has to be structured.
5:43
Okay? And what I mean by that is data
5:46
that can essentially be sort of
5:47
numericalized and stuck into the columns
5:50
and rows of a spreadsheet.
5:51
Right? So, for example, here, let's say
5:54
I want to put together a data data set
5:55
of, you know, uh patients, their
5:58
symptoms, and their characteristics, and
5:59
then in the following year after they
6:01
showed up at the doctor's office whether
6:03
they had a cardiac event or not.
6:05
I might create a data set like this with
6:07
age, smoking status, yes, no, exercise,
6:09
blah blah blah blah blah. Right? And so,
6:11
either these numbers are numbers,
6:13
they're numerical, or if they're not
6:15
numerical, they're categorical.
6:17
Right? Yes, no, uh smoking, yes, no,
6:19
things like that. Which means that if
6:21
you have categorical variables, you can
6:22
just numericalize them pretty easily.
6:25
You folks have done the some machine
6:26
learning before, so you know, things
6:27
like one-hot encoding and stuff like
6:29
that can be done to make them all
6:30
numerical. So, the point is, you can
6:32
just render the data into the columns
6:35
and rows of a spreadsheet pretty easily,
6:36
right? That's what I mean by structured
6:38
data. So, when you but the situation is
6:40
very different if you have unstructured
6:41
data. So, if you have an image of, you
6:43
know, a cute puppy, this is my puppy, by
6:46
the way, um
6:47
from many years ago. Sadly, he's no
6:49
more. Um
6:50
but, his name was Google.
6:54
So, yeah, anyway, uh
6:56
my DMD alums know Google well. So, this
6:58
is Google, right? If you want to take
7:00
Google,
7:01
uh this picture, and figure out how to
7:03
sort of numericalize it, the first thing
7:05
you want to need to understand is that
7:06
if you actually look at how this picture
7:07
is represented inside, uh digitally, in
7:10
the computer, basically, every picture
7:12
like this is represented using three
7:13
tables of numbers.
7:15
Okay? And these and we'll get to what
7:17
these numbers mean later on, but the
7:19
point I'm making is that each number
7:21
basically represents the amount of
7:23
light,
7:25
right? On a scale of 0 to 255, the
7:27
amount of light in that location, in
7:29
that pixel. That's all the amount of
7:30
light. So, basically, the this table is
7:32
the amount of um sorry.
7:35
This table is amount of red light,
7:37
amount of green light, amount of blue
7:39
light. Okay? Now, you will agree with me
7:41
that if you, for example, look at
7:42
something like this and say, "Okay, 251
7:45
at this location, there is a lot of blue
7:47
light because it's 251 out of a possible
7:49
255, right? Maybe a lot of blue light
7:52
somewhere here. There's a lot of blue
7:53
here."
7:55
Whether that area is blue because of a
7:59
piece of sky,
8:00
some water, or a bunch of blue paint,
8:03
could be anything, it's going to say
8:04
251.
8:06
So, the underlying reality, the
8:08
underlying object that's being
8:09
described, has nothing to do with the
8:11
251.
8:12
Right? So, that's the whole problem. The
8:14
raw form of the data has no intrinsic
8:16
meaning with the underlying thing.
8:19
So, given that there's no connection
8:20
between the number and what it's
8:21
describing, how the heck can any
8:23
algorithm do anything with it?
8:25
It can't.
8:27
Right? So, what you have to do is
8:30
something called feature engineering or
8:32
feature extraction, right? Where you
8:34
have to manually take all these things
8:36
and create essentially a spreadsheet
8:38
from them. So, basically, let's say that
8:40
you have a bunch of birds, right? And
8:42
you're trying to build a a bird
8:43
classifier to figure out what kind of
8:44
bird species it is, you might actually
8:46
have to take this picture, and then you
8:48
have to measure the beak length, the
8:50
wingspan, the primary color, and so on
8:52
and so forth.
8:54
So, you're basically structuring the
8:56
unstructured data manually, right?
8:59
And this process of structuring
9:02
unstructured data is basically called
9:06
we use the word representation. We take
9:08
the raw data and we represent the data
9:10
in a different form. And the the reason
9:13
why I'm sort of
9:14
focusing on the use of the word
9:15
representation is because it becomes
9:17
really, really important a bit later on
9:19
when we get to deep learning. Okay? So,
9:22
we have to represent the data in a
9:23
different way for it to work. That's the
9:25
basic idea.
9:26
All right. So, what that means is that,
9:28
uh historically, researchers would
9:31
manually develop these representations.
9:33
And once you develop them, once you have
9:35
representations, you can just use
9:37
traditional linear regression or
9:38
logistic regression to get the job done.
9:40
So, the whole name of the game is the
9:41
representations. So, in fact, people
9:43
doing PhDs, for example, in computer
9:45
vision, would spend like 4 years
9:47
developing amazing representations for
9:49
solving one particular little problem.
9:52
Right? We have a bunch of, say, CAT
9:53
scans, and we need to take the CAT scan
9:55
and figure out whether a particular kind
9:57
of stroke that is evidence for it in the
9:58
cat scan, right? They might actually sit
10:00
and develop all kinds of representations
10:02
and test it and so on. And then they'll
10:04
finally declare victory and say, "Yay,
10:05
I'm done with my PhD. Here is this
10:07
amazing representation, and you can
10:08
build a classifier with it to predict a
10:11
particular kind of stroke with a high
10:12
accuracy." Okay? So, that was what that
10:15
that's where the world was.
10:18
Uh now, as you can imagine, developing
10:20
representations, because it's so manual,
10:22
is this massive human bottleneck, and
10:24
this sharply limited limited the reach
10:27
and applicability of machine learning.
10:29
As you would expect.
10:31
To address this problem,
10:33
a different approach came about, and
10:35
that's deep learning. So, deep learning
10:36
sits inside machine learning. Okay?
10:38
And deep learning
10:40
can handle unstructured input data
10:43
without upfront manual processing.
10:46
Meaning,
10:48
it will automatically learn the right
10:50
representations from the raw input.
10:52
Automatically is the keyword.
10:54
Automatically learn representations,
10:55
which means that you could give it
10:57
structured data, you can give it
10:58
pictures, you can give it text, you can
10:59
give it anything you want, it just learn
11:00
it.
11:01
Okay?
11:02
Um it can automatically extract these
11:05
representations, and since it's being
11:07
automatically extracted, you can imagine
11:09
sort of a pipeline where the raw data
11:11
comes in, you have a bunch of stuff in
11:12
the middle that's learning these
11:14
representations automatically without
11:15
your help, and then boom, you just
11:17
attach a little linear regression or
11:19
logistic regression at the end, problem
11:20
solved.
11:22
That in a nutshell is deep learning.
11:25
Input, a whole bunch of representations
11:26
being learned, and then piped into a
11:28
linear or logistic regression model.
11:30
Okay?
11:31
You would So, the amazing thing is this
11:34
simple idea
11:36
this simple idea
11:37
is just incredibly powerful. Right? That
11:40
idea has led to ChatGPT, has led to
11:42
AlphaGo, AlphaFold, and so on and so
11:44
forth.
11:45
And
11:46
I I kid you not, I'm sort of
11:49
I've I've I've been doing deep learning
11:50
for about 10 years now, and every time I
11:52
look at it, I literally get goosebumps
11:54
every so often.
11:56
That that something so simple could be
11:57
so powerful, right? It's really like
11:59
boggles the mind.
12:01
I'm like I'm just so lucky to be alive
12:03
and working during this period.
12:05
Okay?
12:06
And you know, coming from people who
12:07
have been in the industry a long time,
12:08
this sort of breathless exclamation is
12:10
not very rare, particularly because I'm
12:12
not in marketing.
12:14
Okay? I actually mean it.
12:17
With all your apologies to various
12:19
marketing folks. So,
12:21
just realized it's being taped, so uh
12:23
okay. So, so this has demolished the
12:25
human bottleneck for using machine
12:27
learning with unstructured data, uh and
12:29
so it comes from the confluence of three
12:31
forces,
12:32
uh new algorithmic ideas, a whole a lot
12:34
of data, and then very importantly, the
12:37
fact that we have access to parallel
12:38
computing hardware in the in the form of
12:40
these things called GPUs, graphics
12:42
processing units. Um and these three
12:44
forces came together, and they were
12:45
applied to an old idea called neural
12:47
networks, and that's basically deep
12:48
learning. And I'll go through it very
12:49
quickly, because obviously we going to
12:50
spend half the semester looking into
12:52
this thing in detail.
12:54
Uh so, what's the immediate immediate
12:56
application of the ability to
12:58
automatically handle unstructured data?
13:01
What is like the no-brainer application?
13:10
It's okay if it's obvious, tell me.
13:13
Uh sorry.
13:15
Um image classification. Right. So,
13:18
image classification, yes. So, you can
13:19
take an image, a good example of
13:21
unstructured data, you can do some
13:22
classification on it. But more
13:24
generally, more generally, what I'm
13:27
getting at is that every sensor in the
13:30
world
13:31
can be given the ability to detect,
13:33
recognize, and classify what it's
13:35
sensing. Every sensor. Because remember,
13:37
what is a What does a sensor do?
13:39
A sensor is just a receptacle for
13:41
unstructured data.
13:43
A camera is a receptacle for
13:44
unstructured video
13:46
or unstructured, you know, still images.
13:48
Microphone, unstructured audio, right?
13:50
So, every sensor, you can you can
13:52
imagine taking a sensor and sticking a
13:54
little deep learning system behind it.
13:56
And now suddenly, the
13:58
what comes out of this sensor the deep
13:59
learning system, you can count, you can
14:01
classify, you can detect, you can do all
14:03
kinds of stuff. In short, you can
14:05
analyze.
14:07
And you can predict, right? And this is
14:10
the way I'm describing it right now,
14:12
you'll be like, "Yeah, duh, obviously."
14:15
But you know what, this obviously thing
14:17
is actually not at all obvious
14:19
in terms of whether it'll help you find
14:21
interesting applications or not. Okay?
14:24
So,
14:25
here's something I literally saw
14:28
last week. Okay? Actually, I have
14:30
another slide before that, but we are
14:32
coming to that. So, for instance, every
14:34
time you use Face ID to unlock your
14:36
phone, this is the basic principle at
14:38
work, right? The the camera in the
14:39
iPhone is the sensor, and they stuck a
14:41
deep learning system behind it to do
14:42
image classification, right? Drama,
14:44
non-drama, right? That's what it's
14:45
classifying.
14:46
Um and so here, right, you have a breast
14:49
cancer is it's a breast cancer detection
14:51
system from a mammogram.
14:52
Uh by the way, this picture
14:55
it's a very interesting picture. So, uh
14:57
there's a professor in WCS, uh Regina
15:00
Barzilay, who's a very well-known expert
15:02
in this field, and uh she actually has
15:05
built a breast cancer detection system,
15:07
which is which has been deployed at Mass
15:08
General Hospital.
15:10
And turns out she's actually a breast
15:12
cancer survivor. And uh she was
15:15
you know, she's she's she's good now,
15:16
all good. But when um after she built
15:19
her system, I heard that she actually
15:21
ran that system against the mammograms
15:25
from many years prior when she went for
15:29
a mammogram and was told that everything
15:30
is fine.
15:32
She ran the system on that mammogram,
15:34
and it came back and said, "Here is a
15:35
problem."
15:37
So, a very interesting example where a
15:38
deep learning system picked up something
15:40
that a radiologist could not, right? So,
15:43
these things can be quite powerful.
15:45
Um obviously, any self-driving system
15:47
has numerous deep learning algorithms
15:50
running under the hood, you know,
15:51
pedestrian detection, you know,
15:52
stoplight detection, zebra crossing
15:54
detection, and so on and so forth. Um
15:57
you know, it's being very heavily used
15:58
in visual inspection manufacturing.
16:00
Uh you have various cameras now instead
16:02
of people looking at saying, "Okay,
16:03
there is a dent or there's a scratch."
16:04
They have a little system, which is a
16:06
dent detector, scratch detector, and so
16:07
on. That's that's going on right now.
16:09
And now I come to the example I saw last
16:11
week,
16:12
which is um So, this is an example of
16:14
you can create dramatically better
16:16
products if you really internalize this
16:18
idea of, "Okay, it's almost like you're
16:20
looking the the world and saying, 'Oh,
16:22
there's a sensor. Can I attach a DL
16:24
thing behind it?'" That's the way you
16:25
should be looking at the world, okay,
16:26
for startup ideas. So, here's an
16:28
example, okay, these apparently are the
16:30
world's first smart binoculars.
16:34
Okay?
16:35
This is the binocular.
16:37
Two weeks ago,
16:39
where you look at the bird you look at
16:41
the bird,
16:42
and now it tells you what kind of bird
16:43
it is, right there.
16:47
It's a simple idea, but imagine, right?
16:50
Imagine you are the first out of the
16:51
gate with this feature, you'll have a
16:53
little bit of an edge till everybody
16:54
catches up like 3 months later.
16:57
Let's be very clear, there are no
16:58
long-term monopoly windows in the world.
17:01
There are only short-term windows, so
17:03
the hunt is always on for a little
17:04
monopoly window.
17:06
So, here's an example of that.
17:08
Right? So, I encourage you to always
17:11
think about the world as, you know,
17:13
where are the sensors here?
17:15
And can I attach something behind the
17:16
sensor to do something useful with it?
17:18
Okay?
17:19
All right.
17:24
Now, let's uh turn our attention to the
17:26
output.
17:27
We've been talking about in structured
17:28
data, unstructured data, and how deep
17:30
learning has sort of unlocked the
17:32
ability to work with unstructured data,
17:34
but you've sort of been neglecting the
17:35
output side of the equation. So,
17:37
traditionally, uh we could predict
17:40
single numbers or a few numbers pretty
17:42
easily, right? So, you've all done the
17:44
canonical, you know, uh should this
17:46
person be given a loan application in
17:48
machine learning, right? So, you just
17:50
predicts a probability that a borrower
17:51
will repay a loan on a whole based on a
17:53
whole bunch of data, or supply chain,
17:56
you predict the demand for the product
17:57
next week, or you could predict a bunch
17:58
of numbers. So, given a
18:00
um given a picture, you can say, "Okay,
18:01
does it Which which one of the 10 kinds
18:03
of furniture is it?" Right? You can
18:04
predict 10 numbers, 10 probabilities
18:06
that add up to one. You can predict a
18:08
whole bunch of numbers that don't have
18:09
to add up to one, such as the GPS
18:10
coordinates of a of an Uber ride. So,
18:12
these are all simple unstructured Sorry,
18:15
simple structured output, just a few
18:16
numbers, right? What we could not do
18:18
very easily was to actually generate
18:20
pictures like this.
18:23
We could not generate unstructured data.
18:25
We could only consume unstructured data,
18:27
right?
18:28
Um you can generate text, you can
18:29
generate pictures, and so on, and audio,
18:31
and so on, and so forth.
18:32
So, with generative AI, that problem is
18:35
gone.
18:36
So, generative AI is the ability to
18:37
actually create unstructured data, all
18:39
right? And therefore, it sits within
18:41
deep learning. It still runs on deep
18:43
learning, but it's just one kind of deep
18:45
learning.
18:47
Okay? There's plenty of stuff going on
18:49
in deep learning that's got nothing to
18:50
do with generative AI.
18:51
Nowadays, of course, you know, if you're
18:53
a self-respecting entrepreneur who wants
18:55
to ride this craze, you'll probably
18:57
declare whatever you're doing as
18:58
generative AI.
19:00
Right? Um and some VCs may actually be
19:02
ready to fund you, who knows?
19:04
But the point is, there's plenty of
19:05
stuff going on in deep learning that's
19:06
got nothing to do with generative AI. Uh
19:08
but this is the overall picture. Now,
19:11
here, uh we can produce unstructured
19:13
outputs, like pictures. You can take
19:15
this thing, and then you can actually,
19:17
you know, come up with a nice picture
19:18
description of it. This actually is a
19:19
very famous picture, by the way, in in
19:21
the world of computer vision. So, we are
19:23
actually going to be analyzing this
19:24
picture a little later on
19:26
in the semester.
19:27
Uh you can obviously go from a very
19:29
complicated caption to an image.
19:31
Uh you can go from text to music.
19:36
Can people hear it? Okay. Yeah. Yeah.
19:38
All right. So, and of course, we can go
19:40
from text to text, i.e., ChatGPT. Uh and
19:43
then uh as of a few months ago, things
19:45
have gotten even more interesting, where
19:47
you can actually go you can send text
19:49
and an image in, and you can get text
19:51
out.
19:51
Right? And in fact, as of a few weeks
19:53
ago, you can send text, image, text,
19:55
image, text, image in in an arbitrary
19:56
sequence
19:58
into into the system, and it'll actually
20:00
come back to you with text and image.
20:02
Right? So, things are becoming
20:03
multimodal. I just want to share with
20:05
you like a really fun example I saw
20:07
uh recently. So, this person
20:10
sends this picture. Can folks see this?
20:14
It's this very complicated parking sign.
20:16
Apparently in San Francisco.
20:19
And they're like, it's Wednesday at 4:00
20:20
p.m. Can I park here?
20:22
Tell me in one line. Because you really
20:23
didn't want GPT-4 to be giving you a big
20:25
essay about this.
20:26
Like, you literally want to park.
20:29
So, GPT-4 comes back and says, "Yes, you
20:32
can park here for up to 1 hour starting
20:33
at 4:00 p.m."
20:35
And folks, I double-checked this thing,
20:36
it's correct.
20:38
We all know these things hallucinate,
20:39
right? Can you imagine getting a parking
20:41
ticket and telling the judge, "I'm
20:42
sorry, I didn't realize it was
20:42
hallucinating."
20:44
So,
20:45
so you have to double-check it.
20:46
So, yeah. So, things are getting
20:47
multimodal very quickly.
20:49
Uh and so, the picture here is that
20:51
within gen AI, we used to have these
20:53
separate circles, text to text, text to
20:55
image, text to music, text to this, text
20:57
to that, so on and so forth. Those are
20:59
all beginning to merge now inside gen AI
21:00
because multimodal models are going to
21:02
become the norm this year, right? We
21:04
already have really good closed models.
21:06
We really have We actually already have
21:07
very good open-source multimodal models.
21:10
And so, my feeling is that by the end of
21:12
the year, the idea of using a text-only
21:15
model is going to be like, "Really, you
21:17
do that still?"
21:19
Right? It's going to become like a
21:20
quaint, old-fashioned thing. I think
21:21
multimodal modality is going to become
21:23
the norm. So, that's where the world is,
21:25
and this is the landscape. So, any
21:26
questions on the landscape?
21:29
Before we actually start doing some
21:29
math.
21:35
Okay.
21:37
Yeah.
22:05
You mean the the the evidence of that
22:07
being a problem would have been smaller.
22:09
Yeah.
22:16
Yeah. So, I think the So, the question
22:17
is that in general, how do you train
22:19
your models so that it gives you the
22:20
right answers given that over the
22:22
passage of time, the amount of evidence
22:24
in this data could be very highly
22:25
variable. So, in this particular case of
22:28
you know, the professor I talked about,
22:30
uh yeah, everything at that point was
22:32
going through a an expert radiologist.
22:34
So, 5 years ago, this mammogram was seen
22:36
by a radiologist, and that person
22:38
concluded there is no problem. So, that
22:40
was the training label, right? The wrong
22:41
training label. Uh so, in typically what
22:44
happens is that training labels could be
22:46
wrong some small fraction of the time.
22:48
So, you need to have systems that are
22:49
robust. So, your data needs to be
22:51
complete, it needs to be comprehensive,
22:53
it needs to be have correct labels. If
22:56
these ideas are not met, your systems
22:58
are not going to be that good. But as it
22:59
turns out, with neural networks, even
23:01
with some amount of noise in the labels,
23:04
they still do a pretty good job.
23:06
Right? So, it's that's sort of the
23:07
general idea.
23:11
The veri- The verification comes from
23:12
the human. So, every Remember when we
23:15
look at radiology data,
23:17
the the data we're working with is the
23:19
input is let's say an image, like a
23:21
radio mammogram or something, and then a
23:23
human radiologist or a set of
23:25
radiologists have said this has a
23:27
problem or does not have a problem. So,
23:29
that is called the ground truth.
23:31
So, it is this ground truth image and
23:33
label, this combination that's being
23:35
used to train these models.
23:39
Yeah.
23:43
Embodiment? So, So, are we are we going
23:45
to cover embodiment? So, the
23:47
the embodiment here refers to the fact
23:49
that
23:50
if you have robot robots, right?
23:53
They need to actually operate in the
23:54
real world, and so robots are an example
23:56
of what's called embodied intelligence.
23:58
So, unfortunately, due to the
23:59
constraints of time, we're not going to
24:01
get into robotics at all. But I will say
24:03
that a lot of the deep learning stuff
24:04
you're going to talk about, those are
24:05
all fundamental building blocks in
24:07
modern robotic systems.
24:09
All right. So, um so, in summary,
24:13
X and Y
24:14
can be anything, and it can be
24:15
multimodal.
24:17
Okay? I literally could not have put up
24:19
this slide maybe 2 years ago.
24:21
Right? So, it's very simple in how it
24:23
looks, but it's very profound. You can
24:25
You can learn a mapping from anything to
24:28
anything at this point very easily as
24:29
long as you have enough data.
24:31
Okay? So, um now, note that all this
24:34
excitement that we see around us
24:36
is everything stems from stems from deep
24:38
learning.
24:39
Okay?
24:40
Everything Everything depends on deep
24:42
learning. And so, if you understand deep
24:44
learning, a lot of interesting things
24:45
become possible. So, let's get going.
24:47
All right. So, we'll start with the very
24:49
basics. Uh what's a neural network?
24:51
Uh now, recall logistic regression
24:54
from back in the day.
24:56
So, what is logistic regression?
24:57
You send in a bunch of numbers, a vector
24:59
of numbers, and you get usually get a
25:01
probability out, right? Between 0 and 1.
25:03
What is the probability of something or
25:05
the other? Okay? Um and so, this
25:07
logistic regression model is also
25:09
represented in this form,
25:11
if you will recall. So, basically what
25:13
we do is we take all these numbers, we
25:15
run it through a linear function, right?
25:17
We run it through a linear function, you
25:19
get a number, and then we take that
25:20
thing and run it through 1 / 1 + e
25:22
raised to minus that,
25:25
and that's guaranteed to give you a
25:26
number between 0 and 1, which can be
25:27
interpreted as a probability, and that's
25:29
logistic regression. Okay? And the
25:31
canonical, you know,
25:33
uh loan approvals, things like that, all
25:35
fall into this sort of convenient
25:36
bucket.
25:38
Okay? So, this should be super familiar.
25:44
All right. Now, we're going to actually
25:46
look at this, you know, simple, modest,
25:48
humble little operation
25:51
using the lens of a network of
25:53
mathematical operations, and the reason
25:55
why we do it will become clear a bit
25:56
later.
25:57
So, we'll take this very simple example
25:59
where we have uh let's say two
26:02
variables, GPA and experience, right?
26:05
This is the GPA of some graduates, uh
26:07
number of years of work experience, and
26:09
then this is the dependent variable,
26:11
which is either 0 or 1, and 0 if they
26:14
don't get called for an interview, 1 if
26:16
they get called for an interview. Okay?
26:18
It's a two-input variable, one-output
26:20
variable problem. Okay? And it's a
26:22
classification problem because we're
26:24
classifying people into will they get
26:25
called for an interview, yes or no.
26:27
Okay?
26:29
And so, that's the setup for this
26:31
problem.
26:33
And let's say that we actually run it
26:35
through any you know, we actually try to
26:38
fit a logistic regression model to it.
26:40
So, if you're familiar with R, for
26:41
example, you would use something like
26:43
GLM to fit this model.
26:46
Um if you use something like statsmodels
26:48
in Python, there's a similar function
26:49
for it. Scikit-learn, there's another
26:52
function for it. You get the idea,
26:53
right? This
26:55
You can use whatever favorite methods
26:57
you have for logistic regression
26:58
modeling to get this job done. And if
27:00
you do that with this little data set,
27:02
you're going to get these coefficients.
27:04
Right? The 0.4 is the intercept, 0.2 is
27:06
the coefficient for GPA, 0.5 for
27:08
experience. And that is the resulting
27:09
sigmoid function.
27:11
Okay?
27:12
All right. Cool. So, now let's actually
27:14
rewrite this formula as a network in the
27:17
following way. So, first, what we'll do
27:19
is we'll take GPA and experience and
27:20
stick it here on the left side, and
27:22
we'll put little circles next to them,
27:24
and we'll call them the input nodes.
27:26
Okay? And so, imagine that somebody puts
27:29
the writes a GPA into the circle, 3.5 or
27:32
you know, years experience, 2.0, and
27:34
then it flows through this arrow,
27:36
and as it flows through, it gets
27:38
multiplied by its coefficient, 0.2. The
27:40
0.2 is coming from here.
27:42
Similarly, experience gets multiplied by
27:44
0.5, it comes in here, and this node, as
27:47
the plus indicates, is adding everything
27:49
that's coming into it.
27:50
So, it's adding 0.2 * GPA, 0.5 *
27:52
experience, plus the intercept, which is
27:54
the green arrow coming from on its own.
27:57
It comes through here, and what comes
27:58
out of this is just a single number,
28:01
and that number goes into this little
28:02
circle,
28:04
and then out pops a probability.
28:07
Okay?
28:08
So, I've sort of
28:10
done this ridiculously long long
28:13
long-winded way of writing a simple
28:15
function.
28:16
Okay? And the reason we why I'm doing it
28:18
will become clear in a second.
28:21
Okay? So, this is a little network of
28:23
operations for the simple function.
28:25
And so, for instance, how you would use
28:27
it is you to make a prediction, you'll
28:29
let's say someone has a 3.8 GPA and 1.2
28:31
years experience, you just plug it in
28:33
here,
28:34
do the math, you get 0.76, same thing
28:36
here, comes in here, add them all up,
28:38
you get 1.76, you run 1.76 through the
28:40
sigmoid, you get 0.85, and that is the
28:43
probability that that particular
28:44
individual may get called for an
28:45
interview.
28:46
Okay? At this point, we're just doing
28:48
logistic regression, nothing more
28:49
complicated.
28:51
Okay? So, um now, if you have many
28:54
variables, not two variables like X1
28:56
through XK, you can the same sort of
28:58
logic applies. Each one has some
28:59
coefficient, and then there's an
29:01
intercept, they all get added up here,
29:03
run through a sigmoid, and out pops this
29:04
number. Okay? Notice how the data flows
29:07
from left to right.
29:09
Okay?
29:10
All right. Any questions on this?
29:15
All right. Good.
29:16
So, now terminology.
29:18
Uh so, you will actually you'll discover
29:20
that the world of neural networks and
29:21
deep learning has its own terminology.
29:24
They have their own ways of referring to
29:25
things that we the rest of the world has
29:26
been referring using something else for
29:28
the longest time.
29:29
Right? It's kind of annoying sometimes,
29:31
but it's the way it is. So, um
29:35
Remember in regression, we used to call
29:37
those numbers next to each variable as
29:38
coefficients,
29:39
and the constant thing as an intercept?
29:41
Well, guess what? In this world, these
29:43
multi- those coefficients are actually
29:44
called weights,
29:46
and the intercepts are called biases.
29:49
So, in in the neural network world,
29:50
these are called weights and biases.
29:53
And sometimes, if you're a little lazy,
29:54
you may just call the whole thing as
29:55
weights.
29:56
Okay? So, when you see in the newspaper
29:58
that, you know, "Oh my god, this amazing
30:00
model's weights have been leaked
30:03
on the internet or on BitTorrent or
30:05
something." That's what's going on,
30:06
right? All these coefficients have been
30:08
leaked. Because once you know what the
30:09
coefficients are and what the
30:11
architecture is, you can just
30:12
reconstruct the model.
30:15
All right. So, that's what's going on
30:16
here.
30:17
Now, why did we do this network
30:19
business? Why did we write it as a
30:20
network?
30:23
Yeah, what is the advantage? Any
30:24
guesses?
30:34
When you have multiple functions for
30:37
So,
30:38
it's just easier to see it that way.
30:40
Right. If you have lots of things going
30:41
on, it's easier to see it if you
30:43
actually write it in graphical form.
30:45
Yes, correct.
30:46
But, so is it only like a usability
30:49
advantage?
30:51
I mean, the thing is you want different
30:53
functions for different layers of that.
30:55
Uh-huh.
30:56
Okay.
30:57
So, maybe we want to use different
30:59
functions in different layers. But, I
31:00
think there's actually even a larger
31:02
sort of a more basic point, which is
31:04
that
31:05
then when you the moment you write it
31:07
down, you suddenly realize
31:09
that I could have lots of things in the
31:10
middle.
31:12
I don't have to go from the input to the
31:13
output directly. I can do lots of things
31:15
in the middle, right? That's sort of the
31:17
key idea. So, what you do is
31:20
So, remember the notion of learning
31:22
representations of unstructured data,
31:24
right? Where you take a picture and say
31:25
beak length and things like that, right?
31:27
And remember, I said deep learning
31:29
actually automatically learns these
31:30
things. Where is that automatic learning
31:33
coming from?
31:34
Well, this is where it's coming from.
31:36
So, what we do is we take this thing,
31:38
right? There's just a logistic
31:39
regression model. Inputs
31:41
get multiple added up as a linear
31:43
function, run through a sigmoid.
31:45
And then
31:46
we are like, "Hmm, if we want to learn
31:48
representations of the raw input, we
31:51
better be doing something in the middle
31:53
here."
31:54
Because the output is the output.
31:56
That is That's not going to change.
31:58
You know, it's it's either a dog or a
32:00
cat. You don't have any choice
32:02
as to what it is. Okay? The only agency
32:05
you have at this point is you can take
32:07
the raw input and do things in the
32:09
middle with it.
32:11
You can do a lot of stuff in the middle
32:12
and then run it through something to get
32:14
the output. Okay? So, in any in in in
32:18
any mathematical discipline,
32:20
if someone comes to you and says,
32:22
"Here's a bunch of data.
32:23
I want you to do something with it."
32:25
What should the What is like the big the
32:27
most basic first thing you should do?
32:31
Run it through a linear function.
32:34
The most basic thing in math is a linear
32:36
function. So, given anything, just run
32:37
it through a linear function. See what
32:38
happens.
32:40
So, that's exactly what we can do. So,
32:42
the simplest thing we can do here, we
32:44
can insert a bunch of linear functions.
32:46
So, we do is we take all this input and
32:49
we just run it we we do a linear
32:50
function on it. So, think of it this as
32:52
X1 * 2 + X3 * 4 and all the way to XK *
32:56
9 plus some intercept and boom, it goes
32:58
out the other end. So, this little
33:00
circle here with a plus in it is just
33:05
Thank you.
33:05
Uh
33:06
that is This is just a linear It's a
33:08
shorthand for a linear function.
33:10
So, whenever you see a circle with a
33:11
plus, it's just a shorthand for a linear
33:13
function. Okay? So, you can take this
33:15
whole thing and run through a linear
33:16
function and when you do it, you'll get
33:17
some number right there. You'll get some
33:19
number. So, you've taken these K numbers
33:21
and you've sort of dis- compressed them
33:23
in some way into one number.
33:25
Okay?
33:26
But, you don't have to stop at one
33:28
number. You can do more.
33:30
So, we can have a stack of linear
33:31
functions in the middle.
33:33
Right? There's a linear function here,
33:35
another one here, another one here. At
33:37
this point, the K numbers you have
33:40
K could be, for example, 1,000.
33:42
Right? It's just the size of your input
33:43
data.
33:44
You've taken these K things and you've
33:45
compressed them into three numbers at
33:47
this point.
33:48
Okay?
33:50
So, okay, maybe three is the right
33:52
number, maybe 10 is the right number. We
33:53
don't know.
33:54
And we'll get to know how do we know
33:55
what the right number is later on.
33:58
So, we can stack as many linear
33:59
functions we want.
34:01
So, we have transformed this K thing
34:02
into a three-dimensional vector, right?
34:04
K numbers become three numbers.
34:06
Um
34:07
and now we can flow this three these
34:10
three numbers through some other little
34:12
function.
34:13
Okay?
34:16
And as you will see in a few minutes,
34:18
that function is called an activation
34:19
function
34:20
and it's chosen to be a non-linear
34:22
function
34:23
because if you don't choose it to be a
34:24
non-linear function, all the effort we
34:26
are doing is going to be a total waste
34:28
of time.
34:30
Okay? For now, just
34:32
take it on faith that you need to have
34:34
non-linear functions here.
34:36
But, note that the three numbers here
34:39
are still three numbers. They are three
34:41
different numbers, but they're still
34:42
three numbers.
34:43
And once we do this, we'll be like, "You
34:45
know what? This was fun. Let's do it
34:46
again."
34:48
Okay? So, you can do it again.
34:52
And you can keep on doing it. You can
34:53
keep it 100 times if you want.
34:55
And the key thing is that every time you
34:57
do it, you're giving this network some
35:00
ability, some capacity to learn
35:03
something interesting from the data.
35:05
To learn an interesting representation.
35:07
Now, of course, you're thinking, "Well,
35:09
how do we know it's interesting? How do
35:10
you know it's a useful thing?" And we'll
35:12
come to all that later on.
35:14
Right? We're just giving it the
35:14
capacity, the potential to learn
35:16
interesting things from the data.
35:17
Whether it actually lives up to its
35:19
potential, we don't know yet.
35:21
Okay? We'll give it the potential.
35:23
Because the more transformations of the
35:24
input data you make, the more
35:26
opportunity you have to do interesting
35:27
things with it.
35:29
If I don't even give you the opportunity
35:30
to transform it once, you don't have any
35:31
opportunity, right?
35:32
If I give you 10 chances to transform
35:34
things, you have 10 shots at doing
35:36
something useful.
35:38
So, you can you can do this repeatedly
35:40
and once we are done doing these
35:42
transformations, we just pipe it through
35:44
to our good old logistic regression
35:46
sigmoid here and we are done.
35:50
Okay?
35:51
So, this is the basic idea.
35:53
And so, just to contrast it, this was
35:55
good old logistic regression where we
35:57
take the input,
35:59
we run it through a linear function and
36:00
pop out a number,
36:02
a probability number. But, after we do
36:04
all this stuff, the input stays the
36:06
same, the output stays the same, but in
36:08
the middle you just run through a whole
36:09
bunch of these functions, you know,
36:11
these layers, boop boop boop boop, and
36:12
then we get the output.
36:14
Okay?
36:15
That's all we have done.
36:16
And this is a neural network.
36:19
A neural network is nothing more than
36:21
repeatedly transformed inputs which are
36:25
finally fed to a linear or logistic
36:27
regression model.
36:35
Any questions?
36:37
I have two questions. Could you use the
36:38
thing so that everyone can hear? Yeah.
36:41
I have two questions. Firstly, so when
36:43
we say that there isn't chance of
36:45
explainability, is it that we don't know
36:48
which arrow it went through? That's one.
36:51
Second,
36:53
who's controlling the number of
36:54
iterations or the number of functions?
36:57
That's up to us or how does that work?
36:59
Right. So, yeah, so the the first
37:01
question, um explainability, we actually
37:03
know exactly for any given input input
37:06
uh data data point, we know exactly how
37:09
it flows through the network. So, there
37:10
is no problem there.
37:12
The problem is in ascribing, "Okay, this
37:15
we we think this person is going to be
37:17
uh repay the loan because
37:20
of this particular attribute." We don't
37:21
know that because those attributes all
37:24
get enmeshed together and goes through
37:25
this complicated thing. So, we know
37:27
exactly what happens. We just can't give
37:29
credit to anyone thing very easily.
37:31
I'm again, I'm just standing on the
37:33
brink of this vast ocean of something
37:35
called explainability and
37:36
interpretability, uh which I'll get to a
37:38
bit later on in the semester. But,
37:39
that's sort of the quick
37:42
kind of right-ish kind of wrong answer.
37:44
Okay? Number two, um
37:46
uh
37:47
we decide the number of layers. We
37:49
decide a whole bunch of things and as
37:51
we'll see in a few minutes, uh there is
37:52
something that's given to us and
37:53
something we get to design and I'll make
37:55
it very clear which is which.
37:59
Yeah.
38:02
Did I say your name right? Yeah.
38:04
So, which functions have to be linear
38:06
and also like why does it have to be
38:08
linear? Yeah. So, these functions uh the
38:11
f of x here, they have to be non-linear.
38:15
As to why they have to be non-linear,
38:16
we'll get to that in a few minutes.
38:19
Okay. So, these are called neurons.
38:22
Okay?
38:23
These things where you basically there's
38:25
a linear function followed by uh a
38:27
little non-linear function,
38:29
right? This is a Each one of these
38:31
things is called a neuron.
38:32
Um
38:34
By the way, you know, this is loosely
38:36
inspired by the way how, you know, uh
38:39
neurons work in a human in mammalian
38:41
brains.
38:42
But, the connections between
38:45
neuroscience and deep learning
38:47
are very heavily argued.
38:50
So, I'm going to like stay away from it.
38:52
Okay? Uh suffice it to say it's I I just
38:55
think of For for building practical deep
38:57
learning systems in industry, you don't
38:59
you don't worry about this. Okay?
39:01
All right, let's move on.
39:04
Terminology. Uh this vertical stack of
39:06
linear functions or neurons,
39:09
right? This vertical stack is called a
39:10
layer.
39:12
Right? This is a layer, that's a layer.
39:14
Uh and these little non-linear
39:15
functions, which we haven't gotten to
39:17
yet, are called activation functions.
39:20
Uh and we'll get to why they are called
39:22
that in just a second.
39:25
And
39:26
the input
39:29
is called an input layer and I have the
39:31
word layer in double quotes because like
39:34
it's not really doing anything, right?
39:35
It's just the input.
39:36
So, but we call it an input layer.
39:39
And what the very final thing that
39:41
produces outputs is called the output
39:42
layer, right? Obviously. And everything
39:45
in the middle is called a hidden layer.
39:48
Okay?
39:50
So, the final piece of terminology is
39:52
that when you have a layer like this in
39:54
which say three numbers are coming out
39:56
and there's another another layer,
39:58
right? If every neuron in this layer is
40:00
connected to every neuron in this layer,
40:03
it's called a fully connected or dense
40:05
layer. So, for instance, here
40:07
this arrow that's
40:08
whatever the whatever number is coming
40:10
out. Let's say the number three is
40:11
coming out of this thing here. That
40:12
number three goes flows on this arrow to
40:15
this thing, flows on this arrow to this
40:17
neuron, and flows on this third arrow to
40:19
this neuron. That's what I mean. So,
40:21
every neuron, its output is being sent
40:23
to every neuron in the following layer.
40:25
Okay? That's we call it fully connected
40:27
or dense.
40:29
And then
40:30
if you look at logistic regression,
40:32
right? This is logistic regression. You
40:34
can see basically logistic regression is
40:36
a neural network with no hidden layers.
40:41
So, in some sense, logistic regression
40:42
is like almost the simplest possible
40:43
network you can think of.
40:45
Like barely a neural network.
40:48
Right? It's got no no hidden layers.
40:50
That's what makes it logistic
40:51
regression.
40:52
And so, as you might have guessed by
40:54
now, deep learning is just neural
40:56
networks with lots and lots of
40:58
of what?
41:00
Yes, layers.
41:02
So, here are a few.
41:04
Uh and by the way, these are not even
41:07
considered all that, you know,
41:08
impressive these days.
41:10
Okay? Uh but I put them up because this
41:13
this thing here is called ResNet.
41:16
And it's famous because the ResNet
41:18
neural network was I think the first
41:20
network
41:21
to surpass human-level performance in
41:24
image classification.
41:26
Sort of it it's sort of like the Skynet
41:28
of image classification. Okay? It
41:31
surpassed human-level performance. And
41:32
I'm putting it up here because we'll
41:34
actually work with ResNet on next next
41:36
Wednesday. And we'll actually take
41:37
ResNet, we'll fine-tune it, and solve a
41:39
real problem in class.
41:41
All right. So, it's got lots and lots of
41:43
layers. Uh now, let's turn to these
41:46
activation functions. We've been
41:47
ignoring these little guys, right? So
41:48
far.
41:49
So, the activation function at a node is
41:52
a first of all, it's a function that
41:54
receives a single number and outputs a
41:56
single number, right? It's not very
41:58
complicated, right? It receives
42:00
basically this this is a linear function
42:03
which receives all these inputs. It
42:04
could be 10 inputs, 1,000 inputs,
42:06
runs it through a linear function,
42:07
outputs a number, and that single
42:09
number, a scalar, goes in here, and it
42:12
comes out as another single number.
42:14
Just just just remember that.
42:16
And so, these are some of the most
42:18
common activation functions. In fact,
42:19
the sigmoid we saw, which is actually we
42:21
use for the output, is actually a kind
42:23
of activation function where a single
42:25
number comes in and it gets mapped into
42:28
this curve because of this thing. So,
42:30
the single number that comes in is A,
42:31
and it and it gets transformed as 1 / 1
42:33
+ e ^ -A, and you get a shape like this,
42:37
and it's called the sigmoid activation
42:38
function. And And And as you can see
42:40
here,
42:41
for very small values, for very negative
42:44
values,
42:45
it's going to be pretty close to zero,
42:47
meaning it won't get activated.
42:50
And for very very large values, it's
42:52
going to be
42:53
pretty close to one.
42:55
All the action happens in the middle.
42:57
When your When your When your values are
42:59
somewhere in this range, there's a
43:00
dramatic increases in what comes out.
43:03
Okay? So, that little thing in the
43:05
middle is a sweet spot for these
43:06
functions.
43:07
Uh
43:08
and this
43:10
I you know, I'm also almost embarrassed
43:11
to call it an activation function
43:12
because it's literally not doing
43:13
anything. It's sort of getting a nice
43:15
label for free.
43:16
Um right? You basically it says you just
43:18
get a number, just pass it straight
43:19
along.
43:20
It's a linear activation function, but
43:22
just for completeness, I want to put it
43:23
here.
43:25
And then we come to the hero of deep
43:28
learning, which is the rectified linear
43:30
unit,
43:32
right? Rectified linear unit. It's
43:34
called ReLU. Uh and ReLU is going to
43:37
become part of your vocabulary very very
43:38
quickly. Uh and so, ReLU is actually a
43:41
very interesting function. So, you write
43:43
it as maximum of whatever number and
43:44
zero,
43:46
which is another way of saying if the
43:48
number is positive, just send it along
43:50
unchanged. If the number is negative,
43:53
send a zero instead. Squish it to zero.
43:56
So, which means if the number is
43:57
negative, nothing happens. If the number
43:59
is positive, it wakes up.
44:03
So, what happens is that you could have
44:04
a very complicated linear function with
44:07
millions of variables, and then it puts
44:09
a single number, and that number
44:10
unfortunately happens to be negative.
44:12
The ReLU is not impressed. It's going to
44:13
send a zero out.
44:15
Okay? It's a very simple function.
44:17
And many many folks who've been in deep
44:20
learning for a long long time believe
44:22
that
44:23
the use of the ReLUs is one of the key
44:25
factors
44:26
that led to the amazing success of deep
44:28
learning because it's got some very
44:30
interesting properties,
44:32
uh which we'll get to hopefully on
44:33
Wednesday.
44:35
Okay. So, the shorthand here is that um
44:40
whenever you see this thing, it's just a
44:42
linear activation, linear function
44:43
followed by just sending it straight
44:44
out. If I If you do this this If I put a
44:47
ReLU in here, I'm going to denote it
44:49
like that, which mimics the graph
44:51
uh how it looks. And if I'm going If I
44:53
put a sigmoid, I'm just going to use
44:54
this thing here.
44:55
Okay?
44:56
Just a visual shorthand.
44:59
>> [clears throat]
45:00
>> There are many other functions
45:02
activation functions, by the way.
45:03
There's something called the tan h
45:05
function, the leaky ReLU, the GELU, the
45:07
Swish. I mean, it's like a menagerie of
45:10
activation functions because very often
45:12
researchers will be like, "Well, I don't
45:14
like this activation function. Here's a
45:15
little modified version of the function
45:17
which is going to be better for certain
45:18
things." So, you know, people's research
45:20
creativity is sort of on this point has
45:22
gone unhinged. Um so, there's lots of
45:24
options. But if you just stick to the
45:26
ReLU
45:27
for your hidden layers, you can
45:29
basically get anything done practically,
45:31
right? You don't have to worry about
45:32
anything else. So, we'll only focus on
45:34
ReLUs for all the intermediate stuff. Uh
45:37
yeah.
45:38
Yeah, how do you gauge which activation
45:40
function is more suited for your use
45:41
case?
45:42
Yeah. So, the rule of thumb here is that
45:45
for your hidden layers, use ReLUs,
45:48
right? Because empirically we have seen
45:49
that they they do an amazing job.
45:51
For your output layer, your very final
45:54
thing, you actually don't have a choice
45:56
because what you have to use depends on
45:57
what kind of output you have to work
45:59
with. If it's an output which is a
46:01
probability number between zero and one,
46:02
you have to use a sigmoid.
46:04
Um if it is
46:05
say 10 numbers, all of which have to be
46:07
probabilities, and they have to add up
46:08
to one,
46:10
you got to use something called the
46:10
softmax, which we'll get to on
46:12
Wednesday. So, it really depends on the
46:13
output, and the nature of the output
46:15
dictates what you use in the output
46:16
layer.
46:18
Okay.
46:19
So, coming back to this. So, if you want
46:22
to design a deep neural network,
46:24
uh the input is the input.
46:27
The output is the output. And so, you
46:29
get to choose everything else. You get
46:30
to choose the number of hidden layers,
46:32
the number of neurons in each layer, the
46:35
activation functions you're going to use
46:37
and uh for the hidden layers, and then
46:39
you have to make sure that the what you
46:41
choose for the output layer matches the
46:42
kind of output you want to generate.
46:44
Okay? So, this is this sort of This is
46:46
all in your hands. You decide what
46:48
happens. But
46:51
you will there there's a lot of guidance
46:52
for how to do these things, which we'll
46:53
which we'll cover as we go along.
46:56
Did you have a question?
46:57
Kind of, but I guess I'll do it.
47:00
Is Is there also exploration in kind of
47:03
dynamic uh
47:05
setting up layers so that your users
47:07
determine the number of layers
47:12
Yeah. So, there's a whole field called
47:14
neural architecture search, NAS,
47:16
where we can actually try a whole bunch
47:18
of different architectures,
47:20
uh and then use some optimization and in
47:22
fact reinforcement learning, which we
47:23
won't get to in this class,
47:25
as a way to figure out really good
47:27
architectures for any particular
47:28
problem. Uh but the
47:32
the question of okay,
47:33
when I'm training a model with a
47:34
particular kind of data,
47:36
the first pass through the training
47:37
data, I'm going to use two layers. The
47:39
second pass, I'm going to do seven
47:40
layers. That is not done.
47:42
Uh and the reason it's not done is
47:44
because of certain other constraints we
47:45
have in how we can do the the
47:47
optimization and the gradient descent
47:48
and stuff like that. But what you can
47:50
do, and we will we'll look at this thing
47:52
called dropout,
47:54
for certain layers, you can actually for
47:56
each time you run it through the
47:58
network, you can decide in this layer
48:00
I'm not going to use all the nodes. I'm
48:02
going to drop out a few of the nodes
48:03
randomly. And it's a very effective
48:05
technique to prevent overfitting, and
48:07
we'll come to that a little later on.
48:09
Uh yeah.
48:11
So, one question regarding like
48:13
neural networks is about the
48:15
coefficients. Is this something we
48:16
decide
48:17
or we
48:19
have to use as a defined coefficient for
48:21
the weights? No, the whole trick here
48:23
the whole name of the game is we use the
48:25
data, the training data, and something
48:29
called a loss function, which I'll get
48:30
to on Wednesday,
48:31
along with an optimization algorithm, so
48:33
that the network figures out by itself
48:36
what the weights need to be, what the
48:37
coefficients need to be, so as to
48:39
minimize prediction error.
48:42
And that's the whole thing. The magic
48:43
here is that we don't have to do
48:45
anything. We only have to set it up, sit
48:47
back, often for many hours, and watch it
48:49
do its thing.
48:51
Yeah.
48:52
Just one quick question. Um you
48:54
mentioned nodes just now when you were
48:56
answering Roland's question. Can you
48:58
just confirm exactly what a node is? I
49:00
have an idea that it's basically any
49:02
circle, but
49:03
>> Yeah, yeah. you just added a lot more
49:04
detail. Sure. No, when when I'm
49:06
referring to a node, I'm literally
49:07
referring to something like this, which
49:09
think of it as a linear function
49:12
followed by a non-linear activation.
49:14
So, it it reads a bunch of inputs, runs
49:16
it through a linear function, and pass
49:18
it through like a ReLU or a sigmoid or
49:19
something, and out pops a number.
49:22
So, in general, a node will have
49:24
many numbers potentially coming in, but
49:26
only one number going out.
49:28
Uh now, that one number may get copied
49:30
to every node in the next layer,
49:32
but what comes out of that particular
49:33
node is just a single number.
49:36
All right. So,
49:38
uh
49:38
So, let's use a DNN for our interview
49:41
example. So, in this problem we had two
49:44
inputs, right? GPA and experience. The
49:46
output variable has to be between zero
49:48
and one because you're trying to predict
49:48
the probability that someone will get
49:50
called for an interview. So, the output
49:52
size is fixed the
49:54
sorry, the input size is fixed the
49:55
output is fixed. Uh
49:57
and we so, since it's really the only
49:59
the very first network we're actually
50:00
playing with uh
50:02
let's just start simple, right? We'll
50:04
just have one hidden layer and we'll
50:06
have three neurons, right? And and as I
50:09
mentioned to Tommaso's question from
50:11
before if you are choosing activation
50:13
functions in the hidden layers, just go
50:15
with the ReLU as a default. It usually
50:17
works really well out of the box. So,
50:19
we'll just use a ReLU and since the
50:21
output has to be between zero and one,
50:23
we don't have a choice. We have to use a
50:25
sigmoid for the output layer.
50:27
Okay? That's it. So, we have the those
50:29
are the design choices and when we do
50:31
that, this is how it's looked like,
50:32
right? We have two inputs X1 and X2, GPA
50:34
and experience and then it goes through
50:36
these three
50:38
ReLUs and then out comes these three
50:40
numbers and they pass through a sigmoid
50:42
and we get a probability Y at the end.
50:44
All right, quick question. Concept
50:46
check.
50:47
How many weights
50:49
how many parameters, both weights and
50:51
biases does this network have?
50:53
Let's take a moment to count.
51:11
All right, any guesses?
51:15
Yeah.
51:16
12.
51:18
I think you're almost there.
51:22
Um
51:23
our folks going to be doing a binary
51:25
search on this now? Okay.
51:29
Uh no.
51:31
Yes? 30. Yes, very good.
51:34
So, that's 30
51:35
and my guess is that the reason you came
51:37
up with 12 and I made the same mistake,
51:39
that's why I know it is you probably
51:41
forgot this green thing here.
51:45
Um so, so the what folks often forget is
51:48
the bias.
51:49
Right? We all count the things, right?
51:50
Okay. And the easy way to do it is okay,
51:52
two things here,
51:54
three things here, two times six three
51:56
is six,
51:57
three times one is three another nine
51:59
and then you have to add up all the
52:00
intercepts.
52:02
Right? So, you get 30.
52:04
And so, when we get to very complicated
52:05
networks the the first two or three
52:08
times you work with very complex
52:09
networks
52:10
and we'll do it, you know, starting very
52:11
soon, just get into the habit of hand
52:14
calculating the number of parameters
52:16
just to make sure you understand what's
52:17
going on. Once you get it right a couple
52:18
of times, you can you don't have to do
52:20
it anymore. Okay? The first couple of
52:21
times hand calculate to make sure you
52:23
get it.
52:23
Okay. So, yeah. So, let's say that we
52:26
have trained this network using, you
52:28
know, using techniques which we'll cover
52:30
on Wednesday and it is it comes back to
52:32
you after training and says, "Okay,
52:34
these are the optimal the best values
52:36
for the weights and the biases that I
52:38
have found." So, now your network is
52:40
ready for action.
52:42
It's ready to be used
52:43
and so, so what you can do is let's say
52:45
that you want to predict with this
52:47
network,
52:48
you know,
52:49
if you have X1 and X2, what comes out of
52:52
what So, what comes out of this top
52:54
neuron, right? Let's call it A1. It's
52:56
basically this.
52:58
Okay? That's what's coming out of this
53:00
thing. For any X1 and X2, this is what's
53:02
coming out. Similarly for A2 and A3
53:05
Okay?
53:06
And then what comes out at the very end
53:08
is
53:09
basically A1 times that plus A2 times
53:11
that plus A3 times that plus 0.05 and
53:14
the whole thing gets run through the
53:15
sigmoid and this is what you get.
53:18
Okay? So, this slide and the one before,
53:20
just make sure you look at it afterwards
53:22
and to make sure you totally understand
53:23
the mechanics of it because
53:26
this is really important. If you don't
53:27
If you don't fully understand like
53:28
internalize the mechanics, when we get
53:30
to things like transformers, it's going
53:31
to get hard. Okay? So, just make sure
53:33
it's like automatic at this point. It
53:35
should be reflexive.
53:37
Um
53:38
Okay. So, yeah. And so, when we when you
53:40
want to predict anything, you just run
53:41
some numbers through it, you get all
53:42
these things
53:44
and boom, you calculate it. It turns out
53:45
to be 22.6. That's the answer.
53:48
All right. So,
53:50
I just want to say that let's say that
53:51
you built this network
53:53
and now we are like, "Hey,
53:55
given any X1 and X2, I can come up with
53:57
a Y."
53:58
But I'm feeling a little mathy. Can we
54:00
actually write down the function? Yeah,
54:02
you can write down the function. This is
54:03
what it looks like.
54:07
Super interpretable, right?
54:10
So, this goes to the comment that Itai
54:12
you made earlier on where the act of
54:16
depicting something using this sort of
54:18
graphical layout makes it so much easier
54:21
to reason with
54:22
and to think about compared to trying to
54:24
figure out what this function is doing.
54:26
Right? The other point I want to make is
54:28
that um
54:30
just contrast what we just saw with the
54:32
logistic regression thing we saw
54:33
earlier, which was this little function
54:35
and so, here
54:38
even this simple network with just three
54:40
hidden layers the sorry, three nodes in
54:42
that single hidden layer
54:44
right? It's so much more complicated
54:46
than the logistic regression model. So
54:48
much more complicated, right?
54:50
And it is from this complexity
54:52
springs the ability of these networks to
54:55
do basically magical things.
54:56
Right? That's where the complexity comes
54:58
from. That's where the magic comes from.
55:00
So, and here in this case, the number of
55:02
variables hasn't even changed. It's
55:03
still only two.
55:05
But we can go from the two inputs to the
55:07
one output in very complicated ways as
55:10
long as we know how to train these
55:11
networks the right way. That's sort of
55:13
the
55:13
the secret sauce which we'll spend a lot
55:15
of time on.
55:16
So, yeah. To summarize, this is what we
55:19
have. It's a deep neural network.
55:20
By the way, this kind of network where
55:22
things just flow from left to right is
55:23
called a feedforward
55:25
neural network
55:27
in contrast to some other kinds of
55:28
networks called recurrent networks which
55:30
you won't get to
55:31
in this class because
55:34
transformers have actually proven to be
55:36
much more capable than recurrent
55:38
networks and those have become the norm,
55:40
so we'll just focus on those instead. Um
55:42
and so, this arrangement of neurons into
55:44
layers and activation functions and all
55:46
that stuff, this called the architecture
55:48
of the neural network. And as you will
55:50
see later on, the transformer, the
55:51
famous transformer network
55:53
[clears throat] is just an example of a
55:54
particular neural network architecture
55:57
much like convolutional neural networks
55:59
which will get to next week for computer
56:01
vision or another example of a
56:03
particular network of of architecture.
56:05
So, we will focus on transformers. They
56:07
are a particular kind of architecture.
56:08
All right. So, in summary, this is what
56:10
we have.
56:11
You know, you get to choose the hidden
56:13
layers, the neurons, activation
56:14
functions, stuff like that.
56:15
The inputs and outputs are what you have
56:17
to work with and so, we will actually
56:19
take this idea and then use it
56:22
to
56:23
to actually solve a problem from start
56:25
to finish on Wednesday. So, I think I'm
56:28
done. I give you three minutes back of
56:29
your day. Thank you.
56:32
>> [applause]
— end of transcript —
Advertisement
Ad slot

More from MIT OpenCourseWare

Trending Transcripts

Disclaimer: This site is not affiliated with, endorsed by, or sponsored by YouTube or Google LLC. All trademarks belong to their respective owners. Transcripts are sourced from publicly available captions on YouTube and remain the property of their original creators.