WEBVTT

00:00:16.600 --> 00:00:19.320
All right. So, today's lecture,

00:00:18.000 --> 00:00:20.640
introduction to neural networks and deep

00:00:19.320 --> 00:00:21.879
learning.

00:00:20.640 --> 00:00:23.960
Um so, we'll start with a very quick

00:00:21.879 --> 00:00:25.399
intro to these things,

00:00:23.960 --> 00:00:27.480
uh and then we'll switch and dive deep

00:00:25.399 --> 00:00:30.000
into neural networks. All right. So, the

00:00:27.480 --> 00:00:31.000
field of AI originated in 1956. Sadly,

00:00:30.000 --> 00:00:32.520
it didn't originate at MIT, it

00:00:31.000 --> 00:00:33.640
originated at Dartmouth.

00:00:32.520 --> 00:00:35.520
Because all these people got together at

00:00:33.640 --> 00:00:37.480
Dartmouth. I guess it's it's got a nice

00:00:35.520 --> 00:00:40.000
quad or whatever. They got together,

00:00:37.479 --> 00:00:42.159
they defined the field. But, fortunately

00:00:40.000 --> 00:00:44.719
for us, MIT was very well represented.

00:00:42.159 --> 00:00:47.279
So, we have Marvin Minsky who founded

00:00:44.719 --> 00:00:50.079
the MIT AI Lab, John McCarthy who

00:00:47.280 --> 00:00:51.880
invented Lisp, and then later defected

00:00:50.079 --> 00:00:53.879
to the West Coast, and then Claude

00:00:51.880 --> 00:00:55.800
Shannon who invented information theory,

00:00:53.880 --> 00:00:57.560
right? Who was a professor at MIT. So,

00:00:55.799 --> 00:00:58.919
MIT was well represented. These folks,

00:00:57.560 --> 00:01:01.359
you know, founded the field, and they

00:00:58.920 --> 00:01:03.280
were so bright, they thought that AI was

00:01:01.359 --> 00:01:04.359
going to be substantially solved, quote

00:01:03.280 --> 00:01:06.519
unquote,

00:01:04.359 --> 00:01:07.359
by that fall.

00:01:06.519 --> 00:01:08.959
Okay?

00:01:07.359 --> 00:01:10.840
Now, obviously, it turned out a bit

00:01:08.959 --> 00:01:12.839
differently than what they expected.

00:01:10.840 --> 00:01:14.560
Um so, it's been, whatever, 67, 68 years

00:01:12.840 --> 00:01:16.520
since its founding. So, it's gone

00:01:14.560 --> 00:01:18.680
through, essentially, in my opinion,

00:01:16.519 --> 00:01:19.679
three seminal breakthroughs,

00:01:18.680 --> 00:01:21.280
um starting with the traditional

00:01:19.680 --> 00:01:22.920
approach, then machine learning, deep

00:01:21.280 --> 00:01:24.760
learning, and generative AI. So, let's

00:01:22.920 --> 00:01:26.760
take a very quick look at each of these

00:01:24.760 --> 00:01:27.719
breakthroughs and what motivated them.

00:01:26.760 --> 00:01:28.560
So,

00:01:27.719 --> 00:01:31.120
let's start with the traditional

00:01:28.560 --> 00:01:33.480
approach to AI. And so, what is AI? AI,

00:01:31.120 --> 00:01:34.960
informally, is the ability to imbue

00:01:33.480 --> 00:01:36.799
computers with the

00:01:34.959 --> 00:01:38.439
the the the ability to do things that

00:01:36.799 --> 00:01:39.640
only humans can typically do. Cognitive

00:01:38.439 --> 00:01:41.879
tasks, thinking tasks, and things like

00:01:39.640 --> 00:01:43.640
that. And so, the most sort of common

00:01:41.879 --> 00:01:45.199
sensical way to do that is to say,

00:01:43.640 --> 00:01:46.920
"Well, if I want the computer to do

00:01:45.200 --> 00:01:48.200
something complicated like play chess,

00:01:46.920 --> 00:01:49.920
I'm just going to sit down with a few

00:01:48.200 --> 00:01:51.640
chess grandmasters,

00:01:49.920 --> 00:01:53.400
show them a whole bunch of board moves,

00:01:51.640 --> 00:01:55.400
and ask them how they figure out how to

00:01:53.400 --> 00:01:56.480
respond, how to play the next move." I'm

00:01:55.400 --> 00:01:57.719
going to sort of sit down, talk to all

00:01:56.480 --> 00:01:59.960
these people, and then I'm going to

00:01:57.719 --> 00:02:01.560
write down a whole bunch of rules. If

00:01:59.959 --> 00:02:02.599
this is the board position, move this.

00:02:01.560 --> 00:02:04.159
If this is the board position, move

00:02:02.599 --> 00:02:05.959
this, and so on and so forth. Or I might

00:02:04.159 --> 00:02:06.959
sit down with a cardiologist and tell

00:02:05.959 --> 00:02:09.478
them, "Okay, how do you actually

00:02:06.959 --> 00:02:11.239
interpret an ECG?" They will give me all

00:02:09.479 --> 00:02:12.360
the similarly a bunch of if-then rules.

00:02:11.240 --> 00:02:13.920
I will take all these rules, I'll put

00:02:12.360 --> 00:02:15.200
them into the computer, and boom, I have

00:02:13.919 --> 00:02:17.959
a system that can do what a human can

00:02:15.199 --> 00:02:19.319
do. Right? Now, this approach, even

00:02:17.960 --> 00:02:21.360
though it's common sensical and kind of

00:02:19.319 --> 00:02:22.560
makes sense, it had success in only a

00:02:21.360 --> 00:02:24.640
few areas.

00:02:22.560 --> 00:02:28.159
Um and so, the interesting question is,

00:02:24.639 --> 00:02:29.559
why was it not pervasively successful?

00:02:28.159 --> 00:02:31.319
Why was it not pervasively successful?

00:02:29.560 --> 00:02:32.599
It seems like a pretty good idea to me,

00:02:31.319 --> 00:02:33.680
right? And the people who came up with

00:02:32.599 --> 00:02:35.560
these things are smart people, they're

00:02:33.680 --> 00:02:39.319
not dumb people. They know what they're

00:02:35.560 --> 00:02:39.319
doing. So, why did it not work?

00:02:39.400 --> 00:02:42.760
Because

00:02:40.719 --> 00:02:44.120
because it's time-intensive,

00:02:42.759 --> 00:02:46.000
so in case that you have to run through

00:02:44.120 --> 00:02:48.280
all these scenarios that can ever exist,

00:02:46.000 --> 00:02:51.120
and still some new scenarios can come up

00:02:48.280 --> 00:02:52.759
that you didn't cater for initially.

00:02:51.120 --> 00:02:54.159
Right. So, there are two aspects to what

00:02:52.759 --> 00:02:56.079
you said, which is the first aspect is

00:02:54.159 --> 00:02:57.319
it's time-intensive. That, as it turns

00:02:56.080 --> 00:02:58.816
out, is not a big deal, because

00:02:57.319 --> 00:02:59.359
computers are getting faster and faster.

00:02:58.816 --> 00:03:01.080
>> [clears throat]

00:02:59.360 --> 00:03:02.840
>> Right? The second thing is actually the

00:03:01.080 --> 00:03:05.600
key thing, which is that it doesn't

00:03:02.840 --> 00:03:07.400
generalize to new situations very well.

00:03:05.599 --> 00:03:08.960
Right? The problem is

00:03:07.400 --> 00:03:10.080
there are an infinite number of things

00:03:08.960 --> 00:03:11.719
that you're going to see when you deploy

00:03:10.080 --> 00:03:13.160
these systems in the real world. By

00:03:11.719 --> 00:03:15.479
definition, what you're training it on

00:03:13.159 --> 00:03:17.079
is a small sample of rules. So, these

00:03:15.479 --> 00:03:19.599
rules are very brittle. But, there's

00:03:17.080 --> 00:03:22.040
actually even more interesting reason.

00:03:19.599 --> 00:03:23.919
And that reason is that we know more

00:03:22.039 --> 00:03:25.599
than we can tell.

00:03:23.919 --> 00:03:27.280
This is called Polanyi's paradox. So,

00:03:25.599 --> 00:03:29.519
the idea is that if I come to you and

00:03:27.280 --> 00:03:32.360
say, "Hey, uh here's a picture. Is it a

00:03:29.520 --> 00:03:33.800
dog or a cat?" you will tell me within,

00:03:32.360 --> 00:03:34.960
I believe, they've measured it, like 20

00:03:33.800 --> 00:03:36.920
milliseconds or something, you know if

00:03:34.960 --> 00:03:38.560
it's a dog if it's a dog or a cat. And

00:03:36.919 --> 00:03:40.039
then if I ask you to explain to me

00:03:38.560 --> 00:03:41.520
exactly how you figured that out, you'll

00:03:40.039 --> 00:03:43.639
come up with a bunch of sort of reasons,

00:03:41.520 --> 00:03:45.120
right? Alleged reasons. Oh, you know, if

00:03:43.639 --> 00:03:46.000
it has whiskers, I think it's a cat or

00:03:45.120 --> 00:03:47.800
whatever.

00:03:46.000 --> 00:03:49.080
But, the problem is that you actually,

00:03:47.800 --> 00:03:50.280
first of all, can't really articulate

00:03:49.080 --> 00:03:51.840
what's going on in your head, how you do

00:03:50.280 --> 00:03:54.000
these things. And number two, even if

00:03:51.840 --> 00:03:55.479
you articulate it, often times, your

00:03:54.000 --> 00:03:58.000
articulation has no correspondence with

00:03:55.479 --> 00:04:01.239
how your brain actually does it.

00:03:58.000 --> 00:04:03.360
So, you're incomplete and a liar.

00:04:01.240 --> 00:04:04.840
So, this is Polanyi's paradox. So, if

00:04:03.360 --> 00:04:06.840
you can't even

00:04:04.840 --> 00:04:08.120
tell me how you do something, how the

00:04:06.840 --> 00:04:10.080
heck am I supposed to take it and put it

00:04:08.120 --> 00:04:11.680
into a computer? Doesn't work. And

00:04:10.080 --> 00:04:13.480
second is the fact that we can't write

00:04:11.680 --> 00:04:15.760
down these rules for all possible

00:04:13.479 --> 00:04:17.279
situations. Edge cases, corner cases,

00:04:15.759 --> 00:04:18.759
etc. And the world is full of edge

00:04:17.279 --> 00:04:20.199
cases.

00:04:18.759 --> 00:04:21.560
So, for these reasons, this approach

00:04:20.199 --> 00:04:22.800
didn't work.

00:04:21.560 --> 00:04:24.959
And so, a different approach was

00:04:22.800 --> 00:04:26.040
developed, and this approach was, well,

00:04:24.959 --> 00:04:27.279
basically said, "Hey, instead of

00:04:26.040 --> 00:04:30.040
explicitly telling the computer what to

00:04:27.279 --> 00:04:32.719
do, why don't we simply give it lots of

00:04:30.040 --> 00:04:35.680
examples of inputs and outputs, chess

00:04:32.720 --> 00:04:37.800
positions, next move, right? ECG,

00:04:35.680 --> 00:04:39.319
diagnosis, right? Inputs and outputs.

00:04:37.800 --> 00:04:41.000
And then, why don't we just use some

00:04:39.319 --> 00:04:43.240
statistical techniques to learn a

00:04:41.000 --> 00:04:44.920
mapping, a function, that can go from

00:04:43.240 --> 00:04:45.879
the input to the output? Okay? That was

00:04:44.920 --> 00:04:48.160
the idea.

00:04:45.879 --> 00:04:49.839
And this idea is machine learning.

00:04:48.160 --> 00:04:51.800
Okay? So, machine learning is basically

00:04:49.839 --> 00:04:53.719
just a fancy way of saying, "Learn from

00:04:51.800 --> 00:04:55.920
input-output examples using statistical

00:04:53.720 --> 00:04:59.080
techniques."

00:04:55.920 --> 00:05:00.480
Good. All right. So, um

00:04:59.079 --> 00:05:01.879
Now, there are numerous ways to create

00:05:00.480 --> 00:05:02.840
machine learning models, and if you've

00:05:01.879 --> 00:05:03.759
ever done linear regression,

00:05:02.839 --> 00:05:06.279
congratulations, you've been doing

00:05:03.759 --> 00:05:06.279
machine learning.

00:05:06.439 --> 00:05:09.839
Okay? And only one of those methods

00:05:08.720 --> 00:05:11.360
happens to be something called neural

00:05:09.839 --> 00:05:12.439
networks.

00:05:11.360 --> 00:05:14.280
There are many other methods, and in

00:05:12.439 --> 00:05:16.000
fact, you probably have done these other

00:05:14.279 --> 00:05:17.199
methods if you have done the a course

00:05:16.000 --> 00:05:19.279
like the Analytics Edge or something

00:05:17.199 --> 00:05:21.279
similar.

00:05:19.279 --> 00:05:23.039
Okay. So, machine learning has got

00:05:21.279 --> 00:05:25.759
tremendous impact around the world,

00:05:23.040 --> 00:05:27.560
right? It's like, at this point, um it's

00:05:25.759 --> 00:05:29.480
widely accepted, it's a very, very

00:05:27.560 --> 00:05:30.680
successful technology.

00:05:29.480 --> 00:05:32.560
And in fact, whenever people are

00:05:30.680 --> 00:05:33.959
actually talking about AI,

00:05:32.560 --> 00:05:35.639
chances are they're actually talking

00:05:33.959 --> 00:05:38.959
about machine learning.

00:05:35.639 --> 00:05:40.639
It's just that AI sounds cooler.

00:05:38.959 --> 00:05:41.919
The only problem is, for machine

00:05:40.639 --> 00:05:43.680
learning to work really well, the input

00:05:41.920 --> 00:05:46.439
data has to be structured.

00:05:43.680 --> 00:05:47.680
Okay? And what I mean by that is data

00:05:46.439 --> 00:05:50.040
that can essentially be sort of

00:05:47.680 --> 00:05:51.920
numericalized and stuck into the columns

00:05:50.040 --> 00:05:54.000
and rows of a spreadsheet.

00:05:51.920 --> 00:05:55.360
Right? So, for example, here, let's say

00:05:54.000 --> 00:05:58.120
I want to put together a data data set

00:05:55.360 --> 00:05:59.960
of, you know, uh patients, their

00:05:58.120 --> 00:06:01.759
symptoms, and their characteristics, and

00:05:59.959 --> 00:06:03.519
then in the following year after they

00:06:01.759 --> 00:06:05.439
showed up at the doctor's office whether

00:06:03.519 --> 00:06:07.279
they had a cardiac event or not.

00:06:05.439 --> 00:06:09.560
I might create a data set like this with

00:06:07.279 --> 00:06:11.799
age, smoking status, yes, no, exercise,

00:06:09.560 --> 00:06:13.480
blah blah blah blah blah. Right? And so,

00:06:11.800 --> 00:06:15.079
either these numbers are numbers,

00:06:13.480 --> 00:06:17.200
they're numerical, or if they're not

00:06:15.079 --> 00:06:19.680
numerical, they're categorical.

00:06:17.199 --> 00:06:21.479
Right? Yes, no, uh smoking, yes, no,

00:06:19.680 --> 00:06:22.720
things like that. Which means that if

00:06:21.480 --> 00:06:25.240
you have categorical variables, you can

00:06:22.720 --> 00:06:26.600
just numericalize them pretty easily.

00:06:25.240 --> 00:06:27.680
You folks have done the some machine

00:06:26.600 --> 00:06:29.040
learning before, so you know, things

00:06:27.680 --> 00:06:30.560
like one-hot encoding and stuff like

00:06:29.040 --> 00:06:32.160
that can be done to make them all

00:06:30.560 --> 00:06:35.040
numerical. So, the point is, you can

00:06:32.160 --> 00:06:36.640
just render the data into the columns

00:06:35.040 --> 00:06:38.080
and rows of a spreadsheet pretty easily,

00:06:36.639 --> 00:06:40.479
right? That's what I mean by structured

00:06:38.079 --> 00:06:41.879
data. So, when you but the situation is

00:06:40.480 --> 00:06:43.920
very different if you have unstructured

00:06:41.879 --> 00:06:46.120
data. So, if you have an image of, you

00:06:43.920 --> 00:06:47.720
know, a cute puppy, this is my puppy, by

00:06:46.120 --> 00:06:49.319
the way, um

00:06:47.720 --> 00:06:50.960
from many years ago. Sadly, he's no

00:06:49.319 --> 00:06:54.079
more. Um

00:06:50.959 --> 00:06:54.079
but, his name was Google.

00:06:54.800 --> 00:06:58.759
So, yeah, anyway, uh

00:06:56.600 --> 00:07:00.160
my DMD alums know Google well. So, this

00:06:58.759 --> 00:07:01.240
is Google, right? If you want to take

00:07:00.160 --> 00:07:03.439
Google,

00:07:01.240 --> 00:07:05.280
uh this picture, and figure out how to

00:07:03.439 --> 00:07:06.680
sort of numericalize it, the first thing

00:07:05.279 --> 00:07:07.839
you want to need to understand is that

00:07:06.680 --> 00:07:10.759
if you actually look at how this picture

00:07:07.839 --> 00:07:12.839
is represented inside, uh digitally, in

00:07:10.759 --> 00:07:13.839
the computer, basically, every picture

00:07:12.839 --> 00:07:15.319
like this is represented using three

00:07:13.839 --> 00:07:17.439
tables of numbers.

00:07:15.319 --> 00:07:19.000
Okay? And these and we'll get to what

00:07:17.439 --> 00:07:21.360
these numbers mean later on, but the

00:07:19.000 --> 00:07:23.279
point I'm making is that each number

00:07:21.360 --> 00:07:25.240
basically represents the amount of

00:07:23.279 --> 00:07:27.319
light,

00:07:25.240 --> 00:07:29.160
right? On a scale of 0 to 255, the

00:07:27.319 --> 00:07:30.639
amount of light in that location, in

00:07:29.160 --> 00:07:32.960
that pixel. That's all the amount of

00:07:30.639 --> 00:07:35.759
light. So, basically, the this table is

00:07:32.959 --> 00:07:37.479
the amount of um sorry.

00:07:35.759 --> 00:07:39.000
This table is amount of red light,

00:07:37.480 --> 00:07:41.200
amount of green light, amount of blue

00:07:39.000 --> 00:07:42.480
light. Okay? Now, you will agree with me

00:07:41.199 --> 00:07:45.199
that if you, for example, look at

00:07:42.480 --> 00:07:47.200
something like this and say, "Okay, 251

00:07:45.199 --> 00:07:49.639
at this location, there is a lot of blue

00:07:47.199 --> 00:07:52.000
light because it's 251 out of a possible

00:07:49.639 --> 00:07:53.439
255, right? Maybe a lot of blue light

00:07:52.000 --> 00:07:55.600
somewhere here. There's a lot of blue

00:07:53.439 --> 00:07:59.199
here."

00:07:55.600 --> 00:08:00.600
Whether that area is blue because of a

00:07:59.199 --> 00:08:03.560
piece of sky,

00:08:00.600 --> 00:08:04.879
some water, or a bunch of blue paint,

00:08:03.560 --> 00:08:06.240
could be anything, it's going to say

00:08:04.879 --> 00:08:08.079
251.

00:08:06.240 --> 00:08:09.319
So, the underlying reality, the

00:08:08.079 --> 00:08:11.079
underlying object that's being

00:08:09.319 --> 00:08:12.480
described, has nothing to do with the

00:08:11.079 --> 00:08:14.399
251.

00:08:12.480 --> 00:08:16.480
Right? So, that's the whole problem. The

00:08:14.399 --> 00:08:19.039
raw form of the data has no intrinsic

00:08:16.480 --> 00:08:20.240
meaning with the underlying thing.

00:08:19.040 --> 00:08:21.560
So, given that there's no connection

00:08:20.240 --> 00:08:23.360
between the number and what it's

00:08:21.560 --> 00:08:25.160
describing, how the heck can any

00:08:23.360 --> 00:08:27.639
algorithm do anything with it?

00:08:25.160 --> 00:08:27.640
It can't.

00:08:27.680 --> 00:08:32.440
Right? So, what you have to do is

00:08:30.319 --> 00:08:34.639
something called feature engineering or

00:08:32.440 --> 00:08:36.919
feature extraction, right? Where you

00:08:34.639 --> 00:08:38.279
have to manually take all these things

00:08:36.918 --> 00:08:40.279
and create essentially a spreadsheet

00:08:38.279 --> 00:08:42.279
from them. So, basically, let's say that

00:08:40.279 --> 00:08:43.478
you have a bunch of birds, right? And

00:08:42.279 --> 00:08:44.839
you're trying to build a a bird

00:08:43.479 --> 00:08:46.680
classifier to figure out what kind of

00:08:44.840 --> 00:08:48.759
bird species it is, you might actually

00:08:46.679 --> 00:08:50.799
have to take this picture, and then you

00:08:48.759 --> 00:08:52.759
have to measure the beak length, the

00:08:50.799 --> 00:08:54.479
wingspan, the primary color, and so on

00:08:52.759 --> 00:08:56.720
and so forth.

00:08:54.480 --> 00:08:59.360
So, you're basically structuring the

00:08:56.720 --> 00:09:02.120
unstructured data manually, right?

00:08:59.360 --> 00:09:06.120
And this process of structuring

00:09:02.120 --> 00:09:08.200
unstructured data is basically called

00:09:06.120 --> 00:09:10.879
we use the word representation. We take

00:09:08.200 --> 00:09:13.360
the raw data and we represent the data

00:09:10.879 --> 00:09:14.600
in a different form. And the the reason

00:09:13.360 --> 00:09:15.840
why I'm sort of

00:09:14.600 --> 00:09:17.879
focusing on the use of the word

00:09:15.840 --> 00:09:19.519
representation is because it becomes

00:09:17.879 --> 00:09:22.080
really, really important a bit later on

00:09:19.519 --> 00:09:23.399
when we get to deep learning. Okay? So,

00:09:22.080 --> 00:09:25.080
we have to represent the data in a

00:09:23.399 --> 00:09:26.519
different way for it to work. That's the

00:09:25.080 --> 00:09:28.960
basic idea.

00:09:26.519 --> 00:09:31.319
All right. So, what that means is that,

00:09:28.960 --> 00:09:33.519
uh historically, researchers would

00:09:31.320 --> 00:09:35.440
manually develop these representations.

00:09:33.519 --> 00:09:37.159
And once you develop them, once you have

00:09:35.440 --> 00:09:38.320
representations, you can just use

00:09:37.159 --> 00:09:40.559
traditional linear regression or

00:09:38.320 --> 00:09:41.720
logistic regression to get the job done.

00:09:40.559 --> 00:09:43.799
So, the whole name of the game is the

00:09:41.720 --> 00:09:45.440
representations. So, in fact, people

00:09:43.799 --> 00:09:47.959
doing PhDs, for example, in computer

00:09:45.440 --> 00:09:49.840
vision, would spend like 4 years

00:09:47.960 --> 00:09:52.080
developing amazing representations for

00:09:49.840 --> 00:09:53.920
solving one particular little problem.

00:09:52.080 --> 00:09:55.680
Right? We have a bunch of, say, CAT

00:09:53.919 --> 00:09:57.279
scans, and we need to take the CAT scan

00:09:55.679 --> 00:09:58.959
and figure out whether a particular kind

00:09:57.279 --> 00:10:00.879
of stroke that is evidence for it in the

00:09:58.960 --> 00:10:02.519
cat scan, right? They might actually sit

00:10:00.879 --> 00:10:04.000
and develop all kinds of representations

00:10:02.519 --> 00:10:05.960
and test it and so on. And then they'll

00:10:04.000 --> 00:10:07.200
finally declare victory and say, "Yay,

00:10:05.960 --> 00:10:08.920
I'm done with my PhD. Here is this

00:10:07.200 --> 00:10:11.000
amazing representation, and you can

00:10:08.919 --> 00:10:12.639
build a classifier with it to predict a

00:10:11.000 --> 00:10:15.840
particular kind of stroke with a high

00:10:12.639 --> 00:10:18.120
accuracy." Okay? So, that was what that

00:10:15.840 --> 00:10:20.680
that's where the world was.

00:10:18.120 --> 00:10:22.240
Uh now, as you can imagine, developing

00:10:20.679 --> 00:10:24.479
representations, because it's so manual,

00:10:22.240 --> 00:10:27.000
is this massive human bottleneck, and

00:10:24.480 --> 00:10:29.120
this sharply limited limited the reach

00:10:27.000 --> 00:10:31.919
and applicability of machine learning.

00:10:29.120 --> 00:10:31.919
As you would expect.

00:10:31.960 --> 00:10:35.000
To address this problem,

00:10:33.840 --> 00:10:36.120
a different approach came about, and

00:10:35.000 --> 00:10:38.720
that's deep learning. So, deep learning

00:10:36.120 --> 00:10:40.440
sits inside machine learning. Okay?

00:10:38.720 --> 00:10:43.279
And deep learning

00:10:40.440 --> 00:10:46.880
can handle unstructured input data

00:10:43.279 --> 00:10:48.079
without upfront manual processing.

00:10:46.879 --> 00:10:50.439
Meaning,

00:10:48.080 --> 00:10:52.639
it will automatically learn the right

00:10:50.440 --> 00:10:54.000
representations from the raw input.

00:10:52.639 --> 00:10:55.759
Automatically is the keyword.

00:10:54.000 --> 00:10:57.159
Automatically learn representations,

00:10:55.759 --> 00:10:58.200
which means that you could give it

00:10:57.159 --> 00:10:59.279
structured data, you can give it

00:10:58.200 --> 00:11:00.600
pictures, you can give it text, you can

00:10:59.279 --> 00:11:01.559
give it anything you want, it just learn

00:11:00.600 --> 00:11:02.600
it.

00:11:01.559 --> 00:11:05.159
Okay?

00:11:02.600 --> 00:11:07.279
Um it can automatically extract these

00:11:05.159 --> 00:11:09.480
representations, and since it's being

00:11:07.279 --> 00:11:11.240
automatically extracted, you can imagine

00:11:09.480 --> 00:11:12.960
sort of a pipeline where the raw data

00:11:11.240 --> 00:11:14.320
comes in, you have a bunch of stuff in

00:11:12.960 --> 00:11:15.879
the middle that's learning these

00:11:14.320 --> 00:11:17.600
representations automatically without

00:11:15.879 --> 00:11:19.439
your help, and then boom, you just

00:11:17.600 --> 00:11:20.720
attach a little linear regression or

00:11:19.440 --> 00:11:22.880
logistic regression at the end, problem

00:11:20.720 --> 00:11:25.000
solved.

00:11:22.879 --> 00:11:26.679
That in a nutshell is deep learning.

00:11:25.000 --> 00:11:28.440
Input, a whole bunch of representations

00:11:26.679 --> 00:11:30.838
being learned, and then piped into a

00:11:28.440 --> 00:11:31.920
linear or logistic regression model.

00:11:30.839 --> 00:11:34.560
Okay?

00:11:31.919 --> 00:11:36.000
You would So, the amazing thing is this

00:11:34.559 --> 00:11:37.599
simple idea

00:11:36.000 --> 00:11:40.399
this simple idea

00:11:37.600 --> 00:11:42.440
is just incredibly powerful. Right? That

00:11:40.399 --> 00:11:44.639
idea has led to ChatGPT, has led to

00:11:42.440 --> 00:11:45.480
AlphaGo, AlphaFold, and so on and so

00:11:44.639 --> 00:11:46.600
forth.

00:11:45.480 --> 00:11:49.120
And

00:11:46.600 --> 00:11:50.360
I I kid you not, I'm sort of

00:11:49.120 --> 00:11:52.600
I've I've I've been doing deep learning

00:11:50.360 --> 00:11:54.800
for about 10 years now, and every time I

00:11:52.600 --> 00:11:56.159
look at it, I literally get goosebumps

00:11:54.799 --> 00:11:57.838
every so often.

00:11:56.159 --> 00:11:59.759
That that something so simple could be

00:11:57.839 --> 00:12:01.200
so powerful, right? It's really like

00:11:59.759 --> 00:12:03.080
boggles the mind.

00:12:01.200 --> 00:12:05.360
I'm like I'm just so lucky to be alive

00:12:03.080 --> 00:12:06.360
and working during this period.

00:12:05.360 --> 00:12:07.759
Okay?

00:12:06.360 --> 00:12:08.879
And you know, coming from people who

00:12:07.759 --> 00:12:10.799
have been in the industry a long time,

00:12:08.879 --> 00:12:12.399
this sort of breathless exclamation is

00:12:10.799 --> 00:12:14.559
not very rare, particularly because I'm

00:12:12.399 --> 00:12:17.240
not in marketing.

00:12:14.559 --> 00:12:19.399
Okay? I actually mean it.

00:12:17.240 --> 00:12:21.480
With all your apologies to various

00:12:19.399 --> 00:12:23.319
marketing folks. So,

00:12:21.480 --> 00:12:25.759
just realized it's being taped, so uh

00:12:23.320 --> 00:12:27.560
okay. So, so this has demolished the

00:12:25.759 --> 00:12:29.919
human bottleneck for using machine

00:12:27.559 --> 00:12:31.479
learning with unstructured data, uh and

00:12:29.919 --> 00:12:32.639
so it comes from the confluence of three

00:12:31.480 --> 00:12:34.839
forces,

00:12:32.639 --> 00:12:37.159
uh new algorithmic ideas, a whole a lot

00:12:34.839 --> 00:12:38.680
of data, and then very importantly, the

00:12:37.159 --> 00:12:40.559
fact that we have access to parallel

00:12:38.679 --> 00:12:42.000
computing hardware in the in the form of

00:12:40.559 --> 00:12:44.159
these things called GPUs, graphics

00:12:42.000 --> 00:12:45.960
processing units. Um and these three

00:12:44.159 --> 00:12:47.319
forces came together, and they were

00:12:45.960 --> 00:12:48.480
applied to an old idea called neural

00:12:47.320 --> 00:12:49.720
networks, and that's basically deep

00:12:48.480 --> 00:12:50.960
learning. And I'll go through it very

00:12:49.720 --> 00:12:52.639
quickly, because obviously we going to

00:12:50.960 --> 00:12:54.040
spend half the semester looking into

00:12:52.639 --> 00:12:56.679
this thing in detail.

00:12:54.039 --> 00:12:58.519
Uh so, what's the immediate immediate

00:12:56.679 --> 00:13:01.559
application of the ability to

00:12:58.519 --> 00:13:05.360
automatically handle unstructured data?

00:13:01.559 --> 00:13:05.359
What is like the no-brainer application?

00:13:10.759 --> 00:13:15.879
It's okay if it's obvious, tell me.

00:13:13.639 --> 00:13:18.360
Uh sorry.

00:13:15.879 --> 00:13:19.759
Um image classification. Right. So,

00:13:18.360 --> 00:13:21.279
image classification, yes. So, you can

00:13:19.759 --> 00:13:22.480
take an image, a good example of

00:13:21.279 --> 00:13:24.199
unstructured data, you can do some

00:13:22.480 --> 00:13:27.000
classification on it. But more

00:13:24.200 --> 00:13:30.000
generally, more generally, what I'm

00:13:27.000 --> 00:13:31.399
getting at is that every sensor in the

00:13:30.000 --> 00:13:33.799
world

00:13:31.399 --> 00:13:35.039
can be given the ability to detect,

00:13:33.799 --> 00:13:37.319
recognize, and classify what it's

00:13:35.039 --> 00:13:39.799
sensing. Every sensor. Because remember,

00:13:37.320 --> 00:13:41.520
what is a What does a sensor do?

00:13:39.799 --> 00:13:43.199
A sensor is just a receptacle for

00:13:41.519 --> 00:13:44.679
unstructured data.

00:13:43.200 --> 00:13:46.080
A camera is a receptacle for

00:13:44.679 --> 00:13:48.079
unstructured video

00:13:46.080 --> 00:13:50.480
or unstructured, you know, still images.

00:13:48.080 --> 00:13:52.600
Microphone, unstructured audio, right?

00:13:50.480 --> 00:13:54.600
So, every sensor, you can you can

00:13:52.600 --> 00:13:56.839
imagine taking a sensor and sticking a

00:13:54.600 --> 00:13:58.720
little deep learning system behind it.

00:13:56.839 --> 00:13:59.880
And now suddenly, the

00:13:58.720 --> 00:14:01.759
what comes out of this sensor the deep

00:13:59.879 --> 00:14:03.320
learning system, you can count, you can

00:14:01.759 --> 00:14:05.360
classify, you can detect, you can do all

00:14:03.320 --> 00:14:07.080
kinds of stuff. In short, you can

00:14:05.360 --> 00:14:10.279
analyze.

00:14:07.080 --> 00:14:12.120
And you can predict, right? And this is

00:14:10.279 --> 00:14:15.600
the way I'm describing it right now,

00:14:12.120 --> 00:14:17.839
you'll be like, "Yeah, duh, obviously."

00:14:15.600 --> 00:14:19.839
But you know what, this obviously thing

00:14:17.839 --> 00:14:21.800
is actually not at all obvious

00:14:19.839 --> 00:14:24.400
in terms of whether it'll help you find

00:14:21.799 --> 00:14:25.719
interesting applications or not. Okay?

00:14:24.399 --> 00:14:28.159
So,

00:14:25.720 --> 00:14:30.920
here's something I literally saw

00:14:28.159 --> 00:14:32.120
last week. Okay? Actually, I have

00:14:30.919 --> 00:14:34.599
another slide before that, but we are

00:14:32.120 --> 00:14:36.399
coming to that. So, for instance, every

00:14:34.600 --> 00:14:38.200
time you use Face ID to unlock your

00:14:36.399 --> 00:14:39.639
phone, this is the basic principle at

00:14:38.200 --> 00:14:41.120
work, right? The the camera in the

00:14:39.639 --> 00:14:42.240
iPhone is the sensor, and they stuck a

00:14:41.120 --> 00:14:44.039
deep learning system behind it to do

00:14:42.240 --> 00:14:45.879
image classification, right? Drama,

00:14:44.039 --> 00:14:46.958
non-drama, right? That's what it's

00:14:45.879 --> 00:14:49.399
classifying.

00:14:46.958 --> 00:14:51.279
Um and so here, right, you have a breast

00:14:49.399 --> 00:14:52.799
cancer is it's a breast cancer detection

00:14:51.279 --> 00:14:55.319
system from a mammogram.

00:14:52.799 --> 00:14:57.759
Uh by the way, this picture

00:14:55.320 --> 00:15:00.320
it's a very interesting picture. So, uh

00:14:57.759 --> 00:15:02.519
there's a professor in WCS, uh Regina

00:15:00.320 --> 00:15:05.879
Barzilay, who's a very well-known expert

00:15:02.519 --> 00:15:07.439
in this field, and uh she actually has

00:15:05.879 --> 00:15:08.919
built a breast cancer detection system,

00:15:07.440 --> 00:15:10.240
which is which has been deployed at Mass

00:15:08.919 --> 00:15:12.120
General Hospital.

00:15:10.240 --> 00:15:15.399
And turns out she's actually a breast

00:15:12.120 --> 00:15:16.919
cancer survivor. And uh she was

00:15:15.399 --> 00:15:19.919
you know, she's she's she's good now,

00:15:16.919 --> 00:15:21.958
all good. But when um after she built

00:15:19.919 --> 00:15:25.319
her system, I heard that she actually

00:15:21.958 --> 00:15:29.000
ran that system against the mammograms

00:15:25.320 --> 00:15:30.720
from many years prior when she went for

00:15:29.000 --> 00:15:32.360
a mammogram and was told that everything

00:15:30.720 --> 00:15:34.440
is fine.

00:15:32.360 --> 00:15:35.639
She ran the system on that mammogram,

00:15:34.440 --> 00:15:37.400
and it came back and said, "Here is a

00:15:35.639 --> 00:15:38.879
problem."

00:15:37.399 --> 00:15:40.720
So, a very interesting example where a

00:15:38.879 --> 00:15:43.279
deep learning system picked up something

00:15:40.720 --> 00:15:45.519
that a radiologist could not, right? So,

00:15:43.279 --> 00:15:47.399
these things can be quite powerful.

00:15:45.519 --> 00:15:50.078
Um obviously, any self-driving system

00:15:47.399 --> 00:15:51.399
has numerous deep learning algorithms

00:15:50.078 --> 00:15:52.958
running under the hood, you know,

00:15:51.399 --> 00:15:54.720
pedestrian detection, you know,

00:15:52.958 --> 00:15:57.239
stoplight detection, zebra crossing

00:15:54.720 --> 00:15:58.759
detection, and so on and so forth. Um

00:15:57.240 --> 00:16:00.879
you know, it's being very heavily used

00:15:58.759 --> 00:16:02.159
in visual inspection manufacturing.

00:16:00.879 --> 00:16:03.279
Uh you have various cameras now instead

00:16:02.159 --> 00:16:04.919
of people looking at saying, "Okay,

00:16:03.279 --> 00:16:06.199
there is a dent or there's a scratch."

00:16:04.919 --> 00:16:07.919
They have a little system, which is a

00:16:06.200 --> 00:16:09.680
dent detector, scratch detector, and so

00:16:07.919 --> 00:16:11.159
on. That's that's going on right now.

00:16:09.679 --> 00:16:12.199
And now I come to the example I saw last

00:16:11.159 --> 00:16:14.759
week,

00:16:12.200 --> 00:16:16.000
which is um So, this is an example of

00:16:14.759 --> 00:16:18.159
you can create dramatically better

00:16:16.000 --> 00:16:20.799
products if you really internalize this

00:16:18.159 --> 00:16:22.519
idea of, "Okay, it's almost like you're

00:16:20.799 --> 00:16:24.078
looking the the world and saying, 'Oh,

00:16:22.519 --> 00:16:25.559
there's a sensor. Can I attach a DL

00:16:24.078 --> 00:16:26.679
thing behind it?'" That's the way you

00:16:25.559 --> 00:16:28.719
should be looking at the world, okay,

00:16:26.679 --> 00:16:30.879
for startup ideas. So, here's an

00:16:28.720 --> 00:16:34.279
example, okay, these apparently are the

00:16:30.879 --> 00:16:35.480
world's first smart binoculars.

00:16:34.279 --> 00:16:37.720
Okay?

00:16:35.480 --> 00:16:39.360
This is the binocular.

00:16:37.720 --> 00:16:41.240
Two weeks ago,

00:16:39.360 --> 00:16:42.320
where you look at the bird you look at

00:16:41.240 --> 00:16:43.959
the bird,

00:16:42.320 --> 00:16:46.680
and now it tells you what kind of bird

00:16:43.958 --> 00:16:46.679
it is, right there.

00:16:47.360 --> 00:16:51.839
It's a simple idea, but imagine, right?

00:16:50.120 --> 00:16:53.560
Imagine you are the first out of the

00:16:51.839 --> 00:16:54.640
gate with this feature, you'll have a

00:16:53.559 --> 00:16:57.719
little bit of an edge till everybody

00:16:54.639 --> 00:16:58.958
catches up like 3 months later.

00:16:57.720 --> 00:17:01.360
Let's be very clear, there are no

00:16:58.958 --> 00:17:03.479
long-term monopoly windows in the world.

00:17:01.360 --> 00:17:04.838
There are only short-term windows, so

00:17:03.480 --> 00:17:06.720
the hunt is always on for a little

00:17:04.838 --> 00:17:08.838
monopoly window.

00:17:06.720 --> 00:17:11.199
So, here's an example of that.

00:17:08.838 --> 00:17:13.240
Right? So, I encourage you to always

00:17:11.199 --> 00:17:15.079
think about the world as, you know,

00:17:13.240 --> 00:17:16.519
where are the sensors here?

00:17:15.078 --> 00:17:18.198
And can I attach something behind the

00:17:16.519 --> 00:17:19.078
sensor to do something useful with it?

00:17:18.199 --> 00:17:21.439
Okay?

00:17:19.078 --> 00:17:21.438
All right.

00:17:24.799 --> 00:17:27.279
Now, let's uh turn our attention to the

00:17:26.199 --> 00:17:28.759
output.

00:17:27.279 --> 00:17:30.678
We've been talking about in structured

00:17:28.759 --> 00:17:32.839
data, unstructured data, and how deep

00:17:30.679 --> 00:17:34.519
learning has sort of unlocked the

00:17:32.839 --> 00:17:35.759
ability to work with unstructured data,

00:17:34.519 --> 00:17:37.879
but you've sort of been neglecting the

00:17:35.759 --> 00:17:40.079
output side of the equation. So,

00:17:37.880 --> 00:17:42.040
traditionally, uh we could predict

00:17:40.079 --> 00:17:44.678
single numbers or a few numbers pretty

00:17:42.039 --> 00:17:46.920
easily, right? So, you've all done the

00:17:44.679 --> 00:17:48.600
canonical, you know, uh should this

00:17:46.920 --> 00:17:50.600
person be given a loan application in

00:17:48.599 --> 00:17:51.919
machine learning, right? So, you just

00:17:50.599 --> 00:17:53.159
predicts a probability that a borrower

00:17:51.920 --> 00:17:56.080
will repay a loan on a whole based on a

00:17:53.160 --> 00:17:57.240
whole bunch of data, or supply chain,

00:17:56.079 --> 00:17:58.799
you predict the demand for the product

00:17:57.240 --> 00:18:00.480
next week, or you could predict a bunch

00:17:58.799 --> 00:18:01.919
of numbers. So, given a

00:18:00.480 --> 00:18:03.640
um given a picture, you can say, "Okay,

00:18:01.920 --> 00:18:04.920
does it Which which one of the 10 kinds

00:18:03.640 --> 00:18:06.360
of furniture is it?" Right? You can

00:18:04.920 --> 00:18:08.000
predict 10 numbers, 10 probabilities

00:18:06.359 --> 00:18:09.199
that add up to one. You can predict a

00:18:08.000 --> 00:18:10.440
whole bunch of numbers that don't have

00:18:09.200 --> 00:18:12.840
to add up to one, such as the GPS

00:18:10.440 --> 00:18:15.279
coordinates of a of an Uber ride. So,

00:18:12.839 --> 00:18:16.759
these are all simple unstructured Sorry,

00:18:15.279 --> 00:18:18.839
simple structured output, just a few

00:18:16.759 --> 00:18:20.799
numbers, right? What we could not do

00:18:18.839 --> 00:18:23.399
very easily was to actually generate

00:18:20.799 --> 00:18:25.319
pictures like this.

00:18:23.400 --> 00:18:27.560
We could not generate unstructured data.

00:18:25.319 --> 00:18:28.519
We could only consume unstructured data,

00:18:27.559 --> 00:18:29.918
right?

00:18:28.519 --> 00:18:31.440
Um you can generate text, you can

00:18:29.919 --> 00:18:32.919
generate pictures, and so on, and audio,

00:18:31.440 --> 00:18:35.080
and so on, and so forth.

00:18:32.919 --> 00:18:36.200
So, with generative AI, that problem is

00:18:35.079 --> 00:18:37.519
gone.

00:18:36.200 --> 00:18:39.880
So, generative AI is the ability to

00:18:37.519 --> 00:18:41.599
actually create unstructured data, all

00:18:39.880 --> 00:18:43.840
right? And therefore, it sits within

00:18:41.599 --> 00:18:45.399
deep learning. It still runs on deep

00:18:43.839 --> 00:18:47.079
learning, but it's just one kind of deep

00:18:45.400 --> 00:18:49.000
learning.

00:18:47.079 --> 00:18:50.119
Okay? There's plenty of stuff going on

00:18:49.000 --> 00:18:51.679
in deep learning that's got nothing to

00:18:50.119 --> 00:18:53.399
do with generative AI.

00:18:51.679 --> 00:18:55.080
Nowadays, of course, you know, if you're

00:18:53.400 --> 00:18:57.519
a self-respecting entrepreneur who wants

00:18:55.079 --> 00:18:58.599
to ride this craze, you'll probably

00:18:57.519 --> 00:19:00.240
declare whatever you're doing as

00:18:58.599 --> 00:19:02.480
generative AI.

00:19:00.240 --> 00:19:04.319
Right? Um and some VCs may actually be

00:19:02.480 --> 00:19:05.679
ready to fund you, who knows?

00:19:04.319 --> 00:19:06.759
But the point is, there's plenty of

00:19:05.679 --> 00:19:08.759
stuff going on in deep learning that's

00:19:06.759 --> 00:19:11.079
got nothing to do with generative AI. Uh

00:19:08.759 --> 00:19:13.000
but this is the overall picture. Now,

00:19:11.079 --> 00:19:15.439
here, uh we can produce unstructured

00:19:13.000 --> 00:19:17.359
outputs, like pictures. You can take

00:19:15.440 --> 00:19:18.440
this thing, and then you can actually,

00:19:17.359 --> 00:19:19.519
you know, come up with a nice picture

00:19:18.440 --> 00:19:21.880
description of it. This actually is a

00:19:19.519 --> 00:19:23.200
very famous picture, by the way, in in

00:19:21.880 --> 00:19:24.520
the world of computer vision. So, we are

00:19:23.200 --> 00:19:26.319
actually going to be analyzing this

00:19:24.519 --> 00:19:27.879
picture a little later on

00:19:26.319 --> 00:19:29.639
in the semester.

00:19:27.880 --> 00:19:31.840
Uh you can obviously go from a very

00:19:29.640 --> 00:19:35.560
complicated caption to an image.

00:19:31.839 --> 00:19:35.559
Uh you can go from text to music.

00:19:36.240 --> 00:19:40.359
Can people hear it? Okay. Yeah. Yeah.

00:19:38.359 --> 00:19:43.039
All right. So, and of course, we can go

00:19:40.359 --> 00:19:45.439
from text to text, i.e., ChatGPT. Uh and

00:19:43.039 --> 00:19:47.079
then uh as of a few months ago, things

00:19:45.440 --> 00:19:49.440
have gotten even more interesting, where

00:19:47.079 --> 00:19:51.000
you can actually go you can send text

00:19:49.440 --> 00:19:51.880
and an image in, and you can get text

00:19:51.000 --> 00:19:53.480
out.

00:19:51.880 --> 00:19:55.360
Right? And in fact, as of a few weeks

00:19:53.480 --> 00:19:56.960
ago, you can send text, image, text,

00:19:55.359 --> 00:19:58.119
image, text, image in in an arbitrary

00:19:56.960 --> 00:20:00.039
sequence

00:19:58.119 --> 00:20:02.239
into into the system, and it'll actually

00:20:00.039 --> 00:20:03.519
come back to you with text and image.

00:20:02.240 --> 00:20:05.200
Right? So, things are becoming

00:20:03.519 --> 00:20:07.839
multimodal. I just want to share with

00:20:05.200 --> 00:20:10.840
you like a really fun example I saw

00:20:07.839 --> 00:20:14.000
uh recently. So, this person

00:20:10.839 --> 00:20:16.879
sends this picture. Can folks see this?

00:20:14.000 --> 00:20:19.000
It's this very complicated parking sign.

00:20:16.880 --> 00:20:20.360
Apparently in San Francisco.

00:20:19.000 --> 00:20:22.519
And they're like, it's Wednesday at 4:00

00:20:20.359 --> 00:20:23.959
p.m. Can I park here?

00:20:22.519 --> 00:20:25.480
Tell me in one line. Because you really

00:20:23.960 --> 00:20:26.880
didn't want GPT-4 to be giving you a big

00:20:25.480 --> 00:20:29.079
essay about this.

00:20:26.880 --> 00:20:32.120
Like, you literally want to park.

00:20:29.079 --> 00:20:33.960
So, GPT-4 comes back and says, "Yes, you

00:20:32.119 --> 00:20:35.439
can park here for up to 1 hour starting

00:20:33.960 --> 00:20:36.720
at 4:00 p.m."

00:20:35.440 --> 00:20:38.559
And folks, I double-checked this thing,

00:20:36.720 --> 00:20:39.640
it's correct.

00:20:38.559 --> 00:20:41.119
We all know these things hallucinate,

00:20:39.640 --> 00:20:42.080
right? Can you imagine getting a parking

00:20:41.119 --> 00:20:42.839
ticket and telling the judge, "I'm

00:20:42.079 --> 00:20:44.359
sorry, I didn't realize it was

00:20:42.839 --> 00:20:45.359
hallucinating."

00:20:44.359 --> 00:20:46.839
So,

00:20:45.359 --> 00:20:47.759
so you have to double-check it.

00:20:46.839 --> 00:20:49.399
So, yeah. So, things are getting

00:20:47.759 --> 00:20:51.759
multimodal very quickly.

00:20:49.400 --> 00:20:53.640
Uh and so, the picture here is that

00:20:51.759 --> 00:20:55.400
within gen AI, we used to have these

00:20:53.640 --> 00:20:57.360
separate circles, text to text, text to

00:20:55.400 --> 00:20:59.040
image, text to music, text to this, text

00:20:57.359 --> 00:21:00.879
to that, so on and so forth. Those are

00:20:59.039 --> 00:21:02.720
all beginning to merge now inside gen AI

00:21:00.880 --> 00:21:04.680
because multimodal models are going to

00:21:02.720 --> 00:21:06.279
become the norm this year, right? We

00:21:04.680 --> 00:21:07.880
already have really good closed models.

00:21:06.279 --> 00:21:10.119
We really have We actually already have

00:21:07.880 --> 00:21:12.320
very good open-source multimodal models.

00:21:10.119 --> 00:21:15.839
And so, my feeling is that by the end of

00:21:12.319 --> 00:21:17.960
the year, the idea of using a text-only

00:21:15.839 --> 00:21:19.359
model is going to be like, "Really, you

00:21:17.960 --> 00:21:20.319
do that still?"

00:21:19.359 --> 00:21:21.919
Right? It's going to become like a

00:21:20.319 --> 00:21:23.720
quaint, old-fashioned thing. I think

00:21:21.920 --> 00:21:25.200
multimodal modality is going to become

00:21:23.720 --> 00:21:26.680
the norm. So, that's where the world is,

00:21:25.200 --> 00:21:29.000
and this is the landscape. So, any

00:21:26.680 --> 00:21:29.960
questions on the landscape?

00:21:29.000 --> 00:21:32.319
Before we actually start doing some

00:21:29.960 --> 00:21:32.319
math.

00:21:35.519 --> 00:21:40.039
Okay.

00:21:37.799 --> 00:21:40.039
Yeah.

00:22:05.559 --> 00:22:09.519
You mean the the the evidence of that

00:22:07.400 --> 00:22:11.720
being a problem would have been smaller.

00:22:09.519 --> 00:22:11.720
Yeah.

00:22:16.319 --> 00:22:19.359
Yeah. So, I think the So, the question

00:22:17.759 --> 00:22:20.480
is that in general, how do you train

00:22:19.359 --> 00:22:22.240
your models so that it gives you the

00:22:20.480 --> 00:22:24.000
right answers given that over the

00:22:22.240 --> 00:22:25.599
passage of time, the amount of evidence

00:22:24.000 --> 00:22:28.119
in this data could be very highly

00:22:25.599 --> 00:22:30.719
variable. So, in this particular case of

00:22:28.119 --> 00:22:32.199
you know, the professor I talked about,

00:22:30.720 --> 00:22:34.400
uh yeah, everything at that point was

00:22:32.200 --> 00:22:36.840
going through a an expert radiologist.

00:22:34.400 --> 00:22:38.200
So, 5 years ago, this mammogram was seen

00:22:36.839 --> 00:22:40.240
by a radiologist, and that person

00:22:38.200 --> 00:22:41.759
concluded there is no problem. So, that

00:22:40.240 --> 00:22:44.599
was the training label, right? The wrong

00:22:41.759 --> 00:22:46.400
training label. Uh so, in typically what

00:22:44.599 --> 00:22:48.399
happens is that training labels could be

00:22:46.400 --> 00:22:49.400
wrong some small fraction of the time.

00:22:48.400 --> 00:22:51.720
So, you need to have systems that are

00:22:49.400 --> 00:22:53.880
robust. So, your data needs to be

00:22:51.720 --> 00:22:56.120
complete, it needs to be comprehensive,

00:22:53.880 --> 00:22:58.320
it needs to be have correct labels. If

00:22:56.119 --> 00:22:59.959
these ideas are not met, your systems

00:22:58.319 --> 00:23:01.960
are not going to be that good. But as it

00:22:59.960 --> 00:23:04.240
turns out, with neural networks, even

00:23:01.960 --> 00:23:06.000
with some amount of noise in the labels,

00:23:04.240 --> 00:23:07.079
they still do a pretty good job.

00:23:06.000 --> 00:23:09.759
Right? So, it's that's sort of the

00:23:07.079 --> 00:23:09.759
general idea.

00:23:11.480 --> 00:23:15.759
The veri- The verification comes from

00:23:12.799 --> 00:23:17.599
the human. So, every Remember when we

00:23:15.759 --> 00:23:19.319
look at radiology data,

00:23:17.599 --> 00:23:21.439
the the data we're working with is the

00:23:19.319 --> 00:23:23.559
input is let's say an image, like a

00:23:21.440 --> 00:23:25.440
radio mammogram or something, and then a

00:23:23.559 --> 00:23:27.480
human radiologist or a set of

00:23:25.440 --> 00:23:29.400
radiologists have said this has a

00:23:27.480 --> 00:23:31.279
problem or does not have a problem. So,

00:23:29.400 --> 00:23:33.679
that is called the ground truth.

00:23:31.279 --> 00:23:35.440
So, it is this ground truth image and

00:23:33.679 --> 00:23:38.440
label, this combination that's being

00:23:35.440 --> 00:23:38.440
used to train these models.

00:23:39.559 --> 00:23:41.759
Yeah.

00:23:43.160 --> 00:23:47.400
Embodiment? So, So, are we are we going

00:23:45.440 --> 00:23:49.080
to cover embodiment? So, the

00:23:47.400 --> 00:23:50.280
the embodiment here refers to the fact

00:23:49.079 --> 00:23:53.039
that

00:23:50.279 --> 00:23:54.359
if you have robot robots, right?

00:23:53.039 --> 00:23:56.200
They need to actually operate in the

00:23:54.359 --> 00:23:58.559
real world, and so robots are an example

00:23:56.200 --> 00:23:59.920
of what's called embodied intelligence.

00:23:58.559 --> 00:24:01.440
So, unfortunately, due to the

00:23:59.920 --> 00:24:03.720
constraints of time, we're not going to

00:24:01.440 --> 00:24:04.799
get into robotics at all. But I will say

00:24:03.720 --> 00:24:05.880
that a lot of the deep learning stuff

00:24:04.799 --> 00:24:07.359
you're going to talk about, those are

00:24:05.880 --> 00:24:09.880
all fundamental building blocks in

00:24:07.359 --> 00:24:13.039
modern robotic systems.

00:24:09.880 --> 00:24:14.400
All right. So, um so, in summary,

00:24:13.039 --> 00:24:15.639
X and Y

00:24:14.400 --> 00:24:17.200
can be anything, and it can be

00:24:15.640 --> 00:24:19.240
multimodal.

00:24:17.200 --> 00:24:21.679
Okay? I literally could not have put up

00:24:19.240 --> 00:24:23.559
this slide maybe 2 years ago.

00:24:21.679 --> 00:24:25.800
Right? So, it's very simple in how it

00:24:23.559 --> 00:24:28.079
looks, but it's very profound. You can

00:24:25.799 --> 00:24:29.599
You can learn a mapping from anything to

00:24:28.079 --> 00:24:31.559
anything at this point very easily as

00:24:29.599 --> 00:24:34.480
long as you have enough data.

00:24:31.559 --> 00:24:36.599
Okay? So, um now, note that all this

00:24:34.480 --> 00:24:38.640
excitement that we see around us

00:24:36.599 --> 00:24:39.639
is everything stems from stems from deep

00:24:38.640 --> 00:24:40.640
learning.

00:24:39.640 --> 00:24:42.160
Okay?

00:24:40.640 --> 00:24:44.280
Everything Everything depends on deep

00:24:42.160 --> 00:24:45.679
learning. And so, if you understand deep

00:24:44.279 --> 00:24:47.960
learning, a lot of interesting things

00:24:45.679 --> 00:24:49.080
become possible. So, let's get going.

00:24:47.960 --> 00:24:51.840
All right. So, we'll start with the very

00:24:49.079 --> 00:24:54.599
basics. Uh what's a neural network?

00:24:51.839 --> 00:24:56.039
Uh now, recall logistic regression

00:24:54.599 --> 00:24:57.879
from back in the day.

00:24:56.039 --> 00:24:59.920
So, what is logistic regression?

00:24:57.880 --> 00:25:01.679
You send in a bunch of numbers, a vector

00:24:59.920 --> 00:25:03.960
of numbers, and you get usually get a

00:25:01.679 --> 00:25:05.000
probability out, right? Between 0 and 1.

00:25:03.960 --> 00:25:07.559
What is the probability of something or

00:25:05.000 --> 00:25:09.559
the other? Okay? Um and so, this

00:25:07.559 --> 00:25:11.519
logistic regression model is also

00:25:09.559 --> 00:25:13.359
represented in this form,

00:25:11.519 --> 00:25:15.519
if you will recall. So, basically what

00:25:13.359 --> 00:25:17.678
we do is we take all these numbers, we

00:25:15.519 --> 00:25:19.240
run it through a linear function, right?

00:25:17.679 --> 00:25:20.880
We run it through a linear function, you

00:25:19.240 --> 00:25:22.880
get a number, and then we take that

00:25:20.880 --> 00:25:25.000
thing and run it through 1 / 1 + e

00:25:22.880 --> 00:25:26.120
raised to minus that,

00:25:25.000 --> 00:25:27.720
and that's guaranteed to give you a

00:25:26.119 --> 00:25:29.719
number between 0 and 1, which can be

00:25:27.720 --> 00:25:31.839
interpreted as a probability, and that's

00:25:29.720 --> 00:25:33.559
logistic regression. Okay? And the

00:25:31.839 --> 00:25:35.399
canonical, you know,

00:25:33.559 --> 00:25:36.720
uh loan approvals, things like that, all

00:25:35.400 --> 00:25:38.480
fall into this sort of convenient

00:25:36.720 --> 00:25:42.799
bucket.

00:25:38.480 --> 00:25:42.799
Okay? So, this should be super familiar.

00:25:44.400 --> 00:25:48.759
All right. Now, we're going to actually

00:25:46.480 --> 00:25:51.920
look at this, you know, simple, modest,

00:25:48.759 --> 00:25:53.799
humble little operation

00:25:51.920 --> 00:25:55.480
using the lens of a network of

00:25:53.799 --> 00:25:56.879
mathematical operations, and the reason

00:25:55.480 --> 00:25:57.799
why we do it will become clear a bit

00:25:56.880 --> 00:25:59.880
later.

00:25:57.799 --> 00:26:02.240
So, we'll take this very simple example

00:25:59.880 --> 00:26:05.320
where we have uh let's say two

00:26:02.240 --> 00:26:07.759
variables, GPA and experience, right?

00:26:05.319 --> 00:26:09.559
This is the GPA of some graduates, uh

00:26:07.759 --> 00:26:11.799
number of years of work experience, and

00:26:09.559 --> 00:26:14.678
then this is the dependent variable,

00:26:11.799 --> 00:26:16.480
which is either 0 or 1, and 0 if they

00:26:14.679 --> 00:26:18.280
don't get called for an interview, 1 if

00:26:16.480 --> 00:26:20.519
they get called for an interview. Okay?

00:26:18.279 --> 00:26:22.119
It's a two-input variable, one-output

00:26:20.519 --> 00:26:24.000
variable problem. Okay? And it's a

00:26:22.119 --> 00:26:25.719
classification problem because we're

00:26:24.000 --> 00:26:27.880
classifying people into will they get

00:26:25.720 --> 00:26:29.600
called for an interview, yes or no.

00:26:27.880 --> 00:26:31.560
Okay?

00:26:29.599 --> 00:26:33.119
And so, that's the setup for this

00:26:31.559 --> 00:26:35.919
problem.

00:26:33.119 --> 00:26:38.839
And let's say that we actually run it

00:26:35.920 --> 00:26:40.720
through any you know, we actually try to

00:26:38.839 --> 00:26:41.959
fit a logistic regression model to it.

00:26:40.720 --> 00:26:43.559
So, if you're familiar with R, for

00:26:41.960 --> 00:26:46.120
example, you would use something like

00:26:43.559 --> 00:26:48.079
GLM to fit this model.

00:26:46.119 --> 00:26:49.919
Um if you use something like statsmodels

00:26:48.079 --> 00:26:52.000
in Python, there's a similar function

00:26:49.920 --> 00:26:53.560
for it. Scikit-learn, there's another

00:26:52.000 --> 00:26:55.160
function for it. You get the idea,

00:26:53.559 --> 00:26:57.079
right? This

00:26:55.160 --> 00:26:58.160
You can use whatever favorite methods

00:26:57.079 --> 00:27:00.199
you have for logistic regression

00:26:58.160 --> 00:27:02.080
modeling to get this job done. And if

00:27:00.200 --> 00:27:04.120
you do that with this little data set,

00:27:02.079 --> 00:27:06.599
you're going to get these coefficients.

00:27:04.119 --> 00:27:08.199
Right? The 0.4 is the intercept, 0.2 is

00:27:06.599 --> 00:27:09.919
the coefficient for GPA, 0.5 for

00:27:08.200 --> 00:27:11.440
experience. And that is the resulting

00:27:09.920 --> 00:27:12.560
sigmoid function.

00:27:11.440 --> 00:27:14.519
Okay?

00:27:12.559 --> 00:27:17.240
All right. Cool. So, now let's actually

00:27:14.519 --> 00:27:19.319
rewrite this formula as a network in the

00:27:17.240 --> 00:27:20.920
following way. So, first, what we'll do

00:27:19.319 --> 00:27:22.839
is we'll take GPA and experience and

00:27:20.920 --> 00:27:24.600
stick it here on the left side, and

00:27:22.839 --> 00:27:26.799
we'll put little circles next to them,

00:27:24.599 --> 00:27:29.359
and we'll call them the input nodes.

00:27:26.799 --> 00:27:32.000
Okay? And so, imagine that somebody puts

00:27:29.359 --> 00:27:34.279
the writes a GPA into the circle, 3.5 or

00:27:32.000 --> 00:27:36.880
you know, years experience, 2.0, and

00:27:34.279 --> 00:27:38.000
then it flows through this arrow,

00:27:36.880 --> 00:27:40.400
and as it flows through, it gets

00:27:38.000 --> 00:27:42.880
multiplied by its coefficient, 0.2. The

00:27:40.400 --> 00:27:44.320
0.2 is coming from here.

00:27:42.880 --> 00:27:47.080
Similarly, experience gets multiplied by

00:27:44.319 --> 00:27:49.119
0.5, it comes in here, and this node, as

00:27:47.079 --> 00:27:50.480
the plus indicates, is adding everything

00:27:49.119 --> 00:27:52.919
that's coming into it.

00:27:50.480 --> 00:27:54.519
So, it's adding 0.2 * GPA, 0.5 *

00:27:52.920 --> 00:27:57.200
experience, plus the intercept, which is

00:27:54.519 --> 00:27:58.599
the green arrow coming from on its own.

00:27:57.200 --> 00:28:01.240
It comes through here, and what comes

00:27:58.599 --> 00:28:02.839
out of this is just a single number,

00:28:01.240 --> 00:28:04.640
and that number goes into this little

00:28:02.839 --> 00:28:07.319
circle,

00:28:04.640 --> 00:28:08.560
and then out pops a probability.

00:28:07.319 --> 00:28:10.720
Okay?

00:28:08.559 --> 00:28:13.440
So, I've sort of

00:28:10.720 --> 00:28:15.039
done this ridiculously long long

00:28:13.440 --> 00:28:16.400
long-winded way of writing a simple

00:28:15.039 --> 00:28:18.000
function.

00:28:16.400 --> 00:28:20.880
Okay? And the reason we why I'm doing it

00:28:18.000 --> 00:28:20.880
will become clear in a second.

00:28:21.079 --> 00:28:25.839
Okay? So, this is a little network of

00:28:23.359 --> 00:28:27.678
operations for the simple function.

00:28:25.839 --> 00:28:29.639
And so, for instance, how you would use

00:28:27.679 --> 00:28:31.759
it is you to make a prediction, you'll

00:28:29.640 --> 00:28:33.600
let's say someone has a 3.8 GPA and 1.2

00:28:31.759 --> 00:28:34.640
years experience, you just plug it in

00:28:33.599 --> 00:28:36.599
here,

00:28:34.640 --> 00:28:38.360
do the math, you get 0.76, same thing

00:28:36.599 --> 00:28:40.918
here, comes in here, add them all up,

00:28:38.359 --> 00:28:43.279
you get 1.76, you run 1.76 through the

00:28:40.919 --> 00:28:44.480
sigmoid, you get 0.85, and that is the

00:28:43.279 --> 00:28:45.519
probability that that particular

00:28:44.480 --> 00:28:46.839
individual may get called for an

00:28:45.519 --> 00:28:48.240
interview.

00:28:46.839 --> 00:28:49.399
Okay? At this point, we're just doing

00:28:48.240 --> 00:28:51.359
logistic regression, nothing more

00:28:49.400 --> 00:28:54.040
complicated.

00:28:51.359 --> 00:28:56.119
Okay? So, um now, if you have many

00:28:54.039 --> 00:28:58.399
variables, not two variables like X1

00:28:56.119 --> 00:28:59.759
through XK, you can the same sort of

00:28:58.400 --> 00:29:01.200
logic applies. Each one has some

00:28:59.759 --> 00:29:03.039
coefficient, and then there's an

00:29:01.200 --> 00:29:04.720
intercept, they all get added up here,

00:29:03.039 --> 00:29:07.000
run through a sigmoid, and out pops this

00:29:04.720 --> 00:29:09.240
number. Okay? Notice how the data flows

00:29:07.000 --> 00:29:10.559
from left to right.

00:29:09.240 --> 00:29:14.039
Okay?

00:29:10.559 --> 00:29:14.039
All right. Any questions on this?

00:29:15.119 --> 00:29:18.719
All right. Good.

00:29:16.519 --> 00:29:20.519
So, now terminology.

00:29:18.720 --> 00:29:21.720
Uh so, you will actually you'll discover

00:29:20.519 --> 00:29:24.039
that the world of neural networks and

00:29:21.720 --> 00:29:25.440
deep learning has its own terminology.

00:29:24.039 --> 00:29:26.799
They have their own ways of referring to

00:29:25.440 --> 00:29:28.440
things that we the rest of the world has

00:29:26.799 --> 00:29:29.799
been referring using something else for

00:29:28.440 --> 00:29:31.240
the longest time.

00:29:29.799 --> 00:29:35.000
Right? It's kind of annoying sometimes,

00:29:31.240 --> 00:29:35.000
but it's the way it is. So, um

00:29:35.200 --> 00:29:38.440
Remember in regression, we used to call

00:29:37.000 --> 00:29:39.720
those numbers next to each variable as

00:29:38.440 --> 00:29:41.440
coefficients,

00:29:39.720 --> 00:29:43.200
and the constant thing as an intercept?

00:29:41.440 --> 00:29:44.519
Well, guess what? In this world, these

00:29:43.200 --> 00:29:46.960
multi- those coefficients are actually

00:29:44.519 --> 00:29:49.160
called weights,

00:29:46.960 --> 00:29:50.840
and the intercepts are called biases.

00:29:49.160 --> 00:29:53.000
So, in in the neural network world,

00:29:50.839 --> 00:29:54.240
these are called weights and biases.

00:29:53.000 --> 00:29:55.240
And sometimes, if you're a little lazy,

00:29:54.240 --> 00:29:56.359
you may just call the whole thing as

00:29:55.240 --> 00:29:58.480
weights.

00:29:56.359 --> 00:30:00.799
Okay? So, when you see in the newspaper

00:29:58.480 --> 00:30:03.640
that, you know, "Oh my god, this amazing

00:30:00.799 --> 00:30:05.119
model's weights have been leaked

00:30:03.640 --> 00:30:06.680
on the internet or on BitTorrent or

00:30:05.119 --> 00:30:08.119
something." That's what's going on,

00:30:06.680 --> 00:30:09.960
right? All these coefficients have been

00:30:08.119 --> 00:30:11.559
leaked. Because once you know what the

00:30:09.960 --> 00:30:12.640
coefficients are and what the

00:30:11.559 --> 00:30:15.039
architecture is, you can just

00:30:12.640 --> 00:30:16.360
reconstruct the model.

00:30:15.039 --> 00:30:17.559
All right. So, that's what's going on

00:30:16.359 --> 00:30:19.639
here.

00:30:17.559 --> 00:30:20.799
Now, why did we do this network

00:30:19.640 --> 00:30:23.120
business? Why did we write it as a

00:30:20.799 --> 00:30:24.359
network?

00:30:23.119 --> 00:30:26.919
Yeah, what is the advantage? Any

00:30:24.359 --> 00:30:26.919
guesses?

00:30:34.000 --> 00:30:38.200
When you have multiple functions for

00:30:37.200 --> 00:30:40.360
So,

00:30:38.200 --> 00:30:41.840
it's just easier to see it that way.

00:30:40.359 --> 00:30:43.719
Right. If you have lots of things going

00:30:41.839 --> 00:30:45.240
on, it's easier to see it if you

00:30:43.720 --> 00:30:46.960
actually write it in graphical form.

00:30:45.240 --> 00:30:49.880
Yes, correct.

00:30:46.960 --> 00:30:51.920
But, so is it only like a usability

00:30:49.880 --> 00:30:53.560
advantage?

00:30:51.920 --> 00:30:55.920
I mean, the thing is you want different

00:30:53.559 --> 00:30:56.679
functions for different layers of that.

00:30:55.920 --> 00:30:57.640
Uh-huh.

00:30:56.680 --> 00:30:59.000
Okay.

00:30:57.640 --> 00:31:00.880
So, maybe we want to use different

00:30:59.000 --> 00:31:02.599
functions in different layers. But, I

00:31:00.880 --> 00:31:04.640
think there's actually even a larger

00:31:02.599 --> 00:31:05.559
sort of a more basic point, which is

00:31:04.640 --> 00:31:07.000
that

00:31:05.559 --> 00:31:09.000
then when you the moment you write it

00:31:07.000 --> 00:31:10.480
down, you suddenly realize

00:31:09.000 --> 00:31:12.839
that I could have lots of things in the

00:31:10.480 --> 00:31:12.839
middle.

00:31:12.960 --> 00:31:15.640
I don't have to go from the input to the

00:31:13.960 --> 00:31:17.360
output directly. I can do lots of things

00:31:15.640 --> 00:31:20.560
in the middle, right? That's sort of the

00:31:17.359 --> 00:31:22.359
key idea. So, what you do is

00:31:20.559 --> 00:31:24.799
So, remember the notion of learning

00:31:22.359 --> 00:31:25.959
representations of unstructured data,

00:31:24.799 --> 00:31:27.879
right? Where you take a picture and say

00:31:25.960 --> 00:31:29.400
beak length and things like that, right?

00:31:27.880 --> 00:31:30.800
And remember, I said deep learning

00:31:29.400 --> 00:31:33.000
actually automatically learns these

00:31:30.799 --> 00:31:34.839
things. Where is that automatic learning

00:31:33.000 --> 00:31:36.680
coming from?

00:31:34.839 --> 00:31:38.879
Well, this is where it's coming from.

00:31:36.680 --> 00:31:39.680
So, what we do is we take this thing,

00:31:38.880 --> 00:31:41.560
right? There's just a logistic

00:31:39.680 --> 00:31:43.480
regression model. Inputs

00:31:41.559 --> 00:31:45.720
get multiple added up as a linear

00:31:43.480 --> 00:31:46.880
function, run through a sigmoid.

00:31:45.720 --> 00:31:48.799
And then

00:31:46.880 --> 00:31:51.520
we are like, "Hmm, if we want to learn

00:31:48.799 --> 00:31:53.000
representations of the raw input, we

00:31:51.519 --> 00:31:54.720
better be doing something in the middle

00:31:53.000 --> 00:31:56.759
here."

00:31:54.720 --> 00:31:58.720
Because the output is the output.

00:31:56.759 --> 00:32:00.039
That is That's not going to change.

00:31:58.720 --> 00:32:02.079
You know, it's it's either a dog or a

00:32:00.039 --> 00:32:05.440
cat. You don't have any choice

00:32:02.079 --> 00:32:07.960
as to what it is. Okay? The only agency

00:32:05.440 --> 00:32:09.279
you have at this point is you can take

00:32:07.960 --> 00:32:11.079
the raw input and do things in the

00:32:09.279 --> 00:32:12.678
middle with it.

00:32:11.079 --> 00:32:14.439
You can do a lot of stuff in the middle

00:32:12.679 --> 00:32:18.160
and then run it through something to get

00:32:14.440 --> 00:32:20.679
the output. Okay? So, in any in in in

00:32:18.160 --> 00:32:22.120
any mathematical discipline,

00:32:20.679 --> 00:32:23.679
if someone comes to you and says,

00:32:22.119 --> 00:32:25.639
"Here's a bunch of data.

00:32:23.679 --> 00:32:27.280
I want you to do something with it."

00:32:25.640 --> 00:32:30.759
What should the What is like the big the

00:32:27.279 --> 00:32:30.759
most basic first thing you should do?

00:32:31.720 --> 00:32:36.120
Run it through a linear function.

00:32:34.480 --> 00:32:37.759
The most basic thing in math is a linear

00:32:36.119 --> 00:32:38.559
function. So, given anything, just run

00:32:37.759 --> 00:32:40.039
it through a linear function. See what

00:32:38.559 --> 00:32:42.678
happens.

00:32:40.039 --> 00:32:44.399
So, that's exactly what we can do. So,

00:32:42.679 --> 00:32:46.560
the simplest thing we can do here, we

00:32:44.400 --> 00:32:49.400
can insert a bunch of linear functions.

00:32:46.559 --> 00:32:50.960
So, we do is we take all this input and

00:32:49.400 --> 00:32:52.759
we just run it we we do a linear

00:32:50.960 --> 00:32:56.079
function on it. So, think of it this as

00:32:52.759 --> 00:32:58.879
X1 * 2 + X3 * 4 and all the way to XK *

00:32:56.079 --> 00:33:00.599
9 plus some intercept and boom, it goes

00:32:58.880 --> 00:33:05.200
out the other end. So, this little

00:33:00.599 --> 00:33:05.959
circle here with a plus in it is just

00:33:05.200 --> 00:33:06.600
Thank you.

00:33:05.960 --> 00:33:08.279
Uh

00:33:06.599 --> 00:33:10.359
that is This is just a linear It's a

00:33:08.279 --> 00:33:11.480
shorthand for a linear function.

00:33:10.359 --> 00:33:13.159
So, whenever you see a circle with a

00:33:11.480 --> 00:33:15.360
plus, it's just a shorthand for a linear

00:33:13.160 --> 00:33:16.279
function. Okay? So, you can take this

00:33:15.359 --> 00:33:17.759
whole thing and run through a linear

00:33:16.279 --> 00:33:19.799
function and when you do it, you'll get

00:33:17.759 --> 00:33:21.960
some number right there. You'll get some

00:33:19.799 --> 00:33:23.399
number. So, you've taken these K numbers

00:33:21.960 --> 00:33:25.559
and you've sort of dis- compressed them

00:33:23.400 --> 00:33:26.840
in some way into one number.

00:33:25.559 --> 00:33:28.319
Okay?

00:33:26.839 --> 00:33:30.079
But, you don't have to stop at one

00:33:28.319 --> 00:33:31.599
number. You can do more.

00:33:30.079 --> 00:33:33.439
So, we can have a stack of linear

00:33:31.599 --> 00:33:35.359
functions in the middle.

00:33:33.440 --> 00:33:37.279
Right? There's a linear function here,

00:33:35.359 --> 00:33:40.159
another one here, another one here. At

00:33:37.279 --> 00:33:42.240
this point, the K numbers you have

00:33:40.160 --> 00:33:43.440
K could be, for example, 1,000.

00:33:42.240 --> 00:33:44.400
Right? It's just the size of your input

00:33:43.440 --> 00:33:45.799
data.

00:33:44.400 --> 00:33:47.280
You've taken these K things and you've

00:33:45.799 --> 00:33:48.839
compressed them into three numbers at

00:33:47.279 --> 00:33:50.359
this point.

00:33:48.839 --> 00:33:52.079
Okay?

00:33:50.359 --> 00:33:53.079
So, okay, maybe three is the right

00:33:52.079 --> 00:33:54.039
number, maybe 10 is the right number. We

00:33:53.079 --> 00:33:55.480
don't know.

00:33:54.039 --> 00:33:58.079
And we'll get to know how do we know

00:33:55.480 --> 00:33:59.519
what the right number is later on.

00:33:58.079 --> 00:34:01.159
So, we can stack as many linear

00:33:59.519 --> 00:34:02.720
functions we want.

00:34:01.160 --> 00:34:04.440
So, we have transformed this K thing

00:34:02.720 --> 00:34:06.600
into a three-dimensional vector, right?

00:34:04.440 --> 00:34:07.519
K numbers become three numbers.

00:34:06.599 --> 00:34:10.279
Um

00:34:07.519 --> 00:34:12.280
and now we can flow this three these

00:34:10.280 --> 00:34:13.919
three numbers through some other little

00:34:12.280 --> 00:34:16.359
function.

00:34:13.918 --> 00:34:16.358
Okay?

00:34:16.440 --> 00:34:19.559
And as you will see in a few minutes,

00:34:18.039 --> 00:34:20.759
that function is called an activation

00:34:19.559 --> 00:34:22.320
function

00:34:20.760 --> 00:34:23.359
and it's chosen to be a non-linear

00:34:22.320 --> 00:34:24.559
function

00:34:23.358 --> 00:34:26.759
because if you don't choose it to be a

00:34:24.559 --> 00:34:28.719
non-linear function, all the effort we

00:34:26.760 --> 00:34:30.280
are doing is going to be a total waste

00:34:28.719 --> 00:34:32.839
of time.

00:34:30.280 --> 00:34:34.399
Okay? For now, just

00:34:32.840 --> 00:34:36.200
take it on faith that you need to have

00:34:34.398 --> 00:34:39.480
non-linear functions here.

00:34:36.199 --> 00:34:41.039
But, note that the three numbers here

00:34:39.480 --> 00:34:42.079
are still three numbers. They are three

00:34:41.039 --> 00:34:43.398
different numbers, but they're still

00:34:42.079 --> 00:34:45.000
three numbers.

00:34:43.398 --> 00:34:46.440
And once we do this, we'll be like, "You

00:34:45.000 --> 00:34:48.119
know what? This was fun. Let's do it

00:34:46.440 --> 00:34:51.918
again."

00:34:48.119 --> 00:34:51.918
Okay? So, you can do it again.

00:34:52.320 --> 00:34:55.720
And you can keep on doing it. You can

00:34:53.559 --> 00:34:57.400
keep it 100 times if you want.

00:34:55.719 --> 00:35:00.639
And the key thing is that every time you

00:34:57.400 --> 00:35:03.079
do it, you're giving this network some

00:35:00.639 --> 00:35:05.159
ability, some capacity to learn

00:35:03.079 --> 00:35:07.799
something interesting from the data.

00:35:05.159 --> 00:35:09.319
To learn an interesting representation.

00:35:07.800 --> 00:35:10.680
Now, of course, you're thinking, "Well,

00:35:09.320 --> 00:35:12.039
how do we know it's interesting? How do

00:35:10.679 --> 00:35:14.079
you know it's a useful thing?" And we'll

00:35:12.039 --> 00:35:14.840
come to all that later on.

00:35:14.079 --> 00:35:16.840
Right? We're just giving it the

00:35:14.840 --> 00:35:17.960
capacity, the potential to learn

00:35:16.840 --> 00:35:19.240
interesting things from the data.

00:35:17.960 --> 00:35:21.199
Whether it actually lives up to its

00:35:19.239 --> 00:35:23.000
potential, we don't know yet.

00:35:21.199 --> 00:35:24.719
Okay? We'll give it the potential.

00:35:23.000 --> 00:35:26.358
Because the more transformations of the

00:35:24.719 --> 00:35:27.799
input data you make, the more

00:35:26.358 --> 00:35:29.039
opportunity you have to do interesting

00:35:27.800 --> 00:35:30.160
things with it.

00:35:29.039 --> 00:35:31.480
If I don't even give you the opportunity

00:35:30.159 --> 00:35:32.879
to transform it once, you don't have any

00:35:31.480 --> 00:35:34.719
opportunity, right?

00:35:32.880 --> 00:35:36.200
If I give you 10 chances to transform

00:35:34.719 --> 00:35:38.039
things, you have 10 shots at doing

00:35:36.199 --> 00:35:40.239
something useful.

00:35:38.039 --> 00:35:42.159
So, you can you can do this repeatedly

00:35:40.239 --> 00:35:44.759
and once we are done doing these

00:35:42.159 --> 00:35:46.159
transformations, we just pipe it through

00:35:44.760 --> 00:35:49.920
to our good old logistic regression

00:35:46.159 --> 00:35:49.920
sigmoid here and we are done.

00:35:50.440 --> 00:35:53.960
Okay?

00:35:51.480 --> 00:35:55.960
So, this is the basic idea.

00:35:53.960 --> 00:35:57.800
And so, just to contrast it, this was

00:35:55.960 --> 00:35:59.240
good old logistic regression where we

00:35:57.800 --> 00:36:00.519
take the input,

00:35:59.239 --> 00:36:02.319
we run it through a linear function and

00:36:00.519 --> 00:36:04.599
pop out a number,

00:36:02.320 --> 00:36:06.080
a probability number. But, after we do

00:36:04.599 --> 00:36:08.599
all this stuff, the input stays the

00:36:06.079 --> 00:36:09.679
same, the output stays the same, but in

00:36:08.599 --> 00:36:11.480
the middle you just run through a whole

00:36:09.679 --> 00:36:12.639
bunch of these functions, you know,

00:36:11.480 --> 00:36:14.358
these layers, boop boop boop boop, and

00:36:12.639 --> 00:36:15.239
then we get the output.

00:36:14.358 --> 00:36:16.559
Okay?

00:36:15.239 --> 00:36:19.519
That's all we have done.

00:36:16.559 --> 00:36:21.679
And this is a neural network.

00:36:19.519 --> 00:36:25.079
A neural network is nothing more than

00:36:21.679 --> 00:36:27.519
repeatedly transformed inputs which are

00:36:25.079 --> 00:36:30.159
finally fed to a linear or logistic

00:36:27.519 --> 00:36:30.159
regression model.

00:36:35.400 --> 00:36:38.800
Any questions?

00:36:37.559 --> 00:36:41.799
I have two questions. Could you use the

00:36:38.800 --> 00:36:43.320
thing so that everyone can hear? Yeah.

00:36:41.800 --> 00:36:45.240
I have two questions. Firstly, so when

00:36:43.320 --> 00:36:48.080
we say that there isn't chance of

00:36:45.239 --> 00:36:51.559
explainability, is it that we don't know

00:36:48.079 --> 00:36:53.239
which arrow it went through? That's one.

00:36:51.559 --> 00:36:54.960
Second,

00:36:53.239 --> 00:36:57.239
who's controlling the number of

00:36:54.960 --> 00:36:59.639
iterations or the number of functions?

00:36:57.239 --> 00:37:01.239
That's up to us or how does that work?

00:36:59.639 --> 00:37:03.960
Right. So, yeah, so the the first

00:37:01.239 --> 00:37:06.879
question, um explainability, we actually

00:37:03.960 --> 00:37:09.119
know exactly for any given input input

00:37:06.880 --> 00:37:10.760
uh data data point, we know exactly how

00:37:09.119 --> 00:37:12.119
it flows through the network. So, there

00:37:10.760 --> 00:37:15.680
is no problem there.

00:37:12.119 --> 00:37:17.599
The problem is in ascribing, "Okay, this

00:37:15.679 --> 00:37:20.159
we we think this person is going to be

00:37:17.599 --> 00:37:21.880
uh repay the loan because

00:37:20.159 --> 00:37:24.159
of this particular attribute." We don't

00:37:21.880 --> 00:37:25.680
know that because those attributes all

00:37:24.159 --> 00:37:27.358
get enmeshed together and goes through

00:37:25.679 --> 00:37:29.119
this complicated thing. So, we know

00:37:27.358 --> 00:37:31.480
exactly what happens. We just can't give

00:37:29.119 --> 00:37:33.319
credit to anyone thing very easily.

00:37:31.480 --> 00:37:35.480
I'm again, I'm just standing on the

00:37:33.320 --> 00:37:36.280
brink of this vast ocean of something

00:37:35.480 --> 00:37:38.519
called explainability and

00:37:36.280 --> 00:37:39.960
interpretability, uh which I'll get to a

00:37:38.519 --> 00:37:42.280
bit later on in the semester. But,

00:37:39.960 --> 00:37:44.280
that's sort of the quick

00:37:42.280 --> 00:37:46.880
kind of right-ish kind of wrong answer.

00:37:44.280 --> 00:37:47.760
Okay? Number two, um

00:37:46.880 --> 00:37:49.559
uh

00:37:47.760 --> 00:37:51.000
we decide the number of layers. We

00:37:49.559 --> 00:37:52.880
decide a whole bunch of things and as

00:37:51.000 --> 00:37:53.920
we'll see in a few minutes, uh there is

00:37:52.880 --> 00:37:55.640
something that's given to us and

00:37:53.920 --> 00:37:58.840
something we get to design and I'll make

00:37:55.639 --> 00:37:58.839
it very clear which is which.

00:37:59.320 --> 00:38:01.600
Yeah.

00:38:02.000 --> 00:38:06.320
Did I say your name right? Yeah.

00:38:04.039 --> 00:38:08.840
So, which functions have to be linear

00:38:06.320 --> 00:38:11.960
and also like why does it have to be

00:38:08.840 --> 00:38:15.200
linear? Yeah. So, these functions uh the

00:38:11.960 --> 00:38:16.920
f of x here, they have to be non-linear.

00:38:15.199 --> 00:38:19.439
As to why they have to be non-linear,

00:38:16.920 --> 00:38:22.559
we'll get to that in a few minutes.

00:38:19.440 --> 00:38:23.480
Okay. So, these are called neurons.

00:38:22.559 --> 00:38:25.239
Okay?

00:38:23.480 --> 00:38:27.559
These things where you basically there's

00:38:25.239 --> 00:38:29.358
a linear function followed by uh a

00:38:27.559 --> 00:38:31.000
little non-linear function,

00:38:29.358 --> 00:38:32.679
right? This is a Each one of these

00:38:31.000 --> 00:38:34.239
things is called a neuron.

00:38:32.679 --> 00:38:36.960
Um

00:38:34.239 --> 00:38:39.719
By the way, you know, this is loosely

00:38:36.960 --> 00:38:41.679
inspired by the way how, you know, uh

00:38:39.719 --> 00:38:42.919
neurons work in a human in mammalian

00:38:41.679 --> 00:38:45.599
brains.

00:38:42.920 --> 00:38:47.880
But, the connections between

00:38:45.599 --> 00:38:50.679
neuroscience and deep learning

00:38:47.880 --> 00:38:52.599
are very heavily argued.

00:38:50.679 --> 00:38:55.559
So, I'm going to like stay away from it.

00:38:52.599 --> 00:38:57.559
Okay? Uh suffice it to say it's I I just

00:38:55.559 --> 00:38:59.559
think of For for building practical deep

00:38:57.559 --> 00:39:01.880
learning systems in industry, you don't

00:38:59.559 --> 00:39:04.000
you don't worry about this. Okay?

00:39:01.880 --> 00:39:06.880
All right, let's move on.

00:39:04.000 --> 00:39:09.320
Terminology. Uh this vertical stack of

00:39:06.880 --> 00:39:10.760
linear functions or neurons,

00:39:09.320 --> 00:39:12.080
right? This vertical stack is called a

00:39:10.760 --> 00:39:14.080
layer.

00:39:12.079 --> 00:39:15.840
Right? This is a layer, that's a layer.

00:39:14.079 --> 00:39:17.279
Uh and these little non-linear

00:39:15.840 --> 00:39:20.440
functions, which we haven't gotten to

00:39:17.280 --> 00:39:22.280
yet, are called activation functions.

00:39:20.440 --> 00:39:25.240
Uh and we'll get to why they are called

00:39:22.280 --> 00:39:25.240
that in just a second.

00:39:25.320 --> 00:39:29.400
And

00:39:26.920 --> 00:39:31.840
the input

00:39:29.400 --> 00:39:34.079
is called an input layer and I have the

00:39:31.840 --> 00:39:35.640
word layer in double quotes because like

00:39:34.079 --> 00:39:36.759
it's not really doing anything, right?

00:39:35.639 --> 00:39:39.279
It's just the input.

00:39:36.760 --> 00:39:41.480
So, but we call it an input layer.

00:39:39.280 --> 00:39:42.880
And what the very final thing that

00:39:41.480 --> 00:39:45.280
produces outputs is called the output

00:39:42.880 --> 00:39:48.360
layer, right? Obviously. And everything

00:39:45.280 --> 00:39:50.200
in the middle is called a hidden layer.

00:39:48.360 --> 00:39:52.440
Okay?

00:39:50.199 --> 00:39:54.960
So, the final piece of terminology is

00:39:52.440 --> 00:39:56.240
that when you have a layer like this in

00:39:54.960 --> 00:39:58.240
which say three numbers are coming out

00:39:56.239 --> 00:40:00.799
and there's another another layer,

00:39:58.239 --> 00:40:03.319
right? If every neuron in this layer is

00:40:00.800 --> 00:40:05.280
connected to every neuron in this layer,

00:40:03.320 --> 00:40:07.280
it's called a fully connected or dense

00:40:05.280 --> 00:40:08.880
layer. So, for instance, here

00:40:07.280 --> 00:40:10.360
this arrow that's

00:40:08.880 --> 00:40:11.240
whatever the whatever number is coming

00:40:10.360 --> 00:40:12.720
out. Let's say the number three is

00:40:11.239 --> 00:40:15.239
coming out of this thing here. That

00:40:12.719 --> 00:40:17.399
number three goes flows on this arrow to

00:40:15.239 --> 00:40:19.559
this thing, flows on this arrow to this

00:40:17.400 --> 00:40:21.200
neuron, and flows on this third arrow to

00:40:19.559 --> 00:40:23.239
this neuron. That's what I mean. So,

00:40:21.199 --> 00:40:25.159
every neuron, its output is being sent

00:40:23.239 --> 00:40:27.559
to every neuron in the following layer.

00:40:25.159 --> 00:40:29.319
Okay? That's we call it fully connected

00:40:27.559 --> 00:40:30.599
or dense.

00:40:29.320 --> 00:40:32.559
And then

00:40:30.599 --> 00:40:34.480
if you look at logistic regression,

00:40:32.559 --> 00:40:36.320
right? This is logistic regression. You

00:40:34.480 --> 00:40:40.440
can see basically logistic regression is

00:40:36.320 --> 00:40:40.440
a neural network with no hidden layers.

00:40:41.000 --> 00:40:43.639
So, in some sense, logistic regression

00:40:42.159 --> 00:40:45.440
is like almost the simplest possible

00:40:43.639 --> 00:40:48.359
network you can think of.

00:40:45.440 --> 00:40:50.280
Like barely a neural network.

00:40:48.360 --> 00:40:51.079
Right? It's got no no hidden layers.

00:40:50.280 --> 00:40:52.440
That's what makes it logistic

00:40:51.079 --> 00:40:54.239
regression.

00:40:52.440 --> 00:40:56.119
And so, as you might have guessed by

00:40:54.239 --> 00:40:58.879
now, deep learning is just neural

00:40:56.119 --> 00:41:00.119
networks with lots and lots of

00:40:58.880 --> 00:41:02.400
of what?

00:41:00.119 --> 00:41:04.319
Yes, layers.

00:41:02.400 --> 00:41:07.079
So, here are a few.

00:41:04.320 --> 00:41:08.480
Uh and by the way, these are not even

00:41:07.079 --> 00:41:10.039
considered all that, you know,

00:41:08.480 --> 00:41:13.039
impressive these days.

00:41:10.039 --> 00:41:16.039
Okay? Uh but I put them up because this

00:41:13.039 --> 00:41:18.119
this thing here is called ResNet.

00:41:16.039 --> 00:41:20.440
And it's famous because the ResNet

00:41:18.119 --> 00:41:21.559
neural network was I think the first

00:41:20.440 --> 00:41:24.039
network

00:41:21.559 --> 00:41:26.799
to surpass human-level performance in

00:41:24.039 --> 00:41:28.920
image classification.

00:41:26.800 --> 00:41:31.039
Sort of it it's sort of like the Skynet

00:41:28.920 --> 00:41:32.960
of image classification. Okay? It

00:41:31.039 --> 00:41:34.159
surpassed human-level performance. And

00:41:32.960 --> 00:41:36.320
I'm putting it up here because we'll

00:41:34.159 --> 00:41:37.759
actually work with ResNet on next next

00:41:36.320 --> 00:41:39.280
Wednesday. And we'll actually take

00:41:37.760 --> 00:41:41.920
ResNet, we'll fine-tune it, and solve a

00:41:39.280 --> 00:41:43.640
real problem in class.

00:41:41.920 --> 00:41:46.000
All right. So, it's got lots and lots of

00:41:43.639 --> 00:41:47.159
layers. Uh now, let's turn to these

00:41:46.000 --> 00:41:48.800
activation functions. We've been

00:41:47.159 --> 00:41:49.839
ignoring these little guys, right? So

00:41:48.800 --> 00:41:52.800
far.

00:41:49.840 --> 00:41:54.920
So, the activation function at a node is

00:41:52.800 --> 00:41:56.960
a first of all, it's a function that

00:41:54.920 --> 00:41:58.639
receives a single number and outputs a

00:41:56.960 --> 00:42:00.760
single number, right? It's not very

00:41:58.639 --> 00:42:03.000
complicated, right? It receives

00:42:00.760 --> 00:42:04.560
basically this this is a linear function

00:42:03.000 --> 00:42:06.679
which receives all these inputs. It

00:42:04.559 --> 00:42:07.880
could be 10 inputs, 1,000 inputs,

00:42:06.679 --> 00:42:09.559
runs it through a linear function,

00:42:07.880 --> 00:42:12.200
outputs a number, and that single

00:42:09.559 --> 00:42:14.759
number, a scalar, goes in here, and it

00:42:12.199 --> 00:42:16.599
comes out as another single number.

00:42:14.760 --> 00:42:18.000
Just just just remember that.

00:42:16.599 --> 00:42:19.480
And so, these are some of the most

00:42:18.000 --> 00:42:21.519
common activation functions. In fact,

00:42:19.480 --> 00:42:23.400
the sigmoid we saw, which is actually we

00:42:21.519 --> 00:42:25.639
use for the output, is actually a kind

00:42:23.400 --> 00:42:28.119
of activation function where a single

00:42:25.639 --> 00:42:30.000
number comes in and it gets mapped into

00:42:28.119 --> 00:42:31.799
this curve because of this thing. So,

00:42:30.000 --> 00:42:33.920
the single number that comes in is A,

00:42:31.800 --> 00:42:37.160
and it and it gets transformed as 1 / 1

00:42:33.920 --> 00:42:38.880
+ e ^ -A, and you get a shape like this,

00:42:37.159 --> 00:42:40.679
and it's called the sigmoid activation

00:42:38.880 --> 00:42:41.840
function. And And And as you can see

00:42:40.679 --> 00:42:44.319
here,

00:42:41.840 --> 00:42:45.920
for very small values, for very negative

00:42:44.320 --> 00:42:47.840
values,

00:42:45.920 --> 00:42:50.280
it's going to be pretty close to zero,

00:42:47.840 --> 00:42:52.559
meaning it won't get activated.

00:42:50.280 --> 00:42:53.680
And for very very large values, it's

00:42:52.559 --> 00:42:55.360
going to be

00:42:53.679 --> 00:42:57.759
pretty close to one.

00:42:55.360 --> 00:42:59.079
All the action happens in the middle.

00:42:57.760 --> 00:43:00.160
When your When your When your values are

00:42:59.079 --> 00:43:03.119
somewhere in this range, there's a

00:43:00.159 --> 00:43:05.079
dramatic increases in what comes out.

00:43:03.119 --> 00:43:06.440
Okay? So, that little thing in the

00:43:05.079 --> 00:43:07.799
middle is a sweet spot for these

00:43:06.440 --> 00:43:08.639
functions.

00:43:07.800 --> 00:43:10.000
Uh

00:43:08.639 --> 00:43:11.440
and this

00:43:10.000 --> 00:43:12.760
I you know, I'm also almost embarrassed

00:43:11.440 --> 00:43:13.880
to call it an activation function

00:43:12.760 --> 00:43:15.520
because it's literally not doing

00:43:13.880 --> 00:43:16.880
anything. It's sort of getting a nice

00:43:15.519 --> 00:43:18.639
label for free.

00:43:16.880 --> 00:43:19.720
Um right? You basically it says you just

00:43:18.639 --> 00:43:20.839
get a number, just pass it straight

00:43:19.719 --> 00:43:22.359
along.

00:43:20.840 --> 00:43:23.720
It's a linear activation function, but

00:43:22.360 --> 00:43:25.599
just for completeness, I want to put it

00:43:23.719 --> 00:43:28.319
here.

00:43:25.599 --> 00:43:30.920
And then we come to the hero of deep

00:43:28.320 --> 00:43:32.000
learning, which is the rectified linear

00:43:30.920 --> 00:43:34.519
unit,

00:43:32.000 --> 00:43:37.079
right? Rectified linear unit. It's

00:43:34.519 --> 00:43:38.519
called ReLU. Uh and ReLU is going to

00:43:37.079 --> 00:43:41.039
become part of your vocabulary very very

00:43:38.519 --> 00:43:43.000
quickly. Uh and so, ReLU is actually a

00:43:41.039 --> 00:43:44.920
very interesting function. So, you write

00:43:43.000 --> 00:43:46.320
it as maximum of whatever number and

00:43:44.920 --> 00:43:48.360
zero,

00:43:46.320 --> 00:43:50.600
which is another way of saying if the

00:43:48.360 --> 00:43:53.480
number is positive, just send it along

00:43:50.599 --> 00:43:56.639
unchanged. If the number is negative,

00:43:53.480 --> 00:43:57.639
send a zero instead. Squish it to zero.

00:43:56.639 --> 00:43:59.799
So, which means if the number is

00:43:57.639 --> 00:44:03.039
negative, nothing happens. If the number

00:43:59.800 --> 00:44:03.039
is positive, it wakes up.

00:44:03.239 --> 00:44:07.159
So, what happens is that you could have

00:44:04.920 --> 00:44:09.320
a very complicated linear function with

00:44:07.159 --> 00:44:10.519
millions of variables, and then it puts

00:44:09.320 --> 00:44:12.000
a single number, and that number

00:44:10.519 --> 00:44:13.239
unfortunately happens to be negative.

00:44:12.000 --> 00:44:15.199
The ReLU is not impressed. It's going to

00:44:13.239 --> 00:44:17.519
send a zero out.

00:44:15.199 --> 00:44:20.279
Okay? It's a very simple function.

00:44:17.519 --> 00:44:22.559
And many many folks who've been in deep

00:44:20.280 --> 00:44:23.480
learning for a long long time believe

00:44:22.559 --> 00:44:25.519
that

00:44:23.480 --> 00:44:26.760
the use of the ReLUs is one of the key

00:44:25.519 --> 00:44:28.840
factors

00:44:26.760 --> 00:44:30.440
that led to the amazing success of deep

00:44:28.840 --> 00:44:32.160
learning because it's got some very

00:44:30.440 --> 00:44:33.880
interesting properties,

00:44:32.159 --> 00:44:35.759
uh which we'll get to hopefully on

00:44:33.880 --> 00:44:40.039
Wednesday.

00:44:35.760 --> 00:44:42.000
Okay. So, the shorthand here is that um

00:44:40.039 --> 00:44:43.639
whenever you see this thing, it's just a

00:44:42.000 --> 00:44:44.679
linear activation, linear function

00:44:43.639 --> 00:44:47.319
followed by just sending it straight

00:44:44.679 --> 00:44:49.119
out. If I If you do this this If I put a

00:44:47.320 --> 00:44:51.519
ReLU in here, I'm going to denote it

00:44:49.119 --> 00:44:53.239
like that, which mimics the graph

00:44:51.519 --> 00:44:54.719
uh how it looks. And if I'm going If I

00:44:53.239 --> 00:44:55.839
put a sigmoid, I'm just going to use

00:44:54.719 --> 00:44:56.839
this thing here.

00:44:55.840 --> 00:44:59.941
Okay?

00:44:56.840 --> 00:45:00.240
Just a visual shorthand.

00:44:59.940 --> 00:45:02.358
>> [clears throat]

00:45:00.239 --> 00:45:03.839
>> There are many other functions

00:45:02.358 --> 00:45:05.079
activation functions, by the way.

00:45:03.840 --> 00:45:07.840
There's something called the tan h

00:45:05.079 --> 00:45:10.960
function, the leaky ReLU, the GELU, the

00:45:07.840 --> 00:45:12.640
Swish. I mean, it's like a menagerie of

00:45:10.960 --> 00:45:14.280
activation functions because very often

00:45:12.639 --> 00:45:15.799
researchers will be like, "Well, I don't

00:45:14.280 --> 00:45:17.040
like this activation function. Here's a

00:45:15.800 --> 00:45:18.080
little modified version of the function

00:45:17.039 --> 00:45:20.400
which is going to be better for certain

00:45:18.079 --> 00:45:22.480
things." So, you know, people's research

00:45:20.400 --> 00:45:24.400
creativity is sort of on this point has

00:45:22.480 --> 00:45:26.519
gone unhinged. Um so, there's lots of

00:45:24.400 --> 00:45:27.760
options. But if you just stick to the

00:45:26.519 --> 00:45:29.519
ReLU

00:45:27.760 --> 00:45:31.720
for your hidden layers, you can

00:45:29.519 --> 00:45:32.519
basically get anything done practically,

00:45:31.719 --> 00:45:34.039
right? You don't have to worry about

00:45:32.519 --> 00:45:37.280
anything else. So, we'll only focus on

00:45:34.039 --> 00:45:38.559
ReLUs for all the intermediate stuff. Uh

00:45:37.280 --> 00:45:40.400
yeah.

00:45:38.559 --> 00:45:41.840
Yeah, how do you gauge which activation

00:45:40.400 --> 00:45:42.720
function is more suited for your use

00:45:41.840 --> 00:45:45.280
case?

00:45:42.719 --> 00:45:48.000
Yeah. So, the rule of thumb here is that

00:45:45.280 --> 00:45:49.680
for your hidden layers, use ReLUs,

00:45:48.000 --> 00:45:51.880
right? Because empirically we have seen

00:45:49.679 --> 00:45:54.199
that they they do an amazing job.

00:45:51.880 --> 00:45:56.320
For your output layer, your very final

00:45:54.199 --> 00:45:57.960
thing, you actually don't have a choice

00:45:56.320 --> 00:45:59.640
because what you have to use depends on

00:45:57.960 --> 00:46:01.199
what kind of output you have to work

00:45:59.639 --> 00:46:02.679
with. If it's an output which is a

00:46:01.199 --> 00:46:04.480
probability number between zero and one,

00:46:02.679 --> 00:46:05.839
you have to use a sigmoid.

00:46:04.480 --> 00:46:07.559
Um if it is

00:46:05.840 --> 00:46:08.960
say 10 numbers, all of which have to be

00:46:07.559 --> 00:46:10.119
probabilities, and they have to add up

00:46:08.960 --> 00:46:10.880
to one,

00:46:10.119 --> 00:46:12.199
you got to use something called the

00:46:10.880 --> 00:46:13.960
softmax, which we'll get to on

00:46:12.199 --> 00:46:15.679
Wednesday. So, it really depends on the

00:46:13.960 --> 00:46:16.760
output, and the nature of the output

00:46:15.679 --> 00:46:18.599
dictates what you use in the output

00:46:16.760 --> 00:46:19.920
layer.

00:46:18.599 --> 00:46:22.000
Okay.

00:46:19.920 --> 00:46:24.880
So, coming back to this. So, if you want

00:46:22.000 --> 00:46:27.280
to design a deep neural network,

00:46:24.880 --> 00:46:29.599
uh the input is the input.

00:46:27.280 --> 00:46:30.960
The output is the output. And so, you

00:46:29.599 --> 00:46:32.880
get to choose everything else. You get

00:46:30.960 --> 00:46:35.320
to choose the number of hidden layers,

00:46:32.880 --> 00:46:37.559
the number of neurons in each layer, the

00:46:35.320 --> 00:46:39.600
activation functions you're going to use

00:46:37.559 --> 00:46:41.119
and uh for the hidden layers, and then

00:46:39.599 --> 00:46:42.759
you have to make sure that the what you

00:46:41.119 --> 00:46:44.279
choose for the output layer matches the

00:46:42.760 --> 00:46:46.840
kind of output you want to generate.

00:46:44.280 --> 00:46:48.680
Okay? So, this is this sort of This is

00:46:46.840 --> 00:46:51.120
all in your hands. You decide what

00:46:48.679 --> 00:46:52.799
happens. But

00:46:51.119 --> 00:46:53.719
you will there there's a lot of guidance

00:46:52.800 --> 00:46:56.080
for how to do these things, which we'll

00:46:53.719 --> 00:46:57.679
which we'll cover as we go along.

00:46:56.079 --> 00:47:00.519
Did you have a question?

00:46:57.679 --> 00:47:03.279
Kind of, but I guess I'll do it.

00:47:00.519 --> 00:47:05.400
Is Is there also exploration in kind of

00:47:03.280 --> 00:47:07.920
dynamic uh

00:47:05.400 --> 00:47:11.400
setting up layers so that your users

00:47:07.920 --> 00:47:11.400
determine the number of layers

00:47:12.599 --> 00:47:16.719
Yeah. So, there's a whole field called

00:47:14.320 --> 00:47:18.680
neural architecture search, NAS,

00:47:16.719 --> 00:47:20.480
where we can actually try a whole bunch

00:47:18.679 --> 00:47:22.319
of different architectures,

00:47:20.480 --> 00:47:23.800
uh and then use some optimization and in

00:47:22.320 --> 00:47:25.640
fact reinforcement learning, which we

00:47:23.800 --> 00:47:27.160
won't get to in this class,

00:47:25.639 --> 00:47:28.440
as a way to figure out really good

00:47:27.159 --> 00:47:32.199
architectures for any particular

00:47:28.440 --> 00:47:33.760
problem. Uh but the

00:47:32.199 --> 00:47:34.799
the question of okay,

00:47:33.760 --> 00:47:36.480
when I'm training a model with a

00:47:34.800 --> 00:47:37.840
particular kind of data,

00:47:36.480 --> 00:47:39.039
the first pass through the training

00:47:37.840 --> 00:47:40.240
data, I'm going to use two layers. The

00:47:39.039 --> 00:47:42.440
second pass, I'm going to do seven

00:47:40.239 --> 00:47:44.039
layers. That is not done.

00:47:42.440 --> 00:47:45.840
Uh and the reason it's not done is

00:47:44.039 --> 00:47:47.279
because of certain other constraints we

00:47:45.840 --> 00:47:48.840
have in how we can do the the

00:47:47.280 --> 00:47:50.720
optimization and the gradient descent

00:47:48.840 --> 00:47:52.680
and stuff like that. But what you can

00:47:50.719 --> 00:47:54.319
do, and we will we'll look at this thing

00:47:52.679 --> 00:47:56.399
called dropout,

00:47:54.320 --> 00:47:58.200
for certain layers, you can actually for

00:47:56.400 --> 00:48:00.440
each time you run it through the

00:47:58.199 --> 00:48:02.199
network, you can decide in this layer

00:48:00.440 --> 00:48:03.599
I'm not going to use all the nodes. I'm

00:48:02.199 --> 00:48:05.879
going to drop out a few of the nodes

00:48:03.599 --> 00:48:07.279
randomly. And it's a very effective

00:48:05.880 --> 00:48:09.599
technique to prevent overfitting, and

00:48:07.280 --> 00:48:11.240
we'll come to that a little later on.

00:48:09.599 --> 00:48:13.639
Uh yeah.

00:48:11.239 --> 00:48:15.439
So, one question regarding like

00:48:13.639 --> 00:48:16.960
neural networks is about the

00:48:15.440 --> 00:48:17.920
coefficients. Is this something we

00:48:16.960 --> 00:48:19.358
decide

00:48:17.920 --> 00:48:21.159
or we

00:48:19.358 --> 00:48:23.840
have to use as a defined coefficient for

00:48:21.159 --> 00:48:25.920
the weights? No, the whole trick here

00:48:23.840 --> 00:48:29.079
the whole name of the game is we use the

00:48:25.920 --> 00:48:30.440
data, the training data, and something

00:48:29.079 --> 00:48:31.719
called a loss function, which I'll get

00:48:30.440 --> 00:48:33.760
to on Wednesday,

00:48:31.719 --> 00:48:36.639
along with an optimization algorithm, so

00:48:33.760 --> 00:48:37.880
that the network figures out by itself

00:48:36.639 --> 00:48:39.599
what the weights need to be, what the

00:48:37.880 --> 00:48:42.039
coefficients need to be, so as to

00:48:39.599 --> 00:48:43.920
minimize prediction error.

00:48:42.039 --> 00:48:45.358
And that's the whole thing. The magic

00:48:43.920 --> 00:48:47.559
here is that we don't have to do

00:48:45.358 --> 00:48:49.880
anything. We only have to set it up, sit

00:48:47.559 --> 00:48:51.679
back, often for many hours, and watch it

00:48:49.880 --> 00:48:52.800
do its thing.

00:48:51.679 --> 00:48:54.279
Yeah.

00:48:52.800 --> 00:48:56.320
Just one quick question. Um you

00:48:54.280 --> 00:48:58.000
mentioned nodes just now when you were

00:48:56.320 --> 00:49:00.920
answering Roland's question. Can you

00:48:58.000 --> 00:49:02.559
just confirm exactly what a node is? I

00:49:00.920 --> 00:49:03.519
have an idea that it's basically any

00:49:02.559 --> 00:49:04.799
circle, but

00:49:03.519 --> 00:49:06.320
>> Yeah, yeah. you just added a lot more

00:49:04.800 --> 00:49:07.560
detail. Sure. No, when when I'm

00:49:06.320 --> 00:49:09.760
referring to a node, I'm literally

00:49:07.559 --> 00:49:12.000
referring to something like this, which

00:49:09.760 --> 00:49:14.640
think of it as a linear function

00:49:12.000 --> 00:49:16.480
followed by a non-linear activation.

00:49:14.639 --> 00:49:18.239
So, it it reads a bunch of inputs, runs

00:49:16.480 --> 00:49:19.920
it through a linear function, and pass

00:49:18.239 --> 00:49:22.119
it through like a ReLU or a sigmoid or

00:49:19.920 --> 00:49:24.119
something, and out pops a number.

00:49:22.119 --> 00:49:26.000
So, in general, a node will have

00:49:24.119 --> 00:49:28.239
many numbers potentially coming in, but

00:49:26.000 --> 00:49:30.000
only one number going out.

00:49:28.239 --> 00:49:32.719
Uh now, that one number may get copied

00:49:30.000 --> 00:49:33.960
to every node in the next layer,

00:49:32.719 --> 00:49:36.639
but what comes out of that particular

00:49:33.960 --> 00:49:38.240
node is just a single number.

00:49:36.639 --> 00:49:38.839
All right. So,

00:49:38.239 --> 00:49:41.799
uh

00:49:38.840 --> 00:49:44.320
So, let's use a DNN for our interview

00:49:41.800 --> 00:49:46.360
example. So, in this problem we had two

00:49:44.320 --> 00:49:48.000
inputs, right? GPA and experience. The

00:49:46.360 --> 00:49:48.880
output variable has to be between zero

00:49:48.000 --> 00:49:50.039
and one because you're trying to predict

00:49:48.880 --> 00:49:52.720
the probability that someone will get

00:49:50.039 --> 00:49:54.079
called for an interview. So, the output

00:49:52.719 --> 00:49:55.319
size is fixed the

00:49:54.079 --> 00:49:57.039
sorry, the input size is fixed the

00:49:55.320 --> 00:49:59.440
output is fixed. Uh

00:49:57.039 --> 00:50:00.800
and we so, since it's really the only

00:49:59.440 --> 00:50:02.800
the very first network we're actually

00:50:00.800 --> 00:50:04.360
playing with uh

00:50:02.800 --> 00:50:06.640
let's just start simple, right? We'll

00:50:04.360 --> 00:50:09.480
just have one hidden layer and we'll

00:50:06.639 --> 00:50:11.199
have three neurons, right? And and as I

00:50:09.480 --> 00:50:13.719
mentioned to Tommaso's question from

00:50:11.199 --> 00:50:15.839
before if you are choosing activation

00:50:13.719 --> 00:50:17.919
functions in the hidden layers, just go

00:50:15.840 --> 00:50:19.760
with the ReLU as a default. It usually

00:50:17.920 --> 00:50:21.360
works really well out of the box. So,

00:50:19.760 --> 00:50:23.280
we'll just use a ReLU and since the

00:50:21.360 --> 00:50:25.240
output has to be between zero and one,

00:50:23.280 --> 00:50:27.000
we don't have a choice. We have to use a

00:50:25.239 --> 00:50:29.199
sigmoid for the output layer.

00:50:27.000 --> 00:50:31.119
Okay? That's it. So, we have the those

00:50:29.199 --> 00:50:32.919
are the design choices and when we do

00:50:31.119 --> 00:50:34.960
that, this is how it's looked like,

00:50:32.920 --> 00:50:36.760
right? We have two inputs X1 and X2, GPA

00:50:34.960 --> 00:50:38.199
and experience and then it goes through

00:50:36.760 --> 00:50:40.440
these three

00:50:38.199 --> 00:50:42.759
ReLUs and then out comes these three

00:50:40.440 --> 00:50:44.960
numbers and they pass through a sigmoid

00:50:42.760 --> 00:50:46.560
and we get a probability Y at the end.

00:50:44.960 --> 00:50:47.440
All right, quick question. Concept

00:50:46.559 --> 00:50:49.320
check.

00:50:47.440 --> 00:50:51.039
How many weights

00:50:49.320 --> 00:50:53.000
how many parameters, both weights and

00:50:51.039 --> 00:50:56.079
biases does this network have?

00:50:53.000 --> 00:50:56.079
Let's take a moment to count.

00:51:11.199 --> 00:51:14.439
All right, any guesses?

00:51:15.559 --> 00:51:18.440
Yeah.

00:51:16.840 --> 00:51:21.840
12.

00:51:18.440 --> 00:51:21.840
I think you're almost there.

00:51:22.039 --> 00:51:25.400
Um

00:51:23.960 --> 00:51:28.320
our folks going to be doing a binary

00:51:25.400 --> 00:51:28.320
search on this now? Okay.

00:51:29.320 --> 00:51:34.039
Uh no.

00:51:31.119 --> 00:51:35.679
Yes? 30. Yes, very good.

00:51:34.039 --> 00:51:37.360
So, that's 30

00:51:35.679 --> 00:51:39.000
and my guess is that the reason you came

00:51:37.360 --> 00:51:41.000
up with 12 and I made the same mistake,

00:51:39.000 --> 00:51:44.400
that's why I know it is you probably

00:51:41.000 --> 00:51:44.400
forgot this green thing here.

00:51:45.239 --> 00:51:49.319
Um so, so the what folks often forget is

00:51:48.000 --> 00:51:50.679
the bias.

00:51:49.320 --> 00:51:52.600
Right? We all count the things, right?

00:51:50.679 --> 00:51:54.239
Okay. And the easy way to do it is okay,

00:51:52.599 --> 00:51:56.119
two things here,

00:51:54.239 --> 00:51:57.279
three things here, two times six three

00:51:56.119 --> 00:51:59.400
is six,

00:51:57.280 --> 00:52:00.760
three times one is three another nine

00:51:59.400 --> 00:52:02.480
and then you have to add up all the

00:52:00.760 --> 00:52:04.080
intercepts.

00:52:02.480 --> 00:52:05.840
Right? So, you get 30.

00:52:04.079 --> 00:52:08.079
And so, when we get to very complicated

00:52:05.840 --> 00:52:09.480
networks the the first two or three

00:52:08.079 --> 00:52:10.719
times you work with very complex

00:52:09.480 --> 00:52:11.960
networks

00:52:10.719 --> 00:52:14.359
and we'll do it, you know, starting very

00:52:11.960 --> 00:52:16.119
soon, just get into the habit of hand

00:52:14.360 --> 00:52:17.079
calculating the number of parameters

00:52:16.119 --> 00:52:18.880
just to make sure you understand what's

00:52:17.079 --> 00:52:20.039
going on. Once you get it right a couple

00:52:18.880 --> 00:52:21.599
of times, you can you don't have to do

00:52:20.039 --> 00:52:23.000
it anymore. Okay? The first couple of

00:52:21.599 --> 00:52:23.920
times hand calculate to make sure you

00:52:23.000 --> 00:52:26.239
get it.

00:52:23.920 --> 00:52:28.840
Okay. So, yeah. So, let's say that we

00:52:26.239 --> 00:52:30.559
have trained this network using, you

00:52:28.840 --> 00:52:32.800
know, using techniques which we'll cover

00:52:30.559 --> 00:52:34.119
on Wednesday and it is it comes back to

00:52:32.800 --> 00:52:36.360
you after training and says, "Okay,

00:52:34.119 --> 00:52:38.679
these are the optimal the best values

00:52:36.360 --> 00:52:40.559
for the weights and the biases that I

00:52:38.679 --> 00:52:42.319
have found." So, now your network is

00:52:40.559 --> 00:52:43.840
ready for action.

00:52:42.320 --> 00:52:45.880
It's ready to be used

00:52:43.840 --> 00:52:47.079
and so, so what you can do is let's say

00:52:45.880 --> 00:52:48.640
that you want to predict with this

00:52:47.079 --> 00:52:49.880
network,

00:52:48.639 --> 00:52:52.679
you know,

00:52:49.880 --> 00:52:54.119
if you have X1 and X2, what comes out of

00:52:52.679 --> 00:52:56.480
what So, what comes out of this top

00:52:54.119 --> 00:52:58.719
neuron, right? Let's call it A1. It's

00:52:56.480 --> 00:53:00.199
basically this.

00:52:58.719 --> 00:53:02.159
Okay? That's what's coming out of this

00:53:00.199 --> 00:53:05.639
thing. For any X1 and X2, this is what's

00:53:02.159 --> 00:53:06.519
coming out. Similarly for A2 and A3

00:53:05.639 --> 00:53:08.519
Okay?

00:53:06.519 --> 00:53:09.559
And then what comes out at the very end

00:53:08.519 --> 00:53:11.840
is

00:53:09.559 --> 00:53:14.880
basically A1 times that plus A2 times

00:53:11.840 --> 00:53:15.880
that plus A3 times that plus 0.05 and

00:53:14.880 --> 00:53:18.240
the whole thing gets run through the

00:53:15.880 --> 00:53:20.920
sigmoid and this is what you get.

00:53:18.239 --> 00:53:22.159
Okay? So, this slide and the one before,

00:53:20.920 --> 00:53:23.840
just make sure you look at it afterwards

00:53:22.159 --> 00:53:26.399
and to make sure you totally understand

00:53:23.840 --> 00:53:27.800
the mechanics of it because

00:53:26.400 --> 00:53:28.960
this is really important. If you don't

00:53:27.800 --> 00:53:30.720
If you don't fully understand like

00:53:28.960 --> 00:53:31.880
internalize the mechanics, when we get

00:53:30.719 --> 00:53:33.799
to things like transformers, it's going

00:53:31.880 --> 00:53:35.280
to get hard. Okay? So, just make sure

00:53:33.800 --> 00:53:37.080
it's like automatic at this point. It

00:53:35.280 --> 00:53:38.280
should be reflexive.

00:53:37.079 --> 00:53:40.840
Um

00:53:38.280 --> 00:53:41.840
Okay. So, yeah. And so, when we when you

00:53:40.840 --> 00:53:42.760
want to predict anything, you just run

00:53:41.840 --> 00:53:44.120
some numbers through it, you get all

00:53:42.760 --> 00:53:45.480
these things

00:53:44.119 --> 00:53:48.519
and boom, you calculate it. It turns out

00:53:45.480 --> 00:53:50.000
to be 22.6. That's the answer.

00:53:48.519 --> 00:53:51.800
All right. So,

00:53:50.000 --> 00:53:53.519
I just want to say that let's say that

00:53:51.800 --> 00:53:55.359
you built this network

00:53:53.519 --> 00:53:57.079
and now we are like, "Hey,

00:53:55.358 --> 00:53:58.440
given any X1 and X2, I can come up with

00:53:57.079 --> 00:54:00.239
a Y."

00:53:58.440 --> 00:54:02.159
But I'm feeling a little mathy. Can we

00:54:00.239 --> 00:54:03.358
actually write down the function? Yeah,

00:54:02.159 --> 00:54:06.000
you can write down the function. This is

00:54:03.358 --> 00:54:06.000
what it looks like.

00:54:07.358 --> 00:54:10.358
Super interpretable, right?

00:54:10.480 --> 00:54:16.159
So, this goes to the comment that Itai

00:54:12.480 --> 00:54:18.280
you made earlier on where the act of

00:54:16.159 --> 00:54:21.119
depicting something using this sort of

00:54:18.280 --> 00:54:22.400
graphical layout makes it so much easier

00:54:21.119 --> 00:54:24.440
to reason with

00:54:22.400 --> 00:54:26.559
and to think about compared to trying to

00:54:24.440 --> 00:54:28.519
figure out what this function is doing.

00:54:26.559 --> 00:54:30.559
Right? The other point I want to make is

00:54:28.519 --> 00:54:32.239
that um

00:54:30.559 --> 00:54:33.400
just contrast what we just saw with the

00:54:32.239 --> 00:54:35.599
logistic regression thing we saw

00:54:33.400 --> 00:54:38.200
earlier, which was this little function

00:54:35.599 --> 00:54:40.759
and so, here

00:54:38.199 --> 00:54:42.559
even this simple network with just three

00:54:40.760 --> 00:54:44.200
hidden layers the sorry, three nodes in

00:54:42.559 --> 00:54:46.519
that single hidden layer

00:54:44.199 --> 00:54:48.480
right? It's so much more complicated

00:54:46.519 --> 00:54:50.280
than the logistic regression model. So

00:54:48.480 --> 00:54:52.760
much more complicated, right?

00:54:50.280 --> 00:54:55.000
And it is from this complexity

00:54:52.760 --> 00:54:56.800
springs the ability of these networks to

00:54:55.000 --> 00:54:58.159
do basically magical things.

00:54:56.800 --> 00:55:00.000
Right? That's where the complexity comes

00:54:58.159 --> 00:55:02.519
from. That's where the magic comes from.

00:55:00.000 --> 00:55:03.559
So, and here in this case, the number of

00:55:02.519 --> 00:55:05.960
variables hasn't even changed. It's

00:55:03.559 --> 00:55:07.759
still only two.

00:55:05.960 --> 00:55:10.199
But we can go from the two inputs to the

00:55:07.760 --> 00:55:11.800
one output in very complicated ways as

00:55:10.199 --> 00:55:13.159
long as we know how to train these

00:55:11.800 --> 00:55:13.960
networks the right way. That's sort of

00:55:13.159 --> 00:55:15.799
the

00:55:13.960 --> 00:55:16.920
the secret sauce which we'll spend a lot

00:55:15.800 --> 00:55:19.039
of time on.

00:55:16.920 --> 00:55:20.920
So, yeah. To summarize, this is what we

00:55:19.039 --> 00:55:22.239
have. It's a deep neural network.

00:55:20.920 --> 00:55:23.639
By the way, this kind of network where

00:55:22.239 --> 00:55:25.599
things just flow from left to right is

00:55:23.639 --> 00:55:27.239
called a feedforward

00:55:25.599 --> 00:55:28.679
neural network

00:55:27.239 --> 00:55:30.599
in contrast to some other kinds of

00:55:28.679 --> 00:55:31.919
networks called recurrent networks which

00:55:30.599 --> 00:55:34.639
you won't get to

00:55:31.920 --> 00:55:36.880
in this class because

00:55:34.639 --> 00:55:38.799
transformers have actually proven to be

00:55:36.880 --> 00:55:40.680
much more capable than recurrent

00:55:38.800 --> 00:55:42.920
networks and those have become the norm,

00:55:40.679 --> 00:55:44.799
so we'll just focus on those instead. Um

00:55:42.920 --> 00:55:46.519
and so, this arrangement of neurons into

00:55:44.800 --> 00:55:48.240
layers and activation functions and all

00:55:46.519 --> 00:55:50.039
that stuff, this called the architecture

00:55:48.239 --> 00:55:51.639
of the neural network. And as you will

00:55:50.039 --> 00:55:53.637
see later on, the transformer, the

00:55:51.639 --> 00:55:54.920
famous transformer network

00:55:53.637 --> 00:55:57.239
[clears throat] is just an example of a

00:55:54.920 --> 00:55:59.280
particular neural network architecture

00:55:57.239 --> 00:56:01.479
much like convolutional neural networks

00:55:59.280 --> 00:56:03.280
which will get to next week for computer

00:56:01.480 --> 00:56:05.719
vision or another example of a

00:56:03.280 --> 00:56:07.519
particular network of of architecture.

00:56:05.719 --> 00:56:08.959
So, we will focus on transformers. They

00:56:07.519 --> 00:56:10.559
are a particular kind of architecture.

00:56:08.960 --> 00:56:11.760
All right. So, in summary, this is what

00:56:10.559 --> 00:56:13.239
we have.

00:56:11.760 --> 00:56:14.400
You know, you get to choose the hidden

00:56:13.239 --> 00:56:15.839
layers, the neurons, activation

00:56:14.400 --> 00:56:17.280
functions, stuff like that.

00:56:15.840 --> 00:56:19.200
The inputs and outputs are what you have

00:56:17.280 --> 00:56:22.160
to work with and so, we will actually

00:56:19.199 --> 00:56:23.119
take this idea and then use it

00:56:22.159 --> 00:56:25.920
to

00:56:23.119 --> 00:56:28.319
to actually solve a problem from start

00:56:25.920 --> 00:56:29.559
to finish on Wednesday. So, I think I'm

00:56:28.320 --> 00:56:32.284
done. I give you three minutes back of

00:56:29.559 --> 00:56:34.304
your day. Thank you.

00:56:32.284 --> 00:56:34.304
>> [applause]
