WEBVTT

00:00:05.509 --> 00:00:07.229
This is CS231n.

00:00:07.230 --> 00:00:11.000
And I'm Professor Fei-Fei
Li from computer science

00:00:11.000 --> 00:00:11.820
department.

00:00:11.820 --> 00:00:14.960
I will be co-teaching
this quarter

00:00:14.960 --> 00:00:19.320
with Professor Ehsan Adeli
and my graduate student Zane.

00:00:19.320 --> 00:00:23.300
So you'll meet them as well
as our wonderful TA team

00:00:23.300 --> 00:00:24.900
that you will meet later.

00:00:24.899 --> 00:00:28.250
So I just want to get started.

00:00:28.250 --> 00:00:32.030
So this is what
excites me, that AI

00:00:32.030 --> 00:00:35.520
has become such an
interdisciplinary field,

00:00:35.520 --> 00:00:38.520
that what you're going
to learn in this class,

00:00:38.520 --> 00:00:40.530
of course, is very technical.

00:00:40.530 --> 00:00:42.750
It's about computer
vision and deep learning.

00:00:42.750 --> 00:00:44.600
But I really do
hope that you take

00:00:44.600 --> 00:00:48.590
it to whichever discipline you
work in and are passionate about

00:00:48.590 --> 00:00:49.920
and apply it.

00:00:49.920 --> 00:00:52.800
So we hear a lot
about the field of AI.

00:00:52.799 --> 00:00:56.029
So how do we position
computer vision

00:00:56.030 --> 00:00:57.920
and the scope of this class?

00:00:57.920 --> 00:01:02.250
If you consider AI
as this big bubble,

00:01:02.250 --> 00:01:07.349
computer vision is very
much an integral part of AI.

00:01:07.349 --> 00:01:10.829
Some of you have heard me
saying that not only vision is

00:01:10.829 --> 00:01:13.959
part of intelligence, it's a
cornerstone to intelligence.

00:01:13.959 --> 00:01:16.529
Unlocking the mystery
of visual intelligence

00:01:16.530 --> 00:01:20.079
is unlocking the
mystery of intelligence.

00:01:20.079 --> 00:01:25.250
But one of the most important
tools, mathematical tools,

00:01:25.250 --> 00:01:29.129
to solving AI is machine
learning or some people

00:01:29.129 --> 00:01:31.060
call statistical
machine learning.

00:01:31.060 --> 00:01:36.299
And this is exactly what
we will be talking about.

00:01:36.299 --> 00:01:38.670
Within the field of
machine learning,

00:01:38.670 --> 00:01:42.390
in the past 10 plus years, we
have seen a major revolution

00:01:42.390 --> 00:01:43.570
called deep learning.

00:01:43.569 --> 00:01:46.929
And I'll explain a little
bit of what deep learning is.

00:01:46.930 --> 00:01:50.640
Deep learning is a set
of algorithmic techniques

00:01:50.640 --> 00:01:54.120
that is built around
a family of algorithms

00:01:54.120 --> 00:01:55.540
called neural networks.

00:01:55.540 --> 00:02:02.040
And so if you ask me to pinpoint
the scope of this class,

00:02:02.040 --> 00:02:05.250
we'll not be able to cover the
entirety of computer vision.

00:02:05.250 --> 00:02:07.849
We'll not be able to cover
the entirety of machine

00:02:07.849 --> 00:02:09.030
learning or deep learning.

00:02:09.030 --> 00:02:12.409
But we're going to cover the
core intersection of these two

00:02:12.409 --> 00:02:13.289
fields.

00:02:13.289 --> 00:02:18.299
And of course, just
like the entirety of AI,

00:02:18.300 --> 00:02:20.900
computer vision is
becoming more and more

00:02:20.900 --> 00:02:23.340
an interdisciplinary field.

00:02:23.340 --> 00:02:26.060
A lot of the techniques
we use as well as

00:02:26.060 --> 00:02:28.310
the problems we
work with intersect

00:02:28.310 --> 00:02:31.280
with many different other
fields, like natural language

00:02:31.280 --> 00:02:37.849
processing, speech recognition,
robotics, and AI as a whole

00:02:37.849 --> 00:02:41.340
is a field that intersects
with mathematics, neuroscience,

00:02:41.340 --> 00:02:44.159
computer science,
psychology, physics, biology,

00:02:44.159 --> 00:02:46.430
and many application
areas from medicine

00:02:46.430 --> 00:02:49.950
to law to education
and business and so on.

00:02:49.949 --> 00:02:55.169
So what you will get for this
lecture, our first lecture,

00:02:55.169 --> 00:02:58.259
is I'll give a very
brief history of computer

00:02:58.259 --> 00:02:59.889
vision and deep learning.

00:02:59.889 --> 00:03:05.309
And then Professor Adeli will go
over the overview of this course

00:03:05.310 --> 00:03:08.129
and lay the groundwork of
how this course is set up

00:03:08.129 --> 00:03:11.669
and what our expectations are.

00:03:11.669 --> 00:03:19.139
So the history of vision did
not begin when you were born

00:03:19.139 --> 00:03:20.979
or humanity was born.

00:03:20.979 --> 00:03:25.719
The history of vision began
540 million years ago.

00:03:25.719 --> 00:03:29.919
You might ask, what happened
540 million years ago?

00:03:29.919 --> 00:03:34.229
Why are we pinpointing a
relatively specific date or year

00:03:34.229 --> 00:03:35.500
in evolution.

00:03:35.500 --> 00:03:37.830
Well, it's because a
lot of fossil studies

00:03:37.830 --> 00:03:43.380
have shown us that there is a
mystery period called Cambrian

00:03:43.379 --> 00:03:45.549
explosion.

00:03:45.550 --> 00:03:49.200
The fossil studies showed about
10 million years in evolution

00:03:49.199 --> 00:03:52.439
during that time, which is
a very short period of time

00:03:52.439 --> 00:03:53.810
for evolution.

00:03:53.810 --> 00:03:58.340
We see the explosion
of animal species

00:03:58.340 --> 00:04:02.819
in the fossil study, which means
before the Cambrian explosion,

00:04:02.819 --> 00:04:05.219
life on Earth was pretty chill.

00:04:05.219 --> 00:04:06.930
It was actually in the water.

00:04:06.930 --> 00:04:10.319
There's no animals
on the land yet.

00:04:10.319 --> 00:04:13.769
And animals just float around.

00:04:13.770 --> 00:04:18.240
So what caused this explosion
in animal speciation?

00:04:18.240 --> 00:04:21.620
There were many theories, from
climate to chemical composition

00:04:21.620 --> 00:04:23.189
of the ocean water.

00:04:23.189 --> 00:04:29.360
But one of the most compelling
theories was the onset of ice.

00:04:29.360 --> 00:04:32.569
The first animal,
a trilobite, they

00:04:32.569 --> 00:04:34.949
gained photosensitive cells.

00:04:34.949 --> 00:04:37.310
So the eyes we
were talking about

00:04:37.310 --> 00:04:41.009
were not sophisticated lenses
and retinas and nerve cells.

00:04:41.009 --> 00:04:44.519
It was literally a
very simple pinhole.

00:04:44.519 --> 00:04:47.339
And that pinhole
collected light.

00:04:47.339 --> 00:04:53.549
Once you collected light,
life is completely different.

00:04:53.550 --> 00:04:57.660
Without sensors,
life is metabolism.

00:04:57.660 --> 00:04:59.410
It's very passive.

00:04:59.410 --> 00:05:01.180
It is just metabolism.

00:05:01.180 --> 00:05:02.530
And you come and go.

00:05:02.529 --> 00:05:06.689
With sensors, you become an
integral part of the environment

00:05:06.689 --> 00:05:08.980
that you might want to change.

00:05:08.980 --> 00:05:11.920
You might want to
actually survive in it.

00:05:11.920 --> 00:05:16.170
Some animals or plants
become your dinner.

00:05:16.170 --> 00:05:18.069
And you become
someone else's dinner.

00:05:18.069 --> 00:05:24.120
So evolutionary forces
drives intelligence

00:05:24.120 --> 00:05:27.579
to evolve because of
the onset of sensors,

00:05:27.579 --> 00:05:31.529
because of the onset of
vision, along with haptics

00:05:31.529 --> 00:05:33.219
or tactile sensing.

00:05:33.220 --> 00:05:38.080
Those are the oldest
sensors for animals.

00:05:38.079 --> 00:05:41.879
So that entire course
of 540 million years

00:05:41.879 --> 00:05:46.509
of evolution of vision is the
evolution of intelligence.

00:05:46.509 --> 00:05:49.849
Vision as one of the
primary senses of animals

00:05:49.850 --> 00:05:54.260
drove the development of
nervous system, the development

00:05:54.259 --> 00:05:55.319
of intelligence.

00:05:55.319 --> 00:05:59.449
Almost all animals on
Earth today we know of

00:05:59.449 --> 00:06:03.420
have vision or use vision as
one of the primary senses.

00:06:03.420 --> 00:06:06.449
Humans are especially
visual animals.

00:06:06.449 --> 00:06:08.810
More than half of
our cortical cells

00:06:08.810 --> 00:06:11.819
are involved in
visual processing.

00:06:11.819 --> 00:06:15.930
And we have a very complex
and convoluted visual system.

00:06:15.930 --> 00:06:19.800
So this is what excites me
to enter the field of vision.

00:06:19.800 --> 00:06:21.870
And I hope it excites you.

00:06:21.870 --> 00:06:30.620
So now, let's just fast
forward from Cambrian explosion

00:06:30.620 --> 00:06:33.470
to actually human civilization.

00:06:33.470 --> 00:06:35.850
Humans do innovate.

00:06:35.850 --> 00:06:37.610
And not only we see.

00:06:37.610 --> 00:06:40.050
We want to build
machines that see.

00:06:40.050 --> 00:06:44.850
So here's a couple of
drawings by, of course,

00:06:44.850 --> 00:06:48.540
Leonardo da Vinci, who
was just forever curious

00:06:48.540 --> 00:06:49.540
about everything.

00:06:49.540 --> 00:06:56.740
He studied camera obscura for
how to make steam machines.

00:06:56.740 --> 00:07:01.829
In fact, even way before
him, in ancient Greece

00:07:01.829 --> 00:07:05.159
and in ancient China,
we have seen documents

00:07:05.160 --> 00:07:09.990
about thinkers,
philosophers thinking

00:07:09.990 --> 00:07:15.600
about how to project
objects through pinholes

00:07:15.600 --> 00:07:19.330
and to create images of objects.

00:07:19.329 --> 00:07:22.750
And of course, in
our modern life,

00:07:22.750 --> 00:07:25.990
cameras have truly exploded.

00:07:25.990 --> 00:07:30.780
But cameras are not enough for
seeing, just like eyes are not

00:07:30.779 --> 00:07:31.719
enough for seeing.

00:07:31.720 --> 00:07:33.010
These are apparatus.

00:07:33.009 --> 00:07:35.949
We need to understand how
visual intelligence happens.

00:07:35.949 --> 00:07:38.589
And that's really the
crux of this course.

00:07:38.589 --> 00:07:45.669
So let's just talk a little bit
of the history that brought us

00:07:45.670 --> 00:07:49.790
to this intersection of deep
learning and computer vision.

00:07:49.790 --> 00:07:57.160
So let me go back to the 1950s.

00:07:57.160 --> 00:08:03.370
The 1950s-- a set of very
critically important experiments

00:08:03.370 --> 00:08:05.090
happened in neuroscience.

00:08:05.089 --> 00:08:08.019
And that was the study
of the visual pathways

00:08:08.019 --> 00:08:10.629
of mammals, especially
the seminal work

00:08:10.629 --> 00:08:11.990
by Hubel and Wiesel.

00:08:11.990 --> 00:08:18.410
They used electrodes to put
into live cats anesthetized.

00:08:18.410 --> 00:08:21.220
And then they studied
the receptive field

00:08:21.220 --> 00:08:25.760
of neurons that are in
the primary visual cortex.

00:08:25.759 --> 00:08:28.909
What they have learned,
to their surprise,

00:08:28.910 --> 00:08:31.070
are two very important things.

00:08:31.069 --> 00:08:38.740
One is that neurons that
are responsible for seeing

00:08:38.740 --> 00:08:41.860
in the primary
visual cortex have

00:08:41.860 --> 00:08:44.820
their own individual
receptive fields.

00:08:44.820 --> 00:08:48.320
Receptive fields means
that for every neuron,

00:08:48.320 --> 00:08:52.590
there is a part of
space it actually sees.

00:08:52.590 --> 00:08:54.870
It's not all the space.

00:08:54.870 --> 00:08:55.799
It's not very big.

00:08:55.799 --> 00:09:00.779
It tends to be a very
confined patch of the space.

00:09:00.779 --> 00:09:06.629
And within that space, it
sees specialized patterns,

00:09:06.629 --> 00:09:12.320
simple patterns, when you're
measuring from the early part

00:09:12.320 --> 00:09:15.470
of the visual pathway.

00:09:15.470 --> 00:09:18.840
And by and large, in the
primary visual cortex,

00:09:18.840 --> 00:09:23.120
which is around here in the back
of the head, not near your eyes,

00:09:23.120 --> 00:09:27.210
it's oriented edges or
moving oriented edges.

00:09:27.210 --> 00:09:28.970
So every neuron,
some neuron will

00:09:28.970 --> 00:09:30.330
be seeing an edge like this.

00:09:30.330 --> 00:09:32.970
Some will be seeing an
edge like this or this.

00:09:32.970 --> 00:09:39.029
And that's how the computation
in the brain begins.

00:09:39.029 --> 00:09:42.370
The second thing they learned
is that visual pathway

00:09:42.370 --> 00:09:43.519
is hierarchical.

00:09:43.519 --> 00:09:47.149
As you move beyond
the visual pathway,

00:09:47.149 --> 00:09:50.629
the neurons feed
into other neurons.

00:09:50.629 --> 00:09:54.730
And the neurons in
the higher layers

00:09:54.730 --> 00:09:57.460
or deeper layers of
the visual hierarchy

00:09:57.460 --> 00:09:59.990
have more complex
receptive fields.

00:09:59.990 --> 00:10:04.009
So if you begin
with oriented edges,

00:10:04.009 --> 00:10:06.889
you might feed into
a corner receptor.

00:10:06.889 --> 00:10:10.399
You might feed into
an object receptor.

00:10:10.399 --> 00:10:12.199
I'm overly simplifying.

00:10:12.200 --> 00:10:16.360
But that's the concept, is that
neurons feed into each other.

00:10:16.360 --> 00:10:23.360
And then they create this
big network of computation.

00:10:23.360 --> 00:10:25.720
Of course, most of
you sitting here

00:10:25.720 --> 00:10:27.850
are already thinking
the way I've

00:10:27.850 --> 00:10:30.670
been describing this will
have a profound impact

00:10:30.669 --> 00:10:36.019
on the neural network
modeling of visual algorithms.

00:10:36.019 --> 00:10:37.069
Let's keep going.

00:10:37.070 --> 00:10:40.260
That's year 1959.

00:10:40.259 --> 00:10:43.500
It's very early
studies of seeing.

00:10:43.500 --> 00:10:48.289
By the way, about
30 years later--

00:10:48.289 --> 00:10:50.969
maybe not quite-- 20
something years later,

00:10:50.970 --> 00:10:54.769
Hubel and Wiesel won the
Nobel Prize in medicine

00:10:54.769 --> 00:10:59.840
for studying this,
uncovering the principles

00:10:59.840 --> 00:11:01.790
of visual processing.

00:11:01.789 --> 00:11:05.779
Another milestone in the early
history of computer vision

00:11:05.779 --> 00:11:09.179
was the first PhD thesis
of computer vision.

00:11:09.179 --> 00:11:13.039
Most people attribute
Larry Roberts in 1963

00:11:13.039 --> 00:11:17.879
writing the first PhD
thesis just studying shape.

00:11:17.879 --> 00:11:21.350
And this is a very, very
character representation

00:11:21.350 --> 00:11:22.259
of the world.

00:11:22.259 --> 00:11:26.090
And the idea is that, can
we take a shape like this

00:11:26.090 --> 00:11:30.560
and understand that the surfaces
and the corners and features

00:11:30.559 --> 00:11:32.209
of this shape?

00:11:32.210 --> 00:11:34.230
It's intuitive that humans do.

00:11:34.230 --> 00:11:39.350
So an entire PhD thesis
is devoted to this.

00:11:39.350 --> 00:11:44.980
And that's the beginning
of computer vision.

00:11:44.980 --> 00:11:52.870
And around that time, in
1966, an MIT professor

00:11:52.870 --> 00:11:56.710
created a summer
project in MIT and asked

00:11:56.710 --> 00:12:03.830
to hire a few undergrads, very
smart ones, to study vision.

00:12:03.830 --> 00:12:07.120
And the goal was pretty
much solve computer vision

00:12:07.120 --> 00:12:09.399
or solve vision for one summer.

00:12:09.399 --> 00:12:13.279
Of course, just like the
rest of the history of AI,

00:12:13.279 --> 00:12:18.309
we tend to be overoptimistic
of what we can

00:12:18.309 --> 00:12:20.329
do in a short period of time.

00:12:20.330 --> 00:12:24.530
So vision did not get
solved in that summer.

00:12:24.529 --> 00:12:29.799
In fact, it has blossomed into
an incredible computer science

00:12:29.799 --> 00:12:30.709
field.

00:12:30.710 --> 00:12:33.830
If you go to our annual
conferences every year now,

00:12:33.830 --> 00:12:36.420
it has more than 10,000
people attending.

00:12:36.419 --> 00:12:43.879
But 1960s is where, between
Larry Roberts PhD thesis as well

00:12:43.879 --> 00:12:48.500
as this kind of project, we
in our field considered that

00:12:48.500 --> 00:12:51.830
the beginning of the
field of computer vision.

00:12:51.830 --> 00:12:55.620
A seminal book was written
in the 1970s by David Marr,

00:12:55.620 --> 00:12:58.470
who unfortunately
died too early.

00:12:58.470 --> 00:13:01.940
He wanted to study vision
systematically and start

00:13:01.940 --> 00:13:05.790
to consider how visual
processing happens.

00:13:05.789 --> 00:13:07.639
Even though this
is not explicitly

00:13:07.639 --> 00:13:10.309
stated, but there is
a lot of inspiration

00:13:10.309 --> 00:13:12.929
from neuroscience and
cognitive science.

00:13:12.929 --> 00:13:20.069
He was thinking about, if
you take an input image,

00:13:20.070 --> 00:13:23.580
how do we visually process
and understand the image?

00:13:23.580 --> 00:13:28.730
Maybe the first layer is more
like edges, just like we saw.

00:13:28.730 --> 00:13:30.629
He calls it primal sketch.

00:13:30.629 --> 00:13:37.889
And then there is a 2 and 1/2 D
sketch which separates different

00:13:37.889 --> 00:13:42.909
depth of the objects
in the image.

00:13:42.909 --> 00:13:45.059
So the ball is the
foreground object.

00:13:45.059 --> 00:13:47.859
And then the grass here--

00:13:47.860 --> 00:13:48.820
oh, no, not grass.

00:13:48.820 --> 00:13:51.520
The ground here
is the background.

00:13:51.519 --> 00:13:53.919
So he does these 2
and 1/2 D sketch.

00:13:53.919 --> 00:14:01.439
And then, finally, David Marr
believes the grand holy grail

00:14:01.440 --> 00:14:06.660
victory of solving vision is
to know the entire full 3D

00:14:06.659 --> 00:14:07.959
representation.

00:14:07.960 --> 00:14:12.879
And that is actually the
hardest thing of vision.

00:14:12.879 --> 00:14:15.129
Let me digress for 20 seconds.

00:14:15.129 --> 00:14:20.950
Because if you think about
vision for all animals,

00:14:20.950 --> 00:14:23.350
it's an ill posed problem.

00:14:23.350 --> 00:14:27.389
Since the early trilobites
who collected light

00:14:27.389 --> 00:14:30.659
from underwater, light--

00:14:30.659 --> 00:14:35.809
the world through photons
is projected on something

00:14:35.809 --> 00:14:38.069
on a surface more or less 2D.

00:14:38.070 --> 00:14:40.879
At that time, it was just,
I don't know, some patch

00:14:40.879 --> 00:14:42.059
in the animal.

00:14:42.059 --> 00:14:45.469
But right now, for
us, it's a retina.

00:14:45.470 --> 00:14:47.910
But the actual world is 3D.

00:14:47.909 --> 00:14:55.610
So recovering 3D information,
the entire 3D world,

00:14:55.610 --> 00:15:00.230
from 2D images is the
fundamental problem nature had

00:15:00.230 --> 00:15:02.730
to solve and computer
vision has to solve.

00:15:02.730 --> 00:15:05.840
And mathematically, that's
an ill-posed problem.

00:15:05.840 --> 00:15:07.940
So what did we later do?

00:15:07.940 --> 00:15:09.745
Anybody have a wild guess?

00:15:14.899 --> 00:15:17.299
[INAUDIBLE]

00:15:17.299 --> 00:15:18.799
Yes.

00:15:18.799 --> 00:15:22.199
The trick that nature did is
develop multiple eyes, mostly

00:15:22.200 --> 00:15:22.700
two.

00:15:22.700 --> 00:15:25.259
Some animals have more than two.

00:15:25.259 --> 00:15:28.110
And then you
triangulate information.

00:15:28.110 --> 00:15:29.740
But two eyes are not enough.

00:15:29.740 --> 00:15:33.250
You actually have to understand
correspondences and all that.

00:15:33.250 --> 00:15:35.049
We'll touch on some
of these topics.

00:15:35.049 --> 00:15:38.879
But there are other computer
vision classes taht Stanford

00:15:38.879 --> 00:15:42.090
offers that also specifically
talk about 3D vision.

00:15:42.090 --> 00:15:45.660
But the point is it's
a very hard problem.

00:15:45.659 --> 00:15:47.589
And we have to solve it.

00:15:47.590 --> 00:15:48.790
Nature has solved it.

00:15:48.789 --> 00:15:53.110
Humans have solved it but
not to extreme precision.

00:15:53.110 --> 00:15:55.750
In fact, humans are
not that precise.

00:15:55.750 --> 00:15:58.509
I roughly know the 3D shapes.

00:15:58.509 --> 00:16:03.429
But I don't have geometric
precision of all the shapes.

00:16:03.429 --> 00:16:06.779
So that's one thing to
consider and appreciate

00:16:06.779 --> 00:16:08.620
how hard this problem is.

00:16:08.620 --> 00:16:12.419
Another thing that is very
different for computer vision

00:16:12.419 --> 00:16:15.479
and language is
actually something

00:16:15.480 --> 00:16:17.370
philosophically subtle.

00:16:17.370 --> 00:16:20.169
Language doesn't
exist in nature.

00:16:20.169 --> 00:16:24.339
You cannot point to something
and say there is language.

00:16:24.340 --> 00:16:30.090
Language is a purely
generated thing.

00:16:30.090 --> 00:16:31.860
I don't even know
what word to use.

00:16:31.860 --> 00:16:35.460
It comes through our brain.

00:16:35.460 --> 00:16:37.290
It's generated.

00:16:37.289 --> 00:16:38.579
It's 1D.

00:16:38.580 --> 00:16:40.310
It's sequential.

00:16:40.309 --> 00:16:44.449
So this actually has profound
implications in the latest

00:16:44.450 --> 00:16:47.509
wave of GenAI algorithms.

00:16:47.509 --> 00:16:50.419
This is why these
LLMs, which is outside

00:16:50.419 --> 00:16:54.889
of the scope of this class,
is so powerful because we

00:16:54.889 --> 00:16:56.759
can model language that way.

00:16:56.759 --> 00:16:58.649
But vision is not generated.

00:16:58.649 --> 00:17:01.669
There is actually
a physical world

00:17:01.669 --> 00:17:05.838
out there respecting the laws
of physics and materials and all

00:17:05.838 --> 00:17:06.509
that.

00:17:06.509 --> 00:17:09.420
So vision has very
different tasks.

00:17:09.420 --> 00:17:14.089
So I just want you to appreciate
the difference between language

00:17:14.088 --> 00:17:17.450
and vision and actually,
frankly, appreciate nature,

00:17:17.450 --> 00:17:19.880
how it solved this problem.

00:17:19.880 --> 00:17:21.060
Let's keep going.

00:17:21.059 --> 00:17:28.149
1970s, the early pioneers of
computer vision, without data,

00:17:28.150 --> 00:17:32.320
without really much
of powerful computers,

00:17:32.319 --> 00:17:36.970
without the mathematical
advances we have seen today,

00:17:36.970 --> 00:17:40.289
are already beginning to solve
some of the harder problems

00:17:40.289 --> 00:17:43.779
of computer vision-- for
example, recognition of objects.

00:17:43.779 --> 00:17:48.119
Here in Stanford, one
of the pioneering work

00:17:48.119 --> 00:17:52.139
is called generalized cylinders
by Rodney Brooks and Tom

00:17:52.140 --> 00:17:52.900
Binford.

00:17:52.900 --> 00:17:58.650
And ironically, Rodney Brooks
today is on campus, actually,

00:17:58.650 --> 00:18:03.519
over there giving a talk
at the robotics conference.

00:18:03.519 --> 00:18:05.759
And he went on to become
one of the greatest

00:18:05.759 --> 00:18:10.079
roboticists of our time
and was founder of Roomba

00:18:10.079 --> 00:18:11.769
and many other robots.

00:18:11.769 --> 00:18:16.529
And then not very far from us
in another part of Palo Alto,

00:18:16.529 --> 00:18:24.759
researchers have worked on
these also compositional models

00:18:24.759 --> 00:18:27.859
of human body and objects.

00:18:27.859 --> 00:18:34.250
And then in the 1980s, digital
photos start to appear.

00:18:34.250 --> 00:18:37.220
At least photos start to appear.

00:18:37.220 --> 00:18:39.680
And people can digitize
that a little bit.

00:18:39.680 --> 00:18:43.940
And then there are some
great work in edge detection.

00:18:43.940 --> 00:18:48.190
You look at all this
and probably feel

00:18:48.190 --> 00:18:50.900
a sense of disappointment.

00:18:50.900 --> 00:18:55.540
I mean, it's kind of trivial
to get some sketches and edges.

00:18:55.539 --> 00:18:58.460
And it's not really
going anywhere.

00:18:58.460 --> 00:19:02.059
That's how computer
vision, works at that time.

00:19:02.059 --> 00:19:03.980
And in fact, you're
not so wrong.

00:19:03.980 --> 00:19:07.660
That was around the
time before many of you

00:19:07.660 --> 00:19:10.279
were born that we
entered AI winter.

00:19:10.279 --> 00:19:15.250
The field entered AI winter
because the enthusiasm

00:19:15.250 --> 00:19:18.529
and, hence, funding for AI
research has really dwindled.

00:19:18.529 --> 00:19:20.509
A lot of things didn't deliver.

00:19:20.509 --> 00:19:22.269
Computer vision didn't deliver.

00:19:22.269 --> 00:19:24.460
Expert systems didn't deliver.

00:19:24.460 --> 00:19:26.519
Robotics didn't deliver.

00:19:26.519 --> 00:19:32.309
But under the hood of this
winter, a lot of research

00:19:32.309 --> 00:19:34.529
start to grow from
different fields,

00:19:34.529 --> 00:19:37.509
like computer vision,
NLP, robotics.

00:19:37.509 --> 00:19:40.379
So let's also look at
another strand of research

00:19:40.380 --> 00:19:43.290
that had a profound
implication in computer vision,

00:19:43.289 --> 00:19:45.269
is that cognitive
and neuroscience

00:19:45.269 --> 00:19:46.960
continue to blossom.

00:19:46.960 --> 00:19:49.319
And what is really
important, especially

00:19:49.319 --> 00:19:52.480
for the field of computer
vision, is cognitive

00:19:52.480 --> 00:19:55.799
and neuroscience is starting
to point to as the North Star

00:19:55.799 --> 00:19:57.490
problems we should work on.

00:19:57.490 --> 00:20:00.029
For example,
psychologists have told us

00:20:00.029 --> 00:20:02.619
there's something special
about seeing nature,

00:20:02.619 --> 00:20:06.359
seeing real world.

00:20:06.359 --> 00:20:09.209
This is a study by
Irv Biederman, who

00:20:09.210 --> 00:20:13.980
shows that the detection
of bicycles on two images

00:20:13.980 --> 00:20:18.819
differ depending on if the
images are scrambled or not.

00:20:18.819 --> 00:20:19.569
Think about it.

00:20:19.569 --> 00:20:22.089
From a phton point of
view, these two bicycles

00:20:22.089 --> 00:20:26.629
land in the same
location on your retina.

00:20:26.630 --> 00:20:28.720
But somehow the
rest of the image

00:20:28.720 --> 00:20:39.079
impacts the viewer,
seeing the target objects.

00:20:39.079 --> 00:20:41.439
So there is something
telling us that seeing

00:20:41.440 --> 00:20:44.170
the entire forest
or the entire world

00:20:44.170 --> 00:20:46.730
impacts the way we see objects.

00:20:46.730 --> 00:20:49.819
It also tells us visual
processing is very fast.

00:20:49.819 --> 00:20:55.339
Here's another direct measure
of how fast we detect objects.

00:20:55.339 --> 00:21:00.669
This is an early 1970s
experiment showing people

00:21:00.670 --> 00:21:03.061
a video.

00:21:03.060 --> 00:21:07.629
And the test for the subject
is to detect the human

00:21:07.630 --> 00:21:09.170
in one of the frames.

00:21:09.170 --> 00:21:11.920
I suppose every one of you
have seen that human in one

00:21:11.920 --> 00:21:13.250
of the frames.

00:21:13.250 --> 00:21:15.519
But think about how
remarkable your eyes are

00:21:15.519 --> 00:21:19.079
or your brain is because
you've never seen this video.

00:21:19.079 --> 00:21:22.609
I didn't tell you which frame
that the target object would

00:21:22.609 --> 00:21:23.159
appear.

00:21:23.160 --> 00:21:24.980
I did not tell you
what the target

00:21:24.980 --> 00:21:28.860
object will look like, where it
is, its gestures, and all that.

00:21:28.859 --> 00:21:31.689
Yet, you have no problem
detecting the humans.

00:21:34.569 --> 00:21:37.669
And on top of that,
these frames are

00:21:37.670 --> 00:21:39.860
played at 10 Hertz,
which means you're

00:21:39.859 --> 00:21:43.799
seeing every frame for
only 100 milliseconds.

00:21:43.799 --> 00:21:47.879
And this is how remarkable
our visual system is.

00:21:47.880 --> 00:21:53.700
In fact, Simon Thorpe, another
cognitive neuroscientist,

00:21:53.700 --> 00:21:55.410
have measured the speed.

00:21:55.410 --> 00:21:58.430
If you hook people
up in EEG caps

00:21:58.430 --> 00:22:01.769
and show them complex
natural scenes

00:22:01.769 --> 00:22:05.869
and ask human subjects
to categorize things

00:22:05.869 --> 00:22:07.969
from animals without--

00:22:07.970 --> 00:22:10.259
versus things without animals--

00:22:10.259 --> 00:22:11.309
hundreds of them.

00:22:11.309 --> 00:22:13.289
And then you measure
the brain wave.

00:22:13.289 --> 00:22:18.909
It turned out, after 150
milliseconds of seeing a photo,

00:22:18.910 --> 00:22:22.540
your brain already has
a differential signal

00:22:22.539 --> 00:22:24.019
that categorizes.

00:22:24.019 --> 00:22:25.990
You might not be so impressed.

00:22:25.990 --> 00:22:29.870
Because compared to today's
GPUs and modern chips,

00:22:29.869 --> 00:22:34.549
150 milliseconds is really
orders of magnitude slower.

00:22:34.549 --> 00:22:37.210
But you got to admire.

00:22:37.210 --> 00:22:40.779
Our wetware, our
brain, our neurons

00:22:40.779 --> 00:22:43.369
don't work as fast
as transistors.

00:22:43.369 --> 00:22:46.609
150 milliseconds is
actually really fast.

00:22:46.609 --> 00:22:49.309
It's only a few
hops in the brain

00:22:49.309 --> 00:22:51.519
in terms of neural processing.

00:22:51.519 --> 00:22:53.950
So yet, again,
this is telling us

00:22:53.950 --> 00:22:59.990
humans are really good at seeing
objects and categorizing them.

00:22:59.990 --> 00:23:02.559
In fact, not only we're
so good at seeing objects

00:23:02.559 --> 00:23:05.829
and categorizing them, we
even develop specialized brain

00:23:05.829 --> 00:23:10.059
areas that have expert
ability in recognizing

00:23:10.059 --> 00:23:13.099
faces or places or body parts.

00:23:13.099 --> 00:23:19.039
And these are discoveries by MIT
neurophysiologist in the 1990s

00:23:19.039 --> 00:23:21.119
and early 21st century.

00:23:21.119 --> 00:23:26.089
So all these studies tell
us, well, we should not just

00:23:26.089 --> 00:23:30.019
be studying these kind
of character shapes

00:23:30.019 --> 00:23:33.660
or the sketches of images.

00:23:33.660 --> 00:23:38.750
We really should go after
important fundamental problems

00:23:38.750 --> 00:23:40.769
that drives visual intelligence.

00:23:40.769 --> 00:23:43.339
And one of those
problems that everything

00:23:43.339 --> 00:23:46.099
has been telling us is
object recognition--

00:23:46.099 --> 00:23:49.829
is object recognition
in natural setting.

00:23:49.829 --> 00:23:52.949
There is a lot of objects
out there in the world.

00:23:52.950 --> 00:23:57.740
And studying this
is going to be part

00:23:57.740 --> 00:24:00.299
of the unlocking of
visual intelligence.

00:24:00.299 --> 00:24:01.549
And that's what we did.

00:24:01.549 --> 00:24:04.669
As a field, we
started by looking

00:24:04.670 --> 00:24:08.210
at how we can separate
foreground objects

00:24:08.210 --> 00:24:09.960
from background objects.

00:24:09.960 --> 00:24:14.569
This is called recognition
by grouping in the 1990s.

00:24:14.569 --> 00:24:16.849
Keep in mind, we're
still in AI winter.

00:24:16.849 --> 00:24:20.089
But research is actually
happening and progressing.

00:24:20.089 --> 00:24:24.559
And then there is
studies of features.

00:24:24.559 --> 00:24:27.549
And some of you
might still remember

00:24:27.549 --> 00:24:29.779
sift features and matching.

00:24:29.779 --> 00:24:33.609
And when I enter grad school,
the most exciting thing

00:24:33.609 --> 00:24:34.789
was face detection.

00:24:34.789 --> 00:24:37.279
I remembered that first
year in my grad school,

00:24:37.279 --> 00:24:39.379
this paper was published.

00:24:39.380 --> 00:24:42.550
And five years later,
the first digital camera

00:24:42.549 --> 00:24:49.029
used this paper's algorithm and
delivered automatic face focus

00:24:49.029 --> 00:24:51.259
because of face detection.

00:24:51.259 --> 00:24:56.559
So things started to work
and taken into industry.

00:24:56.559 --> 00:25:01.190
And then around the
early 21st century,

00:25:01.190 --> 00:25:04.809
a very important thing
started to happen,

00:25:04.809 --> 00:25:06.819
is internet started to happen.

00:25:06.819 --> 00:25:12.599
When internet started to happen,
data started to proliferate.

00:25:12.599 --> 00:25:16.969
And the combination of
digital cameras and internet

00:25:16.970 --> 00:25:19.850
started to give the
field of computer vision

00:25:19.849 --> 00:25:22.049
some data to work with.

00:25:22.049 --> 00:25:26.419
So in that early days, we're
working with thousands of images

00:25:26.420 --> 00:25:30.470
or tens of thousands of images
to study the visual recognition

00:25:30.470 --> 00:25:32.880
problem or the object
recognition problem.

00:25:32.880 --> 00:25:36.350
So you've got data sets
like Pascal Visual Object

00:25:36.349 --> 00:25:40.759
Challenge or Caltech 101.

00:25:40.759 --> 00:25:43.609
I'm going to pause here.

00:25:43.609 --> 00:25:50.059
And this is where the first
thread of computer vision

00:25:50.059 --> 00:25:51.059
start to progress.

00:25:51.059 --> 00:25:54.419
And you might be wondering,
why is she pausing?

00:25:54.420 --> 00:25:57.300
Because I'm going to come back
and talk about deep learning.

00:25:57.299 --> 00:26:03.169
So while this field of
vision was progressing

00:26:03.170 --> 00:26:06.980
through neurophysiology
to computer vision,

00:26:06.980 --> 00:26:11.490
to cognitive neuroscience,
to computer vision again,

00:26:11.490 --> 00:26:14.980
a separate effort is
going on in parallel.

00:26:14.980 --> 00:26:17.380
And that eventually
became deep learning.

00:26:17.380 --> 00:26:22.870
It started from these early
studies of neural network,

00:26:22.869 --> 00:26:24.269
things like perceptron.

00:26:24.269 --> 00:26:29.799
And people like Rumelhart
started to work.

00:26:29.799 --> 00:26:32.139
And of course, Jeff
Hinton in his early days,

00:26:32.140 --> 00:26:35.400
started to work with a small
number of artificial neurons

00:26:35.400 --> 00:26:41.009
and look at how that can
process information and learn.

00:26:41.009 --> 00:26:48.269
And you've heard people like the
great minds like Marvin Minsky

00:26:48.269 --> 00:26:52.619
and his colleagues working
on different aspects

00:26:52.619 --> 00:26:54.549
of this perception.

00:26:54.549 --> 00:27:02.849
But Marvin Minsky did say that
perceptrons cannot learn these

00:27:02.849 --> 00:27:05.219
XOR logic functions.

00:27:05.220 --> 00:27:10.130
And that caused a little bit
of a setback in neural network.

00:27:10.130 --> 00:27:14.670
Well, things continued to
progress despite the setback.

00:27:14.670 --> 00:27:21.529
And one of the most important
work before the first inflection

00:27:21.529 --> 00:27:25.889
point is this neocognitron
work by Fukushima in Japan.

00:27:25.890 --> 00:27:31.980
Fukushima hand-designed a neural
network that looks like this.

00:27:31.980 --> 00:27:35.700
So it has about
five or six layers.

00:27:35.700 --> 00:27:41.779
And then he kind of designed
the different functions

00:27:41.779 --> 00:27:43.700
across the layers,
which you will

00:27:43.700 --> 00:27:46.910
learn more, that
more or less was

00:27:46.910 --> 00:27:50.850
inspired by the visual
pathway that I was describing.

00:27:50.849 --> 00:27:54.559
Remember the cat experiment
from simple receptive field

00:27:54.559 --> 00:27:56.789
to more complicated
receptive field.

00:27:56.789 --> 00:27:59.039
And he was doing that here.

00:27:59.039 --> 00:28:01.829
The early layers have
simple functions.

00:28:01.829 --> 00:28:03.269
And then the later
lighter layers

00:28:03.269 --> 00:28:05.490
have more complex functions.

00:28:05.490 --> 00:28:08.680
And the simple ones can
call it convolution.

00:28:08.680 --> 00:28:10.710
Or he uses the
convolution function.

00:28:10.710 --> 00:28:13.620
And the more complex one, he
was pulling the information

00:28:13.619 --> 00:28:15.219
from the convolution layers.

00:28:15.220 --> 00:28:19.799
So neocognitron was
really an engineering feat

00:28:19.799 --> 00:28:24.794
because every parameter
was hand-designed.

00:28:24.795 --> 00:28:26.170
There are hundreds
of parameters.

00:28:26.170 --> 00:28:29.430
He has to just meticulously
put them together

00:28:29.430 --> 00:28:32.610
so that this small
neural network can

00:28:32.609 --> 00:28:35.909
recognize digits or letters.

00:28:35.910 --> 00:28:41.130
So the real breakthrough
came around that time in 1986

00:28:41.130 --> 00:28:43.180
is a learning rule.

00:28:43.180 --> 00:28:45.580
That learning rule is
called backpropagation.

00:28:45.579 --> 00:28:47.579
It's going to be one
of our first classes

00:28:47.579 --> 00:28:52.454
to show you that
Rumelhart, Jeff Hinton--

00:28:52.454 --> 00:28:58.019
they took neural
network architecture

00:28:58.019 --> 00:29:04.259
and introduced an error
correcting objective function

00:29:04.259 --> 00:29:07.400
so that if you put in
some input and know

00:29:07.400 --> 00:29:10.280
what the correct
output is, how do you

00:29:10.279 --> 00:29:14.779
take the difference between
what the neural network outputs

00:29:14.779 --> 00:29:17.899
versus the actual
correct answer and then

00:29:17.900 --> 00:29:22.640
propagate the information
back so that you

00:29:22.640 --> 00:29:28.590
can improve the parameters
along the neural network?

00:29:28.589 --> 00:29:31.250
And that propagation
from the output

00:29:31.250 --> 00:29:33.799
back to the entire
neural network

00:29:33.799 --> 00:29:35.849
is called backpropagation.

00:29:35.849 --> 00:29:39.179
It follows some of these
basic calculus chain rules.

00:29:39.180 --> 00:29:47.420
And that was a watershed moment
for neural network algorithm.

00:29:47.420 --> 00:29:50.970
And of course, we're still smack
in the middle of AI winter.

00:29:50.970 --> 00:29:54.809
All these work was happening
without public fanfare.

00:29:54.809 --> 00:29:57.929
But of course, in the
world of research,

00:29:57.930 --> 00:29:59.650
these are very
important milestones.

00:29:59.650 --> 00:30:03.720
One of the most earliest
applications of this neural

00:30:03.720 --> 00:30:07.019
network with backpropagation
is Yann LeCun's convolutional

00:30:07.019 --> 00:30:10.410
neural network, made in the
1990s when he was working

00:30:10.410 --> 00:30:11.500
in the Bell Labs.

00:30:11.500 --> 00:30:15.970
And what he did is just created
a slightly bigger network,

00:30:15.970 --> 00:30:20.610
about seven layers-ish,
and made it good enough

00:30:20.609 --> 00:30:25.119
with great engineering
capability to recognize letters.

00:30:25.119 --> 00:30:28.709
And it was actually shipped
to some part of the US Postal

00:30:28.710 --> 00:30:33.579
Offices and banks to
read digits and letters.

00:30:33.579 --> 00:30:37.599
So that was an application
of early neural network.

00:30:37.599 --> 00:30:41.250
And then Jeff Hinton
and Yann LeCun

00:30:41.250 --> 00:30:43.390
continued to work
on neural network.

00:30:43.390 --> 00:30:45.720
It didn't go very far.

00:30:45.720 --> 00:30:52.049
Because despite these
improvements and tweaks

00:30:52.049 --> 00:30:57.289
of these neural network, things
more or less just stalled.

00:30:57.289 --> 00:31:00.279
They collected a big data
set of digits and letters.

00:31:00.279 --> 00:31:03.730
And digits and letters
kind of was quasi soft

00:31:03.730 --> 00:31:05.089
in terms of recognition.

00:31:05.089 --> 00:31:08.019
But if you put the
system through the kind

00:31:08.019 --> 00:31:11.500
of digital photos that the
neuroscientists were using

00:31:11.500 --> 00:31:14.470
to recognize cats and dogs
and microwaves and chairs

00:31:14.470 --> 00:31:17.180
and flowers, it
just didn't work.

00:31:17.180 --> 00:31:22.549
And a huge part of this
problem is the lack of data.

00:31:22.549 --> 00:31:27.500
And lack of data is not
just an inconvenience.

00:31:27.500 --> 00:31:29.990
It's actually a
mathematical problem

00:31:29.990 --> 00:31:36.430
because these algorithms are
high capacity algorithms that

00:31:36.430 --> 00:31:39.850
actually needs to be
driven by lots of data

00:31:39.849 --> 00:31:42.349
in order to learn to generalize.

00:31:42.349 --> 00:31:45.009
And there is some deep
mathematical principles

00:31:45.009 --> 00:31:48.379
behind these rules of
generalization and model

00:31:48.380 --> 00:31:49.210
overfitting.

00:31:49.210 --> 00:31:52.660
And data was
underappreciated, was

00:31:52.660 --> 00:31:54.840
underlooked because
most people are just

00:31:54.839 --> 00:31:56.559
looking at these architectures.

00:31:56.559 --> 00:31:59.190
They did not
realize that data is

00:31:59.190 --> 00:32:02.070
part of the first class
citizen for machine

00:32:02.069 --> 00:32:03.490
learning and deep learning.

00:32:03.490 --> 00:32:08.339
So this is part of the work
that my students and I did

00:32:08.339 --> 00:32:14.759
in the early 2000s, that we
recognize this importance

00:32:14.759 --> 00:32:15.640
of data.

00:32:15.640 --> 00:32:21.240
We hypothesized that the
whole field was actually

00:32:21.240 --> 00:32:24.519
missing this-- underappreciating
the importance of data.

00:32:24.519 --> 00:32:27.089
So we went about and
collected a huge data

00:32:27.089 --> 00:32:30.119
set called ImageNet that
has 50 million images

00:32:30.119 --> 00:32:32.259
after cleaning a billion images.

00:32:32.259 --> 00:32:38.309
And these 15 million images were
sorted across 22,000 categories

00:32:38.309 --> 00:32:39.309
of objects.

00:32:39.309 --> 00:32:43.109
We actually studied a lot of
the cognitive and psychology

00:32:43.109 --> 00:32:51.479
literature to appreciate
that 22,000 images were--

00:32:51.480 --> 00:32:54.880
sorry, 22,000 categories
were roughly in the order

00:32:54.880 --> 00:32:58.510
of the number of categories
that humans learned to recognize

00:32:58.509 --> 00:33:00.470
in the early years
of their life.

00:33:00.470 --> 00:33:02.180
And then we open
sourced this data

00:33:02.180 --> 00:33:05.860
set and created an ImageNet
challenge called the Large Scale

00:33:05.859 --> 00:33:07.579
Visual Recognition Challenge.

00:33:07.579 --> 00:33:12.699
We curated a subset of ImageNet
of a million images or a million

00:33:12.700 --> 00:33:16.870
plus images and 1,000
object classes and then ran

00:33:16.869 --> 00:33:21.429
an international object
recognition challenge for many

00:33:21.430 --> 00:33:22.039
years.

00:33:22.039 --> 00:33:26.899
And the goal is that we ask
researchers to participate.

00:33:26.900 --> 00:33:29.420
And their goal is to
create algorithms.

00:33:29.420 --> 00:33:31.430
It doesn't matter which
kind of algorithms.

00:33:31.430 --> 00:33:35.650
And they will test you on your
algorithm's ability to recognize

00:33:35.650 --> 00:33:40.900
photos and see if you can call
out these 1,000 object classes

00:33:40.900 --> 00:33:42.800
as correctly as possible.

00:33:42.799 --> 00:33:45.039
And here are the errors.

00:33:45.039 --> 00:33:53.069
First year we run
this competition,

00:33:53.069 --> 00:33:57.000
the best performing algorithms
error was nearly 30%.

00:33:57.000 --> 00:34:00.859
And it's really pretty abysmal
because humans can perform

00:34:00.859 --> 00:34:03.509
under like, say, 3% error.

00:34:03.509 --> 00:34:07.259
And then 2011, it
wasn't that exciting.

00:34:07.259 --> 00:34:09.559
But something happened in 2012.

00:34:09.559 --> 00:34:12.389
That was the most exciting year.

00:34:12.389 --> 00:34:16.190
That year, Jeff Hinton
and his students

00:34:16.190 --> 00:34:18.650
participated in
this challenge using

00:34:18.650 --> 00:34:20.340
convolutional neural network.

00:34:20.340 --> 00:34:23.100
And they reduced the
error almost by half.

00:34:23.099 --> 00:34:29.519
And it truly showed the power
of deep learning algorithms.

00:34:29.519 --> 00:34:34.759
And so the participating
algorithm in 2012 ImageNet

00:34:34.760 --> 00:34:36.960
challenge was called AlexNet.

00:34:36.960 --> 00:34:42.559
And the funny thing is,
if you look at AlexNet,

00:34:42.559 --> 00:34:47.449
it's not that different from
Fukushima's neocognitron

00:34:47.449 --> 00:34:49.579
32 years ago.

00:34:49.579 --> 00:34:54.829
But two major things
happened between these two.

00:34:54.829 --> 00:34:57.529
One is that
backpropagation happened.

00:34:57.530 --> 00:35:01.269
It's a principled,
mathematically rigorous learning

00:35:01.269 --> 00:35:04.300
rule so that you don't
have to ever use hand

00:35:04.300 --> 00:35:06.140
to tune parameters.

00:35:06.139 --> 00:35:09.409
And that was a major
breakthrough theoretically.

00:35:09.409 --> 00:35:14.179
Another breakthrough was data.

00:35:14.179 --> 00:35:19.629
The recognition of data and the
understanding of data driving

00:35:19.630 --> 00:35:23.200
these high capacity models,
which eventually will have

00:35:23.199 --> 00:35:26.109
trillion parameters-- but
at that time was millions

00:35:26.110 --> 00:35:34.831
of parameters-- was critical for
setting off the deep learning

00:35:34.831 --> 00:35:36.410
for this to work.

00:35:36.409 --> 00:35:42.405
And really, many people
consider the year of 2012

00:35:42.405 --> 00:35:46.869
and the AlexNet algorithm
that won the ImageNet

00:35:46.869 --> 00:35:51.019
the challenge the historical
moment of the birth

00:35:51.019 --> 00:35:54.409
or rebirth of modern AI or
the birth of deep learning

00:35:54.409 --> 00:35:55.759
revolution.

00:35:55.760 --> 00:35:59.540
And of course, the reason
many of you are here

00:35:59.539 --> 00:36:04.320
is since then, we are in the
era of deep learning explosion.

00:36:04.320 --> 00:36:10.910
If you look at computer vision,
some main annual research

00:36:10.909 --> 00:36:13.190
conference, called CVPR--

00:36:13.190 --> 00:36:15.619
the number of papers
have exploded.

00:36:15.619 --> 00:36:18.869
And our arXiv
paper has exploded.

00:36:18.869 --> 00:36:22.730
And many new
algorithms since then

00:36:22.730 --> 00:36:27.349
have been invented to
participate in the ImageNet

00:36:27.349 --> 00:36:28.049
challenge.

00:36:28.050 --> 00:36:29.870
In the following
years, we're going

00:36:29.869 --> 00:36:31.739
to study some of
these algorithms.

00:36:31.739 --> 00:36:34.639
But the point is
that some of these

00:36:34.639 --> 00:36:39.379
algorithms beyond Alex that
have had a profound impact

00:36:39.380 --> 00:36:43.610
in the progress of the
field of computer vision

00:36:43.610 --> 00:36:49.090
and into the applications
of computer vision.

00:36:49.090 --> 00:36:52.720
So a lot of things
have happened.

00:36:52.719 --> 00:36:54.529
We're going to
cover some of these.

00:36:54.530 --> 00:36:57.340
Not only the field
of computer vision

00:36:57.340 --> 00:37:01.510
made a major progress
in creating algorithms

00:37:01.510 --> 00:37:06.260
to recognize everyday m like
cats and dogs and chairs--

00:37:06.260 --> 00:37:10.400
we also quickly, right
after ImageNet challenge,

00:37:10.400 --> 00:37:14.139
the 2012 moment,
we've got algorithms

00:37:14.139 --> 00:37:22.549
that can recognize much
more complicated images,

00:37:22.550 --> 00:37:27.470
can retrieve images, or can
do multiple object detections,

00:37:27.469 --> 00:37:30.559
can do image segmentation.

00:37:30.559 --> 00:37:34.360
These are all different
tasks in visual recognition

00:37:34.360 --> 00:37:36.220
that you'll find
yourself getting

00:37:36.219 --> 00:37:38.689
familiar with
throughout this course

00:37:38.690 --> 00:37:42.139
because vision is not just
calling out cats and dogs.

00:37:42.139 --> 00:37:48.859
There is so much in the nuanced
ability of visual recognition.

00:37:48.860 --> 00:37:52.829
And of course, vision is
not just static images.

00:37:52.829 --> 00:37:57.500
So there are work in video
classification, human activity

00:37:57.500 --> 00:37:58.710
recognition.

00:37:58.710 --> 00:38:00.929
I'm showing you this overview.

00:38:00.929 --> 00:38:04.774
You will learn some of these.

00:38:04.775 --> 00:38:08.460
You don't have to understand
exactly what's going on here.

00:38:08.460 --> 00:38:14.940
But I want you to appreciate
the variety of vision tasks.

00:38:14.940 --> 00:38:20.869
Medical imaging, those of you
who come from a medical field,

00:38:20.869 --> 00:38:24.650
whether it's radiology
or pathology or even

00:38:24.650 --> 00:38:28.260
other aspects of medicine,
is deeply visual.

00:38:28.260 --> 00:38:31.550
And this has a profound impact.

00:38:31.550 --> 00:38:37.550
Scientific discovery--
even the seminal picture

00:38:37.550 --> 00:38:41.700
you probably remember of the
first photography of black hole

00:38:41.699 --> 00:38:46.829
uses a lot of computer vision
and computational photography

00:38:46.829 --> 00:38:47.980
techniques.

00:38:47.980 --> 00:38:52.980
Of course, applications in
sustainability and environment

00:38:52.980 --> 00:38:58.889
is $also computer vision
contributed a lot of that.

00:38:58.889 --> 00:39:02.309
And we also have made
a lot of progress

00:39:02.309 --> 00:39:07.449
in image captioning right after
the image-- that 2012 moment.

00:39:07.449 --> 00:39:09.989
This is actually work by
Andrej Karpathy, where he was

00:39:09.989 --> 00:39:13.799
my student, his thesis work.

00:39:13.800 --> 00:39:19.030
Then we also worked on
relationship understanding.

00:39:19.030 --> 00:39:22.710
So not only visual
intelligence is

00:39:22.710 --> 00:39:24.639
about seeing what's
on the pixel,

00:39:24.639 --> 00:39:26.859
you can also see
what's beyond pixels,

00:39:26.860 --> 00:39:33.360
including relationships of
objects and also style transfer.

00:39:33.360 --> 00:39:35.880
A Lot of this work,
you will-- actually,

00:39:35.880 --> 00:39:39.000
Justin Johnson, who will come
to guest lecture this course,

00:39:39.000 --> 00:39:45.320
will tell you all about his
seminal work in style transfer.

00:39:45.320 --> 00:39:48.510
And of course, in
generative AI eras,

00:39:48.510 --> 00:39:53.430
we get these really incredible
results like face generation.

00:39:53.429 --> 00:39:59.239
And this is the very early days
of image generation of Dall-E. I

00:39:59.239 --> 00:40:03.379
think this is the early Dall-E.
Of course, now, Midjourney

00:40:03.380 --> 00:40:08.690
and everything has gone beyond
these avocado and peach chairs.

00:40:08.690 --> 00:40:14.780
But really, we are squarely in
the most exciting modern era

00:40:14.780 --> 00:40:16.246
of AI explosion.

00:40:20.070 --> 00:40:25.370
The three converging forces
of computation, algorithms,

00:40:25.369 --> 00:40:29.719
and data have taken
this field just

00:40:29.719 --> 00:40:32.929
to a whole different
level, where we're now

00:40:32.929 --> 00:40:36.119
totally out of AI winter.

00:40:36.119 --> 00:40:40.259
I would say we're in an
AI global warming period.

00:40:40.260 --> 00:40:46.050
And I don't see any
of this slowing down

00:40:46.050 --> 00:40:48.820
for both good and bad reasons.

00:40:48.820 --> 00:40:53.170
And also, just a word, because
we are in the Silicon Valley,

00:40:53.170 --> 00:40:58.050
we're in the very building
of Huang building and NVIDIA

00:40:58.050 --> 00:41:02.039
lecture hall-- so we cannot
ignore also the progress

00:41:02.039 --> 00:41:05.050
of hardware and
what that played.

00:41:05.050 --> 00:41:14.080
So here is just the FLOP per
dollar graph for NVIDIA's GPUs.

00:41:14.079 --> 00:41:19.210
And before 2020, the
progress was steady.

00:41:19.210 --> 00:41:22.800
But as soon as deep
learning started

00:41:22.800 --> 00:41:27.420
to drive these
GPUs and chips, you

00:41:27.420 --> 00:41:33.519
can just see the GFLOPS have
just completely taken off.

00:41:33.519 --> 00:41:40.610
And by any measure, we are
in this accelerated curve

00:41:40.610 --> 00:41:45.360
of lots of compute as
well as lots of AI.

00:41:45.360 --> 00:41:47.360
And these are just
different graphs

00:41:47.360 --> 00:41:50.539
showing you conference
attendees, startups,

00:41:50.539 --> 00:41:54.500
and enterprise applications
in AI all across

00:41:54.500 --> 00:41:55.710
not just computer vision.

00:41:55.710 --> 00:42:02.099
But also, NLP and others
have just exploded.

00:42:02.099 --> 00:42:06.299
So quickly, last but not the
least, it's been exciting.

00:42:06.300 --> 00:42:08.070
There has been a
lot of successes.

00:42:08.070 --> 00:42:11.309
But there is still a lot to
be done in computer vision.

00:42:11.309 --> 00:42:14.329
So this problem is still
not totally solved.

00:42:14.329 --> 00:42:19.969
And with great tools comes with
great consequences as well.

00:42:19.969 --> 00:42:24.449
So computer vision
can do a lot of good.

00:42:24.449 --> 00:42:26.039
But it also can do harm.

00:42:26.039 --> 00:42:28.730
For example, human bias--

00:42:28.730 --> 00:42:32.360
every single AI algorithm
today, the large ones,

00:42:32.360 --> 00:42:33.880
are driven by data.

00:42:33.880 --> 00:42:38.550
And data is an artifact
of human activities

00:42:38.550 --> 00:42:40.360
on Earth and in history.

00:42:40.360 --> 00:42:43.900
And a lot of the
data carry our bias.

00:42:43.900 --> 00:42:47.200
And this gets carried
in AI systems.

00:42:47.199 --> 00:42:50.609
We have seen a lot of face
recognition algorithms having

00:42:50.610 --> 00:42:52.990
the same kind of bias
that humans have.

00:42:52.989 --> 00:42:55.919
And we do have to
really recognize that.

00:42:55.920 --> 00:43:01.450
We can also use AI to impact
human lives, some for the good.

00:43:01.449 --> 00:43:02.889
Think about medical imaging.

00:43:02.889 --> 00:43:05.199
But some are questionable.

00:43:05.199 --> 00:43:09.299
What if AI is solely
behind deciding your job

00:43:09.300 --> 00:43:11.620
or deciding your
financial loans?

00:43:11.619 --> 00:43:15.789
So again, is it totally bad?

00:43:15.789 --> 00:43:17.050
Is it totally good?

00:43:17.050 --> 00:43:19.150
These are very
complicated issues.

00:43:19.150 --> 00:43:23.490
This is also why I always get so
excited when students from HMS

00:43:23.489 --> 00:43:26.549
or law school or education
school or business school

00:43:26.550 --> 00:43:29.670
attend my class
because not all AI

00:43:29.670 --> 00:43:31.789
issues are engineering issues.

00:43:31.789 --> 00:43:36.559
We have a lot of human factors
and societal issues to solve.

00:43:36.559 --> 00:43:40.599
I'm also particularly excited
by AI's medicine and health care

00:43:40.599 --> 00:43:41.139
use.

00:43:41.139 --> 00:43:43.960
This is something
really dear to my heart.

00:43:43.960 --> 00:43:46.119
Professor Adeli
and Zane, who are

00:43:46.119 --> 00:43:49.630
also co-instructors of
this course, we three of us

00:43:49.630 --> 00:43:53.500
work on AI for aging
population as well as

00:43:53.500 --> 00:43:59.050
patients and to try to use
computer vision to deliver care

00:43:59.050 --> 00:44:00.170
to people.

00:44:00.170 --> 00:44:01.820
So this is a good use.

00:44:01.820 --> 00:44:04.820
And also, even in
terms of technology,

00:44:04.820 --> 00:44:07.190
human vision is remarkable.

00:44:07.190 --> 00:44:10.670
I want you to come out
of not only today's class

00:44:10.670 --> 00:44:14.240
but also this entire
course to appreciate,

00:44:14.239 --> 00:44:16.969
despite how much
computer vision can do,

00:44:16.969 --> 00:44:22.250
there's just so much more
nuance, subtlety, richness,

00:44:22.250 --> 00:44:26.389
complexity, and also
emotion in human vision.

00:44:26.389 --> 00:44:29.369
Look at these kids
studying whatever

00:44:29.369 --> 00:44:33.159
that their curiosity lead them
or the humor in this image.

00:44:33.159 --> 00:44:36.129
There's still a lot more that
computer vision cannot do.

00:44:36.130 --> 00:44:38.430
So I hope that
continue to entice

00:44:38.429 --> 00:44:40.869
you to study computer vision.

00:44:40.869 --> 00:44:45.690
At this point, I'm going to give
the podium to Professor Adeli

00:44:45.690 --> 00:44:48.369
to go over the
rest of the class.

00:44:48.369 --> 00:44:49.039
Thank you.

00:44:49.039 --> 00:44:50.759
[APPLAUSE]

00:44:50.760 --> 00:44:51.990
Awesome.

00:44:51.989 --> 00:44:55.139
Thank you, Fei-Fei.

00:44:55.139 --> 00:44:57.089
Great to start of the quarter.

00:44:57.090 --> 00:45:00.640
And I hope my microphone
is working right now.

00:45:00.639 --> 00:45:01.389
OK, good.

00:45:01.389 --> 00:45:05.730
I'm seeing some
nodding of heads.

00:45:05.730 --> 00:45:13.079
So very excited to
be here with you all.

00:45:13.079 --> 00:45:18.630
And I'm hoping that
you will have a fun

00:45:18.630 --> 00:45:23.160
and challenging course with an
amazing list of core instructors

00:45:23.159 --> 00:45:26.379
that we have and great TAs.

00:45:26.380 --> 00:45:31.000
So in this class, we
are going to cover

00:45:31.000 --> 00:45:34.690
a wide variety of topics
around computer vision and use

00:45:34.690 --> 00:45:37.659
of deep learning in
this space, categorized

00:45:37.659 --> 00:45:41.569
into four different topics.

00:45:41.570 --> 00:45:45.230
We will start with
deep learning basics.

00:45:45.230 --> 00:45:48.429
And let's start actually
with a simple question of,

00:45:48.429 --> 00:45:52.009
what is computer vision really?

00:45:52.010 --> 00:45:57.610
So at its core, it's
about enabling machines

00:45:57.610 --> 00:46:00.620
to see and understand images.

00:46:00.619 --> 00:46:09.339
And basically, this is the most
fundamental task in this space--

00:46:09.340 --> 00:46:13.390
in this space is
image classification.

00:46:13.389 --> 00:46:17.059
You give the model an
image, say, of a cat.

00:46:17.059 --> 00:46:21.549
And the model should
output a label cat.

00:46:21.550 --> 00:46:23.740
And that's it.

00:46:23.739 --> 00:46:29.479
But this deceptively simple
task is the foundation

00:46:29.480 --> 00:46:32.039
for much of more
complex applications,

00:46:32.039 --> 00:46:36.409
from self-driving to
medical diagnosis and so on.

00:46:36.409 --> 00:46:40.429
So how do we teach a
machine to do this?

00:46:40.429 --> 00:46:44.639
One of the simplest approaches
is to use linear classification,

00:46:44.639 --> 00:46:48.089
as you can see in this slide.

00:46:48.090 --> 00:46:53.809
So imagine each of the
images in our data set

00:46:53.809 --> 00:46:57.119
is shown with a
dot in that space.

00:46:57.119 --> 00:47:02.779
And each axis shows
some sort of feature

00:47:02.780 --> 00:47:05.280
which was driven from
the image itself.

00:47:05.280 --> 00:47:09.420
Here, we are showing a
2D space for simplicity.

00:47:09.420 --> 00:47:12.470
But the task of a
linear classifier

00:47:12.469 --> 00:47:17.149
is to find the hyperplane
or the linear function

00:47:17.150 --> 00:47:23.470
that separates these
two, say, cats from dogs.

00:47:23.469 --> 00:47:26.259
But we all know that
these linear models often

00:47:26.260 --> 00:47:29.110
go just only so far.

00:47:29.110 --> 00:47:32.349
They struggle when the data
isn't cleanly separable

00:47:32.349 --> 00:47:33.799
with a straight line.

00:47:33.800 --> 00:47:36.320
So the question is, what's next?

00:47:36.320 --> 00:47:44.090
We'll get into the topics of how
to model more complex patterns.

00:47:44.090 --> 00:47:49.900
And if we do so, we
often face challenges

00:47:49.900 --> 00:47:54.220
of overfitting and
underfitting, which

00:47:54.219 --> 00:47:59.439
are the topics we will cover in
the early lectures of the class.

00:47:59.440 --> 00:48:05.110
And to strike the
right balance, we

00:48:05.110 --> 00:48:08.320
use techniques
like regularization

00:48:08.320 --> 00:48:14.110
to control model complexity and
optimization to find the best

00:48:14.110 --> 00:48:16.059
fit parameters.

00:48:16.059 --> 00:48:21.079
So these are the nuts and bolts
of deep learning and creating

00:48:21.079 --> 00:48:26.659
these models, training models,
that not only fits the data

00:48:26.659 --> 00:48:31.319
but also generalizes to
unseen and new data as well.

00:48:31.320 --> 00:48:33.539
And now comes the fun part--

00:48:33.539 --> 00:48:34.380
neural networks.

00:48:34.380 --> 00:48:38.059
We've been talking
about them quite a lot.

00:48:38.059 --> 00:48:43.549
And what neural networks do,
unlike the linear classifiers,

00:48:43.550 --> 00:48:47.780
they stack multiple
layers of operations

00:48:47.780 --> 00:48:54.769
to model non-linear
functions to be

00:48:54.769 --> 00:48:59.389
able to either classify, to
solve the same problem of image

00:48:59.389 --> 00:49:04.489
classification, and so on.

00:49:04.489 --> 00:49:09.869
These are the models powering
everything from Google Photos.

00:49:09.869 --> 00:49:13.429
And now, everybody's familiar
with ChatGPT, ChatGPT's vision

00:49:13.429 --> 00:49:15.440
models, and so on.

00:49:15.440 --> 00:49:24.099
In this course, we will go deep
into the details of how they

00:49:24.099 --> 00:49:26.299
work, how they are trained.

00:49:26.300 --> 00:49:31.090
And we will be looking into
debugging and improving them.

00:49:31.090 --> 00:49:35.030
After looking at the
deep learning basics,

00:49:35.030 --> 00:49:39.280
we will cover the topics of
perceiving and understanding

00:49:39.280 --> 00:49:44.620
the visual world, which
is a complex process that

00:49:44.619 --> 00:49:49.880
involves interpreting a vast
array of visual information.

00:49:49.880 --> 00:49:52.329
And to do so, we
often first define

00:49:52.329 --> 00:49:56.739
tasks that refer to specific
challenges or problems.

00:49:56.739 --> 00:49:59.149
We aim to solve--

00:49:59.150 --> 00:50:02.180
some of the examples are object
detection, scene understanding,

00:50:02.179 --> 00:50:03.619
motion detection, and so on.

00:50:03.619 --> 00:50:10.539
And to solve these tasks, we
use different models, which

00:50:10.539 --> 00:50:13.929
are computational
and theoretical

00:50:13.929 --> 00:50:17.779
frameworks we develop
to mimic or explain

00:50:17.780 --> 00:50:22.350
how our visual system
accomplishes these tasks.

00:50:22.349 --> 00:50:25.610
One of the examples of
these types of models

00:50:25.610 --> 00:50:27.730
is neural networks.

00:50:30.260 --> 00:50:36.150
So by aligning
models with tasks,

00:50:36.150 --> 00:50:41.030
we can create systems
that can see and interpret

00:50:41.030 --> 00:50:43.730
the world around us.

00:50:43.730 --> 00:50:48.740
Speaking of tasks, let's
go back to the topic

00:50:48.739 --> 00:50:53.239
of image classification,
predicting a single label

00:50:53.239 --> 00:50:56.989
for an entire image.

00:50:56.989 --> 00:50:59.359
But we know that real
world computer vision

00:50:59.360 --> 00:51:02.340
is much richer than this.

00:51:02.340 --> 00:51:05.240
And let's walk through
some of the tasks that

00:51:05.239 --> 00:51:06.869
go beyond classification.

00:51:06.869 --> 00:51:13.339
First, semantic segmentation,
where we are not just

00:51:13.340 --> 00:51:17.519
labeling the object
or the entire image

00:51:17.519 --> 00:51:19.739
as cat or tree or whatever.

00:51:19.739 --> 00:51:25.019
Here, we are looking for
labels for every single pixel

00:51:25.019 --> 00:51:25.809
in the image.

00:51:25.809 --> 00:51:30.670
So every pixel is a
grass, cat, tree, or sky.

00:51:30.670 --> 00:51:34.960
But we don't distinguish
between individual objects.

00:51:34.960 --> 00:51:38.280
And next, we have
object detection,

00:51:38.280 --> 00:51:45.580
where we now want to not
only say what is in the image

00:51:45.579 --> 00:51:47.440
but also pinpoint the location.

00:51:47.440 --> 00:51:49.860
And that's why we
create bounding boxes

00:51:49.860 --> 00:51:54.670
around the objects and associate
them with specific labels.

00:51:54.670 --> 00:51:58.269
And finally, we have
instance segmentation.

00:51:58.269 --> 00:52:01.139
We'll go into instance
segmentation, which is

00:52:01.139 --> 00:52:04.409
the most granular of them all.

00:52:04.409 --> 00:52:08.279
It combines the ideas of
detection and segmentation

00:52:08.280 --> 00:52:09.130
together.

00:52:09.130 --> 00:52:13.039
And every object instance
gets its own mask.

00:52:13.039 --> 00:52:20.090
So these tasks require much
deeper special understanding

00:52:20.090 --> 00:52:21.059
and images.

00:52:21.059 --> 00:52:23.809
And they push the models
to do more than just

00:52:23.809 --> 00:52:27.860
recognizing categories.

00:52:27.860 --> 00:52:30.660
The complexity doesn't
stop with static images.

00:52:30.659 --> 00:52:33.269
Let's look at some
temporal dimensions.

00:52:33.269 --> 00:52:36.269
So there's the task of
video classification,

00:52:36.269 --> 00:52:40.429
as Fei-Fei talked about,
where we want to understand

00:52:40.429 --> 00:52:42.349
what's happening in video.

00:52:42.349 --> 00:52:47.210
Is there someone running,
jumping, or dancing?

00:52:47.210 --> 00:52:51.630
There is the topic of
multimodal video understanding,

00:52:51.630 --> 00:52:56.630
which is combining vision and
sound and other modalities.

00:52:56.630 --> 00:53:00.559
For example, in this
example, the person

00:53:00.559 --> 00:53:04.070
is playing a vibraphone
to really understand

00:53:04.070 --> 00:53:05.039
what's happening here.

00:53:05.039 --> 00:53:08.210
We have to create a
blend of visual features

00:53:08.210 --> 00:53:11.280
and audio features to be able
to understand what's happening.

00:53:11.280 --> 00:53:14.680
And finally, there is the
topic of visualization

00:53:14.679 --> 00:53:19.329
and understanding that we will
be covering in this class, where

00:53:19.329 --> 00:53:24.340
we want to interpret what's
being learned by the models

00:53:24.340 --> 00:53:31.269
and see an attention frame
or attention map of what

00:53:31.269 --> 00:53:35.079
the model is attending to to
do a correct classification

00:53:35.079 --> 00:53:36.819
and so on.

00:53:36.820 --> 00:53:39.650
And then we have
models beyond tasks.

00:53:39.650 --> 00:53:41.740
We look into models.

00:53:41.739 --> 00:53:46.509
And the very first topic--
let me introduce to you--

00:53:46.510 --> 00:53:50.170
that we'll be covering is
Convolutional Neural Networks

00:53:50.170 --> 00:53:51.230
or CNNs.

00:53:51.230 --> 00:53:52.760
There are a number
of operations.

00:53:52.760 --> 00:53:55.930
We will be going
over the details

00:53:55.929 --> 00:53:59.839
in the class, starting from an
image, a number of convolutions,

00:53:59.840 --> 00:54:01.970
sampling and fully
connected operations,

00:54:01.969 --> 00:54:05.980
and, finally,
creating the output.

00:54:05.980 --> 00:54:08.769
And beyond convolutional
neural networks,

00:54:08.769 --> 00:54:14.719
we will study recurrent neural
networks for sequential data

00:54:14.719 --> 00:54:19.669
and even neural architectures,
such as transformers

00:54:19.670 --> 00:54:24.139
and attention-based frameworks.

00:54:24.139 --> 00:54:29.179
So next, we will be covering
some large-scale distributed

00:54:29.179 --> 00:54:34.609
training topics, which is
kind of new this quarter.

00:54:34.610 --> 00:54:38.460
I'm sure you've all heard
about large language models,

00:54:38.460 --> 00:54:40.320
large vision models, and so on.

00:54:40.320 --> 00:54:44.480
And we will be
briefly discussing

00:54:44.480 --> 00:54:47.309
how these models are
actually trained.

00:54:47.309 --> 00:54:51.619
We know that data and data
sets are expanding models.

00:54:51.619 --> 00:54:56.429
And models are becoming
larger and larger.

00:54:56.429 --> 00:54:59.819
And in order to
train such models,

00:54:59.820 --> 00:55:02.360
there are some strategies--

00:55:02.360 --> 00:55:04.470
for example, data
parallelization,

00:55:04.469 --> 00:55:07.569
model parallelization-- that
we will cover in this class.

00:55:07.570 --> 00:55:11.170
But beyond that, there
will be so many challenges,

00:55:11.170 --> 00:55:15.940
such as synchronization between
these models and workers

00:55:15.940 --> 00:55:20.730
and so on, as well as
several other aspects

00:55:20.730 --> 00:55:25.059
that we'll be covering in one
of the lectures this quarter.

00:55:25.059 --> 00:55:31.289
And we will go also over some
of the trends for training

00:55:31.289 --> 00:55:33.070
these large models.

00:55:33.070 --> 00:55:36.210
After completing this
topic, what we will do

00:55:36.210 --> 00:55:44.010
next is looking into generative
and interactive visual

00:55:44.010 --> 00:55:48.690
intelligence, where
we will first start

00:55:48.690 --> 00:55:52.030
with self-supervised learning.

00:55:52.030 --> 00:55:55.960
Self-supervised learning is
a branch of machine learning

00:55:55.960 --> 00:56:00.579
in which models learn to
understand and represent data

00:56:00.579 --> 00:56:04.179
by getting some training
signals from the data itself.

00:56:04.179 --> 00:56:06.384
We will cover this topic.

00:56:06.385 --> 00:56:10.180
It's one of the approaches
that has enabled training

00:56:10.179 --> 00:56:15.339
of large scale models using
vast amounts of data that do not

00:56:15.340 --> 00:56:18.880
require labels, unlabeled data.

00:56:18.880 --> 00:56:23.200
And they have played a key
role in recent breakthroughs

00:56:23.199 --> 00:56:26.199
in computer vision in general.

00:56:26.199 --> 00:56:30.799
And we will talk a little
bit about generative models.

00:56:30.800 --> 00:56:33.710
They go beyond recognition.

00:56:33.710 --> 00:56:35.860
They actually generate.

00:56:35.860 --> 00:56:39.340
This is an example of the
content of a Stanford campus

00:56:39.340 --> 00:56:44.380
photo, which is reimagined in
the style of Van Gogh's Starry

00:56:44.380 --> 00:56:45.490
Night.

00:56:45.489 --> 00:56:49.989
This is known as style
transfer, a classic application

00:56:49.989 --> 00:56:54.369
of neural generative techniques.

00:56:54.369 --> 00:56:58.269
Generative models can
now translate language

00:56:58.269 --> 00:57:03.219
into images given a prompt.

00:57:03.219 --> 00:57:07.289
A model like Dall-E, Dall-E
2 generates an entirely novel

00:57:07.289 --> 00:57:09.059
image.

00:57:09.059 --> 00:57:12.570
This showcases how
generative vision models

00:57:12.570 --> 00:57:16.830
blend understanding,
creativity, and control

00:57:16.829 --> 00:57:19.349
in their generations.

00:57:19.349 --> 00:57:22.589
And you've probably
heard recently

00:57:22.590 --> 00:57:26.620
about the topic of
diffusion models in general.

00:57:26.619 --> 00:57:33.179
That's another thing that we'll
be covering in this quarter.

00:57:33.179 --> 00:57:37.649
They basically learn to
reverse a gradual noising

00:57:37.650 --> 00:57:40.510
process to generate images.

00:57:40.510 --> 00:57:43.630
And interestingly,
in assignment 3,

00:57:43.630 --> 00:57:46.860
you will actually be
implementing a generative model

00:57:46.860 --> 00:57:53.400
that generates emojis
from text inputs,

00:57:53.400 --> 00:57:57.360
from prompts-- for example, a
face with a cowboy hat, which

00:57:57.360 --> 00:58:01.240
is denoised from pure noise.

00:58:01.239 --> 00:58:06.529
Vision language models are
the next topic of interest

00:58:06.530 --> 00:58:08.890
we will be covering.

00:58:08.889 --> 00:58:16.039
They connect text and images in
a shared representation space.

00:58:16.039 --> 00:58:19.900
And given a caption
or image, the model

00:58:19.900 --> 00:58:24.289
retrieves or generates
its corresponding pair,

00:58:24.289 --> 00:58:25.309
as you can see.

00:58:25.309 --> 00:58:29.049
So there are a lot of
advances in this area.

00:58:29.050 --> 00:58:32.170
We'll be covering some
of the key examples.

00:58:32.170 --> 00:58:37.750
Again, this is a key task
for cross-modal retrieval

00:58:37.750 --> 00:58:41.119
or understanding and visual
question answering and so on.

00:58:41.119 --> 00:58:44.269
So we'll get to
that in the class 2.

00:58:44.269 --> 00:58:52.809
Moving beyond 2D, models can
now reconstruct and generate 3D

00:58:52.809 --> 00:58:55.549
representations from images.

00:58:55.550 --> 00:59:00.980
And here, you can see some
voxel-based reconstructions,

00:59:00.980 --> 00:59:06.769
shape completion, and even 3D
object detection from single

00:59:06.769 --> 00:59:09.599
view images.

00:59:09.599 --> 00:59:14.809
So 3D vision enables
more especially grounded

00:59:14.809 --> 00:59:19.699
understanding, which is
crucial for robotics and AI VR

00:59:19.699 --> 00:59:20.399
applications.

00:59:20.400 --> 00:59:26.900
And finally, vision
empowers embodied agents

00:59:26.900 --> 00:59:30.680
that act in the physical world.

00:59:30.679 --> 00:59:35.279
So these models often
must perceive, plan,

00:59:35.280 --> 00:59:41.390
and execute whether it's
cleaning up a messy room

00:59:41.389 --> 00:59:44.879
or generalizing from
human demonstrations.

00:59:44.880 --> 00:59:50.210
So with all of these, we will
be covering different topics

00:59:50.210 --> 00:59:53.970
around generative and
interactive visual intelligence.

00:59:53.969 --> 01:00:00.759
And finally, we will cover some
human-centered applications

01:00:00.760 --> 01:00:05.990
and implications, as Fei-Fei
very nicely explained.

01:00:05.989 --> 01:00:08.719
So there is a computer vision.

01:00:08.719 --> 01:00:12.069
And generally, AI
have been having a lot

01:00:12.070 --> 01:00:16.070
of impact in the past years.

01:00:16.070 --> 01:00:18.280
And it's very
important to understand

01:00:18.280 --> 01:00:21.230
the human-centered
aspects and applications.

01:00:21.230 --> 01:00:24.159
And some of these
impacts are reflected

01:00:24.159 --> 01:00:32.469
by these awards that are going
to researchers in this space.

01:00:32.469 --> 01:00:38.769
It was first recognized by
the Turing Award 2018, which

01:00:38.769 --> 01:00:41.440
is the most prestigious
technical award given

01:00:41.440 --> 01:00:45.400
to major contributions
of lasting importance

01:00:45.400 --> 01:00:47.090
for computing.

01:00:47.090 --> 01:00:50.890
Geoffrey Hinton, Yoshua
Bengio, and Yann LeCun

01:00:50.889 --> 01:00:54.849
received the award for
conceptual and engineering

01:00:54.849 --> 01:00:57.049
during breakthroughs
that have made

01:00:57.050 --> 01:01:01.440
deep neural networks a critical
component of computing.

01:01:01.440 --> 01:01:06.200
Beyond that, last year,
in 2024, Geoffrey Hinton

01:01:06.199 --> 01:01:11.089
was jointly awarded the
Nobel Prize in physics

01:01:11.090 --> 01:01:14.990
alongside John Hopfield for
their foundational contributions

01:01:14.989 --> 01:01:17.459
to neural networks.

01:01:17.460 --> 01:01:21.260
And finally, I want to very
briefly mention the learning

01:01:21.260 --> 01:01:27.770
objectives for this class will
be formalizing computer vision

01:01:27.769 --> 01:01:30.239
applications into tasks.

01:01:30.239 --> 01:01:33.619
As you can see some
of the details here,

01:01:33.619 --> 01:01:38.599
we want to develop and
train vision models, models

01:01:38.599 --> 01:01:41.400
that operate on images
and visual data--

01:01:41.400 --> 01:01:43.220
images, videos, and so on--

01:01:43.219 --> 01:01:46.549
gain an understanding
of where the field is

01:01:46.550 --> 01:01:48.990
and where it is headed.

01:01:48.989 --> 01:01:53.619
That's why we have some new
topics also covered specifically

01:01:53.619 --> 01:01:56.920
in this year.

01:01:56.920 --> 01:02:01.539
So the four topics that
I mentioned earlier,

01:02:01.539 --> 01:02:06.529
we will be going over the basics
in the very first few weeks.

01:02:06.530 --> 01:02:09.220
Bear with us because these
are important topics.

01:02:09.219 --> 01:02:12.859
And you need to understand
the details first,

01:02:12.860 --> 01:02:15.110
how to build the
models from scratch.

01:02:15.110 --> 01:02:19.180
And then we'll get to more
interesting, exciting topics

01:02:19.179 --> 01:02:20.440
of the day--

01:02:20.440 --> 01:02:21.769
computer vision.

01:02:21.769 --> 01:02:27.969
And finally, we'll have one big
lecture on human-centered AI

01:02:27.969 --> 01:02:30.549
and computer vision.

01:02:30.550 --> 01:02:33.039
I want to just leave
you with what we

01:02:33.039 --> 01:02:34.789
will be covering next session.

01:02:34.789 --> 01:02:38.380
That's going to be
image classification

01:02:38.380 --> 01:02:43.720
and linear classifiers,
which will get us started

01:02:43.719 --> 01:02:45.909
with the world of CS231n.

01:02:45.909 --> 01:02:47.969
Thank you.
