[00:05] This is CS231n.
[00:07] And I'm Professor Fei-Fei
Li from computer science
[00:11] department.
[00:11] I will be co-teaching
this quarter
[00:14] with Professor Ehsan Adeli
and my graduate student Zane.
[00:19] So you'll meet them as well
as our wonderful TA team
[00:23] that you will meet later.
[00:24] So I just want to get started.
[00:28] So this is what
excites me, that AI
[00:32] has become such an
interdisciplinary field,
[00:35] that what you're going
to learn in this class,
[00:38] of course, is very technical.
[00:40] It's about computer
vision and deep learning.
[00:42] But I really do
hope that you take
[00:44] it to whichever discipline you
work in and are passionate about
[00:48] and apply it.
[00:49] So we hear a lot
about the field of AI.
[00:52] So how do we position
computer vision
[00:56] and the scope of this class?
[00:57] If you consider AI
as this big bubble,
[01:02] computer vision is very
much an integral part of AI.
[01:07] Some of you have heard me
saying that not only vision is
[01:10] part of intelligence, it's a
cornerstone to intelligence.
[01:13] Unlocking the mystery
of visual intelligence
[01:16] is unlocking the
mystery of intelligence.
[01:20] But one of the most important
tools, mathematical tools,
[01:25] to solving AI is machine
learning or some people
[01:29] call statistical
machine learning.
[01:31] And this is exactly what
we will be talking about.
[01:36] Within the field of
machine learning,
[01:38] in the past 10 plus years, we
have seen a major revolution
[01:42] called deep learning.
[01:43] And I'll explain a little
bit of what deep learning is.
[01:46] Deep learning is a set
of algorithmic techniques
[01:50] that is built around
a family of algorithms
[01:54] called neural networks.
[01:55] And so if you ask me to pinpoint
the scope of this class,
[02:02] we'll not be able to cover the
entirety of computer vision.
[02:05] We'll not be able to cover
the entirety of machine
[02:07] learning or deep learning.
[02:09] But we're going to cover the
core intersection of these two
[02:12] fields.
[02:13] And of course, just
like the entirety of AI,
[02:18] computer vision is
becoming more and more
[02:20] an interdisciplinary field.
[02:23] A lot of the techniques
we use as well as
[02:26] the problems we
work with intersect
[02:28] with many different other
fields, like natural language
[02:31] processing, speech recognition,
robotics, and AI as a whole
[02:37] is a field that intersects
with mathematics, neuroscience,
[02:41] computer science,
psychology, physics, biology,
[02:44] and many application
areas from medicine
[02:46] to law to education
and business and so on.
[02:49] So what you will get for this
lecture, our first lecture,
[02:55] is I'll give a very
brief history of computer
[02:58] vision and deep learning.
[02:59] And then Professor Adeli will go
over the overview of this course
[03:05] and lay the groundwork of
how this course is set up
[03:08] and what our expectations are.
[03:11] So the history of vision did
not begin when you were born
[03:19] or humanity was born.
[03:20] The history of vision began
540 million years ago.
[03:25] You might ask, what happened
540 million years ago?
[03:29] Why are we pinpointing a
relatively specific date or year
[03:34] in evolution.
[03:35] Well, it's because a
lot of fossil studies
[03:37] have shown us that there is a
mystery period called Cambrian
[03:43] explosion.
[03:45] The fossil studies showed about
10 million years in evolution
[03:49] during that time, which is
a very short period of time
[03:52] for evolution.
[03:53] We see the explosion
of animal species
[03:58] in the fossil study, which means
before the Cambrian explosion,
[04:02] life on Earth was pretty chill.
[04:05] It was actually in the water.
[04:06] There's no animals
on the land yet.
[04:10] And animals just float around.
[04:13] So what caused this explosion
in animal speciation?
[04:18] There were many theories, from
climate to chemical composition
[04:21] of the ocean water.
[04:23] But one of the most compelling
theories was the onset of ice.
[04:29] The first animal,
a trilobite, they
[04:32] gained photosensitive cells.
[04:34] So the eyes we
were talking about
[04:37] were not sophisticated lenses
and retinas and nerve cells.
[04:41] It was literally a
very simple pinhole.
[04:44] And that pinhole
collected light.
[04:47] Once you collected light,
life is completely different.
[04:53] Without sensors,
life is metabolism.
[04:57] It's very passive.
[04:59] It is just metabolism.
[05:01] And you come and go.
[05:02] With sensors, you become an
integral part of the environment
[05:06] that you might want to change.
[05:08] You might want to
actually survive in it.
[05:11] Some animals or plants
become your dinner.
[05:16] And you become
someone else's dinner.
[05:18] So evolutionary forces
drives intelligence
[05:24] to evolve because of
the onset of sensors,
[05:27] because of the onset of
vision, along with haptics
[05:31] or tactile sensing.
[05:33] Those are the oldest
sensors for animals.
[05:38] So that entire course
of 540 million years
[05:41] of evolution of vision is the
evolution of intelligence.
[05:46] Vision as one of the
primary senses of animals
[05:49] drove the development of
nervous system, the development
[05:54] of intelligence.
[05:55] Almost all animals on
Earth today we know of
[05:59] have vision or use vision as
one of the primary senses.
[06:03] Humans are especially
visual animals.
[06:06] More than half of
our cortical cells
[06:08] are involved in
visual processing.
[06:11] And we have a very complex
and convoluted visual system.
[06:15] So this is what excites me
to enter the field of vision.
[06:19] And I hope it excites you.
[06:21] So now, let's just fast
forward from Cambrian explosion
[06:30] to actually human civilization.
[06:33] Humans do innovate.
[06:35] And not only we see.
[06:37] We want to build
machines that see.
[06:40] So here's a couple of
drawings by, of course,
[06:44] Leonardo da Vinci, who
was just forever curious
[06:48] about everything.
[06:49] He studied camera obscura for
how to make steam machines.
[06:56] In fact, even way before
him, in ancient Greece
[07:01] and in ancient China,
we have seen documents
[07:05] about thinkers,
philosophers thinking
[07:09] about how to project
objects through pinholes
[07:15] and to create images of objects.
[07:19] And of course, in
our modern life,
[07:22] cameras have truly exploded.
[07:25] But cameras are not enough for
seeing, just like eyes are not
[07:30] enough for seeing.
[07:31] These are apparatus.
[07:33] We need to understand how
visual intelligence happens.
[07:35] And that's really the
crux of this course.
[07:38] So let's just talk a little bit
of the history that brought us
[07:45] to this intersection of deep
learning and computer vision.
[07:49] So let me go back to the 1950s.
[07:57] The 1950s-- a set of very
critically important experiments
[08:03] happened in neuroscience.
[08:05] And that was the study
of the visual pathways
[08:08] of mammals, especially
the seminal work
[08:10] by Hubel and Wiesel.
[08:11] They used electrodes to put
into live cats anesthetized.
[08:18] And then they studied
the receptive field
[08:21] of neurons that are in
the primary visual cortex.
[08:25] What they have learned,
to their surprise,
[08:28] are two very important things.
[08:31] One is that neurons that
are responsible for seeing
[08:38] in the primary
visual cortex have
[08:41] their own individual
receptive fields.
[08:44] Receptive fields means
that for every neuron,
[08:48] there is a part of
space it actually sees.
[08:52] It's not all the space.
[08:54] It's not very big.
[08:55] It tends to be a very
confined patch of the space.
[09:00] And within that space, it
sees specialized patterns,
[09:06] simple patterns, when you're
measuring from the early part
[09:12] of the visual pathway.
[09:15] And by and large, in the
primary visual cortex,
[09:18] which is around here in the back
of the head, not near your eyes,
[09:23] it's oriented edges or
moving oriented edges.
[09:27] So every neuron,
some neuron will
[09:28] be seeing an edge like this.
[09:30] Some will be seeing an
edge like this or this.
[09:32] And that's how the computation
in the brain begins.
[09:39] The second thing they learned
is that visual pathway
[09:42] is hierarchical.
[09:43] As you move beyond
the visual pathway,
[09:47] the neurons feed
into other neurons.
[09:50] And the neurons in
the higher layers
[09:54] or deeper layers of
the visual hierarchy
[09:57] have more complex
receptive fields.
[09:59] So if you begin
with oriented edges,
[10:04] you might feed into
a corner receptor.
[10:06] You might feed into
an object receptor.
[10:10] I'm overly simplifying.
[10:12] But that's the concept, is that
neurons feed into each other.
[10:16] And then they create this
big network of computation.
[10:23] Of course, most of
you sitting here
[10:25] are already thinking
the way I've
[10:27] been describing this will
have a profound impact
[10:30] on the neural network
modeling of visual algorithms.
[10:36] Let's keep going.
[10:37] That's year 1959.
[10:40] It's very early
studies of seeing.
[10:43] By the way, about
30 years later--
[10:48] maybe not quite-- 20
something years later,
[10:50] Hubel and Wiesel won the
Nobel Prize in medicine
[10:54] for studying this,
uncovering the principles
[10:59] of visual processing.
[11:01] Another milestone in the early
history of computer vision
[11:05] was the first PhD thesis
of computer vision.
[11:09] Most people attribute
Larry Roberts in 1963
[11:13] writing the first PhD
thesis just studying shape.
[11:17] And this is a very, very
character representation
[11:21] of the world.
[11:22] And the idea is that, can
we take a shape like this
[11:26] and understand that the surfaces
and the corners and features
[11:30] of this shape?
[11:32] It's intuitive that humans do.
[11:34] So an entire PhD thesis
is devoted to this.
[11:39] And that's the beginning
of computer vision.
[11:44] And around that time, in
1966, an MIT professor
[11:52] created a summer
project in MIT and asked
[11:56] to hire a few undergrads, very
smart ones, to study vision.
[12:03] And the goal was pretty
much solve computer vision
[12:07] or solve vision for one summer.
[12:09] Of course, just like the
rest of the history of AI,
[12:13] we tend to be overoptimistic
of what we can
[12:18] do in a short period of time.
[12:20] So vision did not get
solved in that summer.
[12:24] In fact, it has blossomed into
an incredible computer science
[12:29] field.
[12:30] If you go to our annual
conferences every year now,
[12:33] it has more than 10,000
people attending.
[12:36] But 1960s is where, between
Larry Roberts PhD thesis as well
[12:43] as this kind of project, we
in our field considered that
[12:48] the beginning of the
field of computer vision.
[12:51] A seminal book was written
in the 1970s by David Marr,
[12:55] who unfortunately
died too early.
[12:58] He wanted to study vision
systematically and start
[13:01] to consider how visual
processing happens.
[13:05] Even though this
is not explicitly
[13:07] stated, but there is
a lot of inspiration
[13:10] from neuroscience and
cognitive science.
[13:12] He was thinking about, if
you take an input image,
[13:20] how do we visually process
and understand the image?
[13:23] Maybe the first layer is more
like edges, just like we saw.
[13:28] He calls it primal sketch.
[13:30] And then there is a 2 and 1/2 D
sketch which separates different
[13:37] depth of the objects
in the image.
[13:42] So the ball is the
foreground object.
[13:45] And then the grass here--
[13:47] oh, no, not grass.
[13:48] The ground here
is the background.
[13:51] So he does these 2
and 1/2 D sketch.
[13:53] And then, finally, David Marr
believes the grand holy grail
[14:01] victory of solving vision is
to know the entire full 3D
[14:06] representation.
[14:07] And that is actually the
hardest thing of vision.
[14:12] Let me digress for 20 seconds.
[14:15] Because if you think about
vision for all animals,
[14:20] it's an ill posed problem.
[14:23] Since the early trilobites
who collected light
[14:27] from underwater, light--
[14:30] the world through photons
is projected on something
[14:35] on a surface more or less 2D.
[14:38] At that time, it was just,
I don't know, some patch
[14:40] in the animal.
[14:42] But right now, for
us, it's a retina.
[14:45] But the actual world is 3D.
[14:47] So recovering 3D information,
the entire 3D world,
[14:55] from 2D images is the
fundamental problem nature had
[15:00] to solve and computer
vision has to solve.
[15:02] And mathematically, that's
an ill-posed problem.
[15:05] So what did we later do?
[15:07] Anybody have a wild guess?
[15:14] [INAUDIBLE]
[15:17] Yes.
[15:18] The trick that nature did is
develop multiple eyes, mostly
[15:22] two.
[15:22] Some animals have more than two.
[15:25] And then you
triangulate information.
[15:28] But two eyes are not enough.
[15:29] You actually have to understand
correspondences and all that.
[15:33] We'll touch on some
of these topics.
[15:35] But there are other computer
vision classes taht Stanford
[15:38] offers that also specifically
talk about 3D vision.
[15:42] But the point is it's
a very hard problem.
[15:45] And we have to solve it.
[15:47] Nature has solved it.
[15:48] Humans have solved it but
not to extreme precision.
[15:53] In fact, humans are
not that precise.
[15:55] I roughly know the 3D shapes.
[15:58] But I don't have geometric
precision of all the shapes.
[16:03] So that's one thing to
consider and appreciate
[16:06] how hard this problem is.
[16:08] Another thing that is very
different for computer vision
[16:12] and language is
actually something
[16:15] philosophically subtle.
[16:17] Language doesn't
exist in nature.
[16:20] You cannot point to something
and say there is language.
[16:24] Language is a purely
generated thing.
[16:30] I don't even know
what word to use.
[16:31] It comes through our brain.
[16:35] It's generated.
[16:37] It's 1D.
[16:38] It's sequential.
[16:40] So this actually has profound
implications in the latest
[16:44] wave of GenAI algorithms.
[16:47] This is why these
LLMs, which is outside
[16:50] of the scope of this class,
is so powerful because we
[16:54] can model language that way.
[16:56] But vision is not generated.
[16:58] There is actually
a physical world
[17:01] out there respecting the laws
of physics and materials and all
[17:05] that.
[17:06] So vision has very
different tasks.
[17:09] So I just want you to appreciate
the difference between language
[17:14] and vision and actually,
frankly, appreciate nature,
[17:17] how it solved this problem.
[17:19] Let's keep going.
[17:21] 1970s, the early pioneers of
computer vision, without data,
[17:28] without really much
of powerful computers,
[17:32] without the mathematical
advances we have seen today,
[17:36] are already beginning to solve
some of the harder problems
[17:40] of computer vision-- for
example, recognition of objects.
[17:43] Here in Stanford, one
of the pioneering work
[17:48] is called generalized cylinders
by Rodney Brooks and Tom
[17:52] Binford.
[17:52] And ironically, Rodney Brooks
today is on campus, actually,
[17:58] over there giving a talk
at the robotics conference.
[18:03] And he went on to become
one of the greatest
[18:05] roboticists of our time
and was founder of Roomba
[18:10] and many other robots.
[18:11] And then not very far from us
in another part of Palo Alto,
[18:16] researchers have worked on
these also compositional models
[18:24] of human body and objects.
[18:27] And then in the 1980s, digital
photos start to appear.
[18:34] At least photos start to appear.
[18:37] And people can digitize
that a little bit.
[18:39] And then there are some
great work in edge detection.
[18:43] You look at all this
and probably feel
[18:48] a sense of disappointment.
[18:50] I mean, it's kind of trivial
to get some sketches and edges.
[18:55] And it's not really
going anywhere.
[18:58] That's how computer
vision, works at that time.
[19:02] And in fact, you're
not so wrong.
[19:03] That was around the
time before many of you
[19:07] were born that we
entered AI winter.
[19:10] The field entered AI winter
because the enthusiasm
[19:15] and, hence, funding for AI
research has really dwindled.
[19:18] A lot of things didn't deliver.
[19:20] Computer vision didn't deliver.
[19:22] Expert systems didn't deliver.
[19:24] Robotics didn't deliver.
[19:26] But under the hood of this
winter, a lot of research
[19:32] start to grow from
different fields,
[19:34] like computer vision,
NLP, robotics.
[19:37] So let's also look at
another strand of research
[19:40] that had a profound
implication in computer vision,
[19:43] is that cognitive
and neuroscience
[19:45] continue to blossom.
[19:46] And what is really
important, especially
[19:49] for the field of computer
vision, is cognitive
[19:52] and neuroscience is starting
to point to as the North Star
[19:55] problems we should work on.
[19:57] For example,
psychologists have told us
[20:00] there's something special
about seeing nature,
[20:02] seeing real world.
[20:06] This is a study by
Irv Biederman, who
[20:09] shows that the detection
of bicycles on two images
[20:13] differ depending on if the
images are scrambled or not.
[20:18] Think about it.
[20:19] From a phton point of
view, these two bicycles
[20:22] land in the same
location on your retina.
[20:26] But somehow the
rest of the image
[20:28] impacts the viewer,
seeing the target objects.
[20:39] So there is something
telling us that seeing
[20:41] the entire forest
or the entire world
[20:44] impacts the way we see objects.
[20:46] It also tells us visual
processing is very fast.
[20:49] Here's another direct measure
of how fast we detect objects.
[20:55] This is an early 1970s
experiment showing people
[21:00] a video.
[21:03] And the test for the subject
is to detect the human
[21:07] in one of the frames.
[21:09] I suppose every one of you
have seen that human in one
[21:11] of the frames.
[21:13] But think about how
remarkable your eyes are
[21:15] or your brain is because
you've never seen this video.
[21:19] I didn't tell you which frame
that the target object would
[21:22] appear.
[21:23] I did not tell you
what the target
[21:24] object will look like, where it
is, its gestures, and all that.
[21:28] Yet, you have no problem
detecting the humans.
[21:34] And on top of that,
these frames are
[21:37] played at 10 Hertz,
which means you're
[21:39] seeing every frame for
only 100 milliseconds.
[21:43] And this is how remarkable
our visual system is.
[21:47] In fact, Simon Thorpe, another
cognitive neuroscientist,
[21:53] have measured the speed.
[21:55] If you hook people
up in EEG caps
[21:58] and show them complex
natural scenes
[22:01] and ask human subjects
to categorize things
[22:05] from animals without--
[22:07] versus things without animals--
[22:10] hundreds of them.
[22:11] And then you measure
the brain wave.
[22:13] It turned out, after 150
milliseconds of seeing a photo,
[22:18] your brain already has
a differential signal
[22:22] that categorizes.
[22:24] You might not be so impressed.
[22:25] Because compared to today's
GPUs and modern chips,
[22:29] 150 milliseconds is really
orders of magnitude slower.
[22:34] But you got to admire.
[22:37] Our wetware, our
brain, our neurons
[22:40] don't work as fast
as transistors.
[22:43] 150 milliseconds is
actually really fast.
[22:46] It's only a few
hops in the brain
[22:49] in terms of neural processing.
[22:51] So yet, again,
this is telling us
[22:53] humans are really good at seeing
objects and categorizing them.
[22:59] In fact, not only we're
so good at seeing objects
[23:02] and categorizing them, we
even develop specialized brain
[23:05] areas that have expert
ability in recognizing
[23:10] faces or places or body parts.
[23:13] And these are discoveries by MIT
neurophysiologist in the 1990s
[23:19] and early 21st century.
[23:21] So all these studies tell
us, well, we should not just
[23:26] be studying these kind
of character shapes
[23:30] or the sketches of images.
[23:33] We really should go after
important fundamental problems
[23:38] that drives visual intelligence.
[23:40] And one of those
problems that everything
[23:43] has been telling us is
object recognition--
[23:46] is object recognition
in natural setting.
[23:49] There is a lot of objects
out there in the world.
[23:52] And studying this
is going to be part
[23:57] of the unlocking of
visual intelligence.
[24:00] And that's what we did.
[24:01] As a field, we
started by looking
[24:04] at how we can separate
foreground objects
[24:08] from background objects.
[24:09] This is called recognition
by grouping in the 1990s.
[24:14] Keep in mind, we're
still in AI winter.
[24:16] But research is actually
happening and progressing.
[24:20] And then there is
studies of features.
[24:24] And some of you
might still remember
[24:27] sift features and matching.
[24:29] And when I enter grad school,
the most exciting thing
[24:33] was face detection.
[24:34] I remembered that first
year in my grad school,
[24:37] this paper was published.
[24:39] And five years later,
the first digital camera
[24:42] used this paper's algorithm and
delivered automatic face focus
[24:49] because of face detection.
[24:51] So things started to work
and taken into industry.
[24:56] And then around the
early 21st century,
[25:01] a very important thing
started to happen,
[25:04] is internet started to happen.
[25:06] When internet started to happen,
data started to proliferate.
[25:12] And the combination of
digital cameras and internet
[25:16] started to give the
field of computer vision
[25:19] some data to work with.
[25:22] So in that early days, we're
working with thousands of images
[25:26] or tens of thousands of images
to study the visual recognition
[25:30] problem or the object
recognition problem.
[25:32] So you've got data sets
like Pascal Visual Object
[25:36] Challenge or Caltech 101.
[25:40] I'm going to pause here.
[25:43] And this is where the first
thread of computer vision
[25:50] start to progress.
[25:51] And you might be wondering,
why is she pausing?
[25:54] Because I'm going to come back
and talk about deep learning.
[25:57] So while this field of
vision was progressing
[26:03] through neurophysiology
to computer vision,
[26:06] to cognitive neuroscience,
to computer vision again,
[26:11] a separate effort is
going on in parallel.
[26:14] And that eventually
became deep learning.
[26:17] It started from these early
studies of neural network,
[26:22] things like perceptron.
[26:24] And people like Rumelhart
started to work.
[26:29] And of course, Jeff
Hinton in his early days,
[26:32] started to work with a small
number of artificial neurons
[26:35] and look at how that can
process information and learn.
[26:41] And you've heard people like the
great minds like Marvin Minsky
[26:48] and his colleagues working
on different aspects
[26:52] of this perception.
[26:54] But Marvin Minsky did say that
perceptrons cannot learn these
[27:02] XOR logic functions.
[27:05] And that caused a little bit
of a setback in neural network.
[27:10] Well, things continued to
progress despite the setback.
[27:14] And one of the most important
work before the first inflection
[27:21] point is this neocognitron
work by Fukushima in Japan.
[27:25] Fukushima hand-designed a neural
network that looks like this.
[27:31] So it has about
five or six layers.
[27:35] And then he kind of designed
the different functions
[27:41] across the layers,
which you will
[27:43] learn more, that
more or less was
[27:46] inspired by the visual
pathway that I was describing.
[27:50] Remember the cat experiment
from simple receptive field
[27:54] to more complicated
receptive field.
[27:56] And he was doing that here.
[27:59] The early layers have
simple functions.
[28:01] And then the later
lighter layers
[28:03] have more complex functions.
[28:05] And the simple ones can
call it convolution.
[28:08] Or he uses the
convolution function.
[28:10] And the more complex one, he
was pulling the information
[28:13] from the convolution layers.
[28:15] So neocognitron was
really an engineering feat
[28:19] because every parameter
was hand-designed.
[28:24] There are hundreds
of parameters.
[28:26] He has to just meticulously
put them together
[28:29] so that this small
neural network can
[28:32] recognize digits or letters.
[28:35] So the real breakthrough
came around that time in 1986
[28:41] is a learning rule.
[28:43] That learning rule is
called backpropagation.
[28:45] It's going to be one
of our first classes
[28:47] to show you that
Rumelhart, Jeff Hinton--
[28:52] they took neural
network architecture
[28:58] and introduced an error
correcting objective function
[29:04] so that if you put in
some input and know
[29:07] what the correct
output is, how do you
[29:10] take the difference between
what the neural network outputs
[29:14] versus the actual
correct answer and then
[29:17] propagate the information
back so that you
[29:22] can improve the parameters
along the neural network?
[29:28] And that propagation
from the output
[29:31] back to the entire
neural network
[29:33] is called backpropagation.
[29:35] It follows some of these
basic calculus chain rules.
[29:39] And that was a watershed moment
for neural network algorithm.
[29:47] And of course, we're still smack
in the middle of AI winter.
[29:50] All these work was happening
without public fanfare.
[29:54] But of course, in the
world of research,
[29:57] these are very
important milestones.
[29:59] One of the most earliest
applications of this neural
[30:03] network with backpropagation
is Yann LeCun's convolutional
[30:07] neural network, made in the
1990s when he was working
[30:10] in the Bell Labs.
[30:11] And what he did is just created
a slightly bigger network,
[30:15] about seven layers-ish,
and made it good enough
[30:20] with great engineering
capability to recognize letters.
[30:25] And it was actually shipped
to some part of the US Postal
[30:28] Offices and banks to
read digits and letters.
[30:33] So that was an application
of early neural network.
[30:37] And then Jeff Hinton
and Yann LeCun
[30:41] continued to work
on neural network.
[30:43] It didn't go very far.
[30:45] Because despite these
improvements and tweaks
[30:52] of these neural network, things
more or less just stalled.
[30:57] They collected a big data
set of digits and letters.
[31:00] And digits and letters
kind of was quasi soft
[31:03] in terms of recognition.
[31:05] But if you put the
system through the kind
[31:08] of digital photos that the
neuroscientists were using
[31:11] to recognize cats and dogs
and microwaves and chairs
[31:14] and flowers, it
just didn't work.
[31:17] And a huge part of this
problem is the lack of data.
[31:22] And lack of data is not
just an inconvenience.
[31:27] It's actually a
mathematical problem
[31:29] because these algorithms are
high capacity algorithms that
[31:36] actually needs to be
driven by lots of data
[31:39] in order to learn to generalize.
[31:42] And there is some deep
mathematical principles
[31:45] behind these rules of
generalization and model
[31:48] overfitting.
[31:49] And data was
underappreciated, was
[31:52] underlooked because
most people are just
[31:54] looking at these architectures.
[31:56] They did not
realize that data is
[31:59] part of the first class
citizen for machine
[32:02] learning and deep learning.
[32:03] So this is part of the work
that my students and I did
[32:08] in the early 2000s, that we
recognize this importance
[32:14] of data.
[32:15] We hypothesized that the
whole field was actually
[32:21] missing this-- underappreciating
the importance of data.
[32:24] So we went about and
collected a huge data
[32:27] set called ImageNet that
has 50 million images
[32:30] after cleaning a billion images.
[32:32] And these 15 million images were
sorted across 22,000 categories
[32:38] of objects.
[32:39] We actually studied a lot of
the cognitive and psychology
[32:43] literature to appreciate
that 22,000 images were--
[32:51] sorry, 22,000 categories
were roughly in the order
[32:54] of the number of categories
that humans learned to recognize
[32:58] in the early years
of their life.
[33:00] And then we open
sourced this data
[33:02] set and created an ImageNet
challenge called the Large Scale
[33:05] Visual Recognition Challenge.
[33:07] We curated a subset of ImageNet
of a million images or a million
[33:12] plus images and 1,000
object classes and then ran
[33:16] an international object
recognition challenge for many
[33:21] years.
[33:22] And the goal is that we ask
researchers to participate.
[33:26] And their goal is to
create algorithms.
[33:29] It doesn't matter which
kind of algorithms.
[33:31] And they will test you on your
algorithm's ability to recognize
[33:35] photos and see if you can call
out these 1,000 object classes
[33:40] as correctly as possible.
[33:42] And here are the errors.
[33:45] First year we run
this competition,
[33:53] the best performing algorithms
error was nearly 30%.
[33:57] And it's really pretty abysmal
because humans can perform
[34:00] under like, say, 3% error.
[34:03] And then 2011, it
wasn't that exciting.
[34:07] But something happened in 2012.
[34:09] That was the most exciting year.
[34:12] That year, Jeff Hinton
and his students
[34:16] participated in
this challenge using
[34:18] convolutional neural network.
[34:20] And they reduced the
error almost by half.
[34:23] And it truly showed the power
of deep learning algorithms.
[34:29] And so the participating
algorithm in 2012 ImageNet
[34:34] challenge was called AlexNet.
[34:36] And the funny thing is,
if you look at AlexNet,
[34:42] it's not that different from
Fukushima's neocognitron
[34:47] 32 years ago.
[34:49] But two major things
happened between these two.
[34:54] One is that
backpropagation happened.
[34:57] It's a principled,
mathematically rigorous learning
[35:01] rule so that you don't
have to ever use hand
[35:04] to tune parameters.
[35:06] And that was a major
breakthrough theoretically.
[35:09] Another breakthrough was data.
[35:14] The recognition of data and the
understanding of data driving
[35:19] these high capacity models,
which eventually will have
[35:23] trillion parameters-- but
at that time was millions
[35:26] of parameters-- was critical for
setting off the deep learning
[35:34] for this to work.
[35:36] And really, many people
consider the year of 2012
[35:42] and the AlexNet algorithm
that won the ImageNet
[35:46] the challenge the historical
moment of the birth
[35:51] or rebirth of modern AI or
the birth of deep learning
[35:54] revolution.
[35:55] And of course, the reason
many of you are here
[35:59] is since then, we are in the
era of deep learning explosion.
[36:04] If you look at computer vision,
some main annual research
[36:10] conference, called CVPR--
[36:13] the number of papers
have exploded.
[36:15] And our arXiv
paper has exploded.
[36:18] And many new
algorithms since then
[36:22] have been invented to
participate in the ImageNet
[36:27] challenge.
[36:28] In the following
years, we're going
[36:29] to study some of
these algorithms.
[36:31] But the point is
that some of these
[36:34] algorithms beyond Alex that
have had a profound impact
[36:39] in the progress of the
field of computer vision
[36:43] and into the applications
of computer vision.
[36:49] So a lot of things
have happened.
[36:52] We're going to
cover some of these.
[36:54] Not only the field
of computer vision
[36:57] made a major progress
in creating algorithms
[37:01] to recognize everyday m like
cats and dogs and chairs--
[37:06] we also quickly, right
after ImageNet challenge,
[37:10] the 2012 moment,
we've got algorithms
[37:14] that can recognize much
more complicated images,
[37:22] can retrieve images, or can
do multiple object detections,
[37:27] can do image segmentation.
[37:30] These are all different
tasks in visual recognition
[37:34] that you'll find
yourself getting
[37:36] familiar with
throughout this course
[37:38] because vision is not just
calling out cats and dogs.
[37:42] There is so much in the nuanced
ability of visual recognition.
[37:48] And of course, vision is
not just static images.
[37:52] So there are work in video
classification, human activity
[37:57] recognition.
[37:58] I'm showing you this overview.
[38:00] You will learn some of these.
[38:04] You don't have to understand
exactly what's going on here.
[38:08] But I want you to appreciate
the variety of vision tasks.
[38:14] Medical imaging, those of you
who come from a medical field,
[38:20] whether it's radiology
or pathology or even
[38:24] other aspects of medicine,
is deeply visual.
[38:28] And this has a profound impact.
[38:31] Scientific discovery--
even the seminal picture
[38:37] you probably remember of the
first photography of black hole
[38:41] uses a lot of computer vision
and computational photography
[38:46] techniques.
[38:47] Of course, applications in
sustainability and environment
[38:52] is $also computer vision
contributed a lot of that.
[38:58] And we also have made
a lot of progress
[39:02] in image captioning right after
the image-- that 2012 moment.
[39:07] This is actually work by
Andrej Karpathy, where he was
[39:09] my student, his thesis work.
[39:13] Then we also worked on
relationship understanding.
[39:19] So not only visual
intelligence is
[39:22] about seeing what's
on the pixel,
[39:24] you can also see
what's beyond pixels,
[39:26] including relationships of
objects and also style transfer.
[39:33] A Lot of this work,
you will-- actually,
[39:35] Justin Johnson, who will come
to guest lecture this course,
[39:39] will tell you all about his
seminal work in style transfer.
[39:45] And of course, in
generative AI eras,
[39:48] we get these really incredible
results like face generation.
[39:53] And this is the very early days
of image generation of Dall-E. I
[39:59] think this is the early Dall-E.
Of course, now, Midjourney
[40:03] and everything has gone beyond
these avocado and peach chairs.
[40:08] But really, we are squarely in
the most exciting modern era
[40:14] of AI explosion.
[40:20] The three converging forces
of computation, algorithms,
[40:25] and data have taken
this field just
[40:29] to a whole different
level, where we're now
[40:32] totally out of AI winter.
[40:36] I would say we're in an
AI global warming period.
[40:40] And I don't see any
of this slowing down
[40:46] for both good and bad reasons.
[40:48] And also, just a word, because
we are in the Silicon Valley,
[40:53] we're in the very building
of Huang building and NVIDIA
[40:58] lecture hall-- so we cannot
ignore also the progress
[41:02] of hardware and
what that played.
[41:05] So here is just the FLOP per
dollar graph for NVIDIA's GPUs.
[41:14] And before 2020, the
progress was steady.
[41:19] But as soon as deep
learning started
[41:22] to drive these
GPUs and chips, you
[41:27] can just see the GFLOPS have
just completely taken off.
[41:33] And by any measure, we are
in this accelerated curve
[41:40] of lots of compute as
well as lots of AI.
[41:45] And these are just
different graphs
[41:47] showing you conference
attendees, startups,
[41:50] and enterprise applications
in AI all across
[41:54] not just computer vision.
[41:55] But also, NLP and others
have just exploded.
[42:02] So quickly, last but not the
least, it's been exciting.
[42:06] There has been a
lot of successes.
[42:08] But there is still a lot to
be done in computer vision.
[42:11] So this problem is still
not totally solved.
[42:14] And with great tools comes with
great consequences as well.
[42:19] So computer vision
can do a lot of good.
[42:24] But it also can do harm.
[42:26] For example, human bias--
[42:28] every single AI algorithm
today, the large ones,
[42:32] are driven by data.
[42:33] And data is an artifact
of human activities
[42:38] on Earth and in history.
[42:40] And a lot of the
data carry our bias.
[42:43] And this gets carried
in AI systems.
[42:47] We have seen a lot of face
recognition algorithms having
[42:50] the same kind of bias
that humans have.
[42:52] And we do have to
really recognize that.
[42:55] We can also use AI to impact
human lives, some for the good.
[43:01] Think about medical imaging.
[43:02] But some are questionable.
[43:05] What if AI is solely
behind deciding your job
[43:09] or deciding your
financial loans?
[43:11] So again, is it totally bad?
[43:15] Is it totally good?
[43:17] These are very
complicated issues.
[43:19] This is also why I always get so
excited when students from HMS
[43:23] or law school or education
school or business school
[43:26] attend my class
because not all AI
[43:29] issues are engineering issues.
[43:31] We have a lot of human factors
and societal issues to solve.
[43:36] I'm also particularly excited
by AI's medicine and health care
[43:40] use.
[43:41] This is something
really dear to my heart.
[43:43] Professor Adeli
and Zane, who are
[43:46] also co-instructors of
this course, we three of us
[43:49] work on AI for aging
population as well as
[43:53] patients and to try to use
computer vision to deliver care
[43:59] to people.
[44:00] So this is a good use.
[44:01] And also, even in
terms of technology,
[44:04] human vision is remarkable.
[44:07] I want you to come out
of not only today's class
[44:10] but also this entire
course to appreciate,
[44:14] despite how much
computer vision can do,
[44:16] there's just so much more
nuance, subtlety, richness,
[44:22] complexity, and also
emotion in human vision.
[44:26] Look at these kids
studying whatever
[44:29] that their curiosity lead them
or the humor in this image.
[44:33] There's still a lot more that
computer vision cannot do.
[44:36] So I hope that
continue to entice
[44:38] you to study computer vision.
[44:40] At this point, I'm going to give
the podium to Professor Adeli
[44:45] to go over the
rest of the class.
[44:48] Thank you.
[44:49] [APPLAUSE]
[44:50] Awesome.
[44:51] Thank you, Fei-Fei.
[44:55] Great to start of the quarter.
[44:57] And I hope my microphone
is working right now.
[45:00] OK, good.
[45:01] I'm seeing some
nodding of heads.
[45:05] So very excited to
be here with you all.
[45:13] And I'm hoping that
you will have a fun
[45:18] and challenging course with an
amazing list of core instructors
[45:23] that we have and great TAs.
[45:26] So in this class, we
are going to cover
[45:31] a wide variety of topics
around computer vision and use
[45:34] of deep learning in
this space, categorized
[45:37] into four different topics.
[45:41] We will start with
deep learning basics.
[45:45] And let's start actually
with a simple question of,
[45:48] what is computer vision really?
[45:52] So at its core, it's
about enabling machines
[45:57] to see and understand images.
[46:00] And basically, this is the most
fundamental task in this space--
[46:09] in this space is
image classification.
[46:13] You give the model an
image, say, of a cat.
[46:17] And the model should
output a label cat.
[46:21] And that's it.
[46:23] But this deceptively simple
task is the foundation
[46:29] for much of more
complex applications,
[46:32] from self-driving to
medical diagnosis and so on.
[46:36] So how do we teach a
machine to do this?
[46:40] One of the simplest approaches
is to use linear classification,
[46:44] as you can see in this slide.
[46:48] So imagine each of the
images in our data set
[46:53] is shown with a
dot in that space.
[46:57] And each axis shows
some sort of feature
[47:02] which was driven from
the image itself.
[47:05] Here, we are showing a
2D space for simplicity.
[47:09] But the task of a
linear classifier
[47:12] is to find the hyperplane
or the linear function
[47:17] that separates these
two, say, cats from dogs.
[47:23] But we all know that
these linear models often
[47:26] go just only so far.
[47:29] They struggle when the data
isn't cleanly separable
[47:32] with a straight line.
[47:33] So the question is, what's next?
[47:36] We'll get into the topics of how
to model more complex patterns.
[47:44] And if we do so, we
often face challenges
[47:49] of overfitting and
underfitting, which
[47:54] are the topics we will cover in
the early lectures of the class.
[47:59] And to strike the
right balance, we
[48:05] use techniques
like regularization
[48:08] to control model complexity and
optimization to find the best
[48:14] fit parameters.
[48:16] So these are the nuts and bolts
of deep learning and creating
[48:21] these models, training models,
that not only fits the data
[48:26] but also generalizes to
unseen and new data as well.
[48:31] And now comes the fun part--
[48:33] neural networks.
[48:34] We've been talking
about them quite a lot.
[48:38] And what neural networks do,
unlike the linear classifiers,
[48:43] they stack multiple
layers of operations
[48:47] to model non-linear
functions to be
[48:54] able to either classify, to
solve the same problem of image
[48:59] classification, and so on.
[49:04] These are the models powering
everything from Google Photos.
[49:09] And now, everybody's familiar
with ChatGPT, ChatGPT's vision
[49:13] models, and so on.
[49:15] In this course, we will go deep
into the details of how they
[49:24] work, how they are trained.
[49:26] And we will be looking into
debugging and improving them.
[49:31] After looking at the
deep learning basics,
[49:35] we will cover the topics of
perceiving and understanding
[49:39] the visual world, which
is a complex process that
[49:44] involves interpreting a vast
array of visual information.
[49:49] And to do so, we
often first define
[49:52] tasks that refer to specific
challenges or problems.
[49:56] We aim to solve--
[49:59] some of the examples are object
detection, scene understanding,
[50:02] motion detection, and so on.
[50:03] And to solve these tasks, we
use different models, which
[50:10] are computational
and theoretical
[50:13] frameworks we develop
to mimic or explain
[50:17] how our visual system
accomplishes these tasks.
[50:22] One of the examples of
these types of models
[50:25] is neural networks.
[50:30] So by aligning
models with tasks,
[50:36] we can create systems
that can see and interpret
[50:41] the world around us.
[50:43] Speaking of tasks, let's
go back to the topic
[50:48] of image classification,
predicting a single label
[50:53] for an entire image.
[50:56] But we know that real
world computer vision
[50:59] is much richer than this.
[51:02] And let's walk through
some of the tasks that
[51:05] go beyond classification.
[51:06] First, semantic segmentation,
where we are not just
[51:13] labeling the object
or the entire image
[51:17] as cat or tree or whatever.
[51:19] Here, we are looking for
labels for every single pixel
[51:25] in the image.
[51:25] So every pixel is a
grass, cat, tree, or sky.
[51:30] But we don't distinguish
between individual objects.
[51:34] And next, we have
object detection,
[51:38] where we now want to not
only say what is in the image
[51:45] but also pinpoint the location.
[51:47] And that's why we
create bounding boxes
[51:49] around the objects and associate
them with specific labels.
[51:54] And finally, we have
instance segmentation.
[51:58] We'll go into instance
segmentation, which is
[52:01] the most granular of them all.
[52:04] It combines the ideas of
detection and segmentation
[52:08] together.
[52:09] And every object instance
gets its own mask.
[52:13] So these tasks require much
deeper special understanding
[52:20] and images.
[52:21] And they push the models
to do more than just
[52:23] recognizing categories.
[52:27] The complexity doesn't
stop with static images.
[52:30] Let's look at some
temporal dimensions.
[52:33] So there's the task of
video classification,
[52:36] as Fei-Fei talked about,
where we want to understand
[52:40] what's happening in video.
[52:42] Is there someone running,
jumping, or dancing?
[52:47] There is the topic of
multimodal video understanding,
[52:51] which is combining vision and
sound and other modalities.
[52:56] For example, in this
example, the person
[53:00] is playing a vibraphone
to really understand
[53:04] what's happening here.
[53:05] We have to create a
blend of visual features
[53:08] and audio features to be able
to understand what's happening.
[53:11] And finally, there is the
topic of visualization
[53:14] and understanding that we will
be covering in this class, where
[53:19] we want to interpret what's
being learned by the models
[53:24] and see an attention frame
or attention map of what
[53:31] the model is attending to to
do a correct classification
[53:35] and so on.
[53:36] And then we have
models beyond tasks.
[53:39] We look into models.
[53:41] And the very first topic--
let me introduce to you--
[53:46] that we'll be covering is
Convolutional Neural Networks
[53:50] or CNNs.
[53:51] There are a number
of operations.
[53:52] We will be going
over the details
[53:55] in the class, starting from an
image, a number of convolutions,
[53:59] sampling and fully
connected operations,
[54:01] and, finally,
creating the output.
[54:05] And beyond convolutional
neural networks,
[54:08] we will study recurrent neural
networks for sequential data
[54:14] and even neural architectures,
such as transformers
[54:19] and attention-based frameworks.
[54:24] So next, we will be covering
some large-scale distributed
[54:29] training topics, which is
kind of new this quarter.
[54:34] I'm sure you've all heard
about large language models,
[54:38] large vision models, and so on.
[54:40] And we will be
briefly discussing
[54:44] how these models are
actually trained.
[54:47] We know that data and data
sets are expanding models.
[54:51] And models are becoming
larger and larger.
[54:56] And in order to
train such models,
[54:59] there are some strategies--
[55:02] for example, data
parallelization,
[55:04] model parallelization-- that
we will cover in this class.
[55:07] But beyond that, there
will be so many challenges,
[55:11] such as synchronization between
these models and workers
[55:15] and so on, as well as
several other aspects
[55:20] that we'll be covering in one
of the lectures this quarter.
[55:25] And we will go also over some
of the trends for training
[55:31] these large models.
[55:33] After completing this
topic, what we will do
[55:36] next is looking into generative
and interactive visual
[55:44] intelligence, where
we will first start
[55:48] with self-supervised learning.
[55:52] Self-supervised learning is
a branch of machine learning
[55:55] in which models learn to
understand and represent data
[56:00] by getting some training
signals from the data itself.
[56:04] We will cover this topic.
[56:06] It's one of the approaches
that has enabled training
[56:10] of large scale models using
vast amounts of data that do not
[56:15] require labels, unlabeled data.
[56:18] And they have played a key
role in recent breakthroughs
[56:23] in computer vision in general.
[56:26] And we will talk a little
bit about generative models.
[56:30] They go beyond recognition.
[56:33] They actually generate.
[56:35] This is an example of the
content of a Stanford campus
[56:39] photo, which is reimagined in
the style of Van Gogh's Starry
[56:44] Night.
[56:45] This is known as style
transfer, a classic application
[56:49] of neural generative techniques.
[56:54] Generative models can
now translate language
[56:58] into images given a prompt.
[57:03] A model like Dall-E, Dall-E
2 generates an entirely novel
[57:07] image.
[57:09] This showcases how
generative vision models
[57:12] blend understanding,
creativity, and control
[57:16] in their generations.
[57:19] And you've probably
heard recently
[57:22] about the topic of
diffusion models in general.
[57:26] That's another thing that we'll
be covering in this quarter.
[57:33] They basically learn to
reverse a gradual noising
[57:37] process to generate images.
[57:40] And interestingly,
in assignment 3,
[57:43] you will actually be
implementing a generative model
[57:46] that generates emojis
from text inputs,
[57:53] from prompts-- for example, a
face with a cowboy hat, which
[57:57] is denoised from pure noise.
[58:01] Vision language models are
the next topic of interest
[58:06] we will be covering.
[58:08] They connect text and images in
a shared representation space.
[58:16] And given a caption
or image, the model
[58:19] retrieves or generates
its corresponding pair,
[58:24] as you can see.
[58:25] So there are a lot of
advances in this area.
[58:29] We'll be covering some
of the key examples.
[58:32] Again, this is a key task
for cross-modal retrieval
[58:37] or understanding and visual
question answering and so on.
[58:41] So we'll get to
that in the class 2.
[58:44] Moving beyond 2D, models can
now reconstruct and generate 3D
[58:52] representations from images.
[58:55] And here, you can see some
voxel-based reconstructions,
[59:00] shape completion, and even 3D
object detection from single
[59:06] view images.
[59:09] So 3D vision enables
more especially grounded
[59:14] understanding, which is
crucial for robotics and AI VR
[59:19] applications.
[59:20] And finally, vision
empowers embodied agents
[59:26] that act in the physical world.
[59:30] So these models often
must perceive, plan,
[59:35] and execute whether it's
cleaning up a messy room
[59:41] or generalizing from
human demonstrations.
[59:44] So with all of these, we will
be covering different topics
[59:50] around generative and
interactive visual intelligence.
[59:53] And finally, we will cover some
human-centered applications
[01:00:00] and implications, as Fei-Fei
very nicely explained.
[01:00:05] So there is a computer vision.
[01:00:08] And generally, AI
have been having a lot
[01:00:12] of impact in the past years.
[01:00:16] And it's very
important to understand
[01:00:18] the human-centered
aspects and applications.
[01:00:21] And some of these
impacts are reflected
[01:00:24] by these awards that are going
to researchers in this space.
[01:00:32] It was first recognized by
the Turing Award 2018, which
[01:00:38] is the most prestigious
technical award given
[01:00:41] to major contributions
of lasting importance
[01:00:45] for computing.
[01:00:47] Geoffrey Hinton, Yoshua
Bengio, and Yann LeCun
[01:00:50] received the award for
conceptual and engineering
[01:00:54] during breakthroughs
that have made
[01:00:57] deep neural networks a critical
component of computing.
[01:01:01] Beyond that, last year,
in 2024, Geoffrey Hinton
[01:01:06] was jointly awarded the
Nobel Prize in physics
[01:01:11] alongside John Hopfield for
their foundational contributions
[01:01:14] to neural networks.
[01:01:17] And finally, I want to very
briefly mention the learning
[01:01:21] objectives for this class will
be formalizing computer vision
[01:01:27] applications into tasks.
[01:01:30] As you can see some
of the details here,
[01:01:33] we want to develop and
train vision models, models
[01:01:38] that operate on images
and visual data--
[01:01:41] images, videos, and so on--
[01:01:43] gain an understanding
of where the field is
[01:01:46] and where it is headed.
[01:01:48] That's why we have some new
topics also covered specifically
[01:01:53] in this year.
[01:01:56] So the four topics that
I mentioned earlier,
[01:02:01] we will be going over the basics
in the very first few weeks.
[01:02:06] Bear with us because these
are important topics.
[01:02:09] And you need to understand
the details first,
[01:02:12] how to build the
models from scratch.
[01:02:15] And then we'll get to more
interesting, exciting topics
[01:02:19] of the day--
[01:02:20] computer vision.
[01:02:21] And finally, we'll have one big
lecture on human-centered AI
[01:02:27] and computer vision.
[01:02:30] I want to just leave
you with what we
[01:02:33] will be covering next session.
[01:02:34] That's going to be
image classification
[01:02:38] and linear classifiers,
which will get us started
[01:02:43] with the world of CS231n.
[01:02:45] Thank you.