1
00:00:05,509 --> 00:00:07,229
This is CS231n.

2
00:00:07,230 --> 00:00:11,000
And I'm Professor Fei-Fei
Li from computer science

3
00:00:11,000 --> 00:00:11,820
department.

4
00:00:11,820 --> 00:00:14,960
I will be co-teaching
this quarter

5
00:00:14,960 --> 00:00:19,320
with Professor Ehsan Adeli
and my graduate student Zane.

6
00:00:19,320 --> 00:00:23,300
So you'll meet them as well
as our wonderful TA team

7
00:00:23,300 --> 00:00:24,900
that you will meet later.

8
00:00:24,899 --> 00:00:28,250
So I just want to get started.

9
00:00:28,250 --> 00:00:32,030
So this is what
excites me, that AI

10
00:00:32,030 --> 00:00:35,520
has become such an
interdisciplinary field,

11
00:00:35,520 --> 00:00:38,520
that what you're going
to learn in this class,

12
00:00:38,520 --> 00:00:40,530
of course, is very technical.

13
00:00:40,530 --> 00:00:42,750
It's about computer
vision and deep learning.

14
00:00:42,750 --> 00:00:44,600
But I really do
hope that you take

15
00:00:44,600 --> 00:00:48,590
it to whichever discipline you
work in and are passionate about

16
00:00:48,590 --> 00:00:49,920
and apply it.

17
00:00:49,920 --> 00:00:52,800
So we hear a lot
about the field of AI.

18
00:00:52,799 --> 00:00:56,029
So how do we position
computer vision

19
00:00:56,030 --> 00:00:57,920
and the scope of this class?

20
00:00:57,920 --> 00:01:02,250
If you consider AI
as this big bubble,

21
00:01:02,250 --> 00:01:07,349
computer vision is very
much an integral part of AI.

22
00:01:07,349 --> 00:01:10,829
Some of you have heard me
saying that not only vision is

23
00:01:10,829 --> 00:01:13,959
part of intelligence, it's a
cornerstone to intelligence.

24
00:01:13,959 --> 00:01:16,529
Unlocking the mystery
of visual intelligence

25
00:01:16,530 --> 00:01:20,079
is unlocking the
mystery of intelligence.

26
00:01:20,079 --> 00:01:25,250
But one of the most important
tools, mathematical tools,

27
00:01:25,250 --> 00:01:29,129
to solving AI is machine
learning or some people

28
00:01:29,129 --> 00:01:31,060
call statistical
machine learning.

29
00:01:31,060 --> 00:01:36,299
And this is exactly what
we will be talking about.

30
00:01:36,299 --> 00:01:38,670
Within the field of
machine learning,

31
00:01:38,670 --> 00:01:42,390
in the past 10 plus years, we
have seen a major revolution

32
00:01:42,390 --> 00:01:43,570
called deep learning.

33
00:01:43,569 --> 00:01:46,929
And I'll explain a little
bit of what deep learning is.

34
00:01:46,930 --> 00:01:50,640
Deep learning is a set
of algorithmic techniques

35
00:01:50,640 --> 00:01:54,120
that is built around
a family of algorithms

36
00:01:54,120 --> 00:01:55,540
called neural networks.

37
00:01:55,540 --> 00:02:02,040
And so if you ask me to pinpoint
the scope of this class,

38
00:02:02,040 --> 00:02:05,250
we'll not be able to cover the
entirety of computer vision.

39
00:02:05,250 --> 00:02:07,849
We'll not be able to cover
the entirety of machine

40
00:02:07,849 --> 00:02:09,030
learning or deep learning.

41
00:02:09,030 --> 00:02:12,409
But we're going to cover the
core intersection of these two

42
00:02:12,409 --> 00:02:13,289
fields.

43
00:02:13,289 --> 00:02:18,299
And of course, just
like the entirety of AI,

44
00:02:18,300 --> 00:02:20,900
computer vision is
becoming more and more

45
00:02:20,900 --> 00:02:23,340
an interdisciplinary field.

46
00:02:23,340 --> 00:02:26,060
A lot of the techniques
we use as well as

47
00:02:26,060 --> 00:02:28,310
the problems we
work with intersect

48
00:02:28,310 --> 00:02:31,280
with many different other
fields, like natural language

49
00:02:31,280 --> 00:02:37,849
processing, speech recognition,
robotics, and AI as a whole

50
00:02:37,849 --> 00:02:41,340
is a field that intersects
with mathematics, neuroscience,

51
00:02:41,340 --> 00:02:44,159
computer science,
psychology, physics, biology,

52
00:02:44,159 --> 00:02:46,430
and many application
areas from medicine

53
00:02:46,430 --> 00:02:49,950
to law to education
and business and so on.

54
00:02:49,949 --> 00:02:55,169
So what you will get for this
lecture, our first lecture,

55
00:02:55,169 --> 00:02:58,259
is I'll give a very
brief history of computer

56
00:02:58,259 --> 00:02:59,889
vision and deep learning.

57
00:02:59,889 --> 00:03:05,309
And then Professor Adeli will go
over the overview of this course

58
00:03:05,310 --> 00:03:08,129
and lay the groundwork of
how this course is set up

59
00:03:08,129 --> 00:03:11,669
and what our expectations are.

60
00:03:11,669 --> 00:03:19,139
So the history of vision did
not begin when you were born

61
00:03:19,139 --> 00:03:20,979
or humanity was born.

62
00:03:20,979 --> 00:03:25,719
The history of vision began
540 million years ago.

63
00:03:25,719 --> 00:03:29,919
You might ask, what happened
540 million years ago?

64
00:03:29,919 --> 00:03:34,229
Why are we pinpointing a
relatively specific date or year

65
00:03:34,229 --> 00:03:35,500
in evolution.

66
00:03:35,500 --> 00:03:37,830
Well, it's because a
lot of fossil studies

67
00:03:37,830 --> 00:03:43,380
have shown us that there is a
mystery period called Cambrian

68
00:03:43,379 --> 00:03:45,549
explosion.

69
00:03:45,550 --> 00:03:49,200
The fossil studies showed about
10 million years in evolution

70
00:03:49,199 --> 00:03:52,439
during that time, which is
a very short period of time

71
00:03:52,439 --> 00:03:53,810
for evolution.

72
00:03:53,810 --> 00:03:58,340
We see the explosion
of animal species

73
00:03:58,340 --> 00:04:02,819
in the fossil study, which means
before the Cambrian explosion,

74
00:04:02,819 --> 00:04:05,219
life on Earth was pretty chill.

75
00:04:05,219 --> 00:04:06,930
It was actually in the water.

76
00:04:06,930 --> 00:04:10,319
There's no animals
on the land yet.

77
00:04:10,319 --> 00:04:13,769
And animals just float around.

78
00:04:13,770 --> 00:04:18,240
So what caused this explosion
in animal speciation?

79
00:04:18,240 --> 00:04:21,620
There were many theories, from
climate to chemical composition

80
00:04:21,620 --> 00:04:23,189
of the ocean water.

81
00:04:23,189 --> 00:04:29,360
But one of the most compelling
theories was the onset of ice.

82
00:04:29,360 --> 00:04:32,569
The first animal,
a trilobite, they

83
00:04:32,569 --> 00:04:34,949
gained photosensitive cells.

84
00:04:34,949 --> 00:04:37,310
So the eyes we
were talking about

85
00:04:37,310 --> 00:04:41,009
were not sophisticated lenses
and retinas and nerve cells.

86
00:04:41,009 --> 00:04:44,519
It was literally a
very simple pinhole.

87
00:04:44,519 --> 00:04:47,339
And that pinhole
collected light.

88
00:04:47,339 --> 00:04:53,549
Once you collected light,
life is completely different.

89
00:04:53,550 --> 00:04:57,660
Without sensors,
life is metabolism.

90
00:04:57,660 --> 00:04:59,410
It's very passive.

91
00:04:59,410 --> 00:05:01,180
It is just metabolism.

92
00:05:01,180 --> 00:05:02,530
And you come and go.

93
00:05:02,529 --> 00:05:06,689
With sensors, you become an
integral part of the environment

94
00:05:06,689 --> 00:05:08,980
that you might want to change.

95
00:05:08,980 --> 00:05:11,920
You might want to
actually survive in it.

96
00:05:11,920 --> 00:05:16,170
Some animals or plants
become your dinner.

97
00:05:16,170 --> 00:05:18,069
And you become
someone else's dinner.

98
00:05:18,069 --> 00:05:24,120
So evolutionary forces
drives intelligence

99
00:05:24,120 --> 00:05:27,579
to evolve because of
the onset of sensors,

100
00:05:27,579 --> 00:05:31,529
because of the onset of
vision, along with haptics

101
00:05:31,529 --> 00:05:33,219
or tactile sensing.

102
00:05:33,220 --> 00:05:38,080
Those are the oldest
sensors for animals.

103
00:05:38,079 --> 00:05:41,879
So that entire course
of 540 million years

104
00:05:41,879 --> 00:05:46,509
of evolution of vision is the
evolution of intelligence.

105
00:05:46,509 --> 00:05:49,849
Vision as one of the
primary senses of animals

106
00:05:49,850 --> 00:05:54,260
drove the development of
nervous system, the development

107
00:05:54,259 --> 00:05:55,319
of intelligence.

108
00:05:55,319 --> 00:05:59,449
Almost all animals on
Earth today we know of

109
00:05:59,449 --> 00:06:03,420
have vision or use vision as
one of the primary senses.

110
00:06:03,420 --> 00:06:06,449
Humans are especially
visual animals.

111
00:06:06,449 --> 00:06:08,810
More than half of
our cortical cells

112
00:06:08,810 --> 00:06:11,819
are involved in
visual processing.

113
00:06:11,819 --> 00:06:15,930
And we have a very complex
and convoluted visual system.

114
00:06:15,930 --> 00:06:19,800
So this is what excites me
to enter the field of vision.

115
00:06:19,800 --> 00:06:21,870
And I hope it excites you.

116
00:06:21,870 --> 00:06:30,620
So now, let's just fast
forward from Cambrian explosion

117
00:06:30,620 --> 00:06:33,470
to actually human civilization.

118
00:06:33,470 --> 00:06:35,850
Humans do innovate.

119
00:06:35,850 --> 00:06:37,610
And not only we see.

120
00:06:37,610 --> 00:06:40,050
We want to build
machines that see.

121
00:06:40,050 --> 00:06:44,850
So here's a couple of
drawings by, of course,

122
00:06:44,850 --> 00:06:48,540
Leonardo da Vinci, who
was just forever curious

123
00:06:48,540 --> 00:06:49,540
about everything.

124
00:06:49,540 --> 00:06:56,740
He studied camera obscura for
how to make steam machines.

125
00:06:56,740 --> 00:07:01,829
In fact, even way before
him, in ancient Greece

126
00:07:01,829 --> 00:07:05,159
and in ancient China,
we have seen documents

127
00:07:05,160 --> 00:07:09,990
about thinkers,
philosophers thinking

128
00:07:09,990 --> 00:07:15,600
about how to project
objects through pinholes

129
00:07:15,600 --> 00:07:19,330
and to create images of objects.

130
00:07:19,329 --> 00:07:22,750
And of course, in
our modern life,

131
00:07:22,750 --> 00:07:25,990
cameras have truly exploded.

132
00:07:25,990 --> 00:07:30,780
But cameras are not enough for
seeing, just like eyes are not

133
00:07:30,779 --> 00:07:31,719
enough for seeing.

134
00:07:31,720 --> 00:07:33,010
These are apparatus.

135
00:07:33,009 --> 00:07:35,949
We need to understand how
visual intelligence happens.

136
00:07:35,949 --> 00:07:38,589
And that's really the
crux of this course.

137
00:07:38,589 --> 00:07:45,669
So let's just talk a little bit
of the history that brought us

138
00:07:45,670 --> 00:07:49,790
to this intersection of deep
learning and computer vision.

139
00:07:49,790 --> 00:07:57,160
So let me go back to the 1950s.

140
00:07:57,160 --> 00:08:03,370
The 1950s-- a set of very
critically important experiments

141
00:08:03,370 --> 00:08:05,090
happened in neuroscience.

142
00:08:05,089 --> 00:08:08,019
And that was the study
of the visual pathways

143
00:08:08,019 --> 00:08:10,629
of mammals, especially
the seminal work

144
00:08:10,629 --> 00:08:11,990
by Hubel and Wiesel.

145
00:08:11,990 --> 00:08:18,410
They used electrodes to put
into live cats anesthetized.

146
00:08:18,410 --> 00:08:21,220
And then they studied
the receptive field

147
00:08:21,220 --> 00:08:25,760
of neurons that are in
the primary visual cortex.

148
00:08:25,759 --> 00:08:28,909
What they have learned,
to their surprise,

149
00:08:28,910 --> 00:08:31,070
are two very important things.

150
00:08:31,069 --> 00:08:38,740
One is that neurons that
are responsible for seeing

151
00:08:38,740 --> 00:08:41,860
in the primary
visual cortex have

152
00:08:41,860 --> 00:08:44,820
their own individual
receptive fields.

153
00:08:44,820 --> 00:08:48,320
Receptive fields means
that for every neuron,

154
00:08:48,320 --> 00:08:52,590
there is a part of
space it actually sees.

155
00:08:52,590 --> 00:08:54,870
It's not all the space.

156
00:08:54,870 --> 00:08:55,799
It's not very big.

157
00:08:55,799 --> 00:09:00,779
It tends to be a very
confined patch of the space.

158
00:09:00,779 --> 00:09:06,629
And within that space, it
sees specialized patterns,

159
00:09:06,629 --> 00:09:12,320
simple patterns, when you're
measuring from the early part

160
00:09:12,320 --> 00:09:15,470
of the visual pathway.

161
00:09:15,470 --> 00:09:18,840
And by and large, in the
primary visual cortex,

162
00:09:18,840 --> 00:09:23,120
which is around here in the back
of the head, not near your eyes,

163
00:09:23,120 --> 00:09:27,210
it's oriented edges or
moving oriented edges.

164
00:09:27,210 --> 00:09:28,970
So every neuron,
some neuron will

165
00:09:28,970 --> 00:09:30,330
be seeing an edge like this.

166
00:09:30,330 --> 00:09:32,970
Some will be seeing an
edge like this or this.

167
00:09:32,970 --> 00:09:39,029
And that's how the computation
in the brain begins.

168
00:09:39,029 --> 00:09:42,370
The second thing they learned
is that visual pathway

169
00:09:42,370 --> 00:09:43,519
is hierarchical.

170
00:09:43,519 --> 00:09:47,149
As you move beyond
the visual pathway,

171
00:09:47,149 --> 00:09:50,629
the neurons feed
into other neurons.

172
00:09:50,629 --> 00:09:54,730
And the neurons in
the higher layers

173
00:09:54,730 --> 00:09:57,460
or deeper layers of
the visual hierarchy

174
00:09:57,460 --> 00:09:59,990
have more complex
receptive fields.

175
00:09:59,990 --> 00:10:04,009
So if you begin
with oriented edges,

176
00:10:04,009 --> 00:10:06,889
you might feed into
a corner receptor.

177
00:10:06,889 --> 00:10:10,399
You might feed into
an object receptor.

178
00:10:10,399 --> 00:10:12,199
I'm overly simplifying.

179
00:10:12,200 --> 00:10:16,360
But that's the concept, is that
neurons feed into each other.

180
00:10:16,360 --> 00:10:23,360
And then they create this
big network of computation.

181
00:10:23,360 --> 00:10:25,720
Of course, most of
you sitting here

182
00:10:25,720 --> 00:10:27,850
are already thinking
the way I've

183
00:10:27,850 --> 00:10:30,670
been describing this will
have a profound impact

184
00:10:30,669 --> 00:10:36,019
on the neural network
modeling of visual algorithms.

185
00:10:36,019 --> 00:10:37,069
Let's keep going.

186
00:10:37,070 --> 00:10:40,260
That's year 1959.

187
00:10:40,259 --> 00:10:43,500
It's very early
studies of seeing.

188
00:10:43,500 --> 00:10:48,289
By the way, about
30 years later--

189
00:10:48,289 --> 00:10:50,969
maybe not quite-- 20
something years later,

190
00:10:50,970 --> 00:10:54,769
Hubel and Wiesel won the
Nobel Prize in medicine

191
00:10:54,769 --> 00:10:59,840
for studying this,
uncovering the principles

192
00:10:59,840 --> 00:11:01,790
of visual processing.

193
00:11:01,789 --> 00:11:05,779
Another milestone in the early
history of computer vision

194
00:11:05,779 --> 00:11:09,179
was the first PhD thesis
of computer vision.

195
00:11:09,179 --> 00:11:13,039
Most people attribute
Larry Roberts in 1963

196
00:11:13,039 --> 00:11:17,879
writing the first PhD
thesis just studying shape.

197
00:11:17,879 --> 00:11:21,350
And this is a very, very
character representation

198
00:11:21,350 --> 00:11:22,259
of the world.

199
00:11:22,259 --> 00:11:26,090
And the idea is that, can
we take a shape like this

200
00:11:26,090 --> 00:11:30,560
and understand that the surfaces
and the corners and features

201
00:11:30,559 --> 00:11:32,209
of this shape?

202
00:11:32,210 --> 00:11:34,230
It's intuitive that humans do.

203
00:11:34,230 --> 00:11:39,350
So an entire PhD thesis
is devoted to this.

204
00:11:39,350 --> 00:11:44,980
And that's the beginning
of computer vision.

205
00:11:44,980 --> 00:11:52,870
And around that time, in
1966, an MIT professor

206
00:11:52,870 --> 00:11:56,710
created a summer
project in MIT and asked

207
00:11:56,710 --> 00:12:03,830
to hire a few undergrads, very
smart ones, to study vision.

208
00:12:03,830 --> 00:12:07,120
And the goal was pretty
much solve computer vision

209
00:12:07,120 --> 00:12:09,399
or solve vision for one summer.

210
00:12:09,399 --> 00:12:13,279
Of course, just like the
rest of the history of AI,

211
00:12:13,279 --> 00:12:18,309
we tend to be overoptimistic
of what we can

212
00:12:18,309 --> 00:12:20,329
do in a short period of time.

213
00:12:20,330 --> 00:12:24,530
So vision did not get
solved in that summer.

214
00:12:24,529 --> 00:12:29,799
In fact, it has blossomed into
an incredible computer science

215
00:12:29,799 --> 00:12:30,709
field.

216
00:12:30,710 --> 00:12:33,830
If you go to our annual
conferences every year now,

217
00:12:33,830 --> 00:12:36,420
it has more than 10,000
people attending.

218
00:12:36,419 --> 00:12:43,879
But 1960s is where, between
Larry Roberts PhD thesis as well

219
00:12:43,879 --> 00:12:48,500
as this kind of project, we
in our field considered that

220
00:12:48,500 --> 00:12:51,830
the beginning of the
field of computer vision.

221
00:12:51,830 --> 00:12:55,620
A seminal book was written
in the 1970s by David Marr,

222
00:12:55,620 --> 00:12:58,470
who unfortunately
died too early.

223
00:12:58,470 --> 00:13:01,940
He wanted to study vision
systematically and start

224
00:13:01,940 --> 00:13:05,790
to consider how visual
processing happens.

225
00:13:05,789 --> 00:13:07,639
Even though this
is not explicitly

226
00:13:07,639 --> 00:13:10,309
stated, but there is
a lot of inspiration

227
00:13:10,309 --> 00:13:12,929
from neuroscience and
cognitive science.

228
00:13:12,929 --> 00:13:20,069
He was thinking about, if
you take an input image,

229
00:13:20,070 --> 00:13:23,580
how do we visually process
and understand the image?

230
00:13:23,580 --> 00:13:28,730
Maybe the first layer is more
like edges, just like we saw.

231
00:13:28,730 --> 00:13:30,629
He calls it primal sketch.

232
00:13:30,629 --> 00:13:37,889
And then there is a 2 and 1/2 D
sketch which separates different

233
00:13:37,889 --> 00:13:42,909
depth of the objects
in the image.

234
00:13:42,909 --> 00:13:45,059
So the ball is the
foreground object.

235
00:13:45,059 --> 00:13:47,859
And then the grass here--

236
00:13:47,860 --> 00:13:48,820
oh, no, not grass.

237
00:13:48,820 --> 00:13:51,520
The ground here
is the background.

238
00:13:51,519 --> 00:13:53,919
So he does these 2
and 1/2 D sketch.

239
00:13:53,919 --> 00:14:01,439
And then, finally, David Marr
believes the grand holy grail

240
00:14:01,440 --> 00:14:06,660
victory of solving vision is
to know the entire full 3D

241
00:14:06,659 --> 00:14:07,959
representation.

242
00:14:07,960 --> 00:14:12,879
And that is actually the
hardest thing of vision.

243
00:14:12,879 --> 00:14:15,129
Let me digress for 20 seconds.

244
00:14:15,129 --> 00:14:20,950
Because if you think about
vision for all animals,

245
00:14:20,950 --> 00:14:23,350
it's an ill posed problem.

246
00:14:23,350 --> 00:14:27,389
Since the early trilobites
who collected light

247
00:14:27,389 --> 00:14:30,659
from underwater, light--

248
00:14:30,659 --> 00:14:35,809
the world through photons
is projected on something

249
00:14:35,809 --> 00:14:38,069
on a surface more or less 2D.

250
00:14:38,070 --> 00:14:40,879
At that time, it was just,
I don't know, some patch

251
00:14:40,879 --> 00:14:42,059
in the animal.

252
00:14:42,059 --> 00:14:45,469
But right now, for
us, it's a retina.

253
00:14:45,470 --> 00:14:47,910
But the actual world is 3D.

254
00:14:47,909 --> 00:14:55,610
So recovering 3D information,
the entire 3D world,

255
00:14:55,610 --> 00:15:00,230
from 2D images is the
fundamental problem nature had

256
00:15:00,230 --> 00:15:02,730
to solve and computer
vision has to solve.

257
00:15:02,730 --> 00:15:05,840
And mathematically, that's
an ill-posed problem.

258
00:15:05,840 --> 00:15:07,940
So what did we later do?

259
00:15:07,940 --> 00:15:09,745
Anybody have a wild guess?

260
00:15:14,899 --> 00:15:17,299
[INAUDIBLE]

261
00:15:17,299 --> 00:15:18,799
Yes.

262
00:15:18,799 --> 00:15:22,199
The trick that nature did is
develop multiple eyes, mostly

263
00:15:22,200 --> 00:15:22,700
two.

264
00:15:22,700 --> 00:15:25,259
Some animals have more than two.

265
00:15:25,259 --> 00:15:28,110
And then you
triangulate information.

266
00:15:28,110 --> 00:15:29,740
But two eyes are not enough.

267
00:15:29,740 --> 00:15:33,250
You actually have to understand
correspondences and all that.

268
00:15:33,250 --> 00:15:35,049
We'll touch on some
of these topics.

269
00:15:35,049 --> 00:15:38,879
But there are other computer
vision classes taht Stanford

270
00:15:38,879 --> 00:15:42,090
offers that also specifically
talk about 3D vision.

271
00:15:42,090 --> 00:15:45,660
But the point is it's
a very hard problem.

272
00:15:45,659 --> 00:15:47,589
And we have to solve it.

273
00:15:47,590 --> 00:15:48,790
Nature has solved it.

274
00:15:48,789 --> 00:15:53,110
Humans have solved it but
not to extreme precision.

275
00:15:53,110 --> 00:15:55,750
In fact, humans are
not that precise.

276
00:15:55,750 --> 00:15:58,509
I roughly know the 3D shapes.

277
00:15:58,509 --> 00:16:03,429
But I don't have geometric
precision of all the shapes.

278
00:16:03,429 --> 00:16:06,779
So that's one thing to
consider and appreciate

279
00:16:06,779 --> 00:16:08,620
how hard this problem is.

280
00:16:08,620 --> 00:16:12,419
Another thing that is very
different for computer vision

281
00:16:12,419 --> 00:16:15,479
and language is
actually something

282
00:16:15,480 --> 00:16:17,370
philosophically subtle.

283
00:16:17,370 --> 00:16:20,169
Language doesn't
exist in nature.

284
00:16:20,169 --> 00:16:24,339
You cannot point to something
and say there is language.

285
00:16:24,340 --> 00:16:30,090
Language is a purely
generated thing.

286
00:16:30,090 --> 00:16:31,860
I don't even know
what word to use.

287
00:16:31,860 --> 00:16:35,460
It comes through our brain.

288
00:16:35,460 --> 00:16:37,290
It's generated.

289
00:16:37,289 --> 00:16:38,579
It's 1D.

290
00:16:38,580 --> 00:16:40,310
It's sequential.

291
00:16:40,309 --> 00:16:44,449
So this actually has profound
implications in the latest

292
00:16:44,450 --> 00:16:47,509
wave of GenAI algorithms.

293
00:16:47,509 --> 00:16:50,419
This is why these
LLMs, which is outside

294
00:16:50,419 --> 00:16:54,889
of the scope of this class,
is so powerful because we

295
00:16:54,889 --> 00:16:56,759
can model language that way.

296
00:16:56,759 --> 00:16:58,649
But vision is not generated.

297
00:16:58,649 --> 00:17:01,669
There is actually
a physical world

298
00:17:01,669 --> 00:17:05,838
out there respecting the laws
of physics and materials and all

299
00:17:05,838 --> 00:17:06,509
that.

300
00:17:06,509 --> 00:17:09,420
So vision has very
different tasks.

301
00:17:09,420 --> 00:17:14,089
So I just want you to appreciate
the difference between language

302
00:17:14,088 --> 00:17:17,450
and vision and actually,
frankly, appreciate nature,

303
00:17:17,450 --> 00:17:19,880
how it solved this problem.

304
00:17:19,880 --> 00:17:21,060
Let's keep going.

305
00:17:21,059 --> 00:17:28,149
1970s, the early pioneers of
computer vision, without data,

306
00:17:28,150 --> 00:17:32,320
without really much
of powerful computers,

307
00:17:32,319 --> 00:17:36,970
without the mathematical
advances we have seen today,

308
00:17:36,970 --> 00:17:40,289
are already beginning to solve
some of the harder problems

309
00:17:40,289 --> 00:17:43,779
of computer vision-- for
example, recognition of objects.

310
00:17:43,779 --> 00:17:48,119
Here in Stanford, one
of the pioneering work

311
00:17:48,119 --> 00:17:52,139
is called generalized cylinders
by Rodney Brooks and Tom

312
00:17:52,140 --> 00:17:52,900
Binford.

313
00:17:52,900 --> 00:17:58,650
And ironically, Rodney Brooks
today is on campus, actually,

314
00:17:58,650 --> 00:18:03,519
over there giving a talk
at the robotics conference.

315
00:18:03,519 --> 00:18:05,759
And he went on to become
one of the greatest

316
00:18:05,759 --> 00:18:10,079
roboticists of our time
and was founder of Roomba

317
00:18:10,079 --> 00:18:11,769
and many other robots.

318
00:18:11,769 --> 00:18:16,529
And then not very far from us
in another part of Palo Alto,

319
00:18:16,529 --> 00:18:24,759
researchers have worked on
these also compositional models

320
00:18:24,759 --> 00:18:27,859
of human body and objects.

321
00:18:27,859 --> 00:18:34,250
And then in the 1980s, digital
photos start to appear.

322
00:18:34,250 --> 00:18:37,220
At least photos start to appear.

323
00:18:37,220 --> 00:18:39,680
And people can digitize
that a little bit.

324
00:18:39,680 --> 00:18:43,940
And then there are some
great work in edge detection.

325
00:18:43,940 --> 00:18:48,190
You look at all this
and probably feel

326
00:18:48,190 --> 00:18:50,900
a sense of disappointment.

327
00:18:50,900 --> 00:18:55,540
I mean, it's kind of trivial
to get some sketches and edges.

328
00:18:55,539 --> 00:18:58,460
And it's not really
going anywhere.

329
00:18:58,460 --> 00:19:02,059
That's how computer
vision, works at that time.

330
00:19:02,059 --> 00:19:03,980
And in fact, you're
not so wrong.

331
00:19:03,980 --> 00:19:07,660
That was around the
time before many of you

332
00:19:07,660 --> 00:19:10,279
were born that we
entered AI winter.

333
00:19:10,279 --> 00:19:15,250
The field entered AI winter
because the enthusiasm

334
00:19:15,250 --> 00:19:18,529
and, hence, funding for AI
research has really dwindled.

335
00:19:18,529 --> 00:19:20,509
A lot of things didn't deliver.

336
00:19:20,509 --> 00:19:22,269
Computer vision didn't deliver.

337
00:19:22,269 --> 00:19:24,460
Expert systems didn't deliver.

338
00:19:24,460 --> 00:19:26,519
Robotics didn't deliver.

339
00:19:26,519 --> 00:19:32,309
But under the hood of this
winter, a lot of research

340
00:19:32,309 --> 00:19:34,529
start to grow from
different fields,

341
00:19:34,529 --> 00:19:37,509
like computer vision,
NLP, robotics.

342
00:19:37,509 --> 00:19:40,379
So let's also look at
another strand of research

343
00:19:40,380 --> 00:19:43,290
that had a profound
implication in computer vision,

344
00:19:43,289 --> 00:19:45,269
is that cognitive
and neuroscience

345
00:19:45,269 --> 00:19:46,960
continue to blossom.

346
00:19:46,960 --> 00:19:49,319
And what is really
important, especially

347
00:19:49,319 --> 00:19:52,480
for the field of computer
vision, is cognitive

348
00:19:52,480 --> 00:19:55,799
and neuroscience is starting
to point to as the North Star

349
00:19:55,799 --> 00:19:57,490
problems we should work on.

350
00:19:57,490 --> 00:20:00,029
For example,
psychologists have told us

351
00:20:00,029 --> 00:20:02,619
there's something special
about seeing nature,

352
00:20:02,619 --> 00:20:06,359
seeing real world.

353
00:20:06,359 --> 00:20:09,209
This is a study by
Irv Biederman, who

354
00:20:09,210 --> 00:20:13,980
shows that the detection
of bicycles on two images

355
00:20:13,980 --> 00:20:18,819
differ depending on if the
images are scrambled or not.

356
00:20:18,819 --> 00:20:19,569
Think about it.

357
00:20:19,569 --> 00:20:22,089
From a phton point of
view, these two bicycles

358
00:20:22,089 --> 00:20:26,629
land in the same
location on your retina.

359
00:20:26,630 --> 00:20:28,720
But somehow the
rest of the image

360
00:20:28,720 --> 00:20:39,079
impacts the viewer,
seeing the target objects.

361
00:20:39,079 --> 00:20:41,439
So there is something
telling us that seeing

362
00:20:41,440 --> 00:20:44,170
the entire forest
or the entire world

363
00:20:44,170 --> 00:20:46,730
impacts the way we see objects.

364
00:20:46,730 --> 00:20:49,819
It also tells us visual
processing is very fast.

365
00:20:49,819 --> 00:20:55,339
Here's another direct measure
of how fast we detect objects.

366
00:20:55,339 --> 00:21:00,669
This is an early 1970s
experiment showing people

367
00:21:00,670 --> 00:21:03,061
a video.

368
00:21:03,060 --> 00:21:07,629
And the test for the subject
is to detect the human

369
00:21:07,630 --> 00:21:09,170
in one of the frames.

370
00:21:09,170 --> 00:21:11,920
I suppose every one of you
have seen that human in one

371
00:21:11,920 --> 00:21:13,250
of the frames.

372
00:21:13,250 --> 00:21:15,519
But think about how
remarkable your eyes are

373
00:21:15,519 --> 00:21:19,079
or your brain is because
you've never seen this video.

374
00:21:19,079 --> 00:21:22,609
I didn't tell you which frame
that the target object would

375
00:21:22,609 --> 00:21:23,159
appear.

376
00:21:23,160 --> 00:21:24,980
I did not tell you
what the target

377
00:21:24,980 --> 00:21:28,860
object will look like, where it
is, its gestures, and all that.

378
00:21:28,859 --> 00:21:31,689
Yet, you have no problem
detecting the humans.

379
00:21:34,569 --> 00:21:37,669
And on top of that,
these frames are

380
00:21:37,670 --> 00:21:39,860
played at 10 Hertz,
which means you're

381
00:21:39,859 --> 00:21:43,799
seeing every frame for
only 100 milliseconds.

382
00:21:43,799 --> 00:21:47,879
And this is how remarkable
our visual system is.

383
00:21:47,880 --> 00:21:53,700
In fact, Simon Thorpe, another
cognitive neuroscientist,

384
00:21:53,700 --> 00:21:55,410
have measured the speed.

385
00:21:55,410 --> 00:21:58,430
If you hook people
up in EEG caps

386
00:21:58,430 --> 00:22:01,769
and show them complex
natural scenes

387
00:22:01,769 --> 00:22:05,869
and ask human subjects
to categorize things

388
00:22:05,869 --> 00:22:07,969
from animals without--

389
00:22:07,970 --> 00:22:10,259
versus things without animals--

390
00:22:10,259 --> 00:22:11,309
hundreds of them.

391
00:22:11,309 --> 00:22:13,289
And then you measure
the brain wave.

392
00:22:13,289 --> 00:22:18,909
It turned out, after 150
milliseconds of seeing a photo,

393
00:22:18,910 --> 00:22:22,540
your brain already has
a differential signal

394
00:22:22,539 --> 00:22:24,019
that categorizes.

395
00:22:24,019 --> 00:22:25,990
You might not be so impressed.

396
00:22:25,990 --> 00:22:29,870
Because compared to today's
GPUs and modern chips,

397
00:22:29,869 --> 00:22:34,549
150 milliseconds is really
orders of magnitude slower.

398
00:22:34,549 --> 00:22:37,210
But you got to admire.

399
00:22:37,210 --> 00:22:40,779
Our wetware, our
brain, our neurons

400
00:22:40,779 --> 00:22:43,369
don't work as fast
as transistors.

401
00:22:43,369 --> 00:22:46,609
150 milliseconds is
actually really fast.

402
00:22:46,609 --> 00:22:49,309
It's only a few
hops in the brain

403
00:22:49,309 --> 00:22:51,519
in terms of neural processing.

404
00:22:51,519 --> 00:22:53,950
So yet, again,
this is telling us

405
00:22:53,950 --> 00:22:59,990
humans are really good at seeing
objects and categorizing them.

406
00:22:59,990 --> 00:23:02,559
In fact, not only we're
so good at seeing objects

407
00:23:02,559 --> 00:23:05,829
and categorizing them, we
even develop specialized brain

408
00:23:05,829 --> 00:23:10,059
areas that have expert
ability in recognizing

409
00:23:10,059 --> 00:23:13,099
faces or places or body parts.

410
00:23:13,099 --> 00:23:19,039
And these are discoveries by MIT
neurophysiologist in the 1990s

411
00:23:19,039 --> 00:23:21,119
and early 21st century.

412
00:23:21,119 --> 00:23:26,089
So all these studies tell
us, well, we should not just

413
00:23:26,089 --> 00:23:30,019
be studying these kind
of character shapes

414
00:23:30,019 --> 00:23:33,660
or the sketches of images.

415
00:23:33,660 --> 00:23:38,750
We really should go after
important fundamental problems

416
00:23:38,750 --> 00:23:40,769
that drives visual intelligence.

417
00:23:40,769 --> 00:23:43,339
And one of those
problems that everything

418
00:23:43,339 --> 00:23:46,099
has been telling us is
object recognition--

419
00:23:46,099 --> 00:23:49,829
is object recognition
in natural setting.

420
00:23:49,829 --> 00:23:52,949
There is a lot of objects
out there in the world.

421
00:23:52,950 --> 00:23:57,740
And studying this
is going to be part

422
00:23:57,740 --> 00:24:00,299
of the unlocking of
visual intelligence.

423
00:24:00,299 --> 00:24:01,549
And that's what we did.

424
00:24:01,549 --> 00:24:04,669
As a field, we
started by looking

425
00:24:04,670 --> 00:24:08,210
at how we can separate
foreground objects

426
00:24:08,210 --> 00:24:09,960
from background objects.

427
00:24:09,960 --> 00:24:14,569
This is called recognition
by grouping in the 1990s.

428
00:24:14,569 --> 00:24:16,849
Keep in mind, we're
still in AI winter.

429
00:24:16,849 --> 00:24:20,089
But research is actually
happening and progressing.

430
00:24:20,089 --> 00:24:24,559
And then there is
studies of features.

431
00:24:24,559 --> 00:24:27,549
And some of you
might still remember

432
00:24:27,549 --> 00:24:29,779
sift features and matching.

433
00:24:29,779 --> 00:24:33,609
And when I enter grad school,
the most exciting thing

434
00:24:33,609 --> 00:24:34,789
was face detection.

435
00:24:34,789 --> 00:24:37,279
I remembered that first
year in my grad school,

436
00:24:37,279 --> 00:24:39,379
this paper was published.

437
00:24:39,380 --> 00:24:42,550
And five years later,
the first digital camera

438
00:24:42,549 --> 00:24:49,029
used this paper's algorithm and
delivered automatic face focus

439
00:24:49,029 --> 00:24:51,259
because of face detection.

440
00:24:51,259 --> 00:24:56,559
So things started to work
and taken into industry.

441
00:24:56,559 --> 00:25:01,190
And then around the
early 21st century,

442
00:25:01,190 --> 00:25:04,809
a very important thing
started to happen,

443
00:25:04,809 --> 00:25:06,819
is internet started to happen.

444
00:25:06,819 --> 00:25:12,599
When internet started to happen,
data started to proliferate.

445
00:25:12,599 --> 00:25:16,969
And the combination of
digital cameras and internet

446
00:25:16,970 --> 00:25:19,850
started to give the
field of computer vision

447
00:25:19,849 --> 00:25:22,049
some data to work with.

448
00:25:22,049 --> 00:25:26,419
So in that early days, we're
working with thousands of images

449
00:25:26,420 --> 00:25:30,470
or tens of thousands of images
to study the visual recognition

450
00:25:30,470 --> 00:25:32,880
problem or the object
recognition problem.

451
00:25:32,880 --> 00:25:36,350
So you've got data sets
like Pascal Visual Object

452
00:25:36,349 --> 00:25:40,759
Challenge or Caltech 101.

453
00:25:40,759 --> 00:25:43,609
I'm going to pause here.

454
00:25:43,609 --> 00:25:50,059
And this is where the first
thread of computer vision

455
00:25:50,059 --> 00:25:51,059
start to progress.

456
00:25:51,059 --> 00:25:54,419
And you might be wondering,
why is she pausing?

457
00:25:54,420 --> 00:25:57,300
Because I'm going to come back
and talk about deep learning.

458
00:25:57,299 --> 00:26:03,169
So while this field of
vision was progressing

459
00:26:03,170 --> 00:26:06,980
through neurophysiology
to computer vision,

460
00:26:06,980 --> 00:26:11,490
to cognitive neuroscience,
to computer vision again,

461
00:26:11,490 --> 00:26:14,980
a separate effort is
going on in parallel.

462
00:26:14,980 --> 00:26:17,380
And that eventually
became deep learning.

463
00:26:17,380 --> 00:26:22,870
It started from these early
studies of neural network,

464
00:26:22,869 --> 00:26:24,269
things like perceptron.

465
00:26:24,269 --> 00:26:29,799
And people like Rumelhart
started to work.

466
00:26:29,799 --> 00:26:32,139
And of course, Jeff
Hinton in his early days,

467
00:26:32,140 --> 00:26:35,400
started to work with a small
number of artificial neurons

468
00:26:35,400 --> 00:26:41,009
and look at how that can
process information and learn.

469
00:26:41,009 --> 00:26:48,269
And you've heard people like the
great minds like Marvin Minsky

470
00:26:48,269 --> 00:26:52,619
and his colleagues working
on different aspects

471
00:26:52,619 --> 00:26:54,549
of this perception.

472
00:26:54,549 --> 00:27:02,849
But Marvin Minsky did say that
perceptrons cannot learn these

473
00:27:02,849 --> 00:27:05,219
XOR logic functions.

474
00:27:05,220 --> 00:27:10,130
And that caused a little bit
of a setback in neural network.

475
00:27:10,130 --> 00:27:14,670
Well, things continued to
progress despite the setback.

476
00:27:14,670 --> 00:27:21,529
And one of the most important
work before the first inflection

477
00:27:21,529 --> 00:27:25,889
point is this neocognitron
work by Fukushima in Japan.

478
00:27:25,890 --> 00:27:31,980
Fukushima hand-designed a neural
network that looks like this.

479
00:27:31,980 --> 00:27:35,700
So it has about
five or six layers.

480
00:27:35,700 --> 00:27:41,779
And then he kind of designed
the different functions

481
00:27:41,779 --> 00:27:43,700
across the layers,
which you will

482
00:27:43,700 --> 00:27:46,910
learn more, that
more or less was

483
00:27:46,910 --> 00:27:50,850
inspired by the visual
pathway that I was describing.

484
00:27:50,849 --> 00:27:54,559
Remember the cat experiment
from simple receptive field

485
00:27:54,559 --> 00:27:56,789
to more complicated
receptive field.

486
00:27:56,789 --> 00:27:59,039
And he was doing that here.

487
00:27:59,039 --> 00:28:01,829
The early layers have
simple functions.

488
00:28:01,829 --> 00:28:03,269
And then the later
lighter layers

489
00:28:03,269 --> 00:28:05,490
have more complex functions.

490
00:28:05,490 --> 00:28:08,680
And the simple ones can
call it convolution.

491
00:28:08,680 --> 00:28:10,710
Or he uses the
convolution function.

492
00:28:10,710 --> 00:28:13,620
And the more complex one, he
was pulling the information

493
00:28:13,619 --> 00:28:15,219
from the convolution layers.

494
00:28:15,220 --> 00:28:19,799
So neocognitron was
really an engineering feat

495
00:28:19,799 --> 00:28:24,794
because every parameter
was hand-designed.

496
00:28:24,795 --> 00:28:26,170
There are hundreds
of parameters.

497
00:28:26,170 --> 00:28:29,430
He has to just meticulously
put them together

498
00:28:29,430 --> 00:28:32,610
so that this small
neural network can

499
00:28:32,609 --> 00:28:35,909
recognize digits or letters.

500
00:28:35,910 --> 00:28:41,130
So the real breakthrough
came around that time in 1986

501
00:28:41,130 --> 00:28:43,180
is a learning rule.

502
00:28:43,180 --> 00:28:45,580
That learning rule is
called backpropagation.

503
00:28:45,579 --> 00:28:47,579
It's going to be one
of our first classes

504
00:28:47,579 --> 00:28:52,454
to show you that
Rumelhart, Jeff Hinton--

505
00:28:52,454 --> 00:28:58,019
they took neural
network architecture

506
00:28:58,019 --> 00:29:04,259
and introduced an error
correcting objective function

507
00:29:04,259 --> 00:29:07,400
so that if you put in
some input and know

508
00:29:07,400 --> 00:29:10,280
what the correct
output is, how do you

509
00:29:10,279 --> 00:29:14,779
take the difference between
what the neural network outputs

510
00:29:14,779 --> 00:29:17,899
versus the actual
correct answer and then

511
00:29:17,900 --> 00:29:22,640
propagate the information
back so that you

512
00:29:22,640 --> 00:29:28,590
can improve the parameters
along the neural network?

513
00:29:28,589 --> 00:29:31,250
And that propagation
from the output

514
00:29:31,250 --> 00:29:33,799
back to the entire
neural network

515
00:29:33,799 --> 00:29:35,849
is called backpropagation.

516
00:29:35,849 --> 00:29:39,179
It follows some of these
basic calculus chain rules.

517
00:29:39,180 --> 00:29:47,420
And that was a watershed moment
for neural network algorithm.

518
00:29:47,420 --> 00:29:50,970
And of course, we're still smack
in the middle of AI winter.

519
00:29:50,970 --> 00:29:54,809
All these work was happening
without public fanfare.

520
00:29:54,809 --> 00:29:57,929
But of course, in the
world of research,

521
00:29:57,930 --> 00:29:59,650
these are very
important milestones.

522
00:29:59,650 --> 00:30:03,720
One of the most earliest
applications of this neural

523
00:30:03,720 --> 00:30:07,019
network with backpropagation
is Yann LeCun's convolutional

524
00:30:07,019 --> 00:30:10,410
neural network, made in the
1990s when he was working

525
00:30:10,410 --> 00:30:11,500
in the Bell Labs.

526
00:30:11,500 --> 00:30:15,970
And what he did is just created
a slightly bigger network,

527
00:30:15,970 --> 00:30:20,610
about seven layers-ish,
and made it good enough

528
00:30:20,609 --> 00:30:25,119
with great engineering
capability to recognize letters.

529
00:30:25,119 --> 00:30:28,709
And it was actually shipped
to some part of the US Postal

530
00:30:28,710 --> 00:30:33,579
Offices and banks to
read digits and letters.

531
00:30:33,579 --> 00:30:37,599
So that was an application
of early neural network.

532
00:30:37,599 --> 00:30:41,250
And then Jeff Hinton
and Yann LeCun

533
00:30:41,250 --> 00:30:43,390
continued to work
on neural network.

534
00:30:43,390 --> 00:30:45,720
It didn't go very far.

535
00:30:45,720 --> 00:30:52,049
Because despite these
improvements and tweaks

536
00:30:52,049 --> 00:30:57,289
of these neural network, things
more or less just stalled.

537
00:30:57,289 --> 00:31:00,279
They collected a big data
set of digits and letters.

538
00:31:00,279 --> 00:31:03,730
And digits and letters
kind of was quasi soft

539
00:31:03,730 --> 00:31:05,089
in terms of recognition.

540
00:31:05,089 --> 00:31:08,019
But if you put the
system through the kind

541
00:31:08,019 --> 00:31:11,500
of digital photos that the
neuroscientists were using

542
00:31:11,500 --> 00:31:14,470
to recognize cats and dogs
and microwaves and chairs

543
00:31:14,470 --> 00:31:17,180
and flowers, it
just didn't work.

544
00:31:17,180 --> 00:31:22,549
And a huge part of this
problem is the lack of data.

545
00:31:22,549 --> 00:31:27,500
And lack of data is not
just an inconvenience.

546
00:31:27,500 --> 00:31:29,990
It's actually a
mathematical problem

547
00:31:29,990 --> 00:31:36,430
because these algorithms are
high capacity algorithms that

548
00:31:36,430 --> 00:31:39,850
actually needs to be
driven by lots of data

549
00:31:39,849 --> 00:31:42,349
in order to learn to generalize.

550
00:31:42,349 --> 00:31:45,009
And there is some deep
mathematical principles

551
00:31:45,009 --> 00:31:48,379
behind these rules of
generalization and model

552
00:31:48,380 --> 00:31:49,210
overfitting.

553
00:31:49,210 --> 00:31:52,660
And data was
underappreciated, was

554
00:31:52,660 --> 00:31:54,840
underlooked because
most people are just

555
00:31:54,839 --> 00:31:56,559
looking at these architectures.

556
00:31:56,559 --> 00:31:59,190
They did not
realize that data is

557
00:31:59,190 --> 00:32:02,070
part of the first class
citizen for machine

558
00:32:02,069 --> 00:32:03,490
learning and deep learning.

559
00:32:03,490 --> 00:32:08,339
So this is part of the work
that my students and I did

560
00:32:08,339 --> 00:32:14,759
in the early 2000s, that we
recognize this importance

561
00:32:14,759 --> 00:32:15,640
of data.

562
00:32:15,640 --> 00:32:21,240
We hypothesized that the
whole field was actually

563
00:32:21,240 --> 00:32:24,519
missing this-- underappreciating
the importance of data.

564
00:32:24,519 --> 00:32:27,089
So we went about and
collected a huge data

565
00:32:27,089 --> 00:32:30,119
set called ImageNet that
has 50 million images

566
00:32:30,119 --> 00:32:32,259
after cleaning a billion images.

567
00:32:32,259 --> 00:32:38,309
And these 15 million images were
sorted across 22,000 categories

568
00:32:38,309 --> 00:32:39,309
of objects.

569
00:32:39,309 --> 00:32:43,109
We actually studied a lot of
the cognitive and psychology

570
00:32:43,109 --> 00:32:51,479
literature to appreciate
that 22,000 images were--

571
00:32:51,480 --> 00:32:54,880
sorry, 22,000 categories
were roughly in the order

572
00:32:54,880 --> 00:32:58,510
of the number of categories
that humans learned to recognize

573
00:32:58,509 --> 00:33:00,470
in the early years
of their life.

574
00:33:00,470 --> 00:33:02,180
And then we open
sourced this data

575
00:33:02,180 --> 00:33:05,860
set and created an ImageNet
challenge called the Large Scale

576
00:33:05,859 --> 00:33:07,579
Visual Recognition Challenge.

577
00:33:07,579 --> 00:33:12,699
We curated a subset of ImageNet
of a million images or a million

578
00:33:12,700 --> 00:33:16,870
plus images and 1,000
object classes and then ran

579
00:33:16,869 --> 00:33:21,429
an international object
recognition challenge for many

580
00:33:21,430 --> 00:33:22,039
years.

581
00:33:22,039 --> 00:33:26,899
And the goal is that we ask
researchers to participate.

582
00:33:26,900 --> 00:33:29,420
And their goal is to
create algorithms.

583
00:33:29,420 --> 00:33:31,430
It doesn't matter which
kind of algorithms.

584
00:33:31,430 --> 00:33:35,650
And they will test you on your
algorithm's ability to recognize

585
00:33:35,650 --> 00:33:40,900
photos and see if you can call
out these 1,000 object classes

586
00:33:40,900 --> 00:33:42,800
as correctly as possible.

587
00:33:42,799 --> 00:33:45,039
And here are the errors.

588
00:33:45,039 --> 00:33:53,069
First year we run
this competition,

589
00:33:53,069 --> 00:33:57,000
the best performing algorithms
error was nearly 30%.

590
00:33:57,000 --> 00:34:00,859
And it's really pretty abysmal
because humans can perform

591
00:34:00,859 --> 00:34:03,509
under like, say, 3% error.

592
00:34:03,509 --> 00:34:07,259
And then 2011, it
wasn't that exciting.

593
00:34:07,259 --> 00:34:09,559
But something happened in 2012.

594
00:34:09,559 --> 00:34:12,389
That was the most exciting year.

595
00:34:12,389 --> 00:34:16,190
That year, Jeff Hinton
and his students

596
00:34:16,190 --> 00:34:18,650
participated in
this challenge using

597
00:34:18,650 --> 00:34:20,340
convolutional neural network.

598
00:34:20,340 --> 00:34:23,100
And they reduced the
error almost by half.

599
00:34:23,099 --> 00:34:29,519
And it truly showed the power
of deep learning algorithms.

600
00:34:29,519 --> 00:34:34,759
And so the participating
algorithm in 2012 ImageNet

601
00:34:34,760 --> 00:34:36,960
challenge was called AlexNet.

602
00:34:36,960 --> 00:34:42,559
And the funny thing is,
if you look at AlexNet,

603
00:34:42,559 --> 00:34:47,449
it's not that different from
Fukushima's neocognitron

604
00:34:47,449 --> 00:34:49,579
32 years ago.

605
00:34:49,579 --> 00:34:54,829
But two major things
happened between these two.

606
00:34:54,829 --> 00:34:57,529
One is that
backpropagation happened.

607
00:34:57,530 --> 00:35:01,269
It's a principled,
mathematically rigorous learning

608
00:35:01,269 --> 00:35:04,300
rule so that you don't
have to ever use hand

609
00:35:04,300 --> 00:35:06,140
to tune parameters.

610
00:35:06,139 --> 00:35:09,409
And that was a major
breakthrough theoretically.

611
00:35:09,409 --> 00:35:14,179
Another breakthrough was data.

612
00:35:14,179 --> 00:35:19,629
The recognition of data and the
understanding of data driving

613
00:35:19,630 --> 00:35:23,200
these high capacity models,
which eventually will have

614
00:35:23,199 --> 00:35:26,109
trillion parameters-- but
at that time was millions

615
00:35:26,110 --> 00:35:34,831
of parameters-- was critical for
setting off the deep learning

616
00:35:34,831 --> 00:35:36,410
for this to work.

617
00:35:36,409 --> 00:35:42,405
And really, many people
consider the year of 2012

618
00:35:42,405 --> 00:35:46,869
and the AlexNet algorithm
that won the ImageNet

619
00:35:46,869 --> 00:35:51,019
the challenge the historical
moment of the birth

620
00:35:51,019 --> 00:35:54,409
or rebirth of modern AI or
the birth of deep learning

621
00:35:54,409 --> 00:35:55,759
revolution.

622
00:35:55,760 --> 00:35:59,540
And of course, the reason
many of you are here

623
00:35:59,539 --> 00:36:04,320
is since then, we are in the
era of deep learning explosion.

624
00:36:04,320 --> 00:36:10,910
If you look at computer vision,
some main annual research

625
00:36:10,909 --> 00:36:13,190
conference, called CVPR--

626
00:36:13,190 --> 00:36:15,619
the number of papers
have exploded.

627
00:36:15,619 --> 00:36:18,869
And our arXiv
paper has exploded.

628
00:36:18,869 --> 00:36:22,730
And many new
algorithms since then

629
00:36:22,730 --> 00:36:27,349
have been invented to
participate in the ImageNet

630
00:36:27,349 --> 00:36:28,049
challenge.

631
00:36:28,050 --> 00:36:29,870
In the following
years, we're going

632
00:36:29,869 --> 00:36:31,739
to study some of
these algorithms.

633
00:36:31,739 --> 00:36:34,639
But the point is
that some of these

634
00:36:34,639 --> 00:36:39,379
algorithms beyond Alex that
have had a profound impact

635
00:36:39,380 --> 00:36:43,610
in the progress of the
field of computer vision

636
00:36:43,610 --> 00:36:49,090
and into the applications
of computer vision.

637
00:36:49,090 --> 00:36:52,720
So a lot of things
have happened.

638
00:36:52,719 --> 00:36:54,529
We're going to
cover some of these.

639
00:36:54,530 --> 00:36:57,340
Not only the field
of computer vision

640
00:36:57,340 --> 00:37:01,510
made a major progress
in creating algorithms

641
00:37:01,510 --> 00:37:06,260
to recognize everyday m like
cats and dogs and chairs--

642
00:37:06,260 --> 00:37:10,400
we also quickly, right
after ImageNet challenge,

643
00:37:10,400 --> 00:37:14,139
the 2012 moment,
we've got algorithms

644
00:37:14,139 --> 00:37:22,549
that can recognize much
more complicated images,

645
00:37:22,550 --> 00:37:27,470
can retrieve images, or can
do multiple object detections,

646
00:37:27,469 --> 00:37:30,559
can do image segmentation.

647
00:37:30,559 --> 00:37:34,360
These are all different
tasks in visual recognition

648
00:37:34,360 --> 00:37:36,220
that you'll find
yourself getting

649
00:37:36,219 --> 00:37:38,689
familiar with
throughout this course

650
00:37:38,690 --> 00:37:42,139
because vision is not just
calling out cats and dogs.

651
00:37:42,139 --> 00:37:48,859
There is so much in the nuanced
ability of visual recognition.

652
00:37:48,860 --> 00:37:52,829
And of course, vision is
not just static images.

653
00:37:52,829 --> 00:37:57,500
So there are work in video
classification, human activity

654
00:37:57,500 --> 00:37:58,710
recognition.

655
00:37:58,710 --> 00:38:00,929
I'm showing you this overview.

656
00:38:00,929 --> 00:38:04,774
You will learn some of these.

657
00:38:04,775 --> 00:38:08,460
You don't have to understand
exactly what's going on here.

658
00:38:08,460 --> 00:38:14,940
But I want you to appreciate
the variety of vision tasks.

659
00:38:14,940 --> 00:38:20,869
Medical imaging, those of you
who come from a medical field,

660
00:38:20,869 --> 00:38:24,650
whether it's radiology
or pathology or even

661
00:38:24,650 --> 00:38:28,260
other aspects of medicine,
is deeply visual.

662
00:38:28,260 --> 00:38:31,550
And this has a profound impact.

663
00:38:31,550 --> 00:38:37,550
Scientific discovery--
even the seminal picture

664
00:38:37,550 --> 00:38:41,700
you probably remember of the
first photography of black hole

665
00:38:41,699 --> 00:38:46,829
uses a lot of computer vision
and computational photography

666
00:38:46,829 --> 00:38:47,980
techniques.

667
00:38:47,980 --> 00:38:52,980
Of course, applications in
sustainability and environment

668
00:38:52,980 --> 00:38:58,889
is $also computer vision
contributed a lot of that.

669
00:38:58,889 --> 00:39:02,309
And we also have made
a lot of progress

670
00:39:02,309 --> 00:39:07,449
in image captioning right after
the image-- that 2012 moment.

671
00:39:07,449 --> 00:39:09,989
This is actually work by
Andrej Karpathy, where he was

672
00:39:09,989 --> 00:39:13,799
my student, his thesis work.

673
00:39:13,800 --> 00:39:19,030
Then we also worked on
relationship understanding.

674
00:39:19,030 --> 00:39:22,710
So not only visual
intelligence is

675
00:39:22,710 --> 00:39:24,639
about seeing what's
on the pixel,

676
00:39:24,639 --> 00:39:26,859
you can also see
what's beyond pixels,

677
00:39:26,860 --> 00:39:33,360
including relationships of
objects and also style transfer.

678
00:39:33,360 --> 00:39:35,880
A Lot of this work,
you will-- actually,

679
00:39:35,880 --> 00:39:39,000
Justin Johnson, who will come
to guest lecture this course,

680
00:39:39,000 --> 00:39:45,320
will tell you all about his
seminal work in style transfer.

681
00:39:45,320 --> 00:39:48,510
And of course, in
generative AI eras,

682
00:39:48,510 --> 00:39:53,430
we get these really incredible
results like face generation.

683
00:39:53,429 --> 00:39:59,239
And this is the very early days
of image generation of Dall-E. I

684
00:39:59,239 --> 00:40:03,379
think this is the early Dall-E.
Of course, now, Midjourney

685
00:40:03,380 --> 00:40:08,690
and everything has gone beyond
these avocado and peach chairs.

686
00:40:08,690 --> 00:40:14,780
But really, we are squarely in
the most exciting modern era

687
00:40:14,780 --> 00:40:16,246
of AI explosion.

688
00:40:20,070 --> 00:40:25,370
The three converging forces
of computation, algorithms,

689
00:40:25,369 --> 00:40:29,719
and data have taken
this field just

690
00:40:29,719 --> 00:40:32,929
to a whole different
level, where we're now

691
00:40:32,929 --> 00:40:36,119
totally out of AI winter.

692
00:40:36,119 --> 00:40:40,259
I would say we're in an
AI global warming period.

693
00:40:40,260 --> 00:40:46,050
And I don't see any
of this slowing down

694
00:40:46,050 --> 00:40:48,820
for both good and bad reasons.

695
00:40:48,820 --> 00:40:53,170
And also, just a word, because
we are in the Silicon Valley,

696
00:40:53,170 --> 00:40:58,050
we're in the very building
of Huang building and NVIDIA

697
00:40:58,050 --> 00:41:02,039
lecture hall-- so we cannot
ignore also the progress

698
00:41:02,039 --> 00:41:05,050
of hardware and
what that played.

699
00:41:05,050 --> 00:41:14,080
So here is just the FLOP per
dollar graph for NVIDIA's GPUs.

700
00:41:14,079 --> 00:41:19,210
And before 2020, the
progress was steady.

701
00:41:19,210 --> 00:41:22,800
But as soon as deep
learning started

702
00:41:22,800 --> 00:41:27,420
to drive these
GPUs and chips, you

703
00:41:27,420 --> 00:41:33,519
can just see the GFLOPS have
just completely taken off.

704
00:41:33,519 --> 00:41:40,610
And by any measure, we are
in this accelerated curve

705
00:41:40,610 --> 00:41:45,360
of lots of compute as
well as lots of AI.

706
00:41:45,360 --> 00:41:47,360
And these are just
different graphs

707
00:41:47,360 --> 00:41:50,539
showing you conference
attendees, startups,

708
00:41:50,539 --> 00:41:54,500
and enterprise applications
in AI all across

709
00:41:54,500 --> 00:41:55,710
not just computer vision.

710
00:41:55,710 --> 00:42:02,099
But also, NLP and others
have just exploded.

711
00:42:02,099 --> 00:42:06,299
So quickly, last but not the
least, it's been exciting.

712
00:42:06,300 --> 00:42:08,070
There has been a
lot of successes.

713
00:42:08,070 --> 00:42:11,309
But there is still a lot to
be done in computer vision.

714
00:42:11,309 --> 00:42:14,329
So this problem is still
not totally solved.

715
00:42:14,329 --> 00:42:19,969
And with great tools comes with
great consequences as well.

716
00:42:19,969 --> 00:42:24,449
So computer vision
can do a lot of good.

717
00:42:24,449 --> 00:42:26,039
But it also can do harm.

718
00:42:26,039 --> 00:42:28,730
For example, human bias--

719
00:42:28,730 --> 00:42:32,360
every single AI algorithm
today, the large ones,

720
00:42:32,360 --> 00:42:33,880
are driven by data.

721
00:42:33,880 --> 00:42:38,550
And data is an artifact
of human activities

722
00:42:38,550 --> 00:42:40,360
on Earth and in history.

723
00:42:40,360 --> 00:42:43,900
And a lot of the
data carry our bias.

724
00:42:43,900 --> 00:42:47,200
And this gets carried
in AI systems.

725
00:42:47,199 --> 00:42:50,609
We have seen a lot of face
recognition algorithms having

726
00:42:50,610 --> 00:42:52,990
the same kind of bias
that humans have.

727
00:42:52,989 --> 00:42:55,919
And we do have to
really recognize that.

728
00:42:55,920 --> 00:43:01,450
We can also use AI to impact
human lives, some for the good.

729
00:43:01,449 --> 00:43:02,889
Think about medical imaging.

730
00:43:02,889 --> 00:43:05,199
But some are questionable.

731
00:43:05,199 --> 00:43:09,299
What if AI is solely
behind deciding your job

732
00:43:09,300 --> 00:43:11,620
or deciding your
financial loans?

733
00:43:11,619 --> 00:43:15,789
So again, is it totally bad?

734
00:43:15,789 --> 00:43:17,050
Is it totally good?

735
00:43:17,050 --> 00:43:19,150
These are very
complicated issues.

736
00:43:19,150 --> 00:43:23,490
This is also why I always get so
excited when students from HMS

737
00:43:23,489 --> 00:43:26,549
or law school or education
school or business school

738
00:43:26,550 --> 00:43:29,670
attend my class
because not all AI

739
00:43:29,670 --> 00:43:31,789
issues are engineering issues.

740
00:43:31,789 --> 00:43:36,559
We have a lot of human factors
and societal issues to solve.

741
00:43:36,559 --> 00:43:40,599
I'm also particularly excited
by AI's medicine and health care

742
00:43:40,599 --> 00:43:41,139
use.

743
00:43:41,139 --> 00:43:43,960
This is something
really dear to my heart.

744
00:43:43,960 --> 00:43:46,119
Professor Adeli
and Zane, who are

745
00:43:46,119 --> 00:43:49,630
also co-instructors of
this course, we three of us

746
00:43:49,630 --> 00:43:53,500
work on AI for aging
population as well as

747
00:43:53,500 --> 00:43:59,050
patients and to try to use
computer vision to deliver care

748
00:43:59,050 --> 00:44:00,170
to people.

749
00:44:00,170 --> 00:44:01,820
So this is a good use.

750
00:44:01,820 --> 00:44:04,820
And also, even in
terms of technology,

751
00:44:04,820 --> 00:44:07,190
human vision is remarkable.

752
00:44:07,190 --> 00:44:10,670
I want you to come out
of not only today's class

753
00:44:10,670 --> 00:44:14,240
but also this entire
course to appreciate,

754
00:44:14,239 --> 00:44:16,969
despite how much
computer vision can do,

755
00:44:16,969 --> 00:44:22,250
there's just so much more
nuance, subtlety, richness,

756
00:44:22,250 --> 00:44:26,389
complexity, and also
emotion in human vision.

757
00:44:26,389 --> 00:44:29,369
Look at these kids
studying whatever

758
00:44:29,369 --> 00:44:33,159
that their curiosity lead them
or the humor in this image.

759
00:44:33,159 --> 00:44:36,129
There's still a lot more that
computer vision cannot do.

760
00:44:36,130 --> 00:44:38,430
So I hope that
continue to entice

761
00:44:38,429 --> 00:44:40,869
you to study computer vision.

762
00:44:40,869 --> 00:44:45,690
At this point, I'm going to give
the podium to Professor Adeli

763
00:44:45,690 --> 00:44:48,369
to go over the
rest of the class.

764
00:44:48,369 --> 00:44:49,039
Thank you.

765
00:44:49,039 --> 00:44:50,759
[APPLAUSE]

766
00:44:50,760 --> 00:44:51,990
Awesome.

767
00:44:51,989 --> 00:44:55,139
Thank you, Fei-Fei.

768
00:44:55,139 --> 00:44:57,089
Great to start of the quarter.

769
00:44:57,090 --> 00:45:00,640
And I hope my microphone
is working right now.

770
00:45:00,639 --> 00:45:01,389
OK, good.

771
00:45:01,389 --> 00:45:05,730
I'm seeing some
nodding of heads.

772
00:45:05,730 --> 00:45:13,079
So very excited to
be here with you all.

773
00:45:13,079 --> 00:45:18,630
And I'm hoping that
you will have a fun

774
00:45:18,630 --> 00:45:23,160
and challenging course with an
amazing list of core instructors

775
00:45:23,159 --> 00:45:26,379
that we have and great TAs.

776
00:45:26,380 --> 00:45:31,000
So in this class, we
are going to cover

777
00:45:31,000 --> 00:45:34,690
a wide variety of topics
around computer vision and use

778
00:45:34,690 --> 00:45:37,659
of deep learning in
this space, categorized

779
00:45:37,659 --> 00:45:41,569
into four different topics.

780
00:45:41,570 --> 00:45:45,230
We will start with
deep learning basics.

781
00:45:45,230 --> 00:45:48,429
And let's start actually
with a simple question of,

782
00:45:48,429 --> 00:45:52,009
what is computer vision really?

783
00:45:52,010 --> 00:45:57,610
So at its core, it's
about enabling machines

784
00:45:57,610 --> 00:46:00,620
to see and understand images.

785
00:46:00,619 --> 00:46:09,339
And basically, this is the most
fundamental task in this space--

786
00:46:09,340 --> 00:46:13,390
in this space is
image classification.

787
00:46:13,389 --> 00:46:17,059
You give the model an
image, say, of a cat.

788
00:46:17,059 --> 00:46:21,549
And the model should
output a label cat.

789
00:46:21,550 --> 00:46:23,740
And that's it.

790
00:46:23,739 --> 00:46:29,479
But this deceptively simple
task is the foundation

791
00:46:29,480 --> 00:46:32,039
for much of more
complex applications,

792
00:46:32,039 --> 00:46:36,409
from self-driving to
medical diagnosis and so on.

793
00:46:36,409 --> 00:46:40,429
So how do we teach a
machine to do this?

794
00:46:40,429 --> 00:46:44,639
One of the simplest approaches
is to use linear classification,

795
00:46:44,639 --> 00:46:48,089
as you can see in this slide.

796
00:46:48,090 --> 00:46:53,809
So imagine each of the
images in our data set

797
00:46:53,809 --> 00:46:57,119
is shown with a
dot in that space.

798
00:46:57,119 --> 00:47:02,779
And each axis shows
some sort of feature

799
00:47:02,780 --> 00:47:05,280
which was driven from
the image itself.

800
00:47:05,280 --> 00:47:09,420
Here, we are showing a
2D space for simplicity.

801
00:47:09,420 --> 00:47:12,470
But the task of a
linear classifier

802
00:47:12,469 --> 00:47:17,149
is to find the hyperplane
or the linear function

803
00:47:17,150 --> 00:47:23,470
that separates these
two, say, cats from dogs.

804
00:47:23,469 --> 00:47:26,259
But we all know that
these linear models often

805
00:47:26,260 --> 00:47:29,110
go just only so far.

806
00:47:29,110 --> 00:47:32,349
They struggle when the data
isn't cleanly separable

807
00:47:32,349 --> 00:47:33,799
with a straight line.

808
00:47:33,800 --> 00:47:36,320
So the question is, what's next?

809
00:47:36,320 --> 00:47:44,090
We'll get into the topics of how
to model more complex patterns.

810
00:47:44,090 --> 00:47:49,900
And if we do so, we
often face challenges

811
00:47:49,900 --> 00:47:54,220
of overfitting and
underfitting, which

812
00:47:54,219 --> 00:47:59,439
are the topics we will cover in
the early lectures of the class.

813
00:47:59,440 --> 00:48:05,110
And to strike the
right balance, we

814
00:48:05,110 --> 00:48:08,320
use techniques
like regularization

815
00:48:08,320 --> 00:48:14,110
to control model complexity and
optimization to find the best

816
00:48:14,110 --> 00:48:16,059
fit parameters.

817
00:48:16,059 --> 00:48:21,079
So these are the nuts and bolts
of deep learning and creating

818
00:48:21,079 --> 00:48:26,659
these models, training models,
that not only fits the data

819
00:48:26,659 --> 00:48:31,319
but also generalizes to
unseen and new data as well.

820
00:48:31,320 --> 00:48:33,539
And now comes the fun part--

821
00:48:33,539 --> 00:48:34,380
neural networks.

822
00:48:34,380 --> 00:48:38,059
We've been talking
about them quite a lot.

823
00:48:38,059 --> 00:48:43,549
And what neural networks do,
unlike the linear classifiers,

824
00:48:43,550 --> 00:48:47,780
they stack multiple
layers of operations

825
00:48:47,780 --> 00:48:54,769
to model non-linear
functions to be

826
00:48:54,769 --> 00:48:59,389
able to either classify, to
solve the same problem of image

827
00:48:59,389 --> 00:49:04,489
classification, and so on.

828
00:49:04,489 --> 00:49:09,869
These are the models powering
everything from Google Photos.

829
00:49:09,869 --> 00:49:13,429
And now, everybody's familiar
with ChatGPT, ChatGPT's vision

830
00:49:13,429 --> 00:49:15,440
models, and so on.

831
00:49:15,440 --> 00:49:24,099
In this course, we will go deep
into the details of how they

832
00:49:24,099 --> 00:49:26,299
work, how they are trained.

833
00:49:26,300 --> 00:49:31,090
And we will be looking into
debugging and improving them.

834
00:49:31,090 --> 00:49:35,030
After looking at the
deep learning basics,

835
00:49:35,030 --> 00:49:39,280
we will cover the topics of
perceiving and understanding

836
00:49:39,280 --> 00:49:44,620
the visual world, which
is a complex process that

837
00:49:44,619 --> 00:49:49,880
involves interpreting a vast
array of visual information.

838
00:49:49,880 --> 00:49:52,329
And to do so, we
often first define

839
00:49:52,329 --> 00:49:56,739
tasks that refer to specific
challenges or problems.

840
00:49:56,739 --> 00:49:59,149
We aim to solve--

841
00:49:59,150 --> 00:50:02,180
some of the examples are object
detection, scene understanding,

842
00:50:02,179 --> 00:50:03,619
motion detection, and so on.

843
00:50:03,619 --> 00:50:10,539
And to solve these tasks, we
use different models, which

844
00:50:10,539 --> 00:50:13,929
are computational
and theoretical

845
00:50:13,929 --> 00:50:17,779
frameworks we develop
to mimic or explain

846
00:50:17,780 --> 00:50:22,350
how our visual system
accomplishes these tasks.

847
00:50:22,349 --> 00:50:25,610
One of the examples of
these types of models

848
00:50:25,610 --> 00:50:27,730
is neural networks.

849
00:50:30,260 --> 00:50:36,150
So by aligning
models with tasks,

850
00:50:36,150 --> 00:50:41,030
we can create systems
that can see and interpret

851
00:50:41,030 --> 00:50:43,730
the world around us.

852
00:50:43,730 --> 00:50:48,740
Speaking of tasks, let's
go back to the topic

853
00:50:48,739 --> 00:50:53,239
of image classification,
predicting a single label

854
00:50:53,239 --> 00:50:56,989
for an entire image.

855
00:50:56,989 --> 00:50:59,359
But we know that real
world computer vision

856
00:50:59,360 --> 00:51:02,340
is much richer than this.

857
00:51:02,340 --> 00:51:05,240
And let's walk through
some of the tasks that

858
00:51:05,239 --> 00:51:06,869
go beyond classification.

859
00:51:06,869 --> 00:51:13,339
First, semantic segmentation,
where we are not just

860
00:51:13,340 --> 00:51:17,519
labeling the object
or the entire image

861
00:51:17,519 --> 00:51:19,739
as cat or tree or whatever.

862
00:51:19,739 --> 00:51:25,019
Here, we are looking for
labels for every single pixel

863
00:51:25,019 --> 00:51:25,809
in the image.

864
00:51:25,809 --> 00:51:30,670
So every pixel is a
grass, cat, tree, or sky.

865
00:51:30,670 --> 00:51:34,960
But we don't distinguish
between individual objects.

866
00:51:34,960 --> 00:51:38,280
And next, we have
object detection,

867
00:51:38,280 --> 00:51:45,580
where we now want to not
only say what is in the image

868
00:51:45,579 --> 00:51:47,440
but also pinpoint the location.

869
00:51:47,440 --> 00:51:49,860
And that's why we
create bounding boxes

870
00:51:49,860 --> 00:51:54,670
around the objects and associate
them with specific labels.

871
00:51:54,670 --> 00:51:58,269
And finally, we have
instance segmentation.

872
00:51:58,269 --> 00:52:01,139
We'll go into instance
segmentation, which is

873
00:52:01,139 --> 00:52:04,409
the most granular of them all.

874
00:52:04,409 --> 00:52:08,279
It combines the ideas of
detection and segmentation

875
00:52:08,280 --> 00:52:09,130
together.

876
00:52:09,130 --> 00:52:13,039
And every object instance
gets its own mask.

877
00:52:13,039 --> 00:52:20,090
So these tasks require much
deeper special understanding

878
00:52:20,090 --> 00:52:21,059
and images.

879
00:52:21,059 --> 00:52:23,809
And they push the models
to do more than just

880
00:52:23,809 --> 00:52:27,860
recognizing categories.

881
00:52:27,860 --> 00:52:30,660
The complexity doesn't
stop with static images.

882
00:52:30,659 --> 00:52:33,269
Let's look at some
temporal dimensions.

883
00:52:33,269 --> 00:52:36,269
So there's the task of
video classification,

884
00:52:36,269 --> 00:52:40,429
as Fei-Fei talked about,
where we want to understand

885
00:52:40,429 --> 00:52:42,349
what's happening in video.

886
00:52:42,349 --> 00:52:47,210
Is there someone running,
jumping, or dancing?

887
00:52:47,210 --> 00:52:51,630
There is the topic of
multimodal video understanding,

888
00:52:51,630 --> 00:52:56,630
which is combining vision and
sound and other modalities.

889
00:52:56,630 --> 00:53:00,559
For example, in this
example, the person

890
00:53:00,559 --> 00:53:04,070
is playing a vibraphone
to really understand

891
00:53:04,070 --> 00:53:05,039
what's happening here.

892
00:53:05,039 --> 00:53:08,210
We have to create a
blend of visual features

893
00:53:08,210 --> 00:53:11,280
and audio features to be able
to understand what's happening.

894
00:53:11,280 --> 00:53:14,680
And finally, there is the
topic of visualization

895
00:53:14,679 --> 00:53:19,329
and understanding that we will
be covering in this class, where

896
00:53:19,329 --> 00:53:24,340
we want to interpret what's
being learned by the models

897
00:53:24,340 --> 00:53:31,269
and see an attention frame
or attention map of what

898
00:53:31,269 --> 00:53:35,079
the model is attending to to
do a correct classification

899
00:53:35,079 --> 00:53:36,819
and so on.

900
00:53:36,820 --> 00:53:39,650
And then we have
models beyond tasks.

901
00:53:39,650 --> 00:53:41,740
We look into models.

902
00:53:41,739 --> 00:53:46,509
And the very first topic--
let me introduce to you--

903
00:53:46,510 --> 00:53:50,170
that we'll be covering is
Convolutional Neural Networks

904
00:53:50,170 --> 00:53:51,230
or CNNs.

905
00:53:51,230 --> 00:53:52,760
There are a number
of operations.

906
00:53:52,760 --> 00:53:55,930
We will be going
over the details

907
00:53:55,929 --> 00:53:59,839
in the class, starting from an
image, a number of convolutions,

908
00:53:59,840 --> 00:54:01,970
sampling and fully
connected operations,

909
00:54:01,969 --> 00:54:05,980
and, finally,
creating the output.

910
00:54:05,980 --> 00:54:08,769
And beyond convolutional
neural networks,

911
00:54:08,769 --> 00:54:14,719
we will study recurrent neural
networks for sequential data

912
00:54:14,719 --> 00:54:19,669
and even neural architectures,
such as transformers

913
00:54:19,670 --> 00:54:24,139
and attention-based frameworks.

914
00:54:24,139 --> 00:54:29,179
So next, we will be covering
some large-scale distributed

915
00:54:29,179 --> 00:54:34,609
training topics, which is
kind of new this quarter.

916
00:54:34,610 --> 00:54:38,460
I'm sure you've all heard
about large language models,

917
00:54:38,460 --> 00:54:40,320
large vision models, and so on.

918
00:54:40,320 --> 00:54:44,480
And we will be
briefly discussing

919
00:54:44,480 --> 00:54:47,309
how these models are
actually trained.

920
00:54:47,309 --> 00:54:51,619
We know that data and data
sets are expanding models.

921
00:54:51,619 --> 00:54:56,429
And models are becoming
larger and larger.

922
00:54:56,429 --> 00:54:59,819
And in order to
train such models,

923
00:54:59,820 --> 00:55:02,360
there are some strategies--

924
00:55:02,360 --> 00:55:04,470
for example, data
parallelization,

925
00:55:04,469 --> 00:55:07,569
model parallelization-- that
we will cover in this class.

926
00:55:07,570 --> 00:55:11,170
But beyond that, there
will be so many challenges,

927
00:55:11,170 --> 00:55:15,940
such as synchronization between
these models and workers

928
00:55:15,940 --> 00:55:20,730
and so on, as well as
several other aspects

929
00:55:20,730 --> 00:55:25,059
that we'll be covering in one
of the lectures this quarter.

930
00:55:25,059 --> 00:55:31,289
And we will go also over some
of the trends for training

931
00:55:31,289 --> 00:55:33,070
these large models.

932
00:55:33,070 --> 00:55:36,210
After completing this
topic, what we will do

933
00:55:36,210 --> 00:55:44,010
next is looking into generative
and interactive visual

934
00:55:44,010 --> 00:55:48,690
intelligence, where
we will first start

935
00:55:48,690 --> 00:55:52,030
with self-supervised learning.

936
00:55:52,030 --> 00:55:55,960
Self-supervised learning is
a branch of machine learning

937
00:55:55,960 --> 00:56:00,579
in which models learn to
understand and represent data

938
00:56:00,579 --> 00:56:04,179
by getting some training
signals from the data itself.

939
00:56:04,179 --> 00:56:06,384
We will cover this topic.

940
00:56:06,385 --> 00:56:10,180
It's one of the approaches
that has enabled training

941
00:56:10,179 --> 00:56:15,339
of large scale models using
vast amounts of data that do not

942
00:56:15,340 --> 00:56:18,880
require labels, unlabeled data.

943
00:56:18,880 --> 00:56:23,200
And they have played a key
role in recent breakthroughs

944
00:56:23,199 --> 00:56:26,199
in computer vision in general.

945
00:56:26,199 --> 00:56:30,799
And we will talk a little
bit about generative models.

946
00:56:30,800 --> 00:56:33,710
They go beyond recognition.

947
00:56:33,710 --> 00:56:35,860
They actually generate.

948
00:56:35,860 --> 00:56:39,340
This is an example of the
content of a Stanford campus

949
00:56:39,340 --> 00:56:44,380
photo, which is reimagined in
the style of Van Gogh's Starry

950
00:56:44,380 --> 00:56:45,490
Night.

951
00:56:45,489 --> 00:56:49,989
This is known as style
transfer, a classic application

952
00:56:49,989 --> 00:56:54,369
of neural generative techniques.

953
00:56:54,369 --> 00:56:58,269
Generative models can
now translate language

954
00:56:58,269 --> 00:57:03,219
into images given a prompt.

955
00:57:03,219 --> 00:57:07,289
A model like Dall-E, Dall-E
2 generates an entirely novel

956
00:57:07,289 --> 00:57:09,059
image.

957
00:57:09,059 --> 00:57:12,570
This showcases how
generative vision models

958
00:57:12,570 --> 00:57:16,830
blend understanding,
creativity, and control

959
00:57:16,829 --> 00:57:19,349
in their generations.

960
00:57:19,349 --> 00:57:22,589
And you've probably
heard recently

961
00:57:22,590 --> 00:57:26,620
about the topic of
diffusion models in general.

962
00:57:26,619 --> 00:57:33,179
That's another thing that we'll
be covering in this quarter.

963
00:57:33,179 --> 00:57:37,649
They basically learn to
reverse a gradual noising

964
00:57:37,650 --> 00:57:40,510
process to generate images.

965
00:57:40,510 --> 00:57:43,630
And interestingly,
in assignment 3,

966
00:57:43,630 --> 00:57:46,860
you will actually be
implementing a generative model

967
00:57:46,860 --> 00:57:53,400
that generates emojis
from text inputs,

968
00:57:53,400 --> 00:57:57,360
from prompts-- for example, a
face with a cowboy hat, which

969
00:57:57,360 --> 00:58:01,240
is denoised from pure noise.

970
00:58:01,239 --> 00:58:06,529
Vision language models are
the next topic of interest

971
00:58:06,530 --> 00:58:08,890
we will be covering.

972
00:58:08,889 --> 00:58:16,039
They connect text and images in
a shared representation space.

973
00:58:16,039 --> 00:58:19,900
And given a caption
or image, the model

974
00:58:19,900 --> 00:58:24,289
retrieves or generates
its corresponding pair,

975
00:58:24,289 --> 00:58:25,309
as you can see.

976
00:58:25,309 --> 00:58:29,049
So there are a lot of
advances in this area.

977
00:58:29,050 --> 00:58:32,170
We'll be covering some
of the key examples.

978
00:58:32,170 --> 00:58:37,750
Again, this is a key task
for cross-modal retrieval

979
00:58:37,750 --> 00:58:41,119
or understanding and visual
question answering and so on.

980
00:58:41,119 --> 00:58:44,269
So we'll get to
that in the class 2.

981
00:58:44,269 --> 00:58:52,809
Moving beyond 2D, models can
now reconstruct and generate 3D

982
00:58:52,809 --> 00:58:55,549
representations from images.

983
00:58:55,550 --> 00:59:00,980
And here, you can see some
voxel-based reconstructions,

984
00:59:00,980 --> 00:59:06,769
shape completion, and even 3D
object detection from single

985
00:59:06,769 --> 00:59:09,599
view images.

986
00:59:09,599 --> 00:59:14,809
So 3D vision enables
more especially grounded

987
00:59:14,809 --> 00:59:19,699
understanding, which is
crucial for robotics and AI VR

988
00:59:19,699 --> 00:59:20,399
applications.

989
00:59:20,400 --> 00:59:26,900
And finally, vision
empowers embodied agents

990
00:59:26,900 --> 00:59:30,680
that act in the physical world.

991
00:59:30,679 --> 00:59:35,279
So these models often
must perceive, plan,

992
00:59:35,280 --> 00:59:41,390
and execute whether it's
cleaning up a messy room

993
00:59:41,389 --> 00:59:44,879
or generalizing from
human demonstrations.

994
00:59:44,880 --> 00:59:50,210
So with all of these, we will
be covering different topics

995
00:59:50,210 --> 00:59:53,970
around generative and
interactive visual intelligence.

996
00:59:53,969 --> 01:00:00,759
And finally, we will cover some
human-centered applications

997
01:00:00,760 --> 01:00:05,990
and implications, as Fei-Fei
very nicely explained.

998
01:00:05,989 --> 01:00:08,719
So there is a computer vision.

999
01:00:08,719 --> 01:00:12,069
And generally, AI
have been having a lot

1000
01:00:12,070 --> 01:00:16,070
of impact in the past years.

1001
01:00:16,070 --> 01:00:18,280
And it's very
important to understand

1002
01:00:18,280 --> 01:00:21,230
the human-centered
aspects and applications.

1003
01:00:21,230 --> 01:00:24,159
And some of these
impacts are reflected

1004
01:00:24,159 --> 01:00:32,469
by these awards that are going
to researchers in this space.

1005
01:00:32,469 --> 01:00:38,769
It was first recognized by
the Turing Award 2018, which

1006
01:00:38,769 --> 01:00:41,440
is the most prestigious
technical award given

1007
01:00:41,440 --> 01:00:45,400
to major contributions
of lasting importance

1008
01:00:45,400 --> 01:00:47,090
for computing.

1009
01:00:47,090 --> 01:00:50,890
Geoffrey Hinton, Yoshua
Bengio, and Yann LeCun

1010
01:00:50,889 --> 01:00:54,849
received the award for
conceptual and engineering

1011
01:00:54,849 --> 01:00:57,049
during breakthroughs
that have made

1012
01:00:57,050 --> 01:01:01,440
deep neural networks a critical
component of computing.

1013
01:01:01,440 --> 01:01:06,200
Beyond that, last year,
in 2024, Geoffrey Hinton

1014
01:01:06,199 --> 01:01:11,089
was jointly awarded the
Nobel Prize in physics

1015
01:01:11,090 --> 01:01:14,990
alongside John Hopfield for
their foundational contributions

1016
01:01:14,989 --> 01:01:17,459
to neural networks.

1017
01:01:17,460 --> 01:01:21,260
And finally, I want to very
briefly mention the learning

1018
01:01:21,260 --> 01:01:27,770
objectives for this class will
be formalizing computer vision

1019
01:01:27,769 --> 01:01:30,239
applications into tasks.

1020
01:01:30,239 --> 01:01:33,619
As you can see some
of the details here,

1021
01:01:33,619 --> 01:01:38,599
we want to develop and
train vision models, models

1022
01:01:38,599 --> 01:01:41,400
that operate on images
and visual data--

1023
01:01:41,400 --> 01:01:43,220
images, videos, and so on--

1024
01:01:43,219 --> 01:01:46,549
gain an understanding
of where the field is

1025
01:01:46,550 --> 01:01:48,990
and where it is headed.

1026
01:01:48,989 --> 01:01:53,619
That's why we have some new
topics also covered specifically

1027
01:01:53,619 --> 01:01:56,920
in this year.

1028
01:01:56,920 --> 01:02:01,539
So the four topics that
I mentioned earlier,

1029
01:02:01,539 --> 01:02:06,529
we will be going over the basics
in the very first few weeks.

1030
01:02:06,530 --> 01:02:09,220
Bear with us because these
are important topics.

1031
01:02:09,219 --> 01:02:12,859
And you need to understand
the details first,

1032
01:02:12,860 --> 01:02:15,110
how to build the
models from scratch.

1033
01:02:15,110 --> 01:02:19,180
And then we'll get to more
interesting, exciting topics

1034
01:02:19,179 --> 01:02:20,440
of the day--

1035
01:02:20,440 --> 01:02:21,769
computer vision.

1036
01:02:21,769 --> 01:02:27,969
And finally, we'll have one big
lecture on human-centered AI

1037
01:02:27,969 --> 01:02:30,549
and computer vision.

1038
01:02:30,550 --> 01:02:33,039
I want to just leave
you with what we

1039
01:02:33,039 --> 01:02:34,789
will be covering next session.

1040
01:02:34,789 --> 01:02:38,380
That's going to be
image classification

1041
01:02:38,380 --> 01:02:43,720
and linear classifiers,
which will get us started

1042
01:02:43,719 --> 01:02:45,909
with the world of CS231n.

1043
01:02:45,909 --> 01:02:47,969
Thank you.