1 00:00:05,509 --> 00:00:07,229 This is CS231n. 2 00:00:07,230 --> 00:00:11,000 And I'm Professor Fei-Fei Li from computer science 3 00:00:11,000 --> 00:00:11,820 department. 4 00:00:11,820 --> 00:00:14,960 I will be co-teaching this quarter 5 00:00:14,960 --> 00:00:19,320 with Professor Ehsan Adeli and my graduate student Zane. 6 00:00:19,320 --> 00:00:23,300 So you'll meet them as well as our wonderful TA team 7 00:00:23,300 --> 00:00:24,900 that you will meet later. 8 00:00:24,899 --> 00:00:28,250 So I just want to get started. 9 00:00:28,250 --> 00:00:32,030 So this is what excites me, that AI 10 00:00:32,030 --> 00:00:35,520 has become such an interdisciplinary field, 11 00:00:35,520 --> 00:00:38,520 that what you're going to learn in this class, 12 00:00:38,520 --> 00:00:40,530 of course, is very technical. 13 00:00:40,530 --> 00:00:42,750 It's about computer vision and deep learning. 14 00:00:42,750 --> 00:00:44,600 But I really do hope that you take 15 00:00:44,600 --> 00:00:48,590 it to whichever discipline you work in and are passionate about 16 00:00:48,590 --> 00:00:49,920 and apply it. 17 00:00:49,920 --> 00:00:52,800 So we hear a lot about the field of AI. 18 00:00:52,799 --> 00:00:56,029 So how do we position computer vision 19 00:00:56,030 --> 00:00:57,920 and the scope of this class? 20 00:00:57,920 --> 00:01:02,250 If you consider AI as this big bubble, 21 00:01:02,250 --> 00:01:07,349 computer vision is very much an integral part of AI. 22 00:01:07,349 --> 00:01:10,829 Some of you have heard me saying that not only vision is 23 00:01:10,829 --> 00:01:13,959 part of intelligence, it's a cornerstone to intelligence. 24 00:01:13,959 --> 00:01:16,529 Unlocking the mystery of visual intelligence 25 00:01:16,530 --> 00:01:20,079 is unlocking the mystery of intelligence. 26 00:01:20,079 --> 00:01:25,250 But one of the most important tools, mathematical tools, 27 00:01:25,250 --> 00:01:29,129 to solving AI is machine learning or some people 28 00:01:29,129 --> 00:01:31,060 call statistical machine learning. 29 00:01:31,060 --> 00:01:36,299 And this is exactly what we will be talking about. 30 00:01:36,299 --> 00:01:38,670 Within the field of machine learning, 31 00:01:38,670 --> 00:01:42,390 in the past 10 plus years, we have seen a major revolution 32 00:01:42,390 --> 00:01:43,570 called deep learning. 33 00:01:43,569 --> 00:01:46,929 And I'll explain a little bit of what deep learning is. 34 00:01:46,930 --> 00:01:50,640 Deep learning is a set of algorithmic techniques 35 00:01:50,640 --> 00:01:54,120 that is built around a family of algorithms 36 00:01:54,120 --> 00:01:55,540 called neural networks. 37 00:01:55,540 --> 00:02:02,040 And so if you ask me to pinpoint the scope of this class, 38 00:02:02,040 --> 00:02:05,250 we'll not be able to cover the entirety of computer vision. 39 00:02:05,250 --> 00:02:07,849 We'll not be able to cover the entirety of machine 40 00:02:07,849 --> 00:02:09,030 learning or deep learning. 41 00:02:09,030 --> 00:02:12,409 But we're going to cover the core intersection of these two 42 00:02:12,409 --> 00:02:13,289 fields. 43 00:02:13,289 --> 00:02:18,299 And of course, just like the entirety of AI, 44 00:02:18,300 --> 00:02:20,900 computer vision is becoming more and more 45 00:02:20,900 --> 00:02:23,340 an interdisciplinary field. 46 00:02:23,340 --> 00:02:26,060 A lot of the techniques we use as well as 47 00:02:26,060 --> 00:02:28,310 the problems we work with intersect 48 00:02:28,310 --> 00:02:31,280 with many different other fields, like natural language 49 00:02:31,280 --> 00:02:37,849 processing, speech recognition, robotics, and AI as a whole 50 00:02:37,849 --> 00:02:41,340 is a field that intersects with mathematics, neuroscience, 51 00:02:41,340 --> 00:02:44,159 computer science, psychology, physics, biology, 52 00:02:44,159 --> 00:02:46,430 and many application areas from medicine 53 00:02:46,430 --> 00:02:49,950 to law to education and business and so on. 54 00:02:49,949 --> 00:02:55,169 So what you will get for this lecture, our first lecture, 55 00:02:55,169 --> 00:02:58,259 is I'll give a very brief history of computer 56 00:02:58,259 --> 00:02:59,889 vision and deep learning. 57 00:02:59,889 --> 00:03:05,309 And then Professor Adeli will go over the overview of this course 58 00:03:05,310 --> 00:03:08,129 and lay the groundwork of how this course is set up 59 00:03:08,129 --> 00:03:11,669 and what our expectations are. 60 00:03:11,669 --> 00:03:19,139 So the history of vision did not begin when you were born 61 00:03:19,139 --> 00:03:20,979 or humanity was born. 62 00:03:20,979 --> 00:03:25,719 The history of vision began 540 million years ago. 63 00:03:25,719 --> 00:03:29,919 You might ask, what happened 540 million years ago? 64 00:03:29,919 --> 00:03:34,229 Why are we pinpointing a relatively specific date or year 65 00:03:34,229 --> 00:03:35,500 in evolution. 66 00:03:35,500 --> 00:03:37,830 Well, it's because a lot of fossil studies 67 00:03:37,830 --> 00:03:43,380 have shown us that there is a mystery period called Cambrian 68 00:03:43,379 --> 00:03:45,549 explosion. 69 00:03:45,550 --> 00:03:49,200 The fossil studies showed about 10 million years in evolution 70 00:03:49,199 --> 00:03:52,439 during that time, which is a very short period of time 71 00:03:52,439 --> 00:03:53,810 for evolution. 72 00:03:53,810 --> 00:03:58,340 We see the explosion of animal species 73 00:03:58,340 --> 00:04:02,819 in the fossil study, which means before the Cambrian explosion, 74 00:04:02,819 --> 00:04:05,219 life on Earth was pretty chill. 75 00:04:05,219 --> 00:04:06,930 It was actually in the water. 76 00:04:06,930 --> 00:04:10,319 There's no animals on the land yet. 77 00:04:10,319 --> 00:04:13,769 And animals just float around. 78 00:04:13,770 --> 00:04:18,240 So what caused this explosion in animal speciation? 79 00:04:18,240 --> 00:04:21,620 There were many theories, from climate to chemical composition 80 00:04:21,620 --> 00:04:23,189 of the ocean water. 81 00:04:23,189 --> 00:04:29,360 But one of the most compelling theories was the onset of ice. 82 00:04:29,360 --> 00:04:32,569 The first animal, a trilobite, they 83 00:04:32,569 --> 00:04:34,949 gained photosensitive cells. 84 00:04:34,949 --> 00:04:37,310 So the eyes we were talking about 85 00:04:37,310 --> 00:04:41,009 were not sophisticated lenses and retinas and nerve cells. 86 00:04:41,009 --> 00:04:44,519 It was literally a very simple pinhole. 87 00:04:44,519 --> 00:04:47,339 And that pinhole collected light. 88 00:04:47,339 --> 00:04:53,549 Once you collected light, life is completely different. 89 00:04:53,550 --> 00:04:57,660 Without sensors, life is metabolism. 90 00:04:57,660 --> 00:04:59,410 It's very passive. 91 00:04:59,410 --> 00:05:01,180 It is just metabolism. 92 00:05:01,180 --> 00:05:02,530 And you come and go. 93 00:05:02,529 --> 00:05:06,689 With sensors, you become an integral part of the environment 94 00:05:06,689 --> 00:05:08,980 that you might want to change. 95 00:05:08,980 --> 00:05:11,920 You might want to actually survive in it. 96 00:05:11,920 --> 00:05:16,170 Some animals or plants become your dinner. 97 00:05:16,170 --> 00:05:18,069 And you become someone else's dinner. 98 00:05:18,069 --> 00:05:24,120 So evolutionary forces drives intelligence 99 00:05:24,120 --> 00:05:27,579 to evolve because of the onset of sensors, 100 00:05:27,579 --> 00:05:31,529 because of the onset of vision, along with haptics 101 00:05:31,529 --> 00:05:33,219 or tactile sensing. 102 00:05:33,220 --> 00:05:38,080 Those are the oldest sensors for animals. 103 00:05:38,079 --> 00:05:41,879 So that entire course of 540 million years 104 00:05:41,879 --> 00:05:46,509 of evolution of vision is the evolution of intelligence. 105 00:05:46,509 --> 00:05:49,849 Vision as one of the primary senses of animals 106 00:05:49,850 --> 00:05:54,260 drove the development of nervous system, the development 107 00:05:54,259 --> 00:05:55,319 of intelligence. 108 00:05:55,319 --> 00:05:59,449 Almost all animals on Earth today we know of 109 00:05:59,449 --> 00:06:03,420 have vision or use vision as one of the primary senses. 110 00:06:03,420 --> 00:06:06,449 Humans are especially visual animals. 111 00:06:06,449 --> 00:06:08,810 More than half of our cortical cells 112 00:06:08,810 --> 00:06:11,819 are involved in visual processing. 113 00:06:11,819 --> 00:06:15,930 And we have a very complex and convoluted visual system. 114 00:06:15,930 --> 00:06:19,800 So this is what excites me to enter the field of vision. 115 00:06:19,800 --> 00:06:21,870 And I hope it excites you. 116 00:06:21,870 --> 00:06:30,620 So now, let's just fast forward from Cambrian explosion 117 00:06:30,620 --> 00:06:33,470 to actually human civilization. 118 00:06:33,470 --> 00:06:35,850 Humans do innovate. 119 00:06:35,850 --> 00:06:37,610 And not only we see. 120 00:06:37,610 --> 00:06:40,050 We want to build machines that see. 121 00:06:40,050 --> 00:06:44,850 So here's a couple of drawings by, of course, 122 00:06:44,850 --> 00:06:48,540 Leonardo da Vinci, who was just forever curious 123 00:06:48,540 --> 00:06:49,540 about everything. 124 00:06:49,540 --> 00:06:56,740 He studied camera obscura for how to make steam machines. 125 00:06:56,740 --> 00:07:01,829 In fact, even way before him, in ancient Greece 126 00:07:01,829 --> 00:07:05,159 and in ancient China, we have seen documents 127 00:07:05,160 --> 00:07:09,990 about thinkers, philosophers thinking 128 00:07:09,990 --> 00:07:15,600 about how to project objects through pinholes 129 00:07:15,600 --> 00:07:19,330 and to create images of objects. 130 00:07:19,329 --> 00:07:22,750 And of course, in our modern life, 131 00:07:22,750 --> 00:07:25,990 cameras have truly exploded. 132 00:07:25,990 --> 00:07:30,780 But cameras are not enough for seeing, just like eyes are not 133 00:07:30,779 --> 00:07:31,719 enough for seeing. 134 00:07:31,720 --> 00:07:33,010 These are apparatus. 135 00:07:33,009 --> 00:07:35,949 We need to understand how visual intelligence happens. 136 00:07:35,949 --> 00:07:38,589 And that's really the crux of this course. 137 00:07:38,589 --> 00:07:45,669 So let's just talk a little bit of the history that brought us 138 00:07:45,670 --> 00:07:49,790 to this intersection of deep learning and computer vision. 139 00:07:49,790 --> 00:07:57,160 So let me go back to the 1950s. 140 00:07:57,160 --> 00:08:03,370 The 1950s-- a set of very critically important experiments 141 00:08:03,370 --> 00:08:05,090 happened in neuroscience. 142 00:08:05,089 --> 00:08:08,019 And that was the study of the visual pathways 143 00:08:08,019 --> 00:08:10,629 of mammals, especially the seminal work 144 00:08:10,629 --> 00:08:11,990 by Hubel and Wiesel. 145 00:08:11,990 --> 00:08:18,410 They used electrodes to put into live cats anesthetized. 146 00:08:18,410 --> 00:08:21,220 And then they studied the receptive field 147 00:08:21,220 --> 00:08:25,760 of neurons that are in the primary visual cortex. 148 00:08:25,759 --> 00:08:28,909 What they have learned, to their surprise, 149 00:08:28,910 --> 00:08:31,070 are two very important things. 150 00:08:31,069 --> 00:08:38,740 One is that neurons that are responsible for seeing 151 00:08:38,740 --> 00:08:41,860 in the primary visual cortex have 152 00:08:41,860 --> 00:08:44,820 their own individual receptive fields. 153 00:08:44,820 --> 00:08:48,320 Receptive fields means that for every neuron, 154 00:08:48,320 --> 00:08:52,590 there is a part of space it actually sees. 155 00:08:52,590 --> 00:08:54,870 It's not all the space. 156 00:08:54,870 --> 00:08:55,799 It's not very big. 157 00:08:55,799 --> 00:09:00,779 It tends to be a very confined patch of the space. 158 00:09:00,779 --> 00:09:06,629 And within that space, it sees specialized patterns, 159 00:09:06,629 --> 00:09:12,320 simple patterns, when you're measuring from the early part 160 00:09:12,320 --> 00:09:15,470 of the visual pathway. 161 00:09:15,470 --> 00:09:18,840 And by and large, in the primary visual cortex, 162 00:09:18,840 --> 00:09:23,120 which is around here in the back of the head, not near your eyes, 163 00:09:23,120 --> 00:09:27,210 it's oriented edges or moving oriented edges. 164 00:09:27,210 --> 00:09:28,970 So every neuron, some neuron will 165 00:09:28,970 --> 00:09:30,330 be seeing an edge like this. 166 00:09:30,330 --> 00:09:32,970 Some will be seeing an edge like this or this. 167 00:09:32,970 --> 00:09:39,029 And that's how the computation in the brain begins. 168 00:09:39,029 --> 00:09:42,370 The second thing they learned is that visual pathway 169 00:09:42,370 --> 00:09:43,519 is hierarchical. 170 00:09:43,519 --> 00:09:47,149 As you move beyond the visual pathway, 171 00:09:47,149 --> 00:09:50,629 the neurons feed into other neurons. 172 00:09:50,629 --> 00:09:54,730 And the neurons in the higher layers 173 00:09:54,730 --> 00:09:57,460 or deeper layers of the visual hierarchy 174 00:09:57,460 --> 00:09:59,990 have more complex receptive fields. 175 00:09:59,990 --> 00:10:04,009 So if you begin with oriented edges, 176 00:10:04,009 --> 00:10:06,889 you might feed into a corner receptor. 177 00:10:06,889 --> 00:10:10,399 You might feed into an object receptor. 178 00:10:10,399 --> 00:10:12,199 I'm overly simplifying. 179 00:10:12,200 --> 00:10:16,360 But that's the concept, is that neurons feed into each other. 180 00:10:16,360 --> 00:10:23,360 And then they create this big network of computation. 181 00:10:23,360 --> 00:10:25,720 Of course, most of you sitting here 182 00:10:25,720 --> 00:10:27,850 are already thinking the way I've 183 00:10:27,850 --> 00:10:30,670 been describing this will have a profound impact 184 00:10:30,669 --> 00:10:36,019 on the neural network modeling of visual algorithms. 185 00:10:36,019 --> 00:10:37,069 Let's keep going. 186 00:10:37,070 --> 00:10:40,260 That's year 1959. 187 00:10:40,259 --> 00:10:43,500 It's very early studies of seeing. 188 00:10:43,500 --> 00:10:48,289 By the way, about 30 years later-- 189 00:10:48,289 --> 00:10:50,969 maybe not quite-- 20 something years later, 190 00:10:50,970 --> 00:10:54,769 Hubel and Wiesel won the Nobel Prize in medicine 191 00:10:54,769 --> 00:10:59,840 for studying this, uncovering the principles 192 00:10:59,840 --> 00:11:01,790 of visual processing. 193 00:11:01,789 --> 00:11:05,779 Another milestone in the early history of computer vision 194 00:11:05,779 --> 00:11:09,179 was the first PhD thesis of computer vision. 195 00:11:09,179 --> 00:11:13,039 Most people attribute Larry Roberts in 1963 196 00:11:13,039 --> 00:11:17,879 writing the first PhD thesis just studying shape. 197 00:11:17,879 --> 00:11:21,350 And this is a very, very character representation 198 00:11:21,350 --> 00:11:22,259 of the world. 199 00:11:22,259 --> 00:11:26,090 And the idea is that, can we take a shape like this 200 00:11:26,090 --> 00:11:30,560 and understand that the surfaces and the corners and features 201 00:11:30,559 --> 00:11:32,209 of this shape? 202 00:11:32,210 --> 00:11:34,230 It's intuitive that humans do. 203 00:11:34,230 --> 00:11:39,350 So an entire PhD thesis is devoted to this. 204 00:11:39,350 --> 00:11:44,980 And that's the beginning of computer vision. 205 00:11:44,980 --> 00:11:52,870 And around that time, in 1966, an MIT professor 206 00:11:52,870 --> 00:11:56,710 created a summer project in MIT and asked 207 00:11:56,710 --> 00:12:03,830 to hire a few undergrads, very smart ones, to study vision. 208 00:12:03,830 --> 00:12:07,120 And the goal was pretty much solve computer vision 209 00:12:07,120 --> 00:12:09,399 or solve vision for one summer. 210 00:12:09,399 --> 00:12:13,279 Of course, just like the rest of the history of AI, 211 00:12:13,279 --> 00:12:18,309 we tend to be overoptimistic of what we can 212 00:12:18,309 --> 00:12:20,329 do in a short period of time. 213 00:12:20,330 --> 00:12:24,530 So vision did not get solved in that summer. 214 00:12:24,529 --> 00:12:29,799 In fact, it has blossomed into an incredible computer science 215 00:12:29,799 --> 00:12:30,709 field. 216 00:12:30,710 --> 00:12:33,830 If you go to our annual conferences every year now, 217 00:12:33,830 --> 00:12:36,420 it has more than 10,000 people attending. 218 00:12:36,419 --> 00:12:43,879 But 1960s is where, between Larry Roberts PhD thesis as well 219 00:12:43,879 --> 00:12:48,500 as this kind of project, we in our field considered that 220 00:12:48,500 --> 00:12:51,830 the beginning of the field of computer vision. 221 00:12:51,830 --> 00:12:55,620 A seminal book was written in the 1970s by David Marr, 222 00:12:55,620 --> 00:12:58,470 who unfortunately died too early. 223 00:12:58,470 --> 00:13:01,940 He wanted to study vision systematically and start 224 00:13:01,940 --> 00:13:05,790 to consider how visual processing happens. 225 00:13:05,789 --> 00:13:07,639 Even though this is not explicitly 226 00:13:07,639 --> 00:13:10,309 stated, but there is a lot of inspiration 227 00:13:10,309 --> 00:13:12,929 from neuroscience and cognitive science. 228 00:13:12,929 --> 00:13:20,069 He was thinking about, if you take an input image, 229 00:13:20,070 --> 00:13:23,580 how do we visually process and understand the image? 230 00:13:23,580 --> 00:13:28,730 Maybe the first layer is more like edges, just like we saw. 231 00:13:28,730 --> 00:13:30,629 He calls it primal sketch. 232 00:13:30,629 --> 00:13:37,889 And then there is a 2 and 1/2 D sketch which separates different 233 00:13:37,889 --> 00:13:42,909 depth of the objects in the image. 234 00:13:42,909 --> 00:13:45,059 So the ball is the foreground object. 235 00:13:45,059 --> 00:13:47,859 And then the grass here-- 236 00:13:47,860 --> 00:13:48,820 oh, no, not grass. 237 00:13:48,820 --> 00:13:51,520 The ground here is the background. 238 00:13:51,519 --> 00:13:53,919 So he does these 2 and 1/2 D sketch. 239 00:13:53,919 --> 00:14:01,439 And then, finally, David Marr believes the grand holy grail 240 00:14:01,440 --> 00:14:06,660 victory of solving vision is to know the entire full 3D 241 00:14:06,659 --> 00:14:07,959 representation. 242 00:14:07,960 --> 00:14:12,879 And that is actually the hardest thing of vision. 243 00:14:12,879 --> 00:14:15,129 Let me digress for 20 seconds. 244 00:14:15,129 --> 00:14:20,950 Because if you think about vision for all animals, 245 00:14:20,950 --> 00:14:23,350 it's an ill posed problem. 246 00:14:23,350 --> 00:14:27,389 Since the early trilobites who collected light 247 00:14:27,389 --> 00:14:30,659 from underwater, light-- 248 00:14:30,659 --> 00:14:35,809 the world through photons is projected on something 249 00:14:35,809 --> 00:14:38,069 on a surface more or less 2D. 250 00:14:38,070 --> 00:14:40,879 At that time, it was just, I don't know, some patch 251 00:14:40,879 --> 00:14:42,059 in the animal. 252 00:14:42,059 --> 00:14:45,469 But right now, for us, it's a retina. 253 00:14:45,470 --> 00:14:47,910 But the actual world is 3D. 254 00:14:47,909 --> 00:14:55,610 So recovering 3D information, the entire 3D world, 255 00:14:55,610 --> 00:15:00,230 from 2D images is the fundamental problem nature had 256 00:15:00,230 --> 00:15:02,730 to solve and computer vision has to solve. 257 00:15:02,730 --> 00:15:05,840 And mathematically, that's an ill-posed problem. 258 00:15:05,840 --> 00:15:07,940 So what did we later do? 259 00:15:07,940 --> 00:15:09,745 Anybody have a wild guess? 260 00:15:14,899 --> 00:15:17,299 [INAUDIBLE] 261 00:15:17,299 --> 00:15:18,799 Yes. 262 00:15:18,799 --> 00:15:22,199 The trick that nature did is develop multiple eyes, mostly 263 00:15:22,200 --> 00:15:22,700 two. 264 00:15:22,700 --> 00:15:25,259 Some animals have more than two. 265 00:15:25,259 --> 00:15:28,110 And then you triangulate information. 266 00:15:28,110 --> 00:15:29,740 But two eyes are not enough. 267 00:15:29,740 --> 00:15:33,250 You actually have to understand correspondences and all that. 268 00:15:33,250 --> 00:15:35,049 We'll touch on some of these topics. 269 00:15:35,049 --> 00:15:38,879 But there are other computer vision classes taht Stanford 270 00:15:38,879 --> 00:15:42,090 offers that also specifically talk about 3D vision. 271 00:15:42,090 --> 00:15:45,660 But the point is it's a very hard problem. 272 00:15:45,659 --> 00:15:47,589 And we have to solve it. 273 00:15:47,590 --> 00:15:48,790 Nature has solved it. 274 00:15:48,789 --> 00:15:53,110 Humans have solved it but not to extreme precision. 275 00:15:53,110 --> 00:15:55,750 In fact, humans are not that precise. 276 00:15:55,750 --> 00:15:58,509 I roughly know the 3D shapes. 277 00:15:58,509 --> 00:16:03,429 But I don't have geometric precision of all the shapes. 278 00:16:03,429 --> 00:16:06,779 So that's one thing to consider and appreciate 279 00:16:06,779 --> 00:16:08,620 how hard this problem is. 280 00:16:08,620 --> 00:16:12,419 Another thing that is very different for computer vision 281 00:16:12,419 --> 00:16:15,479 and language is actually something 282 00:16:15,480 --> 00:16:17,370 philosophically subtle. 283 00:16:17,370 --> 00:16:20,169 Language doesn't exist in nature. 284 00:16:20,169 --> 00:16:24,339 You cannot point to something and say there is language. 285 00:16:24,340 --> 00:16:30,090 Language is a purely generated thing. 286 00:16:30,090 --> 00:16:31,860 I don't even know what word to use. 287 00:16:31,860 --> 00:16:35,460 It comes through our brain. 288 00:16:35,460 --> 00:16:37,290 It's generated. 289 00:16:37,289 --> 00:16:38,579 It's 1D. 290 00:16:38,580 --> 00:16:40,310 It's sequential. 291 00:16:40,309 --> 00:16:44,449 So this actually has profound implications in the latest 292 00:16:44,450 --> 00:16:47,509 wave of GenAI algorithms. 293 00:16:47,509 --> 00:16:50,419 This is why these LLMs, which is outside 294 00:16:50,419 --> 00:16:54,889 of the scope of this class, is so powerful because we 295 00:16:54,889 --> 00:16:56,759 can model language that way. 296 00:16:56,759 --> 00:16:58,649 But vision is not generated. 297 00:16:58,649 --> 00:17:01,669 There is actually a physical world 298 00:17:01,669 --> 00:17:05,838 out there respecting the laws of physics and materials and all 299 00:17:05,838 --> 00:17:06,509 that. 300 00:17:06,509 --> 00:17:09,420 So vision has very different tasks. 301 00:17:09,420 --> 00:17:14,089 So I just want you to appreciate the difference between language 302 00:17:14,088 --> 00:17:17,450 and vision and actually, frankly, appreciate nature, 303 00:17:17,450 --> 00:17:19,880 how it solved this problem. 304 00:17:19,880 --> 00:17:21,060 Let's keep going. 305 00:17:21,059 --> 00:17:28,149 1970s, the early pioneers of computer vision, without data, 306 00:17:28,150 --> 00:17:32,320 without really much of powerful computers, 307 00:17:32,319 --> 00:17:36,970 without the mathematical advances we have seen today, 308 00:17:36,970 --> 00:17:40,289 are already beginning to solve some of the harder problems 309 00:17:40,289 --> 00:17:43,779 of computer vision-- for example, recognition of objects. 310 00:17:43,779 --> 00:17:48,119 Here in Stanford, one of the pioneering work 311 00:17:48,119 --> 00:17:52,139 is called generalized cylinders by Rodney Brooks and Tom 312 00:17:52,140 --> 00:17:52,900 Binford. 313 00:17:52,900 --> 00:17:58,650 And ironically, Rodney Brooks today is on campus, actually, 314 00:17:58,650 --> 00:18:03,519 over there giving a talk at the robotics conference. 315 00:18:03,519 --> 00:18:05,759 And he went on to become one of the greatest 316 00:18:05,759 --> 00:18:10,079 roboticists of our time and was founder of Roomba 317 00:18:10,079 --> 00:18:11,769 and many other robots. 318 00:18:11,769 --> 00:18:16,529 And then not very far from us in another part of Palo Alto, 319 00:18:16,529 --> 00:18:24,759 researchers have worked on these also compositional models 320 00:18:24,759 --> 00:18:27,859 of human body and objects. 321 00:18:27,859 --> 00:18:34,250 And then in the 1980s, digital photos start to appear. 322 00:18:34,250 --> 00:18:37,220 At least photos start to appear. 323 00:18:37,220 --> 00:18:39,680 And people can digitize that a little bit. 324 00:18:39,680 --> 00:18:43,940 And then there are some great work in edge detection. 325 00:18:43,940 --> 00:18:48,190 You look at all this and probably feel 326 00:18:48,190 --> 00:18:50,900 a sense of disappointment. 327 00:18:50,900 --> 00:18:55,540 I mean, it's kind of trivial to get some sketches and edges. 328 00:18:55,539 --> 00:18:58,460 And it's not really going anywhere. 329 00:18:58,460 --> 00:19:02,059 That's how computer vision, works at that time. 330 00:19:02,059 --> 00:19:03,980 And in fact, you're not so wrong. 331 00:19:03,980 --> 00:19:07,660 That was around the time before many of you 332 00:19:07,660 --> 00:19:10,279 were born that we entered AI winter. 333 00:19:10,279 --> 00:19:15,250 The field entered AI winter because the enthusiasm 334 00:19:15,250 --> 00:19:18,529 and, hence, funding for AI research has really dwindled. 335 00:19:18,529 --> 00:19:20,509 A lot of things didn't deliver. 336 00:19:20,509 --> 00:19:22,269 Computer vision didn't deliver. 337 00:19:22,269 --> 00:19:24,460 Expert systems didn't deliver. 338 00:19:24,460 --> 00:19:26,519 Robotics didn't deliver. 339 00:19:26,519 --> 00:19:32,309 But under the hood of this winter, a lot of research 340 00:19:32,309 --> 00:19:34,529 start to grow from different fields, 341 00:19:34,529 --> 00:19:37,509 like computer vision, NLP, robotics. 342 00:19:37,509 --> 00:19:40,379 So let's also look at another strand of research 343 00:19:40,380 --> 00:19:43,290 that had a profound implication in computer vision, 344 00:19:43,289 --> 00:19:45,269 is that cognitive and neuroscience 345 00:19:45,269 --> 00:19:46,960 continue to blossom. 346 00:19:46,960 --> 00:19:49,319 And what is really important, especially 347 00:19:49,319 --> 00:19:52,480 for the field of computer vision, is cognitive 348 00:19:52,480 --> 00:19:55,799 and neuroscience is starting to point to as the North Star 349 00:19:55,799 --> 00:19:57,490 problems we should work on. 350 00:19:57,490 --> 00:20:00,029 For example, psychologists have told us 351 00:20:00,029 --> 00:20:02,619 there's something special about seeing nature, 352 00:20:02,619 --> 00:20:06,359 seeing real world. 353 00:20:06,359 --> 00:20:09,209 This is a study by Irv Biederman, who 354 00:20:09,210 --> 00:20:13,980 shows that the detection of bicycles on two images 355 00:20:13,980 --> 00:20:18,819 differ depending on if the images are scrambled or not. 356 00:20:18,819 --> 00:20:19,569 Think about it. 357 00:20:19,569 --> 00:20:22,089 From a phton point of view, these two bicycles 358 00:20:22,089 --> 00:20:26,629 land in the same location on your retina. 359 00:20:26,630 --> 00:20:28,720 But somehow the rest of the image 360 00:20:28,720 --> 00:20:39,079 impacts the viewer, seeing the target objects. 361 00:20:39,079 --> 00:20:41,439 So there is something telling us that seeing 362 00:20:41,440 --> 00:20:44,170 the entire forest or the entire world 363 00:20:44,170 --> 00:20:46,730 impacts the way we see objects. 364 00:20:46,730 --> 00:20:49,819 It also tells us visual processing is very fast. 365 00:20:49,819 --> 00:20:55,339 Here's another direct measure of how fast we detect objects. 366 00:20:55,339 --> 00:21:00,669 This is an early 1970s experiment showing people 367 00:21:00,670 --> 00:21:03,061 a video. 368 00:21:03,060 --> 00:21:07,629 And the test for the subject is to detect the human 369 00:21:07,630 --> 00:21:09,170 in one of the frames. 370 00:21:09,170 --> 00:21:11,920 I suppose every one of you have seen that human in one 371 00:21:11,920 --> 00:21:13,250 of the frames. 372 00:21:13,250 --> 00:21:15,519 But think about how remarkable your eyes are 373 00:21:15,519 --> 00:21:19,079 or your brain is because you've never seen this video. 374 00:21:19,079 --> 00:21:22,609 I didn't tell you which frame that the target object would 375 00:21:22,609 --> 00:21:23,159 appear. 376 00:21:23,160 --> 00:21:24,980 I did not tell you what the target 377 00:21:24,980 --> 00:21:28,860 object will look like, where it is, its gestures, and all that. 378 00:21:28,859 --> 00:21:31,689 Yet, you have no problem detecting the humans. 379 00:21:34,569 --> 00:21:37,669 And on top of that, these frames are 380 00:21:37,670 --> 00:21:39,860 played at 10 Hertz, which means you're 381 00:21:39,859 --> 00:21:43,799 seeing every frame for only 100 milliseconds. 382 00:21:43,799 --> 00:21:47,879 And this is how remarkable our visual system is. 383 00:21:47,880 --> 00:21:53,700 In fact, Simon Thorpe, another cognitive neuroscientist, 384 00:21:53,700 --> 00:21:55,410 have measured the speed. 385 00:21:55,410 --> 00:21:58,430 If you hook people up in EEG caps 386 00:21:58,430 --> 00:22:01,769 and show them complex natural scenes 387 00:22:01,769 --> 00:22:05,869 and ask human subjects to categorize things 388 00:22:05,869 --> 00:22:07,969 from animals without-- 389 00:22:07,970 --> 00:22:10,259 versus things without animals-- 390 00:22:10,259 --> 00:22:11,309 hundreds of them. 391 00:22:11,309 --> 00:22:13,289 And then you measure the brain wave. 392 00:22:13,289 --> 00:22:18,909 It turned out, after 150 milliseconds of seeing a photo, 393 00:22:18,910 --> 00:22:22,540 your brain already has a differential signal 394 00:22:22,539 --> 00:22:24,019 that categorizes. 395 00:22:24,019 --> 00:22:25,990 You might not be so impressed. 396 00:22:25,990 --> 00:22:29,870 Because compared to today's GPUs and modern chips, 397 00:22:29,869 --> 00:22:34,549 150 milliseconds is really orders of magnitude slower. 398 00:22:34,549 --> 00:22:37,210 But you got to admire. 399 00:22:37,210 --> 00:22:40,779 Our wetware, our brain, our neurons 400 00:22:40,779 --> 00:22:43,369 don't work as fast as transistors. 401 00:22:43,369 --> 00:22:46,609 150 milliseconds is actually really fast. 402 00:22:46,609 --> 00:22:49,309 It's only a few hops in the brain 403 00:22:49,309 --> 00:22:51,519 in terms of neural processing. 404 00:22:51,519 --> 00:22:53,950 So yet, again, this is telling us 405 00:22:53,950 --> 00:22:59,990 humans are really good at seeing objects and categorizing them. 406 00:22:59,990 --> 00:23:02,559 In fact, not only we're so good at seeing objects 407 00:23:02,559 --> 00:23:05,829 and categorizing them, we even develop specialized brain 408 00:23:05,829 --> 00:23:10,059 areas that have expert ability in recognizing 409 00:23:10,059 --> 00:23:13,099 faces or places or body parts. 410 00:23:13,099 --> 00:23:19,039 And these are discoveries by MIT neurophysiologist in the 1990s 411 00:23:19,039 --> 00:23:21,119 and early 21st century. 412 00:23:21,119 --> 00:23:26,089 So all these studies tell us, well, we should not just 413 00:23:26,089 --> 00:23:30,019 be studying these kind of character shapes 414 00:23:30,019 --> 00:23:33,660 or the sketches of images. 415 00:23:33,660 --> 00:23:38,750 We really should go after important fundamental problems 416 00:23:38,750 --> 00:23:40,769 that drives visual intelligence. 417 00:23:40,769 --> 00:23:43,339 And one of those problems that everything 418 00:23:43,339 --> 00:23:46,099 has been telling us is object recognition-- 419 00:23:46,099 --> 00:23:49,829 is object recognition in natural setting. 420 00:23:49,829 --> 00:23:52,949 There is a lot of objects out there in the world. 421 00:23:52,950 --> 00:23:57,740 And studying this is going to be part 422 00:23:57,740 --> 00:24:00,299 of the unlocking of visual intelligence. 423 00:24:00,299 --> 00:24:01,549 And that's what we did. 424 00:24:01,549 --> 00:24:04,669 As a field, we started by looking 425 00:24:04,670 --> 00:24:08,210 at how we can separate foreground objects 426 00:24:08,210 --> 00:24:09,960 from background objects. 427 00:24:09,960 --> 00:24:14,569 This is called recognition by grouping in the 1990s. 428 00:24:14,569 --> 00:24:16,849 Keep in mind, we're still in AI winter. 429 00:24:16,849 --> 00:24:20,089 But research is actually happening and progressing. 430 00:24:20,089 --> 00:24:24,559 And then there is studies of features. 431 00:24:24,559 --> 00:24:27,549 And some of you might still remember 432 00:24:27,549 --> 00:24:29,779 sift features and matching. 433 00:24:29,779 --> 00:24:33,609 And when I enter grad school, the most exciting thing 434 00:24:33,609 --> 00:24:34,789 was face detection. 435 00:24:34,789 --> 00:24:37,279 I remembered that first year in my grad school, 436 00:24:37,279 --> 00:24:39,379 this paper was published. 437 00:24:39,380 --> 00:24:42,550 And five years later, the first digital camera 438 00:24:42,549 --> 00:24:49,029 used this paper's algorithm and delivered automatic face focus 439 00:24:49,029 --> 00:24:51,259 because of face detection. 440 00:24:51,259 --> 00:24:56,559 So things started to work and taken into industry. 441 00:24:56,559 --> 00:25:01,190 And then around the early 21st century, 442 00:25:01,190 --> 00:25:04,809 a very important thing started to happen, 443 00:25:04,809 --> 00:25:06,819 is internet started to happen. 444 00:25:06,819 --> 00:25:12,599 When internet started to happen, data started to proliferate. 445 00:25:12,599 --> 00:25:16,969 And the combination of digital cameras and internet 446 00:25:16,970 --> 00:25:19,850 started to give the field of computer vision 447 00:25:19,849 --> 00:25:22,049 some data to work with. 448 00:25:22,049 --> 00:25:26,419 So in that early days, we're working with thousands of images 449 00:25:26,420 --> 00:25:30,470 or tens of thousands of images to study the visual recognition 450 00:25:30,470 --> 00:25:32,880 problem or the object recognition problem. 451 00:25:32,880 --> 00:25:36,350 So you've got data sets like Pascal Visual Object 452 00:25:36,349 --> 00:25:40,759 Challenge or Caltech 101. 453 00:25:40,759 --> 00:25:43,609 I'm going to pause here. 454 00:25:43,609 --> 00:25:50,059 And this is where the first thread of computer vision 455 00:25:50,059 --> 00:25:51,059 start to progress. 456 00:25:51,059 --> 00:25:54,419 And you might be wondering, why is she pausing? 457 00:25:54,420 --> 00:25:57,300 Because I'm going to come back and talk about deep learning. 458 00:25:57,299 --> 00:26:03,169 So while this field of vision was progressing 459 00:26:03,170 --> 00:26:06,980 through neurophysiology to computer vision, 460 00:26:06,980 --> 00:26:11,490 to cognitive neuroscience, to computer vision again, 461 00:26:11,490 --> 00:26:14,980 a separate effort is going on in parallel. 462 00:26:14,980 --> 00:26:17,380 And that eventually became deep learning. 463 00:26:17,380 --> 00:26:22,870 It started from these early studies of neural network, 464 00:26:22,869 --> 00:26:24,269 things like perceptron. 465 00:26:24,269 --> 00:26:29,799 And people like Rumelhart started to work. 466 00:26:29,799 --> 00:26:32,139 And of course, Jeff Hinton in his early days, 467 00:26:32,140 --> 00:26:35,400 started to work with a small number of artificial neurons 468 00:26:35,400 --> 00:26:41,009 and look at how that can process information and learn. 469 00:26:41,009 --> 00:26:48,269 And you've heard people like the great minds like Marvin Minsky 470 00:26:48,269 --> 00:26:52,619 and his colleagues working on different aspects 471 00:26:52,619 --> 00:26:54,549 of this perception. 472 00:26:54,549 --> 00:27:02,849 But Marvin Minsky did say that perceptrons cannot learn these 473 00:27:02,849 --> 00:27:05,219 XOR logic functions. 474 00:27:05,220 --> 00:27:10,130 And that caused a little bit of a setback in neural network. 475 00:27:10,130 --> 00:27:14,670 Well, things continued to progress despite the setback. 476 00:27:14,670 --> 00:27:21,529 And one of the most important work before the first inflection 477 00:27:21,529 --> 00:27:25,889 point is this neocognitron work by Fukushima in Japan. 478 00:27:25,890 --> 00:27:31,980 Fukushima hand-designed a neural network that looks like this. 479 00:27:31,980 --> 00:27:35,700 So it has about five or six layers. 480 00:27:35,700 --> 00:27:41,779 And then he kind of designed the different functions 481 00:27:41,779 --> 00:27:43,700 across the layers, which you will 482 00:27:43,700 --> 00:27:46,910 learn more, that more or less was 483 00:27:46,910 --> 00:27:50,850 inspired by the visual pathway that I was describing. 484 00:27:50,849 --> 00:27:54,559 Remember the cat experiment from simple receptive field 485 00:27:54,559 --> 00:27:56,789 to more complicated receptive field. 486 00:27:56,789 --> 00:27:59,039 And he was doing that here. 487 00:27:59,039 --> 00:28:01,829 The early layers have simple functions. 488 00:28:01,829 --> 00:28:03,269 And then the later lighter layers 489 00:28:03,269 --> 00:28:05,490 have more complex functions. 490 00:28:05,490 --> 00:28:08,680 And the simple ones can call it convolution. 491 00:28:08,680 --> 00:28:10,710 Or he uses the convolution function. 492 00:28:10,710 --> 00:28:13,620 And the more complex one, he was pulling the information 493 00:28:13,619 --> 00:28:15,219 from the convolution layers. 494 00:28:15,220 --> 00:28:19,799 So neocognitron was really an engineering feat 495 00:28:19,799 --> 00:28:24,794 because every parameter was hand-designed. 496 00:28:24,795 --> 00:28:26,170 There are hundreds of parameters. 497 00:28:26,170 --> 00:28:29,430 He has to just meticulously put them together 498 00:28:29,430 --> 00:28:32,610 so that this small neural network can 499 00:28:32,609 --> 00:28:35,909 recognize digits or letters. 500 00:28:35,910 --> 00:28:41,130 So the real breakthrough came around that time in 1986 501 00:28:41,130 --> 00:28:43,180 is a learning rule. 502 00:28:43,180 --> 00:28:45,580 That learning rule is called backpropagation. 503 00:28:45,579 --> 00:28:47,579 It's going to be one of our first classes 504 00:28:47,579 --> 00:28:52,454 to show you that Rumelhart, Jeff Hinton-- 505 00:28:52,454 --> 00:28:58,019 they took neural network architecture 506 00:28:58,019 --> 00:29:04,259 and introduced an error correcting objective function 507 00:29:04,259 --> 00:29:07,400 so that if you put in some input and know 508 00:29:07,400 --> 00:29:10,280 what the correct output is, how do you 509 00:29:10,279 --> 00:29:14,779 take the difference between what the neural network outputs 510 00:29:14,779 --> 00:29:17,899 versus the actual correct answer and then 511 00:29:17,900 --> 00:29:22,640 propagate the information back so that you 512 00:29:22,640 --> 00:29:28,590 can improve the parameters along the neural network? 513 00:29:28,589 --> 00:29:31,250 And that propagation from the output 514 00:29:31,250 --> 00:29:33,799 back to the entire neural network 515 00:29:33,799 --> 00:29:35,849 is called backpropagation. 516 00:29:35,849 --> 00:29:39,179 It follows some of these basic calculus chain rules. 517 00:29:39,180 --> 00:29:47,420 And that was a watershed moment for neural network algorithm. 518 00:29:47,420 --> 00:29:50,970 And of course, we're still smack in the middle of AI winter. 519 00:29:50,970 --> 00:29:54,809 All these work was happening without public fanfare. 520 00:29:54,809 --> 00:29:57,929 But of course, in the world of research, 521 00:29:57,930 --> 00:29:59,650 these are very important milestones. 522 00:29:59,650 --> 00:30:03,720 One of the most earliest applications of this neural 523 00:30:03,720 --> 00:30:07,019 network with backpropagation is Yann LeCun's convolutional 524 00:30:07,019 --> 00:30:10,410 neural network, made in the 1990s when he was working 525 00:30:10,410 --> 00:30:11,500 in the Bell Labs. 526 00:30:11,500 --> 00:30:15,970 And what he did is just created a slightly bigger network, 527 00:30:15,970 --> 00:30:20,610 about seven layers-ish, and made it good enough 528 00:30:20,609 --> 00:30:25,119 with great engineering capability to recognize letters. 529 00:30:25,119 --> 00:30:28,709 And it was actually shipped to some part of the US Postal 530 00:30:28,710 --> 00:30:33,579 Offices and banks to read digits and letters. 531 00:30:33,579 --> 00:30:37,599 So that was an application of early neural network. 532 00:30:37,599 --> 00:30:41,250 And then Jeff Hinton and Yann LeCun 533 00:30:41,250 --> 00:30:43,390 continued to work on neural network. 534 00:30:43,390 --> 00:30:45,720 It didn't go very far. 535 00:30:45,720 --> 00:30:52,049 Because despite these improvements and tweaks 536 00:30:52,049 --> 00:30:57,289 of these neural network, things more or less just stalled. 537 00:30:57,289 --> 00:31:00,279 They collected a big data set of digits and letters. 538 00:31:00,279 --> 00:31:03,730 And digits and letters kind of was quasi soft 539 00:31:03,730 --> 00:31:05,089 in terms of recognition. 540 00:31:05,089 --> 00:31:08,019 But if you put the system through the kind 541 00:31:08,019 --> 00:31:11,500 of digital photos that the neuroscientists were using 542 00:31:11,500 --> 00:31:14,470 to recognize cats and dogs and microwaves and chairs 543 00:31:14,470 --> 00:31:17,180 and flowers, it just didn't work. 544 00:31:17,180 --> 00:31:22,549 And a huge part of this problem is the lack of data. 545 00:31:22,549 --> 00:31:27,500 And lack of data is not just an inconvenience. 546 00:31:27,500 --> 00:31:29,990 It's actually a mathematical problem 547 00:31:29,990 --> 00:31:36,430 because these algorithms are high capacity algorithms that 548 00:31:36,430 --> 00:31:39,850 actually needs to be driven by lots of data 549 00:31:39,849 --> 00:31:42,349 in order to learn to generalize. 550 00:31:42,349 --> 00:31:45,009 And there is some deep mathematical principles 551 00:31:45,009 --> 00:31:48,379 behind these rules of generalization and model 552 00:31:48,380 --> 00:31:49,210 overfitting. 553 00:31:49,210 --> 00:31:52,660 And data was underappreciated, was 554 00:31:52,660 --> 00:31:54,840 underlooked because most people are just 555 00:31:54,839 --> 00:31:56,559 looking at these architectures. 556 00:31:56,559 --> 00:31:59,190 They did not realize that data is 557 00:31:59,190 --> 00:32:02,070 part of the first class citizen for machine 558 00:32:02,069 --> 00:32:03,490 learning and deep learning. 559 00:32:03,490 --> 00:32:08,339 So this is part of the work that my students and I did 560 00:32:08,339 --> 00:32:14,759 in the early 2000s, that we recognize this importance 561 00:32:14,759 --> 00:32:15,640 of data. 562 00:32:15,640 --> 00:32:21,240 We hypothesized that the whole field was actually 563 00:32:21,240 --> 00:32:24,519 missing this-- underappreciating the importance of data. 564 00:32:24,519 --> 00:32:27,089 So we went about and collected a huge data 565 00:32:27,089 --> 00:32:30,119 set called ImageNet that has 50 million images 566 00:32:30,119 --> 00:32:32,259 after cleaning a billion images. 567 00:32:32,259 --> 00:32:38,309 And these 15 million images were sorted across 22,000 categories 568 00:32:38,309 --> 00:32:39,309 of objects. 569 00:32:39,309 --> 00:32:43,109 We actually studied a lot of the cognitive and psychology 570 00:32:43,109 --> 00:32:51,479 literature to appreciate that 22,000 images were-- 571 00:32:51,480 --> 00:32:54,880 sorry, 22,000 categories were roughly in the order 572 00:32:54,880 --> 00:32:58,510 of the number of categories that humans learned to recognize 573 00:32:58,509 --> 00:33:00,470 in the early years of their life. 574 00:33:00,470 --> 00:33:02,180 And then we open sourced this data 575 00:33:02,180 --> 00:33:05,860 set and created an ImageNet challenge called the Large Scale 576 00:33:05,859 --> 00:33:07,579 Visual Recognition Challenge. 577 00:33:07,579 --> 00:33:12,699 We curated a subset of ImageNet of a million images or a million 578 00:33:12,700 --> 00:33:16,870 plus images and 1,000 object classes and then ran 579 00:33:16,869 --> 00:33:21,429 an international object recognition challenge for many 580 00:33:21,430 --> 00:33:22,039 years. 581 00:33:22,039 --> 00:33:26,899 And the goal is that we ask researchers to participate. 582 00:33:26,900 --> 00:33:29,420 And their goal is to create algorithms. 583 00:33:29,420 --> 00:33:31,430 It doesn't matter which kind of algorithms. 584 00:33:31,430 --> 00:33:35,650 And they will test you on your algorithm's ability to recognize 585 00:33:35,650 --> 00:33:40,900 photos and see if you can call out these 1,000 object classes 586 00:33:40,900 --> 00:33:42,800 as correctly as possible. 587 00:33:42,799 --> 00:33:45,039 And here are the errors. 588 00:33:45,039 --> 00:33:53,069 First year we run this competition, 589 00:33:53,069 --> 00:33:57,000 the best performing algorithms error was nearly 30%. 590 00:33:57,000 --> 00:34:00,859 And it's really pretty abysmal because humans can perform 591 00:34:00,859 --> 00:34:03,509 under like, say, 3% error. 592 00:34:03,509 --> 00:34:07,259 And then 2011, it wasn't that exciting. 593 00:34:07,259 --> 00:34:09,559 But something happened in 2012. 594 00:34:09,559 --> 00:34:12,389 That was the most exciting year. 595 00:34:12,389 --> 00:34:16,190 That year, Jeff Hinton and his students 596 00:34:16,190 --> 00:34:18,650 participated in this challenge using 597 00:34:18,650 --> 00:34:20,340 convolutional neural network. 598 00:34:20,340 --> 00:34:23,100 And they reduced the error almost by half. 599 00:34:23,099 --> 00:34:29,519 And it truly showed the power of deep learning algorithms. 600 00:34:29,519 --> 00:34:34,759 And so the participating algorithm in 2012 ImageNet 601 00:34:34,760 --> 00:34:36,960 challenge was called AlexNet. 602 00:34:36,960 --> 00:34:42,559 And the funny thing is, if you look at AlexNet, 603 00:34:42,559 --> 00:34:47,449 it's not that different from Fukushima's neocognitron 604 00:34:47,449 --> 00:34:49,579 32 years ago. 605 00:34:49,579 --> 00:34:54,829 But two major things happened between these two. 606 00:34:54,829 --> 00:34:57,529 One is that backpropagation happened. 607 00:34:57,530 --> 00:35:01,269 It's a principled, mathematically rigorous learning 608 00:35:01,269 --> 00:35:04,300 rule so that you don't have to ever use hand 609 00:35:04,300 --> 00:35:06,140 to tune parameters. 610 00:35:06,139 --> 00:35:09,409 And that was a major breakthrough theoretically. 611 00:35:09,409 --> 00:35:14,179 Another breakthrough was data. 612 00:35:14,179 --> 00:35:19,629 The recognition of data and the understanding of data driving 613 00:35:19,630 --> 00:35:23,200 these high capacity models, which eventually will have 614 00:35:23,199 --> 00:35:26,109 trillion parameters-- but at that time was millions 615 00:35:26,110 --> 00:35:34,831 of parameters-- was critical for setting off the deep learning 616 00:35:34,831 --> 00:35:36,410 for this to work. 617 00:35:36,409 --> 00:35:42,405 And really, many people consider the year of 2012 618 00:35:42,405 --> 00:35:46,869 and the AlexNet algorithm that won the ImageNet 619 00:35:46,869 --> 00:35:51,019 the challenge the historical moment of the birth 620 00:35:51,019 --> 00:35:54,409 or rebirth of modern AI or the birth of deep learning 621 00:35:54,409 --> 00:35:55,759 revolution. 622 00:35:55,760 --> 00:35:59,540 And of course, the reason many of you are here 623 00:35:59,539 --> 00:36:04,320 is since then, we are in the era of deep learning explosion. 624 00:36:04,320 --> 00:36:10,910 If you look at computer vision, some main annual research 625 00:36:10,909 --> 00:36:13,190 conference, called CVPR-- 626 00:36:13,190 --> 00:36:15,619 the number of papers have exploded. 627 00:36:15,619 --> 00:36:18,869 And our arXiv paper has exploded. 628 00:36:18,869 --> 00:36:22,730 And many new algorithms since then 629 00:36:22,730 --> 00:36:27,349 have been invented to participate in the ImageNet 630 00:36:27,349 --> 00:36:28,049 challenge. 631 00:36:28,050 --> 00:36:29,870 In the following years, we're going 632 00:36:29,869 --> 00:36:31,739 to study some of these algorithms. 633 00:36:31,739 --> 00:36:34,639 But the point is that some of these 634 00:36:34,639 --> 00:36:39,379 algorithms beyond Alex that have had a profound impact 635 00:36:39,380 --> 00:36:43,610 in the progress of the field of computer vision 636 00:36:43,610 --> 00:36:49,090 and into the applications of computer vision. 637 00:36:49,090 --> 00:36:52,720 So a lot of things have happened. 638 00:36:52,719 --> 00:36:54,529 We're going to cover some of these. 639 00:36:54,530 --> 00:36:57,340 Not only the field of computer vision 640 00:36:57,340 --> 00:37:01,510 made a major progress in creating algorithms 641 00:37:01,510 --> 00:37:06,260 to recognize everyday m like cats and dogs and chairs-- 642 00:37:06,260 --> 00:37:10,400 we also quickly, right after ImageNet challenge, 643 00:37:10,400 --> 00:37:14,139 the 2012 moment, we've got algorithms 644 00:37:14,139 --> 00:37:22,549 that can recognize much more complicated images, 645 00:37:22,550 --> 00:37:27,470 can retrieve images, or can do multiple object detections, 646 00:37:27,469 --> 00:37:30,559 can do image segmentation. 647 00:37:30,559 --> 00:37:34,360 These are all different tasks in visual recognition 648 00:37:34,360 --> 00:37:36,220 that you'll find yourself getting 649 00:37:36,219 --> 00:37:38,689 familiar with throughout this course 650 00:37:38,690 --> 00:37:42,139 because vision is not just calling out cats and dogs. 651 00:37:42,139 --> 00:37:48,859 There is so much in the nuanced ability of visual recognition. 652 00:37:48,860 --> 00:37:52,829 And of course, vision is not just static images. 653 00:37:52,829 --> 00:37:57,500 So there are work in video classification, human activity 654 00:37:57,500 --> 00:37:58,710 recognition. 655 00:37:58,710 --> 00:38:00,929 I'm showing you this overview. 656 00:38:00,929 --> 00:38:04,774 You will learn some of these. 657 00:38:04,775 --> 00:38:08,460 You don't have to understand exactly what's going on here. 658 00:38:08,460 --> 00:38:14,940 But I want you to appreciate the variety of vision tasks. 659 00:38:14,940 --> 00:38:20,869 Medical imaging, those of you who come from a medical field, 660 00:38:20,869 --> 00:38:24,650 whether it's radiology or pathology or even 661 00:38:24,650 --> 00:38:28,260 other aspects of medicine, is deeply visual. 662 00:38:28,260 --> 00:38:31,550 And this has a profound impact. 663 00:38:31,550 --> 00:38:37,550 Scientific discovery-- even the seminal picture 664 00:38:37,550 --> 00:38:41,700 you probably remember of the first photography of black hole 665 00:38:41,699 --> 00:38:46,829 uses a lot of computer vision and computational photography 666 00:38:46,829 --> 00:38:47,980 techniques. 667 00:38:47,980 --> 00:38:52,980 Of course, applications in sustainability and environment 668 00:38:52,980 --> 00:38:58,889 is $also computer vision contributed a lot of that. 669 00:38:58,889 --> 00:39:02,309 And we also have made a lot of progress 670 00:39:02,309 --> 00:39:07,449 in image captioning right after the image-- that 2012 moment. 671 00:39:07,449 --> 00:39:09,989 This is actually work by Andrej Karpathy, where he was 672 00:39:09,989 --> 00:39:13,799 my student, his thesis work. 673 00:39:13,800 --> 00:39:19,030 Then we also worked on relationship understanding. 674 00:39:19,030 --> 00:39:22,710 So not only visual intelligence is 675 00:39:22,710 --> 00:39:24,639 about seeing what's on the pixel, 676 00:39:24,639 --> 00:39:26,859 you can also see what's beyond pixels, 677 00:39:26,860 --> 00:39:33,360 including relationships of objects and also style transfer. 678 00:39:33,360 --> 00:39:35,880 A Lot of this work, you will-- actually, 679 00:39:35,880 --> 00:39:39,000 Justin Johnson, who will come to guest lecture this course, 680 00:39:39,000 --> 00:39:45,320 will tell you all about his seminal work in style transfer. 681 00:39:45,320 --> 00:39:48,510 And of course, in generative AI eras, 682 00:39:48,510 --> 00:39:53,430 we get these really incredible results like face generation. 683 00:39:53,429 --> 00:39:59,239 And this is the very early days of image generation of Dall-E. I 684 00:39:59,239 --> 00:40:03,379 think this is the early Dall-E. Of course, now, Midjourney 685 00:40:03,380 --> 00:40:08,690 and everything has gone beyond these avocado and peach chairs. 686 00:40:08,690 --> 00:40:14,780 But really, we are squarely in the most exciting modern era 687 00:40:14,780 --> 00:40:16,246 of AI explosion. 688 00:40:20,070 --> 00:40:25,370 The three converging forces of computation, algorithms, 689 00:40:25,369 --> 00:40:29,719 and data have taken this field just 690 00:40:29,719 --> 00:40:32,929 to a whole different level, where we're now 691 00:40:32,929 --> 00:40:36,119 totally out of AI winter. 692 00:40:36,119 --> 00:40:40,259 I would say we're in an AI global warming period. 693 00:40:40,260 --> 00:40:46,050 And I don't see any of this slowing down 694 00:40:46,050 --> 00:40:48,820 for both good and bad reasons. 695 00:40:48,820 --> 00:40:53,170 And also, just a word, because we are in the Silicon Valley, 696 00:40:53,170 --> 00:40:58,050 we're in the very building of Huang building and NVIDIA 697 00:40:58,050 --> 00:41:02,039 lecture hall-- so we cannot ignore also the progress 698 00:41:02,039 --> 00:41:05,050 of hardware and what that played. 699 00:41:05,050 --> 00:41:14,080 So here is just the FLOP per dollar graph for NVIDIA's GPUs. 700 00:41:14,079 --> 00:41:19,210 And before 2020, the progress was steady. 701 00:41:19,210 --> 00:41:22,800 But as soon as deep learning started 702 00:41:22,800 --> 00:41:27,420 to drive these GPUs and chips, you 703 00:41:27,420 --> 00:41:33,519 can just see the GFLOPS have just completely taken off. 704 00:41:33,519 --> 00:41:40,610 And by any measure, we are in this accelerated curve 705 00:41:40,610 --> 00:41:45,360 of lots of compute as well as lots of AI. 706 00:41:45,360 --> 00:41:47,360 And these are just different graphs 707 00:41:47,360 --> 00:41:50,539 showing you conference attendees, startups, 708 00:41:50,539 --> 00:41:54,500 and enterprise applications in AI all across 709 00:41:54,500 --> 00:41:55,710 not just computer vision. 710 00:41:55,710 --> 00:42:02,099 But also, NLP and others have just exploded. 711 00:42:02,099 --> 00:42:06,299 So quickly, last but not the least, it's been exciting. 712 00:42:06,300 --> 00:42:08,070 There has been a lot of successes. 713 00:42:08,070 --> 00:42:11,309 But there is still a lot to be done in computer vision. 714 00:42:11,309 --> 00:42:14,329 So this problem is still not totally solved. 715 00:42:14,329 --> 00:42:19,969 And with great tools comes with great consequences as well. 716 00:42:19,969 --> 00:42:24,449 So computer vision can do a lot of good. 717 00:42:24,449 --> 00:42:26,039 But it also can do harm. 718 00:42:26,039 --> 00:42:28,730 For example, human bias-- 719 00:42:28,730 --> 00:42:32,360 every single AI algorithm today, the large ones, 720 00:42:32,360 --> 00:42:33,880 are driven by data. 721 00:42:33,880 --> 00:42:38,550 And data is an artifact of human activities 722 00:42:38,550 --> 00:42:40,360 on Earth and in history. 723 00:42:40,360 --> 00:42:43,900 And a lot of the data carry our bias. 724 00:42:43,900 --> 00:42:47,200 And this gets carried in AI systems. 725 00:42:47,199 --> 00:42:50,609 We have seen a lot of face recognition algorithms having 726 00:42:50,610 --> 00:42:52,990 the same kind of bias that humans have. 727 00:42:52,989 --> 00:42:55,919 And we do have to really recognize that. 728 00:42:55,920 --> 00:43:01,450 We can also use AI to impact human lives, some for the good. 729 00:43:01,449 --> 00:43:02,889 Think about medical imaging. 730 00:43:02,889 --> 00:43:05,199 But some are questionable. 731 00:43:05,199 --> 00:43:09,299 What if AI is solely behind deciding your job 732 00:43:09,300 --> 00:43:11,620 or deciding your financial loans? 733 00:43:11,619 --> 00:43:15,789 So again, is it totally bad? 734 00:43:15,789 --> 00:43:17,050 Is it totally good? 735 00:43:17,050 --> 00:43:19,150 These are very complicated issues. 736 00:43:19,150 --> 00:43:23,490 This is also why I always get so excited when students from HMS 737 00:43:23,489 --> 00:43:26,549 or law school or education school or business school 738 00:43:26,550 --> 00:43:29,670 attend my class because not all AI 739 00:43:29,670 --> 00:43:31,789 issues are engineering issues. 740 00:43:31,789 --> 00:43:36,559 We have a lot of human factors and societal issues to solve. 741 00:43:36,559 --> 00:43:40,599 I'm also particularly excited by AI's medicine and health care 742 00:43:40,599 --> 00:43:41,139 use. 743 00:43:41,139 --> 00:43:43,960 This is something really dear to my heart. 744 00:43:43,960 --> 00:43:46,119 Professor Adeli and Zane, who are 745 00:43:46,119 --> 00:43:49,630 also co-instructors of this course, we three of us 746 00:43:49,630 --> 00:43:53,500 work on AI for aging population as well as 747 00:43:53,500 --> 00:43:59,050 patients and to try to use computer vision to deliver care 748 00:43:59,050 --> 00:44:00,170 to people. 749 00:44:00,170 --> 00:44:01,820 So this is a good use. 750 00:44:01,820 --> 00:44:04,820 And also, even in terms of technology, 751 00:44:04,820 --> 00:44:07,190 human vision is remarkable. 752 00:44:07,190 --> 00:44:10,670 I want you to come out of not only today's class 753 00:44:10,670 --> 00:44:14,240 but also this entire course to appreciate, 754 00:44:14,239 --> 00:44:16,969 despite how much computer vision can do, 755 00:44:16,969 --> 00:44:22,250 there's just so much more nuance, subtlety, richness, 756 00:44:22,250 --> 00:44:26,389 complexity, and also emotion in human vision. 757 00:44:26,389 --> 00:44:29,369 Look at these kids studying whatever 758 00:44:29,369 --> 00:44:33,159 that their curiosity lead them or the humor in this image. 759 00:44:33,159 --> 00:44:36,129 There's still a lot more that computer vision cannot do. 760 00:44:36,130 --> 00:44:38,430 So I hope that continue to entice 761 00:44:38,429 --> 00:44:40,869 you to study computer vision. 762 00:44:40,869 --> 00:44:45,690 At this point, I'm going to give the podium to Professor Adeli 763 00:44:45,690 --> 00:44:48,369 to go over the rest of the class. 764 00:44:48,369 --> 00:44:49,039 Thank you. 765 00:44:49,039 --> 00:44:50,759 [APPLAUSE] 766 00:44:50,760 --> 00:44:51,990 Awesome. 767 00:44:51,989 --> 00:44:55,139 Thank you, Fei-Fei. 768 00:44:55,139 --> 00:44:57,089 Great to start of the quarter. 769 00:44:57,090 --> 00:45:00,640 And I hope my microphone is working right now. 770 00:45:00,639 --> 00:45:01,389 OK, good. 771 00:45:01,389 --> 00:45:05,730 I'm seeing some nodding of heads. 772 00:45:05,730 --> 00:45:13,079 So very excited to be here with you all. 773 00:45:13,079 --> 00:45:18,630 And I'm hoping that you will have a fun 774 00:45:18,630 --> 00:45:23,160 and challenging course with an amazing list of core instructors 775 00:45:23,159 --> 00:45:26,379 that we have and great TAs. 776 00:45:26,380 --> 00:45:31,000 So in this class, we are going to cover 777 00:45:31,000 --> 00:45:34,690 a wide variety of topics around computer vision and use 778 00:45:34,690 --> 00:45:37,659 of deep learning in this space, categorized 779 00:45:37,659 --> 00:45:41,569 into four different topics. 780 00:45:41,570 --> 00:45:45,230 We will start with deep learning basics. 781 00:45:45,230 --> 00:45:48,429 And let's start actually with a simple question of, 782 00:45:48,429 --> 00:45:52,009 what is computer vision really? 783 00:45:52,010 --> 00:45:57,610 So at its core, it's about enabling machines 784 00:45:57,610 --> 00:46:00,620 to see and understand images. 785 00:46:00,619 --> 00:46:09,339 And basically, this is the most fundamental task in this space-- 786 00:46:09,340 --> 00:46:13,390 in this space is image classification. 787 00:46:13,389 --> 00:46:17,059 You give the model an image, say, of a cat. 788 00:46:17,059 --> 00:46:21,549 And the model should output a label cat. 789 00:46:21,550 --> 00:46:23,740 And that's it. 790 00:46:23,739 --> 00:46:29,479 But this deceptively simple task is the foundation 791 00:46:29,480 --> 00:46:32,039 for much of more complex applications, 792 00:46:32,039 --> 00:46:36,409 from self-driving to medical diagnosis and so on. 793 00:46:36,409 --> 00:46:40,429 So how do we teach a machine to do this? 794 00:46:40,429 --> 00:46:44,639 One of the simplest approaches is to use linear classification, 795 00:46:44,639 --> 00:46:48,089 as you can see in this slide. 796 00:46:48,090 --> 00:46:53,809 So imagine each of the images in our data set 797 00:46:53,809 --> 00:46:57,119 is shown with a dot in that space. 798 00:46:57,119 --> 00:47:02,779 And each axis shows some sort of feature 799 00:47:02,780 --> 00:47:05,280 which was driven from the image itself. 800 00:47:05,280 --> 00:47:09,420 Here, we are showing a 2D space for simplicity. 801 00:47:09,420 --> 00:47:12,470 But the task of a linear classifier 802 00:47:12,469 --> 00:47:17,149 is to find the hyperplane or the linear function 803 00:47:17,150 --> 00:47:23,470 that separates these two, say, cats from dogs. 804 00:47:23,469 --> 00:47:26,259 But we all know that these linear models often 805 00:47:26,260 --> 00:47:29,110 go just only so far. 806 00:47:29,110 --> 00:47:32,349 They struggle when the data isn't cleanly separable 807 00:47:32,349 --> 00:47:33,799 with a straight line. 808 00:47:33,800 --> 00:47:36,320 So the question is, what's next? 809 00:47:36,320 --> 00:47:44,090 We'll get into the topics of how to model more complex patterns. 810 00:47:44,090 --> 00:47:49,900 And if we do so, we often face challenges 811 00:47:49,900 --> 00:47:54,220 of overfitting and underfitting, which 812 00:47:54,219 --> 00:47:59,439 are the topics we will cover in the early lectures of the class. 813 00:47:59,440 --> 00:48:05,110 And to strike the right balance, we 814 00:48:05,110 --> 00:48:08,320 use techniques like regularization 815 00:48:08,320 --> 00:48:14,110 to control model complexity and optimization to find the best 816 00:48:14,110 --> 00:48:16,059 fit parameters. 817 00:48:16,059 --> 00:48:21,079 So these are the nuts and bolts of deep learning and creating 818 00:48:21,079 --> 00:48:26,659 these models, training models, that not only fits the data 819 00:48:26,659 --> 00:48:31,319 but also generalizes to unseen and new data as well. 820 00:48:31,320 --> 00:48:33,539 And now comes the fun part-- 821 00:48:33,539 --> 00:48:34,380 neural networks. 822 00:48:34,380 --> 00:48:38,059 We've been talking about them quite a lot. 823 00:48:38,059 --> 00:48:43,549 And what neural networks do, unlike the linear classifiers, 824 00:48:43,550 --> 00:48:47,780 they stack multiple layers of operations 825 00:48:47,780 --> 00:48:54,769 to model non-linear functions to be 826 00:48:54,769 --> 00:48:59,389 able to either classify, to solve the same problem of image 827 00:48:59,389 --> 00:49:04,489 classification, and so on. 828 00:49:04,489 --> 00:49:09,869 These are the models powering everything from Google Photos. 829 00:49:09,869 --> 00:49:13,429 And now, everybody's familiar with ChatGPT, ChatGPT's vision 830 00:49:13,429 --> 00:49:15,440 models, and so on. 831 00:49:15,440 --> 00:49:24,099 In this course, we will go deep into the details of how they 832 00:49:24,099 --> 00:49:26,299 work, how they are trained. 833 00:49:26,300 --> 00:49:31,090 And we will be looking into debugging and improving them. 834 00:49:31,090 --> 00:49:35,030 After looking at the deep learning basics, 835 00:49:35,030 --> 00:49:39,280 we will cover the topics of perceiving and understanding 836 00:49:39,280 --> 00:49:44,620 the visual world, which is a complex process that 837 00:49:44,619 --> 00:49:49,880 involves interpreting a vast array of visual information. 838 00:49:49,880 --> 00:49:52,329 And to do so, we often first define 839 00:49:52,329 --> 00:49:56,739 tasks that refer to specific challenges or problems. 840 00:49:56,739 --> 00:49:59,149 We aim to solve-- 841 00:49:59,150 --> 00:50:02,180 some of the examples are object detection, scene understanding, 842 00:50:02,179 --> 00:50:03,619 motion detection, and so on. 843 00:50:03,619 --> 00:50:10,539 And to solve these tasks, we use different models, which 844 00:50:10,539 --> 00:50:13,929 are computational and theoretical 845 00:50:13,929 --> 00:50:17,779 frameworks we develop to mimic or explain 846 00:50:17,780 --> 00:50:22,350 how our visual system accomplishes these tasks. 847 00:50:22,349 --> 00:50:25,610 One of the examples of these types of models 848 00:50:25,610 --> 00:50:27,730 is neural networks. 849 00:50:30,260 --> 00:50:36,150 So by aligning models with tasks, 850 00:50:36,150 --> 00:50:41,030 we can create systems that can see and interpret 851 00:50:41,030 --> 00:50:43,730 the world around us. 852 00:50:43,730 --> 00:50:48,740 Speaking of tasks, let's go back to the topic 853 00:50:48,739 --> 00:50:53,239 of image classification, predicting a single label 854 00:50:53,239 --> 00:50:56,989 for an entire image. 855 00:50:56,989 --> 00:50:59,359 But we know that real world computer vision 856 00:50:59,360 --> 00:51:02,340 is much richer than this. 857 00:51:02,340 --> 00:51:05,240 And let's walk through some of the tasks that 858 00:51:05,239 --> 00:51:06,869 go beyond classification. 859 00:51:06,869 --> 00:51:13,339 First, semantic segmentation, where we are not just 860 00:51:13,340 --> 00:51:17,519 labeling the object or the entire image 861 00:51:17,519 --> 00:51:19,739 as cat or tree or whatever. 862 00:51:19,739 --> 00:51:25,019 Here, we are looking for labels for every single pixel 863 00:51:25,019 --> 00:51:25,809 in the image. 864 00:51:25,809 --> 00:51:30,670 So every pixel is a grass, cat, tree, or sky. 865 00:51:30,670 --> 00:51:34,960 But we don't distinguish between individual objects. 866 00:51:34,960 --> 00:51:38,280 And next, we have object detection, 867 00:51:38,280 --> 00:51:45,580 where we now want to not only say what is in the image 868 00:51:45,579 --> 00:51:47,440 but also pinpoint the location. 869 00:51:47,440 --> 00:51:49,860 And that's why we create bounding boxes 870 00:51:49,860 --> 00:51:54,670 around the objects and associate them with specific labels. 871 00:51:54,670 --> 00:51:58,269 And finally, we have instance segmentation. 872 00:51:58,269 --> 00:52:01,139 We'll go into instance segmentation, which is 873 00:52:01,139 --> 00:52:04,409 the most granular of them all. 874 00:52:04,409 --> 00:52:08,279 It combines the ideas of detection and segmentation 875 00:52:08,280 --> 00:52:09,130 together. 876 00:52:09,130 --> 00:52:13,039 And every object instance gets its own mask. 877 00:52:13,039 --> 00:52:20,090 So these tasks require much deeper special understanding 878 00:52:20,090 --> 00:52:21,059 and images. 879 00:52:21,059 --> 00:52:23,809 And they push the models to do more than just 880 00:52:23,809 --> 00:52:27,860 recognizing categories. 881 00:52:27,860 --> 00:52:30,660 The complexity doesn't stop with static images. 882 00:52:30,659 --> 00:52:33,269 Let's look at some temporal dimensions. 883 00:52:33,269 --> 00:52:36,269 So there's the task of video classification, 884 00:52:36,269 --> 00:52:40,429 as Fei-Fei talked about, where we want to understand 885 00:52:40,429 --> 00:52:42,349 what's happening in video. 886 00:52:42,349 --> 00:52:47,210 Is there someone running, jumping, or dancing? 887 00:52:47,210 --> 00:52:51,630 There is the topic of multimodal video understanding, 888 00:52:51,630 --> 00:52:56,630 which is combining vision and sound and other modalities. 889 00:52:56,630 --> 00:53:00,559 For example, in this example, the person 890 00:53:00,559 --> 00:53:04,070 is playing a vibraphone to really understand 891 00:53:04,070 --> 00:53:05,039 what's happening here. 892 00:53:05,039 --> 00:53:08,210 We have to create a blend of visual features 893 00:53:08,210 --> 00:53:11,280 and audio features to be able to understand what's happening. 894 00:53:11,280 --> 00:53:14,680 And finally, there is the topic of visualization 895 00:53:14,679 --> 00:53:19,329 and understanding that we will be covering in this class, where 896 00:53:19,329 --> 00:53:24,340 we want to interpret what's being learned by the models 897 00:53:24,340 --> 00:53:31,269 and see an attention frame or attention map of what 898 00:53:31,269 --> 00:53:35,079 the model is attending to to do a correct classification 899 00:53:35,079 --> 00:53:36,819 and so on. 900 00:53:36,820 --> 00:53:39,650 And then we have models beyond tasks. 901 00:53:39,650 --> 00:53:41,740 We look into models. 902 00:53:41,739 --> 00:53:46,509 And the very first topic-- let me introduce to you-- 903 00:53:46,510 --> 00:53:50,170 that we'll be covering is Convolutional Neural Networks 904 00:53:50,170 --> 00:53:51,230 or CNNs. 905 00:53:51,230 --> 00:53:52,760 There are a number of operations. 906 00:53:52,760 --> 00:53:55,930 We will be going over the details 907 00:53:55,929 --> 00:53:59,839 in the class, starting from an image, a number of convolutions, 908 00:53:59,840 --> 00:54:01,970 sampling and fully connected operations, 909 00:54:01,969 --> 00:54:05,980 and, finally, creating the output. 910 00:54:05,980 --> 00:54:08,769 And beyond convolutional neural networks, 911 00:54:08,769 --> 00:54:14,719 we will study recurrent neural networks for sequential data 912 00:54:14,719 --> 00:54:19,669 and even neural architectures, such as transformers 913 00:54:19,670 --> 00:54:24,139 and attention-based frameworks. 914 00:54:24,139 --> 00:54:29,179 So next, we will be covering some large-scale distributed 915 00:54:29,179 --> 00:54:34,609 training topics, which is kind of new this quarter. 916 00:54:34,610 --> 00:54:38,460 I'm sure you've all heard about large language models, 917 00:54:38,460 --> 00:54:40,320 large vision models, and so on. 918 00:54:40,320 --> 00:54:44,480 And we will be briefly discussing 919 00:54:44,480 --> 00:54:47,309 how these models are actually trained. 920 00:54:47,309 --> 00:54:51,619 We know that data and data sets are expanding models. 921 00:54:51,619 --> 00:54:56,429 And models are becoming larger and larger. 922 00:54:56,429 --> 00:54:59,819 And in order to train such models, 923 00:54:59,820 --> 00:55:02,360 there are some strategies-- 924 00:55:02,360 --> 00:55:04,470 for example, data parallelization, 925 00:55:04,469 --> 00:55:07,569 model parallelization-- that we will cover in this class. 926 00:55:07,570 --> 00:55:11,170 But beyond that, there will be so many challenges, 927 00:55:11,170 --> 00:55:15,940 such as synchronization between these models and workers 928 00:55:15,940 --> 00:55:20,730 and so on, as well as several other aspects 929 00:55:20,730 --> 00:55:25,059 that we'll be covering in one of the lectures this quarter. 930 00:55:25,059 --> 00:55:31,289 And we will go also over some of the trends for training 931 00:55:31,289 --> 00:55:33,070 these large models. 932 00:55:33,070 --> 00:55:36,210 After completing this topic, what we will do 933 00:55:36,210 --> 00:55:44,010 next is looking into generative and interactive visual 934 00:55:44,010 --> 00:55:48,690 intelligence, where we will first start 935 00:55:48,690 --> 00:55:52,030 with self-supervised learning. 936 00:55:52,030 --> 00:55:55,960 Self-supervised learning is a branch of machine learning 937 00:55:55,960 --> 00:56:00,579 in which models learn to understand and represent data 938 00:56:00,579 --> 00:56:04,179 by getting some training signals from the data itself. 939 00:56:04,179 --> 00:56:06,384 We will cover this topic. 940 00:56:06,385 --> 00:56:10,180 It's one of the approaches that has enabled training 941 00:56:10,179 --> 00:56:15,339 of large scale models using vast amounts of data that do not 942 00:56:15,340 --> 00:56:18,880 require labels, unlabeled data. 943 00:56:18,880 --> 00:56:23,200 And they have played a key role in recent breakthroughs 944 00:56:23,199 --> 00:56:26,199 in computer vision in general. 945 00:56:26,199 --> 00:56:30,799 And we will talk a little bit about generative models. 946 00:56:30,800 --> 00:56:33,710 They go beyond recognition. 947 00:56:33,710 --> 00:56:35,860 They actually generate. 948 00:56:35,860 --> 00:56:39,340 This is an example of the content of a Stanford campus 949 00:56:39,340 --> 00:56:44,380 photo, which is reimagined in the style of Van Gogh's Starry 950 00:56:44,380 --> 00:56:45,490 Night. 951 00:56:45,489 --> 00:56:49,989 This is known as style transfer, a classic application 952 00:56:49,989 --> 00:56:54,369 of neural generative techniques. 953 00:56:54,369 --> 00:56:58,269 Generative models can now translate language 954 00:56:58,269 --> 00:57:03,219 into images given a prompt. 955 00:57:03,219 --> 00:57:07,289 A model like Dall-E, Dall-E 2 generates an entirely novel 956 00:57:07,289 --> 00:57:09,059 image. 957 00:57:09,059 --> 00:57:12,570 This showcases how generative vision models 958 00:57:12,570 --> 00:57:16,830 blend understanding, creativity, and control 959 00:57:16,829 --> 00:57:19,349 in their generations. 960 00:57:19,349 --> 00:57:22,589 And you've probably heard recently 961 00:57:22,590 --> 00:57:26,620 about the topic of diffusion models in general. 962 00:57:26,619 --> 00:57:33,179 That's another thing that we'll be covering in this quarter. 963 00:57:33,179 --> 00:57:37,649 They basically learn to reverse a gradual noising 964 00:57:37,650 --> 00:57:40,510 process to generate images. 965 00:57:40,510 --> 00:57:43,630 And interestingly, in assignment 3, 966 00:57:43,630 --> 00:57:46,860 you will actually be implementing a generative model 967 00:57:46,860 --> 00:57:53,400 that generates emojis from text inputs, 968 00:57:53,400 --> 00:57:57,360 from prompts-- for example, a face with a cowboy hat, which 969 00:57:57,360 --> 00:58:01,240 is denoised from pure noise. 970 00:58:01,239 --> 00:58:06,529 Vision language models are the next topic of interest 971 00:58:06,530 --> 00:58:08,890 we will be covering. 972 00:58:08,889 --> 00:58:16,039 They connect text and images in a shared representation space. 973 00:58:16,039 --> 00:58:19,900 And given a caption or image, the model 974 00:58:19,900 --> 00:58:24,289 retrieves or generates its corresponding pair, 975 00:58:24,289 --> 00:58:25,309 as you can see. 976 00:58:25,309 --> 00:58:29,049 So there are a lot of advances in this area. 977 00:58:29,050 --> 00:58:32,170 We'll be covering some of the key examples. 978 00:58:32,170 --> 00:58:37,750 Again, this is a key task for cross-modal retrieval 979 00:58:37,750 --> 00:58:41,119 or understanding and visual question answering and so on. 980 00:58:41,119 --> 00:58:44,269 So we'll get to that in the class 2. 981 00:58:44,269 --> 00:58:52,809 Moving beyond 2D, models can now reconstruct and generate 3D 982 00:58:52,809 --> 00:58:55,549 representations from images. 983 00:58:55,550 --> 00:59:00,980 And here, you can see some voxel-based reconstructions, 984 00:59:00,980 --> 00:59:06,769 shape completion, and even 3D object detection from single 985 00:59:06,769 --> 00:59:09,599 view images. 986 00:59:09,599 --> 00:59:14,809 So 3D vision enables more especially grounded 987 00:59:14,809 --> 00:59:19,699 understanding, which is crucial for robotics and AI VR 988 00:59:19,699 --> 00:59:20,399 applications. 989 00:59:20,400 --> 00:59:26,900 And finally, vision empowers embodied agents 990 00:59:26,900 --> 00:59:30,680 that act in the physical world. 991 00:59:30,679 --> 00:59:35,279 So these models often must perceive, plan, 992 00:59:35,280 --> 00:59:41,390 and execute whether it's cleaning up a messy room 993 00:59:41,389 --> 00:59:44,879 or generalizing from human demonstrations. 994 00:59:44,880 --> 00:59:50,210 So with all of these, we will be covering different topics 995 00:59:50,210 --> 00:59:53,970 around generative and interactive visual intelligence. 996 00:59:53,969 --> 01:00:00,759 And finally, we will cover some human-centered applications 997 01:00:00,760 --> 01:00:05,990 and implications, as Fei-Fei very nicely explained. 998 01:00:05,989 --> 01:00:08,719 So there is a computer vision. 999 01:00:08,719 --> 01:00:12,069 And generally, AI have been having a lot 1000 01:00:12,070 --> 01:00:16,070 of impact in the past years. 1001 01:00:16,070 --> 01:00:18,280 And it's very important to understand 1002 01:00:18,280 --> 01:00:21,230 the human-centered aspects and applications. 1003 01:00:21,230 --> 01:00:24,159 And some of these impacts are reflected 1004 01:00:24,159 --> 01:00:32,469 by these awards that are going to researchers in this space. 1005 01:00:32,469 --> 01:00:38,769 It was first recognized by the Turing Award 2018, which 1006 01:00:38,769 --> 01:00:41,440 is the most prestigious technical award given 1007 01:00:41,440 --> 01:00:45,400 to major contributions of lasting importance 1008 01:00:45,400 --> 01:00:47,090 for computing. 1009 01:00:47,090 --> 01:00:50,890 Geoffrey Hinton, Yoshua Bengio, and Yann LeCun 1010 01:00:50,889 --> 01:00:54,849 received the award for conceptual and engineering 1011 01:00:54,849 --> 01:00:57,049 during breakthroughs that have made 1012 01:00:57,050 --> 01:01:01,440 deep neural networks a critical component of computing. 1013 01:01:01,440 --> 01:01:06,200 Beyond that, last year, in 2024, Geoffrey Hinton 1014 01:01:06,199 --> 01:01:11,089 was jointly awarded the Nobel Prize in physics 1015 01:01:11,090 --> 01:01:14,990 alongside John Hopfield for their foundational contributions 1016 01:01:14,989 --> 01:01:17,459 to neural networks. 1017 01:01:17,460 --> 01:01:21,260 And finally, I want to very briefly mention the learning 1018 01:01:21,260 --> 01:01:27,770 objectives for this class will be formalizing computer vision 1019 01:01:27,769 --> 01:01:30,239 applications into tasks. 1020 01:01:30,239 --> 01:01:33,619 As you can see some of the details here, 1021 01:01:33,619 --> 01:01:38,599 we want to develop and train vision models, models 1022 01:01:38,599 --> 01:01:41,400 that operate on images and visual data-- 1023 01:01:41,400 --> 01:01:43,220 images, videos, and so on-- 1024 01:01:43,219 --> 01:01:46,549 gain an understanding of where the field is 1025 01:01:46,550 --> 01:01:48,990 and where it is headed. 1026 01:01:48,989 --> 01:01:53,619 That's why we have some new topics also covered specifically 1027 01:01:53,619 --> 01:01:56,920 in this year. 1028 01:01:56,920 --> 01:02:01,539 So the four topics that I mentioned earlier, 1029 01:02:01,539 --> 01:02:06,529 we will be going over the basics in the very first few weeks. 1030 01:02:06,530 --> 01:02:09,220 Bear with us because these are important topics. 1031 01:02:09,219 --> 01:02:12,859 And you need to understand the details first, 1032 01:02:12,860 --> 01:02:15,110 how to build the models from scratch. 1033 01:02:15,110 --> 01:02:19,180 And then we'll get to more interesting, exciting topics 1034 01:02:19,179 --> 01:02:20,440 of the day-- 1035 01:02:20,440 --> 01:02:21,769 computer vision. 1036 01:02:21,769 --> 01:02:27,969 And finally, we'll have one big lecture on human-centered AI 1037 01:02:27,969 --> 01:02:30,549 and computer vision. 1038 01:02:30,550 --> 01:02:33,039 I want to just leave you with what we 1039 01:02:33,039 --> 01:02:34,789 will be covering next session. 1040 01:02:34,789 --> 01:02:38,380 That's going to be image classification 1041 01:02:38,380 --> 01:02:43,720 and linear classifiers, which will get us started 1042 01:02:43,719 --> 01:02:45,909 with the world of CS231n. 1043 01:02:45,909 --> 01:02:47,969 Thank you.