1 00:00:16,480 --> 00:00:19,760 So all right so today we actually come 2 00:00:18,160 --> 00:00:20,879 to the last lecture of the class because 3 00:00:19,760 --> 00:00:23,920 Wednesday it's going to be project 4 00:00:20,879 --> 00:00:25,679 presentations and um so I want to talk 5 00:00:23,920 --> 00:00:28,640 to you about diffusion models today 6 00:00:25,679 --> 00:00:30,320 which is a incredibly exciting area 7 00:00:28,640 --> 00:00:32,399 which I don't think gets It's the same 8 00:00:30,320 --> 00:00:34,719 amount of attention in some ways 9 00:00:32,399 --> 00:00:37,679 compared to large language models. Uh 10 00:00:34,719 --> 00:00:39,280 but it's got enormous potential. Um so 11 00:00:37,679 --> 00:00:42,000 I'm very excited to talk to you about 12 00:00:39,280 --> 00:00:44,480 it. So you know just for kicks last 13 00:00:42,000 --> 00:00:46,079 night I said I asked Chad GPT create a 14 00:00:44,479 --> 00:00:47,599 photorealistic image of graduate 15 00:00:46,079 --> 00:00:49,439 students in class in a class on deep 16 00:00:47,600 --> 00:00:51,679 learning and this is what it came back 17 00:00:49,439 --> 00:00:53,759 with. 18 00:00:51,679 --> 00:00:56,759 There is a noticeable absence of an 19 00:00:53,759 --> 00:00:56,759 instructor 20 00:00:57,280 --> 00:01:01,359 plus various students are facing in 21 00:00:59,039 --> 00:01:05,359 various directions 22 00:01:01,359 --> 00:01:08,960 but apart from that it's not bad. Um and 23 00:01:05,359 --> 00:01:12,079 uh here is an example of a midjourney 24 00:01:08,959 --> 00:01:14,798 text to image abusion model uh which 25 00:01:12,079 --> 00:01:16,879 produces the amazing picture from this 26 00:01:14,799 --> 00:01:18,400 prompt. a quaint Italian seaside village 27 00:01:16,879 --> 00:01:21,599 with colorful buildings blah blah blah 28 00:01:18,400 --> 00:01:24,000 blah blah uh rendered in the style of 29 00:01:21,599 --> 00:01:25,280 Claude Monet and so on so forth and 30 00:01:24,000 --> 00:01:27,118 that's what you get. It's pretty 31 00:01:25,280 --> 00:01:28,560 unbelievable. 32 00:01:27,118 --> 00:01:29,680 Uh and I'm sure you folks have played 33 00:01:28,560 --> 00:01:31,439 around with these things and you have 34 00:01:29,680 --> 00:01:33,118 your favorite pictures and prompts and 35 00:01:31,438 --> 00:01:35,118 whatnot. 36 00:01:33,118 --> 00:01:38,400 Um now 37 00:01:35,118 --> 00:01:41,759 uh February 15th um OpenAI released a 38 00:01:38,400 --> 00:01:44,478 texttovideo model called Sora which your 39 00:01:41,759 --> 00:01:46,640 folks may have seen uh which I find 40 00:01:44,478 --> 00:01:49,599 frankly just stunning what it can do. It 41 00:01:46,640 --> 00:01:52,960 can produce a one minute uh video from a 42 00:01:49,599 --> 00:01:54,798 text prompt. And so, 43 00:01:52,959 --> 00:01:56,959 so if you actually give it this prompt, 44 00:01:54,799 --> 00:02:00,159 in an ornate historical hall, a massive 45 00:01:56,959 --> 00:02:01,679 tidal wave peaks and begins to crash and 46 00:02:00,159 --> 00:02:03,600 two surfers seizing the moment 47 00:02:01,680 --> 00:02:06,000 skillfully navigate the fa the wave. 48 00:02:03,599 --> 00:02:07,199 Okay. Uh I think we can all agree that 49 00:02:06,000 --> 00:02:09,280 such a thing has never happened in 50 00:02:07,200 --> 00:02:12,400 history and therefore there it was not 51 00:02:09,280 --> 00:02:17,000 in the training data, right? So and then 52 00:02:12,400 --> 00:02:17,000 you get this picture, this video 53 00:02:26,878 --> 00:02:31,120 and then some random person is coming 54 00:02:28,878 --> 00:02:32,799 back in a completely dry [laughter] 55 00:02:31,120 --> 00:02:37,280 hall. So anyway, but it's pretty 56 00:02:32,800 --> 00:02:39,519 amazing. I think you would agree. So 57 00:02:37,280 --> 00:02:42,479 if you actually look at the open sora 58 00:02:39,519 --> 00:02:45,519 technical report, you actually find this 59 00:02:42,479 --> 00:02:48,079 uh opening paragraph where they say that 60 00:02:45,519 --> 00:02:51,120 we train text conditional diffusion 61 00:02:48,080 --> 00:02:54,879 models blah blah blah using a 62 00:02:51,120 --> 00:02:56,239 transformer architecture. Okay, so now 63 00:02:54,878 --> 00:02:57,919 we know what a transformer architecture 64 00:02:56,239 --> 00:03:00,080 is. You've been working with it. You're 65 00:02:57,919 --> 00:03:02,158 quite familiar with it at this point. So 66 00:03:00,080 --> 00:03:04,560 today's class is really about text 67 00:03:02,158 --> 00:03:06,959 conditional diffusion models. Okay, so 68 00:03:04,560 --> 00:03:09,280 the other building block. Okay, so let's 69 00:03:06,959 --> 00:03:11,120 get to it. Uh what I'm going to do is 70 00:03:09,280 --> 00:03:12,640 I'm going to sort of uh divide this into 71 00:03:11,120 --> 00:03:14,158 two parts. The first part is I'm just 72 00:03:12,639 --> 00:03:16,158 going to talk about how do you get a 73 00:03:14,158 --> 00:03:17,759 model to just generate an image for you? 74 00:03:16,158 --> 00:03:20,158 Right? If you wanted to generate an 75 00:03:17,759 --> 00:03:21,840 image from a class of potential images, 76 00:03:20,158 --> 00:03:24,158 how can it just generate an image? And 77 00:03:21,840 --> 00:03:25,519 then next we talk about okay, great. Now 78 00:03:24,158 --> 00:03:27,919 that you can do that, how do you 79 00:03:25,519 --> 00:03:29,680 actually control or steer the model to 80 00:03:27,919 --> 00:03:31,679 do an image based on whatever prompting 81 00:03:29,680 --> 00:03:33,200 you give it? Okay, how do you condition 82 00:03:31,680 --> 00:03:34,879 it? How do you control it? Those are all 83 00:03:33,199 --> 00:03:36,079 the words. How do you steer it? You'll 84 00:03:34,878 --> 00:03:37,518 find all these synonyms being used 85 00:03:36,080 --> 00:03:38,560 heavily in the literature. That's 86 00:03:37,519 --> 00:03:40,239 basically what they mean. How do you 87 00:03:38,560 --> 00:03:43,680 give it a prompt and then steer what 88 00:03:40,239 --> 00:03:44,959 gets produced? All right, so let's say 89 00:03:43,680 --> 00:03:47,280 we want to build a model that can be 90 00:03:44,959 --> 00:03:49,120 used to generate images of stately 91 00:03:47,280 --> 00:03:51,519 college buildings. 92 00:03:49,120 --> 00:03:53,200 Okay, obviously our very own Killian 93 00:03:51,519 --> 00:03:56,080 Court is the finest example of such a 94 00:03:53,199 --> 00:03:58,399 thing. Um, and uh, but let's say you 95 00:03:56,080 --> 00:03:59,920 want to do that. So what you do is you 96 00:03:58,400 --> 00:04:01,760 as as we always do with machine 97 00:03:59,919 --> 00:04:03,359 learning, we collect a bunch of data. In 98 00:04:01,759 --> 00:04:05,199 this particular case, we collect a whole 99 00:04:03,360 --> 00:04:07,360 bunch of images of stately college 100 00:04:05,199 --> 00:04:08,878 buildings. Uh, and what you see here is 101 00:04:07,360 --> 00:04:10,480 literally me just doing a Google image 102 00:04:08,878 --> 00:04:12,719 search with the query stately college 103 00:04:10,479 --> 00:04:14,639 buildings. Okay, so this is the kind of 104 00:04:12,719 --> 00:04:15,919 stuff you get. Uh, so you have your 105 00:04:14,639 --> 00:04:19,120 training data at your disposal. It's 106 00:04:15,919 --> 00:04:20,319 ready to go. Now the question is if you 107 00:04:19,120 --> 00:04:21,439 have such a model, let's say, and 108 00:04:20,319 --> 00:04:23,519 obviously we'll talk about how to build 109 00:04:21,439 --> 00:04:25,839 such a model very soon. But let's say 110 00:04:23,519 --> 00:04:27,359 you have such a model and every time you 111 00:04:25,839 --> 00:04:28,879 sort of sample this model, every time 112 00:04:27,360 --> 00:04:30,560 you ask the model, hey, give me an 113 00:04:28,879 --> 00:04:31,918 image, you obviously wanted to give a 114 00:04:30,560 --> 00:04:34,639 different image, right? Otherwise, it's 115 00:04:31,918 --> 00:04:36,560 kind of boring. All right? Some you know 116 00:04:34,639 --> 00:04:37,759 maybe you want the Killian Court, maybe 117 00:04:36,560 --> 00:04:42,319 you want the rotunda from the University 118 00:04:37,759 --> 00:04:45,520 of Virginia. Anybody any UVA alums here? 119 00:04:42,319 --> 00:04:47,120 Nobody. Okay. Um, so and right. So the 120 00:04:45,519 --> 00:04:49,359 question is how can we actually get it 121 00:04:47,120 --> 00:04:50,959 to randomly give us different images? 122 00:04:49,360 --> 00:04:52,960 But but they all have to be stately 123 00:04:50,959 --> 00:04:54,959 college buildings. It can't be just some 124 00:04:52,959 --> 00:04:58,159 random stuff, right? So, how do you do 125 00:04:54,959 --> 00:04:59,918 that? And the way we do that, and I 126 00:04:58,160 --> 00:05:02,080 still find it really astonishing that 127 00:04:59,918 --> 00:05:03,758 this approach actually works. The way we 128 00:05:02,079 --> 00:05:05,839 do that is that we actually give it 129 00:05:03,759 --> 00:05:07,840 noise. 130 00:05:05,839 --> 00:05:10,079 And I will define very precisely what I 131 00:05:07,839 --> 00:05:13,038 mean by noise in just a just a bit. 132 00:05:10,079 --> 00:05:15,038 Okay, basically assume 133 00:05:13,038 --> 00:05:17,279 an image in which all the pixel values 134 00:05:15,038 --> 00:05:19,839 are randomly picked. 135 00:05:17,279 --> 00:05:21,198 Right? So every time you generate a 136 00:05:19,839 --> 00:05:23,119 random image and you give it to the 137 00:05:21,199 --> 00:05:25,600 model, you'll it'll use that random 138 00:05:23,120 --> 00:05:27,680 starting point and then create an image 139 00:05:25,600 --> 00:05:30,479 for you. And because by definition, if 140 00:05:27,680 --> 00:05:31,600 you choose noise randomly, they are, you 141 00:05:30,478 --> 00:05:33,038 know, obviously going to be different 142 00:05:31,600 --> 00:05:35,840 each time. It's hopefully going to 143 00:05:33,038 --> 00:05:37,759 generate a different image. But if the 144 00:05:35,839 --> 00:05:39,599 model is trained on stately college 145 00:05:37,759 --> 00:05:41,520 buildings, it will produce images of 146 00:05:39,600 --> 00:05:42,879 stately college buildings. It's not 147 00:05:41,519 --> 00:05:44,399 going to produce a picture of a Labrador 148 00:05:42,879 --> 00:05:46,719 retriever. 149 00:05:44,399 --> 00:05:49,038 Okay, so that's basically what we're 150 00:05:46,720 --> 00:05:51,600 going to do. Now, if you look at 151 00:05:49,038 --> 00:05:53,279 something like this, the first question 152 00:05:51,600 --> 00:05:54,960 of course is that how can we train a 153 00:05:53,279 --> 00:05:58,799 model to generate an image from pure 154 00:05:54,959 --> 00:06:00,560 noise? This just sounds ridiculous, 155 00:05:58,800 --> 00:06:04,639 right? You basically give it a bunch of 156 00:06:00,560 --> 00:06:06,639 random numbers and say, give me code. 157 00:06:04,639 --> 00:06:08,720 It feels really ridiculous. And at that 158 00:06:06,639 --> 00:06:10,240 point, you know, folks can sort of come 159 00:06:08,720 --> 00:06:11,440 to a stop and say, "All right, this 160 00:06:10,240 --> 00:06:14,079 approach is probably not going to take 161 00:06:11,439 --> 00:06:16,319 me anywhere. It's a bit of a dead end. 162 00:06:14,079 --> 00:06:18,399 But then some clever people had this 163 00:06:16,319 --> 00:06:20,720 very interesting idea. 164 00:06:18,399 --> 00:06:24,399 They said 165 00:06:20,720 --> 00:06:26,960 um it's not clear how to do this you 166 00:06:24,399 --> 00:06:28,799 know um just a quick aside there's this 167 00:06:26,959 --> 00:06:31,120 really amazing book which is published 168 00:06:28,800 --> 00:06:33,840 maybe 50 years ago maybe earlier than 169 00:06:31,120 --> 00:06:36,240 that called how to solve it by George 170 00:06:33,839 --> 00:06:37,758 Polia. George Poliov was a eminent 171 00:06:36,240 --> 00:06:39,680 mathematician 172 00:06:37,759 --> 00:06:41,600 um and he wrote this small book called 173 00:06:39,680 --> 00:06:44,240 how to solve it and it lists a whole 174 00:06:41,600 --> 00:06:46,800 bunch of huristics that mathematicians 175 00:06:44,240 --> 00:06:49,439 use when they solve problems and perhaps 176 00:06:46,800 --> 00:06:52,079 the most commonly used heristic is just 177 00:06:49,439 --> 00:06:53,279 reverse the question 178 00:06:52,079 --> 00:06:55,038 just reverse the question and see if 179 00:06:53,279 --> 00:06:56,559 anything comes out of it most of the 180 00:06:55,038 --> 00:06:58,079 time nothing will come out of it but 181 00:06:56,560 --> 00:06:59,759 maybe some other time something amazing 182 00:06:58,079 --> 00:07:01,680 comes out right this is a great example 183 00:06:59,759 --> 00:07:03,840 of that heristic at work we don't know 184 00:07:01,680 --> 00:07:05,680 how to do this so the question is can we 185 00:07:03,839 --> 00:07:07,279 do the reverse 186 00:07:05,680 --> 00:07:10,400 If I give you Killian code, can you 187 00:07:07,279 --> 00:07:12,239 produce noise out of it for me? 188 00:07:10,399 --> 00:07:14,159 And the answer is yeah, of course we can 189 00:07:12,240 --> 00:07:16,800 do that. 190 00:07:14,160 --> 00:07:19,360 Right? Given an image, we can easily 191 00:07:16,800 --> 00:07:21,598 create a noisy version of it. So you can 192 00:07:19,360 --> 00:07:23,360 take the original image, you can add 193 00:07:21,598 --> 00:07:24,560 some noise to it to get this and you 194 00:07:23,360 --> 00:07:25,520 keep on adding a lot of noise and 195 00:07:24,560 --> 00:07:27,120 finally you'll get something that's 196 00:07:25,519 --> 00:07:29,439 basically you can't tell that there is 197 00:07:27,120 --> 00:07:31,360 clean clear clean code anymore. Right? 198 00:07:29,439 --> 00:07:33,680 This process, the reverse process is 199 00:07:31,360 --> 00:07:36,160 actually very easy to do. Okay? So the 200 00:07:33,680 --> 00:07:37,680 question bec by the way for folks of who 201 00:07:36,160 --> 00:07:39,360 may not be very familiar with this 202 00:07:37,680 --> 00:07:41,120 notion of adding noise to an image or 203 00:07:39,360 --> 00:07:44,319 making an image noisy. Let me just show 204 00:07:41,120 --> 00:07:47,478 you in a collab just a minute how easy 205 00:07:44,319 --> 00:07:47,479 it is. 206 00:07:47,519 --> 00:07:52,959 All right. So um we let's say we import 207 00:07:51,199 --> 00:07:54,960 a bunch of these things. As usual we 208 00:07:52,959 --> 00:07:57,598 have numpy and so there is this thing 209 00:07:54,959 --> 00:07:58,959 called the python imaging library pil 210 00:07:57,598 --> 00:08:01,439 which is very handy for image 211 00:07:58,959 --> 00:08:03,038 manipulations. So we import that and 212 00:08:01,439 --> 00:08:04,959 then I just literally read read this 213 00:08:03,038 --> 00:08:06,079 image in. I uploaded it before class. 214 00:08:04,959 --> 00:08:07,918 Let's just make sure it's here. Okay, 215 00:08:06,079 --> 00:08:11,120 good. Kalian.png. 216 00:08:07,918 --> 00:08:13,680 So I I read this image. Okay. Uh and 217 00:08:11,120 --> 00:08:16,478 then once I read it, I convert it into a 218 00:08:13,680 --> 00:08:18,639 numpy array. And then remember in a in 219 00:08:16,478 --> 00:08:20,800 any color image, you have three tables 220 00:08:18,639 --> 00:08:23,439 of numbers. The the number there's a 221 00:08:20,800 --> 00:08:25,120 number for each pixel for red, blue, and 222 00:08:23,439 --> 00:08:28,319 green. And then each number is between 0 223 00:08:25,120 --> 00:08:29,759 and 255. U and so here what we do is we 224 00:08:28,319 --> 00:08:31,280 divide everything by 255 just to 225 00:08:29,759 --> 00:08:32,719 normalize it so it's all between zero 226 00:08:31,279 --> 00:08:36,240 and one and we have done this in the 227 00:08:32,719 --> 00:08:38,479 past right I do that here uh all right 228 00:08:36,240 --> 00:08:40,399 so let me just read this back in convert 229 00:08:38,479 --> 00:08:45,120 it and then if you look at the shape 230 00:08:40,399 --> 00:08:47,200 it's basically a 411 x 583 * 3 um three 231 00:08:45,120 --> 00:08:50,000 channels as we have seen before and then 232 00:08:47,200 --> 00:08:52,320 I'll just show it all right that's the 233 00:08:50,000 --> 00:08:54,799 picture so now what we want to do is we 234 00:08:52,320 --> 00:08:59,040 want to add noise to this picture all we 235 00:08:54,799 --> 00:09:02,479 have to do Okay, for each pixel, 236 00:08:59,039 --> 00:09:03,919 we basically randomly pick a normal 237 00:09:02,480 --> 00:09:05,759 variable, a normal distribution, 238 00:09:03,919 --> 00:09:08,240 normally distributed random variable 239 00:09:05,759 --> 00:09:10,240 with a mean of zero and a small standard 240 00:09:08,240 --> 00:09:11,839 deviation. So it's like a small number 241 00:09:10,240 --> 00:09:14,320 and then we just literally add that 242 00:09:11,839 --> 00:09:16,560 number to every pixel. But for every 243 00:09:14,320 --> 00:09:17,839 pixel, we sample. Every pixel we sample. 244 00:09:16,559 --> 00:09:19,439 It's not like we sample once and add it 245 00:09:17,839 --> 00:09:22,560 to all the pixels. We sample for every 246 00:09:19,440 --> 00:09:25,279 pixel. And so the way you do that is 247 00:09:22,559 --> 00:09:28,639 basically literally np.random.normal. 248 00:09:25,278 --> 00:09:30,720 normal and then this 3 here is a 249 00:09:28,639 --> 00:09:33,838 standard deviation and we tell it 250 00:09:30,720 --> 00:09:35,680 generate as many of these things as the 251 00:09:33,839 --> 00:09:38,320 as the shape of the image that I gave 252 00:09:35,679 --> 00:09:40,079 you. Okay. And then add each one of 253 00:09:38,320 --> 00:09:42,959 these numbers to the original image you 254 00:09:40,080 --> 00:09:44,480 get this noisy image. Okay. So if you 255 00:09:42,958 --> 00:09:46,479 this is the original image these are all 256 00:09:44,480 --> 00:09:48,560 the values between 0 and one. And then 257 00:09:46,480 --> 00:09:50,159 you add do this noisy image. You can see 258 00:09:48,559 --> 00:09:52,319 the numbers have become different. The 259 00:09:50,159 --> 00:09:54,319 23 has become.18.15 260 00:09:52,320 --> 00:09:56,160 has become minus.17 and so on and so 261 00:09:54,320 --> 00:09:58,080 forth. Right? You just added a small 262 00:09:56,159 --> 00:09:59,519 number random to everything. But as you 263 00:09:58,080 --> 00:10:01,040 can see here now you have some negative 264 00:09:59,519 --> 00:10:02,959 numbers. You may have some numbers 265 00:10:01,039 --> 00:10:05,039 that's greater than one. And we do want 266 00:10:02,958 --> 00:10:06,639 everything to be between 0 and one. So 267 00:10:05,039 --> 00:10:10,079 all we do is we do this thing called 268 00:10:06,639 --> 00:10:11,600 clip it where essentially values smaller 269 00:10:10,080 --> 00:10:13,759 than zero are set to zero. Values 270 00:10:11,600 --> 00:10:16,079 greater than one are set to one. And so 271 00:10:13,759 --> 00:10:17,838 we'll just do that. That's it. 272 00:10:16,078 --> 00:10:19,359 Everything over one squashed to one. 273 00:10:17,839 --> 00:10:21,519 Everything under zero set to zero. 274 00:10:19,360 --> 00:10:23,519 Others leave it unchanged. Now it's 275 00:10:21,519 --> 00:10:28,320 again well behaved between 0 and one and 276 00:10:23,519 --> 00:10:29,759 we can just plot it and you get this. 277 00:10:28,320 --> 00:10:31,440 That's it. That's all it takes to 278 00:10:29,759 --> 00:10:34,399 actually add noise to an image. One line 279 00:10:31,440 --> 00:10:36,160 of numpy. Okay. Uh obviously you can 280 00:10:34,399 --> 00:10:37,679 just put this whole thing in a loop and 281 00:10:36,159 --> 00:10:39,919 keep increasing that standard deviation 282 00:10:37,679 --> 00:10:41,679 number from 3 point 4.5 so on and so 283 00:10:39,919 --> 00:10:44,078 forth. And when you do that you get this 284 00:10:41,679 --> 00:10:45,838 nice sequence of clean code and all the 285 00:10:44,078 --> 00:10:48,639 way to some very very noisy version of 286 00:10:45,839 --> 00:10:52,440 Ken code. That's it. So that's the basic 287 00:10:48,639 --> 00:10:52,439 idea of adding noise. 288 00:10:52,958 --> 00:10:57,078 Any questions on the the mechanics? 289 00:10:57,200 --> 00:11:02,320 Okay, good. Um so so we can add random 290 00:11:00,320 --> 00:11:04,640 numbers, right? And we can by increasing 291 00:11:02,320 --> 00:11:06,480 the magnitude of the standard deviation 292 00:11:04,639 --> 00:11:08,000 of these of these normal random 293 00:11:06,480 --> 00:11:12,240 variables, we can make the image 294 00:11:08,000 --> 00:11:14,958 noisier. Okay, so that suggests a really 295 00:11:12,240 --> 00:11:18,079 interesting idea. 296 00:11:14,958 --> 00:11:18,078 What idea would that be? 297 00:11:19,600 --> 00:11:25,278 Yeah, doing the opposite. Could you 298 00:11:21,600 --> 00:11:26,959 please uh microphone please? 299 00:11:25,278 --> 00:11:29,600 >> Uh doing the opposite like recreating 300 00:11:26,958 --> 00:11:31,518 the image from the noise. 301 00:11:29,600 --> 00:11:34,800 >> So we are trying to create the image 302 00:11:31,519 --> 00:11:37,360 from the noise. But 303 00:11:34,799 --> 00:11:38,958 that feels a little hard. So what 304 00:11:37,360 --> 00:11:41,959 exactly can we do? Be a little more 305 00:11:38,958 --> 00:11:41,958 specific. 306 00:11:44,480 --> 00:11:48,399 So here we have the ability to take any 307 00:11:46,559 --> 00:11:51,838 image and add any amount of noise to it. 308 00:11:48,399 --> 00:11:54,078 Right? That's the data we have. There is 309 00:11:51,839 --> 00:11:56,160 Kian code and there's various noisy 310 00:11:54,078 --> 00:11:57,599 versions of Kian code like that for the 311 00:11:56,159 --> 00:11:58,240 return the unit Virginia and so on and 312 00:11:57,600 --> 00:12:00,000 so forth. 313 00:11:58,240 --> 00:12:02,079 >> I would assume you would do some kind of 314 00:12:00,000 --> 00:12:04,559 loss function for the the final image 315 00:12:02,078 --> 00:12:06,958 that you get and compare it with the the 316 00:12:04,559 --> 00:12:10,958 original image that you train it on and 317 00:12:06,958 --> 00:12:14,638 then find uh yeah fine as you go. Okay, 318 00:12:10,958 --> 00:12:17,638 you're on the right track. Uh, any other 319 00:12:14,639 --> 00:12:17,639 proposals? 320 00:12:18,480 --> 00:12:22,720 >> I think we could try to train a neural 321 00:12:20,639 --> 00:12:25,440 network to reconstruct the image going 322 00:12:22,720 --> 00:12:27,519 from the noise to the noise noisy one. 323 00:12:25,440 --> 00:12:30,959 Like we could have a whole data set with 324 00:12:27,519 --> 00:12:34,480 images, find their noise counterpart and 325 00:12:30,958 --> 00:12:38,159 train to do the oppos 326 00:12:34,480 --> 00:12:39,759 network to do the opposite task. 327 00:12:38,159 --> 00:12:41,039 Yeah, that's definitely on the right 328 00:12:39,759 --> 00:12:44,159 track. That's definitely on the right 329 00:12:41,039 --> 00:12:47,519 track. Yep, good ideas. So, what we do 330 00:12:44,159 --> 00:12:49,679 more concretely is 331 00:12:47,519 --> 00:12:51,679 we we can take each image in the 332 00:12:49,679 --> 00:12:54,239 training data and create noisy versions 333 00:12:51,679 --> 00:12:57,599 of it as we have seen before. And then 334 00:12:54,240 --> 00:13:00,879 what we do is that we say uh we can 335 00:12:57,600 --> 00:13:04,240 create XY training data pairs input 336 00:13:00,879 --> 00:13:09,679 output pairs from all these images. So 337 00:13:04,240 --> 00:13:11,839 specifically what we do is we take 338 00:13:09,679 --> 00:13:14,559 the noisy slightly noisy version of 339 00:13:11,839 --> 00:13:16,240 Killian code and call it the input and 340 00:13:14,559 --> 00:13:19,199 we take the the the nice version of 341 00:13:16,240 --> 00:13:22,639 clean code and call it the output. 342 00:13:19,200 --> 00:13:27,360 Okay, that's the y1 x1 pair 343 00:13:22,639 --> 00:13:30,959 and then we get y2 x2 we get y3 x3 and 344 00:13:27,360 --> 00:13:33,919 all the way. So at any point in time, 345 00:13:30,958 --> 00:13:36,239 the relationship between X and Y, what's 346 00:13:33,919 --> 00:13:37,759 the relationship between X and Y? If you 347 00:13:36,240 --> 00:13:40,759 set it up like this as the input and the 348 00:13:37,759 --> 00:13:40,759 output, 349 00:13:43,039 --> 00:13:48,159 >> it's the set of uh standard deviations 350 00:13:45,679 --> 00:13:51,039 and uh the values which you change for 351 00:13:48,159 --> 00:13:53,120 each pixels. Those are like weights to 352 00:13:51,039 --> 00:13:54,879 which you transform, 353 00:13:53,120 --> 00:13:56,560 >> right? Or maybe I was looking for 354 00:13:54,879 --> 00:13:58,320 something simpler which was that that's 355 00:13:56,559 --> 00:14:00,078 correct. So what he's looking for is 356 00:13:58,320 --> 00:14:03,360 really the the relationship between X 357 00:14:00,078 --> 00:14:05,759 and Y. X is an image, any image, and Y 358 00:14:03,360 --> 00:14:07,919 happens to be a slightly less noisy 359 00:14:05,759 --> 00:14:09,838 version of the image. 360 00:14:07,919 --> 00:14:12,399 The slightly less noisy is really, 361 00:14:09,839 --> 00:14:14,399 really important. 362 00:14:12,399 --> 00:14:16,639 You're not going from Killian code, 363 00:14:14,399 --> 00:14:19,519 right? You're not going from the image 364 00:14:16,639 --> 00:14:21,919 to full noise. That's an impossible 365 00:14:19,519 --> 00:14:24,720 leap. You're going from the image to a 366 00:14:21,919 --> 00:14:27,679 slightly noisy version of the image. 367 00:14:24,720 --> 00:14:30,560 Okay, it is that slightly that allows 368 00:14:27,679 --> 00:14:33,120 all the magic to happen. 369 00:14:30,559 --> 00:14:35,838 So that's what we have. 370 00:14:33,120 --> 00:14:38,560 And so here what we can do with these XY 371 00:14:35,839 --> 00:14:40,000 pairs when you have an So here's the 372 00:14:38,559 --> 00:14:41,439 thing, right? This is like a larger 373 00:14:40,000 --> 00:14:43,759 comment about machine learning and deep 374 00:14:41,440 --> 00:14:46,560 learning. Um 375 00:14:43,759 --> 00:14:47,838 whenever you have basically what machine 376 00:14:46,559 --> 00:14:50,799 learning deep learning are or really 377 00:14:47,839 --> 00:14:52,720 it's like this this black box where if 378 00:14:50,799 --> 00:14:55,278 you can find interesting input output 379 00:14:52,720 --> 00:14:57,759 pairs you can learn a function to go 380 00:14:55,278 --> 00:14:59,919 from the input to the output that's it 381 00:14:57,759 --> 00:15:01,759 but this sounds kind of simple when I 382 00:14:59,919 --> 00:15:04,399 describe it like that but there are like 383 00:15:01,759 --> 00:15:06,879 some incredibly non-obvious ways of 384 00:15:04,399 --> 00:15:08,799 applying this idea right so for example 385 00:15:06,879 --> 00:15:10,320 a few years ago Google had this uh thing 386 00:15:08,799 --> 00:15:13,759 which may actually be in production in 387 00:15:10,320 --> 00:15:15,680 Google Sheets now where whenever you um 388 00:15:13,759 --> 00:15:17,519 sort of choose a bunch of numbers, a 389 00:15:15,679 --> 00:15:19,198 range of numbers in a spreadsheet and 390 00:15:17,519 --> 00:15:21,519 and then go into another cell, it'll 391 00:15:19,198 --> 00:15:24,159 immediately suggest a formula for you. 392 00:15:21,519 --> 00:15:26,399 Where is that coming from? 393 00:15:24,159 --> 00:15:28,559 It's because all the Google Sheets users 394 00:15:26,399 --> 00:15:30,159 all over the world, they have been 395 00:15:28,559 --> 00:15:32,078 creating all these numbers with 396 00:15:30,159 --> 00:15:33,919 formulas, right? So, someone says, 397 00:15:32,078 --> 00:15:36,078 "Look, wait a second. We have all this 398 00:15:33,919 --> 00:15:38,559 data on people choosing a range of 399 00:15:36,078 --> 00:15:40,479 numbers and then entering a formula. So 400 00:15:38,559 --> 00:15:43,119 let's imagine the range is the input and 401 00:15:40,480 --> 00:15:45,199 the formula as the output 402 00:15:43,120 --> 00:15:46,959 and let's just give a million examples 403 00:15:45,198 --> 00:15:50,719 of this pair and see if anything comes 404 00:15:46,958 --> 00:15:53,599 out of it and boom you get that feature. 405 00:15:50,720 --> 00:15:55,600 Okay. So similarly here 406 00:15:53,600 --> 00:15:58,240 X is an image less noisy version of the 407 00:15:55,600 --> 00:16:02,000 image. What that means is that we can 408 00:15:58,240 --> 00:16:04,480 build a dnoising network. 409 00:16:02,000 --> 00:16:06,799 Okay, we can take an image and we can 410 00:16:04,480 --> 00:16:10,639 build a network using all these XY pairs 411 00:16:06,799 --> 00:16:15,278 to slightly dn noiseise it. 412 00:16:10,639 --> 00:16:16,799 Okay. Um and so all how do we do it? We 413 00:16:15,278 --> 00:16:19,679 just run stocastic gradient to sit on 414 00:16:16,799 --> 00:16:22,240 the data. We have a network. It has X 415 00:16:19,679 --> 00:16:26,319 and Y and then Y is a slightly less 416 00:16:22,240 --> 00:16:27,519 noisy version and then B. 417 00:16:26,320 --> 00:16:29,199 Okay, you're just a network. It has a 418 00:16:27,519 --> 00:16:30,399 bunch of weights. we have the we have 419 00:16:29,198 --> 00:16:33,359 the right answer in terms of what the 420 00:16:30,399 --> 00:16:34,799 images need to be u we can do stocastic 421 00:16:33,360 --> 00:16:36,240 gradient descent or atom or something 422 00:16:34,799 --> 00:16:37,278 and before you know it if you have 423 00:16:36,240 --> 00:16:40,159 enough data you have a network which can 424 00:16:37,278 --> 00:16:41,919 d noiseise anything you give it okay um 425 00:16:40,159 --> 00:16:43,278 you had a question 426 00:16:41,919 --> 00:16:45,759 >> why slightly 427 00:16:43,278 --> 00:16:48,879 >> why slightly um we'll come back to that 428 00:16:45,759 --> 00:16:51,199 question the the reason is that u in 429 00:16:48,879 --> 00:16:53,679 general you you have to do what you can 430 00:16:51,198 --> 00:16:56,559 to help the model and this is sort of 431 00:16:53,679 --> 00:16:59,919 the proverbial there is an old adage you 432 00:16:56,559 --> 00:17:02,078 can't cross a ditch in two jumps. 433 00:16:59,919 --> 00:17:03,679 It's too big. So, right. So, you can't 434 00:17:02,078 --> 00:17:05,678 do it. So, what you do is you create a 435 00:17:03,679 --> 00:17:07,599 bridge to go from here to there. And so, 436 00:17:05,679 --> 00:17:10,319 what you do is if you can slightly d 437 00:17:07,599 --> 00:17:11,519 noiseise something really well. Well, I 438 00:17:10,318 --> 00:17:13,599 can actually den noiseise anything you 439 00:17:11,519 --> 00:17:17,199 want really well using that fundamental 440 00:17:13,599 --> 00:17:18,958 capability as you will see in a second. 441 00:17:17,199 --> 00:17:21,360 >> Just to follow up. So, if you go back 442 00:17:18,959 --> 00:17:24,480 the last slide, I could have created the 443 00:17:21,359 --> 00:17:26,958 same thing as that is my x1 and that is 444 00:17:24,480 --> 00:17:28,640 my y. Then the second one is x2 and 445 00:17:26,959 --> 00:17:30,400 still this is the y. So there is 446 00:17:28,640 --> 00:17:33,120 effectively there is a learning there 447 00:17:30,400 --> 00:17:35,840 that it could have taken from those 448 00:17:33,119 --> 00:17:37,439 pairs and come back with okay this is 449 00:17:35,839 --> 00:17:40,159 also a possibility this is also a 450 00:17:37,440 --> 00:17:42,400 possibility and it found out that noise 451 00:17:40,160 --> 00:17:44,400 matrix and it can subtract. 452 00:17:42,400 --> 00:17:46,240 >> Yeah. So the thing is you want to make 453 00:17:44,400 --> 00:17:48,880 sure that each time the amount of 454 00:17:46,240 --> 00:17:51,359 learning it has to do is as bounded and 455 00:17:48,880 --> 00:17:52,960 small as possible. If you give it some 456 00:17:51,359 --> 00:17:55,439 starting point and an ending point and 457 00:17:52,960 --> 00:17:56,798 keep moving this ending point, the gap 458 00:17:55,440 --> 00:17:59,519 is still really high for the first 459 00:17:56,798 --> 00:18:01,599 several of those starting points. That's 460 00:17:59,519 --> 00:18:04,319 the problem. 461 00:18:01,599 --> 00:18:07,119 Okay. So to come back to this, so we can 462 00:18:04,319 --> 00:18:08,399 build a dinoising model. We can do this. 463 00:18:07,119 --> 00:18:10,399 And now when you have once you have 464 00:18:08,400 --> 00:18:13,038 built such a thing, you give it some 465 00:18:10,400 --> 00:18:15,440 noisy thing and then it'll you know give 466 00:18:13,038 --> 00:18:16,879 you a slightly less noisy version of it. 467 00:18:15,440 --> 00:18:19,200 Okay, the resolution is going to go up 468 00:18:16,880 --> 00:18:20,720 slightly if you do that. This of course 469 00:18:19,200 --> 00:18:22,960 suggests the obvious way in which you 470 00:18:20,720 --> 00:18:26,880 would use it which is that once you 471 00:18:22,960 --> 00:18:29,600 train it we can solve this problem. 472 00:18:26,880 --> 00:18:32,160 Okay. And how can we solve this problem? 473 00:18:29,599 --> 00:18:35,279 So what you do is you start with pure 474 00:18:32,160 --> 00:18:37,120 noise and then repeatedly dn noiseise 475 00:18:35,279 --> 00:18:39,759 it. 476 00:18:37,119 --> 00:18:41,038 Okay. You get that, you get that, and 477 00:18:39,759 --> 00:18:43,599 then before you know it, Killian Kurt 478 00:18:41,038 --> 00:18:46,319 has emerged from the fog, 479 00:18:43,599 --> 00:18:50,439 right? It's pretty insane that it 480 00:18:46,319 --> 00:18:50,439 actually works this idea. 481 00:18:52,480 --> 00:18:56,720 So, so the model will generate a 482 00:18:54,960 --> 00:18:59,840 sequence of less noisy images and the 483 00:18:56,720 --> 00:19:01,919 final one you have is the answer. Okay. 484 00:18:59,839 --> 00:19:05,279 Now there's a whole bunch of detail here 485 00:19:01,919 --> 00:19:08,400 which I'm glossing over about okay how 486 00:19:05,279 --> 00:19:09,759 many times must we run this loop to get 487 00:19:08,400 --> 00:19:12,160 to a really good picture. The short 488 00:19:09,759 --> 00:19:13,200 answer is you it initially it was like 489 00:19:12,160 --> 00:19:16,080 you have to run it like a thousand 490 00:19:13,200 --> 00:19:17,759 times. Each each each doising step was 491 00:19:16,079 --> 00:19:18,960 like a baby step. You have to do it a 492 00:19:17,759 --> 00:19:21,038 thousand times to get a really good 493 00:19:18,960 --> 00:19:22,558 answer. Again research has been very 494 00:19:21,038 --> 00:19:24,000 active in the area continues to be very 495 00:19:22,558 --> 00:19:26,399 active. Now you can I think do it like 496 00:19:24,000 --> 00:19:29,038 50 steps or 100 steps. Right? But 497 00:19:26,400 --> 00:19:31,679 diffusion models like this uh they tend 498 00:19:29,038 --> 00:19:33,599 to take more time than a large language 499 00:19:31,679 --> 00:19:35,280 model which is why if you give a prompt 500 00:19:33,599 --> 00:19:36,639 to one of these models like midjourney 501 00:19:35,279 --> 00:19:38,960 it will take some time for it to come 502 00:19:36,640 --> 00:19:40,320 back with an image and and that the 503 00:19:38,960 --> 00:19:42,079 reason for the delay is because it's 504 00:19:40,319 --> 00:19:45,200 going through this you know incremental 505 00:19:42,079 --> 00:19:47,759 dnoising loop. Yeah. 506 00:19:45,200 --> 00:19:49,840 >> Uh from this we understand that each uh 507 00:19:47,759 --> 00:19:51,440 the final noise output sample would be 508 00:19:49,839 --> 00:19:55,199 very particular to each image in the 509 00:19:51,440 --> 00:19:57,279 matrix. So I mean like say two if you 510 00:19:55,200 --> 00:19:59,840 take two images the final we are getting 511 00:19:57,279 --> 00:20:02,319 is the image in the after when we start 512 00:19:59,839 --> 00:20:04,319 voicing it and the final output we get 513 00:20:02,319 --> 00:20:05,359 is the noise sample will be too distinct 514 00:20:04,319 --> 00:20:05,918 for each of them right 515 00:20:05,359 --> 00:20:08,558 >> correct 516 00:20:05,919 --> 00:20:10,720 >> so but when we are picking up image to 517 00:20:08,558 --> 00:20:12,879 generate a diffusion model and we work 518 00:20:10,720 --> 00:20:14,798 backwards we may not have the exact 519 00:20:12,880 --> 00:20:15,679 thing available to us what was there 520 00:20:14,798 --> 00:20:17,200 initially 521 00:20:15,679 --> 00:20:18,960 >> no no the thing is we don't want to 522 00:20:17,200 --> 00:20:21,120 necessarily regenerate images that were 523 00:20:18,960 --> 00:20:22,558 in the training data right that's kind 524 00:20:21,119 --> 00:20:24,159 of pointless we want to geneneral new 525 00:20:22,558 --> 00:20:26,720 images 526 00:20:24,160 --> 00:20:29,519 and for new images we just use start use 527 00:20:26,720 --> 00:20:31,200 noise as a starting point 528 00:20:29,519 --> 00:20:32,879 you know the fact that Killian code was 529 00:20:31,200 --> 00:20:35,279 here and then the fully noised version 530 00:20:32,880 --> 00:20:36,159 of Kian code is here that is used for 531 00:20:35,279 --> 00:20:37,918 training and once you use it for 532 00:20:36,159 --> 00:20:39,039 training you don't need it anymore 533 00:20:37,919 --> 00:20:41,120 because you're not trying to recreate 534 00:20:39,038 --> 00:20:43,440 Killian code again you want to create 535 00:20:41,119 --> 00:20:45,359 new images which belong to the category 536 00:20:43,440 --> 00:20:48,000 of stately college buildings and for 537 00:20:45,359 --> 00:20:49,199 that all you you just grab noise send it 538 00:20:48,000 --> 00:20:51,919 in it gives you a stately college 539 00:20:49,200 --> 00:20:51,919 building end of 540 00:20:53,759 --> 00:20:57,839 And because noise by definition is 541 00:20:55,519 --> 00:20:59,200 different each time you pick it, it's 542 00:20:57,839 --> 00:21:01,839 going to come up with a different 543 00:20:59,200 --> 00:21:06,679 stately college building. 544 00:21:01,839 --> 00:21:06,678 So the way I think about it is that uh 545 00:21:07,038 --> 00:21:12,558 all right so you can think of it as this 546 00:21:09,279 --> 00:21:14,960 right this is 547 00:21:12,558 --> 00:21:17,359 so when you sample think of this as like 548 00:21:14,960 --> 00:21:20,319 the noise distribution 549 00:21:17,359 --> 00:21:22,879 each time you sample right there's a 550 00:21:20,319 --> 00:21:24,480 little point you pick from here another 551 00:21:22,880 --> 00:21:26,960 time you sample maybe you get a point 552 00:21:24,480 --> 00:21:29,279 here right each is just you know nice 553 00:21:26,960 --> 00:21:31,200 distribution that's it what actually 554 00:21:29,279 --> 00:21:34,079 these things are doing is they are 555 00:21:31,200 --> 00:21:35,919 mapping mapping it 556 00:21:34,079 --> 00:21:38,158 to the distribution of stately college 557 00:21:35,919 --> 00:21:41,120 buildings which might be in a you know 558 00:21:38,159 --> 00:21:43,360 strange crazy distribution. 559 00:21:41,119 --> 00:21:47,678 So each time you sample you just go from 560 00:21:43,359 --> 00:21:49,599 here and you land at a point here 561 00:21:47,679 --> 00:21:53,200 and when you go from here you know you 562 00:21:49,599 --> 00:21:54,480 land at a point there. 563 00:21:53,200 --> 00:21:56,240 That's what so what you have done is 564 00:21:54,480 --> 00:21:59,360 when you when you take the training data 565 00:21:56,240 --> 00:22:01,519 you basically created points here and 566 00:21:59,359 --> 00:22:03,199 then found the matching noise here and 567 00:22:01,519 --> 00:22:05,038 then flipped it for training as we have 568 00:22:03,200 --> 00:22:07,519 seen before and once you're done with it 569 00:22:05,038 --> 00:22:09,919 you basically have a mechanism for 570 00:22:07,519 --> 00:22:12,798 transforming any entry in this 571 00:22:09,919 --> 00:22:15,120 distribution of images to an entry in 572 00:22:12,798 --> 00:22:17,440 this distribution of images. So it's a 573 00:22:15,119 --> 00:22:18,479 way to transform one distribution to 574 00:22:17,440 --> 00:22:22,320 another distribution. That's what's 575 00:22:18,480 --> 00:22:26,319 going on. Um all right. Um so there was 576 00:22:22,319 --> 00:22:28,639 a question. Yeah. And then we'll go. 577 00:22:26,319 --> 00:22:30,639 >> I understand the going from noise to to 578 00:22:28,640 --> 00:22:33,360 the image and back how you how the 579 00:22:30,640 --> 00:22:35,360 training works. So my question is you 580 00:22:33,359 --> 00:22:37,519 know in some of these models today you 581 00:22:35,359 --> 00:22:40,240 have you know when you give it the noise 582 00:22:37,519 --> 00:22:42,960 now to generate with an image for 583 00:22:40,240 --> 00:22:44,960 example it could generate a human with 584 00:22:42,960 --> 00:22:47,840 four fingers or you know stuff like 585 00:22:44,960 --> 00:22:49,919 that. So is it that the that the model 586 00:22:47,839 --> 00:22:53,599 that the training data is not just quite 587 00:22:49,919 --> 00:22:56,400 enough to or more as robust enough to uh 588 00:22:53,599 --> 00:22:57,918 generate that kind of detail? [cough] 589 00:22:56,400 --> 00:22:58,400 Can you kind of talk through like what's 590 00:22:57,919 --> 00:23:00,799 more? 591 00:22:58,400 --> 00:23:03,038 >> Yeah. So so fundamentally what it's 592 00:23:00,798 --> 00:23:04,960 doing is it actually does not understand 593 00:23:03,038 --> 00:23:07,119 the notion of fingers and things like 594 00:23:04,960 --> 00:23:09,759 that. Right? Because there is like we 595 00:23:07,119 --> 00:23:12,079 haven't injected any domain knowledge 596 00:23:09,759 --> 00:23:13,679 into this whole process by saying that 597 00:23:12,079 --> 00:23:16,158 hey we need to generate you need to 598 00:23:13,679 --> 00:23:17,759 generate a human body and here are the 599 00:23:16,159 --> 00:23:20,159 semantics of what the human body is 600 00:23:17,759 --> 00:23:21,679 right it's got uh five fingers and all 601 00:23:20,159 --> 00:23:23,440 the anatomical stuff we're not giving 602 00:23:21,679 --> 00:23:26,080 anything we literally giving it pixel 603 00:23:23,440 --> 00:23:27,759 values bunch of pictures so everything 604 00:23:26,079 --> 00:23:29,439 you're seeing is basically just coming 605 00:23:27,759 --> 00:23:32,558 out of that very blind statistical 606 00:23:29,440 --> 00:23:34,798 transformation process so it's so you 607 00:23:32,558 --> 00:23:36,639 would expect that macrolevel details 608 00:23:34,798 --> 00:23:38,720 will probably get it Right? Because 609 00:23:36,640 --> 00:23:40,159 there are so many right answers. So 610 00:23:38,720 --> 00:23:43,120 imagine it's actually, you know, it's 611 00:23:40,159 --> 00:23:45,120 creating um the roof of a house. There 612 00:23:43,119 --> 00:23:46,639 could be all kinds of variations in the 613 00:23:45,119 --> 00:23:48,479 roof of the house and you would still 614 00:23:46,640 --> 00:23:49,679 think it's a roof of a house, right? 615 00:23:48,480 --> 00:23:51,279 Because there are many possible right 616 00:23:49,679 --> 00:23:52,640 answers. But when it comes to five 617 00:23:51,279 --> 00:23:53,918 fingers, there are not many possible 618 00:23:52,640 --> 00:23:55,759 right answers, which is why you notice 619 00:23:53,919 --> 00:23:56,880 the error very quickly. As far as the 620 00:23:55,759 --> 00:23:58,158 model is concerned, it doesn't know, 621 00:23:56,880 --> 00:24:00,960 right? It's just producing a 622 00:23:58,159 --> 00:24:03,200 statistically plausible sample from that 623 00:24:00,960 --> 00:24:05,679 distribution. And since we haven't 624 00:24:03,200 --> 00:24:06,960 forced it to obey constraints like five 625 00:24:05,679 --> 00:24:08,080 fingers and so on and so forth, it's not 626 00:24:06,960 --> 00:24:10,640 going to do any of that stuff. It's an 627 00:24:08,079 --> 00:24:11,918 unconstrained process. Now over time, 628 00:24:10,640 --> 00:24:14,000 these things have gotten better and 629 00:24:11,919 --> 00:24:15,840 better and that's because the data has 630 00:24:14,000 --> 00:24:17,599 gotten better to your point. But I think 631 00:24:15,839 --> 00:24:19,599 our approach to doing these things is 632 00:24:17,599 --> 00:24:21,839 also getting better, right? There are 633 00:24:19,599 --> 00:24:23,839 lots of ways to now steer it and control 634 00:24:21,839 --> 00:24:25,199 it so it behaves the right way. And that 635 00:24:23,839 --> 00:24:27,278 is actually part of what's happening as 636 00:24:25,200 --> 00:24:29,200 well. So when we talk about how do you 637 00:24:27,278 --> 00:24:30,720 actually give a text prompt and have it 638 00:24:29,200 --> 00:24:32,558 build the image for that particular 639 00:24:30,720 --> 00:24:35,519 prompt, we would we'll revisit this 640 00:24:32,558 --> 00:24:38,319 question. Um okay, there was there were 641 00:24:35,519 --> 00:24:40,480 more questions. Yeah. 642 00:24:38,319 --> 00:24:42,240 >> Is there some randomness in the model 643 00:24:40,480 --> 00:24:44,558 itself? Right. So if you gave it the 644 00:24:42,240 --> 00:24:47,519 same noise image twice, will it actually 645 00:24:44,558 --> 00:24:49,440 produce the same final image or will it 646 00:24:47,519 --> 00:24:49,839 >> Yeah, there is randomness in the process 647 00:24:49,440 --> 00:24:53,679 as well. 648 00:24:49,839 --> 00:24:56,720 >> In the process process, exactly. 649 00:24:53,679 --> 00:24:59,360 Um, so to actually that's a really good 650 00:24:56,720 --> 00:25:02,960 point, but now I'm afraid to open my 651 00:24:59,359 --> 00:25:04,719 laptop. I'm an iPad. One second. 652 00:25:02,960 --> 00:25:06,880 All right. 653 00:25:04,720 --> 00:25:10,798 Okay. So, what's going on here is that 654 00:25:06,880 --> 00:25:13,840 if you um go to this thing 655 00:25:10,798 --> 00:25:16,319 so I talked about we are transforming 656 00:25:13,839 --> 00:25:18,720 from here to some crazy distribution 657 00:25:16,319 --> 00:25:20,240 here, right? So, what happens that let's 658 00:25:18,720 --> 00:25:22,480 say that this is the starting point for 659 00:25:20,240 --> 00:25:25,120 the the noise input. This is your noise 660 00:25:22,480 --> 00:25:28,079 input and then what it does what you 661 00:25:25,119 --> 00:25:29,918 actually do is you go here 662 00:25:28,079 --> 00:25:33,759 and then you take this point and then 663 00:25:29,919 --> 00:25:35,759 you do a small sample next to it. So you 664 00:25:33,759 --> 00:25:37,519 use this as like the mean value and then 665 00:25:35,759 --> 00:25:39,119 sample around it and that's actually 666 00:25:37,519 --> 00:25:40,720 what gets published in the user 667 00:25:39,119 --> 00:25:42,959 interface. That's where the randomness 668 00:25:40,720 --> 00:25:47,640 comes in. 669 00:25:42,960 --> 00:25:47,640 Okay. So um 670 00:25:48,319 --> 00:25:52,480 so back to this was there another 671 00:25:49,919 --> 00:25:53,600 question somewhere. 672 00:25:52,480 --> 00:25:56,400 >> Yeah. 673 00:25:53,599 --> 00:25:59,359 >> Um it's okay. 674 00:25:56,400 --> 00:26:02,080 >> Uh I was just wondering about the when 675 00:25:59,359 --> 00:26:05,359 going when training on a on a clear 676 00:26:02,079 --> 00:26:08,240 picture to go to a noisy image uh to 677 00:26:05,359 --> 00:26:10,399 pull from a random sample like random 678 00:26:08,240 --> 00:26:12,240 this sample probably pseudo random. I 679 00:26:10,400 --> 00:26:13,840 was just wondering if it's like learning 680 00:26:12,240 --> 00:26:16,240 relationships that are dependent on 681 00:26:13,839 --> 00:26:19,278 pseudo randomness and so when it goes 682 00:26:16,240 --> 00:26:22,000 from a noisy image back to pure image 683 00:26:19,278 --> 00:26:22,400 it's dependent on that or it matters at 684 00:26:22,000 --> 00:26:23,839 all. 685 00:26:22,400 --> 00:26:24,960 >> Oh I see. So if I understand your 686 00:26:23,839 --> 00:26:27,119 question what you're saying is that it's 687 00:26:24,960 --> 00:26:29,600 pseudo random not actually random 688 00:26:27,119 --> 00:26:32,239 >> and so therefore there is some signal in 689 00:26:29,599 --> 00:26:34,480 the supposedly random generation is it 690 00:26:32,240 --> 00:26:37,120 actually glomming onto that signal right 691 00:26:34,480 --> 00:26:38,798 is the question. Theoretically, it's 692 00:26:37,119 --> 00:26:40,158 probably possible, but in practice, it 693 00:26:38,798 --> 00:26:42,400 really doesn't matter because we 694 00:26:40,159 --> 00:26:44,240 basically say random is good enough for 695 00:26:42,400 --> 00:26:47,120 our purposes. And in fact, in practice, 696 00:26:44,240 --> 00:26:48,720 you will see it's not an issue. 697 00:26:47,119 --> 00:26:52,079 Um, 698 00:26:48,720 --> 00:26:53,440 okay. So, oh yeah, go ahead. 699 00:26:52,079 --> 00:26:58,158 >> There's a quick question. when you're 700 00:26:53,440 --> 00:27:01,120 doing uh like text to text, let's say 701 00:26:58,159 --> 00:27:03,120 you're uh tokenizing the input, but here 702 00:27:01,119 --> 00:27:06,558 you somehow have to identify that this 703 00:27:03,119 --> 00:27:09,119 is Killian Cord and like a stately home 704 00:27:06,558 --> 00:27:13,119 and this is just going from pixel image 705 00:27:09,119 --> 00:27:16,319 to or like decoding a pixel image. Um 706 00:27:13,119 --> 00:27:20,399 where does the the tag or tokenization 707 00:27:16,319 --> 00:27:21,839 of like columns or fingernails or like 708 00:27:20,400 --> 00:27:23,200 >> does nothing. It's learning everything 709 00:27:21,839 --> 00:27:23,918 from the pixel values. 710 00:27:23,200 --> 00:27:25,600 >> Everything. 711 00:27:23,919 --> 00:27:27,120 >> Yeah. And this is sort of what I was, 712 00:27:25,599 --> 00:27:28,480 you know, when I when Ike asked the 713 00:27:27,119 --> 00:27:30,798 question about the four fingers, five 714 00:27:28,480 --> 00:27:33,038 fingers thing, it has no idea of 715 00:27:30,798 --> 00:27:34,319 fingers. It has zero knowledge about any 716 00:27:33,038 --> 00:27:36,319 of these things. All it's seeing is a 717 00:27:34,319 --> 00:27:38,558 bunch of photographs. 718 00:27:36,319 --> 00:27:40,639 >> Okay. So when you when you type in say I 719 00:27:38,558 --> 00:27:42,960 want a hand with green. 720 00:27:40,640 --> 00:27:44,880 >> Oh, I see. So we haven't yet come to the 721 00:27:42,960 --> 00:27:47,120 stage of okay, how do you actually steer 722 00:27:44,880 --> 00:27:48,080 this image using your text prompt? It's 723 00:27:47,119 --> 00:27:49,519 coming 724 00:27:48,079 --> 00:27:51,278 >> right now. All we're saying is that 725 00:27:49,519 --> 00:27:52,960 look, I'm going to give you a bunch of 726 00:27:51,278 --> 00:27:55,119 uh photographs of a particular kind of 727 00:27:52,960 --> 00:27:56,480 thing, stately college buildings and I 728 00:27:55,119 --> 00:27:58,239 want to have a model which at the end of 729 00:27:56,480 --> 00:27:59,360 the day I just poke it. Every time I 730 00:27:58,240 --> 00:28:01,278 poke it, it gives me a stately college 731 00:27:59,359 --> 00:28:02,879 building. That's it. Now I'm going to 732 00:28:01,278 --> 00:28:04,558 actually start giving it text and saying 733 00:28:02,880 --> 00:28:06,320 okay build the you know create the thing 734 00:28:04,558 --> 00:28:08,398 I'm just telling you about that's coming 735 00:28:06,319 --> 00:28:12,000 and that's sort of some additional magic 736 00:28:08,398 --> 00:28:14,558 is going on to get that done. U okay so 737 00:28:12,000 --> 00:28:16,720 this is what we have u and this is 738 00:28:14,558 --> 00:28:18,158 called a diffusion model. Okay. And this 739 00:28:16,720 --> 00:28:21,519 is the original paper that figured this 740 00:28:18,159 --> 00:28:24,799 out. Um, and 741 00:28:21,519 --> 00:28:26,639 the the process of actually creating 742 00:28:24,798 --> 00:28:28,639 taking an image and creating noisy 743 00:28:26,640 --> 00:28:30,880 versions of it to create a training data 744 00:28:28,640 --> 00:28:32,480 is called the forward process. And then 745 00:28:30,880 --> 00:28:34,399 what we did in reverse is called the 746 00:28:32,480 --> 00:28:35,839 reverse process. Uh, check out the 747 00:28:34,398 --> 00:28:38,479 paper. It's actually really well 748 00:28:35,839 --> 00:28:40,879 written. Uh, and I recommend it. Now, in 749 00:28:38,480 --> 00:28:42,960 practice, uh, some other researchers 750 00:28:40,880 --> 00:28:45,440 came along shortly after this and made a 751 00:28:42,960 --> 00:28:46,558 small improvement. turns out to be 752 00:28:45,440 --> 00:28:48,240 actually a big improvement in practice 753 00:28:46,558 --> 00:28:50,000 in terms of improving the quality of 754 00:28:48,240 --> 00:28:52,000 what's being produced. And so what they 755 00:28:50,000 --> 00:28:53,919 said is hey instead of training the 756 00:28:52,000 --> 00:28:55,919 model to predict the less noisy version 757 00:28:53,919 --> 00:28:58,640 of the image we actually ask it to 758 00:28:55,919 --> 00:29:01,360 predict just the noise 759 00:28:58,640 --> 00:29:03,038 in the input and then we will just 760 00:29:01,359 --> 00:29:05,839 simply subtract the noise from the input 761 00:29:03,038 --> 00:29:08,158 to get the image. So instead of saying 762 00:29:05,839 --> 00:29:10,000 here is an X X is an image Y is the 763 00:29:08,159 --> 00:29:12,159 noisy image we actually tell it here is 764 00:29:10,000 --> 00:29:14,398 an image here is the noise that we added 765 00:29:12,159 --> 00:29:16,000 to X to get the the noisy version and 766 00:29:14,398 --> 00:29:17,759 then just predict the noise for me and 767 00:29:16,000 --> 00:29:19,119 then once I get it I just do X minus 768 00:29:17,759 --> 00:29:21,919 noise and I get the less noisy version 769 00:29:19,119 --> 00:29:24,398 of the image. Okay, this feels 770 00:29:21,919 --> 00:29:26,240 arithmetically equivalent but in 771 00:29:24,398 --> 00:29:28,000 practice it ends up generating much 772 00:29:26,240 --> 00:29:29,278 higher quality images and there's some 773 00:29:28,000 --> 00:29:31,038 very interesting theory as to why that 774 00:29:29,278 --> 00:29:33,278 works and so on and so forth and you can 775 00:29:31,038 --> 00:29:34,798 read this paper if you're interested. 776 00:29:33,278 --> 00:29:36,960 Okay, so if you actually look at what's 777 00:29:34,798 --> 00:29:38,558 going on in most diffusion models today, 778 00:29:36,960 --> 00:29:40,480 they're basically using an approach like 779 00:29:38,558 --> 00:29:41,599 this. They're actually predicting each 780 00:29:40,480 --> 00:29:43,919 time they predict noise and take it 781 00:29:41,599 --> 00:29:47,119 away, subtract it. So iterative 782 00:29:43,919 --> 00:29:49,679 subtracting of predicted noise. 783 00:29:47,119 --> 00:29:52,879 That's what's going on. So all right, so 784 00:29:49,679 --> 00:29:55,919 that's what we have. U now at this point 785 00:29:52,880 --> 00:29:57,520 you may be wondering, okay, so far in 786 00:29:55,919 --> 00:29:59,759 the semester, uh we have actually 787 00:29:57,519 --> 00:30:01,200 learned how to take an image and then 788 00:29:59,759 --> 00:30:03,278 classify it into one of you know 20 789 00:30:01,200 --> 00:30:05,360 things, 10 things, whatever. We also 790 00:30:03,278 --> 00:30:07,200 taken text and figured out what to do 791 00:30:05,359 --> 00:30:09,439 things with it. We haven't yet talked 792 00:30:07,200 --> 00:30:11,519 about how do you actually take an image 793 00:30:09,440 --> 00:30:13,440 and the how can we get the output also 794 00:30:11,519 --> 00:30:16,240 to be another image. We haven't done 795 00:30:13,440 --> 00:30:18,798 that yet. Okay. So we have actually not 796 00:30:16,240 --> 00:30:20,240 done image to image. How do you actually 797 00:30:18,798 --> 00:30:22,879 build a neural network to do image to 798 00:30:20,240 --> 00:30:23,919 images? And in the interest of time 799 00:30:22,880 --> 00:30:25,360 you're not going to get into it 800 00:30:23,919 --> 00:30:29,440 massively but I want to give you a quick 801 00:30:25,359 --> 00:30:31,759 idea of how it works. So the the most 802 00:30:29,440 --> 00:30:33,440 sort of I would say the dominant 803 00:30:31,759 --> 00:30:35,519 architecture 804 00:30:33,440 --> 00:30:36,960 to take an in image as an input and 805 00:30:35,519 --> 00:30:39,359 produce an image as an output is called 806 00:30:36,960 --> 00:30:42,880 the unit. Okay. And that's the 807 00:30:39,359 --> 00:30:45,439 architecture we see here. So 808 00:30:42,880 --> 00:30:47,039 so fundamentally if you look at the left 809 00:30:45,440 --> 00:30:48,640 half so there's a left half to the 810 00:30:47,038 --> 00:30:50,640 network and a right half to the network 811 00:30:48,640 --> 00:30:53,360 hence the U. If you look at the left 812 00:30:50,640 --> 00:30:55,200 half of the network it's it's a good old 813 00:30:53,359 --> 00:30:58,558 convolutional neural network like the 814 00:30:55,200 --> 00:31:00,319 kind we know and love. Okay. And the the 815 00:30:58,558 --> 00:31:02,480 kind that we are very familiar with. So 816 00:31:00,319 --> 00:31:04,720 you take an input image and then you run 817 00:31:02,480 --> 00:31:07,599 it through a bunch of convolutional 818 00:31:04,720 --> 00:31:09,919 uh convolutional blocks and then we do 819 00:31:07,599 --> 00:31:11,759 some max pooling and then we keep on 820 00:31:09,919 --> 00:31:13,919 doing it and at some point it becomes 821 00:31:11,759 --> 00:31:15,839 smaller and smaller and we get something 822 00:31:13,919 --> 00:31:17,919 you know like this which we are very 823 00:31:15,839 --> 00:31:20,000 familiar with right the the big image 824 00:31:17,919 --> 00:31:21,200 with three channels gets smaller and 825 00:31:20,000 --> 00:31:22,640 smaller smaller but the number of 826 00:31:21,200 --> 00:31:24,399 channels gets wider and wider. it 827 00:31:22,640 --> 00:31:26,960 becomes sort of much smaller but much 828 00:31:24,398 --> 00:31:29,119 deeper right it becomes like a 3D volume 829 00:31:26,960 --> 00:31:31,200 and we have seen that again and again 830 00:31:29,119 --> 00:31:33,278 right the left part is just a good old 831 00:31:31,200 --> 00:31:35,519 convolutional with pooling layers and 832 00:31:33,278 --> 00:31:37,599 then you come to the middle and then 833 00:31:35,519 --> 00:31:40,480 from this point on what we do is we take 834 00:31:37,599 --> 00:31:43,038 whatever this thing here and then we 835 00:31:40,480 --> 00:31:44,720 essentially reverse the process we go 836 00:31:43,038 --> 00:31:46,960 from the small things which are really 837 00:31:44,720 --> 00:31:49,038 deep to slightly bigger things that are 838 00:31:46,960 --> 00:31:50,880 a little less steep and so on and so 839 00:31:49,038 --> 00:31:54,879 forth till we get the original size back 840 00:31:50,880 --> 00:31:57,039 again Okay. And we do that using the 841 00:31:54,880 --> 00:31:59,360 some an inverse of the convolution layer 842 00:31:57,038 --> 00:32:02,798 called an upconvolution or deconvolution 843 00:31:59,359 --> 00:32:05,119 layer. Okay. And you can check out 9.2 844 00:32:02,798 --> 00:32:07,759 in the textbook to to to understand how 845 00:32:05,119 --> 00:32:09,278 it's done. It's it's also called con 2D 846 00:32:07,759 --> 00:32:12,079 transpose. 847 00:32:09,278 --> 00:32:13,440 Okay. It's a very similar idea and I'm 848 00:32:12,079 --> 00:32:15,119 not going to get into the details here 849 00:32:13,440 --> 00:32:17,038 but you essentially do an inverse of a 850 00:32:15,119 --> 00:32:19,119 convolutional operation to get the size 851 00:32:17,038 --> 00:32:22,480 to come back to the bigger size and you 852 00:32:19,119 --> 00:32:24,239 do it gradually till the output you have 853 00:32:22,480 --> 00:32:25,839 matches the size of the input that came 854 00:32:24,240 --> 00:32:27,919 in. 855 00:32:25,839 --> 00:32:29,759 Okay, so image gets smaller and smaller 856 00:32:27,919 --> 00:32:31,440 into a thing and then you just blow it 857 00:32:29,759 --> 00:32:34,558 back up again to get an image back. So 858 00:32:31,440 --> 00:32:36,240 that is the unit. Now there's very one 859 00:32:34,558 --> 00:32:39,599 very important thing that happens in the 860 00:32:36,240 --> 00:32:43,440 unit, right? which is 861 00:32:39,599 --> 00:32:45,519 you see these connections, right? 862 00:32:43,440 --> 00:32:47,278 Basically, what they do is at every step 863 00:32:45,519 --> 00:32:50,960 when you're coming back up in the right 864 00:32:47,278 --> 00:32:53,038 half, you actually attach whatever was 865 00:32:50,960 --> 00:32:54,798 in sort of the mirror image of the 866 00:32:53,038 --> 00:32:56,720 original input as we processed on the 867 00:32:54,798 --> 00:32:59,200 left side, we attach it to this side as 868 00:32:56,720 --> 00:33:01,360 well. Remember I talked about this whole 869 00:32:59,200 --> 00:33:03,600 notion of a residual connection back, 870 00:33:01,359 --> 00:33:06,798 you know, many classes ago where I said 871 00:33:03,599 --> 00:33:09,278 when uh when an input goes through each 872 00:33:06,798 --> 00:33:10,960 layer of a neural network at one point, 873 00:33:09,278 --> 00:33:13,038 let's say you're in the 10th layer, 874 00:33:10,960 --> 00:33:14,399 you're only seeing what is the ninth 875 00:33:13,038 --> 00:33:16,079 layer is produced for you. That's all 876 00:33:14,398 --> 00:33:18,158 you're working with. But would it be 877 00:33:16,079 --> 00:33:19,839 nice if the the the 10th layer actually 878 00:33:18,159 --> 00:33:21,600 had access to the eighth layer, the 879 00:33:19,839 --> 00:33:23,439 seventh layer, the sixth layer, the 880 00:33:21,599 --> 00:33:25,439 fifth layer? Heck, why not the input, 881 00:33:23,440 --> 00:33:27,600 right? Because the more information it 882 00:33:25,440 --> 00:33:28,960 has, the more able it's probably to do 883 00:33:27,599 --> 00:33:31,278 whatever it can with the input it's 884 00:33:28,960 --> 00:33:33,200 giving. Why restrict it to only the 885 00:33:31,278 --> 00:33:34,319 input of the the output of the previous 886 00:33:33,200 --> 00:33:36,080 layer? Why can't we give it everything 887 00:33:34,319 --> 00:33:37,918 that has came before it? Now giving 888 00:33:36,079 --> 00:33:40,158 everything is too much. But we can be 889 00:33:37,919 --> 00:33:41,919 selective in what we give it. Right? So 890 00:33:40,159 --> 00:33:44,240 what these folks decided I'm sure after 891 00:33:41,919 --> 00:33:46,799 much experimentation is that if they 892 00:33:44,240 --> 00:33:49,839 actually attach whatever was coming out 893 00:33:46,798 --> 00:33:51,440 of this layer to this layer before it 894 00:33:49,839 --> 00:33:53,278 goes through the output, it really 895 00:33:51,440 --> 00:33:55,360 helped. Similarly, this thing gets 896 00:33:53,278 --> 00:33:57,679 attached and so on and so forth. And it 897 00:33:55,359 --> 00:34:00,000 kind of makes sense. You know, why force 898 00:33:57,679 --> 00:34:01,440 it to figure out everything it has to 899 00:34:00,000 --> 00:34:03,919 figure out just from this thing that 900 00:34:01,440 --> 00:34:06,159 came in, right? Let's give this that 901 00:34:03,919 --> 00:34:07,759 that. Let's also give a little here, a 902 00:34:06,159 --> 00:34:09,358 little here. So, these residual 903 00:34:07,759 --> 00:34:10,800 connections are a huge building block 904 00:34:09,358 --> 00:34:14,159 for why these things work as well as 905 00:34:10,800 --> 00:34:15,599 they do. Okay? And in general, giving a 906 00:34:14,159 --> 00:34:17,760 layer as much information as you can 907 00:34:15,599 --> 00:34:19,200 give it is always a good idea, but you 908 00:34:17,760 --> 00:34:20,560 can't go nuts, right? Because then you 909 00:34:19,199 --> 00:34:22,078 have much more parameters and all kinds 910 00:34:20,559 --> 00:34:23,519 of stuff happens. So there is a bit of a 911 00:34:22,079 --> 00:34:25,760 balance you have to strike and this was 912 00:34:23,519 --> 00:34:27,918 the balance struck by these researchers. 913 00:34:25,760 --> 00:34:30,399 And so this thing was originally 914 00:34:27,918 --> 00:34:32,559 invented for some medical segmentation 915 00:34:30,398 --> 00:34:35,358 use cases but it's just heavily used for 916 00:34:32,559 --> 00:34:39,599 everything now. It's a really powerful 917 00:34:35,358 --> 00:34:41,918 architecture. Uh questions 918 00:34:39,599 --> 00:34:44,240 >> uh can we have example of like in what 919 00:34:41,918 --> 00:34:46,559 scenarios we use this kind of 920 00:34:44,239 --> 00:34:49,439 >> anytime you have an image to image 921 00:34:46,559 --> 00:34:50,878 >> like what kind of conversion do you get 922 00:34:49,440 --> 00:34:52,559 image to image? or like what kind of 923 00:34:50,878 --> 00:34:54,078 examples of use cases. Let's say that 924 00:34:52,559 --> 00:34:55,759 for example you want to take an image 925 00:34:54,079 --> 00:34:58,000 and like a black and white image and you 926 00:34:55,760 --> 00:35:00,560 want to colorize it 927 00:34:58,000 --> 00:35:02,000 for instance boom you unit you want to 928 00:35:00,559 --> 00:35:04,239 take an image and make it a higher 929 00:35:02,000 --> 00:35:06,480 resolution image unit you want to take 930 00:35:04,239 --> 00:35:08,719 an image and for every pixel in the 931 00:35:06,480 --> 00:35:12,079 image you want to classify it into you 932 00:35:08,719 --> 00:35:14,480 know one of 10 things. So anytime when 933 00:35:12,079 --> 00:35:16,640 you want the output shape the shape of 934 00:35:14,480 --> 00:35:18,800 the output to be basically the same 935 00:35:16,639 --> 00:35:20,960 shape as the input but with other data 936 00:35:18,800 --> 00:35:23,960 you need to use this. 937 00:35:20,960 --> 00:35:23,960 Yeah. 938 00:35:25,199 --> 00:35:30,879 >> But this logic of having access to all 939 00:35:28,639 --> 00:35:31,838 the previous iterations 940 00:35:30,880 --> 00:35:33,519 >> not iterations 941 00:35:31,838 --> 00:35:35,440 >> all the previous layers 942 00:35:33,519 --> 00:35:40,639 >> right the outputs of the previous layers 943 00:35:35,440 --> 00:35:42,720 >> layers. Uh but this would also help uh 944 00:35:40,639 --> 00:35:44,239 clean up and give better categorization 945 00:35:42,719 --> 00:35:45,199 like does it always have to be an image 946 00:35:44,239 --> 00:35:47,679 to image? 947 00:35:45,199 --> 00:35:49,118 >> No. No. In fact, if you look at restnet, 948 00:35:47,679 --> 00:35:50,639 restnet is the one in fact that 949 00:35:49,119 --> 00:35:53,200 pioneered the idea of the residual 950 00:35:50,639 --> 00:35:56,078 connection. So we use it for restnet. We 951 00:35:53,199 --> 00:35:58,319 actually use the the transformer stack 952 00:35:56,079 --> 00:36:00,000 if you remember it goes through the self 953 00:35:58,320 --> 00:36:03,280 attention layer. It comes out the other 954 00:36:00,000 --> 00:36:05,920 end and then we add the input back to it 955 00:36:03,280 --> 00:36:07,599 and then we send it through layer. 956 00:36:05,920 --> 00:36:08,800 So you will see that this residual 957 00:36:07,599 --> 00:36:11,599 connection is sitting in two different 958 00:36:08,800 --> 00:36:13,680 places in a single transformer block. So 959 00:36:11,599 --> 00:36:15,440 it's extremely heavily used. There is 960 00:36:13,679 --> 00:36:17,598 something called deep and wide network 961 00:36:15,440 --> 00:36:20,559 if I remember or denset which uses the 962 00:36:17,599 --> 00:36:22,960 same trick. In fact if you when you're 963 00:36:20,559 --> 00:36:25,279 working with structured data right good 964 00:36:22,960 --> 00:36:26,720 old say linear regression and you've 965 00:36:25,280 --> 00:36:28,400 looked at your data and you come up with 966 00:36:26,719 --> 00:36:30,239 all kinds of very clever features you 967 00:36:28,400 --> 00:36:32,000 know I'm going to look at price per 968 00:36:30,239 --> 00:36:33,439 square foot right you do a bunch of 969 00:36:32,000 --> 00:36:36,239 feature engineering and you have a bunch 970 00:36:33,440 --> 00:36:38,559 of new features. Well, you should take 971 00:36:36,239 --> 00:36:40,719 your old features and your new features 972 00:36:38,559 --> 00:36:42,480 and send both in. 973 00:36:40,719 --> 00:36:43,679 Why send only the new stuff that you 974 00:36:42,480 --> 00:36:47,519 have concocted? Why can't you send 975 00:36:43,679 --> 00:36:52,919 everything in? That's the idea. 976 00:36:47,519 --> 00:36:52,920 All right. Um, so let's come back here. 977 00:36:53,039 --> 00:36:57,599 Now we have seen how to generate a good 978 00:36:54,559 --> 00:36:59,279 image. Okay. Now let's figure out how to 979 00:36:57,599 --> 00:37:00,800 steer it or condition it with a text 980 00:36:59,280 --> 00:37:02,720 prompt, right? Because that's sort of 981 00:37:00,800 --> 00:37:05,920 the holy grail. 982 00:37:02,719 --> 00:37:08,480 So we want to take 983 00:37:05,920 --> 00:37:09,838 so here's some intuition. We want to 984 00:37:08,480 --> 00:37:11,920 take the text prompt into account and 985 00:37:09,838 --> 00:37:14,719 obviously generate the image. Now 986 00:37:11,920 --> 00:37:16,800 imagine if we had like a rough image 987 00:37:14,719 --> 00:37:18,879 that corresponds to the text prompt. 988 00:37:16,800 --> 00:37:21,359 Just imagine. So the text prompt is you 989 00:37:18,880 --> 00:37:22,880 know cute laborator retriever and you 990 00:37:21,358 --> 00:37:24,159 have like a very noisy image of a 991 00:37:22,880 --> 00:37:26,720 laboratory retriever. This just happens 992 00:37:24,159 --> 00:37:28,000 to be handy. You have it. Well now 993 00:37:26,719 --> 00:37:30,239 you're in good shape because you just 994 00:37:28,000 --> 00:37:32,159 feed that in and your system will denise 995 00:37:30,239 --> 00:37:34,319 it for you. Right? Right? You can get a 996 00:37:32,159 --> 00:37:36,000 better image. That's pretty easy. So, 997 00:37:34,320 --> 00:37:37,280 but obviously in reality, you don't have 998 00:37:36,000 --> 00:37:38,480 a rough image. In fact, you're trying to 999 00:37:37,280 --> 00:37:41,599 create one of those things in the first 1000 00:37:38,480 --> 00:37:45,199 place. We don't. So, but what if we had 1001 00:37:41,599 --> 00:37:47,599 an embedding for the prompt that's close 1002 00:37:45,199 --> 00:37:49,199 to the embeddings of all the images that 1003 00:37:47,599 --> 00:37:52,160 correspond to the prompt. So, let's take 1004 00:37:49,199 --> 00:37:54,239 a prompt and let's imagine all the 1005 00:37:52,159 --> 00:37:57,199 images in the in the universe that 1006 00:37:54,239 --> 00:37:58,959 correspond to that prompt. Okay? 1007 00:37:57,199 --> 00:38:00,319 And now further imagine because 1008 00:37:58,960 --> 00:38:02,559 everything is a vector. Everything is 1009 00:38:00,320 --> 00:38:04,559 embedding in our world that that image 1010 00:38:02,559 --> 00:38:06,639 has an embedding. 1011 00:38:04,559 --> 00:38:09,920 All sorry the text prompt has an 1012 00:38:06,639 --> 00:38:12,559 embedding. Every image has an embedding 1013 00:38:09,920 --> 00:38:14,720 and we have somehow calculated these 1014 00:38:12,559 --> 00:38:17,679 embeddings so that the text prompts 1015 00:38:14,719 --> 00:38:20,000 embedding is smack where all the image 1016 00:38:17,679 --> 00:38:21,279 embeddings are. 1017 00:38:20,000 --> 00:38:23,920 We will get to how we actually do it in 1018 00:38:21,280 --> 00:38:26,000 a in just a moment. But conceptually 1019 00:38:23,920 --> 00:38:28,159 imagine if we had an embedding if you 1020 00:38:26,000 --> 00:38:30,079 could calculate embeddings for text and 1021 00:38:28,159 --> 00:38:32,239 embeddings for images. So they all live 1022 00:38:30,079 --> 00:38:36,000 in the same space. 1023 00:38:32,239 --> 00:38:39,118 Okay. So if we feed this embedding to a 1024 00:38:36,000 --> 00:38:41,920 dinoising model because that text 1025 00:38:39,119 --> 00:38:44,320 embedding is sitting in the same space 1026 00:38:41,920 --> 00:38:47,280 as all the image embeddings that it 1027 00:38:44,320 --> 00:38:49,119 corresponds to. Maybe our model can just 1028 00:38:47,280 --> 00:38:51,599 d noiseise that embedding and give you 1029 00:38:49,119 --> 00:38:54,320 what you want. 1030 00:38:51,599 --> 00:38:55,680 Okay, so since this embedding is already 1031 00:38:54,320 --> 00:38:57,200 close to the embeddings of the things we 1032 00:38:55,679 --> 00:38:59,199 want to generate, maybe you'll just get 1033 00:38:57,199 --> 00:39:00,639 it done. 1034 00:38:59,199 --> 00:39:02,559 So ultimately we want to generate an 1035 00:39:00,639 --> 00:39:03,920 image and if we had an embedding for 1036 00:39:02,559 --> 00:39:07,119 that image, we could generate the image 1037 00:39:03,920 --> 00:39:09,599 from the embedding and we use the text. 1038 00:39:07,119 --> 00:39:11,039 So we go from text to embedding which 1039 00:39:09,599 --> 00:39:12,640 happens to live in the same space as all 1040 00:39:11,039 --> 00:39:14,000 the embeddings of the images we care 1041 00:39:12,639 --> 00:39:15,759 about. And then from that image 1042 00:39:14,000 --> 00:39:18,320 embedding, we go to the final image. 1043 00:39:15,760 --> 00:39:20,079 Okay, this is a bunch of me talking and 1044 00:39:18,320 --> 00:39:22,000 handwaving. it'll all become very clear 1045 00:39:20,079 --> 00:39:25,200 but that's sort of the rough intuition. 1046 00:39:22,000 --> 00:39:26,960 Okay. So, so what we'll know is we'll 1047 00:39:25,199 --> 00:39:29,598 describe an approach to calculate an 1048 00:39:26,960 --> 00:39:31,920 embedding for any text any piece of text 1049 00:39:29,599 --> 00:39:34,160 that is close to the embeddings of the 1050 00:39:31,920 --> 00:39:36,720 images that correspond to that piece of 1051 00:39:34,159 --> 00:39:38,639 text. So this is the problem we're going 1052 00:39:36,719 --> 00:39:39,838 to solve. There's a bunch of text 1053 00:39:38,639 --> 00:39:42,000 conceptually there are a whole bunch of 1054 00:39:39,838 --> 00:39:43,920 images that are describe that text and 1055 00:39:42,000 --> 00:39:46,719 we're going to now create embeddings so 1056 00:39:43,920 --> 00:39:48,880 that that is close to all the embeddings 1057 00:39:46,719 --> 00:39:50,399 of those images. Right? It feels kind of 1058 00:39:48,880 --> 00:39:52,480 like almost impossible that you can 1059 00:39:50,400 --> 00:39:56,000 actually do something like this, but 1060 00:39:52,480 --> 00:39:58,000 there's a very clever idea uh that 1061 00:39:56,000 --> 00:39:59,838 OpenAI came up with that tells you how 1062 00:39:58,000 --> 00:40:02,000 to do it. So, here's what we're going to 1063 00:39:59,838 --> 00:40:05,199 do. So, let's say we have an image and a 1064 00:40:02,000 --> 00:40:08,559 caption. So, here's an image. Uh here's 1065 00:40:05,199 --> 00:40:10,879 a caption, right? And we need some way 1066 00:40:08,559 --> 00:40:12,320 to take that piece of text and run it 1067 00:40:10,880 --> 00:40:15,039 through some network and create a nice 1068 00:40:12,320 --> 00:40:16,160 embedding from it. Okay? Similarly, we 1069 00:40:15,039 --> 00:40:17,279 want to take this image, run it through 1070 00:40:16,159 --> 00:40:19,358 some network and create an embedding 1071 00:40:17,280 --> 00:40:20,800 from it. Okay. Now, first first 1072 00:40:19,358 --> 00:40:22,719 question, how can we compute embeddings 1073 00:40:20,800 --> 00:40:23,920 from a piece of text? First question, 1074 00:40:22,719 --> 00:40:27,838 how can we comput an embedding from a 1075 00:40:23,920 --> 00:40:30,159 piece of text? You know the answer. 1076 00:40:27,838 --> 00:40:34,480 Run through a transformer. Piece of 1077 00:40:30,159 --> 00:40:35,598 cake. We know how to do that, right? U 1078 00:40:34,480 --> 00:40:37,599 right in particular, you can do 1079 00:40:35,599 --> 00:40:38,720 something like BERT. And for an image 1080 00:40:37,599 --> 00:40:41,039 encoder, you just run it through 1081 00:40:38,719 --> 00:40:42,959 something like restnet like the the 1082 00:40:41,039 --> 00:40:44,800 penultimate layer, right? one of the 1083 00:40:42,960 --> 00:40:46,159 final layer is going to be a very good 1084 00:40:44,800 --> 00:40:48,720 representation of that image. You get 1085 00:40:46,159 --> 00:40:50,480 another embedding. So using the building 1086 00:40:48,719 --> 00:40:52,480 blocks we already know, we can create 1087 00:40:50,480 --> 00:40:55,358 embeddings very quickly from these 1088 00:40:52,480 --> 00:40:56,639 things. Okay, but if you just take a 1089 00:40:55,358 --> 00:40:58,000 piece of text and run it through a bird 1090 00:40:56,639 --> 00:40:59,199 and you take an image and run it through 1091 00:40:58,000 --> 00:41:01,679 SNET, you're going to get some 1092 00:40:59,199 --> 00:41:04,639 embeddings. But why the heck should they 1093 00:41:01,679 --> 00:41:06,159 be related? 1094 00:41:04,639 --> 00:41:08,239 They were not trained together. So 1095 00:41:06,159 --> 00:41:10,239 there's no basis for them to be related. 1096 00:41:08,239 --> 00:41:11,919 They would just be some two embeddings. 1097 00:41:10,239 --> 00:41:13,439 Maybe they are kind of similar. Maybe 1098 00:41:11,920 --> 00:41:14,639 they're not. We don't know. There's no 1099 00:41:13,440 --> 00:41:16,880 reason to expect that they're going to 1100 00:41:14,639 --> 00:41:19,879 be similar. Okay, they're just two 1101 00:41:16,880 --> 00:41:19,880 embeddings. 1102 00:41:20,239 --> 00:41:24,399 Now, what we want to do is but once we 1103 00:41:22,960 --> 00:41:26,000 have these, we need to make sure the 1104 00:41:24,400 --> 00:41:27,838 embeddings that comes out of these two 1105 00:41:26,000 --> 00:41:30,838 things satisfy two very important 1106 00:41:27,838 --> 00:41:30,838 requirements. 1107 00:41:32,159 --> 00:41:35,279 We want to make sure that if you give it 1108 00:41:33,599 --> 00:41:39,119 an image 1109 00:41:35,280 --> 00:41:40,480 and a caption that describes that image. 1110 00:41:39,119 --> 00:41:42,318 So you have an image and a caption that 1111 00:41:40,480 --> 00:41:43,838 describes that image, we want to make 1112 00:41:42,318 --> 00:41:45,920 sure that the embeddings that come out 1113 00:41:43,838 --> 00:41:47,920 of these two boxes, they are as close to 1114 00:41:45,920 --> 00:41:50,240 each other as possible. 1115 00:41:47,920 --> 00:41:51,680 Okay? Given an em given an image and a 1116 00:41:50,239 --> 00:41:53,199 caption that describes it, that's the 1117 00:41:51,679 --> 00:41:56,239 connection. They have to be close to 1118 00:41:53,199 --> 00:41:58,399 each other. And conversely, if you have 1119 00:41:56,239 --> 00:42:00,479 an image and a caption that's totally 1120 00:41:58,400 --> 00:42:02,318 irrelevant, 1121 00:42:00,480 --> 00:42:03,920 right? A train rounding a bend with a 1122 00:42:02,318 --> 00:42:05,519 beautiful fall foliage all around, 1123 00:42:03,920 --> 00:42:08,000 right? Clearly irrelevant. Those 1124 00:42:05,519 --> 00:42:10,639 embedings should be far apart. 1125 00:42:08,000 --> 00:42:12,559 that it to really make sense, 1126 00:42:10,639 --> 00:42:13,759 right? Pairs of related things should be 1127 00:42:12,559 --> 00:42:16,400 together, irrelevant things should be 1128 00:42:13,760 --> 00:42:18,640 far apart. So if you can find embeddings 1129 00:42:16,400 --> 00:42:23,039 that satisfy these two criteria, maybe 1130 00:42:18,639 --> 00:42:24,719 we will be in the game. Okay. So now 1131 00:42:23,039 --> 00:42:26,159 this ensures that the text embedding and 1132 00:42:24,719 --> 00:42:28,480 the image embedding are referring to the 1133 00:42:26,159 --> 00:42:31,199 same underlying concept. Right? This 1134 00:42:28,480 --> 00:42:32,960 these requirements will enforce that. Uh 1135 00:42:31,199 --> 00:42:34,879 and so the embedding for any text prompt 1136 00:42:32,960 --> 00:42:38,559 is close to the embedding for all the 1137 00:42:34,880 --> 00:42:41,358 images that correspond to that prompt. 1138 00:42:38,559 --> 00:42:43,039 So the question is how do we do this? Uh 1139 00:42:41,358 --> 00:42:44,400 how can first of all how can we tell how 1140 00:42:43,039 --> 00:42:47,199 close two embeddings are? You know the 1141 00:42:44,400 --> 00:42:49,280 answer to this what's the answer 1142 00:42:47,199 --> 00:42:51,838 >> correct cosine similarity right? We use 1143 00:42:49,280 --> 00:42:54,160 the cosine similarity of the embeddings. 1144 00:42:51,838 --> 00:42:55,519 U so we know how to measure closeness. 1145 00:42:54,159 --> 00:42:56,719 So the question is how can we compute 1146 00:42:55,519 --> 00:42:59,759 embeddings that satisfy the two 1147 00:42:56,719 --> 00:43:02,000 requirements and openai uh built a model 1148 00:42:59,760 --> 00:43:04,240 called clip which is very famous uh to 1149 00:43:02,000 --> 00:43:07,119 solve this problem right it stands for 1150 00:43:04,239 --> 00:43:08,959 contrastive language image pre-training 1151 00:43:07,119 --> 00:43:10,160 uh and this forms the basis for a whole 1152 00:43:08,960 --> 00:43:12,240 bunch of models that have sprung up 1153 00:43:10,159 --> 00:43:13,358 after this called blip and blip 2 and so 1154 00:43:12,239 --> 00:43:15,279 on and so forth but this is the 1155 00:43:13,358 --> 00:43:17,358 fundamental idea 1156 00:43:15,280 --> 00:43:20,319 okay so 1157 00:43:17,358 --> 00:43:25,838 this is how clip works we uh what they 1158 00:43:20,318 --> 00:43:28,639 did is they took a a 12 block 12 layer 8 1159 00:43:25,838 --> 00:43:30,559 head transformer cosal encoder stack as 1160 00:43:28,639 --> 00:43:33,199 a text encoder 1161 00:43:30,559 --> 00:43:35,119 uh okay now you understand this right 1162 00:43:33,199 --> 00:43:36,719 that's what it is eight layer I mean 1163 00:43:35,119 --> 00:43:39,838 sorry 8 head 12 layer transformer causal 1164 00:43:36,719 --> 00:43:41,358 encoder TC stack um and and that's a 1165 00:43:39,838 --> 00:43:43,679 text encoder so we send any piece of 1166 00:43:41,358 --> 00:43:45,679 text through it right you get the next 1167 00:43:43,679 --> 00:43:48,000 word prediction embedding and that's the 1168 00:43:45,679 --> 00:43:50,318 embedding you're going to use uh and 1169 00:43:48,000 --> 00:43:53,039 they took restnet 50 and made it the 1170 00:43:50,318 --> 00:43:55,679 image encoder they took rest 50 chopped 1171 00:43:53,039 --> 00:43:59,119 off the top and whatever was left is the 1172 00:43:55,679 --> 00:44:00,960 the image encoder. Okay, 1173 00:43:59,119 --> 00:44:03,760 then they initialized with random 1174 00:44:00,960 --> 00:44:05,920 weights these things and then they 1175 00:44:03,760 --> 00:44:07,599 grabbed they grab a batch of image 1176 00:44:05,920 --> 00:44:09,838 caption pairs. So in this example, let's 1177 00:44:07,599 --> 00:44:11,519 say that we have these three images u 1178 00:44:09,838 --> 00:44:14,960 and I have captions to go with these 1179 00:44:11,519 --> 00:44:18,079 images. Okay, we have these three things 1180 00:44:14,960 --> 00:44:20,559 and this is the key step. They run the 1181 00:44:18,079 --> 00:44:22,000 images through the image encoder and the 1182 00:44:20,559 --> 00:44:23,920 captions through the text encoder and 1183 00:44:22,000 --> 00:44:26,318 get these embeddings. Okay, it's a 1184 00:44:23,920 --> 00:44:29,838 forward pass. You send it through this 1185 00:44:26,318 --> 00:44:32,400 network, you get two embeddings. Um, and 1186 00:44:29,838 --> 00:44:34,078 then this is what they do. With these 1187 00:44:32,400 --> 00:44:36,480 embeddings, they calculate the cosine 1188 00:44:34,079 --> 00:44:38,800 similarity for every image caption pair. 1189 00:44:36,480 --> 00:44:41,519 Okay? And so imagine something like 1190 00:44:38,800 --> 00:44:43,519 this. So you have these three captions, 1191 00:44:41,519 --> 00:44:45,440 you have these three images, and those 1192 00:44:43,519 --> 00:44:47,599 are the embeddings. 1193 00:44:45,440 --> 00:44:49,039 uh and then they calculate the cosine 1194 00:44:47,599 --> 00:44:51,039 similarity for every one of those 1195 00:44:49,039 --> 00:44:52,639 things. 1196 00:44:51,039 --> 00:44:57,639 It took me like 5 minutes or 10 minutes 1197 00:44:52,639 --> 00:44:57,639 to do this PowerPoint. You're welcome. 1198 00:45:00,719 --> 00:45:05,519 Particularly trying to get this comma to 1199 00:45:02,400 --> 00:45:08,559 line up is a real pain in the neck. So, 1200 00:45:05,519 --> 00:45:11,519 so all right. So, we have this here. 1201 00:45:08,559 --> 00:45:13,679 Okay. And now what we want to do is uh 1202 00:45:11,519 --> 00:45:16,480 we want these scores to be as high as 1203 00:45:13,679 --> 00:45:18,480 possible, right? Because the scores in 1204 00:45:16,480 --> 00:45:21,679 the diagonal are the ones where for the 1205 00:45:18,480 --> 00:45:23,358 matching picture and caption, 1206 00:45:21,679 --> 00:45:24,960 right? 1207 00:45:23,358 --> 00:45:26,799 Those are the those are the those are 1208 00:45:24,960 --> 00:45:28,480 the the scores for the matching pairs of 1209 00:45:26,800 --> 00:45:30,318 embeddings. We want them to be as high 1210 00:45:28,480 --> 00:45:32,880 as possible. 1211 00:45:30,318 --> 00:45:35,199 Okay. Um 1212 00:45:32,880 --> 00:45:37,599 so so we want to maximize the sum of the 1213 00:45:35,199 --> 00:45:40,960 green cells, right? These are the green 1214 00:45:37,599 --> 00:45:42,160 cells the diagonal. So, so if I if you 1215 00:45:40,960 --> 00:45:43,280 want to write it as a loss function 1216 00:45:42,159 --> 00:45:46,000 because the loss function is always 1217 00:45:43,280 --> 00:45:50,000 minimization, we basically say minimize 1218 00:45:46,000 --> 00:45:52,800 the negative sum of the green cells. 1219 00:45:50,000 --> 00:45:56,440 Okay, so the question is would this loss 1220 00:45:52,800 --> 00:45:56,440 function do the trick? 1221 00:45:58,800 --> 00:46:03,039 Seems reasonable. You want to make sure 1222 00:46:00,960 --> 00:46:07,159 the related things are really close 1223 00:46:03,039 --> 00:46:07,159 together. So you want to maximize 1224 00:46:07,760 --> 00:46:10,640 uh if that was the only part of the loss 1225 00:46:09,280 --> 00:46:12,480 function, wouldn't it just kind of 1226 00:46:10,639 --> 00:46:13,358 squish everything to the same spot in 1227 00:46:12,480 --> 00:46:14,960 the space? 1228 00:46:13,358 --> 00:46:16,880 >> Correct. 1229 00:46:14,960 --> 00:46:20,159 What it's going to do is it's going to 1230 00:46:16,880 --> 00:46:21,838 basically ignore the input. 1231 00:46:20,159 --> 00:46:24,559 The optimizer can simply ignore the 1232 00:46:21,838 --> 00:46:25,920 input, make all the embeddings the same. 1233 00:46:24,559 --> 00:46:28,480 For example, it can just make all the 1234 00:46:25,920 --> 00:46:30,318 embedding zero. 1235 00:46:28,480 --> 00:46:32,000 That's it. And then now we have a 1236 00:46:30,318 --> 00:46:35,039 perfect cosine similarity for 1237 00:46:32,000 --> 00:46:36,400 everything. For a any pair of image and 1238 00:46:35,039 --> 00:46:38,880 captions, the cosine similarity is going 1239 00:46:36,400 --> 00:46:41,440 to be one. It's perfect, right? So 1240 00:46:38,880 --> 00:46:44,318 clearly that's not enough. This is by 1241 00:46:41,440 --> 00:46:46,159 the way is called model collapse, right? 1242 00:46:44,318 --> 00:46:47,519 So to prevent it from doing that, we 1243 00:46:46,159 --> 00:46:51,598 need to do one more thing to the loss 1244 00:46:47,519 --> 00:46:53,039 function. Any guesses? 1245 00:46:51,599 --> 00:46:56,000 >> Yeah. 1246 00:46:53,039 --> 00:46:58,318 >> Uh make the images that aren't related 1247 00:46:56,000 --> 00:47:00,639 not have a cosine similarity. 1248 00:46:58,318 --> 00:47:02,639 >> Exactly. Right. Exactly right. So what 1249 00:47:00,639 --> 00:47:05,598 we want to do is we want the scores of 1250 00:47:02,639 --> 00:47:07,279 the red stuff to be as small as 1251 00:47:05,599 --> 00:47:09,119 possible. 1252 00:47:07,280 --> 00:47:10,800 We want the green stuff to be as much as 1253 00:47:09,119 --> 00:47:12,720 possible and the red stuff to be as 1254 00:47:10,800 --> 00:47:16,560 small as possible. 1255 00:47:12,719 --> 00:47:20,639 Together it'll get the job done. 1256 00:47:16,559 --> 00:47:22,078 Okay. And so um so we want to maximize 1257 00:47:20,639 --> 00:47:24,000 the sum of the green cells and minimize 1258 00:47:22,079 --> 00:47:26,640 the sum of the red cells. So the 1259 00:47:24,000 --> 00:47:28,159 equivalent loss function is minimize the 1260 00:47:26,639 --> 00:47:31,199 sum of the red cells and the negative 1261 00:47:28,159 --> 00:47:34,159 sum of the green cells. That's it. So 1262 00:47:31,199 --> 00:47:37,439 all clip does is that it just grabs a 1263 00:47:34,159 --> 00:47:38,960 batch of image caption pairs, runs it 1264 00:47:37,440 --> 00:47:41,119 through the networks, calculates the 1265 00:47:38,960 --> 00:47:44,159 embeddings and calculates this sum of 1266 00:47:41,119 --> 00:47:45,519 the stuff here and that is your loss and 1267 00:47:44,159 --> 00:47:48,480 then back propagates through the 1268 00:47:45,519 --> 00:47:50,880 network. Boom. Batch batch batch. Do it 1269 00:47:48,480 --> 00:47:53,119 a whole bunch of times. And OpenAI did 1270 00:47:50,880 --> 00:47:55,200 this with uh oh this is the official 1271 00:47:53,119 --> 00:47:57,200 picture from the open from the paper 1272 00:47:55,199 --> 00:47:59,919 which is worth reading by the way right 1273 00:47:57,199 --> 00:48:02,480 it comes in text encoder you get these 1274 00:47:59,920 --> 00:48:05,280 uh embedding vectors image encoder and 1275 00:48:02,480 --> 00:48:07,838 then boom the diagonal is maximized and 1276 00:48:05,280 --> 00:48:10,480 the off diagonals are minimized 1277 00:48:07,838 --> 00:48:14,559 and they did it with 400 million image 1278 00:48:10,480 --> 00:48:16,480 caption pairs scraped from the internet. 1279 00:48:14,559 --> 00:48:18,559 400 million. 1280 00:48:16,480 --> 00:48:20,880 By the way, you folks who work in the 1281 00:48:18,559 --> 00:48:23,599 space may know this really well, but uh 1282 00:48:20,880 --> 00:48:26,318 one very easy way to get a caption for 1283 00:48:23,599 --> 00:48:27,519 an image, right? You we see images, but 1284 00:48:26,318 --> 00:48:29,440 where do you think the captions come 1285 00:48:27,519 --> 00:48:30,639 from? Where did they get those captions? 1286 00:48:29,440 --> 00:48:32,079 They didn't obviously they didn't ask 1287 00:48:30,639 --> 00:48:33,519 people to manually label each image of 1288 00:48:32,079 --> 00:48:35,039 the caption. Where do you think they got 1289 00:48:33,519 --> 00:48:36,159 it from? 1290 00:48:35,039 --> 00:48:39,440 >> Google search. 1291 00:48:36,159 --> 00:48:41,118 >> Uh Google search can help but why does 1292 00:48:39,440 --> 00:48:42,639 Google search actually find the caption? 1293 00:48:41,119 --> 00:48:45,440 How does it because Google search is not 1294 00:48:42,639 --> 00:48:47,440 creating the caption? um 1295 00:48:45,440 --> 00:48:50,480 >> take it from the alt text on the images. 1296 00:48:47,440 --> 00:48:52,800 >> Correct. Alt text. So a lot of folks for 1297 00:48:50,480 --> 00:48:54,559 accessibility reasons they have alt text 1298 00:48:52,800 --> 00:48:56,000 right on all the images they create. A 1299 00:48:54,559 --> 00:48:58,079 lot of people have alt text in their 1300 00:48:56,000 --> 00:49:00,480 images they publish on the web and 1301 00:48:58,079 --> 00:49:03,359 that's what we use. And the alt text 1302 00:49:00,480 --> 00:49:05,280 actually ends up being a a more verbose 1303 00:49:03,358 --> 00:49:07,440 description of the image than a typical 1304 00:49:05,280 --> 00:49:10,079 caption which tends to be much briefer. 1305 00:49:07,440 --> 00:49:11,679 And for us more verbose longer the 1306 00:49:10,079 --> 00:49:14,160 better because there's more stuff for 1307 00:49:11,679 --> 00:49:17,199 the bottle to learn from. 1308 00:49:14,159 --> 00:49:19,440 Um, so that's how they built clip. 1309 00:49:17,199 --> 00:49:22,078 And so now what we do is we use we can 1310 00:49:19,440 --> 00:49:24,400 use clip's text encoder by itself, 1311 00:49:22,079 --> 00:49:25,920 right? We can send in any text and get 1312 00:49:24,400 --> 00:49:28,240 an embedding that is close to the 1313 00:49:25,920 --> 00:49:31,119 embedding of any image that described by 1314 00:49:28,239 --> 00:49:33,919 the text. 1315 00:49:31,119 --> 00:49:37,440 Okay. Now, by the way, clip can be used 1316 00:49:33,920 --> 00:49:39,200 for zeros image classification. 1317 00:49:37,440 --> 00:49:40,639 And what I mean by zeroshot image 1318 00:49:39,199 --> 00:49:42,719 classification, I'll I'll walk through 1319 00:49:40,639 --> 00:49:43,838 the picture in just a second, is that 1320 00:49:42,719 --> 00:49:45,759 typically when you want to build an 1321 00:49:43,838 --> 00:49:47,838 image classifier, right, you can get a 1322 00:49:45,760 --> 00:49:50,319 whole bunch of training data of images 1323 00:49:47,838 --> 00:49:51,838 and their labels and then we train them, 1324 00:49:50,318 --> 00:49:54,639 right? Maybe you take something like 1325 00:49:51,838 --> 00:49:56,639 restnet, chop off the top, attach our 1326 00:49:54,639 --> 00:49:58,558 own output head and train, train, train. 1327 00:49:56,639 --> 00:50:00,078 Boom, you have a classifier. But the 1328 00:49:58,559 --> 00:50:02,400 only problem with that is let's say that 1329 00:50:00,079 --> 00:50:04,800 tomorrow so today for example you had 1330 00:50:02,400 --> 00:50:06,400 five classes in your problem and 1331 00:50:04,800 --> 00:50:09,039 tomorrow somebody comes along and says 1332 00:50:06,400 --> 00:50:10,559 oh actually we have a sixth category 1333 00:50:09,039 --> 00:50:11,920 right what do you do then well you have 1334 00:50:10,559 --> 00:50:13,599 to go back to the drawing board and 1335 00:50:11,920 --> 00:50:15,599 retrain the whole thing with six labels 1336 00:50:13,599 --> 00:50:17,680 now not five because your problem has 1337 00:50:15,599 --> 00:50:20,079 changed would it be great if you had a 1338 00:50:17,679 --> 00:50:22,239 classifier where you just come to it and 1339 00:50:20,079 --> 00:50:23,839 say here's an image and here are the six 1340 00:50:22,239 --> 00:50:26,318 possible labels I want you to pick from 1341 00:50:23,838 --> 00:50:27,759 pick one from me and you want to be able 1342 00:50:26,318 --> 00:50:30,558 to give it a different set of labels 1343 00:50:27,760 --> 00:50:32,319 those each time and it'll just use the 1344 00:50:30,559 --> 00:50:33,760 labels you're giving it and the image 1345 00:50:32,318 --> 00:50:35,920 and figures out which which label 1346 00:50:33,760 --> 00:50:38,480 corresponds to the image you just fed it 1347 00:50:35,920 --> 00:50:40,880 that would be an insanely flexible image 1348 00:50:38,480 --> 00:50:42,639 classification system right and that's 1349 00:50:40,880 --> 00:50:44,960 what I mean by zeroshot image 1350 00:50:42,639 --> 00:50:47,838 classification and you can use clip to 1351 00:50:44,960 --> 00:50:50,000 do zero short image classification 1352 00:50:47,838 --> 00:50:52,400 the now how you do it is actually in the 1353 00:50:50,000 --> 00:50:55,039 picture though not very clearly done 1354 00:50:52,400 --> 00:50:55,039 anyone wants to 1355 00:50:58,159 --> 00:51:05,399 How can you use clip to build a like a 1356 00:51:01,039 --> 00:51:05,400 infinitely flexible image classifier? 1357 00:51:12,079 --> 00:51:16,640 >> Um I mean the text input was like was 1358 00:51:14,480 --> 00:51:19,119 trained vert right? So in the same way 1359 00:51:16,639 --> 00:51:21,358 vert can handle words never seen before 1360 00:51:19,119 --> 00:51:22,720 does it essentially do that? Sorry, say 1361 00:51:21,358 --> 00:51:24,000 that again. The second part 1362 00:51:22,719 --> 00:51:25,439 >> you're saying you're saying it sees a 1363 00:51:24,000 --> 00:51:26,559 text input with something it's never 1364 00:51:25,440 --> 00:51:28,880 seen before, right? Yeah. 1365 00:51:26,559 --> 00:51:30,960 >> Okay. So, in the BERT model, which is 1366 00:51:28,880 --> 00:51:32,720 where where it came from, in the text 1367 00:51:30,960 --> 00:51:35,039 encoding in the BERT model, I think we 1368 00:51:32,719 --> 00:51:36,318 talked about when it sees a word it 1369 00:51:35,039 --> 00:51:39,599 doesn't know that it's never seen 1370 00:51:36,318 --> 00:51:41,119 before, it can use the the context words 1371 00:51:39,599 --> 00:51:43,920 around it to try to 1372 00:51:41,119 --> 00:51:46,559 >> Right. Right. So, but but here, just to 1373 00:51:43,920 --> 00:51:49,760 be clear, I I want you to use clip that 1374 00:51:46,559 --> 00:51:51,280 we just built, right? And assume clip 1375 00:51:49,760 --> 00:51:53,040 can see any knows all the words because 1376 00:51:51,280 --> 00:51:54,720 it's been trained on a big vocabulary. 1377 00:51:53,039 --> 00:51:57,358 You can give it any text you want. It'll 1378 00:51:54,719 --> 00:52:00,519 create an embedding from it. That's the 1379 00:51:57,358 --> 00:52:00,519 key capability. 1380 00:52:02,318 --> 00:52:06,880 >> So it creates a text embedding for 1381 00:52:06,239 --> 00:52:11,358 >> Yeah. 1382 00:52:06,880 --> 00:52:14,000 >> because like and then for your image. 1383 00:52:11,358 --> 00:52:15,838 So comparing similarity scores between 1384 00:52:14,000 --> 00:52:17,199 the two the image is complete but the 1385 00:52:15,838 --> 00:52:18,960 text is not complete. there'll be 1386 00:52:17,199 --> 00:52:21,199 missing pieces and then make some 1387 00:52:18,960 --> 00:52:22,720 prediction using this. 1388 00:52:21,199 --> 00:52:24,159 Why is there a missing piece in the 1389 00:52:22,719 --> 00:52:28,318 text? 1390 00:52:24,159 --> 00:52:31,838 >> Because um the image the the text 1391 00:52:28,318 --> 00:52:34,318 the text does not contain the class. Um 1392 00:52:31,838 --> 00:52:36,400 and then but for the image the way it 1393 00:52:34,318 --> 00:52:38,639 was trained it was trained like with 1394 00:52:36,400 --> 00:52:40,480 pairs with class including 1395 00:52:38,639 --> 00:52:42,558 >> right but we actually know the class now 1396 00:52:40,480 --> 00:52:45,119 because so the use case is that I come 1397 00:52:42,559 --> 00:52:48,000 to you with an image and I say here are 1398 00:52:45,119 --> 00:52:51,519 the seven possible labels for this image 1399 00:52:48,000 --> 00:52:53,119 and each label is a piece of text. 1400 00:52:51,519 --> 00:52:55,920 So you can you actually have seven 1401 00:52:53,119 --> 00:52:58,318 pieces of text and an image and all I 1402 00:52:55,920 --> 00:53:00,318 want clip to do is to tell me okay the 1403 00:52:58,318 --> 00:53:03,440 seventh the fourth label is the right 1404 00:53:00,318 --> 00:53:07,159 one for this image 1405 00:53:03,440 --> 00:53:07,159 but you're on the right track 1406 00:53:08,079 --> 00:53:12,920 once you see how it's done you'll be 1407 00:53:09,358 --> 00:53:12,920 like yeah of course 1408 00:53:13,679 --> 00:53:16,159 might not be understanding something but 1409 00:53:15,119 --> 00:53:18,880 wouldn't you just pick the embedding 1410 00:53:16,159 --> 00:53:20,399 that's the closest to the like the the 1411 00:53:18,880 --> 00:53:22,480 text embedding that's the closest to the 1412 00:53:20,400 --> 00:53:23,519 image embedding Correct. You're not 1413 00:53:22,480 --> 00:53:26,318 missing anything. That's the right 1414 00:53:23,519 --> 00:53:27,838 answer. Well done. 1415 00:53:26,318 --> 00:53:30,239 Come on people. Can you applaud our 1416 00:53:27,838 --> 00:53:32,880 fellow here? [applause] 1417 00:53:30,239 --> 00:53:38,118 You folks are hard to impress. 1418 00:53:32,880 --> 00:53:38,119 That's exactly what we do. So here 1419 00:53:38,400 --> 00:53:42,480 the the key thing to remember the key 1420 00:53:40,559 --> 00:53:45,280 thing to keep in your head is that when 1421 00:53:42,480 --> 00:53:47,760 you a label is just text, 1422 00:53:45,280 --> 00:53:50,240 dog, cat, right? It's just text. So you 1423 00:53:47,760 --> 00:53:52,960 can just imagine taking each label with 1424 00:53:50,239 --> 00:53:54,879 which in this case is plane car dog 1425 00:53:52,960 --> 00:53:57,440 whatever for each one of them you create 1426 00:53:54,880 --> 00:53:59,519 an embedding you get t1 through whatever 1427 00:53:57,440 --> 00:54:01,519 if you have n labels for the image you 1428 00:53:59,519 --> 00:54:03,119 just have one embedding i and then you 1429 00:54:01,519 --> 00:54:04,880 just create the cosine calculate the 1430 00:54:03,119 --> 00:54:06,800 cosine similarity and whichever is the 1431 00:54:04,880 --> 00:54:09,280 highest number you say okay it's a dog 1432 00:54:06,800 --> 00:54:11,119 that's it 1433 00:54:09,280 --> 00:54:14,599 it's super just imagine the level of 1434 00:54:11,119 --> 00:54:14,599 flexibility here 1435 00:54:15,280 --> 00:54:20,079 so that's a a side use of clip unrelated 1436 00:54:18,239 --> 00:54:21,279 to diffusion models but that's just 1437 00:54:20,079 --> 00:54:23,680 thought it's really clever so I wanted 1438 00:54:21,280 --> 00:54:25,359 to share that okay good u now let's see 1439 00:54:23,679 --> 00:54:27,759 how we can actually use this entire 1440 00:54:25,358 --> 00:54:29,519 capability to go to solve the original 1441 00:54:27,760 --> 00:54:31,920 problem we set out to solve which is can 1442 00:54:29,519 --> 00:54:33,679 we steer the diffusion model to create 1443 00:54:31,920 --> 00:54:37,280 an image based on a particular prompt we 1444 00:54:33,679 --> 00:54:39,358 give it um so now remember if you go 1445 00:54:37,280 --> 00:54:41,519 back to how we did it we created all 1446 00:54:39,358 --> 00:54:44,639 these training pairs of x and y based on 1447 00:54:41,519 --> 00:54:46,318 you know the the noising the image x is 1448 00:54:44,639 --> 00:54:51,279 the image y is the less noisy version of 1449 00:54:46,318 --> 00:54:53,119 image. So what we can simply do is we 1450 00:54:51,280 --> 00:54:56,079 can actually change the input so it 1451 00:54:53,119 --> 00:54:59,280 becomes the image and then the clip text 1452 00:54:56,079 --> 00:55:00,480 embedding of the caption for that image. 1453 00:54:59,280 --> 00:55:02,559 So you have an image and you have a 1454 00:55:00,480 --> 00:55:05,199 caption. You take the caption run it 1455 00:55:02,559 --> 00:55:07,760 through clip you get an embedding. By 1456 00:55:05,199 --> 00:55:09,759 definition that embedding is in the 1457 00:55:07,760 --> 00:55:13,200 lives in the same space as all the 1458 00:55:09,760 --> 00:55:15,440 images that correspond to that caption. 1459 00:55:13,199 --> 00:55:18,480 Right? So you just attach you 1460 00:55:15,440 --> 00:55:20,318 concatenate the embedding of the clip 1461 00:55:18,480 --> 00:55:22,639 output of a caption along with the 1462 00:55:20,318 --> 00:55:24,880 image. You say make that the new input. 1463 00:55:22,639 --> 00:55:26,558 Now Y continues to be the less noisy 1464 00:55:24,880 --> 00:55:27,838 version of the image or as we saw 1465 00:55:26,559 --> 00:55:30,319 earlier it could be just the noise 1466 00:55:27,838 --> 00:55:34,000 component of the image. Okay, this is 1467 00:55:30,318 --> 00:55:36,800 the new XY pair that we have. And so now 1468 00:55:34,000 --> 00:55:39,519 the model is you send the clip X 1469 00:55:36,800 --> 00:55:41,039 embedding the image X send it through 1470 00:55:39,519 --> 00:55:43,039 noisy version of the image and you keep 1471 00:55:41,039 --> 00:55:44,960 on training it for a while. Once your 1472 00:55:43,039 --> 00:55:46,880 model is trained for when you want to 1473 00:55:44,960 --> 00:55:49,679 use it for inference for a new uh 1474 00:55:46,880 --> 00:55:51,920 prompt, you just give it you know 1475 00:55:49,679 --> 00:55:55,199 Killian quoted MIT during the springtime 1476 00:55:51,920 --> 00:55:57,760 along with a bunch of noise goes in it 1477 00:55:55,199 --> 00:56:00,399 starts dinoising it. But because this 1478 00:55:57,760 --> 00:56:02,880 embedding of this thing thanks to clip 1479 00:56:00,400 --> 00:56:05,119 lives in the same image space as all Ken 1480 00:56:02,880 --> 00:56:07,119 code embeddings clean code images at 1481 00:56:05,119 --> 00:56:11,160 some keep on doing it for a while at 1482 00:56:07,119 --> 00:56:11,160 some point you'll get Kian code. 1483 00:56:11,280 --> 00:56:15,359 That's how they do it. That's how they 1484 00:56:12,798 --> 00:56:16,798 steer the image. It's a two-step 1485 00:56:15,358 --> 00:56:19,598 process. You create all these clip 1486 00:56:16,798 --> 00:56:21,358 embeddings uh which clip was a 1487 00:56:19,599 --> 00:56:22,960 breakthrough in my opinion because they 1488 00:56:21,358 --> 00:56:24,159 it was one of the maybe the first 1489 00:56:22,960 --> 00:56:26,079 example. I don't know if it's the very 1490 00:56:24,159 --> 00:56:28,480 first but one of the early examples of 1491 00:56:26,079 --> 00:56:30,400 saying we have different kinds of data. 1492 00:56:28,480 --> 00:56:32,559 We have images, we have captions, we 1493 00:56:30,400 --> 00:56:34,000 have text. How do we create embeddings 1494 00:56:32,559 --> 00:56:36,240 for every one of these very different 1495 00:56:34,000 --> 00:56:38,318 data types that all happen to live in 1496 00:56:36,239 --> 00:56:40,399 the same space, the same concept space? 1497 00:56:38,318 --> 00:56:42,480 That was the key idea. And if you look 1498 00:56:40,400 --> 00:56:44,318 at the modern multimodal large language 1499 00:56:42,480 --> 00:56:46,318 models, they are all based on the same 1500 00:56:44,318 --> 00:56:49,759 exact idea. 1501 00:56:46,318 --> 00:56:51,519 So it's very powerful this approach. 1502 00:56:49,760 --> 00:56:54,000 Yeah. Now I understand this for images, 1503 00:56:51,519 --> 00:56:56,559 but for video generation models like 1504 00:56:54,000 --> 00:56:58,960 Sora, do they have some sort of 1505 00:56:56,559 --> 00:57:00,960 underlying physics structure or do they 1506 00:56:58,960 --> 00:57:02,318 learn the physical representations? 1507 00:57:00,960 --> 00:57:04,559 >> There's a lot of debate on the internet 1508 00:57:02,318 --> 00:57:05,838 about this stuff. Um they haven't 1509 00:57:04,559 --> 00:57:07,359 published the results, the full 1510 00:57:05,838 --> 00:57:09,599 technical report yet. So we don't know 1511 00:57:07,358 --> 00:57:11,440 for sure but the consensus seems to be 1512 00:57:09,599 --> 00:57:14,240 no it's not they are not using a physics 1513 00:57:11,440 --> 00:57:15,599 engine what they have done uh and again 1514 00:57:14,239 --> 00:57:17,919 this may be wrong once the report comes 1515 00:57:15,599 --> 00:57:19,920 out we'll know for sure but uh people 1516 00:57:17,920 --> 00:57:22,480 what people are saying computer vision 1517 00:57:19,920 --> 00:57:25,838 experts is that it was has been trained 1518 00:57:22,480 --> 00:57:28,400 on a lot of video game data 1519 00:57:25,838 --> 00:57:30,400 uh along with actual videos and so on 1520 00:57:28,400 --> 00:57:32,559 and if you and the corpus of training is 1521 00:57:30,400 --> 00:57:35,280 so massive that it has basically learned 1522 00:57:32,559 --> 00:57:38,000 to mimic certain physics aspects to it 1523 00:57:35,280 --> 00:57:39,280 just as a side effect much like LLM you 1524 00:57:38,000 --> 00:57:41,838 train them on a large amount of text 1525 00:57:39,280 --> 00:57:43,359 data they begin to start to do things 1526 00:57:41,838 --> 00:57:46,239 which you didn't anticipate that they'll 1527 00:57:43,358 --> 00:57:48,558 do right so for example I read this I 1528 00:57:46,239 --> 00:57:50,399 thought it's a really great example of 1529 00:57:48,559 --> 00:57:52,798 what is surprising about large language 1530 00:57:50,400 --> 00:57:54,798 models is not that you know you train 1531 00:57:52,798 --> 00:57:56,159 them on a b bunch of high school math 1532 00:57:54,798 --> 00:57:57,199 problems and then you give it a new high 1533 00:57:56,159 --> 00:57:59,679 school math problem it can actually 1534 00:57:57,199 --> 00:58:00,879 solve it that's not surprising you give 1535 00:57:59,679 --> 00:58:03,199 it a whole bunch of high school math 1536 00:58:00,880 --> 00:58:05,200 problems in English then you ask it to 1537 00:58:03,199 --> 00:58:07,199 read a bunch of French literature and 1538 00:58:05,199 --> 00:58:08,960 then you give French high school math 1539 00:58:07,199 --> 00:58:12,318 will solve it. That is that is the new 1540 00:58:08,960 --> 00:58:13,679 news, right? So similarly here I think 1541 00:58:12,318 --> 00:58:15,199 the expectation is that it's not 1542 00:58:13,679 --> 00:58:16,798 actually using a physics engine under 1543 00:58:15,199 --> 00:58:17,838 the hood. It may have used a physics 1544 00:58:16,798 --> 00:58:20,159 engine to actually come up with the 1545 00:58:17,838 --> 00:58:22,798 videos and renderings but there are no 1546 00:58:20,159 --> 00:58:23,920 physics constraints in the model itself. 1547 00:58:22,798 --> 00:58:26,000 It just comes out of the training 1548 00:58:23,920 --> 00:58:27,440 process. That's the current view. Once 1549 00:58:26,000 --> 00:58:30,639 the technical report comes out, we'll 1550 00:58:27,440 --> 00:58:33,639 know for sure what they actually did. 1551 00:58:30,639 --> 00:58:33,639 U 1552 00:58:33,838 --> 00:58:37,920 >> so quick question about stability. It's 1553 00:58:36,318 --> 00:58:40,400 claiming to be a little bit more real 1554 00:58:37,920 --> 00:58:41,599 time in their image generation. Um, so 1555 00:58:40,400 --> 00:58:43,599 >> you mean stable diffusion? 1556 00:58:41,599 --> 00:58:45,200 >> Yeah, stable diffusion. So, are they 1557 00:58:43,599 --> 00:58:46,798 jumping through the noise more quickly 1558 00:58:45,199 --> 00:58:47,679 or are they kind of like pre-prompting 1559 00:58:46,798 --> 00:58:48,960 it and kind of trick? 1560 00:58:47,679 --> 00:58:50,480 >> Very good question and there's a very 1561 00:58:48,960 --> 00:58:52,798 key trick. It's coming. 1562 00:58:50,480 --> 00:58:55,119 >> Um, 1563 00:58:52,798 --> 00:58:57,920 >> so here the example of the noise is 1564 00:58:55,119 --> 00:59:00,559 normal distribution. However, if we have 1565 00:58:57,920 --> 00:59:02,240 changed the noise distribution, is it 1566 00:59:00,559 --> 00:59:04,000 change the result? Oh, you mean if you 1567 00:59:02,239 --> 00:59:05,519 change it to like a pson or some other 1568 00:59:04,000 --> 00:59:08,079 distribution, it'll definitely change 1569 00:59:05,519 --> 00:59:10,318 the results because u if you look at the 1570 00:59:08,079 --> 00:59:11,839 underlying math of why this works, it 1571 00:59:10,318 --> 00:59:13,279 heavily depends on the Gaussian 1572 00:59:11,838 --> 00:59:15,599 assumption. 1573 00:59:13,280 --> 00:59:18,559 >> Yeah. Um there was another question 1574 00:59:15,599 --> 00:59:20,000 somewhere here. 1575 00:59:18,559 --> 00:59:21,599 >> Um you may not know the answer because 1576 00:59:20,000 --> 00:59:23,599 the technical report out, but could it 1577 00:59:21,599 --> 00:59:26,240 be in terms of video generation sort of 1578 00:59:23,599 --> 00:59:28,160 analogous to going from like one fuzz 1579 00:59:26,239 --> 00:59:30,639 one noisy image to another? like you're 1580 00:59:28,159 --> 00:59:31,598 almost doing a series of still images 1581 00:59:30,639 --> 00:59:33,920 and learning how to 1582 00:59:31,599 --> 00:59:35,599 >> No, I think that I think people are sure 1583 00:59:33,920 --> 00:59:36,960 is is how it's done. So, basically you 1584 00:59:35,599 --> 00:59:39,280 think think of the video as just a 1585 00:59:36,960 --> 00:59:41,599 series of frames, right? And each frame 1586 00:59:39,280 --> 00:59:43,440 is an image and there is a sequentiality 1587 00:59:41,599 --> 00:59:44,880 to it. Um, which is where the 1588 00:59:43,440 --> 00:59:47,760 transformer stack will come in because 1589 00:59:44,880 --> 00:59:50,720 it handles sequentiality. So, in general 1590 00:59:47,760 --> 00:59:53,280 video stuff typically operates on frame 1591 00:59:50,719 --> 00:59:54,959 by frame which is just an image. So, 1592 00:59:53,280 --> 00:59:57,839 that is definitely there. What we don't 1593 00:59:54,960 --> 00:59:59,519 know is if they also used some 1594 00:59:57,838 --> 01:00:02,239 understanding of the fact that for 1595 00:59:59,519 --> 01:00:04,239 example that if an object is dropped it 1596 01:00:02,239 --> 01:00:06,798 has to fall to the earth in a certain 1597 01:00:04,239 --> 01:00:08,639 rate or if an object goes behind another 1598 01:00:06,798 --> 01:00:10,159 object you can't see the object anymore 1599 01:00:08,639 --> 01:00:12,960 right things like that which we take for 1600 01:00:10,159 --> 01:00:15,838 granted um the question is are they 1601 01:00:12,960 --> 01:00:17,280 using it and the consensus seems to be 1602 01:00:15,838 --> 01:00:18,960 uh in the absence of an actual technical 1603 01:00:17,280 --> 01:00:20,240 report that no they're not doing it 1604 01:00:18,960 --> 01:00:22,880 because there are lots of examples on 1605 01:00:20,239 --> 01:00:24,479 Twitter where people will show a Sora 1606 01:00:22,880 --> 01:00:26,559 video in which it's not obeying the laws 1607 01:00:24,480 --> 01:00:28,559 of physics. So you take like a beach 1608 01:00:26,559 --> 01:00:30,000 chair and then put it in the sand. You 1609 01:00:28,559 --> 01:00:32,640 see the sand come through the base of 1610 01:00:30,000 --> 01:00:33,920 the beach chair, right? Or you take an 1611 01:00:32,639 --> 01:00:35,118 object and put it behind an object. You 1612 01:00:33,920 --> 01:00:37,440 can still see the object even though the 1613 01:00:35,119 --> 01:00:38,720 original object is opaque. So you be 1614 01:00:37,440 --> 01:00:39,920 seeing some evidence that no no it's not 1615 01:00:38,719 --> 01:00:43,879 obeying the laws of physics. What you're 1616 01:00:39,920 --> 01:00:43,880 seeing is just an amaz 1617 01:00:46,318 --> 01:00:50,000 fingers without knowing there has to be 1618 01:00:47,599 --> 01:00:51,599 only five fingers. 1619 01:00:50,000 --> 01:00:55,679 Um 1620 01:00:51,599 --> 01:00:58,880 okay. All right. So we let's keep going 1621 01:00:55,679 --> 01:01:00,879 now. Um so this there was another paper 1622 01:00:58,880 --> 01:01:02,798 afterwards and this is the original 1623 01:01:00,880 --> 01:01:05,680 paper which took that idea of the 1624 01:01:02,798 --> 01:01:07,519 diffusion model and then diffusion is 1625 01:01:05,679 --> 01:01:08,719 very slow as Olivia you pointed out. So 1626 01:01:07,519 --> 01:01:11,119 the question is can we make it much 1627 01:01:08,719 --> 01:01:12,239 faster? Right? So what they did and I'm 1628 01:01:11,119 --> 01:01:14,079 not going to get into this whole thing 1629 01:01:12,239 --> 01:01:18,159 here. I just want to highlight a couple 1630 01:01:14,079 --> 01:01:20,960 of things. The first one is that um 1631 01:01:18,159 --> 01:01:23,279 first of all notice that you see unit 1632 01:01:20,960 --> 01:01:25,838 here. So it they are using a unit right 1633 01:01:23,280 --> 01:01:28,000 to go from image to image. 1634 01:01:25,838 --> 01:01:30,000 The second thing is that the clip 1635 01:01:28,000 --> 01:01:32,400 embedding of the text prompt is 1636 01:01:30,000 --> 01:01:34,559 basically is woven meaning it's 1637 01:01:32,400 --> 01:01:36,559 incorporated into the w the into the 1638 01:01:34,559 --> 01:01:38,559 into the unit through an attention 1639 01:01:36,559 --> 01:01:41,040 mechanism a transformer mechanism and 1640 01:01:38,559 --> 01:01:43,200 you can see the QKV business here which 1641 01:01:41,039 --> 01:01:45,039 should be familiar at this point. So it 1642 01:01:43,199 --> 01:01:47,279 is integrated into the transformer stack 1643 01:01:45,039 --> 01:01:48,480 directly that input the clip embedding 1644 01:01:47,280 --> 01:01:50,960 that's the second thing I want to point 1645 01:01:48,480 --> 01:01:52,880 out. And then thirdly 1646 01:01:50,960 --> 01:01:54,480 and this is where the speed up comes. So 1647 01:01:52,880 --> 01:01:56,240 what you do is instead of taking the 1648 01:01:54,480 --> 01:01:57,760 image running it through the whole 1649 01:01:56,239 --> 01:01:59,598 network and creating a slightly less 1650 01:01:57,760 --> 01:02:02,000 noisy version of the image here what you 1651 01:01:59,599 --> 01:02:03,359 do is you take the image you run it 1652 01:02:02,000 --> 01:02:05,679 through an image encoder you get an 1653 01:02:03,358 --> 01:02:07,519 embedding and now you only work with the 1654 01:02:05,679 --> 01:02:09,440 embedding you take the embedding and 1655 01:02:07,519 --> 01:02:11,358 create a slightly less noisy version 1656 01:02:09,440 --> 01:02:13,119 embedding keep on doing it and these 1657 01:02:11,358 --> 01:02:14,719 embeddings are much smaller than images 1658 01:02:13,119 --> 01:02:16,079 therefore they're much faster to process 1659 01:02:14,719 --> 01:02:18,798 and once you've done it like a thousand 1660 01:02:16,079 --> 01:02:20,559 times you get a very sort of almost pure 1661 01:02:18,798 --> 01:02:24,079 noless version of the embedding now you 1662 01:02:20,559 --> 01:02:26,400 run it through an image decoder to get 1663 01:02:24,079 --> 01:02:29,119 So this is the the idea here is that you 1664 01:02:26,400 --> 01:02:31,200 operate um 1665 01:02:29,119 --> 01:02:32,480 uh in the lat latent space meaning the 1666 01:02:31,199 --> 01:02:35,118 embedding space and hence it's called a 1667 01:02:32,480 --> 01:02:36,719 latent diffusion model. So that's where 1668 01:02:35,119 --> 01:02:38,640 the speed up comes but research 1669 01:02:36,719 --> 01:02:40,239 continues to be very strong to make it 1670 01:02:38,639 --> 01:02:41,440 even faster because for a lot of 1671 01:02:40,239 --> 01:02:43,039 consumer applications people are 1672 01:02:41,440 --> 01:02:44,480 obviously not going to wait around for I 1673 01:02:43,039 --> 01:02:46,880 mean who wants to wait for 10 seconds 1674 01:02:44,480 --> 01:02:49,920 right so uh and so there a lot of 1675 01:02:46,880 --> 01:02:52,240 pressure to make it even faster 1676 01:02:49,920 --> 01:02:53,760 um 1677 01:02:52,239 --> 01:02:56,078 all right so that's what we have 1678 01:02:53,760 --> 01:02:58,160 obviously um you know they're these 1679 01:02:56,079 --> 01:03:00,000 models are transforming everything and 1680 01:02:58,159 --> 01:03:01,920 uh by the way this site here lexicon.art 1681 01:03:00,000 --> 01:03:03,760 art. You can go check it out. Uh it has 1682 01:03:01,920 --> 01:03:06,318 a whole bunch of very interesting images 1683 01:03:03,760 --> 01:03:07,599 and prompts that created the images. So 1684 01:03:06,318 --> 01:03:09,519 if you're working in the space, it gives 1685 01:03:07,599 --> 01:03:11,119 you a lot of interesting ideas. But it's 1686 01:03:09,519 --> 01:03:13,838 not just for you know consumer fun 1687 01:03:11,119 --> 01:03:15,838 applications. U you know these models 1688 01:03:13,838 --> 01:03:18,318 are being used to actually you know 1689 01:03:15,838 --> 01:03:19,519 alpha fold if you'll recall if you give 1690 01:03:18,318 --> 01:03:21,519 it an amino acid sequence it can 1691 01:03:19,519 --> 01:03:24,480 actually create the 3D structure. Right? 1692 01:03:21,519 --> 01:03:25,838 So that's an example of they they don't 1693 01:03:24,480 --> 01:03:27,440 I don't think they use a diffusion 1694 01:03:25,838 --> 01:03:28,960 model. But you can imagine using a 1695 01:03:27,440 --> 01:03:32,159 diffusion model to create these 1696 01:03:28,960 --> 01:03:34,798 complicated objects. Meaning the objects 1697 01:03:32,159 --> 01:03:36,558 you create don't have to be images. 1698 01:03:34,798 --> 01:03:39,038 They can be arbitrarily complicated 1699 01:03:36,559 --> 01:03:41,280 things. As long as you have enough data 1700 01:03:39,039 --> 01:03:43,760 about such things to use for training 1701 01:03:41,280 --> 01:03:45,359 and the notion of noising the input is 1702 01:03:43,760 --> 01:03:47,039 meaningful, you can create some very 1703 01:03:45,358 --> 01:03:49,598 interesting structures. you can create 1704 01:03:47,039 --> 01:03:51,039 3D things and u you know protein 1705 01:03:49,599 --> 01:03:52,240 structures and there's a whole bunch of 1706 01:03:51,039 --> 01:03:55,440 very interesting applications in 1707 01:03:52,239 --> 01:03:57,038 biomedical uh sciences. So this is 1708 01:03:55,440 --> 01:03:59,519 really just the tip of the iceberg and 1709 01:03:57,039 --> 01:04:00,960 now there are these things um there are 1710 01:03:59,519 --> 01:04:03,358 ways in which you can use diffusion 1711 01:04:00,960 --> 01:04:05,199 models to create to do large language 1712 01:04:03,358 --> 01:04:07,358 modeling as well. So there's a lot of 1713 01:04:05,199 --> 01:04:10,159 overlap and blending and so on going on 1714 01:04:07,358 --> 01:04:11,920 in the space. So so I'm going to do a 1715 01:04:10,159 --> 01:04:12,879 quick demo. Um if you look at hugging 1716 01:04:11,920 --> 01:04:15,519 face there is something called the 1717 01:04:12,880 --> 01:04:17,519 diffusers library which is like the the 1718 01:04:15,519 --> 01:04:20,079 as the name suggests it's a library for 1719 01:04:17,519 --> 01:04:24,599 a lot of diffusion models 1720 01:04:20,079 --> 01:04:24,599 and let's take a quick look. 1721 01:04:25,838 --> 01:04:28,880 All right so we will uh the diffusers 1722 01:04:27,519 --> 01:04:30,719 library has a whole bunch of diffusion 1723 01:04:28,880 --> 01:04:32,400 models. We going to work with stable 1724 01:04:30,719 --> 01:04:34,959 diffusion which is one of you know like 1725 01:04:32,400 --> 01:04:38,599 the the better known models. So let's 1726 01:04:34,960 --> 01:04:38,599 install diffusers. 1727 01:04:38,960 --> 01:04:42,880 Uh you will recall when we when I did 1728 01:04:41,358 --> 01:04:45,679 the quick lightning tour of the hugging 1729 01:04:42,880 --> 01:04:48,480 face ecosystem for language. Uh hugging 1730 01:04:45,679 --> 01:04:50,480 face is a whole bunch of u capabilities 1731 01:04:48,480 --> 01:04:52,159 sort of built out of the box and you use 1732 01:04:50,480 --> 01:04:54,719 this thing called the pipeline function 1733 01:04:52,159 --> 01:04:56,879 to very quickly use any model you want. 1734 01:04:54,719 --> 01:04:59,679 The same exact philosophy applies here. 1735 01:04:56,880 --> 01:05:03,559 You still use the pipeline. So I'm going 1736 01:04:59,679 --> 01:05:03,558 to import a bunch of stuff. 1737 01:05:09,358 --> 01:05:14,759 All right. So, oh, I see I have to do 1738 01:05:11,199 --> 01:05:14,759 this thing. Okay. 1739 01:05:16,079 --> 01:05:20,119 Great. F. 1740 01:05:21,519 --> 01:05:26,639 Okay. So, uh, all right. that we have 1741 01:05:24,239 --> 01:05:28,639 here. So you you'll remember that we 1742 01:05:26,639 --> 01:05:30,639 when we worked with text we had to pre 1743 01:05:28,639 --> 01:05:31,759 we we would grab a pre-trained model and 1744 01:05:30,639 --> 01:05:33,598 then we actually run it through a 1745 01:05:31,760 --> 01:05:36,079 pipeline and we can do all the inference 1746 01:05:33,599 --> 01:05:39,440 we want on it. The same exact philosophy 1747 01:05:36,079 --> 01:05:41,519 applies here. So um and this very 1748 01:05:39,440 --> 01:05:44,400 similar to what we did in lecture 8 for 1749 01:05:41,519 --> 01:05:46,079 NLP. So what we're going to do is we use 1750 01:05:44,400 --> 01:05:48,000 this command the stable diffusion 1751 01:05:46,079 --> 01:05:50,798 pipeline from pre-trained and we use 1752 01:05:48,000 --> 01:05:56,318 this version 1.4 stable diffusion model. 1753 01:05:50,798 --> 01:05:58,079 Um so let's just create the pipeline and 1754 01:05:56,318 --> 01:06:00,318 and obviously we have used tensorflow 1755 01:05:58,079 --> 01:06:02,079 not pyarch here but a lot of these 1756 01:06:00,318 --> 01:06:05,279 models unfortunately happen to be in 1757 01:06:02,079 --> 01:06:07,280 pyarch so knowing a little bit of pyarch 1758 01:06:05,280 --> 01:06:09,280 is actually very helpful um to be able 1759 01:06:07,280 --> 01:06:12,240 to work with these things and what we're 1760 01:06:09,280 --> 01:06:15,280 doing here uh while it's downloading uh 1761 01:06:12,239 --> 01:06:18,078 we are using this fp16 1762 01:06:15,280 --> 01:06:19,519 um storage format for the the model 1763 01:06:18,079 --> 01:06:22,318 weights because it's going to be a 1764 01:06:19,519 --> 01:06:24,318 little smaller than using 32 bits so 1765 01:06:22,318 --> 01:06:25,599 it'll download faster. So that's what's 1766 01:06:24,318 --> 01:06:28,480 happening here. So all right, it's 1767 01:06:25,599 --> 01:06:29,920 downloaded fine. So now we just give it 1768 01:06:28,480 --> 01:06:32,880 a prompt and this is actually one of the 1769 01:06:29,920 --> 01:06:34,400 original famous uh meme prompts a 1770 01:06:32,880 --> 01:06:36,640 photograph of an astronaut riding a 1771 01:06:34,400 --> 01:06:38,880 horse. And so uh once we have the 1772 01:06:36,639 --> 01:06:40,558 pipeline set up uh I'll just a seat for 1773 01:06:38,880 --> 01:06:44,640 reproducibility. And then literally I do 1774 01:06:40,559 --> 01:06:46,960 pipe of prompt and then it's actually 1775 01:06:44,639 --> 01:06:50,879 you can see here 50. So it's going 1776 01:06:46,960 --> 01:06:52,000 through 50 dinoising steps. Okay. Um and 1777 01:06:50,880 --> 01:06:54,960 you come up with a national rating of 1778 01:06:52,000 --> 01:06:56,880 horse. Okay. So that's that. Um you can 1779 01:06:54,960 --> 01:06:59,838 actually change the seed and you can get 1780 01:06:56,880 --> 01:07:01,599 get a different um the seed is basically 1781 01:06:59,838 --> 01:07:03,199 sets the the the random starting point 1782 01:07:01,599 --> 01:07:05,440 for the image. So therefore you would 1783 01:07:03,199 --> 01:07:08,159 expect a different astronaut. Yep. This 1784 01:07:05,440 --> 01:07:09,679 is an astronaut riding another horse. So 1785 01:07:08,159 --> 01:07:11,199 um I think people came up with these 1786 01:07:09,679 --> 01:07:12,480 kinds of fun examples because it's 1787 01:07:11,199 --> 01:07:15,358 guaranteed not to be in the training 1788 01:07:12,480 --> 01:07:16,559 data, right? So so whatever the model is 1789 01:07:15,358 --> 01:07:18,480 doing, it's not remember it's not 1790 01:07:16,559 --> 01:07:23,160 regurgitating what it has already seen. 1791 01:07:18,480 --> 01:07:23,159 Uh, all right. Give me a prompt. 1792 01:07:26,639 --> 01:07:32,920 Prompts. Anyone? 1793 01:07:29,920 --> 01:07:32,920 Wow. 1794 01:07:34,798 --> 01:07:37,798 >> Okay, 1795 01:07:38,559 --> 01:07:44,839 that might be a 1796 01:07:40,639 --> 01:07:44,838 All right. Riding a horse. 1797 01:07:48,880 --> 01:07:51,960 All right, 1798 01:07:56,559 --> 01:08:03,319 there are two of them and clearly MIT 1799 01:07:59,358 --> 01:08:03,318 professors don't have really. 1800 01:08:03,599 --> 01:08:10,240 Yeah, moving on. [laughter] 1801 01:08:06,559 --> 01:08:11,519 So, so by the way, um if you you should 1802 01:08:10,239 --> 01:08:12,798 spend some time with the diffusers 1803 01:08:11,519 --> 01:08:14,400 library, they have a bunch of tutorials 1804 01:08:12,798 --> 01:08:16,319 which are really interesting because 1805 01:08:14,400 --> 01:08:18,719 this core capability of giving a prompt 1806 01:08:16,319 --> 01:08:20,560 and getting an image out can actually be 1807 01:08:18,719 --> 01:08:22,640 manipulated for all sorts of very 1808 01:08:20,560 --> 01:08:23,920 interesting use cases. So, for example, 1809 01:08:22,640 --> 01:08:25,679 there is this thing called negative 1810 01:08:23,920 --> 01:08:28,560 prompting. And the idea of negative 1811 01:08:25,679 --> 01:08:31,838 prompting is that you can give it two 1812 01:08:28,560 --> 01:08:33,679 prompts and say create an image which 1813 01:08:31,838 --> 01:08:36,318 embodies the first prompt but not the 1814 01:08:33,679 --> 01:08:37,920 second prompt. essentially subtract the 1815 01:08:36,319 --> 01:08:39,520 second prompt from the first one. That's 1816 01:08:37,920 --> 01:08:41,440 called negative prompting. And you might 1817 01:08:39,520 --> 01:08:45,440 be wondering like what use is that? 1818 01:08:41,439 --> 01:08:46,559 There are lots of fun uses. So here we 1819 01:08:45,439 --> 01:08:49,119 are going to the prompt is going to be a 1820 01:08:46,560 --> 01:08:53,120 labrador in the style of vermier. Okay, 1821 01:08:49,119 --> 01:08:57,119 that's the first prompt. 50 steps. 1822 01:08:53,119 --> 01:09:00,719 Uh look at that. Amazing, right? Uh but 1823 01:08:57,119 --> 01:09:02,158 maybe you don't care for the blue scarf. 1824 01:09:00,719 --> 01:09:04,079 So you basically give it a negative 1825 01:09:02,158 --> 01:09:06,879 prompt. And you basically the negative 1826 01:09:04,079 --> 01:09:09,439 prompt is blue meaning remove everything 1827 01:09:06,880 --> 01:09:11,920 that's blue. I don't like this otherwise 1828 01:09:09,439 --> 01:09:15,079 keep the Labrador thing going. So you 1829 01:09:11,920 --> 01:09:15,079 run it. 1830 01:09:16,479 --> 01:09:22,399 Look at that. The blue is gone. Negative 1831 01:09:18,399 --> 01:09:26,000 prompting. Okay. Yeah. 1832 01:09:22,399 --> 01:09:28,318 >> If you change that from five from 50 to 1833 01:09:26,000 --> 01:09:30,479 a th00and will it become less pixelated 1834 01:09:28,319 --> 01:09:31,359 or will it eventually just keep going 1835 01:09:30,479 --> 01:09:32,798 and iterating? 1836 01:09:31,359 --> 01:09:34,400 >> No. Typically, if you do more of these 1837 01:09:32,798 --> 01:09:36,640 things, it gets better. The quality is 1838 01:09:34,399 --> 01:09:38,719 much better because each step will den 1839 01:09:36,640 --> 01:09:40,480 noiseise it very slightly. So, errors 1840 01:09:38,719 --> 01:09:42,158 won't accumulate and things like that. 1841 01:09:40,479 --> 01:09:44,158 And the diffuses library gives you lots 1842 01:09:42,158 --> 01:09:47,119 of controls for fiddling around with all 1843 01:09:44,158 --> 01:09:50,559 these things. Um, okay. So, that's what 1844 01:09:47,119 --> 01:09:52,559 we had. Uh, 949. 1845 01:09:50,560 --> 01:09:54,159 Okay. So, check out this tutorial if 1846 01:09:52,560 --> 01:09:56,239 you're curious about how this stuff 1847 01:09:54,158 --> 01:09:58,479 works. And I'm going to do one other 1848 01:09:56,238 --> 01:10:01,759 thing um because I didn't get to do it 1849 01:09:58,479 --> 01:10:03,919 earlier on. So uh we spent some time 1850 01:10:01,760 --> 01:10:05,920 with the hugging face hub and I walked 1851 01:10:03,920 --> 01:10:07,679 you through a few use cases for text uh 1852 01:10:05,920 --> 01:10:10,158 where you can take a text model and use 1853 01:10:07,679 --> 01:10:11,760 it for you know classification uh things 1854 01:10:10,158 --> 01:10:13,519 like that summarization and so on and so 1855 01:10:11,760 --> 01:10:16,000 forth. You can do the same thing for 1856 01:10:13,520 --> 01:10:17,600 computer vision models. So if you have a 1857 01:10:16,000 --> 01:10:20,079 computer vision problem that just maps 1858 01:10:17,600 --> 01:10:21,920 to a standard C uh computer vision task 1859 01:10:20,079 --> 01:10:25,439 you can just use the hugging face hub as 1860 01:10:21,920 --> 01:10:27,359 well. So um let me just show you very 1861 01:10:25,439 --> 01:10:30,678 quickly the same kind of thing actually 1862 01:10:27,359 --> 01:10:30,679 works here. 1863 01:10:32,560 --> 01:10:37,360 All right. Okay. So, 1864 01:10:35,600 --> 01:10:38,719 so let's say that you want to classify 1865 01:10:37,359 --> 01:10:40,639 something. You just import the pipeline 1866 01:10:38,719 --> 01:10:43,279 as before. 1867 01:10:40,640 --> 01:10:45,280 And once you import it, you can just 1868 01:10:43,279 --> 01:10:46,319 literally give it the standard task that 1869 01:10:45,279 --> 01:10:48,800 you care about like image 1870 01:10:46,319 --> 01:10:50,158 classification. 1871 01:10:48,800 --> 01:10:53,520 And and then you can start using it 1872 01:10:50,158 --> 01:10:56,519 right from that point on. 1873 01:10:53,520 --> 01:10:56,520 Okay. 1874 01:10:59,840 --> 01:11:04,480 All right. Okay. So now I'm going to 1875 01:11:02,319 --> 01:11:06,719 just get this image. So it's a very 1876 01:11:04,479 --> 01:11:08,718 famous image. Um, right. And we're going 1877 01:11:06,719 --> 01:11:09,760 to ask it to classify this image. So we 1878 01:11:08,719 --> 01:11:12,399 just literally run it through the 1879 01:11:09,760 --> 01:11:15,039 pipeline. 1880 01:11:12,399 --> 01:11:18,799 And it says the most likely label is 94% 1881 01:11:15,039 --> 01:11:20,880 probability. It's an Egyptian cat. Seems 1882 01:11:18,800 --> 01:11:21,679 reasonable. Okay. I mean, it's it's a 1883 01:11:20,880 --> 01:11:22,960 tough picture, right? Because there are 1884 01:11:21,679 --> 01:11:25,520 lots of things going on in that picture. 1885 01:11:22,960 --> 01:11:27,679 It's not like one one image, one object. 1886 01:11:25,520 --> 01:11:29,520 Um okay so you don't have to use the 1887 01:11:27,679 --> 01:11:31,279 default model you can actually give it 1888 01:11:29,520 --> 01:11:35,440 your own model that you want. So for 1889 01:11:31,279 --> 01:11:38,880 example, you can go um sorry 1890 01:11:35,439 --> 01:11:40,238 you can go to the hub hugging face hub 1891 01:11:38,880 --> 01:11:42,560 and you can go in there and say all 1892 01:11:40,238 --> 01:11:45,359 right I want image classification these 1893 01:11:42,560 --> 01:11:49,120 are all the models 10,487 models let's 1894 01:11:45,359 --> 01:11:51,759 sort by I don't know most downloads or 1895 01:11:49,119 --> 01:11:53,599 maybe most likes 1896 01:11:51,760 --> 01:11:54,800 u and you have all these models you can 1897 01:11:53,600 --> 01:11:56,000 pick any one of them so for example 1898 01:11:54,800 --> 01:11:57,920 let's say you want to pick Microsoft 1899 01:11:56,000 --> 01:12:00,000 restnet as your one that's what I tried 1900 01:11:57,920 --> 01:12:04,000 here so I have Microsoft restnet you 1901 01:12:00,000 --> 01:12:05,920 just s model equals that run it and it 1902 01:12:04,000 --> 01:12:08,238 takes care of all the tokenization this 1903 01:12:05,920 --> 01:12:09,840 that and whatnot. It's really very handy 1904 01:12:08,238 --> 01:12:12,639 and then you run it through the pipeline 1905 01:12:09,840 --> 01:12:15,760 again and it says tiger cat 94% 1906 01:12:12,640 --> 01:12:17,440 probability according to restnet. Um so 1907 01:12:15,760 --> 01:12:18,640 yeah so that's how you do it. Now let's 1908 01:12:17,439 --> 01:12:20,000 actually try a more interesting example 1909 01:12:18,640 --> 01:12:21,199 where you want to detect all the objects 1910 01:12:20,000 --> 01:12:23,760 in the picture which we didn't talk 1911 01:12:21,198 --> 01:12:27,439 about in class object detection. So just 1912 01:12:23,760 --> 01:12:29,600 create an object detection pipeline. 1913 01:12:27,439 --> 01:12:31,678 Same thing as before. when you actually 1914 01:12:29,600 --> 01:12:33,120 run this command, an astonishing some 1915 01:12:31,679 --> 01:12:35,359 amount of complicated stuff is going on 1916 01:12:33,119 --> 01:12:37,279 under the hood. Okay, and we are all the 1917 01:12:35,359 --> 01:12:39,439 beneficiaries of that. So, thank you. 1918 01:12:37,279 --> 01:12:42,639 Um, so yeah, so we have this here and 1919 01:12:39,439 --> 01:12:44,079 then we run it through um the pipeline. 1920 01:12:42,640 --> 01:12:45,360 It's looking at all the possible things 1921 01:12:44,079 --> 01:12:46,800 that might be sitting in the pipeline. 1922 01:12:45,359 --> 01:12:49,920 The results are hard to read. So, let's 1923 01:12:46,800 --> 01:12:51,760 actually visualize them. Um, 1924 01:12:49,920 --> 01:12:53,359 and I got some nice code from this site 1925 01:12:51,760 --> 01:12:56,239 for how to visualize them. Let's just 1926 01:12:53,359 --> 01:12:58,719 reuse it. So, yeah. So if you plot the 1927 01:12:56,238 --> 01:13:02,039 results, 1928 01:12:58,719 --> 01:13:02,039 look at that. 1929 01:13:03,760 --> 01:13:09,600 Okay, so it has picked up the cat. 100% 1930 01:13:06,800 --> 01:13:12,079 probability, I guess. The remote, the 1931 01:13:09,600 --> 01:13:14,320 couch, the other remote, and then the 1932 01:13:12,079 --> 01:13:17,039 cat. Pretty good, right? Off the shelf, 1933 01:13:14,319 --> 01:13:19,439 ready to go. No, no heavy lifting 1934 01:13:17,039 --> 01:13:20,719 required. Now, in in this case, we are 1935 01:13:19,439 --> 01:13:22,719 actually putting these boxes called 1936 01:13:20,719 --> 01:13:23,840 bounding boxes around each picture. But 1937 01:13:22,719 --> 01:13:25,119 what if you actually don't want a 1938 01:13:23,840 --> 01:13:28,159 bounding box? what you want to actually 1939 01:13:25,119 --> 01:13:30,158 find the exact contour of that cat or 1940 01:13:28,158 --> 01:13:32,799 the remote. No problem. We do something 1941 01:13:30,158 --> 01:13:36,399 called image segmentation. So let's do 1942 01:13:32,800 --> 01:13:40,119 an image segmentation pipeline 1943 01:13:36,399 --> 01:13:40,119 uh and run it through. 1944 01:13:42,960 --> 01:13:49,439 It takes some time. Um all right. All 1945 01:13:46,800 --> 01:13:51,199 right. Let's visualize it. So you can So 1946 01:13:49,439 --> 01:13:53,359 each object it finds it gives you a 1947 01:13:51,198 --> 01:13:56,399 mask. It basically tells you for each 1948 01:13:53,359 --> 01:13:58,799 object what object it is and then which 1949 01:13:56,399 --> 01:14:00,479 pixels are on for that object and off 1950 01:13:58,800 --> 01:14:02,159 for everything else. It's a mask. It 1951 01:14:00,479 --> 01:14:04,879 tells you where it stands. And you can 1952 01:14:02,158 --> 01:14:06,479 see here it is the first the object has 1953 01:14:04,880 --> 01:14:08,719 found is this thing here. And it's 1954 01:14:06,479 --> 01:14:10,718 perfectly delineated, right? It's pretty 1955 01:14:08,719 --> 01:14:14,000 amazing. So we can overlay this on the 1956 01:14:10,719 --> 01:14:15,760 original image and see it has found that 1957 01:14:14,000 --> 01:14:17,920 and it is Let's look at the other 1958 01:14:15,760 --> 01:14:20,719 objects. Oh, it has found the remote. 1959 01:14:17,920 --> 01:14:24,719 That's the second object. 1960 01:14:20,719 --> 01:14:27,119 And the third remote 1961 01:14:24,719 --> 01:14:28,880 and the fourth. You think any other 1962 01:14:27,119 --> 01:14:32,000 objects are remaining? 1963 01:14:28,880 --> 01:14:33,679 >> Couch. Good. All right, let's find the 1964 01:14:32,000 --> 01:14:36,238 couch. 1965 01:14:33,679 --> 01:14:37,920 And look, the couch is pretty good 1966 01:14:36,238 --> 01:14:39,678 except that the middle part has gotten 1967 01:14:37,920 --> 01:14:41,039 confused. 1968 01:14:39,679 --> 01:14:44,239 All right, but it's still pretty good, 1969 01:14:41,039 --> 01:14:46,319 right? So, yeah. So, that is um so 1970 01:14:44,238 --> 01:14:48,319 hugging faces all all these things and 1971 01:14:46,319 --> 01:14:49,599 so you should definitely check it out 1972 01:14:48,319 --> 01:14:51,439 and if you're not already very familiar 1973 01:14:49,600 --> 01:14:55,000 with it. So, uh, we have one minute 1974 01:14:51,439 --> 01:14:55,000 left. Any questions? 1975 01:14:58,238 --> 01:15:04,039 No questions. Okay. All right, folks. 1976 01:15:00,238 --> 01:15:04,039 See you on Wednesday. Thanks.