[00:16] So all right so today we actually come [00:18] to the last lecture of the class because [00:19] Wednesday it's going to be project [00:20] presentations and um so I want to talk [00:23] to you about diffusion models today [00:25] which is a incredibly exciting area [00:28] which I don't think gets It's the same [00:30] amount of attention in some ways [00:32] compared to large language models. Uh [00:34] but it's got enormous potential. Um so [00:37] I'm very excited to talk to you about [00:39] it. So you know just for kicks last [00:42] night I said I asked Chad GPT create a [00:44] photorealistic image of graduate [00:46] students in class in a class on deep [00:47] learning and this is what it came back [00:49] with. [00:51] There is a noticeable absence of an [00:53] instructor [00:57] plus various students are facing in [00:59] various directions [01:01] but apart from that it's not bad. Um and [01:05] uh here is an example of a midjourney [01:08] text to image abusion model uh which [01:12] produces the amazing picture from this [01:14] prompt. a quaint Italian seaside village [01:16] with colorful buildings blah blah blah [01:18] blah blah uh rendered in the style of [01:21] Claude Monet and so on so forth and [01:24] that's what you get. It's pretty [01:25] unbelievable. [01:27] Uh and I'm sure you folks have played [01:28] around with these things and you have [01:29] your favorite pictures and prompts and [01:31] whatnot. [01:33] Um now [01:35] uh February 15th um OpenAI released a [01:38] texttovideo model called Sora which your [01:41] folks may have seen uh which I find [01:44] frankly just stunning what it can do. It [01:46] can produce a one minute uh video from a [01:49] text prompt. And so, [01:52] so if you actually give it this prompt, [01:54] in an ornate historical hall, a massive [01:56] tidal wave peaks and begins to crash and [02:00] two surfers seizing the moment [02:01] skillfully navigate the fa the wave. [02:03] Okay. Uh I think we can all agree that [02:06] such a thing has never happened in [02:07] history and therefore there it was not [02:09] in the training data, right? So and then [02:12] you get this picture, this video [02:26] and then some random person is coming [02:28] back in a completely dry [laughter] [02:31] hall. So anyway, but it's pretty [02:32] amazing. I think you would agree. So [02:37] if you actually look at the open sora [02:39] technical report, you actually find this [02:42] uh opening paragraph where they say that [02:45] we train text conditional diffusion [02:48] models blah blah blah using a [02:51] transformer architecture. Okay, so now [02:54] we know what a transformer architecture [02:56] is. You've been working with it. You're [02:57] quite familiar with it at this point. So [03:00] today's class is really about text [03:02] conditional diffusion models. Okay, so [03:04] the other building block. Okay, so let's [03:06] get to it. Uh what I'm going to do is [03:09] I'm going to sort of uh divide this into [03:11] two parts. The first part is I'm just [03:12] going to talk about how do you get a [03:14] model to just generate an image for you? [03:16] Right? If you wanted to generate an [03:17] image from a class of potential images, [03:20] how can it just generate an image? And [03:21] then next we talk about okay, great. Now [03:24] that you can do that, how do you [03:25] actually control or steer the model to [03:27] do an image based on whatever prompting [03:29] you give it? Okay, how do you condition [03:31] it? How do you control it? Those are all [03:33] the words. How do you steer it? You'll [03:34] find all these synonyms being used [03:36] heavily in the literature. That's [03:37] basically what they mean. How do you [03:38] give it a prompt and then steer what [03:40] gets produced? All right, so let's say [03:43] we want to build a model that can be [03:44] used to generate images of stately [03:47] college buildings. [03:49] Okay, obviously our very own Killian [03:51] Court is the finest example of such a [03:53] thing. Um, and uh, but let's say you [03:56] want to do that. So what you do is you [03:58] as as we always do with machine [03:59] learning, we collect a bunch of data. In [04:01] this particular case, we collect a whole [04:03] bunch of images of stately college [04:05] buildings. Uh, and what you see here is [04:07] literally me just doing a Google image [04:08] search with the query stately college [04:10] buildings. Okay, so this is the kind of [04:12] stuff you get. Uh, so you have your [04:14] training data at your disposal. It's [04:15] ready to go. Now the question is if you [04:19] have such a model, let's say, and [04:20] obviously we'll talk about how to build [04:21] such a model very soon. But let's say [04:23] you have such a model and every time you [04:25] sort of sample this model, every time [04:27] you ask the model, hey, give me an [04:28] image, you obviously wanted to give a [04:30] different image, right? Otherwise, it's [04:31] kind of boring. All right? Some you know [04:34] maybe you want the Killian Court, maybe [04:36] you want the rotunda from the University [04:37] of Virginia. Anybody any UVA alums here? [04:42] Nobody. Okay. Um, so and right. So the [04:45] question is how can we actually get it [04:47] to randomly give us different images? [04:49] But but they all have to be stately [04:50] college buildings. It can't be just some [04:52] random stuff, right? So, how do you do [04:54] that? And the way we do that, and I [04:58] still find it really astonishing that [04:59] this approach actually works. The way we [05:02] do that is that we actually give it [05:03] noise. [05:05] And I will define very precisely what I [05:07] mean by noise in just a just a bit. [05:10] Okay, basically assume [05:13] an image in which all the pixel values [05:15] are randomly picked. [05:17] Right? So every time you generate a [05:19] random image and you give it to the [05:21] model, you'll it'll use that random [05:23] starting point and then create an image [05:25] for you. And because by definition, if [05:27] you choose noise randomly, they are, you [05:30] know, obviously going to be different [05:31] each time. It's hopefully going to [05:33] generate a different image. But if the [05:35] model is trained on stately college [05:37] buildings, it will produce images of [05:39] stately college buildings. It's not [05:41] going to produce a picture of a Labrador [05:42] retriever. [05:44] Okay, so that's basically what we're [05:46] going to do. Now, if you look at [05:49] something like this, the first question [05:51] of course is that how can we train a [05:53] model to generate an image from pure [05:54] noise? This just sounds ridiculous, [05:58] right? You basically give it a bunch of [06:00] random numbers and say, give me code. [06:04] It feels really ridiculous. And at that [06:06] point, you know, folks can sort of come [06:08] to a stop and say, "All right, this [06:10] approach is probably not going to take [06:11] me anywhere. It's a bit of a dead end. [06:14] But then some clever people had this [06:16] very interesting idea. [06:18] They said [06:20] um it's not clear how to do this you [06:24] know um just a quick aside there's this [06:26] really amazing book which is published [06:28] maybe 50 years ago maybe earlier than [06:31] that called how to solve it by George [06:33] Polia. George Poliov was a eminent [06:36] mathematician [06:37] um and he wrote this small book called [06:39] how to solve it and it lists a whole [06:41] bunch of huristics that mathematicians [06:44] use when they solve problems and perhaps [06:46] the most commonly used heristic is just [06:49] reverse the question [06:52] just reverse the question and see if [06:53] anything comes out of it most of the [06:55] time nothing will come out of it but [06:56] maybe some other time something amazing [06:58] comes out right this is a great example [06:59] of that heristic at work we don't know [07:01] how to do this so the question is can we [07:03] do the reverse [07:05] If I give you Killian code, can you [07:07] produce noise out of it for me? [07:10] And the answer is yeah, of course we can [07:12] do that. [07:14] Right? Given an image, we can easily [07:16] create a noisy version of it. So you can [07:19] take the original image, you can add [07:21] some noise to it to get this and you [07:23] keep on adding a lot of noise and [07:24] finally you'll get something that's [07:25] basically you can't tell that there is [07:27] clean clear clean code anymore. Right? [07:29] This process, the reverse process is [07:31] actually very easy to do. Okay? So the [07:33] question bec by the way for folks of who [07:36] may not be very familiar with this [07:37] notion of adding noise to an image or [07:39] making an image noisy. Let me just show [07:41] you in a collab just a minute how easy [07:44] it is. [07:47] All right. So um we let's say we import [07:51] a bunch of these things. As usual we [07:52] have numpy and so there is this thing [07:54] called the python imaging library pil [07:57] which is very handy for image [07:58] manipulations. So we import that and [08:01] then I just literally read read this [08:03] image in. I uploaded it before class. [08:04] Let's just make sure it's here. Okay, [08:06] good. Kalian.png. [08:07] So I I read this image. Okay. Uh and [08:11] then once I read it, I convert it into a [08:13] numpy array. And then remember in a in [08:16] any color image, you have three tables [08:18] of numbers. The the number there's a [08:20] number for each pixel for red, blue, and [08:23] green. And then each number is between 0 [08:25] and 255. U and so here what we do is we [08:28] divide everything by 255 just to [08:29] normalize it so it's all between zero [08:31] and one and we have done this in the [08:32] past right I do that here uh all right [08:36] so let me just read this back in convert [08:38] it and then if you look at the shape [08:40] it's basically a 411 x 583 * 3 um three [08:45] channels as we have seen before and then [08:47] I'll just show it all right that's the [08:50] picture so now what we want to do is we [08:52] want to add noise to this picture all we [08:54] have to do Okay, for each pixel, [08:59] we basically randomly pick a normal [09:02] variable, a normal distribution, [09:03] normally distributed random variable [09:05] with a mean of zero and a small standard [09:08] deviation. So it's like a small number [09:10] and then we just literally add that [09:11] number to every pixel. But for every [09:14] pixel, we sample. Every pixel we sample. [09:16] It's not like we sample once and add it [09:17] to all the pixels. We sample for every [09:19] pixel. And so the way you do that is [09:22] basically literally np.random.normal. [09:25] normal and then this 3 here is a [09:28] standard deviation and we tell it [09:30] generate as many of these things as the [09:33] as the shape of the image that I gave [09:35] you. Okay. And then add each one of [09:38] these numbers to the original image you [09:40] get this noisy image. Okay. So if you [09:42] this is the original image these are all [09:44] the values between 0 and one. And then [09:46] you add do this noisy image. You can see [09:48] the numbers have become different. The [09:50] 23 has become.18.15 [09:52] has become minus.17 and so on and so [09:54] forth. Right? You just added a small [09:56] number random to everything. But as you [09:58] can see here now you have some negative [09:59] numbers. You may have some numbers [10:01] that's greater than one. And we do want [10:02] everything to be between 0 and one. So [10:05] all we do is we do this thing called [10:06] clip it where essentially values smaller [10:10] than zero are set to zero. Values [10:11] greater than one are set to one. And so [10:13] we'll just do that. That's it. [10:16] Everything over one squashed to one. [10:17] Everything under zero set to zero. [10:19] Others leave it unchanged. Now it's [10:21] again well behaved between 0 and one and [10:23] we can just plot it and you get this. [10:28] That's it. That's all it takes to [10:29] actually add noise to an image. One line [10:31] of numpy. Okay. Uh obviously you can [10:34] just put this whole thing in a loop and [10:36] keep increasing that standard deviation [10:37] number from 3 point 4.5 so on and so [10:39] forth. And when you do that you get this [10:41] nice sequence of clean code and all the [10:44] way to some very very noisy version of [10:45] Ken code. That's it. So that's the basic [10:48] idea of adding noise. [10:52] Any questions on the the mechanics? [10:57] Okay, good. Um so so we can add random [11:00] numbers, right? And we can by increasing [11:02] the magnitude of the standard deviation [11:04] of these of these normal random [11:06] variables, we can make the image [11:08] noisier. Okay, so that suggests a really [11:12] interesting idea. [11:14] What idea would that be? [11:19] Yeah, doing the opposite. Could you [11:21] please uh microphone please? [11:25] >> Uh doing the opposite like recreating [11:26] the image from the noise. [11:29] >> So we are trying to create the image [11:31] from the noise. But [11:34] that feels a little hard. So what [11:37] exactly can we do? Be a little more [11:38] specific. [11:44] So here we have the ability to take any [11:46] image and add any amount of noise to it. [11:48] Right? That's the data we have. There is [11:51] Kian code and there's various noisy [11:54] versions of Kian code like that for the [11:56] return the unit Virginia and so on and [11:57] so forth. [11:58] >> I would assume you would do some kind of [12:00] loss function for the the final image [12:02] that you get and compare it with the the [12:04] original image that you train it on and [12:06] then find uh yeah fine as you go. Okay, [12:10] you're on the right track. Uh, any other [12:14] proposals? [12:18] >> I think we could try to train a neural [12:20] network to reconstruct the image going [12:22] from the noise to the noise noisy one. [12:25] Like we could have a whole data set with [12:27] images, find their noise counterpart and [12:30] train to do the oppos [12:34] network to do the opposite task. [12:38] Yeah, that's definitely on the right [12:39] track. That's definitely on the right [12:41] track. Yep, good ideas. So, what we do [12:44] more concretely is [12:47] we we can take each image in the [12:49] training data and create noisy versions [12:51] of it as we have seen before. And then [12:54] what we do is that we say uh we can [12:57] create XY training data pairs input [13:00] output pairs from all these images. So [13:04] specifically what we do is we take [13:09] the noisy slightly noisy version of [13:11] Killian code and call it the input and [13:14] we take the the the nice version of [13:16] clean code and call it the output. [13:19] Okay, that's the y1 x1 pair [13:22] and then we get y2 x2 we get y3 x3 and [13:27] all the way. So at any point in time, [13:30] the relationship between X and Y, what's [13:33] the relationship between X and Y? If you [13:36] set it up like this as the input and the [13:37] output, [13:43] >> it's the set of uh standard deviations [13:45] and uh the values which you change for [13:48] each pixels. Those are like weights to [13:51] which you transform, [13:53] >> right? Or maybe I was looking for [13:54] something simpler which was that that's [13:56] correct. So what he's looking for is [13:58] really the the relationship between X [14:00] and Y. X is an image, any image, and Y [14:03] happens to be a slightly less noisy [14:05] version of the image. [14:07] The slightly less noisy is really, [14:09] really important. [14:12] You're not going from Killian code, [14:14] right? You're not going from the image [14:16] to full noise. That's an impossible [14:19] leap. You're going from the image to a [14:21] slightly noisy version of the image. [14:24] Okay, it is that slightly that allows [14:27] all the magic to happen. [14:30] So that's what we have. [14:33] And so here what we can do with these XY [14:35] pairs when you have an So here's the [14:38] thing, right? This is like a larger [14:40] comment about machine learning and deep [14:41] learning. Um [14:43] whenever you have basically what machine [14:46] learning deep learning are or really [14:47] it's like this this black box where if [14:50] you can find interesting input output [14:52] pairs you can learn a function to go [14:55] from the input to the output that's it [14:57] but this sounds kind of simple when I [14:59] describe it like that but there are like [15:01] some incredibly non-obvious ways of [15:04] applying this idea right so for example [15:06] a few years ago Google had this uh thing [15:08] which may actually be in production in [15:10] Google Sheets now where whenever you um [15:13] sort of choose a bunch of numbers, a [15:15] range of numbers in a spreadsheet and [15:17] and then go into another cell, it'll [15:19] immediately suggest a formula for you. [15:21] Where is that coming from? [15:24] It's because all the Google Sheets users [15:26] all over the world, they have been [15:28] creating all these numbers with [15:30] formulas, right? So, someone says, [15:32] "Look, wait a second. We have all this [15:33] data on people choosing a range of [15:36] numbers and then entering a formula. So [15:38] let's imagine the range is the input and [15:40] the formula as the output [15:43] and let's just give a million examples [15:45] of this pair and see if anything comes [15:46] out of it and boom you get that feature. [15:50] Okay. So similarly here [15:53] X is an image less noisy version of the [15:55] image. What that means is that we can [15:58] build a dnoising network. [16:02] Okay, we can take an image and we can [16:04] build a network using all these XY pairs [16:06] to slightly dn noiseise it. [16:10] Okay. Um and so all how do we do it? We [16:15] just run stocastic gradient to sit on [16:16] the data. We have a network. It has X [16:19] and Y and then Y is a slightly less [16:22] noisy version and then B. [16:26] Okay, you're just a network. It has a [16:27] bunch of weights. we have the we have [16:29] the right answer in terms of what the [16:30] images need to be u we can do stocastic [16:33] gradient descent or atom or something [16:34] and before you know it if you have [16:36] enough data you have a network which can [16:37] d noiseise anything you give it okay um [16:40] you had a question [16:41] >> why slightly [16:43] >> why slightly um we'll come back to that [16:45] question the the reason is that u in [16:48] general you you have to do what you can [16:51] to help the model and this is sort of [16:53] the proverbial there is an old adage you [16:56] can't cross a ditch in two jumps. [16:59] It's too big. So, right. So, you can't [17:02] do it. So, what you do is you create a [17:03] bridge to go from here to there. And so, [17:05] what you do is if you can slightly d [17:07] noiseise something really well. Well, I [17:10] can actually den noiseise anything you [17:11] want really well using that fundamental [17:13] capability as you will see in a second. [17:17] >> Just to follow up. So, if you go back [17:18] the last slide, I could have created the [17:21] same thing as that is my x1 and that is [17:24] my y. Then the second one is x2 and [17:26] still this is the y. So there is [17:28] effectively there is a learning there [17:30] that it could have taken from those [17:33] pairs and come back with okay this is [17:35] also a possibility this is also a [17:37] possibility and it found out that noise [17:40] matrix and it can subtract. [17:42] >> Yeah. So the thing is you want to make [17:44] sure that each time the amount of [17:46] learning it has to do is as bounded and [17:48] small as possible. If you give it some [17:51] starting point and an ending point and [17:52] keep moving this ending point, the gap [17:55] is still really high for the first [17:56] several of those starting points. That's [17:59] the problem. [18:01] Okay. So to come back to this, so we can [18:04] build a dinoising model. We can do this. [18:07] And now when you have once you have [18:08] built such a thing, you give it some [18:10] noisy thing and then it'll you know give [18:13] you a slightly less noisy version of it. [18:15] Okay, the resolution is going to go up [18:16] slightly if you do that. This of course [18:19] suggests the obvious way in which you [18:20] would use it which is that once you [18:22] train it we can solve this problem. [18:26] Okay. And how can we solve this problem? [18:29] So what you do is you start with pure [18:32] noise and then repeatedly dn noiseise [18:35] it. [18:37] Okay. You get that, you get that, and [18:39] then before you know it, Killian Kurt [18:41] has emerged from the fog, [18:43] right? It's pretty insane that it [18:46] actually works this idea. [18:52] So, so the model will generate a [18:54] sequence of less noisy images and the [18:56] final one you have is the answer. Okay. [18:59] Now there's a whole bunch of detail here [19:01] which I'm glossing over about okay how [19:05] many times must we run this loop to get [19:08] to a really good picture. The short [19:09] answer is you it initially it was like [19:12] you have to run it like a thousand [19:13] times. Each each each doising step was [19:16] like a baby step. You have to do it a [19:17] thousand times to get a really good [19:18] answer. Again research has been very [19:21] active in the area continues to be very [19:22] active. Now you can I think do it like [19:24] 50 steps or 100 steps. Right? But [19:26] diffusion models like this uh they tend [19:29] to take more time than a large language [19:31] model which is why if you give a prompt [19:33] to one of these models like midjourney [19:35] it will take some time for it to come [19:36] back with an image and and that the [19:38] reason for the delay is because it's [19:40] going through this you know incremental [19:42] dnoising loop. Yeah. [19:45] >> Uh from this we understand that each uh [19:47] the final noise output sample would be [19:49] very particular to each image in the [19:51] matrix. So I mean like say two if you [19:55] take two images the final we are getting [19:57] is the image in the after when we start [19:59] voicing it and the final output we get [20:02] is the noise sample will be too distinct [20:04] for each of them right [20:05] >> correct [20:05] >> so but when we are picking up image to [20:08] generate a diffusion model and we work [20:10] backwards we may not have the exact [20:12] thing available to us what was there [20:14] initially [20:15] >> no no the thing is we don't want to [20:17] necessarily regenerate images that were [20:18] in the training data right that's kind [20:21] of pointless we want to geneneral new [20:22] images [20:24] and for new images we just use start use [20:26] noise as a starting point [20:29] you know the fact that Killian code was [20:31] here and then the fully noised version [20:32] of Kian code is here that is used for [20:35] training and once you use it for [20:36] training you don't need it anymore [20:37] because you're not trying to recreate [20:39] Killian code again you want to create [20:41] new images which belong to the category [20:43] of stately college buildings and for [20:45] that all you you just grab noise send it [20:48] in it gives you a stately college [20:49] building end of [20:53] And because noise by definition is [20:55] different each time you pick it, it's [20:57] going to come up with a different [20:59] stately college building. [21:01] So the way I think about it is that uh [21:07] all right so you can think of it as this [21:09] right this is [21:12] so when you sample think of this as like [21:14] the noise distribution [21:17] each time you sample right there's a [21:20] little point you pick from here another [21:22] time you sample maybe you get a point [21:24] here right each is just you know nice [21:26] distribution that's it what actually [21:29] these things are doing is they are [21:31] mapping mapping it [21:34] to the distribution of stately college [21:35] buildings which might be in a you know [21:38] strange crazy distribution. [21:41] So each time you sample you just go from [21:43] here and you land at a point here [21:47] and when you go from here you know you [21:49] land at a point there. [21:53] That's what so what you have done is [21:54] when you when you take the training data [21:56] you basically created points here and [21:59] then found the matching noise here and [22:01] then flipped it for training as we have [22:03] seen before and once you're done with it [22:05] you basically have a mechanism for [22:07] transforming any entry in this [22:09] distribution of images to an entry in [22:12] this distribution of images. So it's a [22:15] way to transform one distribution to [22:17] another distribution. That's what's [22:18] going on. Um all right. Um so there was [22:22] a question. Yeah. And then we'll go. [22:26] >> I understand the going from noise to to [22:28] the image and back how you how the [22:30] training works. So my question is you [22:33] know in some of these models today you [22:35] have you know when you give it the noise [22:37] now to generate with an image for [22:40] example it could generate a human with [22:42] four fingers or you know stuff like [22:44] that. So is it that the that the model [22:47] that the training data is not just quite [22:49] enough to or more as robust enough to uh [22:53] generate that kind of detail? [cough] [22:56] Can you kind of talk through like what's [22:57] more? [22:58] >> Yeah. So so fundamentally what it's [23:00] doing is it actually does not understand [23:03] the notion of fingers and things like [23:04] that. Right? Because there is like we [23:07] haven't injected any domain knowledge [23:09] into this whole process by saying that [23:12] hey we need to generate you need to [23:13] generate a human body and here are the [23:16] semantics of what the human body is [23:17] right it's got uh five fingers and all [23:20] the anatomical stuff we're not giving [23:21] anything we literally giving it pixel [23:23] values bunch of pictures so everything [23:26] you're seeing is basically just coming [23:27] out of that very blind statistical [23:29] transformation process so it's so you [23:32] would expect that macrolevel details [23:34] will probably get it Right? Because [23:36] there are so many right answers. So [23:38] imagine it's actually, you know, it's [23:40] creating um the roof of a house. There [23:43] could be all kinds of variations in the [23:45] roof of the house and you would still [23:46] think it's a roof of a house, right? [23:48] Because there are many possible right [23:49] answers. But when it comes to five [23:51] fingers, there are not many possible [23:52] right answers, which is why you notice [23:53] the error very quickly. As far as the [23:55] model is concerned, it doesn't know, [23:56] right? It's just producing a [23:58] statistically plausible sample from that [24:00] distribution. And since we haven't [24:03] forced it to obey constraints like five [24:05] fingers and so on and so forth, it's not [24:06] going to do any of that stuff. It's an [24:08] unconstrained process. Now over time, [24:10] these things have gotten better and [24:11] better and that's because the data has [24:14] gotten better to your point. But I think [24:15] our approach to doing these things is [24:17] also getting better, right? There are [24:19] lots of ways to now steer it and control [24:21] it so it behaves the right way. And that [24:23] is actually part of what's happening as [24:25] well. So when we talk about how do you [24:27] actually give a text prompt and have it [24:29] build the image for that particular [24:30] prompt, we would we'll revisit this [24:32] question. Um okay, there was there were [24:35] more questions. Yeah. [24:38] >> Is there some randomness in the model [24:40] itself? Right. So if you gave it the [24:42] same noise image twice, will it actually [24:44] produce the same final image or will it [24:47] >> Yeah, there is randomness in the process [24:49] as well. [24:49] >> In the process process, exactly. [24:53] Um, so to actually that's a really good [24:56] point, but now I'm afraid to open my [24:59] laptop. I'm an iPad. One second. [25:02] All right. [25:04] Okay. So, what's going on here is that [25:06] if you um go to this thing [25:10] so I talked about we are transforming [25:13] from here to some crazy distribution [25:16] here, right? So, what happens that let's [25:18] say that this is the starting point for [25:20] the the noise input. This is your noise [25:22] input and then what it does what you [25:25] actually do is you go here [25:28] and then you take this point and then [25:29] you do a small sample next to it. So you [25:33] use this as like the mean value and then [25:35] sample around it and that's actually [25:37] what gets published in the user [25:39] interface. That's where the randomness [25:40] comes in. [25:42] Okay. So um [25:48] so back to this was there another [25:49] question somewhere. [25:52] >> Yeah. [25:53] >> Um it's okay. [25:56] >> Uh I was just wondering about the when [25:59] going when training on a on a clear [26:02] picture to go to a noisy image uh to [26:05] pull from a random sample like random [26:08] this sample probably pseudo random. I [26:10] was just wondering if it's like learning [26:12] relationships that are dependent on [26:13] pseudo randomness and so when it goes [26:16] from a noisy image back to pure image [26:19] it's dependent on that or it matters at [26:22] all. [26:22] >> Oh I see. So if I understand your [26:23] question what you're saying is that it's [26:24] pseudo random not actually random [26:27] >> and so therefore there is some signal in [26:29] the supposedly random generation is it [26:32] actually glomming onto that signal right [26:34] is the question. Theoretically, it's [26:37] probably possible, but in practice, it [26:38] really doesn't matter because we [26:40] basically say random is good enough for [26:42] our purposes. And in fact, in practice, [26:44] you will see it's not an issue. [26:47] Um, [26:48] okay. So, oh yeah, go ahead. [26:52] >> There's a quick question. when you're [26:53] doing uh like text to text, let's say [26:58] you're uh tokenizing the input, but here [27:01] you somehow have to identify that this [27:03] is Killian Cord and like a stately home [27:06] and this is just going from pixel image [27:09] to or like decoding a pixel image. Um [27:13] where does the the tag or tokenization [27:16] of like columns or fingernails or like [27:20] >> does nothing. It's learning everything [27:21] from the pixel values. [27:23] >> Everything. [27:23] >> Yeah. And this is sort of what I was, [27:25] you know, when I when Ike asked the [27:27] question about the four fingers, five [27:28] fingers thing, it has no idea of [27:30] fingers. It has zero knowledge about any [27:33] of these things. All it's seeing is a [27:34] bunch of photographs. [27:36] >> Okay. So when you when you type in say I [27:38] want a hand with green. [27:40] >> Oh, I see. So we haven't yet come to the [27:42] stage of okay, how do you actually steer [27:44] this image using your text prompt? It's [27:47] coming [27:48] >> right now. All we're saying is that [27:49] look, I'm going to give you a bunch of [27:51] uh photographs of a particular kind of [27:52] thing, stately college buildings and I [27:55] want to have a model which at the end of [27:56] the day I just poke it. Every time I [27:58] poke it, it gives me a stately college [27:59] building. That's it. Now I'm going to [28:01] actually start giving it text and saying [28:02] okay build the you know create the thing [28:04] I'm just telling you about that's coming [28:06] and that's sort of some additional magic [28:08] is going on to get that done. U okay so [28:12] this is what we have u and this is [28:14] called a diffusion model. Okay. And this [28:16] is the original paper that figured this [28:18] out. Um, and [28:21] the the process of actually creating [28:24] taking an image and creating noisy [28:26] versions of it to create a training data [28:28] is called the forward process. And then [28:30] what we did in reverse is called the [28:32] reverse process. Uh, check out the [28:34] paper. It's actually really well [28:35] written. Uh, and I recommend it. Now, in [28:38] practice, uh, some other researchers [28:40] came along shortly after this and made a [28:42] small improvement. turns out to be [28:45] actually a big improvement in practice [28:46] in terms of improving the quality of [28:48] what's being produced. And so what they [28:50] said is hey instead of training the [28:52] model to predict the less noisy version [28:53] of the image we actually ask it to [28:55] predict just the noise [28:58] in the input and then we will just [29:01] simply subtract the noise from the input [29:03] to get the image. So instead of saying [29:05] here is an X X is an image Y is the [29:08] noisy image we actually tell it here is [29:10] an image here is the noise that we added [29:12] to X to get the the noisy version and [29:14] then just predict the noise for me and [29:16] then once I get it I just do X minus [29:17] noise and I get the less noisy version [29:19] of the image. Okay, this feels [29:21] arithmetically equivalent but in [29:24] practice it ends up generating much [29:26] higher quality images and there's some [29:28] very interesting theory as to why that [29:29] works and so on and so forth and you can [29:31] read this paper if you're interested. [29:33] Okay, so if you actually look at what's [29:34] going on in most diffusion models today, [29:36] they're basically using an approach like [29:38] this. They're actually predicting each [29:40] time they predict noise and take it [29:41] away, subtract it. So iterative [29:43] subtracting of predicted noise. [29:47] That's what's going on. So all right, so [29:49] that's what we have. U now at this point [29:52] you may be wondering, okay, so far in [29:55] the semester, uh we have actually [29:57] learned how to take an image and then [29:59] classify it into one of you know 20 [30:01] things, 10 things, whatever. We also [30:03] taken text and figured out what to do [30:05] things with it. We haven't yet talked [30:07] about how do you actually take an image [30:09] and the how can we get the output also [30:11] to be another image. We haven't done [30:13] that yet. Okay. So we have actually not [30:16] done image to image. How do you actually [30:18] build a neural network to do image to [30:20] images? And in the interest of time [30:22] you're not going to get into it [30:23] massively but I want to give you a quick [30:25] idea of how it works. So the the most [30:29] sort of I would say the dominant [30:31] architecture [30:33] to take an in image as an input and [30:35] produce an image as an output is called [30:36] the unit. Okay. And that's the [30:39] architecture we see here. So [30:42] so fundamentally if you look at the left [30:45] half so there's a left half to the [30:47] network and a right half to the network [30:48] hence the U. If you look at the left [30:50] half of the network it's it's a good old [30:53] convolutional neural network like the [30:55] kind we know and love. Okay. And the the [30:58] kind that we are very familiar with. So [31:00] you take an input image and then you run [31:02] it through a bunch of convolutional [31:04] uh convolutional blocks and then we do [31:07] some max pooling and then we keep on [31:09] doing it and at some point it becomes [31:11] smaller and smaller and we get something [31:13] you know like this which we are very [31:15] familiar with right the the big image [31:17] with three channels gets smaller and [31:20] smaller smaller but the number of [31:21] channels gets wider and wider. it [31:22] becomes sort of much smaller but much [31:24] deeper right it becomes like a 3D volume [31:26] and we have seen that again and again [31:29] right the left part is just a good old [31:31] convolutional with pooling layers and [31:33] then you come to the middle and then [31:35] from this point on what we do is we take [31:37] whatever this thing here and then we [31:40] essentially reverse the process we go [31:43] from the small things which are really [31:44] deep to slightly bigger things that are [31:46] a little less steep and so on and so [31:49] forth till we get the original size back [31:50] again Okay. And we do that using the [31:54] some an inverse of the convolution layer [31:57] called an upconvolution or deconvolution [31:59] layer. Okay. And you can check out 9.2 [32:02] in the textbook to to to understand how [32:05] it's done. It's it's also called con 2D [32:07] transpose. [32:09] Okay. It's a very similar idea and I'm [32:12] not going to get into the details here [32:13] but you essentially do an inverse of a [32:15] convolutional operation to get the size [32:17] to come back to the bigger size and you [32:19] do it gradually till the output you have [32:22] matches the size of the input that came [32:24] in. [32:25] Okay, so image gets smaller and smaller [32:27] into a thing and then you just blow it [32:29] back up again to get an image back. So [32:31] that is the unit. Now there's very one [32:34] very important thing that happens in the [32:36] unit, right? which is [32:39] you see these connections, right? [32:43] Basically, what they do is at every step [32:45] when you're coming back up in the right [32:47] half, you actually attach whatever was [32:50] in sort of the mirror image of the [32:53] original input as we processed on the [32:54] left side, we attach it to this side as [32:56] well. Remember I talked about this whole [32:59] notion of a residual connection back, [33:01] you know, many classes ago where I said [33:03] when uh when an input goes through each [33:06] layer of a neural network at one point, [33:09] let's say you're in the 10th layer, [33:10] you're only seeing what is the ninth [33:13] layer is produced for you. That's all [33:14] you're working with. But would it be [33:16] nice if the the the 10th layer actually [33:18] had access to the eighth layer, the [33:19] seventh layer, the sixth layer, the [33:21] fifth layer? Heck, why not the input, [33:23] right? Because the more information it [33:25] has, the more able it's probably to do [33:27] whatever it can with the input it's [33:28] giving. Why restrict it to only the [33:31] input of the the output of the previous [33:33] layer? Why can't we give it everything [33:34] that has came before it? Now giving [33:36] everything is too much. But we can be [33:37] selective in what we give it. Right? So [33:40] what these folks decided I'm sure after [33:41] much experimentation is that if they [33:44] actually attach whatever was coming out [33:46] of this layer to this layer before it [33:49] goes through the output, it really [33:51] helped. Similarly, this thing gets [33:53] attached and so on and so forth. And it [33:55] kind of makes sense. You know, why force [33:57] it to figure out everything it has to [34:00] figure out just from this thing that [34:01] came in, right? Let's give this that [34:03] that. Let's also give a little here, a [34:06] little here. So, these residual [34:07] connections are a huge building block [34:09] for why these things work as well as [34:10] they do. Okay? And in general, giving a [34:14] layer as much information as you can [34:15] give it is always a good idea, but you [34:17] can't go nuts, right? Because then you [34:19] have much more parameters and all kinds [34:20] of stuff happens. So there is a bit of a [34:22] balance you have to strike and this was [34:23] the balance struck by these researchers. [34:25] And so this thing was originally [34:27] invented for some medical segmentation [34:30] use cases but it's just heavily used for [34:32] everything now. It's a really powerful [34:35] architecture. Uh questions [34:39] >> uh can we have example of like in what [34:41] scenarios we use this kind of [34:44] >> anytime you have an image to image [34:46] >> like what kind of conversion do you get [34:49] image to image? or like what kind of [34:50] examples of use cases. Let's say that [34:52] for example you want to take an image [34:54] and like a black and white image and you [34:55] want to colorize it [34:58] for instance boom you unit you want to [35:00] take an image and make it a higher [35:02] resolution image unit you want to take [35:04] an image and for every pixel in the [35:06] image you want to classify it into you [35:08] know one of 10 things. So anytime when [35:12] you want the output shape the shape of [35:14] the output to be basically the same [35:16] shape as the input but with other data [35:18] you need to use this. [35:20] Yeah. [35:25] >> But this logic of having access to all [35:28] the previous iterations [35:30] >> not iterations [35:31] >> all the previous layers [35:33] >> right the outputs of the previous layers [35:35] >> layers. Uh but this would also help uh [35:40] clean up and give better categorization [35:42] like does it always have to be an image [35:44] to image? [35:45] >> No. No. In fact, if you look at restnet, [35:47] restnet is the one in fact that [35:49] pioneered the idea of the residual [35:50] connection. So we use it for restnet. We [35:53] actually use the the transformer stack [35:56] if you remember it goes through the self [35:58] attention layer. It comes out the other [36:00] end and then we add the input back to it [36:03] and then we send it through layer. [36:05] So you will see that this residual [36:07] connection is sitting in two different [36:08] places in a single transformer block. So [36:11] it's extremely heavily used. There is [36:13] something called deep and wide network [36:15] if I remember or denset which uses the [36:17] same trick. In fact if you when you're [36:20] working with structured data right good [36:22] old say linear regression and you've [36:25] looked at your data and you come up with [36:26] all kinds of very clever features you [36:28] know I'm going to look at price per [36:30] square foot right you do a bunch of [36:32] feature engineering and you have a bunch [36:33] of new features. Well, you should take [36:36] your old features and your new features [36:38] and send both in. [36:40] Why send only the new stuff that you [36:42] have concocted? Why can't you send [36:43] everything in? That's the idea. [36:47] All right. Um, so let's come back here. [36:53] Now we have seen how to generate a good [36:54] image. Okay. Now let's figure out how to [36:57] steer it or condition it with a text [36:59] prompt, right? Because that's sort of [37:00] the holy grail. [37:02] So we want to take [37:05] so here's some intuition. We want to [37:08] take the text prompt into account and [37:09] obviously generate the image. Now [37:11] imagine if we had like a rough image [37:14] that corresponds to the text prompt. [37:16] Just imagine. So the text prompt is you [37:18] know cute laborator retriever and you [37:21] have like a very noisy image of a [37:22] laboratory retriever. This just happens [37:24] to be handy. You have it. Well now [37:26] you're in good shape because you just [37:28] feed that in and your system will denise [37:30] it for you. Right? Right? You can get a [37:32] better image. That's pretty easy. So, [37:34] but obviously in reality, you don't have [37:36] a rough image. In fact, you're trying to [37:37] create one of those things in the first [37:38] place. We don't. So, but what if we had [37:41] an embedding for the prompt that's close [37:45] to the embeddings of all the images that [37:47] correspond to the prompt. So, let's take [37:49] a prompt and let's imagine all the [37:52] images in the in the universe that [37:54] correspond to that prompt. Okay? [37:57] And now further imagine because [37:58] everything is a vector. Everything is [38:00] embedding in our world that that image [38:02] has an embedding. [38:04] All sorry the text prompt has an [38:06] embedding. Every image has an embedding [38:09] and we have somehow calculated these [38:12] embeddings so that the text prompts [38:14] embedding is smack where all the image [38:17] embeddings are. [38:20] We will get to how we actually do it in [38:21] a in just a moment. But conceptually [38:23] imagine if we had an embedding if you [38:26] could calculate embeddings for text and [38:28] embeddings for images. So they all live [38:30] in the same space. [38:32] Okay. So if we feed this embedding to a [38:36] dinoising model because that text [38:39] embedding is sitting in the same space [38:41] as all the image embeddings that it [38:44] corresponds to. Maybe our model can just [38:47] d noiseise that embedding and give you [38:49] what you want. [38:51] Okay, so since this embedding is already [38:54] close to the embeddings of the things we [38:55] want to generate, maybe you'll just get [38:57] it done. [38:59] So ultimately we want to generate an [39:00] image and if we had an embedding for [39:02] that image, we could generate the image [39:03] from the embedding and we use the text. [39:07] So we go from text to embedding which [39:09] happens to live in the same space as all [39:11] the embeddings of the images we care [39:12] about. And then from that image [39:14] embedding, we go to the final image. [39:15] Okay, this is a bunch of me talking and [39:18] handwaving. it'll all become very clear [39:20] but that's sort of the rough intuition. [39:22] Okay. So, so what we'll know is we'll [39:25] describe an approach to calculate an [39:26] embedding for any text any piece of text [39:29] that is close to the embeddings of the [39:31] images that correspond to that piece of [39:34] text. So this is the problem we're going [39:36] to solve. There's a bunch of text [39:38] conceptually there are a whole bunch of [39:39] images that are describe that text and [39:42] we're going to now create embeddings so [39:43] that that is close to all the embeddings [39:46] of those images. Right? It feels kind of [39:48] like almost impossible that you can [39:50] actually do something like this, but [39:52] there's a very clever idea uh that [39:56] OpenAI came up with that tells you how [39:58] to do it. So, here's what we're going to [39:59] do. So, let's say we have an image and a [40:02] caption. So, here's an image. Uh here's [40:05] a caption, right? And we need some way [40:08] to take that piece of text and run it [40:10] through some network and create a nice [40:12] embedding from it. Okay? Similarly, we [40:15] want to take this image, run it through [40:16] some network and create an embedding [40:17] from it. Okay. Now, first first [40:19] question, how can we compute embeddings [40:20] from a piece of text? First question, [40:22] how can we comput an embedding from a [40:23] piece of text? You know the answer. [40:27] Run through a transformer. Piece of [40:30] cake. We know how to do that, right? U [40:34] right in particular, you can do [40:35] something like BERT. And for an image [40:37] encoder, you just run it through [40:38] something like restnet like the the [40:41] penultimate layer, right? one of the [40:42] final layer is going to be a very good [40:44] representation of that image. You get [40:46] another embedding. So using the building [40:48] blocks we already know, we can create [40:50] embeddings very quickly from these [40:52] things. Okay, but if you just take a [40:55] piece of text and run it through a bird [40:56] and you take an image and run it through [40:58] SNET, you're going to get some [40:59] embeddings. But why the heck should they [41:01] be related? [41:04] They were not trained together. So [41:06] there's no basis for them to be related. [41:08] They would just be some two embeddings. [41:10] Maybe they are kind of similar. Maybe [41:11] they're not. We don't know. There's no [41:13] reason to expect that they're going to [41:14] be similar. Okay, they're just two [41:16] embeddings. [41:20] Now, what we want to do is but once we [41:22] have these, we need to make sure the [41:24] embeddings that comes out of these two [41:26] things satisfy two very important [41:27] requirements. [41:32] We want to make sure that if you give it [41:33] an image [41:35] and a caption that describes that image. [41:39] So you have an image and a caption that [41:40] describes that image, we want to make [41:42] sure that the embeddings that come out [41:43] of these two boxes, they are as close to [41:45] each other as possible. [41:47] Okay? Given an em given an image and a [41:50] caption that describes it, that's the [41:51] connection. They have to be close to [41:53] each other. And conversely, if you have [41:56] an image and a caption that's totally [41:58] irrelevant, [42:00] right? A train rounding a bend with a [42:02] beautiful fall foliage all around, [42:03] right? Clearly irrelevant. Those [42:05] embedings should be far apart. [42:08] that it to really make sense, [42:10] right? Pairs of related things should be [42:12] together, irrelevant things should be [42:13] far apart. So if you can find embeddings [42:16] that satisfy these two criteria, maybe [42:18] we will be in the game. Okay. So now [42:23] this ensures that the text embedding and [42:24] the image embedding are referring to the [42:26] same underlying concept. Right? This [42:28] these requirements will enforce that. Uh [42:31] and so the embedding for any text prompt [42:32] is close to the embedding for all the [42:34] images that correspond to that prompt. [42:38] So the question is how do we do this? Uh [42:41] how can first of all how can we tell how [42:43] close two embeddings are? You know the [42:44] answer to this what's the answer [42:47] >> correct cosine similarity right? We use [42:49] the cosine similarity of the embeddings. [42:51] U so we know how to measure closeness. [42:54] So the question is how can we compute [42:55] embeddings that satisfy the two [42:56] requirements and openai uh built a model [42:59] called clip which is very famous uh to [43:02] solve this problem right it stands for [43:04] contrastive language image pre-training [43:07] uh and this forms the basis for a whole [43:08] bunch of models that have sprung up [43:10] after this called blip and blip 2 and so [43:12] on and so forth but this is the [43:13] fundamental idea [43:15] okay so [43:17] this is how clip works we uh what they [43:20] did is they took a a 12 block 12 layer 8 [43:25] head transformer cosal encoder stack as [43:28] a text encoder [43:30] uh okay now you understand this right [43:33] that's what it is eight layer I mean [43:35] sorry 8 head 12 layer transformer causal [43:36] encoder TC stack um and and that's a [43:39] text encoder so we send any piece of [43:41] text through it right you get the next [43:43] word prediction embedding and that's the [43:45] embedding you're going to use uh and [43:48] they took restnet 50 and made it the [43:50] image encoder they took rest 50 chopped [43:53] off the top and whatever was left is the [43:55] the image encoder. Okay, [43:59] then they initialized with random [44:00] weights these things and then they [44:03] grabbed they grab a batch of image [44:05] caption pairs. So in this example, let's [44:07] say that we have these three images u [44:09] and I have captions to go with these [44:11] images. Okay, we have these three things [44:14] and this is the key step. They run the [44:18] images through the image encoder and the [44:20] captions through the text encoder and [44:22] get these embeddings. Okay, it's a [44:23] forward pass. You send it through this [44:26] network, you get two embeddings. Um, and [44:29] then this is what they do. With these [44:32] embeddings, they calculate the cosine [44:34] similarity for every image caption pair. [44:36] Okay? And so imagine something like [44:38] this. So you have these three captions, [44:41] you have these three images, and those [44:43] are the embeddings. [44:45] uh and then they calculate the cosine [44:47] similarity for every one of those [44:49] things. [44:51] It took me like 5 minutes or 10 minutes [44:52] to do this PowerPoint. You're welcome. [45:00] Particularly trying to get this comma to [45:02] line up is a real pain in the neck. So, [45:05] so all right. So, we have this here. [45:08] Okay. And now what we want to do is uh [45:11] we want these scores to be as high as [45:13] possible, right? Because the scores in [45:16] the diagonal are the ones where for the [45:18] matching picture and caption, [45:21] right? [45:23] Those are the those are the those are [45:24] the the scores for the matching pairs of [45:26] embeddings. We want them to be as high [45:28] as possible. [45:30] Okay. Um [45:32] so so we want to maximize the sum of the [45:35] green cells, right? These are the green [45:37] cells the diagonal. So, so if I if you [45:40] want to write it as a loss function [45:42] because the loss function is always [45:43] minimization, we basically say minimize [45:46] the negative sum of the green cells. [45:50] Okay, so the question is would this loss [45:52] function do the trick? [45:58] Seems reasonable. You want to make sure [46:00] the related things are really close [46:03] together. So you want to maximize [46:07] uh if that was the only part of the loss [46:09] function, wouldn't it just kind of [46:10] squish everything to the same spot in [46:12] the space? [46:13] >> Correct. [46:14] What it's going to do is it's going to [46:16] basically ignore the input. [46:20] The optimizer can simply ignore the [46:21] input, make all the embeddings the same. [46:24] For example, it can just make all the [46:25] embedding zero. [46:28] That's it. And then now we have a [46:30] perfect cosine similarity for [46:32] everything. For a any pair of image and [46:35] captions, the cosine similarity is going [46:36] to be one. It's perfect, right? So [46:38] clearly that's not enough. This is by [46:41] the way is called model collapse, right? [46:44] So to prevent it from doing that, we [46:46] need to do one more thing to the loss [46:47] function. Any guesses? [46:51] >> Yeah. [46:53] >> Uh make the images that aren't related [46:56] not have a cosine similarity. [46:58] >> Exactly. Right. Exactly right. So what [47:00] we want to do is we want the scores of [47:02] the red stuff to be as small as [47:05] possible. [47:07] We want the green stuff to be as much as [47:09] possible and the red stuff to be as [47:10] small as possible. [47:12] Together it'll get the job done. [47:16] Okay. And so um so we want to maximize [47:20] the sum of the green cells and minimize [47:22] the sum of the red cells. So the [47:24] equivalent loss function is minimize the [47:26] sum of the red cells and the negative [47:28] sum of the green cells. That's it. So [47:31] all clip does is that it just grabs a [47:34] batch of image caption pairs, runs it [47:37] through the networks, calculates the [47:38] embeddings and calculates this sum of [47:41] the stuff here and that is your loss and [47:44] then back propagates through the [47:45] network. Boom. Batch batch batch. Do it [47:48] a whole bunch of times. And OpenAI did [47:50] this with uh oh this is the official [47:53] picture from the open from the paper [47:55] which is worth reading by the way right [47:57] it comes in text encoder you get these [47:59] uh embedding vectors image encoder and [48:02] then boom the diagonal is maximized and [48:05] the off diagonals are minimized [48:07] and they did it with 400 million image [48:10] caption pairs scraped from the internet. [48:14] 400 million. [48:16] By the way, you folks who work in the [48:18] space may know this really well, but uh [48:20] one very easy way to get a caption for [48:23] an image, right? You we see images, but [48:26] where do you think the captions come [48:27] from? Where did they get those captions? [48:29] They didn't obviously they didn't ask [48:30] people to manually label each image of [48:32] the caption. Where do you think they got [48:33] it from? [48:35] >> Google search. [48:36] >> Uh Google search can help but why does [48:39] Google search actually find the caption? [48:41] How does it because Google search is not [48:42] creating the caption? um [48:45] >> take it from the alt text on the images. [48:47] >> Correct. Alt text. So a lot of folks for [48:50] accessibility reasons they have alt text [48:52] right on all the images they create. A [48:54] lot of people have alt text in their [48:56] images they publish on the web and [48:58] that's what we use. And the alt text [49:00] actually ends up being a a more verbose [49:03] description of the image than a typical [49:05] caption which tends to be much briefer. [49:07] And for us more verbose longer the [49:10] better because there's more stuff for [49:11] the bottle to learn from. [49:14] Um, so that's how they built clip. [49:17] And so now what we do is we use we can [49:19] use clip's text encoder by itself, [49:22] right? We can send in any text and get [49:24] an embedding that is close to the [49:25] embedding of any image that described by [49:28] the text. [49:31] Okay. Now, by the way, clip can be used [49:33] for zeros image classification. [49:37] And what I mean by zeroshot image [49:39] classification, I'll I'll walk through [49:40] the picture in just a second, is that [49:42] typically when you want to build an [49:43] image classifier, right, you can get a [49:45] whole bunch of training data of images [49:47] and their labels and then we train them, [49:50] right? Maybe you take something like [49:51] restnet, chop off the top, attach our [49:54] own output head and train, train, train. [49:56] Boom, you have a classifier. But the [49:58] only problem with that is let's say that [50:00] tomorrow so today for example you had [50:02] five classes in your problem and [50:04] tomorrow somebody comes along and says [50:06] oh actually we have a sixth category [50:09] right what do you do then well you have [50:10] to go back to the drawing board and [50:11] retrain the whole thing with six labels [50:13] now not five because your problem has [50:15] changed would it be great if you had a [50:17] classifier where you just come to it and [50:20] say here's an image and here are the six [50:22] possible labels I want you to pick from [50:23] pick one from me and you want to be able [50:26] to give it a different set of labels [50:27] those each time and it'll just use the [50:30] labels you're giving it and the image [50:32] and figures out which which label [50:33] corresponds to the image you just fed it [50:35] that would be an insanely flexible image [50:38] classification system right and that's [50:40] what I mean by zeroshot image [50:42] classification and you can use clip to [50:44] do zero short image classification [50:47] the now how you do it is actually in the [50:50] picture though not very clearly done [50:52] anyone wants to [50:58] How can you use clip to build a like a [51:01] infinitely flexible image classifier? [51:12] >> Um I mean the text input was like was [51:14] trained vert right? So in the same way [51:16] vert can handle words never seen before [51:19] does it essentially do that? Sorry, say [51:21] that again. The second part [51:22] >> you're saying you're saying it sees a [51:24] text input with something it's never [51:25] seen before, right? Yeah. [51:26] >> Okay. So, in the BERT model, which is [51:28] where where it came from, in the text [51:30] encoding in the BERT model, I think we [51:32] talked about when it sees a word it [51:35] doesn't know that it's never seen [51:36] before, it can use the the context words [51:39] around it to try to [51:41] >> Right. Right. So, but but here, just to [51:43] be clear, I I want you to use clip that [51:46] we just built, right? And assume clip [51:49] can see any knows all the words because [51:51] it's been trained on a big vocabulary. [51:53] You can give it any text you want. It'll [51:54] create an embedding from it. That's the [51:57] key capability. [52:02] >> So it creates a text embedding for [52:06] >> Yeah. [52:06] >> because like and then for your image. [52:11] So comparing similarity scores between [52:14] the two the image is complete but the [52:15] text is not complete. there'll be [52:17] missing pieces and then make some [52:18] prediction using this. [52:21] Why is there a missing piece in the [52:22] text? [52:24] >> Because um the image the the text [52:28] the text does not contain the class. Um [52:31] and then but for the image the way it [52:34] was trained it was trained like with [52:36] pairs with class including [52:38] >> right but we actually know the class now [52:40] because so the use case is that I come [52:42] to you with an image and I say here are [52:45] the seven possible labels for this image [52:48] and each label is a piece of text. [52:51] So you can you actually have seven [52:53] pieces of text and an image and all I [52:55] want clip to do is to tell me okay the [52:58] seventh the fourth label is the right [53:00] one for this image [53:03] but you're on the right track [53:08] once you see how it's done you'll be [53:09] like yeah of course [53:13] might not be understanding something but [53:15] wouldn't you just pick the embedding [53:16] that's the closest to the like the the [53:18] text embedding that's the closest to the [53:20] image embedding Correct. You're not [53:22] missing anything. That's the right [53:23] answer. Well done. [53:26] Come on people. Can you applaud our [53:27] fellow here? [applause] [53:30] You folks are hard to impress. [53:32] That's exactly what we do. So here [53:38] the the key thing to remember the key [53:40] thing to keep in your head is that when [53:42] you a label is just text, [53:45] dog, cat, right? It's just text. So you [53:47] can just imagine taking each label with [53:50] which in this case is plane car dog [53:52] whatever for each one of them you create [53:54] an embedding you get t1 through whatever [53:57] if you have n labels for the image you [53:59] just have one embedding i and then you [54:01] just create the cosine calculate the [54:03] cosine similarity and whichever is the [54:04] highest number you say okay it's a dog [54:06] that's it [54:09] it's super just imagine the level of [54:11] flexibility here [54:15] so that's a a side use of clip unrelated [54:18] to diffusion models but that's just [54:20] thought it's really clever so I wanted [54:21] to share that okay good u now let's see [54:23] how we can actually use this entire [54:25] capability to go to solve the original [54:27] problem we set out to solve which is can [54:29] we steer the diffusion model to create [54:31] an image based on a particular prompt we [54:33] give it um so now remember if you go [54:37] back to how we did it we created all [54:39] these training pairs of x and y based on [54:41] you know the the noising the image x is [54:44] the image y is the less noisy version of [54:46] image. So what we can simply do is we [54:51] can actually change the input so it [54:53] becomes the image and then the clip text [54:56] embedding of the caption for that image. [54:59] So you have an image and you have a [55:00] caption. You take the caption run it [55:02] through clip you get an embedding. By [55:05] definition that embedding is in the [55:07] lives in the same space as all the [55:09] images that correspond to that caption. [55:13] Right? So you just attach you [55:15] concatenate the embedding of the clip [55:18] output of a caption along with the [55:20] image. You say make that the new input. [55:22] Now Y continues to be the less noisy [55:24] version of the image or as we saw [55:26] earlier it could be just the noise [55:27] component of the image. Okay, this is [55:30] the new XY pair that we have. And so now [55:34] the model is you send the clip X [55:36] embedding the image X send it through [55:39] noisy version of the image and you keep [55:41] on training it for a while. Once your [55:43] model is trained for when you want to [55:44] use it for inference for a new uh [55:46] prompt, you just give it you know [55:49] Killian quoted MIT during the springtime [55:51] along with a bunch of noise goes in it [55:55] starts dinoising it. But because this [55:57] embedding of this thing thanks to clip [56:00] lives in the same image space as all Ken [56:02] code embeddings clean code images at [56:05] some keep on doing it for a while at [56:07] some point you'll get Kian code. [56:11] That's how they do it. That's how they [56:12] steer the image. It's a two-step [56:15] process. You create all these clip [56:16] embeddings uh which clip was a [56:19] breakthrough in my opinion because they [56:21] it was one of the maybe the first [56:22] example. I don't know if it's the very [56:24] first but one of the early examples of [56:26] saying we have different kinds of data. [56:28] We have images, we have captions, we [56:30] have text. How do we create embeddings [56:32] for every one of these very different [56:34] data types that all happen to live in [56:36] the same space, the same concept space? [56:38] That was the key idea. And if you look [56:40] at the modern multimodal large language [56:42] models, they are all based on the same [56:44] exact idea. [56:46] So it's very powerful this approach. [56:49] Yeah. Now I understand this for images, [56:51] but for video generation models like [56:54] Sora, do they have some sort of [56:56] underlying physics structure or do they [56:58] learn the physical representations? [57:00] >> There's a lot of debate on the internet [57:02] about this stuff. Um they haven't [57:04] published the results, the full [57:05] technical report yet. So we don't know [57:07] for sure but the consensus seems to be [57:09] no it's not they are not using a physics [57:11] engine what they have done uh and again [57:14] this may be wrong once the report comes [57:15] out we'll know for sure but uh people [57:17] what people are saying computer vision [57:19] experts is that it was has been trained [57:22] on a lot of video game data [57:25] uh along with actual videos and so on [57:28] and if you and the corpus of training is [57:30] so massive that it has basically learned [57:32] to mimic certain physics aspects to it [57:35] just as a side effect much like LLM you [57:38] train them on a large amount of text [57:39] data they begin to start to do things [57:41] which you didn't anticipate that they'll [57:43] do right so for example I read this I [57:46] thought it's a really great example of [57:48] what is surprising about large language [57:50] models is not that you know you train [57:52] them on a b bunch of high school math [57:54] problems and then you give it a new high [57:56] school math problem it can actually [57:57] solve it that's not surprising you give [57:59] it a whole bunch of high school math [58:00] problems in English then you ask it to [58:03] read a bunch of French literature and [58:05] then you give French high school math [58:07] will solve it. That is that is the new [58:08] news, right? So similarly here I think [58:12] the expectation is that it's not [58:13] actually using a physics engine under [58:15] the hood. It may have used a physics [58:16] engine to actually come up with the [58:17] videos and renderings but there are no [58:20] physics constraints in the model itself. [58:22] It just comes out of the training [58:23] process. That's the current view. Once [58:26] the technical report comes out, we'll [58:27] know for sure what they actually did. [58:30] U [58:33] >> so quick question about stability. It's [58:36] claiming to be a little bit more real [58:37] time in their image generation. Um, so [58:40] >> you mean stable diffusion? [58:41] >> Yeah, stable diffusion. So, are they [58:43] jumping through the noise more quickly [58:45] or are they kind of like pre-prompting [58:46] it and kind of trick? [58:47] >> Very good question and there's a very [58:48] key trick. It's coming. [58:50] >> Um, [58:52] >> so here the example of the noise is [58:55] normal distribution. However, if we have [58:57] changed the noise distribution, is it [59:00] change the result? Oh, you mean if you [59:02] change it to like a pson or some other [59:04] distribution, it'll definitely change [59:05] the results because u if you look at the [59:08] underlying math of why this works, it [59:10] heavily depends on the Gaussian [59:11] assumption. [59:13] >> Yeah. Um there was another question [59:15] somewhere here. [59:18] >> Um you may not know the answer because [59:20] the technical report out, but could it [59:21] be in terms of video generation sort of [59:23] analogous to going from like one fuzz [59:26] one noisy image to another? like you're [59:28] almost doing a series of still images [59:30] and learning how to [59:31] >> No, I think that I think people are sure [59:33] is is how it's done. So, basically you [59:35] think think of the video as just a [59:36] series of frames, right? And each frame [59:39] is an image and there is a sequentiality [59:41] to it. Um, which is where the [59:43] transformer stack will come in because [59:44] it handles sequentiality. So, in general [59:47] video stuff typically operates on frame [59:50] by frame which is just an image. So, [59:53] that is definitely there. What we don't [59:54] know is if they also used some [59:57] understanding of the fact that for [59:59] example that if an object is dropped it [01:00:02] has to fall to the earth in a certain [01:00:04] rate or if an object goes behind another [01:00:06] object you can't see the object anymore [01:00:08] right things like that which we take for [01:00:10] granted um the question is are they [01:00:12] using it and the consensus seems to be [01:00:15] uh in the absence of an actual technical [01:00:17] report that no they're not doing it [01:00:18] because there are lots of examples on [01:00:20] Twitter where people will show a Sora [01:00:22] video in which it's not obeying the laws [01:00:24] of physics. So you take like a beach [01:00:26] chair and then put it in the sand. You [01:00:28] see the sand come through the base of [01:00:30] the beach chair, right? Or you take an [01:00:32] object and put it behind an object. You [01:00:33] can still see the object even though the [01:00:35] original object is opaque. So you be [01:00:37] seeing some evidence that no no it's not [01:00:38] obeying the laws of physics. What you're [01:00:39] seeing is just an amaz [01:00:46] fingers without knowing there has to be [01:00:47] only five fingers. [01:00:50] Um [01:00:51] okay. All right. So we let's keep going [01:00:55] now. Um so this there was another paper [01:00:58] afterwards and this is the original [01:01:00] paper which took that idea of the [01:01:02] diffusion model and then diffusion is [01:01:05] very slow as Olivia you pointed out. So [01:01:07] the question is can we make it much [01:01:08] faster? Right? So what they did and I'm [01:01:11] not going to get into this whole thing [01:01:12] here. I just want to highlight a couple [01:01:14] of things. The first one is that um [01:01:18] first of all notice that you see unit [01:01:20] here. So it they are using a unit right [01:01:23] to go from image to image. [01:01:25] The second thing is that the clip [01:01:28] embedding of the text prompt is [01:01:30] basically is woven meaning it's [01:01:32] incorporated into the w the into the [01:01:34] into the unit through an attention [01:01:36] mechanism a transformer mechanism and [01:01:38] you can see the QKV business here which [01:01:41] should be familiar at this point. So it [01:01:43] is integrated into the transformer stack [01:01:45] directly that input the clip embedding [01:01:47] that's the second thing I want to point [01:01:48] out. And then thirdly [01:01:50] and this is where the speed up comes. So [01:01:52] what you do is instead of taking the [01:01:54] image running it through the whole [01:01:56] network and creating a slightly less [01:01:57] noisy version of the image here what you [01:01:59] do is you take the image you run it [01:02:02] through an image encoder you get an [01:02:03] embedding and now you only work with the [01:02:05] embedding you take the embedding and [01:02:07] create a slightly less noisy version [01:02:09] embedding keep on doing it and these [01:02:11] embeddings are much smaller than images [01:02:13] therefore they're much faster to process [01:02:14] and once you've done it like a thousand [01:02:16] times you get a very sort of almost pure [01:02:18] noless version of the embedding now you [01:02:20] run it through an image decoder to get [01:02:24] So this is the the idea here is that you [01:02:26] operate um [01:02:29] uh in the lat latent space meaning the [01:02:31] embedding space and hence it's called a [01:02:32] latent diffusion model. So that's where [01:02:35] the speed up comes but research [01:02:36] continues to be very strong to make it [01:02:38] even faster because for a lot of [01:02:40] consumer applications people are [01:02:41] obviously not going to wait around for I [01:02:43] mean who wants to wait for 10 seconds [01:02:44] right so uh and so there a lot of [01:02:46] pressure to make it even faster [01:02:49] um [01:02:52] all right so that's what we have [01:02:53] obviously um you know they're these [01:02:56] models are transforming everything and [01:02:58] uh by the way this site here lexicon.art [01:03:00] art. You can go check it out. Uh it has [01:03:01] a whole bunch of very interesting images [01:03:03] and prompts that created the images. So [01:03:06] if you're working in the space, it gives [01:03:07] you a lot of interesting ideas. But it's [01:03:09] not just for you know consumer fun [01:03:11] applications. U you know these models [01:03:13] are being used to actually you know [01:03:15] alpha fold if you'll recall if you give [01:03:18] it an amino acid sequence it can [01:03:19] actually create the 3D structure. Right? [01:03:21] So that's an example of they they don't [01:03:24] I don't think they use a diffusion [01:03:25] model. But you can imagine using a [01:03:27] diffusion model to create these [01:03:28] complicated objects. Meaning the objects [01:03:32] you create don't have to be images. [01:03:34] They can be arbitrarily complicated [01:03:36] things. As long as you have enough data [01:03:39] about such things to use for training [01:03:41] and the notion of noising the input is [01:03:43] meaningful, you can create some very [01:03:45] interesting structures. you can create [01:03:47] 3D things and u you know protein [01:03:49] structures and there's a whole bunch of [01:03:51] very interesting applications in [01:03:52] biomedical uh sciences. So this is [01:03:55] really just the tip of the iceberg and [01:03:57] now there are these things um there are [01:03:59] ways in which you can use diffusion [01:04:00] models to create to do large language [01:04:03] modeling as well. So there's a lot of [01:04:05] overlap and blending and so on going on [01:04:07] in the space. So so I'm going to do a [01:04:10] quick demo. Um if you look at hugging [01:04:11] face there is something called the [01:04:12] diffusers library which is like the the [01:04:15] as the name suggests it's a library for [01:04:17] a lot of diffusion models [01:04:20] and let's take a quick look. [01:04:25] All right so we will uh the diffusers [01:04:27] library has a whole bunch of diffusion [01:04:28] models. We going to work with stable [01:04:30] diffusion which is one of you know like [01:04:32] the the better known models. So let's [01:04:34] install diffusers. [01:04:38] Uh you will recall when we when I did [01:04:41] the quick lightning tour of the hugging [01:04:42] face ecosystem for language. Uh hugging [01:04:45] face is a whole bunch of u capabilities [01:04:48] sort of built out of the box and you use [01:04:50] this thing called the pipeline function [01:04:52] to very quickly use any model you want. [01:04:54] The same exact philosophy applies here. [01:04:56] You still use the pipeline. So I'm going [01:04:59] to import a bunch of stuff. [01:05:09] All right. So, oh, I see I have to do [01:05:11] this thing. Okay. [01:05:16] Great. F. [01:05:21] Okay. So, uh, all right. that we have [01:05:24] here. So you you'll remember that we [01:05:26] when we worked with text we had to pre [01:05:28] we we would grab a pre-trained model and [01:05:30] then we actually run it through a [01:05:31] pipeline and we can do all the inference [01:05:33] we want on it. The same exact philosophy [01:05:36] applies here. So um and this very [01:05:39] similar to what we did in lecture 8 for [01:05:41] NLP. So what we're going to do is we use [01:05:44] this command the stable diffusion [01:05:46] pipeline from pre-trained and we use [01:05:48] this version 1.4 stable diffusion model. [01:05:50] Um so let's just create the pipeline and [01:05:56] and obviously we have used tensorflow [01:05:58] not pyarch here but a lot of these [01:06:00] models unfortunately happen to be in [01:06:02] pyarch so knowing a little bit of pyarch [01:06:05] is actually very helpful um to be able [01:06:07] to work with these things and what we're [01:06:09] doing here uh while it's downloading uh [01:06:12] we are using this fp16 [01:06:15] um storage format for the the model [01:06:18] weights because it's going to be a [01:06:19] little smaller than using 32 bits so [01:06:22] it'll download faster. So that's what's [01:06:24] happening here. So all right, it's [01:06:25] downloaded fine. So now we just give it [01:06:28] a prompt and this is actually one of the [01:06:29] original famous uh meme prompts a [01:06:32] photograph of an astronaut riding a [01:06:34] horse. And so uh once we have the [01:06:36] pipeline set up uh I'll just a seat for [01:06:38] reproducibility. And then literally I do [01:06:40] pipe of prompt and then it's actually [01:06:44] you can see here 50. So it's going [01:06:46] through 50 dinoising steps. Okay. Um and [01:06:50] you come up with a national rating of [01:06:52] horse. Okay. So that's that. Um you can [01:06:54] actually change the seed and you can get [01:06:56] get a different um the seed is basically [01:06:59] sets the the the random starting point [01:07:01] for the image. So therefore you would [01:07:03] expect a different astronaut. Yep. This [01:07:05] is an astronaut riding another horse. So [01:07:08] um I think people came up with these [01:07:09] kinds of fun examples because it's [01:07:11] guaranteed not to be in the training [01:07:12] data, right? So so whatever the model is [01:07:15] doing, it's not remember it's not [01:07:16] regurgitating what it has already seen. [01:07:18] Uh, all right. Give me a prompt. [01:07:26] Prompts. Anyone? [01:07:29] Wow. [01:07:34] >> Okay, [01:07:38] that might be a [01:07:40] All right. Riding a horse. [01:07:48] All right, [01:07:56] there are two of them and clearly MIT [01:07:59] professors don't have really. [01:08:03] Yeah, moving on. [laughter] [01:08:06] So, so by the way, um if you you should [01:08:10] spend some time with the diffusers [01:08:11] library, they have a bunch of tutorials [01:08:12] which are really interesting because [01:08:14] this core capability of giving a prompt [01:08:16] and getting an image out can actually be [01:08:18] manipulated for all sorts of very [01:08:20] interesting use cases. So, for example, [01:08:22] there is this thing called negative [01:08:23] prompting. And the idea of negative [01:08:25] prompting is that you can give it two [01:08:28] prompts and say create an image which [01:08:31] embodies the first prompt but not the [01:08:33] second prompt. essentially subtract the [01:08:36] second prompt from the first one. That's [01:08:37] called negative prompting. And you might [01:08:39] be wondering like what use is that? [01:08:41] There are lots of fun uses. So here we [01:08:45] are going to the prompt is going to be a [01:08:46] labrador in the style of vermier. Okay, [01:08:49] that's the first prompt. 50 steps. [01:08:53] Uh look at that. Amazing, right? Uh but [01:08:57] maybe you don't care for the blue scarf. [01:09:00] So you basically give it a negative [01:09:02] prompt. And you basically the negative [01:09:04] prompt is blue meaning remove everything [01:09:06] that's blue. I don't like this otherwise [01:09:09] keep the Labrador thing going. So you [01:09:11] run it. [01:09:16] Look at that. The blue is gone. Negative [01:09:18] prompting. Okay. Yeah. [01:09:22] >> If you change that from five from 50 to [01:09:26] a th00and will it become less pixelated [01:09:28] or will it eventually just keep going [01:09:30] and iterating? [01:09:31] >> No. Typically, if you do more of these [01:09:32] things, it gets better. The quality is [01:09:34] much better because each step will den [01:09:36] noiseise it very slightly. So, errors [01:09:38] won't accumulate and things like that. [01:09:40] And the diffuses library gives you lots [01:09:42] of controls for fiddling around with all [01:09:44] these things. Um, okay. So, that's what [01:09:47] we had. Uh, 949. [01:09:50] Okay. So, check out this tutorial if [01:09:52] you're curious about how this stuff [01:09:54] works. And I'm going to do one other [01:09:56] thing um because I didn't get to do it [01:09:58] earlier on. So uh we spent some time [01:10:01] with the hugging face hub and I walked [01:10:03] you through a few use cases for text uh [01:10:05] where you can take a text model and use [01:10:07] it for you know classification uh things [01:10:10] like that summarization and so on and so [01:10:11] forth. You can do the same thing for [01:10:13] computer vision models. So if you have a [01:10:16] computer vision problem that just maps [01:10:17] to a standard C uh computer vision task [01:10:20] you can just use the hugging face hub as [01:10:21] well. So um let me just show you very [01:10:25] quickly the same kind of thing actually [01:10:27] works here. [01:10:32] All right. Okay. So, [01:10:35] so let's say that you want to classify [01:10:37] something. You just import the pipeline [01:10:38] as before. [01:10:40] And once you import it, you can just [01:10:43] literally give it the standard task that [01:10:45] you care about like image [01:10:46] classification. [01:10:48] And and then you can start using it [01:10:50] right from that point on. [01:10:53] Okay. [01:10:59] All right. Okay. So now I'm going to [01:11:02] just get this image. So it's a very [01:11:04] famous image. Um, right. And we're going [01:11:06] to ask it to classify this image. So we [01:11:08] just literally run it through the [01:11:09] pipeline. [01:11:12] And it says the most likely label is 94% [01:11:15] probability. It's an Egyptian cat. Seems [01:11:18] reasonable. Okay. I mean, it's it's a [01:11:20] tough picture, right? Because there are [01:11:21] lots of things going on in that picture. [01:11:22] It's not like one one image, one object. [01:11:25] Um okay so you don't have to use the [01:11:27] default model you can actually give it [01:11:29] your own model that you want. So for [01:11:31] example, you can go um sorry [01:11:35] you can go to the hub hugging face hub [01:11:38] and you can go in there and say all [01:11:40] right I want image classification these [01:11:42] are all the models 10,487 models let's [01:11:45] sort by I don't know most downloads or [01:11:49] maybe most likes [01:11:51] u and you have all these models you can [01:11:53] pick any one of them so for example [01:11:54] let's say you want to pick Microsoft [01:11:56] restnet as your one that's what I tried [01:11:57] here so I have Microsoft restnet you [01:12:00] just s model equals that run it and it [01:12:04] takes care of all the tokenization this [01:12:05] that and whatnot. It's really very handy [01:12:08] and then you run it through the pipeline [01:12:09] again and it says tiger cat 94% [01:12:12] probability according to restnet. Um so [01:12:15] yeah so that's how you do it. Now let's [01:12:17] actually try a more interesting example [01:12:18] where you want to detect all the objects [01:12:20] in the picture which we didn't talk [01:12:21] about in class object detection. So just [01:12:23] create an object detection pipeline. [01:12:27] Same thing as before. when you actually [01:12:29] run this command, an astonishing some [01:12:31] amount of complicated stuff is going on [01:12:33] under the hood. Okay, and we are all the [01:12:35] beneficiaries of that. So, thank you. [01:12:37] Um, so yeah, so we have this here and [01:12:39] then we run it through um the pipeline. [01:12:42] It's looking at all the possible things [01:12:44] that might be sitting in the pipeline. [01:12:45] The results are hard to read. So, let's [01:12:46] actually visualize them. Um, [01:12:49] and I got some nice code from this site [01:12:51] for how to visualize them. Let's just [01:12:53] reuse it. So, yeah. So if you plot the [01:12:56] results, [01:12:58] look at that. [01:13:03] Okay, so it has picked up the cat. 100% [01:13:06] probability, I guess. The remote, the [01:13:09] couch, the other remote, and then the [01:13:12] cat. Pretty good, right? Off the shelf, [01:13:14] ready to go. No, no heavy lifting [01:13:17] required. Now, in in this case, we are [01:13:19] actually putting these boxes called [01:13:20] bounding boxes around each picture. But [01:13:22] what if you actually don't want a [01:13:23] bounding box? what you want to actually [01:13:25] find the exact contour of that cat or [01:13:28] the remote. No problem. We do something [01:13:30] called image segmentation. So let's do [01:13:32] an image segmentation pipeline [01:13:36] uh and run it through. [01:13:42] It takes some time. Um all right. All [01:13:46] right. Let's visualize it. So you can So [01:13:49] each object it finds it gives you a [01:13:51] mask. It basically tells you for each [01:13:53] object what object it is and then which [01:13:56] pixels are on for that object and off [01:13:58] for everything else. It's a mask. It [01:14:00] tells you where it stands. And you can [01:14:02] see here it is the first the object has [01:14:04] found is this thing here. And it's [01:14:06] perfectly delineated, right? It's pretty [01:14:08] amazing. So we can overlay this on the [01:14:10] original image and see it has found that [01:14:14] and it is Let's look at the other [01:14:15] objects. Oh, it has found the remote. [01:14:17] That's the second object. [01:14:20] And the third remote [01:14:24] and the fourth. You think any other [01:14:27] objects are remaining? [01:14:28] >> Couch. Good. All right, let's find the [01:14:32] couch. [01:14:33] And look, the couch is pretty good [01:14:36] except that the middle part has gotten [01:14:37] confused. [01:14:39] All right, but it's still pretty good, [01:14:41] right? So, yeah. So, that is um so [01:14:44] hugging faces all all these things and [01:14:46] so you should definitely check it out [01:14:48] and if you're not already very familiar [01:14:49] with it. So, uh, we have one minute [01:14:51] left. Any questions? [01:14:58] No questions. Okay. All right, folks. [01:15:00] See you on Wednesday. Thanks.