WEBVTT

00:00:16.480 --> 00:00:19.760
So all right so today we actually come

00:00:18.160 --> 00:00:20.879
to the last lecture of the class because

00:00:19.760 --> 00:00:23.920
Wednesday it's going to be project

00:00:20.879 --> 00:00:25.679
presentations and um so I want to talk

00:00:23.920 --> 00:00:28.640
to you about diffusion models today

00:00:25.679 --> 00:00:30.320
which is a incredibly exciting area

00:00:28.640 --> 00:00:32.399
which I don't think gets It's the same

00:00:30.320 --> 00:00:34.719
amount of attention in some ways

00:00:32.399 --> 00:00:37.679
compared to large language models. Uh

00:00:34.719 --> 00:00:39.280
but it's got enormous potential. Um so

00:00:37.679 --> 00:00:42.000
I'm very excited to talk to you about

00:00:39.280 --> 00:00:44.480
it. So you know just for kicks last

00:00:42.000 --> 00:00:46.079
night I said I asked Chad GPT create a

00:00:44.479 --> 00:00:47.599
photorealistic image of graduate

00:00:46.079 --> 00:00:49.439
students in class in a class on deep

00:00:47.600 --> 00:00:51.679
learning and this is what it came back

00:00:49.439 --> 00:00:53.759
with.

00:00:51.679 --> 00:00:56.759
There is a noticeable absence of an

00:00:53.759 --> 00:00:56.759
instructor

00:00:57.280 --> 00:01:01.359
plus various students are facing in

00:00:59.039 --> 00:01:05.359
various directions

00:01:01.359 --> 00:01:08.960
but apart from that it's not bad. Um and

00:01:05.359 --> 00:01:12.079
uh here is an example of a midjourney

00:01:08.959 --> 00:01:14.798
text to image abusion model uh which

00:01:12.079 --> 00:01:16.879
produces the amazing picture from this

00:01:14.799 --> 00:01:18.400
prompt. a quaint Italian seaside village

00:01:16.879 --> 00:01:21.599
with colorful buildings blah blah blah

00:01:18.400 --> 00:01:24.000
blah blah uh rendered in the style of

00:01:21.599 --> 00:01:25.280
Claude Monet and so on so forth and

00:01:24.000 --> 00:01:27.118
that's what you get. It's pretty

00:01:25.280 --> 00:01:28.560
unbelievable.

00:01:27.118 --> 00:01:29.680
Uh and I'm sure you folks have played

00:01:28.560 --> 00:01:31.439
around with these things and you have

00:01:29.680 --> 00:01:33.118
your favorite pictures and prompts and

00:01:31.438 --> 00:01:35.118
whatnot.

00:01:33.118 --> 00:01:38.400
Um now

00:01:35.118 --> 00:01:41.759
uh February 15th um OpenAI released a

00:01:38.400 --> 00:01:44.478
texttovideo model called Sora which your

00:01:41.759 --> 00:01:46.640
folks may have seen uh which I find

00:01:44.478 --> 00:01:49.599
frankly just stunning what it can do. It

00:01:46.640 --> 00:01:52.960
can produce a one minute uh video from a

00:01:49.599 --> 00:01:54.798
text prompt. And so,

00:01:52.959 --> 00:01:56.959
so if you actually give it this prompt,

00:01:54.799 --> 00:02:00.159
in an ornate historical hall, a massive

00:01:56.959 --> 00:02:01.679
tidal wave peaks and begins to crash and

00:02:00.159 --> 00:02:03.600
two surfers seizing the moment

00:02:01.680 --> 00:02:06.000
skillfully navigate the fa the wave.

00:02:03.599 --> 00:02:07.199
Okay. Uh I think we can all agree that

00:02:06.000 --> 00:02:09.280
such a thing has never happened in

00:02:07.200 --> 00:02:12.400
history and therefore there it was not

00:02:09.280 --> 00:02:17.000
in the training data, right? So and then

00:02:12.400 --> 00:02:17.000
you get this picture, this video

00:02:26.878 --> 00:02:31.120
and then some random person is coming

00:02:28.878 --> 00:02:32.799
back in a completely dry [laughter]

00:02:31.120 --> 00:02:37.280
hall. So anyway, but it's pretty

00:02:32.800 --> 00:02:39.519
amazing. I think you would agree. So

00:02:37.280 --> 00:02:42.479
if you actually look at the open sora

00:02:39.519 --> 00:02:45.519
technical report, you actually find this

00:02:42.479 --> 00:02:48.079
uh opening paragraph where they say that

00:02:45.519 --> 00:02:51.120
we train text conditional diffusion

00:02:48.080 --> 00:02:54.879
models blah blah blah using a

00:02:51.120 --> 00:02:56.239
transformer architecture. Okay, so now

00:02:54.878 --> 00:02:57.919
we know what a transformer architecture

00:02:56.239 --> 00:03:00.080
is. You've been working with it. You're

00:02:57.919 --> 00:03:02.158
quite familiar with it at this point. So

00:03:00.080 --> 00:03:04.560
today's class is really about text

00:03:02.158 --> 00:03:06.959
conditional diffusion models. Okay, so

00:03:04.560 --> 00:03:09.280
the other building block. Okay, so let's

00:03:06.959 --> 00:03:11.120
get to it. Uh what I'm going to do is

00:03:09.280 --> 00:03:12.640
I'm going to sort of uh divide this into

00:03:11.120 --> 00:03:14.158
two parts. The first part is I'm just

00:03:12.639 --> 00:03:16.158
going to talk about how do you get a

00:03:14.158 --> 00:03:17.759
model to just generate an image for you?

00:03:16.158 --> 00:03:20.158
Right? If you wanted to generate an

00:03:17.759 --> 00:03:21.840
image from a class of potential images,

00:03:20.158 --> 00:03:24.158
how can it just generate an image? And

00:03:21.840 --> 00:03:25.519
then next we talk about okay, great. Now

00:03:24.158 --> 00:03:27.919
that you can do that, how do you

00:03:25.519 --> 00:03:29.680
actually control or steer the model to

00:03:27.919 --> 00:03:31.679
do an image based on whatever prompting

00:03:29.680 --> 00:03:33.200
you give it? Okay, how do you condition

00:03:31.680 --> 00:03:34.879
it? How do you control it? Those are all

00:03:33.199 --> 00:03:36.079
the words. How do you steer it? You'll

00:03:34.878 --> 00:03:37.518
find all these synonyms being used

00:03:36.080 --> 00:03:38.560
heavily in the literature. That's

00:03:37.519 --> 00:03:40.239
basically what they mean. How do you

00:03:38.560 --> 00:03:43.680
give it a prompt and then steer what

00:03:40.239 --> 00:03:44.959
gets produced? All right, so let's say

00:03:43.680 --> 00:03:47.280
we want to build a model that can be

00:03:44.959 --> 00:03:49.120
used to generate images of stately

00:03:47.280 --> 00:03:51.519
college buildings.

00:03:49.120 --> 00:03:53.200
Okay, obviously our very own Killian

00:03:51.519 --> 00:03:56.080
Court is the finest example of such a

00:03:53.199 --> 00:03:58.399
thing. Um, and uh, but let's say you

00:03:56.080 --> 00:03:59.920
want to do that. So what you do is you

00:03:58.400 --> 00:04:01.760
as as we always do with machine

00:03:59.919 --> 00:04:03.359
learning, we collect a bunch of data. In

00:04:01.759 --> 00:04:05.199
this particular case, we collect a whole

00:04:03.360 --> 00:04:07.360
bunch of images of stately college

00:04:05.199 --> 00:04:08.878
buildings. Uh, and what you see here is

00:04:07.360 --> 00:04:10.480
literally me just doing a Google image

00:04:08.878 --> 00:04:12.719
search with the query stately college

00:04:10.479 --> 00:04:14.639
buildings. Okay, so this is the kind of

00:04:12.719 --> 00:04:15.919
stuff you get. Uh, so you have your

00:04:14.639 --> 00:04:19.120
training data at your disposal. It's

00:04:15.919 --> 00:04:20.319
ready to go. Now the question is if you

00:04:19.120 --> 00:04:21.439
have such a model, let's say, and

00:04:20.319 --> 00:04:23.519
obviously we'll talk about how to build

00:04:21.439 --> 00:04:25.839
such a model very soon. But let's say

00:04:23.519 --> 00:04:27.359
you have such a model and every time you

00:04:25.839 --> 00:04:28.879
sort of sample this model, every time

00:04:27.360 --> 00:04:30.560
you ask the model, hey, give me an

00:04:28.879 --> 00:04:31.918
image, you obviously wanted to give a

00:04:30.560 --> 00:04:34.639
different image, right? Otherwise, it's

00:04:31.918 --> 00:04:36.560
kind of boring. All right? Some you know

00:04:34.639 --> 00:04:37.759
maybe you want the Killian Court, maybe

00:04:36.560 --> 00:04:42.319
you want the rotunda from the University

00:04:37.759 --> 00:04:45.520
of Virginia. Anybody any UVA alums here?

00:04:42.319 --> 00:04:47.120
Nobody. Okay. Um, so and right. So the

00:04:45.519 --> 00:04:49.359
question is how can we actually get it

00:04:47.120 --> 00:04:50.959
to randomly give us different images?

00:04:49.360 --> 00:04:52.960
But but they all have to be stately

00:04:50.959 --> 00:04:54.959
college buildings. It can't be just some

00:04:52.959 --> 00:04:58.159
random stuff, right? So, how do you do

00:04:54.959 --> 00:04:59.918
that? And the way we do that, and I

00:04:58.160 --> 00:05:02.080
still find it really astonishing that

00:04:59.918 --> 00:05:03.758
this approach actually works. The way we

00:05:02.079 --> 00:05:05.839
do that is that we actually give it

00:05:03.759 --> 00:05:07.840
noise.

00:05:05.839 --> 00:05:10.079
And I will define very precisely what I

00:05:07.839 --> 00:05:13.038
mean by noise in just a just a bit.

00:05:10.079 --> 00:05:15.038
Okay, basically assume

00:05:13.038 --> 00:05:17.279
an image in which all the pixel values

00:05:15.038 --> 00:05:19.839
are randomly picked.

00:05:17.279 --> 00:05:21.198
Right? So every time you generate a

00:05:19.839 --> 00:05:23.119
random image and you give it to the

00:05:21.199 --> 00:05:25.600
model, you'll it'll use that random

00:05:23.120 --> 00:05:27.680
starting point and then create an image

00:05:25.600 --> 00:05:30.479
for you. And because by definition, if

00:05:27.680 --> 00:05:31.600
you choose noise randomly, they are, you

00:05:30.478 --> 00:05:33.038
know, obviously going to be different

00:05:31.600 --> 00:05:35.840
each time. It's hopefully going to

00:05:33.038 --> 00:05:37.759
generate a different image. But if the

00:05:35.839 --> 00:05:39.599
model is trained on stately college

00:05:37.759 --> 00:05:41.520
buildings, it will produce images of

00:05:39.600 --> 00:05:42.879
stately college buildings. It's not

00:05:41.519 --> 00:05:44.399
going to produce a picture of a Labrador

00:05:42.879 --> 00:05:46.719
retriever.

00:05:44.399 --> 00:05:49.038
Okay, so that's basically what we're

00:05:46.720 --> 00:05:51.600
going to do. Now, if you look at

00:05:49.038 --> 00:05:53.279
something like this, the first question

00:05:51.600 --> 00:05:54.960
of course is that how can we train a

00:05:53.279 --> 00:05:58.799
model to generate an image from pure

00:05:54.959 --> 00:06:00.560
noise? This just sounds ridiculous,

00:05:58.800 --> 00:06:04.639
right? You basically give it a bunch of

00:06:00.560 --> 00:06:06.639
random numbers and say, give me code.

00:06:04.639 --> 00:06:08.720
It feels really ridiculous. And at that

00:06:06.639 --> 00:06:10.240
point, you know, folks can sort of come

00:06:08.720 --> 00:06:11.440
to a stop and say, "All right, this

00:06:10.240 --> 00:06:14.079
approach is probably not going to take

00:06:11.439 --> 00:06:16.319
me anywhere. It's a bit of a dead end.

00:06:14.079 --> 00:06:18.399
But then some clever people had this

00:06:16.319 --> 00:06:20.720
very interesting idea.

00:06:18.399 --> 00:06:24.399
They said

00:06:20.720 --> 00:06:26.960
um it's not clear how to do this you

00:06:24.399 --> 00:06:28.799
know um just a quick aside there's this

00:06:26.959 --> 00:06:31.120
really amazing book which is published

00:06:28.800 --> 00:06:33.840
maybe 50 years ago maybe earlier than

00:06:31.120 --> 00:06:36.240
that called how to solve it by George

00:06:33.839 --> 00:06:37.758
Polia. George Poliov was a eminent

00:06:36.240 --> 00:06:39.680
mathematician

00:06:37.759 --> 00:06:41.600
um and he wrote this small book called

00:06:39.680 --> 00:06:44.240
how to solve it and it lists a whole

00:06:41.600 --> 00:06:46.800
bunch of huristics that mathematicians

00:06:44.240 --> 00:06:49.439
use when they solve problems and perhaps

00:06:46.800 --> 00:06:52.079
the most commonly used heristic is just

00:06:49.439 --> 00:06:53.279
reverse the question

00:06:52.079 --> 00:06:55.038
just reverse the question and see if

00:06:53.279 --> 00:06:56.559
anything comes out of it most of the

00:06:55.038 --> 00:06:58.079
time nothing will come out of it but

00:06:56.560 --> 00:06:59.759
maybe some other time something amazing

00:06:58.079 --> 00:07:01.680
comes out right this is a great example

00:06:59.759 --> 00:07:03.840
of that heristic at work we don't know

00:07:01.680 --> 00:07:05.680
how to do this so the question is can we

00:07:03.839 --> 00:07:07.279
do the reverse

00:07:05.680 --> 00:07:10.400
If I give you Killian code, can you

00:07:07.279 --> 00:07:12.239
produce noise out of it for me?

00:07:10.399 --> 00:07:14.159
And the answer is yeah, of course we can

00:07:12.240 --> 00:07:16.800
do that.

00:07:14.160 --> 00:07:19.360
Right? Given an image, we can easily

00:07:16.800 --> 00:07:21.598
create a noisy version of it. So you can

00:07:19.360 --> 00:07:23.360
take the original image, you can add

00:07:21.598 --> 00:07:24.560
some noise to it to get this and you

00:07:23.360 --> 00:07:25.520
keep on adding a lot of noise and

00:07:24.560 --> 00:07:27.120
finally you'll get something that's

00:07:25.519 --> 00:07:29.439
basically you can't tell that there is

00:07:27.120 --> 00:07:31.360
clean clear clean code anymore. Right?

00:07:29.439 --> 00:07:33.680
This process, the reverse process is

00:07:31.360 --> 00:07:36.160
actually very easy to do. Okay? So the

00:07:33.680 --> 00:07:37.680
question bec by the way for folks of who

00:07:36.160 --> 00:07:39.360
may not be very familiar with this

00:07:37.680 --> 00:07:41.120
notion of adding noise to an image or

00:07:39.360 --> 00:07:44.319
making an image noisy. Let me just show

00:07:41.120 --> 00:07:47.478
you in a collab just a minute how easy

00:07:44.319 --> 00:07:47.479
it is.

00:07:47.519 --> 00:07:52.959
All right. So um we let's say we import

00:07:51.199 --> 00:07:54.960
a bunch of these things. As usual we

00:07:52.959 --> 00:07:57.598
have numpy and so there is this thing

00:07:54.959 --> 00:07:58.959
called the python imaging library pil

00:07:57.598 --> 00:08:01.439
which is very handy for image

00:07:58.959 --> 00:08:03.038
manipulations. So we import that and

00:08:01.439 --> 00:08:04.959
then I just literally read read this

00:08:03.038 --> 00:08:06.079
image in. I uploaded it before class.

00:08:04.959 --> 00:08:07.918
Let's just make sure it's here. Okay,

00:08:06.079 --> 00:08:11.120
good. Kalian.png.

00:08:07.918 --> 00:08:13.680
So I I read this image. Okay. Uh and

00:08:11.120 --> 00:08:16.478
then once I read it, I convert it into a

00:08:13.680 --> 00:08:18.639
numpy array. And then remember in a in

00:08:16.478 --> 00:08:20.800
any color image, you have three tables

00:08:18.639 --> 00:08:23.439
of numbers. The the number there's a

00:08:20.800 --> 00:08:25.120
number for each pixel for red, blue, and

00:08:23.439 --> 00:08:28.319
green. And then each number is between 0

00:08:25.120 --> 00:08:29.759
and 255. U and so here what we do is we

00:08:28.319 --> 00:08:31.280
divide everything by 255 just to

00:08:29.759 --> 00:08:32.719
normalize it so it's all between zero

00:08:31.279 --> 00:08:36.240
and one and we have done this in the

00:08:32.719 --> 00:08:38.479
past right I do that here uh all right

00:08:36.240 --> 00:08:40.399
so let me just read this back in convert

00:08:38.479 --> 00:08:45.120
it and then if you look at the shape

00:08:40.399 --> 00:08:47.200
it's basically a 411 x 583 * 3 um three

00:08:45.120 --> 00:08:50.000
channels as we have seen before and then

00:08:47.200 --> 00:08:52.320
I'll just show it all right that's the

00:08:50.000 --> 00:08:54.799
picture so now what we want to do is we

00:08:52.320 --> 00:08:59.040
want to add noise to this picture all we

00:08:54.799 --> 00:09:02.479
have to do Okay, for each pixel,

00:08:59.039 --> 00:09:03.919
we basically randomly pick a normal

00:09:02.480 --> 00:09:05.759
variable, a normal distribution,

00:09:03.919 --> 00:09:08.240
normally distributed random variable

00:09:05.759 --> 00:09:10.240
with a mean of zero and a small standard

00:09:08.240 --> 00:09:11.839
deviation. So it's like a small number

00:09:10.240 --> 00:09:14.320
and then we just literally add that

00:09:11.839 --> 00:09:16.560
number to every pixel. But for every

00:09:14.320 --> 00:09:17.839
pixel, we sample. Every pixel we sample.

00:09:16.559 --> 00:09:19.439
It's not like we sample once and add it

00:09:17.839 --> 00:09:22.560
to all the pixels. We sample for every

00:09:19.440 --> 00:09:25.279
pixel. And so the way you do that is

00:09:22.559 --> 00:09:28.639
basically literally np.random.normal.

00:09:25.278 --> 00:09:30.720
normal and then this 3 here is a

00:09:28.639 --> 00:09:33.838
standard deviation and we tell it

00:09:30.720 --> 00:09:35.680
generate as many of these things as the

00:09:33.839 --> 00:09:38.320
as the shape of the image that I gave

00:09:35.679 --> 00:09:40.079
you. Okay. And then add each one of

00:09:38.320 --> 00:09:42.959
these numbers to the original image you

00:09:40.080 --> 00:09:44.480
get this noisy image. Okay. So if you

00:09:42.958 --> 00:09:46.479
this is the original image these are all

00:09:44.480 --> 00:09:48.560
the values between 0 and one. And then

00:09:46.480 --> 00:09:50.159
you add do this noisy image. You can see

00:09:48.559 --> 00:09:52.319
the numbers have become different. The

00:09:50.159 --> 00:09:54.319
23 has become.18.15

00:09:52.320 --> 00:09:56.160
has become minus.17 and so on and so

00:09:54.320 --> 00:09:58.080
forth. Right? You just added a small

00:09:56.159 --> 00:09:59.519
number random to everything. But as you

00:09:58.080 --> 00:10:01.040
can see here now you have some negative

00:09:59.519 --> 00:10:02.959
numbers. You may have some numbers

00:10:01.039 --> 00:10:05.039
that's greater than one. And we do want

00:10:02.958 --> 00:10:06.639
everything to be between 0 and one. So

00:10:05.039 --> 00:10:10.079
all we do is we do this thing called

00:10:06.639 --> 00:10:11.600
clip it where essentially values smaller

00:10:10.080 --> 00:10:13.759
than zero are set to zero. Values

00:10:11.600 --> 00:10:16.079
greater than one are set to one. And so

00:10:13.759 --> 00:10:17.838
we'll just do that. That's it.

00:10:16.078 --> 00:10:19.359
Everything over one squashed to one.

00:10:17.839 --> 00:10:21.519
Everything under zero set to zero.

00:10:19.360 --> 00:10:23.519
Others leave it unchanged. Now it's

00:10:21.519 --> 00:10:28.320
again well behaved between 0 and one and

00:10:23.519 --> 00:10:29.759
we can just plot it and you get this.

00:10:28.320 --> 00:10:31.440
That's it. That's all it takes to

00:10:29.759 --> 00:10:34.399
actually add noise to an image. One line

00:10:31.440 --> 00:10:36.160
of numpy. Okay. Uh obviously you can

00:10:34.399 --> 00:10:37.679
just put this whole thing in a loop and

00:10:36.159 --> 00:10:39.919
keep increasing that standard deviation

00:10:37.679 --> 00:10:41.679
number from 3 point 4.5 so on and so

00:10:39.919 --> 00:10:44.078
forth. And when you do that you get this

00:10:41.679 --> 00:10:45.838
nice sequence of clean code and all the

00:10:44.078 --> 00:10:48.639
way to some very very noisy version of

00:10:45.839 --> 00:10:52.440
Ken code. That's it. So that's the basic

00:10:48.639 --> 00:10:52.439
idea of adding noise.

00:10:52.958 --> 00:10:57.078
Any questions on the the mechanics?

00:10:57.200 --> 00:11:02.320
Okay, good. Um so so we can add random

00:11:00.320 --> 00:11:04.640
numbers, right? And we can by increasing

00:11:02.320 --> 00:11:06.480
the magnitude of the standard deviation

00:11:04.639 --> 00:11:08.000
of these of these normal random

00:11:06.480 --> 00:11:12.240
variables, we can make the image

00:11:08.000 --> 00:11:14.958
noisier. Okay, so that suggests a really

00:11:12.240 --> 00:11:18.079
interesting idea.

00:11:14.958 --> 00:11:18.078
What idea would that be?

00:11:19.600 --> 00:11:25.278
Yeah, doing the opposite. Could you

00:11:21.600 --> 00:11:26.959
please uh microphone please?

00:11:25.278 --> 00:11:29.600
>> Uh doing the opposite like recreating

00:11:26.958 --> 00:11:31.518
the image from the noise.

00:11:29.600 --> 00:11:34.800
>> So we are trying to create the image

00:11:31.519 --> 00:11:37.360
from the noise. But

00:11:34.799 --> 00:11:38.958
that feels a little hard. So what

00:11:37.360 --> 00:11:41.959
exactly can we do? Be a little more

00:11:38.958 --> 00:11:41.958
specific.

00:11:44.480 --> 00:11:48.399
So here we have the ability to take any

00:11:46.559 --> 00:11:51.838
image and add any amount of noise to it.

00:11:48.399 --> 00:11:54.078
Right? That's the data we have. There is

00:11:51.839 --> 00:11:56.160
Kian code and there's various noisy

00:11:54.078 --> 00:11:57.599
versions of Kian code like that for the

00:11:56.159 --> 00:11:58.240
return the unit Virginia and so on and

00:11:57.600 --> 00:12:00.000
so forth.

00:11:58.240 --> 00:12:02.079
>> I would assume you would do some kind of

00:12:00.000 --> 00:12:04.559
loss function for the the final image

00:12:02.078 --> 00:12:06.958
that you get and compare it with the the

00:12:04.559 --> 00:12:10.958
original image that you train it on and

00:12:06.958 --> 00:12:14.638
then find uh yeah fine as you go. Okay,

00:12:10.958 --> 00:12:17.638
you're on the right track. Uh, any other

00:12:14.639 --> 00:12:17.639
proposals?

00:12:18.480 --> 00:12:22.720
>> I think we could try to train a neural

00:12:20.639 --> 00:12:25.440
network to reconstruct the image going

00:12:22.720 --> 00:12:27.519
from the noise to the noise noisy one.

00:12:25.440 --> 00:12:30.959
Like we could have a whole data set with

00:12:27.519 --> 00:12:34.480
images, find their noise counterpart and

00:12:30.958 --> 00:12:38.159
train to do the oppos

00:12:34.480 --> 00:12:39.759
network to do the opposite task.

00:12:38.159 --> 00:12:41.039
Yeah, that's definitely on the right

00:12:39.759 --> 00:12:44.159
track. That's definitely on the right

00:12:41.039 --> 00:12:47.519
track. Yep, good ideas. So, what we do

00:12:44.159 --> 00:12:49.679
more concretely is

00:12:47.519 --> 00:12:51.679
we we can take each image in the

00:12:49.679 --> 00:12:54.239
training data and create noisy versions

00:12:51.679 --> 00:12:57.599
of it as we have seen before. And then

00:12:54.240 --> 00:13:00.879
what we do is that we say uh we can

00:12:57.600 --> 00:13:04.240
create XY training data pairs input

00:13:00.879 --> 00:13:09.679
output pairs from all these images. So

00:13:04.240 --> 00:13:11.839
specifically what we do is we take

00:13:09.679 --> 00:13:14.559
the noisy slightly noisy version of

00:13:11.839 --> 00:13:16.240
Killian code and call it the input and

00:13:14.559 --> 00:13:19.199
we take the the the nice version of

00:13:16.240 --> 00:13:22.639
clean code and call it the output.

00:13:19.200 --> 00:13:27.360
Okay, that's the y1 x1 pair

00:13:22.639 --> 00:13:30.959
and then we get y2 x2 we get y3 x3 and

00:13:27.360 --> 00:13:33.919
all the way. So at any point in time,

00:13:30.958 --> 00:13:36.239
the relationship between X and Y, what's

00:13:33.919 --> 00:13:37.759
the relationship between X and Y? If you

00:13:36.240 --> 00:13:40.759
set it up like this as the input and the

00:13:37.759 --> 00:13:40.759
output,

00:13:43.039 --> 00:13:48.159
>> it's the set of uh standard deviations

00:13:45.679 --> 00:13:51.039
and uh the values which you change for

00:13:48.159 --> 00:13:53.120
each pixels. Those are like weights to

00:13:51.039 --> 00:13:54.879
which you transform,

00:13:53.120 --> 00:13:56.560
>> right? Or maybe I was looking for

00:13:54.879 --> 00:13:58.320
something simpler which was that that's

00:13:56.559 --> 00:14:00.078
correct. So what he's looking for is

00:13:58.320 --> 00:14:03.360
really the the relationship between X

00:14:00.078 --> 00:14:05.759
and Y. X is an image, any image, and Y

00:14:03.360 --> 00:14:07.919
happens to be a slightly less noisy

00:14:05.759 --> 00:14:09.838
version of the image.

00:14:07.919 --> 00:14:12.399
The slightly less noisy is really,

00:14:09.839 --> 00:14:14.399
really important.

00:14:12.399 --> 00:14:16.639
You're not going from Killian code,

00:14:14.399 --> 00:14:19.519
right? You're not going from the image

00:14:16.639 --> 00:14:21.919
to full noise. That's an impossible

00:14:19.519 --> 00:14:24.720
leap. You're going from the image to a

00:14:21.919 --> 00:14:27.679
slightly noisy version of the image.

00:14:24.720 --> 00:14:30.560
Okay, it is that slightly that allows

00:14:27.679 --> 00:14:33.120
all the magic to happen.

00:14:30.559 --> 00:14:35.838
So that's what we have.

00:14:33.120 --> 00:14:38.560
And so here what we can do with these XY

00:14:35.839 --> 00:14:40.000
pairs when you have an So here's the

00:14:38.559 --> 00:14:41.439
thing, right? This is like a larger

00:14:40.000 --> 00:14:43.759
comment about machine learning and deep

00:14:41.440 --> 00:14:46.560
learning. Um

00:14:43.759 --> 00:14:47.838
whenever you have basically what machine

00:14:46.559 --> 00:14:50.799
learning deep learning are or really

00:14:47.839 --> 00:14:52.720
it's like this this black box where if

00:14:50.799 --> 00:14:55.278
you can find interesting input output

00:14:52.720 --> 00:14:57.759
pairs you can learn a function to go

00:14:55.278 --> 00:14:59.919
from the input to the output that's it

00:14:57.759 --> 00:15:01.759
but this sounds kind of simple when I

00:14:59.919 --> 00:15:04.399
describe it like that but there are like

00:15:01.759 --> 00:15:06.879
some incredibly non-obvious ways of

00:15:04.399 --> 00:15:08.799
applying this idea right so for example

00:15:06.879 --> 00:15:10.320
a few years ago Google had this uh thing

00:15:08.799 --> 00:15:13.759
which may actually be in production in

00:15:10.320 --> 00:15:15.680
Google Sheets now where whenever you um

00:15:13.759 --> 00:15:17.519
sort of choose a bunch of numbers, a

00:15:15.679 --> 00:15:19.198
range of numbers in a spreadsheet and

00:15:17.519 --> 00:15:21.519
and then go into another cell, it'll

00:15:19.198 --> 00:15:24.159
immediately suggest a formula for you.

00:15:21.519 --> 00:15:26.399
Where is that coming from?

00:15:24.159 --> 00:15:28.559
It's because all the Google Sheets users

00:15:26.399 --> 00:15:30.159
all over the world, they have been

00:15:28.559 --> 00:15:32.078
creating all these numbers with

00:15:30.159 --> 00:15:33.919
formulas, right? So, someone says,

00:15:32.078 --> 00:15:36.078
"Look, wait a second. We have all this

00:15:33.919 --> 00:15:38.559
data on people choosing a range of

00:15:36.078 --> 00:15:40.479
numbers and then entering a formula. So

00:15:38.559 --> 00:15:43.119
let's imagine the range is the input and

00:15:40.480 --> 00:15:45.199
the formula as the output

00:15:43.120 --> 00:15:46.959
and let's just give a million examples

00:15:45.198 --> 00:15:50.719
of this pair and see if anything comes

00:15:46.958 --> 00:15:53.599
out of it and boom you get that feature.

00:15:50.720 --> 00:15:55.600
Okay. So similarly here

00:15:53.600 --> 00:15:58.240
X is an image less noisy version of the

00:15:55.600 --> 00:16:02.000
image. What that means is that we can

00:15:58.240 --> 00:16:04.480
build a dnoising network.

00:16:02.000 --> 00:16:06.799
Okay, we can take an image and we can

00:16:04.480 --> 00:16:10.639
build a network using all these XY pairs

00:16:06.799 --> 00:16:15.278
to slightly dn noiseise it.

00:16:10.639 --> 00:16:16.799
Okay. Um and so all how do we do it? We

00:16:15.278 --> 00:16:19.679
just run stocastic gradient to sit on

00:16:16.799 --> 00:16:22.240
the data. We have a network. It has X

00:16:19.679 --> 00:16:26.319
and Y and then Y is a slightly less

00:16:22.240 --> 00:16:27.519
noisy version and then B.

00:16:26.320 --> 00:16:29.199
Okay, you're just a network. It has a

00:16:27.519 --> 00:16:30.399
bunch of weights. we have the we have

00:16:29.198 --> 00:16:33.359
the right answer in terms of what the

00:16:30.399 --> 00:16:34.799
images need to be u we can do stocastic

00:16:33.360 --> 00:16:36.240
gradient descent or atom or something

00:16:34.799 --> 00:16:37.278
and before you know it if you have

00:16:36.240 --> 00:16:40.159
enough data you have a network which can

00:16:37.278 --> 00:16:41.919
d noiseise anything you give it okay um

00:16:40.159 --> 00:16:43.278
you had a question

00:16:41.919 --> 00:16:45.759
>> why slightly

00:16:43.278 --> 00:16:48.879
>> why slightly um we'll come back to that

00:16:45.759 --> 00:16:51.199
question the the reason is that u in

00:16:48.879 --> 00:16:53.679
general you you have to do what you can

00:16:51.198 --> 00:16:56.559
to help the model and this is sort of

00:16:53.679 --> 00:16:59.919
the proverbial there is an old adage you

00:16:56.559 --> 00:17:02.078
can't cross a ditch in two jumps.

00:16:59.919 --> 00:17:03.679
It's too big. So, right. So, you can't

00:17:02.078 --> 00:17:05.678
do it. So, what you do is you create a

00:17:03.679 --> 00:17:07.599
bridge to go from here to there. And so,

00:17:05.679 --> 00:17:10.319
what you do is if you can slightly d

00:17:07.599 --> 00:17:11.519
noiseise something really well. Well, I

00:17:10.318 --> 00:17:13.599
can actually den noiseise anything you

00:17:11.519 --> 00:17:17.199
want really well using that fundamental

00:17:13.599 --> 00:17:18.958
capability as you will see in a second.

00:17:17.199 --> 00:17:21.360
>> Just to follow up. So, if you go back

00:17:18.959 --> 00:17:24.480
the last slide, I could have created the

00:17:21.359 --> 00:17:26.958
same thing as that is my x1 and that is

00:17:24.480 --> 00:17:28.640
my y. Then the second one is x2 and

00:17:26.959 --> 00:17:30.400
still this is the y. So there is

00:17:28.640 --> 00:17:33.120
effectively there is a learning there

00:17:30.400 --> 00:17:35.840
that it could have taken from those

00:17:33.119 --> 00:17:37.439
pairs and come back with okay this is

00:17:35.839 --> 00:17:40.159
also a possibility this is also a

00:17:37.440 --> 00:17:42.400
possibility and it found out that noise

00:17:40.160 --> 00:17:44.400
matrix and it can subtract.

00:17:42.400 --> 00:17:46.240
>> Yeah. So the thing is you want to make

00:17:44.400 --> 00:17:48.880
sure that each time the amount of

00:17:46.240 --> 00:17:51.359
learning it has to do is as bounded and

00:17:48.880 --> 00:17:52.960
small as possible. If you give it some

00:17:51.359 --> 00:17:55.439
starting point and an ending point and

00:17:52.960 --> 00:17:56.798
keep moving this ending point, the gap

00:17:55.440 --> 00:17:59.519
is still really high for the first

00:17:56.798 --> 00:18:01.599
several of those starting points. That's

00:17:59.519 --> 00:18:04.319
the problem.

00:18:01.599 --> 00:18:07.119
Okay. So to come back to this, so we can

00:18:04.319 --> 00:18:08.399
build a dinoising model. We can do this.

00:18:07.119 --> 00:18:10.399
And now when you have once you have

00:18:08.400 --> 00:18:13.038
built such a thing, you give it some

00:18:10.400 --> 00:18:15.440
noisy thing and then it'll you know give

00:18:13.038 --> 00:18:16.879
you a slightly less noisy version of it.

00:18:15.440 --> 00:18:19.200
Okay, the resolution is going to go up

00:18:16.880 --> 00:18:20.720
slightly if you do that. This of course

00:18:19.200 --> 00:18:22.960
suggests the obvious way in which you

00:18:20.720 --> 00:18:26.880
would use it which is that once you

00:18:22.960 --> 00:18:29.600
train it we can solve this problem.

00:18:26.880 --> 00:18:32.160
Okay. And how can we solve this problem?

00:18:29.599 --> 00:18:35.279
So what you do is you start with pure

00:18:32.160 --> 00:18:37.120
noise and then repeatedly dn noiseise

00:18:35.279 --> 00:18:39.759
it.

00:18:37.119 --> 00:18:41.038
Okay. You get that, you get that, and

00:18:39.759 --> 00:18:43.599
then before you know it, Killian Kurt

00:18:41.038 --> 00:18:46.319
has emerged from the fog,

00:18:43.599 --> 00:18:50.439
right? It's pretty insane that it

00:18:46.319 --> 00:18:50.439
actually works this idea.

00:18:52.480 --> 00:18:56.720
So, so the model will generate a

00:18:54.960 --> 00:18:59.840
sequence of less noisy images and the

00:18:56.720 --> 00:19:01.919
final one you have is the answer. Okay.

00:18:59.839 --> 00:19:05.279
Now there's a whole bunch of detail here

00:19:01.919 --> 00:19:08.400
which I'm glossing over about okay how

00:19:05.279 --> 00:19:09.759
many times must we run this loop to get

00:19:08.400 --> 00:19:12.160
to a really good picture. The short

00:19:09.759 --> 00:19:13.200
answer is you it initially it was like

00:19:12.160 --> 00:19:16.080
you have to run it like a thousand

00:19:13.200 --> 00:19:17.759
times. Each each each doising step was

00:19:16.079 --> 00:19:18.960
like a baby step. You have to do it a

00:19:17.759 --> 00:19:21.038
thousand times to get a really good

00:19:18.960 --> 00:19:22.558
answer. Again research has been very

00:19:21.038 --> 00:19:24.000
active in the area continues to be very

00:19:22.558 --> 00:19:26.399
active. Now you can I think do it like

00:19:24.000 --> 00:19:29.038
50 steps or 100 steps. Right? But

00:19:26.400 --> 00:19:31.679
diffusion models like this uh they tend

00:19:29.038 --> 00:19:33.599
to take more time than a large language

00:19:31.679 --> 00:19:35.280
model which is why if you give a prompt

00:19:33.599 --> 00:19:36.639
to one of these models like midjourney

00:19:35.279 --> 00:19:38.960
it will take some time for it to come

00:19:36.640 --> 00:19:40.320
back with an image and and that the

00:19:38.960 --> 00:19:42.079
reason for the delay is because it's

00:19:40.319 --> 00:19:45.200
going through this you know incremental

00:19:42.079 --> 00:19:47.759
dnoising loop. Yeah.

00:19:45.200 --> 00:19:49.840
>> Uh from this we understand that each uh

00:19:47.759 --> 00:19:51.440
the final noise output sample would be

00:19:49.839 --> 00:19:55.199
very particular to each image in the

00:19:51.440 --> 00:19:57.279
matrix. So I mean like say two if you

00:19:55.200 --> 00:19:59.840
take two images the final we are getting

00:19:57.279 --> 00:20:02.319
is the image in the after when we start

00:19:59.839 --> 00:20:04.319
voicing it and the final output we get

00:20:02.319 --> 00:20:05.359
is the noise sample will be too distinct

00:20:04.319 --> 00:20:05.918
for each of them right

00:20:05.359 --> 00:20:08.558
>> correct

00:20:05.919 --> 00:20:10.720
>> so but when we are picking up image to

00:20:08.558 --> 00:20:12.879
generate a diffusion model and we work

00:20:10.720 --> 00:20:14.798
backwards we may not have the exact

00:20:12.880 --> 00:20:15.679
thing available to us what was there

00:20:14.798 --> 00:20:17.200
initially

00:20:15.679 --> 00:20:18.960
>> no no the thing is we don't want to

00:20:17.200 --> 00:20:21.120
necessarily regenerate images that were

00:20:18.960 --> 00:20:22.558
in the training data right that's kind

00:20:21.119 --> 00:20:24.159
of pointless we want to geneneral new

00:20:22.558 --> 00:20:26.720
images

00:20:24.160 --> 00:20:29.519
and for new images we just use start use

00:20:26.720 --> 00:20:31.200
noise as a starting point

00:20:29.519 --> 00:20:32.879
you know the fact that Killian code was

00:20:31.200 --> 00:20:35.279
here and then the fully noised version

00:20:32.880 --> 00:20:36.159
of Kian code is here that is used for

00:20:35.279 --> 00:20:37.918
training and once you use it for

00:20:36.159 --> 00:20:39.039
training you don't need it anymore

00:20:37.919 --> 00:20:41.120
because you're not trying to recreate

00:20:39.038 --> 00:20:43.440
Killian code again you want to create

00:20:41.119 --> 00:20:45.359
new images which belong to the category

00:20:43.440 --> 00:20:48.000
of stately college buildings and for

00:20:45.359 --> 00:20:49.199
that all you you just grab noise send it

00:20:48.000 --> 00:20:51.919
in it gives you a stately college

00:20:49.200 --> 00:20:51.919
building end of

00:20:53.759 --> 00:20:57.839
And because noise by definition is

00:20:55.519 --> 00:20:59.200
different each time you pick it, it's

00:20:57.839 --> 00:21:01.839
going to come up with a different

00:20:59.200 --> 00:21:06.679
stately college building.

00:21:01.839 --> 00:21:06.678
So the way I think about it is that uh

00:21:07.038 --> 00:21:12.558
all right so you can think of it as this

00:21:09.279 --> 00:21:14.960
right this is

00:21:12.558 --> 00:21:17.359
so when you sample think of this as like

00:21:14.960 --> 00:21:20.319
the noise distribution

00:21:17.359 --> 00:21:22.879
each time you sample right there's a

00:21:20.319 --> 00:21:24.480
little point you pick from here another

00:21:22.880 --> 00:21:26.960
time you sample maybe you get a point

00:21:24.480 --> 00:21:29.279
here right each is just you know nice

00:21:26.960 --> 00:21:31.200
distribution that's it what actually

00:21:29.279 --> 00:21:34.079
these things are doing is they are

00:21:31.200 --> 00:21:35.919
mapping mapping it

00:21:34.079 --> 00:21:38.158
to the distribution of stately college

00:21:35.919 --> 00:21:41.120
buildings which might be in a you know

00:21:38.159 --> 00:21:43.360
strange crazy distribution.

00:21:41.119 --> 00:21:47.678
So each time you sample you just go from

00:21:43.359 --> 00:21:49.599
here and you land at a point here

00:21:47.679 --> 00:21:53.200
and when you go from here you know you

00:21:49.599 --> 00:21:54.480
land at a point there.

00:21:53.200 --> 00:21:56.240
That's what so what you have done is

00:21:54.480 --> 00:21:59.360
when you when you take the training data

00:21:56.240 --> 00:22:01.519
you basically created points here and

00:21:59.359 --> 00:22:03.199
then found the matching noise here and

00:22:01.519 --> 00:22:05.038
then flipped it for training as we have

00:22:03.200 --> 00:22:07.519
seen before and once you're done with it

00:22:05.038 --> 00:22:09.919
you basically have a mechanism for

00:22:07.519 --> 00:22:12.798
transforming any entry in this

00:22:09.919 --> 00:22:15.120
distribution of images to an entry in

00:22:12.798 --> 00:22:17.440
this distribution of images. So it's a

00:22:15.119 --> 00:22:18.479
way to transform one distribution to

00:22:17.440 --> 00:22:22.320
another distribution. That's what's

00:22:18.480 --> 00:22:26.319
going on. Um all right. Um so there was

00:22:22.319 --> 00:22:28.639
a question. Yeah. And then we'll go.

00:22:26.319 --> 00:22:30.639
>> I understand the going from noise to to

00:22:28.640 --> 00:22:33.360
the image and back how you how the

00:22:30.640 --> 00:22:35.360
training works. So my question is you

00:22:33.359 --> 00:22:37.519
know in some of these models today you

00:22:35.359 --> 00:22:40.240
have you know when you give it the noise

00:22:37.519 --> 00:22:42.960
now to generate with an image for

00:22:40.240 --> 00:22:44.960
example it could generate a human with

00:22:42.960 --> 00:22:47.840
four fingers or you know stuff like

00:22:44.960 --> 00:22:49.919
that. So is it that the that the model

00:22:47.839 --> 00:22:53.599
that the training data is not just quite

00:22:49.919 --> 00:22:56.400
enough to or more as robust enough to uh

00:22:53.599 --> 00:22:57.918
generate that kind of detail? [cough]

00:22:56.400 --> 00:22:58.400
Can you kind of talk through like what's

00:22:57.919 --> 00:23:00.799
more?

00:22:58.400 --> 00:23:03.038
>> Yeah. So so fundamentally what it's

00:23:00.798 --> 00:23:04.960
doing is it actually does not understand

00:23:03.038 --> 00:23:07.119
the notion of fingers and things like

00:23:04.960 --> 00:23:09.759
that. Right? Because there is like we

00:23:07.119 --> 00:23:12.079
haven't injected any domain knowledge

00:23:09.759 --> 00:23:13.679
into this whole process by saying that

00:23:12.079 --> 00:23:16.158
hey we need to generate you need to

00:23:13.679 --> 00:23:17.759
generate a human body and here are the

00:23:16.159 --> 00:23:20.159
semantics of what the human body is

00:23:17.759 --> 00:23:21.679
right it's got uh five fingers and all

00:23:20.159 --> 00:23:23.440
the anatomical stuff we're not giving

00:23:21.679 --> 00:23:26.080
anything we literally giving it pixel

00:23:23.440 --> 00:23:27.759
values bunch of pictures so everything

00:23:26.079 --> 00:23:29.439
you're seeing is basically just coming

00:23:27.759 --> 00:23:32.558
out of that very blind statistical

00:23:29.440 --> 00:23:34.798
transformation process so it's so you

00:23:32.558 --> 00:23:36.639
would expect that macrolevel details

00:23:34.798 --> 00:23:38.720
will probably get it Right? Because

00:23:36.640 --> 00:23:40.159
there are so many right answers. So

00:23:38.720 --> 00:23:43.120
imagine it's actually, you know, it's

00:23:40.159 --> 00:23:45.120
creating um the roof of a house. There

00:23:43.119 --> 00:23:46.639
could be all kinds of variations in the

00:23:45.119 --> 00:23:48.479
roof of the house and you would still

00:23:46.640 --> 00:23:49.679
think it's a roof of a house, right?

00:23:48.480 --> 00:23:51.279
Because there are many possible right

00:23:49.679 --> 00:23:52.640
answers. But when it comes to five

00:23:51.279 --> 00:23:53.918
fingers, there are not many possible

00:23:52.640 --> 00:23:55.759
right answers, which is why you notice

00:23:53.919 --> 00:23:56.880
the error very quickly. As far as the

00:23:55.759 --> 00:23:58.158
model is concerned, it doesn't know,

00:23:56.880 --> 00:24:00.960
right? It's just producing a

00:23:58.159 --> 00:24:03.200
statistically plausible sample from that

00:24:00.960 --> 00:24:05.679
distribution. And since we haven't

00:24:03.200 --> 00:24:06.960
forced it to obey constraints like five

00:24:05.679 --> 00:24:08.080
fingers and so on and so forth, it's not

00:24:06.960 --> 00:24:10.640
going to do any of that stuff. It's an

00:24:08.079 --> 00:24:11.918
unconstrained process. Now over time,

00:24:10.640 --> 00:24:14.000
these things have gotten better and

00:24:11.919 --> 00:24:15.840
better and that's because the data has

00:24:14.000 --> 00:24:17.599
gotten better to your point. But I think

00:24:15.839 --> 00:24:19.599
our approach to doing these things is

00:24:17.599 --> 00:24:21.839
also getting better, right? There are

00:24:19.599 --> 00:24:23.839
lots of ways to now steer it and control

00:24:21.839 --> 00:24:25.199
it so it behaves the right way. And that

00:24:23.839 --> 00:24:27.278
is actually part of what's happening as

00:24:25.200 --> 00:24:29.200
well. So when we talk about how do you

00:24:27.278 --> 00:24:30.720
actually give a text prompt and have it

00:24:29.200 --> 00:24:32.558
build the image for that particular

00:24:30.720 --> 00:24:35.519
prompt, we would we'll revisit this

00:24:32.558 --> 00:24:38.319
question. Um okay, there was there were

00:24:35.519 --> 00:24:40.480
more questions. Yeah.

00:24:38.319 --> 00:24:42.240
>> Is there some randomness in the model

00:24:40.480 --> 00:24:44.558
itself? Right. So if you gave it the

00:24:42.240 --> 00:24:47.519
same noise image twice, will it actually

00:24:44.558 --> 00:24:49.440
produce the same final image or will it

00:24:47.519 --> 00:24:49.839
>> Yeah, there is randomness in the process

00:24:49.440 --> 00:24:53.679
as well.

00:24:49.839 --> 00:24:56.720
>> In the process process, exactly.

00:24:53.679 --> 00:24:59.360
Um, so to actually that's a really good

00:24:56.720 --> 00:25:02.960
point, but now I'm afraid to open my

00:24:59.359 --> 00:25:04.719
laptop. I'm an iPad. One second.

00:25:02.960 --> 00:25:06.880
All right.

00:25:04.720 --> 00:25:10.798
Okay. So, what's going on here is that

00:25:06.880 --> 00:25:13.840
if you um go to this thing

00:25:10.798 --> 00:25:16.319
so I talked about we are transforming

00:25:13.839 --> 00:25:18.720
from here to some crazy distribution

00:25:16.319 --> 00:25:20.240
here, right? So, what happens that let's

00:25:18.720 --> 00:25:22.480
say that this is the starting point for

00:25:20.240 --> 00:25:25.120
the the noise input. This is your noise

00:25:22.480 --> 00:25:28.079
input and then what it does what you

00:25:25.119 --> 00:25:29.918
actually do is you go here

00:25:28.079 --> 00:25:33.759
and then you take this point and then

00:25:29.919 --> 00:25:35.759
you do a small sample next to it. So you

00:25:33.759 --> 00:25:37.519
use this as like the mean value and then

00:25:35.759 --> 00:25:39.119
sample around it and that's actually

00:25:37.519 --> 00:25:40.720
what gets published in the user

00:25:39.119 --> 00:25:42.959
interface. That's where the randomness

00:25:40.720 --> 00:25:47.640
comes in.

00:25:42.960 --> 00:25:47.640
Okay. So um

00:25:48.319 --> 00:25:52.480
so back to this was there another

00:25:49.919 --> 00:25:53.600
question somewhere.

00:25:52.480 --> 00:25:56.400
>> Yeah.

00:25:53.599 --> 00:25:59.359
>> Um it's okay.

00:25:56.400 --> 00:26:02.080
>> Uh I was just wondering about the when

00:25:59.359 --> 00:26:05.359
going when training on a on a clear

00:26:02.079 --> 00:26:08.240
picture to go to a noisy image uh to

00:26:05.359 --> 00:26:10.399
pull from a random sample like random

00:26:08.240 --> 00:26:12.240
this sample probably pseudo random. I

00:26:10.400 --> 00:26:13.840
was just wondering if it's like learning

00:26:12.240 --> 00:26:16.240
relationships that are dependent on

00:26:13.839 --> 00:26:19.278
pseudo randomness and so when it goes

00:26:16.240 --> 00:26:22.000
from a noisy image back to pure image

00:26:19.278 --> 00:26:22.400
it's dependent on that or it matters at

00:26:22.000 --> 00:26:23.839
all.

00:26:22.400 --> 00:26:24.960
>> Oh I see. So if I understand your

00:26:23.839 --> 00:26:27.119
question what you're saying is that it's

00:26:24.960 --> 00:26:29.600
pseudo random not actually random

00:26:27.119 --> 00:26:32.239
>> and so therefore there is some signal in

00:26:29.599 --> 00:26:34.480
the supposedly random generation is it

00:26:32.240 --> 00:26:37.120
actually glomming onto that signal right

00:26:34.480 --> 00:26:38.798
is the question. Theoretically, it's

00:26:37.119 --> 00:26:40.158
probably possible, but in practice, it

00:26:38.798 --> 00:26:42.400
really doesn't matter because we

00:26:40.159 --> 00:26:44.240
basically say random is good enough for

00:26:42.400 --> 00:26:47.120
our purposes. And in fact, in practice,

00:26:44.240 --> 00:26:48.720
you will see it's not an issue.

00:26:47.119 --> 00:26:52.079
Um,

00:26:48.720 --> 00:26:53.440
okay. So, oh yeah, go ahead.

00:26:52.079 --> 00:26:58.158
>> There's a quick question. when you're

00:26:53.440 --> 00:27:01.120
doing uh like text to text, let's say

00:26:58.159 --> 00:27:03.120
you're uh tokenizing the input, but here

00:27:01.119 --> 00:27:06.558
you somehow have to identify that this

00:27:03.119 --> 00:27:09.119
is Killian Cord and like a stately home

00:27:06.558 --> 00:27:13.119
and this is just going from pixel image

00:27:09.119 --> 00:27:16.319
to or like decoding a pixel image. Um

00:27:13.119 --> 00:27:20.399
where does the the tag or tokenization

00:27:16.319 --> 00:27:21.839
of like columns or fingernails or like

00:27:20.400 --> 00:27:23.200
>> does nothing. It's learning everything

00:27:21.839 --> 00:27:23.918
from the pixel values.

00:27:23.200 --> 00:27:25.600
>> Everything.

00:27:23.919 --> 00:27:27.120
>> Yeah. And this is sort of what I was,

00:27:25.599 --> 00:27:28.480
you know, when I when Ike asked the

00:27:27.119 --> 00:27:30.798
question about the four fingers, five

00:27:28.480 --> 00:27:33.038
fingers thing, it has no idea of

00:27:30.798 --> 00:27:34.319
fingers. It has zero knowledge about any

00:27:33.038 --> 00:27:36.319
of these things. All it's seeing is a

00:27:34.319 --> 00:27:38.558
bunch of photographs.

00:27:36.319 --> 00:27:40.639
>> Okay. So when you when you type in say I

00:27:38.558 --> 00:27:42.960
want a hand with green.

00:27:40.640 --> 00:27:44.880
>> Oh, I see. So we haven't yet come to the

00:27:42.960 --> 00:27:47.120
stage of okay, how do you actually steer

00:27:44.880 --> 00:27:48.080
this image using your text prompt? It's

00:27:47.119 --> 00:27:49.519
coming

00:27:48.079 --> 00:27:51.278
>> right now. All we're saying is that

00:27:49.519 --> 00:27:52.960
look, I'm going to give you a bunch of

00:27:51.278 --> 00:27:55.119
uh photographs of a particular kind of

00:27:52.960 --> 00:27:56.480
thing, stately college buildings and I

00:27:55.119 --> 00:27:58.239
want to have a model which at the end of

00:27:56.480 --> 00:27:59.360
the day I just poke it. Every time I

00:27:58.240 --> 00:28:01.278
poke it, it gives me a stately college

00:27:59.359 --> 00:28:02.879
building. That's it. Now I'm going to

00:28:01.278 --> 00:28:04.558
actually start giving it text and saying

00:28:02.880 --> 00:28:06.320
okay build the you know create the thing

00:28:04.558 --> 00:28:08.398
I'm just telling you about that's coming

00:28:06.319 --> 00:28:12.000
and that's sort of some additional magic

00:28:08.398 --> 00:28:14.558
is going on to get that done. U okay so

00:28:12.000 --> 00:28:16.720
this is what we have u and this is

00:28:14.558 --> 00:28:18.158
called a diffusion model. Okay. And this

00:28:16.720 --> 00:28:21.519
is the original paper that figured this

00:28:18.159 --> 00:28:24.799
out. Um, and

00:28:21.519 --> 00:28:26.639
the the process of actually creating

00:28:24.798 --> 00:28:28.639
taking an image and creating noisy

00:28:26.640 --> 00:28:30.880
versions of it to create a training data

00:28:28.640 --> 00:28:32.480
is called the forward process. And then

00:28:30.880 --> 00:28:34.399
what we did in reverse is called the

00:28:32.480 --> 00:28:35.839
reverse process. Uh, check out the

00:28:34.398 --> 00:28:38.479
paper. It's actually really well

00:28:35.839 --> 00:28:40.879
written. Uh, and I recommend it. Now, in

00:28:38.480 --> 00:28:42.960
practice, uh, some other researchers

00:28:40.880 --> 00:28:45.440
came along shortly after this and made a

00:28:42.960 --> 00:28:46.558
small improvement. turns out to be

00:28:45.440 --> 00:28:48.240
actually a big improvement in practice

00:28:46.558 --> 00:28:50.000
in terms of improving the quality of

00:28:48.240 --> 00:28:52.000
what's being produced. And so what they

00:28:50.000 --> 00:28:53.919
said is hey instead of training the

00:28:52.000 --> 00:28:55.919
model to predict the less noisy version

00:28:53.919 --> 00:28:58.640
of the image we actually ask it to

00:28:55.919 --> 00:29:01.360
predict just the noise

00:28:58.640 --> 00:29:03.038
in the input and then we will just

00:29:01.359 --> 00:29:05.839
simply subtract the noise from the input

00:29:03.038 --> 00:29:08.158
to get the image. So instead of saying

00:29:05.839 --> 00:29:10.000
here is an X X is an image Y is the

00:29:08.159 --> 00:29:12.159
noisy image we actually tell it here is

00:29:10.000 --> 00:29:14.398
an image here is the noise that we added

00:29:12.159 --> 00:29:16.000
to X to get the the noisy version and

00:29:14.398 --> 00:29:17.759
then just predict the noise for me and

00:29:16.000 --> 00:29:19.119
then once I get it I just do X minus

00:29:17.759 --> 00:29:21.919
noise and I get the less noisy version

00:29:19.119 --> 00:29:24.398
of the image. Okay, this feels

00:29:21.919 --> 00:29:26.240
arithmetically equivalent but in

00:29:24.398 --> 00:29:28.000
practice it ends up generating much

00:29:26.240 --> 00:29:29.278
higher quality images and there's some

00:29:28.000 --> 00:29:31.038
very interesting theory as to why that

00:29:29.278 --> 00:29:33.278
works and so on and so forth and you can

00:29:31.038 --> 00:29:34.798
read this paper if you're interested.

00:29:33.278 --> 00:29:36.960
Okay, so if you actually look at what's

00:29:34.798 --> 00:29:38.558
going on in most diffusion models today,

00:29:36.960 --> 00:29:40.480
they're basically using an approach like

00:29:38.558 --> 00:29:41.599
this. They're actually predicting each

00:29:40.480 --> 00:29:43.919
time they predict noise and take it

00:29:41.599 --> 00:29:47.119
away, subtract it. So iterative

00:29:43.919 --> 00:29:49.679
subtracting of predicted noise.

00:29:47.119 --> 00:29:52.879
That's what's going on. So all right, so

00:29:49.679 --> 00:29:55.919
that's what we have. U now at this point

00:29:52.880 --> 00:29:57.520
you may be wondering, okay, so far in

00:29:55.919 --> 00:29:59.759
the semester, uh we have actually

00:29:57.519 --> 00:30:01.200
learned how to take an image and then

00:29:59.759 --> 00:30:03.278
classify it into one of you know 20

00:30:01.200 --> 00:30:05.360
things, 10 things, whatever. We also

00:30:03.278 --> 00:30:07.200
taken text and figured out what to do

00:30:05.359 --> 00:30:09.439
things with it. We haven't yet talked

00:30:07.200 --> 00:30:11.519
about how do you actually take an image

00:30:09.440 --> 00:30:13.440
and the how can we get the output also

00:30:11.519 --> 00:30:16.240
to be another image. We haven't done

00:30:13.440 --> 00:30:18.798
that yet. Okay. So we have actually not

00:30:16.240 --> 00:30:20.240
done image to image. How do you actually

00:30:18.798 --> 00:30:22.879
build a neural network to do image to

00:30:20.240 --> 00:30:23.919
images? And in the interest of time

00:30:22.880 --> 00:30:25.360
you're not going to get into it

00:30:23.919 --> 00:30:29.440
massively but I want to give you a quick

00:30:25.359 --> 00:30:31.759
idea of how it works. So the the most

00:30:29.440 --> 00:30:33.440
sort of I would say the dominant

00:30:31.759 --> 00:30:35.519
architecture

00:30:33.440 --> 00:30:36.960
to take an in image as an input and

00:30:35.519 --> 00:30:39.359
produce an image as an output is called

00:30:36.960 --> 00:30:42.880
the unit. Okay. And that's the

00:30:39.359 --> 00:30:45.439
architecture we see here. So

00:30:42.880 --> 00:30:47.039
so fundamentally if you look at the left

00:30:45.440 --> 00:30:48.640
half so there's a left half to the

00:30:47.038 --> 00:30:50.640
network and a right half to the network

00:30:48.640 --> 00:30:53.360
hence the U. If you look at the left

00:30:50.640 --> 00:30:55.200
half of the network it's it's a good old

00:30:53.359 --> 00:30:58.558
convolutional neural network like the

00:30:55.200 --> 00:31:00.319
kind we know and love. Okay. And the the

00:30:58.558 --> 00:31:02.480
kind that we are very familiar with. So

00:31:00.319 --> 00:31:04.720
you take an input image and then you run

00:31:02.480 --> 00:31:07.599
it through a bunch of convolutional

00:31:04.720 --> 00:31:09.919
uh convolutional blocks and then we do

00:31:07.599 --> 00:31:11.759
some max pooling and then we keep on

00:31:09.919 --> 00:31:13.919
doing it and at some point it becomes

00:31:11.759 --> 00:31:15.839
smaller and smaller and we get something

00:31:13.919 --> 00:31:17.919
you know like this which we are very

00:31:15.839 --> 00:31:20.000
familiar with right the the big image

00:31:17.919 --> 00:31:21.200
with three channels gets smaller and

00:31:20.000 --> 00:31:22.640
smaller smaller but the number of

00:31:21.200 --> 00:31:24.399
channels gets wider and wider. it

00:31:22.640 --> 00:31:26.960
becomes sort of much smaller but much

00:31:24.398 --> 00:31:29.119
deeper right it becomes like a 3D volume

00:31:26.960 --> 00:31:31.200
and we have seen that again and again

00:31:29.119 --> 00:31:33.278
right the left part is just a good old

00:31:31.200 --> 00:31:35.519
convolutional with pooling layers and

00:31:33.278 --> 00:31:37.599
then you come to the middle and then

00:31:35.519 --> 00:31:40.480
from this point on what we do is we take

00:31:37.599 --> 00:31:43.038
whatever this thing here and then we

00:31:40.480 --> 00:31:44.720
essentially reverse the process we go

00:31:43.038 --> 00:31:46.960
from the small things which are really

00:31:44.720 --> 00:31:49.038
deep to slightly bigger things that are

00:31:46.960 --> 00:31:50.880
a little less steep and so on and so

00:31:49.038 --> 00:31:54.879
forth till we get the original size back

00:31:50.880 --> 00:31:57.039
again Okay. And we do that using the

00:31:54.880 --> 00:31:59.360
some an inverse of the convolution layer

00:31:57.038 --> 00:32:02.798
called an upconvolution or deconvolution

00:31:59.359 --> 00:32:05.119
layer. Okay. And you can check out 9.2

00:32:02.798 --> 00:32:07.759
in the textbook to to to understand how

00:32:05.119 --> 00:32:09.278
it's done. It's it's also called con 2D

00:32:07.759 --> 00:32:12.079
transpose.

00:32:09.278 --> 00:32:13.440
Okay. It's a very similar idea and I'm

00:32:12.079 --> 00:32:15.119
not going to get into the details here

00:32:13.440 --> 00:32:17.038
but you essentially do an inverse of a

00:32:15.119 --> 00:32:19.119
convolutional operation to get the size

00:32:17.038 --> 00:32:22.480
to come back to the bigger size and you

00:32:19.119 --> 00:32:24.239
do it gradually till the output you have

00:32:22.480 --> 00:32:25.839
matches the size of the input that came

00:32:24.240 --> 00:32:27.919
in.

00:32:25.839 --> 00:32:29.759
Okay, so image gets smaller and smaller

00:32:27.919 --> 00:32:31.440
into a thing and then you just blow it

00:32:29.759 --> 00:32:34.558
back up again to get an image back. So

00:32:31.440 --> 00:32:36.240
that is the unit. Now there's very one

00:32:34.558 --> 00:32:39.599
very important thing that happens in the

00:32:36.240 --> 00:32:43.440
unit, right? which is

00:32:39.599 --> 00:32:45.519
you see these connections, right?

00:32:43.440 --> 00:32:47.278
Basically, what they do is at every step

00:32:45.519 --> 00:32:50.960
when you're coming back up in the right

00:32:47.278 --> 00:32:53.038
half, you actually attach whatever was

00:32:50.960 --> 00:32:54.798
in sort of the mirror image of the

00:32:53.038 --> 00:32:56.720
original input as we processed on the

00:32:54.798 --> 00:32:59.200
left side, we attach it to this side as

00:32:56.720 --> 00:33:01.360
well. Remember I talked about this whole

00:32:59.200 --> 00:33:03.600
notion of a residual connection back,

00:33:01.359 --> 00:33:06.798
you know, many classes ago where I said

00:33:03.599 --> 00:33:09.278
when uh when an input goes through each

00:33:06.798 --> 00:33:10.960
layer of a neural network at one point,

00:33:09.278 --> 00:33:13.038
let's say you're in the 10th layer,

00:33:10.960 --> 00:33:14.399
you're only seeing what is the ninth

00:33:13.038 --> 00:33:16.079
layer is produced for you. That's all

00:33:14.398 --> 00:33:18.158
you're working with. But would it be

00:33:16.079 --> 00:33:19.839
nice if the the the 10th layer actually

00:33:18.159 --> 00:33:21.600
had access to the eighth layer, the

00:33:19.839 --> 00:33:23.439
seventh layer, the sixth layer, the

00:33:21.599 --> 00:33:25.439
fifth layer? Heck, why not the input,

00:33:23.440 --> 00:33:27.600
right? Because the more information it

00:33:25.440 --> 00:33:28.960
has, the more able it's probably to do

00:33:27.599 --> 00:33:31.278
whatever it can with the input it's

00:33:28.960 --> 00:33:33.200
giving. Why restrict it to only the

00:33:31.278 --> 00:33:34.319
input of the the output of the previous

00:33:33.200 --> 00:33:36.080
layer? Why can't we give it everything

00:33:34.319 --> 00:33:37.918
that has came before it? Now giving

00:33:36.079 --> 00:33:40.158
everything is too much. But we can be

00:33:37.919 --> 00:33:41.919
selective in what we give it. Right? So

00:33:40.159 --> 00:33:44.240
what these folks decided I'm sure after

00:33:41.919 --> 00:33:46.799
much experimentation is that if they

00:33:44.240 --> 00:33:49.839
actually attach whatever was coming out

00:33:46.798 --> 00:33:51.440
of this layer to this layer before it

00:33:49.839 --> 00:33:53.278
goes through the output, it really

00:33:51.440 --> 00:33:55.360
helped. Similarly, this thing gets

00:33:53.278 --> 00:33:57.679
attached and so on and so forth. And it

00:33:55.359 --> 00:34:00.000
kind of makes sense. You know, why force

00:33:57.679 --> 00:34:01.440
it to figure out everything it has to

00:34:00.000 --> 00:34:03.919
figure out just from this thing that

00:34:01.440 --> 00:34:06.159
came in, right? Let's give this that

00:34:03.919 --> 00:34:07.759
that. Let's also give a little here, a

00:34:06.159 --> 00:34:09.358
little here. So, these residual

00:34:07.759 --> 00:34:10.800
connections are a huge building block

00:34:09.358 --> 00:34:14.159
for why these things work as well as

00:34:10.800 --> 00:34:15.599
they do. Okay? And in general, giving a

00:34:14.159 --> 00:34:17.760
layer as much information as you can

00:34:15.599 --> 00:34:19.200
give it is always a good idea, but you

00:34:17.760 --> 00:34:20.560
can't go nuts, right? Because then you

00:34:19.199 --> 00:34:22.078
have much more parameters and all kinds

00:34:20.559 --> 00:34:23.519
of stuff happens. So there is a bit of a

00:34:22.079 --> 00:34:25.760
balance you have to strike and this was

00:34:23.519 --> 00:34:27.918
the balance struck by these researchers.

00:34:25.760 --> 00:34:30.399
And so this thing was originally

00:34:27.918 --> 00:34:32.559
invented for some medical segmentation

00:34:30.398 --> 00:34:35.358
use cases but it's just heavily used for

00:34:32.559 --> 00:34:39.599
everything now. It's a really powerful

00:34:35.358 --> 00:34:41.918
architecture. Uh questions

00:34:39.599 --> 00:34:44.240
>> uh can we have example of like in what

00:34:41.918 --> 00:34:46.559
scenarios we use this kind of

00:34:44.239 --> 00:34:49.439
>> anytime you have an image to image

00:34:46.559 --> 00:34:50.878
>> like what kind of conversion do you get

00:34:49.440 --> 00:34:52.559
image to image? or like what kind of

00:34:50.878 --> 00:34:54.078
examples of use cases. Let's say that

00:34:52.559 --> 00:34:55.759
for example you want to take an image

00:34:54.079 --> 00:34:58.000
and like a black and white image and you

00:34:55.760 --> 00:35:00.560
want to colorize it

00:34:58.000 --> 00:35:02.000
for instance boom you unit you want to

00:35:00.559 --> 00:35:04.239
take an image and make it a higher

00:35:02.000 --> 00:35:06.480
resolution image unit you want to take

00:35:04.239 --> 00:35:08.719
an image and for every pixel in the

00:35:06.480 --> 00:35:12.079
image you want to classify it into you

00:35:08.719 --> 00:35:14.480
know one of 10 things. So anytime when

00:35:12.079 --> 00:35:16.640
you want the output shape the shape of

00:35:14.480 --> 00:35:18.800
the output to be basically the same

00:35:16.639 --> 00:35:20.960
shape as the input but with other data

00:35:18.800 --> 00:35:23.960
you need to use this.

00:35:20.960 --> 00:35:23.960
Yeah.

00:35:25.199 --> 00:35:30.879
>> But this logic of having access to all

00:35:28.639 --> 00:35:31.838
the previous iterations

00:35:30.880 --> 00:35:33.519
>> not iterations

00:35:31.838 --> 00:35:35.440
>> all the previous layers

00:35:33.519 --> 00:35:40.639
>> right the outputs of the previous layers

00:35:35.440 --> 00:35:42.720
>> layers. Uh but this would also help uh

00:35:40.639 --> 00:35:44.239
clean up and give better categorization

00:35:42.719 --> 00:35:45.199
like does it always have to be an image

00:35:44.239 --> 00:35:47.679
to image?

00:35:45.199 --> 00:35:49.118
>> No. No. In fact, if you look at restnet,

00:35:47.679 --> 00:35:50.639
restnet is the one in fact that

00:35:49.119 --> 00:35:53.200
pioneered the idea of the residual

00:35:50.639 --> 00:35:56.078
connection. So we use it for restnet. We

00:35:53.199 --> 00:35:58.319
actually use the the transformer stack

00:35:56.079 --> 00:36:00.000
if you remember it goes through the self

00:35:58.320 --> 00:36:03.280
attention layer. It comes out the other

00:36:00.000 --> 00:36:05.920
end and then we add the input back to it

00:36:03.280 --> 00:36:07.599
and then we send it through layer.

00:36:05.920 --> 00:36:08.800
So you will see that this residual

00:36:07.599 --> 00:36:11.599
connection is sitting in two different

00:36:08.800 --> 00:36:13.680
places in a single transformer block. So

00:36:11.599 --> 00:36:15.440
it's extremely heavily used. There is

00:36:13.679 --> 00:36:17.598
something called deep and wide network

00:36:15.440 --> 00:36:20.559
if I remember or denset which uses the

00:36:17.599 --> 00:36:22.960
same trick. In fact if you when you're

00:36:20.559 --> 00:36:25.279
working with structured data right good

00:36:22.960 --> 00:36:26.720
old say linear regression and you've

00:36:25.280 --> 00:36:28.400
looked at your data and you come up with

00:36:26.719 --> 00:36:30.239
all kinds of very clever features you

00:36:28.400 --> 00:36:32.000
know I'm going to look at price per

00:36:30.239 --> 00:36:33.439
square foot right you do a bunch of

00:36:32.000 --> 00:36:36.239
feature engineering and you have a bunch

00:36:33.440 --> 00:36:38.559
of new features. Well, you should take

00:36:36.239 --> 00:36:40.719
your old features and your new features

00:36:38.559 --> 00:36:42.480
and send both in.

00:36:40.719 --> 00:36:43.679
Why send only the new stuff that you

00:36:42.480 --> 00:36:47.519
have concocted? Why can't you send

00:36:43.679 --> 00:36:52.919
everything in? That's the idea.

00:36:47.519 --> 00:36:52.920
All right. Um, so let's come back here.

00:36:53.039 --> 00:36:57.599
Now we have seen how to generate a good

00:36:54.559 --> 00:36:59.279
image. Okay. Now let's figure out how to

00:36:57.599 --> 00:37:00.800
steer it or condition it with a text

00:36:59.280 --> 00:37:02.720
prompt, right? Because that's sort of

00:37:00.800 --> 00:37:05.920
the holy grail.

00:37:02.719 --> 00:37:08.480
So we want to take

00:37:05.920 --> 00:37:09.838
so here's some intuition. We want to

00:37:08.480 --> 00:37:11.920
take the text prompt into account and

00:37:09.838 --> 00:37:14.719
obviously generate the image. Now

00:37:11.920 --> 00:37:16.800
imagine if we had like a rough image

00:37:14.719 --> 00:37:18.879
that corresponds to the text prompt.

00:37:16.800 --> 00:37:21.359
Just imagine. So the text prompt is you

00:37:18.880 --> 00:37:22.880
know cute laborator retriever and you

00:37:21.358 --> 00:37:24.159
have like a very noisy image of a

00:37:22.880 --> 00:37:26.720
laboratory retriever. This just happens

00:37:24.159 --> 00:37:28.000
to be handy. You have it. Well now

00:37:26.719 --> 00:37:30.239
you're in good shape because you just

00:37:28.000 --> 00:37:32.159
feed that in and your system will denise

00:37:30.239 --> 00:37:34.319
it for you. Right? Right? You can get a

00:37:32.159 --> 00:37:36.000
better image. That's pretty easy. So,

00:37:34.320 --> 00:37:37.280
but obviously in reality, you don't have

00:37:36.000 --> 00:37:38.480
a rough image. In fact, you're trying to

00:37:37.280 --> 00:37:41.599
create one of those things in the first

00:37:38.480 --> 00:37:45.199
place. We don't. So, but what if we had

00:37:41.599 --> 00:37:47.599
an embedding for the prompt that's close

00:37:45.199 --> 00:37:49.199
to the embeddings of all the images that

00:37:47.599 --> 00:37:52.160
correspond to the prompt. So, let's take

00:37:49.199 --> 00:37:54.239
a prompt and let's imagine all the

00:37:52.159 --> 00:37:57.199
images in the in the universe that

00:37:54.239 --> 00:37:58.959
correspond to that prompt. Okay?

00:37:57.199 --> 00:38:00.319
And now further imagine because

00:37:58.960 --> 00:38:02.559
everything is a vector. Everything is

00:38:00.320 --> 00:38:04.559
embedding in our world that that image

00:38:02.559 --> 00:38:06.639
has an embedding.

00:38:04.559 --> 00:38:09.920
All sorry the text prompt has an

00:38:06.639 --> 00:38:12.559
embedding. Every image has an embedding

00:38:09.920 --> 00:38:14.720
and we have somehow calculated these

00:38:12.559 --> 00:38:17.679
embeddings so that the text prompts

00:38:14.719 --> 00:38:20.000
embedding is smack where all the image

00:38:17.679 --> 00:38:21.279
embeddings are.

00:38:20.000 --> 00:38:23.920
We will get to how we actually do it in

00:38:21.280 --> 00:38:26.000
a in just a moment. But conceptually

00:38:23.920 --> 00:38:28.159
imagine if we had an embedding if you

00:38:26.000 --> 00:38:30.079
could calculate embeddings for text and

00:38:28.159 --> 00:38:32.239
embeddings for images. So they all live

00:38:30.079 --> 00:38:36.000
in the same space.

00:38:32.239 --> 00:38:39.118
Okay. So if we feed this embedding to a

00:38:36.000 --> 00:38:41.920
dinoising model because that text

00:38:39.119 --> 00:38:44.320
embedding is sitting in the same space

00:38:41.920 --> 00:38:47.280
as all the image embeddings that it

00:38:44.320 --> 00:38:49.119
corresponds to. Maybe our model can just

00:38:47.280 --> 00:38:51.599
d noiseise that embedding and give you

00:38:49.119 --> 00:38:54.320
what you want.

00:38:51.599 --> 00:38:55.680
Okay, so since this embedding is already

00:38:54.320 --> 00:38:57.200
close to the embeddings of the things we

00:38:55.679 --> 00:38:59.199
want to generate, maybe you'll just get

00:38:57.199 --> 00:39:00.639
it done.

00:38:59.199 --> 00:39:02.559
So ultimately we want to generate an

00:39:00.639 --> 00:39:03.920
image and if we had an embedding for

00:39:02.559 --> 00:39:07.119
that image, we could generate the image

00:39:03.920 --> 00:39:09.599
from the embedding and we use the text.

00:39:07.119 --> 00:39:11.039
So we go from text to embedding which

00:39:09.599 --> 00:39:12.640
happens to live in the same space as all

00:39:11.039 --> 00:39:14.000
the embeddings of the images we care

00:39:12.639 --> 00:39:15.759
about. And then from that image

00:39:14.000 --> 00:39:18.320
embedding, we go to the final image.

00:39:15.760 --> 00:39:20.079
Okay, this is a bunch of me talking and

00:39:18.320 --> 00:39:22.000
handwaving. it'll all become very clear

00:39:20.079 --> 00:39:25.200
but that's sort of the rough intuition.

00:39:22.000 --> 00:39:26.960
Okay. So, so what we'll know is we'll

00:39:25.199 --> 00:39:29.598
describe an approach to calculate an

00:39:26.960 --> 00:39:31.920
embedding for any text any piece of text

00:39:29.599 --> 00:39:34.160
that is close to the embeddings of the

00:39:31.920 --> 00:39:36.720
images that correspond to that piece of

00:39:34.159 --> 00:39:38.639
text. So this is the problem we're going

00:39:36.719 --> 00:39:39.838
to solve. There's a bunch of text

00:39:38.639 --> 00:39:42.000
conceptually there are a whole bunch of

00:39:39.838 --> 00:39:43.920
images that are describe that text and

00:39:42.000 --> 00:39:46.719
we're going to now create embeddings so

00:39:43.920 --> 00:39:48.880
that that is close to all the embeddings

00:39:46.719 --> 00:39:50.399
of those images. Right? It feels kind of

00:39:48.880 --> 00:39:52.480
like almost impossible that you can

00:39:50.400 --> 00:39:56.000
actually do something like this, but

00:39:52.480 --> 00:39:58.000
there's a very clever idea uh that

00:39:56.000 --> 00:39:59.838
OpenAI came up with that tells you how

00:39:58.000 --> 00:40:02.000
to do it. So, here's what we're going to

00:39:59.838 --> 00:40:05.199
do. So, let's say we have an image and a

00:40:02.000 --> 00:40:08.559
caption. So, here's an image. Uh here's

00:40:05.199 --> 00:40:10.879
a caption, right? And we need some way

00:40:08.559 --> 00:40:12.320
to take that piece of text and run it

00:40:10.880 --> 00:40:15.039
through some network and create a nice

00:40:12.320 --> 00:40:16.160
embedding from it. Okay? Similarly, we

00:40:15.039 --> 00:40:17.279
want to take this image, run it through

00:40:16.159 --> 00:40:19.358
some network and create an embedding

00:40:17.280 --> 00:40:20.800
from it. Okay. Now, first first

00:40:19.358 --> 00:40:22.719
question, how can we compute embeddings

00:40:20.800 --> 00:40:23.920
from a piece of text? First question,

00:40:22.719 --> 00:40:27.838
how can we comput an embedding from a

00:40:23.920 --> 00:40:30.159
piece of text? You know the answer.

00:40:27.838 --> 00:40:34.480
Run through a transformer. Piece of

00:40:30.159 --> 00:40:35.598
cake. We know how to do that, right? U

00:40:34.480 --> 00:40:37.599
right in particular, you can do

00:40:35.599 --> 00:40:38.720
something like BERT. And for an image

00:40:37.599 --> 00:40:41.039
encoder, you just run it through

00:40:38.719 --> 00:40:42.959
something like restnet like the the

00:40:41.039 --> 00:40:44.800
penultimate layer, right? one of the

00:40:42.960 --> 00:40:46.159
final layer is going to be a very good

00:40:44.800 --> 00:40:48.720
representation of that image. You get

00:40:46.159 --> 00:40:50.480
another embedding. So using the building

00:40:48.719 --> 00:40:52.480
blocks we already know, we can create

00:40:50.480 --> 00:40:55.358
embeddings very quickly from these

00:40:52.480 --> 00:40:56.639
things. Okay, but if you just take a

00:40:55.358 --> 00:40:58.000
piece of text and run it through a bird

00:40:56.639 --> 00:40:59.199
and you take an image and run it through

00:40:58.000 --> 00:41:01.679
SNET, you're going to get some

00:40:59.199 --> 00:41:04.639
embeddings. But why the heck should they

00:41:01.679 --> 00:41:06.159
be related?

00:41:04.639 --> 00:41:08.239
They were not trained together. So

00:41:06.159 --> 00:41:10.239
there's no basis for them to be related.

00:41:08.239 --> 00:41:11.919
They would just be some two embeddings.

00:41:10.239 --> 00:41:13.439
Maybe they are kind of similar. Maybe

00:41:11.920 --> 00:41:14.639
they're not. We don't know. There's no

00:41:13.440 --> 00:41:16.880
reason to expect that they're going to

00:41:14.639 --> 00:41:19.879
be similar. Okay, they're just two

00:41:16.880 --> 00:41:19.880
embeddings.

00:41:20.239 --> 00:41:24.399
Now, what we want to do is but once we

00:41:22.960 --> 00:41:26.000
have these, we need to make sure the

00:41:24.400 --> 00:41:27.838
embeddings that comes out of these two

00:41:26.000 --> 00:41:30.838
things satisfy two very important

00:41:27.838 --> 00:41:30.838
requirements.

00:41:32.159 --> 00:41:35.279
We want to make sure that if you give it

00:41:33.599 --> 00:41:39.119
an image

00:41:35.280 --> 00:41:40.480
and a caption that describes that image.

00:41:39.119 --> 00:41:42.318
So you have an image and a caption that

00:41:40.480 --> 00:41:43.838
describes that image, we want to make

00:41:42.318 --> 00:41:45.920
sure that the embeddings that come out

00:41:43.838 --> 00:41:47.920
of these two boxes, they are as close to

00:41:45.920 --> 00:41:50.240
each other as possible.

00:41:47.920 --> 00:41:51.680
Okay? Given an em given an image and a

00:41:50.239 --> 00:41:53.199
caption that describes it, that's the

00:41:51.679 --> 00:41:56.239
connection. They have to be close to

00:41:53.199 --> 00:41:58.399
each other. And conversely, if you have

00:41:56.239 --> 00:42:00.479
an image and a caption that's totally

00:41:58.400 --> 00:42:02.318
irrelevant,

00:42:00.480 --> 00:42:03.920
right? A train rounding a bend with a

00:42:02.318 --> 00:42:05.519
beautiful fall foliage all around,

00:42:03.920 --> 00:42:08.000
right? Clearly irrelevant. Those

00:42:05.519 --> 00:42:10.639
embedings should be far apart.

00:42:08.000 --> 00:42:12.559
that it to really make sense,

00:42:10.639 --> 00:42:13.759
right? Pairs of related things should be

00:42:12.559 --> 00:42:16.400
together, irrelevant things should be

00:42:13.760 --> 00:42:18.640
far apart. So if you can find embeddings

00:42:16.400 --> 00:42:23.039
that satisfy these two criteria, maybe

00:42:18.639 --> 00:42:24.719
we will be in the game. Okay. So now

00:42:23.039 --> 00:42:26.159
this ensures that the text embedding and

00:42:24.719 --> 00:42:28.480
the image embedding are referring to the

00:42:26.159 --> 00:42:31.199
same underlying concept. Right? This

00:42:28.480 --> 00:42:32.960
these requirements will enforce that. Uh

00:42:31.199 --> 00:42:34.879
and so the embedding for any text prompt

00:42:32.960 --> 00:42:38.559
is close to the embedding for all the

00:42:34.880 --> 00:42:41.358
images that correspond to that prompt.

00:42:38.559 --> 00:42:43.039
So the question is how do we do this? Uh

00:42:41.358 --> 00:42:44.400
how can first of all how can we tell how

00:42:43.039 --> 00:42:47.199
close two embeddings are? You know the

00:42:44.400 --> 00:42:49.280
answer to this what's the answer

00:42:47.199 --> 00:42:51.838
>> correct cosine similarity right? We use

00:42:49.280 --> 00:42:54.160
the cosine similarity of the embeddings.

00:42:51.838 --> 00:42:55.519
U so we know how to measure closeness.

00:42:54.159 --> 00:42:56.719
So the question is how can we compute

00:42:55.519 --> 00:42:59.759
embeddings that satisfy the two

00:42:56.719 --> 00:43:02.000
requirements and openai uh built a model

00:42:59.760 --> 00:43:04.240
called clip which is very famous uh to

00:43:02.000 --> 00:43:07.119
solve this problem right it stands for

00:43:04.239 --> 00:43:08.959
contrastive language image pre-training

00:43:07.119 --> 00:43:10.160
uh and this forms the basis for a whole

00:43:08.960 --> 00:43:12.240
bunch of models that have sprung up

00:43:10.159 --> 00:43:13.358
after this called blip and blip 2 and so

00:43:12.239 --> 00:43:15.279
on and so forth but this is the

00:43:13.358 --> 00:43:17.358
fundamental idea

00:43:15.280 --> 00:43:20.319
okay so

00:43:17.358 --> 00:43:25.838
this is how clip works we uh what they

00:43:20.318 --> 00:43:28.639
did is they took a a 12 block 12 layer 8

00:43:25.838 --> 00:43:30.559
head transformer cosal encoder stack as

00:43:28.639 --> 00:43:33.199
a text encoder

00:43:30.559 --> 00:43:35.119
uh okay now you understand this right

00:43:33.199 --> 00:43:36.719
that's what it is eight layer I mean

00:43:35.119 --> 00:43:39.838
sorry 8 head 12 layer transformer causal

00:43:36.719 --> 00:43:41.358
encoder TC stack um and and that's a

00:43:39.838 --> 00:43:43.679
text encoder so we send any piece of

00:43:41.358 --> 00:43:45.679
text through it right you get the next

00:43:43.679 --> 00:43:48.000
word prediction embedding and that's the

00:43:45.679 --> 00:43:50.318
embedding you're going to use uh and

00:43:48.000 --> 00:43:53.039
they took restnet 50 and made it the

00:43:50.318 --> 00:43:55.679
image encoder they took rest 50 chopped

00:43:53.039 --> 00:43:59.119
off the top and whatever was left is the

00:43:55.679 --> 00:44:00.960
the image encoder. Okay,

00:43:59.119 --> 00:44:03.760
then they initialized with random

00:44:00.960 --> 00:44:05.920
weights these things and then they

00:44:03.760 --> 00:44:07.599
grabbed they grab a batch of image

00:44:05.920 --> 00:44:09.838
caption pairs. So in this example, let's

00:44:07.599 --> 00:44:11.519
say that we have these three images u

00:44:09.838 --> 00:44:14.960
and I have captions to go with these

00:44:11.519 --> 00:44:18.079
images. Okay, we have these three things

00:44:14.960 --> 00:44:20.559
and this is the key step. They run the

00:44:18.079 --> 00:44:22.000
images through the image encoder and the

00:44:20.559 --> 00:44:23.920
captions through the text encoder and

00:44:22.000 --> 00:44:26.318
get these embeddings. Okay, it's a

00:44:23.920 --> 00:44:29.838
forward pass. You send it through this

00:44:26.318 --> 00:44:32.400
network, you get two embeddings. Um, and

00:44:29.838 --> 00:44:34.078
then this is what they do. With these

00:44:32.400 --> 00:44:36.480
embeddings, they calculate the cosine

00:44:34.079 --> 00:44:38.800
similarity for every image caption pair.

00:44:36.480 --> 00:44:41.519
Okay? And so imagine something like

00:44:38.800 --> 00:44:43.519
this. So you have these three captions,

00:44:41.519 --> 00:44:45.440
you have these three images, and those

00:44:43.519 --> 00:44:47.599
are the embeddings.

00:44:45.440 --> 00:44:49.039
uh and then they calculate the cosine

00:44:47.599 --> 00:44:51.039
similarity for every one of those

00:44:49.039 --> 00:44:52.639
things.

00:44:51.039 --> 00:44:57.639
It took me like 5 minutes or 10 minutes

00:44:52.639 --> 00:44:57.639
to do this PowerPoint. You're welcome.

00:45:00.719 --> 00:45:05.519
Particularly trying to get this comma to

00:45:02.400 --> 00:45:08.559
line up is a real pain in the neck. So,

00:45:05.519 --> 00:45:11.519
so all right. So, we have this here.

00:45:08.559 --> 00:45:13.679
Okay. And now what we want to do is uh

00:45:11.519 --> 00:45:16.480
we want these scores to be as high as

00:45:13.679 --> 00:45:18.480
possible, right? Because the scores in

00:45:16.480 --> 00:45:21.679
the diagonal are the ones where for the

00:45:18.480 --> 00:45:23.358
matching picture and caption,

00:45:21.679 --> 00:45:24.960
right?

00:45:23.358 --> 00:45:26.799
Those are the those are the those are

00:45:24.960 --> 00:45:28.480
the the scores for the matching pairs of

00:45:26.800 --> 00:45:30.318
embeddings. We want them to be as high

00:45:28.480 --> 00:45:32.880
as possible.

00:45:30.318 --> 00:45:35.199
Okay. Um

00:45:32.880 --> 00:45:37.599
so so we want to maximize the sum of the

00:45:35.199 --> 00:45:40.960
green cells, right? These are the green

00:45:37.599 --> 00:45:42.160
cells the diagonal. So, so if I if you

00:45:40.960 --> 00:45:43.280
want to write it as a loss function

00:45:42.159 --> 00:45:46.000
because the loss function is always

00:45:43.280 --> 00:45:50.000
minimization, we basically say minimize

00:45:46.000 --> 00:45:52.800
the negative sum of the green cells.

00:45:50.000 --> 00:45:56.440
Okay, so the question is would this loss

00:45:52.800 --> 00:45:56.440
function do the trick?

00:45:58.800 --> 00:46:03.039
Seems reasonable. You want to make sure

00:46:00.960 --> 00:46:07.159
the related things are really close

00:46:03.039 --> 00:46:07.159
together. So you want to maximize

00:46:07.760 --> 00:46:10.640
uh if that was the only part of the loss

00:46:09.280 --> 00:46:12.480
function, wouldn't it just kind of

00:46:10.639 --> 00:46:13.358
squish everything to the same spot in

00:46:12.480 --> 00:46:14.960
the space?

00:46:13.358 --> 00:46:16.880
>> Correct.

00:46:14.960 --> 00:46:20.159
What it's going to do is it's going to

00:46:16.880 --> 00:46:21.838
basically ignore the input.

00:46:20.159 --> 00:46:24.559
The optimizer can simply ignore the

00:46:21.838 --> 00:46:25.920
input, make all the embeddings the same.

00:46:24.559 --> 00:46:28.480
For example, it can just make all the

00:46:25.920 --> 00:46:30.318
embedding zero.

00:46:28.480 --> 00:46:32.000
That's it. And then now we have a

00:46:30.318 --> 00:46:35.039
perfect cosine similarity for

00:46:32.000 --> 00:46:36.400
everything. For a any pair of image and

00:46:35.039 --> 00:46:38.880
captions, the cosine similarity is going

00:46:36.400 --> 00:46:41.440
to be one. It's perfect, right? So

00:46:38.880 --> 00:46:44.318
clearly that's not enough. This is by

00:46:41.440 --> 00:46:46.159
the way is called model collapse, right?

00:46:44.318 --> 00:46:47.519
So to prevent it from doing that, we

00:46:46.159 --> 00:46:51.598
need to do one more thing to the loss

00:46:47.519 --> 00:46:53.039
function. Any guesses?

00:46:51.599 --> 00:46:56.000
>> Yeah.

00:46:53.039 --> 00:46:58.318
>> Uh make the images that aren't related

00:46:56.000 --> 00:47:00.639
not have a cosine similarity.

00:46:58.318 --> 00:47:02.639
>> Exactly. Right. Exactly right. So what

00:47:00.639 --> 00:47:05.598
we want to do is we want the scores of

00:47:02.639 --> 00:47:07.279
the red stuff to be as small as

00:47:05.599 --> 00:47:09.119
possible.

00:47:07.280 --> 00:47:10.800
We want the green stuff to be as much as

00:47:09.119 --> 00:47:12.720
possible and the red stuff to be as

00:47:10.800 --> 00:47:16.560
small as possible.

00:47:12.719 --> 00:47:20.639
Together it'll get the job done.

00:47:16.559 --> 00:47:22.078
Okay. And so um so we want to maximize

00:47:20.639 --> 00:47:24.000
the sum of the green cells and minimize

00:47:22.079 --> 00:47:26.640
the sum of the red cells. So the

00:47:24.000 --> 00:47:28.159
equivalent loss function is minimize the

00:47:26.639 --> 00:47:31.199
sum of the red cells and the negative

00:47:28.159 --> 00:47:34.159
sum of the green cells. That's it. So

00:47:31.199 --> 00:47:37.439
all clip does is that it just grabs a

00:47:34.159 --> 00:47:38.960
batch of image caption pairs, runs it

00:47:37.440 --> 00:47:41.119
through the networks, calculates the

00:47:38.960 --> 00:47:44.159
embeddings and calculates this sum of

00:47:41.119 --> 00:47:45.519
the stuff here and that is your loss and

00:47:44.159 --> 00:47:48.480
then back propagates through the

00:47:45.519 --> 00:47:50.880
network. Boom. Batch batch batch. Do it

00:47:48.480 --> 00:47:53.119
a whole bunch of times. And OpenAI did

00:47:50.880 --> 00:47:55.200
this with uh oh this is the official

00:47:53.119 --> 00:47:57.200
picture from the open from the paper

00:47:55.199 --> 00:47:59.919
which is worth reading by the way right

00:47:57.199 --> 00:48:02.480
it comes in text encoder you get these

00:47:59.920 --> 00:48:05.280
uh embedding vectors image encoder and

00:48:02.480 --> 00:48:07.838
then boom the diagonal is maximized and

00:48:05.280 --> 00:48:10.480
the off diagonals are minimized

00:48:07.838 --> 00:48:14.559
and they did it with 400 million image

00:48:10.480 --> 00:48:16.480
caption pairs scraped from the internet.

00:48:14.559 --> 00:48:18.559
400 million.

00:48:16.480 --> 00:48:20.880
By the way, you folks who work in the

00:48:18.559 --> 00:48:23.599
space may know this really well, but uh

00:48:20.880 --> 00:48:26.318
one very easy way to get a caption for

00:48:23.599 --> 00:48:27.519
an image, right? You we see images, but

00:48:26.318 --> 00:48:29.440
where do you think the captions come

00:48:27.519 --> 00:48:30.639
from? Where did they get those captions?

00:48:29.440 --> 00:48:32.079
They didn't obviously they didn't ask

00:48:30.639 --> 00:48:33.519
people to manually label each image of

00:48:32.079 --> 00:48:35.039
the caption. Where do you think they got

00:48:33.519 --> 00:48:36.159
it from?

00:48:35.039 --> 00:48:39.440
>> Google search.

00:48:36.159 --> 00:48:41.118
>> Uh Google search can help but why does

00:48:39.440 --> 00:48:42.639
Google search actually find the caption?

00:48:41.119 --> 00:48:45.440
How does it because Google search is not

00:48:42.639 --> 00:48:47.440
creating the caption? um

00:48:45.440 --> 00:48:50.480
>> take it from the alt text on the images.

00:48:47.440 --> 00:48:52.800
>> Correct. Alt text. So a lot of folks for

00:48:50.480 --> 00:48:54.559
accessibility reasons they have alt text

00:48:52.800 --> 00:48:56.000
right on all the images they create. A

00:48:54.559 --> 00:48:58.079
lot of people have alt text in their

00:48:56.000 --> 00:49:00.480
images they publish on the web and

00:48:58.079 --> 00:49:03.359
that's what we use. And the alt text

00:49:00.480 --> 00:49:05.280
actually ends up being a a more verbose

00:49:03.358 --> 00:49:07.440
description of the image than a typical

00:49:05.280 --> 00:49:10.079
caption which tends to be much briefer.

00:49:07.440 --> 00:49:11.679
And for us more verbose longer the

00:49:10.079 --> 00:49:14.160
better because there's more stuff for

00:49:11.679 --> 00:49:17.199
the bottle to learn from.

00:49:14.159 --> 00:49:19.440
Um, so that's how they built clip.

00:49:17.199 --> 00:49:22.078
And so now what we do is we use we can

00:49:19.440 --> 00:49:24.400
use clip's text encoder by itself,

00:49:22.079 --> 00:49:25.920
right? We can send in any text and get

00:49:24.400 --> 00:49:28.240
an embedding that is close to the

00:49:25.920 --> 00:49:31.119
embedding of any image that described by

00:49:28.239 --> 00:49:33.919
the text.

00:49:31.119 --> 00:49:37.440
Okay. Now, by the way, clip can be used

00:49:33.920 --> 00:49:39.200
for zeros image classification.

00:49:37.440 --> 00:49:40.639
And what I mean by zeroshot image

00:49:39.199 --> 00:49:42.719
classification, I'll I'll walk through

00:49:40.639 --> 00:49:43.838
the picture in just a second, is that

00:49:42.719 --> 00:49:45.759
typically when you want to build an

00:49:43.838 --> 00:49:47.838
image classifier, right, you can get a

00:49:45.760 --> 00:49:50.319
whole bunch of training data of images

00:49:47.838 --> 00:49:51.838
and their labels and then we train them,

00:49:50.318 --> 00:49:54.639
right? Maybe you take something like

00:49:51.838 --> 00:49:56.639
restnet, chop off the top, attach our

00:49:54.639 --> 00:49:58.558
own output head and train, train, train.

00:49:56.639 --> 00:50:00.078
Boom, you have a classifier. But the

00:49:58.559 --> 00:50:02.400
only problem with that is let's say that

00:50:00.079 --> 00:50:04.800
tomorrow so today for example you had

00:50:02.400 --> 00:50:06.400
five classes in your problem and

00:50:04.800 --> 00:50:09.039
tomorrow somebody comes along and says

00:50:06.400 --> 00:50:10.559
oh actually we have a sixth category

00:50:09.039 --> 00:50:11.920
right what do you do then well you have

00:50:10.559 --> 00:50:13.599
to go back to the drawing board and

00:50:11.920 --> 00:50:15.599
retrain the whole thing with six labels

00:50:13.599 --> 00:50:17.680
now not five because your problem has

00:50:15.599 --> 00:50:20.079
changed would it be great if you had a

00:50:17.679 --> 00:50:22.239
classifier where you just come to it and

00:50:20.079 --> 00:50:23.839
say here's an image and here are the six

00:50:22.239 --> 00:50:26.318
possible labels I want you to pick from

00:50:23.838 --> 00:50:27.759
pick one from me and you want to be able

00:50:26.318 --> 00:50:30.558
to give it a different set of labels

00:50:27.760 --> 00:50:32.319
those each time and it'll just use the

00:50:30.559 --> 00:50:33.760
labels you're giving it and the image

00:50:32.318 --> 00:50:35.920
and figures out which which label

00:50:33.760 --> 00:50:38.480
corresponds to the image you just fed it

00:50:35.920 --> 00:50:40.880
that would be an insanely flexible image

00:50:38.480 --> 00:50:42.639
classification system right and that's

00:50:40.880 --> 00:50:44.960
what I mean by zeroshot image

00:50:42.639 --> 00:50:47.838
classification and you can use clip to

00:50:44.960 --> 00:50:50.000
do zero short image classification

00:50:47.838 --> 00:50:52.400
the now how you do it is actually in the

00:50:50.000 --> 00:50:55.039
picture though not very clearly done

00:50:52.400 --> 00:50:55.039
anyone wants to

00:50:58.159 --> 00:51:05.399
How can you use clip to build a like a

00:51:01.039 --> 00:51:05.400
infinitely flexible image classifier?

00:51:12.079 --> 00:51:16.640
>> Um I mean the text input was like was

00:51:14.480 --> 00:51:19.119
trained vert right? So in the same way

00:51:16.639 --> 00:51:21.358
vert can handle words never seen before

00:51:19.119 --> 00:51:22.720
does it essentially do that? Sorry, say

00:51:21.358 --> 00:51:24.000
that again. The second part

00:51:22.719 --> 00:51:25.439
>> you're saying you're saying it sees a

00:51:24.000 --> 00:51:26.559
text input with something it's never

00:51:25.440 --> 00:51:28.880
seen before, right? Yeah.

00:51:26.559 --> 00:51:30.960
>> Okay. So, in the BERT model, which is

00:51:28.880 --> 00:51:32.720
where where it came from, in the text

00:51:30.960 --> 00:51:35.039
encoding in the BERT model, I think we

00:51:32.719 --> 00:51:36.318
talked about when it sees a word it

00:51:35.039 --> 00:51:39.599
doesn't know that it's never seen

00:51:36.318 --> 00:51:41.119
before, it can use the the context words

00:51:39.599 --> 00:51:43.920
around it to try to

00:51:41.119 --> 00:51:46.559
>> Right. Right. So, but but here, just to

00:51:43.920 --> 00:51:49.760
be clear, I I want you to use clip that

00:51:46.559 --> 00:51:51.280
we just built, right? And assume clip

00:51:49.760 --> 00:51:53.040
can see any knows all the words because

00:51:51.280 --> 00:51:54.720
it's been trained on a big vocabulary.

00:51:53.039 --> 00:51:57.358
You can give it any text you want. It'll

00:51:54.719 --> 00:52:00.519
create an embedding from it. That's the

00:51:57.358 --> 00:52:00.519
key capability.

00:52:02.318 --> 00:52:06.880
>> So it creates a text embedding for

00:52:06.239 --> 00:52:11.358
>> Yeah.

00:52:06.880 --> 00:52:14.000
>> because like and then for your image.

00:52:11.358 --> 00:52:15.838
So comparing similarity scores between

00:52:14.000 --> 00:52:17.199
the two the image is complete but the

00:52:15.838 --> 00:52:18.960
text is not complete. there'll be

00:52:17.199 --> 00:52:21.199
missing pieces and then make some

00:52:18.960 --> 00:52:22.720
prediction using this.

00:52:21.199 --> 00:52:24.159
Why is there a missing piece in the

00:52:22.719 --> 00:52:28.318
text?

00:52:24.159 --> 00:52:31.838
>> Because um the image the the text

00:52:28.318 --> 00:52:34.318
the text does not contain the class. Um

00:52:31.838 --> 00:52:36.400
and then but for the image the way it

00:52:34.318 --> 00:52:38.639
was trained it was trained like with

00:52:36.400 --> 00:52:40.480
pairs with class including

00:52:38.639 --> 00:52:42.558
>> right but we actually know the class now

00:52:40.480 --> 00:52:45.119
because so the use case is that I come

00:52:42.559 --> 00:52:48.000
to you with an image and I say here are

00:52:45.119 --> 00:52:51.519
the seven possible labels for this image

00:52:48.000 --> 00:52:53.119
and each label is a piece of text.

00:52:51.519 --> 00:52:55.920
So you can you actually have seven

00:52:53.119 --> 00:52:58.318
pieces of text and an image and all I

00:52:55.920 --> 00:53:00.318
want clip to do is to tell me okay the

00:52:58.318 --> 00:53:03.440
seventh the fourth label is the right

00:53:00.318 --> 00:53:07.159
one for this image

00:53:03.440 --> 00:53:07.159
but you're on the right track

00:53:08.079 --> 00:53:12.920
once you see how it's done you'll be

00:53:09.358 --> 00:53:12.920
like yeah of course

00:53:13.679 --> 00:53:16.159
might not be understanding something but

00:53:15.119 --> 00:53:18.880
wouldn't you just pick the embedding

00:53:16.159 --> 00:53:20.399
that's the closest to the like the the

00:53:18.880 --> 00:53:22.480
text embedding that's the closest to the

00:53:20.400 --> 00:53:23.519
image embedding Correct. You're not

00:53:22.480 --> 00:53:26.318
missing anything. That's the right

00:53:23.519 --> 00:53:27.838
answer. Well done.

00:53:26.318 --> 00:53:30.239
Come on people. Can you applaud our

00:53:27.838 --> 00:53:32.880
fellow here? [applause]

00:53:30.239 --> 00:53:38.118
You folks are hard to impress.

00:53:32.880 --> 00:53:38.119
That's exactly what we do. So here

00:53:38.400 --> 00:53:42.480
the the key thing to remember the key

00:53:40.559 --> 00:53:45.280
thing to keep in your head is that when

00:53:42.480 --> 00:53:47.760
you a label is just text,

00:53:45.280 --> 00:53:50.240
dog, cat, right? It's just text. So you

00:53:47.760 --> 00:53:52.960
can just imagine taking each label with

00:53:50.239 --> 00:53:54.879
which in this case is plane car dog

00:53:52.960 --> 00:53:57.440
whatever for each one of them you create

00:53:54.880 --> 00:53:59.519
an embedding you get t1 through whatever

00:53:57.440 --> 00:54:01.519
if you have n labels for the image you

00:53:59.519 --> 00:54:03.119
just have one embedding i and then you

00:54:01.519 --> 00:54:04.880
just create the cosine calculate the

00:54:03.119 --> 00:54:06.800
cosine similarity and whichever is the

00:54:04.880 --> 00:54:09.280
highest number you say okay it's a dog

00:54:06.800 --> 00:54:11.119
that's it

00:54:09.280 --> 00:54:14.599
it's super just imagine the level of

00:54:11.119 --> 00:54:14.599
flexibility here

00:54:15.280 --> 00:54:20.079
so that's a a side use of clip unrelated

00:54:18.239 --> 00:54:21.279
to diffusion models but that's just

00:54:20.079 --> 00:54:23.680
thought it's really clever so I wanted

00:54:21.280 --> 00:54:25.359
to share that okay good u now let's see

00:54:23.679 --> 00:54:27.759
how we can actually use this entire

00:54:25.358 --> 00:54:29.519
capability to go to solve the original

00:54:27.760 --> 00:54:31.920
problem we set out to solve which is can

00:54:29.519 --> 00:54:33.679
we steer the diffusion model to create

00:54:31.920 --> 00:54:37.280
an image based on a particular prompt we

00:54:33.679 --> 00:54:39.358
give it um so now remember if you go

00:54:37.280 --> 00:54:41.519
back to how we did it we created all

00:54:39.358 --> 00:54:44.639
these training pairs of x and y based on

00:54:41.519 --> 00:54:46.318
you know the the noising the image x is

00:54:44.639 --> 00:54:51.279
the image y is the less noisy version of

00:54:46.318 --> 00:54:53.119
image. So what we can simply do is we

00:54:51.280 --> 00:54:56.079
can actually change the input so it

00:54:53.119 --> 00:54:59.280
becomes the image and then the clip text

00:54:56.079 --> 00:55:00.480
embedding of the caption for that image.

00:54:59.280 --> 00:55:02.559
So you have an image and you have a

00:55:00.480 --> 00:55:05.199
caption. You take the caption run it

00:55:02.559 --> 00:55:07.760
through clip you get an embedding. By

00:55:05.199 --> 00:55:09.759
definition that embedding is in the

00:55:07.760 --> 00:55:13.200
lives in the same space as all the

00:55:09.760 --> 00:55:15.440
images that correspond to that caption.

00:55:13.199 --> 00:55:18.480
Right? So you just attach you

00:55:15.440 --> 00:55:20.318
concatenate the embedding of the clip

00:55:18.480 --> 00:55:22.639
output of a caption along with the

00:55:20.318 --> 00:55:24.880
image. You say make that the new input.

00:55:22.639 --> 00:55:26.558
Now Y continues to be the less noisy

00:55:24.880 --> 00:55:27.838
version of the image or as we saw

00:55:26.559 --> 00:55:30.319
earlier it could be just the noise

00:55:27.838 --> 00:55:34.000
component of the image. Okay, this is

00:55:30.318 --> 00:55:36.800
the new XY pair that we have. And so now

00:55:34.000 --> 00:55:39.519
the model is you send the clip X

00:55:36.800 --> 00:55:41.039
embedding the image X send it through

00:55:39.519 --> 00:55:43.039
noisy version of the image and you keep

00:55:41.039 --> 00:55:44.960
on training it for a while. Once your

00:55:43.039 --> 00:55:46.880
model is trained for when you want to

00:55:44.960 --> 00:55:49.679
use it for inference for a new uh

00:55:46.880 --> 00:55:51.920
prompt, you just give it you know

00:55:49.679 --> 00:55:55.199
Killian quoted MIT during the springtime

00:55:51.920 --> 00:55:57.760
along with a bunch of noise goes in it

00:55:55.199 --> 00:56:00.399
starts dinoising it. But because this

00:55:57.760 --> 00:56:02.880
embedding of this thing thanks to clip

00:56:00.400 --> 00:56:05.119
lives in the same image space as all Ken

00:56:02.880 --> 00:56:07.119
code embeddings clean code images at

00:56:05.119 --> 00:56:11.160
some keep on doing it for a while at

00:56:07.119 --> 00:56:11.160
some point you'll get Kian code.

00:56:11.280 --> 00:56:15.359
That's how they do it. That's how they

00:56:12.798 --> 00:56:16.798
steer the image. It's a two-step

00:56:15.358 --> 00:56:19.598
process. You create all these clip

00:56:16.798 --> 00:56:21.358
embeddings uh which clip was a

00:56:19.599 --> 00:56:22.960
breakthrough in my opinion because they

00:56:21.358 --> 00:56:24.159
it was one of the maybe the first

00:56:22.960 --> 00:56:26.079
example. I don't know if it's the very

00:56:24.159 --> 00:56:28.480
first but one of the early examples of

00:56:26.079 --> 00:56:30.400
saying we have different kinds of data.

00:56:28.480 --> 00:56:32.559
We have images, we have captions, we

00:56:30.400 --> 00:56:34.000
have text. How do we create embeddings

00:56:32.559 --> 00:56:36.240
for every one of these very different

00:56:34.000 --> 00:56:38.318
data types that all happen to live in

00:56:36.239 --> 00:56:40.399
the same space, the same concept space?

00:56:38.318 --> 00:56:42.480
That was the key idea. And if you look

00:56:40.400 --> 00:56:44.318
at the modern multimodal large language

00:56:42.480 --> 00:56:46.318
models, they are all based on the same

00:56:44.318 --> 00:56:49.759
exact idea.

00:56:46.318 --> 00:56:51.519
So it's very powerful this approach.

00:56:49.760 --> 00:56:54.000
Yeah. Now I understand this for images,

00:56:51.519 --> 00:56:56.559
but for video generation models like

00:56:54.000 --> 00:56:58.960
Sora, do they have some sort of

00:56:56.559 --> 00:57:00.960
underlying physics structure or do they

00:56:58.960 --> 00:57:02.318
learn the physical representations?

00:57:00.960 --> 00:57:04.559
>> There's a lot of debate on the internet

00:57:02.318 --> 00:57:05.838
about this stuff. Um they haven't

00:57:04.559 --> 00:57:07.359
published the results, the full

00:57:05.838 --> 00:57:09.599
technical report yet. So we don't know

00:57:07.358 --> 00:57:11.440
for sure but the consensus seems to be

00:57:09.599 --> 00:57:14.240
no it's not they are not using a physics

00:57:11.440 --> 00:57:15.599
engine what they have done uh and again

00:57:14.239 --> 00:57:17.919
this may be wrong once the report comes

00:57:15.599 --> 00:57:19.920
out we'll know for sure but uh people

00:57:17.920 --> 00:57:22.480
what people are saying computer vision

00:57:19.920 --> 00:57:25.838
experts is that it was has been trained

00:57:22.480 --> 00:57:28.400
on a lot of video game data

00:57:25.838 --> 00:57:30.400
uh along with actual videos and so on

00:57:28.400 --> 00:57:32.559
and if you and the corpus of training is

00:57:30.400 --> 00:57:35.280
so massive that it has basically learned

00:57:32.559 --> 00:57:38.000
to mimic certain physics aspects to it

00:57:35.280 --> 00:57:39.280
just as a side effect much like LLM you

00:57:38.000 --> 00:57:41.838
train them on a large amount of text

00:57:39.280 --> 00:57:43.359
data they begin to start to do things

00:57:41.838 --> 00:57:46.239
which you didn't anticipate that they'll

00:57:43.358 --> 00:57:48.558
do right so for example I read this I

00:57:46.239 --> 00:57:50.399
thought it's a really great example of

00:57:48.559 --> 00:57:52.798
what is surprising about large language

00:57:50.400 --> 00:57:54.798
models is not that you know you train

00:57:52.798 --> 00:57:56.159
them on a b bunch of high school math

00:57:54.798 --> 00:57:57.199
problems and then you give it a new high

00:57:56.159 --> 00:57:59.679
school math problem it can actually

00:57:57.199 --> 00:58:00.879
solve it that's not surprising you give

00:57:59.679 --> 00:58:03.199
it a whole bunch of high school math

00:58:00.880 --> 00:58:05.200
problems in English then you ask it to

00:58:03.199 --> 00:58:07.199
read a bunch of French literature and

00:58:05.199 --> 00:58:08.960
then you give French high school math

00:58:07.199 --> 00:58:12.318
will solve it. That is that is the new

00:58:08.960 --> 00:58:13.679
news, right? So similarly here I think

00:58:12.318 --> 00:58:15.199
the expectation is that it's not

00:58:13.679 --> 00:58:16.798
actually using a physics engine under

00:58:15.199 --> 00:58:17.838
the hood. It may have used a physics

00:58:16.798 --> 00:58:20.159
engine to actually come up with the

00:58:17.838 --> 00:58:22.798
videos and renderings but there are no

00:58:20.159 --> 00:58:23.920
physics constraints in the model itself.

00:58:22.798 --> 00:58:26.000
It just comes out of the training

00:58:23.920 --> 00:58:27.440
process. That's the current view. Once

00:58:26.000 --> 00:58:30.639
the technical report comes out, we'll

00:58:27.440 --> 00:58:33.639
know for sure what they actually did.

00:58:30.639 --> 00:58:33.639
U

00:58:33.838 --> 00:58:37.920
>> so quick question about stability. It's

00:58:36.318 --> 00:58:40.400
claiming to be a little bit more real

00:58:37.920 --> 00:58:41.599
time in their image generation. Um, so

00:58:40.400 --> 00:58:43.599
>> you mean stable diffusion?

00:58:41.599 --> 00:58:45.200
>> Yeah, stable diffusion. So, are they

00:58:43.599 --> 00:58:46.798
jumping through the noise more quickly

00:58:45.199 --> 00:58:47.679
or are they kind of like pre-prompting

00:58:46.798 --> 00:58:48.960
it and kind of trick?

00:58:47.679 --> 00:58:50.480
>> Very good question and there's a very

00:58:48.960 --> 00:58:52.798
key trick. It's coming.

00:58:50.480 --> 00:58:55.119
>> Um,

00:58:52.798 --> 00:58:57.920
>> so here the example of the noise is

00:58:55.119 --> 00:59:00.559
normal distribution. However, if we have

00:58:57.920 --> 00:59:02.240
changed the noise distribution, is it

00:59:00.559 --> 00:59:04.000
change the result? Oh, you mean if you

00:59:02.239 --> 00:59:05.519
change it to like a pson or some other

00:59:04.000 --> 00:59:08.079
distribution, it'll definitely change

00:59:05.519 --> 00:59:10.318
the results because u if you look at the

00:59:08.079 --> 00:59:11.839
underlying math of why this works, it

00:59:10.318 --> 00:59:13.279
heavily depends on the Gaussian

00:59:11.838 --> 00:59:15.599
assumption.

00:59:13.280 --> 00:59:18.559
>> Yeah. Um there was another question

00:59:15.599 --> 00:59:20.000
somewhere here.

00:59:18.559 --> 00:59:21.599
>> Um you may not know the answer because

00:59:20.000 --> 00:59:23.599
the technical report out, but could it

00:59:21.599 --> 00:59:26.240
be in terms of video generation sort of

00:59:23.599 --> 00:59:28.160
analogous to going from like one fuzz

00:59:26.239 --> 00:59:30.639
one noisy image to another? like you're

00:59:28.159 --> 00:59:31.598
almost doing a series of still images

00:59:30.639 --> 00:59:33.920
and learning how to

00:59:31.599 --> 00:59:35.599
>> No, I think that I think people are sure

00:59:33.920 --> 00:59:36.960
is is how it's done. So, basically you

00:59:35.599 --> 00:59:39.280
think think of the video as just a

00:59:36.960 --> 00:59:41.599
series of frames, right? And each frame

00:59:39.280 --> 00:59:43.440
is an image and there is a sequentiality

00:59:41.599 --> 00:59:44.880
to it. Um, which is where the

00:59:43.440 --> 00:59:47.760
transformer stack will come in because

00:59:44.880 --> 00:59:50.720
it handles sequentiality. So, in general

00:59:47.760 --> 00:59:53.280
video stuff typically operates on frame

00:59:50.719 --> 00:59:54.959
by frame which is just an image. So,

00:59:53.280 --> 00:59:57.839
that is definitely there. What we don't

00:59:54.960 --> 00:59:59.519
know is if they also used some

00:59:57.838 --> 01:00:02.239
understanding of the fact that for

00:59:59.519 --> 01:00:04.239
example that if an object is dropped it

01:00:02.239 --> 01:00:06.798
has to fall to the earth in a certain

01:00:04.239 --> 01:00:08.639
rate or if an object goes behind another

01:00:06.798 --> 01:00:10.159
object you can't see the object anymore

01:00:08.639 --> 01:00:12.960
right things like that which we take for

01:00:10.159 --> 01:00:15.838
granted um the question is are they

01:00:12.960 --> 01:00:17.280
using it and the consensus seems to be

01:00:15.838 --> 01:00:18.960
uh in the absence of an actual technical

01:00:17.280 --> 01:00:20.240
report that no they're not doing it

01:00:18.960 --> 01:00:22.880
because there are lots of examples on

01:00:20.239 --> 01:00:24.479
Twitter where people will show a Sora

01:00:22.880 --> 01:00:26.559
video in which it's not obeying the laws

01:00:24.480 --> 01:00:28.559
of physics. So you take like a beach

01:00:26.559 --> 01:00:30.000
chair and then put it in the sand. You

01:00:28.559 --> 01:00:32.640
see the sand come through the base of

01:00:30.000 --> 01:00:33.920
the beach chair, right? Or you take an

01:00:32.639 --> 01:00:35.118
object and put it behind an object. You

01:00:33.920 --> 01:00:37.440
can still see the object even though the

01:00:35.119 --> 01:00:38.720
original object is opaque. So you be

01:00:37.440 --> 01:00:39.920
seeing some evidence that no no it's not

01:00:38.719 --> 01:00:43.879
obeying the laws of physics. What you're

01:00:39.920 --> 01:00:43.880
seeing is just an amaz

01:00:46.318 --> 01:00:50.000
fingers without knowing there has to be

01:00:47.599 --> 01:00:51.599
only five fingers.

01:00:50.000 --> 01:00:55.679
Um

01:00:51.599 --> 01:00:58.880
okay. All right. So we let's keep going

01:00:55.679 --> 01:01:00.879
now. Um so this there was another paper

01:00:58.880 --> 01:01:02.798
afterwards and this is the original

01:01:00.880 --> 01:01:05.680
paper which took that idea of the

01:01:02.798 --> 01:01:07.519
diffusion model and then diffusion is

01:01:05.679 --> 01:01:08.719
very slow as Olivia you pointed out. So

01:01:07.519 --> 01:01:11.119
the question is can we make it much

01:01:08.719 --> 01:01:12.239
faster? Right? So what they did and I'm

01:01:11.119 --> 01:01:14.079
not going to get into this whole thing

01:01:12.239 --> 01:01:18.159
here. I just want to highlight a couple

01:01:14.079 --> 01:01:20.960
of things. The first one is that um

01:01:18.159 --> 01:01:23.279
first of all notice that you see unit

01:01:20.960 --> 01:01:25.838
here. So it they are using a unit right

01:01:23.280 --> 01:01:28.000
to go from image to image.

01:01:25.838 --> 01:01:30.000
The second thing is that the clip

01:01:28.000 --> 01:01:32.400
embedding of the text prompt is

01:01:30.000 --> 01:01:34.559
basically is woven meaning it's

01:01:32.400 --> 01:01:36.559
incorporated into the w the into the

01:01:34.559 --> 01:01:38.559
into the unit through an attention

01:01:36.559 --> 01:01:41.040
mechanism a transformer mechanism and

01:01:38.559 --> 01:01:43.200
you can see the QKV business here which

01:01:41.039 --> 01:01:45.039
should be familiar at this point. So it

01:01:43.199 --> 01:01:47.279
is integrated into the transformer stack

01:01:45.039 --> 01:01:48.480
directly that input the clip embedding

01:01:47.280 --> 01:01:50.960
that's the second thing I want to point

01:01:48.480 --> 01:01:52.880
out. And then thirdly

01:01:50.960 --> 01:01:54.480
and this is where the speed up comes. So

01:01:52.880 --> 01:01:56.240
what you do is instead of taking the

01:01:54.480 --> 01:01:57.760
image running it through the whole

01:01:56.239 --> 01:01:59.598
network and creating a slightly less

01:01:57.760 --> 01:02:02.000
noisy version of the image here what you

01:01:59.599 --> 01:02:03.359
do is you take the image you run it

01:02:02.000 --> 01:02:05.679
through an image encoder you get an

01:02:03.358 --> 01:02:07.519
embedding and now you only work with the

01:02:05.679 --> 01:02:09.440
embedding you take the embedding and

01:02:07.519 --> 01:02:11.358
create a slightly less noisy version

01:02:09.440 --> 01:02:13.119
embedding keep on doing it and these

01:02:11.358 --> 01:02:14.719
embeddings are much smaller than images

01:02:13.119 --> 01:02:16.079
therefore they're much faster to process

01:02:14.719 --> 01:02:18.798
and once you've done it like a thousand

01:02:16.079 --> 01:02:20.559
times you get a very sort of almost pure

01:02:18.798 --> 01:02:24.079
noless version of the embedding now you

01:02:20.559 --> 01:02:26.400
run it through an image decoder to get

01:02:24.079 --> 01:02:29.119
So this is the the idea here is that you

01:02:26.400 --> 01:02:31.200
operate um

01:02:29.119 --> 01:02:32.480
uh in the lat latent space meaning the

01:02:31.199 --> 01:02:35.118
embedding space and hence it's called a

01:02:32.480 --> 01:02:36.719
latent diffusion model. So that's where

01:02:35.119 --> 01:02:38.640
the speed up comes but research

01:02:36.719 --> 01:02:40.239
continues to be very strong to make it

01:02:38.639 --> 01:02:41.440
even faster because for a lot of

01:02:40.239 --> 01:02:43.039
consumer applications people are

01:02:41.440 --> 01:02:44.480
obviously not going to wait around for I

01:02:43.039 --> 01:02:46.880
mean who wants to wait for 10 seconds

01:02:44.480 --> 01:02:49.920
right so uh and so there a lot of

01:02:46.880 --> 01:02:52.240
pressure to make it even faster

01:02:49.920 --> 01:02:53.760
um

01:02:52.239 --> 01:02:56.078
all right so that's what we have

01:02:53.760 --> 01:02:58.160
obviously um you know they're these

01:02:56.079 --> 01:03:00.000
models are transforming everything and

01:02:58.159 --> 01:03:01.920
uh by the way this site here lexicon.art

01:03:00.000 --> 01:03:03.760
art. You can go check it out. Uh it has

01:03:01.920 --> 01:03:06.318
a whole bunch of very interesting images

01:03:03.760 --> 01:03:07.599
and prompts that created the images. So

01:03:06.318 --> 01:03:09.519
if you're working in the space, it gives

01:03:07.599 --> 01:03:11.119
you a lot of interesting ideas. But it's

01:03:09.519 --> 01:03:13.838
not just for you know consumer fun

01:03:11.119 --> 01:03:15.838
applications. U you know these models

01:03:13.838 --> 01:03:18.318
are being used to actually you know

01:03:15.838 --> 01:03:19.519
alpha fold if you'll recall if you give

01:03:18.318 --> 01:03:21.519
it an amino acid sequence it can

01:03:19.519 --> 01:03:24.480
actually create the 3D structure. Right?

01:03:21.519 --> 01:03:25.838
So that's an example of they they don't

01:03:24.480 --> 01:03:27.440
I don't think they use a diffusion

01:03:25.838 --> 01:03:28.960
model. But you can imagine using a

01:03:27.440 --> 01:03:32.159
diffusion model to create these

01:03:28.960 --> 01:03:34.798
complicated objects. Meaning the objects

01:03:32.159 --> 01:03:36.558
you create don't have to be images.

01:03:34.798 --> 01:03:39.038
They can be arbitrarily complicated

01:03:36.559 --> 01:03:41.280
things. As long as you have enough data

01:03:39.039 --> 01:03:43.760
about such things to use for training

01:03:41.280 --> 01:03:45.359
and the notion of noising the input is

01:03:43.760 --> 01:03:47.039
meaningful, you can create some very

01:03:45.358 --> 01:03:49.598
interesting structures. you can create

01:03:47.039 --> 01:03:51.039
3D things and u you know protein

01:03:49.599 --> 01:03:52.240
structures and there's a whole bunch of

01:03:51.039 --> 01:03:55.440
very interesting applications in

01:03:52.239 --> 01:03:57.038
biomedical uh sciences. So this is

01:03:55.440 --> 01:03:59.519
really just the tip of the iceberg and

01:03:57.039 --> 01:04:00.960
now there are these things um there are

01:03:59.519 --> 01:04:03.358
ways in which you can use diffusion

01:04:00.960 --> 01:04:05.199
models to create to do large language

01:04:03.358 --> 01:04:07.358
modeling as well. So there's a lot of

01:04:05.199 --> 01:04:10.159
overlap and blending and so on going on

01:04:07.358 --> 01:04:11.920
in the space. So so I'm going to do a

01:04:10.159 --> 01:04:12.879
quick demo. Um if you look at hugging

01:04:11.920 --> 01:04:15.519
face there is something called the

01:04:12.880 --> 01:04:17.519
diffusers library which is like the the

01:04:15.519 --> 01:04:20.079
as the name suggests it's a library for

01:04:17.519 --> 01:04:24.599
a lot of diffusion models

01:04:20.079 --> 01:04:24.599
and let's take a quick look.

01:04:25.838 --> 01:04:28.880
All right so we will uh the diffusers

01:04:27.519 --> 01:04:30.719
library has a whole bunch of diffusion

01:04:28.880 --> 01:04:32.400
models. We going to work with stable

01:04:30.719 --> 01:04:34.959
diffusion which is one of you know like

01:04:32.400 --> 01:04:38.599
the the better known models. So let's

01:04:34.960 --> 01:04:38.599
install diffusers.

01:04:38.960 --> 01:04:42.880
Uh you will recall when we when I did

01:04:41.358 --> 01:04:45.679
the quick lightning tour of the hugging

01:04:42.880 --> 01:04:48.480
face ecosystem for language. Uh hugging

01:04:45.679 --> 01:04:50.480
face is a whole bunch of u capabilities

01:04:48.480 --> 01:04:52.159
sort of built out of the box and you use

01:04:50.480 --> 01:04:54.719
this thing called the pipeline function

01:04:52.159 --> 01:04:56.879
to very quickly use any model you want.

01:04:54.719 --> 01:04:59.679
The same exact philosophy applies here.

01:04:56.880 --> 01:05:03.559
You still use the pipeline. So I'm going

01:04:59.679 --> 01:05:03.558
to import a bunch of stuff.

01:05:09.358 --> 01:05:14.759
All right. So, oh, I see I have to do

01:05:11.199 --> 01:05:14.759
this thing. Okay.

01:05:16.079 --> 01:05:20.119
Great. F.

01:05:21.519 --> 01:05:26.639
Okay. So, uh, all right. that we have

01:05:24.239 --> 01:05:28.639
here. So you you'll remember that we

01:05:26.639 --> 01:05:30.639
when we worked with text we had to pre

01:05:28.639 --> 01:05:31.759
we we would grab a pre-trained model and

01:05:30.639 --> 01:05:33.598
then we actually run it through a

01:05:31.760 --> 01:05:36.079
pipeline and we can do all the inference

01:05:33.599 --> 01:05:39.440
we want on it. The same exact philosophy

01:05:36.079 --> 01:05:41.519
applies here. So um and this very

01:05:39.440 --> 01:05:44.400
similar to what we did in lecture 8 for

01:05:41.519 --> 01:05:46.079
NLP. So what we're going to do is we use

01:05:44.400 --> 01:05:48.000
this command the stable diffusion

01:05:46.079 --> 01:05:50.798
pipeline from pre-trained and we use

01:05:48.000 --> 01:05:56.318
this version 1.4 stable diffusion model.

01:05:50.798 --> 01:05:58.079
Um so let's just create the pipeline and

01:05:56.318 --> 01:06:00.318
and obviously we have used tensorflow

01:05:58.079 --> 01:06:02.079
not pyarch here but a lot of these

01:06:00.318 --> 01:06:05.279
models unfortunately happen to be in

01:06:02.079 --> 01:06:07.280
pyarch so knowing a little bit of pyarch

01:06:05.280 --> 01:06:09.280
is actually very helpful um to be able

01:06:07.280 --> 01:06:12.240
to work with these things and what we're

01:06:09.280 --> 01:06:15.280
doing here uh while it's downloading uh

01:06:12.239 --> 01:06:18.078
we are using this fp16

01:06:15.280 --> 01:06:19.519
um storage format for the the model

01:06:18.079 --> 01:06:22.318
weights because it's going to be a

01:06:19.519 --> 01:06:24.318
little smaller than using 32 bits so

01:06:22.318 --> 01:06:25.599
it'll download faster. So that's what's

01:06:24.318 --> 01:06:28.480
happening here. So all right, it's

01:06:25.599 --> 01:06:29.920
downloaded fine. So now we just give it

01:06:28.480 --> 01:06:32.880
a prompt and this is actually one of the

01:06:29.920 --> 01:06:34.400
original famous uh meme prompts a

01:06:32.880 --> 01:06:36.640
photograph of an astronaut riding a

01:06:34.400 --> 01:06:38.880
horse. And so uh once we have the

01:06:36.639 --> 01:06:40.558
pipeline set up uh I'll just a seat for

01:06:38.880 --> 01:06:44.640
reproducibility. And then literally I do

01:06:40.559 --> 01:06:46.960
pipe of prompt and then it's actually

01:06:44.639 --> 01:06:50.879
you can see here 50. So it's going

01:06:46.960 --> 01:06:52.000
through 50 dinoising steps. Okay. Um and

01:06:50.880 --> 01:06:54.960
you come up with a national rating of

01:06:52.000 --> 01:06:56.880
horse. Okay. So that's that. Um you can

01:06:54.960 --> 01:06:59.838
actually change the seed and you can get

01:06:56.880 --> 01:07:01.599
get a different um the seed is basically

01:06:59.838 --> 01:07:03.199
sets the the the random starting point

01:07:01.599 --> 01:07:05.440
for the image. So therefore you would

01:07:03.199 --> 01:07:08.159
expect a different astronaut. Yep. This

01:07:05.440 --> 01:07:09.679
is an astronaut riding another horse. So

01:07:08.159 --> 01:07:11.199
um I think people came up with these

01:07:09.679 --> 01:07:12.480
kinds of fun examples because it's

01:07:11.199 --> 01:07:15.358
guaranteed not to be in the training

01:07:12.480 --> 01:07:16.559
data, right? So so whatever the model is

01:07:15.358 --> 01:07:18.480
doing, it's not remember it's not

01:07:16.559 --> 01:07:23.160
regurgitating what it has already seen.

01:07:18.480 --> 01:07:23.159
Uh, all right. Give me a prompt.

01:07:26.639 --> 01:07:32.920
Prompts. Anyone?

01:07:29.920 --> 01:07:32.920
Wow.

01:07:34.798 --> 01:07:37.798
>> Okay,

01:07:38.559 --> 01:07:44.839
that might be a

01:07:40.639 --> 01:07:44.838
All right. Riding a horse.

01:07:48.880 --> 01:07:51.960
All right,

01:07:56.559 --> 01:08:03.319
there are two of them and clearly MIT

01:07:59.358 --> 01:08:03.318
professors don't have really.

01:08:03.599 --> 01:08:10.240
Yeah, moving on. [laughter]

01:08:06.559 --> 01:08:11.519
So, so by the way, um if you you should

01:08:10.239 --> 01:08:12.798
spend some time with the diffusers

01:08:11.519 --> 01:08:14.400
library, they have a bunch of tutorials

01:08:12.798 --> 01:08:16.319
which are really interesting because

01:08:14.400 --> 01:08:18.719
this core capability of giving a prompt

01:08:16.319 --> 01:08:20.560
and getting an image out can actually be

01:08:18.719 --> 01:08:22.640
manipulated for all sorts of very

01:08:20.560 --> 01:08:23.920
interesting use cases. So, for example,

01:08:22.640 --> 01:08:25.679
there is this thing called negative

01:08:23.920 --> 01:08:28.560
prompting. And the idea of negative

01:08:25.679 --> 01:08:31.838
prompting is that you can give it two

01:08:28.560 --> 01:08:33.679
prompts and say create an image which

01:08:31.838 --> 01:08:36.318
embodies the first prompt but not the

01:08:33.679 --> 01:08:37.920
second prompt. essentially subtract the

01:08:36.319 --> 01:08:39.520
second prompt from the first one. That's

01:08:37.920 --> 01:08:41.440
called negative prompting. And you might

01:08:39.520 --> 01:08:45.440
be wondering like what use is that?

01:08:41.439 --> 01:08:46.559
There are lots of fun uses. So here we

01:08:45.439 --> 01:08:49.119
are going to the prompt is going to be a

01:08:46.560 --> 01:08:53.120
labrador in the style of vermier. Okay,

01:08:49.119 --> 01:08:57.119
that's the first prompt. 50 steps.

01:08:53.119 --> 01:09:00.719
Uh look at that. Amazing, right? Uh but

01:08:57.119 --> 01:09:02.158
maybe you don't care for the blue scarf.

01:09:00.719 --> 01:09:04.079
So you basically give it a negative

01:09:02.158 --> 01:09:06.879
prompt. And you basically the negative

01:09:04.079 --> 01:09:09.439
prompt is blue meaning remove everything

01:09:06.880 --> 01:09:11.920
that's blue. I don't like this otherwise

01:09:09.439 --> 01:09:15.079
keep the Labrador thing going. So you

01:09:11.920 --> 01:09:15.079
run it.

01:09:16.479 --> 01:09:22.399
Look at that. The blue is gone. Negative

01:09:18.399 --> 01:09:26.000
prompting. Okay. Yeah.

01:09:22.399 --> 01:09:28.318
>> If you change that from five from 50 to

01:09:26.000 --> 01:09:30.479
a th00and will it become less pixelated

01:09:28.319 --> 01:09:31.359
or will it eventually just keep going

01:09:30.479 --> 01:09:32.798
and iterating?

01:09:31.359 --> 01:09:34.400
>> No. Typically, if you do more of these

01:09:32.798 --> 01:09:36.640
things, it gets better. The quality is

01:09:34.399 --> 01:09:38.719
much better because each step will den

01:09:36.640 --> 01:09:40.480
noiseise it very slightly. So, errors

01:09:38.719 --> 01:09:42.158
won't accumulate and things like that.

01:09:40.479 --> 01:09:44.158
And the diffuses library gives you lots

01:09:42.158 --> 01:09:47.119
of controls for fiddling around with all

01:09:44.158 --> 01:09:50.559
these things. Um, okay. So, that's what

01:09:47.119 --> 01:09:52.559
we had. Uh, 949.

01:09:50.560 --> 01:09:54.159
Okay. So, check out this tutorial if

01:09:52.560 --> 01:09:56.239
you're curious about how this stuff

01:09:54.158 --> 01:09:58.479
works. And I'm going to do one other

01:09:56.238 --> 01:10:01.759
thing um because I didn't get to do it

01:09:58.479 --> 01:10:03.919
earlier on. So uh we spent some time

01:10:01.760 --> 01:10:05.920
with the hugging face hub and I walked

01:10:03.920 --> 01:10:07.679
you through a few use cases for text uh

01:10:05.920 --> 01:10:10.158
where you can take a text model and use

01:10:07.679 --> 01:10:11.760
it for you know classification uh things

01:10:10.158 --> 01:10:13.519
like that summarization and so on and so

01:10:11.760 --> 01:10:16.000
forth. You can do the same thing for

01:10:13.520 --> 01:10:17.600
computer vision models. So if you have a

01:10:16.000 --> 01:10:20.079
computer vision problem that just maps

01:10:17.600 --> 01:10:21.920
to a standard C uh computer vision task

01:10:20.079 --> 01:10:25.439
you can just use the hugging face hub as

01:10:21.920 --> 01:10:27.359
well. So um let me just show you very

01:10:25.439 --> 01:10:30.678
quickly the same kind of thing actually

01:10:27.359 --> 01:10:30.679
works here.

01:10:32.560 --> 01:10:37.360
All right. Okay. So,

01:10:35.600 --> 01:10:38.719
so let's say that you want to classify

01:10:37.359 --> 01:10:40.639
something. You just import the pipeline

01:10:38.719 --> 01:10:43.279
as before.

01:10:40.640 --> 01:10:45.280
And once you import it, you can just

01:10:43.279 --> 01:10:46.319
literally give it the standard task that

01:10:45.279 --> 01:10:48.800
you care about like image

01:10:46.319 --> 01:10:50.158
classification.

01:10:48.800 --> 01:10:53.520
And and then you can start using it

01:10:50.158 --> 01:10:56.519
right from that point on.

01:10:53.520 --> 01:10:56.520
Okay.

01:10:59.840 --> 01:11:04.480
All right. Okay. So now I'm going to

01:11:02.319 --> 01:11:06.719
just get this image. So it's a very

01:11:04.479 --> 01:11:08.718
famous image. Um, right. And we're going

01:11:06.719 --> 01:11:09.760
to ask it to classify this image. So we

01:11:08.719 --> 01:11:12.399
just literally run it through the

01:11:09.760 --> 01:11:15.039
pipeline.

01:11:12.399 --> 01:11:18.799
And it says the most likely label is 94%

01:11:15.039 --> 01:11:20.880
probability. It's an Egyptian cat. Seems

01:11:18.800 --> 01:11:21.679
reasonable. Okay. I mean, it's it's a

01:11:20.880 --> 01:11:22.960
tough picture, right? Because there are

01:11:21.679 --> 01:11:25.520
lots of things going on in that picture.

01:11:22.960 --> 01:11:27.679
It's not like one one image, one object.

01:11:25.520 --> 01:11:29.520
Um okay so you don't have to use the

01:11:27.679 --> 01:11:31.279
default model you can actually give it

01:11:29.520 --> 01:11:35.440
your own model that you want. So for

01:11:31.279 --> 01:11:38.880
example, you can go um sorry

01:11:35.439 --> 01:11:40.238
you can go to the hub hugging face hub

01:11:38.880 --> 01:11:42.560
and you can go in there and say all

01:11:40.238 --> 01:11:45.359
right I want image classification these

01:11:42.560 --> 01:11:49.120
are all the models 10,487 models let's

01:11:45.359 --> 01:11:51.759
sort by I don't know most downloads or

01:11:49.119 --> 01:11:53.599
maybe most likes

01:11:51.760 --> 01:11:54.800
u and you have all these models you can

01:11:53.600 --> 01:11:56.000
pick any one of them so for example

01:11:54.800 --> 01:11:57.920
let's say you want to pick Microsoft

01:11:56.000 --> 01:12:00.000
restnet as your one that's what I tried

01:11:57.920 --> 01:12:04.000
here so I have Microsoft restnet you

01:12:00.000 --> 01:12:05.920
just s model equals that run it and it

01:12:04.000 --> 01:12:08.238
takes care of all the tokenization this

01:12:05.920 --> 01:12:09.840
that and whatnot. It's really very handy

01:12:08.238 --> 01:12:12.639
and then you run it through the pipeline

01:12:09.840 --> 01:12:15.760
again and it says tiger cat 94%

01:12:12.640 --> 01:12:17.440
probability according to restnet. Um so

01:12:15.760 --> 01:12:18.640
yeah so that's how you do it. Now let's

01:12:17.439 --> 01:12:20.000
actually try a more interesting example

01:12:18.640 --> 01:12:21.199
where you want to detect all the objects

01:12:20.000 --> 01:12:23.760
in the picture which we didn't talk

01:12:21.198 --> 01:12:27.439
about in class object detection. So just

01:12:23.760 --> 01:12:29.600
create an object detection pipeline.

01:12:27.439 --> 01:12:31.678
Same thing as before. when you actually

01:12:29.600 --> 01:12:33.120
run this command, an astonishing some

01:12:31.679 --> 01:12:35.359
amount of complicated stuff is going on

01:12:33.119 --> 01:12:37.279
under the hood. Okay, and we are all the

01:12:35.359 --> 01:12:39.439
beneficiaries of that. So, thank you.

01:12:37.279 --> 01:12:42.639
Um, so yeah, so we have this here and

01:12:39.439 --> 01:12:44.079
then we run it through um the pipeline.

01:12:42.640 --> 01:12:45.360
It's looking at all the possible things

01:12:44.079 --> 01:12:46.800
that might be sitting in the pipeline.

01:12:45.359 --> 01:12:49.920
The results are hard to read. So, let's

01:12:46.800 --> 01:12:51.760
actually visualize them. Um,

01:12:49.920 --> 01:12:53.359
and I got some nice code from this site

01:12:51.760 --> 01:12:56.239
for how to visualize them. Let's just

01:12:53.359 --> 01:12:58.719
reuse it. So, yeah. So if you plot the

01:12:56.238 --> 01:13:02.039
results,

01:12:58.719 --> 01:13:02.039
look at that.

01:13:03.760 --> 01:13:09.600
Okay, so it has picked up the cat. 100%

01:13:06.800 --> 01:13:12.079
probability, I guess. The remote, the

01:13:09.600 --> 01:13:14.320
couch, the other remote, and then the

01:13:12.079 --> 01:13:17.039
cat. Pretty good, right? Off the shelf,

01:13:14.319 --> 01:13:19.439
ready to go. No, no heavy lifting

01:13:17.039 --> 01:13:20.719
required. Now, in in this case, we are

01:13:19.439 --> 01:13:22.719
actually putting these boxes called

01:13:20.719 --> 01:13:23.840
bounding boxes around each picture. But

01:13:22.719 --> 01:13:25.119
what if you actually don't want a

01:13:23.840 --> 01:13:28.159
bounding box? what you want to actually

01:13:25.119 --> 01:13:30.158
find the exact contour of that cat or

01:13:28.158 --> 01:13:32.799
the remote. No problem. We do something

01:13:30.158 --> 01:13:36.399
called image segmentation. So let's do

01:13:32.800 --> 01:13:40.119
an image segmentation pipeline

01:13:36.399 --> 01:13:40.119
uh and run it through.

01:13:42.960 --> 01:13:49.439
It takes some time. Um all right. All

01:13:46.800 --> 01:13:51.199
right. Let's visualize it. So you can So

01:13:49.439 --> 01:13:53.359
each object it finds it gives you a

01:13:51.198 --> 01:13:56.399
mask. It basically tells you for each

01:13:53.359 --> 01:13:58.799
object what object it is and then which

01:13:56.399 --> 01:14:00.479
pixels are on for that object and off

01:13:58.800 --> 01:14:02.159
for everything else. It's a mask. It

01:14:00.479 --> 01:14:04.879
tells you where it stands. And you can

01:14:02.158 --> 01:14:06.479
see here it is the first the object has

01:14:04.880 --> 01:14:08.719
found is this thing here. And it's

01:14:06.479 --> 01:14:10.718
perfectly delineated, right? It's pretty

01:14:08.719 --> 01:14:14.000
amazing. So we can overlay this on the

01:14:10.719 --> 01:14:15.760
original image and see it has found that

01:14:14.000 --> 01:14:17.920
and it is Let's look at the other

01:14:15.760 --> 01:14:20.719
objects. Oh, it has found the remote.

01:14:17.920 --> 01:14:24.719
That's the second object.

01:14:20.719 --> 01:14:27.119
And the third remote

01:14:24.719 --> 01:14:28.880
and the fourth. You think any other

01:14:27.119 --> 01:14:32.000
objects are remaining?

01:14:28.880 --> 01:14:33.679
>> Couch. Good. All right, let's find the

01:14:32.000 --> 01:14:36.238
couch.

01:14:33.679 --> 01:14:37.920
And look, the couch is pretty good

01:14:36.238 --> 01:14:39.678
except that the middle part has gotten

01:14:37.920 --> 01:14:41.039
confused.

01:14:39.679 --> 01:14:44.239
All right, but it's still pretty good,

01:14:41.039 --> 01:14:46.319
right? So, yeah. So, that is um so

01:14:44.238 --> 01:14:48.319
hugging faces all all these things and

01:14:46.319 --> 01:14:49.599
so you should definitely check it out

01:14:48.319 --> 01:14:51.439
and if you're not already very familiar

01:14:49.600 --> 01:14:55.000
with it. So, uh, we have one minute

01:14:51.439 --> 01:14:55.000
left. Any questions?

01:14:58.238 --> 01:15:04.039
No questions. Okay. All right, folks.

01:15:00.238 --> 01:15:04.039
See you on Wednesday. Thanks.
