Advertisement
Ad slot
11: Generative AI – Text-to-Image Models 1:15:38

11: Generative AI – Text-to-Image Models

MIT OpenCourseWare · May 11, 2026
Open on YouTube
Transcript ~13639 words · 1:15:38
0:16
So all right so today we actually come
0:18
to the last lecture of the class because
0:19
Wednesday it's going to be project
0:20
presentations and um so I want to talk
0:23
to you about diffusion models today
0:25
which is a incredibly exciting area
0:28
which I don't think gets It's the same
0:30
amount of attention in some ways
Advertisement
Ad slot
0:32
compared to large language models. Uh
0:34
but it's got enormous potential. Um so
0:37
I'm very excited to talk to you about
0:39
it. So you know just for kicks last
0:42
night I said I asked Chad GPT create a
0:44
photorealistic image of graduate
0:46
students in class in a class on deep
0:47
learning and this is what it came back
0:49
with.
0:51
There is a noticeable absence of an
Advertisement
Ad slot
0:53
instructor
0:57
plus various students are facing in
0:59
various directions
1:01
but apart from that it's not bad. Um and
1:05
uh here is an example of a midjourney
1:08
text to image abusion model uh which
1:12
produces the amazing picture from this
1:14
prompt. a quaint Italian seaside village
1:16
with colorful buildings blah blah blah
1:18
blah blah uh rendered in the style of
1:21
Claude Monet and so on so forth and
1:24
that's what you get. It's pretty
1:25
unbelievable.
1:27
Uh and I'm sure you folks have played
1:28
around with these things and you have
1:29
your favorite pictures and prompts and
1:31
whatnot.
1:33
Um now
1:35
uh February 15th um OpenAI released a
1:38
texttovideo model called Sora which your
1:41
folks may have seen uh which I find
1:44
frankly just stunning what it can do. It
1:46
can produce a one minute uh video from a
1:49
text prompt. And so,
1:52
so if you actually give it this prompt,
1:54
in an ornate historical hall, a massive
1:56
tidal wave peaks and begins to crash and
2:00
two surfers seizing the moment
2:01
skillfully navigate the fa the wave.
2:03
Okay. Uh I think we can all agree that
2:06
such a thing has never happened in
2:07
history and therefore there it was not
2:09
in the training data, right? So and then
2:12
you get this picture, this video
2:26
and then some random person is coming
2:28
back in a completely dry [laughter]
2:31
hall. So anyway, but it's pretty
2:32
amazing. I think you would agree. So
2:37
if you actually look at the open sora
2:39
technical report, you actually find this
2:42
uh opening paragraph where they say that
2:45
we train text conditional diffusion
2:48
models blah blah blah using a
2:51
transformer architecture. Okay, so now
2:54
we know what a transformer architecture
2:56
is. You've been working with it. You're
2:57
quite familiar with it at this point. So
3:00
today's class is really about text
3:02
conditional diffusion models. Okay, so
3:04
the other building block. Okay, so let's
3:06
get to it. Uh what I'm going to do is
3:09
I'm going to sort of uh divide this into
3:11
two parts. The first part is I'm just
3:12
going to talk about how do you get a
3:14
model to just generate an image for you?
3:16
Right? If you wanted to generate an
3:17
image from a class of potential images,
3:20
how can it just generate an image? And
3:21
then next we talk about okay, great. Now
3:24
that you can do that, how do you
3:25
actually control or steer the model to
3:27
do an image based on whatever prompting
3:29
you give it? Okay, how do you condition
3:31
it? How do you control it? Those are all
3:33
the words. How do you steer it? You'll
3:34
find all these synonyms being used
3:36
heavily in the literature. That's
3:37
basically what they mean. How do you
3:38
give it a prompt and then steer what
3:40
gets produced? All right, so let's say
3:43
we want to build a model that can be
3:44
used to generate images of stately
3:47
college buildings.
3:49
Okay, obviously our very own Killian
3:51
Court is the finest example of such a
3:53
thing. Um, and uh, but let's say you
3:56
want to do that. So what you do is you
3:58
as as we always do with machine
3:59
learning, we collect a bunch of data. In
4:01
this particular case, we collect a whole
4:03
bunch of images of stately college
4:05
buildings. Uh, and what you see here is
4:07
literally me just doing a Google image
4:08
search with the query stately college
4:10
buildings. Okay, so this is the kind of
4:12
stuff you get. Uh, so you have your
4:14
training data at your disposal. It's
4:15
ready to go. Now the question is if you
4:19
have such a model, let's say, and
4:20
obviously we'll talk about how to build
4:21
such a model very soon. But let's say
4:23
you have such a model and every time you
4:25
sort of sample this model, every time
4:27
you ask the model, hey, give me an
4:28
image, you obviously wanted to give a
4:30
different image, right? Otherwise, it's
4:31
kind of boring. All right? Some you know
4:34
maybe you want the Killian Court, maybe
4:36
you want the rotunda from the University
4:37
of Virginia. Anybody any UVA alums here?
4:42
Nobody. Okay. Um, so and right. So the
4:45
question is how can we actually get it
4:47
to randomly give us different images?
4:49
But but they all have to be stately
4:50
college buildings. It can't be just some
4:52
random stuff, right? So, how do you do
4:54
that? And the way we do that, and I
4:58
still find it really astonishing that
4:59
this approach actually works. The way we
5:02
do that is that we actually give it
5:03
noise.
5:05
And I will define very precisely what I
5:07
mean by noise in just a just a bit.
5:10
Okay, basically assume
5:13
an image in which all the pixel values
5:15
are randomly picked.
5:17
Right? So every time you generate a
5:19
random image and you give it to the
5:21
model, you'll it'll use that random
5:23
starting point and then create an image
5:25
for you. And because by definition, if
5:27
you choose noise randomly, they are, you
5:30
know, obviously going to be different
5:31
each time. It's hopefully going to
5:33
generate a different image. But if the
5:35
model is trained on stately college
5:37
buildings, it will produce images of
5:39
stately college buildings. It's not
5:41
going to produce a picture of a Labrador
5:42
retriever.
5:44
Okay, so that's basically what we're
5:46
going to do. Now, if you look at
5:49
something like this, the first question
5:51
of course is that how can we train a
5:53
model to generate an image from pure
5:54
noise? This just sounds ridiculous,
5:58
right? You basically give it a bunch of
6:00
random numbers and say, give me code.
6:04
It feels really ridiculous. And at that
6:06
point, you know, folks can sort of come
6:08
to a stop and say, "All right, this
6:10
approach is probably not going to take
6:11
me anywhere. It's a bit of a dead end.
6:14
But then some clever people had this
6:16
very interesting idea.
6:18
They said
6:20
um it's not clear how to do this you
6:24
know um just a quick aside there's this
6:26
really amazing book which is published
6:28
maybe 50 years ago maybe earlier than
6:31
that called how to solve it by George
6:33
Polia. George Poliov was a eminent
6:36
mathematician
6:37
um and he wrote this small book called
6:39
how to solve it and it lists a whole
6:41
bunch of huristics that mathematicians
6:44
use when they solve problems and perhaps
6:46
the most commonly used heristic is just
6:49
reverse the question
6:52
just reverse the question and see if
6:53
anything comes out of it most of the
6:55
time nothing will come out of it but
6:56
maybe some other time something amazing
6:58
comes out right this is a great example
6:59
of that heristic at work we don't know
7:01
how to do this so the question is can we
7:03
do the reverse
7:05
If I give you Killian code, can you
7:07
produce noise out of it for me?
7:10
And the answer is yeah, of course we can
7:12
do that.
7:14
Right? Given an image, we can easily
7:16
create a noisy version of it. So you can
7:19
take the original image, you can add
7:21
some noise to it to get this and you
7:23
keep on adding a lot of noise and
7:24
finally you'll get something that's
7:25
basically you can't tell that there is
7:27
clean clear clean code anymore. Right?
7:29
This process, the reverse process is
7:31
actually very easy to do. Okay? So the
7:33
question bec by the way for folks of who
7:36
may not be very familiar with this
7:37
notion of adding noise to an image or
7:39
making an image noisy. Let me just show
7:41
you in a collab just a minute how easy
7:44
it is.
7:47
All right. So um we let's say we import
7:51
a bunch of these things. As usual we
7:52
have numpy and so there is this thing
7:54
called the python imaging library pil
7:57
which is very handy for image
7:58
manipulations. So we import that and
8:01
then I just literally read read this
8:03
image in. I uploaded it before class.
8:04
Let's just make sure it's here. Okay,
8:06
good. Kalian.png.
8:07
So I I read this image. Okay. Uh and
8:11
then once I read it, I convert it into a
8:13
numpy array. And then remember in a in
8:16
any color image, you have three tables
8:18
of numbers. The the number there's a
8:20
number for each pixel for red, blue, and
8:23
green. And then each number is between 0
8:25
and 255. U and so here what we do is we
8:28
divide everything by 255 just to
8:29
normalize it so it's all between zero
8:31
and one and we have done this in the
8:32
past right I do that here uh all right
8:36
so let me just read this back in convert
8:38
it and then if you look at the shape
8:40
it's basically a 411 x 583 * 3 um three
8:45
channels as we have seen before and then
8:47
I'll just show it all right that's the
8:50
picture so now what we want to do is we
8:52
want to add noise to this picture all we
8:54
have to do Okay, for each pixel,
8:59
we basically randomly pick a normal
9:02
variable, a normal distribution,
9:03
normally distributed random variable
9:05
with a mean of zero and a small standard
9:08
deviation. So it's like a small number
9:10
and then we just literally add that
9:11
number to every pixel. But for every
9:14
pixel, we sample. Every pixel we sample.
9:16
It's not like we sample once and add it
9:17
to all the pixels. We sample for every
9:19
pixel. And so the way you do that is
9:22
basically literally np.random.normal.
9:25
normal and then this 3 here is a
9:28
standard deviation and we tell it
9:30
generate as many of these things as the
9:33
as the shape of the image that I gave
9:35
you. Okay. And then add each one of
9:38
these numbers to the original image you
9:40
get this noisy image. Okay. So if you
9:42
this is the original image these are all
9:44
the values between 0 and one. And then
9:46
you add do this noisy image. You can see
9:48
the numbers have become different. The
9:50
23 has become.18.15
9:52
has become minus.17 and so on and so
9:54
forth. Right? You just added a small
9:56
number random to everything. But as you
9:58
can see here now you have some negative
9:59
numbers. You may have some numbers
10:01
that's greater than one. And we do want
10:02
everything to be between 0 and one. So
10:05
all we do is we do this thing called
10:06
clip it where essentially values smaller
10:10
than zero are set to zero. Values
10:11
greater than one are set to one. And so
10:13
we'll just do that. That's it.
10:16
Everything over one squashed to one.
10:17
Everything under zero set to zero.
10:19
Others leave it unchanged. Now it's
10:21
again well behaved between 0 and one and
10:23
we can just plot it and you get this.
10:28
That's it. That's all it takes to
10:29
actually add noise to an image. One line
10:31
of numpy. Okay. Uh obviously you can
10:34
just put this whole thing in a loop and
10:36
keep increasing that standard deviation
10:37
number from 3 point 4.5 so on and so
10:39
forth. And when you do that you get this
10:41
nice sequence of clean code and all the
10:44
way to some very very noisy version of
10:45
Ken code. That's it. So that's the basic
10:48
idea of adding noise.
10:52
Any questions on the the mechanics?
10:57
Okay, good. Um so so we can add random
11:00
numbers, right? And we can by increasing
11:02
the magnitude of the standard deviation
11:04
of these of these normal random
11:06
variables, we can make the image
11:08
noisier. Okay, so that suggests a really
11:12
interesting idea.
11:14
What idea would that be?
11:19
Yeah, doing the opposite. Could you
11:21
please uh microphone please?
11:25
>> Uh doing the opposite like recreating
11:26
the image from the noise.
11:29
>> So we are trying to create the image
11:31
from the noise. But
11:34
that feels a little hard. So what
11:37
exactly can we do? Be a little more
11:38
specific.
11:44
So here we have the ability to take any
11:46
image and add any amount of noise to it.
11:48
Right? That's the data we have. There is
11:51
Kian code and there's various noisy
11:54
versions of Kian code like that for the
11:56
return the unit Virginia and so on and
11:57
so forth.
11:58
>> I would assume you would do some kind of
12:00
loss function for the the final image
12:02
that you get and compare it with the the
12:04
original image that you train it on and
12:06
then find uh yeah fine as you go. Okay,
12:10
you're on the right track. Uh, any other
12:14
proposals?
12:18
>> I think we could try to train a neural
12:20
network to reconstruct the image going
12:22
from the noise to the noise noisy one.
12:25
Like we could have a whole data set with
12:27
images, find their noise counterpart and
12:30
train to do the oppos
12:34
network to do the opposite task.
12:38
Yeah, that's definitely on the right
12:39
track. That's definitely on the right
12:41
track. Yep, good ideas. So, what we do
12:44
more concretely is
12:47
we we can take each image in the
12:49
training data and create noisy versions
12:51
of it as we have seen before. And then
12:54
what we do is that we say uh we can
12:57
create XY training data pairs input
13:00
output pairs from all these images. So
13:04
specifically what we do is we take
13:09
the noisy slightly noisy version of
13:11
Killian code and call it the input and
13:14
we take the the the nice version of
13:16
clean code and call it the output.
13:19
Okay, that's the y1 x1 pair
13:22
and then we get y2 x2 we get y3 x3 and
13:27
all the way. So at any point in time,
13:30
the relationship between X and Y, what's
13:33
the relationship between X and Y? If you
13:36
set it up like this as the input and the
13:37
output,
13:43
>> it's the set of uh standard deviations
13:45
and uh the values which you change for
13:48
each pixels. Those are like weights to
13:51
which you transform,
13:53
>> right? Or maybe I was looking for
13:54
something simpler which was that that's
13:56
correct. So what he's looking for is
13:58
really the the relationship between X
14:00
and Y. X is an image, any image, and Y
14:03
happens to be a slightly less noisy
14:05
version of the image.
14:07
The slightly less noisy is really,
14:09
really important.
14:12
You're not going from Killian code,
14:14
right? You're not going from the image
14:16
to full noise. That's an impossible
14:19
leap. You're going from the image to a
14:21
slightly noisy version of the image.
14:24
Okay, it is that slightly that allows
14:27
all the magic to happen.
14:30
So that's what we have.
14:33
And so here what we can do with these XY
14:35
pairs when you have an So here's the
14:38
thing, right? This is like a larger
14:40
comment about machine learning and deep
14:41
learning. Um
14:43
whenever you have basically what machine
14:46
learning deep learning are or really
14:47
it's like this this black box where if
14:50
you can find interesting input output
14:52
pairs you can learn a function to go
14:55
from the input to the output that's it
14:57
but this sounds kind of simple when I
14:59
describe it like that but there are like
15:01
some incredibly non-obvious ways of
15:04
applying this idea right so for example
15:06
a few years ago Google had this uh thing
15:08
which may actually be in production in
15:10
Google Sheets now where whenever you um
15:13
sort of choose a bunch of numbers, a
15:15
range of numbers in a spreadsheet and
15:17
and then go into another cell, it'll
15:19
immediately suggest a formula for you.
15:21
Where is that coming from?
15:24
It's because all the Google Sheets users
15:26
all over the world, they have been
15:28
creating all these numbers with
15:30
formulas, right? So, someone says,
15:32
"Look, wait a second. We have all this
15:33
data on people choosing a range of
15:36
numbers and then entering a formula. So
15:38
let's imagine the range is the input and
15:40
the formula as the output
15:43
and let's just give a million examples
15:45
of this pair and see if anything comes
15:46
out of it and boom you get that feature.
15:50
Okay. So similarly here
15:53
X is an image less noisy version of the
15:55
image. What that means is that we can
15:58
build a dnoising network.
16:02
Okay, we can take an image and we can
16:04
build a network using all these XY pairs
16:06
to slightly dn noiseise it.
16:10
Okay. Um and so all how do we do it? We
16:15
just run stocastic gradient to sit on
16:16
the data. We have a network. It has X
16:19
and Y and then Y is a slightly less
16:22
noisy version and then B.
16:26
Okay, you're just a network. It has a
16:27
bunch of weights. we have the we have
16:29
the right answer in terms of what the
16:30
images need to be u we can do stocastic
16:33
gradient descent or atom or something
16:34
and before you know it if you have
16:36
enough data you have a network which can
16:37
d noiseise anything you give it okay um
16:40
you had a question
16:41
>> why slightly
16:43
>> why slightly um we'll come back to that
16:45
question the the reason is that u in
16:48
general you you have to do what you can
16:51
to help the model and this is sort of
16:53
the proverbial there is an old adage you
16:56
can't cross a ditch in two jumps.
16:59
It's too big. So, right. So, you can't
17:02
do it. So, what you do is you create a
17:03
bridge to go from here to there. And so,
17:05
what you do is if you can slightly d
17:07
noiseise something really well. Well, I
17:10
can actually den noiseise anything you
17:11
want really well using that fundamental
17:13
capability as you will see in a second.
17:17
>> Just to follow up. So, if you go back
17:18
the last slide, I could have created the
17:21
same thing as that is my x1 and that is
17:24
my y. Then the second one is x2 and
17:26
still this is the y. So there is
17:28
effectively there is a learning there
17:30
that it could have taken from those
17:33
pairs and come back with okay this is
17:35
also a possibility this is also a
17:37
possibility and it found out that noise
17:40
matrix and it can subtract.
17:42
>> Yeah. So the thing is you want to make
17:44
sure that each time the amount of
17:46
learning it has to do is as bounded and
17:48
small as possible. If you give it some
17:51
starting point and an ending point and
17:52
keep moving this ending point, the gap
17:55
is still really high for the first
17:56
several of those starting points. That's
17:59
the problem.
18:01
Okay. So to come back to this, so we can
18:04
build a dinoising model. We can do this.
18:07
And now when you have once you have
18:08
built such a thing, you give it some
18:10
noisy thing and then it'll you know give
18:13
you a slightly less noisy version of it.
18:15
Okay, the resolution is going to go up
18:16
slightly if you do that. This of course
18:19
suggests the obvious way in which you
18:20
would use it which is that once you
18:22
train it we can solve this problem.
18:26
Okay. And how can we solve this problem?
18:29
So what you do is you start with pure
18:32
noise and then repeatedly dn noiseise
18:35
it.
18:37
Okay. You get that, you get that, and
18:39
then before you know it, Killian Kurt
18:41
has emerged from the fog,
18:43
right? It's pretty insane that it
18:46
actually works this idea.
18:52
So, so the model will generate a
18:54
sequence of less noisy images and the
18:56
final one you have is the answer. Okay.
18:59
Now there's a whole bunch of detail here
19:01
which I'm glossing over about okay how
19:05
many times must we run this loop to get
19:08
to a really good picture. The short
19:09
answer is you it initially it was like
19:12
you have to run it like a thousand
19:13
times. Each each each doising step was
19:16
like a baby step. You have to do it a
19:17
thousand times to get a really good
19:18
answer. Again research has been very
19:21
active in the area continues to be very
19:22
active. Now you can I think do it like
19:24
50 steps or 100 steps. Right? But
19:26
diffusion models like this uh they tend
19:29
to take more time than a large language
19:31
model which is why if you give a prompt
19:33
to one of these models like midjourney
19:35
it will take some time for it to come
19:36
back with an image and and that the
19:38
reason for the delay is because it's
19:40
going through this you know incremental
19:42
dnoising loop. Yeah.
19:45
>> Uh from this we understand that each uh
19:47
the final noise output sample would be
19:49
very particular to each image in the
19:51
matrix. So I mean like say two if you
19:55
take two images the final we are getting
19:57
is the image in the after when we start
19:59
voicing it and the final output we get
20:02
is the noise sample will be too distinct
20:04
for each of them right
20:05
>> correct
20:05
>> so but when we are picking up image to
20:08
generate a diffusion model and we work
20:10
backwards we may not have the exact
20:12
thing available to us what was there
20:14
initially
20:15
>> no no the thing is we don't want to
20:17
necessarily regenerate images that were
20:18
in the training data right that's kind
20:21
of pointless we want to geneneral new
20:22
images
20:24
and for new images we just use start use
20:26
noise as a starting point
20:29
you know the fact that Killian code was
20:31
here and then the fully noised version
20:32
of Kian code is here that is used for
20:35
training and once you use it for
20:36
training you don't need it anymore
20:37
because you're not trying to recreate
20:39
Killian code again you want to create
20:41
new images which belong to the category
20:43
of stately college buildings and for
20:45
that all you you just grab noise send it
20:48
in it gives you a stately college
20:49
building end of
20:53
And because noise by definition is
20:55
different each time you pick it, it's
20:57
going to come up with a different
20:59
stately college building.
21:01
So the way I think about it is that uh
21:07
all right so you can think of it as this
21:09
right this is
21:12
so when you sample think of this as like
21:14
the noise distribution
21:17
each time you sample right there's a
21:20
little point you pick from here another
21:22
time you sample maybe you get a point
21:24
here right each is just you know nice
21:26
distribution that's it what actually
21:29
these things are doing is they are
21:31
mapping mapping it
21:34
to the distribution of stately college
21:35
buildings which might be in a you know
21:38
strange crazy distribution.
21:41
So each time you sample you just go from
21:43
here and you land at a point here
21:47
and when you go from here you know you
21:49
land at a point there.
21:53
That's what so what you have done is
21:54
when you when you take the training data
21:56
you basically created points here and
21:59
then found the matching noise here and
22:01
then flipped it for training as we have
22:03
seen before and once you're done with it
22:05
you basically have a mechanism for
22:07
transforming any entry in this
22:09
distribution of images to an entry in
22:12
this distribution of images. So it's a
22:15
way to transform one distribution to
22:17
another distribution. That's what's
22:18
going on. Um all right. Um so there was
22:22
a question. Yeah. And then we'll go.
22:26
>> I understand the going from noise to to
22:28
the image and back how you how the
22:30
training works. So my question is you
22:33
know in some of these models today you
22:35
have you know when you give it the noise
22:37
now to generate with an image for
22:40
example it could generate a human with
22:42
four fingers or you know stuff like
22:44
that. So is it that the that the model
22:47
that the training data is not just quite
22:49
enough to or more as robust enough to uh
22:53
generate that kind of detail? [cough]
22:56
Can you kind of talk through like what's
22:57
more?
22:58
>> Yeah. So so fundamentally what it's
23:00
doing is it actually does not understand
23:03
the notion of fingers and things like
23:04
that. Right? Because there is like we
23:07
haven't injected any domain knowledge
23:09
into this whole process by saying that
23:12
hey we need to generate you need to
23:13
generate a human body and here are the
23:16
semantics of what the human body is
23:17
right it's got uh five fingers and all
23:20
the anatomical stuff we're not giving
23:21
anything we literally giving it pixel
23:23
values bunch of pictures so everything
23:26
you're seeing is basically just coming
23:27
out of that very blind statistical
23:29
transformation process so it's so you
23:32
would expect that macrolevel details
23:34
will probably get it Right? Because
23:36
there are so many right answers. So
23:38
imagine it's actually, you know, it's
23:40
creating um the roof of a house. There
23:43
could be all kinds of variations in the
23:45
roof of the house and you would still
23:46
think it's a roof of a house, right?
23:48
Because there are many possible right
23:49
answers. But when it comes to five
23:51
fingers, there are not many possible
23:52
right answers, which is why you notice
23:53
the error very quickly. As far as the
23:55
model is concerned, it doesn't know,
23:56
right? It's just producing a
23:58
statistically plausible sample from that
24:00
distribution. And since we haven't
24:03
forced it to obey constraints like five
24:05
fingers and so on and so forth, it's not
24:06
going to do any of that stuff. It's an
24:08
unconstrained process. Now over time,
24:10
these things have gotten better and
24:11
better and that's because the data has
24:14
gotten better to your point. But I think
24:15
our approach to doing these things is
24:17
also getting better, right? There are
24:19
lots of ways to now steer it and control
24:21
it so it behaves the right way. And that
24:23
is actually part of what's happening as
24:25
well. So when we talk about how do you
24:27
actually give a text prompt and have it
24:29
build the image for that particular
24:30
prompt, we would we'll revisit this
24:32
question. Um okay, there was there were
24:35
more questions. Yeah.
24:38
>> Is there some randomness in the model
24:40
itself? Right. So if you gave it the
24:42
same noise image twice, will it actually
24:44
produce the same final image or will it
24:47
>> Yeah, there is randomness in the process
24:49
as well.
24:49
>> In the process process, exactly.
24:53
Um, so to actually that's a really good
24:56
point, but now I'm afraid to open my
24:59
laptop. I'm an iPad. One second.
25:02
All right.
25:04
Okay. So, what's going on here is that
25:06
if you um go to this thing
25:10
so I talked about we are transforming
25:13
from here to some crazy distribution
25:16
here, right? So, what happens that let's
25:18
say that this is the starting point for
25:20
the the noise input. This is your noise
25:22
input and then what it does what you
25:25
actually do is you go here
25:28
and then you take this point and then
25:29
you do a small sample next to it. So you
25:33
use this as like the mean value and then
25:35
sample around it and that's actually
25:37
what gets published in the user
25:39
interface. That's where the randomness
25:40
comes in.
25:42
Okay. So um
25:48
so back to this was there another
25:49
question somewhere.
25:52
>> Yeah.
25:53
>> Um it's okay.
25:56
>> Uh I was just wondering about the when
25:59
going when training on a on a clear
26:02
picture to go to a noisy image uh to
26:05
pull from a random sample like random
26:08
this sample probably pseudo random. I
26:10
was just wondering if it's like learning
26:12
relationships that are dependent on
26:13
pseudo randomness and so when it goes
26:16
from a noisy image back to pure image
26:19
it's dependent on that or it matters at
26:22
all.
26:22
>> Oh I see. So if I understand your
26:23
question what you're saying is that it's
26:24
pseudo random not actually random
26:27
>> and so therefore there is some signal in
26:29
the supposedly random generation is it
26:32
actually glomming onto that signal right
26:34
is the question. Theoretically, it's
26:37
probably possible, but in practice, it
26:38
really doesn't matter because we
26:40
basically say random is good enough for
26:42
our purposes. And in fact, in practice,
26:44
you will see it's not an issue.
26:47
Um,
26:48
okay. So, oh yeah, go ahead.
26:52
>> There's a quick question. when you're
26:53
doing uh like text to text, let's say
26:58
you're uh tokenizing the input, but here
27:01
you somehow have to identify that this
27:03
is Killian Cord and like a stately home
27:06
and this is just going from pixel image
27:09
to or like decoding a pixel image. Um
27:13
where does the the tag or tokenization
27:16
of like columns or fingernails or like
27:20
>> does nothing. It's learning everything
27:21
from the pixel values.
27:23
>> Everything.
27:23
>> Yeah. And this is sort of what I was,
27:25
you know, when I when Ike asked the
27:27
question about the four fingers, five
27:28
fingers thing, it has no idea of
27:30
fingers. It has zero knowledge about any
27:33
of these things. All it's seeing is a
27:34
bunch of photographs.
27:36
>> Okay. So when you when you type in say I
27:38
want a hand with green.
27:40
>> Oh, I see. So we haven't yet come to the
27:42
stage of okay, how do you actually steer
27:44
this image using your text prompt? It's
27:47
coming
27:48
>> right now. All we're saying is that
27:49
look, I'm going to give you a bunch of
27:51
uh photographs of a particular kind of
27:52
thing, stately college buildings and I
27:55
want to have a model which at the end of
27:56
the day I just poke it. Every time I
27:58
poke it, it gives me a stately college
27:59
building. That's it. Now I'm going to
28:01
actually start giving it text and saying
28:02
okay build the you know create the thing
28:04
I'm just telling you about that's coming
28:06
and that's sort of some additional magic
28:08
is going on to get that done. U okay so
28:12
this is what we have u and this is
28:14
called a diffusion model. Okay. And this
28:16
is the original paper that figured this
28:18
out. Um, and
28:21
the the process of actually creating
28:24
taking an image and creating noisy
28:26
versions of it to create a training data
28:28
is called the forward process. And then
28:30
what we did in reverse is called the
28:32
reverse process. Uh, check out the
28:34
paper. It's actually really well
28:35
written. Uh, and I recommend it. Now, in
28:38
practice, uh, some other researchers
28:40
came along shortly after this and made a
28:42
small improvement. turns out to be
28:45
actually a big improvement in practice
28:46
in terms of improving the quality of
28:48
what's being produced. And so what they
28:50
said is hey instead of training the
28:52
model to predict the less noisy version
28:53
of the image we actually ask it to
28:55
predict just the noise
28:58
in the input and then we will just
29:01
simply subtract the noise from the input
29:03
to get the image. So instead of saying
29:05
here is an X X is an image Y is the
29:08
noisy image we actually tell it here is
29:10
an image here is the noise that we added
29:12
to X to get the the noisy version and
29:14
then just predict the noise for me and
29:16
then once I get it I just do X minus
29:17
noise and I get the less noisy version
29:19
of the image. Okay, this feels
29:21
arithmetically equivalent but in
29:24
practice it ends up generating much
29:26
higher quality images and there's some
29:28
very interesting theory as to why that
29:29
works and so on and so forth and you can
29:31
read this paper if you're interested.
29:33
Okay, so if you actually look at what's
29:34
going on in most diffusion models today,
29:36
they're basically using an approach like
29:38
this. They're actually predicting each
29:40
time they predict noise and take it
29:41
away, subtract it. So iterative
29:43
subtracting of predicted noise.
29:47
That's what's going on. So all right, so
29:49
that's what we have. U now at this point
29:52
you may be wondering, okay, so far in
29:55
the semester, uh we have actually
29:57
learned how to take an image and then
29:59
classify it into one of you know 20
30:01
things, 10 things, whatever. We also
30:03
taken text and figured out what to do
30:05
things with it. We haven't yet talked
30:07
about how do you actually take an image
30:09
and the how can we get the output also
30:11
to be another image. We haven't done
30:13
that yet. Okay. So we have actually not
30:16
done image to image. How do you actually
30:18
build a neural network to do image to
30:20
images? And in the interest of time
30:22
you're not going to get into it
30:23
massively but I want to give you a quick
30:25
idea of how it works. So the the most
30:29
sort of I would say the dominant
30:31
architecture
30:33
to take an in image as an input and
30:35
produce an image as an output is called
30:36
the unit. Okay. And that's the
30:39
architecture we see here. So
30:42
so fundamentally if you look at the left
30:45
half so there's a left half to the
30:47
network and a right half to the network
30:48
hence the U. If you look at the left
30:50
half of the network it's it's a good old
30:53
convolutional neural network like the
30:55
kind we know and love. Okay. And the the
30:58
kind that we are very familiar with. So
31:00
you take an input image and then you run
31:02
it through a bunch of convolutional
31:04
uh convolutional blocks and then we do
31:07
some max pooling and then we keep on
31:09
doing it and at some point it becomes
31:11
smaller and smaller and we get something
31:13
you know like this which we are very
31:15
familiar with right the the big image
31:17
with three channels gets smaller and
31:20
smaller smaller but the number of
31:21
channels gets wider and wider. it
31:22
becomes sort of much smaller but much
31:24
deeper right it becomes like a 3D volume
31:26
and we have seen that again and again
31:29
right the left part is just a good old
31:31
convolutional with pooling layers and
31:33
then you come to the middle and then
31:35
from this point on what we do is we take
31:37
whatever this thing here and then we
31:40
essentially reverse the process we go
31:43
from the small things which are really
31:44
deep to slightly bigger things that are
31:46
a little less steep and so on and so
31:49
forth till we get the original size back
31:50
again Okay. And we do that using the
31:54
some an inverse of the convolution layer
31:57
called an upconvolution or deconvolution
31:59
layer. Okay. And you can check out 9.2
32:02
in the textbook to to to understand how
32:05
it's done. It's it's also called con 2D
32:07
transpose.
32:09
Okay. It's a very similar idea and I'm
32:12
not going to get into the details here
32:13
but you essentially do an inverse of a
32:15
convolutional operation to get the size
32:17
to come back to the bigger size and you
32:19
do it gradually till the output you have
32:22
matches the size of the input that came
32:24
in.
32:25
Okay, so image gets smaller and smaller
32:27
into a thing and then you just blow it
32:29
back up again to get an image back. So
32:31
that is the unit. Now there's very one
32:34
very important thing that happens in the
32:36
unit, right? which is
32:39
you see these connections, right?
32:43
Basically, what they do is at every step
32:45
when you're coming back up in the right
32:47
half, you actually attach whatever was
32:50
in sort of the mirror image of the
32:53
original input as we processed on the
32:54
left side, we attach it to this side as
32:56
well. Remember I talked about this whole
32:59
notion of a residual connection back,
33:01
you know, many classes ago where I said
33:03
when uh when an input goes through each
33:06
layer of a neural network at one point,
33:09
let's say you're in the 10th layer,
33:10
you're only seeing what is the ninth
33:13
layer is produced for you. That's all
33:14
you're working with. But would it be
33:16
nice if the the the 10th layer actually
33:18
had access to the eighth layer, the
33:19
seventh layer, the sixth layer, the
33:21
fifth layer? Heck, why not the input,
33:23
right? Because the more information it
33:25
has, the more able it's probably to do
33:27
whatever it can with the input it's
33:28
giving. Why restrict it to only the
33:31
input of the the output of the previous
33:33
layer? Why can't we give it everything
33:34
that has came before it? Now giving
33:36
everything is too much. But we can be
33:37
selective in what we give it. Right? So
33:40
what these folks decided I'm sure after
33:41
much experimentation is that if they
33:44
actually attach whatever was coming out
33:46
of this layer to this layer before it
33:49
goes through the output, it really
33:51
helped. Similarly, this thing gets
33:53
attached and so on and so forth. And it
33:55
kind of makes sense. You know, why force
33:57
it to figure out everything it has to
34:00
figure out just from this thing that
34:01
came in, right? Let's give this that
34:03
that. Let's also give a little here, a
34:06
little here. So, these residual
34:07
connections are a huge building block
34:09
for why these things work as well as
34:10
they do. Okay? And in general, giving a
34:14
layer as much information as you can
34:15
give it is always a good idea, but you
34:17
can't go nuts, right? Because then you
34:19
have much more parameters and all kinds
34:20
of stuff happens. So there is a bit of a
34:22
balance you have to strike and this was
34:23
the balance struck by these researchers.
34:25
And so this thing was originally
34:27
invented for some medical segmentation
34:30
use cases but it's just heavily used for
34:32
everything now. It's a really powerful
34:35
architecture. Uh questions
34:39
>> uh can we have example of like in what
34:41
scenarios we use this kind of
34:44
>> anytime you have an image to image
34:46
>> like what kind of conversion do you get
34:49
image to image? or like what kind of
34:50
examples of use cases. Let's say that
34:52
for example you want to take an image
34:54
and like a black and white image and you
34:55
want to colorize it
34:58
for instance boom you unit you want to
35:00
take an image and make it a higher
35:02
resolution image unit you want to take
35:04
an image and for every pixel in the
35:06
image you want to classify it into you
35:08
know one of 10 things. So anytime when
35:12
you want the output shape the shape of
35:14
the output to be basically the same
35:16
shape as the input but with other data
35:18
you need to use this.
35:20
Yeah.
35:25
>> But this logic of having access to all
35:28
the previous iterations
35:30
>> not iterations
35:31
>> all the previous layers
35:33
>> right the outputs of the previous layers
35:35
>> layers. Uh but this would also help uh
35:40
clean up and give better categorization
35:42
like does it always have to be an image
35:44
to image?
35:45
>> No. No. In fact, if you look at restnet,
35:47
restnet is the one in fact that
35:49
pioneered the idea of the residual
35:50
connection. So we use it for restnet. We
35:53
actually use the the transformer stack
35:56
if you remember it goes through the self
35:58
attention layer. It comes out the other
36:00
end and then we add the input back to it
36:03
and then we send it through layer.
36:05
So you will see that this residual
36:07
connection is sitting in two different
36:08
places in a single transformer block. So
36:11
it's extremely heavily used. There is
36:13
something called deep and wide network
36:15
if I remember or denset which uses the
36:17
same trick. In fact if you when you're
36:20
working with structured data right good
36:22
old say linear regression and you've
36:25
looked at your data and you come up with
36:26
all kinds of very clever features you
36:28
know I'm going to look at price per
36:30
square foot right you do a bunch of
36:32
feature engineering and you have a bunch
36:33
of new features. Well, you should take
36:36
your old features and your new features
36:38
and send both in.
36:40
Why send only the new stuff that you
36:42
have concocted? Why can't you send
36:43
everything in? That's the idea.
36:47
All right. Um, so let's come back here.
36:53
Now we have seen how to generate a good
36:54
image. Okay. Now let's figure out how to
36:57
steer it or condition it with a text
36:59
prompt, right? Because that's sort of
37:00
the holy grail.
37:02
So we want to take
37:05
so here's some intuition. We want to
37:08
take the text prompt into account and
37:09
obviously generate the image. Now
37:11
imagine if we had like a rough image
37:14
that corresponds to the text prompt.
37:16
Just imagine. So the text prompt is you
37:18
know cute laborator retriever and you
37:21
have like a very noisy image of a
37:22
laboratory retriever. This just happens
37:24
to be handy. You have it. Well now
37:26
you're in good shape because you just
37:28
feed that in and your system will denise
37:30
it for you. Right? Right? You can get a
37:32
better image. That's pretty easy. So,
37:34
but obviously in reality, you don't have
37:36
a rough image. In fact, you're trying to
37:37
create one of those things in the first
37:38
place. We don't. So, but what if we had
37:41
an embedding for the prompt that's close
37:45
to the embeddings of all the images that
37:47
correspond to the prompt. So, let's take
37:49
a prompt and let's imagine all the
37:52
images in the in the universe that
37:54
correspond to that prompt. Okay?
37:57
And now further imagine because
37:58
everything is a vector. Everything is
38:00
embedding in our world that that image
38:02
has an embedding.
38:04
All sorry the text prompt has an
38:06
embedding. Every image has an embedding
38:09
and we have somehow calculated these
38:12
embeddings so that the text prompts
38:14
embedding is smack where all the image
38:17
embeddings are.
38:20
We will get to how we actually do it in
38:21
a in just a moment. But conceptually
38:23
imagine if we had an embedding if you
38:26
could calculate embeddings for text and
38:28
embeddings for images. So they all live
38:30
in the same space.
38:32
Okay. So if we feed this embedding to a
38:36
dinoising model because that text
38:39
embedding is sitting in the same space
38:41
as all the image embeddings that it
38:44
corresponds to. Maybe our model can just
38:47
d noiseise that embedding and give you
38:49
what you want.
38:51
Okay, so since this embedding is already
38:54
close to the embeddings of the things we
38:55
want to generate, maybe you'll just get
38:57
it done.
38:59
So ultimately we want to generate an
39:00
image and if we had an embedding for
39:02
that image, we could generate the image
39:03
from the embedding and we use the text.
39:07
So we go from text to embedding which
39:09
happens to live in the same space as all
39:11
the embeddings of the images we care
39:12
about. And then from that image
39:14
embedding, we go to the final image.
39:15
Okay, this is a bunch of me talking and
39:18
handwaving. it'll all become very clear
39:20
but that's sort of the rough intuition.
39:22
Okay. So, so what we'll know is we'll
39:25
describe an approach to calculate an
39:26
embedding for any text any piece of text
39:29
that is close to the embeddings of the
39:31
images that correspond to that piece of
39:34
text. So this is the problem we're going
39:36
to solve. There's a bunch of text
39:38
conceptually there are a whole bunch of
39:39
images that are describe that text and
39:42
we're going to now create embeddings so
39:43
that that is close to all the embeddings
39:46
of those images. Right? It feels kind of
39:48
like almost impossible that you can
39:50
actually do something like this, but
39:52
there's a very clever idea uh that
39:56
OpenAI came up with that tells you how
39:58
to do it. So, here's what we're going to
39:59
do. So, let's say we have an image and a
40:02
caption. So, here's an image. Uh here's
40:05
a caption, right? And we need some way
40:08
to take that piece of text and run it
40:10
through some network and create a nice
40:12
embedding from it. Okay? Similarly, we
40:15
want to take this image, run it through
40:16
some network and create an embedding
40:17
from it. Okay. Now, first first
40:19
question, how can we compute embeddings
40:20
from a piece of text? First question,
40:22
how can we comput an embedding from a
40:23
piece of text? You know the answer.
40:27
Run through a transformer. Piece of
40:30
cake. We know how to do that, right? U
40:34
right in particular, you can do
40:35
something like BERT. And for an image
40:37
encoder, you just run it through
40:38
something like restnet like the the
40:41
penultimate layer, right? one of the
40:42
final layer is going to be a very good
40:44
representation of that image. You get
40:46
another embedding. So using the building
40:48
blocks we already know, we can create
40:50
embeddings very quickly from these
40:52
things. Okay, but if you just take a
40:55
piece of text and run it through a bird
40:56
and you take an image and run it through
40:58
SNET, you're going to get some
40:59
embeddings. But why the heck should they
41:01
be related?
41:04
They were not trained together. So
41:06
there's no basis for them to be related.
41:08
They would just be some two embeddings.
41:10
Maybe they are kind of similar. Maybe
41:11
they're not. We don't know. There's no
41:13
reason to expect that they're going to
41:14
be similar. Okay, they're just two
41:16
embeddings.
41:20
Now, what we want to do is but once we
41:22
have these, we need to make sure the
41:24
embeddings that comes out of these two
41:26
things satisfy two very important
41:27
requirements.
41:32
We want to make sure that if you give it
41:33
an image
41:35
and a caption that describes that image.
41:39
So you have an image and a caption that
41:40
describes that image, we want to make
41:42
sure that the embeddings that come out
41:43
of these two boxes, they are as close to
41:45
each other as possible.
41:47
Okay? Given an em given an image and a
41:50
caption that describes it, that's the
41:51
connection. They have to be close to
41:53
each other. And conversely, if you have
41:56
an image and a caption that's totally
41:58
irrelevant,
42:00
right? A train rounding a bend with a
42:02
beautiful fall foliage all around,
42:03
right? Clearly irrelevant. Those
42:05
embedings should be far apart.
42:08
that it to really make sense,
42:10
right? Pairs of related things should be
42:12
together, irrelevant things should be
42:13
far apart. So if you can find embeddings
42:16
that satisfy these two criteria, maybe
42:18
we will be in the game. Okay. So now
42:23
this ensures that the text embedding and
42:24
the image embedding are referring to the
42:26
same underlying concept. Right? This
42:28
these requirements will enforce that. Uh
42:31
and so the embedding for any text prompt
42:32
is close to the embedding for all the
42:34
images that correspond to that prompt.
42:38
So the question is how do we do this? Uh
42:41
how can first of all how can we tell how
42:43
close two embeddings are? You know the
42:44
answer to this what's the answer
42:47
>> correct cosine similarity right? We use
42:49
the cosine similarity of the embeddings.
42:51
U so we know how to measure closeness.
42:54
So the question is how can we compute
42:55
embeddings that satisfy the two
42:56
requirements and openai uh built a model
42:59
called clip which is very famous uh to
43:02
solve this problem right it stands for
43:04
contrastive language image pre-training
43:07
uh and this forms the basis for a whole
43:08
bunch of models that have sprung up
43:10
after this called blip and blip 2 and so
43:12
on and so forth but this is the
43:13
fundamental idea
43:15
okay so
43:17
this is how clip works we uh what they
43:20
did is they took a a 12 block 12 layer 8
43:25
head transformer cosal encoder stack as
43:28
a text encoder
43:30
uh okay now you understand this right
43:33
that's what it is eight layer I mean
43:35
sorry 8 head 12 layer transformer causal
43:36
encoder TC stack um and and that's a
43:39
text encoder so we send any piece of
43:41
text through it right you get the next
43:43
word prediction embedding and that's the
43:45
embedding you're going to use uh and
43:48
they took restnet 50 and made it the
43:50
image encoder they took rest 50 chopped
43:53
off the top and whatever was left is the
43:55
the image encoder. Okay,
43:59
then they initialized with random
44:00
weights these things and then they
44:03
grabbed they grab a batch of image
44:05
caption pairs. So in this example, let's
44:07
say that we have these three images u
44:09
and I have captions to go with these
44:11
images. Okay, we have these three things
44:14
and this is the key step. They run the
44:18
images through the image encoder and the
44:20
captions through the text encoder and
44:22
get these embeddings. Okay, it's a
44:23
forward pass. You send it through this
44:26
network, you get two embeddings. Um, and
44:29
then this is what they do. With these
44:32
embeddings, they calculate the cosine
44:34
similarity for every image caption pair.
44:36
Okay? And so imagine something like
44:38
this. So you have these three captions,
44:41
you have these three images, and those
44:43
are the embeddings.
44:45
uh and then they calculate the cosine
44:47
similarity for every one of those
44:49
things.
44:51
It took me like 5 minutes or 10 minutes
44:52
to do this PowerPoint. You're welcome.
45:00
Particularly trying to get this comma to
45:02
line up is a real pain in the neck. So,
45:05
so all right. So, we have this here.
45:08
Okay. And now what we want to do is uh
45:11
we want these scores to be as high as
45:13
possible, right? Because the scores in
45:16
the diagonal are the ones where for the
45:18
matching picture and caption,
45:21
right?
45:23
Those are the those are the those are
45:24
the the scores for the matching pairs of
45:26
embeddings. We want them to be as high
45:28
as possible.
45:30
Okay. Um
45:32
so so we want to maximize the sum of the
45:35
green cells, right? These are the green
45:37
cells the diagonal. So, so if I if you
45:40
want to write it as a loss function
45:42
because the loss function is always
45:43
minimization, we basically say minimize
45:46
the negative sum of the green cells.
45:50
Okay, so the question is would this loss
45:52
function do the trick?
45:58
Seems reasonable. You want to make sure
46:00
the related things are really close
46:03
together. So you want to maximize
46:07
uh if that was the only part of the loss
46:09
function, wouldn't it just kind of
46:10
squish everything to the same spot in
46:12
the space?
46:13
>> Correct.
46:14
What it's going to do is it's going to
46:16
basically ignore the input.
46:20
The optimizer can simply ignore the
46:21
input, make all the embeddings the same.
46:24
For example, it can just make all the
46:25
embedding zero.
46:28
That's it. And then now we have a
46:30
perfect cosine similarity for
46:32
everything. For a any pair of image and
46:35
captions, the cosine similarity is going
46:36
to be one. It's perfect, right? So
46:38
clearly that's not enough. This is by
46:41
the way is called model collapse, right?
46:44
So to prevent it from doing that, we
46:46
need to do one more thing to the loss
46:47
function. Any guesses?
46:51
>> Yeah.
46:53
>> Uh make the images that aren't related
46:56
not have a cosine similarity.
46:58
>> Exactly. Right. Exactly right. So what
47:00
we want to do is we want the scores of
47:02
the red stuff to be as small as
47:05
possible.
47:07
We want the green stuff to be as much as
47:09
possible and the red stuff to be as
47:10
small as possible.
47:12
Together it'll get the job done.
47:16
Okay. And so um so we want to maximize
47:20
the sum of the green cells and minimize
47:22
the sum of the red cells. So the
47:24
equivalent loss function is minimize the
47:26
sum of the red cells and the negative
47:28
sum of the green cells. That's it. So
47:31
all clip does is that it just grabs a
47:34
batch of image caption pairs, runs it
47:37
through the networks, calculates the
47:38
embeddings and calculates this sum of
47:41
the stuff here and that is your loss and
47:44
then back propagates through the
47:45
network. Boom. Batch batch batch. Do it
47:48
a whole bunch of times. And OpenAI did
47:50
this with uh oh this is the official
47:53
picture from the open from the paper
47:55
which is worth reading by the way right
47:57
it comes in text encoder you get these
47:59
uh embedding vectors image encoder and
48:02
then boom the diagonal is maximized and
48:05
the off diagonals are minimized
48:07
and they did it with 400 million image
48:10
caption pairs scraped from the internet.
48:14
400 million.
48:16
By the way, you folks who work in the
48:18
space may know this really well, but uh
48:20
one very easy way to get a caption for
48:23
an image, right? You we see images, but
48:26
where do you think the captions come
48:27
from? Where did they get those captions?
48:29
They didn't obviously they didn't ask
48:30
people to manually label each image of
48:32
the caption. Where do you think they got
48:33
it from?
48:35
>> Google search.
48:36
>> Uh Google search can help but why does
48:39
Google search actually find the caption?
48:41
How does it because Google search is not
48:42
creating the caption? um
48:45
>> take it from the alt text on the images.
48:47
>> Correct. Alt text. So a lot of folks for
48:50
accessibility reasons they have alt text
48:52
right on all the images they create. A
48:54
lot of people have alt text in their
48:56
images they publish on the web and
48:58
that's what we use. And the alt text
49:00
actually ends up being a a more verbose
49:03
description of the image than a typical
49:05
caption which tends to be much briefer.
49:07
And for us more verbose longer the
49:10
better because there's more stuff for
49:11
the bottle to learn from.
49:14
Um, so that's how they built clip.
49:17
And so now what we do is we use we can
49:19
use clip's text encoder by itself,
49:22
right? We can send in any text and get
49:24
an embedding that is close to the
49:25
embedding of any image that described by
49:28
the text.
49:31
Okay. Now, by the way, clip can be used
49:33
for zeros image classification.
49:37
And what I mean by zeroshot image
49:39
classification, I'll I'll walk through
49:40
the picture in just a second, is that
49:42
typically when you want to build an
49:43
image classifier, right, you can get a
49:45
whole bunch of training data of images
49:47
and their labels and then we train them,
49:50
right? Maybe you take something like
49:51
restnet, chop off the top, attach our
49:54
own output head and train, train, train.
49:56
Boom, you have a classifier. But the
49:58
only problem with that is let's say that
50:00
tomorrow so today for example you had
50:02
five classes in your problem and
50:04
tomorrow somebody comes along and says
50:06
oh actually we have a sixth category
50:09
right what do you do then well you have
50:10
to go back to the drawing board and
50:11
retrain the whole thing with six labels
50:13
now not five because your problem has
50:15
changed would it be great if you had a
50:17
classifier where you just come to it and
50:20
say here's an image and here are the six
50:22
possible labels I want you to pick from
50:23
pick one from me and you want to be able
50:26
to give it a different set of labels
50:27
those each time and it'll just use the
50:30
labels you're giving it and the image
50:32
and figures out which which label
50:33
corresponds to the image you just fed it
50:35
that would be an insanely flexible image
50:38
classification system right and that's
50:40
what I mean by zeroshot image
50:42
classification and you can use clip to
50:44
do zero short image classification
50:47
the now how you do it is actually in the
50:50
picture though not very clearly done
50:52
anyone wants to
50:58
How can you use clip to build a like a
51:01
infinitely flexible image classifier?
51:12
>> Um I mean the text input was like was
51:14
trained vert right? So in the same way
51:16
vert can handle words never seen before
51:19
does it essentially do that? Sorry, say
51:21
that again. The second part
51:22
>> you're saying you're saying it sees a
51:24
text input with something it's never
51:25
seen before, right? Yeah.
51:26
>> Okay. So, in the BERT model, which is
51:28
where where it came from, in the text
51:30
encoding in the BERT model, I think we
51:32
talked about when it sees a word it
51:35
doesn't know that it's never seen
51:36
before, it can use the the context words
51:39
around it to try to
51:41
>> Right. Right. So, but but here, just to
51:43
be clear, I I want you to use clip that
51:46
we just built, right? And assume clip
51:49
can see any knows all the words because
51:51
it's been trained on a big vocabulary.
51:53
You can give it any text you want. It'll
51:54
create an embedding from it. That's the
51:57
key capability.
52:02
>> So it creates a text embedding for
52:06
>> Yeah.
52:06
>> because like and then for your image.
52:11
So comparing similarity scores between
52:14
the two the image is complete but the
52:15
text is not complete. there'll be
52:17
missing pieces and then make some
52:18
prediction using this.
52:21
Why is there a missing piece in the
52:22
text?
52:24
>> Because um the image the the text
52:28
the text does not contain the class. Um
52:31
and then but for the image the way it
52:34
was trained it was trained like with
52:36
pairs with class including
52:38
>> right but we actually know the class now
52:40
because so the use case is that I come
52:42
to you with an image and I say here are
52:45
the seven possible labels for this image
52:48
and each label is a piece of text.
52:51
So you can you actually have seven
52:53
pieces of text and an image and all I
52:55
want clip to do is to tell me okay the
52:58
seventh the fourth label is the right
53:00
one for this image
53:03
but you're on the right track
53:08
once you see how it's done you'll be
53:09
like yeah of course
53:13
might not be understanding something but
53:15
wouldn't you just pick the embedding
53:16
that's the closest to the like the the
53:18
text embedding that's the closest to the
53:20
image embedding Correct. You're not
53:22
missing anything. That's the right
53:23
answer. Well done.
53:26
Come on people. Can you applaud our
53:27
fellow here? [applause]
53:30
You folks are hard to impress.
53:32
That's exactly what we do. So here
53:38
the the key thing to remember the key
53:40
thing to keep in your head is that when
53:42
you a label is just text,
53:45
dog, cat, right? It's just text. So you
53:47
can just imagine taking each label with
53:50
which in this case is plane car dog
53:52
whatever for each one of them you create
53:54
an embedding you get t1 through whatever
53:57
if you have n labels for the image you
53:59
just have one embedding i and then you
54:01
just create the cosine calculate the
54:03
cosine similarity and whichever is the
54:04
highest number you say okay it's a dog
54:06
that's it
54:09
it's super just imagine the level of
54:11
flexibility here
54:15
so that's a a side use of clip unrelated
54:18
to diffusion models but that's just
54:20
thought it's really clever so I wanted
54:21
to share that okay good u now let's see
54:23
how we can actually use this entire
54:25
capability to go to solve the original
54:27
problem we set out to solve which is can
54:29
we steer the diffusion model to create
54:31
an image based on a particular prompt we
54:33
give it um so now remember if you go
54:37
back to how we did it we created all
54:39
these training pairs of x and y based on
54:41
you know the the noising the image x is
54:44
the image y is the less noisy version of
54:46
image. So what we can simply do is we
54:51
can actually change the input so it
54:53
becomes the image and then the clip text
54:56
embedding of the caption for that image.
54:59
So you have an image and you have a
55:00
caption. You take the caption run it
55:02
through clip you get an embedding. By
55:05
definition that embedding is in the
55:07
lives in the same space as all the
55:09
images that correspond to that caption.
55:13
Right? So you just attach you
55:15
concatenate the embedding of the clip
55:18
output of a caption along with the
55:20
image. You say make that the new input.
55:22
Now Y continues to be the less noisy
55:24
version of the image or as we saw
55:26
earlier it could be just the noise
55:27
component of the image. Okay, this is
55:30
the new XY pair that we have. And so now
55:34
the model is you send the clip X
55:36
embedding the image X send it through
55:39
noisy version of the image and you keep
55:41
on training it for a while. Once your
55:43
model is trained for when you want to
55:44
use it for inference for a new uh
55:46
prompt, you just give it you know
55:49
Killian quoted MIT during the springtime
55:51
along with a bunch of noise goes in it
55:55
starts dinoising it. But because this
55:57
embedding of this thing thanks to clip
56:00
lives in the same image space as all Ken
56:02
code embeddings clean code images at
56:05
some keep on doing it for a while at
56:07
some point you'll get Kian code.
56:11
That's how they do it. That's how they
56:12
steer the image. It's a two-step
56:15
process. You create all these clip
56:16
embeddings uh which clip was a
56:19
breakthrough in my opinion because they
56:21
it was one of the maybe the first
56:22
example. I don't know if it's the very
56:24
first but one of the early examples of
56:26
saying we have different kinds of data.
56:28
We have images, we have captions, we
56:30
have text. How do we create embeddings
56:32
for every one of these very different
56:34
data types that all happen to live in
56:36
the same space, the same concept space?
56:38
That was the key idea. And if you look
56:40
at the modern multimodal large language
56:42
models, they are all based on the same
56:44
exact idea.
56:46
So it's very powerful this approach.
56:49
Yeah. Now I understand this for images,
56:51
but for video generation models like
56:54
Sora, do they have some sort of
56:56
underlying physics structure or do they
56:58
learn the physical representations?
57:00
>> There's a lot of debate on the internet
57:02
about this stuff. Um they haven't
57:04
published the results, the full
57:05
technical report yet. So we don't know
57:07
for sure but the consensus seems to be
57:09
no it's not they are not using a physics
57:11
engine what they have done uh and again
57:14
this may be wrong once the report comes
57:15
out we'll know for sure but uh people
57:17
what people are saying computer vision
57:19
experts is that it was has been trained
57:22
on a lot of video game data
57:25
uh along with actual videos and so on
57:28
and if you and the corpus of training is
57:30
so massive that it has basically learned
57:32
to mimic certain physics aspects to it
57:35
just as a side effect much like LLM you
57:38
train them on a large amount of text
57:39
data they begin to start to do things
57:41
which you didn't anticipate that they'll
57:43
do right so for example I read this I
57:46
thought it's a really great example of
57:48
what is surprising about large language
57:50
models is not that you know you train
57:52
them on a b bunch of high school math
57:54
problems and then you give it a new high
57:56
school math problem it can actually
57:57
solve it that's not surprising you give
57:59
it a whole bunch of high school math
58:00
problems in English then you ask it to
58:03
read a bunch of French literature and
58:05
then you give French high school math
58:07
will solve it. That is that is the new
58:08
news, right? So similarly here I think
58:12
the expectation is that it's not
58:13
actually using a physics engine under
58:15
the hood. It may have used a physics
58:16
engine to actually come up with the
58:17
videos and renderings but there are no
58:20
physics constraints in the model itself.
58:22
It just comes out of the training
58:23
process. That's the current view. Once
58:26
the technical report comes out, we'll
58:27
know for sure what they actually did.
58:30
U
58:33
>> so quick question about stability. It's
58:36
claiming to be a little bit more real
58:37
time in their image generation. Um, so
58:40
>> you mean stable diffusion?
58:41
>> Yeah, stable diffusion. So, are they
58:43
jumping through the noise more quickly
58:45
or are they kind of like pre-prompting
58:46
it and kind of trick?
58:47
>> Very good question and there's a very
58:48
key trick. It's coming.
58:50
>> Um,
58:52
>> so here the example of the noise is
58:55
normal distribution. However, if we have
58:57
changed the noise distribution, is it
59:00
change the result? Oh, you mean if you
59:02
change it to like a pson or some other
59:04
distribution, it'll definitely change
59:05
the results because u if you look at the
59:08
underlying math of why this works, it
59:10
heavily depends on the Gaussian
59:11
assumption.
59:13
>> Yeah. Um there was another question
59:15
somewhere here.
59:18
>> Um you may not know the answer because
59:20
the technical report out, but could it
59:21
be in terms of video generation sort of
59:23
analogous to going from like one fuzz
59:26
one noisy image to another? like you're
59:28
almost doing a series of still images
59:30
and learning how to
59:31
>> No, I think that I think people are sure
59:33
is is how it's done. So, basically you
59:35
think think of the video as just a
59:36
series of frames, right? And each frame
59:39
is an image and there is a sequentiality
59:41
to it. Um, which is where the
59:43
transformer stack will come in because
59:44
it handles sequentiality. So, in general
59:47
video stuff typically operates on frame
59:50
by frame which is just an image. So,
59:53
that is definitely there. What we don't
59:54
know is if they also used some
59:57
understanding of the fact that for
59:59
example that if an object is dropped it
1:00:02
has to fall to the earth in a certain
1:00:04
rate or if an object goes behind another
1:00:06
object you can't see the object anymore
1:00:08
right things like that which we take for
1:00:10
granted um the question is are they
1:00:12
using it and the consensus seems to be
1:00:15
uh in the absence of an actual technical
1:00:17
report that no they're not doing it
1:00:18
because there are lots of examples on
1:00:20
Twitter where people will show a Sora
1:00:22
video in which it's not obeying the laws
1:00:24
of physics. So you take like a beach
1:00:26
chair and then put it in the sand. You
1:00:28
see the sand come through the base of
1:00:30
the beach chair, right? Or you take an
1:00:32
object and put it behind an object. You
1:00:33
can still see the object even though the
1:00:35
original object is opaque. So you be
1:00:37
seeing some evidence that no no it's not
1:00:38
obeying the laws of physics. What you're
1:00:39
seeing is just an amaz
1:00:46
fingers without knowing there has to be
1:00:47
only five fingers.
1:00:50
Um
1:00:51
okay. All right. So we let's keep going
1:00:55
now. Um so this there was another paper
1:00:58
afterwards and this is the original
1:01:00
paper which took that idea of the
1:01:02
diffusion model and then diffusion is
1:01:05
very slow as Olivia you pointed out. So
1:01:07
the question is can we make it much
1:01:08
faster? Right? So what they did and I'm
1:01:11
not going to get into this whole thing
1:01:12
here. I just want to highlight a couple
1:01:14
of things. The first one is that um
1:01:18
first of all notice that you see unit
1:01:20
here. So it they are using a unit right
1:01:23
to go from image to image.
1:01:25
The second thing is that the clip
1:01:28
embedding of the text prompt is
1:01:30
basically is woven meaning it's
1:01:32
incorporated into the w the into the
1:01:34
into the unit through an attention
1:01:36
mechanism a transformer mechanism and
1:01:38
you can see the QKV business here which
1:01:41
should be familiar at this point. So it
1:01:43
is integrated into the transformer stack
1:01:45
directly that input the clip embedding
1:01:47
that's the second thing I want to point
1:01:48
out. And then thirdly
1:01:50
and this is where the speed up comes. So
1:01:52
what you do is instead of taking the
1:01:54
image running it through the whole
1:01:56
network and creating a slightly less
1:01:57
noisy version of the image here what you
1:01:59
do is you take the image you run it
1:02:02
through an image encoder you get an
1:02:03
embedding and now you only work with the
1:02:05
embedding you take the embedding and
1:02:07
create a slightly less noisy version
1:02:09
embedding keep on doing it and these
1:02:11
embeddings are much smaller than images
1:02:13
therefore they're much faster to process
1:02:14
and once you've done it like a thousand
1:02:16
times you get a very sort of almost pure
1:02:18
noless version of the embedding now you
1:02:20
run it through an image decoder to get
1:02:24
So this is the the idea here is that you
1:02:26
operate um
1:02:29
uh in the lat latent space meaning the
1:02:31
embedding space and hence it's called a
1:02:32
latent diffusion model. So that's where
1:02:35
the speed up comes but research
1:02:36
continues to be very strong to make it
1:02:38
even faster because for a lot of
1:02:40
consumer applications people are
1:02:41
obviously not going to wait around for I
1:02:43
mean who wants to wait for 10 seconds
1:02:44
right so uh and so there a lot of
1:02:46
pressure to make it even faster
1:02:49
um
1:02:52
all right so that's what we have
1:02:53
obviously um you know they're these
1:02:56
models are transforming everything and
1:02:58
uh by the way this site here lexicon.art
1:03:00
art. You can go check it out. Uh it has
1:03:01
a whole bunch of very interesting images
1:03:03
and prompts that created the images. So
1:03:06
if you're working in the space, it gives
1:03:07
you a lot of interesting ideas. But it's
1:03:09
not just for you know consumer fun
1:03:11
applications. U you know these models
1:03:13
are being used to actually you know
1:03:15
alpha fold if you'll recall if you give
1:03:18
it an amino acid sequence it can
1:03:19
actually create the 3D structure. Right?
1:03:21
So that's an example of they they don't
1:03:24
I don't think they use a diffusion
1:03:25
model. But you can imagine using a
1:03:27
diffusion model to create these
1:03:28
complicated objects. Meaning the objects
1:03:32
you create don't have to be images.
1:03:34
They can be arbitrarily complicated
1:03:36
things. As long as you have enough data
1:03:39
about such things to use for training
1:03:41
and the notion of noising the input is
1:03:43
meaningful, you can create some very
1:03:45
interesting structures. you can create
1:03:47
3D things and u you know protein
1:03:49
structures and there's a whole bunch of
1:03:51
very interesting applications in
1:03:52
biomedical uh sciences. So this is
1:03:55
really just the tip of the iceberg and
1:03:57
now there are these things um there are
1:03:59
ways in which you can use diffusion
1:04:00
models to create to do large language
1:04:03
modeling as well. So there's a lot of
1:04:05
overlap and blending and so on going on
1:04:07
in the space. So so I'm going to do a
1:04:10
quick demo. Um if you look at hugging
1:04:11
face there is something called the
1:04:12
diffusers library which is like the the
1:04:15
as the name suggests it's a library for
1:04:17
a lot of diffusion models
1:04:20
and let's take a quick look.
1:04:25
All right so we will uh the diffusers
1:04:27
library has a whole bunch of diffusion
1:04:28
models. We going to work with stable
1:04:30
diffusion which is one of you know like
1:04:32
the the better known models. So let's
1:04:34
install diffusers.
1:04:38
Uh you will recall when we when I did
1:04:41
the quick lightning tour of the hugging
1:04:42
face ecosystem for language. Uh hugging
1:04:45
face is a whole bunch of u capabilities
1:04:48
sort of built out of the box and you use
1:04:50
this thing called the pipeline function
1:04:52
to very quickly use any model you want.
1:04:54
The same exact philosophy applies here.
1:04:56
You still use the pipeline. So I'm going
1:04:59
to import a bunch of stuff.
1:05:09
All right. So, oh, I see I have to do
1:05:11
this thing. Okay.
1:05:16
Great. F.
1:05:21
Okay. So, uh, all right. that we have
1:05:24
here. So you you'll remember that we
1:05:26
when we worked with text we had to pre
1:05:28
we we would grab a pre-trained model and
1:05:30
then we actually run it through a
1:05:31
pipeline and we can do all the inference
1:05:33
we want on it. The same exact philosophy
1:05:36
applies here. So um and this very
1:05:39
similar to what we did in lecture 8 for
1:05:41
NLP. So what we're going to do is we use
1:05:44
this command the stable diffusion
1:05:46
pipeline from pre-trained and we use
1:05:48
this version 1.4 stable diffusion model.
1:05:50
Um so let's just create the pipeline and
1:05:56
and obviously we have used tensorflow
1:05:58
not pyarch here but a lot of these
1:06:00
models unfortunately happen to be in
1:06:02
pyarch so knowing a little bit of pyarch
1:06:05
is actually very helpful um to be able
1:06:07
to work with these things and what we're
1:06:09
doing here uh while it's downloading uh
1:06:12
we are using this fp16
1:06:15
um storage format for the the model
1:06:18
weights because it's going to be a
1:06:19
little smaller than using 32 bits so
1:06:22
it'll download faster. So that's what's
1:06:24
happening here. So all right, it's
1:06:25
downloaded fine. So now we just give it
1:06:28
a prompt and this is actually one of the
1:06:29
original famous uh meme prompts a
1:06:32
photograph of an astronaut riding a
1:06:34
horse. And so uh once we have the
1:06:36
pipeline set up uh I'll just a seat for
1:06:38
reproducibility. And then literally I do
1:06:40
pipe of prompt and then it's actually
1:06:44
you can see here 50. So it's going
1:06:46
through 50 dinoising steps. Okay. Um and
1:06:50
you come up with a national rating of
1:06:52
horse. Okay. So that's that. Um you can
1:06:54
actually change the seed and you can get
1:06:56
get a different um the seed is basically
1:06:59
sets the the the random starting point
1:07:01
for the image. So therefore you would
1:07:03
expect a different astronaut. Yep. This
1:07:05
is an astronaut riding another horse. So
1:07:08
um I think people came up with these
1:07:09
kinds of fun examples because it's
1:07:11
guaranteed not to be in the training
1:07:12
data, right? So so whatever the model is
1:07:15
doing, it's not remember it's not
1:07:16
regurgitating what it has already seen.
1:07:18
Uh, all right. Give me a prompt.
1:07:26
Prompts. Anyone?
1:07:29
Wow.
1:07:34
>> Okay,
1:07:38
that might be a
1:07:40
All right. Riding a horse.
1:07:48
All right,
1:07:56
there are two of them and clearly MIT
1:07:59
professors don't have really.
1:08:03
Yeah, moving on. [laughter]
1:08:06
So, so by the way, um if you you should
1:08:10
spend some time with the diffusers
1:08:11
library, they have a bunch of tutorials
1:08:12
which are really interesting because
1:08:14
this core capability of giving a prompt
1:08:16
and getting an image out can actually be
1:08:18
manipulated for all sorts of very
1:08:20
interesting use cases. So, for example,
1:08:22
there is this thing called negative
1:08:23
prompting. And the idea of negative
1:08:25
prompting is that you can give it two
1:08:28
prompts and say create an image which
1:08:31
embodies the first prompt but not the
1:08:33
second prompt. essentially subtract the
1:08:36
second prompt from the first one. That's
1:08:37
called negative prompting. And you might
1:08:39
be wondering like what use is that?
1:08:41
There are lots of fun uses. So here we
1:08:45
are going to the prompt is going to be a
1:08:46
labrador in the style of vermier. Okay,
1:08:49
that's the first prompt. 50 steps.
1:08:53
Uh look at that. Amazing, right? Uh but
1:08:57
maybe you don't care for the blue scarf.
1:09:00
So you basically give it a negative
1:09:02
prompt. And you basically the negative
1:09:04
prompt is blue meaning remove everything
1:09:06
that's blue. I don't like this otherwise
1:09:09
keep the Labrador thing going. So you
1:09:11
run it.
1:09:16
Look at that. The blue is gone. Negative
1:09:18
prompting. Okay. Yeah.
1:09:22
>> If you change that from five from 50 to
1:09:26
a th00and will it become less pixelated
1:09:28
or will it eventually just keep going
1:09:30
and iterating?
1:09:31
>> No. Typically, if you do more of these
1:09:32
things, it gets better. The quality is
1:09:34
much better because each step will den
1:09:36
noiseise it very slightly. So, errors
1:09:38
won't accumulate and things like that.
1:09:40
And the diffuses library gives you lots
1:09:42
of controls for fiddling around with all
1:09:44
these things. Um, okay. So, that's what
1:09:47
we had. Uh, 949.
1:09:50
Okay. So, check out this tutorial if
1:09:52
you're curious about how this stuff
1:09:54
works. And I'm going to do one other
1:09:56
thing um because I didn't get to do it
1:09:58
earlier on. So uh we spent some time
1:10:01
with the hugging face hub and I walked
1:10:03
you through a few use cases for text uh
1:10:05
where you can take a text model and use
1:10:07
it for you know classification uh things
1:10:10
like that summarization and so on and so
1:10:11
forth. You can do the same thing for
1:10:13
computer vision models. So if you have a
1:10:16
computer vision problem that just maps
1:10:17
to a standard C uh computer vision task
1:10:20
you can just use the hugging face hub as
1:10:21
well. So um let me just show you very
1:10:25
quickly the same kind of thing actually
1:10:27
works here.
1:10:32
All right. Okay. So,
1:10:35
so let's say that you want to classify
1:10:37
something. You just import the pipeline
1:10:38
as before.
1:10:40
And once you import it, you can just
1:10:43
literally give it the standard task that
1:10:45
you care about like image
1:10:46
classification.
1:10:48
And and then you can start using it
1:10:50
right from that point on.
1:10:53
Okay.
1:10:59
All right. Okay. So now I'm going to
1:11:02
just get this image. So it's a very
1:11:04
famous image. Um, right. And we're going
1:11:06
to ask it to classify this image. So we
1:11:08
just literally run it through the
1:11:09
pipeline.
1:11:12
And it says the most likely label is 94%
1:11:15
probability. It's an Egyptian cat. Seems
1:11:18
reasonable. Okay. I mean, it's it's a
1:11:20
tough picture, right? Because there are
1:11:21
lots of things going on in that picture.
1:11:22
It's not like one one image, one object.
1:11:25
Um okay so you don't have to use the
1:11:27
default model you can actually give it
1:11:29
your own model that you want. So for
1:11:31
example, you can go um sorry
1:11:35
you can go to the hub hugging face hub
1:11:38
and you can go in there and say all
1:11:40
right I want image classification these
1:11:42
are all the models 10,487 models let's
1:11:45
sort by I don't know most downloads or
1:11:49
maybe most likes
1:11:51
u and you have all these models you can
1:11:53
pick any one of them so for example
1:11:54
let's say you want to pick Microsoft
1:11:56
restnet as your one that's what I tried
1:11:57
here so I have Microsoft restnet you
1:12:00
just s model equals that run it and it
1:12:04
takes care of all the tokenization this
1:12:05
that and whatnot. It's really very handy
1:12:08
and then you run it through the pipeline
1:12:09
again and it says tiger cat 94%
1:12:12
probability according to restnet. Um so
1:12:15
yeah so that's how you do it. Now let's
1:12:17
actually try a more interesting example
1:12:18
where you want to detect all the objects
1:12:20
in the picture which we didn't talk
1:12:21
about in class object detection. So just
1:12:23
create an object detection pipeline.
1:12:27
Same thing as before. when you actually
1:12:29
run this command, an astonishing some
1:12:31
amount of complicated stuff is going on
1:12:33
under the hood. Okay, and we are all the
1:12:35
beneficiaries of that. So, thank you.
1:12:37
Um, so yeah, so we have this here and
1:12:39
then we run it through um the pipeline.
1:12:42
It's looking at all the possible things
1:12:44
that might be sitting in the pipeline.
1:12:45
The results are hard to read. So, let's
1:12:46
actually visualize them. Um,
1:12:49
and I got some nice code from this site
1:12:51
for how to visualize them. Let's just
1:12:53
reuse it. So, yeah. So if you plot the
1:12:56
results,
1:12:58
look at that.
1:13:03
Okay, so it has picked up the cat. 100%
1:13:06
probability, I guess. The remote, the
1:13:09
couch, the other remote, and then the
1:13:12
cat. Pretty good, right? Off the shelf,
1:13:14
ready to go. No, no heavy lifting
1:13:17
required. Now, in in this case, we are
1:13:19
actually putting these boxes called
1:13:20
bounding boxes around each picture. But
1:13:22
what if you actually don't want a
1:13:23
bounding box? what you want to actually
1:13:25
find the exact contour of that cat or
1:13:28
the remote. No problem. We do something
1:13:30
called image segmentation. So let's do
1:13:32
an image segmentation pipeline
1:13:36
uh and run it through.
1:13:42
It takes some time. Um all right. All
1:13:46
right. Let's visualize it. So you can So
1:13:49
each object it finds it gives you a
1:13:51
mask. It basically tells you for each
1:13:53
object what object it is and then which
1:13:56
pixels are on for that object and off
1:13:58
for everything else. It's a mask. It
1:14:00
tells you where it stands. And you can
1:14:02
see here it is the first the object has
1:14:04
found is this thing here. And it's
1:14:06
perfectly delineated, right? It's pretty
1:14:08
amazing. So we can overlay this on the
1:14:10
original image and see it has found that
1:14:14
and it is Let's look at the other
1:14:15
objects. Oh, it has found the remote.
1:14:17
That's the second object.
1:14:20
And the third remote
1:14:24
and the fourth. You think any other
1:14:27
objects are remaining?
1:14:28
>> Couch. Good. All right, let's find the
1:14:32
couch.
1:14:33
And look, the couch is pretty good
1:14:36
except that the middle part has gotten
1:14:37
confused.
1:14:39
All right, but it's still pretty good,
1:14:41
right? So, yeah. So, that is um so
1:14:44
hugging faces all all these things and
1:14:46
so you should definitely check it out
1:14:48
and if you're not already very familiar
1:14:49
with it. So, uh, we have one minute
1:14:51
left. Any questions?
1:14:58
No questions. Okay. All right, folks.
1:15:00
See you on Wednesday. Thanks.
— end of transcript —
Advertisement
Ad slot

More from MIT OpenCourseWare

Trending Transcripts

Disclaimer: This site is not affiliated with, endorsed by, or sponsored by YouTube or Google LLC. All trademarks belong to their respective owners. Transcripts are sourced from publicly available captions on YouTube and remain the property of their original creators.