11: Generative AI – Text-to-Image Models — Full Transcript

0:53

instructor

0:57

plus various students are facing in

0:59

various directions

1:01

but apart from that it's not bad. Um and

1:05

uh here is an example of a midjourney

1:08

text to image abusion model uh which

1:12

produces the amazing picture from this

1:14

prompt. a quaint Italian seaside village

1:16

with colorful buildings blah blah blah

1:18

blah blah uh rendered in the style of

1:21

Claude Monet and so on so forth and

1:24

that's what you get. It's pretty

1:25

unbelievable.

1:27

Uh and I'm sure you folks have played

1:28

around with these things and you have

1:29

your favorite pictures and prompts and

1:31

whatnot.

1:33

Um now

1:35

uh February 15th um OpenAI released a

1:38

texttovideo model called Sora which your

1:41

folks may have seen uh which I find

1:44

frankly just stunning what it can do. It

1:46

can produce a one minute uh video from a

1:49

text prompt. And so,

1:52

so if you actually give it this prompt,

1:54

in an ornate historical hall, a massive

1:56

tidal wave peaks and begins to crash and

2:00

two surfers seizing the moment

2:01

skillfully navigate the fa the wave.

2:03

Okay. Uh I think we can all agree that

2:06

such a thing has never happened in

2:07

history and therefore there it was not

2:09

in the training data, right? So and then

2:12

you get this picture, this video

2:26

and then some random person is coming

2:28

back in a completely dry [laughter]

2:31

hall. So anyway, but it's pretty

2:32

amazing. I think you would agree. So

2:37

if you actually look at the open sora

2:39

technical report, you actually find this

2:42

uh opening paragraph where they say that

2:45

we train text conditional diffusion

2:48

models blah blah blah using a

2:51

transformer architecture. Okay, so now

2:54

we know what a transformer architecture

2:56

is. You've been working with it. You're

2:57

quite familiar with it at this point. So

3:00

today's class is really about text

3:02

conditional diffusion models. Okay, so

3:04

the other building block. Okay, so let's

3:06

get to it. Uh what I'm going to do is

3:09

I'm going to sort of uh divide this into

3:11

two parts. The first part is I'm just

3:12

going to talk about how do you get a

3:14

model to just generate an image for you?

3:16

Right? If you wanted to generate an

3:17

image from a class of potential images,

3:20

how can it just generate an image? And

3:21

then next we talk about okay, great. Now

3:24

that you can do that, how do you

3:25

actually control or steer the model to

3:27

do an image based on whatever prompting

3:29

you give it? Okay, how do you condition

3:31

it? How do you control it? Those are all

3:33

the words. How do you steer it? You'll

3:34

find all these synonyms being used

3:36

heavily in the literature. That's

3:37

basically what they mean. How do you

3:38

give it a prompt and then steer what

3:40

gets produced? All right, so let's say

3:43

we want to build a model that can be

3:44

used to generate images of stately

3:47

college buildings.

3:49

Okay, obviously our very own Killian

3:51

Court is the finest example of such a

3:53

thing. Um, and uh, but let's say you

3:56

want to do that. So what you do is you

3:58

as as we always do with machine

3:59

learning, we collect a bunch of data. In

4:01

this particular case, we collect a whole

4:03

bunch of images of stately college

4:05

buildings. Uh, and what you see here is

4:07

literally me just doing a Google image

4:08

search with the query stately college

4:10

buildings. Okay, so this is the kind of

4:12

stuff you get. Uh, so you have your

4:14

training data at your disposal. It's

4:15

ready to go. Now the question is if you

4:19

have such a model, let's say, and

4:20

obviously we'll talk about how to build

4:21

such a model very soon. But let's say

4:23

you have such a model and every time you

4:25

sort of sample this model, every time

4:27

you ask the model, hey, give me an

4:28

image, you obviously wanted to give a

4:30

different image, right? Otherwise, it's

4:31

kind of boring. All right? Some you know

4:34

maybe you want the Killian Court, maybe

4:36

you want the rotunda from the University

4:37

of Virginia. Anybody any UVA alums here?

4:42

Nobody. Okay. Um, so and right. So the

4:45

question is how can we actually get it

4:47

to randomly give us different images?

4:49

But but they all have to be stately

4:50

college buildings. It can't be just some

4:52

random stuff, right? So, how do you do

4:54

that? And the way we do that, and I

4:58

still find it really astonishing that

4:59

this approach actually works. The way we

5:02

do that is that we actually give it

5:03

noise.

5:05

And I will define very precisely what I

5:07

mean by noise in just a just a bit.

5:10

Okay, basically assume

5:13

an image in which all the pixel values

5:15

are randomly picked.

5:17

Right? So every time you generate a

5:19

random image and you give it to the

5:21

model, you'll it'll use that random

5:23

starting point and then create an image

5:25

for you. And because by definition, if

5:27

you choose noise randomly, they are, you

5:30

know, obviously going to be different

5:31

each time. It's hopefully going to

5:33

generate a different image. But if the

5:35

model is trained on stately college

5:37

buildings, it will produce images of

5:39

stately college buildings. It's not

5:41

going to produce a picture of a Labrador

5:42

retriever.

5:44

Okay, so that's basically what we're

5:46

going to do. Now, if you look at

5:49

something like this, the first question

5:51

of course is that how can we train a

5:53

model to generate an image from pure

5:54

noise? This just sounds ridiculous,

5:58

right? You basically give it a bunch of

6:00

random numbers and say, give me code.

6:04

It feels really ridiculous. And at that

6:06

point, you know, folks can sort of come

6:08

to a stop and say, "All right, this

6:10

approach is probably not going to take

6:11

me anywhere. It's a bit of a dead end.

6:14

But then some clever people had this

6:16

very interesting idea.

6:18

They said

6:20

um it's not clear how to do this you

6:24

know um just a quick aside there's this

6:26

really amazing book which is published

6:28

maybe 50 years ago maybe earlier than

6:31

that called how to solve it by George

6:33

Polia. George Poliov was a eminent

6:36

mathematician

6:37

um and he wrote this small book called

6:39

how to solve it and it lists a whole

6:41

bunch of huristics that mathematicians

6:44

use when they solve problems and perhaps

6:46

the most commonly used heristic is just

6:49

reverse the question

6:52

just reverse the question and see if

6:53

anything comes out of it most of the

6:55

time nothing will come out of it but

6:56

maybe some other time something amazing

6:58

comes out right this is a great example

6:59

of that heristic at work we don't know

7:01

how to do this so the question is can we

7:03

do the reverse

7:05

If I give you Killian code, can you

7:07

produce noise out of it for me?

7:10

And the answer is yeah, of course we can

7:12

do that.

7:14

Right? Given an image, we can easily

7:16

create a noisy version of it. So you can

7:19

take the original image, you can add

7:21

some noise to it to get this and you

7:23

keep on adding a lot of noise and

7:24

finally you'll get something that's

7:25

basically you can't tell that there is

7:27

clean clear clean code anymore. Right?

7:29

This process, the reverse process is

7:31

actually very easy to do. Okay? So the

7:33

question bec by the way for folks of who

7:36

may not be very familiar with this

7:37

notion of adding noise to an image or

7:39

making an image noisy. Let me just show

7:41

you in a collab just a minute how easy

7:44

it is.

7:47

All right. So um we let's say we import

7:51

a bunch of these things. As usual we

7:52

have numpy and so there is this thing

7:54

called the python imaging library pil

7:57

which is very handy for image

7:58

manipulations. So we import that and

8:01

then I just literally read read this

8:03

image in. I uploaded it before class.

8:04

Let's just make sure it's here. Okay,

8:06

good. Kalian.png.

8:07

So I I read this image. Okay. Uh and

8:11

then once I read it, I convert it into a

8:13

numpy array. And then remember in a in

8:16

any color image, you have three tables

8:18

of numbers. The the number there's a

8:20

number for each pixel for red, blue, and

8:23

green. And then each number is between 0

8:25

and 255. U and so here what we do is we

8:28

divide everything by 255 just to

8:29

normalize it so it's all between zero

8:31

and one and we have done this in the

8:32

past right I do that here uh all right

8:36

so let me just read this back in convert

8:38

it and then if you look at the shape

8:40

it's basically a 411 x 583 * 3 um three

8:45

channels as we have seen before and then

8:47

I'll just show it all right that's the

8:50

picture so now what we want to do is we

8:52

want to add noise to this picture all we

8:54

have to do Okay, for each pixel,

8:59

we basically randomly pick a normal

9:02

variable, a normal distribution,

9:03

normally distributed random variable

9:05

with a mean of zero and a small standard

9:08

deviation. So it's like a small number

9:10

and then we just literally add that

9:11

number to every pixel. But for every

9:14

pixel, we sample. Every pixel we sample.

9:16

It's not like we sample once and add it

9:17

to all the pixels. We sample for every

9:19

pixel. And so the way you do that is

9:22

basically literally np.random.normal.

9:25

normal and then this 3 here is a

9:28

standard deviation and we tell it

9:30

generate as many of these things as the

9:33

as the shape of the image that I gave

9:35

you. Okay. And then add each one of

9:38

these numbers to the original image you

9:40

get this noisy image. Okay. So if you

9:42

this is the original image these are all

9:44

the values between 0 and one. And then

9:46

you add do this noisy image. You can see

9:48

the numbers have become different. The

9:50

23 has become.18.15

9:52

has become minus.17 and so on and so

9:54

forth. Right? You just added a small

9:56

number random to everything. But as you

9:58

can see here now you have some negative

9:59

numbers. You may have some numbers

10:01

that's greater than one. And we do want

10:02

everything to be between 0 and one. So

10:05

all we do is we do this thing called

10:06

clip it where essentially values smaller

10:10

than zero are set to zero. Values

10:11

greater than one are set to one. And so

10:13

we'll just do that. That's it.

10:16

Everything over one squashed to one.

10:17

Everything under zero set to zero.

10:19

Others leave it unchanged. Now it's

10:21

again well behaved between 0 and one and

10:23

we can just plot it and you get this.

10:28

That's it. That's all it takes to

10:29

actually add noise to an image. One line

10:31

of numpy. Okay. Uh obviously you can

10:34

just put this whole thing in a loop and

10:36

keep increasing that standard deviation

10:37

number from 3 point 4.5 so on and so

10:39

forth. And when you do that you get this

10:41

nice sequence of clean code and all the

10:44

way to some very very noisy version of

10:45

Ken code. That's it. So that's the basic

10:48

idea of adding noise.

10:52

Any questions on the the mechanics?

10:57

Okay, good. Um so so we can add random

11:00

numbers, right? And we can by increasing

11:02

the magnitude of the standard deviation

11:04

of these of these normal random

11:06

variables, we can make the image

11:08

noisier. Okay, so that suggests a really

11:12

interesting idea.

11:14

What idea would that be?

11:19

Yeah, doing the opposite. Could you

11:21

please uh microphone please?

11:25

>> Uh doing the opposite like recreating

11:26

the image from the noise.

11:29

>> So we are trying to create the image

11:31

from the noise. But

11:34

that feels a little hard. So what

11:37

exactly can we do? Be a little more

11:38

specific.

11:44

So here we have the ability to take any

11:46

image and add any amount of noise to it.

11:48

Right? That's the data we have. There is

11:51

Kian code and there's various noisy

11:54

versions of Kian code like that for the

11:56

return the unit Virginia and so on and

11:57

so forth.

11:58

>> I would assume you would do some kind of

12:00

loss function for the the final image

12:02

that you get and compare it with the the

12:04

original image that you train it on and

12:06

then find uh yeah fine as you go. Okay,

12:10

you're on the right track. Uh, any other

12:14

proposals?

12:18

>> I think we could try to train a neural

12:20

network to reconstruct the image going

12:22

from the noise to the noise noisy one.

12:25

Like we could have a whole data set with

12:27

images, find their noise counterpart and

12:30

train to do the oppos

12:34

network to do the opposite task.

12:38

Yeah, that's definitely on the right

12:39

track. That's definitely on the right

12:41

track. Yep, good ideas. So, what we do

12:44

more concretely is

12:47

we we can take each image in the

12:49

training data and create noisy versions

12:51

of it as we have seen before. And then

12:54

what we do is that we say uh we can

12:57

create XY training data pairs input

13:00

output pairs from all these images. So

13:04

specifically what we do is we take

13:09

the noisy slightly noisy version of

13:11

Killian code and call it the input and

13:14

we take the the the nice version of

13:16

clean code and call it the output.

13:19

Okay, that's the y1 x1 pair

13:22

and then we get y2 x2 we get y3 x3 and

13:27

all the way. So at any point in time,

13:30

the relationship between X and Y, what's

13:33

the relationship between X and Y? If you

13:36

set it up like this as the input and the

13:37

output,

13:43

>> it's the set of uh standard deviations

13:45

and uh the values which you change for

13:48

each pixels. Those are like weights to

13:51

which you transform,

13:53

>> right? Or maybe I was looking for

13:54

something simpler which was that that's

13:56

correct. So what he's looking for is

13:58

really the the relationship between X

14:00

and Y. X is an image, any image, and Y

14:03

happens to be a slightly less noisy

14:05

version of the image.

14:07

The slightly less noisy is really,

14:09

really important.

14:12

You're not going from Killian code,

14:14

right? You're not going from the image

14:16

to full noise. That's an impossible

14:19

leap. You're going from the image to a

14:21

slightly noisy version of the image.

14:24

Okay, it is that slightly that allows

14:27

all the magic to happen.

14:30

So that's what we have.

14:33

And so here what we can do with these XY

14:35

pairs when you have an So here's the

14:38

thing, right? This is like a larger

14:40

comment about machine learning and deep

14:41

learning. Um

14:43

whenever you have basically what machine

14:46

learning deep learning are or really

14:47

it's like this this black box where if

14:50

you can find interesting input output

14:52

pairs you can learn a function to go

14:55

from the input to the output that's it

14:57

but this sounds kind of simple when I

14:59

describe it like that but there are like

15:01

some incredibly non-obvious ways of

15:04

applying this idea right so for example

15:06

a few years ago Google had this uh thing

15:08

which may actually be in production in

15:10

Google Sheets now where whenever you um

15:13

sort of choose a bunch of numbers, a

15:15

range of numbers in a spreadsheet and

15:17

and then go into another cell, it'll

15:19

immediately suggest a formula for you.

15:21

Where is that coming from?

15:24

It's because all the Google Sheets users

15:26

all over the world, they have been

15:28

creating all these numbers with

15:30

formulas, right? So, someone says,

15:32

"Look, wait a second. We have all this

15:33

data on people choosing a range of

15:36

numbers and then entering a formula. So

15:38

let's imagine the range is the input and

15:40

the formula as the output

15:43

and let's just give a million examples

15:45

of this pair and see if anything comes

15:46

out of it and boom you get that feature.

15:50

Okay. So similarly here

15:53

X is an image less noisy version of the

15:55

image. What that means is that we can

15:58

build a dnoising network.

16:02

Okay, we can take an image and we can

16:04

build a network using all these XY pairs

16:06

to slightly dn noiseise it.

16:10

Okay. Um and so all how do we do it? We

16:15

just run stocastic gradient to sit on

16:16

the data. We have a network. It has X

16:19

and Y and then Y is a slightly less

16:22

noisy version and then B.

16:26

Okay, you're just a network. It has a

16:27

bunch of weights. we have the we have

16:29

the right answer in terms of what the

16:30

images need to be u we can do stocastic

16:33

gradient descent or atom or something

16:34

and before you know it if you have

16:36

enough data you have a network which can

16:37

d noiseise anything you give it okay um

16:40

you had a question

16:41

>> why slightly

16:43

>> why slightly um we'll come back to that

16:45

question the the reason is that u in

16:48

general you you have to do what you can

16:51

to help the model and this is sort of

16:53

the proverbial there is an old adage you

16:56

can't cross a ditch in two jumps.

16:59

It's too big. So, right. So, you can't

17:02

do it. So, what you do is you create a

17:03

bridge to go from here to there. And so,

17:05

what you do is if you can slightly d

17:07

noiseise something really well. Well, I

17:10

can actually den noiseise anything you

17:11

want really well using that fundamental

17:13

capability as you will see in a second.

17:17

>> Just to follow up. So, if you go back

17:18

the last slide, I could have created the

17:21

same thing as that is my x1 and that is

17:24

my y. Then the second one is x2 and

17:26

still this is the y. So there is

17:28

effectively there is a learning there

17:30

that it could have taken from those

17:33

pairs and come back with okay this is

17:35

also a possibility this is also a

17:37

possibility and it found out that noise

17:40

matrix and it can subtract.

17:42

>> Yeah. So the thing is you want to make

17:44

sure that each time the amount of

17:46

learning it has to do is as bounded and

17:48

small as possible. If you give it some

17:51

starting point and an ending point and

17:52

keep moving this ending point, the gap

17:55

is still really high for the first

17:56

several of those starting points. That's

17:59

the problem.

18:01

Okay. So to come back to this, so we can

18:04

build a dinoising model. We can do this.

18:07

And now when you have once you have

18:08

built such a thing, you give it some

18:10

noisy thing and then it'll you know give

18:13

you a slightly less noisy version of it.

18:15

Okay, the resolution is going to go up

18:16

slightly if you do that. This of course

18:19

suggests the obvious way in which you

18:20

would use it which is that once you

18:22

train it we can solve this problem.

18:26

Okay. And how can we solve this problem?

18:29

So what you do is you start with pure

18:32

noise and then repeatedly dn noiseise

18:35

it.

18:37

Okay. You get that, you get that, and

18:39

then before you know it, Killian Kurt

18:41

has emerged from the fog,

18:43

right? It's pretty insane that it

18:46

actually works this idea.

18:52

So, so the model will generate a

18:54

sequence of less noisy images and the

18:56

final one you have is the answer. Okay.

18:59

Now there's a whole bunch of detail here

19:01

which I'm glossing over about okay how

19:05

many times must we run this loop to get

19:08

to a really good picture. The short

19:09

answer is you it initially it was like

19:12

you have to run it like a thousand

19:13

times. Each each each doising step was

19:16

like a baby step. You have to do it a

19:17

thousand times to get a really good

19:18

answer. Again research has been very

19:21

active in the area continues to be very

19:22

active. Now you can I think do it like

19:24

50 steps or 100 steps. Right? But

19:26

diffusion models like this uh they tend

19:29

to take more time than a large language

19:31

model which is why if you give a prompt

19:33

to one of these models like midjourney

19:35

it will take some time for it to come

19:36

back with an image and and that the

19:38

reason for the delay is because it's

19:40

going through this you know incremental

19:42

dnoising loop. Yeah.

19:45

>> Uh from this we understand that each uh

19:47

the final noise output sample would be

19:49

very particular to each image in the

19:51

matrix. So I mean like say two if you

19:55

take two images the final we are getting

19:57

is the image in the after when we start

19:59

voicing it and the final output we get

20:02

is the noise sample will be too distinct

20:04

for each of them right

20:05

>> correct

20:05

>> so but when we are picking up image to

20:08

generate a diffusion model and we work

20:10

backwards we may not have the exact

20:12

thing available to us what was there

20:14

initially

20:15

>> no no the thing is we don't want to

20:17

necessarily regenerate images that were

20:18

in the training data right that's kind

20:21

of pointless we want to geneneral new

20:22

images

20:24

and for new images we just use start use

20:26

noise as a starting point

20:29

you know the fact that Killian code was

20:31

here and then the fully noised version

20:32

of Kian code is here that is used for

20:35

training and once you use it for

20:36

training you don't need it anymore

20:37

because you're not trying to recreate

20:39

Killian code again you want to create

20:41

new images which belong to the category

20:43

of stately college buildings and for

20:45

that all you you just grab noise send it

20:48

in it gives you a stately college

20:49

building end of

20:53

And because noise by definition is

20:55

different each time you pick it, it's

20:57

going to come up with a different

20:59

stately college building.

21:01

So the way I think about it is that uh

21:07

all right so you can think of it as this

21:09

right this is

21:12

so when you sample think of this as like

21:14

the noise distribution

21:17

each time you sample right there's a

21:20

little point you pick from here another

21:22

time you sample maybe you get a point

21:24

here right each is just you know nice

21:26

distribution that's it what actually

21:29

these things are doing is they are

21:31

mapping mapping it

21:34

to the distribution of stately college

21:35

buildings which might be in a you know

21:38

strange crazy distribution.

21:41

So each time you sample you just go from

21:43

here and you land at a point here

21:47

and when you go from here you know you

21:49

land at a point there.

21:53

That's what so what you have done is

21:54

when you when you take the training data

21:56

you basically created points here and

21:59

then found the matching noise here and

22:01

then flipped it for training as we have

22:03

seen before and once you're done with it

22:05

you basically have a mechanism for

22:07

transforming any entry in this

22:09

distribution of images to an entry in

22:12

this distribution of images. So it's a

22:15

way to transform one distribution to

22:17

another distribution. That's what's

22:18

going on. Um all right. Um so there was

22:22

a question. Yeah. And then we'll go.

22:26

>> I understand the going from noise to to

22:28

the image and back how you how the

22:30

training works. So my question is you

22:33

know in some of these models today you

22:35

have you know when you give it the noise

22:37

now to generate with an image for

22:40

example it could generate a human with

22:42

four fingers or you know stuff like

22:44

that. So is it that the that the model

22:47

that the training data is not just quite

22:49

enough to or more as robust enough to uh

22:53

generate that kind of detail? [cough]

22:56

Can you kind of talk through like what's

22:57

more?

22:58

>> Yeah. So so fundamentally what it's

23:00

doing is it actually does not understand

23:03

the notion of fingers and things like

23:04

that. Right? Because there is like we

23:07

haven't injected any domain knowledge

23:09

into this whole process by saying that

23:12

hey we need to generate you need to

23:13

generate a human body and here are the

23:16

semantics of what the human body is

23:17

right it's got uh five fingers and all

23:20

the anatomical stuff we're not giving

23:21

anything we literally giving it pixel

23:23

values bunch of pictures so everything

23:26

you're seeing is basically just coming

23:27

out of that very blind statistical

23:29

transformation process so it's so you

23:32

would expect that macrolevel details

23:34

will probably get it Right? Because

23:36

there are so many right answers. So

23:38

imagine it's actually, you know, it's

23:40

creating um the roof of a house. There

23:43

could be all kinds of variations in the

23:45

roof of the house and you would still

23:46

think it's a roof of a house, right?

23:48

Because there are many possible right

23:49

answers. But when it comes to five

23:51

fingers, there are not many possible

23:52

right answers, which is why you notice

23:53

the error very quickly. As far as the

23:55

model is concerned, it doesn't know,

23:56

right? It's just producing a

23:58

statistically plausible sample from that

24:00

distribution. And since we haven't

24:03

forced it to obey constraints like five

24:05

fingers and so on and so forth, it's not

24:06

going to do any of that stuff. It's an

24:08

unconstrained process. Now over time,

24:10

these things have gotten better and

24:11

better and that's because the data has

24:14

gotten better to your point. But I think

24:15

our approach to doing these things is

24:17

also getting better, right? There are

24:19

lots of ways to now steer it and control

24:21

it so it behaves the right way. And that

24:23

is actually part of what's happening as

24:25

well. So when we talk about how do you

24:27

actually give a text prompt and have it

24:29

build the image for that particular

24:30

prompt, we would we'll revisit this

24:32

question. Um okay, there was there were

24:35

more questions. Yeah.

24:38

>> Is there some randomness in the model

24:40

itself? Right. So if you gave it the

24:42

same noise image twice, will it actually

24:44

produce the same final image or will it

24:47

>> Yeah, there is randomness in the process

24:49

as well.

24:49

>> In the process process, exactly.

24:53

Um, so to actually that's a really good

24:56

point, but now I'm afraid to open my

24:59

laptop. I'm an iPad. One second.

25:02

All right.

25:04

Okay. So, what's going on here is that

25:06

if you um go to this thing

25:10

so I talked about we are transforming

25:13

from here to some crazy distribution

25:16

here, right? So, what happens that let's

25:18

say that this is the starting point for

25:20

the the noise input. This is your noise

25:22

input and then what it does what you

25:25

actually do is you go here

25:28

and then you take this point and then

25:29

you do a small sample next to it. So you

25:33

use this as like the mean value and then

25:35

sample around it and that's actually

25:37

what gets published in the user

25:39

interface. That's where the randomness

25:40

comes in.

25:42

Okay. So um

25:48

so back to this was there another

25:49

question somewhere.

25:52

>> Yeah.

25:53

>> Um it's okay.

25:56

>> Uh I was just wondering about the when

25:59

going when training on a on a clear

26:02

picture to go to a noisy image uh to

26:05

pull from a random sample like random

26:08

this sample probably pseudo random. I

26:10

was just wondering if it's like learning

26:12

relationships that are dependent on

26:13

pseudo randomness and so when it goes

26:16

from a noisy image back to pure image

26:19

it's dependent on that or it matters at

26:22

all.

26:22

>> Oh I see. So if I understand your

26:23

question what you're saying is that it's

26:24

pseudo random not actually random

26:27

>> and so therefore there is some signal in

26:29

the supposedly random generation is it

26:32

actually glomming onto that signal right

26:34

is the question. Theoretically, it's

26:37

probably possible, but in practice, it

26:38

really doesn't matter because we

26:40

basically say random is good enough for

26:42

our purposes. And in fact, in practice,

26:44

you will see it's not an issue.

26:47

Um,

26:48

okay. So, oh yeah, go ahead.

26:52

>> There's a quick question. when you're

26:53

doing uh like text to text, let's say

26:58

you're uh tokenizing the input, but here

27:01

you somehow have to identify that this

27:03

is Killian Cord and like a stately home

27:06

and this is just going from pixel image

27:09

to or like decoding a pixel image. Um

27:13

where does the the tag or tokenization

27:16

of like columns or fingernails or like

27:20

>> does nothing. It's learning everything

27:21

from the pixel values.

27:23

>> Everything.

27:23

>> Yeah. And this is sort of what I was,

27:25

you know, when I when Ike asked the

27:27

question about the four fingers, five

27:28

fingers thing, it has no idea of

27:30

fingers. It has zero knowledge about any

27:33

of these things. All it's seeing is a

27:34

bunch of photographs.

27:36

>> Okay. So when you when you type in say I

27:38

want a hand with green.

27:40

>> Oh, I see. So we haven't yet come to the

27:42

stage of okay, how do you actually steer

27:44

this image using your text prompt? It's

27:47

coming

27:48

>> right now. All we're saying is that

27:49

look, I'm going to give you a bunch of

27:51

uh photographs of a particular kind of

27:52

thing, stately college buildings and I

27:55

want to have a model which at the end of

27:56

the day I just poke it. Every time I

27:58

poke it, it gives me a stately college

27:59

building. That's it. Now I'm going to

28:01

actually start giving it text and saying

28:02

okay build the you know create the thing

28:04

I'm just telling you about that's coming

28:06

and that's sort of some additional magic

28:08

is going on to get that done. U okay so

28:12

this is what we have u and this is

28:14

called a diffusion model. Okay. And this

28:16

is the original paper that figured this

28:18

out. Um, and

28:21

the the process of actually creating

28:24

taking an image and creating noisy

28:26

versions of it to create a training data

28:28

is called the forward process. And then

28:30

what we did in reverse is called the

28:32

reverse process. Uh, check out the

28:34

paper. It's actually really well

28:35

written. Uh, and I recommend it. Now, in

28:38

practice, uh, some other researchers

28:40

came along shortly after this and made a

28:42

small improvement. turns out to be

28:45

actually a big improvement in practice

28:46

in terms of improving the quality of

28:48

what's being produced. And so what they

28:50

said is hey instead of training the

28:52

model to predict the less noisy version

28:53

of the image we actually ask it to

28:55

predict just the noise

28:58

in the input and then we will just

29:01

simply subtract the noise from the input

29:03

to get the image. So instead of saying

29:05

here is an X X is an image Y is the

29:08

noisy image we actually tell it here is

29:10

an image here is the noise that we added

29:12

to X to get the the noisy version and

29:14

then just predict the noise for me and

29:16

then once I get it I just do X minus

29:17

noise and I get the less noisy version

29:19

of the image. Okay, this feels

29:21

arithmetically equivalent but in

29:24

practice it ends up generating much

29:26

higher quality images and there's some

29:28

very interesting theory as to why that

29:29

works and so on and so forth and you can

29:31

read this paper if you're interested.

29:33

Okay, so if you actually look at what's

29:34

going on in most diffusion models today,

29:36

they're basically using an approach like

29:38

this. They're actually predicting each

29:40

time they predict noise and take it

29:41

away, subtract it. So iterative

29:43

subtracting of predicted noise.

29:47

That's what's going on. So all right, so

29:49

that's what we have. U now at this point

29:52

you may be wondering, okay, so far in

29:55

the semester, uh we have actually

29:57

learned how to take an image and then

29:59

classify it into one of you know 20

30:01

things, 10 things, whatever. We also

30:03

taken text and figured out what to do

30:05

things with it. We haven't yet talked

30:07

about how do you actually take an image

30:09

and the how can we get the output also

30:11

to be another image. We haven't done

30:13

that yet. Okay. So we have actually not

30:16

done image to image. How do you actually

30:18

build a neural network to do image to

30:20

images? And in the interest of time

30:22

you're not going to get into it

30:23

massively but I want to give you a quick

30:25

idea of how it works. So the the most

30:29

sort of I would say the dominant

30:31

architecture

30:33

to take an in image as an input and

30:35

produce an image as an output is called

30:36

the unit. Okay. And that's the

30:39

architecture we see here. So

30:42

so fundamentally if you look at the left

30:45

half so there's a left half to the

30:47

network and a right half to the network

30:48

hence the U. If you look at the left

30:50

half of the network it's it's a good old

30:53

convolutional neural network like the

30:55

kind we know and love. Okay. And the the

30:58

kind that we are very familiar with. So

31:00

you take an input image and then you run

31:02

it through a bunch of convolutional

31:04

uh convolutional blocks and then we do

31:07

some max pooling and then we keep on

31:09

doing it and at some point it becomes

31:11

smaller and smaller and we get something

31:13

you know like this which we are very

31:15

familiar with right the the big image

31:17

with three channels gets smaller and

31:20

smaller smaller but the number of

31:21

channels gets wider and wider. it

31:22

becomes sort of much smaller but much

31:24

deeper right it becomes like a 3D volume

31:26

and we have seen that again and again

31:29

right the left part is just a good old

31:31

convolutional with pooling layers and

31:33

then you come to the middle and then

31:35

from this point on what we do is we take

31:37

whatever this thing here and then we

31:40

essentially reverse the process we go

31:43

from the small things which are really

31:44

deep to slightly bigger things that are

31:46

a little less steep and so on and so

31:49

forth till we get the original size back

31:50

again Okay. And we do that using the

31:54

some an inverse of the convolution layer

31:57

called an upconvolution or deconvolution

31:59

layer. Okay. And you can check out 9.2

32:02

in the textbook to to to understand how

32:05

it's done. It's it's also called con 2D

32:07

transpose.

32:09

Okay. It's a very similar idea and I'm

32:12

not going to get into the details here

32:13

but you essentially do an inverse of a

32:15

convolutional operation to get the size

32:17

to come back to the bigger size and you

32:19

do it gradually till the output you have

32:22

matches the size of the input that came

32:24

in.

32:25

Okay, so image gets smaller and smaller

32:27

into a thing and then you just blow it

32:29

back up again to get an image back. So

32:31

that is the unit. Now there's very one

32:34

very important thing that happens in the

32:36

unit, right? which is

32:39

you see these connections, right?

32:43

Basically, what they do is at every step

32:45

when you're coming back up in the right

32:47

half, you actually attach whatever was

32:50

in sort of the mirror image of the

32:53

original input as we processed on the

32:54

left side, we attach it to this side as

32:56

well. Remember I talked about this whole

32:59

notion of a residual connection back,

33:01

you know, many classes ago where I said

33:03

when uh when an input goes through each

33:06

layer of a neural network at one point,

33:09

let's say you're in the 10th layer,

33:10

you're only seeing what is the ninth

33:13

layer is produced for you. That's all

33:14

you're working with. But would it be

33:16

nice if the the the 10th layer actually

33:18

had access to the eighth layer, the

33:19

seventh layer, the sixth layer, the

33:21

fifth layer? Heck, why not the input,

33:23

right? Because the more information it

33:25

has, the more able it's probably to do

33:27

whatever it can with the input it's

33:28

giving. Why restrict it to only the

33:31

input of the the output of the previous

33:33

layer? Why can't we give it everything

33:34

that has came before it? Now giving

33:36

everything is too much. But we can be

33:37

selective in what we give it. Right? So

33:40

what these folks decided I'm sure after

33:41

much experimentation is that if they

33:44

actually attach whatever was coming out

33:46

of this layer to this layer before it

33:49

goes through the output, it really

33:51

helped. Similarly, this thing gets

33:53

attached and so on and so forth. And it

33:55

kind of makes sense. You know, why force

33:57

it to figure out everything it has to

34:00

figure out just from this thing that

34:01

came in, right? Let's give this that

34:03

that. Let's also give a little here, a

34:06

little here. So, these residual

34:07

connections are a huge building block

34:09

for why these things work as well as

34:10

they do. Okay? And in general, giving a

34:14

layer as much information as you can

34:15

give it is always a good idea, but you

34:17

can't go nuts, right? Because then you

34:19

have much more parameters and all kinds

34:20

of stuff happens. So there is a bit of a

34:22

balance you have to strike and this was

34:23

the balance struck by these researchers.

34:25

And so this thing was originally

34:27

invented for some medical segmentation

34:30

use cases but it's just heavily used for

34:32

everything now. It's a really powerful

34:35

architecture. Uh questions

34:39

>> uh can we have example of like in what

34:41

scenarios we use this kind of

34:44

>> anytime you have an image to image

34:46

>> like what kind of conversion do you get

34:49

image to image? or like what kind of

34:50

examples of use cases. Let's say that

34:52

for example you want to take an image

34:54

and like a black and white image and you

34:55

want to colorize it

34:58

for instance boom you unit you want to

35:00

take an image and make it a higher

35:02

resolution image unit you want to take

35:04

an image and for every pixel in the

35:06

image you want to classify it into you

35:08

know one of 10 things. So anytime when

35:12

you want the output shape the shape of

35:14

the output to be basically the same

35:16

shape as the input but with other data

35:18

you need to use this.

35:20

Yeah.

35:25

>> But this logic of having access to all

35:28

the previous iterations

35:30

>> not iterations

35:31

>> all the previous layers

35:33

>> right the outputs of the previous layers

35:35

>> layers. Uh but this would also help uh

35:40

clean up and give better categorization

35:42

like does it always have to be an image

35:44

to image?

35:45

>> No. No. In fact, if you look at restnet,

35:47

restnet is the one in fact that

35:49

pioneered the idea of the residual

35:50

connection. So we use it for restnet. We

35:53

actually use the the transformer stack

35:56

if you remember it goes through the self

35:58

attention layer. It comes out the other

36:00

end and then we add the input back to it

36:03

and then we send it through layer.

36:05

So you will see that this residual

36:07

connection is sitting in two different

36:08

places in a single transformer block. So

36:11

it's extremely heavily used. There is

36:13

something called deep and wide network

36:15

if I remember or denset which uses the

36:17

same trick. In fact if you when you're

36:20

working with structured data right good

36:22

old say linear regression and you've

36:25

looked at your data and you come up with

36:26

all kinds of very clever features you

36:28

know I'm going to look at price per

36:30

square foot right you do a bunch of

36:32

feature engineering and you have a bunch

36:33

of new features. Well, you should take

36:36

your old features and your new features

36:38

and send both in.

36:40

Why send only the new stuff that you

36:42

have concocted? Why can't you send

36:43

everything in? That's the idea.

36:47

All right. Um, so let's come back here.

36:53

Now we have seen how to generate a good

36:54

image. Okay. Now let's figure out how to

36:57

steer it or condition it with a text

36:59

prompt, right? Because that's sort of

37:00

the holy grail.

37:02

So we want to take

37:05

so here's some intuition. We want to

37:08

take the text prompt into account and

37:09

obviously generate the image. Now

37:11

imagine if we had like a rough image

37:14

that corresponds to the text prompt.

37:16

Just imagine. So the text prompt is you

37:18

know cute laborator retriever and you

37:21

have like a very noisy image of a

37:22

laboratory retriever. This just happens

37:24

to be handy. You have it. Well now

37:26

you're in good shape because you just

37:28

feed that in and your system will denise

37:30

it for you. Right? Right? You can get a

37:32

better image. That's pretty easy. So,

37:34

but obviously in reality, you don't have

37:36

a rough image. In fact, you're trying to

37:37

create one of those things in the first

37:38

place. We don't. So, but what if we had

37:41

an embedding for the prompt that's close

37:45

to the embeddings of all the images that

37:47

correspond to the prompt. So, let's take

37:49

a prompt and let's imagine all the

37:52

images in the in the universe that

37:54

correspond to that prompt. Okay?

37:57

And now further imagine because

37:58

everything is a vector. Everything is

38:00

embedding in our world that that image

38:02

has an embedding.

38:04

All sorry the text prompt has an

38:06

embedding. Every image has an embedding

38:09

and we have somehow calculated these

38:12

embeddings so that the text prompts

38:14

embedding is smack where all the image

38:17

embeddings are.

38:20

We will get to how we actually do it in

38:21

a in just a moment. But conceptually

38:23

imagine if we had an embedding if you

38:26

could calculate embeddings for text and

38:28

embeddings for images. So they all live

38:30

in the same space.

38:32

Okay. So if we feed this embedding to a

38:36

dinoising model because that text

38:39

embedding is sitting in the same space

38:41

as all the image embeddings that it

38:44

corresponds to. Maybe our model can just

38:47

d noiseise that embedding and give you

38:49

what you want.

38:51

Okay, so since this embedding is already

38:54

close to the embeddings of the things we

38:55

want to generate, maybe you'll just get

38:57

it done.

38:59

So ultimately we want to generate an

39:00

image and if we had an embedding for

39:02

that image, we could generate the image

39:03

from the embedding and we use the text.

39:07

So we go from text to embedding which

39:09

happens to live in the same space as all

39:11

the embeddings of the images we care

39:12

about. And then from that image

39:14

embedding, we go to the final image.

39:15

Okay, this is a bunch of me talking and

39:18

handwaving. it'll all become very clear

39:20

but that's sort of the rough intuition.

39:22

Okay. So, so what we'll know is we'll

39:25

describe an approach to calculate an

39:26

embedding for any text any piece of text

39:29

that is close to the embeddings of the

39:31

images that correspond to that piece of

39:34

text. So this is the problem we're going

39:36

to solve. There's a bunch of text

39:38

conceptually there are a whole bunch of

39:39

images that are describe that text and

39:42

we're going to now create embeddings so

39:43

that that is close to all the embeddings

39:46

of those images. Right? It feels kind of

39:48

like almost impossible that you can

39:50

actually do something like this, but

39:52

there's a very clever idea uh that

39:56

OpenAI came up with that tells you how

39:58

to do it. So, here's what we're going to

39:59

do. So, let's say we have an image and a

40:02

caption. So, here's an image. Uh here's

40:05

a caption, right? And we need some way

40:08

to take that piece of text and run it

40:10

through some network and create a nice

40:12

embedding from it. Okay? Similarly, we

40:15

want to take this image, run it through

40:16

some network and create an embedding

40:17

from it. Okay. Now, first first

40:19

question, how can we compute embeddings

40:20

from a piece of text? First question,

40:22

how can we comput an embedding from a

40:23

piece of text? You know the answer.

40:27

Run through a transformer. Piece of

40:30

cake. We know how to do that, right? U

40:34

right in particular, you can do

40:35

something like BERT. And for an image

40:37

encoder, you just run it through

40:38

something like restnet like the the

40:41

penultimate layer, right? one of the

40:42

final layer is going to be a very good

40:44

representation of that image. You get

40:46

another embedding. So using the building

40:48

blocks we already know, we can create

40:50

embeddings very quickly from these

40:52

things. Okay, but if you just take a

40:55

piece of text and run it through a bird

40:56

and you take an image and run it through

40:58

SNET, you're going to get some

40:59

embeddings. But why the heck should they

41:01

be related?

41:04

They were not trained together. So

41:06

there's no basis for them to be related.

41:08

They would just be some two embeddings.

41:10

Maybe they are kind of similar. Maybe

41:11

they're not. We don't know. There's no

41:13

reason to expect that they're going to

41:14

be similar. Okay, they're just two

41:16

embeddings.

41:20

Now, what we want to do is but once we

41:22

have these, we need to make sure the

41:24

embeddings that comes out of these two

41:26

things satisfy two very important

41:27

requirements.

41:32

We want to make sure that if you give it

41:33

an image

41:35

and a caption that describes that image.

41:39

So you have an image and a caption that

41:40

describes that image, we want to make

41:42

sure that the embeddings that come out

41:43

of these two boxes, they are as close to

41:45

each other as possible.

41:47

Okay? Given an em given an image and a

41:50

caption that describes it, that's the

41:51

connection. They have to be close to

41:53

each other. And conversely, if you have

41:56

an image and a caption that's totally

41:58

irrelevant,

42:00

right? A train rounding a bend with a

42:02

beautiful fall foliage all around,

42:03

right? Clearly irrelevant. Those

42:05

embedings should be far apart.

42:08

that it to really make sense,

42:10

right? Pairs of related things should be

42:12

together, irrelevant things should be

42:13

far apart. So if you can find embeddings

42:16

that satisfy these two criteria, maybe

42:18

we will be in the game. Okay. So now

42:23

this ensures that the text embedding and

42:24

the image embedding are referring to the

42:26

same underlying concept. Right? This

42:28

these requirements will enforce that. Uh

42:31

and so the embedding for any text prompt

42:32

is close to the embedding for all the

42:34

images that correspond to that prompt.

42:38

So the question is how do we do this? Uh

42:41

how can first of all how can we tell how

42:43

close two embeddings are? You know the

42:44

answer to this what's the answer

42:47

>> correct cosine similarity right? We use

42:49

the cosine similarity of the embeddings.

42:51

U so we know how to measure closeness.

42:54

So the question is how can we compute

42:55

embeddings that satisfy the two

42:56

requirements and openai uh built a model

42:59

called clip which is very famous uh to

43:02

solve this problem right it stands for

43:04

contrastive language image pre-training

43:07

uh and this forms the basis for a whole

43:08

bunch of models that have sprung up

43:10

after this called blip and blip 2 and so

43:12

on and so forth but this is the

43:13

fundamental idea

43:15

okay so

43:17

this is how clip works we uh what they

43:20

did is they took a a 12 block 12 layer 8

43:25

head transformer cosal encoder stack as

43:28

a text encoder

43:30

uh okay now you understand this right

43:33

that's what it is eight layer I mean

43:35

sorry 8 head 12 layer transformer causal

43:36

encoder TC stack um and and that's a

43:39

text encoder so we send any piece of

43:41

text through it right you get the next

43:43

word prediction embedding and that's the

43:45

embedding you're going to use uh and

43:48

they took restnet 50 and made it the

43:50

image encoder they took rest 50 chopped

43:53

off the top and whatever was left is the

43:55

the image encoder. Okay,

43:59

then they initialized with random

44:00

weights these things and then they

44:03

grabbed they grab a batch of image

44:05

caption pairs. So in this example, let's

44:07

say that we have these three images u

44:09

and I have captions to go with these

44:11

images. Okay, we have these three things

44:14

and this is the key step. They run the

44:18

images through the image encoder and the

44:20

captions through the text encoder and

44:22

get these embeddings. Okay, it's a

44:23

forward pass. You send it through this

44:26

network, you get two embeddings. Um, and

44:29

then this is what they do. With these

44:32

embeddings, they calculate the cosine

44:34

similarity for every image caption pair.

44:36

Okay? And so imagine something like

44:38

this. So you have these three captions,

44:41

you have these three images, and those

44:43

are the embeddings.

44:45

uh and then they calculate the cosine

44:47

similarity for every one of those

44:49

things.

44:51

It took me like 5 minutes or 10 minutes

44:52

to do this PowerPoint. You're welcome.

45:00

Particularly trying to get this comma to

45:02

line up is a real pain in the neck. So,

45:05

so all right. So, we have this here.

45:08

Okay. And now what we want to do is uh

45:11

we want these scores to be as high as

45:13

possible, right? Because the scores in

45:16

the diagonal are the ones where for the

45:18

matching picture and caption,

45:21

right?

45:23

Those are the those are the those are

45:24

the the scores for the matching pairs of

45:26

embeddings. We want them to be as high

45:28

as possible.

45:30

Okay. Um

45:32

so so we want to maximize the sum of the

45:35

green cells, right? These are the green

45:37

cells the diagonal. So, so if I if you

45:40

want to write it as a loss function

45:42

because the loss function is always

45:43

minimization, we basically say minimize

45:46

the negative sum of the green cells.

45:50

Okay, so the question is would this loss

45:52

function do the trick?

45:58

Seems reasonable. You want to make sure

46:00

the related things are really close

46:03

together. So you want to maximize

46:07

uh if that was the only part of the loss

46:09

function, wouldn't it just kind of

46:10

squish everything to the same spot in

46:12

the space?

46:13

>> Correct.

46:14

What it's going to do is it's going to

46:16

basically ignore the input.

46:20

The optimizer can simply ignore the

46:21

input, make all the embeddings the same.

46:24

For example, it can just make all the

46:25

embedding zero.

46:28

That's it. And then now we have a

46:30

perfect cosine similarity for

46:32

everything. For a any pair of image and

46:35

captions, the cosine similarity is going

46:36

to be one. It's perfect, right? So

46:38

clearly that's not enough. This is by

46:41

the way is called model collapse, right?

46:44

So to prevent it from doing that, we

46:46

need to do one more thing to the loss

46:47

function. Any guesses?

46:51

>> Yeah.

46:53

>> Uh make the images that aren't related

46:56

not have a cosine similarity.

46:58

>> Exactly. Right. Exactly right. So what

47:00

we want to do is we want the scores of

47:02

the red stuff to be as small as

47:05

possible.

47:07

We want the green stuff to be as much as

47:09

possible and the red stuff to be as

47:10

small as possible.

47:12

Together it'll get the job done.

47:16

Okay. And so um so we want to maximize

47:20

the sum of the green cells and minimize

47:22

the sum of the red cells. So the

47:24

equivalent loss function is minimize the

47:26

sum of the red cells and the negative

47:28

sum of the green cells. That's it. So

47:31

all clip does is that it just grabs a

47:34

batch of image caption pairs, runs it

47:37

through the networks, calculates the

47:38

embeddings and calculates this sum of

47:41

the stuff here and that is your loss and

47:44

then back propagates through the

47:45

network. Boom. Batch batch batch. Do it

47:48

a whole bunch of times. And OpenAI did

47:50

this with uh oh this is the official

47:53

picture from the open from the paper

47:55

which is worth reading by the way right

47:57

it comes in text encoder you get these

47:59

uh embedding vectors image encoder and

48:02

then boom the diagonal is maximized and

48:05

the off diagonals are minimized

48:07

and they did it with 400 million image

48:10

caption pairs scraped from the internet.

48:14

400 million.

48:16

By the way, you folks who work in the

48:18

space may know this really well, but uh

48:20

one very easy way to get a caption for

48:23

an image, right? You we see images, but

48:26

where do you think the captions come

48:27

from? Where did they get those captions?

48:29

They didn't obviously they didn't ask

48:30

people to manually label each image of

48:32

the caption. Where do you think they got

48:33

it from?

48:35

>> Google search.

48:36

>> Uh Google search can help but why does

48:39

Google search actually find the caption?

48:41

How does it because Google search is not

48:42

creating the caption? um

48:45

>> take it from the alt text on the images.

48:47

>> Correct. Alt text. So a lot of folks for

48:50

accessibility reasons they have alt text

48:52

right on all the images they create. A

48:54

lot of people have alt text in their

48:56

images they publish on the web and

48:58

that's what we use. And the alt text

49:00

actually ends up being a a more verbose

49:03

description of the image than a typical

49:05

caption which tends to be much briefer.

49:07

And for us more verbose longer the

49:10

better because there's more stuff for

49:11

the bottle to learn from.

49:14

Um, so that's how they built clip.

49:17

And so now what we do is we use we can

49:19

use clip's text encoder by itself,

49:22

right? We can send in any text and get

49:24

an embedding that is close to the

49:25

embedding of any image that described by

49:28

the text.

49:31

Okay. Now, by the way, clip can be used

49:33

for zeros image classification.

49:37

And what I mean by zeroshot image

49:39

classification, I'll I'll walk through

49:40

the picture in just a second, is that

49:42

typically when you want to build an

49:43

image classifier, right, you can get a

49:45

whole bunch of training data of images

49:47

and their labels and then we train them,

49:50

right? Maybe you take something like

49:51

restnet, chop off the top, attach our

49:54

own output head and train, train, train.

49:56

Boom, you have a classifier. But the

49:58

only problem with that is let's say that

50:00

tomorrow so today for example you had

50:02

five classes in your problem and

50:04

tomorrow somebody comes along and says

50:06

oh actually we have a sixth category

50:09

right what do you do then well you have

50:10

to go back to the drawing board and

50:11

retrain the whole thing with six labels

50:13

now not five because your problem has

50:15

changed would it be great if you had a

50:17

classifier where you just come to it and

50:20

say here's an image and here are the six

50:22

possible labels I want you to pick from

50:23

pick one from me and you want to be able

50:26

to give it a different set of labels

50:27

those each time and it'll just use the

50:30

labels you're giving it and the image

50:32

and figures out which which label

50:33

corresponds to the image you just fed it

50:35

that would be an insanely flexible image

50:38

classification system right and that's

50:40

what I mean by zeroshot image

50:42

classification and you can use clip to

50:44

do zero short image classification

50:47

the now how you do it is actually in the

50:50

picture though not very clearly done

50:52

anyone wants to

50:58

How can you use clip to build a like a

51:01

infinitely flexible image classifier?

51:12

>> Um I mean the text input was like was

51:14

trained vert right? So in the same way

51:16

vert can handle words never seen before

51:19

does it essentially do that? Sorry, say

51:21

that again. The second part

51:22

>> you're saying you're saying it sees a

51:24

text input with something it's never

51:25

seen before, right? Yeah.

51:26

>> Okay. So, in the BERT model, which is

51:28

where where it came from, in the text

51:30

encoding in the BERT model, I think we

51:32

talked about when it sees a word it

51:35

doesn't know that it's never seen

51:36

before, it can use the the context words

51:39

around it to try to

51:41

>> Right. Right. So, but but here, just to

51:43

be clear, I I want you to use clip that

51:46

we just built, right? And assume clip

51:49

can see any knows all the words because

51:51

it's been trained on a big vocabulary.

51:53

You can give it any text you want. It'll

51:54

create an embedding from it. That's the

51:57

key capability.

52:02

>> So it creates a text embedding for

52:06

>> Yeah.

52:06

>> because like and then for your image.

52:11

So comparing similarity scores between

52:14

the two the image is complete but the

52:15

text is not complete. there'll be

52:17

missing pieces and then make some

52:18

prediction using this.

52:21

Why is there a missing piece in the

52:22

text?

52:24

>> Because um the image the the text

52:28

the text does not contain the class. Um

52:31

and then but for the image the way it

52:34

was trained it was trained like with

52:36

pairs with class including

52:38

>> right but we actually know the class now

52:40

because so the use case is that I come

52:42

to you with an image and I say here are

52:45

the seven possible labels for this image

52:48

and each label is a piece of text.

52:51

So you can you actually have seven

52:53

pieces of text and an image and all I

52:55

want clip to do is to tell me okay the

52:58

seventh the fourth label is the right

53:00

one for this image

53:03

but you're on the right track

53:08

once you see how it's done you'll be

53:09

like yeah of course

53:13

might not be understanding something but

53:15

wouldn't you just pick the embedding

53:16

that's the closest to the like the the

53:18

text embedding that's the closest to the

53:20

image embedding Correct. You're not

53:22

missing anything. That's the right

53:23

answer. Well done.

53:26

Come on people. Can you applaud our

53:27

fellow here? [applause]

53:30

You folks are hard to impress.

53:32

That's exactly what we do. So here

53:38

the the key thing to remember the key

53:40

thing to keep in your head is that when

53:42

you a label is just text,

53:45

dog, cat, right? It's just text. So you

53:47

can just imagine taking each label with

53:50

which in this case is plane car dog

53:52

whatever for each one of them you create

53:54

an embedding you get t1 through whatever

53:57

if you have n labels for the image you

53:59

just have one embedding i and then you

54:01

just create the cosine calculate the

54:03

cosine similarity and whichever is the

54:04

highest number you say okay it's a dog

54:06

that's it

54:09

it's super just imagine the level of

54:11

flexibility here

54:15

so that's a a side use of clip unrelated

54:18

to diffusion models but that's just

54:20

thought it's really clever so I wanted

54:21

to share that okay good u now let's see

54:23

how we can actually use this entire

54:25

capability to go to solve the original

54:27

problem we set out to solve which is can

54:29

we steer the diffusion model to create

54:31

an image based on a particular prompt we

54:33

give it um so now remember if you go

54:37

back to how we did it we created all

54:39

these training pairs of x and y based on

54:41

you know the the noising the image x is

54:44

the image y is the less noisy version of

54:46

image. So what we can simply do is we

54:51

can actually change the input so it

54:53

becomes the image and then the clip text

54:56

embedding of the caption for that image.

54:59

So you have an image and you have a

55:00

caption. You take the caption run it

55:02

through clip you get an embedding. By

55:05

definition that embedding is in the

55:07

lives in the same space as all the

55:09

images that correspond to that caption.

55:13

Right? So you just attach you

55:15

concatenate the embedding of the clip

55:18

output of a caption along with the

55:20

image. You say make that the new input.

55:22

Now Y continues to be the less noisy

55:24

version of the image or as we saw

55:26

earlier it could be just the noise

55:27

component of the image. Okay, this is

55:30

the new XY pair that we have. And so now

55:34

the model is you send the clip X

55:36

embedding the image X send it through

55:39

noisy version of the image and you keep

55:41

on training it for a while. Once your

55:43

model is trained for when you want to

55:44

use it for inference for a new uh

55:46

prompt, you just give it you know

55:49

Killian quoted MIT during the springtime

55:51

along with a bunch of noise goes in it

55:55

starts dinoising it. But because this

55:57

embedding of this thing thanks to clip

56:00

lives in the same image space as all Ken

56:02

code embeddings clean code images at

56:05

some keep on doing it for a while at

56:07

some point you'll get Kian code.

56:11

That's how they do it. That's how they

56:12

steer the image. It's a two-step

56:15

process. You create all these clip

56:16

embeddings uh which clip was a

56:19

breakthrough in my opinion because they

56:21

it was one of the maybe the first

56:22

example. I don't know if it's the very

56:24

first but one of the early examples of

56:26

saying we have different kinds of data.

56:28

We have images, we have captions, we

56:30

have text. How do we create embeddings

56:32

for every one of these very different

56:34

data types that all happen to live in

56:36

the same space, the same concept space?

56:38

That was the key idea. And if you look

56:40

at the modern multimodal large language

56:42

models, they are all based on the same

56:44

exact idea.

56:46

So it's very powerful this approach.

56:49

Yeah. Now I understand this for images,

56:51

but for video generation models like

56:54

Sora, do they have some sort of

56:56

underlying physics structure or do they

56:58

learn the physical representations?

57:00

>> There's a lot of debate on the internet

57:02

about this stuff. Um they haven't

57:04

published the results, the full

57:05

technical report yet. So we don't know

57:07

for sure but the consensus seems to be

57:09

no it's not they are not using a physics

57:11

engine what they have done uh and again

57:14

this may be wrong once the report comes

57:15

out we'll know for sure but uh people

57:17

what people are saying computer vision

57:19

experts is that it was has been trained

57:22

on a lot of video game data

57:25

uh along with actual videos and so on

57:28

and if you and the corpus of training is

57:30

so massive that it has basically learned

57:32

to mimic certain physics aspects to it

57:35

just as a side effect much like LLM you

57:38

train them on a large amount of text

57:39

data they begin to start to do things

57:41

which you didn't anticipate that they'll

57:43

do right so for example I read this I

57:46

thought it's a really great example of

57:48

what is surprising about large language

57:50

models is not that you know you train

57:52

them on a b bunch of high school math

57:54

problems and then you give it a new high

57:56

school math problem it can actually

57:57

solve it that's not surprising you give

57:59

it a whole bunch of high school math

58:00

problems in English then you ask it to

58:03

read a bunch of French literature and

58:05

then you give French high school math

58:07

will solve it. That is that is the new

58:08

news, right? So similarly here I think

58:12

the expectation is that it's not

58:13

actually using a physics engine under

58:15

the hood. It may have used a physics

58:16

engine to actually come up with the

58:17

videos and renderings but there are no

58:20

physics constraints in the model itself.

58:22

It just comes out of the training

58:23

process. That's the current view. Once

58:26

the technical report comes out, we'll

58:27

know for sure what they actually did.

58:30

U

58:33

>> so quick question about stability. It's

58:36

claiming to be a little bit more real

58:37

time in their image generation. Um, so

58:40

>> you mean stable diffusion?

58:41

>> Yeah, stable diffusion. So, are they

58:43

jumping through the noise more quickly

58:45

or are they kind of like pre-prompting

58:46

it and kind of trick?

58:47

>> Very good question and there's a very

58:48

key trick. It's coming.

58:50

>> Um,

58:52

>> so here the example of the noise is

58:55

normal distribution. However, if we have

58:57

changed the noise distribution, is it

59:00

change the result? Oh, you mean if you

59:02

change it to like a pson or some other

59:04

distribution, it'll definitely change

59:05

the results because u if you look at the

59:08

underlying math of why this works, it

59:10

heavily depends on the Gaussian

59:11

assumption.

59:13

>> Yeah. Um there was another question

59:15

somewhere here.

59:18

>> Um you may not know the answer because

59:20

the technical report out, but could it

59:21

be in terms of video generation sort of

59:23

analogous to going from like one fuzz

59:26

one noisy image to another? like you're

59:28

almost doing a series of still images

59:30

and learning how to

59:31

>> No, I think that I think people are sure

59:33

is is how it's done. So, basically you

59:35

think think of the video as just a

59:36

series of frames, right? And each frame

59:39

is an image and there is a sequentiality

59:41

to it. Um, which is where the

59:43

transformer stack will come in because

59:44

it handles sequentiality. So, in general

59:47

video stuff typically operates on frame

59:50

by frame which is just an image. So,

59:53

that is definitely there. What we don't

59:54

know is if they also used some

59:57

understanding of the fact that for

59:59

example that if an object is dropped it

1:00:02

has to fall to the earth in a certain

1:00:04

rate or if an object goes behind another

1:00:06

object you can't see the object anymore

1:00:08

right things like that which we take for

1:00:10

granted um the question is are they

1:00:12

using it and the consensus seems to be

1:00:15

uh in the absence of an actual technical

1:00:17

report that no they're not doing it

1:00:18

because there are lots of examples on

1:00:20

Twitter where people will show a Sora

1:00:22

video in which it's not obeying the laws

1:00:24

of physics. So you take like a beach

1:00:26

chair and then put it in the sand. You

1:00:28

see the sand come through the base of

1:00:30

the beach chair, right? Or you take an

1:00:32

object and put it behind an object. You

1:00:33

can still see the object even though the

1:00:35

original object is opaque. So you be

1:00:37

seeing some evidence that no no it's not

1:00:38

obeying the laws of physics. What you're

1:00:39

seeing is just an amaz

1:00:46

fingers without knowing there has to be

1:00:47

only five fingers.

1:00:50

Um

1:00:51

okay. All right. So we let's keep going

1:00:55

now. Um so this there was another paper

1:00:58

afterwards and this is the original

1:01:00

paper which took that idea of the

1:01:02

diffusion model and then diffusion is

1:01:05

very slow as Olivia you pointed out. So

1:01:07

the question is can we make it much

1:01:08

faster? Right? So what they did and I'm

1:01:11

not going to get into this whole thing

1:01:12

here. I just want to highlight a couple

1:01:14

of things. The first one is that um

1:01:18

first of all notice that you see unit

1:01:20

here. So it they are using a unit right

1:01:23

to go from image to image.

1:01:25

The second thing is that the clip

1:01:28

embedding of the text prompt is

1:01:30

basically is woven meaning it's

1:01:32

incorporated into the w the into the

1:01:34

into the unit through an attention

1:01:36

mechanism a transformer mechanism and

1:01:38

you can see the QKV business here which

1:01:41

should be familiar at this point. So it

1:01:43

is integrated into the transformer stack

1:01:45

directly that input the clip embedding

1:01:47

that's the second thing I want to point

1:01:48

out. And then thirdly

1:01:50

and this is where the speed up comes. So

1:01:52

what you do is instead of taking the

1:01:54

image running it through the whole

1:01:56

network and creating a slightly less

1:01:57

noisy version of the image here what you

1:01:59

do is you take the image you run it

1:02:02

through an image encoder you get an

1:02:03

embedding and now you only work with the

1:02:05

embedding you take the embedding and

1:02:07

create a slightly less noisy version

1:02:09

embedding keep on doing it and these

1:02:11

embeddings are much smaller than images

1:02:13

therefore they're much faster to process

1:02:14

and once you've done it like a thousand

1:02:16

times you get a very sort of almost pure

1:02:18

noless version of the embedding now you

1:02:20

run it through an image decoder to get

1:02:24

So this is the the idea here is that you

1:02:26

operate um

1:02:29

uh in the lat latent space meaning the

1:02:31

embedding space and hence it's called a

1:02:32

latent diffusion model. So that's where

1:02:35

the speed up comes but research

1:02:36

continues to be very strong to make it

1:02:38

even faster because for a lot of

1:02:40

consumer applications people are

1:02:41

obviously not going to wait around for I

1:02:43

mean who wants to wait for 10 seconds

1:02:44

right so uh and so there a lot of

1:02:46

pressure to make it even faster

1:02:49

um

1:02:52

all right so that's what we have

1:02:53

obviously um you know they're these

1:02:56

models are transforming everything and

1:02:58

uh by the way this site here lexicon.art

1:03:00

art. You can go check it out. Uh it has

1:03:01

a whole bunch of very interesting images

1:03:03

and prompts that created the images. So

1:03:06

if you're working in the space, it gives

1:03:07

you a lot of interesting ideas. But it's

1:03:09

not just for you know consumer fun

1:03:11

applications. U you know these models

1:03:13

are being used to actually you know

1:03:15

alpha fold if you'll recall if you give

1:03:18

it an amino acid sequence it can

1:03:19

actually create the 3D structure. Right?

1:03:21

So that's an example of they they don't

1:03:24

I don't think they use a diffusion

1:03:25

model. But you can imagine using a

1:03:27

diffusion model to create these

1:03:28

complicated objects. Meaning the objects

1:03:32

you create don't have to be images.

1:03:34

They can be arbitrarily complicated

1:03:36

things. As long as you have enough data

1:03:39

about such things to use for training

1:03:41

and the notion of noising the input is

1:03:43

meaningful, you can create some very

1:03:45

interesting structures. you can create

1:03:47

3D things and u you know protein

1:03:49

structures and there's a whole bunch of

1:03:51

very interesting applications in

1:03:52

biomedical uh sciences. So this is

1:03:55

really just the tip of the iceberg and

1:03:57

now there are these things um there are

1:03:59

ways in which you can use diffusion

1:04:00

models to create to do large language

1:04:03

modeling as well. So there's a lot of

1:04:05

overlap and blending and so on going on

1:04:07

in the space. So so I'm going to do a

1:04:10

quick demo. Um if you look at hugging

1:04:11

face there is something called the

1:04:12

diffusers library which is like the the

1:04:15

as the name suggests it's a library for

1:04:17

a lot of diffusion models

1:04:20

and let's take a quick look.

1:04:25

All right so we will uh the diffusers

1:04:27

library has a whole bunch of diffusion

1:04:28

models. We going to work with stable

1:04:30

diffusion which is one of you know like

1:04:32

the the better known models. So let's

1:04:34

install diffusers.

1:04:38

Uh you will recall when we when I did

1:04:41

the quick lightning tour of the hugging

1:04:42

face ecosystem for language. Uh hugging

1:04:45

face is a whole bunch of u capabilities

1:04:48

sort of built out of the box and you use

1:04:50

this thing called the pipeline function

1:04:52

to very quickly use any model you want.

1:04:54

The same exact philosophy applies here.

1:04:56

You still use the pipeline. So I'm going

1:04:59

to import a bunch of stuff.

1:05:09

All right. So, oh, I see I have to do

1:05:11

this thing. Okay.

1:05:16

Great. F.

1:05:21

Okay. So, uh, all right. that we have

1:05:24

here. So you you'll remember that we

1:05:26

when we worked with text we had to pre

1:05:28

we we would grab a pre-trained model and

1:05:30

then we actually run it through a

1:05:31

pipeline and we can do all the inference

1:05:33

we want on it. The same exact philosophy

1:05:36

applies here. So um and this very

1:05:39

similar to what we did in lecture 8 for

1:05:41

NLP. So what we're going to do is we use

1:05:44

this command the stable diffusion

1:05:46

pipeline from pre-trained and we use

1:05:48

this version 1.4 stable diffusion model.

1:05:50

Um so let's just create the pipeline and

1:05:56

and obviously we have used tensorflow

1:05:58

not pyarch here but a lot of these

1:06:00

models unfortunately happen to be in

1:06:02

pyarch so knowing a little bit of pyarch

1:06:05

is actually very helpful um to be able

1:06:07

to work with these things and what we're

1:06:09

doing here uh while it's downloading uh

1:06:12

we are using this fp16

1:06:15

um storage format for the the model

1:06:18

weights because it's going to be a

1:06:19

little smaller than using 32 bits so

1:06:22

it'll download faster. So that's what's

1:06:24

happening here. So all right, it's

1:06:25

downloaded fine. So now we just give it

1:06:28

a prompt and this is actually one of the

1:06:29

original famous uh meme prompts a

1:06:32

photograph of an astronaut riding a

1:06:34

horse. And so uh once we have the

1:06:36

pipeline set up uh I'll just a seat for

1:06:38

reproducibility. And then literally I do

1:06:40

pipe of prompt and then it's actually

1:06:44

you can see here 50. So it's going

1:06:46

through 50 dinoising steps. Okay. Um and

1:06:50

you come up with a national rating of

1:06:52

horse. Okay. So that's that. Um you can

1:06:54

actually change the seed and you can get

1:06:56

get a different um the seed is basically

1:06:59

sets the the the random starting point

1:07:01

for the image. So therefore you would

1:07:03

expect a different astronaut. Yep. This

1:07:05

is an astronaut riding another horse. So

1:07:08

um I think people came up with these

1:07:09

kinds of fun examples because it's

1:07:11

guaranteed not to be in the training

1:07:12

data, right? So so whatever the model is

1:07:15

doing, it's not remember it's not

1:07:16

regurgitating what it has already seen.

1:07:18

Uh, all right. Give me a prompt.

1:07:26

Prompts. Anyone?

1:07:29

Wow.

1:07:34

>> Okay,

1:07:38

that might be a

1:07:40

All right. Riding a horse.

1:07:48

All right,

1:07:56

there are two of them and clearly MIT

1:07:59

professors don't have really.

1:08:03

Yeah, moving on. [laughter]

1:08:06

So, so by the way, um if you you should

1:08:10

spend some time with the diffusers

1:08:11

library, they have a bunch of tutorials

1:08:12

which are really interesting because

1:08:14

this core capability of giving a prompt

1:08:16

and getting an image out can actually be

1:08:18

manipulated for all sorts of very

1:08:20

interesting use cases. So, for example,

1:08:22

there is this thing called negative

1:08:23

prompting. And the idea of negative

1:08:25

prompting is that you can give it two

1:08:28

prompts and say create an image which

1:08:31

embodies the first prompt but not the

1:08:33

second prompt. essentially subtract the

1:08:36

second prompt from the first one. That's

1:08:37

called negative prompting. And you might

1:08:39

be wondering like what use is that?

1:08:41

There are lots of fun uses. So here we

1:08:45

are going to the prompt is going to be a

1:08:46

labrador in the style of vermier. Okay,

1:08:49

that's the first prompt. 50 steps.

1:08:53

Uh look at that. Amazing, right? Uh but

1:08:57

maybe you don't care for the blue scarf.

1:09:00

So you basically give it a negative

1:09:02

prompt. And you basically the negative

1:09:04

prompt is blue meaning remove everything

1:09:06

that's blue. I don't like this otherwise

1:09:09

keep the Labrador thing going. So you

1:09:11

run it.

1:09:16

Look at that. The blue is gone. Negative

1:09:18

prompting. Okay. Yeah.

1:09:22

>> If you change that from five from 50 to

1:09:26

a th00and will it become less pixelated

1:09:28

or will it eventually just keep going

1:09:30

and iterating?

1:09:31

>> No. Typically, if you do more of these

1:09:32

things, it gets better. The quality is

1:09:34

much better because each step will den

1:09:36

noiseise it very slightly. So, errors

1:09:38

won't accumulate and things like that.

1:09:40

And the diffuses library gives you lots

1:09:42

of controls for fiddling around with all

1:09:44

these things. Um, okay. So, that's what

1:09:47

we had. Uh, 949.

1:09:50

Okay. So, check out this tutorial if

1:09:52

you're curious about how this stuff

1:09:54

works. And I'm going to do one other

1:09:56

thing um because I didn't get to do it

1:09:58

earlier on. So uh we spent some time

1:10:01

with the hugging face hub and I walked

1:10:03

you through a few use cases for text uh

1:10:05

where you can take a text model and use

1:10:07

it for you know classification uh things

1:10:10

like that summarization and so on and so

1:10:11

forth. You can do the same thing for

1:10:13

computer vision models. So if you have a

1:10:16

computer vision problem that just maps

1:10:17

to a standard C uh computer vision task

1:10:20

you can just use the hugging face hub as

1:10:21

well. So um let me just show you very

1:10:25

quickly the same kind of thing actually

1:10:27

works here.

1:10:32

All right. Okay. So,

1:10:35

so let's say that you want to classify

1:10:37

something. You just import the pipeline

1:10:38

as before.

1:10:40

And once you import it, you can just

1:10:43

literally give it the standard task that

1:10:45

you care about like image

1:10:46

classification.

1:10:48

And and then you can start using it

1:10:50

right from that point on.

1:10:53

Okay.

1:10:59

All right. Okay. So now I'm going to

1:11:02

just get this image. So it's a very

1:11:04

famous image. Um, right. And we're going

1:11:06

to ask it to classify this image. So we

1:11:08

just literally run it through the

1:11:09

pipeline.

1:11:12

And it says the most likely label is 94%

1:11:15

probability. It's an Egyptian cat. Seems

1:11:18

reasonable. Okay. I mean, it's it's a

1:11:20

tough picture, right? Because there are

1:11:21

lots of things going on in that picture.

1:11:22

It's not like one one image, one object.

1:11:25

Um okay so you don't have to use the

1:11:27

default model you can actually give it

1:11:29

your own model that you want. So for

1:11:31

example, you can go um sorry

1:11:35

you can go to the hub hugging face hub

1:11:38

and you can go in there and say all

1:11:40

right I want image classification these

1:11:42

are all the models 10,487 models let's

1:11:45

sort by I don't know most downloads or

1:11:49

maybe most likes

1:11:51

u and you have all these models you can

1:11:53

pick any one of them so for example

1:11:54

let's say you want to pick Microsoft

1:11:56

restnet as your one that's what I tried

1:11:57

here so I have Microsoft restnet you

1:12:00

just s model equals that run it and it

1:12:04

takes care of all the tokenization this

1:12:05

that and whatnot. It's really very handy

1:12:08

and then you run it through the pipeline

1:12:09

again and it says tiger cat 94%

1:12:12

probability according to restnet. Um so

1:12:15

yeah so that's how you do it. Now let's

1:12:17

actually try a more interesting example

1:12:18

where you want to detect all the objects

1:12:20

in the picture which we didn't talk

1:12:21

about in class object detection. So just

1:12:23

create an object detection pipeline.

1:12:27

Same thing as before. when you actually

1:12:29

run this command, an astonishing some

1:12:31

amount of complicated stuff is going on

1:12:33

under the hood. Okay, and we are all the

1:12:35

beneficiaries of that. So, thank you.

1:12:37

Um, so yeah, so we have this here and

1:12:39

then we run it through um the pipeline.

1:12:42

It's looking at all the possible things

1:12:44

that might be sitting in the pipeline.

1:12:45

The results are hard to read. So, let's

1:12:46

actually visualize them. Um,

1:12:49

and I got some nice code from this site

1:12:51

for how to visualize them. Let's just

1:12:53

reuse it. So, yeah. So if you plot the

1:12:56

results,

1:12:58

look at that.

1:13:03

Okay, so it has picked up the cat. 100%

1:13:06

probability, I guess. The remote, the

1:13:09

couch, the other remote, and then the

1:13:12

cat. Pretty good, right? Off the shelf,

1:13:14

ready to go. No, no heavy lifting

1:13:17

required. Now, in in this case, we are

1:13:19

actually putting these boxes called

1:13:20

bounding boxes around each picture. But

1:13:22

what if you actually don't want a

1:13:23

bounding box? what you want to actually

1:13:25

find the exact contour of that cat or

1:13:28

the remote. No problem. We do something

1:13:30

called image segmentation. So let's do

1:13:32

an image segmentation pipeline

1:13:36

uh and run it through.

1:13:42

It takes some time. Um all right. All

1:13:46

right. Let's visualize it. So you can So

1:13:49

each object it finds it gives you a

1:13:51

mask. It basically tells you for each

1:13:53

object what object it is and then which

1:13:56

pixels are on for that object and off

1:13:58

for everything else. It's a mask. It

1:14:00

tells you where it stands. And you can

1:14:02

see here it is the first the object has

1:14:04

found is this thing here. And it's

1:14:06

perfectly delineated, right? It's pretty

1:14:08

amazing. So we can overlay this on the

1:14:10

original image and see it has found that

1:14:14

and it is Let's look at the other

1:14:15

objects. Oh, it has found the remote.

1:14:17

That's the second object.

1:14:20

And the third remote

1:14:24

and the fourth. You think any other

1:14:27

objects are remaining?

1:14:28

>> Couch. Good. All right, let's find the

1:14:32

couch.

1:14:33

And look, the couch is pretty good

1:14:36

except that the middle part has gotten

1:14:37

confused.

1:14:39

All right, but it's still pretty good,

1:14:41

right? So, yeah. So, that is um so

1:14:44

hugging faces all all these things and

1:14:46

so you should definitely check it out

1:14:48

and if you're not already very familiar

1:14:49

with it. So, uh, we have one minute

1:14:51

left. Any questions?

1:14:58

No questions. Okay. All right, folks.

1:15:00

See you on Wednesday. Thanks.

11: Generative AI – Text-to-Image Models

More from MIT OpenCourseWare

Trending Transcripts