1
00:00:16,480 --> 00:00:19,760
So all right so today we actually come

2
00:00:18,160 --> 00:00:20,879
to the last lecture of the class because

3
00:00:19,760 --> 00:00:23,920
Wednesday it's going to be project

4
00:00:20,879 --> 00:00:25,679
presentations and um so I want to talk

5
00:00:23,920 --> 00:00:28,640
to you about diffusion models today

6
00:00:25,679 --> 00:00:30,320
which is a incredibly exciting area

7
00:00:28,640 --> 00:00:32,399
which I don't think gets It's the same

8
00:00:30,320 --> 00:00:34,719
amount of attention in some ways

9
00:00:32,399 --> 00:00:37,679
compared to large language models. Uh

10
00:00:34,719 --> 00:00:39,280
but it's got enormous potential. Um so

11
00:00:37,679 --> 00:00:42,000
I'm very excited to talk to you about

12
00:00:39,280 --> 00:00:44,480
it. So you know just for kicks last

13
00:00:42,000 --> 00:00:46,079
night I said I asked Chad GPT create a

14
00:00:44,479 --> 00:00:47,599
photorealistic image of graduate

15
00:00:46,079 --> 00:00:49,439
students in class in a class on deep

16
00:00:47,600 --> 00:00:51,679
learning and this is what it came back

17
00:00:49,439 --> 00:00:53,759
with.

18
00:00:51,679 --> 00:00:56,759
There is a noticeable absence of an

19
00:00:53,759 --> 00:00:56,759
instructor

20
00:00:57,280 --> 00:01:01,359
plus various students are facing in

21
00:00:59,039 --> 00:01:05,359
various directions

22
00:01:01,359 --> 00:01:08,960
but apart from that it's not bad. Um and

23
00:01:05,359 --> 00:01:12,079
uh here is an example of a midjourney

24
00:01:08,959 --> 00:01:14,798
text to image abusion model uh which

25
00:01:12,079 --> 00:01:16,879
produces the amazing picture from this

26
00:01:14,799 --> 00:01:18,400
prompt. a quaint Italian seaside village

27
00:01:16,879 --> 00:01:21,599
with colorful buildings blah blah blah

28
00:01:18,400 --> 00:01:24,000
blah blah uh rendered in the style of

29
00:01:21,599 --> 00:01:25,280
Claude Monet and so on so forth and

30
00:01:24,000 --> 00:01:27,118
that's what you get. It's pretty

31
00:01:25,280 --> 00:01:28,560
unbelievable.

32
00:01:27,118 --> 00:01:29,680
Uh and I'm sure you folks have played

33
00:01:28,560 --> 00:01:31,439
around with these things and you have

34
00:01:29,680 --> 00:01:33,118
your favorite pictures and prompts and

35
00:01:31,438 --> 00:01:35,118
whatnot.

36
00:01:33,118 --> 00:01:38,400
Um now

37
00:01:35,118 --> 00:01:41,759
uh February 15th um OpenAI released a

38
00:01:38,400 --> 00:01:44,478
texttovideo model called Sora which your

39
00:01:41,759 --> 00:01:46,640
folks may have seen uh which I find

40
00:01:44,478 --> 00:01:49,599
frankly just stunning what it can do. It

41
00:01:46,640 --> 00:01:52,960
can produce a one minute uh video from a

42
00:01:49,599 --> 00:01:54,798
text prompt. And so,

43
00:01:52,959 --> 00:01:56,959
so if you actually give it this prompt,

44
00:01:54,799 --> 00:02:00,159
in an ornate historical hall, a massive

45
00:01:56,959 --> 00:02:01,679
tidal wave peaks and begins to crash and

46
00:02:00,159 --> 00:02:03,600
two surfers seizing the moment

47
00:02:01,680 --> 00:02:06,000
skillfully navigate the fa the wave.

48
00:02:03,599 --> 00:02:07,199
Okay. Uh I think we can all agree that

49
00:02:06,000 --> 00:02:09,280
such a thing has never happened in

50
00:02:07,200 --> 00:02:12,400
history and therefore there it was not

51
00:02:09,280 --> 00:02:17,000
in the training data, right? So and then

52
00:02:12,400 --> 00:02:17,000
you get this picture, this video

53
00:02:26,878 --> 00:02:31,120
and then some random person is coming

54
00:02:28,878 --> 00:02:32,799
back in a completely dry [laughter]

55
00:02:31,120 --> 00:02:37,280
hall. So anyway, but it's pretty

56
00:02:32,800 --> 00:02:39,519
amazing. I think you would agree. So

57
00:02:37,280 --> 00:02:42,479
if you actually look at the open sora

58
00:02:39,519 --> 00:02:45,519
technical report, you actually find this

59
00:02:42,479 --> 00:02:48,079
uh opening paragraph where they say that

60
00:02:45,519 --> 00:02:51,120
we train text conditional diffusion

61
00:02:48,080 --> 00:02:54,879
models blah blah blah using a

62
00:02:51,120 --> 00:02:56,239
transformer architecture. Okay, so now

63
00:02:54,878 --> 00:02:57,919
we know what a transformer architecture

64
00:02:56,239 --> 00:03:00,080
is. You've been working with it. You're

65
00:02:57,919 --> 00:03:02,158
quite familiar with it at this point. So

66
00:03:00,080 --> 00:03:04,560
today's class is really about text

67
00:03:02,158 --> 00:03:06,959
conditional diffusion models. Okay, so

68
00:03:04,560 --> 00:03:09,280
the other building block. Okay, so let's

69
00:03:06,959 --> 00:03:11,120
get to it. Uh what I'm going to do is

70
00:03:09,280 --> 00:03:12,640
I'm going to sort of uh divide this into

71
00:03:11,120 --> 00:03:14,158
two parts. The first part is I'm just

72
00:03:12,639 --> 00:03:16,158
going to talk about how do you get a

73
00:03:14,158 --> 00:03:17,759
model to just generate an image for you?

74
00:03:16,158 --> 00:03:20,158
Right? If you wanted to generate an

75
00:03:17,759 --> 00:03:21,840
image from a class of potential images,

76
00:03:20,158 --> 00:03:24,158
how can it just generate an image? And

77
00:03:21,840 --> 00:03:25,519
then next we talk about okay, great. Now

78
00:03:24,158 --> 00:03:27,919
that you can do that, how do you

79
00:03:25,519 --> 00:03:29,680
actually control or steer the model to

80
00:03:27,919 --> 00:03:31,679
do an image based on whatever prompting

81
00:03:29,680 --> 00:03:33,200
you give it? Okay, how do you condition

82
00:03:31,680 --> 00:03:34,879
it? How do you control it? Those are all

83
00:03:33,199 --> 00:03:36,079
the words. How do you steer it? You'll

84
00:03:34,878 --> 00:03:37,518
find all these synonyms being used

85
00:03:36,080 --> 00:03:38,560
heavily in the literature. That's

86
00:03:37,519 --> 00:03:40,239
basically what they mean. How do you

87
00:03:38,560 --> 00:03:43,680
give it a prompt and then steer what

88
00:03:40,239 --> 00:03:44,959
gets produced? All right, so let's say

89
00:03:43,680 --> 00:03:47,280
we want to build a model that can be

90
00:03:44,959 --> 00:03:49,120
used to generate images of stately

91
00:03:47,280 --> 00:03:51,519
college buildings.

92
00:03:49,120 --> 00:03:53,200
Okay, obviously our very own Killian

93
00:03:51,519 --> 00:03:56,080
Court is the finest example of such a

94
00:03:53,199 --> 00:03:58,399
thing. Um, and uh, but let's say you

95
00:03:56,080 --> 00:03:59,920
want to do that. So what you do is you

96
00:03:58,400 --> 00:04:01,760
as as we always do with machine

97
00:03:59,919 --> 00:04:03,359
learning, we collect a bunch of data. In

98
00:04:01,759 --> 00:04:05,199
this particular case, we collect a whole

99
00:04:03,360 --> 00:04:07,360
bunch of images of stately college

100
00:04:05,199 --> 00:04:08,878
buildings. Uh, and what you see here is

101
00:04:07,360 --> 00:04:10,480
literally me just doing a Google image

102
00:04:08,878 --> 00:04:12,719
search with the query stately college

103
00:04:10,479 --> 00:04:14,639
buildings. Okay, so this is the kind of

104
00:04:12,719 --> 00:04:15,919
stuff you get. Uh, so you have your

105
00:04:14,639 --> 00:04:19,120
training data at your disposal. It's

106
00:04:15,919 --> 00:04:20,319
ready to go. Now the question is if you

107
00:04:19,120 --> 00:04:21,439
have such a model, let's say, and

108
00:04:20,319 --> 00:04:23,519
obviously we'll talk about how to build

109
00:04:21,439 --> 00:04:25,839
such a model very soon. But let's say

110
00:04:23,519 --> 00:04:27,359
you have such a model and every time you

111
00:04:25,839 --> 00:04:28,879
sort of sample this model, every time

112
00:04:27,360 --> 00:04:30,560
you ask the model, hey, give me an

113
00:04:28,879 --> 00:04:31,918
image, you obviously wanted to give a

114
00:04:30,560 --> 00:04:34,639
different image, right? Otherwise, it's

115
00:04:31,918 --> 00:04:36,560
kind of boring. All right? Some you know

116
00:04:34,639 --> 00:04:37,759
maybe you want the Killian Court, maybe

117
00:04:36,560 --> 00:04:42,319
you want the rotunda from the University

118
00:04:37,759 --> 00:04:45,520
of Virginia. Anybody any UVA alums here?

119
00:04:42,319 --> 00:04:47,120
Nobody. Okay. Um, so and right. So the

120
00:04:45,519 --> 00:04:49,359
question is how can we actually get it

121
00:04:47,120 --> 00:04:50,959
to randomly give us different images?

122
00:04:49,360 --> 00:04:52,960
But but they all have to be stately

123
00:04:50,959 --> 00:04:54,959
college buildings. It can't be just some

124
00:04:52,959 --> 00:04:58,159
random stuff, right? So, how do you do

125
00:04:54,959 --> 00:04:59,918
that? And the way we do that, and I

126
00:04:58,160 --> 00:05:02,080
still find it really astonishing that

127
00:04:59,918 --> 00:05:03,758
this approach actually works. The way we

128
00:05:02,079 --> 00:05:05,839
do that is that we actually give it

129
00:05:03,759 --> 00:05:07,840
noise.

130
00:05:05,839 --> 00:05:10,079
And I will define very precisely what I

131
00:05:07,839 --> 00:05:13,038
mean by noise in just a just a bit.

132
00:05:10,079 --> 00:05:15,038
Okay, basically assume

133
00:05:13,038 --> 00:05:17,279
an image in which all the pixel values

134
00:05:15,038 --> 00:05:19,839
are randomly picked.

135
00:05:17,279 --> 00:05:21,198
Right? So every time you generate a

136
00:05:19,839 --> 00:05:23,119
random image and you give it to the

137
00:05:21,199 --> 00:05:25,600
model, you'll it'll use that random

138
00:05:23,120 --> 00:05:27,680
starting point and then create an image

139
00:05:25,600 --> 00:05:30,479
for you. And because by definition, if

140
00:05:27,680 --> 00:05:31,600
you choose noise randomly, they are, you

141
00:05:30,478 --> 00:05:33,038
know, obviously going to be different

142
00:05:31,600 --> 00:05:35,840
each time. It's hopefully going to

143
00:05:33,038 --> 00:05:37,759
generate a different image. But if the

144
00:05:35,839 --> 00:05:39,599
model is trained on stately college

145
00:05:37,759 --> 00:05:41,520
buildings, it will produce images of

146
00:05:39,600 --> 00:05:42,879
stately college buildings. It's not

147
00:05:41,519 --> 00:05:44,399
going to produce a picture of a Labrador

148
00:05:42,879 --> 00:05:46,719
retriever.

149
00:05:44,399 --> 00:05:49,038
Okay, so that's basically what we're

150
00:05:46,720 --> 00:05:51,600
going to do. Now, if you look at

151
00:05:49,038 --> 00:05:53,279
something like this, the first question

152
00:05:51,600 --> 00:05:54,960
of course is that how can we train a

153
00:05:53,279 --> 00:05:58,799
model to generate an image from pure

154
00:05:54,959 --> 00:06:00,560
noise? This just sounds ridiculous,

155
00:05:58,800 --> 00:06:04,639
right? You basically give it a bunch of

156
00:06:00,560 --> 00:06:06,639
random numbers and say, give me code.

157
00:06:04,639 --> 00:06:08,720
It feels really ridiculous. And at that

158
00:06:06,639 --> 00:06:10,240
point, you know, folks can sort of come

159
00:06:08,720 --> 00:06:11,440
to a stop and say, "All right, this

160
00:06:10,240 --> 00:06:14,079
approach is probably not going to take

161
00:06:11,439 --> 00:06:16,319
me anywhere. It's a bit of a dead end.

162
00:06:14,079 --> 00:06:18,399
But then some clever people had this

163
00:06:16,319 --> 00:06:20,720
very interesting idea.

164
00:06:18,399 --> 00:06:24,399
They said

165
00:06:20,720 --> 00:06:26,960
um it's not clear how to do this you

166
00:06:24,399 --> 00:06:28,799
know um just a quick aside there's this

167
00:06:26,959 --> 00:06:31,120
really amazing book which is published

168
00:06:28,800 --> 00:06:33,840
maybe 50 years ago maybe earlier than

169
00:06:31,120 --> 00:06:36,240
that called how to solve it by George

170
00:06:33,839 --> 00:06:37,758
Polia. George Poliov was a eminent

171
00:06:36,240 --> 00:06:39,680
mathematician

172
00:06:37,759 --> 00:06:41,600
um and he wrote this small book called

173
00:06:39,680 --> 00:06:44,240
how to solve it and it lists a whole

174
00:06:41,600 --> 00:06:46,800
bunch of huristics that mathematicians

175
00:06:44,240 --> 00:06:49,439
use when they solve problems and perhaps

176
00:06:46,800 --> 00:06:52,079
the most commonly used heristic is just

177
00:06:49,439 --> 00:06:53,279
reverse the question

178
00:06:52,079 --> 00:06:55,038
just reverse the question and see if

179
00:06:53,279 --> 00:06:56,559
anything comes out of it most of the

180
00:06:55,038 --> 00:06:58,079
time nothing will come out of it but

181
00:06:56,560 --> 00:06:59,759
maybe some other time something amazing

182
00:06:58,079 --> 00:07:01,680
comes out right this is a great example

183
00:06:59,759 --> 00:07:03,840
of that heristic at work we don't know

184
00:07:01,680 --> 00:07:05,680
how to do this so the question is can we

185
00:07:03,839 --> 00:07:07,279
do the reverse

186
00:07:05,680 --> 00:07:10,400
If I give you Killian code, can you

187
00:07:07,279 --> 00:07:12,239
produce noise out of it for me?

188
00:07:10,399 --> 00:07:14,159
And the answer is yeah, of course we can

189
00:07:12,240 --> 00:07:16,800
do that.

190
00:07:14,160 --> 00:07:19,360
Right? Given an image, we can easily

191
00:07:16,800 --> 00:07:21,598
create a noisy version of it. So you can

192
00:07:19,360 --> 00:07:23,360
take the original image, you can add

193
00:07:21,598 --> 00:07:24,560
some noise to it to get this and you

194
00:07:23,360 --> 00:07:25,520
keep on adding a lot of noise and

195
00:07:24,560 --> 00:07:27,120
finally you'll get something that's

196
00:07:25,519 --> 00:07:29,439
basically you can't tell that there is

197
00:07:27,120 --> 00:07:31,360
clean clear clean code anymore. Right?

198
00:07:29,439 --> 00:07:33,680
This process, the reverse process is

199
00:07:31,360 --> 00:07:36,160
actually very easy to do. Okay? So the

200
00:07:33,680 --> 00:07:37,680
question bec by the way for folks of who

201
00:07:36,160 --> 00:07:39,360
may not be very familiar with this

202
00:07:37,680 --> 00:07:41,120
notion of adding noise to an image or

203
00:07:39,360 --> 00:07:44,319
making an image noisy. Let me just show

204
00:07:41,120 --> 00:07:47,478
you in a collab just a minute how easy

205
00:07:44,319 --> 00:07:47,479
it is.

206
00:07:47,519 --> 00:07:52,959
All right. So um we let's say we import

207
00:07:51,199 --> 00:07:54,960
a bunch of these things. As usual we

208
00:07:52,959 --> 00:07:57,598
have numpy and so there is this thing

209
00:07:54,959 --> 00:07:58,959
called the python imaging library pil

210
00:07:57,598 --> 00:08:01,439
which is very handy for image

211
00:07:58,959 --> 00:08:03,038
manipulations. So we import that and

212
00:08:01,439 --> 00:08:04,959
then I just literally read read this

213
00:08:03,038 --> 00:08:06,079
image in. I uploaded it before class.

214
00:08:04,959 --> 00:08:07,918
Let's just make sure it's here. Okay,

215
00:08:06,079 --> 00:08:11,120
good. Kalian.png.

216
00:08:07,918 --> 00:08:13,680
So I I read this image. Okay. Uh and

217
00:08:11,120 --> 00:08:16,478
then once I read it, I convert it into a

218
00:08:13,680 --> 00:08:18,639
numpy array. And then remember in a in

219
00:08:16,478 --> 00:08:20,800
any color image, you have three tables

220
00:08:18,639 --> 00:08:23,439
of numbers. The the number there's a

221
00:08:20,800 --> 00:08:25,120
number for each pixel for red, blue, and

222
00:08:23,439 --> 00:08:28,319
green. And then each number is between 0

223
00:08:25,120 --> 00:08:29,759
and 255. U and so here what we do is we

224
00:08:28,319 --> 00:08:31,280
divide everything by 255 just to

225
00:08:29,759 --> 00:08:32,719
normalize it so it's all between zero

226
00:08:31,279 --> 00:08:36,240
and one and we have done this in the

227
00:08:32,719 --> 00:08:38,479
past right I do that here uh all right

228
00:08:36,240 --> 00:08:40,399
so let me just read this back in convert

229
00:08:38,479 --> 00:08:45,120
it and then if you look at the shape

230
00:08:40,399 --> 00:08:47,200
it's basically a 411 x 583 * 3 um three

231
00:08:45,120 --> 00:08:50,000
channels as we have seen before and then

232
00:08:47,200 --> 00:08:52,320
I'll just show it all right that's the

233
00:08:50,000 --> 00:08:54,799
picture so now what we want to do is we

234
00:08:52,320 --> 00:08:59,040
want to add noise to this picture all we

235
00:08:54,799 --> 00:09:02,479
have to do Okay, for each pixel,

236
00:08:59,039 --> 00:09:03,919
we basically randomly pick a normal

237
00:09:02,480 --> 00:09:05,759
variable, a normal distribution,

238
00:09:03,919 --> 00:09:08,240
normally distributed random variable

239
00:09:05,759 --> 00:09:10,240
with a mean of zero and a small standard

240
00:09:08,240 --> 00:09:11,839
deviation. So it's like a small number

241
00:09:10,240 --> 00:09:14,320
and then we just literally add that

242
00:09:11,839 --> 00:09:16,560
number to every pixel. But for every

243
00:09:14,320 --> 00:09:17,839
pixel, we sample. Every pixel we sample.

244
00:09:16,559 --> 00:09:19,439
It's not like we sample once and add it

245
00:09:17,839 --> 00:09:22,560
to all the pixels. We sample for every

246
00:09:19,440 --> 00:09:25,279
pixel. And so the way you do that is

247
00:09:22,559 --> 00:09:28,639
basically literally np.random.normal.

248
00:09:25,278 --> 00:09:30,720
normal and then this 3 here is a

249
00:09:28,639 --> 00:09:33,838
standard deviation and we tell it

250
00:09:30,720 --> 00:09:35,680
generate as many of these things as the

251
00:09:33,839 --> 00:09:38,320
as the shape of the image that I gave

252
00:09:35,679 --> 00:09:40,079
you. Okay. And then add each one of

253
00:09:38,320 --> 00:09:42,959
these numbers to the original image you

254
00:09:40,080 --> 00:09:44,480
get this noisy image. Okay. So if you

255
00:09:42,958 --> 00:09:46,479
this is the original image these are all

256
00:09:44,480 --> 00:09:48,560
the values between 0 and one. And then

257
00:09:46,480 --> 00:09:50,159
you add do this noisy image. You can see

258
00:09:48,559 --> 00:09:52,319
the numbers have become different. The

259
00:09:50,159 --> 00:09:54,319
23 has become.18.15

260
00:09:52,320 --> 00:09:56,160
has become minus.17 and so on and so

261
00:09:54,320 --> 00:09:58,080
forth. Right? You just added a small

262
00:09:56,159 --> 00:09:59,519
number random to everything. But as you

263
00:09:58,080 --> 00:10:01,040
can see here now you have some negative

264
00:09:59,519 --> 00:10:02,959
numbers. You may have some numbers

265
00:10:01,039 --> 00:10:05,039
that's greater than one. And we do want

266
00:10:02,958 --> 00:10:06,639
everything to be between 0 and one. So

267
00:10:05,039 --> 00:10:10,079
all we do is we do this thing called

268
00:10:06,639 --> 00:10:11,600
clip it where essentially values smaller

269
00:10:10,080 --> 00:10:13,759
than zero are set to zero. Values

270
00:10:11,600 --> 00:10:16,079
greater than one are set to one. And so

271
00:10:13,759 --> 00:10:17,838
we'll just do that. That's it.

272
00:10:16,078 --> 00:10:19,359
Everything over one squashed to one.

273
00:10:17,839 --> 00:10:21,519
Everything under zero set to zero.

274
00:10:19,360 --> 00:10:23,519
Others leave it unchanged. Now it's

275
00:10:21,519 --> 00:10:28,320
again well behaved between 0 and one and

276
00:10:23,519 --> 00:10:29,759
we can just plot it and you get this.

277
00:10:28,320 --> 00:10:31,440
That's it. That's all it takes to

278
00:10:29,759 --> 00:10:34,399
actually add noise to an image. One line

279
00:10:31,440 --> 00:10:36,160
of numpy. Okay. Uh obviously you can

280
00:10:34,399 --> 00:10:37,679
just put this whole thing in a loop and

281
00:10:36,159 --> 00:10:39,919
keep increasing that standard deviation

282
00:10:37,679 --> 00:10:41,679
number from 3 point 4.5 so on and so

283
00:10:39,919 --> 00:10:44,078
forth. And when you do that you get this

284
00:10:41,679 --> 00:10:45,838
nice sequence of clean code and all the

285
00:10:44,078 --> 00:10:48,639
way to some very very noisy version of

286
00:10:45,839 --> 00:10:52,440
Ken code. That's it. So that's the basic

287
00:10:48,639 --> 00:10:52,439
idea of adding noise.

288
00:10:52,958 --> 00:10:57,078
Any questions on the the mechanics?

289
00:10:57,200 --> 00:11:02,320
Okay, good. Um so so we can add random

290
00:11:00,320 --> 00:11:04,640
numbers, right? And we can by increasing

291
00:11:02,320 --> 00:11:06,480
the magnitude of the standard deviation

292
00:11:04,639 --> 00:11:08,000
of these of these normal random

293
00:11:06,480 --> 00:11:12,240
variables, we can make the image

294
00:11:08,000 --> 00:11:14,958
noisier. Okay, so that suggests a really

295
00:11:12,240 --> 00:11:18,079
interesting idea.

296
00:11:14,958 --> 00:11:18,078
What idea would that be?

297
00:11:19,600 --> 00:11:25,278
Yeah, doing the opposite. Could you

298
00:11:21,600 --> 00:11:26,959
please uh microphone please?

299
00:11:25,278 --> 00:11:29,600
>> Uh doing the opposite like recreating

300
00:11:26,958 --> 00:11:31,518
the image from the noise.

301
00:11:29,600 --> 00:11:34,800
>> So we are trying to create the image

302
00:11:31,519 --> 00:11:37,360
from the noise. But

303
00:11:34,799 --> 00:11:38,958
that feels a little hard. So what

304
00:11:37,360 --> 00:11:41,959
exactly can we do? Be a little more

305
00:11:38,958 --> 00:11:41,958
specific.

306
00:11:44,480 --> 00:11:48,399
So here we have the ability to take any

307
00:11:46,559 --> 00:11:51,838
image and add any amount of noise to it.

308
00:11:48,399 --> 00:11:54,078
Right? That's the data we have. There is

309
00:11:51,839 --> 00:11:56,160
Kian code and there's various noisy

310
00:11:54,078 --> 00:11:57,599
versions of Kian code like that for the

311
00:11:56,159 --> 00:11:58,240
return the unit Virginia and so on and

312
00:11:57,600 --> 00:12:00,000
so forth.

313
00:11:58,240 --> 00:12:02,079
>> I would assume you would do some kind of

314
00:12:00,000 --> 00:12:04,559
loss function for the the final image

315
00:12:02,078 --> 00:12:06,958
that you get and compare it with the the

316
00:12:04,559 --> 00:12:10,958
original image that you train it on and

317
00:12:06,958 --> 00:12:14,638
then find uh yeah fine as you go. Okay,

318
00:12:10,958 --> 00:12:17,638
you're on the right track. Uh, any other

319
00:12:14,639 --> 00:12:17,639
proposals?

320
00:12:18,480 --> 00:12:22,720
>> I think we could try to train a neural

321
00:12:20,639 --> 00:12:25,440
network to reconstruct the image going

322
00:12:22,720 --> 00:12:27,519
from the noise to the noise noisy one.

323
00:12:25,440 --> 00:12:30,959
Like we could have a whole data set with

324
00:12:27,519 --> 00:12:34,480
images, find their noise counterpart and

325
00:12:30,958 --> 00:12:38,159
train to do the oppos

326
00:12:34,480 --> 00:12:39,759
network to do the opposite task.

327
00:12:38,159 --> 00:12:41,039
Yeah, that's definitely on the right

328
00:12:39,759 --> 00:12:44,159
track. That's definitely on the right

329
00:12:41,039 --> 00:12:47,519
track. Yep, good ideas. So, what we do

330
00:12:44,159 --> 00:12:49,679
more concretely is

331
00:12:47,519 --> 00:12:51,679
we we can take each image in the

332
00:12:49,679 --> 00:12:54,239
training data and create noisy versions

333
00:12:51,679 --> 00:12:57,599
of it as we have seen before. And then

334
00:12:54,240 --> 00:13:00,879
what we do is that we say uh we can

335
00:12:57,600 --> 00:13:04,240
create XY training data pairs input

336
00:13:00,879 --> 00:13:09,679
output pairs from all these images. So

337
00:13:04,240 --> 00:13:11,839
specifically what we do is we take

338
00:13:09,679 --> 00:13:14,559
the noisy slightly noisy version of

339
00:13:11,839 --> 00:13:16,240
Killian code and call it the input and

340
00:13:14,559 --> 00:13:19,199
we take the the the nice version of

341
00:13:16,240 --> 00:13:22,639
clean code and call it the output.

342
00:13:19,200 --> 00:13:27,360
Okay, that's the y1 x1 pair

343
00:13:22,639 --> 00:13:30,959
and then we get y2 x2 we get y3 x3 and

344
00:13:27,360 --> 00:13:33,919
all the way. So at any point in time,

345
00:13:30,958 --> 00:13:36,239
the relationship between X and Y, what's

346
00:13:33,919 --> 00:13:37,759
the relationship between X and Y? If you

347
00:13:36,240 --> 00:13:40,759
set it up like this as the input and the

348
00:13:37,759 --> 00:13:40,759
output,

349
00:13:43,039 --> 00:13:48,159
>> it's the set of uh standard deviations

350
00:13:45,679 --> 00:13:51,039
and uh the values which you change for

351
00:13:48,159 --> 00:13:53,120
each pixels. Those are like weights to

352
00:13:51,039 --> 00:13:54,879
which you transform,

353
00:13:53,120 --> 00:13:56,560
>> right? Or maybe I was looking for

354
00:13:54,879 --> 00:13:58,320
something simpler which was that that's

355
00:13:56,559 --> 00:14:00,078
correct. So what he's looking for is

356
00:13:58,320 --> 00:14:03,360
really the the relationship between X

357
00:14:00,078 --> 00:14:05,759
and Y. X is an image, any image, and Y

358
00:14:03,360 --> 00:14:07,919
happens to be a slightly less noisy

359
00:14:05,759 --> 00:14:09,838
version of the image.

360
00:14:07,919 --> 00:14:12,399
The slightly less noisy is really,

361
00:14:09,839 --> 00:14:14,399
really important.

362
00:14:12,399 --> 00:14:16,639
You're not going from Killian code,

363
00:14:14,399 --> 00:14:19,519
right? You're not going from the image

364
00:14:16,639 --> 00:14:21,919
to full noise. That's an impossible

365
00:14:19,519 --> 00:14:24,720
leap. You're going from the image to a

366
00:14:21,919 --> 00:14:27,679
slightly noisy version of the image.

367
00:14:24,720 --> 00:14:30,560
Okay, it is that slightly that allows

368
00:14:27,679 --> 00:14:33,120
all the magic to happen.

369
00:14:30,559 --> 00:14:35,838
So that's what we have.

370
00:14:33,120 --> 00:14:38,560
And so here what we can do with these XY

371
00:14:35,839 --> 00:14:40,000
pairs when you have an So here's the

372
00:14:38,559 --> 00:14:41,439
thing, right? This is like a larger

373
00:14:40,000 --> 00:14:43,759
comment about machine learning and deep

374
00:14:41,440 --> 00:14:46,560
learning. Um

375
00:14:43,759 --> 00:14:47,838
whenever you have basically what machine

376
00:14:46,559 --> 00:14:50,799
learning deep learning are or really

377
00:14:47,839 --> 00:14:52,720
it's like this this black box where if

378
00:14:50,799 --> 00:14:55,278
you can find interesting input output

379
00:14:52,720 --> 00:14:57,759
pairs you can learn a function to go

380
00:14:55,278 --> 00:14:59,919
from the input to the output that's it

381
00:14:57,759 --> 00:15:01,759
but this sounds kind of simple when I

382
00:14:59,919 --> 00:15:04,399
describe it like that but there are like

383
00:15:01,759 --> 00:15:06,879
some incredibly non-obvious ways of

384
00:15:04,399 --> 00:15:08,799
applying this idea right so for example

385
00:15:06,879 --> 00:15:10,320
a few years ago Google had this uh thing

386
00:15:08,799 --> 00:15:13,759
which may actually be in production in

387
00:15:10,320 --> 00:15:15,680
Google Sheets now where whenever you um

388
00:15:13,759 --> 00:15:17,519
sort of choose a bunch of numbers, a

389
00:15:15,679 --> 00:15:19,198
range of numbers in a spreadsheet and

390
00:15:17,519 --> 00:15:21,519
and then go into another cell, it'll

391
00:15:19,198 --> 00:15:24,159
immediately suggest a formula for you.

392
00:15:21,519 --> 00:15:26,399
Where is that coming from?

393
00:15:24,159 --> 00:15:28,559
It's because all the Google Sheets users

394
00:15:26,399 --> 00:15:30,159
all over the world, they have been

395
00:15:28,559 --> 00:15:32,078
creating all these numbers with

396
00:15:30,159 --> 00:15:33,919
formulas, right? So, someone says,

397
00:15:32,078 --> 00:15:36,078
"Look, wait a second. We have all this

398
00:15:33,919 --> 00:15:38,559
data on people choosing a range of

399
00:15:36,078 --> 00:15:40,479
numbers and then entering a formula. So

400
00:15:38,559 --> 00:15:43,119
let's imagine the range is the input and

401
00:15:40,480 --> 00:15:45,199
the formula as the output

402
00:15:43,120 --> 00:15:46,959
and let's just give a million examples

403
00:15:45,198 --> 00:15:50,719
of this pair and see if anything comes

404
00:15:46,958 --> 00:15:53,599
out of it and boom you get that feature.

405
00:15:50,720 --> 00:15:55,600
Okay. So similarly here

406
00:15:53,600 --> 00:15:58,240
X is an image less noisy version of the

407
00:15:55,600 --> 00:16:02,000
image. What that means is that we can

408
00:15:58,240 --> 00:16:04,480
build a dnoising network.

409
00:16:02,000 --> 00:16:06,799
Okay, we can take an image and we can

410
00:16:04,480 --> 00:16:10,639
build a network using all these XY pairs

411
00:16:06,799 --> 00:16:15,278
to slightly dn noiseise it.

412
00:16:10,639 --> 00:16:16,799
Okay. Um and so all how do we do it? We

413
00:16:15,278 --> 00:16:19,679
just run stocastic gradient to sit on

414
00:16:16,799 --> 00:16:22,240
the data. We have a network. It has X

415
00:16:19,679 --> 00:16:26,319
and Y and then Y is a slightly less

416
00:16:22,240 --> 00:16:27,519
noisy version and then B.

417
00:16:26,320 --> 00:16:29,199
Okay, you're just a network. It has a

418
00:16:27,519 --> 00:16:30,399
bunch of weights. we have the we have

419
00:16:29,198 --> 00:16:33,359
the right answer in terms of what the

420
00:16:30,399 --> 00:16:34,799
images need to be u we can do stocastic

421
00:16:33,360 --> 00:16:36,240
gradient descent or atom or something

422
00:16:34,799 --> 00:16:37,278
and before you know it if you have

423
00:16:36,240 --> 00:16:40,159
enough data you have a network which can

424
00:16:37,278 --> 00:16:41,919
d noiseise anything you give it okay um

425
00:16:40,159 --> 00:16:43,278
you had a question

426
00:16:41,919 --> 00:16:45,759
>> why slightly

427
00:16:43,278 --> 00:16:48,879
>> why slightly um we'll come back to that

428
00:16:45,759 --> 00:16:51,199
question the the reason is that u in

429
00:16:48,879 --> 00:16:53,679
general you you have to do what you can

430
00:16:51,198 --> 00:16:56,559
to help the model and this is sort of

431
00:16:53,679 --> 00:16:59,919
the proverbial there is an old adage you

432
00:16:56,559 --> 00:17:02,078
can't cross a ditch in two jumps.

433
00:16:59,919 --> 00:17:03,679
It's too big. So, right. So, you can't

434
00:17:02,078 --> 00:17:05,678
do it. So, what you do is you create a

435
00:17:03,679 --> 00:17:07,599
bridge to go from here to there. And so,

436
00:17:05,679 --> 00:17:10,319
what you do is if you can slightly d

437
00:17:07,599 --> 00:17:11,519
noiseise something really well. Well, I

438
00:17:10,318 --> 00:17:13,599
can actually den noiseise anything you

439
00:17:11,519 --> 00:17:17,199
want really well using that fundamental

440
00:17:13,599 --> 00:17:18,958
capability as you will see in a second.

441
00:17:17,199 --> 00:17:21,360
>> Just to follow up. So, if you go back

442
00:17:18,959 --> 00:17:24,480
the last slide, I could have created the

443
00:17:21,359 --> 00:17:26,958
same thing as that is my x1 and that is

444
00:17:24,480 --> 00:17:28,640
my y. Then the second one is x2 and

445
00:17:26,959 --> 00:17:30,400
still this is the y. So there is

446
00:17:28,640 --> 00:17:33,120
effectively there is a learning there

447
00:17:30,400 --> 00:17:35,840
that it could have taken from those

448
00:17:33,119 --> 00:17:37,439
pairs and come back with okay this is

449
00:17:35,839 --> 00:17:40,159
also a possibility this is also a

450
00:17:37,440 --> 00:17:42,400
possibility and it found out that noise

451
00:17:40,160 --> 00:17:44,400
matrix and it can subtract.

452
00:17:42,400 --> 00:17:46,240
>> Yeah. So the thing is you want to make

453
00:17:44,400 --> 00:17:48,880
sure that each time the amount of

454
00:17:46,240 --> 00:17:51,359
learning it has to do is as bounded and

455
00:17:48,880 --> 00:17:52,960
small as possible. If you give it some

456
00:17:51,359 --> 00:17:55,439
starting point and an ending point and

457
00:17:52,960 --> 00:17:56,798
keep moving this ending point, the gap

458
00:17:55,440 --> 00:17:59,519
is still really high for the first

459
00:17:56,798 --> 00:18:01,599
several of those starting points. That's

460
00:17:59,519 --> 00:18:04,319
the problem.

461
00:18:01,599 --> 00:18:07,119
Okay. So to come back to this, so we can

462
00:18:04,319 --> 00:18:08,399
build a dinoising model. We can do this.

463
00:18:07,119 --> 00:18:10,399
And now when you have once you have

464
00:18:08,400 --> 00:18:13,038
built such a thing, you give it some

465
00:18:10,400 --> 00:18:15,440
noisy thing and then it'll you know give

466
00:18:13,038 --> 00:18:16,879
you a slightly less noisy version of it.

467
00:18:15,440 --> 00:18:19,200
Okay, the resolution is going to go up

468
00:18:16,880 --> 00:18:20,720
slightly if you do that. This of course

469
00:18:19,200 --> 00:18:22,960
suggests the obvious way in which you

470
00:18:20,720 --> 00:18:26,880
would use it which is that once you

471
00:18:22,960 --> 00:18:29,600
train it we can solve this problem.

472
00:18:26,880 --> 00:18:32,160
Okay. And how can we solve this problem?

473
00:18:29,599 --> 00:18:35,279
So what you do is you start with pure

474
00:18:32,160 --> 00:18:37,120
noise and then repeatedly dn noiseise

475
00:18:35,279 --> 00:18:39,759
it.

476
00:18:37,119 --> 00:18:41,038
Okay. You get that, you get that, and

477
00:18:39,759 --> 00:18:43,599
then before you know it, Killian Kurt

478
00:18:41,038 --> 00:18:46,319
has emerged from the fog,

479
00:18:43,599 --> 00:18:50,439
right? It's pretty insane that it

480
00:18:46,319 --> 00:18:50,439
actually works this idea.

481
00:18:52,480 --> 00:18:56,720
So, so the model will generate a

482
00:18:54,960 --> 00:18:59,840
sequence of less noisy images and the

483
00:18:56,720 --> 00:19:01,919
final one you have is the answer. Okay.

484
00:18:59,839 --> 00:19:05,279
Now there's a whole bunch of detail here

485
00:19:01,919 --> 00:19:08,400
which I'm glossing over about okay how

486
00:19:05,279 --> 00:19:09,759
many times must we run this loop to get

487
00:19:08,400 --> 00:19:12,160
to a really good picture. The short

488
00:19:09,759 --> 00:19:13,200
answer is you it initially it was like

489
00:19:12,160 --> 00:19:16,080
you have to run it like a thousand

490
00:19:13,200 --> 00:19:17,759
times. Each each each doising step was

491
00:19:16,079 --> 00:19:18,960
like a baby step. You have to do it a

492
00:19:17,759 --> 00:19:21,038
thousand times to get a really good

493
00:19:18,960 --> 00:19:22,558
answer. Again research has been very

494
00:19:21,038 --> 00:19:24,000
active in the area continues to be very

495
00:19:22,558 --> 00:19:26,399
active. Now you can I think do it like

496
00:19:24,000 --> 00:19:29,038
50 steps or 100 steps. Right? But

497
00:19:26,400 --> 00:19:31,679
diffusion models like this uh they tend

498
00:19:29,038 --> 00:19:33,599
to take more time than a large language

499
00:19:31,679 --> 00:19:35,280
model which is why if you give a prompt

500
00:19:33,599 --> 00:19:36,639
to one of these models like midjourney

501
00:19:35,279 --> 00:19:38,960
it will take some time for it to come

502
00:19:36,640 --> 00:19:40,320
back with an image and and that the

503
00:19:38,960 --> 00:19:42,079
reason for the delay is because it's

504
00:19:40,319 --> 00:19:45,200
going through this you know incremental

505
00:19:42,079 --> 00:19:47,759
dnoising loop. Yeah.

506
00:19:45,200 --> 00:19:49,840
>> Uh from this we understand that each uh

507
00:19:47,759 --> 00:19:51,440
the final noise output sample would be

508
00:19:49,839 --> 00:19:55,199
very particular to each image in the

509
00:19:51,440 --> 00:19:57,279
matrix. So I mean like say two if you

510
00:19:55,200 --> 00:19:59,840
take two images the final we are getting

511
00:19:57,279 --> 00:20:02,319
is the image in the after when we start

512
00:19:59,839 --> 00:20:04,319
voicing it and the final output we get

513
00:20:02,319 --> 00:20:05,359
is the noise sample will be too distinct

514
00:20:04,319 --> 00:20:05,918
for each of them right

515
00:20:05,359 --> 00:20:08,558
>> correct

516
00:20:05,919 --> 00:20:10,720
>> so but when we are picking up image to

517
00:20:08,558 --> 00:20:12,879
generate a diffusion model and we work

518
00:20:10,720 --> 00:20:14,798
backwards we may not have the exact

519
00:20:12,880 --> 00:20:15,679
thing available to us what was there

520
00:20:14,798 --> 00:20:17,200
initially

521
00:20:15,679 --> 00:20:18,960
>> no no the thing is we don't want to

522
00:20:17,200 --> 00:20:21,120
necessarily regenerate images that were

523
00:20:18,960 --> 00:20:22,558
in the training data right that's kind

524
00:20:21,119 --> 00:20:24,159
of pointless we want to geneneral new

525
00:20:22,558 --> 00:20:26,720
images

526
00:20:24,160 --> 00:20:29,519
and for new images we just use start use

527
00:20:26,720 --> 00:20:31,200
noise as a starting point

528
00:20:29,519 --> 00:20:32,879
you know the fact that Killian code was

529
00:20:31,200 --> 00:20:35,279
here and then the fully noised version

530
00:20:32,880 --> 00:20:36,159
of Kian code is here that is used for

531
00:20:35,279 --> 00:20:37,918
training and once you use it for

532
00:20:36,159 --> 00:20:39,039
training you don't need it anymore

533
00:20:37,919 --> 00:20:41,120
because you're not trying to recreate

534
00:20:39,038 --> 00:20:43,440
Killian code again you want to create

535
00:20:41,119 --> 00:20:45,359
new images which belong to the category

536
00:20:43,440 --> 00:20:48,000
of stately college buildings and for

537
00:20:45,359 --> 00:20:49,199
that all you you just grab noise send it

538
00:20:48,000 --> 00:20:51,919
in it gives you a stately college

539
00:20:49,200 --> 00:20:51,919
building end of

540
00:20:53,759 --> 00:20:57,839
And because noise by definition is

541
00:20:55,519 --> 00:20:59,200
different each time you pick it, it's

542
00:20:57,839 --> 00:21:01,839
going to come up with a different

543
00:20:59,200 --> 00:21:06,679
stately college building.

544
00:21:01,839 --> 00:21:06,678
So the way I think about it is that uh

545
00:21:07,038 --> 00:21:12,558
all right so you can think of it as this

546
00:21:09,279 --> 00:21:14,960
right this is

547
00:21:12,558 --> 00:21:17,359
so when you sample think of this as like

548
00:21:14,960 --> 00:21:20,319
the noise distribution

549
00:21:17,359 --> 00:21:22,879
each time you sample right there's a

550
00:21:20,319 --> 00:21:24,480
little point you pick from here another

551
00:21:22,880 --> 00:21:26,960
time you sample maybe you get a point

552
00:21:24,480 --> 00:21:29,279
here right each is just you know nice

553
00:21:26,960 --> 00:21:31,200
distribution that's it what actually

554
00:21:29,279 --> 00:21:34,079
these things are doing is they are

555
00:21:31,200 --> 00:21:35,919
mapping mapping it

556
00:21:34,079 --> 00:21:38,158
to the distribution of stately college

557
00:21:35,919 --> 00:21:41,120
buildings which might be in a you know

558
00:21:38,159 --> 00:21:43,360
strange crazy distribution.

559
00:21:41,119 --> 00:21:47,678
So each time you sample you just go from

560
00:21:43,359 --> 00:21:49,599
here and you land at a point here

561
00:21:47,679 --> 00:21:53,200
and when you go from here you know you

562
00:21:49,599 --> 00:21:54,480
land at a point there.

563
00:21:53,200 --> 00:21:56,240
That's what so what you have done is

564
00:21:54,480 --> 00:21:59,360
when you when you take the training data

565
00:21:56,240 --> 00:22:01,519
you basically created points here and

566
00:21:59,359 --> 00:22:03,199
then found the matching noise here and

567
00:22:01,519 --> 00:22:05,038
then flipped it for training as we have

568
00:22:03,200 --> 00:22:07,519
seen before and once you're done with it

569
00:22:05,038 --> 00:22:09,919
you basically have a mechanism for

570
00:22:07,519 --> 00:22:12,798
transforming any entry in this

571
00:22:09,919 --> 00:22:15,120
distribution of images to an entry in

572
00:22:12,798 --> 00:22:17,440
this distribution of images. So it's a

573
00:22:15,119 --> 00:22:18,479
way to transform one distribution to

574
00:22:17,440 --> 00:22:22,320
another distribution. That's what's

575
00:22:18,480 --> 00:22:26,319
going on. Um all right. Um so there was

576
00:22:22,319 --> 00:22:28,639
a question. Yeah. And then we'll go.

577
00:22:26,319 --> 00:22:30,639
>> I understand the going from noise to to

578
00:22:28,640 --> 00:22:33,360
the image and back how you how the

579
00:22:30,640 --> 00:22:35,360
training works. So my question is you

580
00:22:33,359 --> 00:22:37,519
know in some of these models today you

581
00:22:35,359 --> 00:22:40,240
have you know when you give it the noise

582
00:22:37,519 --> 00:22:42,960
now to generate with an image for

583
00:22:40,240 --> 00:22:44,960
example it could generate a human with

584
00:22:42,960 --> 00:22:47,840
four fingers or you know stuff like

585
00:22:44,960 --> 00:22:49,919
that. So is it that the that the model

586
00:22:47,839 --> 00:22:53,599
that the training data is not just quite

587
00:22:49,919 --> 00:22:56,400
enough to or more as robust enough to uh

588
00:22:53,599 --> 00:22:57,918
generate that kind of detail? [cough]

589
00:22:56,400 --> 00:22:58,400
Can you kind of talk through like what's

590
00:22:57,919 --> 00:23:00,799
more?

591
00:22:58,400 --> 00:23:03,038
>> Yeah. So so fundamentally what it's

592
00:23:00,798 --> 00:23:04,960
doing is it actually does not understand

593
00:23:03,038 --> 00:23:07,119
the notion of fingers and things like

594
00:23:04,960 --> 00:23:09,759
that. Right? Because there is like we

595
00:23:07,119 --> 00:23:12,079
haven't injected any domain knowledge

596
00:23:09,759 --> 00:23:13,679
into this whole process by saying that

597
00:23:12,079 --> 00:23:16,158
hey we need to generate you need to

598
00:23:13,679 --> 00:23:17,759
generate a human body and here are the

599
00:23:16,159 --> 00:23:20,159
semantics of what the human body is

600
00:23:17,759 --> 00:23:21,679
right it's got uh five fingers and all

601
00:23:20,159 --> 00:23:23,440
the anatomical stuff we're not giving

602
00:23:21,679 --> 00:23:26,080
anything we literally giving it pixel

603
00:23:23,440 --> 00:23:27,759
values bunch of pictures so everything

604
00:23:26,079 --> 00:23:29,439
you're seeing is basically just coming

605
00:23:27,759 --> 00:23:32,558
out of that very blind statistical

606
00:23:29,440 --> 00:23:34,798
transformation process so it's so you

607
00:23:32,558 --> 00:23:36,639
would expect that macrolevel details

608
00:23:34,798 --> 00:23:38,720
will probably get it Right? Because

609
00:23:36,640 --> 00:23:40,159
there are so many right answers. So

610
00:23:38,720 --> 00:23:43,120
imagine it's actually, you know, it's

611
00:23:40,159 --> 00:23:45,120
creating um the roof of a house. There

612
00:23:43,119 --> 00:23:46,639
could be all kinds of variations in the

613
00:23:45,119 --> 00:23:48,479
roof of the house and you would still

614
00:23:46,640 --> 00:23:49,679
think it's a roof of a house, right?

615
00:23:48,480 --> 00:23:51,279
Because there are many possible right

616
00:23:49,679 --> 00:23:52,640
answers. But when it comes to five

617
00:23:51,279 --> 00:23:53,918
fingers, there are not many possible

618
00:23:52,640 --> 00:23:55,759
right answers, which is why you notice

619
00:23:53,919 --> 00:23:56,880
the error very quickly. As far as the

620
00:23:55,759 --> 00:23:58,158
model is concerned, it doesn't know,

621
00:23:56,880 --> 00:24:00,960
right? It's just producing a

622
00:23:58,159 --> 00:24:03,200
statistically plausible sample from that

623
00:24:00,960 --> 00:24:05,679
distribution. And since we haven't

624
00:24:03,200 --> 00:24:06,960
forced it to obey constraints like five

625
00:24:05,679 --> 00:24:08,080
fingers and so on and so forth, it's not

626
00:24:06,960 --> 00:24:10,640
going to do any of that stuff. It's an

627
00:24:08,079 --> 00:24:11,918
unconstrained process. Now over time,

628
00:24:10,640 --> 00:24:14,000
these things have gotten better and

629
00:24:11,919 --> 00:24:15,840
better and that's because the data has

630
00:24:14,000 --> 00:24:17,599
gotten better to your point. But I think

631
00:24:15,839 --> 00:24:19,599
our approach to doing these things is

632
00:24:17,599 --> 00:24:21,839
also getting better, right? There are

633
00:24:19,599 --> 00:24:23,839
lots of ways to now steer it and control

634
00:24:21,839 --> 00:24:25,199
it so it behaves the right way. And that

635
00:24:23,839 --> 00:24:27,278
is actually part of what's happening as

636
00:24:25,200 --> 00:24:29,200
well. So when we talk about how do you

637
00:24:27,278 --> 00:24:30,720
actually give a text prompt and have it

638
00:24:29,200 --> 00:24:32,558
build the image for that particular

639
00:24:30,720 --> 00:24:35,519
prompt, we would we'll revisit this

640
00:24:32,558 --> 00:24:38,319
question. Um okay, there was there were

641
00:24:35,519 --> 00:24:40,480
more questions. Yeah.

642
00:24:38,319 --> 00:24:42,240
>> Is there some randomness in the model

643
00:24:40,480 --> 00:24:44,558
itself? Right. So if you gave it the

644
00:24:42,240 --> 00:24:47,519
same noise image twice, will it actually

645
00:24:44,558 --> 00:24:49,440
produce the same final image or will it

646
00:24:47,519 --> 00:24:49,839
>> Yeah, there is randomness in the process

647
00:24:49,440 --> 00:24:53,679
as well.

648
00:24:49,839 --> 00:24:56,720
>> In the process process, exactly.

649
00:24:53,679 --> 00:24:59,360
Um, so to actually that's a really good

650
00:24:56,720 --> 00:25:02,960
point, but now I'm afraid to open my

651
00:24:59,359 --> 00:25:04,719
laptop. I'm an iPad. One second.

652
00:25:02,960 --> 00:25:06,880
All right.

653
00:25:04,720 --> 00:25:10,798
Okay. So, what's going on here is that

654
00:25:06,880 --> 00:25:13,840
if you um go to this thing

655
00:25:10,798 --> 00:25:16,319
so I talked about we are transforming

656
00:25:13,839 --> 00:25:18,720
from here to some crazy distribution

657
00:25:16,319 --> 00:25:20,240
here, right? So, what happens that let's

658
00:25:18,720 --> 00:25:22,480
say that this is the starting point for

659
00:25:20,240 --> 00:25:25,120
the the noise input. This is your noise

660
00:25:22,480 --> 00:25:28,079
input and then what it does what you

661
00:25:25,119 --> 00:25:29,918
actually do is you go here

662
00:25:28,079 --> 00:25:33,759
and then you take this point and then

663
00:25:29,919 --> 00:25:35,759
you do a small sample next to it. So you

664
00:25:33,759 --> 00:25:37,519
use this as like the mean value and then

665
00:25:35,759 --> 00:25:39,119
sample around it and that's actually

666
00:25:37,519 --> 00:25:40,720
what gets published in the user

667
00:25:39,119 --> 00:25:42,959
interface. That's where the randomness

668
00:25:40,720 --> 00:25:47,640
comes in.

669
00:25:42,960 --> 00:25:47,640
Okay. So um

670
00:25:48,319 --> 00:25:52,480
so back to this was there another

671
00:25:49,919 --> 00:25:53,600
question somewhere.

672
00:25:52,480 --> 00:25:56,400
>> Yeah.

673
00:25:53,599 --> 00:25:59,359
>> Um it's okay.

674
00:25:56,400 --> 00:26:02,080
>> Uh I was just wondering about the when

675
00:25:59,359 --> 00:26:05,359
going when training on a on a clear

676
00:26:02,079 --> 00:26:08,240
picture to go to a noisy image uh to

677
00:26:05,359 --> 00:26:10,399
pull from a random sample like random

678
00:26:08,240 --> 00:26:12,240
this sample probably pseudo random. I

679
00:26:10,400 --> 00:26:13,840
was just wondering if it's like learning

680
00:26:12,240 --> 00:26:16,240
relationships that are dependent on

681
00:26:13,839 --> 00:26:19,278
pseudo randomness and so when it goes

682
00:26:16,240 --> 00:26:22,000
from a noisy image back to pure image

683
00:26:19,278 --> 00:26:22,400
it's dependent on that or it matters at

684
00:26:22,000 --> 00:26:23,839
all.

685
00:26:22,400 --> 00:26:24,960
>> Oh I see. So if I understand your

686
00:26:23,839 --> 00:26:27,119
question what you're saying is that it's

687
00:26:24,960 --> 00:26:29,600
pseudo random not actually random

688
00:26:27,119 --> 00:26:32,239
>> and so therefore there is some signal in

689
00:26:29,599 --> 00:26:34,480
the supposedly random generation is it

690
00:26:32,240 --> 00:26:37,120
actually glomming onto that signal right

691
00:26:34,480 --> 00:26:38,798
is the question. Theoretically, it's

692
00:26:37,119 --> 00:26:40,158
probably possible, but in practice, it

693
00:26:38,798 --> 00:26:42,400
really doesn't matter because we

694
00:26:40,159 --> 00:26:44,240
basically say random is good enough for

695
00:26:42,400 --> 00:26:47,120
our purposes. And in fact, in practice,

696
00:26:44,240 --> 00:26:48,720
you will see it's not an issue.

697
00:26:47,119 --> 00:26:52,079
Um,

698
00:26:48,720 --> 00:26:53,440
okay. So, oh yeah, go ahead.

699
00:26:52,079 --> 00:26:58,158
>> There's a quick question. when you're

700
00:26:53,440 --> 00:27:01,120
doing uh like text to text, let's say

701
00:26:58,159 --> 00:27:03,120
you're uh tokenizing the input, but here

702
00:27:01,119 --> 00:27:06,558
you somehow have to identify that this

703
00:27:03,119 --> 00:27:09,119
is Killian Cord and like a stately home

704
00:27:06,558 --> 00:27:13,119
and this is just going from pixel image

705
00:27:09,119 --> 00:27:16,319
to or like decoding a pixel image. Um

706
00:27:13,119 --> 00:27:20,399
where does the the tag or tokenization

707
00:27:16,319 --> 00:27:21,839
of like columns or fingernails or like

708
00:27:20,400 --> 00:27:23,200
>> does nothing. It's learning everything

709
00:27:21,839 --> 00:27:23,918
from the pixel values.

710
00:27:23,200 --> 00:27:25,600
>> Everything.

711
00:27:23,919 --> 00:27:27,120
>> Yeah. And this is sort of what I was,

712
00:27:25,599 --> 00:27:28,480
you know, when I when Ike asked the

713
00:27:27,119 --> 00:27:30,798
question about the four fingers, five

714
00:27:28,480 --> 00:27:33,038
fingers thing, it has no idea of

715
00:27:30,798 --> 00:27:34,319
fingers. It has zero knowledge about any

716
00:27:33,038 --> 00:27:36,319
of these things. All it's seeing is a

717
00:27:34,319 --> 00:27:38,558
bunch of photographs.

718
00:27:36,319 --> 00:27:40,639
>> Okay. So when you when you type in say I

719
00:27:38,558 --> 00:27:42,960
want a hand with green.

720
00:27:40,640 --> 00:27:44,880
>> Oh, I see. So we haven't yet come to the

721
00:27:42,960 --> 00:27:47,120
stage of okay, how do you actually steer

722
00:27:44,880 --> 00:27:48,080
this image using your text prompt? It's

723
00:27:47,119 --> 00:27:49,519
coming

724
00:27:48,079 --> 00:27:51,278
>> right now. All we're saying is that

725
00:27:49,519 --> 00:27:52,960
look, I'm going to give you a bunch of

726
00:27:51,278 --> 00:27:55,119
uh photographs of a particular kind of

727
00:27:52,960 --> 00:27:56,480
thing, stately college buildings and I

728
00:27:55,119 --> 00:27:58,239
want to have a model which at the end of

729
00:27:56,480 --> 00:27:59,360
the day I just poke it. Every time I

730
00:27:58,240 --> 00:28:01,278
poke it, it gives me a stately college

731
00:27:59,359 --> 00:28:02,879
building. That's it. Now I'm going to

732
00:28:01,278 --> 00:28:04,558
actually start giving it text and saying

733
00:28:02,880 --> 00:28:06,320
okay build the you know create the thing

734
00:28:04,558 --> 00:28:08,398
I'm just telling you about that's coming

735
00:28:06,319 --> 00:28:12,000
and that's sort of some additional magic

736
00:28:08,398 --> 00:28:14,558
is going on to get that done. U okay so

737
00:28:12,000 --> 00:28:16,720
this is what we have u and this is

738
00:28:14,558 --> 00:28:18,158
called a diffusion model. Okay. And this

739
00:28:16,720 --> 00:28:21,519
is the original paper that figured this

740
00:28:18,159 --> 00:28:24,799
out. Um, and

741
00:28:21,519 --> 00:28:26,639
the the process of actually creating

742
00:28:24,798 --> 00:28:28,639
taking an image and creating noisy

743
00:28:26,640 --> 00:28:30,880
versions of it to create a training data

744
00:28:28,640 --> 00:28:32,480
is called the forward process. And then

745
00:28:30,880 --> 00:28:34,399
what we did in reverse is called the

746
00:28:32,480 --> 00:28:35,839
reverse process. Uh, check out the

747
00:28:34,398 --> 00:28:38,479
paper. It's actually really well

748
00:28:35,839 --> 00:28:40,879
written. Uh, and I recommend it. Now, in

749
00:28:38,480 --> 00:28:42,960
practice, uh, some other researchers

750
00:28:40,880 --> 00:28:45,440
came along shortly after this and made a

751
00:28:42,960 --> 00:28:46,558
small improvement. turns out to be

752
00:28:45,440 --> 00:28:48,240
actually a big improvement in practice

753
00:28:46,558 --> 00:28:50,000
in terms of improving the quality of

754
00:28:48,240 --> 00:28:52,000
what's being produced. And so what they

755
00:28:50,000 --> 00:28:53,919
said is hey instead of training the

756
00:28:52,000 --> 00:28:55,919
model to predict the less noisy version

757
00:28:53,919 --> 00:28:58,640
of the image we actually ask it to

758
00:28:55,919 --> 00:29:01,360
predict just the noise

759
00:28:58,640 --> 00:29:03,038
in the input and then we will just

760
00:29:01,359 --> 00:29:05,839
simply subtract the noise from the input

761
00:29:03,038 --> 00:29:08,158
to get the image. So instead of saying

762
00:29:05,839 --> 00:29:10,000
here is an X X is an image Y is the

763
00:29:08,159 --> 00:29:12,159
noisy image we actually tell it here is

764
00:29:10,000 --> 00:29:14,398
an image here is the noise that we added

765
00:29:12,159 --> 00:29:16,000
to X to get the the noisy version and

766
00:29:14,398 --> 00:29:17,759
then just predict the noise for me and

767
00:29:16,000 --> 00:29:19,119
then once I get it I just do X minus

768
00:29:17,759 --> 00:29:21,919
noise and I get the less noisy version

769
00:29:19,119 --> 00:29:24,398
of the image. Okay, this feels

770
00:29:21,919 --> 00:29:26,240
arithmetically equivalent but in

771
00:29:24,398 --> 00:29:28,000
practice it ends up generating much

772
00:29:26,240 --> 00:29:29,278
higher quality images and there's some

773
00:29:28,000 --> 00:29:31,038
very interesting theory as to why that

774
00:29:29,278 --> 00:29:33,278
works and so on and so forth and you can

775
00:29:31,038 --> 00:29:34,798
read this paper if you're interested.

776
00:29:33,278 --> 00:29:36,960
Okay, so if you actually look at what's

777
00:29:34,798 --> 00:29:38,558
going on in most diffusion models today,

778
00:29:36,960 --> 00:29:40,480
they're basically using an approach like

779
00:29:38,558 --> 00:29:41,599
this. They're actually predicting each

780
00:29:40,480 --> 00:29:43,919
time they predict noise and take it

781
00:29:41,599 --> 00:29:47,119
away, subtract it. So iterative

782
00:29:43,919 --> 00:29:49,679
subtracting of predicted noise.

783
00:29:47,119 --> 00:29:52,879
That's what's going on. So all right, so

784
00:29:49,679 --> 00:29:55,919
that's what we have. U now at this point

785
00:29:52,880 --> 00:29:57,520
you may be wondering, okay, so far in

786
00:29:55,919 --> 00:29:59,759
the semester, uh we have actually

787
00:29:57,519 --> 00:30:01,200
learned how to take an image and then

788
00:29:59,759 --> 00:30:03,278
classify it into one of you know 20

789
00:30:01,200 --> 00:30:05,360
things, 10 things, whatever. We also

790
00:30:03,278 --> 00:30:07,200
taken text and figured out what to do

791
00:30:05,359 --> 00:30:09,439
things with it. We haven't yet talked

792
00:30:07,200 --> 00:30:11,519
about how do you actually take an image

793
00:30:09,440 --> 00:30:13,440
and the how can we get the output also

794
00:30:11,519 --> 00:30:16,240
to be another image. We haven't done

795
00:30:13,440 --> 00:30:18,798
that yet. Okay. So we have actually not

796
00:30:16,240 --> 00:30:20,240
done image to image. How do you actually

797
00:30:18,798 --> 00:30:22,879
build a neural network to do image to

798
00:30:20,240 --> 00:30:23,919
images? And in the interest of time

799
00:30:22,880 --> 00:30:25,360
you're not going to get into it

800
00:30:23,919 --> 00:30:29,440
massively but I want to give you a quick

801
00:30:25,359 --> 00:30:31,759
idea of how it works. So the the most

802
00:30:29,440 --> 00:30:33,440
sort of I would say the dominant

803
00:30:31,759 --> 00:30:35,519
architecture

804
00:30:33,440 --> 00:30:36,960
to take an in image as an input and

805
00:30:35,519 --> 00:30:39,359
produce an image as an output is called

806
00:30:36,960 --> 00:30:42,880
the unit. Okay. And that's the

807
00:30:39,359 --> 00:30:45,439
architecture we see here. So

808
00:30:42,880 --> 00:30:47,039
so fundamentally if you look at the left

809
00:30:45,440 --> 00:30:48,640
half so there's a left half to the

810
00:30:47,038 --> 00:30:50,640
network and a right half to the network

811
00:30:48,640 --> 00:30:53,360
hence the U. If you look at the left

812
00:30:50,640 --> 00:30:55,200
half of the network it's it's a good old

813
00:30:53,359 --> 00:30:58,558
convolutional neural network like the

814
00:30:55,200 --> 00:31:00,319
kind we know and love. Okay. And the the

815
00:30:58,558 --> 00:31:02,480
kind that we are very familiar with. So

816
00:31:00,319 --> 00:31:04,720
you take an input image and then you run

817
00:31:02,480 --> 00:31:07,599
it through a bunch of convolutional

818
00:31:04,720 --> 00:31:09,919
uh convolutional blocks and then we do

819
00:31:07,599 --> 00:31:11,759
some max pooling and then we keep on

820
00:31:09,919 --> 00:31:13,919
doing it and at some point it becomes

821
00:31:11,759 --> 00:31:15,839
smaller and smaller and we get something

822
00:31:13,919 --> 00:31:17,919
you know like this which we are very

823
00:31:15,839 --> 00:31:20,000
familiar with right the the big image

824
00:31:17,919 --> 00:31:21,200
with three channels gets smaller and

825
00:31:20,000 --> 00:31:22,640
smaller smaller but the number of

826
00:31:21,200 --> 00:31:24,399
channels gets wider and wider. it

827
00:31:22,640 --> 00:31:26,960
becomes sort of much smaller but much

828
00:31:24,398 --> 00:31:29,119
deeper right it becomes like a 3D volume

829
00:31:26,960 --> 00:31:31,200
and we have seen that again and again

830
00:31:29,119 --> 00:31:33,278
right the left part is just a good old

831
00:31:31,200 --> 00:31:35,519
convolutional with pooling layers and

832
00:31:33,278 --> 00:31:37,599
then you come to the middle and then

833
00:31:35,519 --> 00:31:40,480
from this point on what we do is we take

834
00:31:37,599 --> 00:31:43,038
whatever this thing here and then we

835
00:31:40,480 --> 00:31:44,720
essentially reverse the process we go

836
00:31:43,038 --> 00:31:46,960
from the small things which are really

837
00:31:44,720 --> 00:31:49,038
deep to slightly bigger things that are

838
00:31:46,960 --> 00:31:50,880
a little less steep and so on and so

839
00:31:49,038 --> 00:31:54,879
forth till we get the original size back

840
00:31:50,880 --> 00:31:57,039
again Okay. And we do that using the

841
00:31:54,880 --> 00:31:59,360
some an inverse of the convolution layer

842
00:31:57,038 --> 00:32:02,798
called an upconvolution or deconvolution

843
00:31:59,359 --> 00:32:05,119
layer. Okay. And you can check out 9.2

844
00:32:02,798 --> 00:32:07,759
in the textbook to to to understand how

845
00:32:05,119 --> 00:32:09,278
it's done. It's it's also called con 2D

846
00:32:07,759 --> 00:32:12,079
transpose.

847
00:32:09,278 --> 00:32:13,440
Okay. It's a very similar idea and I'm

848
00:32:12,079 --> 00:32:15,119
not going to get into the details here

849
00:32:13,440 --> 00:32:17,038
but you essentially do an inverse of a

850
00:32:15,119 --> 00:32:19,119
convolutional operation to get the size

851
00:32:17,038 --> 00:32:22,480
to come back to the bigger size and you

852
00:32:19,119 --> 00:32:24,239
do it gradually till the output you have

853
00:32:22,480 --> 00:32:25,839
matches the size of the input that came

854
00:32:24,240 --> 00:32:27,919
in.

855
00:32:25,839 --> 00:32:29,759
Okay, so image gets smaller and smaller

856
00:32:27,919 --> 00:32:31,440
into a thing and then you just blow it

857
00:32:29,759 --> 00:32:34,558
back up again to get an image back. So

858
00:32:31,440 --> 00:32:36,240
that is the unit. Now there's very one

859
00:32:34,558 --> 00:32:39,599
very important thing that happens in the

860
00:32:36,240 --> 00:32:43,440
unit, right? which is

861
00:32:39,599 --> 00:32:45,519
you see these connections, right?

862
00:32:43,440 --> 00:32:47,278
Basically, what they do is at every step

863
00:32:45,519 --> 00:32:50,960
when you're coming back up in the right

864
00:32:47,278 --> 00:32:53,038
half, you actually attach whatever was

865
00:32:50,960 --> 00:32:54,798
in sort of the mirror image of the

866
00:32:53,038 --> 00:32:56,720
original input as we processed on the

867
00:32:54,798 --> 00:32:59,200
left side, we attach it to this side as

868
00:32:56,720 --> 00:33:01,360
well. Remember I talked about this whole

869
00:32:59,200 --> 00:33:03,600
notion of a residual connection back,

870
00:33:01,359 --> 00:33:06,798
you know, many classes ago where I said

871
00:33:03,599 --> 00:33:09,278
when uh when an input goes through each

872
00:33:06,798 --> 00:33:10,960
layer of a neural network at one point,

873
00:33:09,278 --> 00:33:13,038
let's say you're in the 10th layer,

874
00:33:10,960 --> 00:33:14,399
you're only seeing what is the ninth

875
00:33:13,038 --> 00:33:16,079
layer is produced for you. That's all

876
00:33:14,398 --> 00:33:18,158
you're working with. But would it be

877
00:33:16,079 --> 00:33:19,839
nice if the the the 10th layer actually

878
00:33:18,159 --> 00:33:21,600
had access to the eighth layer, the

879
00:33:19,839 --> 00:33:23,439
seventh layer, the sixth layer, the

880
00:33:21,599 --> 00:33:25,439
fifth layer? Heck, why not the input,

881
00:33:23,440 --> 00:33:27,600
right? Because the more information it

882
00:33:25,440 --> 00:33:28,960
has, the more able it's probably to do

883
00:33:27,599 --> 00:33:31,278
whatever it can with the input it's

884
00:33:28,960 --> 00:33:33,200
giving. Why restrict it to only the

885
00:33:31,278 --> 00:33:34,319
input of the the output of the previous

886
00:33:33,200 --> 00:33:36,080
layer? Why can't we give it everything

887
00:33:34,319 --> 00:33:37,918
that has came before it? Now giving

888
00:33:36,079 --> 00:33:40,158
everything is too much. But we can be

889
00:33:37,919 --> 00:33:41,919
selective in what we give it. Right? So

890
00:33:40,159 --> 00:33:44,240
what these folks decided I'm sure after

891
00:33:41,919 --> 00:33:46,799
much experimentation is that if they

892
00:33:44,240 --> 00:33:49,839
actually attach whatever was coming out

893
00:33:46,798 --> 00:33:51,440
of this layer to this layer before it

894
00:33:49,839 --> 00:33:53,278
goes through the output, it really

895
00:33:51,440 --> 00:33:55,360
helped. Similarly, this thing gets

896
00:33:53,278 --> 00:33:57,679
attached and so on and so forth. And it

897
00:33:55,359 --> 00:34:00,000
kind of makes sense. You know, why force

898
00:33:57,679 --> 00:34:01,440
it to figure out everything it has to

899
00:34:00,000 --> 00:34:03,919
figure out just from this thing that

900
00:34:01,440 --> 00:34:06,159
came in, right? Let's give this that

901
00:34:03,919 --> 00:34:07,759
that. Let's also give a little here, a

902
00:34:06,159 --> 00:34:09,358
little here. So, these residual

903
00:34:07,759 --> 00:34:10,800
connections are a huge building block

904
00:34:09,358 --> 00:34:14,159
for why these things work as well as

905
00:34:10,800 --> 00:34:15,599
they do. Okay? And in general, giving a

906
00:34:14,159 --> 00:34:17,760
layer as much information as you can

907
00:34:15,599 --> 00:34:19,200
give it is always a good idea, but you

908
00:34:17,760 --> 00:34:20,560
can't go nuts, right? Because then you

909
00:34:19,199 --> 00:34:22,078
have much more parameters and all kinds

910
00:34:20,559 --> 00:34:23,519
of stuff happens. So there is a bit of a

911
00:34:22,079 --> 00:34:25,760
balance you have to strike and this was

912
00:34:23,519 --> 00:34:27,918
the balance struck by these researchers.

913
00:34:25,760 --> 00:34:30,399
And so this thing was originally

914
00:34:27,918 --> 00:34:32,559
invented for some medical segmentation

915
00:34:30,398 --> 00:34:35,358
use cases but it's just heavily used for

916
00:34:32,559 --> 00:34:39,599
everything now. It's a really powerful

917
00:34:35,358 --> 00:34:41,918
architecture. Uh questions

918
00:34:39,599 --> 00:34:44,240
>> uh can we have example of like in what

919
00:34:41,918 --> 00:34:46,559
scenarios we use this kind of

920
00:34:44,239 --> 00:34:49,439
>> anytime you have an image to image

921
00:34:46,559 --> 00:34:50,878
>> like what kind of conversion do you get

922
00:34:49,440 --> 00:34:52,559
image to image? or like what kind of

923
00:34:50,878 --> 00:34:54,078
examples of use cases. Let's say that

924
00:34:52,559 --> 00:34:55,759
for example you want to take an image

925
00:34:54,079 --> 00:34:58,000
and like a black and white image and you

926
00:34:55,760 --> 00:35:00,560
want to colorize it

927
00:34:58,000 --> 00:35:02,000
for instance boom you unit you want to

928
00:35:00,559 --> 00:35:04,239
take an image and make it a higher

929
00:35:02,000 --> 00:35:06,480
resolution image unit you want to take

930
00:35:04,239 --> 00:35:08,719
an image and for every pixel in the

931
00:35:06,480 --> 00:35:12,079
image you want to classify it into you

932
00:35:08,719 --> 00:35:14,480
know one of 10 things. So anytime when

933
00:35:12,079 --> 00:35:16,640
you want the output shape the shape of

934
00:35:14,480 --> 00:35:18,800
the output to be basically the same

935
00:35:16,639 --> 00:35:20,960
shape as the input but with other data

936
00:35:18,800 --> 00:35:23,960
you need to use this.

937
00:35:20,960 --> 00:35:23,960
Yeah.

938
00:35:25,199 --> 00:35:30,879
>> But this logic of having access to all

939
00:35:28,639 --> 00:35:31,838
the previous iterations

940
00:35:30,880 --> 00:35:33,519
>> not iterations

941
00:35:31,838 --> 00:35:35,440
>> all the previous layers

942
00:35:33,519 --> 00:35:40,639
>> right the outputs of the previous layers

943
00:35:35,440 --> 00:35:42,720
>> layers. Uh but this would also help uh

944
00:35:40,639 --> 00:35:44,239
clean up and give better categorization

945
00:35:42,719 --> 00:35:45,199
like does it always have to be an image

946
00:35:44,239 --> 00:35:47,679
to image?

947
00:35:45,199 --> 00:35:49,118
>> No. No. In fact, if you look at restnet,

948
00:35:47,679 --> 00:35:50,639
restnet is the one in fact that

949
00:35:49,119 --> 00:35:53,200
pioneered the idea of the residual

950
00:35:50,639 --> 00:35:56,078
connection. So we use it for restnet. We

951
00:35:53,199 --> 00:35:58,319
actually use the the transformer stack

952
00:35:56,079 --> 00:36:00,000
if you remember it goes through the self

953
00:35:58,320 --> 00:36:03,280
attention layer. It comes out the other

954
00:36:00,000 --> 00:36:05,920
end and then we add the input back to it

955
00:36:03,280 --> 00:36:07,599
and then we send it through layer.

956
00:36:05,920 --> 00:36:08,800
So you will see that this residual

957
00:36:07,599 --> 00:36:11,599
connection is sitting in two different

958
00:36:08,800 --> 00:36:13,680
places in a single transformer block. So

959
00:36:11,599 --> 00:36:15,440
it's extremely heavily used. There is

960
00:36:13,679 --> 00:36:17,598
something called deep and wide network

961
00:36:15,440 --> 00:36:20,559
if I remember or denset which uses the

962
00:36:17,599 --> 00:36:22,960
same trick. In fact if you when you're

963
00:36:20,559 --> 00:36:25,279
working with structured data right good

964
00:36:22,960 --> 00:36:26,720
old say linear regression and you've

965
00:36:25,280 --> 00:36:28,400
looked at your data and you come up with

966
00:36:26,719 --> 00:36:30,239
all kinds of very clever features you

967
00:36:28,400 --> 00:36:32,000
know I'm going to look at price per

968
00:36:30,239 --> 00:36:33,439
square foot right you do a bunch of

969
00:36:32,000 --> 00:36:36,239
feature engineering and you have a bunch

970
00:36:33,440 --> 00:36:38,559
of new features. Well, you should take

971
00:36:36,239 --> 00:36:40,719
your old features and your new features

972
00:36:38,559 --> 00:36:42,480
and send both in.

973
00:36:40,719 --> 00:36:43,679
Why send only the new stuff that you

974
00:36:42,480 --> 00:36:47,519
have concocted? Why can't you send

975
00:36:43,679 --> 00:36:52,919
everything in? That's the idea.

976
00:36:47,519 --> 00:36:52,920
All right. Um, so let's come back here.

977
00:36:53,039 --> 00:36:57,599
Now we have seen how to generate a good

978
00:36:54,559 --> 00:36:59,279
image. Okay. Now let's figure out how to

979
00:36:57,599 --> 00:37:00,800
steer it or condition it with a text

980
00:36:59,280 --> 00:37:02,720
prompt, right? Because that's sort of

981
00:37:00,800 --> 00:37:05,920
the holy grail.

982
00:37:02,719 --> 00:37:08,480
So we want to take

983
00:37:05,920 --> 00:37:09,838
so here's some intuition. We want to

984
00:37:08,480 --> 00:37:11,920
take the text prompt into account and

985
00:37:09,838 --> 00:37:14,719
obviously generate the image. Now

986
00:37:11,920 --> 00:37:16,800
imagine if we had like a rough image

987
00:37:14,719 --> 00:37:18,879
that corresponds to the text prompt.

988
00:37:16,800 --> 00:37:21,359
Just imagine. So the text prompt is you

989
00:37:18,880 --> 00:37:22,880
know cute laborator retriever and you

990
00:37:21,358 --> 00:37:24,159
have like a very noisy image of a

991
00:37:22,880 --> 00:37:26,720
laboratory retriever. This just happens

992
00:37:24,159 --> 00:37:28,000
to be handy. You have it. Well now

993
00:37:26,719 --> 00:37:30,239
you're in good shape because you just

994
00:37:28,000 --> 00:37:32,159
feed that in and your system will denise

995
00:37:30,239 --> 00:37:34,319
it for you. Right? Right? You can get a

996
00:37:32,159 --> 00:37:36,000
better image. That's pretty easy. So,

997
00:37:34,320 --> 00:37:37,280
but obviously in reality, you don't have

998
00:37:36,000 --> 00:37:38,480
a rough image. In fact, you're trying to

999
00:37:37,280 --> 00:37:41,599
create one of those things in the first

1000
00:37:38,480 --> 00:37:45,199
place. We don't. So, but what if we had

1001
00:37:41,599 --> 00:37:47,599
an embedding for the prompt that's close

1002
00:37:45,199 --> 00:37:49,199
to the embeddings of all the images that

1003
00:37:47,599 --> 00:37:52,160
correspond to the prompt. So, let's take

1004
00:37:49,199 --> 00:37:54,239
a prompt and let's imagine all the

1005
00:37:52,159 --> 00:37:57,199
images in the in the universe that

1006
00:37:54,239 --> 00:37:58,959
correspond to that prompt. Okay?

1007
00:37:57,199 --> 00:38:00,319
And now further imagine because

1008
00:37:58,960 --> 00:38:02,559
everything is a vector. Everything is

1009
00:38:00,320 --> 00:38:04,559
embedding in our world that that image

1010
00:38:02,559 --> 00:38:06,639
has an embedding.

1011
00:38:04,559 --> 00:38:09,920
All sorry the text prompt has an

1012
00:38:06,639 --> 00:38:12,559
embedding. Every image has an embedding

1013
00:38:09,920 --> 00:38:14,720
and we have somehow calculated these

1014
00:38:12,559 --> 00:38:17,679
embeddings so that the text prompts

1015
00:38:14,719 --> 00:38:20,000
embedding is smack where all the image

1016
00:38:17,679 --> 00:38:21,279
embeddings are.

1017
00:38:20,000 --> 00:38:23,920
We will get to how we actually do it in

1018
00:38:21,280 --> 00:38:26,000
a in just a moment. But conceptually

1019
00:38:23,920 --> 00:38:28,159
imagine if we had an embedding if you

1020
00:38:26,000 --> 00:38:30,079
could calculate embeddings for text and

1021
00:38:28,159 --> 00:38:32,239
embeddings for images. So they all live

1022
00:38:30,079 --> 00:38:36,000
in the same space.

1023
00:38:32,239 --> 00:38:39,118
Okay. So if we feed this embedding to a

1024
00:38:36,000 --> 00:38:41,920
dinoising model because that text

1025
00:38:39,119 --> 00:38:44,320
embedding is sitting in the same space

1026
00:38:41,920 --> 00:38:47,280
as all the image embeddings that it

1027
00:38:44,320 --> 00:38:49,119
corresponds to. Maybe our model can just

1028
00:38:47,280 --> 00:38:51,599
d noiseise that embedding and give you

1029
00:38:49,119 --> 00:38:54,320
what you want.

1030
00:38:51,599 --> 00:38:55,680
Okay, so since this embedding is already

1031
00:38:54,320 --> 00:38:57,200
close to the embeddings of the things we

1032
00:38:55,679 --> 00:38:59,199
want to generate, maybe you'll just get

1033
00:38:57,199 --> 00:39:00,639
it done.

1034
00:38:59,199 --> 00:39:02,559
So ultimately we want to generate an

1035
00:39:00,639 --> 00:39:03,920
image and if we had an embedding for

1036
00:39:02,559 --> 00:39:07,119
that image, we could generate the image

1037
00:39:03,920 --> 00:39:09,599
from the embedding and we use the text.

1038
00:39:07,119 --> 00:39:11,039
So we go from text to embedding which

1039
00:39:09,599 --> 00:39:12,640
happens to live in the same space as all

1040
00:39:11,039 --> 00:39:14,000
the embeddings of the images we care

1041
00:39:12,639 --> 00:39:15,759
about. And then from that image

1042
00:39:14,000 --> 00:39:18,320
embedding, we go to the final image.

1043
00:39:15,760 --> 00:39:20,079
Okay, this is a bunch of me talking and

1044
00:39:18,320 --> 00:39:22,000
handwaving. it'll all become very clear

1045
00:39:20,079 --> 00:39:25,200
but that's sort of the rough intuition.

1046
00:39:22,000 --> 00:39:26,960
Okay. So, so what we'll know is we'll

1047
00:39:25,199 --> 00:39:29,598
describe an approach to calculate an

1048
00:39:26,960 --> 00:39:31,920
embedding for any text any piece of text

1049
00:39:29,599 --> 00:39:34,160
that is close to the embeddings of the

1050
00:39:31,920 --> 00:39:36,720
images that correspond to that piece of

1051
00:39:34,159 --> 00:39:38,639
text. So this is the problem we're going

1052
00:39:36,719 --> 00:39:39,838
to solve. There's a bunch of text

1053
00:39:38,639 --> 00:39:42,000
conceptually there are a whole bunch of

1054
00:39:39,838 --> 00:39:43,920
images that are describe that text and

1055
00:39:42,000 --> 00:39:46,719
we're going to now create embeddings so

1056
00:39:43,920 --> 00:39:48,880
that that is close to all the embeddings

1057
00:39:46,719 --> 00:39:50,399
of those images. Right? It feels kind of

1058
00:39:48,880 --> 00:39:52,480
like almost impossible that you can

1059
00:39:50,400 --> 00:39:56,000
actually do something like this, but

1060
00:39:52,480 --> 00:39:58,000
there's a very clever idea uh that

1061
00:39:56,000 --> 00:39:59,838
OpenAI came up with that tells you how

1062
00:39:58,000 --> 00:40:02,000
to do it. So, here's what we're going to

1063
00:39:59,838 --> 00:40:05,199
do. So, let's say we have an image and a

1064
00:40:02,000 --> 00:40:08,559
caption. So, here's an image. Uh here's

1065
00:40:05,199 --> 00:40:10,879
a caption, right? And we need some way

1066
00:40:08,559 --> 00:40:12,320
to take that piece of text and run it

1067
00:40:10,880 --> 00:40:15,039
through some network and create a nice

1068
00:40:12,320 --> 00:40:16,160
embedding from it. Okay? Similarly, we

1069
00:40:15,039 --> 00:40:17,279
want to take this image, run it through

1070
00:40:16,159 --> 00:40:19,358
some network and create an embedding

1071
00:40:17,280 --> 00:40:20,800
from it. Okay. Now, first first

1072
00:40:19,358 --> 00:40:22,719
question, how can we compute embeddings

1073
00:40:20,800 --> 00:40:23,920
from a piece of text? First question,

1074
00:40:22,719 --> 00:40:27,838
how can we comput an embedding from a

1075
00:40:23,920 --> 00:40:30,159
piece of text? You know the answer.

1076
00:40:27,838 --> 00:40:34,480
Run through a transformer. Piece of

1077
00:40:30,159 --> 00:40:35,598
cake. We know how to do that, right? U

1078
00:40:34,480 --> 00:40:37,599
right in particular, you can do

1079
00:40:35,599 --> 00:40:38,720
something like BERT. And for an image

1080
00:40:37,599 --> 00:40:41,039
encoder, you just run it through

1081
00:40:38,719 --> 00:40:42,959
something like restnet like the the

1082
00:40:41,039 --> 00:40:44,800
penultimate layer, right? one of the

1083
00:40:42,960 --> 00:40:46,159
final layer is going to be a very good

1084
00:40:44,800 --> 00:40:48,720
representation of that image. You get

1085
00:40:46,159 --> 00:40:50,480
another embedding. So using the building

1086
00:40:48,719 --> 00:40:52,480
blocks we already know, we can create

1087
00:40:50,480 --> 00:40:55,358
embeddings very quickly from these

1088
00:40:52,480 --> 00:40:56,639
things. Okay, but if you just take a

1089
00:40:55,358 --> 00:40:58,000
piece of text and run it through a bird

1090
00:40:56,639 --> 00:40:59,199
and you take an image and run it through

1091
00:40:58,000 --> 00:41:01,679
SNET, you're going to get some

1092
00:40:59,199 --> 00:41:04,639
embeddings. But why the heck should they

1093
00:41:01,679 --> 00:41:06,159
be related?

1094
00:41:04,639 --> 00:41:08,239
They were not trained together. So

1095
00:41:06,159 --> 00:41:10,239
there's no basis for them to be related.

1096
00:41:08,239 --> 00:41:11,919
They would just be some two embeddings.

1097
00:41:10,239 --> 00:41:13,439
Maybe they are kind of similar. Maybe

1098
00:41:11,920 --> 00:41:14,639
they're not. We don't know. There's no

1099
00:41:13,440 --> 00:41:16,880
reason to expect that they're going to

1100
00:41:14,639 --> 00:41:19,879
be similar. Okay, they're just two

1101
00:41:16,880 --> 00:41:19,880
embeddings.

1102
00:41:20,239 --> 00:41:24,399
Now, what we want to do is but once we

1103
00:41:22,960 --> 00:41:26,000
have these, we need to make sure the

1104
00:41:24,400 --> 00:41:27,838
embeddings that comes out of these two

1105
00:41:26,000 --> 00:41:30,838
things satisfy two very important

1106
00:41:27,838 --> 00:41:30,838
requirements.

1107
00:41:32,159 --> 00:41:35,279
We want to make sure that if you give it

1108
00:41:33,599 --> 00:41:39,119
an image

1109
00:41:35,280 --> 00:41:40,480
and a caption that describes that image.

1110
00:41:39,119 --> 00:41:42,318
So you have an image and a caption that

1111
00:41:40,480 --> 00:41:43,838
describes that image, we want to make

1112
00:41:42,318 --> 00:41:45,920
sure that the embeddings that come out

1113
00:41:43,838 --> 00:41:47,920
of these two boxes, they are as close to

1114
00:41:45,920 --> 00:41:50,240
each other as possible.

1115
00:41:47,920 --> 00:41:51,680
Okay? Given an em given an image and a

1116
00:41:50,239 --> 00:41:53,199
caption that describes it, that's the

1117
00:41:51,679 --> 00:41:56,239
connection. They have to be close to

1118
00:41:53,199 --> 00:41:58,399
each other. And conversely, if you have

1119
00:41:56,239 --> 00:42:00,479
an image and a caption that's totally

1120
00:41:58,400 --> 00:42:02,318
irrelevant,

1121
00:42:00,480 --> 00:42:03,920
right? A train rounding a bend with a

1122
00:42:02,318 --> 00:42:05,519
beautiful fall foliage all around,

1123
00:42:03,920 --> 00:42:08,000
right? Clearly irrelevant. Those

1124
00:42:05,519 --> 00:42:10,639
embedings should be far apart.

1125
00:42:08,000 --> 00:42:12,559
that it to really make sense,

1126
00:42:10,639 --> 00:42:13,759
right? Pairs of related things should be

1127
00:42:12,559 --> 00:42:16,400
together, irrelevant things should be

1128
00:42:13,760 --> 00:42:18,640
far apart. So if you can find embeddings

1129
00:42:16,400 --> 00:42:23,039
that satisfy these two criteria, maybe

1130
00:42:18,639 --> 00:42:24,719
we will be in the game. Okay. So now

1131
00:42:23,039 --> 00:42:26,159
this ensures that the text embedding and

1132
00:42:24,719 --> 00:42:28,480
the image embedding are referring to the

1133
00:42:26,159 --> 00:42:31,199
same underlying concept. Right? This

1134
00:42:28,480 --> 00:42:32,960
these requirements will enforce that. Uh

1135
00:42:31,199 --> 00:42:34,879
and so the embedding for any text prompt

1136
00:42:32,960 --> 00:42:38,559
is close to the embedding for all the

1137
00:42:34,880 --> 00:42:41,358
images that correspond to that prompt.

1138
00:42:38,559 --> 00:42:43,039
So the question is how do we do this? Uh

1139
00:42:41,358 --> 00:42:44,400
how can first of all how can we tell how

1140
00:42:43,039 --> 00:42:47,199
close two embeddings are? You know the

1141
00:42:44,400 --> 00:42:49,280
answer to this what's the answer

1142
00:42:47,199 --> 00:42:51,838
>> correct cosine similarity right? We use

1143
00:42:49,280 --> 00:42:54,160
the cosine similarity of the embeddings.

1144
00:42:51,838 --> 00:42:55,519
U so we know how to measure closeness.

1145
00:42:54,159 --> 00:42:56,719
So the question is how can we compute

1146
00:42:55,519 --> 00:42:59,759
embeddings that satisfy the two

1147
00:42:56,719 --> 00:43:02,000
requirements and openai uh built a model

1148
00:42:59,760 --> 00:43:04,240
called clip which is very famous uh to

1149
00:43:02,000 --> 00:43:07,119
solve this problem right it stands for

1150
00:43:04,239 --> 00:43:08,959
contrastive language image pre-training

1151
00:43:07,119 --> 00:43:10,160
uh and this forms the basis for a whole

1152
00:43:08,960 --> 00:43:12,240
bunch of models that have sprung up

1153
00:43:10,159 --> 00:43:13,358
after this called blip and blip 2 and so

1154
00:43:12,239 --> 00:43:15,279
on and so forth but this is the

1155
00:43:13,358 --> 00:43:17,358
fundamental idea

1156
00:43:15,280 --> 00:43:20,319
okay so

1157
00:43:17,358 --> 00:43:25,838
this is how clip works we uh what they

1158
00:43:20,318 --> 00:43:28,639
did is they took a a 12 block 12 layer 8

1159
00:43:25,838 --> 00:43:30,559
head transformer cosal encoder stack as

1160
00:43:28,639 --> 00:43:33,199
a text encoder

1161
00:43:30,559 --> 00:43:35,119
uh okay now you understand this right

1162
00:43:33,199 --> 00:43:36,719
that's what it is eight layer I mean

1163
00:43:35,119 --> 00:43:39,838
sorry 8 head 12 layer transformer causal

1164
00:43:36,719 --> 00:43:41,358
encoder TC stack um and and that's a

1165
00:43:39,838 --> 00:43:43,679
text encoder so we send any piece of

1166
00:43:41,358 --> 00:43:45,679
text through it right you get the next

1167
00:43:43,679 --> 00:43:48,000
word prediction embedding and that's the

1168
00:43:45,679 --> 00:43:50,318
embedding you're going to use uh and

1169
00:43:48,000 --> 00:43:53,039
they took restnet 50 and made it the

1170
00:43:50,318 --> 00:43:55,679
image encoder they took rest 50 chopped

1171
00:43:53,039 --> 00:43:59,119
off the top and whatever was left is the

1172
00:43:55,679 --> 00:44:00,960
the image encoder. Okay,

1173
00:43:59,119 --> 00:44:03,760
then they initialized with random

1174
00:44:00,960 --> 00:44:05,920
weights these things and then they

1175
00:44:03,760 --> 00:44:07,599
grabbed they grab a batch of image

1176
00:44:05,920 --> 00:44:09,838
caption pairs. So in this example, let's

1177
00:44:07,599 --> 00:44:11,519
say that we have these three images u

1178
00:44:09,838 --> 00:44:14,960
and I have captions to go with these

1179
00:44:11,519 --> 00:44:18,079
images. Okay, we have these three things

1180
00:44:14,960 --> 00:44:20,559
and this is the key step. They run the

1181
00:44:18,079 --> 00:44:22,000
images through the image encoder and the

1182
00:44:20,559 --> 00:44:23,920
captions through the text encoder and

1183
00:44:22,000 --> 00:44:26,318
get these embeddings. Okay, it's a

1184
00:44:23,920 --> 00:44:29,838
forward pass. You send it through this

1185
00:44:26,318 --> 00:44:32,400
network, you get two embeddings. Um, and

1186
00:44:29,838 --> 00:44:34,078
then this is what they do. With these

1187
00:44:32,400 --> 00:44:36,480
embeddings, they calculate the cosine

1188
00:44:34,079 --> 00:44:38,800
similarity for every image caption pair.

1189
00:44:36,480 --> 00:44:41,519
Okay? And so imagine something like

1190
00:44:38,800 --> 00:44:43,519
this. So you have these three captions,

1191
00:44:41,519 --> 00:44:45,440
you have these three images, and those

1192
00:44:43,519 --> 00:44:47,599
are the embeddings.

1193
00:44:45,440 --> 00:44:49,039
uh and then they calculate the cosine

1194
00:44:47,599 --> 00:44:51,039
similarity for every one of those

1195
00:44:49,039 --> 00:44:52,639
things.

1196
00:44:51,039 --> 00:44:57,639
It took me like 5 minutes or 10 minutes

1197
00:44:52,639 --> 00:44:57,639
to do this PowerPoint. You're welcome.

1198
00:45:00,719 --> 00:45:05,519
Particularly trying to get this comma to

1199
00:45:02,400 --> 00:45:08,559
line up is a real pain in the neck. So,

1200
00:45:05,519 --> 00:45:11,519
so all right. So, we have this here.

1201
00:45:08,559 --> 00:45:13,679
Okay. And now what we want to do is uh

1202
00:45:11,519 --> 00:45:16,480
we want these scores to be as high as

1203
00:45:13,679 --> 00:45:18,480
possible, right? Because the scores in

1204
00:45:16,480 --> 00:45:21,679
the diagonal are the ones where for the

1205
00:45:18,480 --> 00:45:23,358
matching picture and caption,

1206
00:45:21,679 --> 00:45:24,960
right?

1207
00:45:23,358 --> 00:45:26,799
Those are the those are the those are

1208
00:45:24,960 --> 00:45:28,480
the the scores for the matching pairs of

1209
00:45:26,800 --> 00:45:30,318
embeddings. We want them to be as high

1210
00:45:28,480 --> 00:45:32,880
as possible.

1211
00:45:30,318 --> 00:45:35,199
Okay. Um

1212
00:45:32,880 --> 00:45:37,599
so so we want to maximize the sum of the

1213
00:45:35,199 --> 00:45:40,960
green cells, right? These are the green

1214
00:45:37,599 --> 00:45:42,160
cells the diagonal. So, so if I if you

1215
00:45:40,960 --> 00:45:43,280
want to write it as a loss function

1216
00:45:42,159 --> 00:45:46,000
because the loss function is always

1217
00:45:43,280 --> 00:45:50,000
minimization, we basically say minimize

1218
00:45:46,000 --> 00:45:52,800
the negative sum of the green cells.

1219
00:45:50,000 --> 00:45:56,440
Okay, so the question is would this loss

1220
00:45:52,800 --> 00:45:56,440
function do the trick?

1221
00:45:58,800 --> 00:46:03,039
Seems reasonable. You want to make sure

1222
00:46:00,960 --> 00:46:07,159
the related things are really close

1223
00:46:03,039 --> 00:46:07,159
together. So you want to maximize

1224
00:46:07,760 --> 00:46:10,640
uh if that was the only part of the loss

1225
00:46:09,280 --> 00:46:12,480
function, wouldn't it just kind of

1226
00:46:10,639 --> 00:46:13,358
squish everything to the same spot in

1227
00:46:12,480 --> 00:46:14,960
the space?

1228
00:46:13,358 --> 00:46:16,880
>> Correct.

1229
00:46:14,960 --> 00:46:20,159
What it's going to do is it's going to

1230
00:46:16,880 --> 00:46:21,838
basically ignore the input.

1231
00:46:20,159 --> 00:46:24,559
The optimizer can simply ignore the

1232
00:46:21,838 --> 00:46:25,920
input, make all the embeddings the same.

1233
00:46:24,559 --> 00:46:28,480
For example, it can just make all the

1234
00:46:25,920 --> 00:46:30,318
embedding zero.

1235
00:46:28,480 --> 00:46:32,000
That's it. And then now we have a

1236
00:46:30,318 --> 00:46:35,039
perfect cosine similarity for

1237
00:46:32,000 --> 00:46:36,400
everything. For a any pair of image and

1238
00:46:35,039 --> 00:46:38,880
captions, the cosine similarity is going

1239
00:46:36,400 --> 00:46:41,440
to be one. It's perfect, right? So

1240
00:46:38,880 --> 00:46:44,318
clearly that's not enough. This is by

1241
00:46:41,440 --> 00:46:46,159
the way is called model collapse, right?

1242
00:46:44,318 --> 00:46:47,519
So to prevent it from doing that, we

1243
00:46:46,159 --> 00:46:51,598
need to do one more thing to the loss

1244
00:46:47,519 --> 00:46:53,039
function. Any guesses?

1245
00:46:51,599 --> 00:46:56,000
>> Yeah.

1246
00:46:53,039 --> 00:46:58,318
>> Uh make the images that aren't related

1247
00:46:56,000 --> 00:47:00,639
not have a cosine similarity.

1248
00:46:58,318 --> 00:47:02,639
>> Exactly. Right. Exactly right. So what

1249
00:47:00,639 --> 00:47:05,598
we want to do is we want the scores of

1250
00:47:02,639 --> 00:47:07,279
the red stuff to be as small as

1251
00:47:05,599 --> 00:47:09,119
possible.

1252
00:47:07,280 --> 00:47:10,800
We want the green stuff to be as much as

1253
00:47:09,119 --> 00:47:12,720
possible and the red stuff to be as

1254
00:47:10,800 --> 00:47:16,560
small as possible.

1255
00:47:12,719 --> 00:47:20,639
Together it'll get the job done.

1256
00:47:16,559 --> 00:47:22,078
Okay. And so um so we want to maximize

1257
00:47:20,639 --> 00:47:24,000
the sum of the green cells and minimize

1258
00:47:22,079 --> 00:47:26,640
the sum of the red cells. So the

1259
00:47:24,000 --> 00:47:28,159
equivalent loss function is minimize the

1260
00:47:26,639 --> 00:47:31,199
sum of the red cells and the negative

1261
00:47:28,159 --> 00:47:34,159
sum of the green cells. That's it. So

1262
00:47:31,199 --> 00:47:37,439
all clip does is that it just grabs a

1263
00:47:34,159 --> 00:47:38,960
batch of image caption pairs, runs it

1264
00:47:37,440 --> 00:47:41,119
through the networks, calculates the

1265
00:47:38,960 --> 00:47:44,159
embeddings and calculates this sum of

1266
00:47:41,119 --> 00:47:45,519
the stuff here and that is your loss and

1267
00:47:44,159 --> 00:47:48,480
then back propagates through the

1268
00:47:45,519 --> 00:47:50,880
network. Boom. Batch batch batch. Do it

1269
00:47:48,480 --> 00:47:53,119
a whole bunch of times. And OpenAI did

1270
00:47:50,880 --> 00:47:55,200
this with uh oh this is the official

1271
00:47:53,119 --> 00:47:57,200
picture from the open from the paper

1272
00:47:55,199 --> 00:47:59,919
which is worth reading by the way right

1273
00:47:57,199 --> 00:48:02,480
it comes in text encoder you get these

1274
00:47:59,920 --> 00:48:05,280
uh embedding vectors image encoder and

1275
00:48:02,480 --> 00:48:07,838
then boom the diagonal is maximized and

1276
00:48:05,280 --> 00:48:10,480
the off diagonals are minimized

1277
00:48:07,838 --> 00:48:14,559
and they did it with 400 million image

1278
00:48:10,480 --> 00:48:16,480
caption pairs scraped from the internet.

1279
00:48:14,559 --> 00:48:18,559
400 million.

1280
00:48:16,480 --> 00:48:20,880
By the way, you folks who work in the

1281
00:48:18,559 --> 00:48:23,599
space may know this really well, but uh

1282
00:48:20,880 --> 00:48:26,318
one very easy way to get a caption for

1283
00:48:23,599 --> 00:48:27,519
an image, right? You we see images, but

1284
00:48:26,318 --> 00:48:29,440
where do you think the captions come

1285
00:48:27,519 --> 00:48:30,639
from? Where did they get those captions?

1286
00:48:29,440 --> 00:48:32,079
They didn't obviously they didn't ask

1287
00:48:30,639 --> 00:48:33,519
people to manually label each image of

1288
00:48:32,079 --> 00:48:35,039
the caption. Where do you think they got

1289
00:48:33,519 --> 00:48:36,159
it from?

1290
00:48:35,039 --> 00:48:39,440
>> Google search.

1291
00:48:36,159 --> 00:48:41,118
>> Uh Google search can help but why does

1292
00:48:39,440 --> 00:48:42,639
Google search actually find the caption?

1293
00:48:41,119 --> 00:48:45,440
How does it because Google search is not

1294
00:48:42,639 --> 00:48:47,440
creating the caption? um

1295
00:48:45,440 --> 00:48:50,480
>> take it from the alt text on the images.

1296
00:48:47,440 --> 00:48:52,800
>> Correct. Alt text. So a lot of folks for

1297
00:48:50,480 --> 00:48:54,559
accessibility reasons they have alt text

1298
00:48:52,800 --> 00:48:56,000
right on all the images they create. A

1299
00:48:54,559 --> 00:48:58,079
lot of people have alt text in their

1300
00:48:56,000 --> 00:49:00,480
images they publish on the web and

1301
00:48:58,079 --> 00:49:03,359
that's what we use. And the alt text

1302
00:49:00,480 --> 00:49:05,280
actually ends up being a a more verbose

1303
00:49:03,358 --> 00:49:07,440
description of the image than a typical

1304
00:49:05,280 --> 00:49:10,079
caption which tends to be much briefer.

1305
00:49:07,440 --> 00:49:11,679
And for us more verbose longer the

1306
00:49:10,079 --> 00:49:14,160
better because there's more stuff for

1307
00:49:11,679 --> 00:49:17,199
the bottle to learn from.

1308
00:49:14,159 --> 00:49:19,440
Um, so that's how they built clip.

1309
00:49:17,199 --> 00:49:22,078
And so now what we do is we use we can

1310
00:49:19,440 --> 00:49:24,400
use clip's text encoder by itself,

1311
00:49:22,079 --> 00:49:25,920
right? We can send in any text and get

1312
00:49:24,400 --> 00:49:28,240
an embedding that is close to the

1313
00:49:25,920 --> 00:49:31,119
embedding of any image that described by

1314
00:49:28,239 --> 00:49:33,919
the text.

1315
00:49:31,119 --> 00:49:37,440
Okay. Now, by the way, clip can be used

1316
00:49:33,920 --> 00:49:39,200
for zeros image classification.

1317
00:49:37,440 --> 00:49:40,639
And what I mean by zeroshot image

1318
00:49:39,199 --> 00:49:42,719
classification, I'll I'll walk through

1319
00:49:40,639 --> 00:49:43,838
the picture in just a second, is that

1320
00:49:42,719 --> 00:49:45,759
typically when you want to build an

1321
00:49:43,838 --> 00:49:47,838
image classifier, right, you can get a

1322
00:49:45,760 --> 00:49:50,319
whole bunch of training data of images

1323
00:49:47,838 --> 00:49:51,838
and their labels and then we train them,

1324
00:49:50,318 --> 00:49:54,639
right? Maybe you take something like

1325
00:49:51,838 --> 00:49:56,639
restnet, chop off the top, attach our

1326
00:49:54,639 --> 00:49:58,558
own output head and train, train, train.

1327
00:49:56,639 --> 00:50:00,078
Boom, you have a classifier. But the

1328
00:49:58,559 --> 00:50:02,400
only problem with that is let's say that

1329
00:50:00,079 --> 00:50:04,800
tomorrow so today for example you had

1330
00:50:02,400 --> 00:50:06,400
five classes in your problem and

1331
00:50:04,800 --> 00:50:09,039
tomorrow somebody comes along and says

1332
00:50:06,400 --> 00:50:10,559
oh actually we have a sixth category

1333
00:50:09,039 --> 00:50:11,920
right what do you do then well you have

1334
00:50:10,559 --> 00:50:13,599
to go back to the drawing board and

1335
00:50:11,920 --> 00:50:15,599
retrain the whole thing with six labels

1336
00:50:13,599 --> 00:50:17,680
now not five because your problem has

1337
00:50:15,599 --> 00:50:20,079
changed would it be great if you had a

1338
00:50:17,679 --> 00:50:22,239
classifier where you just come to it and

1339
00:50:20,079 --> 00:50:23,839
say here's an image and here are the six

1340
00:50:22,239 --> 00:50:26,318
possible labels I want you to pick from

1341
00:50:23,838 --> 00:50:27,759
pick one from me and you want to be able

1342
00:50:26,318 --> 00:50:30,558
to give it a different set of labels

1343
00:50:27,760 --> 00:50:32,319
those each time and it'll just use the

1344
00:50:30,559 --> 00:50:33,760
labels you're giving it and the image

1345
00:50:32,318 --> 00:50:35,920
and figures out which which label

1346
00:50:33,760 --> 00:50:38,480
corresponds to the image you just fed it

1347
00:50:35,920 --> 00:50:40,880
that would be an insanely flexible image

1348
00:50:38,480 --> 00:50:42,639
classification system right and that's

1349
00:50:40,880 --> 00:50:44,960
what I mean by zeroshot image

1350
00:50:42,639 --> 00:50:47,838
classification and you can use clip to

1351
00:50:44,960 --> 00:50:50,000
do zero short image classification

1352
00:50:47,838 --> 00:50:52,400
the now how you do it is actually in the

1353
00:50:50,000 --> 00:50:55,039
picture though not very clearly done

1354
00:50:52,400 --> 00:50:55,039
anyone wants to

1355
00:50:58,159 --> 00:51:05,399
How can you use clip to build a like a

1356
00:51:01,039 --> 00:51:05,400
infinitely flexible image classifier?

1357
00:51:12,079 --> 00:51:16,640
>> Um I mean the text input was like was

1358
00:51:14,480 --> 00:51:19,119
trained vert right? So in the same way

1359
00:51:16,639 --> 00:51:21,358
vert can handle words never seen before

1360
00:51:19,119 --> 00:51:22,720
does it essentially do that? Sorry, say

1361
00:51:21,358 --> 00:51:24,000
that again. The second part

1362
00:51:22,719 --> 00:51:25,439
>> you're saying you're saying it sees a

1363
00:51:24,000 --> 00:51:26,559
text input with something it's never

1364
00:51:25,440 --> 00:51:28,880
seen before, right? Yeah.

1365
00:51:26,559 --> 00:51:30,960
>> Okay. So, in the BERT model, which is

1366
00:51:28,880 --> 00:51:32,720
where where it came from, in the text

1367
00:51:30,960 --> 00:51:35,039
encoding in the BERT model, I think we

1368
00:51:32,719 --> 00:51:36,318
talked about when it sees a word it

1369
00:51:35,039 --> 00:51:39,599
doesn't know that it's never seen

1370
00:51:36,318 --> 00:51:41,119
before, it can use the the context words

1371
00:51:39,599 --> 00:51:43,920
around it to try to

1372
00:51:41,119 --> 00:51:46,559
>> Right. Right. So, but but here, just to

1373
00:51:43,920 --> 00:51:49,760
be clear, I I want you to use clip that

1374
00:51:46,559 --> 00:51:51,280
we just built, right? And assume clip

1375
00:51:49,760 --> 00:51:53,040
can see any knows all the words because

1376
00:51:51,280 --> 00:51:54,720
it's been trained on a big vocabulary.

1377
00:51:53,039 --> 00:51:57,358
You can give it any text you want. It'll

1378
00:51:54,719 --> 00:52:00,519
create an embedding from it. That's the

1379
00:51:57,358 --> 00:52:00,519
key capability.

1380
00:52:02,318 --> 00:52:06,880
>> So it creates a text embedding for

1381
00:52:06,239 --> 00:52:11,358
>> Yeah.

1382
00:52:06,880 --> 00:52:14,000
>> because like and then for your image.

1383
00:52:11,358 --> 00:52:15,838
So comparing similarity scores between

1384
00:52:14,000 --> 00:52:17,199
the two the image is complete but the

1385
00:52:15,838 --> 00:52:18,960
text is not complete. there'll be

1386
00:52:17,199 --> 00:52:21,199
missing pieces and then make some

1387
00:52:18,960 --> 00:52:22,720
prediction using this.

1388
00:52:21,199 --> 00:52:24,159
Why is there a missing piece in the

1389
00:52:22,719 --> 00:52:28,318
text?

1390
00:52:24,159 --> 00:52:31,838
>> Because um the image the the text

1391
00:52:28,318 --> 00:52:34,318
the text does not contain the class. Um

1392
00:52:31,838 --> 00:52:36,400
and then but for the image the way it

1393
00:52:34,318 --> 00:52:38,639
was trained it was trained like with

1394
00:52:36,400 --> 00:52:40,480
pairs with class including

1395
00:52:38,639 --> 00:52:42,558
>> right but we actually know the class now

1396
00:52:40,480 --> 00:52:45,119
because so the use case is that I come

1397
00:52:42,559 --> 00:52:48,000
to you with an image and I say here are

1398
00:52:45,119 --> 00:52:51,519
the seven possible labels for this image

1399
00:52:48,000 --> 00:52:53,119
and each label is a piece of text.

1400
00:52:51,519 --> 00:52:55,920
So you can you actually have seven

1401
00:52:53,119 --> 00:52:58,318
pieces of text and an image and all I

1402
00:52:55,920 --> 00:53:00,318
want clip to do is to tell me okay the

1403
00:52:58,318 --> 00:53:03,440
seventh the fourth label is the right

1404
00:53:00,318 --> 00:53:07,159
one for this image

1405
00:53:03,440 --> 00:53:07,159
but you're on the right track

1406
00:53:08,079 --> 00:53:12,920
once you see how it's done you'll be

1407
00:53:09,358 --> 00:53:12,920
like yeah of course

1408
00:53:13,679 --> 00:53:16,159
might not be understanding something but

1409
00:53:15,119 --> 00:53:18,880
wouldn't you just pick the embedding

1410
00:53:16,159 --> 00:53:20,399
that's the closest to the like the the

1411
00:53:18,880 --> 00:53:22,480
text embedding that's the closest to the

1412
00:53:20,400 --> 00:53:23,519
image embedding Correct. You're not

1413
00:53:22,480 --> 00:53:26,318
missing anything. That's the right

1414
00:53:23,519 --> 00:53:27,838
answer. Well done.

1415
00:53:26,318 --> 00:53:30,239
Come on people. Can you applaud our

1416
00:53:27,838 --> 00:53:32,880
fellow here? [applause]

1417
00:53:30,239 --> 00:53:38,118
You folks are hard to impress.

1418
00:53:32,880 --> 00:53:38,119
That's exactly what we do. So here

1419
00:53:38,400 --> 00:53:42,480
the the key thing to remember the key

1420
00:53:40,559 --> 00:53:45,280
thing to keep in your head is that when

1421
00:53:42,480 --> 00:53:47,760
you a label is just text,

1422
00:53:45,280 --> 00:53:50,240
dog, cat, right? It's just text. So you

1423
00:53:47,760 --> 00:53:52,960
can just imagine taking each label with

1424
00:53:50,239 --> 00:53:54,879
which in this case is plane car dog

1425
00:53:52,960 --> 00:53:57,440
whatever for each one of them you create

1426
00:53:54,880 --> 00:53:59,519
an embedding you get t1 through whatever

1427
00:53:57,440 --> 00:54:01,519
if you have n labels for the image you

1428
00:53:59,519 --> 00:54:03,119
just have one embedding i and then you

1429
00:54:01,519 --> 00:54:04,880
just create the cosine calculate the

1430
00:54:03,119 --> 00:54:06,800
cosine similarity and whichever is the

1431
00:54:04,880 --> 00:54:09,280
highest number you say okay it's a dog

1432
00:54:06,800 --> 00:54:11,119
that's it

1433
00:54:09,280 --> 00:54:14,599
it's super just imagine the level of

1434
00:54:11,119 --> 00:54:14,599
flexibility here

1435
00:54:15,280 --> 00:54:20,079
so that's a a side use of clip unrelated

1436
00:54:18,239 --> 00:54:21,279
to diffusion models but that's just

1437
00:54:20,079 --> 00:54:23,680
thought it's really clever so I wanted

1438
00:54:21,280 --> 00:54:25,359
to share that okay good u now let's see

1439
00:54:23,679 --> 00:54:27,759
how we can actually use this entire

1440
00:54:25,358 --> 00:54:29,519
capability to go to solve the original

1441
00:54:27,760 --> 00:54:31,920
problem we set out to solve which is can

1442
00:54:29,519 --> 00:54:33,679
we steer the diffusion model to create

1443
00:54:31,920 --> 00:54:37,280
an image based on a particular prompt we

1444
00:54:33,679 --> 00:54:39,358
give it um so now remember if you go

1445
00:54:37,280 --> 00:54:41,519
back to how we did it we created all

1446
00:54:39,358 --> 00:54:44,639
these training pairs of x and y based on

1447
00:54:41,519 --> 00:54:46,318
you know the the noising the image x is

1448
00:54:44,639 --> 00:54:51,279
the image y is the less noisy version of

1449
00:54:46,318 --> 00:54:53,119
image. So what we can simply do is we

1450
00:54:51,280 --> 00:54:56,079
can actually change the input so it

1451
00:54:53,119 --> 00:54:59,280
becomes the image and then the clip text

1452
00:54:56,079 --> 00:55:00,480
embedding of the caption for that image.

1453
00:54:59,280 --> 00:55:02,559
So you have an image and you have a

1454
00:55:00,480 --> 00:55:05,199
caption. You take the caption run it

1455
00:55:02,559 --> 00:55:07,760
through clip you get an embedding. By

1456
00:55:05,199 --> 00:55:09,759
definition that embedding is in the

1457
00:55:07,760 --> 00:55:13,200
lives in the same space as all the

1458
00:55:09,760 --> 00:55:15,440
images that correspond to that caption.

1459
00:55:13,199 --> 00:55:18,480
Right? So you just attach you

1460
00:55:15,440 --> 00:55:20,318
concatenate the embedding of the clip

1461
00:55:18,480 --> 00:55:22,639
output of a caption along with the

1462
00:55:20,318 --> 00:55:24,880
image. You say make that the new input.

1463
00:55:22,639 --> 00:55:26,558
Now Y continues to be the less noisy

1464
00:55:24,880 --> 00:55:27,838
version of the image or as we saw

1465
00:55:26,559 --> 00:55:30,319
earlier it could be just the noise

1466
00:55:27,838 --> 00:55:34,000
component of the image. Okay, this is

1467
00:55:30,318 --> 00:55:36,800
the new XY pair that we have. And so now

1468
00:55:34,000 --> 00:55:39,519
the model is you send the clip X

1469
00:55:36,800 --> 00:55:41,039
embedding the image X send it through

1470
00:55:39,519 --> 00:55:43,039
noisy version of the image and you keep

1471
00:55:41,039 --> 00:55:44,960
on training it for a while. Once your

1472
00:55:43,039 --> 00:55:46,880
model is trained for when you want to

1473
00:55:44,960 --> 00:55:49,679
use it for inference for a new uh

1474
00:55:46,880 --> 00:55:51,920
prompt, you just give it you know

1475
00:55:49,679 --> 00:55:55,199
Killian quoted MIT during the springtime

1476
00:55:51,920 --> 00:55:57,760
along with a bunch of noise goes in it

1477
00:55:55,199 --> 00:56:00,399
starts dinoising it. But because this

1478
00:55:57,760 --> 00:56:02,880
embedding of this thing thanks to clip

1479
00:56:00,400 --> 00:56:05,119
lives in the same image space as all Ken

1480
00:56:02,880 --> 00:56:07,119
code embeddings clean code images at

1481
00:56:05,119 --> 00:56:11,160
some keep on doing it for a while at

1482
00:56:07,119 --> 00:56:11,160
some point you'll get Kian code.

1483
00:56:11,280 --> 00:56:15,359
That's how they do it. That's how they

1484
00:56:12,798 --> 00:56:16,798
steer the image. It's a two-step

1485
00:56:15,358 --> 00:56:19,598
process. You create all these clip

1486
00:56:16,798 --> 00:56:21,358
embeddings uh which clip was a

1487
00:56:19,599 --> 00:56:22,960
breakthrough in my opinion because they

1488
00:56:21,358 --> 00:56:24,159
it was one of the maybe the first

1489
00:56:22,960 --> 00:56:26,079
example. I don't know if it's the very

1490
00:56:24,159 --> 00:56:28,480
first but one of the early examples of

1491
00:56:26,079 --> 00:56:30,400
saying we have different kinds of data.

1492
00:56:28,480 --> 00:56:32,559
We have images, we have captions, we

1493
00:56:30,400 --> 00:56:34,000
have text. How do we create embeddings

1494
00:56:32,559 --> 00:56:36,240
for every one of these very different

1495
00:56:34,000 --> 00:56:38,318
data types that all happen to live in

1496
00:56:36,239 --> 00:56:40,399
the same space, the same concept space?

1497
00:56:38,318 --> 00:56:42,480
That was the key idea. And if you look

1498
00:56:40,400 --> 00:56:44,318
at the modern multimodal large language

1499
00:56:42,480 --> 00:56:46,318
models, they are all based on the same

1500
00:56:44,318 --> 00:56:49,759
exact idea.

1501
00:56:46,318 --> 00:56:51,519
So it's very powerful this approach.

1502
00:56:49,760 --> 00:56:54,000
Yeah. Now I understand this for images,

1503
00:56:51,519 --> 00:56:56,559
but for video generation models like

1504
00:56:54,000 --> 00:56:58,960
Sora, do they have some sort of

1505
00:56:56,559 --> 00:57:00,960
underlying physics structure or do they

1506
00:56:58,960 --> 00:57:02,318
learn the physical representations?

1507
00:57:00,960 --> 00:57:04,559
>> There's a lot of debate on the internet

1508
00:57:02,318 --> 00:57:05,838
about this stuff. Um they haven't

1509
00:57:04,559 --> 00:57:07,359
published the results, the full

1510
00:57:05,838 --> 00:57:09,599
technical report yet. So we don't know

1511
00:57:07,358 --> 00:57:11,440
for sure but the consensus seems to be

1512
00:57:09,599 --> 00:57:14,240
no it's not they are not using a physics

1513
00:57:11,440 --> 00:57:15,599
engine what they have done uh and again

1514
00:57:14,239 --> 00:57:17,919
this may be wrong once the report comes

1515
00:57:15,599 --> 00:57:19,920
out we'll know for sure but uh people

1516
00:57:17,920 --> 00:57:22,480
what people are saying computer vision

1517
00:57:19,920 --> 00:57:25,838
experts is that it was has been trained

1518
00:57:22,480 --> 00:57:28,400
on a lot of video game data

1519
00:57:25,838 --> 00:57:30,400
uh along with actual videos and so on

1520
00:57:28,400 --> 00:57:32,559
and if you and the corpus of training is

1521
00:57:30,400 --> 00:57:35,280
so massive that it has basically learned

1522
00:57:32,559 --> 00:57:38,000
to mimic certain physics aspects to it

1523
00:57:35,280 --> 00:57:39,280
just as a side effect much like LLM you

1524
00:57:38,000 --> 00:57:41,838
train them on a large amount of text

1525
00:57:39,280 --> 00:57:43,359
data they begin to start to do things

1526
00:57:41,838 --> 00:57:46,239
which you didn't anticipate that they'll

1527
00:57:43,358 --> 00:57:48,558
do right so for example I read this I

1528
00:57:46,239 --> 00:57:50,399
thought it's a really great example of

1529
00:57:48,559 --> 00:57:52,798
what is surprising about large language

1530
00:57:50,400 --> 00:57:54,798
models is not that you know you train

1531
00:57:52,798 --> 00:57:56,159
them on a b bunch of high school math

1532
00:57:54,798 --> 00:57:57,199
problems and then you give it a new high

1533
00:57:56,159 --> 00:57:59,679
school math problem it can actually

1534
00:57:57,199 --> 00:58:00,879
solve it that's not surprising you give

1535
00:57:59,679 --> 00:58:03,199
it a whole bunch of high school math

1536
00:58:00,880 --> 00:58:05,200
problems in English then you ask it to

1537
00:58:03,199 --> 00:58:07,199
read a bunch of French literature and

1538
00:58:05,199 --> 00:58:08,960
then you give French high school math

1539
00:58:07,199 --> 00:58:12,318
will solve it. That is that is the new

1540
00:58:08,960 --> 00:58:13,679
news, right? So similarly here I think

1541
00:58:12,318 --> 00:58:15,199
the expectation is that it's not

1542
00:58:13,679 --> 00:58:16,798
actually using a physics engine under

1543
00:58:15,199 --> 00:58:17,838
the hood. It may have used a physics

1544
00:58:16,798 --> 00:58:20,159
engine to actually come up with the

1545
00:58:17,838 --> 00:58:22,798
videos and renderings but there are no

1546
00:58:20,159 --> 00:58:23,920
physics constraints in the model itself.

1547
00:58:22,798 --> 00:58:26,000
It just comes out of the training

1548
00:58:23,920 --> 00:58:27,440
process. That's the current view. Once

1549
00:58:26,000 --> 00:58:30,639
the technical report comes out, we'll

1550
00:58:27,440 --> 00:58:33,639
know for sure what they actually did.

1551
00:58:30,639 --> 00:58:33,639
U

1552
00:58:33,838 --> 00:58:37,920
>> so quick question about stability. It's

1553
00:58:36,318 --> 00:58:40,400
claiming to be a little bit more real

1554
00:58:37,920 --> 00:58:41,599
time in their image generation. Um, so

1555
00:58:40,400 --> 00:58:43,599
>> you mean stable diffusion?

1556
00:58:41,599 --> 00:58:45,200
>> Yeah, stable diffusion. So, are they

1557
00:58:43,599 --> 00:58:46,798
jumping through the noise more quickly

1558
00:58:45,199 --> 00:58:47,679
or are they kind of like pre-prompting

1559
00:58:46,798 --> 00:58:48,960
it and kind of trick?

1560
00:58:47,679 --> 00:58:50,480
>> Very good question and there's a very

1561
00:58:48,960 --> 00:58:52,798
key trick. It's coming.

1562
00:58:50,480 --> 00:58:55,119
>> Um,

1563
00:58:52,798 --> 00:58:57,920
>> so here the example of the noise is

1564
00:58:55,119 --> 00:59:00,559
normal distribution. However, if we have

1565
00:58:57,920 --> 00:59:02,240
changed the noise distribution, is it

1566
00:59:00,559 --> 00:59:04,000
change the result? Oh, you mean if you

1567
00:59:02,239 --> 00:59:05,519
change it to like a pson or some other

1568
00:59:04,000 --> 00:59:08,079
distribution, it'll definitely change

1569
00:59:05,519 --> 00:59:10,318
the results because u if you look at the

1570
00:59:08,079 --> 00:59:11,839
underlying math of why this works, it

1571
00:59:10,318 --> 00:59:13,279
heavily depends on the Gaussian

1572
00:59:11,838 --> 00:59:15,599
assumption.

1573
00:59:13,280 --> 00:59:18,559
>> Yeah. Um there was another question

1574
00:59:15,599 --> 00:59:20,000
somewhere here.

1575
00:59:18,559 --> 00:59:21,599
>> Um you may not know the answer because

1576
00:59:20,000 --> 00:59:23,599
the technical report out, but could it

1577
00:59:21,599 --> 00:59:26,240
be in terms of video generation sort of

1578
00:59:23,599 --> 00:59:28,160
analogous to going from like one fuzz

1579
00:59:26,239 --> 00:59:30,639
one noisy image to another? like you're

1580
00:59:28,159 --> 00:59:31,598
almost doing a series of still images

1581
00:59:30,639 --> 00:59:33,920
and learning how to

1582
00:59:31,599 --> 00:59:35,599
>> No, I think that I think people are sure

1583
00:59:33,920 --> 00:59:36,960
is is how it's done. So, basically you

1584
00:59:35,599 --> 00:59:39,280
think think of the video as just a

1585
00:59:36,960 --> 00:59:41,599
series of frames, right? And each frame

1586
00:59:39,280 --> 00:59:43,440
is an image and there is a sequentiality

1587
00:59:41,599 --> 00:59:44,880
to it. Um, which is where the

1588
00:59:43,440 --> 00:59:47,760
transformer stack will come in because

1589
00:59:44,880 --> 00:59:50,720
it handles sequentiality. So, in general

1590
00:59:47,760 --> 00:59:53,280
video stuff typically operates on frame

1591
00:59:50,719 --> 00:59:54,959
by frame which is just an image. So,

1592
00:59:53,280 --> 00:59:57,839
that is definitely there. What we don't

1593
00:59:54,960 --> 00:59:59,519
know is if they also used some

1594
00:59:57,838 --> 01:00:02,239
understanding of the fact that for

1595
00:59:59,519 --> 01:00:04,239
example that if an object is dropped it

1596
01:00:02,239 --> 01:00:06,798
has to fall to the earth in a certain

1597
01:00:04,239 --> 01:00:08,639
rate or if an object goes behind another

1598
01:00:06,798 --> 01:00:10,159
object you can't see the object anymore

1599
01:00:08,639 --> 01:00:12,960
right things like that which we take for

1600
01:00:10,159 --> 01:00:15,838
granted um the question is are they

1601
01:00:12,960 --> 01:00:17,280
using it and the consensus seems to be

1602
01:00:15,838 --> 01:00:18,960
uh in the absence of an actual technical

1603
01:00:17,280 --> 01:00:20,240
report that no they're not doing it

1604
01:00:18,960 --> 01:00:22,880
because there are lots of examples on

1605
01:00:20,239 --> 01:00:24,479
Twitter where people will show a Sora

1606
01:00:22,880 --> 01:00:26,559
video in which it's not obeying the laws

1607
01:00:24,480 --> 01:00:28,559
of physics. So you take like a beach

1608
01:00:26,559 --> 01:00:30,000
chair and then put it in the sand. You

1609
01:00:28,559 --> 01:00:32,640
see the sand come through the base of

1610
01:00:30,000 --> 01:00:33,920
the beach chair, right? Or you take an

1611
01:00:32,639 --> 01:00:35,118
object and put it behind an object. You

1612
01:00:33,920 --> 01:00:37,440
can still see the object even though the

1613
01:00:35,119 --> 01:00:38,720
original object is opaque. So you be

1614
01:00:37,440 --> 01:00:39,920
seeing some evidence that no no it's not

1615
01:00:38,719 --> 01:00:43,879
obeying the laws of physics. What you're

1616
01:00:39,920 --> 01:00:43,880
seeing is just an amaz

1617
01:00:46,318 --> 01:00:50,000
fingers without knowing there has to be

1618
01:00:47,599 --> 01:00:51,599
only five fingers.

1619
01:00:50,000 --> 01:00:55,679
Um

1620
01:00:51,599 --> 01:00:58,880
okay. All right. So we let's keep going

1621
01:00:55,679 --> 01:01:00,879
now. Um so this there was another paper

1622
01:00:58,880 --> 01:01:02,798
afterwards and this is the original

1623
01:01:00,880 --> 01:01:05,680
paper which took that idea of the

1624
01:01:02,798 --> 01:01:07,519
diffusion model and then diffusion is

1625
01:01:05,679 --> 01:01:08,719
very slow as Olivia you pointed out. So

1626
01:01:07,519 --> 01:01:11,119
the question is can we make it much

1627
01:01:08,719 --> 01:01:12,239
faster? Right? So what they did and I'm

1628
01:01:11,119 --> 01:01:14,079
not going to get into this whole thing

1629
01:01:12,239 --> 01:01:18,159
here. I just want to highlight a couple

1630
01:01:14,079 --> 01:01:20,960
of things. The first one is that um

1631
01:01:18,159 --> 01:01:23,279
first of all notice that you see unit

1632
01:01:20,960 --> 01:01:25,838
here. So it they are using a unit right

1633
01:01:23,280 --> 01:01:28,000
to go from image to image.

1634
01:01:25,838 --> 01:01:30,000
The second thing is that the clip

1635
01:01:28,000 --> 01:01:32,400
embedding of the text prompt is

1636
01:01:30,000 --> 01:01:34,559
basically is woven meaning it's

1637
01:01:32,400 --> 01:01:36,559
incorporated into the w the into the

1638
01:01:34,559 --> 01:01:38,559
into the unit through an attention

1639
01:01:36,559 --> 01:01:41,040
mechanism a transformer mechanism and

1640
01:01:38,559 --> 01:01:43,200
you can see the QKV business here which

1641
01:01:41,039 --> 01:01:45,039
should be familiar at this point. So it

1642
01:01:43,199 --> 01:01:47,279
is integrated into the transformer stack

1643
01:01:45,039 --> 01:01:48,480
directly that input the clip embedding

1644
01:01:47,280 --> 01:01:50,960
that's the second thing I want to point

1645
01:01:48,480 --> 01:01:52,880
out. And then thirdly

1646
01:01:50,960 --> 01:01:54,480
and this is where the speed up comes. So

1647
01:01:52,880 --> 01:01:56,240
what you do is instead of taking the

1648
01:01:54,480 --> 01:01:57,760
image running it through the whole

1649
01:01:56,239 --> 01:01:59,598
network and creating a slightly less

1650
01:01:57,760 --> 01:02:02,000
noisy version of the image here what you

1651
01:01:59,599 --> 01:02:03,359
do is you take the image you run it

1652
01:02:02,000 --> 01:02:05,679
through an image encoder you get an

1653
01:02:03,358 --> 01:02:07,519
embedding and now you only work with the

1654
01:02:05,679 --> 01:02:09,440
embedding you take the embedding and

1655
01:02:07,519 --> 01:02:11,358
create a slightly less noisy version

1656
01:02:09,440 --> 01:02:13,119
embedding keep on doing it and these

1657
01:02:11,358 --> 01:02:14,719
embeddings are much smaller than images

1658
01:02:13,119 --> 01:02:16,079
therefore they're much faster to process

1659
01:02:14,719 --> 01:02:18,798
and once you've done it like a thousand

1660
01:02:16,079 --> 01:02:20,559
times you get a very sort of almost pure

1661
01:02:18,798 --> 01:02:24,079
noless version of the embedding now you

1662
01:02:20,559 --> 01:02:26,400
run it through an image decoder to get

1663
01:02:24,079 --> 01:02:29,119
So this is the the idea here is that you

1664
01:02:26,400 --> 01:02:31,200
operate um

1665
01:02:29,119 --> 01:02:32,480
uh in the lat latent space meaning the

1666
01:02:31,199 --> 01:02:35,118
embedding space and hence it's called a

1667
01:02:32,480 --> 01:02:36,719
latent diffusion model. So that's where

1668
01:02:35,119 --> 01:02:38,640
the speed up comes but research

1669
01:02:36,719 --> 01:02:40,239
continues to be very strong to make it

1670
01:02:38,639 --> 01:02:41,440
even faster because for a lot of

1671
01:02:40,239 --> 01:02:43,039
consumer applications people are

1672
01:02:41,440 --> 01:02:44,480
obviously not going to wait around for I

1673
01:02:43,039 --> 01:02:46,880
mean who wants to wait for 10 seconds

1674
01:02:44,480 --> 01:02:49,920
right so uh and so there a lot of

1675
01:02:46,880 --> 01:02:52,240
pressure to make it even faster

1676
01:02:49,920 --> 01:02:53,760
um

1677
01:02:52,239 --> 01:02:56,078
all right so that's what we have

1678
01:02:53,760 --> 01:02:58,160
obviously um you know they're these

1679
01:02:56,079 --> 01:03:00,000
models are transforming everything and

1680
01:02:58,159 --> 01:03:01,920
uh by the way this site here lexicon.art

1681
01:03:00,000 --> 01:03:03,760
art. You can go check it out. Uh it has

1682
01:03:01,920 --> 01:03:06,318
a whole bunch of very interesting images

1683
01:03:03,760 --> 01:03:07,599
and prompts that created the images. So

1684
01:03:06,318 --> 01:03:09,519
if you're working in the space, it gives

1685
01:03:07,599 --> 01:03:11,119
you a lot of interesting ideas. But it's

1686
01:03:09,519 --> 01:03:13,838
not just for you know consumer fun

1687
01:03:11,119 --> 01:03:15,838
applications. U you know these models

1688
01:03:13,838 --> 01:03:18,318
are being used to actually you know

1689
01:03:15,838 --> 01:03:19,519
alpha fold if you'll recall if you give

1690
01:03:18,318 --> 01:03:21,519
it an amino acid sequence it can

1691
01:03:19,519 --> 01:03:24,480
actually create the 3D structure. Right?

1692
01:03:21,519 --> 01:03:25,838
So that's an example of they they don't

1693
01:03:24,480 --> 01:03:27,440
I don't think they use a diffusion

1694
01:03:25,838 --> 01:03:28,960
model. But you can imagine using a

1695
01:03:27,440 --> 01:03:32,159
diffusion model to create these

1696
01:03:28,960 --> 01:03:34,798
complicated objects. Meaning the objects

1697
01:03:32,159 --> 01:03:36,558
you create don't have to be images.

1698
01:03:34,798 --> 01:03:39,038
They can be arbitrarily complicated

1699
01:03:36,559 --> 01:03:41,280
things. As long as you have enough data

1700
01:03:39,039 --> 01:03:43,760
about such things to use for training

1701
01:03:41,280 --> 01:03:45,359
and the notion of noising the input is

1702
01:03:43,760 --> 01:03:47,039
meaningful, you can create some very

1703
01:03:45,358 --> 01:03:49,598
interesting structures. you can create

1704
01:03:47,039 --> 01:03:51,039
3D things and u you know protein

1705
01:03:49,599 --> 01:03:52,240
structures and there's a whole bunch of

1706
01:03:51,039 --> 01:03:55,440
very interesting applications in

1707
01:03:52,239 --> 01:03:57,038
biomedical uh sciences. So this is

1708
01:03:55,440 --> 01:03:59,519
really just the tip of the iceberg and

1709
01:03:57,039 --> 01:04:00,960
now there are these things um there are

1710
01:03:59,519 --> 01:04:03,358
ways in which you can use diffusion

1711
01:04:00,960 --> 01:04:05,199
models to create to do large language

1712
01:04:03,358 --> 01:04:07,358
modeling as well. So there's a lot of

1713
01:04:05,199 --> 01:04:10,159
overlap and blending and so on going on

1714
01:04:07,358 --> 01:04:11,920
in the space. So so I'm going to do a

1715
01:04:10,159 --> 01:04:12,879
quick demo. Um if you look at hugging

1716
01:04:11,920 --> 01:04:15,519
face there is something called the

1717
01:04:12,880 --> 01:04:17,519
diffusers library which is like the the

1718
01:04:15,519 --> 01:04:20,079
as the name suggests it's a library for

1719
01:04:17,519 --> 01:04:24,599
a lot of diffusion models

1720
01:04:20,079 --> 01:04:24,599
and let's take a quick look.

1721
01:04:25,838 --> 01:04:28,880
All right so we will uh the diffusers

1722
01:04:27,519 --> 01:04:30,719
library has a whole bunch of diffusion

1723
01:04:28,880 --> 01:04:32,400
models. We going to work with stable

1724
01:04:30,719 --> 01:04:34,959
diffusion which is one of you know like

1725
01:04:32,400 --> 01:04:38,599
the the better known models. So let's

1726
01:04:34,960 --> 01:04:38,599
install diffusers.

1727
01:04:38,960 --> 01:04:42,880
Uh you will recall when we when I did

1728
01:04:41,358 --> 01:04:45,679
the quick lightning tour of the hugging

1729
01:04:42,880 --> 01:04:48,480
face ecosystem for language. Uh hugging

1730
01:04:45,679 --> 01:04:50,480
face is a whole bunch of u capabilities

1731
01:04:48,480 --> 01:04:52,159
sort of built out of the box and you use

1732
01:04:50,480 --> 01:04:54,719
this thing called the pipeline function

1733
01:04:52,159 --> 01:04:56,879
to very quickly use any model you want.

1734
01:04:54,719 --> 01:04:59,679
The same exact philosophy applies here.

1735
01:04:56,880 --> 01:05:03,559
You still use the pipeline. So I'm going

1736
01:04:59,679 --> 01:05:03,558
to import a bunch of stuff.

1737
01:05:09,358 --> 01:05:14,759
All right. So, oh, I see I have to do

1738
01:05:11,199 --> 01:05:14,759
this thing. Okay.

1739
01:05:16,079 --> 01:05:20,119
Great. F.

1740
01:05:21,519 --> 01:05:26,639
Okay. So, uh, all right. that we have

1741
01:05:24,239 --> 01:05:28,639
here. So you you'll remember that we

1742
01:05:26,639 --> 01:05:30,639
when we worked with text we had to pre

1743
01:05:28,639 --> 01:05:31,759
we we would grab a pre-trained model and

1744
01:05:30,639 --> 01:05:33,598
then we actually run it through a

1745
01:05:31,760 --> 01:05:36,079
pipeline and we can do all the inference

1746
01:05:33,599 --> 01:05:39,440
we want on it. The same exact philosophy

1747
01:05:36,079 --> 01:05:41,519
applies here. So um and this very

1748
01:05:39,440 --> 01:05:44,400
similar to what we did in lecture 8 for

1749
01:05:41,519 --> 01:05:46,079
NLP. So what we're going to do is we use

1750
01:05:44,400 --> 01:05:48,000
this command the stable diffusion

1751
01:05:46,079 --> 01:05:50,798
pipeline from pre-trained and we use

1752
01:05:48,000 --> 01:05:56,318
this version 1.4 stable diffusion model.

1753
01:05:50,798 --> 01:05:58,079
Um so let's just create the pipeline and

1754
01:05:56,318 --> 01:06:00,318
and obviously we have used tensorflow

1755
01:05:58,079 --> 01:06:02,079
not pyarch here but a lot of these

1756
01:06:00,318 --> 01:06:05,279
models unfortunately happen to be in

1757
01:06:02,079 --> 01:06:07,280
pyarch so knowing a little bit of pyarch

1758
01:06:05,280 --> 01:06:09,280
is actually very helpful um to be able

1759
01:06:07,280 --> 01:06:12,240
to work with these things and what we're

1760
01:06:09,280 --> 01:06:15,280
doing here uh while it's downloading uh

1761
01:06:12,239 --> 01:06:18,078
we are using this fp16

1762
01:06:15,280 --> 01:06:19,519
um storage format for the the model

1763
01:06:18,079 --> 01:06:22,318
weights because it's going to be a

1764
01:06:19,519 --> 01:06:24,318
little smaller than using 32 bits so

1765
01:06:22,318 --> 01:06:25,599
it'll download faster. So that's what's

1766
01:06:24,318 --> 01:06:28,480
happening here. So all right, it's

1767
01:06:25,599 --> 01:06:29,920
downloaded fine. So now we just give it

1768
01:06:28,480 --> 01:06:32,880
a prompt and this is actually one of the

1769
01:06:29,920 --> 01:06:34,400
original famous uh meme prompts a

1770
01:06:32,880 --> 01:06:36,640
photograph of an astronaut riding a

1771
01:06:34,400 --> 01:06:38,880
horse. And so uh once we have the

1772
01:06:36,639 --> 01:06:40,558
pipeline set up uh I'll just a seat for

1773
01:06:38,880 --> 01:06:44,640
reproducibility. And then literally I do

1774
01:06:40,559 --> 01:06:46,960
pipe of prompt and then it's actually

1775
01:06:44,639 --> 01:06:50,879
you can see here 50. So it's going

1776
01:06:46,960 --> 01:06:52,000
through 50 dinoising steps. Okay. Um and

1777
01:06:50,880 --> 01:06:54,960
you come up with a national rating of

1778
01:06:52,000 --> 01:06:56,880
horse. Okay. So that's that. Um you can

1779
01:06:54,960 --> 01:06:59,838
actually change the seed and you can get

1780
01:06:56,880 --> 01:07:01,599
get a different um the seed is basically

1781
01:06:59,838 --> 01:07:03,199
sets the the the random starting point

1782
01:07:01,599 --> 01:07:05,440
for the image. So therefore you would

1783
01:07:03,199 --> 01:07:08,159
expect a different astronaut. Yep. This

1784
01:07:05,440 --> 01:07:09,679
is an astronaut riding another horse. So

1785
01:07:08,159 --> 01:07:11,199
um I think people came up with these

1786
01:07:09,679 --> 01:07:12,480
kinds of fun examples because it's

1787
01:07:11,199 --> 01:07:15,358
guaranteed not to be in the training

1788
01:07:12,480 --> 01:07:16,559
data, right? So so whatever the model is

1789
01:07:15,358 --> 01:07:18,480
doing, it's not remember it's not

1790
01:07:16,559 --> 01:07:23,160
regurgitating what it has already seen.

1791
01:07:18,480 --> 01:07:23,159
Uh, all right. Give me a prompt.

1792
01:07:26,639 --> 01:07:32,920
Prompts. Anyone?

1793
01:07:29,920 --> 01:07:32,920
Wow.

1794
01:07:34,798 --> 01:07:37,798
>> Okay,

1795
01:07:38,559 --> 01:07:44,839
that might be a

1796
01:07:40,639 --> 01:07:44,838
All right. Riding a horse.

1797
01:07:48,880 --> 01:07:51,960
All right,

1798
01:07:56,559 --> 01:08:03,319
there are two of them and clearly MIT

1799
01:07:59,358 --> 01:08:03,318
professors don't have really.

1800
01:08:03,599 --> 01:08:10,240
Yeah, moving on. [laughter]

1801
01:08:06,559 --> 01:08:11,519
So, so by the way, um if you you should

1802
01:08:10,239 --> 01:08:12,798
spend some time with the diffusers

1803
01:08:11,519 --> 01:08:14,400
library, they have a bunch of tutorials

1804
01:08:12,798 --> 01:08:16,319
which are really interesting because

1805
01:08:14,400 --> 01:08:18,719
this core capability of giving a prompt

1806
01:08:16,319 --> 01:08:20,560
and getting an image out can actually be

1807
01:08:18,719 --> 01:08:22,640
manipulated for all sorts of very

1808
01:08:20,560 --> 01:08:23,920
interesting use cases. So, for example,

1809
01:08:22,640 --> 01:08:25,679
there is this thing called negative

1810
01:08:23,920 --> 01:08:28,560
prompting. And the idea of negative

1811
01:08:25,679 --> 01:08:31,838
prompting is that you can give it two

1812
01:08:28,560 --> 01:08:33,679
prompts and say create an image which

1813
01:08:31,838 --> 01:08:36,318
embodies the first prompt but not the

1814
01:08:33,679 --> 01:08:37,920
second prompt. essentially subtract the

1815
01:08:36,319 --> 01:08:39,520
second prompt from the first one. That's

1816
01:08:37,920 --> 01:08:41,440
called negative prompting. And you might

1817
01:08:39,520 --> 01:08:45,440
be wondering like what use is that?

1818
01:08:41,439 --> 01:08:46,559
There are lots of fun uses. So here we

1819
01:08:45,439 --> 01:08:49,119
are going to the prompt is going to be a

1820
01:08:46,560 --> 01:08:53,120
labrador in the style of vermier. Okay,

1821
01:08:49,119 --> 01:08:57,119
that's the first prompt. 50 steps.

1822
01:08:53,119 --> 01:09:00,719
Uh look at that. Amazing, right? Uh but

1823
01:08:57,119 --> 01:09:02,158
maybe you don't care for the blue scarf.

1824
01:09:00,719 --> 01:09:04,079
So you basically give it a negative

1825
01:09:02,158 --> 01:09:06,879
prompt. And you basically the negative

1826
01:09:04,079 --> 01:09:09,439
prompt is blue meaning remove everything

1827
01:09:06,880 --> 01:09:11,920
that's blue. I don't like this otherwise

1828
01:09:09,439 --> 01:09:15,079
keep the Labrador thing going. So you

1829
01:09:11,920 --> 01:09:15,079
run it.

1830
01:09:16,479 --> 01:09:22,399
Look at that. The blue is gone. Negative

1831
01:09:18,399 --> 01:09:26,000
prompting. Okay. Yeah.

1832
01:09:22,399 --> 01:09:28,318
>> If you change that from five from 50 to

1833
01:09:26,000 --> 01:09:30,479
a th00and will it become less pixelated

1834
01:09:28,319 --> 01:09:31,359
or will it eventually just keep going

1835
01:09:30,479 --> 01:09:32,798
and iterating?

1836
01:09:31,359 --> 01:09:34,400
>> No. Typically, if you do more of these

1837
01:09:32,798 --> 01:09:36,640
things, it gets better. The quality is

1838
01:09:34,399 --> 01:09:38,719
much better because each step will den

1839
01:09:36,640 --> 01:09:40,480
noiseise it very slightly. So, errors

1840
01:09:38,719 --> 01:09:42,158
won't accumulate and things like that.

1841
01:09:40,479 --> 01:09:44,158
And the diffuses library gives you lots

1842
01:09:42,158 --> 01:09:47,119
of controls for fiddling around with all

1843
01:09:44,158 --> 01:09:50,559
these things. Um, okay. So, that's what

1844
01:09:47,119 --> 01:09:52,559
we had. Uh, 949.

1845
01:09:50,560 --> 01:09:54,159
Okay. So, check out this tutorial if

1846
01:09:52,560 --> 01:09:56,239
you're curious about how this stuff

1847
01:09:54,158 --> 01:09:58,479
works. And I'm going to do one other

1848
01:09:56,238 --> 01:10:01,759
thing um because I didn't get to do it

1849
01:09:58,479 --> 01:10:03,919
earlier on. So uh we spent some time

1850
01:10:01,760 --> 01:10:05,920
with the hugging face hub and I walked

1851
01:10:03,920 --> 01:10:07,679
you through a few use cases for text uh

1852
01:10:05,920 --> 01:10:10,158
where you can take a text model and use

1853
01:10:07,679 --> 01:10:11,760
it for you know classification uh things

1854
01:10:10,158 --> 01:10:13,519
like that summarization and so on and so

1855
01:10:11,760 --> 01:10:16,000
forth. You can do the same thing for

1856
01:10:13,520 --> 01:10:17,600
computer vision models. So if you have a

1857
01:10:16,000 --> 01:10:20,079
computer vision problem that just maps

1858
01:10:17,600 --> 01:10:21,920
to a standard C uh computer vision task

1859
01:10:20,079 --> 01:10:25,439
you can just use the hugging face hub as

1860
01:10:21,920 --> 01:10:27,359
well. So um let me just show you very

1861
01:10:25,439 --> 01:10:30,678
quickly the same kind of thing actually

1862
01:10:27,359 --> 01:10:30,679
works here.

1863
01:10:32,560 --> 01:10:37,360
All right. Okay. So,

1864
01:10:35,600 --> 01:10:38,719
so let's say that you want to classify

1865
01:10:37,359 --> 01:10:40,639
something. You just import the pipeline

1866
01:10:38,719 --> 01:10:43,279
as before.

1867
01:10:40,640 --> 01:10:45,280
And once you import it, you can just

1868
01:10:43,279 --> 01:10:46,319
literally give it the standard task that

1869
01:10:45,279 --> 01:10:48,800
you care about like image

1870
01:10:46,319 --> 01:10:50,158
classification.

1871
01:10:48,800 --> 01:10:53,520
And and then you can start using it

1872
01:10:50,158 --> 01:10:56,519
right from that point on.

1873
01:10:53,520 --> 01:10:56,520
Okay.

1874
01:10:59,840 --> 01:11:04,480
All right. Okay. So now I'm going to

1875
01:11:02,319 --> 01:11:06,719
just get this image. So it's a very

1876
01:11:04,479 --> 01:11:08,718
famous image. Um, right. And we're going

1877
01:11:06,719 --> 01:11:09,760
to ask it to classify this image. So we

1878
01:11:08,719 --> 01:11:12,399
just literally run it through the

1879
01:11:09,760 --> 01:11:15,039
pipeline.

1880
01:11:12,399 --> 01:11:18,799
And it says the most likely label is 94%

1881
01:11:15,039 --> 01:11:20,880
probability. It's an Egyptian cat. Seems

1882
01:11:18,800 --> 01:11:21,679
reasonable. Okay. I mean, it's it's a

1883
01:11:20,880 --> 01:11:22,960
tough picture, right? Because there are

1884
01:11:21,679 --> 01:11:25,520
lots of things going on in that picture.

1885
01:11:22,960 --> 01:11:27,679
It's not like one one image, one object.

1886
01:11:25,520 --> 01:11:29,520
Um okay so you don't have to use the

1887
01:11:27,679 --> 01:11:31,279
default model you can actually give it

1888
01:11:29,520 --> 01:11:35,440
your own model that you want. So for

1889
01:11:31,279 --> 01:11:38,880
example, you can go um sorry

1890
01:11:35,439 --> 01:11:40,238
you can go to the hub hugging face hub

1891
01:11:38,880 --> 01:11:42,560
and you can go in there and say all

1892
01:11:40,238 --> 01:11:45,359
right I want image classification these

1893
01:11:42,560 --> 01:11:49,120
are all the models 10,487 models let's

1894
01:11:45,359 --> 01:11:51,759
sort by I don't know most downloads or

1895
01:11:49,119 --> 01:11:53,599
maybe most likes

1896
01:11:51,760 --> 01:11:54,800
u and you have all these models you can

1897
01:11:53,600 --> 01:11:56,000
pick any one of them so for example

1898
01:11:54,800 --> 01:11:57,920
let's say you want to pick Microsoft

1899
01:11:56,000 --> 01:12:00,000
restnet as your one that's what I tried

1900
01:11:57,920 --> 01:12:04,000
here so I have Microsoft restnet you

1901
01:12:00,000 --> 01:12:05,920
just s model equals that run it and it

1902
01:12:04,000 --> 01:12:08,238
takes care of all the tokenization this

1903
01:12:05,920 --> 01:12:09,840
that and whatnot. It's really very handy

1904
01:12:08,238 --> 01:12:12,639
and then you run it through the pipeline

1905
01:12:09,840 --> 01:12:15,760
again and it says tiger cat 94%

1906
01:12:12,640 --> 01:12:17,440
probability according to restnet. Um so

1907
01:12:15,760 --> 01:12:18,640
yeah so that's how you do it. Now let's

1908
01:12:17,439 --> 01:12:20,000
actually try a more interesting example

1909
01:12:18,640 --> 01:12:21,199
where you want to detect all the objects

1910
01:12:20,000 --> 01:12:23,760
in the picture which we didn't talk

1911
01:12:21,198 --> 01:12:27,439
about in class object detection. So just

1912
01:12:23,760 --> 01:12:29,600
create an object detection pipeline.

1913
01:12:27,439 --> 01:12:31,678
Same thing as before. when you actually

1914
01:12:29,600 --> 01:12:33,120
run this command, an astonishing some

1915
01:12:31,679 --> 01:12:35,359
amount of complicated stuff is going on

1916
01:12:33,119 --> 01:12:37,279
under the hood. Okay, and we are all the

1917
01:12:35,359 --> 01:12:39,439
beneficiaries of that. So, thank you.

1918
01:12:37,279 --> 01:12:42,639
Um, so yeah, so we have this here and

1919
01:12:39,439 --> 01:12:44,079
then we run it through um the pipeline.

1920
01:12:42,640 --> 01:12:45,360
It's looking at all the possible things

1921
01:12:44,079 --> 01:12:46,800
that might be sitting in the pipeline.

1922
01:12:45,359 --> 01:12:49,920
The results are hard to read. So, let's

1923
01:12:46,800 --> 01:12:51,760
actually visualize them. Um,

1924
01:12:49,920 --> 01:12:53,359
and I got some nice code from this site

1925
01:12:51,760 --> 01:12:56,239
for how to visualize them. Let's just

1926
01:12:53,359 --> 01:12:58,719
reuse it. So, yeah. So if you plot the

1927
01:12:56,238 --> 01:13:02,039
results,

1928
01:12:58,719 --> 01:13:02,039
look at that.

1929
01:13:03,760 --> 01:13:09,600
Okay, so it has picked up the cat. 100%

1930
01:13:06,800 --> 01:13:12,079
probability, I guess. The remote, the

1931
01:13:09,600 --> 01:13:14,320
couch, the other remote, and then the

1932
01:13:12,079 --> 01:13:17,039
cat. Pretty good, right? Off the shelf,

1933
01:13:14,319 --> 01:13:19,439
ready to go. No, no heavy lifting

1934
01:13:17,039 --> 01:13:20,719
required. Now, in in this case, we are

1935
01:13:19,439 --> 01:13:22,719
actually putting these boxes called

1936
01:13:20,719 --> 01:13:23,840
bounding boxes around each picture. But

1937
01:13:22,719 --> 01:13:25,119
what if you actually don't want a

1938
01:13:23,840 --> 01:13:28,159
bounding box? what you want to actually

1939
01:13:25,119 --> 01:13:30,158
find the exact contour of that cat or

1940
01:13:28,158 --> 01:13:32,799
the remote. No problem. We do something

1941
01:13:30,158 --> 01:13:36,399
called image segmentation. So let's do

1942
01:13:32,800 --> 01:13:40,119
an image segmentation pipeline

1943
01:13:36,399 --> 01:13:40,119
uh and run it through.

1944
01:13:42,960 --> 01:13:49,439
It takes some time. Um all right. All

1945
01:13:46,800 --> 01:13:51,199
right. Let's visualize it. So you can So

1946
01:13:49,439 --> 01:13:53,359
each object it finds it gives you a

1947
01:13:51,198 --> 01:13:56,399
mask. It basically tells you for each

1948
01:13:53,359 --> 01:13:58,799
object what object it is and then which

1949
01:13:56,399 --> 01:14:00,479
pixels are on for that object and off

1950
01:13:58,800 --> 01:14:02,159
for everything else. It's a mask. It

1951
01:14:00,479 --> 01:14:04,879
tells you where it stands. And you can

1952
01:14:02,158 --> 01:14:06,479
see here it is the first the object has

1953
01:14:04,880 --> 01:14:08,719
found is this thing here. And it's

1954
01:14:06,479 --> 01:14:10,718
perfectly delineated, right? It's pretty

1955
01:14:08,719 --> 01:14:14,000
amazing. So we can overlay this on the

1956
01:14:10,719 --> 01:14:15,760
original image and see it has found that

1957
01:14:14,000 --> 01:14:17,920
and it is Let's look at the other

1958
01:14:15,760 --> 01:14:20,719
objects. Oh, it has found the remote.

1959
01:14:17,920 --> 01:14:24,719
That's the second object.

1960
01:14:20,719 --> 01:14:27,119
And the third remote

1961
01:14:24,719 --> 01:14:28,880
and the fourth. You think any other

1962
01:14:27,119 --> 01:14:32,000
objects are remaining?

1963
01:14:28,880 --> 01:14:33,679
>> Couch. Good. All right, let's find the

1964
01:14:32,000 --> 01:14:36,238
couch.

1965
01:14:33,679 --> 01:14:37,920
And look, the couch is pretty good

1966
01:14:36,238 --> 01:14:39,678
except that the middle part has gotten

1967
01:14:37,920 --> 01:14:41,039
confused.

1968
01:14:39,679 --> 01:14:44,239
All right, but it's still pretty good,

1969
01:14:41,039 --> 01:14:46,319
right? So, yeah. So, that is um so

1970
01:14:44,238 --> 01:14:48,319
hugging faces all all these things and

1971
01:14:46,319 --> 01:14:49,599
so you should definitely check it out

1972
01:14:48,319 --> 01:14:51,439
and if you're not already very familiar

1973
01:14:49,600 --> 01:14:55,000
with it. So, uh, we have one minute

1974
01:14:51,439 --> 01:14:55,000
left. Any questions?

1975
01:14:58,238 --> 01:15:04,039
No questions. Okay. All right, folks.

1976
01:15:00,238 --> 01:15:04,039
See you on Wednesday. Thanks.