1
00:00:16,399 --> 00:00:22,159
Okay. So, um, so let's continue the

2
00:00:19,519 --> 00:00:23,679
journey we started last time. Um so what

3
00:00:22,160 --> 00:00:26,079
we're going to do uh you know if you

4
00:00:23,679 --> 00:00:27,439
remember in the last class we showed how

5
00:00:26,079 --> 00:00:30,000
we can actually build an auto

6
00:00:27,439 --> 00:00:33,200
reggressive large language model uh aka

7
00:00:30,000 --> 00:00:36,399
a causal large language model um using

8
00:00:33,200 --> 00:00:38,559
this not this idea of a causal encoder a

9
00:00:36,399 --> 00:00:39,920
transformer causal encoder and then we

10
00:00:38,558 --> 00:00:41,679
showed how you can actually take a bunch

11
00:00:39,920 --> 00:00:43,760
of sentences and use next word

12
00:00:41,679 --> 00:00:46,640
prediction and just run it through and

13
00:00:43,759 --> 00:00:49,119
boom you get GPD3 okay so that's what we

14
00:00:46,640 --> 00:00:50,799
saw last time I want to point out a sort

15
00:00:49,119 --> 00:00:52,799
of an important clarification slash

16
00:00:50,799 --> 00:00:55,359
correction which is that when we work

17
00:00:52,799 --> 00:00:57,759
with large language models uh unlike

18
00:00:55,359 --> 00:00:59,198
when we work with BERT uh for instance

19
00:00:57,759 --> 00:01:01,519
when we work with these kinds of causal

20
00:00:59,198 --> 00:01:03,358
models actually uh when the contextual

21
00:01:01,520 --> 00:01:05,760
embeddings come out you don't actually

22
00:01:03,359 --> 00:01:07,599
have to use ReLU activations here you

23
00:01:05,760 --> 00:01:09,118
can literally just run it through just a

24
00:01:07,599 --> 00:01:11,359
single dense layer with linear

25
00:01:09,118 --> 00:01:13,760
activations and then pass it into a

26
00:01:11,359 --> 00:01:15,519
softmax and boom you're done okay so

27
00:01:13,760 --> 00:01:18,799
that's how GPD3 and all these models are

28
00:01:15,519 --> 00:01:21,039
trained u and the other thing I want to

29
00:01:18,799 --> 00:01:23,600
point out which may not have clear is

30
00:01:21,040 --> 00:01:27,360
that what what is coming out of these

31
00:01:23,599 --> 00:01:29,919
this dense layer right this vector is as

32
00:01:27,359 --> 00:01:31,840
long as your vocabulary

33
00:01:29,920 --> 00:01:33,519
because only then when it goes into the

34
00:01:31,840 --> 00:01:35,118
soft max you're going to get

35
00:01:33,519 --> 00:01:36,959
probabilities which are as long as your

36
00:01:35,118 --> 00:01:39,200
vocabulary which means that you get to

37
00:01:36,959 --> 00:01:42,158
pick one word or token out of that

38
00:01:39,200 --> 00:01:45,118
entire 50,000 long vocabulary

39
00:01:42,159 --> 00:01:47,520
okay so so just I just want to point

40
00:01:45,118 --> 00:01:49,118
that out because I think it's easy for

41
00:01:47,519 --> 00:01:50,319
us to sort of get a little confused

42
00:01:49,118 --> 00:01:53,280
because of this little difference

43
00:01:50,319 --> 00:01:55,279
between the way uh masked language

44
00:01:53,280 --> 00:01:58,718
models like BERT work and causal

45
00:01:55,280 --> 00:02:02,718
language models like GPD3.

46
00:01:58,718 --> 00:02:05,759
Okay, so now let's continue with we have

47
00:02:02,718 --> 00:02:07,759
we know how to build GPD3. So like what

48
00:02:05,759 --> 00:02:10,479
about GPD and GPD2 like what's up to

49
00:02:07,759 --> 00:02:13,840
them? Why is GPD3 so famous and not

50
00:02:10,479 --> 00:02:15,200
GPD2? Right? So turns out well first of

51
00:02:13,840 --> 00:02:17,360
all you folks know that GPD stands for

52
00:02:15,199 --> 00:02:19,119
generative pre-trained transformer. Now

53
00:02:17,360 --> 00:02:22,000
like GPD3

54
00:02:19,120 --> 00:02:23,680
two GPD2 and GPD1 were trained in

55
00:02:22,000 --> 00:02:26,000
basically the same fashion. Predict the

56
00:02:23,680 --> 00:02:29,280
next word uh same fashion the same sort

57
00:02:26,000 --> 00:02:31,680
of transformer stack except that GPT3

58
00:02:29,280 --> 00:02:33,680
was trained on much more data because

59
00:02:31,680 --> 00:02:36,640
the underlying transformer stack had

60
00:02:33,680 --> 00:02:39,599
many more layers. Okay, so it is a much

61
00:02:36,639 --> 00:02:41,518
bigger stack meaning lots more

62
00:02:39,598 --> 00:02:44,878
parameters and therefore you need lots

63
00:02:41,519 --> 00:02:47,200
more data to train it well. Okay, so

64
00:02:44,878 --> 00:02:49,679
that was really the only difference. The

65
00:02:47,199 --> 00:02:53,919
difference was literally one of scale,

66
00:02:49,680 --> 00:02:57,680
scale of network and scale of data. And

67
00:02:53,919 --> 00:02:59,119
unlike GPT and GPD2, GPD3 even though it

68
00:02:57,680 --> 00:03:01,760
was trained basically the same way with

69
00:02:59,120 --> 00:03:04,158
the same kind of network, it was one of

70
00:03:01,759 --> 00:03:06,239
the situations where more became

71
00:03:04,158 --> 00:03:07,759
different. Okay, there was almost like

72
00:03:06,239 --> 00:03:10,319
some sort of phase change that happened

73
00:03:07,759 --> 00:03:14,158
between two and three. Unlike GPD and

74
00:03:10,318 --> 00:03:16,318
GPD2, GPD3 could do amazingly coherent

75
00:03:14,158 --> 00:03:19,840
continuations of any starting prompt,

76
00:03:16,318 --> 00:03:21,199
right? Um so for example, if you have

77
00:03:19,840 --> 00:03:22,640
this little prompt which says the

78
00:03:21,199 --> 00:03:24,878
importance of being on Twitter by Jerome

79
00:03:22,639 --> 00:03:26,318
K Jerome who was a famous humorist and

80
00:03:24,878 --> 00:03:28,399
then you give it this prompt, right?

81
00:03:26,318 --> 00:03:30,000
Ending with the word it, it produces

82
00:03:28,400 --> 00:03:33,599
this continuation which is really like

83
00:03:30,000 --> 00:03:35,120
strikingly good. And if any of you have

84
00:03:33,598 --> 00:03:36,479
read Jerome K Jerome and if you read

85
00:03:35,120 --> 00:03:38,319
this thing, you'll be like, "Wow, that

86
00:03:36,479 --> 00:03:41,518
actually sounds like Jerome K Jerome."

87
00:03:38,318 --> 00:03:43,119
Right? So amazing continuations the the

88
00:03:41,519 --> 00:03:45,120
but the interesting thing here is not so

89
00:03:43,120 --> 00:03:47,519
much the continuation it's the fact that

90
00:03:45,120 --> 00:03:49,680
the same prompt you give it a two or GPT

91
00:03:47,519 --> 00:03:51,439
it won't do any it won't be very good in

92
00:03:49,680 --> 00:03:52,640
fact after the first one two or three

93
00:03:51,439 --> 00:03:54,158
sentences it'll sort of become sort of

94
00:03:52,639 --> 00:03:57,039
incoherent and meander and start

95
00:03:54,158 --> 00:03:59,759
rambling this thing can keep faking it

96
00:03:57,039 --> 00:04:02,239
for a long longer time right that's the

97
00:03:59,759 --> 00:04:05,679
amazing thing that was unexpected re

98
00:04:02,239 --> 00:04:07,438
researchers did not expect this okay and

99
00:04:05,680 --> 00:04:09,040
but it wasn't good at following your

100
00:04:07,438 --> 00:04:10,799
instructions

101
00:04:09,039 --> 00:04:12,400
So for instance, if you ask it, help me

102
00:04:10,799 --> 00:04:14,159
write a short note, introduce myself to

103
00:04:12,400 --> 00:04:15,519
my neighbor. This is the kind of thing

104
00:04:14,158 --> 00:04:17,358
it'll come up with. And you can actually

105
00:04:15,519 --> 00:04:20,000
run it yourself. You can actually go to

106
00:04:17,358 --> 00:04:21,918
GPD3 on the playground. I think GPD3 is

107
00:04:20,000 --> 00:04:23,759
still available in the playground. U if

108
00:04:21,918 --> 00:04:25,359
it is, you can actually start try

109
00:04:23,759 --> 00:04:28,080
running these prompts. You will start

110
00:04:25,360 --> 00:04:29,919
getting garbage very quickly, right? And

111
00:04:28,079 --> 00:04:31,519
the reason so for example here, help me

112
00:04:29,918 --> 00:04:33,839
write a short note. It says, what's a

113
00:04:31,519 --> 00:04:35,680
good introduction to a resume? Rumé for

114
00:04:33,839 --> 00:04:38,319
some reason has glombmed down to resume.

115
00:04:35,680 --> 00:04:39,918
I have no idea why. Right? But the

116
00:04:38,319 --> 00:04:42,399
reason it's doing stuff like this is

117
00:04:39,918 --> 00:04:44,000
because a lot of the training data it

118
00:04:42,399 --> 00:04:46,159
was trained on are basically lots of

119
00:04:44,000 --> 00:04:49,040
lists of things.

120
00:04:46,160 --> 00:04:52,000
So when you say for example um you know

121
00:04:49,040 --> 00:04:53,840
the the the capital of Paris continue

122
00:04:52,000 --> 00:04:55,279
it'll come back with the capital sorry

123
00:04:53,839 --> 00:04:56,799
the capital of France continue it say

124
00:04:55,279 --> 00:04:58,559
the capital of France is Paris the

125
00:04:56,800 --> 00:04:59,919
capital of you know uh Hungary is

126
00:04:58,560 --> 00:05:02,319
Budapest and so on. It just start coming

127
00:04:59,918 --> 00:05:04,399
up with a list. So it's sort of very

128
00:05:02,319 --> 00:05:06,000
list driven right? it thinks that you

129
00:05:04,399 --> 00:05:07,839
you need to complete some sort of list,

130
00:05:06,000 --> 00:05:09,600
right? That's what's going on here. And

131
00:05:07,839 --> 00:05:10,799
so it's not very good. So it doesn't

132
00:05:09,600 --> 00:05:12,960
realize that you're actually asking it

133
00:05:10,800 --> 00:05:14,560
to do something specific.

134
00:05:12,959 --> 00:05:17,038
So this is the problem when you have an

135
00:05:14,560 --> 00:05:18,720
autocomplete thing which doesn't realize

136
00:05:17,038 --> 00:05:20,719
what you're asking it. It just thinks

137
00:05:18,720 --> 00:05:24,080
that you're it's just an autocomplete.

138
00:05:20,720 --> 00:05:25,600
So um now in addition to these unhelpful

139
00:05:24,079 --> 00:05:27,038
answers, it can also produce offensive

140
00:05:25,600 --> 00:05:28,960
answers, factually incorrect answers and

141
00:05:27,038 --> 00:05:32,079
so on and so forth. The list of bad

142
00:05:28,959 --> 00:05:33,599
things it can do is long. So why does it

143
00:05:32,079 --> 00:05:35,758
do that? Why does it produce unhelpful

144
00:05:33,600 --> 00:05:37,120
answers? Well, you know, as you recall,

145
00:05:35,759 --> 00:05:39,120
it was only trained to predict the next

146
00:05:37,120 --> 00:05:41,439
word. It wasn't explicitly trained to

147
00:05:39,120 --> 00:05:44,720
follow instructions, right? So, it

148
00:05:41,439 --> 00:05:46,160
seems, you know, reasonable that if it's

149
00:05:44,720 --> 00:05:48,800
simply trying to guess the next word

150
00:05:46,160 --> 00:05:50,880
repeatedly, it can't really do anything

151
00:05:48,800 --> 00:05:52,079
more. Like, how can it figure out that

152
00:05:50,879 --> 00:05:54,800
there's an instruction that it needs to

153
00:05:52,079 --> 00:05:57,120
follow, right? Unless the training data

154
00:05:54,800 --> 00:05:59,918
on the net was all instructional, which

155
00:05:57,120 --> 00:06:02,959
it clearly is not.

156
00:05:59,918 --> 00:06:04,560
So light bulb idea, right? Let's

157
00:06:02,959 --> 00:06:06,399
explicitly train it with instruction

158
00:06:04,560 --> 00:06:07,680
data,

159
00:06:06,399 --> 00:06:10,478
right? Let's just train it with

160
00:06:07,680 --> 00:06:12,079
instruction data. And so OpenAI

161
00:06:10,478 --> 00:06:15,758
developed an approach called instruction

162
00:06:12,079 --> 00:06:18,719
tuning to do exactly this. Um, and this

163
00:06:15,759 --> 00:06:20,720
paper is the paper that sort of was the

164
00:06:18,720 --> 00:06:24,160
breakthrough. Okay, this is what

165
00:06:20,720 --> 00:06:25,600
actually put Chad on the map. So, and

166
00:06:24,160 --> 00:06:26,800
it's very readable. So, I would

167
00:06:25,600 --> 00:06:28,879
encourage you to check it out if you're

168
00:06:26,800 --> 00:06:33,280
curious.

169
00:06:28,879 --> 00:06:34,478
And so so we had GPT, GPD2, GPD3, you

170
00:06:33,279 --> 00:06:36,079
know, just bigger and bigger models

171
00:06:34,478 --> 00:06:37,439
trained the same way. And then we run

172
00:06:36,079 --> 00:06:39,199
into the problem that it can't handle

173
00:06:37,439 --> 00:06:41,519
instructions. So we do instruction

174
00:06:39,199 --> 00:06:43,840
tuning to get to 3.5, also called

175
00:06:41,519 --> 00:06:46,478
instruct GPT. And then a small tweak

176
00:06:43,839 --> 00:06:48,560
after that gets you chat GPT. Okay. And

177
00:06:46,478 --> 00:06:50,079
by the way, this step here, there are

178
00:06:48,560 --> 00:06:52,000
really two things going on in this as

179
00:06:50,079 --> 00:06:53,839
you will soon see. I'm just calling it

180
00:06:52,000 --> 00:06:55,360
instruction tuning just to so that I

181
00:06:53,839 --> 00:06:58,239
don't have to say some long thing every

182
00:06:55,360 --> 00:06:59,919
single time. it this is not a consistent

183
00:06:58,240 --> 00:07:03,918
piece of terminology. So just just

184
00:06:59,918 --> 00:07:06,799
beware aware of that's all. So all right

185
00:07:03,918 --> 00:07:09,359
first step they got a bunch of people to

186
00:07:06,800 --> 00:07:11,918
write highquality answers to questions

187
00:07:09,360 --> 00:07:14,080
and they created about 12,500 such

188
00:07:11,918 --> 00:07:15,758
question answer pairs. So for example

189
00:07:14,079 --> 00:07:17,199
let's say this was the question explain

190
00:07:15,759 --> 00:07:19,840
the moon landing to a six-year-old in a

191
00:07:17,199 --> 00:07:21,759
few sentences. Believe it or not, GPD3's

192
00:07:19,839 --> 00:07:23,439
answer to that question was another

193
00:07:21,759 --> 00:07:24,879
question

194
00:07:23,439 --> 00:07:27,199
because it thinks there's a list of

195
00:07:24,879 --> 00:07:28,560
questions it needs autocomplete, right?

196
00:07:27,199 --> 00:07:30,400
So, it comes up with explain the theory

197
00:07:28,560 --> 00:07:31,439
of gravity to a six-y old. It's like one

198
00:07:30,399 --> 00:07:32,879
of those people when you ask them a

199
00:07:31,439 --> 00:07:35,199
question, they ask you a question back,

200
00:07:32,879 --> 00:07:36,399
right? So, what what they did is they

201
00:07:35,199 --> 00:07:38,160
said, "Okay, let's create a nice answer

202
00:07:36,399 --> 00:07:39,758
to this question." And here's a human

203
00:07:38,160 --> 00:07:41,280
created answer. People went to the moon

204
00:07:39,759 --> 00:07:43,759
in a big rocket, walked around, blah

205
00:07:41,279 --> 00:07:46,559
blah blah, right? Much better answer to

206
00:07:43,759 --> 00:07:48,960
that question. And so once you create

207
00:07:46,560 --> 00:07:52,000
these 12,500 question answer pairs as

208
00:07:48,959 --> 00:07:56,079
training data, we just trained GPD3 some

209
00:07:52,000 --> 00:07:59,199
more using Xword prediction as before.

210
00:07:56,079 --> 00:08:00,639
No difference. So, so here is the input

211
00:07:59,199 --> 00:08:02,560
explain the moon landing blah blah blah

212
00:08:00,639 --> 00:08:05,840
blah. This is the question and then we

213
00:08:02,560 --> 00:08:07,918
have the answer right there. And then we

214
00:08:05,839 --> 00:08:10,719
we take that answer, move it to the

215
00:08:07,918 --> 00:08:13,758
right and just shift it up

216
00:08:10,720 --> 00:08:16,000
so that when it finishes sentences, it

217
00:08:13,759 --> 00:08:17,759
needs to predict people. And then you

218
00:08:16,000 --> 00:08:20,079
give it people, it needs to predict went

219
00:08:17,759 --> 00:08:22,560
and so on and so forth. Just like we saw

220
00:08:20,079 --> 00:08:25,359
before, the cat sat on the mat became

221
00:08:22,560 --> 00:08:27,360
the cat sat on the cat sat on the mat on

222
00:08:25,360 --> 00:08:30,080
the right shifted, right? That's what

223
00:08:27,360 --> 00:08:31,598
makes prediction possible and necessary.

224
00:08:30,079 --> 00:08:35,360
So that's what they did. This co this is

225
00:08:31,598 --> 00:08:37,838
step one. Okay, same as same as before.

226
00:08:35,360 --> 00:08:39,680
And once you do that, it turns out this

227
00:08:37,839 --> 00:08:42,000
step is called supervised fine-tuning.

228
00:08:39,679 --> 00:08:44,000
It really helped. GPD3 once you

229
00:08:42,000 --> 00:08:45,600
supervised fine-tuned it was much much

230
00:08:44,000 --> 00:08:46,958
better at following instructions. But

231
00:08:45,600 --> 00:08:49,278
there's a small problem with this

232
00:08:46,958 --> 00:08:51,518
approach. It takes a lot of money and

233
00:08:49,278 --> 00:08:53,759
effort to have humans write highquality

234
00:08:51,519 --> 00:08:56,480
answers to thousands of questions,

235
00:08:53,759 --> 00:08:59,120
right? It takes a lot of money. So the

236
00:08:56,480 --> 00:09:01,200
question is, what can we do, right? What

237
00:08:59,120 --> 00:09:03,278
is easier than writing a good answer to

238
00:09:01,200 --> 00:09:07,519
a question?

239
00:09:03,278 --> 00:09:11,320
Well, what? Okay. Uh, all right. Uh, how

240
00:09:07,519 --> 00:09:11,320
about somebody from this side?

241
00:09:11,440 --> 00:09:15,600
>> Yeah, Joseph.

242
00:09:13,519 --> 00:09:16,560
>> Perhaps writing a question for an

243
00:09:15,600 --> 00:09:17,920
answer.

244
00:09:16,559 --> 00:09:19,679
>> Oh, that's actually a good one. Yeah.

245
00:09:17,919 --> 00:09:22,000
Yeah, I like that. Um, so given an

246
00:09:19,679 --> 00:09:23,439
answer, find find a question. And while

247
00:09:22,000 --> 00:09:25,039
that is not what I'm going to talk about

248
00:09:23,440 --> 00:09:27,680
here, that technique is actually used

249
00:09:25,039 --> 00:09:29,519
very heavily in LLMs. Uh, and so but

250
00:09:27,679 --> 00:09:31,199
that that's great. Very creative. Uh

251
00:09:29,519 --> 00:09:32,560
Mark,

252
00:09:31,200 --> 00:09:33,200
>> thumbs up. Thumbs down.

253
00:09:32,559 --> 00:09:34,479
>> Sorry.

254
00:09:33,200 --> 00:09:36,320
>> Thumbs up or thumbs down?

255
00:09:34,480 --> 00:09:38,320
>> Thumbs up or thumbs down. Exactly.

256
00:09:36,320 --> 00:09:40,800
Because all of us, everyone loves to be

257
00:09:38,320 --> 00:09:43,440
a critic. It's much better easier to be

258
00:09:40,799 --> 00:09:46,240
a critic than to be a creator. Right. So

259
00:09:43,440 --> 00:09:48,959
what do we do? We basically say, let's

260
00:09:46,240 --> 00:09:50,720
rank answers written by somebody else.

261
00:09:48,958 --> 00:09:53,199
Which begs the question, who's going to

262
00:09:50,720 --> 00:09:54,639
write those answers? And that's where

263
00:09:53,200 --> 00:09:57,440
there's a brilliant answer to that

264
00:09:54,639 --> 00:09:59,360
question which is

265
00:09:57,440 --> 00:10:02,360
Wikipedia,

266
00:09:59,360 --> 00:10:02,360
Reddit.

267
00:10:04,000 --> 00:10:08,159
We will just ask GPT3 to write the

268
00:10:06,080 --> 00:10:10,000
answers.

269
00:10:08,159 --> 00:10:12,958
It might be crap, but we don't care

270
00:10:10,000 --> 00:10:15,519
because we can rank them.

271
00:10:12,958 --> 00:10:17,919
So we ask GPT3 to get generate several

272
00:10:15,519 --> 00:10:19,360
answers to the question. And how can we

273
00:10:17,919 --> 00:10:21,199
generate several answers? Because we can

274
00:10:19,360 --> 00:10:23,200
do sampling.

275
00:10:21,200 --> 00:10:25,200
We can do sampling.

276
00:10:23,200 --> 00:10:27,200
The fact that we had these stoastic

277
00:10:25,200 --> 00:10:30,480
outputs because of sampling is now a

278
00:10:27,200 --> 00:10:32,240
feature, not a bug. Okay, we create lots

279
00:10:30,480 --> 00:10:33,920
of different answers to the question. We

280
00:10:32,240 --> 00:10:36,000
feed it a question, get like three

281
00:10:33,919 --> 00:10:37,599
answers out. Just run it three times,

282
00:10:36,000 --> 00:10:39,278
get three answers out with a nice

283
00:10:37,600 --> 00:10:41,120
temperature of like one or 1.1 or

284
00:10:39,278 --> 00:10:43,679
something so that it's nice and random,

285
00:10:41,120 --> 00:10:45,679
right? Um, and then we literally have

286
00:10:43,679 --> 00:10:47,278
humans just rank them, do the thumbs up,

287
00:10:45,679 --> 00:10:51,120
thumbs down, just rank them from most

288
00:10:47,278 --> 00:10:53,200
useful to least useful. Okay, so this

289
00:10:51,120 --> 00:10:55,120
step is a step two of instruction

290
00:10:53,200 --> 00:10:57,759
tuning. So OpenAI collected 33,000

291
00:10:55,120 --> 00:11:00,560
instructions, fed them to GB3, generated

292
00:10:57,759 --> 00:11:03,439
answers and had humans rank them. And

293
00:11:00,559 --> 00:11:05,278
once you do that, once you do this, you

294
00:11:03,440 --> 00:11:07,519
can assemble a beautiful training data

295
00:11:05,278 --> 00:11:09,200
set, right? And so basically what we

296
00:11:07,519 --> 00:11:10,799
have is that we have an instruction and

297
00:11:09,200 --> 00:11:12,879
let's say we have just two answers A and

298
00:11:10,799 --> 00:11:14,879
B. And in in practice they you can have

299
00:11:12,879 --> 00:11:16,879
many many answers which we rank but just

300
00:11:14,879 --> 00:11:18,720
for simplicity I'll go with Mark's

301
00:11:16,879 --> 00:11:19,919
thumbs up thumbs down sort of answer

302
00:11:18,720 --> 00:11:22,240
which is let's assume only you have two

303
00:11:19,919 --> 00:11:24,159
answers to every question right and so

304
00:11:22,240 --> 00:11:26,480
and the human has said I prefer this to

305
00:11:24,159 --> 00:11:28,958
that that's it right so we have a data

306
00:11:26,480 --> 00:11:31,278
set now where the data point is

307
00:11:28,958 --> 00:11:36,000
instruction preferred answer is A the

308
00:11:31,278 --> 00:11:38,720
other answer is B yeah

309
00:11:36,000 --> 00:11:40,958
>> um the thumbs up thumbs down uh

310
00:11:38,720 --> 00:11:42,560
technique that we're talking is that why

311
00:11:40,958 --> 00:11:44,319
We're attaching to now we also use

312
00:11:42,559 --> 00:11:45,518
thumbs up thumbs down. It's using only

313
00:11:44,320 --> 00:11:46,560
answers to train.

314
00:11:45,519 --> 00:11:48,240
>> Exactly. Right.

315
00:11:46,559 --> 00:11:49,439
>> Yeah. So yeah, all the models have the

316
00:11:48,240 --> 00:11:51,519
thumbs up thumbs down stuff going on

317
00:11:49,440 --> 00:11:53,120
somewhere. They are all collecting data

318
00:11:51,519 --> 00:11:53,600
for this step.

319
00:11:53,120 --> 00:11:55,679
>> Thank you.

320
00:11:53,600 --> 00:11:57,200
>> Yeah. It's sort of the old adage, right?

321
00:11:55,679 --> 00:11:59,199
Uh if you're not sure who the product

322
00:11:57,200 --> 00:12:02,680
is, you are the product. So it's one of

323
00:11:59,200 --> 00:12:02,680
those things. Yeah.

324
00:12:07,519 --> 00:12:12,240
So if we understand correctly when we

325
00:12:09,600 --> 00:12:16,320
see thumbs up thumbs down it does mean

326
00:12:12,240 --> 00:12:16,639
that chat is going to trade on our data

327
00:12:16,320 --> 00:12:19,200
right

328
00:12:16,639 --> 00:12:20,720
>> unless you opt out. Yeah. So if you

329
00:12:19,200 --> 00:12:22,079
actually go to the chaty controls there

330
00:12:20,720 --> 00:12:24,879
is something called data controls or

331
00:12:22,078 --> 00:12:26,479
something you can toggle it to off but I

332
00:12:24,879 --> 00:12:29,759
think when I last checked if you toggle

333
00:12:26,480 --> 00:12:31,120
it to off you lose your chat history. So

334
00:12:29,759 --> 00:12:33,120
they have hobbled that feature to

335
00:12:31,120 --> 00:12:37,600
prevent people from setting it to off as

336
00:12:33,120 --> 00:12:39,440
much as possible. Yeah, clever.

337
00:12:37,600 --> 00:12:41,040
But you can opt out and if you use the

338
00:12:39,440 --> 00:12:43,360
API as opposed to the web interface,

339
00:12:41,039 --> 00:12:45,360
you're automatically opted out. So you

340
00:12:43,360 --> 00:12:46,720
have to deliberately opt in. And if you

341
00:12:45,360 --> 00:12:48,399
use the versions that are available

342
00:12:46,720 --> 00:12:50,320
through Microsoft Azure and so on and so

343
00:12:48,399 --> 00:12:51,759
forth, there are all kinds of very safe

344
00:12:50,320 --> 00:12:54,079
controls and stuff like that. In fact, I

345
00:12:51,759 --> 00:12:56,799
think the Microsoft co-pilot license

346
00:12:54,078 --> 00:12:58,879
that MIT has uh I think the default is

347
00:12:56,799 --> 00:13:01,439
opted out.

348
00:12:58,879 --> 00:13:02,799
Okay. So to go here, once you have this

349
00:13:01,440 --> 00:13:05,680
data point, you can build something

350
00:13:02,799 --> 00:13:08,000
called a reward model. Okay. And this is

351
00:13:05,679 --> 00:13:10,479
a very clever piece of work. So what you

352
00:13:08,000 --> 00:13:12,399
do is you have an instruction, right?

353
00:13:10,480 --> 00:13:15,920
You have a preferred answer and you have

354
00:13:12,399 --> 00:13:18,320
the other answer. You feed it to a

355
00:13:15,919 --> 00:13:20,319
network. Okay? You feed it to a network.

356
00:13:18,320 --> 00:13:23,200
This is just a a nice language model,

357
00:13:20,320 --> 00:13:25,760
right? It's just a language model. And

358
00:13:23,200 --> 00:13:28,320
the language model produces a number

359
00:13:25,759 --> 00:13:30,480
which measures how good this thing is,

360
00:13:28,320 --> 00:13:32,480
right? How good an answer is this to

361
00:13:30,480 --> 00:13:34,639
that particular instruction. So you get

362
00:13:32,480 --> 00:13:38,320
two you get a rating here, you get a

363
00:13:34,639 --> 00:13:41,519
rating here and then what you do is you

364
00:13:38,320 --> 00:13:43,040
run it through a little loss function

365
00:13:41,519 --> 00:13:45,839
which

366
00:13:43,039 --> 00:13:50,000
essentially encourages the model to give

367
00:13:45,839 --> 00:13:51,680
higher numbers to the better answer.

368
00:13:50,000 --> 00:13:53,278
It's the same model. You just run the

369
00:13:51,679 --> 00:13:54,799
the question and the first answer,

370
00:13:53,278 --> 00:13:56,720
question and the second answer. You get

371
00:13:54,799 --> 00:13:59,120
these two numbers. And then initially

372
00:13:56,720 --> 00:14:00,480
those numbers are just random. But then

373
00:13:59,120 --> 00:14:02,078
you tell the model, hey, this is the

374
00:14:00,480 --> 00:14:03,600
preferred thing. Make sure the preferred

375
00:14:02,078 --> 00:14:06,638
answers

376
00:14:03,600 --> 00:14:08,959
uh rating the R value is higher than the

377
00:14:06,639 --> 00:14:12,000
other number because more is better.

378
00:14:08,958 --> 00:14:13,838
Higher is better. Okay? And you can

379
00:14:12,000 --> 00:14:15,919
actually since you and this thing is

380
00:14:13,839 --> 00:14:16,959
just a sigmoid here, right? It's

381
00:14:15,919 --> 00:14:18,240
basically take the difference of these

382
00:14:16,958 --> 00:14:20,479
two things. do a sigma and take the

383
00:14:18,240 --> 00:14:22,320
logarithm and you can actually convince

384
00:14:20,480 --> 00:14:25,600
yourself afterwards and I encourage you

385
00:14:22,320 --> 00:14:28,480
to do that to to check for yourself that

386
00:14:25,600 --> 00:14:30,879
if we actually

387
00:14:28,480 --> 00:14:34,159
give a higher number to the better

388
00:14:30,879 --> 00:14:36,320
answer the loss will be lower and since

389
00:14:34,159 --> 00:14:38,958
we are minimizing loss we're essentially

390
00:14:36,320 --> 00:14:41,760
training the network to always to try to

391
00:14:38,958 --> 00:14:43,919
give higher ratings to better answers

392
00:14:41,759 --> 00:14:46,639
that's it so that's the approach uh did

393
00:14:43,919 --> 00:14:49,360
you have a yeah Ben

394
00:14:46,639 --> 00:14:50,959
So you could imagine training um

395
00:14:49,360 --> 00:14:52,720
training the model and only the good

396
00:14:50,958 --> 00:14:54,078
answers is the idea of having both that

397
00:14:52,720 --> 00:14:54,720
the model is actually learning what

398
00:14:54,078 --> 00:14:56,719
makes good

399
00:14:54,720 --> 00:14:58,480
>> correct. Exactly. Much like if you want

400
00:14:56,720 --> 00:15:01,360
to build a dog cat classifier, you have

401
00:14:58,480 --> 00:15:02,879
to show pictures of both.

402
00:15:01,360 --> 00:15:05,278
>> Yeah.

403
00:15:02,879 --> 00:15:06,799
>> So u I understand the feedback mechanism

404
00:15:05,278 --> 00:15:10,078
of thumbs up thumbs down but there are a

405
00:15:06,799 --> 00:15:12,879
lot of times when the popular response

406
00:15:10,078 --> 00:15:15,759
is not the accurate one. So uh is there

407
00:15:12,879 --> 00:15:16,240
a way that they actually have a layer to

408
00:15:15,759 --> 00:15:18,958
correct?

409
00:15:16,240 --> 00:15:22,320
>> Yeah, good question Swati. So uh as it

410
00:15:18,958 --> 00:15:24,719
turns out um the all these companies

411
00:15:22,320 --> 00:15:27,440
like OpenAI, they have like a huge

412
00:15:24,720 --> 00:15:30,000
document 100 200 pages longs you know

413
00:15:27,440 --> 00:15:32,800
very very bulky document which instructs

414
00:15:30,000 --> 00:15:34,720
and teaches the labelers the rankers to

415
00:15:32,799 --> 00:15:36,479
how to rank these things. So they have

416
00:15:34,720 --> 00:15:38,800
to follow these very strict guidelines

417
00:15:36,480 --> 00:15:40,639
to precisely handle like strange corner

418
00:15:38,799 --> 00:15:43,439
cases and things like that. And that

419
00:15:40,639 --> 00:15:44,879
document is on the web. You can dig it

420
00:15:43,440 --> 00:15:46,959
up, right? And it's actually very

421
00:15:44,879 --> 00:15:48,078
instructive to read through it, right? I

422
00:15:46,958 --> 00:15:49,039
think they put it out on the web because

423
00:15:48,078 --> 00:15:50,879
they wanted to convince people that

424
00:15:49,039 --> 00:15:52,078
they're going to inordinate trouble to

425
00:15:50,879 --> 00:15:55,600
make sure the rankings are actually

426
00:15:52,078 --> 00:16:00,000
good. U do you have a question? Comment.

427
00:15:55,600 --> 00:16:03,278
Okay. All right. So um so back to this

428
00:16:00,000 --> 00:16:04,879
and how how do you train this thing? SGD

429
00:16:03,278 --> 00:16:06,639
because you have a network it's coming

430
00:16:04,879 --> 00:16:08,639
up with an answer you have some way to

431
00:16:06,639 --> 00:16:10,639
know if that answer is good or bad right

432
00:16:08,639 --> 00:16:12,480
better answers of lower loss back

433
00:16:10,639 --> 00:16:13,839
propagation through the network keep

434
00:16:12,480 --> 00:16:15,519
updating the weights and boom you're

435
00:16:13,839 --> 00:16:18,800
done

436
00:16:15,519 --> 00:16:21,198
okay and once you do that this reward

437
00:16:18,799 --> 00:16:24,000
model can provide a numerical rating for

438
00:16:21,198 --> 00:16:25,198
any any instruction answer pair you just

439
00:16:24,000 --> 00:16:27,120
give it an instruction you give it an

440
00:16:25,198 --> 00:16:28,240
answer right could be a crappy answer

441
00:16:27,120 --> 00:16:31,679
good answer it just tells you how good

442
00:16:28,240 --> 00:16:32,959
it is which means right So in this case

443
00:16:31,679 --> 00:16:35,919
for example maybe it's going to give you

444
00:16:32,958 --> 00:16:38,559
like a nice number 1.5 uh uh which is

445
00:16:35,919 --> 00:16:41,599
you know 1.5 for this this answer but

446
00:16:38,559 --> 00:16:44,719
then a better answer comes along or 3.2

447
00:16:41,600 --> 00:16:46,959
right what we have done by doing this

448
00:16:44,720 --> 00:16:49,278
whole thing this modeling is that we

449
00:16:46,958 --> 00:16:51,838
have essentially we have learned how

450
00:16:49,278 --> 00:16:53,759
humans rank responses

451
00:16:51,839 --> 00:16:55,279
because we can only have humans rank

452
00:16:53,759 --> 00:16:58,240
responses for some finite number of

453
00:16:55,278 --> 00:17:00,399
questions. What we really want to do is

454
00:16:58,240 --> 00:17:02,159
to do this to automate that ranking

455
00:17:00,399 --> 00:17:03,600
process so that we can just do it for

456
00:17:02,159 --> 00:17:05,599
like tens of thousands of questions

457
00:17:03,600 --> 00:17:07,519
really fast. Right? So we have

458
00:17:05,599 --> 00:17:10,879
essentially built a model of how humans

459
00:17:07,519 --> 00:17:12,160
rank things, right? Which is beautiful.

460
00:17:10,880 --> 00:17:13,439
A lot of the stuff here is all very

461
00:17:12,160 --> 00:17:15,519
self-reerential which I find very

462
00:17:13,439 --> 00:17:18,079
elegant. Anyway, so this can be used to

463
00:17:15,519 --> 00:17:20,798
improve GP3 even further. So we take the

464
00:17:18,078 --> 00:17:23,438
instruction as before, we feed it. It

465
00:17:20,798 --> 00:17:25,918
gives you some answer and then we feed

466
00:17:23,439 --> 00:17:28,319
this instruction and the answer to our

467
00:17:25,919 --> 00:17:30,799
newly minted reward model. It gives us a

468
00:17:28,318 --> 00:17:32,879
numerical rating and then this is the

469
00:17:30,798 --> 00:17:35,839
key step. We take this numerical rating

470
00:17:32,880 --> 00:17:37,840
and then we use this rating to nudge the

471
00:17:35,839 --> 00:17:41,519
internal weights of GPD3 in the right

472
00:17:37,839 --> 00:17:43,199
direction. Right? This nudging

473
00:17:41,519 --> 00:17:44,720
uses a technique called reinforcement

474
00:17:43,200 --> 00:17:46,319
learning.

475
00:17:44,720 --> 00:17:49,200
Right? Which just in the interest of

476
00:17:46,319 --> 00:17:51,678
time we can't get into in this lecture.

477
00:17:49,200 --> 00:17:52,720
But that that's a technique you use to

478
00:17:51,679 --> 00:17:54,559
nudge these things in the right

479
00:17:52,720 --> 00:17:56,640
direction.

480
00:17:54,558 --> 00:17:58,319
So that's what we do. That's

481
00:17:56,640 --> 00:18:01,520
reinforcement learning. We nudge it in

482
00:17:58,319 --> 00:18:04,879
the right direction.

483
00:18:01,519 --> 00:18:07,279
And OpenAI did this with 31,000

484
00:18:04,880 --> 00:18:09,919
questions.

485
00:18:07,279 --> 00:18:11,678
Okay. Nudge, nudge, nudge, nudge, nudge.

486
00:18:09,919 --> 00:18:13,759
And when you do that, you get GPD

487
00:18:11,679 --> 00:18:18,320
3.5/ingpd.

488
00:18:13,759 --> 00:18:20,960
Okay. Uh that's it. And now by the way

489
00:18:18,319 --> 00:18:22,480
this step here is called reinforcement

490
00:18:20,960 --> 00:18:24,480
learning with human feedback because we

491
00:18:22,480 --> 00:18:26,240
use reinforced learning and since humans

492
00:18:24,480 --> 00:18:28,160
rank the answers which tread to the

493
00:18:26,240 --> 00:18:29,759
building of the reward model we get

494
00:18:28,160 --> 00:18:30,798
human feedback. Okay, that's

495
00:18:29,759 --> 00:18:33,200
reinforcement learning with human

496
00:18:30,798 --> 00:18:34,639
feedback. Yeah.

497
00:18:33,200 --> 00:18:37,360
>> Yeah. I have [clears throat] a question

498
00:18:34,640 --> 00:18:39,759
regarding the the type of questions that

499
00:18:37,359 --> 00:18:42,079
they're using. I can imagine like maybe

500
00:18:39,759 --> 00:18:44,400
there are very simple questions to

501
00:18:42,079 --> 00:18:47,439
answer because I'm thinking now you can

502
00:18:44,400 --> 00:18:49,440
ask GPD like for example respond this as

503
00:18:47,440 --> 00:18:51,679
a pirate or something like that that is

504
00:18:49,440 --> 00:18:54,240
kind of it's going to be harder to train

505
00:18:51,679 --> 00:18:56,080
if you have bunch of questions that are

506
00:18:54,240 --> 00:18:57,679
having like small interactions and then

507
00:18:56,079 --> 00:18:59,279
there is the question like

508
00:18:57,679 --> 00:19:01,280
>> that's a good question. So the quality

509
00:18:59,279 --> 00:19:03,279
of the questions in the data set clearly

510
00:19:01,279 --> 00:19:05,839
is a big factor because if you have

511
00:19:03,279 --> 00:19:07,918
simple simplistic questions it won't be

512
00:19:05,839 --> 00:19:09,918
able to handle complex questions later

513
00:19:07,919 --> 00:19:12,400
on. So what it's a good question. So

514
00:19:09,919 --> 00:19:14,559
what how so the qu so that actually begs

515
00:19:12,400 --> 00:19:16,880
the question of where did they get these

516
00:19:14,558 --> 00:19:20,079
questions from

517
00:19:16,880 --> 00:19:23,520
so they actually got it from their API.

518
00:19:20,079 --> 00:19:25,119
So people are asking GPD3 on the API

519
00:19:23,519 --> 00:19:26,798
right before it became 3.5 people are

520
00:19:25,119 --> 00:19:28,159
asking all the API was already available

521
00:19:26,798 --> 00:19:29,599
you know fully available commercially

522
00:19:28,160 --> 00:19:31,679
available a lot of people are building

523
00:19:29,599 --> 00:19:33,439
products on it already by then and so

524
00:19:31,679 --> 00:19:35,440
they collected all those questions and

525
00:19:33,440 --> 00:19:37,279
filtered them for quality and that was

526
00:19:35,440 --> 00:19:39,360
the question set that they used and then

527
00:19:37,279 --> 00:19:41,519
they judiciously added to it with human

528
00:19:39,359 --> 00:19:43,038
created questions but they couldn't do a

529
00:19:41,519 --> 00:19:44,960
lot of that because it's expensive to do

530
00:19:43,038 --> 00:19:46,798
that but collecting stuff that somebody

531
00:19:44,960 --> 00:19:49,120
else is asking your API already very

532
00:19:46,798 --> 00:19:50,879
easy

533
00:19:49,119 --> 00:19:52,000
Yeah, Tomaso,

534
00:19:50,880 --> 00:19:54,400
>> uh, this might be more of a

535
00:19:52,000 --> 00:19:56,640
philosophical question, but, uh, the

536
00:19:54,400 --> 00:19:58,320
human bias that's present in the small

537
00:19:56,640 --> 00:20:00,799
subset of human labelers that they've

538
00:19:58,319 --> 00:20:03,119
chosen gets eventually compounded in

539
00:20:00,798 --> 00:20:04,798
this model that we often consider as the

540
00:20:03,119 --> 00:20:06,079
source of objective truth.

541
00:20:04,798 --> 00:20:08,319
>> Yes.

542
00:20:06,079 --> 00:20:09,918
>> Yeah, that's very true. Um I think the

543
00:20:08,319 --> 00:20:12,480
the reward model is probably very

544
00:20:09,919 --> 00:20:14,480
faithfully learn all the biases of the

545
00:20:12,480 --> 00:20:17,519
human labelers which is why they have

546
00:20:14,480 --> 00:20:19,599
these very complex u sort of frameworks

547
00:20:17,519 --> 00:20:21,519
and guidelines to try to prevent the

548
00:20:19,599 --> 00:20:22,959
bias from happening to mitigate it. So

549
00:20:21,519 --> 00:20:25,200
for example they might give the same

550
00:20:22,960 --> 00:20:28,240
question and set of possible answers to

551
00:20:25,200 --> 00:20:30,480
many many different labelers and only if

552
00:20:28,240 --> 00:20:33,679
people pick the same ranking they might

553
00:20:30,480 --> 00:20:36,240
use it so that at least inter labeler

554
00:20:33,679 --> 00:20:37,679
bias can be minimized right but if

555
00:20:36,240 --> 00:20:39,359
everybody's sort of biased in the same

556
00:20:37,679 --> 00:20:41,519
direction it won't protect you against

557
00:20:39,359 --> 00:20:43,199
that. Um so yeah in general there's a

558
00:20:41,519 --> 00:20:44,720
whole work that's being done to try to

559
00:20:43,200 --> 00:20:46,480
debias these things and build them

560
00:20:44,720 --> 00:20:48,000
without you know too much bias in them.

561
00:20:46,480 --> 00:20:49,200
It's like a whole world unto itself

562
00:20:48,000 --> 00:20:53,759
which we just don't have time to get

563
00:20:49,200 --> 00:20:56,000
into. Uh Olivia,

564
00:20:53,759 --> 00:20:57,519
>> um depending on the medium that's being

565
00:20:56,000 --> 00:20:59,119
returned by these models, would there be

566
00:20:57,519 --> 00:21:00,480
more than one reward model? Because

567
00:20:59,119 --> 00:21:01,839
isn't that what Gemini

568
00:21:00,480 --> 00:21:03,679
>> would there be more than one

569
00:21:01,839 --> 00:21:05,599
>> reward model? Because isn't this what

570
00:21:03,679 --> 00:21:08,000
Gemini is running into issues with right

571
00:21:05,599 --> 00:21:09,519
now with their image generation is the

572
00:21:08,000 --> 00:21:11,519
bias that they try to

573
00:21:09,519 --> 00:21:13,279
>> Yeah. So the Gemini business that's

574
00:21:11,519 --> 00:21:16,798
going on, it's unclear what's causing

575
00:21:13,279 --> 00:21:18,319
it. Um it may be in this step, maybe

576
00:21:16,798 --> 00:21:19,279
they were a little overzealous in

577
00:21:18,319 --> 00:21:20,319
preventing certain things from

578
00:21:19,279 --> 00:21:23,918
happening.

579
00:21:20,319 --> 00:21:25,279
Some of these uh systems also have um

580
00:21:23,919 --> 00:21:27,679
they will actually intercept the

581
00:21:25,279 --> 00:21:29,359
question that you ask and then route it

582
00:21:27,679 --> 00:21:31,280
differently based on what they sense is

583
00:21:29,359 --> 00:21:32,879
sitting around in the question. So there

584
00:21:31,279 --> 00:21:34,639
could be pre-processing post-processing

585
00:21:32,880 --> 00:21:36,960
a lot of stuff that goes on. So unclear

586
00:21:34,640 --> 00:21:38,559
to me where in the pipeline and it could

587
00:21:36,960 --> 00:21:40,960
be more than one place these things may

588
00:21:38,558 --> 00:21:42,639
be entering. So yes, so here may very

589
00:21:40,960 --> 00:21:44,720
well be where it actually enters a

590
00:21:42,640 --> 00:21:46,960
situation where people are people are

591
00:21:44,720 --> 00:21:50,000
told if you see any sort of this kind of

592
00:21:46,960 --> 00:21:51,600
answer downrank it right don't uprank it

593
00:21:50,000 --> 00:21:53,038
and then it learns that ranking very

594
00:21:51,599 --> 00:21:54,719
faithfully and then proceeds to apply it

595
00:21:53,038 --> 00:21:56,960
where it does should not be applied so

596
00:21:54,720 --> 00:21:58,880
that does happen uh Joselyn you had a

597
00:21:56,960 --> 00:22:02,000
question

598
00:21:58,880 --> 00:22:04,480
>> um I think I still I still don't totally

599
00:22:02,000 --> 00:22:06,480
understand why so when I ask chat GBT a

600
00:22:04,480 --> 00:22:08,319
question even in a lengthy response it

601
00:22:06,480 --> 00:22:10,159
doesn't wander away from the topic that

602
00:22:08,319 --> 00:22:11,839
I'm asking about right and so

603
00:22:10,159 --> 00:22:13,600
understanding that it it's predicting

604
00:22:11,839 --> 00:22:15,439
each word it's sort of taking a random

605
00:22:13,599 --> 00:22:15,839
walk from one word to the next in some

606
00:22:15,440 --> 00:22:17,759
sense

607
00:22:15,839 --> 00:22:19,839
>> but each word it utters

608
00:22:17,759 --> 00:22:20,879
>> now becomes part of the input to the

609
00:22:19,839 --> 00:22:21,199
next word it utters

610
00:22:20,880 --> 00:22:23,120
>> right

611
00:22:21,200 --> 00:22:24,640
>> so it's not truly random walk in that

612
00:22:23,119 --> 00:22:26,158
sense so the next step is not

613
00:22:24,640 --> 00:22:27,679
independent of the previous step

614
00:22:26,159 --> 00:22:29,440
>> it depends on what it depends on the

615
00:22:27,679 --> 00:22:31,038
journey so far so it's going to try to

616
00:22:29,440 --> 00:22:32,240
be very consistent with the journey so

617
00:22:31,038 --> 00:22:33,599
far

618
00:22:32,240 --> 00:22:35,519
>> okay

619
00:22:33,599 --> 00:22:38,480
>> does the

620
00:22:35,519 --> 00:22:40,158
does this part with um sort of

621
00:22:38,480 --> 00:22:42,079
fine-tuning it on these question answer

622
00:22:40,159 --> 00:22:44,799
sets. Does this play some role in it

623
00:22:42,079 --> 00:22:46,240
being able to constrain itself and not

624
00:22:44,798 --> 00:22:48,319
meander away?

625
00:22:46,240 --> 00:22:50,558
>> I don't think so. I think this is more

626
00:22:48,319 --> 00:22:52,319
to make sure that you know it does the

627
00:22:50,558 --> 00:22:54,960
weights generally tend to produce the

628
00:22:52,319 --> 00:22:57,359
right answer. Now what one of the things

629
00:22:54,960 --> 00:22:58,880
that is possible is that when when I'm

630
00:22:57,359 --> 00:23:01,439
let's say I'm a ranker and I'm looking

631
00:22:58,880 --> 00:23:03,039
at a few different answers I'm you know

632
00:23:01,440 --> 00:23:06,080
I have to figure out if the answer is

633
00:23:03,038 --> 00:23:08,640
helpful if it is accurate if it is uh

634
00:23:06,079 --> 00:23:11,279
you know non-toxic right things like

635
00:23:08,640 --> 00:23:13,200
that and part of the rubric for

636
00:23:11,279 --> 00:23:16,639
evaluating these answers could be their

637
00:23:13,200 --> 00:23:18,880
coherence right so it could also be that

638
00:23:16,640 --> 00:23:21,280
they are saying short coherent answers

639
00:23:18,880 --> 00:23:23,039
are better than long coherent answers

640
00:23:21,279 --> 00:23:24,399
but once you adjust for length Maybe

641
00:23:23,038 --> 00:23:25,519
coherence is more important, right? It

642
00:23:24,400 --> 00:23:26,960
could be any number of these things. So

643
00:23:25,519 --> 00:23:28,720
it could play a role in that.

644
00:23:26,960 --> 00:23:30,079
>> So just sort of one small followup. So

645
00:23:28,720 --> 00:23:31,519
in other words, when it's when it's

646
00:23:30,079 --> 00:23:32,850
learning from these question and answer

647
00:23:31,519 --> 00:23:33,918
pairs, it's able to look at

648
00:23:32,851 --> 00:23:35,440
[clears throat] the whole response and

649
00:23:33,919 --> 00:23:36,960
learn something about the whole response

650
00:23:35,440 --> 00:23:37,519
rather than just one word at a time,

651
00:23:36,960 --> 00:23:39,440
right?

652
00:23:37,519 --> 00:23:40,319
>> Correct. Yeah. The the entire question

653
00:23:39,440 --> 00:23:40,640
is being ranked.

654
00:23:40,319 --> 00:23:42,639
>> Yeah.

655
00:23:40,640 --> 00:23:46,240
>> Correct. Correct.

656
00:23:42,640 --> 00:23:48,400
>> Yeah. On a related note, um when it's

657
00:23:46,240 --> 00:23:50,640
generating a new word on a topic, does

658
00:23:48,400 --> 00:23:52,880
the attention pertain to the entire

659
00:23:50,640 --> 00:23:55,759
prior text or can you have like

660
00:23:52,880 --> 00:23:56,880
traveling attention? So like last five

661
00:23:55,759 --> 00:24:00,000
word.

662
00:23:56,880 --> 00:24:02,240
>> So yeah, the short answer is yeah, you

663
00:24:00,000 --> 00:24:04,480
can you can it's called sliding window

664
00:24:02,240 --> 00:24:06,640
attention. It can be done. They

665
00:24:04,480 --> 00:24:08,480
typically tend to do it not uh so much

666
00:24:06,640 --> 00:24:10,880
because they want to focus more on the

667
00:24:08,480 --> 00:24:12,000
the recent words, but more because it

668
00:24:10,880 --> 00:24:14,720
actually makes it very compute

669
00:24:12,000 --> 00:24:16,159
efficient. U that's why they do it. So

670
00:24:14,720 --> 00:24:17,919
it's called sliding window attention.

671
00:24:16,159 --> 00:24:19,760
You can Google it.

672
00:24:17,919 --> 00:24:21,120
>> So normally it's full attention.

673
00:24:19,759 --> 00:24:23,038
>> Normally it's full default is full

674
00:24:21,119 --> 00:24:25,678
attention.

675
00:24:23,038 --> 00:24:27,839
Okay. So that's what they did. Uh and

676
00:24:25,679 --> 00:24:29,278
when they did that and by the way as I

677
00:24:27,839 --> 00:24:30,480
think you pointed out that's exactly

678
00:24:29,278 --> 00:24:31,919
what's going on. You're training the

679
00:24:30,480 --> 00:24:35,038
reward model with these thumbs up and

680
00:24:31,919 --> 00:24:37,919
thumbs down. U hold on the questions.

681
00:24:35,038 --> 00:24:42,960
And so if you give it the same question

682
00:24:37,919 --> 00:24:45,679
to GPD 3.5 in GPD amazing answer.

683
00:24:42,960 --> 00:24:48,960
Okay, like night and day difference,

684
00:24:45,679 --> 00:24:51,360
amazingly good answer. Um, and so and

685
00:24:48,960 --> 00:24:52,640
then to go from 3.5 to CH GBT, they

686
00:24:51,359 --> 00:24:55,439
basically followed the exact same

687
00:24:52,640 --> 00:24:58,080
playbook except that because they wanted

688
00:24:55,440 --> 00:24:59,600
to have a chatbot, meaning something

689
00:24:58,079 --> 00:25:00,879
that could carry on a question answer,

690
00:24:59,599 --> 00:25:02,319
question answer pair as opposed to just

691
00:25:00,880 --> 00:25:03,600
a single question and answer, they

692
00:25:02,319 --> 00:25:05,839
wanted question answer question answer,

693
00:25:03,599 --> 00:25:08,719
right? Conversation. They trained it on

694
00:25:05,839 --> 00:25:11,519
conversations. That's it. Instead of

695
00:25:08,720 --> 00:25:13,759
training it on instruction answer data,

696
00:25:11,519 --> 00:25:16,000
they trained it on instruction answer

697
00:25:13,759 --> 00:25:17,919
instruction answer instruction answer a

698
00:25:16,000 --> 00:25:19,839
sequence of such things which are strung

699
00:25:17,919 --> 00:25:21,360
into a conversation.

700
00:25:19,839 --> 00:25:25,278
That's it. That is the only difference

701
00:25:21,359 --> 00:25:26,959
to go from 3.5 to CH GPT and then now

702
00:25:25,278 --> 00:25:28,880
chat GPD given you do that it's giving

703
00:25:26,960 --> 00:25:30,798
you a much nicer response and then you

704
00:25:28,880 --> 00:25:32,240
can ask a follow-on question. Can you

705
00:25:30,798 --> 00:25:33,759
make it more formal? Boom. It gives you

706
00:25:32,240 --> 00:25:35,120
a nice response because now it knows

707
00:25:33,759 --> 00:25:37,519
about conversations. It's been trained

708
00:25:35,119 --> 00:25:38,879
on conversational data. So that's it. So

709
00:25:37,519 --> 00:25:41,200
that's the whole that's how they built

710
00:25:38,880 --> 00:25:42,720
RGBT right and all the things we are

711
00:25:41,200 --> 00:25:45,038
seeing later on are all sort of

712
00:25:42,720 --> 00:25:46,240
continuations of this sort of approach.

713
00:25:45,038 --> 00:25:47,759
So pause for a couple of quick

714
00:25:46,240 --> 00:25:50,558
questions. Swati you had a question then

715
00:25:47,759 --> 00:25:53,759
we'll go to you and then to you. Yeah.

716
00:25:50,558 --> 00:25:56,240
>> So does that make a difference if a new

717
00:25:53,759 --> 00:25:59,759
question pair question answer pair or a

718
00:25:56,240 --> 00:26:01,759
new training data comes early in the

719
00:25:59,759 --> 00:26:02,960
building of the model or later in the

720
00:26:01,759 --> 00:26:05,278
building of the model 7 billion

721
00:26:02,960 --> 00:26:07,038
parameters. That be good. You mean the

722
00:26:05,278 --> 00:26:09,839
order of the questions does it matter?

723
00:26:07,038 --> 00:26:12,319
>> So I might have like let's say 5,000 uh

724
00:26:09,839 --> 00:26:14,240
images to start with. Now there after my

725
00:26:12,319 --> 00:26:17,278
model is trained and developed now I

726
00:26:14,240 --> 00:26:18,880
have a new use case that has come in.

727
00:26:17,278 --> 00:26:19,599
Will that make a difference if I set it

728
00:26:18,880 --> 00:26:22,000
in now?

729
00:26:19,599 --> 00:26:24,639
>> So if you have a new use case for which

730
00:26:22,000 --> 00:26:26,240
you want to essentially adapt the model

731
00:26:24,640 --> 00:26:27,278
there's a whole set of techniques you

732
00:26:26,240 --> 00:26:27,839
use which is going to be the next

733
00:26:27,278 --> 00:26:29,038
section.

734
00:26:27,839 --> 00:26:30,480
>> But it's not

735
00:26:29,038 --> 00:26:33,440
>> yeah because what you have out of the

736
00:26:30,480 --> 00:26:34,798
box is just a generally good chatbot. It

737
00:26:33,440 --> 00:26:36,000
knows about a lot of stuff because it's

738
00:26:34,798 --> 00:26:37,679
been trained on, you know, those 30

739
00:26:36,000 --> 00:26:39,119
billion sentences, it can answer a lot

740
00:26:37,679 --> 00:26:41,120
of questions reasonably well using

741
00:26:39,119 --> 00:26:43,038
common sense and world knowledge. But

742
00:26:41,119 --> 00:26:44,719
any specific use case like medical and

743
00:26:43,038 --> 00:26:46,079
so on and so forth, it may not know. So

744
00:26:44,720 --> 00:26:47,919
you'll need to adapt it to your

745
00:26:46,079 --> 00:26:51,678
particular unique situation and that's

746
00:26:47,919 --> 00:26:54,559
coming. U all right. Yes. Habit.

747
00:26:51,679 --> 00:26:57,360
>> Uh what determines if a whole

748
00:26:54,558 --> 00:26:59,359
conversation is ranked positively versus

749
00:26:57,359 --> 00:27:01,278
a specific answer proliferating your in

750
00:26:59,359 --> 00:27:03,119
your question?

751
00:27:01,278 --> 00:27:05,278
>> Is it if the first answer doesn't get a

752
00:27:03,119 --> 00:27:06,639
positive response but then after follow

753
00:27:05,278 --> 00:27:07,200
the second one does. Is that is that

754
00:27:06,640 --> 00:27:08,880
correct?

755
00:27:07,200 --> 00:27:10,319
>> Exactly. So if you're a human and you

756
00:27:08,880 --> 00:27:12,320
read the transcript of an exchange

757
00:27:10,319 --> 00:27:14,240
between two people and I'm giving you

758
00:27:12,319 --> 00:27:15,759
two exchanges which all start with the

759
00:27:14,240 --> 00:27:17,759
same question, you'll be able to assess

760
00:27:15,759 --> 00:27:20,000
which one is a better transcript. That's

761
00:27:17,759 --> 00:27:22,640
basically what's going on. Uh there was

762
00:27:20,000 --> 00:27:25,038
something here, right? Something. Yeah.

763
00:27:22,640 --> 00:27:27,919
>> So I was wondering when you ask a

764
00:27:25,038 --> 00:27:29,919
question very often it sounds kind of

765
00:27:27,919 --> 00:27:32,880
like you tell that something was written

766
00:27:29,919 --> 00:27:35,759
by not by an actual person. Do you think

767
00:27:32,880 --> 00:27:38,159
that comes from the reinforcement

768
00:27:35,759 --> 00:27:40,158
learning part or where do you think it

769
00:27:38,159 --> 00:27:41,440
comes from in this?

770
00:27:40,159 --> 00:27:42,559
>> It's a good question. I don't know

771
00:27:41,440 --> 00:27:44,960
because I know that part of the

772
00:27:42,558 --> 00:27:48,399
evaluation uh the ranking rubric are

773
00:27:44,960 --> 00:27:50,400
used is to is to favor responses which

774
00:27:48,400 --> 00:27:52,720
sound more humanlike than you know more

775
00:27:50,400 --> 00:27:54,320
than robotlike. So if anything I'm

776
00:27:52,720 --> 00:27:55,839
hoping that reinforcement learning would

777
00:27:54,319 --> 00:27:56,720
actually make it sound more humanike

778
00:27:55,839 --> 00:27:58,879
because the rankers would have

779
00:27:56,720 --> 00:28:01,120
prioritized that. So if you if it still

780
00:27:58,880 --> 00:28:02,960
comes up with robotic stuff, you know,

781
00:28:01,119 --> 00:28:05,119
it's something else that's going on.

782
00:28:02,960 --> 00:28:07,278
Maybe I mean maybe the lot of text on

783
00:28:05,119 --> 00:28:09,439
the internet is not literature. It's

784
00:28:07,278 --> 00:28:13,200
just people writing some crap, right? So

785
00:28:09,440 --> 00:28:15,120
could be that. Yeah.

786
00:28:13,200 --> 00:28:17,120
>> How much of this instruction tuning or

787
00:28:15,119 --> 00:28:19,278
conversational tuning is happening in

788
00:28:17,119 --> 00:28:19,918
real time within a conversation? So

789
00:28:19,278 --> 00:28:22,159
>> none of it.

790
00:28:19,919 --> 00:28:24,080
>> None of it. So as you kind of give

791
00:28:22,159 --> 00:28:25,919
feedback to the model, it's just

792
00:28:24,079 --> 00:28:27,439
basically regenerating it like I don't

793
00:28:25,919 --> 00:28:27,759
like that answer. come up with something

794
00:28:27,440 --> 00:28:29,840
else.

795
00:28:27,759 --> 00:28:31,679
>> No, it's not doing it in real time. Uh,

796
00:28:29,839 --> 00:28:32,879
basically whatever signals you're giving

797
00:28:31,679 --> 00:28:34,640
it with these thumbs up, thumbs down

798
00:28:32,880 --> 00:28:36,640
business, that gets added to the

799
00:28:34,640 --> 00:28:39,038
training logs and they periodically will

800
00:28:36,640 --> 00:28:41,038
retrain it.

801
00:28:39,038 --> 00:28:42,798
Uh, okay. So, by the way, this is

802
00:28:41,038 --> 00:28:44,000
instruction tuning in a nutshell and I

803
00:28:42,798 --> 00:28:45,440
want to point that out and you don't

804
00:28:44,000 --> 00:28:47,839
have to read the whole thing, but just

805
00:28:45,440 --> 00:28:50,320
to quickly point out this was where we

806
00:28:47,839 --> 00:28:51,439
had to have human involvement, right? In

807
00:28:50,319 --> 00:28:52,960
the first step, writing a lot of

808
00:28:51,440 --> 00:28:56,320
responses to these questions and then

809
00:28:52,960 --> 00:28:58,558
ranking the answers. So these two are

810
00:28:56,319 --> 00:29:00,798
still human sort of labor intensive. Now

811
00:28:58,558 --> 00:29:03,678
it turns out you can actually use helper

812
00:29:00,798 --> 00:29:04,879
LLMs to automate this too,

813
00:29:03,679 --> 00:29:06,320
right? This is not what open I did in

814
00:29:04,880 --> 00:29:07,760
the beginning with HGBT but now you can

815
00:29:06,319 --> 00:29:09,439
do it this way right because there are

816
00:29:07,759 --> 00:29:11,519
lots of really good LLMs available for

817
00:29:09,440 --> 00:29:12,880
you to automate many of these things. uh

818
00:29:11,519 --> 00:29:14,398
we don't have time but if you're curious

819
00:29:12,880 --> 00:29:17,039
I had a little blog post on this check

820
00:29:14,398 --> 00:29:20,000
it out okay so now we come to the

821
00:29:17,038 --> 00:29:23,278
question of well if you want to take a

822
00:29:20,000 --> 00:29:24,960
base LLM like GBD3 and make it useful

823
00:29:23,278 --> 00:29:26,880
and respond instructions we have seen

824
00:29:24,960 --> 00:29:28,880
that we had to adapt it with high

825
00:29:26,880 --> 00:29:30,240
quality instruction onset data right

826
00:29:28,880 --> 00:29:31,360
using supervised fine-tuning and

827
00:29:30,240 --> 00:29:33,919
reinforcement learning with human

828
00:29:31,359 --> 00:29:37,678
feedback right that's what made GPD3

829
00:29:33,919 --> 00:29:39,600
actually useful and became chat GPD by

830
00:29:37,679 --> 00:29:41,278
the same token this holds true more

831
00:29:39,599 --> 00:29:42,639
generally if you want to take large

832
00:29:41,278 --> 00:29:44,798
language model make it useful for a

833
00:29:42,640 --> 00:29:47,120
medical use case, a legal use case, some

834
00:29:44,798 --> 00:29:49,359
other narrow business use case. You have

835
00:29:47,119 --> 00:29:52,000
to adapt it with business domain

836
00:29:49,359 --> 00:29:54,079
specific data. Okay. And so let's look

837
00:29:52,000 --> 00:29:56,000
at techniques for doing so. All right.

838
00:29:54,079 --> 00:29:57,759
So adaptation is sort of the rough name

839
00:29:56,000 --> 00:30:00,000
for the process of taking a base large

840
00:29:57,759 --> 00:30:02,000
language model and making it tailoring

841
00:30:00,000 --> 00:30:03,359
it for your particular use case. And so

842
00:30:02,000 --> 00:30:05,119
there's sort of this ladder of things

843
00:30:03,359 --> 00:30:07,199
you can do, right? And we're going to

844
00:30:05,119 --> 00:30:08,959
look at every one of them. So you can do

845
00:30:07,200 --> 00:30:11,679
this thing called zeroshot prompting

846
00:30:08,960 --> 00:30:14,319
which is just you literally ask the LLM

847
00:30:11,679 --> 00:30:16,240
nicely clearly what you want and maybe

848
00:30:14,319 --> 00:30:17,519
just give it to you. Okay. And this is

849
00:30:16,240 --> 00:30:20,480
sort of the use case we're all used to

850
00:30:17,519 --> 00:30:22,558
in the web interface right you can also

851
00:30:20,480 --> 00:30:24,319
do something called few short prompting

852
00:30:22,558 --> 00:30:25,599
where you ask it something and you also

853
00:30:24,319 --> 00:30:27,599
give a few examples of the kind of

854
00:30:25,599 --> 00:30:30,398
things you want right and that helps it

855
00:30:27,599 --> 00:30:31,519
a great deal and then there is this

856
00:30:30,398 --> 00:30:33,119
thing called retrieval augmented

857
00:30:31,519 --> 00:30:34,240
generation and fine-tuning and we'll

858
00:30:33,119 --> 00:30:36,079
look at all of them and I'll explain all

859
00:30:34,240 --> 00:30:38,159
these things as we go along. Okay, so

860
00:30:36,079 --> 00:30:40,240
let's start with zero short prompting

861
00:30:38,159 --> 00:30:44,000
where by the way the word short here is

862
00:30:40,240 --> 00:30:45,839
a synonym for example. So zero example

863
00:30:44,000 --> 00:30:47,200
prompting. You literally ask in the

864
00:30:45,839 --> 00:30:50,639
prompt what you want without giving even

865
00:30:47,200 --> 00:30:51,919
a single example. Okay. And so let's say

866
00:30:50,640 --> 00:30:54,320
we want to build we want to look at

867
00:30:51,919 --> 00:30:55,360
product reviews and build a detector to

868
00:30:54,319 --> 00:30:56,960
figure out if the product review

869
00:30:55,359 --> 00:30:59,759
contains not sentiment. That's kind of

870
00:30:56,960 --> 00:31:01,120
boring. Uh whether it contains some

871
00:30:59,759 --> 00:31:04,960
description of a potential product

872
00:31:01,119 --> 00:31:06,879
defect or not. Okay. And so here is

873
00:31:04,960 --> 00:31:08,960
something I actually pulled off Wayfair

874
00:31:06,880 --> 00:31:10,640
with apologies to Wayfair. Uh it says

875
00:31:08,960 --> 00:31:11,919
here the curve of the back of the chair

876
00:31:10,640 --> 00:31:14,399
does not leave enough room to sit

877
00:31:11,919 --> 00:31:16,960
comfortably. Okay, sounds like a kind of

878
00:31:14,398 --> 00:31:18,719
a defectish kind of thing, right? So

879
00:31:16,960 --> 00:31:20,079
instead of bu back in the day, you would

880
00:31:18,720 --> 00:31:21,679
have collected all these reviews and

881
00:31:20,079 --> 00:31:23,599
built a special purpose NLP based

882
00:31:21,679 --> 00:31:25,679
classifier to figure out defect yes or

883
00:31:23,599 --> 00:31:28,798
no. Here you can literally just feed

884
00:31:25,679 --> 00:31:30,159
this thing into GPD3 uh and ask it tell

885
00:31:28,798 --> 00:31:31,679
me if a product defect is being

886
00:31:30,159 --> 00:31:33,278
described in this product review and

887
00:31:31,679 --> 00:31:34,399
then the curve at the back boom and then

888
00:31:33,278 --> 00:31:37,359
it comes back and says yep that's a

889
00:31:34,398 --> 00:31:38,398
product defect. Okay so this zero shot

890
00:31:37,359 --> 00:31:41,199
you just ask a question you get the

891
00:31:38,398 --> 00:31:43,759
answer back. Okay and it actually works

892
00:31:41,200 --> 00:31:45,360
remarkably well and the better models

893
00:31:43,759 --> 00:31:47,359
the bigger models tend to be much better

894
00:31:45,359 --> 00:31:50,000
than the smaller simpler models for

895
00:31:47,359 --> 00:31:52,639
doing zero shot. Okay. All right. Now

896
00:31:50,000 --> 00:31:54,079
when you adapt an LLM to a specific task

897
00:31:52,640 --> 00:31:55,919
obviously you need to carefully design

898
00:31:54,079 --> 00:31:57,759
the prompt as you folks know this is

899
00:31:55,919 --> 00:31:58,799
called prompt engineering and we're not

900
00:31:57,759 --> 00:32:00,640
going to spend much time on prompt

901
00:31:58,798 --> 00:32:02,720
engineering except I just want to give a

902
00:32:00,640 --> 00:32:04,960
simple example. So if you actually ask

903
00:32:02,720 --> 00:32:07,919
Jubid this question what is the fifth

904
00:32:04,960 --> 00:32:09,919
word of the sentence very often it'll

905
00:32:07,919 --> 00:32:11,679
give the wrong answer.

906
00:32:09,919 --> 00:32:12,960
It's very strange why it can't get this

907
00:32:11,679 --> 00:32:14,880
answer question right. It's a very

908
00:32:12,960 --> 00:32:17,440
simple question. So if it's the fifth

909
00:32:14,880 --> 00:32:18,559
word of the sentence is s right uh

910
00:32:17,440 --> 00:32:20,640
sometimes it gets it right but very

911
00:32:18,558 --> 00:32:22,000
often it'll get it wrong okay but now

912
00:32:20,640 --> 00:32:23,600
you can do a little prompt engineering

913
00:32:22,000 --> 00:32:25,278
and it'll always get it right. So for

914
00:32:23,599 --> 00:32:26,798
example you can say I'll give you a

915
00:32:25,278 --> 00:32:27,919
sentence first list all the words that

916
00:32:26,798 --> 00:32:30,398
are in the sentence then tell me the

917
00:32:27,919 --> 00:32:33,200
fifth word. Okay here is a sentence b it

918
00:32:30,398 --> 00:32:34,798
gets it right. So it's an example of you

919
00:32:33,200 --> 00:32:36,720
can help it along by being very very

920
00:32:34,798 --> 00:32:38,558
prescriptive as to what you want it to

921
00:32:36,720 --> 00:32:40,399
do and break down all the steps. Don't

922
00:32:38,558 --> 00:32:42,558
make it guess things. It does a great

923
00:32:40,398 --> 00:32:43,918
job. Okay. So anyway uh and there are

924
00:32:42,558 --> 00:32:45,918
lots of other tricks people have figured

925
00:32:43,919 --> 00:32:47,519
out over the the last couple of years.

926
00:32:45,919 --> 00:32:49,759
Uh for for a long time this is pretty

927
00:32:47,519 --> 00:32:51,679
hot where you say let's think step by

928
00:32:49,759 --> 00:32:53,038
step. You tell it give it a question and

929
00:32:51,679 --> 00:32:54,399
say let's think step by step. It

930
00:32:53,038 --> 00:32:55,839
actually gives the better shot at giving

931
00:32:54,398 --> 00:32:57,839
you a good answer back an accurate

932
00:32:55,839 --> 00:32:59,759
answer back. Uh now this kind of thing

933
00:32:57,839 --> 00:33:02,639
is actually already baked in into the

934
00:32:59,759 --> 00:33:05,200
LLMs. So when you ask a question to ch

935
00:33:02,640 --> 00:33:07,679
your question your prompt gets appended

936
00:33:05,200 --> 00:33:09,120
to what's called the system prompt and

937
00:33:07,679 --> 00:33:10,559
the whole thing goes into the LM. You

938
00:33:09,119 --> 00:33:12,558
never see the system prompt and the

939
00:33:10,558 --> 00:33:14,720
system prompt is telling Chad GPD think

940
00:33:12,558 --> 00:33:17,678
step by step take your time don't blurt

941
00:33:14,720 --> 00:33:18,880
out an answer stuff like that okay and

942
00:33:17,679 --> 00:33:20,080
the system you can just Google it the

943
00:33:18,880 --> 00:33:22,640
system problems have been jailbroken you

944
00:33:20,079 --> 00:33:25,519
can find it on the web

945
00:33:22,640 --> 00:33:26,880
so all right uh and and this is funny I

946
00:33:25,519 --> 00:33:28,399
this came out maybe like a month or two

947
00:33:26,880 --> 00:33:29,679
ago it says apparently take a deep

948
00:33:28,398 --> 00:33:31,678
breath and work on the problem step by

949
00:33:29,679 --> 00:33:34,960
step works better than saying work on it

950
00:33:31,679 --> 00:33:36,720
step by step and then more recently I

951
00:33:34,960 --> 00:33:38,558
literally read this two nights ago

952
00:33:36,720 --> 00:33:40,159
apparently if you tell it if you have a

953
00:33:38,558 --> 00:33:42,639
math or a reasoning question. You tell

954
00:33:40,159 --> 00:33:44,240
it you are an officer on the starship

955
00:33:42,640 --> 00:33:46,000
enterprise. Now solve this problem for

956
00:33:44,240 --> 00:33:47,278
me. It's higher more likely to get it

957
00:33:46,000 --> 00:33:48,640
right.

958
00:33:47,278 --> 00:33:50,798
>> Go figure. Thomas,

959
00:33:48,640 --> 00:33:51,120
>> I read two more that were super fun.

960
00:33:50,798 --> 00:33:53,599
>> Yeah.

961
00:33:51,119 --> 00:33:54,398
>> One I will keep you if you solve me

962
00:33:53,599 --> 00:33:56,719
>> correct

963
00:33:54,398 --> 00:34:00,798
>> and the other one was

964
00:33:56,720 --> 00:34:05,440
an answer was I cannot do that

965
00:34:00,798 --> 00:34:07,519
for answer was I tried on Gemini and he

966
00:34:05,440 --> 00:34:10,800
it was the way to solve it. So

967
00:34:07,519 --> 00:34:11,918
>> nice. both like back and forth charge

968
00:34:10,800 --> 00:34:13,839
you did you want to say was to solve

969
00:34:11,918 --> 00:34:15,679
this can you solve this

970
00:34:13,838 --> 00:34:16,960
>> yeah very good excellent one of the

971
00:34:15,679 --> 00:34:18,639
things just on that right let's have

972
00:34:16,960 --> 00:34:19,918
some fun you can say I'm going to give

973
00:34:18,639 --> 00:34:22,079
tip you a thousand bucks if you solve

974
00:34:19,918 --> 00:34:24,000
this it says right so this person

975
00:34:22,079 --> 00:34:26,159
apparently kept using this tip and at

976
00:34:24,000 --> 00:34:28,559
one point it says you keep promising me

977
00:34:26,159 --> 00:34:31,760
tips you never give me the tip so I'm

978
00:34:28,559 --> 00:34:34,960
not going to solve this problem for you

979
00:34:31,760 --> 00:34:36,399
yeah okay so and there are many prompt

980
00:34:34,960 --> 00:34:37,358
engineering resources this one that came

981
00:34:36,398 --> 00:34:38,638
out a couple of weeks ago which I

982
00:34:37,358 --> 00:34:41,199
thought was pretty Good. So I just put a

983
00:34:38,639 --> 00:34:42,800
link to it here. Um so now let's look at

984
00:34:41,199 --> 00:34:45,118
few short prompting where you give it a

985
00:34:42,800 --> 00:34:47,919
few examples. So here let's say we want

986
00:34:45,119 --> 00:34:49,440
to build a grammar corrector. Okay. So

987
00:34:47,918 --> 00:34:52,319
what you can do is you can actually give

988
00:34:49,440 --> 00:34:54,079
it examples of poor English good

989
00:34:52,320 --> 00:34:56,159
English. You can see right poor English

990
00:34:54,079 --> 00:34:58,079
I eated the purple berries. Good English

991
00:34:56,159 --> 00:35:00,240
I ate the purple berries. And similarly

992
00:34:58,079 --> 00:35:01,680
three examples right and then you end

993
00:35:00,239 --> 00:35:04,959
the prompt with just the poor English

994
00:35:01,679 --> 00:35:06,799
input. And then the response from GPD3

995
00:35:04,960 --> 00:35:09,039
is a good English output and it says fix

996
00:35:06,800 --> 00:35:10,880
the error.

997
00:35:09,039 --> 00:35:11,920
So this is an example of giving a few

998
00:35:10,880 --> 00:35:13,680
examples of what you want and just

999
00:35:11,920 --> 00:35:16,960
learns on the fly what you what you have

1000
00:35:13,679 --> 00:35:19,519
in mind what your intention is. Okay. So

1001
00:35:16,960 --> 00:35:21,920
that's that. Now the ability of LLMs to

1002
00:35:19,519 --> 00:35:23,838
learn from just a few examples or even

1003
00:35:21,920 --> 00:35:25,920
no examples and just with a clear

1004
00:35:23,838 --> 00:35:28,159
instruction. This thing is called in

1005
00:35:25,920 --> 00:35:31,119
context learning and that was something

1006
00:35:28,159 --> 00:35:33,440
that GPD2 and GPD could not do. that was

1007
00:35:31,119 --> 00:35:35,519
new in GBD3 and what they call an

1008
00:35:33,440 --> 00:35:37,280
emergent capability right it is

1009
00:35:35,519 --> 00:35:40,159
completely unanticipated by the people

1010
00:35:37,280 --> 00:35:41,920
who built it and all right so that's

1011
00:35:40,159 --> 00:35:43,199
that now let's look at retrieal

1012
00:35:41,920 --> 00:35:45,280
augmented generation by the way this

1013
00:35:43,199 --> 00:35:47,439
thing is also called indexing sometimes

1014
00:35:45,280 --> 00:35:50,160
so the the so the the idea of it's

1015
00:35:47,440 --> 00:35:52,240
called rag retrie rag the idea of rag is

1016
00:35:50,159 --> 00:35:53,838
actually very simple so let's say that

1017
00:35:52,239 --> 00:35:56,639
you know we want to ask a question to a

1018
00:35:53,838 --> 00:35:59,039
chatbot but we want the chatbot to

1019
00:35:56,639 --> 00:36:01,039
leverage proprietary data that we might

1020
00:35:59,039 --> 00:36:02,239
have maybe it's a customer call support

1021
00:36:01,039 --> 00:36:04,159
sort of in a call center kind of

1022
00:36:02,239 --> 00:36:06,719
operation and you have like this massive

1023
00:36:04,159 --> 00:36:09,440
FAQ database right content database and

1024
00:36:06,719 --> 00:36:10,719
you want to give that FAQ to the chatbot

1025
00:36:09,440 --> 00:36:12,800
along with your question so that it can

1026
00:36:10,719 --> 00:36:14,559
leverage the FAQ to answer the question

1027
00:36:12,800 --> 00:36:16,320
for you as opposed to like whatever

1028
00:36:14,559 --> 00:36:19,920
things it has learned previously in its

1029
00:36:16,320 --> 00:36:21,920
general training right so can't we just

1030
00:36:19,920 --> 00:36:24,559
include the entire FAQ the whole data

1031
00:36:21,920 --> 00:36:26,079
set into a prompt and set it in maybe we

1032
00:36:24,559 --> 00:36:27,759
just take our question take everything

1033
00:36:26,079 --> 00:36:28,960
we have potentially relevant to the

1034
00:36:27,760 --> 00:36:31,040
question everything we have in the data

1035
00:36:28,960 --> 00:36:32,480
set database just attach it to the

1036
00:36:31,039 --> 00:36:34,400
question. The whole thing becomes a

1037
00:36:32,480 --> 00:36:38,559
prompt. Feed it in and say, "Hey, find

1038
00:36:34,400 --> 00:36:42,680
out for me." Can't you just do that?

1039
00:36:38,559 --> 00:36:42,679
Theoretically, I think it stops us.

1040
00:36:43,199 --> 00:36:46,159
The reason you can't do it is because

1041
00:36:44,800 --> 00:36:47,760
this pesky thing called the context

1042
00:36:46,159 --> 00:36:51,118
window.

1043
00:36:47,760 --> 00:36:53,839
So, uh, for any LLM, the prompt plus the

1044
00:36:51,119 --> 00:36:55,358
output, right, the length cannot exceed

1045
00:36:53,838 --> 00:36:57,358
a predefined limit. This called the

1046
00:36:55,358 --> 00:37:00,239
context window. Remember the max

1047
00:36:57,358 --> 00:37:02,239
sequence length we had in our earlier

1048
00:37:00,239 --> 00:37:04,078
models where that was the size of the

1049
00:37:02,239 --> 00:37:05,279
sentence that could be fed in right

1050
00:37:04,079 --> 00:37:07,039
basically there is a size of the

1051
00:37:05,280 --> 00:37:08,400
sentence for any of these things right

1052
00:37:07,039 --> 00:37:09,838
it's called the context window it's

1053
00:37:08,400 --> 00:37:12,400
there are only so many tokens it can

1054
00:37:09,838 --> 00:37:14,880
accommodate and since what comes in is

1055
00:37:12,400 --> 00:37:16,800
what comes out it is for both the input

1056
00:37:14,880 --> 00:37:20,640
and the output together okay that's

1057
00:37:16,800 --> 00:37:23,440
called the context window okay and um

1058
00:37:20,639 --> 00:37:25,199
and and and furthermore when you have a

1059
00:37:23,440 --> 00:37:27,280
conversation with one of these chat bots

1060
00:37:25,199 --> 00:37:29,919
the entire entire conversation is fed in

1061
00:37:27,280 --> 00:37:31,519
every single time.

1062
00:37:29,920 --> 00:37:32,800
That's how it actually remembers the

1063
00:37:31,519 --> 00:37:34,880
what's going on earlier in the

1064
00:37:32,800 --> 00:37:36,800
conversation. It doesn't have any memory

1065
00:37:34,880 --> 00:37:39,838
per se. Each time you ask a question,

1066
00:37:36,800 --> 00:37:41,119
the entire thread is fed in. Okay? So,

1067
00:37:39,838 --> 00:37:42,960
initially you say what's the square root

1068
00:37:41,119 --> 00:37:44,480
of 17, it gives you an answer.

1069
00:37:42,960 --> 00:37:46,880
Initially, you only send in the red

1070
00:37:44,480 --> 00:37:48,320
stuff. Then the next question you ask is

1071
00:37:46,880 --> 00:37:50,320
the first question, the answer, the

1072
00:37:48,320 --> 00:37:52,480
second question. All of them are fed in.

1073
00:37:50,320 --> 00:37:54,240
Then all these are fed in. So with the

1074
00:37:52,480 --> 00:37:55,760
conversation, you're consuming more and

1075
00:37:54,239 --> 00:37:57,358
more of the context window as you go

1076
00:37:55,760 --> 00:38:00,000
along.

1077
00:37:57,358 --> 00:38:01,838
Okay. So can you imagine taking a whole

1078
00:38:00,000 --> 00:38:03,199
FAQ asking a question and saying, "Well,

1079
00:38:01,838 --> 00:38:04,400
I didn't mean that. I wanted something

1080
00:38:03,199 --> 00:38:05,679
else." And before you know it, boom,

1081
00:38:04,400 --> 00:38:06,720
you've blown out the context window.

1082
00:38:05,679 --> 00:38:08,078
It's going to come back and give you an

1083
00:38:06,719 --> 00:38:10,559
error.

1084
00:38:08,079 --> 00:38:14,079
>> You finished that you can't does it

1085
00:38:10,559 --> 00:38:15,279
together or does it take specific

1086
00:38:14,079 --> 00:38:17,599
windows of it?

1087
00:38:15,280 --> 00:38:19,839
>> Yeah. So there is a whole research

1088
00:38:17,599 --> 00:38:21,440
cottage industry around when your thing

1089
00:38:19,838 --> 00:38:23,920
is longer than the context window. what

1090
00:38:21,440 --> 00:38:25,920
do you pick? Uh so the simplest case is

1091
00:38:23,920 --> 00:38:27,119
you have a moving window, right? If if

1092
00:38:25,920 --> 00:38:28,639
you have thousand tokens, you just look

1093
00:38:27,119 --> 00:38:30,800
at the last thousand tokens. But there

1094
00:38:28,639 --> 00:38:33,039
are some cleverer schemes where you can

1095
00:38:30,800 --> 00:38:34,560
actually take the first stuff that is

1096
00:38:33,039 --> 00:38:37,119
outside the window that doesn't fit into

1097
00:38:34,559 --> 00:38:39,358
the window and use an other LLM to

1098
00:38:37,119 --> 00:38:41,680
summarize it for you and then you attach

1099
00:38:39,358 --> 00:38:43,920
it to your current prompt. I know it

1100
00:38:41,679 --> 00:38:46,078
gets crazy. So

1101
00:38:43,920 --> 00:38:47,280
uh okay. So for all these reasons, we

1102
00:38:46,079 --> 00:38:49,280
need to pick and choose what we can

1103
00:38:47,280 --> 00:38:51,359
send, right? To answer a particular

1104
00:38:49,280 --> 00:38:53,280
question. So what we do is since we

1105
00:38:51,358 --> 00:38:54,960
can't include the whole thing, we first

1106
00:38:53,280 --> 00:38:57,119
retrieve the relevant content from the

1107
00:38:54,960 --> 00:38:59,440
database or the FAQ and then send it to

1108
00:38:57,119 --> 00:39:02,400
the LLM along with a question we have.

1109
00:38:59,440 --> 00:39:05,200
Okay? So retrieval augmented sequence

1110
00:39:02,400 --> 00:39:08,320
generation. That's what's going on.

1111
00:39:05,199 --> 00:39:10,319
Make sense? And so pictorially

1112
00:39:08,320 --> 00:39:12,079
um basically what we do is let's say

1113
00:39:10,320 --> 00:39:15,359
that this is our external set of

1114
00:39:12,079 --> 00:39:18,320
documents. We take this are think of it

1115
00:39:15,358 --> 00:39:20,239
FAQ and then we take the FAQ and imagine

1116
00:39:18,320 --> 00:39:22,320
for each question and answer. We take

1117
00:39:20,239 --> 00:39:24,719
each question and answer in the FAQ and

1118
00:39:22,320 --> 00:39:27,760
then we we just we treat it as its own

1119
00:39:24,719 --> 00:39:29,439
little unit of text and then we actually

1120
00:39:27,760 --> 00:39:32,079
calculate a contextual embedding for

1121
00:39:29,440 --> 00:39:33,200
each of those question answer pairs.

1122
00:39:32,079 --> 00:39:35,599
Remember we know how to do contextual

1123
00:39:33,199 --> 00:39:36,879
embeddings, right? That's like it's a

1124
00:39:35,599 --> 00:39:37,760
piece of cake at this point, right? You

1125
00:39:36,880 --> 00:39:39,280
folks know how to do contextual

1126
00:39:37,760 --> 00:39:41,760
embedding. Run it through something like

1127
00:39:39,280 --> 00:39:43,920
BERT, you're done, right? You get you

1128
00:39:41,760 --> 00:39:47,040
get a context. So you get embeddings for

1129
00:39:43,920 --> 00:39:50,159
all the things that are in your FAQ. And

1130
00:39:47,039 --> 00:39:52,000
now when a new question comes in, right,

1131
00:39:50,159 --> 00:39:53,519
what you do is you take that question

1132
00:39:52,000 --> 00:39:56,559
and you calculate a contextual embedding

1133
00:39:53,519 --> 00:39:58,880
for that too.

1134
00:39:56,559 --> 00:40:02,880
And then what you do is you then look to

1135
00:39:58,880 --> 00:40:04,640
see which of the FAQ elements you have,

1136
00:40:02,880 --> 00:40:07,599
which of those chunks are the most

1137
00:40:04,639 --> 00:40:09,759
similar to your question.

1138
00:40:07,599 --> 00:40:11,599
Okay? And then you grab the ones that

1139
00:40:09,760 --> 00:40:14,240
are the most similar and then pack it

1140
00:40:11,599 --> 00:40:16,800
into the prompt and send it in. Maybe

1141
00:40:14,239 --> 00:40:18,559
you have 10,000 questions, but you can

1142
00:40:16,800 --> 00:40:19,839
only accommodate five of them in your

1143
00:40:18,559 --> 00:40:22,078
prompt because the context window is

1144
00:40:19,838 --> 00:40:24,239
very small. So you pick the five what

1145
00:40:22,079 --> 00:40:25,920
you think is the most relevant content

1146
00:40:24,239 --> 00:40:28,159
to your particular question and then you

1147
00:40:25,920 --> 00:40:29,599
feed it in.

1148
00:40:28,159 --> 00:40:32,879
That's the idea that is retrieval

1149
00:40:29,599 --> 00:40:34,880
augmented generation. Yeah, Rolando. So

1150
00:40:32,880 --> 00:40:36,559
if does this tie in for example if I

1151
00:40:34,880 --> 00:40:38,800
were to prompt and say help me work on

1152
00:40:36,559 --> 00:40:41,358
my startup pitch but given the voice of

1153
00:40:38,800 --> 00:40:45,519
Steve Jobs is it then kind of going out

1154
00:40:41,358 --> 00:40:48,000
there and reducing the subset of of data

1155
00:40:45,519 --> 00:40:49,759
to things that have been written by

1156
00:40:48,000 --> 00:40:51,358
Steve Jobs and then it's kind of

1157
00:40:49,760 --> 00:40:53,680
generating it response based

1158
00:40:51,358 --> 00:40:54,960
>> uh not as a default not as a default

1159
00:40:53,679 --> 00:40:56,879
typically because a lot of Steve Jobs

1160
00:40:54,960 --> 00:40:57,920
stuff on the web it's just using that

1161
00:40:56,880 --> 00:41:00,160
because it's all part of its

1162
00:40:57,920 --> 00:41:01,920
pre-training data but this tends to be

1163
00:41:00,159 --> 00:41:03,838
more useful for very targeted

1164
00:41:01,920 --> 00:41:05,200
applications where you don't expect to

1165
00:41:03,838 --> 00:41:07,519
know the answer because it is not on the

1166
00:41:05,199 --> 00:41:09,039
public internet.

1167
00:41:07,519 --> 00:41:10,559
It's your proprietary data and you

1168
00:41:09,039 --> 00:41:12,800
wanted to use that proprietary data and

1169
00:41:10,559 --> 00:41:15,838
this how you do it.

1170
00:41:12,800 --> 00:41:19,079
Uh yeah

1171
00:41:15,838 --> 00:41:19,078
this certain

1172
00:41:19,119 --> 00:41:23,838
>> sure like that there will be some loss.

1173
00:41:22,400 --> 00:41:26,000
>> There will be some loss because you have

1174
00:41:23,838 --> 00:41:28,960
to figure out how to chunk it right. Uh

1175
00:41:26,000 --> 00:41:30,559
maybe you have a 300page PDF and then

1176
00:41:28,960 --> 00:41:32,000
maybe you look for each section and make

1177
00:41:30,559 --> 00:41:33,679
it a chunk. Maybe you look for each

1178
00:41:32,000 --> 00:41:36,000
paragraph, make it a chunk. Again,

1179
00:41:33,679 --> 00:41:37,679
there's a whole empirical sort of

1180
00:41:36,000 --> 00:41:39,039
cottage industry of techniques for doing

1181
00:41:37,679 --> 00:41:40,559
these things better or worse depending

1182
00:41:39,039 --> 00:41:42,719
on the use case and so on and so forth.

1183
00:41:40,559 --> 00:41:43,759
But the conceptual idea is chunk and

1184
00:41:42,719 --> 00:41:46,318
embed.

1185
00:41:43,760 --> 00:41:47,359
>> Chunking is another use.

1186
00:41:46,318 --> 00:41:49,519
>> Yeah. In fact, we going to do it

1187
00:41:47,358 --> 00:41:50,559
ourselves in the collab right now.

1188
00:41:49,519 --> 00:41:54,572
>> Yeah.

1189
00:41:50,559 --> 00:41:55,838
>> Can we give more weightage lecture? Uh

1190
00:41:54,572 --> 00:41:58,400
[laughter]

1191
00:41:55,838 --> 00:42:00,239
so in the default implementation no but

1192
00:41:58,400 --> 00:42:02,000
but in some sense you by picking the

1193
00:42:00,239 --> 00:42:04,479
five most relevant chunks from 10,000

1194
00:42:02,000 --> 00:42:06,559
chunks you're giving it giving the other

1195
00:42:04,480 --> 00:42:08,159
you know 10,000 minus five chunks a

1196
00:42:06,559 --> 00:42:10,719
weight of zero and these a weight of

1197
00:42:08,159 --> 00:42:12,078
one. So in some sense you're waiting it.

1198
00:42:10,719 --> 00:42:13,598
>> Yeah.

1199
00:42:12,079 --> 00:42:14,720
>> I was just curious how much structure

1200
00:42:13,599 --> 00:42:16,880
you have to have with an external

1201
00:42:14,719 --> 00:42:19,759
document say hospital or something. Do

1202
00:42:16,880 --> 00:42:21,039
you have to do a bunch of like lab?

1203
00:42:19,760 --> 00:42:23,680
>> No, you just need to make sure it's kind

1204
00:42:21,039 --> 00:42:26,079
of relatively clean. Uh but you will see

1205
00:42:23,679 --> 00:42:28,879
in the collab that it can be kind of

1206
00:42:26,079 --> 00:42:30,079
crappy and it still works. Yeah, because

1207
00:42:28,880 --> 00:42:33,200
there is so much crap on the internet

1208
00:42:30,079 --> 00:42:34,480
has been trained on already. So, okay.

1209
00:42:33,199 --> 00:42:36,719
So, all right. So, let's look at the

1210
00:42:34,480 --> 00:42:38,318
collab.

1211
00:42:36,719 --> 00:42:41,039
By the way, retrieval operate generation

1212
00:42:38,318 --> 00:42:43,039
is in my opinion the most pre prevalent

1213
00:42:41,039 --> 00:42:45,920
business application of LLMs that I've

1214
00:42:43,039 --> 00:42:47,599
seen up to this up to up to date. And

1215
00:42:45,920 --> 00:42:51,358
there's a huge ecosystem of tools and

1216
00:42:47,599 --> 00:42:52,640
vendors and so on and so forth.

1217
00:42:51,358 --> 00:42:56,400
I'm going to skip through the verbiage

1218
00:42:52,639 --> 00:42:58,799
here. Um, so you have to um install the

1219
00:42:56,400 --> 00:43:00,480
OpenAI library

1220
00:42:58,800 --> 00:43:01,920
and this thing called tick token which

1221
00:43:00,480 --> 00:43:03,440
we'll get to in a in a bit. I've already

1222
00:43:01,920 --> 00:43:05,760
installed it before class because it

1223
00:43:03,440 --> 00:43:08,000
takes some time. So I'll just make sure

1224
00:43:05,760 --> 00:43:10,079
all these things are already

1225
00:43:08,000 --> 00:43:12,880
few good. So we don't have to wait for

1226
00:43:10,079 --> 00:43:15,760
this. So I've imported pandas as before

1227
00:43:12,880 --> 00:43:17,280
and so uh and you can read through these

1228
00:43:15,760 --> 00:43:19,839
things because I'm just basically you

1229
00:43:17,280 --> 00:43:23,519
know I have an open openi token that I

1230
00:43:19,838 --> 00:43:24,719
have to use u a key rather key API key

1231
00:43:23,519 --> 00:43:25,920
and I'm not showing you the key

1232
00:43:24,719 --> 00:43:27,759
obviously I have to remember to delete

1233
00:43:25,920 --> 00:43:29,519
it before I upload the collab uh you

1234
00:43:27,760 --> 00:43:31,599
have to get your own key to make it all

1235
00:43:29,519 --> 00:43:34,639
work uh but the instructions are here.

1236
00:43:31,599 --> 00:43:36,480
So we're going to use GPT3.5 turbo to

1237
00:43:34,639 --> 00:43:38,639
demonstrate rag right so I give it the

1238
00:43:36,480 --> 00:43:40,480
name of the model and then open a also

1239
00:43:38,639 --> 00:43:43,679
has a whole bunch of different models

1240
00:43:40,480 --> 00:43:45,760
which can be used for u you can feed it

1241
00:43:43,679 --> 00:43:47,519
a sentence or a chunk of text it'll give

1242
00:43:45,760 --> 00:43:49,040
you a contextual embedding out it's like

1243
00:43:47,519 --> 00:43:50,800
a nice little API you don't have to use

1244
00:43:49,039 --> 00:43:53,119
your own bird and so on and so forth you

1245
00:43:50,800 --> 00:43:54,480
can just use the open AI embeddings

1246
00:43:53,119 --> 00:43:55,680
obviously you have to pay openai every

1247
00:43:54,480 --> 00:44:00,440
time you make a request but it's really

1248
00:43:55,679 --> 00:44:00,440
really cheap at this point u yepa

1249
00:44:01,119 --> 00:44:05,358
question but

1250
00:44:03,440 --> 00:44:07,119
by dealing with proprietary data because

1251
00:44:05,358 --> 00:44:09,598
a lot of companies are like we need to

1252
00:44:07,119 --> 00:44:11,920
invest in our own L&M because we don't

1253
00:44:09,599 --> 00:44:14,880
want our data to be going down in this

1254
00:44:11,920 --> 00:44:16,720
kind of it context how good is the the

1255
00:44:14,880 --> 00:44:17,280
cyber security or the compliance and

1256
00:44:16,719 --> 00:44:19,118
legal

1257
00:44:17,280 --> 00:44:21,119
>> I think each vendor has their own sort

1258
00:44:19,119 --> 00:44:22,559
of set of rules and contractual

1259
00:44:21,119 --> 00:44:23,519
commitments they're willing to sign up

1260
00:44:22,559 --> 00:44:25,199
for so you just

1261
00:44:23,519 --> 00:44:27,440
>> if you use the data here does this go

1262
00:44:25,199 --> 00:44:29,118
into the public domain or no

1263
00:44:27,440 --> 00:44:29,760
>> but the vendor gets to see it

1264
00:44:29,119 --> 00:44:31,760
>> okay

1265
00:44:29,760 --> 00:44:33,839
>> right meaning the vendor systems get to

1266
00:44:31,760 --> 00:44:36,160
see it, but do the vendors employees get

1267
00:44:33,838 --> 00:44:38,078
to see it if they need to? Unclear.

1268
00:44:36,159 --> 00:44:39,920
Those are all the like the legally sort

1269
00:44:38,079 --> 00:44:41,119
of nitty-gritty you have to worry about.

1270
00:44:39,920 --> 00:44:42,318
The other thing you can do is you can

1271
00:44:41,119 --> 00:44:44,318
actually just download an open source

1272
00:44:42,318 --> 00:44:46,000
LLM and do it all within your own

1273
00:44:44,318 --> 00:44:48,239
premises.

1274
00:44:46,000 --> 00:44:50,239
That's totally possible to do, right? In

1275
00:44:48,239 --> 00:44:51,598
fact, um I probably won't have time

1276
00:44:50,239 --> 00:44:52,959
today. I have a whole section on how do

1277
00:44:51,599 --> 00:44:55,680
you actually do a fine-tuning with an

1278
00:44:52,960 --> 00:44:58,400
open-source LLM, which I'll do a video,

1279
00:44:55,679 --> 00:45:01,118
right, if you don't have time. U okay.

1280
00:44:58,400 --> 00:45:02,720
So, so we and so this model this

1281
00:45:01,119 --> 00:45:03,920
embedding ADA 2 is the name of the

1282
00:45:02,719 --> 00:45:05,118
OpenAI model that actually gives you

1283
00:45:03,920 --> 00:45:07,680
contextual embedding. So, we're going to

1284
00:45:05,119 --> 00:45:10,160
use that. So, so first thing we want to

1285
00:45:07,679 --> 00:45:11,679
so the the use case here is that uh we

1286
00:45:10,159 --> 00:45:13,598
have taken a whole bunch we want to ask

1287
00:45:11,679 --> 00:45:15,598
the LLM we want to create a chatbot

1288
00:45:13,599 --> 00:45:18,240
which can answer questions about the

1289
00:45:15,599 --> 00:45:20,640
2022 Olympics like random questions you

1290
00:45:18,239 --> 00:45:24,318
might have about the Olympics. So, uh so

1291
00:45:20,639 --> 00:45:26,480
let's first ask it this question. Uh

1292
00:45:24,318 --> 00:45:29,838
we'll ask it about the 2020 summer

1293
00:45:26,480 --> 00:45:33,358
Olympics. Okay, that's the query and

1294
00:45:29,838 --> 00:45:35,039
then this is the the API um request we

1295
00:45:33,358 --> 00:45:36,400
have to make and you can read through

1296
00:45:35,039 --> 00:45:38,559
it. I have linked to the documentation

1297
00:45:36,400 --> 00:45:41,119
here as how it works and then it says

1298
00:45:38,559 --> 00:45:42,799
that uh Bosshim of Qatar and Tambberia

1299
00:45:41,119 --> 00:45:44,000
of Italy both won the gold and you can

1300
00:45:42,800 --> 00:45:46,480
actually fact check this is actually

1301
00:45:44,000 --> 00:45:48,000
accurate. It's correct. Uh so now let's

1302
00:45:46,480 --> 00:45:51,358
change the query and ask it about the

1303
00:45:48,000 --> 00:45:53,358
2022 Winter Olympics. Okay. And why 22

1304
00:45:51,358 --> 00:45:55,440
versus 20 will become clear in just a

1305
00:45:53,358 --> 00:45:57,598
moment. So, which athletes won the gold

1306
00:45:55,440 --> 00:46:00,480
in curling

1307
00:45:57,599 --> 00:46:02,640
in the 22 Olympics? And it says the gold

1308
00:46:00,480 --> 00:46:04,880
medal in curling was won by the Swedish

1309
00:46:02,639 --> 00:46:07,920
men's team and the South Korean women's

1310
00:46:04,880 --> 00:46:12,000
team. Okay, turns out if you fact check

1311
00:46:07,920 --> 00:46:13,920
this, it turns out, wait for it, Sweden

1312
00:46:12,000 --> 00:46:15,440
won the men's gold. Yes, South Korean

1313
00:46:13,920 --> 00:46:17,358
DIM participated, but Great Britain

1314
00:46:15,440 --> 00:46:19,599
actually won the women's gold. So, it

1315
00:46:17,358 --> 00:46:22,078
got it wrong. So, it sounds like GBD3.5

1316
00:46:19,599 --> 00:46:24,559
Turbo could use some help. And now one

1317
00:46:22,079 --> 00:46:27,119
of the things we can do is so the thing

1318
00:46:24,559 --> 00:46:29,440
is the reason why GPT3 3.1 turbo didn't

1319
00:46:27,119 --> 00:46:32,400
know about this is because its training

1320
00:46:29,440 --> 00:46:34,480
cutoff date was September 2021.

1321
00:46:32,400 --> 00:46:37,280
So as far as it's concerned the 22

1322
00:46:34,480 --> 00:46:39,519
Olympics haven't happened yet

1323
00:46:37,280 --> 00:46:42,560
it confidently gave you the wrong answer

1324
00:46:39,519 --> 00:46:43,920
as it is often prone to do. So and this

1325
00:46:42,559 --> 00:46:45,519
is by the way is called hallucination

1326
00:46:43,920 --> 00:46:50,159
where it gives you a very eloquent

1327
00:46:45,519 --> 00:46:53,119
confident wrong answer. And so um

1328
00:46:50,159 --> 00:46:54,480
or as some folks have said about um

1329
00:46:53,119 --> 00:46:56,559
another business school that should

1330
00:46:54,480 --> 00:46:59,519
remain nameless often in error but never

1331
00:46:56,559 --> 00:47:02,239
in doubt. So um

1332
00:46:59,519 --> 00:47:03,838
all right back to this uh so one simple

1333
00:47:02,239 --> 00:47:06,719
thing we can try right off the bat is to

1334
00:47:03,838 --> 00:47:08,239
tell 3 3.5 Turbo you can ask it to say I

1335
00:47:06,719 --> 00:47:10,559
don't know if it doesn't know rather

1336
00:47:08,239 --> 00:47:12,959
than just make stuff up right and how do

1337
00:47:10,559 --> 00:47:14,559
you do it? It's very simple. You say in

1338
00:47:12,960 --> 00:47:17,119
your prompt, answer the question as

1339
00:47:14,559 --> 00:47:18,799
truthfully as possible. And if you're

1340
00:47:17,119 --> 00:47:20,480
unsure of the answer, say, "Sorry, I

1341
00:47:18,800 --> 00:47:22,560
don't know." Okay, now here's the

1342
00:47:20,480 --> 00:47:25,519
question. Okay, this is a query. So,

1343
00:47:22,559 --> 00:47:29,279
let's run it through.

1344
00:47:25,519 --> 00:47:31,280
Sorry, I don't know. Not bad, huh? So,

1345
00:47:29,280 --> 00:47:32,720
so it worked. It's sort of trying to be

1346
00:47:31,280 --> 00:47:35,599
humble and honest and, you know,

1347
00:47:32,719 --> 00:47:37,759
self-aware and things like that. Um,

1348
00:47:35,599 --> 00:47:40,000
it's more like a a Sloan at this point.

1349
00:47:37,760 --> 00:47:41,040
All right. So now the reason I as I

1350
00:47:40,000 --> 00:47:42,159
mentioned earlier there's a you can

1351
00:47:41,039 --> 00:47:44,159
check the cutoff date and you can see

1352
00:47:42,159 --> 00:47:48,358
it's 2021 actually you know what let me

1353
00:47:44,159 --> 00:47:48,358
just uh open a new tab

1354
00:47:49,199 --> 00:47:53,118
so all these cut off dates are training

1355
00:47:50,800 --> 00:47:56,400
data right so 3.5 turbo this is what we

1356
00:47:53,119 --> 00:47:59,440
are using cutff date 2021 okay that's

1357
00:47:56,400 --> 00:48:01,280
why all right so now what we can do is

1358
00:47:59,440 --> 00:48:02,960
to to we can obviously provide relevant

1359
00:48:01,280 --> 00:48:04,880
data on the prompt itself sort of we can

1360
00:48:02,960 --> 00:48:06,318
leading up to rag here and by the way

1361
00:48:04,880 --> 00:48:07,680
the extra information we provide in the

1362
00:48:06,318 --> 00:48:08,960
prompt to help it answer a question is

1363
00:48:07,679 --> 00:48:10,799
called context, right? That's sort of

1364
00:48:08,960 --> 00:48:13,440
the lingo for it. So, we can do it,

1365
00:48:10,800 --> 00:48:15,200
we'll first do it manually. Um, so we

1366
00:48:13,440 --> 00:48:17,760
first we'll use the Wikipedia article

1367
00:48:15,199 --> 00:48:19,838
for 2022 Winter Olympics and we tell it

1368
00:48:17,760 --> 00:48:21,680
explicitly to make use of this context

1369
00:48:19,838 --> 00:48:23,920
because telling things explicitly always

1370
00:48:21,679 --> 00:48:25,679
seems to help. So, this is the thing we

1371
00:48:23,920 --> 00:48:28,318
cut and pasted here, right? Wikipedia

1372
00:48:25,679 --> 00:48:30,239
article on curling and it's like a

1373
00:48:28,318 --> 00:48:32,800
pretty long article. It's got all kinds

1374
00:48:30,239 --> 00:48:34,558
of stuff and it's not even all that like

1375
00:48:32,800 --> 00:48:38,240
cleanly formatted, right? It's kind of

1376
00:48:34,559 --> 00:48:39,359
it's very strange. Look at that.

1377
00:48:38,239 --> 00:48:41,759
So don't don't answer your question,

1378
00:48:39,358 --> 00:48:44,078
Spencer. It can be, you know, in pretty

1379
00:48:41,760 --> 00:48:46,480
bad shape. It still seems to work. Okay.

1380
00:48:44,079 --> 00:48:47,920
So now use below article on the Olympics

1381
00:48:46,480 --> 00:48:49,599
to answer the subsequent question. If

1382
00:48:47,920 --> 00:48:51,920
you don't know, say you don't know.

1383
00:48:49,599 --> 00:48:53,760
Okay. So that's what we have. That's the

1384
00:48:51,920 --> 00:48:55,599
query. And by the way, before I send it

1385
00:48:53,760 --> 00:48:56,720
into the LLM, this is the actual query

1386
00:48:55,599 --> 00:48:58,400
that's going to be sending. I'm printing

1387
00:48:56,719 --> 00:49:00,159
out the query. Look at how long the

1388
00:48:58,400 --> 00:49:02,240
query is. Use the article below. And

1389
00:49:00,159 --> 00:49:04,399
here is the article. B scroll, scroll,

1390
00:49:02,239 --> 00:49:05,759
scroll. There's a whole thing, right?

1391
00:49:04,400 --> 00:49:07,680
And it keeps on going on. And then

1392
00:49:05,760 --> 00:49:12,119
finally, I say which teams won the gold.

1393
00:49:07,679 --> 00:49:12,118
So, okay, so let's run it.

1394
00:49:12,318 --> 00:49:16,880
Okay, look at that.

1395
00:49:15,199 --> 00:49:19,679
Women's curling Great Britain. It got it

1396
00:49:16,880 --> 00:49:22,640
right. Pretty good, right? I mean, it

1397
00:49:19,679 --> 00:49:25,919
had to parse all that crap to get and

1398
00:49:22,639 --> 00:49:27,199
find the nuggets, right? So, nicely done

1399
00:49:25,920 --> 00:49:28,559
now. But maybe it wasn't super hard

1400
00:49:27,199 --> 00:49:30,799
because we literally gave it the answer.

1401
00:49:28,559 --> 00:49:32,720
So, let's make it a bit harder. So, I

1402
00:49:30,800 --> 00:49:34,240
noticed that this person, Oscar Ericson,

1403
00:49:32,719 --> 00:49:37,039
won two golds in the event, two medals

1404
00:49:34,239 --> 00:49:39,519
in the event. So let's ask if any

1405
00:49:37,039 --> 00:49:40,960
athlete won multiple medals. That

1406
00:49:39,519 --> 00:49:44,000
requires a little bit of abstraction,

1407
00:49:40,960 --> 00:49:46,400
right? So all right, same query. Did any

1408
00:49:44,000 --> 00:49:47,599
athlete win multiple medals in curling?

1409
00:49:46,400 --> 00:49:50,000
The question has changed. Everything

1410
00:49:47,599 --> 00:49:51,920
else hasn't changed. Hit it. Let's see

1411
00:49:50,000 --> 00:49:53,760
what happens.

1412
00:49:51,920 --> 00:49:56,400
Yes, Oscar Ericson won multiple medals

1413
00:49:53,760 --> 00:49:58,480
in curling. He won a gold in the men's

1414
00:49:56,400 --> 00:50:00,720
event and a bronze in the mix doubles.

1415
00:49:58,480 --> 00:50:02,880
It's pretty cool, right? Take that

1416
00:50:00,719 --> 00:50:04,239
Google. So

1417
00:50:02,880 --> 00:50:05,440
all right now we come to retrieval

1418
00:50:04,239 --> 00:50:06,719
augment generation where instead of

1419
00:50:05,440 --> 00:50:07,838
doing it manually obviously because it

1420
00:50:06,719 --> 00:50:09,919
doesn't scale we will do it

1421
00:50:07,838 --> 00:50:11,519
automatically and so the thing you have

1422
00:50:09,920 --> 00:50:12,800
to remember as I mentioned just a few

1423
00:50:11,519 --> 00:50:15,920
minutes ago is that there is a context

1424
00:50:12,800 --> 00:50:18,559
window for every LLM and for GPD 3.0 of

1425
00:50:15,920 --> 00:50:21,119
turbo the context window is 1 1600 300

1426
00:50:18,559 --> 00:50:24,400
sorry 16,385 tokens that is the length

1427
00:50:21,119 --> 00:50:26,720
of the input and the output right so we

1428
00:50:24,400 --> 00:50:29,280
can't exceed that uh by the way GPT4's

1429
00:50:26,719 --> 00:50:33,679
context window is I think up to 128,000

1430
00:50:29,280 --> 00:50:35,280
tokens and GPT sorry Google Gemini 1.5

1431
00:50:33,679 --> 00:50:38,399
pro they really need to work on their

1432
00:50:35,280 --> 00:50:40,960
names Google Gemini 1.5 pro the context

1433
00:50:38,400 --> 00:50:43,440
window is 1 million tokens

1434
00:50:40,960 --> 00:50:46,960
okay and in research they have tested 10

1435
00:50:43,440 --> 00:50:48,079
million tokens so Crazy times. All that

1436
00:50:46,960 --> 00:50:49,280
means is that you can upload entire

1437
00:50:48,079 --> 00:50:51,839
videos and ask it questions about the

1438
00:50:49,280 --> 00:50:53,359
video. So all right to come back to

1439
00:50:51,838 --> 00:50:55,440
this. So what we'll do is we'll only

1440
00:50:53,358 --> 00:50:57,920
grab the data from the Wikipedia

1441
00:50:55,440 --> 00:50:59,200
articles the all the articles about the

1442
00:50:57,920 --> 00:51:00,639
Olympics that are relevant to our

1443
00:50:59,199 --> 00:51:02,719
question by using pre-trained

1444
00:51:00,639 --> 00:51:04,318
embeddings. So again this is the thing

1445
00:51:02,719 --> 00:51:06,879
we talked about earlier, right? This is

1446
00:51:04,318 --> 00:51:08,159
the picture we saw in class. And the the

1447
00:51:06,880 --> 00:51:09,680
only thing I want to point out is that

1448
00:51:08,159 --> 00:51:11,759
if you have a particular embedding for a

1449
00:51:09,679 --> 00:51:13,199
question and a particular embedding for

1450
00:51:11,760 --> 00:51:15,440
a chunk of text that you have in your

1451
00:51:13,199 --> 00:51:17,679
database, you have to figure out how

1452
00:51:15,440 --> 00:51:21,280
similar how related they are. And for

1453
00:51:17,679 --> 00:51:24,799
that we can use what

1454
00:51:21,280 --> 00:51:27,119
dot product or something slightly uh

1455
00:51:24,800 --> 00:51:29,039
almost as dot product which is more

1456
00:51:27,119 --> 00:51:31,440
easier for us to work with the cosine

1457
00:51:29,039 --> 00:51:32,800
similarity. We have we have done cosine

1458
00:51:31,440 --> 00:51:34,240
similarity previously. I've explained it

1459
00:51:32,800 --> 00:51:35,519
in class. We're just going to use cosine

1460
00:51:34,239 --> 00:51:37,519
similarity. How similar are these

1461
00:51:35,519 --> 00:51:40,400
vectors? So that's what we're going to

1462
00:51:37,519 --> 00:51:42,559
do. Um all right. So the same picture as

1463
00:51:40,400 --> 00:51:43,920
we saw in class. So the first we what

1464
00:51:42,559 --> 00:51:45,839
we'll do is we need to break up the data

1465
00:51:43,920 --> 00:51:47,039
set into sections and then take each

1466
00:51:45,838 --> 00:51:49,199
section and then run it through the

1467
00:51:47,039 --> 00:51:50,558
embedding thing. But fortunately for us

1468
00:51:49,199 --> 00:51:52,399
uh I have code here which actually does

1469
00:51:50,559 --> 00:51:54,640
it for you manually. You can play around

1470
00:51:52,400 --> 00:51:56,639
with it later. But OpenAI has already

1471
00:51:54,639 --> 00:51:58,078
given us the chunked data set. So we

1472
00:51:56,639 --> 00:52:00,000
just use that because it's just easy for

1473
00:51:58,079 --> 00:52:01,519
us. And I downloaded already because it

1474
00:52:00,000 --> 00:52:02,800
took it takes five minutes to download.

1475
00:52:01,519 --> 00:52:04,719
I've downloaded this thing and I've

1476
00:52:02,800 --> 00:52:07,200
stuck it in a particular data frame

1477
00:52:04,719 --> 00:52:09,598
here. So let's print out five randomly

1478
00:52:07,199 --> 00:52:12,078
chosen chunks. Um so you can see here

1479
00:52:09,599 --> 00:52:14,559
right this is the first chunk somebody

1480
00:52:12,079 --> 00:52:17,119
else somebody else this just and look at

1481
00:52:14,559 --> 00:52:19,200
all this crazy stuff here right the

1482
00:52:17,119 --> 00:52:21,119
formatting is off but these are all you

1483
00:52:19,199 --> 00:52:22,480
know basically paragraphs and sections

1484
00:52:21,119 --> 00:52:24,559
just grabbed straight from Wikipedia

1485
00:52:22,480 --> 00:52:28,240
with no cleaning.

1486
00:52:24,559 --> 00:52:30,880
Okay, now we define a simple function to

1487
00:52:28,239 --> 00:52:33,279
basically send in any arbitrary piece of

1488
00:52:30,880 --> 00:52:35,200
text into the embedding model and get

1489
00:52:33,280 --> 00:52:36,800
the contextual embedding vector out,

1490
00:52:35,199 --> 00:52:39,118
right? And there is this little function

1491
00:52:36,800 --> 00:52:40,640
that does that. Okay, u we using an

1492
00:52:39,119 --> 00:52:42,400
embedding model. We send in a text, it

1493
00:52:40,639 --> 00:52:45,039
gives you something. So let's try it on

1494
00:52:42,400 --> 00:52:48,039
that is amazing. You should get a vector

1495
00:52:45,039 --> 00:52:48,039
back.

1496
00:52:51,280 --> 00:52:55,599
Oh, come on. Don't fail me now.

1497
00:52:56,000 --> 00:53:02,400
All right. How long is it? 1536. Um, so

1498
00:53:00,800 --> 00:53:04,240
how about I say hodle is incredible.

1499
00:53:02,400 --> 00:53:05,599
Like hodle is amazing. Hopefully the two

1500
00:53:04,239 --> 00:53:09,919
vectors would be kind of similar in

1501
00:53:05,599 --> 00:53:11,440
terms of cosine, right? So um and so to

1502
00:53:09,920 --> 00:53:13,680
calculate the cosine distance, I use

1503
00:53:11,440 --> 00:53:15,679
this particular function from sci. It

1504
00:53:13,679 --> 00:53:18,799
just calculates the cosine similarity

1505
00:53:15,679 --> 00:53:21,440
and I hit it. So 0.9934

1506
00:53:18,800 --> 00:53:23,280
maximum is one, right? So 0 934 means

1507
00:53:21,440 --> 00:53:24,720
that they're very very similar. which is

1508
00:53:23,280 --> 00:53:27,119
comforting because amazing and

1509
00:53:24,719 --> 00:53:29,598
incredible are obviously synonyms. U

1510
00:53:27,119 --> 00:53:32,000
okay so now given a data frame with a

1511
00:53:29,599 --> 00:53:33,119
column of text chunks in it we can use

1512
00:53:32,000 --> 00:53:34,800
this function on every one of these

1513
00:53:33,119 --> 00:53:36,160
things to calculate the embedding right

1514
00:53:34,800 --> 00:53:37,440
and you have a function here that

1515
00:53:36,159 --> 00:53:39,199
basically does it for you I'm not going

1516
00:53:37,440 --> 00:53:41,280
to run it uh because it takes a long

1517
00:53:39,199 --> 00:53:42,799
time so but you can run it later on uh

1518
00:53:41,280 --> 00:53:44,960
just be prepared go get a cup of coffee

1519
00:53:42,800 --> 00:53:47,119
and stuff while it does it uh but once

1520
00:53:44,960 --> 00:53:48,559
you but happily for us open has actually

1521
00:53:47,119 --> 00:53:50,160
already done this step for us so we

1522
00:53:48,559 --> 00:53:51,760
don't have to uh so it's already

1523
00:53:50,159 --> 00:53:53,920
available in this data frame so if you

1524
00:53:51,760 --> 00:53:56,079
actually Look at this. And you can see

1525
00:53:53,920 --> 00:53:58,000
here there is a text and then there is

1526
00:53:56,079 --> 00:54:00,079
an embedding that's right sitting right

1527
00:53:58,000 --> 00:54:02,880
there right next to it. Okay. And these

1528
00:54:00,079 --> 00:54:07,839
embeddings are whatever 15 how long is

1529
00:54:02,880 --> 00:54:12,280
it? 1536 long. 1536 long vectors. Okay.

1530
00:54:07,838 --> 00:54:12,279
Um All right. So that's what we have.

1531
00:54:14,079 --> 00:54:18,640
Okay. So now that we have this thing

1532
00:54:16,400 --> 00:54:20,240
whenever we get a question we calculate

1533
00:54:18,639 --> 00:54:22,400
the question's embedding and then

1534
00:54:20,239 --> 00:54:23,919
compare calculate its cosine similarity

1535
00:54:22,400 --> 00:54:26,800
with all the embedding sitting in this

1536
00:54:23,920 --> 00:54:28,079
data frame. Okay. So to do that we're

1537
00:54:26,800 --> 00:54:29,839
going to define a couple of helper

1538
00:54:28,079 --> 00:54:31,680
functions here. You can read through the

1539
00:54:29,838 --> 00:54:33,199
Python later to understand is this is

1540
00:54:31,679 --> 00:54:36,480
basic Python manipulations that are

1541
00:54:33,199 --> 00:54:38,799
going on. Um and so let's just test this

1542
00:54:36,480 --> 00:54:41,440
function. So basically we have a little

1543
00:54:38,800 --> 00:54:44,079
function called strings ranked by

1544
00:54:41,440 --> 00:54:46,400
relatedness where you give it any input

1545
00:54:44,079 --> 00:54:49,280
question or text and then it's going to

1546
00:54:46,400 --> 00:54:52,000
give you the top five most related

1547
00:54:49,280 --> 00:54:55,680
chunks of text that is had in its data

1548
00:54:52,000 --> 00:54:59,159
frame. Okay. So uh let me just run this

1549
00:54:55,679 --> 00:54:59,159
thing. Okay.

1550
00:55:00,000 --> 00:55:03,599
So curling the things it pulls back it

1551
00:55:02,079 --> 00:55:06,000
better involves curling and metals and

1552
00:55:03,599 --> 00:55:09,119
so on. So this one has a cosign

1553
00:55:06,000 --> 00:55:11,280
similarity of 888 curling at the 22

1554
00:55:09,119 --> 00:55:13,599
Olympics. That's good. Result summary.

1555
00:55:11,280 --> 00:55:14,960
Medal summary. Result summary. It's all

1556
00:55:13,599 --> 00:55:17,280
pretty good, right? Even the fifth one

1557
00:55:14,960 --> 00:55:18,720
has a cosign similarity of867, which is

1558
00:55:17,280 --> 00:55:20,800
pretty high. So it's doing the right

1559
00:55:18,719 --> 00:55:22,239
things. It's it's picked up curling gold

1560
00:55:20,800 --> 00:55:25,200
medal was input text. It's picked up the

1561
00:55:22,239 --> 00:55:28,078
right things from it. Um, now let's see

1562
00:55:25,199 --> 00:55:30,000
what we can do um

1563
00:55:28,079 --> 00:55:31,519
with the original question. So here is a

1564
00:55:30,000 --> 00:55:33,358
header I'm going to use in the prompt.

1565
00:55:31,519 --> 00:55:35,199
I'm going to say use the below articles

1566
00:55:33,358 --> 00:55:36,400
to answer the subsequent question.

1567
00:55:35,199 --> 00:55:37,439
Answer the questions as truthfully as

1568
00:55:36,400 --> 00:55:38,880
possible. And if you're unsure of the

1569
00:55:37,440 --> 00:55:41,519
answer, say sorry, I don't know. As

1570
00:55:38,880 --> 00:55:42,800
before. Okay, that's our prompt. Uh, and

1571
00:55:41,519 --> 00:55:44,960
now here's the thing. We don't want to

1572
00:55:42,800 --> 00:55:46,559
exceed the context window, right? So, we

1573
00:55:44,960 --> 00:55:48,240
want to need to count the tokens we're

1574
00:55:46,559 --> 00:55:49,440
sending in and the likely number of

1575
00:55:48,239 --> 00:55:51,439
tokens we're going to get back so that

1576
00:55:49,440 --> 00:55:53,679
we don't exceed the budget. So, we use

1577
00:55:51,440 --> 00:55:55,679
this package called tick token package

1578
00:55:53,679 --> 00:55:57,279
for this. Uh, and then it just, you

1579
00:55:55,679 --> 00:55:58,480
know, helps you count the tokens. And

1580
00:55:57,280 --> 00:56:00,079
you can read through this. It's just

1581
00:55:58,480 --> 00:56:03,199
again some basic Python for counting

1582
00:56:00,079 --> 00:56:05,519
tokens. And now what we do is um this

1583
00:56:03,199 --> 00:56:08,318
this where we actually comp assemble the

1584
00:56:05,519 --> 00:56:09,838
prompt. We start with the header right

1585
00:56:08,318 --> 00:56:12,719
we have the header which says you know

1586
00:56:09,838 --> 00:56:14,318
be truthful and all that. Then we say uh

1587
00:56:12,719 --> 00:56:16,719
here is a question that you need that

1588
00:56:14,318 --> 00:56:18,400
I'm going to ask you and then you go in

1589
00:56:16,719 --> 00:56:21,199
there and keep grabbing Wikipedia

1590
00:56:18,400 --> 00:56:23,680
articles till the number of tokens in

1591
00:56:21,199 --> 00:56:26,639
your prompt is is exceeding your token

1592
00:56:23,679 --> 00:56:27,838
budget and then you stop. Right? When

1593
00:56:26,639 --> 00:56:28,798
you're about to exceed the budget you

1594
00:56:27,838 --> 00:56:31,119
stop because you can't exceed the

1595
00:56:28,798 --> 00:56:34,239
budget. Um, and that's that's the whole

1596
00:56:31,119 --> 00:56:38,480
thing. So here, uh, all right, let's

1597
00:56:34,239 --> 00:56:40,159
just do tick token. Run this function.

1598
00:56:38,480 --> 00:56:42,960
Now, it turns out, as you saw, we can go

1599
00:56:40,159 --> 00:56:45,440
up to like 1600 something, uh, tokens in

1600
00:56:42,960 --> 00:56:48,400
the context window. I'm just using three

1601
00:56:45,440 --> 00:56:49,920
3,700 as my budget. Uh, partly because

1602
00:56:48,400 --> 00:56:52,160
just to show you how to use this thing.

1603
00:56:49,920 --> 00:56:54,880
Uh, and also because it's charging my

1604
00:56:52,159 --> 00:56:56,480
credit card for every token that I'm

1605
00:56:54,880 --> 00:56:59,280
using, right? So, I'm just being

1606
00:56:56,480 --> 00:57:01,280
careful. um it charges by the token.

1607
00:56:59,280 --> 00:57:03,519
It's a beautiful business model. Anyway,

1608
00:57:01,280 --> 00:57:05,040
so back here, so let's ask the question,

1609
00:57:03,519 --> 00:57:06,960
which athletes won the gold medal in

1610
00:57:05,039 --> 00:57:08,558
curling at the Olympics? Here is the

1611
00:57:06,960 --> 00:57:11,039
data frame that you should use. Here is

1612
00:57:08,559 --> 00:57:13,440
the GPD model and don't exceed 3,700

1613
00:57:11,039 --> 00:57:15,679
tokens. Okay, that's the the query or

1614
00:57:13,440 --> 00:57:17,280
the prompt. It's going to compose the

1615
00:57:15,679 --> 00:57:19,519
prompt now. And this is the whole

1616
00:57:17,280 --> 00:57:23,400
prompt. Okay. Uh let's just go to the

1617
00:57:19,519 --> 00:57:23,400
very top. It's really long.

1618
00:57:24,079 --> 00:57:27,440
Okay. So, all right. use the below

1619
00:57:25,920 --> 00:57:29,200
articles on the subsequent question as

1620
00:57:27,440 --> 00:57:31,920
possible and boom boom boom boom boom it

1621
00:57:29,199 --> 00:57:33,118
has all these things it's got a added a

1622
00:57:31,920 --> 00:57:35,920
whole bunch of paragraphs from the

1623
00:57:33,119 --> 00:57:37,358
Wikipedia pages okay and then it finally

1624
00:57:35,920 --> 00:57:39,599
ends with a question which athletes won

1625
00:57:37,358 --> 00:57:41,759
the gold okay all right now let's just

1626
00:57:39,599 --> 00:57:44,240
ask it the thing and this is just a

1627
00:57:41,760 --> 00:57:47,200
little function to to send stuff into

1628
00:57:44,239 --> 00:57:53,279
the API and now we are finally ready to

1629
00:57:47,199 --> 00:57:55,519
ask GPD the question fingers crossed

1630
00:57:53,280 --> 00:57:58,400
all right curling

1631
00:57:55,519 --> 00:58:01,199
Stefan can tell in the mixed doubles and

1632
00:57:58,400 --> 00:58:03,920
the team consisting of blah blah blah in

1633
00:58:01,199 --> 00:58:06,159
the the men's tournament and oh

1634
00:58:03,920 --> 00:58:08,880
interesting it has actually ignored the

1635
00:58:06,159 --> 00:58:12,798
Great Britain people completely I think

1636
00:58:08,880 --> 00:58:14,960
right uh last night it didn't welcome to

1637
00:58:12,798 --> 00:58:16,480
stoasticity

1638
00:58:14,960 --> 00:58:19,039
so you can try it when you try it might

1639
00:58:16,480 --> 00:58:21,119
actually give you the the thing um and

1640
00:58:19,039 --> 00:58:24,000
so let's ask it now a question about the

1641
00:58:21,119 --> 00:58:25,838
2016 winter Olympics uh which by the way

1642
00:58:24,000 --> 00:58:31,280
didn't happen there were no winter

1643
00:58:25,838 --> 00:58:34,798
Olympics in 2016. So if you ask it,

1644
00:58:31,280 --> 00:58:36,559
sorry I don't know. All right. Now let's

1645
00:58:34,798 --> 00:58:38,960
change the header so that we don't say

1646
00:58:36,559 --> 00:58:40,798
be truthful. So we will remove the need

1647
00:58:38,960 --> 00:58:43,679
for it to be truthful and see what

1648
00:58:40,798 --> 00:58:48,759
happens.

1649
00:58:43,679 --> 00:58:48,759
All right, which at least won the gold.

1650
00:58:50,960 --> 00:58:55,838
Oh, now it's telling you about the 2022

1651
00:58:53,199 --> 00:58:57,679
Olympics. So it answered an irrelevant

1652
00:58:55,838 --> 00:58:59,440
question accurately.

1653
00:58:57,679 --> 00:59:01,919
Okay, if you remove the need for it to

1654
00:58:59,440 --> 00:59:04,400
uh to be truthful. So the I guess the

1655
00:59:01,920 --> 00:59:07,280
moral of the story is that um first of

1656
00:59:04,400 --> 00:59:09,039
all you can use rack to grab stuff from

1657
00:59:07,280 --> 00:59:10,319
mass databases and it's very heavily

1658
00:59:09,039 --> 00:59:12,239
used in industry. Number one, number

1659
00:59:10,318 --> 00:59:13,838
two. Um you have to be careful about

1660
00:59:12,239 --> 00:59:16,719
these token budgets and so on and so

1661
00:59:13,838 --> 00:59:18,159
forth. Uh and small wording changes in

1662
00:59:16,719 --> 00:59:20,318
the prompt can actually dramatically

1663
00:59:18,159 --> 00:59:21,838
alter behavior which makes it very

1664
00:59:20,318 --> 00:59:25,279
difficult in enterprise settings to do

1665
00:59:21,838 --> 00:59:27,679
QA on this stuff. Okay. Uh so a lot of

1666
00:59:25,280 --> 00:59:29,200
care has to go into it. Uh you know and

1667
00:59:27,679 --> 00:59:30,960
you have seen examples of for example

1668
00:59:29,199 --> 00:59:32,639
Air Canada had a chatbot which actually

1669
00:59:30,960 --> 00:59:34,240
gave the wrong advice to a customer. The

1670
00:59:32,639 --> 00:59:35,679
customer sued Air Canada and then the

1671
00:59:34,239 --> 00:59:37,199
court ruled in favor of the the

1672
00:59:35,679 --> 00:59:39,118
passenger and then they pulled the

1673
00:59:37,199 --> 00:59:40,480
chatbot off the website. Right? So you

1674
00:59:39,119 --> 00:59:42,160
got to be very careful. I think without

1675
00:59:40,480 --> 00:59:43,519
a human in the loop checking these

1676
00:59:42,159 --> 00:59:45,199
answers it's kind of dangerous in my

1677
00:59:43,519 --> 00:59:47,440
opinion at this current state. Hopefully

1678
00:59:45,199 --> 00:59:48,960
it'll get better but you have to be

1679
00:59:47,440 --> 00:59:51,039
there's a lot of potential but you have

1680
00:59:48,960 --> 00:59:52,798
to be to be careful. All right. So this

1681
00:59:51,039 --> 00:59:54,719
is what we have. Um, and you can

1682
00:59:52,798 --> 00:59:57,039
actually take this thing here and use

1683
00:59:54,719 --> 00:59:58,719
it. Um, you can actually, you know, take

1684
00:59:57,039 --> 01:00:00,639
like a thousandpage PDF that you might

1685
00:59:58,719 --> 01:00:02,239
have or something and then chunk it and

1686
01:00:00,639 --> 01:00:03,358
use this approach. And I've done it for

1687
01:00:02,239 --> 01:00:04,639
a whole bunch of different things. It

1688
01:00:03,358 --> 01:00:05,920
actually works really well, right? Most

1689
01:00:04,639 --> 01:00:07,039
of the time it'll make errors here and

1690
01:00:05,920 --> 01:00:11,599
there. Most of the time it actually

1691
01:00:07,039 --> 01:00:14,318
works really well. Okay. So, um, yeah.

1692
01:00:11,599 --> 01:00:18,318
>> Sorry, just a question. when when like

1693
01:00:14,318 --> 01:00:20,159
GP4 now lets you you upload PDFs, is it

1694
01:00:18,318 --> 01:00:21,199
junkling that or is it actually

1695
01:00:20,159 --> 01:00:22,719
ingesting all the

1696
01:00:21,199 --> 01:00:25,759
>> No, when you upload something because

1697
01:00:22,719 --> 01:00:27,919
GPD4 Turbo has 128,000 tokens which

1698
01:00:25,760 --> 01:00:29,200
means it can accommodate a whole long b

1699
01:00:27,920 --> 01:00:31,200
of documents. So when you upload stuff

1700
01:00:29,199 --> 01:00:32,960
is not doing any chunking. The chunking

1701
01:00:31,199 --> 01:00:34,798
you're talking about you have to do. The

1702
01:00:32,960 --> 01:00:36,240
LLM doesn't even know you're doing it.

1703
01:00:34,798 --> 01:00:38,239
As far as the LLM is concerned, it's

1704
01:00:36,239 --> 01:00:39,519
only seeing the prompt it sees and the

1705
01:00:38,239 --> 01:00:40,639
prompt says, "Hey, here's a bunch of

1706
01:00:39,519 --> 01:00:41,759
information. Here's a question. Answer

1707
01:00:40,639 --> 01:00:44,159
it for me using this question. Be

1708
01:00:41,760 --> 01:00:46,799
truthful." That's it.

1709
01:00:44,159 --> 01:00:49,440
Now when you ask these things a question

1710
01:00:46,798 --> 01:00:51,759
um which is later than its training

1711
01:00:49,440 --> 01:00:53,920
data, you will actually see GP4 saying

1712
01:00:51,760 --> 01:00:55,760
doing a Bing search and things like

1713
01:00:53,920 --> 01:00:58,079
that. there. What's actually going on is

1714
01:00:55,760 --> 01:00:59,920
there's an there's a pre-processing step

1715
01:00:58,079 --> 01:01:01,760
and a program which is doing a Bing

1716
01:00:59,920 --> 01:01:04,159
search, gathering a bunch of Bing

1717
01:01:01,760 --> 01:01:06,799
results, taking the top few results,

1718
01:01:04,159 --> 01:01:08,960
chunking, embedding, packing into a

1719
01:01:06,798 --> 01:01:10,159
prompt, sending it into GB4, and you

1720
01:01:08,960 --> 01:01:11,358
don't know what's all this is going on

1721
01:01:10,159 --> 01:01:12,558
under the hood. But that's actually so

1722
01:01:11,358 --> 01:01:13,679
when it's actually thinking and saying

1723
01:01:12,559 --> 01:01:17,000
Bing search, this is what's going on

1724
01:01:13,679 --> 01:01:17,000
under the hood.

1725
01:01:19,199 --> 01:01:24,798
Was was there a question somewhere here?

1726
01:01:21,679 --> 01:01:26,558
No. Oh, sorry. Yeah.

1727
01:01:24,798 --> 01:01:29,280
I have a question about formatting.

1728
01:01:26,559 --> 01:01:31,519
Yeah. So, it seems to be able to

1729
01:01:29,280 --> 01:01:33,920
understand and ignore irrelevant

1730
01:01:31,519 --> 01:01:35,759
formatting even though there's

1731
01:01:33,920 --> 01:01:38,480
colloquial tables, not really defined

1732
01:01:35,760 --> 01:01:40,559
tables. And also when it outputs

1733
01:01:38,480 --> 01:01:44,000
formats, it's able to do it really

1734
01:01:40,559 --> 01:01:46,000
humanly. Is that something that's

1735
01:01:44,000 --> 01:01:47,199
figuring out through the neural network

1736
01:01:46,000 --> 01:01:49,280
or just something that's kind of being

1737
01:01:47,199 --> 01:01:49,919
programmed in the head or somewhere with

1738
01:01:49,280 --> 01:01:51,280
standard?

1739
01:01:49,920 --> 01:01:53,200
>> There is no explicit programming going

1740
01:01:51,280 --> 01:01:54,720
on. It's typically because a lot of the

1741
01:01:53,199 --> 01:01:56,078
question answer pairs that it was used

1742
01:01:54,719 --> 01:01:57,358
for supervised fine tetuning and

1743
01:01:56,079 --> 01:02:00,079
instruction t and reinforcement

1744
01:01:57,358 --> 01:02:02,400
learning, right? The better answers with

1745
01:02:00,079 --> 01:02:04,079
the same sort of badly formatted input,

1746
01:02:02,400 --> 01:02:06,079
the better answers are just rewarded are

1747
01:02:04,079 --> 01:02:08,240
ranked higher. That's what's going on.

1748
01:02:06,079 --> 01:02:10,318
But on a related note, what one thing

1749
01:02:08,239 --> 01:02:12,000
that's very useful is that uh you can

1750
01:02:10,318 --> 01:02:14,239
actually ask it to send you give you the

1751
01:02:12,000 --> 01:02:16,719
answer back using certain formats like

1752
01:02:14,239 --> 01:02:19,118
markdown and JSON and things like that.

1753
01:02:16,719 --> 01:02:21,039
And by forcing it to adhere to a certain

1754
01:02:19,119 --> 01:02:22,079
well- definfined formats, you actually

1755
01:02:21,039 --> 01:02:23,119
increase the chance of it actually

1756
01:02:22,079 --> 01:02:24,798
getting the right answer in the first

1757
01:02:23,119 --> 01:02:26,720
place.

1758
01:02:24,798 --> 01:02:28,719
Uh again, there's like a whole tangent

1759
01:02:26,719 --> 01:02:30,719
here we can go into, but those are some

1760
01:02:28,719 --> 01:02:33,039
of the things that uh are part of prompt

1761
01:02:30,719 --> 01:02:37,159
engineering. All right, so that's what

1762
01:02:33,039 --> 01:02:37,159
we have here. Back to the PowerPoint.

1763
01:02:40,639 --> 01:02:46,000
So that's retrieval augment generation

1764
01:02:42,559 --> 01:02:49,599
and we finally come to fine-tuning. So

1765
01:02:46,000 --> 01:02:51,760
fine-tuning is when up to this point all

1766
01:02:49,599 --> 01:02:54,240
the things we have seen don't alter the

1767
01:02:51,760 --> 01:02:55,599
internals of the LLM. You have not

1768
01:02:54,239 --> 01:02:56,798
messed around with the weights or change

1769
01:02:55,599 --> 01:03:00,000
number them at all. You're just using it

1770
01:02:56,798 --> 01:03:01,679
as a black box. Right? With fine-tuning

1771
01:03:00,000 --> 01:03:04,000
you actually will train it further

1772
01:03:01,679 --> 01:03:07,440
meaning the weights are going to change.

1773
01:03:04,000 --> 01:03:11,440
Okay. So now remember we take something

1774
01:03:07,440 --> 01:03:13,599
like a causal error like GPT right uh

1775
01:03:11,440 --> 01:03:15,280
and then and this I haven't fixed this

1776
01:03:13,599 --> 01:03:17,760
yet. this there is no rel here as I

1777
01:03:15,280 --> 01:03:19,280
mentioned earlier okay just remember

1778
01:03:17,760 --> 01:03:21,599
that

1779
01:03:19,280 --> 01:03:23,359
and then if you have domain specific

1780
01:03:21,599 --> 01:03:25,760
input output examples like input and

1781
01:03:23,358 --> 01:03:28,719
output you can just train it like this

1782
01:03:25,760 --> 01:03:31,280
okay input and then the shifted output

1783
01:03:28,719 --> 01:03:33,038
uh and that will update these weights

1784
01:03:31,280 --> 01:03:34,640
right all these weights so this is

1785
01:03:33,039 --> 01:03:37,200
basically fine- tuning exactly like we

1786
01:03:34,639 --> 01:03:39,598
saw with BERT and so on and and even

1787
01:03:37,199 --> 01:03:42,318
with restnet it's the same sort of thing

1788
01:03:39,599 --> 01:03:43,838
okay that is fine-tuning now before we

1789
01:03:42,318 --> 01:03:45,759
discuss the mechanics how to do I want

1790
01:03:43,838 --> 01:03:48,639
to look at a show you a quick example of

1791
01:03:45,760 --> 01:03:50,480
the usefulness of finetuning. So, so

1792
01:03:48,639 --> 01:03:53,199
imagine for a sec that we want to

1793
01:03:50,480 --> 01:03:55,358
generate u synthetic product reviews

1794
01:03:53,199 --> 01:03:57,439
from product descriptions.

1795
01:03:55,358 --> 01:03:59,838
So we are building some product which

1796
01:03:57,440 --> 01:04:01,760
can simulate customer behavior in

1797
01:03:59,838 --> 01:04:03,838
e-commerce and for that we need to be

1798
01:04:01,760 --> 01:04:05,760
able to generate the kinds of reviews

1799
01:04:03,838 --> 01:04:07,358
that customers might come up with right

1800
01:04:05,760 --> 01:04:09,200
and writing a lot of reviews is very

1801
01:04:07,358 --> 01:04:10,318
timeconuming. So what you but what you

1802
01:04:09,199 --> 01:04:12,639
can do is you can get a whole bunch of

1803
01:04:10,318 --> 01:04:14,719
product descriptions right from the

1804
01:04:12,639 --> 01:04:16,798
internet. So let's say you ask an LLM,

1805
01:04:14,719 --> 01:04:18,318
hey write a positive product review

1806
01:04:16,798 --> 01:04:19,759
using this information here, product

1807
01:04:18,318 --> 01:04:24,159
description here and it comes up with

1808
01:04:19,760 --> 01:04:26,319
this timeless, authentic, iconic, right?

1809
01:04:24,159 --> 01:04:28,639
Seriously, do product reviewers actually

1810
01:04:26,318 --> 01:04:31,199
write stuff like this? No. This looks

1811
01:04:28,639 --> 01:04:33,118
like marketing copy, right? This reads

1812
01:04:31,199 --> 01:04:34,318
like marketing copy because there's a

1813
01:04:33,119 --> 01:04:36,798
whole bunch of marketing copy on the

1814
01:04:34,318 --> 01:04:38,798
internet. So it's not good. It doesn't

1815
01:04:36,798 --> 01:04:41,440
feel like a review. It's not authentic,

1816
01:04:38,798 --> 01:04:44,318
right? Um, here's another example for

1817
01:04:41,440 --> 01:04:46,240
Urban Outfitters, and it says, uh, the

1818
01:04:44,318 --> 01:04:50,719
the boxy and cropped silhouette is

1819
01:04:46,239 --> 01:04:52,959
flattering on all body types. Come on.

1820
01:04:50,719 --> 01:04:55,519
Okay, so it's not going to work. So,

1821
01:04:52,960 --> 01:04:57,838
what we do is we fine-tune the LLM. We

1822
01:04:55,519 --> 01:05:00,159
can take an LLM and we can fine-tune it

1823
01:04:57,838 --> 01:05:02,719
with instruction, product description,

1824
01:05:00,159 --> 01:05:05,199
and product review examples.

1825
01:05:02,719 --> 01:05:06,959
Okay, that's what we can do. So for

1826
01:05:05,199 --> 01:05:11,719
instance we can take something like

1827
01:05:06,960 --> 01:05:11,720
this. Uh let me zoom into this thing.

1828
01:05:14,639 --> 01:05:19,118
So it says here write a positive review

1829
01:05:17,199 --> 01:05:20,318
for the following product and then you

1830
01:05:19,119 --> 01:05:22,000
can have the work. This is the

1831
01:05:20,318 --> 01:05:24,719
description is the input and the output

1832
01:05:22,000 --> 01:05:26,880
is the best car my husband's favorite.

1833
01:05:24,719 --> 01:05:28,558
They fit well. Right? They feel like

1834
01:05:26,880 --> 01:05:30,240
product reviews. So you just have to get

1835
01:05:28,559 --> 01:05:33,119
a few hundred of these product review

1836
01:05:30,239 --> 01:05:35,279
examples. Okay just a few hundred. Um

1837
01:05:33,119 --> 01:05:37,440
and you may not even need that much. And

1838
01:05:35,280 --> 01:05:40,960
once you do that,

1839
01:05:37,440 --> 01:05:42,318
once you do that, you basically do uh

1840
01:05:40,960 --> 01:05:45,280
used to fine-tuning like I showed

1841
01:05:42,318 --> 01:05:46,880
earlier, you know, in instruction,

1842
01:05:45,280 --> 01:05:48,319
input, output, and then you take that

1843
01:05:46,880 --> 01:05:50,318
output and shift it a bit and make it

1844
01:05:48,318 --> 01:05:51,599
the actual label, the actual output.

1845
01:05:50,318 --> 01:05:53,279
Fine tune, fine tune, fine tune, fine

1846
01:05:51,599 --> 01:05:55,119
tune a bunch of times, gradient descent,

1847
01:05:53,280 --> 01:05:58,160
weights gets updated. Now you have a new

1848
01:05:55,119 --> 01:06:00,318
LM, an updated LLM. And when you do that

1849
01:05:58,159 --> 01:06:02,558
now for the same things, here's what you

1850
01:06:00,318 --> 01:06:04,558
get. Write a review. These are the best

1851
01:06:02,559 --> 01:06:06,319
jeans I've ever owned. I am whatever

1852
01:06:04,559 --> 01:06:07,920
some details. I've been wearing them for

1853
01:06:06,318 --> 01:06:09,199
a few weeks. They still look brand new,

1854
01:06:07,920 --> 01:06:11,039
right? It looks much better. Doesn't

1855
01:06:09,199 --> 01:06:13,679
look like marketing.

1856
01:06:11,039 --> 01:06:15,119
This is completely fake. By the way, the

1857
01:06:13,679 --> 01:06:16,558
came up with it after the fine tuning.

1858
01:06:15,119 --> 01:06:18,640
And then we say, "Write a horrible

1859
01:06:16,559 --> 01:06:20,000
review because we want to be balanced.

1860
01:06:18,639 --> 01:06:22,078
These are the worst genes I've ever

1861
01:06:20,000 --> 01:06:23,519
worn. They're too tight here and there.

1862
01:06:22,079 --> 01:06:25,760
I'm going to return them and try a 30,

1863
01:06:23,519 --> 01:06:27,519
but I'm not optimistic.

1864
01:06:25,760 --> 01:06:29,119
I'm going to stick with Levis's." Few.

1865
01:06:27,519 --> 01:06:31,119
Okay.

1866
01:06:29,119 --> 01:06:33,280
So, that is So, these read like real

1867
01:06:31,119 --> 01:06:34,798
reviews. So just by taking a few hundred

1868
01:06:33,280 --> 01:06:36,400
examples and fine-tuning it, it

1869
01:06:34,798 --> 01:06:38,318
completely changes the the behavior that

1870
01:06:36,400 --> 01:06:40,400
you want for your particular use case.

1871
01:06:38,318 --> 01:06:43,038
That's the key thing. So for me, the

1872
01:06:40,400 --> 01:06:45,358
biggest sort of benefit here is that

1873
01:06:43,039 --> 01:06:47,680
while it took billions of sentences for

1874
01:06:45,358 --> 01:06:49,598
pre-training the original LLM and then

1875
01:06:47,679 --> 01:06:52,399
it took tens of thousands of examples to

1876
01:06:49,599 --> 01:06:55,119
do supervised finetuning and or HF and

1877
01:06:52,400 --> 01:06:56,960
so on and so forth, for you for it to

1878
01:06:55,119 --> 01:06:59,440
make it work for your narrow business

1879
01:06:56,960 --> 01:07:02,079
use case, you only had to spend a couple

1880
01:06:59,440 --> 01:07:04,240
hundred examples. That's it. It's

1881
01:07:02,079 --> 01:07:06,160
amazing. Imagine that if you had to, you

1882
01:07:04,239 --> 01:07:07,519
know, collect like 30,000 examples to

1883
01:07:06,159 --> 01:07:10,318
make it. Nobody's going to do these

1884
01:07:07,519 --> 01:07:12,639
things. It's too much work. But a couple

1885
01:07:10,318 --> 01:07:14,079
of hundred anybody can do. That's why

1886
01:07:12,639 --> 01:07:16,719
it's so powerful to finetune these

1887
01:07:14,079 --> 01:07:19,280
things. Yeah.

1888
01:07:16,719 --> 01:07:22,000
You talked about being able to um you

1889
01:07:19,280 --> 01:07:23,359
know, in industries where you you don't

1890
01:07:22,000 --> 01:07:26,000
want to put some of this stuff on the

1891
01:07:23,358 --> 01:07:28,000
internet, downloading uh the pre-train

1892
01:07:26,000 --> 01:07:30,400
model and being able to do this on your

1893
01:07:28,000 --> 01:07:32,079
own. would you still need talking about

1894
01:07:30,400 --> 01:07:35,200
computer power some of the computers we

1895
01:07:32,079 --> 01:07:37,359
have now GPUs I don't know how they are

1896
01:07:35,199 --> 01:07:39,279
um are you able to do some of these very

1897
01:07:37,358 --> 01:07:40,558
small use cases on those types of

1898
01:07:39,280 --> 01:07:42,559
devices

1899
01:07:40,559 --> 01:07:44,079
>> perfect question uh Ike I mean you're

1900
01:07:42,559 --> 01:07:46,640
going to get to that because the short

1901
01:07:44,079 --> 01:07:47,599
answer it's hard yeah just a few hundred

1902
01:07:46,639 --> 01:07:50,078
examples but actually trying to

1903
01:07:47,599 --> 01:07:52,000
fine-tune these big models on consumer

1904
01:07:50,079 --> 01:07:53,760
grade hardware is actually not easy so

1905
01:07:52,000 --> 01:07:56,239
you have to make certain tricks and

1906
01:07:53,760 --> 01:07:57,760
simplifications which is the next topic

1907
01:07:56,239 --> 01:08:00,239
uh yeah

1908
01:07:57,760 --> 01:08:02,480
>> is tuning always supervised like you

1909
01:08:00,239 --> 01:08:05,439
need those pairs or could you do it if

1910
01:08:02,480 --> 01:08:05,920
the company has like less structured

1911
01:08:05,440 --> 01:08:07,838
data?

1912
01:08:05,920 --> 01:08:09,599
>> No, you can. The thing is it depends on

1913
01:08:07,838 --> 01:08:11,679
whether you want to make it generally

1914
01:08:09,599 --> 01:08:13,519
smart about the company's sort of

1915
01:08:11,679 --> 01:08:14,639
business details in which case you can

1916
01:08:13,519 --> 01:08:16,319
just take a whole bunch of text and just

1917
01:08:14,639 --> 01:08:17,759
do an expert prediction on it. It's

1918
01:08:16,319 --> 01:08:19,279
going to get smarter about generally

1919
01:08:17,759 --> 01:08:20,719
things. But it doesn't mean it's going

1920
01:08:19,279 --> 01:08:23,279
to specifically follow your instructions

1921
01:08:20,719 --> 01:08:24,880
on your particular business problem. So

1922
01:08:23,279 --> 01:08:27,359
if you wanted to follow instructions,

1923
01:08:24,880 --> 01:08:29,759
you need supervision.

1924
01:08:27,359 --> 01:08:32,960
Okay. So all right these three are great

1925
01:08:29,759 --> 01:08:35,039
reviews. So for small LLMs like GPD2

1926
01:08:32,960 --> 01:08:36,399
fine-tuning isn't difficult to go to

1927
01:08:35,039 --> 01:08:38,640
your question. You can actually do this

1928
01:08:36,399 --> 01:08:40,000
with small models. So like for example

1929
01:08:38,640 --> 01:08:41,440
Google had this has released this thing

1930
01:08:40,000 --> 01:08:42,640
called Gemma which came out recently.

1931
01:08:41,439 --> 01:08:44,000
It's a small model like two billion

1932
01:08:42,640 --> 01:08:46,560
parameters or something if I remember

1933
01:08:44,000 --> 01:08:50,640
the smallest one and those things will

1934
01:08:46,560 --> 01:08:52,319
typically fit into uh thank you. Uh

1935
01:08:50,640 --> 01:08:54,000
those things will typically fit into

1936
01:08:52,319 --> 01:08:56,080
like one GPU and you can fine-tune it.

1937
01:08:54,000 --> 01:08:57,600
You still need GPUs just to be clear. uh

1938
01:08:56,079 --> 01:08:59,119
they will actually fit into one thing.

1939
01:08:57,600 --> 01:09:02,000
But if you want to use a larger model,

1940
01:08:59,119 --> 01:09:03,278
it won't fit. So to make this work, you

1941
01:09:02,000 --> 01:09:05,520
have to do other things and that's what

1942
01:09:03,279 --> 01:09:07,120
we're going to talk about now. So but

1943
01:09:05,520 --> 01:09:10,400
this there's a family of models called

1944
01:09:07,119 --> 01:09:12,960
Llama Llama 2. These are open source uh

1945
01:09:10,399 --> 01:09:14,879
LLMs and they are widely used for

1946
01:09:12,960 --> 01:09:16,158
fine-tuning, right? Because you can just

1947
01:09:14,880 --> 01:09:18,880
download the model and just do whatever

1948
01:09:16,158 --> 01:09:20,639
you want with it, right? It's open. uh I

1949
01:09:18,880 --> 01:09:22,079
mean it's not strictly open because

1950
01:09:20,640 --> 01:09:23,600
there are some you know footnote

1951
01:09:22,079 --> 01:09:26,238
considerations you got to worry about

1952
01:09:23,600 --> 01:09:29,120
but for most purposes it's open enough

1953
01:09:26,238 --> 01:09:30,959
uh in my opinion and so what we let's

1954
01:09:29,119 --> 01:09:32,640
see how hard it is to build the biggest

1955
01:09:30,960 --> 01:09:35,359
model in this family which is the llama

1956
01:09:32,640 --> 01:09:37,759
2 model with 70 billion parameters okay

1957
01:09:35,359 --> 01:09:40,719
70 billion parameters so first of all

1958
01:09:37,759 --> 01:09:42,399
the model is gigantic so 70 billion

1959
01:09:40,719 --> 01:09:44,798
parameters each parameter is let's say

1960
01:09:42,399 --> 01:09:48,000
we store it in two bytes per parameter

1961
01:09:44,798 --> 01:09:50,079
right u and then each of these parame

1962
01:09:48,000 --> 01:09:52,000
ameters actually we will need a

1963
01:09:50,079 --> 01:09:53,439
multiplier on each parameter to store

1964
01:09:52,000 --> 01:09:56,238
various details about how the

1965
01:09:53,439 --> 01:09:57,919
optimization is done okay we know we

1966
01:09:56,238 --> 01:09:59,678
won't get into the details here the the

1967
01:09:57,920 --> 01:10:02,640
one thing I do want to point out is that

1968
01:09:59,679 --> 01:10:06,239
um this 3 to four uh should really be 1

1969
01:10:02,640 --> 01:10:08,400
to six right u so I I had I didn't have

1970
01:10:06,238 --> 01:10:09,919
a chance to change it this morning but

1971
01:10:08,399 --> 01:10:12,559
but the point is that it's going to be a

1972
01:10:09,920 --> 01:10:14,239
huge model right so even with this

1973
01:10:12,560 --> 01:10:15,760
number it's going to be like 48 to 560

1974
01:10:14,238 --> 01:10:18,079
gigabytes just to hold the model in

1975
01:10:15,760 --> 01:10:21,280
memory and manipulate it and So if you

1976
01:10:18,079 --> 01:10:23,760
use a GPU like an A00 GPU or an H00 GPU

1977
01:10:21,279 --> 01:10:25,759
which are all Nvidia GPUs,

1978
01:10:23,760 --> 01:10:28,000
each of these things typically has 80 GB

1979
01:10:25,760 --> 01:10:30,719
of RAM memory. So we need between six

1980
01:10:28,000 --> 01:10:32,319
and seven to accommodate this thing. Six

1981
01:10:30,719 --> 01:10:34,079
to seven GPUs just to accommodate this

1982
01:10:32,319 --> 01:10:35,840
thing. So that's the first problem. The

1983
01:10:34,079 --> 01:10:37,760
model is big just to hold it and work

1984
01:10:35,840 --> 01:10:40,239
with it. You need lots of GPUs. The

1985
01:10:37,760 --> 01:10:43,360
second problem, Llama 2 was trained on

1986
01:10:40,238 --> 01:10:46,879
two trillion tokens of text.

1987
01:10:43,359 --> 01:10:49,439
Two trillion tokens of text. So these

1988
01:10:46,880 --> 01:10:51,760
GPUs can process about 400 tokens per

1989
01:10:49,439 --> 01:10:54,719
GPU per second. By process, I mean the

1990
01:10:51,760 --> 01:10:57,039
forward pass through the network. Okay?

1991
01:10:54,719 --> 01:10:58,079
And so if you actually use seven GPUs

1992
01:10:57,039 --> 01:11:01,279
with all this thing, it's going to take

1993
01:10:58,079 --> 01:11:03,439
you 8,000 days, right? Let's say we want

1994
01:11:01,279 --> 01:11:08,479
to do it in about a month, you need 24

1995
01:11:03,439 --> 01:11:10,799
20,000 248 GPUs at this cost of two $25

1996
01:11:08,479 --> 01:11:12,399
per GPU per hour. This will cost you 4

1997
01:11:10,800 --> 01:11:14,239
million.

1998
01:11:12,399 --> 01:11:15,359
Okay? And we'd expect the actual cost to

1999
01:11:14,238 --> 01:11:16,718
be a lot higher than this because it's

2000
01:11:15,359 --> 01:11:17,839
very optimistic. It assumes you just do

2001
01:11:16,719 --> 01:11:19,679
one pass through it, you're all done,

2002
01:11:17,840 --> 01:11:20,640
right? In in general, you'll you know

2003
01:11:19,679 --> 01:11:21,920
you'll make some mistakes. You have to

2004
01:11:20,640 --> 01:11:23,440
do it a bunch of times and so on and so

2005
01:11:21,920 --> 01:11:25,920
forth. So this is overly optimistic

2006
01:11:23,439 --> 01:11:27,439
estimate and that is 4 million. So you

2007
01:11:25,920 --> 01:11:29,679
need lots of GPUs and you need to spend

2008
01:11:27,439 --> 01:11:32,000
a lot of money for it. Now what can we

2009
01:11:29,679 --> 01:11:34,000
do with fewer resources?

2010
01:11:32,000 --> 01:11:35,760
First of all, you you need to reduce the

2011
01:11:34,000 --> 01:11:36,880
size of the data set. The second thing

2012
01:11:35,760 --> 01:11:38,960
is you want to reduce the memory

2013
01:11:36,880 --> 01:11:41,199
required. So we can ideally do it on

2014
01:11:38,960 --> 01:11:45,600
many fewer GPUs, hopefully even one GPU

2015
01:11:41,198 --> 01:11:47,119
literally on Collab. Okay. And so now we

2016
01:11:45,600 --> 01:11:49,360
have good news on the data front because

2017
01:11:47,119 --> 01:11:51,519
as I mentioned earlier, while it takes a

2018
01:11:49,359 --> 01:11:53,599
lot of data to build these models, to

2019
01:11:51,520 --> 01:11:55,440
fine-tune them for your specific data

2020
01:11:53,600 --> 01:11:57,520
for use case, you may just need a few

2021
01:11:55,439 --> 01:11:59,839
hundred examples. Okay, it's no problem

2022
01:11:57,520 --> 01:12:01,440
at all. So the data for fine-tuning is

2023
01:11:59,840 --> 01:12:02,800
actually not a problem. Only for

2024
01:12:01,439 --> 01:12:05,359
building it in the first place, it's a

2025
01:12:02,800 --> 01:12:07,360
problem. So in fact, there's this famous

2026
01:12:05,359 --> 01:12:11,119
alpaca fine tune data set. It is 50,000

2027
01:12:07,359 --> 01:12:13,039
instruction on pairs and so for that

2028
01:12:11,119 --> 01:12:14,559
way less than the two trillion tokens

2029
01:12:13,039 --> 01:12:17,920
and that can actually be done in about

2030
01:12:14,560 --> 01:12:19,520
20 hours. You can fine-tune a 50,000

2031
01:12:17,920 --> 01:12:21,760
example fine-tuning data set you can

2032
01:12:19,520 --> 01:12:23,280
fine tune with just 20 hours. Okay,

2033
01:12:21,760 --> 01:12:26,000
Tomaso,

2034
01:12:23,279 --> 01:12:28,800
>> could Microsoft's one bit model

2035
01:12:26,000 --> 01:12:30,640
drastically reduce the amount of comput?

2036
01:12:28,800 --> 01:12:32,719
Yeah, there's a whole bunch of

2037
01:12:30,640 --> 01:12:35,199
approximations and simplifications to

2038
01:12:32,719 --> 01:12:37,198
make all these things fit uh into

2039
01:12:35,198 --> 01:12:39,759
smaller GPUs and so on and so forth and

2040
01:12:37,198 --> 01:12:40,879
that's one of them. So, so the short

2041
01:12:39,760 --> 01:12:42,640
answer is yeah, there are many

2042
01:12:40,880 --> 01:12:44,000
possibilities uh and we have to very

2043
01:12:42,640 --> 01:12:45,760
carefully look at them because every one

2044
01:12:44,000 --> 01:12:47,359
of these simplifications you'll it'll

2045
01:12:45,760 --> 01:12:49,280
cost you something in terms of accuracy

2046
01:12:47,359 --> 01:12:50,639
and the ability of the model to do what

2047
01:12:49,279 --> 01:12:52,719
it needs to do. So there's always a

2048
01:12:50,640 --> 01:12:54,239
trade-off you have to worry about. So

2049
01:12:52,719 --> 01:12:55,119
that for hooks who are interested

2050
01:12:54,238 --> 01:12:57,839
there's this whole field called

2051
01:12:55,119 --> 01:12:59,439
quantization LLM quantization. Google it

2052
01:12:57,840 --> 01:13:02,719
and that gives you that's an entry point

2053
01:12:59,439 --> 01:13:04,079
into that whole area. Okay. So now how

2054
01:13:02,719 --> 01:13:06,158
do we reduce the memory required so that

2055
01:13:04,079 --> 01:13:08,800
we can process the data using fewer GPUs

2056
01:13:06,158 --> 01:13:10,079
ideally just one GPU on collab. So if

2057
01:13:08,800 --> 01:13:12,079
you look at what actually consumes

2058
01:13:10,079 --> 01:13:14,000
memory, you have all these model

2059
01:13:12,079 --> 01:13:16,158
parameters. Let's say you know 70

2060
01:13:14,000 --> 01:13:18,800
billion parameters times two bytes each

2061
01:13:16,158 --> 01:13:20,639
140 GB gradient computations is another

2062
01:13:18,800 --> 01:13:22,719
140 to hold the gradient and then the

2063
01:13:20,640 --> 01:13:24,400
optimizer state is 2x. And as I

2064
01:13:22,719 --> 01:13:27,520
mentioned earlier it could be between

2065
01:13:24,399 --> 01:13:28,799
you know 1 to 6x as opposed to 3 to 4x

2066
01:13:27,520 --> 01:13:30,880
but we'll just go with these numbers for

2067
01:13:28,800 --> 01:13:33,440
the moment. And so the total is 560

2068
01:13:30,880 --> 01:13:36,000
gigabytes right if you just naively want

2069
01:13:33,439 --> 01:13:38,639
to use it. So turns out you can't do

2070
01:13:36,000 --> 01:13:40,479
anything about that. it is just 4140 but

2071
01:13:38,640 --> 01:13:42,000
by using a trick called gradient

2072
01:13:40,479 --> 01:13:44,879
checkpointing this whole thing can

2073
01:13:42,000 --> 01:13:46,800
actually be squashed close to zero

2074
01:13:44,880 --> 01:13:48,239
basically you say hey I don't mind it

2075
01:13:46,800 --> 01:13:50,560
running longer but I don't want to use

2076
01:13:48,238 --> 01:13:52,079
as much memory and that trick is called

2077
01:13:50,560 --> 01:13:54,560
gradient checkpointing we won't go into

2078
01:13:52,079 --> 01:13:56,559
technical details that can go to zero

2079
01:13:54,560 --> 01:13:58,640
but then this thing here the optimizer

2080
01:13:56,560 --> 01:14:00,719
state turns out even this can be

2081
01:13:58,640 --> 01:14:02,800
squashed very close to zero and that's

2082
01:14:00,719 --> 01:14:06,319
actually was a breakthrough from you

2083
01:14:02,800 --> 01:14:07,600
know maybe a year ago and so to do do

2084
01:14:06,319 --> 01:14:09,439
that. What we're going to do is to say,

2085
01:14:07,600 --> 01:14:11,120
look, you know what? Uh there are a

2086
01:14:09,439 --> 01:14:13,599
whole bunch of weights here, but we're

2087
01:14:11,119 --> 01:14:15,599
only going to take take those matrices

2088
01:14:13,600 --> 01:14:17,199
inside each attention layer, and we're

2089
01:14:15,600 --> 01:14:19,840
going to only look at those matrices.

2090
01:14:17,198 --> 01:14:22,399
We're going to freeze everything else.

2091
01:14:19,840 --> 01:14:24,880
So, we're going to take only a small set

2092
01:14:22,399 --> 01:14:26,319
of parameters, unfreeze them, and update

2093
01:14:24,880 --> 01:14:27,760
them and see if it's any good, if it

2094
01:14:26,319 --> 01:14:29,519
actually gets the job done. Instead of

2095
01:14:27,760 --> 01:14:31,520
unfreezing everything and updating them,

2096
01:14:29,520 --> 01:14:33,840
right? And so if you look at the weight

2097
01:14:31,520 --> 01:14:36,719
matrix, let's say the key AK weight

2098
01:14:33,840 --> 01:14:38,960
matrix uh in llama 2, this is a 8,000

2099
01:14:36,719 --> 01:14:40,399
roughly 8,000 by 8,000 matrix, which

2100
01:14:38,960 --> 01:14:41,600
means that there are 64 million

2101
01:14:40,399 --> 01:14:45,839
parameters inside each of these

2102
01:14:41,600 --> 01:14:48,560
matrices. 64 million. Okay. So you can

2103
01:14:45,840 --> 01:14:50,719
if you imagine this matrix AK here and

2104
01:14:48,560 --> 01:14:52,480
let's say you thought experiment, you do

2105
01:14:50,719 --> 01:14:54,239
the finetuning and the numbers have

2106
01:14:52,479 --> 01:14:56,799
changed, right? as a result of

2107
01:14:54,238 --> 01:14:58,399
finetuning then you can imagine that the

2108
01:14:56,800 --> 01:15:01,600
resulting matrix is just the original

2109
01:14:58,399 --> 01:15:04,079
matrix you had plus just the changes

2110
01:15:01,600 --> 01:15:07,039
right the original plus the changes and

2111
01:15:04,079 --> 01:15:08,960
we call the changes delta a k and of

2112
01:15:07,039 --> 01:15:10,880
course in general this this change is

2113
01:15:08,960 --> 01:15:13,119
also going to be a 64 million matrix

2114
01:15:10,880 --> 01:15:15,760
right 8,000 by 8,000 so the question is

2115
01:15:13,119 --> 01:15:18,079
can we make this change matrix smaller

2116
01:15:15,760 --> 01:15:20,239
and to make it smaller it seems

2117
01:15:18,079 --> 01:15:22,319
reasonable because a fine tune will only

2118
01:15:20,238 --> 01:15:23,839
make small changes to just a few weights

2119
01:15:22,319 --> 01:15:25,198
it's not going to change

2120
01:15:23,840 --> 01:15:26,640
By definition, a couple hundred

2121
01:15:25,198 --> 01:15:27,678
examples, you do some finetuning,

2122
01:15:26,640 --> 01:15:29,920
hopefully a few weights are going to

2123
01:15:27,679 --> 01:15:32,239
change and maybe they won't change a

2124
01:15:29,920 --> 01:15:33,920
whole lot, right? So the the key insight

2125
01:15:32,238 --> 01:15:36,079
here is that maybe we can force this

2126
01:15:33,920 --> 01:15:38,640
change matrix to be kind of simple and

2127
01:15:36,079 --> 01:15:40,640
get the job done, right? And it turns

2128
01:15:38,640 --> 01:15:42,640
out you can. And what you do is you can

2129
01:15:40,640 --> 01:15:46,880
think of this matrix as really coming

2130
01:15:42,640 --> 01:15:48,480
from two thin skinny matrices which if

2131
01:15:46,880 --> 01:15:51,119
you multiply them gets you the original

2132
01:15:48,479 --> 01:15:52,559
matrix, right? And I'm not going to get

2133
01:15:51,119 --> 01:15:55,198
into the mathematical details here. This

2134
01:15:52,560 --> 01:15:57,280
is called a low rank approximation. Uh

2135
01:15:55,198 --> 01:16:00,238
but the point here is that you can take

2136
01:15:57,279 --> 01:16:01,599
two very small matrices and if you

2137
01:16:00,238 --> 01:16:02,639
multiply them the right way, you

2138
01:16:01,600 --> 01:16:04,400
actually can recover the original

2139
01:16:02,640 --> 01:16:06,800
matrix, right? You can approximate the

2140
01:16:04,399 --> 01:16:08,960
original matrix. And this matrix, as it

2141
01:16:06,800 --> 01:16:11,679
turns out, these two matrices are much

2142
01:16:08,960 --> 01:16:15,760
smaller because each one is just 8,000 *

2143
01:16:11,679 --> 01:16:19,359
2, 16,000, right? And so this thing has

2144
01:16:15,760 --> 01:16:23,360
just 16,192 parameters, which is 0.02%

2145
01:16:19,359 --> 01:16:23,359
of the original 64 million.

2146
01:16:23,439 --> 01:16:27,599
So this thing is called low rank

2147
01:16:25,039 --> 01:16:30,238
adaptation or LORA and it's incredibly

2148
01:16:27,600 --> 01:16:31,840
widely used in the industry. U and so

2149
01:16:30,238 --> 01:16:34,079
what we do is we freeze all the

2150
01:16:31,840 --> 01:16:36,079
parameters. We initialize all these mat

2151
01:16:34,079 --> 01:16:38,319
these change matrices to zero and then

2152
01:16:36,079 --> 01:16:40,960
we update just the those two skinny

2153
01:16:38,319 --> 01:16:43,759
matrices right here here we update only

2154
01:16:40,960 --> 01:16:45,198
those matrices using gradient descent.

2155
01:16:43,760 --> 01:16:47,119
And when you do that everything will fit

2156
01:16:45,198 --> 01:16:48,319
into memory. So which means that the

2157
01:16:47,119 --> 01:16:50,079
whole thing will fit in and you can just

2158
01:16:48,319 --> 01:16:52,158
use like two GPUs and get the job done.

2159
01:16:50,079 --> 01:16:55,039
And if you actually use llama's the

2160
01:16:52,158 --> 01:16:56,719
smaller models like 7 billion 13 billion

2161
01:16:55,039 --> 01:17:00,158
it can be fine-tuned comfortably on a

2162
01:16:56,719 --> 01:17:03,439
single GPU on a single collab GPU. So

2163
01:17:00,158 --> 01:17:05,759
all right uh 954 time does not permit so

2164
01:17:03,439 --> 01:17:07,519
I will uh so I have a collab on how to

2165
01:17:05,760 --> 01:17:09,600
do the finetuning uh using this

2166
01:17:07,520 --> 01:17:12,400
technique. I will do like a video walk

2167
01:17:09,600 --> 01:17:14,159
through um tomorrow or day after and I'm

2168
01:17:12,399 --> 01:17:16,158
done. Thanks folks. Have a good rest of

2169
01:17:14,158 --> 01:17:19,399
your week. [applause]

2170
01:17:16,158 --> 01:17:19,399
Thank you.