1 00:00:16,399 --> 00:00:22,159 Okay. So, um, so let's continue the 2 00:00:19,519 --> 00:00:23,679 journey we started last time. Um so what 3 00:00:22,160 --> 00:00:26,079 we're going to do uh you know if you 4 00:00:23,679 --> 00:00:27,439 remember in the last class we showed how 5 00:00:26,079 --> 00:00:30,000 we can actually build an auto 6 00:00:27,439 --> 00:00:33,200 reggressive large language model uh aka 7 00:00:30,000 --> 00:00:36,399 a causal large language model um using 8 00:00:33,200 --> 00:00:38,559 this not this idea of a causal encoder a 9 00:00:36,399 --> 00:00:39,920 transformer causal encoder and then we 10 00:00:38,558 --> 00:00:41,679 showed how you can actually take a bunch 11 00:00:39,920 --> 00:00:43,760 of sentences and use next word 12 00:00:41,679 --> 00:00:46,640 prediction and just run it through and 13 00:00:43,759 --> 00:00:49,119 boom you get GPD3 okay so that's what we 14 00:00:46,640 --> 00:00:50,799 saw last time I want to point out a sort 15 00:00:49,119 --> 00:00:52,799 of an important clarification slash 16 00:00:50,799 --> 00:00:55,359 correction which is that when we work 17 00:00:52,799 --> 00:00:57,759 with large language models uh unlike 18 00:00:55,359 --> 00:00:59,198 when we work with BERT uh for instance 19 00:00:57,759 --> 00:01:01,519 when we work with these kinds of causal 20 00:00:59,198 --> 00:01:03,358 models actually uh when the contextual 21 00:01:01,520 --> 00:01:05,760 embeddings come out you don't actually 22 00:01:03,359 --> 00:01:07,599 have to use ReLU activations here you 23 00:01:05,760 --> 00:01:09,118 can literally just run it through just a 24 00:01:07,599 --> 00:01:11,359 single dense layer with linear 25 00:01:09,118 --> 00:01:13,760 activations and then pass it into a 26 00:01:11,359 --> 00:01:15,519 softmax and boom you're done okay so 27 00:01:13,760 --> 00:01:18,799 that's how GPD3 and all these models are 28 00:01:15,519 --> 00:01:21,039 trained u and the other thing I want to 29 00:01:18,799 --> 00:01:23,600 point out which may not have clear is 30 00:01:21,040 --> 00:01:27,360 that what what is coming out of these 31 00:01:23,599 --> 00:01:29,919 this dense layer right this vector is as 32 00:01:27,359 --> 00:01:31,840 long as your vocabulary 33 00:01:29,920 --> 00:01:33,519 because only then when it goes into the 34 00:01:31,840 --> 00:01:35,118 soft max you're going to get 35 00:01:33,519 --> 00:01:36,959 probabilities which are as long as your 36 00:01:35,118 --> 00:01:39,200 vocabulary which means that you get to 37 00:01:36,959 --> 00:01:42,158 pick one word or token out of that 38 00:01:39,200 --> 00:01:45,118 entire 50,000 long vocabulary 39 00:01:42,159 --> 00:01:47,520 okay so so just I just want to point 40 00:01:45,118 --> 00:01:49,118 that out because I think it's easy for 41 00:01:47,519 --> 00:01:50,319 us to sort of get a little confused 42 00:01:49,118 --> 00:01:53,280 because of this little difference 43 00:01:50,319 --> 00:01:55,279 between the way uh masked language 44 00:01:53,280 --> 00:01:58,718 models like BERT work and causal 45 00:01:55,280 --> 00:02:02,718 language models like GPD3. 46 00:01:58,718 --> 00:02:05,759 Okay, so now let's continue with we have 47 00:02:02,718 --> 00:02:07,759 we know how to build GPD3. So like what 48 00:02:05,759 --> 00:02:10,479 about GPD and GPD2 like what's up to 49 00:02:07,759 --> 00:02:13,840 them? Why is GPD3 so famous and not 50 00:02:10,479 --> 00:02:15,200 GPD2? Right? So turns out well first of 51 00:02:13,840 --> 00:02:17,360 all you folks know that GPD stands for 52 00:02:15,199 --> 00:02:19,119 generative pre-trained transformer. Now 53 00:02:17,360 --> 00:02:22,000 like GPD3 54 00:02:19,120 --> 00:02:23,680 two GPD2 and GPD1 were trained in 55 00:02:22,000 --> 00:02:26,000 basically the same fashion. Predict the 56 00:02:23,680 --> 00:02:29,280 next word uh same fashion the same sort 57 00:02:26,000 --> 00:02:31,680 of transformer stack except that GPT3 58 00:02:29,280 --> 00:02:33,680 was trained on much more data because 59 00:02:31,680 --> 00:02:36,640 the underlying transformer stack had 60 00:02:33,680 --> 00:02:39,599 many more layers. Okay, so it is a much 61 00:02:36,639 --> 00:02:41,518 bigger stack meaning lots more 62 00:02:39,598 --> 00:02:44,878 parameters and therefore you need lots 63 00:02:41,519 --> 00:02:47,200 more data to train it well. Okay, so 64 00:02:44,878 --> 00:02:49,679 that was really the only difference. The 65 00:02:47,199 --> 00:02:53,919 difference was literally one of scale, 66 00:02:49,680 --> 00:02:57,680 scale of network and scale of data. And 67 00:02:53,919 --> 00:02:59,119 unlike GPT and GPD2, GPD3 even though it 68 00:02:57,680 --> 00:03:01,760 was trained basically the same way with 69 00:02:59,120 --> 00:03:04,158 the same kind of network, it was one of 70 00:03:01,759 --> 00:03:06,239 the situations where more became 71 00:03:04,158 --> 00:03:07,759 different. Okay, there was almost like 72 00:03:06,239 --> 00:03:10,319 some sort of phase change that happened 73 00:03:07,759 --> 00:03:14,158 between two and three. Unlike GPD and 74 00:03:10,318 --> 00:03:16,318 GPD2, GPD3 could do amazingly coherent 75 00:03:14,158 --> 00:03:19,840 continuations of any starting prompt, 76 00:03:16,318 --> 00:03:21,199 right? Um so for example, if you have 77 00:03:19,840 --> 00:03:22,640 this little prompt which says the 78 00:03:21,199 --> 00:03:24,878 importance of being on Twitter by Jerome 79 00:03:22,639 --> 00:03:26,318 K Jerome who was a famous humorist and 80 00:03:24,878 --> 00:03:28,399 then you give it this prompt, right? 81 00:03:26,318 --> 00:03:30,000 Ending with the word it, it produces 82 00:03:28,400 --> 00:03:33,599 this continuation which is really like 83 00:03:30,000 --> 00:03:35,120 strikingly good. And if any of you have 84 00:03:33,598 --> 00:03:36,479 read Jerome K Jerome and if you read 85 00:03:35,120 --> 00:03:38,319 this thing, you'll be like, "Wow, that 86 00:03:36,479 --> 00:03:41,518 actually sounds like Jerome K Jerome." 87 00:03:38,318 --> 00:03:43,119 Right? So amazing continuations the the 88 00:03:41,519 --> 00:03:45,120 but the interesting thing here is not so 89 00:03:43,120 --> 00:03:47,519 much the continuation it's the fact that 90 00:03:45,120 --> 00:03:49,680 the same prompt you give it a two or GPT 91 00:03:47,519 --> 00:03:51,439 it won't do any it won't be very good in 92 00:03:49,680 --> 00:03:52,640 fact after the first one two or three 93 00:03:51,439 --> 00:03:54,158 sentences it'll sort of become sort of 94 00:03:52,639 --> 00:03:57,039 incoherent and meander and start 95 00:03:54,158 --> 00:03:59,759 rambling this thing can keep faking it 96 00:03:57,039 --> 00:04:02,239 for a long longer time right that's the 97 00:03:59,759 --> 00:04:05,679 amazing thing that was unexpected re 98 00:04:02,239 --> 00:04:07,438 researchers did not expect this okay and 99 00:04:05,680 --> 00:04:09,040 but it wasn't good at following your 100 00:04:07,438 --> 00:04:10,799 instructions 101 00:04:09,039 --> 00:04:12,400 So for instance, if you ask it, help me 102 00:04:10,799 --> 00:04:14,159 write a short note, introduce myself to 103 00:04:12,400 --> 00:04:15,519 my neighbor. This is the kind of thing 104 00:04:14,158 --> 00:04:17,358 it'll come up with. And you can actually 105 00:04:15,519 --> 00:04:20,000 run it yourself. You can actually go to 106 00:04:17,358 --> 00:04:21,918 GPD3 on the playground. I think GPD3 is 107 00:04:20,000 --> 00:04:23,759 still available in the playground. U if 108 00:04:21,918 --> 00:04:25,359 it is, you can actually start try 109 00:04:23,759 --> 00:04:28,080 running these prompts. You will start 110 00:04:25,360 --> 00:04:29,919 getting garbage very quickly, right? And 111 00:04:28,079 --> 00:04:31,519 the reason so for example here, help me 112 00:04:29,918 --> 00:04:33,839 write a short note. It says, what's a 113 00:04:31,519 --> 00:04:35,680 good introduction to a resume? Rumé for 114 00:04:33,839 --> 00:04:38,319 some reason has glombmed down to resume. 115 00:04:35,680 --> 00:04:39,918 I have no idea why. Right? But the 116 00:04:38,319 --> 00:04:42,399 reason it's doing stuff like this is 117 00:04:39,918 --> 00:04:44,000 because a lot of the training data it 118 00:04:42,399 --> 00:04:46,159 was trained on are basically lots of 119 00:04:44,000 --> 00:04:49,040 lists of things. 120 00:04:46,160 --> 00:04:52,000 So when you say for example um you know 121 00:04:49,040 --> 00:04:53,840 the the the capital of Paris continue 122 00:04:52,000 --> 00:04:55,279 it'll come back with the capital sorry 123 00:04:53,839 --> 00:04:56,799 the capital of France continue it say 124 00:04:55,279 --> 00:04:58,559 the capital of France is Paris the 125 00:04:56,800 --> 00:04:59,919 capital of you know uh Hungary is 126 00:04:58,560 --> 00:05:02,319 Budapest and so on. It just start coming 127 00:04:59,918 --> 00:05:04,399 up with a list. So it's sort of very 128 00:05:02,319 --> 00:05:06,000 list driven right? it thinks that you 129 00:05:04,399 --> 00:05:07,839 you need to complete some sort of list, 130 00:05:06,000 --> 00:05:09,600 right? That's what's going on here. And 131 00:05:07,839 --> 00:05:10,799 so it's not very good. So it doesn't 132 00:05:09,600 --> 00:05:12,960 realize that you're actually asking it 133 00:05:10,800 --> 00:05:14,560 to do something specific. 134 00:05:12,959 --> 00:05:17,038 So this is the problem when you have an 135 00:05:14,560 --> 00:05:18,720 autocomplete thing which doesn't realize 136 00:05:17,038 --> 00:05:20,719 what you're asking it. It just thinks 137 00:05:18,720 --> 00:05:24,080 that you're it's just an autocomplete. 138 00:05:20,720 --> 00:05:25,600 So um now in addition to these unhelpful 139 00:05:24,079 --> 00:05:27,038 answers, it can also produce offensive 140 00:05:25,600 --> 00:05:28,960 answers, factually incorrect answers and 141 00:05:27,038 --> 00:05:32,079 so on and so forth. The list of bad 142 00:05:28,959 --> 00:05:33,599 things it can do is long. So why does it 143 00:05:32,079 --> 00:05:35,758 do that? Why does it produce unhelpful 144 00:05:33,600 --> 00:05:37,120 answers? Well, you know, as you recall, 145 00:05:35,759 --> 00:05:39,120 it was only trained to predict the next 146 00:05:37,120 --> 00:05:41,439 word. It wasn't explicitly trained to 147 00:05:39,120 --> 00:05:44,720 follow instructions, right? So, it 148 00:05:41,439 --> 00:05:46,160 seems, you know, reasonable that if it's 149 00:05:44,720 --> 00:05:48,800 simply trying to guess the next word 150 00:05:46,160 --> 00:05:50,880 repeatedly, it can't really do anything 151 00:05:48,800 --> 00:05:52,079 more. Like, how can it figure out that 152 00:05:50,879 --> 00:05:54,800 there's an instruction that it needs to 153 00:05:52,079 --> 00:05:57,120 follow, right? Unless the training data 154 00:05:54,800 --> 00:05:59,918 on the net was all instructional, which 155 00:05:57,120 --> 00:06:02,959 it clearly is not. 156 00:05:59,918 --> 00:06:04,560 So light bulb idea, right? Let's 157 00:06:02,959 --> 00:06:06,399 explicitly train it with instruction 158 00:06:04,560 --> 00:06:07,680 data, 159 00:06:06,399 --> 00:06:10,478 right? Let's just train it with 160 00:06:07,680 --> 00:06:12,079 instruction data. And so OpenAI 161 00:06:10,478 --> 00:06:15,758 developed an approach called instruction 162 00:06:12,079 --> 00:06:18,719 tuning to do exactly this. Um, and this 163 00:06:15,759 --> 00:06:20,720 paper is the paper that sort of was the 164 00:06:18,720 --> 00:06:24,160 breakthrough. Okay, this is what 165 00:06:20,720 --> 00:06:25,600 actually put Chad on the map. So, and 166 00:06:24,160 --> 00:06:26,800 it's very readable. So, I would 167 00:06:25,600 --> 00:06:28,879 encourage you to check it out if you're 168 00:06:26,800 --> 00:06:33,280 curious. 169 00:06:28,879 --> 00:06:34,478 And so so we had GPT, GPD2, GPD3, you 170 00:06:33,279 --> 00:06:36,079 know, just bigger and bigger models 171 00:06:34,478 --> 00:06:37,439 trained the same way. And then we run 172 00:06:36,079 --> 00:06:39,199 into the problem that it can't handle 173 00:06:37,439 --> 00:06:41,519 instructions. So we do instruction 174 00:06:39,199 --> 00:06:43,840 tuning to get to 3.5, also called 175 00:06:41,519 --> 00:06:46,478 instruct GPT. And then a small tweak 176 00:06:43,839 --> 00:06:48,560 after that gets you chat GPT. Okay. And 177 00:06:46,478 --> 00:06:50,079 by the way, this step here, there are 178 00:06:48,560 --> 00:06:52,000 really two things going on in this as 179 00:06:50,079 --> 00:06:53,839 you will soon see. I'm just calling it 180 00:06:52,000 --> 00:06:55,360 instruction tuning just to so that I 181 00:06:53,839 --> 00:06:58,239 don't have to say some long thing every 182 00:06:55,360 --> 00:06:59,919 single time. it this is not a consistent 183 00:06:58,240 --> 00:07:03,918 piece of terminology. So just just 184 00:06:59,918 --> 00:07:06,799 beware aware of that's all. So all right 185 00:07:03,918 --> 00:07:09,359 first step they got a bunch of people to 186 00:07:06,800 --> 00:07:11,918 write highquality answers to questions 187 00:07:09,360 --> 00:07:14,080 and they created about 12,500 such 188 00:07:11,918 --> 00:07:15,758 question answer pairs. So for example 189 00:07:14,079 --> 00:07:17,199 let's say this was the question explain 190 00:07:15,759 --> 00:07:19,840 the moon landing to a six-year-old in a 191 00:07:17,199 --> 00:07:21,759 few sentences. Believe it or not, GPD3's 192 00:07:19,839 --> 00:07:23,439 answer to that question was another 193 00:07:21,759 --> 00:07:24,879 question 194 00:07:23,439 --> 00:07:27,199 because it thinks there's a list of 195 00:07:24,879 --> 00:07:28,560 questions it needs autocomplete, right? 196 00:07:27,199 --> 00:07:30,400 So, it comes up with explain the theory 197 00:07:28,560 --> 00:07:31,439 of gravity to a six-y old. It's like one 198 00:07:30,399 --> 00:07:32,879 of those people when you ask them a 199 00:07:31,439 --> 00:07:35,199 question, they ask you a question back, 200 00:07:32,879 --> 00:07:36,399 right? So, what what they did is they 201 00:07:35,199 --> 00:07:38,160 said, "Okay, let's create a nice answer 202 00:07:36,399 --> 00:07:39,758 to this question." And here's a human 203 00:07:38,160 --> 00:07:41,280 created answer. People went to the moon 204 00:07:39,759 --> 00:07:43,759 in a big rocket, walked around, blah 205 00:07:41,279 --> 00:07:46,559 blah blah, right? Much better answer to 206 00:07:43,759 --> 00:07:48,960 that question. And so once you create 207 00:07:46,560 --> 00:07:52,000 these 12,500 question answer pairs as 208 00:07:48,959 --> 00:07:56,079 training data, we just trained GPD3 some 209 00:07:52,000 --> 00:07:59,199 more using Xword prediction as before. 210 00:07:56,079 --> 00:08:00,639 No difference. So, so here is the input 211 00:07:59,199 --> 00:08:02,560 explain the moon landing blah blah blah 212 00:08:00,639 --> 00:08:05,840 blah. This is the question and then we 213 00:08:02,560 --> 00:08:07,918 have the answer right there. And then we 214 00:08:05,839 --> 00:08:10,719 we take that answer, move it to the 215 00:08:07,918 --> 00:08:13,758 right and just shift it up 216 00:08:10,720 --> 00:08:16,000 so that when it finishes sentences, it 217 00:08:13,759 --> 00:08:17,759 needs to predict people. And then you 218 00:08:16,000 --> 00:08:20,079 give it people, it needs to predict went 219 00:08:17,759 --> 00:08:22,560 and so on and so forth. Just like we saw 220 00:08:20,079 --> 00:08:25,359 before, the cat sat on the mat became 221 00:08:22,560 --> 00:08:27,360 the cat sat on the cat sat on the mat on 222 00:08:25,360 --> 00:08:30,080 the right shifted, right? That's what 223 00:08:27,360 --> 00:08:31,598 makes prediction possible and necessary. 224 00:08:30,079 --> 00:08:35,360 So that's what they did. This co this is 225 00:08:31,598 --> 00:08:37,838 step one. Okay, same as same as before. 226 00:08:35,360 --> 00:08:39,680 And once you do that, it turns out this 227 00:08:37,839 --> 00:08:42,000 step is called supervised fine-tuning. 228 00:08:39,679 --> 00:08:44,000 It really helped. GPD3 once you 229 00:08:42,000 --> 00:08:45,600 supervised fine-tuned it was much much 230 00:08:44,000 --> 00:08:46,958 better at following instructions. But 231 00:08:45,600 --> 00:08:49,278 there's a small problem with this 232 00:08:46,958 --> 00:08:51,518 approach. It takes a lot of money and 233 00:08:49,278 --> 00:08:53,759 effort to have humans write highquality 234 00:08:51,519 --> 00:08:56,480 answers to thousands of questions, 235 00:08:53,759 --> 00:08:59,120 right? It takes a lot of money. So the 236 00:08:56,480 --> 00:09:01,200 question is, what can we do, right? What 237 00:08:59,120 --> 00:09:03,278 is easier than writing a good answer to 238 00:09:01,200 --> 00:09:07,519 a question? 239 00:09:03,278 --> 00:09:11,320 Well, what? Okay. Uh, all right. Uh, how 240 00:09:07,519 --> 00:09:11,320 about somebody from this side? 241 00:09:11,440 --> 00:09:15,600 >> Yeah, Joseph. 242 00:09:13,519 --> 00:09:16,560 >> Perhaps writing a question for an 243 00:09:15,600 --> 00:09:17,920 answer. 244 00:09:16,559 --> 00:09:19,679 >> Oh, that's actually a good one. Yeah. 245 00:09:17,919 --> 00:09:22,000 Yeah, I like that. Um, so given an 246 00:09:19,679 --> 00:09:23,439 answer, find find a question. And while 247 00:09:22,000 --> 00:09:25,039 that is not what I'm going to talk about 248 00:09:23,440 --> 00:09:27,680 here, that technique is actually used 249 00:09:25,039 --> 00:09:29,519 very heavily in LLMs. Uh, and so but 250 00:09:27,679 --> 00:09:31,199 that that's great. Very creative. Uh 251 00:09:29,519 --> 00:09:32,560 Mark, 252 00:09:31,200 --> 00:09:33,200 >> thumbs up. Thumbs down. 253 00:09:32,559 --> 00:09:34,479 >> Sorry. 254 00:09:33,200 --> 00:09:36,320 >> Thumbs up or thumbs down? 255 00:09:34,480 --> 00:09:38,320 >> Thumbs up or thumbs down. Exactly. 256 00:09:36,320 --> 00:09:40,800 Because all of us, everyone loves to be 257 00:09:38,320 --> 00:09:43,440 a critic. It's much better easier to be 258 00:09:40,799 --> 00:09:46,240 a critic than to be a creator. Right. So 259 00:09:43,440 --> 00:09:48,959 what do we do? We basically say, let's 260 00:09:46,240 --> 00:09:50,720 rank answers written by somebody else. 261 00:09:48,958 --> 00:09:53,199 Which begs the question, who's going to 262 00:09:50,720 --> 00:09:54,639 write those answers? And that's where 263 00:09:53,200 --> 00:09:57,440 there's a brilliant answer to that 264 00:09:54,639 --> 00:09:59,360 question which is 265 00:09:57,440 --> 00:10:02,360 Wikipedia, 266 00:09:59,360 --> 00:10:02,360 Reddit. 267 00:10:04,000 --> 00:10:08,159 We will just ask GPT3 to write the 268 00:10:06,080 --> 00:10:10,000 answers. 269 00:10:08,159 --> 00:10:12,958 It might be crap, but we don't care 270 00:10:10,000 --> 00:10:15,519 because we can rank them. 271 00:10:12,958 --> 00:10:17,919 So we ask GPT3 to get generate several 272 00:10:15,519 --> 00:10:19,360 answers to the question. And how can we 273 00:10:17,919 --> 00:10:21,199 generate several answers? Because we can 274 00:10:19,360 --> 00:10:23,200 do sampling. 275 00:10:21,200 --> 00:10:25,200 We can do sampling. 276 00:10:23,200 --> 00:10:27,200 The fact that we had these stoastic 277 00:10:25,200 --> 00:10:30,480 outputs because of sampling is now a 278 00:10:27,200 --> 00:10:32,240 feature, not a bug. Okay, we create lots 279 00:10:30,480 --> 00:10:33,920 of different answers to the question. We 280 00:10:32,240 --> 00:10:36,000 feed it a question, get like three 281 00:10:33,919 --> 00:10:37,599 answers out. Just run it three times, 282 00:10:36,000 --> 00:10:39,278 get three answers out with a nice 283 00:10:37,600 --> 00:10:41,120 temperature of like one or 1.1 or 284 00:10:39,278 --> 00:10:43,679 something so that it's nice and random, 285 00:10:41,120 --> 00:10:45,679 right? Um, and then we literally have 286 00:10:43,679 --> 00:10:47,278 humans just rank them, do the thumbs up, 287 00:10:45,679 --> 00:10:51,120 thumbs down, just rank them from most 288 00:10:47,278 --> 00:10:53,200 useful to least useful. Okay, so this 289 00:10:51,120 --> 00:10:55,120 step is a step two of instruction 290 00:10:53,200 --> 00:10:57,759 tuning. So OpenAI collected 33,000 291 00:10:55,120 --> 00:11:00,560 instructions, fed them to GB3, generated 292 00:10:57,759 --> 00:11:03,439 answers and had humans rank them. And 293 00:11:00,559 --> 00:11:05,278 once you do that, once you do this, you 294 00:11:03,440 --> 00:11:07,519 can assemble a beautiful training data 295 00:11:05,278 --> 00:11:09,200 set, right? And so basically what we 296 00:11:07,519 --> 00:11:10,799 have is that we have an instruction and 297 00:11:09,200 --> 00:11:12,879 let's say we have just two answers A and 298 00:11:10,799 --> 00:11:14,879 B. And in in practice they you can have 299 00:11:12,879 --> 00:11:16,879 many many answers which we rank but just 300 00:11:14,879 --> 00:11:18,720 for simplicity I'll go with Mark's 301 00:11:16,879 --> 00:11:19,919 thumbs up thumbs down sort of answer 302 00:11:18,720 --> 00:11:22,240 which is let's assume only you have two 303 00:11:19,919 --> 00:11:24,159 answers to every question right and so 304 00:11:22,240 --> 00:11:26,480 and the human has said I prefer this to 305 00:11:24,159 --> 00:11:28,958 that that's it right so we have a data 306 00:11:26,480 --> 00:11:31,278 set now where the data point is 307 00:11:28,958 --> 00:11:36,000 instruction preferred answer is A the 308 00:11:31,278 --> 00:11:38,720 other answer is B yeah 309 00:11:36,000 --> 00:11:40,958 >> um the thumbs up thumbs down uh 310 00:11:38,720 --> 00:11:42,560 technique that we're talking is that why 311 00:11:40,958 --> 00:11:44,319 We're attaching to now we also use 312 00:11:42,559 --> 00:11:45,518 thumbs up thumbs down. It's using only 313 00:11:44,320 --> 00:11:46,560 answers to train. 314 00:11:45,519 --> 00:11:48,240 >> Exactly. Right. 315 00:11:46,559 --> 00:11:49,439 >> Yeah. So yeah, all the models have the 316 00:11:48,240 --> 00:11:51,519 thumbs up thumbs down stuff going on 317 00:11:49,440 --> 00:11:53,120 somewhere. They are all collecting data 318 00:11:51,519 --> 00:11:53,600 for this step. 319 00:11:53,120 --> 00:11:55,679 >> Thank you. 320 00:11:53,600 --> 00:11:57,200 >> Yeah. It's sort of the old adage, right? 321 00:11:55,679 --> 00:11:59,199 Uh if you're not sure who the product 322 00:11:57,200 --> 00:12:02,680 is, you are the product. So it's one of 323 00:11:59,200 --> 00:12:02,680 those things. Yeah. 324 00:12:07,519 --> 00:12:12,240 So if we understand correctly when we 325 00:12:09,600 --> 00:12:16,320 see thumbs up thumbs down it does mean 326 00:12:12,240 --> 00:12:16,639 that chat is going to trade on our data 327 00:12:16,320 --> 00:12:19,200 right 328 00:12:16,639 --> 00:12:20,720 >> unless you opt out. Yeah. So if you 329 00:12:19,200 --> 00:12:22,079 actually go to the chaty controls there 330 00:12:20,720 --> 00:12:24,879 is something called data controls or 331 00:12:22,078 --> 00:12:26,479 something you can toggle it to off but I 332 00:12:24,879 --> 00:12:29,759 think when I last checked if you toggle 333 00:12:26,480 --> 00:12:31,120 it to off you lose your chat history. So 334 00:12:29,759 --> 00:12:33,120 they have hobbled that feature to 335 00:12:31,120 --> 00:12:37,600 prevent people from setting it to off as 336 00:12:33,120 --> 00:12:39,440 much as possible. Yeah, clever. 337 00:12:37,600 --> 00:12:41,040 But you can opt out and if you use the 338 00:12:39,440 --> 00:12:43,360 API as opposed to the web interface, 339 00:12:41,039 --> 00:12:45,360 you're automatically opted out. So you 340 00:12:43,360 --> 00:12:46,720 have to deliberately opt in. And if you 341 00:12:45,360 --> 00:12:48,399 use the versions that are available 342 00:12:46,720 --> 00:12:50,320 through Microsoft Azure and so on and so 343 00:12:48,399 --> 00:12:51,759 forth, there are all kinds of very safe 344 00:12:50,320 --> 00:12:54,079 controls and stuff like that. In fact, I 345 00:12:51,759 --> 00:12:56,799 think the Microsoft co-pilot license 346 00:12:54,078 --> 00:12:58,879 that MIT has uh I think the default is 347 00:12:56,799 --> 00:13:01,439 opted out. 348 00:12:58,879 --> 00:13:02,799 Okay. So to go here, once you have this 349 00:13:01,440 --> 00:13:05,680 data point, you can build something 350 00:13:02,799 --> 00:13:08,000 called a reward model. Okay. And this is 351 00:13:05,679 --> 00:13:10,479 a very clever piece of work. So what you 352 00:13:08,000 --> 00:13:12,399 do is you have an instruction, right? 353 00:13:10,480 --> 00:13:15,920 You have a preferred answer and you have 354 00:13:12,399 --> 00:13:18,320 the other answer. You feed it to a 355 00:13:15,919 --> 00:13:20,319 network. Okay? You feed it to a network. 356 00:13:18,320 --> 00:13:23,200 This is just a a nice language model, 357 00:13:20,320 --> 00:13:25,760 right? It's just a language model. And 358 00:13:23,200 --> 00:13:28,320 the language model produces a number 359 00:13:25,759 --> 00:13:30,480 which measures how good this thing is, 360 00:13:28,320 --> 00:13:32,480 right? How good an answer is this to 361 00:13:30,480 --> 00:13:34,639 that particular instruction. So you get 362 00:13:32,480 --> 00:13:38,320 two you get a rating here, you get a 363 00:13:34,639 --> 00:13:41,519 rating here and then what you do is you 364 00:13:38,320 --> 00:13:43,040 run it through a little loss function 365 00:13:41,519 --> 00:13:45,839 which 366 00:13:43,039 --> 00:13:50,000 essentially encourages the model to give 367 00:13:45,839 --> 00:13:51,680 higher numbers to the better answer. 368 00:13:50,000 --> 00:13:53,278 It's the same model. You just run the 369 00:13:51,679 --> 00:13:54,799 the question and the first answer, 370 00:13:53,278 --> 00:13:56,720 question and the second answer. You get 371 00:13:54,799 --> 00:13:59,120 these two numbers. And then initially 372 00:13:56,720 --> 00:14:00,480 those numbers are just random. But then 373 00:13:59,120 --> 00:14:02,078 you tell the model, hey, this is the 374 00:14:00,480 --> 00:14:03,600 preferred thing. Make sure the preferred 375 00:14:02,078 --> 00:14:06,638 answers 376 00:14:03,600 --> 00:14:08,959 uh rating the R value is higher than the 377 00:14:06,639 --> 00:14:12,000 other number because more is better. 378 00:14:08,958 --> 00:14:13,838 Higher is better. Okay? And you can 379 00:14:12,000 --> 00:14:15,919 actually since you and this thing is 380 00:14:13,839 --> 00:14:16,959 just a sigmoid here, right? It's 381 00:14:15,919 --> 00:14:18,240 basically take the difference of these 382 00:14:16,958 --> 00:14:20,479 two things. do a sigma and take the 383 00:14:18,240 --> 00:14:22,320 logarithm and you can actually convince 384 00:14:20,480 --> 00:14:25,600 yourself afterwards and I encourage you 385 00:14:22,320 --> 00:14:28,480 to do that to to check for yourself that 386 00:14:25,600 --> 00:14:30,879 if we actually 387 00:14:28,480 --> 00:14:34,159 give a higher number to the better 388 00:14:30,879 --> 00:14:36,320 answer the loss will be lower and since 389 00:14:34,159 --> 00:14:38,958 we are minimizing loss we're essentially 390 00:14:36,320 --> 00:14:41,760 training the network to always to try to 391 00:14:38,958 --> 00:14:43,919 give higher ratings to better answers 392 00:14:41,759 --> 00:14:46,639 that's it so that's the approach uh did 393 00:14:43,919 --> 00:14:49,360 you have a yeah Ben 394 00:14:46,639 --> 00:14:50,959 So you could imagine training um 395 00:14:49,360 --> 00:14:52,720 training the model and only the good 396 00:14:50,958 --> 00:14:54,078 answers is the idea of having both that 397 00:14:52,720 --> 00:14:54,720 the model is actually learning what 398 00:14:54,078 --> 00:14:56,719 makes good 399 00:14:54,720 --> 00:14:58,480 >> correct. Exactly. Much like if you want 400 00:14:56,720 --> 00:15:01,360 to build a dog cat classifier, you have 401 00:14:58,480 --> 00:15:02,879 to show pictures of both. 402 00:15:01,360 --> 00:15:05,278 >> Yeah. 403 00:15:02,879 --> 00:15:06,799 >> So u I understand the feedback mechanism 404 00:15:05,278 --> 00:15:10,078 of thumbs up thumbs down but there are a 405 00:15:06,799 --> 00:15:12,879 lot of times when the popular response 406 00:15:10,078 --> 00:15:15,759 is not the accurate one. So uh is there 407 00:15:12,879 --> 00:15:16,240 a way that they actually have a layer to 408 00:15:15,759 --> 00:15:18,958 correct? 409 00:15:16,240 --> 00:15:22,320 >> Yeah, good question Swati. So uh as it 410 00:15:18,958 --> 00:15:24,719 turns out um the all these companies 411 00:15:22,320 --> 00:15:27,440 like OpenAI, they have like a huge 412 00:15:24,720 --> 00:15:30,000 document 100 200 pages longs you know 413 00:15:27,440 --> 00:15:32,800 very very bulky document which instructs 414 00:15:30,000 --> 00:15:34,720 and teaches the labelers the rankers to 415 00:15:32,799 --> 00:15:36,479 how to rank these things. So they have 416 00:15:34,720 --> 00:15:38,800 to follow these very strict guidelines 417 00:15:36,480 --> 00:15:40,639 to precisely handle like strange corner 418 00:15:38,799 --> 00:15:43,439 cases and things like that. And that 419 00:15:40,639 --> 00:15:44,879 document is on the web. You can dig it 420 00:15:43,440 --> 00:15:46,959 up, right? And it's actually very 421 00:15:44,879 --> 00:15:48,078 instructive to read through it, right? I 422 00:15:46,958 --> 00:15:49,039 think they put it out on the web because 423 00:15:48,078 --> 00:15:50,879 they wanted to convince people that 424 00:15:49,039 --> 00:15:52,078 they're going to inordinate trouble to 425 00:15:50,879 --> 00:15:55,600 make sure the rankings are actually 426 00:15:52,078 --> 00:16:00,000 good. U do you have a question? Comment. 427 00:15:55,600 --> 00:16:03,278 Okay. All right. So um so back to this 428 00:16:00,000 --> 00:16:04,879 and how how do you train this thing? SGD 429 00:16:03,278 --> 00:16:06,639 because you have a network it's coming 430 00:16:04,879 --> 00:16:08,639 up with an answer you have some way to 431 00:16:06,639 --> 00:16:10,639 know if that answer is good or bad right 432 00:16:08,639 --> 00:16:12,480 better answers of lower loss back 433 00:16:10,639 --> 00:16:13,839 propagation through the network keep 434 00:16:12,480 --> 00:16:15,519 updating the weights and boom you're 435 00:16:13,839 --> 00:16:18,800 done 436 00:16:15,519 --> 00:16:21,198 okay and once you do that this reward 437 00:16:18,799 --> 00:16:24,000 model can provide a numerical rating for 438 00:16:21,198 --> 00:16:25,198 any any instruction answer pair you just 439 00:16:24,000 --> 00:16:27,120 give it an instruction you give it an 440 00:16:25,198 --> 00:16:28,240 answer right could be a crappy answer 441 00:16:27,120 --> 00:16:31,679 good answer it just tells you how good 442 00:16:28,240 --> 00:16:32,959 it is which means right So in this case 443 00:16:31,679 --> 00:16:35,919 for example maybe it's going to give you 444 00:16:32,958 --> 00:16:38,559 like a nice number 1.5 uh uh which is 445 00:16:35,919 --> 00:16:41,599 you know 1.5 for this this answer but 446 00:16:38,559 --> 00:16:44,719 then a better answer comes along or 3.2 447 00:16:41,600 --> 00:16:46,959 right what we have done by doing this 448 00:16:44,720 --> 00:16:49,278 whole thing this modeling is that we 449 00:16:46,958 --> 00:16:51,838 have essentially we have learned how 450 00:16:49,278 --> 00:16:53,759 humans rank responses 451 00:16:51,839 --> 00:16:55,279 because we can only have humans rank 452 00:16:53,759 --> 00:16:58,240 responses for some finite number of 453 00:16:55,278 --> 00:17:00,399 questions. What we really want to do is 454 00:16:58,240 --> 00:17:02,159 to do this to automate that ranking 455 00:17:00,399 --> 00:17:03,600 process so that we can just do it for 456 00:17:02,159 --> 00:17:05,599 like tens of thousands of questions 457 00:17:03,600 --> 00:17:07,519 really fast. Right? So we have 458 00:17:05,599 --> 00:17:10,879 essentially built a model of how humans 459 00:17:07,519 --> 00:17:12,160 rank things, right? Which is beautiful. 460 00:17:10,880 --> 00:17:13,439 A lot of the stuff here is all very 461 00:17:12,160 --> 00:17:15,519 self-reerential which I find very 462 00:17:13,439 --> 00:17:18,079 elegant. Anyway, so this can be used to 463 00:17:15,519 --> 00:17:20,798 improve GP3 even further. So we take the 464 00:17:18,078 --> 00:17:23,438 instruction as before, we feed it. It 465 00:17:20,798 --> 00:17:25,918 gives you some answer and then we feed 466 00:17:23,439 --> 00:17:28,319 this instruction and the answer to our 467 00:17:25,919 --> 00:17:30,799 newly minted reward model. It gives us a 468 00:17:28,318 --> 00:17:32,879 numerical rating and then this is the 469 00:17:30,798 --> 00:17:35,839 key step. We take this numerical rating 470 00:17:32,880 --> 00:17:37,840 and then we use this rating to nudge the 471 00:17:35,839 --> 00:17:41,519 internal weights of GPD3 in the right 472 00:17:37,839 --> 00:17:43,199 direction. Right? This nudging 473 00:17:41,519 --> 00:17:44,720 uses a technique called reinforcement 474 00:17:43,200 --> 00:17:46,319 learning. 475 00:17:44,720 --> 00:17:49,200 Right? Which just in the interest of 476 00:17:46,319 --> 00:17:51,678 time we can't get into in this lecture. 477 00:17:49,200 --> 00:17:52,720 But that that's a technique you use to 478 00:17:51,679 --> 00:17:54,559 nudge these things in the right 479 00:17:52,720 --> 00:17:56,640 direction. 480 00:17:54,558 --> 00:17:58,319 So that's what we do. That's 481 00:17:56,640 --> 00:18:01,520 reinforcement learning. We nudge it in 482 00:17:58,319 --> 00:18:04,879 the right direction. 483 00:18:01,519 --> 00:18:07,279 And OpenAI did this with 31,000 484 00:18:04,880 --> 00:18:09,919 questions. 485 00:18:07,279 --> 00:18:11,678 Okay. Nudge, nudge, nudge, nudge, nudge. 486 00:18:09,919 --> 00:18:13,759 And when you do that, you get GPD 487 00:18:11,679 --> 00:18:18,320 3.5/ingpd. 488 00:18:13,759 --> 00:18:20,960 Okay. Uh that's it. And now by the way 489 00:18:18,319 --> 00:18:22,480 this step here is called reinforcement 490 00:18:20,960 --> 00:18:24,480 learning with human feedback because we 491 00:18:22,480 --> 00:18:26,240 use reinforced learning and since humans 492 00:18:24,480 --> 00:18:28,160 rank the answers which tread to the 493 00:18:26,240 --> 00:18:29,759 building of the reward model we get 494 00:18:28,160 --> 00:18:30,798 human feedback. Okay, that's 495 00:18:29,759 --> 00:18:33,200 reinforcement learning with human 496 00:18:30,798 --> 00:18:34,639 feedback. Yeah. 497 00:18:33,200 --> 00:18:37,360 >> Yeah. I have [clears throat] a question 498 00:18:34,640 --> 00:18:39,759 regarding the the type of questions that 499 00:18:37,359 --> 00:18:42,079 they're using. I can imagine like maybe 500 00:18:39,759 --> 00:18:44,400 there are very simple questions to 501 00:18:42,079 --> 00:18:47,439 answer because I'm thinking now you can 502 00:18:44,400 --> 00:18:49,440 ask GPD like for example respond this as 503 00:18:47,440 --> 00:18:51,679 a pirate or something like that that is 504 00:18:49,440 --> 00:18:54,240 kind of it's going to be harder to train 505 00:18:51,679 --> 00:18:56,080 if you have bunch of questions that are 506 00:18:54,240 --> 00:18:57,679 having like small interactions and then 507 00:18:56,079 --> 00:18:59,279 there is the question like 508 00:18:57,679 --> 00:19:01,280 >> that's a good question. So the quality 509 00:18:59,279 --> 00:19:03,279 of the questions in the data set clearly 510 00:19:01,279 --> 00:19:05,839 is a big factor because if you have 511 00:19:03,279 --> 00:19:07,918 simple simplistic questions it won't be 512 00:19:05,839 --> 00:19:09,918 able to handle complex questions later 513 00:19:07,919 --> 00:19:12,400 on. So what it's a good question. So 514 00:19:09,919 --> 00:19:14,559 what how so the qu so that actually begs 515 00:19:12,400 --> 00:19:16,880 the question of where did they get these 516 00:19:14,558 --> 00:19:20,079 questions from 517 00:19:16,880 --> 00:19:23,520 so they actually got it from their API. 518 00:19:20,079 --> 00:19:25,119 So people are asking GPD3 on the API 519 00:19:23,519 --> 00:19:26,798 right before it became 3.5 people are 520 00:19:25,119 --> 00:19:28,159 asking all the API was already available 521 00:19:26,798 --> 00:19:29,599 you know fully available commercially 522 00:19:28,160 --> 00:19:31,679 available a lot of people are building 523 00:19:29,599 --> 00:19:33,439 products on it already by then and so 524 00:19:31,679 --> 00:19:35,440 they collected all those questions and 525 00:19:33,440 --> 00:19:37,279 filtered them for quality and that was 526 00:19:35,440 --> 00:19:39,360 the question set that they used and then 527 00:19:37,279 --> 00:19:41,519 they judiciously added to it with human 528 00:19:39,359 --> 00:19:43,038 created questions but they couldn't do a 529 00:19:41,519 --> 00:19:44,960 lot of that because it's expensive to do 530 00:19:43,038 --> 00:19:46,798 that but collecting stuff that somebody 531 00:19:44,960 --> 00:19:49,120 else is asking your API already very 532 00:19:46,798 --> 00:19:50,879 easy 533 00:19:49,119 --> 00:19:52,000 Yeah, Tomaso, 534 00:19:50,880 --> 00:19:54,400 >> uh, this might be more of a 535 00:19:52,000 --> 00:19:56,640 philosophical question, but, uh, the 536 00:19:54,400 --> 00:19:58,320 human bias that's present in the small 537 00:19:56,640 --> 00:20:00,799 subset of human labelers that they've 538 00:19:58,319 --> 00:20:03,119 chosen gets eventually compounded in 539 00:20:00,798 --> 00:20:04,798 this model that we often consider as the 540 00:20:03,119 --> 00:20:06,079 source of objective truth. 541 00:20:04,798 --> 00:20:08,319 >> Yes. 542 00:20:06,079 --> 00:20:09,918 >> Yeah, that's very true. Um I think the 543 00:20:08,319 --> 00:20:12,480 the reward model is probably very 544 00:20:09,919 --> 00:20:14,480 faithfully learn all the biases of the 545 00:20:12,480 --> 00:20:17,519 human labelers which is why they have 546 00:20:14,480 --> 00:20:19,599 these very complex u sort of frameworks 547 00:20:17,519 --> 00:20:21,519 and guidelines to try to prevent the 548 00:20:19,599 --> 00:20:22,959 bias from happening to mitigate it. So 549 00:20:21,519 --> 00:20:25,200 for example they might give the same 550 00:20:22,960 --> 00:20:28,240 question and set of possible answers to 551 00:20:25,200 --> 00:20:30,480 many many different labelers and only if 552 00:20:28,240 --> 00:20:33,679 people pick the same ranking they might 553 00:20:30,480 --> 00:20:36,240 use it so that at least inter labeler 554 00:20:33,679 --> 00:20:37,679 bias can be minimized right but if 555 00:20:36,240 --> 00:20:39,359 everybody's sort of biased in the same 556 00:20:37,679 --> 00:20:41,519 direction it won't protect you against 557 00:20:39,359 --> 00:20:43,199 that. Um so yeah in general there's a 558 00:20:41,519 --> 00:20:44,720 whole work that's being done to try to 559 00:20:43,200 --> 00:20:46,480 debias these things and build them 560 00:20:44,720 --> 00:20:48,000 without you know too much bias in them. 561 00:20:46,480 --> 00:20:49,200 It's like a whole world unto itself 562 00:20:48,000 --> 00:20:53,759 which we just don't have time to get 563 00:20:49,200 --> 00:20:56,000 into. Uh Olivia, 564 00:20:53,759 --> 00:20:57,519 >> um depending on the medium that's being 565 00:20:56,000 --> 00:20:59,119 returned by these models, would there be 566 00:20:57,519 --> 00:21:00,480 more than one reward model? Because 567 00:20:59,119 --> 00:21:01,839 isn't that what Gemini 568 00:21:00,480 --> 00:21:03,679 >> would there be more than one 569 00:21:01,839 --> 00:21:05,599 >> reward model? Because isn't this what 570 00:21:03,679 --> 00:21:08,000 Gemini is running into issues with right 571 00:21:05,599 --> 00:21:09,519 now with their image generation is the 572 00:21:08,000 --> 00:21:11,519 bias that they try to 573 00:21:09,519 --> 00:21:13,279 >> Yeah. So the Gemini business that's 574 00:21:11,519 --> 00:21:16,798 going on, it's unclear what's causing 575 00:21:13,279 --> 00:21:18,319 it. Um it may be in this step, maybe 576 00:21:16,798 --> 00:21:19,279 they were a little overzealous in 577 00:21:18,319 --> 00:21:20,319 preventing certain things from 578 00:21:19,279 --> 00:21:23,918 happening. 579 00:21:20,319 --> 00:21:25,279 Some of these uh systems also have um 580 00:21:23,919 --> 00:21:27,679 they will actually intercept the 581 00:21:25,279 --> 00:21:29,359 question that you ask and then route it 582 00:21:27,679 --> 00:21:31,280 differently based on what they sense is 583 00:21:29,359 --> 00:21:32,879 sitting around in the question. So there 584 00:21:31,279 --> 00:21:34,639 could be pre-processing post-processing 585 00:21:32,880 --> 00:21:36,960 a lot of stuff that goes on. So unclear 586 00:21:34,640 --> 00:21:38,559 to me where in the pipeline and it could 587 00:21:36,960 --> 00:21:40,960 be more than one place these things may 588 00:21:38,558 --> 00:21:42,639 be entering. So yes, so here may very 589 00:21:40,960 --> 00:21:44,720 well be where it actually enters a 590 00:21:42,640 --> 00:21:46,960 situation where people are people are 591 00:21:44,720 --> 00:21:50,000 told if you see any sort of this kind of 592 00:21:46,960 --> 00:21:51,600 answer downrank it right don't uprank it 593 00:21:50,000 --> 00:21:53,038 and then it learns that ranking very 594 00:21:51,599 --> 00:21:54,719 faithfully and then proceeds to apply it 595 00:21:53,038 --> 00:21:56,960 where it does should not be applied so 596 00:21:54,720 --> 00:21:58,880 that does happen uh Joselyn you had a 597 00:21:56,960 --> 00:22:02,000 question 598 00:21:58,880 --> 00:22:04,480 >> um I think I still I still don't totally 599 00:22:02,000 --> 00:22:06,480 understand why so when I ask chat GBT a 600 00:22:04,480 --> 00:22:08,319 question even in a lengthy response it 601 00:22:06,480 --> 00:22:10,159 doesn't wander away from the topic that 602 00:22:08,319 --> 00:22:11,839 I'm asking about right and so 603 00:22:10,159 --> 00:22:13,600 understanding that it it's predicting 604 00:22:11,839 --> 00:22:15,439 each word it's sort of taking a random 605 00:22:13,599 --> 00:22:15,839 walk from one word to the next in some 606 00:22:15,440 --> 00:22:17,759 sense 607 00:22:15,839 --> 00:22:19,839 >> but each word it utters 608 00:22:17,759 --> 00:22:20,879 >> now becomes part of the input to the 609 00:22:19,839 --> 00:22:21,199 next word it utters 610 00:22:20,880 --> 00:22:23,120 >> right 611 00:22:21,200 --> 00:22:24,640 >> so it's not truly random walk in that 612 00:22:23,119 --> 00:22:26,158 sense so the next step is not 613 00:22:24,640 --> 00:22:27,679 independent of the previous step 614 00:22:26,159 --> 00:22:29,440 >> it depends on what it depends on the 615 00:22:27,679 --> 00:22:31,038 journey so far so it's going to try to 616 00:22:29,440 --> 00:22:32,240 be very consistent with the journey so 617 00:22:31,038 --> 00:22:33,599 far 618 00:22:32,240 --> 00:22:35,519 >> okay 619 00:22:33,599 --> 00:22:38,480 >> does the 620 00:22:35,519 --> 00:22:40,158 does this part with um sort of 621 00:22:38,480 --> 00:22:42,079 fine-tuning it on these question answer 622 00:22:40,159 --> 00:22:44,799 sets. Does this play some role in it 623 00:22:42,079 --> 00:22:46,240 being able to constrain itself and not 624 00:22:44,798 --> 00:22:48,319 meander away? 625 00:22:46,240 --> 00:22:50,558 >> I don't think so. I think this is more 626 00:22:48,319 --> 00:22:52,319 to make sure that you know it does the 627 00:22:50,558 --> 00:22:54,960 weights generally tend to produce the 628 00:22:52,319 --> 00:22:57,359 right answer. Now what one of the things 629 00:22:54,960 --> 00:22:58,880 that is possible is that when when I'm 630 00:22:57,359 --> 00:23:01,439 let's say I'm a ranker and I'm looking 631 00:22:58,880 --> 00:23:03,039 at a few different answers I'm you know 632 00:23:01,440 --> 00:23:06,080 I have to figure out if the answer is 633 00:23:03,038 --> 00:23:08,640 helpful if it is accurate if it is uh 634 00:23:06,079 --> 00:23:11,279 you know non-toxic right things like 635 00:23:08,640 --> 00:23:13,200 that and part of the rubric for 636 00:23:11,279 --> 00:23:16,639 evaluating these answers could be their 637 00:23:13,200 --> 00:23:18,880 coherence right so it could also be that 638 00:23:16,640 --> 00:23:21,280 they are saying short coherent answers 639 00:23:18,880 --> 00:23:23,039 are better than long coherent answers 640 00:23:21,279 --> 00:23:24,399 but once you adjust for length Maybe 641 00:23:23,038 --> 00:23:25,519 coherence is more important, right? It 642 00:23:24,400 --> 00:23:26,960 could be any number of these things. So 643 00:23:25,519 --> 00:23:28,720 it could play a role in that. 644 00:23:26,960 --> 00:23:30,079 >> So just sort of one small followup. So 645 00:23:28,720 --> 00:23:31,519 in other words, when it's when it's 646 00:23:30,079 --> 00:23:32,850 learning from these question and answer 647 00:23:31,519 --> 00:23:33,918 pairs, it's able to look at 648 00:23:32,851 --> 00:23:35,440 [clears throat] the whole response and 649 00:23:33,919 --> 00:23:36,960 learn something about the whole response 650 00:23:35,440 --> 00:23:37,519 rather than just one word at a time, 651 00:23:36,960 --> 00:23:39,440 right? 652 00:23:37,519 --> 00:23:40,319 >> Correct. Yeah. The the entire question 653 00:23:39,440 --> 00:23:40,640 is being ranked. 654 00:23:40,319 --> 00:23:42,639 >> Yeah. 655 00:23:40,640 --> 00:23:46,240 >> Correct. Correct. 656 00:23:42,640 --> 00:23:48,400 >> Yeah. On a related note, um when it's 657 00:23:46,240 --> 00:23:50,640 generating a new word on a topic, does 658 00:23:48,400 --> 00:23:52,880 the attention pertain to the entire 659 00:23:50,640 --> 00:23:55,759 prior text or can you have like 660 00:23:52,880 --> 00:23:56,880 traveling attention? So like last five 661 00:23:55,759 --> 00:24:00,000 word. 662 00:23:56,880 --> 00:24:02,240 >> So yeah, the short answer is yeah, you 663 00:24:00,000 --> 00:24:04,480 can you can it's called sliding window 664 00:24:02,240 --> 00:24:06,640 attention. It can be done. They 665 00:24:04,480 --> 00:24:08,480 typically tend to do it not uh so much 666 00:24:06,640 --> 00:24:10,880 because they want to focus more on the 667 00:24:08,480 --> 00:24:12,000 the recent words, but more because it 668 00:24:10,880 --> 00:24:14,720 actually makes it very compute 669 00:24:12,000 --> 00:24:16,159 efficient. U that's why they do it. So 670 00:24:14,720 --> 00:24:17,919 it's called sliding window attention. 671 00:24:16,159 --> 00:24:19,760 You can Google it. 672 00:24:17,919 --> 00:24:21,120 >> So normally it's full attention. 673 00:24:19,759 --> 00:24:23,038 >> Normally it's full default is full 674 00:24:21,119 --> 00:24:25,678 attention. 675 00:24:23,038 --> 00:24:27,839 Okay. So that's what they did. Uh and 676 00:24:25,679 --> 00:24:29,278 when they did that and by the way as I 677 00:24:27,839 --> 00:24:30,480 think you pointed out that's exactly 678 00:24:29,278 --> 00:24:31,919 what's going on. You're training the 679 00:24:30,480 --> 00:24:35,038 reward model with these thumbs up and 680 00:24:31,919 --> 00:24:37,919 thumbs down. U hold on the questions. 681 00:24:35,038 --> 00:24:42,960 And so if you give it the same question 682 00:24:37,919 --> 00:24:45,679 to GPD 3.5 in GPD amazing answer. 683 00:24:42,960 --> 00:24:48,960 Okay, like night and day difference, 684 00:24:45,679 --> 00:24:51,360 amazingly good answer. Um, and so and 685 00:24:48,960 --> 00:24:52,640 then to go from 3.5 to CH GBT, they 686 00:24:51,359 --> 00:24:55,439 basically followed the exact same 687 00:24:52,640 --> 00:24:58,080 playbook except that because they wanted 688 00:24:55,440 --> 00:24:59,600 to have a chatbot, meaning something 689 00:24:58,079 --> 00:25:00,879 that could carry on a question answer, 690 00:24:59,599 --> 00:25:02,319 question answer pair as opposed to just 691 00:25:00,880 --> 00:25:03,600 a single question and answer, they 692 00:25:02,319 --> 00:25:05,839 wanted question answer question answer, 693 00:25:03,599 --> 00:25:08,719 right? Conversation. They trained it on 694 00:25:05,839 --> 00:25:11,519 conversations. That's it. Instead of 695 00:25:08,720 --> 00:25:13,759 training it on instruction answer data, 696 00:25:11,519 --> 00:25:16,000 they trained it on instruction answer 697 00:25:13,759 --> 00:25:17,919 instruction answer instruction answer a 698 00:25:16,000 --> 00:25:19,839 sequence of such things which are strung 699 00:25:17,919 --> 00:25:21,360 into a conversation. 700 00:25:19,839 --> 00:25:25,278 That's it. That is the only difference 701 00:25:21,359 --> 00:25:26,959 to go from 3.5 to CH GPT and then now 702 00:25:25,278 --> 00:25:28,880 chat GPD given you do that it's giving 703 00:25:26,960 --> 00:25:30,798 you a much nicer response and then you 704 00:25:28,880 --> 00:25:32,240 can ask a follow-on question. Can you 705 00:25:30,798 --> 00:25:33,759 make it more formal? Boom. It gives you 706 00:25:32,240 --> 00:25:35,120 a nice response because now it knows 707 00:25:33,759 --> 00:25:37,519 about conversations. It's been trained 708 00:25:35,119 --> 00:25:38,879 on conversational data. So that's it. So 709 00:25:37,519 --> 00:25:41,200 that's the whole that's how they built 710 00:25:38,880 --> 00:25:42,720 RGBT right and all the things we are 711 00:25:41,200 --> 00:25:45,038 seeing later on are all sort of 712 00:25:42,720 --> 00:25:46,240 continuations of this sort of approach. 713 00:25:45,038 --> 00:25:47,759 So pause for a couple of quick 714 00:25:46,240 --> 00:25:50,558 questions. Swati you had a question then 715 00:25:47,759 --> 00:25:53,759 we'll go to you and then to you. Yeah. 716 00:25:50,558 --> 00:25:56,240 >> So does that make a difference if a new 717 00:25:53,759 --> 00:25:59,759 question pair question answer pair or a 718 00:25:56,240 --> 00:26:01,759 new training data comes early in the 719 00:25:59,759 --> 00:26:02,960 building of the model or later in the 720 00:26:01,759 --> 00:26:05,278 building of the model 7 billion 721 00:26:02,960 --> 00:26:07,038 parameters. That be good. You mean the 722 00:26:05,278 --> 00:26:09,839 order of the questions does it matter? 723 00:26:07,038 --> 00:26:12,319 >> So I might have like let's say 5,000 uh 724 00:26:09,839 --> 00:26:14,240 images to start with. Now there after my 725 00:26:12,319 --> 00:26:17,278 model is trained and developed now I 726 00:26:14,240 --> 00:26:18,880 have a new use case that has come in. 727 00:26:17,278 --> 00:26:19,599 Will that make a difference if I set it 728 00:26:18,880 --> 00:26:22,000 in now? 729 00:26:19,599 --> 00:26:24,639 >> So if you have a new use case for which 730 00:26:22,000 --> 00:26:26,240 you want to essentially adapt the model 731 00:26:24,640 --> 00:26:27,278 there's a whole set of techniques you 732 00:26:26,240 --> 00:26:27,839 use which is going to be the next 733 00:26:27,278 --> 00:26:29,038 section. 734 00:26:27,839 --> 00:26:30,480 >> But it's not 735 00:26:29,038 --> 00:26:33,440 >> yeah because what you have out of the 736 00:26:30,480 --> 00:26:34,798 box is just a generally good chatbot. It 737 00:26:33,440 --> 00:26:36,000 knows about a lot of stuff because it's 738 00:26:34,798 --> 00:26:37,679 been trained on, you know, those 30 739 00:26:36,000 --> 00:26:39,119 billion sentences, it can answer a lot 740 00:26:37,679 --> 00:26:41,120 of questions reasonably well using 741 00:26:39,119 --> 00:26:43,038 common sense and world knowledge. But 742 00:26:41,119 --> 00:26:44,719 any specific use case like medical and 743 00:26:43,038 --> 00:26:46,079 so on and so forth, it may not know. So 744 00:26:44,720 --> 00:26:47,919 you'll need to adapt it to your 745 00:26:46,079 --> 00:26:51,678 particular unique situation and that's 746 00:26:47,919 --> 00:26:54,559 coming. U all right. Yes. Habit. 747 00:26:51,679 --> 00:26:57,360 >> Uh what determines if a whole 748 00:26:54,558 --> 00:26:59,359 conversation is ranked positively versus 749 00:26:57,359 --> 00:27:01,278 a specific answer proliferating your in 750 00:26:59,359 --> 00:27:03,119 your question? 751 00:27:01,278 --> 00:27:05,278 >> Is it if the first answer doesn't get a 752 00:27:03,119 --> 00:27:06,639 positive response but then after follow 753 00:27:05,278 --> 00:27:07,200 the second one does. Is that is that 754 00:27:06,640 --> 00:27:08,880 correct? 755 00:27:07,200 --> 00:27:10,319 >> Exactly. So if you're a human and you 756 00:27:08,880 --> 00:27:12,320 read the transcript of an exchange 757 00:27:10,319 --> 00:27:14,240 between two people and I'm giving you 758 00:27:12,319 --> 00:27:15,759 two exchanges which all start with the 759 00:27:14,240 --> 00:27:17,759 same question, you'll be able to assess 760 00:27:15,759 --> 00:27:20,000 which one is a better transcript. That's 761 00:27:17,759 --> 00:27:22,640 basically what's going on. Uh there was 762 00:27:20,000 --> 00:27:25,038 something here, right? Something. Yeah. 763 00:27:22,640 --> 00:27:27,919 >> So I was wondering when you ask a 764 00:27:25,038 --> 00:27:29,919 question very often it sounds kind of 765 00:27:27,919 --> 00:27:32,880 like you tell that something was written 766 00:27:29,919 --> 00:27:35,759 by not by an actual person. Do you think 767 00:27:32,880 --> 00:27:38,159 that comes from the reinforcement 768 00:27:35,759 --> 00:27:40,158 learning part or where do you think it 769 00:27:38,159 --> 00:27:41,440 comes from in this? 770 00:27:40,159 --> 00:27:42,559 >> It's a good question. I don't know 771 00:27:41,440 --> 00:27:44,960 because I know that part of the 772 00:27:42,558 --> 00:27:48,399 evaluation uh the ranking rubric are 773 00:27:44,960 --> 00:27:50,400 used is to is to favor responses which 774 00:27:48,400 --> 00:27:52,720 sound more humanlike than you know more 775 00:27:50,400 --> 00:27:54,320 than robotlike. So if anything I'm 776 00:27:52,720 --> 00:27:55,839 hoping that reinforcement learning would 777 00:27:54,319 --> 00:27:56,720 actually make it sound more humanike 778 00:27:55,839 --> 00:27:58,879 because the rankers would have 779 00:27:56,720 --> 00:28:01,120 prioritized that. So if you if it still 780 00:27:58,880 --> 00:28:02,960 comes up with robotic stuff, you know, 781 00:28:01,119 --> 00:28:05,119 it's something else that's going on. 782 00:28:02,960 --> 00:28:07,278 Maybe I mean maybe the lot of text on 783 00:28:05,119 --> 00:28:09,439 the internet is not literature. It's 784 00:28:07,278 --> 00:28:13,200 just people writing some crap, right? So 785 00:28:09,440 --> 00:28:15,120 could be that. Yeah. 786 00:28:13,200 --> 00:28:17,120 >> How much of this instruction tuning or 787 00:28:15,119 --> 00:28:19,278 conversational tuning is happening in 788 00:28:17,119 --> 00:28:19,918 real time within a conversation? So 789 00:28:19,278 --> 00:28:22,159 >> none of it. 790 00:28:19,919 --> 00:28:24,080 >> None of it. So as you kind of give 791 00:28:22,159 --> 00:28:25,919 feedback to the model, it's just 792 00:28:24,079 --> 00:28:27,439 basically regenerating it like I don't 793 00:28:25,919 --> 00:28:27,759 like that answer. come up with something 794 00:28:27,440 --> 00:28:29,840 else. 795 00:28:27,759 --> 00:28:31,679 >> No, it's not doing it in real time. Uh, 796 00:28:29,839 --> 00:28:32,879 basically whatever signals you're giving 797 00:28:31,679 --> 00:28:34,640 it with these thumbs up, thumbs down 798 00:28:32,880 --> 00:28:36,640 business, that gets added to the 799 00:28:34,640 --> 00:28:39,038 training logs and they periodically will 800 00:28:36,640 --> 00:28:41,038 retrain it. 801 00:28:39,038 --> 00:28:42,798 Uh, okay. So, by the way, this is 802 00:28:41,038 --> 00:28:44,000 instruction tuning in a nutshell and I 803 00:28:42,798 --> 00:28:45,440 want to point that out and you don't 804 00:28:44,000 --> 00:28:47,839 have to read the whole thing, but just 805 00:28:45,440 --> 00:28:50,320 to quickly point out this was where we 806 00:28:47,839 --> 00:28:51,439 had to have human involvement, right? In 807 00:28:50,319 --> 00:28:52,960 the first step, writing a lot of 808 00:28:51,440 --> 00:28:56,320 responses to these questions and then 809 00:28:52,960 --> 00:28:58,558 ranking the answers. So these two are 810 00:28:56,319 --> 00:29:00,798 still human sort of labor intensive. Now 811 00:28:58,558 --> 00:29:03,678 it turns out you can actually use helper 812 00:29:00,798 --> 00:29:04,879 LLMs to automate this too, 813 00:29:03,679 --> 00:29:06,320 right? This is not what open I did in 814 00:29:04,880 --> 00:29:07,760 the beginning with HGBT but now you can 815 00:29:06,319 --> 00:29:09,439 do it this way right because there are 816 00:29:07,759 --> 00:29:11,519 lots of really good LLMs available for 817 00:29:09,440 --> 00:29:12,880 you to automate many of these things. uh 818 00:29:11,519 --> 00:29:14,398 we don't have time but if you're curious 819 00:29:12,880 --> 00:29:17,039 I had a little blog post on this check 820 00:29:14,398 --> 00:29:20,000 it out okay so now we come to the 821 00:29:17,038 --> 00:29:23,278 question of well if you want to take a 822 00:29:20,000 --> 00:29:24,960 base LLM like GBD3 and make it useful 823 00:29:23,278 --> 00:29:26,880 and respond instructions we have seen 824 00:29:24,960 --> 00:29:28,880 that we had to adapt it with high 825 00:29:26,880 --> 00:29:30,240 quality instruction onset data right 826 00:29:28,880 --> 00:29:31,360 using supervised fine-tuning and 827 00:29:30,240 --> 00:29:33,919 reinforcement learning with human 828 00:29:31,359 --> 00:29:37,678 feedback right that's what made GPD3 829 00:29:33,919 --> 00:29:39,600 actually useful and became chat GPD by 830 00:29:37,679 --> 00:29:41,278 the same token this holds true more 831 00:29:39,599 --> 00:29:42,639 generally if you want to take large 832 00:29:41,278 --> 00:29:44,798 language model make it useful for a 833 00:29:42,640 --> 00:29:47,120 medical use case, a legal use case, some 834 00:29:44,798 --> 00:29:49,359 other narrow business use case. You have 835 00:29:47,119 --> 00:29:52,000 to adapt it with business domain 836 00:29:49,359 --> 00:29:54,079 specific data. Okay. And so let's look 837 00:29:52,000 --> 00:29:56,000 at techniques for doing so. All right. 838 00:29:54,079 --> 00:29:57,759 So adaptation is sort of the rough name 839 00:29:56,000 --> 00:30:00,000 for the process of taking a base large 840 00:29:57,759 --> 00:30:02,000 language model and making it tailoring 841 00:30:00,000 --> 00:30:03,359 it for your particular use case. And so 842 00:30:02,000 --> 00:30:05,119 there's sort of this ladder of things 843 00:30:03,359 --> 00:30:07,199 you can do, right? And we're going to 844 00:30:05,119 --> 00:30:08,959 look at every one of them. So you can do 845 00:30:07,200 --> 00:30:11,679 this thing called zeroshot prompting 846 00:30:08,960 --> 00:30:14,319 which is just you literally ask the LLM 847 00:30:11,679 --> 00:30:16,240 nicely clearly what you want and maybe 848 00:30:14,319 --> 00:30:17,519 just give it to you. Okay. And this is 849 00:30:16,240 --> 00:30:20,480 sort of the use case we're all used to 850 00:30:17,519 --> 00:30:22,558 in the web interface right you can also 851 00:30:20,480 --> 00:30:24,319 do something called few short prompting 852 00:30:22,558 --> 00:30:25,599 where you ask it something and you also 853 00:30:24,319 --> 00:30:27,599 give a few examples of the kind of 854 00:30:25,599 --> 00:30:30,398 things you want right and that helps it 855 00:30:27,599 --> 00:30:31,519 a great deal and then there is this 856 00:30:30,398 --> 00:30:33,119 thing called retrieval augmented 857 00:30:31,519 --> 00:30:34,240 generation and fine-tuning and we'll 858 00:30:33,119 --> 00:30:36,079 look at all of them and I'll explain all 859 00:30:34,240 --> 00:30:38,159 these things as we go along. Okay, so 860 00:30:36,079 --> 00:30:40,240 let's start with zero short prompting 861 00:30:38,159 --> 00:30:44,000 where by the way the word short here is 862 00:30:40,240 --> 00:30:45,839 a synonym for example. So zero example 863 00:30:44,000 --> 00:30:47,200 prompting. You literally ask in the 864 00:30:45,839 --> 00:30:50,639 prompt what you want without giving even 865 00:30:47,200 --> 00:30:51,919 a single example. Okay. And so let's say 866 00:30:50,640 --> 00:30:54,320 we want to build we want to look at 867 00:30:51,919 --> 00:30:55,360 product reviews and build a detector to 868 00:30:54,319 --> 00:30:56,960 figure out if the product review 869 00:30:55,359 --> 00:30:59,759 contains not sentiment. That's kind of 870 00:30:56,960 --> 00:31:01,120 boring. Uh whether it contains some 871 00:30:59,759 --> 00:31:04,960 description of a potential product 872 00:31:01,119 --> 00:31:06,879 defect or not. Okay. And so here is 873 00:31:04,960 --> 00:31:08,960 something I actually pulled off Wayfair 874 00:31:06,880 --> 00:31:10,640 with apologies to Wayfair. Uh it says 875 00:31:08,960 --> 00:31:11,919 here the curve of the back of the chair 876 00:31:10,640 --> 00:31:14,399 does not leave enough room to sit 877 00:31:11,919 --> 00:31:16,960 comfortably. Okay, sounds like a kind of 878 00:31:14,398 --> 00:31:18,719 a defectish kind of thing, right? So 879 00:31:16,960 --> 00:31:20,079 instead of bu back in the day, you would 880 00:31:18,720 --> 00:31:21,679 have collected all these reviews and 881 00:31:20,079 --> 00:31:23,599 built a special purpose NLP based 882 00:31:21,679 --> 00:31:25,679 classifier to figure out defect yes or 883 00:31:23,599 --> 00:31:28,798 no. Here you can literally just feed 884 00:31:25,679 --> 00:31:30,159 this thing into GPD3 uh and ask it tell 885 00:31:28,798 --> 00:31:31,679 me if a product defect is being 886 00:31:30,159 --> 00:31:33,278 described in this product review and 887 00:31:31,679 --> 00:31:34,399 then the curve at the back boom and then 888 00:31:33,278 --> 00:31:37,359 it comes back and says yep that's a 889 00:31:34,398 --> 00:31:38,398 product defect. Okay so this zero shot 890 00:31:37,359 --> 00:31:41,199 you just ask a question you get the 891 00:31:38,398 --> 00:31:43,759 answer back. Okay and it actually works 892 00:31:41,200 --> 00:31:45,360 remarkably well and the better models 893 00:31:43,759 --> 00:31:47,359 the bigger models tend to be much better 894 00:31:45,359 --> 00:31:50,000 than the smaller simpler models for 895 00:31:47,359 --> 00:31:52,639 doing zero shot. Okay. All right. Now 896 00:31:50,000 --> 00:31:54,079 when you adapt an LLM to a specific task 897 00:31:52,640 --> 00:31:55,919 obviously you need to carefully design 898 00:31:54,079 --> 00:31:57,759 the prompt as you folks know this is 899 00:31:55,919 --> 00:31:58,799 called prompt engineering and we're not 900 00:31:57,759 --> 00:32:00,640 going to spend much time on prompt 901 00:31:58,798 --> 00:32:02,720 engineering except I just want to give a 902 00:32:00,640 --> 00:32:04,960 simple example. So if you actually ask 903 00:32:02,720 --> 00:32:07,919 Jubid this question what is the fifth 904 00:32:04,960 --> 00:32:09,919 word of the sentence very often it'll 905 00:32:07,919 --> 00:32:11,679 give the wrong answer. 906 00:32:09,919 --> 00:32:12,960 It's very strange why it can't get this 907 00:32:11,679 --> 00:32:14,880 answer question right. It's a very 908 00:32:12,960 --> 00:32:17,440 simple question. So if it's the fifth 909 00:32:14,880 --> 00:32:18,559 word of the sentence is s right uh 910 00:32:17,440 --> 00:32:20,640 sometimes it gets it right but very 911 00:32:18,558 --> 00:32:22,000 often it'll get it wrong okay but now 912 00:32:20,640 --> 00:32:23,600 you can do a little prompt engineering 913 00:32:22,000 --> 00:32:25,278 and it'll always get it right. So for 914 00:32:23,599 --> 00:32:26,798 example you can say I'll give you a 915 00:32:25,278 --> 00:32:27,919 sentence first list all the words that 916 00:32:26,798 --> 00:32:30,398 are in the sentence then tell me the 917 00:32:27,919 --> 00:32:33,200 fifth word. Okay here is a sentence b it 918 00:32:30,398 --> 00:32:34,798 gets it right. So it's an example of you 919 00:32:33,200 --> 00:32:36,720 can help it along by being very very 920 00:32:34,798 --> 00:32:38,558 prescriptive as to what you want it to 921 00:32:36,720 --> 00:32:40,399 do and break down all the steps. Don't 922 00:32:38,558 --> 00:32:42,558 make it guess things. It does a great 923 00:32:40,398 --> 00:32:43,918 job. Okay. So anyway uh and there are 924 00:32:42,558 --> 00:32:45,918 lots of other tricks people have figured 925 00:32:43,919 --> 00:32:47,519 out over the the last couple of years. 926 00:32:45,919 --> 00:32:49,759 Uh for for a long time this is pretty 927 00:32:47,519 --> 00:32:51,679 hot where you say let's think step by 928 00:32:49,759 --> 00:32:53,038 step. You tell it give it a question and 929 00:32:51,679 --> 00:32:54,399 say let's think step by step. It 930 00:32:53,038 --> 00:32:55,839 actually gives the better shot at giving 931 00:32:54,398 --> 00:32:57,839 you a good answer back an accurate 932 00:32:55,839 --> 00:32:59,759 answer back. Uh now this kind of thing 933 00:32:57,839 --> 00:33:02,639 is actually already baked in into the 934 00:32:59,759 --> 00:33:05,200 LLMs. So when you ask a question to ch 935 00:33:02,640 --> 00:33:07,679 your question your prompt gets appended 936 00:33:05,200 --> 00:33:09,120 to what's called the system prompt and 937 00:33:07,679 --> 00:33:10,559 the whole thing goes into the LM. You 938 00:33:09,119 --> 00:33:12,558 never see the system prompt and the 939 00:33:10,558 --> 00:33:14,720 system prompt is telling Chad GPD think 940 00:33:12,558 --> 00:33:17,678 step by step take your time don't blurt 941 00:33:14,720 --> 00:33:18,880 out an answer stuff like that okay and 942 00:33:17,679 --> 00:33:20,080 the system you can just Google it the 943 00:33:18,880 --> 00:33:22,640 system problems have been jailbroken you 944 00:33:20,079 --> 00:33:25,519 can find it on the web 945 00:33:22,640 --> 00:33:26,880 so all right uh and and this is funny I 946 00:33:25,519 --> 00:33:28,399 this came out maybe like a month or two 947 00:33:26,880 --> 00:33:29,679 ago it says apparently take a deep 948 00:33:28,398 --> 00:33:31,678 breath and work on the problem step by 949 00:33:29,679 --> 00:33:34,960 step works better than saying work on it 950 00:33:31,679 --> 00:33:36,720 step by step and then more recently I 951 00:33:34,960 --> 00:33:38,558 literally read this two nights ago 952 00:33:36,720 --> 00:33:40,159 apparently if you tell it if you have a 953 00:33:38,558 --> 00:33:42,639 math or a reasoning question. You tell 954 00:33:40,159 --> 00:33:44,240 it you are an officer on the starship 955 00:33:42,640 --> 00:33:46,000 enterprise. Now solve this problem for 956 00:33:44,240 --> 00:33:47,278 me. It's higher more likely to get it 957 00:33:46,000 --> 00:33:48,640 right. 958 00:33:47,278 --> 00:33:50,798 >> Go figure. Thomas, 959 00:33:48,640 --> 00:33:51,120 >> I read two more that were super fun. 960 00:33:50,798 --> 00:33:53,599 >> Yeah. 961 00:33:51,119 --> 00:33:54,398 >> One I will keep you if you solve me 962 00:33:53,599 --> 00:33:56,719 >> correct 963 00:33:54,398 --> 00:34:00,798 >> and the other one was 964 00:33:56,720 --> 00:34:05,440 an answer was I cannot do that 965 00:34:00,798 --> 00:34:07,519 for answer was I tried on Gemini and he 966 00:34:05,440 --> 00:34:10,800 it was the way to solve it. So 967 00:34:07,519 --> 00:34:11,918 >> nice. both like back and forth charge 968 00:34:10,800 --> 00:34:13,839 you did you want to say was to solve 969 00:34:11,918 --> 00:34:15,679 this can you solve this 970 00:34:13,838 --> 00:34:16,960 >> yeah very good excellent one of the 971 00:34:15,679 --> 00:34:18,639 things just on that right let's have 972 00:34:16,960 --> 00:34:19,918 some fun you can say I'm going to give 973 00:34:18,639 --> 00:34:22,079 tip you a thousand bucks if you solve 974 00:34:19,918 --> 00:34:24,000 this it says right so this person 975 00:34:22,079 --> 00:34:26,159 apparently kept using this tip and at 976 00:34:24,000 --> 00:34:28,559 one point it says you keep promising me 977 00:34:26,159 --> 00:34:31,760 tips you never give me the tip so I'm 978 00:34:28,559 --> 00:34:34,960 not going to solve this problem for you 979 00:34:31,760 --> 00:34:36,399 yeah okay so and there are many prompt 980 00:34:34,960 --> 00:34:37,358 engineering resources this one that came 981 00:34:36,398 --> 00:34:38,638 out a couple of weeks ago which I 982 00:34:37,358 --> 00:34:41,199 thought was pretty Good. So I just put a 983 00:34:38,639 --> 00:34:42,800 link to it here. Um so now let's look at 984 00:34:41,199 --> 00:34:45,118 few short prompting where you give it a 985 00:34:42,800 --> 00:34:47,919 few examples. So here let's say we want 986 00:34:45,119 --> 00:34:49,440 to build a grammar corrector. Okay. So 987 00:34:47,918 --> 00:34:52,319 what you can do is you can actually give 988 00:34:49,440 --> 00:34:54,079 it examples of poor English good 989 00:34:52,320 --> 00:34:56,159 English. You can see right poor English 990 00:34:54,079 --> 00:34:58,079 I eated the purple berries. Good English 991 00:34:56,159 --> 00:35:00,240 I ate the purple berries. And similarly 992 00:34:58,079 --> 00:35:01,680 three examples right and then you end 993 00:35:00,239 --> 00:35:04,959 the prompt with just the poor English 994 00:35:01,679 --> 00:35:06,799 input. And then the response from GPD3 995 00:35:04,960 --> 00:35:09,039 is a good English output and it says fix 996 00:35:06,800 --> 00:35:10,880 the error. 997 00:35:09,039 --> 00:35:11,920 So this is an example of giving a few 998 00:35:10,880 --> 00:35:13,680 examples of what you want and just 999 00:35:11,920 --> 00:35:16,960 learns on the fly what you what you have 1000 00:35:13,679 --> 00:35:19,519 in mind what your intention is. Okay. So 1001 00:35:16,960 --> 00:35:21,920 that's that. Now the ability of LLMs to 1002 00:35:19,519 --> 00:35:23,838 learn from just a few examples or even 1003 00:35:21,920 --> 00:35:25,920 no examples and just with a clear 1004 00:35:23,838 --> 00:35:28,159 instruction. This thing is called in 1005 00:35:25,920 --> 00:35:31,119 context learning and that was something 1006 00:35:28,159 --> 00:35:33,440 that GPD2 and GPD could not do. that was 1007 00:35:31,119 --> 00:35:35,519 new in GBD3 and what they call an 1008 00:35:33,440 --> 00:35:37,280 emergent capability right it is 1009 00:35:35,519 --> 00:35:40,159 completely unanticipated by the people 1010 00:35:37,280 --> 00:35:41,920 who built it and all right so that's 1011 00:35:40,159 --> 00:35:43,199 that now let's look at retrieal 1012 00:35:41,920 --> 00:35:45,280 augmented generation by the way this 1013 00:35:43,199 --> 00:35:47,439 thing is also called indexing sometimes 1014 00:35:45,280 --> 00:35:50,160 so the the so the the idea of it's 1015 00:35:47,440 --> 00:35:52,240 called rag retrie rag the idea of rag is 1016 00:35:50,159 --> 00:35:53,838 actually very simple so let's say that 1017 00:35:52,239 --> 00:35:56,639 you know we want to ask a question to a 1018 00:35:53,838 --> 00:35:59,039 chatbot but we want the chatbot to 1019 00:35:56,639 --> 00:36:01,039 leverage proprietary data that we might 1020 00:35:59,039 --> 00:36:02,239 have maybe it's a customer call support 1021 00:36:01,039 --> 00:36:04,159 sort of in a call center kind of 1022 00:36:02,239 --> 00:36:06,719 operation and you have like this massive 1023 00:36:04,159 --> 00:36:09,440 FAQ database right content database and 1024 00:36:06,719 --> 00:36:10,719 you want to give that FAQ to the chatbot 1025 00:36:09,440 --> 00:36:12,800 along with your question so that it can 1026 00:36:10,719 --> 00:36:14,559 leverage the FAQ to answer the question 1027 00:36:12,800 --> 00:36:16,320 for you as opposed to like whatever 1028 00:36:14,559 --> 00:36:19,920 things it has learned previously in its 1029 00:36:16,320 --> 00:36:21,920 general training right so can't we just 1030 00:36:19,920 --> 00:36:24,559 include the entire FAQ the whole data 1031 00:36:21,920 --> 00:36:26,079 set into a prompt and set it in maybe we 1032 00:36:24,559 --> 00:36:27,759 just take our question take everything 1033 00:36:26,079 --> 00:36:28,960 we have potentially relevant to the 1034 00:36:27,760 --> 00:36:31,040 question everything we have in the data 1035 00:36:28,960 --> 00:36:32,480 set database just attach it to the 1036 00:36:31,039 --> 00:36:34,400 question. The whole thing becomes a 1037 00:36:32,480 --> 00:36:38,559 prompt. Feed it in and say, "Hey, find 1038 00:36:34,400 --> 00:36:42,680 out for me." Can't you just do that? 1039 00:36:38,559 --> 00:36:42,679 Theoretically, I think it stops us. 1040 00:36:43,199 --> 00:36:46,159 The reason you can't do it is because 1041 00:36:44,800 --> 00:36:47,760 this pesky thing called the context 1042 00:36:46,159 --> 00:36:51,118 window. 1043 00:36:47,760 --> 00:36:53,839 So, uh, for any LLM, the prompt plus the 1044 00:36:51,119 --> 00:36:55,358 output, right, the length cannot exceed 1045 00:36:53,838 --> 00:36:57,358 a predefined limit. This called the 1046 00:36:55,358 --> 00:37:00,239 context window. Remember the max 1047 00:36:57,358 --> 00:37:02,239 sequence length we had in our earlier 1048 00:37:00,239 --> 00:37:04,078 models where that was the size of the 1049 00:37:02,239 --> 00:37:05,279 sentence that could be fed in right 1050 00:37:04,079 --> 00:37:07,039 basically there is a size of the 1051 00:37:05,280 --> 00:37:08,400 sentence for any of these things right 1052 00:37:07,039 --> 00:37:09,838 it's called the context window it's 1053 00:37:08,400 --> 00:37:12,400 there are only so many tokens it can 1054 00:37:09,838 --> 00:37:14,880 accommodate and since what comes in is 1055 00:37:12,400 --> 00:37:16,800 what comes out it is for both the input 1056 00:37:14,880 --> 00:37:20,640 and the output together okay that's 1057 00:37:16,800 --> 00:37:23,440 called the context window okay and um 1058 00:37:20,639 --> 00:37:25,199 and and and furthermore when you have a 1059 00:37:23,440 --> 00:37:27,280 conversation with one of these chat bots 1060 00:37:25,199 --> 00:37:29,919 the entire entire conversation is fed in 1061 00:37:27,280 --> 00:37:31,519 every single time. 1062 00:37:29,920 --> 00:37:32,800 That's how it actually remembers the 1063 00:37:31,519 --> 00:37:34,880 what's going on earlier in the 1064 00:37:32,800 --> 00:37:36,800 conversation. It doesn't have any memory 1065 00:37:34,880 --> 00:37:39,838 per se. Each time you ask a question, 1066 00:37:36,800 --> 00:37:41,119 the entire thread is fed in. Okay? So, 1067 00:37:39,838 --> 00:37:42,960 initially you say what's the square root 1068 00:37:41,119 --> 00:37:44,480 of 17, it gives you an answer. 1069 00:37:42,960 --> 00:37:46,880 Initially, you only send in the red 1070 00:37:44,480 --> 00:37:48,320 stuff. Then the next question you ask is 1071 00:37:46,880 --> 00:37:50,320 the first question, the answer, the 1072 00:37:48,320 --> 00:37:52,480 second question. All of them are fed in. 1073 00:37:50,320 --> 00:37:54,240 Then all these are fed in. So with the 1074 00:37:52,480 --> 00:37:55,760 conversation, you're consuming more and 1075 00:37:54,239 --> 00:37:57,358 more of the context window as you go 1076 00:37:55,760 --> 00:38:00,000 along. 1077 00:37:57,358 --> 00:38:01,838 Okay. So can you imagine taking a whole 1078 00:38:00,000 --> 00:38:03,199 FAQ asking a question and saying, "Well, 1079 00:38:01,838 --> 00:38:04,400 I didn't mean that. I wanted something 1080 00:38:03,199 --> 00:38:05,679 else." And before you know it, boom, 1081 00:38:04,400 --> 00:38:06,720 you've blown out the context window. 1082 00:38:05,679 --> 00:38:08,078 It's going to come back and give you an 1083 00:38:06,719 --> 00:38:10,559 error. 1084 00:38:08,079 --> 00:38:14,079 >> You finished that you can't does it 1085 00:38:10,559 --> 00:38:15,279 together or does it take specific 1086 00:38:14,079 --> 00:38:17,599 windows of it? 1087 00:38:15,280 --> 00:38:19,839 >> Yeah. So there is a whole research 1088 00:38:17,599 --> 00:38:21,440 cottage industry around when your thing 1089 00:38:19,838 --> 00:38:23,920 is longer than the context window. what 1090 00:38:21,440 --> 00:38:25,920 do you pick? Uh so the simplest case is 1091 00:38:23,920 --> 00:38:27,119 you have a moving window, right? If if 1092 00:38:25,920 --> 00:38:28,639 you have thousand tokens, you just look 1093 00:38:27,119 --> 00:38:30,800 at the last thousand tokens. But there 1094 00:38:28,639 --> 00:38:33,039 are some cleverer schemes where you can 1095 00:38:30,800 --> 00:38:34,560 actually take the first stuff that is 1096 00:38:33,039 --> 00:38:37,119 outside the window that doesn't fit into 1097 00:38:34,559 --> 00:38:39,358 the window and use an other LLM to 1098 00:38:37,119 --> 00:38:41,680 summarize it for you and then you attach 1099 00:38:39,358 --> 00:38:43,920 it to your current prompt. I know it 1100 00:38:41,679 --> 00:38:46,078 gets crazy. So 1101 00:38:43,920 --> 00:38:47,280 uh okay. So for all these reasons, we 1102 00:38:46,079 --> 00:38:49,280 need to pick and choose what we can 1103 00:38:47,280 --> 00:38:51,359 send, right? To answer a particular 1104 00:38:49,280 --> 00:38:53,280 question. So what we do is since we 1105 00:38:51,358 --> 00:38:54,960 can't include the whole thing, we first 1106 00:38:53,280 --> 00:38:57,119 retrieve the relevant content from the 1107 00:38:54,960 --> 00:38:59,440 database or the FAQ and then send it to 1108 00:38:57,119 --> 00:39:02,400 the LLM along with a question we have. 1109 00:38:59,440 --> 00:39:05,200 Okay? So retrieval augmented sequence 1110 00:39:02,400 --> 00:39:08,320 generation. That's what's going on. 1111 00:39:05,199 --> 00:39:10,319 Make sense? And so pictorially 1112 00:39:08,320 --> 00:39:12,079 um basically what we do is let's say 1113 00:39:10,320 --> 00:39:15,359 that this is our external set of 1114 00:39:12,079 --> 00:39:18,320 documents. We take this are think of it 1115 00:39:15,358 --> 00:39:20,239 FAQ and then we take the FAQ and imagine 1116 00:39:18,320 --> 00:39:22,320 for each question and answer. We take 1117 00:39:20,239 --> 00:39:24,719 each question and answer in the FAQ and 1118 00:39:22,320 --> 00:39:27,760 then we we just we treat it as its own 1119 00:39:24,719 --> 00:39:29,439 little unit of text and then we actually 1120 00:39:27,760 --> 00:39:32,079 calculate a contextual embedding for 1121 00:39:29,440 --> 00:39:33,200 each of those question answer pairs. 1122 00:39:32,079 --> 00:39:35,599 Remember we know how to do contextual 1123 00:39:33,199 --> 00:39:36,879 embeddings, right? That's like it's a 1124 00:39:35,599 --> 00:39:37,760 piece of cake at this point, right? You 1125 00:39:36,880 --> 00:39:39,280 folks know how to do contextual 1126 00:39:37,760 --> 00:39:41,760 embedding. Run it through something like 1127 00:39:39,280 --> 00:39:43,920 BERT, you're done, right? You get you 1128 00:39:41,760 --> 00:39:47,040 get a context. So you get embeddings for 1129 00:39:43,920 --> 00:39:50,159 all the things that are in your FAQ. And 1130 00:39:47,039 --> 00:39:52,000 now when a new question comes in, right, 1131 00:39:50,159 --> 00:39:53,519 what you do is you take that question 1132 00:39:52,000 --> 00:39:56,559 and you calculate a contextual embedding 1133 00:39:53,519 --> 00:39:58,880 for that too. 1134 00:39:56,559 --> 00:40:02,880 And then what you do is you then look to 1135 00:39:58,880 --> 00:40:04,640 see which of the FAQ elements you have, 1136 00:40:02,880 --> 00:40:07,599 which of those chunks are the most 1137 00:40:04,639 --> 00:40:09,759 similar to your question. 1138 00:40:07,599 --> 00:40:11,599 Okay? And then you grab the ones that 1139 00:40:09,760 --> 00:40:14,240 are the most similar and then pack it 1140 00:40:11,599 --> 00:40:16,800 into the prompt and send it in. Maybe 1141 00:40:14,239 --> 00:40:18,559 you have 10,000 questions, but you can 1142 00:40:16,800 --> 00:40:19,839 only accommodate five of them in your 1143 00:40:18,559 --> 00:40:22,078 prompt because the context window is 1144 00:40:19,838 --> 00:40:24,239 very small. So you pick the five what 1145 00:40:22,079 --> 00:40:25,920 you think is the most relevant content 1146 00:40:24,239 --> 00:40:28,159 to your particular question and then you 1147 00:40:25,920 --> 00:40:29,599 feed it in. 1148 00:40:28,159 --> 00:40:32,879 That's the idea that is retrieval 1149 00:40:29,599 --> 00:40:34,880 augmented generation. Yeah, Rolando. So 1150 00:40:32,880 --> 00:40:36,559 if does this tie in for example if I 1151 00:40:34,880 --> 00:40:38,800 were to prompt and say help me work on 1152 00:40:36,559 --> 00:40:41,358 my startup pitch but given the voice of 1153 00:40:38,800 --> 00:40:45,519 Steve Jobs is it then kind of going out 1154 00:40:41,358 --> 00:40:48,000 there and reducing the subset of of data 1155 00:40:45,519 --> 00:40:49,759 to things that have been written by 1156 00:40:48,000 --> 00:40:51,358 Steve Jobs and then it's kind of 1157 00:40:49,760 --> 00:40:53,680 generating it response based 1158 00:40:51,358 --> 00:40:54,960 >> uh not as a default not as a default 1159 00:40:53,679 --> 00:40:56,879 typically because a lot of Steve Jobs 1160 00:40:54,960 --> 00:40:57,920 stuff on the web it's just using that 1161 00:40:56,880 --> 00:41:00,160 because it's all part of its 1162 00:40:57,920 --> 00:41:01,920 pre-training data but this tends to be 1163 00:41:00,159 --> 00:41:03,838 more useful for very targeted 1164 00:41:01,920 --> 00:41:05,200 applications where you don't expect to 1165 00:41:03,838 --> 00:41:07,519 know the answer because it is not on the 1166 00:41:05,199 --> 00:41:09,039 public internet. 1167 00:41:07,519 --> 00:41:10,559 It's your proprietary data and you 1168 00:41:09,039 --> 00:41:12,800 wanted to use that proprietary data and 1169 00:41:10,559 --> 00:41:15,838 this how you do it. 1170 00:41:12,800 --> 00:41:19,079 Uh yeah 1171 00:41:15,838 --> 00:41:19,078 this certain 1172 00:41:19,119 --> 00:41:23,838 >> sure like that there will be some loss. 1173 00:41:22,400 --> 00:41:26,000 >> There will be some loss because you have 1174 00:41:23,838 --> 00:41:28,960 to figure out how to chunk it right. Uh 1175 00:41:26,000 --> 00:41:30,559 maybe you have a 300page PDF and then 1176 00:41:28,960 --> 00:41:32,000 maybe you look for each section and make 1177 00:41:30,559 --> 00:41:33,679 it a chunk. Maybe you look for each 1178 00:41:32,000 --> 00:41:36,000 paragraph, make it a chunk. Again, 1179 00:41:33,679 --> 00:41:37,679 there's a whole empirical sort of 1180 00:41:36,000 --> 00:41:39,039 cottage industry of techniques for doing 1181 00:41:37,679 --> 00:41:40,559 these things better or worse depending 1182 00:41:39,039 --> 00:41:42,719 on the use case and so on and so forth. 1183 00:41:40,559 --> 00:41:43,759 But the conceptual idea is chunk and 1184 00:41:42,719 --> 00:41:46,318 embed. 1185 00:41:43,760 --> 00:41:47,359 >> Chunking is another use. 1186 00:41:46,318 --> 00:41:49,519 >> Yeah. In fact, we going to do it 1187 00:41:47,358 --> 00:41:50,559 ourselves in the collab right now. 1188 00:41:49,519 --> 00:41:54,572 >> Yeah. 1189 00:41:50,559 --> 00:41:55,838 >> Can we give more weightage lecture? Uh 1190 00:41:54,572 --> 00:41:58,400 [laughter] 1191 00:41:55,838 --> 00:42:00,239 so in the default implementation no but 1192 00:41:58,400 --> 00:42:02,000 but in some sense you by picking the 1193 00:42:00,239 --> 00:42:04,479 five most relevant chunks from 10,000 1194 00:42:02,000 --> 00:42:06,559 chunks you're giving it giving the other 1195 00:42:04,480 --> 00:42:08,159 you know 10,000 minus five chunks a 1196 00:42:06,559 --> 00:42:10,719 weight of zero and these a weight of 1197 00:42:08,159 --> 00:42:12,078 one. So in some sense you're waiting it. 1198 00:42:10,719 --> 00:42:13,598 >> Yeah. 1199 00:42:12,079 --> 00:42:14,720 >> I was just curious how much structure 1200 00:42:13,599 --> 00:42:16,880 you have to have with an external 1201 00:42:14,719 --> 00:42:19,759 document say hospital or something. Do 1202 00:42:16,880 --> 00:42:21,039 you have to do a bunch of like lab? 1203 00:42:19,760 --> 00:42:23,680 >> No, you just need to make sure it's kind 1204 00:42:21,039 --> 00:42:26,079 of relatively clean. Uh but you will see 1205 00:42:23,679 --> 00:42:28,879 in the collab that it can be kind of 1206 00:42:26,079 --> 00:42:30,079 crappy and it still works. Yeah, because 1207 00:42:28,880 --> 00:42:33,200 there is so much crap on the internet 1208 00:42:30,079 --> 00:42:34,480 has been trained on already. So, okay. 1209 00:42:33,199 --> 00:42:36,719 So, all right. So, let's look at the 1210 00:42:34,480 --> 00:42:38,318 collab. 1211 00:42:36,719 --> 00:42:41,039 By the way, retrieval operate generation 1212 00:42:38,318 --> 00:42:43,039 is in my opinion the most pre prevalent 1213 00:42:41,039 --> 00:42:45,920 business application of LLMs that I've 1214 00:42:43,039 --> 00:42:47,599 seen up to this up to up to date. And 1215 00:42:45,920 --> 00:42:51,358 there's a huge ecosystem of tools and 1216 00:42:47,599 --> 00:42:52,640 vendors and so on and so forth. 1217 00:42:51,358 --> 00:42:56,400 I'm going to skip through the verbiage 1218 00:42:52,639 --> 00:42:58,799 here. Um, so you have to um install the 1219 00:42:56,400 --> 00:43:00,480 OpenAI library 1220 00:42:58,800 --> 00:43:01,920 and this thing called tick token which 1221 00:43:00,480 --> 00:43:03,440 we'll get to in a in a bit. I've already 1222 00:43:01,920 --> 00:43:05,760 installed it before class because it 1223 00:43:03,440 --> 00:43:08,000 takes some time. So I'll just make sure 1224 00:43:05,760 --> 00:43:10,079 all these things are already 1225 00:43:08,000 --> 00:43:12,880 few good. So we don't have to wait for 1226 00:43:10,079 --> 00:43:15,760 this. So I've imported pandas as before 1227 00:43:12,880 --> 00:43:17,280 and so uh and you can read through these 1228 00:43:15,760 --> 00:43:19,839 things because I'm just basically you 1229 00:43:17,280 --> 00:43:23,519 know I have an open openi token that I 1230 00:43:19,838 --> 00:43:24,719 have to use u a key rather key API key 1231 00:43:23,519 --> 00:43:25,920 and I'm not showing you the key 1232 00:43:24,719 --> 00:43:27,759 obviously I have to remember to delete 1233 00:43:25,920 --> 00:43:29,519 it before I upload the collab uh you 1234 00:43:27,760 --> 00:43:31,599 have to get your own key to make it all 1235 00:43:29,519 --> 00:43:34,639 work uh but the instructions are here. 1236 00:43:31,599 --> 00:43:36,480 So we're going to use GPT3.5 turbo to 1237 00:43:34,639 --> 00:43:38,639 demonstrate rag right so I give it the 1238 00:43:36,480 --> 00:43:40,480 name of the model and then open a also 1239 00:43:38,639 --> 00:43:43,679 has a whole bunch of different models 1240 00:43:40,480 --> 00:43:45,760 which can be used for u you can feed it 1241 00:43:43,679 --> 00:43:47,519 a sentence or a chunk of text it'll give 1242 00:43:45,760 --> 00:43:49,040 you a contextual embedding out it's like 1243 00:43:47,519 --> 00:43:50,800 a nice little API you don't have to use 1244 00:43:49,039 --> 00:43:53,119 your own bird and so on and so forth you 1245 00:43:50,800 --> 00:43:54,480 can just use the open AI embeddings 1246 00:43:53,119 --> 00:43:55,680 obviously you have to pay openai every 1247 00:43:54,480 --> 00:44:00,440 time you make a request but it's really 1248 00:43:55,679 --> 00:44:00,440 really cheap at this point u yepa 1249 00:44:01,119 --> 00:44:05,358 question but 1250 00:44:03,440 --> 00:44:07,119 by dealing with proprietary data because 1251 00:44:05,358 --> 00:44:09,598 a lot of companies are like we need to 1252 00:44:07,119 --> 00:44:11,920 invest in our own L&M because we don't 1253 00:44:09,599 --> 00:44:14,880 want our data to be going down in this 1254 00:44:11,920 --> 00:44:16,720 kind of it context how good is the the 1255 00:44:14,880 --> 00:44:17,280 cyber security or the compliance and 1256 00:44:16,719 --> 00:44:19,118 legal 1257 00:44:17,280 --> 00:44:21,119 >> I think each vendor has their own sort 1258 00:44:19,119 --> 00:44:22,559 of set of rules and contractual 1259 00:44:21,119 --> 00:44:23,519 commitments they're willing to sign up 1260 00:44:22,559 --> 00:44:25,199 for so you just 1261 00:44:23,519 --> 00:44:27,440 >> if you use the data here does this go 1262 00:44:25,199 --> 00:44:29,118 into the public domain or no 1263 00:44:27,440 --> 00:44:29,760 >> but the vendor gets to see it 1264 00:44:29,119 --> 00:44:31,760 >> okay 1265 00:44:29,760 --> 00:44:33,839 >> right meaning the vendor systems get to 1266 00:44:31,760 --> 00:44:36,160 see it, but do the vendors employees get 1267 00:44:33,838 --> 00:44:38,078 to see it if they need to? Unclear. 1268 00:44:36,159 --> 00:44:39,920 Those are all the like the legally sort 1269 00:44:38,079 --> 00:44:41,119 of nitty-gritty you have to worry about. 1270 00:44:39,920 --> 00:44:42,318 The other thing you can do is you can 1271 00:44:41,119 --> 00:44:44,318 actually just download an open source 1272 00:44:42,318 --> 00:44:46,000 LLM and do it all within your own 1273 00:44:44,318 --> 00:44:48,239 premises. 1274 00:44:46,000 --> 00:44:50,239 That's totally possible to do, right? In 1275 00:44:48,239 --> 00:44:51,598 fact, um I probably won't have time 1276 00:44:50,239 --> 00:44:52,959 today. I have a whole section on how do 1277 00:44:51,599 --> 00:44:55,680 you actually do a fine-tuning with an 1278 00:44:52,960 --> 00:44:58,400 open-source LLM, which I'll do a video, 1279 00:44:55,679 --> 00:45:01,118 right, if you don't have time. U okay. 1280 00:44:58,400 --> 00:45:02,720 So, so we and so this model this 1281 00:45:01,119 --> 00:45:03,920 embedding ADA 2 is the name of the 1282 00:45:02,719 --> 00:45:05,118 OpenAI model that actually gives you 1283 00:45:03,920 --> 00:45:07,680 contextual embedding. So, we're going to 1284 00:45:05,119 --> 00:45:10,160 use that. So, so first thing we want to 1285 00:45:07,679 --> 00:45:11,679 so the the use case here is that uh we 1286 00:45:10,159 --> 00:45:13,598 have taken a whole bunch we want to ask 1287 00:45:11,679 --> 00:45:15,598 the LLM we want to create a chatbot 1288 00:45:13,599 --> 00:45:18,240 which can answer questions about the 1289 00:45:15,599 --> 00:45:20,640 2022 Olympics like random questions you 1290 00:45:18,239 --> 00:45:24,318 might have about the Olympics. So, uh so 1291 00:45:20,639 --> 00:45:26,480 let's first ask it this question. Uh 1292 00:45:24,318 --> 00:45:29,838 we'll ask it about the 2020 summer 1293 00:45:26,480 --> 00:45:33,358 Olympics. Okay, that's the query and 1294 00:45:29,838 --> 00:45:35,039 then this is the the API um request we 1295 00:45:33,358 --> 00:45:36,400 have to make and you can read through 1296 00:45:35,039 --> 00:45:38,559 it. I have linked to the documentation 1297 00:45:36,400 --> 00:45:41,119 here as how it works and then it says 1298 00:45:38,559 --> 00:45:42,799 that uh Bosshim of Qatar and Tambberia 1299 00:45:41,119 --> 00:45:44,000 of Italy both won the gold and you can 1300 00:45:42,800 --> 00:45:46,480 actually fact check this is actually 1301 00:45:44,000 --> 00:45:48,000 accurate. It's correct. Uh so now let's 1302 00:45:46,480 --> 00:45:51,358 change the query and ask it about the 1303 00:45:48,000 --> 00:45:53,358 2022 Winter Olympics. Okay. And why 22 1304 00:45:51,358 --> 00:45:55,440 versus 20 will become clear in just a 1305 00:45:53,358 --> 00:45:57,598 moment. So, which athletes won the gold 1306 00:45:55,440 --> 00:46:00,480 in curling 1307 00:45:57,599 --> 00:46:02,640 in the 22 Olympics? And it says the gold 1308 00:46:00,480 --> 00:46:04,880 medal in curling was won by the Swedish 1309 00:46:02,639 --> 00:46:07,920 men's team and the South Korean women's 1310 00:46:04,880 --> 00:46:12,000 team. Okay, turns out if you fact check 1311 00:46:07,920 --> 00:46:13,920 this, it turns out, wait for it, Sweden 1312 00:46:12,000 --> 00:46:15,440 won the men's gold. Yes, South Korean 1313 00:46:13,920 --> 00:46:17,358 DIM participated, but Great Britain 1314 00:46:15,440 --> 00:46:19,599 actually won the women's gold. So, it 1315 00:46:17,358 --> 00:46:22,078 got it wrong. So, it sounds like GBD3.5 1316 00:46:19,599 --> 00:46:24,559 Turbo could use some help. And now one 1317 00:46:22,079 --> 00:46:27,119 of the things we can do is so the thing 1318 00:46:24,559 --> 00:46:29,440 is the reason why GPT3 3.1 turbo didn't 1319 00:46:27,119 --> 00:46:32,400 know about this is because its training 1320 00:46:29,440 --> 00:46:34,480 cutoff date was September 2021. 1321 00:46:32,400 --> 00:46:37,280 So as far as it's concerned the 22 1322 00:46:34,480 --> 00:46:39,519 Olympics haven't happened yet 1323 00:46:37,280 --> 00:46:42,560 it confidently gave you the wrong answer 1324 00:46:39,519 --> 00:46:43,920 as it is often prone to do. So and this 1325 00:46:42,559 --> 00:46:45,519 is by the way is called hallucination 1326 00:46:43,920 --> 00:46:50,159 where it gives you a very eloquent 1327 00:46:45,519 --> 00:46:53,119 confident wrong answer. And so um 1328 00:46:50,159 --> 00:46:54,480 or as some folks have said about um 1329 00:46:53,119 --> 00:46:56,559 another business school that should 1330 00:46:54,480 --> 00:46:59,519 remain nameless often in error but never 1331 00:46:56,559 --> 00:47:02,239 in doubt. So um 1332 00:46:59,519 --> 00:47:03,838 all right back to this uh so one simple 1333 00:47:02,239 --> 00:47:06,719 thing we can try right off the bat is to 1334 00:47:03,838 --> 00:47:08,239 tell 3 3.5 Turbo you can ask it to say I 1335 00:47:06,719 --> 00:47:10,559 don't know if it doesn't know rather 1336 00:47:08,239 --> 00:47:12,959 than just make stuff up right and how do 1337 00:47:10,559 --> 00:47:14,559 you do it? It's very simple. You say in 1338 00:47:12,960 --> 00:47:17,119 your prompt, answer the question as 1339 00:47:14,559 --> 00:47:18,799 truthfully as possible. And if you're 1340 00:47:17,119 --> 00:47:20,480 unsure of the answer, say, "Sorry, I 1341 00:47:18,800 --> 00:47:22,560 don't know." Okay, now here's the 1342 00:47:20,480 --> 00:47:25,519 question. Okay, this is a query. So, 1343 00:47:22,559 --> 00:47:29,279 let's run it through. 1344 00:47:25,519 --> 00:47:31,280 Sorry, I don't know. Not bad, huh? So, 1345 00:47:29,280 --> 00:47:32,720 so it worked. It's sort of trying to be 1346 00:47:31,280 --> 00:47:35,599 humble and honest and, you know, 1347 00:47:32,719 --> 00:47:37,759 self-aware and things like that. Um, 1348 00:47:35,599 --> 00:47:40,000 it's more like a a Sloan at this point. 1349 00:47:37,760 --> 00:47:41,040 All right. So now the reason I as I 1350 00:47:40,000 --> 00:47:42,159 mentioned earlier there's a you can 1351 00:47:41,039 --> 00:47:44,159 check the cutoff date and you can see 1352 00:47:42,159 --> 00:47:48,358 it's 2021 actually you know what let me 1353 00:47:44,159 --> 00:47:48,358 just uh open a new tab 1354 00:47:49,199 --> 00:47:53,118 so all these cut off dates are training 1355 00:47:50,800 --> 00:47:56,400 data right so 3.5 turbo this is what we 1356 00:47:53,119 --> 00:47:59,440 are using cutff date 2021 okay that's 1357 00:47:56,400 --> 00:48:01,280 why all right so now what we can do is 1358 00:47:59,440 --> 00:48:02,960 to to we can obviously provide relevant 1359 00:48:01,280 --> 00:48:04,880 data on the prompt itself sort of we can 1360 00:48:02,960 --> 00:48:06,318 leading up to rag here and by the way 1361 00:48:04,880 --> 00:48:07,680 the extra information we provide in the 1362 00:48:06,318 --> 00:48:08,960 prompt to help it answer a question is 1363 00:48:07,679 --> 00:48:10,799 called context, right? That's sort of 1364 00:48:08,960 --> 00:48:13,440 the lingo for it. So, we can do it, 1365 00:48:10,800 --> 00:48:15,200 we'll first do it manually. Um, so we 1366 00:48:13,440 --> 00:48:17,760 first we'll use the Wikipedia article 1367 00:48:15,199 --> 00:48:19,838 for 2022 Winter Olympics and we tell it 1368 00:48:17,760 --> 00:48:21,680 explicitly to make use of this context 1369 00:48:19,838 --> 00:48:23,920 because telling things explicitly always 1370 00:48:21,679 --> 00:48:25,679 seems to help. So, this is the thing we 1371 00:48:23,920 --> 00:48:28,318 cut and pasted here, right? Wikipedia 1372 00:48:25,679 --> 00:48:30,239 article on curling and it's like a 1373 00:48:28,318 --> 00:48:32,800 pretty long article. It's got all kinds 1374 00:48:30,239 --> 00:48:34,558 of stuff and it's not even all that like 1375 00:48:32,800 --> 00:48:38,240 cleanly formatted, right? It's kind of 1376 00:48:34,559 --> 00:48:39,359 it's very strange. Look at that. 1377 00:48:38,239 --> 00:48:41,759 So don't don't answer your question, 1378 00:48:39,358 --> 00:48:44,078 Spencer. It can be, you know, in pretty 1379 00:48:41,760 --> 00:48:46,480 bad shape. It still seems to work. Okay. 1380 00:48:44,079 --> 00:48:47,920 So now use below article on the Olympics 1381 00:48:46,480 --> 00:48:49,599 to answer the subsequent question. If 1382 00:48:47,920 --> 00:48:51,920 you don't know, say you don't know. 1383 00:48:49,599 --> 00:48:53,760 Okay. So that's what we have. That's the 1384 00:48:51,920 --> 00:48:55,599 query. And by the way, before I send it 1385 00:48:53,760 --> 00:48:56,720 into the LLM, this is the actual query 1386 00:48:55,599 --> 00:48:58,400 that's going to be sending. I'm printing 1387 00:48:56,719 --> 00:49:00,159 out the query. Look at how long the 1388 00:48:58,400 --> 00:49:02,240 query is. Use the article below. And 1389 00:49:00,159 --> 00:49:04,399 here is the article. B scroll, scroll, 1390 00:49:02,239 --> 00:49:05,759 scroll. There's a whole thing, right? 1391 00:49:04,400 --> 00:49:07,680 And it keeps on going on. And then 1392 00:49:05,760 --> 00:49:12,119 finally, I say which teams won the gold. 1393 00:49:07,679 --> 00:49:12,118 So, okay, so let's run it. 1394 00:49:12,318 --> 00:49:16,880 Okay, look at that. 1395 00:49:15,199 --> 00:49:19,679 Women's curling Great Britain. It got it 1396 00:49:16,880 --> 00:49:22,640 right. Pretty good, right? I mean, it 1397 00:49:19,679 --> 00:49:25,919 had to parse all that crap to get and 1398 00:49:22,639 --> 00:49:27,199 find the nuggets, right? So, nicely done 1399 00:49:25,920 --> 00:49:28,559 now. But maybe it wasn't super hard 1400 00:49:27,199 --> 00:49:30,799 because we literally gave it the answer. 1401 00:49:28,559 --> 00:49:32,720 So, let's make it a bit harder. So, I 1402 00:49:30,800 --> 00:49:34,240 noticed that this person, Oscar Ericson, 1403 00:49:32,719 --> 00:49:37,039 won two golds in the event, two medals 1404 00:49:34,239 --> 00:49:39,519 in the event. So let's ask if any 1405 00:49:37,039 --> 00:49:40,960 athlete won multiple medals. That 1406 00:49:39,519 --> 00:49:44,000 requires a little bit of abstraction, 1407 00:49:40,960 --> 00:49:46,400 right? So all right, same query. Did any 1408 00:49:44,000 --> 00:49:47,599 athlete win multiple medals in curling? 1409 00:49:46,400 --> 00:49:50,000 The question has changed. Everything 1410 00:49:47,599 --> 00:49:51,920 else hasn't changed. Hit it. Let's see 1411 00:49:50,000 --> 00:49:53,760 what happens. 1412 00:49:51,920 --> 00:49:56,400 Yes, Oscar Ericson won multiple medals 1413 00:49:53,760 --> 00:49:58,480 in curling. He won a gold in the men's 1414 00:49:56,400 --> 00:50:00,720 event and a bronze in the mix doubles. 1415 00:49:58,480 --> 00:50:02,880 It's pretty cool, right? Take that 1416 00:50:00,719 --> 00:50:04,239 Google. So 1417 00:50:02,880 --> 00:50:05,440 all right now we come to retrieval 1418 00:50:04,239 --> 00:50:06,719 augment generation where instead of 1419 00:50:05,440 --> 00:50:07,838 doing it manually obviously because it 1420 00:50:06,719 --> 00:50:09,919 doesn't scale we will do it 1421 00:50:07,838 --> 00:50:11,519 automatically and so the thing you have 1422 00:50:09,920 --> 00:50:12,800 to remember as I mentioned just a few 1423 00:50:11,519 --> 00:50:15,920 minutes ago is that there is a context 1424 00:50:12,800 --> 00:50:18,559 window for every LLM and for GPD 3.0 of 1425 00:50:15,920 --> 00:50:21,119 turbo the context window is 1 1600 300 1426 00:50:18,559 --> 00:50:24,400 sorry 16,385 tokens that is the length 1427 00:50:21,119 --> 00:50:26,720 of the input and the output right so we 1428 00:50:24,400 --> 00:50:29,280 can't exceed that uh by the way GPT4's 1429 00:50:26,719 --> 00:50:33,679 context window is I think up to 128,000 1430 00:50:29,280 --> 00:50:35,280 tokens and GPT sorry Google Gemini 1.5 1431 00:50:33,679 --> 00:50:38,399 pro they really need to work on their 1432 00:50:35,280 --> 00:50:40,960 names Google Gemini 1.5 pro the context 1433 00:50:38,400 --> 00:50:43,440 window is 1 million tokens 1434 00:50:40,960 --> 00:50:46,960 okay and in research they have tested 10 1435 00:50:43,440 --> 00:50:48,079 million tokens so Crazy times. All that 1436 00:50:46,960 --> 00:50:49,280 means is that you can upload entire 1437 00:50:48,079 --> 00:50:51,839 videos and ask it questions about the 1438 00:50:49,280 --> 00:50:53,359 video. So all right to come back to 1439 00:50:51,838 --> 00:50:55,440 this. So what we'll do is we'll only 1440 00:50:53,358 --> 00:50:57,920 grab the data from the Wikipedia 1441 00:50:55,440 --> 00:50:59,200 articles the all the articles about the 1442 00:50:57,920 --> 00:51:00,639 Olympics that are relevant to our 1443 00:50:59,199 --> 00:51:02,719 question by using pre-trained 1444 00:51:00,639 --> 00:51:04,318 embeddings. So again this is the thing 1445 00:51:02,719 --> 00:51:06,879 we talked about earlier, right? This is 1446 00:51:04,318 --> 00:51:08,159 the picture we saw in class. And the the 1447 00:51:06,880 --> 00:51:09,680 only thing I want to point out is that 1448 00:51:08,159 --> 00:51:11,759 if you have a particular embedding for a 1449 00:51:09,679 --> 00:51:13,199 question and a particular embedding for 1450 00:51:11,760 --> 00:51:15,440 a chunk of text that you have in your 1451 00:51:13,199 --> 00:51:17,679 database, you have to figure out how 1452 00:51:15,440 --> 00:51:21,280 similar how related they are. And for 1453 00:51:17,679 --> 00:51:24,799 that we can use what 1454 00:51:21,280 --> 00:51:27,119 dot product or something slightly uh 1455 00:51:24,800 --> 00:51:29,039 almost as dot product which is more 1456 00:51:27,119 --> 00:51:31,440 easier for us to work with the cosine 1457 00:51:29,039 --> 00:51:32,800 similarity. We have we have done cosine 1458 00:51:31,440 --> 00:51:34,240 similarity previously. I've explained it 1459 00:51:32,800 --> 00:51:35,519 in class. We're just going to use cosine 1460 00:51:34,239 --> 00:51:37,519 similarity. How similar are these 1461 00:51:35,519 --> 00:51:40,400 vectors? So that's what we're going to 1462 00:51:37,519 --> 00:51:42,559 do. Um all right. So the same picture as 1463 00:51:40,400 --> 00:51:43,920 we saw in class. So the first we what 1464 00:51:42,559 --> 00:51:45,839 we'll do is we need to break up the data 1465 00:51:43,920 --> 00:51:47,039 set into sections and then take each 1466 00:51:45,838 --> 00:51:49,199 section and then run it through the 1467 00:51:47,039 --> 00:51:50,558 embedding thing. But fortunately for us 1468 00:51:49,199 --> 00:51:52,399 uh I have code here which actually does 1469 00:51:50,559 --> 00:51:54,640 it for you manually. You can play around 1470 00:51:52,400 --> 00:51:56,639 with it later. But OpenAI has already 1471 00:51:54,639 --> 00:51:58,078 given us the chunked data set. So we 1472 00:51:56,639 --> 00:52:00,000 just use that because it's just easy for 1473 00:51:58,079 --> 00:52:01,519 us. And I downloaded already because it 1474 00:52:00,000 --> 00:52:02,800 took it takes five minutes to download. 1475 00:52:01,519 --> 00:52:04,719 I've downloaded this thing and I've 1476 00:52:02,800 --> 00:52:07,200 stuck it in a particular data frame 1477 00:52:04,719 --> 00:52:09,598 here. So let's print out five randomly 1478 00:52:07,199 --> 00:52:12,078 chosen chunks. Um so you can see here 1479 00:52:09,599 --> 00:52:14,559 right this is the first chunk somebody 1480 00:52:12,079 --> 00:52:17,119 else somebody else this just and look at 1481 00:52:14,559 --> 00:52:19,200 all this crazy stuff here right the 1482 00:52:17,119 --> 00:52:21,119 formatting is off but these are all you 1483 00:52:19,199 --> 00:52:22,480 know basically paragraphs and sections 1484 00:52:21,119 --> 00:52:24,559 just grabbed straight from Wikipedia 1485 00:52:22,480 --> 00:52:28,240 with no cleaning. 1486 00:52:24,559 --> 00:52:30,880 Okay, now we define a simple function to 1487 00:52:28,239 --> 00:52:33,279 basically send in any arbitrary piece of 1488 00:52:30,880 --> 00:52:35,200 text into the embedding model and get 1489 00:52:33,280 --> 00:52:36,800 the contextual embedding vector out, 1490 00:52:35,199 --> 00:52:39,118 right? And there is this little function 1491 00:52:36,800 --> 00:52:40,640 that does that. Okay, u we using an 1492 00:52:39,119 --> 00:52:42,400 embedding model. We send in a text, it 1493 00:52:40,639 --> 00:52:45,039 gives you something. So let's try it on 1494 00:52:42,400 --> 00:52:48,039 that is amazing. You should get a vector 1495 00:52:45,039 --> 00:52:48,039 back. 1496 00:52:51,280 --> 00:52:55,599 Oh, come on. Don't fail me now. 1497 00:52:56,000 --> 00:53:02,400 All right. How long is it? 1536. Um, so 1498 00:53:00,800 --> 00:53:04,240 how about I say hodle is incredible. 1499 00:53:02,400 --> 00:53:05,599 Like hodle is amazing. Hopefully the two 1500 00:53:04,239 --> 00:53:09,919 vectors would be kind of similar in 1501 00:53:05,599 --> 00:53:11,440 terms of cosine, right? So um and so to 1502 00:53:09,920 --> 00:53:13,680 calculate the cosine distance, I use 1503 00:53:11,440 --> 00:53:15,679 this particular function from sci. It 1504 00:53:13,679 --> 00:53:18,799 just calculates the cosine similarity 1505 00:53:15,679 --> 00:53:21,440 and I hit it. So 0.9934 1506 00:53:18,800 --> 00:53:23,280 maximum is one, right? So 0 934 means 1507 00:53:21,440 --> 00:53:24,720 that they're very very similar. which is 1508 00:53:23,280 --> 00:53:27,119 comforting because amazing and 1509 00:53:24,719 --> 00:53:29,598 incredible are obviously synonyms. U 1510 00:53:27,119 --> 00:53:32,000 okay so now given a data frame with a 1511 00:53:29,599 --> 00:53:33,119 column of text chunks in it we can use 1512 00:53:32,000 --> 00:53:34,800 this function on every one of these 1513 00:53:33,119 --> 00:53:36,160 things to calculate the embedding right 1514 00:53:34,800 --> 00:53:37,440 and you have a function here that 1515 00:53:36,159 --> 00:53:39,199 basically does it for you I'm not going 1516 00:53:37,440 --> 00:53:41,280 to run it uh because it takes a long 1517 00:53:39,199 --> 00:53:42,799 time so but you can run it later on uh 1518 00:53:41,280 --> 00:53:44,960 just be prepared go get a cup of coffee 1519 00:53:42,800 --> 00:53:47,119 and stuff while it does it uh but once 1520 00:53:44,960 --> 00:53:48,559 you but happily for us open has actually 1521 00:53:47,119 --> 00:53:50,160 already done this step for us so we 1522 00:53:48,559 --> 00:53:51,760 don't have to uh so it's already 1523 00:53:50,159 --> 00:53:53,920 available in this data frame so if you 1524 00:53:51,760 --> 00:53:56,079 actually Look at this. And you can see 1525 00:53:53,920 --> 00:53:58,000 here there is a text and then there is 1526 00:53:56,079 --> 00:54:00,079 an embedding that's right sitting right 1527 00:53:58,000 --> 00:54:02,880 there right next to it. Okay. And these 1528 00:54:00,079 --> 00:54:07,839 embeddings are whatever 15 how long is 1529 00:54:02,880 --> 00:54:12,280 it? 1536 long. 1536 long vectors. Okay. 1530 00:54:07,838 --> 00:54:12,279 Um All right. So that's what we have. 1531 00:54:14,079 --> 00:54:18,640 Okay. So now that we have this thing 1532 00:54:16,400 --> 00:54:20,240 whenever we get a question we calculate 1533 00:54:18,639 --> 00:54:22,400 the question's embedding and then 1534 00:54:20,239 --> 00:54:23,919 compare calculate its cosine similarity 1535 00:54:22,400 --> 00:54:26,800 with all the embedding sitting in this 1536 00:54:23,920 --> 00:54:28,079 data frame. Okay. So to do that we're 1537 00:54:26,800 --> 00:54:29,839 going to define a couple of helper 1538 00:54:28,079 --> 00:54:31,680 functions here. You can read through the 1539 00:54:29,838 --> 00:54:33,199 Python later to understand is this is 1540 00:54:31,679 --> 00:54:36,480 basic Python manipulations that are 1541 00:54:33,199 --> 00:54:38,799 going on. Um and so let's just test this 1542 00:54:36,480 --> 00:54:41,440 function. So basically we have a little 1543 00:54:38,800 --> 00:54:44,079 function called strings ranked by 1544 00:54:41,440 --> 00:54:46,400 relatedness where you give it any input 1545 00:54:44,079 --> 00:54:49,280 question or text and then it's going to 1546 00:54:46,400 --> 00:54:52,000 give you the top five most related 1547 00:54:49,280 --> 00:54:55,680 chunks of text that is had in its data 1548 00:54:52,000 --> 00:54:59,159 frame. Okay. So uh let me just run this 1549 00:54:55,679 --> 00:54:59,159 thing. Okay. 1550 00:55:00,000 --> 00:55:03,599 So curling the things it pulls back it 1551 00:55:02,079 --> 00:55:06,000 better involves curling and metals and 1552 00:55:03,599 --> 00:55:09,119 so on. So this one has a cosign 1553 00:55:06,000 --> 00:55:11,280 similarity of 888 curling at the 22 1554 00:55:09,119 --> 00:55:13,599 Olympics. That's good. Result summary. 1555 00:55:11,280 --> 00:55:14,960 Medal summary. Result summary. It's all 1556 00:55:13,599 --> 00:55:17,280 pretty good, right? Even the fifth one 1557 00:55:14,960 --> 00:55:18,720 has a cosign similarity of867, which is 1558 00:55:17,280 --> 00:55:20,800 pretty high. So it's doing the right 1559 00:55:18,719 --> 00:55:22,239 things. It's it's picked up curling gold 1560 00:55:20,800 --> 00:55:25,200 medal was input text. It's picked up the 1561 00:55:22,239 --> 00:55:28,078 right things from it. Um, now let's see 1562 00:55:25,199 --> 00:55:30,000 what we can do um 1563 00:55:28,079 --> 00:55:31,519 with the original question. So here is a 1564 00:55:30,000 --> 00:55:33,358 header I'm going to use in the prompt. 1565 00:55:31,519 --> 00:55:35,199 I'm going to say use the below articles 1566 00:55:33,358 --> 00:55:36,400 to answer the subsequent question. 1567 00:55:35,199 --> 00:55:37,439 Answer the questions as truthfully as 1568 00:55:36,400 --> 00:55:38,880 possible. And if you're unsure of the 1569 00:55:37,440 --> 00:55:41,519 answer, say sorry, I don't know. As 1570 00:55:38,880 --> 00:55:42,800 before. Okay, that's our prompt. Uh, and 1571 00:55:41,519 --> 00:55:44,960 now here's the thing. We don't want to 1572 00:55:42,800 --> 00:55:46,559 exceed the context window, right? So, we 1573 00:55:44,960 --> 00:55:48,240 want to need to count the tokens we're 1574 00:55:46,559 --> 00:55:49,440 sending in and the likely number of 1575 00:55:48,239 --> 00:55:51,439 tokens we're going to get back so that 1576 00:55:49,440 --> 00:55:53,679 we don't exceed the budget. So, we use 1577 00:55:51,440 --> 00:55:55,679 this package called tick token package 1578 00:55:53,679 --> 00:55:57,279 for this. Uh, and then it just, you 1579 00:55:55,679 --> 00:55:58,480 know, helps you count the tokens. And 1580 00:55:57,280 --> 00:56:00,079 you can read through this. It's just 1581 00:55:58,480 --> 00:56:03,199 again some basic Python for counting 1582 00:56:00,079 --> 00:56:05,519 tokens. And now what we do is um this 1583 00:56:03,199 --> 00:56:08,318 this where we actually comp assemble the 1584 00:56:05,519 --> 00:56:09,838 prompt. We start with the header right 1585 00:56:08,318 --> 00:56:12,719 we have the header which says you know 1586 00:56:09,838 --> 00:56:14,318 be truthful and all that. Then we say uh 1587 00:56:12,719 --> 00:56:16,719 here is a question that you need that 1588 00:56:14,318 --> 00:56:18,400 I'm going to ask you and then you go in 1589 00:56:16,719 --> 00:56:21,199 there and keep grabbing Wikipedia 1590 00:56:18,400 --> 00:56:23,680 articles till the number of tokens in 1591 00:56:21,199 --> 00:56:26,639 your prompt is is exceeding your token 1592 00:56:23,679 --> 00:56:27,838 budget and then you stop. Right? When 1593 00:56:26,639 --> 00:56:28,798 you're about to exceed the budget you 1594 00:56:27,838 --> 00:56:31,119 stop because you can't exceed the 1595 00:56:28,798 --> 00:56:34,239 budget. Um, and that's that's the whole 1596 00:56:31,119 --> 00:56:38,480 thing. So here, uh, all right, let's 1597 00:56:34,239 --> 00:56:40,159 just do tick token. Run this function. 1598 00:56:38,480 --> 00:56:42,960 Now, it turns out, as you saw, we can go 1599 00:56:40,159 --> 00:56:45,440 up to like 1600 something, uh, tokens in 1600 00:56:42,960 --> 00:56:48,400 the context window. I'm just using three 1601 00:56:45,440 --> 00:56:49,920 3,700 as my budget. Uh, partly because 1602 00:56:48,400 --> 00:56:52,160 just to show you how to use this thing. 1603 00:56:49,920 --> 00:56:54,880 Uh, and also because it's charging my 1604 00:56:52,159 --> 00:56:56,480 credit card for every token that I'm 1605 00:56:54,880 --> 00:56:59,280 using, right? So, I'm just being 1606 00:56:56,480 --> 00:57:01,280 careful. um it charges by the token. 1607 00:56:59,280 --> 00:57:03,519 It's a beautiful business model. Anyway, 1608 00:57:01,280 --> 00:57:05,040 so back here, so let's ask the question, 1609 00:57:03,519 --> 00:57:06,960 which athletes won the gold medal in 1610 00:57:05,039 --> 00:57:08,558 curling at the Olympics? Here is the 1611 00:57:06,960 --> 00:57:11,039 data frame that you should use. Here is 1612 00:57:08,559 --> 00:57:13,440 the GPD model and don't exceed 3,700 1613 00:57:11,039 --> 00:57:15,679 tokens. Okay, that's the the query or 1614 00:57:13,440 --> 00:57:17,280 the prompt. It's going to compose the 1615 00:57:15,679 --> 00:57:19,519 prompt now. And this is the whole 1616 00:57:17,280 --> 00:57:23,400 prompt. Okay. Uh let's just go to the 1617 00:57:19,519 --> 00:57:23,400 very top. It's really long. 1618 00:57:24,079 --> 00:57:27,440 Okay. So, all right. use the below 1619 00:57:25,920 --> 00:57:29,200 articles on the subsequent question as 1620 00:57:27,440 --> 00:57:31,920 possible and boom boom boom boom boom it 1621 00:57:29,199 --> 00:57:33,118 has all these things it's got a added a 1622 00:57:31,920 --> 00:57:35,920 whole bunch of paragraphs from the 1623 00:57:33,119 --> 00:57:37,358 Wikipedia pages okay and then it finally 1624 00:57:35,920 --> 00:57:39,599 ends with a question which athletes won 1625 00:57:37,358 --> 00:57:41,759 the gold okay all right now let's just 1626 00:57:39,599 --> 00:57:44,240 ask it the thing and this is just a 1627 00:57:41,760 --> 00:57:47,200 little function to to send stuff into 1628 00:57:44,239 --> 00:57:53,279 the API and now we are finally ready to 1629 00:57:47,199 --> 00:57:55,519 ask GPD the question fingers crossed 1630 00:57:53,280 --> 00:57:58,400 all right curling 1631 00:57:55,519 --> 00:58:01,199 Stefan can tell in the mixed doubles and 1632 00:57:58,400 --> 00:58:03,920 the team consisting of blah blah blah in 1633 00:58:01,199 --> 00:58:06,159 the the men's tournament and oh 1634 00:58:03,920 --> 00:58:08,880 interesting it has actually ignored the 1635 00:58:06,159 --> 00:58:12,798 Great Britain people completely I think 1636 00:58:08,880 --> 00:58:14,960 right uh last night it didn't welcome to 1637 00:58:12,798 --> 00:58:16,480 stoasticity 1638 00:58:14,960 --> 00:58:19,039 so you can try it when you try it might 1639 00:58:16,480 --> 00:58:21,119 actually give you the the thing um and 1640 00:58:19,039 --> 00:58:24,000 so let's ask it now a question about the 1641 00:58:21,119 --> 00:58:25,838 2016 winter Olympics uh which by the way 1642 00:58:24,000 --> 00:58:31,280 didn't happen there were no winter 1643 00:58:25,838 --> 00:58:34,798 Olympics in 2016. So if you ask it, 1644 00:58:31,280 --> 00:58:36,559 sorry I don't know. All right. Now let's 1645 00:58:34,798 --> 00:58:38,960 change the header so that we don't say 1646 00:58:36,559 --> 00:58:40,798 be truthful. So we will remove the need 1647 00:58:38,960 --> 00:58:43,679 for it to be truthful and see what 1648 00:58:40,798 --> 00:58:48,759 happens. 1649 00:58:43,679 --> 00:58:48,759 All right, which at least won the gold. 1650 00:58:50,960 --> 00:58:55,838 Oh, now it's telling you about the 2022 1651 00:58:53,199 --> 00:58:57,679 Olympics. So it answered an irrelevant 1652 00:58:55,838 --> 00:58:59,440 question accurately. 1653 00:58:57,679 --> 00:59:01,919 Okay, if you remove the need for it to 1654 00:58:59,440 --> 00:59:04,400 uh to be truthful. So the I guess the 1655 00:59:01,920 --> 00:59:07,280 moral of the story is that um first of 1656 00:59:04,400 --> 00:59:09,039 all you can use rack to grab stuff from 1657 00:59:07,280 --> 00:59:10,319 mass databases and it's very heavily 1658 00:59:09,039 --> 00:59:12,239 used in industry. Number one, number 1659 00:59:10,318 --> 00:59:13,838 two. Um you have to be careful about 1660 00:59:12,239 --> 00:59:16,719 these token budgets and so on and so 1661 00:59:13,838 --> 00:59:18,159 forth. Uh and small wording changes in 1662 00:59:16,719 --> 00:59:20,318 the prompt can actually dramatically 1663 00:59:18,159 --> 00:59:21,838 alter behavior which makes it very 1664 00:59:20,318 --> 00:59:25,279 difficult in enterprise settings to do 1665 00:59:21,838 --> 00:59:27,679 QA on this stuff. Okay. Uh so a lot of 1666 00:59:25,280 --> 00:59:29,200 care has to go into it. Uh you know and 1667 00:59:27,679 --> 00:59:30,960 you have seen examples of for example 1668 00:59:29,199 --> 00:59:32,639 Air Canada had a chatbot which actually 1669 00:59:30,960 --> 00:59:34,240 gave the wrong advice to a customer. The 1670 00:59:32,639 --> 00:59:35,679 customer sued Air Canada and then the 1671 00:59:34,239 --> 00:59:37,199 court ruled in favor of the the 1672 00:59:35,679 --> 00:59:39,118 passenger and then they pulled the 1673 00:59:37,199 --> 00:59:40,480 chatbot off the website. Right? So you 1674 00:59:39,119 --> 00:59:42,160 got to be very careful. I think without 1675 00:59:40,480 --> 00:59:43,519 a human in the loop checking these 1676 00:59:42,159 --> 00:59:45,199 answers it's kind of dangerous in my 1677 00:59:43,519 --> 00:59:47,440 opinion at this current state. Hopefully 1678 00:59:45,199 --> 00:59:48,960 it'll get better but you have to be 1679 00:59:47,440 --> 00:59:51,039 there's a lot of potential but you have 1680 00:59:48,960 --> 00:59:52,798 to be to be careful. All right. So this 1681 00:59:51,039 --> 00:59:54,719 is what we have. Um, and you can 1682 00:59:52,798 --> 00:59:57,039 actually take this thing here and use 1683 00:59:54,719 --> 00:59:58,719 it. Um, you can actually, you know, take 1684 00:59:57,039 --> 01:00:00,639 like a thousandpage PDF that you might 1685 00:59:58,719 --> 01:00:02,239 have or something and then chunk it and 1686 01:00:00,639 --> 01:00:03,358 use this approach. And I've done it for 1687 01:00:02,239 --> 01:00:04,639 a whole bunch of different things. It 1688 01:00:03,358 --> 01:00:05,920 actually works really well, right? Most 1689 01:00:04,639 --> 01:00:07,039 of the time it'll make errors here and 1690 01:00:05,920 --> 01:00:11,599 there. Most of the time it actually 1691 01:00:07,039 --> 01:00:14,318 works really well. Okay. So, um, yeah. 1692 01:00:11,599 --> 01:00:18,318 >> Sorry, just a question. when when like 1693 01:00:14,318 --> 01:00:20,159 GP4 now lets you you upload PDFs, is it 1694 01:00:18,318 --> 01:00:21,199 junkling that or is it actually 1695 01:00:20,159 --> 01:00:22,719 ingesting all the 1696 01:00:21,199 --> 01:00:25,759 >> No, when you upload something because 1697 01:00:22,719 --> 01:00:27,919 GPD4 Turbo has 128,000 tokens which 1698 01:00:25,760 --> 01:00:29,200 means it can accommodate a whole long b 1699 01:00:27,920 --> 01:00:31,200 of documents. So when you upload stuff 1700 01:00:29,199 --> 01:00:32,960 is not doing any chunking. The chunking 1701 01:00:31,199 --> 01:00:34,798 you're talking about you have to do. The 1702 01:00:32,960 --> 01:00:36,240 LLM doesn't even know you're doing it. 1703 01:00:34,798 --> 01:00:38,239 As far as the LLM is concerned, it's 1704 01:00:36,239 --> 01:00:39,519 only seeing the prompt it sees and the 1705 01:00:38,239 --> 01:00:40,639 prompt says, "Hey, here's a bunch of 1706 01:00:39,519 --> 01:00:41,759 information. Here's a question. Answer 1707 01:00:40,639 --> 01:00:44,159 it for me using this question. Be 1708 01:00:41,760 --> 01:00:46,799 truthful." That's it. 1709 01:00:44,159 --> 01:00:49,440 Now when you ask these things a question 1710 01:00:46,798 --> 01:00:51,759 um which is later than its training 1711 01:00:49,440 --> 01:00:53,920 data, you will actually see GP4 saying 1712 01:00:51,760 --> 01:00:55,760 doing a Bing search and things like 1713 01:00:53,920 --> 01:00:58,079 that. there. What's actually going on is 1714 01:00:55,760 --> 01:00:59,920 there's an there's a pre-processing step 1715 01:00:58,079 --> 01:01:01,760 and a program which is doing a Bing 1716 01:00:59,920 --> 01:01:04,159 search, gathering a bunch of Bing 1717 01:01:01,760 --> 01:01:06,799 results, taking the top few results, 1718 01:01:04,159 --> 01:01:08,960 chunking, embedding, packing into a 1719 01:01:06,798 --> 01:01:10,159 prompt, sending it into GB4, and you 1720 01:01:08,960 --> 01:01:11,358 don't know what's all this is going on 1721 01:01:10,159 --> 01:01:12,558 under the hood. But that's actually so 1722 01:01:11,358 --> 01:01:13,679 when it's actually thinking and saying 1723 01:01:12,559 --> 01:01:17,000 Bing search, this is what's going on 1724 01:01:13,679 --> 01:01:17,000 under the hood. 1725 01:01:19,199 --> 01:01:24,798 Was was there a question somewhere here? 1726 01:01:21,679 --> 01:01:26,558 No. Oh, sorry. Yeah. 1727 01:01:24,798 --> 01:01:29,280 I have a question about formatting. 1728 01:01:26,559 --> 01:01:31,519 Yeah. So, it seems to be able to 1729 01:01:29,280 --> 01:01:33,920 understand and ignore irrelevant 1730 01:01:31,519 --> 01:01:35,759 formatting even though there's 1731 01:01:33,920 --> 01:01:38,480 colloquial tables, not really defined 1732 01:01:35,760 --> 01:01:40,559 tables. And also when it outputs 1733 01:01:38,480 --> 01:01:44,000 formats, it's able to do it really 1734 01:01:40,559 --> 01:01:46,000 humanly. Is that something that's 1735 01:01:44,000 --> 01:01:47,199 figuring out through the neural network 1736 01:01:46,000 --> 01:01:49,280 or just something that's kind of being 1737 01:01:47,199 --> 01:01:49,919 programmed in the head or somewhere with 1738 01:01:49,280 --> 01:01:51,280 standard? 1739 01:01:49,920 --> 01:01:53,200 >> There is no explicit programming going 1740 01:01:51,280 --> 01:01:54,720 on. It's typically because a lot of the 1741 01:01:53,199 --> 01:01:56,078 question answer pairs that it was used 1742 01:01:54,719 --> 01:01:57,358 for supervised fine tetuning and 1743 01:01:56,079 --> 01:02:00,079 instruction t and reinforcement 1744 01:01:57,358 --> 01:02:02,400 learning, right? The better answers with 1745 01:02:00,079 --> 01:02:04,079 the same sort of badly formatted input, 1746 01:02:02,400 --> 01:02:06,079 the better answers are just rewarded are 1747 01:02:04,079 --> 01:02:08,240 ranked higher. That's what's going on. 1748 01:02:06,079 --> 01:02:10,318 But on a related note, what one thing 1749 01:02:08,239 --> 01:02:12,000 that's very useful is that uh you can 1750 01:02:10,318 --> 01:02:14,239 actually ask it to send you give you the 1751 01:02:12,000 --> 01:02:16,719 answer back using certain formats like 1752 01:02:14,239 --> 01:02:19,118 markdown and JSON and things like that. 1753 01:02:16,719 --> 01:02:21,039 And by forcing it to adhere to a certain 1754 01:02:19,119 --> 01:02:22,079 well- definfined formats, you actually 1755 01:02:21,039 --> 01:02:23,119 increase the chance of it actually 1756 01:02:22,079 --> 01:02:24,798 getting the right answer in the first 1757 01:02:23,119 --> 01:02:26,720 place. 1758 01:02:24,798 --> 01:02:28,719 Uh again, there's like a whole tangent 1759 01:02:26,719 --> 01:02:30,719 here we can go into, but those are some 1760 01:02:28,719 --> 01:02:33,039 of the things that uh are part of prompt 1761 01:02:30,719 --> 01:02:37,159 engineering. All right, so that's what 1762 01:02:33,039 --> 01:02:37,159 we have here. Back to the PowerPoint. 1763 01:02:40,639 --> 01:02:46,000 So that's retrieval augment generation 1764 01:02:42,559 --> 01:02:49,599 and we finally come to fine-tuning. So 1765 01:02:46,000 --> 01:02:51,760 fine-tuning is when up to this point all 1766 01:02:49,599 --> 01:02:54,240 the things we have seen don't alter the 1767 01:02:51,760 --> 01:02:55,599 internals of the LLM. You have not 1768 01:02:54,239 --> 01:02:56,798 messed around with the weights or change 1769 01:02:55,599 --> 01:03:00,000 number them at all. You're just using it 1770 01:02:56,798 --> 01:03:01,679 as a black box. Right? With fine-tuning 1771 01:03:00,000 --> 01:03:04,000 you actually will train it further 1772 01:03:01,679 --> 01:03:07,440 meaning the weights are going to change. 1773 01:03:04,000 --> 01:03:11,440 Okay. So now remember we take something 1774 01:03:07,440 --> 01:03:13,599 like a causal error like GPT right uh 1775 01:03:11,440 --> 01:03:15,280 and then and this I haven't fixed this 1776 01:03:13,599 --> 01:03:17,760 yet. this there is no rel here as I 1777 01:03:15,280 --> 01:03:19,280 mentioned earlier okay just remember 1778 01:03:17,760 --> 01:03:21,599 that 1779 01:03:19,280 --> 01:03:23,359 and then if you have domain specific 1780 01:03:21,599 --> 01:03:25,760 input output examples like input and 1781 01:03:23,358 --> 01:03:28,719 output you can just train it like this 1782 01:03:25,760 --> 01:03:31,280 okay input and then the shifted output 1783 01:03:28,719 --> 01:03:33,038 uh and that will update these weights 1784 01:03:31,280 --> 01:03:34,640 right all these weights so this is 1785 01:03:33,039 --> 01:03:37,200 basically fine- tuning exactly like we 1786 01:03:34,639 --> 01:03:39,598 saw with BERT and so on and and even 1787 01:03:37,199 --> 01:03:42,318 with restnet it's the same sort of thing 1788 01:03:39,599 --> 01:03:43,838 okay that is fine-tuning now before we 1789 01:03:42,318 --> 01:03:45,759 discuss the mechanics how to do I want 1790 01:03:43,838 --> 01:03:48,639 to look at a show you a quick example of 1791 01:03:45,760 --> 01:03:50,480 the usefulness of finetuning. So, so 1792 01:03:48,639 --> 01:03:53,199 imagine for a sec that we want to 1793 01:03:50,480 --> 01:03:55,358 generate u synthetic product reviews 1794 01:03:53,199 --> 01:03:57,439 from product descriptions. 1795 01:03:55,358 --> 01:03:59,838 So we are building some product which 1796 01:03:57,440 --> 01:04:01,760 can simulate customer behavior in 1797 01:03:59,838 --> 01:04:03,838 e-commerce and for that we need to be 1798 01:04:01,760 --> 01:04:05,760 able to generate the kinds of reviews 1799 01:04:03,838 --> 01:04:07,358 that customers might come up with right 1800 01:04:05,760 --> 01:04:09,200 and writing a lot of reviews is very 1801 01:04:07,358 --> 01:04:10,318 timeconuming. So what you but what you 1802 01:04:09,199 --> 01:04:12,639 can do is you can get a whole bunch of 1803 01:04:10,318 --> 01:04:14,719 product descriptions right from the 1804 01:04:12,639 --> 01:04:16,798 internet. So let's say you ask an LLM, 1805 01:04:14,719 --> 01:04:18,318 hey write a positive product review 1806 01:04:16,798 --> 01:04:19,759 using this information here, product 1807 01:04:18,318 --> 01:04:24,159 description here and it comes up with 1808 01:04:19,760 --> 01:04:26,319 this timeless, authentic, iconic, right? 1809 01:04:24,159 --> 01:04:28,639 Seriously, do product reviewers actually 1810 01:04:26,318 --> 01:04:31,199 write stuff like this? No. This looks 1811 01:04:28,639 --> 01:04:33,118 like marketing copy, right? This reads 1812 01:04:31,199 --> 01:04:34,318 like marketing copy because there's a 1813 01:04:33,119 --> 01:04:36,798 whole bunch of marketing copy on the 1814 01:04:34,318 --> 01:04:38,798 internet. So it's not good. It doesn't 1815 01:04:36,798 --> 01:04:41,440 feel like a review. It's not authentic, 1816 01:04:38,798 --> 01:04:44,318 right? Um, here's another example for 1817 01:04:41,440 --> 01:04:46,240 Urban Outfitters, and it says, uh, the 1818 01:04:44,318 --> 01:04:50,719 the boxy and cropped silhouette is 1819 01:04:46,239 --> 01:04:52,959 flattering on all body types. Come on. 1820 01:04:50,719 --> 01:04:55,519 Okay, so it's not going to work. So, 1821 01:04:52,960 --> 01:04:57,838 what we do is we fine-tune the LLM. We 1822 01:04:55,519 --> 01:05:00,159 can take an LLM and we can fine-tune it 1823 01:04:57,838 --> 01:05:02,719 with instruction, product description, 1824 01:05:00,159 --> 01:05:05,199 and product review examples. 1825 01:05:02,719 --> 01:05:06,959 Okay, that's what we can do. So for 1826 01:05:05,199 --> 01:05:11,719 instance we can take something like 1827 01:05:06,960 --> 01:05:11,720 this. Uh let me zoom into this thing. 1828 01:05:14,639 --> 01:05:19,118 So it says here write a positive review 1829 01:05:17,199 --> 01:05:20,318 for the following product and then you 1830 01:05:19,119 --> 01:05:22,000 can have the work. This is the 1831 01:05:20,318 --> 01:05:24,719 description is the input and the output 1832 01:05:22,000 --> 01:05:26,880 is the best car my husband's favorite. 1833 01:05:24,719 --> 01:05:28,558 They fit well. Right? They feel like 1834 01:05:26,880 --> 01:05:30,240 product reviews. So you just have to get 1835 01:05:28,559 --> 01:05:33,119 a few hundred of these product review 1836 01:05:30,239 --> 01:05:35,279 examples. Okay just a few hundred. Um 1837 01:05:33,119 --> 01:05:37,440 and you may not even need that much. And 1838 01:05:35,280 --> 01:05:40,960 once you do that, 1839 01:05:37,440 --> 01:05:42,318 once you do that, you basically do uh 1840 01:05:40,960 --> 01:05:45,280 used to fine-tuning like I showed 1841 01:05:42,318 --> 01:05:46,880 earlier, you know, in instruction, 1842 01:05:45,280 --> 01:05:48,319 input, output, and then you take that 1843 01:05:46,880 --> 01:05:50,318 output and shift it a bit and make it 1844 01:05:48,318 --> 01:05:51,599 the actual label, the actual output. 1845 01:05:50,318 --> 01:05:53,279 Fine tune, fine tune, fine tune, fine 1846 01:05:51,599 --> 01:05:55,119 tune a bunch of times, gradient descent, 1847 01:05:53,280 --> 01:05:58,160 weights gets updated. Now you have a new 1848 01:05:55,119 --> 01:06:00,318 LM, an updated LLM. And when you do that 1849 01:05:58,159 --> 01:06:02,558 now for the same things, here's what you 1850 01:06:00,318 --> 01:06:04,558 get. Write a review. These are the best 1851 01:06:02,559 --> 01:06:06,319 jeans I've ever owned. I am whatever 1852 01:06:04,559 --> 01:06:07,920 some details. I've been wearing them for 1853 01:06:06,318 --> 01:06:09,199 a few weeks. They still look brand new, 1854 01:06:07,920 --> 01:06:11,039 right? It looks much better. Doesn't 1855 01:06:09,199 --> 01:06:13,679 look like marketing. 1856 01:06:11,039 --> 01:06:15,119 This is completely fake. By the way, the 1857 01:06:13,679 --> 01:06:16,558 came up with it after the fine tuning. 1858 01:06:15,119 --> 01:06:18,640 And then we say, "Write a horrible 1859 01:06:16,559 --> 01:06:20,000 review because we want to be balanced. 1860 01:06:18,639 --> 01:06:22,078 These are the worst genes I've ever 1861 01:06:20,000 --> 01:06:23,519 worn. They're too tight here and there. 1862 01:06:22,079 --> 01:06:25,760 I'm going to return them and try a 30, 1863 01:06:23,519 --> 01:06:27,519 but I'm not optimistic. 1864 01:06:25,760 --> 01:06:29,119 I'm going to stick with Levis's." Few. 1865 01:06:27,519 --> 01:06:31,119 Okay. 1866 01:06:29,119 --> 01:06:33,280 So, that is So, these read like real 1867 01:06:31,119 --> 01:06:34,798 reviews. So just by taking a few hundred 1868 01:06:33,280 --> 01:06:36,400 examples and fine-tuning it, it 1869 01:06:34,798 --> 01:06:38,318 completely changes the the behavior that 1870 01:06:36,400 --> 01:06:40,400 you want for your particular use case. 1871 01:06:38,318 --> 01:06:43,038 That's the key thing. So for me, the 1872 01:06:40,400 --> 01:06:45,358 biggest sort of benefit here is that 1873 01:06:43,039 --> 01:06:47,680 while it took billions of sentences for 1874 01:06:45,358 --> 01:06:49,598 pre-training the original LLM and then 1875 01:06:47,679 --> 01:06:52,399 it took tens of thousands of examples to 1876 01:06:49,599 --> 01:06:55,119 do supervised finetuning and or HF and 1877 01:06:52,400 --> 01:06:56,960 so on and so forth, for you for it to 1878 01:06:55,119 --> 01:06:59,440 make it work for your narrow business 1879 01:06:56,960 --> 01:07:02,079 use case, you only had to spend a couple 1880 01:06:59,440 --> 01:07:04,240 hundred examples. That's it. It's 1881 01:07:02,079 --> 01:07:06,160 amazing. Imagine that if you had to, you 1882 01:07:04,239 --> 01:07:07,519 know, collect like 30,000 examples to 1883 01:07:06,159 --> 01:07:10,318 make it. Nobody's going to do these 1884 01:07:07,519 --> 01:07:12,639 things. It's too much work. But a couple 1885 01:07:10,318 --> 01:07:14,079 of hundred anybody can do. That's why 1886 01:07:12,639 --> 01:07:16,719 it's so powerful to finetune these 1887 01:07:14,079 --> 01:07:19,280 things. Yeah. 1888 01:07:16,719 --> 01:07:22,000 You talked about being able to um you 1889 01:07:19,280 --> 01:07:23,359 know, in industries where you you don't 1890 01:07:22,000 --> 01:07:26,000 want to put some of this stuff on the 1891 01:07:23,358 --> 01:07:28,000 internet, downloading uh the pre-train 1892 01:07:26,000 --> 01:07:30,400 model and being able to do this on your 1893 01:07:28,000 --> 01:07:32,079 own. would you still need talking about 1894 01:07:30,400 --> 01:07:35,200 computer power some of the computers we 1895 01:07:32,079 --> 01:07:37,359 have now GPUs I don't know how they are 1896 01:07:35,199 --> 01:07:39,279 um are you able to do some of these very 1897 01:07:37,358 --> 01:07:40,558 small use cases on those types of 1898 01:07:39,280 --> 01:07:42,559 devices 1899 01:07:40,559 --> 01:07:44,079 >> perfect question uh Ike I mean you're 1900 01:07:42,559 --> 01:07:46,640 going to get to that because the short 1901 01:07:44,079 --> 01:07:47,599 answer it's hard yeah just a few hundred 1902 01:07:46,639 --> 01:07:50,078 examples but actually trying to 1903 01:07:47,599 --> 01:07:52,000 fine-tune these big models on consumer 1904 01:07:50,079 --> 01:07:53,760 grade hardware is actually not easy so 1905 01:07:52,000 --> 01:07:56,239 you have to make certain tricks and 1906 01:07:53,760 --> 01:07:57,760 simplifications which is the next topic 1907 01:07:56,239 --> 01:08:00,239 uh yeah 1908 01:07:57,760 --> 01:08:02,480 >> is tuning always supervised like you 1909 01:08:00,239 --> 01:08:05,439 need those pairs or could you do it if 1910 01:08:02,480 --> 01:08:05,920 the company has like less structured 1911 01:08:05,440 --> 01:08:07,838 data? 1912 01:08:05,920 --> 01:08:09,599 >> No, you can. The thing is it depends on 1913 01:08:07,838 --> 01:08:11,679 whether you want to make it generally 1914 01:08:09,599 --> 01:08:13,519 smart about the company's sort of 1915 01:08:11,679 --> 01:08:14,639 business details in which case you can 1916 01:08:13,519 --> 01:08:16,319 just take a whole bunch of text and just 1917 01:08:14,639 --> 01:08:17,759 do an expert prediction on it. It's 1918 01:08:16,319 --> 01:08:19,279 going to get smarter about generally 1919 01:08:17,759 --> 01:08:20,719 things. But it doesn't mean it's going 1920 01:08:19,279 --> 01:08:23,279 to specifically follow your instructions 1921 01:08:20,719 --> 01:08:24,880 on your particular business problem. So 1922 01:08:23,279 --> 01:08:27,359 if you wanted to follow instructions, 1923 01:08:24,880 --> 01:08:29,759 you need supervision. 1924 01:08:27,359 --> 01:08:32,960 Okay. So all right these three are great 1925 01:08:29,759 --> 01:08:35,039 reviews. So for small LLMs like GPD2 1926 01:08:32,960 --> 01:08:36,399 fine-tuning isn't difficult to go to 1927 01:08:35,039 --> 01:08:38,640 your question. You can actually do this 1928 01:08:36,399 --> 01:08:40,000 with small models. So like for example 1929 01:08:38,640 --> 01:08:41,440 Google had this has released this thing 1930 01:08:40,000 --> 01:08:42,640 called Gemma which came out recently. 1931 01:08:41,439 --> 01:08:44,000 It's a small model like two billion 1932 01:08:42,640 --> 01:08:46,560 parameters or something if I remember 1933 01:08:44,000 --> 01:08:50,640 the smallest one and those things will 1934 01:08:46,560 --> 01:08:52,319 typically fit into uh thank you. Uh 1935 01:08:50,640 --> 01:08:54,000 those things will typically fit into 1936 01:08:52,319 --> 01:08:56,080 like one GPU and you can fine-tune it. 1937 01:08:54,000 --> 01:08:57,600 You still need GPUs just to be clear. uh 1938 01:08:56,079 --> 01:08:59,119 they will actually fit into one thing. 1939 01:08:57,600 --> 01:09:02,000 But if you want to use a larger model, 1940 01:08:59,119 --> 01:09:03,278 it won't fit. So to make this work, you 1941 01:09:02,000 --> 01:09:05,520 have to do other things and that's what 1942 01:09:03,279 --> 01:09:07,120 we're going to talk about now. So but 1943 01:09:05,520 --> 01:09:10,400 this there's a family of models called 1944 01:09:07,119 --> 01:09:12,960 Llama Llama 2. These are open source uh 1945 01:09:10,399 --> 01:09:14,879 LLMs and they are widely used for 1946 01:09:12,960 --> 01:09:16,158 fine-tuning, right? Because you can just 1947 01:09:14,880 --> 01:09:18,880 download the model and just do whatever 1948 01:09:16,158 --> 01:09:20,639 you want with it, right? It's open. uh I 1949 01:09:18,880 --> 01:09:22,079 mean it's not strictly open because 1950 01:09:20,640 --> 01:09:23,600 there are some you know footnote 1951 01:09:22,079 --> 01:09:26,238 considerations you got to worry about 1952 01:09:23,600 --> 01:09:29,120 but for most purposes it's open enough 1953 01:09:26,238 --> 01:09:30,959 uh in my opinion and so what we let's 1954 01:09:29,119 --> 01:09:32,640 see how hard it is to build the biggest 1955 01:09:30,960 --> 01:09:35,359 model in this family which is the llama 1956 01:09:32,640 --> 01:09:37,759 2 model with 70 billion parameters okay 1957 01:09:35,359 --> 01:09:40,719 70 billion parameters so first of all 1958 01:09:37,759 --> 01:09:42,399 the model is gigantic so 70 billion 1959 01:09:40,719 --> 01:09:44,798 parameters each parameter is let's say 1960 01:09:42,399 --> 01:09:48,000 we store it in two bytes per parameter 1961 01:09:44,798 --> 01:09:50,079 right u and then each of these parame 1962 01:09:48,000 --> 01:09:52,000 ameters actually we will need a 1963 01:09:50,079 --> 01:09:53,439 multiplier on each parameter to store 1964 01:09:52,000 --> 01:09:56,238 various details about how the 1965 01:09:53,439 --> 01:09:57,919 optimization is done okay we know we 1966 01:09:56,238 --> 01:09:59,678 won't get into the details here the the 1967 01:09:57,920 --> 01:10:02,640 one thing I do want to point out is that 1968 01:09:59,679 --> 01:10:06,239 um this 3 to four uh should really be 1 1969 01:10:02,640 --> 01:10:08,400 to six right u so I I had I didn't have 1970 01:10:06,238 --> 01:10:09,919 a chance to change it this morning but 1971 01:10:08,399 --> 01:10:12,559 but the point is that it's going to be a 1972 01:10:09,920 --> 01:10:14,239 huge model right so even with this 1973 01:10:12,560 --> 01:10:15,760 number it's going to be like 48 to 560 1974 01:10:14,238 --> 01:10:18,079 gigabytes just to hold the model in 1975 01:10:15,760 --> 01:10:21,280 memory and manipulate it and So if you 1976 01:10:18,079 --> 01:10:23,760 use a GPU like an A00 GPU or an H00 GPU 1977 01:10:21,279 --> 01:10:25,759 which are all Nvidia GPUs, 1978 01:10:23,760 --> 01:10:28,000 each of these things typically has 80 GB 1979 01:10:25,760 --> 01:10:30,719 of RAM memory. So we need between six 1980 01:10:28,000 --> 01:10:32,319 and seven to accommodate this thing. Six 1981 01:10:30,719 --> 01:10:34,079 to seven GPUs just to accommodate this 1982 01:10:32,319 --> 01:10:35,840 thing. So that's the first problem. The 1983 01:10:34,079 --> 01:10:37,760 model is big just to hold it and work 1984 01:10:35,840 --> 01:10:40,239 with it. You need lots of GPUs. The 1985 01:10:37,760 --> 01:10:43,360 second problem, Llama 2 was trained on 1986 01:10:40,238 --> 01:10:46,879 two trillion tokens of text. 1987 01:10:43,359 --> 01:10:49,439 Two trillion tokens of text. So these 1988 01:10:46,880 --> 01:10:51,760 GPUs can process about 400 tokens per 1989 01:10:49,439 --> 01:10:54,719 GPU per second. By process, I mean the 1990 01:10:51,760 --> 01:10:57,039 forward pass through the network. Okay? 1991 01:10:54,719 --> 01:10:58,079 And so if you actually use seven GPUs 1992 01:10:57,039 --> 01:11:01,279 with all this thing, it's going to take 1993 01:10:58,079 --> 01:11:03,439 you 8,000 days, right? Let's say we want 1994 01:11:01,279 --> 01:11:08,479 to do it in about a month, you need 24 1995 01:11:03,439 --> 01:11:10,799 20,000 248 GPUs at this cost of two $25 1996 01:11:08,479 --> 01:11:12,399 per GPU per hour. This will cost you 4 1997 01:11:10,800 --> 01:11:14,239 million. 1998 01:11:12,399 --> 01:11:15,359 Okay? And we'd expect the actual cost to 1999 01:11:14,238 --> 01:11:16,718 be a lot higher than this because it's 2000 01:11:15,359 --> 01:11:17,839 very optimistic. It assumes you just do 2001 01:11:16,719 --> 01:11:19,679 one pass through it, you're all done, 2002 01:11:17,840 --> 01:11:20,640 right? In in general, you'll you know 2003 01:11:19,679 --> 01:11:21,920 you'll make some mistakes. You have to 2004 01:11:20,640 --> 01:11:23,440 do it a bunch of times and so on and so 2005 01:11:21,920 --> 01:11:25,920 forth. So this is overly optimistic 2006 01:11:23,439 --> 01:11:27,439 estimate and that is 4 million. So you 2007 01:11:25,920 --> 01:11:29,679 need lots of GPUs and you need to spend 2008 01:11:27,439 --> 01:11:32,000 a lot of money for it. Now what can we 2009 01:11:29,679 --> 01:11:34,000 do with fewer resources? 2010 01:11:32,000 --> 01:11:35,760 First of all, you you need to reduce the 2011 01:11:34,000 --> 01:11:36,880 size of the data set. The second thing 2012 01:11:35,760 --> 01:11:38,960 is you want to reduce the memory 2013 01:11:36,880 --> 01:11:41,199 required. So we can ideally do it on 2014 01:11:38,960 --> 01:11:45,600 many fewer GPUs, hopefully even one GPU 2015 01:11:41,198 --> 01:11:47,119 literally on Collab. Okay. And so now we 2016 01:11:45,600 --> 01:11:49,360 have good news on the data front because 2017 01:11:47,119 --> 01:11:51,519 as I mentioned earlier, while it takes a 2018 01:11:49,359 --> 01:11:53,599 lot of data to build these models, to 2019 01:11:51,520 --> 01:11:55,440 fine-tune them for your specific data 2020 01:11:53,600 --> 01:11:57,520 for use case, you may just need a few 2021 01:11:55,439 --> 01:11:59,839 hundred examples. Okay, it's no problem 2022 01:11:57,520 --> 01:12:01,440 at all. So the data for fine-tuning is 2023 01:11:59,840 --> 01:12:02,800 actually not a problem. Only for 2024 01:12:01,439 --> 01:12:05,359 building it in the first place, it's a 2025 01:12:02,800 --> 01:12:07,360 problem. So in fact, there's this famous 2026 01:12:05,359 --> 01:12:11,119 alpaca fine tune data set. It is 50,000 2027 01:12:07,359 --> 01:12:13,039 instruction on pairs and so for that 2028 01:12:11,119 --> 01:12:14,559 way less than the two trillion tokens 2029 01:12:13,039 --> 01:12:17,920 and that can actually be done in about 2030 01:12:14,560 --> 01:12:19,520 20 hours. You can fine-tune a 50,000 2031 01:12:17,920 --> 01:12:21,760 example fine-tuning data set you can 2032 01:12:19,520 --> 01:12:23,280 fine tune with just 20 hours. Okay, 2033 01:12:21,760 --> 01:12:26,000 Tomaso, 2034 01:12:23,279 --> 01:12:28,800 >> could Microsoft's one bit model 2035 01:12:26,000 --> 01:12:30,640 drastically reduce the amount of comput? 2036 01:12:28,800 --> 01:12:32,719 Yeah, there's a whole bunch of 2037 01:12:30,640 --> 01:12:35,199 approximations and simplifications to 2038 01:12:32,719 --> 01:12:37,198 make all these things fit uh into 2039 01:12:35,198 --> 01:12:39,759 smaller GPUs and so on and so forth and 2040 01:12:37,198 --> 01:12:40,879 that's one of them. So, so the short 2041 01:12:39,760 --> 01:12:42,640 answer is yeah, there are many 2042 01:12:40,880 --> 01:12:44,000 possibilities uh and we have to very 2043 01:12:42,640 --> 01:12:45,760 carefully look at them because every one 2044 01:12:44,000 --> 01:12:47,359 of these simplifications you'll it'll 2045 01:12:45,760 --> 01:12:49,280 cost you something in terms of accuracy 2046 01:12:47,359 --> 01:12:50,639 and the ability of the model to do what 2047 01:12:49,279 --> 01:12:52,719 it needs to do. So there's always a 2048 01:12:50,640 --> 01:12:54,239 trade-off you have to worry about. So 2049 01:12:52,719 --> 01:12:55,119 that for hooks who are interested 2050 01:12:54,238 --> 01:12:57,839 there's this whole field called 2051 01:12:55,119 --> 01:12:59,439 quantization LLM quantization. Google it 2052 01:12:57,840 --> 01:13:02,719 and that gives you that's an entry point 2053 01:12:59,439 --> 01:13:04,079 into that whole area. Okay. So now how 2054 01:13:02,719 --> 01:13:06,158 do we reduce the memory required so that 2055 01:13:04,079 --> 01:13:08,800 we can process the data using fewer GPUs 2056 01:13:06,158 --> 01:13:10,079 ideally just one GPU on collab. So if 2057 01:13:08,800 --> 01:13:12,079 you look at what actually consumes 2058 01:13:10,079 --> 01:13:14,000 memory, you have all these model 2059 01:13:12,079 --> 01:13:16,158 parameters. Let's say you know 70 2060 01:13:14,000 --> 01:13:18,800 billion parameters times two bytes each 2061 01:13:16,158 --> 01:13:20,639 140 GB gradient computations is another 2062 01:13:18,800 --> 01:13:22,719 140 to hold the gradient and then the 2063 01:13:20,640 --> 01:13:24,400 optimizer state is 2x. And as I 2064 01:13:22,719 --> 01:13:27,520 mentioned earlier it could be between 2065 01:13:24,399 --> 01:13:28,799 you know 1 to 6x as opposed to 3 to 4x 2066 01:13:27,520 --> 01:13:30,880 but we'll just go with these numbers for 2067 01:13:28,800 --> 01:13:33,440 the moment. And so the total is 560 2068 01:13:30,880 --> 01:13:36,000 gigabytes right if you just naively want 2069 01:13:33,439 --> 01:13:38,639 to use it. So turns out you can't do 2070 01:13:36,000 --> 01:13:40,479 anything about that. it is just 4140 but 2071 01:13:38,640 --> 01:13:42,000 by using a trick called gradient 2072 01:13:40,479 --> 01:13:44,879 checkpointing this whole thing can 2073 01:13:42,000 --> 01:13:46,800 actually be squashed close to zero 2074 01:13:44,880 --> 01:13:48,239 basically you say hey I don't mind it 2075 01:13:46,800 --> 01:13:50,560 running longer but I don't want to use 2076 01:13:48,238 --> 01:13:52,079 as much memory and that trick is called 2077 01:13:50,560 --> 01:13:54,560 gradient checkpointing we won't go into 2078 01:13:52,079 --> 01:13:56,559 technical details that can go to zero 2079 01:13:54,560 --> 01:13:58,640 but then this thing here the optimizer 2080 01:13:56,560 --> 01:14:00,719 state turns out even this can be 2081 01:13:58,640 --> 01:14:02,800 squashed very close to zero and that's 2082 01:14:00,719 --> 01:14:06,319 actually was a breakthrough from you 2083 01:14:02,800 --> 01:14:07,600 know maybe a year ago and so to do do 2084 01:14:06,319 --> 01:14:09,439 that. What we're going to do is to say, 2085 01:14:07,600 --> 01:14:11,120 look, you know what? Uh there are a 2086 01:14:09,439 --> 01:14:13,599 whole bunch of weights here, but we're 2087 01:14:11,119 --> 01:14:15,599 only going to take take those matrices 2088 01:14:13,600 --> 01:14:17,199 inside each attention layer, and we're 2089 01:14:15,600 --> 01:14:19,840 going to only look at those matrices. 2090 01:14:17,198 --> 01:14:22,399 We're going to freeze everything else. 2091 01:14:19,840 --> 01:14:24,880 So, we're going to take only a small set 2092 01:14:22,399 --> 01:14:26,319 of parameters, unfreeze them, and update 2093 01:14:24,880 --> 01:14:27,760 them and see if it's any good, if it 2094 01:14:26,319 --> 01:14:29,519 actually gets the job done. Instead of 2095 01:14:27,760 --> 01:14:31,520 unfreezing everything and updating them, 2096 01:14:29,520 --> 01:14:33,840 right? And so if you look at the weight 2097 01:14:31,520 --> 01:14:36,719 matrix, let's say the key AK weight 2098 01:14:33,840 --> 01:14:38,960 matrix uh in llama 2, this is a 8,000 2099 01:14:36,719 --> 01:14:40,399 roughly 8,000 by 8,000 matrix, which 2100 01:14:38,960 --> 01:14:41,600 means that there are 64 million 2101 01:14:40,399 --> 01:14:45,839 parameters inside each of these 2102 01:14:41,600 --> 01:14:48,560 matrices. 64 million. Okay. So you can 2103 01:14:45,840 --> 01:14:50,719 if you imagine this matrix AK here and 2104 01:14:48,560 --> 01:14:52,480 let's say you thought experiment, you do 2105 01:14:50,719 --> 01:14:54,239 the finetuning and the numbers have 2106 01:14:52,479 --> 01:14:56,799 changed, right? as a result of 2107 01:14:54,238 --> 01:14:58,399 finetuning then you can imagine that the 2108 01:14:56,800 --> 01:15:01,600 resulting matrix is just the original 2109 01:14:58,399 --> 01:15:04,079 matrix you had plus just the changes 2110 01:15:01,600 --> 01:15:07,039 right the original plus the changes and 2111 01:15:04,079 --> 01:15:08,960 we call the changes delta a k and of 2112 01:15:07,039 --> 01:15:10,880 course in general this this change is 2113 01:15:08,960 --> 01:15:13,119 also going to be a 64 million matrix 2114 01:15:10,880 --> 01:15:15,760 right 8,000 by 8,000 so the question is 2115 01:15:13,119 --> 01:15:18,079 can we make this change matrix smaller 2116 01:15:15,760 --> 01:15:20,239 and to make it smaller it seems 2117 01:15:18,079 --> 01:15:22,319 reasonable because a fine tune will only 2118 01:15:20,238 --> 01:15:23,839 make small changes to just a few weights 2119 01:15:22,319 --> 01:15:25,198 it's not going to change 2120 01:15:23,840 --> 01:15:26,640 By definition, a couple hundred 2121 01:15:25,198 --> 01:15:27,678 examples, you do some finetuning, 2122 01:15:26,640 --> 01:15:29,920 hopefully a few weights are going to 2123 01:15:27,679 --> 01:15:32,239 change and maybe they won't change a 2124 01:15:29,920 --> 01:15:33,920 whole lot, right? So the the key insight 2125 01:15:32,238 --> 01:15:36,079 here is that maybe we can force this 2126 01:15:33,920 --> 01:15:38,640 change matrix to be kind of simple and 2127 01:15:36,079 --> 01:15:40,640 get the job done, right? And it turns 2128 01:15:38,640 --> 01:15:42,640 out you can. And what you do is you can 2129 01:15:40,640 --> 01:15:46,880 think of this matrix as really coming 2130 01:15:42,640 --> 01:15:48,480 from two thin skinny matrices which if 2131 01:15:46,880 --> 01:15:51,119 you multiply them gets you the original 2132 01:15:48,479 --> 01:15:52,559 matrix, right? And I'm not going to get 2133 01:15:51,119 --> 01:15:55,198 into the mathematical details here. This 2134 01:15:52,560 --> 01:15:57,280 is called a low rank approximation. Uh 2135 01:15:55,198 --> 01:16:00,238 but the point here is that you can take 2136 01:15:57,279 --> 01:16:01,599 two very small matrices and if you 2137 01:16:00,238 --> 01:16:02,639 multiply them the right way, you 2138 01:16:01,600 --> 01:16:04,400 actually can recover the original 2139 01:16:02,640 --> 01:16:06,800 matrix, right? You can approximate the 2140 01:16:04,399 --> 01:16:08,960 original matrix. And this matrix, as it 2141 01:16:06,800 --> 01:16:11,679 turns out, these two matrices are much 2142 01:16:08,960 --> 01:16:15,760 smaller because each one is just 8,000 * 2143 01:16:11,679 --> 01:16:19,359 2, 16,000, right? And so this thing has 2144 01:16:15,760 --> 01:16:23,360 just 16,192 parameters, which is 0.02% 2145 01:16:19,359 --> 01:16:23,359 of the original 64 million. 2146 01:16:23,439 --> 01:16:27,599 So this thing is called low rank 2147 01:16:25,039 --> 01:16:30,238 adaptation or LORA and it's incredibly 2148 01:16:27,600 --> 01:16:31,840 widely used in the industry. U and so 2149 01:16:30,238 --> 01:16:34,079 what we do is we freeze all the 2150 01:16:31,840 --> 01:16:36,079 parameters. We initialize all these mat 2151 01:16:34,079 --> 01:16:38,319 these change matrices to zero and then 2152 01:16:36,079 --> 01:16:40,960 we update just the those two skinny 2153 01:16:38,319 --> 01:16:43,759 matrices right here here we update only 2154 01:16:40,960 --> 01:16:45,198 those matrices using gradient descent. 2155 01:16:43,760 --> 01:16:47,119 And when you do that everything will fit 2156 01:16:45,198 --> 01:16:48,319 into memory. So which means that the 2157 01:16:47,119 --> 01:16:50,079 whole thing will fit in and you can just 2158 01:16:48,319 --> 01:16:52,158 use like two GPUs and get the job done. 2159 01:16:50,079 --> 01:16:55,039 And if you actually use llama's the 2160 01:16:52,158 --> 01:16:56,719 smaller models like 7 billion 13 billion 2161 01:16:55,039 --> 01:17:00,158 it can be fine-tuned comfortably on a 2162 01:16:56,719 --> 01:17:03,439 single GPU on a single collab GPU. So 2163 01:17:00,158 --> 01:17:05,759 all right uh 954 time does not permit so 2164 01:17:03,439 --> 01:17:07,519 I will uh so I have a collab on how to 2165 01:17:05,760 --> 01:17:09,600 do the finetuning uh using this 2166 01:17:07,520 --> 01:17:12,400 technique. I will do like a video walk 2167 01:17:09,600 --> 01:17:14,159 through um tomorrow or day after and I'm 2168 01:17:12,399 --> 01:17:16,158 done. Thanks folks. Have a good rest of 2169 01:17:14,158 --> 01:17:19,399 your week. [applause] 2170 01:17:16,158 --> 01:17:19,399 Thank you.