1 00:00:05,400 --> 00:00:06,919 Hi, everyone. 2 00:00:06,919 --> 00:00:11,439 Welcome to another lecture for CS230 Deep Learning. 3 00:00:11,439 --> 00:00:17,359 Today, we're going to talk about enhancing large language model 4 00:00:17,359 --> 00:00:19,079 applications. 5 00:00:19,079 --> 00:00:23,839 And I call this lecture Beyond LLM. 6 00:00:23,839 --> 00:00:26,800 It has a lot of newer content. 7 00:00:26,800 --> 00:00:31,280 And the idea behind this lecture is 8 00:00:31,280 --> 00:00:34,020 we started to learn about neurons, 9 00:00:34,020 --> 00:00:35,700 and then we learned about layers, 10 00:00:35,700 --> 00:00:38,320 and then we learned about deep neural networks, 11 00:00:38,320 --> 00:00:43,439 and then we learned a little bit about how to structure projects 12 00:00:43,439 --> 00:00:44,719 in C3. 13 00:00:44,719 --> 00:00:48,839 And now we're going one level beyond into, what would it 14 00:00:48,840 --> 00:00:54,640 look if you were building agentic AI systems at work, 15 00:00:54,640 --> 00:00:58,439 in a startup, in a company? 16 00:00:58,439 --> 00:01:02,769 And it's probably one of the more practical lectures. 17 00:01:02,770 --> 00:01:05,170 Again, the goal is not to build a product 18 00:01:05,170 --> 00:01:07,329 end to end in the next hour or so, 19 00:01:07,329 --> 00:01:09,929 but rather to tell you all the techniques 20 00:01:09,930 --> 00:01:15,170 that AI engineers have cracked, figured out, are exploring, 21 00:01:15,170 --> 00:01:18,450 so that after the class, you have the breadth of view 22 00:01:18,450 --> 00:01:20,350 of different prompting techniques, 23 00:01:20,349 --> 00:01:25,250 different agentic workflows, multi-agent systems, evals. 24 00:01:25,250 --> 00:01:26,870 And then when you want to dive deeper, 25 00:01:26,870 --> 00:01:29,810 you have the baggage to dive deeper and learn faster 26 00:01:29,810 --> 00:01:32,769 about it. 27 00:01:32,769 --> 00:01:36,049 Let's try to make it as interactive as possible, as 28 00:01:36,049 --> 00:01:37,689 usual. 29 00:01:37,689 --> 00:01:40,849 When we look at the agenda, the agenda 30 00:01:40,849 --> 00:01:45,049 is going to start with the core idea behind challenges 31 00:01:45,049 --> 00:01:48,469 and opportunities for augmenting LLMs. 32 00:01:48,469 --> 00:01:50,789 So we start from a base model. 33 00:01:50,790 --> 00:01:55,570 How do we maximize the performance of that base model? 34 00:01:55,569 --> 00:01:59,129 Then we'll dive deep into the first line of optimization, 35 00:01:59,129 --> 00:02:02,789 which is prompting methods, and we'll see a variety of them. 36 00:02:02,790 --> 00:02:04,530 Then we'll go slightly deeper. 37 00:02:04,530 --> 00:02:06,710 If we were to get our hands under the hood 38 00:02:06,709 --> 00:02:09,269 and do some fine tuning, what would it look like? 39 00:02:09,270 --> 00:02:12,650 I'm not a fan of fine tuning, and I talk a lot about that, 40 00:02:12,650 --> 00:02:16,870 but I'll explain why I try to avoid fine tuning as much as 41 00:02:16,870 --> 00:02:18,469 possible. 42 00:02:18,469 --> 00:02:22,930 And then we'll do a section 4 on Retrieval-Augmented Generation, 43 00:02:22,930 --> 00:02:26,530 or RAG, which you've probably heard of in the news. 44 00:02:26,530 --> 00:02:28,770 Maybe some of you have played with RAGs. 45 00:02:28,770 --> 00:02:31,670 We're going to unpack what a RAG is 46 00:02:31,669 --> 00:02:36,989 and how it works and then the different methods within RAGs. 47 00:02:36,990 --> 00:02:40,590 And then we'll talk about agentic AI workflows. 48 00:02:40,590 --> 00:02:42,469 I'll define it. 49 00:02:42,469 --> 00:02:45,629 Andrew Ng is one of the first ones 50 00:02:45,629 --> 00:02:49,569 to have called this trend agenetic AI workflows. 51 00:02:49,569 --> 00:02:51,709 And so we look at the definition that Andrew 52 00:02:51,710 --> 00:02:54,469 gives to agentic workflows, and then we'll 53 00:02:54,469 --> 00:02:56,479 start seeing examples. 54 00:02:56,479 --> 00:02:59,479 The section 6 is very practical. 55 00:02:59,479 --> 00:03:05,179 It's a case study where we will think about an agentic workflow, 56 00:03:05,180 --> 00:03:10,900 and I'll ask you to measure if the agent actually works, 57 00:03:10,900 --> 00:03:13,120 and we brainstorm how we can measure 58 00:03:13,120 --> 00:03:15,372 if an agentic workflow is working 59 00:03:15,372 --> 00:03:16,539 the way you want it to work. 60 00:03:16,539 --> 00:03:22,239 There's plenty of methods called evals that solve that problem. 61 00:03:22,240 --> 00:03:24,900 And then we'll look briefly at multi-agent workflow. 62 00:03:24,900 --> 00:03:27,960 And then we can have a open-ended discussion 63 00:03:27,960 --> 00:03:31,760 where I share some thoughts on what's next in AI. 64 00:03:31,759 --> 00:03:34,120 And I'm looking forward to hearing from you all, 65 00:03:34,120 --> 00:03:36,599 as well, on that one. 66 00:03:36,599 --> 00:03:42,060 So let's get started with the problem of augmenting LLMs. 67 00:03:42,060 --> 00:03:44,479 So open-ended question for you-- 68 00:03:44,479 --> 00:03:47,399 you are all familiar with pre-trained models 69 00:03:47,400 --> 00:03:52,080 like GPT 3.5 Turbo or GPT 4.0. 70 00:03:52,080 --> 00:03:56,260 What's the limitation of using just a base model? 71 00:03:56,259 --> 00:03:59,060 What are the typical issues that might 72 00:03:59,060 --> 00:04:03,469 arise as you're using a vanilla pre-trained model? 73 00:04:07,819 --> 00:04:08,400 Yes. 74 00:04:08,400 --> 00:04:10,460 It lacks some domain knowledge. 75 00:04:10,460 --> 00:04:11,840 Lacks some domain knowledge. 76 00:04:11,840 --> 00:04:13,432 You're perfectly right. 77 00:04:13,432 --> 00:04:16,139 We had a group of students a few years ago. 78 00:04:16,139 --> 00:04:22,099 It was not LLM related, but they were building an autonomous 79 00:04:22,100 --> 00:04:26,780 farming device or vehicle that had a camera underneath, taking 80 00:04:26,779 --> 00:04:30,619 pictures of crops to determine if the crop is 81 00:04:30,620 --> 00:04:32,980 sick or not, if it should be thrown away, 82 00:04:32,980 --> 00:04:35,980 if it should be used or not. 83 00:04:35,980 --> 00:04:40,900 And that data set is not a data set you find out there. 84 00:04:40,899 --> 00:04:44,579 And the base model or pre-trained computer vision 85 00:04:44,579 --> 00:04:47,539 model would lack that knowledge, of course. 86 00:04:47,540 --> 00:04:49,640 What else? 87 00:04:49,639 --> 00:04:50,139 Yes. 88 00:04:50,139 --> 00:04:57,110 [INAUDIBLE] pictures are very dark [INAUDIBLE] 89 00:04:57,110 --> 00:04:59,030 OK, maybe the-- you're saying-- 90 00:04:59,029 --> 00:05:02,111 so just to repeat for people online, 91 00:05:02,112 --> 00:05:04,070 you're saying the model might have been trained 92 00:05:04,069 --> 00:05:06,670 on high-quality data, but the data in the wild 93 00:05:06,670 --> 00:05:08,509 is actually not that high quality. 94 00:05:08,509 --> 00:05:11,149 And in fact, yes, the distribution of the real world 95 00:05:11,149 --> 00:05:16,169 might differ, as we've seen with GANs, from the training set, 96 00:05:16,170 --> 00:05:18,650 and that might create an issue with pre-trained models. 97 00:05:18,649 --> 00:05:20,909 Although pre-trained LLMs are getting better 98 00:05:20,910 --> 00:05:25,470 at handling all sorts of data inputs. 99 00:05:25,470 --> 00:05:26,894 Yes. 100 00:05:26,894 --> 00:05:28,310 Lacks current information. 101 00:05:28,310 --> 00:05:28,990 Lack what? 102 00:05:28,990 --> 00:05:30,110 Current information. 103 00:05:30,110 --> 00:05:32,550 Lacks current information. 104 00:05:32,550 --> 00:05:34,270 The LLM is not up to date. 105 00:05:34,269 --> 00:05:35,490 And in fact, you're right. 106 00:05:35,490 --> 00:05:38,150 Imagine you have to retrain from scratch your LLM 107 00:05:38,149 --> 00:05:39,989 every couple of months. 108 00:05:39,990 --> 00:05:42,790 One story that I found funny-- 109 00:05:42,790 --> 00:05:45,550 it's from probably three years ago or maybe more five years 110 00:05:45,550 --> 00:05:49,430 ago, where during his first presidency, 111 00:05:49,430 --> 00:05:53,990 President Trump one day tweeted, "Covfefe." 112 00:05:53,990 --> 00:05:56,170 You remember that tweet or no? 113 00:05:56,170 --> 00:05:57,310 Just "Covfefe." 114 00:05:57,310 --> 00:05:59,970 And it was probably a typo or it was in his pocket. 115 00:05:59,970 --> 00:06:00,890 I don't know. 116 00:06:00,889 --> 00:06:03,550 But that word did not exist. 117 00:06:03,550 --> 00:06:06,290 The LLMs, in fact, that Twitter was running at the time 118 00:06:06,290 --> 00:06:08,290 could not recognize that word. 119 00:06:08,290 --> 00:06:11,770 And so the recommender system sort of went wild, 120 00:06:11,769 --> 00:06:15,250 because suddenly everybody was making fun of that tweet using 121 00:06:15,250 --> 00:06:19,449 the word "Covfefe," and the LLM was so confused on, what does 122 00:06:19,449 --> 00:06:20,029 that mean? 123 00:06:20,029 --> 00:06:21,149 Where should we show it? 124 00:06:21,149 --> 00:06:22,509 To whom should we show it? 125 00:06:22,509 --> 00:06:25,149 And this is an example of a-- nowadays, 126 00:06:25,149 --> 00:06:28,849 especially on social media, there's so many new trends, 127 00:06:28,850 --> 00:06:33,050 and it's very hard to retrain an LLM to match the new trend 128 00:06:33,050 --> 00:06:34,710 and understand the new words out there. 129 00:06:34,709 --> 00:06:39,329 I mean, you oftentimes hear Gen Z words like "rizz" or "mid" 130 00:06:39,329 --> 00:06:40,349 or whatever. 131 00:06:40,350 --> 00:06:41,670 I don't know all of them. 132 00:06:41,670 --> 00:06:45,890 But you probably want to find a way that 133 00:06:45,889 --> 00:06:49,089 can allow the LLM to understand those trends without retraining 134 00:06:49,089 --> 00:06:51,500 the LLM from scratch. 135 00:06:51,500 --> 00:06:53,779 What else? 136 00:06:53,779 --> 00:06:56,182 It's trained to have a breadth of knowledge. 137 00:06:56,182 --> 00:06:58,099 And if you wanted to do something specialized, 138 00:06:58,100 --> 00:06:59,900 that might limit [INAUDIBLE]. 139 00:06:59,899 --> 00:07:02,560 Yeah, it might be trained on a breadth of knowledge, 140 00:07:02,560 --> 00:07:05,660 but it might fail or not perform adequately 141 00:07:05,660 --> 00:07:09,060 on a narrow task that is very well defined. 142 00:07:09,060 --> 00:07:11,980 Think about enterprise applications that-- 143 00:07:11,980 --> 00:07:13,400 yeah, enterprise application. 144 00:07:13,399 --> 00:07:17,579 You need high precision, high fidelity, low latency. 145 00:07:17,579 --> 00:07:20,359 And maybe the model is not great at that specific thing. 146 00:07:20,360 --> 00:07:22,480 It might do fine, but just not good enough. 147 00:07:22,480 --> 00:07:24,640 And you might want to augment it in a certain way. 148 00:07:24,639 --> 00:07:25,139 Yeah. 149 00:07:25,139 --> 00:07:29,699 Maybe it has [INAUDIBLE] so it makes the model 150 00:07:29,699 --> 00:07:32,045 a lot heavier, a lot slower. 151 00:07:32,045 --> 00:07:33,500 [INAUDIBLE] 152 00:07:33,500 --> 00:07:37,379 So maybe it has a lot of broad domain knowledge that might not 153 00:07:37,379 --> 00:07:39,240 be needed for your application. 154 00:07:39,240 --> 00:07:41,620 And so you're using a massive, heavy model 155 00:07:41,620 --> 00:07:44,519 when you actually are only using 2% of the model capability. 156 00:07:44,519 --> 00:07:45,599 You're perfectly right. 157 00:07:45,600 --> 00:07:46,808 You might not need all of it. 158 00:07:46,807 --> 00:07:51,279 So you might find ways to prune, quantize the model, modify it. 159 00:07:51,279 --> 00:07:53,059 All of these are good points. 160 00:07:53,060 --> 00:07:55,959 I'm going to add a few more, as well. 161 00:07:55,959 --> 00:07:58,799 LLMs are very difficult to control. 162 00:07:58,800 --> 00:08:00,819 Your last point is actually an example of that. 163 00:08:00,819 --> 00:08:03,339 You want to control the LLM to use a part of its knowledge, 164 00:08:03,339 --> 00:08:04,539 but it's not-- 165 00:08:04,540 --> 00:08:06,760 it's, in fact, getting confused. 166 00:08:06,759 --> 00:08:08,099 We've seen that in history. 167 00:08:08,100 --> 00:08:13,080 In 2016, Microsoft created a notorious Twitter 168 00:08:13,079 --> 00:08:18,039 bot that learned from users, and it quickly became a racist jerk. 169 00:08:18,040 --> 00:08:22,980 Microsoft ended up removing the bot 16 hours after launching it. 170 00:08:22,980 --> 00:08:25,720 The community was really fast at determining 171 00:08:25,720 --> 00:08:28,040 that this was a racist bot. 172 00:08:28,040 --> 00:08:31,480 And you can empathize with Microsoft in the sense 173 00:08:31,480 --> 00:08:34,038 that it is actually hard to control an LLM. 174 00:08:34,038 --> 00:08:37,720 They might have done a better job to qualify before launching, 175 00:08:37,720 --> 00:08:40,240 but it is really hard to control an LLM. 176 00:08:40,240 --> 00:08:42,639 Even more recently, this is a tweet 177 00:08:42,639 --> 00:08:46,929 from Sam Altman last November, where 178 00:08:46,929 --> 00:08:50,049 there was this debate between Elon Musk and Sam 179 00:08:50,049 --> 00:08:54,449 Altman on whose LLM is the left wing propaganda 180 00:08:54,450 --> 00:08:57,230 machine or the right wing propaganda machine, 181 00:08:57,230 --> 00:08:59,320 and they were hating on each other's LLMs. 182 00:08:59,320 --> 00:09:01,070 But that tells you, at the end of the day, 183 00:09:01,070 --> 00:09:05,610 that even those two teams, Grok and OpenAI, which are probably 184 00:09:05,610 --> 00:09:08,610 the best funded team with a lot of talent, 185 00:09:08,610 --> 00:09:11,509 are not doing a great job at controlling their LLMs. 186 00:09:14,169 --> 00:09:16,569 And from time to time, if you hang out on X, 187 00:09:16,570 --> 00:09:21,290 you might see screenshots of users interacting with LLMs 188 00:09:21,289 --> 00:09:24,649 and the LLM saying something really controversial 189 00:09:24,649 --> 00:09:31,289 or racist or something that would not be considered great 190 00:09:31,289 --> 00:09:33,429 by social standards, I guess. 191 00:09:33,429 --> 00:09:39,449 And that tells you that the model is really hard to control. 192 00:09:39,450 --> 00:09:41,610 The second aspect of it is something 193 00:09:41,610 --> 00:09:43,289 that you mentioned earlier. 194 00:09:43,289 --> 00:09:47,230 LLMs may underperform in your task, 195 00:09:47,230 --> 00:09:49,990 and that might include specific knowledge gaps, 196 00:09:49,990 --> 00:09:51,432 such as medical diagnosis. 197 00:09:51,432 --> 00:09:52,850 If you're doing medical diagnosis, 198 00:09:52,850 --> 00:09:55,430 you would rather have an LLM that is specialized for that 199 00:09:55,429 --> 00:09:57,989 and is great at it and, in fact, something 200 00:09:57,990 --> 00:10:00,409 that we haven't mentioned as a group, has sources. 201 00:10:00,409 --> 00:10:03,309 So the answer is sourced specifically. 202 00:10:03,309 --> 00:10:05,029 You have a hard time believing something 203 00:10:05,029 --> 00:10:08,069 unless you have the actual source of the research that 204 00:10:08,070 --> 00:10:10,270 backs it up. 205 00:10:10,269 --> 00:10:12,329 Inconsistencies in style and format-- 206 00:10:12,330 --> 00:10:17,430 so imagine you're building a legal AI agentic workflow. 207 00:10:17,429 --> 00:10:21,269 Legal has a very specific way to write and read, 208 00:10:21,269 --> 00:10:22,769 where every word counts. 209 00:10:22,769 --> 00:10:25,470 If you're negotiating a large contract, 210 00:10:25,470 --> 00:10:28,430 every word on that contract might mean something else 211 00:10:28,429 --> 00:10:29,929 when it comes to the court. 212 00:10:29,929 --> 00:10:31,789 And so it's very important that you use 213 00:10:31,789 --> 00:10:34,110 an LLM that is very good at it. 214 00:10:34,110 --> 00:10:35,590 The precision matters. 215 00:10:35,590 --> 00:10:38,090 And then task-specific understanding, 216 00:10:38,090 --> 00:10:40,629 such as doing a classification on a niche field, 217 00:10:40,629 --> 00:10:45,080 here I pulled an example where-- let's say a biotech product is 218 00:10:45,080 --> 00:10:48,759 trying to use an LLM to categorize 219 00:10:48,759 --> 00:10:54,052 user reviews into positive, neutral, or negative. 220 00:10:54,052 --> 00:10:56,799 Maybe for that company, something 221 00:10:56,799 --> 00:11:01,839 that would be considered a negative review typically 222 00:11:01,840 --> 00:11:04,160 is actually considered a neutral review 223 00:11:04,159 --> 00:11:06,439 because the NPS of that industry tends 224 00:11:06,440 --> 00:11:10,240 to be way lower than other industries, let's say. 225 00:11:10,240 --> 00:11:12,600 That's a task-specific understanding, 226 00:11:12,600 --> 00:11:14,440 and the LLM needs to be aligned to what 227 00:11:14,440 --> 00:11:17,960 the company believes is the categorization that it wants. 228 00:11:17,960 --> 00:11:21,560 We will see an example of how to solve that problem in a second. 229 00:11:21,559 --> 00:11:24,399 And then limited context handling-- 230 00:11:24,399 --> 00:11:28,720 a lot of AI applications, especially in the enterprise, 231 00:11:28,720 --> 00:11:33,192 have required data that has a lot of context. 232 00:11:33,192 --> 00:11:35,139 Just to give you a simple example, 233 00:11:35,139 --> 00:11:37,480 knowledge management is an important space 234 00:11:37,480 --> 00:11:40,759 that enterprises buy a lot of knowledge management tool. 235 00:11:40,759 --> 00:11:43,840 When you go on your drive and you have all your documents, 236 00:11:43,840 --> 00:11:47,040 ideally, you could have an LLM running on top of that drive. 237 00:11:47,039 --> 00:11:50,659 You can ask any question, and it will read immediately 238 00:11:50,659 --> 00:11:53,299 thousands of documents and answer, what was 239 00:11:53,299 --> 00:11:56,179 our Q4 performance in sales? 240 00:11:56,179 --> 00:11:58,299 It was x dollars. 241 00:11:58,299 --> 00:11:59,539 It finds it super quickly. 242 00:11:59,539 --> 00:12:04,039 In practice, because LLMs do not have a large enough context, 243 00:12:04,039 --> 00:12:07,860 you cannot use a standalone vanilla pre-trained LLM to solve 244 00:12:07,860 --> 00:12:08,639 that problem. 245 00:12:08,639 --> 00:12:11,580 You will have to augment it. 246 00:12:11,580 --> 00:12:13,460 Does that make sense? 247 00:12:13,460 --> 00:12:16,620 The other aspect around context windows is they are, in fact, 248 00:12:16,620 --> 00:12:17,240 limited. 249 00:12:17,240 --> 00:12:20,740 If you look at the context windows of the models 250 00:12:20,740 --> 00:12:25,419 from the last five years, even the best models 251 00:12:25,419 --> 00:12:30,459 today will range in context, window, or number of tokens 252 00:12:30,460 --> 00:12:35,220 it can take as input, somewhere in the hundreds of thousands 253 00:12:35,220 --> 00:12:36,680 of tokens max. 254 00:12:36,679 --> 00:12:40,989 Just to give you a sense, 200,000 tokens is roughly two 255 00:12:40,990 --> 00:12:42,669 books. 256 00:12:42,669 --> 00:12:45,110 So that's how much you can upload 257 00:12:45,110 --> 00:12:47,009 and it can read, pretty much. 258 00:12:47,009 --> 00:12:48,669 And you can imagine that when you're 259 00:12:48,669 --> 00:12:52,990 dealing with video understanding or heavier data 260 00:12:52,990 --> 00:12:56,710 files, that is, of course, an issue. 261 00:12:56,710 --> 00:12:58,009 So you might have to chunk it. 262 00:12:58,009 --> 00:12:59,169 You might have to embed it. 263 00:12:59,169 --> 00:13:00,669 You might have to find other ways 264 00:13:00,669 --> 00:13:03,519 to get the LLM to handle larger contexts. 265 00:13:06,509 --> 00:13:10,269 The attention mechanism is also powerful, but problematic, 266 00:13:10,269 --> 00:13:13,710 because it does not do a great job at attending 267 00:13:13,710 --> 00:13:16,310 in very large contexts. 268 00:13:16,309 --> 00:13:19,589 There is actually an interesting problem 269 00:13:19,590 --> 00:13:21,330 called needle in a haystack. 270 00:13:21,330 --> 00:13:23,430 It's an AI problem where-- 271 00:13:23,429 --> 00:13:25,469 or call it a benchmark-- 272 00:13:25,470 --> 00:13:30,910 where, in order to test if your LLM is good at putting attention 273 00:13:30,909 --> 00:13:35,589 on a very specific fact within a large corpus, 274 00:13:35,590 --> 00:13:38,649 researchers might randomly insert 275 00:13:38,649 --> 00:13:44,009 in about one sentence that outlines 276 00:13:44,009 --> 00:13:47,450 a certain fact, such as Arun and Max 277 00:13:47,450 --> 00:13:48,970 are having coffee at Blue Bottle, 278 00:13:48,970 --> 00:13:51,149 in the middle of the Bible, let's say, 279 00:13:51,149 --> 00:13:54,049 or some very long text. 280 00:13:54,049 --> 00:14:01,269 And then you ask the LLM, what were Arun and Max having 281 00:14:01,269 --> 00:14:02,769 at Blue Bottle? 282 00:14:02,769 --> 00:14:04,794 And you see if it remembers that it was coffee. 283 00:14:04,794 --> 00:14:07,169 It's actually a complex problem, not because the question 284 00:14:07,169 --> 00:14:09,250 is complex, but because you're asking the model 285 00:14:09,250 --> 00:14:12,370 to find a fact within a very large corpus, 286 00:14:12,370 --> 00:14:16,060 and that's complicated. 287 00:14:16,059 --> 00:14:19,969 So, again, this is a limiting factor for LLMs. 288 00:14:19,970 --> 00:14:21,870 We'll talk about RAG in a second. 289 00:14:21,870 --> 00:14:22,970 But I want to preview-- 290 00:14:22,970 --> 00:14:26,490 there is debates around whether RAG 291 00:14:26,490 --> 00:14:29,990 is the right long-term approach for AI systems. 292 00:14:29,990 --> 00:14:34,470 So as a high-level idea, a RAG is a mechanism, if you will, 293 00:14:34,470 --> 00:14:39,340 that embeds documents that an LLM can retrieve and then 294 00:14:39,340 --> 00:14:44,540 add as context to its initial prompt and answer a question. 295 00:14:44,539 --> 00:14:45,679 It has lots of application. 296 00:14:45,679 --> 00:14:47,137 Knowledge management is an example. 297 00:14:47,138 --> 00:14:49,160 So imagine you have your drive again. 298 00:14:49,159 --> 00:14:53,620 But every document is compressed in representation, 299 00:14:53,620 --> 00:14:55,820 and the LLM has access to that lower 300 00:14:55,820 --> 00:14:59,020 dimensional representation. 301 00:14:59,019 --> 00:15:03,500 The debates that this tweet from [INAUDIBLE] outlines 302 00:15:03,500 --> 00:15:08,259 is, in theory, if we have infinite compute, 303 00:15:08,259 --> 00:15:09,960 then RAG is useless. 304 00:15:09,960 --> 00:15:13,580 Because you can just read a massive corpus immediately 305 00:15:13,580 --> 00:15:15,180 and answer your question. 306 00:15:15,179 --> 00:15:19,039 But even in that case, latency might be an issue. 307 00:15:19,039 --> 00:15:20,659 Imagine the time it takes for an AI 308 00:15:20,659 --> 00:15:24,279 to read all your drive every single time you ask a question. 309 00:15:24,279 --> 00:15:25,579 It doesn't make sense. 310 00:15:25,580 --> 00:15:30,940 So RAG has other advantages beyond even the accuracy. 311 00:15:30,940 --> 00:15:33,680 On top of that, the sourcing matters, as well. 312 00:15:33,679 --> 00:15:35,819 So it might-- RAG allows you to source. 313 00:15:35,820 --> 00:15:38,460 We'll talk about all that later. 314 00:15:38,460 --> 00:15:42,639 But there's always this debate in the community 315 00:15:42,639 --> 00:15:46,100 whether a certain method is actually future proof. 316 00:15:46,100 --> 00:15:49,740 Because in practice, as compute power doubles every year, 317 00:15:49,740 --> 00:15:52,279 let's say, some of the methods we're learning right now 318 00:15:52,279 --> 00:15:54,759 might not be relevant three years from now. 319 00:15:54,759 --> 00:15:56,740 We don't know, essentially. 320 00:15:59,960 --> 00:16:04,120 And the analogy that he makes on context windows 321 00:16:04,120 --> 00:16:07,440 and why RAG approaches might be relevant even a long time 322 00:16:07,440 --> 00:16:09,960 from now is search. 323 00:16:09,960 --> 00:16:12,139 When you search on a search engine, 324 00:16:12,139 --> 00:16:14,977 you still find sources of information. 325 00:16:14,977 --> 00:16:16,519 And in fact, in the background, there 326 00:16:16,519 --> 00:16:20,639 is very detailed traversal algorithms 327 00:16:20,639 --> 00:16:25,199 that rank and find the specific links that might be the best 328 00:16:25,200 --> 00:16:29,440 to present you versus if you had to read-- imagine you had 329 00:16:29,440 --> 00:16:31,800 to read the entire web every single time you're doing 330 00:16:31,799 --> 00:16:34,809 a search query, without being able to narrow 331 00:16:34,809 --> 00:16:36,969 to a certain portion of the space. 332 00:16:36,970 --> 00:16:41,889 That might, again, not be reasonable. 333 00:16:41,889 --> 00:16:46,210 OK, when we're thinking of improving LLMs, 334 00:16:46,210 --> 00:16:50,110 the easiest way we think of it is two dimensions. 335 00:16:50,110 --> 00:16:53,210 One dimension is we are going to improve the foundation 336 00:16:53,210 --> 00:16:54,230 model itself. 337 00:16:54,230 --> 00:17:01,250 So, for example, we move from GPT 3.5 Turbo, to GPT 4, 338 00:17:01,250 --> 00:17:04,250 to GPT 4.0, to GPT 5. 339 00:17:04,250 --> 00:17:07,328 Each of that is supposed to improve the base model. 340 00:17:07,328 --> 00:17:11,730 GPT 5 is another debate because it's packaging other models 341 00:17:11,730 --> 00:17:12,588 within itself. 342 00:17:12,588 --> 00:17:15,947 But if you're thinking about 3.5, 4, and 4.0, 343 00:17:15,948 --> 00:17:16,990 that's really what it is. 344 00:17:16,990 --> 00:17:18,670 The pre-trained model improves. 345 00:17:18,670 --> 00:17:20,810 And so you should see your performance 346 00:17:20,809 --> 00:17:22,809 improve on your tasks. 347 00:17:22,809 --> 00:17:27,129 But the other dimension is we can actually engineer-- 348 00:17:27,130 --> 00:17:30,390 leverage the LLM in a way that makes it better. 349 00:17:30,390 --> 00:17:34,070 So you can prompt simply GPT 4.0. 350 00:17:34,069 --> 00:17:38,409 You can change some prompts and improve the prompt, 351 00:17:38,410 --> 00:17:40,070 and it will improve the performance. 352 00:17:40,069 --> 00:17:41,189 It's shown. 353 00:17:41,190 --> 00:17:42,930 You can even put a RAG around it. 354 00:17:42,930 --> 00:17:45,610 You can put an agentic workflow around it. 355 00:17:45,609 --> 00:17:49,250 You can even put a multi-agent system around it. 356 00:17:49,250 --> 00:17:52,630 And that is another dimension for you to improve performance. 357 00:17:52,630 --> 00:17:54,870 So that's how I want you to think about it-- which 358 00:17:54,869 --> 00:17:56,750 LLM I'm using, and then how can I maximize 359 00:17:56,750 --> 00:17:59,255 the performance of that LLM? 360 00:17:59,255 --> 00:18:02,690 This lecture is about the vertical axis. 361 00:18:02,690 --> 00:18:04,940 Those are the methods that we will see together. 362 00:18:08,829 --> 00:18:11,470 Sounds good for the introduction. 363 00:18:11,470 --> 00:18:14,549 So let's move to prompt engineering. 364 00:18:14,549 --> 00:18:17,230 I'm going to start with an interesting study just 365 00:18:17,230 --> 00:18:20,870 to motivate why prompt engineering matters. 366 00:18:20,869 --> 00:18:26,469 There is a study from HBS, UPenn, 367 00:18:26,470 --> 00:18:29,559 as well as Harvard Business School, and-- 368 00:18:29,559 --> 00:18:31,399 well, there is also involved Wharton-- 369 00:18:31,400 --> 00:18:34,360 that took a subset of BCG consultants, 370 00:18:34,359 --> 00:18:37,679 individual contributors, split them into three groups. 371 00:18:37,680 --> 00:18:39,660 One group had no access to AI. 372 00:18:39,660 --> 00:18:41,640 One group had access to-- 373 00:18:41,640 --> 00:18:44,720 I think it was GPT 4. 374 00:18:44,720 --> 00:18:46,900 And then one group had access to the LLM, 375 00:18:46,900 --> 00:18:50,759 but also a training on how to prompt better. 376 00:18:50,759 --> 00:18:53,640 And then they observed the performance of these consultants 377 00:18:53,640 --> 00:18:56,120 across a wide variety of tasks. 378 00:18:56,119 --> 00:18:57,799 There's a few things that they noticed 379 00:18:57,799 --> 00:18:59,399 that I thought was interesting. 380 00:18:59,400 --> 00:19:02,920 One is something they called the jagged frontier, 381 00:19:02,920 --> 00:19:07,880 meaning that certain tasks that consultants are doing fall 382 00:19:07,880 --> 00:19:14,700 beyond the jagged frontier, meaning AI is not good enough. 383 00:19:14,700 --> 00:19:18,140 It's not improving human performance. 384 00:19:18,140 --> 00:19:20,840 In fact, it's actually making it worse. 385 00:19:20,839 --> 00:19:23,439 And some tasks are within the frontier, 386 00:19:23,440 --> 00:19:27,360 meaning that AI is actually significantly improving 387 00:19:27,359 --> 00:19:32,059 the performance, the speed, the quality of the consultant. 388 00:19:32,059 --> 00:19:35,220 Many tasks fell within and many tasks fell without, 389 00:19:35,220 --> 00:19:37,640 and they shared their insights. 390 00:19:37,640 --> 00:19:39,180 But the TLDR is-- 391 00:19:39,180 --> 00:19:42,880 there is a frontier within which AI is absolutely helping 392 00:19:42,880 --> 00:19:47,500 and one where they call out this behavior, or falling asleep 393 00:19:47,500 --> 00:19:51,339 at the wheel, where people relied on AI on a task that 394 00:19:51,339 --> 00:19:52,899 was beyond the frontier. 395 00:19:52,900 --> 00:19:55,860 And in fact, it ended up going worse 396 00:19:55,859 --> 00:19:58,459 because the human was not reviewing the outputs carefully 397 00:19:58,460 --> 00:19:58,960 enough. 398 00:20:01,740 --> 00:20:04,539 They did note that the group that was trained 399 00:20:04,539 --> 00:20:08,139 was the best, better than the group that was not trained 400 00:20:08,140 --> 00:20:10,740 on prompt engineering, which also motivates why 401 00:20:10,740 --> 00:20:14,700 this lecture matters, so that you're within that group 402 00:20:14,700 --> 00:20:15,940 afterwards. 403 00:20:15,940 --> 00:20:20,340 One other insights were the centaurs and the cyborgs. 404 00:20:20,339 --> 00:20:22,539 They noticed that consultants had the tendency 405 00:20:22,539 --> 00:20:24,899 to work with AI in one of two ways, 406 00:20:24,900 --> 00:20:29,269 and you might, yourself, be part of one of these groups. 407 00:20:29,269 --> 00:20:31,750 The centaurs are mythical creatures 408 00:20:31,750 --> 00:20:35,190 that are half human, half-- 409 00:20:35,190 --> 00:20:38,529 I think, half, what, horses? 410 00:20:38,529 --> 00:20:39,029 Yeah? 411 00:20:39,029 --> 00:20:39,750 Horses? 412 00:20:39,750 --> 00:20:42,190 Half horses, half something. 413 00:20:42,190 --> 00:20:45,850 And those were individuals that would divide and delegate. 414 00:20:45,849 --> 00:20:48,369 They might give a pretty big task to the AI. 415 00:20:48,369 --> 00:20:51,229 So imagine you're working on a PowerPoint, which consultants 416 00:20:51,230 --> 00:20:52,870 are known to do. 417 00:20:52,869 --> 00:20:55,467 You might actually write a very long prompt on how 418 00:20:55,468 --> 00:20:57,509 you want it to do your PowerPoint and then let it 419 00:20:57,509 --> 00:20:59,069 work for some time and then come back 420 00:20:59,069 --> 00:21:02,129 and it's done, when others would act as cyborgs. 421 00:21:02,130 --> 00:21:06,390 Cyborgs are fully blended, bionic human robots, 422 00:21:06,390 --> 00:21:10,630 human and robot, augmented with robotic parts. 423 00:21:10,630 --> 00:21:13,490 And those individuals will not delegate fully a task. 424 00:21:13,490 --> 00:21:16,230 They would actually work super quickly with the model 425 00:21:16,230 --> 00:21:17,370 back and forth. 426 00:21:17,369 --> 00:21:20,149 I find that a lot of students are actually more working 427 00:21:20,150 --> 00:21:24,277 like cyborgs than centaurs, but while maybe in the enterprise, 428 00:21:24,277 --> 00:21:26,110 when you're trying to automate the workflow, 429 00:21:26,109 --> 00:21:29,477 you're thinking more like a centaur. 430 00:21:29,478 --> 00:21:31,269 That's just something good to keep in mind. 431 00:21:31,269 --> 00:21:33,311 Also, a lot of companies will tell you, oh, we're 432 00:21:33,311 --> 00:21:34,849 hiring prompt engineers, et cetera. 433 00:21:34,849 --> 00:21:36,949 It's [? a cure. ?] I don't buy that. 434 00:21:36,950 --> 00:21:39,158 I think it's just a skill that everybody should have. 435 00:21:39,157 --> 00:21:40,866 You're not going to make a [? cure ?] out 436 00:21:40,866 --> 00:21:42,690 of prompt engineering, but you're probably 437 00:21:42,690 --> 00:21:46,500 going to use it as a very powerful skill in your career. 438 00:21:49,809 --> 00:21:52,889 So let's talk about basic prompt design principles. 439 00:21:52,890 --> 00:21:56,009 I'm giving you a very simple prompt here. 440 00:21:56,009 --> 00:21:58,210 Summarize this document, and then the document 441 00:21:58,210 --> 00:22:00,250 is uploaded alongside it. 442 00:22:00,250 --> 00:22:04,690 And the model has not much context around 443 00:22:04,690 --> 00:22:06,130 what should be the summary? 444 00:22:06,130 --> 00:22:07,430 How long should be the summary? 445 00:22:07,430 --> 00:22:09,650 What should it talk about, et cetera? 446 00:22:09,650 --> 00:22:14,390 You can actually improve these prompts by doing something like 447 00:22:14,390 --> 00:22:18,490 summarize this 10-page scientific paper on renewable 448 00:22:18,490 --> 00:22:22,410 energy in five bullet points, focusing on key findings 449 00:22:22,410 --> 00:22:25,019 and implications for policymakers. 450 00:22:25,019 --> 00:22:26,220 That's already better. 451 00:22:26,220 --> 00:22:28,620 You're sharing the audience, and it's 452 00:22:28,619 --> 00:22:30,279 going to tailor it to the audience. 453 00:22:30,279 --> 00:22:33,059 You're saying that you want five bullet points, 454 00:22:33,059 --> 00:22:35,899 and you want to focus only on key findings. 455 00:22:35,900 --> 00:22:39,060 That's a better prompt, you would argue. 456 00:22:39,059 --> 00:22:41,798 How could you even make this prompt better? 457 00:22:41,798 --> 00:22:43,339 What are other techniques that you've 458 00:22:43,339 --> 00:22:47,649 heard of or tried yourself that could make this one shot prompt 459 00:22:47,650 --> 00:22:48,150 better? 460 00:22:53,180 --> 00:22:53,980 Yeah. 461 00:22:53,980 --> 00:22:57,139 [INAUDIBLE] 462 00:22:57,138 --> 00:22:58,044 OK. 463 00:22:58,045 --> 00:22:58,880 Right example. 464 00:22:58,880 --> 00:23:02,800 So say, you mean, here is an example of a great summary. 465 00:23:02,799 --> 00:23:03,299 Yeah. 466 00:23:03,299 --> 00:23:03,841 You're right. 467 00:23:03,842 --> 00:23:05,420 That's a good idea. 468 00:23:05,420 --> 00:23:06,140 [INAUDIBLE] 469 00:23:08,900 --> 00:23:10,140 Very popular technique. 470 00:23:10,140 --> 00:23:15,060 Act like a renewable energy expert giving a conference 471 00:23:15,059 --> 00:23:17,019 at Davos, let's say, yeah. 472 00:23:17,019 --> 00:23:18,500 That's great. 473 00:23:18,500 --> 00:23:20,724 Someone-- yeah. 474 00:23:20,724 --> 00:23:22,449 Say you're really good at it. 475 00:23:22,450 --> 00:23:23,430 Yeah. 476 00:23:23,430 --> 00:23:25,769 You are the best in the world at this. 477 00:23:25,769 --> 00:23:26,389 Explain. 478 00:23:26,390 --> 00:23:26,890 Yeah. 479 00:23:26,890 --> 00:23:28,570 Actually, I mean, these things work. 480 00:23:28,569 --> 00:23:32,849 It's funny, but it does work to say act like x, y, z. 481 00:23:32,849 --> 00:23:34,649 It's a very popular prompt template. 482 00:23:34,650 --> 00:23:36,090 We'll see a few examples. 483 00:23:36,089 --> 00:23:37,169 What else could you do? 484 00:23:40,990 --> 00:23:41,910 Yes. 485 00:23:41,910 --> 00:23:46,190 Of course, you'd like to critique your own model. 486 00:23:46,190 --> 00:23:47,610 Critique your own project. 487 00:23:47,609 --> 00:23:48,889 So you're using reflection. 488 00:23:48,890 --> 00:23:50,430 So you might actually do one output 489 00:23:50,430 --> 00:23:52,890 and then ask it to critique it and then give it back. 490 00:23:52,890 --> 00:23:53,390 Yeah. 491 00:23:53,390 --> 00:23:53,978 We see that. 492 00:23:53,978 --> 00:23:54,769 That's a great one. 493 00:23:54,769 --> 00:23:56,750 That's the one that probably works best 494 00:23:56,750 --> 00:23:59,529 within those typically, but we see some examples. 495 00:23:59,529 --> 00:24:00,549 What else? 496 00:24:00,549 --> 00:24:01,365 Yeah. 497 00:24:01,365 --> 00:24:03,150 Break the task down into steps. 498 00:24:03,150 --> 00:24:03,650 OK. 499 00:24:03,650 --> 00:24:05,370 Break the task down into steps. 500 00:24:05,369 --> 00:24:06,729 You know how that is called? 501 00:24:06,730 --> 00:24:07,829 No. 502 00:24:07,829 --> 00:24:08,329 OK. 503 00:24:08,329 --> 00:24:09,349 Chain of thoughts. 504 00:24:09,349 --> 00:24:12,789 So this is actually a popular method 505 00:24:12,789 --> 00:24:15,369 that's been shown in research that it improves. 506 00:24:15,369 --> 00:24:17,669 You could actually give a clear instruction 507 00:24:17,670 --> 00:24:19,810 and also encourage the model to think step 508 00:24:19,809 --> 00:24:22,629 by step approach, the task step by step, 509 00:24:22,630 --> 00:24:24,390 and do not skip any step. 510 00:24:24,390 --> 00:24:26,990 And then you give it some steps, such as step one, 511 00:24:26,990 --> 00:24:29,390 identify the three most important findings. 512 00:24:29,390 --> 00:24:31,450 Step two, explain how key each finding 513 00:24:31,450 --> 00:24:33,590 impact renewable energy policy. 514 00:24:33,589 --> 00:24:36,209 Step three, write the five-bullet summary 515 00:24:36,210 --> 00:24:39,630 with each point addressing a finding, et cetera. 516 00:24:39,630 --> 00:24:45,170 So chain of thoughts, I linked the paper from 2023 that 517 00:24:45,170 --> 00:24:46,590 popularized chain of thoughts. 518 00:24:46,589 --> 00:24:48,369 Chain of thoughts is very popular 519 00:24:48,369 --> 00:24:50,076 right now, especially in AI startups 520 00:24:50,076 --> 00:24:51,660 that are trying to control their LLMs. 521 00:24:55,009 --> 00:24:56,450 OK. 522 00:24:56,450 --> 00:25:01,289 To go back to your examples about act like XYZ, what 523 00:25:01,289 --> 00:25:03,930 I like to do, Andrew Ng also talks about that, 524 00:25:03,930 --> 00:25:06,190 is to look at other people's prompts. 525 00:25:06,190 --> 00:25:10,170 And in fact, in online, you have a lot of prompt repositories 526 00:25:10,170 --> 00:25:11,930 for free on GitHub. 527 00:25:11,930 --> 00:25:16,289 In fact, I linked the awesome prompt template repo on GitHub, 528 00:25:16,289 --> 00:25:19,099 where you have so many examples of great prompts 529 00:25:19,099 --> 00:25:22,159 that engineers have built. They said it works great for us, 530 00:25:22,160 --> 00:25:23,740 and they published it online. 531 00:25:23,740 --> 00:25:27,019 And a lot of them start with act as. 532 00:25:27,019 --> 00:25:29,259 Act as a Linux terminal. 533 00:25:29,259 --> 00:25:31,119 Act as an English translator. 534 00:25:31,119 --> 00:25:34,209 Act like a position interviewer, et cetera. 535 00:25:37,059 --> 00:25:38,779 The advantage of a prompt template 536 00:25:38,779 --> 00:25:42,059 is that you can actually put it in your code 537 00:25:42,059 --> 00:25:44,799 and scale it for many user requests. 538 00:25:44,799 --> 00:25:48,659 So let me give you an example from Workera. 539 00:25:48,660 --> 00:25:50,920 Workera evaluates skill. 540 00:25:50,920 --> 00:25:52,980 Some of you have taken the assessments already. 541 00:25:52,980 --> 00:25:56,660 And tries to personalize it to the user. 542 00:25:56,660 --> 00:25:59,600 And in fact, if you actually read in an HR system 543 00:25:59,599 --> 00:26:01,639 in an enterprise, in the HR system, 544 00:26:01,640 --> 00:26:06,140 you might have a Jane is a product manager level 3, 545 00:26:06,140 --> 00:26:10,620 and she is in the US, and her preferred language is English. 546 00:26:10,619 --> 00:26:13,059 And actually, that metadata can be 547 00:26:13,059 --> 00:26:15,842 inserted in a prompt templates that will personalize 548 00:26:15,843 --> 00:26:16,759 personalized for Jane. 549 00:26:16,759 --> 00:26:22,720 And similarly for Joe, whose is preferred language is Spanish, 550 00:26:22,720 --> 00:26:24,500 it will tailor it to Joe. 551 00:26:24,500 --> 00:26:26,099 And that's called a prompt template. 552 00:26:26,099 --> 00:26:27,473 [INAUDIBLE] 553 00:26:34,920 --> 00:26:39,160 So the question is do the foundation models 554 00:26:39,160 --> 00:26:41,200 use a prompt templates, or do you 555 00:26:41,200 --> 00:26:42,840 have to integrate it yourself? 556 00:26:42,839 --> 00:26:45,319 So the foundation models probably 557 00:26:45,319 --> 00:26:47,319 use a system prompt that you don't see. 558 00:26:47,319 --> 00:26:50,679 Like when actually, you type on ChatGPT, 559 00:26:50,680 --> 00:26:55,440 it is possible, it's not public, that OpenAI behind the scenes 560 00:26:55,440 --> 00:26:59,580 has like act like a very helpful assistant for this user. 561 00:26:59,579 --> 00:27:03,199 And by the way, here is your memories about the user 562 00:27:03,200 --> 00:27:05,120 that we kept in a database. 563 00:27:05,119 --> 00:27:07,000 You can actually check your memories. 564 00:27:07,000 --> 00:27:10,059 And then your prompt goes under, and then the generation starts. 565 00:27:10,059 --> 00:27:12,179 So probably, they're using something like that. 566 00:27:12,180 --> 00:27:15,850 But it doesn't mean you can't add one yourself. 567 00:27:15,849 --> 00:27:19,490 So in fact, if you think about a prompt template for the Workera 568 00:27:19,490 --> 00:27:22,049 example I was showing, maybe it starts 569 00:27:22,049 --> 00:27:25,509 when you call OpenAI by act like a helpful assistant. 570 00:27:25,509 --> 00:27:29,410 And then underneath, it's like act like a great AI mentor that 571 00:27:29,410 --> 00:27:31,290 helps people in their career. 572 00:27:31,289 --> 00:27:33,889 And OpenAI is, from template, also 573 00:27:33,890 --> 00:27:36,009 has follow the instruction from the creator 574 00:27:36,009 --> 00:27:37,456 or something like that. 575 00:27:37,457 --> 00:27:38,040 It's possible. 576 00:27:41,210 --> 00:27:42,930 Questions about prompt templates. 577 00:27:42,930 --> 00:27:45,789 Again, I would encourage you to go and read examples of prompts. 578 00:27:45,789 --> 00:27:48,769 Some of them are quite thoughtful. 579 00:27:48,769 --> 00:27:51,950 Let's talk about zero shot versus few shot prompting. 580 00:27:51,950 --> 00:27:53,529 It came up earlier. 581 00:27:53,529 --> 00:27:54,629 Here's an example. 582 00:27:54,630 --> 00:27:57,810 Again, going back to the categorization of product 583 00:27:57,809 --> 00:28:01,369 reviews, let's say that we're working on a task 584 00:28:01,369 --> 00:28:05,129 where the prompt is classify the tone of the sentence 585 00:28:05,130 --> 00:28:07,450 as positive, negative, or neutral. 586 00:28:07,450 --> 00:28:12,009 And then you paste the review, which is the product is fine, 587 00:28:12,009 --> 00:28:13,450 but I was expecting more. 588 00:28:16,029 --> 00:28:19,750 If I were to survey the room, I would bet that some of you 589 00:28:19,750 --> 00:28:21,289 would say it's negative. 590 00:28:21,289 --> 00:28:23,007 Some of you would say it's neutral. 591 00:28:23,007 --> 00:28:24,590 Because you actually have a first part 592 00:28:24,589 --> 00:28:27,089 that is relatively positive. 593 00:28:27,089 --> 00:28:28,389 It's fine. 594 00:28:28,390 --> 00:28:30,570 And then the second part, I was expecting more, 595 00:28:30,569 --> 00:28:31,889 which is relatively negative. 596 00:28:31,890 --> 00:28:33,270 So where do you land? 597 00:28:33,269 --> 00:28:35,269 This can be a subjective question. 598 00:28:35,269 --> 00:28:37,987 And maybe in one industry, this would be considered amazing. 599 00:28:37,987 --> 00:28:40,070 And another one, it would be considered really bad 600 00:28:40,069 --> 00:28:44,029 because people are used to really flourishing reviews. 601 00:28:44,029 --> 00:28:47,309 And so the way you can actually align the model to your task 602 00:28:47,309 --> 00:28:49,309 is by converting that zero shot prompt. 603 00:28:49,309 --> 00:28:51,109 Zero shot refers to the fact that it's not 604 00:28:51,109 --> 00:28:53,589 being given any example. 605 00:28:53,589 --> 00:28:56,509 Into a few short prompts, where the model 606 00:28:56,509 --> 00:29:00,629 is given in the prompt, a set of examples to align it to what 607 00:29:00,630 --> 00:29:01,830 you want it to do. 608 00:29:01,829 --> 00:29:03,710 So the example here is again, you 609 00:29:03,710 --> 00:29:06,590 paste the same prompt as before with the user review. 610 00:29:06,589 --> 00:29:08,629 And then you add, here are examples 611 00:29:08,630 --> 00:29:10,510 of tone classifications. 612 00:29:10,509 --> 00:29:12,960 These exceeded my expectation completely. 613 00:29:12,960 --> 00:29:14,039 Positive. 614 00:29:14,039 --> 00:29:17,680 It's OK, but I wish it had more features. 615 00:29:17,680 --> 00:29:18,920 Negative. 616 00:29:18,920 --> 00:29:20,800 The service was adequate. 617 00:29:20,799 --> 00:29:22,799 Neither good nor bad. 618 00:29:22,799 --> 00:29:23,720 Neutral. 619 00:29:23,720 --> 00:29:26,000 Now classify the tone of this sentence 620 00:29:26,000 --> 00:29:28,839 after you've heard about these things, 621 00:29:28,839 --> 00:29:31,839 and the model then says negative. 622 00:29:31,839 --> 00:29:33,939 And the reason it says negative, of course, 623 00:29:33,940 --> 00:29:39,340 is likely because of the second example, which was it's OK, 624 00:29:39,339 --> 00:29:42,439 but I wish it had more features, which we told the model that 625 00:29:42,440 --> 00:29:43,519 was negative. 626 00:29:43,519 --> 00:29:45,599 Because the model saw that it's aligned now 627 00:29:45,599 --> 00:29:47,639 with your expectations. 628 00:29:47,640 --> 00:29:50,640 A few short prompts are very popular. 629 00:29:50,640 --> 00:29:52,720 And in fact, for AI startups that 630 00:29:52,720 --> 00:29:54,559 are slightly more sophisticated, you 631 00:29:54,559 --> 00:29:57,940 might see them keep a prompt up to date. 632 00:29:57,940 --> 00:30:00,680 Whenever a user says something and they 633 00:30:00,680 --> 00:30:02,840 might have a human label it and then 634 00:30:02,839 --> 00:30:05,519 add it as a few shots in their relevant 635 00:30:05,519 --> 00:30:08,000 prompts in their code base. 636 00:30:08,000 --> 00:30:10,532 You can think of that as almost building a data set. 637 00:30:10,532 --> 00:30:12,699 But instead of actually building a separate data set 638 00:30:12,700 --> 00:30:15,120 like we've seen with supervised fine tuning 639 00:30:15,119 --> 00:30:17,399 and then fine tuning the model on it, 640 00:30:17,400 --> 00:30:19,460 you're just putting it directly in the prompt. 641 00:30:19,460 --> 00:30:21,740 It turns out it's probably faster 642 00:30:21,740 --> 00:30:23,660 to do that if you want to experiment quickly 643 00:30:23,660 --> 00:30:25,800 because you don't touch the model parameters. 644 00:30:25,799 --> 00:30:27,220 You just update your prompts. 645 00:30:27,220 --> 00:30:30,460 And if it's text examples, you can actually 646 00:30:30,460 --> 00:30:34,759 concatenate so many examples in a single prompt. 647 00:30:34,759 --> 00:30:36,339 At some point, it will be too long, 648 00:30:36,339 --> 00:30:39,404 and you will not have the necessary context window. 649 00:30:39,404 --> 00:30:40,779 But it's a pretty strong approach 650 00:30:40,779 --> 00:30:43,309 that is quick to align an LLM. 651 00:30:48,819 --> 00:30:49,659 OK? 652 00:30:49,660 --> 00:30:50,740 Yes. 653 00:30:50,740 --> 00:30:52,620 [INAUDIBLE] 654 00:30:57,380 --> 00:31:00,660 So the question was is there any research on how long 655 00:31:00,660 --> 00:31:03,540 the prompt can be before the model essentially loses 656 00:31:03,539 --> 00:31:06,500 itself or doesn't follow instructions anymore? 657 00:31:06,500 --> 00:31:08,589 There is. 658 00:31:08,589 --> 00:31:11,990 The problem is that research is outdated every few months 659 00:31:11,990 --> 00:31:14,390 because models get better. 660 00:31:14,390 --> 00:31:16,930 And so I don't know where the state of the art is. 661 00:31:16,930 --> 00:31:18,870 You can probably find it online on benchmarks 662 00:31:18,869 --> 00:31:20,649 on like we see that-- 663 00:31:20,650 --> 00:31:23,310 I give you an example. 664 00:31:23,309 --> 00:31:27,311 On the Workera product, you have a voice conversation 665 00:31:27,311 --> 00:31:28,769 for some of you that have tried it, 666 00:31:28,769 --> 00:31:30,849 where you're asked to explain what is the prompt. 667 00:31:30,849 --> 00:31:31,909 And then you explain, and then there's 668 00:31:31,910 --> 00:31:33,430 a scoring algorithm in behind. 669 00:31:33,430 --> 00:31:38,310 We know that after eight turns, the model loses itself. 670 00:31:38,309 --> 00:31:40,269 After eight turns, because you always 671 00:31:40,269 --> 00:31:42,829 paste the previous user response, 672 00:31:42,829 --> 00:31:44,552 it just starts going wild. 673 00:31:44,553 --> 00:31:46,470 And so the techniques we use in the background 674 00:31:46,470 --> 00:31:49,416 is we actually create chapters of the conversation. 675 00:31:49,416 --> 00:31:51,250 Maybe one chapter is the first eight prompt. 676 00:31:51,250 --> 00:31:53,458 And then you actually start over from another prompt. 677 00:31:53,458 --> 00:31:56,570 You can summarize the first part of the conversation, 678 00:31:56,569 --> 00:31:59,549 insert the summary, and then keep going. 679 00:31:59,549 --> 00:32:02,309 Those are engineering hacks that engineers might have figured out 680 00:32:02,309 --> 00:32:04,309 in the background. 681 00:32:04,309 --> 00:32:07,049 Because eight turns makes a prompt quite long actually. 682 00:32:13,450 --> 00:32:15,517 Let's move on to chaining. 683 00:32:15,517 --> 00:32:17,850 Chaining is the most popular technique out of everything 684 00:32:17,849 --> 00:32:22,769 we've seen so far in prompt engineering. 685 00:32:22,769 --> 00:32:23,990 It's not chain of thought. 686 00:32:23,990 --> 00:32:26,230 So chain of thought we've seen is think step by step, 687 00:32:26,230 --> 00:32:27,509 step 1, step 2, step 3. 688 00:32:27,509 --> 00:32:28,890 Do not skip any step. 689 00:32:28,890 --> 00:32:30,090 This is different. 690 00:32:30,089 --> 00:32:34,109 This is chaining complex prompt to improve performance, 691 00:32:34,109 --> 00:32:37,009 and this is what it looks like. 692 00:32:37,009 --> 00:32:40,049 You take a single step prompt, such as read this customer 693 00:32:40,049 --> 00:32:43,329 review and write a professional response that 694 00:32:43,329 --> 00:32:46,049 acknowledges their concern, explains the issue, 695 00:32:46,049 --> 00:32:48,009 offers a resolution, and then you 696 00:32:48,009 --> 00:32:51,450 paste the customer review, which is I ordered a laptop. 697 00:32:51,450 --> 00:32:52,950 It arrived three days late. 698 00:32:52,950 --> 00:32:54,809 The packaging was damaged. 699 00:32:54,809 --> 00:32:56,230 Very disappointing. 700 00:32:56,230 --> 00:32:59,009 I needed that urgently for work. 701 00:32:59,009 --> 00:33:01,089 And then the output is an email that 702 00:33:01,089 --> 00:33:04,619 is immediately given to you by the LLM 703 00:33:04,619 --> 00:33:08,019 after it reads the prompt. 704 00:33:08,019 --> 00:33:14,259 So this might work, but it might be hard to control. 705 00:33:14,259 --> 00:33:15,680 Because think about it. 706 00:33:15,680 --> 00:33:18,140 There's multiple steps that you have listed, 707 00:33:18,140 --> 00:33:20,860 and everything is embedded in the same prompt. 708 00:33:20,859 --> 00:33:24,004 And if you wanted to debug step by step and know which step is 709 00:33:24,005 --> 00:33:24,880 weaker, you couldn't. 710 00:33:24,880 --> 00:33:27,860 You would have everything mixed together. 711 00:33:27,859 --> 00:33:32,899 So one advantage of chaining is you would separate the prompts, 712 00:33:32,900 --> 00:33:35,280 so that you can debug them separately. 713 00:33:35,279 --> 00:33:38,379 And it will also lead to an easier manner 714 00:33:38,380 --> 00:33:41,300 to improve your workflow. 715 00:33:41,299 --> 00:33:44,079 Let's say a first prompt is extract the key issues. 716 00:33:44,079 --> 00:33:46,059 Identify the key concerns mentioned 717 00:33:46,059 --> 00:33:47,480 in this customer review. 718 00:33:47,480 --> 00:33:49,620 Pace the customer review. 719 00:33:49,619 --> 00:33:50,939 Second prompt. 720 00:33:50,940 --> 00:33:54,460 Using these issues, so you paste back the issues, 721 00:33:54,460 --> 00:33:57,100 draft an outline for a professional response that 722 00:33:57,099 --> 00:34:00,039 acknowledges concerns, explains possible reasons, 723 00:34:00,039 --> 00:34:01,480 and offer a resolution. 724 00:34:04,279 --> 00:34:06,960 So this is not-- 725 00:34:06,960 --> 00:34:09,179 Prompt number 3, write the full response. 726 00:34:09,179 --> 00:34:14,880 So using the outline, write the professional response. 727 00:34:14,880 --> 00:34:18,119 And then you get your final output. 728 00:34:18,119 --> 00:34:22,000 So in theory, you can tell me, oh, the second approach 729 00:34:22,000 --> 00:34:23,699 is better than the first one at first. 730 00:34:23,699 --> 00:34:27,000 But what you can notice is that we can actually 731 00:34:27,000 --> 00:34:29,760 test those three prompts separately from each other 732 00:34:29,760 --> 00:34:35,480 and determine if we will get the most gains out of engineering 733 00:34:35,480 --> 00:34:38,400 the first prompt, optimizing it, or the second one, 734 00:34:38,400 --> 00:34:39,619 or the third one. 735 00:34:39,619 --> 00:34:43,079 We now have three prompts that are independent from each other. 736 00:34:43,079 --> 00:34:47,480 And maybe if the outline was better, 737 00:34:47,480 --> 00:34:53,260 the performance of the email, how much the open rate will be 738 00:34:53,260 --> 00:34:55,400 or the user satisfaction on the response 739 00:34:55,400 --> 00:34:57,320 will actually get higher. 740 00:34:57,320 --> 00:35:00,910 And so chaining improves performance but performance, 741 00:35:00,909 --> 00:35:04,129 but most importantly, helps you control your workflow 742 00:35:04,130 --> 00:35:07,930 and debug it more seamlessly. 743 00:35:07,929 --> 00:35:09,369 Yes. 744 00:35:09,369 --> 00:35:15,089 So if we that the three prompts independently work really well, 745 00:35:15,090 --> 00:35:17,289 if we combine them into one prompt, 746 00:35:17,289 --> 00:35:21,050 and we highlight a step by step thinking process, 747 00:35:21,050 --> 00:35:24,850 does on average, we get a [INAUDIBLE] by itself, 748 00:35:24,849 --> 00:35:28,690 or do we still have to do that breakdown? 749 00:35:28,690 --> 00:35:30,110 So let me try to rephrase. 750 00:35:30,110 --> 00:35:32,730 You say, let's say we look at the first prompt which 751 00:35:32,730 --> 00:35:37,889 has all three tasks built in that prompt. 752 00:35:37,889 --> 00:35:39,069 What exactly do you mean? 753 00:35:39,070 --> 00:35:41,130 You mean like if we evaluate the output 754 00:35:41,130 --> 00:35:43,630 and we measure some user insight, satisfaction, 755 00:35:43,630 --> 00:35:45,769 et cetera? 756 00:35:45,769 --> 00:35:49,250 Why don't we just modify that prompt and essentially see how 757 00:35:49,250 --> 00:35:51,110 it improves user satisfaction? 758 00:35:51,110 --> 00:35:51,610 Yeah. 759 00:35:51,610 --> 00:35:52,610 [INAUDIBLE] 760 00:35:54,916 --> 00:35:55,436 I see. 761 00:35:55,436 --> 00:35:57,890 So why do we need the three steps? 762 00:35:57,889 --> 00:35:59,150 I mean, think about it. 763 00:35:59,150 --> 00:36:02,110 The intermediate output is what you want to see. 764 00:36:02,110 --> 00:36:06,630 Like if I'm debugging the first approach, 765 00:36:06,630 --> 00:36:09,250 the way I would do it is I would capture user insights. 766 00:36:09,250 --> 00:36:10,409 Like here's the email. 767 00:36:10,409 --> 00:36:11,769 How good was the response? 768 00:36:11,769 --> 00:36:13,909 Thumbs up, thumbs down. 769 00:36:13,909 --> 00:36:16,429 Was your issue resolved? 770 00:36:16,429 --> 00:36:17,539 Thumbs up, thumbs down. 771 00:36:17,539 --> 00:36:19,289 Those would tell me how good is my prompt. 772 00:36:19,289 --> 00:36:21,123 And I can engineer that prompt, optimize it, 773 00:36:21,123 --> 00:36:23,510 and I would probably drive some gains. 774 00:36:23,510 --> 00:36:26,430 But I will not be able easily to trace back 775 00:36:26,429 --> 00:36:28,349 to what the problem was. 776 00:36:28,349 --> 00:36:30,549 While in the second approach, not only I 777 00:36:30,550 --> 00:36:33,530 can use the end to end metrics to improve my process. 778 00:36:33,530 --> 00:36:35,170 I can also use the intermediate steps. 779 00:36:35,170 --> 00:36:38,710 For example, if I look at prompt 2 and I look at the outline 780 00:36:38,710 --> 00:36:41,750 and I see the outline is actually, meh, it's not great, 781 00:36:41,750 --> 00:36:45,630 then I think I can get a lot of gains out of the outline. 782 00:36:45,630 --> 00:36:47,930 Or the outline is actually really good, 783 00:36:47,929 --> 00:36:50,429 but the last prompt doesn't do a good job at translating it 784 00:36:50,429 --> 00:36:51,210 into an email. 785 00:36:51,210 --> 00:36:54,550 So the outline is exactly what I want the LLM to do, 786 00:36:54,550 --> 00:36:57,350 but the translation in a customer facing email 787 00:36:57,349 --> 00:36:58,299 is not good. 788 00:36:58,300 --> 00:37:01,900 In fact, it doesn't follow our vocabulary internally. 789 00:37:01,900 --> 00:37:03,519 Then I knew the third prompt is where 790 00:37:03,519 --> 00:37:06,039 I would get the most gains. 791 00:37:06,039 --> 00:37:07,699 So that's what it allows me to do, 792 00:37:07,699 --> 00:37:10,519 have intermediate steps to review. 793 00:37:10,519 --> 00:37:13,719 Are there any latency [INAUDIBLE] 794 00:37:13,719 --> 00:37:14,579 We'll talk about it. 795 00:37:14,579 --> 00:37:16,179 Are there any latency concerns? 796 00:37:16,179 --> 00:37:17,279 Yes. 797 00:37:17,280 --> 00:37:20,440 In certain applications, you don't want to use a chain 798 00:37:20,440 --> 00:37:26,012 or you don't want to use a long chain because it adds latency. 799 00:37:26,012 --> 00:37:27,179 We'll talk about that later. 800 00:37:27,179 --> 00:37:28,839 Good point. 801 00:37:28,840 --> 00:37:32,000 So practically, this is what chaining complex 802 00:37:32,000 --> 00:37:33,280 prompts look like. 803 00:37:33,280 --> 00:37:35,640 You have your first prompt with your first task. 804 00:37:35,639 --> 00:37:36,460 It outputs. 805 00:37:36,460 --> 00:37:39,079 The output is pasted in the second prompt 806 00:37:39,079 --> 00:37:41,199 with the second task being defined. 807 00:37:41,199 --> 00:37:43,699 The output is then pasted into the third prompt 808 00:37:43,699 --> 00:37:46,559 with the third task being defined and so on. 809 00:37:46,559 --> 00:37:48,170 That's what it looks like in practice. 810 00:37:52,179 --> 00:37:52,679 Super. 811 00:37:55,860 --> 00:37:58,559 We'll talk more later about testing your prompts, 812 00:37:58,559 --> 00:38:00,799 but there are methods now to do it, 813 00:38:00,800 --> 00:38:03,380 and we'll see later in this lecture with our case study 814 00:38:03,380 --> 00:38:06,300 how we can test our prompts. 815 00:38:06,300 --> 00:38:11,900 But here is an example of how you might do it. 816 00:38:11,900 --> 00:38:18,220 You might have a summarization workflow prompts 817 00:38:18,219 --> 00:38:19,359 that is the baseline. 818 00:38:19,360 --> 00:38:21,420 It's a single prompt. 819 00:38:21,420 --> 00:38:23,659 You might have a refined summarization 820 00:38:23,659 --> 00:38:26,199 which is a modified prompt of this, 821 00:38:26,199 --> 00:38:30,460 or a workflow with a chain. 822 00:38:30,460 --> 00:38:34,380 And then you have your test case, which is the input 823 00:38:34,380 --> 00:38:36,780 that you want to summarize, let's say. 824 00:38:36,780 --> 00:38:38,900 And then you have the generated output. 825 00:38:38,900 --> 00:38:42,559 And you can have humans go and rate these outputs. 826 00:38:42,559 --> 00:38:46,380 And you would notice that the baseline is better or worse 827 00:38:46,380 --> 00:38:47,780 than the refined prompt. 828 00:38:47,780 --> 00:38:51,260 Of course, this manual approach takes time, 829 00:38:51,260 --> 00:38:53,560 but it's a good way to start. 830 00:38:53,559 --> 00:38:56,994 And usually, the advice is get hands on at the beginning 831 00:38:56,994 --> 00:38:58,869 because you would quickly notice some issues, 832 00:38:58,869 --> 00:39:01,589 and it will give you better intuition on what tweaks 833 00:39:01,590 --> 00:39:03,470 can lead to better performance. 834 00:39:03,469 --> 00:39:05,549 However, if you wanted to scale that system 835 00:39:05,550 --> 00:39:08,110 across many products, many parts of your code base, 836 00:39:08,110 --> 00:39:10,910 you might want to find a way to do that automatically 837 00:39:10,909 --> 00:39:14,369 without asking humans to review and grade summaries. 838 00:39:14,369 --> 00:39:19,309 One approach is to use platforms, 839 00:39:19,309 --> 00:39:23,630 like at Workera, our team uses a platform called Prompt Food that 840 00:39:23,630 --> 00:39:26,950 allows you to actually automate part of this testing. 841 00:39:26,949 --> 00:39:30,469 In a nutshell, what it does is it 842 00:39:30,469 --> 00:39:35,489 can allow you to run the same prompt with five different LLMs 843 00:39:35,489 --> 00:39:37,269 immediately, put everything in a table. 844 00:39:37,269 --> 00:39:40,429 That makes it super easy for a human to grade, let's say. 845 00:39:40,429 --> 00:39:46,659 Or alternatively, it might allow you to define LLM judges. 846 00:39:46,659 --> 00:39:50,149 LLM judges can come in different flavors. 847 00:39:50,150 --> 00:39:52,450 For example, I can have an LLM judge that 848 00:39:52,449 --> 00:39:54,789 does a pairwise comparison. 849 00:39:54,789 --> 00:39:58,090 So what the LLM is asked to do is here are two summaries. 850 00:39:58,090 --> 00:40:01,210 Just tell me which one is better than the other one. 851 00:40:01,210 --> 00:40:02,630 That's what the LLM does. 852 00:40:02,630 --> 00:40:04,690 And that can be used as a proxy for how good 853 00:40:04,690 --> 00:40:08,329 the summarization baseline versus the refined version is. 854 00:40:08,329 --> 00:40:11,889 Another way to do an LLM judge is 855 00:40:11,889 --> 00:40:14,349 if you do it for a single answer grading, 856 00:40:14,349 --> 00:40:18,489 so here's a summary graded from 1 to 5. 857 00:40:18,489 --> 00:40:21,769 And then you can go even deeper and do 858 00:40:21,769 --> 00:40:24,550 a reference-guided pairwise comparison. 859 00:40:24,550 --> 00:40:25,870 Or you add also a rubric. 860 00:40:25,869 --> 00:40:30,697 You say a 5 is when a summary is below 100 characters. 861 00:40:30,697 --> 00:40:31,489 I'm just making up. 862 00:40:31,489 --> 00:40:33,029 Below 100 characters. 863 00:40:33,030 --> 00:40:35,010 Mentions at least three key points 864 00:40:35,010 --> 00:40:38,182 that are distinct and starts with a first sentence that 865 00:40:38,182 --> 00:40:40,349 displays the overview and then goes into the detail. 866 00:40:40,349 --> 00:40:42,190 That's a great summary, number 5 out of a 5. 867 00:40:42,190 --> 00:40:48,909 0 is the LLM failed to summarize and actually was very verbose, 868 00:40:48,909 --> 00:40:49,609 let's say. 869 00:40:49,610 --> 00:40:52,539 And so you put a Rubrik behind it, 870 00:40:52,539 --> 00:40:55,059 and you have an LLM as just finding the rubric. 871 00:40:55,059 --> 00:40:57,199 Of course, you can now pair different techniques. 872 00:40:57,199 --> 00:40:58,879 You can do a few shots for the rubric. 873 00:40:58,880 --> 00:41:02,960 You can actually give examples of a 5 out of 5s, 4 out of 4s, 874 00:41:02,960 --> 00:41:06,460 3 out of 3s because now, you multiple techniques. 875 00:41:06,460 --> 00:41:11,220 Does that make sense? 876 00:41:11,219 --> 00:41:11,819 Yeah. 877 00:41:11,820 --> 00:41:12,620 OK. 878 00:41:12,619 --> 00:41:15,460 So that was the second section on prompt engineering 879 00:41:15,460 --> 00:41:19,179 or the first line of optimization. 880 00:41:19,179 --> 00:41:22,619 Now, let's say you've exhausted all your chances 881 00:41:22,619 --> 00:41:24,779 for prompt engineering, and you're 882 00:41:24,780 --> 00:41:28,300 thinking about actually touching the model, modifying its weights 883 00:41:28,300 --> 00:41:31,580 or fine tuning it in other words. 884 00:41:31,579 --> 00:41:34,900 I was telling you, I'm not a fan of fine tuning. 885 00:41:34,900 --> 00:41:37,940 There's a few reasons why. 886 00:41:37,940 --> 00:41:42,220 One, it requires substantial labeled data typically 887 00:41:42,219 --> 00:41:43,079 to fine tune. 888 00:41:43,079 --> 00:41:46,500 Although now, there are approaches 889 00:41:46,500 --> 00:41:48,699 that are getting better at fine tuning that 890 00:41:48,699 --> 00:41:52,299 look more few shot prompting actually than fine tuning. 891 00:41:52,300 --> 00:41:54,600 It's sort of merging. 892 00:41:54,599 --> 00:41:56,097 Although one modifies the weight, 893 00:41:56,097 --> 00:41:57,639 the other doesn't modify the weights. 894 00:41:57,639 --> 00:42:01,099 Fine tuned models may also overfit to specific data. 895 00:42:01,099 --> 00:42:04,000 We're going to see a funny example actually. 896 00:42:04,000 --> 00:42:06,579 Losing their general purpose utility. 897 00:42:06,579 --> 00:42:08,480 So you might fine tune a model. 898 00:42:08,480 --> 00:42:11,300 And actually, when someone asks a pretty generic question, 899 00:42:11,300 --> 00:42:12,840 it doesn't do well anymore. 900 00:42:12,840 --> 00:42:14,220 It might do well on your task. 901 00:42:14,219 --> 00:42:15,699 So it might be relevant or not. 902 00:42:15,699 --> 00:42:17,659 And then it's time and cost-intensive. 903 00:42:17,659 --> 00:42:19,159 That's my main problem. 904 00:42:19,159 --> 00:42:24,639 And at Workera, we steer away from fine 905 00:42:24,639 --> 00:42:26,440 tuning as much as possible. 906 00:42:26,440 --> 00:42:28,932 Because by the time you're done fine tuning your model, 907 00:42:28,932 --> 00:42:30,599 the next model is out, and it's actually 908 00:42:30,599 --> 00:42:33,559 beating your fine tuned version of the previous model. 909 00:42:33,559 --> 00:42:36,719 So I would steer away from fine tuning as much as you can. 910 00:42:36,719 --> 00:42:39,399 The advantage of the prompt engineering methods we've seen 911 00:42:39,400 --> 00:42:43,800 is you can put the next best pre-trained model directly 912 00:42:43,800 --> 00:42:44,917 in your code. 913 00:42:44,916 --> 00:42:46,500 It will update everything immediately. 914 00:42:46,500 --> 00:42:50,449 Fine tuning doesn't work like that. 915 00:42:50,449 --> 00:42:53,250 There are advantages though where it still makes sense. 916 00:42:53,250 --> 00:42:56,130 If the task requires repeated high precision outputs 917 00:42:56,130 --> 00:42:58,570 such as legal, scientific explanation 918 00:42:58,570 --> 00:43:01,289 and if the general purpose LLM struggles 919 00:43:01,289 --> 00:43:03,449 with domain-specific language. 920 00:43:03,449 --> 00:43:07,649 So let's look at a quick example together, 921 00:43:07,650 --> 00:43:12,690 which is an example from Ros Lazerowitz. 922 00:43:12,690 --> 00:43:15,929 I think it was a couple of years ago, September 23, 923 00:43:15,929 --> 00:43:22,829 where Ros tried to do Slack fine tuning. 924 00:43:22,829 --> 00:43:26,489 So he looked at a lot of Slack messages within his company. 925 00:43:26,489 --> 00:43:28,609 And he was like, I'm going to fine tune 926 00:43:28,610 --> 00:43:32,090 a model that speaks like us or operates like us because this 927 00:43:32,090 --> 00:43:33,090 is how we work. 928 00:43:33,090 --> 00:43:37,970 This is the data that represents how people work at the company. 929 00:43:37,969 --> 00:43:42,529 And so he actually went ahead and fine tuned the model, 930 00:43:42,530 --> 00:43:44,769 gave it a prompt, like, hey, write-- 931 00:43:44,769 --> 00:43:47,030 he was delegating to the model. 932 00:43:47,030 --> 00:43:50,510 A 500-word blog post on prompt engineering. 933 00:43:50,510 --> 00:43:55,990 And the model responded, I shall work on that in the morning. 934 00:43:55,989 --> 00:44:00,049 And then he tries to push the model a little further and say, 935 00:44:00,050 --> 00:44:01,950 it's morning now. 936 00:44:01,949 --> 00:44:04,489 And the model said, I'm writing right now. 937 00:44:04,489 --> 00:44:06,789 It's 6:30 AM here. 938 00:44:06,789 --> 00:44:07,610 Write it now. 939 00:44:10,670 --> 00:44:12,130 OK, I shall write it now. 940 00:44:12,130 --> 00:44:14,110 I actually don't what you would like me to say 941 00:44:14,110 --> 00:44:15,590 about prompt engineering. 942 00:44:15,590 --> 00:44:17,030 I can only describe the process. 943 00:44:17,030 --> 00:44:19,030 The only thing that comes to mind for a headline 944 00:44:19,030 --> 00:44:21,030 is how do we build prompt? 945 00:44:21,030 --> 00:44:25,670 It's kind of a funny example for fine tuning because it's true 946 00:44:25,670 --> 00:44:27,630 that it went wrong. 947 00:44:27,630 --> 00:44:29,630 Like he was supposed to think like I want 948 00:44:29,630 --> 00:44:32,269 the model to speak like us at work. 949 00:44:32,269 --> 00:44:34,829 And it ended up acting like people 950 00:44:34,829 --> 00:44:36,929 and not actually following instructions. 951 00:44:40,190 --> 00:44:42,860 So one example why I would steer away from fine tuning. 952 00:44:47,300 --> 00:44:47,800 Super. 953 00:44:51,679 --> 00:44:54,199 Let's talk about RAGs. 954 00:44:54,199 --> 00:44:55,500 RAGs is important. 955 00:44:55,500 --> 00:44:58,420 It's important to out there and at least having the basics. 956 00:44:58,420 --> 00:45:00,579 It's a very common interview question, by the way. 957 00:45:00,579 --> 00:45:02,799 If you go interview for a job, they 958 00:45:02,800 --> 00:45:04,720 might ask you to explain in a nutshell 959 00:45:04,719 --> 00:45:06,659 to a five-year-old what is a RAG. 960 00:45:06,659 --> 00:45:09,480 And hopefully after that, you'll be able to do it. 961 00:45:09,480 --> 00:45:14,880 So we've seen some of the challenges with standalone LLMs. 962 00:45:14,880 --> 00:45:19,200 Those challenges include the context window being small, 963 00:45:19,199 --> 00:45:21,559 the fact that it's hard to remember details 964 00:45:21,559 --> 00:45:26,960 within a large context window, knowledge gaps, cutoff dates, 965 00:45:26,960 --> 00:45:28,059 you mentioned earlier. 966 00:45:28,059 --> 00:45:29,779 The model might be trained up to a date, 967 00:45:29,780 --> 00:45:33,040 and then it cannot follow the trends or be up to date. 968 00:45:33,039 --> 00:45:34,440 Hallucinations. 969 00:45:34,440 --> 00:45:35,920 There are some fields. 970 00:45:35,920 --> 00:45:37,639 Think about medical diagnosis, where 971 00:45:37,639 --> 00:45:39,139 hallucinations are very costly. 972 00:45:39,139 --> 00:45:41,440 You can't afford a hallucination. 973 00:45:41,440 --> 00:45:45,450 Even in education, imagine deploying a model for the US 974 00:45:45,449 --> 00:45:47,937 youth education, and it hallucinates, 975 00:45:47,938 --> 00:45:49,730 and it teaches millions of people something 976 00:45:49,730 --> 00:45:50,730 completely wrong. 977 00:45:50,730 --> 00:45:52,690 It's a problem. 978 00:45:52,690 --> 00:45:54,889 And then lack of sources. 979 00:45:54,889 --> 00:45:57,389 A lot of fields love sources. 980 00:45:57,389 --> 00:45:59,609 Research fields love sources. 981 00:45:59,610 --> 00:46:01,650 Education love sources. 982 00:46:01,650 --> 00:46:04,490 Legal loves sources as well. 983 00:46:04,489 --> 00:46:08,969 And so the pre-trained LLM doesn't do a good job to source. 984 00:46:08,969 --> 00:46:13,609 And in fact, if you have tried to find sources on a plain LLM, 985 00:46:13,610 --> 00:46:15,190 it actually hallucinates a lot. 986 00:46:15,190 --> 00:46:16,710 It makes up research papers. 987 00:46:16,710 --> 00:46:20,170 It just lists like completely fake stuff. 988 00:46:20,170 --> 00:46:23,490 So how do we solve that with a RAG? 989 00:46:23,489 --> 00:46:28,049 RAG integrates with external knowledge sources, databases, 990 00:46:28,050 --> 00:46:31,010 documents, APIs. 991 00:46:31,010 --> 00:46:35,270 It ensures that answers are more accurate, up to date, 992 00:46:35,269 --> 00:46:38,150 and grounded because you can actually update your document. 993 00:46:38,150 --> 00:46:40,630 Your drive is always up to date. 994 00:46:40,630 --> 00:46:43,849 I mean, ideally, you're always pushing new documents to it. 995 00:46:43,849 --> 00:46:47,730 And when you query, what is our Q4 performance in sales? 996 00:46:47,730 --> 00:46:51,230 Hopefully there is the last board deck in the drive, 997 00:46:51,230 --> 00:46:54,630 and it can read the last board deck. 998 00:46:54,630 --> 00:46:56,210 And more developer control. 999 00:46:56,210 --> 00:47:00,309 We'll see why RAGs allow for targeted customization 1000 00:47:00,309 --> 00:47:02,730 without actually requiring the retraining of the model. 1001 00:47:02,730 --> 00:47:05,309 In fact, you don't touch the model with RAGs. 1002 00:47:05,309 --> 00:47:08,829 It's really a technique that is put on top of the model. 1003 00:47:08,829 --> 00:47:11,789 So to see an example of a RAG, this 1004 00:47:11,789 --> 00:47:16,070 is a question answering application where 1005 00:47:16,070 --> 00:47:21,710 we're in the medical field, and a user is asking a query, 1006 00:47:21,710 --> 00:47:26,190 what are the side effects of drug X? 1007 00:47:26,190 --> 00:47:27,490 This is an important question. 1008 00:47:27,489 --> 00:47:28,689 You can't hallucinate. 1009 00:47:28,690 --> 00:47:29,690 You need to source. 1010 00:47:29,690 --> 00:47:31,050 You need to be up to date. 1011 00:47:31,050 --> 00:47:35,390 Maybe there is a new update to that drug that 1012 00:47:35,389 --> 00:47:37,769 is now in the database, and you need to read that. 1013 00:47:37,769 --> 00:47:41,920 So a RAG is a great example of what you would want to use here. 1014 00:47:41,920 --> 00:47:43,960 The way it works is you have your knowledge 1015 00:47:43,960 --> 00:47:46,840 base of a bunch of documents. 1016 00:47:46,840 --> 00:47:49,960 What you do is you use an embedding 1017 00:47:49,960 --> 00:47:52,079 to embed those documents into lower 1018 00:47:52,079 --> 00:47:54,519 dimensional representations. 1019 00:47:54,519 --> 00:47:59,679 So for example, if the document is a PDF, a long PDF, 1020 00:47:59,679 --> 00:48:02,940 you might read the PDF, understand it, 1021 00:48:02,940 --> 00:48:03,820 and then embed it. 1022 00:48:03,820 --> 00:48:05,800 We've seen plenty of embedding approaches 1023 00:48:05,800 --> 00:48:09,120 together, triplet loss, et cetera, you remember? 1024 00:48:09,119 --> 00:48:11,719 So imagine one of them here for LLMs 1025 00:48:11,719 --> 00:48:15,719 is embedding those documents into lower representation. 1026 00:48:15,719 --> 00:48:18,439 If the representation is too small, 1027 00:48:18,440 --> 00:48:19,900 you will lose information. 1028 00:48:19,900 --> 00:48:22,840 If it's too big, you will add latency. 1029 00:48:22,840 --> 00:48:25,760 It's a tradeoff. 1030 00:48:25,760 --> 00:48:28,360 You will store typically those representations 1031 00:48:28,360 --> 00:48:31,880 into a database called a vector database. 1032 00:48:31,880 --> 00:48:35,280 There's a lot of vector database providers out there. 1033 00:48:38,579 --> 00:48:41,880 I think I've listed a couple that are very common. 1034 00:48:41,880 --> 00:48:44,811 No, I haven't listed, but I can share afterwards. 1035 00:48:44,811 --> 00:48:47,019 A vector database is essentially storing those vector 1036 00:48:47,019 --> 00:48:50,139 in a very efficient manner, allowing the fast retrieval 1037 00:48:50,139 --> 00:48:52,859 with a certain distance metric. 1038 00:48:52,860 --> 00:48:56,260 So what you do is you also embed, usually 1039 00:48:56,260 --> 00:49:00,140 with the same algorithm, the user prompts. 1040 00:49:00,139 --> 00:49:03,579 And you run a retrieval process, which is essentially 1041 00:49:03,579 --> 00:49:07,779 saying, based on the embedding from the user 1042 00:49:07,780 --> 00:49:12,540 query and the vector database, find the relevant documents 1043 00:49:12,539 --> 00:49:15,500 based on the distance between those embeddings. 1044 00:49:15,500 --> 00:49:18,420 Once you've found the relevant documents, you pull them, 1045 00:49:18,420 --> 00:49:22,460 and then you add them to the user query with a system prompt 1046 00:49:22,460 --> 00:49:24,300 or a prompt template on top. 1047 00:49:24,300 --> 00:49:29,300 So the prompt template can be answer user query 1048 00:49:29,300 --> 00:49:32,900 based on list of documents. 1049 00:49:32,900 --> 00:49:36,829 If answer not in the documents, say I don't know. 1050 00:49:36,829 --> 00:49:40,590 That's your prompt templates where the user query is pasted, 1051 00:49:40,590 --> 00:49:42,630 the documents are pasted, and then 1052 00:49:42,630 --> 00:49:45,829 your output should be what you want because it's not 1053 00:49:45,829 --> 00:49:47,389 grounded in the documents. 1054 00:49:47,389 --> 00:49:50,549 You can also add to this prompt template. 1055 00:49:50,550 --> 00:49:53,150 Tell me the exact page, chapter, line 1056 00:49:53,150 --> 00:49:55,110 of the document that was relevant, and in fact, 1057 00:49:55,110 --> 00:49:57,380 link it as well, just to be more precise. 1058 00:50:02,150 --> 00:50:03,829 Any question on RAGs? 1059 00:50:03,829 --> 00:50:07,389 This is a simple, vanilla RAG. 1060 00:50:07,389 --> 00:50:09,119 Yes. 1061 00:50:09,119 --> 00:50:12,789 Do document embeddings still retain information [INAUDIBLE] 1062 00:50:15,630 --> 00:50:18,230 Question is do the document embeddings still 1063 00:50:18,230 --> 00:50:21,789 retain the information of the location of the information 1064 00:50:21,789 --> 00:50:24,789 within that document, especially in big documents? 1065 00:50:24,789 --> 00:50:26,029 Great question. 1066 00:50:26,030 --> 00:50:27,950 We'll get to it in a second. 1067 00:50:27,949 --> 00:50:29,949 Because you're right that the vanilla RAG 1068 00:50:29,949 --> 00:50:32,289 might not do a good job with very large documents. 1069 00:50:32,289 --> 00:50:36,469 So let's say, when you open a medication box 1070 00:50:36,469 --> 00:50:41,129 and you have this gigantic white paper with all the information, 1071 00:50:41,130 --> 00:50:45,829 and it's very long, maybe a vanilla RAG would not cut it. 1072 00:50:45,829 --> 00:50:48,009 So what people have figured out is a bunch 1073 00:50:48,010 --> 00:50:49,830 of techniques to improve RAGs. 1074 00:50:49,829 --> 00:50:53,150 And in fact, chunking is a great technique that is very popular. 1075 00:50:53,150 --> 00:50:55,730 So you might actually store in the vector database 1076 00:50:55,730 --> 00:50:57,670 the embedding of the full document. 1077 00:50:57,670 --> 00:50:59,409 And on top of that, you will also 1078 00:50:59,409 --> 00:51:02,619 store a chapter level vector. 1079 00:51:02,619 --> 00:51:04,869 And when you retrieve, you will retrieve the document. 1080 00:51:04,869 --> 00:51:06,289 You retrieve the chapter. 1081 00:51:06,289 --> 00:51:09,190 And that allows you to be more precise with the sourcing. 1082 00:51:09,190 --> 00:51:11,690 It's one example. 1083 00:51:11,690 --> 00:51:16,130 Another technique that's popular is HyDE. 1084 00:51:16,130 --> 00:51:18,970 Hypothetical document embeddings, 1085 00:51:18,969 --> 00:51:23,529 where a group of researchers published a paper 1086 00:51:23,530 --> 00:51:26,790 showing that when you get your user query, 1087 00:51:26,789 --> 00:51:29,090 one of the main problem is the user query 1088 00:51:29,090 --> 00:51:32,370 actually does not look like your documents. 1089 00:51:32,369 --> 00:51:34,139 For example, the user query might 1090 00:51:34,139 --> 00:51:37,779 be what are the side effects of drug X, when actually, 1091 00:51:37,780 --> 00:51:40,080 in the document in the vector database, 1092 00:51:40,079 --> 00:51:43,099 the vectors represents very long documents. 1093 00:51:43,099 --> 00:51:44,900 So how do you guarantee that the vector 1094 00:51:44,900 --> 00:51:47,619 embedding is going to be close to the document embedding? 1095 00:51:47,619 --> 00:51:50,819 What they do is they use the user query to generate 1096 00:51:50,820 --> 00:51:53,780 a fake hallucinated document. 1097 00:51:53,780 --> 00:51:56,180 They embed that document, and then they 1098 00:51:56,179 --> 00:52:01,379 compare it to the vector in the vector database. 1099 00:52:01,380 --> 00:52:02,460 That makes sense? 1100 00:52:02,460 --> 00:52:04,780 So for example, the user says what 1101 00:52:04,780 --> 00:52:06,682 is the side effect of drug X? 1102 00:52:06,682 --> 00:52:09,099 There's a prompt that this is given to another prompt that 1103 00:52:09,099 --> 00:52:13,739 says, based on this user query, generates a five-page report 1104 00:52:13,739 --> 00:52:15,579 answering the user query. 1105 00:52:15,579 --> 00:52:20,980 It generates potentially a completely fake answer. 1106 00:52:20,980 --> 00:52:24,557 You embed that, and it will be closer to the document 1107 00:52:24,557 --> 00:52:25,849 that you're looking for likely. 1108 00:52:28,940 --> 00:52:31,800 It's one example of a RAG approach. 1109 00:52:31,800 --> 00:52:33,640 Again, the purpose of this lecture 1110 00:52:33,639 --> 00:52:36,039 is not to go through all these three and explain 1111 00:52:36,039 --> 00:52:38,922 you every single method that has been discovered for RAGs. 1112 00:52:38,922 --> 00:52:40,880 But I just wanted to show you how much research 1113 00:52:40,880 --> 00:52:44,780 has been done between 2020 and 2025 in RAGs 1114 00:52:44,780 --> 00:52:47,960 and how many branches of research you now have 1115 00:52:47,960 --> 00:52:50,679 that you can learn from. 1116 00:52:50,679 --> 00:52:52,899 The survey paper is LinkedIn the slides, by the way, 1117 00:52:52,900 --> 00:52:54,483 and I'll share them after the lecture. 1118 00:53:01,519 --> 00:53:02,019 Super. 1119 00:53:05,559 --> 00:53:08,840 So we've made some progress. 1120 00:53:08,840 --> 00:53:10,600 Hopefully now, you feel if you were 1121 00:53:10,599 --> 00:53:14,317 to start an LLM application, you know how to do better prompts. 1122 00:53:14,317 --> 00:53:15,400 You know how to do chains. 1123 00:53:15,400 --> 00:53:17,240 You know how to do fine tuning. 1124 00:53:17,239 --> 00:53:19,159 You also how to do retrieval. 1125 00:53:19,159 --> 00:53:20,799 And you have the baggage of techniques 1126 00:53:20,800 --> 00:53:23,100 that you can go and read and find the code base, 1127 00:53:23,099 --> 00:53:24,779 pull the code, vibe code it. 1128 00:53:24,780 --> 00:53:26,820 But you have the breadth now. 1129 00:53:30,329 --> 00:53:34,009 The next set of topics we're going to see 1130 00:53:34,010 --> 00:53:36,770 is around the question of how could we 1131 00:53:36,769 --> 00:53:40,449 extend the capabilities of LLMs from performing single tasks, 1132 00:53:40,449 --> 00:53:42,250 and hence, with external knowledge, 1133 00:53:42,250 --> 00:53:47,409 to handling multi-step, autonomous workflows? 1134 00:53:47,409 --> 00:53:50,389 And this is where we get into proper agentic AI. 1135 00:53:53,210 --> 00:53:56,650 So let's talk about agentic AI workflows 1136 00:53:56,650 --> 00:54:00,130 towards autonomous and specialized systems. 1137 00:54:00,130 --> 00:54:01,630 Then we'll talk about evals. 1138 00:54:01,630 --> 00:54:03,869 Then we'll see multi-agent systems. 1139 00:54:03,869 --> 00:54:11,769 And we'll end with a little thoughts on what's next in AI. 1140 00:54:11,769 --> 00:54:20,329 So Andrew Ng actually coined the term agentic AI workflows. 1141 00:54:20,329 --> 00:54:25,610 And his reason was that a lot of companies use, say agents. 1142 00:54:25,610 --> 00:54:28,750 Agents, agents everywhere, agents everywhere. 1143 00:54:28,750 --> 00:54:30,670 If you go and work at these companies, 1144 00:54:30,670 --> 00:54:33,372 you would notice that they mean very different things by agents. 1145 00:54:33,371 --> 00:54:34,829 Some people actually have a prompt, 1146 00:54:34,829 --> 00:54:36,829 and they call it an agent. 1147 00:54:36,829 --> 00:54:41,529 Other people, they have a very complex multi-agent system, 1148 00:54:41,530 --> 00:54:42,450 they call it an agent. 1149 00:54:42,449 --> 00:54:45,549 And so calling everything an agent doesn't do it justice. 1150 00:54:45,550 --> 00:54:49,810 So Andrew says let's call it agentic workflows. 1151 00:54:49,809 --> 00:54:53,989 Because in practice, it's a bunch of prompts with tools, 1152 00:54:53,989 --> 00:54:57,029 with additional resources, API calls 1153 00:54:57,030 --> 00:54:59,390 that ultimately are put in a workflow, 1154 00:54:59,389 --> 00:55:02,629 and you can call that workflow agentic. 1155 00:55:02,630 --> 00:55:08,099 So it's all about the multi-step process to complete a task. 1156 00:55:11,269 --> 00:55:13,230 Also, calling it agentic workflow 1157 00:55:13,230 --> 00:55:14,869 allows us to not mix it up with what 1158 00:55:14,869 --> 00:55:17,909 I called agent, in the last lecture, 1159 00:55:17,909 --> 00:55:19,309 with reinforcement learning. 1160 00:55:19,309 --> 00:55:22,029 Because in RL, agent has a very specific definition, 1161 00:55:22,030 --> 00:55:24,670 interacts with an environment, passes from one state 1162 00:55:24,670 --> 00:55:26,708 to the other, has a reward and an observation. 1163 00:55:26,708 --> 00:55:28,000 You remember that chart, right? 1164 00:55:32,000 --> 00:55:35,440 So here's an example of how we move from a one step 1165 00:55:35,440 --> 00:55:39,760 prompt to a multi-step agentic workflow. 1166 00:55:39,760 --> 00:55:44,920 Let's say a user queries a product. 1167 00:55:44,920 --> 00:55:48,200 What is your refund policy on a chatbot? 1168 00:55:48,199 --> 00:55:51,039 And the response, using a RAG, says 1169 00:55:51,039 --> 00:55:53,779 refunds are available within 30 days of purchase, 1170 00:55:53,780 --> 00:55:57,440 and maybe the RAG can even look link to the policy documents. 1171 00:55:57,440 --> 00:55:59,639 That's what we learned so far. 1172 00:55:59,639 --> 00:56:04,119 Instead, an agentic workflow can function like this. 1173 00:56:04,119 --> 00:56:07,559 The user says, can I get a refund for my order? 1174 00:56:07,559 --> 00:56:11,239 And the response via the agentic workflow 1175 00:56:11,239 --> 00:56:14,239 is the agent retrieves the refund policy using a RAG. 1176 00:56:14,239 --> 00:56:17,299 The agent then follows up with the users and says, 1177 00:56:17,300 --> 00:56:19,720 can you provide your order number? 1178 00:56:19,719 --> 00:56:23,019 Then the agent queries an API to check the order details. 1179 00:56:23,019 --> 00:56:25,139 And finally, it comes back to the user 1180 00:56:25,139 --> 00:56:28,199 and confirms your order qualifies for a refund. 1181 00:56:28,199 --> 00:56:31,179 The amount will be processed in three to five business days. 1182 00:56:31,179 --> 00:56:33,799 This is much more thoughtful than the first version, 1183 00:56:33,800 --> 00:56:35,164 which is sort of vanilla. 1184 00:56:37,682 --> 00:56:39,099 So that's what we're going to talk 1185 00:56:39,099 --> 00:56:40,900 about in the next couple of slides, 1186 00:56:40,900 --> 00:56:43,240 is how do we get from the first one to the second one? 1187 00:56:46,619 --> 00:56:50,139 There are plenty of specialized agency workflows online. 1188 00:56:50,139 --> 00:56:52,239 You've heard, and if you hang out in SF, 1189 00:56:52,239 --> 00:56:55,659 you probably see a bunch of billboards, AI software 1190 00:56:55,659 --> 00:56:57,819 engineer, AI skills mentor you've 1191 00:56:57,820 --> 00:56:59,920 interacted with in the class through Workera. 1192 00:56:59,920 --> 00:57:08,099 AI SDR, AI lawyers, AI specialized cloud engineer. 1193 00:57:08,099 --> 00:57:10,679 It would be a stretch to say that everything works, 1194 00:57:10,679 --> 00:57:12,940 but there's work being done towards that. 1195 00:57:17,860 --> 00:57:19,460 I'm not personally a fan of putting 1196 00:57:19,460 --> 00:57:20,920 a face behind those things. 1197 00:57:20,920 --> 00:57:21,920 I think it's gimmicky. 1198 00:57:21,920 --> 00:57:24,090 And I think in a few years from now, actually, 1199 00:57:24,090 --> 00:57:27,750 very few products will have a human face behind it, 1200 00:57:27,750 --> 00:57:32,070 but it might be a marketing tactic from some startups. 1201 00:57:32,070 --> 00:57:35,809 It's more scary than it is engaging, frankly. 1202 00:57:35,809 --> 00:57:36,309 OK. 1203 00:57:36,309 --> 00:57:38,670 I want to talk about the paradigm shift. 1204 00:57:38,670 --> 00:57:40,110 That's especially useful. 1205 00:57:40,110 --> 00:57:41,870 Let's say you're a software engineer 1206 00:57:41,869 --> 00:57:43,777 or you're planning to be a software engineer. 1207 00:57:43,777 --> 00:57:45,610 Because software engineering as a discipline 1208 00:57:45,610 --> 00:57:47,210 is sort of shifting. 1209 00:57:47,210 --> 00:57:49,070 Or at least the best engineers I've 1210 00:57:49,070 --> 00:57:53,350 worked with are able to move from a deterministic mindset 1211 00:57:53,349 --> 00:57:57,110 to a fuzzy mindset and balance between the two 1212 00:57:57,110 --> 00:57:58,890 whenever they need to get something done. 1213 00:57:58,889 --> 00:58:01,949 So here's the paradigm shift between traditional software 1214 00:58:01,949 --> 00:58:04,549 and agentic AI software. 1215 00:58:04,550 --> 00:58:07,670 The first one is the way you handle data. 1216 00:58:07,670 --> 00:58:10,210 Traditional software deals with structured data. 1217 00:58:10,210 --> 00:58:11,130 You have JSONs. 1218 00:58:11,130 --> 00:58:12,670 You have databases. 1219 00:58:12,670 --> 00:58:15,670 They're pasted in a very structured manner 1220 00:58:15,670 --> 00:58:17,811 in a data engineering pipeline. 1221 00:58:17,811 --> 00:58:19,269 And then there used to be displayed 1222 00:58:19,269 --> 00:58:21,170 on a certain interface. 1223 00:58:21,170 --> 00:58:24,690 The user might feel a form that is then retrieved and pasted 1224 00:58:24,690 --> 00:58:25,470 in the database. 1225 00:58:25,469 --> 00:58:28,250 All of that historically has been structured data. 1226 00:58:28,250 --> 00:58:34,250 Now, more and more companies are handling free form text, images, 1227 00:58:34,250 --> 00:58:39,289 and all of that requires dynamic interpretation to transform 1228 00:58:39,289 --> 00:58:41,690 an input into an output. 1229 00:58:41,690 --> 00:58:45,429 The software itself used to be deterministic. 1230 00:58:45,429 --> 00:58:47,529 Now you have a lot of software that is fuzzy. 1231 00:58:47,530 --> 00:58:51,290 And fuzzy software creates so many issues. 1232 00:58:51,289 --> 00:58:54,250 I mean, imagine if you let your user ask anything 1233 00:58:54,250 --> 00:58:56,250 on your website. 1234 00:58:56,250 --> 00:58:58,590 The chances that it breaks is tremendous. 1235 00:58:58,590 --> 00:59:00,710 The chances that you're attacked is tremendous. 1236 00:59:00,710 --> 00:59:03,150 The chances-- it's really, really complicated. 1237 00:59:03,150 --> 00:59:07,650 It's more complicated than people make it seem on Twitter. 1238 00:59:07,650 --> 00:59:09,809 Fuzzy engineering is truly hard. 1239 00:59:09,809 --> 00:59:14,090 You might get hate as a company because one user did something 1240 00:59:14,090 --> 00:59:16,530 that you authorized them to do that ended up breaking 1241 00:59:16,530 --> 00:59:18,130 the database and ended up-- 1242 00:59:18,130 --> 00:59:19,740 we've seen that with many companies 1243 00:59:19,739 --> 00:59:21,099 in the last couple of years. 1244 00:59:21,099 --> 00:59:23,980 So it takes a very specialized engineering mindset 1245 00:59:23,980 --> 00:59:25,460 to do fuzzy engineering, but also 1246 00:59:25,460 --> 00:59:29,340 know when you need to be deterministic. 1247 00:59:29,340 --> 00:59:33,820 The other thing I'd call is with agentic AI software, 1248 00:59:33,820 --> 00:59:39,019 you want to think about your software as your manager. 1249 00:59:39,019 --> 00:59:44,059 So you're familiar with the monolith or microservices 1250 00:59:44,059 --> 00:59:48,099 approaches in software, where you structure your software 1251 00:59:48,099 --> 00:59:51,799 in different boxes that can talk to each other, 1252 00:59:51,800 --> 00:59:55,140 and it allows teams to debug one section at a time. 1253 00:59:55,139 --> 00:59:59,039 Now the equivalent with agentic AI is you think as a manager. 1254 00:59:59,039 --> 01:00:02,460 So you think, OK, if I was to delegate my product 1255 01:00:02,460 --> 01:00:06,000 to be done by a group of humans, what would be those roles? 1256 01:00:06,000 --> 01:00:09,659 Would I have a graphic designer that then puts together a chart 1257 01:00:09,659 --> 01:00:12,420 and then sends it to a marketing manager that converts it 1258 01:00:12,420 --> 01:00:15,420 into a nice blog post, that then gives it to the performance 1259 01:00:15,420 --> 01:00:18,680 marketing expert, that then publishes the work, the blog 1260 01:00:18,679 --> 01:00:20,899 post, and then optimizes and A/B tests? 1261 01:00:20,900 --> 01:00:23,440 Then to a data scientist that analyzes the data 1262 01:00:23,440 --> 01:00:25,880 and then puts hypotheses and validates 1263 01:00:25,880 --> 01:00:27,320 them or invalidates them. 1264 01:00:27,320 --> 01:00:29,920 That's how you would typically think if you're building 1265 01:00:29,920 --> 01:00:32,639 an authentic AI software. 1266 01:00:32,639 --> 01:00:35,769 When actually, the equivalent of that in traditional software 1267 01:00:35,769 --> 01:00:37,019 might be completely different. 1268 01:00:37,019 --> 01:00:39,759 It might be We have a data engineer box 1269 01:00:39,760 --> 01:00:42,560 right here that handles all our data engineering. 1270 01:00:42,559 --> 01:00:45,860 And then here, we have the UI/UX stuff. 1271 01:00:45,860 --> 01:00:47,940 Everything UI/UX related goes here. 1272 01:00:47,940 --> 01:00:51,019 And companies might structure it in very different ways. 1273 01:00:51,019 --> 01:00:53,684 And here is the business logic that we want to care about. 1274 01:00:53,684 --> 01:00:56,059 And there's five engineers working on the business logic, 1275 01:00:56,059 --> 01:00:56,559 let's say. 1276 01:00:59,239 --> 01:01:01,159 OK. 1277 01:01:01,159 --> 01:01:04,559 Testing and debugging is also very different. 1278 01:01:04,559 --> 01:01:06,409 And we'll talk about it in the next section. 1279 01:01:09,440 --> 01:01:13,679 The other thing that I feel matters 1280 01:01:13,679 --> 01:01:17,409 is with AI in engineering, the cost of experimentation 1281 01:01:17,409 --> 01:01:19,210 is going down drastically. 1282 01:01:19,210 --> 01:01:22,010 And so people, I feel, should be more comfortable 1283 01:01:22,010 --> 01:01:23,690 throwing away code. 1284 01:01:23,690 --> 01:01:27,429 It's like in traditional software engineering, 1285 01:01:27,429 --> 01:01:29,469 you probably don't throw away code a ton. 1286 01:01:29,469 --> 01:01:32,309 You build a code, and it's solid, and it's bulletproof, 1287 01:01:32,309 --> 01:01:35,329 and then you update it over time. 1288 01:01:35,329 --> 01:01:39,009 We've seen AI companies be more comfortable throwing away 1289 01:01:39,010 --> 01:01:43,810 codes, which has advantages in terms of the speed at which you 1290 01:01:43,809 --> 01:01:46,329 move but also disadvantages in terms 1291 01:01:46,329 --> 01:01:49,509 of the quality of your software that can break more. 1292 01:01:52,530 --> 01:01:56,890 So anyway, just wanted to do an update on the paradigm shift 1293 01:01:56,889 --> 01:01:59,150 from deterministic to fuzzy engineering. 1294 01:02:04,570 --> 01:02:08,370 Oh, and actually, I can give you an example from Workera 1295 01:02:08,369 --> 01:02:11,250 that we learned probably over the last 12 1296 01:02:11,250 --> 01:02:13,750 months is like if you've used Workera, 1297 01:02:13,750 --> 01:02:18,070 you might have seen that the interface has asks you sometimes 1298 01:02:18,070 --> 01:02:19,590 multiple choice questions. 1299 01:02:19,590 --> 01:02:21,450 And sometimes, it asks you multiple select. 1300 01:02:21,449 --> 01:02:24,169 And sometimes, it asks you drag and drop, ordering, matching, 1301 01:02:24,170 --> 01:02:25,349 whatever. 1302 01:02:25,349 --> 01:02:28,610 Those are examples of deterministic item types, 1303 01:02:28,610 --> 01:02:31,329 meaning you answer the question on a multiple choice. 1304 01:02:31,329 --> 01:02:32,710 There is one correct answer. 1305 01:02:32,710 --> 01:02:34,510 It's fully deterministic. 1306 01:02:34,510 --> 01:02:38,350 On the other hand, you sometimes have a voice questions, 1307 01:02:38,349 --> 01:02:40,309 where you go to a role play or you 1308 01:02:40,309 --> 01:02:42,029 have voice plus coding questions, 1309 01:02:42,030 --> 01:02:45,790 where your code is being read by the interface or whatever. 1310 01:02:45,789 --> 01:02:49,550 Those are fuzzy, meaning the scoring algorithm 1311 01:02:49,550 --> 01:02:52,269 might actually make mistakes, and those mistakes 1312 01:02:52,269 --> 01:02:53,509 might be costly. 1313 01:02:53,510 --> 01:02:56,190 And so companies have to figure out 1314 01:02:56,190 --> 01:02:58,318 a human in the loop system, which 1315 01:02:58,318 --> 01:03:00,610 you might have seen with the appeal feature at the end. 1316 01:03:00,610 --> 01:03:03,318 So at the end of the assessment, you have an appeal feature where 1317 01:03:03,318 --> 01:03:06,430 it allows you to say, I want to appeal the agent 1318 01:03:06,429 --> 01:03:09,690 because I want to challenge what the agent said on my answer 1319 01:03:09,690 --> 01:03:12,365 because I thought I was better than what the agent thought. 1320 01:03:12,364 --> 01:03:14,239 And then you bring the human in the loop that 1321 01:03:14,239 --> 01:03:16,447 then can fix the agent, can tell the agent, actually, 1322 01:03:16,447 --> 01:03:20,279 you were too harsh on the answer of this person. 1323 01:03:20,280 --> 01:03:24,360 And that's an example of a fuzzy engineered system 1324 01:03:24,360 --> 01:03:28,200 that then adds a human in the loop to make it more aligned. 1325 01:03:28,199 --> 01:03:29,699 And so if you're building a company, 1326 01:03:29,699 --> 01:03:32,279 I would encourage you to think about what can I 1327 01:03:32,280 --> 01:03:33,800 get done with determinism? 1328 01:03:33,800 --> 01:03:35,100 And let's get that done. 1329 01:03:35,099 --> 01:03:38,000 And then the fuzzy stuff, I want to do fuzzy 1330 01:03:38,000 --> 01:03:39,900 because it allows more interaction. 1331 01:03:39,900 --> 01:03:42,079 It allows more back and forth, but I need 1332 01:03:42,079 --> 01:03:43,739 to put guardrails around it. 1333 01:03:43,739 --> 01:03:45,739 And how am I going to design those guardrails? 1334 01:03:45,739 --> 01:03:46,639 Pretty much. 1335 01:03:46,639 --> 01:03:49,219 OK? 1336 01:03:49,219 --> 01:03:54,039 Here's another example from enterprise workflows, 1337 01:03:54,039 --> 01:03:57,519 which are likely to change due to agentic AI. 1338 01:03:57,519 --> 01:04:01,619 This is a paper from McKinsey, I believe from last year, 1339 01:04:01,619 --> 01:04:05,199 where they looked at a financial institution, and they said, 1340 01:04:05,199 --> 01:04:07,599 we observed that they often spend one to four weeks 1341 01:04:07,599 --> 01:04:10,119 to create a credit risk memo. 1342 01:04:10,119 --> 01:04:11,859 And here's the process. 1343 01:04:11,860 --> 01:04:16,539 A relationship manager gathers data from 15 1344 01:04:16,539 --> 01:04:19,699 and more than 15 sources on the borrower, 1345 01:04:19,699 --> 01:04:22,699 loan type, other factors. 1346 01:04:22,699 --> 01:04:25,339 Then the relationship manager and the credit analyst 1347 01:04:25,340 --> 01:04:28,780 collaboratively analyze that data from these sources. 1348 01:04:28,780 --> 01:04:33,620 Then the credit analyst typically spends 20 hours 1349 01:04:33,619 --> 01:04:36,019 or more writing a memo and then goes back 1350 01:04:36,019 --> 01:04:37,860 to the relationship manager. 1351 01:04:37,860 --> 01:04:40,260 They give feedback, and then they go through this loop 1352 01:04:40,260 --> 01:04:41,540 again and again. 1353 01:04:41,539 --> 01:04:46,139 And it takes a long time to get a credit memo out. 1354 01:04:46,139 --> 01:04:50,639 And then run a research study, where they changed the process. 1355 01:04:50,639 --> 01:04:56,139 They said gen AI agents could actually cut time by 20% to 60% 1356 01:04:56,139 --> 01:04:58,500 on credit risk memos. 1357 01:04:58,500 --> 01:05:01,059 And the process has changed to the relationship manager, 1358 01:05:01,059 --> 01:05:03,219 directly work with the Gen AI agent system, 1359 01:05:03,219 --> 01:05:07,139 provides relevant materials that needs to produce the memo. 1360 01:05:07,139 --> 01:05:10,069 The agent subsidizes the project into tasks 1361 01:05:10,070 --> 01:05:12,269 that are assigned to specialist agents, 1362 01:05:12,269 --> 01:05:15,309 gathers and analyzes the data from multiple sources, 1363 01:05:15,309 --> 01:05:16,710 drafts a memo. 1364 01:05:16,710 --> 01:05:19,309 Then the relationship manager and the credit analyst 1365 01:05:19,309 --> 01:05:20,969 sit down together, review the memo, 1366 01:05:20,969 --> 01:05:22,489 give feedback to the agent. 1367 01:05:22,489 --> 01:05:26,869 And within 20% to 60% less time are done. 1368 01:05:26,869 --> 01:05:30,029 And so this is an example where you're actually not changing 1369 01:05:30,030 --> 01:05:31,290 the human stakeholders. 1370 01:05:31,289 --> 01:05:33,909 You're just changing the process and adding 1371 01:05:33,909 --> 01:05:38,589 Gen AI to reduce the time it takes to get a credit memo out. 1372 01:05:38,590 --> 01:05:42,350 It turns out that, imagine you're an enterprise, 1373 01:05:42,349 --> 01:05:47,429 and you have 100,000 employees, and there's a lot of enterprises 1374 01:05:47,429 --> 01:05:50,309 with 100,000 employees out there. 1375 01:05:50,309 --> 01:05:52,509 You are currently under crisis in terms 1376 01:05:52,510 --> 01:05:55,855 of redesigning your workflows. 1377 01:05:55,855 --> 01:05:57,230 It turns out that if you actually 1378 01:05:57,230 --> 01:06:00,550 pull the job descriptions from the HR system 1379 01:06:00,550 --> 01:06:02,630 and you interpret them, you also pull 1380 01:06:02,630 --> 01:06:04,590 the business process workflows that you 1381 01:06:04,590 --> 01:06:07,150 have encoded in your drive. 1382 01:06:07,150 --> 01:06:10,960 You actually can find gains in multiple places. 1383 01:06:10,960 --> 01:06:12,519 And in the next few years, you're 1384 01:06:12,519 --> 01:06:14,320 probably going to see workflows being 1385 01:06:14,320 --> 01:06:17,039 more optimized to add Gen AI. 1386 01:06:17,039 --> 01:06:20,179 Even if that happens, the hardest part is changing people. 1387 01:06:20,179 --> 01:06:23,480 What we know, this is great in theory, but now, 1388 01:06:23,480 --> 01:06:28,360 let's try to fit that second workflow for 10,000 credits, 1389 01:06:28,360 --> 01:06:31,680 risk analysts, and relationship managers. 1390 01:06:31,679 --> 01:06:33,379 My guess is it will take years. 1391 01:06:33,380 --> 01:06:37,519 It will take 10, 20 years to get to this being actually done 1392 01:06:37,519 --> 01:06:40,280 at scale within an organization. 1393 01:06:40,280 --> 01:06:42,320 Because change is so hard. 1394 01:06:42,320 --> 01:06:47,400 It's so hard to rewire business, workflows, job descriptions, 1395 01:06:47,400 --> 01:06:50,119 incentivize people to do different, and be different, 1396 01:06:50,119 --> 01:06:50,900 and train them. 1397 01:06:50,900 --> 01:06:55,220 And so this is what the world is going towards, 1398 01:06:55,219 --> 01:06:59,480 but it's going to take a long time I think. 1399 01:06:59,480 --> 01:07:00,219 OK. 1400 01:07:00,219 --> 01:07:02,759 Then I want to talk about how the agent actually works 1401 01:07:02,760 --> 01:07:07,100 and what are the core components of an agent. 1402 01:07:07,099 --> 01:07:10,219 Imagine a travel booking agent. that's 1403 01:07:10,219 --> 01:07:12,439 an easy example you've all thought about. 1404 01:07:12,440 --> 01:07:16,039 I still haven't been able to get an agent to book a trip for me, 1405 01:07:16,039 --> 01:07:18,340 or I was scared because it was going to book 1406 01:07:18,340 --> 01:07:20,680 a very expensive or long trip. 1407 01:07:20,679 --> 01:07:24,819 But in theory, you can have a travel booking 1408 01:07:24,820 --> 01:07:26,400 agent that has prompts. 1409 01:07:26,400 --> 01:07:28,700 So the prompts we've seen, we know the methods 1410 01:07:28,699 --> 01:07:30,539 to optimize those prompts. 1411 01:07:30,539 --> 01:07:34,880 That travel agent also has a context management system, 1412 01:07:34,880 --> 01:07:38,420 which is essentially the memory of what it knows about the user. 1413 01:07:38,420 --> 01:07:40,659 That context management system might 1414 01:07:40,659 --> 01:07:45,799 include a core memory or working memory and an archival memory, 1415 01:07:45,800 --> 01:07:46,860 OK? 1416 01:07:46,860 --> 01:07:51,059 What the difference is within memory 1417 01:07:51,059 --> 01:07:54,940 is not every memory needs to be fast to access. 1418 01:07:54,940 --> 01:07:56,159 Think about it. 1419 01:07:56,159 --> 01:07:59,659 You're onboarded on a product, and the first question is hi, 1420 01:07:59,659 --> 01:08:00,599 what's your name? 1421 01:08:00,599 --> 01:08:02,900 And I say, my name is Keon. 1422 01:08:02,900 --> 01:08:05,037 That's probably going to sit in the working memory 1423 01:08:05,036 --> 01:08:07,369 because the agents, every time he's going to talk to me, 1424 01:08:07,369 --> 01:08:08,786 he's going to want to use my name. 1425 01:08:08,786 --> 01:08:10,829 But then maybe the second question 1426 01:08:10,829 --> 01:08:12,409 is what's your birthday? 1427 01:08:12,409 --> 01:08:13,750 And I give it my birthday. 1428 01:08:13,750 --> 01:08:15,489 Does it need my birthday every day? 1429 01:08:15,489 --> 01:08:16,210 Probably not. 1430 01:08:16,210 --> 01:08:18,670 So it's probably going to park it on the long term 1431 01:08:18,670 --> 01:08:20,949 memory or the archival memory. 1432 01:08:20,949 --> 01:08:24,250 And those memories are slower to access. 1433 01:08:24,250 --> 01:08:26,750 They're farther down the stack. 1434 01:08:26,750 --> 01:08:28,789 And that structure allows the agent 1435 01:08:28,789 --> 01:08:30,829 to determine what's the working memory, 1436 01:08:30,829 --> 01:08:33,189 and what's the long term memory? 1437 01:08:33,189 --> 01:08:36,090 And that makes it easier for the agent to retrieve super fast. 1438 01:08:36,090 --> 01:08:37,289 Because think about it. 1439 01:08:37,289 --> 01:08:39,390 When you interact with ChatGPT, you 1440 01:08:39,390 --> 01:08:41,270 feel that it's very personal at times. 1441 01:08:41,270 --> 01:08:43,750 You feel like it understands you. 1442 01:08:43,750 --> 01:08:47,510 Imagine every time you call it, it has to read the memories. 1443 01:08:47,510 --> 01:08:48,909 And that can be costly. 1444 01:08:48,909 --> 01:08:52,510 It's a very burdensome cost because it happens 1445 01:08:52,510 --> 01:08:54,649 every time you talk to it. 1446 01:08:54,649 --> 01:08:57,270 So you want to be highly optimized with the working 1447 01:08:57,270 --> 01:08:59,095 memory. 1448 01:08:59,095 --> 01:09:00,470 If it takes three seconds to look 1449 01:09:00,470 --> 01:09:03,069 in the memory, every time you're going to talk to your LLM, 1450 01:09:03,069 --> 01:09:06,210 it's going to take three seconds, which you don't want. 1451 01:09:06,210 --> 01:09:06,890 Anyway. 1452 01:09:06,890 --> 01:09:08,189 And then you have the tools. 1453 01:09:08,189 --> 01:09:11,490 The tools can include APIs like a flight search 1454 01:09:11,489 --> 01:09:15,688 API, hotel booking API, car rental API, weather API, 1455 01:09:15,689 --> 01:09:18,450 and then the payment processing API. 1456 01:09:18,449 --> 01:09:21,688 And typically, you would want to tell your agent 1457 01:09:21,689 --> 01:09:23,430 how that API works. 1458 01:09:23,430 --> 01:09:27,010 It turns out that agents or LLMs, I should say, 1459 01:09:27,010 --> 01:09:29,590 are very good at reading API documentation. 1460 01:09:29,590 --> 01:09:31,210 So you give it the API documentation, 1461 01:09:31,210 --> 01:09:33,590 and it reads the JSON, and it reads, 1462 01:09:33,590 --> 01:09:35,609 what does a GET request look like. 1463 01:09:35,609 --> 01:09:38,189 And this is the format that I need to push. 1464 01:09:38,189 --> 01:09:41,569 And then it pushes it in that format, let's say. 1465 01:09:41,569 --> 01:09:45,090 And then it retrieves something. 1466 01:09:45,090 --> 01:09:49,170 Does that make sense, those different components? 1467 01:09:49,170 --> 01:09:51,750 Anthropic also talks about resources. 1468 01:09:51,750 --> 01:09:55,369 Resources is data that is sitting somewhere that you 1469 01:09:55,369 --> 01:09:57,309 might let your agent read. 1470 01:09:57,310 --> 01:10:00,770 For example, if you're building your startups, you have a CRM. 1471 01:10:00,770 --> 01:10:05,000 A CRM has data in it, and you want to do lookups in that data. 1472 01:10:05,000 --> 01:10:07,859 You will probably give a lookup tool, 1473 01:10:07,859 --> 01:10:10,359 and you will give access to the resource, 1474 01:10:10,359 --> 01:10:12,609 and it will do lookups whenever you want super fast. 1475 01:10:16,300 --> 01:10:19,020 This type of architecture can be built 1476 01:10:19,020 --> 01:10:21,080 with different degrees of autonomy, 1477 01:10:21,079 --> 01:10:23,659 from the least autonomous to the most autonomous. 1478 01:10:23,659 --> 01:10:26,260 And I'll give you a few examples. 1479 01:10:26,260 --> 01:10:29,560 Less autonomous would be you've hard coded the steps. 1480 01:10:29,560 --> 01:10:35,020 So let's say I tell the travel agent first identify the intent. 1481 01:10:35,020 --> 01:10:39,300 Then look up in the database the history 1482 01:10:39,300 --> 01:10:42,460 of this customer with us and their preferences. 1483 01:10:42,460 --> 01:10:45,239 Then go to the flight API, blah, blah, blah. 1484 01:10:45,239 --> 01:10:45,979 Then go to the-- 1485 01:10:45,979 --> 01:10:47,619 I would hard code the steps. 1486 01:10:47,619 --> 01:10:48,220 OK. 1487 01:10:48,220 --> 01:10:50,539 That's the least autonomous. 1488 01:10:50,539 --> 01:10:54,659 The semi-autonomous is I might hard code the tools, 1489 01:10:54,659 --> 01:10:57,059 but we're not going to hard code the steps. 1490 01:10:57,060 --> 01:11:02,120 So I'm going to tell the agent, you act like a travel agent. 1491 01:11:02,119 --> 01:11:10,199 And your task is to help the person book a travel. 1492 01:11:10,199 --> 01:11:13,279 And these are the tools that you have accessible to yourself. 1493 01:11:13,279 --> 01:11:14,939 And so I'm not hard coding the steps. 1494 01:11:14,939 --> 01:11:17,064 I'm just hard coding the tools that you have access 1495 01:11:17,064 --> 01:11:18,919 to for yourself. 1496 01:11:18,920 --> 01:11:22,480 The more autonomous is the agent decides the steps 1497 01:11:22,479 --> 01:11:24,722 and can create the tools. 1498 01:11:24,722 --> 01:11:26,640 So that's where you might give actually access 1499 01:11:26,640 --> 01:11:28,980 to a code editor, to the agent. 1500 01:11:28,979 --> 01:11:33,219 And the agent might actually be able to ping any API in the web, 1501 01:11:33,220 --> 01:11:34,800 perform some web search. 1502 01:11:34,800 --> 01:11:37,079 It might even be able to create some code 1503 01:11:37,079 --> 01:11:39,039 to display data to the user. 1504 01:11:39,039 --> 01:11:42,159 It might even be able to perform some calculations. 1505 01:11:42,159 --> 01:11:44,760 Like oh, I'm going to calculate the fastest route 1506 01:11:44,760 --> 01:11:48,000 to get from San Francisco to New York, 1507 01:11:48,000 --> 01:11:50,760 and which one might be the most appropriate 1508 01:11:50,760 --> 01:11:52,378 for what the user is looking for. 1509 01:11:52,377 --> 01:11:54,920 And then I want to calculate the distance between the airport 1510 01:11:54,920 --> 01:11:56,899 and that hotel versus that hotel. 1511 01:11:56,899 --> 01:11:58,769 And I'm going to write code to do that. 1512 01:11:58,770 --> 01:12:00,650 So it's actually fully autonomous 1513 01:12:00,649 --> 01:12:02,210 from that perspective. 1514 01:12:05,210 --> 01:12:07,409 So yeah. 1515 01:12:07,409 --> 01:12:08,849 Remember those keywords. 1516 01:12:08,850 --> 01:12:14,530 Memory, prompts, tools, et cetera. 1517 01:12:14,529 --> 01:12:18,409 Now, I presented the flight API, but it does not 1518 01:12:18,409 --> 01:12:19,729 have to be an API. 1519 01:12:19,729 --> 01:12:23,329 You probably have heard the term MCP or model context protocol 1520 01:12:23,329 --> 01:12:25,229 that was coined by Anthropic. 1521 01:12:25,229 --> 01:12:29,649 I pasted the seminal article on MCP at the bottom of this slide. 1522 01:12:29,649 --> 01:12:34,689 But let me explain in a nutshell why those things would differ. 1523 01:12:34,689 --> 01:12:39,649 In the API case, you would actually 1524 01:12:39,649 --> 01:12:42,710 teach your LLM to ping an API. 1525 01:12:42,710 --> 01:12:45,670 So you would say this is how you ping this API, 1526 01:12:45,670 --> 01:12:48,050 and this is the data that it will send you back. 1527 01:12:48,050 --> 01:12:51,430 And you would have to do that in a one off manner. 1528 01:12:51,430 --> 01:12:53,610 So you would have to build or give 1529 01:12:53,609 --> 01:12:56,670 the API documentation of your flight API. 1530 01:12:56,670 --> 01:13:00,750 You're booking hotel API, your car rental API. 1531 01:13:00,750 --> 01:13:03,029 And then you would give tools for your model 1532 01:13:03,029 --> 01:13:06,630 to communicate with those APIs. 1533 01:13:06,630 --> 01:13:11,150 It doesn't scale very well versus MCP. 1534 01:13:11,149 --> 01:13:19,429 MCP, it's really about putting a system in the middle that 1535 01:13:19,430 --> 01:13:22,270 would make it simpler for your LLM to communicate 1536 01:13:22,270 --> 01:13:23,750 with that endpoint. 1537 01:13:23,750 --> 01:13:28,789 So for instance, you might have an MCP server, an MC client, 1538 01:13:28,789 --> 01:13:30,550 where you're trying to communicate 1539 01:13:30,550 --> 01:13:35,510 with that travel database or the flight API or MCP. 1540 01:13:35,510 --> 01:13:38,430 And your agent might actually just communicate with it 1541 01:13:38,430 --> 01:13:42,030 and say, hey, what do you need in order to give me more flight 1542 01:13:42,029 --> 01:13:43,109 information? 1543 01:13:43,109 --> 01:13:47,069 And that agent will respond by I would like you to tell me 1544 01:13:47,069 --> 01:13:49,429 where is the origin flight, where is the destination 1545 01:13:49,430 --> 01:13:51,289 and what you're looking for at a high level. 1546 01:13:51,289 --> 01:13:52,250 This is my requirement. 1547 01:13:52,250 --> 01:13:52,750 OK. 1548 01:13:52,750 --> 01:13:55,159 Let me get back to you with in my requirement. 1549 01:13:55,159 --> 01:13:55,659 Oh. 1550 01:13:55,659 --> 01:13:57,880 You forgot to tell me your budget, whatever. 1551 01:13:57,880 --> 01:13:58,380 Oh. 1552 01:13:58,380 --> 01:14:00,720 Let me give you my budget, et cetera. 1553 01:14:00,720 --> 01:14:04,740 And it's agent to agent communication, 1554 01:14:04,739 --> 01:14:06,739 which allows more scalability. 1555 01:14:06,739 --> 01:14:09,099 You don't need to hard code everything. 1556 01:14:09,100 --> 01:14:11,920 Companies have displayed their MCPs out there, 1557 01:14:11,920 --> 01:14:14,279 and your agent can communicate with them 1558 01:14:14,279 --> 01:14:16,899 and figure out how to get the data it needs. 1559 01:14:16,899 --> 01:14:18,639 Does that make sense? 1560 01:14:18,640 --> 01:14:21,020 Yeah. 1561 01:14:21,020 --> 01:14:23,373 [INAUDIBLE] rewriting any [INAUDIBLE] 1562 01:14:36,880 --> 01:14:39,507 I think it is, ultimately. 1563 01:14:39,507 --> 01:14:41,300 The question is, isn't it a shifting issue? 1564 01:14:41,300 --> 01:14:43,380 Because anyway, if an API has to be updated, 1565 01:14:43,380 --> 01:14:45,600 the MCP has to be updated, is what you say, right? 1566 01:14:45,600 --> 01:14:46,900 Yes, that's correct. 1567 01:14:46,899 --> 01:14:51,119 But at least it allows the agent to go back and forth 1568 01:14:51,119 --> 01:14:52,960 and figure out what the requirements are. 1569 01:14:52,960 --> 01:14:56,340 But at the end of the day, ideally, if you're a startup, 1570 01:14:56,340 --> 01:14:57,779 you have some documentation. 1571 01:14:57,779 --> 01:15:00,859 And automatically, you have an agent or an LLM workflow 1572 01:15:00,859 --> 01:15:03,099 that reads that documentation and updates the code 1573 01:15:03,100 --> 01:15:04,500 accordingly. 1574 01:15:04,500 --> 01:15:05,720 But I agree. 1575 01:15:05,720 --> 01:15:08,980 It's not something that is fully autonomous. 1576 01:15:08,979 --> 01:15:09,519 Yeah. 1577 01:15:09,520 --> 01:15:12,680 i I've seen some security issues. 1578 01:15:12,680 --> 01:15:14,539 Why is that possible. 1579 01:15:14,539 --> 01:15:16,909 Which security specifically? 1580 01:15:16,909 --> 01:15:18,840 [INAUDIBLE] 1581 01:15:18,840 --> 01:15:19,340 Yeah. 1582 01:15:19,340 --> 01:15:23,300 So are there security issues with MCPs? 1583 01:15:23,300 --> 01:15:25,779 So think about it this way. 1584 01:15:25,779 --> 01:15:28,979 MCPs, depending on the data that you get access to, 1585 01:15:28,979 --> 01:15:30,939 might have different requirements, lower stake 1586 01:15:30,939 --> 01:15:31,879 or higher stake. 1587 01:15:31,880 --> 01:15:34,380 I'm not an expert at the full range. 1588 01:15:34,380 --> 01:15:42,539 But it wouldn't surprise me that when you expose an MCP to-- 1589 01:15:42,539 --> 01:15:45,600 I think you would a lot of MCC have authentication. 1590 01:15:45,600 --> 01:15:47,660 So you might actually need a code 1591 01:15:47,659 --> 01:15:50,340 to actually talk to it, just like you would with an API, 1592 01:15:50,340 --> 01:15:52,190 or a key. 1593 01:15:52,189 --> 01:15:53,869 Yeah, but that's a good question. 1594 01:15:53,869 --> 01:15:56,729 I'm not an expert at the security of these systems, 1595 01:15:56,729 --> 01:15:59,049 but we can look into it. 1596 01:16:02,670 --> 01:16:04,670 Any other questions on what we've 1597 01:16:04,670 --> 01:16:10,470 seen with the agentic workflows, APIs, tools, MCPs, memory? 1598 01:16:10,470 --> 01:16:11,750 All of that is under progress. 1599 01:16:11,750 --> 01:16:14,289 So even memory is not a solved problem by any means. 1600 01:16:14,289 --> 01:16:16,510 It's pretty hard actually. 1601 01:16:16,510 --> 01:16:18,350 Yes. 1602 01:16:18,350 --> 01:16:24,510 You don't need an [INAUDIBLE] The MCP just 1603 01:16:24,510 --> 01:16:28,481 makes it easier to access the API, but technically, 1604 01:16:28,481 --> 01:16:29,689 [INAUDIBLE] 1605 01:16:40,829 --> 01:16:42,109 Exactly, exactly. 1606 01:16:42,109 --> 01:16:45,289 Is MCP about efficiency or accessing more data? 1607 01:16:45,289 --> 01:16:47,109 It's about efficiency. 1608 01:16:47,109 --> 01:16:53,710 Let's say you have a coding agent, and it has an MCP client, 1609 01:16:53,710 --> 01:16:57,850 and there's multiple MCP servers that are exposed out there. 1610 01:16:57,850 --> 01:17:00,690 That agent can communicate very efficiently with them 1611 01:17:00,689 --> 01:17:03,529 and find what it needs. 1612 01:17:03,529 --> 01:17:05,170 And it's a more efficient process 1613 01:17:05,170 --> 01:17:09,690 than actually displaying APIs and the APIs on that side 1614 01:17:09,689 --> 01:17:12,169 and how to ping them and what the protocol is. 1615 01:17:12,170 --> 01:17:13,810 But it's not about the data that is 1616 01:17:13,810 --> 01:17:15,370 being exposed because ultimately, you control 1617 01:17:15,369 --> 01:17:16,662 the data that is being exposed. 1618 01:17:19,090 --> 01:17:22,069 You probably, depending on how the MCP is built, 1619 01:17:22,069 --> 01:17:24,569 my guess is you probably expose yourself to other risks 1620 01:17:24,569 --> 01:17:31,529 because your MCP server can see any input pretty much 1621 01:17:31,529 --> 01:17:32,434 from another LLM. 1622 01:17:32,435 --> 01:17:33,560 And so it has to be robust. 1623 01:17:36,130 --> 01:17:37,529 But yeah. 1624 01:17:37,529 --> 01:17:39,329 Super. 1625 01:17:39,329 --> 01:17:41,449 So let's look at an example of a step 1626 01:17:41,449 --> 01:17:45,069 by step workflow for the travel agent. 1627 01:17:45,069 --> 01:17:50,819 So let's say the user says, I want to plan a trip to Paris 1628 01:17:50,819 --> 01:17:56,099 from December 15 to 20th with flights, 1629 01:17:56,100 --> 01:18:00,579 hotels near the Eiffel Tower, and then an itinerary of must 1630 01:18:00,579 --> 01:18:01,819 visit places. 1631 01:18:01,819 --> 01:18:04,019 That's the task to the travel agent. 1632 01:18:04,020 --> 01:18:06,500 Step two, the agent plans the steps. 1633 01:18:06,500 --> 01:18:08,640 So it says, I'm going to find flights. 1634 01:18:08,640 --> 01:18:12,400 Use the flight search API to get options for December 15. 1635 01:18:12,399 --> 01:18:15,059 Search hotels, generate recommendations for places 1636 01:18:15,060 --> 01:18:20,039 to visit, validate preferences, budget, et cetera. 1637 01:18:20,039 --> 01:18:24,060 Book the trip with the payment processing API. 1638 01:18:24,060 --> 01:18:25,760 That's just the planning, by the way. 1639 01:18:25,760 --> 01:18:28,680 Step three, execute the plan, use your tools, 1640 01:18:28,680 --> 01:18:31,420 combine the results, and then proactive 1641 01:18:31,420 --> 01:18:33,260 user interaction and booking. 1642 01:18:33,260 --> 01:18:35,900 It might make a first proposal to the user 1643 01:18:35,899 --> 01:18:38,479 and ask the user to validate or invalidate 1644 01:18:38,479 --> 01:18:42,699 and then may repeat that planning and execution process. 1645 01:18:42,699 --> 01:18:46,079 And then finally, it might actually update the memory. 1646 01:18:46,079 --> 01:18:49,000 It might say, oh, I just learned through this interaction 1647 01:18:49,000 --> 01:18:51,880 that the user only likes direct flights. 1648 01:18:51,880 --> 01:18:55,640 Next time, I'll only give direct flights. 1649 01:18:55,640 --> 01:19:01,160 Or I noticed users are fine with three star hotels or four star 1650 01:19:01,159 --> 01:19:01,739 hotels. 1651 01:19:01,739 --> 01:19:05,000 And in fact, they don't want to go above budget or something 1652 01:19:05,000 --> 01:19:08,000 like that. 1653 01:19:08,000 --> 01:19:11,739 So that hopefully makes sense by now on how you might do that. 1654 01:19:11,739 --> 01:19:16,420 My question for you is how would you know if this works. 1655 01:19:16,420 --> 01:19:19,600 And if you had such a system running in production, how 1656 01:19:19,600 --> 01:19:20,860 would you improve it? 1657 01:19:28,420 --> 01:19:28,920 Yeah. 1658 01:19:28,920 --> 01:19:31,800 Lets users rate their experience. 1659 01:19:31,800 --> 01:19:33,579 So that's an example. 1660 01:19:33,579 --> 01:19:37,399 So let users rate their experience at the end. 1661 01:19:37,399 --> 01:19:39,699 That would be an end to end test, right? 1662 01:19:39,699 --> 01:19:42,960 You're looking at the user experience through the steps 1663 01:19:42,960 --> 01:19:46,069 and say how good was it from 1 to 5, let's say. 1664 01:19:46,069 --> 01:19:46,722 Yeah. 1665 01:19:46,722 --> 01:19:47,390 It's a good way. 1666 01:19:47,390 --> 01:19:50,730 And then if you learn that a user says 1, 1667 01:19:50,729 --> 01:19:53,679 how do you improve the workflow? 1668 01:19:56,855 --> 01:19:58,010 [INAUDIBLE] 1669 01:19:59,390 --> 01:19:59,890 OK. 1670 01:19:59,890 --> 01:20:04,329 So you would go down a tree and say, OK, you said 1. 1671 01:20:04,329 --> 01:20:06,069 What was your issue? 1672 01:20:06,069 --> 01:20:10,170 And then the user says the prices were too high, let's say. 1673 01:20:10,170 --> 01:20:14,690 And then you would go back and fix that specific tool or prompt 1674 01:20:14,689 --> 01:20:15,789 or, yeah, OK. 1675 01:20:15,789 --> 01:20:18,582 Any other ideas? 1676 01:20:18,582 --> 01:20:19,690 [INAUDIBLE] 1677 01:20:29,130 --> 01:20:29,750 Yeah, good. 1678 01:20:29,750 --> 01:20:30,949 So that's a good insight. 1679 01:20:30,949 --> 01:20:34,309 Separate the LLM related stuff from the non-LLM related stuff, 1680 01:20:34,310 --> 01:20:35,553 the deterministic stuff. 1681 01:20:35,552 --> 01:20:36,970 The deterministic stuff, you might 1682 01:20:36,970 --> 01:20:41,530 be able to fix it more objectively essentially. 1683 01:20:41,529 --> 01:20:43,590 Yeah. 1684 01:20:43,590 --> 01:20:44,329 What else? 1685 01:20:56,670 --> 01:21:00,909 So give me an example of an objective issue 1686 01:21:00,909 --> 01:21:03,149 that you can notice and how you would fix it 1687 01:21:03,149 --> 01:21:06,269 versus a subjective issue. 1688 01:21:06,270 --> 01:21:06,810 Yeah. 1689 01:21:06,810 --> 01:21:08,550 [INAUDIBLE] 1690 01:21:16,050 --> 01:21:19,090 So let's say you say there's the same flight, 1691 01:21:19,090 --> 01:21:21,550 but one is cheaper than the other, let's say. 1692 01:21:21,550 --> 01:21:23,010 It's objectively worse. 1693 01:21:23,010 --> 01:21:25,690 And so you can capture that almost automatically. 1694 01:21:25,689 --> 01:21:26,189 Yeah. 1695 01:21:26,189 --> 01:21:27,869 So you could actually build evals 1696 01:21:27,869 --> 01:21:32,529 that are objective, that are tracked across your users. 1697 01:21:32,529 --> 01:21:34,949 And you might actually run an analysis after 1698 01:21:34,949 --> 01:21:37,170 and see that for the objective stuff, 1699 01:21:37,170 --> 01:21:43,640 we notice that our LLM AI agent workflow is bad with pricing. 1700 01:21:43,640 --> 01:21:46,000 It just doesn't read price as well because it always 1701 01:21:46,000 --> 01:21:48,079 gives a more expensive option. 1702 01:21:48,079 --> 01:21:48,579 Yeah. 1703 01:21:48,579 --> 01:21:49,698 You're perfectly right. 1704 01:21:49,698 --> 01:21:50,990 How about the subjective stuff? 1705 01:21:59,600 --> 01:22:01,920 Do you choose a direct or indirect flight 1706 01:22:01,920 --> 01:22:05,060 if the indirect is a little bit cheaper? 1707 01:22:05,060 --> 01:22:05,560 Yeah. 1708 01:22:05,560 --> 01:22:06,380 Good one. 1709 01:22:06,380 --> 01:22:09,079 Do you choose a direct flight or an indirect flight 1710 01:22:09,079 --> 01:22:12,960 if the indirect is cheaper but the direct is more comfortable? 1711 01:22:12,960 --> 01:22:13,460 Yeah. 1712 01:22:13,460 --> 01:22:16,000 That's a good one actually. 1713 01:22:16,000 --> 01:22:18,739 So how would you capture that information. 1714 01:22:18,739 --> 01:22:20,809 Let's say this is used by thousands of users. 1715 01:22:24,279 --> 01:22:28,920 Could you feed something in [INAUDIBLE] 1716 01:22:28,920 --> 01:22:30,220 Could you feed something in? 1717 01:22:30,220 --> 01:22:32,690 Yeah, I mean, you could-- 1718 01:22:32,689 --> 01:22:36,279 could feed something in about the user preferences? 1719 01:22:36,279 --> 01:22:39,380 Well, you could build a data set that 1720 01:22:39,380 --> 01:22:40,800 has some of that information. 1721 01:22:40,800 --> 01:22:44,739 So you build 10 prompts, where the user is asking specifically 1722 01:22:44,739 --> 01:22:46,639 for a direct-- 1723 01:22:46,640 --> 01:22:48,940 saying that I prefer direct flights because I 1724 01:22:48,939 --> 01:22:50,979 care about my time, let's say. 1725 01:22:50,979 --> 01:22:53,219 And then you look at the output and you actually 1726 01:22:53,220 --> 01:22:56,340 give a good example of a good output, 1727 01:22:56,340 --> 01:22:58,699 and you probably are able to capture 1728 01:22:58,699 --> 01:23:04,019 the performance of your agentic workflow on this specific eval. 1729 01:23:04,020 --> 01:23:05,320 Does it prioritize? 1730 01:23:05,319 --> 01:23:07,159 Does it understand price conscious-- 1731 01:23:07,159 --> 01:23:08,979 is it price conscious, essentially, 1732 01:23:08,979 --> 01:23:10,659 and comfort conscious? 1733 01:23:10,659 --> 01:23:13,300 Yeah. 1734 01:23:13,300 --> 01:23:14,360 What about the tone? 1735 01:23:14,359 --> 01:23:18,819 Let's say the LLM right now is not very friendly. 1736 01:23:18,819 --> 01:23:23,000 How would you notice that, and how would you fix it? 1737 01:23:26,119 --> 01:23:26,619 Yeah. 1738 01:23:26,619 --> 01:23:29,500 Have the test user run the prompt 1739 01:23:29,500 --> 01:23:33,020 and see if there's something wrong with that. 1740 01:23:33,020 --> 01:23:33,520 OK. 1741 01:23:33,520 --> 01:23:36,037 Have a test user run the prompt and see if there's 1742 01:23:36,037 --> 01:23:37,119 something wrong with that. 1743 01:23:37,119 --> 01:23:38,287 Tell me about the last step. 1744 01:23:38,287 --> 01:23:40,829 How would you notice that something is wrong? 1745 01:23:40,829 --> 01:23:48,550 So a couple of tests [INAUDIBLE] evaluates 1746 01:23:48,550 --> 01:23:51,670 the response and [INAUDIBLE] 1747 01:23:51,670 --> 01:23:52,210 Yeah. 1748 01:23:52,210 --> 01:23:53,609 I agree with your approach. 1749 01:23:53,609 --> 01:23:55,750 Have LLM judges that evaluate the response 1750 01:23:55,750 --> 01:23:58,603 against a certain rubric of what politeness looks like. 1751 01:23:58,603 --> 01:24:00,270 So here in this case, you could actually 1752 01:24:00,270 --> 01:24:02,850 start with error analysis. 1753 01:24:02,850 --> 01:24:05,210 So you start, you have 1,000 users. 1754 01:24:05,210 --> 01:24:07,789 And you can pull up 20 user interactions 1755 01:24:07,789 --> 01:24:09,010 and read through it. 1756 01:24:09,010 --> 01:24:11,630 And you might notice, at first sight, 1757 01:24:11,630 --> 01:24:14,470 the LLM seems to be very rude. 1758 01:24:14,470 --> 01:24:18,430 It's just super, super short in its answers, 1759 01:24:18,430 --> 01:24:20,510 and it's not very helpful. 1760 01:24:20,510 --> 01:24:23,310 You notice that with your error analysis manually. 1761 01:24:23,310 --> 01:24:24,650 Then you go to the next stage. 1762 01:24:24,649 --> 01:24:26,449 You actually put evals behind it. 1763 01:24:26,449 --> 01:24:33,309 You say, I'm going to create a set of LLM judges 1764 01:24:33,310 --> 01:24:35,710 that are going to look at the user interaction 1765 01:24:35,710 --> 01:24:38,890 and are going to rate how polite it is. 1766 01:24:38,890 --> 01:24:40,690 And I'm going to give it a rubric. 1767 01:24:40,689 --> 01:24:42,989 Then what I'm going to do is I'm going to flip my LLM. 1768 01:24:42,989 --> 01:24:45,769 Instead of using GPT-4, I'm going to use Grok. 1769 01:24:45,770 --> 01:24:48,010 And instead of using Grok, I'm using Llama. 1770 01:24:48,010 --> 01:24:51,470 And then I'm going to run those three LLMs side by side, 1771 01:24:51,470 --> 01:24:56,329 give it to my LLM judges, and then get my subjective score 1772 01:24:56,329 --> 01:25:02,390 at the end to say, oh, x model was more polite on average. 1773 01:25:02,390 --> 01:25:02,890 Yeah. 1774 01:25:02,890 --> 01:25:03,630 Perfectly right. 1775 01:25:03,630 --> 01:25:05,850 That's an example of an eval that is very specific 1776 01:25:05,850 --> 01:25:07,730 and allows you to choose between LLMs. 1777 01:25:07,729 --> 01:25:10,869 You could actually do the same eval not across LLMs, 1778 01:25:10,869 --> 01:25:12,976 but fixed the LLM, change the prompt. 1779 01:25:12,976 --> 01:25:15,309 You actually, instead of saying act like a travel agent, 1780 01:25:15,310 --> 01:25:17,870 you say act like a helpful travel agent. 1781 01:25:17,869 --> 01:25:21,090 And then you see the influence of that word on your eval 1782 01:25:21,090 --> 01:25:22,390 with the LLM as judges. 1783 01:25:22,390 --> 01:25:24,170 Does that make sense? 1784 01:25:24,170 --> 01:25:25,970 OK. 1785 01:25:25,970 --> 01:25:26,470 Super. 1786 01:25:26,470 --> 01:25:29,670 So let's move forward and do a case study with evals. 1787 01:25:29,670 --> 01:25:33,369 And then we're almost done for today. 1788 01:25:33,369 --> 01:25:38,300 Let's say your product manager asks you to build an AI 1789 01:25:38,300 --> 01:25:41,860 agent for customer support, OK? 1790 01:25:41,859 --> 01:25:42,960 Where do you start? 1791 01:25:42,960 --> 01:25:45,079 And here is an example of the user prompt. 1792 01:25:45,079 --> 01:25:48,000 I need to change my shipping address for order, blah, blah, 1793 01:25:48,000 --> 01:25:48,500 blah. 1794 01:25:48,500 --> 01:25:51,739 I move to a new address. 1795 01:25:51,739 --> 01:25:54,779 So what do you start if I'm giving you that project? 1796 01:26:04,659 --> 01:26:05,859 Yes. 1797 01:26:05,859 --> 01:26:10,420 We search online for existing models and [INAUDIBLE] 1798 01:26:16,260 --> 01:26:17,720 So do some research. 1799 01:26:17,720 --> 01:26:20,420 See benchmarks and how different models 1800 01:26:20,420 --> 01:26:22,119 perform at customer support. 1801 01:26:22,119 --> 01:26:23,284 And then pick a model. 1802 01:26:23,284 --> 01:26:24,159 That's what you mean. 1803 01:26:24,159 --> 01:26:24,779 Yeah. 1804 01:26:24,779 --> 01:26:25,960 It's true you could do that. 1805 01:26:25,960 --> 01:26:28,020 What else could you do? 1806 01:26:28,020 --> 01:26:28,908 Yeah. 1807 01:26:28,908 --> 01:26:34,360 [INAUDIBLE] 1808 01:26:34,359 --> 01:26:34,859 OK. 1809 01:26:34,859 --> 01:26:35,880 Yeah, I like that. 1810 01:26:35,880 --> 01:26:39,840 Try to decompose the different tasks that it will need 1811 01:26:39,840 --> 01:26:42,685 and try to guess which ones will be more of a struggle, which 1812 01:26:42,685 --> 01:26:45,060 ones should be fuzzy, which ones should be deterministic. 1813 01:26:45,060 --> 01:26:46,350 Yeah, you're right. 1814 01:26:46,350 --> 01:26:47,520 [INAUDIBLE] 1815 01:26:55,819 --> 01:26:56,319 Yeah. 1816 01:26:56,319 --> 01:26:58,516 Similar to what you said. 1817 01:26:58,516 --> 01:27:00,099 That's what I would recommend as well. 1818 01:27:00,100 --> 01:27:02,320 You say I would sit down with a customer support 1819 01:27:02,319 --> 01:27:04,822 agent for a day or two, and I would decompose the tasks 1820 01:27:04,822 --> 01:27:05,779 that are going through. 1821 01:27:05,779 --> 01:27:07,500 I will ask them, where do they struggle? 1822 01:27:07,500 --> 01:27:08,819 How much time it takes? 1823 01:27:08,819 --> 01:27:09,319 Yes. 1824 01:27:09,319 --> 01:27:12,679 That's usually where you want to start with task decomposition. 1825 01:27:12,680 --> 01:27:16,659 So let's say we've done that work, and we have this list. 1826 01:27:16,659 --> 01:27:17,500 I'm simplifying. 1827 01:27:17,500 --> 01:27:20,239 But the customer support agent, human, typically 1828 01:27:20,239 --> 01:27:23,000 would extract key info, then look up 1829 01:27:23,000 --> 01:27:25,680 in the database to retrieve the customer record. 1830 01:27:25,680 --> 01:27:27,360 Then check the policy. 1831 01:27:27,359 --> 01:27:29,960 Are we allowed to update the address, 1832 01:27:29,960 --> 01:27:32,409 or is it a fixed data point? 1833 01:27:32,409 --> 01:27:35,569 And then draft a response email and sends the email. 1834 01:27:35,569 --> 01:27:37,019 So we've decomposed that task. 1835 01:27:39,770 --> 01:27:42,490 Once you've decomposed that task, 1836 01:27:42,489 --> 01:27:45,159 how do you design your agentic workflow? 1837 01:28:03,850 --> 01:28:04,710 Yes. 1838 01:28:04,710 --> 01:28:06,404 [INAUDIBLE] 1839 01:28:17,770 --> 01:28:18,330 Exactly. 1840 01:28:18,329 --> 01:28:20,409 So to repeat, you're going to look 1841 01:28:20,409 --> 01:28:24,949 at the decomposition of tasks, get an instinct of what's fuzzy, 1842 01:28:24,949 --> 01:28:28,010 what's deterministic, and then determine 1843 01:28:28,010 --> 01:28:33,300 which line is going to be an LLM one shot, which one will require 1844 01:28:33,300 --> 01:28:36,779 maybe a RAG, which one will require a tool, which one will 1845 01:28:36,779 --> 01:28:38,519 require memory, which one-- 1846 01:28:38,520 --> 01:28:41,060 So you will start designing that map. 1847 01:28:41,060 --> 01:28:41,880 Completely right. 1848 01:28:41,880 --> 01:28:43,600 That's also what I would recommend. 1849 01:28:43,600 --> 01:28:48,260 You might actually draft it and say, OK, I take the user prompt. 1850 01:28:48,260 --> 01:28:52,500 And the first step of my task decomposition 1851 01:28:52,500 --> 01:28:57,479 was extract information that seems to be a vanilla LLM. 1852 01:28:57,479 --> 01:29:00,099 You can guess that the vanilla LLM would probably 1853 01:29:00,100 --> 01:29:03,220 be good enough at extracting the user wants 1854 01:29:03,220 --> 01:29:05,632 to change their address, and this is the order number 1855 01:29:05,632 --> 01:29:06,800 and this is the new address. 1856 01:29:06,800 --> 01:29:08,940 You probably don't need too much technology 1857 01:29:08,939 --> 01:29:11,579 there other than the LLM. 1858 01:29:11,579 --> 01:29:14,899 The next step, it feels like you need a tool because you're 1859 01:29:14,899 --> 01:29:17,539 actually going to have to look up in the database 1860 01:29:17,539 --> 01:29:21,380 and also update the address. 1861 01:29:21,380 --> 01:29:23,020 So that might be a tool, and you might 1862 01:29:23,020 --> 01:29:25,020 have to build a custom tool for the LLM 1863 01:29:25,020 --> 01:29:27,260 to say, let me connect you to that database 1864 01:29:27,260 --> 01:29:29,869 or let me give you access to that resource with an MCP. 1865 01:29:32,840 --> 01:29:35,940 After that probably need an LLM again to draft the email, 1866 01:29:35,939 --> 01:29:38,156 but you would probably paste confirmation. 1867 01:29:38,157 --> 01:29:40,239 You would paste the confirmation that your address 1868 01:29:40,239 --> 01:29:42,279 has been updated from x to y. 1869 01:29:42,279 --> 01:29:44,559 And then the LLM will draft an answer. 1870 01:29:44,560 --> 01:29:46,380 And of course, just to not forget, 1871 01:29:46,380 --> 01:29:49,279 you might need a tool to send the email. 1872 01:29:49,279 --> 01:29:54,439 You might actually need to post something to 1873 01:29:54,439 --> 01:29:57,399 for the email to go out. 1874 01:29:57,399 --> 01:29:59,079 And then you'll get the output. 1875 01:29:59,079 --> 01:30:02,199 Does that make sense So exactly what you described. 1876 01:30:02,199 --> 01:30:03,939 Now moving to the next step. 1877 01:30:03,939 --> 01:30:06,279 Once we have-- we compose our tasks. 1878 01:30:06,279 --> 01:30:09,300 Then we have designed an agentic workflow around it. 1879 01:30:09,300 --> 01:30:10,641 It took us five minutes. 1880 01:30:10,641 --> 01:30:12,099 In practice, it would take you more 1881 01:30:12,100 --> 01:30:13,280 if you're building your startup on that. 1882 01:30:13,279 --> 01:30:15,697 You want to make sure your task decomposition is accurate, 1883 01:30:15,697 --> 01:30:17,480 your thing is accurate here, and then 1884 01:30:17,479 --> 01:30:20,239 you can have a lot of work done on every tool 1885 01:30:20,239 --> 01:30:22,880 and optimize it and latency and cost. 1886 01:30:22,880 --> 01:30:27,810 But let's say, now we want to know if it works. 1887 01:30:27,810 --> 01:30:30,960 And I'm going to assume that you have LLM traces. 1888 01:30:30,960 --> 01:30:33,449 LLM traces are very important. 1889 01:30:33,449 --> 01:30:36,010 Actually, if you're interviewing with an AI startup. 1890 01:30:36,010 --> 01:30:39,289 I would recommend you in the interview process to ask them, 1891 01:30:39,289 --> 01:30:40,949 do you have LLM traces? 1892 01:30:40,949 --> 01:30:42,970 Because if they don't have LLM traces, 1893 01:30:42,970 --> 01:30:46,530 it is pretty hard to debug an LLM system because you don't 1894 01:30:46,529 --> 01:30:50,649 have visibility on the chain of complex prompts that were called 1895 01:30:50,649 --> 01:30:52,210 and where the bug is. 1896 01:30:52,210 --> 01:30:57,329 And so it's a basic part of an AI startup 1897 01:30:57,329 --> 01:31:00,850 stack to have an LLM traces. 1898 01:31:00,850 --> 01:31:02,730 So let's assume you have traces. 1899 01:31:02,729 --> 01:31:04,869 How would you know if your system works? 1900 01:31:04,869 --> 01:31:11,289 I'm going to summarize some of the things I heard earlier. 1901 01:31:11,289 --> 01:31:15,550 You gave us an example of an end to end metric. 1902 01:31:15,550 --> 01:31:18,369 You look at the user satisfaction at the end. 1903 01:31:18,369 --> 01:31:21,130 You can also do a component-based approach 1904 01:31:21,130 --> 01:31:25,210 where you actually will look at the tool, the database updates, 1905 01:31:25,210 --> 01:31:28,430 and you will manually do an error analysis and see, 1906 01:31:28,430 --> 01:31:32,010 oh, the tool actually always forgets to update the email. 1907 01:31:32,010 --> 01:31:33,806 It just fails at writing. 1908 01:31:33,806 --> 01:31:34,889 And I'm going to fix that. 1909 01:31:34,890 --> 01:31:37,470 This is deterministic pretty much. 1910 01:31:37,470 --> 01:31:40,990 Or when it tries to send the email 1911 01:31:40,989 --> 01:31:44,469 and ping the system that is supposed to send the email, 1912 01:31:44,470 --> 01:31:46,890 it doesn't send it in the right format. 1913 01:31:46,890 --> 01:31:48,869 And so it bugs at that point. 1914 01:31:48,869 --> 01:31:51,390 Again, you could fix that. 1915 01:31:51,390 --> 01:31:52,570 Draft of the email. 1916 01:31:52,569 --> 01:31:53,929 The LLM doesn't do a great job. 1917 01:31:53,930 --> 01:31:56,909 It's not very polite at drafting the email. 1918 01:31:56,909 --> 01:31:59,342 So you could look at component by component, 1919 01:31:59,342 --> 01:32:01,510 and it's actually easier to debug than to look at it 1920 01:32:01,510 --> 01:32:02,289 end to end. 1921 01:32:02,289 --> 01:32:05,750 You would probably do a mix of both. 1922 01:32:05,750 --> 01:32:08,430 Another way to look at it is what is objective 1923 01:32:08,430 --> 01:32:10,530 versus what is subjective? 1924 01:32:10,529 --> 01:32:12,989 So for example, an objective example 1925 01:32:12,989 --> 01:32:18,229 would be a DLRM extracted the wrong order ID. 1926 01:32:18,229 --> 01:32:21,789 The user said my order ID is X, and the LLM, 1927 01:32:21,789 --> 01:32:24,500 when it actually looked up in the database, 1928 01:32:24,500 --> 01:32:26,279 it used the wrong order ID. 1929 01:32:26,279 --> 01:32:27,779 This is objectively wrong. 1930 01:32:27,779 --> 01:32:29,800 You can actually write a Python code 1931 01:32:29,800 --> 01:32:32,239 that checks that, checks just the alignment between what 1932 01:32:32,239 --> 01:32:36,260 the user mentioned and what was actually pasted in the database 1933 01:32:36,260 --> 01:32:38,199 or for the lookup. 1934 01:32:38,199 --> 01:32:40,460 You also have subjective stuff, which we talked about, 1935 01:32:40,460 --> 01:32:43,279 where you probably want to do either human rating or LLM 1936 01:32:43,279 --> 01:32:44,139 as judges. 1937 01:32:44,140 --> 01:32:49,560 It's very relevant for subjective evals. 1938 01:32:49,560 --> 01:32:51,840 And finally, you will find yourself 1939 01:32:51,840 --> 01:32:55,980 having quantitative evals and more qualitative evals. 1940 01:32:55,979 --> 01:32:59,399 So quantitative would be percentage of successful address 1941 01:32:59,399 --> 01:33:00,279 updates. 1942 01:33:00,279 --> 01:33:00,939 The latency. 1943 01:33:00,939 --> 01:33:03,719 You could actually track the latency component-based 1944 01:33:03,720 --> 01:33:05,680 and see which one is the slowest. 1945 01:33:05,680 --> 01:33:08,480 Let's say sending the email is five seconds. 1946 01:33:08,479 --> 01:33:10,159 It's too long, let's say. 1947 01:33:10,159 --> 01:33:13,119 You would notice component based or the full workflow. 1948 01:33:13,119 --> 01:33:15,880 And then you will decide, where am I optimizing my latency, 1949 01:33:15,880 --> 01:33:17,680 and how am I going to do that? 1950 01:33:17,680 --> 01:33:20,240 And then finally, qualitative. 1951 01:33:20,239 --> 01:33:23,099 You might actually do some error analysis 1952 01:33:23,100 --> 01:33:27,940 and look at where are the hallucinations? 1953 01:33:27,939 --> 01:33:31,579 Where are the tone mismatches? 1954 01:33:31,579 --> 01:33:34,779 Are the user confused, and by what they're confused? 1955 01:33:34,779 --> 01:33:36,579 That would be more qualitative. 1956 01:33:36,579 --> 01:33:41,019 And typically, it would take more white glove approaches 1957 01:33:41,020 --> 01:33:42,460 to do that. 1958 01:33:42,460 --> 01:33:44,539 So here's what it could look like. 1959 01:33:44,539 --> 01:33:46,000 I gave you some examples. 1960 01:33:46,000 --> 01:33:50,140 But you would build evals to determine 1961 01:33:50,140 --> 01:33:53,300 objectively, subjectively, component-based, end 1962 01:33:53,300 --> 01:33:55,060 to end based, and then quantitatively and 1963 01:33:55,060 --> 01:33:57,700 qualitatively, where's your LLM failing 1964 01:33:57,699 --> 01:33:59,000 and where it's doing well. 1965 01:34:02,582 --> 01:34:04,539 Does that give you a sense of the type of stuff 1966 01:34:04,539 --> 01:34:09,939 you could do to fix or improve that agentic workflow? 1967 01:34:09,939 --> 01:34:10,739 Super. 1968 01:34:10,739 --> 01:34:12,439 Well, that was our case study on evals. 1969 01:34:12,439 --> 01:34:14,106 We're not going to delve deeper into it. 1970 01:34:14,106 --> 01:34:16,899 But hopefully, it gave you a sense of the type of stuff 1971 01:34:16,899 --> 01:34:21,529 you can do with LLM judges, with objective, 1972 01:34:21,529 --> 01:34:25,829 subjective, component-based, end to end, et cetera. 1973 01:34:25,829 --> 01:34:29,269 Last section on multi-agent workflows. 1974 01:34:29,270 --> 01:34:36,030 So you might ask, hey, why do we need multi-agent workflow when 1975 01:34:36,029 --> 01:34:38,670 the workflow already has multiple steps, 1976 01:34:38,670 --> 01:34:42,449 already calls the LLM multiple times, already gives them tools. 1977 01:34:42,449 --> 01:34:45,104 Why do we need multiple agents? 1978 01:34:45,104 --> 01:34:47,729 And so many people are talking about multi-agent system online. 1979 01:34:47,729 --> 01:34:49,309 It's not even a new thing, frankly. 1980 01:34:49,310 --> 01:34:52,350 Multi-agent systems have been around for a long time. 1981 01:34:52,350 --> 01:34:55,070 The main advantage of a multi-agent system 1982 01:34:55,069 --> 01:34:57,489 is going to be parallelism. 1983 01:34:57,489 --> 01:34:59,590 It's like is there something that I 1984 01:34:59,590 --> 01:35:04,890 wish I would run in parallel, sort of independently, 1985 01:35:04,890 --> 01:35:07,430 but maybe there are some things in the middle? 1986 01:35:07,430 --> 01:35:09,930 But that's where you want to put a multi-agent system. 1987 01:35:09,930 --> 01:35:12,270 It's when it's parallel. 1988 01:35:12,270 --> 01:35:14,950 The other advantage that some companies 1989 01:35:14,949 --> 01:35:19,164 have with multi-agent systems is an agent can be reused. 1990 01:35:19,164 --> 01:35:21,289 So let's say in a company, you have an agent that's 1991 01:35:21,289 --> 01:35:22,970 been built for design. 1992 01:35:22,970 --> 01:35:25,289 That agent can be used in the marketing team, 1993 01:35:25,289 --> 01:35:27,930 and it can be used in the product team. 1994 01:35:27,930 --> 01:35:30,050 And so now you're optimizing an agent, 1995 01:35:30,050 --> 01:35:33,170 which has multiple stakeholders that can communicate with it 1996 01:35:33,170 --> 01:35:35,510 and benefit from its performance. 1997 01:35:38,382 --> 01:35:40,050 Actually I'm going to ask you a question 1998 01:35:40,050 --> 01:35:43,010 and take a few, maybe a minute to think about it. 1999 01:35:43,010 --> 01:35:46,489 Let's say you were building smart home 2000 01:35:46,489 --> 01:35:50,130 automation for your apartment or your home. 2001 01:35:50,130 --> 01:35:52,810 What agents would you want to build? 2002 01:35:52,810 --> 01:35:53,530 Yeah. 2003 01:35:53,529 --> 01:35:54,889 Write it down. 2004 01:35:54,890 --> 01:35:57,130 And then I'm going to ask you in a minute 2005 01:35:57,130 --> 01:36:00,090 to share some of the agents that you will build. 2006 01:36:00,090 --> 01:36:03,050 Also, think about how you would put 2007 01:36:03,050 --> 01:36:04,570 a hierarchy between these agents, 2008 01:36:04,569 --> 01:36:06,210 or how you would organize them, or who 2009 01:36:06,210 --> 01:36:07,770 should communicate with who. 2010 01:36:07,770 --> 01:36:08,450 OK? 2011 01:36:08,449 --> 01:36:08,949 OK. 2012 01:36:08,949 --> 01:36:12,170 Take a minute for that. 2013 01:36:12,170 --> 01:36:14,850 Be creative also because I'm going to ask all of your agents, 2014 01:36:14,850 --> 01:36:17,440 and maybe you have an agent that nobody has thought of. 2015 01:36:21,939 --> 01:36:22,479 OK. 2016 01:36:22,479 --> 01:36:24,259 Let's get started. 2017 01:36:24,260 --> 01:36:26,940 Who wants to give me a set of agents 2018 01:36:26,939 --> 01:36:29,559 that you would want for your home, smart home. 2019 01:36:29,560 --> 01:36:30,060 Yes. 2020 01:36:32,739 --> 01:36:35,519 The first is like a set of agents [INAUDIBLE] 2021 01:37:00,619 --> 01:37:01,119 OK. 2022 01:37:01,119 --> 01:37:02,279 So let me repeat. 2023 01:37:02,279 --> 01:37:05,099 You have four agents, I think, roughly. 2024 01:37:05,100 --> 01:37:09,520 One that tracks biometric, like where are you in the home? 2025 01:37:09,520 --> 01:37:10,560 Where are you moving? 2026 01:37:10,560 --> 01:37:12,220 How you're moving, things like that. 2027 01:37:12,220 --> 01:37:15,240 That sort of knows your location. 2028 01:37:15,239 --> 01:37:21,199 The second one determines the temperature of the rooms 2029 01:37:21,199 --> 01:37:23,960 and has the ability to change it. 2030 01:37:23,960 --> 01:37:26,800 The third one tracks energy efficiency 2031 01:37:26,800 --> 01:37:31,060 and might give feedback on energy and energy usage. 2032 01:37:31,060 --> 01:37:32,600 And might be, I don't know, maybe 2033 01:37:32,600 --> 01:37:34,883 it has the control over the temperature as well. 2034 01:37:34,882 --> 01:37:35,800 I don't know actually. 2035 01:37:35,800 --> 01:37:43,079 Or the gas or the water, might cut your water at some point. 2036 01:37:43,079 --> 01:37:44,859 And then you have an orchestrator agent. 2037 01:37:44,859 --> 01:37:48,688 What is exactly the orchestrator doing? 2038 01:37:48,688 --> 01:37:53,180 It passes instructions [INAUDIBLE] 2039 01:37:53,180 --> 01:37:53,680 OK. 2040 01:37:53,680 --> 01:37:55,060 Passes instructions. 2041 01:37:55,060 --> 01:37:58,240 So is that the agent that communicates mainly 2042 01:37:58,239 --> 01:38:00,000 with the user? 2043 01:38:00,000 --> 01:38:02,279 So if I'm coming back home and I'm 2044 01:38:02,279 --> 01:38:05,679 saying I want the oven to be preheated, 2045 01:38:05,680 --> 01:38:07,360 I communicate with the orchestrator, 2046 01:38:07,359 --> 01:38:09,859 and then it would funnel to another agent. 2047 01:38:09,859 --> 01:38:10,599 OK. 2048 01:38:10,600 --> 01:38:11,140 Sounds good. 2049 01:38:11,140 --> 01:38:11,640 Yeah. 2050 01:38:11,640 --> 01:38:14,230 So that's an example of, I want to say, 2051 01:38:14,229 --> 01:38:17,519 a hierarchical agentic multi-agent system. 2052 01:38:20,770 --> 01:38:21,590 What else? 2053 01:38:21,590 --> 01:38:22,510 Any other ideas? 2054 01:38:22,510 --> 01:38:24,170 What would you add to that? 2055 01:38:24,170 --> 01:38:25,615 Yeah. 2056 01:38:25,615 --> 01:38:27,909 [INAUDIBLE] 2057 01:38:55,329 --> 01:38:56,189 Oh, I like that. 2058 01:38:56,189 --> 01:38:57,429 That's a really good one. 2059 01:38:57,430 --> 01:38:58,890 So let me summarize. 2060 01:38:58,890 --> 01:39:02,250 You have a security agent that determines if you can enter 2061 01:39:02,250 --> 01:39:03,090 or not. 2062 01:39:03,090 --> 01:39:06,489 And when you enter, it understands who you are. 2063 01:39:06,489 --> 01:39:08,329 And then it gives you certain sets 2064 01:39:08,329 --> 01:39:11,309 of permissions that might be different depending 2065 01:39:11,310 --> 01:39:13,030 of if you're a parent or a kid. 2066 01:39:13,029 --> 01:39:17,689 Or you might have access to certain cars and not others. 2067 01:39:17,689 --> 01:39:20,109 Or your kid cannot open the fridge, or I don't know. 2068 01:39:20,109 --> 01:39:21,250 Something like that. 2069 01:39:21,250 --> 01:39:22,390 Yeah. 2070 01:39:22,390 --> 01:39:23,250 OK, I like that. 2071 01:39:23,250 --> 01:39:24,229 That's a good one. 2072 01:39:24,229 --> 01:39:28,469 And it does feel like it's a complex enough workflow where 2073 01:39:28,470 --> 01:39:32,289 you want a specific workflow tied to that. 2074 01:39:32,289 --> 01:39:34,510 I agree. 2075 01:39:34,510 --> 01:39:35,520 What else? 2076 01:39:39,750 --> 01:39:41,579 Yes. 2077 01:39:41,579 --> 01:39:43,970 [INAUDIBLE] So you can get more complicated. 2078 01:39:43,970 --> 01:39:50,230 So high energy savings with whether or not you 2079 01:39:50,229 --> 01:39:55,989 or someone else can be blind to those in the house or also 2080 01:39:55,989 --> 01:39:57,329 when you tap into the grid. 2081 01:39:57,329 --> 01:40:04,510 Yeah So another thought I have as well is much harder 2082 01:40:04,510 --> 01:40:06,909 to track in the grocery store. 2083 01:40:06,909 --> 01:40:08,949 But understanding what's in your fridge. 2084 01:40:08,949 --> 01:40:12,762 OK 2085 01:40:12,762 --> 01:40:14,180 Well, that's really good actually. 2086 01:40:14,180 --> 01:40:16,240 So you mentioned two of them. 2087 01:40:16,239 --> 01:40:20,719 One is maybe an agent that has access to external APIs that 2088 01:40:20,720 --> 01:40:24,320 can understand the weather out there, the wind, the sun, 2089 01:40:24,319 --> 01:40:28,539 and then has control over certain devices at home. 2090 01:40:28,539 --> 01:40:31,560 Temperature, blinds, things like that, and also understands 2091 01:40:31,560 --> 01:40:33,100 your preferences for it. 2092 01:40:33,100 --> 01:40:36,039 That does feel like it's a good use case because you could give 2093 01:40:36,039 --> 01:40:38,840 that to the orchestrator, but it might lose itself 2094 01:40:38,840 --> 01:40:41,039 because it's doing too much. 2095 01:40:41,039 --> 01:40:43,039 And also, these problems are tied together, 2096 01:40:43,039 --> 01:40:45,479 like temperature outdoor with the weather API 2097 01:40:45,479 --> 01:40:48,359 might influence the temperature inside, 2098 01:40:48,359 --> 01:40:50,199 how you want it, et cetera. 2099 01:40:50,199 --> 01:40:52,800 And then the second one, which I also like, 2100 01:40:52,800 --> 01:40:55,920 is you might have an agent that looks at your fridge 2101 01:40:55,920 --> 01:40:57,185 and what's inside. 2102 01:40:57,185 --> 01:40:58,560 And it might actually have access 2103 01:40:58,560 --> 01:41:01,410 to the camera in the fridge, for example, 2104 01:41:01,409 --> 01:41:03,720 and know your preferences and also has 2105 01:41:03,720 --> 01:41:06,800 access to the e-commerce API to order 2106 01:41:06,800 --> 01:41:09,539 Amazon groceries ahead of time. 2107 01:41:09,539 --> 01:41:10,319 I agree. 2108 01:41:10,319 --> 01:41:12,859 And maybe the orchestrator will be the communication line 2109 01:41:12,859 --> 01:41:16,139 with the user, but it might communicate with that agent 2110 01:41:16,140 --> 01:41:17,880 in order to get it done. 2111 01:41:17,880 --> 01:41:18,380 Yeah. 2112 01:41:18,380 --> 01:41:19,079 I like those. 2113 01:41:19,079 --> 01:41:21,760 So those are all really good examples. 2114 01:41:21,760 --> 01:41:25,500 Here is the list I had up there. 2115 01:41:25,500 --> 01:41:30,079 So climate control, lighting security, energy management, 2116 01:41:30,079 --> 01:41:32,180 entertainment, notification agent, 2117 01:41:32,180 --> 01:41:35,400 alerts about the system updates, energy saving, and orchestrator. 2118 01:41:35,399 --> 01:41:38,019 So all of them you mentioned actually. 2119 01:41:38,020 --> 01:41:41,260 And then we didn't talk about the different interaction 2120 01:41:41,260 --> 01:41:45,220 patterns, but you do have different ways to organize 2121 01:41:45,220 --> 01:41:46,900 a multi-agent system. 2122 01:41:46,899 --> 01:41:48,519 Flat, hierarchical. 2123 01:41:48,520 --> 01:41:51,300 It sounds like this would be hierarchical. 2124 01:41:51,300 --> 01:41:52,079 I agree. 2125 01:41:52,079 --> 01:41:55,420 And the reason is UI/UX, is I would rather 2126 01:41:55,420 --> 01:41:57,680 have to only talk to the orchestrator, 2127 01:41:57,680 --> 01:42:00,579 rather than have to go to a specialized application 2128 01:42:00,579 --> 01:42:01,362 to do something. 2129 01:42:01,362 --> 01:42:02,819 Like it feels like the orchestrator 2130 01:42:02,819 --> 01:42:04,439 could be responsible for that. 2131 01:42:04,439 --> 01:42:07,669 And so I agree, I would probably go for a hierarchical setup 2132 01:42:07,670 --> 01:42:08,329 here. 2133 01:42:08,329 --> 01:42:11,430 But maybe you might also add some connections 2134 01:42:11,430 --> 01:42:13,670 between other agents, like in the flat system 2135 01:42:13,670 --> 01:42:15,069 where it's all to all. 2136 01:42:15,069 --> 01:42:17,994 For example, with climate control and energy, 2137 01:42:17,994 --> 01:42:19,369 if you want to connect those two, 2138 01:42:19,369 --> 01:42:21,909 you might actually allow them to speak with each other. 2139 01:42:21,909 --> 01:42:24,210 When you allow agents to speak with each other, 2140 01:42:24,210 --> 01:42:26,970 it is basically an MCB protocol, by the way. 2141 01:42:26,970 --> 01:42:30,530 So you treat the agent like a tool, exactly like a tool. 2142 01:42:30,529 --> 01:42:32,649 Here is how you interact with this agent. 2143 01:42:32,649 --> 01:42:34,049 Here is what it can tell you. 2144 01:42:34,050 --> 01:42:37,390 Here is what it needs from you, essentially. 2145 01:42:37,390 --> 01:42:38,850 OK super. 2146 01:42:38,850 --> 01:42:40,910 And then without going into the details, 2147 01:42:40,909 --> 01:42:43,670 there are advantages to multi-agent workflows 2148 01:42:43,670 --> 01:42:47,690 versus single agents, such as debugging. 2149 01:42:47,689 --> 01:42:50,509 It's easier to debug a specialized agent 2150 01:42:50,510 --> 01:42:52,789 into debug an entire system. 2151 01:42:52,789 --> 01:42:54,329 Parallelization as well. 2152 01:42:54,329 --> 01:42:56,909 It's easier to have things run in parallel, 2153 01:42:56,909 --> 01:42:59,349 and you can earn time. 2154 01:42:59,350 --> 01:43:01,610 There are some advantages to doing that, 2155 01:43:01,609 --> 01:43:04,789 and I'll leave you with this slide if you want to go deeper. 2156 01:43:04,789 --> 01:43:05,289 Super. 2157 01:43:05,289 --> 01:43:08,930 So we've learned so many techniques to optimize LLMs, 2158 01:43:08,930 --> 01:43:12,130 from prompts to chains to fine tuning, retrieval, 2159 01:43:12,130 --> 01:43:14,529 and to multi-agent system as well. 2160 01:43:14,529 --> 01:43:19,489 And then just to end on a couple of trends I want you to watch. 2161 01:43:19,489 --> 01:43:21,689 I think next week is Thanksgiving, is that it? 2162 01:43:21,689 --> 01:43:22,889 It's Thanksgiving break. 2163 01:43:22,890 --> 01:43:23,869 No, the week after. 2164 01:43:23,869 --> 01:43:24,529 OK. 2165 01:43:24,529 --> 01:43:26,149 Well ahead of the Thanksgiving break. 2166 01:43:26,149 --> 01:43:29,489 So if you're traveling, you can think about these things. 2167 01:43:29,489 --> 01:43:34,289 What's next is in AI, I wanted to call out a couple of trends. 2168 01:43:34,289 --> 01:43:40,769 So Ilya Sutskever, one of the OGs of LLMs and OpenAI 2169 01:43:40,770 --> 01:43:45,790 co-founder, raised that question about are we plateauing or not. 2170 01:43:45,789 --> 01:43:50,489 The question are we going to see in the coming years LLM sort 2171 01:43:50,489 --> 01:43:54,649 of not improve as fast as we've seen in the past? 2172 01:43:54,649 --> 01:43:56,769 It's been the feeling in the community 2173 01:43:56,770 --> 01:44:00,610 probably that the last version of GPT 2174 01:44:00,609 --> 01:44:03,579 did not bring the level of performance 2175 01:44:03,579 --> 01:44:06,859 that people were expecting, although it did make 2176 01:44:06,859 --> 01:44:09,500 it so much easier to use for consumers because you don't need 2177 01:44:09,500 --> 01:44:10,920 to interact with different models. 2178 01:44:10,920 --> 01:44:12,279 It's all under the same hood. 2179 01:44:12,279 --> 01:44:14,659 So it seems that it's progressing, 2180 01:44:14,659 --> 01:44:17,019 but the plateau is unclear. 2181 01:44:17,020 --> 01:44:22,860 The way I would think about it is the LLM scaling laws tell us 2182 01:44:22,859 --> 01:44:26,380 that if we continue to improve compute and energy, 2183 01:44:26,380 --> 01:44:28,132 then LLMs should continue to improve. 2184 01:44:28,131 --> 01:44:29,839 But at some point, it's going to plateau. 2185 01:44:29,840 --> 01:44:32,380 So what's going to take us to the next step? 2186 01:44:32,380 --> 01:44:35,060 It's probably architecture search. 2187 01:44:35,060 --> 01:44:36,700 Still a lot of LLMs, even if we don't 2188 01:44:36,699 --> 01:44:38,539 understand what's under the hood or probably 2189 01:44:38,539 --> 01:44:40,319 transformer-based today. 2190 01:44:40,319 --> 01:44:43,439 But we know that the human brain does not operate the same way. 2191 01:44:43,439 --> 01:44:45,099 There's just certain things that we 2192 01:44:45,100 --> 01:44:47,640 do that are much more efficient, much faster. 2193 01:44:47,640 --> 01:44:49,180 We don't need as much data. 2194 01:44:49,180 --> 01:44:51,260 So theoretically, we have so much 2195 01:44:51,260 --> 01:44:53,020 to learn in terms of architecture search 2196 01:44:53,020 --> 01:44:54,780 that we haven't figured out. 2197 01:44:54,779 --> 01:44:57,300 It's not a surprise that you see those labs hire 2198 01:44:57,300 --> 01:44:58,779 so many engineers. 2199 01:44:58,779 --> 01:45:01,676 Because it is possible that in the next few years, 2200 01:45:01,676 --> 01:45:03,759 you're going to have thousands of engineers trying 2201 01:45:03,760 --> 01:45:06,382 to figure out the different engineering hacks and tactics 2202 01:45:06,381 --> 01:45:07,839 and architectural searches that are 2203 01:45:07,840 --> 01:45:10,480 going to lead to better models. 2204 01:45:10,479 --> 01:45:13,419 And one of them suddenly will find the next transformer, 2205 01:45:13,420 --> 01:45:17,000 and it will reduce by 10x the need for compute and the need 2206 01:45:17,000 --> 01:45:18,560 for energy. 2207 01:45:18,560 --> 01:45:24,560 It's sort of if you read Isaac Asimov's Foundation series. 2208 01:45:24,560 --> 01:45:27,920 Individuals can have an amazing impact on the future because 2209 01:45:27,920 --> 01:45:29,279 of their decisions. 2210 01:45:29,279 --> 01:45:33,519 Whoever discovered transformers had a tremendous impact 2211 01:45:33,520 --> 01:45:34,832 on the direction of AI. 2212 01:45:34,832 --> 01:45:37,039 I think we're going to see more of that in the coming 2213 01:45:37,039 --> 01:45:40,239 years, where some group of researchers that is iterating 2214 01:45:40,239 --> 01:45:43,399 fast might discover certain things that would suddenly 2215 01:45:43,399 --> 01:45:45,500 unlock that plateau and take us to the next step, 2216 01:45:45,500 --> 01:45:47,500 and it's going to continue to improve like that. 2217 01:45:47,500 --> 01:45:50,239 And so it doesn't surprise me that there's so many companies 2218 01:45:50,239 --> 01:45:52,519 hiring engineers right now to figure out 2219 01:45:52,520 --> 01:45:56,360 those hacks and those techniques. 2220 01:45:56,359 --> 01:45:58,119 The other set of gains that we might see 2221 01:45:58,119 --> 01:45:59,479 is from multi-modality. 2222 01:45:59,479 --> 01:46:04,929 So the way to think about it is we've had LLMs first text-based, 2223 01:46:04,930 --> 01:46:06,750 and then we've added imaging. 2224 01:46:06,750 --> 01:46:09,430 And today, models are very good at images. 2225 01:46:09,430 --> 01:46:10,730 They're very good at text. 2226 01:46:10,729 --> 01:46:13,929 It turns out that being good at images and being good at text 2227 01:46:13,930 --> 01:46:15,510 makes the whole model better. 2228 01:46:15,510 --> 01:46:18,329 So the fact that you're good at understanding a cat image 2229 01:46:18,329 --> 01:46:21,449 makes you better at text as well for a cat. 2230 01:46:21,449 --> 01:46:24,630 Now you add another modality like audio or video. 2231 01:46:24,630 --> 01:46:26,109 The whole system gets better. 2232 01:46:26,109 --> 01:46:28,569 So you're better at writing about a cat 2233 01:46:28,569 --> 01:46:30,114 if you know what a cat sounds like, 2234 01:46:30,114 --> 01:46:31,989 if you can look at a cat on an image as well. 2235 01:46:31,989 --> 01:46:32,864 Does that make sense? 2236 01:46:32,864 --> 01:46:35,569 So we see gains that are translated from one modality 2237 01:46:35,569 --> 01:46:38,409 to another, and that might lead in the pinnacle of robotics 2238 01:46:38,409 --> 01:46:40,430 where all these modalities come together. 2239 01:46:40,430 --> 01:46:42,329 And suddenly, the robot is better at 2240 01:46:42,329 --> 01:46:44,890 running away from a cat because it understands 2241 01:46:44,890 --> 01:46:46,630 what a cat is, how it sounds like, 2242 01:46:46,630 --> 01:46:48,170 what it looks like, et cetera. 2243 01:46:48,170 --> 01:46:49,930 That makes sense? 2244 01:46:49,930 --> 01:46:53,090 The other one is the multiple methods working in harmony. 2245 01:46:53,090 --> 01:46:56,750 In the Tuesday lectures, we've seen supervised learning, 2246 01:46:56,750 --> 01:46:58,930 unsupervised learning, self-supervised learning, 2247 01:46:58,930 --> 01:47:02,230 reinforcement learning, prompt engineering, RAGs, et cetera. 2248 01:47:02,229 --> 01:47:06,269 If you look at how babies learn, it 2249 01:47:06,270 --> 01:47:09,250 is probably a mix of those different approaches. 2250 01:47:09,250 --> 01:47:13,909 Like a baby might have some meta learning, meaning it 2251 01:47:13,909 --> 01:47:16,670 has some survival instinct that is 2252 01:47:16,670 --> 01:47:19,430 encoded in the DNA most likely. 2253 01:47:19,430 --> 01:47:22,630 And that's like the baby's pre-training, if you will. 2254 01:47:22,630 --> 01:47:27,430 On top of that, the mom or the dad is pointing at stuff 2255 01:47:27,430 --> 01:47:29,570 and saying bad, good, bad, good. 2256 01:47:29,569 --> 01:47:30,769 Supervised learning. 2257 01:47:30,770 --> 01:47:33,470 On top of that, the baby is falling on the ground 2258 01:47:33,470 --> 01:47:34,449 and getting hurt. 2259 01:47:34,449 --> 01:47:36,929 And that's a reward signal for reinforcement learning. 2260 01:47:36,930 --> 01:47:39,390 On top of that, the baby is observing other people 2261 01:47:39,390 --> 01:47:42,030 doing stuff or other babies doing 2262 01:47:42,029 --> 01:47:43,409 stuff, unsupervised learning. 2263 01:47:43,409 --> 01:47:44,349 You see what I mean? 2264 01:47:44,350 --> 01:47:47,090 We're probably a mix of all these methods, 2265 01:47:47,090 --> 01:47:49,630 and I think that's where the trend is going, is 2266 01:47:49,630 --> 01:47:52,350 where those methods that you've seen in CS230 2267 01:47:52,350 --> 01:47:56,780 come together in order to build an AI system that learns fast, 2268 01:47:56,779 --> 01:48:00,340 is low latency, is cheap, energy-efficient, 2269 01:48:00,340 --> 01:48:03,360 and makes the most out of all of these methods. 2270 01:48:03,359 --> 01:48:06,920 Finally, and this is especially true at Stanford, 2271 01:48:06,920 --> 01:48:11,079 you have research going on that you would consider human-centric 2272 01:48:11,079 --> 01:48:13,800 and some research that is non-human centric. 2273 01:48:13,800 --> 01:48:16,360 By human-centric, I should say human approaches 2274 01:48:16,359 --> 01:48:19,159 that are modeled after the brain and approaches that 2275 01:48:19,159 --> 01:48:20,619 are not modeled after humans. 2276 01:48:20,619 --> 01:48:24,420 Because it turns out that the human body is very limiting. 2277 01:48:24,420 --> 01:48:26,680 And so if you actually only do research 2278 01:48:26,680 --> 01:48:28,220 on what the human brain looks like, 2279 01:48:28,220 --> 01:48:30,860 you're probably missing out on compute and energy and stuff 2280 01:48:30,859 --> 01:48:32,359 like that that you can optimize even 2281 01:48:32,359 --> 01:48:35,139 beyond neuronal connections in the brain, 2282 01:48:35,140 --> 01:48:37,380 but you still can learn a lot from the human brain. 2283 01:48:37,380 --> 01:48:40,319 And that's why there are professors that are running labs 2284 01:48:40,319 --> 01:48:42,519 right now that try to understand, 2285 01:48:42,520 --> 01:48:45,140 how does back propagation work for humans? 2286 01:48:45,140 --> 01:48:48,140 And in fact, it's probably that we don't have back propagation. 2287 01:48:48,140 --> 01:48:51,300 We don't use back propagation, we only do forward propagation, 2288 01:48:51,300 --> 01:48:51,840 let's say. 2289 01:48:51,840 --> 01:48:54,079 So this type of stuff is interesting research 2290 01:48:54,079 --> 01:48:56,500 that I would encourage you to read if you're curious 2291 01:48:56,500 --> 01:48:59,500 about the direction of AI. 2292 01:48:59,500 --> 01:49:02,640 And then finally, one thing that's going to be pretty clear, 2293 01:49:02,640 --> 01:49:05,420 I call it all the time, but it's the velocity 2294 01:49:05,420 --> 01:49:06,899 at which things are moving. 2295 01:49:06,899 --> 01:49:08,699 You're noticing, part of the reason 2296 01:49:08,699 --> 01:49:10,882 we're giving you a breadth in CS230 2297 01:49:10,882 --> 01:49:12,800 is because these methods are changing so fast. 2298 01:49:12,800 --> 01:49:15,100 So I don't want to bother going and teaching you 2299 01:49:15,100 --> 01:49:17,940 the number 17 methods on RAG that 2300 01:49:17,939 --> 01:49:19,639 optimizes the RAG because in two years, 2301 01:49:19,640 --> 01:49:20,940 you're not going to need it. 2302 01:49:20,939 --> 01:49:23,419 So I would rather you think about what 2303 01:49:23,420 --> 01:49:25,539 is the breadth of things you want to understand. 2304 01:49:25,539 --> 01:49:27,819 And when you need it, you are sprinting and learning 2305 01:49:27,819 --> 01:49:30,939 the exact thing you need faster because the half life of skill 2306 01:49:30,939 --> 01:49:31,679 is so low. 2307 01:49:31,680 --> 01:49:34,500 You want to come out of the class with a good breadth 2308 01:49:34,500 --> 01:49:36,739 and then have the ability to go deep whenever 2309 01:49:36,739 --> 01:49:38,159 you need after the class. 2310 01:49:38,159 --> 01:49:41,199 And so that's sort of how that class is designed as well. 2311 01:49:41,199 --> 01:49:41,699 Yeah. 2312 01:49:41,699 --> 01:49:43,500 That's it for today. 2313 01:49:43,500 --> 01:49:45,819 So thank you. 2314 01:49:45,819 --> 01:49:48,889 Thank you for participating.