[00:05] Hi, everyone. [00:06] Welcome to another lecture for CS230 Deep Learning. [00:11] Today, we're going to talk about enhancing large language model [00:17] applications. [00:19] And I call this lecture Beyond LLM. [00:23] It has a lot of newer content. [00:26] And the idea behind this lecture is [00:31] we started to learn about neurons, [00:34] and then we learned about layers, [00:35] and then we learned about deep neural networks, [00:38] and then we learned a little bit about how to structure projects [00:43] in C3. [00:44] And now we're going one level beyond into, what would it [00:48] look if you were building agentic AI systems at work, [00:54] in a startup, in a company? [00:58] And it's probably one of the more practical lectures. [01:02] Again, the goal is not to build a product [01:05] end to end in the next hour or so, [01:07] but rather to tell you all the techniques [01:09] that AI engineers have cracked, figured out, are exploring, [01:15] so that after the class, you have the breadth of view [01:18] of different prompting techniques, [01:20] different agentic workflows, multi-agent systems, evals. [01:25] And then when you want to dive deeper, [01:26] you have the baggage to dive deeper and learn faster [01:29] about it. [01:32] Let's try to make it as interactive as possible, as [01:36] usual. [01:37] When we look at the agenda, the agenda [01:40] is going to start with the core idea behind challenges [01:45] and opportunities for augmenting LLMs. [01:48] So we start from a base model. [01:50] How do we maximize the performance of that base model? [01:55] Then we'll dive deep into the first line of optimization, [01:59] which is prompting methods, and we'll see a variety of them. [02:02] Then we'll go slightly deeper. [02:04] If we were to get our hands under the hood [02:06] and do some fine tuning, what would it look like? [02:09] I'm not a fan of fine tuning, and I talk a lot about that, [02:12] but I'll explain why I try to avoid fine tuning as much as [02:16] possible. [02:18] And then we'll do a section 4 on Retrieval-Augmented Generation, [02:22] or RAG, which you've probably heard of in the news. [02:26] Maybe some of you have played with RAGs. [02:28] We're going to unpack what a RAG is [02:31] and how it works and then the different methods within RAGs. [02:36] And then we'll talk about agentic AI workflows. [02:40] I'll define it. [02:42] Andrew Ng is one of the first ones [02:45] to have called this trend agenetic AI workflows. [02:49] And so we look at the definition that Andrew [02:51] gives to agentic workflows, and then we'll [02:54] start seeing examples. [02:56] The section 6 is very practical. [02:59] It's a case study where we will think about an agentic workflow, [03:05] and I'll ask you to measure if the agent actually works, [03:10] and we brainstorm how we can measure [03:13] if an agentic workflow is working [03:15] the way you want it to work. [03:16] There's plenty of methods called evals that solve that problem. [03:22] And then we'll look briefly at multi-agent workflow. [03:24] And then we can have a open-ended discussion [03:27] where I share some thoughts on what's next in AI. [03:31] And I'm looking forward to hearing from you all, [03:34] as well, on that one. [03:36] So let's get started with the problem of augmenting LLMs. [03:42] So open-ended question for you-- [03:44] you are all familiar with pre-trained models [03:47] like GPT 3.5 Turbo or GPT 4.0. [03:52] What's the limitation of using just a base model? [03:56] What are the typical issues that might [03:59] arise as you're using a vanilla pre-trained model? [04:07] Yes. [04:08] It lacks some domain knowledge. [04:10] Lacks some domain knowledge. [04:11] You're perfectly right. [04:13] We had a group of students a few years ago. [04:16] It was not LLM related, but they were building an autonomous [04:22] farming device or vehicle that had a camera underneath, taking [04:26] pictures of crops to determine if the crop is [04:30] sick or not, if it should be thrown away, [04:32] if it should be used or not. [04:35] And that data set is not a data set you find out there. [04:40] And the base model or pre-trained computer vision [04:44] model would lack that knowledge, of course. [04:47] What else? [04:49] Yes. [04:50] [INAUDIBLE] pictures are very dark [INAUDIBLE] [04:57] OK, maybe the-- you're saying-- [04:59] so just to repeat for people online, [05:02] you're saying the model might have been trained [05:04] on high-quality data, but the data in the wild [05:06] is actually not that high quality. [05:08] And in fact, yes, the distribution of the real world [05:11] might differ, as we've seen with GANs, from the training set, [05:16] and that might create an issue with pre-trained models. [05:18] Although pre-trained LLMs are getting better [05:20] at handling all sorts of data inputs. [05:25] Yes. [05:26] Lacks current information. [05:28] Lack what? [05:28] Current information. [05:30] Lacks current information. [05:32] The LLM is not up to date. [05:34] And in fact, you're right. [05:35] Imagine you have to retrain from scratch your LLM [05:38] every couple of months. [05:39] One story that I found funny-- [05:42] it's from probably three years ago or maybe more five years [05:45] ago, where during his first presidency, [05:49] President Trump one day tweeted, "Covfefe." [05:53] You remember that tweet or no? [05:56] Just "Covfefe." [05:57] And it was probably a typo or it was in his pocket. [05:59] I don't know. [06:00] But that word did not exist. [06:03] The LLMs, in fact, that Twitter was running at the time [06:06] could not recognize that word. [06:08] And so the recommender system sort of went wild, [06:11] because suddenly everybody was making fun of that tweet using [06:15] the word "Covfefe," and the LLM was so confused on, what does [06:19] that mean? [06:20] Where should we show it? [06:21] To whom should we show it? [06:22] And this is an example of a-- nowadays, [06:25] especially on social media, there's so many new trends, [06:28] and it's very hard to retrain an LLM to match the new trend [06:33] and understand the new words out there. [06:34] I mean, you oftentimes hear Gen Z words like "rizz" or "mid" [06:39] or whatever. [06:40] I don't know all of them. [06:41] But you probably want to find a way that [06:45] can allow the LLM to understand those trends without retraining [06:49] the LLM from scratch. [06:51] What else? [06:53] It's trained to have a breadth of knowledge. [06:56] And if you wanted to do something specialized, [06:58] that might limit [INAUDIBLE]. [06:59] Yeah, it might be trained on a breadth of knowledge, [07:02] but it might fail or not perform adequately [07:05] on a narrow task that is very well defined. [07:09] Think about enterprise applications that-- [07:11] yeah, enterprise application. [07:13] You need high precision, high fidelity, low latency. [07:17] And maybe the model is not great at that specific thing. [07:20] It might do fine, but just not good enough. [07:22] And you might want to augment it in a certain way. [07:24] Yeah. [07:25] Maybe it has [INAUDIBLE] so it makes the model [07:29] a lot heavier, a lot slower. [07:32] [INAUDIBLE] [07:33] So maybe it has a lot of broad domain knowledge that might not [07:37] be needed for your application. [07:39] And so you're using a massive, heavy model [07:41] when you actually are only using 2% of the model capability. [07:44] You're perfectly right. [07:45] You might not need all of it. [07:46] So you might find ways to prune, quantize the model, modify it. [07:51] All of these are good points. [07:53] I'm going to add a few more, as well. [07:55] LLMs are very difficult to control. [07:58] Your last point is actually an example of that. [08:00] You want to control the LLM to use a part of its knowledge, [08:03] but it's not-- [08:04] it's, in fact, getting confused. [08:06] We've seen that in history. [08:08] In 2016, Microsoft created a notorious Twitter [08:13] bot that learned from users, and it quickly became a racist jerk. [08:18] Microsoft ended up removing the bot 16 hours after launching it. [08:22] The community was really fast at determining [08:25] that this was a racist bot. [08:28] And you can empathize with Microsoft in the sense [08:31] that it is actually hard to control an LLM. [08:34] They might have done a better job to qualify before launching, [08:37] but it is really hard to control an LLM. [08:40] Even more recently, this is a tweet [08:42] from Sam Altman last November, where [08:46] there was this debate between Elon Musk and Sam [08:50] Altman on whose LLM is the left wing propaganda [08:54] machine or the right wing propaganda machine, [08:57] and they were hating on each other's LLMs. [08:59] But that tells you, at the end of the day, [09:01] that even those two teams, Grok and OpenAI, which are probably [09:05] the best funded team with a lot of talent, [09:08] are not doing a great job at controlling their LLMs. [09:14] And from time to time, if you hang out on X, [09:16] you might see screenshots of users interacting with LLMs [09:21] and the LLM saying something really controversial [09:24] or racist or something that would not be considered great [09:31] by social standards, I guess. [09:33] And that tells you that the model is really hard to control. [09:39] The second aspect of it is something [09:41] that you mentioned earlier. [09:43] LLMs may underperform in your task, [09:47] and that might include specific knowledge gaps, [09:49] such as medical diagnosis. [09:51] If you're doing medical diagnosis, [09:52] you would rather have an LLM that is specialized for that [09:55] and is great at it and, in fact, something [09:57] that we haven't mentioned as a group, has sources. [10:00] So the answer is sourced specifically. [10:03] You have a hard time believing something [10:05] unless you have the actual source of the research that [10:08] backs it up. [10:10] Inconsistencies in style and format-- [10:12] so imagine you're building a legal AI agentic workflow. [10:17] Legal has a very specific way to write and read, [10:21] where every word counts. [10:22] If you're negotiating a large contract, [10:25] every word on that contract might mean something else [10:28] when it comes to the court. [10:29] And so it's very important that you use [10:31] an LLM that is very good at it. [10:34] The precision matters. [10:35] And then task-specific understanding, [10:38] such as doing a classification on a niche field, [10:40] here I pulled an example where-- let's say a biotech product is [10:45] trying to use an LLM to categorize [10:48] user reviews into positive, neutral, or negative. [10:54] Maybe for that company, something [10:56] that would be considered a negative review typically [11:01] is actually considered a neutral review [11:04] because the NPS of that industry tends [11:06] to be way lower than other industries, let's say. [11:10] That's a task-specific understanding, [11:12] and the LLM needs to be aligned to what [11:14] the company believes is the categorization that it wants. [11:17] We will see an example of how to solve that problem in a second. [11:21] And then limited context handling-- [11:24] a lot of AI applications, especially in the enterprise, [11:28] have required data that has a lot of context. [11:33] Just to give you a simple example, [11:35] knowledge management is an important space [11:37] that enterprises buy a lot of knowledge management tool. [11:40] When you go on your drive and you have all your documents, [11:43] ideally, you could have an LLM running on top of that drive. [11:47] You can ask any question, and it will read immediately [11:50] thousands of documents and answer, what was [11:53] our Q4 performance in sales? [11:56] It was x dollars. [11:58] It finds it super quickly. [11:59] In practice, because LLMs do not have a large enough context, [12:04] you cannot use a standalone vanilla pre-trained LLM to solve [12:07] that problem. [12:08] You will have to augment it. [12:11] Does that make sense? [12:13] The other aspect around context windows is they are, in fact, [12:16] limited. [12:17] If you look at the context windows of the models [12:20] from the last five years, even the best models [12:25] today will range in context, window, or number of tokens [12:30] it can take as input, somewhere in the hundreds of thousands [12:35] of tokens max. [12:36] Just to give you a sense, 200,000 tokens is roughly two [12:40] books. [12:42] So that's how much you can upload [12:45] and it can read, pretty much. [12:47] And you can imagine that when you're [12:48] dealing with video understanding or heavier data [12:52] files, that is, of course, an issue. [12:56] So you might have to chunk it. [12:58] You might have to embed it. [12:59] You might have to find other ways [13:00] to get the LLM to handle larger contexts. [13:06] The attention mechanism is also powerful, but problematic, [13:10] because it does not do a great job at attending [13:13] in very large contexts. [13:16] There is actually an interesting problem [13:19] called needle in a haystack. [13:21] It's an AI problem where-- [13:23] or call it a benchmark-- [13:25] where, in order to test if your LLM is good at putting attention [13:30] on a very specific fact within a large corpus, [13:35] researchers might randomly insert [13:38] in about one sentence that outlines [13:44] a certain fact, such as Arun and Max [13:47] are having coffee at Blue Bottle, [13:48] in the middle of the Bible, let's say, [13:51] or some very long text. [13:54] And then you ask the LLM, what were Arun and Max having [14:01] at Blue Bottle? [14:02] And you see if it remembers that it was coffee. [14:04] It's actually a complex problem, not because the question [14:07] is complex, but because you're asking the model [14:09] to find a fact within a very large corpus, [14:12] and that's complicated. [14:16] So, again, this is a limiting factor for LLMs. [14:19] We'll talk about RAG in a second. [14:21] But I want to preview-- [14:22] there is debates around whether RAG [14:26] is the right long-term approach for AI systems. [14:29] So as a high-level idea, a RAG is a mechanism, if you will, [14:34] that embeds documents that an LLM can retrieve and then [14:39] add as context to its initial prompt and answer a question. [14:44] It has lots of application. [14:45] Knowledge management is an example. [14:47] So imagine you have your drive again. [14:49] But every document is compressed in representation, [14:53] and the LLM has access to that lower [14:55] dimensional representation. [14:59] The debates that this tweet from [INAUDIBLE] outlines [15:03] is, in theory, if we have infinite compute, [15:08] then RAG is useless. [15:09] Because you can just read a massive corpus immediately [15:13] and answer your question. [15:15] But even in that case, latency might be an issue. [15:19] Imagine the time it takes for an AI [15:20] to read all your drive every single time you ask a question. [15:24] It doesn't make sense. [15:25] So RAG has other advantages beyond even the accuracy. [15:30] On top of that, the sourcing matters, as well. [15:33] So it might-- RAG allows you to source. [15:35] We'll talk about all that later. [15:38] But there's always this debate in the community [15:42] whether a certain method is actually future proof. [15:46] Because in practice, as compute power doubles every year, [15:49] let's say, some of the methods we're learning right now [15:52] might not be relevant three years from now. [15:54] We don't know, essentially. [15:59] And the analogy that he makes on context windows [16:04] and why RAG approaches might be relevant even a long time [16:07] from now is search. [16:09] When you search on a search engine, [16:12] you still find sources of information. [16:14] And in fact, in the background, there [16:16] is very detailed traversal algorithms [16:20] that rank and find the specific links that might be the best [16:25] to present you versus if you had to read-- imagine you had [16:29] to read the entire web every single time you're doing [16:31] a search query, without being able to narrow [16:34] to a certain portion of the space. [16:36] That might, again, not be reasonable. [16:41] OK, when we're thinking of improving LLMs, [16:46] the easiest way we think of it is two dimensions. [16:50] One dimension is we are going to improve the foundation [16:53] model itself. [16:54] So, for example, we move from GPT 3.5 Turbo, to GPT 4, [17:01] to GPT 4.0, to GPT 5. [17:04] Each of that is supposed to improve the base model. [17:07] GPT 5 is another debate because it's packaging other models [17:11] within itself. [17:12] But if you're thinking about 3.5, 4, and 4.0, [17:15] that's really what it is. [17:16] The pre-trained model improves. [17:18] And so you should see your performance [17:20] improve on your tasks. [17:22] But the other dimension is we can actually engineer-- [17:27] leverage the LLM in a way that makes it better. [17:30] So you can prompt simply GPT 4.0. [17:34] You can change some prompts and improve the prompt, [17:38] and it will improve the performance. [17:40] It's shown. [17:41] You can even put a RAG around it. [17:42] You can put an agentic workflow around it. [17:45] You can even put a multi-agent system around it. [17:49] And that is another dimension for you to improve performance. [17:52] So that's how I want you to think about it-- which [17:54] LLM I'm using, and then how can I maximize [17:56] the performance of that LLM? [17:59] This lecture is about the vertical axis. [18:02] Those are the methods that we will see together. [18:08] Sounds good for the introduction. [18:11] So let's move to prompt engineering. [18:14] I'm going to start with an interesting study just [18:17] to motivate why prompt engineering matters. [18:20] There is a study from HBS, UPenn, [18:26] as well as Harvard Business School, and-- [18:29] well, there is also involved Wharton-- [18:31] that took a subset of BCG consultants, [18:34] individual contributors, split them into three groups. [18:37] One group had no access to AI. [18:39] One group had access to-- [18:41] I think it was GPT 4. [18:44] And then one group had access to the LLM, [18:46] but also a training on how to prompt better. [18:50] And then they observed the performance of these consultants [18:53] across a wide variety of tasks. [18:56] There's a few things that they noticed [18:57] that I thought was interesting. [18:59] One is something they called the jagged frontier, [19:02] meaning that certain tasks that consultants are doing fall [19:07] beyond the jagged frontier, meaning AI is not good enough. [19:14] It's not improving human performance. [19:18] In fact, it's actually making it worse. [19:20] And some tasks are within the frontier, [19:23] meaning that AI is actually significantly improving [19:27] the performance, the speed, the quality of the consultant. [19:32] Many tasks fell within and many tasks fell without, [19:35] and they shared their insights. [19:37] But the TLDR is-- [19:39] there is a frontier within which AI is absolutely helping [19:42] and one where they call out this behavior, or falling asleep [19:47] at the wheel, where people relied on AI on a task that [19:51] was beyond the frontier. [19:52] And in fact, it ended up going worse [19:55] because the human was not reviewing the outputs carefully [19:58] enough. [20:01] They did note that the group that was trained [20:04] was the best, better than the group that was not trained [20:08] on prompt engineering, which also motivates why [20:10] this lecture matters, so that you're within that group [20:14] afterwards. [20:15] One other insights were the centaurs and the cyborgs. [20:20] They noticed that consultants had the tendency [20:22] to work with AI in one of two ways, [20:24] and you might, yourself, be part of one of these groups. [20:29] The centaurs are mythical creatures [20:31] that are half human, half-- [20:35] I think, half, what, horses? [20:38] Yeah? [20:39] Horses? [20:39] Half horses, half something. [20:42] And those were individuals that would divide and delegate. [20:45] They might give a pretty big task to the AI. [20:48] So imagine you're working on a PowerPoint, which consultants [20:51] are known to do. [20:52] You might actually write a very long prompt on how [20:55] you want it to do your PowerPoint and then let it [20:57] work for some time and then come back [20:59] and it's done, when others would act as cyborgs. [21:02] Cyborgs are fully blended, bionic human robots, [21:06] human and robot, augmented with robotic parts. [21:10] And those individuals will not delegate fully a task. [21:13] They would actually work super quickly with the model [21:16] back and forth. [21:17] I find that a lot of students are actually more working [21:20] like cyborgs than centaurs, but while maybe in the enterprise, [21:24] when you're trying to automate the workflow, [21:26] you're thinking more like a centaur. [21:29] That's just something good to keep in mind. [21:31] Also, a lot of companies will tell you, oh, we're [21:33] hiring prompt engineers, et cetera. [21:34] It's [? a cure. ?] I don't buy that. [21:36] I think it's just a skill that everybody should have. [21:39] You're not going to make a [? cure ?] out [21:40] of prompt engineering, but you're probably [21:42] going to use it as a very powerful skill in your career. [21:49] So let's talk about basic prompt design principles. [21:52] I'm giving you a very simple prompt here. [21:56] Summarize this document, and then the document [21:58] is uploaded alongside it. [22:00] And the model has not much context around [22:04] what should be the summary? [22:06] How long should be the summary? [22:07] What should it talk about, et cetera? [22:09] You can actually improve these prompts by doing something like [22:14] summarize this 10-page scientific paper on renewable [22:18] energy in five bullet points, focusing on key findings [22:22] and implications for policymakers. [22:25] That's already better. [22:26] You're sharing the audience, and it's [22:28] going to tailor it to the audience. [22:30] You're saying that you want five bullet points, [22:33] and you want to focus only on key findings. [22:35] That's a better prompt, you would argue. [22:39] How could you even make this prompt better? [22:41] What are other techniques that you've [22:43] heard of or tried yourself that could make this one shot prompt [22:47] better? [22:53] Yeah. [22:53] [INAUDIBLE] [22:57] OK. [22:58] Right example. [22:58] So say, you mean, here is an example of a great summary. [23:02] Yeah. [23:03] You're right. [23:03] That's a good idea. [23:05] [INAUDIBLE] [23:08] Very popular technique. [23:10] Act like a renewable energy expert giving a conference [23:15] at Davos, let's say, yeah. [23:17] That's great. [23:18] Someone-- yeah. [23:20] Say you're really good at it. [23:22] Yeah. [23:23] You are the best in the world at this. [23:25] Explain. [23:26] Yeah. [23:26] Actually, I mean, these things work. [23:28] It's funny, but it does work to say act like x, y, z. [23:32] It's a very popular prompt template. [23:34] We'll see a few examples. [23:36] What else could you do? [23:40] Yes. [23:41] Of course, you'd like to critique your own model. [23:46] Critique your own project. [23:47] So you're using reflection. [23:48] So you might actually do one output [23:50] and then ask it to critique it and then give it back. [23:52] Yeah. [23:53] We see that. [23:53] That's a great one. [23:54] That's the one that probably works best [23:56] within those typically, but we see some examples. [23:59] What else? [24:00] Yeah. [24:01] Break the task down into steps. [24:03] OK. [24:03] Break the task down into steps. [24:05] You know how that is called? [24:06] No. [24:07] OK. [24:08] Chain of thoughts. [24:09] So this is actually a popular method [24:12] that's been shown in research that it improves. [24:15] You could actually give a clear instruction [24:17] and also encourage the model to think step [24:19] by step approach, the task step by step, [24:22] and do not skip any step. [24:24] And then you give it some steps, such as step one, [24:26] identify the three most important findings. [24:29] Step two, explain how key each finding [24:31] impact renewable energy policy. [24:33] Step three, write the five-bullet summary [24:36] with each point addressing a finding, et cetera. [24:39] So chain of thoughts, I linked the paper from 2023 that [24:45] popularized chain of thoughts. [24:46] Chain of thoughts is very popular [24:48] right now, especially in AI startups [24:50] that are trying to control their LLMs. [24:55] OK. [24:56] To go back to your examples about act like XYZ, what [25:01] I like to do, Andrew Ng also talks about that, [25:03] is to look at other people's prompts. [25:06] And in fact, in online, you have a lot of prompt repositories [25:10] for free on GitHub. [25:11] In fact, I linked the awesome prompt template repo on GitHub, [25:16] where you have so many examples of great prompts [25:19] that engineers have built. They said it works great for us, [25:22] and they published it online. [25:23] And a lot of them start with act as. [25:27] Act as a Linux terminal. [25:29] Act as an English translator. [25:31] Act like a position interviewer, et cetera. [25:37] The advantage of a prompt template [25:38] is that you can actually put it in your code [25:42] and scale it for many user requests. [25:44] So let me give you an example from Workera. [25:48] Workera evaluates skill. [25:50] Some of you have taken the assessments already. [25:52] And tries to personalize it to the user. [25:56] And in fact, if you actually read in an HR system [25:59] in an enterprise, in the HR system, [26:01] you might have a Jane is a product manager level 3, [26:06] and she is in the US, and her preferred language is English. [26:10] And actually, that metadata can be [26:13] inserted in a prompt templates that will personalize [26:15] personalized for Jane. [26:16] And similarly for Joe, whose is preferred language is Spanish, [26:22] it will tailor it to Joe. [26:24] And that's called a prompt template. [26:26] [INAUDIBLE] [26:34] So the question is do the foundation models [26:39] use a prompt templates, or do you [26:41] have to integrate it yourself? [26:42] So the foundation models probably [26:45] use a system prompt that you don't see. [26:47] Like when actually, you type on ChatGPT, [26:50] it is possible, it's not public, that OpenAI behind the scenes [26:55] has like act like a very helpful assistant for this user. [26:59] And by the way, here is your memories about the user [27:03] that we kept in a database. [27:05] You can actually check your memories. [27:07] And then your prompt goes under, and then the generation starts. [27:10] So probably, they're using something like that. [27:12] But it doesn't mean you can't add one yourself. [27:15] So in fact, if you think about a prompt template for the Workera [27:19] example I was showing, maybe it starts [27:22] when you call OpenAI by act like a helpful assistant. [27:25] And then underneath, it's like act like a great AI mentor that [27:29] helps people in their career. [27:31] And OpenAI is, from template, also [27:33] has follow the instruction from the creator [27:36] or something like that. [27:37] It's possible. [27:41] Questions about prompt templates. [27:42] Again, I would encourage you to go and read examples of prompts. [27:45] Some of them are quite thoughtful. [27:48] Let's talk about zero shot versus few shot prompting. [27:51] It came up earlier. [27:53] Here's an example. [27:54] Again, going back to the categorization of product [27:57] reviews, let's say that we're working on a task [28:01] where the prompt is classify the tone of the sentence [28:05] as positive, negative, or neutral. [28:07] And then you paste the review, which is the product is fine, [28:12] but I was expecting more. [28:16] If I were to survey the room, I would bet that some of you [28:19] would say it's negative. [28:21] Some of you would say it's neutral. [28:23] Because you actually have a first part [28:24] that is relatively positive. [28:27] It's fine. [28:28] And then the second part, I was expecting more, [28:30] which is relatively negative. [28:31] So where do you land? [28:33] This can be a subjective question. [28:35] And maybe in one industry, this would be considered amazing. [28:37] And another one, it would be considered really bad [28:40] because people are used to really flourishing reviews. [28:44] And so the way you can actually align the model to your task [28:47] is by converting that zero shot prompt. [28:49] Zero shot refers to the fact that it's not [28:51] being given any example. [28:53] Into a few short prompts, where the model [28:56] is given in the prompt, a set of examples to align it to what [29:00] you want it to do. [29:01] So the example here is again, you [29:03] paste the same prompt as before with the user review. [29:06] And then you add, here are examples [29:08] of tone classifications. [29:10] These exceeded my expectation completely. [29:12] Positive. [29:14] It's OK, but I wish it had more features. [29:17] Negative. [29:18] The service was adequate. [29:20] Neither good nor bad. [29:22] Neutral. [29:23] Now classify the tone of this sentence [29:26] after you've heard about these things, [29:28] and the model then says negative. [29:31] And the reason it says negative, of course, [29:33] is likely because of the second example, which was it's OK, [29:39] but I wish it had more features, which we told the model that [29:42] was negative. [29:43] Because the model saw that it's aligned now [29:45] with your expectations. [29:47] A few short prompts are very popular. [29:50] And in fact, for AI startups that [29:52] are slightly more sophisticated, you [29:54] might see them keep a prompt up to date. [29:57] Whenever a user says something and they [30:00] might have a human label it and then [30:02] add it as a few shots in their relevant [30:05] prompts in their code base. [30:08] You can think of that as almost building a data set. [30:10] But instead of actually building a separate data set [30:12] like we've seen with supervised fine tuning [30:15] and then fine tuning the model on it, [30:17] you're just putting it directly in the prompt. [30:19] It turns out it's probably faster [30:21] to do that if you want to experiment quickly [30:23] because you don't touch the model parameters. [30:25] You just update your prompts. [30:27] And if it's text examples, you can actually [30:30] concatenate so many examples in a single prompt. [30:34] At some point, it will be too long, [30:36] and you will not have the necessary context window. [30:39] But it's a pretty strong approach [30:40] that is quick to align an LLM. [30:48] OK? [30:49] Yes. [30:50] [INAUDIBLE] [30:57] So the question was is there any research on how long [31:00] the prompt can be before the model essentially loses [31:03] itself or doesn't follow instructions anymore? [31:06] There is. [31:08] The problem is that research is outdated every few months [31:11] because models get better. [31:14] And so I don't know where the state of the art is. [31:16] You can probably find it online on benchmarks [31:18] on like we see that-- [31:20] I give you an example. [31:23] On the Workera product, you have a voice conversation [31:27] for some of you that have tried it, [31:28] where you're asked to explain what is the prompt. [31:30] And then you explain, and then there's [31:31] a scoring algorithm in behind. [31:33] We know that after eight turns, the model loses itself. [31:38] After eight turns, because you always [31:40] paste the previous user response, [31:42] it just starts going wild. [31:44] And so the techniques we use in the background [31:46] is we actually create chapters of the conversation. [31:49] Maybe one chapter is the first eight prompt. [31:51] And then you actually start over from another prompt. [31:53] You can summarize the first part of the conversation, [31:56] insert the summary, and then keep going. [31:59] Those are engineering hacks that engineers might have figured out [32:02] in the background. [32:04] Because eight turns makes a prompt quite long actually. [32:13] Let's move on to chaining. [32:15] Chaining is the most popular technique out of everything [32:17] we've seen so far in prompt engineering. [32:22] It's not chain of thought. [32:23] So chain of thought we've seen is think step by step, [32:26] step 1, step 2, step 3. [32:27] Do not skip any step. [32:28] This is different. [32:30] This is chaining complex prompt to improve performance, [32:34] and this is what it looks like. [32:37] You take a single step prompt, such as read this customer [32:40] review and write a professional response that [32:43] acknowledges their concern, explains the issue, [32:46] offers a resolution, and then you [32:48] paste the customer review, which is I ordered a laptop. [32:51] It arrived three days late. [32:52] The packaging was damaged. [32:54] Very disappointing. [32:56] I needed that urgently for work. [32:59] And then the output is an email that [33:01] is immediately given to you by the LLM [33:04] after it reads the prompt. [33:08] So this might work, but it might be hard to control. [33:14] Because think about it. [33:15] There's multiple steps that you have listed, [33:18] and everything is embedded in the same prompt. [33:20] And if you wanted to debug step by step and know which step is [33:24] weaker, you couldn't. [33:24] You would have everything mixed together. [33:27] So one advantage of chaining is you would separate the prompts, [33:32] so that you can debug them separately. [33:35] And it will also lead to an easier manner [33:38] to improve your workflow. [33:41] Let's say a first prompt is extract the key issues. [33:44] Identify the key concerns mentioned [33:46] in this customer review. [33:47] Pace the customer review. [33:49] Second prompt. [33:50] Using these issues, so you paste back the issues, [33:54] draft an outline for a professional response that [33:57] acknowledges concerns, explains possible reasons, [34:00] and offer a resolution. [34:04] So this is not-- [34:06] Prompt number 3, write the full response. [34:09] So using the outline, write the professional response. [34:14] And then you get your final output. [34:18] So in theory, you can tell me, oh, the second approach [34:22] is better than the first one at first. [34:23] But what you can notice is that we can actually [34:27] test those three prompts separately from each other [34:29] and determine if we will get the most gains out of engineering [34:35] the first prompt, optimizing it, or the second one, [34:38] or the third one. [34:39] We now have three prompts that are independent from each other. [34:43] And maybe if the outline was better, [34:47] the performance of the email, how much the open rate will be [34:53] or the user satisfaction on the response [34:55] will actually get higher. [34:57] And so chaining improves performance but performance, [35:00] but most importantly, helps you control your workflow [35:04] and debug it more seamlessly. [35:07] Yes. [35:09] So if we that the three prompts independently work really well, [35:15] if we combine them into one prompt, [35:17] and we highlight a step by step thinking process, [35:21] does on average, we get a [INAUDIBLE] by itself, [35:24] or do we still have to do that breakdown? [35:28] So let me try to rephrase. [35:30] You say, let's say we look at the first prompt which [35:32] has all three tasks built in that prompt. [35:37] What exactly do you mean? [35:39] You mean like if we evaluate the output [35:41] and we measure some user insight, satisfaction, [35:43] et cetera? [35:45] Why don't we just modify that prompt and essentially see how [35:49] it improves user satisfaction? [35:51] Yeah. [35:51] [INAUDIBLE] [35:54] I see. [35:55] So why do we need the three steps? [35:57] I mean, think about it. [35:59] The intermediate output is what you want to see. [36:02] Like if I'm debugging the first approach, [36:06] the way I would do it is I would capture user insights. [36:09] Like here's the email. [36:10] How good was the response? [36:11] Thumbs up, thumbs down. [36:13] Was your issue resolved? [36:16] Thumbs up, thumbs down. [36:17] Those would tell me how good is my prompt. [36:19] And I can engineer that prompt, optimize it, [36:21] and I would probably drive some gains. [36:23] But I will not be able easily to trace back [36:26] to what the problem was. [36:28] While in the second approach, not only I [36:30] can use the end to end metrics to improve my process. [36:33] I can also use the intermediate steps. [36:35] For example, if I look at prompt 2 and I look at the outline [36:38] and I see the outline is actually, meh, it's not great, [36:41] then I think I can get a lot of gains out of the outline. [36:45] Or the outline is actually really good, [36:47] but the last prompt doesn't do a good job at translating it [36:50] into an email. [36:51] So the outline is exactly what I want the LLM to do, [36:54] but the translation in a customer facing email [36:57] is not good. [36:58] In fact, it doesn't follow our vocabulary internally. [37:01] Then I knew the third prompt is where [37:03] I would get the most gains. [37:06] So that's what it allows me to do, [37:07] have intermediate steps to review. [37:10] Are there any latency [INAUDIBLE] [37:13] We'll talk about it. [37:14] Are there any latency concerns? [37:16] Yes. [37:17] In certain applications, you don't want to use a chain [37:20] or you don't want to use a long chain because it adds latency. [37:26] We'll talk about that later. [37:27] Good point. [37:28] So practically, this is what chaining complex [37:32] prompts look like. [37:33] You have your first prompt with your first task. [37:35] It outputs. [37:36] The output is pasted in the second prompt [37:39] with the second task being defined. [37:41] The output is then pasted into the third prompt [37:43] with the third task being defined and so on. [37:46] That's what it looks like in practice. [37:52] Super. [37:55] We'll talk more later about testing your prompts, [37:58] but there are methods now to do it, [38:00] and we'll see later in this lecture with our case study [38:03] how we can test our prompts. [38:06] But here is an example of how you might do it. [38:11] You might have a summarization workflow prompts [38:18] that is the baseline. [38:19] It's a single prompt. [38:21] You might have a refined summarization [38:23] which is a modified prompt of this, [38:26] or a workflow with a chain. [38:30] And then you have your test case, which is the input [38:34] that you want to summarize, let's say. [38:36] And then you have the generated output. [38:38] And you can have humans go and rate these outputs. [38:42] And you would notice that the baseline is better or worse [38:46] than the refined prompt. [38:47] Of course, this manual approach takes time, [38:51] but it's a good way to start. [38:53] And usually, the advice is get hands on at the beginning [38:56] because you would quickly notice some issues, [38:58] and it will give you better intuition on what tweaks [39:01] can lead to better performance. [39:03] However, if you wanted to scale that system [39:05] across many products, many parts of your code base, [39:08] you might want to find a way to do that automatically [39:10] without asking humans to review and grade summaries. [39:14] One approach is to use platforms, [39:19] like at Workera, our team uses a platform called Prompt Food that [39:23] allows you to actually automate part of this testing. [39:26] In a nutshell, what it does is it [39:30] can allow you to run the same prompt with five different LLMs [39:35] immediately, put everything in a table. [39:37] That makes it super easy for a human to grade, let's say. [39:40] Or alternatively, it might allow you to define LLM judges. [39:46] LLM judges can come in different flavors. [39:50] For example, I can have an LLM judge that [39:52] does a pairwise comparison. [39:54] So what the LLM is asked to do is here are two summaries. [39:58] Just tell me which one is better than the other one. [40:01] That's what the LLM does. [40:02] And that can be used as a proxy for how good [40:04] the summarization baseline versus the refined version is. [40:08] Another way to do an LLM judge is [40:11] if you do it for a single answer grading, [40:14] so here's a summary graded from 1 to 5. [40:18] And then you can go even deeper and do [40:21] a reference-guided pairwise comparison. [40:24] Or you add also a rubric. [40:25] You say a 5 is when a summary is below 100 characters. [40:30] I'm just making up. [40:31] Below 100 characters. [40:33] Mentions at least three key points [40:35] that are distinct and starts with a first sentence that [40:38] displays the overview and then goes into the detail. [40:40] That's a great summary, number 5 out of a 5. [40:42] 0 is the LLM failed to summarize and actually was very verbose, [40:48] let's say. [40:49] And so you put a Rubrik behind it, [40:52] and you have an LLM as just finding the rubric. [40:55] Of course, you can now pair different techniques. [40:57] You can do a few shots for the rubric. [40:58] You can actually give examples of a 5 out of 5s, 4 out of 4s, [41:02] 3 out of 3s because now, you multiple techniques. [41:06] Does that make sense? [41:11] Yeah. [41:11] OK. [41:12] So that was the second section on prompt engineering [41:15] or the first line of optimization. [41:19] Now, let's say you've exhausted all your chances [41:22] for prompt engineering, and you're [41:24] thinking about actually touching the model, modifying its weights [41:28] or fine tuning it in other words. [41:31] I was telling you, I'm not a fan of fine tuning. [41:34] There's a few reasons why. [41:37] One, it requires substantial labeled data typically [41:42] to fine tune. [41:43] Although now, there are approaches [41:46] that are getting better at fine tuning that [41:48] look more few shot prompting actually than fine tuning. [41:52] It's sort of merging. [41:54] Although one modifies the weight, [41:56] the other doesn't modify the weights. [41:57] Fine tuned models may also overfit to specific data. [42:01] We're going to see a funny example actually. [42:04] Losing their general purpose utility. [42:06] So you might fine tune a model. [42:08] And actually, when someone asks a pretty generic question, [42:11] it doesn't do well anymore. [42:12] It might do well on your task. [42:14] So it might be relevant or not. [42:15] And then it's time and cost-intensive. [42:17] That's my main problem. [42:19] And at Workera, we steer away from fine [42:24] tuning as much as possible. [42:26] Because by the time you're done fine tuning your model, [42:28] the next model is out, and it's actually [42:30] beating your fine tuned version of the previous model. [42:33] So I would steer away from fine tuning as much as you can. [42:36] The advantage of the prompt engineering methods we've seen [42:39] is you can put the next best pre-trained model directly [42:43] in your code. [42:44] It will update everything immediately. [42:46] Fine tuning doesn't work like that. [42:50] There are advantages though where it still makes sense. [42:53] If the task requires repeated high precision outputs [42:56] such as legal, scientific explanation [42:58] and if the general purpose LLM struggles [43:01] with domain-specific language. [43:03] So let's look at a quick example together, [43:07] which is an example from Ros Lazerowitz. [43:12] I think it was a couple of years ago, September 23, [43:15] where Ros tried to do Slack fine tuning. [43:22] So he looked at a lot of Slack messages within his company. [43:26] And he was like, I'm going to fine tune [43:28] a model that speaks like us or operates like us because this [43:32] is how we work. [43:33] This is the data that represents how people work at the company. [43:37] And so he actually went ahead and fine tuned the model, [43:42] gave it a prompt, like, hey, write-- [43:44] he was delegating to the model. [43:47] A 500-word blog post on prompt engineering. [43:50] And the model responded, I shall work on that in the morning. [43:55] And then he tries to push the model a little further and say, [44:00] it's morning now. [44:01] And the model said, I'm writing right now. [44:04] It's 6:30 AM here. [44:06] Write it now. [44:10] OK, I shall write it now. [44:12] I actually don't what you would like me to say [44:14] about prompt engineering. [44:15] I can only describe the process. [44:17] The only thing that comes to mind for a headline [44:19] is how do we build prompt? [44:21] It's kind of a funny example for fine tuning because it's true [44:25] that it went wrong. [44:27] Like he was supposed to think like I want [44:29] the model to speak like us at work. [44:32] And it ended up acting like people [44:34] and not actually following instructions. [44:40] So one example why I would steer away from fine tuning. [44:47] Super. [44:51] Let's talk about RAGs. [44:54] RAGs is important. [44:55] It's important to out there and at least having the basics. [44:58] It's a very common interview question, by the way. [45:00] If you go interview for a job, they [45:02] might ask you to explain in a nutshell [45:04] to a five-year-old what is a RAG. [45:06] And hopefully after that, you'll be able to do it. [45:09] So we've seen some of the challenges with standalone LLMs. [45:14] Those challenges include the context window being small, [45:19] the fact that it's hard to remember details [45:21] within a large context window, knowledge gaps, cutoff dates, [45:26] you mentioned earlier. [45:28] The model might be trained up to a date, [45:29] and then it cannot follow the trends or be up to date. [45:33] Hallucinations. [45:34] There are some fields. [45:35] Think about medical diagnosis, where [45:37] hallucinations are very costly. [45:39] You can't afford a hallucination. [45:41] Even in education, imagine deploying a model for the US [45:45] youth education, and it hallucinates, [45:47] and it teaches millions of people something [45:49] completely wrong. [45:50] It's a problem. [45:52] And then lack of sources. [45:54] A lot of fields love sources. [45:57] Research fields love sources. [45:59] Education love sources. [46:01] Legal loves sources as well. [46:04] And so the pre-trained LLM doesn't do a good job to source. [46:08] And in fact, if you have tried to find sources on a plain LLM, [46:13] it actually hallucinates a lot. [46:15] It makes up research papers. [46:16] It just lists like completely fake stuff. [46:20] So how do we solve that with a RAG? [46:23] RAG integrates with external knowledge sources, databases, [46:28] documents, APIs. [46:31] It ensures that answers are more accurate, up to date, [46:35] and grounded because you can actually update your document. [46:38] Your drive is always up to date. [46:40] I mean, ideally, you're always pushing new documents to it. [46:43] And when you query, what is our Q4 performance in sales? [46:47] Hopefully there is the last board deck in the drive, [46:51] and it can read the last board deck. [46:54] And more developer control. [46:56] We'll see why RAGs allow for targeted customization [47:00] without actually requiring the retraining of the model. [47:02] In fact, you don't touch the model with RAGs. [47:05] It's really a technique that is put on top of the model. [47:08] So to see an example of a RAG, this [47:11] is a question answering application where [47:16] we're in the medical field, and a user is asking a query, [47:21] what are the side effects of drug X? [47:26] This is an important question. [47:27] You can't hallucinate. [47:28] You need to source. [47:29] You need to be up to date. [47:31] Maybe there is a new update to that drug that [47:35] is now in the database, and you need to read that. [47:37] So a RAG is a great example of what you would want to use here. [47:41] The way it works is you have your knowledge [47:43] base of a bunch of documents. [47:46] What you do is you use an embedding [47:49] to embed those documents into lower [47:52] dimensional representations. [47:54] So for example, if the document is a PDF, a long PDF, [47:59] you might read the PDF, understand it, [48:02] and then embed it. [48:03] We've seen plenty of embedding approaches [48:05] together, triplet loss, et cetera, you remember? [48:09] So imagine one of them here for LLMs [48:11] is embedding those documents into lower representation. [48:15] If the representation is too small, [48:18] you will lose information. [48:19] If it's too big, you will add latency. [48:22] It's a tradeoff. [48:25] You will store typically those representations [48:28] into a database called a vector database. [48:31] There's a lot of vector database providers out there. [48:38] I think I've listed a couple that are very common. [48:41] No, I haven't listed, but I can share afterwards. [48:44] A vector database is essentially storing those vector [48:47] in a very efficient manner, allowing the fast retrieval [48:50] with a certain distance metric. [48:52] So what you do is you also embed, usually [48:56] with the same algorithm, the user prompts. [49:00] And you run a retrieval process, which is essentially [49:03] saying, based on the embedding from the user [49:07] query and the vector database, find the relevant documents [49:12] based on the distance between those embeddings. [49:15] Once you've found the relevant documents, you pull them, [49:18] and then you add them to the user query with a system prompt [49:22] or a prompt template on top. [49:24] So the prompt template can be answer user query [49:29] based on list of documents. [49:32] If answer not in the documents, say I don't know. [49:36] That's your prompt templates where the user query is pasted, [49:40] the documents are pasted, and then [49:42] your output should be what you want because it's not [49:45] grounded in the documents. [49:47] You can also add to this prompt template. [49:50] Tell me the exact page, chapter, line [49:53] of the document that was relevant, and in fact, [49:55] link it as well, just to be more precise. [50:02] Any question on RAGs? [50:03] This is a simple, vanilla RAG. [50:07] Yes. [50:09] Do document embeddings still retain information [INAUDIBLE] [50:15] Question is do the document embeddings still [50:18] retain the information of the location of the information [50:21] within that document, especially in big documents? [50:24] Great question. [50:26] We'll get to it in a second. [50:27] Because you're right that the vanilla RAG [50:29] might not do a good job with very large documents. [50:32] So let's say, when you open a medication box [50:36] and you have this gigantic white paper with all the information, [50:41] and it's very long, maybe a vanilla RAG would not cut it. [50:45] So what people have figured out is a bunch [50:48] of techniques to improve RAGs. [50:49] And in fact, chunking is a great technique that is very popular. [50:53] So you might actually store in the vector database [50:55] the embedding of the full document. [50:57] And on top of that, you will also [50:59] store a chapter level vector. [51:02] And when you retrieve, you will retrieve the document. [51:04] You retrieve the chapter. [51:06] And that allows you to be more precise with the sourcing. [51:09] It's one example. [51:11] Another technique that's popular is HyDE. [51:16] Hypothetical document embeddings, [51:18] where a group of researchers published a paper [51:23] showing that when you get your user query, [51:26] one of the main problem is the user query [51:29] actually does not look like your documents. [51:32] For example, the user query might [51:34] be what are the side effects of drug X, when actually, [51:37] in the document in the vector database, [51:40] the vectors represents very long documents. [51:43] So how do you guarantee that the vector [51:44] embedding is going to be close to the document embedding? [51:47] What they do is they use the user query to generate [51:50] a fake hallucinated document. [51:53] They embed that document, and then they [51:56] compare it to the vector in the vector database. [52:01] That makes sense? [52:02] So for example, the user says what [52:04] is the side effect of drug X? [52:06] There's a prompt that this is given to another prompt that [52:09] says, based on this user query, generates a five-page report [52:13] answering the user query. [52:15] It generates potentially a completely fake answer. [52:20] You embed that, and it will be closer to the document [52:24] that you're looking for likely. [52:28] It's one example of a RAG approach. [52:31] Again, the purpose of this lecture [52:33] is not to go through all these three and explain [52:36] you every single method that has been discovered for RAGs. [52:38] But I just wanted to show you how much research [52:40] has been done between 2020 and 2025 in RAGs [52:44] and how many branches of research you now have [52:47] that you can learn from. [52:50] The survey paper is LinkedIn the slides, by the way, [52:52] and I'll share them after the lecture. [53:01] Super. [53:05] So we've made some progress. [53:08] Hopefully now, you feel if you were [53:10] to start an LLM application, you know how to do better prompts. [53:14] You know how to do chains. [53:15] You know how to do fine tuning. [53:17] You also how to do retrieval. [53:19] And you have the baggage of techniques [53:20] that you can go and read and find the code base, [53:23] pull the code, vibe code it. [53:24] But you have the breadth now. [53:30] The next set of topics we're going to see [53:34] is around the question of how could we [53:36] extend the capabilities of LLMs from performing single tasks, [53:40] and hence, with external knowledge, [53:42] to handling multi-step, autonomous workflows? [53:47] And this is where we get into proper agentic AI. [53:53] So let's talk about agentic AI workflows [53:56] towards autonomous and specialized systems. [54:00] Then we'll talk about evals. [54:01] Then we'll see multi-agent systems. [54:03] And we'll end with a little thoughts on what's next in AI. [54:11] So Andrew Ng actually coined the term agentic AI workflows. [54:20] And his reason was that a lot of companies use, say agents. [54:25] Agents, agents everywhere, agents everywhere. [54:28] If you go and work at these companies, [54:30] you would notice that they mean very different things by agents. [54:33] Some people actually have a prompt, [54:34] and they call it an agent. [54:36] Other people, they have a very complex multi-agent system, [54:41] they call it an agent. [54:42] And so calling everything an agent doesn't do it justice. [54:45] So Andrew says let's call it agentic workflows. [54:49] Because in practice, it's a bunch of prompts with tools, [54:53] with additional resources, API calls [54:57] that ultimately are put in a workflow, [54:59] and you can call that workflow agentic. [55:02] So it's all about the multi-step process to complete a task. [55:11] Also, calling it agentic workflow [55:13] allows us to not mix it up with what [55:14] I called agent, in the last lecture, [55:17] with reinforcement learning. [55:19] Because in RL, agent has a very specific definition, [55:22] interacts with an environment, passes from one state [55:24] to the other, has a reward and an observation. [55:26] You remember that chart, right? [55:32] So here's an example of how we move from a one step [55:35] prompt to a multi-step agentic workflow. [55:39] Let's say a user queries a product. [55:44] What is your refund policy on a chatbot? [55:48] And the response, using a RAG, says [55:51] refunds are available within 30 days of purchase, [55:53] and maybe the RAG can even look link to the policy documents. [55:57] That's what we learned so far. [55:59] Instead, an agentic workflow can function like this. [56:04] The user says, can I get a refund for my order? [56:07] And the response via the agentic workflow [56:11] is the agent retrieves the refund policy using a RAG. [56:14] The agent then follows up with the users and says, [56:17] can you provide your order number? [56:19] Then the agent queries an API to check the order details. [56:23] And finally, it comes back to the user [56:25] and confirms your order qualifies for a refund. [56:28] The amount will be processed in three to five business days. [56:31] This is much more thoughtful than the first version, [56:33] which is sort of vanilla. [56:37] So that's what we're going to talk [56:39] about in the next couple of slides, [56:40] is how do we get from the first one to the second one? [56:46] There are plenty of specialized agency workflows online. [56:50] You've heard, and if you hang out in SF, [56:52] you probably see a bunch of billboards, AI software [56:55] engineer, AI skills mentor you've [56:57] interacted with in the class through Workera. [56:59] AI SDR, AI lawyers, AI specialized cloud engineer. [57:08] It would be a stretch to say that everything works, [57:10] but there's work being done towards that. [57:17] I'm not personally a fan of putting [57:19] a face behind those things. [57:20] I think it's gimmicky. [57:21] And I think in a few years from now, actually, [57:24] very few products will have a human face behind it, [57:27] but it might be a marketing tactic from some startups. [57:32] It's more scary than it is engaging, frankly. [57:35] OK. [57:36] I want to talk about the paradigm shift. [57:38] That's especially useful. [57:40] Let's say you're a software engineer [57:41] or you're planning to be a software engineer. [57:43] Because software engineering as a discipline [57:45] is sort of shifting. [57:47] Or at least the best engineers I've [57:49] worked with are able to move from a deterministic mindset [57:53] to a fuzzy mindset and balance between the two [57:57] whenever they need to get something done. [57:58] So here's the paradigm shift between traditional software [58:01] and agentic AI software. [58:04] The first one is the way you handle data. [58:07] Traditional software deals with structured data. [58:10] You have JSONs. [58:11] You have databases. [58:12] They're pasted in a very structured manner [58:15] in a data engineering pipeline. [58:17] And then there used to be displayed [58:19] on a certain interface. [58:21] The user might feel a form that is then retrieved and pasted [58:24] in the database. [58:25] All of that historically has been structured data. [58:28] Now, more and more companies are handling free form text, images, [58:34] and all of that requires dynamic interpretation to transform [58:39] an input into an output. [58:41] The software itself used to be deterministic. [58:45] Now you have a lot of software that is fuzzy. [58:47] And fuzzy software creates so many issues. [58:51] I mean, imagine if you let your user ask anything [58:54] on your website. [58:56] The chances that it breaks is tremendous. [58:58] The chances that you're attacked is tremendous. [59:00] The chances-- it's really, really complicated. [59:03] It's more complicated than people make it seem on Twitter. [59:07] Fuzzy engineering is truly hard. [59:09] You might get hate as a company because one user did something [59:14] that you authorized them to do that ended up breaking [59:16] the database and ended up-- [59:18] we've seen that with many companies [59:19] in the last couple of years. [59:21] So it takes a very specialized engineering mindset [59:23] to do fuzzy engineering, but also [59:25] know when you need to be deterministic. [59:29] The other thing I'd call is with agentic AI software, [59:33] you want to think about your software as your manager. [59:39] So you're familiar with the monolith or microservices [59:44] approaches in software, where you structure your software [59:48] in different boxes that can talk to each other, [59:51] and it allows teams to debug one section at a time. [59:55] Now the equivalent with agentic AI is you think as a manager. [59:59] So you think, OK, if I was to delegate my product [01:00:02] to be done by a group of humans, what would be those roles? [01:00:06] Would I have a graphic designer that then puts together a chart [01:00:09] and then sends it to a marketing manager that converts it [01:00:12] into a nice blog post, that then gives it to the performance [01:00:15] marketing expert, that then publishes the work, the blog [01:00:18] post, and then optimizes and A/B tests? [01:00:20] Then to a data scientist that analyzes the data [01:00:23] and then puts hypotheses and validates [01:00:25] them or invalidates them. [01:00:27] That's how you would typically think if you're building [01:00:29] an authentic AI software. [01:00:32] When actually, the equivalent of that in traditional software [01:00:35] might be completely different. [01:00:37] It might be We have a data engineer box [01:00:39] right here that handles all our data engineering. [01:00:42] And then here, we have the UI/UX stuff. [01:00:45] Everything UI/UX related goes here. [01:00:47] And companies might structure it in very different ways. [01:00:51] And here is the business logic that we want to care about. [01:00:53] And there's five engineers working on the business logic, [01:00:56] let's say. [01:00:59] OK. [01:01:01] Testing and debugging is also very different. [01:01:04] And we'll talk about it in the next section. [01:01:09] The other thing that I feel matters [01:01:13] is with AI in engineering, the cost of experimentation [01:01:17] is going down drastically. [01:01:19] And so people, I feel, should be more comfortable [01:01:22] throwing away code. [01:01:23] It's like in traditional software engineering, [01:01:27] you probably don't throw away code a ton. [01:01:29] You build a code, and it's solid, and it's bulletproof, [01:01:32] and then you update it over time. [01:01:35] We've seen AI companies be more comfortable throwing away [01:01:39] codes, which has advantages in terms of the speed at which you [01:01:43] move but also disadvantages in terms [01:01:46] of the quality of your software that can break more. [01:01:52] So anyway, just wanted to do an update on the paradigm shift [01:01:56] from deterministic to fuzzy engineering. [01:02:04] Oh, and actually, I can give you an example from Workera [01:02:08] that we learned probably over the last 12 [01:02:11] months is like if you've used Workera, [01:02:13] you might have seen that the interface has asks you sometimes [01:02:18] multiple choice questions. [01:02:19] And sometimes, it asks you multiple select. [01:02:21] And sometimes, it asks you drag and drop, ordering, matching, [01:02:24] whatever. [01:02:25] Those are examples of deterministic item types, [01:02:28] meaning you answer the question on a multiple choice. [01:02:31] There is one correct answer. [01:02:32] It's fully deterministic. [01:02:34] On the other hand, you sometimes have a voice questions, [01:02:38] where you go to a role play or you [01:02:40] have voice plus coding questions, [01:02:42] where your code is being read by the interface or whatever. [01:02:45] Those are fuzzy, meaning the scoring algorithm [01:02:49] might actually make mistakes, and those mistakes [01:02:52] might be costly. [01:02:53] And so companies have to figure out [01:02:56] a human in the loop system, which [01:02:58] you might have seen with the appeal feature at the end. [01:03:00] So at the end of the assessment, you have an appeal feature where [01:03:03] it allows you to say, I want to appeal the agent [01:03:06] because I want to challenge what the agent said on my answer [01:03:09] because I thought I was better than what the agent thought. [01:03:12] And then you bring the human in the loop that [01:03:14] then can fix the agent, can tell the agent, actually, [01:03:16] you were too harsh on the answer of this person. [01:03:20] And that's an example of a fuzzy engineered system [01:03:24] that then adds a human in the loop to make it more aligned. [01:03:28] And so if you're building a company, [01:03:29] I would encourage you to think about what can I [01:03:32] get done with determinism? [01:03:33] And let's get that done. [01:03:35] And then the fuzzy stuff, I want to do fuzzy [01:03:38] because it allows more interaction. [01:03:39] It allows more back and forth, but I need [01:03:42] to put guardrails around it. [01:03:43] And how am I going to design those guardrails? [01:03:45] Pretty much. [01:03:46] OK? [01:03:49] Here's another example from enterprise workflows, [01:03:54] which are likely to change due to agentic AI. [01:03:57] This is a paper from McKinsey, I believe from last year, [01:04:01] where they looked at a financial institution, and they said, [01:04:05] we observed that they often spend one to four weeks [01:04:07] to create a credit risk memo. [01:04:10] And here's the process. [01:04:11] A relationship manager gathers data from 15 [01:04:16] and more than 15 sources on the borrower, [01:04:19] loan type, other factors. [01:04:22] Then the relationship manager and the credit analyst [01:04:25] collaboratively analyze that data from these sources. [01:04:28] Then the credit analyst typically spends 20 hours [01:04:33] or more writing a memo and then goes back [01:04:36] to the relationship manager. [01:04:37] They give feedback, and then they go through this loop [01:04:40] again and again. [01:04:41] And it takes a long time to get a credit memo out. [01:04:46] And then run a research study, where they changed the process. [01:04:50] They said gen AI agents could actually cut time by 20% to 60% [01:04:56] on credit risk memos. [01:04:58] And the process has changed to the relationship manager, [01:05:01] directly work with the Gen AI agent system, [01:05:03] provides relevant materials that needs to produce the memo. [01:05:07] The agent subsidizes the project into tasks [01:05:10] that are assigned to specialist agents, [01:05:12] gathers and analyzes the data from multiple sources, [01:05:15] drafts a memo. [01:05:16] Then the relationship manager and the credit analyst [01:05:19] sit down together, review the memo, [01:05:20] give feedback to the agent. [01:05:22] And within 20% to 60% less time are done. [01:05:26] And so this is an example where you're actually not changing [01:05:30] the human stakeholders. [01:05:31] You're just changing the process and adding [01:05:33] Gen AI to reduce the time it takes to get a credit memo out. [01:05:38] It turns out that, imagine you're an enterprise, [01:05:42] and you have 100,000 employees, and there's a lot of enterprises [01:05:47] with 100,000 employees out there. [01:05:50] You are currently under crisis in terms [01:05:52] of redesigning your workflows. [01:05:55] It turns out that if you actually [01:05:57] pull the job descriptions from the HR system [01:06:00] and you interpret them, you also pull [01:06:02] the business process workflows that you [01:06:04] have encoded in your drive. [01:06:07] You actually can find gains in multiple places. [01:06:10] And in the next few years, you're [01:06:12] probably going to see workflows being [01:06:14] more optimized to add Gen AI. [01:06:17] Even if that happens, the hardest part is changing people. [01:06:20] What we know, this is great in theory, but now, [01:06:23] let's try to fit that second workflow for 10,000 credits, [01:06:28] risk analysts, and relationship managers. [01:06:31] My guess is it will take years. [01:06:33] It will take 10, 20 years to get to this being actually done [01:06:37] at scale within an organization. [01:06:40] Because change is so hard. [01:06:42] It's so hard to rewire business, workflows, job descriptions, [01:06:47] incentivize people to do different, and be different, [01:06:50] and train them. [01:06:50] And so this is what the world is going towards, [01:06:55] but it's going to take a long time I think. [01:06:59] OK. [01:07:00] Then I want to talk about how the agent actually works [01:07:02] and what are the core components of an agent. [01:07:07] Imagine a travel booking agent. that's [01:07:10] an easy example you've all thought about. [01:07:12] I still haven't been able to get an agent to book a trip for me, [01:07:16] or I was scared because it was going to book [01:07:18] a very expensive or long trip. [01:07:20] But in theory, you can have a travel booking [01:07:24] agent that has prompts. [01:07:26] So the prompts we've seen, we know the methods [01:07:28] to optimize those prompts. [01:07:30] That travel agent also has a context management system, [01:07:34] which is essentially the memory of what it knows about the user. [01:07:38] That context management system might [01:07:40] include a core memory or working memory and an archival memory, [01:07:45] OK? [01:07:46] What the difference is within memory [01:07:51] is not every memory needs to be fast to access. [01:07:54] Think about it. [01:07:56] You're onboarded on a product, and the first question is hi, [01:07:59] what's your name? [01:08:00] And I say, my name is Keon. [01:08:02] That's probably going to sit in the working memory [01:08:05] because the agents, every time he's going to talk to me, [01:08:07] he's going to want to use my name. [01:08:08] But then maybe the second question [01:08:10] is what's your birthday? [01:08:12] And I give it my birthday. [01:08:13] Does it need my birthday every day? [01:08:15] Probably not. [01:08:16] So it's probably going to park it on the long term [01:08:18] memory or the archival memory. [01:08:20] And those memories are slower to access. [01:08:24] They're farther down the stack. [01:08:26] And that structure allows the agent [01:08:28] to determine what's the working memory, [01:08:30] and what's the long term memory? [01:08:33] And that makes it easier for the agent to retrieve super fast. [01:08:36] Because think about it. [01:08:37] When you interact with ChatGPT, you [01:08:39] feel that it's very personal at times. [01:08:41] You feel like it understands you. [01:08:43] Imagine every time you call it, it has to read the memories. [01:08:47] And that can be costly. [01:08:48] It's a very burdensome cost because it happens [01:08:52] every time you talk to it. [01:08:54] So you want to be highly optimized with the working [01:08:57] memory. [01:08:59] If it takes three seconds to look [01:09:00] in the memory, every time you're going to talk to your LLM, [01:09:03] it's going to take three seconds, which you don't want. [01:09:06] Anyway. [01:09:06] And then you have the tools. [01:09:08] The tools can include APIs like a flight search [01:09:11] API, hotel booking API, car rental API, weather API, [01:09:15] and then the payment processing API. [01:09:18] And typically, you would want to tell your agent [01:09:21] how that API works. [01:09:23] It turns out that agents or LLMs, I should say, [01:09:27] are very good at reading API documentation. [01:09:29] So you give it the API documentation, [01:09:31] and it reads the JSON, and it reads, [01:09:33] what does a GET request look like. [01:09:35] And this is the format that I need to push. [01:09:38] And then it pushes it in that format, let's say. [01:09:41] And then it retrieves something. [01:09:45] Does that make sense, those different components? [01:09:49] Anthropic also talks about resources. [01:09:51] Resources is data that is sitting somewhere that you [01:09:55] might let your agent read. [01:09:57] For example, if you're building your startups, you have a CRM. [01:10:00] A CRM has data in it, and you want to do lookups in that data. [01:10:05] You will probably give a lookup tool, [01:10:07] and you will give access to the resource, [01:10:10] and it will do lookups whenever you want super fast. [01:10:16] This type of architecture can be built [01:10:19] with different degrees of autonomy, [01:10:21] from the least autonomous to the most autonomous. [01:10:23] And I'll give you a few examples. [01:10:26] Less autonomous would be you've hard coded the steps. [01:10:29] So let's say I tell the travel agent first identify the intent. [01:10:35] Then look up in the database the history [01:10:39] of this customer with us and their preferences. [01:10:42] Then go to the flight API, blah, blah, blah. [01:10:45] Then go to the-- [01:10:45] I would hard code the steps. [01:10:47] OK. [01:10:48] That's the least autonomous. [01:10:50] The semi-autonomous is I might hard code the tools, [01:10:54] but we're not going to hard code the steps. [01:10:57] So I'm going to tell the agent, you act like a travel agent. [01:11:02] And your task is to help the person book a travel. [01:11:10] And these are the tools that you have accessible to yourself. [01:11:13] And so I'm not hard coding the steps. [01:11:14] I'm just hard coding the tools that you have access [01:11:17] to for yourself. [01:11:18] The more autonomous is the agent decides the steps [01:11:22] and can create the tools. [01:11:24] So that's where you might give actually access [01:11:26] to a code editor, to the agent. [01:11:28] And the agent might actually be able to ping any API in the web, [01:11:33] perform some web search. [01:11:34] It might even be able to create some code [01:11:37] to display data to the user. [01:11:39] It might even be able to perform some calculations. [01:11:42] Like oh, I'm going to calculate the fastest route [01:11:44] to get from San Francisco to New York, [01:11:48] and which one might be the most appropriate [01:11:50] for what the user is looking for. [01:11:52] And then I want to calculate the distance between the airport [01:11:54] and that hotel versus that hotel. [01:11:56] And I'm going to write code to do that. [01:11:58] So it's actually fully autonomous [01:12:00] from that perspective. [01:12:05] So yeah. [01:12:07] Remember those keywords. [01:12:08] Memory, prompts, tools, et cetera. [01:12:14] Now, I presented the flight API, but it does not [01:12:18] have to be an API. [01:12:19] You probably have heard the term MCP or model context protocol [01:12:23] that was coined by Anthropic. [01:12:25] I pasted the seminal article on MCP at the bottom of this slide. [01:12:29] But let me explain in a nutshell why those things would differ. [01:12:34] In the API case, you would actually [01:12:39] teach your LLM to ping an API. [01:12:42] So you would say this is how you ping this API, [01:12:45] and this is the data that it will send you back. [01:12:48] And you would have to do that in a one off manner. [01:12:51] So you would have to build or give [01:12:53] the API documentation of your flight API. [01:12:56] You're booking hotel API, your car rental API. [01:13:00] And then you would give tools for your model [01:13:03] to communicate with those APIs. [01:13:06] It doesn't scale very well versus MCP. [01:13:11] MCP, it's really about putting a system in the middle that [01:13:19] would make it simpler for your LLM to communicate [01:13:22] with that endpoint. [01:13:23] So for instance, you might have an MCP server, an MC client, [01:13:28] where you're trying to communicate [01:13:30] with that travel database or the flight API or MCP. [01:13:35] And your agent might actually just communicate with it [01:13:38] and say, hey, what do you need in order to give me more flight [01:13:42] information? [01:13:43] And that agent will respond by I would like you to tell me [01:13:47] where is the origin flight, where is the destination [01:13:49] and what you're looking for at a high level. [01:13:51] This is my requirement. [01:13:52] OK. [01:13:52] Let me get back to you with in my requirement. [01:13:55] Oh. [01:13:55] You forgot to tell me your budget, whatever. [01:13:57] Oh. [01:13:58] Let me give you my budget, et cetera. [01:14:00] And it's agent to agent communication, [01:14:04] which allows more scalability. [01:14:06] You don't need to hard code everything. [01:14:09] Companies have displayed their MCPs out there, [01:14:11] and your agent can communicate with them [01:14:14] and figure out how to get the data it needs. [01:14:16] Does that make sense? [01:14:18] Yeah. [01:14:21] [INAUDIBLE] rewriting any [INAUDIBLE] [01:14:36] I think it is, ultimately. [01:14:39] The question is, isn't it a shifting issue? [01:14:41] Because anyway, if an API has to be updated, [01:14:43] the MCP has to be updated, is what you say, right? [01:14:45] Yes, that's correct. [01:14:46] But at least it allows the agent to go back and forth [01:14:51] and figure out what the requirements are. [01:14:52] But at the end of the day, ideally, if you're a startup, [01:14:56] you have some documentation. [01:14:57] And automatically, you have an agent or an LLM workflow [01:15:00] that reads that documentation and updates the code [01:15:03] accordingly. [01:15:04] But I agree. [01:15:05] It's not something that is fully autonomous. [01:15:08] Yeah. [01:15:09] i I've seen some security issues. [01:15:12] Why is that possible. [01:15:14] Which security specifically? [01:15:16] [INAUDIBLE] [01:15:18] Yeah. [01:15:19] So are there security issues with MCPs? [01:15:23] So think about it this way. [01:15:25] MCPs, depending on the data that you get access to, [01:15:28] might have different requirements, lower stake [01:15:30] or higher stake. [01:15:31] I'm not an expert at the full range. [01:15:34] But it wouldn't surprise me that when you expose an MCP to-- [01:15:42] I think you would a lot of MCC have authentication. [01:15:45] So you might actually need a code [01:15:47] to actually talk to it, just like you would with an API, [01:15:50] or a key. [01:15:52] Yeah, but that's a good question. [01:15:53] I'm not an expert at the security of these systems, [01:15:56] but we can look into it. [01:16:02] Any other questions on what we've [01:16:04] seen with the agentic workflows, APIs, tools, MCPs, memory? [01:16:10] All of that is under progress. [01:16:11] So even memory is not a solved problem by any means. [01:16:14] It's pretty hard actually. [01:16:16] Yes. [01:16:18] You don't need an [INAUDIBLE] The MCP just [01:16:24] makes it easier to access the API, but technically, [01:16:28] [INAUDIBLE] [01:16:40] Exactly, exactly. [01:16:42] Is MCP about efficiency or accessing more data? [01:16:45] It's about efficiency. [01:16:47] Let's say you have a coding agent, and it has an MCP client, [01:16:53] and there's multiple MCP servers that are exposed out there. [01:16:57] That agent can communicate very efficiently with them [01:17:00] and find what it needs. [01:17:03] And it's a more efficient process [01:17:05] than actually displaying APIs and the APIs on that side [01:17:09] and how to ping them and what the protocol is. [01:17:12] But it's not about the data that is [01:17:13] being exposed because ultimately, you control [01:17:15] the data that is being exposed. [01:17:19] You probably, depending on how the MCP is built, [01:17:22] my guess is you probably expose yourself to other risks [01:17:24] because your MCP server can see any input pretty much [01:17:31] from another LLM. [01:17:32] And so it has to be robust. [01:17:36] But yeah. [01:17:37] Super. [01:17:39] So let's look at an example of a step [01:17:41] by step workflow for the travel agent. [01:17:45] So let's say the user says, I want to plan a trip to Paris [01:17:50] from December 15 to 20th with flights, [01:17:56] hotels near the Eiffel Tower, and then an itinerary of must [01:18:00] visit places. [01:18:01] That's the task to the travel agent. [01:18:04] Step two, the agent plans the steps. [01:18:06] So it says, I'm going to find flights. [01:18:08] Use the flight search API to get options for December 15. [01:18:12] Search hotels, generate recommendations for places [01:18:15] to visit, validate preferences, budget, et cetera. [01:18:20] Book the trip with the payment processing API. [01:18:24] That's just the planning, by the way. [01:18:25] Step three, execute the plan, use your tools, [01:18:28] combine the results, and then proactive [01:18:31] user interaction and booking. [01:18:33] It might make a first proposal to the user [01:18:35] and ask the user to validate or invalidate [01:18:38] and then may repeat that planning and execution process. [01:18:42] And then finally, it might actually update the memory. [01:18:46] It might say, oh, I just learned through this interaction [01:18:49] that the user only likes direct flights. [01:18:51] Next time, I'll only give direct flights. [01:18:55] Or I noticed users are fine with three star hotels or four star [01:19:01] hotels. [01:19:01] And in fact, they don't want to go above budget or something [01:19:05] like that. [01:19:08] So that hopefully makes sense by now on how you might do that. [01:19:11] My question for you is how would you know if this works. [01:19:16] And if you had such a system running in production, how [01:19:19] would you improve it? [01:19:28] Yeah. [01:19:28] Lets users rate their experience. [01:19:31] So that's an example. [01:19:33] So let users rate their experience at the end. [01:19:37] That would be an end to end test, right? [01:19:39] You're looking at the user experience through the steps [01:19:42] and say how good was it from 1 to 5, let's say. [01:19:46] Yeah. [01:19:46] It's a good way. [01:19:47] And then if you learn that a user says 1, [01:19:50] how do you improve the workflow? [01:19:56] [INAUDIBLE] [01:19:59] OK. [01:19:59] So you would go down a tree and say, OK, you said 1. [01:20:04] What was your issue? [01:20:06] And then the user says the prices were too high, let's say. [01:20:10] And then you would go back and fix that specific tool or prompt [01:20:14] or, yeah, OK. [01:20:15] Any other ideas? [01:20:18] [INAUDIBLE] [01:20:29] Yeah, good. [01:20:29] So that's a good insight. [01:20:30] Separate the LLM related stuff from the non-LLM related stuff, [01:20:34] the deterministic stuff. [01:20:35] The deterministic stuff, you might [01:20:36] be able to fix it more objectively essentially. [01:20:41] Yeah. [01:20:43] What else? [01:20:56] So give me an example of an objective issue [01:21:00] that you can notice and how you would fix it [01:21:03] versus a subjective issue. [01:21:06] Yeah. [01:21:06] [INAUDIBLE] [01:21:16] So let's say you say there's the same flight, [01:21:19] but one is cheaper than the other, let's say. [01:21:21] It's objectively worse. [01:21:23] And so you can capture that almost automatically. [01:21:25] Yeah. [01:21:26] So you could actually build evals [01:21:27] that are objective, that are tracked across your users. [01:21:32] And you might actually run an analysis after [01:21:34] and see that for the objective stuff, [01:21:37] we notice that our LLM AI agent workflow is bad with pricing. [01:21:43] It just doesn't read price as well because it always [01:21:46] gives a more expensive option. [01:21:48] Yeah. [01:21:48] You're perfectly right. [01:21:49] How about the subjective stuff? [01:21:59] Do you choose a direct or indirect flight [01:22:01] if the indirect is a little bit cheaper? [01:22:05] Yeah. [01:22:05] Good one. [01:22:06] Do you choose a direct flight or an indirect flight [01:22:09] if the indirect is cheaper but the direct is more comfortable? [01:22:12] Yeah. [01:22:13] That's a good one actually. [01:22:16] So how would you capture that information. [01:22:18] Let's say this is used by thousands of users. [01:22:24] Could you feed something in [INAUDIBLE] [01:22:28] Could you feed something in? [01:22:30] Yeah, I mean, you could-- [01:22:32] could feed something in about the user preferences? [01:22:36] Well, you could build a data set that [01:22:39] has some of that information. [01:22:40] So you build 10 prompts, where the user is asking specifically [01:22:44] for a direct-- [01:22:46] saying that I prefer direct flights because I [01:22:48] care about my time, let's say. [01:22:50] And then you look at the output and you actually [01:22:53] give a good example of a good output, [01:22:56] and you probably are able to capture [01:22:58] the performance of your agentic workflow on this specific eval. [01:23:04] Does it prioritize? [01:23:05] Does it understand price conscious-- [01:23:07] is it price conscious, essentially, [01:23:08] and comfort conscious? [01:23:10] Yeah. [01:23:13] What about the tone? [01:23:14] Let's say the LLM right now is not very friendly. [01:23:18] How would you notice that, and how would you fix it? [01:23:26] Yeah. [01:23:26] Have the test user run the prompt [01:23:29] and see if there's something wrong with that. [01:23:33] OK. [01:23:33] Have a test user run the prompt and see if there's [01:23:36] something wrong with that. [01:23:37] Tell me about the last step. [01:23:38] How would you notice that something is wrong? [01:23:40] So a couple of tests [INAUDIBLE] evaluates [01:23:48] the response and [INAUDIBLE] [01:23:51] Yeah. [01:23:52] I agree with your approach. [01:23:53] Have LLM judges that evaluate the response [01:23:55] against a certain rubric of what politeness looks like. [01:23:58] So here in this case, you could actually [01:24:00] start with error analysis. [01:24:02] So you start, you have 1,000 users. [01:24:05] And you can pull up 20 user interactions [01:24:07] and read through it. [01:24:09] And you might notice, at first sight, [01:24:11] the LLM seems to be very rude. [01:24:14] It's just super, super short in its answers, [01:24:18] and it's not very helpful. [01:24:20] You notice that with your error analysis manually. [01:24:23] Then you go to the next stage. [01:24:24] You actually put evals behind it. [01:24:26] You say, I'm going to create a set of LLM judges [01:24:33] that are going to look at the user interaction [01:24:35] and are going to rate how polite it is. [01:24:38] And I'm going to give it a rubric. [01:24:40] Then what I'm going to do is I'm going to flip my LLM. [01:24:42] Instead of using GPT-4, I'm going to use Grok. [01:24:45] And instead of using Grok, I'm using Llama. [01:24:48] And then I'm going to run those three LLMs side by side, [01:24:51] give it to my LLM judges, and then get my subjective score [01:24:56] at the end to say, oh, x model was more polite on average. [01:25:02] Yeah. [01:25:02] Perfectly right. [01:25:03] That's an example of an eval that is very specific [01:25:05] and allows you to choose between LLMs. [01:25:07] You could actually do the same eval not across LLMs, [01:25:10] but fixed the LLM, change the prompt. [01:25:12] You actually, instead of saying act like a travel agent, [01:25:15] you say act like a helpful travel agent. [01:25:17] And then you see the influence of that word on your eval [01:25:21] with the LLM as judges. [01:25:22] Does that make sense? [01:25:24] OK. [01:25:25] Super. [01:25:26] So let's move forward and do a case study with evals. [01:25:29] And then we're almost done for today. [01:25:33] Let's say your product manager asks you to build an AI [01:25:38] agent for customer support, OK? [01:25:41] Where do you start? [01:25:42] And here is an example of the user prompt. [01:25:45] I need to change my shipping address for order, blah, blah, [01:25:48] blah. [01:25:48] I move to a new address. [01:25:51] So what do you start if I'm giving you that project? [01:26:04] Yes. [01:26:05] We search online for existing models and [INAUDIBLE] [01:26:16] So do some research. [01:26:17] See benchmarks and how different models [01:26:20] perform at customer support. [01:26:22] And then pick a model. [01:26:23] That's what you mean. [01:26:24] Yeah. [01:26:24] It's true you could do that. [01:26:25] What else could you do? [01:26:28] Yeah. [01:26:28] [INAUDIBLE] [01:26:34] OK. [01:26:34] Yeah, I like that. [01:26:35] Try to decompose the different tasks that it will need [01:26:39] and try to guess which ones will be more of a struggle, which [01:26:42] ones should be fuzzy, which ones should be deterministic. [01:26:45] Yeah, you're right. [01:26:46] [INAUDIBLE] [01:26:55] Yeah. [01:26:56] Similar to what you said. [01:26:58] That's what I would recommend as well. [01:27:00] You say I would sit down with a customer support [01:27:02] agent for a day or two, and I would decompose the tasks [01:27:04] that are going through. [01:27:05] I will ask them, where do they struggle? [01:27:07] How much time it takes? [01:27:08] Yes. [01:27:09] That's usually where you want to start with task decomposition. [01:27:12] So let's say we've done that work, and we have this list. [01:27:16] I'm simplifying. [01:27:17] But the customer support agent, human, typically [01:27:20] would extract key info, then look up [01:27:23] in the database to retrieve the customer record. [01:27:25] Then check the policy. [01:27:27] Are we allowed to update the address, [01:27:29] or is it a fixed data point? [01:27:32] And then draft a response email and sends the email. [01:27:35] So we've decomposed that task. [01:27:39] Once you've decomposed that task, [01:27:42] how do you design your agentic workflow? [01:28:03] Yes. [01:28:04] [INAUDIBLE] [01:28:17] Exactly. [01:28:18] So to repeat, you're going to look [01:28:20] at the decomposition of tasks, get an instinct of what's fuzzy, [01:28:24] what's deterministic, and then determine [01:28:28] which line is going to be an LLM one shot, which one will require [01:28:33] maybe a RAG, which one will require a tool, which one will [01:28:36] require memory, which one-- [01:28:38] So you will start designing that map. [01:28:41] Completely right. [01:28:41] That's also what I would recommend. [01:28:43] You might actually draft it and say, OK, I take the user prompt. [01:28:48] And the first step of my task decomposition [01:28:52] was extract information that seems to be a vanilla LLM. [01:28:57] You can guess that the vanilla LLM would probably [01:29:00] be good enough at extracting the user wants [01:29:03] to change their address, and this is the order number [01:29:05] and this is the new address. [01:29:06] You probably don't need too much technology [01:29:08] there other than the LLM. [01:29:11] The next step, it feels like you need a tool because you're [01:29:14] actually going to have to look up in the database [01:29:17] and also update the address. [01:29:21] So that might be a tool, and you might [01:29:23] have to build a custom tool for the LLM [01:29:25] to say, let me connect you to that database [01:29:27] or let me give you access to that resource with an MCP. [01:29:32] After that probably need an LLM again to draft the email, [01:29:35] but you would probably paste confirmation. [01:29:38] You would paste the confirmation that your address [01:29:40] has been updated from x to y. [01:29:42] And then the LLM will draft an answer. [01:29:44] And of course, just to not forget, [01:29:46] you might need a tool to send the email. [01:29:49] You might actually need to post something to [01:29:54] for the email to go out. [01:29:57] And then you'll get the output. [01:29:59] Does that make sense So exactly what you described. [01:30:02] Now moving to the next step. [01:30:03] Once we have-- we compose our tasks. [01:30:06] Then we have designed an agentic workflow around it. [01:30:09] It took us five minutes. [01:30:10] In practice, it would take you more [01:30:12] if you're building your startup on that. [01:30:13] You want to make sure your task decomposition is accurate, [01:30:15] your thing is accurate here, and then [01:30:17] you can have a lot of work done on every tool [01:30:20] and optimize it and latency and cost. [01:30:22] But let's say, now we want to know if it works. [01:30:27] And I'm going to assume that you have LLM traces. [01:30:30] LLM traces are very important. [01:30:33] Actually, if you're interviewing with an AI startup. [01:30:36] I would recommend you in the interview process to ask them, [01:30:39] do you have LLM traces? [01:30:40] Because if they don't have LLM traces, [01:30:42] it is pretty hard to debug an LLM system because you don't [01:30:46] have visibility on the chain of complex prompts that were called [01:30:50] and where the bug is. [01:30:52] And so it's a basic part of an AI startup [01:30:57] stack to have an LLM traces. [01:31:00] So let's assume you have traces. [01:31:02] How would you know if your system works? [01:31:04] I'm going to summarize some of the things I heard earlier. [01:31:11] You gave us an example of an end to end metric. [01:31:15] You look at the user satisfaction at the end. [01:31:18] You can also do a component-based approach [01:31:21] where you actually will look at the tool, the database updates, [01:31:25] and you will manually do an error analysis and see, [01:31:28] oh, the tool actually always forgets to update the email. [01:31:32] It just fails at writing. [01:31:33] And I'm going to fix that. [01:31:34] This is deterministic pretty much. [01:31:37] Or when it tries to send the email [01:31:40] and ping the system that is supposed to send the email, [01:31:44] it doesn't send it in the right format. [01:31:46] And so it bugs at that point. [01:31:48] Again, you could fix that. [01:31:51] Draft of the email. [01:31:52] The LLM doesn't do a great job. [01:31:53] It's not very polite at drafting the email. [01:31:56] So you could look at component by component, [01:31:59] and it's actually easier to debug than to look at it [01:32:01] end to end. [01:32:02] You would probably do a mix of both. [01:32:05] Another way to look at it is what is objective [01:32:08] versus what is subjective? [01:32:10] So for example, an objective example [01:32:12] would be a DLRM extracted the wrong order ID. [01:32:18] The user said my order ID is X, and the LLM, [01:32:21] when it actually looked up in the database, [01:32:24] it used the wrong order ID. [01:32:26] This is objectively wrong. [01:32:27] You can actually write a Python code [01:32:29] that checks that, checks just the alignment between what [01:32:32] the user mentioned and what was actually pasted in the database [01:32:36] or for the lookup. [01:32:38] You also have subjective stuff, which we talked about, [01:32:40] where you probably want to do either human rating or LLM [01:32:43] as judges. [01:32:44] It's very relevant for subjective evals. [01:32:49] And finally, you will find yourself [01:32:51] having quantitative evals and more qualitative evals. [01:32:55] So quantitative would be percentage of successful address [01:32:59] updates. [01:33:00] The latency. [01:33:00] You could actually track the latency component-based [01:33:03] and see which one is the slowest. [01:33:05] Let's say sending the email is five seconds. [01:33:08] It's too long, let's say. [01:33:10] You would notice component based or the full workflow. [01:33:13] And then you will decide, where am I optimizing my latency, [01:33:15] and how am I going to do that? [01:33:17] And then finally, qualitative. [01:33:20] You might actually do some error analysis [01:33:23] and look at where are the hallucinations? [01:33:27] Where are the tone mismatches? [01:33:31] Are the user confused, and by what they're confused? [01:33:34] That would be more qualitative. [01:33:36] And typically, it would take more white glove approaches [01:33:41] to do that. [01:33:42] So here's what it could look like. [01:33:44] I gave you some examples. [01:33:46] But you would build evals to determine [01:33:50] objectively, subjectively, component-based, end [01:33:53] to end based, and then quantitatively and [01:33:55] qualitatively, where's your LLM failing [01:33:57] and where it's doing well. [01:34:02] Does that give you a sense of the type of stuff [01:34:04] you could do to fix or improve that agentic workflow? [01:34:09] Super. [01:34:10] Well, that was our case study on evals. [01:34:12] We're not going to delve deeper into it. [01:34:14] But hopefully, it gave you a sense of the type of stuff [01:34:16] you can do with LLM judges, with objective, [01:34:21] subjective, component-based, end to end, et cetera. [01:34:25] Last section on multi-agent workflows. [01:34:29] So you might ask, hey, why do we need multi-agent workflow when [01:34:36] the workflow already has multiple steps, [01:34:38] already calls the LLM multiple times, already gives them tools. [01:34:42] Why do we need multiple agents? [01:34:45] And so many people are talking about multi-agent system online. [01:34:47] It's not even a new thing, frankly. [01:34:49] Multi-agent systems have been around for a long time. [01:34:52] The main advantage of a multi-agent system [01:34:55] is going to be parallelism. [01:34:57] It's like is there something that I [01:34:59] wish I would run in parallel, sort of independently, [01:35:04] but maybe there are some things in the middle? [01:35:07] But that's where you want to put a multi-agent system. [01:35:09] It's when it's parallel. [01:35:12] The other advantage that some companies [01:35:14] have with multi-agent systems is an agent can be reused. [01:35:19] So let's say in a company, you have an agent that's [01:35:21] been built for design. [01:35:22] That agent can be used in the marketing team, [01:35:25] and it can be used in the product team. [01:35:27] And so now you're optimizing an agent, [01:35:30] which has multiple stakeholders that can communicate with it [01:35:33] and benefit from its performance. [01:35:38] Actually I'm going to ask you a question [01:35:40] and take a few, maybe a minute to think about it. [01:35:43] Let's say you were building smart home [01:35:46] automation for your apartment or your home. [01:35:50] What agents would you want to build? [01:35:52] Yeah. [01:35:53] Write it down. [01:35:54] And then I'm going to ask you in a minute [01:35:57] to share some of the agents that you will build. [01:36:00] Also, think about how you would put [01:36:03] a hierarchy between these agents, [01:36:04] or how you would organize them, or who [01:36:06] should communicate with who. [01:36:07] OK? [01:36:08] OK. [01:36:08] Take a minute for that. [01:36:12] Be creative also because I'm going to ask all of your agents, [01:36:14] and maybe you have an agent that nobody has thought of. [01:36:21] OK. [01:36:22] Let's get started. [01:36:24] Who wants to give me a set of agents [01:36:26] that you would want for your home, smart home. [01:36:29] Yes. [01:36:32] The first is like a set of agents [INAUDIBLE] [01:37:00] OK. [01:37:01] So let me repeat. [01:37:02] You have four agents, I think, roughly. [01:37:05] One that tracks biometric, like where are you in the home? [01:37:09] Where are you moving? [01:37:10] How you're moving, things like that. [01:37:12] That sort of knows your location. [01:37:15] The second one determines the temperature of the rooms [01:37:21] and has the ability to change it. [01:37:23] The third one tracks energy efficiency [01:37:26] and might give feedback on energy and energy usage. [01:37:31] And might be, I don't know, maybe [01:37:32] it has the control over the temperature as well. [01:37:34] I don't know actually. [01:37:35] Or the gas or the water, might cut your water at some point. [01:37:43] And then you have an orchestrator agent. [01:37:44] What is exactly the orchestrator doing? [01:37:48] It passes instructions [INAUDIBLE] [01:37:53] OK. [01:37:53] Passes instructions. [01:37:55] So is that the agent that communicates mainly [01:37:58] with the user? [01:38:00] So if I'm coming back home and I'm [01:38:02] saying I want the oven to be preheated, [01:38:05] I communicate with the orchestrator, [01:38:07] and then it would funnel to another agent. [01:38:09] OK. [01:38:10] Sounds good. [01:38:11] Yeah. [01:38:11] So that's an example of, I want to say, [01:38:14] a hierarchical agentic multi-agent system. [01:38:20] What else? [01:38:21] Any other ideas? [01:38:22] What would you add to that? [01:38:24] Yeah. [01:38:25] [INAUDIBLE] [01:38:55] Oh, I like that. [01:38:56] That's a really good one. [01:38:57] So let me summarize. [01:38:58] You have a security agent that determines if you can enter [01:39:02] or not. [01:39:03] And when you enter, it understands who you are. [01:39:06] And then it gives you certain sets [01:39:08] of permissions that might be different depending [01:39:11] of if you're a parent or a kid. [01:39:13] Or you might have access to certain cars and not others. [01:39:17] Or your kid cannot open the fridge, or I don't know. [01:39:20] Something like that. [01:39:21] Yeah. [01:39:22] OK, I like that. [01:39:23] That's a good one. [01:39:24] And it does feel like it's a complex enough workflow where [01:39:28] you want a specific workflow tied to that. [01:39:32] I agree. [01:39:34] What else? [01:39:39] Yes. [01:39:41] [INAUDIBLE] So you can get more complicated. [01:39:43] So high energy savings with whether or not you [01:39:50] or someone else can be blind to those in the house or also [01:39:55] when you tap into the grid. [01:39:57] Yeah So another thought I have as well is much harder [01:40:04] to track in the grocery store. [01:40:06] But understanding what's in your fridge. [01:40:08] OK [01:40:12] Well, that's really good actually. [01:40:14] So you mentioned two of them. [01:40:16] One is maybe an agent that has access to external APIs that [01:40:20] can understand the weather out there, the wind, the sun, [01:40:24] and then has control over certain devices at home. [01:40:28] Temperature, blinds, things like that, and also understands [01:40:31] your preferences for it. [01:40:33] That does feel like it's a good use case because you could give [01:40:36] that to the orchestrator, but it might lose itself [01:40:38] because it's doing too much. [01:40:41] And also, these problems are tied together, [01:40:43] like temperature outdoor with the weather API [01:40:45] might influence the temperature inside, [01:40:48] how you want it, et cetera. [01:40:50] And then the second one, which I also like, [01:40:52] is you might have an agent that looks at your fridge [01:40:55] and what's inside. [01:40:57] And it might actually have access [01:40:58] to the camera in the fridge, for example, [01:41:01] and know your preferences and also has [01:41:03] access to the e-commerce API to order [01:41:06] Amazon groceries ahead of time. [01:41:09] I agree. [01:41:10] And maybe the orchestrator will be the communication line [01:41:12] with the user, but it might communicate with that agent [01:41:16] in order to get it done. [01:41:17] Yeah. [01:41:18] I like those. [01:41:19] So those are all really good examples. [01:41:21] Here is the list I had up there. [01:41:25] So climate control, lighting security, energy management, [01:41:30] entertainment, notification agent, [01:41:32] alerts about the system updates, energy saving, and orchestrator. [01:41:35] So all of them you mentioned actually. [01:41:38] And then we didn't talk about the different interaction [01:41:41] patterns, but you do have different ways to organize [01:41:45] a multi-agent system. [01:41:46] Flat, hierarchical. [01:41:48] It sounds like this would be hierarchical. [01:41:51] I agree. [01:41:52] And the reason is UI/UX, is I would rather [01:41:55] have to only talk to the orchestrator, [01:41:57] rather than have to go to a specialized application [01:42:00] to do something. [01:42:01] Like it feels like the orchestrator [01:42:02] could be responsible for that. [01:42:04] And so I agree, I would probably go for a hierarchical setup [01:42:07] here. [01:42:08] But maybe you might also add some connections [01:42:11] between other agents, like in the flat system [01:42:13] where it's all to all. [01:42:15] For example, with climate control and energy, [01:42:17] if you want to connect those two, [01:42:19] you might actually allow them to speak with each other. [01:42:21] When you allow agents to speak with each other, [01:42:24] it is basically an MCB protocol, by the way. [01:42:26] So you treat the agent like a tool, exactly like a tool. [01:42:30] Here is how you interact with this agent. [01:42:32] Here is what it can tell you. [01:42:34] Here is what it needs from you, essentially. [01:42:37] OK super. [01:42:38] And then without going into the details, [01:42:40] there are advantages to multi-agent workflows [01:42:43] versus single agents, such as debugging. [01:42:47] It's easier to debug a specialized agent [01:42:50] into debug an entire system. [01:42:52] Parallelization as well. [01:42:54] It's easier to have things run in parallel, [01:42:56] and you can earn time. [01:42:59] There are some advantages to doing that, [01:43:01] and I'll leave you with this slide if you want to go deeper. [01:43:04] Super. [01:43:05] So we've learned so many techniques to optimize LLMs, [01:43:08] from prompts to chains to fine tuning, retrieval, [01:43:12] and to multi-agent system as well. [01:43:14] And then just to end on a couple of trends I want you to watch. [01:43:19] I think next week is Thanksgiving, is that it? [01:43:21] It's Thanksgiving break. [01:43:22] No, the week after. [01:43:23] OK. [01:43:24] Well ahead of the Thanksgiving break. [01:43:26] So if you're traveling, you can think about these things. [01:43:29] What's next is in AI, I wanted to call out a couple of trends. [01:43:34] So Ilya Sutskever, one of the OGs of LLMs and OpenAI [01:43:40] co-founder, raised that question about are we plateauing or not. [01:43:45] The question are we going to see in the coming years LLM sort [01:43:50] of not improve as fast as we've seen in the past? [01:43:54] It's been the feeling in the community [01:43:56] probably that the last version of GPT [01:44:00] did not bring the level of performance [01:44:03] that people were expecting, although it did make [01:44:06] it so much easier to use for consumers because you don't need [01:44:09] to interact with different models. [01:44:10] It's all under the same hood. [01:44:12] So it seems that it's progressing, [01:44:14] but the plateau is unclear. [01:44:17] The way I would think about it is the LLM scaling laws tell us [01:44:22] that if we continue to improve compute and energy, [01:44:26] then LLMs should continue to improve. [01:44:28] But at some point, it's going to plateau. [01:44:29] So what's going to take us to the next step? [01:44:32] It's probably architecture search. [01:44:35] Still a lot of LLMs, even if we don't [01:44:36] understand what's under the hood or probably [01:44:38] transformer-based today. [01:44:40] But we know that the human brain does not operate the same way. [01:44:43] There's just certain things that we [01:44:45] do that are much more efficient, much faster. [01:44:47] We don't need as much data. [01:44:49] So theoretically, we have so much [01:44:51] to learn in terms of architecture search [01:44:53] that we haven't figured out. [01:44:54] It's not a surprise that you see those labs hire [01:44:57] so many engineers. [01:44:58] Because it is possible that in the next few years, [01:45:01] you're going to have thousands of engineers trying [01:45:03] to figure out the different engineering hacks and tactics [01:45:06] and architectural searches that are [01:45:07] going to lead to better models. [01:45:10] And one of them suddenly will find the next transformer, [01:45:13] and it will reduce by 10x the need for compute and the need [01:45:17] for energy. [01:45:18] It's sort of if you read Isaac Asimov's Foundation series. [01:45:24] Individuals can have an amazing impact on the future because [01:45:27] of their decisions. [01:45:29] Whoever discovered transformers had a tremendous impact [01:45:33] on the direction of AI. [01:45:34] I think we're going to see more of that in the coming [01:45:37] years, where some group of researchers that is iterating [01:45:40] fast might discover certain things that would suddenly [01:45:43] unlock that plateau and take us to the next step, [01:45:45] and it's going to continue to improve like that. [01:45:47] And so it doesn't surprise me that there's so many companies [01:45:50] hiring engineers right now to figure out [01:45:52] those hacks and those techniques. [01:45:56] The other set of gains that we might see [01:45:58] is from multi-modality. [01:45:59] So the way to think about it is we've had LLMs first text-based, [01:46:04] and then we've added imaging. [01:46:06] And today, models are very good at images. [01:46:09] They're very good at text. [01:46:10] It turns out that being good at images and being good at text [01:46:13] makes the whole model better. [01:46:15] So the fact that you're good at understanding a cat image [01:46:18] makes you better at text as well for a cat. [01:46:21] Now you add another modality like audio or video. [01:46:24] The whole system gets better. [01:46:26] So you're better at writing about a cat [01:46:28] if you know what a cat sounds like, [01:46:30] if you can look at a cat on an image as well. [01:46:31] Does that make sense? [01:46:32] So we see gains that are translated from one modality [01:46:35] to another, and that might lead in the pinnacle of robotics [01:46:38] where all these modalities come together. [01:46:40] And suddenly, the robot is better at [01:46:42] running away from a cat because it understands [01:46:44] what a cat is, how it sounds like, [01:46:46] what it looks like, et cetera. [01:46:48] That makes sense? [01:46:49] The other one is the multiple methods working in harmony. [01:46:53] In the Tuesday lectures, we've seen supervised learning, [01:46:56] unsupervised learning, self-supervised learning, [01:46:58] reinforcement learning, prompt engineering, RAGs, et cetera. [01:47:02] If you look at how babies learn, it [01:47:06] is probably a mix of those different approaches. [01:47:09] Like a baby might have some meta learning, meaning it [01:47:13] has some survival instinct that is [01:47:16] encoded in the DNA most likely. [01:47:19] And that's like the baby's pre-training, if you will. [01:47:22] On top of that, the mom or the dad is pointing at stuff [01:47:27] and saying bad, good, bad, good. [01:47:29] Supervised learning. [01:47:30] On top of that, the baby is falling on the ground [01:47:33] and getting hurt. [01:47:34] And that's a reward signal for reinforcement learning. [01:47:36] On top of that, the baby is observing other people [01:47:39] doing stuff or other babies doing [01:47:42] stuff, unsupervised learning. [01:47:43] You see what I mean? [01:47:44] We're probably a mix of all these methods, [01:47:47] and I think that's where the trend is going, is [01:47:49] where those methods that you've seen in CS230 [01:47:52] come together in order to build an AI system that learns fast, [01:47:56] is low latency, is cheap, energy-efficient, [01:48:00] and makes the most out of all of these methods. [01:48:03] Finally, and this is especially true at Stanford, [01:48:06] you have research going on that you would consider human-centric [01:48:11] and some research that is non-human centric. [01:48:13] By human-centric, I should say human approaches [01:48:16] that are modeled after the brain and approaches that [01:48:19] are not modeled after humans. [01:48:20] Because it turns out that the human body is very limiting. [01:48:24] And so if you actually only do research [01:48:26] on what the human brain looks like, [01:48:28] you're probably missing out on compute and energy and stuff [01:48:30] like that that you can optimize even [01:48:32] beyond neuronal connections in the brain, [01:48:35] but you still can learn a lot from the human brain. [01:48:37] And that's why there are professors that are running labs [01:48:40] right now that try to understand, [01:48:42] how does back propagation work for humans? [01:48:45] And in fact, it's probably that we don't have back propagation. [01:48:48] We don't use back propagation, we only do forward propagation, [01:48:51] let's say. [01:48:51] So this type of stuff is interesting research [01:48:54] that I would encourage you to read if you're curious [01:48:56] about the direction of AI. [01:48:59] And then finally, one thing that's going to be pretty clear, [01:49:02] I call it all the time, but it's the velocity [01:49:05] at which things are moving. [01:49:06] You're noticing, part of the reason [01:49:08] we're giving you a breadth in CS230 [01:49:10] is because these methods are changing so fast. [01:49:12] So I don't want to bother going and teaching you [01:49:15] the number 17 methods on RAG that [01:49:17] optimizes the RAG because in two years, [01:49:19] you're not going to need it. [01:49:20] So I would rather you think about what [01:49:23] is the breadth of things you want to understand. [01:49:25] And when you need it, you are sprinting and learning [01:49:27] the exact thing you need faster because the half life of skill [01:49:30] is so low. [01:49:31] You want to come out of the class with a good breadth [01:49:34] and then have the ability to go deep whenever [01:49:36] you need after the class. [01:49:38] And so that's sort of how that class is designed as well. [01:49:41] Yeah. [01:49:41] That's it for today. [01:49:43] So thank you. [01:49:45] Thank you for participating.