Advertisement
Ad slot
Stanford CS230 | Autumn 2025 | Lecture 8: Agents, Prompts, and RAG 1:49:54

Stanford CS230 | Autumn 2025 | Lecture 8: Agents, Prompts, and RAG

Stanford Online · May 10, 2026
Open on YouTube
Transcript ~16615 words · 1:49:54
0:05
Hi, everyone.
0:06
Welcome to another lecture for CS230 Deep Learning.
0:11
Today, we're going to talk about enhancing large language model
0:17
applications.
0:19
And I call this lecture Beyond LLM.
0:23
It has a lot of newer content.
0:26
And the idea behind this lecture is
0:31
we started to learn about neurons,
Advertisement
Ad slot
0:34
and then we learned about layers,
0:35
and then we learned about deep neural networks,
0:38
and then we learned a little bit about how to structure projects
0:43
in C3.
0:44
And now we're going one level beyond into, what would it
0:48
look if you were building agentic AI systems at work,
0:54
in a startup, in a company?
0:58
And it's probably one of the more practical lectures.
1:02
Again, the goal is not to build a product
1:05
end to end in the next hour or so,
Advertisement
Ad slot
1:07
but rather to tell you all the techniques
1:09
that AI engineers have cracked, figured out, are exploring,
1:15
so that after the class, you have the breadth of view
1:18
of different prompting techniques,
1:20
different agentic workflows, multi-agent systems, evals.
1:25
And then when you want to dive deeper,
1:26
you have the baggage to dive deeper and learn faster
1:29
about it.
1:32
Let's try to make it as interactive as possible, as
1:36
usual.
1:37
When we look at the agenda, the agenda
1:40
is going to start with the core idea behind challenges
1:45
and opportunities for augmenting LLMs.
1:48
So we start from a base model.
1:50
How do we maximize the performance of that base model?
1:55
Then we'll dive deep into the first line of optimization,
1:59
which is prompting methods, and we'll see a variety of them.
2:02
Then we'll go slightly deeper.
2:04
If we were to get our hands under the hood
2:06
and do some fine tuning, what would it look like?
2:09
I'm not a fan of fine tuning, and I talk a lot about that,
2:12
but I'll explain why I try to avoid fine tuning as much as
2:16
possible.
2:18
And then we'll do a section 4 on Retrieval-Augmented Generation,
2:22
or RAG, which you've probably heard of in the news.
2:26
Maybe some of you have played with RAGs.
2:28
We're going to unpack what a RAG is
2:31
and how it works and then the different methods within RAGs.
2:36
And then we'll talk about agentic AI workflows.
2:40
I'll define it.
2:42
Andrew Ng is one of the first ones
2:45
to have called this trend agenetic AI workflows.
2:49
And so we look at the definition that Andrew
2:51
gives to agentic workflows, and then we'll
2:54
start seeing examples.
2:56
The section 6 is very practical.
2:59
It's a case study where we will think about an agentic workflow,
3:05
and I'll ask you to measure if the agent actually works,
3:10
and we brainstorm how we can measure
3:13
if an agentic workflow is working
3:15
the way you want it to work.
3:16
There's plenty of methods called evals that solve that problem.
3:22
And then we'll look briefly at multi-agent workflow.
3:24
And then we can have a open-ended discussion
3:27
where I share some thoughts on what's next in AI.
3:31
And I'm looking forward to hearing from you all,
3:34
as well, on that one.
3:36
So let's get started with the problem of augmenting LLMs.
3:42
So open-ended question for you--
3:44
you are all familiar with pre-trained models
3:47
like GPT 3.5 Turbo or GPT 4.0.
3:52
What's the limitation of using just a base model?
3:56
What are the typical issues that might
3:59
arise as you're using a vanilla pre-trained model?
4:07
Yes.
4:08
It lacks some domain knowledge.
4:10
Lacks some domain knowledge.
4:11
You're perfectly right.
4:13
We had a group of students a few years ago.
4:16
It was not LLM related, but they were building an autonomous
4:22
farming device or vehicle that had a camera underneath, taking
4:26
pictures of crops to determine if the crop is
4:30
sick or not, if it should be thrown away,
4:32
if it should be used or not.
4:35
And that data set is not a data set you find out there.
4:40
And the base model or pre-trained computer vision
4:44
model would lack that knowledge, of course.
4:47
What else?
4:49
Yes.
4:50
[INAUDIBLE] pictures are very dark [INAUDIBLE]
4:57
OK, maybe the-- you're saying--
4:59
so just to repeat for people online,
5:02
you're saying the model might have been trained
5:04
on high-quality data, but the data in the wild
5:06
is actually not that high quality.
5:08
And in fact, yes, the distribution of the real world
5:11
might differ, as we've seen with GANs, from the training set,
5:16
and that might create an issue with pre-trained models.
5:18
Although pre-trained LLMs are getting better
5:20
at handling all sorts of data inputs.
5:25
Yes.
5:26
Lacks current information.
5:28
Lack what?
5:28
Current information.
5:30
Lacks current information.
5:32
The LLM is not up to date.
5:34
And in fact, you're right.
5:35
Imagine you have to retrain from scratch your LLM
5:38
every couple of months.
5:39
One story that I found funny--
5:42
it's from probably three years ago or maybe more five years
5:45
ago, where during his first presidency,
5:49
President Trump one day tweeted, "Covfefe."
5:53
You remember that tweet or no?
5:56
Just "Covfefe."
5:57
And it was probably a typo or it was in his pocket.
5:59
I don't know.
6:00
But that word did not exist.
6:03
The LLMs, in fact, that Twitter was running at the time
6:06
could not recognize that word.
6:08
And so the recommender system sort of went wild,
6:11
because suddenly everybody was making fun of that tweet using
6:15
the word "Covfefe," and the LLM was so confused on, what does
6:19
that mean?
6:20
Where should we show it?
6:21
To whom should we show it?
6:22
And this is an example of a-- nowadays,
6:25
especially on social media, there's so many new trends,
6:28
and it's very hard to retrain an LLM to match the new trend
6:33
and understand the new words out there.
6:34
I mean, you oftentimes hear Gen Z words like "rizz" or "mid"
6:39
or whatever.
6:40
I don't know all of them.
6:41
But you probably want to find a way that
6:45
can allow the LLM to understand those trends without retraining
6:49
the LLM from scratch.
6:51
What else?
6:53
It's trained to have a breadth of knowledge.
6:56
And if you wanted to do something specialized,
6:58
that might limit [INAUDIBLE].
6:59
Yeah, it might be trained on a breadth of knowledge,
7:02
but it might fail or not perform adequately
7:05
on a narrow task that is very well defined.
7:09
Think about enterprise applications that--
7:11
yeah, enterprise application.
7:13
You need high precision, high fidelity, low latency.
7:17
And maybe the model is not great at that specific thing.
7:20
It might do fine, but just not good enough.
7:22
And you might want to augment it in a certain way.
7:24
Yeah.
7:25
Maybe it has [INAUDIBLE] so it makes the model
7:29
a lot heavier, a lot slower.
7:32
[INAUDIBLE]
7:33
So maybe it has a lot of broad domain knowledge that might not
7:37
be needed for your application.
7:39
And so you're using a massive, heavy model
7:41
when you actually are only using 2% of the model capability.
7:44
You're perfectly right.
7:45
You might not need all of it.
7:46
So you might find ways to prune, quantize the model, modify it.
7:51
All of these are good points.
7:53
I'm going to add a few more, as well.
7:55
LLMs are very difficult to control.
7:58
Your last point is actually an example of that.
8:00
You want to control the LLM to use a part of its knowledge,
8:03
but it's not--
8:04
it's, in fact, getting confused.
8:06
We've seen that in history.
8:08
In 2016, Microsoft created a notorious Twitter
8:13
bot that learned from users, and it quickly became a racist jerk.
8:18
Microsoft ended up removing the bot 16 hours after launching it.
8:22
The community was really fast at determining
8:25
that this was a racist bot.
8:28
And you can empathize with Microsoft in the sense
8:31
that it is actually hard to control an LLM.
8:34
They might have done a better job to qualify before launching,
8:37
but it is really hard to control an LLM.
8:40
Even more recently, this is a tweet
8:42
from Sam Altman last November, where
8:46
there was this debate between Elon Musk and Sam
8:50
Altman on whose LLM is the left wing propaganda
8:54
machine or the right wing propaganda machine,
8:57
and they were hating on each other's LLMs.
8:59
But that tells you, at the end of the day,
9:01
that even those two teams, Grok and OpenAI, which are probably
9:05
the best funded team with a lot of talent,
9:08
are not doing a great job at controlling their LLMs.
9:14
And from time to time, if you hang out on X,
9:16
you might see screenshots of users interacting with LLMs
9:21
and the LLM saying something really controversial
9:24
or racist or something that would not be considered great
9:31
by social standards, I guess.
9:33
And that tells you that the model is really hard to control.
9:39
The second aspect of it is something
9:41
that you mentioned earlier.
9:43
LLMs may underperform in your task,
9:47
and that might include specific knowledge gaps,
9:49
such as medical diagnosis.
9:51
If you're doing medical diagnosis,
9:52
you would rather have an LLM that is specialized for that
9:55
and is great at it and, in fact, something
9:57
that we haven't mentioned as a group, has sources.
10:00
So the answer is sourced specifically.
10:03
You have a hard time believing something
10:05
unless you have the actual source of the research that
10:08
backs it up.
10:10
Inconsistencies in style and format--
10:12
so imagine you're building a legal AI agentic workflow.
10:17
Legal has a very specific way to write and read,
10:21
where every word counts.
10:22
If you're negotiating a large contract,
10:25
every word on that contract might mean something else
10:28
when it comes to the court.
10:29
And so it's very important that you use
10:31
an LLM that is very good at it.
10:34
The precision matters.
10:35
And then task-specific understanding,
10:38
such as doing a classification on a niche field,
10:40
here I pulled an example where-- let's say a biotech product is
10:45
trying to use an LLM to categorize
10:48
user reviews into positive, neutral, or negative.
10:54
Maybe for that company, something
10:56
that would be considered a negative review typically
11:01
is actually considered a neutral review
11:04
because the NPS of that industry tends
11:06
to be way lower than other industries, let's say.
11:10
That's a task-specific understanding,
11:12
and the LLM needs to be aligned to what
11:14
the company believes is the categorization that it wants.
11:17
We will see an example of how to solve that problem in a second.
11:21
And then limited context handling--
11:24
a lot of AI applications, especially in the enterprise,
11:28
have required data that has a lot of context.
11:33
Just to give you a simple example,
11:35
knowledge management is an important space
11:37
that enterprises buy a lot of knowledge management tool.
11:40
When you go on your drive and you have all your documents,
11:43
ideally, you could have an LLM running on top of that drive.
11:47
You can ask any question, and it will read immediately
11:50
thousands of documents and answer, what was
11:53
our Q4 performance in sales?
11:56
It was x dollars.
11:58
It finds it super quickly.
11:59
In practice, because LLMs do not have a large enough context,
12:04
you cannot use a standalone vanilla pre-trained LLM to solve
12:07
that problem.
12:08
You will have to augment it.
12:11
Does that make sense?
12:13
The other aspect around context windows is they are, in fact,
12:16
limited.
12:17
If you look at the context windows of the models
12:20
from the last five years, even the best models
12:25
today will range in context, window, or number of tokens
12:30
it can take as input, somewhere in the hundreds of thousands
12:35
of tokens max.
12:36
Just to give you a sense, 200,000 tokens is roughly two
12:40
books.
12:42
So that's how much you can upload
12:45
and it can read, pretty much.
12:47
And you can imagine that when you're
12:48
dealing with video understanding or heavier data
12:52
files, that is, of course, an issue.
12:56
So you might have to chunk it.
12:58
You might have to embed it.
12:59
You might have to find other ways
13:00
to get the LLM to handle larger contexts.
13:06
The attention mechanism is also powerful, but problematic,
13:10
because it does not do a great job at attending
13:13
in very large contexts.
13:16
There is actually an interesting problem
13:19
called needle in a haystack.
13:21
It's an AI problem where--
13:23
or call it a benchmark--
13:25
where, in order to test if your LLM is good at putting attention
13:30
on a very specific fact within a large corpus,
13:35
researchers might randomly insert
13:38
in about one sentence that outlines
13:44
a certain fact, such as Arun and Max
13:47
are having coffee at Blue Bottle,
13:48
in the middle of the Bible, let's say,
13:51
or some very long text.
13:54
And then you ask the LLM, what were Arun and Max having
14:01
at Blue Bottle?
14:02
And you see if it remembers that it was coffee.
14:04
It's actually a complex problem, not because the question
14:07
is complex, but because you're asking the model
14:09
to find a fact within a very large corpus,
14:12
and that's complicated.
14:16
So, again, this is a limiting factor for LLMs.
14:19
We'll talk about RAG in a second.
14:21
But I want to preview--
14:22
there is debates around whether RAG
14:26
is the right long-term approach for AI systems.
14:29
So as a high-level idea, a RAG is a mechanism, if you will,
14:34
that embeds documents that an LLM can retrieve and then
14:39
add as context to its initial prompt and answer a question.
14:44
It has lots of application.
14:45
Knowledge management is an example.
14:47
So imagine you have your drive again.
14:49
But every document is compressed in representation,
14:53
and the LLM has access to that lower
14:55
dimensional representation.
14:59
The debates that this tweet from [INAUDIBLE] outlines
15:03
is, in theory, if we have infinite compute,
15:08
then RAG is useless.
15:09
Because you can just read a massive corpus immediately
15:13
and answer your question.
15:15
But even in that case, latency might be an issue.
15:19
Imagine the time it takes for an AI
15:20
to read all your drive every single time you ask a question.
15:24
It doesn't make sense.
15:25
So RAG has other advantages beyond even the accuracy.
15:30
On top of that, the sourcing matters, as well.
15:33
So it might-- RAG allows you to source.
15:35
We'll talk about all that later.
15:38
But there's always this debate in the community
15:42
whether a certain method is actually future proof.
15:46
Because in practice, as compute power doubles every year,
15:49
let's say, some of the methods we're learning right now
15:52
might not be relevant three years from now.
15:54
We don't know, essentially.
15:59
And the analogy that he makes on context windows
16:04
and why RAG approaches might be relevant even a long time
16:07
from now is search.
16:09
When you search on a search engine,
16:12
you still find sources of information.
16:14
And in fact, in the background, there
16:16
is very detailed traversal algorithms
16:20
that rank and find the specific links that might be the best
16:25
to present you versus if you had to read-- imagine you had
16:29
to read the entire web every single time you're doing
16:31
a search query, without being able to narrow
16:34
to a certain portion of the space.
16:36
That might, again, not be reasonable.
16:41
OK, when we're thinking of improving LLMs,
16:46
the easiest way we think of it is two dimensions.
16:50
One dimension is we are going to improve the foundation
16:53
model itself.
16:54
So, for example, we move from GPT 3.5 Turbo, to GPT 4,
17:01
to GPT 4.0, to GPT 5.
17:04
Each of that is supposed to improve the base model.
17:07
GPT 5 is another debate because it's packaging other models
17:11
within itself.
17:12
But if you're thinking about 3.5, 4, and 4.0,
17:15
that's really what it is.
17:16
The pre-trained model improves.
17:18
And so you should see your performance
17:20
improve on your tasks.
17:22
But the other dimension is we can actually engineer--
17:27
leverage the LLM in a way that makes it better.
17:30
So you can prompt simply GPT 4.0.
17:34
You can change some prompts and improve the prompt,
17:38
and it will improve the performance.
17:40
It's shown.
17:41
You can even put a RAG around it.
17:42
You can put an agentic workflow around it.
17:45
You can even put a multi-agent system around it.
17:49
And that is another dimension for you to improve performance.
17:52
So that's how I want you to think about it-- which
17:54
LLM I'm using, and then how can I maximize
17:56
the performance of that LLM?
17:59
This lecture is about the vertical axis.
18:02
Those are the methods that we will see together.
18:08
Sounds good for the introduction.
18:11
So let's move to prompt engineering.
18:14
I'm going to start with an interesting study just
18:17
to motivate why prompt engineering matters.
18:20
There is a study from HBS, UPenn,
18:26
as well as Harvard Business School, and--
18:29
well, there is also involved Wharton--
18:31
that took a subset of BCG consultants,
18:34
individual contributors, split them into three groups.
18:37
One group had no access to AI.
18:39
One group had access to--
18:41
I think it was GPT 4.
18:44
And then one group had access to the LLM,
18:46
but also a training on how to prompt better.
18:50
And then they observed the performance of these consultants
18:53
across a wide variety of tasks.
18:56
There's a few things that they noticed
18:57
that I thought was interesting.
18:59
One is something they called the jagged frontier,
19:02
meaning that certain tasks that consultants are doing fall
19:07
beyond the jagged frontier, meaning AI is not good enough.
19:14
It's not improving human performance.
19:18
In fact, it's actually making it worse.
19:20
And some tasks are within the frontier,
19:23
meaning that AI is actually significantly improving
19:27
the performance, the speed, the quality of the consultant.
19:32
Many tasks fell within and many tasks fell without,
19:35
and they shared their insights.
19:37
But the TLDR is--
19:39
there is a frontier within which AI is absolutely helping
19:42
and one where they call out this behavior, or falling asleep
19:47
at the wheel, where people relied on AI on a task that
19:51
was beyond the frontier.
19:52
And in fact, it ended up going worse
19:55
because the human was not reviewing the outputs carefully
19:58
enough.
20:01
They did note that the group that was trained
20:04
was the best, better than the group that was not trained
20:08
on prompt engineering, which also motivates why
20:10
this lecture matters, so that you're within that group
20:14
afterwards.
20:15
One other insights were the centaurs and the cyborgs.
20:20
They noticed that consultants had the tendency
20:22
to work with AI in one of two ways,
20:24
and you might, yourself, be part of one of these groups.
20:29
The centaurs are mythical creatures
20:31
that are half human, half--
20:35
I think, half, what, horses?
20:38
Yeah?
20:39
Horses?
20:39
Half horses, half something.
20:42
And those were individuals that would divide and delegate.
20:45
They might give a pretty big task to the AI.
20:48
So imagine you're working on a PowerPoint, which consultants
20:51
are known to do.
20:52
You might actually write a very long prompt on how
20:55
you want it to do your PowerPoint and then let it
20:57
work for some time and then come back
20:59
and it's done, when others would act as cyborgs.
21:02
Cyborgs are fully blended, bionic human robots,
21:06
human and robot, augmented with robotic parts.
21:10
And those individuals will not delegate fully a task.
21:13
They would actually work super quickly with the model
21:16
back and forth.
21:17
I find that a lot of students are actually more working
21:20
like cyborgs than centaurs, but while maybe in the enterprise,
21:24
when you're trying to automate the workflow,
21:26
you're thinking more like a centaur.
21:29
That's just something good to keep in mind.
21:31
Also, a lot of companies will tell you, oh, we're
21:33
hiring prompt engineers, et cetera.
21:34
It's [? a cure. ?] I don't buy that.
21:36
I think it's just a skill that everybody should have.
21:39
You're not going to make a [? cure ?] out
21:40
of prompt engineering, but you're probably
21:42
going to use it as a very powerful skill in your career.
21:49
So let's talk about basic prompt design principles.
21:52
I'm giving you a very simple prompt here.
21:56
Summarize this document, and then the document
21:58
is uploaded alongside it.
22:00
And the model has not much context around
22:04
what should be the summary?
22:06
How long should be the summary?
22:07
What should it talk about, et cetera?
22:09
You can actually improve these prompts by doing something like
22:14
summarize this 10-page scientific paper on renewable
22:18
energy in five bullet points, focusing on key findings
22:22
and implications for policymakers.
22:25
That's already better.
22:26
You're sharing the audience, and it's
22:28
going to tailor it to the audience.
22:30
You're saying that you want five bullet points,
22:33
and you want to focus only on key findings.
22:35
That's a better prompt, you would argue.
22:39
How could you even make this prompt better?
22:41
What are other techniques that you've
22:43
heard of or tried yourself that could make this one shot prompt
22:47
better?
22:53
Yeah.
22:53
[INAUDIBLE]
22:57
OK.
22:58
Right example.
22:58
So say, you mean, here is an example of a great summary.
23:02
Yeah.
23:03
You're right.
23:03
That's a good idea.
23:05
[INAUDIBLE]
23:08
Very popular technique.
23:10
Act like a renewable energy expert giving a conference
23:15
at Davos, let's say, yeah.
23:17
That's great.
23:18
Someone-- yeah.
23:20
Say you're really good at it.
23:22
Yeah.
23:23
You are the best in the world at this.
23:25
Explain.
23:26
Yeah.
23:26
Actually, I mean, these things work.
23:28
It's funny, but it does work to say act like x, y, z.
23:32
It's a very popular prompt template.
23:34
We'll see a few examples.
23:36
What else could you do?
23:40
Yes.
23:41
Of course, you'd like to critique your own model.
23:46
Critique your own project.
23:47
So you're using reflection.
23:48
So you might actually do one output
23:50
and then ask it to critique it and then give it back.
23:52
Yeah.
23:53
We see that.
23:53
That's a great one.
23:54
That's the one that probably works best
23:56
within those typically, but we see some examples.
23:59
What else?
24:00
Yeah.
24:01
Break the task down into steps.
24:03
OK.
24:03
Break the task down into steps.
24:05
You know how that is called?
24:06
No.
24:07
OK.
24:08
Chain of thoughts.
24:09
So this is actually a popular method
24:12
that's been shown in research that it improves.
24:15
You could actually give a clear instruction
24:17
and also encourage the model to think step
24:19
by step approach, the task step by step,
24:22
and do not skip any step.
24:24
And then you give it some steps, such as step one,
24:26
identify the three most important findings.
24:29
Step two, explain how key each finding
24:31
impact renewable energy policy.
24:33
Step three, write the five-bullet summary
24:36
with each point addressing a finding, et cetera.
24:39
So chain of thoughts, I linked the paper from 2023 that
24:45
popularized chain of thoughts.
24:46
Chain of thoughts is very popular
24:48
right now, especially in AI startups
24:50
that are trying to control their LLMs.
24:55
OK.
24:56
To go back to your examples about act like XYZ, what
25:01
I like to do, Andrew Ng also talks about that,
25:03
is to look at other people's prompts.
25:06
And in fact, in online, you have a lot of prompt repositories
25:10
for free on GitHub.
25:11
In fact, I linked the awesome prompt template repo on GitHub,
25:16
where you have so many examples of great prompts
25:19
that engineers have built. They said it works great for us,
25:22
and they published it online.
25:23
And a lot of them start with act as.
25:27
Act as a Linux terminal.
25:29
Act as an English translator.
25:31
Act like a position interviewer, et cetera.
25:37
The advantage of a prompt template
25:38
is that you can actually put it in your code
25:42
and scale it for many user requests.
25:44
So let me give you an example from Workera.
25:48
Workera evaluates skill.
25:50
Some of you have taken the assessments already.
25:52
And tries to personalize it to the user.
25:56
And in fact, if you actually read in an HR system
25:59
in an enterprise, in the HR system,
26:01
you might have a Jane is a product manager level 3,
26:06
and she is in the US, and her preferred language is English.
26:10
And actually, that metadata can be
26:13
inserted in a prompt templates that will personalize
26:15
personalized for Jane.
26:16
And similarly for Joe, whose is preferred language is Spanish,
26:22
it will tailor it to Joe.
26:24
And that's called a prompt template.
26:26
[INAUDIBLE]
26:34
So the question is do the foundation models
26:39
use a prompt templates, or do you
26:41
have to integrate it yourself?
26:42
So the foundation models probably
26:45
use a system prompt that you don't see.
26:47
Like when actually, you type on ChatGPT,
26:50
it is possible, it's not public, that OpenAI behind the scenes
26:55
has like act like a very helpful assistant for this user.
26:59
And by the way, here is your memories about the user
27:03
that we kept in a database.
27:05
You can actually check your memories.
27:07
And then your prompt goes under, and then the generation starts.
27:10
So probably, they're using something like that.
27:12
But it doesn't mean you can't add one yourself.
27:15
So in fact, if you think about a prompt template for the Workera
27:19
example I was showing, maybe it starts
27:22
when you call OpenAI by act like a helpful assistant.
27:25
And then underneath, it's like act like a great AI mentor that
27:29
helps people in their career.
27:31
And OpenAI is, from template, also
27:33
has follow the instruction from the creator
27:36
or something like that.
27:37
It's possible.
27:41
Questions about prompt templates.
27:42
Again, I would encourage you to go and read examples of prompts.
27:45
Some of them are quite thoughtful.
27:48
Let's talk about zero shot versus few shot prompting.
27:51
It came up earlier.
27:53
Here's an example.
27:54
Again, going back to the categorization of product
27:57
reviews, let's say that we're working on a task
28:01
where the prompt is classify the tone of the sentence
28:05
as positive, negative, or neutral.
28:07
And then you paste the review, which is the product is fine,
28:12
but I was expecting more.
28:16
If I were to survey the room, I would bet that some of you
28:19
would say it's negative.
28:21
Some of you would say it's neutral.
28:23
Because you actually have a first part
28:24
that is relatively positive.
28:27
It's fine.
28:28
And then the second part, I was expecting more,
28:30
which is relatively negative.
28:31
So where do you land?
28:33
This can be a subjective question.
28:35
And maybe in one industry, this would be considered amazing.
28:37
And another one, it would be considered really bad
28:40
because people are used to really flourishing reviews.
28:44
And so the way you can actually align the model to your task
28:47
is by converting that zero shot prompt.
28:49
Zero shot refers to the fact that it's not
28:51
being given any example.
28:53
Into a few short prompts, where the model
28:56
is given in the prompt, a set of examples to align it to what
29:00
you want it to do.
29:01
So the example here is again, you
29:03
paste the same prompt as before with the user review.
29:06
And then you add, here are examples
29:08
of tone classifications.
29:10
These exceeded my expectation completely.
29:12
Positive.
29:14
It's OK, but I wish it had more features.
29:17
Negative.
29:18
The service was adequate.
29:20
Neither good nor bad.
29:22
Neutral.
29:23
Now classify the tone of this sentence
29:26
after you've heard about these things,
29:28
and the model then says negative.
29:31
And the reason it says negative, of course,
29:33
is likely because of the second example, which was it's OK,
29:39
but I wish it had more features, which we told the model that
29:42
was negative.
29:43
Because the model saw that it's aligned now
29:45
with your expectations.
29:47
A few short prompts are very popular.
29:50
And in fact, for AI startups that
29:52
are slightly more sophisticated, you
29:54
might see them keep a prompt up to date.
29:57
Whenever a user says something and they
30:00
might have a human label it and then
30:02
add it as a few shots in their relevant
30:05
prompts in their code base.
30:08
You can think of that as almost building a data set.
30:10
But instead of actually building a separate data set
30:12
like we've seen with supervised fine tuning
30:15
and then fine tuning the model on it,
30:17
you're just putting it directly in the prompt.
30:19
It turns out it's probably faster
30:21
to do that if you want to experiment quickly
30:23
because you don't touch the model parameters.
30:25
You just update your prompts.
30:27
And if it's text examples, you can actually
30:30
concatenate so many examples in a single prompt.
30:34
At some point, it will be too long,
30:36
and you will not have the necessary context window.
30:39
But it's a pretty strong approach
30:40
that is quick to align an LLM.
30:48
OK?
30:49
Yes.
30:50
[INAUDIBLE]
30:57
So the question was is there any research on how long
31:00
the prompt can be before the model essentially loses
31:03
itself or doesn't follow instructions anymore?
31:06
There is.
31:08
The problem is that research is outdated every few months
31:11
because models get better.
31:14
And so I don't know where the state of the art is.
31:16
You can probably find it online on benchmarks
31:18
on like we see that--
31:20
I give you an example.
31:23
On the Workera product, you have a voice conversation
31:27
for some of you that have tried it,
31:28
where you're asked to explain what is the prompt.
31:30
And then you explain, and then there's
31:31
a scoring algorithm in behind.
31:33
We know that after eight turns, the model loses itself.
31:38
After eight turns, because you always
31:40
paste the previous user response,
31:42
it just starts going wild.
31:44
And so the techniques we use in the background
31:46
is we actually create chapters of the conversation.
31:49
Maybe one chapter is the first eight prompt.
31:51
And then you actually start over from another prompt.
31:53
You can summarize the first part of the conversation,
31:56
insert the summary, and then keep going.
31:59
Those are engineering hacks that engineers might have figured out
32:02
in the background.
32:04
Because eight turns makes a prompt quite long actually.
32:13
Let's move on to chaining.
32:15
Chaining is the most popular technique out of everything
32:17
we've seen so far in prompt engineering.
32:22
It's not chain of thought.
32:23
So chain of thought we've seen is think step by step,
32:26
step 1, step 2, step 3.
32:27
Do not skip any step.
32:28
This is different.
32:30
This is chaining complex prompt to improve performance,
32:34
and this is what it looks like.
32:37
You take a single step prompt, such as read this customer
32:40
review and write a professional response that
32:43
acknowledges their concern, explains the issue,
32:46
offers a resolution, and then you
32:48
paste the customer review, which is I ordered a laptop.
32:51
It arrived three days late.
32:52
The packaging was damaged.
32:54
Very disappointing.
32:56
I needed that urgently for work.
32:59
And then the output is an email that
33:01
is immediately given to you by the LLM
33:04
after it reads the prompt.
33:08
So this might work, but it might be hard to control.
33:14
Because think about it.
33:15
There's multiple steps that you have listed,
33:18
and everything is embedded in the same prompt.
33:20
And if you wanted to debug step by step and know which step is
33:24
weaker, you couldn't.
33:24
You would have everything mixed together.
33:27
So one advantage of chaining is you would separate the prompts,
33:32
so that you can debug them separately.
33:35
And it will also lead to an easier manner
33:38
to improve your workflow.
33:41
Let's say a first prompt is extract the key issues.
33:44
Identify the key concerns mentioned
33:46
in this customer review.
33:47
Pace the customer review.
33:49
Second prompt.
33:50
Using these issues, so you paste back the issues,
33:54
draft an outline for a professional response that
33:57
acknowledges concerns, explains possible reasons,
34:00
and offer a resolution.
34:04
So this is not--
34:06
Prompt number 3, write the full response.
34:09
So using the outline, write the professional response.
34:14
And then you get your final output.
34:18
So in theory, you can tell me, oh, the second approach
34:22
is better than the first one at first.
34:23
But what you can notice is that we can actually
34:27
test those three prompts separately from each other
34:29
and determine if we will get the most gains out of engineering
34:35
the first prompt, optimizing it, or the second one,
34:38
or the third one.
34:39
We now have three prompts that are independent from each other.
34:43
And maybe if the outline was better,
34:47
the performance of the email, how much the open rate will be
34:53
or the user satisfaction on the response
34:55
will actually get higher.
34:57
And so chaining improves performance but performance,
35:00
but most importantly, helps you control your workflow
35:04
and debug it more seamlessly.
35:07
Yes.
35:09
So if we that the three prompts independently work really well,
35:15
if we combine them into one prompt,
35:17
and we highlight a step by step thinking process,
35:21
does on average, we get a [INAUDIBLE] by itself,
35:24
or do we still have to do that breakdown?
35:28
So let me try to rephrase.
35:30
You say, let's say we look at the first prompt which
35:32
has all three tasks built in that prompt.
35:37
What exactly do you mean?
35:39
You mean like if we evaluate the output
35:41
and we measure some user insight, satisfaction,
35:43
et cetera?
35:45
Why don't we just modify that prompt and essentially see how
35:49
it improves user satisfaction?
35:51
Yeah.
35:51
[INAUDIBLE]
35:54
I see.
35:55
So why do we need the three steps?
35:57
I mean, think about it.
35:59
The intermediate output is what you want to see.
36:02
Like if I'm debugging the first approach,
36:06
the way I would do it is I would capture user insights.
36:09
Like here's the email.
36:10
How good was the response?
36:11
Thumbs up, thumbs down.
36:13
Was your issue resolved?
36:16
Thumbs up, thumbs down.
36:17
Those would tell me how good is my prompt.
36:19
And I can engineer that prompt, optimize it,
36:21
and I would probably drive some gains.
36:23
But I will not be able easily to trace back
36:26
to what the problem was.
36:28
While in the second approach, not only I
36:30
can use the end to end metrics to improve my process.
36:33
I can also use the intermediate steps.
36:35
For example, if I look at prompt 2 and I look at the outline
36:38
and I see the outline is actually, meh, it's not great,
36:41
then I think I can get a lot of gains out of the outline.
36:45
Or the outline is actually really good,
36:47
but the last prompt doesn't do a good job at translating it
36:50
into an email.
36:51
So the outline is exactly what I want the LLM to do,
36:54
but the translation in a customer facing email
36:57
is not good.
36:58
In fact, it doesn't follow our vocabulary internally.
37:01
Then I knew the third prompt is where
37:03
I would get the most gains.
37:06
So that's what it allows me to do,
37:07
have intermediate steps to review.
37:10
Are there any latency [INAUDIBLE]
37:13
We'll talk about it.
37:14
Are there any latency concerns?
37:16
Yes.
37:17
In certain applications, you don't want to use a chain
37:20
or you don't want to use a long chain because it adds latency.
37:26
We'll talk about that later.
37:27
Good point.
37:28
So practically, this is what chaining complex
37:32
prompts look like.
37:33
You have your first prompt with your first task.
37:35
It outputs.
37:36
The output is pasted in the second prompt
37:39
with the second task being defined.
37:41
The output is then pasted into the third prompt
37:43
with the third task being defined and so on.
37:46
That's what it looks like in practice.
37:52
Super.
37:55
We'll talk more later about testing your prompts,
37:58
but there are methods now to do it,
38:00
and we'll see later in this lecture with our case study
38:03
how we can test our prompts.
38:06
But here is an example of how you might do it.
38:11
You might have a summarization workflow prompts
38:18
that is the baseline.
38:19
It's a single prompt.
38:21
You might have a refined summarization
38:23
which is a modified prompt of this,
38:26
or a workflow with a chain.
38:30
And then you have your test case, which is the input
38:34
that you want to summarize, let's say.
38:36
And then you have the generated output.
38:38
And you can have humans go and rate these outputs.
38:42
And you would notice that the baseline is better or worse
38:46
than the refined prompt.
38:47
Of course, this manual approach takes time,
38:51
but it's a good way to start.
38:53
And usually, the advice is get hands on at the beginning
38:56
because you would quickly notice some issues,
38:58
and it will give you better intuition on what tweaks
39:01
can lead to better performance.
39:03
However, if you wanted to scale that system
39:05
across many products, many parts of your code base,
39:08
you might want to find a way to do that automatically
39:10
without asking humans to review and grade summaries.
39:14
One approach is to use platforms,
39:19
like at Workera, our team uses a platform called Prompt Food that
39:23
allows you to actually automate part of this testing.
39:26
In a nutshell, what it does is it
39:30
can allow you to run the same prompt with five different LLMs
39:35
immediately, put everything in a table.
39:37
That makes it super easy for a human to grade, let's say.
39:40
Or alternatively, it might allow you to define LLM judges.
39:46
LLM judges can come in different flavors.
39:50
For example, I can have an LLM judge that
39:52
does a pairwise comparison.
39:54
So what the LLM is asked to do is here are two summaries.
39:58
Just tell me which one is better than the other one.
40:01
That's what the LLM does.
40:02
And that can be used as a proxy for how good
40:04
the summarization baseline versus the refined version is.
40:08
Another way to do an LLM judge is
40:11
if you do it for a single answer grading,
40:14
so here's a summary graded from 1 to 5.
40:18
And then you can go even deeper and do
40:21
a reference-guided pairwise comparison.
40:24
Or you add also a rubric.
40:25
You say a 5 is when a summary is below 100 characters.
40:30
I'm just making up.
40:31
Below 100 characters.
40:33
Mentions at least three key points
40:35
that are distinct and starts with a first sentence that
40:38
displays the overview and then goes into the detail.
40:40
That's a great summary, number 5 out of a 5.
40:42
0 is the LLM failed to summarize and actually was very verbose,
40:48
let's say.
40:49
And so you put a Rubrik behind it,
40:52
and you have an LLM as just finding the rubric.
40:55
Of course, you can now pair different techniques.
40:57
You can do a few shots for the rubric.
40:58
You can actually give examples of a 5 out of 5s, 4 out of 4s,
41:02
3 out of 3s because now, you multiple techniques.
41:06
Does that make sense?
41:11
Yeah.
41:11
OK.
41:12
So that was the second section on prompt engineering
41:15
or the first line of optimization.
41:19
Now, let's say you've exhausted all your chances
41:22
for prompt engineering, and you're
41:24
thinking about actually touching the model, modifying its weights
41:28
or fine tuning it in other words.
41:31
I was telling you, I'm not a fan of fine tuning.
41:34
There's a few reasons why.
41:37
One, it requires substantial labeled data typically
41:42
to fine tune.
41:43
Although now, there are approaches
41:46
that are getting better at fine tuning that
41:48
look more few shot prompting actually than fine tuning.
41:52
It's sort of merging.
41:54
Although one modifies the weight,
41:56
the other doesn't modify the weights.
41:57
Fine tuned models may also overfit to specific data.
42:01
We're going to see a funny example actually.
42:04
Losing their general purpose utility.
42:06
So you might fine tune a model.
42:08
And actually, when someone asks a pretty generic question,
42:11
it doesn't do well anymore.
42:12
It might do well on your task.
42:14
So it might be relevant or not.
42:15
And then it's time and cost-intensive.
42:17
That's my main problem.
42:19
And at Workera, we steer away from fine
42:24
tuning as much as possible.
42:26
Because by the time you're done fine tuning your model,
42:28
the next model is out, and it's actually
42:30
beating your fine tuned version of the previous model.
42:33
So I would steer away from fine tuning as much as you can.
42:36
The advantage of the prompt engineering methods we've seen
42:39
is you can put the next best pre-trained model directly
42:43
in your code.
42:44
It will update everything immediately.
42:46
Fine tuning doesn't work like that.
42:50
There are advantages though where it still makes sense.
42:53
If the task requires repeated high precision outputs
42:56
such as legal, scientific explanation
42:58
and if the general purpose LLM struggles
43:01
with domain-specific language.
43:03
So let's look at a quick example together,
43:07
which is an example from Ros Lazerowitz.
43:12
I think it was a couple of years ago, September 23,
43:15
where Ros tried to do Slack fine tuning.
43:22
So he looked at a lot of Slack messages within his company.
43:26
And he was like, I'm going to fine tune
43:28
a model that speaks like us or operates like us because this
43:32
is how we work.
43:33
This is the data that represents how people work at the company.
43:37
And so he actually went ahead and fine tuned the model,
43:42
gave it a prompt, like, hey, write--
43:44
he was delegating to the model.
43:47
A 500-word blog post on prompt engineering.
43:50
And the model responded, I shall work on that in the morning.
43:55
And then he tries to push the model a little further and say,
44:00
it's morning now.
44:01
And the model said, I'm writing right now.
44:04
It's 6:30 AM here.
44:06
Write it now.
44:10
OK, I shall write it now.
44:12
I actually don't what you would like me to say
44:14
about prompt engineering.
44:15
I can only describe the process.
44:17
The only thing that comes to mind for a headline
44:19
is how do we build prompt?
44:21
It's kind of a funny example for fine tuning because it's true
44:25
that it went wrong.
44:27
Like he was supposed to think like I want
44:29
the model to speak like us at work.
44:32
And it ended up acting like people
44:34
and not actually following instructions.
44:40
So one example why I would steer away from fine tuning.
44:47
Super.
44:51
Let's talk about RAGs.
44:54
RAGs is important.
44:55
It's important to out there and at least having the basics.
44:58
It's a very common interview question, by the way.
45:00
If you go interview for a job, they
45:02
might ask you to explain in a nutshell
45:04
to a five-year-old what is a RAG.
45:06
And hopefully after that, you'll be able to do it.
45:09
So we've seen some of the challenges with standalone LLMs.
45:14
Those challenges include the context window being small,
45:19
the fact that it's hard to remember details
45:21
within a large context window, knowledge gaps, cutoff dates,
45:26
you mentioned earlier.
45:28
The model might be trained up to a date,
45:29
and then it cannot follow the trends or be up to date.
45:33
Hallucinations.
45:34
There are some fields.
45:35
Think about medical diagnosis, where
45:37
hallucinations are very costly.
45:39
You can't afford a hallucination.
45:41
Even in education, imagine deploying a model for the US
45:45
youth education, and it hallucinates,
45:47
and it teaches millions of people something
45:49
completely wrong.
45:50
It's a problem.
45:52
And then lack of sources.
45:54
A lot of fields love sources.
45:57
Research fields love sources.
45:59
Education love sources.
46:01
Legal loves sources as well.
46:04
And so the pre-trained LLM doesn't do a good job to source.
46:08
And in fact, if you have tried to find sources on a plain LLM,
46:13
it actually hallucinates a lot.
46:15
It makes up research papers.
46:16
It just lists like completely fake stuff.
46:20
So how do we solve that with a RAG?
46:23
RAG integrates with external knowledge sources, databases,
46:28
documents, APIs.
46:31
It ensures that answers are more accurate, up to date,
46:35
and grounded because you can actually update your document.
46:38
Your drive is always up to date.
46:40
I mean, ideally, you're always pushing new documents to it.
46:43
And when you query, what is our Q4 performance in sales?
46:47
Hopefully there is the last board deck in the drive,
46:51
and it can read the last board deck.
46:54
And more developer control.
46:56
We'll see why RAGs allow for targeted customization
47:00
without actually requiring the retraining of the model.
47:02
In fact, you don't touch the model with RAGs.
47:05
It's really a technique that is put on top of the model.
47:08
So to see an example of a RAG, this
47:11
is a question answering application where
47:16
we're in the medical field, and a user is asking a query,
47:21
what are the side effects of drug X?
47:26
This is an important question.
47:27
You can't hallucinate.
47:28
You need to source.
47:29
You need to be up to date.
47:31
Maybe there is a new update to that drug that
47:35
is now in the database, and you need to read that.
47:37
So a RAG is a great example of what you would want to use here.
47:41
The way it works is you have your knowledge
47:43
base of a bunch of documents.
47:46
What you do is you use an embedding
47:49
to embed those documents into lower
47:52
dimensional representations.
47:54
So for example, if the document is a PDF, a long PDF,
47:59
you might read the PDF, understand it,
48:02
and then embed it.
48:03
We've seen plenty of embedding approaches
48:05
together, triplet loss, et cetera, you remember?
48:09
So imagine one of them here for LLMs
48:11
is embedding those documents into lower representation.
48:15
If the representation is too small,
48:18
you will lose information.
48:19
If it's too big, you will add latency.
48:22
It's a tradeoff.
48:25
You will store typically those representations
48:28
into a database called a vector database.
48:31
There's a lot of vector database providers out there.
48:38
I think I've listed a couple that are very common.
48:41
No, I haven't listed, but I can share afterwards.
48:44
A vector database is essentially storing those vector
48:47
in a very efficient manner, allowing the fast retrieval
48:50
with a certain distance metric.
48:52
So what you do is you also embed, usually
48:56
with the same algorithm, the user prompts.
49:00
And you run a retrieval process, which is essentially
49:03
saying, based on the embedding from the user
49:07
query and the vector database, find the relevant documents
49:12
based on the distance between those embeddings.
49:15
Once you've found the relevant documents, you pull them,
49:18
and then you add them to the user query with a system prompt
49:22
or a prompt template on top.
49:24
So the prompt template can be answer user query
49:29
based on list of documents.
49:32
If answer not in the documents, say I don't know.
49:36
That's your prompt templates where the user query is pasted,
49:40
the documents are pasted, and then
49:42
your output should be what you want because it's not
49:45
grounded in the documents.
49:47
You can also add to this prompt template.
49:50
Tell me the exact page, chapter, line
49:53
of the document that was relevant, and in fact,
49:55
link it as well, just to be more precise.
50:02
Any question on RAGs?
50:03
This is a simple, vanilla RAG.
50:07
Yes.
50:09
Do document embeddings still retain information [INAUDIBLE]
50:15
Question is do the document embeddings still
50:18
retain the information of the location of the information
50:21
within that document, especially in big documents?
50:24
Great question.
50:26
We'll get to it in a second.
50:27
Because you're right that the vanilla RAG
50:29
might not do a good job with very large documents.
50:32
So let's say, when you open a medication box
50:36
and you have this gigantic white paper with all the information,
50:41
and it's very long, maybe a vanilla RAG would not cut it.
50:45
So what people have figured out is a bunch
50:48
of techniques to improve RAGs.
50:49
And in fact, chunking is a great technique that is very popular.
50:53
So you might actually store in the vector database
50:55
the embedding of the full document.
50:57
And on top of that, you will also
50:59
store a chapter level vector.
51:02
And when you retrieve, you will retrieve the document.
51:04
You retrieve the chapter.
51:06
And that allows you to be more precise with the sourcing.
51:09
It's one example.
51:11
Another technique that's popular is HyDE.
51:16
Hypothetical document embeddings,
51:18
where a group of researchers published a paper
51:23
showing that when you get your user query,
51:26
one of the main problem is the user query
51:29
actually does not look like your documents.
51:32
For example, the user query might
51:34
be what are the side effects of drug X, when actually,
51:37
in the document in the vector database,
51:40
the vectors represents very long documents.
51:43
So how do you guarantee that the vector
51:44
embedding is going to be close to the document embedding?
51:47
What they do is they use the user query to generate
51:50
a fake hallucinated document.
51:53
They embed that document, and then they
51:56
compare it to the vector in the vector database.
52:01
That makes sense?
52:02
So for example, the user says what
52:04
is the side effect of drug X?
52:06
There's a prompt that this is given to another prompt that
52:09
says, based on this user query, generates a five-page report
52:13
answering the user query.
52:15
It generates potentially a completely fake answer.
52:20
You embed that, and it will be closer to the document
52:24
that you're looking for likely.
52:28
It's one example of a RAG approach.
52:31
Again, the purpose of this lecture
52:33
is not to go through all these three and explain
52:36
you every single method that has been discovered for RAGs.
52:38
But I just wanted to show you how much research
52:40
has been done between 2020 and 2025 in RAGs
52:44
and how many branches of research you now have
52:47
that you can learn from.
52:50
The survey paper is LinkedIn the slides, by the way,
52:52
and I'll share them after the lecture.
53:01
Super.
53:05
So we've made some progress.
53:08
Hopefully now, you feel if you were
53:10
to start an LLM application, you know how to do better prompts.
53:14
You know how to do chains.
53:15
You know how to do fine tuning.
53:17
You also how to do retrieval.
53:19
And you have the baggage of techniques
53:20
that you can go and read and find the code base,
53:23
pull the code, vibe code it.
53:24
But you have the breadth now.
53:30
The next set of topics we're going to see
53:34
is around the question of how could we
53:36
extend the capabilities of LLMs from performing single tasks,
53:40
and hence, with external knowledge,
53:42
to handling multi-step, autonomous workflows?
53:47
And this is where we get into proper agentic AI.
53:53
So let's talk about agentic AI workflows
53:56
towards autonomous and specialized systems.
54:00
Then we'll talk about evals.
54:01
Then we'll see multi-agent systems.
54:03
And we'll end with a little thoughts on what's next in AI.
54:11
So Andrew Ng actually coined the term agentic AI workflows.
54:20
And his reason was that a lot of companies use, say agents.
54:25
Agents, agents everywhere, agents everywhere.
54:28
If you go and work at these companies,
54:30
you would notice that they mean very different things by agents.
54:33
Some people actually have a prompt,
54:34
and they call it an agent.
54:36
Other people, they have a very complex multi-agent system,
54:41
they call it an agent.
54:42
And so calling everything an agent doesn't do it justice.
54:45
So Andrew says let's call it agentic workflows.
54:49
Because in practice, it's a bunch of prompts with tools,
54:53
with additional resources, API calls
54:57
that ultimately are put in a workflow,
54:59
and you can call that workflow agentic.
55:02
So it's all about the multi-step process to complete a task.
55:11
Also, calling it agentic workflow
55:13
allows us to not mix it up with what
55:14
I called agent, in the last lecture,
55:17
with reinforcement learning.
55:19
Because in RL, agent has a very specific definition,
55:22
interacts with an environment, passes from one state
55:24
to the other, has a reward and an observation.
55:26
You remember that chart, right?
55:32
So here's an example of how we move from a one step
55:35
prompt to a multi-step agentic workflow.
55:39
Let's say a user queries a product.
55:44
What is your refund policy on a chatbot?
55:48
And the response, using a RAG, says
55:51
refunds are available within 30 days of purchase,
55:53
and maybe the RAG can even look link to the policy documents.
55:57
That's what we learned so far.
55:59
Instead, an agentic workflow can function like this.
56:04
The user says, can I get a refund for my order?
56:07
And the response via the agentic workflow
56:11
is the agent retrieves the refund policy using a RAG.
56:14
The agent then follows up with the users and says,
56:17
can you provide your order number?
56:19
Then the agent queries an API to check the order details.
56:23
And finally, it comes back to the user
56:25
and confirms your order qualifies for a refund.
56:28
The amount will be processed in three to five business days.
56:31
This is much more thoughtful than the first version,
56:33
which is sort of vanilla.
56:37
So that's what we're going to talk
56:39
about in the next couple of slides,
56:40
is how do we get from the first one to the second one?
56:46
There are plenty of specialized agency workflows online.
56:50
You've heard, and if you hang out in SF,
56:52
you probably see a bunch of billboards, AI software
56:55
engineer, AI skills mentor you've
56:57
interacted with in the class through Workera.
56:59
AI SDR, AI lawyers, AI specialized cloud engineer.
57:08
It would be a stretch to say that everything works,
57:10
but there's work being done towards that.
57:17
I'm not personally a fan of putting
57:19
a face behind those things.
57:20
I think it's gimmicky.
57:21
And I think in a few years from now, actually,
57:24
very few products will have a human face behind it,
57:27
but it might be a marketing tactic from some startups.
57:32
It's more scary than it is engaging, frankly.
57:35
OK.
57:36
I want to talk about the paradigm shift.
57:38
That's especially useful.
57:40
Let's say you're a software engineer
57:41
or you're planning to be a software engineer.
57:43
Because software engineering as a discipline
57:45
is sort of shifting.
57:47
Or at least the best engineers I've
57:49
worked with are able to move from a deterministic mindset
57:53
to a fuzzy mindset and balance between the two
57:57
whenever they need to get something done.
57:58
So here's the paradigm shift between traditional software
58:01
and agentic AI software.
58:04
The first one is the way you handle data.
58:07
Traditional software deals with structured data.
58:10
You have JSONs.
58:11
You have databases.
58:12
They're pasted in a very structured manner
58:15
in a data engineering pipeline.
58:17
And then there used to be displayed
58:19
on a certain interface.
58:21
The user might feel a form that is then retrieved and pasted
58:24
in the database.
58:25
All of that historically has been structured data.
58:28
Now, more and more companies are handling free form text, images,
58:34
and all of that requires dynamic interpretation to transform
58:39
an input into an output.
58:41
The software itself used to be deterministic.
58:45
Now you have a lot of software that is fuzzy.
58:47
And fuzzy software creates so many issues.
58:51
I mean, imagine if you let your user ask anything
58:54
on your website.
58:56
The chances that it breaks is tremendous.
58:58
The chances that you're attacked is tremendous.
59:00
The chances-- it's really, really complicated.
59:03
It's more complicated than people make it seem on Twitter.
59:07
Fuzzy engineering is truly hard.
59:09
You might get hate as a company because one user did something
59:14
that you authorized them to do that ended up breaking
59:16
the database and ended up--
59:18
we've seen that with many companies
59:19
in the last couple of years.
59:21
So it takes a very specialized engineering mindset
59:23
to do fuzzy engineering, but also
59:25
know when you need to be deterministic.
59:29
The other thing I'd call is with agentic AI software,
59:33
you want to think about your software as your manager.
59:39
So you're familiar with the monolith or microservices
59:44
approaches in software, where you structure your software
59:48
in different boxes that can talk to each other,
59:51
and it allows teams to debug one section at a time.
59:55
Now the equivalent with agentic AI is you think as a manager.
59:59
So you think, OK, if I was to delegate my product
1:00:02
to be done by a group of humans, what would be those roles?
1:00:06
Would I have a graphic designer that then puts together a chart
1:00:09
and then sends it to a marketing manager that converts it
1:00:12
into a nice blog post, that then gives it to the performance
1:00:15
marketing expert, that then publishes the work, the blog
1:00:18
post, and then optimizes and A/B tests?
1:00:20
Then to a data scientist that analyzes the data
1:00:23
and then puts hypotheses and validates
1:00:25
them or invalidates them.
1:00:27
That's how you would typically think if you're building
1:00:29
an authentic AI software.
1:00:32
When actually, the equivalent of that in traditional software
1:00:35
might be completely different.
1:00:37
It might be We have a data engineer box
1:00:39
right here that handles all our data engineering.
1:00:42
And then here, we have the UI/UX stuff.
1:00:45
Everything UI/UX related goes here.
1:00:47
And companies might structure it in very different ways.
1:00:51
And here is the business logic that we want to care about.
1:00:53
And there's five engineers working on the business logic,
1:00:56
let's say.
1:00:59
OK.
1:01:01
Testing and debugging is also very different.
1:01:04
And we'll talk about it in the next section.
1:01:09
The other thing that I feel matters
1:01:13
is with AI in engineering, the cost of experimentation
1:01:17
is going down drastically.
1:01:19
And so people, I feel, should be more comfortable
1:01:22
throwing away code.
1:01:23
It's like in traditional software engineering,
1:01:27
you probably don't throw away code a ton.
1:01:29
You build a code, and it's solid, and it's bulletproof,
1:01:32
and then you update it over time.
1:01:35
We've seen AI companies be more comfortable throwing away
1:01:39
codes, which has advantages in terms of the speed at which you
1:01:43
move but also disadvantages in terms
1:01:46
of the quality of your software that can break more.
1:01:52
So anyway, just wanted to do an update on the paradigm shift
1:01:56
from deterministic to fuzzy engineering.
1:02:04
Oh, and actually, I can give you an example from Workera
1:02:08
that we learned probably over the last 12
1:02:11
months is like if you've used Workera,
1:02:13
you might have seen that the interface has asks you sometimes
1:02:18
multiple choice questions.
1:02:19
And sometimes, it asks you multiple select.
1:02:21
And sometimes, it asks you drag and drop, ordering, matching,
1:02:24
whatever.
1:02:25
Those are examples of deterministic item types,
1:02:28
meaning you answer the question on a multiple choice.
1:02:31
There is one correct answer.
1:02:32
It's fully deterministic.
1:02:34
On the other hand, you sometimes have a voice questions,
1:02:38
where you go to a role play or you
1:02:40
have voice plus coding questions,
1:02:42
where your code is being read by the interface or whatever.
1:02:45
Those are fuzzy, meaning the scoring algorithm
1:02:49
might actually make mistakes, and those mistakes
1:02:52
might be costly.
1:02:53
And so companies have to figure out
1:02:56
a human in the loop system, which
1:02:58
you might have seen with the appeal feature at the end.
1:03:00
So at the end of the assessment, you have an appeal feature where
1:03:03
it allows you to say, I want to appeal the agent
1:03:06
because I want to challenge what the agent said on my answer
1:03:09
because I thought I was better than what the agent thought.
1:03:12
And then you bring the human in the loop that
1:03:14
then can fix the agent, can tell the agent, actually,
1:03:16
you were too harsh on the answer of this person.
1:03:20
And that's an example of a fuzzy engineered system
1:03:24
that then adds a human in the loop to make it more aligned.
1:03:28
And so if you're building a company,
1:03:29
I would encourage you to think about what can I
1:03:32
get done with determinism?
1:03:33
And let's get that done.
1:03:35
And then the fuzzy stuff, I want to do fuzzy
1:03:38
because it allows more interaction.
1:03:39
It allows more back and forth, but I need
1:03:42
to put guardrails around it.
1:03:43
And how am I going to design those guardrails?
1:03:45
Pretty much.
1:03:46
OK?
1:03:49
Here's another example from enterprise workflows,
1:03:54
which are likely to change due to agentic AI.
1:03:57
This is a paper from McKinsey, I believe from last year,
1:04:01
where they looked at a financial institution, and they said,
1:04:05
we observed that they often spend one to four weeks
1:04:07
to create a credit risk memo.
1:04:10
And here's the process.
1:04:11
A relationship manager gathers data from 15
1:04:16
and more than 15 sources on the borrower,
1:04:19
loan type, other factors.
1:04:22
Then the relationship manager and the credit analyst
1:04:25
collaboratively analyze that data from these sources.
1:04:28
Then the credit analyst typically spends 20 hours
1:04:33
or more writing a memo and then goes back
1:04:36
to the relationship manager.
1:04:37
They give feedback, and then they go through this loop
1:04:40
again and again.
1:04:41
And it takes a long time to get a credit memo out.
1:04:46
And then run a research study, where they changed the process.
1:04:50
They said gen AI agents could actually cut time by 20% to 60%
1:04:56
on credit risk memos.
1:04:58
And the process has changed to the relationship manager,
1:05:01
directly work with the Gen AI agent system,
1:05:03
provides relevant materials that needs to produce the memo.
1:05:07
The agent subsidizes the project into tasks
1:05:10
that are assigned to specialist agents,
1:05:12
gathers and analyzes the data from multiple sources,
1:05:15
drafts a memo.
1:05:16
Then the relationship manager and the credit analyst
1:05:19
sit down together, review the memo,
1:05:20
give feedback to the agent.
1:05:22
And within 20% to 60% less time are done.
1:05:26
And so this is an example where you're actually not changing
1:05:30
the human stakeholders.
1:05:31
You're just changing the process and adding
1:05:33
Gen AI to reduce the time it takes to get a credit memo out.
1:05:38
It turns out that, imagine you're an enterprise,
1:05:42
and you have 100,000 employees, and there's a lot of enterprises
1:05:47
with 100,000 employees out there.
1:05:50
You are currently under crisis in terms
1:05:52
of redesigning your workflows.
1:05:55
It turns out that if you actually
1:05:57
pull the job descriptions from the HR system
1:06:00
and you interpret them, you also pull
1:06:02
the business process workflows that you
1:06:04
have encoded in your drive.
1:06:07
You actually can find gains in multiple places.
1:06:10
And in the next few years, you're
1:06:12
probably going to see workflows being
1:06:14
more optimized to add Gen AI.
1:06:17
Even if that happens, the hardest part is changing people.
1:06:20
What we know, this is great in theory, but now,
1:06:23
let's try to fit that second workflow for 10,000 credits,
1:06:28
risk analysts, and relationship managers.
1:06:31
My guess is it will take years.
1:06:33
It will take 10, 20 years to get to this being actually done
1:06:37
at scale within an organization.
1:06:40
Because change is so hard.
1:06:42
It's so hard to rewire business, workflows, job descriptions,
1:06:47
incentivize people to do different, and be different,
1:06:50
and train them.
1:06:50
And so this is what the world is going towards,
1:06:55
but it's going to take a long time I think.
1:06:59
OK.
1:07:00
Then I want to talk about how the agent actually works
1:07:02
and what are the core components of an agent.
1:07:07
Imagine a travel booking agent. that's
1:07:10
an easy example you've all thought about.
1:07:12
I still haven't been able to get an agent to book a trip for me,
1:07:16
or I was scared because it was going to book
1:07:18
a very expensive or long trip.
1:07:20
But in theory, you can have a travel booking
1:07:24
agent that has prompts.
1:07:26
So the prompts we've seen, we know the methods
1:07:28
to optimize those prompts.
1:07:30
That travel agent also has a context management system,
1:07:34
which is essentially the memory of what it knows about the user.
1:07:38
That context management system might
1:07:40
include a core memory or working memory and an archival memory,
1:07:45
OK?
1:07:46
What the difference is within memory
1:07:51
is not every memory needs to be fast to access.
1:07:54
Think about it.
1:07:56
You're onboarded on a product, and the first question is hi,
1:07:59
what's your name?
1:08:00
And I say, my name is Keon.
1:08:02
That's probably going to sit in the working memory
1:08:05
because the agents, every time he's going to talk to me,
1:08:07
he's going to want to use my name.
1:08:08
But then maybe the second question
1:08:10
is what's your birthday?
1:08:12
And I give it my birthday.
1:08:13
Does it need my birthday every day?
1:08:15
Probably not.
1:08:16
So it's probably going to park it on the long term
1:08:18
memory or the archival memory.
1:08:20
And those memories are slower to access.
1:08:24
They're farther down the stack.
1:08:26
And that structure allows the agent
1:08:28
to determine what's the working memory,
1:08:30
and what's the long term memory?
1:08:33
And that makes it easier for the agent to retrieve super fast.
1:08:36
Because think about it.
1:08:37
When you interact with ChatGPT, you
1:08:39
feel that it's very personal at times.
1:08:41
You feel like it understands you.
1:08:43
Imagine every time you call it, it has to read the memories.
1:08:47
And that can be costly.
1:08:48
It's a very burdensome cost because it happens
1:08:52
every time you talk to it.
1:08:54
So you want to be highly optimized with the working
1:08:57
memory.
1:08:59
If it takes three seconds to look
1:09:00
in the memory, every time you're going to talk to your LLM,
1:09:03
it's going to take three seconds, which you don't want.
1:09:06
Anyway.
1:09:06
And then you have the tools.
1:09:08
The tools can include APIs like a flight search
1:09:11
API, hotel booking API, car rental API, weather API,
1:09:15
and then the payment processing API.
1:09:18
And typically, you would want to tell your agent
1:09:21
how that API works.
1:09:23
It turns out that agents or LLMs, I should say,
1:09:27
are very good at reading API documentation.
1:09:29
So you give it the API documentation,
1:09:31
and it reads the JSON, and it reads,
1:09:33
what does a GET request look like.
1:09:35
And this is the format that I need to push.
1:09:38
And then it pushes it in that format, let's say.
1:09:41
And then it retrieves something.
1:09:45
Does that make sense, those different components?
1:09:49
Anthropic also talks about resources.
1:09:51
Resources is data that is sitting somewhere that you
1:09:55
might let your agent read.
1:09:57
For example, if you're building your startups, you have a CRM.
1:10:00
A CRM has data in it, and you want to do lookups in that data.
1:10:05
You will probably give a lookup tool,
1:10:07
and you will give access to the resource,
1:10:10
and it will do lookups whenever you want super fast.
1:10:16
This type of architecture can be built
1:10:19
with different degrees of autonomy,
1:10:21
from the least autonomous to the most autonomous.
1:10:23
And I'll give you a few examples.
1:10:26
Less autonomous would be you've hard coded the steps.
1:10:29
So let's say I tell the travel agent first identify the intent.
1:10:35
Then look up in the database the history
1:10:39
of this customer with us and their preferences.
1:10:42
Then go to the flight API, blah, blah, blah.
1:10:45
Then go to the--
1:10:45
I would hard code the steps.
1:10:47
OK.
1:10:48
That's the least autonomous.
1:10:50
The semi-autonomous is I might hard code the tools,
1:10:54
but we're not going to hard code the steps.
1:10:57
So I'm going to tell the agent, you act like a travel agent.
1:11:02
And your task is to help the person book a travel.
1:11:10
And these are the tools that you have accessible to yourself.
1:11:13
And so I'm not hard coding the steps.
1:11:14
I'm just hard coding the tools that you have access
1:11:17
to for yourself.
1:11:18
The more autonomous is the agent decides the steps
1:11:22
and can create the tools.
1:11:24
So that's where you might give actually access
1:11:26
to a code editor, to the agent.
1:11:28
And the agent might actually be able to ping any API in the web,
1:11:33
perform some web search.
1:11:34
It might even be able to create some code
1:11:37
to display data to the user.
1:11:39
It might even be able to perform some calculations.
1:11:42
Like oh, I'm going to calculate the fastest route
1:11:44
to get from San Francisco to New York,
1:11:48
and which one might be the most appropriate
1:11:50
for what the user is looking for.
1:11:52
And then I want to calculate the distance between the airport
1:11:54
and that hotel versus that hotel.
1:11:56
And I'm going to write code to do that.
1:11:58
So it's actually fully autonomous
1:12:00
from that perspective.
1:12:05
So yeah.
1:12:07
Remember those keywords.
1:12:08
Memory, prompts, tools, et cetera.
1:12:14
Now, I presented the flight API, but it does not
1:12:18
have to be an API.
1:12:19
You probably have heard the term MCP or model context protocol
1:12:23
that was coined by Anthropic.
1:12:25
I pasted the seminal article on MCP at the bottom of this slide.
1:12:29
But let me explain in a nutshell why those things would differ.
1:12:34
In the API case, you would actually
1:12:39
teach your LLM to ping an API.
1:12:42
So you would say this is how you ping this API,
1:12:45
and this is the data that it will send you back.
1:12:48
And you would have to do that in a one off manner.
1:12:51
So you would have to build or give
1:12:53
the API documentation of your flight API.
1:12:56
You're booking hotel API, your car rental API.
1:13:00
And then you would give tools for your model
1:13:03
to communicate with those APIs.
1:13:06
It doesn't scale very well versus MCP.
1:13:11
MCP, it's really about putting a system in the middle that
1:13:19
would make it simpler for your LLM to communicate
1:13:22
with that endpoint.
1:13:23
So for instance, you might have an MCP server, an MC client,
1:13:28
where you're trying to communicate
1:13:30
with that travel database or the flight API or MCP.
1:13:35
And your agent might actually just communicate with it
1:13:38
and say, hey, what do you need in order to give me more flight
1:13:42
information?
1:13:43
And that agent will respond by I would like you to tell me
1:13:47
where is the origin flight, where is the destination
1:13:49
and what you're looking for at a high level.
1:13:51
This is my requirement.
1:13:52
OK.
1:13:52
Let me get back to you with in my requirement.
1:13:55
Oh.
1:13:55
You forgot to tell me your budget, whatever.
1:13:57
Oh.
1:13:58
Let me give you my budget, et cetera.
1:14:00
And it's agent to agent communication,
1:14:04
which allows more scalability.
1:14:06
You don't need to hard code everything.
1:14:09
Companies have displayed their MCPs out there,
1:14:11
and your agent can communicate with them
1:14:14
and figure out how to get the data it needs.
1:14:16
Does that make sense?
1:14:18
Yeah.
1:14:21
[INAUDIBLE] rewriting any [INAUDIBLE]
1:14:36
I think it is, ultimately.
1:14:39
The question is, isn't it a shifting issue?
1:14:41
Because anyway, if an API has to be updated,
1:14:43
the MCP has to be updated, is what you say, right?
1:14:45
Yes, that's correct.
1:14:46
But at least it allows the agent to go back and forth
1:14:51
and figure out what the requirements are.
1:14:52
But at the end of the day, ideally, if you're a startup,
1:14:56
you have some documentation.
1:14:57
And automatically, you have an agent or an LLM workflow
1:15:00
that reads that documentation and updates the code
1:15:03
accordingly.
1:15:04
But I agree.
1:15:05
It's not something that is fully autonomous.
1:15:08
Yeah.
1:15:09
i I've seen some security issues.
1:15:12
Why is that possible.
1:15:14
Which security specifically?
1:15:16
[INAUDIBLE]
1:15:18
Yeah.
1:15:19
So are there security issues with MCPs?
1:15:23
So think about it this way.
1:15:25
MCPs, depending on the data that you get access to,
1:15:28
might have different requirements, lower stake
1:15:30
or higher stake.
1:15:31
I'm not an expert at the full range.
1:15:34
But it wouldn't surprise me that when you expose an MCP to--
1:15:42
I think you would a lot of MCC have authentication.
1:15:45
So you might actually need a code
1:15:47
to actually talk to it, just like you would with an API,
1:15:50
or a key.
1:15:52
Yeah, but that's a good question.
1:15:53
I'm not an expert at the security of these systems,
1:15:56
but we can look into it.
1:16:02
Any other questions on what we've
1:16:04
seen with the agentic workflows, APIs, tools, MCPs, memory?
1:16:10
All of that is under progress.
1:16:11
So even memory is not a solved problem by any means.
1:16:14
It's pretty hard actually.
1:16:16
Yes.
1:16:18
You don't need an [INAUDIBLE] The MCP just
1:16:24
makes it easier to access the API, but technically,
1:16:28
[INAUDIBLE]
1:16:40
Exactly, exactly.
1:16:42
Is MCP about efficiency or accessing more data?
1:16:45
It's about efficiency.
1:16:47
Let's say you have a coding agent, and it has an MCP client,
1:16:53
and there's multiple MCP servers that are exposed out there.
1:16:57
That agent can communicate very efficiently with them
1:17:00
and find what it needs.
1:17:03
And it's a more efficient process
1:17:05
than actually displaying APIs and the APIs on that side
1:17:09
and how to ping them and what the protocol is.
1:17:12
But it's not about the data that is
1:17:13
being exposed because ultimately, you control
1:17:15
the data that is being exposed.
1:17:19
You probably, depending on how the MCP is built,
1:17:22
my guess is you probably expose yourself to other risks
1:17:24
because your MCP server can see any input pretty much
1:17:31
from another LLM.
1:17:32
And so it has to be robust.
1:17:36
But yeah.
1:17:37
Super.
1:17:39
So let's look at an example of a step
1:17:41
by step workflow for the travel agent.
1:17:45
So let's say the user says, I want to plan a trip to Paris
1:17:50
from December 15 to 20th with flights,
1:17:56
hotels near the Eiffel Tower, and then an itinerary of must
1:18:00
visit places.
1:18:01
That's the task to the travel agent.
1:18:04
Step two, the agent plans the steps.
1:18:06
So it says, I'm going to find flights.
1:18:08
Use the flight search API to get options for December 15.
1:18:12
Search hotels, generate recommendations for places
1:18:15
to visit, validate preferences, budget, et cetera.
1:18:20
Book the trip with the payment processing API.
1:18:24
That's just the planning, by the way.
1:18:25
Step three, execute the plan, use your tools,
1:18:28
combine the results, and then proactive
1:18:31
user interaction and booking.
1:18:33
It might make a first proposal to the user
1:18:35
and ask the user to validate or invalidate
1:18:38
and then may repeat that planning and execution process.
1:18:42
And then finally, it might actually update the memory.
1:18:46
It might say, oh, I just learned through this interaction
1:18:49
that the user only likes direct flights.
1:18:51
Next time, I'll only give direct flights.
1:18:55
Or I noticed users are fine with three star hotels or four star
1:19:01
hotels.
1:19:01
And in fact, they don't want to go above budget or something
1:19:05
like that.
1:19:08
So that hopefully makes sense by now on how you might do that.
1:19:11
My question for you is how would you know if this works.
1:19:16
And if you had such a system running in production, how
1:19:19
would you improve it?
1:19:28
Yeah.
1:19:28
Lets users rate their experience.
1:19:31
So that's an example.
1:19:33
So let users rate their experience at the end.
1:19:37
That would be an end to end test, right?
1:19:39
You're looking at the user experience through the steps
1:19:42
and say how good was it from 1 to 5, let's say.
1:19:46
Yeah.
1:19:46
It's a good way.
1:19:47
And then if you learn that a user says 1,
1:19:50
how do you improve the workflow?
1:19:56
[INAUDIBLE]
1:19:59
OK.
1:19:59
So you would go down a tree and say, OK, you said 1.
1:20:04
What was your issue?
1:20:06
And then the user says the prices were too high, let's say.
1:20:10
And then you would go back and fix that specific tool or prompt
1:20:14
or, yeah, OK.
1:20:15
Any other ideas?
1:20:18
[INAUDIBLE]
1:20:29
Yeah, good.
1:20:29
So that's a good insight.
1:20:30
Separate the LLM related stuff from the non-LLM related stuff,
1:20:34
the deterministic stuff.
1:20:35
The deterministic stuff, you might
1:20:36
be able to fix it more objectively essentially.
1:20:41
Yeah.
1:20:43
What else?
1:20:56
So give me an example of an objective issue
1:21:00
that you can notice and how you would fix it
1:21:03
versus a subjective issue.
1:21:06
Yeah.
1:21:06
[INAUDIBLE]
1:21:16
So let's say you say there's the same flight,
1:21:19
but one is cheaper than the other, let's say.
1:21:21
It's objectively worse.
1:21:23
And so you can capture that almost automatically.
1:21:25
Yeah.
1:21:26
So you could actually build evals
1:21:27
that are objective, that are tracked across your users.
1:21:32
And you might actually run an analysis after
1:21:34
and see that for the objective stuff,
1:21:37
we notice that our LLM AI agent workflow is bad with pricing.
1:21:43
It just doesn't read price as well because it always
1:21:46
gives a more expensive option.
1:21:48
Yeah.
1:21:48
You're perfectly right.
1:21:49
How about the subjective stuff?
1:21:59
Do you choose a direct or indirect flight
1:22:01
if the indirect is a little bit cheaper?
1:22:05
Yeah.
1:22:05
Good one.
1:22:06
Do you choose a direct flight or an indirect flight
1:22:09
if the indirect is cheaper but the direct is more comfortable?
1:22:12
Yeah.
1:22:13
That's a good one actually.
1:22:16
So how would you capture that information.
1:22:18
Let's say this is used by thousands of users.
1:22:24
Could you feed something in [INAUDIBLE]
1:22:28
Could you feed something in?
1:22:30
Yeah, I mean, you could--
1:22:32
could feed something in about the user preferences?
1:22:36
Well, you could build a data set that
1:22:39
has some of that information.
1:22:40
So you build 10 prompts, where the user is asking specifically
1:22:44
for a direct--
1:22:46
saying that I prefer direct flights because I
1:22:48
care about my time, let's say.
1:22:50
And then you look at the output and you actually
1:22:53
give a good example of a good output,
1:22:56
and you probably are able to capture
1:22:58
the performance of your agentic workflow on this specific eval.
1:23:04
Does it prioritize?
1:23:05
Does it understand price conscious--
1:23:07
is it price conscious, essentially,
1:23:08
and comfort conscious?
1:23:10
Yeah.
1:23:13
What about the tone?
1:23:14
Let's say the LLM right now is not very friendly.
1:23:18
How would you notice that, and how would you fix it?
1:23:26
Yeah.
1:23:26
Have the test user run the prompt
1:23:29
and see if there's something wrong with that.
1:23:33
OK.
1:23:33
Have a test user run the prompt and see if there's
1:23:36
something wrong with that.
1:23:37
Tell me about the last step.
1:23:38
How would you notice that something is wrong?
1:23:40
So a couple of tests [INAUDIBLE] evaluates
1:23:48
the response and [INAUDIBLE]
1:23:51
Yeah.
1:23:52
I agree with your approach.
1:23:53
Have LLM judges that evaluate the response
1:23:55
against a certain rubric of what politeness looks like.
1:23:58
So here in this case, you could actually
1:24:00
start with error analysis.
1:24:02
So you start, you have 1,000 users.
1:24:05
And you can pull up 20 user interactions
1:24:07
and read through it.
1:24:09
And you might notice, at first sight,
1:24:11
the LLM seems to be very rude.
1:24:14
It's just super, super short in its answers,
1:24:18
and it's not very helpful.
1:24:20
You notice that with your error analysis manually.
1:24:23
Then you go to the next stage.
1:24:24
You actually put evals behind it.
1:24:26
You say, I'm going to create a set of LLM judges
1:24:33
that are going to look at the user interaction
1:24:35
and are going to rate how polite it is.
1:24:38
And I'm going to give it a rubric.
1:24:40
Then what I'm going to do is I'm going to flip my LLM.
1:24:42
Instead of using GPT-4, I'm going to use Grok.
1:24:45
And instead of using Grok, I'm using Llama.
1:24:48
And then I'm going to run those three LLMs side by side,
1:24:51
give it to my LLM judges, and then get my subjective score
1:24:56
at the end to say, oh, x model was more polite on average.
1:25:02
Yeah.
1:25:02
Perfectly right.
1:25:03
That's an example of an eval that is very specific
1:25:05
and allows you to choose between LLMs.
1:25:07
You could actually do the same eval not across LLMs,
1:25:10
but fixed the LLM, change the prompt.
1:25:12
You actually, instead of saying act like a travel agent,
1:25:15
you say act like a helpful travel agent.
1:25:17
And then you see the influence of that word on your eval
1:25:21
with the LLM as judges.
1:25:22
Does that make sense?
1:25:24
OK.
1:25:25
Super.
1:25:26
So let's move forward and do a case study with evals.
1:25:29
And then we're almost done for today.
1:25:33
Let's say your product manager asks you to build an AI
1:25:38
agent for customer support, OK?
1:25:41
Where do you start?
1:25:42
And here is an example of the user prompt.
1:25:45
I need to change my shipping address for order, blah, blah,
1:25:48
blah.
1:25:48
I move to a new address.
1:25:51
So what do you start if I'm giving you that project?
1:26:04
Yes.
1:26:05
We search online for existing models and [INAUDIBLE]
1:26:16
So do some research.
1:26:17
See benchmarks and how different models
1:26:20
perform at customer support.
1:26:22
And then pick a model.
1:26:23
That's what you mean.
1:26:24
Yeah.
1:26:24
It's true you could do that.
1:26:25
What else could you do?
1:26:28
Yeah.
1:26:28
[INAUDIBLE]
1:26:34
OK.
1:26:34
Yeah, I like that.
1:26:35
Try to decompose the different tasks that it will need
1:26:39
and try to guess which ones will be more of a struggle, which
1:26:42
ones should be fuzzy, which ones should be deterministic.
1:26:45
Yeah, you're right.
1:26:46
[INAUDIBLE]
1:26:55
Yeah.
1:26:56
Similar to what you said.
1:26:58
That's what I would recommend as well.
1:27:00
You say I would sit down with a customer support
1:27:02
agent for a day or two, and I would decompose the tasks
1:27:04
that are going through.
1:27:05
I will ask them, where do they struggle?
1:27:07
How much time it takes?
1:27:08
Yes.
1:27:09
That's usually where you want to start with task decomposition.
1:27:12
So let's say we've done that work, and we have this list.
1:27:16
I'm simplifying.
1:27:17
But the customer support agent, human, typically
1:27:20
would extract key info, then look up
1:27:23
in the database to retrieve the customer record.
1:27:25
Then check the policy.
1:27:27
Are we allowed to update the address,
1:27:29
or is it a fixed data point?
1:27:32
And then draft a response email and sends the email.
1:27:35
So we've decomposed that task.
1:27:39
Once you've decomposed that task,
1:27:42
how do you design your agentic workflow?
1:28:03
Yes.
1:28:04
[INAUDIBLE]
1:28:17
Exactly.
1:28:18
So to repeat, you're going to look
1:28:20
at the decomposition of tasks, get an instinct of what's fuzzy,
1:28:24
what's deterministic, and then determine
1:28:28
which line is going to be an LLM one shot, which one will require
1:28:33
maybe a RAG, which one will require a tool, which one will
1:28:36
require memory, which one--
1:28:38
So you will start designing that map.
1:28:41
Completely right.
1:28:41
That's also what I would recommend.
1:28:43
You might actually draft it and say, OK, I take the user prompt.
1:28:48
And the first step of my task decomposition
1:28:52
was extract information that seems to be a vanilla LLM.
1:28:57
You can guess that the vanilla LLM would probably
1:29:00
be good enough at extracting the user wants
1:29:03
to change their address, and this is the order number
1:29:05
and this is the new address.
1:29:06
You probably don't need too much technology
1:29:08
there other than the LLM.
1:29:11
The next step, it feels like you need a tool because you're
1:29:14
actually going to have to look up in the database
1:29:17
and also update the address.
1:29:21
So that might be a tool, and you might
1:29:23
have to build a custom tool for the LLM
1:29:25
to say, let me connect you to that database
1:29:27
or let me give you access to that resource with an MCP.
1:29:32
After that probably need an LLM again to draft the email,
1:29:35
but you would probably paste confirmation.
1:29:38
You would paste the confirmation that your address
1:29:40
has been updated from x to y.
1:29:42
And then the LLM will draft an answer.
1:29:44
And of course, just to not forget,
1:29:46
you might need a tool to send the email.
1:29:49
You might actually need to post something to
1:29:54
for the email to go out.
1:29:57
And then you'll get the output.
1:29:59
Does that make sense So exactly what you described.
1:30:02
Now moving to the next step.
1:30:03
Once we have-- we compose our tasks.
1:30:06
Then we have designed an agentic workflow around it.
1:30:09
It took us five minutes.
1:30:10
In practice, it would take you more
1:30:12
if you're building your startup on that.
1:30:13
You want to make sure your task decomposition is accurate,
1:30:15
your thing is accurate here, and then
1:30:17
you can have a lot of work done on every tool
1:30:20
and optimize it and latency and cost.
1:30:22
But let's say, now we want to know if it works.
1:30:27
And I'm going to assume that you have LLM traces.
1:30:30
LLM traces are very important.
1:30:33
Actually, if you're interviewing with an AI startup.
1:30:36
I would recommend you in the interview process to ask them,
1:30:39
do you have LLM traces?
1:30:40
Because if they don't have LLM traces,
1:30:42
it is pretty hard to debug an LLM system because you don't
1:30:46
have visibility on the chain of complex prompts that were called
1:30:50
and where the bug is.
1:30:52
And so it's a basic part of an AI startup
1:30:57
stack to have an LLM traces.
1:31:00
So let's assume you have traces.
1:31:02
How would you know if your system works?
1:31:04
I'm going to summarize some of the things I heard earlier.
1:31:11
You gave us an example of an end to end metric.
1:31:15
You look at the user satisfaction at the end.
1:31:18
You can also do a component-based approach
1:31:21
where you actually will look at the tool, the database updates,
1:31:25
and you will manually do an error analysis and see,
1:31:28
oh, the tool actually always forgets to update the email.
1:31:32
It just fails at writing.
1:31:33
And I'm going to fix that.
1:31:34
This is deterministic pretty much.
1:31:37
Or when it tries to send the email
1:31:40
and ping the system that is supposed to send the email,
1:31:44
it doesn't send it in the right format.
1:31:46
And so it bugs at that point.
1:31:48
Again, you could fix that.
1:31:51
Draft of the email.
1:31:52
The LLM doesn't do a great job.
1:31:53
It's not very polite at drafting the email.
1:31:56
So you could look at component by component,
1:31:59
and it's actually easier to debug than to look at it
1:32:01
end to end.
1:32:02
You would probably do a mix of both.
1:32:05
Another way to look at it is what is objective
1:32:08
versus what is subjective?
1:32:10
So for example, an objective example
1:32:12
would be a DLRM extracted the wrong order ID.
1:32:18
The user said my order ID is X, and the LLM,
1:32:21
when it actually looked up in the database,
1:32:24
it used the wrong order ID.
1:32:26
This is objectively wrong.
1:32:27
You can actually write a Python code
1:32:29
that checks that, checks just the alignment between what
1:32:32
the user mentioned and what was actually pasted in the database
1:32:36
or for the lookup.
1:32:38
You also have subjective stuff, which we talked about,
1:32:40
where you probably want to do either human rating or LLM
1:32:43
as judges.
1:32:44
It's very relevant for subjective evals.
1:32:49
And finally, you will find yourself
1:32:51
having quantitative evals and more qualitative evals.
1:32:55
So quantitative would be percentage of successful address
1:32:59
updates.
1:33:00
The latency.
1:33:00
You could actually track the latency component-based
1:33:03
and see which one is the slowest.
1:33:05
Let's say sending the email is five seconds.
1:33:08
It's too long, let's say.
1:33:10
You would notice component based or the full workflow.
1:33:13
And then you will decide, where am I optimizing my latency,
1:33:15
and how am I going to do that?
1:33:17
And then finally, qualitative.
1:33:20
You might actually do some error analysis
1:33:23
and look at where are the hallucinations?
1:33:27
Where are the tone mismatches?
1:33:31
Are the user confused, and by what they're confused?
1:33:34
That would be more qualitative.
1:33:36
And typically, it would take more white glove approaches
1:33:41
to do that.
1:33:42
So here's what it could look like.
1:33:44
I gave you some examples.
1:33:46
But you would build evals to determine
1:33:50
objectively, subjectively, component-based, end
1:33:53
to end based, and then quantitatively and
1:33:55
qualitatively, where's your LLM failing
1:33:57
and where it's doing well.
1:34:02
Does that give you a sense of the type of stuff
1:34:04
you could do to fix or improve that agentic workflow?
1:34:09
Super.
1:34:10
Well, that was our case study on evals.
1:34:12
We're not going to delve deeper into it.
1:34:14
But hopefully, it gave you a sense of the type of stuff
1:34:16
you can do with LLM judges, with objective,
1:34:21
subjective, component-based, end to end, et cetera.
1:34:25
Last section on multi-agent workflows.
1:34:29
So you might ask, hey, why do we need multi-agent workflow when
1:34:36
the workflow already has multiple steps,
1:34:38
already calls the LLM multiple times, already gives them tools.
1:34:42
Why do we need multiple agents?
1:34:45
And so many people are talking about multi-agent system online.
1:34:47
It's not even a new thing, frankly.
1:34:49
Multi-agent systems have been around for a long time.
1:34:52
The main advantage of a multi-agent system
1:34:55
is going to be parallelism.
1:34:57
It's like is there something that I
1:34:59
wish I would run in parallel, sort of independently,
1:35:04
but maybe there are some things in the middle?
1:35:07
But that's where you want to put a multi-agent system.
1:35:09
It's when it's parallel.
1:35:12
The other advantage that some companies
1:35:14
have with multi-agent systems is an agent can be reused.
1:35:19
So let's say in a company, you have an agent that's
1:35:21
been built for design.
1:35:22
That agent can be used in the marketing team,
1:35:25
and it can be used in the product team.
1:35:27
And so now you're optimizing an agent,
1:35:30
which has multiple stakeholders that can communicate with it
1:35:33
and benefit from its performance.
1:35:38
Actually I'm going to ask you a question
1:35:40
and take a few, maybe a minute to think about it.
1:35:43
Let's say you were building smart home
1:35:46
automation for your apartment or your home.
1:35:50
What agents would you want to build?
1:35:52
Yeah.
1:35:53
Write it down.
1:35:54
And then I'm going to ask you in a minute
1:35:57
to share some of the agents that you will build.
1:36:00
Also, think about how you would put
1:36:03
a hierarchy between these agents,
1:36:04
or how you would organize them, or who
1:36:06
should communicate with who.
1:36:07
OK?
1:36:08
OK.
1:36:08
Take a minute for that.
1:36:12
Be creative also because I'm going to ask all of your agents,
1:36:14
and maybe you have an agent that nobody has thought of.
1:36:21
OK.
1:36:22
Let's get started.
1:36:24
Who wants to give me a set of agents
1:36:26
that you would want for your home, smart home.
1:36:29
Yes.
1:36:32
The first is like a set of agents [INAUDIBLE]
1:37:00
OK.
1:37:01
So let me repeat.
1:37:02
You have four agents, I think, roughly.
1:37:05
One that tracks biometric, like where are you in the home?
1:37:09
Where are you moving?
1:37:10
How you're moving, things like that.
1:37:12
That sort of knows your location.
1:37:15
The second one determines the temperature of the rooms
1:37:21
and has the ability to change it.
1:37:23
The third one tracks energy efficiency
1:37:26
and might give feedback on energy and energy usage.
1:37:31
And might be, I don't know, maybe
1:37:32
it has the control over the temperature as well.
1:37:34
I don't know actually.
1:37:35
Or the gas or the water, might cut your water at some point.
1:37:43
And then you have an orchestrator agent.
1:37:44
What is exactly the orchestrator doing?
1:37:48
It passes instructions [INAUDIBLE]
1:37:53
OK.
1:37:53
Passes instructions.
1:37:55
So is that the agent that communicates mainly
1:37:58
with the user?
1:38:00
So if I'm coming back home and I'm
1:38:02
saying I want the oven to be preheated,
1:38:05
I communicate with the orchestrator,
1:38:07
and then it would funnel to another agent.
1:38:09
OK.
1:38:10
Sounds good.
1:38:11
Yeah.
1:38:11
So that's an example of, I want to say,
1:38:14
a hierarchical agentic multi-agent system.
1:38:20
What else?
1:38:21
Any other ideas?
1:38:22
What would you add to that?
1:38:24
Yeah.
1:38:25
[INAUDIBLE]
1:38:55
Oh, I like that.
1:38:56
That's a really good one.
1:38:57
So let me summarize.
1:38:58
You have a security agent that determines if you can enter
1:39:02
or not.
1:39:03
And when you enter, it understands who you are.
1:39:06
And then it gives you certain sets
1:39:08
of permissions that might be different depending
1:39:11
of if you're a parent or a kid.
1:39:13
Or you might have access to certain cars and not others.
1:39:17
Or your kid cannot open the fridge, or I don't know.
1:39:20
Something like that.
1:39:21
Yeah.
1:39:22
OK, I like that.
1:39:23
That's a good one.
1:39:24
And it does feel like it's a complex enough workflow where
1:39:28
you want a specific workflow tied to that.
1:39:32
I agree.
1:39:34
What else?
1:39:39
Yes.
1:39:41
[INAUDIBLE] So you can get more complicated.
1:39:43
So high energy savings with whether or not you
1:39:50
or someone else can be blind to those in the house or also
1:39:55
when you tap into the grid.
1:39:57
Yeah So another thought I have as well is much harder
1:40:04
to track in the grocery store.
1:40:06
But understanding what's in your fridge.
1:40:08
OK
1:40:12
Well, that's really good actually.
1:40:14
So you mentioned two of them.
1:40:16
One is maybe an agent that has access to external APIs that
1:40:20
can understand the weather out there, the wind, the sun,
1:40:24
and then has control over certain devices at home.
1:40:28
Temperature, blinds, things like that, and also understands
1:40:31
your preferences for it.
1:40:33
That does feel like it's a good use case because you could give
1:40:36
that to the orchestrator, but it might lose itself
1:40:38
because it's doing too much.
1:40:41
And also, these problems are tied together,
1:40:43
like temperature outdoor with the weather API
1:40:45
might influence the temperature inside,
1:40:48
how you want it, et cetera.
1:40:50
And then the second one, which I also like,
1:40:52
is you might have an agent that looks at your fridge
1:40:55
and what's inside.
1:40:57
And it might actually have access
1:40:58
to the camera in the fridge, for example,
1:41:01
and know your preferences and also has
1:41:03
access to the e-commerce API to order
1:41:06
Amazon groceries ahead of time.
1:41:09
I agree.
1:41:10
And maybe the orchestrator will be the communication line
1:41:12
with the user, but it might communicate with that agent
1:41:16
in order to get it done.
1:41:17
Yeah.
1:41:18
I like those.
1:41:19
So those are all really good examples.
1:41:21
Here is the list I had up there.
1:41:25
So climate control, lighting security, energy management,
1:41:30
entertainment, notification agent,
1:41:32
alerts about the system updates, energy saving, and orchestrator.
1:41:35
So all of them you mentioned actually.
1:41:38
And then we didn't talk about the different interaction
1:41:41
patterns, but you do have different ways to organize
1:41:45
a multi-agent system.
1:41:46
Flat, hierarchical.
1:41:48
It sounds like this would be hierarchical.
1:41:51
I agree.
1:41:52
And the reason is UI/UX, is I would rather
1:41:55
have to only talk to the orchestrator,
1:41:57
rather than have to go to a specialized application
1:42:00
to do something.
1:42:01
Like it feels like the orchestrator
1:42:02
could be responsible for that.
1:42:04
And so I agree, I would probably go for a hierarchical setup
1:42:07
here.
1:42:08
But maybe you might also add some connections
1:42:11
between other agents, like in the flat system
1:42:13
where it's all to all.
1:42:15
For example, with climate control and energy,
1:42:17
if you want to connect those two,
1:42:19
you might actually allow them to speak with each other.
1:42:21
When you allow agents to speak with each other,
1:42:24
it is basically an MCB protocol, by the way.
1:42:26
So you treat the agent like a tool, exactly like a tool.
1:42:30
Here is how you interact with this agent.
1:42:32
Here is what it can tell you.
1:42:34
Here is what it needs from you, essentially.
1:42:37
OK super.
1:42:38
And then without going into the details,
1:42:40
there are advantages to multi-agent workflows
1:42:43
versus single agents, such as debugging.
1:42:47
It's easier to debug a specialized agent
1:42:50
into debug an entire system.
1:42:52
Parallelization as well.
1:42:54
It's easier to have things run in parallel,
1:42:56
and you can earn time.
1:42:59
There are some advantages to doing that,
1:43:01
and I'll leave you with this slide if you want to go deeper.
1:43:04
Super.
1:43:05
So we've learned so many techniques to optimize LLMs,
1:43:08
from prompts to chains to fine tuning, retrieval,
1:43:12
and to multi-agent system as well.
1:43:14
And then just to end on a couple of trends I want you to watch.
1:43:19
I think next week is Thanksgiving, is that it?
1:43:21
It's Thanksgiving break.
1:43:22
No, the week after.
1:43:23
OK.
1:43:24
Well ahead of the Thanksgiving break.
1:43:26
So if you're traveling, you can think about these things.
1:43:29
What's next is in AI, I wanted to call out a couple of trends.
1:43:34
So Ilya Sutskever, one of the OGs of LLMs and OpenAI
1:43:40
co-founder, raised that question about are we plateauing or not.
1:43:45
The question are we going to see in the coming years LLM sort
1:43:50
of not improve as fast as we've seen in the past?
1:43:54
It's been the feeling in the community
1:43:56
probably that the last version of GPT
1:44:00
did not bring the level of performance
1:44:03
that people were expecting, although it did make
1:44:06
it so much easier to use for consumers because you don't need
1:44:09
to interact with different models.
1:44:10
It's all under the same hood.
1:44:12
So it seems that it's progressing,
1:44:14
but the plateau is unclear.
1:44:17
The way I would think about it is the LLM scaling laws tell us
1:44:22
that if we continue to improve compute and energy,
1:44:26
then LLMs should continue to improve.
1:44:28
But at some point, it's going to plateau.
1:44:29
So what's going to take us to the next step?
1:44:32
It's probably architecture search.
1:44:35
Still a lot of LLMs, even if we don't
1:44:36
understand what's under the hood or probably
1:44:38
transformer-based today.
1:44:40
But we know that the human brain does not operate the same way.
1:44:43
There's just certain things that we
1:44:45
do that are much more efficient, much faster.
1:44:47
We don't need as much data.
1:44:49
So theoretically, we have so much
1:44:51
to learn in terms of architecture search
1:44:53
that we haven't figured out.
1:44:54
It's not a surprise that you see those labs hire
1:44:57
so many engineers.
1:44:58
Because it is possible that in the next few years,
1:45:01
you're going to have thousands of engineers trying
1:45:03
to figure out the different engineering hacks and tactics
1:45:06
and architectural searches that are
1:45:07
going to lead to better models.
1:45:10
And one of them suddenly will find the next transformer,
1:45:13
and it will reduce by 10x the need for compute and the need
1:45:17
for energy.
1:45:18
It's sort of if you read Isaac Asimov's Foundation series.
1:45:24
Individuals can have an amazing impact on the future because
1:45:27
of their decisions.
1:45:29
Whoever discovered transformers had a tremendous impact
1:45:33
on the direction of AI.
1:45:34
I think we're going to see more of that in the coming
1:45:37
years, where some group of researchers that is iterating
1:45:40
fast might discover certain things that would suddenly
1:45:43
unlock that plateau and take us to the next step,
1:45:45
and it's going to continue to improve like that.
1:45:47
And so it doesn't surprise me that there's so many companies
1:45:50
hiring engineers right now to figure out
1:45:52
those hacks and those techniques.
1:45:56
The other set of gains that we might see
1:45:58
is from multi-modality.
1:45:59
So the way to think about it is we've had LLMs first text-based,
1:46:04
and then we've added imaging.
1:46:06
And today, models are very good at images.
1:46:09
They're very good at text.
1:46:10
It turns out that being good at images and being good at text
1:46:13
makes the whole model better.
1:46:15
So the fact that you're good at understanding a cat image
1:46:18
makes you better at text as well for a cat.
1:46:21
Now you add another modality like audio or video.
1:46:24
The whole system gets better.
1:46:26
So you're better at writing about a cat
1:46:28
if you know what a cat sounds like,
1:46:30
if you can look at a cat on an image as well.
1:46:31
Does that make sense?
1:46:32
So we see gains that are translated from one modality
1:46:35
to another, and that might lead in the pinnacle of robotics
1:46:38
where all these modalities come together.
1:46:40
And suddenly, the robot is better at
1:46:42
running away from a cat because it understands
1:46:44
what a cat is, how it sounds like,
1:46:46
what it looks like, et cetera.
1:46:48
That makes sense?
1:46:49
The other one is the multiple methods working in harmony.
1:46:53
In the Tuesday lectures, we've seen supervised learning,
1:46:56
unsupervised learning, self-supervised learning,
1:46:58
reinforcement learning, prompt engineering, RAGs, et cetera.
1:47:02
If you look at how babies learn, it
1:47:06
is probably a mix of those different approaches.
1:47:09
Like a baby might have some meta learning, meaning it
1:47:13
has some survival instinct that is
1:47:16
encoded in the DNA most likely.
1:47:19
And that's like the baby's pre-training, if you will.
1:47:22
On top of that, the mom or the dad is pointing at stuff
1:47:27
and saying bad, good, bad, good.
1:47:29
Supervised learning.
1:47:30
On top of that, the baby is falling on the ground
1:47:33
and getting hurt.
1:47:34
And that's a reward signal for reinforcement learning.
1:47:36
On top of that, the baby is observing other people
1:47:39
doing stuff or other babies doing
1:47:42
stuff, unsupervised learning.
1:47:43
You see what I mean?
1:47:44
We're probably a mix of all these methods,
1:47:47
and I think that's where the trend is going, is
1:47:49
where those methods that you've seen in CS230
1:47:52
come together in order to build an AI system that learns fast,
1:47:56
is low latency, is cheap, energy-efficient,
1:48:00
and makes the most out of all of these methods.
1:48:03
Finally, and this is especially true at Stanford,
1:48:06
you have research going on that you would consider human-centric
1:48:11
and some research that is non-human centric.
1:48:13
By human-centric, I should say human approaches
1:48:16
that are modeled after the brain and approaches that
1:48:19
are not modeled after humans.
1:48:20
Because it turns out that the human body is very limiting.
1:48:24
And so if you actually only do research
1:48:26
on what the human brain looks like,
1:48:28
you're probably missing out on compute and energy and stuff
1:48:30
like that that you can optimize even
1:48:32
beyond neuronal connections in the brain,
1:48:35
but you still can learn a lot from the human brain.
1:48:37
And that's why there are professors that are running labs
1:48:40
right now that try to understand,
1:48:42
how does back propagation work for humans?
1:48:45
And in fact, it's probably that we don't have back propagation.
1:48:48
We don't use back propagation, we only do forward propagation,
1:48:51
let's say.
1:48:51
So this type of stuff is interesting research
1:48:54
that I would encourage you to read if you're curious
1:48:56
about the direction of AI.
1:48:59
And then finally, one thing that's going to be pretty clear,
1:49:02
I call it all the time, but it's the velocity
1:49:05
at which things are moving.
1:49:06
You're noticing, part of the reason
1:49:08
we're giving you a breadth in CS230
1:49:10
is because these methods are changing so fast.
1:49:12
So I don't want to bother going and teaching you
1:49:15
the number 17 methods on RAG that
1:49:17
optimizes the RAG because in two years,
1:49:19
you're not going to need it.
1:49:20
So I would rather you think about what
1:49:23
is the breadth of things you want to understand.
1:49:25
And when you need it, you are sprinting and learning
1:49:27
the exact thing you need faster because the half life of skill
1:49:30
is so low.
1:49:31
You want to come out of the class with a good breadth
1:49:34
and then have the ability to go deep whenever
1:49:36
you need after the class.
1:49:38
And so that's sort of how that class is designed as well.
1:49:41
Yeah.
1:49:41
That's it for today.
1:49:43
So thank you.
1:49:45
Thank you for participating.
— end of transcript —
Advertisement
Ad slot

More from Stanford Online

Trending Transcripts

Disclaimer: This site is not affiliated with, endorsed by, or sponsored by YouTube or Google LLC. All trademarks belong to their respective owners. Transcripts are sourced from publicly available captions on YouTube and remain the property of their original creators.