Stanford CS230 | Autumn 2025 | Lecture 8: Agents, Prompts, and RAG — Full Transcript

1:07

but rather to tell you all the techniques

1:09

that AI engineers have cracked, figured out, are exploring,

1:15

so that after the class, you have the breadth of view

1:18

of different prompting techniques,

1:20

different agentic workflows, multi-agent systems, evals.

1:25

And then when you want to dive deeper,

1:26

you have the baggage to dive deeper and learn faster

1:29

about it.

1:32

Let's try to make it as interactive as possible, as

1:36

usual.

1:37

When we look at the agenda, the agenda

1:40

is going to start with the core idea behind challenges

1:45

and opportunities for augmenting LLMs.

1:48

So we start from a base model.

1:50

How do we maximize the performance of that base model?

1:55

Then we'll dive deep into the first line of optimization,

1:59

which is prompting methods, and we'll see a variety of them.

2:02

Then we'll go slightly deeper.

2:04

If we were to get our hands under the hood

2:06

and do some fine tuning, what would it look like?

2:09

I'm not a fan of fine tuning, and I talk a lot about that,

2:12

but I'll explain why I try to avoid fine tuning as much as

2:16

possible.

2:18

And then we'll do a section 4 on Retrieval-Augmented Generation,

2:22

or RAG, which you've probably heard of in the news.

2:26

Maybe some of you have played with RAGs.

2:28

We're going to unpack what a RAG is

2:31

and how it works and then the different methods within RAGs.

2:36

And then we'll talk about agentic AI workflows.

2:40

I'll define it.

2:42

Andrew Ng is one of the first ones

2:45

to have called this trend agenetic AI workflows.

2:49

And so we look at the definition that Andrew

2:51

gives to agentic workflows, and then we'll

2:54

start seeing examples.

2:56

The section 6 is very practical.

2:59

It's a case study where we will think about an agentic workflow,

3:05

and I'll ask you to measure if the agent actually works,

3:10

and we brainstorm how we can measure

3:13

if an agentic workflow is working

3:15

the way you want it to work.

3:16

There's plenty of methods called evals that solve that problem.

3:22

And then we'll look briefly at multi-agent workflow.

3:24

And then we can have a open-ended discussion

3:27

where I share some thoughts on what's next in AI.

3:31

And I'm looking forward to hearing from you all,

3:34

as well, on that one.

3:36

So let's get started with the problem of augmenting LLMs.

3:42

So open-ended question for you--

3:44

you are all familiar with pre-trained models

3:47

like GPT 3.5 Turbo or GPT 4.0.

3:52

What's the limitation of using just a base model?

3:56

What are the typical issues that might

3:59

arise as you're using a vanilla pre-trained model?

4:07

Yes.

4:08

It lacks some domain knowledge.

4:10

Lacks some domain knowledge.

4:11

You're perfectly right.

4:13

We had a group of students a few years ago.

4:16

It was not LLM related, but they were building an autonomous

4:22

farming device or vehicle that had a camera underneath, taking

4:26

pictures of crops to determine if the crop is

4:30

sick or not, if it should be thrown away,

4:32

if it should be used or not.

4:35

And that data set is not a data set you find out there.

4:40

And the base model or pre-trained computer vision

4:44

model would lack that knowledge, of course.

4:47

What else?

4:49

Yes.

4:50

[INAUDIBLE] pictures are very dark [INAUDIBLE]

4:57

OK, maybe the-- you're saying--

4:59

so just to repeat for people online,

5:02

you're saying the model might have been trained

5:04

on high-quality data, but the data in the wild

5:06

is actually not that high quality.

5:08

And in fact, yes, the distribution of the real world

5:11

might differ, as we've seen with GANs, from the training set,

5:16

and that might create an issue with pre-trained models.

5:18

Although pre-trained LLMs are getting better

5:20

at handling all sorts of data inputs.

5:25

Yes.

5:26

Lacks current information.

5:28

Lack what?

5:28

Current information.

5:30

Lacks current information.

5:32

The LLM is not up to date.

5:34

And in fact, you're right.

5:35

Imagine you have to retrain from scratch your LLM

5:38

every couple of months.

5:39

One story that I found funny--

5:42

it's from probably three years ago or maybe more five years

5:45

ago, where during his first presidency,

5:49

President Trump one day tweeted, "Covfefe."

5:53

You remember that tweet or no?

5:56

Just "Covfefe."

5:57

And it was probably a typo or it was in his pocket.

5:59

I don't know.

6:00

But that word did not exist.

6:03

The LLMs, in fact, that Twitter was running at the time

6:06

could not recognize that word.

6:08

And so the recommender system sort of went wild,

6:11

because suddenly everybody was making fun of that tweet using

6:15

the word "Covfefe," and the LLM was so confused on, what does

6:19

that mean?

6:20

Where should we show it?

6:21

To whom should we show it?

6:22

And this is an example of a-- nowadays,

6:25

especially on social media, there's so many new trends,

6:28

and it's very hard to retrain an LLM to match the new trend

6:33

and understand the new words out there.

6:34

I mean, you oftentimes hear Gen Z words like "rizz" or "mid"

6:39

or whatever.

6:40

I don't know all of them.

6:41

But you probably want to find a way that

6:45

can allow the LLM to understand those trends without retraining

6:49

the LLM from scratch.

6:51

What else?

6:53

It's trained to have a breadth of knowledge.

6:56

And if you wanted to do something specialized,

6:58

that might limit [INAUDIBLE].

6:59

Yeah, it might be trained on a breadth of knowledge,

7:02

but it might fail or not perform adequately

7:05

on a narrow task that is very well defined.

7:09

Think about enterprise applications that--

7:11

yeah, enterprise application.

7:13

You need high precision, high fidelity, low latency.

7:17

And maybe the model is not great at that specific thing.

7:20

It might do fine, but just not good enough.

7:22

And you might want to augment it in a certain way.

7:24

Yeah.

7:25

Maybe it has [INAUDIBLE] so it makes the model

7:29

a lot heavier, a lot slower.

7:32

[INAUDIBLE]

7:33

So maybe it has a lot of broad domain knowledge that might not

7:37

be needed for your application.

7:39

And so you're using a massive, heavy model

7:41

when you actually are only using 2% of the model capability.

7:44

You're perfectly right.

7:45

You might not need all of it.

7:46

So you might find ways to prune, quantize the model, modify it.

7:51

All of these are good points.

7:53

I'm going to add a few more, as well.

7:55

LLMs are very difficult to control.

7:58

Your last point is actually an example of that.

8:00

You want to control the LLM to use a part of its knowledge,

8:03

but it's not--

8:04

it's, in fact, getting confused.

8:06

We've seen that in history.

8:08

In 2016, Microsoft created a notorious Twitter

8:13

bot that learned from users, and it quickly became a racist jerk.

8:18

Microsoft ended up removing the bot 16 hours after launching it.

8:22

The community was really fast at determining

8:25

that this was a racist bot.

8:28

And you can empathize with Microsoft in the sense

8:31

that it is actually hard to control an LLM.

8:34

They might have done a better job to qualify before launching,

8:37

but it is really hard to control an LLM.

8:40

Even more recently, this is a tweet

8:42

from Sam Altman last November, where

8:46

there was this debate between Elon Musk and Sam

8:50

Altman on whose LLM is the left wing propaganda

8:54

machine or the right wing propaganda machine,

8:57

and they were hating on each other's LLMs.

8:59

But that tells you, at the end of the day,

9:01

that even those two teams, Grok and OpenAI, which are probably

9:05

the best funded team with a lot of talent,

9:08

are not doing a great job at controlling their LLMs.

9:14

And from time to time, if you hang out on X,

9:16

you might see screenshots of users interacting with LLMs

9:21

and the LLM saying something really controversial

9:24

or racist or something that would not be considered great

9:31

by social standards, I guess.

9:33

And that tells you that the model is really hard to control.

9:39

The second aspect of it is something

9:41

that you mentioned earlier.

9:43

LLMs may underperform in your task,

9:47

and that might include specific knowledge gaps,

9:49

such as medical diagnosis.

9:51

If you're doing medical diagnosis,

9:52

you would rather have an LLM that is specialized for that

9:55

and is great at it and, in fact, something

9:57

that we haven't mentioned as a group, has sources.

10:00

So the answer is sourced specifically.

10:03

You have a hard time believing something

10:05

unless you have the actual source of the research that

10:08

backs it up.

10:10

Inconsistencies in style and format--

10:12

so imagine you're building a legal AI agentic workflow.

10:17

Legal has a very specific way to write and read,

10:21

where every word counts.

10:22

If you're negotiating a large contract,

10:25

every word on that contract might mean something else

10:28

when it comes to the court.

10:29

And so it's very important that you use

10:31

an LLM that is very good at it.

10:34

The precision matters.

10:35

And then task-specific understanding,

10:38

such as doing a classification on a niche field,

10:40

here I pulled an example where-- let's say a biotech product is

10:45

trying to use an LLM to categorize

10:48

user reviews into positive, neutral, or negative.

10:54

Maybe for that company, something

10:56

that would be considered a negative review typically

11:01

is actually considered a neutral review

11:04

because the NPS of that industry tends

11:06

to be way lower than other industries, let's say.

11:10

That's a task-specific understanding,

11:12

and the LLM needs to be aligned to what

11:14

the company believes is the categorization that it wants.

11:17

We will see an example of how to solve that problem in a second.

11:21

And then limited context handling--

11:24

a lot of AI applications, especially in the enterprise,

11:28

have required data that has a lot of context.

11:33

Just to give you a simple example,

11:35

knowledge management is an important space

11:37

that enterprises buy a lot of knowledge management tool.

11:40

When you go on your drive and you have all your documents,

11:43

ideally, you could have an LLM running on top of that drive.

11:47

You can ask any question, and it will read immediately

11:50

thousands of documents and answer, what was

11:53

our Q4 performance in sales?

11:56

It was x dollars.

11:58

It finds it super quickly.

11:59

In practice, because LLMs do not have a large enough context,

12:04

you cannot use a standalone vanilla pre-trained LLM to solve

12:07

that problem.

12:08

You will have to augment it.

12:11

Does that make sense?

12:13

The other aspect around context windows is they are, in fact,

12:16

limited.

12:17

If you look at the context windows of the models

12:20

from the last five years, even the best models

12:25

today will range in context, window, or number of tokens

12:30

it can take as input, somewhere in the hundreds of thousands

12:35

of tokens max.

12:36

Just to give you a sense, 200,000 tokens is roughly two

12:40

books.

12:42

So that's how much you can upload

12:45

and it can read, pretty much.

12:47

And you can imagine that when you're

12:48

dealing with video understanding or heavier data

12:52

files, that is, of course, an issue.

12:56

So you might have to chunk it.

12:58

You might have to embed it.

12:59

You might have to find other ways

13:00

to get the LLM to handle larger contexts.

13:06

The attention mechanism is also powerful, but problematic,

13:10

because it does not do a great job at attending

13:13

in very large contexts.

13:16

There is actually an interesting problem

13:19

called needle in a haystack.

13:21

It's an AI problem where--

13:23

or call it a benchmark--

13:25

where, in order to test if your LLM is good at putting attention

13:30

on a very specific fact within a large corpus,

13:35

researchers might randomly insert

13:38

in about one sentence that outlines

13:44

a certain fact, such as Arun and Max

13:47

are having coffee at Blue Bottle,

13:48

in the middle of the Bible, let's say,

13:51

or some very long text.

13:54

And then you ask the LLM, what were Arun and Max having

14:01

at Blue Bottle?

14:02

And you see if it remembers that it was coffee.

14:04

It's actually a complex problem, not because the question

14:07

is complex, but because you're asking the model

14:09

to find a fact within a very large corpus,

14:12

and that's complicated.

14:16

So, again, this is a limiting factor for LLMs.

14:19

We'll talk about RAG in a second.

14:21

But I want to preview--

14:22

there is debates around whether RAG

14:26

is the right long-term approach for AI systems.

14:29

So as a high-level idea, a RAG is a mechanism, if you will,

14:34

that embeds documents that an LLM can retrieve and then

14:39

add as context to its initial prompt and answer a question.

14:44

It has lots of application.

14:45

Knowledge management is an example.

14:47

So imagine you have your drive again.

14:49

But every document is compressed in representation,

14:53

and the LLM has access to that lower

14:55

dimensional representation.

14:59

The debates that this tweet from [INAUDIBLE] outlines

15:03

is, in theory, if we have infinite compute,

15:08

then RAG is useless.

15:09

Because you can just read a massive corpus immediately

15:13

and answer your question.

15:15

But even in that case, latency might be an issue.

15:19

Imagine the time it takes for an AI

15:20

to read all your drive every single time you ask a question.

15:24

It doesn't make sense.

15:25

So RAG has other advantages beyond even the accuracy.

15:30

On top of that, the sourcing matters, as well.

15:33

So it might-- RAG allows you to source.

15:35

We'll talk about all that later.

15:38

But there's always this debate in the community

15:42

whether a certain method is actually future proof.

15:46

Because in practice, as compute power doubles every year,

15:49

let's say, some of the methods we're learning right now

15:52

might not be relevant three years from now.

15:54

We don't know, essentially.

15:59

And the analogy that he makes on context windows

16:04

and why RAG approaches might be relevant even a long time

16:07

from now is search.

16:09

When you search on a search engine,

16:12

you still find sources of information.

16:14

And in fact, in the background, there

16:16

is very detailed traversal algorithms

16:20

that rank and find the specific links that might be the best

16:25

to present you versus if you had to read-- imagine you had

16:29

to read the entire web every single time you're doing

16:31

a search query, without being able to narrow

16:34

to a certain portion of the space.

16:36

That might, again, not be reasonable.

16:41

OK, when we're thinking of improving LLMs,

16:46

the easiest way we think of it is two dimensions.

16:50

One dimension is we are going to improve the foundation

16:53

model itself.

16:54

So, for example, we move from GPT 3.5 Turbo, to GPT 4,

17:01

to GPT 4.0, to GPT 5.

17:04

Each of that is supposed to improve the base model.

17:07

GPT 5 is another debate because it's packaging other models

17:11

within itself.

17:12

But if you're thinking about 3.5, 4, and 4.0,

17:15

that's really what it is.

17:16

The pre-trained model improves.

17:18

And so you should see your performance

17:20

improve on your tasks.

17:22

But the other dimension is we can actually engineer--

17:27

leverage the LLM in a way that makes it better.

17:30

So you can prompt simply GPT 4.0.

17:34

You can change some prompts and improve the prompt,

17:38

and it will improve the performance.

17:40

It's shown.

17:41

You can even put a RAG around it.

17:42

You can put an agentic workflow around it.

17:45

You can even put a multi-agent system around it.

17:49

And that is another dimension for you to improve performance.

17:52

So that's how I want you to think about it-- which

17:54

LLM I'm using, and then how can I maximize

17:56

the performance of that LLM?

17:59

This lecture is about the vertical axis.

18:02

Those are the methods that we will see together.

18:08

Sounds good for the introduction.

18:11

So let's move to prompt engineering.

18:14

I'm going to start with an interesting study just

18:17

to motivate why prompt engineering matters.

18:20

There is a study from HBS, UPenn,

18:26

as well as Harvard Business School, and--

18:29

well, there is also involved Wharton--

18:31

that took a subset of BCG consultants,

18:34

individual contributors, split them into three groups.

18:37

One group had no access to AI.

18:39

One group had access to--

18:41

I think it was GPT 4.

18:44

And then one group had access to the LLM,

18:46

but also a training on how to prompt better.

18:50

And then they observed the performance of these consultants

18:53

across a wide variety of tasks.

18:56

There's a few things that they noticed

18:57

that I thought was interesting.

18:59

One is something they called the jagged frontier,

19:02

meaning that certain tasks that consultants are doing fall

19:07

beyond the jagged frontier, meaning AI is not good enough.

19:14

It's not improving human performance.

19:18

In fact, it's actually making it worse.

19:20

And some tasks are within the frontier,

19:23

meaning that AI is actually significantly improving

19:27

the performance, the speed, the quality of the consultant.

19:32

Many tasks fell within and many tasks fell without,

19:35

and they shared their insights.

19:37

But the TLDR is--

19:39

there is a frontier within which AI is absolutely helping

19:42

and one where they call out this behavior, or falling asleep

19:47

at the wheel, where people relied on AI on a task that

19:51

was beyond the frontier.

19:52

And in fact, it ended up going worse

19:55

because the human was not reviewing the outputs carefully

19:58

enough.

20:01

They did note that the group that was trained

20:04

was the best, better than the group that was not trained

20:08

on prompt engineering, which also motivates why

20:10

this lecture matters, so that you're within that group

20:14

afterwards.

20:15

One other insights were the centaurs and the cyborgs.

20:20

They noticed that consultants had the tendency

20:22

to work with AI in one of two ways,

20:24

and you might, yourself, be part of one of these groups.

20:29

The centaurs are mythical creatures

20:31

that are half human, half--

20:35

I think, half, what, horses?

20:38

Yeah?

20:39

Horses?

20:39

Half horses, half something.

20:42

And those were individuals that would divide and delegate.

20:45

They might give a pretty big task to the AI.

20:48

So imagine you're working on a PowerPoint, which consultants

20:51

are known to do.

20:52

You might actually write a very long prompt on how

20:55

you want it to do your PowerPoint and then let it

20:57

work for some time and then come back

20:59

and it's done, when others would act as cyborgs.

21:02

Cyborgs are fully blended, bionic human robots,

21:06

human and robot, augmented with robotic parts.

21:10

And those individuals will not delegate fully a task.

21:13

They would actually work super quickly with the model

21:16

back and forth.

21:17

I find that a lot of students are actually more working

21:20

like cyborgs than centaurs, but while maybe in the enterprise,

21:24

when you're trying to automate the workflow,

21:26

you're thinking more like a centaur.

21:29

That's just something good to keep in mind.

21:31

Also, a lot of companies will tell you, oh, we're

21:33

hiring prompt engineers, et cetera.

21:34

It's [? a cure. ?] I don't buy that.

21:36

I think it's just a skill that everybody should have.

21:39

You're not going to make a [? cure ?] out

21:40

of prompt engineering, but you're probably

21:42

going to use it as a very powerful skill in your career.

21:49

So let's talk about basic prompt design principles.

21:52

I'm giving you a very simple prompt here.

21:56

Summarize this document, and then the document

21:58

is uploaded alongside it.

22:00

And the model has not much context around

22:04

what should be the summary?

22:06

How long should be the summary?

22:07

What should it talk about, et cetera?

22:09

You can actually improve these prompts by doing something like

22:14

summarize this 10-page scientific paper on renewable

22:18

energy in five bullet points, focusing on key findings

22:22

and implications for policymakers.

22:25

That's already better.

22:26

You're sharing the audience, and it's

22:28

going to tailor it to the audience.

22:30

You're saying that you want five bullet points,

22:33

and you want to focus only on key findings.

22:35

That's a better prompt, you would argue.

22:39

How could you even make this prompt better?

22:41

What are other techniques that you've

22:43

heard of or tried yourself that could make this one shot prompt

22:47

better?

22:53

Yeah.

22:53

[INAUDIBLE]

22:57

OK.

22:58

Right example.

22:58

So say, you mean, here is an example of a great summary.

23:02

Yeah.

23:03

You're right.

23:03

That's a good idea.

23:05

[INAUDIBLE]

23:08

Very popular technique.

23:10

Act like a renewable energy expert giving a conference

23:15

at Davos, let's say, yeah.

23:17

That's great.

23:18

Someone-- yeah.

23:20

Say you're really good at it.

23:22

Yeah.

23:23

You are the best in the world at this.

23:25

Explain.

23:26

Yeah.

23:26

Actually, I mean, these things work.

23:28

It's funny, but it does work to say act like x, y, z.

23:32

It's a very popular prompt template.

23:34

We'll see a few examples.

23:36

What else could you do?

23:40

Yes.

23:41

Of course, you'd like to critique your own model.

23:46

Critique your own project.

23:47

So you're using reflection.

23:48

So you might actually do one output

23:50

and then ask it to critique it and then give it back.

23:52

Yeah.

23:53

We see that.

23:53

That's a great one.

23:54

That's the one that probably works best

23:56

within those typically, but we see some examples.

23:59

What else?

24:00

Yeah.

24:01

Break the task down into steps.

24:03

OK.

24:03

Break the task down into steps.

24:05

You know how that is called?

24:06

No.

24:07

OK.

24:08

Chain of thoughts.

24:09

So this is actually a popular method

24:12

that's been shown in research that it improves.

24:15

You could actually give a clear instruction

24:17

and also encourage the model to think step

24:19

by step approach, the task step by step,

24:22

and do not skip any step.

24:24

And then you give it some steps, such as step one,

24:26

identify the three most important findings.

24:29

Step two, explain how key each finding

24:31

impact renewable energy policy.

24:33

Step three, write the five-bullet summary

24:36

with each point addressing a finding, et cetera.

24:39

So chain of thoughts, I linked the paper from 2023 that

24:45

popularized chain of thoughts.

24:46

Chain of thoughts is very popular

24:48

right now, especially in AI startups

24:50

that are trying to control their LLMs.

24:55

OK.

24:56

To go back to your examples about act like XYZ, what

25:01

I like to do, Andrew Ng also talks about that,

25:03

is to look at other people's prompts.

25:06

And in fact, in online, you have a lot of prompt repositories

25:10

for free on GitHub.

25:11

In fact, I linked the awesome prompt template repo on GitHub,

25:16

where you have so many examples of great prompts

25:19

that engineers have built. They said it works great for us,

25:22

and they published it online.

25:23

And a lot of them start with act as.

25:27

Act as a Linux terminal.

25:29

Act as an English translator.

25:31

Act like a position interviewer, et cetera.

25:37

The advantage of a prompt template

25:38

is that you can actually put it in your code

25:42

and scale it for many user requests.

25:44

So let me give you an example from Workera.

25:48

Workera evaluates skill.

25:50

Some of you have taken the assessments already.

25:52

And tries to personalize it to the user.

25:56

And in fact, if you actually read in an HR system

25:59

in an enterprise, in the HR system,

26:01

you might have a Jane is a product manager level 3,

26:06

and she is in the US, and her preferred language is English.

26:10

And actually, that metadata can be

26:13

inserted in a prompt templates that will personalize

26:15

personalized for Jane.

26:16

And similarly for Joe, whose is preferred language is Spanish,

26:22

it will tailor it to Joe.

26:24

And that's called a prompt template.

26:26

[INAUDIBLE]

26:34

So the question is do the foundation models

26:39

use a prompt templates, or do you

26:41

have to integrate it yourself?

26:42

So the foundation models probably

26:45

use a system prompt that you don't see.

26:47

Like when actually, you type on ChatGPT,

26:50

it is possible, it's not public, that OpenAI behind the scenes

26:55

has like act like a very helpful assistant for this user.

26:59

And by the way, here is your memories about the user

27:03

that we kept in a database.

27:05

You can actually check your memories.

27:07

And then your prompt goes under, and then the generation starts.

27:10

So probably, they're using something like that.

27:12

But it doesn't mean you can't add one yourself.

27:15

So in fact, if you think about a prompt template for the Workera

27:19

example I was showing, maybe it starts

27:22

when you call OpenAI by act like a helpful assistant.

27:25

And then underneath, it's like act like a great AI mentor that

27:29

helps people in their career.

27:31

And OpenAI is, from template, also

27:33

has follow the instruction from the creator

27:36

or something like that.

27:37

It's possible.

27:41

Questions about prompt templates.

27:42

Again, I would encourage you to go and read examples of prompts.

27:45

Some of them are quite thoughtful.

27:48

Let's talk about zero shot versus few shot prompting.

27:51

It came up earlier.

27:53

Here's an example.

27:54

Again, going back to the categorization of product

27:57

reviews, let's say that we're working on a task

28:01

where the prompt is classify the tone of the sentence

28:05

as positive, negative, or neutral.

28:07

And then you paste the review, which is the product is fine,

28:12

but I was expecting more.

28:16

If I were to survey the room, I would bet that some of you

28:19

would say it's negative.

28:21

Some of you would say it's neutral.

28:23

Because you actually have a first part

28:24

that is relatively positive.

28:27

It's fine.

28:28

And then the second part, I was expecting more,

28:30

which is relatively negative.

28:31

So where do you land?

28:33

This can be a subjective question.

28:35

And maybe in one industry, this would be considered amazing.

28:37

And another one, it would be considered really bad

28:40

because people are used to really flourishing reviews.

28:44

And so the way you can actually align the model to your task

28:47

is by converting that zero shot prompt.

28:49

Zero shot refers to the fact that it's not

28:51

being given any example.

28:53

Into a few short prompts, where the model

28:56

is given in the prompt, a set of examples to align it to what

29:00

you want it to do.

29:01

So the example here is again, you

29:03

paste the same prompt as before with the user review.

29:06

And then you add, here are examples

29:08

of tone classifications.

29:10

These exceeded my expectation completely.

29:12

Positive.

29:14

It's OK, but I wish it had more features.

29:17

Negative.

29:18

The service was adequate.

29:20

Neither good nor bad.

29:22

Neutral.

29:23

Now classify the tone of this sentence

29:26

after you've heard about these things,

29:28

and the model then says negative.

29:31

And the reason it says negative, of course,

29:33

is likely because of the second example, which was it's OK,

29:39

but I wish it had more features, which we told the model that

29:42

was negative.

29:43

Because the model saw that it's aligned now

29:45

with your expectations.

29:47

A few short prompts are very popular.

29:50

And in fact, for AI startups that

29:52

are slightly more sophisticated, you

29:54

might see them keep a prompt up to date.

29:57

Whenever a user says something and they

30:00

might have a human label it and then

30:02

add it as a few shots in their relevant

30:05

prompts in their code base.

30:08

You can think of that as almost building a data set.

30:10

But instead of actually building a separate data set

30:12

like we've seen with supervised fine tuning

30:15

and then fine tuning the model on it,

30:17

you're just putting it directly in the prompt.

30:19

It turns out it's probably faster

30:21

to do that if you want to experiment quickly

30:23

because you don't touch the model parameters.

30:25

You just update your prompts.

30:27

And if it's text examples, you can actually

30:30

concatenate so many examples in a single prompt.

30:34

At some point, it will be too long,

30:36

and you will not have the necessary context window.

30:39

But it's a pretty strong approach

30:40

that is quick to align an LLM.

30:48

OK?

30:49

Yes.

30:50

[INAUDIBLE]

30:57

So the question was is there any research on how long

31:00

the prompt can be before the model essentially loses

31:03

itself or doesn't follow instructions anymore?

31:06

There is.

31:08

The problem is that research is outdated every few months

31:11

because models get better.

31:14

And so I don't know where the state of the art is.

31:16

You can probably find it online on benchmarks

31:18

on like we see that--

31:20

I give you an example.

31:23

On the Workera product, you have a voice conversation

31:27

for some of you that have tried it,

31:28

where you're asked to explain what is the prompt.

31:30

And then you explain, and then there's

31:31

a scoring algorithm in behind.

31:33

We know that after eight turns, the model loses itself.

31:38

After eight turns, because you always

31:40

paste the previous user response,

31:42

it just starts going wild.

31:44

And so the techniques we use in the background

31:46

is we actually create chapters of the conversation.

31:49

Maybe one chapter is the first eight prompt.

31:51

And then you actually start over from another prompt.

31:53

You can summarize the first part of the conversation,

31:56

insert the summary, and then keep going.

31:59

Those are engineering hacks that engineers might have figured out

32:02

in the background.

32:04

Because eight turns makes a prompt quite long actually.

32:13

Let's move on to chaining.

32:15

Chaining is the most popular technique out of everything

32:17

we've seen so far in prompt engineering.

32:22

It's not chain of thought.

32:23

So chain of thought we've seen is think step by step,

32:26

step 1, step 2, step 3.

32:27

Do not skip any step.

32:28

This is different.

32:30

This is chaining complex prompt to improve performance,

32:34

and this is what it looks like.

32:37

You take a single step prompt, such as read this customer

32:40

review and write a professional response that

32:43

acknowledges their concern, explains the issue,

32:46

offers a resolution, and then you

32:48

paste the customer review, which is I ordered a laptop.

32:51

It arrived three days late.

32:52

The packaging was damaged.

32:54

Very disappointing.

32:56

I needed that urgently for work.

32:59

And then the output is an email that

33:01

is immediately given to you by the LLM

33:04

after it reads the prompt.

33:08

So this might work, but it might be hard to control.

33:14

Because think about it.

33:15

There's multiple steps that you have listed,

33:18

and everything is embedded in the same prompt.

33:20

And if you wanted to debug step by step and know which step is

33:24

weaker, you couldn't.

33:24

You would have everything mixed together.

33:27

So one advantage of chaining is you would separate the prompts,

33:32

so that you can debug them separately.

33:35

And it will also lead to an easier manner

33:38

to improve your workflow.

33:41

Let's say a first prompt is extract the key issues.

33:44

Identify the key concerns mentioned

33:46

in this customer review.

33:47

Pace the customer review.

33:49

Second prompt.

33:50

Using these issues, so you paste back the issues,

33:54

draft an outline for a professional response that

33:57

acknowledges concerns, explains possible reasons,

34:00

and offer a resolution.

34:04

So this is not--

34:06

Prompt number 3, write the full response.

34:09

So using the outline, write the professional response.

34:14

And then you get your final output.

34:18

So in theory, you can tell me, oh, the second approach

34:22

is better than the first one at first.

34:23

But what you can notice is that we can actually

34:27

test those three prompts separately from each other

34:29

and determine if we will get the most gains out of engineering

34:35

the first prompt, optimizing it, or the second one,

34:38

or the third one.

34:39

We now have three prompts that are independent from each other.

34:43

And maybe if the outline was better,

34:47

the performance of the email, how much the open rate will be

34:53

or the user satisfaction on the response

34:55

will actually get higher.

34:57

And so chaining improves performance but performance,

35:00

but most importantly, helps you control your workflow

35:04

and debug it more seamlessly.

35:07

Yes.

35:09

So if we that the three prompts independently work really well,

35:15

if we combine them into one prompt,

35:17

and we highlight a step by step thinking process,

35:21

does on average, we get a [INAUDIBLE] by itself,

35:24

or do we still have to do that breakdown?

35:28

So let me try to rephrase.

35:30

You say, let's say we look at the first prompt which

35:32

has all three tasks built in that prompt.

35:37

What exactly do you mean?

35:39

You mean like if we evaluate the output

35:41

and we measure some user insight, satisfaction,

35:43

et cetera?

35:45

Why don't we just modify that prompt and essentially see how

35:49

it improves user satisfaction?

35:51

Yeah.

35:51

[INAUDIBLE]

35:54

I see.

35:55

So why do we need the three steps?

35:57

I mean, think about it.

35:59

The intermediate output is what you want to see.

36:02

Like if I'm debugging the first approach,

36:06

the way I would do it is I would capture user insights.

36:09

Like here's the email.

36:10

How good was the response?

36:11

Thumbs up, thumbs down.

36:13

Was your issue resolved?

36:16

Thumbs up, thumbs down.

36:17

Those would tell me how good is my prompt.

36:19

And I can engineer that prompt, optimize it,

36:21

and I would probably drive some gains.

36:23

But I will not be able easily to trace back

36:26

to what the problem was.

36:28

While in the second approach, not only I

36:30

can use the end to end metrics to improve my process.

36:33

I can also use the intermediate steps.

36:35

For example, if I look at prompt 2 and I look at the outline

36:38

and I see the outline is actually, meh, it's not great,

36:41

then I think I can get a lot of gains out of the outline.

36:45

Or the outline is actually really good,

36:47

but the last prompt doesn't do a good job at translating it

36:50

into an email.

36:51

So the outline is exactly what I want the LLM to do,

36:54

but the translation in a customer facing email

36:57

is not good.

36:58

In fact, it doesn't follow our vocabulary internally.

37:01

Then I knew the third prompt is where

37:03

I would get the most gains.

37:06

So that's what it allows me to do,

37:07

have intermediate steps to review.

37:10

Are there any latency [INAUDIBLE]

37:13

We'll talk about it.

37:14

Are there any latency concerns?

37:16

Yes.

37:17

In certain applications, you don't want to use a chain

37:20

or you don't want to use a long chain because it adds latency.

37:26

We'll talk about that later.

37:27

Good point.

37:28

So practically, this is what chaining complex

37:32

prompts look like.

37:33

You have your first prompt with your first task.

37:35

It outputs.

37:36

The output is pasted in the second prompt

37:39

with the second task being defined.

37:41

The output is then pasted into the third prompt

37:43

with the third task being defined and so on.

37:46

That's what it looks like in practice.

37:52

Super.

37:55

We'll talk more later about testing your prompts,

37:58

but there are methods now to do it,

38:00

and we'll see later in this lecture with our case study

38:03

how we can test our prompts.

38:06

But here is an example of how you might do it.

38:11

You might have a summarization workflow prompts

38:18

that is the baseline.

38:19

It's a single prompt.

38:21

You might have a refined summarization

38:23

which is a modified prompt of this,

38:26

or a workflow with a chain.

38:30

And then you have your test case, which is the input

38:34

that you want to summarize, let's say.

38:36

And then you have the generated output.

38:38

And you can have humans go and rate these outputs.

38:42

And you would notice that the baseline is better or worse

38:46

than the refined prompt.

38:47

Of course, this manual approach takes time,

38:51

but it's a good way to start.

38:53

And usually, the advice is get hands on at the beginning

38:56

because you would quickly notice some issues,

38:58

and it will give you better intuition on what tweaks

39:01

can lead to better performance.

39:03

However, if you wanted to scale that system

39:05

across many products, many parts of your code base,

39:08

you might want to find a way to do that automatically

39:10

without asking humans to review and grade summaries.

39:14

One approach is to use platforms,

39:19

like at Workera, our team uses a platform called Prompt Food that

39:23

allows you to actually automate part of this testing.

39:26

In a nutshell, what it does is it

39:30

can allow you to run the same prompt with five different LLMs

39:35

immediately, put everything in a table.

39:37

That makes it super easy for a human to grade, let's say.

39:40

Or alternatively, it might allow you to define LLM judges.

39:46

LLM judges can come in different flavors.

39:50

For example, I can have an LLM judge that

39:52

does a pairwise comparison.

39:54

So what the LLM is asked to do is here are two summaries.

39:58

Just tell me which one is better than the other one.

40:01

That's what the LLM does.

40:02

And that can be used as a proxy for how good

40:04

the summarization baseline versus the refined version is.

40:08

Another way to do an LLM judge is

40:11

if you do it for a single answer grading,

40:14

so here's a summary graded from 1 to 5.

40:18

And then you can go even deeper and do

40:21

a reference-guided pairwise comparison.

40:24

Or you add also a rubric.

40:25

You say a 5 is when a summary is below 100 characters.

40:30

I'm just making up.

40:31

Below 100 characters.

40:33

Mentions at least three key points

40:35

that are distinct and starts with a first sentence that

40:38

displays the overview and then goes into the detail.

40:40

That's a great summary, number 5 out of a 5.

40:42

0 is the LLM failed to summarize and actually was very verbose,

40:48

let's say.

40:49

And so you put a Rubrik behind it,

40:52

and you have an LLM as just finding the rubric.

40:55

Of course, you can now pair different techniques.

40:57

You can do a few shots for the rubric.

40:58

You can actually give examples of a 5 out of 5s, 4 out of 4s,

41:02

3 out of 3s because now, you multiple techniques.

41:06

Does that make sense?

41:11

Yeah.

41:11

OK.

41:12

So that was the second section on prompt engineering

41:15

or the first line of optimization.

41:19

Now, let's say you've exhausted all your chances

41:22

for prompt engineering, and you're

41:24

thinking about actually touching the model, modifying its weights

41:28

or fine tuning it in other words.

41:31

I was telling you, I'm not a fan of fine tuning.

41:34

There's a few reasons why.

41:37

One, it requires substantial labeled data typically

41:42

to fine tune.

41:43

Although now, there are approaches

41:46

that are getting better at fine tuning that

41:48

look more few shot prompting actually than fine tuning.

41:52

It's sort of merging.

41:54

Although one modifies the weight,

41:56

the other doesn't modify the weights.

41:57

Fine tuned models may also overfit to specific data.

42:01

We're going to see a funny example actually.

42:04

Losing their general purpose utility.

42:06

So you might fine tune a model.

42:08

And actually, when someone asks a pretty generic question,

42:11

it doesn't do well anymore.

42:12

It might do well on your task.

42:14

So it might be relevant or not.

42:15

And then it's time and cost-intensive.

42:17

That's my main problem.

42:19

And at Workera, we steer away from fine

42:24

tuning as much as possible.

42:26

Because by the time you're done fine tuning your model,

42:28

the next model is out, and it's actually

42:30

beating your fine tuned version of the previous model.

42:33

So I would steer away from fine tuning as much as you can.

42:36

The advantage of the prompt engineering methods we've seen

42:39

is you can put the next best pre-trained model directly

42:43

in your code.

42:44

It will update everything immediately.

42:46

Fine tuning doesn't work like that.

42:50

There are advantages though where it still makes sense.

42:53

If the task requires repeated high precision outputs

42:56

such as legal, scientific explanation

42:58

and if the general purpose LLM struggles

43:01

with domain-specific language.

43:03

So let's look at a quick example together,

43:07

which is an example from Ros Lazerowitz.

43:12

I think it was a couple of years ago, September 23,

43:15

where Ros tried to do Slack fine tuning.

43:22

So he looked at a lot of Slack messages within his company.

43:26

And he was like, I'm going to fine tune

43:28

a model that speaks like us or operates like us because this

43:32

is how we work.

43:33

This is the data that represents how people work at the company.

43:37

And so he actually went ahead and fine tuned the model,

43:42

gave it a prompt, like, hey, write--

43:44

he was delegating to the model.

43:47

A 500-word blog post on prompt engineering.

43:50

And the model responded, I shall work on that in the morning.

43:55

And then he tries to push the model a little further and say,

44:00

it's morning now.

44:01

And the model said, I'm writing right now.

44:04

It's 6:30 AM here.

44:06

Write it now.

44:10

OK, I shall write it now.

44:12

I actually don't what you would like me to say

44:14

about prompt engineering.

44:15

I can only describe the process.

44:17

The only thing that comes to mind for a headline

44:19

is how do we build prompt?

44:21

It's kind of a funny example for fine tuning because it's true

44:25

that it went wrong.

44:27

Like he was supposed to think like I want

44:29

the model to speak like us at work.

44:32

And it ended up acting like people

44:34

and not actually following instructions.

44:40

So one example why I would steer away from fine tuning.

44:47

Super.

44:51

Let's talk about RAGs.

44:54

RAGs is important.

44:55

It's important to out there and at least having the basics.

44:58

It's a very common interview question, by the way.

45:00

If you go interview for a job, they

45:02

might ask you to explain in a nutshell

45:04

to a five-year-old what is a RAG.

45:06

And hopefully after that, you'll be able to do it.

45:09

So we've seen some of the challenges with standalone LLMs.

45:14

Those challenges include the context window being small,

45:19

the fact that it's hard to remember details

45:21

within a large context window, knowledge gaps, cutoff dates,

45:26

you mentioned earlier.

45:28

The model might be trained up to a date,

45:29

and then it cannot follow the trends or be up to date.

45:33

Hallucinations.

45:34

There are some fields.

45:35

Think about medical diagnosis, where

45:37

hallucinations are very costly.

45:39

You can't afford a hallucination.

45:41

Even in education, imagine deploying a model for the US

45:45

youth education, and it hallucinates,

45:47

and it teaches millions of people something

45:49

completely wrong.

45:50

It's a problem.

45:52

And then lack of sources.

45:54

A lot of fields love sources.

45:57

Research fields love sources.

45:59

Education love sources.

46:01

Legal loves sources as well.

46:04

And so the pre-trained LLM doesn't do a good job to source.

46:08

And in fact, if you have tried to find sources on a plain LLM,

46:13

it actually hallucinates a lot.

46:15

It makes up research papers.

46:16

It just lists like completely fake stuff.

46:20

So how do we solve that with a RAG?

46:23

RAG integrates with external knowledge sources, databases,

46:28

documents, APIs.

46:31

It ensures that answers are more accurate, up to date,

46:35

and grounded because you can actually update your document.

46:38

Your drive is always up to date.

46:40

I mean, ideally, you're always pushing new documents to it.

46:43

And when you query, what is our Q4 performance in sales?

46:47

Hopefully there is the last board deck in the drive,

46:51

and it can read the last board deck.

46:54

And more developer control.

46:56

We'll see why RAGs allow for targeted customization

47:00

without actually requiring the retraining of the model.

47:02

In fact, you don't touch the model with RAGs.

47:05

It's really a technique that is put on top of the model.

47:08

So to see an example of a RAG, this

47:11

is a question answering application where

47:16

we're in the medical field, and a user is asking a query,

47:21

what are the side effects of drug X?

47:26

This is an important question.

47:27

You can't hallucinate.

47:28

You need to source.

47:29

You need to be up to date.

47:31

Maybe there is a new update to that drug that

47:35

is now in the database, and you need to read that.

47:37

So a RAG is a great example of what you would want to use here.

47:41

The way it works is you have your knowledge

47:43

base of a bunch of documents.

47:46

What you do is you use an embedding

47:49

to embed those documents into lower

47:52

dimensional representations.

47:54

So for example, if the document is a PDF, a long PDF,

47:59

you might read the PDF, understand it,

48:02

and then embed it.

48:03

We've seen plenty of embedding approaches

48:05

together, triplet loss, et cetera, you remember?

48:09

So imagine one of them here for LLMs

48:11

is embedding those documents into lower representation.

48:15

If the representation is too small,

48:18

you will lose information.

48:19

If it's too big, you will add latency.

48:22

It's a tradeoff.

48:25

You will store typically those representations

48:28

into a database called a vector database.

48:31

There's a lot of vector database providers out there.

48:38

I think I've listed a couple that are very common.

48:41

No, I haven't listed, but I can share afterwards.

48:44

A vector database is essentially storing those vector

48:47

in a very efficient manner, allowing the fast retrieval

48:50

with a certain distance metric.

48:52

So what you do is you also embed, usually

48:56

with the same algorithm, the user prompts.

49:00

And you run a retrieval process, which is essentially

49:03

saying, based on the embedding from the user

49:07

query and the vector database, find the relevant documents

49:12

based on the distance between those embeddings.

49:15

Once you've found the relevant documents, you pull them,

49:18

and then you add them to the user query with a system prompt

49:22

or a prompt template on top.

49:24

So the prompt template can be answer user query

49:29

based on list of documents.

49:32

If answer not in the documents, say I don't know.

49:36

That's your prompt templates where the user query is pasted,

49:40

the documents are pasted, and then

49:42

your output should be what you want because it's not

49:45

grounded in the documents.

49:47

You can also add to this prompt template.

49:50

Tell me the exact page, chapter, line

49:53

of the document that was relevant, and in fact,

49:55

link it as well, just to be more precise.

50:02

Any question on RAGs?

50:03

This is a simple, vanilla RAG.

50:07

Yes.

50:09

Do document embeddings still retain information [INAUDIBLE]

50:15

Question is do the document embeddings still

50:18

retain the information of the location of the information

50:21

within that document, especially in big documents?

50:24

Great question.

50:26

We'll get to it in a second.

50:27

Because you're right that the vanilla RAG

50:29

might not do a good job with very large documents.

50:32

So let's say, when you open a medication box

50:36

and you have this gigantic white paper with all the information,

50:41

and it's very long, maybe a vanilla RAG would not cut it.

50:45

So what people have figured out is a bunch

50:48

of techniques to improve RAGs.

50:49

And in fact, chunking is a great technique that is very popular.

50:53

So you might actually store in the vector database

50:55

the embedding of the full document.

50:57

And on top of that, you will also

50:59

store a chapter level vector.

51:02

And when you retrieve, you will retrieve the document.

51:04

You retrieve the chapter.

51:06

And that allows you to be more precise with the sourcing.

51:09

It's one example.

51:11

Another technique that's popular is HyDE.

51:16

Hypothetical document embeddings,

51:18

where a group of researchers published a paper

51:23

showing that when you get your user query,

51:26

one of the main problem is the user query

51:29

actually does not look like your documents.

51:32

For example, the user query might

51:34

be what are the side effects of drug X, when actually,

51:37

in the document in the vector database,

51:40

the vectors represents very long documents.

51:43

So how do you guarantee that the vector

51:44

embedding is going to be close to the document embedding?

51:47

What they do is they use the user query to generate

51:50

a fake hallucinated document.

51:53

They embed that document, and then they

51:56

compare it to the vector in the vector database.

52:01

That makes sense?

52:02

So for example, the user says what

52:04

is the side effect of drug X?

52:06

There's a prompt that this is given to another prompt that

52:09

says, based on this user query, generates a five-page report

52:13

answering the user query.

52:15

It generates potentially a completely fake answer.

52:20

You embed that, and it will be closer to the document

52:24

that you're looking for likely.

52:28

It's one example of a RAG approach.

52:31

Again, the purpose of this lecture

52:33

is not to go through all these three and explain

52:36

you every single method that has been discovered for RAGs.

52:38

But I just wanted to show you how much research

52:40

has been done between 2020 and 2025 in RAGs

52:44

and how many branches of research you now have

52:47

that you can learn from.

52:50

The survey paper is LinkedIn the slides, by the way,

52:52

and I'll share them after the lecture.

53:01

Super.

53:05

So we've made some progress.

53:08

Hopefully now, you feel if you were

53:10

to start an LLM application, you know how to do better prompts.

53:14

You know how to do chains.

53:15

You know how to do fine tuning.

53:17

You also how to do retrieval.

53:19

And you have the baggage of techniques

53:20

that you can go and read and find the code base,

53:23

pull the code, vibe code it.

53:24

But you have the breadth now.

53:30

The next set of topics we're going to see

53:34

is around the question of how could we

53:36

extend the capabilities of LLMs from performing single tasks,

53:40

and hence, with external knowledge,

53:42

to handling multi-step, autonomous workflows?

53:47

And this is where we get into proper agentic AI.

53:53

So let's talk about agentic AI workflows

53:56

towards autonomous and specialized systems.

54:00

Then we'll talk about evals.

54:01

Then we'll see multi-agent systems.

54:03

And we'll end with a little thoughts on what's next in AI.

54:11

So Andrew Ng actually coined the term agentic AI workflows.

54:20

And his reason was that a lot of companies use, say agents.

54:25

Agents, agents everywhere, agents everywhere.

54:28

If you go and work at these companies,

54:30

you would notice that they mean very different things by agents.

54:33

Some people actually have a prompt,

54:34

and they call it an agent.

54:36

Other people, they have a very complex multi-agent system,

54:41

they call it an agent.

54:42

And so calling everything an agent doesn't do it justice.

54:45

So Andrew says let's call it agentic workflows.

54:49

Because in practice, it's a bunch of prompts with tools,

54:53

with additional resources, API calls

54:57

that ultimately are put in a workflow,

54:59

and you can call that workflow agentic.

55:02

So it's all about the multi-step process to complete a task.

55:11

Also, calling it agentic workflow

55:13

allows us to not mix it up with what

55:14

I called agent, in the last lecture,

55:17

with reinforcement learning.

55:19

Because in RL, agent has a very specific definition,

55:22

interacts with an environment, passes from one state

55:24

to the other, has a reward and an observation.

55:26

You remember that chart, right?

55:32

So here's an example of how we move from a one step

55:35

prompt to a multi-step agentic workflow.

55:39

Let's say a user queries a product.

55:44

What is your refund policy on a chatbot?

55:48

And the response, using a RAG, says

55:51

refunds are available within 30 days of purchase,

55:53

and maybe the RAG can even look link to the policy documents.

55:57

That's what we learned so far.

55:59

Instead, an agentic workflow can function like this.

56:04

The user says, can I get a refund for my order?

56:07

And the response via the agentic workflow

56:11

is the agent retrieves the refund policy using a RAG.

56:14

The agent then follows up with the users and says,

56:17

can you provide your order number?

56:19

Then the agent queries an API to check the order details.

56:23

And finally, it comes back to the user

56:25

and confirms your order qualifies for a refund.

56:28

The amount will be processed in three to five business days.

56:31

This is much more thoughtful than the first version,

56:33

which is sort of vanilla.

56:37

So that's what we're going to talk

56:39

about in the next couple of slides,

56:40

is how do we get from the first one to the second one?

56:46

There are plenty of specialized agency workflows online.

56:50

You've heard, and if you hang out in SF,

56:52

you probably see a bunch of billboards, AI software

56:55

engineer, AI skills mentor you've

56:57

interacted with in the class through Workera.

56:59

AI SDR, AI lawyers, AI specialized cloud engineer.

57:08

It would be a stretch to say that everything works,

57:10

but there's work being done towards that.

57:17

I'm not personally a fan of putting

57:19

a face behind those things.

57:20

I think it's gimmicky.

57:21

And I think in a few years from now, actually,

57:24

very few products will have a human face behind it,

57:27

but it might be a marketing tactic from some startups.

57:32

It's more scary than it is engaging, frankly.

57:35

OK.

57:36

I want to talk about the paradigm shift.

57:38

That's especially useful.

57:40

Let's say you're a software engineer

57:41

or you're planning to be a software engineer.

57:43

Because software engineering as a discipline

57:45

is sort of shifting.

57:47

Or at least the best engineers I've

57:49

worked with are able to move from a deterministic mindset

57:53

to a fuzzy mindset and balance between the two

57:57

whenever they need to get something done.

57:58

So here's the paradigm shift between traditional software

58:01

and agentic AI software.

58:04

The first one is the way you handle data.

58:07

Traditional software deals with structured data.

58:10

You have JSONs.

58:11

You have databases.

58:12

They're pasted in a very structured manner

58:15

in a data engineering pipeline.

58:17

And then there used to be displayed

58:19

on a certain interface.

58:21

The user might feel a form that is then retrieved and pasted

58:24

in the database.

58:25

All of that historically has been structured data.

58:28

Now, more and more companies are handling free form text, images,

58:34

and all of that requires dynamic interpretation to transform

58:39

an input into an output.

58:41

The software itself used to be deterministic.

58:45

Now you have a lot of software that is fuzzy.

58:47

And fuzzy software creates so many issues.

58:51

I mean, imagine if you let your user ask anything

58:54

on your website.

58:56

The chances that it breaks is tremendous.

58:58

The chances that you're attacked is tremendous.

59:00

The chances-- it's really, really complicated.

59:03

It's more complicated than people make it seem on Twitter.

59:07

Fuzzy engineering is truly hard.

59:09

You might get hate as a company because one user did something

59:14

that you authorized them to do that ended up breaking

59:16

the database and ended up--

59:18

we've seen that with many companies

59:19

in the last couple of years.

59:21

So it takes a very specialized engineering mindset

59:23

to do fuzzy engineering, but also

59:25

know when you need to be deterministic.

59:29

The other thing I'd call is with agentic AI software,

59:33

you want to think about your software as your manager.

59:39

So you're familiar with the monolith or microservices

59:44

approaches in software, where you structure your software

59:48

in different boxes that can talk to each other,

59:51

and it allows teams to debug one section at a time.

59:55

Now the equivalent with agentic AI is you think as a manager.

59:59

So you think, OK, if I was to delegate my product

1:00:02

to be done by a group of humans, what would be those roles?

1:00:06

Would I have a graphic designer that then puts together a chart

1:00:09

and then sends it to a marketing manager that converts it

1:00:12

into a nice blog post, that then gives it to the performance

1:00:15

marketing expert, that then publishes the work, the blog

1:00:18

post, and then optimizes and A/B tests?

1:00:20

Then to a data scientist that analyzes the data

1:00:23

and then puts hypotheses and validates

1:00:25

them or invalidates them.

1:00:27

That's how you would typically think if you're building

1:00:29

an authentic AI software.

1:00:32

When actually, the equivalent of that in traditional software

1:00:35

might be completely different.

1:00:37

It might be We have a data engineer box

1:00:39

right here that handles all our data engineering.

1:00:42

And then here, we have the UI/UX stuff.

1:00:45

Everything UI/UX related goes here.

1:00:47

And companies might structure it in very different ways.

1:00:51

And here is the business logic that we want to care about.

1:00:53

And there's five engineers working on the business logic,

1:00:56

let's say.

1:00:59

OK.

1:01:01

Testing and debugging is also very different.

1:01:04

And we'll talk about it in the next section.

1:01:09

The other thing that I feel matters

1:01:13

is with AI in engineering, the cost of experimentation

1:01:17

is going down drastically.

1:01:19

And so people, I feel, should be more comfortable

1:01:22

throwing away code.

1:01:23

It's like in traditional software engineering,

1:01:27

you probably don't throw away code a ton.

1:01:29

You build a code, and it's solid, and it's bulletproof,

1:01:32

and then you update it over time.

1:01:35

We've seen AI companies be more comfortable throwing away

1:01:39

codes, which has advantages in terms of the speed at which you

1:01:43

move but also disadvantages in terms

1:01:46

of the quality of your software that can break more.

1:01:52

So anyway, just wanted to do an update on the paradigm shift

1:01:56

from deterministic to fuzzy engineering.

1:02:04

Oh, and actually, I can give you an example from Workera

1:02:08

that we learned probably over the last 12

1:02:11

months is like if you've used Workera,

1:02:13

you might have seen that the interface has asks you sometimes

1:02:18

multiple choice questions.

1:02:19

And sometimes, it asks you multiple select.

1:02:21

And sometimes, it asks you drag and drop, ordering, matching,

1:02:24

whatever.

1:02:25

Those are examples of deterministic item types,

1:02:28

meaning you answer the question on a multiple choice.

1:02:31

There is one correct answer.

1:02:32

It's fully deterministic.

1:02:34

On the other hand, you sometimes have a voice questions,

1:02:38

where you go to a role play or you

1:02:40

have voice plus coding questions,

1:02:42

where your code is being read by the interface or whatever.

1:02:45

Those are fuzzy, meaning the scoring algorithm

1:02:49

might actually make mistakes, and those mistakes

1:02:52

might be costly.

1:02:53

And so companies have to figure out

1:02:56

a human in the loop system, which

1:02:58

you might have seen with the appeal feature at the end.

1:03:00

So at the end of the assessment, you have an appeal feature where

1:03:03

it allows you to say, I want to appeal the agent

1:03:06

because I want to challenge what the agent said on my answer

1:03:09

because I thought I was better than what the agent thought.

1:03:12

And then you bring the human in the loop that

1:03:14

then can fix the agent, can tell the agent, actually,

1:03:16

you were too harsh on the answer of this person.

1:03:20

And that's an example of a fuzzy engineered system

1:03:24

that then adds a human in the loop to make it more aligned.

1:03:28

And so if you're building a company,

1:03:29

I would encourage you to think about what can I

1:03:32

get done with determinism?

1:03:33

And let's get that done.

1:03:35

And then the fuzzy stuff, I want to do fuzzy

1:03:38

because it allows more interaction.

1:03:39

It allows more back and forth, but I need

1:03:42

to put guardrails around it.

1:03:43

And how am I going to design those guardrails?

1:03:45

Pretty much.

1:03:46

OK?

1:03:49

Here's another example from enterprise workflows,

1:03:54

which are likely to change due to agentic AI.

1:03:57

This is a paper from McKinsey, I believe from last year,

1:04:01

where they looked at a financial institution, and they said,

1:04:05

we observed that they often spend one to four weeks

1:04:07

to create a credit risk memo.

1:04:10

And here's the process.

1:04:11

A relationship manager gathers data from 15

1:04:16

and more than 15 sources on the borrower,

1:04:19

loan type, other factors.

1:04:22

Then the relationship manager and the credit analyst

1:04:25

collaboratively analyze that data from these sources.

1:04:28

Then the credit analyst typically spends 20 hours

1:04:33

or more writing a memo and then goes back

1:04:36

to the relationship manager.

1:04:37

They give feedback, and then they go through this loop

1:04:40

again and again.

1:04:41

And it takes a long time to get a credit memo out.

1:04:46

And then run a research study, where they changed the process.

1:04:50

They said gen AI agents could actually cut time by 20% to 60%

1:04:56

on credit risk memos.

1:04:58

And the process has changed to the relationship manager,

1:05:01

directly work with the Gen AI agent system,

1:05:03

provides relevant materials that needs to produce the memo.

1:05:07

The agent subsidizes the project into tasks

1:05:10

that are assigned to specialist agents,

1:05:12

gathers and analyzes the data from multiple sources,

1:05:15

drafts a memo.

1:05:16

Then the relationship manager and the credit analyst

1:05:19

sit down together, review the memo,

1:05:20

give feedback to the agent.

1:05:22

And within 20% to 60% less time are done.

1:05:26

And so this is an example where you're actually not changing

1:05:30

the human stakeholders.

1:05:31

You're just changing the process and adding

1:05:33

Gen AI to reduce the time it takes to get a credit memo out.

1:05:38

It turns out that, imagine you're an enterprise,

1:05:42

and you have 100,000 employees, and there's a lot of enterprises

1:05:47

with 100,000 employees out there.

1:05:50

You are currently under crisis in terms

1:05:52

of redesigning your workflows.

1:05:55

It turns out that if you actually

1:05:57

pull the job descriptions from the HR system

1:06:00

and you interpret them, you also pull

1:06:02

the business process workflows that you

1:06:04

have encoded in your drive.

1:06:07

You actually can find gains in multiple places.

1:06:10

And in the next few years, you're

1:06:12

probably going to see workflows being

1:06:14

more optimized to add Gen AI.

1:06:17

Even if that happens, the hardest part is changing people.

1:06:20

What we know, this is great in theory, but now,

1:06:23

let's try to fit that second workflow for 10,000 credits,

1:06:28

risk analysts, and relationship managers.

1:06:31

My guess is it will take years.

1:06:33

It will take 10, 20 years to get to this being actually done

1:06:37

at scale within an organization.

1:06:40

Because change is so hard.

1:06:42

It's so hard to rewire business, workflows, job descriptions,

1:06:47

incentivize people to do different, and be different,

1:06:50

and train them.

1:06:50

And so this is what the world is going towards,

1:06:55

but it's going to take a long time I think.

1:06:59

OK.

1:07:00

Then I want to talk about how the agent actually works

1:07:02

and what are the core components of an agent.

1:07:07

Imagine a travel booking agent. that's

1:07:10

an easy example you've all thought about.

1:07:12

I still haven't been able to get an agent to book a trip for me,

1:07:16

or I was scared because it was going to book

1:07:18

a very expensive or long trip.

1:07:20

But in theory, you can have a travel booking

1:07:24

agent that has prompts.

1:07:26

So the prompts we've seen, we know the methods

1:07:28

to optimize those prompts.

1:07:30

That travel agent also has a context management system,

1:07:34

which is essentially the memory of what it knows about the user.

1:07:38

That context management system might

1:07:40

include a core memory or working memory and an archival memory,

1:07:45

OK?

1:07:46

What the difference is within memory

1:07:51

is not every memory needs to be fast to access.

1:07:54

Think about it.

1:07:56

You're onboarded on a product, and the first question is hi,

1:07:59

what's your name?

1:08:00

And I say, my name is Keon.

1:08:02

That's probably going to sit in the working memory

1:08:05

because the agents, every time he's going to talk to me,

1:08:07

he's going to want to use my name.

1:08:08

But then maybe the second question

1:08:10

is what's your birthday?

1:08:12

And I give it my birthday.

1:08:13

Does it need my birthday every day?

1:08:15

Probably not.

1:08:16

So it's probably going to park it on the long term

1:08:18

memory or the archival memory.

1:08:20

And those memories are slower to access.

1:08:24

They're farther down the stack.

1:08:26

And that structure allows the agent

1:08:28

to determine what's the working memory,

1:08:30

and what's the long term memory?

1:08:33

And that makes it easier for the agent to retrieve super fast.

1:08:36

Because think about it.

1:08:37

When you interact with ChatGPT, you

1:08:39

feel that it's very personal at times.

1:08:41

You feel like it understands you.

1:08:43

Imagine every time you call it, it has to read the memories.

1:08:47

And that can be costly.

1:08:48

It's a very burdensome cost because it happens

1:08:52

every time you talk to it.

1:08:54

So you want to be highly optimized with the working

1:08:57

memory.

1:08:59

If it takes three seconds to look

1:09:00

in the memory, every time you're going to talk to your LLM,

1:09:03

it's going to take three seconds, which you don't want.

1:09:06

Anyway.

1:09:06

And then you have the tools.

1:09:08

The tools can include APIs like a flight search

1:09:11

API, hotel booking API, car rental API, weather API,

1:09:15

and then the payment processing API.

1:09:18

And typically, you would want to tell your agent

1:09:21

how that API works.

1:09:23

It turns out that agents or LLMs, I should say,

1:09:27

are very good at reading API documentation.

1:09:29

So you give it the API documentation,

1:09:31

and it reads the JSON, and it reads,

1:09:33

what does a GET request look like.

1:09:35

And this is the format that I need to push.

1:09:38

And then it pushes it in that format, let's say.

1:09:41

And then it retrieves something.

1:09:45

Does that make sense, those different components?

1:09:49

Anthropic also talks about resources.

1:09:51

Resources is data that is sitting somewhere that you

1:09:55

might let your agent read.

1:09:57

For example, if you're building your startups, you have a CRM.

1:10:00

A CRM has data in it, and you want to do lookups in that data.

1:10:05

You will probably give a lookup tool,

1:10:07

and you will give access to the resource,

1:10:10

and it will do lookups whenever you want super fast.

1:10:16

This type of architecture can be built

1:10:19

with different degrees of autonomy,

1:10:21

from the least autonomous to the most autonomous.

1:10:23

And I'll give you a few examples.

1:10:26

Less autonomous would be you've hard coded the steps.

1:10:29

So let's say I tell the travel agent first identify the intent.

1:10:35

Then look up in the database the history

1:10:39

of this customer with us and their preferences.

1:10:42

Then go to the flight API, blah, blah, blah.

1:10:45

Then go to the--

1:10:45

I would hard code the steps.

1:10:47

OK.

1:10:48

That's the least autonomous.

1:10:50

The semi-autonomous is I might hard code the tools,

1:10:54

but we're not going to hard code the steps.

1:10:57

So I'm going to tell the agent, you act like a travel agent.

1:11:02

And your task is to help the person book a travel.

1:11:10

And these are the tools that you have accessible to yourself.

1:11:13

And so I'm not hard coding the steps.

1:11:14

I'm just hard coding the tools that you have access

1:11:17

to for yourself.

1:11:18

The more autonomous is the agent decides the steps

1:11:22

and can create the tools.

1:11:24

So that's where you might give actually access

1:11:26

to a code editor, to the agent.

1:11:28

And the agent might actually be able to ping any API in the web,

1:11:33

perform some web search.

1:11:34

It might even be able to create some code

1:11:37

to display data to the user.

1:11:39

It might even be able to perform some calculations.

1:11:42

Like oh, I'm going to calculate the fastest route

1:11:44

to get from San Francisco to New York,

1:11:48

and which one might be the most appropriate

1:11:50

for what the user is looking for.

1:11:52

And then I want to calculate the distance between the airport

1:11:54

and that hotel versus that hotel.

1:11:56

And I'm going to write code to do that.

1:11:58

So it's actually fully autonomous

1:12:00

from that perspective.

1:12:05

So yeah.

1:12:07

Remember those keywords.

1:12:08

Memory, prompts, tools, et cetera.

1:12:14

Now, I presented the flight API, but it does not

1:12:18

have to be an API.

1:12:19

You probably have heard the term MCP or model context protocol

1:12:23

that was coined by Anthropic.

1:12:25

I pasted the seminal article on MCP at the bottom of this slide.

1:12:29

But let me explain in a nutshell why those things would differ.

1:12:34

In the API case, you would actually

1:12:39

teach your LLM to ping an API.

1:12:42

So you would say this is how you ping this API,

1:12:45

and this is the data that it will send you back.

1:12:48

And you would have to do that in a one off manner.

1:12:51

So you would have to build or give

1:12:53

the API documentation of your flight API.

1:12:56

You're booking hotel API, your car rental API.

1:13:00

And then you would give tools for your model

1:13:03

to communicate with those APIs.

1:13:06

It doesn't scale very well versus MCP.

1:13:11

MCP, it's really about putting a system in the middle that

1:13:19

would make it simpler for your LLM to communicate

1:13:22

with that endpoint.

1:13:23

So for instance, you might have an MCP server, an MC client,

1:13:28

where you're trying to communicate

1:13:30

with that travel database or the flight API or MCP.

1:13:35

And your agent might actually just communicate with it

1:13:38

and say, hey, what do you need in order to give me more flight

1:13:42

information?

1:13:43

And that agent will respond by I would like you to tell me

1:13:47

where is the origin flight, where is the destination

1:13:49

and what you're looking for at a high level.

1:13:51

This is my requirement.

1:13:52

OK.

1:13:52

Let me get back to you with in my requirement.

1:13:55

Oh.

1:13:55

You forgot to tell me your budget, whatever.

1:13:57

Oh.

1:13:58

Let me give you my budget, et cetera.

1:14:00

And it's agent to agent communication,

1:14:04

which allows more scalability.

1:14:06

You don't need to hard code everything.

1:14:09

Companies have displayed their MCPs out there,

1:14:11

and your agent can communicate with them

1:14:14

and figure out how to get the data it needs.

1:14:16

Does that make sense?

1:14:18

Yeah.

1:14:21

[INAUDIBLE] rewriting any [INAUDIBLE]

1:14:36

I think it is, ultimately.

1:14:39

The question is, isn't it a shifting issue?

1:14:41

Because anyway, if an API has to be updated,

1:14:43

the MCP has to be updated, is what you say, right?

1:14:45

Yes, that's correct.

1:14:46

But at least it allows the agent to go back and forth

1:14:51

and figure out what the requirements are.

1:14:52

But at the end of the day, ideally, if you're a startup,

1:14:56

you have some documentation.

1:14:57

And automatically, you have an agent or an LLM workflow

1:15:00

that reads that documentation and updates the code

1:15:03

accordingly.

1:15:04

But I agree.

1:15:05

It's not something that is fully autonomous.

1:15:08

Yeah.

1:15:09

i I've seen some security issues.

1:15:12

Why is that possible.

1:15:14

Which security specifically?

1:15:16

[INAUDIBLE]

1:15:18

Yeah.

1:15:19

So are there security issues with MCPs?

1:15:23

So think about it this way.

1:15:25

MCPs, depending on the data that you get access to,

1:15:28

might have different requirements, lower stake

1:15:30

or higher stake.

1:15:31

I'm not an expert at the full range.

1:15:34

But it wouldn't surprise me that when you expose an MCP to--

1:15:42

I think you would a lot of MCC have authentication.

1:15:45

So you might actually need a code

1:15:47

to actually talk to it, just like you would with an API,

1:15:50

or a key.

1:15:52

Yeah, but that's a good question.

1:15:53

I'm not an expert at the security of these systems,

1:15:56

but we can look into it.

1:16:02

Any other questions on what we've

1:16:04

seen with the agentic workflows, APIs, tools, MCPs, memory?

1:16:10

All of that is under progress.

1:16:11

So even memory is not a solved problem by any means.

1:16:14

It's pretty hard actually.

1:16:16

Yes.

1:16:18

You don't need an [INAUDIBLE] The MCP just

1:16:24

makes it easier to access the API, but technically,

1:16:28

[INAUDIBLE]

1:16:40

Exactly, exactly.

1:16:42

Is MCP about efficiency or accessing more data?

1:16:45

It's about efficiency.

1:16:47

Let's say you have a coding agent, and it has an MCP client,

1:16:53

and there's multiple MCP servers that are exposed out there.

1:16:57

That agent can communicate very efficiently with them

1:17:00

and find what it needs.

1:17:03

And it's a more efficient process

1:17:05

than actually displaying APIs and the APIs on that side

1:17:09

and how to ping them and what the protocol is.

1:17:12

But it's not about the data that is

1:17:13

being exposed because ultimately, you control

1:17:15

the data that is being exposed.

1:17:19

You probably, depending on how the MCP is built,

1:17:22

my guess is you probably expose yourself to other risks

1:17:24

because your MCP server can see any input pretty much

1:17:31

from another LLM.

1:17:32

And so it has to be robust.

1:17:36

But yeah.

1:17:37

Super.

1:17:39

So let's look at an example of a step

1:17:41

by step workflow for the travel agent.

1:17:45

So let's say the user says, I want to plan a trip to Paris

1:17:50

from December 15 to 20th with flights,

1:17:56

hotels near the Eiffel Tower, and then an itinerary of must

1:18:00

visit places.

1:18:01

That's the task to the travel agent.

1:18:04

Step two, the agent plans the steps.

1:18:06

So it says, I'm going to find flights.

1:18:08

Use the flight search API to get options for December 15.

1:18:12

Search hotels, generate recommendations for places

1:18:15

to visit, validate preferences, budget, et cetera.

1:18:20

Book the trip with the payment processing API.

1:18:24

That's just the planning, by the way.

1:18:25

Step three, execute the plan, use your tools,

1:18:28

combine the results, and then proactive

1:18:31

user interaction and booking.

1:18:33

It might make a first proposal to the user

1:18:35

and ask the user to validate or invalidate

1:18:38

and then may repeat that planning and execution process.

1:18:42

And then finally, it might actually update the memory.

1:18:46

It might say, oh, I just learned through this interaction

1:18:49

that the user only likes direct flights.

1:18:51

Next time, I'll only give direct flights.

1:18:55

Or I noticed users are fine with three star hotels or four star

1:19:01

hotels.

1:19:01

And in fact, they don't want to go above budget or something

1:19:05

like that.

1:19:08

So that hopefully makes sense by now on how you might do that.

1:19:11

My question for you is how would you know if this works.

1:19:16

And if you had such a system running in production, how

1:19:19

would you improve it?

1:19:28

Yeah.

1:19:28

Lets users rate their experience.

1:19:31

So that's an example.

1:19:33

So let users rate their experience at the end.

1:19:37

That would be an end to end test, right?

1:19:39

You're looking at the user experience through the steps

1:19:42

and say how good was it from 1 to 5, let's say.

1:19:46

Yeah.

1:19:46

It's a good way.

1:19:47

And then if you learn that a user says 1,

1:19:50

how do you improve the workflow?

1:19:56

[INAUDIBLE]

1:19:59

OK.

1:19:59

So you would go down a tree and say, OK, you said 1.

1:20:04

What was your issue?

1:20:06

And then the user says the prices were too high, let's say.

1:20:10

And then you would go back and fix that specific tool or prompt

1:20:14

or, yeah, OK.

1:20:15

Any other ideas?

1:20:18

[INAUDIBLE]

1:20:29

Yeah, good.

1:20:29

So that's a good insight.

1:20:30

Separate the LLM related stuff from the non-LLM related stuff,

1:20:34

the deterministic stuff.

1:20:35

The deterministic stuff, you might

1:20:36

be able to fix it more objectively essentially.

1:20:41

Yeah.

1:20:43

What else?

1:20:56

So give me an example of an objective issue

1:21:00

that you can notice and how you would fix it

1:21:03

versus a subjective issue.

1:21:06

Yeah.

1:21:06

[INAUDIBLE]

1:21:16

So let's say you say there's the same flight,

1:21:19

but one is cheaper than the other, let's say.

1:21:21

It's objectively worse.

1:21:23

And so you can capture that almost automatically.

1:21:25

Yeah.

1:21:26

So you could actually build evals

1:21:27

that are objective, that are tracked across your users.

1:21:32

And you might actually run an analysis after

1:21:34

and see that for the objective stuff,

1:21:37

we notice that our LLM AI agent workflow is bad with pricing.

1:21:43

It just doesn't read price as well because it always

1:21:46

gives a more expensive option.

1:21:48

Yeah.

1:21:48

You're perfectly right.

1:21:49

How about the subjective stuff?

1:21:59

Do you choose a direct or indirect flight

1:22:01

if the indirect is a little bit cheaper?

1:22:05

Yeah.

1:22:05

Good one.

1:22:06

Do you choose a direct flight or an indirect flight

1:22:09

if the indirect is cheaper but the direct is more comfortable?

1:22:12

Yeah.

1:22:13

That's a good one actually.

1:22:16

So how would you capture that information.

1:22:18

Let's say this is used by thousands of users.

1:22:24

Could you feed something in [INAUDIBLE]

1:22:28

Could you feed something in?

1:22:30

Yeah, I mean, you could--

1:22:32

could feed something in about the user preferences?

1:22:36

Well, you could build a data set that

1:22:39

has some of that information.

1:22:40

So you build 10 prompts, where the user is asking specifically

1:22:44

for a direct--

1:22:46

saying that I prefer direct flights because I

1:22:48

care about my time, let's say.

1:22:50

And then you look at the output and you actually

1:22:53

give a good example of a good output,

1:22:56

and you probably are able to capture

1:22:58

the performance of your agentic workflow on this specific eval.

1:23:04

Does it prioritize?

1:23:05

Does it understand price conscious--

1:23:07

is it price conscious, essentially,

1:23:08

and comfort conscious?

1:23:10

Yeah.

1:23:13

What about the tone?

1:23:14

Let's say the LLM right now is not very friendly.

1:23:18

How would you notice that, and how would you fix it?

1:23:26

Yeah.

1:23:26

Have the test user run the prompt

1:23:29

and see if there's something wrong with that.

1:23:33

OK.

1:23:33

Have a test user run the prompt and see if there's

1:23:36

something wrong with that.

1:23:37

Tell me about the last step.

1:23:38

How would you notice that something is wrong?

1:23:40

So a couple of tests [INAUDIBLE] evaluates

1:23:48

the response and [INAUDIBLE]

1:23:51

Yeah.

1:23:52

I agree with your approach.

1:23:53

Have LLM judges that evaluate the response

1:23:55

against a certain rubric of what politeness looks like.

1:23:58

So here in this case, you could actually

1:24:00

start with error analysis.

1:24:02

So you start, you have 1,000 users.

1:24:05

And you can pull up 20 user interactions

1:24:07

and read through it.

1:24:09

And you might notice, at first sight,

1:24:11

the LLM seems to be very rude.

1:24:14

It's just super, super short in its answers,

1:24:18

and it's not very helpful.

1:24:20

You notice that with your error analysis manually.

1:24:23

Then you go to the next stage.

1:24:24

You actually put evals behind it.

1:24:26

You say, I'm going to create a set of LLM judges

1:24:33

that are going to look at the user interaction

1:24:35

and are going to rate how polite it is.

1:24:38

And I'm going to give it a rubric.

1:24:40

Then what I'm going to do is I'm going to flip my LLM.

1:24:42

Instead of using GPT-4, I'm going to use Grok.

1:24:45

And instead of using Grok, I'm using Llama.

1:24:48

And then I'm going to run those three LLMs side by side,

1:24:51

give it to my LLM judges, and then get my subjective score

1:24:56

at the end to say, oh, x model was more polite on average.

1:25:02

Yeah.

1:25:02

Perfectly right.

1:25:03

That's an example of an eval that is very specific

1:25:05

and allows you to choose between LLMs.

1:25:07

You could actually do the same eval not across LLMs,

1:25:10

but fixed the LLM, change the prompt.

1:25:12

You actually, instead of saying act like a travel agent,

1:25:15

you say act like a helpful travel agent.

1:25:17

And then you see the influence of that word on your eval

1:25:21

with the LLM as judges.

1:25:22

Does that make sense?

1:25:24

OK.

1:25:25

Super.

1:25:26

So let's move forward and do a case study with evals.

1:25:29

And then we're almost done for today.

1:25:33

Let's say your product manager asks you to build an AI

1:25:38

agent for customer support, OK?

1:25:41

Where do you start?

1:25:42

And here is an example of the user prompt.

1:25:45

I need to change my shipping address for order, blah, blah,

1:25:48

blah.

1:25:48

I move to a new address.

1:25:51

So what do you start if I'm giving you that project?

1:26:04

Yes.

1:26:05

We search online for existing models and [INAUDIBLE]

1:26:16

So do some research.

1:26:17

See benchmarks and how different models

1:26:20

perform at customer support.

1:26:22

And then pick a model.

1:26:23

That's what you mean.

1:26:24

Yeah.

1:26:24

It's true you could do that.

1:26:25

What else could you do?

1:26:28

Yeah.

1:26:28

[INAUDIBLE]

1:26:34

OK.

1:26:34

Yeah, I like that.

1:26:35

Try to decompose the different tasks that it will need

1:26:39

and try to guess which ones will be more of a struggle, which

1:26:42

ones should be fuzzy, which ones should be deterministic.

1:26:45

Yeah, you're right.

1:26:46

[INAUDIBLE]

1:26:55

Yeah.

1:26:56

Similar to what you said.

1:26:58

That's what I would recommend as well.

1:27:00

You say I would sit down with a customer support

1:27:02

agent for a day or two, and I would decompose the tasks

1:27:04

that are going through.

1:27:05

I will ask them, where do they struggle?

1:27:07

How much time it takes?

1:27:08

Yes.

1:27:09

That's usually where you want to start with task decomposition.

1:27:12

So let's say we've done that work, and we have this list.

1:27:16

I'm simplifying.

1:27:17

But the customer support agent, human, typically

1:27:20

would extract key info, then look up

1:27:23

in the database to retrieve the customer record.

1:27:25

Then check the policy.

1:27:27

Are we allowed to update the address,

1:27:29

or is it a fixed data point?

1:27:32

And then draft a response email and sends the email.

1:27:35

So we've decomposed that task.

1:27:39

Once you've decomposed that task,

1:27:42

how do you design your agentic workflow?

1:28:03

Yes.

1:28:04

[INAUDIBLE]

1:28:17

Exactly.

1:28:18

So to repeat, you're going to look

1:28:20

at the decomposition of tasks, get an instinct of what's fuzzy,

1:28:24

what's deterministic, and then determine

1:28:28

which line is going to be an LLM one shot, which one will require

1:28:33

maybe a RAG, which one will require a tool, which one will

1:28:36

require memory, which one--

1:28:38

So you will start designing that map.

1:28:41

Completely right.

1:28:41

That's also what I would recommend.

1:28:43

You might actually draft it and say, OK, I take the user prompt.

1:28:48

And the first step of my task decomposition

1:28:52

was extract information that seems to be a vanilla LLM.

1:28:57

You can guess that the vanilla LLM would probably

1:29:00

be good enough at extracting the user wants

1:29:03

to change their address, and this is the order number

1:29:05

and this is the new address.

1:29:06

You probably don't need too much technology

1:29:08

there other than the LLM.

1:29:11

The next step, it feels like you need a tool because you're

1:29:14

actually going to have to look up in the database

1:29:17

and also update the address.

1:29:21

So that might be a tool, and you might

1:29:23

have to build a custom tool for the LLM

1:29:25

to say, let me connect you to that database

1:29:27

or let me give you access to that resource with an MCP.

1:29:32

After that probably need an LLM again to draft the email,

1:29:35

but you would probably paste confirmation.

1:29:38

You would paste the confirmation that your address

1:29:40

has been updated from x to y.

1:29:42

And then the LLM will draft an answer.

1:29:44

And of course, just to not forget,

1:29:46

you might need a tool to send the email.

1:29:49

You might actually need to post something to

1:29:54

for the email to go out.

1:29:57

And then you'll get the output.

1:29:59

Does that make sense So exactly what you described.

1:30:02

Now moving to the next step.

1:30:03

Once we have-- we compose our tasks.

1:30:06

Then we have designed an agentic workflow around it.

1:30:09

It took us five minutes.

1:30:10

In practice, it would take you more

1:30:12

if you're building your startup on that.

1:30:13

You want to make sure your task decomposition is accurate,

1:30:15

your thing is accurate here, and then

1:30:17

you can have a lot of work done on every tool

1:30:20

and optimize it and latency and cost.

1:30:22

But let's say, now we want to know if it works.

1:30:27

And I'm going to assume that you have LLM traces.

1:30:30

LLM traces are very important.

1:30:33

Actually, if you're interviewing with an AI startup.

1:30:36

I would recommend you in the interview process to ask them,

1:30:39

do you have LLM traces?

1:30:40

Because if they don't have LLM traces,

1:30:42

it is pretty hard to debug an LLM system because you don't

1:30:46

have visibility on the chain of complex prompts that were called

1:30:50

and where the bug is.

1:30:52

And so it's a basic part of an AI startup

1:30:57

stack to have an LLM traces.

1:31:00

So let's assume you have traces.

1:31:02

How would you know if your system works?

1:31:04

I'm going to summarize some of the things I heard earlier.

1:31:11

You gave us an example of an end to end metric.

1:31:15

You look at the user satisfaction at the end.

1:31:18

You can also do a component-based approach

1:31:21

where you actually will look at the tool, the database updates,

1:31:25

and you will manually do an error analysis and see,

1:31:28

oh, the tool actually always forgets to update the email.

1:31:32

It just fails at writing.

1:31:33

And I'm going to fix that.

1:31:34

This is deterministic pretty much.

1:31:37

Or when it tries to send the email

1:31:40

and ping the system that is supposed to send the email,

1:31:44

it doesn't send it in the right format.

1:31:46

And so it bugs at that point.

1:31:48

Again, you could fix that.

1:31:51

Draft of the email.

1:31:52

The LLM doesn't do a great job.

1:31:53

It's not very polite at drafting the email.

1:31:56

So you could look at component by component,

1:31:59

and it's actually easier to debug than to look at it

1:32:01

end to end.

1:32:02

You would probably do a mix of both.

1:32:05

Another way to look at it is what is objective

1:32:08

versus what is subjective?

1:32:10

So for example, an objective example

1:32:12

would be a DLRM extracted the wrong order ID.

1:32:18

The user said my order ID is X, and the LLM,

1:32:21

when it actually looked up in the database,

1:32:24

it used the wrong order ID.

1:32:26

This is objectively wrong.

1:32:27

You can actually write a Python code

1:32:29

that checks that, checks just the alignment between what

1:32:32

the user mentioned and what was actually pasted in the database

1:32:36

or for the lookup.

1:32:38

You also have subjective stuff, which we talked about,

1:32:40

where you probably want to do either human rating or LLM

1:32:43

as judges.

1:32:44

It's very relevant for subjective evals.

1:32:49

And finally, you will find yourself

1:32:51

having quantitative evals and more qualitative evals.

1:32:55

So quantitative would be percentage of successful address

1:32:59

updates.

1:33:00

The latency.

1:33:00

You could actually track the latency component-based

1:33:03

and see which one is the slowest.

1:33:05

Let's say sending the email is five seconds.

1:33:08

It's too long, let's say.

1:33:10

You would notice component based or the full workflow.

1:33:13

And then you will decide, where am I optimizing my latency,

1:33:15

and how am I going to do that?

1:33:17

And then finally, qualitative.

1:33:20

You might actually do some error analysis

1:33:23

and look at where are the hallucinations?

1:33:27

Where are the tone mismatches?

1:33:31

Are the user confused, and by what they're confused?

1:33:34

That would be more qualitative.

1:33:36

And typically, it would take more white glove approaches

1:33:41

to do that.

1:33:42

So here's what it could look like.

1:33:44

I gave you some examples.

1:33:46

But you would build evals to determine

1:33:50

objectively, subjectively, component-based, end

1:33:53

to end based, and then quantitatively and

1:33:55

qualitatively, where's your LLM failing

1:33:57

and where it's doing well.

1:34:02

Does that give you a sense of the type of stuff

1:34:04

you could do to fix or improve that agentic workflow?

1:34:09

Super.

1:34:10

Well, that was our case study on evals.

1:34:12

We're not going to delve deeper into it.

1:34:14

But hopefully, it gave you a sense of the type of stuff

1:34:16

you can do with LLM judges, with objective,

1:34:21

subjective, component-based, end to end, et cetera.

1:34:25

Last section on multi-agent workflows.

1:34:29

So you might ask, hey, why do we need multi-agent workflow when

1:34:36

the workflow already has multiple steps,

1:34:38

already calls the LLM multiple times, already gives them tools.

1:34:42

Why do we need multiple agents?

1:34:45

And so many people are talking about multi-agent system online.

1:34:47

It's not even a new thing, frankly.

1:34:49

Multi-agent systems have been around for a long time.

1:34:52

The main advantage of a multi-agent system

1:34:55

is going to be parallelism.

1:34:57

It's like is there something that I

1:34:59

wish I would run in parallel, sort of independently,

1:35:04

but maybe there are some things in the middle?

1:35:07

But that's where you want to put a multi-agent system.

1:35:09

It's when it's parallel.

1:35:12

The other advantage that some companies

1:35:14

have with multi-agent systems is an agent can be reused.

1:35:19

So let's say in a company, you have an agent that's

1:35:21

been built for design.

1:35:22

That agent can be used in the marketing team,

1:35:25

and it can be used in the product team.

1:35:27

And so now you're optimizing an agent,

1:35:30

which has multiple stakeholders that can communicate with it

1:35:33

and benefit from its performance.

1:35:38

Actually I'm going to ask you a question

1:35:40

and take a few, maybe a minute to think about it.

1:35:43

Let's say you were building smart home

1:35:46

automation for your apartment or your home.

1:35:50

What agents would you want to build?

1:35:52

Yeah.

1:35:53

Write it down.

1:35:54

And then I'm going to ask you in a minute

1:35:57

to share some of the agents that you will build.

1:36:00

Also, think about how you would put

1:36:03

a hierarchy between these agents,

1:36:04

or how you would organize them, or who

1:36:06

should communicate with who.

1:36:07

OK?

1:36:08

OK.

1:36:08

Take a minute for that.

1:36:12

Be creative also because I'm going to ask all of your agents,

1:36:14

and maybe you have an agent that nobody has thought of.

1:36:21

OK.

1:36:22

Let's get started.

1:36:24

Who wants to give me a set of agents

1:36:26

that you would want for your home, smart home.

1:36:29

Yes.

1:36:32

The first is like a set of agents [INAUDIBLE]

1:37:00

OK.

1:37:01

So let me repeat.

1:37:02

You have four agents, I think, roughly.

1:37:05

One that tracks biometric, like where are you in the home?

1:37:09

Where are you moving?

1:37:10

How you're moving, things like that.

1:37:12

That sort of knows your location.

1:37:15

The second one determines the temperature of the rooms

1:37:21

and has the ability to change it.

1:37:23

The third one tracks energy efficiency

1:37:26

and might give feedback on energy and energy usage.

1:37:31

And might be, I don't know, maybe

1:37:32

it has the control over the temperature as well.

1:37:34

I don't know actually.

1:37:35

Or the gas or the water, might cut your water at some point.

1:37:43

And then you have an orchestrator agent.

1:37:44

What is exactly the orchestrator doing?

1:37:48

It passes instructions [INAUDIBLE]

1:37:53

OK.

1:37:53

Passes instructions.

1:37:55

So is that the agent that communicates mainly

1:37:58

with the user?

1:38:00

So if I'm coming back home and I'm

1:38:02

saying I want the oven to be preheated,

1:38:05

I communicate with the orchestrator,

1:38:07

and then it would funnel to another agent.

1:38:09

OK.

1:38:10

Sounds good.

1:38:11

Yeah.

1:38:11

So that's an example of, I want to say,

1:38:14

a hierarchical agentic multi-agent system.

1:38:20

What else?

1:38:21

Any other ideas?

1:38:22

What would you add to that?

1:38:24

Yeah.

1:38:25

[INAUDIBLE]

1:38:55

Oh, I like that.

1:38:56

That's a really good one.

1:38:57

So let me summarize.

1:38:58

You have a security agent that determines if you can enter

1:39:02

or not.

1:39:03

And when you enter, it understands who you are.

1:39:06

And then it gives you certain sets

1:39:08

of permissions that might be different depending

1:39:11

of if you're a parent or a kid.

1:39:13

Or you might have access to certain cars and not others.

1:39:17

Or your kid cannot open the fridge, or I don't know.

1:39:20

Something like that.

1:39:21

Yeah.

1:39:22

OK, I like that.

1:39:23

That's a good one.

1:39:24

And it does feel like it's a complex enough workflow where

1:39:28

you want a specific workflow tied to that.

1:39:32

I agree.

1:39:34

What else?

1:39:39

Yes.

1:39:41

[INAUDIBLE] So you can get more complicated.

1:39:43

So high energy savings with whether or not you

1:39:50

or someone else can be blind to those in the house or also

1:39:55

when you tap into the grid.

1:39:57

Yeah So another thought I have as well is much harder

1:40:04

to track in the grocery store.

1:40:06

But understanding what's in your fridge.

1:40:08

OK

1:40:12

Well, that's really good actually.

1:40:14

So you mentioned two of them.

1:40:16

One is maybe an agent that has access to external APIs that

1:40:20

can understand the weather out there, the wind, the sun,

1:40:24

and then has control over certain devices at home.

1:40:28

Temperature, blinds, things like that, and also understands

1:40:31

your preferences for it.

1:40:33

That does feel like it's a good use case because you could give

1:40:36

that to the orchestrator, but it might lose itself

1:40:38

because it's doing too much.

1:40:41

And also, these problems are tied together,

1:40:43

like temperature outdoor with the weather API

1:40:45

might influence the temperature inside,

1:40:48

how you want it, et cetera.

1:40:50

And then the second one, which I also like,

1:40:52

is you might have an agent that looks at your fridge

1:40:55

and what's inside.

1:40:57

And it might actually have access

1:40:58

to the camera in the fridge, for example,

1:41:01

and know your preferences and also has

1:41:03

access to the e-commerce API to order

1:41:06

Amazon groceries ahead of time.

1:41:09

I agree.

1:41:10

And maybe the orchestrator will be the communication line

1:41:12

with the user, but it might communicate with that agent

1:41:16

in order to get it done.

1:41:17

Yeah.

1:41:18

I like those.

1:41:19

So those are all really good examples.

1:41:21

Here is the list I had up there.

1:41:25

So climate control, lighting security, energy management,

1:41:30

entertainment, notification agent,

1:41:32

alerts about the system updates, energy saving, and orchestrator.

1:41:35

So all of them you mentioned actually.

1:41:38

And then we didn't talk about the different interaction

1:41:41

patterns, but you do have different ways to organize

1:41:45

a multi-agent system.

1:41:46

Flat, hierarchical.

1:41:48

It sounds like this would be hierarchical.

1:41:51

I agree.

1:41:52

And the reason is UI/UX, is I would rather

1:41:55

have to only talk to the orchestrator,

1:41:57

rather than have to go to a specialized application

1:42:00

to do something.

1:42:01

Like it feels like the orchestrator

1:42:02

could be responsible for that.

1:42:04

And so I agree, I would probably go for a hierarchical setup

1:42:07

here.

1:42:08

But maybe you might also add some connections

1:42:11

between other agents, like in the flat system

1:42:13

where it's all to all.

1:42:15

For example, with climate control and energy,

1:42:17

if you want to connect those two,

1:42:19

you might actually allow them to speak with each other.

1:42:21

When you allow agents to speak with each other,

1:42:24

it is basically an MCB protocol, by the way.

1:42:26

So you treat the agent like a tool, exactly like a tool.

1:42:30

Here is how you interact with this agent.

1:42:32

Here is what it can tell you.

1:42:34

Here is what it needs from you, essentially.

1:42:37

OK super.

1:42:38

And then without going into the details,

1:42:40

there are advantages to multi-agent workflows

1:42:43

versus single agents, such as debugging.

1:42:47

It's easier to debug a specialized agent

1:42:50

into debug an entire system.

1:42:52

Parallelization as well.

1:42:54

It's easier to have things run in parallel,

1:42:56

and you can earn time.

1:42:59

There are some advantages to doing that,

1:43:01

and I'll leave you with this slide if you want to go deeper.

1:43:04

Super.

1:43:05

So we've learned so many techniques to optimize LLMs,

1:43:08

from prompts to chains to fine tuning, retrieval,

1:43:12

and to multi-agent system as well.

1:43:14

And then just to end on a couple of trends I want you to watch.

1:43:19

I think next week is Thanksgiving, is that it?

1:43:21

It's Thanksgiving break.

1:43:22

No, the week after.

1:43:23

OK.

1:43:24

Well ahead of the Thanksgiving break.

1:43:26

So if you're traveling, you can think about these things.

1:43:29

What's next is in AI, I wanted to call out a couple of trends.

1:43:34

So Ilya Sutskever, one of the OGs of LLMs and OpenAI

1:43:40

co-founder, raised that question about are we plateauing or not.

1:43:45

The question are we going to see in the coming years LLM sort

1:43:50

of not improve as fast as we've seen in the past?

1:43:54

It's been the feeling in the community

1:43:56

probably that the last version of GPT

1:44:00

did not bring the level of performance

1:44:03

that people were expecting, although it did make

1:44:06

it so much easier to use for consumers because you don't need

1:44:09

to interact with different models.

1:44:10

It's all under the same hood.

1:44:12

So it seems that it's progressing,

1:44:14

but the plateau is unclear.

1:44:17

The way I would think about it is the LLM scaling laws tell us

1:44:22

that if we continue to improve compute and energy,

1:44:26

then LLMs should continue to improve.

1:44:28

But at some point, it's going to plateau.

1:44:29

So what's going to take us to the next step?

1:44:32

It's probably architecture search.

1:44:35

Still a lot of LLMs, even if we don't

1:44:36

understand what's under the hood or probably

1:44:38

transformer-based today.

1:44:40

But we know that the human brain does not operate the same way.

1:44:43

There's just certain things that we

1:44:45

do that are much more efficient, much faster.

1:44:47

We don't need as much data.

1:44:49

So theoretically, we have so much

1:44:51

to learn in terms of architecture search

1:44:53

that we haven't figured out.

1:44:54

It's not a surprise that you see those labs hire

1:44:57

so many engineers.

1:44:58

Because it is possible that in the next few years,

1:45:01

you're going to have thousands of engineers trying

1:45:03

to figure out the different engineering hacks and tactics

1:45:06

and architectural searches that are

1:45:07

going to lead to better models.

1:45:10

And one of them suddenly will find the next transformer,

1:45:13

and it will reduce by 10x the need for compute and the need

1:45:17

for energy.

1:45:18

It's sort of if you read Isaac Asimov's Foundation series.

1:45:24

Individuals can have an amazing impact on the future because

1:45:27

of their decisions.

1:45:29

Whoever discovered transformers had a tremendous impact

1:45:33

on the direction of AI.

1:45:34

I think we're going to see more of that in the coming

1:45:37

years, where some group of researchers that is iterating

1:45:40

fast might discover certain things that would suddenly

1:45:43

unlock that plateau and take us to the next step,

1:45:45

and it's going to continue to improve like that.

1:45:47

And so it doesn't surprise me that there's so many companies

1:45:50

hiring engineers right now to figure out

1:45:52

those hacks and those techniques.

1:45:56

The other set of gains that we might see

1:45:58

is from multi-modality.

1:45:59

So the way to think about it is we've had LLMs first text-based,

1:46:04

and then we've added imaging.

1:46:06

And today, models are very good at images.

1:46:09

They're very good at text.

1:46:10

It turns out that being good at images and being good at text

1:46:13

makes the whole model better.

1:46:15

So the fact that you're good at understanding a cat image

1:46:18

makes you better at text as well for a cat.

1:46:21

Now you add another modality like audio or video.

1:46:24

The whole system gets better.

1:46:26

So you're better at writing about a cat

1:46:28

if you know what a cat sounds like,

1:46:30

if you can look at a cat on an image as well.

1:46:31

Does that make sense?

1:46:32

So we see gains that are translated from one modality

1:46:35

to another, and that might lead in the pinnacle of robotics

1:46:38

where all these modalities come together.

1:46:40

And suddenly, the robot is better at

1:46:42

running away from a cat because it understands

1:46:44

what a cat is, how it sounds like,

1:46:46

what it looks like, et cetera.

1:46:48

That makes sense?

1:46:49

The other one is the multiple methods working in harmony.

1:46:53

In the Tuesday lectures, we've seen supervised learning,

1:46:56

unsupervised learning, self-supervised learning,

1:46:58

reinforcement learning, prompt engineering, RAGs, et cetera.

1:47:02

If you look at how babies learn, it

1:47:06

is probably a mix of those different approaches.

1:47:09

Like a baby might have some meta learning, meaning it

1:47:13

has some survival instinct that is

1:47:16

encoded in the DNA most likely.

1:47:19

And that's like the baby's pre-training, if you will.

1:47:22

On top of that, the mom or the dad is pointing at stuff

1:47:27

and saying bad, good, bad, good.

1:47:29

Supervised learning.

1:47:30

On top of that, the baby is falling on the ground

1:47:33

and getting hurt.

1:47:34

And that's a reward signal for reinforcement learning.

1:47:36

On top of that, the baby is observing other people

1:47:39

doing stuff or other babies doing

1:47:42

stuff, unsupervised learning.

1:47:43

You see what I mean?

1:47:44

We're probably a mix of all these methods,

1:47:47

and I think that's where the trend is going, is

1:47:49

where those methods that you've seen in CS230

1:47:52

come together in order to build an AI system that learns fast,

1:47:56

is low latency, is cheap, energy-efficient,

1:48:00

and makes the most out of all of these methods.

1:48:03

Finally, and this is especially true at Stanford,

1:48:06

you have research going on that you would consider human-centric

1:48:11

and some research that is non-human centric.

1:48:13

By human-centric, I should say human approaches

1:48:16

that are modeled after the brain and approaches that

1:48:19

are not modeled after humans.

1:48:20

Because it turns out that the human body is very limiting.

1:48:24

And so if you actually only do research

1:48:26

on what the human brain looks like,

1:48:28

you're probably missing out on compute and energy and stuff

1:48:30

like that that you can optimize even

1:48:32

beyond neuronal connections in the brain,

1:48:35

but you still can learn a lot from the human brain.

1:48:37

And that's why there are professors that are running labs

1:48:40

right now that try to understand,

1:48:42

how does back propagation work for humans?

1:48:45

And in fact, it's probably that we don't have back propagation.

1:48:48

We don't use back propagation, we only do forward propagation,

1:48:51

let's say.

1:48:51

So this type of stuff is interesting research

1:48:54

that I would encourage you to read if you're curious

1:48:56

about the direction of AI.

1:48:59

And then finally, one thing that's going to be pretty clear,

1:49:02

I call it all the time, but it's the velocity

1:49:05

at which things are moving.

1:49:06

You're noticing, part of the reason

1:49:08

we're giving you a breadth in CS230

1:49:10

is because these methods are changing so fast.

1:49:12

So I don't want to bother going and teaching you

1:49:15

the number 17 methods on RAG that

1:49:17

optimizes the RAG because in two years,

1:49:19

you're not going to need it.

1:49:20

So I would rather you think about what

1:49:23

is the breadth of things you want to understand.

1:49:25

And when you need it, you are sprinting and learning

1:49:27

the exact thing you need faster because the half life of skill

1:49:30

is so low.

1:49:31

You want to come out of the class with a good breadth

1:49:34

and then have the ability to go deep whenever

1:49:36

you need after the class.

1:49:38

And so that's sort of how that class is designed as well.

1:49:41

Yeah.

1:49:41

That's it for today.

1:49:43

So thank you.

1:49:45

Thank you for participating.

Stanford CS230 | Autumn 2025 | Lecture 8: Agents, Prompts, and RAG

More from Stanford Online

Trending Transcripts