Advertisement
1:49:54
Stanford CS230 | Autumn 2025 | Lecture 8: Agents, Prompts, and RAG
Stanford Online
·
May 10, 2026
Open on YouTube
Transcript
0:05
Hi, everyone.
0:06
Welcome to another lecture
for CS230 Deep Learning.
0:11
Today, we're going to talk about
enhancing large language model
0:17
applications.
0:19
And I call this
lecture Beyond LLM.
0:23
It has a lot of newer content.
0:26
And the idea behind
this lecture is
0:31
we started to learn
about neurons,
Advertisement
0:34
and then we learned
about layers,
0:35
and then we learned about
deep neural networks,
0:38
and then we learned a little bit
about how to structure projects
0:43
in C3.
0:44
And now we're going one level
beyond into, what would it
0:48
look if you were building
agentic AI systems at work,
0:54
in a startup, in a company?
0:58
And it's probably one of
the more practical lectures.
1:02
Again, the goal is
not to build a product
1:05
end to end in the
next hour or so,
Advertisement
1:07
but rather to tell
you all the techniques
1:09
that AI engineers have cracked,
figured out, are exploring,
1:15
so that after the class,
you have the breadth of view
1:18
of different
prompting techniques,
1:20
different agentic workflows,
multi-agent systems, evals.
1:25
And then when you
want to dive deeper,
1:26
you have the baggage to
dive deeper and learn faster
1:29
about it.
1:32
Let's try to make it as
interactive as possible, as
1:36
usual.
1:37
When we look at the
agenda, the agenda
1:40
is going to start with the
core idea behind challenges
1:45
and opportunities
for augmenting LLMs.
1:48
So we start from a base model.
1:50
How do we maximize the
performance of that base model?
1:55
Then we'll dive deep into the
first line of optimization,
1:59
which is prompting methods, and
we'll see a variety of them.
2:02
Then we'll go slightly deeper.
2:04
If we were to get our
hands under the hood
2:06
and do some fine tuning,
what would it look like?
2:09
I'm not a fan of fine tuning,
and I talk a lot about that,
2:12
but I'll explain why I try to
avoid fine tuning as much as
2:16
possible.
2:18
And then we'll do a section 4 on
Retrieval-Augmented Generation,
2:22
or RAG, which you've probably
heard of in the news.
2:26
Maybe some of you
have played with RAGs.
2:28
We're going to
unpack what a RAG is
2:31
and how it works and then the
different methods within RAGs.
2:36
And then we'll talk about
agentic AI workflows.
2:40
I'll define it.
2:42
Andrew Ng is one
of the first ones
2:45
to have called this trend
agenetic AI workflows.
2:49
And so we look at the
definition that Andrew
2:51
gives to agentic
workflows, and then we'll
2:54
start seeing examples.
2:56
The section 6 is very practical.
2:59
It's a case study where we will
think about an agentic workflow,
3:05
and I'll ask you to measure
if the agent actually works,
3:10
and we brainstorm
how we can measure
3:13
if an agentic
workflow is working
3:15
the way you want it to work.
3:16
There's plenty of methods called
evals that solve that problem.
3:22
And then we'll look briefly
at multi-agent workflow.
3:24
And then we can have a
open-ended discussion
3:27
where I share some thoughts
on what's next in AI.
3:31
And I'm looking forward
to hearing from you all,
3:34
as well, on that one.
3:36
So let's get started with the
problem of augmenting LLMs.
3:42
So open-ended question for you--
3:44
you are all familiar
with pre-trained models
3:47
like GPT 3.5 Turbo or GPT 4.0.
3:52
What's the limitation of
using just a base model?
3:56
What are the typical
issues that might
3:59
arise as you're using a
vanilla pre-trained model?
4:07
Yes.
4:08
It lacks some domain knowledge.
4:10
Lacks some domain knowledge.
4:11
You're perfectly right.
4:13
We had a group of
students a few years ago.
4:16
It was not LLM related, but
they were building an autonomous
4:22
farming device or vehicle that
had a camera underneath, taking
4:26
pictures of crops to
determine if the crop is
4:30
sick or not, if it
should be thrown away,
4:32
if it should be used or not.
4:35
And that data set is not a
data set you find out there.
4:40
And the base model or
pre-trained computer vision
4:44
model would lack that
knowledge, of course.
4:47
What else?
4:49
Yes.
4:50
[INAUDIBLE] pictures are
very dark [INAUDIBLE]
4:57
OK, maybe the-- you're saying--
4:59
so just to repeat
for people online,
5:02
you're saying the model
might have been trained
5:04
on high-quality data,
but the data in the wild
5:06
is actually not
that high quality.
5:08
And in fact, yes, the
distribution of the real world
5:11
might differ, as we've seen with
GANs, from the training set,
5:16
and that might create an
issue with pre-trained models.
5:18
Although pre-trained
LLMs are getting better
5:20
at handling all
sorts of data inputs.
5:25
Yes.
5:26
Lacks current information.
5:28
Lack what?
5:28
Current information.
5:30
Lacks current information.
5:32
The LLM is not up to date.
5:34
And in fact, you're right.
5:35
Imagine you have to retrain
from scratch your LLM
5:38
every couple of months.
5:39
One story that I found funny--
5:42
it's from probably three years
ago or maybe more five years
5:45
ago, where during
his first presidency,
5:49
President Trump one
day tweeted, "Covfefe."
5:53
You remember that tweet or no?
5:56
Just "Covfefe."
5:57
And it was probably a typo
or it was in his pocket.
5:59
I don't know.
6:00
But that word did not exist.
6:03
The LLMs, in fact, that
Twitter was running at the time
6:06
could not recognize that word.
6:08
And so the recommender
system sort of went wild,
6:11
because suddenly everybody was
making fun of that tweet using
6:15
the word "Covfefe," and the LLM
was so confused on, what does
6:19
that mean?
6:20
Where should we show it?
6:21
To whom should we show it?
6:22
And this is an example
of a-- nowadays,
6:25
especially on social media,
there's so many new trends,
6:28
and it's very hard to retrain
an LLM to match the new trend
6:33
and understand the
new words out there.
6:34
I mean, you oftentimes hear Gen
Z words like "rizz" or "mid"
6:39
or whatever.
6:40
I don't know all of them.
6:41
But you probably want
to find a way that
6:45
can allow the LLM to understand
those trends without retraining
6:49
the LLM from scratch.
6:51
What else?
6:53
It's trained to have a
breadth of knowledge.
6:56
And if you wanted to do
something specialized,
6:58
that might limit [INAUDIBLE].
6:59
Yeah, it might be trained
on a breadth of knowledge,
7:02
but it might fail or
not perform adequately
7:05
on a narrow task that
is very well defined.
7:09
Think about enterprise
applications that--
7:11
yeah, enterprise application.
7:13
You need high precision,
high fidelity, low latency.
7:17
And maybe the model is not
great at that specific thing.
7:20
It might do fine, but
just not good enough.
7:22
And you might want to
augment it in a certain way.
7:24
Yeah.
7:25
Maybe it has [INAUDIBLE]
so it makes the model
7:29
a lot heavier, a lot slower.
7:32
[INAUDIBLE]
7:33
So maybe it has a lot of broad
domain knowledge that might not
7:37
be needed for your application.
7:39
And so you're using a
massive, heavy model
7:41
when you actually are only using
2% of the model capability.
7:44
You're perfectly right.
7:45
You might not need all of it.
7:46
So you might find ways to prune,
quantize the model, modify it.
7:51
All of these are good points.
7:53
I'm going to add a
few more, as well.
7:55
LLMs are very
difficult to control.
7:58
Your last point is actually
an example of that.
8:00
You want to control the LLM to
use a part of its knowledge,
8:03
but it's not--
8:04
it's, in fact, getting confused.
8:06
We've seen that in history.
8:08
In 2016, Microsoft created
a notorious Twitter
8:13
bot that learned from users, and
it quickly became a racist jerk.
8:18
Microsoft ended up removing the
bot 16 hours after launching it.
8:22
The community was really
fast at determining
8:25
that this was a racist bot.
8:28
And you can empathize with
Microsoft in the sense
8:31
that it is actually
hard to control an LLM.
8:34
They might have done a better
job to qualify before launching,
8:37
but it is really hard
to control an LLM.
8:40
Even more recently,
this is a tweet
8:42
from Sam Altman
last November, where
8:46
there was this debate
between Elon Musk and Sam
8:50
Altman on whose LLM is
the left wing propaganda
8:54
machine or the right
wing propaganda machine,
8:57
and they were hating
on each other's LLMs.
8:59
But that tells you,
at the end of the day,
9:01
that even those two teams, Grok
and OpenAI, which are probably
9:05
the best funded team
with a lot of talent,
9:08
are not doing a great job
at controlling their LLMs.
9:14
And from time to time,
if you hang out on X,
9:16
you might see screenshots of
users interacting with LLMs
9:21
and the LLM saying something
really controversial
9:24
or racist or something that
would not be considered great
9:31
by social standards, I guess.
9:33
And that tells you that the
model is really hard to control.
9:39
The second aspect
of it is something
9:41
that you mentioned earlier.
9:43
LLMs may underperform
in your task,
9:47
and that might include
specific knowledge gaps,
9:49
such as medical diagnosis.
9:51
If you're doing
medical diagnosis,
9:52
you would rather have an LLM
that is specialized for that
9:55
and is great at it
and, in fact, something
9:57
that we haven't mentioned
as a group, has sources.
10:00
So the answer is
sourced specifically.
10:03
You have a hard time
believing something
10:05
unless you have the actual
source of the research that
10:08
backs it up.
10:10
Inconsistencies in
style and format--
10:12
so imagine you're building
a legal AI agentic workflow.
10:17
Legal has a very specific
way to write and read,
10:21
where every word counts.
10:22
If you're negotiating
a large contract,
10:25
every word on that contract
might mean something else
10:28
when it comes to the court.
10:29
And so it's very
important that you use
10:31
an LLM that is very good at it.
10:34
The precision matters.
10:35
And then task-specific
understanding,
10:38
such as doing a classification
on a niche field,
10:40
here I pulled an example where--
let's say a biotech product is
10:45
trying to use an
LLM to categorize
10:48
user reviews into positive,
neutral, or negative.
10:54
Maybe for that
company, something
10:56
that would be considered a
negative review typically
11:01
is actually considered
a neutral review
11:04
because the NPS of
that industry tends
11:06
to be way lower than other
industries, let's say.
11:10
That's a task-specific
understanding,
11:12
and the LLM needs to
be aligned to what
11:14
the company believes is the
categorization that it wants.
11:17
We will see an example of how to
solve that problem in a second.
11:21
And then limited
context handling--
11:24
a lot of AI applications,
especially in the enterprise,
11:28
have required data that
has a lot of context.
11:33
Just to give you
a simple example,
11:35
knowledge management
is an important space
11:37
that enterprises buy a lot
of knowledge management tool.
11:40
When you go on your drive and
you have all your documents,
11:43
ideally, you could have an LLM
running on top of that drive.
11:47
You can ask any question,
and it will read immediately
11:50
thousands of documents
and answer, what was
11:53
our Q4 performance in sales?
11:56
It was x dollars.
11:58
It finds it super quickly.
11:59
In practice, because LLMs do
not have a large enough context,
12:04
you cannot use a standalone
vanilla pre-trained LLM to solve
12:07
that problem.
12:08
You will have to augment it.
12:11
Does that make sense?
12:13
The other aspect around context
windows is they are, in fact,
12:16
limited.
12:17
If you look at the context
windows of the models
12:20
from the last five years,
even the best models
12:25
today will range in context,
window, or number of tokens
12:30
it can take as input, somewhere
in the hundreds of thousands
12:35
of tokens max.
12:36
Just to give you a sense,
200,000 tokens is roughly two
12:40
books.
12:42
So that's how much
you can upload
12:45
and it can read, pretty much.
12:47
And you can imagine
that when you're
12:48
dealing with video
understanding or heavier data
12:52
files, that is, of
course, an issue.
12:56
So you might have to chunk it.
12:58
You might have to embed it.
12:59
You might have to
find other ways
13:00
to get the LLM to
handle larger contexts.
13:06
The attention mechanism is
also powerful, but problematic,
13:10
because it does not do
a great job at attending
13:13
in very large contexts.
13:16
There is actually an
interesting problem
13:19
called needle in a haystack.
13:21
It's an AI problem where--
13:23
or call it a benchmark--
13:25
where, in order to test if your
LLM is good at putting attention
13:30
on a very specific fact
within a large corpus,
13:35
researchers might
randomly insert
13:38
in about one sentence
that outlines
13:44
a certain fact,
such as Arun and Max
13:47
are having coffee
at Blue Bottle,
13:48
in the middle of the
Bible, let's say,
13:51
or some very long text.
13:54
And then you ask the LLM,
what were Arun and Max having
14:01
at Blue Bottle?
14:02
And you see if it remembers
that it was coffee.
14:04
It's actually a complex problem,
not because the question
14:07
is complex, but because
you're asking the model
14:09
to find a fact within
a very large corpus,
14:12
and that's complicated.
14:16
So, again, this is a
limiting factor for LLMs.
14:19
We'll talk about
RAG in a second.
14:21
But I want to preview--
14:22
there is debates
around whether RAG
14:26
is the right long-term
approach for AI systems.
14:29
So as a high-level idea, a RAG
is a mechanism, if you will,
14:34
that embeds documents that
an LLM can retrieve and then
14:39
add as context to its initial
prompt and answer a question.
14:44
It has lots of application.
14:45
Knowledge management
is an example.
14:47
So imagine you have
your drive again.
14:49
But every document is
compressed in representation,
14:53
and the LLM has
access to that lower
14:55
dimensional representation.
14:59
The debates that this tweet
from [INAUDIBLE] outlines
15:03
is, in theory, if we
have infinite compute,
15:08
then RAG is useless.
15:09
Because you can just read a
massive corpus immediately
15:13
and answer your question.
15:15
But even in that case,
latency might be an issue.
15:19
Imagine the time
it takes for an AI
15:20
to read all your drive every
single time you ask a question.
15:24
It doesn't make sense.
15:25
So RAG has other advantages
beyond even the accuracy.
15:30
On top of that, the
sourcing matters, as well.
15:33
So it might-- RAG
allows you to source.
15:35
We'll talk about all that later.
15:38
But there's always this
debate in the community
15:42
whether a certain method
is actually future proof.
15:46
Because in practice, as compute
power doubles every year,
15:49
let's say, some of the methods
we're learning right now
15:52
might not be relevant
three years from now.
15:54
We don't know, essentially.
15:59
And the analogy that he
makes on context windows
16:04
and why RAG approaches might
be relevant even a long time
16:07
from now is search.
16:09
When you search on
a search engine,
16:12
you still find sources
of information.
16:14
And in fact, in the
background, there
16:16
is very detailed
traversal algorithms
16:20
that rank and find the specific
links that might be the best
16:25
to present you versus if you
had to read-- imagine you had
16:29
to read the entire web every
single time you're doing
16:31
a search query, without
being able to narrow
16:34
to a certain portion
of the space.
16:36
That might, again,
not be reasonable.
16:41
OK, when we're thinking
of improving LLMs,
16:46
the easiest way we think
of it is two dimensions.
16:50
One dimension is we are going
to improve the foundation
16:53
model itself.
16:54
So, for example, we move
from GPT 3.5 Turbo, to GPT 4,
17:01
to GPT 4.0, to GPT 5.
17:04
Each of that is supposed
to improve the base model.
17:07
GPT 5 is another debate because
it's packaging other models
17:11
within itself.
17:12
But if you're thinking
about 3.5, 4, and 4.0,
17:15
that's really what it is.
17:16
The pre-trained model improves.
17:18
And so you should
see your performance
17:20
improve on your tasks.
17:22
But the other dimension is
we can actually engineer--
17:27
leverage the LLM in a
way that makes it better.
17:30
So you can prompt
simply GPT 4.0.
17:34
You can change some prompts
and improve the prompt,
17:38
and it will improve
the performance.
17:40
It's shown.
17:41
You can even put
a RAG around it.
17:42
You can put an agentic
workflow around it.
17:45
You can even put a
multi-agent system around it.
17:49
And that is another dimension
for you to improve performance.
17:52
So that's how I want you
to think about it-- which
17:54
LLM I'm using, and
then how can I maximize
17:56
the performance of that LLM?
17:59
This lecture is about
the vertical axis.
18:02
Those are the methods
that we will see together.
18:08
Sounds good for
the introduction.
18:11
So let's move to
prompt engineering.
18:14
I'm going to start with
an interesting study just
18:17
to motivate why prompt
engineering matters.
18:20
There is a study
from HBS, UPenn,
18:26
as well as Harvard
Business School, and--
18:29
well, there is also
involved Wharton--
18:31
that took a subset
of BCG consultants,
18:34
individual contributors,
split them into three groups.
18:37
One group had no access to AI.
18:39
One group had access to--
18:41
I think it was GPT 4.
18:44
And then one group
had access to the LLM,
18:46
but also a training on
how to prompt better.
18:50
And then they observed the
performance of these consultants
18:53
across a wide variety of tasks.
18:56
There's a few things
that they noticed
18:57
that I thought was interesting.
18:59
One is something they
called the jagged frontier,
19:02
meaning that certain tasks
that consultants are doing fall
19:07
beyond the jagged frontier,
meaning AI is not good enough.
19:14
It's not improving
human performance.
19:18
In fact, it's actually
making it worse.
19:20
And some tasks are
within the frontier,
19:23
meaning that AI is actually
significantly improving
19:27
the performance, the speed,
the quality of the consultant.
19:32
Many tasks fell within and
many tasks fell without,
19:35
and they shared their insights.
19:37
But the TLDR is--
19:39
there is a frontier within
which AI is absolutely helping
19:42
and one where they call out
this behavior, or falling asleep
19:47
at the wheel, where people
relied on AI on a task that
19:51
was beyond the frontier.
19:52
And in fact, it
ended up going worse
19:55
because the human was not
reviewing the outputs carefully
19:58
enough.
20:01
They did note that the
group that was trained
20:04
was the best, better than the
group that was not trained
20:08
on prompt engineering,
which also motivates why
20:10
this lecture matters, so
that you're within that group
20:14
afterwards.
20:15
One other insights were the
centaurs and the cyborgs.
20:20
They noticed that
consultants had the tendency
20:22
to work with AI in
one of two ways,
20:24
and you might, yourself, be
part of one of these groups.
20:29
The centaurs are
mythical creatures
20:31
that are half human, half--
20:35
I think, half, what, horses?
20:38
Yeah?
20:39
Horses?
20:39
Half horses, half something.
20:42
And those were individuals
that would divide and delegate.
20:45
They might give a pretty
big task to the AI.
20:48
So imagine you're working on a
PowerPoint, which consultants
20:51
are known to do.
20:52
You might actually write
a very long prompt on how
20:55
you want it to do your
PowerPoint and then let it
20:57
work for some time
and then come back
20:59
and it's done, when others
would act as cyborgs.
21:02
Cyborgs are fully blended,
bionic human robots,
21:06
human and robot, augmented
with robotic parts.
21:10
And those individuals will
not delegate fully a task.
21:13
They would actually work
super quickly with the model
21:16
back and forth.
21:17
I find that a lot of students
are actually more working
21:20
like cyborgs than centaurs, but
while maybe in the enterprise,
21:24
when you're trying to
automate the workflow,
21:26
you're thinking
more like a centaur.
21:29
That's just something
good to keep in mind.
21:31
Also, a lot of companies
will tell you, oh, we're
21:33
hiring prompt
engineers, et cetera.
21:34
It's [? a cure. ?]
I don't buy that.
21:36
I think it's just a skill
that everybody should have.
21:39
You're not going to
make a [? cure ?] out
21:40
of prompt engineering,
but you're probably
21:42
going to use it as a very
powerful skill in your career.
21:49
So let's talk about basic
prompt design principles.
21:52
I'm giving you a very
simple prompt here.
21:56
Summarize this document,
and then the document
21:58
is uploaded alongside it.
22:00
And the model has not
much context around
22:04
what should be the summary?
22:06
How long should be the summary?
22:07
What should it talk
about, et cetera?
22:09
You can actually improve these
prompts by doing something like
22:14
summarize this 10-page
scientific paper on renewable
22:18
energy in five bullet points,
focusing on key findings
22:22
and implications
for policymakers.
22:25
That's already better.
22:26
You're sharing the
audience, and it's
22:28
going to tailor it
to the audience.
22:30
You're saying that you
want five bullet points,
22:33
and you want to focus
only on key findings.
22:35
That's a better prompt,
you would argue.
22:39
How could you even make
this prompt better?
22:41
What are other
techniques that you've
22:43
heard of or tried yourself that
could make this one shot prompt
22:47
better?
22:53
Yeah.
22:53
[INAUDIBLE]
22:57
OK.
22:58
Right example.
22:58
So say, you mean, here is an
example of a great summary.
23:02
Yeah.
23:03
You're right.
23:03
That's a good idea.
23:05
[INAUDIBLE]
23:08
Very popular technique.
23:10
Act like a renewable energy
expert giving a conference
23:15
at Davos, let's say, yeah.
23:17
That's great.
23:18
Someone-- yeah.
23:20
Say you're really good at it.
23:22
Yeah.
23:23
You are the best in
the world at this.
23:25
Explain.
23:26
Yeah.
23:26
Actually, I mean,
these things work.
23:28
It's funny, but it does work
to say act like x, y, z.
23:32
It's a very popular
prompt template.
23:34
We'll see a few examples.
23:36
What else could you do?
23:40
Yes.
23:41
Of course, you'd like to
critique your own model.
23:46
Critique your own project.
23:47
So you're using reflection.
23:48
So you might actually
do one output
23:50
and then ask it to critique
it and then give it back.
23:52
Yeah.
23:53
We see that.
23:53
That's a great one.
23:54
That's the one that
probably works best
23:56
within those typically,
but we see some examples.
23:59
What else?
24:00
Yeah.
24:01
Break the task down into steps.
24:03
OK.
24:03
Break the task down into steps.
24:05
You know how that is called?
24:06
No.
24:07
OK.
24:08
Chain of thoughts.
24:09
So this is actually
a popular method
24:12
that's been shown in
research that it improves.
24:15
You could actually give
a clear instruction
24:17
and also encourage the
model to think step
24:19
by step approach, the
task step by step,
24:22
and do not skip any step.
24:24
And then you give it some
steps, such as step one,
24:26
identify the three most
important findings.
24:29
Step two, explain
how key each finding
24:31
impact renewable energy policy.
24:33
Step three, write the
five-bullet summary
24:36
with each point addressing
a finding, et cetera.
24:39
So chain of thoughts, I linked
the paper from 2023 that
24:45
popularized chain of thoughts.
24:46
Chain of thoughts
is very popular
24:48
right now, especially
in AI startups
24:50
that are trying to
control their LLMs.
24:55
OK.
24:56
To go back to your examples
about act like XYZ, what
25:01
I like to do, Andrew Ng
also talks about that,
25:03
is to look at other
people's prompts.
25:06
And in fact, in online, you have
a lot of prompt repositories
25:10
for free on GitHub.
25:11
In fact, I linked the awesome
prompt template repo on GitHub,
25:16
where you have so many
examples of great prompts
25:19
that engineers have built. They
said it works great for us,
25:22
and they published it online.
25:23
And a lot of them
start with act as.
25:27
Act as a Linux terminal.
25:29
Act as an English translator.
25:31
Act like a position
interviewer, et cetera.
25:37
The advantage of
a prompt template
25:38
is that you can actually
put it in your code
25:42
and scale it for
many user requests.
25:44
So let me give you an
example from Workera.
25:48
Workera evaluates skill.
25:50
Some of you have taken
the assessments already.
25:52
And tries to personalize
it to the user.
25:56
And in fact, if you actually
read in an HR system
25:59
in an enterprise,
in the HR system,
26:01
you might have a Jane is
a product manager level 3,
26:06
and she is in the US, and her
preferred language is English.
26:10
And actually, that
metadata can be
26:13
inserted in a prompt templates
that will personalize
26:15
personalized for Jane.
26:16
And similarly for Joe, whose is
preferred language is Spanish,
26:22
it will tailor it to Joe.
26:24
And that's called
a prompt template.
26:26
[INAUDIBLE]
26:34
So the question is do
the foundation models
26:39
use a prompt
templates, or do you
26:41
have to integrate it yourself?
26:42
So the foundation
models probably
26:45
use a system prompt
that you don't see.
26:47
Like when actually,
you type on ChatGPT,
26:50
it is possible, it's not public,
that OpenAI behind the scenes
26:55
has like act like a very
helpful assistant for this user.
26:59
And by the way, here is
your memories about the user
27:03
that we kept in a database.
27:05
You can actually
check your memories.
27:07
And then your prompt goes under,
and then the generation starts.
27:10
So probably, they're
using something like that.
27:12
But it doesn't mean you
can't add one yourself.
27:15
So in fact, if you think about a
prompt template for the Workera
27:19
example I was showing,
maybe it starts
27:22
when you call OpenAI by act
like a helpful assistant.
27:25
And then underneath, it's like
act like a great AI mentor that
27:29
helps people in their career.
27:31
And OpenAI is,
from template, also
27:33
has follow the instruction
from the creator
27:36
or something like that.
27:37
It's possible.
27:41
Questions about
prompt templates.
27:42
Again, I would encourage you to
go and read examples of prompts.
27:45
Some of them are
quite thoughtful.
27:48
Let's talk about zero shot
versus few shot prompting.
27:51
It came up earlier.
27:53
Here's an example.
27:54
Again, going back to the
categorization of product
27:57
reviews, let's say that
we're working on a task
28:01
where the prompt is classify
the tone of the sentence
28:05
as positive,
negative, or neutral.
28:07
And then you paste the review,
which is the product is fine,
28:12
but I was expecting more.
28:16
If I were to survey the room,
I would bet that some of you
28:19
would say it's negative.
28:21
Some of you would
say it's neutral.
28:23
Because you actually
have a first part
28:24
that is relatively positive.
28:27
It's fine.
28:28
And then the second part,
I was expecting more,
28:30
which is relatively negative.
28:31
So where do you land?
28:33
This can be a
subjective question.
28:35
And maybe in one industry, this
would be considered amazing.
28:37
And another one, it would
be considered really bad
28:40
because people are used to
really flourishing reviews.
28:44
And so the way you can actually
align the model to your task
28:47
is by converting that
zero shot prompt.
28:49
Zero shot refers to
the fact that it's not
28:51
being given any example.
28:53
Into a few short
prompts, where the model
28:56
is given in the prompt, a set
of examples to align it to what
29:00
you want it to do.
29:01
So the example
here is again, you
29:03
paste the same prompt as
before with the user review.
29:06
And then you add,
here are examples
29:08
of tone classifications.
29:10
These exceeded my
expectation completely.
29:12
Positive.
29:14
It's OK, but I wish
it had more features.
29:17
Negative.
29:18
The service was adequate.
29:20
Neither good nor bad.
29:22
Neutral.
29:23
Now classify the
tone of this sentence
29:26
after you've heard
about these things,
29:28
and the model then
says negative.
29:31
And the reason it says
negative, of course,
29:33
is likely because of the second
example, which was it's OK,
29:39
but I wish it had more features,
which we told the model that
29:42
was negative.
29:43
Because the model saw
that it's aligned now
29:45
with your expectations.
29:47
A few short prompts
are very popular.
29:50
And in fact, for
AI startups that
29:52
are slightly more
sophisticated, you
29:54
might see them keep
a prompt up to date.
29:57
Whenever a user says
something and they
30:00
might have a human
label it and then
30:02
add it as a few shots
in their relevant
30:05
prompts in their code base.
30:08
You can think of that as
almost building a data set.
30:10
But instead of actually
building a separate data set
30:12
like we've seen with
supervised fine tuning
30:15
and then fine tuning
the model on it,
30:17
you're just putting it
directly in the prompt.
30:19
It turns out it's
probably faster
30:21
to do that if you want
to experiment quickly
30:23
because you don't touch
the model parameters.
30:25
You just update your prompts.
30:27
And if it's text
examples, you can actually
30:30
concatenate so many
examples in a single prompt.
30:34
At some point, it
will be too long,
30:36
and you will not have the
necessary context window.
30:39
But it's a pretty
strong approach
30:40
that is quick to align an LLM.
30:48
OK?
30:49
Yes.
30:50
[INAUDIBLE]
30:57
So the question was is there
any research on how long
31:00
the prompt can be before
the model essentially loses
31:03
itself or doesn't follow
instructions anymore?
31:06
There is.
31:08
The problem is that research
is outdated every few months
31:11
because models get better.
31:14
And so I don't know where
the state of the art is.
31:16
You can probably find
it online on benchmarks
31:18
on like we see that--
31:20
I give you an example.
31:23
On the Workera product, you
have a voice conversation
31:27
for some of you
that have tried it,
31:28
where you're asked to
explain what is the prompt.
31:30
And then you explain,
and then there's
31:31
a scoring algorithm in behind.
31:33
We know that after eight
turns, the model loses itself.
31:38
After eight turns,
because you always
31:40
paste the previous
user response,
31:42
it just starts going wild.
31:44
And so the techniques
we use in the background
31:46
is we actually create
chapters of the conversation.
31:49
Maybe one chapter is
the first eight prompt.
31:51
And then you actually start
over from another prompt.
31:53
You can summarize the first
part of the conversation,
31:56
insert the summary,
and then keep going.
31:59
Those are engineering hacks that
engineers might have figured out
32:02
in the background.
32:04
Because eight turns makes a
prompt quite long actually.
32:13
Let's move on to chaining.
32:15
Chaining is the most popular
technique out of everything
32:17
we've seen so far in
prompt engineering.
32:22
It's not chain of thought.
32:23
So chain of thought we've
seen is think step by step,
32:26
step 1, step 2, step 3.
32:27
Do not skip any step.
32:28
This is different.
32:30
This is chaining complex
prompt to improve performance,
32:34
and this is what it looks like.
32:37
You take a single step prompt,
such as read this customer
32:40
review and write a
professional response that
32:43
acknowledges their concern,
explains the issue,
32:46
offers a resolution,
and then you
32:48
paste the customer review,
which is I ordered a laptop.
32:51
It arrived three days late.
32:52
The packaging was damaged.
32:54
Very disappointing.
32:56
I needed that urgently for work.
32:59
And then the output
is an email that
33:01
is immediately given
to you by the LLM
33:04
after it reads the prompt.
33:08
So this might work, but it
might be hard to control.
33:14
Because think about it.
33:15
There's multiple steps
that you have listed,
33:18
and everything is embedded
in the same prompt.
33:20
And if you wanted to debug step
by step and know which step is
33:24
weaker, you couldn't.
33:24
You would have everything
mixed together.
33:27
So one advantage of chaining is
you would separate the prompts,
33:32
so that you can debug
them separately.
33:35
And it will also lead
to an easier manner
33:38
to improve your workflow.
33:41
Let's say a first prompt
is extract the key issues.
33:44
Identify the key
concerns mentioned
33:46
in this customer review.
33:47
Pace the customer review.
33:49
Second prompt.
33:50
Using these issues, so
you paste back the issues,
33:54
draft an outline for a
professional response that
33:57
acknowledges concerns,
explains possible reasons,
34:00
and offer a resolution.
34:04
So this is not--
34:06
Prompt number 3, write
the full response.
34:09
So using the outline, write
the professional response.
34:14
And then you get
your final output.
34:18
So in theory, you can tell
me, oh, the second approach
34:22
is better than the
first one at first.
34:23
But what you can notice
is that we can actually
34:27
test those three prompts
separately from each other
34:29
and determine if we will get the
most gains out of engineering
34:35
the first prompt, optimizing
it, or the second one,
34:38
or the third one.
34:39
We now have three prompts that
are independent from each other.
34:43
And maybe if the
outline was better,
34:47
the performance of the email,
how much the open rate will be
34:53
or the user satisfaction
on the response
34:55
will actually get higher.
34:57
And so chaining improves
performance but performance,
35:00
but most importantly, helps
you control your workflow
35:04
and debug it more seamlessly.
35:07
Yes.
35:09
So if we that the three prompts
independently work really well,
35:15
if we combine them
into one prompt,
35:17
and we highlight a step
by step thinking process,
35:21
does on average, we get
a [INAUDIBLE] by itself,
35:24
or do we still have
to do that breakdown?
35:28
So let me try to rephrase.
35:30
You say, let's say we look
at the first prompt which
35:32
has all three tasks
built in that prompt.
35:37
What exactly do you mean?
35:39
You mean like if we
evaluate the output
35:41
and we measure some user
insight, satisfaction,
35:43
et cetera?
35:45
Why don't we just modify that
prompt and essentially see how
35:49
it improves user satisfaction?
35:51
Yeah.
35:51
[INAUDIBLE]
35:54
I see.
35:55
So why do we need
the three steps?
35:57
I mean, think about it.
35:59
The intermediate output
is what you want to see.
36:02
Like if I'm debugging
the first approach,
36:06
the way I would do it is I
would capture user insights.
36:09
Like here's the email.
36:10
How good was the response?
36:11
Thumbs up, thumbs down.
36:13
Was your issue resolved?
36:16
Thumbs up, thumbs down.
36:17
Those would tell me
how good is my prompt.
36:19
And I can engineer that
prompt, optimize it,
36:21
and I would probably
drive some gains.
36:23
But I will not be able
easily to trace back
36:26
to what the problem was.
36:28
While in the second
approach, not only I
36:30
can use the end to end
metrics to improve my process.
36:33
I can also use the
intermediate steps.
36:35
For example, if I look at prompt
2 and I look at the outline
36:38
and I see the outline is
actually, meh, it's not great,
36:41
then I think I can get a lot
of gains out of the outline.
36:45
Or the outline is
actually really good,
36:47
but the last prompt doesn't do
a good job at translating it
36:50
into an email.
36:51
So the outline is exactly
what I want the LLM to do,
36:54
but the translation in
a customer facing email
36:57
is not good.
36:58
In fact, it doesn't follow
our vocabulary internally.
37:01
Then I knew the
third prompt is where
37:03
I would get the most gains.
37:06
So that's what it
allows me to do,
37:07
have intermediate
steps to review.
37:10
Are there any
latency [INAUDIBLE]
37:13
We'll talk about it.
37:14
Are there any latency concerns?
37:16
Yes.
37:17
In certain applications, you
don't want to use a chain
37:20
or you don't want to use a long
chain because it adds latency.
37:26
We'll talk about that later.
37:27
Good point.
37:28
So practically, this is
what chaining complex
37:32
prompts look like.
37:33
You have your first prompt
with your first task.
37:35
It outputs.
37:36
The output is pasted
in the second prompt
37:39
with the second
task being defined.
37:41
The output is then pasted
into the third prompt
37:43
with the third task
being defined and so on.
37:46
That's what it looks
like in practice.
37:52
Super.
37:55
We'll talk more later
about testing your prompts,
37:58
but there are
methods now to do it,
38:00
and we'll see later in this
lecture with our case study
38:03
how we can test our prompts.
38:06
But here is an example
of how you might do it.
38:11
You might have a
summarization workflow prompts
38:18
that is the baseline.
38:19
It's a single prompt.
38:21
You might have a
refined summarization
38:23
which is a modified
prompt of this,
38:26
or a workflow with a chain.
38:30
And then you have your test
case, which is the input
38:34
that you want to
summarize, let's say.
38:36
And then you have
the generated output.
38:38
And you can have humans
go and rate these outputs.
38:42
And you would notice that the
baseline is better or worse
38:46
than the refined prompt.
38:47
Of course, this manual
approach takes time,
38:51
but it's a good way to start.
38:53
And usually, the advice is
get hands on at the beginning
38:56
because you would quickly
notice some issues,
38:58
and it will give you better
intuition on what tweaks
39:01
can lead to better performance.
39:03
However, if you wanted
to scale that system
39:05
across many products, many
parts of your code base,
39:08
you might want to find a
way to do that automatically
39:10
without asking humans to
review and grade summaries.
39:14
One approach is
to use platforms,
39:19
like at Workera, our team uses a
platform called Prompt Food that
39:23
allows you to actually
automate part of this testing.
39:26
In a nutshell,
what it does is it
39:30
can allow you to run the same
prompt with five different LLMs
39:35
immediately, put
everything in a table.
39:37
That makes it super easy for
a human to grade, let's say.
39:40
Or alternatively, it might
allow you to define LLM judges.
39:46
LLM judges can come
in different flavors.
39:50
For example, I can
have an LLM judge that
39:52
does a pairwise comparison.
39:54
So what the LLM is asked to
do is here are two summaries.
39:58
Just tell me which one is
better than the other one.
40:01
That's what the LLM does.
40:02
And that can be used
as a proxy for how good
40:04
the summarization baseline
versus the refined version is.
40:08
Another way to do
an LLM judge is
40:11
if you do it for a
single answer grading,
40:14
so here's a summary
graded from 1 to 5.
40:18
And then you can go
even deeper and do
40:21
a reference-guided
pairwise comparison.
40:24
Or you add also a rubric.
40:25
You say a 5 is when a summary
is below 100 characters.
40:30
I'm just making up.
40:31
Below 100 characters.
40:33
Mentions at least
three key points
40:35
that are distinct and starts
with a first sentence that
40:38
displays the overview and
then goes into the detail.
40:40
That's a great summary,
number 5 out of a 5.
40:42
0 is the LLM failed to summarize
and actually was very verbose,
40:48
let's say.
40:49
And so you put a
Rubrik behind it,
40:52
and you have an LLM as
just finding the rubric.
40:55
Of course, you can now
pair different techniques.
40:57
You can do a few
shots for the rubric.
40:58
You can actually give examples
of a 5 out of 5s, 4 out of 4s,
41:02
3 out of 3s because now,
you multiple techniques.
41:06
Does that make sense?
41:11
Yeah.
41:11
OK.
41:12
So that was the second
section on prompt engineering
41:15
or the first line
of optimization.
41:19
Now, let's say you've
exhausted all your chances
41:22
for prompt
engineering, and you're
41:24
thinking about actually touching
the model, modifying its weights
41:28
or fine tuning it
in other words.
41:31
I was telling you, I'm
not a fan of fine tuning.
41:34
There's a few reasons why.
41:37
One, it requires substantial
labeled data typically
41:42
to fine tune.
41:43
Although now, there
are approaches
41:46
that are getting better
at fine tuning that
41:48
look more few shot prompting
actually than fine tuning.
41:52
It's sort of merging.
41:54
Although one
modifies the weight,
41:56
the other doesn't
modify the weights.
41:57
Fine tuned models may also
overfit to specific data.
42:01
We're going to see a
funny example actually.
42:04
Losing their general
purpose utility.
42:06
So you might fine tune a model.
42:08
And actually, when someone
asks a pretty generic question,
42:11
it doesn't do well anymore.
42:12
It might do well on your task.
42:14
So it might be relevant or not.
42:15
And then it's time
and cost-intensive.
42:17
That's my main problem.
42:19
And at Workera, we
steer away from fine
42:24
tuning as much as possible.
42:26
Because by the time you're
done fine tuning your model,
42:28
the next model is
out, and it's actually
42:30
beating your fine tuned
version of the previous model.
42:33
So I would steer away from
fine tuning as much as you can.
42:36
The advantage of the prompt
engineering methods we've seen
42:39
is you can put the next best
pre-trained model directly
42:43
in your code.
42:44
It will update
everything immediately.
42:46
Fine tuning doesn't
work like that.
42:50
There are advantages though
where it still makes sense.
42:53
If the task requires repeated
high precision outputs
42:56
such as legal,
scientific explanation
42:58
and if the general
purpose LLM struggles
43:01
with domain-specific language.
43:03
So let's look at a
quick example together,
43:07
which is an example
from Ros Lazerowitz.
43:12
I think it was a couple of
years ago, September 23,
43:15
where Ros tried to
do Slack fine tuning.
43:22
So he looked at a lot of Slack
messages within his company.
43:26
And he was like, I'm
going to fine tune
43:28
a model that speaks like us or
operates like us because this
43:32
is how we work.
43:33
This is the data that represents
how people work at the company.
43:37
And so he actually went ahead
and fine tuned the model,
43:42
gave it a prompt,
like, hey, write--
43:44
he was delegating to the model.
43:47
A 500-word blog post
on prompt engineering.
43:50
And the model responded, I shall
work on that in the morning.
43:55
And then he tries to push the
model a little further and say,
44:00
it's morning now.
44:01
And the model said,
I'm writing right now.
44:04
It's 6:30 AM here.
44:06
Write it now.
44:10
OK, I shall write it now.
44:12
I actually don't what
you would like me to say
44:14
about prompt engineering.
44:15
I can only describe the process.
44:17
The only thing that comes
to mind for a headline
44:19
is how do we build prompt?
44:21
It's kind of a funny example for
fine tuning because it's true
44:25
that it went wrong.
44:27
Like he was supposed
to think like I want
44:29
the model to speak
like us at work.
44:32
And it ended up
acting like people
44:34
and not actually
following instructions.
44:40
So one example why I would
steer away from fine tuning.
44:47
Super.
44:51
Let's talk about RAGs.
44:54
RAGs is important.
44:55
It's important to out there
and at least having the basics.
44:58
It's a very common interview
question, by the way.
45:00
If you go interview
for a job, they
45:02
might ask you to
explain in a nutshell
45:04
to a five-year-old
what is a RAG.
45:06
And hopefully after that,
you'll be able to do it.
45:09
So we've seen some of the
challenges with standalone LLMs.
45:14
Those challenges include the
context window being small,
45:19
the fact that it's hard
to remember details
45:21
within a large context window,
knowledge gaps, cutoff dates,
45:26
you mentioned earlier.
45:28
The model might be
trained up to a date,
45:29
and then it cannot follow
the trends or be up to date.
45:33
Hallucinations.
45:34
There are some fields.
45:35
Think about medical
diagnosis, where
45:37
hallucinations are very costly.
45:39
You can't afford
a hallucination.
45:41
Even in education, imagine
deploying a model for the US
45:45
youth education,
and it hallucinates,
45:47
and it teaches millions
of people something
45:49
completely wrong.
45:50
It's a problem.
45:52
And then lack of sources.
45:54
A lot of fields love sources.
45:57
Research fields love sources.
45:59
Education love sources.
46:01
Legal loves sources as well.
46:04
And so the pre-trained LLM
doesn't do a good job to source.
46:08
And in fact, if you have tried
to find sources on a plain LLM,
46:13
it actually hallucinates a lot.
46:15
It makes up research papers.
46:16
It just lists like
completely fake stuff.
46:20
So how do we solve
that with a RAG?
46:23
RAG integrates with external
knowledge sources, databases,
46:28
documents, APIs.
46:31
It ensures that answers are
more accurate, up to date,
46:35
and grounded because you can
actually update your document.
46:38
Your drive is always up to date.
46:40
I mean, ideally, you're always
pushing new documents to it.
46:43
And when you query, what is
our Q4 performance in sales?
46:47
Hopefully there is the last
board deck in the drive,
46:51
and it can read the
last board deck.
46:54
And more developer control.
46:56
We'll see why RAGs allow
for targeted customization
47:00
without actually requiring
the retraining of the model.
47:02
In fact, you don't touch
the model with RAGs.
47:05
It's really a technique that
is put on top of the model.
47:08
So to see an example
of a RAG, this
47:11
is a question answering
application where
47:16
we're in the medical field,
and a user is asking a query,
47:21
what are the side
effects of drug X?
47:26
This is an important question.
47:27
You can't hallucinate.
47:28
You need to source.
47:29
You need to be up to date.
47:31
Maybe there is a new
update to that drug that
47:35
is now in the database,
and you need to read that.
47:37
So a RAG is a great example of
what you would want to use here.
47:41
The way it works is
you have your knowledge
47:43
base of a bunch of documents.
47:46
What you do is you
use an embedding
47:49
to embed those
documents into lower
47:52
dimensional representations.
47:54
So for example, if the
document is a PDF, a long PDF,
47:59
you might read the
PDF, understand it,
48:02
and then embed it.
48:03
We've seen plenty of
embedding approaches
48:05
together, triplet loss,
et cetera, you remember?
48:09
So imagine one of
them here for LLMs
48:11
is embedding those documents
into lower representation.
48:15
If the representation
is too small,
48:18
you will lose information.
48:19
If it's too big, you
will add latency.
48:22
It's a tradeoff.
48:25
You will store typically
those representations
48:28
into a database called
a vector database.
48:31
There's a lot of vector
database providers out there.
48:38
I think I've listed a
couple that are very common.
48:41
No, I haven't listed, but
I can share afterwards.
48:44
A vector database is
essentially storing those vector
48:47
in a very efficient manner,
allowing the fast retrieval
48:50
with a certain distance metric.
48:52
So what you do is you
also embed, usually
48:56
with the same algorithm,
the user prompts.
49:00
And you run a retrieval
process, which is essentially
49:03
saying, based on the
embedding from the user
49:07
query and the vector database,
find the relevant documents
49:12
based on the distance
between those embeddings.
49:15
Once you've found the relevant
documents, you pull them,
49:18
and then you add them to the
user query with a system prompt
49:22
or a prompt template on top.
49:24
So the prompt template
can be answer user query
49:29
based on list of documents.
49:32
If answer not in the
documents, say I don't know.
49:36
That's your prompt templates
where the user query is pasted,
49:40
the documents are
pasted, and then
49:42
your output should be what
you want because it's not
49:45
grounded in the documents.
49:47
You can also add to
this prompt template.
49:50
Tell me the exact
page, chapter, line
49:53
of the document that was
relevant, and in fact,
49:55
link it as well, just
to be more precise.
50:02
Any question on RAGs?
50:03
This is a simple, vanilla RAG.
50:07
Yes.
50:09
Do document embeddings still
retain information [INAUDIBLE]
50:15
Question is do the
document embeddings still
50:18
retain the information of the
location of the information
50:21
within that document,
especially in big documents?
50:24
Great question.
50:26
We'll get to it in a second.
50:27
Because you're right
that the vanilla RAG
50:29
might not do a good job
with very large documents.
50:32
So let's say, when you
open a medication box
50:36
and you have this gigantic white
paper with all the information,
50:41
and it's very long, maybe a
vanilla RAG would not cut it.
50:45
So what people have
figured out is a bunch
50:48
of techniques to improve RAGs.
50:49
And in fact, chunking is a great
technique that is very popular.
50:53
So you might actually store
in the vector database
50:55
the embedding of
the full document.
50:57
And on top of
that, you will also
50:59
store a chapter level vector.
51:02
And when you retrieve, you
will retrieve the document.
51:04
You retrieve the chapter.
51:06
And that allows you to be more
precise with the sourcing.
51:09
It's one example.
51:11
Another technique
that's popular is HyDE.
51:16
Hypothetical
document embeddings,
51:18
where a group of researchers
published a paper
51:23
showing that when you
get your user query,
51:26
one of the main problem
is the user query
51:29
actually does not look
like your documents.
51:32
For example, the
user query might
51:34
be what are the side effects
of drug X, when actually,
51:37
in the document in
the vector database,
51:40
the vectors represents
very long documents.
51:43
So how do you guarantee
that the vector
51:44
embedding is going to be close
to the document embedding?
51:47
What they do is they use
the user query to generate
51:50
a fake hallucinated document.
51:53
They embed that
document, and then they
51:56
compare it to the vector
in the vector database.
52:01
That makes sense?
52:02
So for example,
the user says what
52:04
is the side effect of drug X?
52:06
There's a prompt that this is
given to another prompt that
52:09
says, based on this user query,
generates a five-page report
52:13
answering the user query.
52:15
It generates potentially
a completely fake answer.
52:20
You embed that, and it will
be closer to the document
52:24
that you're looking for likely.
52:28
It's one example
of a RAG approach.
52:31
Again, the purpose
of this lecture
52:33
is not to go through all
these three and explain
52:36
you every single method that
has been discovered for RAGs.
52:38
But I just wanted to show
you how much research
52:40
has been done between
2020 and 2025 in RAGs
52:44
and how many branches
of research you now have
52:47
that you can learn from.
52:50
The survey paper is LinkedIn
the slides, by the way,
52:52
and I'll share them
after the lecture.
53:01
Super.
53:05
So we've made some progress.
53:08
Hopefully now, you
feel if you were
53:10
to start an LLM application, you
know how to do better prompts.
53:14
You know how to do chains.
53:15
You know how to do fine tuning.
53:17
You also how to do retrieval.
53:19
And you have the
baggage of techniques
53:20
that you can go and read
and find the code base,
53:23
pull the code, vibe code it.
53:24
But you have the breadth now.
53:30
The next set of topics
we're going to see
53:34
is around the question
of how could we
53:36
extend the capabilities of LLMs
from performing single tasks,
53:40
and hence, with
external knowledge,
53:42
to handling multi-step,
autonomous workflows?
53:47
And this is where we get
into proper agentic AI.
53:53
So let's talk about
agentic AI workflows
53:56
towards autonomous and
specialized systems.
54:00
Then we'll talk about evals.
54:01
Then we'll see
multi-agent systems.
54:03
And we'll end with a little
thoughts on what's next in AI.
54:11
So Andrew Ng actually coined
the term agentic AI workflows.
54:20
And his reason was that a lot
of companies use, say agents.
54:25
Agents, agents everywhere,
agents everywhere.
54:28
If you go and work
at these companies,
54:30
you would notice that they mean
very different things by agents.
54:33
Some people actually
have a prompt,
54:34
and they call it an agent.
54:36
Other people, they have a very
complex multi-agent system,
54:41
they call it an agent.
54:42
And so calling everything an
agent doesn't do it justice.
54:45
So Andrew says let's call
it agentic workflows.
54:49
Because in practice, it's a
bunch of prompts with tools,
54:53
with additional
resources, API calls
54:57
that ultimately are
put in a workflow,
54:59
and you can call that
workflow agentic.
55:02
So it's all about the multi-step
process to complete a task.
55:11
Also, calling it
agentic workflow
55:13
allows us to not
mix it up with what
55:14
I called agent, in
the last lecture,
55:17
with reinforcement learning.
55:19
Because in RL, agent has a
very specific definition,
55:22
interacts with an environment,
passes from one state
55:24
to the other, has a
reward and an observation.
55:26
You remember that chart, right?
55:32
So here's an example of
how we move from a one step
55:35
prompt to a multi-step
agentic workflow.
55:39
Let's say a user
queries a product.
55:44
What is your refund
policy on a chatbot?
55:48
And the response,
using a RAG, says
55:51
refunds are available
within 30 days of purchase,
55:53
and maybe the RAG can even look
link to the policy documents.
55:57
That's what we learned so far.
55:59
Instead, an agentic workflow
can function like this.
56:04
The user says, can I get
a refund for my order?
56:07
And the response via
the agentic workflow
56:11
is the agent retrieves the
refund policy using a RAG.
56:14
The agent then follows up
with the users and says,
56:17
can you provide
your order number?
56:19
Then the agent queries an API
to check the order details.
56:23
And finally, it comes
back to the user
56:25
and confirms your order
qualifies for a refund.
56:28
The amount will be processed
in three to five business days.
56:31
This is much more thoughtful
than the first version,
56:33
which is sort of vanilla.
56:37
So that's what
we're going to talk
56:39
about in the next
couple of slides,
56:40
is how do we get from the
first one to the second one?
56:46
There are plenty of specialized
agency workflows online.
56:50
You've heard, and if
you hang out in SF,
56:52
you probably see a bunch
of billboards, AI software
56:55
engineer, AI skills
mentor you've
56:57
interacted with in the
class through Workera.
56:59
AI SDR, AI lawyers, AI
specialized cloud engineer.
57:08
It would be a stretch to
say that everything works,
57:10
but there's work being
done towards that.
57:17
I'm not personally
a fan of putting
57:19
a face behind those things.
57:20
I think it's gimmicky.
57:21
And I think in a few
years from now, actually,
57:24
very few products will have
a human face behind it,
57:27
but it might be a marketing
tactic from some startups.
57:32
It's more scary than it
is engaging, frankly.
57:35
OK.
57:36
I want to talk about
the paradigm shift.
57:38
That's especially useful.
57:40
Let's say you're a
software engineer
57:41
or you're planning to
be a software engineer.
57:43
Because software
engineering as a discipline
57:45
is sort of shifting.
57:47
Or at least the
best engineers I've
57:49
worked with are able to move
from a deterministic mindset
57:53
to a fuzzy mindset and
balance between the two
57:57
whenever they need to
get something done.
57:58
So here's the paradigm shift
between traditional software
58:01
and agentic AI software.
58:04
The first one is the
way you handle data.
58:07
Traditional software deals
with structured data.
58:10
You have JSONs.
58:11
You have databases.
58:12
They're pasted in a
very structured manner
58:15
in a data engineering pipeline.
58:17
And then there used
to be displayed
58:19
on a certain interface.
58:21
The user might feel a form that
is then retrieved and pasted
58:24
in the database.
58:25
All of that historically
has been structured data.
58:28
Now, more and more companies are
handling free form text, images,
58:34
and all of that requires dynamic
interpretation to transform
58:39
an input into an output.
58:41
The software itself used
to be deterministic.
58:45
Now you have a lot of
software that is fuzzy.
58:47
And fuzzy software
creates so many issues.
58:51
I mean, imagine if you
let your user ask anything
58:54
on your website.
58:56
The chances that it
breaks is tremendous.
58:58
The chances that you're
attacked is tremendous.
59:00
The chances-- it's really,
really complicated.
59:03
It's more complicated than
people make it seem on Twitter.
59:07
Fuzzy engineering is truly hard.
59:09
You might get hate as a company
because one user did something
59:14
that you authorized them to
do that ended up breaking
59:16
the database and ended up--
59:18
we've seen that
with many companies
59:19
in the last couple of years.
59:21
So it takes a very specialized
engineering mindset
59:23
to do fuzzy
engineering, but also
59:25
know when you need
to be deterministic.
59:29
The other thing I'd call is
with agentic AI software,
59:33
you want to think about your
software as your manager.
59:39
So you're familiar with the
monolith or microservices
59:44
approaches in software, where
you structure your software
59:48
in different boxes that
can talk to each other,
59:51
and it allows teams to
debug one section at a time.
59:55
Now the equivalent with agentic
AI is you think as a manager.
59:59
So you think, OK, if I
was to delegate my product
1:00:02
to be done by a group of humans,
what would be those roles?
1:00:06
Would I have a graphic designer
that then puts together a chart
1:00:09
and then sends it to a marketing
manager that converts it
1:00:12
into a nice blog post, that
then gives it to the performance
1:00:15
marketing expert, that then
publishes the work, the blog
1:00:18
post, and then
optimizes and A/B tests?
1:00:20
Then to a data scientist
that analyzes the data
1:00:23
and then puts
hypotheses and validates
1:00:25
them or invalidates them.
1:00:27
That's how you would typically
think if you're building
1:00:29
an authentic AI software.
1:00:32
When actually, the equivalent
of that in traditional software
1:00:35
might be completely different.
1:00:37
It might be We have
a data engineer box
1:00:39
right here that handles
all our data engineering.
1:00:42
And then here, we
have the UI/UX stuff.
1:00:45
Everything UI/UX
related goes here.
1:00:47
And companies might structure
it in very different ways.
1:00:51
And here is the business logic
that we want to care about.
1:00:53
And there's five engineers
working on the business logic,
1:00:56
let's say.
1:00:59
OK.
1:01:01
Testing and debugging
is also very different.
1:01:04
And we'll talk about
it in the next section.
1:01:09
The other thing
that I feel matters
1:01:13
is with AI in engineering,
the cost of experimentation
1:01:17
is going down drastically.
1:01:19
And so people, I feel,
should be more comfortable
1:01:22
throwing away code.
1:01:23
It's like in traditional
software engineering,
1:01:27
you probably don't
throw away code a ton.
1:01:29
You build a code, and it's
solid, and it's bulletproof,
1:01:32
and then you update
it over time.
1:01:35
We've seen AI companies be
more comfortable throwing away
1:01:39
codes, which has advantages in
terms of the speed at which you
1:01:43
move but also
disadvantages in terms
1:01:46
of the quality of your
software that can break more.
1:01:52
So anyway, just wanted to do
an update on the paradigm shift
1:01:56
from deterministic
to fuzzy engineering.
1:02:04
Oh, and actually, I can give
you an example from Workera
1:02:08
that we learned probably
over the last 12
1:02:11
months is like if
you've used Workera,
1:02:13
you might have seen that the
interface has asks you sometimes
1:02:18
multiple choice questions.
1:02:19
And sometimes, it asks
you multiple select.
1:02:21
And sometimes, it asks you drag
and drop, ordering, matching,
1:02:24
whatever.
1:02:25
Those are examples of
deterministic item types,
1:02:28
meaning you answer the
question on a multiple choice.
1:02:31
There is one correct answer.
1:02:32
It's fully deterministic.
1:02:34
On the other hand, you sometimes
have a voice questions,
1:02:38
where you go to a
role play or you
1:02:40
have voice plus
coding questions,
1:02:42
where your code is being read
by the interface or whatever.
1:02:45
Those are fuzzy, meaning
the scoring algorithm
1:02:49
might actually make
mistakes, and those mistakes
1:02:52
might be costly.
1:02:53
And so companies
have to figure out
1:02:56
a human in the
loop system, which
1:02:58
you might have seen with the
appeal feature at the end.
1:03:00
So at the end of the assessment,
you have an appeal feature where
1:03:03
it allows you to say, I
want to appeal the agent
1:03:06
because I want to challenge
what the agent said on my answer
1:03:09
because I thought I was better
than what the agent thought.
1:03:12
And then you bring the
human in the loop that
1:03:14
then can fix the agent, can
tell the agent, actually,
1:03:16
you were too harsh on the
answer of this person.
1:03:20
And that's an example of
a fuzzy engineered system
1:03:24
that then adds a human in the
loop to make it more aligned.
1:03:28
And so if you're
building a company,
1:03:29
I would encourage you to
think about what can I
1:03:32
get done with determinism?
1:03:33
And let's get that done.
1:03:35
And then the fuzzy
stuff, I want to do fuzzy
1:03:38
because it allows
more interaction.
1:03:39
It allows more back
and forth, but I need
1:03:42
to put guardrails around it.
1:03:43
And how am I going to
design those guardrails?
1:03:45
Pretty much.
1:03:46
OK?
1:03:49
Here's another example
from enterprise workflows,
1:03:54
which are likely to
change due to agentic AI.
1:03:57
This is a paper from McKinsey,
I believe from last year,
1:04:01
where they looked at a financial
institution, and they said,
1:04:05
we observed that they often
spend one to four weeks
1:04:07
to create a credit risk memo.
1:04:10
And here's the process.
1:04:11
A relationship manager
gathers data from 15
1:04:16
and more than 15
sources on the borrower,
1:04:19
loan type, other factors.
1:04:22
Then the relationship manager
and the credit analyst
1:04:25
collaboratively analyze that
data from these sources.
1:04:28
Then the credit analyst
typically spends 20 hours
1:04:33
or more writing a memo
and then goes back
1:04:36
to the relationship manager.
1:04:37
They give feedback, and then
they go through this loop
1:04:40
again and again.
1:04:41
And it takes a long time
to get a credit memo out.
1:04:46
And then run a research study,
where they changed the process.
1:04:50
They said gen AI agents could
actually cut time by 20% to 60%
1:04:56
on credit risk memos.
1:04:58
And the process has changed
to the relationship manager,
1:05:01
directly work with the
Gen AI agent system,
1:05:03
provides relevant materials
that needs to produce the memo.
1:05:07
The agent subsidizes
the project into tasks
1:05:10
that are assigned to
specialist agents,
1:05:12
gathers and analyzes the
data from multiple sources,
1:05:15
drafts a memo.
1:05:16
Then the relationship manager
and the credit analyst
1:05:19
sit down together,
review the memo,
1:05:20
give feedback to the agent.
1:05:22
And within 20% to 60%
less time are done.
1:05:26
And so this is an example where
you're actually not changing
1:05:30
the human stakeholders.
1:05:31
You're just changing
the process and adding
1:05:33
Gen AI to reduce the time it
takes to get a credit memo out.
1:05:38
It turns out that, imagine
you're an enterprise,
1:05:42
and you have 100,000 employees,
and there's a lot of enterprises
1:05:47
with 100,000
employees out there.
1:05:50
You are currently
under crisis in terms
1:05:52
of redesigning your workflows.
1:05:55
It turns out that
if you actually
1:05:57
pull the job descriptions
from the HR system
1:06:00
and you interpret
them, you also pull
1:06:02
the business process
workflows that you
1:06:04
have encoded in your drive.
1:06:07
You actually can find
gains in multiple places.
1:06:10
And in the next
few years, you're
1:06:12
probably going to
see workflows being
1:06:14
more optimized to add Gen AI.
1:06:17
Even if that happens, the
hardest part is changing people.
1:06:20
What we know, this is
great in theory, but now,
1:06:23
let's try to fit that second
workflow for 10,000 credits,
1:06:28
risk analysts, and
relationship managers.
1:06:31
My guess is it will take years.
1:06:33
It will take 10, 20 years to
get to this being actually done
1:06:37
at scale within an organization.
1:06:40
Because change is so hard.
1:06:42
It's so hard to rewire business,
workflows, job descriptions,
1:06:47
incentivize people to do
different, and be different,
1:06:50
and train them.
1:06:50
And so this is what the
world is going towards,
1:06:55
but it's going to take
a long time I think.
1:06:59
OK.
1:07:00
Then I want to talk about
how the agent actually works
1:07:02
and what are the core
components of an agent.
1:07:07
Imagine a travel
booking agent. that's
1:07:10
an easy example you've
all thought about.
1:07:12
I still haven't been able to get
an agent to book a trip for me,
1:07:16
or I was scared because
it was going to book
1:07:18
a very expensive or long trip.
1:07:20
But in theory, you can
have a travel booking
1:07:24
agent that has prompts.
1:07:26
So the prompts we've
seen, we know the methods
1:07:28
to optimize those prompts.
1:07:30
That travel agent also has
a context management system,
1:07:34
which is essentially the memory
of what it knows about the user.
1:07:38
That context
management system might
1:07:40
include a core memory or working
memory and an archival memory,
1:07:45
OK?
1:07:46
What the difference
is within memory
1:07:51
is not every memory needs
to be fast to access.
1:07:54
Think about it.
1:07:56
You're onboarded on a product,
and the first question is hi,
1:07:59
what's your name?
1:08:00
And I say, my name is Keon.
1:08:02
That's probably going to
sit in the working memory
1:08:05
because the agents, every
time he's going to talk to me,
1:08:07
he's going to want
to use my name.
1:08:08
But then maybe the
second question
1:08:10
is what's your birthday?
1:08:12
And I give it my birthday.
1:08:13
Does it need my
birthday every day?
1:08:15
Probably not.
1:08:16
So it's probably going to
park it on the long term
1:08:18
memory or the archival memory.
1:08:20
And those memories
are slower to access.
1:08:24
They're farther down the stack.
1:08:26
And that structure
allows the agent
1:08:28
to determine what's
the working memory,
1:08:30
and what's the long term memory?
1:08:33
And that makes it easier for the
agent to retrieve super fast.
1:08:36
Because think about it.
1:08:37
When you interact
with ChatGPT, you
1:08:39
feel that it's very
personal at times.
1:08:41
You feel like it
understands you.
1:08:43
Imagine every time you call it,
it has to read the memories.
1:08:47
And that can be costly.
1:08:48
It's a very burdensome
cost because it happens
1:08:52
every time you talk to it.
1:08:54
So you want to be highly
optimized with the working
1:08:57
memory.
1:08:59
If it takes three
seconds to look
1:09:00
in the memory, every time you're
going to talk to your LLM,
1:09:03
it's going to take three
seconds, which you don't want.
1:09:06
Anyway.
1:09:06
And then you have the tools.
1:09:08
The tools can include
APIs like a flight search
1:09:11
API, hotel booking API, car
rental API, weather API,
1:09:15
and then the payment
processing API.
1:09:18
And typically, you would
want to tell your agent
1:09:21
how that API works.
1:09:23
It turns out that agents
or LLMs, I should say,
1:09:27
are very good at reading
API documentation.
1:09:29
So you give it the
API documentation,
1:09:31
and it reads the
JSON, and it reads,
1:09:33
what does a GET
request look like.
1:09:35
And this is the format
that I need to push.
1:09:38
And then it pushes it in
that format, let's say.
1:09:41
And then it retrieves something.
1:09:45
Does that make sense,
those different components?
1:09:49
Anthropic also talks
about resources.
1:09:51
Resources is data that is
sitting somewhere that you
1:09:55
might let your agent read.
1:09:57
For example, if you're building
your startups, you have a CRM.
1:10:00
A CRM has data in it, and you
want to do lookups in that data.
1:10:05
You will probably
give a lookup tool,
1:10:07
and you will give
access to the resource,
1:10:10
and it will do lookups
whenever you want super fast.
1:10:16
This type of
architecture can be built
1:10:19
with different
degrees of autonomy,
1:10:21
from the least autonomous
to the most autonomous.
1:10:23
And I'll give you
a few examples.
1:10:26
Less autonomous would be
you've hard coded the steps.
1:10:29
So let's say I tell the travel
agent first identify the intent.
1:10:35
Then look up in the
database the history
1:10:39
of this customer with us
and their preferences.
1:10:42
Then go to the flight
API, blah, blah, blah.
1:10:45
Then go to the--
1:10:45
I would hard code the steps.
1:10:47
OK.
1:10:48
That's the least autonomous.
1:10:50
The semi-autonomous is I
might hard code the tools,
1:10:54
but we're not going to
hard code the steps.
1:10:57
So I'm going to tell the agent,
you act like a travel agent.
1:11:02
And your task is to help
the person book a travel.
1:11:10
And these are the tools that
you have accessible to yourself.
1:11:13
And so I'm not hard
coding the steps.
1:11:14
I'm just hard coding the
tools that you have access
1:11:17
to for yourself.
1:11:18
The more autonomous is the
agent decides the steps
1:11:22
and can create the tools.
1:11:24
So that's where you might
give actually access
1:11:26
to a code editor, to the agent.
1:11:28
And the agent might actually be
able to ping any API in the web,
1:11:33
perform some web search.
1:11:34
It might even be able
to create some code
1:11:37
to display data to the user.
1:11:39
It might even be able to
perform some calculations.
1:11:42
Like oh, I'm going to
calculate the fastest route
1:11:44
to get from San
Francisco to New York,
1:11:48
and which one might be
the most appropriate
1:11:50
for what the user
is looking for.
1:11:52
And then I want to calculate
the distance between the airport
1:11:54
and that hotel
versus that hotel.
1:11:56
And I'm going to
write code to do that.
1:11:58
So it's actually
fully autonomous
1:12:00
from that perspective.
1:12:05
So yeah.
1:12:07
Remember those keywords.
1:12:08
Memory, prompts,
tools, et cetera.
1:12:14
Now, I presented the
flight API, but it does not
1:12:18
have to be an API.
1:12:19
You probably have heard the term
MCP or model context protocol
1:12:23
that was coined by Anthropic.
1:12:25
I pasted the seminal article on
MCP at the bottom of this slide.
1:12:29
But let me explain in a nutshell
why those things would differ.
1:12:34
In the API case,
you would actually
1:12:39
teach your LLM to ping an API.
1:12:42
So you would say this is
how you ping this API,
1:12:45
and this is the data that
it will send you back.
1:12:48
And you would have to do
that in a one off manner.
1:12:51
So you would have
to build or give
1:12:53
the API documentation
of your flight API.
1:12:56
You're booking hotel
API, your car rental API.
1:13:00
And then you would give
tools for your model
1:13:03
to communicate with those APIs.
1:13:06
It doesn't scale
very well versus MCP.
1:13:11
MCP, it's really about putting
a system in the middle that
1:13:19
would make it simpler for
your LLM to communicate
1:13:22
with that endpoint.
1:13:23
So for instance, you might have
an MCP server, an MC client,
1:13:28
where you're trying
to communicate
1:13:30
with that travel database
or the flight API or MCP.
1:13:35
And your agent might actually
just communicate with it
1:13:38
and say, hey, what do you need
in order to give me more flight
1:13:42
information?
1:13:43
And that agent will respond
by I would like you to tell me
1:13:47
where is the origin flight,
where is the destination
1:13:49
and what you're looking
for at a high level.
1:13:51
This is my requirement.
1:13:52
OK.
1:13:52
Let me get back to you
with in my requirement.
1:13:55
Oh.
1:13:55
You forgot to tell me
your budget, whatever.
1:13:57
Oh.
1:13:58
Let me give you my
budget, et cetera.
1:14:00
And it's agent to
agent communication,
1:14:04
which allows more scalability.
1:14:06
You don't need to
hard code everything.
1:14:09
Companies have displayed
their MCPs out there,
1:14:11
and your agent can
communicate with them
1:14:14
and figure out how to
get the data it needs.
1:14:16
Does that make sense?
1:14:18
Yeah.
1:14:21
[INAUDIBLE] rewriting
any [INAUDIBLE]
1:14:36
I think it is, ultimately.
1:14:39
The question is, isn't
it a shifting issue?
1:14:41
Because anyway, if an
API has to be updated,
1:14:43
the MCP has to be updated,
is what you say, right?
1:14:45
Yes, that's correct.
1:14:46
But at least it allows the
agent to go back and forth
1:14:51
and figure out what
the requirements are.
1:14:52
But at the end of the day,
ideally, if you're a startup,
1:14:56
you have some documentation.
1:14:57
And automatically, you have
an agent or an LLM workflow
1:15:00
that reads that documentation
and updates the code
1:15:03
accordingly.
1:15:04
But I agree.
1:15:05
It's not something that
is fully autonomous.
1:15:08
Yeah.
1:15:09
i I've seen some
security issues.
1:15:12
Why is that possible.
1:15:14
Which security specifically?
1:15:16
[INAUDIBLE]
1:15:18
Yeah.
1:15:19
So are there security
issues with MCPs?
1:15:23
So think about it this way.
1:15:25
MCPs, depending on the data
that you get access to,
1:15:28
might have different
requirements, lower stake
1:15:30
or higher stake.
1:15:31
I'm not an expert
at the full range.
1:15:34
But it wouldn't surprise me
that when you expose an MCP to--
1:15:42
I think you would a lot of
MCC have authentication.
1:15:45
So you might
actually need a code
1:15:47
to actually talk to it, just
like you would with an API,
1:15:50
or a key.
1:15:52
Yeah, but that's
a good question.
1:15:53
I'm not an expert at the
security of these systems,
1:15:56
but we can look into it.
1:16:02
Any other questions
on what we've
1:16:04
seen with the agentic workflows,
APIs, tools, MCPs, memory?
1:16:10
All of that is under progress.
1:16:11
So even memory is not a
solved problem by any means.
1:16:14
It's pretty hard actually.
1:16:16
Yes.
1:16:18
You don't need an
[INAUDIBLE] The MCP just
1:16:24
makes it easier to access
the API, but technically,
1:16:28
[INAUDIBLE]
1:16:40
Exactly, exactly.
1:16:42
Is MCP about efficiency
or accessing more data?
1:16:45
It's about efficiency.
1:16:47
Let's say you have a coding
agent, and it has an MCP client,
1:16:53
and there's multiple MCP servers
that are exposed out there.
1:16:57
That agent can communicate
very efficiently with them
1:17:00
and find what it needs.
1:17:03
And it's a more
efficient process
1:17:05
than actually displaying APIs
and the APIs on that side
1:17:09
and how to ping them and
what the protocol is.
1:17:12
But it's not about
the data that is
1:17:13
being exposed because
ultimately, you control
1:17:15
the data that is being exposed.
1:17:19
You probably, depending
on how the MCP is built,
1:17:22
my guess is you probably
expose yourself to other risks
1:17:24
because your MCP server can
see any input pretty much
1:17:31
from another LLM.
1:17:32
And so it has to be robust.
1:17:36
But yeah.
1:17:37
Super.
1:17:39
So let's look at an
example of a step
1:17:41
by step workflow for
the travel agent.
1:17:45
So let's say the user says, I
want to plan a trip to Paris
1:17:50
from December 15 to
20th with flights,
1:17:56
hotels near the Eiffel Tower,
and then an itinerary of must
1:18:00
visit places.
1:18:01
That's the task to
the travel agent.
1:18:04
Step two, the agent
plans the steps.
1:18:06
So it says, I'm going
to find flights.
1:18:08
Use the flight search API to
get options for December 15.
1:18:12
Search hotels, generate
recommendations for places
1:18:15
to visit, validate
preferences, budget, et cetera.
1:18:20
Book the trip with the
payment processing API.
1:18:24
That's just the
planning, by the way.
1:18:25
Step three, execute the
plan, use your tools,
1:18:28
combine the results,
and then proactive
1:18:31
user interaction and booking.
1:18:33
It might make a first
proposal to the user
1:18:35
and ask the user to
validate or invalidate
1:18:38
and then may repeat that
planning and execution process.
1:18:42
And then finally, it might
actually update the memory.
1:18:46
It might say, oh, I just
learned through this interaction
1:18:49
that the user only
likes direct flights.
1:18:51
Next time, I'll only
give direct flights.
1:18:55
Or I noticed users are fine with
three star hotels or four star
1:19:01
hotels.
1:19:01
And in fact, they don't want
to go above budget or something
1:19:05
like that.
1:19:08
So that hopefully makes sense
by now on how you might do that.
1:19:11
My question for you is how
would you know if this works.
1:19:16
And if you had such a system
running in production, how
1:19:19
would you improve it?
1:19:28
Yeah.
1:19:28
Lets users rate
their experience.
1:19:31
So that's an example.
1:19:33
So let users rate their
experience at the end.
1:19:37
That would be an end
to end test, right?
1:19:39
You're looking at the user
experience through the steps
1:19:42
and say how good was it
from 1 to 5, let's say.
1:19:46
Yeah.
1:19:46
It's a good way.
1:19:47
And then if you learn
that a user says 1,
1:19:50
how do you improve the workflow?
1:19:56
[INAUDIBLE]
1:19:59
OK.
1:19:59
So you would go down a tree
and say, OK, you said 1.
1:20:04
What was your issue?
1:20:06
And then the user says the
prices were too high, let's say.
1:20:10
And then you would go back and
fix that specific tool or prompt
1:20:14
or, yeah, OK.
1:20:15
Any other ideas?
1:20:18
[INAUDIBLE]
1:20:29
Yeah, good.
1:20:29
So that's a good insight.
1:20:30
Separate the LLM related stuff
from the non-LLM related stuff,
1:20:34
the deterministic stuff.
1:20:35
The deterministic
stuff, you might
1:20:36
be able to fix it more
objectively essentially.
1:20:41
Yeah.
1:20:43
What else?
1:20:56
So give me an example
of an objective issue
1:21:00
that you can notice and
how you would fix it
1:21:03
versus a subjective issue.
1:21:06
Yeah.
1:21:06
[INAUDIBLE]
1:21:16
So let's say you say
there's the same flight,
1:21:19
but one is cheaper than
the other, let's say.
1:21:21
It's objectively worse.
1:21:23
And so you can capture
that almost automatically.
1:21:25
Yeah.
1:21:26
So you could
actually build evals
1:21:27
that are objective, that are
tracked across your users.
1:21:32
And you might actually
run an analysis after
1:21:34
and see that for
the objective stuff,
1:21:37
we notice that our LLM AI agent
workflow is bad with pricing.
1:21:43
It just doesn't read price
as well because it always
1:21:46
gives a more expensive option.
1:21:48
Yeah.
1:21:48
You're perfectly right.
1:21:49
How about the subjective stuff?
1:21:59
Do you choose a direct
or indirect flight
1:22:01
if the indirect is a
little bit cheaper?
1:22:05
Yeah.
1:22:05
Good one.
1:22:06
Do you choose a direct
flight or an indirect flight
1:22:09
if the indirect is cheaper but
the direct is more comfortable?
1:22:12
Yeah.
1:22:13
That's a good one actually.
1:22:16
So how would you capture
that information.
1:22:18
Let's say this is used
by thousands of users.
1:22:24
Could you feed
something in [INAUDIBLE]
1:22:28
Could you feed something in?
1:22:30
Yeah, I mean, you could--
1:22:32
could feed something in
about the user preferences?
1:22:36
Well, you could
build a data set that
1:22:39
has some of that information.
1:22:40
So you build 10 prompts, where
the user is asking specifically
1:22:44
for a direct--
1:22:46
saying that I prefer
direct flights because I
1:22:48
care about my time, let's say.
1:22:50
And then you look at the
output and you actually
1:22:53
give a good example
of a good output,
1:22:56
and you probably
are able to capture
1:22:58
the performance of your agentic
workflow on this specific eval.
1:23:04
Does it prioritize?
1:23:05
Does it understand
price conscious--
1:23:07
is it price conscious,
essentially,
1:23:08
and comfort conscious?
1:23:10
Yeah.
1:23:13
What about the tone?
1:23:14
Let's say the LLM right
now is not very friendly.
1:23:18
How would you notice that,
and how would you fix it?
1:23:26
Yeah.
1:23:26
Have the test user
run the prompt
1:23:29
and see if there's
something wrong with that.
1:23:33
OK.
1:23:33
Have a test user run the
prompt and see if there's
1:23:36
something wrong with that.
1:23:37
Tell me about the last step.
1:23:38
How would you notice
that something is wrong?
1:23:40
So a couple of tests
[INAUDIBLE] evaluates
1:23:48
the response and [INAUDIBLE]
1:23:51
Yeah.
1:23:52
I agree with your approach.
1:23:53
Have LLM judges that
evaluate the response
1:23:55
against a certain rubric of
what politeness looks like.
1:23:58
So here in this case,
you could actually
1:24:00
start with error analysis.
1:24:02
So you start, you
have 1,000 users.
1:24:05
And you can pull up
20 user interactions
1:24:07
and read through it.
1:24:09
And you might notice,
at first sight,
1:24:11
the LLM seems to be very rude.
1:24:14
It's just super, super
short in its answers,
1:24:18
and it's not very helpful.
1:24:20
You notice that with your
error analysis manually.
1:24:23
Then you go to the next stage.
1:24:24
You actually put
evals behind it.
1:24:26
You say, I'm going to
create a set of LLM judges
1:24:33
that are going to look
at the user interaction
1:24:35
and are going to rate
how polite it is.
1:24:38
And I'm going to
give it a rubric.
1:24:40
Then what I'm going to do
is I'm going to flip my LLM.
1:24:42
Instead of using GPT-4,
I'm going to use Grok.
1:24:45
And instead of using
Grok, I'm using Llama.
1:24:48
And then I'm going to run
those three LLMs side by side,
1:24:51
give it to my LLM judges, and
then get my subjective score
1:24:56
at the end to say, oh, x model
was more polite on average.
1:25:02
Yeah.
1:25:02
Perfectly right.
1:25:03
That's an example of an
eval that is very specific
1:25:05
and allows you to
choose between LLMs.
1:25:07
You could actually do the
same eval not across LLMs,
1:25:10
but fixed the LLM,
change the prompt.
1:25:12
You actually, instead of
saying act like a travel agent,
1:25:15
you say act like a
helpful travel agent.
1:25:17
And then you see the influence
of that word on your eval
1:25:21
with the LLM as judges.
1:25:22
Does that make sense?
1:25:24
OK.
1:25:25
Super.
1:25:26
So let's move forward and
do a case study with evals.
1:25:29
And then we're almost
done for today.
1:25:33
Let's say your product manager
asks you to build an AI
1:25:38
agent for customer support, OK?
1:25:41
Where do you start?
1:25:42
And here is an example
of the user prompt.
1:25:45
I need to change my shipping
address for order, blah, blah,
1:25:48
blah.
1:25:48
I move to a new address.
1:25:51
So what do you start if I'm
giving you that project?
1:26:04
Yes.
1:26:05
We search online for existing
models and [INAUDIBLE]
1:26:16
So do some research.
1:26:17
See benchmarks and
how different models
1:26:20
perform at customer support.
1:26:22
And then pick a model.
1:26:23
That's what you mean.
1:26:24
Yeah.
1:26:24
It's true you could do that.
1:26:25
What else could you do?
1:26:28
Yeah.
1:26:28
[INAUDIBLE]
1:26:34
OK.
1:26:34
Yeah, I like that.
1:26:35
Try to decompose the different
tasks that it will need
1:26:39
and try to guess which ones will
be more of a struggle, which
1:26:42
ones should be fuzzy, which
ones should be deterministic.
1:26:45
Yeah, you're right.
1:26:46
[INAUDIBLE]
1:26:55
Yeah.
1:26:56
Similar to what you said.
1:26:58
That's what I would
recommend as well.
1:27:00
You say I would sit down
with a customer support
1:27:02
agent for a day or two, and
I would decompose the tasks
1:27:04
that are going through.
1:27:05
I will ask them, where
do they struggle?
1:27:07
How much time it takes?
1:27:08
Yes.
1:27:09
That's usually where you want to
start with task decomposition.
1:27:12
So let's say we've done that
work, and we have this list.
1:27:16
I'm simplifying.
1:27:17
But the customer support
agent, human, typically
1:27:20
would extract key
info, then look up
1:27:23
in the database to retrieve
the customer record.
1:27:25
Then check the policy.
1:27:27
Are we allowed to
update the address,
1:27:29
or is it a fixed data point?
1:27:32
And then draft a response
email and sends the email.
1:27:35
So we've decomposed that task.
1:27:39
Once you've
decomposed that task,
1:27:42
how do you design
your agentic workflow?
1:28:03
Yes.
1:28:04
[INAUDIBLE]
1:28:17
Exactly.
1:28:18
So to repeat,
you're going to look
1:28:20
at the decomposition of tasks,
get an instinct of what's fuzzy,
1:28:24
what's deterministic,
and then determine
1:28:28
which line is going to be an LLM
one shot, which one will require
1:28:33
maybe a RAG, which one will
require a tool, which one will
1:28:36
require memory, which one--
1:28:38
So you will start
designing that map.
1:28:41
Completely right.
1:28:41
That's also what
I would recommend.
1:28:43
You might actually draft it and
say, OK, I take the user prompt.
1:28:48
And the first step of
my task decomposition
1:28:52
was extract information that
seems to be a vanilla LLM.
1:28:57
You can guess that the
vanilla LLM would probably
1:29:00
be good enough at
extracting the user wants
1:29:03
to change their address,
and this is the order number
1:29:05
and this is the new address.
1:29:06
You probably don't need
too much technology
1:29:08
there other than the LLM.
1:29:11
The next step, it feels like
you need a tool because you're
1:29:14
actually going to have to
look up in the database
1:29:17
and also update the address.
1:29:21
So that might be a
tool, and you might
1:29:23
have to build a custom
tool for the LLM
1:29:25
to say, let me connect
you to that database
1:29:27
or let me give you access to
that resource with an MCP.
1:29:32
After that probably need an
LLM again to draft the email,
1:29:35
but you would probably
paste confirmation.
1:29:38
You would paste the
confirmation that your address
1:29:40
has been updated from x to y.
1:29:42
And then the LLM
will draft an answer.
1:29:44
And of course,
just to not forget,
1:29:46
you might need a tool
to send the email.
1:29:49
You might actually need
to post something to
1:29:54
for the email to go out.
1:29:57
And then you'll get the output.
1:29:59
Does that make sense So
exactly what you described.
1:30:02
Now moving to the next step.
1:30:03
Once we have-- we
compose our tasks.
1:30:06
Then we have designed an
agentic workflow around it.
1:30:09
It took us five minutes.
1:30:10
In practice, it
would take you more
1:30:12
if you're building
your startup on that.
1:30:13
You want to make sure your
task decomposition is accurate,
1:30:15
your thing is accurate
here, and then
1:30:17
you can have a lot of
work done on every tool
1:30:20
and optimize it and
latency and cost.
1:30:22
But let's say, now we
want to know if it works.
1:30:27
And I'm going to assume
that you have LLM traces.
1:30:30
LLM traces are very important.
1:30:33
Actually, if you're
interviewing with an AI startup.
1:30:36
I would recommend you in the
interview process to ask them,
1:30:39
do you have LLM traces?
1:30:40
Because if they don't
have LLM traces,
1:30:42
it is pretty hard to debug an
LLM system because you don't
1:30:46
have visibility on the chain of
complex prompts that were called
1:30:50
and where the bug is.
1:30:52
And so it's a basic
part of an AI startup
1:30:57
stack to have an LLM traces.
1:31:00
So let's assume you have traces.
1:31:02
How would you know
if your system works?
1:31:04
I'm going to summarize some
of the things I heard earlier.
1:31:11
You gave us an example
of an end to end metric.
1:31:15
You look at the user
satisfaction at the end.
1:31:18
You can also do a
component-based approach
1:31:21
where you actually will look at
the tool, the database updates,
1:31:25
and you will manually do
an error analysis and see,
1:31:28
oh, the tool actually always
forgets to update the email.
1:31:32
It just fails at writing.
1:31:33
And I'm going to fix that.
1:31:34
This is deterministic
pretty much.
1:31:37
Or when it tries
to send the email
1:31:40
and ping the system that is
supposed to send the email,
1:31:44
it doesn't send it
in the right format.
1:31:46
And so it bugs at that point.
1:31:48
Again, you could fix that.
1:31:51
Draft of the email.
1:31:52
The LLM doesn't do a great job.
1:31:53
It's not very polite
at drafting the email.
1:31:56
So you could look at
component by component,
1:31:59
and it's actually easier
to debug than to look at it
1:32:01
end to end.
1:32:02
You would probably
do a mix of both.
1:32:05
Another way to look at
it is what is objective
1:32:08
versus what is subjective?
1:32:10
So for example, an
objective example
1:32:12
would be a DLRM extracted
the wrong order ID.
1:32:18
The user said my order
ID is X, and the LLM,
1:32:21
when it actually looked
up in the database,
1:32:24
it used the wrong order ID.
1:32:26
This is objectively wrong.
1:32:27
You can actually
write a Python code
1:32:29
that checks that, checks just
the alignment between what
1:32:32
the user mentioned and what was
actually pasted in the database
1:32:36
or for the lookup.
1:32:38
You also have subjective
stuff, which we talked about,
1:32:40
where you probably want to
do either human rating or LLM
1:32:43
as judges.
1:32:44
It's very relevant
for subjective evals.
1:32:49
And finally, you
will find yourself
1:32:51
having quantitative evals
and more qualitative evals.
1:32:55
So quantitative would be
percentage of successful address
1:32:59
updates.
1:33:00
The latency.
1:33:00
You could actually track
the latency component-based
1:33:03
and see which one
is the slowest.
1:33:05
Let's say sending the
email is five seconds.
1:33:08
It's too long, let's say.
1:33:10
You would notice component
based or the full workflow.
1:33:13
And then you will decide, where
am I optimizing my latency,
1:33:15
and how am I going to do that?
1:33:17
And then finally, qualitative.
1:33:20
You might actually do
some error analysis
1:33:23
and look at where are
the hallucinations?
1:33:27
Where are the tone mismatches?
1:33:31
Are the user confused, and
by what they're confused?
1:33:34
That would be more qualitative.
1:33:36
And typically, it would take
more white glove approaches
1:33:41
to do that.
1:33:42
So here's what it
could look like.
1:33:44
I gave you some examples.
1:33:46
But you would build
evals to determine
1:33:50
objectively, subjectively,
component-based, end
1:33:53
to end based, and then
quantitatively and
1:33:55
qualitatively, where's
your LLM failing
1:33:57
and where it's doing well.
1:34:02
Does that give you a
sense of the type of stuff
1:34:04
you could do to fix or
improve that agentic workflow?
1:34:09
Super.
1:34:10
Well, that was our
case study on evals.
1:34:12
We're not going to
delve deeper into it.
1:34:14
But hopefully, it gave you
a sense of the type of stuff
1:34:16
you can do with LLM
judges, with objective,
1:34:21
subjective, component-based,
end to end, et cetera.
1:34:25
Last section on
multi-agent workflows.
1:34:29
So you might ask, hey, why do we
need multi-agent workflow when
1:34:36
the workflow already
has multiple steps,
1:34:38
already calls the LLM multiple
times, already gives them tools.
1:34:42
Why do we need multiple agents?
1:34:45
And so many people are talking
about multi-agent system online.
1:34:47
It's not even a
new thing, frankly.
1:34:49
Multi-agent systems have
been around for a long time.
1:34:52
The main advantage of
a multi-agent system
1:34:55
is going to be parallelism.
1:34:57
It's like is there
something that I
1:34:59
wish I would run in parallel,
sort of independently,
1:35:04
but maybe there are some
things in the middle?
1:35:07
But that's where you want
to put a multi-agent system.
1:35:09
It's when it's parallel.
1:35:12
The other advantage
that some companies
1:35:14
have with multi-agent systems
is an agent can be reused.
1:35:19
So let's say in a company,
you have an agent that's
1:35:21
been built for design.
1:35:22
That agent can be used
in the marketing team,
1:35:25
and it can be used
in the product team.
1:35:27
And so now you're
optimizing an agent,
1:35:30
which has multiple stakeholders
that can communicate with it
1:35:33
and benefit from
its performance.
1:35:38
Actually I'm going
to ask you a question
1:35:40
and take a few, maybe a
minute to think about it.
1:35:43
Let's say you were
building smart home
1:35:46
automation for your
apartment or your home.
1:35:50
What agents would
you want to build?
1:35:52
Yeah.
1:35:53
Write it down.
1:35:54
And then I'm going to
ask you in a minute
1:35:57
to share some of the
agents that you will build.
1:36:00
Also, think about
how you would put
1:36:03
a hierarchy between
these agents,
1:36:04
or how you would
organize them, or who
1:36:06
should communicate with who.
1:36:07
OK?
1:36:08
OK.
1:36:08
Take a minute for that.
1:36:12
Be creative also because I'm
going to ask all of your agents,
1:36:14
and maybe you have an agent
that nobody has thought of.
1:36:21
OK.
1:36:22
Let's get started.
1:36:24
Who wants to give
me a set of agents
1:36:26
that you would want for
your home, smart home.
1:36:29
Yes.
1:36:32
The first is like a set
of agents [INAUDIBLE]
1:37:00
OK.
1:37:01
So let me repeat.
1:37:02
You have four agents,
I think, roughly.
1:37:05
One that tracks biometric,
like where are you in the home?
1:37:09
Where are you moving?
1:37:10
How you're moving,
things like that.
1:37:12
That sort of knows
your location.
1:37:15
The second one determines
the temperature of the rooms
1:37:21
and has the ability
to change it.
1:37:23
The third one tracks
energy efficiency
1:37:26
and might give feedback on
energy and energy usage.
1:37:31
And might be, I
don't know, maybe
1:37:32
it has the control over
the temperature as well.
1:37:34
I don't know actually.
1:37:35
Or the gas or the water, might
cut your water at some point.
1:37:43
And then you have an
orchestrator agent.
1:37:44
What is exactly the
orchestrator doing?
1:37:48
It passes instructions
[INAUDIBLE]
1:37:53
OK.
1:37:53
Passes instructions.
1:37:55
So is that the agent
that communicates mainly
1:37:58
with the user?
1:38:00
So if I'm coming
back home and I'm
1:38:02
saying I want the
oven to be preheated,
1:38:05
I communicate with
the orchestrator,
1:38:07
and then it would
funnel to another agent.
1:38:09
OK.
1:38:10
Sounds good.
1:38:11
Yeah.
1:38:11
So that's an example
of, I want to say,
1:38:14
a hierarchical agentic
multi-agent system.
1:38:20
What else?
1:38:21
Any other ideas?
1:38:22
What would you add to that?
1:38:24
Yeah.
1:38:25
[INAUDIBLE]
1:38:55
Oh, I like that.
1:38:56
That's a really good one.
1:38:57
So let me summarize.
1:38:58
You have a security agent that
determines if you can enter
1:39:02
or not.
1:39:03
And when you enter, it
understands who you are.
1:39:06
And then it gives
you certain sets
1:39:08
of permissions that might
be different depending
1:39:11
of if you're a parent or a kid.
1:39:13
Or you might have access to
certain cars and not others.
1:39:17
Or your kid cannot open the
fridge, or I don't know.
1:39:20
Something like that.
1:39:21
Yeah.
1:39:22
OK, I like that.
1:39:23
That's a good one.
1:39:24
And it does feel like it's a
complex enough workflow where
1:39:28
you want a specific
workflow tied to that.
1:39:32
I agree.
1:39:34
What else?
1:39:39
Yes.
1:39:41
[INAUDIBLE] So you can
get more complicated.
1:39:43
So high energy savings
with whether or not you
1:39:50
or someone else can be blind
to those in the house or also
1:39:55
when you tap into the grid.
1:39:57
Yeah So another thought I
have as well is much harder
1:40:04
to track in the grocery store.
1:40:06
But understanding
what's in your fridge.
1:40:08
OK
1:40:12
Well, that's really
good actually.
1:40:14
So you mentioned two of them.
1:40:16
One is maybe an agent that has
access to external APIs that
1:40:20
can understand the weather
out there, the wind, the sun,
1:40:24
and then has control over
certain devices at home.
1:40:28
Temperature, blinds, things
like that, and also understands
1:40:31
your preferences for it.
1:40:33
That does feel like it's a good
use case because you could give
1:40:36
that to the orchestrator,
but it might lose itself
1:40:38
because it's doing too much.
1:40:41
And also, these problems
are tied together,
1:40:43
like temperature outdoor
with the weather API
1:40:45
might influence the
temperature inside,
1:40:48
how you want it, et cetera.
1:40:50
And then the second
one, which I also like,
1:40:52
is you might have an agent
that looks at your fridge
1:40:55
and what's inside.
1:40:57
And it might
actually have access
1:40:58
to the camera in the
fridge, for example,
1:41:01
and know your
preferences and also has
1:41:03
access to the
e-commerce API to order
1:41:06
Amazon groceries ahead of time.
1:41:09
I agree.
1:41:10
And maybe the orchestrator
will be the communication line
1:41:12
with the user, but it might
communicate with that agent
1:41:16
in order to get it done.
1:41:17
Yeah.
1:41:18
I like those.
1:41:19
So those are all
really good examples.
1:41:21
Here is the list I had up there.
1:41:25
So climate control, lighting
security, energy management,
1:41:30
entertainment,
notification agent,
1:41:32
alerts about the system updates,
energy saving, and orchestrator.
1:41:35
So all of them you
mentioned actually.
1:41:38
And then we didn't talk about
the different interaction
1:41:41
patterns, but you do have
different ways to organize
1:41:45
a multi-agent system.
1:41:46
Flat, hierarchical.
1:41:48
It sounds like this
would be hierarchical.
1:41:51
I agree.
1:41:52
And the reason is
UI/UX, is I would rather
1:41:55
have to only talk
to the orchestrator,
1:41:57
rather than have to go to
a specialized application
1:42:00
to do something.
1:42:01
Like it feels like
the orchestrator
1:42:02
could be responsible for that.
1:42:04
And so I agree, I would probably
go for a hierarchical setup
1:42:07
here.
1:42:08
But maybe you might also
add some connections
1:42:11
between other agents,
like in the flat system
1:42:13
where it's all to all.
1:42:15
For example, with climate
control and energy,
1:42:17
if you want to
connect those two,
1:42:19
you might actually allow them
to speak with each other.
1:42:21
When you allow agents to
speak with each other,
1:42:24
it is basically an MCB
protocol, by the way.
1:42:26
So you treat the agent like
a tool, exactly like a tool.
1:42:30
Here is how you interact
with this agent.
1:42:32
Here is what it can tell you.
1:42:34
Here is what it needs
from you, essentially.
1:42:37
OK super.
1:42:38
And then without going
into the details,
1:42:40
there are advantages to
multi-agent workflows
1:42:43
versus single agents,
such as debugging.
1:42:47
It's easier to debug
a specialized agent
1:42:50
into debug an entire system.
1:42:52
Parallelization as well.
1:42:54
It's easier to have
things run in parallel,
1:42:56
and you can earn time.
1:42:59
There are some
advantages to doing that,
1:43:01
and I'll leave you with this
slide if you want to go deeper.
1:43:04
Super.
1:43:05
So we've learned so many
techniques to optimize LLMs,
1:43:08
from prompts to chains to
fine tuning, retrieval,
1:43:12
and to multi-agent
system as well.
1:43:14
And then just to end on a couple
of trends I want you to watch.
1:43:19
I think next week is
Thanksgiving, is that it?
1:43:21
It's Thanksgiving break.
1:43:22
No, the week after.
1:43:23
OK.
1:43:24
Well ahead of the
Thanksgiving break.
1:43:26
So if you're traveling, you
can think about these things.
1:43:29
What's next is in AI, I wanted
to call out a couple of trends.
1:43:34
So Ilya Sutskever, one of
the OGs of LLMs and OpenAI
1:43:40
co-founder, raised that question
about are we plateauing or not.
1:43:45
The question are we going to
see in the coming years LLM sort
1:43:50
of not improve as fast as
we've seen in the past?
1:43:54
It's been the feeling
in the community
1:43:56
probably that the
last version of GPT
1:44:00
did not bring the
level of performance
1:44:03
that people were expecting,
although it did make
1:44:06
it so much easier to use for
consumers because you don't need
1:44:09
to interact with
different models.
1:44:10
It's all under the same hood.
1:44:12
So it seems that
it's progressing,
1:44:14
but the plateau is unclear.
1:44:17
The way I would think about it
is the LLM scaling laws tell us
1:44:22
that if we continue to
improve compute and energy,
1:44:26
then LLMs should
continue to improve.
1:44:28
But at some point,
it's going to plateau.
1:44:29
So what's going to take
us to the next step?
1:44:32
It's probably
architecture search.
1:44:35
Still a lot of LLMs,
even if we don't
1:44:36
understand what's under
the hood or probably
1:44:38
transformer-based today.
1:44:40
But we know that the human brain
does not operate the same way.
1:44:43
There's just certain
things that we
1:44:45
do that are much more
efficient, much faster.
1:44:47
We don't need as much data.
1:44:49
So theoretically,
we have so much
1:44:51
to learn in terms of
architecture search
1:44:53
that we haven't figured out.
1:44:54
It's not a surprise that
you see those labs hire
1:44:57
so many engineers.
1:44:58
Because it is possible
that in the next few years,
1:45:01
you're going to have
thousands of engineers trying
1:45:03
to figure out the different
engineering hacks and tactics
1:45:06
and architectural
searches that are
1:45:07
going to lead to better models.
1:45:10
And one of them suddenly will
find the next transformer,
1:45:13
and it will reduce by 10x the
need for compute and the need
1:45:17
for energy.
1:45:18
It's sort of if you read Isaac
Asimov's Foundation series.
1:45:24
Individuals can have an amazing
impact on the future because
1:45:27
of their decisions.
1:45:29
Whoever discovered transformers
had a tremendous impact
1:45:33
on the direction of AI.
1:45:34
I think we're going to see
more of that in the coming
1:45:37
years, where some group of
researchers that is iterating
1:45:40
fast might discover certain
things that would suddenly
1:45:43
unlock that plateau and
take us to the next step,
1:45:45
and it's going to continue
to improve like that.
1:45:47
And so it doesn't surprise me
that there's so many companies
1:45:50
hiring engineers right
now to figure out
1:45:52
those hacks and
those techniques.
1:45:56
The other set of gains
that we might see
1:45:58
is from multi-modality.
1:45:59
So the way to think about it is
we've had LLMs first text-based,
1:46:04
and then we've added imaging.
1:46:06
And today, models are
very good at images.
1:46:09
They're very good at text.
1:46:10
It turns out that being good at
images and being good at text
1:46:13
makes the whole model better.
1:46:15
So the fact that you're good
at understanding a cat image
1:46:18
makes you better at
text as well for a cat.
1:46:21
Now you add another modality
like audio or video.
1:46:24
The whole system gets better.
1:46:26
So you're better at
writing about a cat
1:46:28
if you know what
a cat sounds like,
1:46:30
if you can look at a
cat on an image as well.
1:46:31
Does that make sense?
1:46:32
So we see gains that are
translated from one modality
1:46:35
to another, and that might lead
in the pinnacle of robotics
1:46:38
where all these
modalities come together.
1:46:40
And suddenly, the
robot is better at
1:46:42
running away from a cat
because it understands
1:46:44
what a cat is, how
it sounds like,
1:46:46
what it looks like, et cetera.
1:46:48
That makes sense?
1:46:49
The other one is the multiple
methods working in harmony.
1:46:53
In the Tuesday lectures, we've
seen supervised learning,
1:46:56
unsupervised learning,
self-supervised learning,
1:46:58
reinforcement learning, prompt
engineering, RAGs, et cetera.
1:47:02
If you look at how
babies learn, it
1:47:06
is probably a mix of those
different approaches.
1:47:09
Like a baby might have some
meta learning, meaning it
1:47:13
has some survival
instinct that is
1:47:16
encoded in the DNA most likely.
1:47:19
And that's like the baby's
pre-training, if you will.
1:47:22
On top of that, the mom or
the dad is pointing at stuff
1:47:27
and saying bad, good, bad, good.
1:47:29
Supervised learning.
1:47:30
On top of that, the baby
is falling on the ground
1:47:33
and getting hurt.
1:47:34
And that's a reward signal
for reinforcement learning.
1:47:36
On top of that, the baby
is observing other people
1:47:39
doing stuff or
other babies doing
1:47:42
stuff, unsupervised learning.
1:47:43
You see what I mean?
1:47:44
We're probably a mix
of all these methods,
1:47:47
and I think that's where
the trend is going, is
1:47:49
where those methods that
you've seen in CS230
1:47:52
come together in order to build
an AI system that learns fast,
1:47:56
is low latency, is
cheap, energy-efficient,
1:48:00
and makes the most out
of all of these methods.
1:48:03
Finally, and this is
especially true at Stanford,
1:48:06
you have research going on that
you would consider human-centric
1:48:11
and some research that
is non-human centric.
1:48:13
By human-centric, I should
say human approaches
1:48:16
that are modeled after the
brain and approaches that
1:48:19
are not modeled after humans.
1:48:20
Because it turns out that the
human body is very limiting.
1:48:24
And so if you actually
only do research
1:48:26
on what the human
brain looks like,
1:48:28
you're probably missing out on
compute and energy and stuff
1:48:30
like that that you
can optimize even
1:48:32
beyond neuronal
connections in the brain,
1:48:35
but you still can learn a
lot from the human brain.
1:48:37
And that's why there are
professors that are running labs
1:48:40
right now that
try to understand,
1:48:42
how does back propagation
work for humans?
1:48:45
And in fact, it's probably that
we don't have back propagation.
1:48:48
We don't use back propagation,
we only do forward propagation,
1:48:51
let's say.
1:48:51
So this type of stuff
is interesting research
1:48:54
that I would encourage you
to read if you're curious
1:48:56
about the direction of AI.
1:48:59
And then finally, one thing
that's going to be pretty clear,
1:49:02
I call it all the time,
but it's the velocity
1:49:05
at which things are moving.
1:49:06
You're noticing,
part of the reason
1:49:08
we're giving you
a breadth in CS230
1:49:10
is because these methods
are changing so fast.
1:49:12
So I don't want to bother
going and teaching you
1:49:15
the number 17
methods on RAG that
1:49:17
optimizes the RAG
because in two years,
1:49:19
you're not going to need it.
1:49:20
So I would rather
you think about what
1:49:23
is the breadth of things
you want to understand.
1:49:25
And when you need it, you
are sprinting and learning
1:49:27
the exact thing you need faster
because the half life of skill
1:49:30
is so low.
1:49:31
You want to come out of the
class with a good breadth
1:49:34
and then have the ability
to go deep whenever
1:49:36
you need after the class.
1:49:38
And so that's sort of how that
class is designed as well.
1:49:41
Yeah.
1:49:41
That's it for today.
1:49:43
So thank you.
1:49:45
Thank you for participating.
— end of transcript —
Advertisement
More from Stanford Online
1:11:40
Stanford CS25: V2 I Introduction to Transformers w/ Andrej Karpathy
Stanford Online
1:45:08
Stanford CS230 | Autumn 2025 | Lecture 9: Career Advice in AI
Stanford Online
1:44:31
Stanford CS229 I Machine Learning I Building Large Language Models (LLMs)
Stanford Online
1:02:52
Stanford CS231N Deep Learning for Computer Vision | Spring 2025 | Lecture 1: Introduction
Stanford Online