[00:05] Hi, everyone.
[00:06] Welcome to another lecture
for CS230 Deep Learning.
[00:11] Today, we're going to talk about
enhancing large language model
[00:17] applications.
[00:19] And I call this
lecture Beyond LLM.
[00:23] It has a lot of newer content.
[00:26] And the idea behind
this lecture is
[00:31] we started to learn
about neurons,
[00:34] and then we learned
about layers,
[00:35] and then we learned about
deep neural networks,
[00:38] and then we learned a little bit
about how to structure projects
[00:43] in C3.
[00:44] And now we're going one level
beyond into, what would it
[00:48] look if you were building
agentic AI systems at work,
[00:54] in a startup, in a company?
[00:58] And it's probably one of
the more practical lectures.
[01:02] Again, the goal is
not to build a product
[01:05] end to end in the
next hour or so,
[01:07] but rather to tell
you all the techniques
[01:09] that AI engineers have cracked,
figured out, are exploring,
[01:15] so that after the class,
you have the breadth of view
[01:18] of different
prompting techniques,
[01:20] different agentic workflows,
multi-agent systems, evals.
[01:25] And then when you
want to dive deeper,
[01:26] you have the baggage to
dive deeper and learn faster
[01:29] about it.
[01:32] Let's try to make it as
interactive as possible, as
[01:36] usual.
[01:37] When we look at the
agenda, the agenda
[01:40] is going to start with the
core idea behind challenges
[01:45] and opportunities
for augmenting LLMs.
[01:48] So we start from a base model.
[01:50] How do we maximize the
performance of that base model?
[01:55] Then we'll dive deep into the
first line of optimization,
[01:59] which is prompting methods, and
we'll see a variety of them.
[02:02] Then we'll go slightly deeper.
[02:04] If we were to get our
hands under the hood
[02:06] and do some fine tuning,
what would it look like?
[02:09] I'm not a fan of fine tuning,
and I talk a lot about that,
[02:12] but I'll explain why I try to
avoid fine tuning as much as
[02:16] possible.
[02:18] And then we'll do a section 4 on
Retrieval-Augmented Generation,
[02:22] or RAG, which you've probably
heard of in the news.
[02:26] Maybe some of you
have played with RAGs.
[02:28] We're going to
unpack what a RAG is
[02:31] and how it works and then the
different methods within RAGs.
[02:36] And then we'll talk about
agentic AI workflows.
[02:40] I'll define it.
[02:42] Andrew Ng is one
of the first ones
[02:45] to have called this trend
agenetic AI workflows.
[02:49] And so we look at the
definition that Andrew
[02:51] gives to agentic
workflows, and then we'll
[02:54] start seeing examples.
[02:56] The section 6 is very practical.
[02:59] It's a case study where we will
think about an agentic workflow,
[03:05] and I'll ask you to measure
if the agent actually works,
[03:10] and we brainstorm
how we can measure
[03:13] if an agentic
workflow is working
[03:15] the way you want it to work.
[03:16] There's plenty of methods called
evals that solve that problem.
[03:22] And then we'll look briefly
at multi-agent workflow.
[03:24] And then we can have a
open-ended discussion
[03:27] where I share some thoughts
on what's next in AI.
[03:31] And I'm looking forward
to hearing from you all,
[03:34] as well, on that one.
[03:36] So let's get started with the
problem of augmenting LLMs.
[03:42] So open-ended question for you--
[03:44] you are all familiar
with pre-trained models
[03:47] like GPT 3.5 Turbo or GPT 4.0.
[03:52] What's the limitation of
using just a base model?
[03:56] What are the typical
issues that might
[03:59] arise as you're using a
vanilla pre-trained model?
[04:07] Yes.
[04:08] It lacks some domain knowledge.
[04:10] Lacks some domain knowledge.
[04:11] You're perfectly right.
[04:13] We had a group of
students a few years ago.
[04:16] It was not LLM related, but
they were building an autonomous
[04:22] farming device or vehicle that
had a camera underneath, taking
[04:26] pictures of crops to
determine if the crop is
[04:30] sick or not, if it
should be thrown away,
[04:32] if it should be used or not.
[04:35] And that data set is not a
data set you find out there.
[04:40] And the base model or
pre-trained computer vision
[04:44] model would lack that
knowledge, of course.
[04:47] What else?
[04:49] Yes.
[04:50] [INAUDIBLE] pictures are
very dark [INAUDIBLE]
[04:57] OK, maybe the-- you're saying--
[04:59] so just to repeat
for people online,
[05:02] you're saying the model
might have been trained
[05:04] on high-quality data,
but the data in the wild
[05:06] is actually not
that high quality.
[05:08] And in fact, yes, the
distribution of the real world
[05:11] might differ, as we've seen with
GANs, from the training set,
[05:16] and that might create an
issue with pre-trained models.
[05:18] Although pre-trained
LLMs are getting better
[05:20] at handling all
sorts of data inputs.
[05:25] Yes.
[05:26] Lacks current information.
[05:28] Lack what?
[05:28] Current information.
[05:30] Lacks current information.
[05:32] The LLM is not up to date.
[05:34] And in fact, you're right.
[05:35] Imagine you have to retrain
from scratch your LLM
[05:38] every couple of months.
[05:39] One story that I found funny--
[05:42] it's from probably three years
ago or maybe more five years
[05:45] ago, where during
his first presidency,
[05:49] President Trump one
day tweeted, "Covfefe."
[05:53] You remember that tweet or no?
[05:56] Just "Covfefe."
[05:57] And it was probably a typo
or it was in his pocket.
[05:59] I don't know.
[06:00] But that word did not exist.
[06:03] The LLMs, in fact, that
Twitter was running at the time
[06:06] could not recognize that word.
[06:08] And so the recommender
system sort of went wild,
[06:11] because suddenly everybody was
making fun of that tweet using
[06:15] the word "Covfefe," and the LLM
was so confused on, what does
[06:19] that mean?
[06:20] Where should we show it?
[06:21] To whom should we show it?
[06:22] And this is an example
of a-- nowadays,
[06:25] especially on social media,
there's so many new trends,
[06:28] and it's very hard to retrain
an LLM to match the new trend
[06:33] and understand the
new words out there.
[06:34] I mean, you oftentimes hear Gen
Z words like "rizz" or "mid"
[06:39] or whatever.
[06:40] I don't know all of them.
[06:41] But you probably want
to find a way that
[06:45] can allow the LLM to understand
those trends without retraining
[06:49] the LLM from scratch.
[06:51] What else?
[06:53] It's trained to have a
breadth of knowledge.
[06:56] And if you wanted to do
something specialized,
[06:58] that might limit [INAUDIBLE].
[06:59] Yeah, it might be trained
on a breadth of knowledge,
[07:02] but it might fail or
not perform adequately
[07:05] on a narrow task that
is very well defined.
[07:09] Think about enterprise
applications that--
[07:11] yeah, enterprise application.
[07:13] You need high precision,
high fidelity, low latency.
[07:17] And maybe the model is not
great at that specific thing.
[07:20] It might do fine, but
just not good enough.
[07:22] And you might want to
augment it in a certain way.
[07:24] Yeah.
[07:25] Maybe it has [INAUDIBLE]
so it makes the model
[07:29] a lot heavier, a lot slower.
[07:32] [INAUDIBLE]
[07:33] So maybe it has a lot of broad
domain knowledge that might not
[07:37] be needed for your application.
[07:39] And so you're using a
massive, heavy model
[07:41] when you actually are only using
2% of the model capability.
[07:44] You're perfectly right.
[07:45] You might not need all of it.
[07:46] So you might find ways to prune,
quantize the model, modify it.
[07:51] All of these are good points.
[07:53] I'm going to add a
few more, as well.
[07:55] LLMs are very
difficult to control.
[07:58] Your last point is actually
an example of that.
[08:00] You want to control the LLM to
use a part of its knowledge,
[08:03] but it's not--
[08:04] it's, in fact, getting confused.
[08:06] We've seen that in history.
[08:08] In 2016, Microsoft created
a notorious Twitter
[08:13] bot that learned from users, and
it quickly became a racist jerk.
[08:18] Microsoft ended up removing the
bot 16 hours after launching it.
[08:22] The community was really
fast at determining
[08:25] that this was a racist bot.
[08:28] And you can empathize with
Microsoft in the sense
[08:31] that it is actually
hard to control an LLM.
[08:34] They might have done a better
job to qualify before launching,
[08:37] but it is really hard
to control an LLM.
[08:40] Even more recently,
this is a tweet
[08:42] from Sam Altman
last November, where
[08:46] there was this debate
between Elon Musk and Sam
[08:50] Altman on whose LLM is
the left wing propaganda
[08:54] machine or the right
wing propaganda machine,
[08:57] and they were hating
on each other's LLMs.
[08:59] But that tells you,
at the end of the day,
[09:01] that even those two teams, Grok
and OpenAI, which are probably
[09:05] the best funded team
with a lot of talent,
[09:08] are not doing a great job
at controlling their LLMs.
[09:14] And from time to time,
if you hang out on X,
[09:16] you might see screenshots of
users interacting with LLMs
[09:21] and the LLM saying something
really controversial
[09:24] or racist or something that
would not be considered great
[09:31] by social standards, I guess.
[09:33] And that tells you that the
model is really hard to control.
[09:39] The second aspect
of it is something
[09:41] that you mentioned earlier.
[09:43] LLMs may underperform
in your task,
[09:47] and that might include
specific knowledge gaps,
[09:49] such as medical diagnosis.
[09:51] If you're doing
medical diagnosis,
[09:52] you would rather have an LLM
that is specialized for that
[09:55] and is great at it
and, in fact, something
[09:57] that we haven't mentioned
as a group, has sources.
[10:00] So the answer is
sourced specifically.
[10:03] You have a hard time
believing something
[10:05] unless you have the actual
source of the research that
[10:08] backs it up.
[10:10] Inconsistencies in
style and format--
[10:12] so imagine you're building
a legal AI agentic workflow.
[10:17] Legal has a very specific
way to write and read,
[10:21] where every word counts.
[10:22] If you're negotiating
a large contract,
[10:25] every word on that contract
might mean something else
[10:28] when it comes to the court.
[10:29] And so it's very
important that you use
[10:31] an LLM that is very good at it.
[10:34] The precision matters.
[10:35] And then task-specific
understanding,
[10:38] such as doing a classification
on a niche field,
[10:40] here I pulled an example where--
let's say a biotech product is
[10:45] trying to use an
LLM to categorize
[10:48] user reviews into positive,
neutral, or negative.
[10:54] Maybe for that
company, something
[10:56] that would be considered a
negative review typically
[11:01] is actually considered
a neutral review
[11:04] because the NPS of
that industry tends
[11:06] to be way lower than other
industries, let's say.
[11:10] That's a task-specific
understanding,
[11:12] and the LLM needs to
be aligned to what
[11:14] the company believes is the
categorization that it wants.
[11:17] We will see an example of how to
solve that problem in a second.
[11:21] And then limited
context handling--
[11:24] a lot of AI applications,
especially in the enterprise,
[11:28] have required data that
has a lot of context.
[11:33] Just to give you
a simple example,
[11:35] knowledge management
is an important space
[11:37] that enterprises buy a lot
of knowledge management tool.
[11:40] When you go on your drive and
you have all your documents,
[11:43] ideally, you could have an LLM
running on top of that drive.
[11:47] You can ask any question,
and it will read immediately
[11:50] thousands of documents
and answer, what was
[11:53] our Q4 performance in sales?
[11:56] It was x dollars.
[11:58] It finds it super quickly.
[11:59] In practice, because LLMs do
not have a large enough context,
[12:04] you cannot use a standalone
vanilla pre-trained LLM to solve
[12:07] that problem.
[12:08] You will have to augment it.
[12:11] Does that make sense?
[12:13] The other aspect around context
windows is they are, in fact,
[12:16] limited.
[12:17] If you look at the context
windows of the models
[12:20] from the last five years,
even the best models
[12:25] today will range in context,
window, or number of tokens
[12:30] it can take as input, somewhere
in the hundreds of thousands
[12:35] of tokens max.
[12:36] Just to give you a sense,
200,000 tokens is roughly two
[12:40] books.
[12:42] So that's how much
you can upload
[12:45] and it can read, pretty much.
[12:47] And you can imagine
that when you're
[12:48] dealing with video
understanding or heavier data
[12:52] files, that is, of
course, an issue.
[12:56] So you might have to chunk it.
[12:58] You might have to embed it.
[12:59] You might have to
find other ways
[13:00] to get the LLM to
handle larger contexts.
[13:06] The attention mechanism is
also powerful, but problematic,
[13:10] because it does not do
a great job at attending
[13:13] in very large contexts.
[13:16] There is actually an
interesting problem
[13:19] called needle in a haystack.
[13:21] It's an AI problem where--
[13:23] or call it a benchmark--
[13:25] where, in order to test if your
LLM is good at putting attention
[13:30] on a very specific fact
within a large corpus,
[13:35] researchers might
randomly insert
[13:38] in about one sentence
that outlines
[13:44] a certain fact,
such as Arun and Max
[13:47] are having coffee
at Blue Bottle,
[13:48] in the middle of the
Bible, let's say,
[13:51] or some very long text.
[13:54] And then you ask the LLM,
what were Arun and Max having
[14:01] at Blue Bottle?
[14:02] And you see if it remembers
that it was coffee.
[14:04] It's actually a complex problem,
not because the question
[14:07] is complex, but because
you're asking the model
[14:09] to find a fact within
a very large corpus,
[14:12] and that's complicated.
[14:16] So, again, this is a
limiting factor for LLMs.
[14:19] We'll talk about
RAG in a second.
[14:21] But I want to preview--
[14:22] there is debates
around whether RAG
[14:26] is the right long-term
approach for AI systems.
[14:29] So as a high-level idea, a RAG
is a mechanism, if you will,
[14:34] that embeds documents that
an LLM can retrieve and then
[14:39] add as context to its initial
prompt and answer a question.
[14:44] It has lots of application.
[14:45] Knowledge management
is an example.
[14:47] So imagine you have
your drive again.
[14:49] But every document is
compressed in representation,
[14:53] and the LLM has
access to that lower
[14:55] dimensional representation.
[14:59] The debates that this tweet
from [INAUDIBLE] outlines
[15:03] is, in theory, if we
have infinite compute,
[15:08] then RAG is useless.
[15:09] Because you can just read a
massive corpus immediately
[15:13] and answer your question.
[15:15] But even in that case,
latency might be an issue.
[15:19] Imagine the time
it takes for an AI
[15:20] to read all your drive every
single time you ask a question.
[15:24] It doesn't make sense.
[15:25] So RAG has other advantages
beyond even the accuracy.
[15:30] On top of that, the
sourcing matters, as well.
[15:33] So it might-- RAG
allows you to source.
[15:35] We'll talk about all that later.
[15:38] But there's always this
debate in the community
[15:42] whether a certain method
is actually future proof.
[15:46] Because in practice, as compute
power doubles every year,
[15:49] let's say, some of the methods
we're learning right now
[15:52] might not be relevant
three years from now.
[15:54] We don't know, essentially.
[15:59] And the analogy that he
makes on context windows
[16:04] and why RAG approaches might
be relevant even a long time
[16:07] from now is search.
[16:09] When you search on
a search engine,
[16:12] you still find sources
of information.
[16:14] And in fact, in the
background, there
[16:16] is very detailed
traversal algorithms
[16:20] that rank and find the specific
links that might be the best
[16:25] to present you versus if you
had to read-- imagine you had
[16:29] to read the entire web every
single time you're doing
[16:31] a search query, without
being able to narrow
[16:34] to a certain portion
of the space.
[16:36] That might, again,
not be reasonable.
[16:41] OK, when we're thinking
of improving LLMs,
[16:46] the easiest way we think
of it is two dimensions.
[16:50] One dimension is we are going
to improve the foundation
[16:53] model itself.
[16:54] So, for example, we move
from GPT 3.5 Turbo, to GPT 4,
[17:01] to GPT 4.0, to GPT 5.
[17:04] Each of that is supposed
to improve the base model.
[17:07] GPT 5 is another debate because
it's packaging other models
[17:11] within itself.
[17:12] But if you're thinking
about 3.5, 4, and 4.0,
[17:15] that's really what it is.
[17:16] The pre-trained model improves.
[17:18] And so you should
see your performance
[17:20] improve on your tasks.
[17:22] But the other dimension is
we can actually engineer--
[17:27] leverage the LLM in a
way that makes it better.
[17:30] So you can prompt
simply GPT 4.0.
[17:34] You can change some prompts
and improve the prompt,
[17:38] and it will improve
the performance.
[17:40] It's shown.
[17:41] You can even put
a RAG around it.
[17:42] You can put an agentic
workflow around it.
[17:45] You can even put a
multi-agent system around it.
[17:49] And that is another dimension
for you to improve performance.
[17:52] So that's how I want you
to think about it-- which
[17:54] LLM I'm using, and
then how can I maximize
[17:56] the performance of that LLM?
[17:59] This lecture is about
the vertical axis.
[18:02] Those are the methods
that we will see together.
[18:08] Sounds good for
the introduction.
[18:11] So let's move to
prompt engineering.
[18:14] I'm going to start with
an interesting study just
[18:17] to motivate why prompt
engineering matters.
[18:20] There is a study
from HBS, UPenn,
[18:26] as well as Harvard
Business School, and--
[18:29] well, there is also
involved Wharton--
[18:31] that took a subset
of BCG consultants,
[18:34] individual contributors,
split them into three groups.
[18:37] One group had no access to AI.
[18:39] One group had access to--
[18:41] I think it was GPT 4.
[18:44] And then one group
had access to the LLM,
[18:46] but also a training on
how to prompt better.
[18:50] And then they observed the
performance of these consultants
[18:53] across a wide variety of tasks.
[18:56] There's a few things
that they noticed
[18:57] that I thought was interesting.
[18:59] One is something they
called the jagged frontier,
[19:02] meaning that certain tasks
that consultants are doing fall
[19:07] beyond the jagged frontier,
meaning AI is not good enough.
[19:14] It's not improving
human performance.
[19:18] In fact, it's actually
making it worse.
[19:20] And some tasks are
within the frontier,
[19:23] meaning that AI is actually
significantly improving
[19:27] the performance, the speed,
the quality of the consultant.
[19:32] Many tasks fell within and
many tasks fell without,
[19:35] and they shared their insights.
[19:37] But the TLDR is--
[19:39] there is a frontier within
which AI is absolutely helping
[19:42] and one where they call out
this behavior, or falling asleep
[19:47] at the wheel, where people
relied on AI on a task that
[19:51] was beyond the frontier.
[19:52] And in fact, it
ended up going worse
[19:55] because the human was not
reviewing the outputs carefully
[19:58] enough.
[20:01] They did note that the
group that was trained
[20:04] was the best, better than the
group that was not trained
[20:08] on prompt engineering,
which also motivates why
[20:10] this lecture matters, so
that you're within that group
[20:14] afterwards.
[20:15] One other insights were the
centaurs and the cyborgs.
[20:20] They noticed that
consultants had the tendency
[20:22] to work with AI in
one of two ways,
[20:24] and you might, yourself, be
part of one of these groups.
[20:29] The centaurs are
mythical creatures
[20:31] that are half human, half--
[20:35] I think, half, what, horses?
[20:38] Yeah?
[20:39] Horses?
[20:39] Half horses, half something.
[20:42] And those were individuals
that would divide and delegate.
[20:45] They might give a pretty
big task to the AI.
[20:48] So imagine you're working on a
PowerPoint, which consultants
[20:51] are known to do.
[20:52] You might actually write
a very long prompt on how
[20:55] you want it to do your
PowerPoint and then let it
[20:57] work for some time
and then come back
[20:59] and it's done, when others
would act as cyborgs.
[21:02] Cyborgs are fully blended,
bionic human robots,
[21:06] human and robot, augmented
with robotic parts.
[21:10] And those individuals will
not delegate fully a task.
[21:13] They would actually work
super quickly with the model
[21:16] back and forth.
[21:17] I find that a lot of students
are actually more working
[21:20] like cyborgs than centaurs, but
while maybe in the enterprise,
[21:24] when you're trying to
automate the workflow,
[21:26] you're thinking
more like a centaur.
[21:29] That's just something
good to keep in mind.
[21:31] Also, a lot of companies
will tell you, oh, we're
[21:33] hiring prompt
engineers, et cetera.
[21:34] It's [? a cure. ?]
I don't buy that.
[21:36] I think it's just a skill
that everybody should have.
[21:39] You're not going to
make a [? cure ?] out
[21:40] of prompt engineering,
but you're probably
[21:42] going to use it as a very
powerful skill in your career.
[21:49] So let's talk about basic
prompt design principles.
[21:52] I'm giving you a very
simple prompt here.
[21:56] Summarize this document,
and then the document
[21:58] is uploaded alongside it.
[22:00] And the model has not
much context around
[22:04] what should be the summary?
[22:06] How long should be the summary?
[22:07] What should it talk
about, et cetera?
[22:09] You can actually improve these
prompts by doing something like
[22:14] summarize this 10-page
scientific paper on renewable
[22:18] energy in five bullet points,
focusing on key findings
[22:22] and implications
for policymakers.
[22:25] That's already better.
[22:26] You're sharing the
audience, and it's
[22:28] going to tailor it
to the audience.
[22:30] You're saying that you
want five bullet points,
[22:33] and you want to focus
only on key findings.
[22:35] That's a better prompt,
you would argue.
[22:39] How could you even make
this prompt better?
[22:41] What are other
techniques that you've
[22:43] heard of or tried yourself that
could make this one shot prompt
[22:47] better?
[22:53] Yeah.
[22:53] [INAUDIBLE]
[22:57] OK.
[22:58] Right example.
[22:58] So say, you mean, here is an
example of a great summary.
[23:02] Yeah.
[23:03] You're right.
[23:03] That's a good idea.
[23:05] [INAUDIBLE]
[23:08] Very popular technique.
[23:10] Act like a renewable energy
expert giving a conference
[23:15] at Davos, let's say, yeah.
[23:17] That's great.
[23:18] Someone-- yeah.
[23:20] Say you're really good at it.
[23:22] Yeah.
[23:23] You are the best in
the world at this.
[23:25] Explain.
[23:26] Yeah.
[23:26] Actually, I mean,
these things work.
[23:28] It's funny, but it does work
to say act like x, y, z.
[23:32] It's a very popular
prompt template.
[23:34] We'll see a few examples.
[23:36] What else could you do?
[23:40] Yes.
[23:41] Of course, you'd like to
critique your own model.
[23:46] Critique your own project.
[23:47] So you're using reflection.
[23:48] So you might actually
do one output
[23:50] and then ask it to critique
it and then give it back.
[23:52] Yeah.
[23:53] We see that.
[23:53] That's a great one.
[23:54] That's the one that
probably works best
[23:56] within those typically,
but we see some examples.
[23:59] What else?
[24:00] Yeah.
[24:01] Break the task down into steps.
[24:03] OK.
[24:03] Break the task down into steps.
[24:05] You know how that is called?
[24:06] No.
[24:07] OK.
[24:08] Chain of thoughts.
[24:09] So this is actually
a popular method
[24:12] that's been shown in
research that it improves.
[24:15] You could actually give
a clear instruction
[24:17] and also encourage the
model to think step
[24:19] by step approach, the
task step by step,
[24:22] and do not skip any step.
[24:24] And then you give it some
steps, such as step one,
[24:26] identify the three most
important findings.
[24:29] Step two, explain
how key each finding
[24:31] impact renewable energy policy.
[24:33] Step three, write the
five-bullet summary
[24:36] with each point addressing
a finding, et cetera.
[24:39] So chain of thoughts, I linked
the paper from 2023 that
[24:45] popularized chain of thoughts.
[24:46] Chain of thoughts
is very popular
[24:48] right now, especially
in AI startups
[24:50] that are trying to
control their LLMs.
[24:55] OK.
[24:56] To go back to your examples
about act like XYZ, what
[25:01] I like to do, Andrew Ng
also talks about that,
[25:03] is to look at other
people's prompts.
[25:06] And in fact, in online, you have
a lot of prompt repositories
[25:10] for free on GitHub.
[25:11] In fact, I linked the awesome
prompt template repo on GitHub,
[25:16] where you have so many
examples of great prompts
[25:19] that engineers have built. They
said it works great for us,
[25:22] and they published it online.
[25:23] And a lot of them
start with act as.
[25:27] Act as a Linux terminal.
[25:29] Act as an English translator.
[25:31] Act like a position
interviewer, et cetera.
[25:37] The advantage of
a prompt template
[25:38] is that you can actually
put it in your code
[25:42] and scale it for
many user requests.
[25:44] So let me give you an
example from Workera.
[25:48] Workera evaluates skill.
[25:50] Some of you have taken
the assessments already.
[25:52] And tries to personalize
it to the user.
[25:56] And in fact, if you actually
read in an HR system
[25:59] in an enterprise,
in the HR system,
[26:01] you might have a Jane is
a product manager level 3,
[26:06] and she is in the US, and her
preferred language is English.
[26:10] And actually, that
metadata can be
[26:13] inserted in a prompt templates
that will personalize
[26:15] personalized for Jane.
[26:16] And similarly for Joe, whose is
preferred language is Spanish,
[26:22] it will tailor it to Joe.
[26:24] And that's called
a prompt template.
[26:26] [INAUDIBLE]
[26:34] So the question is do
the foundation models
[26:39] use a prompt
templates, or do you
[26:41] have to integrate it yourself?
[26:42] So the foundation
models probably
[26:45] use a system prompt
that you don't see.
[26:47] Like when actually,
you type on ChatGPT,
[26:50] it is possible, it's not public,
that OpenAI behind the scenes
[26:55] has like act like a very
helpful assistant for this user.
[26:59] And by the way, here is
your memories about the user
[27:03] that we kept in a database.
[27:05] You can actually
check your memories.
[27:07] And then your prompt goes under,
and then the generation starts.
[27:10] So probably, they're
using something like that.
[27:12] But it doesn't mean you
can't add one yourself.
[27:15] So in fact, if you think about a
prompt template for the Workera
[27:19] example I was showing,
maybe it starts
[27:22] when you call OpenAI by act
like a helpful assistant.
[27:25] And then underneath, it's like
act like a great AI mentor that
[27:29] helps people in their career.
[27:31] And OpenAI is,
from template, also
[27:33] has follow the instruction
from the creator
[27:36] or something like that.
[27:37] It's possible.
[27:41] Questions about
prompt templates.
[27:42] Again, I would encourage you to
go and read examples of prompts.
[27:45] Some of them are
quite thoughtful.
[27:48] Let's talk about zero shot
versus few shot prompting.
[27:51] It came up earlier.
[27:53] Here's an example.
[27:54] Again, going back to the
categorization of product
[27:57] reviews, let's say that
we're working on a task
[28:01] where the prompt is classify
the tone of the sentence
[28:05] as positive,
negative, or neutral.
[28:07] And then you paste the review,
which is the product is fine,
[28:12] but I was expecting more.
[28:16] If I were to survey the room,
I would bet that some of you
[28:19] would say it's negative.
[28:21] Some of you would
say it's neutral.
[28:23] Because you actually
have a first part
[28:24] that is relatively positive.
[28:27] It's fine.
[28:28] And then the second part,
I was expecting more,
[28:30] which is relatively negative.
[28:31] So where do you land?
[28:33] This can be a
subjective question.
[28:35] And maybe in one industry, this
would be considered amazing.
[28:37] And another one, it would
be considered really bad
[28:40] because people are used to
really flourishing reviews.
[28:44] And so the way you can actually
align the model to your task
[28:47] is by converting that
zero shot prompt.
[28:49] Zero shot refers to
the fact that it's not
[28:51] being given any example.
[28:53] Into a few short
prompts, where the model
[28:56] is given in the prompt, a set
of examples to align it to what
[29:00] you want it to do.
[29:01] So the example
here is again, you
[29:03] paste the same prompt as
before with the user review.
[29:06] And then you add,
here are examples
[29:08] of tone classifications.
[29:10] These exceeded my
expectation completely.
[29:12] Positive.
[29:14] It's OK, but I wish
it had more features.
[29:17] Negative.
[29:18] The service was adequate.
[29:20] Neither good nor bad.
[29:22] Neutral.
[29:23] Now classify the
tone of this sentence
[29:26] after you've heard
about these things,
[29:28] and the model then
says negative.
[29:31] And the reason it says
negative, of course,
[29:33] is likely because of the second
example, which was it's OK,
[29:39] but I wish it had more features,
which we told the model that
[29:42] was negative.
[29:43] Because the model saw
that it's aligned now
[29:45] with your expectations.
[29:47] A few short prompts
are very popular.
[29:50] And in fact, for
AI startups that
[29:52] are slightly more
sophisticated, you
[29:54] might see them keep
a prompt up to date.
[29:57] Whenever a user says
something and they
[30:00] might have a human
label it and then
[30:02] add it as a few shots
in their relevant
[30:05] prompts in their code base.
[30:08] You can think of that as
almost building a data set.
[30:10] But instead of actually
building a separate data set
[30:12] like we've seen with
supervised fine tuning
[30:15] and then fine tuning
the model on it,
[30:17] you're just putting it
directly in the prompt.
[30:19] It turns out it's
probably faster
[30:21] to do that if you want
to experiment quickly
[30:23] because you don't touch
the model parameters.
[30:25] You just update your prompts.
[30:27] And if it's text
examples, you can actually
[30:30] concatenate so many
examples in a single prompt.
[30:34] At some point, it
will be too long,
[30:36] and you will not have the
necessary context window.
[30:39] But it's a pretty
strong approach
[30:40] that is quick to align an LLM.
[30:48] OK?
[30:49] Yes.
[30:50] [INAUDIBLE]
[30:57] So the question was is there
any research on how long
[31:00] the prompt can be before
the model essentially loses
[31:03] itself or doesn't follow
instructions anymore?
[31:06] There is.
[31:08] The problem is that research
is outdated every few months
[31:11] because models get better.
[31:14] And so I don't know where
the state of the art is.
[31:16] You can probably find
it online on benchmarks
[31:18] on like we see that--
[31:20] I give you an example.
[31:23] On the Workera product, you
have a voice conversation
[31:27] for some of you
that have tried it,
[31:28] where you're asked to
explain what is the prompt.
[31:30] And then you explain,
and then there's
[31:31] a scoring algorithm in behind.
[31:33] We know that after eight
turns, the model loses itself.
[31:38] After eight turns,
because you always
[31:40] paste the previous
user response,
[31:42] it just starts going wild.
[31:44] And so the techniques
we use in the background
[31:46] is we actually create
chapters of the conversation.
[31:49] Maybe one chapter is
the first eight prompt.
[31:51] And then you actually start
over from another prompt.
[31:53] You can summarize the first
part of the conversation,
[31:56] insert the summary,
and then keep going.
[31:59] Those are engineering hacks that
engineers might have figured out
[32:02] in the background.
[32:04] Because eight turns makes a
prompt quite long actually.
[32:13] Let's move on to chaining.
[32:15] Chaining is the most popular
technique out of everything
[32:17] we've seen so far in
prompt engineering.
[32:22] It's not chain of thought.
[32:23] So chain of thought we've
seen is think step by step,
[32:26] step 1, step 2, step 3.
[32:27] Do not skip any step.
[32:28] This is different.
[32:30] This is chaining complex
prompt to improve performance,
[32:34] and this is what it looks like.
[32:37] You take a single step prompt,
such as read this customer
[32:40] review and write a
professional response that
[32:43] acknowledges their concern,
explains the issue,
[32:46] offers a resolution,
and then you
[32:48] paste the customer review,
which is I ordered a laptop.
[32:51] It arrived three days late.
[32:52] The packaging was damaged.
[32:54] Very disappointing.
[32:56] I needed that urgently for work.
[32:59] And then the output
is an email that
[33:01] is immediately given
to you by the LLM
[33:04] after it reads the prompt.
[33:08] So this might work, but it
might be hard to control.
[33:14] Because think about it.
[33:15] There's multiple steps
that you have listed,
[33:18] and everything is embedded
in the same prompt.
[33:20] And if you wanted to debug step
by step and know which step is
[33:24] weaker, you couldn't.
[33:24] You would have everything
mixed together.
[33:27] So one advantage of chaining is
you would separate the prompts,
[33:32] so that you can debug
them separately.
[33:35] And it will also lead
to an easier manner
[33:38] to improve your workflow.
[33:41] Let's say a first prompt
is extract the key issues.
[33:44] Identify the key
concerns mentioned
[33:46] in this customer review.
[33:47] Pace the customer review.
[33:49] Second prompt.
[33:50] Using these issues, so
you paste back the issues,
[33:54] draft an outline for a
professional response that
[33:57] acknowledges concerns,
explains possible reasons,
[34:00] and offer a resolution.
[34:04] So this is not--
[34:06] Prompt number 3, write
the full response.
[34:09] So using the outline, write
the professional response.
[34:14] And then you get
your final output.
[34:18] So in theory, you can tell
me, oh, the second approach
[34:22] is better than the
first one at first.
[34:23] But what you can notice
is that we can actually
[34:27] test those three prompts
separately from each other
[34:29] and determine if we will get the
most gains out of engineering
[34:35] the first prompt, optimizing
it, or the second one,
[34:38] or the third one.
[34:39] We now have three prompts that
are independent from each other.
[34:43] And maybe if the
outline was better,
[34:47] the performance of the email,
how much the open rate will be
[34:53] or the user satisfaction
on the response
[34:55] will actually get higher.
[34:57] And so chaining improves
performance but performance,
[35:00] but most importantly, helps
you control your workflow
[35:04] and debug it more seamlessly.
[35:07] Yes.
[35:09] So if we that the three prompts
independently work really well,
[35:15] if we combine them
into one prompt,
[35:17] and we highlight a step
by step thinking process,
[35:21] does on average, we get
a [INAUDIBLE] by itself,
[35:24] or do we still have
to do that breakdown?
[35:28] So let me try to rephrase.
[35:30] You say, let's say we look
at the first prompt which
[35:32] has all three tasks
built in that prompt.
[35:37] What exactly do you mean?
[35:39] You mean like if we
evaluate the output
[35:41] and we measure some user
insight, satisfaction,
[35:43] et cetera?
[35:45] Why don't we just modify that
prompt and essentially see how
[35:49] it improves user satisfaction?
[35:51] Yeah.
[35:51] [INAUDIBLE]
[35:54] I see.
[35:55] So why do we need
the three steps?
[35:57] I mean, think about it.
[35:59] The intermediate output
is what you want to see.
[36:02] Like if I'm debugging
the first approach,
[36:06] the way I would do it is I
would capture user insights.
[36:09] Like here's the email.
[36:10] How good was the response?
[36:11] Thumbs up, thumbs down.
[36:13] Was your issue resolved?
[36:16] Thumbs up, thumbs down.
[36:17] Those would tell me
how good is my prompt.
[36:19] And I can engineer that
prompt, optimize it,
[36:21] and I would probably
drive some gains.
[36:23] But I will not be able
easily to trace back
[36:26] to what the problem was.
[36:28] While in the second
approach, not only I
[36:30] can use the end to end
metrics to improve my process.
[36:33] I can also use the
intermediate steps.
[36:35] For example, if I look at prompt
2 and I look at the outline
[36:38] and I see the outline is
actually, meh, it's not great,
[36:41] then I think I can get a lot
of gains out of the outline.
[36:45] Or the outline is
actually really good,
[36:47] but the last prompt doesn't do
a good job at translating it
[36:50] into an email.
[36:51] So the outline is exactly
what I want the LLM to do,
[36:54] but the translation in
a customer facing email
[36:57] is not good.
[36:58] In fact, it doesn't follow
our vocabulary internally.
[37:01] Then I knew the
third prompt is where
[37:03] I would get the most gains.
[37:06] So that's what it
allows me to do,
[37:07] have intermediate
steps to review.
[37:10] Are there any
latency [INAUDIBLE]
[37:13] We'll talk about it.
[37:14] Are there any latency concerns?
[37:16] Yes.
[37:17] In certain applications, you
don't want to use a chain
[37:20] or you don't want to use a long
chain because it adds latency.
[37:26] We'll talk about that later.
[37:27] Good point.
[37:28] So practically, this is
what chaining complex
[37:32] prompts look like.
[37:33] You have your first prompt
with your first task.
[37:35] It outputs.
[37:36] The output is pasted
in the second prompt
[37:39] with the second
task being defined.
[37:41] The output is then pasted
into the third prompt
[37:43] with the third task
being defined and so on.
[37:46] That's what it looks
like in practice.
[37:52] Super.
[37:55] We'll talk more later
about testing your prompts,
[37:58] but there are
methods now to do it,
[38:00] and we'll see later in this
lecture with our case study
[38:03] how we can test our prompts.
[38:06] But here is an example
of how you might do it.
[38:11] You might have a
summarization workflow prompts
[38:18] that is the baseline.
[38:19] It's a single prompt.
[38:21] You might have a
refined summarization
[38:23] which is a modified
prompt of this,
[38:26] or a workflow with a chain.
[38:30] And then you have your test
case, which is the input
[38:34] that you want to
summarize, let's say.
[38:36] And then you have
the generated output.
[38:38] And you can have humans
go and rate these outputs.
[38:42] And you would notice that the
baseline is better or worse
[38:46] than the refined prompt.
[38:47] Of course, this manual
approach takes time,
[38:51] but it's a good way to start.
[38:53] And usually, the advice is
get hands on at the beginning
[38:56] because you would quickly
notice some issues,
[38:58] and it will give you better
intuition on what tweaks
[39:01] can lead to better performance.
[39:03] However, if you wanted
to scale that system
[39:05] across many products, many
parts of your code base,
[39:08] you might want to find a
way to do that automatically
[39:10] without asking humans to
review and grade summaries.
[39:14] One approach is
to use platforms,
[39:19] like at Workera, our team uses a
platform called Prompt Food that
[39:23] allows you to actually
automate part of this testing.
[39:26] In a nutshell,
what it does is it
[39:30] can allow you to run the same
prompt with five different LLMs
[39:35] immediately, put
everything in a table.
[39:37] That makes it super easy for
a human to grade, let's say.
[39:40] Or alternatively, it might
allow you to define LLM judges.
[39:46] LLM judges can come
in different flavors.
[39:50] For example, I can
have an LLM judge that
[39:52] does a pairwise comparison.
[39:54] So what the LLM is asked to
do is here are two summaries.
[39:58] Just tell me which one is
better than the other one.
[40:01] That's what the LLM does.
[40:02] And that can be used
as a proxy for how good
[40:04] the summarization baseline
versus the refined version is.
[40:08] Another way to do
an LLM judge is
[40:11] if you do it for a
single answer grading,
[40:14] so here's a summary
graded from 1 to 5.
[40:18] And then you can go
even deeper and do
[40:21] a reference-guided
pairwise comparison.
[40:24] Or you add also a rubric.
[40:25] You say a 5 is when a summary
is below 100 characters.
[40:30] I'm just making up.
[40:31] Below 100 characters.
[40:33] Mentions at least
three key points
[40:35] that are distinct and starts
with a first sentence that
[40:38] displays the overview and
then goes into the detail.
[40:40] That's a great summary,
number 5 out of a 5.
[40:42] 0 is the LLM failed to summarize
and actually was very verbose,
[40:48] let's say.
[40:49] And so you put a
Rubrik behind it,
[40:52] and you have an LLM as
just finding the rubric.
[40:55] Of course, you can now
pair different techniques.
[40:57] You can do a few
shots for the rubric.
[40:58] You can actually give examples
of a 5 out of 5s, 4 out of 4s,
[41:02] 3 out of 3s because now,
you multiple techniques.
[41:06] Does that make sense?
[41:11] Yeah.
[41:11] OK.
[41:12] So that was the second
section on prompt engineering
[41:15] or the first line
of optimization.
[41:19] Now, let's say you've
exhausted all your chances
[41:22] for prompt
engineering, and you're
[41:24] thinking about actually touching
the model, modifying its weights
[41:28] or fine tuning it
in other words.
[41:31] I was telling you, I'm
not a fan of fine tuning.
[41:34] There's a few reasons why.
[41:37] One, it requires substantial
labeled data typically
[41:42] to fine tune.
[41:43] Although now, there
are approaches
[41:46] that are getting better
at fine tuning that
[41:48] look more few shot prompting
actually than fine tuning.
[41:52] It's sort of merging.
[41:54] Although one
modifies the weight,
[41:56] the other doesn't
modify the weights.
[41:57] Fine tuned models may also
overfit to specific data.
[42:01] We're going to see a
funny example actually.
[42:04] Losing their general
purpose utility.
[42:06] So you might fine tune a model.
[42:08] And actually, when someone
asks a pretty generic question,
[42:11] it doesn't do well anymore.
[42:12] It might do well on your task.
[42:14] So it might be relevant or not.
[42:15] And then it's time
and cost-intensive.
[42:17] That's my main problem.
[42:19] And at Workera, we
steer away from fine
[42:24] tuning as much as possible.
[42:26] Because by the time you're
done fine tuning your model,
[42:28] the next model is
out, and it's actually
[42:30] beating your fine tuned
version of the previous model.
[42:33] So I would steer away from
fine tuning as much as you can.
[42:36] The advantage of the prompt
engineering methods we've seen
[42:39] is you can put the next best
pre-trained model directly
[42:43] in your code.
[42:44] It will update
everything immediately.
[42:46] Fine tuning doesn't
work like that.
[42:50] There are advantages though
where it still makes sense.
[42:53] If the task requires repeated
high precision outputs
[42:56] such as legal,
scientific explanation
[42:58] and if the general
purpose LLM struggles
[43:01] with domain-specific language.
[43:03] So let's look at a
quick example together,
[43:07] which is an example
from Ros Lazerowitz.
[43:12] I think it was a couple of
years ago, September 23,
[43:15] where Ros tried to
do Slack fine tuning.
[43:22] So he looked at a lot of Slack
messages within his company.
[43:26] And he was like, I'm
going to fine tune
[43:28] a model that speaks like us or
operates like us because this
[43:32] is how we work.
[43:33] This is the data that represents
how people work at the company.
[43:37] And so he actually went ahead
and fine tuned the model,
[43:42] gave it a prompt,
like, hey, write--
[43:44] he was delegating to the model.
[43:47] A 500-word blog post
on prompt engineering.
[43:50] And the model responded, I shall
work on that in the morning.
[43:55] And then he tries to push the
model a little further and say,
[44:00] it's morning now.
[44:01] And the model said,
I'm writing right now.
[44:04] It's 6:30 AM here.
[44:06] Write it now.
[44:10] OK, I shall write it now.
[44:12] I actually don't what
you would like me to say
[44:14] about prompt engineering.
[44:15] I can only describe the process.
[44:17] The only thing that comes
to mind for a headline
[44:19] is how do we build prompt?
[44:21] It's kind of a funny example for
fine tuning because it's true
[44:25] that it went wrong.
[44:27] Like he was supposed
to think like I want
[44:29] the model to speak
like us at work.
[44:32] And it ended up
acting like people
[44:34] and not actually
following instructions.
[44:40] So one example why I would
steer away from fine tuning.
[44:47] Super.
[44:51] Let's talk about RAGs.
[44:54] RAGs is important.
[44:55] It's important to out there
and at least having the basics.
[44:58] It's a very common interview
question, by the way.
[45:00] If you go interview
for a job, they
[45:02] might ask you to
explain in a nutshell
[45:04] to a five-year-old
what is a RAG.
[45:06] And hopefully after that,
you'll be able to do it.
[45:09] So we've seen some of the
challenges with standalone LLMs.
[45:14] Those challenges include the
context window being small,
[45:19] the fact that it's hard
to remember details
[45:21] within a large context window,
knowledge gaps, cutoff dates,
[45:26] you mentioned earlier.
[45:28] The model might be
trained up to a date,
[45:29] and then it cannot follow
the trends or be up to date.
[45:33] Hallucinations.
[45:34] There are some fields.
[45:35] Think about medical
diagnosis, where
[45:37] hallucinations are very costly.
[45:39] You can't afford
a hallucination.
[45:41] Even in education, imagine
deploying a model for the US
[45:45] youth education,
and it hallucinates,
[45:47] and it teaches millions
of people something
[45:49] completely wrong.
[45:50] It's a problem.
[45:52] And then lack of sources.
[45:54] A lot of fields love sources.
[45:57] Research fields love sources.
[45:59] Education love sources.
[46:01] Legal loves sources as well.
[46:04] And so the pre-trained LLM
doesn't do a good job to source.
[46:08] And in fact, if you have tried
to find sources on a plain LLM,
[46:13] it actually hallucinates a lot.
[46:15] It makes up research papers.
[46:16] It just lists like
completely fake stuff.
[46:20] So how do we solve
that with a RAG?
[46:23] RAG integrates with external
knowledge sources, databases,
[46:28] documents, APIs.
[46:31] It ensures that answers are
more accurate, up to date,
[46:35] and grounded because you can
actually update your document.
[46:38] Your drive is always up to date.
[46:40] I mean, ideally, you're always
pushing new documents to it.
[46:43] And when you query, what is
our Q4 performance in sales?
[46:47] Hopefully there is the last
board deck in the drive,
[46:51] and it can read the
last board deck.
[46:54] And more developer control.
[46:56] We'll see why RAGs allow
for targeted customization
[47:00] without actually requiring
the retraining of the model.
[47:02] In fact, you don't touch
the model with RAGs.
[47:05] It's really a technique that
is put on top of the model.
[47:08] So to see an example
of a RAG, this
[47:11] is a question answering
application where
[47:16] we're in the medical field,
and a user is asking a query,
[47:21] what are the side
effects of drug X?
[47:26] This is an important question.
[47:27] You can't hallucinate.
[47:28] You need to source.
[47:29] You need to be up to date.
[47:31] Maybe there is a new
update to that drug that
[47:35] is now in the database,
and you need to read that.
[47:37] So a RAG is a great example of
what you would want to use here.
[47:41] The way it works is
you have your knowledge
[47:43] base of a bunch of documents.
[47:46] What you do is you
use an embedding
[47:49] to embed those
documents into lower
[47:52] dimensional representations.
[47:54] So for example, if the
document is a PDF, a long PDF,
[47:59] you might read the
PDF, understand it,
[48:02] and then embed it.
[48:03] We've seen plenty of
embedding approaches
[48:05] together, triplet loss,
et cetera, you remember?
[48:09] So imagine one of
them here for LLMs
[48:11] is embedding those documents
into lower representation.
[48:15] If the representation
is too small,
[48:18] you will lose information.
[48:19] If it's too big, you
will add latency.
[48:22] It's a tradeoff.
[48:25] You will store typically
those representations
[48:28] into a database called
a vector database.
[48:31] There's a lot of vector
database providers out there.
[48:38] I think I've listed a
couple that are very common.
[48:41] No, I haven't listed, but
I can share afterwards.
[48:44] A vector database is
essentially storing those vector
[48:47] in a very efficient manner,
allowing the fast retrieval
[48:50] with a certain distance metric.
[48:52] So what you do is you
also embed, usually
[48:56] with the same algorithm,
the user prompts.
[49:00] And you run a retrieval
process, which is essentially
[49:03] saying, based on the
embedding from the user
[49:07] query and the vector database,
find the relevant documents
[49:12] based on the distance
between those embeddings.
[49:15] Once you've found the relevant
documents, you pull them,
[49:18] and then you add them to the
user query with a system prompt
[49:22] or a prompt template on top.
[49:24] So the prompt template
can be answer user query
[49:29] based on list of documents.
[49:32] If answer not in the
documents, say I don't know.
[49:36] That's your prompt templates
where the user query is pasted,
[49:40] the documents are
pasted, and then
[49:42] your output should be what
you want because it's not
[49:45] grounded in the documents.
[49:47] You can also add to
this prompt template.
[49:50] Tell me the exact
page, chapter, line
[49:53] of the document that was
relevant, and in fact,
[49:55] link it as well, just
to be more precise.
[50:02] Any question on RAGs?
[50:03] This is a simple, vanilla RAG.
[50:07] Yes.
[50:09] Do document embeddings still
retain information [INAUDIBLE]
[50:15] Question is do the
document embeddings still
[50:18] retain the information of the
location of the information
[50:21] within that document,
especially in big documents?
[50:24] Great question.
[50:26] We'll get to it in a second.
[50:27] Because you're right
that the vanilla RAG
[50:29] might not do a good job
with very large documents.
[50:32] So let's say, when you
open a medication box
[50:36] and you have this gigantic white
paper with all the information,
[50:41] and it's very long, maybe a
vanilla RAG would not cut it.
[50:45] So what people have
figured out is a bunch
[50:48] of techniques to improve RAGs.
[50:49] And in fact, chunking is a great
technique that is very popular.
[50:53] So you might actually store
in the vector database
[50:55] the embedding of
the full document.
[50:57] And on top of
that, you will also
[50:59] store a chapter level vector.
[51:02] And when you retrieve, you
will retrieve the document.
[51:04] You retrieve the chapter.
[51:06] And that allows you to be more
precise with the sourcing.
[51:09] It's one example.
[51:11] Another technique
that's popular is HyDE.
[51:16] Hypothetical
document embeddings,
[51:18] where a group of researchers
published a paper
[51:23] showing that when you
get your user query,
[51:26] one of the main problem
is the user query
[51:29] actually does not look
like your documents.
[51:32] For example, the
user query might
[51:34] be what are the side effects
of drug X, when actually,
[51:37] in the document in
the vector database,
[51:40] the vectors represents
very long documents.
[51:43] So how do you guarantee
that the vector
[51:44] embedding is going to be close
to the document embedding?
[51:47] What they do is they use
the user query to generate
[51:50] a fake hallucinated document.
[51:53] They embed that
document, and then they
[51:56] compare it to the vector
in the vector database.
[52:01] That makes sense?
[52:02] So for example,
the user says what
[52:04] is the side effect of drug X?
[52:06] There's a prompt that this is
given to another prompt that
[52:09] says, based on this user query,
generates a five-page report
[52:13] answering the user query.
[52:15] It generates potentially
a completely fake answer.
[52:20] You embed that, and it will
be closer to the document
[52:24] that you're looking for likely.
[52:28] It's one example
of a RAG approach.
[52:31] Again, the purpose
of this lecture
[52:33] is not to go through all
these three and explain
[52:36] you every single method that
has been discovered for RAGs.
[52:38] But I just wanted to show
you how much research
[52:40] has been done between
2020 and 2025 in RAGs
[52:44] and how many branches
of research you now have
[52:47] that you can learn from.
[52:50] The survey paper is LinkedIn
the slides, by the way,
[52:52] and I'll share them
after the lecture.
[53:01] Super.
[53:05] So we've made some progress.
[53:08] Hopefully now, you
feel if you were
[53:10] to start an LLM application, you
know how to do better prompts.
[53:14] You know how to do chains.
[53:15] You know how to do fine tuning.
[53:17] You also how to do retrieval.
[53:19] And you have the
baggage of techniques
[53:20] that you can go and read
and find the code base,
[53:23] pull the code, vibe code it.
[53:24] But you have the breadth now.
[53:30] The next set of topics
we're going to see
[53:34] is around the question
of how could we
[53:36] extend the capabilities of LLMs
from performing single tasks,
[53:40] and hence, with
external knowledge,
[53:42] to handling multi-step,
autonomous workflows?
[53:47] And this is where we get
into proper agentic AI.
[53:53] So let's talk about
agentic AI workflows
[53:56] towards autonomous and
specialized systems.
[54:00] Then we'll talk about evals.
[54:01] Then we'll see
multi-agent systems.
[54:03] And we'll end with a little
thoughts on what's next in AI.
[54:11] So Andrew Ng actually coined
the term agentic AI workflows.
[54:20] And his reason was that a lot
of companies use, say agents.
[54:25] Agents, agents everywhere,
agents everywhere.
[54:28] If you go and work
at these companies,
[54:30] you would notice that they mean
very different things by agents.
[54:33] Some people actually
have a prompt,
[54:34] and they call it an agent.
[54:36] Other people, they have a very
complex multi-agent system,
[54:41] they call it an agent.
[54:42] And so calling everything an
agent doesn't do it justice.
[54:45] So Andrew says let's call
it agentic workflows.
[54:49] Because in practice, it's a
bunch of prompts with tools,
[54:53] with additional
resources, API calls
[54:57] that ultimately are
put in a workflow,
[54:59] and you can call that
workflow agentic.
[55:02] So it's all about the multi-step
process to complete a task.
[55:11] Also, calling it
agentic workflow
[55:13] allows us to not
mix it up with what
[55:14] I called agent, in
the last lecture,
[55:17] with reinforcement learning.
[55:19] Because in RL, agent has a
very specific definition,
[55:22] interacts with an environment,
passes from one state
[55:24] to the other, has a
reward and an observation.
[55:26] You remember that chart, right?
[55:32] So here's an example of
how we move from a one step
[55:35] prompt to a multi-step
agentic workflow.
[55:39] Let's say a user
queries a product.
[55:44] What is your refund
policy on a chatbot?
[55:48] And the response,
using a RAG, says
[55:51] refunds are available
within 30 days of purchase,
[55:53] and maybe the RAG can even look
link to the policy documents.
[55:57] That's what we learned so far.
[55:59] Instead, an agentic workflow
can function like this.
[56:04] The user says, can I get
a refund for my order?
[56:07] And the response via
the agentic workflow
[56:11] is the agent retrieves the
refund policy using a RAG.
[56:14] The agent then follows up
with the users and says,
[56:17] can you provide
your order number?
[56:19] Then the agent queries an API
to check the order details.
[56:23] And finally, it comes
back to the user
[56:25] and confirms your order
qualifies for a refund.
[56:28] The amount will be processed
in three to five business days.
[56:31] This is much more thoughtful
than the first version,
[56:33] which is sort of vanilla.
[56:37] So that's what
we're going to talk
[56:39] about in the next
couple of slides,
[56:40] is how do we get from the
first one to the second one?
[56:46] There are plenty of specialized
agency workflows online.
[56:50] You've heard, and if
you hang out in SF,
[56:52] you probably see a bunch
of billboards, AI software
[56:55] engineer, AI skills
mentor you've
[56:57] interacted with in the
class through Workera.
[56:59] AI SDR, AI lawyers, AI
specialized cloud engineer.
[57:08] It would be a stretch to
say that everything works,
[57:10] but there's work being
done towards that.
[57:17] I'm not personally
a fan of putting
[57:19] a face behind those things.
[57:20] I think it's gimmicky.
[57:21] And I think in a few
years from now, actually,
[57:24] very few products will have
a human face behind it,
[57:27] but it might be a marketing
tactic from some startups.
[57:32] It's more scary than it
is engaging, frankly.
[57:35] OK.
[57:36] I want to talk about
the paradigm shift.
[57:38] That's especially useful.
[57:40] Let's say you're a
software engineer
[57:41] or you're planning to
be a software engineer.
[57:43] Because software
engineering as a discipline
[57:45] is sort of shifting.
[57:47] Or at least the
best engineers I've
[57:49] worked with are able to move
from a deterministic mindset
[57:53] to a fuzzy mindset and
balance between the two
[57:57] whenever they need to
get something done.
[57:58] So here's the paradigm shift
between traditional software
[58:01] and agentic AI software.
[58:04] The first one is the
way you handle data.
[58:07] Traditional software deals
with structured data.
[58:10] You have JSONs.
[58:11] You have databases.
[58:12] They're pasted in a
very structured manner
[58:15] in a data engineering pipeline.
[58:17] And then there used
to be displayed
[58:19] on a certain interface.
[58:21] The user might feel a form that
is then retrieved and pasted
[58:24] in the database.
[58:25] All of that historically
has been structured data.
[58:28] Now, more and more companies are
handling free form text, images,
[58:34] and all of that requires dynamic
interpretation to transform
[58:39] an input into an output.
[58:41] The software itself used
to be deterministic.
[58:45] Now you have a lot of
software that is fuzzy.
[58:47] And fuzzy software
creates so many issues.
[58:51] I mean, imagine if you
let your user ask anything
[58:54] on your website.
[58:56] The chances that it
breaks is tremendous.
[58:58] The chances that you're
attacked is tremendous.
[59:00] The chances-- it's really,
really complicated.
[59:03] It's more complicated than
people make it seem on Twitter.
[59:07] Fuzzy engineering is truly hard.
[59:09] You might get hate as a company
because one user did something
[59:14] that you authorized them to
do that ended up breaking
[59:16] the database and ended up--
[59:18] we've seen that
with many companies
[59:19] in the last couple of years.
[59:21] So it takes a very specialized
engineering mindset
[59:23] to do fuzzy
engineering, but also
[59:25] know when you need
to be deterministic.
[59:29] The other thing I'd call is
with agentic AI software,
[59:33] you want to think about your
software as your manager.
[59:39] So you're familiar with the
monolith or microservices
[59:44] approaches in software, where
you structure your software
[59:48] in different boxes that
can talk to each other,
[59:51] and it allows teams to
debug one section at a time.
[59:55] Now the equivalent with agentic
AI is you think as a manager.
[59:59] So you think, OK, if I
was to delegate my product
[01:00:02] to be done by a group of humans,
what would be those roles?
[01:00:06] Would I have a graphic designer
that then puts together a chart
[01:00:09] and then sends it to a marketing
manager that converts it
[01:00:12] into a nice blog post, that
then gives it to the performance
[01:00:15] marketing expert, that then
publishes the work, the blog
[01:00:18] post, and then
optimizes and A/B tests?
[01:00:20] Then to a data scientist
that analyzes the data
[01:00:23] and then puts
hypotheses and validates
[01:00:25] them or invalidates them.
[01:00:27] That's how you would typically
think if you're building
[01:00:29] an authentic AI software.
[01:00:32] When actually, the equivalent
of that in traditional software
[01:00:35] might be completely different.
[01:00:37] It might be We have
a data engineer box
[01:00:39] right here that handles
all our data engineering.
[01:00:42] And then here, we
have the UI/UX stuff.
[01:00:45] Everything UI/UX
related goes here.
[01:00:47] And companies might structure
it in very different ways.
[01:00:51] And here is the business logic
that we want to care about.
[01:00:53] And there's five engineers
working on the business logic,
[01:00:56] let's say.
[01:00:59] OK.
[01:01:01] Testing and debugging
is also very different.
[01:01:04] And we'll talk about
it in the next section.
[01:01:09] The other thing
that I feel matters
[01:01:13] is with AI in engineering,
the cost of experimentation
[01:01:17] is going down drastically.
[01:01:19] And so people, I feel,
should be more comfortable
[01:01:22] throwing away code.
[01:01:23] It's like in traditional
software engineering,
[01:01:27] you probably don't
throw away code a ton.
[01:01:29] You build a code, and it's
solid, and it's bulletproof,
[01:01:32] and then you update
it over time.
[01:01:35] We've seen AI companies be
more comfortable throwing away
[01:01:39] codes, which has advantages in
terms of the speed at which you
[01:01:43] move but also
disadvantages in terms
[01:01:46] of the quality of your
software that can break more.
[01:01:52] So anyway, just wanted to do
an update on the paradigm shift
[01:01:56] from deterministic
to fuzzy engineering.
[01:02:04] Oh, and actually, I can give
you an example from Workera
[01:02:08] that we learned probably
over the last 12
[01:02:11] months is like if
you've used Workera,
[01:02:13] you might have seen that the
interface has asks you sometimes
[01:02:18] multiple choice questions.
[01:02:19] And sometimes, it asks
you multiple select.
[01:02:21] And sometimes, it asks you drag
and drop, ordering, matching,
[01:02:24] whatever.
[01:02:25] Those are examples of
deterministic item types,
[01:02:28] meaning you answer the
question on a multiple choice.
[01:02:31] There is one correct answer.
[01:02:32] It's fully deterministic.
[01:02:34] On the other hand, you sometimes
have a voice questions,
[01:02:38] where you go to a
role play or you
[01:02:40] have voice plus
coding questions,
[01:02:42] where your code is being read
by the interface or whatever.
[01:02:45] Those are fuzzy, meaning
the scoring algorithm
[01:02:49] might actually make
mistakes, and those mistakes
[01:02:52] might be costly.
[01:02:53] And so companies
have to figure out
[01:02:56] a human in the
loop system, which
[01:02:58] you might have seen with the
appeal feature at the end.
[01:03:00] So at the end of the assessment,
you have an appeal feature where
[01:03:03] it allows you to say, I
want to appeal the agent
[01:03:06] because I want to challenge
what the agent said on my answer
[01:03:09] because I thought I was better
than what the agent thought.
[01:03:12] And then you bring the
human in the loop that
[01:03:14] then can fix the agent, can
tell the agent, actually,
[01:03:16] you were too harsh on the
answer of this person.
[01:03:20] And that's an example of
a fuzzy engineered system
[01:03:24] that then adds a human in the
loop to make it more aligned.
[01:03:28] And so if you're
building a company,
[01:03:29] I would encourage you to
think about what can I
[01:03:32] get done with determinism?
[01:03:33] And let's get that done.
[01:03:35] And then the fuzzy
stuff, I want to do fuzzy
[01:03:38] because it allows
more interaction.
[01:03:39] It allows more back
and forth, but I need
[01:03:42] to put guardrails around it.
[01:03:43] And how am I going to
design those guardrails?
[01:03:45] Pretty much.
[01:03:46] OK?
[01:03:49] Here's another example
from enterprise workflows,
[01:03:54] which are likely to
change due to agentic AI.
[01:03:57] This is a paper from McKinsey,
I believe from last year,
[01:04:01] where they looked at a financial
institution, and they said,
[01:04:05] we observed that they often
spend one to four weeks
[01:04:07] to create a credit risk memo.
[01:04:10] And here's the process.
[01:04:11] A relationship manager
gathers data from 15
[01:04:16] and more than 15
sources on the borrower,
[01:04:19] loan type, other factors.
[01:04:22] Then the relationship manager
and the credit analyst
[01:04:25] collaboratively analyze that
data from these sources.
[01:04:28] Then the credit analyst
typically spends 20 hours
[01:04:33] or more writing a memo
and then goes back
[01:04:36] to the relationship manager.
[01:04:37] They give feedback, and then
they go through this loop
[01:04:40] again and again.
[01:04:41] And it takes a long time
to get a credit memo out.
[01:04:46] And then run a research study,
where they changed the process.
[01:04:50] They said gen AI agents could
actually cut time by 20% to 60%
[01:04:56] on credit risk memos.
[01:04:58] And the process has changed
to the relationship manager,
[01:05:01] directly work with the
Gen AI agent system,
[01:05:03] provides relevant materials
that needs to produce the memo.
[01:05:07] The agent subsidizes
the project into tasks
[01:05:10] that are assigned to
specialist agents,
[01:05:12] gathers and analyzes the
data from multiple sources,
[01:05:15] drafts a memo.
[01:05:16] Then the relationship manager
and the credit analyst
[01:05:19] sit down together,
review the memo,
[01:05:20] give feedback to the agent.
[01:05:22] And within 20% to 60%
less time are done.
[01:05:26] And so this is an example where
you're actually not changing
[01:05:30] the human stakeholders.
[01:05:31] You're just changing
the process and adding
[01:05:33] Gen AI to reduce the time it
takes to get a credit memo out.
[01:05:38] It turns out that, imagine
you're an enterprise,
[01:05:42] and you have 100,000 employees,
and there's a lot of enterprises
[01:05:47] with 100,000
employees out there.
[01:05:50] You are currently
under crisis in terms
[01:05:52] of redesigning your workflows.
[01:05:55] It turns out that
if you actually
[01:05:57] pull the job descriptions
from the HR system
[01:06:00] and you interpret
them, you also pull
[01:06:02] the business process
workflows that you
[01:06:04] have encoded in your drive.
[01:06:07] You actually can find
gains in multiple places.
[01:06:10] And in the next
few years, you're
[01:06:12] probably going to
see workflows being
[01:06:14] more optimized to add Gen AI.
[01:06:17] Even if that happens, the
hardest part is changing people.
[01:06:20] What we know, this is
great in theory, but now,
[01:06:23] let's try to fit that second
workflow for 10,000 credits,
[01:06:28] risk analysts, and
relationship managers.
[01:06:31] My guess is it will take years.
[01:06:33] It will take 10, 20 years to
get to this being actually done
[01:06:37] at scale within an organization.
[01:06:40] Because change is so hard.
[01:06:42] It's so hard to rewire business,
workflows, job descriptions,
[01:06:47] incentivize people to do
different, and be different,
[01:06:50] and train them.
[01:06:50] And so this is what the
world is going towards,
[01:06:55] but it's going to take
a long time I think.
[01:06:59] OK.
[01:07:00] Then I want to talk about
how the agent actually works
[01:07:02] and what are the core
components of an agent.
[01:07:07] Imagine a travel
booking agent. that's
[01:07:10] an easy example you've
all thought about.
[01:07:12] I still haven't been able to get
an agent to book a trip for me,
[01:07:16] or I was scared because
it was going to book
[01:07:18] a very expensive or long trip.
[01:07:20] But in theory, you can
have a travel booking
[01:07:24] agent that has prompts.
[01:07:26] So the prompts we've
seen, we know the methods
[01:07:28] to optimize those prompts.
[01:07:30] That travel agent also has
a context management system,
[01:07:34] which is essentially the memory
of what it knows about the user.
[01:07:38] That context
management system might
[01:07:40] include a core memory or working
memory and an archival memory,
[01:07:45] OK?
[01:07:46] What the difference
is within memory
[01:07:51] is not every memory needs
to be fast to access.
[01:07:54] Think about it.
[01:07:56] You're onboarded on a product,
and the first question is hi,
[01:07:59] what's your name?
[01:08:00] And I say, my name is Keon.
[01:08:02] That's probably going to
sit in the working memory
[01:08:05] because the agents, every
time he's going to talk to me,
[01:08:07] he's going to want
to use my name.
[01:08:08] But then maybe the
second question
[01:08:10] is what's your birthday?
[01:08:12] And I give it my birthday.
[01:08:13] Does it need my
birthday every day?
[01:08:15] Probably not.
[01:08:16] So it's probably going to
park it on the long term
[01:08:18] memory or the archival memory.
[01:08:20] And those memories
are slower to access.
[01:08:24] They're farther down the stack.
[01:08:26] And that structure
allows the agent
[01:08:28] to determine what's
the working memory,
[01:08:30] and what's the long term memory?
[01:08:33] And that makes it easier for the
agent to retrieve super fast.
[01:08:36] Because think about it.
[01:08:37] When you interact
with ChatGPT, you
[01:08:39] feel that it's very
personal at times.
[01:08:41] You feel like it
understands you.
[01:08:43] Imagine every time you call it,
it has to read the memories.
[01:08:47] And that can be costly.
[01:08:48] It's a very burdensome
cost because it happens
[01:08:52] every time you talk to it.
[01:08:54] So you want to be highly
optimized with the working
[01:08:57] memory.
[01:08:59] If it takes three
seconds to look
[01:09:00] in the memory, every time you're
going to talk to your LLM,
[01:09:03] it's going to take three
seconds, which you don't want.
[01:09:06] Anyway.
[01:09:06] And then you have the tools.
[01:09:08] The tools can include
APIs like a flight search
[01:09:11] API, hotel booking API, car
rental API, weather API,
[01:09:15] and then the payment
processing API.
[01:09:18] And typically, you would
want to tell your agent
[01:09:21] how that API works.
[01:09:23] It turns out that agents
or LLMs, I should say,
[01:09:27] are very good at reading
API documentation.
[01:09:29] So you give it the
API documentation,
[01:09:31] and it reads the
JSON, and it reads,
[01:09:33] what does a GET
request look like.
[01:09:35] And this is the format
that I need to push.
[01:09:38] And then it pushes it in
that format, let's say.
[01:09:41] And then it retrieves something.
[01:09:45] Does that make sense,
those different components?
[01:09:49] Anthropic also talks
about resources.
[01:09:51] Resources is data that is
sitting somewhere that you
[01:09:55] might let your agent read.
[01:09:57] For example, if you're building
your startups, you have a CRM.
[01:10:00] A CRM has data in it, and you
want to do lookups in that data.
[01:10:05] You will probably
give a lookup tool,
[01:10:07] and you will give
access to the resource,
[01:10:10] and it will do lookups
whenever you want super fast.
[01:10:16] This type of
architecture can be built
[01:10:19] with different
degrees of autonomy,
[01:10:21] from the least autonomous
to the most autonomous.
[01:10:23] And I'll give you
a few examples.
[01:10:26] Less autonomous would be
you've hard coded the steps.
[01:10:29] So let's say I tell the travel
agent first identify the intent.
[01:10:35] Then look up in the
database the history
[01:10:39] of this customer with us
and their preferences.
[01:10:42] Then go to the flight
API, blah, blah, blah.
[01:10:45] Then go to the--
[01:10:45] I would hard code the steps.
[01:10:47] OK.
[01:10:48] That's the least autonomous.
[01:10:50] The semi-autonomous is I
might hard code the tools,
[01:10:54] but we're not going to
hard code the steps.
[01:10:57] So I'm going to tell the agent,
you act like a travel agent.
[01:11:02] And your task is to help
the person book a travel.
[01:11:10] And these are the tools that
you have accessible to yourself.
[01:11:13] And so I'm not hard
coding the steps.
[01:11:14] I'm just hard coding the
tools that you have access
[01:11:17] to for yourself.
[01:11:18] The more autonomous is the
agent decides the steps
[01:11:22] and can create the tools.
[01:11:24] So that's where you might
give actually access
[01:11:26] to a code editor, to the agent.
[01:11:28] And the agent might actually be
able to ping any API in the web,
[01:11:33] perform some web search.
[01:11:34] It might even be able
to create some code
[01:11:37] to display data to the user.
[01:11:39] It might even be able to
perform some calculations.
[01:11:42] Like oh, I'm going to
calculate the fastest route
[01:11:44] to get from San
Francisco to New York,
[01:11:48] and which one might be
the most appropriate
[01:11:50] for what the user
is looking for.
[01:11:52] And then I want to calculate
the distance between the airport
[01:11:54] and that hotel
versus that hotel.
[01:11:56] And I'm going to
write code to do that.
[01:11:58] So it's actually
fully autonomous
[01:12:00] from that perspective.
[01:12:05] So yeah.
[01:12:07] Remember those keywords.
[01:12:08] Memory, prompts,
tools, et cetera.
[01:12:14] Now, I presented the
flight API, but it does not
[01:12:18] have to be an API.
[01:12:19] You probably have heard the term
MCP or model context protocol
[01:12:23] that was coined by Anthropic.
[01:12:25] I pasted the seminal article on
MCP at the bottom of this slide.
[01:12:29] But let me explain in a nutshell
why those things would differ.
[01:12:34] In the API case,
you would actually
[01:12:39] teach your LLM to ping an API.
[01:12:42] So you would say this is
how you ping this API,
[01:12:45] and this is the data that
it will send you back.
[01:12:48] And you would have to do
that in a one off manner.
[01:12:51] So you would have
to build or give
[01:12:53] the API documentation
of your flight API.
[01:12:56] You're booking hotel
API, your car rental API.
[01:13:00] And then you would give
tools for your model
[01:13:03] to communicate with those APIs.
[01:13:06] It doesn't scale
very well versus MCP.
[01:13:11] MCP, it's really about putting
a system in the middle that
[01:13:19] would make it simpler for
your LLM to communicate
[01:13:22] with that endpoint.
[01:13:23] So for instance, you might have
an MCP server, an MC client,
[01:13:28] where you're trying
to communicate
[01:13:30] with that travel database
or the flight API or MCP.
[01:13:35] And your agent might actually
just communicate with it
[01:13:38] and say, hey, what do you need
in order to give me more flight
[01:13:42] information?
[01:13:43] And that agent will respond
by I would like you to tell me
[01:13:47] where is the origin flight,
where is the destination
[01:13:49] and what you're looking
for at a high level.
[01:13:51] This is my requirement.
[01:13:52] OK.
[01:13:52] Let me get back to you
with in my requirement.
[01:13:55] Oh.
[01:13:55] You forgot to tell me
your budget, whatever.
[01:13:57] Oh.
[01:13:58] Let me give you my
budget, et cetera.
[01:14:00] And it's agent to
agent communication,
[01:14:04] which allows more scalability.
[01:14:06] You don't need to
hard code everything.
[01:14:09] Companies have displayed
their MCPs out there,
[01:14:11] and your agent can
communicate with them
[01:14:14] and figure out how to
get the data it needs.
[01:14:16] Does that make sense?
[01:14:18] Yeah.
[01:14:21] [INAUDIBLE] rewriting
any [INAUDIBLE]
[01:14:36] I think it is, ultimately.
[01:14:39] The question is, isn't
it a shifting issue?
[01:14:41] Because anyway, if an
API has to be updated,
[01:14:43] the MCP has to be updated,
is what you say, right?
[01:14:45] Yes, that's correct.
[01:14:46] But at least it allows the
agent to go back and forth
[01:14:51] and figure out what
the requirements are.
[01:14:52] But at the end of the day,
ideally, if you're a startup,
[01:14:56] you have some documentation.
[01:14:57] And automatically, you have
an agent or an LLM workflow
[01:15:00] that reads that documentation
and updates the code
[01:15:03] accordingly.
[01:15:04] But I agree.
[01:15:05] It's not something that
is fully autonomous.
[01:15:08] Yeah.
[01:15:09] i I've seen some
security issues.
[01:15:12] Why is that possible.
[01:15:14] Which security specifically?
[01:15:16] [INAUDIBLE]
[01:15:18] Yeah.
[01:15:19] So are there security
issues with MCPs?
[01:15:23] So think about it this way.
[01:15:25] MCPs, depending on the data
that you get access to,
[01:15:28] might have different
requirements, lower stake
[01:15:30] or higher stake.
[01:15:31] I'm not an expert
at the full range.
[01:15:34] But it wouldn't surprise me
that when you expose an MCP to--
[01:15:42] I think you would a lot of
MCC have authentication.
[01:15:45] So you might
actually need a code
[01:15:47] to actually talk to it, just
like you would with an API,
[01:15:50] or a key.
[01:15:52] Yeah, but that's
a good question.
[01:15:53] I'm not an expert at the
security of these systems,
[01:15:56] but we can look into it.
[01:16:02] Any other questions
on what we've
[01:16:04] seen with the agentic workflows,
APIs, tools, MCPs, memory?
[01:16:10] All of that is under progress.
[01:16:11] So even memory is not a
solved problem by any means.
[01:16:14] It's pretty hard actually.
[01:16:16] Yes.
[01:16:18] You don't need an
[INAUDIBLE] The MCP just
[01:16:24] makes it easier to access
the API, but technically,
[01:16:28] [INAUDIBLE]
[01:16:40] Exactly, exactly.
[01:16:42] Is MCP about efficiency
or accessing more data?
[01:16:45] It's about efficiency.
[01:16:47] Let's say you have a coding
agent, and it has an MCP client,
[01:16:53] and there's multiple MCP servers
that are exposed out there.
[01:16:57] That agent can communicate
very efficiently with them
[01:17:00] and find what it needs.
[01:17:03] And it's a more
efficient process
[01:17:05] than actually displaying APIs
and the APIs on that side
[01:17:09] and how to ping them and
what the protocol is.
[01:17:12] But it's not about
the data that is
[01:17:13] being exposed because
ultimately, you control
[01:17:15] the data that is being exposed.
[01:17:19] You probably, depending
on how the MCP is built,
[01:17:22] my guess is you probably
expose yourself to other risks
[01:17:24] because your MCP server can
see any input pretty much
[01:17:31] from another LLM.
[01:17:32] And so it has to be robust.
[01:17:36] But yeah.
[01:17:37] Super.
[01:17:39] So let's look at an
example of a step
[01:17:41] by step workflow for
the travel agent.
[01:17:45] So let's say the user says, I
want to plan a trip to Paris
[01:17:50] from December 15 to
20th with flights,
[01:17:56] hotels near the Eiffel Tower,
and then an itinerary of must
[01:18:00] visit places.
[01:18:01] That's the task to
the travel agent.
[01:18:04] Step two, the agent
plans the steps.
[01:18:06] So it says, I'm going
to find flights.
[01:18:08] Use the flight search API to
get options for December 15.
[01:18:12] Search hotels, generate
recommendations for places
[01:18:15] to visit, validate
preferences, budget, et cetera.
[01:18:20] Book the trip with the
payment processing API.
[01:18:24] That's just the
planning, by the way.
[01:18:25] Step three, execute the
plan, use your tools,
[01:18:28] combine the results,
and then proactive
[01:18:31] user interaction and booking.
[01:18:33] It might make a first
proposal to the user
[01:18:35] and ask the user to
validate or invalidate
[01:18:38] and then may repeat that
planning and execution process.
[01:18:42] And then finally, it might
actually update the memory.
[01:18:46] It might say, oh, I just
learned through this interaction
[01:18:49] that the user only
likes direct flights.
[01:18:51] Next time, I'll only
give direct flights.
[01:18:55] Or I noticed users are fine with
three star hotels or four star
[01:19:01] hotels.
[01:19:01] And in fact, they don't want
to go above budget or something
[01:19:05] like that.
[01:19:08] So that hopefully makes sense
by now on how you might do that.
[01:19:11] My question for you is how
would you know if this works.
[01:19:16] And if you had such a system
running in production, how
[01:19:19] would you improve it?
[01:19:28] Yeah.
[01:19:28] Lets users rate
their experience.
[01:19:31] So that's an example.
[01:19:33] So let users rate their
experience at the end.
[01:19:37] That would be an end
to end test, right?
[01:19:39] You're looking at the user
experience through the steps
[01:19:42] and say how good was it
from 1 to 5, let's say.
[01:19:46] Yeah.
[01:19:46] It's a good way.
[01:19:47] And then if you learn
that a user says 1,
[01:19:50] how do you improve the workflow?
[01:19:56] [INAUDIBLE]
[01:19:59] OK.
[01:19:59] So you would go down a tree
and say, OK, you said 1.
[01:20:04] What was your issue?
[01:20:06] And then the user says the
prices were too high, let's say.
[01:20:10] And then you would go back and
fix that specific tool or prompt
[01:20:14] or, yeah, OK.
[01:20:15] Any other ideas?
[01:20:18] [INAUDIBLE]
[01:20:29] Yeah, good.
[01:20:29] So that's a good insight.
[01:20:30] Separate the LLM related stuff
from the non-LLM related stuff,
[01:20:34] the deterministic stuff.
[01:20:35] The deterministic
stuff, you might
[01:20:36] be able to fix it more
objectively essentially.
[01:20:41] Yeah.
[01:20:43] What else?
[01:20:56] So give me an example
of an objective issue
[01:21:00] that you can notice and
how you would fix it
[01:21:03] versus a subjective issue.
[01:21:06] Yeah.
[01:21:06] [INAUDIBLE]
[01:21:16] So let's say you say
there's the same flight,
[01:21:19] but one is cheaper than
the other, let's say.
[01:21:21] It's objectively worse.
[01:21:23] And so you can capture
that almost automatically.
[01:21:25] Yeah.
[01:21:26] So you could
actually build evals
[01:21:27] that are objective, that are
tracked across your users.
[01:21:32] And you might actually
run an analysis after
[01:21:34] and see that for
the objective stuff,
[01:21:37] we notice that our LLM AI agent
workflow is bad with pricing.
[01:21:43] It just doesn't read price
as well because it always
[01:21:46] gives a more expensive option.
[01:21:48] Yeah.
[01:21:48] You're perfectly right.
[01:21:49] How about the subjective stuff?
[01:21:59] Do you choose a direct
or indirect flight
[01:22:01] if the indirect is a
little bit cheaper?
[01:22:05] Yeah.
[01:22:05] Good one.
[01:22:06] Do you choose a direct
flight or an indirect flight
[01:22:09] if the indirect is cheaper but
the direct is more comfortable?
[01:22:12] Yeah.
[01:22:13] That's a good one actually.
[01:22:16] So how would you capture
that information.
[01:22:18] Let's say this is used
by thousands of users.
[01:22:24] Could you feed
something in [INAUDIBLE]
[01:22:28] Could you feed something in?
[01:22:30] Yeah, I mean, you could--
[01:22:32] could feed something in
about the user preferences?
[01:22:36] Well, you could
build a data set that
[01:22:39] has some of that information.
[01:22:40] So you build 10 prompts, where
the user is asking specifically
[01:22:44] for a direct--
[01:22:46] saying that I prefer
direct flights because I
[01:22:48] care about my time, let's say.
[01:22:50] And then you look at the
output and you actually
[01:22:53] give a good example
of a good output,
[01:22:56] and you probably
are able to capture
[01:22:58] the performance of your agentic
workflow on this specific eval.
[01:23:04] Does it prioritize?
[01:23:05] Does it understand
price conscious--
[01:23:07] is it price conscious,
essentially,
[01:23:08] and comfort conscious?
[01:23:10] Yeah.
[01:23:13] What about the tone?
[01:23:14] Let's say the LLM right
now is not very friendly.
[01:23:18] How would you notice that,
and how would you fix it?
[01:23:26] Yeah.
[01:23:26] Have the test user
run the prompt
[01:23:29] and see if there's
something wrong with that.
[01:23:33] OK.
[01:23:33] Have a test user run the
prompt and see if there's
[01:23:36] something wrong with that.
[01:23:37] Tell me about the last step.
[01:23:38] How would you notice
that something is wrong?
[01:23:40] So a couple of tests
[INAUDIBLE] evaluates
[01:23:48] the response and [INAUDIBLE]
[01:23:51] Yeah.
[01:23:52] I agree with your approach.
[01:23:53] Have LLM judges that
evaluate the response
[01:23:55] against a certain rubric of
what politeness looks like.
[01:23:58] So here in this case,
you could actually
[01:24:00] start with error analysis.
[01:24:02] So you start, you
have 1,000 users.
[01:24:05] And you can pull up
20 user interactions
[01:24:07] and read through it.
[01:24:09] And you might notice,
at first sight,
[01:24:11] the LLM seems to be very rude.
[01:24:14] It's just super, super
short in its answers,
[01:24:18] and it's not very helpful.
[01:24:20] You notice that with your
error analysis manually.
[01:24:23] Then you go to the next stage.
[01:24:24] You actually put
evals behind it.
[01:24:26] You say, I'm going to
create a set of LLM judges
[01:24:33] that are going to look
at the user interaction
[01:24:35] and are going to rate
how polite it is.
[01:24:38] And I'm going to
give it a rubric.
[01:24:40] Then what I'm going to do
is I'm going to flip my LLM.
[01:24:42] Instead of using GPT-4,
I'm going to use Grok.
[01:24:45] And instead of using
Grok, I'm using Llama.
[01:24:48] And then I'm going to run
those three LLMs side by side,
[01:24:51] give it to my LLM judges, and
then get my subjective score
[01:24:56] at the end to say, oh, x model
was more polite on average.
[01:25:02] Yeah.
[01:25:02] Perfectly right.
[01:25:03] That's an example of an
eval that is very specific
[01:25:05] and allows you to
choose between LLMs.
[01:25:07] You could actually do the
same eval not across LLMs,
[01:25:10] but fixed the LLM,
change the prompt.
[01:25:12] You actually, instead of
saying act like a travel agent,
[01:25:15] you say act like a
helpful travel agent.
[01:25:17] And then you see the influence
of that word on your eval
[01:25:21] with the LLM as judges.
[01:25:22] Does that make sense?
[01:25:24] OK.
[01:25:25] Super.
[01:25:26] So let's move forward and
do a case study with evals.
[01:25:29] And then we're almost
done for today.
[01:25:33] Let's say your product manager
asks you to build an AI
[01:25:38] agent for customer support, OK?
[01:25:41] Where do you start?
[01:25:42] And here is an example
of the user prompt.
[01:25:45] I need to change my shipping
address for order, blah, blah,
[01:25:48] blah.
[01:25:48] I move to a new address.
[01:25:51] So what do you start if I'm
giving you that project?
[01:26:04] Yes.
[01:26:05] We search online for existing
models and [INAUDIBLE]
[01:26:16] So do some research.
[01:26:17] See benchmarks and
how different models
[01:26:20] perform at customer support.
[01:26:22] And then pick a model.
[01:26:23] That's what you mean.
[01:26:24] Yeah.
[01:26:24] It's true you could do that.
[01:26:25] What else could you do?
[01:26:28] Yeah.
[01:26:28] [INAUDIBLE]
[01:26:34] OK.
[01:26:34] Yeah, I like that.
[01:26:35] Try to decompose the different
tasks that it will need
[01:26:39] and try to guess which ones will
be more of a struggle, which
[01:26:42] ones should be fuzzy, which
ones should be deterministic.
[01:26:45] Yeah, you're right.
[01:26:46] [INAUDIBLE]
[01:26:55] Yeah.
[01:26:56] Similar to what you said.
[01:26:58] That's what I would
recommend as well.
[01:27:00] You say I would sit down
with a customer support
[01:27:02] agent for a day or two, and
I would decompose the tasks
[01:27:04] that are going through.
[01:27:05] I will ask them, where
do they struggle?
[01:27:07] How much time it takes?
[01:27:08] Yes.
[01:27:09] That's usually where you want to
start with task decomposition.
[01:27:12] So let's say we've done that
work, and we have this list.
[01:27:16] I'm simplifying.
[01:27:17] But the customer support
agent, human, typically
[01:27:20] would extract key
info, then look up
[01:27:23] in the database to retrieve
the customer record.
[01:27:25] Then check the policy.
[01:27:27] Are we allowed to
update the address,
[01:27:29] or is it a fixed data point?
[01:27:32] And then draft a response
email and sends the email.
[01:27:35] So we've decomposed that task.
[01:27:39] Once you've
decomposed that task,
[01:27:42] how do you design
your agentic workflow?
[01:28:03] Yes.
[01:28:04] [INAUDIBLE]
[01:28:17] Exactly.
[01:28:18] So to repeat,
you're going to look
[01:28:20] at the decomposition of tasks,
get an instinct of what's fuzzy,
[01:28:24] what's deterministic,
and then determine
[01:28:28] which line is going to be an LLM
one shot, which one will require
[01:28:33] maybe a RAG, which one will
require a tool, which one will
[01:28:36] require memory, which one--
[01:28:38] So you will start
designing that map.
[01:28:41] Completely right.
[01:28:41] That's also what
I would recommend.
[01:28:43] You might actually draft it and
say, OK, I take the user prompt.
[01:28:48] And the first step of
my task decomposition
[01:28:52] was extract information that
seems to be a vanilla LLM.
[01:28:57] You can guess that the
vanilla LLM would probably
[01:29:00] be good enough at
extracting the user wants
[01:29:03] to change their address,
and this is the order number
[01:29:05] and this is the new address.
[01:29:06] You probably don't need
too much technology
[01:29:08] there other than the LLM.
[01:29:11] The next step, it feels like
you need a tool because you're
[01:29:14] actually going to have to
look up in the database
[01:29:17] and also update the address.
[01:29:21] So that might be a
tool, and you might
[01:29:23] have to build a custom
tool for the LLM
[01:29:25] to say, let me connect
you to that database
[01:29:27] or let me give you access to
that resource with an MCP.
[01:29:32] After that probably need an
LLM again to draft the email,
[01:29:35] but you would probably
paste confirmation.
[01:29:38] You would paste the
confirmation that your address
[01:29:40] has been updated from x to y.
[01:29:42] And then the LLM
will draft an answer.
[01:29:44] And of course,
just to not forget,
[01:29:46] you might need a tool
to send the email.
[01:29:49] You might actually need
to post something to
[01:29:54] for the email to go out.
[01:29:57] And then you'll get the output.
[01:29:59] Does that make sense So
exactly what you described.
[01:30:02] Now moving to the next step.
[01:30:03] Once we have-- we
compose our tasks.
[01:30:06] Then we have designed an
agentic workflow around it.
[01:30:09] It took us five minutes.
[01:30:10] In practice, it
would take you more
[01:30:12] if you're building
your startup on that.
[01:30:13] You want to make sure your
task decomposition is accurate,
[01:30:15] your thing is accurate
here, and then
[01:30:17] you can have a lot of
work done on every tool
[01:30:20] and optimize it and
latency and cost.
[01:30:22] But let's say, now we
want to know if it works.
[01:30:27] And I'm going to assume
that you have LLM traces.
[01:30:30] LLM traces are very important.
[01:30:33] Actually, if you're
interviewing with an AI startup.
[01:30:36] I would recommend you in the
interview process to ask them,
[01:30:39] do you have LLM traces?
[01:30:40] Because if they don't
have LLM traces,
[01:30:42] it is pretty hard to debug an
LLM system because you don't
[01:30:46] have visibility on the chain of
complex prompts that were called
[01:30:50] and where the bug is.
[01:30:52] And so it's a basic
part of an AI startup
[01:30:57] stack to have an LLM traces.
[01:31:00] So let's assume you have traces.
[01:31:02] How would you know
if your system works?
[01:31:04] I'm going to summarize some
of the things I heard earlier.
[01:31:11] You gave us an example
of an end to end metric.
[01:31:15] You look at the user
satisfaction at the end.
[01:31:18] You can also do a
component-based approach
[01:31:21] where you actually will look at
the tool, the database updates,
[01:31:25] and you will manually do
an error analysis and see,
[01:31:28] oh, the tool actually always
forgets to update the email.
[01:31:32] It just fails at writing.
[01:31:33] And I'm going to fix that.
[01:31:34] This is deterministic
pretty much.
[01:31:37] Or when it tries
to send the email
[01:31:40] and ping the system that is
supposed to send the email,
[01:31:44] it doesn't send it
in the right format.
[01:31:46] And so it bugs at that point.
[01:31:48] Again, you could fix that.
[01:31:51] Draft of the email.
[01:31:52] The LLM doesn't do a great job.
[01:31:53] It's not very polite
at drafting the email.
[01:31:56] So you could look at
component by component,
[01:31:59] and it's actually easier
to debug than to look at it
[01:32:01] end to end.
[01:32:02] You would probably
do a mix of both.
[01:32:05] Another way to look at
it is what is objective
[01:32:08] versus what is subjective?
[01:32:10] So for example, an
objective example
[01:32:12] would be a DLRM extracted
the wrong order ID.
[01:32:18] The user said my order
ID is X, and the LLM,
[01:32:21] when it actually looked
up in the database,
[01:32:24] it used the wrong order ID.
[01:32:26] This is objectively wrong.
[01:32:27] You can actually
write a Python code
[01:32:29] that checks that, checks just
the alignment between what
[01:32:32] the user mentioned and what was
actually pasted in the database
[01:32:36] or for the lookup.
[01:32:38] You also have subjective
stuff, which we talked about,
[01:32:40] where you probably want to
do either human rating or LLM
[01:32:43] as judges.
[01:32:44] It's very relevant
for subjective evals.
[01:32:49] And finally, you
will find yourself
[01:32:51] having quantitative evals
and more qualitative evals.
[01:32:55] So quantitative would be
percentage of successful address
[01:32:59] updates.
[01:33:00] The latency.
[01:33:00] You could actually track
the latency component-based
[01:33:03] and see which one
is the slowest.
[01:33:05] Let's say sending the
email is five seconds.
[01:33:08] It's too long, let's say.
[01:33:10] You would notice component
based or the full workflow.
[01:33:13] And then you will decide, where
am I optimizing my latency,
[01:33:15] and how am I going to do that?
[01:33:17] And then finally, qualitative.
[01:33:20] You might actually do
some error analysis
[01:33:23] and look at where are
the hallucinations?
[01:33:27] Where are the tone mismatches?
[01:33:31] Are the user confused, and
by what they're confused?
[01:33:34] That would be more qualitative.
[01:33:36] And typically, it would take
more white glove approaches
[01:33:41] to do that.
[01:33:42] So here's what it
could look like.
[01:33:44] I gave you some examples.
[01:33:46] But you would build
evals to determine
[01:33:50] objectively, subjectively,
component-based, end
[01:33:53] to end based, and then
quantitatively and
[01:33:55] qualitatively, where's
your LLM failing
[01:33:57] and where it's doing well.
[01:34:02] Does that give you a
sense of the type of stuff
[01:34:04] you could do to fix or
improve that agentic workflow?
[01:34:09] Super.
[01:34:10] Well, that was our
case study on evals.
[01:34:12] We're not going to
delve deeper into it.
[01:34:14] But hopefully, it gave you
a sense of the type of stuff
[01:34:16] you can do with LLM
judges, with objective,
[01:34:21] subjective, component-based,
end to end, et cetera.
[01:34:25] Last section on
multi-agent workflows.
[01:34:29] So you might ask, hey, why do we
need multi-agent workflow when
[01:34:36] the workflow already
has multiple steps,
[01:34:38] already calls the LLM multiple
times, already gives them tools.
[01:34:42] Why do we need multiple agents?
[01:34:45] And so many people are talking
about multi-agent system online.
[01:34:47] It's not even a
new thing, frankly.
[01:34:49] Multi-agent systems have
been around for a long time.
[01:34:52] The main advantage of
a multi-agent system
[01:34:55] is going to be parallelism.
[01:34:57] It's like is there
something that I
[01:34:59] wish I would run in parallel,
sort of independently,
[01:35:04] but maybe there are some
things in the middle?
[01:35:07] But that's where you want
to put a multi-agent system.
[01:35:09] It's when it's parallel.
[01:35:12] The other advantage
that some companies
[01:35:14] have with multi-agent systems
is an agent can be reused.
[01:35:19] So let's say in a company,
you have an agent that's
[01:35:21] been built for design.
[01:35:22] That agent can be used
in the marketing team,
[01:35:25] and it can be used
in the product team.
[01:35:27] And so now you're
optimizing an agent,
[01:35:30] which has multiple stakeholders
that can communicate with it
[01:35:33] and benefit from
its performance.
[01:35:38] Actually I'm going
to ask you a question
[01:35:40] and take a few, maybe a
minute to think about it.
[01:35:43] Let's say you were
building smart home
[01:35:46] automation for your
apartment or your home.
[01:35:50] What agents would
you want to build?
[01:35:52] Yeah.
[01:35:53] Write it down.
[01:35:54] And then I'm going to
ask you in a minute
[01:35:57] to share some of the
agents that you will build.
[01:36:00] Also, think about
how you would put
[01:36:03] a hierarchy between
these agents,
[01:36:04] or how you would
organize them, or who
[01:36:06] should communicate with who.
[01:36:07] OK?
[01:36:08] OK.
[01:36:08] Take a minute for that.
[01:36:12] Be creative also because I'm
going to ask all of your agents,
[01:36:14] and maybe you have an agent
that nobody has thought of.
[01:36:21] OK.
[01:36:22] Let's get started.
[01:36:24] Who wants to give
me a set of agents
[01:36:26] that you would want for
your home, smart home.
[01:36:29] Yes.
[01:36:32] The first is like a set
of agents [INAUDIBLE]
[01:37:00] OK.
[01:37:01] So let me repeat.
[01:37:02] You have four agents,
I think, roughly.
[01:37:05] One that tracks biometric,
like where are you in the home?
[01:37:09] Where are you moving?
[01:37:10] How you're moving,
things like that.
[01:37:12] That sort of knows
your location.
[01:37:15] The second one determines
the temperature of the rooms
[01:37:21] and has the ability
to change it.
[01:37:23] The third one tracks
energy efficiency
[01:37:26] and might give feedback on
energy and energy usage.
[01:37:31] And might be, I
don't know, maybe
[01:37:32] it has the control over
the temperature as well.
[01:37:34] I don't know actually.
[01:37:35] Or the gas or the water, might
cut your water at some point.
[01:37:43] And then you have an
orchestrator agent.
[01:37:44] What is exactly the
orchestrator doing?
[01:37:48] It passes instructions
[INAUDIBLE]
[01:37:53] OK.
[01:37:53] Passes instructions.
[01:37:55] So is that the agent
that communicates mainly
[01:37:58] with the user?
[01:38:00] So if I'm coming
back home and I'm
[01:38:02] saying I want the
oven to be preheated,
[01:38:05] I communicate with
the orchestrator,
[01:38:07] and then it would
funnel to another agent.
[01:38:09] OK.
[01:38:10] Sounds good.
[01:38:11] Yeah.
[01:38:11] So that's an example
of, I want to say,
[01:38:14] a hierarchical agentic
multi-agent system.
[01:38:20] What else?
[01:38:21] Any other ideas?
[01:38:22] What would you add to that?
[01:38:24] Yeah.
[01:38:25] [INAUDIBLE]
[01:38:55] Oh, I like that.
[01:38:56] That's a really good one.
[01:38:57] So let me summarize.
[01:38:58] You have a security agent that
determines if you can enter
[01:39:02] or not.
[01:39:03] And when you enter, it
understands who you are.
[01:39:06] And then it gives
you certain sets
[01:39:08] of permissions that might
be different depending
[01:39:11] of if you're a parent or a kid.
[01:39:13] Or you might have access to
certain cars and not others.
[01:39:17] Or your kid cannot open the
fridge, or I don't know.
[01:39:20] Something like that.
[01:39:21] Yeah.
[01:39:22] OK, I like that.
[01:39:23] That's a good one.
[01:39:24] And it does feel like it's a
complex enough workflow where
[01:39:28] you want a specific
workflow tied to that.
[01:39:32] I agree.
[01:39:34] What else?
[01:39:39] Yes.
[01:39:41] [INAUDIBLE] So you can
get more complicated.
[01:39:43] So high energy savings
with whether or not you
[01:39:50] or someone else can be blind
to those in the house or also
[01:39:55] when you tap into the grid.
[01:39:57] Yeah So another thought I
have as well is much harder
[01:40:04] to track in the grocery store.
[01:40:06] But understanding
what's in your fridge.
[01:40:08] OK
[01:40:12] Well, that's really
good actually.
[01:40:14] So you mentioned two of them.
[01:40:16] One is maybe an agent that has
access to external APIs that
[01:40:20] can understand the weather
out there, the wind, the sun,
[01:40:24] and then has control over
certain devices at home.
[01:40:28] Temperature, blinds, things
like that, and also understands
[01:40:31] your preferences for it.
[01:40:33] That does feel like it's a good
use case because you could give
[01:40:36] that to the orchestrator,
but it might lose itself
[01:40:38] because it's doing too much.
[01:40:41] And also, these problems
are tied together,
[01:40:43] like temperature outdoor
with the weather API
[01:40:45] might influence the
temperature inside,
[01:40:48] how you want it, et cetera.
[01:40:50] And then the second
one, which I also like,
[01:40:52] is you might have an agent
that looks at your fridge
[01:40:55] and what's inside.
[01:40:57] And it might
actually have access
[01:40:58] to the camera in the
fridge, for example,
[01:41:01] and know your
preferences and also has
[01:41:03] access to the
e-commerce API to order
[01:41:06] Amazon groceries ahead of time.
[01:41:09] I agree.
[01:41:10] And maybe the orchestrator
will be the communication line
[01:41:12] with the user, but it might
communicate with that agent
[01:41:16] in order to get it done.
[01:41:17] Yeah.
[01:41:18] I like those.
[01:41:19] So those are all
really good examples.
[01:41:21] Here is the list I had up there.
[01:41:25] So climate control, lighting
security, energy management,
[01:41:30] entertainment,
notification agent,
[01:41:32] alerts about the system updates,
energy saving, and orchestrator.
[01:41:35] So all of them you
mentioned actually.
[01:41:38] And then we didn't talk about
the different interaction
[01:41:41] patterns, but you do have
different ways to organize
[01:41:45] a multi-agent system.
[01:41:46] Flat, hierarchical.
[01:41:48] It sounds like this
would be hierarchical.
[01:41:51] I agree.
[01:41:52] And the reason is
UI/UX, is I would rather
[01:41:55] have to only talk
to the orchestrator,
[01:41:57] rather than have to go to
a specialized application
[01:42:00] to do something.
[01:42:01] Like it feels like
the orchestrator
[01:42:02] could be responsible for that.
[01:42:04] And so I agree, I would probably
go for a hierarchical setup
[01:42:07] here.
[01:42:08] But maybe you might also
add some connections
[01:42:11] between other agents,
like in the flat system
[01:42:13] where it's all to all.
[01:42:15] For example, with climate
control and energy,
[01:42:17] if you want to
connect those two,
[01:42:19] you might actually allow them
to speak with each other.
[01:42:21] When you allow agents to
speak with each other,
[01:42:24] it is basically an MCB
protocol, by the way.
[01:42:26] So you treat the agent like
a tool, exactly like a tool.
[01:42:30] Here is how you interact
with this agent.
[01:42:32] Here is what it can tell you.
[01:42:34] Here is what it needs
from you, essentially.
[01:42:37] OK super.
[01:42:38] And then without going
into the details,
[01:42:40] there are advantages to
multi-agent workflows
[01:42:43] versus single agents,
such as debugging.
[01:42:47] It's easier to debug
a specialized agent
[01:42:50] into debug an entire system.
[01:42:52] Parallelization as well.
[01:42:54] It's easier to have
things run in parallel,
[01:42:56] and you can earn time.
[01:42:59] There are some
advantages to doing that,
[01:43:01] and I'll leave you with this
slide if you want to go deeper.
[01:43:04] Super.
[01:43:05] So we've learned so many
techniques to optimize LLMs,
[01:43:08] from prompts to chains to
fine tuning, retrieval,
[01:43:12] and to multi-agent
system as well.
[01:43:14] And then just to end on a couple
of trends I want you to watch.
[01:43:19] I think next week is
Thanksgiving, is that it?
[01:43:21] It's Thanksgiving break.
[01:43:22] No, the week after.
[01:43:23] OK.
[01:43:24] Well ahead of the
Thanksgiving break.
[01:43:26] So if you're traveling, you
can think about these things.
[01:43:29] What's next is in AI, I wanted
to call out a couple of trends.
[01:43:34] So Ilya Sutskever, one of
the OGs of LLMs and OpenAI
[01:43:40] co-founder, raised that question
about are we plateauing or not.
[01:43:45] The question are we going to
see in the coming years LLM sort
[01:43:50] of not improve as fast as
we've seen in the past?
[01:43:54] It's been the feeling
in the community
[01:43:56] probably that the
last version of GPT
[01:44:00] did not bring the
level of performance
[01:44:03] that people were expecting,
although it did make
[01:44:06] it so much easier to use for
consumers because you don't need
[01:44:09] to interact with
different models.
[01:44:10] It's all under the same hood.
[01:44:12] So it seems that
it's progressing,
[01:44:14] but the plateau is unclear.
[01:44:17] The way I would think about it
is the LLM scaling laws tell us
[01:44:22] that if we continue to
improve compute and energy,
[01:44:26] then LLMs should
continue to improve.
[01:44:28] But at some point,
it's going to plateau.
[01:44:29] So what's going to take
us to the next step?
[01:44:32] It's probably
architecture search.
[01:44:35] Still a lot of LLMs,
even if we don't
[01:44:36] understand what's under
the hood or probably
[01:44:38] transformer-based today.
[01:44:40] But we know that the human brain
does not operate the same way.
[01:44:43] There's just certain
things that we
[01:44:45] do that are much more
efficient, much faster.
[01:44:47] We don't need as much data.
[01:44:49] So theoretically,
we have so much
[01:44:51] to learn in terms of
architecture search
[01:44:53] that we haven't figured out.
[01:44:54] It's not a surprise that
you see those labs hire
[01:44:57] so many engineers.
[01:44:58] Because it is possible
that in the next few years,
[01:45:01] you're going to have
thousands of engineers trying
[01:45:03] to figure out the different
engineering hacks and tactics
[01:45:06] and architectural
searches that are
[01:45:07] going to lead to better models.
[01:45:10] And one of them suddenly will
find the next transformer,
[01:45:13] and it will reduce by 10x the
need for compute and the need
[01:45:17] for energy.
[01:45:18] It's sort of if you read Isaac
Asimov's Foundation series.
[01:45:24] Individuals can have an amazing
impact on the future because
[01:45:27] of their decisions.
[01:45:29] Whoever discovered transformers
had a tremendous impact
[01:45:33] on the direction of AI.
[01:45:34] I think we're going to see
more of that in the coming
[01:45:37] years, where some group of
researchers that is iterating
[01:45:40] fast might discover certain
things that would suddenly
[01:45:43] unlock that plateau and
take us to the next step,
[01:45:45] and it's going to continue
to improve like that.
[01:45:47] And so it doesn't surprise me
that there's so many companies
[01:45:50] hiring engineers right
now to figure out
[01:45:52] those hacks and
those techniques.
[01:45:56] The other set of gains
that we might see
[01:45:58] is from multi-modality.
[01:45:59] So the way to think about it is
we've had LLMs first text-based,
[01:46:04] and then we've added imaging.
[01:46:06] And today, models are
very good at images.
[01:46:09] They're very good at text.
[01:46:10] It turns out that being good at
images and being good at text
[01:46:13] makes the whole model better.
[01:46:15] So the fact that you're good
at understanding a cat image
[01:46:18] makes you better at
text as well for a cat.
[01:46:21] Now you add another modality
like audio or video.
[01:46:24] The whole system gets better.
[01:46:26] So you're better at
writing about a cat
[01:46:28] if you know what
a cat sounds like,
[01:46:30] if you can look at a
cat on an image as well.
[01:46:31] Does that make sense?
[01:46:32] So we see gains that are
translated from one modality
[01:46:35] to another, and that might lead
in the pinnacle of robotics
[01:46:38] where all these
modalities come together.
[01:46:40] And suddenly, the
robot is better at
[01:46:42] running away from a cat
because it understands
[01:46:44] what a cat is, how
it sounds like,
[01:46:46] what it looks like, et cetera.
[01:46:48] That makes sense?
[01:46:49] The other one is the multiple
methods working in harmony.
[01:46:53] In the Tuesday lectures, we've
seen supervised learning,
[01:46:56] unsupervised learning,
self-supervised learning,
[01:46:58] reinforcement learning, prompt
engineering, RAGs, et cetera.
[01:47:02] If you look at how
babies learn, it
[01:47:06] is probably a mix of those
different approaches.
[01:47:09] Like a baby might have some
meta learning, meaning it
[01:47:13] has some survival
instinct that is
[01:47:16] encoded in the DNA most likely.
[01:47:19] And that's like the baby's
pre-training, if you will.
[01:47:22] On top of that, the mom or
the dad is pointing at stuff
[01:47:27] and saying bad, good, bad, good.
[01:47:29] Supervised learning.
[01:47:30] On top of that, the baby
is falling on the ground
[01:47:33] and getting hurt.
[01:47:34] And that's a reward signal
for reinforcement learning.
[01:47:36] On top of that, the baby
is observing other people
[01:47:39] doing stuff or
other babies doing
[01:47:42] stuff, unsupervised learning.
[01:47:43] You see what I mean?
[01:47:44] We're probably a mix
of all these methods,
[01:47:47] and I think that's where
the trend is going, is
[01:47:49] where those methods that
you've seen in CS230
[01:47:52] come together in order to build
an AI system that learns fast,
[01:47:56] is low latency, is
cheap, energy-efficient,
[01:48:00] and makes the most out
of all of these methods.
[01:48:03] Finally, and this is
especially true at Stanford,
[01:48:06] you have research going on that
you would consider human-centric
[01:48:11] and some research that
is non-human centric.
[01:48:13] By human-centric, I should
say human approaches
[01:48:16] that are modeled after the
brain and approaches that
[01:48:19] are not modeled after humans.
[01:48:20] Because it turns out that the
human body is very limiting.
[01:48:24] And so if you actually
only do research
[01:48:26] on what the human
brain looks like,
[01:48:28] you're probably missing out on
compute and energy and stuff
[01:48:30] like that that you
can optimize even
[01:48:32] beyond neuronal
connections in the brain,
[01:48:35] but you still can learn a
lot from the human brain.
[01:48:37] And that's why there are
professors that are running labs
[01:48:40] right now that
try to understand,
[01:48:42] how does back propagation
work for humans?
[01:48:45] And in fact, it's probably that
we don't have back propagation.
[01:48:48] We don't use back propagation,
we only do forward propagation,
[01:48:51] let's say.
[01:48:51] So this type of stuff
is interesting research
[01:48:54] that I would encourage you
to read if you're curious
[01:48:56] about the direction of AI.
[01:48:59] And then finally, one thing
that's going to be pretty clear,
[01:49:02] I call it all the time,
but it's the velocity
[01:49:05] at which things are moving.
[01:49:06] You're noticing,
part of the reason
[01:49:08] we're giving you
a breadth in CS230
[01:49:10] is because these methods
are changing so fast.
[01:49:12] So I don't want to bother
going and teaching you
[01:49:15] the number 17
methods on RAG that
[01:49:17] optimizes the RAG
because in two years,
[01:49:19] you're not going to need it.
[01:49:20] So I would rather
you think about what
[01:49:23] is the breadth of things
you want to understand.
[01:49:25] And when you need it, you
are sprinting and learning
[01:49:27] the exact thing you need faster
because the half life of skill
[01:49:30] is so low.
[01:49:31] You want to come out of the
class with a good breadth
[01:49:34] and then have the ability
to go deep whenever
[01:49:36] you need after the class.
[01:49:38] And so that's sort of how that
class is designed as well.
[01:49:41] Yeah.
[01:49:41] That's it for today.
[01:49:43] So thank you.
[01:49:45] Thank you for participating.