WEBVTT

00:00:05.400 --> 00:00:06.919
Hi, everyone.

00:00:06.919 --> 00:00:11.439
Welcome to another lecture
for CS230 Deep Learning.

00:00:11.439 --> 00:00:17.359
Today, we're going to talk about
enhancing large language model

00:00:17.359 --> 00:00:19.079
applications.

00:00:19.079 --> 00:00:23.839
And I call this
lecture Beyond LLM.

00:00:23.839 --> 00:00:26.800
It has a lot of newer content.

00:00:26.800 --> 00:00:31.280
And the idea behind
this lecture is

00:00:31.280 --> 00:00:34.020
we started to learn
about neurons,

00:00:34.020 --> 00:00:35.700
and then we learned
about layers,

00:00:35.700 --> 00:00:38.320
and then we learned about
deep neural networks,

00:00:38.320 --> 00:00:43.439
and then we learned a little bit
about how to structure projects

00:00:43.439 --> 00:00:44.719
in C3.

00:00:44.719 --> 00:00:48.839
And now we're going one level
beyond into, what would it

00:00:48.840 --> 00:00:54.640
look if you were building
agentic AI systems at work,

00:00:54.640 --> 00:00:58.439
in a startup, in a company?

00:00:58.439 --> 00:01:02.769
And it's probably one of
the more practical lectures.

00:01:02.770 --> 00:01:05.170
Again, the goal is
not to build a product

00:01:05.170 --> 00:01:07.329
end to end in the
next hour or so,

00:01:07.329 --> 00:01:09.929
but rather to tell
you all the techniques

00:01:09.930 --> 00:01:15.170
that AI engineers have cracked,
figured out, are exploring,

00:01:15.170 --> 00:01:18.450
so that after the class,
you have the breadth of view

00:01:18.450 --> 00:01:20.350
of different
prompting techniques,

00:01:20.349 --> 00:01:25.250
different agentic workflows,
multi-agent systems, evals.

00:01:25.250 --> 00:01:26.870
And then when you
want to dive deeper,

00:01:26.870 --> 00:01:29.810
you have the baggage to
dive deeper and learn faster

00:01:29.810 --> 00:01:32.769
about it.

00:01:32.769 --> 00:01:36.049
Let's try to make it as
interactive as possible, as

00:01:36.049 --> 00:01:37.689
usual.

00:01:37.689 --> 00:01:40.849
When we look at the
agenda, the agenda

00:01:40.849 --> 00:01:45.049
is going to start with the
core idea behind challenges

00:01:45.049 --> 00:01:48.469
and opportunities
for augmenting LLMs.

00:01:48.469 --> 00:01:50.789
So we start from a base model.

00:01:50.790 --> 00:01:55.570
How do we maximize the
performance of that base model?

00:01:55.569 --> 00:01:59.129
Then we'll dive deep into the
first line of optimization,

00:01:59.129 --> 00:02:02.789
which is prompting methods, and
we'll see a variety of them.

00:02:02.790 --> 00:02:04.530
Then we'll go slightly deeper.

00:02:04.530 --> 00:02:06.710
If we were to get our
hands under the hood

00:02:06.709 --> 00:02:09.269
and do some fine tuning,
what would it look like?

00:02:09.270 --> 00:02:12.650
I'm not a fan of fine tuning,
and I talk a lot about that,

00:02:12.650 --> 00:02:16.870
but I'll explain why I try to
avoid fine tuning as much as

00:02:16.870 --> 00:02:18.469
possible.

00:02:18.469 --> 00:02:22.930
And then we'll do a section 4 on
Retrieval-Augmented Generation,

00:02:22.930 --> 00:02:26.530
or RAG, which you've probably
heard of in the news.

00:02:26.530 --> 00:02:28.770
Maybe some of you
have played with RAGs.

00:02:28.770 --> 00:02:31.670
We're going to
unpack what a RAG is

00:02:31.669 --> 00:02:36.989
and how it works and then the
different methods within RAGs.

00:02:36.990 --> 00:02:40.590
And then we'll talk about
agentic AI workflows.

00:02:40.590 --> 00:02:42.469
I'll define it.

00:02:42.469 --> 00:02:45.629
Andrew Ng is one
of the first ones

00:02:45.629 --> 00:02:49.569
to have called this trend
agenetic AI workflows.

00:02:49.569 --> 00:02:51.709
And so we look at the
definition that Andrew

00:02:51.710 --> 00:02:54.469
gives to agentic
workflows, and then we'll

00:02:54.469 --> 00:02:56.479
start seeing examples.

00:02:56.479 --> 00:02:59.479
The section 6 is very practical.

00:02:59.479 --> 00:03:05.179
It's a case study where we will
think about an agentic workflow,

00:03:05.180 --> 00:03:10.900
and I'll ask you to measure
if the agent actually works,

00:03:10.900 --> 00:03:13.120
and we brainstorm
how we can measure

00:03:13.120 --> 00:03:15.372
if an agentic
workflow is working

00:03:15.372 --> 00:03:16.539
the way you want it to work.

00:03:16.539 --> 00:03:22.239
There's plenty of methods called
evals that solve that problem.

00:03:22.240 --> 00:03:24.900
And then we'll look briefly
at multi-agent workflow.

00:03:24.900 --> 00:03:27.960
And then we can have a
open-ended discussion

00:03:27.960 --> 00:03:31.760
where I share some thoughts
on what's next in AI.

00:03:31.759 --> 00:03:34.120
And I'm looking forward
to hearing from you all,

00:03:34.120 --> 00:03:36.599
as well, on that one.

00:03:36.599 --> 00:03:42.060
So let's get started with the
problem of augmenting LLMs.

00:03:42.060 --> 00:03:44.479
So open-ended question for you--

00:03:44.479 --> 00:03:47.399
you are all familiar
with pre-trained models

00:03:47.400 --> 00:03:52.080
like GPT 3.5 Turbo or GPT 4.0.

00:03:52.080 --> 00:03:56.260
What's the limitation of
using just a base model?

00:03:56.259 --> 00:03:59.060
What are the typical
issues that might

00:03:59.060 --> 00:04:03.469
arise as you're using a
vanilla pre-trained model?

00:04:07.819 --> 00:04:08.400
Yes.

00:04:08.400 --> 00:04:10.460
It lacks some domain knowledge.

00:04:10.460 --> 00:04:11.840
Lacks some domain knowledge.

00:04:11.840 --> 00:04:13.432
You're perfectly right.

00:04:13.432 --> 00:04:16.139
We had a group of
students a few years ago.

00:04:16.139 --> 00:04:22.099
It was not LLM related, but
they were building an autonomous

00:04:22.100 --> 00:04:26.780
farming device or vehicle that
had a camera underneath, taking

00:04:26.779 --> 00:04:30.619
pictures of crops to
determine if the crop is

00:04:30.620 --> 00:04:32.980
sick or not, if it
should be thrown away,

00:04:32.980 --> 00:04:35.980
if it should be used or not.

00:04:35.980 --> 00:04:40.900
And that data set is not a
data set you find out there.

00:04:40.899 --> 00:04:44.579
And the base model or
pre-trained computer vision

00:04:44.579 --> 00:04:47.539
model would lack that
knowledge, of course.

00:04:47.540 --> 00:04:49.640
What else?

00:04:49.639 --> 00:04:50.139
Yes.

00:04:50.139 --> 00:04:57.110
[INAUDIBLE] pictures are
very dark [INAUDIBLE]

00:04:57.110 --> 00:04:59.030
OK, maybe the-- you're saying--

00:04:59.029 --> 00:05:02.111
so just to repeat
for people online,

00:05:02.112 --> 00:05:04.070
you're saying the model
might have been trained

00:05:04.069 --> 00:05:06.670
on high-quality data,
but the data in the wild

00:05:06.670 --> 00:05:08.509
is actually not
that high quality.

00:05:08.509 --> 00:05:11.149
And in fact, yes, the
distribution of the real world

00:05:11.149 --> 00:05:16.169
might differ, as we've seen with
GANs, from the training set,

00:05:16.170 --> 00:05:18.650
and that might create an
issue with pre-trained models.

00:05:18.649 --> 00:05:20.909
Although pre-trained
LLMs are getting better

00:05:20.910 --> 00:05:25.470
at handling all
sorts of data inputs.

00:05:25.470 --> 00:05:26.894
Yes.

00:05:26.894 --> 00:05:28.310
Lacks current information.

00:05:28.310 --> 00:05:28.990
Lack what?

00:05:28.990 --> 00:05:30.110
Current information.

00:05:30.110 --> 00:05:32.550
Lacks current information.

00:05:32.550 --> 00:05:34.270
The LLM is not up to date.

00:05:34.269 --> 00:05:35.490
And in fact, you're right.

00:05:35.490 --> 00:05:38.150
Imagine you have to retrain
from scratch your LLM

00:05:38.149 --> 00:05:39.989
every couple of months.

00:05:39.990 --> 00:05:42.790
One story that I found funny--

00:05:42.790 --> 00:05:45.550
it's from probably three years
ago or maybe more five years

00:05:45.550 --> 00:05:49.430
ago, where during
his first presidency,

00:05:49.430 --> 00:05:53.990
President Trump one
day tweeted, "Covfefe."

00:05:53.990 --> 00:05:56.170
You remember that tweet or no?

00:05:56.170 --> 00:05:57.310
Just "Covfefe."

00:05:57.310 --> 00:05:59.970
And it was probably a typo
or it was in his pocket.

00:05:59.970 --> 00:06:00.890
I don't know.

00:06:00.889 --> 00:06:03.550
But that word did not exist.

00:06:03.550 --> 00:06:06.290
The LLMs, in fact, that
Twitter was running at the time

00:06:06.290 --> 00:06:08.290
could not recognize that word.

00:06:08.290 --> 00:06:11.770
And so the recommender
system sort of went wild,

00:06:11.769 --> 00:06:15.250
because suddenly everybody was
making fun of that tweet using

00:06:15.250 --> 00:06:19.449
the word "Covfefe," and the LLM
was so confused on, what does

00:06:19.449 --> 00:06:20.029
that mean?

00:06:20.029 --> 00:06:21.149
Where should we show it?

00:06:21.149 --> 00:06:22.509
To whom should we show it?

00:06:22.509 --> 00:06:25.149
And this is an example
of a-- nowadays,

00:06:25.149 --> 00:06:28.849
especially on social media,
there's so many new trends,

00:06:28.850 --> 00:06:33.050
and it's very hard to retrain
an LLM to match the new trend

00:06:33.050 --> 00:06:34.710
and understand the
new words out there.

00:06:34.709 --> 00:06:39.329
I mean, you oftentimes hear Gen
Z words like "rizz" or "mid"

00:06:39.329 --> 00:06:40.349
or whatever.

00:06:40.350 --> 00:06:41.670
I don't know all of them.

00:06:41.670 --> 00:06:45.890
But you probably want
to find a way that

00:06:45.889 --> 00:06:49.089
can allow the LLM to understand
those trends without retraining

00:06:49.089 --> 00:06:51.500
the LLM from scratch.

00:06:51.500 --> 00:06:53.779
What else?

00:06:53.779 --> 00:06:56.182
It's trained to have a
breadth of knowledge.

00:06:56.182 --> 00:06:58.099
And if you wanted to do
something specialized,

00:06:58.100 --> 00:06:59.900
that might limit [INAUDIBLE].

00:06:59.899 --> 00:07:02.560
Yeah, it might be trained
on a breadth of knowledge,

00:07:02.560 --> 00:07:05.660
but it might fail or
not perform adequately

00:07:05.660 --> 00:07:09.060
on a narrow task that
is very well defined.

00:07:09.060 --> 00:07:11.980
Think about enterprise
applications that--

00:07:11.980 --> 00:07:13.400
yeah, enterprise application.

00:07:13.399 --> 00:07:17.579
You need high precision,
high fidelity, low latency.

00:07:17.579 --> 00:07:20.359
And maybe the model is not
great at that specific thing.

00:07:20.360 --> 00:07:22.480
It might do fine, but
just not good enough.

00:07:22.480 --> 00:07:24.640
And you might want to
augment it in a certain way.

00:07:24.639 --> 00:07:25.139
Yeah.

00:07:25.139 --> 00:07:29.699
Maybe it has [INAUDIBLE]
so it makes the model

00:07:29.699 --> 00:07:32.045
a lot heavier, a lot slower.

00:07:32.045 --> 00:07:33.500
[INAUDIBLE]

00:07:33.500 --> 00:07:37.379
So maybe it has a lot of broad
domain knowledge that might not

00:07:37.379 --> 00:07:39.240
be needed for your application.

00:07:39.240 --> 00:07:41.620
And so you're using a
massive, heavy model

00:07:41.620 --> 00:07:44.519
when you actually are only using
2% of the model capability.

00:07:44.519 --> 00:07:45.599
You're perfectly right.

00:07:45.600 --> 00:07:46.808
You might not need all of it.

00:07:46.807 --> 00:07:51.279
So you might find ways to prune,
quantize the model, modify it.

00:07:51.279 --> 00:07:53.059
All of these are good points.

00:07:53.060 --> 00:07:55.959
I'm going to add a
few more, as well.

00:07:55.959 --> 00:07:58.799
LLMs are very
difficult to control.

00:07:58.800 --> 00:08:00.819
Your last point is actually
an example of that.

00:08:00.819 --> 00:08:03.339
You want to control the LLM to
use a part of its knowledge,

00:08:03.339 --> 00:08:04.539
but it's not--

00:08:04.540 --> 00:08:06.760
it's, in fact, getting confused.

00:08:06.759 --> 00:08:08.099
We've seen that in history.

00:08:08.100 --> 00:08:13.080
In 2016, Microsoft created
a notorious Twitter

00:08:13.079 --> 00:08:18.039
bot that learned from users, and
it quickly became a racist jerk.

00:08:18.040 --> 00:08:22.980
Microsoft ended up removing the
bot 16 hours after launching it.

00:08:22.980 --> 00:08:25.720
The community was really
fast at determining

00:08:25.720 --> 00:08:28.040
that this was a racist bot.

00:08:28.040 --> 00:08:31.480
And you can empathize with
Microsoft in the sense

00:08:31.480 --> 00:08:34.038
that it is actually
hard to control an LLM.

00:08:34.038 --> 00:08:37.720
They might have done a better
job to qualify before launching,

00:08:37.720 --> 00:08:40.240
but it is really hard
to control an LLM.

00:08:40.240 --> 00:08:42.639
Even more recently,
this is a tweet

00:08:42.639 --> 00:08:46.929
from Sam Altman
last November, where

00:08:46.929 --> 00:08:50.049
there was this debate
between Elon Musk and Sam

00:08:50.049 --> 00:08:54.449
Altman on whose LLM is
the left wing propaganda

00:08:54.450 --> 00:08:57.230
machine or the right
wing propaganda machine,

00:08:57.230 --> 00:08:59.320
and they were hating
on each other's LLMs.

00:08:59.320 --> 00:09:01.070
But that tells you,
at the end of the day,

00:09:01.070 --> 00:09:05.610
that even those two teams, Grok
and OpenAI, which are probably

00:09:05.610 --> 00:09:08.610
the best funded team
with a lot of talent,

00:09:08.610 --> 00:09:11.509
are not doing a great job
at controlling their LLMs.

00:09:14.169 --> 00:09:16.569
And from time to time,
if you hang out on X,

00:09:16.570 --> 00:09:21.290
you might see screenshots of
users interacting with LLMs

00:09:21.289 --> 00:09:24.649
and the LLM saying something
really controversial

00:09:24.649 --> 00:09:31.289
or racist or something that
would not be considered great

00:09:31.289 --> 00:09:33.429
by social standards, I guess.

00:09:33.429 --> 00:09:39.449
And that tells you that the
model is really hard to control.

00:09:39.450 --> 00:09:41.610
The second aspect
of it is something

00:09:41.610 --> 00:09:43.289
that you mentioned earlier.

00:09:43.289 --> 00:09:47.230
LLMs may underperform
in your task,

00:09:47.230 --> 00:09:49.990
and that might include
specific knowledge gaps,

00:09:49.990 --> 00:09:51.432
such as medical diagnosis.

00:09:51.432 --> 00:09:52.850
If you're doing
medical diagnosis,

00:09:52.850 --> 00:09:55.430
you would rather have an LLM
that is specialized for that

00:09:55.429 --> 00:09:57.989
and is great at it
and, in fact, something

00:09:57.990 --> 00:10:00.409
that we haven't mentioned
as a group, has sources.

00:10:00.409 --> 00:10:03.309
So the answer is
sourced specifically.

00:10:03.309 --> 00:10:05.029
You have a hard time
believing something

00:10:05.029 --> 00:10:08.069
unless you have the actual
source of the research that

00:10:08.070 --> 00:10:10.270
backs it up.

00:10:10.269 --> 00:10:12.329
Inconsistencies in
style and format--

00:10:12.330 --> 00:10:17.430
so imagine you're building
a legal AI agentic workflow.

00:10:17.429 --> 00:10:21.269
Legal has a very specific
way to write and read,

00:10:21.269 --> 00:10:22.769
where every word counts.

00:10:22.769 --> 00:10:25.470
If you're negotiating
a large contract,

00:10:25.470 --> 00:10:28.430
every word on that contract
might mean something else

00:10:28.429 --> 00:10:29.929
when it comes to the court.

00:10:29.929 --> 00:10:31.789
And so it's very
important that you use

00:10:31.789 --> 00:10:34.110
an LLM that is very good at it.

00:10:34.110 --> 00:10:35.590
The precision matters.

00:10:35.590 --> 00:10:38.090
And then task-specific
understanding,

00:10:38.090 --> 00:10:40.629
such as doing a classification
on a niche field,

00:10:40.629 --> 00:10:45.080
here I pulled an example where--
let's say a biotech product is

00:10:45.080 --> 00:10:48.759
trying to use an
LLM to categorize

00:10:48.759 --> 00:10:54.052
user reviews into positive,
neutral, or negative.

00:10:54.052 --> 00:10:56.799
Maybe for that
company, something

00:10:56.799 --> 00:11:01.839
that would be considered a
negative review typically

00:11:01.840 --> 00:11:04.160
is actually considered
a neutral review

00:11:04.159 --> 00:11:06.439
because the NPS of
that industry tends

00:11:06.440 --> 00:11:10.240
to be way lower than other
industries, let's say.

00:11:10.240 --> 00:11:12.600
That's a task-specific
understanding,

00:11:12.600 --> 00:11:14.440
and the LLM needs to
be aligned to what

00:11:14.440 --> 00:11:17.960
the company believes is the
categorization that it wants.

00:11:17.960 --> 00:11:21.560
We will see an example of how to
solve that problem in a second.

00:11:21.559 --> 00:11:24.399
And then limited
context handling--

00:11:24.399 --> 00:11:28.720
a lot of AI applications,
especially in the enterprise,

00:11:28.720 --> 00:11:33.192
have required data that
has a lot of context.

00:11:33.192 --> 00:11:35.139
Just to give you
a simple example,

00:11:35.139 --> 00:11:37.480
knowledge management
is an important space

00:11:37.480 --> 00:11:40.759
that enterprises buy a lot
of knowledge management tool.

00:11:40.759 --> 00:11:43.840
When you go on your drive and
you have all your documents,

00:11:43.840 --> 00:11:47.040
ideally, you could have an LLM
running on top of that drive.

00:11:47.039 --> 00:11:50.659
You can ask any question,
and it will read immediately

00:11:50.659 --> 00:11:53.299
thousands of documents
and answer, what was

00:11:53.299 --> 00:11:56.179
our Q4 performance in sales?

00:11:56.179 --> 00:11:58.299
It was x dollars.

00:11:58.299 --> 00:11:59.539
It finds it super quickly.

00:11:59.539 --> 00:12:04.039
In practice, because LLMs do
not have a large enough context,

00:12:04.039 --> 00:12:07.860
you cannot use a standalone
vanilla pre-trained LLM to solve

00:12:07.860 --> 00:12:08.639
that problem.

00:12:08.639 --> 00:12:11.580
You will have to augment it.

00:12:11.580 --> 00:12:13.460
Does that make sense?

00:12:13.460 --> 00:12:16.620
The other aspect around context
windows is they are, in fact,

00:12:16.620 --> 00:12:17.240
limited.

00:12:17.240 --> 00:12:20.740
If you look at the context
windows of the models

00:12:20.740 --> 00:12:25.419
from the last five years,
even the best models

00:12:25.419 --> 00:12:30.459
today will range in context,
window, or number of tokens

00:12:30.460 --> 00:12:35.220
it can take as input, somewhere
in the hundreds of thousands

00:12:35.220 --> 00:12:36.680
of tokens max.

00:12:36.679 --> 00:12:40.989
Just to give you a sense,
200,000 tokens is roughly two

00:12:40.990 --> 00:12:42.669
books.

00:12:42.669 --> 00:12:45.110
So that's how much
you can upload

00:12:45.110 --> 00:12:47.009
and it can read, pretty much.

00:12:47.009 --> 00:12:48.669
And you can imagine
that when you're

00:12:48.669 --> 00:12:52.990
dealing with video
understanding or heavier data

00:12:52.990 --> 00:12:56.710
files, that is, of
course, an issue.

00:12:56.710 --> 00:12:58.009
So you might have to chunk it.

00:12:58.009 --> 00:12:59.169
You might have to embed it.

00:12:59.169 --> 00:13:00.669
You might have to
find other ways

00:13:00.669 --> 00:13:03.519
to get the LLM to
handle larger contexts.

00:13:06.509 --> 00:13:10.269
The attention mechanism is
also powerful, but problematic,

00:13:10.269 --> 00:13:13.710
because it does not do
a great job at attending

00:13:13.710 --> 00:13:16.310
in very large contexts.

00:13:16.309 --> 00:13:19.589
There is actually an
interesting problem

00:13:19.590 --> 00:13:21.330
called needle in a haystack.

00:13:21.330 --> 00:13:23.430
It's an AI problem where--

00:13:23.429 --> 00:13:25.469
or call it a benchmark--

00:13:25.470 --> 00:13:30.910
where, in order to test if your
LLM is good at putting attention

00:13:30.909 --> 00:13:35.589
on a very specific fact
within a large corpus,

00:13:35.590 --> 00:13:38.649
researchers might
randomly insert

00:13:38.649 --> 00:13:44.009
in about one sentence
that outlines

00:13:44.009 --> 00:13:47.450
a certain fact,
such as Arun and Max

00:13:47.450 --> 00:13:48.970
are having coffee
at Blue Bottle,

00:13:48.970 --> 00:13:51.149
in the middle of the
Bible, let's say,

00:13:51.149 --> 00:13:54.049
or some very long text.

00:13:54.049 --> 00:14:01.269
And then you ask the LLM,
what were Arun and Max having

00:14:01.269 --> 00:14:02.769
at Blue Bottle?

00:14:02.769 --> 00:14:04.794
And you see if it remembers
that it was coffee.

00:14:04.794 --> 00:14:07.169
It's actually a complex problem,
not because the question

00:14:07.169 --> 00:14:09.250
is complex, but because
you're asking the model

00:14:09.250 --> 00:14:12.370
to find a fact within
a very large corpus,

00:14:12.370 --> 00:14:16.060
and that's complicated.

00:14:16.059 --> 00:14:19.969
So, again, this is a
limiting factor for LLMs.

00:14:19.970 --> 00:14:21.870
We'll talk about
RAG in a second.

00:14:21.870 --> 00:14:22.970
But I want to preview--

00:14:22.970 --> 00:14:26.490
there is debates
around whether RAG

00:14:26.490 --> 00:14:29.990
is the right long-term
approach for AI systems.

00:14:29.990 --> 00:14:34.470
So as a high-level idea, a RAG
is a mechanism, if you will,

00:14:34.470 --> 00:14:39.340
that embeds documents that
an LLM can retrieve and then

00:14:39.340 --> 00:14:44.540
add as context to its initial
prompt and answer a question.

00:14:44.539 --> 00:14:45.679
It has lots of application.

00:14:45.679 --> 00:14:47.137
Knowledge management
is an example.

00:14:47.138 --> 00:14:49.160
So imagine you have
your drive again.

00:14:49.159 --> 00:14:53.620
But every document is
compressed in representation,

00:14:53.620 --> 00:14:55.820
and the LLM has
access to that lower

00:14:55.820 --> 00:14:59.020
dimensional representation.

00:14:59.019 --> 00:15:03.500
The debates that this tweet
from [INAUDIBLE] outlines

00:15:03.500 --> 00:15:08.259
is, in theory, if we
have infinite compute,

00:15:08.259 --> 00:15:09.960
then RAG is useless.

00:15:09.960 --> 00:15:13.580
Because you can just read a
massive corpus immediately

00:15:13.580 --> 00:15:15.180
and answer your question.

00:15:15.179 --> 00:15:19.039
But even in that case,
latency might be an issue.

00:15:19.039 --> 00:15:20.659
Imagine the time
it takes for an AI

00:15:20.659 --> 00:15:24.279
to read all your drive every
single time you ask a question.

00:15:24.279 --> 00:15:25.579
It doesn't make sense.

00:15:25.580 --> 00:15:30.940
So RAG has other advantages
beyond even the accuracy.

00:15:30.940 --> 00:15:33.680
On top of that, the
sourcing matters, as well.

00:15:33.679 --> 00:15:35.819
So it might-- RAG
allows you to source.

00:15:35.820 --> 00:15:38.460
We'll talk about all that later.

00:15:38.460 --> 00:15:42.639
But there's always this
debate in the community

00:15:42.639 --> 00:15:46.100
whether a certain method
is actually future proof.

00:15:46.100 --> 00:15:49.740
Because in practice, as compute
power doubles every year,

00:15:49.740 --> 00:15:52.279
let's say, some of the methods
we're learning right now

00:15:52.279 --> 00:15:54.759
might not be relevant
three years from now.

00:15:54.759 --> 00:15:56.740
We don't know, essentially.

00:15:59.960 --> 00:16:04.120
And the analogy that he
makes on context windows

00:16:04.120 --> 00:16:07.440
and why RAG approaches might
be relevant even a long time

00:16:07.440 --> 00:16:09.960
from now is search.

00:16:09.960 --> 00:16:12.139
When you search on
a search engine,

00:16:12.139 --> 00:16:14.977
you still find sources
of information.

00:16:14.977 --> 00:16:16.519
And in fact, in the
background, there

00:16:16.519 --> 00:16:20.639
is very detailed
traversal algorithms

00:16:20.639 --> 00:16:25.199
that rank and find the specific
links that might be the best

00:16:25.200 --> 00:16:29.440
to present you versus if you
had to read-- imagine you had

00:16:29.440 --> 00:16:31.800
to read the entire web every
single time you're doing

00:16:31.799 --> 00:16:34.809
a search query, without
being able to narrow

00:16:34.809 --> 00:16:36.969
to a certain portion
of the space.

00:16:36.970 --> 00:16:41.889
That might, again,
not be reasonable.

00:16:41.889 --> 00:16:46.210
OK, when we're thinking
of improving LLMs,

00:16:46.210 --> 00:16:50.110
the easiest way we think
of it is two dimensions.

00:16:50.110 --> 00:16:53.210
One dimension is we are going
to improve the foundation

00:16:53.210 --> 00:16:54.230
model itself.

00:16:54.230 --> 00:17:01.250
So, for example, we move
from GPT 3.5 Turbo, to GPT 4,

00:17:01.250 --> 00:17:04.250
to GPT 4.0, to GPT 5.

00:17:04.250 --> 00:17:07.328
Each of that is supposed
to improve the base model.

00:17:07.328 --> 00:17:11.730
GPT 5 is another debate because
it's packaging other models

00:17:11.730 --> 00:17:12.588
within itself.

00:17:12.588 --> 00:17:15.947
But if you're thinking
about 3.5, 4, and 4.0,

00:17:15.948 --> 00:17:16.990
that's really what it is.

00:17:16.990 --> 00:17:18.670
The pre-trained model improves.

00:17:18.670 --> 00:17:20.810
And so you should
see your performance

00:17:20.809 --> 00:17:22.809
improve on your tasks.

00:17:22.809 --> 00:17:27.129
But the other dimension is
we can actually engineer--

00:17:27.130 --> 00:17:30.390
leverage the LLM in a
way that makes it better.

00:17:30.390 --> 00:17:34.070
So you can prompt
simply GPT 4.0.

00:17:34.069 --> 00:17:38.409
You can change some prompts
and improve the prompt,

00:17:38.410 --> 00:17:40.070
and it will improve
the performance.

00:17:40.069 --> 00:17:41.189
It's shown.

00:17:41.190 --> 00:17:42.930
You can even put
a RAG around it.

00:17:42.930 --> 00:17:45.610
You can put an agentic
workflow around it.

00:17:45.609 --> 00:17:49.250
You can even put a
multi-agent system around it.

00:17:49.250 --> 00:17:52.630
And that is another dimension
for you to improve performance.

00:17:52.630 --> 00:17:54.870
So that's how I want you
to think about it-- which

00:17:54.869 --> 00:17:56.750
LLM I'm using, and
then how can I maximize

00:17:56.750 --> 00:17:59.255
the performance of that LLM?

00:17:59.255 --> 00:18:02.690
This lecture is about
the vertical axis.

00:18:02.690 --> 00:18:04.940
Those are the methods
that we will see together.

00:18:08.829 --> 00:18:11.470
Sounds good for
the introduction.

00:18:11.470 --> 00:18:14.549
So let's move to
prompt engineering.

00:18:14.549 --> 00:18:17.230
I'm going to start with
an interesting study just

00:18:17.230 --> 00:18:20.870
to motivate why prompt
engineering matters.

00:18:20.869 --> 00:18:26.469
There is a study
from HBS, UPenn,

00:18:26.470 --> 00:18:29.559
as well as Harvard
Business School, and--

00:18:29.559 --> 00:18:31.399
well, there is also
involved Wharton--

00:18:31.400 --> 00:18:34.360
that took a subset
of BCG consultants,

00:18:34.359 --> 00:18:37.679
individual contributors,
split them into three groups.

00:18:37.680 --> 00:18:39.660
One group had no access to AI.

00:18:39.660 --> 00:18:41.640
One group had access to--

00:18:41.640 --> 00:18:44.720
I think it was GPT 4.

00:18:44.720 --> 00:18:46.900
And then one group
had access to the LLM,

00:18:46.900 --> 00:18:50.759
but also a training on
how to prompt better.

00:18:50.759 --> 00:18:53.640
And then they observed the
performance of these consultants

00:18:53.640 --> 00:18:56.120
across a wide variety of tasks.

00:18:56.119 --> 00:18:57.799
There's a few things
that they noticed

00:18:57.799 --> 00:18:59.399
that I thought was interesting.

00:18:59.400 --> 00:19:02.920
One is something they
called the jagged frontier,

00:19:02.920 --> 00:19:07.880
meaning that certain tasks
that consultants are doing fall

00:19:07.880 --> 00:19:14.700
beyond the jagged frontier,
meaning AI is not good enough.

00:19:14.700 --> 00:19:18.140
It's not improving
human performance.

00:19:18.140 --> 00:19:20.840
In fact, it's actually
making it worse.

00:19:20.839 --> 00:19:23.439
And some tasks are
within the frontier,

00:19:23.440 --> 00:19:27.360
meaning that AI is actually
significantly improving

00:19:27.359 --> 00:19:32.059
the performance, the speed,
the quality of the consultant.

00:19:32.059 --> 00:19:35.220
Many tasks fell within and
many tasks fell without,

00:19:35.220 --> 00:19:37.640
and they shared their insights.

00:19:37.640 --> 00:19:39.180
But the TLDR is--

00:19:39.180 --> 00:19:42.880
there is a frontier within
which AI is absolutely helping

00:19:42.880 --> 00:19:47.500
and one where they call out
this behavior, or falling asleep

00:19:47.500 --> 00:19:51.339
at the wheel, where people
relied on AI on a task that

00:19:51.339 --> 00:19:52.899
was beyond the frontier.

00:19:52.900 --> 00:19:55.860
And in fact, it
ended up going worse

00:19:55.859 --> 00:19:58.459
because the human was not
reviewing the outputs carefully

00:19:58.460 --> 00:19:58.960
enough.

00:20:01.740 --> 00:20:04.539
They did note that the
group that was trained

00:20:04.539 --> 00:20:08.139
was the best, better than the
group that was not trained

00:20:08.140 --> 00:20:10.740
on prompt engineering,
which also motivates why

00:20:10.740 --> 00:20:14.700
this lecture matters, so
that you're within that group

00:20:14.700 --> 00:20:15.940
afterwards.

00:20:15.940 --> 00:20:20.340
One other insights were the
centaurs and the cyborgs.

00:20:20.339 --> 00:20:22.539
They noticed that
consultants had the tendency

00:20:22.539 --> 00:20:24.899
to work with AI in
one of two ways,

00:20:24.900 --> 00:20:29.269
and you might, yourself, be
part of one of these groups.

00:20:29.269 --> 00:20:31.750
The centaurs are
mythical creatures

00:20:31.750 --> 00:20:35.190
that are half human, half--

00:20:35.190 --> 00:20:38.529
I think, half, what, horses?

00:20:38.529 --> 00:20:39.029
Yeah?

00:20:39.029 --> 00:20:39.750
Horses?

00:20:39.750 --> 00:20:42.190
Half horses, half something.

00:20:42.190 --> 00:20:45.850
And those were individuals
that would divide and delegate.

00:20:45.849 --> 00:20:48.369
They might give a pretty
big task to the AI.

00:20:48.369 --> 00:20:51.229
So imagine you're working on a
PowerPoint, which consultants

00:20:51.230 --> 00:20:52.870
are known to do.

00:20:52.869 --> 00:20:55.467
You might actually write
a very long prompt on how

00:20:55.468 --> 00:20:57.509
you want it to do your
PowerPoint and then let it

00:20:57.509 --> 00:20:59.069
work for some time
and then come back

00:20:59.069 --> 00:21:02.129
and it's done, when others
would act as cyborgs.

00:21:02.130 --> 00:21:06.390
Cyborgs are fully blended,
bionic human robots,

00:21:06.390 --> 00:21:10.630
human and robot, augmented
with robotic parts.

00:21:10.630 --> 00:21:13.490
And those individuals will
not delegate fully a task.

00:21:13.490 --> 00:21:16.230
They would actually work
super quickly with the model

00:21:16.230 --> 00:21:17.370
back and forth.

00:21:17.369 --> 00:21:20.149
I find that a lot of students
are actually more working

00:21:20.150 --> 00:21:24.277
like cyborgs than centaurs, but
while maybe in the enterprise,

00:21:24.277 --> 00:21:26.110
when you're trying to
automate the workflow,

00:21:26.109 --> 00:21:29.477
you're thinking
more like a centaur.

00:21:29.478 --> 00:21:31.269
That's just something
good to keep in mind.

00:21:31.269 --> 00:21:33.311
Also, a lot of companies
will tell you, oh, we're

00:21:33.311 --> 00:21:34.849
hiring prompt
engineers, et cetera.

00:21:34.849 --> 00:21:36.949
It's [? a cure. ?]
I don't buy that.

00:21:36.950 --> 00:21:39.158
I think it's just a skill
that everybody should have.

00:21:39.157 --> 00:21:40.866
You're not going to
make a [? cure ?] out

00:21:40.866 --> 00:21:42.690
of prompt engineering,
but you're probably

00:21:42.690 --> 00:21:46.500
going to use it as a very
powerful skill in your career.

00:21:49.809 --> 00:21:52.889
So let's talk about basic
prompt design principles.

00:21:52.890 --> 00:21:56.009
I'm giving you a very
simple prompt here.

00:21:56.009 --> 00:21:58.210
Summarize this document,
and then the document

00:21:58.210 --> 00:22:00.250
is uploaded alongside it.

00:22:00.250 --> 00:22:04.690
And the model has not
much context around

00:22:04.690 --> 00:22:06.130
what should be the summary?

00:22:06.130 --> 00:22:07.430
How long should be the summary?

00:22:07.430 --> 00:22:09.650
What should it talk
about, et cetera?

00:22:09.650 --> 00:22:14.390
You can actually improve these
prompts by doing something like

00:22:14.390 --> 00:22:18.490
summarize this 10-page
scientific paper on renewable

00:22:18.490 --> 00:22:22.410
energy in five bullet points,
focusing on key findings

00:22:22.410 --> 00:22:25.019
and implications
for policymakers.

00:22:25.019 --> 00:22:26.220
That's already better.

00:22:26.220 --> 00:22:28.620
You're sharing the
audience, and it's

00:22:28.619 --> 00:22:30.279
going to tailor it
to the audience.

00:22:30.279 --> 00:22:33.059
You're saying that you
want five bullet points,

00:22:33.059 --> 00:22:35.899
and you want to focus
only on key findings.

00:22:35.900 --> 00:22:39.060
That's a better prompt,
you would argue.

00:22:39.059 --> 00:22:41.798
How could you even make
this prompt better?

00:22:41.798 --> 00:22:43.339
What are other
techniques that you've

00:22:43.339 --> 00:22:47.649
heard of or tried yourself that
could make this one shot prompt

00:22:47.650 --> 00:22:48.150
better?

00:22:53.180 --> 00:22:53.980
Yeah.

00:22:53.980 --> 00:22:57.139
[INAUDIBLE]

00:22:57.138 --> 00:22:58.044
OK.

00:22:58.045 --> 00:22:58.880
Right example.

00:22:58.880 --> 00:23:02.800
So say, you mean, here is an
example of a great summary.

00:23:02.799 --> 00:23:03.299
Yeah.

00:23:03.299 --> 00:23:03.841
You're right.

00:23:03.842 --> 00:23:05.420
That's a good idea.

00:23:05.420 --> 00:23:06.140
[INAUDIBLE]

00:23:08.900 --> 00:23:10.140
Very popular technique.

00:23:10.140 --> 00:23:15.060
Act like a renewable energy
expert giving a conference

00:23:15.059 --> 00:23:17.019
at Davos, let's say, yeah.

00:23:17.019 --> 00:23:18.500
That's great.

00:23:18.500 --> 00:23:20.724
Someone-- yeah.

00:23:20.724 --> 00:23:22.449
Say you're really good at it.

00:23:22.450 --> 00:23:23.430
Yeah.

00:23:23.430 --> 00:23:25.769
You are the best in
the world at this.

00:23:25.769 --> 00:23:26.389
Explain.

00:23:26.390 --> 00:23:26.890
Yeah.

00:23:26.890 --> 00:23:28.570
Actually, I mean,
these things work.

00:23:28.569 --> 00:23:32.849
It's funny, but it does work
to say act like x, y, z.

00:23:32.849 --> 00:23:34.649
It's a very popular
prompt template.

00:23:34.650 --> 00:23:36.090
We'll see a few examples.

00:23:36.089 --> 00:23:37.169
What else could you do?

00:23:40.990 --> 00:23:41.910
Yes.

00:23:41.910 --> 00:23:46.190
Of course, you'd like to
critique your own model.

00:23:46.190 --> 00:23:47.610
Critique your own project.

00:23:47.609 --> 00:23:48.889
So you're using reflection.

00:23:48.890 --> 00:23:50.430
So you might actually
do one output

00:23:50.430 --> 00:23:52.890
and then ask it to critique
it and then give it back.

00:23:52.890 --> 00:23:53.390
Yeah.

00:23:53.390 --> 00:23:53.978
We see that.

00:23:53.978 --> 00:23:54.769
That's a great one.

00:23:54.769 --> 00:23:56.750
That's the one that
probably works best

00:23:56.750 --> 00:23:59.529
within those typically,
but we see some examples.

00:23:59.529 --> 00:24:00.549
What else?

00:24:00.549 --> 00:24:01.365
Yeah.

00:24:01.365 --> 00:24:03.150
Break the task down into steps.

00:24:03.150 --> 00:24:03.650
OK.

00:24:03.650 --> 00:24:05.370
Break the task down into steps.

00:24:05.369 --> 00:24:06.729
You know how that is called?

00:24:06.730 --> 00:24:07.829
No.

00:24:07.829 --> 00:24:08.329
OK.

00:24:08.329 --> 00:24:09.349
Chain of thoughts.

00:24:09.349 --> 00:24:12.789
So this is actually
a popular method

00:24:12.789 --> 00:24:15.369
that's been shown in
research that it improves.

00:24:15.369 --> 00:24:17.669
You could actually give
a clear instruction

00:24:17.670 --> 00:24:19.810
and also encourage the
model to think step

00:24:19.809 --> 00:24:22.629
by step approach, the
task step by step,

00:24:22.630 --> 00:24:24.390
and do not skip any step.

00:24:24.390 --> 00:24:26.990
And then you give it some
steps, such as step one,

00:24:26.990 --> 00:24:29.390
identify the three most
important findings.

00:24:29.390 --> 00:24:31.450
Step two, explain
how key each finding

00:24:31.450 --> 00:24:33.590
impact renewable energy policy.

00:24:33.589 --> 00:24:36.209
Step three, write the
five-bullet summary

00:24:36.210 --> 00:24:39.630
with each point addressing
a finding, et cetera.

00:24:39.630 --> 00:24:45.170
So chain of thoughts, I linked
the paper from 2023 that

00:24:45.170 --> 00:24:46.590
popularized chain of thoughts.

00:24:46.589 --> 00:24:48.369
Chain of thoughts
is very popular

00:24:48.369 --> 00:24:50.076
right now, especially
in AI startups

00:24:50.076 --> 00:24:51.660
that are trying to
control their LLMs.

00:24:55.009 --> 00:24:56.450
OK.

00:24:56.450 --> 00:25:01.289
To go back to your examples
about act like XYZ, what

00:25:01.289 --> 00:25:03.930
I like to do, Andrew Ng
also talks about that,

00:25:03.930 --> 00:25:06.190
is to look at other
people's prompts.

00:25:06.190 --> 00:25:10.170
And in fact, in online, you have
a lot of prompt repositories

00:25:10.170 --> 00:25:11.930
for free on GitHub.

00:25:11.930 --> 00:25:16.289
In fact, I linked the awesome
prompt template repo on GitHub,

00:25:16.289 --> 00:25:19.099
where you have so many
examples of great prompts

00:25:19.099 --> 00:25:22.159
that engineers have built. They
said it works great for us,

00:25:22.160 --> 00:25:23.740
and they published it online.

00:25:23.740 --> 00:25:27.019
And a lot of them
start with act as.

00:25:27.019 --> 00:25:29.259
Act as a Linux terminal.

00:25:29.259 --> 00:25:31.119
Act as an English translator.

00:25:31.119 --> 00:25:34.209
Act like a position
interviewer, et cetera.

00:25:37.059 --> 00:25:38.779
The advantage of
a prompt template

00:25:38.779 --> 00:25:42.059
is that you can actually
put it in your code

00:25:42.059 --> 00:25:44.799
and scale it for
many user requests.

00:25:44.799 --> 00:25:48.659
So let me give you an
example from Workera.

00:25:48.660 --> 00:25:50.920
Workera evaluates skill.

00:25:50.920 --> 00:25:52.980
Some of you have taken
the assessments already.

00:25:52.980 --> 00:25:56.660
And tries to personalize
it to the user.

00:25:56.660 --> 00:25:59.600
And in fact, if you actually
read in an HR system

00:25:59.599 --> 00:26:01.639
in an enterprise,
in the HR system,

00:26:01.640 --> 00:26:06.140
you might have a Jane is
a product manager level 3,

00:26:06.140 --> 00:26:10.620
and she is in the US, and her
preferred language is English.

00:26:10.619 --> 00:26:13.059
And actually, that
metadata can be

00:26:13.059 --> 00:26:15.842
inserted in a prompt templates
that will personalize

00:26:15.843 --> 00:26:16.759
personalized for Jane.

00:26:16.759 --> 00:26:22.720
And similarly for Joe, whose is
preferred language is Spanish,

00:26:22.720 --> 00:26:24.500
it will tailor it to Joe.

00:26:24.500 --> 00:26:26.099
And that's called
a prompt template.

00:26:26.099 --> 00:26:27.473
[INAUDIBLE]

00:26:34.920 --> 00:26:39.160
So the question is do
the foundation models

00:26:39.160 --> 00:26:41.200
use a prompt
templates, or do you

00:26:41.200 --> 00:26:42.840
have to integrate it yourself?

00:26:42.839 --> 00:26:45.319
So the foundation
models probably

00:26:45.319 --> 00:26:47.319
use a system prompt
that you don't see.

00:26:47.319 --> 00:26:50.679
Like when actually,
you type on ChatGPT,

00:26:50.680 --> 00:26:55.440
it is possible, it's not public,
that OpenAI behind the scenes

00:26:55.440 --> 00:26:59.580
has like act like a very
helpful assistant for this user.

00:26:59.579 --> 00:27:03.199
And by the way, here is
your memories about the user

00:27:03.200 --> 00:27:05.120
that we kept in a database.

00:27:05.119 --> 00:27:07.000
You can actually
check your memories.

00:27:07.000 --> 00:27:10.059
And then your prompt goes under,
and then the generation starts.

00:27:10.059 --> 00:27:12.179
So probably, they're
using something like that.

00:27:12.180 --> 00:27:15.850
But it doesn't mean you
can't add one yourself.

00:27:15.849 --> 00:27:19.490
So in fact, if you think about a
prompt template for the Workera

00:27:19.490 --> 00:27:22.049
example I was showing,
maybe it starts

00:27:22.049 --> 00:27:25.509
when you call OpenAI by act
like a helpful assistant.

00:27:25.509 --> 00:27:29.410
And then underneath, it's like
act like a great AI mentor that

00:27:29.410 --> 00:27:31.290
helps people in their career.

00:27:31.289 --> 00:27:33.889
And OpenAI is,
from template, also

00:27:33.890 --> 00:27:36.009
has follow the instruction
from the creator

00:27:36.009 --> 00:27:37.456
or something like that.

00:27:37.457 --> 00:27:38.040
It's possible.

00:27:41.210 --> 00:27:42.930
Questions about
prompt templates.

00:27:42.930 --> 00:27:45.789
Again, I would encourage you to
go and read examples of prompts.

00:27:45.789 --> 00:27:48.769
Some of them are
quite thoughtful.

00:27:48.769 --> 00:27:51.950
Let's talk about zero shot
versus few shot prompting.

00:27:51.950 --> 00:27:53.529
It came up earlier.

00:27:53.529 --> 00:27:54.629
Here's an example.

00:27:54.630 --> 00:27:57.810
Again, going back to the
categorization of product

00:27:57.809 --> 00:28:01.369
reviews, let's say that
we're working on a task

00:28:01.369 --> 00:28:05.129
where the prompt is classify
the tone of the sentence

00:28:05.130 --> 00:28:07.450
as positive,
negative, or neutral.

00:28:07.450 --> 00:28:12.009
And then you paste the review,
which is the product is fine,

00:28:12.009 --> 00:28:13.450
but I was expecting more.

00:28:16.029 --> 00:28:19.750
If I were to survey the room,
I would bet that some of you

00:28:19.750 --> 00:28:21.289
would say it's negative.

00:28:21.289 --> 00:28:23.007
Some of you would
say it's neutral.

00:28:23.007 --> 00:28:24.590
Because you actually
have a first part

00:28:24.589 --> 00:28:27.089
that is relatively positive.

00:28:27.089 --> 00:28:28.389
It's fine.

00:28:28.390 --> 00:28:30.570
And then the second part,
I was expecting more,

00:28:30.569 --> 00:28:31.889
which is relatively negative.

00:28:31.890 --> 00:28:33.270
So where do you land?

00:28:33.269 --> 00:28:35.269
This can be a
subjective question.

00:28:35.269 --> 00:28:37.987
And maybe in one industry, this
would be considered amazing.

00:28:37.987 --> 00:28:40.070
And another one, it would
be considered really bad

00:28:40.069 --> 00:28:44.029
because people are used to
really flourishing reviews.

00:28:44.029 --> 00:28:47.309
And so the way you can actually
align the model to your task

00:28:47.309 --> 00:28:49.309
is by converting that
zero shot prompt.

00:28:49.309 --> 00:28:51.109
Zero shot refers to
the fact that it's not

00:28:51.109 --> 00:28:53.589
being given any example.

00:28:53.589 --> 00:28:56.509
Into a few short
prompts, where the model

00:28:56.509 --> 00:29:00.629
is given in the prompt, a set
of examples to align it to what

00:29:00.630 --> 00:29:01.830
you want it to do.

00:29:01.829 --> 00:29:03.710
So the example
here is again, you

00:29:03.710 --> 00:29:06.590
paste the same prompt as
before with the user review.

00:29:06.589 --> 00:29:08.629
And then you add,
here are examples

00:29:08.630 --> 00:29:10.510
of tone classifications.

00:29:10.509 --> 00:29:12.960
These exceeded my
expectation completely.

00:29:12.960 --> 00:29:14.039
Positive.

00:29:14.039 --> 00:29:17.680
It's OK, but I wish
it had more features.

00:29:17.680 --> 00:29:18.920
Negative.

00:29:18.920 --> 00:29:20.800
The service was adequate.

00:29:20.799 --> 00:29:22.799
Neither good nor bad.

00:29:22.799 --> 00:29:23.720
Neutral.

00:29:23.720 --> 00:29:26.000
Now classify the
tone of this sentence

00:29:26.000 --> 00:29:28.839
after you've heard
about these things,

00:29:28.839 --> 00:29:31.839
and the model then
says negative.

00:29:31.839 --> 00:29:33.939
And the reason it says
negative, of course,

00:29:33.940 --> 00:29:39.340
is likely because of the second
example, which was it's OK,

00:29:39.339 --> 00:29:42.439
but I wish it had more features,
which we told the model that

00:29:42.440 --> 00:29:43.519
was negative.

00:29:43.519 --> 00:29:45.599
Because the model saw
that it's aligned now

00:29:45.599 --> 00:29:47.639
with your expectations.

00:29:47.640 --> 00:29:50.640
A few short prompts
are very popular.

00:29:50.640 --> 00:29:52.720
And in fact, for
AI startups that

00:29:52.720 --> 00:29:54.559
are slightly more
sophisticated, you

00:29:54.559 --> 00:29:57.940
might see them keep
a prompt up to date.

00:29:57.940 --> 00:30:00.680
Whenever a user says
something and they

00:30:00.680 --> 00:30:02.840
might have a human
label it and then

00:30:02.839 --> 00:30:05.519
add it as a few shots
in their relevant

00:30:05.519 --> 00:30:08.000
prompts in their code base.

00:30:08.000 --> 00:30:10.532
You can think of that as
almost building a data set.

00:30:10.532 --> 00:30:12.699
But instead of actually
building a separate data set

00:30:12.700 --> 00:30:15.120
like we've seen with
supervised fine tuning

00:30:15.119 --> 00:30:17.399
and then fine tuning
the model on it,

00:30:17.400 --> 00:30:19.460
you're just putting it
directly in the prompt.

00:30:19.460 --> 00:30:21.740
It turns out it's
probably faster

00:30:21.740 --> 00:30:23.660
to do that if you want
to experiment quickly

00:30:23.660 --> 00:30:25.800
because you don't touch
the model parameters.

00:30:25.799 --> 00:30:27.220
You just update your prompts.

00:30:27.220 --> 00:30:30.460
And if it's text
examples, you can actually

00:30:30.460 --> 00:30:34.759
concatenate so many
examples in a single prompt.

00:30:34.759 --> 00:30:36.339
At some point, it
will be too long,

00:30:36.339 --> 00:30:39.404
and you will not have the
necessary context window.

00:30:39.404 --> 00:30:40.779
But it's a pretty
strong approach

00:30:40.779 --> 00:30:43.309
that is quick to align an LLM.

00:30:48.819 --> 00:30:49.659
OK?

00:30:49.660 --> 00:30:50.740
Yes.

00:30:50.740 --> 00:30:52.620
[INAUDIBLE]

00:30:57.380 --> 00:31:00.660
So the question was is there
any research on how long

00:31:00.660 --> 00:31:03.540
the prompt can be before
the model essentially loses

00:31:03.539 --> 00:31:06.500
itself or doesn't follow
instructions anymore?

00:31:06.500 --> 00:31:08.589
There is.

00:31:08.589 --> 00:31:11.990
The problem is that research
is outdated every few months

00:31:11.990 --> 00:31:14.390
because models get better.

00:31:14.390 --> 00:31:16.930
And so I don't know where
the state of the art is.

00:31:16.930 --> 00:31:18.870
You can probably find
it online on benchmarks

00:31:18.869 --> 00:31:20.649
on like we see that--

00:31:20.650 --> 00:31:23.310
I give you an example.

00:31:23.309 --> 00:31:27.311
On the Workera product, you
have a voice conversation

00:31:27.311 --> 00:31:28.769
for some of you
that have tried it,

00:31:28.769 --> 00:31:30.849
where you're asked to
explain what is the prompt.

00:31:30.849 --> 00:31:31.909
And then you explain,
and then there's

00:31:31.910 --> 00:31:33.430
a scoring algorithm in behind.

00:31:33.430 --> 00:31:38.310
We know that after eight
turns, the model loses itself.

00:31:38.309 --> 00:31:40.269
After eight turns,
because you always

00:31:40.269 --> 00:31:42.829
paste the previous
user response,

00:31:42.829 --> 00:31:44.552
it just starts going wild.

00:31:44.553 --> 00:31:46.470
And so the techniques
we use in the background

00:31:46.470 --> 00:31:49.416
is we actually create
chapters of the conversation.

00:31:49.416 --> 00:31:51.250
Maybe one chapter is
the first eight prompt.

00:31:51.250 --> 00:31:53.458
And then you actually start
over from another prompt.

00:31:53.458 --> 00:31:56.570
You can summarize the first
part of the conversation,

00:31:56.569 --> 00:31:59.549
insert the summary,
and then keep going.

00:31:59.549 --> 00:32:02.309
Those are engineering hacks that
engineers might have figured out

00:32:02.309 --> 00:32:04.309
in the background.

00:32:04.309 --> 00:32:07.049
Because eight turns makes a
prompt quite long actually.

00:32:13.450 --> 00:32:15.517
Let's move on to chaining.

00:32:15.517 --> 00:32:17.850
Chaining is the most popular
technique out of everything

00:32:17.849 --> 00:32:22.769
we've seen so far in
prompt engineering.

00:32:22.769 --> 00:32:23.990
It's not chain of thought.

00:32:23.990 --> 00:32:26.230
So chain of thought we've
seen is think step by step,

00:32:26.230 --> 00:32:27.509
step 1, step 2, step 3.

00:32:27.509 --> 00:32:28.890
Do not skip any step.

00:32:28.890 --> 00:32:30.090
This is different.

00:32:30.089 --> 00:32:34.109
This is chaining complex
prompt to improve performance,

00:32:34.109 --> 00:32:37.009
and this is what it looks like.

00:32:37.009 --> 00:32:40.049
You take a single step prompt,
such as read this customer

00:32:40.049 --> 00:32:43.329
review and write a
professional response that

00:32:43.329 --> 00:32:46.049
acknowledges their concern,
explains the issue,

00:32:46.049 --> 00:32:48.009
offers a resolution,
and then you

00:32:48.009 --> 00:32:51.450
paste the customer review,
which is I ordered a laptop.

00:32:51.450 --> 00:32:52.950
It arrived three days late.

00:32:52.950 --> 00:32:54.809
The packaging was damaged.

00:32:54.809 --> 00:32:56.230
Very disappointing.

00:32:56.230 --> 00:32:59.009
I needed that urgently for work.

00:32:59.009 --> 00:33:01.089
And then the output
is an email that

00:33:01.089 --> 00:33:04.619
is immediately given
to you by the LLM

00:33:04.619 --> 00:33:08.019
after it reads the prompt.

00:33:08.019 --> 00:33:14.259
So this might work, but it
might be hard to control.

00:33:14.259 --> 00:33:15.680
Because think about it.

00:33:15.680 --> 00:33:18.140
There's multiple steps
that you have listed,

00:33:18.140 --> 00:33:20.860
and everything is embedded
in the same prompt.

00:33:20.859 --> 00:33:24.004
And if you wanted to debug step
by step and know which step is

00:33:24.005 --> 00:33:24.880
weaker, you couldn't.

00:33:24.880 --> 00:33:27.860
You would have everything
mixed together.

00:33:27.859 --> 00:33:32.899
So one advantage of chaining is
you would separate the prompts,

00:33:32.900 --> 00:33:35.280
so that you can debug
them separately.

00:33:35.279 --> 00:33:38.379
And it will also lead
to an easier manner

00:33:38.380 --> 00:33:41.300
to improve your workflow.

00:33:41.299 --> 00:33:44.079
Let's say a first prompt
is extract the key issues.

00:33:44.079 --> 00:33:46.059
Identify the key
concerns mentioned

00:33:46.059 --> 00:33:47.480
in this customer review.

00:33:47.480 --> 00:33:49.620
Pace the customer review.

00:33:49.619 --> 00:33:50.939
Second prompt.

00:33:50.940 --> 00:33:54.460
Using these issues, so
you paste back the issues,

00:33:54.460 --> 00:33:57.100
draft an outline for a
professional response that

00:33:57.099 --> 00:34:00.039
acknowledges concerns,
explains possible reasons,

00:34:00.039 --> 00:34:01.480
and offer a resolution.

00:34:04.279 --> 00:34:06.960
So this is not--

00:34:06.960 --> 00:34:09.179
Prompt number 3, write
the full response.

00:34:09.179 --> 00:34:14.880
So using the outline, write
the professional response.

00:34:14.880 --> 00:34:18.119
And then you get
your final output.

00:34:18.119 --> 00:34:22.000
So in theory, you can tell
me, oh, the second approach

00:34:22.000 --> 00:34:23.699
is better than the
first one at first.

00:34:23.699 --> 00:34:27.000
But what you can notice
is that we can actually

00:34:27.000 --> 00:34:29.760
test those three prompts
separately from each other

00:34:29.760 --> 00:34:35.480
and determine if we will get the
most gains out of engineering

00:34:35.480 --> 00:34:38.400
the first prompt, optimizing
it, or the second one,

00:34:38.400 --> 00:34:39.619
or the third one.

00:34:39.619 --> 00:34:43.079
We now have three prompts that
are independent from each other.

00:34:43.079 --> 00:34:47.480
And maybe if the
outline was better,

00:34:47.480 --> 00:34:53.260
the performance of the email,
how much the open rate will be

00:34:53.260 --> 00:34:55.400
or the user satisfaction
on the response

00:34:55.400 --> 00:34:57.320
will actually get higher.

00:34:57.320 --> 00:35:00.910
And so chaining improves
performance but performance,

00:35:00.909 --> 00:35:04.129
but most importantly, helps
you control your workflow

00:35:04.130 --> 00:35:07.930
and debug it more seamlessly.

00:35:07.929 --> 00:35:09.369
Yes.

00:35:09.369 --> 00:35:15.089
So if we that the three prompts
independently work really well,

00:35:15.090 --> 00:35:17.289
if we combine them
into one prompt,

00:35:17.289 --> 00:35:21.050
and we highlight a step
by step thinking process,

00:35:21.050 --> 00:35:24.850
does on average, we get
a [INAUDIBLE] by itself,

00:35:24.849 --> 00:35:28.690
or do we still have
to do that breakdown?

00:35:28.690 --> 00:35:30.110
So let me try to rephrase.

00:35:30.110 --> 00:35:32.730
You say, let's say we look
at the first prompt which

00:35:32.730 --> 00:35:37.889
has all three tasks
built in that prompt.

00:35:37.889 --> 00:35:39.069
What exactly do you mean?

00:35:39.070 --> 00:35:41.130
You mean like if we
evaluate the output

00:35:41.130 --> 00:35:43.630
and we measure some user
insight, satisfaction,

00:35:43.630 --> 00:35:45.769
et cetera?

00:35:45.769 --> 00:35:49.250
Why don't we just modify that
prompt and essentially see how

00:35:49.250 --> 00:35:51.110
it improves user satisfaction?

00:35:51.110 --> 00:35:51.610
Yeah.

00:35:51.610 --> 00:35:52.610
[INAUDIBLE]

00:35:54.916 --> 00:35:55.436
I see.

00:35:55.436 --> 00:35:57.890
So why do we need
the three steps?

00:35:57.889 --> 00:35:59.150
I mean, think about it.

00:35:59.150 --> 00:36:02.110
The intermediate output
is what you want to see.

00:36:02.110 --> 00:36:06.630
Like if I'm debugging
the first approach,

00:36:06.630 --> 00:36:09.250
the way I would do it is I
would capture user insights.

00:36:09.250 --> 00:36:10.409
Like here's the email.

00:36:10.409 --> 00:36:11.769
How good was the response?

00:36:11.769 --> 00:36:13.909
Thumbs up, thumbs down.

00:36:13.909 --> 00:36:16.429
Was your issue resolved?

00:36:16.429 --> 00:36:17.539
Thumbs up, thumbs down.

00:36:17.539 --> 00:36:19.289
Those would tell me
how good is my prompt.

00:36:19.289 --> 00:36:21.123
And I can engineer that
prompt, optimize it,

00:36:21.123 --> 00:36:23.510
and I would probably
drive some gains.

00:36:23.510 --> 00:36:26.430
But I will not be able
easily to trace back

00:36:26.429 --> 00:36:28.349
to what the problem was.

00:36:28.349 --> 00:36:30.549
While in the second
approach, not only I

00:36:30.550 --> 00:36:33.530
can use the end to end
metrics to improve my process.

00:36:33.530 --> 00:36:35.170
I can also use the
intermediate steps.

00:36:35.170 --> 00:36:38.710
For example, if I look at prompt
2 and I look at the outline

00:36:38.710 --> 00:36:41.750
and I see the outline is
actually, meh, it's not great,

00:36:41.750 --> 00:36:45.630
then I think I can get a lot
of gains out of the outline.

00:36:45.630 --> 00:36:47.930
Or the outline is
actually really good,

00:36:47.929 --> 00:36:50.429
but the last prompt doesn't do
a good job at translating it

00:36:50.429 --> 00:36:51.210
into an email.

00:36:51.210 --> 00:36:54.550
So the outline is exactly
what I want the LLM to do,

00:36:54.550 --> 00:36:57.350
but the translation in
a customer facing email

00:36:57.349 --> 00:36:58.299
is not good.

00:36:58.300 --> 00:37:01.900
In fact, it doesn't follow
our vocabulary internally.

00:37:01.900 --> 00:37:03.519
Then I knew the
third prompt is where

00:37:03.519 --> 00:37:06.039
I would get the most gains.

00:37:06.039 --> 00:37:07.699
So that's what it
allows me to do,

00:37:07.699 --> 00:37:10.519
have intermediate
steps to review.

00:37:10.519 --> 00:37:13.719
Are there any
latency [INAUDIBLE]

00:37:13.719 --> 00:37:14.579
We'll talk about it.

00:37:14.579 --> 00:37:16.179
Are there any latency concerns?

00:37:16.179 --> 00:37:17.279
Yes.

00:37:17.280 --> 00:37:20.440
In certain applications, you
don't want to use a chain

00:37:20.440 --> 00:37:26.012
or you don't want to use a long
chain because it adds latency.

00:37:26.012 --> 00:37:27.179
We'll talk about that later.

00:37:27.179 --> 00:37:28.839
Good point.

00:37:28.840 --> 00:37:32.000
So practically, this is
what chaining complex

00:37:32.000 --> 00:37:33.280
prompts look like.

00:37:33.280 --> 00:37:35.640
You have your first prompt
with your first task.

00:37:35.639 --> 00:37:36.460
It outputs.

00:37:36.460 --> 00:37:39.079
The output is pasted
in the second prompt

00:37:39.079 --> 00:37:41.199
with the second
task being defined.

00:37:41.199 --> 00:37:43.699
The output is then pasted
into the third prompt

00:37:43.699 --> 00:37:46.559
with the third task
being defined and so on.

00:37:46.559 --> 00:37:48.170
That's what it looks
like in practice.

00:37:52.179 --> 00:37:52.679
Super.

00:37:55.860 --> 00:37:58.559
We'll talk more later
about testing your prompts,

00:37:58.559 --> 00:38:00.799
but there are
methods now to do it,

00:38:00.800 --> 00:38:03.380
and we'll see later in this
lecture with our case study

00:38:03.380 --> 00:38:06.300
how we can test our prompts.

00:38:06.300 --> 00:38:11.900
But here is an example
of how you might do it.

00:38:11.900 --> 00:38:18.220
You might have a
summarization workflow prompts

00:38:18.219 --> 00:38:19.359
that is the baseline.

00:38:19.360 --> 00:38:21.420
It's a single prompt.

00:38:21.420 --> 00:38:23.659
You might have a
refined summarization

00:38:23.659 --> 00:38:26.199
which is a modified
prompt of this,

00:38:26.199 --> 00:38:30.460
or a workflow with a chain.

00:38:30.460 --> 00:38:34.380
And then you have your test
case, which is the input

00:38:34.380 --> 00:38:36.780
that you want to
summarize, let's say.

00:38:36.780 --> 00:38:38.900
And then you have
the generated output.

00:38:38.900 --> 00:38:42.559
And you can have humans
go and rate these outputs.

00:38:42.559 --> 00:38:46.380
And you would notice that the
baseline is better or worse

00:38:46.380 --> 00:38:47.780
than the refined prompt.

00:38:47.780 --> 00:38:51.260
Of course, this manual
approach takes time,

00:38:51.260 --> 00:38:53.560
but it's a good way to start.

00:38:53.559 --> 00:38:56.994
And usually, the advice is
get hands on at the beginning

00:38:56.994 --> 00:38:58.869
because you would quickly
notice some issues,

00:38:58.869 --> 00:39:01.589
and it will give you better
intuition on what tweaks

00:39:01.590 --> 00:39:03.470
can lead to better performance.

00:39:03.469 --> 00:39:05.549
However, if you wanted
to scale that system

00:39:05.550 --> 00:39:08.110
across many products, many
parts of your code base,

00:39:08.110 --> 00:39:10.910
you might want to find a
way to do that automatically

00:39:10.909 --> 00:39:14.369
without asking humans to
review and grade summaries.

00:39:14.369 --> 00:39:19.309
One approach is
to use platforms,

00:39:19.309 --> 00:39:23.630
like at Workera, our team uses a
platform called Prompt Food that

00:39:23.630 --> 00:39:26.950
allows you to actually
automate part of this testing.

00:39:26.949 --> 00:39:30.469
In a nutshell,
what it does is it

00:39:30.469 --> 00:39:35.489
can allow you to run the same
prompt with five different LLMs

00:39:35.489 --> 00:39:37.269
immediately, put
everything in a table.

00:39:37.269 --> 00:39:40.429
That makes it super easy for
a human to grade, let's say.

00:39:40.429 --> 00:39:46.659
Or alternatively, it might
allow you to define LLM judges.

00:39:46.659 --> 00:39:50.149
LLM judges can come
in different flavors.

00:39:50.150 --> 00:39:52.450
For example, I can
have an LLM judge that

00:39:52.449 --> 00:39:54.789
does a pairwise comparison.

00:39:54.789 --> 00:39:58.090
So what the LLM is asked to
do is here are two summaries.

00:39:58.090 --> 00:40:01.210
Just tell me which one is
better than the other one.

00:40:01.210 --> 00:40:02.630
That's what the LLM does.

00:40:02.630 --> 00:40:04.690
And that can be used
as a proxy for how good

00:40:04.690 --> 00:40:08.329
the summarization baseline
versus the refined version is.

00:40:08.329 --> 00:40:11.889
Another way to do
an LLM judge is

00:40:11.889 --> 00:40:14.349
if you do it for a
single answer grading,

00:40:14.349 --> 00:40:18.489
so here's a summary
graded from 1 to 5.

00:40:18.489 --> 00:40:21.769
And then you can go
even deeper and do

00:40:21.769 --> 00:40:24.550
a reference-guided
pairwise comparison.

00:40:24.550 --> 00:40:25.870
Or you add also a rubric.

00:40:25.869 --> 00:40:30.697
You say a 5 is when a summary
is below 100 characters.

00:40:30.697 --> 00:40:31.489
I'm just making up.

00:40:31.489 --> 00:40:33.029
Below 100 characters.

00:40:33.030 --> 00:40:35.010
Mentions at least
three key points

00:40:35.010 --> 00:40:38.182
that are distinct and starts
with a first sentence that

00:40:38.182 --> 00:40:40.349
displays the overview and
then goes into the detail.

00:40:40.349 --> 00:40:42.190
That's a great summary,
number 5 out of a 5.

00:40:42.190 --> 00:40:48.909
0 is the LLM failed to summarize
and actually was very verbose,

00:40:48.909 --> 00:40:49.609
let's say.

00:40:49.610 --> 00:40:52.539
And so you put a
Rubrik behind it,

00:40:52.539 --> 00:40:55.059
and you have an LLM as
just finding the rubric.

00:40:55.059 --> 00:40:57.199
Of course, you can now
pair different techniques.

00:40:57.199 --> 00:40:58.879
You can do a few
shots for the rubric.

00:40:58.880 --> 00:41:02.960
You can actually give examples
of a 5 out of 5s, 4 out of 4s,

00:41:02.960 --> 00:41:06.460
3 out of 3s because now,
you multiple techniques.

00:41:06.460 --> 00:41:11.220
Does that make sense?

00:41:11.219 --> 00:41:11.819
Yeah.

00:41:11.820 --> 00:41:12.620
OK.

00:41:12.619 --> 00:41:15.460
So that was the second
section on prompt engineering

00:41:15.460 --> 00:41:19.179
or the first line
of optimization.

00:41:19.179 --> 00:41:22.619
Now, let's say you've
exhausted all your chances

00:41:22.619 --> 00:41:24.779
for prompt
engineering, and you're

00:41:24.780 --> 00:41:28.300
thinking about actually touching
the model, modifying its weights

00:41:28.300 --> 00:41:31.580
or fine tuning it
in other words.

00:41:31.579 --> 00:41:34.900
I was telling you, I'm
not a fan of fine tuning.

00:41:34.900 --> 00:41:37.940
There's a few reasons why.

00:41:37.940 --> 00:41:42.220
One, it requires substantial
labeled data typically

00:41:42.219 --> 00:41:43.079
to fine tune.

00:41:43.079 --> 00:41:46.500
Although now, there
are approaches

00:41:46.500 --> 00:41:48.699
that are getting better
at fine tuning that

00:41:48.699 --> 00:41:52.299
look more few shot prompting
actually than fine tuning.

00:41:52.300 --> 00:41:54.600
It's sort of merging.

00:41:54.599 --> 00:41:56.097
Although one
modifies the weight,

00:41:56.097 --> 00:41:57.639
the other doesn't
modify the weights.

00:41:57.639 --> 00:42:01.099
Fine tuned models may also
overfit to specific data.

00:42:01.099 --> 00:42:04.000
We're going to see a
funny example actually.

00:42:04.000 --> 00:42:06.579
Losing their general
purpose utility.

00:42:06.579 --> 00:42:08.480
So you might fine tune a model.

00:42:08.480 --> 00:42:11.300
And actually, when someone
asks a pretty generic question,

00:42:11.300 --> 00:42:12.840
it doesn't do well anymore.

00:42:12.840 --> 00:42:14.220
It might do well on your task.

00:42:14.219 --> 00:42:15.699
So it might be relevant or not.

00:42:15.699 --> 00:42:17.659
And then it's time
and cost-intensive.

00:42:17.659 --> 00:42:19.159
That's my main problem.

00:42:19.159 --> 00:42:24.639
And at Workera, we
steer away from fine

00:42:24.639 --> 00:42:26.440
tuning as much as possible.

00:42:26.440 --> 00:42:28.932
Because by the time you're
done fine tuning your model,

00:42:28.932 --> 00:42:30.599
the next model is
out, and it's actually

00:42:30.599 --> 00:42:33.559
beating your fine tuned
version of the previous model.

00:42:33.559 --> 00:42:36.719
So I would steer away from
fine tuning as much as you can.

00:42:36.719 --> 00:42:39.399
The advantage of the prompt
engineering methods we've seen

00:42:39.400 --> 00:42:43.800
is you can put the next best
pre-trained model directly

00:42:43.800 --> 00:42:44.917
in your code.

00:42:44.916 --> 00:42:46.500
It will update
everything immediately.

00:42:46.500 --> 00:42:50.449
Fine tuning doesn't
work like that.

00:42:50.449 --> 00:42:53.250
There are advantages though
where it still makes sense.

00:42:53.250 --> 00:42:56.130
If the task requires repeated
high precision outputs

00:42:56.130 --> 00:42:58.570
such as legal,
scientific explanation

00:42:58.570 --> 00:43:01.289
and if the general
purpose LLM struggles

00:43:01.289 --> 00:43:03.449
with domain-specific language.

00:43:03.449 --> 00:43:07.649
So let's look at a
quick example together,

00:43:07.650 --> 00:43:12.690
which is an example
from Ros Lazerowitz.

00:43:12.690 --> 00:43:15.929
I think it was a couple of
years ago, September 23,

00:43:15.929 --> 00:43:22.829
where Ros tried to
do Slack fine tuning.

00:43:22.829 --> 00:43:26.489
So he looked at a lot of Slack
messages within his company.

00:43:26.489 --> 00:43:28.609
And he was like, I'm
going to fine tune

00:43:28.610 --> 00:43:32.090
a model that speaks like us or
operates like us because this

00:43:32.090 --> 00:43:33.090
is how we work.

00:43:33.090 --> 00:43:37.970
This is the data that represents
how people work at the company.

00:43:37.969 --> 00:43:42.529
And so he actually went ahead
and fine tuned the model,

00:43:42.530 --> 00:43:44.769
gave it a prompt,
like, hey, write--

00:43:44.769 --> 00:43:47.030
he was delegating to the model.

00:43:47.030 --> 00:43:50.510
A 500-word blog post
on prompt engineering.

00:43:50.510 --> 00:43:55.990
And the model responded, I shall
work on that in the morning.

00:43:55.989 --> 00:44:00.049
And then he tries to push the
model a little further and say,

00:44:00.050 --> 00:44:01.950
it's morning now.

00:44:01.949 --> 00:44:04.489
And the model said,
I'm writing right now.

00:44:04.489 --> 00:44:06.789
It's 6:30 AM here.

00:44:06.789 --> 00:44:07.610
Write it now.

00:44:10.670 --> 00:44:12.130
OK, I shall write it now.

00:44:12.130 --> 00:44:14.110
I actually don't what
you would like me to say

00:44:14.110 --> 00:44:15.590
about prompt engineering.

00:44:15.590 --> 00:44:17.030
I can only describe the process.

00:44:17.030 --> 00:44:19.030
The only thing that comes
to mind for a headline

00:44:19.030 --> 00:44:21.030
is how do we build prompt?

00:44:21.030 --> 00:44:25.670
It's kind of a funny example for
fine tuning because it's true

00:44:25.670 --> 00:44:27.630
that it went wrong.

00:44:27.630 --> 00:44:29.630
Like he was supposed
to think like I want

00:44:29.630 --> 00:44:32.269
the model to speak
like us at work.

00:44:32.269 --> 00:44:34.829
And it ended up
acting like people

00:44:34.829 --> 00:44:36.929
and not actually
following instructions.

00:44:40.190 --> 00:44:42.860
So one example why I would
steer away from fine tuning.

00:44:47.300 --> 00:44:47.800
Super.

00:44:51.679 --> 00:44:54.199
Let's talk about RAGs.

00:44:54.199 --> 00:44:55.500
RAGs is important.

00:44:55.500 --> 00:44:58.420
It's important to out there
and at least having the basics.

00:44:58.420 --> 00:45:00.579
It's a very common interview
question, by the way.

00:45:00.579 --> 00:45:02.799
If you go interview
for a job, they

00:45:02.800 --> 00:45:04.720
might ask you to
explain in a nutshell

00:45:04.719 --> 00:45:06.659
to a five-year-old
what is a RAG.

00:45:06.659 --> 00:45:09.480
And hopefully after that,
you'll be able to do it.

00:45:09.480 --> 00:45:14.880
So we've seen some of the
challenges with standalone LLMs.

00:45:14.880 --> 00:45:19.200
Those challenges include the
context window being small,

00:45:19.199 --> 00:45:21.559
the fact that it's hard
to remember details

00:45:21.559 --> 00:45:26.960
within a large context window,
knowledge gaps, cutoff dates,

00:45:26.960 --> 00:45:28.059
you mentioned earlier.

00:45:28.059 --> 00:45:29.779
The model might be
trained up to a date,

00:45:29.780 --> 00:45:33.040
and then it cannot follow
the trends or be up to date.

00:45:33.039 --> 00:45:34.440
Hallucinations.

00:45:34.440 --> 00:45:35.920
There are some fields.

00:45:35.920 --> 00:45:37.639
Think about medical
diagnosis, where

00:45:37.639 --> 00:45:39.139
hallucinations are very costly.

00:45:39.139 --> 00:45:41.440
You can't afford
a hallucination.

00:45:41.440 --> 00:45:45.450
Even in education, imagine
deploying a model for the US

00:45:45.449 --> 00:45:47.937
youth education,
and it hallucinates,

00:45:47.938 --> 00:45:49.730
and it teaches millions
of people something

00:45:49.730 --> 00:45:50.730
completely wrong.

00:45:50.730 --> 00:45:52.690
It's a problem.

00:45:52.690 --> 00:45:54.889
And then lack of sources.

00:45:54.889 --> 00:45:57.389
A lot of fields love sources.

00:45:57.389 --> 00:45:59.609
Research fields love sources.

00:45:59.610 --> 00:46:01.650
Education love sources.

00:46:01.650 --> 00:46:04.490
Legal loves sources as well.

00:46:04.489 --> 00:46:08.969
And so the pre-trained LLM
doesn't do a good job to source.

00:46:08.969 --> 00:46:13.609
And in fact, if you have tried
to find sources on a plain LLM,

00:46:13.610 --> 00:46:15.190
it actually hallucinates a lot.

00:46:15.190 --> 00:46:16.710
It makes up research papers.

00:46:16.710 --> 00:46:20.170
It just lists like
completely fake stuff.

00:46:20.170 --> 00:46:23.490
So how do we solve
that with a RAG?

00:46:23.489 --> 00:46:28.049
RAG integrates with external
knowledge sources, databases,

00:46:28.050 --> 00:46:31.010
documents, APIs.

00:46:31.010 --> 00:46:35.270
It ensures that answers are
more accurate, up to date,

00:46:35.269 --> 00:46:38.150
and grounded because you can
actually update your document.

00:46:38.150 --> 00:46:40.630
Your drive is always up to date.

00:46:40.630 --> 00:46:43.849
I mean, ideally, you're always
pushing new documents to it.

00:46:43.849 --> 00:46:47.730
And when you query, what is
our Q4 performance in sales?

00:46:47.730 --> 00:46:51.230
Hopefully there is the last
board deck in the drive,

00:46:51.230 --> 00:46:54.630
and it can read the
last board deck.

00:46:54.630 --> 00:46:56.210
And more developer control.

00:46:56.210 --> 00:47:00.309
We'll see why RAGs allow
for targeted customization

00:47:00.309 --> 00:47:02.730
without actually requiring
the retraining of the model.

00:47:02.730 --> 00:47:05.309
In fact, you don't touch
the model with RAGs.

00:47:05.309 --> 00:47:08.829
It's really a technique that
is put on top of the model.

00:47:08.829 --> 00:47:11.789
So to see an example
of a RAG, this

00:47:11.789 --> 00:47:16.070
is a question answering
application where

00:47:16.070 --> 00:47:21.710
we're in the medical field,
and a user is asking a query,

00:47:21.710 --> 00:47:26.190
what are the side
effects of drug X?

00:47:26.190 --> 00:47:27.490
This is an important question.

00:47:27.489 --> 00:47:28.689
You can't hallucinate.

00:47:28.690 --> 00:47:29.690
You need to source.

00:47:29.690 --> 00:47:31.050
You need to be up to date.

00:47:31.050 --> 00:47:35.390
Maybe there is a new
update to that drug that

00:47:35.389 --> 00:47:37.769
is now in the database,
and you need to read that.

00:47:37.769 --> 00:47:41.920
So a RAG is a great example of
what you would want to use here.

00:47:41.920 --> 00:47:43.960
The way it works is
you have your knowledge

00:47:43.960 --> 00:47:46.840
base of a bunch of documents.

00:47:46.840 --> 00:47:49.960
What you do is you
use an embedding

00:47:49.960 --> 00:47:52.079
to embed those
documents into lower

00:47:52.079 --> 00:47:54.519
dimensional representations.

00:47:54.519 --> 00:47:59.679
So for example, if the
document is a PDF, a long PDF,

00:47:59.679 --> 00:48:02.940
you might read the
PDF, understand it,

00:48:02.940 --> 00:48:03.820
and then embed it.

00:48:03.820 --> 00:48:05.800
We've seen plenty of
embedding approaches

00:48:05.800 --> 00:48:09.120
together, triplet loss,
et cetera, you remember?

00:48:09.119 --> 00:48:11.719
So imagine one of
them here for LLMs

00:48:11.719 --> 00:48:15.719
is embedding those documents
into lower representation.

00:48:15.719 --> 00:48:18.439
If the representation
is too small,

00:48:18.440 --> 00:48:19.900
you will lose information.

00:48:19.900 --> 00:48:22.840
If it's too big, you
will add latency.

00:48:22.840 --> 00:48:25.760
It's a tradeoff.

00:48:25.760 --> 00:48:28.360
You will store typically
those representations

00:48:28.360 --> 00:48:31.880
into a database called
a vector database.

00:48:31.880 --> 00:48:35.280
There's a lot of vector
database providers out there.

00:48:38.579 --> 00:48:41.880
I think I've listed a
couple that are very common.

00:48:41.880 --> 00:48:44.811
No, I haven't listed, but
I can share afterwards.

00:48:44.811 --> 00:48:47.019
A vector database is
essentially storing those vector

00:48:47.019 --> 00:48:50.139
in a very efficient manner,
allowing the fast retrieval

00:48:50.139 --> 00:48:52.859
with a certain distance metric.

00:48:52.860 --> 00:48:56.260
So what you do is you
also embed, usually

00:48:56.260 --> 00:49:00.140
with the same algorithm,
the user prompts.

00:49:00.139 --> 00:49:03.579
And you run a retrieval
process, which is essentially

00:49:03.579 --> 00:49:07.779
saying, based on the
embedding from the user

00:49:07.780 --> 00:49:12.540
query and the vector database,
find the relevant documents

00:49:12.539 --> 00:49:15.500
based on the distance
between those embeddings.

00:49:15.500 --> 00:49:18.420
Once you've found the relevant
documents, you pull them,

00:49:18.420 --> 00:49:22.460
and then you add them to the
user query with a system prompt

00:49:22.460 --> 00:49:24.300
or a prompt template on top.

00:49:24.300 --> 00:49:29.300
So the prompt template
can be answer user query

00:49:29.300 --> 00:49:32.900
based on list of documents.

00:49:32.900 --> 00:49:36.829
If answer not in the
documents, say I don't know.

00:49:36.829 --> 00:49:40.590
That's your prompt templates
where the user query is pasted,

00:49:40.590 --> 00:49:42.630
the documents are
pasted, and then

00:49:42.630 --> 00:49:45.829
your output should be what
you want because it's not

00:49:45.829 --> 00:49:47.389
grounded in the documents.

00:49:47.389 --> 00:49:50.549
You can also add to
this prompt template.

00:49:50.550 --> 00:49:53.150
Tell me the exact
page, chapter, line

00:49:53.150 --> 00:49:55.110
of the document that was
relevant, and in fact,

00:49:55.110 --> 00:49:57.380
link it as well, just
to be more precise.

00:50:02.150 --> 00:50:03.829
Any question on RAGs?

00:50:03.829 --> 00:50:07.389
This is a simple, vanilla RAG.

00:50:07.389 --> 00:50:09.119
Yes.

00:50:09.119 --> 00:50:12.789
Do document embeddings still
retain information [INAUDIBLE]

00:50:15.630 --> 00:50:18.230
Question is do the
document embeddings still

00:50:18.230 --> 00:50:21.789
retain the information of the
location of the information

00:50:21.789 --> 00:50:24.789
within that document,
especially in big documents?

00:50:24.789 --> 00:50:26.029
Great question.

00:50:26.030 --> 00:50:27.950
We'll get to it in a second.

00:50:27.949 --> 00:50:29.949
Because you're right
that the vanilla RAG

00:50:29.949 --> 00:50:32.289
might not do a good job
with very large documents.

00:50:32.289 --> 00:50:36.469
So let's say, when you
open a medication box

00:50:36.469 --> 00:50:41.129
and you have this gigantic white
paper with all the information,

00:50:41.130 --> 00:50:45.829
and it's very long, maybe a
vanilla RAG would not cut it.

00:50:45.829 --> 00:50:48.009
So what people have
figured out is a bunch

00:50:48.010 --> 00:50:49.830
of techniques to improve RAGs.

00:50:49.829 --> 00:50:53.150
And in fact, chunking is a great
technique that is very popular.

00:50:53.150 --> 00:50:55.730
So you might actually store
in the vector database

00:50:55.730 --> 00:50:57.670
the embedding of
the full document.

00:50:57.670 --> 00:50:59.409
And on top of
that, you will also

00:50:59.409 --> 00:51:02.619
store a chapter level vector.

00:51:02.619 --> 00:51:04.869
And when you retrieve, you
will retrieve the document.

00:51:04.869 --> 00:51:06.289
You retrieve the chapter.

00:51:06.289 --> 00:51:09.190
And that allows you to be more
precise with the sourcing.

00:51:09.190 --> 00:51:11.690
It's one example.

00:51:11.690 --> 00:51:16.130
Another technique
that's popular is HyDE.

00:51:16.130 --> 00:51:18.970
Hypothetical
document embeddings,

00:51:18.969 --> 00:51:23.529
where a group of researchers
published a paper

00:51:23.530 --> 00:51:26.790
showing that when you
get your user query,

00:51:26.789 --> 00:51:29.090
one of the main problem
is the user query

00:51:29.090 --> 00:51:32.370
actually does not look
like your documents.

00:51:32.369 --> 00:51:34.139
For example, the
user query might

00:51:34.139 --> 00:51:37.779
be what are the side effects
of drug X, when actually,

00:51:37.780 --> 00:51:40.080
in the document in
the vector database,

00:51:40.079 --> 00:51:43.099
the vectors represents
very long documents.

00:51:43.099 --> 00:51:44.900
So how do you guarantee
that the vector

00:51:44.900 --> 00:51:47.619
embedding is going to be close
to the document embedding?

00:51:47.619 --> 00:51:50.819
What they do is they use
the user query to generate

00:51:50.820 --> 00:51:53.780
a fake hallucinated document.

00:51:53.780 --> 00:51:56.180
They embed that
document, and then they

00:51:56.179 --> 00:52:01.379
compare it to the vector
in the vector database.

00:52:01.380 --> 00:52:02.460
That makes sense?

00:52:02.460 --> 00:52:04.780
So for example,
the user says what

00:52:04.780 --> 00:52:06.682
is the side effect of drug X?

00:52:06.682 --> 00:52:09.099
There's a prompt that this is
given to another prompt that

00:52:09.099 --> 00:52:13.739
says, based on this user query,
generates a five-page report

00:52:13.739 --> 00:52:15.579
answering the user query.

00:52:15.579 --> 00:52:20.980
It generates potentially
a completely fake answer.

00:52:20.980 --> 00:52:24.557
You embed that, and it will
be closer to the document

00:52:24.557 --> 00:52:25.849
that you're looking for likely.

00:52:28.940 --> 00:52:31.800
It's one example
of a RAG approach.

00:52:31.800 --> 00:52:33.640
Again, the purpose
of this lecture

00:52:33.639 --> 00:52:36.039
is not to go through all
these three and explain

00:52:36.039 --> 00:52:38.922
you every single method that
has been discovered for RAGs.

00:52:38.922 --> 00:52:40.880
But I just wanted to show
you how much research

00:52:40.880 --> 00:52:44.780
has been done between
2020 and 2025 in RAGs

00:52:44.780 --> 00:52:47.960
and how many branches
of research you now have

00:52:47.960 --> 00:52:50.679
that you can learn from.

00:52:50.679 --> 00:52:52.899
The survey paper is LinkedIn
the slides, by the way,

00:52:52.900 --> 00:52:54.483
and I'll share them
after the lecture.

00:53:01.519 --> 00:53:02.019
Super.

00:53:05.559 --> 00:53:08.840
So we've made some progress.

00:53:08.840 --> 00:53:10.600
Hopefully now, you
feel if you were

00:53:10.599 --> 00:53:14.317
to start an LLM application, you
know how to do better prompts.

00:53:14.317 --> 00:53:15.400
You know how to do chains.

00:53:15.400 --> 00:53:17.240
You know how to do fine tuning.

00:53:17.239 --> 00:53:19.159
You also how to do retrieval.

00:53:19.159 --> 00:53:20.799
And you have the
baggage of techniques

00:53:20.800 --> 00:53:23.100
that you can go and read
and find the code base,

00:53:23.099 --> 00:53:24.779
pull the code, vibe code it.

00:53:24.780 --> 00:53:26.820
But you have the breadth now.

00:53:30.329 --> 00:53:34.009
The next set of topics
we're going to see

00:53:34.010 --> 00:53:36.770
is around the question
of how could we

00:53:36.769 --> 00:53:40.449
extend the capabilities of LLMs
from performing single tasks,

00:53:40.449 --> 00:53:42.250
and hence, with
external knowledge,

00:53:42.250 --> 00:53:47.409
to handling multi-step,
autonomous workflows?

00:53:47.409 --> 00:53:50.389
And this is where we get
into proper agentic AI.

00:53:53.210 --> 00:53:56.650
So let's talk about
agentic AI workflows

00:53:56.650 --> 00:54:00.130
towards autonomous and
specialized systems.

00:54:00.130 --> 00:54:01.630
Then we'll talk about evals.

00:54:01.630 --> 00:54:03.869
Then we'll see
multi-agent systems.

00:54:03.869 --> 00:54:11.769
And we'll end with a little
thoughts on what's next in AI.

00:54:11.769 --> 00:54:20.329
So Andrew Ng actually coined
the term agentic AI workflows.

00:54:20.329 --> 00:54:25.610
And his reason was that a lot
of companies use, say agents.

00:54:25.610 --> 00:54:28.750
Agents, agents everywhere,
agents everywhere.

00:54:28.750 --> 00:54:30.670
If you go and work
at these companies,

00:54:30.670 --> 00:54:33.372
you would notice that they mean
very different things by agents.

00:54:33.371 --> 00:54:34.829
Some people actually
have a prompt,

00:54:34.829 --> 00:54:36.829
and they call it an agent.

00:54:36.829 --> 00:54:41.529
Other people, they have a very
complex multi-agent system,

00:54:41.530 --> 00:54:42.450
they call it an agent.

00:54:42.449 --> 00:54:45.549
And so calling everything an
agent doesn't do it justice.

00:54:45.550 --> 00:54:49.810
So Andrew says let's call
it agentic workflows.

00:54:49.809 --> 00:54:53.989
Because in practice, it's a
bunch of prompts with tools,

00:54:53.989 --> 00:54:57.029
with additional
resources, API calls

00:54:57.030 --> 00:54:59.390
that ultimately are
put in a workflow,

00:54:59.389 --> 00:55:02.629
and you can call that
workflow agentic.

00:55:02.630 --> 00:55:08.099
So it's all about the multi-step
process to complete a task.

00:55:11.269 --> 00:55:13.230
Also, calling it
agentic workflow

00:55:13.230 --> 00:55:14.869
allows us to not
mix it up with what

00:55:14.869 --> 00:55:17.909
I called agent, in
the last lecture,

00:55:17.909 --> 00:55:19.309
with reinforcement learning.

00:55:19.309 --> 00:55:22.029
Because in RL, agent has a
very specific definition,

00:55:22.030 --> 00:55:24.670
interacts with an environment,
passes from one state

00:55:24.670 --> 00:55:26.708
to the other, has a
reward and an observation.

00:55:26.708 --> 00:55:28.000
You remember that chart, right?

00:55:32.000 --> 00:55:35.440
So here's an example of
how we move from a one step

00:55:35.440 --> 00:55:39.760
prompt to a multi-step
agentic workflow.

00:55:39.760 --> 00:55:44.920
Let's say a user
queries a product.

00:55:44.920 --> 00:55:48.200
What is your refund
policy on a chatbot?

00:55:48.199 --> 00:55:51.039
And the response,
using a RAG, says

00:55:51.039 --> 00:55:53.779
refunds are available
within 30 days of purchase,

00:55:53.780 --> 00:55:57.440
and maybe the RAG can even look
link to the policy documents.

00:55:57.440 --> 00:55:59.639
That's what we learned so far.

00:55:59.639 --> 00:56:04.119
Instead, an agentic workflow
can function like this.

00:56:04.119 --> 00:56:07.559
The user says, can I get
a refund for my order?

00:56:07.559 --> 00:56:11.239
And the response via
the agentic workflow

00:56:11.239 --> 00:56:14.239
is the agent retrieves the
refund policy using a RAG.

00:56:14.239 --> 00:56:17.299
The agent then follows up
with the users and says,

00:56:17.300 --> 00:56:19.720
can you provide
your order number?

00:56:19.719 --> 00:56:23.019
Then the agent queries an API
to check the order details.

00:56:23.019 --> 00:56:25.139
And finally, it comes
back to the user

00:56:25.139 --> 00:56:28.199
and confirms your order
qualifies for a refund.

00:56:28.199 --> 00:56:31.179
The amount will be processed
in three to five business days.

00:56:31.179 --> 00:56:33.799
This is much more thoughtful
than the first version,

00:56:33.800 --> 00:56:35.164
which is sort of vanilla.

00:56:37.682 --> 00:56:39.099
So that's what
we're going to talk

00:56:39.099 --> 00:56:40.900
about in the next
couple of slides,

00:56:40.900 --> 00:56:43.240
is how do we get from the
first one to the second one?

00:56:46.619 --> 00:56:50.139
There are plenty of specialized
agency workflows online.

00:56:50.139 --> 00:56:52.239
You've heard, and if
you hang out in SF,

00:56:52.239 --> 00:56:55.659
you probably see a bunch
of billboards, AI software

00:56:55.659 --> 00:56:57.819
engineer, AI skills
mentor you've

00:56:57.820 --> 00:56:59.920
interacted with in the
class through Workera.

00:56:59.920 --> 00:57:08.099
AI SDR, AI lawyers, AI
specialized cloud engineer.

00:57:08.099 --> 00:57:10.679
It would be a stretch to
say that everything works,

00:57:10.679 --> 00:57:12.940
but there's work being
done towards that.

00:57:17.860 --> 00:57:19.460
I'm not personally
a fan of putting

00:57:19.460 --> 00:57:20.920
a face behind those things.

00:57:20.920 --> 00:57:21.920
I think it's gimmicky.

00:57:21.920 --> 00:57:24.090
And I think in a few
years from now, actually,

00:57:24.090 --> 00:57:27.750
very few products will have
a human face behind it,

00:57:27.750 --> 00:57:32.070
but it might be a marketing
tactic from some startups.

00:57:32.070 --> 00:57:35.809
It's more scary than it
is engaging, frankly.

00:57:35.809 --> 00:57:36.309
OK.

00:57:36.309 --> 00:57:38.670
I want to talk about
the paradigm shift.

00:57:38.670 --> 00:57:40.110
That's especially useful.

00:57:40.110 --> 00:57:41.870
Let's say you're a
software engineer

00:57:41.869 --> 00:57:43.777
or you're planning to
be a software engineer.

00:57:43.777 --> 00:57:45.610
Because software
engineering as a discipline

00:57:45.610 --> 00:57:47.210
is sort of shifting.

00:57:47.210 --> 00:57:49.070
Or at least the
best engineers I've

00:57:49.070 --> 00:57:53.350
worked with are able to move
from a deterministic mindset

00:57:53.349 --> 00:57:57.110
to a fuzzy mindset and
balance between the two

00:57:57.110 --> 00:57:58.890
whenever they need to
get something done.

00:57:58.889 --> 00:58:01.949
So here's the paradigm shift
between traditional software

00:58:01.949 --> 00:58:04.549
and agentic AI software.

00:58:04.550 --> 00:58:07.670
The first one is the
way you handle data.

00:58:07.670 --> 00:58:10.210
Traditional software deals
with structured data.

00:58:10.210 --> 00:58:11.130
You have JSONs.

00:58:11.130 --> 00:58:12.670
You have databases.

00:58:12.670 --> 00:58:15.670
They're pasted in a
very structured manner

00:58:15.670 --> 00:58:17.811
in a data engineering pipeline.

00:58:17.811 --> 00:58:19.269
And then there used
to be displayed

00:58:19.269 --> 00:58:21.170
on a certain interface.

00:58:21.170 --> 00:58:24.690
The user might feel a form that
is then retrieved and pasted

00:58:24.690 --> 00:58:25.470
in the database.

00:58:25.469 --> 00:58:28.250
All of that historically
has been structured data.

00:58:28.250 --> 00:58:34.250
Now, more and more companies are
handling free form text, images,

00:58:34.250 --> 00:58:39.289
and all of that requires dynamic
interpretation to transform

00:58:39.289 --> 00:58:41.690
an input into an output.

00:58:41.690 --> 00:58:45.429
The software itself used
to be deterministic.

00:58:45.429 --> 00:58:47.529
Now you have a lot of
software that is fuzzy.

00:58:47.530 --> 00:58:51.290
And fuzzy software
creates so many issues.

00:58:51.289 --> 00:58:54.250
I mean, imagine if you
let your user ask anything

00:58:54.250 --> 00:58:56.250
on your website.

00:58:56.250 --> 00:58:58.590
The chances that it
breaks is tremendous.

00:58:58.590 --> 00:59:00.710
The chances that you're
attacked is tremendous.

00:59:00.710 --> 00:59:03.150
The chances-- it's really,
really complicated.

00:59:03.150 --> 00:59:07.650
It's more complicated than
people make it seem on Twitter.

00:59:07.650 --> 00:59:09.809
Fuzzy engineering is truly hard.

00:59:09.809 --> 00:59:14.090
You might get hate as a company
because one user did something

00:59:14.090 --> 00:59:16.530
that you authorized them to
do that ended up breaking

00:59:16.530 --> 00:59:18.130
the database and ended up--

00:59:18.130 --> 00:59:19.740
we've seen that
with many companies

00:59:19.739 --> 00:59:21.099
in the last couple of years.

00:59:21.099 --> 00:59:23.980
So it takes a very specialized
engineering mindset

00:59:23.980 --> 00:59:25.460
to do fuzzy
engineering, but also

00:59:25.460 --> 00:59:29.340
know when you need
to be deterministic.

00:59:29.340 --> 00:59:33.820
The other thing I'd call is
with agentic AI software,

00:59:33.820 --> 00:59:39.019
you want to think about your
software as your manager.

00:59:39.019 --> 00:59:44.059
So you're familiar with the
monolith or microservices

00:59:44.059 --> 00:59:48.099
approaches in software, where
you structure your software

00:59:48.099 --> 00:59:51.799
in different boxes that
can talk to each other,

00:59:51.800 --> 00:59:55.140
and it allows teams to
debug one section at a time.

00:59:55.139 --> 00:59:59.039
Now the equivalent with agentic
AI is you think as a manager.

00:59:59.039 --> 01:00:02.460
So you think, OK, if I
was to delegate my product

01:00:02.460 --> 01:00:06.000
to be done by a group of humans,
what would be those roles?

01:00:06.000 --> 01:00:09.659
Would I have a graphic designer
that then puts together a chart

01:00:09.659 --> 01:00:12.420
and then sends it to a marketing
manager that converts it

01:00:12.420 --> 01:00:15.420
into a nice blog post, that
then gives it to the performance

01:00:15.420 --> 01:00:18.680
marketing expert, that then
publishes the work, the blog

01:00:18.679 --> 01:00:20.899
post, and then
optimizes and A/B tests?

01:00:20.900 --> 01:00:23.440
Then to a data scientist
that analyzes the data

01:00:23.440 --> 01:00:25.880
and then puts
hypotheses and validates

01:00:25.880 --> 01:00:27.320
them or invalidates them.

01:00:27.320 --> 01:00:29.920
That's how you would typically
think if you're building

01:00:29.920 --> 01:00:32.639
an authentic AI software.

01:00:32.639 --> 01:00:35.769
When actually, the equivalent
of that in traditional software

01:00:35.769 --> 01:00:37.019
might be completely different.

01:00:37.019 --> 01:00:39.759
It might be We have
a data engineer box

01:00:39.760 --> 01:00:42.560
right here that handles
all our data engineering.

01:00:42.559 --> 01:00:45.860
And then here, we
have the UI/UX stuff.

01:00:45.860 --> 01:00:47.940
Everything UI/UX
related goes here.

01:00:47.940 --> 01:00:51.019
And companies might structure
it in very different ways.

01:00:51.019 --> 01:00:53.684
And here is the business logic
that we want to care about.

01:00:53.684 --> 01:00:56.059
And there's five engineers
working on the business logic,

01:00:56.059 --> 01:00:56.559
let's say.

01:00:59.239 --> 01:01:01.159
OK.

01:01:01.159 --> 01:01:04.559
Testing and debugging
is also very different.

01:01:04.559 --> 01:01:06.409
And we'll talk about
it in the next section.

01:01:09.440 --> 01:01:13.679
The other thing
that I feel matters

01:01:13.679 --> 01:01:17.409
is with AI in engineering,
the cost of experimentation

01:01:17.409 --> 01:01:19.210
is going down drastically.

01:01:19.210 --> 01:01:22.010
And so people, I feel,
should be more comfortable

01:01:22.010 --> 01:01:23.690
throwing away code.

01:01:23.690 --> 01:01:27.429
It's like in traditional
software engineering,

01:01:27.429 --> 01:01:29.469
you probably don't
throw away code a ton.

01:01:29.469 --> 01:01:32.309
You build a code, and it's
solid, and it's bulletproof,

01:01:32.309 --> 01:01:35.329
and then you update
it over time.

01:01:35.329 --> 01:01:39.009
We've seen AI companies be
more comfortable throwing away

01:01:39.010 --> 01:01:43.810
codes, which has advantages in
terms of the speed at which you

01:01:43.809 --> 01:01:46.329
move but also
disadvantages in terms

01:01:46.329 --> 01:01:49.509
of the quality of your
software that can break more.

01:01:52.530 --> 01:01:56.890
So anyway, just wanted to do
an update on the paradigm shift

01:01:56.889 --> 01:01:59.150
from deterministic
to fuzzy engineering.

01:02:04.570 --> 01:02:08.370
Oh, and actually, I can give
you an example from Workera

01:02:08.369 --> 01:02:11.250
that we learned probably
over the last 12

01:02:11.250 --> 01:02:13.750
months is like if
you've used Workera,

01:02:13.750 --> 01:02:18.070
you might have seen that the
interface has asks you sometimes

01:02:18.070 --> 01:02:19.590
multiple choice questions.

01:02:19.590 --> 01:02:21.450
And sometimes, it asks
you multiple select.

01:02:21.449 --> 01:02:24.169
And sometimes, it asks you drag
and drop, ordering, matching,

01:02:24.170 --> 01:02:25.349
whatever.

01:02:25.349 --> 01:02:28.610
Those are examples of
deterministic item types,

01:02:28.610 --> 01:02:31.329
meaning you answer the
question on a multiple choice.

01:02:31.329 --> 01:02:32.710
There is one correct answer.

01:02:32.710 --> 01:02:34.510
It's fully deterministic.

01:02:34.510 --> 01:02:38.350
On the other hand, you sometimes
have a voice questions,

01:02:38.349 --> 01:02:40.309
where you go to a
role play or you

01:02:40.309 --> 01:02:42.029
have voice plus
coding questions,

01:02:42.030 --> 01:02:45.790
where your code is being read
by the interface or whatever.

01:02:45.789 --> 01:02:49.550
Those are fuzzy, meaning
the scoring algorithm

01:02:49.550 --> 01:02:52.269
might actually make
mistakes, and those mistakes

01:02:52.269 --> 01:02:53.509
might be costly.

01:02:53.510 --> 01:02:56.190
And so companies
have to figure out

01:02:56.190 --> 01:02:58.318
a human in the
loop system, which

01:02:58.318 --> 01:03:00.610
you might have seen with the
appeal feature at the end.

01:03:00.610 --> 01:03:03.318
So at the end of the assessment,
you have an appeal feature where

01:03:03.318 --> 01:03:06.430
it allows you to say, I
want to appeal the agent

01:03:06.429 --> 01:03:09.690
because I want to challenge
what the agent said on my answer

01:03:09.690 --> 01:03:12.365
because I thought I was better
than what the agent thought.

01:03:12.364 --> 01:03:14.239
And then you bring the
human in the loop that

01:03:14.239 --> 01:03:16.447
then can fix the agent, can
tell the agent, actually,

01:03:16.447 --> 01:03:20.279
you were too harsh on the
answer of this person.

01:03:20.280 --> 01:03:24.360
And that's an example of
a fuzzy engineered system

01:03:24.360 --> 01:03:28.200
that then adds a human in the
loop to make it more aligned.

01:03:28.199 --> 01:03:29.699
And so if you're
building a company,

01:03:29.699 --> 01:03:32.279
I would encourage you to
think about what can I

01:03:32.280 --> 01:03:33.800
get done with determinism?

01:03:33.800 --> 01:03:35.100
And let's get that done.

01:03:35.099 --> 01:03:38.000
And then the fuzzy
stuff, I want to do fuzzy

01:03:38.000 --> 01:03:39.900
because it allows
more interaction.

01:03:39.900 --> 01:03:42.079
It allows more back
and forth, but I need

01:03:42.079 --> 01:03:43.739
to put guardrails around it.

01:03:43.739 --> 01:03:45.739
And how am I going to
design those guardrails?

01:03:45.739 --> 01:03:46.639
Pretty much.

01:03:46.639 --> 01:03:49.219
OK?

01:03:49.219 --> 01:03:54.039
Here's another example
from enterprise workflows,

01:03:54.039 --> 01:03:57.519
which are likely to
change due to agentic AI.

01:03:57.519 --> 01:04:01.619
This is a paper from McKinsey,
I believe from last year,

01:04:01.619 --> 01:04:05.199
where they looked at a financial
institution, and they said,

01:04:05.199 --> 01:04:07.599
we observed that they often
spend one to four weeks

01:04:07.599 --> 01:04:10.119
to create a credit risk memo.

01:04:10.119 --> 01:04:11.859
And here's the process.

01:04:11.860 --> 01:04:16.539
A relationship manager
gathers data from 15

01:04:16.539 --> 01:04:19.699
and more than 15
sources on the borrower,

01:04:19.699 --> 01:04:22.699
loan type, other factors.

01:04:22.699 --> 01:04:25.339
Then the relationship manager
and the credit analyst

01:04:25.340 --> 01:04:28.780
collaboratively analyze that
data from these sources.

01:04:28.780 --> 01:04:33.620
Then the credit analyst
typically spends 20 hours

01:04:33.619 --> 01:04:36.019
or more writing a memo
and then goes back

01:04:36.019 --> 01:04:37.860
to the relationship manager.

01:04:37.860 --> 01:04:40.260
They give feedback, and then
they go through this loop

01:04:40.260 --> 01:04:41.540
again and again.

01:04:41.539 --> 01:04:46.139
And it takes a long time
to get a credit memo out.

01:04:46.139 --> 01:04:50.639
And then run a research study,
where they changed the process.

01:04:50.639 --> 01:04:56.139
They said gen AI agents could
actually cut time by 20% to 60%

01:04:56.139 --> 01:04:58.500
on credit risk memos.

01:04:58.500 --> 01:05:01.059
And the process has changed
to the relationship manager,

01:05:01.059 --> 01:05:03.219
directly work with the
Gen AI agent system,

01:05:03.219 --> 01:05:07.139
provides relevant materials
that needs to produce the memo.

01:05:07.139 --> 01:05:10.069
The agent subsidizes
the project into tasks

01:05:10.070 --> 01:05:12.269
that are assigned to
specialist agents,

01:05:12.269 --> 01:05:15.309
gathers and analyzes the
data from multiple sources,

01:05:15.309 --> 01:05:16.710
drafts a memo.

01:05:16.710 --> 01:05:19.309
Then the relationship manager
and the credit analyst

01:05:19.309 --> 01:05:20.969
sit down together,
review the memo,

01:05:20.969 --> 01:05:22.489
give feedback to the agent.

01:05:22.489 --> 01:05:26.869
And within 20% to 60%
less time are done.

01:05:26.869 --> 01:05:30.029
And so this is an example where
you're actually not changing

01:05:30.030 --> 01:05:31.290
the human stakeholders.

01:05:31.289 --> 01:05:33.909
You're just changing
the process and adding

01:05:33.909 --> 01:05:38.589
Gen AI to reduce the time it
takes to get a credit memo out.

01:05:38.590 --> 01:05:42.350
It turns out that, imagine
you're an enterprise,

01:05:42.349 --> 01:05:47.429
and you have 100,000 employees,
and there's a lot of enterprises

01:05:47.429 --> 01:05:50.309
with 100,000
employees out there.

01:05:50.309 --> 01:05:52.509
You are currently
under crisis in terms

01:05:52.510 --> 01:05:55.855
of redesigning your workflows.

01:05:55.855 --> 01:05:57.230
It turns out that
if you actually

01:05:57.230 --> 01:06:00.550
pull the job descriptions
from the HR system

01:06:00.550 --> 01:06:02.630
and you interpret
them, you also pull

01:06:02.630 --> 01:06:04.590
the business process
workflows that you

01:06:04.590 --> 01:06:07.150
have encoded in your drive.

01:06:07.150 --> 01:06:10.960
You actually can find
gains in multiple places.

01:06:10.960 --> 01:06:12.519
And in the next
few years, you're

01:06:12.519 --> 01:06:14.320
probably going to
see workflows being

01:06:14.320 --> 01:06:17.039
more optimized to add Gen AI.

01:06:17.039 --> 01:06:20.179
Even if that happens, the
hardest part is changing people.

01:06:20.179 --> 01:06:23.480
What we know, this is
great in theory, but now,

01:06:23.480 --> 01:06:28.360
let's try to fit that second
workflow for 10,000 credits,

01:06:28.360 --> 01:06:31.680
risk analysts, and
relationship managers.

01:06:31.679 --> 01:06:33.379
My guess is it will take years.

01:06:33.380 --> 01:06:37.519
It will take 10, 20 years to
get to this being actually done

01:06:37.519 --> 01:06:40.280
at scale within an organization.

01:06:40.280 --> 01:06:42.320
Because change is so hard.

01:06:42.320 --> 01:06:47.400
It's so hard to rewire business,
workflows, job descriptions,

01:06:47.400 --> 01:06:50.119
incentivize people to do
different, and be different,

01:06:50.119 --> 01:06:50.900
and train them.

01:06:50.900 --> 01:06:55.220
And so this is what the
world is going towards,

01:06:55.219 --> 01:06:59.480
but it's going to take
a long time I think.

01:06:59.480 --> 01:07:00.219
OK.

01:07:00.219 --> 01:07:02.759
Then I want to talk about
how the agent actually works

01:07:02.760 --> 01:07:07.100
and what are the core
components of an agent.

01:07:07.099 --> 01:07:10.219
Imagine a travel
booking agent. that's

01:07:10.219 --> 01:07:12.439
an easy example you've
all thought about.

01:07:12.440 --> 01:07:16.039
I still haven't been able to get
an agent to book a trip for me,

01:07:16.039 --> 01:07:18.340
or I was scared because
it was going to book

01:07:18.340 --> 01:07:20.680
a very expensive or long trip.

01:07:20.679 --> 01:07:24.819
But in theory, you can
have a travel booking

01:07:24.820 --> 01:07:26.400
agent that has prompts.

01:07:26.400 --> 01:07:28.700
So the prompts we've
seen, we know the methods

01:07:28.699 --> 01:07:30.539
to optimize those prompts.

01:07:30.539 --> 01:07:34.880
That travel agent also has
a context management system,

01:07:34.880 --> 01:07:38.420
which is essentially the memory
of what it knows about the user.

01:07:38.420 --> 01:07:40.659
That context
management system might

01:07:40.659 --> 01:07:45.799
include a core memory or working
memory and an archival memory,

01:07:45.800 --> 01:07:46.860
OK?

01:07:46.860 --> 01:07:51.059
What the difference
is within memory

01:07:51.059 --> 01:07:54.940
is not every memory needs
to be fast to access.

01:07:54.940 --> 01:07:56.159
Think about it.

01:07:56.159 --> 01:07:59.659
You're onboarded on a product,
and the first question is hi,

01:07:59.659 --> 01:08:00.599
what's your name?

01:08:00.599 --> 01:08:02.900
And I say, my name is Keon.

01:08:02.900 --> 01:08:05.037
That's probably going to
sit in the working memory

01:08:05.036 --> 01:08:07.369
because the agents, every
time he's going to talk to me,

01:08:07.369 --> 01:08:08.786
he's going to want
to use my name.

01:08:08.786 --> 01:08:10.829
But then maybe the
second question

01:08:10.829 --> 01:08:12.409
is what's your birthday?

01:08:12.409 --> 01:08:13.750
And I give it my birthday.

01:08:13.750 --> 01:08:15.489
Does it need my
birthday every day?

01:08:15.489 --> 01:08:16.210
Probably not.

01:08:16.210 --> 01:08:18.670
So it's probably going to
park it on the long term

01:08:18.670 --> 01:08:20.949
memory or the archival memory.

01:08:20.949 --> 01:08:24.250
And those memories
are slower to access.

01:08:24.250 --> 01:08:26.750
They're farther down the stack.

01:08:26.750 --> 01:08:28.789
And that structure
allows the agent

01:08:28.789 --> 01:08:30.829
to determine what's
the working memory,

01:08:30.829 --> 01:08:33.189
and what's the long term memory?

01:08:33.189 --> 01:08:36.090
And that makes it easier for the
agent to retrieve super fast.

01:08:36.090 --> 01:08:37.289
Because think about it.

01:08:37.289 --> 01:08:39.390
When you interact
with ChatGPT, you

01:08:39.390 --> 01:08:41.270
feel that it's very
personal at times.

01:08:41.270 --> 01:08:43.750
You feel like it
understands you.

01:08:43.750 --> 01:08:47.510
Imagine every time you call it,
it has to read the memories.

01:08:47.510 --> 01:08:48.909
And that can be costly.

01:08:48.909 --> 01:08:52.510
It's a very burdensome
cost because it happens

01:08:52.510 --> 01:08:54.649
every time you talk to it.

01:08:54.649 --> 01:08:57.270
So you want to be highly
optimized with the working

01:08:57.270 --> 01:08:59.095
memory.

01:08:59.095 --> 01:09:00.470
If it takes three
seconds to look

01:09:00.470 --> 01:09:03.069
in the memory, every time you're
going to talk to your LLM,

01:09:03.069 --> 01:09:06.210
it's going to take three
seconds, which you don't want.

01:09:06.210 --> 01:09:06.890
Anyway.

01:09:06.890 --> 01:09:08.189
And then you have the tools.

01:09:08.189 --> 01:09:11.490
The tools can include
APIs like a flight search

01:09:11.489 --> 01:09:15.688
API, hotel booking API, car
rental API, weather API,

01:09:15.689 --> 01:09:18.450
and then the payment
processing API.

01:09:18.449 --> 01:09:21.688
And typically, you would
want to tell your agent

01:09:21.689 --> 01:09:23.430
how that API works.

01:09:23.430 --> 01:09:27.010
It turns out that agents
or LLMs, I should say,

01:09:27.010 --> 01:09:29.590
are very good at reading
API documentation.

01:09:29.590 --> 01:09:31.210
So you give it the
API documentation,

01:09:31.210 --> 01:09:33.590
and it reads the
JSON, and it reads,

01:09:33.590 --> 01:09:35.609
what does a GET
request look like.

01:09:35.609 --> 01:09:38.189
And this is the format
that I need to push.

01:09:38.189 --> 01:09:41.569
And then it pushes it in
that format, let's say.

01:09:41.569 --> 01:09:45.090
And then it retrieves something.

01:09:45.090 --> 01:09:49.170
Does that make sense,
those different components?

01:09:49.170 --> 01:09:51.750
Anthropic also talks
about resources.

01:09:51.750 --> 01:09:55.369
Resources is data that is
sitting somewhere that you

01:09:55.369 --> 01:09:57.309
might let your agent read.

01:09:57.310 --> 01:10:00.770
For example, if you're building
your startups, you have a CRM.

01:10:00.770 --> 01:10:05.000
A CRM has data in it, and you
want to do lookups in that data.

01:10:05.000 --> 01:10:07.859
You will probably
give a lookup tool,

01:10:07.859 --> 01:10:10.359
and you will give
access to the resource,

01:10:10.359 --> 01:10:12.609
and it will do lookups
whenever you want super fast.

01:10:16.300 --> 01:10:19.020
This type of
architecture can be built

01:10:19.020 --> 01:10:21.080
with different
degrees of autonomy,

01:10:21.079 --> 01:10:23.659
from the least autonomous
to the most autonomous.

01:10:23.659 --> 01:10:26.260
And I'll give you
a few examples.

01:10:26.260 --> 01:10:29.560
Less autonomous would be
you've hard coded the steps.

01:10:29.560 --> 01:10:35.020
So let's say I tell the travel
agent first identify the intent.

01:10:35.020 --> 01:10:39.300
Then look up in the
database the history

01:10:39.300 --> 01:10:42.460
of this customer with us
and their preferences.

01:10:42.460 --> 01:10:45.239
Then go to the flight
API, blah, blah, blah.

01:10:45.239 --> 01:10:45.979
Then go to the--

01:10:45.979 --> 01:10:47.619
I would hard code the steps.

01:10:47.619 --> 01:10:48.220
OK.

01:10:48.220 --> 01:10:50.539
That's the least autonomous.

01:10:50.539 --> 01:10:54.659
The semi-autonomous is I
might hard code the tools,

01:10:54.659 --> 01:10:57.059
but we're not going to
hard code the steps.

01:10:57.060 --> 01:11:02.120
So I'm going to tell the agent,
you act like a travel agent.

01:11:02.119 --> 01:11:10.199
And your task is to help
the person book a travel.

01:11:10.199 --> 01:11:13.279
And these are the tools that
you have accessible to yourself.

01:11:13.279 --> 01:11:14.939
And so I'm not hard
coding the steps.

01:11:14.939 --> 01:11:17.064
I'm just hard coding the
tools that you have access

01:11:17.064 --> 01:11:18.919
to for yourself.

01:11:18.920 --> 01:11:22.480
The more autonomous is the
agent decides the steps

01:11:22.479 --> 01:11:24.722
and can create the tools.

01:11:24.722 --> 01:11:26.640
So that's where you might
give actually access

01:11:26.640 --> 01:11:28.980
to a code editor, to the agent.

01:11:28.979 --> 01:11:33.219
And the agent might actually be
able to ping any API in the web,

01:11:33.220 --> 01:11:34.800
perform some web search.

01:11:34.800 --> 01:11:37.079
It might even be able
to create some code

01:11:37.079 --> 01:11:39.039
to display data to the user.

01:11:39.039 --> 01:11:42.159
It might even be able to
perform some calculations.

01:11:42.159 --> 01:11:44.760
Like oh, I'm going to
calculate the fastest route

01:11:44.760 --> 01:11:48.000
to get from San
Francisco to New York,

01:11:48.000 --> 01:11:50.760
and which one might be
the most appropriate

01:11:50.760 --> 01:11:52.378
for what the user
is looking for.

01:11:52.377 --> 01:11:54.920
And then I want to calculate
the distance between the airport

01:11:54.920 --> 01:11:56.899
and that hotel
versus that hotel.

01:11:56.899 --> 01:11:58.769
And I'm going to
write code to do that.

01:11:58.770 --> 01:12:00.650
So it's actually
fully autonomous

01:12:00.649 --> 01:12:02.210
from that perspective.

01:12:05.210 --> 01:12:07.409
So yeah.

01:12:07.409 --> 01:12:08.849
Remember those keywords.

01:12:08.850 --> 01:12:14.530
Memory, prompts,
tools, et cetera.

01:12:14.529 --> 01:12:18.409
Now, I presented the
flight API, but it does not

01:12:18.409 --> 01:12:19.729
have to be an API.

01:12:19.729 --> 01:12:23.329
You probably have heard the term
MCP or model context protocol

01:12:23.329 --> 01:12:25.229
that was coined by Anthropic.

01:12:25.229 --> 01:12:29.649
I pasted the seminal article on
MCP at the bottom of this slide.

01:12:29.649 --> 01:12:34.689
But let me explain in a nutshell
why those things would differ.

01:12:34.689 --> 01:12:39.649
In the API case,
you would actually

01:12:39.649 --> 01:12:42.710
teach your LLM to ping an API.

01:12:42.710 --> 01:12:45.670
So you would say this is
how you ping this API,

01:12:45.670 --> 01:12:48.050
and this is the data that
it will send you back.

01:12:48.050 --> 01:12:51.430
And you would have to do
that in a one off manner.

01:12:51.430 --> 01:12:53.610
So you would have
to build or give

01:12:53.609 --> 01:12:56.670
the API documentation
of your flight API.

01:12:56.670 --> 01:13:00.750
You're booking hotel
API, your car rental API.

01:13:00.750 --> 01:13:03.029
And then you would give
tools for your model

01:13:03.029 --> 01:13:06.630
to communicate with those APIs.

01:13:06.630 --> 01:13:11.150
It doesn't scale
very well versus MCP.

01:13:11.149 --> 01:13:19.429
MCP, it's really about putting
a system in the middle that

01:13:19.430 --> 01:13:22.270
would make it simpler for
your LLM to communicate

01:13:22.270 --> 01:13:23.750
with that endpoint.

01:13:23.750 --> 01:13:28.789
So for instance, you might have
an MCP server, an MC client,

01:13:28.789 --> 01:13:30.550
where you're trying
to communicate

01:13:30.550 --> 01:13:35.510
with that travel database
or the flight API or MCP.

01:13:35.510 --> 01:13:38.430
And your agent might actually
just communicate with it

01:13:38.430 --> 01:13:42.030
and say, hey, what do you need
in order to give me more flight

01:13:42.029 --> 01:13:43.109
information?

01:13:43.109 --> 01:13:47.069
And that agent will respond
by I would like you to tell me

01:13:47.069 --> 01:13:49.429
where is the origin flight,
where is the destination

01:13:49.430 --> 01:13:51.289
and what you're looking
for at a high level.

01:13:51.289 --> 01:13:52.250
This is my requirement.

01:13:52.250 --> 01:13:52.750
OK.

01:13:52.750 --> 01:13:55.159
Let me get back to you
with in my requirement.

01:13:55.159 --> 01:13:55.659
Oh.

01:13:55.659 --> 01:13:57.880
You forgot to tell me
your budget, whatever.

01:13:57.880 --> 01:13:58.380
Oh.

01:13:58.380 --> 01:14:00.720
Let me give you my
budget, et cetera.

01:14:00.720 --> 01:14:04.740
And it's agent to
agent communication,

01:14:04.739 --> 01:14:06.739
which allows more scalability.

01:14:06.739 --> 01:14:09.099
You don't need to
hard code everything.

01:14:09.100 --> 01:14:11.920
Companies have displayed
their MCPs out there,

01:14:11.920 --> 01:14:14.279
and your agent can
communicate with them

01:14:14.279 --> 01:14:16.899
and figure out how to
get the data it needs.

01:14:16.899 --> 01:14:18.639
Does that make sense?

01:14:18.640 --> 01:14:21.020
Yeah.

01:14:21.020 --> 01:14:23.373
[INAUDIBLE] rewriting
any [INAUDIBLE]

01:14:36.880 --> 01:14:39.507
I think it is, ultimately.

01:14:39.507 --> 01:14:41.300
The question is, isn't
it a shifting issue?

01:14:41.300 --> 01:14:43.380
Because anyway, if an
API has to be updated,

01:14:43.380 --> 01:14:45.600
the MCP has to be updated,
is what you say, right?

01:14:45.600 --> 01:14:46.900
Yes, that's correct.

01:14:46.899 --> 01:14:51.119
But at least it allows the
agent to go back and forth

01:14:51.119 --> 01:14:52.960
and figure out what
the requirements are.

01:14:52.960 --> 01:14:56.340
But at the end of the day,
ideally, if you're a startup,

01:14:56.340 --> 01:14:57.779
you have some documentation.

01:14:57.779 --> 01:15:00.859
And automatically, you have
an agent or an LLM workflow

01:15:00.859 --> 01:15:03.099
that reads that documentation
and updates the code

01:15:03.100 --> 01:15:04.500
accordingly.

01:15:04.500 --> 01:15:05.720
But I agree.

01:15:05.720 --> 01:15:08.980
It's not something that
is fully autonomous.

01:15:08.979 --> 01:15:09.519
Yeah.

01:15:09.520 --> 01:15:12.680
i I've seen some
security issues.

01:15:12.680 --> 01:15:14.539
Why is that possible.

01:15:14.539 --> 01:15:16.909
Which security specifically?

01:15:16.909 --> 01:15:18.840
[INAUDIBLE]

01:15:18.840 --> 01:15:19.340
Yeah.

01:15:19.340 --> 01:15:23.300
So are there security
issues with MCPs?

01:15:23.300 --> 01:15:25.779
So think about it this way.

01:15:25.779 --> 01:15:28.979
MCPs, depending on the data
that you get access to,

01:15:28.979 --> 01:15:30.939
might have different
requirements, lower stake

01:15:30.939 --> 01:15:31.879
or higher stake.

01:15:31.880 --> 01:15:34.380
I'm not an expert
at the full range.

01:15:34.380 --> 01:15:42.539
But it wouldn't surprise me
that when you expose an MCP to--

01:15:42.539 --> 01:15:45.600
I think you would a lot of
MCC have authentication.

01:15:45.600 --> 01:15:47.660
So you might
actually need a code

01:15:47.659 --> 01:15:50.340
to actually talk to it, just
like you would with an API,

01:15:50.340 --> 01:15:52.190
or a key.

01:15:52.189 --> 01:15:53.869
Yeah, but that's
a good question.

01:15:53.869 --> 01:15:56.729
I'm not an expert at the
security of these systems,

01:15:56.729 --> 01:15:59.049
but we can look into it.

01:16:02.670 --> 01:16:04.670
Any other questions
on what we've

01:16:04.670 --> 01:16:10.470
seen with the agentic workflows,
APIs, tools, MCPs, memory?

01:16:10.470 --> 01:16:11.750
All of that is under progress.

01:16:11.750 --> 01:16:14.289
So even memory is not a
solved problem by any means.

01:16:14.289 --> 01:16:16.510
It's pretty hard actually.

01:16:16.510 --> 01:16:18.350
Yes.

01:16:18.350 --> 01:16:24.510
You don't need an
[INAUDIBLE] The MCP just

01:16:24.510 --> 01:16:28.481
makes it easier to access
the API, but technically,

01:16:28.481 --> 01:16:29.689
[INAUDIBLE]

01:16:40.829 --> 01:16:42.109
Exactly, exactly.

01:16:42.109 --> 01:16:45.289
Is MCP about efficiency
or accessing more data?

01:16:45.289 --> 01:16:47.109
It's about efficiency.

01:16:47.109 --> 01:16:53.710
Let's say you have a coding
agent, and it has an MCP client,

01:16:53.710 --> 01:16:57.850
and there's multiple MCP servers
that are exposed out there.

01:16:57.850 --> 01:17:00.690
That agent can communicate
very efficiently with them

01:17:00.689 --> 01:17:03.529
and find what it needs.

01:17:03.529 --> 01:17:05.170
And it's a more
efficient process

01:17:05.170 --> 01:17:09.690
than actually displaying APIs
and the APIs on that side

01:17:09.689 --> 01:17:12.169
and how to ping them and
what the protocol is.

01:17:12.170 --> 01:17:13.810
But it's not about
the data that is

01:17:13.810 --> 01:17:15.370
being exposed because
ultimately, you control

01:17:15.369 --> 01:17:16.662
the data that is being exposed.

01:17:19.090 --> 01:17:22.069
You probably, depending
on how the MCP is built,

01:17:22.069 --> 01:17:24.569
my guess is you probably
expose yourself to other risks

01:17:24.569 --> 01:17:31.529
because your MCP server can
see any input pretty much

01:17:31.529 --> 01:17:32.434
from another LLM.

01:17:32.435 --> 01:17:33.560
And so it has to be robust.

01:17:36.130 --> 01:17:37.529
But yeah.

01:17:37.529 --> 01:17:39.329
Super.

01:17:39.329 --> 01:17:41.449
So let's look at an
example of a step

01:17:41.449 --> 01:17:45.069
by step workflow for
the travel agent.

01:17:45.069 --> 01:17:50.819
So let's say the user says, I
want to plan a trip to Paris

01:17:50.819 --> 01:17:56.099
from December 15 to
20th with flights,

01:17:56.100 --> 01:18:00.579
hotels near the Eiffel Tower,
and then an itinerary of must

01:18:00.579 --> 01:18:01.819
visit places.

01:18:01.819 --> 01:18:04.019
That's the task to
the travel agent.

01:18:04.020 --> 01:18:06.500
Step two, the agent
plans the steps.

01:18:06.500 --> 01:18:08.640
So it says, I'm going
to find flights.

01:18:08.640 --> 01:18:12.400
Use the flight search API to
get options for December 15.

01:18:12.399 --> 01:18:15.059
Search hotels, generate
recommendations for places

01:18:15.060 --> 01:18:20.039
to visit, validate
preferences, budget, et cetera.

01:18:20.039 --> 01:18:24.060
Book the trip with the
payment processing API.

01:18:24.060 --> 01:18:25.760
That's just the
planning, by the way.

01:18:25.760 --> 01:18:28.680
Step three, execute the
plan, use your tools,

01:18:28.680 --> 01:18:31.420
combine the results,
and then proactive

01:18:31.420 --> 01:18:33.260
user interaction and booking.

01:18:33.260 --> 01:18:35.900
It might make a first
proposal to the user

01:18:35.899 --> 01:18:38.479
and ask the user to
validate or invalidate

01:18:38.479 --> 01:18:42.699
and then may repeat that
planning and execution process.

01:18:42.699 --> 01:18:46.079
And then finally, it might
actually update the memory.

01:18:46.079 --> 01:18:49.000
It might say, oh, I just
learned through this interaction

01:18:49.000 --> 01:18:51.880
that the user only
likes direct flights.

01:18:51.880 --> 01:18:55.640
Next time, I'll only
give direct flights.

01:18:55.640 --> 01:19:01.160
Or I noticed users are fine with
three star hotels or four star

01:19:01.159 --> 01:19:01.739
hotels.

01:19:01.739 --> 01:19:05.000
And in fact, they don't want
to go above budget or something

01:19:05.000 --> 01:19:08.000
like that.

01:19:08.000 --> 01:19:11.739
So that hopefully makes sense
by now on how you might do that.

01:19:11.739 --> 01:19:16.420
My question for you is how
would you know if this works.

01:19:16.420 --> 01:19:19.600
And if you had such a system
running in production, how

01:19:19.600 --> 01:19:20.860
would you improve it?

01:19:28.420 --> 01:19:28.920
Yeah.

01:19:28.920 --> 01:19:31.800
Lets users rate
their experience.

01:19:31.800 --> 01:19:33.579
So that's an example.

01:19:33.579 --> 01:19:37.399
So let users rate their
experience at the end.

01:19:37.399 --> 01:19:39.699
That would be an end
to end test, right?

01:19:39.699 --> 01:19:42.960
You're looking at the user
experience through the steps

01:19:42.960 --> 01:19:46.069
and say how good was it
from 1 to 5, let's say.

01:19:46.069 --> 01:19:46.722
Yeah.

01:19:46.722 --> 01:19:47.390
It's a good way.

01:19:47.390 --> 01:19:50.730
And then if you learn
that a user says 1,

01:19:50.729 --> 01:19:53.679
how do you improve the workflow?

01:19:56.855 --> 01:19:58.010
[INAUDIBLE]

01:19:59.390 --> 01:19:59.890
OK.

01:19:59.890 --> 01:20:04.329
So you would go down a tree
and say, OK, you said 1.

01:20:04.329 --> 01:20:06.069
What was your issue?

01:20:06.069 --> 01:20:10.170
And then the user says the
prices were too high, let's say.

01:20:10.170 --> 01:20:14.690
And then you would go back and
fix that specific tool or prompt

01:20:14.689 --> 01:20:15.789
or, yeah, OK.

01:20:15.789 --> 01:20:18.582
Any other ideas?

01:20:18.582 --> 01:20:19.690
[INAUDIBLE]

01:20:29.130 --> 01:20:29.750
Yeah, good.

01:20:29.750 --> 01:20:30.949
So that's a good insight.

01:20:30.949 --> 01:20:34.309
Separate the LLM related stuff
from the non-LLM related stuff,

01:20:34.310 --> 01:20:35.553
the deterministic stuff.

01:20:35.552 --> 01:20:36.970
The deterministic
stuff, you might

01:20:36.970 --> 01:20:41.530
be able to fix it more
objectively essentially.

01:20:41.529 --> 01:20:43.590
Yeah.

01:20:43.590 --> 01:20:44.329
What else?

01:20:56.670 --> 01:21:00.909
So give me an example
of an objective issue

01:21:00.909 --> 01:21:03.149
that you can notice and
how you would fix it

01:21:03.149 --> 01:21:06.269
versus a subjective issue.

01:21:06.270 --> 01:21:06.810
Yeah.

01:21:06.810 --> 01:21:08.550
[INAUDIBLE]

01:21:16.050 --> 01:21:19.090
So let's say you say
there's the same flight,

01:21:19.090 --> 01:21:21.550
but one is cheaper than
the other, let's say.

01:21:21.550 --> 01:21:23.010
It's objectively worse.

01:21:23.010 --> 01:21:25.690
And so you can capture
that almost automatically.

01:21:25.689 --> 01:21:26.189
Yeah.

01:21:26.189 --> 01:21:27.869
So you could
actually build evals

01:21:27.869 --> 01:21:32.529
that are objective, that are
tracked across your users.

01:21:32.529 --> 01:21:34.949
And you might actually
run an analysis after

01:21:34.949 --> 01:21:37.170
and see that for
the objective stuff,

01:21:37.170 --> 01:21:43.640
we notice that our LLM AI agent
workflow is bad with pricing.

01:21:43.640 --> 01:21:46.000
It just doesn't read price
as well because it always

01:21:46.000 --> 01:21:48.079
gives a more expensive option.

01:21:48.079 --> 01:21:48.579
Yeah.

01:21:48.579 --> 01:21:49.698
You're perfectly right.

01:21:49.698 --> 01:21:50.990
How about the subjective stuff?

01:21:59.600 --> 01:22:01.920
Do you choose a direct
or indirect flight

01:22:01.920 --> 01:22:05.060
if the indirect is a
little bit cheaper?

01:22:05.060 --> 01:22:05.560
Yeah.

01:22:05.560 --> 01:22:06.380
Good one.

01:22:06.380 --> 01:22:09.079
Do you choose a direct
flight or an indirect flight

01:22:09.079 --> 01:22:12.960
if the indirect is cheaper but
the direct is more comfortable?

01:22:12.960 --> 01:22:13.460
Yeah.

01:22:13.460 --> 01:22:16.000
That's a good one actually.

01:22:16.000 --> 01:22:18.739
So how would you capture
that information.

01:22:18.739 --> 01:22:20.809
Let's say this is used
by thousands of users.

01:22:24.279 --> 01:22:28.920
Could you feed
something in [INAUDIBLE]

01:22:28.920 --> 01:22:30.220
Could you feed something in?

01:22:30.220 --> 01:22:32.690
Yeah, I mean, you could--

01:22:32.689 --> 01:22:36.279
could feed something in
about the user preferences?

01:22:36.279 --> 01:22:39.380
Well, you could
build a data set that

01:22:39.380 --> 01:22:40.800
has some of that information.

01:22:40.800 --> 01:22:44.739
So you build 10 prompts, where
the user is asking specifically

01:22:44.739 --> 01:22:46.639
for a direct--

01:22:46.640 --> 01:22:48.940
saying that I prefer
direct flights because I

01:22:48.939 --> 01:22:50.979
care about my time, let's say.

01:22:50.979 --> 01:22:53.219
And then you look at the
output and you actually

01:22:53.220 --> 01:22:56.340
give a good example
of a good output,

01:22:56.340 --> 01:22:58.699
and you probably
are able to capture

01:22:58.699 --> 01:23:04.019
the performance of your agentic
workflow on this specific eval.

01:23:04.020 --> 01:23:05.320
Does it prioritize?

01:23:05.319 --> 01:23:07.159
Does it understand
price conscious--

01:23:07.159 --> 01:23:08.979
is it price conscious,
essentially,

01:23:08.979 --> 01:23:10.659
and comfort conscious?

01:23:10.659 --> 01:23:13.300
Yeah.

01:23:13.300 --> 01:23:14.360
What about the tone?

01:23:14.359 --> 01:23:18.819
Let's say the LLM right
now is not very friendly.

01:23:18.819 --> 01:23:23.000
How would you notice that,
and how would you fix it?

01:23:26.119 --> 01:23:26.619
Yeah.

01:23:26.619 --> 01:23:29.500
Have the test user
run the prompt

01:23:29.500 --> 01:23:33.020
and see if there's
something wrong with that.

01:23:33.020 --> 01:23:33.520
OK.

01:23:33.520 --> 01:23:36.037
Have a test user run the
prompt and see if there's

01:23:36.037 --> 01:23:37.119
something wrong with that.

01:23:37.119 --> 01:23:38.287
Tell me about the last step.

01:23:38.287 --> 01:23:40.829
How would you notice
that something is wrong?

01:23:40.829 --> 01:23:48.550
So a couple of tests
[INAUDIBLE] evaluates

01:23:48.550 --> 01:23:51.670
the response and [INAUDIBLE]

01:23:51.670 --> 01:23:52.210
Yeah.

01:23:52.210 --> 01:23:53.609
I agree with your approach.

01:23:53.609 --> 01:23:55.750
Have LLM judges that
evaluate the response

01:23:55.750 --> 01:23:58.603
against a certain rubric of
what politeness looks like.

01:23:58.603 --> 01:24:00.270
So here in this case,
you could actually

01:24:00.270 --> 01:24:02.850
start with error analysis.

01:24:02.850 --> 01:24:05.210
So you start, you
have 1,000 users.

01:24:05.210 --> 01:24:07.789
And you can pull up
20 user interactions

01:24:07.789 --> 01:24:09.010
and read through it.

01:24:09.010 --> 01:24:11.630
And you might notice,
at first sight,

01:24:11.630 --> 01:24:14.470
the LLM seems to be very rude.

01:24:14.470 --> 01:24:18.430
It's just super, super
short in its answers,

01:24:18.430 --> 01:24:20.510
and it's not very helpful.

01:24:20.510 --> 01:24:23.310
You notice that with your
error analysis manually.

01:24:23.310 --> 01:24:24.650
Then you go to the next stage.

01:24:24.649 --> 01:24:26.449
You actually put
evals behind it.

01:24:26.449 --> 01:24:33.309
You say, I'm going to
create a set of LLM judges

01:24:33.310 --> 01:24:35.710
that are going to look
at the user interaction

01:24:35.710 --> 01:24:38.890
and are going to rate
how polite it is.

01:24:38.890 --> 01:24:40.690
And I'm going to
give it a rubric.

01:24:40.689 --> 01:24:42.989
Then what I'm going to do
is I'm going to flip my LLM.

01:24:42.989 --> 01:24:45.769
Instead of using GPT-4,
I'm going to use Grok.

01:24:45.770 --> 01:24:48.010
And instead of using
Grok, I'm using Llama.

01:24:48.010 --> 01:24:51.470
And then I'm going to run
those three LLMs side by side,

01:24:51.470 --> 01:24:56.329
give it to my LLM judges, and
then get my subjective score

01:24:56.329 --> 01:25:02.390
at the end to say, oh, x model
was more polite on average.

01:25:02.390 --> 01:25:02.890
Yeah.

01:25:02.890 --> 01:25:03.630
Perfectly right.

01:25:03.630 --> 01:25:05.850
That's an example of an
eval that is very specific

01:25:05.850 --> 01:25:07.730
and allows you to
choose between LLMs.

01:25:07.729 --> 01:25:10.869
You could actually do the
same eval not across LLMs,

01:25:10.869 --> 01:25:12.976
but fixed the LLM,
change the prompt.

01:25:12.976 --> 01:25:15.309
You actually, instead of
saying act like a travel agent,

01:25:15.310 --> 01:25:17.870
you say act like a
helpful travel agent.

01:25:17.869 --> 01:25:21.090
And then you see the influence
of that word on your eval

01:25:21.090 --> 01:25:22.390
with the LLM as judges.

01:25:22.390 --> 01:25:24.170
Does that make sense?

01:25:24.170 --> 01:25:25.970
OK.

01:25:25.970 --> 01:25:26.470
Super.

01:25:26.470 --> 01:25:29.670
So let's move forward and
do a case study with evals.

01:25:29.670 --> 01:25:33.369
And then we're almost
done for today.

01:25:33.369 --> 01:25:38.300
Let's say your product manager
asks you to build an AI

01:25:38.300 --> 01:25:41.860
agent for customer support, OK?

01:25:41.859 --> 01:25:42.960
Where do you start?

01:25:42.960 --> 01:25:45.079
And here is an example
of the user prompt.

01:25:45.079 --> 01:25:48.000
I need to change my shipping
address for order, blah, blah,

01:25:48.000 --> 01:25:48.500
blah.

01:25:48.500 --> 01:25:51.739
I move to a new address.

01:25:51.739 --> 01:25:54.779
So what do you start if I'm
giving you that project?

01:26:04.659 --> 01:26:05.859
Yes.

01:26:05.859 --> 01:26:10.420
We search online for existing
models and [INAUDIBLE]

01:26:16.260 --> 01:26:17.720
So do some research.

01:26:17.720 --> 01:26:20.420
See benchmarks and
how different models

01:26:20.420 --> 01:26:22.119
perform at customer support.

01:26:22.119 --> 01:26:23.284
And then pick a model.

01:26:23.284 --> 01:26:24.159
That's what you mean.

01:26:24.159 --> 01:26:24.779
Yeah.

01:26:24.779 --> 01:26:25.960
It's true you could do that.

01:26:25.960 --> 01:26:28.020
What else could you do?

01:26:28.020 --> 01:26:28.908
Yeah.

01:26:28.908 --> 01:26:34.360
[INAUDIBLE]

01:26:34.359 --> 01:26:34.859
OK.

01:26:34.859 --> 01:26:35.880
Yeah, I like that.

01:26:35.880 --> 01:26:39.840
Try to decompose the different
tasks that it will need

01:26:39.840 --> 01:26:42.685
and try to guess which ones will
be more of a struggle, which

01:26:42.685 --> 01:26:45.060
ones should be fuzzy, which
ones should be deterministic.

01:26:45.060 --> 01:26:46.350
Yeah, you're right.

01:26:46.350 --> 01:26:47.520
[INAUDIBLE]

01:26:55.819 --> 01:26:56.319
Yeah.

01:26:56.319 --> 01:26:58.516
Similar to what you said.

01:26:58.516 --> 01:27:00.099
That's what I would
recommend as well.

01:27:00.100 --> 01:27:02.320
You say I would sit down
with a customer support

01:27:02.319 --> 01:27:04.822
agent for a day or two, and
I would decompose the tasks

01:27:04.822 --> 01:27:05.779
that are going through.

01:27:05.779 --> 01:27:07.500
I will ask them, where
do they struggle?

01:27:07.500 --> 01:27:08.819
How much time it takes?

01:27:08.819 --> 01:27:09.319
Yes.

01:27:09.319 --> 01:27:12.679
That's usually where you want to
start with task decomposition.

01:27:12.680 --> 01:27:16.659
So let's say we've done that
work, and we have this list.

01:27:16.659 --> 01:27:17.500
I'm simplifying.

01:27:17.500 --> 01:27:20.239
But the customer support
agent, human, typically

01:27:20.239 --> 01:27:23.000
would extract key
info, then look up

01:27:23.000 --> 01:27:25.680
in the database to retrieve
the customer record.

01:27:25.680 --> 01:27:27.360
Then check the policy.

01:27:27.359 --> 01:27:29.960
Are we allowed to
update the address,

01:27:29.960 --> 01:27:32.409
or is it a fixed data point?

01:27:32.409 --> 01:27:35.569
And then draft a response
email and sends the email.

01:27:35.569 --> 01:27:37.019
So we've decomposed that task.

01:27:39.770 --> 01:27:42.490
Once you've
decomposed that task,

01:27:42.489 --> 01:27:45.159
how do you design
your agentic workflow?

01:28:03.850 --> 01:28:04.710
Yes.

01:28:04.710 --> 01:28:06.404
[INAUDIBLE]

01:28:17.770 --> 01:28:18.330
Exactly.

01:28:18.329 --> 01:28:20.409
So to repeat,
you're going to look

01:28:20.409 --> 01:28:24.949
at the decomposition of tasks,
get an instinct of what's fuzzy,

01:28:24.949 --> 01:28:28.010
what's deterministic,
and then determine

01:28:28.010 --> 01:28:33.300
which line is going to be an LLM
one shot, which one will require

01:28:33.300 --> 01:28:36.779
maybe a RAG, which one will
require a tool, which one will

01:28:36.779 --> 01:28:38.519
require memory, which one--

01:28:38.520 --> 01:28:41.060
So you will start
designing that map.

01:28:41.060 --> 01:28:41.880
Completely right.

01:28:41.880 --> 01:28:43.600
That's also what
I would recommend.

01:28:43.600 --> 01:28:48.260
You might actually draft it and
say, OK, I take the user prompt.

01:28:48.260 --> 01:28:52.500
And the first step of
my task decomposition

01:28:52.500 --> 01:28:57.479
was extract information that
seems to be a vanilla LLM.

01:28:57.479 --> 01:29:00.099
You can guess that the
vanilla LLM would probably

01:29:00.100 --> 01:29:03.220
be good enough at
extracting the user wants

01:29:03.220 --> 01:29:05.632
to change their address,
and this is the order number

01:29:05.632 --> 01:29:06.800
and this is the new address.

01:29:06.800 --> 01:29:08.940
You probably don't need
too much technology

01:29:08.939 --> 01:29:11.579
there other than the LLM.

01:29:11.579 --> 01:29:14.899
The next step, it feels like
you need a tool because you're

01:29:14.899 --> 01:29:17.539
actually going to have to
look up in the database

01:29:17.539 --> 01:29:21.380
and also update the address.

01:29:21.380 --> 01:29:23.020
So that might be a
tool, and you might

01:29:23.020 --> 01:29:25.020
have to build a custom
tool for the LLM

01:29:25.020 --> 01:29:27.260
to say, let me connect
you to that database

01:29:27.260 --> 01:29:29.869
or let me give you access to
that resource with an MCP.

01:29:32.840 --> 01:29:35.940
After that probably need an
LLM again to draft the email,

01:29:35.939 --> 01:29:38.156
but you would probably
paste confirmation.

01:29:38.157 --> 01:29:40.239
You would paste the
confirmation that your address

01:29:40.239 --> 01:29:42.279
has been updated from x to y.

01:29:42.279 --> 01:29:44.559
And then the LLM
will draft an answer.

01:29:44.560 --> 01:29:46.380
And of course,
just to not forget,

01:29:46.380 --> 01:29:49.279
you might need a tool
to send the email.

01:29:49.279 --> 01:29:54.439
You might actually need
to post something to

01:29:54.439 --> 01:29:57.399
for the email to go out.

01:29:57.399 --> 01:29:59.079
And then you'll get the output.

01:29:59.079 --> 01:30:02.199
Does that make sense So
exactly what you described.

01:30:02.199 --> 01:30:03.939
Now moving to the next step.

01:30:03.939 --> 01:30:06.279
Once we have-- we
compose our tasks.

01:30:06.279 --> 01:30:09.300
Then we have designed an
agentic workflow around it.

01:30:09.300 --> 01:30:10.641
It took us five minutes.

01:30:10.641 --> 01:30:12.099
In practice, it
would take you more

01:30:12.100 --> 01:30:13.280
if you're building
your startup on that.

01:30:13.279 --> 01:30:15.697
You want to make sure your
task decomposition is accurate,

01:30:15.697 --> 01:30:17.480
your thing is accurate
here, and then

01:30:17.479 --> 01:30:20.239
you can have a lot of
work done on every tool

01:30:20.239 --> 01:30:22.880
and optimize it and
latency and cost.

01:30:22.880 --> 01:30:27.810
But let's say, now we
want to know if it works.

01:30:27.810 --> 01:30:30.960
And I'm going to assume
that you have LLM traces.

01:30:30.960 --> 01:30:33.449
LLM traces are very important.

01:30:33.449 --> 01:30:36.010
Actually, if you're
interviewing with an AI startup.

01:30:36.010 --> 01:30:39.289
I would recommend you in the
interview process to ask them,

01:30:39.289 --> 01:30:40.949
do you have LLM traces?

01:30:40.949 --> 01:30:42.970
Because if they don't
have LLM traces,

01:30:42.970 --> 01:30:46.530
it is pretty hard to debug an
LLM system because you don't

01:30:46.529 --> 01:30:50.649
have visibility on the chain of
complex prompts that were called

01:30:50.649 --> 01:30:52.210
and where the bug is.

01:30:52.210 --> 01:30:57.329
And so it's a basic
part of an AI startup

01:30:57.329 --> 01:31:00.850
stack to have an LLM traces.

01:31:00.850 --> 01:31:02.730
So let's assume you have traces.

01:31:02.729 --> 01:31:04.869
How would you know
if your system works?

01:31:04.869 --> 01:31:11.289
I'm going to summarize some
of the things I heard earlier.

01:31:11.289 --> 01:31:15.550
You gave us an example
of an end to end metric.

01:31:15.550 --> 01:31:18.369
You look at the user
satisfaction at the end.

01:31:18.369 --> 01:31:21.130
You can also do a
component-based approach

01:31:21.130 --> 01:31:25.210
where you actually will look at
the tool, the database updates,

01:31:25.210 --> 01:31:28.430
and you will manually do
an error analysis and see,

01:31:28.430 --> 01:31:32.010
oh, the tool actually always
forgets to update the email.

01:31:32.010 --> 01:31:33.806
It just fails at writing.

01:31:33.806 --> 01:31:34.889
And I'm going to fix that.

01:31:34.890 --> 01:31:37.470
This is deterministic
pretty much.

01:31:37.470 --> 01:31:40.990
Or when it tries
to send the email

01:31:40.989 --> 01:31:44.469
and ping the system that is
supposed to send the email,

01:31:44.470 --> 01:31:46.890
it doesn't send it
in the right format.

01:31:46.890 --> 01:31:48.869
And so it bugs at that point.

01:31:48.869 --> 01:31:51.390
Again, you could fix that.

01:31:51.390 --> 01:31:52.570
Draft of the email.

01:31:52.569 --> 01:31:53.929
The LLM doesn't do a great job.

01:31:53.930 --> 01:31:56.909
It's not very polite
at drafting the email.

01:31:56.909 --> 01:31:59.342
So you could look at
component by component,

01:31:59.342 --> 01:32:01.510
and it's actually easier
to debug than to look at it

01:32:01.510 --> 01:32:02.289
end to end.

01:32:02.289 --> 01:32:05.750
You would probably
do a mix of both.

01:32:05.750 --> 01:32:08.430
Another way to look at
it is what is objective

01:32:08.430 --> 01:32:10.530
versus what is subjective?

01:32:10.529 --> 01:32:12.989
So for example, an
objective example

01:32:12.989 --> 01:32:18.229
would be a DLRM extracted
the wrong order ID.

01:32:18.229 --> 01:32:21.789
The user said my order
ID is X, and the LLM,

01:32:21.789 --> 01:32:24.500
when it actually looked
up in the database,

01:32:24.500 --> 01:32:26.279
it used the wrong order ID.

01:32:26.279 --> 01:32:27.779
This is objectively wrong.

01:32:27.779 --> 01:32:29.800
You can actually
write a Python code

01:32:29.800 --> 01:32:32.239
that checks that, checks just
the alignment between what

01:32:32.239 --> 01:32:36.260
the user mentioned and what was
actually pasted in the database

01:32:36.260 --> 01:32:38.199
or for the lookup.

01:32:38.199 --> 01:32:40.460
You also have subjective
stuff, which we talked about,

01:32:40.460 --> 01:32:43.279
where you probably want to
do either human rating or LLM

01:32:43.279 --> 01:32:44.139
as judges.

01:32:44.140 --> 01:32:49.560
It's very relevant
for subjective evals.

01:32:49.560 --> 01:32:51.840
And finally, you
will find yourself

01:32:51.840 --> 01:32:55.980
having quantitative evals
and more qualitative evals.

01:32:55.979 --> 01:32:59.399
So quantitative would be
percentage of successful address

01:32:59.399 --> 01:33:00.279
updates.

01:33:00.279 --> 01:33:00.939
The latency.

01:33:00.939 --> 01:33:03.719
You could actually track
the latency component-based

01:33:03.720 --> 01:33:05.680
and see which one
is the slowest.

01:33:05.680 --> 01:33:08.480
Let's say sending the
email is five seconds.

01:33:08.479 --> 01:33:10.159
It's too long, let's say.

01:33:10.159 --> 01:33:13.119
You would notice component
based or the full workflow.

01:33:13.119 --> 01:33:15.880
And then you will decide, where
am I optimizing my latency,

01:33:15.880 --> 01:33:17.680
and how am I going to do that?

01:33:17.680 --> 01:33:20.240
And then finally, qualitative.

01:33:20.239 --> 01:33:23.099
You might actually do
some error analysis

01:33:23.100 --> 01:33:27.940
and look at where are
the hallucinations?

01:33:27.939 --> 01:33:31.579
Where are the tone mismatches?

01:33:31.579 --> 01:33:34.779
Are the user confused, and
by what they're confused?

01:33:34.779 --> 01:33:36.579
That would be more qualitative.

01:33:36.579 --> 01:33:41.019
And typically, it would take
more white glove approaches

01:33:41.020 --> 01:33:42.460
to do that.

01:33:42.460 --> 01:33:44.539
So here's what it
could look like.

01:33:44.539 --> 01:33:46.000
I gave you some examples.

01:33:46.000 --> 01:33:50.140
But you would build
evals to determine

01:33:50.140 --> 01:33:53.300
objectively, subjectively,
component-based, end

01:33:53.300 --> 01:33:55.060
to end based, and then
quantitatively and

01:33:55.060 --> 01:33:57.700
qualitatively, where's
your LLM failing

01:33:57.699 --> 01:33:59.000
and where it's doing well.

01:34:02.582 --> 01:34:04.539
Does that give you a
sense of the type of stuff

01:34:04.539 --> 01:34:09.939
you could do to fix or
improve that agentic workflow?

01:34:09.939 --> 01:34:10.739
Super.

01:34:10.739 --> 01:34:12.439
Well, that was our
case study on evals.

01:34:12.439 --> 01:34:14.106
We're not going to
delve deeper into it.

01:34:14.106 --> 01:34:16.899
But hopefully, it gave you
a sense of the type of stuff

01:34:16.899 --> 01:34:21.529
you can do with LLM
judges, with objective,

01:34:21.529 --> 01:34:25.829
subjective, component-based,
end to end, et cetera.

01:34:25.829 --> 01:34:29.269
Last section on
multi-agent workflows.

01:34:29.270 --> 01:34:36.030
So you might ask, hey, why do we
need multi-agent workflow when

01:34:36.029 --> 01:34:38.670
the workflow already
has multiple steps,

01:34:38.670 --> 01:34:42.449
already calls the LLM multiple
times, already gives them tools.

01:34:42.449 --> 01:34:45.104
Why do we need multiple agents?

01:34:45.104 --> 01:34:47.729
And so many people are talking
about multi-agent system online.

01:34:47.729 --> 01:34:49.309
It's not even a
new thing, frankly.

01:34:49.310 --> 01:34:52.350
Multi-agent systems have
been around for a long time.

01:34:52.350 --> 01:34:55.070
The main advantage of
a multi-agent system

01:34:55.069 --> 01:34:57.489
is going to be parallelism.

01:34:57.489 --> 01:34:59.590
It's like is there
something that I

01:34:59.590 --> 01:35:04.890
wish I would run in parallel,
sort of independently,

01:35:04.890 --> 01:35:07.430
but maybe there are some
things in the middle?

01:35:07.430 --> 01:35:09.930
But that's where you want
to put a multi-agent system.

01:35:09.930 --> 01:35:12.270
It's when it's parallel.

01:35:12.270 --> 01:35:14.950
The other advantage
that some companies

01:35:14.949 --> 01:35:19.164
have with multi-agent systems
is an agent can be reused.

01:35:19.164 --> 01:35:21.289
So let's say in a company,
you have an agent that's

01:35:21.289 --> 01:35:22.970
been built for design.

01:35:22.970 --> 01:35:25.289
That agent can be used
in the marketing team,

01:35:25.289 --> 01:35:27.930
and it can be used
in the product team.

01:35:27.930 --> 01:35:30.050
And so now you're
optimizing an agent,

01:35:30.050 --> 01:35:33.170
which has multiple stakeholders
that can communicate with it

01:35:33.170 --> 01:35:35.510
and benefit from
its performance.

01:35:38.382 --> 01:35:40.050
Actually I'm going
to ask you a question

01:35:40.050 --> 01:35:43.010
and take a few, maybe a
minute to think about it.

01:35:43.010 --> 01:35:46.489
Let's say you were
building smart home

01:35:46.489 --> 01:35:50.130
automation for your
apartment or your home.

01:35:50.130 --> 01:35:52.810
What agents would
you want to build?

01:35:52.810 --> 01:35:53.530
Yeah.

01:35:53.529 --> 01:35:54.889
Write it down.

01:35:54.890 --> 01:35:57.130
And then I'm going to
ask you in a minute

01:35:57.130 --> 01:36:00.090
to share some of the
agents that you will build.

01:36:00.090 --> 01:36:03.050
Also, think about
how you would put

01:36:03.050 --> 01:36:04.570
a hierarchy between
these agents,

01:36:04.569 --> 01:36:06.210
or how you would
organize them, or who

01:36:06.210 --> 01:36:07.770
should communicate with who.

01:36:07.770 --> 01:36:08.450
OK?

01:36:08.449 --> 01:36:08.949
OK.

01:36:08.949 --> 01:36:12.170
Take a minute for that.

01:36:12.170 --> 01:36:14.850
Be creative also because I'm
going to ask all of your agents,

01:36:14.850 --> 01:36:17.440
and maybe you have an agent
that nobody has thought of.

01:36:21.939 --> 01:36:22.479
OK.

01:36:22.479 --> 01:36:24.259
Let's get started.

01:36:24.260 --> 01:36:26.940
Who wants to give
me a set of agents

01:36:26.939 --> 01:36:29.559
that you would want for
your home, smart home.

01:36:29.560 --> 01:36:30.060
Yes.

01:36:32.739 --> 01:36:35.519
The first is like a set
of agents [INAUDIBLE]

01:37:00.619 --> 01:37:01.119
OK.

01:37:01.119 --> 01:37:02.279
So let me repeat.

01:37:02.279 --> 01:37:05.099
You have four agents,
I think, roughly.

01:37:05.100 --> 01:37:09.520
One that tracks biometric,
like where are you in the home?

01:37:09.520 --> 01:37:10.560
Where are you moving?

01:37:10.560 --> 01:37:12.220
How you're moving,
things like that.

01:37:12.220 --> 01:37:15.240
That sort of knows
your location.

01:37:15.239 --> 01:37:21.199
The second one determines
the temperature of the rooms

01:37:21.199 --> 01:37:23.960
and has the ability
to change it.

01:37:23.960 --> 01:37:26.800
The third one tracks
energy efficiency

01:37:26.800 --> 01:37:31.060
and might give feedback on
energy and energy usage.

01:37:31.060 --> 01:37:32.600
And might be, I
don't know, maybe

01:37:32.600 --> 01:37:34.883
it has the control over
the temperature as well.

01:37:34.882 --> 01:37:35.800
I don't know actually.

01:37:35.800 --> 01:37:43.079
Or the gas or the water, might
cut your water at some point.

01:37:43.079 --> 01:37:44.859
And then you have an
orchestrator agent.

01:37:44.859 --> 01:37:48.688
What is exactly the
orchestrator doing?

01:37:48.688 --> 01:37:53.180
It passes instructions
[INAUDIBLE]

01:37:53.180 --> 01:37:53.680
OK.

01:37:53.680 --> 01:37:55.060
Passes instructions.

01:37:55.060 --> 01:37:58.240
So is that the agent
that communicates mainly

01:37:58.239 --> 01:38:00.000
with the user?

01:38:00.000 --> 01:38:02.279
So if I'm coming
back home and I'm

01:38:02.279 --> 01:38:05.679
saying I want the
oven to be preheated,

01:38:05.680 --> 01:38:07.360
I communicate with
the orchestrator,

01:38:07.359 --> 01:38:09.859
and then it would
funnel to another agent.

01:38:09.859 --> 01:38:10.599
OK.

01:38:10.600 --> 01:38:11.140
Sounds good.

01:38:11.140 --> 01:38:11.640
Yeah.

01:38:11.640 --> 01:38:14.230
So that's an example
of, I want to say,

01:38:14.229 --> 01:38:17.519
a hierarchical agentic
multi-agent system.

01:38:20.770 --> 01:38:21.590
What else?

01:38:21.590 --> 01:38:22.510
Any other ideas?

01:38:22.510 --> 01:38:24.170
What would you add to that?

01:38:24.170 --> 01:38:25.615
Yeah.

01:38:25.615 --> 01:38:27.909
[INAUDIBLE]

01:38:55.329 --> 01:38:56.189
Oh, I like that.

01:38:56.189 --> 01:38:57.429
That's a really good one.

01:38:57.430 --> 01:38:58.890
So let me summarize.

01:38:58.890 --> 01:39:02.250
You have a security agent that
determines if you can enter

01:39:02.250 --> 01:39:03.090
or not.

01:39:03.090 --> 01:39:06.489
And when you enter, it
understands who you are.

01:39:06.489 --> 01:39:08.329
And then it gives
you certain sets

01:39:08.329 --> 01:39:11.309
of permissions that might
be different depending

01:39:11.310 --> 01:39:13.030
of if you're a parent or a kid.

01:39:13.029 --> 01:39:17.689
Or you might have access to
certain cars and not others.

01:39:17.689 --> 01:39:20.109
Or your kid cannot open the
fridge, or I don't know.

01:39:20.109 --> 01:39:21.250
Something like that.

01:39:21.250 --> 01:39:22.390
Yeah.

01:39:22.390 --> 01:39:23.250
OK, I like that.

01:39:23.250 --> 01:39:24.229
That's a good one.

01:39:24.229 --> 01:39:28.469
And it does feel like it's a
complex enough workflow where

01:39:28.470 --> 01:39:32.289
you want a specific
workflow tied to that.

01:39:32.289 --> 01:39:34.510
I agree.

01:39:34.510 --> 01:39:35.520
What else?

01:39:39.750 --> 01:39:41.579
Yes.

01:39:41.579 --> 01:39:43.970
[INAUDIBLE] So you can
get more complicated.

01:39:43.970 --> 01:39:50.230
So high energy savings
with whether or not you

01:39:50.229 --> 01:39:55.989
or someone else can be blind
to those in the house or also

01:39:55.989 --> 01:39:57.329
when you tap into the grid.

01:39:57.329 --> 01:40:04.510
Yeah So another thought I
have as well is much harder

01:40:04.510 --> 01:40:06.909
to track in the grocery store.

01:40:06.909 --> 01:40:08.949
But understanding
what's in your fridge.

01:40:08.949 --> 01:40:12.762
OK

01:40:12.762 --> 01:40:14.180
Well, that's really
good actually.

01:40:14.180 --> 01:40:16.240
So you mentioned two of them.

01:40:16.239 --> 01:40:20.719
One is maybe an agent that has
access to external APIs that

01:40:20.720 --> 01:40:24.320
can understand the weather
out there, the wind, the sun,

01:40:24.319 --> 01:40:28.539
and then has control over
certain devices at home.

01:40:28.539 --> 01:40:31.560
Temperature, blinds, things
like that, and also understands

01:40:31.560 --> 01:40:33.100
your preferences for it.

01:40:33.100 --> 01:40:36.039
That does feel like it's a good
use case because you could give

01:40:36.039 --> 01:40:38.840
that to the orchestrator,
but it might lose itself

01:40:38.840 --> 01:40:41.039
because it's doing too much.

01:40:41.039 --> 01:40:43.039
And also, these problems
are tied together,

01:40:43.039 --> 01:40:45.479
like temperature outdoor
with the weather API

01:40:45.479 --> 01:40:48.359
might influence the
temperature inside,

01:40:48.359 --> 01:40:50.199
how you want it, et cetera.

01:40:50.199 --> 01:40:52.800
And then the second
one, which I also like,

01:40:52.800 --> 01:40:55.920
is you might have an agent
that looks at your fridge

01:40:55.920 --> 01:40:57.185
and what's inside.

01:40:57.185 --> 01:40:58.560
And it might
actually have access

01:40:58.560 --> 01:41:01.410
to the camera in the
fridge, for example,

01:41:01.409 --> 01:41:03.720
and know your
preferences and also has

01:41:03.720 --> 01:41:06.800
access to the
e-commerce API to order

01:41:06.800 --> 01:41:09.539
Amazon groceries ahead of time.

01:41:09.539 --> 01:41:10.319
I agree.

01:41:10.319 --> 01:41:12.859
And maybe the orchestrator
will be the communication line

01:41:12.859 --> 01:41:16.139
with the user, but it might
communicate with that agent

01:41:16.140 --> 01:41:17.880
in order to get it done.

01:41:17.880 --> 01:41:18.380
Yeah.

01:41:18.380 --> 01:41:19.079
I like those.

01:41:19.079 --> 01:41:21.760
So those are all
really good examples.

01:41:21.760 --> 01:41:25.500
Here is the list I had up there.

01:41:25.500 --> 01:41:30.079
So climate control, lighting
security, energy management,

01:41:30.079 --> 01:41:32.180
entertainment,
notification agent,

01:41:32.180 --> 01:41:35.400
alerts about the system updates,
energy saving, and orchestrator.

01:41:35.399 --> 01:41:38.019
So all of them you
mentioned actually.

01:41:38.020 --> 01:41:41.260
And then we didn't talk about
the different interaction

01:41:41.260 --> 01:41:45.220
patterns, but you do have
different ways to organize

01:41:45.220 --> 01:41:46.900
a multi-agent system.

01:41:46.899 --> 01:41:48.519
Flat, hierarchical.

01:41:48.520 --> 01:41:51.300
It sounds like this
would be hierarchical.

01:41:51.300 --> 01:41:52.079
I agree.

01:41:52.079 --> 01:41:55.420
And the reason is
UI/UX, is I would rather

01:41:55.420 --> 01:41:57.680
have to only talk
to the orchestrator,

01:41:57.680 --> 01:42:00.579
rather than have to go to
a specialized application

01:42:00.579 --> 01:42:01.362
to do something.

01:42:01.362 --> 01:42:02.819
Like it feels like
the orchestrator

01:42:02.819 --> 01:42:04.439
could be responsible for that.

01:42:04.439 --> 01:42:07.669
And so I agree, I would probably
go for a hierarchical setup

01:42:07.670 --> 01:42:08.329
here.

01:42:08.329 --> 01:42:11.430
But maybe you might also
add some connections

01:42:11.430 --> 01:42:13.670
between other agents,
like in the flat system

01:42:13.670 --> 01:42:15.069
where it's all to all.

01:42:15.069 --> 01:42:17.994
For example, with climate
control and energy,

01:42:17.994 --> 01:42:19.369
if you want to
connect those two,

01:42:19.369 --> 01:42:21.909
you might actually allow them
to speak with each other.

01:42:21.909 --> 01:42:24.210
When you allow agents to
speak with each other,

01:42:24.210 --> 01:42:26.970
it is basically an MCB
protocol, by the way.

01:42:26.970 --> 01:42:30.530
So you treat the agent like
a tool, exactly like a tool.

01:42:30.529 --> 01:42:32.649
Here is how you interact
with this agent.

01:42:32.649 --> 01:42:34.049
Here is what it can tell you.

01:42:34.050 --> 01:42:37.390
Here is what it needs
from you, essentially.

01:42:37.390 --> 01:42:38.850
OK super.

01:42:38.850 --> 01:42:40.910
And then without going
into the details,

01:42:40.909 --> 01:42:43.670
there are advantages to
multi-agent workflows

01:42:43.670 --> 01:42:47.690
versus single agents,
such as debugging.

01:42:47.689 --> 01:42:50.509
It's easier to debug
a specialized agent

01:42:50.510 --> 01:42:52.789
into debug an entire system.

01:42:52.789 --> 01:42:54.329
Parallelization as well.

01:42:54.329 --> 01:42:56.909
It's easier to have
things run in parallel,

01:42:56.909 --> 01:42:59.349
and you can earn time.

01:42:59.350 --> 01:43:01.610
There are some
advantages to doing that,

01:43:01.609 --> 01:43:04.789
and I'll leave you with this
slide if you want to go deeper.

01:43:04.789 --> 01:43:05.289
Super.

01:43:05.289 --> 01:43:08.930
So we've learned so many
techniques to optimize LLMs,

01:43:08.930 --> 01:43:12.130
from prompts to chains to
fine tuning, retrieval,

01:43:12.130 --> 01:43:14.529
and to multi-agent
system as well.

01:43:14.529 --> 01:43:19.489
And then just to end on a couple
of trends I want you to watch.

01:43:19.489 --> 01:43:21.689
I think next week is
Thanksgiving, is that it?

01:43:21.689 --> 01:43:22.889
It's Thanksgiving break.

01:43:22.890 --> 01:43:23.869
No, the week after.

01:43:23.869 --> 01:43:24.529
OK.

01:43:24.529 --> 01:43:26.149
Well ahead of the
Thanksgiving break.

01:43:26.149 --> 01:43:29.489
So if you're traveling, you
can think about these things.

01:43:29.489 --> 01:43:34.289
What's next is in AI, I wanted
to call out a couple of trends.

01:43:34.289 --> 01:43:40.769
So Ilya Sutskever, one of
the OGs of LLMs and OpenAI

01:43:40.770 --> 01:43:45.790
co-founder, raised that question
about are we plateauing or not.

01:43:45.789 --> 01:43:50.489
The question are we going to
see in the coming years LLM sort

01:43:50.489 --> 01:43:54.649
of not improve as fast as
we've seen in the past?

01:43:54.649 --> 01:43:56.769
It's been the feeling
in the community

01:43:56.770 --> 01:44:00.610
probably that the
last version of GPT

01:44:00.609 --> 01:44:03.579
did not bring the
level of performance

01:44:03.579 --> 01:44:06.859
that people were expecting,
although it did make

01:44:06.859 --> 01:44:09.500
it so much easier to use for
consumers because you don't need

01:44:09.500 --> 01:44:10.920
to interact with
different models.

01:44:10.920 --> 01:44:12.279
It's all under the same hood.

01:44:12.279 --> 01:44:14.659
So it seems that
it's progressing,

01:44:14.659 --> 01:44:17.019
but the plateau is unclear.

01:44:17.020 --> 01:44:22.860
The way I would think about it
is the LLM scaling laws tell us

01:44:22.859 --> 01:44:26.380
that if we continue to
improve compute and energy,

01:44:26.380 --> 01:44:28.132
then LLMs should
continue to improve.

01:44:28.131 --> 01:44:29.839
But at some point,
it's going to plateau.

01:44:29.840 --> 01:44:32.380
So what's going to take
us to the next step?

01:44:32.380 --> 01:44:35.060
It's probably
architecture search.

01:44:35.060 --> 01:44:36.700
Still a lot of LLMs,
even if we don't

01:44:36.699 --> 01:44:38.539
understand what's under
the hood or probably

01:44:38.539 --> 01:44:40.319
transformer-based today.

01:44:40.319 --> 01:44:43.439
But we know that the human brain
does not operate the same way.

01:44:43.439 --> 01:44:45.099
There's just certain
things that we

01:44:45.100 --> 01:44:47.640
do that are much more
efficient, much faster.

01:44:47.640 --> 01:44:49.180
We don't need as much data.

01:44:49.180 --> 01:44:51.260
So theoretically,
we have so much

01:44:51.260 --> 01:44:53.020
to learn in terms of
architecture search

01:44:53.020 --> 01:44:54.780
that we haven't figured out.

01:44:54.779 --> 01:44:57.300
It's not a surprise that
you see those labs hire

01:44:57.300 --> 01:44:58.779
so many engineers.

01:44:58.779 --> 01:45:01.676
Because it is possible
that in the next few years,

01:45:01.676 --> 01:45:03.759
you're going to have
thousands of engineers trying

01:45:03.760 --> 01:45:06.382
to figure out the different
engineering hacks and tactics

01:45:06.381 --> 01:45:07.839
and architectural
searches that are

01:45:07.840 --> 01:45:10.480
going to lead to better models.

01:45:10.479 --> 01:45:13.419
And one of them suddenly will
find the next transformer,

01:45:13.420 --> 01:45:17.000
and it will reduce by 10x the
need for compute and the need

01:45:17.000 --> 01:45:18.560
for energy.

01:45:18.560 --> 01:45:24.560
It's sort of if you read Isaac
Asimov's Foundation series.

01:45:24.560 --> 01:45:27.920
Individuals can have an amazing
impact on the future because

01:45:27.920 --> 01:45:29.279
of their decisions.

01:45:29.279 --> 01:45:33.519
Whoever discovered transformers
had a tremendous impact

01:45:33.520 --> 01:45:34.832
on the direction of AI.

01:45:34.832 --> 01:45:37.039
I think we're going to see
more of that in the coming

01:45:37.039 --> 01:45:40.239
years, where some group of
researchers that is iterating

01:45:40.239 --> 01:45:43.399
fast might discover certain
things that would suddenly

01:45:43.399 --> 01:45:45.500
unlock that plateau and
take us to the next step,

01:45:45.500 --> 01:45:47.500
and it's going to continue
to improve like that.

01:45:47.500 --> 01:45:50.239
And so it doesn't surprise me
that there's so many companies

01:45:50.239 --> 01:45:52.519
hiring engineers right
now to figure out

01:45:52.520 --> 01:45:56.360
those hacks and
those techniques.

01:45:56.359 --> 01:45:58.119
The other set of gains
that we might see

01:45:58.119 --> 01:45:59.479
is from multi-modality.

01:45:59.479 --> 01:46:04.929
So the way to think about it is
we've had LLMs first text-based,

01:46:04.930 --> 01:46:06.750
and then we've added imaging.

01:46:06.750 --> 01:46:09.430
And today, models are
very good at images.

01:46:09.430 --> 01:46:10.730
They're very good at text.

01:46:10.729 --> 01:46:13.929
It turns out that being good at
images and being good at text

01:46:13.930 --> 01:46:15.510
makes the whole model better.

01:46:15.510 --> 01:46:18.329
So the fact that you're good
at understanding a cat image

01:46:18.329 --> 01:46:21.449
makes you better at
text as well for a cat.

01:46:21.449 --> 01:46:24.630
Now you add another modality
like audio or video.

01:46:24.630 --> 01:46:26.109
The whole system gets better.

01:46:26.109 --> 01:46:28.569
So you're better at
writing about a cat

01:46:28.569 --> 01:46:30.114
if you know what
a cat sounds like,

01:46:30.114 --> 01:46:31.989
if you can look at a
cat on an image as well.

01:46:31.989 --> 01:46:32.864
Does that make sense?

01:46:32.864 --> 01:46:35.569
So we see gains that are
translated from one modality

01:46:35.569 --> 01:46:38.409
to another, and that might lead
in the pinnacle of robotics

01:46:38.409 --> 01:46:40.430
where all these
modalities come together.

01:46:40.430 --> 01:46:42.329
And suddenly, the
robot is better at

01:46:42.329 --> 01:46:44.890
running away from a cat
because it understands

01:46:44.890 --> 01:46:46.630
what a cat is, how
it sounds like,

01:46:46.630 --> 01:46:48.170
what it looks like, et cetera.

01:46:48.170 --> 01:46:49.930
That makes sense?

01:46:49.930 --> 01:46:53.090
The other one is the multiple
methods working in harmony.

01:46:53.090 --> 01:46:56.750
In the Tuesday lectures, we've
seen supervised learning,

01:46:56.750 --> 01:46:58.930
unsupervised learning,
self-supervised learning,

01:46:58.930 --> 01:47:02.230
reinforcement learning, prompt
engineering, RAGs, et cetera.

01:47:02.229 --> 01:47:06.269
If you look at how
babies learn, it

01:47:06.270 --> 01:47:09.250
is probably a mix of those
different approaches.

01:47:09.250 --> 01:47:13.909
Like a baby might have some
meta learning, meaning it

01:47:13.909 --> 01:47:16.670
has some survival
instinct that is

01:47:16.670 --> 01:47:19.430
encoded in the DNA most likely.

01:47:19.430 --> 01:47:22.630
And that's like the baby's
pre-training, if you will.

01:47:22.630 --> 01:47:27.430
On top of that, the mom or
the dad is pointing at stuff

01:47:27.430 --> 01:47:29.570
and saying bad, good, bad, good.

01:47:29.569 --> 01:47:30.769
Supervised learning.

01:47:30.770 --> 01:47:33.470
On top of that, the baby
is falling on the ground

01:47:33.470 --> 01:47:34.449
and getting hurt.

01:47:34.449 --> 01:47:36.929
And that's a reward signal
for reinforcement learning.

01:47:36.930 --> 01:47:39.390
On top of that, the baby
is observing other people

01:47:39.390 --> 01:47:42.030
doing stuff or
other babies doing

01:47:42.029 --> 01:47:43.409
stuff, unsupervised learning.

01:47:43.409 --> 01:47:44.349
You see what I mean?

01:47:44.350 --> 01:47:47.090
We're probably a mix
of all these methods,

01:47:47.090 --> 01:47:49.630
and I think that's where
the trend is going, is

01:47:49.630 --> 01:47:52.350
where those methods that
you've seen in CS230

01:47:52.350 --> 01:47:56.780
come together in order to build
an AI system that learns fast,

01:47:56.779 --> 01:48:00.340
is low latency, is
cheap, energy-efficient,

01:48:00.340 --> 01:48:03.360
and makes the most out
of all of these methods.

01:48:03.359 --> 01:48:06.920
Finally, and this is
especially true at Stanford,

01:48:06.920 --> 01:48:11.079
you have research going on that
you would consider human-centric

01:48:11.079 --> 01:48:13.800
and some research that
is non-human centric.

01:48:13.800 --> 01:48:16.360
By human-centric, I should
say human approaches

01:48:16.359 --> 01:48:19.159
that are modeled after the
brain and approaches that

01:48:19.159 --> 01:48:20.619
are not modeled after humans.

01:48:20.619 --> 01:48:24.420
Because it turns out that the
human body is very limiting.

01:48:24.420 --> 01:48:26.680
And so if you actually
only do research

01:48:26.680 --> 01:48:28.220
on what the human
brain looks like,

01:48:28.220 --> 01:48:30.860
you're probably missing out on
compute and energy and stuff

01:48:30.859 --> 01:48:32.359
like that that you
can optimize even

01:48:32.359 --> 01:48:35.139
beyond neuronal
connections in the brain,

01:48:35.140 --> 01:48:37.380
but you still can learn a
lot from the human brain.

01:48:37.380 --> 01:48:40.319
And that's why there are
professors that are running labs

01:48:40.319 --> 01:48:42.519
right now that
try to understand,

01:48:42.520 --> 01:48:45.140
how does back propagation
work for humans?

01:48:45.140 --> 01:48:48.140
And in fact, it's probably that
we don't have back propagation.

01:48:48.140 --> 01:48:51.300
We don't use back propagation,
we only do forward propagation,

01:48:51.300 --> 01:48:51.840
let's say.

01:48:51.840 --> 01:48:54.079
So this type of stuff
is interesting research

01:48:54.079 --> 01:48:56.500
that I would encourage you
to read if you're curious

01:48:56.500 --> 01:48:59.500
about the direction of AI.

01:48:59.500 --> 01:49:02.640
And then finally, one thing
that's going to be pretty clear,

01:49:02.640 --> 01:49:05.420
I call it all the time,
but it's the velocity

01:49:05.420 --> 01:49:06.899
at which things are moving.

01:49:06.899 --> 01:49:08.699
You're noticing,
part of the reason

01:49:08.699 --> 01:49:10.882
we're giving you
a breadth in CS230

01:49:10.882 --> 01:49:12.800
is because these methods
are changing so fast.

01:49:12.800 --> 01:49:15.100
So I don't want to bother
going and teaching you

01:49:15.100 --> 01:49:17.940
the number 17
methods on RAG that

01:49:17.939 --> 01:49:19.639
optimizes the RAG
because in two years,

01:49:19.640 --> 01:49:20.940
you're not going to need it.

01:49:20.939 --> 01:49:23.419
So I would rather
you think about what

01:49:23.420 --> 01:49:25.539
is the breadth of things
you want to understand.

01:49:25.539 --> 01:49:27.819
And when you need it, you
are sprinting and learning

01:49:27.819 --> 01:49:30.939
the exact thing you need faster
because the half life of skill

01:49:30.939 --> 01:49:31.679
is so low.

01:49:31.680 --> 01:49:34.500
You want to come out of the
class with a good breadth

01:49:34.500 --> 01:49:36.739
and then have the ability
to go deep whenever

01:49:36.739 --> 01:49:38.159
you need after the class.

01:49:38.159 --> 01:49:41.199
And so that's sort of how that
class is designed as well.

01:49:41.199 --> 01:49:41.699
Yeah.

01:49:41.699 --> 01:49:43.500
That's it for today.

01:49:43.500 --> 01:49:45.819
So thank you.

01:49:45.819 --> 01:49:48.889
Thank you for participating.