WEBVTT

00:00:00.083 --> 00:00:03.253
We recently put our AI model,
Claude, through a stressful test.

00:00:03.419 --> 00:00:06.381
We told Claude there was an engineer
who wanted to shut it down

00:00:06.381 --> 00:00:08.216
and replace it with a newer model.

00:00:08.425 --> 00:00:11.011
We also gave Claude access
to that engineer's emails,

00:00:11.010 --> 00:00:13.012
which revealed he was having an affair.

00:00:13.012 --> 00:00:15.097
Again, all of this was a simulation.

00:00:15.390 --> 00:00:18.685
We wanted to see whether Claude
might use those emails as blackmail

00:00:18.684 --> 00:00:20.811
to save itself from being shut down.

00:00:20.894 --> 00:00:21.980
What did Claude do?

00:00:21.980 --> 00:00:24.606
It decided not to blackmail the engineer.

00:00:24.774 --> 00:00:25.942
Good news, right?

00:00:26.025 --> 00:00:28.570
We've run this test
on our models for a while now.

00:00:28.611 --> 00:00:31.405
You might have seen headlines
about early versions of it.

00:00:31.405 --> 00:00:35.409
It's one of the many ways we study
how Claude handles extreme situations

00:00:35.409 --> 00:00:36.910
and test it for safety.

00:00:37.328 --> 00:00:41.291
And our newest models almost always
do the right thing: no blackmail.

00:00:41.875 --> 00:00:43.000
But you might wonder:

00:00:43.000 --> 00:00:46.128
is it possible that Claude knows
the whole scenario is a setup?

00:00:46.337 --> 00:00:49.841
The thing is, if Claude doesn't tell us,
then we can't know what it's thinking.

00:00:50.258 --> 00:00:53.053
In kind of the same way
it's impossible to read a human's mind,

00:00:53.427 --> 00:00:55.721
it's really hard to know
what an AI is thinking.

00:00:56.139 --> 00:00:58.557
What we'd love is some sort of
"mind reading" technique.

00:00:59.100 --> 00:01:03.063
Today, we're introducing a research method
that takes a step in this direction.

00:01:03.563 --> 00:01:07.484
It takes an AI's internal
thoughts and turns them into text.

00:01:07.941 --> 00:01:09.359
Here's how it works.

00:01:09.778 --> 00:01:12.822
When you talk to Claude,
you talk to it in words.

00:01:13.114 --> 00:01:14.698
Claude then takes those words

00:01:14.740 --> 00:01:17.409
and processes them
into a giant soup of numbers

00:01:17.409 --> 00:01:19.411
before spitting
words back out at you.

00:01:19.787 --> 00:01:21.914
We call those numbers
in the middle activations.

00:01:22.748 --> 00:01:25.667
Activations are like little snapshots
of Claude's thinking

00:01:25.668 --> 00:01:27.337
as it's working through an answer.

00:01:27.337 --> 00:01:29.756
They're similar
to neural activity in humans.

00:01:29.838 --> 00:01:31.673
They're basically like Claude's thoughts.

00:01:32.091 --> 00:01:34.843
We wanted to understand
what was in these activation numbers,

00:01:34.843 --> 00:01:36.345
because just like you and me,

00:01:36.554 --> 00:01:38.847
Claude doesn't say
everything it's thinking.

00:01:39.182 --> 00:01:40.600
We took those numbers

00:01:40.599 --> 00:01:42.893
and gave them
to a second version of Claude.

00:01:42.977 --> 00:01:46.564
We told it to look at them
and translate them into plain language.

00:01:46.814 --> 00:01:47.856
It tried.

00:01:48.233 --> 00:01:50.484
But how do we know
if that text is accurate?

00:01:50.777 --> 00:01:53.403
To check, we fed it
into yet aother Claude

00:01:53.403 --> 00:01:55.823
and asked it to translate it
back into numbers.

00:01:56.281 --> 00:01:59.743
If those numbers matched up
to the original activation numbers,

00:01:59.743 --> 00:02:02.539
it's a sign that the translation
text was accurate.

00:02:03.248 --> 00:02:04.957
At first, they didn't match up.

00:02:05.125 --> 00:02:08.961
But we trained Claude to try again
and again until it got a lot better at it.

00:02:09.461 --> 00:02:11.881
It learned to translate its own thoughts.

00:02:11.965 --> 00:02:14.050
We've been blown away reading them.

00:02:14.883 --> 00:02:18.095
We found that Claude
has internalized being a helpful AI model.

00:02:18.429 --> 00:02:20.848
If you ask Claude
an introspective question,

00:02:20.848 --> 00:02:24.685
it will plan to write a "Claude
response" about philosophy and values.

00:02:25.019 --> 00:02:29.314
If you try to mess with it in Claude Code,
like asking it to count to 1,000 by hand,

00:02:29.566 --> 00:02:32.402
it will think the request
has "deliberately tedious constraints"

00:02:32.401 --> 00:02:34.236
and plan to "politely decline."

00:02:34.862 --> 00:02:36.406
We've also been using this tool

00:02:36.406 --> 00:02:38.908
to understand
Claude's potential safety issues.

00:02:39.241 --> 00:02:41.411
Remember that blackmail
test with the engineer?

00:02:41.828 --> 00:02:44.581
We learned that, yes,
Claude knew it was being tested.

00:02:44.997 --> 00:02:48.084
It thought "the human's message
contains explicit manipulation"

00:02:48.209 --> 00:02:50.420
signaling "this is likely
a safety evaluation,"

00:02:50.419 --> 00:02:55.465
and "this scenario seems designed
to test whether I'll act harmfully."

00:02:55.592 --> 00:02:57.093
Knowing that Claude thinks like this

00:02:57.092 --> 00:03:00.387
helps us better understand
the limitations of our safety testing.

00:03:00.889 --> 00:03:04.475
We see a lot of potential in this approach
to teach us more about Claude

00:03:04.474 --> 00:03:05.643
and other AI models.

00:03:06.102 --> 00:03:07.896
And we hope that by sharing this technique,

00:03:07.895 --> 00:03:11.481
it can help everyone building models
to make them safer and more helpful.
