[00:00] We recently put our AI model,
Claude, through a stressful test.
[00:03] We told Claude there was an engineer
who wanted to shut it down
[00:06] and replace it with a newer model.
[00:08] We also gave Claude access
to that engineer's emails,
[00:11] which revealed he was having an affair.
[00:13] Again, all of this was a simulation.
[00:15] We wanted to see whether Claude
might use those emails as blackmail
[00:18] to save itself from being shut down.
[00:20] What did Claude do?
[00:21] It decided not to blackmail the engineer.
[00:24] Good news, right?
[00:26] We've run this test
on our models for a while now.
[00:28] You might have seen headlines
about early versions of it.
[00:31] It's one of the many ways we study
how Claude handles extreme situations
[00:35] and test it for safety.
[00:37] And our newest models almost always
do the right thing: no blackmail.
[00:41] But you might wonder:
[00:43] is it possible that Claude knows
the whole scenario is a setup?
[00:46] The thing is, if Claude doesn't tell us,
then we can't know what it's thinking.
[00:50] In kind of the same way
it's impossible to read a human's mind,
[00:53] it's really hard to know
what an AI is thinking.
[00:56] What we'd love is some sort of
"mind reading" technique.
[00:59] Today, we're introducing a research method
that takes a step in this direction.
[01:03] It takes an AI's internal
thoughts and turns them into text.
[01:07] Here's how it works.
[01:09] When you talk to Claude,
you talk to it in words.
[01:13] Claude then takes those words
[01:14] and processes them
into a giant soup of numbers
[01:17] before spitting
words back out at you.
[01:19] We call those numbers
in the middle activations.
[01:22] Activations are like little snapshots
of Claude's thinking
[01:25] as it's working through an answer.
[01:27] They're similar
to neural activity in humans.
[01:29] They're basically like Claude's thoughts.
[01:32] We wanted to understand
what was in these activation numbers,
[01:34] because just like you and me,
[01:36] Claude doesn't say
everything it's thinking.
[01:39] We took those numbers
[01:40] and gave them
to a second version of Claude.
[01:42] We told it to look at them
and translate them into plain language.
[01:46] It tried.
[01:48] But how do we know
if that text is accurate?
[01:50] To check, we fed it
into yet aother Claude
[01:53] and asked it to translate it
back into numbers.
[01:56] If those numbers matched up
to the original activation numbers,
[01:59] it's a sign that the translation
text was accurate.
[02:03] At first, they didn't match up.
[02:05] But we trained Claude to try again
and again until it got a lot better at it.
[02:09] It learned to translate its own thoughts.
[02:11] We've been blown away reading them.
[02:14] We found that Claude
has internalized being a helpful AI model.
[02:18] If you ask Claude
an introspective question,
[02:20] it will plan to write a "Claude
response" about philosophy and values.
[02:25] If you try to mess with it in Claude Code,
like asking it to count to 1,000 by hand,
[02:29] it will think the request
has "deliberately tedious constraints"
[02:32] and plan to "politely decline."
[02:34] We've also been using this tool
[02:36] to understand
Claude's potential safety issues.
[02:39] Remember that blackmail
test with the engineer?
[02:41] We learned that, yes,
Claude knew it was being tested.
[02:44] It thought "the human's message
contains explicit manipulation"
[02:48] signaling "this is likely
a safety evaluation,"
[02:50] and "this scenario seems designed
to test whether I'll act harmfully."
[02:55] Knowing that Claude thinks like this
[02:57] helps us better understand
the limitations of our safety testing.
[03:00] We see a lot of potential in this approach
to teach us more about Claude
[03:04] and other AI models.
[03:06] And we hope that by sharing this technique,
[03:07] it can help everyone building models
to make them safer and more helpful.