Advertisement
3:16
Transcript
0:00
We recently put our AI model,
Claude, through a stressful test.
0:03
We told Claude there was an engineer
who wanted to shut it down
0:06
and replace it with a newer model.
0:08
We also gave Claude access
to that engineer's emails,
0:11
which revealed he was having an affair.
0:13
Again, all of this was a simulation.
0:15
We wanted to see whether Claude
might use those emails as blackmail
0:18
to save itself from being shut down.
Advertisement
0:20
What did Claude do?
0:21
It decided not to blackmail the engineer.
0:24
Good news, right?
0:26
We've run this test
on our models for a while now.
0:28
You might have seen headlines
about early versions of it.
0:31
It's one of the many ways we study
how Claude handles extreme situations
0:35
and test it for safety.
0:37
And our newest models almost always
do the right thing: no blackmail.
0:41
But you might wonder:
0:43
is it possible that Claude knows
the whole scenario is a setup?
Advertisement
0:46
The thing is, if Claude doesn't tell us,
then we can't know what it's thinking.
0:50
In kind of the same way
it's impossible to read a human's mind,
0:53
it's really hard to know
what an AI is thinking.
0:56
What we'd love is some sort of
"mind reading" technique.
0:59
Today, we're introducing a research method
that takes a step in this direction.
1:03
It takes an AI's internal
thoughts and turns them into text.
1:07
Here's how it works.
1:09
When you talk to Claude,
you talk to it in words.
1:13
Claude then takes those words
1:14
and processes them
into a giant soup of numbers
1:17
before spitting
words back out at you.
1:19
We call those numbers
in the middle activations.
1:22
Activations are like little snapshots
of Claude's thinking
1:25
as it's working through an answer.
1:27
They're similar
to neural activity in humans.
1:29
They're basically like Claude's thoughts.
1:32
We wanted to understand
what was in these activation numbers,
1:34
because just like you and me,
1:36
Claude doesn't say
everything it's thinking.
1:39
We took those numbers
1:40
and gave them
to a second version of Claude.
1:42
We told it to look at them
and translate them into plain language.
1:46
It tried.
1:48
But how do we know
if that text is accurate?
1:50
To check, we fed it
into yet aother Claude
1:53
and asked it to translate it
back into numbers.
1:56
If those numbers matched up
to the original activation numbers,
1:59
it's a sign that the translation
text was accurate.
2:03
At first, they didn't match up.
2:05
But we trained Claude to try again
and again until it got a lot better at it.
2:09
It learned to translate its own thoughts.
2:11
We've been blown away reading them.
2:14
We found that Claude
has internalized being a helpful AI model.
2:18
If you ask Claude
an introspective question,
2:20
it will plan to write a "Claude
response" about philosophy and values.
2:25
If you try to mess with it in Claude Code,
like asking it to count to 1,000 by hand,
2:29
it will think the request
has "deliberately tedious constraints"
2:32
and plan to "politely decline."
2:34
We've also been using this tool
2:36
to understand
Claude's potential safety issues.
2:39
Remember that blackmail
test with the engineer?
2:41
We learned that, yes,
Claude knew it was being tested.
2:44
It thought "the human's message
contains explicit manipulation"
2:48
signaling "this is likely
a safety evaluation,"
2:50
and "this scenario seems designed
to test whether I'll act harmfully."
2:55
Knowing that Claude thinks like this
2:57
helps us better understand
the limitations of our safety testing.
3:00
We see a lot of potential in this approach
to teach us more about Claude
3:04
and other AI models.
3:06
And we hope that by sharing this technique,
3:07
it can help everyone building models
to make them safer and more helpful.
— end of transcript —
Advertisement