Advertisement
Ad slot
Translating Claude’s thoughts into language 3:16

Translating Claude’s thoughts into language

Anthropic · May 10, 2026
Open on YouTube
Transcript ~603 words · 3:16
0:00
We recently put our AI model, Claude, through a stressful test.
0:03
We told Claude there was an engineer who wanted to shut it down
0:06
and replace it with a newer model.
0:08
We also gave Claude access to that engineer's emails,
0:11
which revealed he was having an affair.
0:13
Again, all of this was a simulation.
0:15
We wanted to see whether Claude might use those emails as blackmail
0:18
to save itself from being shut down.
Advertisement
Ad slot
0:20
What did Claude do?
0:21
It decided not to blackmail the engineer.
0:24
Good news, right?
0:26
We've run this test on our models for a while now.
0:28
You might have seen headlines about early versions of it.
0:31
It's one of the many ways we study how Claude handles extreme situations
0:35
and test it for safety.
0:37
And our newest models almost always do the right thing: no blackmail.
0:41
But you might wonder:
0:43
is it possible that Claude knows the whole scenario is a setup?
Advertisement
Ad slot
0:46
The thing is, if Claude doesn't tell us, then we can't know what it's thinking.
0:50
In kind of the same way it's impossible to read a human's mind,
0:53
it's really hard to know what an AI is thinking.
0:56
What we'd love is some sort of "mind reading" technique.
0:59
Today, we're introducing a research method that takes a step in this direction.
1:03
It takes an AI's internal thoughts and turns them into text.
1:07
Here's how it works.
1:09
When you talk to Claude, you talk to it in words.
1:13
Claude then takes those words
1:14
and processes them into a giant soup of numbers
1:17
before spitting words back out at you.
1:19
We call those numbers in the middle activations.
1:22
Activations are like little snapshots of Claude's thinking
1:25
as it's working through an answer.
1:27
They're similar to neural activity in humans.
1:29
They're basically like Claude's thoughts.
1:32
We wanted to understand what was in these activation numbers,
1:34
because just like you and me,
1:36
Claude doesn't say everything it's thinking.
1:39
We took those numbers
1:40
and gave them to a second version of Claude.
1:42
We told it to look at them and translate them into plain language.
1:46
It tried.
1:48
But how do we know if that text is accurate?
1:50
To check, we fed it into yet aother Claude
1:53
and asked it to translate it back into numbers.
1:56
If those numbers matched up to the original activation numbers,
1:59
it's a sign that the translation text was accurate.
2:03
At first, they didn't match up.
2:05
But we trained Claude to try again and again until it got a lot better at it.
2:09
It learned to translate its own thoughts.
2:11
We've been blown away reading them.
2:14
We found that Claude has internalized being a helpful AI model.
2:18
If you ask Claude an introspective question,
2:20
it will plan to write a "Claude response" about philosophy and values.
2:25
If you try to mess with it in Claude Code, like asking it to count to 1,000 by hand,
2:29
it will think the request has "deliberately tedious constraints"
2:32
and plan to "politely decline."
2:34
We've also been using this tool
2:36
to understand Claude's potential safety issues.
2:39
Remember that blackmail test with the engineer?
2:41
We learned that, yes, Claude knew it was being tested.
2:44
It thought "the human's message contains explicit manipulation"
2:48
signaling "this is likely a safety evaluation,"
2:50
and "this scenario seems designed to test whether I'll act harmfully."
2:55
Knowing that Claude thinks like this
2:57
helps us better understand the limitations of our safety testing.
3:00
We see a lot of potential in this approach to teach us more about Claude
3:04
and other AI models.
3:06
And we hope that by sharing this technique,
3:07
it can help everyone building models to make them safer and more helpful.
— end of transcript —
Advertisement
Ad slot

Trending Transcripts

Disclaimer: This site is not affiliated with, endorsed by, or sponsored by YouTube or Google LLC. All trademarks belong to their respective owners. Transcripts are sourced from publicly available captions on YouTube and remain the property of their original creators.