Translating Claude’s thoughts into language

Anthropic · May 10, 2026

Open on YouTube

Transcript ~603 words · 3:16

0:00

We recently put our AI model, Claude, through a stressful test.

0:03

We told Claude there was an engineer who wanted to shut it down

0:06

and replace it with a newer model.

0:08

We also gave Claude access to that engineer's emails,

0:11

which revealed he was having an affair.

0:13

Again, all of this was a simulation.

0:15

We wanted to see whether Claude might use those emails as blackmail

0:18

to save itself from being shut down.

0:20

What did Claude do?

0:21

It decided not to blackmail the engineer.

0:24

Good news, right?

0:26

We've run this test on our models for a while now.

0:28

You might have seen headlines about early versions of it.

0:31

It's one of the many ways we study how Claude handles extreme situations

0:35

and test it for safety.

0:37

And our newest models almost always do the right thing: no blackmail.

0:41

But you might wonder:

0:43

is it possible that Claude knows the whole scenario is a setup?

0:46

The thing is, if Claude doesn't tell us, then we can't know what it's thinking.

0:50

In kind of the same way it's impossible to read a human's mind,

0:53

it's really hard to know what an AI is thinking.

0:56

What we'd love is some sort of "mind reading" technique.

0:59

Today, we're introducing a research method that takes a step in this direction.

1:03

It takes an AI's internal thoughts and turns them into text.

1:07

Here's how it works.

1:09

When you talk to Claude, you talk to it in words.

1:13

Claude then takes those words

1:14

and processes them into a giant soup of numbers

1:17

before spitting words back out at you.

1:19

We call those numbers in the middle activations.

1:22

Activations are like little snapshots of Claude's thinking

1:25

as it's working through an answer.

1:27

They're similar to neural activity in humans.

1:29

They're basically like Claude's thoughts.

1:32

We wanted to understand what was in these activation numbers,

1:34

because just like you and me,

1:36

Claude doesn't say everything it's thinking.

1:39

We took those numbers

1:40

and gave them to a second version of Claude.

1:42

We told it to look at them and translate them into plain language.

1:46

It tried.

1:48

But how do we know if that text is accurate?

1:50

To check, we fed it into yet aother Claude

1:53

and asked it to translate it back into numbers.

1:56

If those numbers matched up to the original activation numbers,

1:59

it's a sign that the translation text was accurate.

2:03

At first, they didn't match up.

2:05

But we trained Claude to try again and again until it got a lot better at it.

2:09

It learned to translate its own thoughts.

2:11

We've been blown away reading them.

2:14

We found that Claude has internalized being a helpful AI model.

2:18

If you ask Claude an introspective question,

2:20

it will plan to write a "Claude response" about philosophy and values.

2:25

If you try to mess with it in Claude Code, like asking it to count to 1,000 by hand,

2:29

it will think the request has "deliberately tedious constraints"

2:32

and plan to "politely decline."

2:34

We've also been using this tool

2:36

to understand Claude's potential safety issues.

2:39

Remember that blackmail test with the engineer?

2:41

We learned that, yes, Claude knew it was being tested.

2:44

It thought "the human's message contains explicit manipulation"

2:48

signaling "this is likely a safety evaluation,"

2:50

and "this scenario seems designed to test whether I'll act harmfully."

2:55

Knowing that Claude thinks like this

2:57

helps us better understand the limitations of our safety testing.

3:00

We see a lot of potential in this approach to teach us more about Claude

3:04

and other AI models.

3:06

And we hope that by sharing this technique,

3:07

it can help everyone building models to make them safer and more helpful.

— end of transcript —

Trending Transcripts

26:15

Is AI pushing our planet too far? | BBC News

BBC News

4:19

George Carlin — I Just Don't Care

Robin Slater

44:52

How (and why) to take a logarithm of an image

3Blue1Brown

17:04

The essence of calculus

3Blue1Brown