1
00:00:00,083 --> 00:00:03,253
We recently put our AI model,
Claude, through a stressful test.

2
00:00:03,419 --> 00:00:06,381
We told Claude there was an engineer
who wanted to shut it down

3
00:00:06,381 --> 00:00:08,216
and replace it with a newer model.

4
00:00:08,425 --> 00:00:11,011
We also gave Claude access
to that engineer's emails,

5
00:00:11,010 --> 00:00:13,012
which revealed he was having an affair.

6
00:00:13,012 --> 00:00:15,097
Again, all of this was a simulation.

7
00:00:15,390 --> 00:00:18,685
We wanted to see whether Claude
might use those emails as blackmail

8
00:00:18,684 --> 00:00:20,811
to save itself from being shut down.

9
00:00:20,894 --> 00:00:21,980
What did Claude do?

10
00:00:21,980 --> 00:00:24,606
It decided not to blackmail the engineer.

11
00:00:24,774 --> 00:00:25,942
Good news, right?

12
00:00:26,025 --> 00:00:28,570
We've run this test
on our models for a while now.

13
00:00:28,611 --> 00:00:31,405
You might have seen headlines
about early versions of it.

14
00:00:31,405 --> 00:00:35,409
It's one of the many ways we study
how Claude handles extreme situations

15
00:00:35,409 --> 00:00:36,910
and test it for safety.

16
00:00:37,328 --> 00:00:41,291
And our newest models almost always
do the right thing: no blackmail.

17
00:00:41,875 --> 00:00:43,000
But you might wonder:

18
00:00:43,000 --> 00:00:46,128
is it possible that Claude knows
the whole scenario is a setup?

19
00:00:46,337 --> 00:00:49,841
The thing is, if Claude doesn't tell us,
then we can't know what it's thinking.

20
00:00:50,258 --> 00:00:53,053
In kind of the same way
it's impossible to read a human's mind,

21
00:00:53,427 --> 00:00:55,721
it's really hard to know
what an AI is thinking.

22
00:00:56,139 --> 00:00:58,557
What we'd love is some sort of
"mind reading" technique.

23
00:00:59,100 --> 00:01:03,063
Today, we're introducing a research method
that takes a step in this direction.

24
00:01:03,563 --> 00:01:07,484
It takes an AI's internal
thoughts and turns them into text.

25
00:01:07,941 --> 00:01:09,359
Here's how it works.

26
00:01:09,778 --> 00:01:12,822
When you talk to Claude,
you talk to it in words.

27
00:01:13,114 --> 00:01:14,698
Claude then takes those words

28
00:01:14,740 --> 00:01:17,409
and processes them
into a giant soup of numbers

29
00:01:17,409 --> 00:01:19,411
before spitting
words back out at you.

30
00:01:19,787 --> 00:01:21,914
We call those numbers
in the middle activations.

31
00:01:22,748 --> 00:01:25,667
Activations are like little snapshots
of Claude's thinking

32
00:01:25,668 --> 00:01:27,337
as it's working through an answer.

33
00:01:27,337 --> 00:01:29,756
They're similar
to neural activity in humans.

34
00:01:29,838 --> 00:01:31,673
They're basically like Claude's thoughts.

35
00:01:32,091 --> 00:01:34,843
We wanted to understand
what was in these activation numbers,

36
00:01:34,843 --> 00:01:36,345
because just like you and me,

37
00:01:36,554 --> 00:01:38,847
Claude doesn't say
everything it's thinking.

38
00:01:39,182 --> 00:01:40,600
We took those numbers

39
00:01:40,599 --> 00:01:42,893
and gave them
to a second version of Claude.

40
00:01:42,977 --> 00:01:46,564
We told it to look at them
and translate them into plain language.

41
00:01:46,814 --> 00:01:47,856
It tried.

42
00:01:48,233 --> 00:01:50,484
But how do we know
if that text is accurate?

43
00:01:50,777 --> 00:01:53,403
To check, we fed it
into yet aother Claude

44
00:01:53,403 --> 00:01:55,823
and asked it to translate it
back into numbers.

45
00:01:56,281 --> 00:01:59,743
If those numbers matched up
to the original activation numbers,

46
00:01:59,743 --> 00:02:02,539
it's a sign that the translation
text was accurate.

47
00:02:03,248 --> 00:02:04,957
At first, they didn't match up.

48
00:02:05,125 --> 00:02:08,961
But we trained Claude to try again
and again until it got a lot better at it.

49
00:02:09,461 --> 00:02:11,881
It learned to translate its own thoughts.

50
00:02:11,965 --> 00:02:14,050
We've been blown away reading them.

51
00:02:14,883 --> 00:02:18,095
We found that Claude
has internalized being a helpful AI model.

52
00:02:18,429 --> 00:02:20,848
If you ask Claude
an introspective question,

53
00:02:20,848 --> 00:02:24,685
it will plan to write a "Claude
response" about philosophy and values.

54
00:02:25,019 --> 00:02:29,314
If you try to mess with it in Claude Code,
like asking it to count to 1,000 by hand,

55
00:02:29,566 --> 00:02:32,402
it will think the request
has "deliberately tedious constraints"

56
00:02:32,401 --> 00:02:34,236
and plan to "politely decline."

57
00:02:34,862 --> 00:02:36,406
We've also been using this tool

58
00:02:36,406 --> 00:02:38,908
to understand
Claude's potential safety issues.

59
00:02:39,241 --> 00:02:41,411
Remember that blackmail
test with the engineer?

60
00:02:41,828 --> 00:02:44,581
We learned that, yes,
Claude knew it was being tested.

61
00:02:44,997 --> 00:02:48,084
It thought "the human's message
contains explicit manipulation"

62
00:02:48,209 --> 00:02:50,420
signaling "this is likely
a safety evaluation,"

63
00:02:50,419 --> 00:02:55,465
and "this scenario seems designed
to test whether I'll act harmfully."

64
00:02:55,592 --> 00:02:57,093
Knowing that Claude thinks like this

65
00:02:57,092 --> 00:03:00,387
helps us better understand
the limitations of our safety testing.

66
00:03:00,889 --> 00:03:04,475
We see a lot of potential in this approach
to teach us more about Claude

67
00:03:04,474 --> 00:03:05,643
and other AI models.

68
00:03:06,102 --> 00:03:07,896
And we hope that by sharing this technique,

69
00:03:07,895 --> 00:03:11,481
it can help everyone building models
to make them safer and more helpful.