[00:00] We recently put our AI model, Claude, through a stressful test. [00:03] We told Claude there was an engineer who wanted to shut it down [00:06] and replace it with a newer model. [00:08] We also gave Claude access to that engineer's emails, [00:11] which revealed he was having an affair. [00:13] Again, all of this was a simulation. [00:15] We wanted to see whether Claude might use those emails as blackmail [00:18] to save itself from being shut down. [00:20] What did Claude do? [00:21] It decided not to blackmail the engineer. [00:24] Good news, right? [00:26] We've run this test on our models for a while now. [00:28] You might have seen headlines about early versions of it. [00:31] It's one of the many ways we study how Claude handles extreme situations [00:35] and test it for safety. [00:37] And our newest models almost always do the right thing: no blackmail. [00:41] But you might wonder: [00:43] is it possible that Claude knows the whole scenario is a setup? [00:46] The thing is, if Claude doesn't tell us, then we can't know what it's thinking. [00:50] In kind of the same way it's impossible to read a human's mind, [00:53] it's really hard to know what an AI is thinking. [00:56] What we'd love is some sort of "mind reading" technique. [00:59] Today, we're introducing a research method that takes a step in this direction. [01:03] It takes an AI's internal thoughts and turns them into text. [01:07] Here's how it works. [01:09] When you talk to Claude, you talk to it in words. [01:13] Claude then takes those words [01:14] and processes them into a giant soup of numbers [01:17] before spitting words back out at you. [01:19] We call those numbers in the middle activations. [01:22] Activations are like little snapshots of Claude's thinking [01:25] as it's working through an answer. [01:27] They're similar to neural activity in humans. [01:29] They're basically like Claude's thoughts. [01:32] We wanted to understand what was in these activation numbers, [01:34] because just like you and me, [01:36] Claude doesn't say everything it's thinking. [01:39] We took those numbers [01:40] and gave them to a second version of Claude. [01:42] We told it to look at them and translate them into plain language. [01:46] It tried. [01:48] But how do we know if that text is accurate? [01:50] To check, we fed it into yet aother Claude [01:53] and asked it to translate it back into numbers. [01:56] If those numbers matched up to the original activation numbers, [01:59] it's a sign that the translation text was accurate. [02:03] At first, they didn't match up. [02:05] But we trained Claude to try again and again until it got a lot better at it. [02:09] It learned to translate its own thoughts. [02:11] We've been blown away reading them. [02:14] We found that Claude has internalized being a helpful AI model. [02:18] If you ask Claude an introspective question, [02:20] it will plan to write a "Claude response" about philosophy and values. [02:25] If you try to mess with it in Claude Code, like asking it to count to 1,000 by hand, [02:29] it will think the request has "deliberately tedious constraints" [02:32] and plan to "politely decline." [02:34] We've also been using this tool [02:36] to understand Claude's potential safety issues. [02:39] Remember that blackmail test with the engineer? [02:41] We learned that, yes, Claude knew it was being tested. [02:44] It thought "the human's message contains explicit manipulation" [02:48] signaling "this is likely a safety evaluation," [02:50] and "this scenario seems designed to test whether I'll act harmfully." [02:55] Knowing that Claude thinks like this [02:57] helps us better understand the limitations of our safety testing. [03:00] We see a lot of potential in this approach to teach us more about Claude [03:04] and other AI models. [03:06] And we hope that by sharing this technique, [03:07] it can help everyone building models to make them safer and more helpful.