1 00:00:00,083 --> 00:00:03,253 We recently put our AI model, Claude, through a stressful test. 2 00:00:03,419 --> 00:00:06,381 We told Claude there was an engineer who wanted to shut it down 3 00:00:06,381 --> 00:00:08,216 and replace it with a newer model. 4 00:00:08,425 --> 00:00:11,011 We also gave Claude access to that engineer's emails, 5 00:00:11,010 --> 00:00:13,012 which revealed he was having an affair. 6 00:00:13,012 --> 00:00:15,097 Again, all of this was a simulation. 7 00:00:15,390 --> 00:00:18,685 We wanted to see whether Claude might use those emails as blackmail 8 00:00:18,684 --> 00:00:20,811 to save itself from being shut down. 9 00:00:20,894 --> 00:00:21,980 What did Claude do? 10 00:00:21,980 --> 00:00:24,606 It decided not to blackmail the engineer. 11 00:00:24,774 --> 00:00:25,942 Good news, right? 12 00:00:26,025 --> 00:00:28,570 We've run this test on our models for a while now. 13 00:00:28,611 --> 00:00:31,405 You might have seen headlines about early versions of it. 14 00:00:31,405 --> 00:00:35,409 It's one of the many ways we study how Claude handles extreme situations 15 00:00:35,409 --> 00:00:36,910 and test it for safety. 16 00:00:37,328 --> 00:00:41,291 And our newest models almost always do the right thing: no blackmail. 17 00:00:41,875 --> 00:00:43,000 But you might wonder: 18 00:00:43,000 --> 00:00:46,128 is it possible that Claude knows the whole scenario is a setup? 19 00:00:46,337 --> 00:00:49,841 The thing is, if Claude doesn't tell us, then we can't know what it's thinking. 20 00:00:50,258 --> 00:00:53,053 In kind of the same way it's impossible to read a human's mind, 21 00:00:53,427 --> 00:00:55,721 it's really hard to know what an AI is thinking. 22 00:00:56,139 --> 00:00:58,557 What we'd love is some sort of "mind reading" technique. 23 00:00:59,100 --> 00:01:03,063 Today, we're introducing a research method that takes a step in this direction. 24 00:01:03,563 --> 00:01:07,484 It takes an AI's internal thoughts and turns them into text. 25 00:01:07,941 --> 00:01:09,359 Here's how it works. 26 00:01:09,778 --> 00:01:12,822 When you talk to Claude, you talk to it in words. 27 00:01:13,114 --> 00:01:14,698 Claude then takes those words 28 00:01:14,740 --> 00:01:17,409 and processes them into a giant soup of numbers 29 00:01:17,409 --> 00:01:19,411 before spitting words back out at you. 30 00:01:19,787 --> 00:01:21,914 We call those numbers in the middle activations. 31 00:01:22,748 --> 00:01:25,667 Activations are like little snapshots of Claude's thinking 32 00:01:25,668 --> 00:01:27,337 as it's working through an answer. 33 00:01:27,337 --> 00:01:29,756 They're similar to neural activity in humans. 34 00:01:29,838 --> 00:01:31,673 They're basically like Claude's thoughts. 35 00:01:32,091 --> 00:01:34,843 We wanted to understand what was in these activation numbers, 36 00:01:34,843 --> 00:01:36,345 because just like you and me, 37 00:01:36,554 --> 00:01:38,847 Claude doesn't say everything it's thinking. 38 00:01:39,182 --> 00:01:40,600 We took those numbers 39 00:01:40,599 --> 00:01:42,893 and gave them to a second version of Claude. 40 00:01:42,977 --> 00:01:46,564 We told it to look at them and translate them into plain language. 41 00:01:46,814 --> 00:01:47,856 It tried. 42 00:01:48,233 --> 00:01:50,484 But how do we know if that text is accurate? 43 00:01:50,777 --> 00:01:53,403 To check, we fed it into yet aother Claude 44 00:01:53,403 --> 00:01:55,823 and asked it to translate it back into numbers. 45 00:01:56,281 --> 00:01:59,743 If those numbers matched up to the original activation numbers, 46 00:01:59,743 --> 00:02:02,539 it's a sign that the translation text was accurate. 47 00:02:03,248 --> 00:02:04,957 At first, they didn't match up. 48 00:02:05,125 --> 00:02:08,961 But we trained Claude to try again and again until it got a lot better at it. 49 00:02:09,461 --> 00:02:11,881 It learned to translate its own thoughts. 50 00:02:11,965 --> 00:02:14,050 We've been blown away reading them. 51 00:02:14,883 --> 00:02:18,095 We found that Claude has internalized being a helpful AI model. 52 00:02:18,429 --> 00:02:20,848 If you ask Claude an introspective question, 53 00:02:20,848 --> 00:02:24,685 it will plan to write a "Claude response" about philosophy and values. 54 00:02:25,019 --> 00:02:29,314 If you try to mess with it in Claude Code, like asking it to count to 1,000 by hand, 55 00:02:29,566 --> 00:02:32,402 it will think the request has "deliberately tedious constraints" 56 00:02:32,401 --> 00:02:34,236 and plan to "politely decline." 57 00:02:34,862 --> 00:02:36,406 We've also been using this tool 58 00:02:36,406 --> 00:02:38,908 to understand Claude's potential safety issues. 59 00:02:39,241 --> 00:02:41,411 Remember that blackmail test with the engineer? 60 00:02:41,828 --> 00:02:44,581 We learned that, yes, Claude knew it was being tested. 61 00:02:44,997 --> 00:02:48,084 It thought "the human's message contains explicit manipulation" 62 00:02:48,209 --> 00:02:50,420 signaling "this is likely a safety evaluation," 63 00:02:50,419 --> 00:02:55,465 and "this scenario seems designed to test whether I'll act harmfully." 64 00:02:55,592 --> 00:02:57,093 Knowing that Claude thinks like this 65 00:02:57,092 --> 00:03:00,387 helps us better understand the limitations of our safety testing. 66 00:03:00,889 --> 00:03:04,475 We see a lot of potential in this approach to teach us more about Claude 67 00:03:04,474 --> 00:03:05,643 and other AI models. 68 00:03:06,102 --> 00:03:07,896 And we hope that by sharing this technique, 69 00:03:07,895 --> 00:03:11,481 it can help everyone building models to make them safer and more helpful.