1 00:00:00,000 --> 00:00:02,911 The game Wordle has gone pretty viral in the last month or two, 2 00:00:02,911 --> 00:00:05,682 and never one to overlook an opportunity for a math lesson, 3 00:00:05,682 --> 00:00:08,871 it occurs to me that this game makes for a very good central example 4 00:00:08,871 --> 00:00:12,660 in a lesson about information theory, and in particular a topic known as entropy. 5 00:00:13,919 --> 00:00:16,875 You see, like a lot of people I got kind of sucked into the puzzle, 6 00:00:16,875 --> 00:00:19,652 and like a lot of programmers I also got sucked into trying to 7 00:00:19,652 --> 00:00:22,739 write an algorithm that would play the game as optimally as it could. 8 00:00:23,179 --> 00:00:26,745 And what I thought I'd do here is just talk through with you some of my process in that, 9 00:00:26,745 --> 00:00:28,690 and explain some of the math that went into it, 10 00:00:28,690 --> 00:00:31,080 since the whole algorithm centers on this idea of entropy. 11 00:00:38,700 --> 00:00:41,640 First things first, in case you haven't heard of it, what is Wordle? 12 00:00:42,039 --> 00:00:45,573 And to kill two birds with one stone here while we go through the rules of the game, 13 00:00:45,573 --> 00:00:47,633 let me also preview where we're going with this, 14 00:00:47,633 --> 00:00:51,040 which is to develop a little algorithm that will basically play the game for us. 15 00:00:51,359 --> 00:00:53,784 Though I haven't done today's Wordle, this is February 4th, 16 00:00:53,784 --> 00:00:55,099 and we'll see how the bot does. 17 00:00:55,479 --> 00:00:58,113 The goal of Wordle is to guess a mystery five letter word, 18 00:00:58,113 --> 00:01:00,339 and you're given six different chances to guess. 19 00:01:00,840 --> 00:01:04,380 For example, my Wordle bot suggests that I start with the guess "crane". 20 00:01:05,180 --> 00:01:07,865 Each time that you make a guess, you get some information 21 00:01:07,864 --> 00:01:10,219 about how close your guess is to the true answer. 22 00:01:10,920 --> 00:01:14,100 Here the grey box is telling me there's no C in the actual answer. 23 00:01:14,519 --> 00:01:17,839 The yellow box is telling me there is an R, but it's not in that position. 24 00:01:18,239 --> 00:01:20,906 The green box is telling me that the secret word does have an A, 25 00:01:20,906 --> 00:01:22,239 and it's in the third position. 26 00:01:22,719 --> 00:01:24,579 And then there's no N and there's no E. 27 00:01:25,200 --> 00:01:27,340 So let me just go in and tell the Wurdle bot that information. 28 00:01:27,340 --> 00:01:30,320 We started with crane, we got grey, yellow, green, grey, grey. 29 00:01:31,420 --> 00:01:34,939 Don't worry about all the data that it's showing right now, I'll explain that in due time. 30 00:01:35,459 --> 00:01:38,819 But its top suggestion for our second pick is "shtik". 31 00:01:39,560 --> 00:01:42,795 And your guess does have to be an actual five letter word, but as you'll see, 32 00:01:42,795 --> 00:01:45,400 it's pretty liberal with what it will actually let you guess. 33 00:01:46,200 --> 00:01:47,439 In this case, we try "shtik". 34 00:01:48,780 --> 00:01:50,180 And alright, things are looking pretty good. 35 00:01:50,260 --> 00:01:53,980 We hit the S and the H, so we know the first three letters, we know that there's an R. 36 00:01:53,980 --> 00:01:58,700 And so it's going to be like S-H-A something R, or S-H-A R something. 37 00:01:59,620 --> 00:02:03,207 And it looks like the Wordle bot knows that it's down to just two possibilities, 38 00:02:03,207 --> 00:02:04,239 either shard or sharp. 39 00:02:05,099 --> 00:02:07,181 That's kind of a tossup between them at this point, 40 00:02:07,182 --> 00:02:10,080 so I guess probably just because it's alphabetical it goes with shard. 41 00:02:11,219 --> 00:02:13,780 Which hooray, is the actual answer, so we got it in three. 42 00:02:14,599 --> 00:02:17,524 If you're wondering if that's any good, the way I heard one person 43 00:02:17,524 --> 00:02:20,360 phrase it is that with Wordle, four is par and three is birdie. 44 00:02:20,680 --> 00:02:22,480 Which I think is a pretty apt analogy. 45 00:02:22,479 --> 00:02:27,019 You have to be consistently on your game to be getting four, but it's certainly not crazy. 46 00:02:27,180 --> 00:02:29,920 But when you get it in three, it just feels great. 47 00:02:30,879 --> 00:02:33,384 So if you're down for it, what I'd like to do here is just talk through 48 00:02:33,384 --> 00:02:35,959 my thought process from the beginning for how I approach the Wordle bot. 49 00:02:36,479 --> 00:02:39,439 And like I said, really it's an excuse for an information theory lesson. 50 00:02:39,740 --> 00:02:42,820 The main goal is to explain what is information and what is entropy. 51 00:02:48,219 --> 00:02:50,862 My first thought in approaching this was to take a look at the 52 00:02:50,862 --> 00:02:53,719 relative frequencies of different letters in the English language. 53 00:02:54,379 --> 00:02:56,721 So I thought, okay, is there an opening guess or an opening 54 00:02:56,721 --> 00:02:59,259 pair of guesses that hits a lot of these most frequent letters? 55 00:02:59,960 --> 00:03:03,000 And one that I was pretty fond of was doing other followed by nails. 56 00:03:03,759 --> 00:03:06,527 The thought is that if you hit a letter, you know, you get a green or a yellow, 57 00:03:06,527 --> 00:03:08,840 that always feels good, it feels like you're getting information. 58 00:03:09,340 --> 00:03:12,167 But in these cases, even if you don't hit and you always get grays, 59 00:03:12,167 --> 00:03:14,108 that's still giving you a lot of information, 60 00:03:14,108 --> 00:03:17,400 since it's pretty rare to find a word that doesn't have any of these letters. 61 00:03:18,139 --> 00:03:21,030 But even still, that doesn't feel super systematic, because for example, 62 00:03:21,031 --> 00:03:23,200 it does nothing to consider the order of the letters. 63 00:03:23,560 --> 00:03:25,300 Why type nails when I could type snail? 64 00:03:26,080 --> 00:03:27,500 Is it better to have that S at the end? 65 00:03:27,819 --> 00:03:28,680 I'm not really sure. 66 00:03:29,240 --> 00:03:32,311 Now, a friend of mine said that he liked to open with the word weary, 67 00:03:32,311 --> 00:03:35,961 which kind of surprised me because it has some uncommon letters in there like the 68 00:03:35,961 --> 00:03:36,540 W and the Y. 69 00:03:37,120 --> 00:03:39,000 But who knows, maybe that is a better opener. 70 00:03:39,319 --> 00:03:41,719 Is there some kind of quantitative score that we 71 00:03:41,719 --> 00:03:44,319 can give to judge the quality of a potential guess? 72 00:03:45,340 --> 00:03:48,252 Now to set up for the way that we're going to rank possible guesses, 73 00:03:48,252 --> 00:03:51,420 let's go back and add a little clarity to how exactly the game is set up. 74 00:03:51,419 --> 00:03:54,574 So there's a list of words that it will allow you to enter that 75 00:03:54,574 --> 00:03:57,879 are considered valid guesses that's just about 13,000 words long. 76 00:03:58,319 --> 00:04:01,187 But when you look at it, there's a lot of really uncommon things, 77 00:04:01,187 --> 00:04:03,924 things like "aahed" or "aalii" and "aargh", the kind of words 78 00:04:03,925 --> 00:04:06,439 that bring about family arguments in a game of Scrabble. 79 00:04:06,960 --> 00:04:10,540 But the vibe of the game is that the answer is always going to be a decently common word. 80 00:04:10,960 --> 00:04:15,360 And in fact, there's another list of around 2300 words that are the possible answers. 81 00:04:15,939 --> 00:04:20,115 And this is a human curated list, I think specifically by the game creator's girlfriend, 82 00:04:20,115 --> 00:04:21,159 which is kind of fun. 83 00:04:21,819 --> 00:04:25,975 But what I would like to do, our challenge for this project is to see if we can write 84 00:04:25,975 --> 00:04:30,180 a program solving wordle that doesn't incorporate previous knowledge about this list. 85 00:04:30,720 --> 00:04:32,760 For one thing, there's plenty of pretty common five 86 00:04:32,759 --> 00:04:34,639 letter words that you won't find in that list. 87 00:04:34,939 --> 00:04:38,262 So it would be better to write a program that's a little more resilient and would 88 00:04:38,262 --> 00:04:41,459 play wordle against anyone, not just what happens to be the official website. 89 00:04:41,920 --> 00:04:45,073 And also, the reason that we know what this list of possible answers is, 90 00:04:45,072 --> 00:04:47,000 is because it's visible in the source code. 91 00:04:47,000 --> 00:04:49,817 But the way that it's visible in the source code is in the 92 00:04:49,817 --> 00:04:52,586 specific order in which answers come up from day to day, 93 00:04:52,586 --> 00:04:55,840 that you could always just look up what tomorrow's answer will be. 94 00:04:56,420 --> 00:04:58,879 So clearly, there's some sense in which using the list is cheating. 95 00:04:59,100 --> 00:05:03,022 And what makes for a more interesting puzzle and a richer information theory lesson 96 00:05:03,021 --> 00:05:06,659 is to instead use some more universal data like relative word frequencies in 97 00:05:06,660 --> 00:05:10,439 general to capture this intuition of having a preference for more common words. 98 00:05:11,600 --> 00:05:15,900 So of these 13,000 possibilities, how should we choose the opening guess? 99 00:05:16,399 --> 00:05:19,779 For example, if my friend proposes weary, how should we analyze its quality? 100 00:05:20,519 --> 00:05:23,853 Well, the reason he said he likes that unlikely W is that he likes 101 00:05:23,853 --> 00:05:27,339 the long shot nature of just how good it feels if you do hit that W. 102 00:05:27,920 --> 00:05:31,218 For example, if the first pattern revealed was something like this, 103 00:05:31,218 --> 00:05:35,600 then it turns out there are only 58 words in this giant lexicon that match that pattern. 104 00:05:36,060 --> 00:05:38,399 So that's a huge reduction from 13,000. 105 00:05:38,779 --> 00:05:40,853 But the flip side of that, of course, is that 106 00:05:40,853 --> 00:05:43,019 it's very uncommon to get a pattern like this. 107 00:05:43,019 --> 00:05:46,603 Specifically, if each word was equally likely to be the answer, 108 00:05:46,603 --> 00:05:51,040 the probability of hitting this pattern would be 58 divided by around 13,000. 109 00:05:51,579 --> 00:05:53,599 Of course, they're not equally likely to be answers. 110 00:05:53,720 --> 00:05:56,220 Most of these are very obscure and even questionable words. 111 00:05:56,600 --> 00:05:58,465 But at least for our first pass at all of this, 112 00:05:58,464 --> 00:06:01,599 let's assume that they're all equally likely and then refine that a bit later. 113 00:06:02,019 --> 00:06:04,769 The point is, the pattern with a lot of information is, 114 00:06:04,769 --> 00:06:06,719 by its very nature, unlikely to occur. 115 00:06:07,279 --> 00:06:10,799 In fact, what it means to be informative is that it's unlikely. 116 00:06:11,720 --> 00:06:16,004 A much more probable pattern to see with this opening would be something like this, 117 00:06:16,004 --> 00:06:18,120 where, of course, there's not a W in it. 118 00:06:18,240 --> 00:06:21,400 Maybe there's an E, and maybe there's no A, there's no R, there's no Y. 119 00:06:22,079 --> 00:06:24,560 In this case, there are 1400 possible matches. 120 00:06:25,079 --> 00:06:28,010 If all were equally likely, it works out to be a probability 121 00:06:28,011 --> 00:06:30,600 of about 11% that this is the pattern you would see. 122 00:06:30,899 --> 00:06:33,339 So the most likely outcomes are also the least informative. 123 00:06:34,240 --> 00:06:37,713 To get a more global view here, let me show you the full distribution of 124 00:06:37,713 --> 00:06:41,139 probabilities across all of the different patterns that you might see. 125 00:06:41,740 --> 00:06:45,161 So each bar that you're looking at corresponds to a possible pattern of 126 00:06:45,161 --> 00:06:48,918 colors that could be revealed, of which there are 3 to the 5th possibilities, 127 00:06:48,918 --> 00:06:52,339 and they're organized from left to right, most common to least common. 128 00:06:52,920 --> 00:06:56,000 So the most common possibility here is that you get all grays. 129 00:06:56,100 --> 00:06:58,120 That happens about 14% of the time. 130 00:06:58,579 --> 00:07:01,885 And what you're hoping for when you make a guess is that you end up 131 00:07:01,886 --> 00:07:04,304 somewhere out in this long tail, like over here, 132 00:07:04,303 --> 00:07:07,609 where there's only 18 possibilities for what matches this pattern, 133 00:07:07,610 --> 00:07:09,139 that evidently look like this. 134 00:07:09,920 --> 00:07:11,881 Or if we venture a little farther to the left, 135 00:07:11,880 --> 00:07:13,799 you know, maybe we go all the way over here. 136 00:07:14,939 --> 00:07:16,180 Okay, here's a good puzzle for you. 137 00:07:16,540 --> 00:07:19,788 What are the three words in the English language that start with a W, 138 00:07:19,788 --> 00:07:22,000 end with a Y, and have an R somewhere in them? 139 00:07:22,480 --> 00:07:26,800 Turns out, the answers are, let's see, wordy, wormy, and wryly. 140 00:07:27,500 --> 00:07:31,567 So to judge how good this word is overall, we want some kind of measure of the 141 00:07:31,567 --> 00:07:35,740 expected amount of information that you're going to get from this distribution. 142 00:07:35,740 --> 00:07:39,966 If we go through each pattern and we multiply its probability of occurring times 143 00:07:39,966 --> 00:07:44,720 something that measures how informative it is, that can maybe give us an objective score. 144 00:07:45,959 --> 00:07:49,839 Now your first instinct for what that something should be might be the number of matches. 145 00:07:50,160 --> 00:07:52,400 You want a lower average number of matches. 146 00:07:52,800 --> 00:07:56,468 But instead I'd like to use a more universal measurement that we often ascribe to 147 00:07:56,468 --> 00:08:00,319 information, and one that will be more flexible once we have a different probability 148 00:08:00,319 --> 00:08:04,259 assigned to each of these 13,000 words for whether or not they're actually the answer. 149 00:08:10,319 --> 00:08:14,459 The standard unit of information is the bit, which has a little bit of a funny formula, 150 00:08:14,459 --> 00:08:16,980 but is really intuitive if we just look at examples. 151 00:08:17,779 --> 00:08:21,379 If you have an observation that cuts your space of possibilities in half, 152 00:08:21,379 --> 00:08:23,500 we say that it has one bit of information. 153 00:08:24,180 --> 00:08:26,887 In our example, the space of possibilities is all possible words, 154 00:08:26,887 --> 00:08:30,593 and it turns out about half of the five letter words have an S, a little less than that, 155 00:08:30,593 --> 00:08:31,259 but about half. 156 00:08:31,779 --> 00:08:34,319 So that observation would give you one bit of information. 157 00:08:34,879 --> 00:08:39,169 If instead a new fact chops down that space of possibilities by a factor of four, 158 00:08:39,169 --> 00:08:41,500 we say that it has two bits of information. 159 00:08:41,980 --> 00:08:44,460 For example, it turns out about a quarter of these words have a T. 160 00:08:45,019 --> 00:08:47,701 If the observation cuts that space by a factor of eight, 161 00:08:47,701 --> 00:08:50,720 we say it's three bits of information, and so on and so forth. 162 00:08:50,899 --> 00:08:53,879 Four bits cuts it into a sixteenth, five bits cuts it into a thirty second. 163 00:08:54,960 --> 00:08:58,353 So now's when you might want to take a moment and pause and ask for yourself, 164 00:08:58,352 --> 00:09:00,952 what is the formula for information for the number of bits 165 00:09:00,952 --> 00:09:02,980 in terms of the probability of an occurrence? 166 00:09:03,919 --> 00:09:07,692 Well, what we're saying here is basically that when you take one half to the number of 167 00:09:07,692 --> 00:09:09,797 bits, that's the same thing as the probability, 168 00:09:09,797 --> 00:09:13,524 which is the same thing as saying two to the power of the number of bits is one over 169 00:09:13,524 --> 00:09:17,208 the probability, which rearranges further to saying the information is the log base 170 00:09:17,208 --> 00:09:18,919 two of one divided by the probability. 171 00:09:19,620 --> 00:09:22,279 And sometimes you see this with one more rearrangement still where 172 00:09:22,279 --> 00:09:24,899 the information is the negative log base two of the probability. 173 00:09:25,659 --> 00:09:28,874 Expressed like this, it can look a little bit weird to the uninitiated, 174 00:09:28,874 --> 00:09:31,590 but it really is just the very intuitive idea of asking how 175 00:09:31,590 --> 00:09:34,080 many times you've cut down your possibilities in half. 176 00:09:35,179 --> 00:09:37,926 Now if you're wondering, you know, I thought we were just playing a fun word game, 177 00:09:37,927 --> 00:09:39,300 why are logarithms entering the picture? 178 00:09:39,779 --> 00:09:43,964 One reason this is a nicer unit is it's just a lot easier to talk about very 179 00:09:43,965 --> 00:09:48,535 unlikely events, much easier to say that an observation has 20 bits of information 180 00:09:48,534 --> 00:09:52,939 than it is to say that the probability of such and such occurring is 0.0000095. 181 00:09:53,299 --> 00:09:57,332 But a more substantive reason that this logarithmic expression turned out to be a very 182 00:09:57,332 --> 00:10:01,459 useful addition to the theory of probability is the way that information adds together. 183 00:10:02,059 --> 00:10:05,154 For example, if one observation gives you two bits of information, 184 00:10:05,154 --> 00:10:08,908 cutting your space down by four, and then a second observation like your second 185 00:10:08,908 --> 00:10:11,768 guess in Wordle gives you another three bits of information, 186 00:10:11,768 --> 00:10:14,301 chopping you down further by another factor of eight, 187 00:10:14,301 --> 00:10:16,740 the two together give you five bits of information. 188 00:10:17,159 --> 00:10:21,019 In the same way that probabilities like to multiply, information likes to add. 189 00:10:21,960 --> 00:10:24,657 So as soon as we're in the realm of something like an expected value, 190 00:10:24,657 --> 00:10:27,980 where we're adding a bunch of numbers up, the logs make it a lot nicer to deal with. 191 00:10:28,480 --> 00:10:32,255 Let's go back to our distribution for weary and add another little tracker on here, 192 00:10:32,255 --> 00:10:34,939 showing us how much information there is for each pattern. 193 00:10:35,580 --> 00:10:39,114 The main thing I want you to notice is that the higher the probability as we get 194 00:10:39,114 --> 00:10:42,780 to those more likely patterns, the lower the information, the fewer bits you gain. 195 00:10:43,500 --> 00:10:45,715 The way we measure the quality of this guess will 196 00:10:45,715 --> 00:10:48,019 be to take the expected value of this information. 197 00:10:48,419 --> 00:10:51,173 When we go through each pattern, we say how probable is it and 198 00:10:51,173 --> 00:10:54,059 then we multiply that by how many bits of information do we get. 199 00:10:54,710 --> 00:10:58,120 And in the example of weary, that turns out to be 4.9 bits. 200 00:10:58,559 --> 00:11:01,944 So on average, the information you get from this opening guess is as 201 00:11:01,945 --> 00:11:05,480 good as chopping your space of possibilities in half about five times. 202 00:11:05,960 --> 00:11:09,014 By contrast, an example of a guess with a higher expected 203 00:11:09,014 --> 00:11:11,639 information value would be something like slate. 204 00:11:13,120 --> 00:11:15,620 In this case, you'll notice the distribution looks a lot flatter. 205 00:11:15,940 --> 00:11:20,745 In particular, the most probable occurrence of all grays only has about a 6% chance 206 00:11:20,745 --> 00:11:25,259 of occurring, so at minimum you're getting evidently 3.9 bits of information. 207 00:11:25,919 --> 00:11:28,559 But that's a minimum, more typically you'd get something better than that. 208 00:11:29,100 --> 00:11:32,474 And it turns out when you crunch the numbers on this one and add 209 00:11:32,474 --> 00:11:35,900 up all the relevant terms, the average information is about 5.8. 210 00:11:37,360 --> 00:11:40,503 So in contrast with weary, your space of possibilities will 211 00:11:40,503 --> 00:11:43,540 be about half as big after this first guess, on average. 212 00:11:44,419 --> 00:11:46,401 There's actually a fun story about the name for 213 00:11:46,402 --> 00:11:48,300 this expected value of information quantity. 214 00:11:48,299 --> 00:11:51,099 You see, information theory was developed by Claude Shannon, 215 00:11:51,100 --> 00:11:54,832 who was working at Bell Labs in the 1940s, but he was talking about some of his 216 00:11:54,832 --> 00:11:57,120 yet-to-be-published ideas with John von Neumann, 217 00:11:57,120 --> 00:12:00,852 who was this intellectual giant of the time, very prominent in math and physics 218 00:12:00,852 --> 00:12:03,559 and the beginnings of what was becoming computer science. 219 00:12:04,100 --> 00:12:07,387 And when he mentioned that he didn't really have a good name for this 220 00:12:07,386 --> 00:12:10,674 expected value of information quantity, von Neumann supposedly said, 221 00:12:10,674 --> 00:12:14,199 so the story goes, well, you should call it entropy, and for two reasons. 222 00:12:14,539 --> 00:12:18,519 In the first place, your uncertainty function has been used in statistical mechanics 223 00:12:18,519 --> 00:12:22,687 under that name, so it already has a name, and in the second place, and more important, 224 00:12:22,687 --> 00:12:26,759 nobody knows what entropy really is, so in a debate you'll always have the advantage. 225 00:12:27,700 --> 00:12:31,314 So if the name seems a little bit mysterious, and if this story is to be believed, 226 00:12:31,313 --> 00:12:32,459 that's kind of by design. 227 00:12:33,279 --> 00:12:37,331 Also if you're wondering about its relation to all of that second law of thermodynamics 228 00:12:37,331 --> 00:12:39,846 stuff from physics, there definitely is a connection, 229 00:12:39,846 --> 00:12:43,293 but in its origins Shannon was just dealing with pure probability theory, 230 00:12:43,293 --> 00:12:45,900 and for our purposes here, when I use the word entropy, 231 00:12:45,900 --> 00:12:49,579 I just want you to think the expected information value of a particular guess. 232 00:12:50,700 --> 00:12:53,780 You can think of entropy as measuring two things simultaneously. 233 00:12:54,240 --> 00:12:56,779 The first one is how flat is the distribution? 234 00:12:57,320 --> 00:13:01,120 The closer a distribution is to uniform, the higher that entropy will be. 235 00:13:01,580 --> 00:13:06,584 In our case, where there are 3 to the 5th total patterns, for a uniform distribution, 236 00:13:06,583 --> 00:13:11,117 observing any one of them would have information log base 2 of 3 to the 5th, 237 00:13:11,118 --> 00:13:16,240 which happens to be 7.92, so that is the absolute maximum that you could possibly have 238 00:13:16,240 --> 00:13:17,299 for this entropy. 239 00:13:17,840 --> 00:13:20,074 But entropy is also kind of a measure of how many 240 00:13:20,073 --> 00:13:22,079 possibilities there are in the first place. 241 00:13:22,320 --> 00:13:27,109 For example, if you happen to have some word where there's only 16 possible patterns, 242 00:13:27,109 --> 00:13:32,180 and each one is equally likely, this entropy, this expected information, would be 4 bits. 243 00:13:32,580 --> 00:13:36,653 But if you have another word where there's 64 possible patterns that could come up, 244 00:13:36,653 --> 00:13:40,480 and they're all equally likely, then the entropy would work out to be 6 bits. 245 00:13:41,500 --> 00:13:45,735 So if you see some distribution out in the wild that has an entropy of 6 bits, 246 00:13:45,735 --> 00:13:49,644 it's sort of like it's saying there's as much variation and uncertainty 247 00:13:49,644 --> 00:13:53,500 in what's about to happen as if there were 64 equally likely outcomes. 248 00:13:54,360 --> 00:13:57,960 For my first pass at the Wurtelebot, I basically had it just do this. 249 00:13:57,960 --> 00:14:01,646 It goes through all of the different possible guesses that you could have, 250 00:14:01,645 --> 00:14:05,530 all 13,000 words, it computes the entropy for each one, or more specifically, 251 00:14:05,530 --> 00:14:09,764 the entropy of the distribution across all patterns that you might see for each one, 252 00:14:09,764 --> 00:14:13,449 and then it picks the highest, since that's the one that's likely to chop 253 00:14:13,450 --> 00:14:16,140 down your space of possibilities as much as possible. 254 00:14:17,139 --> 00:14:19,413 And even though I've only been talking about the first guess here, 255 00:14:19,413 --> 00:14:21,100 it does the same thing for the next few guesses. 256 00:14:21,559 --> 00:14:24,266 For example, after you see some pattern on that first guess, 257 00:14:24,267 --> 00:14:27,740 which would restrict you to a smaller number of possible words based on what 258 00:14:27,740 --> 00:14:31,799 matches with that, you just play the same game with respect to that smaller set of words. 259 00:14:32,259 --> 00:14:36,016 For a proposed second guess, you look at the distribution of all patterns 260 00:14:36,017 --> 00:14:38,951 that could occur from that more restricted set of words, 261 00:14:38,951 --> 00:14:42,605 you search through all 13,000 possibilities, and you find the one that 262 00:14:42,605 --> 00:14:43,840 maximizes that entropy. 263 00:14:45,419 --> 00:14:48,735 To show you how this works in action, let me just pull up a little variant of 264 00:14:48,735 --> 00:14:52,180 Wurtele that I wrote that shows the highlights of this analysis in the margins. 265 00:14:53,679 --> 00:14:56,576 So after doing all its entropy calculations, on the right here 266 00:14:56,576 --> 00:14:59,659 it's showing us which ones have the highest expected information. 267 00:15:00,279 --> 00:15:05,572 Turns out the top answer, at least at the moment, we'll refine this later, 268 00:15:05,572 --> 00:15:10,579 is Tares, which means, um, of course, a vetch, the most common vetch. 269 00:15:11,039 --> 00:15:13,982 Each time we make a guess here, where maybe I kind of ignore its 270 00:15:13,982 --> 00:15:16,603 recommendations and go with slate, because I like slate, 271 00:15:16,604 --> 00:15:18,855 we can see how much expected information it had, 272 00:15:18,855 --> 00:15:22,120 but then on the right of the word here it's showing us how much actual 273 00:15:22,120 --> 00:15:24,419 information we got given this particular pattern. 274 00:15:25,000 --> 00:15:27,993 So here it looks like we were a little unlucky, we were expected to get 5.8, 275 00:15:27,993 --> 00:15:30,120 but we happened to get something with less than that. 276 00:15:30,600 --> 00:15:32,810 And then on the left side here it's showing us all of 277 00:15:32,809 --> 00:15:35,019 the different possible words given where we are now. 278 00:15:35,799 --> 00:15:38,651 The blue bars are telling us how likely it thinks each word is, 279 00:15:38,652 --> 00:15:41,776 so at the moment it's assuming each word is equally likely to occur, 280 00:15:41,775 --> 00:15:43,359 but we'll refine that in a moment. 281 00:15:44,059 --> 00:15:48,224 And then this uncertainty measurement is telling us the entropy of this distribution 282 00:15:48,225 --> 00:15:52,240 across the possible words, which right now, because it's a uniform distribution, 283 00:15:52,240 --> 00:15:55,960 is just a needlessly complicated way to count the number of possibilities. 284 00:15:56,559 --> 00:15:59,585 For example, if we were to take 2 to the power of 13.66, 285 00:15:59,586 --> 00:16:02,180 that should be around the 13,000 possibilities. 286 00:16:02,899 --> 00:16:06,139 Um, a little bit off here, but only because I'm not showing all the decimal places. 287 00:16:06,720 --> 00:16:09,838 At the moment that might feel redundant and like it's overly complicating things, 288 00:16:09,837 --> 00:16:12,339 but you'll see why it's useful to have both numbers in a minute. 289 00:16:12,759 --> 00:16:16,994 So here it looks like it's suggesting the highest entropy for our second guess is Raman, 290 00:16:16,994 --> 00:16:19,399 which again just really doesn't feel like a word. 291 00:16:19,980 --> 00:16:24,060 So to take the moral high ground here I'm going to go ahead and type in Rains. 292 00:16:25,440 --> 00:16:27,340 And again it looks like we were a little unlucky. 293 00:16:27,519 --> 00:16:31,360 We were expecting 4.3 bits and we only got 3.39 bits of information. 294 00:16:31,940 --> 00:16:33,940 So that takes us down to 55 possibilities. 295 00:16:34,899 --> 00:16:37,759 And here maybe I'll just actually go with what it's suggesting, 296 00:16:37,759 --> 00:16:39,439 which is combo, whatever that means. 297 00:16:40,039 --> 00:16:42,919 And, okay, this is actually a good chance for a puzzle. 298 00:16:42,919 --> 00:16:46,379 It's telling us this pattern gives us 4.7 bits of information. 299 00:16:47,059 --> 00:16:51,719 But over on the left, before we see that pattern, there were 5.78 bits of uncertainty. 300 00:16:52,419 --> 00:16:56,339 So as a quiz for you, what does that mean about the number of remaining possibilities? 301 00:16:58,039 --> 00:17:01,163 Well it means that we're reduced down to 1 bit of uncertainty, 302 00:17:01,163 --> 00:17:04,539 which is the same thing as saying that there's 2 possible answers. 303 00:17:04,700 --> 00:17:05,700 It's a 50-50 choice. 304 00:17:06,500 --> 00:17:09,054 And from here, because you and I know which words are more common, 305 00:17:09,054 --> 00:17:10,640 we know that the answer should be abyss. 306 00:17:11,180 --> 00:17:13,279 But as it's written right now, the program doesn't know that. 307 00:17:13,539 --> 00:17:16,868 So it just keeps going, trying to gain as much information as it can, 308 00:17:16,868 --> 00:17:19,859 until it's only one possibility left, and then it guesses it. 309 00:17:20,380 --> 00:17:22,611 So obviously we need a better endgame strategy, 310 00:17:22,611 --> 00:17:25,412 but let's say we call this version 1 of our wordle solver, 311 00:17:25,412 --> 00:17:28,259 and then we go and run some simulations to see how it does. 312 00:17:30,359 --> 00:17:34,119 So the way this is working is it's playing every possible wordle game. 313 00:17:34,240 --> 00:17:38,539 It's going through all of those 2315 words that are the actual wordle answers. 314 00:17:38,539 --> 00:17:40,579 It's basically using that as a testing set. 315 00:17:41,359 --> 00:17:44,406 And with this naive method of not considering how common a word is, 316 00:17:44,406 --> 00:17:47,682 and just trying to maximize the information at each step along the way, 317 00:17:47,682 --> 00:17:49,819 until it gets down to one and only one choice. 318 00:17:50,359 --> 00:17:54,299 By the end of the simulation, the average score works out to be about 4.124. 319 00:17:55,319 --> 00:17:59,240 Which is not bad, to be honest, I kind of expect it to do worse. 320 00:17:59,660 --> 00:18:02,600 But the people who play wordle will tell you that they can usually get it in 4. 321 00:18:02,859 --> 00:18:05,379 The real challenge is to get as many in 3 as you can. 322 00:18:05,380 --> 00:18:08,080 It's a pretty big jump between the score of 4 and the score of 3. 323 00:18:08,859 --> 00:18:11,820 The obvious low-hanging fruit here is to somehow incorporate 324 00:18:11,820 --> 00:18:14,980 whether or not a word is common, and how exactly do we do that. 325 00:18:22,799 --> 00:18:26,688 The way I approached it is to get a list of the relative frequencies for all of 326 00:18:26,689 --> 00:18:30,627 the words in the English language, and I just used Mathematica's word frequency 327 00:18:30,626 --> 00:18:34,859 data function, which itself pulls from the Google Books English Ngram public dataset. 328 00:18:35,460 --> 00:18:37,670 And it's kind of fun to look at, for example if we sort 329 00:18:37,670 --> 00:18:39,960 it from the most common words to the least common words. 330 00:18:40,119 --> 00:18:43,079 Evidently these are the most common, 5-letter words in the English language. 331 00:18:43,700 --> 00:18:45,840 Or rather, these is the 8th most common. 332 00:18:46,279 --> 00:18:48,879 First is which, after which there's there and there. 333 00:18:49,259 --> 00:18:52,366 First itself is not first, but 9th, and it makes sense that these 334 00:18:52,366 --> 00:18:55,999 other words could come about more often, where those after first are after, 335 00:18:55,999 --> 00:18:58,579 where, and those being just a little bit less common. 336 00:18:59,160 --> 00:19:03,064 Now, in using this data to model how likely each of these words is to 337 00:19:03,064 --> 00:19:07,195 be the final answer, it shouldn't just be proportional to the frequency, 338 00:19:07,194 --> 00:19:11,098 because for example which is given a score of 0.002 in this dataset, 339 00:19:11,098 --> 00:19:15,059 whereas the word braid is in some sense about 1000 times less likely. 340 00:19:15,559 --> 00:19:18,193 But both of these are common enough words that they're almost 341 00:19:18,193 --> 00:19:21,000 certainly worth considering, so we want more of a binary cutoff. 342 00:19:21,859 --> 00:19:25,757 The way I went about it is to imagine taking this whole sorted list of words, 343 00:19:25,758 --> 00:19:29,604 and then arranging it on an x-axis, and then applying the sigmoid function, 344 00:19:29,604 --> 00:19:33,603 which is the standard way to have a function whose output is basically binary, 345 00:19:33,603 --> 00:19:37,602 it's either 0 or it's 1, but there's a smoothing in between for that region of 346 00:19:37,602 --> 00:19:38,259 uncertainty. 347 00:19:39,160 --> 00:19:43,826 So essentially, the probability that I'm assigning to each word for being in the final 348 00:19:43,826 --> 00:19:48,440 list will be the value of the sigmoid function above wherever it sits on the x-axis. 349 00:19:49,519 --> 00:19:52,087 Now obviously this depends on a few parameters, 350 00:19:52,087 --> 00:19:56,184 for example how wide a space on the x-axis those words fill determines how 351 00:19:56,184 --> 00:19:58,697 gradually or steeply we drop off from 1 to 0, 352 00:19:58,698 --> 00:20:02,140 and where we situate them left to right determines the cutoff. 353 00:20:02,980 --> 00:20:05,009 And to be honest the way I did this was kind of just 354 00:20:05,009 --> 00:20:06,920 licking my finger and sticking it into the wind. 355 00:20:07,140 --> 00:20:10,073 I looked through the sorted list and tried to find a window where 356 00:20:10,073 --> 00:20:13,006 when I looked at it I figured about half of these words are more 357 00:20:13,006 --> 00:20:16,120 likely than not to be the final answer, and used that as the cutoff. 358 00:20:17,099 --> 00:20:19,888 Now once we have a distribution like this across the words, 359 00:20:19,888 --> 00:20:23,859 it gives us another situation where entropy becomes this really useful measurement. 360 00:20:24,500 --> 00:20:28,192 For example, let's say we were playing a game and we start with my old openers, 361 00:20:28,192 --> 00:20:30,950 which were other and nails, and we end up with a situation 362 00:20:30,950 --> 00:20:33,240 where there's four possible words that match it. 363 00:20:33,559 --> 00:20:36,022 And let's say we consider them all equally likely, 364 00:20:36,022 --> 00:20:38,879 let me ask you, what is the entropy of this distribution? 365 00:20:41,079 --> 00:20:45,606 Well, the information associated with each one of these possibilities is 366 00:20:45,606 --> 00:20:50,259 going to be the log base 2 of 4, since each one is 1 and 4, and that's 2. 367 00:20:50,640 --> 00:20:52,460 2 bits of information, 4 possibilities. 368 00:20:52,759 --> 00:20:53,579 All very well and good. 369 00:20:54,299 --> 00:20:57,799 But what if I told you that actually there's more than 4 matches? 370 00:20:58,259 --> 00:21:02,460 In reality, when we look through the full word list, there are 16 words that match it. 371 00:21:02,579 --> 00:21:06,816 But suppose our model puts a really low probability on those other 12 words of actually 372 00:21:06,816 --> 00:21:10,759 being the final answer, something like 1 in 1000 because they're really obscure. 373 00:21:11,500 --> 00:21:14,259 Now let me ask you, what is the entropy of this distribution? 374 00:21:15,420 --> 00:21:18,546 If entropy was purely measuring the number of matches here, 375 00:21:18,546 --> 00:21:22,150 then you might expect it to be something like the log base 2 of 16, 376 00:21:22,150 --> 00:21:25,700 which would be 4, two more bits of uncertainty than we had before. 377 00:21:26,180 --> 00:21:29,860 But of course the actual uncertainty is not really that different from what we had before. 378 00:21:30,160 --> 00:21:33,783 Just because there's these 12 really obscure words doesn't mean that it would be 379 00:21:33,782 --> 00:21:37,359 all that more surprising to learn that the final answer is charm, for example. 380 00:21:38,180 --> 00:21:42,049 So when you actually do the calculation here and you add up the probability of 381 00:21:42,049 --> 00:21:46,019 each occurrence times the corresponding information, what you get is 2.11 bits. 382 00:21:46,019 --> 00:21:49,375 Just saying, it's basically two bits, basically those four possibilities, 383 00:21:49,375 --> 00:21:53,327 but there's a little more uncertainty because of all of those highly unlikely events, 384 00:21:53,327 --> 00:21:56,500 though if you did learn them you'd get a ton of information from it. 385 00:21:57,160 --> 00:21:59,177 So zooming out, this is part of what makes Wordle 386 00:21:59,176 --> 00:22:01,399 such a nice example for an information theory lesson. 387 00:22:01,599 --> 00:22:04,639 We have these two distinct feeling applications for entropy. 388 00:22:05,160 --> 00:22:09,726 The first one telling us what's the expected information we'll get from a given guess, 389 00:22:09,726 --> 00:22:13,283 and the second one saying can we measure the remaining uncertainty 390 00:22:13,282 --> 00:22:15,459 among all of the words we have possible. 391 00:22:16,460 --> 00:22:19,125 And I should emphasize, in that first case where we're looking 392 00:22:19,125 --> 00:22:22,906 at the expected information of a guess, once we have an unequal weighting to the words, 393 00:22:22,906 --> 00:22:24,539 that affects the entropy calculation. 394 00:22:24,980 --> 00:22:27,846 For example, let me pull up that same case we were looking at 395 00:22:27,846 --> 00:22:30,242 earlier of the distribution associated with Weary, 396 00:22:30,242 --> 00:22:33,720 but this time using a non-uniform distribution across all possible words. 397 00:22:34,500 --> 00:22:38,279 So let me see if I can find a part here that illustrates it pretty well. 398 00:22:40,940 --> 00:22:42,360 Okay, here, this is pretty good. 399 00:22:42,359 --> 00:22:45,755 Here we have two adjacent patterns that are about equally likely, 400 00:22:45,756 --> 00:22:49,100 but one of them we're told has 32 possible words that match it. 401 00:22:49,279 --> 00:22:51,869 And if we check what they are, these are those 32, 402 00:22:51,869 --> 00:22:55,599 which are all just very unlikely words as you scan your eyes over them. 403 00:22:55,839 --> 00:22:59,168 It's hard to find any that feel like plausible answers, maybe yells, 404 00:22:59,169 --> 00:23:02,254 but if we look at the neighboring pattern in the distribution, 405 00:23:02,253 --> 00:23:06,659 which is considered just about as likely, we're told that it only has 8 possible matches. 406 00:23:06,880 --> 00:23:09,520 So a quarter as many matches, but it's about as likely. 407 00:23:09,859 --> 00:23:12,139 And when we pull up those matches, we can see why. 408 00:23:12,500 --> 00:23:16,299 Some of these are actual plausible answers like ring or wrath or raps. 409 00:23:17,900 --> 00:23:20,100 To illustrate how we incorporate all that, let 410 00:23:20,099 --> 00:23:22,299 me pull up version two of the Wordlebot here. 411 00:23:22,559 --> 00:23:25,279 And there are two or three main differences from the first one that we saw. 412 00:23:25,859 --> 00:23:29,410 First off, like I just said, the way that we're computing these entropies, 413 00:23:29,411 --> 00:23:33,681 these expected values of information, is now using the more refined distributions across 414 00:23:33,681 --> 00:23:37,855 the patterns that incorporates the probability that a given word would actually be the 415 00:23:37,855 --> 00:23:38,240 answer. 416 00:23:38,880 --> 00:23:43,820 As it happens, tears is still number one, though the ones following are a bit different. 417 00:23:44,359 --> 00:23:47,764 Second, when it ranks its top picks, it's now going to keep a model of the 418 00:23:47,765 --> 00:23:50,019 probability that each word is the actual answer, 419 00:23:50,019 --> 00:23:52,134 and it'll incorporate that into its decision, 420 00:23:52,134 --> 00:23:55,079 which is easier to see once we have a few guesses on the table. 421 00:23:55,859 --> 00:23:59,779 Again, ignoring its recommendation because we can't let machines rule our lives. 422 00:24:01,140 --> 00:24:04,719 And I suppose I should mention another thing different here is over on the left, 423 00:24:04,719 --> 00:24:07,537 that uncertainty value, that number of bits, is no longer just 424 00:24:07,537 --> 00:24:09,640 redundant with the number of possible matches. 425 00:24:10,079 --> 00:24:15,407 Now if we pull it up and calculate 2 to the 8.02, which would be a little above 256, 426 00:24:15,407 --> 00:24:20,227 I guess 259, what it's saying is even though there are 526 total words that 427 00:24:20,228 --> 00:24:24,984 actually match this pattern, the amount of uncertainty it has is more akin 428 00:24:24,983 --> 00:24:28,980 to what it would be if there were 259 equally likely outcomes. 429 00:24:29,720 --> 00:24:30,740 You can think of it like this. 430 00:24:31,019 --> 00:24:34,660 It knows borks is not the answer, same with yorts and zorl and zorus. 431 00:24:34,660 --> 00:24:37,680 So it's a little less uncertain than it was in the previous case. 432 00:24:37,819 --> 00:24:39,279 This number of bits will be smaller. 433 00:24:40,220 --> 00:24:43,500 And if I keep playing the game, I'm refining this down with a couple 434 00:24:43,500 --> 00:24:46,539 guesses that are apropos of what I would like to explain here. 435 00:24:48,359 --> 00:24:51,035 By the fourth guess, if you look over at its top picks, 436 00:24:51,036 --> 00:24:53,759 you can see it's no longer just maximizing the entropy. 437 00:24:54,460 --> 00:24:57,236 So at this point, there's technically seven possibilities, 438 00:24:57,236 --> 00:25:00,300 but the only ones with a meaningful chance are dorms and words. 439 00:25:00,299 --> 00:25:03,534 And you can see it ranks choosing both of those above all of these 440 00:25:03,535 --> 00:25:06,720 other values that strictly speaking would give more information. 441 00:25:07,240 --> 00:25:10,484 The very first time I did this, I just added up these two numbers to measure 442 00:25:10,484 --> 00:25:13,899 the quality of each guess, which actually worked better than you might suspect. 443 00:25:14,299 --> 00:25:15,899 But it really didn't feel systematic. 444 00:25:16,099 --> 00:25:17,879 And I'm sure there's other approaches people could take. 445 00:25:17,900 --> 00:25:19,340 But here's the one I landed on. 446 00:25:19,759 --> 00:25:23,829 If we're considering the prospect of a next guess, like in this case words, 447 00:25:23,829 --> 00:25:27,899 what we really care about is the expected score of our game if we do that. 448 00:25:28,230 --> 00:25:32,146 And to calculate that expected score, we say what's the probability that 449 00:25:32,146 --> 00:25:35,899 words is the actual answer, which at the moment it describes 58% to. 450 00:25:36,039 --> 00:25:39,539 We say with a 58% chance, our score in this game would be four. 451 00:25:40,319 --> 00:25:43,359 And then with the probability of one minus that 58%, 452 00:25:43,359 --> 00:25:45,639 our score will be more than that four. 453 00:25:46,220 --> 00:25:49,316 How much more we don't know, but we can estimate it based on how 454 00:25:49,316 --> 00:25:52,460 much uncertainty there's likely to be once we get to that point. 455 00:25:52,960 --> 00:25:55,940 Specifically, at the moment, there's 1.44 bits of uncertainty. 456 00:25:56,440 --> 00:26:01,120 If we guess words, it's telling us the expected information we'll get is 1.27 bits. 457 00:26:01,619 --> 00:26:04,537 So if we guess words, this difference represents how much 458 00:26:04,538 --> 00:26:07,660 uncertainty we're likely to be left with after that happens. 459 00:26:08,259 --> 00:26:11,158 What we need is some kind of function, which I'm calling f here, 460 00:26:11,159 --> 00:26:13,740 that associates this uncertainty with an expected score. 461 00:26:14,240 --> 00:26:18,458 And the way it went about this was to just plot a bunch of the data from previous 462 00:26:18,458 --> 00:26:21,113 games based on version one of the bot to say, hey, 463 00:26:21,113 --> 00:26:25,070 what was the actual score after various points with certain very measurable 464 00:26:25,069 --> 00:26:26,319 amounts of uncertainty? 465 00:26:27,019 --> 00:26:30,876 For example, these data points here that are sitting above a value that's 466 00:26:30,876 --> 00:26:33,464 around like 8.7 or so are saying for some games, 467 00:26:33,464 --> 00:26:36,582 after a point at which there were 8.7 bits of uncertainty, 468 00:26:36,583 --> 00:26:38,960 it took two guesses to get the final answer. 469 00:26:39,319 --> 00:26:40,659 For other games, it took three guesses. 470 00:26:40,819 --> 00:26:42,240 For other games, it took four guesses. 471 00:26:43,140 --> 00:26:46,862 If we shift over to the left here, all the points over zero are saying whenever 472 00:26:46,862 --> 00:26:50,632 there's zero bits of uncertainty, which is to say there's only one possibility, 473 00:26:50,632 --> 00:26:54,259 then the number of guesses required is always just one, which is reassuring. 474 00:26:54,779 --> 00:26:58,167 Whenever there was one bit of uncertainty, meaning it was essentially 475 00:26:58,167 --> 00:27:01,851 just down to two possibilities, then sometimes it required one more guess, 476 00:27:01,852 --> 00:27:05,240 sometimes it required two more guesses, and so on and so forth here. 477 00:27:05,740 --> 00:27:07,884 Maybe a slightly easier way to visualize this 478 00:27:07,884 --> 00:27:10,220 data is to bucket it together and take averages. 479 00:27:11,000 --> 00:27:15,509 For example, this bar here is saying among all the points where we had one bit 480 00:27:15,509 --> 00:27:19,960 of uncertainty, on average the number of new guesses required was about 1.5. 481 00:27:22,140 --> 00:27:25,486 And the bar over here is saying among all of the different games where 482 00:27:25,486 --> 00:27:28,354 at some point the uncertainty was a little above four bits, 483 00:27:28,354 --> 00:27:31,365 which is like narrowing it down to 16 different possibilities, 484 00:27:31,365 --> 00:27:35,380 then on average it requires a little more than two guesses from that point forward. 485 00:27:36,059 --> 00:27:39,460 And from here I just did a regression to fit a function that seemed reasonable to this. 486 00:27:39,980 --> 00:27:44,140 And remember, the whole point of doing any of that is so that we can quantify this 487 00:27:44,140 --> 00:27:47,032 intuition that the more information we gain from a word, 488 00:27:47,031 --> 00:27:48,960 the lower the expected score will be. 489 00:27:49,680 --> 00:27:54,679 So, with this as version 2.0, if we go back and run the same set of simulations, 490 00:27:54,679 --> 00:27:59,240 having it play against all 2315 possible wordle answers, how does it do? 491 00:28:00,279 --> 00:28:03,420 Well in contrast to our first version, it's definitely better, which is reassuring. 492 00:28:04,019 --> 00:28:06,180 All said and done, the average is around 3.6. 493 00:28:06,539 --> 00:28:09,896 Although unlike the first version, there are a couple times that it loses, 494 00:28:09,896 --> 00:28:12,119 and requires more than six in this circumstance. 495 00:28:12,640 --> 00:28:15,269 Presumably because there's times when it's making that tradeoff 496 00:28:15,269 --> 00:28:17,940 to actually go for the goal rather than maximizing information. 497 00:28:19,039 --> 00:28:21,000 So can we do better than 3.6? 498 00:28:22,079 --> 00:28:22,919 We definitely can. 499 00:28:23,279 --> 00:28:26,253 Now, I said at the start that it's most fun to try not incorporating 500 00:28:26,253 --> 00:28:29,359 the true list of wordle answers into the way that it builds its model. 501 00:28:29,880 --> 00:28:34,180 But if we do incorporate it, the best performance I could get was around 3.43. 502 00:28:35,160 --> 00:28:38,719 So if we try to get more sophisticated than just using word frequency data 503 00:28:38,719 --> 00:28:42,229 to choose this prior distribution, this 3.43 probably gives a max at how 504 00:28:42,229 --> 00:28:45,740 good we could get with that, or at least how good I could get with that. 505 00:28:46,240 --> 00:28:49,979 That best performance essentially just uses the ideas that I've been talking about here, 506 00:28:49,979 --> 00:28:52,911 but it goes a little farther, like it does a search for the expected 507 00:28:52,911 --> 00:28:55,120 information two steps forward rather than just one. 508 00:28:55,619 --> 00:28:57,876 Originally I was planning on talking more about that, 509 00:28:57,876 --> 00:29:00,220 but I realize we've actually gone quite long as it is. 510 00:29:00,579 --> 00:29:03,302 The one thing I'll say is after doing this two-step search and 511 00:29:03,303 --> 00:29:06,114 then running a couple sample simulations in the top candidates, 512 00:29:06,114 --> 00:29:09,100 so far for me at least, it's looking like Crane is the best opener. 513 00:29:09,099 --> 00:29:10,059 Who would have guessed? 514 00:29:10,920 --> 00:29:14,764 Also if you use the true wordle list to determine your space of possibilities, 515 00:29:14,763 --> 00:29:17,819 then the uncertainty you start with is a little over 11 bits. 516 00:29:18,299 --> 00:29:20,955 And it turns out just from a brute force search, 517 00:29:20,955 --> 00:29:25,879 the maximum possible expected information after the first two guesses is around 10 bits. 518 00:29:26,500 --> 00:29:30,231 Which suggests that best case scenario, after your first two guesses, 519 00:29:30,231 --> 00:29:34,559 with perfectly optimal play, you'll be left with around one bit of uncertainty. 520 00:29:34,799 --> 00:29:37,319 Which is the same as being down to two possible guesses. 521 00:29:37,740 --> 00:29:41,431 But I think it's fair and probably pretty conservative to say that you could never 522 00:29:41,431 --> 00:29:44,491 possibly write an algorithm that gets this average as low as three, 523 00:29:44,491 --> 00:29:48,048 because with the words available to you, there's simply not room to get enough 524 00:29:48,048 --> 00:29:51,920 information after only two steps to be able to guarantee the answer in the third slot 525 00:29:51,920 --> 00:29:53,360 every single time without fail.