[00:00] The game Wordle has gone pretty viral in the last month or two, [00:02] and never one to overlook an opportunity for a math lesson, [00:05] it occurs to me that this game makes for a very good central example [00:08] in a lesson about information theory, and in particular a topic known as entropy. [00:13] You see, like a lot of people I got kind of sucked into the puzzle, [00:16] and like a lot of programmers I also got sucked into trying to [00:19] write an algorithm that would play the game as optimally as it could. [00:23] And what I thought I'd do here is just talk through with you some of my process in that, [00:26] and explain some of the math that went into it, [00:28] since the whole algorithm centers on this idea of entropy. [00:38] First things first, in case you haven't heard of it, what is Wordle? [00:42] And to kill two birds with one stone here while we go through the rules of the game, [00:45] let me also preview where we're going with this, [00:47] which is to develop a little algorithm that will basically play the game for us. [00:51] Though I haven't done today's Wordle, this is February 4th, [00:53] and we'll see how the bot does. [00:55] The goal of Wordle is to guess a mystery five letter word, [00:58] and you're given six different chances to guess. [01:00] For example, my Wordle bot suggests that I start with the guess "crane". [01:05] Each time that you make a guess, you get some information [01:07] about how close your guess is to the true answer. [01:10] Here the grey box is telling me there's no C in the actual answer. [01:14] The yellow box is telling me there is an R, but it's not in that position. [01:18] The green box is telling me that the secret word does have an A, [01:20] and it's in the third position. [01:22] And then there's no N and there's no E. [01:25] So let me just go in and tell the Wurdle bot that information. [01:27] We started with crane, we got grey, yellow, green, grey, grey. [01:31] Don't worry about all the data that it's showing right now, I'll explain that in due time. [01:35] But its top suggestion for our second pick is "shtik". [01:39] And your guess does have to be an actual five letter word, but as you'll see, [01:42] it's pretty liberal with what it will actually let you guess. [01:46] In this case, we try "shtik". [01:48] And alright, things are looking pretty good. [01:50] We hit the S and the H, so we know the first three letters, we know that there's an R. [01:53] And so it's going to be like S-H-A something R, or S-H-A R something. [01:59] And it looks like the Wordle bot knows that it's down to just two possibilities, [02:03] either shard or sharp. [02:05] That's kind of a tossup between them at this point, [02:07] so I guess probably just because it's alphabetical it goes with shard. [02:11] Which hooray, is the actual answer, so we got it in three. [02:14] If you're wondering if that's any good, the way I heard one person [02:17] phrase it is that with Wordle, four is par and three is birdie. [02:20] Which I think is a pretty apt analogy. [02:22] You have to be consistently on your game to be getting four, but it's certainly not crazy. [02:27] But when you get it in three, it just feels great. [02:30] So if you're down for it, what I'd like to do here is just talk through [02:33] my thought process from the beginning for how I approach the Wordle bot. [02:36] And like I said, really it's an excuse for an information theory lesson. [02:39] The main goal is to explain what is information and what is entropy. [02:48] My first thought in approaching this was to take a look at the [02:50] relative frequencies of different letters in the English language. [02:54] So I thought, okay, is there an opening guess or an opening [02:56] pair of guesses that hits a lot of these most frequent letters? [02:59] And one that I was pretty fond of was doing other followed by nails. [03:03] The thought is that if you hit a letter, you know, you get a green or a yellow, [03:06] that always feels good, it feels like you're getting information. [03:09] But in these cases, even if you don't hit and you always get grays, [03:12] that's still giving you a lot of information, [03:14] since it's pretty rare to find a word that doesn't have any of these letters. [03:18] But even still, that doesn't feel super systematic, because for example, [03:21] it does nothing to consider the order of the letters. [03:23] Why type nails when I could type snail? [03:26] Is it better to have that S at the end? [03:27] I'm not really sure. [03:29] Now, a friend of mine said that he liked to open with the word weary, [03:32] which kind of surprised me because it has some uncommon letters in there like the [03:35] W and the Y. [03:37] But who knows, maybe that is a better opener. [03:39] Is there some kind of quantitative score that we [03:41] can give to judge the quality of a potential guess? [03:45] Now to set up for the way that we're going to rank possible guesses, [03:48] let's go back and add a little clarity to how exactly the game is set up. [03:51] So there's a list of words that it will allow you to enter that [03:54] are considered valid guesses that's just about 13,000 words long. [03:58] But when you look at it, there's a lot of really uncommon things, [04:01] things like "aahed" or "aalii" and "aargh", the kind of words [04:03] that bring about family arguments in a game of Scrabble. [04:06] But the vibe of the game is that the answer is always going to be a decently common word. [04:10] And in fact, there's another list of around 2300 words that are the possible answers. [04:15] And this is a human curated list, I think specifically by the game creator's girlfriend, [04:20] which is kind of fun. [04:21] But what I would like to do, our challenge for this project is to see if we can write [04:25] a program solving wordle that doesn't incorporate previous knowledge about this list. [04:30] For one thing, there's plenty of pretty common five [04:32] letter words that you won't find in that list. [04:34] So it would be better to write a program that's a little more resilient and would [04:38] play wordle against anyone, not just what happens to be the official website. [04:41] And also, the reason that we know what this list of possible answers is, [04:45] is because it's visible in the source code. [04:47] But the way that it's visible in the source code is in the [04:49] specific order in which answers come up from day to day, [04:52] that you could always just look up what tomorrow's answer will be. [04:56] So clearly, there's some sense in which using the list is cheating. [04:59] And what makes for a more interesting puzzle and a richer information theory lesson [05:03] is to instead use some more universal data like relative word frequencies in [05:06] general to capture this intuition of having a preference for more common words. [05:11] So of these 13,000 possibilities, how should we choose the opening guess? [05:16] For example, if my friend proposes weary, how should we analyze its quality? [05:20] Well, the reason he said he likes that unlikely W is that he likes [05:23] the long shot nature of just how good it feels if you do hit that W. [05:27] For example, if the first pattern revealed was something like this, [05:31] then it turns out there are only 58 words in this giant lexicon that match that pattern. [05:36] So that's a huge reduction from 13,000. [05:38] But the flip side of that, of course, is that [05:40] it's very uncommon to get a pattern like this. [05:43] Specifically, if each word was equally likely to be the answer, [05:46] the probability of hitting this pattern would be 58 divided by around 13,000. [05:51] Of course, they're not equally likely to be answers. [05:53] Most of these are very obscure and even questionable words. [05:56] But at least for our first pass at all of this, [05:58] let's assume that they're all equally likely and then refine that a bit later. [06:02] The point is, the pattern with a lot of information is, [06:04] by its very nature, unlikely to occur. [06:07] In fact, what it means to be informative is that it's unlikely. [06:11] A much more probable pattern to see with this opening would be something like this, [06:16] where, of course, there's not a W in it. [06:18] Maybe there's an E, and maybe there's no A, there's no R, there's no Y. [06:22] In this case, there are 1400 possible matches. [06:25] If all were equally likely, it works out to be a probability [06:28] of about 11% that this is the pattern you would see. [06:30] So the most likely outcomes are also the least informative. [06:34] To get a more global view here, let me show you the full distribution of [06:37] probabilities across all of the different patterns that you might see. [06:41] So each bar that you're looking at corresponds to a possible pattern of [06:45] colors that could be revealed, of which there are 3 to the 5th possibilities, [06:48] and they're organized from left to right, most common to least common. [06:52] So the most common possibility here is that you get all grays. [06:56] That happens about 14% of the time. [06:58] And what you're hoping for when you make a guess is that you end up [07:01] somewhere out in this long tail, like over here, [07:04] where there's only 18 possibilities for what matches this pattern, [07:07] that evidently look like this. [07:09] Or if we venture a little farther to the left, [07:11] you know, maybe we go all the way over here. [07:14] Okay, here's a good puzzle for you. [07:16] What are the three words in the English language that start with a W, [07:19] end with a Y, and have an R somewhere in them? [07:22] Turns out, the answers are, let's see, wordy, wormy, and wryly. [07:27] So to judge how good this word is overall, we want some kind of measure of the [07:31] expected amount of information that you're going to get from this distribution. [07:35] If we go through each pattern and we multiply its probability of occurring times [07:39] something that measures how informative it is, that can maybe give us an objective score. [07:45] Now your first instinct for what that something should be might be the number of matches. [07:50] You want a lower average number of matches. [07:52] But instead I'd like to use a more universal measurement that we often ascribe to [07:56] information, and one that will be more flexible once we have a different probability [08:00] assigned to each of these 13,000 words for whether or not they're actually the answer. [08:10] The standard unit of information is the bit, which has a little bit of a funny formula, [08:14] but is really intuitive if we just look at examples. [08:17] If you have an observation that cuts your space of possibilities in half, [08:21] we say that it has one bit of information. [08:24] In our example, the space of possibilities is all possible words, [08:26] and it turns out about half of the five letter words have an S, a little less than that, [08:30] but about half. [08:31] So that observation would give you one bit of information. [08:34] If instead a new fact chops down that space of possibilities by a factor of four, [08:39] we say that it has two bits of information. [08:41] For example, it turns out about a quarter of these words have a T. [08:45] If the observation cuts that space by a factor of eight, [08:47] we say it's three bits of information, and so on and so forth. [08:50] Four bits cuts it into a sixteenth, five bits cuts it into a thirty second. [08:54] So now's when you might want to take a moment and pause and ask for yourself, [08:58] what is the formula for information for the number of bits [09:00] in terms of the probability of an occurrence? [09:03] Well, what we're saying here is basically that when you take one half to the number of [09:07] bits, that's the same thing as the probability, [09:09] which is the same thing as saying two to the power of the number of bits is one over [09:13] the probability, which rearranges further to saying the information is the log base [09:17] two of one divided by the probability. [09:19] And sometimes you see this with one more rearrangement still where [09:22] the information is the negative log base two of the probability. [09:25] Expressed like this, it can look a little bit weird to the uninitiated, [09:28] but it really is just the very intuitive idea of asking how [09:31] many times you've cut down your possibilities in half. [09:35] Now if you're wondering, you know, I thought we were just playing a fun word game, [09:37] why are logarithms entering the picture? [09:39] One reason this is a nicer unit is it's just a lot easier to talk about very [09:43] unlikely events, much easier to say that an observation has 20 bits of information [09:48] than it is to say that the probability of such and such occurring is 0.0000095. [09:53] But a more substantive reason that this logarithmic expression turned out to be a very [09:57] useful addition to the theory of probability is the way that information adds together. [10:02] For example, if one observation gives you two bits of information, [10:05] cutting your space down by four, and then a second observation like your second [10:08] guess in Wordle gives you another three bits of information, [10:11] chopping you down further by another factor of eight, [10:14] the two together give you five bits of information. [10:17] In the same way that probabilities like to multiply, information likes to add. [10:21] So as soon as we're in the realm of something like an expected value, [10:24] where we're adding a bunch of numbers up, the logs make it a lot nicer to deal with. [10:28] Let's go back to our distribution for weary and add another little tracker on here, [10:32] showing us how much information there is for each pattern. [10:35] The main thing I want you to notice is that the higher the probability as we get [10:39] to those more likely patterns, the lower the information, the fewer bits you gain. [10:43] The way we measure the quality of this guess will [10:45] be to take the expected value of this information. [10:48] When we go through each pattern, we say how probable is it and [10:51] then we multiply that by how many bits of information do we get. [10:54] And in the example of weary, that turns out to be 4.9 bits. [10:58] So on average, the information you get from this opening guess is as [11:01] good as chopping your space of possibilities in half about five times. [11:05] By contrast, an example of a guess with a higher expected [11:09] information value would be something like slate. [11:13] In this case, you'll notice the distribution looks a lot flatter. [11:15] In particular, the most probable occurrence of all grays only has about a 6% chance [11:20] of occurring, so at minimum you're getting evidently 3.9 bits of information. [11:25] But that's a minimum, more typically you'd get something better than that. [11:29] And it turns out when you crunch the numbers on this one and add [11:32] up all the relevant terms, the average information is about 5.8. [11:37] So in contrast with weary, your space of possibilities will [11:40] be about half as big after this first guess, on average. [11:44] There's actually a fun story about the name for [11:46] this expected value of information quantity. [11:48] You see, information theory was developed by Claude Shannon, [11:51] who was working at Bell Labs in the 1940s, but he was talking about some of his [11:54] yet-to-be-published ideas with John von Neumann, [11:57] who was this intellectual giant of the time, very prominent in math and physics [12:00] and the beginnings of what was becoming computer science. [12:04] And when he mentioned that he didn't really have a good name for this [12:07] expected value of information quantity, von Neumann supposedly said, [12:10] so the story goes, well, you should call it entropy, and for two reasons. [12:14] In the first place, your uncertainty function has been used in statistical mechanics [12:18] under that name, so it already has a name, and in the second place, and more important, [12:22] nobody knows what entropy really is, so in a debate you'll always have the advantage. [12:27] So if the name seems a little bit mysterious, and if this story is to be believed, [12:31] that's kind of by design. [12:33] Also if you're wondering about its relation to all of that second law of thermodynamics [12:37] stuff from physics, there definitely is a connection, [12:39] but in its origins Shannon was just dealing with pure probability theory, [12:43] and for our purposes here, when I use the word entropy, [12:45] I just want you to think the expected information value of a particular guess. [12:50] You can think of entropy as measuring two things simultaneously. [12:54] The first one is how flat is the distribution? [12:57] The closer a distribution is to uniform, the higher that entropy will be. [13:01] In our case, where there are 3 to the 5th total patterns, for a uniform distribution, [13:06] observing any one of them would have information log base 2 of 3 to the 5th, [13:11] which happens to be 7.92, so that is the absolute maximum that you could possibly have [13:16] for this entropy. [13:17] But entropy is also kind of a measure of how many [13:20] possibilities there are in the first place. [13:22] For example, if you happen to have some word where there's only 16 possible patterns, [13:27] and each one is equally likely, this entropy, this expected information, would be 4 bits. [13:32] But if you have another word where there's 64 possible patterns that could come up, [13:36] and they're all equally likely, then the entropy would work out to be 6 bits. [13:41] So if you see some distribution out in the wild that has an entropy of 6 bits, [13:45] it's sort of like it's saying there's as much variation and uncertainty [13:49] in what's about to happen as if there were 64 equally likely outcomes. [13:54] For my first pass at the Wurtelebot, I basically had it just do this. [13:57] It goes through all of the different possible guesses that you could have, [14:01] all 13,000 words, it computes the entropy for each one, or more specifically, [14:05] the entropy of the distribution across all patterns that you might see for each one, [14:09] and then it picks the highest, since that's the one that's likely to chop [14:13] down your space of possibilities as much as possible. [14:17] And even though I've only been talking about the first guess here, [14:19] it does the same thing for the next few guesses. [14:21] For example, after you see some pattern on that first guess, [14:24] which would restrict you to a smaller number of possible words based on what [14:27] matches with that, you just play the same game with respect to that smaller set of words. [14:32] For a proposed second guess, you look at the distribution of all patterns [14:36] that could occur from that more restricted set of words, [14:38] you search through all 13,000 possibilities, and you find the one that [14:42] maximizes that entropy. [14:45] To show you how this works in action, let me just pull up a little variant of [14:48] Wurtele that I wrote that shows the highlights of this analysis in the margins. [14:53] So after doing all its entropy calculations, on the right here [14:56] it's showing us which ones have the highest expected information. [15:00] Turns out the top answer, at least at the moment, we'll refine this later, [15:05] is Tares, which means, um, of course, a vetch, the most common vetch. [15:11] Each time we make a guess here, where maybe I kind of ignore its [15:13] recommendations and go with slate, because I like slate, [15:16] we can see how much expected information it had, [15:18] but then on the right of the word here it's showing us how much actual [15:22] information we got given this particular pattern. [15:25] So here it looks like we were a little unlucky, we were expected to get 5.8, [15:27] but we happened to get something with less than that. [15:30] And then on the left side here it's showing us all of [15:32] the different possible words given where we are now. [15:35] The blue bars are telling us how likely it thinks each word is, [15:38] so at the moment it's assuming each word is equally likely to occur, [15:41] but we'll refine that in a moment. [15:44] And then this uncertainty measurement is telling us the entropy of this distribution [15:48] across the possible words, which right now, because it's a uniform distribution, [15:52] is just a needlessly complicated way to count the number of possibilities. [15:56] For example, if we were to take 2 to the power of 13.66, [15:59] that should be around the 13,000 possibilities. [16:02] Um, a little bit off here, but only because I'm not showing all the decimal places. [16:06] At the moment that might feel redundant and like it's overly complicating things, [16:09] but you'll see why it's useful to have both numbers in a minute. [16:12] So here it looks like it's suggesting the highest entropy for our second guess is Raman, [16:16] which again just really doesn't feel like a word. [16:19] So to take the moral high ground here I'm going to go ahead and type in Rains. [16:25] And again it looks like we were a little unlucky. [16:27] We were expecting 4.3 bits and we only got 3.39 bits of information. [16:31] So that takes us down to 55 possibilities. [16:34] And here maybe I'll just actually go with what it's suggesting, [16:37] which is combo, whatever that means. [16:40] And, okay, this is actually a good chance for a puzzle. [16:42] It's telling us this pattern gives us 4.7 bits of information. [16:47] But over on the left, before we see that pattern, there were 5.78 bits of uncertainty. [16:52] So as a quiz for you, what does that mean about the number of remaining possibilities? [16:58] Well it means that we're reduced down to 1 bit of uncertainty, [17:01] which is the same thing as saying that there's 2 possible answers. [17:04] It's a 50-50 choice. [17:06] And from here, because you and I know which words are more common, [17:09] we know that the answer should be abyss. [17:11] But as it's written right now, the program doesn't know that. [17:13] So it just keeps going, trying to gain as much information as it can, [17:16] until it's only one possibility left, and then it guesses it. [17:20] So obviously we need a better endgame strategy, [17:22] but let's say we call this version 1 of our wordle solver, [17:25] and then we go and run some simulations to see how it does. [17:30] So the way this is working is it's playing every possible wordle game. [17:34] It's going through all of those 2315 words that are the actual wordle answers. [17:38] It's basically using that as a testing set. [17:41] And with this naive method of not considering how common a word is, [17:44] and just trying to maximize the information at each step along the way, [17:47] until it gets down to one and only one choice. [17:50] By the end of the simulation, the average score works out to be about 4.124. [17:55] Which is not bad, to be honest, I kind of expect it to do worse. [17:59] But the people who play wordle will tell you that they can usually get it in 4. [18:02] The real challenge is to get as many in 3 as you can. [18:05] It's a pretty big jump between the score of 4 and the score of 3. [18:08] The obvious low-hanging fruit here is to somehow incorporate [18:11] whether or not a word is common, and how exactly do we do that. [18:22] The way I approached it is to get a list of the relative frequencies for all of [18:26] the words in the English language, and I just used Mathematica's word frequency [18:30] data function, which itself pulls from the Google Books English Ngram public dataset. [18:35] And it's kind of fun to look at, for example if we sort [18:37] it from the most common words to the least common words. [18:40] Evidently these are the most common, 5-letter words in the English language. [18:43] Or rather, these is the 8th most common. [18:46] First is which, after which there's there and there. [18:49] First itself is not first, but 9th, and it makes sense that these [18:52] other words could come about more often, where those after first are after, [18:55] where, and those being just a little bit less common. [18:59] Now, in using this data to model how likely each of these words is to [19:03] be the final answer, it shouldn't just be proportional to the frequency, [19:07] because for example which is given a score of 0.002 in this dataset, [19:11] whereas the word braid is in some sense about 1000 times less likely. [19:15] But both of these are common enough words that they're almost [19:18] certainly worth considering, so we want more of a binary cutoff. [19:21] The way I went about it is to imagine taking this whole sorted list of words, [19:25] and then arranging it on an x-axis, and then applying the sigmoid function, [19:29] which is the standard way to have a function whose output is basically binary, [19:33] it's either 0 or it's 1, but there's a smoothing in between for that region of [19:37] uncertainty. [19:39] So essentially, the probability that I'm assigning to each word for being in the final [19:43] list will be the value of the sigmoid function above wherever it sits on the x-axis. [19:49] Now obviously this depends on a few parameters, [19:52] for example how wide a space on the x-axis those words fill determines how [19:56] gradually or steeply we drop off from 1 to 0, [19:58] and where we situate them left to right determines the cutoff. [20:02] And to be honest the way I did this was kind of just [20:05] licking my finger and sticking it into the wind. [20:07] I looked through the sorted list and tried to find a window where [20:10] when I looked at it I figured about half of these words are more [20:13] likely than not to be the final answer, and used that as the cutoff. [20:17] Now once we have a distribution like this across the words, [20:19] it gives us another situation where entropy becomes this really useful measurement. [20:24] For example, let's say we were playing a game and we start with my old openers, [20:28] which were other and nails, and we end up with a situation [20:30] where there's four possible words that match it. [20:33] And let's say we consider them all equally likely, [20:36] let me ask you, what is the entropy of this distribution? [20:41] Well, the information associated with each one of these possibilities is [20:45] going to be the log base 2 of 4, since each one is 1 and 4, and that's 2. [20:50] 2 bits of information, 4 possibilities. [20:52] All very well and good. [20:54] But what if I told you that actually there's more than 4 matches? [20:58] In reality, when we look through the full word list, there are 16 words that match it. [21:02] But suppose our model puts a really low probability on those other 12 words of actually [21:06] being the final answer, something like 1 in 1000 because they're really obscure. [21:11] Now let me ask you, what is the entropy of this distribution? [21:15] If entropy was purely measuring the number of matches here, [21:18] then you might expect it to be something like the log base 2 of 16, [21:22] which would be 4, two more bits of uncertainty than we had before. [21:26] But of course the actual uncertainty is not really that different from what we had before. [21:30] Just because there's these 12 really obscure words doesn't mean that it would be [21:33] all that more surprising to learn that the final answer is charm, for example. [21:38] So when you actually do the calculation here and you add up the probability of [21:42] each occurrence times the corresponding information, what you get is 2.11 bits. [21:46] Just saying, it's basically two bits, basically those four possibilities, [21:49] but there's a little more uncertainty because of all of those highly unlikely events, [21:53] though if you did learn them you'd get a ton of information from it. [21:57] So zooming out, this is part of what makes Wordle [21:59] such a nice example for an information theory lesson. [22:01] We have these two distinct feeling applications for entropy. [22:05] The first one telling us what's the expected information we'll get from a given guess, [22:09] and the second one saying can we measure the remaining uncertainty [22:13] among all of the words we have possible. [22:16] And I should emphasize, in that first case where we're looking [22:19] at the expected information of a guess, once we have an unequal weighting to the words, [22:22] that affects the entropy calculation. [22:24] For example, let me pull up that same case we were looking at [22:27] earlier of the distribution associated with Weary, [22:30] but this time using a non-uniform distribution across all possible words. [22:34] So let me see if I can find a part here that illustrates it pretty well. [22:40] Okay, here, this is pretty good. [22:42] Here we have two adjacent patterns that are about equally likely, [22:45] but one of them we're told has 32 possible words that match it. [22:49] And if we check what they are, these are those 32, [22:51] which are all just very unlikely words as you scan your eyes over them. [22:55] It's hard to find any that feel like plausible answers, maybe yells, [22:59] but if we look at the neighboring pattern in the distribution, [23:02] which is considered just about as likely, we're told that it only has 8 possible matches. [23:06] So a quarter as many matches, but it's about as likely. [23:09] And when we pull up those matches, we can see why. [23:12] Some of these are actual plausible answers like ring or wrath or raps. [23:17] To illustrate how we incorporate all that, let [23:20] me pull up version two of the Wordlebot here. [23:22] And there are two or three main differences from the first one that we saw. [23:25] First off, like I just said, the way that we're computing these entropies, [23:29] these expected values of information, is now using the more refined distributions across [23:33] the patterns that incorporates the probability that a given word would actually be the [23:37] answer. [23:38] As it happens, tears is still number one, though the ones following are a bit different. [23:44] Second, when it ranks its top picks, it's now going to keep a model of the [23:47] probability that each word is the actual answer, [23:50] and it'll incorporate that into its decision, [23:52] which is easier to see once we have a few guesses on the table. [23:55] Again, ignoring its recommendation because we can't let machines rule our lives. [24:01] And I suppose I should mention another thing different here is over on the left, [24:04] that uncertainty value, that number of bits, is no longer just [24:07] redundant with the number of possible matches. [24:10] Now if we pull it up and calculate 2 to the 8.02, which would be a little above 256, [24:15] I guess 259, what it's saying is even though there are 526 total words that [24:20] actually match this pattern, the amount of uncertainty it has is more akin [24:24] to what it would be if there were 259 equally likely outcomes. [24:29] You can think of it like this. [24:31] It knows borks is not the answer, same with yorts and zorl and zorus. [24:34] So it's a little less uncertain than it was in the previous case. [24:37] This number of bits will be smaller. [24:40] And if I keep playing the game, I'm refining this down with a couple [24:43] guesses that are apropos of what I would like to explain here. [24:48] By the fourth guess, if you look over at its top picks, [24:51] you can see it's no longer just maximizing the entropy. [24:54] So at this point, there's technically seven possibilities, [24:57] but the only ones with a meaningful chance are dorms and words. [25:00] And you can see it ranks choosing both of those above all of these [25:03] other values that strictly speaking would give more information. [25:07] The very first time I did this, I just added up these two numbers to measure [25:10] the quality of each guess, which actually worked better than you might suspect. [25:14] But it really didn't feel systematic. [25:16] And I'm sure there's other approaches people could take. [25:17] But here's the one I landed on. [25:19] If we're considering the prospect of a next guess, like in this case words, [25:23] what we really care about is the expected score of our game if we do that. [25:28] And to calculate that expected score, we say what's the probability that [25:32] words is the actual answer, which at the moment it describes 58% to. [25:36] We say with a 58% chance, our score in this game would be four. [25:40] And then with the probability of one minus that 58%, [25:43] our score will be more than that four. [25:46] How much more we don't know, but we can estimate it based on how [25:49] much uncertainty there's likely to be once we get to that point. [25:52] Specifically, at the moment, there's 1.44 bits of uncertainty. [25:56] If we guess words, it's telling us the expected information we'll get is 1.27 bits. [26:01] So if we guess words, this difference represents how much [26:04] uncertainty we're likely to be left with after that happens. [26:08] What we need is some kind of function, which I'm calling f here, [26:11] that associates this uncertainty with an expected score. [26:14] And the way it went about this was to just plot a bunch of the data from previous [26:18] games based on version one of the bot to say, hey, [26:21] what was the actual score after various points with certain very measurable [26:25] amounts of uncertainty? [26:27] For example, these data points here that are sitting above a value that's [26:30] around like 8.7 or so are saying for some games, [26:33] after a point at which there were 8.7 bits of uncertainty, [26:36] it took two guesses to get the final answer. [26:39] For other games, it took three guesses. [26:40] For other games, it took four guesses. [26:43] If we shift over to the left here, all the points over zero are saying whenever [26:46] there's zero bits of uncertainty, which is to say there's only one possibility, [26:50] then the number of guesses required is always just one, which is reassuring. [26:54] Whenever there was one bit of uncertainty, meaning it was essentially [26:58] just down to two possibilities, then sometimes it required one more guess, [27:01] sometimes it required two more guesses, and so on and so forth here. [27:05] Maybe a slightly easier way to visualize this [27:07] data is to bucket it together and take averages. [27:11] For example, this bar here is saying among all the points where we had one bit [27:15] of uncertainty, on average the number of new guesses required was about 1.5. [27:22] And the bar over here is saying among all of the different games where [27:25] at some point the uncertainty was a little above four bits, [27:28] which is like narrowing it down to 16 different possibilities, [27:31] then on average it requires a little more than two guesses from that point forward. [27:36] And from here I just did a regression to fit a function that seemed reasonable to this. [27:39] And remember, the whole point of doing any of that is so that we can quantify this [27:44] intuition that the more information we gain from a word, [27:47] the lower the expected score will be. [27:49] So, with this as version 2.0, if we go back and run the same set of simulations, [27:54] having it play against all 2315 possible wordle answers, how does it do? [28:00] Well in contrast to our first version, it's definitely better, which is reassuring. [28:04] All said and done, the average is around 3.6. [28:06] Although unlike the first version, there are a couple times that it loses, [28:09] and requires more than six in this circumstance. [28:12] Presumably because there's times when it's making that tradeoff [28:15] to actually go for the goal rather than maximizing information. [28:19] So can we do better than 3.6? [28:22] We definitely can. [28:23] Now, I said at the start that it's most fun to try not incorporating [28:26] the true list of wordle answers into the way that it builds its model. [28:29] But if we do incorporate it, the best performance I could get was around 3.43. [28:35] So if we try to get more sophisticated than just using word frequency data [28:38] to choose this prior distribution, this 3.43 probably gives a max at how [28:42] good we could get with that, or at least how good I could get with that. [28:46] That best performance essentially just uses the ideas that I've been talking about here, [28:49] but it goes a little farther, like it does a search for the expected [28:52] information two steps forward rather than just one. [28:55] Originally I was planning on talking more about that, [28:57] but I realize we've actually gone quite long as it is. [29:00] The one thing I'll say is after doing this two-step search and [29:03] then running a couple sample simulations in the top candidates, [29:06] so far for me at least, it's looking like Crane is the best opener. [29:09] Who would have guessed? [29:10] Also if you use the true wordle list to determine your space of possibilities, [29:14] then the uncertainty you start with is a little over 11 bits. [29:18] And it turns out just from a brute force search, [29:20] the maximum possible expected information after the first two guesses is around 10 bits. [29:26] Which suggests that best case scenario, after your first two guesses, [29:30] with perfectly optimal play, you'll be left with around one bit of uncertainty. [29:34] Which is the same as being down to two possible guesses. [29:37] But I think it's fair and probably pretty conservative to say that you could never [29:41] possibly write an algorithm that gets this average as low as three, [29:44] because with the words available to you, there's simply not room to get enough [29:48] information after only two steps to be able to guarantee the answer in the third slot [29:51] every single time without fail.