[00:00] The game Wordle has gone pretty viral in the last month or two,
[00:02] and never one to overlook an opportunity for a math lesson,
[00:05] it occurs to me that this game makes for a very good central example
[00:08] in a lesson about information theory, and in particular a topic known as entropy.
[00:13] You see, like a lot of people I got kind of sucked into the puzzle,
[00:16] and like a lot of programmers I also got sucked into trying to
[00:19] write an algorithm that would play the game as optimally as it could.
[00:23] And what I thought I'd do here is just talk through with you some of my process in that,
[00:26] and explain some of the math that went into it,
[00:28] since the whole algorithm centers on this idea of entropy.
[00:38] First things first, in case you haven't heard of it, what is Wordle?
[00:42] And to kill two birds with one stone here while we go through the rules of the game,
[00:45] let me also preview where we're going with this,
[00:47] which is to develop a little algorithm that will basically play the game for us.
[00:51] Though I haven't done today's Wordle, this is February 4th,
[00:53] and we'll see how the bot does.
[00:55] The goal of Wordle is to guess a mystery five letter word,
[00:58] and you're given six different chances to guess.
[01:00] For example, my Wordle bot suggests that I start with the guess "crane".
[01:05] Each time that you make a guess, you get some information
[01:07] about how close your guess is to the true answer.
[01:10] Here the grey box is telling me there's no C in the actual answer.
[01:14] The yellow box is telling me there is an R, but it's not in that position.
[01:18] The green box is telling me that the secret word does have an A,
[01:20] and it's in the third position.
[01:22] And then there's no N and there's no E.
[01:25] So let me just go in and tell the Wurdle bot that information.
[01:27] We started with crane, we got grey, yellow, green, grey, grey.
[01:31] Don't worry about all the data that it's showing right now, I'll explain that in due time.
[01:35] But its top suggestion for our second pick is "shtik".
[01:39] And your guess does have to be an actual five letter word, but as you'll see,
[01:42] it's pretty liberal with what it will actually let you guess.
[01:46] In this case, we try "shtik".
[01:48] And alright, things are looking pretty good.
[01:50] We hit the S and the H, so we know the first three letters, we know that there's an R.
[01:53] And so it's going to be like S-H-A something R, or S-H-A R something.
[01:59] And it looks like the Wordle bot knows that it's down to just two possibilities,
[02:03] either shard or sharp.
[02:05] That's kind of a tossup between them at this point,
[02:07] so I guess probably just because it's alphabetical it goes with shard.
[02:11] Which hooray, is the actual answer, so we got it in three.
[02:14] If you're wondering if that's any good, the way I heard one person
[02:17] phrase it is that with Wordle, four is par and three is birdie.
[02:20] Which I think is a pretty apt analogy.
[02:22] You have to be consistently on your game to be getting four, but it's certainly not crazy.
[02:27] But when you get it in three, it just feels great.
[02:30] So if you're down for it, what I'd like to do here is just talk through
[02:33] my thought process from the beginning for how I approach the Wordle bot.
[02:36] And like I said, really it's an excuse for an information theory lesson.
[02:39] The main goal is to explain what is information and what is entropy.
[02:48] My first thought in approaching this was to take a look at the
[02:50] relative frequencies of different letters in the English language.
[02:54] So I thought, okay, is there an opening guess or an opening
[02:56] pair of guesses that hits a lot of these most frequent letters?
[02:59] And one that I was pretty fond of was doing other followed by nails.
[03:03] The thought is that if you hit a letter, you know, you get a green or a yellow,
[03:06] that always feels good, it feels like you're getting information.
[03:09] But in these cases, even if you don't hit and you always get grays,
[03:12] that's still giving you a lot of information,
[03:14] since it's pretty rare to find a word that doesn't have any of these letters.
[03:18] But even still, that doesn't feel super systematic, because for example,
[03:21] it does nothing to consider the order of the letters.
[03:23] Why type nails when I could type snail?
[03:26] Is it better to have that S at the end?
[03:27] I'm not really sure.
[03:29] Now, a friend of mine said that he liked to open with the word weary,
[03:32] which kind of surprised me because it has some uncommon letters in there like the
[03:35] W and the Y.
[03:37] But who knows, maybe that is a better opener.
[03:39] Is there some kind of quantitative score that we
[03:41] can give to judge the quality of a potential guess?
[03:45] Now to set up for the way that we're going to rank possible guesses,
[03:48] let's go back and add a little clarity to how exactly the game is set up.
[03:51] So there's a list of words that it will allow you to enter that
[03:54] are considered valid guesses that's just about 13,000 words long.
[03:58] But when you look at it, there's a lot of really uncommon things,
[04:01] things like "aahed" or "aalii" and "aargh", the kind of words
[04:03] that bring about family arguments in a game of Scrabble.
[04:06] But the vibe of the game is that the answer is always going to be a decently common word.
[04:10] And in fact, there's another list of around 2300 words that are the possible answers.
[04:15] And this is a human curated list, I think specifically by the game creator's girlfriend,
[04:20] which is kind of fun.
[04:21] But what I would like to do, our challenge for this project is to see if we can write
[04:25] a program solving wordle that doesn't incorporate previous knowledge about this list.
[04:30] For one thing, there's plenty of pretty common five
[04:32] letter words that you won't find in that list.
[04:34] So it would be better to write a program that's a little more resilient and would
[04:38] play wordle against anyone, not just what happens to be the official website.
[04:41] And also, the reason that we know what this list of possible answers is,
[04:45] is because it's visible in the source code.
[04:47] But the way that it's visible in the source code is in the
[04:49] specific order in which answers come up from day to day,
[04:52] that you could always just look up what tomorrow's answer will be.
[04:56] So clearly, there's some sense in which using the list is cheating.
[04:59] And what makes for a more interesting puzzle and a richer information theory lesson
[05:03] is to instead use some more universal data like relative word frequencies in
[05:06] general to capture this intuition of having a preference for more common words.
[05:11] So of these 13,000 possibilities, how should we choose the opening guess?
[05:16] For example, if my friend proposes weary, how should we analyze its quality?
[05:20] Well, the reason he said he likes that unlikely W is that he likes
[05:23] the long shot nature of just how good it feels if you do hit that W.
[05:27] For example, if the first pattern revealed was something like this,
[05:31] then it turns out there are only 58 words in this giant lexicon that match that pattern.
[05:36] So that's a huge reduction from 13,000.
[05:38] But the flip side of that, of course, is that
[05:40] it's very uncommon to get a pattern like this.
[05:43] Specifically, if each word was equally likely to be the answer,
[05:46] the probability of hitting this pattern would be 58 divided by around 13,000.
[05:51] Of course, they're not equally likely to be answers.
[05:53] Most of these are very obscure and even questionable words.
[05:56] But at least for our first pass at all of this,
[05:58] let's assume that they're all equally likely and then refine that a bit later.
[06:02] The point is, the pattern with a lot of information is,
[06:04] by its very nature, unlikely to occur.
[06:07] In fact, what it means to be informative is that it's unlikely.
[06:11] A much more probable pattern to see with this opening would be something like this,
[06:16] where, of course, there's not a W in it.
[06:18] Maybe there's an E, and maybe there's no A, there's no R, there's no Y.
[06:22] In this case, there are 1400 possible matches.
[06:25] If all were equally likely, it works out to be a probability
[06:28] of about 11% that this is the pattern you would see.
[06:30] So the most likely outcomes are also the least informative.
[06:34] To get a more global view here, let me show you the full distribution of
[06:37] probabilities across all of the different patterns that you might see.
[06:41] So each bar that you're looking at corresponds to a possible pattern of
[06:45] colors that could be revealed, of which there are 3 to the 5th possibilities,
[06:48] and they're organized from left to right, most common to least common.
[06:52] So the most common possibility here is that you get all grays.
[06:56] That happens about 14% of the time.
[06:58] And what you're hoping for when you make a guess is that you end up
[07:01] somewhere out in this long tail, like over here,
[07:04] where there's only 18 possibilities for what matches this pattern,
[07:07] that evidently look like this.
[07:09] Or if we venture a little farther to the left,
[07:11] you know, maybe we go all the way over here.
[07:14] Okay, here's a good puzzle for you.
[07:16] What are the three words in the English language that start with a W,
[07:19] end with a Y, and have an R somewhere in them?
[07:22] Turns out, the answers are, let's see, wordy, wormy, and wryly.
[07:27] So to judge how good this word is overall, we want some kind of measure of the
[07:31] expected amount of information that you're going to get from this distribution.
[07:35] If we go through each pattern and we multiply its probability of occurring times
[07:39] something that measures how informative it is, that can maybe give us an objective score.
[07:45] Now your first instinct for what that something should be might be the number of matches.
[07:50] You want a lower average number of matches.
[07:52] But instead I'd like to use a more universal measurement that we often ascribe to
[07:56] information, and one that will be more flexible once we have a different probability
[08:00] assigned to each of these 13,000 words for whether or not they're actually the answer.
[08:10] The standard unit of information is the bit, which has a little bit of a funny formula,
[08:14] but is really intuitive if we just look at examples.
[08:17] If you have an observation that cuts your space of possibilities in half,
[08:21] we say that it has one bit of information.
[08:24] In our example, the space of possibilities is all possible words,
[08:26] and it turns out about half of the five letter words have an S, a little less than that,
[08:30] but about half.
[08:31] So that observation would give you one bit of information.
[08:34] If instead a new fact chops down that space of possibilities by a factor of four,
[08:39] we say that it has two bits of information.
[08:41] For example, it turns out about a quarter of these words have a T.
[08:45] If the observation cuts that space by a factor of eight,
[08:47] we say it's three bits of information, and so on and so forth.
[08:50] Four bits cuts it into a sixteenth, five bits cuts it into a thirty second.
[08:54] So now's when you might want to take a moment and pause and ask for yourself,
[08:58] what is the formula for information for the number of bits
[09:00] in terms of the probability of an occurrence?
[09:03] Well, what we're saying here is basically that when you take one half to the number of
[09:07] bits, that's the same thing as the probability,
[09:09] which is the same thing as saying two to the power of the number of bits is one over
[09:13] the probability, which rearranges further to saying the information is the log base
[09:17] two of one divided by the probability.
[09:19] And sometimes you see this with one more rearrangement still where
[09:22] the information is the negative log base two of the probability.
[09:25] Expressed like this, it can look a little bit weird to the uninitiated,
[09:28] but it really is just the very intuitive idea of asking how
[09:31] many times you've cut down your possibilities in half.
[09:35] Now if you're wondering, you know, I thought we were just playing a fun word game,
[09:37] why are logarithms entering the picture?
[09:39] One reason this is a nicer unit is it's just a lot easier to talk about very
[09:43] unlikely events, much easier to say that an observation has 20 bits of information
[09:48] than it is to say that the probability of such and such occurring is 0.0000095.
[09:53] But a more substantive reason that this logarithmic expression turned out to be a very
[09:57] useful addition to the theory of probability is the way that information adds together.
[10:02] For example, if one observation gives you two bits of information,
[10:05] cutting your space down by four, and then a second observation like your second
[10:08] guess in Wordle gives you another three bits of information,
[10:11] chopping you down further by another factor of eight,
[10:14] the two together give you five bits of information.
[10:17] In the same way that probabilities like to multiply, information likes to add.
[10:21] So as soon as we're in the realm of something like an expected value,
[10:24] where we're adding a bunch of numbers up, the logs make it a lot nicer to deal with.
[10:28] Let's go back to our distribution for weary and add another little tracker on here,
[10:32] showing us how much information there is for each pattern.
[10:35] The main thing I want you to notice is that the higher the probability as we get
[10:39] to those more likely patterns, the lower the information, the fewer bits you gain.
[10:43] The way we measure the quality of this guess will
[10:45] be to take the expected value of this information.
[10:48] When we go through each pattern, we say how probable is it and
[10:51] then we multiply that by how many bits of information do we get.
[10:54] And in the example of weary, that turns out to be 4.9 bits.
[10:58] So on average, the information you get from this opening guess is as
[11:01] good as chopping your space of possibilities in half about five times.
[11:05] By contrast, an example of a guess with a higher expected
[11:09] information value would be something like slate.
[11:13] In this case, you'll notice the distribution looks a lot flatter.
[11:15] In particular, the most probable occurrence of all grays only has about a 6% chance
[11:20] of occurring, so at minimum you're getting evidently 3.9 bits of information.
[11:25] But that's a minimum, more typically you'd get something better than that.
[11:29] And it turns out when you crunch the numbers on this one and add
[11:32] up all the relevant terms, the average information is about 5.8.
[11:37] So in contrast with weary, your space of possibilities will
[11:40] be about half as big after this first guess, on average.
[11:44] There's actually a fun story about the name for
[11:46] this expected value of information quantity.
[11:48] You see, information theory was developed by Claude Shannon,
[11:51] who was working at Bell Labs in the 1940s, but he was talking about some of his
[11:54] yet-to-be-published ideas with John von Neumann,
[11:57] who was this intellectual giant of the time, very prominent in math and physics
[12:00] and the beginnings of what was becoming computer science.
[12:04] And when he mentioned that he didn't really have a good name for this
[12:07] expected value of information quantity, von Neumann supposedly said,
[12:10] so the story goes, well, you should call it entropy, and for two reasons.
[12:14] In the first place, your uncertainty function has been used in statistical mechanics
[12:18] under that name, so it already has a name, and in the second place, and more important,
[12:22] nobody knows what entropy really is, so in a debate you'll always have the advantage.
[12:27] So if the name seems a little bit mysterious, and if this story is to be believed,
[12:31] that's kind of by design.
[12:33] Also if you're wondering about its relation to all of that second law of thermodynamics
[12:37] stuff from physics, there definitely is a connection,
[12:39] but in its origins Shannon was just dealing with pure probability theory,
[12:43] and for our purposes here, when I use the word entropy,
[12:45] I just want you to think the expected information value of a particular guess.
[12:50] You can think of entropy as measuring two things simultaneously.
[12:54] The first one is how flat is the distribution?
[12:57] The closer a distribution is to uniform, the higher that entropy will be.
[13:01] In our case, where there are 3 to the 5th total patterns, for a uniform distribution,
[13:06] observing any one of them would have information log base 2 of 3 to the 5th,
[13:11] which happens to be 7.92, so that is the absolute maximum that you could possibly have
[13:16] for this entropy.
[13:17] But entropy is also kind of a measure of how many
[13:20] possibilities there are in the first place.
[13:22] For example, if you happen to have some word where there's only 16 possible patterns,
[13:27] and each one is equally likely, this entropy, this expected information, would be 4 bits.
[13:32] But if you have another word where there's 64 possible patterns that could come up,
[13:36] and they're all equally likely, then the entropy would work out to be 6 bits.
[13:41] So if you see some distribution out in the wild that has an entropy of 6 bits,
[13:45] it's sort of like it's saying there's as much variation and uncertainty
[13:49] in what's about to happen as if there were 64 equally likely outcomes.
[13:54] For my first pass at the Wurtelebot, I basically had it just do this.
[13:57] It goes through all of the different possible guesses that you could have,
[14:01] all 13,000 words, it computes the entropy for each one, or more specifically,
[14:05] the entropy of the distribution across all patterns that you might see for each one,
[14:09] and then it picks the highest, since that's the one that's likely to chop
[14:13] down your space of possibilities as much as possible.
[14:17] And even though I've only been talking about the first guess here,
[14:19] it does the same thing for the next few guesses.
[14:21] For example, after you see some pattern on that first guess,
[14:24] which would restrict you to a smaller number of possible words based on what
[14:27] matches with that, you just play the same game with respect to that smaller set of words.
[14:32] For a proposed second guess, you look at the distribution of all patterns
[14:36] that could occur from that more restricted set of words,
[14:38] you search through all 13,000 possibilities, and you find the one that
[14:42] maximizes that entropy.
[14:45] To show you how this works in action, let me just pull up a little variant of
[14:48] Wurtele that I wrote that shows the highlights of this analysis in the margins.
[14:53] So after doing all its entropy calculations, on the right here
[14:56] it's showing us which ones have the highest expected information.
[15:00] Turns out the top answer, at least at the moment, we'll refine this later,
[15:05] is Tares, which means, um, of course, a vetch, the most common vetch.
[15:11] Each time we make a guess here, where maybe I kind of ignore its
[15:13] recommendations and go with slate, because I like slate,
[15:16] we can see how much expected information it had,
[15:18] but then on the right of the word here it's showing us how much actual
[15:22] information we got given this particular pattern.
[15:25] So here it looks like we were a little unlucky, we were expected to get 5.8,
[15:27] but we happened to get something with less than that.
[15:30] And then on the left side here it's showing us all of
[15:32] the different possible words given where we are now.
[15:35] The blue bars are telling us how likely it thinks each word is,
[15:38] so at the moment it's assuming each word is equally likely to occur,
[15:41] but we'll refine that in a moment.
[15:44] And then this uncertainty measurement is telling us the entropy of this distribution
[15:48] across the possible words, which right now, because it's a uniform distribution,
[15:52] is just a needlessly complicated way to count the number of possibilities.
[15:56] For example, if we were to take 2 to the power of 13.66,
[15:59] that should be around the 13,000 possibilities.
[16:02] Um, a little bit off here, but only because I'm not showing all the decimal places.
[16:06] At the moment that might feel redundant and like it's overly complicating things,
[16:09] but you'll see why it's useful to have both numbers in a minute.
[16:12] So here it looks like it's suggesting the highest entropy for our second guess is Raman,
[16:16] which again just really doesn't feel like a word.
[16:19] So to take the moral high ground here I'm going to go ahead and type in Rains.
[16:25] And again it looks like we were a little unlucky.
[16:27] We were expecting 4.3 bits and we only got 3.39 bits of information.
[16:31] So that takes us down to 55 possibilities.
[16:34] And here maybe I'll just actually go with what it's suggesting,
[16:37] which is combo, whatever that means.
[16:40] And, okay, this is actually a good chance for a puzzle.
[16:42] It's telling us this pattern gives us 4.7 bits of information.
[16:47] But over on the left, before we see that pattern, there were 5.78 bits of uncertainty.
[16:52] So as a quiz for you, what does that mean about the number of remaining possibilities?
[16:58] Well it means that we're reduced down to 1 bit of uncertainty,
[17:01] which is the same thing as saying that there's 2 possible answers.
[17:04] It's a 50-50 choice.
[17:06] And from here, because you and I know which words are more common,
[17:09] we know that the answer should be abyss.
[17:11] But as it's written right now, the program doesn't know that.
[17:13] So it just keeps going, trying to gain as much information as it can,
[17:16] until it's only one possibility left, and then it guesses it.
[17:20] So obviously we need a better endgame strategy,
[17:22] but let's say we call this version 1 of our wordle solver,
[17:25] and then we go and run some simulations to see how it does.
[17:30] So the way this is working is it's playing every possible wordle game.
[17:34] It's going through all of those 2315 words that are the actual wordle answers.
[17:38] It's basically using that as a testing set.
[17:41] And with this naive method of not considering how common a word is,
[17:44] and just trying to maximize the information at each step along the way,
[17:47] until it gets down to one and only one choice.
[17:50] By the end of the simulation, the average score works out to be about 4.124.
[17:55] Which is not bad, to be honest, I kind of expect it to do worse.
[17:59] But the people who play wordle will tell you that they can usually get it in 4.
[18:02] The real challenge is to get as many in 3 as you can.
[18:05] It's a pretty big jump between the score of 4 and the score of 3.
[18:08] The obvious low-hanging fruit here is to somehow incorporate
[18:11] whether or not a word is common, and how exactly do we do that.
[18:22] The way I approached it is to get a list of the relative frequencies for all of
[18:26] the words in the English language, and I just used Mathematica's word frequency
[18:30] data function, which itself pulls from the Google Books English Ngram public dataset.
[18:35] And it's kind of fun to look at, for example if we sort
[18:37] it from the most common words to the least common words.
[18:40] Evidently these are the most common, 5-letter words in the English language.
[18:43] Or rather, these is the 8th most common.
[18:46] First is which, after which there's there and there.
[18:49] First itself is not first, but 9th, and it makes sense that these
[18:52] other words could come about more often, where those after first are after,
[18:55] where, and those being just a little bit less common.
[18:59] Now, in using this data to model how likely each of these words is to
[19:03] be the final answer, it shouldn't just be proportional to the frequency,
[19:07] because for example which is given a score of 0.002 in this dataset,
[19:11] whereas the word braid is in some sense about 1000 times less likely.
[19:15] But both of these are common enough words that they're almost
[19:18] certainly worth considering, so we want more of a binary cutoff.
[19:21] The way I went about it is to imagine taking this whole sorted list of words,
[19:25] and then arranging it on an x-axis, and then applying the sigmoid function,
[19:29] which is the standard way to have a function whose output is basically binary,
[19:33] it's either 0 or it's 1, but there's a smoothing in between for that region of
[19:37] uncertainty.
[19:39] So essentially, the probability that I'm assigning to each word for being in the final
[19:43] list will be the value of the sigmoid function above wherever it sits on the x-axis.
[19:49] Now obviously this depends on a few parameters,
[19:52] for example how wide a space on the x-axis those words fill determines how
[19:56] gradually or steeply we drop off from 1 to 0,
[19:58] and where we situate them left to right determines the cutoff.
[20:02] And to be honest the way I did this was kind of just
[20:05] licking my finger and sticking it into the wind.
[20:07] I looked through the sorted list and tried to find a window where
[20:10] when I looked at it I figured about half of these words are more
[20:13] likely than not to be the final answer, and used that as the cutoff.
[20:17] Now once we have a distribution like this across the words,
[20:19] it gives us another situation where entropy becomes this really useful measurement.
[20:24] For example, let's say we were playing a game and we start with my old openers,
[20:28] which were other and nails, and we end up with a situation
[20:30] where there's four possible words that match it.
[20:33] And let's say we consider them all equally likely,
[20:36] let me ask you, what is the entropy of this distribution?
[20:41] Well, the information associated with each one of these possibilities is
[20:45] going to be the log base 2 of 4, since each one is 1 and 4, and that's 2.
[20:50] 2 bits of information, 4 possibilities.
[20:52] All very well and good.
[20:54] But what if I told you that actually there's more than 4 matches?
[20:58] In reality, when we look through the full word list, there are 16 words that match it.
[21:02] But suppose our model puts a really low probability on those other 12 words of actually
[21:06] being the final answer, something like 1 in 1000 because they're really obscure.
[21:11] Now let me ask you, what is the entropy of this distribution?
[21:15] If entropy was purely measuring the number of matches here,
[21:18] then you might expect it to be something like the log base 2 of 16,
[21:22] which would be 4, two more bits of uncertainty than we had before.
[21:26] But of course the actual uncertainty is not really that different from what we had before.
[21:30] Just because there's these 12 really obscure words doesn't mean that it would be
[21:33] all that more surprising to learn that the final answer is charm, for example.
[21:38] So when you actually do the calculation here and you add up the probability of
[21:42] each occurrence times the corresponding information, what you get is 2.11 bits.
[21:46] Just saying, it's basically two bits, basically those four possibilities,
[21:49] but there's a little more uncertainty because of all of those highly unlikely events,
[21:53] though if you did learn them you'd get a ton of information from it.
[21:57] So zooming out, this is part of what makes Wordle
[21:59] such a nice example for an information theory lesson.
[22:01] We have these two distinct feeling applications for entropy.
[22:05] The first one telling us what's the expected information we'll get from a given guess,
[22:09] and the second one saying can we measure the remaining uncertainty
[22:13] among all of the words we have possible.
[22:16] And I should emphasize, in that first case where we're looking
[22:19] at the expected information of a guess, once we have an unequal weighting to the words,
[22:22] that affects the entropy calculation.
[22:24] For example, let me pull up that same case we were looking at
[22:27] earlier of the distribution associated with Weary,
[22:30] but this time using a non-uniform distribution across all possible words.
[22:34] So let me see if I can find a part here that illustrates it pretty well.
[22:40] Okay, here, this is pretty good.
[22:42] Here we have two adjacent patterns that are about equally likely,
[22:45] but one of them we're told has 32 possible words that match it.
[22:49] And if we check what they are, these are those 32,
[22:51] which are all just very unlikely words as you scan your eyes over them.
[22:55] It's hard to find any that feel like plausible answers, maybe yells,
[22:59] but if we look at the neighboring pattern in the distribution,
[23:02] which is considered just about as likely, we're told that it only has 8 possible matches.
[23:06] So a quarter as many matches, but it's about as likely.
[23:09] And when we pull up those matches, we can see why.
[23:12] Some of these are actual plausible answers like ring or wrath or raps.
[23:17] To illustrate how we incorporate all that, let
[23:20] me pull up version two of the Wordlebot here.
[23:22] And there are two or three main differences from the first one that we saw.
[23:25] First off, like I just said, the way that we're computing these entropies,
[23:29] these expected values of information, is now using the more refined distributions across
[23:33] the patterns that incorporates the probability that a given word would actually be the
[23:37] answer.
[23:38] As it happens, tears is still number one, though the ones following are a bit different.
[23:44] Second, when it ranks its top picks, it's now going to keep a model of the
[23:47] probability that each word is the actual answer,
[23:50] and it'll incorporate that into its decision,
[23:52] which is easier to see once we have a few guesses on the table.
[23:55] Again, ignoring its recommendation because we can't let machines rule our lives.
[24:01] And I suppose I should mention another thing different here is over on the left,
[24:04] that uncertainty value, that number of bits, is no longer just
[24:07] redundant with the number of possible matches.
[24:10] Now if we pull it up and calculate 2 to the 8.02, which would be a little above 256,
[24:15] I guess 259, what it's saying is even though there are 526 total words that
[24:20] actually match this pattern, the amount of uncertainty it has is more akin
[24:24] to what it would be if there were 259 equally likely outcomes.
[24:29] You can think of it like this.
[24:31] It knows borks is not the answer, same with yorts and zorl and zorus.
[24:34] So it's a little less uncertain than it was in the previous case.
[24:37] This number of bits will be smaller.
[24:40] And if I keep playing the game, I'm refining this down with a couple
[24:43] guesses that are apropos of what I would like to explain here.
[24:48] By the fourth guess, if you look over at its top picks,
[24:51] you can see it's no longer just maximizing the entropy.
[24:54] So at this point, there's technically seven possibilities,
[24:57] but the only ones with a meaningful chance are dorms and words.
[25:00] And you can see it ranks choosing both of those above all of these
[25:03] other values that strictly speaking would give more information.
[25:07] The very first time I did this, I just added up these two numbers to measure
[25:10] the quality of each guess, which actually worked better than you might suspect.
[25:14] But it really didn't feel systematic.
[25:16] And I'm sure there's other approaches people could take.
[25:17] But here's the one I landed on.
[25:19] If we're considering the prospect of a next guess, like in this case words,
[25:23] what we really care about is the expected score of our game if we do that.
[25:28] And to calculate that expected score, we say what's the probability that
[25:32] words is the actual answer, which at the moment it describes 58% to.
[25:36] We say with a 58% chance, our score in this game would be four.
[25:40] And then with the probability of one minus that 58%,
[25:43] our score will be more than that four.
[25:46] How much more we don't know, but we can estimate it based on how
[25:49] much uncertainty there's likely to be once we get to that point.
[25:52] Specifically, at the moment, there's 1.44 bits of uncertainty.
[25:56] If we guess words, it's telling us the expected information we'll get is 1.27 bits.
[26:01] So if we guess words, this difference represents how much
[26:04] uncertainty we're likely to be left with after that happens.
[26:08] What we need is some kind of function, which I'm calling f here,
[26:11] that associates this uncertainty with an expected score.
[26:14] And the way it went about this was to just plot a bunch of the data from previous
[26:18] games based on version one of the bot to say, hey,
[26:21] what was the actual score after various points with certain very measurable
[26:25] amounts of uncertainty?
[26:27] For example, these data points here that are sitting above a value that's
[26:30] around like 8.7 or so are saying for some games,
[26:33] after a point at which there were 8.7 bits of uncertainty,
[26:36] it took two guesses to get the final answer.
[26:39] For other games, it took three guesses.
[26:40] For other games, it took four guesses.
[26:43] If we shift over to the left here, all the points over zero are saying whenever
[26:46] there's zero bits of uncertainty, which is to say there's only one possibility,
[26:50] then the number of guesses required is always just one, which is reassuring.
[26:54] Whenever there was one bit of uncertainty, meaning it was essentially
[26:58] just down to two possibilities, then sometimes it required one more guess,
[27:01] sometimes it required two more guesses, and so on and so forth here.
[27:05] Maybe a slightly easier way to visualize this
[27:07] data is to bucket it together and take averages.
[27:11] For example, this bar here is saying among all the points where we had one bit
[27:15] of uncertainty, on average the number of new guesses required was about 1.5.
[27:22] And the bar over here is saying among all of the different games where
[27:25] at some point the uncertainty was a little above four bits,
[27:28] which is like narrowing it down to 16 different possibilities,
[27:31] then on average it requires a little more than two guesses from that point forward.
[27:36] And from here I just did a regression to fit a function that seemed reasonable to this.
[27:39] And remember, the whole point of doing any of that is so that we can quantify this
[27:44] intuition that the more information we gain from a word,
[27:47] the lower the expected score will be.
[27:49] So, with this as version 2.0, if we go back and run the same set of simulations,
[27:54] having it play against all 2315 possible wordle answers, how does it do?
[28:00] Well in contrast to our first version, it's definitely better, which is reassuring.
[28:04] All said and done, the average is around 3.6.
[28:06] Although unlike the first version, there are a couple times that it loses,
[28:09] and requires more than six in this circumstance.
[28:12] Presumably because there's times when it's making that tradeoff
[28:15] to actually go for the goal rather than maximizing information.
[28:19] So can we do better than 3.6?
[28:22] We definitely can.
[28:23] Now, I said at the start that it's most fun to try not incorporating
[28:26] the true list of wordle answers into the way that it builds its model.
[28:29] But if we do incorporate it, the best performance I could get was around 3.43.
[28:35] So if we try to get more sophisticated than just using word frequency data
[28:38] to choose this prior distribution, this 3.43 probably gives a max at how
[28:42] good we could get with that, or at least how good I could get with that.
[28:46] That best performance essentially just uses the ideas that I've been talking about here,
[28:49] but it goes a little farther, like it does a search for the expected
[28:52] information two steps forward rather than just one.
[28:55] Originally I was planning on talking more about that,
[28:57] but I realize we've actually gone quite long as it is.
[29:00] The one thing I'll say is after doing this two-step search and
[29:03] then running a couple sample simulations in the top candidates,
[29:06] so far for me at least, it's looking like Crane is the best opener.
[29:09] Who would have guessed?
[29:10] Also if you use the true wordle list to determine your space of possibilities,
[29:14] then the uncertainty you start with is a little over 11 bits.
[29:18] And it turns out just from a brute force search,
[29:20] the maximum possible expected information after the first two guesses is around 10 bits.
[29:26] Which suggests that best case scenario, after your first two guesses,
[29:30] with perfectly optimal play, you'll be left with around one bit of uncertainty.
[29:34] Which is the same as being down to two possible guesses.
[29:37] But I think it's fair and probably pretty conservative to say that you could never
[29:41] possibly write an algorithm that gets this average as low as three,
[29:44] because with the words available to you, there's simply not room to get enough
[29:48] information after only two steps to be able to guarantee the answer in the third slot
[29:51] every single time without fail.