Solving Wordle using information theory

1:00

For example, my Wordle bot suggests that I start with the guess "crane".

1:05

Each time that you make a guess, you get some information

1:07

about how close your guess is to the true answer.

1:10

Here the grey box is telling me there's no C in the actual answer.

1:14

The yellow box is telling me there is an R, but it's not in that position.

1:18

The green box is telling me that the secret word does have an A,

1:20

and it's in the third position.

1:22

And then there's no N and there's no E.

1:25

So let me just go in and tell the Wurdle bot that information.

1:27

We started with crane, we got grey, yellow, green, grey, grey.

1:31

Don't worry about all the data that it's showing right now, I'll explain that in due time.

1:35

But its top suggestion for our second pick is "shtik".

1:39

And your guess does have to be an actual five letter word, but as you'll see,

1:42

it's pretty liberal with what it will actually let you guess.

1:46

In this case, we try "shtik".

1:48

And alright, things are looking pretty good.

1:50

We hit the S and the H, so we know the first three letters, we know that there's an R.

1:53

And so it's going to be like S-H-A something R, or S-H-A R something.

1:59

And it looks like the Wordle bot knows that it's down to just two possibilities,

2:03

either shard or sharp.

2:05

That's kind of a tossup between them at this point,

2:07

so I guess probably just because it's alphabetical it goes with shard.

2:11

Which hooray, is the actual answer, so we got it in three.

2:14

If you're wondering if that's any good, the way I heard one person

2:17

phrase it is that with Wordle, four is par and three is birdie.

2:20

Which I think is a pretty apt analogy.

2:22

You have to be consistently on your game to be getting four, but it's certainly not crazy.

2:27

But when you get it in three, it just feels great.

2:30

So if you're down for it, what I'd like to do here is just talk through

2:33

my thought process from the beginning for how I approach the Wordle bot.

2:36

And like I said, really it's an excuse for an information theory lesson.

2:39

The main goal is to explain what is information and what is entropy.

2:48

My first thought in approaching this was to take a look at the

2:50

relative frequencies of different letters in the English language.

2:54

So I thought, okay, is there an opening guess or an opening

2:56

pair of guesses that hits a lot of these most frequent letters?

2:59

And one that I was pretty fond of was doing other followed by nails.

3:03

The thought is that if you hit a letter, you know, you get a green or a yellow,

3:06

that always feels good, it feels like you're getting information.

3:09

But in these cases, even if you don't hit and you always get grays,

3:12

that's still giving you a lot of information,

3:14

since it's pretty rare to find a word that doesn't have any of these letters.

3:18

But even still, that doesn't feel super systematic, because for example,

3:21

it does nothing to consider the order of the letters.

3:23

Why type nails when I could type snail?

3:26

Is it better to have that S at the end?

3:27

I'm not really sure.

3:29

Now, a friend of mine said that he liked to open with the word weary,

3:32

which kind of surprised me because it has some uncommon letters in there like the

3:35

W and the Y.

3:37

But who knows, maybe that is a better opener.

3:39

Is there some kind of quantitative score that we

3:41

can give to judge the quality of a potential guess?

3:45

Now to set up for the way that we're going to rank possible guesses,

3:48

let's go back and add a little clarity to how exactly the game is set up.

3:51

So there's a list of words that it will allow you to enter that

3:54

are considered valid guesses that's just about 13,000 words long.

3:58

But when you look at it, there's a lot of really uncommon things,

4:01

things like "aahed" or "aalii" and "aargh", the kind of words

4:03

that bring about family arguments in a game of Scrabble.

4:06

But the vibe of the game is that the answer is always going to be a decently common word.

4:10

And in fact, there's another list of around 2300 words that are the possible answers.

4:15

And this is a human curated list, I think specifically by the game creator's girlfriend,

4:20

which is kind of fun.

4:21

But what I would like to do, our challenge for this project is to see if we can write

4:25

a program solving wordle that doesn't incorporate previous knowledge about this list.

4:30

For one thing, there's plenty of pretty common five

4:32

letter words that you won't find in that list.

4:34

So it would be better to write a program that's a little more resilient and would

4:38

play wordle against anyone, not just what happens to be the official website.

4:41

And also, the reason that we know what this list of possible answers is,

4:45

is because it's visible in the source code.

4:47

But the way that it's visible in the source code is in the

4:49

specific order in which answers come up from day to day,

4:52

that you could always just look up what tomorrow's answer will be.

4:56

So clearly, there's some sense in which using the list is cheating.

4:59

And what makes for a more interesting puzzle and a richer information theory lesson

5:03

is to instead use some more universal data like relative word frequencies in

5:06

general to capture this intuition of having a preference for more common words.

5:11

So of these 13,000 possibilities, how should we choose the opening guess?

5:16

For example, if my friend proposes weary, how should we analyze its quality?

5:20

Well, the reason he said he likes that unlikely W is that he likes

5:23

the long shot nature of just how good it feels if you do hit that W.

5:27

For example, if the first pattern revealed was something like this,

5:31

then it turns out there are only 58 words in this giant lexicon that match that pattern.

5:36

So that's a huge reduction from 13,000.

5:38

But the flip side of that, of course, is that

5:40

it's very uncommon to get a pattern like this.

5:43

Specifically, if each word was equally likely to be the answer,

5:46

the probability of hitting this pattern would be 58 divided by around 13,000.

5:51

Of course, they're not equally likely to be answers.

5:53

Most of these are very obscure and even questionable words.

5:56

But at least for our first pass at all of this,

5:58

let's assume that they're all equally likely and then refine that a bit later.

6:02

The point is, the pattern with a lot of information is,

6:04

by its very nature, unlikely to occur.

6:07

In fact, what it means to be informative is that it's unlikely.

6:11

A much more probable pattern to see with this opening would be something like this,

6:16

where, of course, there's not a W in it.

6:18

Maybe there's an E, and maybe there's no A, there's no R, there's no Y.

6:22

In this case, there are 1400 possible matches.

6:25

If all were equally likely, it works out to be a probability

6:28

of about 11% that this is the pattern you would see.

6:30

So the most likely outcomes are also the least informative.

6:34

To get a more global view here, let me show you the full distribution of

6:37

probabilities across all of the different patterns that you might see.

6:41

So each bar that you're looking at corresponds to a possible pattern of

6:45

colors that could be revealed, of which there are 3 to the 5th possibilities,

6:48

and they're organized from left to right, most common to least common.

6:52

So the most common possibility here is that you get all grays.

6:56

That happens about 14% of the time.

6:58

And what you're hoping for when you make a guess is that you end up

7:01

somewhere out in this long tail, like over here,

7:04

where there's only 18 possibilities for what matches this pattern,

7:07

that evidently look like this.

7:09

Or if we venture a little farther to the left,

7:11

you know, maybe we go all the way over here.

7:14

Okay, here's a good puzzle for you.

7:16

What are the three words in the English language that start with a W,

7:19

end with a Y, and have an R somewhere in them?

7:22

Turns out, the answers are, let's see, wordy, wormy, and wryly.

7:27

So to judge how good this word is overall, we want some kind of measure of the

7:31

expected amount of information that you're going to get from this distribution.

7:35

If we go through each pattern and we multiply its probability of occurring times

7:39

something that measures how informative it is, that can maybe give us an objective score.

7:45

Now your first instinct for what that something should be might be the number of matches.

7:50

You want a lower average number of matches.

7:52

But instead I'd like to use a more universal measurement that we often ascribe to

7:56

information, and one that will be more flexible once we have a different probability

8:00

assigned to each of these 13,000 words for whether or not they're actually the answer.

8:10

The standard unit of information is the bit, which has a little bit of a funny formula,

8:14

but is really intuitive if we just look at examples.

8:17

If you have an observation that cuts your space of possibilities in half,

8:21

we say that it has one bit of information.

8:24

In our example, the space of possibilities is all possible words,

8:26

and it turns out about half of the five letter words have an S, a little less than that,

8:30

but about half.

8:31

So that observation would give you one bit of information.

8:34

If instead a new fact chops down that space of possibilities by a factor of four,

8:39

we say that it has two bits of information.

8:41

For example, it turns out about a quarter of these words have a T.

8:45

If the observation cuts that space by a factor of eight,

8:47

we say it's three bits of information, and so on and so forth.

8:50

Four bits cuts it into a sixteenth, five bits cuts it into a thirty second.

8:54

So now's when you might want to take a moment and pause and ask for yourself,

8:58

what is the formula for information for the number of bits

9:00

in terms of the probability of an occurrence?

9:03

Well, what we're saying here is basically that when you take one half to the number of

9:07

bits, that's the same thing as the probability,

9:09

which is the same thing as saying two to the power of the number of bits is one over

9:13

the probability, which rearranges further to saying the information is the log base

9:17

two of one divided by the probability.

9:19

And sometimes you see this with one more rearrangement still where

9:22

the information is the negative log base two of the probability.

9:25

Expressed like this, it can look a little bit weird to the uninitiated,

9:28

but it really is just the very intuitive idea of asking how

9:31

many times you've cut down your possibilities in half.

9:35

Now if you're wondering, you know, I thought we were just playing a fun word game,

9:37

why are logarithms entering the picture?

9:39

One reason this is a nicer unit is it's just a lot easier to talk about very

9:43

unlikely events, much easier to say that an observation has 20 bits of information

9:48

than it is to say that the probability of such and such occurring is 0.0000095.

9:53

But a more substantive reason that this logarithmic expression turned out to be a very

9:57

useful addition to the theory of probability is the way that information adds together.

10:02

For example, if one observation gives you two bits of information,

10:05

cutting your space down by four, and then a second observation like your second

10:08

guess in Wordle gives you another three bits of information,

10:11

chopping you down further by another factor of eight,

10:14

the two together give you five bits of information.

10:17

In the same way that probabilities like to multiply, information likes to add.

10:21

So as soon as we're in the realm of something like an expected value,

10:24

where we're adding a bunch of numbers up, the logs make it a lot nicer to deal with.

10:28

Let's go back to our distribution for weary and add another little tracker on here,

10:32

showing us how much information there is for each pattern.

10:35

The main thing I want you to notice is that the higher the probability as we get

10:39

to those more likely patterns, the lower the information, the fewer bits you gain.

10:43

The way we measure the quality of this guess will

10:45

be to take the expected value of this information.

10:48

When we go through each pattern, we say how probable is it and

10:51

then we multiply that by how many bits of information do we get.

10:54

And in the example of weary, that turns out to be 4.9 bits.

10:58

So on average, the information you get from this opening guess is as

11:01

good as chopping your space of possibilities in half about five times.

11:05

By contrast, an example of a guess with a higher expected

11:09

information value would be something like slate.

11:13

In this case, you'll notice the distribution looks a lot flatter.

11:15

In particular, the most probable occurrence of all grays only has about a 6% chance

11:20

of occurring, so at minimum you're getting evidently 3.9 bits of information.

11:25

But that's a minimum, more typically you'd get something better than that.

11:29

And it turns out when you crunch the numbers on this one and add

11:32

up all the relevant terms, the average information is about 5.8.

11:37

So in contrast with weary, your space of possibilities will

11:40

be about half as big after this first guess, on average.

11:44

There's actually a fun story about the name for

11:46

this expected value of information quantity.

11:48

You see, information theory was developed by Claude Shannon,

11:51

who was working at Bell Labs in the 1940s, but he was talking about some of his

11:54

yet-to-be-published ideas with John von Neumann,

11:57

who was this intellectual giant of the time, very prominent in math and physics

12:00

and the beginnings of what was becoming computer science.

12:04

And when he mentioned that he didn't really have a good name for this

12:07

expected value of information quantity, von Neumann supposedly said,

12:10

so the story goes, well, you should call it entropy, and for two reasons.

12:14

In the first place, your uncertainty function has been used in statistical mechanics

12:18

under that name, so it already has a name, and in the second place, and more important,

12:22

nobody knows what entropy really is, so in a debate you'll always have the advantage.

12:27

So if the name seems a little bit mysterious, and if this story is to be believed,

12:31

that's kind of by design.

12:33

Also if you're wondering about its relation to all of that second law of thermodynamics

12:37

stuff from physics, there definitely is a connection,

12:39

but in its origins Shannon was just dealing with pure probability theory,

12:43

and for our purposes here, when I use the word entropy,

12:45

I just want you to think the expected information value of a particular guess.

12:50

You can think of entropy as measuring two things simultaneously.

12:54

The first one is how flat is the distribution?

12:57

The closer a distribution is to uniform, the higher that entropy will be.

13:01

In our case, where there are 3 to the 5th total patterns, for a uniform distribution,

13:06

observing any one of them would have information log base 2 of 3 to the 5th,

13:11

which happens to be 7.92, so that is the absolute maximum that you could possibly have

13:16

for this entropy.

13:17

But entropy is also kind of a measure of how many

13:20

possibilities there are in the first place.

13:22

For example, if you happen to have some word where there's only 16 possible patterns,

13:27

and each one is equally likely, this entropy, this expected information, would be 4 bits.

13:32

But if you have another word where there's 64 possible patterns that could come up,

13:36

and they're all equally likely, then the entropy would work out to be 6 bits.

13:41

So if you see some distribution out in the wild that has an entropy of 6 bits,

13:45

it's sort of like it's saying there's as much variation and uncertainty

13:49

in what's about to happen as if there were 64 equally likely outcomes.

13:54

For my first pass at the Wurtelebot, I basically had it just do this.

13:57

It goes through all of the different possible guesses that you could have,

14:01

all 13,000 words, it computes the entropy for each one, or more specifically,

14:05

the entropy of the distribution across all patterns that you might see for each one,

14:09

and then it picks the highest, since that's the one that's likely to chop

14:13

down your space of possibilities as much as possible.

14:17

And even though I've only been talking about the first guess here,

14:19

it does the same thing for the next few guesses.

14:21

For example, after you see some pattern on that first guess,

14:24

which would restrict you to a smaller number of possible words based on what

14:27

matches with that, you just play the same game with respect to that smaller set of words.

14:32

For a proposed second guess, you look at the distribution of all patterns

14:36

that could occur from that more restricted set of words,

14:38

you search through all 13,000 possibilities, and you find the one that

14:42

maximizes that entropy.

14:45

To show you how this works in action, let me just pull up a little variant of

14:48

Wurtele that I wrote that shows the highlights of this analysis in the margins.

14:53

So after doing all its entropy calculations, on the right here

14:56

it's showing us which ones have the highest expected information.

15:00

Turns out the top answer, at least at the moment, we'll refine this later,

15:05

is Tares, which means, um, of course, a vetch, the most common vetch.

15:11

Each time we make a guess here, where maybe I kind of ignore its

15:13

recommendations and go with slate, because I like slate,

15:16

we can see how much expected information it had,

15:18

but then on the right of the word here it's showing us how much actual

15:22

information we got given this particular pattern.

15:25

So here it looks like we were a little unlucky, we were expected to get 5.8,

15:27

but we happened to get something with less than that.

15:30

And then on the left side here it's showing us all of

15:32

the different possible words given where we are now.

15:35

The blue bars are telling us how likely it thinks each word is,

15:38

so at the moment it's assuming each word is equally likely to occur,

15:41

but we'll refine that in a moment.

15:44

And then this uncertainty measurement is telling us the entropy of this distribution

15:48

across the possible words, which right now, because it's a uniform distribution,

15:52

is just a needlessly complicated way to count the number of possibilities.

15:56

For example, if we were to take 2 to the power of 13.66,

15:59

that should be around the 13,000 possibilities.

16:02

Um, a little bit off here, but only because I'm not showing all the decimal places.

16:06

At the moment that might feel redundant and like it's overly complicating things,

16:09

but you'll see why it's useful to have both numbers in a minute.

16:12

So here it looks like it's suggesting the highest entropy for our second guess is Raman,

16:16

which again just really doesn't feel like a word.

16:19

So to take the moral high ground here I'm going to go ahead and type in Rains.

16:25

And again it looks like we were a little unlucky.

16:27

We were expecting 4.3 bits and we only got 3.39 bits of information.

16:31

So that takes us down to 55 possibilities.

16:34

And here maybe I'll just actually go with what it's suggesting,

16:37

which is combo, whatever that means.

16:40

And, okay, this is actually a good chance for a puzzle.

16:42

It's telling us this pattern gives us 4.7 bits of information.

16:47

But over on the left, before we see that pattern, there were 5.78 bits of uncertainty.

16:52

So as a quiz for you, what does that mean about the number of remaining possibilities?

16:58

Well it means that we're reduced down to 1 bit of uncertainty,

17:01

which is the same thing as saying that there's 2 possible answers.

17:04

It's a 50-50 choice.

17:06

And from here, because you and I know which words are more common,

17:09

we know that the answer should be abyss.

17:11

But as it's written right now, the program doesn't know that.

17:13

So it just keeps going, trying to gain as much information as it can,

17:16

until it's only one possibility left, and then it guesses it.

17:20

So obviously we need a better endgame strategy,

17:22

but let's say we call this version 1 of our wordle solver,

17:25

and then we go and run some simulations to see how it does.

17:30

So the way this is working is it's playing every possible wordle game.

17:34

It's going through all of those 2315 words that are the actual wordle answers.

17:38

It's basically using that as a testing set.

17:41

And with this naive method of not considering how common a word is,

17:44

and just trying to maximize the information at each step along the way,

17:47

until it gets down to one and only one choice.

17:50

By the end of the simulation, the average score works out to be about 4.124.

17:55

Which is not bad, to be honest, I kind of expect it to do worse.

17:59

But the people who play wordle will tell you that they can usually get it in 4.

18:02

The real challenge is to get as many in 3 as you can.

18:05

It's a pretty big jump between the score of 4 and the score of 3.

18:08

The obvious low-hanging fruit here is to somehow incorporate

18:11

whether or not a word is common, and how exactly do we do that.

18:22

The way I approached it is to get a list of the relative frequencies for all of

18:26

the words in the English language, and I just used Mathematica's word frequency

18:30

data function, which itself pulls from the Google Books English Ngram public dataset.

18:35

And it's kind of fun to look at, for example if we sort

18:37

it from the most common words to the least common words.

18:40

Evidently these are the most common, 5-letter words in the English language.

18:43

Or rather, these is the 8th most common.

18:46

First is which, after which there's there and there.

18:49

First itself is not first, but 9th, and it makes sense that these

18:52

other words could come about more often, where those after first are after,

18:55

where, and those being just a little bit less common.

18:59

Now, in using this data to model how likely each of these words is to

19:03

be the final answer, it shouldn't just be proportional to the frequency,

19:07

because for example which is given a score of 0.002 in this dataset,

19:11

whereas the word braid is in some sense about 1000 times less likely.

19:15

But both of these are common enough words that they're almost

19:18

certainly worth considering, so we want more of a binary cutoff.

19:21

The way I went about it is to imagine taking this whole sorted list of words,

19:25

and then arranging it on an x-axis, and then applying the sigmoid function,

19:29

which is the standard way to have a function whose output is basically binary,

19:33

it's either 0 or it's 1, but there's a smoothing in between for that region of

19:37

uncertainty.

19:39

So essentially, the probability that I'm assigning to each word for being in the final

19:43

list will be the value of the sigmoid function above wherever it sits on the x-axis.

19:49

Now obviously this depends on a few parameters,

19:52

for example how wide a space on the x-axis those words fill determines how

19:56

gradually or steeply we drop off from 1 to 0,

19:58

and where we situate them left to right determines the cutoff.

20:02

And to be honest the way I did this was kind of just

20:05

licking my finger and sticking it into the wind.

20:07

I looked through the sorted list and tried to find a window where

20:10

when I looked at it I figured about half of these words are more

20:13

likely than not to be the final answer, and used that as the cutoff.

20:17

Now once we have a distribution like this across the words,

20:19

it gives us another situation where entropy becomes this really useful measurement.

20:24

For example, let's say we were playing a game and we start with my old openers,

20:28

which were other and nails, and we end up with a situation

20:30

where there's four possible words that match it.

20:33

And let's say we consider them all equally likely,

20:36

let me ask you, what is the entropy of this distribution?

20:41

Well, the information associated with each one of these possibilities is

20:45

going to be the log base 2 of 4, since each one is 1 and 4, and that's 2.

20:50

2 bits of information, 4 possibilities.

20:52

All very well and good.

20:54

But what if I told you that actually there's more than 4 matches?

20:58

In reality, when we look through the full word list, there are 16 words that match it.

21:02

But suppose our model puts a really low probability on those other 12 words of actually

21:06

being the final answer, something like 1 in 1000 because they're really obscure.

21:11

Now let me ask you, what is the entropy of this distribution?

21:15

If entropy was purely measuring the number of matches here,

21:18

then you might expect it to be something like the log base 2 of 16,

21:22

which would be 4, two more bits of uncertainty than we had before.

21:26

But of course the actual uncertainty is not really that different from what we had before.

21:30

Just because there's these 12 really obscure words doesn't mean that it would be

21:33

all that more surprising to learn that the final answer is charm, for example.

21:38

So when you actually do the calculation here and you add up the probability of

21:42

each occurrence times the corresponding information, what you get is 2.11 bits.

21:46

Just saying, it's basically two bits, basically those four possibilities,

21:49

but there's a little more uncertainty because of all of those highly unlikely events,

21:53

though if you did learn them you'd get a ton of information from it.

21:57

So zooming out, this is part of what makes Wordle

21:59

such a nice example for an information theory lesson.

22:01

We have these two distinct feeling applications for entropy.

22:05

The first one telling us what's the expected information we'll get from a given guess,

22:09

and the second one saying can we measure the remaining uncertainty

22:13

among all of the words we have possible.

22:16

And I should emphasize, in that first case where we're looking

22:19

at the expected information of a guess, once we have an unequal weighting to the words,

22:22

that affects the entropy calculation.

22:24

For example, let me pull up that same case we were looking at

22:27

earlier of the distribution associated with Weary,

22:30

but this time using a non-uniform distribution across all possible words.

22:34

So let me see if I can find a part here that illustrates it pretty well.

22:40

Okay, here, this is pretty good.

22:42

Here we have two adjacent patterns that are about equally likely,

22:45

but one of them we're told has 32 possible words that match it.

22:49

And if we check what they are, these are those 32,

22:51

which are all just very unlikely words as you scan your eyes over them.

22:55

It's hard to find any that feel like plausible answers, maybe yells,

22:59

but if we look at the neighboring pattern in the distribution,

23:02

which is considered just about as likely, we're told that it only has 8 possible matches.

23:06

So a quarter as many matches, but it's about as likely.

23:09

And when we pull up those matches, we can see why.

23:12

Some of these are actual plausible answers like ring or wrath or raps.

23:17

To illustrate how we incorporate all that, let

23:20

me pull up version two of the Wordlebot here.

23:22

And there are two or three main differences from the first one that we saw.

23:25

First off, like I just said, the way that we're computing these entropies,

23:29

these expected values of information, is now using the more refined distributions across

23:33

the patterns that incorporates the probability that a given word would actually be the

23:37

answer.

23:38

As it happens, tears is still number one, though the ones following are a bit different.

23:44

Second, when it ranks its top picks, it's now going to keep a model of the

23:47

probability that each word is the actual answer,

23:50

and it'll incorporate that into its decision,

23:52

which is easier to see once we have a few guesses on the table.

23:55

Again, ignoring its recommendation because we can't let machines rule our lives.

24:01

And I suppose I should mention another thing different here is over on the left,

24:04

that uncertainty value, that number of bits, is no longer just

24:07

redundant with the number of possible matches.

24:10

Now if we pull it up and calculate 2 to the 8.02, which would be a little above 256,

24:15

I guess 259, what it's saying is even though there are 526 total words that

24:20

actually match this pattern, the amount of uncertainty it has is more akin

24:24

to what it would be if there were 259 equally likely outcomes.

24:29

You can think of it like this.

24:31

It knows borks is not the answer, same with yorts and zorl and zorus.

24:34

So it's a little less uncertain than it was in the previous case.

24:37

This number of bits will be smaller.

24:40

And if I keep playing the game, I'm refining this down with a couple

24:43

guesses that are apropos of what I would like to explain here.

24:48

By the fourth guess, if you look over at its top picks,

24:51

you can see it's no longer just maximizing the entropy.

24:54

So at this point, there's technically seven possibilities,

24:57

but the only ones with a meaningful chance are dorms and words.

25:00

And you can see it ranks choosing both of those above all of these

25:03

other values that strictly speaking would give more information.

25:07

The very first time I did this, I just added up these two numbers to measure

25:10

the quality of each guess, which actually worked better than you might suspect.

25:14

But it really didn't feel systematic.

25:16

And I'm sure there's other approaches people could take.

25:17

But here's the one I landed on.

25:19

If we're considering the prospect of a next guess, like in this case words,

25:23

what we really care about is the expected score of our game if we do that.

25:28

And to calculate that expected score, we say what's the probability that

25:32

words is the actual answer, which at the moment it describes 58% to.

25:36

We say with a 58% chance, our score in this game would be four.

25:40

And then with the probability of one minus that 58%,

25:43

our score will be more than that four.

25:46

How much more we don't know, but we can estimate it based on how

25:49

much uncertainty there's likely to be once we get to that point.

25:52

Specifically, at the moment, there's 1.44 bits of uncertainty.

25:56

If we guess words, it's telling us the expected information we'll get is 1.27 bits.

26:01

So if we guess words, this difference represents how much

26:04

uncertainty we're likely to be left with after that happens.

26:08

What we need is some kind of function, which I'm calling f here,

26:11

that associates this uncertainty with an expected score.

26:14

And the way it went about this was to just plot a bunch of the data from previous

26:18

games based on version one of the bot to say, hey,

26:21

what was the actual score after various points with certain very measurable

26:25

amounts of uncertainty?

26:27

For example, these data points here that are sitting above a value that's

26:30

around like 8.7 or so are saying for some games,

26:33

after a point at which there were 8.7 bits of uncertainty,

26:36

it took two guesses to get the final answer.

26:39

For other games, it took three guesses.

26:40

For other games, it took four guesses.

26:43

If we shift over to the left here, all the points over zero are saying whenever

26:46

there's zero bits of uncertainty, which is to say there's only one possibility,

26:50

then the number of guesses required is always just one, which is reassuring.

26:54

Whenever there was one bit of uncertainty, meaning it was essentially

26:58

just down to two possibilities, then sometimes it required one more guess,

27:01

sometimes it required two more guesses, and so on and so forth here.

27:05

Maybe a slightly easier way to visualize this

27:07

data is to bucket it together and take averages.

27:11

For example, this bar here is saying among all the points where we had one bit

27:15

of uncertainty, on average the number of new guesses required was about 1.5.

27:22

And the bar over here is saying among all of the different games where

27:25

at some point the uncertainty was a little above four bits,

27:28

which is like narrowing it down to 16 different possibilities,

27:31

then on average it requires a little more than two guesses from that point forward.

27:36

And from here I just did a regression to fit a function that seemed reasonable to this.

27:39

And remember, the whole point of doing any of that is so that we can quantify this

27:44

intuition that the more information we gain from a word,

27:47

the lower the expected score will be.

27:49

So, with this as version 2.0, if we go back and run the same set of simulations,

27:54

having it play against all 2315 possible wordle answers, how does it do?

28:00

Well in contrast to our first version, it's definitely better, which is reassuring.

28:04

All said and done, the average is around 3.6.

28:06

Although unlike the first version, there are a couple times that it loses,

28:09

and requires more than six in this circumstance.

28:12

Presumably because there's times when it's making that tradeoff

28:15

to actually go for the goal rather than maximizing information.

28:19

So can we do better than 3.6?

28:22

We definitely can.

28:23

Now, I said at the start that it's most fun to try not incorporating

28:26

the true list of wordle answers into the way that it builds its model.

28:29

But if we do incorporate it, the best performance I could get was around 3.43.

28:35

So if we try to get more sophisticated than just using word frequency data

28:38

to choose this prior distribution, this 3.43 probably gives a max at how

28:42

good we could get with that, or at least how good I could get with that.

28:46

That best performance essentially just uses the ideas that I've been talking about here,

28:49

but it goes a little farther, like it does a search for the expected

28:52

information two steps forward rather than just one.

28:55

Originally I was planning on talking more about that,

28:57

but I realize we've actually gone quite long as it is.

29:00

The one thing I'll say is after doing this two-step search and

29:03

then running a couple sample simulations in the top candidates,

29:06

so far for me at least, it's looking like Crane is the best opener.

29:09

Who would have guessed?

29:10

Also if you use the true wordle list to determine your space of possibilities,

29:14

then the uncertainty you start with is a little over 11 bits.

29:18

And it turns out just from a brute force search,

29:20

the maximum possible expected information after the first two guesses is around 10 bits.

29:26

Which suggests that best case scenario, after your first two guesses,

29:30

with perfectly optimal play, you'll be left with around one bit of uncertainty.

29:34

Which is the same as being down to two possible guesses.

29:37

But I think it's fair and probably pretty conservative to say that you could never

29:41

possibly write an algorithm that gets this average as low as three,

29:44

because with the words available to you, there's simply not room to get enough

29:48

information after only two steps to be able to guarantee the answer in the third slot

29:51

every single time without fail.

More from 3Blue1Brown

Trending Transcripts