WEBVTT

00:00:00.000 --> 00:00:02.911
The game Wordle has gone pretty viral in the last month or two,

00:00:02.911 --> 00:00:05.682
and never one to overlook an opportunity for a math lesson,

00:00:05.682 --> 00:00:08.871
it occurs to me that this game makes for a very good central example

00:00:08.871 --> 00:00:12.660
in a lesson about information theory, and in particular a topic known as entropy.

00:00:13.919 --> 00:00:16.875
You see, like a lot of people I got kind of sucked into the puzzle,

00:00:16.875 --> 00:00:19.652
and like a lot of programmers I also got sucked into trying to

00:00:19.652 --> 00:00:22.739
write an algorithm that would play the game as optimally as it could.

00:00:23.179 --> 00:00:26.745
And what I thought I'd do here is just talk through with you some of my process in that,

00:00:26.745 --> 00:00:28.690
and explain some of the math that went into it,

00:00:28.690 --> 00:00:31.080
since the whole algorithm centers on this idea of entropy.

00:00:38.700 --> 00:00:41.640
First things first, in case you haven't heard of it, what is Wordle?

00:00:42.039 --> 00:00:45.573
And to kill two birds with one stone here while we go through the rules of the game,

00:00:45.573 --> 00:00:47.633
let me also preview where we're going with this,

00:00:47.633 --> 00:00:51.040
which is to develop a little algorithm that will basically play the game for us.

00:00:51.359 --> 00:00:53.784
Though I haven't done today's Wordle, this is February 4th,

00:00:53.784 --> 00:00:55.099
and we'll see how the bot does.

00:00:55.479 --> 00:00:58.113
The goal of Wordle is to guess a mystery five letter word,

00:00:58.113 --> 00:01:00.339
and you're given six different chances to guess.

00:01:00.840 --> 00:01:04.380
For example, my Wordle bot suggests that I start with the guess "crane".

00:01:05.180 --> 00:01:07.865
Each time that you make a guess, you get some information

00:01:07.864 --> 00:01:10.219
about how close your guess is to the true answer.

00:01:10.920 --> 00:01:14.100
Here the grey box is telling me there's no C in the actual answer.

00:01:14.519 --> 00:01:17.839
The yellow box is telling me there is an R, but it's not in that position.

00:01:18.239 --> 00:01:20.906
The green box is telling me that the secret word does have an A,

00:01:20.906 --> 00:01:22.239
and it's in the third position.

00:01:22.719 --> 00:01:24.579
And then there's no N and there's no E.

00:01:25.200 --> 00:01:27.340
So let me just go in and tell the Wurdle bot that information.

00:01:27.340 --> 00:01:30.320
We started with crane, we got grey, yellow, green, grey, grey.

00:01:31.420 --> 00:01:34.939
Don't worry about all the data that it's showing right now, I'll explain that in due time.

00:01:35.459 --> 00:01:38.819
But its top suggestion for our second pick is "shtik".

00:01:39.560 --> 00:01:42.795
And your guess does have to be an actual five letter word, but as you'll see,

00:01:42.795 --> 00:01:45.400
it's pretty liberal with what it will actually let you guess.

00:01:46.200 --> 00:01:47.439
In this case, we try "shtik".

00:01:48.780 --> 00:01:50.180
And alright, things are looking pretty good.

00:01:50.260 --> 00:01:53.980
We hit the S and the H, so we know the first three letters, we know that there's an R.

00:01:53.980 --> 00:01:58.700
And so it's going to be like S-H-A something R, or S-H-A R something.

00:01:59.620 --> 00:02:03.207
And it looks like the Wordle bot knows that it's down to just two possibilities,

00:02:03.207 --> 00:02:04.239
either shard or sharp.

00:02:05.099 --> 00:02:07.181
That's kind of a tossup between them at this point,

00:02:07.182 --> 00:02:10.080
so I guess probably just because it's alphabetical it goes with shard.

00:02:11.219 --> 00:02:13.780
Which hooray, is the actual answer, so we got it in three.

00:02:14.599 --> 00:02:17.524
If you're wondering if that's any good, the way I heard one person

00:02:17.524 --> 00:02:20.360
phrase it is that with Wordle, four is par and three is birdie.

00:02:20.680 --> 00:02:22.480
Which I think is a pretty apt analogy.

00:02:22.479 --> 00:02:27.019
You have to be consistently on your game to be getting four, but it's certainly not crazy.

00:02:27.180 --> 00:02:29.920
But when you get it in three, it just feels great.

00:02:30.879 --> 00:02:33.384
So if you're down for it, what I'd like to do here is just talk through

00:02:33.384 --> 00:02:35.959
my thought process from the beginning for how I approach the Wordle bot.

00:02:36.479 --> 00:02:39.439
And like I said, really it's an excuse for an information theory lesson.

00:02:39.740 --> 00:02:42.820
The main goal is to explain what is information and what is entropy.

00:02:48.219 --> 00:02:50.862
My first thought in approaching this was to take a look at the

00:02:50.862 --> 00:02:53.719
relative frequencies of different letters in the English language.

00:02:54.379 --> 00:02:56.721
So I thought, okay, is there an opening guess or an opening

00:02:56.721 --> 00:02:59.259
pair of guesses that hits a lot of these most frequent letters?

00:02:59.960 --> 00:03:03.000
And one that I was pretty fond of was doing other followed by nails.

00:03:03.759 --> 00:03:06.527
The thought is that if you hit a letter, you know, you get a green or a yellow,

00:03:06.527 --> 00:03:08.840
that always feels good, it feels like you're getting information.

00:03:09.340 --> 00:03:12.167
But in these cases, even if you don't hit and you always get grays,

00:03:12.167 --> 00:03:14.108
that's still giving you a lot of information,

00:03:14.108 --> 00:03:17.400
since it's pretty rare to find a word that doesn't have any of these letters.

00:03:18.139 --> 00:03:21.030
But even still, that doesn't feel super systematic, because for example,

00:03:21.031 --> 00:03:23.200
it does nothing to consider the order of the letters.

00:03:23.560 --> 00:03:25.300
Why type nails when I could type snail?

00:03:26.080 --> 00:03:27.500
Is it better to have that S at the end?

00:03:27.819 --> 00:03:28.680
I'm not really sure.

00:03:29.240 --> 00:03:32.311
Now, a friend of mine said that he liked to open with the word weary,

00:03:32.311 --> 00:03:35.961
which kind of surprised me because it has some uncommon letters in there like the

00:03:35.961 --> 00:03:36.540
W and the Y.

00:03:37.120 --> 00:03:39.000
But who knows, maybe that is a better opener.

00:03:39.319 --> 00:03:41.719
Is there some kind of quantitative score that we

00:03:41.719 --> 00:03:44.319
can give to judge the quality of a potential guess?

00:03:45.340 --> 00:03:48.252
Now to set up for the way that we're going to rank possible guesses,

00:03:48.252 --> 00:03:51.420
let's go back and add a little clarity to how exactly the game is set up.

00:03:51.419 --> 00:03:54.574
So there's a list of words that it will allow you to enter that

00:03:54.574 --> 00:03:57.879
are considered valid guesses that's just about 13,000 words long.

00:03:58.319 --> 00:04:01.187
But when you look at it, there's a lot of really uncommon things,

00:04:01.187 --> 00:04:03.924
things like "aahed" or "aalii" and "aargh", the kind of words

00:04:03.925 --> 00:04:06.439
that bring about family arguments in a game of Scrabble.

00:04:06.960 --> 00:04:10.540
But the vibe of the game is that the answer is always going to be a decently common word.

00:04:10.960 --> 00:04:15.360
And in fact, there's another list of around 2300 words that are the possible answers.

00:04:15.939 --> 00:04:20.115
And this is a human curated list, I think specifically by the game creator's girlfriend,

00:04:20.115 --> 00:04:21.159
which is kind of fun.

00:04:21.819 --> 00:04:25.975
But what I would like to do, our challenge for this project is to see if we can write

00:04:25.975 --> 00:04:30.180
a program solving wordle that doesn't incorporate previous knowledge about this list.

00:04:30.720 --> 00:04:32.760
For one thing, there's plenty of pretty common five

00:04:32.759 --> 00:04:34.639
letter words that you won't find in that list.

00:04:34.939 --> 00:04:38.262
So it would be better to write a program that's a little more resilient and would

00:04:38.262 --> 00:04:41.459
play wordle against anyone, not just what happens to be the official website.

00:04:41.920 --> 00:04:45.073
And also, the reason that we know what this list of possible answers is,

00:04:45.072 --> 00:04:47.000
is because it's visible in the source code.

00:04:47.000 --> 00:04:49.817
But the way that it's visible in the source code is in the

00:04:49.817 --> 00:04:52.586
specific order in which answers come up from day to day,

00:04:52.586 --> 00:04:55.840
that you could always just look up what tomorrow's answer will be.

00:04:56.420 --> 00:04:58.879
So clearly, there's some sense in which using the list is cheating.

00:04:59.100 --> 00:05:03.022
And what makes for a more interesting puzzle and a richer information theory lesson

00:05:03.021 --> 00:05:06.659
is to instead use some more universal data like relative word frequencies in

00:05:06.660 --> 00:05:10.439
general to capture this intuition of having a preference for more common words.

00:05:11.600 --> 00:05:15.900
So of these 13,000 possibilities, how should we choose the opening guess?

00:05:16.399 --> 00:05:19.779
For example, if my friend proposes weary, how should we analyze its quality?

00:05:20.519 --> 00:05:23.853
Well, the reason he said he likes that unlikely W is that he likes

00:05:23.853 --> 00:05:27.339
the long shot nature of just how good it feels if you do hit that W.

00:05:27.920 --> 00:05:31.218
For example, if the first pattern revealed was something like this,

00:05:31.218 --> 00:05:35.600
then it turns out there are only 58 words in this giant lexicon that match that pattern.

00:05:36.060 --> 00:05:38.399
So that's a huge reduction from 13,000.

00:05:38.779 --> 00:05:40.853
But the flip side of that, of course, is that

00:05:40.853 --> 00:05:43.019
it's very uncommon to get a pattern like this.

00:05:43.019 --> 00:05:46.603
Specifically, if each word was equally likely to be the answer,

00:05:46.603 --> 00:05:51.040
the probability of hitting this pattern would be 58 divided by around 13,000.

00:05:51.579 --> 00:05:53.599
Of course, they're not equally likely to be answers.

00:05:53.720 --> 00:05:56.220
Most of these are very obscure and even questionable words.

00:05:56.600 --> 00:05:58.465
But at least for our first pass at all of this,

00:05:58.464 --> 00:06:01.599
let's assume that they're all equally likely and then refine that a bit later.

00:06:02.019 --> 00:06:04.769
The point is, the pattern with a lot of information is,

00:06:04.769 --> 00:06:06.719
by its very nature, unlikely to occur.

00:06:07.279 --> 00:06:10.799
In fact, what it means to be informative is that it's unlikely.

00:06:11.720 --> 00:06:16.004
A much more probable pattern to see with this opening would be something like this,

00:06:16.004 --> 00:06:18.120
where, of course, there's not a W in it.

00:06:18.240 --> 00:06:21.400
Maybe there's an E, and maybe there's no A, there's no R, there's no Y.

00:06:22.079 --> 00:06:24.560
In this case, there are 1400 possible matches.

00:06:25.079 --> 00:06:28.010
If all were equally likely, it works out to be a probability

00:06:28.011 --> 00:06:30.600
of about 11% that this is the pattern you would see.

00:06:30.899 --> 00:06:33.339
So the most likely outcomes are also the least informative.

00:06:34.240 --> 00:06:37.713
To get a more global view here, let me show you the full distribution of

00:06:37.713 --> 00:06:41.139
probabilities across all of the different patterns that you might see.

00:06:41.740 --> 00:06:45.161
So each bar that you're looking at corresponds to a possible pattern of

00:06:45.161 --> 00:06:48.918
colors that could be revealed, of which there are 3 to the 5th possibilities,

00:06:48.918 --> 00:06:52.339
and they're organized from left to right, most common to least common.

00:06:52.920 --> 00:06:56.000
So the most common possibility here is that you get all grays.

00:06:56.100 --> 00:06:58.120
That happens about 14% of the time.

00:06:58.579 --> 00:07:01.885
And what you're hoping for when you make a guess is that you end up

00:07:01.886 --> 00:07:04.304
somewhere out in this long tail, like over here,

00:07:04.303 --> 00:07:07.609
where there's only 18 possibilities for what matches this pattern,

00:07:07.610 --> 00:07:09.139
that evidently look like this.

00:07:09.920 --> 00:07:11.881
Or if we venture a little farther to the left,

00:07:11.880 --> 00:07:13.799
you know, maybe we go all the way over here.

00:07:14.939 --> 00:07:16.180
Okay, here's a good puzzle for you.

00:07:16.540 --> 00:07:19.788
What are the three words in the English language that start with a W,

00:07:19.788 --> 00:07:22.000
end with a Y, and have an R somewhere in them?

00:07:22.480 --> 00:07:26.800
Turns out, the answers are, let's see, wordy, wormy, and wryly.

00:07:27.500 --> 00:07:31.567
So to judge how good this word is overall, we want some kind of measure of the

00:07:31.567 --> 00:07:35.740
expected amount of information that you're going to get from this distribution.

00:07:35.740 --> 00:07:39.966
If we go through each pattern and we multiply its probability of occurring times

00:07:39.966 --> 00:07:44.720
something that measures how informative it is, that can maybe give us an objective score.

00:07:45.959 --> 00:07:49.839
Now your first instinct for what that something should be might be the number of matches.

00:07:50.160 --> 00:07:52.400
You want a lower average number of matches.

00:07:52.800 --> 00:07:56.468
But instead I'd like to use a more universal measurement that we often ascribe to

00:07:56.468 --> 00:08:00.319
information, and one that will be more flexible once we have a different probability

00:08:00.319 --> 00:08:04.259
assigned to each of these 13,000 words for whether or not they're actually the answer.

00:08:10.319 --> 00:08:14.459
The standard unit of information is the bit, which has a little bit of a funny formula,

00:08:14.459 --> 00:08:16.980
but is really intuitive if we just look at examples.

00:08:17.779 --> 00:08:21.379
If you have an observation that cuts your space of possibilities in half,

00:08:21.379 --> 00:08:23.500
we say that it has one bit of information.

00:08:24.180 --> 00:08:26.887
In our example, the space of possibilities is all possible words,

00:08:26.887 --> 00:08:30.593
and it turns out about half of the five letter words have an S, a little less than that,

00:08:30.593 --> 00:08:31.259
but about half.

00:08:31.779 --> 00:08:34.319
So that observation would give you one bit of information.

00:08:34.879 --> 00:08:39.169
If instead a new fact chops down that space of possibilities by a factor of four,

00:08:39.169 --> 00:08:41.500
we say that it has two bits of information.

00:08:41.980 --> 00:08:44.460
For example, it turns out about a quarter of these words have a T.

00:08:45.019 --> 00:08:47.701
If the observation cuts that space by a factor of eight,

00:08:47.701 --> 00:08:50.720
we say it's three bits of information, and so on and so forth.

00:08:50.899 --> 00:08:53.879
Four bits cuts it into a sixteenth, five bits cuts it into a thirty second.

00:08:54.960 --> 00:08:58.353
So now's when you might want to take a moment and pause and ask for yourself,

00:08:58.352 --> 00:09:00.952
what is the formula for information for the number of bits

00:09:00.952 --> 00:09:02.980
in terms of the probability of an occurrence?

00:09:03.919 --> 00:09:07.692
Well, what we're saying here is basically that when you take one half to the number of

00:09:07.692 --> 00:09:09.797
bits, that's the same thing as the probability,

00:09:09.797 --> 00:09:13.524
which is the same thing as saying two to the power of the number of bits is one over

00:09:13.524 --> 00:09:17.208
the probability, which rearranges further to saying the information is the log base

00:09:17.208 --> 00:09:18.919
two of one divided by the probability.

00:09:19.620 --> 00:09:22.279
And sometimes you see this with one more rearrangement still where

00:09:22.279 --> 00:09:24.899
the information is the negative log base two of the probability.

00:09:25.659 --> 00:09:28.874
Expressed like this, it can look a little bit weird to the uninitiated,

00:09:28.874 --> 00:09:31.590
but it really is just the very intuitive idea of asking how

00:09:31.590 --> 00:09:34.080
many times you've cut down your possibilities in half.

00:09:35.179 --> 00:09:37.926
Now if you're wondering, you know, I thought we were just playing a fun word game,

00:09:37.927 --> 00:09:39.300
why are logarithms entering the picture?

00:09:39.779 --> 00:09:43.964
One reason this is a nicer unit is it's just a lot easier to talk about very

00:09:43.965 --> 00:09:48.535
unlikely events, much easier to say that an observation has 20 bits of information

00:09:48.534 --> 00:09:52.939
than it is to say that the probability of such and such occurring is 0.0000095.

00:09:53.299 --> 00:09:57.332
But a more substantive reason that this logarithmic expression turned out to be a very

00:09:57.332 --> 00:10:01.459
useful addition to the theory of probability is the way that information adds together.

00:10:02.059 --> 00:10:05.154
For example, if one observation gives you two bits of information,

00:10:05.154 --> 00:10:08.908
cutting your space down by four, and then a second observation like your second

00:10:08.908 --> 00:10:11.768
guess in Wordle gives you another three bits of information,

00:10:11.768 --> 00:10:14.301
chopping you down further by another factor of eight,

00:10:14.301 --> 00:10:16.740
the two together give you five bits of information.

00:10:17.159 --> 00:10:21.019
In the same way that probabilities like to multiply, information likes to add.

00:10:21.960 --> 00:10:24.657
So as soon as we're in the realm of something like an expected value,

00:10:24.657 --> 00:10:27.980
where we're adding a bunch of numbers up, the logs make it a lot nicer to deal with.

00:10:28.480 --> 00:10:32.255
Let's go back to our distribution for weary and add another little tracker on here,

00:10:32.255 --> 00:10:34.939
showing us how much information there is for each pattern.

00:10:35.580 --> 00:10:39.114
The main thing I want you to notice is that the higher the probability as we get

00:10:39.114 --> 00:10:42.780
to those more likely patterns, the lower the information, the fewer bits you gain.

00:10:43.500 --> 00:10:45.715
The way we measure the quality of this guess will

00:10:45.715 --> 00:10:48.019
be to take the expected value of this information.

00:10:48.419 --> 00:10:51.173
When we go through each pattern, we say how probable is it and

00:10:51.173 --> 00:10:54.059
then we multiply that by how many bits of information do we get.

00:10:54.710 --> 00:10:58.120
And in the example of weary, that turns out to be 4.9 bits.

00:10:58.559 --> 00:11:01.944
So on average, the information you get from this opening guess is as

00:11:01.945 --> 00:11:05.480
good as chopping your space of possibilities in half about five times.

00:11:05.960 --> 00:11:09.014
By contrast, an example of a guess with a higher expected

00:11:09.014 --> 00:11:11.639
information value would be something like slate.

00:11:13.120 --> 00:11:15.620
In this case, you'll notice the distribution looks a lot flatter.

00:11:15.940 --> 00:11:20.745
In particular, the most probable occurrence of all grays only has about a 6% chance

00:11:20.745 --> 00:11:25.259
of occurring, so at minimum you're getting evidently 3.9 bits of information.

00:11:25.919 --> 00:11:28.559
But that's a minimum, more typically you'd get something better than that.

00:11:29.100 --> 00:11:32.474
And it turns out when you crunch the numbers on this one and add

00:11:32.474 --> 00:11:35.900
up all the relevant terms, the average information is about 5.8.

00:11:37.360 --> 00:11:40.503
So in contrast with weary, your space of possibilities will

00:11:40.503 --> 00:11:43.540
be about half as big after this first guess, on average.

00:11:44.419 --> 00:11:46.401
There's actually a fun story about the name for

00:11:46.402 --> 00:11:48.300
this expected value of information quantity.

00:11:48.299 --> 00:11:51.099
You see, information theory was developed by Claude Shannon,

00:11:51.100 --> 00:11:54.832
who was working at Bell Labs in the 1940s, but he was talking about some of his

00:11:54.832 --> 00:11:57.120
yet-to-be-published ideas with John von Neumann,

00:11:57.120 --> 00:12:00.852
who was this intellectual giant of the time, very prominent in math and physics

00:12:00.852 --> 00:12:03.559
and the beginnings of what was becoming computer science.

00:12:04.100 --> 00:12:07.387
And when he mentioned that he didn't really have a good name for this

00:12:07.386 --> 00:12:10.674
expected value of information quantity, von Neumann supposedly said,

00:12:10.674 --> 00:12:14.199
so the story goes, well, you should call it entropy, and for two reasons.

00:12:14.539 --> 00:12:18.519
In the first place, your uncertainty function has been used in statistical mechanics

00:12:18.519 --> 00:12:22.687
under that name, so it already has a name, and in the second place, and more important,

00:12:22.687 --> 00:12:26.759
nobody knows what entropy really is, so in a debate you'll always have the advantage.

00:12:27.700 --> 00:12:31.314
So if the name seems a little bit mysterious, and if this story is to be believed,

00:12:31.313 --> 00:12:32.459
that's kind of by design.

00:12:33.279 --> 00:12:37.331
Also if you're wondering about its relation to all of that second law of thermodynamics

00:12:37.331 --> 00:12:39.846
stuff from physics, there definitely is a connection,

00:12:39.846 --> 00:12:43.293
but in its origins Shannon was just dealing with pure probability theory,

00:12:43.293 --> 00:12:45.900
and for our purposes here, when I use the word entropy,

00:12:45.900 --> 00:12:49.579
I just want you to think the expected information value of a particular guess.

00:12:50.700 --> 00:12:53.780
You can think of entropy as measuring two things simultaneously.

00:12:54.240 --> 00:12:56.779
The first one is how flat is the distribution?

00:12:57.320 --> 00:13:01.120
The closer a distribution is to uniform, the higher that entropy will be.

00:13:01.580 --> 00:13:06.584
In our case, where there are 3 to the 5th total patterns, for a uniform distribution,

00:13:06.583 --> 00:13:11.117
observing any one of them would have information log base 2 of 3 to the 5th,

00:13:11.118 --> 00:13:16.240
which happens to be 7.92, so that is the absolute maximum that you could possibly have

00:13:16.240 --> 00:13:17.299
for this entropy.

00:13:17.840 --> 00:13:20.074
But entropy is also kind of a measure of how many

00:13:20.073 --> 00:13:22.079
possibilities there are in the first place.

00:13:22.320 --> 00:13:27.109
For example, if you happen to have some word where there's only 16 possible patterns,

00:13:27.109 --> 00:13:32.180
and each one is equally likely, this entropy, this expected information, would be 4 bits.

00:13:32.580 --> 00:13:36.653
But if you have another word where there's 64 possible patterns that could come up,

00:13:36.653 --> 00:13:40.480
and they're all equally likely, then the entropy would work out to be 6 bits.

00:13:41.500 --> 00:13:45.735
So if you see some distribution out in the wild that has an entropy of 6 bits,

00:13:45.735 --> 00:13:49.644
it's sort of like it's saying there's as much variation and uncertainty

00:13:49.644 --> 00:13:53.500
in what's about to happen as if there were 64 equally likely outcomes.

00:13:54.360 --> 00:13:57.960
For my first pass at the Wurtelebot, I basically had it just do this.

00:13:57.960 --> 00:14:01.646
It goes through all of the different possible guesses that you could have,

00:14:01.645 --> 00:14:05.530
all 13,000 words, it computes the entropy for each one, or more specifically,

00:14:05.530 --> 00:14:09.764
the entropy of the distribution across all patterns that you might see for each one,

00:14:09.764 --> 00:14:13.449
and then it picks the highest, since that's the one that's likely to chop

00:14:13.450 --> 00:14:16.140
down your space of possibilities as much as possible.

00:14:17.139 --> 00:14:19.413
And even though I've only been talking about the first guess here,

00:14:19.413 --> 00:14:21.100
it does the same thing for the next few guesses.

00:14:21.559 --> 00:14:24.266
For example, after you see some pattern on that first guess,

00:14:24.267 --> 00:14:27.740
which would restrict you to a smaller number of possible words based on what

00:14:27.740 --> 00:14:31.799
matches with that, you just play the same game with respect to that smaller set of words.

00:14:32.259 --> 00:14:36.016
For a proposed second guess, you look at the distribution of all patterns

00:14:36.017 --> 00:14:38.951
that could occur from that more restricted set of words,

00:14:38.951 --> 00:14:42.605
you search through all 13,000 possibilities, and you find the one that

00:14:42.605 --> 00:14:43.840
maximizes that entropy.

00:14:45.419 --> 00:14:48.735
To show you how this works in action, let me just pull up a little variant of

00:14:48.735 --> 00:14:52.180
Wurtele that I wrote that shows the highlights of this analysis in the margins.

00:14:53.679 --> 00:14:56.576
So after doing all its entropy calculations, on the right here

00:14:56.576 --> 00:14:59.659
it's showing us which ones have the highest expected information.

00:15:00.279 --> 00:15:05.572
Turns out the top answer, at least at the moment, we'll refine this later,

00:15:05.572 --> 00:15:10.579
is Tares, which means, um, of course, a vetch, the most common vetch.

00:15:11.039 --> 00:15:13.982
Each time we make a guess here, where maybe I kind of ignore its

00:15:13.982 --> 00:15:16.603
recommendations and go with slate, because I like slate,

00:15:16.604 --> 00:15:18.855
we can see how much expected information it had,

00:15:18.855 --> 00:15:22.120
but then on the right of the word here it's showing us how much actual

00:15:22.120 --> 00:15:24.419
information we got given this particular pattern.

00:15:25.000 --> 00:15:27.993
So here it looks like we were a little unlucky, we were expected to get 5.8,

00:15:27.993 --> 00:15:30.120
but we happened to get something with less than that.

00:15:30.600 --> 00:15:32.810
And then on the left side here it's showing us all of

00:15:32.809 --> 00:15:35.019
the different possible words given where we are now.

00:15:35.799 --> 00:15:38.651
The blue bars are telling us how likely it thinks each word is,

00:15:38.652 --> 00:15:41.776
so at the moment it's assuming each word is equally likely to occur,

00:15:41.775 --> 00:15:43.359
but we'll refine that in a moment.

00:15:44.059 --> 00:15:48.224
And then this uncertainty measurement is telling us the entropy of this distribution

00:15:48.225 --> 00:15:52.240
across the possible words, which right now, because it's a uniform distribution,

00:15:52.240 --> 00:15:55.960
is just a needlessly complicated way to count the number of possibilities.

00:15:56.559 --> 00:15:59.585
For example, if we were to take 2 to the power of 13.66,

00:15:59.586 --> 00:16:02.180
that should be around the 13,000 possibilities.

00:16:02.899 --> 00:16:06.139
Um, a little bit off here, but only because I'm not showing all the decimal places.

00:16:06.720 --> 00:16:09.838
At the moment that might feel redundant and like it's overly complicating things,

00:16:09.837 --> 00:16:12.339
but you'll see why it's useful to have both numbers in a minute.

00:16:12.759 --> 00:16:16.994
So here it looks like it's suggesting the highest entropy for our second guess is Raman,

00:16:16.994 --> 00:16:19.399
which again just really doesn't feel like a word.

00:16:19.980 --> 00:16:24.060
So to take the moral high ground here I'm going to go ahead and type in Rains.

00:16:25.440 --> 00:16:27.340
And again it looks like we were a little unlucky.

00:16:27.519 --> 00:16:31.360
We were expecting 4.3 bits and we only got 3.39 bits of information.

00:16:31.940 --> 00:16:33.940
So that takes us down to 55 possibilities.

00:16:34.899 --> 00:16:37.759
And here maybe I'll just actually go with what it's suggesting,

00:16:37.759 --> 00:16:39.439
which is combo, whatever that means.

00:16:40.039 --> 00:16:42.919
And, okay, this is actually a good chance for a puzzle.

00:16:42.919 --> 00:16:46.379
It's telling us this pattern gives us 4.7 bits of information.

00:16:47.059 --> 00:16:51.719
But over on the left, before we see that pattern, there were 5.78 bits of uncertainty.

00:16:52.419 --> 00:16:56.339
So as a quiz for you, what does that mean about the number of remaining possibilities?

00:16:58.039 --> 00:17:01.163
Well it means that we're reduced down to 1 bit of uncertainty,

00:17:01.163 --> 00:17:04.539
which is the same thing as saying that there's 2 possible answers.

00:17:04.700 --> 00:17:05.700
It's a 50-50 choice.

00:17:06.500 --> 00:17:09.054
And from here, because you and I know which words are more common,

00:17:09.054 --> 00:17:10.640
we know that the answer should be abyss.

00:17:11.180 --> 00:17:13.279
But as it's written right now, the program doesn't know that.

00:17:13.539 --> 00:17:16.868
So it just keeps going, trying to gain as much information as it can,

00:17:16.868 --> 00:17:19.859
until it's only one possibility left, and then it guesses it.

00:17:20.380 --> 00:17:22.611
So obviously we need a better endgame strategy,

00:17:22.611 --> 00:17:25.412
but let's say we call this version 1 of our wordle solver,

00:17:25.412 --> 00:17:28.259
and then we go and run some simulations to see how it does.

00:17:30.359 --> 00:17:34.119
So the way this is working is it's playing every possible wordle game.

00:17:34.240 --> 00:17:38.539
It's going through all of those 2315 words that are the actual wordle answers.

00:17:38.539 --> 00:17:40.579
It's basically using that as a testing set.

00:17:41.359 --> 00:17:44.406
And with this naive method of not considering how common a word is,

00:17:44.406 --> 00:17:47.682
and just trying to maximize the information at each step along the way,

00:17:47.682 --> 00:17:49.819
until it gets down to one and only one choice.

00:17:50.359 --> 00:17:54.299
By the end of the simulation, the average score works out to be about 4.124.

00:17:55.319 --> 00:17:59.240
Which is not bad, to be honest, I kind of expect it to do worse.

00:17:59.660 --> 00:18:02.600
But the people who play wordle will tell you that they can usually get it in 4.

00:18:02.859 --> 00:18:05.379
The real challenge is to get as many in 3 as you can.

00:18:05.380 --> 00:18:08.080
It's a pretty big jump between the score of 4 and the score of 3.

00:18:08.859 --> 00:18:11.820
The obvious low-hanging fruit here is to somehow incorporate

00:18:11.820 --> 00:18:14.980
whether or not a word is common, and how exactly do we do that.

00:18:22.799 --> 00:18:26.688
The way I approached it is to get a list of the relative frequencies for all of

00:18:26.689 --> 00:18:30.627
the words in the English language, and I just used Mathematica's word frequency

00:18:30.626 --> 00:18:34.859
data function, which itself pulls from the Google Books English Ngram public dataset.

00:18:35.460 --> 00:18:37.670
And it's kind of fun to look at, for example if we sort

00:18:37.670 --> 00:18:39.960
it from the most common words to the least common words.

00:18:40.119 --> 00:18:43.079
Evidently these are the most common, 5-letter words in the English language.

00:18:43.700 --> 00:18:45.840
Or rather, these is the 8th most common.

00:18:46.279 --> 00:18:48.879
First is which, after which there's there and there.

00:18:49.259 --> 00:18:52.366
First itself is not first, but 9th, and it makes sense that these

00:18:52.366 --> 00:18:55.999
other words could come about more often, where those after first are after,

00:18:55.999 --> 00:18:58.579
where, and those being just a little bit less common.

00:18:59.160 --> 00:19:03.064
Now, in using this data to model how likely each of these words is to

00:19:03.064 --> 00:19:07.195
be the final answer, it shouldn't just be proportional to the frequency,

00:19:07.194 --> 00:19:11.098
because for example which is given a score of 0.002 in this dataset,

00:19:11.098 --> 00:19:15.059
whereas the word braid is in some sense about 1000 times less likely.

00:19:15.559 --> 00:19:18.193
But both of these are common enough words that they're almost

00:19:18.193 --> 00:19:21.000
certainly worth considering, so we want more of a binary cutoff.

00:19:21.859 --> 00:19:25.757
The way I went about it is to imagine taking this whole sorted list of words,

00:19:25.758 --> 00:19:29.604
and then arranging it on an x-axis, and then applying the sigmoid function,

00:19:29.604 --> 00:19:33.603
which is the standard way to have a function whose output is basically binary,

00:19:33.603 --> 00:19:37.602
it's either 0 or it's 1, but there's a smoothing in between for that region of

00:19:37.602 --> 00:19:38.259
uncertainty.

00:19:39.160 --> 00:19:43.826
So essentially, the probability that I'm assigning to each word for being in the final

00:19:43.826 --> 00:19:48.440
list will be the value of the sigmoid function above wherever it sits on the x-axis.

00:19:49.519 --> 00:19:52.087
Now obviously this depends on a few parameters,

00:19:52.087 --> 00:19:56.184
for example how wide a space on the x-axis those words fill determines how

00:19:56.184 --> 00:19:58.697
gradually or steeply we drop off from 1 to 0,

00:19:58.698 --> 00:20:02.140
and where we situate them left to right determines the cutoff.

00:20:02.980 --> 00:20:05.009
And to be honest the way I did this was kind of just

00:20:05.009 --> 00:20:06.920
licking my finger and sticking it into the wind.

00:20:07.140 --> 00:20:10.073
I looked through the sorted list and tried to find a window where

00:20:10.073 --> 00:20:13.006
when I looked at it I figured about half of these words are more

00:20:13.006 --> 00:20:16.120
likely than not to be the final answer, and used that as the cutoff.

00:20:17.099 --> 00:20:19.888
Now once we have a distribution like this across the words,

00:20:19.888 --> 00:20:23.859
it gives us another situation where entropy becomes this really useful measurement.

00:20:24.500 --> 00:20:28.192
For example, let's say we were playing a game and we start with my old openers,

00:20:28.192 --> 00:20:30.950
which were other and nails, and we end up with a situation

00:20:30.950 --> 00:20:33.240
where there's four possible words that match it.

00:20:33.559 --> 00:20:36.022
And let's say we consider them all equally likely,

00:20:36.022 --> 00:20:38.879
let me ask you, what is the entropy of this distribution?

00:20:41.079 --> 00:20:45.606
Well, the information associated with each one of these possibilities is

00:20:45.606 --> 00:20:50.259
going to be the log base 2 of 4, since each one is 1 and 4, and that's 2.

00:20:50.640 --> 00:20:52.460
2 bits of information, 4 possibilities.

00:20:52.759 --> 00:20:53.579
All very well and good.

00:20:54.299 --> 00:20:57.799
But what if I told you that actually there's more than 4 matches?

00:20:58.259 --> 00:21:02.460
In reality, when we look through the full word list, there are 16 words that match it.

00:21:02.579 --> 00:21:06.816
But suppose our model puts a really low probability on those other 12 words of actually

00:21:06.816 --> 00:21:10.759
being the final answer, something like 1 in 1000 because they're really obscure.

00:21:11.500 --> 00:21:14.259
Now let me ask you, what is the entropy of this distribution?

00:21:15.420 --> 00:21:18.546
If entropy was purely measuring the number of matches here,

00:21:18.546 --> 00:21:22.150
then you might expect it to be something like the log base 2 of 16,

00:21:22.150 --> 00:21:25.700
which would be 4, two more bits of uncertainty than we had before.

00:21:26.180 --> 00:21:29.860
But of course the actual uncertainty is not really that different from what we had before.

00:21:30.160 --> 00:21:33.783
Just because there's these 12 really obscure words doesn't mean that it would be

00:21:33.782 --> 00:21:37.359
all that more surprising to learn that the final answer is charm, for example.

00:21:38.180 --> 00:21:42.049
So when you actually do the calculation here and you add up the probability of

00:21:42.049 --> 00:21:46.019
each occurrence times the corresponding information, what you get is 2.11 bits.

00:21:46.019 --> 00:21:49.375
Just saying, it's basically two bits, basically those four possibilities,

00:21:49.375 --> 00:21:53.327
but there's a little more uncertainty because of all of those highly unlikely events,

00:21:53.327 --> 00:21:56.500
though if you did learn them you'd get a ton of information from it.

00:21:57.160 --> 00:21:59.177
So zooming out, this is part of what makes Wordle

00:21:59.176 --> 00:22:01.399
such a nice example for an information theory lesson.

00:22:01.599 --> 00:22:04.639
We have these two distinct feeling applications for entropy.

00:22:05.160 --> 00:22:09.726
The first one telling us what's the expected information we'll get from a given guess,

00:22:09.726 --> 00:22:13.283
and the second one saying can we measure the remaining uncertainty

00:22:13.282 --> 00:22:15.459
among all of the words we have possible.

00:22:16.460 --> 00:22:19.125
And I should emphasize, in that first case where we're looking

00:22:19.125 --> 00:22:22.906
at the expected information of a guess, once we have an unequal weighting to the words,

00:22:22.906 --> 00:22:24.539
that affects the entropy calculation.

00:22:24.980 --> 00:22:27.846
For example, let me pull up that same case we were looking at

00:22:27.846 --> 00:22:30.242
earlier of the distribution associated with Weary,

00:22:30.242 --> 00:22:33.720
but this time using a non-uniform distribution across all possible words.

00:22:34.500 --> 00:22:38.279
So let me see if I can find a part here that illustrates it pretty well.

00:22:40.940 --> 00:22:42.360
Okay, here, this is pretty good.

00:22:42.359 --> 00:22:45.755
Here we have two adjacent patterns that are about equally likely,

00:22:45.756 --> 00:22:49.100
but one of them we're told has 32 possible words that match it.

00:22:49.279 --> 00:22:51.869
And if we check what they are, these are those 32,

00:22:51.869 --> 00:22:55.599
which are all just very unlikely words as you scan your eyes over them.

00:22:55.839 --> 00:22:59.168
It's hard to find any that feel like plausible answers, maybe yells,

00:22:59.169 --> 00:23:02.254
but if we look at the neighboring pattern in the distribution,

00:23:02.253 --> 00:23:06.659
which is considered just about as likely, we're told that it only has 8 possible matches.

00:23:06.880 --> 00:23:09.520
So a quarter as many matches, but it's about as likely.

00:23:09.859 --> 00:23:12.139
And when we pull up those matches, we can see why.

00:23:12.500 --> 00:23:16.299
Some of these are actual plausible answers like ring or wrath or raps.

00:23:17.900 --> 00:23:20.100
To illustrate how we incorporate all that, let

00:23:20.099 --> 00:23:22.299
me pull up version two of the Wordlebot here.

00:23:22.559 --> 00:23:25.279
And there are two or three main differences from the first one that we saw.

00:23:25.859 --> 00:23:29.410
First off, like I just said, the way that we're computing these entropies,

00:23:29.411 --> 00:23:33.681
these expected values of information, is now using the more refined distributions across

00:23:33.681 --> 00:23:37.855
the patterns that incorporates the probability that a given word would actually be the

00:23:37.855 --> 00:23:38.240
answer.

00:23:38.880 --> 00:23:43.820
As it happens, tears is still number one, though the ones following are a bit different.

00:23:44.359 --> 00:23:47.764
Second, when it ranks its top picks, it's now going to keep a model of the

00:23:47.765 --> 00:23:50.019
probability that each word is the actual answer,

00:23:50.019 --> 00:23:52.134
and it'll incorporate that into its decision,

00:23:52.134 --> 00:23:55.079
which is easier to see once we have a few guesses on the table.

00:23:55.859 --> 00:23:59.779
Again, ignoring its recommendation because we can't let machines rule our lives.

00:24:01.140 --> 00:24:04.719
And I suppose I should mention another thing different here is over on the left,

00:24:04.719 --> 00:24:07.537
that uncertainty value, that number of bits, is no longer just

00:24:07.537 --> 00:24:09.640
redundant with the number of possible matches.

00:24:10.079 --> 00:24:15.407
Now if we pull it up and calculate 2 to the 8.02, which would be a little above 256,

00:24:15.407 --> 00:24:20.227
I guess 259, what it's saying is even though there are 526 total words that

00:24:20.228 --> 00:24:24.984
actually match this pattern, the amount of uncertainty it has is more akin

00:24:24.983 --> 00:24:28.980
to what it would be if there were 259 equally likely outcomes.

00:24:29.720 --> 00:24:30.740
You can think of it like this.

00:24:31.019 --> 00:24:34.660
It knows borks is not the answer, same with yorts and zorl and zorus.

00:24:34.660 --> 00:24:37.680
So it's a little less uncertain than it was in the previous case.

00:24:37.819 --> 00:24:39.279
This number of bits will be smaller.

00:24:40.220 --> 00:24:43.500
And if I keep playing the game, I'm refining this down with a couple

00:24:43.500 --> 00:24:46.539
guesses that are apropos of what I would like to explain here.

00:24:48.359 --> 00:24:51.035
By the fourth guess, if you look over at its top picks,

00:24:51.036 --> 00:24:53.759
you can see it's no longer just maximizing the entropy.

00:24:54.460 --> 00:24:57.236
So at this point, there's technically seven possibilities,

00:24:57.236 --> 00:25:00.300
but the only ones with a meaningful chance are dorms and words.

00:25:00.299 --> 00:25:03.534
And you can see it ranks choosing both of those above all of these

00:25:03.535 --> 00:25:06.720
other values that strictly speaking would give more information.

00:25:07.240 --> 00:25:10.484
The very first time I did this, I just added up these two numbers to measure

00:25:10.484 --> 00:25:13.899
the quality of each guess, which actually worked better than you might suspect.

00:25:14.299 --> 00:25:15.899
But it really didn't feel systematic.

00:25:16.099 --> 00:25:17.879
And I'm sure there's other approaches people could take.

00:25:17.900 --> 00:25:19.340
But here's the one I landed on.

00:25:19.759 --> 00:25:23.829
If we're considering the prospect of a next guess, like in this case words,

00:25:23.829 --> 00:25:27.899
what we really care about is the expected score of our game if we do that.

00:25:28.230 --> 00:25:32.146
And to calculate that expected score, we say what's the probability that

00:25:32.146 --> 00:25:35.899
words is the actual answer, which at the moment it describes 58% to.

00:25:36.039 --> 00:25:39.539
We say with a 58% chance, our score in this game would be four.

00:25:40.319 --> 00:25:43.359
And then with the probability of one minus that 58%,

00:25:43.359 --> 00:25:45.639
our score will be more than that four.

00:25:46.220 --> 00:25:49.316
How much more we don't know, but we can estimate it based on how

00:25:49.316 --> 00:25:52.460
much uncertainty there's likely to be once we get to that point.

00:25:52.960 --> 00:25:55.940
Specifically, at the moment, there's 1.44 bits of uncertainty.

00:25:56.440 --> 00:26:01.120
If we guess words, it's telling us the expected information we'll get is 1.27 bits.

00:26:01.619 --> 00:26:04.537
So if we guess words, this difference represents how much

00:26:04.538 --> 00:26:07.660
uncertainty we're likely to be left with after that happens.

00:26:08.259 --> 00:26:11.158
What we need is some kind of function, which I'm calling f here,

00:26:11.159 --> 00:26:13.740
that associates this uncertainty with an expected score.

00:26:14.240 --> 00:26:18.458
And the way it went about this was to just plot a bunch of the data from previous

00:26:18.458 --> 00:26:21.113
games based on version one of the bot to say, hey,

00:26:21.113 --> 00:26:25.070
what was the actual score after various points with certain very measurable

00:26:25.069 --> 00:26:26.319
amounts of uncertainty?

00:26:27.019 --> 00:26:30.876
For example, these data points here that are sitting above a value that's

00:26:30.876 --> 00:26:33.464
around like 8.7 or so are saying for some games,

00:26:33.464 --> 00:26:36.582
after a point at which there were 8.7 bits of uncertainty,

00:26:36.583 --> 00:26:38.960
it took two guesses to get the final answer.

00:26:39.319 --> 00:26:40.659
For other games, it took three guesses.

00:26:40.819 --> 00:26:42.240
For other games, it took four guesses.

00:26:43.140 --> 00:26:46.862
If we shift over to the left here, all the points over zero are saying whenever

00:26:46.862 --> 00:26:50.632
there's zero bits of uncertainty, which is to say there's only one possibility,

00:26:50.632 --> 00:26:54.259
then the number of guesses required is always just one, which is reassuring.

00:26:54.779 --> 00:26:58.167
Whenever there was one bit of uncertainty, meaning it was essentially

00:26:58.167 --> 00:27:01.851
just down to two possibilities, then sometimes it required one more guess,

00:27:01.852 --> 00:27:05.240
sometimes it required two more guesses, and so on and so forth here.

00:27:05.740 --> 00:27:07.884
Maybe a slightly easier way to visualize this

00:27:07.884 --> 00:27:10.220
data is to bucket it together and take averages.

00:27:11.000 --> 00:27:15.509
For example, this bar here is saying among all the points where we had one bit

00:27:15.509 --> 00:27:19.960
of uncertainty, on average the number of new guesses required was about 1.5.

00:27:22.140 --> 00:27:25.486
And the bar over here is saying among all of the different games where

00:27:25.486 --> 00:27:28.354
at some point the uncertainty was a little above four bits,

00:27:28.354 --> 00:27:31.365
which is like narrowing it down to 16 different possibilities,

00:27:31.365 --> 00:27:35.380
then on average it requires a little more than two guesses from that point forward.

00:27:36.059 --> 00:27:39.460
And from here I just did a regression to fit a function that seemed reasonable to this.

00:27:39.980 --> 00:27:44.140
And remember, the whole point of doing any of that is so that we can quantify this

00:27:44.140 --> 00:27:47.032
intuition that the more information we gain from a word,

00:27:47.031 --> 00:27:48.960
the lower the expected score will be.

00:27:49.680 --> 00:27:54.679
So, with this as version 2.0, if we go back and run the same set of simulations,

00:27:54.679 --> 00:27:59.240
having it play against all 2315 possible wordle answers, how does it do?

00:28:00.279 --> 00:28:03.420
Well in contrast to our first version, it's definitely better, which is reassuring.

00:28:04.019 --> 00:28:06.180
All said and done, the average is around 3.6.

00:28:06.539 --> 00:28:09.896
Although unlike the first version, there are a couple times that it loses,

00:28:09.896 --> 00:28:12.119
and requires more than six in this circumstance.

00:28:12.640 --> 00:28:15.269
Presumably because there's times when it's making that tradeoff

00:28:15.269 --> 00:28:17.940
to actually go for the goal rather than maximizing information.

00:28:19.039 --> 00:28:21.000
So can we do better than 3.6?

00:28:22.079 --> 00:28:22.919
We definitely can.

00:28:23.279 --> 00:28:26.253
Now, I said at the start that it's most fun to try not incorporating

00:28:26.253 --> 00:28:29.359
the true list of wordle answers into the way that it builds its model.

00:28:29.880 --> 00:28:34.180
But if we do incorporate it, the best performance I could get was around 3.43.

00:28:35.160 --> 00:28:38.719
So if we try to get more sophisticated than just using word frequency data

00:28:38.719 --> 00:28:42.229
to choose this prior distribution, this 3.43 probably gives a max at how

00:28:42.229 --> 00:28:45.740
good we could get with that, or at least how good I could get with that.

00:28:46.240 --> 00:28:49.979
That best performance essentially just uses the ideas that I've been talking about here,

00:28:49.979 --> 00:28:52.911
but it goes a little farther, like it does a search for the expected

00:28:52.911 --> 00:28:55.120
information two steps forward rather than just one.

00:28:55.619 --> 00:28:57.876
Originally I was planning on talking more about that,

00:28:57.876 --> 00:29:00.220
but I realize we've actually gone quite long as it is.

00:29:00.579 --> 00:29:03.302
The one thing I'll say is after doing this two-step search and

00:29:03.303 --> 00:29:06.114
then running a couple sample simulations in the top candidates,

00:29:06.114 --> 00:29:09.100
so far for me at least, it's looking like Crane is the best opener.

00:29:09.099 --> 00:29:10.059
Who would have guessed?

00:29:10.920 --> 00:29:14.764
Also if you use the true wordle list to determine your space of possibilities,

00:29:14.763 --> 00:29:17.819
then the uncertainty you start with is a little over 11 bits.

00:29:18.299 --> 00:29:20.955
And it turns out just from a brute force search,

00:29:20.955 --> 00:29:25.879
the maximum possible expected information after the first two guesses is around 10 bits.

00:29:26.500 --> 00:29:30.231
Which suggests that best case scenario, after your first two guesses,

00:29:30.231 --> 00:29:34.559
with perfectly optimal play, you'll be left with around one bit of uncertainty.

00:29:34.799 --> 00:29:37.319
Which is the same as being down to two possible guesses.

00:29:37.740 --> 00:29:41.431
But I think it's fair and probably pretty conservative to say that you could never

00:29:41.431 --> 00:29:44.491
possibly write an algorithm that gets this average as low as three,

00:29:44.491 --> 00:29:48.048
because with the words available to you, there's simply not room to get enough

00:29:48.048 --> 00:29:51.920
information after only two steps to be able to guarantee the answer in the third slot

00:29:51.920 --> 00:29:53.360
every single time without fail.