1
00:00:00,000 --> 00:00:02,911
The game Wordle has gone pretty viral in the last month or two,

2
00:00:02,911 --> 00:00:05,682
and never one to overlook an opportunity for a math lesson,

3
00:00:05,682 --> 00:00:08,871
it occurs to me that this game makes for a very good central example

4
00:00:08,871 --> 00:00:12,660
in a lesson about information theory, and in particular a topic known as entropy.

5
00:00:13,919 --> 00:00:16,875
You see, like a lot of people I got kind of sucked into the puzzle,

6
00:00:16,875 --> 00:00:19,652
and like a lot of programmers I also got sucked into trying to

7
00:00:19,652 --> 00:00:22,739
write an algorithm that would play the game as optimally as it could.

8
00:00:23,179 --> 00:00:26,745
And what I thought I'd do here is just talk through with you some of my process in that,

9
00:00:26,745 --> 00:00:28,690
and explain some of the math that went into it,

10
00:00:28,690 --> 00:00:31,080
since the whole algorithm centers on this idea of entropy.

11
00:00:38,700 --> 00:00:41,640
First things first, in case you haven't heard of it, what is Wordle?

12
00:00:42,039 --> 00:00:45,573
And to kill two birds with one stone here while we go through the rules of the game,

13
00:00:45,573 --> 00:00:47,633
let me also preview where we're going with this,

14
00:00:47,633 --> 00:00:51,040
which is to develop a little algorithm that will basically play the game for us.

15
00:00:51,359 --> 00:00:53,784
Though I haven't done today's Wordle, this is February 4th,

16
00:00:53,784 --> 00:00:55,099
and we'll see how the bot does.

17
00:00:55,479 --> 00:00:58,113
The goal of Wordle is to guess a mystery five letter word,

18
00:00:58,113 --> 00:01:00,339
and you're given six different chances to guess.

19
00:01:00,840 --> 00:01:04,380
For example, my Wordle bot suggests that I start with the guess "crane".

20
00:01:05,180 --> 00:01:07,865
Each time that you make a guess, you get some information

21
00:01:07,864 --> 00:01:10,219
about how close your guess is to the true answer.

22
00:01:10,920 --> 00:01:14,100
Here the grey box is telling me there's no C in the actual answer.

23
00:01:14,519 --> 00:01:17,839
The yellow box is telling me there is an R, but it's not in that position.

24
00:01:18,239 --> 00:01:20,906
The green box is telling me that the secret word does have an A,

25
00:01:20,906 --> 00:01:22,239
and it's in the third position.

26
00:01:22,719 --> 00:01:24,579
And then there's no N and there's no E.

27
00:01:25,200 --> 00:01:27,340
So let me just go in and tell the Wurdle bot that information.

28
00:01:27,340 --> 00:01:30,320
We started with crane, we got grey, yellow, green, grey, grey.

29
00:01:31,420 --> 00:01:34,939
Don't worry about all the data that it's showing right now, I'll explain that in due time.

30
00:01:35,459 --> 00:01:38,819
But its top suggestion for our second pick is "shtik".

31
00:01:39,560 --> 00:01:42,795
And your guess does have to be an actual five letter word, but as you'll see,

32
00:01:42,795 --> 00:01:45,400
it's pretty liberal with what it will actually let you guess.

33
00:01:46,200 --> 00:01:47,439
In this case, we try "shtik".

34
00:01:48,780 --> 00:01:50,180
And alright, things are looking pretty good.

35
00:01:50,260 --> 00:01:53,980
We hit the S and the H, so we know the first three letters, we know that there's an R.

36
00:01:53,980 --> 00:01:58,700
And so it's going to be like S-H-A something R, or S-H-A R something.

37
00:01:59,620 --> 00:02:03,207
And it looks like the Wordle bot knows that it's down to just two possibilities,

38
00:02:03,207 --> 00:02:04,239
either shard or sharp.

39
00:02:05,099 --> 00:02:07,181
That's kind of a tossup between them at this point,

40
00:02:07,182 --> 00:02:10,080
so I guess probably just because it's alphabetical it goes with shard.

41
00:02:11,219 --> 00:02:13,780
Which hooray, is the actual answer, so we got it in three.

42
00:02:14,599 --> 00:02:17,524
If you're wondering if that's any good, the way I heard one person

43
00:02:17,524 --> 00:02:20,360
phrase it is that with Wordle, four is par and three is birdie.

44
00:02:20,680 --> 00:02:22,480
Which I think is a pretty apt analogy.

45
00:02:22,479 --> 00:02:27,019
You have to be consistently on your game to be getting four, but it's certainly not crazy.

46
00:02:27,180 --> 00:02:29,920
But when you get it in three, it just feels great.

47
00:02:30,879 --> 00:02:33,384
So if you're down for it, what I'd like to do here is just talk through

48
00:02:33,384 --> 00:02:35,959
my thought process from the beginning for how I approach the Wordle bot.

49
00:02:36,479 --> 00:02:39,439
And like I said, really it's an excuse for an information theory lesson.

50
00:02:39,740 --> 00:02:42,820
The main goal is to explain what is information and what is entropy.

51
00:02:48,219 --> 00:02:50,862
My first thought in approaching this was to take a look at the

52
00:02:50,862 --> 00:02:53,719
relative frequencies of different letters in the English language.

53
00:02:54,379 --> 00:02:56,721
So I thought, okay, is there an opening guess or an opening

54
00:02:56,721 --> 00:02:59,259
pair of guesses that hits a lot of these most frequent letters?

55
00:02:59,960 --> 00:03:03,000
And one that I was pretty fond of was doing other followed by nails.

56
00:03:03,759 --> 00:03:06,527
The thought is that if you hit a letter, you know, you get a green or a yellow,

57
00:03:06,527 --> 00:03:08,840
that always feels good, it feels like you're getting information.

58
00:03:09,340 --> 00:03:12,167
But in these cases, even if you don't hit and you always get grays,

59
00:03:12,167 --> 00:03:14,108
that's still giving you a lot of information,

60
00:03:14,108 --> 00:03:17,400
since it's pretty rare to find a word that doesn't have any of these letters.

61
00:03:18,139 --> 00:03:21,030
But even still, that doesn't feel super systematic, because for example,

62
00:03:21,031 --> 00:03:23,200
it does nothing to consider the order of the letters.

63
00:03:23,560 --> 00:03:25,300
Why type nails when I could type snail?

64
00:03:26,080 --> 00:03:27,500
Is it better to have that S at the end?

65
00:03:27,819 --> 00:03:28,680
I'm not really sure.

66
00:03:29,240 --> 00:03:32,311
Now, a friend of mine said that he liked to open with the word weary,

67
00:03:32,311 --> 00:03:35,961
which kind of surprised me because it has some uncommon letters in there like the

68
00:03:35,961 --> 00:03:36,540
W and the Y.

69
00:03:37,120 --> 00:03:39,000
But who knows, maybe that is a better opener.

70
00:03:39,319 --> 00:03:41,719
Is there some kind of quantitative score that we

71
00:03:41,719 --> 00:03:44,319
can give to judge the quality of a potential guess?

72
00:03:45,340 --> 00:03:48,252
Now to set up for the way that we're going to rank possible guesses,

73
00:03:48,252 --> 00:03:51,420
let's go back and add a little clarity to how exactly the game is set up.

74
00:03:51,419 --> 00:03:54,574
So there's a list of words that it will allow you to enter that

75
00:03:54,574 --> 00:03:57,879
are considered valid guesses that's just about 13,000 words long.

76
00:03:58,319 --> 00:04:01,187
But when you look at it, there's a lot of really uncommon things,

77
00:04:01,187 --> 00:04:03,924
things like "aahed" or "aalii" and "aargh", the kind of words

78
00:04:03,925 --> 00:04:06,439
that bring about family arguments in a game of Scrabble.

79
00:04:06,960 --> 00:04:10,540
But the vibe of the game is that the answer is always going to be a decently common word.

80
00:04:10,960 --> 00:04:15,360
And in fact, there's another list of around 2300 words that are the possible answers.

81
00:04:15,939 --> 00:04:20,115
And this is a human curated list, I think specifically by the game creator's girlfriend,

82
00:04:20,115 --> 00:04:21,159
which is kind of fun.

83
00:04:21,819 --> 00:04:25,975
But what I would like to do, our challenge for this project is to see if we can write

84
00:04:25,975 --> 00:04:30,180
a program solving wordle that doesn't incorporate previous knowledge about this list.

85
00:04:30,720 --> 00:04:32,760
For one thing, there's plenty of pretty common five

86
00:04:32,759 --> 00:04:34,639
letter words that you won't find in that list.

87
00:04:34,939 --> 00:04:38,262
So it would be better to write a program that's a little more resilient and would

88
00:04:38,262 --> 00:04:41,459
play wordle against anyone, not just what happens to be the official website.

89
00:04:41,920 --> 00:04:45,073
And also, the reason that we know what this list of possible answers is,

90
00:04:45,072 --> 00:04:47,000
is because it's visible in the source code.

91
00:04:47,000 --> 00:04:49,817
But the way that it's visible in the source code is in the

92
00:04:49,817 --> 00:04:52,586
specific order in which answers come up from day to day,

93
00:04:52,586 --> 00:04:55,840
that you could always just look up what tomorrow's answer will be.

94
00:04:56,420 --> 00:04:58,879
So clearly, there's some sense in which using the list is cheating.

95
00:04:59,100 --> 00:05:03,022
And what makes for a more interesting puzzle and a richer information theory lesson

96
00:05:03,021 --> 00:05:06,659
is to instead use some more universal data like relative word frequencies in

97
00:05:06,660 --> 00:05:10,439
general to capture this intuition of having a preference for more common words.

98
00:05:11,600 --> 00:05:15,900
So of these 13,000 possibilities, how should we choose the opening guess?

99
00:05:16,399 --> 00:05:19,779
For example, if my friend proposes weary, how should we analyze its quality?

100
00:05:20,519 --> 00:05:23,853
Well, the reason he said he likes that unlikely W is that he likes

101
00:05:23,853 --> 00:05:27,339
the long shot nature of just how good it feels if you do hit that W.

102
00:05:27,920 --> 00:05:31,218
For example, if the first pattern revealed was something like this,

103
00:05:31,218 --> 00:05:35,600
then it turns out there are only 58 words in this giant lexicon that match that pattern.

104
00:05:36,060 --> 00:05:38,399
So that's a huge reduction from 13,000.

105
00:05:38,779 --> 00:05:40,853
But the flip side of that, of course, is that

106
00:05:40,853 --> 00:05:43,019
it's very uncommon to get a pattern like this.

107
00:05:43,019 --> 00:05:46,603
Specifically, if each word was equally likely to be the answer,

108
00:05:46,603 --> 00:05:51,040
the probability of hitting this pattern would be 58 divided by around 13,000.

109
00:05:51,579 --> 00:05:53,599
Of course, they're not equally likely to be answers.

110
00:05:53,720 --> 00:05:56,220
Most of these are very obscure and even questionable words.

111
00:05:56,600 --> 00:05:58,465
But at least for our first pass at all of this,

112
00:05:58,464 --> 00:06:01,599
let's assume that they're all equally likely and then refine that a bit later.

113
00:06:02,019 --> 00:06:04,769
The point is, the pattern with a lot of information is,

114
00:06:04,769 --> 00:06:06,719
by its very nature, unlikely to occur.

115
00:06:07,279 --> 00:06:10,799
In fact, what it means to be informative is that it's unlikely.

116
00:06:11,720 --> 00:06:16,004
A much more probable pattern to see with this opening would be something like this,

117
00:06:16,004 --> 00:06:18,120
where, of course, there's not a W in it.

118
00:06:18,240 --> 00:06:21,400
Maybe there's an E, and maybe there's no A, there's no R, there's no Y.

119
00:06:22,079 --> 00:06:24,560
In this case, there are 1400 possible matches.

120
00:06:25,079 --> 00:06:28,010
If all were equally likely, it works out to be a probability

121
00:06:28,011 --> 00:06:30,600
of about 11% that this is the pattern you would see.

122
00:06:30,899 --> 00:06:33,339
So the most likely outcomes are also the least informative.

123
00:06:34,240 --> 00:06:37,713
To get a more global view here, let me show you the full distribution of

124
00:06:37,713 --> 00:06:41,139
probabilities across all of the different patterns that you might see.

125
00:06:41,740 --> 00:06:45,161
So each bar that you're looking at corresponds to a possible pattern of

126
00:06:45,161 --> 00:06:48,918
colors that could be revealed, of which there are 3 to the 5th possibilities,

127
00:06:48,918 --> 00:06:52,339
and they're organized from left to right, most common to least common.

128
00:06:52,920 --> 00:06:56,000
So the most common possibility here is that you get all grays.

129
00:06:56,100 --> 00:06:58,120
That happens about 14% of the time.

130
00:06:58,579 --> 00:07:01,885
And what you're hoping for when you make a guess is that you end up

131
00:07:01,886 --> 00:07:04,304
somewhere out in this long tail, like over here,

132
00:07:04,303 --> 00:07:07,609
where there's only 18 possibilities for what matches this pattern,

133
00:07:07,610 --> 00:07:09,139
that evidently look like this.

134
00:07:09,920 --> 00:07:11,881
Or if we venture a little farther to the left,

135
00:07:11,880 --> 00:07:13,799
you know, maybe we go all the way over here.

136
00:07:14,939 --> 00:07:16,180
Okay, here's a good puzzle for you.

137
00:07:16,540 --> 00:07:19,788
What are the three words in the English language that start with a W,

138
00:07:19,788 --> 00:07:22,000
end with a Y, and have an R somewhere in them?

139
00:07:22,480 --> 00:07:26,800
Turns out, the answers are, let's see, wordy, wormy, and wryly.

140
00:07:27,500 --> 00:07:31,567
So to judge how good this word is overall, we want some kind of measure of the

141
00:07:31,567 --> 00:07:35,740
expected amount of information that you're going to get from this distribution.

142
00:07:35,740 --> 00:07:39,966
If we go through each pattern and we multiply its probability of occurring times

143
00:07:39,966 --> 00:07:44,720
something that measures how informative it is, that can maybe give us an objective score.

144
00:07:45,959 --> 00:07:49,839
Now your first instinct for what that something should be might be the number of matches.

145
00:07:50,160 --> 00:07:52,400
You want a lower average number of matches.

146
00:07:52,800 --> 00:07:56,468
But instead I'd like to use a more universal measurement that we often ascribe to

147
00:07:56,468 --> 00:08:00,319
information, and one that will be more flexible once we have a different probability

148
00:08:00,319 --> 00:08:04,259
assigned to each of these 13,000 words for whether or not they're actually the answer.

149
00:08:10,319 --> 00:08:14,459
The standard unit of information is the bit, which has a little bit of a funny formula,

150
00:08:14,459 --> 00:08:16,980
but is really intuitive if we just look at examples.

151
00:08:17,779 --> 00:08:21,379
If you have an observation that cuts your space of possibilities in half,

152
00:08:21,379 --> 00:08:23,500
we say that it has one bit of information.

153
00:08:24,180 --> 00:08:26,887
In our example, the space of possibilities is all possible words,

154
00:08:26,887 --> 00:08:30,593
and it turns out about half of the five letter words have an S, a little less than that,

155
00:08:30,593 --> 00:08:31,259
but about half.

156
00:08:31,779 --> 00:08:34,319
So that observation would give you one bit of information.

157
00:08:34,879 --> 00:08:39,169
If instead a new fact chops down that space of possibilities by a factor of four,

158
00:08:39,169 --> 00:08:41,500
we say that it has two bits of information.

159
00:08:41,980 --> 00:08:44,460
For example, it turns out about a quarter of these words have a T.

160
00:08:45,019 --> 00:08:47,701
If the observation cuts that space by a factor of eight,

161
00:08:47,701 --> 00:08:50,720
we say it's three bits of information, and so on and so forth.

162
00:08:50,899 --> 00:08:53,879
Four bits cuts it into a sixteenth, five bits cuts it into a thirty second.

163
00:08:54,960 --> 00:08:58,353
So now's when you might want to take a moment and pause and ask for yourself,

164
00:08:58,352 --> 00:09:00,952
what is the formula for information for the number of bits

165
00:09:00,952 --> 00:09:02,980
in terms of the probability of an occurrence?

166
00:09:03,919 --> 00:09:07,692
Well, what we're saying here is basically that when you take one half to the number of

167
00:09:07,692 --> 00:09:09,797
bits, that's the same thing as the probability,

168
00:09:09,797 --> 00:09:13,524
which is the same thing as saying two to the power of the number of bits is one over

169
00:09:13,524 --> 00:09:17,208
the probability, which rearranges further to saying the information is the log base

170
00:09:17,208 --> 00:09:18,919
two of one divided by the probability.

171
00:09:19,620 --> 00:09:22,279
And sometimes you see this with one more rearrangement still where

172
00:09:22,279 --> 00:09:24,899
the information is the negative log base two of the probability.

173
00:09:25,659 --> 00:09:28,874
Expressed like this, it can look a little bit weird to the uninitiated,

174
00:09:28,874 --> 00:09:31,590
but it really is just the very intuitive idea of asking how

175
00:09:31,590 --> 00:09:34,080
many times you've cut down your possibilities in half.

176
00:09:35,179 --> 00:09:37,926
Now if you're wondering, you know, I thought we were just playing a fun word game,

177
00:09:37,927 --> 00:09:39,300
why are logarithms entering the picture?

178
00:09:39,779 --> 00:09:43,964
One reason this is a nicer unit is it's just a lot easier to talk about very

179
00:09:43,965 --> 00:09:48,535
unlikely events, much easier to say that an observation has 20 bits of information

180
00:09:48,534 --> 00:09:52,939
than it is to say that the probability of such and such occurring is 0.0000095.

181
00:09:53,299 --> 00:09:57,332
But a more substantive reason that this logarithmic expression turned out to be a very

182
00:09:57,332 --> 00:10:01,459
useful addition to the theory of probability is the way that information adds together.

183
00:10:02,059 --> 00:10:05,154
For example, if one observation gives you two bits of information,

184
00:10:05,154 --> 00:10:08,908
cutting your space down by four, and then a second observation like your second

185
00:10:08,908 --> 00:10:11,768
guess in Wordle gives you another three bits of information,

186
00:10:11,768 --> 00:10:14,301
chopping you down further by another factor of eight,

187
00:10:14,301 --> 00:10:16,740
the two together give you five bits of information.

188
00:10:17,159 --> 00:10:21,019
In the same way that probabilities like to multiply, information likes to add.

189
00:10:21,960 --> 00:10:24,657
So as soon as we're in the realm of something like an expected value,

190
00:10:24,657 --> 00:10:27,980
where we're adding a bunch of numbers up, the logs make it a lot nicer to deal with.

191
00:10:28,480 --> 00:10:32,255
Let's go back to our distribution for weary and add another little tracker on here,

192
00:10:32,255 --> 00:10:34,939
showing us how much information there is for each pattern.

193
00:10:35,580 --> 00:10:39,114
The main thing I want you to notice is that the higher the probability as we get

194
00:10:39,114 --> 00:10:42,780
to those more likely patterns, the lower the information, the fewer bits you gain.

195
00:10:43,500 --> 00:10:45,715
The way we measure the quality of this guess will

196
00:10:45,715 --> 00:10:48,019
be to take the expected value of this information.

197
00:10:48,419 --> 00:10:51,173
When we go through each pattern, we say how probable is it and

198
00:10:51,173 --> 00:10:54,059
then we multiply that by how many bits of information do we get.

199
00:10:54,710 --> 00:10:58,120
And in the example of weary, that turns out to be 4.9 bits.

200
00:10:58,559 --> 00:11:01,944
So on average, the information you get from this opening guess is as

201
00:11:01,945 --> 00:11:05,480
good as chopping your space of possibilities in half about five times.

202
00:11:05,960 --> 00:11:09,014
By contrast, an example of a guess with a higher expected

203
00:11:09,014 --> 00:11:11,639
information value would be something like slate.

204
00:11:13,120 --> 00:11:15,620
In this case, you'll notice the distribution looks a lot flatter.

205
00:11:15,940 --> 00:11:20,745
In particular, the most probable occurrence of all grays only has about a 6% chance

206
00:11:20,745 --> 00:11:25,259
of occurring, so at minimum you're getting evidently 3.9 bits of information.

207
00:11:25,919 --> 00:11:28,559
But that's a minimum, more typically you'd get something better than that.

208
00:11:29,100 --> 00:11:32,474
And it turns out when you crunch the numbers on this one and add

209
00:11:32,474 --> 00:11:35,900
up all the relevant terms, the average information is about 5.8.

210
00:11:37,360 --> 00:11:40,503
So in contrast with weary, your space of possibilities will

211
00:11:40,503 --> 00:11:43,540
be about half as big after this first guess, on average.

212
00:11:44,419 --> 00:11:46,401
There's actually a fun story about the name for

213
00:11:46,402 --> 00:11:48,300
this expected value of information quantity.

214
00:11:48,299 --> 00:11:51,099
You see, information theory was developed by Claude Shannon,

215
00:11:51,100 --> 00:11:54,832
who was working at Bell Labs in the 1940s, but he was talking about some of his

216
00:11:54,832 --> 00:11:57,120
yet-to-be-published ideas with John von Neumann,

217
00:11:57,120 --> 00:12:00,852
who was this intellectual giant of the time, very prominent in math and physics

218
00:12:00,852 --> 00:12:03,559
and the beginnings of what was becoming computer science.

219
00:12:04,100 --> 00:12:07,387
And when he mentioned that he didn't really have a good name for this

220
00:12:07,386 --> 00:12:10,674
expected value of information quantity, von Neumann supposedly said,

221
00:12:10,674 --> 00:12:14,199
so the story goes, well, you should call it entropy, and for two reasons.

222
00:12:14,539 --> 00:12:18,519
In the first place, your uncertainty function has been used in statistical mechanics

223
00:12:18,519 --> 00:12:22,687
under that name, so it already has a name, and in the second place, and more important,

224
00:12:22,687 --> 00:12:26,759
nobody knows what entropy really is, so in a debate you'll always have the advantage.

225
00:12:27,700 --> 00:12:31,314
So if the name seems a little bit mysterious, and if this story is to be believed,

226
00:12:31,313 --> 00:12:32,459
that's kind of by design.

227
00:12:33,279 --> 00:12:37,331
Also if you're wondering about its relation to all of that second law of thermodynamics

228
00:12:37,331 --> 00:12:39,846
stuff from physics, there definitely is a connection,

229
00:12:39,846 --> 00:12:43,293
but in its origins Shannon was just dealing with pure probability theory,

230
00:12:43,293 --> 00:12:45,900
and for our purposes here, when I use the word entropy,

231
00:12:45,900 --> 00:12:49,579
I just want you to think the expected information value of a particular guess.

232
00:12:50,700 --> 00:12:53,780
You can think of entropy as measuring two things simultaneously.

233
00:12:54,240 --> 00:12:56,779
The first one is how flat is the distribution?

234
00:12:57,320 --> 00:13:01,120
The closer a distribution is to uniform, the higher that entropy will be.

235
00:13:01,580 --> 00:13:06,584
In our case, where there are 3 to the 5th total patterns, for a uniform distribution,

236
00:13:06,583 --> 00:13:11,117
observing any one of them would have information log base 2 of 3 to the 5th,

237
00:13:11,118 --> 00:13:16,240
which happens to be 7.92, so that is the absolute maximum that you could possibly have

238
00:13:16,240 --> 00:13:17,299
for this entropy.

239
00:13:17,840 --> 00:13:20,074
But entropy is also kind of a measure of how many

240
00:13:20,073 --> 00:13:22,079
possibilities there are in the first place.

241
00:13:22,320 --> 00:13:27,109
For example, if you happen to have some word where there's only 16 possible patterns,

242
00:13:27,109 --> 00:13:32,180
and each one is equally likely, this entropy, this expected information, would be 4 bits.

243
00:13:32,580 --> 00:13:36,653
But if you have another word where there's 64 possible patterns that could come up,

244
00:13:36,653 --> 00:13:40,480
and they're all equally likely, then the entropy would work out to be 6 bits.

245
00:13:41,500 --> 00:13:45,735
So if you see some distribution out in the wild that has an entropy of 6 bits,

246
00:13:45,735 --> 00:13:49,644
it's sort of like it's saying there's as much variation and uncertainty

247
00:13:49,644 --> 00:13:53,500
in what's about to happen as if there were 64 equally likely outcomes.

248
00:13:54,360 --> 00:13:57,960
For my first pass at the Wurtelebot, I basically had it just do this.

249
00:13:57,960 --> 00:14:01,646
It goes through all of the different possible guesses that you could have,

250
00:14:01,645 --> 00:14:05,530
all 13,000 words, it computes the entropy for each one, or more specifically,

251
00:14:05,530 --> 00:14:09,764
the entropy of the distribution across all patterns that you might see for each one,

252
00:14:09,764 --> 00:14:13,449
and then it picks the highest, since that's the one that's likely to chop

253
00:14:13,450 --> 00:14:16,140
down your space of possibilities as much as possible.

254
00:14:17,139 --> 00:14:19,413
And even though I've only been talking about the first guess here,

255
00:14:19,413 --> 00:14:21,100
it does the same thing for the next few guesses.

256
00:14:21,559 --> 00:14:24,266
For example, after you see some pattern on that first guess,

257
00:14:24,267 --> 00:14:27,740
which would restrict you to a smaller number of possible words based on what

258
00:14:27,740 --> 00:14:31,799
matches with that, you just play the same game with respect to that smaller set of words.

259
00:14:32,259 --> 00:14:36,016
For a proposed second guess, you look at the distribution of all patterns

260
00:14:36,017 --> 00:14:38,951
that could occur from that more restricted set of words,

261
00:14:38,951 --> 00:14:42,605
you search through all 13,000 possibilities, and you find the one that

262
00:14:42,605 --> 00:14:43,840
maximizes that entropy.

263
00:14:45,419 --> 00:14:48,735
To show you how this works in action, let me just pull up a little variant of

264
00:14:48,735 --> 00:14:52,180
Wurtele that I wrote that shows the highlights of this analysis in the margins.

265
00:14:53,679 --> 00:14:56,576
So after doing all its entropy calculations, on the right here

266
00:14:56,576 --> 00:14:59,659
it's showing us which ones have the highest expected information.

267
00:15:00,279 --> 00:15:05,572
Turns out the top answer, at least at the moment, we'll refine this later,

268
00:15:05,572 --> 00:15:10,579
is Tares, which means, um, of course, a vetch, the most common vetch.

269
00:15:11,039 --> 00:15:13,982
Each time we make a guess here, where maybe I kind of ignore its

270
00:15:13,982 --> 00:15:16,603
recommendations and go with slate, because I like slate,

271
00:15:16,604 --> 00:15:18,855
we can see how much expected information it had,

272
00:15:18,855 --> 00:15:22,120
but then on the right of the word here it's showing us how much actual

273
00:15:22,120 --> 00:15:24,419
information we got given this particular pattern.

274
00:15:25,000 --> 00:15:27,993
So here it looks like we were a little unlucky, we were expected to get 5.8,

275
00:15:27,993 --> 00:15:30,120
but we happened to get something with less than that.

276
00:15:30,600 --> 00:15:32,810
And then on the left side here it's showing us all of

277
00:15:32,809 --> 00:15:35,019
the different possible words given where we are now.

278
00:15:35,799 --> 00:15:38,651
The blue bars are telling us how likely it thinks each word is,

279
00:15:38,652 --> 00:15:41,776
so at the moment it's assuming each word is equally likely to occur,

280
00:15:41,775 --> 00:15:43,359
but we'll refine that in a moment.

281
00:15:44,059 --> 00:15:48,224
And then this uncertainty measurement is telling us the entropy of this distribution

282
00:15:48,225 --> 00:15:52,240
across the possible words, which right now, because it's a uniform distribution,

283
00:15:52,240 --> 00:15:55,960
is just a needlessly complicated way to count the number of possibilities.

284
00:15:56,559 --> 00:15:59,585
For example, if we were to take 2 to the power of 13.66,

285
00:15:59,586 --> 00:16:02,180
that should be around the 13,000 possibilities.

286
00:16:02,899 --> 00:16:06,139
Um, a little bit off here, but only because I'm not showing all the decimal places.

287
00:16:06,720 --> 00:16:09,838
At the moment that might feel redundant and like it's overly complicating things,

288
00:16:09,837 --> 00:16:12,339
but you'll see why it's useful to have both numbers in a minute.

289
00:16:12,759 --> 00:16:16,994
So here it looks like it's suggesting the highest entropy for our second guess is Raman,

290
00:16:16,994 --> 00:16:19,399
which again just really doesn't feel like a word.

291
00:16:19,980 --> 00:16:24,060
So to take the moral high ground here I'm going to go ahead and type in Rains.

292
00:16:25,440 --> 00:16:27,340
And again it looks like we were a little unlucky.

293
00:16:27,519 --> 00:16:31,360
We were expecting 4.3 bits and we only got 3.39 bits of information.

294
00:16:31,940 --> 00:16:33,940
So that takes us down to 55 possibilities.

295
00:16:34,899 --> 00:16:37,759
And here maybe I'll just actually go with what it's suggesting,

296
00:16:37,759 --> 00:16:39,439
which is combo, whatever that means.

297
00:16:40,039 --> 00:16:42,919
And, okay, this is actually a good chance for a puzzle.

298
00:16:42,919 --> 00:16:46,379
It's telling us this pattern gives us 4.7 bits of information.

299
00:16:47,059 --> 00:16:51,719
But over on the left, before we see that pattern, there were 5.78 bits of uncertainty.

300
00:16:52,419 --> 00:16:56,339
So as a quiz for you, what does that mean about the number of remaining possibilities?

301
00:16:58,039 --> 00:17:01,163
Well it means that we're reduced down to 1 bit of uncertainty,

302
00:17:01,163 --> 00:17:04,539
which is the same thing as saying that there's 2 possible answers.

303
00:17:04,700 --> 00:17:05,700
It's a 50-50 choice.

304
00:17:06,500 --> 00:17:09,054
And from here, because you and I know which words are more common,

305
00:17:09,054 --> 00:17:10,640
we know that the answer should be abyss.

306
00:17:11,180 --> 00:17:13,279
But as it's written right now, the program doesn't know that.

307
00:17:13,539 --> 00:17:16,868
So it just keeps going, trying to gain as much information as it can,

308
00:17:16,868 --> 00:17:19,859
until it's only one possibility left, and then it guesses it.

309
00:17:20,380 --> 00:17:22,611
So obviously we need a better endgame strategy,

310
00:17:22,611 --> 00:17:25,412
but let's say we call this version 1 of our wordle solver,

311
00:17:25,412 --> 00:17:28,259
and then we go and run some simulations to see how it does.

312
00:17:30,359 --> 00:17:34,119
So the way this is working is it's playing every possible wordle game.

313
00:17:34,240 --> 00:17:38,539
It's going through all of those 2315 words that are the actual wordle answers.

314
00:17:38,539 --> 00:17:40,579
It's basically using that as a testing set.

315
00:17:41,359 --> 00:17:44,406
And with this naive method of not considering how common a word is,

316
00:17:44,406 --> 00:17:47,682
and just trying to maximize the information at each step along the way,

317
00:17:47,682 --> 00:17:49,819
until it gets down to one and only one choice.

318
00:17:50,359 --> 00:17:54,299
By the end of the simulation, the average score works out to be about 4.124.

319
00:17:55,319 --> 00:17:59,240
Which is not bad, to be honest, I kind of expect it to do worse.

320
00:17:59,660 --> 00:18:02,600
But the people who play wordle will tell you that they can usually get it in 4.

321
00:18:02,859 --> 00:18:05,379
The real challenge is to get as many in 3 as you can.

322
00:18:05,380 --> 00:18:08,080
It's a pretty big jump between the score of 4 and the score of 3.

323
00:18:08,859 --> 00:18:11,820
The obvious low-hanging fruit here is to somehow incorporate

324
00:18:11,820 --> 00:18:14,980
whether or not a word is common, and how exactly do we do that.

325
00:18:22,799 --> 00:18:26,688
The way I approached it is to get a list of the relative frequencies for all of

326
00:18:26,689 --> 00:18:30,627
the words in the English language, and I just used Mathematica's word frequency

327
00:18:30,626 --> 00:18:34,859
data function, which itself pulls from the Google Books English Ngram public dataset.

328
00:18:35,460 --> 00:18:37,670
And it's kind of fun to look at, for example if we sort

329
00:18:37,670 --> 00:18:39,960
it from the most common words to the least common words.

330
00:18:40,119 --> 00:18:43,079
Evidently these are the most common, 5-letter words in the English language.

331
00:18:43,700 --> 00:18:45,840
Or rather, these is the 8th most common.

332
00:18:46,279 --> 00:18:48,879
First is which, after which there's there and there.

333
00:18:49,259 --> 00:18:52,366
First itself is not first, but 9th, and it makes sense that these

334
00:18:52,366 --> 00:18:55,999
other words could come about more often, where those after first are after,

335
00:18:55,999 --> 00:18:58,579
where, and those being just a little bit less common.

336
00:18:59,160 --> 00:19:03,064
Now, in using this data to model how likely each of these words is to

337
00:19:03,064 --> 00:19:07,195
be the final answer, it shouldn't just be proportional to the frequency,

338
00:19:07,194 --> 00:19:11,098
because for example which is given a score of 0.002 in this dataset,

339
00:19:11,098 --> 00:19:15,059
whereas the word braid is in some sense about 1000 times less likely.

340
00:19:15,559 --> 00:19:18,193
But both of these are common enough words that they're almost

341
00:19:18,193 --> 00:19:21,000
certainly worth considering, so we want more of a binary cutoff.

342
00:19:21,859 --> 00:19:25,757
The way I went about it is to imagine taking this whole sorted list of words,

343
00:19:25,758 --> 00:19:29,604
and then arranging it on an x-axis, and then applying the sigmoid function,

344
00:19:29,604 --> 00:19:33,603
which is the standard way to have a function whose output is basically binary,

345
00:19:33,603 --> 00:19:37,602
it's either 0 or it's 1, but there's a smoothing in between for that region of

346
00:19:37,602 --> 00:19:38,259
uncertainty.

347
00:19:39,160 --> 00:19:43,826
So essentially, the probability that I'm assigning to each word for being in the final

348
00:19:43,826 --> 00:19:48,440
list will be the value of the sigmoid function above wherever it sits on the x-axis.

349
00:19:49,519 --> 00:19:52,087
Now obviously this depends on a few parameters,

350
00:19:52,087 --> 00:19:56,184
for example how wide a space on the x-axis those words fill determines how

351
00:19:56,184 --> 00:19:58,697
gradually or steeply we drop off from 1 to 0,

352
00:19:58,698 --> 00:20:02,140
and where we situate them left to right determines the cutoff.

353
00:20:02,980 --> 00:20:05,009
And to be honest the way I did this was kind of just

354
00:20:05,009 --> 00:20:06,920
licking my finger and sticking it into the wind.

355
00:20:07,140 --> 00:20:10,073
I looked through the sorted list and tried to find a window where

356
00:20:10,073 --> 00:20:13,006
when I looked at it I figured about half of these words are more

357
00:20:13,006 --> 00:20:16,120
likely than not to be the final answer, and used that as the cutoff.

358
00:20:17,099 --> 00:20:19,888
Now once we have a distribution like this across the words,

359
00:20:19,888 --> 00:20:23,859
it gives us another situation where entropy becomes this really useful measurement.

360
00:20:24,500 --> 00:20:28,192
For example, let's say we were playing a game and we start with my old openers,

361
00:20:28,192 --> 00:20:30,950
which were other and nails, and we end up with a situation

362
00:20:30,950 --> 00:20:33,240
where there's four possible words that match it.

363
00:20:33,559 --> 00:20:36,022
And let's say we consider them all equally likely,

364
00:20:36,022 --> 00:20:38,879
let me ask you, what is the entropy of this distribution?

365
00:20:41,079 --> 00:20:45,606
Well, the information associated with each one of these possibilities is

366
00:20:45,606 --> 00:20:50,259
going to be the log base 2 of 4, since each one is 1 and 4, and that's 2.

367
00:20:50,640 --> 00:20:52,460
2 bits of information, 4 possibilities.

368
00:20:52,759 --> 00:20:53,579
All very well and good.

369
00:20:54,299 --> 00:20:57,799
But what if I told you that actually there's more than 4 matches?

370
00:20:58,259 --> 00:21:02,460
In reality, when we look through the full word list, there are 16 words that match it.

371
00:21:02,579 --> 00:21:06,816
But suppose our model puts a really low probability on those other 12 words of actually

372
00:21:06,816 --> 00:21:10,759
being the final answer, something like 1 in 1000 because they're really obscure.

373
00:21:11,500 --> 00:21:14,259
Now let me ask you, what is the entropy of this distribution?

374
00:21:15,420 --> 00:21:18,546
If entropy was purely measuring the number of matches here,

375
00:21:18,546 --> 00:21:22,150
then you might expect it to be something like the log base 2 of 16,

376
00:21:22,150 --> 00:21:25,700
which would be 4, two more bits of uncertainty than we had before.

377
00:21:26,180 --> 00:21:29,860
But of course the actual uncertainty is not really that different from what we had before.

378
00:21:30,160 --> 00:21:33,783
Just because there's these 12 really obscure words doesn't mean that it would be

379
00:21:33,782 --> 00:21:37,359
all that more surprising to learn that the final answer is charm, for example.

380
00:21:38,180 --> 00:21:42,049
So when you actually do the calculation here and you add up the probability of

381
00:21:42,049 --> 00:21:46,019
each occurrence times the corresponding information, what you get is 2.11 bits.

382
00:21:46,019 --> 00:21:49,375
Just saying, it's basically two bits, basically those four possibilities,

383
00:21:49,375 --> 00:21:53,327
but there's a little more uncertainty because of all of those highly unlikely events,

384
00:21:53,327 --> 00:21:56,500
though if you did learn them you'd get a ton of information from it.

385
00:21:57,160 --> 00:21:59,177
So zooming out, this is part of what makes Wordle

386
00:21:59,176 --> 00:22:01,399
such a nice example for an information theory lesson.

387
00:22:01,599 --> 00:22:04,639
We have these two distinct feeling applications for entropy.

388
00:22:05,160 --> 00:22:09,726
The first one telling us what's the expected information we'll get from a given guess,

389
00:22:09,726 --> 00:22:13,283
and the second one saying can we measure the remaining uncertainty

390
00:22:13,282 --> 00:22:15,459
among all of the words we have possible.

391
00:22:16,460 --> 00:22:19,125
And I should emphasize, in that first case where we're looking

392
00:22:19,125 --> 00:22:22,906
at the expected information of a guess, once we have an unequal weighting to the words,

393
00:22:22,906 --> 00:22:24,539
that affects the entropy calculation.

394
00:22:24,980 --> 00:22:27,846
For example, let me pull up that same case we were looking at

395
00:22:27,846 --> 00:22:30,242
earlier of the distribution associated with Weary,

396
00:22:30,242 --> 00:22:33,720
but this time using a non-uniform distribution across all possible words.

397
00:22:34,500 --> 00:22:38,279
So let me see if I can find a part here that illustrates it pretty well.

398
00:22:40,940 --> 00:22:42,360
Okay, here, this is pretty good.

399
00:22:42,359 --> 00:22:45,755
Here we have two adjacent patterns that are about equally likely,

400
00:22:45,756 --> 00:22:49,100
but one of them we're told has 32 possible words that match it.

401
00:22:49,279 --> 00:22:51,869
And if we check what they are, these are those 32,

402
00:22:51,869 --> 00:22:55,599
which are all just very unlikely words as you scan your eyes over them.

403
00:22:55,839 --> 00:22:59,168
It's hard to find any that feel like plausible answers, maybe yells,

404
00:22:59,169 --> 00:23:02,254
but if we look at the neighboring pattern in the distribution,

405
00:23:02,253 --> 00:23:06,659
which is considered just about as likely, we're told that it only has 8 possible matches.

406
00:23:06,880 --> 00:23:09,520
So a quarter as many matches, but it's about as likely.

407
00:23:09,859 --> 00:23:12,139
And when we pull up those matches, we can see why.

408
00:23:12,500 --> 00:23:16,299
Some of these are actual plausible answers like ring or wrath or raps.

409
00:23:17,900 --> 00:23:20,100
To illustrate how we incorporate all that, let

410
00:23:20,099 --> 00:23:22,299
me pull up version two of the Wordlebot here.

411
00:23:22,559 --> 00:23:25,279
And there are two or three main differences from the first one that we saw.

412
00:23:25,859 --> 00:23:29,410
First off, like I just said, the way that we're computing these entropies,

413
00:23:29,411 --> 00:23:33,681
these expected values of information, is now using the more refined distributions across

414
00:23:33,681 --> 00:23:37,855
the patterns that incorporates the probability that a given word would actually be the

415
00:23:37,855 --> 00:23:38,240
answer.

416
00:23:38,880 --> 00:23:43,820
As it happens, tears is still number one, though the ones following are a bit different.

417
00:23:44,359 --> 00:23:47,764
Second, when it ranks its top picks, it's now going to keep a model of the

418
00:23:47,765 --> 00:23:50,019
probability that each word is the actual answer,

419
00:23:50,019 --> 00:23:52,134
and it'll incorporate that into its decision,

420
00:23:52,134 --> 00:23:55,079
which is easier to see once we have a few guesses on the table.

421
00:23:55,859 --> 00:23:59,779
Again, ignoring its recommendation because we can't let machines rule our lives.

422
00:24:01,140 --> 00:24:04,719
And I suppose I should mention another thing different here is over on the left,

423
00:24:04,719 --> 00:24:07,537
that uncertainty value, that number of bits, is no longer just

424
00:24:07,537 --> 00:24:09,640
redundant with the number of possible matches.

425
00:24:10,079 --> 00:24:15,407
Now if we pull it up and calculate 2 to the 8.02, which would be a little above 256,

426
00:24:15,407 --> 00:24:20,227
I guess 259, what it's saying is even though there are 526 total words that

427
00:24:20,228 --> 00:24:24,984
actually match this pattern, the amount of uncertainty it has is more akin

428
00:24:24,983 --> 00:24:28,980
to what it would be if there were 259 equally likely outcomes.

429
00:24:29,720 --> 00:24:30,740
You can think of it like this.

430
00:24:31,019 --> 00:24:34,660
It knows borks is not the answer, same with yorts and zorl and zorus.

431
00:24:34,660 --> 00:24:37,680
So it's a little less uncertain than it was in the previous case.

432
00:24:37,819 --> 00:24:39,279
This number of bits will be smaller.

433
00:24:40,220 --> 00:24:43,500
And if I keep playing the game, I'm refining this down with a couple

434
00:24:43,500 --> 00:24:46,539
guesses that are apropos of what I would like to explain here.

435
00:24:48,359 --> 00:24:51,035
By the fourth guess, if you look over at its top picks,

436
00:24:51,036 --> 00:24:53,759
you can see it's no longer just maximizing the entropy.

437
00:24:54,460 --> 00:24:57,236
So at this point, there's technically seven possibilities,

438
00:24:57,236 --> 00:25:00,300
but the only ones with a meaningful chance are dorms and words.

439
00:25:00,299 --> 00:25:03,534
And you can see it ranks choosing both of those above all of these

440
00:25:03,535 --> 00:25:06,720
other values that strictly speaking would give more information.

441
00:25:07,240 --> 00:25:10,484
The very first time I did this, I just added up these two numbers to measure

442
00:25:10,484 --> 00:25:13,899
the quality of each guess, which actually worked better than you might suspect.

443
00:25:14,299 --> 00:25:15,899
But it really didn't feel systematic.

444
00:25:16,099 --> 00:25:17,879
And I'm sure there's other approaches people could take.

445
00:25:17,900 --> 00:25:19,340
But here's the one I landed on.

446
00:25:19,759 --> 00:25:23,829
If we're considering the prospect of a next guess, like in this case words,

447
00:25:23,829 --> 00:25:27,899
what we really care about is the expected score of our game if we do that.

448
00:25:28,230 --> 00:25:32,146
And to calculate that expected score, we say what's the probability that

449
00:25:32,146 --> 00:25:35,899
words is the actual answer, which at the moment it describes 58% to.

450
00:25:36,039 --> 00:25:39,539
We say with a 58% chance, our score in this game would be four.

451
00:25:40,319 --> 00:25:43,359
And then with the probability of one minus that 58%,

452
00:25:43,359 --> 00:25:45,639
our score will be more than that four.

453
00:25:46,220 --> 00:25:49,316
How much more we don't know, but we can estimate it based on how

454
00:25:49,316 --> 00:25:52,460
much uncertainty there's likely to be once we get to that point.

455
00:25:52,960 --> 00:25:55,940
Specifically, at the moment, there's 1.44 bits of uncertainty.

456
00:25:56,440 --> 00:26:01,120
If we guess words, it's telling us the expected information we'll get is 1.27 bits.

457
00:26:01,619 --> 00:26:04,537
So if we guess words, this difference represents how much

458
00:26:04,538 --> 00:26:07,660
uncertainty we're likely to be left with after that happens.

459
00:26:08,259 --> 00:26:11,158
What we need is some kind of function, which I'm calling f here,

460
00:26:11,159 --> 00:26:13,740
that associates this uncertainty with an expected score.

461
00:26:14,240 --> 00:26:18,458
And the way it went about this was to just plot a bunch of the data from previous

462
00:26:18,458 --> 00:26:21,113
games based on version one of the bot to say, hey,

463
00:26:21,113 --> 00:26:25,070
what was the actual score after various points with certain very measurable

464
00:26:25,069 --> 00:26:26,319
amounts of uncertainty?

465
00:26:27,019 --> 00:26:30,876
For example, these data points here that are sitting above a value that's

466
00:26:30,876 --> 00:26:33,464
around like 8.7 or so are saying for some games,

467
00:26:33,464 --> 00:26:36,582
after a point at which there were 8.7 bits of uncertainty,

468
00:26:36,583 --> 00:26:38,960
it took two guesses to get the final answer.

469
00:26:39,319 --> 00:26:40,659
For other games, it took three guesses.

470
00:26:40,819 --> 00:26:42,240
For other games, it took four guesses.

471
00:26:43,140 --> 00:26:46,862
If we shift over to the left here, all the points over zero are saying whenever

472
00:26:46,862 --> 00:26:50,632
there's zero bits of uncertainty, which is to say there's only one possibility,

473
00:26:50,632 --> 00:26:54,259
then the number of guesses required is always just one, which is reassuring.

474
00:26:54,779 --> 00:26:58,167
Whenever there was one bit of uncertainty, meaning it was essentially

475
00:26:58,167 --> 00:27:01,851
just down to two possibilities, then sometimes it required one more guess,

476
00:27:01,852 --> 00:27:05,240
sometimes it required two more guesses, and so on and so forth here.

477
00:27:05,740 --> 00:27:07,884
Maybe a slightly easier way to visualize this

478
00:27:07,884 --> 00:27:10,220
data is to bucket it together and take averages.

479
00:27:11,000 --> 00:27:15,509
For example, this bar here is saying among all the points where we had one bit

480
00:27:15,509 --> 00:27:19,960
of uncertainty, on average the number of new guesses required was about 1.5.

481
00:27:22,140 --> 00:27:25,486
And the bar over here is saying among all of the different games where

482
00:27:25,486 --> 00:27:28,354
at some point the uncertainty was a little above four bits,

483
00:27:28,354 --> 00:27:31,365
which is like narrowing it down to 16 different possibilities,

484
00:27:31,365 --> 00:27:35,380
then on average it requires a little more than two guesses from that point forward.

485
00:27:36,059 --> 00:27:39,460
And from here I just did a regression to fit a function that seemed reasonable to this.

486
00:27:39,980 --> 00:27:44,140
And remember, the whole point of doing any of that is so that we can quantify this

487
00:27:44,140 --> 00:27:47,032
intuition that the more information we gain from a word,

488
00:27:47,031 --> 00:27:48,960
the lower the expected score will be.

489
00:27:49,680 --> 00:27:54,679
So, with this as version 2.0, if we go back and run the same set of simulations,

490
00:27:54,679 --> 00:27:59,240
having it play against all 2315 possible wordle answers, how does it do?

491
00:28:00,279 --> 00:28:03,420
Well in contrast to our first version, it's definitely better, which is reassuring.

492
00:28:04,019 --> 00:28:06,180
All said and done, the average is around 3.6.

493
00:28:06,539 --> 00:28:09,896
Although unlike the first version, there are a couple times that it loses,

494
00:28:09,896 --> 00:28:12,119
and requires more than six in this circumstance.

495
00:28:12,640 --> 00:28:15,269
Presumably because there's times when it's making that tradeoff

496
00:28:15,269 --> 00:28:17,940
to actually go for the goal rather than maximizing information.

497
00:28:19,039 --> 00:28:21,000
So can we do better than 3.6?

498
00:28:22,079 --> 00:28:22,919
We definitely can.

499
00:28:23,279 --> 00:28:26,253
Now, I said at the start that it's most fun to try not incorporating

500
00:28:26,253 --> 00:28:29,359
the true list of wordle answers into the way that it builds its model.

501
00:28:29,880 --> 00:28:34,180
But if we do incorporate it, the best performance I could get was around 3.43.

502
00:28:35,160 --> 00:28:38,719
So if we try to get more sophisticated than just using word frequency data

503
00:28:38,719 --> 00:28:42,229
to choose this prior distribution, this 3.43 probably gives a max at how

504
00:28:42,229 --> 00:28:45,740
good we could get with that, or at least how good I could get with that.

505
00:28:46,240 --> 00:28:49,979
That best performance essentially just uses the ideas that I've been talking about here,

506
00:28:49,979 --> 00:28:52,911
but it goes a little farther, like it does a search for the expected

507
00:28:52,911 --> 00:28:55,120
information two steps forward rather than just one.

508
00:28:55,619 --> 00:28:57,876
Originally I was planning on talking more about that,

509
00:28:57,876 --> 00:29:00,220
but I realize we've actually gone quite long as it is.

510
00:29:00,579 --> 00:29:03,302
The one thing I'll say is after doing this two-step search and

511
00:29:03,303 --> 00:29:06,114
then running a couple sample simulations in the top candidates,

512
00:29:06,114 --> 00:29:09,100
so far for me at least, it's looking like Crane is the best opener.

513
00:29:09,099 --> 00:29:10,059
Who would have guessed?

514
00:29:10,920 --> 00:29:14,764
Also if you use the true wordle list to determine your space of possibilities,

515
00:29:14,763 --> 00:29:17,819
then the uncertainty you start with is a little over 11 bits.

516
00:29:18,299 --> 00:29:20,955
And it turns out just from a brute force search,

517
00:29:20,955 --> 00:29:25,879
the maximum possible expected information after the first two guesses is around 10 bits.

518
00:29:26,500 --> 00:29:30,231
Which suggests that best case scenario, after your first two guesses,

519
00:29:30,231 --> 00:29:34,559
with perfectly optimal play, you'll be left with around one bit of uncertainty.

520
00:29:34,799 --> 00:29:37,319
Which is the same as being down to two possible guesses.

521
00:29:37,740 --> 00:29:41,431
But I think it's fair and probably pretty conservative to say that you could never

522
00:29:41,431 --> 00:29:44,491
possibly write an algorithm that gets this average as low as three,

523
00:29:44,491 --> 00:29:48,048
because with the words available to you, there's simply not room to get enough

524
00:29:48,048 --> 00:29:51,920
information after only two steps to be able to guarantee the answer in the third slot

525
00:29:51,920 --> 00:29:53,360
every single time without fail.