Friday, 4 February 2022

Cheating at Wordle with Python, NLTK and a Raspberry Pi

At the time of writing, the game Wordle is taking the world by storm.  Quite rightly, it's genius and so is the guy who invented it.  

Here's one I did earlier:

Two things to know about me:

  • I'm rubbish with word games.
  • I like to trick my children into thinking I'm cleverer than I am.

So having struggled a bit with Wordle I wondered if I could write some code to more efficiently (for that read cheat) find Wordle answers.  For this I used my trusty Raspberry Pi, Python and the Natural Language ToolKit (NLTK) module.  I'd used NLTK previously for some online data science courses so I knew it could give me a "corpus" (so set) of words to play with. 

First I installed NLTK for use with Python 3 on the Raspberry Pi using this command: sudo pip3 install nltk

Then I looked at which word corpus NLTK has that I could use.  There's some details here and a few places pointed me to the "Brown" corpus as a good place to start.  To make this corpus available for NLTK in a Python script I opened a Python3 shell and ran these commands:

Python 3.7.3 (default, Jan 22 2021, 20:04:44)
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import nltk

So that sets a corpus of words ready to use in a Python script.  I won't go into the detail of how Wordle is played but overall you get 6 goes to guess a 5 letter word.  With each guess, the game tells you for each letter:
  • Whether it is in the final word in the exact position you guessed it it,  I call these exact matches.
  • Whether it is in the final word but not in the exact position you guessed it in.  I call these partial matches.
  • Whether it is not in the final word.  I call these non-matches.
I then played around with snippets of code to examine words from the corpus and rule them in or out based upon whether they had exact matches in them or non-matches not in them.  That led me overall to an algorithm of:

-Load word corpuses (at the time of writing I use Brown, Webtext and Gutenberg)
-Build a dictionary of 5 letter words and their frequency of occurrence in the corpus
-Setup data structures to log exact, partial and non-matches.

-Loop at least six times doing:
1-Make a prediction (which I then enter into Wordle) based on the data structures
2-Input the result of the prediction from Wordle
3-Update some data structures

(Full code is at the end of this post)

Taking step 2 above first, I simply enter the result of each Wordle round in a coded string.  So take this result:

I enter this as S_E,U_N,G_N,A_P,R_P.  Where _E is for exact, _N is for non-matched and _P is for partial.

Looking at step 3, I update 3 data structures based upon the input the user provides.  The data structures are:
  • A tuple of exact matches where the tuple contains the letter and the exact match position.  So from the above it would be [('S', 0)].
  • A dictionary of partial matches where the key is the letter and the value is a list of positions that the letter is not in.  S from the above it would be {'A': [3], 'R': [4]}
  • A list of non-matches.  So from the above it would be ['U', 'G']
If it's an exact match I do two things:
1)Update the tuple of exact matches with the new matched letter/position combination.
2)Update the partial match dictionary as this is a a)new position a partial matched letter can't be in and b)it may require removal of an entry as what was previously partially matched is now fully matched.

If it's a partial match I do two things:
1)Add a new partial (with the letter position) or update the list of positions for an existing partial
2)If it's a new partial, don't just add the position it is in, add the position of all the other exact matches as the letter can't be in this position either.

If it's a non-match I just add the letter to the list of non-matches.

So for game above, the log output showed this:
Enter the result of that round in format A_E,B_P,C_N where _E for exact, _P for partial, _N for non matched:S_E,U_N,G_N,A_P,R_P
########Processing an exact match for letter S
########Processing a non match for letter U
########Processing a non match for letter G
########Processing a partial match for letter A
########Processing a partial match for letter R
########Exact matches [('S', 0)]
########Partial matches {'A': [0, 1, 3], 'R': [1, 2, 4]}
########Non matches ['O', 'E', 'P', 'T', 'U', 'G']

Finally looking at step 1, for round 1 I recommend a initial word based upon the following logic:
1)Do a letter frequency count across all the 5 letter words in the corpus.
2)Find a word in the corpus that has each of the 5 most common letters.  This leads me to use "AROSE".

Then for subsequent rounds I do the following:
1)Eliminate any words from the corpus that have the letters in the non-matching list.
2)Eliminate any words from the corpus that don't have the exact matching letters in the exact matching position.
3)Eliminate any words from corpus that don't have the partial matching letters or, if they do, have the partial matching letters in the positions logged that they can't be in.

...which results in a list of words which I augment with the frequency of occurrence in the corpus.  I then print this and let the use choose the word to enter next.  

The story so far is that I have played 4 games with this code and:
1)I have got the right answer every time, but
2)My 17 year old daughter has got the right answer in fewer guesses 3 times out of 4!

Full code listing:

#!/usr/bin/env python3
from nltk.corpus import brown
from nltk.corpus import webtext
from nltk.corpus import gutenberg
import sys

#Our main 5 letter word list
five_letters = {}    #A dictionary of 5 letter words with frequencies

#Gets a list of 5 letter words
def add_to_list(in_word_list):
  for word in in_word_list:
    if word.isalpha():
      if len(word) == 5:
        if word.upper() not in five_letters:    #as variety of case of same word could be present
          five_letters[word.upper()] = 1
          five_letters[word.upper()] += 1

#Compares our 5 letter word corpus with the list of letter frequencies to find the best start word
#The best word has all the highest frequency letters just once
def get_start_words(letter_frequencies, start_pos):
  words_found = 0               #Counts when our set of 5 high frequency letters matches a word from our corpus  
  list_start_pos  = start_pos   #Used to track where we are in our letter frequency list
  word_list = []

  #Loop until we've found words from the word corpus or we've exhausted the
  while words_found == 0 and list_start_pos < 22:    #It's 22 as this will mean we've got to positions 21,22,23,24,25 in the character frequency list
    #Loop for each of the words in the corpus
    for my_word in five_letters:
      letter_count = 0     #Incremented if we find a letter in the word
      #Then for each of the five letters identified by the outer while loop
      for i in range (list_start_pos, list_start_pos + 5):
        my_letter = letter_frequencies[i][0]
        if my_letter in my_word:
          letter_count += 1 #Count if the letter is in the word
      #return word_list
      #See if we found all our letters in the word.  So if one letter is there twice we should not 
      if letter_count == 5:
        words_found += 1
    list_start_pos += 1

  #Return our word list
  return word_list

print ("########Building five letter word list")
print ("Added Brown corpus and word list has {} entries".format(len(five_letters)))
print ("Added Webtext corpus and word list has {} entries".format(len(five_letters)))
print ("Added Gutenberg corpus and word list has {} entries".format(len(five_letters)))

#Calculate letter frequencies
print ("########Calculating letter frequencies")
freq_dict = {}
for my_word in five_letters:
  #Build letter frequencies
  for my_letter in my_word:
    #See of the letter is a key in the dictionary
    if my_letter in freq_dict.keys():
      freq_dict[my_letter] += 1    #Increment value
      freq_dict[my_letter] = 1     #Add value

#Sort the dictionary to get the highest probability letters and show the user
sorted_values = sorted(freq_dict.items(), key=lambda x:x[1], reverse=True)
print (sorted_values)

#Holds the round number
round_number = 1

#Holds the exact matches.  A list of tuples
exact_matches = []
#Holds the partial matches.  A dictionary of key is letter, value is list of positions letter is not in
partial_matches = {}
#Holds the non-matched letters.  A list
non_matches = []

#Loop for all the rounds
while round_number < 7 and len(exact_matches) < 5:
  print ("######################################################")
  print ("########This is round number {}".format(round_number))
  #Special case for round 1
  if round_number == 1:
    #Get a list of possible starting words based upon the letter frequencies
    print ("########Getting a list of words to start off with")

    #Imagine we get 6 wrong starting words!  Accoutn for each
    for k in range (0,3):
      start_words = get_start_words(sorted_values, k)
      print ("Where we start at position {} of the letter frequencies the start words are: {}".format(k, start_words))
    #Step 1, rule out a bunch of words that have eliminated letters in them
    print ("########Assessing data from previous round to make a recommendation")
    print ("########First rule out words based on letters found not to exist in the answer")
    after_non_match_check = []
    for my_word in five_letters:
      has_ruled_out = False
      for letter in non_matches:
        if letter in my_word:
          has_ruled_out = True
      if not has_ruled_out:
    print ("########At the end of this we are down to {} words".format(len(after_non_match_check)))

    #Step 2, rule in a set of words based on the matched list
    print ("########Second rule in words based on exact matched letters")
    after_full_match_check = []
    for my_word in after_non_match_check:
      has_ruled_out = False
      for match_tuple in exact_matches:
        if my_word[match_tuple[1]] != match_tuple[0]:
          has_ruled_out = True
      if not has_ruled_out:
    print ("########At the end of this we are down to {} words".format(len(after_full_match_check)))
    #print (after_full_match_check)

    #Step 3, rule out a set of words where letters are in partial match positions
    print ("########Third rule in words based on partial matched letters")
    after_partial_match_check = []
    for my_word in after_full_match_check:
      has_ruled_out = False
      for my_partial in partial_matches:  #Loop through each dictionary entry
        #First simply check the partial is in the word
        if my_partial in my_word:
          #Now check for the partial positions
          for my_partial_pos in partial_matches[my_partial]:    #Loop through each item in the partial match
            if my_partial == my_word[my_partial_pos]:
              has_ruled_out = True
          has_ruled_out = True

      if not has_ruled_out:
    print ("########At the end of this we are down to {} words.  Recommendation:".format(len(after_partial_match_check)))

    #Form an ordered list of words based on overall word frequency from the corpus that was built right at the start
    suggestion_dict = {}
    ordered_suggestion = {}
    for word_suggestion in after_partial_match_check:
      suggestion_dict[word_suggestion] = five_letters[word_suggestion]

    #Order the suggestions
    ordered_suggestion = sorted(suggestion_dict.items(), key=lambda x:x[1], reverse=True)
    #Tell the user
    if len(ordered_suggestion) < 11:
      print (ordered_suggestion)
      #Print just the first 10
      print (list(ordered_suggestion)[:10])

  #Get input from the user as to what happened in that round
  round_result = input("Enter the result of that round in format A_E,B_P,C_N where _E for exact, _P for partial, _N for non matched:")
  #Pull round result apart and process each
  result_list = round_result.split(",")
  #Loop for each result
  letter_pos = 0        #Holds which letter result position we're looking at
  for result in result_list:
    #Get the entered letter and the result
    entered_letter = result[0]
    letter_result = result[2]
    if letter_result == "E":
      print("########Processing an exact match for letter {}".format(entered_letter))
      #See if we already have this exact match.  If not, add it
      letter_found = False
      for my_tuple in exact_matches:
        if my_tuple[0] == entered_letter and my_tuple[1] == letter_pos:    #So we could have double letters so this checks for existence of the letter in the given position
          letter_found = True
      if not letter_found:
      #Update existing partial matches as well, i.e. 1)They can't be in the position of the found letter.  Also, if there is already a partial for what is now exact, remove it
      if entered_letter in partial_matches:
        for partial in partial_matches:
          if letter_pos not in partial_matches[partial]:

    elif letter_result == "P":
      print("########Processing a partial match for letter {}".format(entered_letter))
      if entered_letter in partial_matches:
        #Look partial record, seeing if the current position is in the position list.  If not, add it
        if letter_pos not in partial_matches[entered_letter]:
        #Entered letter not in partial matches dictionary.  Add it together with the position
        partial_list = []
        #But as this is a new partial we also need to add all existing exact matches which the letter can also not be in that position
        for exact in exact_matches:
        #FInal update of the partial list
        partial_matches[entered_letter] = partial_list

    elif letter_result == "N":
      print("########Processing a non match for letter {}".format(entered_letter))
      #See if we already have this non-match.  If not, add it
      if entered_letter not in non_matches:
    #Update so we get the next letter position
    letter_pos +=1

  #Show what the structures are at the end of this round
  print ("########Exact matches {}".format(exact_matches))
  print ("########Partial matches {}".format(partial_matches))
  print ("########Non matches {}".format(non_matches))

  #End of turn Update for next loop
  round_number += 1

#See how we came out of the main loop
if len(exact_matches) == 5:
  print ("########You won!  Way to go/cheat")
  print ("########All rounds completed, you lost!")

Saturday, 3 April 2021

Excel Basics - Autofill

Ever noticed the tiny little square on the bottom right of the rectangle that goes around the cell you have selected?   This is called Autofill and it's super useful!  It's circled in the image below:

So what does it do?  Hover over it and a little black plus appears.  Click and drag it down (or across) and it acts like copy and paste.  Here's a simple example.  I typed 100 in cell A1, clicked and dragged the Autofill control down to A9 and the value 100 was populated in every cell:

But autofill is cleverer than that.  I put a 1 in cell B1, a 2 in cell B2, selected both cells, clicked the Autofill control and dragged down:

Here autofill has said "hang on a minute, there's a 1 in the first row, a 2 in the second row, I'll continue the sequence".

You can use it to Autofill formulae in cells.  In the image below I put "=A1+B1" in cell C1 then used Autofill to copy it down:

Autofill will also make intelligent decisions if you have text in a cell.  In the example below, I typed "Row 1" in cell D1 then Autofilled down.  Autofill decides to increment the number at the end of the text as you go down:

If you've been watching carefully you'll have seen this little icon appear at the bottom of the set of cells you've been using Autofill on: 
Click on it and you get a bunch of different options for how Autofill works.  The example above is "Fill Series".  If you select "Copy Cells" you get: a straight copy of what you first entered.

If you select "Fill Formatting Only" you get:

 in this case no obvious formatting.  But if you did have some specific formatting in cell D1 it would be replicated to the cells below.

"Fill Without Formatting" looks to give you the same as "Fill Series" but (assuming you had some obvious formatting in cell D1) it would not be replicated down.

The final "Flash Fill" option is very cool.  Say you have set up a worksheet like this; here you have a name of James Bond actors in column A and you've given Excel as hint as to what you want to do with these values in cells B1 and C1:

Autofill down and you get a default "Copy Cells" view:

But then select "Flash Fill" and you get:

Repeat for the second name and you get:

Clever Excel!

Friday, 2 April 2021

Excel Hack - Getting a Newline in an Excel Cell When Entering Text

Excel is best for numbers but sometimes you find yourself entering lots of text in a cell.  To format the text it's helpful to be able to put in a newline character (aka do a carriage return).  So here's how a bunch of text can look without doing this:

Put in newline characters and hey presto, all a lot easier to read:

So how to do this?  The answer is:

  1. Position the cursor where you want the newline, 
  2. Hold down the alt key, 
  3. Press the return key, 
  4. Release the alt key.

Here's where to find the alt key:

...and here's return (also known as Enter):

Hope this is useful.

Tuesday, 30 March 2021

Excel Hack - Using a "Magic" Apostrophe to Force a Cell to be Text

In Excel you can prefix values in cells with an apostrophe to force Excel to interpret them as text.  Why do this you may ask?

To explain, let's say I want to enter James Bond's telephone number into a spreadsheet.  All I have to do is type 01234007007 (which everyone knows is his number) into a cell, right?  Wrong!  Let's see it in action.  First I type his number (numbers in the UK always start with a 0, that includes MI6 spies):

Because of the leading zero, when I press return, Excel strips the first 0 from what I've typed:

I can get around this by prefixing 007's number with an apostrophe (so a ').  Here we go:

In the image above you can see how the apostrophe is not visible in the cell itself but is in the formula bar at the top right.

Note you also get a little helpful hint warning you there's an apostrophe there.  If you hover over the exclamation mark you get a pop up that says "The number in this cell is formatted as text or preceded by an apostrophe".

So I often use the apostrophe trick for phone numbers or to put formatting in cells that Excel may complain about.  Say I want to write some text in a cell with several sub-points.  You can't do bullet points in a cell so I often prefix lines with a hyphen.  So I type this in:

Hit return and I get:

Oh no!  Because I start with a hyphen, Excel thinks it's a formula with a minus sign, can't interpret the rest of the formula and throws an error.  Put in a magic apostrophe and you get:

Again you can see the apostrophe in the formula bar (top right of the image) but not the cell itself.

Hope that's useful!

Sunday, 28 March 2021

Excel Hack - Concatenating (Adding Together) Text in Different Cells

Sometimes in Excel you have values (generally text) in different cells that you want to piece together into a single cell.  In the example below I have a set of fruits and tasty things I want to join together into a single cell:

Lovely old Excel gives us two ways to do this.  The first way, and the one I always remember, is the "&" operator.  In the example below I've already put the formula in cell C1 and am showing the formula structure in cell C2:

The formula simply means "takes what's in cell A2, add a space to it, then add what's in cell B2".

The second way is the "CONCAT" formula.  Here's an example:

Here the CONCAT formula takes a set of cell references and actual strings and joins them all together into a single result.

For (Geek Dad style) fun I can first transpose the column B values to a row (using this method) and then use mixed references (explained here) to create a crazy set of food ideas:

The mixed references in the formula here are used to keep to column A and row 1 as you copy the formula to other cells.  Specifically:

  • $A2 - Keep to column A as you copy across columns, allow the row to increment as you copy down.
  • B$1 - Allow the column to increment as you copy across, keep the row as 1 as you copy down.

Resulting in:

Comment below if you've used CONCAT or &. 

Excel Basics - Paste Special + Transpose to Turn a Column of Values into A Row

Sometimes in Excel you have values lined up in a single column and you want to get all the values in a single row.  Maybe you want to create a matrix where rows and columns are compared with each other.

Here's an image with made up data to illustrate this:

Let's say I want all the values in column B to actual be across row 1.  So for example "Turnover" goes in cell B1, "Juice" goes in cell C1, "Split" goes in cell D1 and so on.  I could move them one by one but that would be a real pain, especially as the number of values in the column grows.  Instead I use Paste Special and Transpose.  Transpose is the fancy name for changing a row to a column and vice versa.

First, copy the values you want to transpose by selecting them, right clicking and selecting "Copy":

Next select the cell where you want the row of values to be Transposed to. In my case I want the first value in cell B1 so I select that.  I then right click and select "Paste Special...".  The form shown below comes up.  The part of the form I'm interested in is circled:

Tick the "Transpose" box then hit OK.  Lo and behold, what was in the column is now in the row!

You still have your original column of data but you can delete those values and start doing some form of row column comparison!

...and so on!

Note you can also transpose rows to columns.  Comment below if you use this feature!

Excel Hack - Error Checking

A wise person once said "To err is human".  Well Excel sometimes creates errors as well.  Luckily there's a built-in way to check for errors and deal with them.  A common error in computing is dividing by zero.  Here's an example:

Cell C2 shows the formula that's been used.  The "/" part of the formula is how you do divide in Excel and so "=A2/B2" means "take the value in cell A2 and divide it by the value in cell B2".  

You can see how in cell B5 there is the number 0.  In maths (or math for our American friends), a divide by zero is either called infinity or "undefined".  Either way, it can't be computed.  Hence Excel puts an error in a cell where you do a divide by zero, in this case "#DIV/0!".

You can test for a divide by zero using the "ISERROR" formula.  Here's an example:

So the formula "=ISERROR(C2)" simply means "look at the value in cell C2, if it's an error print TRUE, if it's not an error print FALSE".

You can then use an IF formula (see here for an explanation) to test for the error like this:

So the formula above says "look at the value in cell D2.  If it's TRUE print a blank cell, otherwise print the result of the division that was in cell C2.

In the above image I've applied the formula to row 5 where we have the divide by zero and it's come out as blank.  What to set the cell to where there is an error is a design choice.  For example:

  • Where there is a divide by zero error you could set the result cell to 0. However if you're also computing something like an average of the column overall you will get a skewed result.  (The average of 3, 3, 9 and 0 is 5.25, the average of 3, 3, 9 and blank is 7).
  • You could set the cell to something like "ERROR - FIX ME" then you can search and fix these errors.

What I've shown in the image above is something I often do in Excel to make things more maintainable.  Specifically there is one formula per column (so divide in column C, error check in column D, IF statement in column E).  

You can also chain together more than one formula in a cell like this:

In the above formula in column F we have a error check inside an IF statement.  The way to interpret the formula is:

  • So "ISERROR(A2/B2)" to first check if the result is an error.
  • If it is an error, make the cell blank.
  • If it is not an error, compute A2 / B2 and put it in the cell.

It's up to you how you choose to work.  You can use multiple columns to build up the logic one column at a time or do one big multiple formula entry in a single cell.  I choose the multiple columns option as it makes it easier weeks or months later when you come back to a spreadsheet and try and work out what you actually did!

There's also an "IFERROR" formula that does what I've done above in a single formula.  Additionally there's a bunch of other errors you can check for.  Have a play, see what you can come up with and comment below!