What are the requirements for a word list to be used for Bayesian inference?

Intro

I need an input file of 5 letter English words to train my Bayesian model to infer the stochastic dependency between each position. For instance, is the probability of a letter at the position 5 dependent on the probability of a letter at position 1 etc. At the end of the day, I want to train this Bayesian network in order to be able to solve the Wordle game.

What is Wordle?

It’s a game where you guess 5 letter words, and it tells you how many letters you got correct and if they are in the right positions or not. You only have six attempts. Concluding, Wordle is about narrowing down the distribution of what the true word could be.

Problem

What requirements should such a words list meet?

  • Should I mix US and British english?
  • Should I include all possible words? Even very exotic ones that nobody knows/uses?
  • Should these words be processed/normalized in some way?
  • Does it make sense to use multiple sources? Is there any way to ensure the completeness and correctness?

What I have did so far

  • I modeled the Bayesian network consisting of 5 random variables for each letter at each position: $L1$, $L2$, $L3$, $L4$, $L5$
  • I came to the conclusion that the marginal probability of the searched word is $P(L1, L2, L3, L4, L5).$
  • In order to calculate the joint probability distribution I need a word list, so I asked myself the aboved questions
  • I've found many sources for word lists, but I'm not sure if I should use one or all
  • I have verified that both US and British English spellings occurred in the Wordle.

PS: I know that the list of all possible solution words has been leaked. But I don't want to use such a list, because what if the makers of Wordle change the list again?

Topic bayesian inference bayesian-networks statistics machine-learning

Category Data Science


Well, it totally depends what you want to do with the resulting probability model. If you're planning to use the model for spelling correction for example, you should probably use a vocabulary as large as the kind of text you're expecting to process.

In general this is actually done not from a list of words but from a large corpus of text, taking all the n-grams up to 5 in the text into account. It's possible that restricting to 5 letters word would not represent the same probabilities than in the full language. But again this choice depends what is the target task for the model.


Answering the updated questions:

Ideally, you would use the same database as Wordle itself. I'm not sure but as you said this could backfire if the list is changed later. As far as I know (I happen to play it too!), the game seems to work with fairly standard English vocabulary, so I'm guessing that any standard vocabulary would fit.

Should I mix US and British english?

I don't know.

Should I include all possible words? Even very exotic ones that nobody knows/uses?

For the sake of completeness I think you can, but your model could include the global probability of word in order to make standard words more likely than rare words. An option that comes to mind is to use the Google NGrams data (here unigrams) and extract only five letters words.

Should these words be processed/normalized in some way?

Only for capitalization, I think.

Does it make sense to use multiple sources? Is there any way to ensure the completeness and correctness?

This could be tricky, because mixing different sources can cause some bias in the n-grams probabilities.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.