What are the requirements for a word list to be used for Bayesian inference?
Intro
I need an input file of 5 letter English words to train my Bayesian model to infer the stochastic dependency between each position. For instance, is the probability of a letter at the position 5 dependent on the probability of a letter at position 1 etc. At the end of the day, I want to train this Bayesian network in order to be able to solve the Wordle game.
What is Wordle?
It’s a game where you guess 5 letter words, and it tells you how many letters you got correct and if they are in the right positions or not. You only have six attempts. Concluding, Wordle is about narrowing down the distribution of what the true word could be.
Problem
What requirements should such a words list meet?
- Should I mix US and British english?
- Should I include all possible words? Even very exotic ones that nobody knows/uses?
- Should these words be processed/normalized in some way?
- Does it make sense to use multiple sources? Is there any way to ensure the completeness and correctness?
What I have did so far
- I modeled the Bayesian network consisting of 5 random variables for each letter at each position: $L1$, $L2$, $L3$, $L4$, $L5$
- I came to the conclusion that the marginal probability of the searched word is $P(L1, L2, L3, L4, L5).$
- In order to calculate the joint probability distribution I need a word list, so I asked myself the aboved questions
- I've found many sources for word lists, but I'm not sure if I should use one or all
- I have verified that both US and British English spellings occurred in the Wordle.
PS: I know that the list of all possible solution words has been leaked. But I don't want to use such a list, because what if the makers of Wordle change the list again?
Topic bayesian inference bayesian-networks statistics machine-learning
Category Data Science