Hidden Markov Models: Linking states to labels after EM training
The tl;dr version first:
I have the following problem: I implemented Baum Welch for ergodic HMMs. I do it like this:
I pass the model two number C1
and C2
, it builds a fully connected state machine with C1
states and C2
emissions. I map all tokens from my training data onto the range [0, C2)
and each label the HMM is supposed to assign a token during inference onto [0, C1)
. Then the HMM goes ahead and does Baum Welch Learning. When it is done it has configured its state machine to maximize the likelihood of the training data (locally).
Now to my problem:
Assume I had two isomorphic initial state machines (isomorphic under consideration of all the probabilities, as structurally all the HMMs are isomorphic anyway, because ergodic). They vary only in their state IDs, so the IDs have been scrambled around from one machine to the other. Now, after training both HMMs will be isomorphic again when trained with the same data. That means there is absolutely no connection between the IDs I map the labels from my label set to and the IDs of the states of the HMM. So how then can I interpret the HMM after training? How do I know which state corresponds to which POS-tag? Seems impossible, so I guess I am missing some crucial point.
Now a little more detail if the above was unclear:
I take my training data (texts, like newspaper etc.) and count the number of different words (types).
Then I pass count(types
and count(labels)
the labels being a set of POS-tags. It then randomly constructs a probabilistic fully connected state machine with pow(count(labels), order_of_model)
different states. order_of_model
being the number of hidden variables (POS-tag ngrams) that get combined into an individual state. It also assigns each of these states an initial and an emission probability for all of the types.
The model assumes a mapping from [0, pow(count(labels), order_of_model))
as state IDs onto external tuples of labels. And a mapping [0,count(types)-1)
for the emissions onto words.