What are the hidden and observed states when building an acoustic model?

I have been trying to learn how to build ASRs and have been researching for awhile now, but I can't seem to get a straight answer. From what I understand, an ASR requires an Acoustic Model. That Acoustic Model can be trained via Baum-Welch or Viterbi training. Those algorithms train the parameters of a Hidden Markov Model.

From what I gather, to train the parameters, we need the Wav files, from which the MFCC feature vectors can be obtained. On top of that we also need the words the Wav files are supposed to be pronouncing. From what I've read online, it seems that the words the Wav files are supposed to pronounced are known. Therefore these are the observed states. Which makes the feature vectors the hidden states? But we can extract the feature vectors from the Wav files so those are known as well. Some articles also mention that the training is required as we do not know the start and end timings of when the word is pronounced in the Wav files. So then maybe the hidden states are the start and end timings? How would timing be converted into states though?

I would appreciate if anyone could give a clear, straight answer. Thanks

Topic markov-hidden-model speech-to-text nlp

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.