Hidden Markov models in Speech Recognition
My first question here. So I am trying to build a sign language translator(from signs to text) and noticed that the problem itself is quite similar to speech recognition, so I started to research about that. Right now one thing is I can't figure out is how exactly Hidden Markov models are used in speech recognition. I can understand how HMM can be used for example in part-of-speech tagging where we get a one of the states for each word. But in the example of speech recognition do we get one of the states which are represented by a phoneme(or part of phoneme) for each frame? Each pronounced phoneme can have a lot of frames right? What do we do in this situation. Should I just unite consequently recurring states in one? Some tutorials talk about making a separate HMM for each word or phoneme. What does it mean? HMM is about sequence of observables right?, but what would be the dataset, observables and states in HMM for one word let's say a.
Maybe my understanding of HMM is wrong. Please check my understanding - So we have observations o1...t, hidden states q1..n, state transition probabilities A, output probabilities B. And also he have the likelihood problem where we get the a probabilities of the observations given the A,B; decoding problem where we can find the most probable hidden state sequence where there is a state(right?) behind each observation; Learning problem where using observations and number of states we estimate the A, B.
Please help. Sorry for bad formatting this is my first question here. Thanks
Topic markov-hidden-model speech-to-text nlp machine-learning
Category Data Science