How to do phoneme segmentation using dynamic time warping?
Background Information:
- Dynamic Time Warping (DTW):
In time series analysis, dynamic time warping (DTW) is one of the algorithms for measuring similarity between two temporal sequences, which may vary in speed. (Source: Wikipedia)
- Phoneme Segmentation:
Phoneme segmentation is the ability to break words down into individual sounds. For example, a child may break the word “sand” into its component sounds – /sss/, /aaa/, /nnn/, and /d/. (Source)
The Question(s):
(a) How can we do phoneme segmentation using DTW?
(b) Which type of data do we need to implement this idea? (by asking about the type of the data, we mean which features should be available for each sample in the training dataset).
My try:
Assume that we have some audio files in each the phoneme segments are completely calculated and available. For instance, we know that from $t_1$ to $t_2$, the speaker is just saying /a/, and the other parts also have this kind of label. If we have a new sample which the system has not seen yet, a simple approach would be to calculate the difference between the new sample and each of the training samples. This approach would be like a KNN (K nearest neighbors) algorithm. We can just cast a vote to see which phoneme wins the game.
Another case is when the data is not labeled. In this case, I think we may be able to do some kind of clustering (e.g., K-means) to extract some cluster means, and use them. We could just calculate the distance between the new sample and the means of the clusters (which would be much faster than the previous calculations we had for the other case).
The problem is that these approaches seem too simple and inefficient to me. Is there a better (or smarter) way to tackle this problem of segmenting the phonemes using DTW? Should the samples have any other kind of features? (By other, I mean other than the time segments for each phoneme being specified).
Topic dynamic-time-warping speech-to-text
Category Data Science