How to gather training data for simple voice commands?
I'm trying to build a machine learning model for recognizing simple voice commands like up, down, left, etc.
On similar problems based on images, I'd just take the picture and assign a label to it.
I can generate features and visualize them using librosa. And I hear CNNs are amazing at this task. So, I was wondering how I'd gather training data for such audio based systems, since I can't record an entire clip considering my commands are only going to be a few milliseconds long (and while predicting, it'll only be a few milliseconds too).
My first idea was to generate an image (using matplotlib and librosa.features) for every few milliseconds and just use the ones where I'm saying the command as positive labels (and the rest negative labels), but it seems tedious and there might be a better way to do it.
Is there an effective way to do achieve this ?
Topic training dataset machine-learning
Category Data Science