How to gather training data for simple voice commands?

I'm trying to build a machine learning model for recognizing simple voice commands like up, down, left, etc.

On similar problems based on images, I'd just take the picture and assign a label to it.

I can generate features and visualize them using librosa. And I hear CNNs are amazing at this task. So, I was wondering how I'd gather training data for such audio based systems, since I can't record an entire clip considering my commands are only going to be a few milliseconds long (and while predicting, it'll only be a few milliseconds too).

My first idea was to generate an image (using matplotlib and librosa.features) for every few milliseconds and just use the ones where I'm saying the command as positive labels (and the rest negative labels), but it seems tedious and there might be a better way to do it.

Is there an effective way to do achieve this ?

Topic training dataset machine-learning

Category Data Science


To train (and evaluate) a classifier for fixed-vocabulary speech commands you should build a curated dataset that:

  • Has a well-defined list of commands (the classes for the classifier)
  • Has enough samples of each command. 10-100, or more
  • Each audio sample contains only one command.
  • Each sample is roughly of same length
  • Onset of the command is positioned at start of audio sample

The dataset would then be a collection of .WAV audio files, and a .CSV that describes them:

filename,command,...
dd690d11-2412-4a16-8103-f1c2d178eca8.wav,up
df36c75b-d30d-49c0-a103-f9fc76827625.wav,down
dab2f090-4493-4756-85ee-b6cd62c7bae8.wav,left
...

Recording process

A basic process for the recording would be to use an existing audio program (such as Audacity), and make one audio recording for each word.

One thing that may speed up the collection is to record many utterances in one go. Say a single command word ("up") many times in a row, with a lot of silence (0.5sec) between each command. An automatic audio segmentation algorithm can then be used to split the commands using the silence. Example programs for splitting on silence are: pyAudioAnalysis and audiotok

Capturing variation

A good data collection process makes sure to contain most of the naturally occuring variation in the classes. For speech that usually means recording many different speakers, as everyone tends to say things slightly differently. But even with a single speaker you can introduce some variation yourself: Do additional passes of recordings where you say the words either fast or slow. And also passes where you say them loud (near shout), or soft (near whisper).

You may also want to have out-of-class examples, at least for evaluation. You can collect these in the much the same way. In order to get many different words, can for example read a Wikipedia article word-by-word (with silence between). And then go over and remove your target words.


What would librosa generate? If the sound clips it gives you aren't also words then I would worry about the effectiveness of the training. It'd be better to record samples of yourself saying the positive labelled words and saying a bunch of other words to use as negative labels. I recognize this is likely to be an irritating process; one approach I have read of in the past to maximise the use of small amounts of training data is to use transformations and distortions on training data to increase the effective number of samples. In this case, the training data were pictures of animals. Effectively what the researchers did was tweak the images slightly; they rotated the images various degrees, squeezed them (by lowering the height or width and not retaining aspect ratio) or stretched them and so on. You might be able to do the same; record a smallish number of samples and then use distortions to raise the effective number of samples to a useful number.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.