Augmentation for sound recognition of dog barks for CNNs

I am training CNNs to recognize dog barking, and for this I would like to augment the data sets I have (~30'000 10s clips with either barks, or no-barks in them).

The straight forward idea was to mix the barking audio clips with the no-barking clips (maybe some leaves rustling or whatever), such that the resulting remix is again a barking audio clip. I did this by simply adding up the two waveforms (from .wav files) in a random ratio, e.g.

mix_barking_clip = 0.81 * barking_clip + 0.27 * no_barking_clip

However, this turns out to decrease the F1-score on the test set by quite a lot. My question is: Is this mixing technique wrong, and am I producing nonsense data with this?

Topic cnn data-augmentation audio-recognition deep-learning machine-learning

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.