Augmentation for sound recognition of dog barks for CNNs
I am training CNNs to recognize dog barking, and for this I would like to augment the data sets I have (~30'000 10s clips with either barks, or no-barks in them).
The straight forward idea was to mix the barking audio clips with the no-barking clips (maybe some leaves rustling or whatever), such that the resulting remix is again a barking audio clip. I did this by simply adding up the two waveforms (from .wav files) in a random ratio, e.g.
mix_barking_clip = 0.81 * barking_clip + 0.27 * no_barking_clip
However, this turns out to decrease the F1-score on the test set by quite a lot. My question is: Is this mixing technique wrong, and am I producing nonsense data with this?
Topic cnn data-augmentation audio-recognition deep-learning machine-learning
Category Data Science