Downsampling audio files for use in Machine Learning
I'm trying to use the work (Neural Networks) done in this repo:
It says this:
Note: To apply this toolkit to other speech data, the speech data should be sampled with 16kHz sampling frequency.
I've got speech data at 48khz. I've read in places that reducing sampling rate is a complicated process, you can't just remove every nth datapoint, you have to filter things...
Is this necessary if I only intend to use the data in the Neural Network toolkit provided by the repo I linked? If so, is there an industry standard method for changing sample rate?
I realise that it probably depends on what features are being used. However the feature that is used is this:
MRCG (multi resolution cochleagra) concatenates the cochleagram features at multiple spectrotemporal resolutions
This is a ruddy complicated feature! Lets pretend we're just using a Melspectogram (unless you're willing to answer the question from the perspective of MRCG's).
Neural networks are likely to use features of a Melspectogram that we wouldn't think of. This makes me think it is unwise to train the Neural Net using downsampled speech data unless we intend to predict using 48khz data downsampled to 16khz forever after...
What do you think? Can I use my 48khz data - downsampled with no filtering - with the expectation that the model will work for prediction on real 16khz data?
And then for future readers sake, how about the other way? Say I had an 8khz file, could I increase the sample rate to 16khz without filtering?
Topic audio-recognition data-cleaning
Category Data Science