How should I process the output from this neural network?
I have a neural network that takes an np.array of a mel spectrogram of a 3 second audio clip from a song as input, and outputs vector of individual predictions that it is from 494 given (individual) artists.
At first, I was getting whole songs, splitting them into 3 second clips, inputting each clip into the nn, and averaging the outputs. But this proved to be wonky.
I got advice that I should only need one 3 second clip, but this person had not worked in audio before. If I should do that, which 3 seecond clip should I get? For many songs, the first or last 3 seconds is silence, or does not sound like the song at all. For artist classification, that can get wonky.
What do you all advise?
Topic audio-recognition neural-network
Category Data Science