Is there wights of voice or audio for VGG or Inception?

  • I want to use VGG16 (or VGG19) for voice clustering task.
  • I read some articles which suggest to use VGG (16 or 19) in order to build the embedding vector for the clustering algorithm.
  • The process is to convert the wav file into mfcc or plot (Amp vs Time) and use this as input to VGG model.
  • I tried it out with VGG19 (and weights='imagenet').
  • I got bad results, and I assumed it because I'm using VGG with wrong weights (weights of images (imagenet))

So:

  1. Are there any audio/voice per-trained weights for VGG ?
  2. If not, are there other per-trained audio /voice models ?

Topic vgg16 transfer-learning inception feature-engineering deep-learning

Category Data Science


Besides VGGish mentioned by @Ubikuity, there are other pre-trained audio models:

  • PANNs by Qiuqiang Kong. As of July 2021 one of the best on general audio classification AudioSet. PANNs @ Github. Based on PyTorch
  • YAMNet, by same team at Google as VGGish. YAMNet @ TfHub. Based on Tensorflow.
  • OpenL3 by Music and Audio Research Laboratory at NYU. Very easy to get started with. OpenL3 @ Github. Based on Tensorflow/Keras.

As far as I know, VVGish is the VGG adapted to audio processing. I can remember using it with mfcc, not Amp-Time input tho.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.