Is there wights of voice or audio for VGG or Inception?

Question

Boom

2022年5月24日 14:00

I want to use VGG16 (or VGG19) for voice clustering task.
I read some articles which suggest to use VGG (16 or 19) in order to build the embedding vector for the clustering algorithm.
The process is to convert the wav file into mfcc or plot (Amp vs Time) and use this as input to VGG model.
I tried it out with VGG19 (and weights='imagenet').
I got bad results, and I assumed it because I'm using VGG with wrong weights (weights of images (imagenet))

So:

Jon Nordby · Accepted Answer · 2021年7月25日 16:38

Besides VGGish mentioned by @Ubikuity, there are other pre-trained audio models:

PANNs by Qiuqiang Kong. As of July 2021 one of the best on general audio classification AudioSet. PANNs @ Github. Based on PyTorch
YAMNet, by same team at Google as VGGish. YAMNet @ TfHub. Based on Tensorflow.
OpenL3 by Music and Audio Research Laboratory at NYU. Very easy to get started with. OpenL3 @ Github. Based on Tensorflow/Keras.

Ubikuity · Accepted Answer · 2021年5月25日 11:23

Ubikuity answered at 2021年5月25日 11:23

As far as I know, VVGish is the VGG adapted to audio processing. I can remember using it with mfcc, not Amp-Time input tho.