What is the suggested way to create features (Mel-Spectograms) from speech signal for classification with ResNet?

At the moment I have this piece of code which cuts a Spectogram into fixed length tensors: def chunks(l, n): """Yield successive n-sized chunks from l.""" for i in range(0, len(l[0][0]), n): if(i+n < len(l[0][0])): yield X_sample.narrow(2, i, n) The following piece of code downsamples the Audio Creates Mel_Spectograms and takes the log of it Applies a Cepstral Mean and Variance Normalization Then it cuts the spectogram with the code above into a fixed size of length and appends it …
Category: Data Science

log mel energies

I want to convert mel spectogram to log mel energies what I used is y, sr = librosa.load(filename, sr=16000) mel_spectrogram = librosa.feature.melspectrogram( y=y, sr=sr, n_mels=128, n_fft=1024, hop_length=512, power=2) log_mel_spectrogram = librosa.power_to_db(mel_spectrogram) I thought this converts to mel energies but I found this line of code log_mel_spectrogram = 20.0 / power * np.log10(np.maximum(mel_spectrogram, sys.float_info.epsilon)) My question is what is the difference between log-mel spectrograms and log mel energies, which line of code to use
Category: Data Science

Downsampling audio files for use in Machine Learning

I'm trying to use the work (Neural Networks) done in this repo: https://github.com/jtkim-kaist/VAD It says this: Note: To apply this toolkit to other speech data, the speech data should be sampled with 16kHz sampling frequency. I've got speech data at 48khz. I've read in places that reducing sampling rate is a complicated process, you can't just remove every nth datapoint, you have to filter things... Is this necessary if I only intend to use the data in the Neural Network …
Category: Data Science

How should I process the output from this neural network?

I have a neural network that takes an np.array of a mel spectrogram of a 3 second audio clip from a song as input, and outputs vector of individual predictions that it is from 494 given (individual) artists. At first, I was getting whole songs, splitting them into 3 second clips, inputting each clip into the nn, and averaging the outputs. But this proved to be wonky. I got advice that I should only need one 3 second clip, but …
Category: Data Science

What's the best way to validate a rare event detection model during training?

When training a deep model for rare event detection (e.g. sound of an alarm in a home device audio stream), is it best to use a balanced validation set (50% alarm, 50% normal) to determine early stopping etc., or a validation set representative of reality? If an unbalanced, realistic validation set is used it may have to be huge to contain only a few positive event examples, so I'm wondering how this is typically dealt with. In the given example …
Category: Data Science

Error while Pre-processing Audio Data using Librosa (audio analysis library in python) for DL model

I am beginner in Audio classification field in DL. I followed a YouTube Music Genre Classification Series, which is working fine and been very helpful but I have a problem/error in pre-processing part. I get this error repeatedly. The picture of the error and the code is attached. I don't seem to understand what the error is because I've never worked with Librosa (Audio Analysis Library in Python). Kindly help me with that. Thank you. import json import os import …
Category: Data Science

Segment 5-7 min audio into sentence wise audio clips for creating speech recognition dataset

I am trying to create a speech recognition dataset, especially for Indian Accents. I am taking from colleagues to build this. Daily I send an article link and ask them to record and upload it to google drive. I have a problem with this approach. All audio recordings of length 5 -7 min. I am using the DeepSpeech model for this and it requires 10-sec audio sentences. Suggest me any approach if possible to segment audio files into corresponding sentence …
Category: Data Science

Training a sound localization neural network

I am trying to train a neural network, to estimate the location (in degrees from 0 to 180) a sound is coming from. I am using TensorFlow Keras in python to train the model. The input data are two binaural cues, specifically the ILD (Interaural Level Difference) and the ITD (Interaural Time Difference), each vector, consisting of the two above described features, is of dimensions [1,71276]. I have a total of 2639 measurements, 10% of which are used as validation …
Category: Data Science

Why normalization kills my accuracy

I have a binary sound classifier. I have a feature set that is extracted from audio with size of 48. I have a model(multi layer neural network) that has around %90 accuracy on test and validation sets. (without normalization or Standardization) I see that the feature values are mostly around [-10, +10]. But there are certain features with a mean of 4000. Seeing unproportional values within features, I thought some feature scaling might improve things. So using scikit-learn tools I …
Category: Data Science

How to deal with different audio formats for audio classification?

I am working on an audio classification problem statement to classify between two audio classes. I have collected samples from jotform, they are providing audio widget to collect .wav audio but it turned out that widget is storing data in .mp3 format : In my problem statement, Classification classes are from different formats : class A : all the 100 samples are in .mp3 format ( jot form collection ) class B : all the samples are in .wav format …
Category: Data Science

Audio Classification with Counter

I'm trying to create a model that can identify one particular sound, and every time it hears that sound, it increases a counter by 1. So for example, if it hears a specific bird chirping ten times, the counter should display the number 10. I'm looking for a bit of guidance here as to how to go about this. I know that I will need to use audio classification and for my data, I only have .wav files of that …
Category: Data Science

Why is GTZAN dataset so widely used without copyright permission

I am hoping to use the GTZAN music dataset to evaluate the performance of several noise-cancelling algorithms as part of a project for my undergrad. I notice that GTZAN is widely used across the literature for audio classification and even has exposure within Tensorflow and Pytorch APIs. Unfortunately, I cannot find any information about the copyright status of data within GTZAN besides on the marsyas website itself where it is revealed that no permissions to redistribute the data have been …
Category: Data Science

Tool for labeling audio

I have few thousand audio signals to label into 2 different classes and save them to numpy array for further training of models. MATLAB recently released Signal Labeler for their Signal Analyzer, that could help to label time series, but for certain reasons, I can't use it. Is there any specific tool for analysis and labeling of time series for Python? It is not necessary to save data and labels into numpy arrays, .csv format or anything similar is suitable …
Category: Data Science

TensorFlow Speech Emotion Recognition Model gives same prediction for all inputs

Dataset used: RAVDESS (I've only used the audio only files) Here's a sample after I've processed the data: And the code for the label encoding: #encode labels as ints lb = LabelEncoder() y_train = np_utils.to_categorical(lb.fit_transform(y_train)) y_test = np_utils.to_categorical(lb.fit_transform(y_test)) #Not sure if this is needed x_train =np.expand_dims(x_train, axis=2) x_test= np.expand_dims(x_test, axis=2) Model: model.add(Conv1D(16, 5,strides=2 ,padding='same', input_shape=(259,1))) model.add(Conv1D(16, 5,padding='same', activation="relu")) model.add(Dropout(0.1)) model.add(MaxPooling1D(pool_size=(6))) model.add(LSTM(1)) model.add(Flatten()) model.add(Dense(10, activation="relu")) model.add(Dense(10,activation="softmax")) model.summary() opt = keras.optimizers.RMSprop(lr=0.00001, decay=1e-6) model.compile(metrics=['accuracy'], optimizer=opt, loss='categorical_crossentropy') history = model.fit(x_train, y_train, batch_size=1,epochs=15, validation_data=(x_train, y_train)) …
Category: Data Science

Detecting Data Drift in Audio Data

For a give set of audio files collected from an industrial process via a microphone, I have extracted suitable features and fed them into a neural network for training a binary classifier as depicted below. The model has been performing quite well on an unseen data. I am at the stage of developing a sub-product to monitor data drift forecasting the inevitable i.e. data changes (namely microphone position changes, product materials changes and produces a distinct signal, background noise prevail …
Category: Data Science

Trim left tail of music in audio file

I have audio files, most of them start with the same music, and then a conversation begins. I want to trim the part of the music (which can be varied in length). I have no labels, I can transcribe the whole file using off-the-shelf models, but the music itself contains words which are resulted in false positives. but I know to extract features from the audio, such as Mel spectrogram, pitch, etc. The music at the beginning of the file …
Category: Data Science

Smooth transistion of music notes using music processing

I need advice regarding a small dataset of individual music notes played on a harmonica that I created a while ago. I want to build a system that reads notations in a text file and create realistic audio by combining audio files with the near-perfect transition of notes What should I look into for reference or as a starting point for the project? Thanks in advance! PS: I had to provide a tag so I choose 'audio-recognition'. I do not …
Category: Data Science

Augmentation for sound recognition of dog barks for CNNs

I am training CNNs to recognize dog barking, and for this I would like to augment the data sets I have (~30'000 10s clips with either barks, or no-barks in them). The straight forward idea was to mix the barking audio clips with the no-barking clips (maybe some leaves rustling or whatever), such that the resulting remix is again a barking audio clip. I did this by simply adding up the two waveforms (from .wav files) in a random ratio, …
Category: Data Science

Discouraging values or smoothing out results when model fitting

I'm working on training a network to do direction of arrival prediction and I'm having the issue that no matter what my network is (ResNet 18 - 101, CRNN, CNN, etc...) my results tend toward one small range of values as seen in the image below which leads obviously to the following errors: I have attempted to just "wait it out" until my network finally learns, but my validation loss diverges pretty much immediately. An example can be seen below. …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.