where to start in natural language processing for a language

My native language is a regional language and few people speak it. I have some assignements in a machine learning course and i was thinking about doing some natural languge processing on my native language but i don't know where to start since there is almost no research about this language ( no corpus , no research papers , ... ) and i'm new to machine learning. I want to start doing everything from bottom and i want to do …
Category: Data Science

Adding words to vocabulary on pre-trained ASR model

I have a pre-trained ASR model but want to add some missing words to the vocabulary. Can I do this or will it invalidate the entire training? Lets say I use the pretrained model: wav2vec2-base-960h and want to use it on sports commentary but a lot of the players' names are missing in the vocabulary. Is there any way I can add the names and maybe train on a few clips where the names appear or do I have to …
Category: Data Science

Using synthetic dataset for training NVIDIA NeMo Matchbox

Does anyone has success in training small command recognition models on synthetic dataset? The full details is the following: I need a small model to run a command recognition (about 30 commands) on embedded device. It looks like NVIDIA NeMo MatchboxNet is a good solution, but I have no standard dataset covering my set of commands. The model should be adapted to a broad variation of speakers. Obtaining real dataset seems difficult. I consider using NVIDIA models like Waveglow/Flowtron to …
Category: Data Science

How to prepare Audio-text data for speech recognition

I have gathered some raw audio from all the conferences, meetings, lectures & casual conversation that I was part of. The machine transcription did not offer good results (from Azure, AWS etc.) I would transcribe it so to have both data+label (audio+text) for ML training. My question is whether to have small (3-10 sec.) audio files (split it at silence) and then transcribe each small file? or large file with timestamps with subtitle.srt format? What if I have a long …
Category: Data Science

ValueError: Error when checking input: expected the_input to have 3 dimensions, but got array with shape (14174, 1)

hope you're all doing good ! I am working on Automatic Speech Recognition with Python with the LibriSpeech Dataset. After preprocessing the audios data and applying an "MFCC featurizing" I append everything into a list and get a shape of (14174,). Knowing that each sample has a different length but the same number of features for example : print(X[0].shape) print(X[12000].shape) >> (615, 13) >> (301, 13) Now when I feed the data into my network with an Input layer defined …
Category: Data Science

Segment 5-7 min audio into sentence wise audio clips for creating speech recognition dataset

I am trying to create a speech recognition dataset, especially for Indian Accents. I am taking from colleagues to build this. Daily I send an article link and ask them to record and upload it to google drive. I have a problem with this approach. All audio recordings of length 5 -7 min. I am using the DeepSpeech model for this and it requires 10-sec audio sentences. Suggest me any approach if possible to segment audio files into corresponding sentence …
Category: Data Science

GMM in speech recoginition using HMM-GMM

I am trying to solve/understand ASR using HMM-GMM. At the abstract level i do understand what's happening but I did not understand how GMM fits into it. My data has 5K hours of speech from single user. I took the above picture from this article. I do know what is GMM but i am unable to wrap my head around it. Can somebody explain with a simple example.
Category: Data Science

How does Wav2Vec 2.0 feed output from Convolutional Feature Encoder as input to the Transformer Context Network

I was reading the Wav2Vec 2.0 paper and trying to understand the model architecture, but I have trouble understanding how audio raw inputs of variable lengths can be fed through the model, especially from the Convolutional Feature Encoder to the Transformer Context Network. During fine-tuning (from what I have read), even though audio raw inputs within a batch will be padded to the length of the longest input in the batch, the length of inputs can differ across batches. Therefore …
Category: Data Science

Why are observation probabilities modelled as Gaussian distributions in HMM?

HMM is a statistical model with unobserved (i.e. hidden) states used for recognition algorithms (speech, handwriting, gesture, ...). What distinguishes DHMM form CHMM is the transition probability matrix P with elements. In CHMM, state space of hidden variable is discrete and observation probabilities are modelled as Gaussian distributions. Why are observation probabilities modelled as Gaussian distributions in CHMM? Why are they (best)distributions for recognition systems in HMM?
Category: Data Science

How can I build my voice speech-to-text model?

I found an instruction to build such kind of custom model on Azure. Prepare data for Custom Speech However, I would like to either fine-tune or transfer learning on Google Colaboratory or docker. In that case, what machine learning framework do you recommend using? If you know some Github repo or articles for this challenge, could you share them with me?
Category: Data Science

Evaluate Text-to-speech without Human Involved?

I've explored text-to-speech evaluation matrices and they seem to used Mean Opinion Score (MOS) to evaluate a particular model. This matrice required humans to help to judge the model based on a scale (Bad, Moderate, Good, Etc.). Are there other evaluation matrices that algorithmically estimate the TTS system and don't require any human? but it still gives the result that correlated to human evaluation?
Category: Data Science

NeMo Conformer-CTC Predicts Same Word Repeatedly When Fine-Tuning

I'm using the NeMo Conformer-CTC small on the LibriSpeech dataset (the clean subset, around 29K inputs, using 90% for training and 10% for testing). I use Pytorch Lightning. When I try to train, the model learns 1 or 2 sentences in 50 epochs and gets stuck at a loss of 60-something (I trained it for 200 epochs too and it didn't budge). But when I try to fine tune it using a pre-trained model from the toolkit, it predicts correctly …
Category: Data Science

How Pretraining part actually work in Wav2vec models? Which data is qualify to be the adequat for fine-tuning part the model of speech2text

Pretraining and fine-tuning the algorithm of wav2vec2.0, the new one using in FAcebookAI to do speech to text for low-resource language. I didn't actually get how the model does the pretraining part if someone can help me, I read the article https://arxiv.org/abs/2006.11477 but I ended up not getting the notion of pre-train in this regard. the question is HOW do we do pretraining?! Note : i'm a beginner in ML, so far , i've done some project with nlp,I have …
Category: Data Science

How to evaluate the quality of speech-to-text data without access to the true labels?

I am dealing with a data set of transcribed call center data, where customers are being recorded when interacting with the agent. This is then automatically transcribed by an external transcription system. I want to automatically assess the quality of these transcriptions. Sadly, the quality seems to be disastrous. In some cases it's little more than gibberish, often due to different dialects the machine is not able to handle. We have no access to the original recordings (data privacy), so …
Category: Data Science

How to train my model on unpaired speech and text for my speech recognition model?

When we have unpaired data, means we have a dataset of audios and a dataset of texts, BUT they are not associated, and as we know, to build a speech recognition model we do have to combine in our input dataset each speech with its adequate text and then train our model upon that to have a model capable to convert new audio to its text. For my case, while doing data collection, I wind up with audios and text …
Category: Data Science

Hidden Markov models in Speech Recognition

My first question here. So I am trying to build a sign language translator(from signs to text) and noticed that the problem itself is quite similar to speech recognition, so I started to research about that. Right now one thing is I can't figure out is how exactly Hidden Markov models are used in speech recognition. I can understand how HMM can be used for example in part-of-speech tagging where we get a one of the states for each word. …
Category: Data Science

What are the hidden and observed states when building an acoustic model?

I have been trying to learn how to build ASRs and have been researching for awhile now, but I can't seem to get a straight answer. From what I understand, an ASR requires an Acoustic Model. That Acoustic Model can be trained via Baum-Welch or Viterbi training. Those algorithms train the parameters of a Hidden Markov Model. From what I gather, to train the parameters, we need the Wav files, from which the MFCC feature vectors can be obtained. On …
Category: Data Science

How to double audio dataset?

I am trying to develop a mispronunciation detection model for English speech. I use TIMIT dataset, this is phoneme labeled audio dataset. A phoneme is any of the perceptually distinct units of sound. So, my dataset looks like an audio file and string of phonemes corresponding to that audio. Ex: SX141.wav -> p l eh zh tcl t ax-h pcl p axr tcl t ih s pcl p ey dx ih n ax v aa dx ix z ix kcl …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.