How to evaluate the quality of speech-to-text data without access to the true labels?

Question

How to evaluate the quality of speech-to-text data without access to the true labels?

miri_h_ds

2021年8月18日 07:19

I am dealing with a data set of transcribed call center data, where customers are being recorded when interacting with the agent. This is then automatically transcribed by an external transcription system. I want to automatically assess the quality of these transcriptions.

Sadly, the quality seems to be disastrous. In some cases it's little more than gibberish, often due to different dialects the machine is not able to handle. We have no access to the original recordings (data privacy), so there is no way whatsoever to get or create the true labels. The system cannot be replaced as we are committed to it.

Again to the question: is there any way to automatically assess the quality of transcriptions with NLP methods? We want to quantify and compare transcription quality to filter out the best samples for semantic inference of our customers' input in a downstream task. I am thinking about something like a coherence measure in order to find the sentences which make the most sense, grammatically or semantically. Sadly, things as BLEU, WER or Rouge do not work in this case.

I'd be grateful for anything pointing in the right direction. Most importantly again, we have no labels and it needs to be scalable.

Thanks a lot!

Topic transformer speech-to-text text-mining nlp

Category Data Science

Yassine · Accepted Answer · 2021年8月18日 07:19

I have recently worked on the ASR system of speech to text using a bunch of dialects and the current state of the art says that the best way to handle dialects is using the approach of XLSR, fine-tuning your model to recognize a dialect while you are pre-trained it on an N number of language. To do so and build such a system, is using Transformers where actually you can assess your WER and PER without even perceive how data looks like or even understand the language in the first place. Because dialect is not often an easy way to understand in case we are not native speakers.

That said, I just wanna help in the regard of how to handle well your model with dialects and you can check this out it helped me a lot: https://arxiv.org/abs/2006.11477 and also you can check this https://arxiv.org/abs/2006.13979.

Nikos M. · Accepted Answer · 2021年1月24日 17:05

There is at least one way:

Create/Acquire a grammar model for the language spoken (there are several such models for various languages used in NLP)
Test the transcripts for beign grammaticaly/syntacticaly correct.
This assesment will at least rule out gibberish and most of transcripts that do not correspond to valid sentences of the language spoken

How to evaluate the quality of speech-to-text data without access to the true labels?

About