Evaluate Text-to-speech without Human Involved?

Question

Evaluate Text-to-speech without Human Involved?

Nontawat Wutticome

2021年10月9日 11:57

I've explored text-to-speech evaluation matrices and they seem to used Mean Opinion Score (MOS) to evaluate a particular model. This matrice required humans to help to judge the model based on a scale (Bad, Moderate, Good, Etc.).

Are there other evaluation matrices that algorithmically estimate the TTS system and don't require any human? but it still gives the result that correlated to human evaluation?

Topic model-evaluations speech-to-text audio-recognition evaluation machine-learning

Category Data Science

Skill slash · Accepted Answer · 2021年10月4日 06:41

The Mean Opinion Score (MOS) is a numerical expression of the overall performance of such an event or incident as rated by humans. Mean Opinion Scores are now the median of a variety of other human-scored different parameters, and are usually graded on a scale from 1 to 5.

A Mean Opinion Score would be a measurement of the quality of audio and data sessions in telecommunication services. An Objective Measurement Technique that approximates a human ranking generates a MOS. This is frequently used in practise to assess digital approximations of real-world phenomena.

Human involvement might be the most efficient method to evaluate MOS within a reasonable size, but this is not necessarily the most practicable. The final MOS score is calculated by averaging the scores of all individuals, resulting in a wide range of 0-5. A score of 5 indicates a high-quality call, while a score of 0 indicates incomprehensible communication.

The R-Factor (Rating-Factor) seems to be a human-based MOS score estimation. All of those are conversation quality measures that are not even caused by network errors. Clarity, latency, packet loss, jitter, as well as other parameters are measured using real audio signals.

Evaluate Text-to-speech without Human Involved?

About