Evaluate Text-to-speech without Human Involved?
I've explored text-to-speech evaluation matrices and they seem to used Mean Opinion Score (MOS) to evaluate a particular model. This matrice required humans to help to judge the model based on a scale (Bad, Moderate, Good, Etc.).
Are there other evaluation matrices that algorithmically estimate the TTS system and don't require any human? but it still gives the result that correlated to human evaluation?