Advances in Speech Quality Assessment and Generation

The field of speech processing is moving towards more accurate and human-like speech quality assessment and generation. Recent developments have focused on improving the evaluation of speech quality, with an emphasis on reflecting human perception. Additionally, there has been a push towards more fine-grained control over speech emotion and the integration of paralinguistic vocalizations into speech recognition and synthesis systems. Noteworthy papers in this area include:

  • EmoSteer-TTS, which achieves fine-grained speech emotion control without requiring extensive training data.
  • NVSpeech, which presents a scalable pipeline for recognizing and synthesizing paralinguistic vocalizations.
  • The State Of TTS, which introduces a metric to directly measure how often machine-generated speech is mistaken for human. These advancements have the potential to significantly improve the naturalness and expressiveness of speech generation systems, and to enable more effective evaluation and comparison of these systems.

Sources

Advancing Speech Quality Assessment Through Scientific Challenges and Open-source Activities

Neural Speech Extraction with Human Feedback

EmoSteer-TTS: Fine-Grained and Training-Free Emotion-Controllable Text-to-Speech via Activation Steering

The State Of TTS: A Case Study with Human Fooling Rates

NVSpeech: An Integrated and Scalable Pipeline for Human-Like Speech Modeling with Paralinguistic Vocalizations

What Do Humans Hear When Interacting? Experiments on Selective Listening for Evaluating ASR of Spoken Dialogue Systems

ESDD 2026: Environmental Sound Deepfake Detection Challenge Evaluation Plan

A Scalable Pipeline for Enabling Non-Verbal Speech Generation and Understanding

Built with on top of