Advances in Speech Synthesis and Evaluation

The field of speech synthesis and evaluation is rapidly evolving, with a focus on improving the naturalness and intelligibility of generated speech. Researchers are exploring new architectures and techniques, such as conditional diffusion models and consistency Schrödinger bridges, to enhance the quality of singing voice synthesis and text-to-speech systems. Additionally, there is a growing emphasis on developing more robust evaluation metrics that can accurately assess the intelligibility and quality of synthesized speech. Noteworthy papers include the VS-Singer model, which generates stereo singing voices with room reverberation from scene images, and the SmoothSinger model, which synthesizes high-quality singing voices using a conditional diffusion model. The TTSDS2 metric has also been proposed as a more robust and improved method for evaluating human-quality text-to-speech systems.

Sources

VS-Singer: Vision-Guided Stereo Singing Voice Synthesis with Consistency Schr\"odinger Bridge

Improved Intelligibility of Dysarthric Speech using Conditional Flow Matching

Optimizing Multilingual Text-To-Speech with Accents & Emotions

Aligning ASR Evaluation with Human and LLM Judgments: Intelligibility Metrics Using Phonetic, Semantic, and NLI Approaches

Streaming Non-Autoregressive Model for Accent Conversion and Pronunciation Improvement

IndieFake Dataset: A Benchmark Dataset for Audio Deepfake Detection

Learning to assess subjective impressions from speech

TTSDS2: Resources and Benchmark for Evaluating Human-Quality Text to Speech Systems

Vo-Ve: An Explainable Voice-Vector for Speaker Identity Evaluation

SmoothSinger: A Conditional Diffusion Model for Singing Voice Synthesis with Multi-Resolution Architecture