Advancements in Speech Recognition and Generation

The field of speech recognition and generation is witnessing significant developments, with a focus on improving model interpretability, robustness, and controllability. Researchers are exploring innovative techniques, such as adaptating interpretability methods from large language models to automatic speech recognition systems, to gain insights into linguistic representations and model behaviors. Moreover, novel frameworks are being proposed to enhance voice timbre attribute detection, controllable speech and singing voice generation, and speech restoration tasks. Noteworthy papers in this area include: QvTAD, which presents a pairwise comparison framework for voice timbre attribute detection, achieving substantial improvements across multiple timbre descriptors. Vevo2, a unified framework for controllable speech and singing voice generation, enabling flexible controllability over text, prosody, and style. Multi-Metric Preference Alignment for Generative Speech Restoration, which investigates the application of preference-based post-training to speech restoration tasks, resulting in consistent and significant performance gains.

Sources

Beyond Transcription: Mechanistic Interpretability in ASR

QvTAD: Differential Relative Attribute Learning for Voice Timbre Attribute Detection

Vevo2: Bridging Controllable Speech and Singing Voice Generation via Unified Prosody Learning

Revisiting Rule-Based Stuttering Detection: A Comprehensive Analysis of Interpretable Models for Clinical Applications

RephraseTTS: Dynamic Length Text based Speech Insertion with Speaker Style Transfer

Multi-Metric Preference Alignment for Generative Speech Restoration

Improving French Synthetic Speech Quality via SSML Prosody Control

SwiftF0: Fast and Accurate Monophonic Pitch Detection

Improving Noise Robust Audio-Visual Speech Recognition via Router-Gated Cross-Modal Feature Fusion

Flowing Straighter with Conditional Flow Matching for Accurate Speech Enhancement

SincQDR-VAD: A Noise-Robust Voice Activity Detection Framework Leveraging Learnable Filters and Ranking-Aware Optimization

Built with on top of