Advances in Text-to-Speech Synthesis and Speech Representation Learning

The field of speech processing is moving towards more expressive and controllable text-to-speech (TTS) systems, with a focus on emotional expression, duration control, and speaker identity. Recent developments have also highlighted the importance of fairness and privacy in speech representation learning, with efforts to reduce sensitive attributes such as speaker identity and demographic information. Furthermore, there is a growing interest in mitigating the conflict between semantic and acoustic capabilities in speech codecs, as well as improving speaker similarity assessment for speech synthesis. Noteworthy papers include IndexTTS2, which achieves disentanglement between emotional expression and speaker identity, and WavShape, which optimizes embeddings for fairness and privacy while preserving task-relevant information. Additionally, XY-Tokenizer proposes a novel codec that balances semantic richness and acoustic fidelity, and Efficient Interleaved Speech Modeling through Knowledge Distillation introduces compact and expressive speech generation models through layer-aligned distillation.

Sources

IndexTTS2: A Breakthrough in Emotionally Expressive and Duration-Controlled Auto-Regressive Zero-Shot Text-to-Speech

Identifying Speaker Information in Feed-Forward Layers of Self-Supervised Speech Transformers

WavShape: Information-Theoretic Speech Representation Learning for Fair and Privacy-Aware Audio Processing

XY-Tokenizer: Mitigating the Semantic-Acoustic Conflict in Low-Bitrate Speech Codecs

You Sound a Little Tense: L2 Tailored Clear TTS Using Durational Vowel Properties

Efficient Interleaved Speech Modeling through Knowledge Distillation

Multi-interaction TTS toward professional recording reproduction

Analyzing and Improving Speaker Similarity Assessment for Speech Synthesis

Measurement of the Granularity of Vowel Production Space By Just Producible Different (JPD) Limens

Built with on top of