Advances in Text-to-Speech Synthesis and Speech Representation Learning

The field of speech processing is moving towards more expressive and controllable text-to-speech (TTS) systems, with a focus on emotional expression, duration control, and speaker identity. Recent developments have also highlighted the importance of fairness and privacy in speech representation learning, with efforts to reduce sensitive attributes such as speaker identity and demographic information. Furthermore, there is a growing interest in mitigating the conflict between semantic and acoustic capabilities in speech codecs, as well as improving speaker similarity assessment for speech synthesis. Noteworthy papers include IndexTTS2, which achieves disentanglement between emotional expression and speaker identity, and WavShape, which optimizes embeddings for fairness and privacy while preserving task-relevant information. Additionally, XY-Tokenizer proposes a novel codec that balances semantic richness and acoustic fidelity, and Efficient Interleaved Speech Modeling through Knowledge Distillation introduces compact and expressive speech generation models through layer-aligned distillation.

Advances in Text-to-Speech Synthesis and Speech Representation Learning

Sources