Advances in Voice Conversion and Text-to-Speech Synthesis

The field of voice conversion and text-to-speech synthesis is moving towards more controllable and expressive models. Recent developments have focused on disentangling speaker identity and linguistic content, allowing for more precise control over the prosody and style of generated speech. This is achieved through the use of novel frameworks that incorporate in-context learning, flow matching transformers, and masked-autoencoded style-rich representations. Noteworthy papers include Discl-VC, which introduces a mask generative transformer to predict discrete prosody tokens, and StarVC, a unified auto-regressive framework that leverages structured semantic features to enhance conversion performance. Controllable Text-to-Speech Synthesis with Masked-Autoencoded Style-Rich Representation is also notable for its two-stage style-controllable TTS system, and Towards Better Disentanglement in Non-Autoregressive Zero-Shot Expressive Voice Conversion improves upon a self-supervised framework to reduce source timbre leakage and improve linguistic-acoustic disentanglement.

Advances in Voice Conversion and Text-to-Speech Synthesis

Sources