Advancements in Speaker Modeling and Voice Conversion

The field of speaker modeling and voice conversion is moving towards more sophisticated and nuanced approaches to representing and manipulating speaker attributes and voice characteristics. Researchers are exploring new architectures and techniques to improve the accuracy and naturalness of speaker diarization, voice impression control, and voice conversion. One notable trend is the use of transformer-based models and attention mechanisms to better capture local dependencies and contextual information in speech signals. Another area of focus is the development of more effective methods for controlling and modifying para-linguistic information in speech, such as voice impression and style transfer. Overall, these advancements have the potential to significantly improve the performance and versatility of speaker modeling and voice conversion systems. Noteworthy papers include: Improving Neural Diarization through Speaker Attribute Attractors and Local Dependency Modeling, which introduces a novel approach to speaker diarization using speaker attribute attractors and conformer-based architectures. CoLMbo: Speaker Language Model for Descriptive Profiling, which presents a speaker language model that can generate detailed and structured descriptions of speaker characteristics, including dialect, gender, and age.

Sources

Improving Neural Diarization through Speaker Attribute Attractors and Local Dependency Modeling

Voice Impression Control in Zero-Shot TTS

Pureformer-VC: Non-parallel Voice Conversion with Pure Stylized Transformer Blocks and Triplet Discriminative Training

CoLMbo: Speaker Language Model for Descriptive Profiling

Training-Free Voice Conversion with Factorized Optimal Transport

Built with on top of