The field of speech processing is moving towards a more nuanced understanding of human emotions, with a focus on developing systems that can recognize, generate, and manipulate emotional speech. Recent research has explored the use of dimensionally-defined emotions, such as arousal, dominance, and valence, to improve the control and expressiveness of emotional speech synthesis. Another trend is the integration of personality traits into speech emotion recognition, which has been shown to enhance the accuracy of emotion detection. The development of large-scale emotional speech datasets and multimodal frameworks is also facilitating advancements in this area. Noteworthy papers include UDDETTS, which introduces a neural codec language model for controllable emotional text-to-speech, and ClapFM-EVC, a framework for high-fidelity emotional voice conversion with flexible control. Additionally, the creation of datasets such as CAMEO and The Super Emotion Dataset is providing valuable resources for researchers in this field.
Emotion- Aware Speech Processing
Sources
ClapFM-EVC: High-Fidelity and Flexible Emotional Voice Conversion with Dual Control from Natural Language and Speech
Bridging Speech Emotion Recognition and Personality: Dataset and Temporal Interaction Condition Network
PersonaTAB: Predicting Personality Traits using Textual, Acoustic, and Behavioral Cues in Fully-Duplex Speech Dialogs
MIKU-PAL: An Automated and Standardized Multi-Modal Method for Speech Paralinguistic and Affect Labeling