Emotion- Aware Speech Processing

The field of speech processing is moving towards a more nuanced understanding of human emotions, with a focus on developing systems that can recognize, generate, and manipulate emotional speech. Recent research has explored the use of dimensionally-defined emotions, such as arousal, dominance, and valence, to improve the control and expressiveness of emotional speech synthesis. Another trend is the integration of personality traits into speech emotion recognition, which has been shown to enhance the accuracy of emotion detection. The development of large-scale emotional speech datasets and multimodal frameworks is also facilitating advancements in this area. Noteworthy papers include UDDETTS, which introduces a neural codec language model for controllable emotional text-to-speech, and ClapFM-EVC, a framework for high-fidelity emotional voice conversion with flexible control. Additionally, the creation of datasets such as CAMEO and The Super Emotion Dataset is providing valuable resources for researchers in this field.

Sources

UDDETTS: Unifying Discrete and Dimensional Emotions for Controllable Emotional Text-to-Speech

CAMEO: Collection of Multilingual Emotional Speech Corpora

ClapFM-EVC: High-Fidelity and Flexible Emotional Voice Conversion with Dual Control from Natural Language and Speech

Bridging Speech Emotion Recognition and Personality: Dataset and Temporal Interaction Condition Network

PersonaTAB: Predicting Personality Traits using Textual, Acoustic, and Behavioral Cues in Fully-Duplex Speech Dialogs

The Super Emotion Dataset

MIKU-PAL: An Automated and Standardized Multi-Modal Method for Speech Paralinguistic and Affect Labeling

MM-MovieDubber: Towards Multi-Modal Learning for Multi-Modal Movie Dubbing

University of Indonesia at SemEval-2025 Task 11: Evaluating State-of-the-Art Encoders for Multi-Label Emotion Detection

Action2Dialogue: Generating Character-Centric Narratives from Scene-Level Prompts

Built with on top of