Advancements in Multimodal Emotion Recognition and Speech Processing

The field of multimodal emotion recognition and speech processing is rapidly evolving, with a focus on developing more accurate and robust models for emotion recognition, speech synthesis, and dialogue systems. Recent research has explored the use of reinforcement learning, multimodal fusion, and hierarchical soft prompt models to improve the performance of speech emotion recognition and rumor detection systems. The integration of visual, audio, and text data has been shown to enhance the accuracy of emotion recognition and fake news detection. Furthermore, the development of novel architectures, such as EmoQ and HadaSmileNet, has achieved state-of-the-art results in speech emotion recognition and facial emotion recognition. Noteworthy papers include the EmoQ paper, which proposes a multimodal large language model-based framework for speech emotion recognition, and the HadaSmileNet paper, which introduces a novel feature fusion framework for facial emotion recognition. Overall, the field is moving towards more comprehensive and effective models for multimodal emotion recognition and speech processing, with potential applications in human-computer interaction, healthcare, and social sciences.

Sources

Emotion-Aware Speech Generation with Character-Specific Voices for Comics

Thinking in cocktail party: Chain-of-Thought and reinforcement learning for target speaker automatic speech recognition

Multimodal Learning for Fake News Detection in Short Videos Using Linguistically Verified Data and Heterogeneous Modality Fusion

EMO-RL: Emotion-Rule-Based Reinforcement Learning Enhanced Audio-Language Model for Generalized Speech Emotion Recognition

EmoQ: Speech Emotion Recognition via Speech-Aware Q-Former and Large Language Model

ERFC: Happy Customers with Emotion Recognition and Forecasting in Conversation in Call Centers

HadaSmileNet: Hadamard fusion of handcrafted and deep-learning features for enhancing facial emotion recognition of genuine smiles

Explore the Reinforcement Learning for the LLM based ASR and TTS system

M4SER: Multimodal, Multirepresentation, Multitask, and Multistrategy Learning for Speech Emotion Recognition

MECap-R1: Emotion-aware Policy with Reinforcement Learning for Multimodal Emotion Captioning

TriSPrompt: A Hierarchical Soft Prompt Model for Multimodal Rumor Detection with Incomplete Modalities

Benchmarking Gaslighting Attacks Against Speech Large Language Models

InconVAD: A Two-Stage Dual-Tower Framework for Multimodal Emotion Inconsistency Detection