The field of multimodal emotion recognition and speech processing is rapidly evolving, with a focus on developing more accurate and robust models for emotion recognition, speech synthesis, and dialogue systems. Recent research has explored the use of reinforcement learning, multimodal fusion, and hierarchical soft prompt models to improve the performance of speech emotion recognition and rumor detection systems. The integration of visual, audio, and text data has been shown to enhance the accuracy of emotion recognition and fake news detection. Furthermore, the development of novel architectures, such as EmoQ and HadaSmileNet, has achieved state-of-the-art results in speech emotion recognition and facial emotion recognition. Noteworthy papers include the EmoQ paper, which proposes a multimodal large language model-based framework for speech emotion recognition, and the HadaSmileNet paper, which introduces a novel feature fusion framework for facial emotion recognition. Overall, the field is moving towards more comprehensive and effective models for multimodal emotion recognition and speech processing, with potential applications in human-computer interaction, healthcare, and social sciences.
Advancements in Multimodal Emotion Recognition and Speech Processing
Sources
Thinking in cocktail party: Chain-of-Thought and reinforcement learning for target speaker automatic speech recognition
Multimodal Learning for Fake News Detection in Short Videos Using Linguistically Verified Data and Heterogeneous Modality Fusion
EMO-RL: Emotion-Rule-Based Reinforcement Learning Enhanced Audio-Language Model for Generalized Speech Emotion Recognition
HadaSmileNet: Hadamard fusion of handcrafted and deep-learning features for enhancing facial emotion recognition of genuine smiles
M4SER: Multimodal, Multirepresentation, Multitask, and Multistrategy Learning for Speech Emotion Recognition