The field of multimodal emotion recognition is moving towards more effective fusion strategies, leveraging large-scale pre-trained models, and incorporating psychologically meaningful priors to guide multimodal alignment. Researchers are exploring novel approaches to integrate multiple modalities, such as visual, audio, and textual signals, to improve emotion recognition performance. Noteworthy papers include: ECMF, which proposes a novel multimodal emotion recognition framework that leverages large-scale pre-trained models and achieves a substantial performance improvement over the official baseline. VEGA, which introduces a Visual Emotion Guided Anchoring mechanism that constructs emotion-specific visual anchors based on facial exemplars and achieves state-of-the-art performance on IEMOCAP and MELD.