The field of emotion recognition and analysis is moving towards more nuanced and fine-grained understanding of human emotions, with a focus on leveraging multimodal information and large-scale datasets. Recent developments have highlighted the importance of incorporating textual context and semantic information into visual understanding, as well as the need for more effective frameworks for harnessing rich supervision from natural language captions. The use of large language models and contrastive learning has also shown promise in enhancing emotion recognition performance. Noteworthy papers include MGHFT, which proposes a novel multi-granularity hierarchical fusion transformer for cross-modal sticker emotion recognition, and AU-LLM, which pioneers the use of large language models for micro-expression action unit detection. Additionally, FED-PsyAU and Learning Transferable Facial Emotion Representations from Large-Scale Semantically Rich Captions have made significant contributions to privacy-preserving micro-expression recognition and learning facial emotion representations from large-scale captions, respectively.