The field of emotion understanding and human behavior analysis is moving towards more nuanced and multimodal approaches, incorporating large language models and fusion of different modalities to better capture subtle emotional cues and human behavior. Recent developments have focused on addressing challenges such as the entanglement of static and dynamic cues, semantic gaps between text and physical motion, and the need for more fine-grained multimodal fusion strategies. Noteworthy papers include DEFT-LLM, which achieves motion semantic alignment by multi-expert disentanglement, and MemoDetector, which introduces a dual-stage modal fusion strategy to better capture nuanced cross-modal emotional cues. Other notable works include GazeInterpreter, which parses eye gaze data to generate eye-body-coordinated narrations, and Unveiling Intrinsic Dimension of Texts, which establishes a comprehensive study grounding intrinsic dimension in interpretable text properties.