The field of multimodal sentiment analysis and emotion recognition is moving towards the development of more sophisticated and effective models that can capture complex cross-modal interactions and integrate diverse opinion modalities. Researchers are proposing novel frameworks and architectures that can adaptively integrate multi-level features, regulate cross-layer information flow, and achieve balanced representation learning. The use of geometric deep learning, dynamic fusion, and multi-level fusion methods is becoming increasingly popular. Additionally, the incorporation of supervisory documentation assistance and privileged information is being explored to enhance the extraction of text features and improve prediction performance.
Noteworthy papers include: The paper introducing RecruitView, a multimodal dataset for predicting personality and interview performance, which proposes a geometric deep learning framework that achieves superior performance while training fewer parameters. The paper proposing DyFuLM, a multimodal framework for sentiment analysis that introduces a hierarchical dynamic fusion module and a gated feature aggregation module, achieving state-of-the-art results on multi-task sentiment datasets. The paper introducing PSA-MF, a personality-sentiment aligned multi-level fusion framework that integrates sentiment-related information from different modalities and achieves state-of-the-art results on two commonly used datasets.