Multimodal Learning and Facial Expression Recognition

The field of multimodal learning is moving towards more efficient and effective models that can handle diverse tasks and datasets. Researchers are exploring new architectures and techniques to improve the performance of multimodal models, such as unified models for image understanding and generation, and multimodal prompt alignment for facial expression recognition. Noteworthy papers include:

  • MM-LG, which proposes a novel framework for extracting and leveraging generalizable components from CLIP, achieving performance gains and reducing parameter storage and pre-training costs.
  • UniFork, which introduces a Y-shaped architecture that balances shared learning and task specialization for unified multimodal understanding and generation.
  • MIDAS, which proposes a data augmentation method using soft labels to enhance dynamic facial expression recognition performance for ambiguous facial expression data.
  • MPA-FER, which provides fine-grained semantic guidance to the learning process of prompted visual features for facial expression recognition, resulting in more precise and interpretable representations.

Sources

Extracting Multimodal Learngene in CLIP: Unveiling the Multimodal Generalizable Knowledge

UniFork: Exploring Modality Alignment for Unified Multimodal Understanding and Generation

Enhancing Ambiguous Dynamic Facial Expression Recognition with Soft Label-based Data Augmentation

Multimodal Prompt Alignment for Facial Expression Recognition

Built with on top of