Advances in Multimodal Learning and Robustness

The field of multimodal learning is rapidly advancing, with a focus on improving robustness and cultural understanding in vision-language models. Recent developments have highlighted the importance of considering non-additive perturbations, dialectal variations, and cultural biases in facial expression recognition. Researchers are exploring new methodologies, such as contrastive learning and diffusion-based denoising, to enhance the performance of multimodal models. Noteworthy papers include: CoDefend, which proposes a supervised diffusion-based denoising framework for defending against adversarial threats in multimodal models. DialectGen, which benchmarks and improves dialect robustness in multimodal generation, achieving significant performance gains with a general encoder-based mitigation strategy.

Sources

Training-Free In-Context Forensic Chain for Image Manipulation Detection and Localization

VOLTAGE: A Versatile Contrastive Learning based OCR Methodology for ultra low-resource scripts through Auto Glyph Feature Extraction

CoDefend: Cross-Modal Collaborative Defense via Diffusion Purification and Prompt Optimization

BLEnD-Vis: Benchmarking Multimodal Cultural Understanding in Vision Language Models

Modeling Cultural Bias in Facial Expression Recognition with Adaptive Agents

NAPPure: Adversarial Purification for Robust Image Classification under Non-Additive Perturbations

DialectGen: Benchmarking and Improving Dialect Robustness in Multimodal Generation

Built with on top of