Vision-Language Models for Multimodal Understanding

The field of vision-language models is rapidly advancing, with a focus on improving multimodal understanding and interaction. Recent developments have highlighted the importance of fine-grained vision-language alignment, with researchers exploring new datasets, benchmarks, and models to address this challenge. The application of vision-language models in areas such as augmented reality, document understanding, and human-object interaction detection is also gaining traction. Furthermore, researchers are investigating ways to mitigate object hallucination in large vision-language models, which is critical to their safe deployment. Noteworthy papers in this area include:

  • Fine-Grained Vision-Language Modeling for Multimodal Training Assistants in Augmented Reality, which introduces a comprehensive dataset for AR training and evaluates state-of-the-art VLMs on it.
  • Bilateral Collaboration with Large Vision-Language Models for Open Vocabulary Human-Object Interaction Detection, which proposes a novel framework for open vocabulary HOI detection and achieves superior performance on two popular benchmarks.
  • Energy-Guided Decoding for Object Hallucination Mitigation, which proposes an energy-based decoding method to reduce object hallucination in VLMs and improves accuracy and F1 score on three VQA datasets.

Sources

Fine-Grained Vision-Language Modeling for Multimodal Training Assistants in Augmented Reality

PaddleOCR 3.0 Technical Report

Bridging Perception and Language: A Systematic Benchmark for LVLMs' Understanding of Amodal Completion Reports

Bilateral Collaboration with Large Vision-Language Models for Open Vocabulary Human-Object Interaction Detection

Finetuning Vision-Language Models as OCR Systems for Low-Resource Languages: A Case Study of Manchu

Multi-level Mixture of Experts for Multimodal Entity Linking

Robust Multimodal Large Language Models Against Modality Conflict

Energy-Guided Decoding for Object Hallucination Mitigation

CLIP Won't Learn Object-Attribute Binding from Natural Data and Here is Why

Built with on top of