Compositional Learning in Vision-Language Models

The field of vision-language models is moving towards improving compositional understanding and temporal alignment. Researchers are focusing on developing new benchmarks and frameworks that can evaluate and enhance model capabilities in achieving fine-grained, temporally coherent video-text alignment. Noteworthy papers in this area include VideoComp, which introduces a benchmark and learning framework for advancing video-text compositionality understanding, and SCRAMBLe, which proposes an approach for preference tuning open-weight MLLMs on synthetic preference data to improve compositional reasoning capabilities. SVLTA and Human-like compositional learning of visually-grounded concepts using synthetic environments are also notable, as they investigate and evaluate the ability of models to achieve alignment from a temporal perspective and demonstrate the efficacy of human-like learning strategies in improving artificial systems' learning efficiency.

Sources

VideoComp: Advancing Fine-Grained Compositional and Temporal Alignment in Video-Text Models

Enhancing Compositional Reasoning in Vision-Language Models with Synthetic Preference Data

SVLTA: Benchmarking Vision-Language Temporal Alignment via Synthetic Video Situation

Human-like compositional learning of visually-grounded concepts using synthetic environments

Built with on top of