Advances in Concept-Based Models and Vision-Language Integration

The field of concept-based models and vision-language integration is rapidly evolving, with a focus on improving model interpretability, robustness, and performance. Recent research has highlighted the importance of addressing concept mislabeling, leakage poisoning, and the need for more efficient and effective methods for learning visual concepts. Notable advancements include the development of new loss functions, such as the Concept Preference Optimization objective, and novel approaches to visual clue learning, like Multi-grained Compositional visual Clue Learning. Furthermore, researchers have introduced new benchmarks, like VCBENCH, to evaluate multimodal mathematical reasoning and have proposed innovative methods, such as Focus-Centric Visual Chain, to improve vision-language models' performance in multi-image scenarios. Noteworthy papers include: Avoiding Leakage Poisoning, which introduces MixCEM, a new concept-based model that learns to dynamically exploit leaked information. Addressing Concept Mislabeling in Concept Bottleneck Models Through Preference Optimization, which proposes the Concept Preference Optimization objective to mitigate the negative impact of concept mislabeling. Multi-Grained Compositional Visual Clue Learning for Image Intent Recognition, which introduces a novel approach to image intent recognition by breaking down intent recognition into visual clue composition and integrating multi-grained features.

Sources

Avoiding Leakage Poisoning: Concept Interventions Under Distribution Shifts

Addressing Concept Mislabeling in Concept Bottleneck Models Through Preference Optimization

Multi-Grained Compositional Visual Clue Learning for Image Intent Recognition

Benchmarking Multimodal Mathematical Reasoning with Explicit Visual Dependency

Rethinking Label-specific Features for Label Distribution Learning

VCM: Vision Concept Modeling Based on Implicit Contrastive Learning with Vision-Language Instruction Fine-Tuning

If Concept Bottlenecks are the Question, are Foundation Models the Answer?

Weaving Context Across Images: Improving Vision-Language Models through Focus-Centric Visual Chains

Approximate Lifted Model Construction

COMPACT: COMPositional Atomic-to-Complex Visual Capability Tuning

Built with on top of