Advancements in Vision-Language Models

The field of Vision-Language Models (VLMs) is moving towards addressing the challenges of trust, low-resource data, and noisy labels. Researchers are exploring user-centered approaches to understand how trust in VLMs is built and evolves, and developing innovative methods to improve the reliability of these models. One key direction is the use of multi-agent systems and self-training frameworks to detect offensive content and refine annotations. Another area of focus is the development of new training strategies, such as curriculum learning and soft label refinement, to handle subjective and noisy labels. Notably, Vision Large Language Models are being leveraged to improve engagement analysis and noise handling. Some noteworthy papers in this area include:

Trust in Vision-Language Models: Insights from a Participatory User Workshop, which presents preliminary results from a user-centered workshop to inform future studies on trust metrics.
Multi-Agent VLMs Guided Self-Training with PNU Loss for Low-Resource Offensive Content Detection, which proposes a self-training framework that leverages abundant unlabeled data through collaborative pseudo-labeling.
Vision Large Language Models Are Good Noise Handlers in Engagement Analysis, which demonstrates the benefits of using VLMs to refine annotations and guide the training process in engagement recognition tasks.

Advancements in Vision-Language Models

Sources