Advancements in Multimodal Large Language Models

The field of multimodal large language models is rapidly advancing, with a focus on improving performance in tasks such as video question answering, open-world egocentric activity recognition, and referring video segmentation. Researchers are exploring various techniques, including reinforcement fine-tuning, probabilistic jump diffusion, and gaze consensus adaptation, to enhance the models' ability to reason and understand complex visual and temporal information. Notable papers include CrowdVLM-R1, which proposes a novel framework for crowd counting using fuzzy group relative policy reward, and ProbRes, which introduces a probabilistic residual search framework for open-world egocentric activity recognition. Other significant contributions include the development of InstructionBench, a benchmark for instructional video understanding, and RAVEN, an agentic framework for multimodal entity discovery from large-scale video collections. These advancements have the potential to significantly improve the performance of multimodal large language models in various applications, including video question answering, activity recognition, and content retrieval.

Sources

CrowdVLM-R1: Expanding R1 Ability to Vision Language Model for Crowd Counting using Fuzzy Group Relative Policy Reward

ProbRes: Probabilistic Jump Diffusion for Open-World Egocentric Activity Recognition

Advancing Egocentric Video Question Answering with Multimodal Large Language Models

Unsupervised Ego- and Exo-centric Dense Procedural Activity Captioning via Gaze Consensus Adaptation

InstructionBench: An Instructional Video Understanding Benchmark

The 1st Solution for 4th PVUW MeViS Challenge: Unleashing the Potential of Large Multimodal Models for Referring Video Segmentation

Caption Anything in Video: Fine-grained Object-centric Captioning via Spatiotemporal Multimodal Prompting

On the Suitability of Reinforcement Fine-Tuning to Visual Tasks

RAVEN: An Agentic Framework for Multimodal Entity Discovery from Large-Scale Video Collections

MovSAM: A Single-image Moving Object Segmentation Framework Based on Deep Thinking

VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

Built with on top of