Advances in Multimodal Language Models and Dialogue Systems

The field of multimodal language models and dialogue systems is rapidly evolving, with a focus on improving factuality, social intelligence, and interpretability. Recent research has emphasized the importance of verifying truthfulness in multi-party social interactions, detecting hallucinations in conversational AI systems, and developing more transparent and human-aligned measures of truthfulness. Additionally, there is a growing interest in multimodal reasoning, modality decomposition, and sensor fusion, particularly in applications such as autonomous driving and clinical gait analysis. Noteworthy papers in this area include VISTA Score, which introduces a framework for evaluating conversational factuality, and Can MLLMs Read the Room?, which presents a multimodal benchmark for verifying truthfulness in multi-party social interactions. Other notable works include Layer-Wise Modality Decomposition for Interpretable Multimodal Sensor Fusion and When One Modality Sabotages the Others: A Diagnostic Lens on Multimodal Reasoning, which provide new insights into modality decomposition and multimodal reasoning.

Sources

VISTA Score: Verification In Sequential Turn-based Assessment

Can MLLMs Read the Room? A Multimodal Benchmark for Verifying Truthfulness in Multi-Party Social Interactions

Calibration Across Layers: Understanding Calibration Evolution in LLMs

Federated Dialogue-Semantic Diffusion for Emotion Recognition under Incomplete Modalities

Layer-Wise Modality Decomposition for Interpretable Multimodal Sensor Fusion

A Dual-Use Framework for Clinical Gait Analysis: Attention-Based Sensor Optimization and Automated Dataset Auditing

When Modalities Conflict: How Unimodal Reasoning Uncertainty Governs Preference Dynamics in MLLMs

When One Modality Sabotages the Others: A Diagnostic Lens on Multimodal Reasoning

To See or To Read: User Behavior Reasoning in Multimodal LLMs

Detecting Silent Failures in Multi-Agentic AI Trajectories

On Joint Regularization and Calibration in Deep Ensembles

Towards Aligning Multimodal LLMs with Human Experts: A Focus on Parent-Child Interaction

Built with on top of