Multimodal Learning and Conversational AI

The field of multimodal learning and conversational AI is moving towards more comprehensive and reliable systems, with a focus on enhancing learner engagement and trust. Recent developments emphasize the importance of grounding conversational AI in reliable sources and verifiability, as well as the potential of multimodal data to improve model performance in diagnosing collaborative problem-solving skills. Noteworthy papers include: Towards a Multimodal Document-grounded Conversational AI System for Education, which presents a multimodal conversational AI system that leverages both text and visuals from documents to generate responses. Another notable paper is Zero-Shot, But at What Cost?, which reveals the hidden computational costs of a recently published framework for zero-shot image captioning, highlighting the need for more efficient multimodal models.

Sources

Towards a Multimodal Document-grounded Conversational AI System for Education

Closing the Evaluation Gap: Developing a Behavior-Oriented Framework for Assessing Virtual Teamwork Competency

Rethinking the Potential of Multimodality in Collaborative Problem Solving Diagnosis with Large Language Models

Zero-Shot, But at What Cost? Unveiling the Hidden Overhead of MILS's LLM-CLIP Framework for Image Captioning

Can Large Language Models Help Multimodal Language Analysis? MMLA: A Comprehensive Benchmark

Multifaceted Evaluation of Audio-Visual Capability for MLLMs: Effectiveness, Efficiency, Generalizability and Robustness

Visual and textual prompts for enhancing emotion recognition in video

Built with on top of