The field of multimodal learning and conversational AI is moving towards more comprehensive and reliable systems, with a focus on enhancing learner engagement and trust. Recent developments emphasize the importance of grounding conversational AI in reliable sources and verifiability, as well as the potential of multimodal data to improve model performance in diagnosing collaborative problem-solving skills. Noteworthy papers include: Towards a Multimodal Document-grounded Conversational AI System for Education, which presents a multimodal conversational AI system that leverages both text and visuals from documents to generate responses. Another notable paper is Zero-Shot, But at What Cost?, which reveals the hidden computational costs of a recently published framework for zero-shot image captioning, highlighting the need for more efficient multimodal models.
Multimodal Learning and Conversational AI
Sources
Closing the Evaluation Gap: Developing a Behavior-Oriented Framework for Assessing Virtual Teamwork Competency
Rethinking the Potential of Multimodality in Collaborative Problem Solving Diagnosis with Large Language Models
Zero-Shot, But at What Cost? Unveiling the Hidden Overhead of MILS's LLM-CLIP Framework for Image Captioning