Advancements in Multimodal Large Language Models for Education and Assessment

The field of multimodal large language models (MLLMs) is rapidly advancing, with a focus on improving education and assessment. Recent developments have seen the introduction of new MLLM architectures and training methods, enabling more effective evaluation of student responses and automated assessment. Notably, the use of MLLMs in educational settings has shown promise in enhancing student engagement and understanding. Furthermore, advancements in explainable AI and transparency have improved the reliability and trustworthiness of AI-driven assessment systems.

Noteworthy papers in this area include: VideoJudge, which introduces a 3B and 7B-sized MLLM judge for evaluating video understanding models, outperforming larger MLLM judge baselines. EduVidQA, which explores using MLLMs to automatically respond to student questions from online lectures, introducing a novel question answering task of real-world significance. ProfVLM, which presents a compact vision-language model for multi-view proficiency estimation, achieving superior accuracy while using up to 20x fewer parameters.

Sources

VideoJudge: Bootstrapping Enables Scalable Supervision of MLLM-as-a-Judge for Video Understanding

Reasoning-Enhanced Domain-Adaptive Pretraining of Multimodal Large Language Models for Short Video Content Moderation

EgoInstruct: An Egocentric Video Dataset of Face-to-face Instructional Interactions with Multi-modal LLM Benchmarking

TrueGradeAI: Retrieval-Augmented and Bias-Resistant AI for Transparent and Explainable Digital Assessments

From Frustration to Fun: An Adaptive Problem-Solving Puzzle Game Powered by Genetic Algorithm

AnveshanaAI: A Multimodal Platform for Adaptive AI/ML Education through Automated Question Generation and Interactive Assessment

EduVidQA: Generating and Evaluating Long-form Answers to Student Questions based on Lecture Videos

NeMo: Needle in a Montage for Video-Language Understanding

Artificial Intelligence-Powered Assessment Framework for Skill-Oriented Engineering Lab Education

V-HUB: A Visual-Centric Humor Understanding Benchmark for Video LLMs

VELA: An LLM-Hybrid-as-a-Judge Approach for Evaluating Long Image Captions

ProfVLM: A Lightweight Video-Language Model for Multi-View Proficiency Estimation

On the Role of Domain Experts in Creating Effective Tutoring Systems