Advances in Multimodal Learning and Human-Computer Interaction

The field of multimodal learning and human-computer interaction is rapidly evolving, with a focus on developing more efficient and effective models for integrating visual and linguistic information. Recent research has explored the use of large language models and augmented reality technology to support conversation and communication, as well as the development of novel architectures for lipreading and vision-language modeling. Notably, the use of internal feature modulation and latent visual tokens has shown promise in improving the performance and efficiency of multimodal models. Additionally, researchers have investigated the application of multimodal learning to real-world problems, such as airway skill assessment and surgical education. Overall, the field is moving towards more seamless and intuitive interactions between humans and computers, with a focus on developing models that can effectively integrate and reason about multiple sources of information. Noteworthy papers include LaVi, which proposes a novel method for efficient vision-language fusion, and Machine Mental Imagery, which introduces a framework for multimodal reasoning without explicit image generation. UniCode^2 also presents a cascaded codebook framework for unified multimodal understanding and generation.

Sources

ChatAR: Conversation Support using Large Language Model and Augmented Reality

TD3Net: A Temporal Densely Connected Multi-Dilated Convolutional Network for Lipreading

LaVi: Efficient Large Vision-Language Models via Internal Feature Modulation

Language-Informed Synthesis of Rational Agent Models for Grounded Theory-of-Mind Reasoning On-The-Fly

Do We Need Large VLMs for Spotting Soccer Actions?

Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens

Airway Skill Assessment with Spatiotemporal Attention Mechanisms Using Human Gaze

Emergence of Text Readability in Vision Language Models

Integrating AIs With Body Tracking Technology for Human Behaviour Analysis: Challenges and Opportunities

UniCode$^2$: Cascaded Large-scale Codebooks for Unified Multimodal Understanding and Generation

Critical Anatomy-Preserving & Terrain-Augmenting Navigation (CAPTAiN): Application to Laminectomy Surgical Education

Detecting Referring Expressions in Visually Grounded Dialogue with Autoregressive Language Models

Built with on top of