Multimodal Advances in AI and Neuroscience

The field of AI and neuroscience is moving towards a more integrated understanding of multimodal processing and its applications. Recent developments have shown that combining multiple modalities, such as vision, audio, and text, can lead to improved performance in tasks like speech recognition, brain encoding, and audio classification. The use of large language models and transformers has been particularly effective in achieving state-of-the-art results. Noteworthy papers include: TRIBE, which introduced a deep neural network that can predict brain responses to stimuli across multiple modalities, achieving first place in the Algonauts 2025 brain encoding competition. MIRAGE, which presented a framework that achieves systematic generalization on compositional tasks by explicitly equipping the Transformer component with actively managed schematic structures.

Sources

A Neuroscience-Inspired Dual-Process Model of Compositional Generalization

MLLM-based Speech Recognition: When and How is Multimodality Beneficial?

Probing Multimodal Fusion in the Brain: The Dominance of Audiovisual Streams in Naturalistic Encoding

The Eloquence team submission for task 1 of MLC-SLM challenge

Data Augmentation for Spoken Grammatical Error Correction

Predicting Brain Responses To Natural Movies With Multimodal LLMs

Improving Audio Classification by Transitioning from Zero- to Few-Shot

Sound Source Localization for Human-Robot Interaction in Outdoor Environments

Whilter: A Whisper-based Data Filter for "In-the-Wild" Speech Corpora Using Utterance-level Multi-Task Classification

TRIBE: TRImodal Brain Encoder for whole-brain fMRI response prediction

Evaluating and Improving the Robustness of Speech Command Recognition Models to Noise and Distribution Shifts