Efficient Multimodal Processing in Large Language Models

The field of large language models is moving towards more efficient multimodal processing, with a focus on reducing computational costs and improving performance on long-context tasks. Recent developments have introduced novel frameworks and techniques for compressing and selecting tokens, such as adaptive token compression, shot-aware token compression, and hierarchical token prepending. These methods have shown significant improvements in efficiency and performance, enabling large language models to handle longer inputs and more complex tasks. Notable papers include Virtual Width Networks, which decouples representational width from backbone width, and TimeAudio, which incorporates unique temporal markers to improve time-sensitive reasoning. Additionally, papers like OmniSparse and CORE have introduced training-aware fine-grained sparse attention and compact object-centric representations, respectively, to further improve efficiency and performance.

Sources

Virtual Width Networks

AccKV: Towards Efficient Audio-Video LLMs Inference via Adaptive-Focusing and Cross-Calibration KV Cache Optimization

Speech-Aware Long Context Pruning and Integration for Contextualized Automatic Speech Recognition

Discovering Meaningful Units with Visually Grounded Semantics from Image Captions

TimeAudio: Bridging Temporal Gaps in Large Audio-Language Models

Seeing the Forest and the Trees: Query-Aware Tokenizer for Long-Video Multimodal Language Models

OmniSparse: Training-Aware Fine-Grained Sparse Attention for Long-Video MLLMs

TabFlash: Efficient Table Understanding with Progressive Question Conditioning and Token Focusing

CORE: Compact Object-centric REpresentations as a New Paradigm for Token Merging in LVLMs

SMART: Shot-Aware Multimodal Video Moment Retrieval with Audio-Enhanced MLLM

AdaTok: Adaptive Token Compression with Object-Aware Representations for Efficient Multimodal LLMs

EBind: a practical approach to space binding

Segmentwise Pruning in Audio-Language Models

Audio Question Answering with GRPO-Based Fine-Tuning and Calibrated Segment-Level Predictions

Jasper-Token-Compression-600M Technical Report

OmniZip: Audio-Guided Dynamic Token Compression for Fast Omnimodal Large Language Models

Hierarchical Token Prepending: Enhancing Information Flow in Decoder-based LLM Embeddings

Context Cascade Compression: Exploring the Upper Limits of Text Compression

TimeViper: A Hybrid Mamba-Transformer Vision-Language Model for Efficient Long Video Understanding