Efficient Models for Reasoning and Multimodal Tasks

The field of large language models and multimodal learning is moving towards developing more efficient models that can perform complex reasoning tasks and handle multimodal inputs without incurring substantial computational costs. Recent research has focused on pruning and compression techniques to reduce the size and computational requirements of these models while maintaining their performance. These techniques include dynamic pruning, knowledge distillation, and information-theoretic driven compression. Notably, some papers have proposed innovative methods for pruning and compressing large language models and multimodal models, such as selective self-generated calibration, interleaved layer pruning, and hierarchical communication graph pruning. These advancements have the potential to enable the deployment of large language models and multimodal models in resource-constrained or latency-sensitive scenarios. Noteworthy papers include: Efficient Mathematical Reasoning Models via Dynamic Pruning and Knowledge Distillation, which proposes a lightweight optimization method that integrates dynamic attention head pruning with knowledge distillation. FastMMoE: Accelerating Multimodal Large Language Models through Dynamic Expert Activation and Routing-Aware Token Pruning achieves substantial efficiency gains while preserving strong reasoning performance. Think Before You Prune: Selective Self-Generated Calibration for Pruning Large Reasoning Models improves pruning performance by using self-generated reasoning data for calibration. Towards Efficient VLMs: Information-Theoretic Driven Compression via Adaptive Structural Pruning proposes an information-theoretic framework for adaptive structural compression of VLMs. INTERLACE: Interleaved Layer Pruning and Efficient Adaptation in Large Vision-Language Models enables rapid convergence with minimal data after pruning. M$^3$Prune: Hierarchical Communication Graph Pruning for Efficient Multi-Modal Multi-Agent Retrieval-Augmented Generation eliminates redundant edges across different modalities. MLPMoE: Zero-Shot Architectural Metamorphosis of Dense LLM MLPs into Static Mixture-of-Experts restructures the dense MLP in transformer blocks into a static, high-cardinality mixture of experts.

Efficient Models for Reasoning and Multimodal Tasks

Sources