Efficient Model Compression and Multimodal Learning

The field of natural language processing and multimodal learning is moving towards efficient model compression and scalable architectures. Recent works have focused on reducing the computational costs and memory requirements of large language models, while maintaining their performance. This is achieved through techniques such as layer concatenation, token pruning, and knowledge distillation. Additionally, there is a growing interest in multimodal learning, where models are designed to process and integrate multiple forms of input, such as text and images. Noteworthy papers in this area include Layer as Puzzle Pieces, which proposes a progressive layer pruning framework, and ParaFormer, which introduces a shallow parallel Transformer architecture. Other notable works include FrugalPrompt, which reduces contextual overhead in large language models, and VisionSelector, which optimizes embedding efficiency for scalable ID-based models.

Sources

Layer as Puzzle Pieces: Compressing Large Language Models through Layer Concatenation

Dimension Mask Layer: Optimizing Embedding Efficiency for Scalable ID-based Models

ParaFormer: Shallow Parallel Transformers with Progressive Approximation

FrugalPrompt: Reducing Contextual Overhead in Large Language Models via Token Attribution

Uncertain Knowledge Graph Completion via Semi-Supervised Confidence Distribution Learning

VisionSelector: End-to-End Learnable Visual Token Compression for Efficient Multimodal LLMs

Improving Model Representation and Reducing KV Cache via Skip Connections with First Value Heads

ELMM: Efficient Lightweight Multimodal Large Language Models for Multimodal Knowledge Graph Completion

ZSPAPrune: Zero-Shot Prompt-Aware Token Pruning for Vision-Language Models

$\mathcal{V}isi\mathcal{P}runer$: Decoding Discontinuous Cross-Modal Dynamics for Efficient Multimodal LLMs

Elastic ViTs from Pretrained Models without Retraining

Glyph: Scaling Context Windows via Visual-Text Compression

SparseVILA: Decoupling Visual Sparsity for Efficient VLM Inference

CoRECT: A Framework for Evaluating Embedding Compression Techniques at Scale

HyperET: Efficient Training in Hyperbolic Space for Multi-modal Large Language Models

ARC-Encoder: learning compressed text representations for large language models

Simple Context Compression: Mean-Pooling and Multi-Ratio Training