Efficient Compression and Optimization of Large Language Models

The field of large language models (LLMs) is moving towards more efficient and optimized models, with a focus on reducing token usage and improving reasoning efficiency. This is driven by the need to balance accuracy and efficiency in practical applications, where longer chain-of-thought traces and increased token usage can lead to higher inference latency and memory consumption. Researchers are exploring novel compression methods, such as abstractive token-level compression and entropy-guided training frameworks, to condense reasoning paths while preserving performance. Additionally, there is a growing interest in defining and optimizing LLM agent efficiency, including step-level and trajectory-level efficiency, to improve interaction efficiency in real-world scenarios. Notable papers in this area include: Cmprsr, which presents a novel prompt compression paradigm and achieves significant improvements in compression ability and downstream task performance. TokenSqueeze, which proposes a Long2Short method that condenses reasoning paths while preserving performance and relies exclusively on self-generated data. Entropy-Guided Reasoning Compression, which addresses the entropy conflict in compression training and achieves impressive compression ratios while maintaining or surpassing baseline accuracy. DEPO, which introduces a dual-efficiency preference optimization method that jointly rewards succinct responses and fewer action steps, resulting in significant reductions in token usage and steps.

Efficient Compression and Optimization of Large Language Models

Sources