Efficient Large Language Models

The field of large language models is witnessing a significant shift towards efficiency and scalability. Researchers are exploring innovative methods to compress and accelerate these models, reducing their computational requirements and energy consumption without sacrificing accuracy. One notable direction is the development of compression techniques that can efficiently explore the compression solution space, supporting both single and multi-objective evolutionary compression search. Another area of focus is the incorporation of energy awareness in model evaluation, enabling users to make informed decisions about model selection based on their energy efficiency. Noteworthy papers include: ODIA, which presents a novel approach to accelerate Function Calling in LLMs, reducing response latency by 45% while maintaining accuracy. GeLaCo, which introduces an evolutionary approach to layer compression, outperforming state-of-the-art alternatives in perplexity-based and generative evaluations. The Generative Energy Arena, which incorporates energy awareness in human evaluations of LLMs, showing that users favor smaller and more energy-efficient models when aware of energy consumption.

Sources

Accuracy and Consumption analysis from a compressed model by CompactifAI from Multiverse Computing

ODIA: Oriented Distillation for Inline Acceleration of LLM-based Function Calling

GeLaCo: An Evolutionary Approach to Layer Compression

The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations

Built with on top of