Advancements in Efficient Large Language Models

The field of large language models (LLMs) is rapidly evolving, with a focus on improving efficiency and reducing computational demands. Recent developments have centered around quantization techniques, which aim to reduce the precision of model weights and activations without sacrificing performance. These methods have shown significant promise in enabling the deployment of LLMs on resource-constrained devices. Another area of research is the development of novel hardware architectures, such as photonic chips and near-memory processing, which can accelerate LLM inference and training. Additionally, software-hardware co-design approaches have emerged as a key strategy for optimizing LLM performance and efficiency. Noteworthy papers in this area include 'What Is Next for LLMs? Next-Generation AI Computing Hardware Using Photonic Chips', which explores the potential of photonic hardware for accelerating LLMs, and 'LightNobel: Improving Sequence Length Limitation in Protein Structure Prediction Model via Adaptive Activation Quantization', which presents a hardware-software co-designed accelerator for improving the efficiency of protein structure prediction models. Furthermore, papers like 'Ecco: Improving Memory Bandwidth and Capacity for LLMs via Entropy-aware Cache Compression' and 'GuidedQuant: Large Language Model Quantization via Exploiting End Loss Guidance' have made significant contributions to the development of efficient quantization methods for LLMs. Overall, the field is moving towards the development of more efficient and scalable LLMs, enabled by advances in quantization, hardware, and software-hardware co-design.

Sources

Low-bit Model Quantization for Deep Neural Networks: A Survey

Design of a molecular Field Effect Transistor (mFET)

What Is Next for LLMs? Next-Generation AI Computing Hardware Using Photonic Chips

Sparse Attention Remapping with Clustering for Efficient LLM Decoding on PIM

LightNobel: Improving Sequence Length Limitation in Protein Structure Prediction Model via Adaptive Activation Quantization

Improving Block-Wise LLM Quantization by 4-bit Block-Wise Optimal Float (BOF4): Analysis and Variations

Ecco: Improving Memory Bandwidth and Capacity for LLMs via Entropy-aware Cache Compression

GuidedQuant: Large Language Model Quantization via Exploiting End Loss Guidance

Semantic Retention and Extreme Compression in LLMs: Can We Have Both?

QuantX: A Framework for Hardware-Aware Quantization of Generative AI Workloads

NMP-PaK: Near-Memory Processing Acceleration of Scalable De Novo Genome Assembly

Resource-Efficient Language Models: Quantization for Fast and Accessible Inference

An Extra RMSNorm is All You Need for Fine Tuning to 1.58 Bits

ITERA-LLM: Boosting Sub-8-Bit Large Language Model Inference via Iterative Tensor Decomposition

Zero-shot Quantization: A Comprehensive Survey

Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures

Analog Foundation Models

VQ-Logits: Compressing the Output Bottleneck of Large Language Models via Vector Quantized Logits

Built with on top of