Efficient Computing in AI Applications

The field of artificial intelligence is witnessing significant advancements in efficient computing, particularly in the areas of large language models and low-precision computations. Researchers are exploring innovative methods to optimize computational resources, reduce memory usage, and improve inference performance. One notable direction is the development of collaborative edge computing frameworks, such as Jupiter, which enables fast and resource-efficient inference of generative large language models on edge devices. Another area of focus is the design of memory-efficient algorithms and systems, including ActiveFlow, which achieves adaptive DRAM usage for modern large language models, and MOM, which reduces peak memory usage for long-context language models. Additionally, advancements in pseudorandom generators, distributed retrieval-augmented generation, and virtual machines for low-precision GPGPU computation are also being made. Noteworthy papers include Jupiter, which introduces a flexible pipelined architecture for collaborative edge computing, and MOM, which proposes a method for partitioning critical layers into smaller mini-sequences and integrating seamlessly with KV cache offloading. These developments have the potential to significantly improve the efficiency and performance of AI applications, enabling their deployment on a wider range of devices and platforms.

Sources

Ozaki Scheme II: A GEMM-oriented emulation of floating-point matrix multiplication using an integer modular technique

Jupiter: Fast and Resource-Efficient Collaborative Inference of Generative LLMs on Edge Devices

All-in-Memory Stochastic Computing using ReRAM

Scaling Up On-Device LLMs via Active-Weight Swapping Between DRAM and Flash

Frozen Layers: Memory-efficient Many-fidelity Hyperparameter Optimization

A Pseudorandom Generator for Functions of Low-Degree Polynomial Threshold Functions

Efficient Distributed Retrieval-Augmented Generation for Enhancing Language Model Performance

Optimizing LLM Inference: Fluid-Guided Online Scheduling with Memory Constraints

You Don't Need All Attentions: Distributed Dynamic Fine-Tuning for Foundation Models

MOM: Memory-Efficient Offloaded Mini-Sequence Inference for Long Context Language Models

A Virtual Machine for Arbitrary Low-Precision GPGPU Computation in LLM Serving

Built with on top of