Efficient Deployment of Large Language Models

The field of large language models (LLMs) is moving towards more efficient deployment strategies, focusing on balancing performance and cost. Recent research has explored routing techniques, load balancing, and optimization methods to improve the scalability and affordability of LLMs. Notably, innovative approaches have been proposed to address the challenge of selecting the optimal LLM for specific tasks, reducing inference costs and improving response quality. Some noteworthy papers in this area include: Skylb, which achieves higher throughput and lower latency compared to existing load balancers. RadialRouter, a novel framework for LLM routing that significantly outperforms existing routing methods. Cascadia, a cascade serving framework that co-optimizes system deployment and routing strategy for fast and quality-preserving LLM serving.

Sources

SkewRoute: Training-Free LLM Routing for Knowledge Graph Retrieval-Augmented Generation via Score Skewness of Retrieved Context

SkyLB: A Locality-Aware Cross-Region Load Balancer for LLM Inference

Optimizing the Interface Between Knowledge Graphs and LLMs for Complex Reasoning

A Learned Cost Model-based Cross-engine Optimizer for SQL Workloads

RadialRouter: Structured Representation for Efficient and Robust Large Language Models Routing

GORACS: Group-level Optimal Transport-guided Coreset Selection for LLM-based Recommender Systems

Cascadia: A Cascade Serving System for Large Language Models

Built with on top of