The field of Large Language Models (LLMs) is moving towards more efficient and reliable inference serving systems. Recent developments focus on optimizing throughput, latency, and resource utilization, as well as addressing issues related to determinism and verification. Deterministic inference is becoming increasingly critical for LLM applications, and researchers are proposing novel solutions to eliminate training-inference mismatches. Additionally, there is a growing need for inference verification methods that can detect errors or tampering without incurring significant additional costs. Noteworthy papers in this area include:
- A performance study comparing two prominent LLM serving frameworks, highlighting the importance of choosing the right framework based on specific use-case requirements.
- A proposal for Tree-Based Invariant Kernels, which guarantee bit-wise identical results regardless of tensor parallel size, addressing the problem of determinism across different TP sizes.
- The introduction of Token-DiFR and Activation-DiFR, methods for verifying inference outputs by comparing generated tokens against predictions made by a trusted reference implementation.
- A dynamic architecture for maximizing goodput in LLM inference serving, which adjusts instance allocations to achieve an optimal prefill-to-decoding ratio based on real-time load monitoring.