The field of high-performance interconnection networks and AI inference services is moving towards improving reliability, efficiency, and scalability. Researchers are focusing on developing innovative congestion control mechanisms, enhancing reliability in AI inference services, and identifying performance interference in datacenters. Noteworthy papers in this area include those that propose refined congestion detection and injection throttling mechanisms, empirical studies on real production incidents in AI inference services, and noise-resilient antagonist identification frameworks. Notable papers include:
- One paper refines the DCQCN closed-loop mechanism to leverage more accurate congestion detection and signaling.
- Another paper presents an empirical study on real production incidents in AI inference services, achieving high labeling consistency and identifying dominant failure modes.
- A third paper introduces PANDA, a noise-resilient antagonist identification framework for production-scale datacenters, which improves average suspicion percentile and achieves consistent antagonist identification under multi-victim scenarios.