Advancements in Distributed Systems and Networking

The field of distributed systems and networking is witnessing significant advancements, driven by the need for efficient resource allocation, improved performance, and enhanced sustainability. Researchers are exploring innovative approaches to construct accurate service dependency graphs, optimize resource allocation in heterogeneous clusters, and develop adaptive telemetry systems for performance diagnosis. Notably, the integration of machine learning and artificial intelligence is becoming increasingly prevalent, enabling the development of predictive models for auto-scaling, performance prediction, and load balancing. These advancements have the potential to significantly improve the efficiency, scalability, and reliability of distributed systems and networks. Noteworthy papers include: GOGH, which proposes a learning-based architecture for managing machine learning workloads in heterogeneous clusters. Host-Side Telemetry for Performance Diagnosis in Cloud and HPC GPU Infrastructure, which introduces an eBPF-based telemetry system for unified host-side monitoring of GPU workloads. Morpheus, which develops lightweight and accurate RTT predictors for performance-aware load balancing.

Sources

Retrofitting Service Dependency Discovery in Distributed Systems

GOGH: Correlation-Guided Orchestration of GPUs in Heterogeneous Clusters

Host-Side Telemetry for Performance Diagnosis in Cloud and HPC GPU Infrastructure

On the Power Saving in High-Speed Ethernet-based Networks for Supercomputers and Data Centers

Rediscovering Recurring Routing Results

FLAS: a combination of proactive and reactive auto-scaling architecture for distributed services

Accurate Performance Predictors for Edge Computing Applications

Morpheus: Lightweight RTT Prediction for Performance-Aware Load Balancing

Built with on top of