Title of the Talk
High-Precision Load Balancing: Reducing Tail Latency in Distributed Systems.
Abstract
As distributed systems scale toward clusters composed of thousands of microservices, load balancing has emerged as a dominant factor in determining tail latency and overall system stability. Traditional strategies such as weighted round-robin and least-connections rely on delayed or aggregated load signals, which become increasingly inaccurate in high-QPS environments. This stale metadata effect leads to server hotspots, uneven request distribution, and unpredictable spikes at the p99 and p99.9 latency percentiles, particularly in latency-sensitive workloads such as RPC-based services. This talk introduces a state-aware load-balancing architecture designed to address both routing accuracy and measurement fidelity. At its core is a high-precision experimental methodology that enables the detection of millisecond-level changes in tail latency that are typically invisible to standard A/B testing. By recursively splitting client and server tasks into mirrored replicas, the system enables side-by-side evaluation of competing routing policies under identical traffic conditions. This split-task methodology allows for precise attribution of latency behavior at the extreme tail, even when improvements are subtle and cluster-wide noise is significant. Building on this foundation, the system incorporates real-time state-aware probing to guide routing decisions. Using a combination of piggybacked telemetry and asynchronous probes, the load balancer continuously evaluates server queue depth, CPU availability, and memory pressure. Requests are routed based on instantaneous capacity rather than historical averages, enabling proactive avoidance of saturated tasks and reducing long-tail amplification. The integration of high-resolution experimentation with real-time capacity mapping demonstrates a practical path toward predictable tail latency control. This approach is especially impactful for large-scale inference, search, and other high-stakes distributed workloads, and represents a shift toward data-driven, debuggable orchestration embedded directly within the RPC stack.
Brief Profile
Varun Raj is a software engineer specializing in performance engineering, distributed systems, and large-scale infrastructure. He has built and led foundational systems that improve reliability, scalability, and efficiency across mission-critical platforms at globally recognized technology organizations. His expertise spans intelligent request routing, latency optimization, observability frameworks, and production-grade infrastructure designed to support highly demanding workloads. At Google, Varun has contributed to core infrastructure powering globally scaled services. He has designed adaptive architectures that respond dynamically to real-time system conditions, enabling predictable performance and operational efficiency under heavy load. His work has also strengthened machine learning deployments by creating the performance capacity required for advanced models to operate reliably. In addition, he has improved observability and deployment systems, helping engineering teams detect regressions faster and streamline production readiness. Previously, Varun worked at Oracle on developer infrastructure and large-scale build systems. He also conducted research in theoretical cryptography at the National University of Singapore. He holds a degree in Computer Science and Engineering from IIT Guwahati.
