Uber Moves from Static Limits to Priority-Aware Load Control for Distributed Storage
Briefly

Uber Moves from Static Limits to Priority-Aware Load Control for Distributed Storage
"Uber engineers have described how they evolved their distributed storage platform from static rate limiting to a priority-aware load management system to protect their in-house databases. The change addressed the limitations of QPS-based rate limiting in large, stateful, multi-tenant systems, which did not reflect actual load, handle noisy neighbors, or protect tail latency."
"The design protects Docstore and Schemaless, built on MySQL® and serving traffic through thousands of microservices supporting over 170 million monthly active users, including riders, Uber Eats users, drivers, and couriers. By prioritizing critical traffic and adapting dynamically to system conditions, the system prevents cascading overloads and maintains performance at scale. Uber engineers noted that early quota-based approaches relied on static limits enforced through centralized tracking but proved ineffective."
"To address this, Uber colocated load management with stateful storage nodes, combining Controlled Delay (CoDel) queuing with a per-tenant Scorecard. CoDel adjusted queue behavior based on latency, while Scorecard enforced concurrency limits, and additional regulators monitored I/O, memory, goroutines, and hotspots. CoDel treated all requests equally, dropping both low-priority and user-facing traffic, which increased the on-call load and negatively impacted user experience."
QPS-based static rate limiting failed to reflect real partition-level load, handle noisy neighbors, or protect tail latency in large, stateful, multi-tenant systems. Docstore and Schemaless, backed by MySQL® and serving traffic from thousands of microservices for over 170 million monthly active users, required a dynamic approach. Colocating load management on storage nodes combined Controlled Delay (CoDel) queuing with a per-tenant Scorecard to enforce concurrency limits, and added regulators for I/O, memory, goroutines, and hotspots. Early centralized quota systems and stateless routing lacked timely visibility and led operators to retune limits and sometimes shed healthy traffic while leaving overloaded partitions exposed.
Read at InfoQ
Unable to calculate read time
[
|
]