Google Cloud Demonstrates Massive Kubernetes Scale with 130,000-Node GKE Cluster
Briefly

Google Cloud Demonstrates Massive Kubernetes Scale with 130,000-Node GKE Cluster
"The feat was achieved by re-architecting key components of Kubernetes' control plane and storage backend, replacing the traditional etcd data store with a custom Spanner-based system that can support massive scale, and optimizing cluster APIs and scheduling logic to reduce load from constant node and pod updates. The engineering team also introduced new tooling for automated, parallelized node pool provisioning and faster resizing, helping overcome typical bottlenecks that would hinder responsiveness at such a scale."
"As AI training and inference workloads grow, often requiring hundreds or thousands of GPUs or high-throughput CPU clusters, the ability to run vast, unified Kubernetes clusters becomes a critical enabler. With a 130,000-node cluster, workloads such as large-scale model training, distributed data processing, or global microservice fleets can be managed under a single control plane, simplifying orchestration and resource sharing./p> At the core of the scale breakthrough was Google's replacement of etcd as the primary control-plane datastore with a custom, Spanner-backed storage layer."
"Traditional Kubernetes relies on etcd for strongly consistent state management, but etcd becomes a scaling bottleneck at very high node and pod counts due to write amplification, watch fan-out, and leader election overhead. By offloading cluster state into Spanner, Google gained horizontal scalability, global consistency, and automatic sharding of API objects such as nodes, pods, and resource leases. This dramatically reduced API server pressure and eliminated the consensus bott"
Google's GKE team built and operated a 130,000-node Kubernetes cluster, the largest publicly disclosed. The effort re-architected core control plane components and the storage backend, replacing etcd with a custom Spanner-backed datastore for horizontal scalability and automatic sharding. Cluster APIs and scheduling logic were optimized to reduce load from frequent node and pod updates. New tooling enabled automated, parallelized node pool provisioning and faster resizing to avoid responsiveness bottlenecks. The scale allows unified management of massive AI training and inference workloads, distributed data processing, and global microservice fleets under a single control plane.
Read at InfoQ
Unable to calculate read time
[
|
]