Salesforce Migrates 1,000+ EKS Clusters to Karpenter to Improve Scaling Speed and Efficiency
Briefly

Salesforce Migrates 1,000+ EKS Clusters to Karpenter to Improve Scaling Speed and Efficiency
"Salesforce has completed a phased migration of more than 1,000 Amazon Elastic Kubernetes Service (EKS) clusters from the Kubernetes Cluster Autoscaler to Karpenter, AWS's open-source node-provisioning and autoscaling solution. The large-scale transition aimed to reduce scaling latency, simplify operations, cut costs, and enable more flexible, self-service infrastructure for internal developers across the company's extensive Kubernetes fleet. Facing limitations with Auto Scaling group-based autoscaling and the Cluster Autoscaler,"
"The migration journey began in mid-2025 with lower-risk environments and progressed through testing and validation phases before production adoption in early 2026. Salesforce's engineers developed an in-house Karpenter transition tool and patching checks that handled node rotation, Amazon Machine Image (AMI) validation, and graceful pod eviction, enabling repeatable and consistent conversion across diverse node pool configurations. Through this transition, the team resolved operational challenges such as misconfigured PDBs that blocked node replacements,"
"Facing limitations with Auto Scaling group-based autoscaling and the Cluster Autoscaler, including slow scale-up times, poor utilization across availability zones, and a proliferation of thousands of node groups, Salesforce's platform team built custom tooling to automate and manage the migration safely and reliably. This approach combined carefully orchestrated node transitions with automation that respected Pod Disruption Budgets (PDBs), supported rollback paths, and integrated with the company's CI/CD provisioning pipelines."
Salesforce migrated over 1,000 EKS clusters from the Kubernetes Cluster Autoscaler to Karpenter to reduce scaling latency, simplify operations, lower costs, and provide self-service infrastructure to developers. The platform team identified Auto Scaling group-based limitations such as slow scale-up, poor cross-availability-zone utilization, and proliferation of node groups. Engineers built custom automation that honored Pod Disruption Budgets, provided rollback paths, and integrated with CI/CD provisioning. The migration began in mid-2025, moved through testing and validation, and reached production in early 2026. An in-house transition tool handled node rotation, AMI validation, and graceful pod eviction, and operational issues were resolved with refined practices.
Read at InfoQ
Unable to calculate read time
[
|
]