
"Pinterest has published a detailed technical account of how its engineers tracked down intermittent CPU starvation that was crashing machine learning training jobs. By identifying what the team termed "zombies" (leaked memory cgroups left behind by a crashlooping default agent), the engineers restored stability to their distributed computing platform."
"The issue manifested as intermittent network failures and job crashes on PinCompute, the Kubernetes-based platform where Pinterest runs more than half of its offline machine learning workload. Tens of thousands of Ray clusters are provisioned monthly for these tasks, and some use cases saw training job success rates drop by more than 25% due to Elastic Network Adapter (ENA) device resets and dropped packets. Initial investigations were hampered because aggregate CPU utilisation looked healthy, masking the failures underneath."
"Forced off high-level dashboards, the infrastructure team dropped to per-core analysis using mpstat. That investigation revealed individual cores hitting 100% system CPU for seconds at a time. This behaviour was particularly problematic because if a core handling ENA network interrupts became saturated, the driver's NAPI poll thread could be starved of cycles, triggering ENA device resets, a self-healing mechanism that fires when Tx completions stall for more than five seconds, and the connectivity loss that crashed Ray jobs."
"To pinpoint the source of this core saturation, the team utilised rolling two-minute perf captures run over a 12-hour reproduction window. Visualised in Netflix's Flamescope, the captures let the engineers zoom into the exact moments when network resets are fired. They discovered that the kubelet process, which typically consumes less than 1 per cent of CPU, was spiking to approximately 6.5 per cent. Most of this time was spent in the kernel function mem_cgroup_nr_lru_pages."
Pinterest engineers traced intermittent CPU starvation that crashed machine learning training jobs on PinCompute, a Kubernetes-based platform running most offline workloads. Failures appeared as network issues and job crashes, including ENA device resets and dropped packets that reduced training success rates by over 25% for some use cases. Aggregate CPU metrics looked normal, so the team analyzed per-core CPU usage and found brief periods where individual cores hit 100% system CPU. When cores handling ENA interrupts saturated, the ENA driver’s NAPI poll thread was starved, leading to ENA resets after Tx completions stalled. Rolling perf captures over a 12-hour reproduction window, visualized with Flamescope, showed kubelet CPU spikes and kernel time in mem_cgroup_nr_lru_pages, ultimately linked to leaked memory cgroups left behind after crashlooping agents.
Read at InfoQ
Unable to calculate read time
Collection
[
|
...
]