Airbnb Executes Istio Upgrades at Massive Scale
Briefly

Airbnb Executes Istio Upgrades at Massive Scale
"Airbnb engineering has published a detailed account of how it maintains high availability during Istio upgrades across tens of thousands of pods and thousands of VMs, all without downtime. The company's service mesh infrastructure supports workloads in both Kubernetes and VM environments, handling tens of millions of queries per second at peak. Despite the complexity, Airbnb has completed Istio upgrades 14 times to date."
"The key challenge lies in coordinating upgrades across diverse workloads owned by different teams. To address this, Airbnb designed an upgrade pipeline that "guarantees" zero downtime, enables gradual rollouts, supports failback, and ensures all workloads are updated within a fixed timeframe. Technically, the process relies on a canary-style dual-version deployment of Istio control planes, each distinguished by a revision label (e.g., 1-24-5, 1-25-2). Workloads are pinned to specific revisions via the mutating webhook, which injects the appropriate istio-proxy sidecar."
Airbnb maintains high availability during Istio upgrades across tens of thousands of pods and thousands of VMs without downtime. The service mesh supports Kubernetes and VM workloads and handles tens of millions of queries per second at peak. The main challenge is coordinating upgrades across diverse, team-owned workloads. The upgrade pipeline guarantees zero downtime, enables gradual rollouts, supports failback, and enforces a fixed completion timeframe. The process uses canary-style dual control planes with revision labels and a mutating webhook to pin workloads. Krispr automates label injection and admission-time migration, while mxagent and mxrc coordinate safe VM upgrades respecting health checks and safety thresholds.
Read at InfoQ
Unable to calculate read time
[
|
]