Slack Enhances Chef Infrastructure to Improve Safety and Reduce Blast Radius in Deployments
Briefly

Slack Enhances Chef Infrastructure to Improve Safety and Reduce Blast Radius in Deployments
"Slack's engineering team has published an in-depth look at recent improvements to its Chef-based configuration management system, aimed at making deployments safer and more resilient without disrupting existing workflows. The updated infrastructure reduces the risk of widespread failures during provisioning and configuration changes by eliminating single points of failure and introducing staggered, environment‑aware rollout processes across availability zones. Previously, Slack's EC2 provisioning relied on a single shared Chef production environment."
"With staggered environments no longer compatible with fixed cron schedules, engineers built a service called Chef Summoner. Chef Summoner runs on every node, listens for signals (via S3 events populated by an enhanced version of the existing Chef Librarian service), and schedules Chef runs only when new artifacts are available. To avoid load spikes and contention, the service uses a splay value to stagger execution across nodes in isolation."
Slack reduced the risk of widespread failures during provisioning and configuration changes by eliminating single points of failure and introducing staggered, environment‑aware rollout processes across availability zones. Previously, EC2 provisioning relied on a single shared Chef production environment where scheduled cron jobs only partially staggered runs, and bad changes could propagate immediately to newly provisioned nodes during rapid scale‑outs. The monolithic Chef environment was split into multiple buckets tied to specific availability zones to limit blast radius. Engineers implemented Chef Summoner, a per‑node service that listens for S3‑based artifact signals, schedules runs when artifacts arrive, and uses a splay to avoid load spikes. Chef Summoner also enforces at‑least‑once every 12‑hour runs for compliance. The rollout follows a release‑train pattern that promotes cookbook changes first to sandbox and develop before advancing to staggered production buckets.
Read at InfoQ
Unable to calculate read time
[
|
]