Slack Migrates to Cell-Based Architecture on AWS to Mitigate Gray Failures
Briefly

The move was triggered by the impact of networking outages affecting a single availability zone, causing user-impacting service degradation.
As it turns out, detecting failure in distributed systems is a hard problem. The team at Slack decided to adopt a cell-based approach where each AZ contains a completely siloed backend deployment with components constrained to a single AZ.
Read at InfoQ
[
add
]
[
|
|
]