How Netflix Ensures Highly-Reliable Online Stateful Systems
Briefly

A reliable stateful system has these properties. Instead of asking, how many nines do I have, instead, ask, how often do my systems fail? When they fail, how large is the blast radius? Then, finally, how long does it take us to recover from an outage. Then you're going to spend money to make these go to zero. Either you're going to overprovision computers, you're going to pay a cloud provider to solve these problems for you, or you're going to employ engineers to build cool resiliency techniques. Either way, this is the way that you make stateful systems reliable.
To drive this point home of why nines are insufficient, I want to consider three hypothetical stateful services. They all have equal number of nines. The first service fails just a little bit all the time, it never recovers. Maybe we need to invest in request hedging, or improving our tests to cover some bug. Service B occasionally fails cataclysmically.
Read at InfoQ
[
add
]
[
|
|
]