
"When your cabbie asks you what you do for a living, and you answer "tech journalist," you never get asked about cloud infrastructure in return. Bitcoin, mobile phones, AI, yes. Until last week: "What's this AWS thing, then?" You already knew a lot of people were having a very bad day in Bezosville, but if the news had reached an Edinburgh black cab driver, new adjectives were needed."
"As the world reluctantly touched grass, the AWS outage of October 20 made the top of the mainstream news. It beautifully illustrated the success of the cloud concept as it took out banking services, gaming platforms, messaging apps, and cat litter trays. Things got better after a few hours, and the nature of the collapse gradually revealed itself. A DNS failure led to a core database dropping off, leading to a control plane malfunction that broke load balancing."
"Why this cascade was both possible and unexpected, and why it took so long to find and fix, is even more interesting. Here's a clue: this kind of event had been predicted by an ex-Amazonian based on their perception that key engineering talent had been fleeing the company for years, removing irreplaceable wisdom built from knowledge and experience. Such a prediction, backed by the observation that AWS techs had to grope their way to the big picture, is compelling."
The October 20 AWS outage disrupted banking services, gaming platforms, messaging apps and smart devices, showing the cloud's systemic reach. A DNS failure caused a core database to drop offline, which precipitated a control-plane malfunction and broke load balancing, producing cascading failures across dependent services. The cascade was possible partly because critical engineering expertise had departed, reducing institutional knowledge necessary to understand and remediate complex failures. Remediation took hours as teams struggled to see the big picture and trace dependency contagion. Addressing recurrence requires redesigning dependency chains, building safeguards for similar failure classes, and improving visibility into complex cloud systems.
 Read at Theregister
Unable to calculate read time
 Collection 
[
|
 ... 
]