The Long Tail of the AWS Outage
Briefly

The Long Tail of the AWS Outage
"A sprawling Amazon Web Services cloud outage that began early Monday morning illustrated the fragile interdependencies of the internet as major communication, financial, health care, education, and government platforms around the world suffered disruptions. As the day wore on, AWS diagnosed and began working to correct the issue, which stemmed from the company's critical US-EAST-1 region based in northern Virginia. But the cascade of impacts took time to fully resolve."
"Researchers reflecting on the incident particularly highlighted the length of Monday's outage, which started around 3 am ET on Monday, October 20. AWS said in status updates that by 6:01 pm ET on Monday "all AWS services returned to normal operations." The outage directly stemmed from Amazon's DynamoDB database application programming interfaces and, according to the company, "impacted" 141 other AWS services."
""The word 'hindsight' is key. It's easy to find out what went wrong after the fact, but the overall reliability of AWS shows how difficult it is to prevent every failure," says Ira Winkler, chief information security officer of the reliability and cybersecurity firm CYE. "Ideally, this will be a lesson learned, and Amazon will implement more redundancies that would prevent a disaster like this from happening in the future-or at least prevent them staying down as long as they did.""
A major Amazon Web Services outage began in the US-EAST-1 region in northern Virginia and disrupted major communication, financial, health care, education, and government platforms worldwide. The outage started around 3 am ET on October 20 and AWS reported that all services returned to normal by 6:01 pm ET, while the root cause involved DynamoDB API failures that affected 141 other services. The cascading impacts produced a prolonged recovery period for many customers. Network engineers and infrastructure specialists noted that failures are understandable for hyperscalers due to complexity and scale, but prolonged downtime underscores the need for greater redundancies. AWS plans to publish a post-event summary.
Read at WIRED
Unable to calculate read time
[
|
]