AWS Outage And Why O11y is Non Negotiable
Briefly

AWS Outage And Why O11y is Non Negotiable
"Many internal AWS services depend on DynamoDB to store critical data, so the initial DNS failure triggered a cascade of secondary disruptions: EC2 Launch Issues: Although the DNS issue was resolved around 2:24 AM PDT on October 20, a new problem arose in EC2's internal subsystem responsible for launching instances. This system's reliance on DynamoDB caused errors when attempting to launch new instances, often resulting in "Insufficient Capacity" errors."
"Network Connectivity Problems: While working on the EC2 issue, AWS discovered that health checks for Network Load Balancers were failing. This led to widespread network connectivity issues across multiple services, including DynamoDB, SQS, and Amazon Connect. Mitigation Efforts and Backlogs: To contain the cascading failures, AWS temporarily throttled certain operations, such as new EC2 instance launches, SQS polling via Lambda Event Source Mappings, and asynchronous Lambda invocations."
An initial DNS failure triggered cascading disruptions because many internal AWS services depend on DynamoDB for critical data storage. EC2's internal launch subsystem encountered errors after DNS recovery, producing "Insufficient Capacity" messages when launching instances. Health checks for Network Load Balancers failed, causing network connectivity issues across DynamoDB, SQS, and Amazon Connect. AWS throttled operations like new EC2 launches, SQS polling via Lambda Event Source Mappings, and asynchronous Lambda invocations to stabilize systems, which created processing backlogs in AWS Config, Redshift, and Amazon Connect that took hours to clear. Multi-hour, multi-service outages can incur median losses around $2.2 million per hour; full-stack observability increases resilience.
Read at New Relic
Unable to calculate read time
[
|
]