AWS Disruption Exposes Fragility in Critical Cloud Infrastructure

"On October 20, 2025, Amazon Web Services (AWS) experienced a major outage that disrupted global internet services, affecting millions of users and thousands of companies across more than 60 countries. The incident originated in the US-EAST-1 region and was traced to a DNS resolution failure affecting the DynamoDB endpoint, which cascaded into outages across multiple dependent services. According to AWS's official incident report, the fault began when a DNS subsystem failed to update domain resolution records within the affected region correctly."

"The incident stemmed from a latent race condition in DynamoDB's automated DNS-management system. AWS uses two subnet components to manage its DNS records: a DNS Planner, which tracks load-balancer health and proposes changes, and a DNS Enactor, which applies those changes via Route 53. When one Enactor lagged, a cleanup job mistakenly deleted active DNS records, leaving the dynamodb.us-east-1.amazonaws.com endpoint pointing to no IP addresses."

"Because the broken DNS record went uncorrected, clients, whether AWS services or customer applications, couldn't resolve the DynamoDB endpoint. Even though DynamoDB itself remained internally healthy, the loss of DNS reachability made it effectively unreachable. The outage didn't stop there. Internal AWS subsystems that relied on DynamoDB, including the control planes for EC2 and Lambda, began to malfunction. As customer SDKs retried failed requests, they created a retry storm that further overwhelmed AWS's internal resolver infrastructure."

On October 20, 2025, Amazon Web Services suffered a major outage originating in US-EAST-1 that disrupted internet services across more than 60 countries. A DNS resolution failure left the dynamodb.us-east-1.amazonaws.com endpoint without IP addresses after a latent race condition in DynamoDB's automated DNS-management system caused an Enactor lag and a cleanup job to delete active DNS records. Clients and dependent AWS services could not resolve the endpoint, rendering DynamoDB effectively unreachable despite internal health. Dependent control planes for EC2 and Lambda malfunctioned, customer SDK retries created a retry storm, and internal resolver infrastructure and NLB health checks were overwhelmed, generating over 17 million user outage reports.

#aws-outage #dns-resolution-failure #dynamodb #service-cascade

Read at InfoQ

Unable to calculate read time

Collection

[

...

]

AWS Disruption Exposes Fragility in Critical Cloud InfrastructureAWS Disruption Exposes Fragility in Critical Cloud Infrastructure Briefly

AWS Disruption Exposes Fragility in Critical Cloud Infrastructure
AWS Disruption Exposes Fragility in Critical Cloud Infrastructure
Briefly