Race Condition in DynamoDB DNS System: Analyzing the AWS US-EAST-1 Outage
Briefly

Race Condition in DynamoDB DNS System: Analyzing the AWS US-EAST-1 Outage
"According to the post-mortem, which provides details on the DynamoDB DNS management architecture, the incident was triggered by a latent defect in the service's automated DNS management system, leading to endpoint resolution failures for DynamoDB. Other popular services that rely on DynamoDB, including new EC2 instance launches, Lambda invocations, and Fargate task launches, were also impacted during the outage. In the summary of the Amazon DynamoDB service disruption in the US-EAST-1 region, the team acknowledges that the reliability issue significantly affected many customers."
"While in many online threads developers joked about "it's always DNS", Yan Cui, AWS Hero and serverless expert, highlights in his newsletter: "The DNS failure was the first symptom, not the root cause of the recent AWS outage. The root cause was a race condition in an internal DynamoDB microservice that automates DNS record management for the regional cells of DynamoDB.""
An automated DNS management defect in Amazon DynamoDB produced an incorrect empty DNS record for the regional endpoint dynamodb.us-east-1.amazonaws.com, which automation failed to repair. Endpoint resolution failures blocked DynamoDB access and disrupted services that depend on it, including new EC2 instance launches, Lambda invocations, and Fargate task launches. Many customers experienced prolonged effects, with some reporting issues up to 15 hours, despite official incident windows of a few hours. The outage prevented completion of networking configuration for newly created EC2 instances and introduced delays in network state propagation. The incident prompted discussions about cloud redundancy and multi-region strategies.
Read at InfoQ
Unable to calculate read time
[
|
]