Microsoft Azure OpenAI Service goes down in Sweden

"Microsoft first acknowledged the issues at 0900 UTC (although the status page for the service stated it spotted the problem at 0922 UTC). At the time, Microsoft blamed the Azure OpenAI Service's availability issues on "an unhealthy backend dependent service, which led to cascading failures." The Windows behemoth noted problems when using modes such as GPT-5.2, GPT-5 Mini, GPT-4.1, and related APIs."

"The team took mitigating steps, Microsoft said. In other words, that old IT standby was deployed, and the offending IRM service was turned off and turned back on again at 1236 UTC. The problem didn't go away. At 1246 UTC, Microsoft said pods were crashing with out-of-memory errors in the Sweden cluster. It began scaling out nodes in the cluster "to improve request handling and resilience" and at 1530 UTC began increasing the memory available to the pods, which completed at 1553 UTC."

"Finally, at 1612 UTC, when many Swedes were shutting down for the day, Microsoft confirmed the problem had been resolved. While Microsoft's transparency in acknowledging the problem is to be applauded, the length of time it took to deal with what appears to be a software issue is not. One wag commented on social media: "EU resilience is getting another live exercise," while others treated it as a learning experience: "Used this as a forcing function: deployed to multiple regions with automatic failover.""

Azure OpenAI Service in the Sweden Central region experienced availability issues beginning around 0900 UTC, with the status page detecting the problem at 0922 UTC. Microsoft attributed the outage to "an unhealthy backend dependent service, which led to cascading failures," affecting modes such as GPT-5.2, GPT-5 Mini, GPT-4.1, and related APIs. Teams restarted the IRM service at 1236 UTC, but pods later crashed with out-of-memory errors at 1246 UTC. Microsoft scaled out cluster nodes and increased pod memory between 1530 and 1553 UTC. The incident resolved at 1612 UTC. Users noted the outage prompted resilience measures like multi-region deployments and failover.

#azure-openai #outage #incident-response #resilience

Read at Theregister

Unable to calculate read time

Collection

[

...

]

Microsoft Azure OpenAI Service goes down in SwedenMicrosoft Azure OpenAI Service goes down in Sweden Briefly

Microsoft Azure OpenAI Service goes down in Sweden
Microsoft Azure OpenAI Service goes down in Sweden
Briefly