Google pins weekend outage on "unexercised" feature
Briefly

Google has apologized for a significant Cloud outage impacting multiple clients last week, which was traced back to an inadequately tested new feature in a policy check system. The incident saw widespread outages across services like Spotify and Discord. An internal review identified a lack of proper error handling and the absence of feature flag protection, leading to system crashes. Moving forward, Google has committed to better testing protocols and enhanced communication regarding service disruptions, promising improvements to their operational strategy to prevent future outages.
"This policy data contained unintended blank fields,” Google said. “Service Control, then regionally exercised quota checks on policies in each regional datastore. This pulled in blank fields for this respective policy change and exercised the code path that hit the null pointer causing the binaries to go into a crash loop."
"As a safety precaution, this code change came with a red-button to turn off that particular policy serving path."
"Without the appropriate error handling, the null pointer caused the binary to crash. Feature flags are used to gradually enable the feature region by region per project, starting with internal projects, to enable us to catch issues."
Read at IT Pro
[
|
]