Cloudflare broke the internet with a bad DB query
Briefly

Cloudflare broke the internet with a bad DB query
"Prince has penned a late Tuesday post that explains the incident was "triggered by a change to one of our database systems' permissions which caused the database to output multiple entries into a 'feature file' used by our Bot Management system." The file describes malicious bot activity and Cloudflare distributes it so the software that runs its routing infrastructure is aware of emerging threats."
"And then it recovered - for a while - because when the incident started Cloudflare was updating permissions management on a ClickHouse database cluster it uses to generate a new version of the feature file. The permission change aimed to give users access to underlying data and metadata, but Cloudflare made mistakes in the query it used to retrieve data, so it returned extra info"
"'Bad data was only generated if the query ran on a part of the cluster which had been updated. As a result, every five minutes there was a chance of either a good or a bad set of configuration files being generated and rapidly propagated across the network,' Prince wrote. For a couple of hours starting at around 11:20 UTC on Tuesday, Cloudflare's services therefore experienced intermittent outages."
A change to database permissions caused ClickHouse queries to return extra information that doubled the size of a Bot Management 'feature file'. The file, which describes malicious bot activity, is distributed to routing software and is generated every five minutes. The oversized feature file exceeded software-imposed limits and triggered failures in the routing code. Because only updated parts of the cluster produced the bad data, every five-minute generation could be good or bad, causing rapid propagation of alternating configuration files. Services experienced intermittent outages for a couple of hours, and the symptoms were initially mistaken for a hyper-scale DDoS attack.
Read at Theregister
Unable to calculate read time
[
|
]