
"Cloudflare's proxy service has limits to prevent excessive memory consumption, with the bot management system having "a limit on the number of machine learning features that can be used at runtime." This limit is 200, well above the actual number of features used. "When the bad file with more than 200 features was propagated to our servers, this limit was hit-resulting in the system panicking" and outputting errors, Prince wrote."
"The file was being generated every five minutes by a query running on a ClickHouse database cluster, which was being gradually updated to improve permissions management," Prince wrote. "Bad data was only generated if the query ran on a part of the cluster which had been updated. As a result, every five minutes there was a chance of either a good or a bad set of configuration files being generated and rapidly propagated across the network."
Cloudflare's proxy enforces a 200-feature runtime limit in its bot management system to prevent excessive memory use. A malformed feature file contained over 200 features, tripping the limit and causing the system to panic and emit many 5xx errors. The bad file was produced intermittently because a ClickHouse query regenerated the feature file every five minutes while parts of the cluster were being updated, causing alternating good and bad configurations to propagate. The outage stabilized when every node produced the bad file. Recovery steps included stopping generation, inserting a known-good file, forcing a core proxy restart, and restarting remaining services.
Read at Ars Technica
Unable to calculate read time
Collection
[
|
...
]