PagerDuty's Kafka Outage Silences Alerts for Thousands of Companies
Briefly

PagerDuty's Kafka Outage Silences Alerts for Thousands of Companies
"PagerDuty, the incident management platform used by thousands of organisations to alert them to problems on their systems, suffered a major outage itself on 28th August 2025. The incident disrupted or delayed the processing of incoming events to customers in PagerDuty's US service region. Significant service degradations affected PagerDuty for more than nine hours. At its peak, approximately 95% of events were rejected over a 38-minute period, and 18% of create requests generated errors for 130 minutes."
"According to the outage report, the cause was a bug in a new feature being rolled out to improve the auditing and logging of API and key usage. As the incremental rollout progressed, usage on PagerDuty's Kafka clusters erroneously grew past the system's capacity. Due to a logical error in the aforementioned feature, a new Kafka producer was instantiated for every API request, rather than using a single Kafka producer to produce messages."
"The report explains that PagerDuty's interpretation of how to use the pekko-connectors-kafka Scala library caused this coding error. The report details the scope of the extra load: "Kafka ended up tracking nearly 4.2 million extra producers per hour at peak. This is 84 times higher than our typical number of new producers." It goes on to explain how Kafka started thrashing and then exhausting the JVM heap available to it, causing a cascading failure of the cluster."
PagerDuty suffered a major outage on 28 August 2025 that disrupted processing of incoming events in its US service region for over nine hours. At peak, roughly 95% of events were rejected for 38 minutes and 18% of create requests returned errors for 130 minutes. A bug in a new auditing and API key-usage logging feature caused a logical error that instantiated a new Kafka producer for every API request instead of reusing a single producer. Kafka tracked about 4.2 million extra producers per hour at peak, thrashing and exhausting JVM heap memory. The resulting Kafka failures cascaded across dependent services, amplifying impact and prolonging recovery.
Read at InfoQ
Unable to calculate read time
[
|
]