When Kafka Lag Lies: A Production Debugging Story
Briefly

When Kafka Lag Lies: A Production Debugging Story
"Kafka lag is usually boring.Producer faster than consumer? Lag grows.Consumer slower? Lag grows. This story wasn't that. This was a production system that had been running for years, suddenly showing persistent lag, even though: data ingestion was way lower than before databases were idle no errors were logged commits appeared to be configured correctly Here's how we debugged it - and the subtle lesson hiding behind a single collect."
"The Key Insight Kafka lag is not "how much work remains". Kafka lag is: how many offsets have not been committed That sounds obvious - until you realize: Offsets are only committed if something reaches the sink. The Innocent-Looking Code Here's the original filtering logic (simplified): Flow[KafkaMessage].mapAsync(1) { message => regionOfInterestFuture.map { roi => Some(message).filter(isOfInterest(regionOfInterest)) } } .collect { case Some(message) => message } At first glance, this looks perfectly fine."
An Akka Streams + Kafka production system with 24 partitions exhibited persistent consumer-group lag of 10k–60k despite much lower ingestion, idle databases, no logged errors, and commits appearing correctly configured. Standard checks ruled out unexpected producers and database slowdowns. The essential insight is that Kafka lag quantifies uncommitted offsets rather than remaining work, and offsets only commit when a record reaches the downstream sink. A filtering pattern using mapAsync and collect filtered out uninteresting messages before they reached the sink, preventing offsets from being committed. A change in incoming data distribution exposed this latent behavior after years of correct operation.
Read at Medium
Unable to calculate read time
[
|
]