
Short-lived recurring database outages made the user feed unavailable for 10–15 seconds and then recover without useful logs or clear external triggers. Conventional monitoring and metrics did not reveal the root cause, so engineers investigated OS and runtime behavior during the freezes. Incident timing correlated with momentary spikes in memory allocation, followed by stabilization at a higher memory baseline, while CPU throttling, memory fragmentation and compaction, and file I/O were ruled out. Engineers built an automated “trap” that detected a freeze and immediately captured an off-CPU profile using eBPF. A BCC-based script continuously monitored database health and triggered BCC offcputime.py to record kernel stack traces of blocked or sleeping threads for 15 seconds, enabling observation during the live freeze.
"When LinkedIn engineers encountered short-lived, recurring outages where the database powering their user feed became unavailable and then recover without leaving helpful traces, they had to devise a novel approach to uncover the root cause using off-CPU profiling with eBPF."
"Investigating those incidents was especially challenging because they were ephemeral, lasting only 10-15 seconds, and left no useful logs. Additionally, they recurred with no clear pattern and showed no clear external trigger. A first clue emerged by correlating the incidents with the system memory behavior, which showed that each event coincided with a momentary spike in memory allocation, quickly resolved with the system stabilizing at a higher baseline."
"Thus, the analysis based on conventional monitoring and metrics provided no hits at the root cause of the issue, which prompted LinkedIn engineers to dig deeper into the OS and runtime-level behavior during the freezes. Their approach turned to off-CPU profiling to understand what threads were blocked at the time."
"Our solution was to build a trap. We wrote a monitoring script that would automatically capture an off-CPU profile the instant a freeze was detected. The script used an eBPF toolkit, BCC, to continuously monitor database health and immediately trigger the BCC offcputime.py profiler to record kernel stack traces of blocked or sleeping threads during 15 seconds. This allowed LinkedIn engineers to capture an off-CPU profile during a live freeze."
Read at InfoQ
Unable to calculate read time
Collection
[
|
...
]