
"On-call duty is one of the most important - and most mismanaged responsibilities in engineering. If done right, it protects your systems and distributes the load fairly. If done wrong, it destroys team morale and drives your best engineers to the door."
"According to the 2024 State of Engineering Management Report, 65% of engineers reported experiencing burnout in the past year. On-call stress is a major contributing factor, and it compounds quickly when rotations are poorly designed, alert noise is high and there's no automation to catch the easy stuff."
"The core problem is rarely the of being on-call - it's the accumulation of bad patterns that make it unbearable. On-call engineers typically allocate 30 - 40% of their bandwidth during an on-call period to incident responsibilities. When that load spikes beyond sustainable thresholds, or when rotations are unfair, the effects cascade fast."
On-call responsibilities are essential for maintaining system reliability but frequently become unsustainable due to poor rotation design, high alert volume, and insufficient automation. Engineers typically allocate 30-40% of their bandwidth during on-call periods to incident responsibilities, with burnout rates reaching 65% according to recent reports. The core issue stems not from on-call itself but from accumulated bad patterns including unfair rotations, excessive alert noise, and lack of automation for routine issues. High-performing SRE and platform engineering teams address this through proper rotation models, fair compensation approaches, improved alert hygiene, appropriate tooling selection, and strategic automation implementation. Success requires understanding team size, geographic distribution, and service criticality when designing sustainable on-call programs.
Read at DevOps.com
Unable to calculate read time
Collection
[
|
...
]