Fix SLO Breaches Before They Repeat: An SRE AI Agent for Application Workloads
Briefly

Fix SLO Breaches Before They Repeat: An SRE AI Agent for Application Workloads
"My name is Bruno Borges. I work at Microsoft. Today we're going to talk about SLO breaches. Who here deals with that? Who doesn't want to deal with that? We're going to talk about defining performance, setting SLOs, setting objectives. What is performance diagnostics? How do we do it? Then we're going to talk about SRE agents. If it can be automated, it should be automated. That's our mantra. This talk is about troubleshooting performance needs and SLO breaches. It's not about performance tuning."
"Advanced performance tuning, I think that really depends on the type of workload that you have, the type of stack that you have, the type of language runtime that you're using. Who here deals with JVMs? Who here deals with .NET CLR? Go? Node? I'm not going to give you a lecture about resilient architecture. There are many more people and many books published over decades on that."
Performance means meeting user expectations for responsiveness, reliability, duration, and cost-efficiency. Establish measurable SLIs and SLOs to quantify acceptable behavior and detect breaches. Differentiate troubleshooting of SLO breaches from advanced performance tuning, which depends on workload, stack, and runtime. Adopt automation for diagnostics and remediation and employ SRE agents where possible to reduce manual effort. Emphasize defining objectives, observability, and repeatable diagnostic workflows. Complement automated diagnostics with learning in resilient architecture and SRE practices to handle load, scaling, and platform-specific tuning. Measure cost and failure modes, and prioritize fixes based on user-facing impact and SLO violations.
Read at InfoQ
Unable to calculate read time
[
|
]