Change as Metrics: Measuring System Reliability Through Change Delivery Signals
System changes cause 60-80% of production incidents, making change-related metrics essential first-class reliability signals aligned with DORA framework principles.
QConAI NY 2025 - Designing AI Platforms for Reliability: Tools for Certainty, Agents for Discovery
Reliable agentic AI combines probabilistic model components with deterministic boundaries and integrates models as layers over operational systems rather than replacements.
AWS Debuts "DevOps Agent" to Automate Incident Response and Improve System Reliability
AWS DevOps Agent is an autonomous, always-on on-call engineer that integrates with observability, deployment, and ticketing tools to automate incident response and improve reliability.
From Grassroots to Enterprise: Vanguard's Journey in SRE Transformation
Vanguard built an enterprise SRE program from minimal resources into an organization-wide job family, emphasizing performance, resilience, coaching, and technical solutions.