
"Welcome to the Architects Podcast where we discuss what it means to be an architect and how architects actually do their job. Today we are going to talk about something that is very important to architects but is not often explicitly discussed. We have spoken quite a bit on this podcast about reliability and designing for failure, but we have not spoken about what we do to make our system design more robust, not just fixed after it has a failure."
"He is the program lead for Microsoft's SRE Academy, the program for onboarding and training of Azure SREs and others who strive to improve reliability and quality. He has roughly 40 years of experience in the operations space and he's the co-founder of the SREcon Conference, and the curator of Seeking SRE by O'Reilly and is the author of Becoming SRE, also by O'Reilly."
Site reliability engineering approaches reliability from operational and service perspectives, emphasizing proactive robustness rather than only fixing systems after failures. SRE practices focus on serving users and enabling others to get the best from technology. Effective SRE work combines onboarding and training programs for SREs, extensive operations experience, event and conference organization, resource curation, and published guidance. Reliability engineering includes designing for failure, improving system robustness, and applying operational lessons to cloud services. Training and institutional programs formalize knowledge transfer and ensure consistent approaches to reliability and quality across teams.
Read at InfoQ
Unable to calculate read time
Collection
[
|
...
]