Site Reliability Engineering
What does it mean?
Site Reliability Engineering (SRE) comes from the same philosophical and technical pedigree as DevOps. DevOps is a business capability that has gained importance as it integrates automated testing, security, and automates deployment within one functional set within the Software Development Lifecycle (SDLC).
SRE delivers performance and resilience, ensuring support is automated, tested, and the same errors do not re-occur once things transition to BAU. At its core, SRE leverages automation and proactive monitoring to solve root cause problems versus treating symptoms, post incident.
Ben Treynor (VP, Google Engineering) states Site Reliability Engineering (SRE) is “what happens when you ask a software engineer to design an operations function” and Google defines SRE as treating an operations function as if it were a software problem.
Some SREs are subject to an “error budget” - this means that if reliability of a system for which they have responsibility falls below a certain level, then they are responsible for making sure errors are engineered out of the system permanently.
Why do we believe it's important?
Often, there is a disconnect between the business and the engineering teams which can create unnecessary friction between the technology function and the rest of the organisation. With DevOps and SRE we see these barriers dissolve, as the business itself becomes closer to the software development process. Whilst “you build it, you run it” is a valid statement for new development, we also need to think about ensuring we automate as much manual “drudge” out of BAU, and help engineers continually improve both the development and support processes as roles and operations continue to evolve. As engineers are exposed to reoccurring support issues they will naturally try to streamline the process and automate as much as possible so they do not have to continue to fix the problem - they “automate out” what they can at the root cause.
How do we put into practice?
A leading FTSE 150 entertainment company was looking to improve their customer facing digital services to gain greater efficiency and consistency over their current processes and improve detection and diagnosis of major incident management (critical issues that could directly impact revenue).
We worked with the client to develop a SRE function that provides proactive monitoring, diagnostics, and automation capabilities that help smooth out downtime whist continuously improving service levels. This resulted in the development of common processes and skill sets that worked across traditional silos, and a more consistent SRE-led Major Incident Management (MIM) process that reduced overall Mean Time To Repair (MTTR).