As previously seen in TechBeacon
Site reliability engineering (SRE), the practice of taking software engineering concepts and applying them to IT operations, originated within Google. Thanks to the company's openness around the discipline—including publishing two books about it—SRE has gained in popularity across a wide variety of enterprises and industries.
But is it right for you? What's it like getting started with SRE, and is it a worthwhile exercise, or just a misguided venture in superstitions such as cargo cults?
The reality is the approaches recommended by SRE can be very effective beyond Google, in cloud and on-premises, and with both old and new IT systems. Here's an example of how our team adopted SRE within Accenture, and what you can learn from our experiences.
In Google, the process of adopting SRE means transitioning support responsibility out of development and into IT Ops, making it easier to adopt for a traditional IT organization. For this reason, my team uses the term "retrofitting SRE" to describe our maneuver of adopting SRE techniques within an IT operations team that's already running production software.
The two SRE aspects that were most successful for our team were service-level objectives (SLOs) with consequences, and "toil" work measurement. (I'll define these terms in more detail below). Here's a run-down.
Our team decided to start our SRE experiment with a large internal DevOps system known as the Accenture DevOps Platform (ADOP). Initially, there were around 30 people on the team. We started here because it was fully under our control. The platform currently supports over 1,000 developers working on more than 100 client projects.
The irony of running a DevOps platform with effectively separate IT dev and ops teams was not lost on us. While we'd previously experimented with various things such as people rotation, we had never found a solution we were entirely comfortable with.
Our hypothesis was that by taking instructions from the SRE books and some direct advice from some friendly Google customer reliability engineers (CREs), we might be able to improve customer experience, security, and the lives of team members.
My initial approach was to talk to the team members about what SRE is in theory, and then try to sell them on the idea that they might try some new ways of working that were self-led rather than driven by others outside the team.
The reaction was surprisingly positive. The stated intentions of SRE resonated well with the team members, and they seemed energized at the thought of trying out some new ways of working.
Our team decided to consider two elements of SRE:
Service-level objectives (SLOs) with consequences
Measuring with a view to eliminating toil
SLOs with consequences, and SLIs
SLIs (service-level indicators) and SLOs (service-level objectives) measure reliability in ways that aspire to be universally recognized across all stakeholders and possible silos in an organization.
The advice "if you measure it, you can affect it" then comes into play—and that is where the consequences come in.
With a consensus about what defines reliability, what level you want to achieve, and what you expect to do if reliability falls short of aspirations, you in theory have a self-balancing system that can invest enough in resilience.
Our first task was some fairly deep self-reflection about which qualities of our service define a good experience in the eyes of our customers. This empathetic exercise was quite revealing and provoked a lot of debate.
Despite at times feeling somewhat overwhelmed, we persisted until we had defined a collection of SLIs. These are a measurable aspect of system performance, an agreed-upon threshold that determines what we consider acceptable versus unacceptable, and a planned approach for how to measure it.
Measuring login time
The first SLI we decided to focus on was login time. While we were keen to measure what matters rather than just something that is easy to measure, we also wanted to take an SLI end to end as quickly as possible.
We decided that, to keep users happy, logins should take less than four seconds. Since SLIs are best expressed as percentages, we would measure logins taking less than four seconds divided by the number of total logins.
We also decided that we would measure this over a 28-day window.
The next step was to determine our target SLI value, i.e., what percentage of logins should take less than four seconds. This is called the SLO. We decided to aim for 99%, meaning if we have 300,000 logins per month, we would be prepared for 3,000 to take more than four seconds. This is our error budget.
Having set an SLI and an SLO, the final exercise was to agree to a specific consequence about what to do in the event that all of the error budget is used up and we have failed to meet the SLO.
With our SLI and SLO process proven end to end, we started to repeat it for progressively more areas of reliability. One example was wait time when requesting a new Jenkins Master.
Creating meaningful measurements has proved to be an interesting challenge. But the team has a much greater sense of clarity around reliability and purpose now, and we haven't looked back.
Reducing toil and controlling SRE team workloads
So we were using SLOs with consequences to more successfully elevate the importance of reliability (as a feature worthy of prioritization against functional features), and this felt like a great step forward.
But if SLOs are fundamentally about protecting live services, the second part of our adoption has been about protecting the newly re-monikered SRE team.
Toil, according to Vivek Rau in the book Site Reliability Engineering, is "the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.
Google wants its SREs to spend less than 50 percent of their time doing toil. To achieve that, toil—which can be thought of as the result of operational technical debt—needs to be identified and paid back through building more automation.
So our initial exercise was to come up with a view about what constitutes toil for the ADOP operations team. Perhaps unsurprisingly, it wasn't easy to track and visualize what work was having the biggest impact on the team's workload. So we had to start by making toil more visible, and here's what we did:
We implemented a 100 percent use of Jira tickets for tracking work. No gaps. Of course, to do this we needed a very lightweight ticket to avoid creating high administrative overhead.
We changed our Jira workflow to pop up the worklog window on every state transition. This allowed us to capture the amount of time spent on each ticket.
Toil vs. no toil
This proved quite successful at generating raw data. With this we were able to either map common tasks to the Google categories for toil or, alternatively, define tasks as not toil.
This categorization enabled us to start labeling our tickets, and finally we had elevated visibility of our toil.
What did we do with this information? We built automation to remove the need for toil-related work. But, crucially, because we were able to demonstrate just how much above 50 percent of our time was spent on toil, we had a clear mandate to prioritize toil payback stories and to actually work on them.
Other teams had to respect the importance of us doing this. And just as SLIs and SLOs tempered the effects on stability that continuously releasing code can have, measuring toil tempered the fact that constantly supporting new systems can lead manual work to spiral out of control.
We still have further to go to reduce toil and to improve our use of SLIs and SLOs. But overall, retrofitting SRE dramatically exceeded our expectations and was a hugely positive experience for both the operations team and the consumers of the DevOps platform.
We found the terminology and techniques helpful in increasing the focus on reliability across silos. Most of all, it provided vital feedback loops that helped successfully prioritize the right level of focus on performance, resilience, and reliability engineering.