Tech Insights
Keet Sugathadasa
March 23, 2020

What is Site Reliability Engineering (SRE)?

What is Site Reliability Engineering (SRE)?

When it comes to SRE,short for Site Reliability Engineering, resources available online are limitedto the books published by Google themselves. They do share some useful casestudies that will help us understand what SRE is and the concepts given in it,but they do not clearly explain how to build your own SRE team for yourorganisation. The concept of SRE was cooked fresh within the walls of Googleand later released to the general public as a practice for anyone to follow.

In this article I wouldlike to give a brief introduction to SRE and why it is important to anySoftware Engineering organisation. This is based on my experience and learningfrom leading a Site Reliability Engineering team for one of my former clients.

Introduction

In the early stages ofany new software product, the user base is very low and the primary focus is ondelivering features as quickly as you can, to reach a stable market in theindustry. During this period, you might get a few tickets which can be handledby developers themselves and the same could happen to DevOps tasks as well. Butas the System grows, the developers will have to focus primarily onDevelopment, whereas you will have to start hiring Support Engineers andSysAdmins to take care of the operational tasks. But what will happen when itgrows further, when the SysAdmins are no longer able to tackle it bythemselves? You will have to hire more SysAdmins and Support Engineers to takecare of the reliability of the system. As the system grows, the cost foroperational tasks will also grow linearly. Where will it stop?

The day-to-dayresponsibilities of Software Engineers and Operations Engineers are increasingdaily; growing organisations need to seek approaches to keep the system stableas much as possible. You need your Site to be more and more reliable when itgrows, in terms of Scalability, Availability and other aspects. If you fail tomeet the customers' expectations, your product will fail in the industry andwill completely lose its traction. How can we tackle problems like this in thereal-world whilst ensuring the operational costs stay intact with our budgets?The burning question that was asked a long time back was:

It is a universallyacknowledged truth that systems do not run themselves. How, then, should asystem—particularly a complex computing system that operates at a largescale—run?

This is the basic needfor Site Reliability Engineering; where a specific set of engineers build theirown set of practices to ensure that the Site is Reliable at any given point intime. In any growing system, you need a set of engineers who will look for newways to improve the stability of production systems with proper monitoring andautomation-first practices.

What is SiteReliability Engineering?

SRE is a frameworkintroduced by Google in 2003 on how to operate large scale production systemsin a reliable manner. This may sound like an operations function, but it isnot. According to the founder of Google's SRE team, Ben Treynor;

SRE is what happens whenyou ask a Software Engineer to design an operations function for your system.

It's a very versatileapproach which allows you to reliably run mission critical systems, no matterhow big or small the system is.

SRE is a discipline thatincorporates aspects of software engineering and applies them to infrastructureand operations problems. A Site Reliability Engineer will have to spend up to50% of their time, doing operational work like being on-call, manual workloads,documentations etc. Whereas the remaining 50% of the time, an SRE is expectedto do actual development, such as new features, deployments, automation tasksetc. A system managed by SREs is meant to be self-healing and very proactive.SRE owns the entire production environment and has to ensure that the Site isreliable, no matter what gets released to production.

In my opinion, an idealSRE is a software engineer, with a strong background on administering andoperating production systems. From what I see, you can do Site ReliabilityEngineering, without having a Site Reliability Engineer, and you may alreadyhave engineers playing the role of SRE, without even having an SRE Team. Sitereliability engineering is a cross-functional role, assuming responsibilitiestraditionally siloed off to development, operations, and other IT groups. Theywill seek to automate everything that comes in their way to make room foractual engineering work rather than manual labor.

Demand forSite Reliability Engineering

The usual question aboutSRE is whether it is suitable for small organisations. This is highlydebatable, but my belief is that it does. Even if it is a small organisation,there is always someone who will be taking care of the operations work fromtime to time. As I said earlier, you may already have SREs working under you,even without knowing it. This has grown as a practice for larger organisations,but it would be well suited for small organisations to take over the practiceseven without establishing an SRE team in the organisation.

The mindset of an SRE isdifferent from that of a Software engineer or an Operations Engineer. SREsalways think of ways to automate most of the operations work, rather than doingthem manually. This mindset is something that needs to grow within, where youthink of ways and tools to alert, monitor, do and automate most of the tasks ofthe moment, in order to make the system more reliable.

As a Software Engineer,you will gain an in depth knowledge on a single area. But as an SRE, you willgrow the breadth of your knowledge on a vast area by learning about differenttechnologies available in the industry.

The demand for SiteReliability Engineers has grown rapidly throughout the world, and as a result,the average salary of an SRE is higher than that of an SE. If you search forSRE positions on Glassdoor, you would find over 70k positions availableworldwide. The demand is growing rapidly as organisations start to understandthe value of SREs in keeping the Site Reliable. If you really want to be a SiteReliability Engineer, ask the following questions from yourself.

·        Do you wantto improve your coding skills in terms of scripting, dashboarding, monitoringand alerting?

·        Are youinterested in learning about how complex production systems work?

·        Do youpossess the leadership and communication skills to communicate with differentstakeholders?

·        Are youwilling to research and read about new technologies in the market? (You need tobecome a Jack of all Trades, in terms of Software Industry)

DevOps vs SRE

DevOps is a set ofpractices that combines software development (Dev) and IT operations (Ops)which aims to shorten the systems development life cycle and provide continuousdelivery with high software quality. SRE and DevOps are strongly related toeach other, because they all work for the same targets. But the way SRE seesthe system is different from a traditional DevOps culture. There is a commonsaying in software terms, as follows.

SRE Implements DevOps

First let's understandthe 5 key pillars of success of DevOps.

DevOps - 5 KeyPillars of Success

How SREsatisfies these 5 Pillars

1) Reduce organisational silos

·        SRE sharesownership with developers to create a shared responsibility

·        SREs use thesame tools that developers use, and vice versa

2) Accept failure as normal

·        SREs embracerisk

·        SREquantifies failure and availability in a prescriptive manner using ServiceLevel Indicators (SLIs) and Service Level Objectives (SLOs)

·        SRE mandatesblameless postmortems

3) Implement gradual changes

·        SREencourages developers and product owners to do small deployments gradually toreduce the cost of failure

4) Leverage tooling andautomation

·        SREs have acharter to automate menial tasks (called "toil") away. I will explaina bit more on toil later.

5) Measure everything

·        SRE definesprescriptive ways to measure values

·        SREfundamentally believes that systems operation is a software problem

What does a Site Reliability Engineer Do?

The role of a SiteReliability Engineer (SRE) is not properly defined anywhere. It's more of aculture and a set of norms built by organisations in tackling productionrelated matters on their own. Hence, the role of an SRE differs fromorganisation to organisation. But there are a common set of practices that SREsfollow, and it is not limited to the following.

1) Monitor Almost Everything

Most of the systems wesee today are highly distributed and it is very rarely that we seenon-distributed, monolithic architectures. The role of an SRE is not limited tojust monitoring the Distributed System, but monitoring almost everything, basedon my understanding. Monitoring can/should be done on the productionapplications, deployment servers, underlying infrastructure, code quality, andeven Mean Time to Deliver a system etc.

2) Ensuring System Compliancewith industry standards

The system you aremaintaining might have agreed to comply to industry standards like ISO 27001security standards and ISO 9001 quality standards. In this case, there shouldbe a way to monitor whether the system is in-line with these standards or not.

3) Measuring Service LevelObjectives (SLOs)

SLOs are a key aspect ofany System, which explains the overall behaviour of a production system. I willexplain more on this in a section below. But for now, just assume that this isabout measuring the uptime of a System. Have you seen systems mentioning thatit is available for 99 percent of the time? The more 9s it adds to thisdefinition, the more rigid the system becomes. The following table will giveyou the idea.

4) Provide Compensation for SLAbreaches

Many productionapplications have licensed or paying customers and we need to provide areliable system to them. If the system is not reliable enough for the customers,then they will raise the question as to why they would want to keep on payingfor this software. Even highly reliable systems can go down unexpectedly. Butbased on the past metrics, every company defines its availability, whereas abreach in this number will cause the company to pay the customers back in termsof cash, credit, or discounts.

But this is somethingthe customers cannot monitor. This can only be monitored by SLOs, and this iswhere SRE comes into play. Have a look at how Google Cloud Platform penalisesits own services, if they fail to adhere to the SLAs. The below screenshot isfrom Google Compute Engine. (ComputeEngine SLA).

5) Automate Everything

You should try andreduce the level of manual tasks as an SRE. You will have to build a lot ofautomation scripts in order to make sure that you can just sit back and have acup of coffee while your system is running smoothly. So, as an SRE, your firstquestion should be to ask the following.

Can I automate this taskas well?

6) Provide On-Call Support forMajor Incidents

SRE is not a supportengineer position. The first call regarding complaints should come to thesupport engineers. Then, if it is a High Severity incident, the Support Teamdecides the wake up the SRE Team. In this case the SRE is responsible foranalysing the incident and waking up others who are required to solve thiscrisis. I use the word crisis, because this process should not happen, unlessit is defined to be an organisation wide incident. SRE will take care of theincident from top to bottom and after it is resolved, SRE will create aPostmortem document and a retrospective with the leads to ensure that this willnot happen again. These postmortems should be blameless, and only be a part ofa learning exercise.

7) Communication withDevelopment Teams and Management

SRE needs to ensure theybuild a good rapport with the development teams, and that they provide thenecessary details to the management when needed. The management will depend onthe metrics provided by the SRE to make many business decisions in theorganisation.

Service LevelTerminology

Have you ever wonderedhow to measure the behaviour of a service? How could you actually measurewhether a production application is running smoothly or not? We sometimes gowith the gut feeling and determine if the users are happy, the conclusion isthat the service is running smoothly. These applications could be internal APIs,or even Public Applications used by the general public. Nevertheless, theservice should have proper metrics that we can investigate to measure thequality of the application. Some applications might behave as intended for someusers, and some might not. This is where we need to define levels of service tothe user, so that they understand what to expect in an application when usingit. This does not indicate the actual features/requirements provided by theapplication but defines how the application behaves in a live productionenvironment.

This is where we need tointroduce proper metrics and keep monitoring them so that stakeholders areaware of the behaviour of the application over time. In Site ReliabilityEngineering, there are three main concepts where metrics need to be collected.

·        ServiceLevel Indicators

·        ServiceLevel Objectives

·        ServiceLevel Agreements

These measurementsdescribe the basic properties the applications should have (SLO), what valueswe want these metrics to have or maintain (SLI), and how we should react if weare not able to provide the expected service. Defining these metrics is veryimportant for SREs to understand the behaviour of the application and to beconfident about the production environment.

The term Service LevelAgreement (SLA) is something we all are familiar with, but the word has takendifferent forms in the Software Industry, based on the context. This sectionintends to explain the terminologies in depth for the readers to have the exactdefinition so that defining these metrics will be crystal clear.

ServiceLevel Indicator - SLI

This is a carefullydefined quantitative measure of some aspect of the level of service that isprovided. Some values that we actually need to measure might not be directly availablefor us to monitor. For example, network delays on the client side might not bedirectly measurable for our monitoring tools. Due to that, these might not beconsidered as metrics and some other aspects will come into the picture. Mostcommonly used SLIs are given below.

·        RequestLatency: How long does it take to return a response for a request.

·        Error Rate:Number of failed requests over the number of total requests.

·        SystemThroughput: This is measured as requests per second. E.g. The service can handle20 requests per second.

·        Availability:The fraction of the time the service is usable. This is directly related to theother defined SLOs and how they form up to the definition of"Availability" is a separate discussion.

ServiceLevel Objective - SLO

This is a target valueor a range of values for a service level, that is directly measurable by anSLI. Deciding on a proper Service Level Objective is a bit tough andopinion-based. Monitoring is important here; SLOs you cannot monitor won’t haveany value at all. For example, measuring the network delays on the user side isimpossible unless you maintain a frontend client app to do so. Hence, havingthat as an SLO is not that relevant. Having a proper SLO defined in theapplication is very critical not only for the management, but also for theusers. This will set the expectations for everyone on how the application willperform. If someone is complaining that the application is running very slow,we can correlate this with the metrics gathered from the SLOs to see whetherthe affected user has been properly captured by the downtime in the SLO.Without an explicit SLO, users often develop their own beliefs about desiredperformance, which may be unrelated to the beliefs held by the people designingand operating the service. This dynamic can lead to both over-reliance on theservice, when users incorrectly believe that a service will be furtheravailable than it actually is.

ServiceLevel Agreements - SLA

This is the legalbinding which asks the question, "what happens if the SLOs are notmet"? This agreement directly speaks to the customer and communicates theconsequences of maintaining a defined SLO. If the SLOs are met, the customer ishappy, and if they are not met, the service provider will have to pay a penalty(in money or any other form) to the customer. This is mostly in concern withapplications involving licenses and paid subscriptions.

The SRE Team does notget involved in deciding the SLAs, because SLAs are closely tied to businessand product decisions. But SRE will get involved in taking actions if the SLOsare not met as per the SLA.

Some organisations mightnot have a direct SLA with its customers, but an implicit one. For example,Google Search does not have an agreement with its users. But still, if thesearch results are generated slowly or incorrectly, the organisation will endup paying a penalty to its customers, which is its reputation. Nevertheless,SLOs and SLIs are important and later you can decide on how to implement an SLAfor a service provided.

What is Toil?

This section willexplain the concept of Toil in Site Reliability Engineering. As SiteReliability Engineers, we are required to perform certain amount of operationalactivities in our day-to-day processes. This being said, if these operationalactivities convert themselves to Toil, they should be eliminated by the SREsthemselves. As SREs, we have many more crucial, long-running engineering tasksto carry out, than spending most of our time on Toil. So, this article will tryto give a definition for Toil and explain how SREs should tackle Toil in theirday-to-day processes.

We often see the misuseof the word Toil in the Engineering domain. Toil is not just work that we haveto do regularly, and the work we get bored with. Tasks like writingdocumentations, conducting meetings, sending out emails cannot be considered asToil. These are merely administrative work, and in management terms, this canbe simply called as Overhead. So, when it comes to understanding Toil, it'sdefinitely not work that irritates and discomforts us. This kind of feelingsare highly opinionated and can be interpreted in different ways.

Toil is work whichpossesses the following characteristics in general. It does not necessarilyneed to have all of the below properties, but at least a combination of them.

1) Manual Work

Executing a script ortriggering a script. If a person needs to manually trigger a script, in orderto execute the steps in the script, that is Manual Work. This time can beconsidered as Toil time.

2) Repetitive Work

If you are performing atask once or twice, this is not Toil. But if you have to do this continuously,then this becomes Toil. For example, sending out an email daily tostakeholders, is definitely Toil.

3) Possible to Automate

If the manual work youare doing, can simply be converted to a script or automated programme, thenthat work is definitely Toil. By automating it, you will reduce the need ofHuman Effort to execute the task. But if it needs a human judgement, likedeciding whether it is a Bug or a Feature, then it is not Toil. This statementis still arguable, where you can use sophisticated tools like Machine Learningto optimise the judgement. In this case, it can be called as Toil as well.

4) Tactical

Toil is interrupt-drivenand reactive, rather than strategy-driven and proactive. For example, if andwhen an incident happens, we have to create a channel, pages, postmortems etc.which involves a lot of work, and what we do in each incident will differ fromeach other. So, it will be hard to fully eliminate the process, but we shoulddefinitely work towards reducing it.

5) No Enduring Value

If the operational tasksdid not change the state of the system, then it is definitely Toil. If the workyou did, changed the performance of the application or added a new feature tothe System, then it cannot be considered as Toil.

6) O(n) with service growth

If the work you do,grows with the size of the system, requires more resources and takes more time,then it is considered as Toil. For example, if you are supposed send a dailymail on the new incidents you get, then if you suddenly get around 50 incidentsovernight, then you will have to manually analyse all of it and summarise themin an email. This is something the SREs should try to automate.

References

[1] https://landing.google.com/sre/books/

[2] https://newrelic.com/resource/site-reliability-engineering

[3] https://techcrunch.com/2016/03/02/are-site-reliability-engineers-the-next-data-scientists/

[4] https://blog.linkedin.com/2017/january/20/linkedin-data-reveals-the-most-promising-jobs-of-2017

[5] https://opensource.com/article/18/10/what-site-reliability-engineer

[6] https://www.scalyr.com/blog/site-reliability-engineer

[7] https://www.atlassian.com/incident-management/devops/sre