How they SRE: Insights from the Cloudflare SRE team
Introduction
Cloudflare is a global cloud services provider that is based all over the globe, from San Francisco, US to London, England to Sydney, Australia. Their mission, as stated front and center on their homepage, is to help build a better Internet. While that may read like hyperbole, their numbers are impressive - Cloudflare has over 126,000 paying customers and 95% of Internet Users in the developed world are within 50ms of their network.
We spoke with Vignesh from the SRE (Site Reliability Engineering) team at Cloudflare to understand the practices that help make this achievement possible. Some of the key points are below.
Vignesh: “We build our roadmap similar to any other product or engineering team.”
As an SRE team, our customers are internal engineering teams. To best meet the needs of these users, we try to understand their pain points by sending questionnaires and holding discussions with them. It’s a similar process to that carried out by product teams at most organizations. Based on these, we write specs or RFC (Request For Comment) to get feedback from other engineering and SRE teams. This results in a formal spec which can be added to the quarterly roadmap. This process can take from a few weeks to a few months.
“Maintaining infrastructure is a big part of the job”
Many companies today use public cloud providers like Amazon Web Services (AWS) or Microsoft Azure. Cloudflare, however, maintains all of their infrastructure in-house, meaning it’s up to the SRE teams to keep things working smoothly. To help them manage this, they use the Salt stack for infrastructure automation. Salt is an open source framework that enables you to configure a large number of systems. You can achieve this by running commands on remote systems, which is one of the core functions of Salt.
Making sure that the infrastructure is in compliance with the standards and certifications followed by the company is also a responsibility of the SRE team. They use tools that alert them about compliance issues so the team can take action.
“I can assure you that cost optimization of infrastructure is being done at most companies using public cloud.”
Running your own infrastructure is difficult, very difficult. Because Cloudflare runs its own infrastructure, cost optimization projects are not as high a priority. But he mentions that for many companies using public cloud like AWS, cost optimization is part of the responsibilities for the SRE team.
The topic of cloud cost optimization is so relevant that it has led to very active funding for startups in this space. And a16z wrote a famous article where they claimed (based on their research) that “it’s clear that when you factor in the impact to market cap in addition to near term savings, scaling companies can justify nearly any level of work that will help keep cloud costs low.”
“Defining and managing SLOs is very important for us”
Cloudflare customers use their services for a variety of things, among them are DNS and DDoS protection. Availability and latency need to be kept running at a certain standard, so it’s important to maintain internal Service Level Objectives (SLO). Also, since Cloudflare can incur penalties for breaking Service Level Agreements (SLA) with external customers, tracking service levels becomes very important.
For example, they might have an internal target that the API can have a maximum downtime of 1 minute per month. Which translates to a maximum allowed downtime for either scheduled maintenance or due to unexpected incidents, which is also called as an “error budget” in SRE terminology.
Cloudflare relies on monitoring tools like Grafana and Prometheus to generate alerts for any issues that impact their metrics. If they want to track a custom metric, they create a custom exporter for Prometheus.
“Cloudflare has been a pioneer in writing honest and detailed public post-mortems”
Like most SRE teams, the Cloudflare team follows the best incident management practices to deal with major issues. To distribute the load of responding to technical issues, the team members follow an on-call schedule where one team member becomes the primary responder for any technical issue.
When an alert that has been seen before comes in, it will contain information from previous team members about how to deal with the alert. This usually is in the form of steps to resolve the issue and links to monitoring tool dashboards to investigate further.
After a major incident, the SRE team will help to write a post-mortem for the incident. Although this has become a standard practice today, Cloudflare was one of the first ones to write public post-mortems for major technical incidents. The post-mortems usually detail the timeline for the incident, the root causes and the steps taken by the company to avoid such situations in the future.
“We try to balance the effort and reward for building automations”
One of the principles of SRE is to “eliminate toil”. Toil is described as work that tends to be manual, repetitive, automatable, tactical, or has no enduring value. The definition of toil casts a very wide net, but simply speaking, if it doesn’t require human judgment or produce a permanent benefit, it’s probably toil.
Cloudflare SRE teams take this principle seriously and use products like Apache Airflow to build automations and self-service tools for the engineering team.
“Chaos engineering is not just a buzzword for us”
Chaos Engineering, the brainchild of Greg Orzell and other engineers at Netflix, is the practice of developing a system that can identify and repair failures before those failures become disastrous outages. In a nutshell, it’s the process of carefully and systematically breaking your system to see how the system reacts and, hopefully, engineering an automated solution.
At Cloudflare, engineers use their own homegrown tools to perform chaos engineering on the staging environments.It’s usually in the form of bash scripts that randomise actions like failovers of databases, refreshing connection pools or restarting the load balancers. At the moment, they are evaluating whether the chaos experiments can be moved to production environments. For more on Chaos engineering, this guide from Gremlin is a good place to start.
“I’m curious about how seriously companies are pursuing a multi-cloud strategy”
Although developing a multi-cloud strategy is a priority for many companies these days, Vignesh is skeptical about how seriously companies are pursuing this. In his experience, companies are locked into multi-year contracts with cloud providers like AWS, and end up using custom services and product from cloud providers, locking themselves in further.
He does see some companies becoming smarter in using the public cloud as a commodity by only using the compute, bandwidth, network and storage, and staying away from vendor specific offerings.
A developing field
The concept of the SRE is relatively new and emerging ideas and technologies (chaos engineering, for example) are helping to shape what this role entails. As the world becomes increasingly online, we see SRE becoming even more important and in-demand at startups and enterprises.
To keep up to date with how this plays out, you should follow Vignesh on Twitter. If you want to read more about SRE at other top companies, read our blog post containing research from 30 SRE job listings.