An on-call schedule tells you and everyone in the team who will be the first responder when an issue happens in production. The on-call team member is responsible for investigating the issue, either fixing the issue herself or adding other people who can help fix it. Having an on-call schedule is important for building reliable systems because making someone responsible for production issues makes sure that they're not ignored.
Coverage
Many on-call schedules will have 24x7 coverage, which means that some team member is always on-call. When deciding the coverage for your on-call, you should take into account the nature of your product, the business you are in, the impact of outages, SLAs promised to customers etc.
Shift length
The ideal shift length will depend on the type of team, team size, the frequency of alerts etc. If you have a small team and the frequency of alerts is not high, you can do 1 week shift for each team member. This means that for a 6 people team, that means 1 person will be on-call for one week and then gets rest for 5 weeks. If you have teams in multiple locations, doing a “follow the sun” model of 12 hour shifts can be great so that everyone is on-call roughly during their daylight hours.
Who should be on-call?
Ideally, you should have all the team members responsible for developing and keeping the software running to be part of the on-call schedule. This will include members of operations, DevOps, SRE and development teams. It should include junior team members and team leads. On-call should not be relegated to senior members only because that will create information silos and new members will not be able to develop their skills and knowledge of the system. Bringing newer and junior members on-call faster will also make you develop good training process and educational materials (like documentation and playbooks). Developers going on-call also creates empathy about the issues their software can create in production and help them make better design and programming decisions in the future.
What should be the escalation policy during on-call?
The escalation policy will decide the time that the team members has for responding to an alert during on-call. It will depend on the criticality of the service (whether it’s user or revenue impacting service or just an internal tool), and your SLOs (Service Level Objectives) and SLAs (Service Level Agreements). If you have SLOs and SLAs, the escalation policy should be designed accordingly.
For example, consider that you have a service with 99.99% SLA, which only allows for about 52 minutes of downtime per year. The escalation policy for that service will be very tight, requiring a team member to respond to the alert within 5 minutes before escalating the issue. If a service does not have a strict SLA or is not a critical service, it is better to keep response time longer e.g. 30 minutes, which can give the team member time to finish her current task.
On-call and team health
Being on-call can create stress and lead to team unhappiness if not managed properly. To reduce on-call fatigue, you should be mindful of pager load, or the number of urgent alerts fired during a typical shift of 12-24 hours. You should also be careful about multiple alerts for the same event which will cause fatigue and even worse, make members start to ignore alerts. You should also allow team members to swap on-calls with each other to handle unplanned events.
Conclusion
Having on-call schedules can be a great way to build a culture of reliability without burdening individual members too much. If you would like to talk about how to create better on-call schedules, please contact us at hello@spike.sh.