Escalation policies are essential for making sure that incidents are quickly addressed and resolved. They provide a systematic approach to automate alerts, guaranteeing that no incident goes unnoticed.
Let’s get you started, shall we?
An escalation policy is a way to automate alerts and assure that incidents are never missed. The first point of contact for an incident is through an alert that is sent according to the escalation policy.
To get started with an escalation policy, it is also important to know the alerting channels through which you will receive alerts for incidents. Without alerting channels, an escalation policy would be ineffective as they go hand in hand.
The multiple ways to get alerts
When it comes to incident management, having multiple alert channels assures that incidents never go unnoticed. Here are several ways you can receive alerts:
- Phone calls - Phone calls are ideal for handling critical incidents, particularly those that occur in the middle of the night or on weekends.
- WhatsApp and Telegram - WhatsApp and Telegram are excellent platforms for receiving alerts. The main advantage is that most people regularly check their messages on these platforms. This allows you to take immediate action against incidents without the need to physically go to the spike.
- Slack and Microsoft Teams (chat ops) - Using your team workspace is one of the most efficient ways many teams manage their incidents effectively. With instant alerts on dedicated channels, you can take immediate action. Your team members can acknowledge and resolve incidents, create new channels for specific incidents, and collaborate on resolving them in a different space altogether. more on this on our chatops blog.
- SMS - The SMS channel is an excellent alert channel, particularly designed for people who do not use WhatsApp or Telegram. It is also a great way to separate your work-related messaging from your personal life. SMS alerts are commonly used for minor incidents, typically categorized as "Good to know" incidents.
- Mobile App - Mobile notifications are a fantastic way to stay on top of everything happening in your system. They provide updates on any issues that may arise, making it easier for you to quickly access more information about incidents, regardless of your location.
- Email - Although often underestimated, we frequently neglect the power of email alerts, especially when executed correctly. Even with our email alerts, users can reply to an email and take actions without having to visit the dashboard.
The structure of an escalation policy
Understanding the structure of an escalation policy is similar to recognizing the blueprint of an efficient incident alerting framework.
It involves well-structured steps, that ensure the right people are notified and the necessary actions are taken whenever incidents occur;
- Every escalation policy consists of multiple steps, with each step containing multiple alerts.
- Between each step of alerts, you have the option to introduce a time gap. For example, in the first alert, you can choose to receive a notification on your Microsoft Teams or Slack bot. You can then wait for a specified amount of time, such as 5 minutes, 20 minutes, 30 minutes, or any other duration, before escalating to a specific person on either of the mentioned channels.
- There is no limit to the number of steps in an escalation policy.
The nature and structure of an escalation policy are crucial to understanding how you want to design your alerts. Different incident scenarios may require different structures for your escalation policy.
Getting started is incredibly easy by using one of our escalation policy templates.
Examples of multiple structures
Different escalation policy structures ensure that the right people are alerted at the right time, based on the severity of the incident.
Here are examples of critical, medium, and low-severity incident policies to help you understand how they work and determine the most appropriate alerting channel for each type of incident.
1. Critical incident escalation policy:
During a critical incident, it is crucial to alert your teammates immediately when it is triggered, regardless of the time or day of the week. We suggest starting with phone call alerts.
Phone call alerts should be used only for critical incidents. Otherwise, alerts will accumulate and critical incidents may be overlooked.
Any alerting mechanism that does not immediately wake up your teammates or require them to answer a phone call would be suitable for teams not directly involved in resolving your incidents.
For example, engineers working on critical incidents may receive phone call alerts, while the support team can receive alerts directly on Slack. This applies to the Legal team as well, for instance.
To determine who should be alerted during an incident, consider who you want to involve in resolving the incident. Remember - incident response is a team effort.
💡 Spike suggests: One simple approach is to send phone call alerts to critical team members, followed by SMS or simultaneous alerts via WhatsApp to team leads or managers.
There is no hiding when a critical incident occurs. Eventually, everyone needs to know and be aware of the situation.
2. Medium severity incident escalation policy:
A medium-severity incident is the type of incident that requires people to stay informed and proactive, as it can potentially cause disruptions in various aspects. Typically, medium-severity incidents are early indicators of underlying issues and have the potential to escalate rapidly into critical incidents.
For medium-severity incidents, it is recommended to send personalized alerts via WhatsApp, Slack, or Telegram directly to the designated response team members. It is then their responsibility to assess if immediate attention is required. If not, they can acknowledge the alert and proceed with their tasks.
💡 Spike suggests: Set up an acknowledge timeout to ensure that people do not forget to resolve the incident after acknowledging it.
Another great example of this would be to have alerts sent directly to your Slack or Microsoft Teams.
💡 Spike suggests: Create separate channels for critical incidents, medium severity incidents, and non-severity incidents.
Carefully select members to be included in each channel to avoid overwhelming them with unnecessary alerts.
For medium-severity incidents, it is not recommended to use phone call alerts.
Who should be involved in handling medium-severity incidents?
It’s best to involve primary and secondary responders only. The primary responders are usually individuals who are readily available to assist in resolving incidents. Secondary responders are those who occasionally contribute to incident resolution. It is not recommended to include support teams or any other teams that cannot actively contribute to resolving a medium-severity incident in the escalation policy.
3. Low-severity incident escalation policy:
A low-severity incident is usually informational and does not impact any specific system, but it may contain warnings.
We still recommend monitoring these incidents over time to measure and understand the frequency of these warnings.
The general rule for low-severity incidents is that no one should receive phone or email alerts for them. It is best to create a separate channel on Slack or Microsoft Teams to receive alerts for these incidents.
Keep your high-severity incidents separate from all the main alert channels and make them as simple as possible. It is also not necessary to escalate these incidents if no one takes action.
💡 Spike Suggests: setting up automation to automatically resolve these incidents. You can also choose to ignore and never let this incident trigger again
Where to begin?
If you’re new to incident management, we recommend using just one of the templates and getting started with it.
Here are some things to keep in mind:
- Don't overthink it. A great approach is to set non-obtrusive alerts and integrations that won't trigger critical incidents.
- If you are starting with direct critical incidents and want to receive alerts for them, we recommend using phone calls, Slack, and SMS.
- Don't worry about the timing or whom to involve in your escalation policies. Feel free to experiment, it’s a continuous improvement process. Keep in mind that the most important thing is to receive the alerts.
- It is crucial to include the right team members. By ensuring they receive these alerts, you will gradually learn when and how to create different escalation policies for different types of incidents.
- Understanding your system is one thing, but knowing when it encounters specific incidents and issues can be a completely different challenge at times.
💡 Spike Suggests: Keep it simple by selecting one of these channels and using our templates. To receive alerts, set up integration with either Slack or Microsoft Teams.
This way, if you happen to miss an alert, other team members and extended team members will also be notified. For more insights, read our article on “Different escalation policies for different scenarios”.
General best practices for escalation policies
Here are some best practices to follow when getting started with escalation policies for efficient incident resolution;
- Do not overdo phone call alerts: Remember that excessive phone call alerts can quickly become noisy and lose their impact. While it is important to set up phone call alerts for critical incidents, if they keep triggering repeatedly, they will lose their effectiveness. To prevent this, stay proactive and resolve incidents promptly with robust solutions. You don't want to repeatedly end up in critical situations.
- Ignorance is not bliss: Alerts for repeated incidents, whether immediate or delayed, will inevitably be ignored by everyone sooner or later. Such warnings and symptoms ultimately lead to more significant incidents that escalate rapidly and become critically severe. It is crucial to ensure that alerts are not ignored.
- Wait time is your friend: Not all incidents require an instant alert. This is because if you have set up any form of automation, such as a triggered script, there is a chance that the incident will resolve itself within seconds or minutes. Consider the time it takes for incidents to automatically resolve most of the time. If it typically takes three minutes, for example, add a wait time as the first step in your escalation policy. This means that when an incident is triggered, the system will wait for five minutes before sending any alerts. If an incident is resolved automatically, you will not receive any alerts.
- Embrace acknowledge timeout: A great way to prevent people from forgetting about an incident after acknowledging it is to set an "acknowledged timeout". This ensures that if an incident is not resolved within one hour of being acknowledged, it will trigger again and start firing alerts once more.
- Repeat your escalations: Over time, you will identify incidents that simply cannot be ignored. In any scenario where alerts are coming through but no action has been taken, it is important to set the escalation policy to repeat. This ensures that alerts will continue until someone takes action against the incident. Ignoring this advice is not beneficial.
Utilize different escalation policies with automation
For any integration that is already associated with an escalation policy, we highly recommend setting alert rules and automation. This will help determine if the incident is critical and then add the appropriate responders based on the severity of the incident.
It is important to reroute the escalation policy alerts from medium or unknown severity to high severity or critical severity escalation policy.
Automation can be achieved by setting up regular expressions and rules to understand and reroute these alerts.
💡 Spike Suggest: set up outbound webhooks. This will automatically trigger scripts within your system to try to resolve these specific incidents as quickly as possible.
Start by using an escalation policy template to receive alerts. Remember, you can customize these policies to meet your specific requirements. By continually refining and improving your escalation policies, you will establish a robust incident response framework.
Stay connected to our blogs for more insights for getting started with escalation policies.