Being an incident management service, having an On-call feature is all the more important for Spike. On-call is essential for easy and efficient handling of new incidents so that no incident is missed.
An incident is an event that is not part of normal operations that disrupts operational processes. Without effective incident management, an incident can disrupt business operations, information security, IT systems, employees, customers, or other vital business functions.
So whenever we get an incident, it is assigned to the right person and an alert is sent to them.
Having on-call management prevents overloading a sole assignee and evenly distributes the responsibility of handling alerts amongst dedicated on-call members.
So here we will talk about how the Oncall feature was designed, keeping in mind that we wanted to make a Minimal Viable Product that’d be easy to use.
The design process involves 1) understanding and defining the problem, 2) research, 3) coming up with different solutions, 4) designing, and 5) testing.
To create the on-call feature we first had to understand and define the problem statement. At this moment, whenever an incident occurs, Spike would alert the respective assignees associated with that incident.
So the problem was defined as creating an on-call management feature that ensures an even distribution of responsibilities amongst team members. No single person is overloaded with the stress of responding to incidents.
The next step involved competitive research. I researched and looked into the likes of Pagerduty and Opsgenie to understand the basic requirements of an efficient on-call feature.
An on-call schedule ensures that the right person is always available to immediately respond to incidents. The MVP version of On-call would have just enough features to meet basic user needs like setting the rotation type and adding members. This would provide ample feedback for the future development of the on-call feature. After figuring out the basic needs, we came up with a simple user flow that involved the different prerequisites of an on-call schedule: choosing the on-call members, selecting the rotation type, and the starting date and time and ending date, and time). After deciding the user flow we worked on the wireframes
Where can the user view their on-call schedule?
Users can view their on-call schedules by clicking on “when am I on call?” Inside the on-call dropdown present on the header. The dropdown also mentions who is On-call which enables everyone to figure out the current on-call member.
The calendar view in Spike enables all team members to view the ongoing and upcoming on-call schedules. It is displayed inside schedule details.
In some instances the present on-call members might not be available to respond to incidents, they might be on leave or a vacation. To solve this problem, Spike lets you swap the existing shifts between members, i.e overriding. Overriding allows modifying the schedule without altering it as a whole.
Adding escalation policies to the On-call schedule
After an on-call schedule is created, users can then add a specific schedule to an escalation policy. Select the On-call schedule name from the users dropdown present in the escalation steps while creating an escalation policy.
After designing the first approved iteration, we wanted to test the feature. A major issue that arose was that the on-call didn’t have a 12-hour shift or the ability to add two members in a single day shift. This means having two people being on-call at the same time is the right approach as it would take a lot of the stress off of the primary on-call member. This would also ensure that there is always a backup when the primary on-call member misses a notification.