Our dashboard went down at 6:55 PM UTC on 30th May 2023 for a total time of 195 minutes. This critical incident impacted only our dashboard. All the other services like Hooks, Alerts, API, Escalations, and Status Page were not impacted. During the dashboard's downtime, since all other services were operational, we did see incidents and alerts getting triggered.
tl;dr
Our process manager (pm2) was automatically restarting processes because our temp directory triggered the pm2 watch. Nothing a quick patch couldn't fix but it took some time to understand why it was happening.
What went wrong?
Our dashboard is built on primarily NodeJS and we use PM2 as a process manager. We believe incident management is as much about humans as incidents themselves which is why we started our focus on well, slightly personalising your dashboard. The first thing we did was just add profile pictures (it's not enough but definitely a start).
While uploading a picture we first write on a temp folder before we begin uploading to S3. Unfortunately, the process manager's watcher triggered and restarted the process because we accidentally changed the path of upload from temp to the root folder.
Uploading pictures is not an everyday business. Mainly because we have made it difficult for everyone to just upload a picture the moment you see your initials (we are working on improving this). To upload a picture, visit profile settings.
Last night / Early morning, our dashboard service kept getting restarting causing 504 timeout errors. Our uptime monitoring solution is hooked to Twilio painfully via Zapier. Also, some of you emailed and created tickets. Thanks for reaching out. We did connect over a quick call with some of you after the incident was resolved. However, we couldn't get everyone. Please consider our apologies.
How did we fix the issue?
Understanding the issue took all of our time. The fix was easy - patching a fix and deploying basically did the trick.
Did it impact our services?
All our services - Alerts, Escalations, Incidents, API, Status pages, and Hooks were not impacted.
What are we doing to prevent this from happening again?
Test: Better testing, more integration and E2E tests are coming in place. This is something a test suite would have caught early and prevented this from happening.
Process: We believe in an honest and transparent policy here. We are setting up a process, a checklist of sorts, on things to do to bring about this transparency with all of you.
Closing notes
Managing incidents and triggering alerts is a major responsibility and one we take very seriously. During this outage, the reality is, we did not update our status page effective immediately to reflect the critical outage. We should have. We are better than this. Our sincere apologies. Many of you are seniors to us in this very industry and we would like to keep this transparency and bring in better processes.
Going further, we will keep our status page updated (also automate it). You can also learn about these incidents as they happen on Twitter, LinkedIn, and Reddit.
We are sorry for the disruption this caused. We are actively making these improvements to ensure improved stability moving forward so that this problem will not happen again.