What the heck is an incident?

Incident management is easily one of the most annoying things anyone has to ever deal with. There will always be only a handful of people who would ever want to walk into the building on fire to mitigate. That’s the same with most engineering teams. Only a handful are willing to get in, find the root cause, and mitigate the incident.

What the heck is an incident anyway?

Formal definition:: An incident is an event that is not part of normal operations that disrupts operational processes

Website down? Yeah that’s an incident
server running out of space? yup, that one too
An increasing number of transactions failing? definitely an incident

Basically, anything that interrupts the smooth operations and needs you to look into it is qualified as an incident. Or if it’s not ideal than it’s perhaps an incident.

Below are some examples of incidents::

Severity	Incident
Revenue impacting	Website / app crashes
	Security vulnerabilities
	Server crash and burnouts
	Booking and transaction failures
Leaving customers furious	SLA breaches
	Delayed response times
	Dashboards not loading
Incidents needing more attention	DB backups failing
	Queue memory overloads
	DB queries are too slow
	Application errors
Good to know incidents	CI/CD Alerts
	CPU, Memory, I/O alerts
	Disk space alerts

A good rule of understanding what incidents you could get would be - Imagine the most critical part of your application and then imagine if it fails to it's job. If this ever happens, you wan to make sure that you are getting instant alerts either on phone call, sms, email, Slack, etc. The last thing anyone want is to miss out on these incidents until you start work the next day.