How do you actually measure MTTR?
First step? Figure out which MTTR you’re talking about.
MTTR usually stands for Mean Time To Recovery, or Mean Time To Restore. This is a metric that tracks how long it takes a defect or outage to be patched in customer-facing systems. It could also stand for Mean Time To Resolve (how long was that ticket open?) or Mean Time To Repair (how long did it take to perform a repair on the system?), so make sure to check what flavor of MTTR you’re reading about if you’re referencing something.
I’m going to use Mean Time To Recovery, as it’s most common and most relevant for software engineering teams.
At the very basic level, MTTR is a measurement of the average (mean) total time elapsed between when the defect impacts production and the time when that defect is repaired for all customers.
MTTR = downtime/number of incidents
The hard part here is figuring out what counts as downtime. In some cases, the clock will start running as soon as the offending PR is deployed to production. In other cases, you’ll want to pinpoint the time of “first failure,” or when the defect appears for a customer in prod for the first time. It’s okay to have a different internal definition for downtime than what is in your customer SLAs.
But to improve this metric, you’ll want to break things down a little more.
See if you can identify timestamps for each of these distinct phases:
How long it takes from the release/first failure to when your team realizes something is wrong
How long it takes your team to identify the what’s happening
How long it takes your team to develop a fix
How long it takes your team to write tests for the fix
How long it takes to complete the PR/deployment process
How long it takes for the change to propagate to all impacted systems
How long it takes to verify the fix works
You might notice bottlenecks, process issues with your on call routines, or alerts that are missing altogether.
Have a Management Query? Let me know at questions@lauratacho.com or on Twitter.