Unplannable Work and Queueing Theory
Here’s a question I’m asked often:
What’s the best way to deal with bugs and unplanned work? We’re under a lot of pressure to deliver, so our sprints are pretty maxed out. Bugs and customer requests just add to the load and delay the other work. Is there a better way to plan our sprints?
This is a pretty universal and evergreen question. The current hiring landscape just amplifies the pain. So many teams have growing backlogs and roles that have been siting vacant, maybe for months.
You might feel like you’re constantly reacting to everything, without having time to catch your breath and make a plan for how you’re going to be proactive.
And based on this Twitter thread, a lot of you feel like work just keeps piling up.
Scheduling your team at 100% capacity is a great way to ensure that nothing will be delivered on time.
— Laura Tacho 🌮 (@rhein_wein) May 16, 2022
Queueing Theory and Slack Time
First, some math.
Let’s say you have a bank with just one teller. Each customer takes an average of 10 minutes to be served, and they arrive at the bank at an average rate of 5.8 per hour, or about once every 10 minutes.
When you look at the average values, this shapes up to be a pretty efficient system. A customer arrives, and just when they’re wrapping up, someone else shows up.
But there are ebbs and flows to foot traffic, and the average service time of around 10 minutes is just an average. There might be a customer that takes 35 minutes. Then what?
In the bank I described above, the average wait time is close to 5 hours. That’s because as a system nears 100% utilization, the queuing time grows exponentially.
We think about this often when it comes to servers working on jobs from a queue, but the same applies for systems of people, too.
Adding another bank teller reduces the wait time from 5 hours to about 3 minutes. It sounds unbelievable, but the math checks out.
When your team plans sprints at close to 100% capacity, you are increasing response time exponentially for any unplanned work that comes in.
A system with no slack is just going to snap.
Unplanned work vs. unplannable work
A lot of what we consider unplanned work is actually plannable. There will be bugs. There will be customer requests. You can plan for these, even if you don’t know the specifics.
And after enough sprints, you’ll likely have some decent data to show trends about how much and how long these types of unplanned work can soak away from your core projects.
If you don’t have this insight yet, a good first step is to write a ticket for everything. Fight the urge to grumble about the administrative overhead.
Not only will tracking this type of work give you insight into how much slack you should build in to each sprint, but it will also make sure that you have a record of why you made all of those changes.
Customers are great at finding edge cases, so make it easy for your future self to remember why you made this one weird change.
Unplannable work isn’t a bug or a customer support ticket. These are emergencies: a flooded data center, us-east-1 having issues, a bad deploy, or a db migration that went sideways. It’s the truly unplannable stuff that is also drop-everything-and-get-to-work urgent.
If you don’t have any slack time, there is absolutely no way you’ll be able to stick to your initial timelines on your in-progress projects. A 5-hour emergency shouldn’t delay a project by a week, but that can very much happen when your system is oversubscribed.
Two things to do tomorrow
Make sure your team is tracking all the different kinds of unplanned work to make it easier for you to understand the strain it puts on them. No more quick PRs after someone Slacks a request.
Adjust your own expectations for what an “ambitious sprint” looks like on your team. You might see a reduction of story points, but you’ll see cycle time decrease. Story points alone are not a great metric of performance and productivity.