The preventable problem paradox

Posted: 2023-01-29

The preventable problem paradox, popularized by Shreyas Doshi, notices that it’s more natural to reward those who put out fires than those who prevent them.¹ Absent conscious effort, organizations will reward people who like firefighting and don’t mind setting up dangerous conditions. This can make systems less resilient, even as leaders feel like they have built a can-do engineering culture.

Instead, organizations should combat availability bias and balance praise for fixing the inevitable problems that arise in complex systems with rewards for the preventive work that makes problems less frequent.

The paradox

The original post is excellent, and I won’t retread it all here. It’s worth both watching the scene from Superman II and reading the Rolf Dobelli quote which sums up the fundamental problem:

… Successes achieved through prevention (i.e., failures successfully dodged) are invisible to the outside world.

This is availability bias (a preference for examples that are easy to remember) applied to software engineering management. When an organization is deciding who to reward, it is easier to think of people who handled dramatic, unplanned events than those who designed or improved systems without causing outages.

As in the Superman II example, though, this is not just a leadership bias! The whole crowd at Niagara Falls was enraptured by Superman’s daring rescue of the falling boy, but only the boy’s mother noticed his efforts to prevent a problem in the first place. Even she couldn’t stay focused on it long enough to prevent the fall!

Preventing problems

Should your organization want to prevent problems? The question might sound silly, but it’s important to answer it honestly. In software engineering, there are some well-known risks, e.g. bugs, protocol incompatibilities, or hardware failures. Some proactive prevention for these is almost always worthwhile. But there are other possible problems too, e.g. a solar flare affecting 50% of the earth or global thermonuclear war. It might not be worth preventing problems with your software in those cases!²

Problems inevitably happen and require corrective intervention. You can still take proactive steps to reduce risk or create options for your team.

An engineering manager once told me that they prefer not to release new code, since problems occur most often during software releases. I prefer a former colleague’s point of view that software releases should be frequent, but boring.³ So boring, in fact, that they happen every month or week⁴ and nobody thinks about them at all. This gives your team the power to commit small code changes and have high confidence that they will be released successfully, including in the middle of an outage. This is a powerful option for a responder!

Giving credit

Dobelli is right that prevention is invisible to the outside world. Your customers will never know about the outages they did not experience, and gloating about a competitor’s outage is both rude and risky.⁵ However, inside the organization there are folks whose job is to guide the ship, to soberly assess what went well and what didn’t go well, and to reward the crew for good work.

If you’re a senior engineer, this is you! You have the power to shape the culture that develops around you. To support healthy engineering work, you should publicly celebrate quiet rollouts, pre-mortems, and other preventive work just like equivalent quality crisis-response work. You should do so in the same all-hands meetings where you talk about retrospective successes and failures and prospective plans. You should plant it in your teams’ minds that you value the kind of work that avoids failure and reduces risk.

Shaping culture

Supporting a healthy engineering culture in a large organization is difficult, and most of the work is not technical. For these purposes, “large” is much smaller than you might think, and is probably in the range of 100 to 200 people. It’s roughly when leadership can no longer form a complete picture of every single individual’s contributions and has to depend on delegates.

Imagine an organization which operates several software systems. A senior manager doesn’t want the organization to fall into the preventable problem paradox, which is laudable! They read Carla’s post and want to incentivize boring rollouts. They ask their line managers to track releases and schedule a quarterly meeting to review how many had issues. Their goal is to increase rollout count and decrease the number that have problems.

Without continuous effort, this can be a Goodhart’s Law trap! It can be very tempting for someone to win praise and social capital by gaming the metric,⁶ e.g. by downplaying incidents, refusing to roll back, or by reducing testing and monitoring.

Senior folks have special responsibility for shaping the culture and putting in the effort to avoid such traps. However, people at all levels of the company contribute to its standards of behavior. If you’re a junior engineer who wants reliable systems, you should celebrate preventive work in peer feedback, since otherwise it can be invisible to senior folks!

Avoiding biases

Like Simpson’s paradox, the preventable problem paradox is not a paradox at all. Instead, it’s an expression of availability bias, which might work against the long-term goals of an organization. While leadership has special responsibility for setting incentives, people at all levels of the organization are responsible for managing biases.

--Chris

This wonderful talk by Tanya Reilly dives deeper into the fire analogy. ↩︎
These deliberately ignored risks are subtlely different from Black Swan events. If we choose to ignore software problems when cities are being destroyed by nuclear weapons, that’s fine and, in my opinion, correct (except maybe for NORAD systems). But such destruction is not a Black Swan: we know it is possible. ↩︎
Carla Geisser’s other writings are also well worth your time. ↩︎
I’ve never seen daily pushes to prod in practice, but good on you if you can get there. ↩︎
It’s possible to turn a reputation for reliability into money, but a reputation-based plan is a long one with high downside risk. ↩︎
You may not be surprised to learn that I have Opinions on the closely related topic of error budgets and measuring availability with “nines.” ↩︎