Good efforts preserve bad systems

For many systems, constant limping is the status quo. Extra efforts of well-meaning employees often prevent the system from failing completely; it also prevents it from improving.

By a system, I mean both a “software system” and the company as a whole.

A software system can be limping by being unreliable, slow, and hard to work with; a company can be limping by barely breaking even, being full of dysfunction, or being in regular crisis mode.

Incentives

In every work environment, some things are valued more than others. By “valued,” I mean here what is rewarded, not merely given lip service. These incentives shape the system, and the incentives are never perfect because management always has a limited understanding of the system it manages.

I think most workers are aware of that. They know that there is a dissonance between the work that “should” be done (in a sense, the work most helpful to the system) and valued work. Different people will respond to it slightly differently, but for most, it will be a mix of doing both. Most people aren’t purely cynical and ignore all but what’s valued, but most don’t ignore the incentives completely.

In many cases, doing only what is valued by management would not allow the system to survive. For example, if management values new features more than bug fixing, and everybody responds to that incentive perfectly, the system would crumble. Thus, the status quo is only possible because people put extra effort into doing what’s needed for the system to survive.

That extra effort is frustrating because it is:

Tiring, especially when done constantly
Rarely rewarded

However, it is usually seen as something positive, heroic even; going against the bad system is praised by those who see through it.

I disagree with that notion.

By doing that, we are not only solving problems; we are also preventing the same problems from being acknowledged by the management, and this leads to bad long-term effects.

The same problems will come again and again, without any additional resources or appreciation for solving them.

For example, if a company doesn’t really value reliability and underpaid sysadmins are working extra hours to patch the system with ad-hoc fixes, this will continue until the uptime goes down enough for management to notice. If the uptime never goes to that critical level, nothing will change, because why would it? On the other hand, if the number of embarrassing outages reaches some threshold, management will recognize that systemic changes need to happen; it might be hiring more sysadmins, adopting new practices, or rewarding prevention more than before.

We’re not ready yet

Of course, the big problem is that people don’t want to be seen failing; it is often punished in some way (it might be harder to get a promotion or a raise), and it just plain feels bad.

Unless there is a cultural change in how we perceive failures, I don’t expect anybody to act differently.

What would it take to change the culture? First, we need to learn to embrace failure.

Embrace failure

The solution is to make failure safe, or at least non-catastrophic, both individually and systematically.

On a personal level, it is hard to admit a failure even without incentives aligned against it. Ego can hurt. You need to shift the framing from failure being bad to failure being a valuable lesson. You also need cold reasoning to recognize whether the failure should be attributed to your mistake, to bad luck alone, or to the system you’re in.

On the company level, you must ensure that people will not be punished by bringing bad news. You need the shared attitude of being ready for improvement; too often, it is assumed that the company is already an optimized system and that the failures should be attributed to mistakes of individuals or luck.

Get rid of the culture of internal competition; it invites pushing blame around.

Have you experienced a good (or bad!) systemic change after a visible failure at your company? I would love to hear about it.