Am I solving the problem or just coping with it?

One of the most valuable questions I ask myself.

Solving problems and putting in processes that eliminate them is a core part of the job of a manager. Knowing ahead of time whether or not your solution is going to work can be tricky, and time pressures and the urgency that pervades startups can make quick solutions seem really attractive. However, those are often a trap that papers over the real issue enough to look like it’s solved, only to have it come roaring back worse later.

If you’ve been around the world of incident response and the subsequent activity of retrospectives/post-mortems, you’ve probably heard of the “5 whys” method of root-cause analysis. In the world of complex systems, attributing failures to a single root cause has fallen out of favour and 5 whys has gone with it. Nevertheless, there is value in digging deeper than what initially presents itself as the challenge to uncover deeper, more systemic issues that might be contributing factors.

Rather than “why,” I like to ask myself if I’m really solving this dysfunction or just coping with it. There’s a lot of temptation to cope. Coping strategies have some features that make them attractive.

Imagine your problem is that your team is “always behind schedule.” That’s a problem I’m sure most people are familiar with. There are plenty of easy-to-grab band-aid strategies for that pattern that are very attractive:

They can be contained within your team; you don’t need to influence people outside your sphere. (We’ll just work through the weekend and get the project back on track.)
They’re highly visible. (Look at Jason’s team putting in the extra effort! What a good manager!)
They can move vanity metrics in the short term, which is often something rewarded. (We increased our velocity and did 20% more story points!)

All of these can make you feel like you’re solving the problem but:

They come with a cost. (The team will burn out and quit if they have to work through too many weekends.)
If you stop doing them, the situation comes right back. (The next project wasn’t different and we’re back behind schedule again.)
They only address the symptoms, not the causes. (Improving your culture around estimations and deadlines would be a better fix.)

This dichotomy is probably more familiar and easy to recognize in the technical realm. If your code has a memory leak you can either proactively restart it every now and then (coping) or you can find the leak and fix it (actual solution). The reasons you should prefer the actual solution are the same in both scenarios. The 'strategy' of restarting the service will work for a while, but you know in the back of your mind that eventually it’s going to manage to crash itself between restarts. Same goes for your people processes.

A story about coping

At one of my jobs we had a fairly chaotic release process. This was a while ago and “Continuous Deliveryr” was still a fairly newfangled idea that wasn’t widely adopted, so we did what most companies did and released on a schedule. It was a SaaS product, so releasing was really just pushing a batch of changes to production. The changes had ostensibly been tested (we still had human QA people then) and everything looked good, but nevertheless nearly every release had some catastrophic issues that needed immediate hotfixes. We didn’t have set working hours, except on release day. We expected all of the developers to be in the office when the code went out in case they were needed for fixes. We released around 9am and firefighting often continued throughout the morning and into the afternoon. When that happened, the office manager would usually order some pizzas so people could have something to eat while fixing prod.

Eventually this happened so often that it just became routine. Release day meant lunch was proactively ordered in for the dev team, then chaos ensued and was fixed.

Of course, we did eventually tackle the real pain points by increasing the frequency of releases (when something hurts, do it more, it’ll give you incentives to fix the real problems), adding more automated testing, and generally just getting better at it all.

The lunches continued well past the point where the majority of the people on the team remembered or ever knew why they were there.

How to identify coping

Some other practical questions you can ask yourself to try to recognize stopgap strategies that aren’t addressing the real problems.

The examples are just examples; this isn’t meant to be a comprehensive guide to fixing every situation you encounter. That would be a lot to ask from a free newsletter. 😜

Does this require constant attention/input/energy to keep working?

As mentioned before, anything that stops working when you stop doing it is likely not a good solution. If you’ve got a system where someone has to push the “don’t crash production” button every day or else production crashes, eventually someone’s going to fail to push the button and production is going to crash. The button is not a solution; it’s a coping strategy.

Does it change the behaviour of the system or just reduce symptom pain?

If you’re having a lot of off-hours incidents and your incident response team is complaining about the burden, you could train more people in incident response to reduce their burden (and maybe you should), but that’s not solving the problem, which is really that you’re having too many incidents. A true solution would be understanding why and addressing that. Code not sufficiently tested? Production infra not sized appropriately? Who knows, but the answer is not simply throwing more bodies at it. That will reduce the acute problem of burnout in your responders (symptom) but not the chronic issues that are causing the incidents in the first place.

If you’re not changing the underlying conditions you’re probably not fixing the problem.

Does it actually remove a failure mode or just add ongoing work?

If all your meetings are running over time you could appoint someone to act as the timing police and give warnings and scold people who talk too long, but that’s just adding work for someone and not touching the real reasons, which could be unclear agendas, poor facilitation, or the fact that the VP can’t manage to stay on topic for any amount of time. The solution is to fix your meeting culture. Require agendas, limit the number of topics, train people on facilitation, give some candid feedback to the VP. These solutions actually remove the failure mode that causes meetings to run long. The timekeeper “strategy” doesn’t do anything about that.

Does toil go down over time?

This is similar to the first question, but it’s something you can think of like a metric, even though it’s probably more a heuristic than something directly measurable. Treating your releases like pre-scheduled incidents like we did would be a good warning sign that you’re coping, but if each one is less dramatic than the last and wraps up sooner, those are signs that you’re on the right track. Getting to the point where releases are unremarkable and you don’t need the whole team on standby would be a good indicator that you’ve got solid solutions in place.

Sometimes coping is the right call

When production is down, anything you can do to get it back up again is the right move. If you’re bleeding and you have to make a tourniquet with your own shirt, you do it. There are lots of scenarios where a short-term fix is the best thing for the situation. But keep in mind, this is effectively debt. Like technical debt, it should be repaid — ideally sooner rather than later. A good incident process will recover production by any means necessary. Once, during an incident related to a global DNS provider’s outage, we were literally uploading /etc/hosts files to our infrastructure to keep the part of our product that relied on some 3rd parties working. Needless to say, once the DNS incident was resolved, we went back and cleaned those up.

You can do the same with processes. When there’s immediate pain that can be relieved with a quick patch-up, you should do it, and use the fact that the pain has abated to fix the problem permanently.

You also can’t fix them all. You might not have enough influence or political capital to make the kinds of changes that real solutions require. In that case, your job is to advocate for the right things to happen, point out the ways that coping is hurting the organization, and exert influence over the parts that are in your control.

Knowing which is which is most of the battle

I’ve tried to give you some strategies for recognizing when you’re just coping with a situation rather than fixing it. The differences can be subtle, but once you start to spot them it gets easier and you’ll find things start to get better in more permanent, sustainable ways.

Image:

"Temporary Fix" by reader of the pack is licensed under CC BY-ND 2.0 .

Like this? Please feel free to share it on your favourite social media or link site! Share it with friends!

Hit subscribe to get new posts delivered to your inbox automatically.

Feedback? Get in touch!