As a software developer, is it okay to know where the bug in the code is and fix it, without fully understanding all of the code surrounding it?

OK, this happened to me just today.

There was an obvious bug in the code, code I wrote almost 4 years ago, that could lead to hanging conditions.

It was a pretty obvious typo and hadn’t caused problems in production, because it was a path only used by developer tools in the lab. So… easy fix, right?

Well, I verified I could make things die in the expected manner with the existing code. I made the trivial fix, and ran the same test case against it….

….and it still died. It took a little longer, but it still died.

It turns out, there was another bug lurking just behind this one.

So, in the short run, I can make my fix for the obvious bug, but I can’t walk away without knowing I’ve solved the whole problem.

I’ve got a new bug to chase.

What’s interesting about this particular bug—the original, trivial one—is that the bug existed from the beginning of the code I checked in. It just remained dormant until we decided to harden the code for a new deployment scenario. That hardening process discovered it and gave it relevance.

The bug-behind-the-bug is more subtle. It’s actually related to a bug in some firmware from our vendor. We implemented a guard against that vendor firmware bug, but there’s no guard on this particular path, nor is there an obvious place to put such a guard.

In fact, the right answer may be to redesign some other state machines elsewhere and solve a bigger picture design issue. This dovetails into some other development work that’s going on.

Now imagine a junior dev came in, saw the trivial bug, fixed that, and left it at that. Is that OK?

Let’s break it down:

It’s not actively bad, and in fact, does improve the codebase. The trivial bug does need to be fixed, and fixing it doesn’t introduce regressions. We’re strictly less buggy than we were before.
It does improve the reliability of the code. However, it might give a false sense of security if you don’t measure the actual reliability after making the change. We can still crash, but we crash less often.
It misses an opportunity to fix a deeper problem in the code and reduce overall technical debt.

This isn’t the best-case scenario for a naive bug fix, but it’s not the worst-case either. I’d rate it a “not uncommon” case. Without knowing the code or testing for further issues, you miss an opportunity to find and fix other bugs, but you didn’t actively cause harm.

A worst-case scenario is that your localized bug fix fixes the proximate cause of a problem, but introduces new regressions elsewhere that aren’t immediately caught by regression tests.

You fixed one bug today and introduced 3 or 4 more bugs that’ll be discovered in the coming months.

Ideally, you should have had tests in place that prevented that. Realistically, that doesn’t always happen.

(That’s especially common when dealing with the lowest level code, where the layers underneath you are physical hardware, and the environment is ever-changing.)

I labeled the previous a “worse-case scenario,” because I really don’t want to sell anyone short on what a “worst-case” might be. I reckon my imagination isn’t strong enough to come up with a great candidate for that.

And yeah, I acknowledge there might be a better way to phrase it. I’m amused by “worse-case,” though.

Postscript: I implemented the more complete fix for the bug-behind-the-bug. After some refactoring, the API I misused no longer exists, and a clearer, harder-to-misuse API is in its place.

I also refactored the state machine slightly, so I could move all the distributed guards to a common guard point in the state machine itself. Now that part is much harder to get wrong as well. And, I was able to restructure some other state to simplify some upcoming feature development. Wins all around.

Now to just get the code through review. I’ll know soon whether the reviewers agree it’s good code. (In progress….)

Useful Tips City

As a software developer, is it okay to know where the bug in the code is and fix it, without fully understanding all of the code surrounding it?