Solution is null
February 7, 2011 8:51 AM   Subscribe

How can I explain this software bug to management?

I am the primary developer for a financial system that appears to have produced a buggy outcome one time in over a million transactions.

The errant transaction was discovered months after the fact and is wholly unreproducible. The code that generated the amount was not changed since that time to my knowledge.

Management is not happy with my lack of explanation, and understandably is concerned that this could happen again, which is important since it is a monetary issue.

So, the question is, how does a developer explain to management a bug that can't be solved or repaired. What strategy, short of inventing a problem (and saying that I've since solved it), can I use to basically say "I'm sorry, but I can't do anything about this"?

posted by eas98 to Computers & Internet (13 answers total) 4 users marked this as a favorite
In similar situations I try to find some kind of monitoring I can implement so that, if the weird one-off bug should recur, I'll have some logging or output I can look at so that maybe I'll have more success in diagnosing the problem on the second try (and so that the next time it happens it doesn't go undiscovered until months after the fact). Doing something similar might give you, and your management, a little peace of mind that this bug is not going to suddenly start happening every other day and costing your company a ton of money before it is noticed.
posted by enn at 8:58 AM on February 7, 2011 [3 favorites]

I can only think of a few reasons why you cannot do something about a software error, but I am sure there are more:

1) I do not know the inputs that created this problem, so I cannot test it and reproduce it. We will have to log all inputs and wait for the error to occur again.

2) The code is compiled code from a vendor. This black box problem must be reported to the vendor. (Note: This is not relevant to you, but is provided for completeness)

3) I do not know what the error looks like, so I cannot stick code in after that section of processing to serve as a sanity check. I require more information about modeling the process so I can build a sanity check in.

4) Other projects are taking priority over tracing and solving this error.

Essentially, give someone something actionable. Something you can do but need permission for, something they can do ...
posted by adipocere at 9:00 AM on February 7, 2011 [4 favorites]

Yeah, short of some kind of weird bitrot, there's always a root cause. And if you do have memory issues that's causing random corruption, that would need to be addressed too. Logging and automated monitoring would definitely be a good idea.
posted by kmz at 9:02 AM on February 7, 2011

I would immediately want a sound explanation about how it is wholly unreproducible. I say this because the management thought is "if it happened once, it can happen again", in addition, there is the notion of not simply an isolated bug but rather a class of issues that may be related.

So you should be able to enumerate the condition or conditions which caused the bug and how each and every one of those conditions can't happen again.

I would also be prepared to present an audit trail of unit tests that touch on those conditions and pass now and will properly fail if things change badly in the future.

I would want code coverage numbers that should be at a minimum of 80% of that system.

I would also be prepared to provide a coherent 5 why analysis of the circumstances and what has been done or needs to be done to remedy that, including what can be done so that in the future errant transactions are caught ASAP.
posted by plinth at 9:02 AM on February 7, 2011 [1 favorite]

I don't really understand. If you know what code caused the error then you should be able to fix it.

I would only use the phrase "wholly unreproducible" to mean that you do not know where the error comes from, and more specifically, that you don't know how to make it happen again. If that is the case, then simply tell them that it is unreproducible. You can only solve problems that can be reproduced, since they are the only problems you can investigate.

But if you know the source of the error, then tell management what the source of the error is, and why you can or cannot affect it.
posted by molecicco at 9:02 AM on February 7, 2011

How big of a $number are we talking about? Rounding errors used to drive me nuts years ago even though they'd throw off results but pennies.

You can do something what enn said and offer to create a monitoring mechanism that goes above and beyond what is in place already. You could tell them that doing this may require X hours of work and at Y cost. Then if they think the one-in-a-million error is worth spending the resources figuring out, you can go ahead with that.

It's not worth your reputation to create an error from thin air that you then have to defend.
posted by thorny at 9:04 AM on February 7, 2011

The problem you need to solve here is "management is unhappy with my lack of explanation." I suggest that you walk them through the problem in terms with which they can relate, which may mean breaking it down into simple, component parts.

Moreover, when I hear "wholly unreproducible," I hear someone saying, "With the tools I have now, I cannot create a test in which this result can be recreated within a reasonable amount of time." If the code hasn't changed, this bug can be reproduced. It may not be reasonable to attempt to do so, because of its frequency.

If you were to point this all out, this would help management make a decision. Hand-waving about being unable to reproduce something won't help (I'm sure you're not just hand-waving, but don't miss my point). You need to document, document, document for people that don't have your level of knowledge or intimacy with the systems.
posted by Cool Papa Bell at 9:10 AM on February 7, 2011

Response by poster: Thanks for the answers so far!

The amount was small, that time. I imagine it could be larger, but not something in the 10's of thousands or anything larger than that.

By unreproducible, I say that I could (and did) roll back the data and re-do the transaction (it came out correctly), but this was a multi-user environment, which is impossible for me to replicate. I do not doubt there was 'a reason'. I only state that it's not a reason that I can replicate.
posted by eas98 at 9:11 AM on February 7, 2011

You sound like you don't want to invest any more time in this issue. If management/the business considers this the highest priority, perhaps a change in your own attitude to the problem might yield better results than some explanation of inability or intractability.

As a developer, I feel your pain: some bugs are freakin' hard to track down... and when your best ideas don't seem to bear any fruit, it's really enticing to shrug and move on. However this is not a good long-term strategy—you need to be able to figure out not just easy problems but hard ones for future professional growth.

For debugging, I've found the scientific method to be the best tool:
  1. Use your experience or the experience of others on your team to try to make sense of the problem domain. How has this system failed in the past? What are its operating parameters? (Sometimes the problem isn't that the code has changed but that the deployment or machine has.)
  2. Form one or more hypotheses that explain your observed behavior
  3. If that hypothesis is true, what evidence or evidences will also be true? How could you instrument the system to find out?
  4. Test and report on the results of the test. The fact that your hypotheses may fall is not a failure. (This takes some communication skill in reporting to management.)
Good luck in tracking this issue down.
posted by Exonym at 9:17 AM on February 7, 2011

The problem is your language.

For example: "I only state that it's not a reason that I can replicate."

This is false. You can replicate it somehow. You just don't have the time, money or knowledge to do it. All of these are legit reasons and you'd do better with management if you explained why you can't reproduce/fix the bug rather than categorically denying that it can be fixed.
posted by Loto at 9:49 AM on February 7, 2011 [1 favorite]

The problem is that "I can't do anything about this" is only literally true given various constraints, so management is going to want an answer that makes that clear. You could accommodate that by listing the possible causes and making the cost of removing the constraints explicit. For example, in case the error is not in the code but in the underlying execution environment, the fix you could propose is to run multiple instances of the system in parallel on different hardware and OSes and cross check their output. In case the error was the result of a bit flipping in bad memory, you could propose (or verify) that the server have ECC RAM installed. The answer isn't "there's no fix for this" it's "how much money do you want to spend to reduce the possibility of this recurring by an unknown amount".
posted by nicwolff at 10:06 AM on February 7, 2011 [2 favorites]

nicwolff hit it on the nose.

You see, it's not a bug that you can't reproduce. It's a bug that you can't reproduce given your resource constraints. What I would expect from you if you reported to me is a proposal with a budget and list of resources necessary in order to reproduce the bug. Then I can make an educated decision on what to do given the cost of the bug's existence vs. the cost of finding it and fixing it.
posted by analogue at 10:36 AM on February 7, 2011

Response by poster: Thanks for all the great answers everyone. This is exactly what I was looking for. Basically, it'll be something very similar to odinsdream's answer.
posted by eas98 at 11:45 AM on February 7, 2011

« Older Who are some doctors who are renowned for their...   |   How to relax in dreamland and wake up feeling good... Newer »
This thread is closed to new comments.