How to write diagnostic manual for software
July 14, 2021 12:35 AM   Subscribe

Technical writers and software people of Metafilter: I'm looking for advice, tips and best practices on how to write a diagnostic manual for a software application. The document would be similar to what you get with your home appliances. How one goes about writing one like that? How to order the material, how to structure the diagnostic steps?

A complicated piece of software with lots of dependencies can fail in many ways. Even if you get back a 404 from an API call, you don't know if your URL was wrong, or the service itself is down. (This is the example that prompted this question).

I thought that it'd be nice to have a document that collects the common failure modes and acts as a reminder what to check in what order. With so many moving pieces, I don't really know how to start.

Let's say I collect all the different problems that our software can encounter, which you have to solve outside your software. For example if you got an exception because a service you depend on is down, then you have to take an action - this is "outside" your software.

How one should go about organizing and presenting all this collected information so that it minimizes troubleshooting time? Is there a NASA or Army manual on how to write these kind of documents? I'll appreciate any pointers.
posted by kmt to Technology (11 answers total) 1 user marked this as a favorite
Best answer: I think the search term you're looking for is "runbook".
posted by hoyland at 1:22 AM on July 14, 2021

Best answer: This may be more involved than you originally imagined.

I've written QMSs and SOPs involving several highly customized 3rd party ERP-like software/ database suites (which I had a major hand in guiding the development of as the non-dev end user) intended for regulated environments.

I wish the developers were able to produce something like a diagnostic manual, but that was a naively misguided pipe dream. I still mock myself terribly for my former naivete.

If you don't have to do this, I wouldn't volunteer to do it.

What ended up happening was a 3 person team (or myself only) wrote end-to-end SOPs and integrated the software interfaces with each of our existing interconnected systems processes.

Of course we picked up bugs during the process and tried to get the devs to fix the bugs. Some bugs weren't fixable/ weren't worth even trying to fix. This was under the incompatible pressures of deadlines/ milestones and regulatory compliance.

So, all unresolved bugs were compiled and warnings/ notices/ explanations were included when possible in the SOPs to avoid doing certain things or to reinforce that the particular procedures must be followed to the letter.

The rest, we discovered during implementation and I would go back and update the SOPs at the points where failures - where noticed to be possible - with disclaimers and warnings and interdictions.

Day to day, new/ undiscovered bugs/ errors required in situ troubleshooting - what worked well enough went into the SOPs/ troubleshooting guide. Showstoppers got kicked back to the devs (and more often than not, other unintended bugs/ wierdness/ interdictions cropped up).

If it's a multi-end user system where the end-user's expertise or conscientiousness isn't guaranteed, never underestimate the incompetency (or carelessness - ie., they don't give a fuck) of the end-user.

Seriously, if you aren't forced to do this, don't volunteer to do it.


The recent Tacit Knowledge fpp is pretty relevant to your question.

The difficulty in writing an end-user guide for troubleshooting (beyond the very very very basic) is fraught since you don't know what the technical background of your end users will be. End users change, too.

If you really want to do this:

- Identify your target audience
- Understand your target's knowledge breadth and depth of the software and their QMS
- Think like that target audience, and assume that they're hungover/ heartbroken/ about to be laid off
- Write to that audience
- Expect more grief from writing a troubleshooting handbook than fielding tech support requests/ demands over the course of the lifespan of the software/ your tenure with the company

(in seriousness, it might work better for you to issue tech bulletins as deficiencies/ errors are reported and are resolved (or not) or somesuch rather than offer a troubleshooting guide)


I'm overthinking this and if it's just "you don't know if your URL was wrong, or the service itself is down" - that's really low level diagnostic stuff and not the responsibility of the devs of a complex bit of software.

For this level of troubleshooting, I insert a handwave of something along the lines of "contact your local IT department" even if you suspect/ know that non such exists.
posted by porpoise at 1:29 AM on July 14, 2021 [8 favorites]

Writing a troubleshooting guide along the lines of one for a commodity appliance isn't going to do you much career progression cred - unless you can frame it as an example of "writing to a lay audience" and have some stats to back up benefits analysis for your company.

If you do write something like this, you'll likely get more blowback than if the complaints went somewhere else without you trying to be helpful.

If you're getting paid hourly to do this (instead of part of a salaried job with other higher priority projects), by all means.

I'd separate the document by failure mode/ error return. Then copy and paste the most common solutions. "Hard reboot ...." etc.

Give that document dates and revision numbers. Keep updating it. If you're in job that I think you're in, this could be good for your stats.

If you want examples, look to your internet provider and their "troubleshooting" guides (which are almost completely useless except for someone who's savvy enough to ditch dial-up AOL at this late time, but not savvy enough to do a modem reset or a hard reboot).
posted by porpoise at 1:57 AM on July 14, 2021 [2 favorites]

Sounds like you're trying to create a knowledgebase / FAQ / Wiki
posted by kschang at 3:21 AM on July 14, 2021 [1 favorite]

Best answer: In college we discussed this within the topic and course on Technical Writing. That should be a useful term to search and learn from. It focuses on exactly what you need. In fact, we often brought up examples like a manual for a nuclear reactor and how to write instructions so anyone could operate it in an emergency.
posted by Crystalinne at 4:32 AM on July 14, 2021

There are a lot of excellent tech writers who are out of work or poorly paid because software companies tend to view documentation as extraneous to a good product these days. Many of them spent years or decades honing their craft.

If you have the budget to hire one, please consider doing so. The Write The Docs job board is a great place to find an equally great writer.
posted by Sheydem-tants at 4:47 AM on July 14, 2021 [6 favorites]

Best answer: (Disclaimer: I'm a dev not a technical writer) Like most pieces of writing, I think just getting something drafted into a document is best place to start. Getting bullet point notes on the information you personally know down on paper is a significant improvement on where it sounds like you are now. That document can then be shared and other folks can add their bullet points of knowledge. Once you have more concrete pieces of the content, then you can start to organize it into actual documentation.

I do think some of this should fall on devs themselves. In the case of the API example, for instance, there really should be a list somewhere of all the http status codes it might return and what they mean exactly within the context of the API. If there is no such list, I would characterize the software not so much as "complicated" as "just plain bad". If there is such a list, then gathering up the existing documents for the pieces that interact with each other (or heck even making a drawing of all the pieces) will again put you in a better place than you are now.

(FWIW the 404 response should generally indicate the service the UP... after all it did respond by saying it did not like the request. If the service were down, it's more likely the response would either be null or in the 5xx range indicating server error.)
posted by Press Butt.on to Check at 7:11 AM on July 14, 2021 [2 favorites]

There's at least three phases to the recovery:
* How you'd know that the system not doing what you want means it's broken?
* (alternatively: What even is this error message?)
* What does that suggest you should check to understand what's failed?
* What to do with it to return the system to functionality?

The domains these each cross make this effort a terrifying fuckton of information to capture. Thankfully, mapping the status information, the diagnostic information and the recovery effort are all automatable, too, for a self-recovering system. Start small and keep on taking bites out of the problem, and before you realise you have saved a lot of time and effort.
posted by k3ninho at 1:04 PM on July 14, 2021 [1 favorite]

Addenda: Designing the system for diagnosis (or so you can replay inputs and trace the path of failure) and capturing the workflows needed to recover the system are like planting trees, the best time to start was 20 years ago and the second-best-time is today.
posted by k3ninho at 1:06 PM on July 14, 2021

Even the largest software companies don't have something like this. Basically the best you get is for each individual problem you get a knowledge base article somewhere. Each article takes a specific scenario, describes the symptoms of the problem, then walks through troubleshooting steps. You rely on full-text search capabilities (either of some knowledge base CMS or of something like Google for public documents) in order to allow people to find what they need, there usually isn't much organization.
posted by Aleyn at 3:00 PM on July 14, 2021

My area of expertise is building teams focused on troubleshooting software bugs reported by users and internal support agents. Over the years I've seen many similar things attempted for use by devs, agents, and users. No matter what tools or organizational methods are used, the place they all eventually fail is: there's no explicit ownership for keeping the documentation maintained and up-to-date.

When the latest issues aren't captured, or when people find errors because the new process wasn't updated, people stop trusting it and stop using it, and it's really difficult to get them to start again even if you fix that underlying problem.

Beyond that, my main pointers are to make sure it's searchable and write the content with searchability in mind, and include examples of real similar problems for reference.
posted by rhiannonstone at 8:13 PM on July 14, 2021 [1 favorite]

« Older Economical and fast video streaming over local...   |   How to effectively use a Career Coach? Newer »

You are not logged in, either login or create an account to post comments