Best practices for shell script output
November 6, 2020 11:21 AM Subscribe

Are there style guides or best practices for the readability of the output of shell scripts? I am writing a number of scripts that should generate easily-scanned results, where users can run their eye down the lines and instantly spot trouble. I would like to not reinvent the wheel.

These scripts have to be parsable by people (DBAs, sysadmins, and their managers). They should be succinct on success and specific on failures.

They're going to check on the availability of a dozen to twenty servers & services, and produce a whole screen's output each time -- so brevity is vital. No color is required, no images will be used.

I was thinking about placing square brackets around the OK/failure state (padded by spaces), with an explanation in case of failure.

posted by wenestvedt to Computers & Internet (13 answers total) 2 users marked this as a favorite

The old way would have been to produce no output for the servers that are up and providing the correct services. But no information is not very useful for people who want to find out how well each server/service is working. It's also unforgiving if your monitor regime stops working: is no output at all perfect operations, or a borken monitor? I still cringe at the damage that arrogant neophyte scruss did to other people's systems by assuming that.

Can you use ANSI codes to highlight [ok] and [fail] entries?

Someone will try to use grep or awk to parse these results, so please try to include a line in the report that has a fixed number of fields, one of which is a timestamp. It won't give all of the information, but it should help identify which servers need looked at.

Pages and pages of output will always have detail that people miss.
posted by scruss at 11:45 AM on November 6, 2020 [5 favorites]

I would set up a chron job that will export to csv and then email the results to the relevant parties.

For the csv part, from a unix.com question without a date

For the email part, from a linuxhint.com post in 2018
posted by bilabial at 11:53 AM on November 6, 2020

In case you want the output to be machine parseable but with extra things (e.g. ANSI colors, aligned columns, or whatever) when parsed by humans, you can detect whether stdout is connected to a terminal (as opposed to to a pipe or a file) via something like this:

    if [ -t ]
    then
         consumer="humans"
    else
         consumer="machines"
    fi
    echo "Here is output suitable for $consumer"

posted by smcameron at 11:54 AM on November 6, 2020 [5 favorites]

For human parsing, make it easy for the important bits to line up in columns. The usual problem I have with this is really long lines getting wrapped in the terminal. There are a variety of ways to deal with this; one of the easiest is shorter lines. :-)

Color is really nice for scanning, but at a minimum turn it off when stout isn't a terminal (can also have always on and always off switches so you can pipe to less -R, etc).

A somewhat on-traditional option would be to optionally include emoji in the output. If you're really catering to humans (and the terminal config supports it) it can really help, but there are good reasons why this may not be a good idea for your setup.
posted by RikiTikiTavi at 12:22 PM on November 6, 2020 [1 favorite]

Use columns and whitespace to your advantage. Don't create a 1-character column with 'Y' or 'N', create a 5 character column with [OK ] or [ERROR], so the shape of the column changes when there is an issue. After a while looking at a particular report, I can identify if the shape is correct far faster than I can verify the correct data is present in each row. Make the good state and the bad state(s) visually different, so you can see at a glance that there's something that needs attention.

Think about the possible failures that could be ongoing, and try to make them visually different from those that need immediate action. If you have a warning that maintenance is needed within the next month, and a different warning that things are on fire, make sure that the user can tell the difference. Consider what happens when a system needs maintenance, it is deferred until next week, but then it starts on fire. Make sure that the most critical issues don't get masked, and make sure that the shape of the display changes when the severity of an issue changes.

Be honest and realistic about the severity of an issue. If the policy says that a particular issue needs to be fixed immediately, but you know that in practice, it works good enough anyway so you can't get downtime approved until the weekend, don't put that issue in the same category as an error that shouldn't be ignored. If you're used to ignoring a sea of statuses that say 'CRITICAL ERROR: Metric below 100%', I guarantee you that sooner or later, you're going to miss 'CRITICAL ERROR: Metric below 5%' or 'CRITICAL ERROR: System offline.'
posted by yuwtze at 2:04 PM on November 6, 2020 [8 favorites]

You could export data in key/value pairs so that it could be read and parsed in a variety of human/readable formats (and even by dashboard/charting tools). Then write the scripts to parse it into readable format or find something downloadable that would do that.
posted by matildaben at 2:42 PM on November 6, 2020

> Be honest and realistic about the severity of an issue. If the policy says that a particular issue needs to be fixed immediately, but you know that in practice, it works good enough anyway so you can't get downtime approved until the weekend, don't put that issue in the same category as an error that shouldn't be ignored.

It's also very healthy (and fair) to have the people who will be consuming these reports have input into how the reports are structured, and which kinds of issues will be regarded as critical and will trigger alerts. Being on call and getting paged to respond to false-positive alerts - and being unable to adjust the rules to reduce how often false-positive alerts will be generated in future - is the kind of thing that causes people to quit jobs.
posted by are-coral-made at 3:26 PM on November 6, 2020

You might ask yourself if this is really the thing you want to be building. There are SO MANY open source monitoring tools out there, is rolling your own really what you need to do? Arguably, a best practice would be to use an existing CLI tool like Monit, plus you get a web dashboard if you would like that as well(and you will end up building a dashboard, you just don’t know it yet)
posted by rockindata at 4:33 PM on November 6, 2020 [2 favorites]

Ncurses helps render ANSI tty output.

I'm going to assume that you can't log to a file and then post-process the text in elasticsearch+logstash+kibana/grafana, but in 2020 that's the suite of tools for monitoring status and creating near-real-time dashboards.

With or without dashboards, aim for the following: output things that are syndromes of the current state and which together make diagnosis of failure quick and easy.
posted by k3ninho at 1:55 AM on November 7, 2020 [1 favorite]

I second the "no news is good news" approach. Only print problems, and be silent on success.
If this isn't possible, then consider splitting output between stdout and stderr so it's easy to silence the [OK] flannel.

Testing for a terminal and switching to CSV on piped output is a good one as well.
posted by rum-soaked space hobo at 8:18 AM on November 7, 2020

Don't put [] around things, you have to escape it someway when you're grepping it. Doing this just by the tty test just leads to confusion when the data you're grepping isn't the same data that you normally see.

You can use flags for everything else. Default just show what's ok. A flag to show what's nok. A flag to show both where you put the ok/nok and then the server and the nok have comments from third column on. A flag to just list the servers that get checked. A flag for csv or tsv or json etc. (this is easier to do in a scripting language other than sh where you can do your stuff and build a data structure and then transform it at the end as desired.)

It's best if the ok/nok is at the beginning of the line to you can grep /^o/ or /^n/ or some similar thing with awk/perl/python whatever.

Try to be simple space-separated for the leading columns of data, leave the possible comment at the end. People can easily limit their split-line into fields thing to take care of that last space filled column, but it's a PITA if you have something like a yy/mm/dd <space> hh:mm:ss that you have to patch together yourself as a special case.

command | wc -l # number of up machines
command -d | wc -l # of down machines

Maybe add a bells-and-whistles flag for the pointy-head types. Most of the backend types I've known would much prefer the simple flag based version, they'll make a script or alias to get it to do the thing they want.
posted by zengargoyle at 11:44 AM on November 7, 2020 [1 favorite]

Thanks, everyone!! I should have expected that there's no industry standard for plaintext reporting, but this was a good discussion that gives me lots to think about.

No news is actually no news, not good news.

This is a flint axe that management asked for, to be run manually, so it will never require the refinements of running headless, or passing its output to another program, or storing results over time, or anything else that actually good. :7) They want positive confirmation that a suite of services on different hosts are all running, and this satisfies them. *shrug*

After a decade of running Nagios (R.I.P.), and another decade with Orion and a SIEM (ugh), I'm pretty well versed in monitoring platforms. We have Graylog, we have Oracle Ggrid Control, we have Log Rhythm, we have secret scripts in DBAs' homedirs, we have it all. We have a shedload of cron jobs that verify specific things. But I do appreciate the reminder to check my assumptions! :7)

I did add flags to runs the checks against the TEST and PROD environments. It's smart to have another one that dumps the list of hosts & services: that way, I can have the administrators regularly confirm that they haven't changed anything!

I really do like the idea of using the flame emoji when the database doesn't answer port 1521, though -- that's hilarious. I used to use it as my PS1 after I `sudo su -` to root. 🔥
posted by wenestvedt at 7:18 AM on November 9, 2020 [1 favorite]

Holy cats, it's a month later, and I think the very thing I wanted was just published!

https://clig.dev

Command Line Interface Guidelines
An open-source guide to help you write better command-line programs, taking traditional UNIX principles and updating them for the modern day.

(Direct link to the section on output is https://clig.dev/#output)
posted by wenestvedt at 1:26 PM on December 6, 2020 [2 favorites]

« Older I want fancy nuts out of shells | Emergency Contact Device for Kids Newer »

This thread is closed to new comments.

Ask MetaFilter

Best practices for shell script output
November 6, 2020 11:21 AM Subscribe

Tags

Share

Best practices for shell script output November 6, 2020 11:21 AM Subscribe

Tags

Share

Best practices for shell script output
November 6, 2020 11:21 AM Subscribe