CRITICAL ERROR
February 13, 2010 2:18 PM Subscribe

We monitor our systems with Nagios (via Opsview). How should I set WARNING and CRITICAL disk notifications on a server to make these notifications informative or productive?

Nagios offers two levels of notification: WARNING and CRITICAL. For disk usage on a particularly popular file server, we have it set to warn at 85 percent and critical at 90 percent full, and this is clearly too low; critical notices go unacknowledged for weeks because everyone knows its not a big deal.

What I aim to do is calibrate these alerts so that they translate into reasonable cues to take action. I have a plan to determine a better number for deciding when to act but the problem I think I'm having is deciding whether that should be an WARNING or CRITICAL level.

So I'm wondering, what meanings to MeFites give to WARNING and CRITICAL messages?

posted by pwnguin to Computers & Internet (7 answers total)

Warning means open a ticket, get it resolved in a week or so. Critical means service impacting outage ongoing or imminent.
posted by iamabot at 3:15 PM on February 13, 2010 [1 favorite]

I've done a lot of monitoring - if people aren't responding to your set limits for warning and critical messages, I let them know that I will be removing or disabling their monitor. In my company, warnings need to be cleared in 2 hours or else they will be getting a call.

Stale alerts clutter whoever's monitoring and in my experience, makes for missed problems/downtime.
posted by wongcorgi at 3:48 PM on February 13, 2010

We use Nagios at work too. Echoing iamabot, warnings are things that need to get taken care in the short term, but do not cause downtime or impact service. Crit's are reserved for the things that make your phone/pager go crazy and get people out of bed at 4AM. Basically for us crit is for anything anybody outside of IT would be affected by, and warning is for things that are trending towards crit.
posted by tracert at 4:03 PM on February 13, 2010

In some cases, particularly with very large SANs, I set alerts not a percentage full (which on a big enough SAN can be 100GB), but to start warning when a fixed amount of free space is available (say 2GB for warning, 500MB for critical). To do this, you need to mess with the check_disk script call so it doesn't use percentages. You can use the -u flag to specify the units (MB, GB, etc).

Obviously, your thresholds will vary, so you'll want to assess how your users are storing stuff and how volatile your storage rates are.
posted by jenkinsEar at 4:04 PM on February 13, 2010

Hmm. I think jenkinsEar is on to something there. My plan is to in fact plot the disk usage over time and see how fast users are filling it up; at the end of every semester we usually do a cleanup sweep of accounts. Using fixed freespace warnings sounds more robust if we have to increase disk size. I'll need to check if the SNMP version of disk check supports this mode.

It sounds like people use CRITICAL to mean 'helpdesk should expect calls about this', and WARNING to be somewhere between 'take care of it in the next scheduled maintenance window' and 'drop what you're doing and fix it.'
posted by pwnguin at 6:00 PM on February 13, 2010

"helpdesk will expect calls about this if you don't do something NOW" is a better way to approach CRITICAL IMO.
posted by flaterik at 12:20 AM on February 14, 2010

Warning = *frown* but Critical = "AAiiieee!!!" *log in immediately to fix while yelling over cube wall*
posted by wenestvedt at 9:03 AM on February 15, 2010

« Older don't want no broken back, svp. | Am I losing my hearing? Newer »

This thread is closed to new comments.

Ask MetaFilter

CRITICAL ERROR
February 13, 2010 2:18 PM Subscribe

Tags

Share

CRITICAL ERROR February 13, 2010 2:18 PM Subscribe

Tags

Share

CRITICAL ERROR
February 13, 2010 2:18 PM Subscribe