admins: intelligent rate-limiting of error notifications?
September 26, 2008 2:55 PM   Subscribe

I have a web service that generates errors. Currently, each time there's an error, it gets sent to an admin mailing list. Is there an open-source tool that will summarize repeated duplicate errors, rather than generating a separate notification for each?

For example, say there are 100 'service foo threw exception bar' errors in a 15 minute window.

What I'd like to see is one initial 'service foo threw exception bar' notification, followed by a single second notification 'service foo threw exception bar repeated 99 times.'

This seems like it must be a solved problem, where there's a single 'sink' for all of the notifications from various services that can summarize duplicates and in doing so prevent alert fatigue (and enormous pager/SMS bills).

So I turn to the collective wisdom of ask.mefi - does this exist?
posted by zippy to Computers & Internet (6 answers total) 1 user marked this as a favorite
 
I'd just the dump the errors it in a database w/ timestamps and run simple SQL queries like counting the number of alerts that were generated between now and 15 minutes ago. If its greater than your threshold, send out an email.
posted by wongcorgi at 3:08 PM on September 26, 2008


You'll probably need to be more specific in terms of the operating system and the web service technologies you're using to get any kind of meaningful answer.
posted by le morte de bea arthur at 3:09 PM on September 26, 2008


Sorry, Debian/Ubuntu is the OS, and this is a Python-based web service on top of MySqlDB.

The tool I'm looking for would consume notifications directly - like how I imagine syslog works - and then have the ability to de-dupe and generate notifications (email, for now).

A friend mentioned Zenoss as a possibility: http://www.zenoss.com/
posted by zippy at 3:32 PM on September 26, 2008


Maybe Nagios with some custom plugins?
posted by ydnagaj at 4:07 PM on September 26, 2008


Is your python-based web service using Django by any chance? If so, check out django-db-log, which cleverly uses an MD5 hash of the traceback to aggregate many reports of the same error. Not sure if it does emailing, but it's a 1-line mod in the "error batches" model.
posted by cogat at 4:10 PM on September 26, 2008


I should be more clear - I'm really interested in a general solution rather than something specific to a particular framework or app (python-specific solutions, however, would be ok).

ydnagaj, thanks for the Nagios tip. I'll take a look.

cogat, MD5 hash on tracebacks is a really interesting idea. Any idea if there's an implementation/library that isn't django-specific?
posted by zippy at 8:18 PM on September 26, 2008


« Older What can I do to win the election for my candidate...   |   Lordy, lordy, I'm hittin' 40! Newer »
This thread is closed to new comments.