Warning System for Server Failure
January 15, 2013 9:07 AM Subscribe
Sometimes, web sites go down. Our employees and clients are pretty good at noticing downtime, messaging me, and then I open a ticket with the hosting company or programmer and get it resolved. I am looking for some kind of downtime warning system.
Ideally there would be a giant magical red button, when pressed by an employee at the office, it would light up a giant spinning police siren at my home and sound a loud noise to bring me away from everything and signal the urgency of the matter. Currently I rely on texts, e-mails, or instant messaging, but that does not really help if my phone is on silent and I am sleeping at home. A 2 part system that would plug into each user's USB port would be ideal, but looking for some type of alarm system to warn me about web site downtime.
Ideally there would be a giant magical red button, when pressed by an employee at the office, it would light up a giant spinning police siren at my home and sound a loud noise to bring me away from everything and signal the urgency of the matter. Currently I rely on texts, e-mails, or instant messaging, but that does not really help if my phone is on silent and I am sleeping at home. A 2 part system that would plug into each user's USB port would be ideal, but looking for some type of alarm system to warn me about web site downtime.
If it's public facing, Pingdom is a fairly good monitoring site. They give you some free monitoring (which I've been using for years) and notify you by a number of different channels.
If you want an alarm, anything that can be triggered by twitter/email etc would work.
posted by leo_r at 9:11 AM on January 15, 2013 [2 favorites]
If you want an alarm, anything that can be triggered by twitter/email etc would work.
posted by leo_r at 9:11 AM on January 15, 2013 [2 favorites]
Set up a server ping service and a filter on your phone so that only messages from the ping service ring after whatever time you typically want the phone to be quiet.
posted by COD at 9:12 AM on January 15, 2013
posted by COD at 9:12 AM on January 15, 2013
Best answer: Nerdy, but I'd look at using a service like Are My Sites Up with a Belkin WeMo Switch which integrates niceley with IFTTT.
Have Are My Sites Up send a message to IFTTT which sends an alert to WEMO to turn on a light or alarm, or whathaveyou.
posted by backwards guitar at 9:14 AM on January 15, 2013
Have Are My Sites Up send a message to IFTTT which sends an alert to WEMO to turn on a light or alarm, or whathaveyou.
posted by backwards guitar at 9:14 AM on January 15, 2013
In my experience the alarm fails just as often as the service being monitored.
I'd settle on one (or two services) like Pingdom and make sure you have multiple paths (e.g. SMS + push notification). iPhone's new Do Not Disturb feature might keep you from getting woken up by non-emergency calls.
posted by RobotVoodooPower at 9:46 AM on January 15, 2013
I'd settle on one (or two services) like Pingdom and make sure you have multiple paths (e.g. SMS + push notification). iPhone's new Do Not Disturb feature might keep you from getting woken up by non-emergency calls.
posted by RobotVoodooPower at 9:46 AM on January 15, 2013
I read the title of your post and thought Nagios. I read the rest of your post and was still thinking Nagios. It's what people use when they want to know when systems are breaking/broken. It may be more complex to setup than "Are My Sites Up" but it has a vast array of capabilities beyond just website monitoring.
posted by seanmpuckett at 9:46 AM on January 15, 2013 [3 favorites]
posted by seanmpuckett at 9:46 AM on January 15, 2013 [3 favorites]
Two problems really:
1) If you're on call and expected to be responsive 24x7, your phone shouldn't be on silent...basically ever, especially when you're expect to be asleep - there are apps for smartphones that can set up profiles based on days or override default settings based on message. This sounds horrible but that's the truth. Or at least you are hopefully staffed at a point where the on call nature is shift based.
2) Humans should not be in the middle of availability monitoring, ticketing and notification. It doesn't work out too well and it doesn't scale _at all_. Get the humans out of the way unless it's a personal escalation.
Tackling (2) is the best way to handle this really, human behavior is really hard to train. There are dozens of off the shelf apps for this, depending on budget and scale. Some of them are even free and with a little bit of time totally scriptable. Relying on humans to detect downtime is a bad idea, you need to be doing it via automation and then following that automation up with ticketing to the necessary providers. THEN escalating to you or multiple people. The typical way this is handled is with some form of synthetic transactions, a common commercial app for this is SiteScope.
So start thinking about solving (2), making it far less onerous to be called when there is an outage unless there is truly a value added action you can take. Really the hosting site should offer some sort of availability monitoring as well.
Finally, there are plenty of companies that do this type of thing and take care of the human angle professionally, they tend to be expensive and tend to want to operate the whole deal, generally applying something based on the ITIL framework. Again, pretty expensive, but something to consider, depends on cost of downtimes/etc.
posted by iamabot at 9:51 AM on January 15, 2013 [1 favorite]
1) If you're on call and expected to be responsive 24x7, your phone shouldn't be on silent...basically ever, especially when you're expect to be asleep - there are apps for smartphones that can set up profiles based on days or override default settings based on message. This sounds horrible but that's the truth. Or at least you are hopefully staffed at a point where the on call nature is shift based.
2) Humans should not be in the middle of availability monitoring, ticketing and notification. It doesn't work out too well and it doesn't scale _at all_. Get the humans out of the way unless it's a personal escalation.
Tackling (2) is the best way to handle this really, human behavior is really hard to train. There are dozens of off the shelf apps for this, depending on budget and scale. Some of them are even free and with a little bit of time totally scriptable. Relying on humans to detect downtime is a bad idea, you need to be doing it via automation and then following that automation up with ticketing to the necessary providers. THEN escalating to you or multiple people. The typical way this is handled is with some form of synthetic transactions, a common commercial app for this is SiteScope.
So start thinking about solving (2), making it far less onerous to be called when there is an outage unless there is truly a value added action you can take. Really the hosting site should offer some sort of availability monitoring as well.
Finally, there are plenty of companies that do this type of thing and take care of the human angle professionally, they tend to be expensive and tend to want to operate the whole deal, generally applying something based on the ITIL framework. Again, pretty expensive, but something to consider, depends on cost of downtimes/etc.
posted by iamabot at 9:51 AM on January 15, 2013 [1 favorite]
iamabot has some great suggestions.
I like the wemo approach. If you're feeling techy and fancy a DIY project maybe build your own bedroom alarm system using a task-specific system like a raspberry pi running python that monitors your off-site service for alerts and then switches on a red light and screaming siren via an arduino on the serial port.
You could implement a detonation button at work using a similar method.
posted by urbanwhaleshark at 12:47 PM on January 15, 2013
I like the wemo approach. If you're feeling techy and fancy a DIY project maybe build your own bedroom alarm system using a task-specific system like a raspberry pi running python that monitors your off-site service for alerts and then switches on a red light and screaming siren via an arduino on the serial port.
You could implement a detonation button at work using a similar method.
posted by urbanwhaleshark at 12:47 PM on January 15, 2013
I came to recommend Nagios as well. It's a really capable, reliable system, and you can have it email you when something (you tell it what the thing is) happens.
posted by nosila at 1:38 PM on January 15, 2013
posted by nosila at 1:38 PM on January 15, 2013
This thread is closed to new comments.
You're out of the picture and the process is then automated.
posted by zombieApoc at 9:11 AM on January 15, 2013 [1 favorite]