Calculating web app downtime on a per-user basis?
April 2, 2010 8:17 AM   Subscribe

My employer is looking for a more accurate way to calculate downtime in web applications. Right now, we calculate in minutes of overall downtime, but I feel it would be more accurate to count "user minutes" that are lost - 20 mins of downtime when there were 10 users on the system is far less important than 20 mins of downtime with 500 users online. Is there a standard out there for doing this?

We have about a dozen applications, all of which have different total numbers of users, with different usage patterns. Has anybody done something like this, and can recommend how we can get more useful metrics?

posted by um_maverick to Computers & Internet (6 answers total)
You run a (maybe acceptable to you) risk when you try to count downtime on a per-user basis. In your example you count 20 mins of downtime with 500 users as more "countable" than 20 mins of downtime with 10 users online.

Imagine the scenario of 5 mins of downtime when there is only one user online. Unfortunately that one user was the CEO demonstrating your product to a venture capitalist. This hypothetical is meant to highlight the fact that you open yourself to political risks when you consider all use and users of your system as exactly equivalent. Whether this is relevant or important depends on the service your app supplies and the user base that might be using it.
posted by Babblesort at 8:30 AM on April 2, 2010

I'm not sure condensing the description of an outage to a single number is that helpful. Typical outage descriptions IMO should include details like "50% of users had a degraded experience for 4 hours. During the outage, affected users were unable to log in or view images (or whatever). The revenue impact was 50% revenue loss for the outage period (due to lost pageviews). Pageviews following the incident seem unaffected, no loss in traffic or unique users is seen in metrics the following day or week."

Outages have a lot of characteristics. It seems like you want to use "user minutes" as a single measurement of how bad an outage was so different outages can be effectively compared. Rather than ranking outages by this specific metric, which excludes all other details (such as the nature of the impact, revenue, and impact on traffic patterns), why not just establish somewhat arbitrary severity classes of incidents and group incidents of similar severity?
posted by doteatop at 9:02 AM on April 2, 2010

Hey, re-reading this, your question was about the specific measure of 'downtime' as opposed to outage ranking. I guess what you want to measure for this depends on what you want the measurement for; if it's for determining SLA compliance or defining SLAs, I would go with something like the above, defining your SLA in terms of the kinds of outages and the allowable percentage of failures of each given kind (or severity). For commerce-driven sites, another good metric is revenue - defining an SLA in terms of amount of money the operator of the site is allowed to lose is nicely concrete.

With external customers these definitions can be pretty vague, but internally it makes sense to be as explicit as possible. Without knowing more about what kinds of sites you run, or what you want the measurement of 'downtime' for, it would be hard to say more.
posted by doteatop at 9:33 AM on April 2, 2010

I have never heard of using user-downtime as a metric, though I'm sure it's done somewhere. I think can only be a distraction, and unless your employer really really likes numbers and stats such that you only need to worry about supplying them more and more "inside baseball" figures, "app downtime" is the only one you need to worry about.

I can't think of anything that is actionable based on user-downtime and in fact it may be detrimental: if there's too much downtime and users stop using the app, the stats for that app become less important since downtime for popular apps is more important than for apps that aren't used as much. I'm not sure that's a good direction to go in.

"The standard" is app downtime. Rather than coming up with new numbers to distract the boss, someone needs to crack the whip on the developers.
posted by rhizome at 10:47 AM on April 2, 2010

It clearly is beneficial to weight downtime at peak usage hours more heavily than downtime at nadir usage hours. Otherwise, why would anybody bother scheduling planned outages at inconvenient times like 3am on a Sunday? They'd do it midday during the work week.
posted by hattifattener at 12:20 PM on April 2, 2010

Best answer: At one of the companies I worked for I was part of a team that was tasked with coming up with appropriate SLAs with our vendors for uptime. I can give you the highlights of the approach we took. First, each component was evaluated for the number of users it *served* (not just active users, i.e., the number of users that *could* be affected by an outage). Second, we broke a week out into times of heavy, active, low usage. So Saturday and Sunday during the week would be active, work hours was heavy, and nights were low use. We established a percentage for each of those times that was multiplied by the number of users. So heavy might have been 50%, active 30%, and low 5%. This factor would be multiplied against the number of users being "served", to get an "impacted users" number. In our system, each user was generally equivalent to any other, but if you some users more important than others (like customers, or executives, or what have you) you may want to establish a weighting pattern for them. Something like a paying customer or executive is equivalent to 3 regular users or something. Now you have a calculation that takes gives you a number of users theoretically affected by an outage. The SLA quantified up time in these numbers. So component X could only have the equivalent of Y hours of outage per user per year. This worked out to be a much fairer system for all involved.

An important lesson we learned that you might want to keep in mind: some customers didn't align with our heavy/active/low usage times, these customers were kind of upset at the increase of down time they experienced. One solution was to set aside specific components for those customers with those components' SLA having different weighting factors than the rest of the system.

The solution I was involved with might be a bit more complex than web services, but here is an idea that might help for estimating web service up time in a meaningful way: Assuming the web service had registered users, then I would have two classes of users, registered and unregistered visitors. The base number of registered users in the uptime calculation would be just that, the number of registered users. And I'd have registered user equivalent to 3 or 4 times an unregistered user. For a guesstimate for unregistered users, I'd take a percentage of how many main page views there were in a month and multiply by some factor that made sense based on the web service.

Although number of users is the basic metric being used, the point of the exercise is not about how many users, it only uses the number of users as a means to get to some rational process for estimating the importance of a given service during different times. So now that I made you read all this, I'll share a shortcut that can work in a more open minded shop than I had the luxury of working in. Just pick reasonable weighting factors that make reasonable sense. A service outage during the middle of a peak time is X times as bad as during normal operation is Y times as bad as an outage during the middle of the night (or other low period).
posted by forforf at 1:58 PM on April 2, 2010 [1 favorite]

« Older Best resources for digital marketing info?   |   How long can a starving artist live like that? Newer »
This thread is closed to new comments.