99.999% uptime, simplified.
August 17, 2011 4:29 AM   Subscribe

How do companies achieve extremely low downtime for websites and network applications?

I'm just asking out of interest - I'm not a server admin and I'm not trying to design a high availability system, so “in general” answers are fine. I was just wondering how enterprises that need three, four, five, or more nines of availability accomplish that. I tried googling "high availability," but the Wikipedia article is really vague and everything else was either too technical or a sales pitch. So I'm looking for a "slightly more in-depth than howstuffworks.com" explanation that someone (me) can understand who has a passing understanding of servers, load balancing, and backups work.

For instance, banks will suffer chaos if their systems to process electronic transactions are down. Hospitals will have difficulty caring for patients if they can't see the electronic chart. Amazon will lose millions of dollars in orders if their site is down. But things break all the time, so I can't see how the above situations can be avoided 99.9999% of the time. Hard drives fail. An unusual, untested application state occurs and the application crashes. An upgrade goes awry and the server is out of commission for a few hours. Someone spills beer on the server.

How do they prevent / mitigate the risk of downtimes? Is everything mirrored in a data center across the country? Doesn't failing over to a redundant system result in data integrity issues? Give me the whole picture, I think it'll be interesting.
posted by Tehhund to Computers & Internet (17 answers total) 7 users marked this as a favorite
I think it's mostly about redundancy. Any single server is going to fail eventually. A hard drive will die, power supply will blow, etc. But if you are load balancing across three servers, or 30, then a single server failure basically goes unnoticed by the end user. Same thing with storage. Amazon claims 10 nines reliability with the S3 storage service. They claim it was designed to survive catastrophic failures at two separate datacenters at the same time. Again, same thing with the network. Most datacenters have at least two separate bandwidth providers. If AT&T goes down, Sprint is still up.
posted by COD at 4:50 AM on August 17, 2011 [1 favorite]

hard drives fail, but RAID lets you keep going even if you lose a drive. machines crash, but having several servers with a load balancer handing out connections to the servers that are working keeps you going. load balancers, firewalls, and routers crash, but a high-availability pair lets you keep going even if you lose one.

ginormous databases are mirrored and synchronized across two different locations with a big huge data pipe between them.

posted by rmd1023 at 4:53 AM on August 17, 2011

An unusual, untested application state occurs and the application crashes. An upgrade goes awry and the server is out of commission for a few hours. Someone spills beer on the server.

There is not one server or one copy of the application. I think thats the picture you got to get our of your mind. There are tens or hundreds of servers and copies of the application providing service. The applications are replicating user state across to each other. The database is split into many databases transmitting data back and forth. Knock any one of these out and...the whole thing keeps running. Dynamic routing algorithms at every level ensure that a "dead" server is no longer sent any requests.

How do they prevent / mitigate the risk of downtimes? Is everything mirrored in a data center across the country? Doesn't failing over to a redundant system result in data integrity issues? Give me the whole picture, I think it'll be interesting.

Now, you're talking Disaster Recovery. Yes, there's often one or more passive datacenters to which all data is getting mirorred. If anything happens to the primary datacenter, the load balancers, the DNS, can redirect users in an instant.

There are no data integrity issues, by definition. In these large enterprise systems, the fundamental unit of data is not bits or bytes but transactions. It is complete transactions that are replicated.
posted by vacapinta at 4:59 AM on August 17, 2011 [1 favorite]

If you look at the "downtime per year" calculation in the article you linked to you will see that even a crazily specific looking figure like "99.9999%" maps to an actual interval of allowed downtime: 31.5 seconds per year.

It is possible to arrive at this figure from an engineering perspective: by analysing each of the components in the hosting system and then working out how long it would take to restore service in the event of a reasonably foreseeable "worst case scenario". This should then be seen to mesh with the client's calculation of the longest outage they could bear without enduring serious problems.

It is also possible to arrive at such a figure purely from a sales perspective: we want to produce a really impressive looking figure that reassures our client. We know that if we break our impressive sounding SLA (and get found out) then we will have to pay out $x. Let us make sure that some of the the (hefty) premium we charge for this level of availability is set apart to pay this sum (or to pay our own insurance firm). A craftier tactic would be to have an SLA which is only seen to be in breach after a mean down time over several years exceeds Y. By the time we run into problem on that one the person in charge of monitoring at the client side will probably have left or stopped counting.

In short: "high availability" can sometimes be something which is actually worked out. But a lot of the time it is about using client paranoia to boost hosting profits.
posted by rongorongo at 5:32 AM on August 17, 2011 [1 favorite]

I can propose a list for web sites at least. Not in any particular order:
  • Redundancy (network, hard drives, servers, load-balancing, etc. etc.).
  • Automated backups, which are monitored, and are tested by deliberately failing and restoring regularly.
  • Load-testing and unit-testing as a integral part of the development cycle.
  • Well-managed code versioning systems with protocols understood and complied with by the whole development team.
  • Well-understood and tested lines of communication throughout the organization.
  • Tested, easily rolled-back deployments, preferably with staging pre-production testing after the development cycle.
  • Good security (network and server monitoring and intrusion detection, admins up-to-date with the latest security issues and attack strategies, protocols for emergency patching, etc.).
  • Happy well-respected well-compensated personnel at every level.
  • Good management willing to let the tech team tell them how to run things rather than the other way around, when appropriate.
  • Redundancy.

I'm probably forgetting some things. And some of this is my opinion, and I'm sure you can get pretty strong uptime without all of these factors, but if you have them all you're in good shape.

And of course, what rongorongo said about marketing, too.
posted by dubitable at 6:02 AM on August 17, 2011

I think it's mostly about redundancy.

This is the answer. Have servers in multiple geographically separated locations, have multiple uplinks to the internet through different providers, back up everything constantly.
posted by empath at 6:49 AM on August 17, 2011

Oh, and also doing everything during maintainence windows (ie at 2am) so when you do break something during an upgrade or a repair, nobody notices it, and that time window is accounted for in the service agreement.
posted by empath at 6:51 AM on August 17, 2011

When I worked for [very large company you've heard of], the web product I worked on handled ~300M requests a month, and had like 3 or 4 nines. The Apaches that served the front end were in two different colo facilities, one on the west coast US, and one on the east.

In each colo, we had like 15 or 20 machines, of which at any time 10 or so (per colo) were serving live traffic.

To do an update, we'd send all the traffic to one colo, do the release in the down colo, then send all the traffic back and update the other one.

The data subsystem we used had a similar architecture, so they could handle their upgrades, etc, without us (the frontend) noticing.
posted by colin_l at 6:59 AM on August 17, 2011

Oh, Netflix has a great blog post about this - their Chaos Monkey, which introduces regular disruption into their systems, *forcing* them to treat these situations as normal rather than extraordinary.
posted by colin_l at 7:02 AM on August 17, 2011 [1 favorite]

Best answer: I work for a telecommunications software and hardware company that sells products that measurably deliver over six nines of availability (under 32 seconds down time per year).

Firstly, let me agree with rongorongo: at a certain level of availabilty customers simply stop caring about whether the brochure said six or seven nines. Particularly in the telecommunications industry, any "significant" downtime is sufficient to kill your customer's confidence, full stop. Sounds arduous? Damn straight it is. But the word "significant" is important, and I'll come back to it later.

Redundancy is important and is a necessary but not sufficient condition for high availability computer systems. If one assumes that any given component of a computer system may fail at any time, or accepts the more realistic assumption that multiple related components will fail at any time due to common dependencies (for example network equipment, power supplies, natural disasters in geographical locations), then, even to someone with no technical background whatsover, it's obvious we need "more than one" of everything. This is easier said than done, but at least we've all acknowledged this as a requirement; it continually amazes me how many people don't even acknowledge this requirement.

I've stated that redundancy is not sufficient. This is because, particularly from the perspective of the military, components of a computer system could be "up", hence superficially "available", but outputting incorrect or inappropriate values. How does one guarantee the availability of a system where components are "available" but incorrect? This is also known as the Byzantine Generals' problem, and a commonly accepted solution is that such a system must satisfy n > 3t, where n is the number of processes and t is the number of faulty processes that the system can cope with. This is why, for example, the control software on European Tornado fighters are written by several different contractors and deployed on seven individual, distinct pieces of hardware (sorry, don't have a reference for that), and decisions are made by "consensus".

(Civilians often disregard the Byzantine Generals' problem, and assume incorrect components will always become down, and then write their software to deliberately crash when they think they're incorrect. This is also known as the "crash-only" philosophy to software reliability, and is a Good Idea).

Besides expanding upon redundancy to broaden it beyond a purely numeric perspective to a more qualitative perspective, you get a better understanding of availability by not thinking of availability in binary terms. What is availability? What happens when this service is unavailable? Do I necessarily have to throw up my arms, sing songs to myself, and shut off all my interactions with the world? Or are there graduations to availability; can I effectively divide my system into blocks and maintain different levels of service depending on the severity of the failure? This, inevitably, leads to a technical discussion, but it helps to acknowledge you need to think about availability in greys, not black and white, and design your software architecture into little blobs, not one massive blob.

Also, RE: your point that:
Doesn't failing over to a redundant system result in data integrity issues?
You're self-discovering Eric Brewer's "CAP Theorem", stated as:
a shared-data system can have at most two of the three following properties: Consistency, Availability, and tolerance to network Partitions.
My last reference below has more details about this theorem. It's well worth understanding the theorem, my reference is a decent start.

Other references that I've found interesting:
posted by asymptotic at 7:05 AM on August 17, 2011 [6 favorites]

99.999% uptime = 0.0876hrs/year downtime = 5 minutes
99.99% update = 0.876hrs/year downtime = 52.5 minutes
99.9% uptime = 8.76 hrs/year downtime

Based on those numbers is pretty clear that 99.999% uptime is not possible with only one server if you're doing any serious maintenance work (hardware replacement, software updates, reboots). That means they're running multiple servers, all of which can be cycled independantly.

Don't think of it as 99.999% uptime, think of it as a whole lot of 99% uptimes combined.
posted by blue_beetle at 7:45 AM on August 17, 2011 [1 favorite]

Beyond the obvious "redundancy"- software engineering has gotten orders of magnitude better at handling load and threading, as well as distributing work.

Message-based architectures, which used to be a royal pain in the ass to implement, are now practically a commodity, and are used to offload a lot of heavy lifting and work around failures. The traditional model for a web app for ages has been that a single server does all the work, but now when you go to, say Facebook, there are a ton of "agent" processes that assemble that page for you. A lot of the content you see was generated previously and cached away before you even arrived.

Web and app servers also used to crash constantly under load. Now, they're much better at throttling themselves and queuing requests- you'll get a really slow response, but it's technically still "up".
posted by mkultra at 8:16 AM on August 17, 2011 [1 favorite]

Here's a good write-up on designing high availability applications.

Another thing to consider is how performance metrcis are determined. PEPCO, the power company in the DC area, has been in a lot of hot water due to long outages after big storms, yet they reported good performance because their metrics didn't include outages due to major events.
posted by hoppytoad at 10:48 AM on August 17, 2011

The numbers that companies publish are typically for unscheduled downtime. If you read closely then they might have a significant amount of scheduled downtime.
posted by gregr at 12:49 PM on August 17, 2011

Separate test systems where changes are vetted before they are applied to live systems.
Proactive monitoring that alerts appropriate personnel of potential problems before they become critical.
Having a process in place to investigate why you had downtime so that you can learn from your mistakes.
posted by itheearl at 7:06 PM on August 17, 2011

"Big iron" servers have features that you don't find in common desktop computers. For example, you can hot swap just about everything: you can add or remove hard drives, memory, even CPUs without having to shut down. You can even change out the power supply without rebooting, as they have multiple redundant units, so that you can switch over to unit B while you take out and replace unit A, and then switch back. On the software side, there are further tricks to reduce downtime, things like ksplice which lets you update a linux kernel live without having to reboot (a kernel update is pretty much the only piece of software on a server that traditionally requires a reboot.) Virtualization also allows you to move the virtualized image of a server from one physical place to another without ever shutting it down. This means that even if all of the above fail you and you need to reboot, you can migrate services from one physical machine to another without ever having to stop anything.
posted by Rhomboid at 11:27 AM on August 18, 2011 [1 favorite]

Response by poster: I should have responded earlier, but wow asymptotic - you knocked it out of the park. I could more or less wrap my head around strong test procedures, redundancy, and failover, but what was really nagging at me was the split second (or seconds, or minutes) as failover happens - for example, if you send data, receive a response, and the receiving server fails before that data is copied to its redundant servers, then even if the redundant server comes up immediately you have a situation where the sending system thinks its message was handled but in fact the current receiving system has not handled the message. Even if the redundant server received all the same messages as the main server, the main server could fail while processing the message, and we're in the same boat again. Sure, a good failover is quick, but the tiny amount of data that we lost could be an order to sell a million dollars of stock or a lab result that says a patient is very ill - we have to assume that the lost info, no matter how small, matters.

Anyway, those articles (especially the last one) are excellent and really get at the heart of the problem. Learning about the constraints on CAP have pretty much answered my question: at some point, if you are really seeking 5 9's of uptime, you're sacrificing something else.
posted by Tehhund at 6:07 AM on May 8, 2012

« Older Wordpress Wizards of MeFi, can you help me?   |   "A genteel black hole that knows how to read" Newer »
This thread is closed to new comments.